From 79720610b473a03e9893704c2eb053b2daa5f146 Mon Sep 17 00:00:00 2001 From: "baptiste.coudray" <baptiste.coudray@etu.hesge.ch> Date: Mon, 16 Aug 2021 13:35:30 +0200 Subject: [PATCH] Updated doc --- src/text/01-references.md | 5 ++--- src/text/02-introduction.md | 4 ++-- src/text/03-programmation-parallele.md | 2 ++ src/text/07-automate-elementaire.md | 13 +++++++++---- src/text/08-jeu-de-la-vie.md | 20 +++++++++++++++----- src/text/09-lattice-boltzmann.md | 26 +++++++++++++++++--------- src/text/ZZ-glossaire.tex | 1 + 7 files changed, 48 insertions(+), 23 deletions(-) diff --git a/src/text/01-references.md b/src/text/01-references.md index 9483bd7..1a72da8 100644 --- a/src/text/01-references.md +++ b/src/text/01-references.md @@ -14,9 +14,8 @@ \multicolumn{1}{l}{URL01} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:AmdahlsLaw.svg}} \\ \multicolumn{1}{l}{URL02} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:Gustafson.png}} \\ \multicolumn{1}{l}{URL03} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:Elder_futhark.png}} \\ -\multicolumn{1}{l}{URL04} & \multicolumn{1}{l}{\url{https://futhark-lang.org/images/mss.svg}} \\ -\multicolumn{1}{l}{URL05} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:Gol-blinker1.png}} \\ -\multicolumn{1}{l}{URL06} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:Gol-blinker2.png}} \\ +\multicolumn{1}{l}{URL04} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:Gol-blinker1.png}} \\ +\multicolumn{1}{l}{URL05} & \multicolumn{1}{l}{\url{https://commons.wikimedia.org/wiki/File:Gol-blinker2.png}} \\ \end{tabular} \pagebreak diff --git a/src/text/02-introduction.md b/src/text/02-introduction.md index bdae309..a85ee21 100644 --- a/src/text/02-introduction.md +++ b/src/text/02-introduction.md @@ -4,7 +4,7 @@ Today, most computers are equipped with (+^GPU). They provide more and more computing cores and have become fundamental embedded high-performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex, and strongly dependent on the (+GPU) running the code. Futhark is an experimental, functional, and architecture agnostic language; that is why it seems relevant to study it. It allows generating code allowing a standard sequential execution (on a single-core processor), on (+GPU) (with (+CUDA) and (+OpenCL) backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high-performance platforms, it lacks support for distributed computing. This work aims to develop a library that can port any Futhark code to an (+MPI) library with as little effort as possible. -To achieve that, we introduce the meaning of distributed high-performance computing, then what is (+MPI) and Futhark. We decide to implement a library that can parallelize cellular automaton in, one, two or three dimensions. By adding Futhark on top of (+MPI), the programmer will have the possibilities to compile his code in: +To achieve that, we introduce the meaning of distributed high-performance computing, then what is (+MPI) and Futhark. The (+MPI) specification allows doing distributed computing, and the programming language Futhark allows doing high-performance computing by compiling our program in (+OpenCL), (+CUDA), multicore, and sequential. We decide to implement a library that can parallelize cellular automaton in, one, two or three dimensions. By adding Futhark on top of (+MPI), the programmer will have the possibilities to compile his code in: * parallelized-sequential mode, * parallelized-multicore mode, @@ -23,7 +23,7 @@ The leading resources we used to carry out this project were Futhark and (+MPI) ## Working method {-} -During this project, we use Git and put the source code on the Gitlab platform of HEPIA: +During this project, we use Git and put the source code on the Gitlab platform of (+HEPIA): * Source code of the library with usage examples * https://gitedu.hesge.ch/baptiste.coudray/projet-de-bachelor diff --git a/src/text/03-programmation-parallele.md b/src/text/03-programmation-parallele.md index 70e1a67..bf47a26 100644 --- a/src/text/03-programmation-parallele.md +++ b/src/text/03-programmation-parallele.md @@ -8,4 +8,6 @@ Oracle's multithreading programming guide [@oracle_chapter_2010], define *concur Finally, parallel computing exploits the computing power of a graphics card thanks to the thousands of cores it has. This allows a gain in performance compared to the previously described calculation methods because the operations are performed simultaneously on the different cores. +So, _Distributed High-Performance Computing_ means distributing a program on multiple networked computers and programming the software in sequential, concurrent or parallel computing. + \pagebreak diff --git a/src/text/07-automate-elementaire.md b/src/text/07-automate-elementaire.md index 5eb4f45..c04bedd 100644 --- a/src/text/07-automate-elementaire.md +++ b/src/text/07-automate-elementaire.md @@ -42,7 +42,8 @@ entry next_chunk_elems [n] (chunk_elems :[n]i8) :[]i8 = ``` Therefore, the `elementary.fut` file contains only a function that applies the rules on the cellular automaton. -Note that the function returns the cellular automaton without the envelope. +As we can see, `next_chunk_elems` is the primary function that takes a chunk of the cellular automaton as a parameter. +This function applies the rules defined before on every cell and returns the new value of each cell without the envelope. ```c void init_chunk_elems(chunk_info_t *ci) { @@ -115,7 +116,7 @@ The sequential and multicore benchmarks are performed as follows: | 128 | 5.316 [s] | ± 0.191 [s] | x123.7 | 15 | Table: Results for the parallelized-sequential version of (+SCA) -This table contains the results obtained by using the backend `sequential C` of Futhark. +This table contains the results obtained by using the backend `c` of Futhark. \pagebreak @@ -155,7 +156,9 @@ The (+OpenCL) and (+CUDA) benchmarks are performed as follows: | 2 | 2 | 83.339 [s] | ± 0.099 [s] | x2.0 | 15 | | 4 | 4 | 42.122 [s] | ± 0.078 [s] | x3.9 | 15 | | 8 | 8 | 21.447 [s] | ± 0.031 [s] | x7.7 | 15 | -Table: Results for the parallelized-OpenCL version of (+SCA) +Table: Results for the parallelized-(+OpenCL) version of (+SCA) + +This table contains the results obtained by using the backend `opencl` of Futhark. | Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures | |:---:|:---:|:---:|:---:|:---:|:---:| @@ -163,7 +166,9 @@ Table: Results for the parallelized-OpenCL version of (+SCA) | 2 | 2 | 80.434 [s] | ± 0.094 [s] | x2.0 | 15 | | 4 | 4 | 40.640 [s] | ± 0.073 [s] | x3.9 | 15 | | 8 | 8 | 20.657 [s] | ± 0.046 [s] | x7.8 | 15 | -Table: Results for the parallelized-CUDA version of (+SCA) +Table: Results for the parallelized-(+CUDA) version of (+SCA) + +This table contains the results obtained by using the backend `cuda` of Futhark. \pagebreak diff --git a/src/text/08-jeu-de-la-vie.md b/src/text/08-jeu-de-la-vie.md index 8b61dcf..3a6d3c3 100644 --- a/src/text/08-jeu-de-la-vie.md +++ b/src/text/08-jeu-de-la-vie.md @@ -9,7 +9,7 @@ The Game of Life is a zero-player game designed by John Horton Conway in 1970. I ## Example with the blinker -\cimg{figs/gol_blinker1.png}{scale=0.40}{First state of blinker}{Source: Taken from \url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray} +\cimg{figs/gol_blinker1.png}{scale=0.40}{First state of blinker}{Source: Taken from \url{https://commons.wikimedia.org/}, ref. URL04. Re-created by Baptiste Coudray} A basic example is a blinker: @@ -17,7 +17,7 @@ A basic example is a blinker: * the cell (zero, two) and (two, two) are born because they have three living neighbors (rule n°3), * the cell (one, two) stays alive because it has two living neighbors (rule n°4). -\cimg{figs/gol_blinker2.png}{scale=0.40}{Second state of blinker}{Source: Taken from \url{https://commons.wikimedia.org/}, ref. URL06. Re-created by Baptiste Coudray} +\cimg{figs/gol_blinker2.png}{scale=0.40}{Second state of blinker}{Source: Taken from \url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray} Thus, after the application of the rules, the horizontal line becomes a vertical line. Then, at the next iteration, the vertical line becomes a horizontal line again. @@ -85,6 +85,8 @@ The sequential and multicore benchmarks are performed as follows: | 128 | 28.111 [s] | ± 0.263 [s] | x123.5 | 15 | Table: Results for the parallelized-sequential version of Game of Life +This table contains the results obtained by using the backend `c` of Futhark. + | Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures | |:---:|:---:|:---:|:---:|:---:| | 1 | 2154.686 [s] | ± 198.122 [s] | x1.0 | 15 | @@ -97,6 +99,8 @@ Table: Results for the parallelized-sequential version of Game of Life | 128 | 14.008 [s] | ± 0.335 [s] | x153.8 | 15 | Table: Results for the parallelized-multicore version of Game of Life +This table contains the results obtained by using the backend `multicore` of Futhark. + \cimgl{figs/gol_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the game of life in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-gol} We notice an apparent difference between the parallelized-sequential and multicore version when there is only one task. The multicore version is $1.6$ times faster than the sequential version. Nevertheless, both versions have a perfect speedup. The multicore version even gets a maximum speedup of x154 with 128 tasks. This performance can be explained by the caching of data in the processor and the use of threads. @@ -117,7 +121,9 @@ The (+OpenCL) and (+CUDA) benchmarks are performed as follows: | 2 | 2 | 115.400 [s] | ± 0.070 [s] | x2.0 | 15 | | 4 | 4 | 58.019 [s] | ± 0.104 [s] | x4.0 | 15 | | 8 | 8 | 29.157 [s] | ± 0.061 [s] | x7.9 | 15 | -Table: Results for the parallelized-OpenCL version of Game of Life +Table: Results for the parallelized-(+OpenCL) version of Game of Life + +This table contains the results obtained by using the backend `opencl` of Futhark. | Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures | |:---:|:---:|:---:|:---:|:---:|:---:| @@ -125,10 +131,14 @@ Table: Results for the parallelized-OpenCL version of Game of Life | 2 | 2 | 109.598 [s] | ± 0.109 [s] | x2.0 | 15 | | 4 | 4 | 55.039 [s] | ± 0.100 [s] | x4.0 | 15 | | 8 | 8 | 27.737 [s] | ± 0.050 [s] | x7.9 | 15 | -Table: Results for the parallelized-CUDA version of Game of Life +Table: Results for the parallelized-(+CUDA) version of Game of Life + +This table contains the results obtained by using the backend `cuda` of Futhark. + +\pagebreak \cimgl{figs/gol_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the game of life in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol} -With this performance test (\ref{fig:bench-gpu-gol}), we notice that the computation time is essentially the same in OpenCL as in CUDA. Moreover, the parallelization follows the ideal speedup curve. Furthermore, we notice that parallel computation is up to 15 times faster than sequential/concurrent computation when executing with a single task/graphical card. +With this performance test (\ref{fig:bench-gpu-gol}), we notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the parallelization follows the ideal speedup curve. Furthermore, we notice that parallel computation is up to 15 times faster than sequential/concurrent computation when executing with a single task/graphical card. \pagebreak diff --git a/src/text/09-lattice-boltzmann.md b/src/text/09-lattice-boltzmann.md index 746edb7..4fde788 100644 --- a/src/text/09-lattice-boltzmann.md +++ b/src/text/09-lattice-boltzmann.md @@ -26,26 +26,30 @@ The sequential and multicore benchmarks are performed as follows: |:---:|:---:|:---:|:---:|:---:| | 1 | 716.133 [s] | ± 5.309 [s] | x1.0 | 15 | | 2 | 363.166 [s] | ± 3.482 [s] | x2.0 | 15 | -| 4 | 185.43 [s] | ± 0.847 [s] | x3.9 | 15 | +| 4 | 185.430 [s] | ± 0.847 [s] | x3.9 | 15 | | 8 | 93.994 [s] | ± 0.566 [s] | x7.6 | 15 | | 16 | 81.266 [s] | ± 8.947 [s] | x8.8 | 15 | -| 32 | 41.04 [s] | ± 1.590 [s] | x17.4 | 15 | +| 32 | 41.040 [s] | ± 1.590 [s] | x17.4 | 15 | | 64 | 22.188 [s] | ± 0.321 [s] | x32.3 | 15 | | 128 | 17.415 [s] | ± 4.956 [s] | x41.1 | 15 | Table: Results for the parallelized-sequential version of (+LBM) +This table contains the results obtained by using the backend `c` of Futhark. + | Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures | |:---:|:---:|:---:|:---:|:---:| | 1 | 695.675 [s] | ± 8.867 [s] | x1.0 | 15 | | 2 | 352.925 [s] | ± 4.293 [s] | x2.0 | 15 | | 4 | 181.736 [s] | ± 0.695 [s] | x3.8 | 15 | | 8 | 237.983 [s] | ± 0.271 [s] | x2.9 | 15 | -| 16 | 79.36 [s] | ± 2.185 [s] | x8.8 | 15 | +| 16 | 79.360 [s] | ± 2.185 [s] | x8.8 | 15 | | 32 | 46.285 [s] | ± 0.138 [s] | x15.0 | 15 | | 64 | 24.059 [s] | ± 0.061 [s] | x28.9 | 15 | | 128 | 16.614 [s] | ± 1.088 [s] | x41.9 | 15 | Table: Results for the parallelized-multicore version of (+LBM) +This table contains the results obtained by using the backend `multicore` of Futhark. + \pagebreak \cimgl{figs/lbm_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the LBM in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-lbm} @@ -66,20 +70,24 @@ The (+OpenCL) and (+CUDA) benchmarks are performed as follows: |:---:|:---:|:---:|:---:|:---:|:---:| | 1 | 1 | 210.347 [s] | ± 0.096 [s] | x1.0 | 15 | | 2 | 2 | 99.677 [s] | ± 0.038 [s] | x2.1 | 15 | -| 4 | 4 | 40.71 [s] | ± 0.076 [s] | x5.2 | 15 | -| 8 | 8 | 20.8 [s] | ± 0.031 [s] | x10.1 | 15 | -Table: Results for the parallelized-OpenCL version of (+LBM) +| 4 | 4 | 40.710 [s] | ± 0.076 [s] | x5.2 | 15 | +| 8 | 8 | 20.800 [s] | ± 0.031 [s] | x10.1 | 15 | +Table: Results for the parallelized-(+OpenCL) version of (+LBM) + +This table contains the results obtained by using the backend `opencl` of Futhark. | Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures | |:---:|:---:|:---:|:---:|:---:|:---:| | 1 | 1 | 207.683 [s] | ± 0.249 [s] | x1.0 | 15 | | 2 | 2 | 99.177 [s] | ± 0.056 [s] | x2.1 | 15 | -| 4 | 4 | 40.24 [s] | ± 0.074 [s] | x5.2 | 15 | +| 4 | 4 | 40.240 [s] | ± 0.074 [s] | x5.2 | 15 | | 8 | 8 | 20.459 [s] | ± 0.037 [s] | x10.2 | 15 | -Table: Results for the parallelized-CUDA version of (+LBM) +Table: Results for the parallelized-(+CUDA) version of (+LBM) + +This table contains the results obtained by using the backend `cuda` of Futhark. \cimgl{figs/lbm_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the LBM in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol} -Like the other benchmarks (\ref{fig:bench-gpu-elem}, \ref{fig:bench-gpu-gol}), there is very little difference between the OpenCL and CUDA versions (computation time and speedup). We get a more than ideal speedup with 2, 4, and 8 tasks/GPUs (x2.1, x5.2, and x10.2, respectively). Finally, we notice that parallel computation is up to 3 times faster than sequential/concurrent computation when executing with a single task/graphical card. +Like the other benchmarks (\ref{fig:bench-gpu-sca}, \ref{fig:bench-gpu-gol}), there is very little difference between the (+OpenCL) and (+CUDA) versions (computation time and speedup). We get a more than ideal speedup with 2, 4, and 8 tasks/(+^GPU) (x2.1, x5.2, and x10.2, respectively). Finally, we notice that parallel computation is up to 3 times faster than sequential/concurrent computation when executing with a single task/graphical card. \pagebreak diff --git a/src/text/ZZ-glossaire.tex b/src/text/ZZ-glossaire.tex index 4de72d9..b1c2a88 100644 --- a/src/text/ZZ-glossaire.tex +++ b/src/text/ZZ-glossaire.tex @@ -13,6 +13,7 @@ \newacronym{CUDA}{CUDA}{Compute Unified Device Architecture} \newacronym{OpenCL}{OpenCL}{Open Computing Language} \newacronym{HES-GE}{HES-GE}{Haute École Spécialisée de GEnève} +\newacronym{HEPIA}{HEPIA}{Haute École du Paysage, d'Ingénierie et d'Architecture de Genève} \newacronym{IO}{I/O}{Input/Output} \newacronym{MSS}{MSS}{Maximum Segment Sum} \newacronym{SCA}{SCA}{Simple Cellular Automaton} -- GitLab