diff --git a/src/text/02-introduction.md b/src/text/02-introduction.md index a85ee21fd2bb4434d96fbfba74f537962ae5829f..03730e9fd09de31db2a8a776a8bf955e02f2f72e 100644 --- a/src/text/02-introduction.md +++ b/src/text/02-introduction.md @@ -4,12 +4,12 @@ Today, most computers are equipped with (+^GPU). They provide more and more computing cores and have become fundamental embedded high-performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex, and strongly dependent on the (+GPU) running the code. Futhark is an experimental, functional, and architecture agnostic language; that is why it seems relevant to study it. It allows generating code allowing a standard sequential execution (on a single-core processor), on (+GPU) (with (+CUDA) and (+OpenCL) backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high-performance platforms, it lacks support for distributed computing. This work aims to develop a library that can port any Futhark code to an (+MPI) library with as little effort as possible. -To achieve that, we introduce the meaning of distributed high-performance computing, then what is (+MPI) and Futhark. The (+MPI) specification allows doing distributed computing, and the programming language Futhark allows doing high-performance computing by compiling our program in (+OpenCL), (+CUDA), multicore, and sequential. We decide to implement a library that can parallelize cellular automaton in, one, two or three dimensions. By adding Futhark on top of (+MPI), the programmer will have the possibilities to compile his code in: +To achieve that, we introduce the meaning of distributed high-performance computing, then what is (+MPI) and Futhark. The (+MPI) specification allows doing distributed computing, and the programming language Futhark allows doing high-performance computing by compiling our program in (+OpenCL), (+CUDA), multicore, and sequential. We decide to implement a library that can distribute cellular automaton in, one, two or three dimensions. By adding Futhark on top of (+MPI), the programmer will have the possibilities to compile his code in: -* parallelized-sequential mode, -* parallelized-multicore mode, -* parallelized-(+OpenCL) mode, -* parallelized-(+CUDA) mode. +* distributed-sequential mode, +* distributed-multicore mode, +* distributed-(+OpenCL) mode, +* distributed-(+CUDA) mode. Finally, we used this library by implementing a cellular automata in each dimension: diff --git a/src/text/06-mpi-x-futhark.md b/src/text/06-mpi-x-futhark.md index bce10448785a97b518623ac80c7a08f63d6b63a6..e27fc776b8fe372e7ae547b0301c5047eaee5aa3 100644 --- a/src/text/06-mpi-x-futhark.md +++ b/src/text/06-mpi-x-futhark.md @@ -19,11 +19,11 @@ These values are valid for a cellular automaton of dimension two and a Chebyshev ## MPI x Futhark -Our library allows parallelizing cellular automata automatically so that the programmer only has to write the Futhark function to update his cellular automaton. Our library supports cellular automata of one, two, and three dimensions and with any types of data. The use of the Futhark language allows to quickly update the state of the cellular automaton thanks to the different backend available. Therefore, several modes are available: +Our library allows distributing cellular automata automatically so that the programmer only has to write the Futhark function to update his cellular automaton. Our library supports cellular automata of one, two, and three dimensions and with any types of data. The use of the Futhark language allows to quickly update the state of the cellular automaton thanks to the different backend available. Therefore, several modes are available: -* parallelized-sequential, the Futhark code executes sequentially, -* parallelized-multicore, the Futhark code executes concurrently to (+POSIX) threads, -* parallelized-OpenCL/(+CUDA), the Futhark code executes on the graphics card. +* distributed-sequential, the Futhark code executes sequentially, +* distributed-multicore, the Futhark code executes concurrently to (+POSIX) threads, +* distributed-OpenCL/(+CUDA), the Futhark code executes on the graphics card. ### Communication diff --git a/src/text/07-automate-elementaire.md b/src/text/07-automate-elementaire.md index f16f82a1dd87c26cb54831074ec0641b686b9767..7e3aef2c3eab518cab064735a412721aecf67129 100644 --- a/src/text/07-automate-elementaire.md +++ b/src/text/07-automate-elementaire.md @@ -28,7 +28,7 @@ Iteration 0 is the initial state and only cell two is alive. To perform the next * the cell (two) stays alive because of rule n°3, * the cell (three) stays alive because of rule n°6. -## Parallelized version +## Distributed version With the created library, we implement this (+SCA) previously described. To do this, we create a Futhark `elementary.fut` file, which is used to calculate the next state of a part of the cellular automaton. @@ -96,7 +96,7 @@ The entire cellular automaton is retrieved on the root node by calling the funct ## CPU Benchmark -We perform benchmarks to validate the scalability of our one-dimensional parallelization when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). +We perform benchmarks to validate the scalability of our one-dimensional distribution when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). The sequential and multicore benchmarks are performed as follows: * the cellular automaton is $300,000,000$ cells in size, @@ -114,7 +114,7 @@ The sequential and multicore benchmarks are performed as follows: | 32 | 20.938 [s] | ± 0.007 [s] | x31.4 | 15 | | 64 | 11.071 [s] | ± 0.024 [s] | x59.4 | 15 | | 128 | 5.316 [s] | ± 0.191 [s] | x123.7 | 15 | -Table: Results for the parallelized-sequential version of (+SCA) +Table: Results for the distributed-sequential version of (+SCA) This table contains the results obtained by using the backend `c` of Futhark. @@ -130,14 +130,14 @@ This table contains the results obtained by using the backend `c` of Futhark. | 32 | 25.776 [s] | ± 0.725 [s] | x27.5 | 15 | | 64 | 12.506 [s] | ± 0.554 [s] | x56.7 | 15 | | 128 | 5.816 [s] | ± 0.045 [s] | x121.8 | 15 | -Table: Results for the parallelized-multicore version of (+SCA) +Table: Results for the distributed-multicore version of (+SCA) This table contains the results obtained by using the backend `multicore` of Futhark. -\cimgl{figs/elem_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the SCA in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-sca} +\cimgl{figs/elem_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the SCA in distributed-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-sca} -We compare the average computation time for each number of tasks and each version (sequential and multicore) on the left graph. On the right graph, we compare the ideal speedup with the parallelized-sequential and multicore version speedup. -The more we increase the number of tasks, the more the execution time is reduced. Thus, the parallelized-sequential or multicore version speedup follows the curve of the ideal speedup. We can see that concurrent computing does not provide a significant performance gain over sequential computing because of the overhead of creating threads. +We compare the average computation time for each number of tasks and each version (sequential and multicore) on the left graph. On the right graph, we compare the ideal speedup with the distributed-sequential and multicore version speedup. +The more we increase the number of tasks, the more the execution time is reduced. Thus, the distributed-sequential or multicore version speedup follows the curve of the ideal speedup. We can see that concurrent computing does not provide a significant performance gain over sequential computing because of the overhead of creating threads. \pagebreak @@ -157,7 +157,7 @@ The (+OpenCL) and (+CUDA) benchmarks are performed as follows: | 2 | 2 | 83.339 [s] | ± 0.099 [s] | x2.0 | 15 | | 4 | 4 | 42.122 [s] | ± 0.078 [s] | x3.9 | 15 | | 8 | 8 | 21.447 [s] | ± 0.031 [s] | x7.7 | 15 | -Table: Results for the parallelized-(+OpenCL) version of (+SCA) +Table: Results for the distributed-(+OpenCL) version of (+SCA) This table contains the results obtained by using the backend `opencl` of Futhark. @@ -167,14 +167,14 @@ This table contains the results obtained by using the backend `opencl` of Futhar | 2 | 2 | 80.434 [s] | ± 0.094 [s] | x2.0 | 15 | | 4 | 4 | 40.640 [s] | ± 0.073 [s] | x3.9 | 15 | | 8 | 8 | 20.657 [s] | ± 0.046 [s] | x7.8 | 15 | -Table: Results for the parallelized-(+CUDA) version of (+SCA) +Table: Results for the distributed-(+CUDA) version of (+SCA) This table contains the results obtained by using the backend `cuda` of Futhark. \pagebreak -\cimgl{figs/elem_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the SCA in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-sca} +\cimgl{figs/elem_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the SCA in distributed-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-sca} -With this performance test (\ref{fig:bench-gpu-sca}), we compare the average computation time for each number of tasks/(+^GPU) and each version ((+OpenCL) and (+CUDA)) on the left graph. On the right graph, we compare the ideal speedup with the parallelized-opencl and cuda version speedup. We notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the parallelization follows the ideal speedup curve. Finally, we notice that parallel computation is up to four times faster than sequential/concurrent computation when executing with a single task/graphical card. +With this performance test (\ref{fig:bench-gpu-sca}), we compare the average computation time for each number of tasks/(+^GPU) and each version ((+OpenCL) and (+CUDA)) on the left graph. On the right graph, we compare the ideal speedup with the `distributed-opencl` and cuda version speedup. We notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the distributed follows the ideal speedup curve. Finally, we notice that parallel computation is up to four times faster than sequential/concurrent computation when executing with a single task/graphical card. \pagebreak diff --git a/src/text/08-jeu-de-la-vie.md b/src/text/08-jeu-de-la-vie.md index 5905b8be5098bdfb3779f5ffe45eefbc3abdfd31..7bdb9a09ebf0b15f41127ce46ba88422ef028660 100644 --- a/src/text/08-jeu-de-la-vie.md +++ b/src/text/08-jeu-de-la-vie.md @@ -21,7 +21,7 @@ A basic example is a blinker: Thus, after the application of the rules, the horizontal line becomes a vertical line. Then, at the next iteration, the vertical line becomes a horizontal line again. -## Parallelized version +## Distributed version We create the game of life with our library to test it with a two-dimensional cellular automaton. The code is relatively the same as the previous example; therefore, it is not explained, but you can find it in the Git repository. @@ -64,7 +64,7 @@ int main(int argc, char *argv[]) { ## CPU Benchmarks -We perform benchmarks to validate the scalability of our two-dimensional parallelization when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). +We perform benchmarks to validate the scalability of our two-dimensional distribution when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). The sequential and multicore benchmarks are performed as follows: @@ -83,7 +83,7 @@ The sequential and multicore benchmarks are performed as follows: | 32 | 100.422 [s] | ± 0.068 [s] | x34.6 | 15 | | 64 | 55.986 [s] | ± 1.587 [s] | x62.0 | 15 | | 128 | 28.111 [s] | ± 0.263 [s] | x123.5 | 15 | -Table: Results for the parallelized-sequential version of Game of Life +Table: Results for the distributed-sequential version of Game of Life This table contains the results obtained by using the backend `c` of Futhark. @@ -97,13 +97,13 @@ This table contains the results obtained by using the backend `c` of Futhark. | 32 | 71.463 [s] | ± 0.485 [s] | x30.2 | 15 | | 64 | 39.116 [s] | ± 0.489 [s] | x55.1 | 15 | | 128 | 14.008 [s] | ± 0.335 [s] | x153.8 | 15 | -Table: Results for the parallelized-multicore version of Game of Life +Table: Results for the distributed-multicore version of Game of Life This table contains the results obtained by using the backend `multicore` of Futhark. -\cimgl{figs/gol_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the game of life in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-gol} +\cimgl{figs/gol_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the game of life in distributed-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-gol} -We notice an apparent difference between the parallelized-sequential and multicore version when there is only one task. The multicore version is $1.6$ times faster than the sequential version. Nevertheless, both versions have a perfect speedup. The multicore version even gets a maximum speedup of x154 with 128 tasks. This performance can be explained by the caching of data in the processor and the use of threads. +We notice an apparent difference between the distributed-sequential and multicore version when there is only one task. The multicore version is $1.6$ times faster than the sequential version. Nevertheless, both versions have a perfect speedup. The multicore version even gets a maximum speedup of x154 with 128 tasks. This performance can be explained by the caching of data in the processor and the use of threads. ## GPU Benchmarks @@ -121,7 +121,7 @@ The (+OpenCL) and (+CUDA) benchmarks are performed as follows: | 2 | 2 | 115.400 [s] | ± 0.070 [s] | x2.0 | 15 | | 4 | 4 | 58.019 [s] | ± 0.104 [s] | x4.0 | 15 | | 8 | 8 | 29.157 [s] | ± 0.061 [s] | x7.9 | 15 | -Table: Results for the parallelized-(+OpenCL) version of Game of Life +Table: Results for the distributed-(+OpenCL) version of Game of Life This table contains the results obtained by using the backend `opencl` of Futhark. @@ -131,14 +131,14 @@ This table contains the results obtained by using the backend `opencl` of Futhar | 2 | 2 | 109.598 [s] | ± 0.109 [s] | x2.0 | 15 | | 4 | 4 | 55.039 [s] | ± 0.100 [s] | x4.0 | 15 | | 8 | 8 | 27.737 [s] | ± 0.050 [s] | x7.9 | 15 | -Table: Results for the parallelized-(+CUDA) version of Game of Life +Table: Results for the distributed-(+CUDA) version of Game of Life This table contains the results obtained by using the backend `cuda` of Futhark. \pagebreak -\cimgl{figs/gol_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the game of life in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol} +\cimgl{figs/gol_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the game of life in distributed-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol} -With this performance test (\ref{fig:bench-gpu-gol}), we notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the parallelization follows the ideal speedup curve. Furthermore, we notice that parallel computation is up to 15 times faster than sequential/concurrent computation when executing with a single task/graphical card. +With this performance test (\ref{fig:bench-gpu-gol}), we notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the distribution follows the ideal speedup curve. Furthermore, we notice that parallel computation is up to 15 times faster than sequential/concurrent computation when executing with a single task/graphical card. \pagebreak diff --git a/src/text/09-lattice-boltzmann.md b/src/text/09-lattice-boltzmann.md index 4fde788a04fd32c84d59b00401b49677cfd6a393..e5dc2e4741fe91a2cc410aa4c5aa6433a36de397 100644 --- a/src/text/09-lattice-boltzmann.md +++ b/src/text/09-lattice-boltzmann.md @@ -8,13 +8,13 @@ method originates in a molecular description of a fluid, based on the Boltzmann physical terms stemming from a knowledge of the interaction between molecules. It is therefore an invaluable tool in fundamental research, as it keeps the cycle between the elaboration of a theory and the formulation of a corresponding numerical model short._" [@latt_palabos_2020] -## Parallelized version +## Distributed version We implement the Lattice-Boltzmann Method with our library to test it with a three-dimensional cellular automaton. Each cell is containing an array of 27 floats. ## CPU Benchmark -We perform benchmarks to validate the scalability of our three-dimensional parallelization when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). +We perform benchmarks to validate the scalability of our three-dimensional distribution when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). The sequential and multicore benchmarks are performed as follows: * the cellular automaton is $27'000'000$ cells in size, @@ -32,7 +32,7 @@ The sequential and multicore benchmarks are performed as follows: | 32 | 41.040 [s] | ± 1.590 [s] | x17.4 | 15 | | 64 | 22.188 [s] | ± 0.321 [s] | x32.3 | 15 | | 128 | 17.415 [s] | ± 4.956 [s] | x41.1 | 15 | -Table: Results for the parallelized-sequential version of (+LBM) +Table: Results for the distributed-sequential version of (+LBM) This table contains the results obtained by using the backend `c` of Futhark. @@ -46,13 +46,13 @@ This table contains the results obtained by using the backend `c` of Futhark. | 32 | 46.285 [s] | ± 0.138 [s] | x15.0 | 15 | | 64 | 24.059 [s] | ± 0.061 [s] | x28.9 | 15 | | 128 | 16.614 [s] | ± 1.088 [s] | x41.9 | 15 | -Table: Results for the parallelized-multicore version of (+LBM) +Table: Results for the distributed-multicore version of (+LBM) This table contains the results obtained by using the backend `multicore` of Futhark. \pagebreak -\cimgl{figs/lbm_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the LBM in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-lbm} +\cimgl{figs/lbm_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the LBM in distributed-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-lbm} Contrary to the previous benchmarks, the speedups do not follow the ideal speedup curve. Indeed, whether in sequential or multicore, we obtain a maximum speedup with 128 tasks of x41 when we were hoping to have a speedup of x128. @@ -72,7 +72,7 @@ The (+OpenCL) and (+CUDA) benchmarks are performed as follows: | 2 | 2 | 99.677 [s] | ± 0.038 [s] | x2.1 | 15 | | 4 | 4 | 40.710 [s] | ± 0.076 [s] | x5.2 | 15 | | 8 | 8 | 20.800 [s] | ± 0.031 [s] | x10.1 | 15 | -Table: Results for the parallelized-(+OpenCL) version of (+LBM) +Table: Results for the distributed-(+OpenCL) version of (+LBM) This table contains the results obtained by using the backend `opencl` of Futhark. @@ -82,11 +82,11 @@ This table contains the results obtained by using the backend `opencl` of Futhar | 2 | 2 | 99.177 [s] | ± 0.056 [s] | x2.1 | 15 | | 4 | 4 | 40.240 [s] | ± 0.074 [s] | x5.2 | 15 | | 8 | 8 | 20.459 [s] | ± 0.037 [s] | x10.2 | 15 | -Table: Results for the parallelized-(+CUDA) version of (+LBM) +Table: Results for the distributed-(+CUDA) version of (+LBM) This table contains the results obtained by using the backend `cuda` of Futhark. -\cimgl{figs/lbm_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the LBM in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol} +\cimgl{figs/lbm_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the LBM in distributed-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol} Like the other benchmarks (\ref{fig:bench-gpu-sca}, \ref{fig:bench-gpu-gol}), there is very little difference between the (+OpenCL) and (+CUDA) versions (computation time and speedup). We get a more than ideal speedup with 2, 4, and 8 tasks/(+^GPU) (x2.1, x5.2, and x10.2, respectively). Finally, we notice that parallel computation is up to 3 times faster than sequential/concurrent computation when executing with a single task/graphical card.