First version finished

b236ed46 · baptiste.coudray · 925e28ef · b236ed46 · b236ed46 · b236ed46
Verified Commit b236ed46 authored 3 years ago by baptiste.coudray
--- a/src/text/07-automate-elementaire.md
+++ b/src/text/07-automate-elementaire.md
@@ -2,7 +2,7 @@

 The simplest non-trivial cellular automaton that can be conceived consists of a one-dimensional grid of cells that can take only two states ("0" or "1"), with a neighborhood consisting, for each cell, of itself and the two cells adjacent to it [@noauthor_automate_2021].

-There are $2^3 = 8$ possible configurations (or patterns, rules) of such a neighborhood. In order for the cellular automaton to work, it is necessary to define what the state must be at the next generation of a cell for each of these patterns. The 8 rules/configurations defined is as follows:
+There are $2^3 = 8$ possible configurations (or patterns, rules) of such a neighborhood. In order for the cellular automaton to work, it is necessary to define what the state must be at the next generation of a cell for each of these patterns. The eight rules/configurations defined is as follows:

 | Rule n° | East neighbour state | Cell state | West neighbour state | Cell next state |
 |:---:|:---:|:---:|:---:|:---:|
@@ -20,8 +20,7 @@ Table: Evolution rules for a cellule in a one dimensional cellular-automaton

 ## Example

-\cimg{figs/simple_automate.png}{scale=0.5}{First state of blinker}{Source: Taken from
-\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
+\cimg{figs/simple_automate.png}{scale=0.5}{First and second state of a SCA}{Source: Created by Baptiste Coudray}

 Iteration 0 is the initial state and only cell two is alive. To perform the next iteration:

@@ -31,7 +30,7 @@ Iteration 0 is the initial state and only cell two is alive. To perform the next

 ## Parallelized version

-With the created library, we implement this (+^SCA) previously described. To do this, we create a Futhark `elementary.fut` file, which is used to calculate the next state of a part of the cellular automaton.
+With the created library, we implement this (+SCA) previously described. To do this, we create a Futhark `elementary.fut` file, which is used to calculate the next state of a part of the cellular automaton.

 ```
 let compute_next_elems [n] (chunk_elems :[n]i8) :[]i8 = ...
@@ -82,7 +81,7 @@ Finally, a C file `main.c` is needed to create the program's entry point. We ini

 ## CPU Benchmark

-We perform benchmarks to validate the scalability of our one-dimensional parallelization when compiling in sequential, multicore, (+^OpenCL), or (+^CUDA) mode. The benchmarks are performed on the HES-GE cluster (Baobab/Yggdrasil).
+We perform benchmarks to validate the scalability of our one-dimensional parallelization when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil).
 The sequential and multicore benchmarks are performed as follows:

 * the cellular automaton is $300,000,000$ cells in size,
@@ -100,7 +99,7 @@ The sequential and multicore benchmarks are performed as follows:
 | 32 | 20.938 [s] | ± 0.007 [s] | x31.4 | 15 |
 | 64 | 11.071 [s] | ± 0.024 [s] | x59.4 | 15 |
 | 128 | 5.316 [s] | ± 0.191 [s] | x123.7 | 15 |
-Table: Results for the parallelized-sequential version of SCA
+Table: Results for the parallelized-sequential version of (+SCA)

 | Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
 |:---:|:---:|:---:|:---:|:---:|
@@ -112,7 +111,7 @@ Table: Results for the parallelized-sequential version of SCA
 | 32 | 25.776 [s] | ± 0.725 [s] | x27.5 | 15 |
 | 64 | 12.506 [s] | ± 0.554 [s] | x56.7 | 15 |
 | 128 | 5.816 [s] | ± 0.045 [s] | x121.8 | 15 |
-Table: Results for the parallelized-multicore version of SCA
+Table: Results for the parallelized-multicore version of (+SCA)

 \pagebreak

@@ -120,11 +119,11 @@ Table: Results for the parallelized-multicore version of SCA

 We compare the average computation time for each task and each version (sequential and multicore) on the left graph. On the right graph, we compare the ideal speedup with the parallelized-sequential and multicore version speedup.

-The more we increase the number of tasks, the more the execution time is reduced. Thus, the parallelized-sequential or multicore version speedup follows the curve of the ideal speedup.
+The more we increase the number of tasks, the more the execution time is reduced. Thus, the parallelized-sequential or multicore version speedup follows the curve of the ideal speedup. We can see that concurrent computing does not provide a significant performance gain over sequential computing.

 ## GPU Benchmark

-The (+^OpenCL) and (+^CUDA) benchmarks are performed as follows:
+The (+OpenCL) and (+CUDA) benchmarks are performed as follows:

 * the cellular automaton has $300'000'000$ cells,
 * the number of tasks varies between $2^0$ and $2^6$.
@@ -143,7 +142,7 @@ The (+^OpenCL) and (+^CUDA) benchmarks are performed as follows:
 | 16 | 8 | 31.675 [s] | ± 0.056 [s] | x5.2 | 15 |
 | 32 | 8 | 43.65 [s] | ± 0.102 [s] | x3.8 | 15 |
 | 64 | 8 | 67.096 [s] | ± 0.118 [s] | x2.5 | 15 |
-Table: Results for the parallelized-OpenCL version of SCA
+Table: Results for the parallelized-OpenCL version of (+SCA)

 | Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
 |:---:|:---:|:---:|:---:|:---:|:---:|
@@ -154,12 +153,12 @@ Table: Results for the parallelized-OpenCL version of SCA
 | 16 | 8 | 30.749 [s] | ± 0.069 [s] | x5.2 | 15 |
 | 32 | 8 | 42.352 [s] | ± 0.117 [s] | x3.8 | 15 |
 | 64 | 8 | 65.228 [s] | ± 0.042 [s] | x2.5 | 15 |
-Table: Results for the parallelized-CUDA version of SCA
+Table: Results for the parallelized-CUDA version of (+SCA)

 \pagebreak

 \cimg{figs/elem_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the SCA in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}

-With this performance test, we notice that the computation time is essentially the same in OpenCL as in CUDA. Moreover, the parallelization follows the ideal speedup curve when the number of processes equals the number of graphics cards. However, when the eight graphics cards are shared, the speedup in OpenCL/CUDA crashes, and the computation time increases.
+With this performance test, we notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the parallelization follows the ideal speedup curve when the number of processes equals the number of graphics cards. However, when the eight graphics cards are shared, the speedup in (+OpenCL)/(+CUDA) crashes, and the computation time increases. Moreover, we notice that parallel computation is up to four times faster than sequential/concurrent computation when executing with a single task/graphical card.

 \pagebreak
--- a/src/text/08-jeu-de-la-vie.md
+++ b/src/text/08-jeu-de-la-vie.md
@@ -9,8 +9,7 @@ The Game of Life is a zero-player game designed by John Horton Conway in 1970. I

 ## Example

-\cimg{figs/gol_blinker1.png}{scale=0.40}{First state of blinker}{Source: Taken from
-\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
+\cimg{figs/gol_blinker1.png}{scale=0.40}{First state of blinker}{Source: Taken from \url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}

 \pagebreak

@@ -108,6 +107,6 @@ Table: Results for the parallelized-CUDA version of Game of Life

 \cimg{figs/gol_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the game of life in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}

-With this performance test, we notice that the computation time is essentially the same in OpenCL as in CUDA. Moreover, the parallelization follows the ideal speedup curve when the number of processes equals the number of graphics cards. However, when the eight graphics cards are shared, the speedup in OpenCL/CUDA stabilize, and the computation time increases ($+7 [s]$ between eight tasks and 64 tasks).
+With this performance test, we notice that the computation time is essentially the same in OpenCL as in CUDA. Moreover, the parallelization follows the ideal speedup curve when the number of processes equals the number of graphics cards. However, when the eight graphics cards are shared, the speedup in OpenCL/CUDA stabilize, and the computation time increases ($+7\:[s]$ between eight tasks and 64 tasks).

 \pagebreak
--- a/src/text/09-lattice-boltzmann.md
+++ b/src/text/09-lattice-boltzmann.md
-# Lattice-Boltzmann
+# Lattice-Boltzmann Method

 "_The lattice Boltzmann method (LBM) has established itself in the past decades as a valuable approach to Computational
 Fluid Dynamics (CFD). It is commonly used to model time-dependent, incompressible or compressible flows in a
@@ -10,11 +10,11 @@ fundamental research, as it keeps the cycle between the elaboration of a theory

 ## Parallelized version

-We create the lattice-Boltzmann method with our library to test it with a three-dimensional cellular automaton.
+We implement the Lattice-Boltzmann Method with our library to test it with a three-dimensional cellular automaton.

 ## CPU Benchmark

-We perform benchmarks to validate the scalability of our three-dimensional parallelization when compiling in sequential, multicore, (+^OpenCL), or (+^CUDA) mode. The benchmarks are performed on the (+^HES-GE) cluster (Baobab/Yggdrasil).
+We perform benchmarks to validate the scalability of our three-dimensional parallelization when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil).
 The sequential and multicore benchmarks are performed as follows:

 * the cellular automaton is $27'000'000$ cells in size,
@@ -32,7 +32,7 @@ The sequential and multicore benchmarks are performed as follows:
 | 32 | 41.04 [s] | ± 1.59 [s] | x17.4 | 15 |
 | 64 | 22.188 [s] | ± 0.321 [s] | x32.3 | 15 |
 | 128 | 17.415 [s] | ± 4.956 [s] | x41.1 | 15 |
-Table: Results for the parallelized-sequential version of Lattice-Boltzmann
+Table: Results for the parallelized-sequential version of (+LBM)

 | Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
 |:---:|:---:|:---:|:---:|:---:|
@@ -44,17 +44,17 @@ Table: Results for the parallelized-sequential version of Lattice-Boltzmann
 | 32 | 46.285 [s] | ± 0.138 [s] | x15.0 | 15 |
 | 64 | 24.059 [s] | ± 0.061 [s] | x28.9 | 15 |
 | 128 | 16.614 [s] | ± 1.088 [s] | x41.9 | 15 |
-Table: Results for the parallelized-multicore version of Lattice-Boltzmann
+Table: Results for the parallelized-multicore version of (+LBM)

 \pagebreak

-\cimg{figs/lbm_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the lattice-Boltzmann method in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}
+\cimg{figs/lbm_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the LBM in parallelized-sequential/multicore}{Source: Realized by Baptiste Coudray}

-Contrairement aux benchmarks précédents, les speedups ne suivent pas la courbe du speedup idéal. En effet, que ce soit en sequential ou en multicore, nous obtenons un speedup maximal avec 128 tâches de x41 alors qu'on espérait avoir un speedup de x128.
+Contrary to the previous benchmarks, the speedups do not follow the ideal speedup curve. Indeed, whether in sequential or multicore, we obtain a maximum speedup with 128 tasks of x41 when we were hoping to have a speedup of x128.

 ## GPU Benchmark

-The (+^OpenCL) and (+^CUDA) benchmarks are performed as follows:
+The (+OpenCL) and (+CUDA) benchmarks are performed as follows:

 * the cellular automaton has $27'000'000$ cells,
 * the number of tasks varies between $2^0$ and $2^6$.
@@ -62,4 +62,30 @@ The (+^OpenCL) and (+^CUDA) benchmarks are performed as follows:
 * the iteration is computed $3'000$ times.
 * From $2^0$ to $2^3$ tasks, an NVIDIA GeForce RTX 3090 is allocated for each task; beyond that, the eight graphics cards are shared equally among the ranks.

+| Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1 | 1 | 210.347 [s] | ± 0.096 [s] | x1.0 | 15 |
+| 2 | 2 | 99.677 [s] | ± 0.038 [s] | x2.1 | 15 |
+| 4 | 4 | 40.71 [s] | ± 0.076 [s] | x5.2 | 15 |
+| 8 | 8 | 20.8 [s] | ± 0.031 [s] | x10.1 | 15 |
+| 16 | 8 | 22.88 [s] | ± 0.064 [s] | x9.2 | 15 |
+| 32 | 8 | 22.47 [s] | ± 0.036 [s] | x9.4 | 15 |
+| 64 | 8 | 23.848 [s] | ± 0.035 [s] | x8.8 | 15 |
+Table: Results for the parallelized-OpenCL version of (+LBM)
+
+| Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1 | 1 | 207.683 [s] | ± 0.249 [s] | x1.0 | 15 |
+| 2 | 2 | 99.177 [s] | ± 0.056 [s] | x2.1 | 15 |
+| 4 | 4 | 40.24 [s] | ± 0.074 [s] | x5.2 | 15 |
+| 8 | 8 | 20.459 [s] | ± 0.037 [s] | x10.2 | 15 |
+| 16 | 8 | 22.837 [s] | ± 0.037 [s] | x9.1 | 15 |
+| 32 | 8 | 22.361 [s] | ± 0.024 [s] | x9.3 | 15 |
+| 64 | 8 | 23.688 [s] | ± 0.051 [s] | x8.8 | 15 |
+Table: Results for the parallelized-CUDA version of (+LBM)
+
+\cimg{figs/lbm_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the LBM in parallelized-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}
+
+Like the other benchmarks, there is very little difference between the OpenCL and CUDA versions (computation time and speedup). We get a more than ideal speedup with 2, 4, and 8 tasks/GPUs (x2.1, x5.2, and x10.2, respectively). When several tasks use the same graphics card, the computation time stabilizes at 22 seconds, and the speedup stops increasing.
+
 \pagebreak
--- a/src/text/10-conclusion.md
+++ b/src/text/10-conclusion.md
 # Conclusion

+In this project, we created a library allowing to distribute a one, two or three dimensional cellular automaton on several computation nodes via MPI. Thanks to the different Futhark backends, the update of the cellular automaton can be done in sequential, concurrent or parallel computation. Thus, we compared these different modes by implementing a cellular automaton in one dimension ((+SCA)), in two dimensions (Game of Life) and in three dimensions ((+LBM)). Benchmarks for each backend were performed to verify the scalability of the library. We obtained ideal speedups with the cellular automata in one and two dimensions and with the use of the sequential and multicore Futhark backend. With these two backends and a three-dimensional cellular automaton, we had a maximum speedup of x41 with 128 tasks. Concerning the OpenCL and CUDA backends, they show no difference in performance between them. For the three cellular automata, the speedup is ideal only when the number of tasks is equal to the number of GPUs.
+Finally, the library can be improved in order to obtain an ideal speedup in three dimensions with the CPU backends. Moreover, the addition of a load balancing of the graphic cards to obtain better performances when there are more tasks than GPUs, and the support of the Von Neumann neighborhood to manage other cellular automata.
+
 \pagebreak
--- a/src/text/ZZ-glossaire.tex
+++ b/src/text/ZZ-glossaire.tex
@@ -16,3 +16,4 @@
 \newacronym{IO}{I/O}{Input/Output}
 \newacronym{MSS}{MSS}{Maximum Segment Sum}
 \newacronym{SCA}{SCA}{Simple Cellular Automaton}
+\newacronym{LBM}{LBM}{Lattice-Boltzmann Method}