From 3b7a7d26fef41573a52895ebb8e5555e59584a6f Mon Sep 17 00:00:00 2001 From: "baptiste.coudray" <baptiste.coudray@etu.hesge.ch> Date: Mon, 16 Aug 2021 16:39:34 +0200 Subject: [PATCH] Updated doc --- src/text/07-automate-elementaire.md | 29 +++++++++++++++++++---------- src/text/08-jeu-de-la-vie.md | 6 ++++-- src/text/09-lattice-boltzmann.md | 4 ++-- src/text/ZZ-glossaire.tex | 15 ++++++++------- 4 files changed, 33 insertions(+), 21 deletions(-) diff --git a/src/text/07-automate-elementaire.md b/src/text/07-automate-elementaire.md index 7e3aef2..3fac0db 100644 --- a/src/text/07-automate-elementaire.md +++ b/src/text/07-automate-elementaire.md @@ -90,11 +90,24 @@ int main(int argc, char *argv[]) { } ``` -Finally, a C file `main.c` is needed to create the program's entry point. We initialize the MPI and Futhark environment. Then, our library, via the function `dispatch_context_new`, by specifying the size of the cellular automaton (600), its data type by using a predefined type in MPI (`MPI_INT8_T`), and the number of dimensions (one in this case). The function `get_chunk_info` is called to get the chunk of the cellular automaton attributed to the current rank. Thus, we initiate it with the values that we want with the function `init_chunk_elems` We call our Futhark function `futhark_entry_next_chunk_elems` to compute the new values. Given that the function is asynchronous, the function `futhark_context_sync` waits for the action to finish. Thus, we retrieve the new values with `futhark_values_i8_1d`. +Finally, a C file `main.c` is needed to create the program's entry point. We initialize the MPI and Futhark environment. Then, our library, via the function `dispatch_context_new`, by specifying the size of the cellular automaton (600), its data type by using a predefined type in MPI (`MPI_INT8_T`), and the number of dimensions (one in this case). It returns a dispatch context that will need to be provided when calling other functions of our library. -The entire cellular automaton is retrieved on the root node by calling the function `get_data`. The variable `sca` is not `NULL` if it is the root node. +The function `get_chunk_info` is called to get the chunk of the cellular automaton attributed to the current rank. Thus, we initiate it with the values that we want with the function `init_chunk_elems`. -## CPU Benchmark +In the temporal loop, we update our chunk using the created function `compute_next_chunk_elems`. In this function, we call the (+API) function `get_chunk_with_envelope` from our library with the following parameters: + +* the dispatch context, obtained from `dispatch_context_new`, +* the Futhark context, obtained from `futhark_context_new`, +* the Chebyshev distance, +* and a pointer to a Futhark function that converts a C array to a Futhark array. + +Our (+API) function handles the (+MPI) communication to exchange each chunk's missing and needed neighbors to create the envelope of the current process's chunk. + +Thus, we call our Futhark function `futhark_entry_next_chunk_elems` to compute the new values. Given that the function is asynchronous, the function `futhark_context_sync` waits for the action to finish. Finally, we retrieve the new values with `futhark_values_i8_1d`. + +After all that, it is possible to retrieve the entire cellular automaton on the root node by calling the function `get_data` on each task. The variable `sca` is not `NULL` if it is the root node, and it points to the entire cellular automaton. + +## CPU Benchmarks We perform benchmarks to validate the scalability of our one-dimensional distribution when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). The sequential and multicore benchmarks are performed as follows: @@ -118,8 +131,6 @@ Table: Results for the distributed-sequential version of (+SCA) This table contains the results obtained by using the backend `c` of Futhark. -\pagebreak - | Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures | |:---:|:---:|:---:|:---:|:---:| | 1 | 708.689 [s] | ± 16.036 [s] | x1.0 | 15 | @@ -134,14 +145,14 @@ Table: Results for the distributed-multicore version of (+SCA) This table contains the results obtained by using the backend `multicore` of Futhark. +\pagebreak + \cimgl{figs/elem_result_and_speedup_cpu.png}{width=\linewidth}{Benchmarks of the SCA in distributed-sequential/multicore}{Source: Realized by Baptiste Coudray}{fig:bench-cpu-sca} We compare the average computation time for each number of tasks and each version (sequential and multicore) on the left graph. On the right graph, we compare the ideal speedup with the distributed-sequential and multicore version speedup. The more we increase the number of tasks, the more the execution time is reduced. Thus, the distributed-sequential or multicore version speedup follows the curve of the ideal speedup. We can see that concurrent computing does not provide a significant performance gain over sequential computing because of the overhead of creating threads. -\pagebreak - -## GPU Benchmark +## GPU Benchmarks The (+OpenCL) and (+CUDA) benchmarks are performed as follows: @@ -171,8 +182,6 @@ Table: Results for the distributed-(+CUDA) version of (+SCA) This table contains the results obtained by using the backend `cuda` of Futhark. -\pagebreak - \cimgl{figs/elem_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the SCA in distributed-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-sca} With this performance test (\ref{fig:bench-gpu-sca}), we compare the average computation time for each number of tasks/(+^GPU) and each version ((+OpenCL) and (+CUDA)) on the left graph. On the right graph, we compare the ideal speedup with the `distributed-opencl` and cuda version speedup. We notice that the computation time is essentially the same in (+OpenCL) as in (+CUDA). Moreover, the distributed follows the ideal speedup curve. Finally, we notice that parallel computation is up to four times faster than sequential/concurrent computation when executing with a single task/graphical card. diff --git a/src/text/08-jeu-de-la-vie.md b/src/text/08-jeu-de-la-vie.md index 7bdb9a0..b0a4025 100644 --- a/src/text/08-jeu-de-la-vie.md +++ b/src/text/08-jeu-de-la-vie.md @@ -36,7 +36,7 @@ entry next_chunk_board [n][m] (chunk_board :[n][m]i8) :[][]i8 = let next_board = compute_next_board chunk_board neighbours in next_board[1:n-1, 1:m-1] ``` -This is a sneak peek of the `gol.fut` file. Like the (+SCA), we only update our cellular automaton. +This is a sneak peek of the `gol.fut` file. Like the (+SCA), we only update our cellular automaton and return it without the envelope. ```c void compute_next_chunk_board(struct dispatch_context *dc, @@ -46,7 +46,7 @@ void compute_next_chunk_board(struct dispatch_context *dc, struct futhark_i8_2d *fut_next_chunk_board; futhark_entry_next_chunk_board(fc, &fut_next_chunk_board, fut_chunk_with_envelope); - /* ... */ + /* ... Sync, Get values & Free resources ... */ } int main(int argc, char *argv[]) { @@ -62,6 +62,8 @@ int main(int argc, char *argv[]) { } ``` +In the C file `main.c`, we can see this almost the same code as for the (+SCA) example, but when calling `dispatch_context_new` we specify that this is a two-dimensional cellular automaton. In the `compute_next_chunk_board` function, we call `get_chunk_with_envelope` from our library with a different conversion function. It transforms a C 2D array to a Futhark 2D array. + ## CPU Benchmarks We perform benchmarks to validate the scalability of our two-dimensional distribution when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). diff --git a/src/text/09-lattice-boltzmann.md b/src/text/09-lattice-boltzmann.md index e5dc2e4..547a991 100644 --- a/src/text/09-lattice-boltzmann.md +++ b/src/text/09-lattice-boltzmann.md @@ -12,7 +12,7 @@ fundamental research, as it keeps the cycle between the elaboration of a theory We implement the Lattice-Boltzmann Method with our library to test it with a three-dimensional cellular automaton. Each cell is containing an array of 27 floats. -## CPU Benchmark +## CPU Benchmarks We perform benchmarks to validate the scalability of our three-dimensional distribution when compiling in sequential, multicore, (+OpenCL), or (+CUDA) mode. The benchmarks are performed on the (+HES-GE) cluster (Baobab/Yggdrasil). The sequential and multicore benchmarks are performed as follows: @@ -56,7 +56,7 @@ This table contains the results obtained by using the backend `multicore` of Fut Contrary to the previous benchmarks, the speedups do not follow the ideal speedup curve. Indeed, whether in sequential or multicore, we obtain a maximum speedup with 128 tasks of x41 when we were hoping to have a speedup of x128. -## GPU Benchmark +## GPU Benchmarks The (+OpenCL) and (+CUDA) benchmarks are performed as follows: diff --git a/src/text/ZZ-glossaire.tex b/src/text/ZZ-glossaire.tex index b1c2a88..8996acb 100644 --- a/src/text/ZZ-glossaire.tex +++ b/src/text/ZZ-glossaire.tex @@ -5,16 +5,17 @@ % Insérez les termes pour la table des acronymes ici. % Ne mettez pas de points à la fin d'une entrée, ils sont mis pour vous ! +\newacronym{API}{API}{Application Programming Interface} \newacronym{CPU}{CPU}{Central Processing Unit} -\newacronym{GPU}{GPU}{Graphics Processing Unit} -\newacronym{MPI}{MPI}{Message Passing Interface} -\newacronym{GCC}{GCC}{GNU Compiler Collection} -\newacronym{POSIX}{POSIX}{Portable Operating System Interface uniX} \newacronym{CUDA}{CUDA}{Compute Unified Device Architecture} -\newacronym{OpenCL}{OpenCL}{Open Computing Language} -\newacronym{HES-GE}{HES-GE}{Haute École Spécialisée de GEnève} +\newacronym{GCC}{GCC}{GNU Compiler Collection} +\newacronym{GPU}{GPU}{Graphics Processing Unit} \newacronym{HEPIA}{HEPIA}{Haute École du Paysage, d'Ingénierie et d'Architecture de Genève} +\newacronym{HES-GE}{HES-GE}{Haute École Spécialisée de GEnève} \newacronym{IO}{I/O}{Input/Output} +\newacronym{LBM}{LBM}{Lattice-Boltzmann Method} +\newacronym{MPI}{MPI}{Message Passing Interface} \newacronym{MSS}{MSS}{Maximum Segment Sum} +\newacronym{OpenCL}{OpenCL}{Open Computing Language} +\newacronym{POSIX}{POSIX}{Portable Operating System Interface uniX} \newacronym{SCA}{SCA}{Simple Cellular Automaton} -\newacronym{LBM}{LBM}{Lattice-Boltzmann Method} -- GitLab