Updated doc

378dba8c · baptiste.coudray · 96d9415a · 96d9415a · 378dba8c · 378dba8c
Verified Commit 378dba8c authored 3 years ago by baptiste.coudray
--- a/src/figs/dispatch_1d.png
+++ b/src/figs/dispatch_1d.png
--- a/src/figs/distributed_systems.graffle
+++ b/src/figs/distributed_systems.graffle
--- a/src/figs/distributed_systems.png
+++ b/src/figs/distributed_systems.png
--- a/src/figs/ring.png
+++ b/src/figs/ring.png
--- a/src/text/00-preface.md
+++ b/src/text/00-preface.md
@@ -10,12 +10,18 @@ I would like to thank the people who helped me during this project:

 # Abstract {-}

-Today, most computers are equipped with (+^GPU). They provide more and more computing cores and have become fundamental embedded high-performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex, and strongly dependent on the (+GPU) running the code. Futhark is an experimental, functional, and architecture agnostic language; that is why it seems relevant to study it.  It allows generating code allowing a standard sequential execution (on a single-core processor), on (+GPU) (with (+CUDA) and (+OpenCL) backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high-performance platforms, it lacks support for distributed computing with (+MPI). We create a library which perform the distribution of a cellular automaton on multiple compute nodes through MPI. The update of the cellular automaton is computed via the Futhark language using one of the four available backends (sequential, multicore, OpenCL, and CUDA). In order to validate our library, we implement a cellular automaton in one dimension ((+SCA)), in two dimensions (Game of Life) and three dimensions ((+LBM)). Finally, with the performance tests performed, we obtain an ideal speedup in one and two dimensions with the sequential and multicore backend. With the GPU backend, we obtain an ideal speedup only when the number of tasks equals the number of GPUs.
+Today, most computers are equipped with (+^GPU). They provide more and more computing cores and have become fundamental embedded high-performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex, and strongly dependent on the (+GPU) running the code. Futhark is an experimental, functional, and architecture agnostic language; that is why it seems relevant to study it.  It allows generating code allowing a standard sequential execution (on a single-core processor), on (+GPU) (with (+CUDA) and (+OpenCL) backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high-performance platforms, it lacks support for distributed computing (+MPI). We create a library that distributes a cellular automaton on multiple compute nodes through (+MPI). The update of the cellular automaton is computed via the Futhark language using one of the four available backends (sequential, multicore, (+OpenCL), and (+CUDA)). In order to test our library, we implement a cellular automaton in one dimension ((+SCA)), in two dimensions (Game of Life), and three dimensions ((+LBM)). Finally, with the performance tests performed, we obtain an ideal speedup in one and two dimensions with the sequential and multicore backend, but with (+LBM), we obtain a maximum of x42 with 128 tasks. When using the GPU backends, we obtain an ideal speedup for the three cellular automata. Parallel computing shows better performance compared to sequential or concurrent computing. For example, with the Game of Life, we are up to 15 times faster.

-\begin{figure} \vspace{.1cm} \begin{center} \includegraphics[scale=0.4]{figs/front-logo.png}
-\end{center} \end{figure} \begin{tabular}{ p{3cm} p{1cm} p{1cm} p{6cm} } \multicolumn{1}{l}{Candidate:}& & &
-\multicolumn{1}{l}{Referent teacher:}\\ \multicolumn{1}{l}{\textbf{Baptiste Coudray}} & & &
-\multicolumn{1}{l}{\textbf{Dr. Orestis Malaspinas}} \\ \multicolumn{1}{l}{Field of study: Information Technologies Engineering} & & &
-\multicolumn{1}{l}{} \\ \end{tabular}
+\begin{figure} \vspace{.1cm} \begin{center} \includegraphics[scale=0.22]{figs/front-logo.png}
+\end{center} \end{figure} 
+
+\begin{table}[ht!]
+\begin{tabular}{lllll}
+Candidate:                                           & Referent teacher:               &  &  &  \\
+\textbf{Baptiste COUDRAY}                            & \textbf{Dr. Orestis MALASPINAS} &  &  &  \\
+Field of study: Information Technologies Engineering & \multicolumn{1}{r}{}            &  &  &  \\
+&                                 &  &  &
+\end{tabular}
+\end{table}

 \pagebreak
--- a/src/text/03-programmation-parallele.md
+++ b/src/text/03-programmation-parallele.md
 # Distributed High-Performance Computing

-Distributed systems are groups of networked computers that share a common goal. Distributed systems are used to increase computing power and solve a complex problem faster than with a single machine. Thus, when the problem is distributed, it is solved more quickly than sequential, concurrent, or parallel computing [@noauthor_distributed_2021].
+\cimgl{figs/distributed_systems.png}{width=\linewidth}{A Distributed High-Performance Computing}{Source: Created by Baptiste Coudray}{fig:dd-sys}

-Sequential computation consists of executing a processing step by step, where each operation is triggered only when the previous operation is completed, even when the two operations are independent.
+As we can see in \ref{fig:dd-sys}, distributed systems are groups of networked computers that share a common goal. They are used to increasing computing power and solve a complex problem faster than a single machine [@noauthor_distributed_2021]. In order to do that the problem's data are divided along each computer which can be done by communicating with each other via message passing. Each computer executes the same program (which is a distributed program) but on a different data. The algorithm is applied using one of this three computing methods:

-Oracle's multithreading programming guide [@oracle_chapter_2010], define *concurrency* as a state where each task is executed independently with time-slicing. A performance gain is noticeable when tasks are most independent of others because they do not have to wait for the progress of another task.
+1. sequential computing,
+2. concurrent computing,
+3. or parallel computing.

-Finally, parallel computing exploits the computing power of a graphics card thanks to the thousands of cores it has. This allows a gain in performance compared to the previously described calculation methods because the operations are performed simultaneously on the different cores.
+With sequential computation, the algorithm is executed step by step, each operation is triggered only when the previous operation is completed, even when the two operations are independent.\newline
+With concurrent computing, the problem's data are once again split into smaller parts in order to be shared with the threads available on the processor. Each thread applies independently with time-slicing the algorithm on his set of data. A performance gain is noticeable when tasks are most independent of others because they do not have to wait for the progress of another task (thread) [@oracle_chapter_2010].\newline
+With parallel computing, data are also split again and the algorithm is applied simultaneously on the multiple processors available. Generally, we use the (+GPU) because it contains a thousand cores while a (+CPU) contains only a hundred. Thus, the primary goal of parallel computing is to increase available computation power for faster application processing and problem-solving.

-So, _Distributed High-Performance Computing_ means distributing a program on multiple networked computers and programming the software in sequential, concurrent or parallel computing.
+So, _Distributed High-Performance Computing_ means distributing a program on multiple networked computers and executing the algorithm using sequential, concurrent or parallel computing.

 \pagebreak
--- a/src/text/09-lattice-boltzmann.md
+++ b/src/text/09-lattice-boltzmann.md
@@ -117,7 +117,7 @@ Table: Results for the distributed-(+CUDA) version of (+LBM)

 This table contains the results obtained by using the backend `cuda` of Futhark.

-\cimgl{figs/lbm_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the LBM in distributed-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-gol}
+\cimgl{figs/lbm_result_and_speedup_gpu.png}{width=\linewidth}{Benchmarks of the LBM in distributed-OpenCL/CUDA}{Source: Realized by Baptiste Coudray}{fig:bench-gpu-lbm}

 Like the other benchmarks (\ref{fig:bench-gpu-sca}, \ref{fig:bench-gpu-gol}), there is very little difference between the (+OpenCL) and (+CUDA) versions (computation time and speedup). We get a more than ideal speedup with 2, 4, and 8 tasks/(+^GPU) (x2.1, x5.2, and x10.2, respectively). Finally, we notice that parallel computation is up to 3 times faster than sequential/concurrent computation when executing with a single task/graphical card.


--- a/src/text/10-conclusion.md
+++ b/src/text/10-conclusion.md
 # Conclusion

 In this project, we created a library allowing to distribute a one, two or three dimensional cellular automaton on several computation nodes via (+MPI). Thanks to the different Futhark backends, the update of the cellular automaton can be done in sequential, concurrent or parallel computation. Thus, we compared these different modes by implementing a cellular automaton in one dimension ((+SCA)), in two dimensions (Game of Life) and in three dimensions ((+LBM)). Benchmarks for each backend were performed to verify the scalability of the library. We obtained ideal speedups with the cellular automata in one and two dimensions and with the use of the sequential and multicore Futhark backend. With these two backends and a three-dimensional cellular automaton, we had a maximum speedup of x41 with 128 tasks. Concerning the (+OpenCL) and (+CUDA) backends, they show no difference in performance between them and for the three cellular automata, the speedup is ideal. Parallel computing has consistently shown better performance compared to sequential or simultaneous computing. For example, with the Game of Life, we are up to 15 times faster.
-During this work, I learn the importance to make unit tests to valid my implementation. Indeed, I was able to narrowing down multiple bugs that I made and make sure that my library was still functioning when I was adding cellular automaton in two and three dimension.
-Finally, the library can be improved to obtain an ideal speedup in three dimensions with the CPU backends. Moreover, the support of the Von Neumann neighborhood to manage other cellular automata.
+
+During this work, I learnt the importance to make unit tests to valid my implementation. Indeed, I was able to narrowing down multiple bugs that I made and make sure that my library was still functioning when I was adding cellular automaton in two and three dimension.
+
+The library can be improved to obtain an ideal speedup in three dimensions with the CPU backends. Moreover, the support of the Von Neumann neighborhood to manage other cellular automata.

 \pagebreak