Skip to content
Snippets Groups Projects
Verified Commit 91ae7db3 authored by baptiste.coudray's avatar baptiste.coudray
Browse files

More docs

parent 4330ceb1
No related branches found
No related tags found
No related merge requests found
Showing
with 383 additions and 120 deletions
src/figs/amdahls-law.png

35.8 KiB

src/figs/communication_1d.png

5.41 KiB

src/figs/communication_2d.png

10.9 KiB

src/figs/communication_3d.png

13 KiB

src/figs/front-logo.png

2.47 KiB

src/figs/gol_result_and_speedup_gpu.png

67 KiB

src/figs/gustafson-law.png

93 KiB

src/figs/ring.png

34.9 KiB

src/figs/sca_result_and_speedup.png

79.4 KiB

src/figs/simple_automate.png

12 KiB

......@@ -9,12 +9,12 @@ I would like to thank the people who helped me during this project:
# Abstract {-}
Today, most computers are equipped with GPUs. They provide more and more computing cores and have become real embedded high performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex and strongly dependent on the GPU running the code. Futhark is an experimental, functional and architecture agnostic language, that is why it seems relevant to study it. It allows generating code allowing a standard sequential execution (on a single core processor), on GPU (with CUDA and OpenCL backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high performance platforms, it lacks support for distributed computing. In this work the goal is to develop a library that can port any Futhark code to an MPI library with as little effort as possible.
Today, most computers are equipped with GPUs. They provide more and more computing cores and have become fundamental embedded high-performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex, and strongly dependent on the GPU running the code. Futhark is an experimental, functional, and architecture agnostic language; that is why it seems relevant to study it. It allows generating code allowing a standard sequential execution (on a single-core processor), on GPU (with CUDA and OpenCL backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high-performance platforms, it lacks support for distributed computing. This work aims to develop a library that can port any Futhark code to an MPI library with as little effort as possible.
\begin{figure} \vspace{.1cm} \begin{center} \includegraphics[width=3.72cm,height=2.4cm]{figs/front-logo.png}
\end{center} \end{figure} \begin{tabular}{ p{3cm} p{1cm} p{1cm} p{6cm} } \multicolumn{1}{l}{Candidat :}& & &
\multicolumn{1}{l}{Professeur responsable :}\\ \multicolumn{1}{l}{\textbf{Baptiste Coudray}} & & &
\multicolumn{1}{l}{\textbf{Dr. Orestis Malaspinas}} \\ \multicolumn{1}{l}{Filière d'études : ITI} & & &
\end{center} \end{figure} \begin{tabular}{ p{3cm} p{1cm} p{1cm} p{6cm} } \multicolumn{1}{l}{Candidate:}& & &
\multicolumn{1}{l}{Referent teacher:}\\ \multicolumn{1}{l}{\textbf{Baptiste Coudray}} & & &
\multicolumn{1}{l}{\textbf{Dr. Orestis Malaspinas}} \\ \multicolumn{1}{l}{Field of study: Information technologies engineering} & & &
\multicolumn{1}{l}{} \\ \end{tabular}
\pagebreak
......@@ -2,16 +2,21 @@
# Introduction {-}
Today, most computers are equipped with GPUs. They provide more and more computing cores and have become fundamental embedded high-performance computing tools. In this context, the number of applications taking advantage of these tools seems low at first glance. The problem is that the development tools are heterogeneous, complex, and strongly dependent on the GPU running the code. Futhark is an experimental, functional, and architecture agnostic language; that is why it seems relevant to study it. It allows generating code allowing a standard sequential execution (on a single-core processor), on GPU (with CUDA and OpenCL backends), on several cores of the same processor (shared memory). To make it a tool that could be used on all high-performance platforms, it lacks support for distributed computing. This work aims to develop a library that can port any Futhark code to an MPI library with as little effort as possible.
## Motivation {-}
To achieve that, we introduce the interest of parallelization by reviewing Amdahl's law and Gustafon-Barsis's law, then what is MPI and Futhark. We decide to implement a library that can parallelize cellular automaton in, one, two or three dimensions. By adding Futhark on top of MPI, the programmer will have the possibilities to compile his code in :
* parallelized-sequential mode,
* parallelized-multicore mode,
* parallelized-OpenCL mode,
* parallelized-CUDA mode.
Finally, we used this library by implementing a cellular automata in each dimension, and we perform a benchmark to ensure that each cellular automata scales correctly in these four modes.
## Presentation of the project {-}
The leading resources we used to carry out this project were Futhark and MPI user guide. We also exchanged with Futhark creator Troels Henriksen.
## Working method {-}
During this project, I used Git and put my source code on the Gitlab platform of HEPIA:
During this project, we use Git and put the source code on the Gitlab platform of HEPIA:
* https://gitedu.hesge.ch/baptiste.coudray/projet-de-bachelor
* Source code of the library with usage examples.
......
# La programmation parallèle
La programmation parallèle permet de réaliser des opérations sur des données de manières simultanée dans le but d'augmenter la vitesse de traitement par rapport à la programmation séquentielle.
# Message Passing Interface
In order to realize parallel programming, the standard (+^MPI) was created in 1993-1994 to standardize the passage of messages between several computers or in a computer with several processors/cores [@noauthor_message_2020]. (+^MPI) is therefore a communication protocol and not a programming language. Currently, the latest version of (+^MPI) is 4.0
which was approved in 2021. There are several implementations of the standard:
In order to realize parallel programming, the standard (+^MPI) was created in 1993-1994 to standardize the passage of messages between several computers or in a computer with several processors/cores [@noauthor_message_2020]. (+^MPI) is, therefore, a communication protocol and not a programming language. Currently, the latest version of (+^MPI) is 4.0 which approved in 2021. There are several implementations of the standard:
* MPICH, which support for the moment, MPI 3.1,
* Open MPI, which support, for the moment, MPI 3.1
......@@ -12,23 +11,13 @@ I used Open MPI throughout this project, both on my computer and on the cluster
## Example
To understand the basis of (+^MPI), let's look at an example mimicking a *token ring* network [@kendall_mpi_nodate]. This type of network forces a process to send a message to the message in the console for example, only if it has the token in its possession. Once it has emitted its message, the process must transmit the token to its neighbor.
To understand the basis of (+^MPI), let us look at an example mimicking a *token ring* network [@kendall_mpi_nodate]. This type of network forces a process to send a message to the message in the console, for example, only if it has the token in its possession. Moreover, once it has emitted its message, the process must transmit the token to its neighbor.
\cimg{figs/ring.png}{scale=0.4}{Imitation of a network in \textit{token ring}}{Source : Baptiste Coudray}
In this example, the node with the identifier zero has first the token that it will pass to the node one, then it will give it to node two, etc. The program ends when the token is back in the possession of the process
In this example, the node with the identifier zero has first the token that it will pass to node one, then it will give it to node two, and so on. The program ends when the token is back in possession of the process
zero: node four sends the token to node zero.
Pour comprendre la base de (+^MPI), voyons un exemple imitant un réseau en *token ring*[@kendall_mpi_nodate]. Ce type de réseau contraint un processus peut émettre un
message dans la console par exemple, seulement s'il a en sa possession le jeton. Une fois qu'il a émi son message, le
processus doit transmettre le jeton à son voisin.
\cimg{figs/ring.png}{scale=0.4}{Imitation d'un réseau en \textit{token ring}}{Source : Baptiste Coudray}
Dans cet exemple, le nœud avec l'identifiant zéro et possède en premier lieu le jeton qu'il passera au nœud un, puis il
le donnera au nœud deux, etc. Le programme se termine quand le jeton sera de nouveau en sa possession du processus
zéro : le nœud quatre envoie le jeton au nœud zéro.
```c
int main(int argc, char** argv) {
// Initialize the MPI environment
......@@ -62,17 +51,17 @@ int main(int argc, char** argv) {
printf("Process %d received token %d from process %d\n", world_rank,
token, world_size - 1);
}
MPI_Finalize();
return MPI_Finalize();
}
```
Any parallel program using (+^MPI) must call the `MPI_Init` function to initialize the environment, otherwise an error message is displayed. Then, `MPI_Comm_rank` allows us to retrieve our ID (the node number we have) and then with the `MPI_Comm_size` function, we get the number of nodes on which our program is running.
Any parallel program using (+^MPI) must call the `MPI_Init` function to initialize the environment, otherwise, an error message is displayed. Then, `MPI_Comm_rank` allows us to retrieve our ID (the node number we have), and then with the `MPI_Comm_size` function, we get the number of nodes on which our program is running.
Thanks to the node number, the node with the identifier zero, sends the token to its neighbor via the `MPI_Send` function.
So, once sent, it waits for node four to send the token via the function `MPI_Recv`. The other nodes are waiting to receive the token from their neighbor to pass the token in turn.
The nodes communicate through the communicator `MPI_COMM_WORLD` which is a macro-constant designating all nodes associated with the current program.
So, once sent, it waits for node four to send the token via the function `MPI_Recv`. Then, the other nodes are waiting to receive the token from their neighbor to pass the token in turn.
The nodes communicate through the communicator `MPI_COMM_WORLD`, a macro-constant designating all nodes associated with the current program.
Finally, every program must terminate with the `MPI_Finalize()` function, otherwise the execution ends with an
Finally, every program must terminate with the `MPI_Finalize()` function; otherwise, the execution ends with an
error message.
```bash
......@@ -80,10 +69,9 @@ mpicc ring.c -o ring
mpirun -n 5 ./ring
```
To compile a (+^MPI) program, you have to go through the `mpicc` program which is a wrapper
around (+^GCC). Indeed, `mpicc` automatically adds the right compilation parameters to the (+^GCC) program.
Our compiled program must be run through `mpirun` so that it distributes our program to the compute nodes. The `-n` parameter is used to specify the number of processes to run.
To compile a (+^MPI) program, you have to go through the `mpicc` program, which is a wrapper
around (+^GCC). Indeed, `mpicc` automatically adds the correct compilation parameters to the (+^GCC) program.
Next, our compiled program must be run through `mpirun` to distribute our program to the compute nodes. Finally, the `-n` parameter is used to specify the number of processes to run.
```
Process 1 received token -1 from process 0
......@@ -92,4 +80,4 @@ Process 3 received token -1 from process 2
Process 4 received token -1 from process 3
Process 0 received token -1 from process 4
```
Thus, we can see that the processes exchange the token each in turn until the zero node receives the token again.
Thus, we can see that the processes exchange the token each in turn until node zero receives the token again.
# La programmation parallèle
La programmation parallèle permet de réaliser des opérations sur des données de manières simultanée dans le but d'augmenter la vitesse de traitement par rapport à la programmation séquentielle. D'après la taxonomie de Flynn, il existe plusieurs types de parallélisme :
* SISD
* *Single Instruction Single Data*, la machine effectue une instruction sur une donnée à la fois. C'est typiquement un
ordinateur personnel qu'on pouvait acheter jusqu'à la fin des années 1990.
* SIMD
* *Single Instruction Multiple Data*, la machine effectue une instruction sur plusieurs données à la fois.
Aujourd'hui, la plupart des processeurs ont la possibilité d'effectuer ces opérations.
* MISD
* *Multiple Instruction Single Data*, plusieurs unités de calcul effectuent une opération sur une donnée.
* MIMD
* *Multiple Instruction Multiple Data*, plusieurs unités de calcul effectuent une opération sur plusieurs données. [@noauthor_parallelisme_2021]
\pagebreak
Dans le parallélisme, il existe deux lois importantes :
1. la loi d'Amdahl
2. la loi de Gustafon-Barsis
\cimg{figs/amdahls-law.png}{scale=0.6}{Loi d'Amdahl}{Source : Tiré de https://commons.wikimedia.org/, ref. URL02}
La loi d'Amdahls affirme que la vitesse globale du programme est limitée par le code qui ne peut être parallélisée. En effet, dans un code il y aura presque toujours une partie séquentielle non parallélisable. Il y a donc une relation entre le ratio de code parallélisable et la vitesse globale d'exécution du programme.
Dans le graphique ci-dessus, on remarque que si :
* 50 % du code est parallélisé, alors, on obtient une accélération théorique maximale de x2 à partir de 16 processeurs.
* 75 % du code est parallélisé, alors, on obtient une accélération théorique maximale de x4 à partir de 128 processeurs.
* 90 % du code est parallélisé, alors, on obtient une accélération théorique maximale de x10 à partir de 512 processeurs.
* 95 % du code est parallélisé, alors, on obtient une accélération théorique maximale de x20 à partir de 4096 processeurs.
\pagebreak
\cimg{figs/gustafson-law.png}{scale=0.75}{Loi de Gustafon-Barsis}{Source : Tiré de https://commons.wikimedia.org/, ref. URL03}
La loi de Gustafson dit que plus le nombre de données à traiter est grand, plus l'utilisation d'un grand nombre de processeurs sera avantageux. Ainsi, l'accélération est linéaire comme on peut le voir sur le graphique.
Sur le graphique, on remarque par exemple qu'avec un code qui est 90 % parallélisé, on a un *speedup* d'au moins x100 avec 120 processeurs là où la loi d'Amdahl estimait un *speedup* maximal de x10 avec 512 processeurs. La loi de Gustafson est donc beaucoup plus optimiste en termes de gain de performance.
\pagebreak
# Introduction to the language Futhark
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Tiré de https://commons.wikimedia.org/, ref. URL04}
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
Futhark is a purely functional programming language for producing parallelizable code on (+^CPU) or (+^GPU). It was designed by Troels Henriksen, Cosmin Oancea and Martin Elsman at the University of Copenhagen.
The main goal of Futhark is to write generic code that can be compiled into either :
The main goal of Futhark is to write generic code that can compile into either :
* (+^OpenCL),
* (+^CUDA),
......@@ -11,20 +11,20 @@ The main goal of Futhark is to write generic code that can be compiled into eith
* sequential C,
* sequential Python.
Although a Futhark code can be compiled into an executable, this feature is reserved for testing purposes because there is no (+^ES). Thus, the main interest is to write very specific functions that you would like to speed up thanks to parallel programming and to compile in library mode so that it can be used in a C program.
Although a Futhark code can compile into an executable, this feature reserves for testing purposes because there is no (+^IO). Thus, the main interest is to write particular functions that you would like to speed up thanks to parallel programming and compile in library mode to use in a C program.
\pagebreak
## Example 1
To better understand Futhark, here is a very simple example, which consists in calculating the factorial of a number. @noauthor_basic_nodate].
To better understand Futhark, here is a simple example: calculating the factorial of a number. @noauthor_basic_nodate].
```
let fact (n: i32): i32 = reduce (*) 1 (1...n)
let main (n: i32): i32 = fact n
```
Futhark does not handle recursion, so the factorial of a number is defined as the successive multiplication of numbers from one to `n`. In Futhark, this operation is defined as the reduction of an array with the multiplication of each value as the operation. The entry point of the program, `main`, takes as parameter a number and calls the function `fact`.
Futhark does not handle recursion, so the factorial of a number is defined as the successive multiplication of numbers from one to `n`. In Futhark, this operation defines the reduction of an array with the multiplication of each value as the operation. The program's entry point, `main`, takes as parameter a number and calls the function `fact`.
Futhark ne gère pas la récursion, de ce fait, la factorielle d'un nombre est défini comme la multiplication successive des nombres allant de un à `n`. En Futhark, on définit cette opération par la réduction d'un tableau avec comme opération la multiplication de chaque valeur. Le point d'entrée du programme, `main`, prends en paramètre un nombre et appelle la fonction `fact`.
......@@ -33,7 +33,7 @@ futhark opencl fact.fut
echo 12 | ./fact
```
To compile the Futhark code, we have to specify a backend, this one allows to compile our code in :
To compile the Futhark code, we have to specify a backend; this one allows us to compile our code in :
* (+^OpenCL) (opencl, pyopencl),
* (+^CUDA) (cuda),
......@@ -41,7 +41,7 @@ To compile the Futhark code, we have to specify a backend, this one allows to co
* sequential C (c),
* Python sequential (python).
Here I compile in (+^OpenCL) to run the program on the graphics card, and I run the program with the number 12 as parameter.
Here I compile in (+^OpenCL) to run the program on the graphics card, and I run the program with the number 12 as the parameter.
```
479001600i32
......@@ -49,6 +49,7 @@ Here I compile in (+^OpenCL) to run the program on the graphics card, and I run
The program calculates the factorial of 12 and therefore returns 479 001 600.
\pagebreak
## Example 2
In this other example, we use Futhark in a C program to perform a very specific operation, in this case to calculate the factorial of a number.
......@@ -71,21 +72,24 @@ Then you have to compile the Futhark code in library mode and specify the backen
int main(int argc, char **argv) {
int number = atoi(argv[1]);
/* Futhark Init */
struct futhark_context_config *futcfg = futhark_context_config_new();
struct futhark_context *futctx = futhark_context_new(futcfg);
/* Main Code */
int result;
futhark_entry_fact(futctx, &result, number);
printf("%d\n", result);
/* Futhark Free */
futhark_context_config_free(futcfg);
futhark_context_free(futctx);
return 0;
}
```
The program initializes a Futhark configuration and a Futhark context, then the program calls the `fact` function which has been compiled in C. It is called via the function `futhark_entry_fact` which takes as arguments the Futhark context, an integer pointer to store the result, and the number whose factorial is desired.
The program initializes a Futhark configuration and a Futhark context; then, the program calls the `fact` function, which has been compiled in C. It is called via the function `futhark_entry_fact`, which takes as arguments the Futhark context, an integer pointer to store the result, and the number whose factorial desire.
```bash
......@@ -99,6 +103,6 @@ gcc main.c -o fact fact.o -lOpenCL -lm
479001600
```
The execution of the program with the factorial of 12, returns the correct value, i.e 479 001 600.
The program's execution with the factorial of 12 returns the correct value, i.e. 479 001 600.
\pagebreak
# MPI x Futhark
Notre librairie permet de paralléliser des automates cellulaires automatiquement de sorte que le programmeur n'ai plus qu'à écrire la fonction Futhark permettant de mettre à jour son automate cellulaire. Notre librairie prends en charge les automates cellulaires de, une, deux et trois dimensions.
## Communication
Afin de faciliter la communication entre les différents rangs, nous créons une topologie cartésienne virtuelle grâce à MPI.
A virtual topology is a mechanism for naming the processes in a communicator in away that fits the communication pattern better. The main aim of this is to makes sub-sequent code simpler. It may also provide hints to the run-time system which allow it to optimise the communication or even hint to the loader how to configure the processes. The virtual topology might also gain us some performance benefit.
### One dimension
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
Dans une topologie cartésienne à une dimension, on remarque que les rangs peuvent communiquer directement avec leur voisin de gauche et de droite même s'ils sont aux extrémités du réseau. En effet, le communicateur MPI est défini pour être cyclique ce qui évite de devoir parcourir les $n - 2$ voisins qui les séparent.
### Two dimensions
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
Dans une topologie cartésienne à deux dimensions, on remarque que les rangs peuvent communiquer directement avec leur voisin de gauche, de droite, du haut et du bas. Quand un rang doit communiquer avec leur voisin de diagonale, nous utilisons le communicateur par défaut (`MPI_COMM_WORLD`) pour qu'il communique directement entre eux sans passer par un voisin.
### Three dimensions
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
Dans une topologie cartésienne à trois dimensions, on remarque que les rangs ont les mêmes capacités de communication qu'une topologie en deux dimensions, mais, ils peuvent en plus communiquer avec leur voisin de devant et de derrière.
## Data dispatching
L'automate cellulaire est partagé de façon le plus équitable possible entre les rangs disponibles de sorte que chaque rang effectue plus ou moins le même temps de travail. Ainsi chaque rang à un chunk qui est une partie de l'automate celullaire. Ce chunk peut être de dimension un, deux ou trois.
### One dimension
### Two dimensions
### Three dimensions
## Envelope
L'automate cellulaire devra utiliser le voisinage de Moore ce qui veut dire que chaque cellule dispose de :
* deux voisines en une dimension,
* huit voisines en deux dimensions,
* 26 voisines en trois dimensions.
Ces valeurs sont valides pour un automate cellulaire de dimension deux et d'une distance de Tchebychev de un. On peut généraliser le nombre de voisines qu'une cellule a via la formule $(2r + 1)^d - 1$, où $r$ est la distance de Tchebychev et $d$ la dimension.
Ainsi l'enveloppe contient le voisinage de Moore manquant d'une distance de Tchebychev de $r$ des cellules situées aux extremités du chunk.
### One dimension
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
Le voisinage de Moore en une dimension d'une cellule comprends la voisine de gauche (east-neighbor) et la voisine de droite (west-neighbor). Donc les
### Two dimensions
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
### Three dimensions
\cimg{figs/futhark.png}{scale=0.60}{Futhark}{Source : Taken from https://commons.wikimedia.org/, ref. URL04}
\pagebreak
# Simple Cellular Automaton
The simplest non-trivial cellular automaton that can be conceived consists of a one-dimensional grid of cells that can take only two states ("0" or "1"), with a neighborhood consisting, for each cell, of itself and the two cells adjacent to it.
There are $2^3 = 8$ possible configurations (or patterns, rules) of such a neighborhood. In order for the cellular automaton to work, it is necessary to define what the state must be, at the next generation, of a cell for each of these patterns. The 8 rules/configurations defined is as follows:
| Rule n° | Neighbour left state | State | Neighbour right state | Next state |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 0 | 0 | 0 | 0
| 2 | 0 | 0 | 1 | 1
| 3 | 0 | 1 | 0 | 1
| 4 | 0 | 1 | 1 | 1
| 5 | 1 | 0 | 0 | 1
| 6 | 1 | 0 | 1 | 0
| 7 | 1 | 1 | 0 | 0
| 8 | 1 | 1 | 1 | 0
## Example
\cimg{figs/simple_automate.png}{scale=0.5}{First state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
Iteration 0 is the initial state and only cell two is alive. To perform the next iteration:
* the cell (one) is born because of rule n°2,
* the cell (two) stays alive because of rule n°3,
* the cell (three) stays alive because of rule n°6.
## Parallelized version
Avec la librairie que nous avons créée, nous avons implémenté l'automate cellulaire précédemment décris. Pour ce faire, nous créons un fichier Futhark `elementary.fut` qui sert à calculer le prochain état d'une partie de l'automate cellulaire.
```
entry next_chunk_elems [n] (chunk_elems :[n]i8) (envelope: envelope_1d_i8) :[n]i8 =
let augmented_elems = augment_chunk_elems chunk_elems envelope
let next_elems = compute_next_elems augmented_elems
in next_elems[1:n+1] :> [n]i8
```
De ce fait, le fichier `elementary.fut` contient une fonction permettant de mettre à jour une partie de l'automate cellulaire.
```c
void compute_next_chunk_elems(struct dispatch_context *dc, struct futhark_context *fc, chunk_info_t *ci) {
envelope_t *outer_envelope = get_outer_envelope(dc, fc, 1);
struct futhark_opaque_envelope_1d_i8 *fut_outer_envelope = futhark_outer_envelope_new(dc, fc, outer_envelope, futhark_restore_opaque_envelope_1d_i8, FUTHARK_I8);
struct futhark_i8_1d *fut_chunk_elems = futhark_new_i8_1d(fc, ci->data, ci->dimensions[1]);
struct futhark_i8_1d *fut_next_chunk_elems;
futhark_context_sync(fc);
futhark_entry_next_chunk_elems(fc, &fut_next_chunk_elems, fut_chunk_elems, fut_outer_envelope);
futhark_context_sync(fc);
futhark_values_i8_1d(fc, fut_next_chunk_elems, ci->data);
futhark_context_sync(fc);
/* ... Free resources ... */
}
int main(int argc, char *argv[]) {
/* ... MPI & Futhark Init ... */
const int N_ITERATIONS = 100;
int elems_dimensions[1] = {600};
struct dispatch_context *disp_context = dispatch_context_new(elems_dimensions, MPI_INT8_T, 1);
chunk_info_t ci = get_chunk_info(disp_context);
init_chunk_elems(&ci);
for (int j = 0; j < N_ITERATIONS; ++j) {
compute_next_chunk_elems(disp_context, fut_context, &ci);
}
/* ... Free resources ... */
}
```
Finalement, un fichier C `main.c` est nécessaire pour créer le point d'entrée du programme. Le code est relativement simple, le programmeur doit initialiser environnement MPI et Futhark. Ensuite, il doit initialiser notre librairie via la fonction `dispatch_context_new` en spécifiant la taille de l'automate cellulaire, son type de données et le nombre de dimensions.
## CPU Benchmark
| Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 710.678 [s] | ± 1.689 [s] | x1.0 | 15 |
| 2 | 390.93 [s] | ± 24.671 [s] | x1.8 | 15 |
| 4 | 209.194 [s] | ± 0.443 [s] | x3.4 | 15 |
| 8 | 104.686 [s] | ± 0.273 [s] | x6.8 | 15 |
| 16 | 51.996 [s] | ± 0.285 [s] | x13.7 | 15 |
| 32 | 26.201 [s] | ± 0.087 [s] | x27.1 | 15 |
| 64 | 13.104 [s] | ± 0.049 [s] | x54.2 | 15 |
| 128 | 6.653 [s] | ± 0.032 [s] | x106.8 | 15 |
*Array of results for the parallelized-sequential version*
| Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 823.08 [s] | ± 9.366 [s] | x1.0 | 15 |
| 2 | 417.564 [s] | ± 13.221 [s] | x2.0 | 15 |
| 4 | 213.967 [s] | ± 9.582 [s] | x3.8 | 15 |
| 8 | 121.354 [s] | ± 0.037 [s] | x6.8 | 15 |
| 16 | 56.497 [s] | ± 0.231 [s] | x14.6 | 15 |
| 32 | 29.848 [s] | ± 0.081 [s] | x27.6 | 15 |
| 64 | 14.502 [s] | ± 0.064 [s] | x56.8 | 15 |
| 128 | 7.774 [s] | ± 0.053 [s] | x105.9 | 15 |
*Array of results for the parallelized-multicore version*
On remarque que la version multicoeurs est plus lente que la version séquentielle, cela est dû à la non-optimisation des tableaux à une dimension de la part du backend Multicore. En effet, cette fonctionnalité n'est pas encore implémentée dans le compilateur Futhark.
A gauche, le graphique montre le temps de calculs de 100 générations du jeu de la vie pour un automate cellulaire de $30 000^2 = 900 000 000$ cellules de la version sequentielle et multicoeurs. A droite, nous avons le speedup idéal ainsi que le speedup obtenu de la version sequentielle et multicoeurs.
Sur le graphique du temps d'exécution, on remarque que celui-ci diminue de l'ordre $\frac{1}{x}$ pour l'éxecution séquentielle et multicore. A noté que la version séquentielle est plus rapide que la version multicœurs.
## GPU Benchmark
\pagebreak
# Game of Life
The Game of Life is a zero-player game designed by John Horton Conway in 1970. It is also one of the best known
cellular automata. A cellular automaton consists of a regular grid of cells each containing a state
chosen among a finite set and which can evolve in the course of time. The game does not require the interaction of a player
for it to evolve, it evolves thanks to these extremely simple rules:
1. a cell has eight neighbors,
2. a cell can be either alive or dead,
3. a dead cell with exactly three living neighbors becomes alive,
4. a living cell with two or three living neighbors stays alive, otherwise it dies. [@noauthor_jeu_2020]
## Example
\cimg{figs/gol_blinker1.png}{scale=0.40}{First state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
A basic example is the blinker:
* the cell (one, one) and (one, three) die because they have seven dead neighbors and one living neighbor (rule n°4),
* the cell (zero, two) and (two, two) are born because they have three living neighbors (rule n°3),
* the cell (one, two) stays alive because it has two living neighbors (rule n°4).
\cimg{figs/gol_blinker2.png}{scale=0.40}{Second state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL06. Re-created by Baptiste Coudray}
Thus, after the application of the rules, the horizontal line becomes a vertical line. At the next iteration, the vertical line becomes a horizontal line again.
## Parallelized version
\cimg{figs/gol_result_and_speedup_cpu.png}{width=\linewidth}{First state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
\pagebreak
# Game of Life
The Game of Life is a zero-player game designed by John Horton Conway in 1970. It is also one of the best-known cellular automata. A cellular automaton consists of a regular grid of cells each containing a state chosen among a finite set and which can evolve in the course of time. The game does not require the interaction of a player for it to evolve, it evolves thanks to these extremely simple rules:
1. a cell has eight neighbors,
2. a cell can be either alive or dead,
3. a dead cell with exactly three living neighbors becomes alive,
4. a living cell with two or three living neighbors stays alive; otherwise, it dies. [@noauthor_jeu_2020]
\pagebreak
## Example
\cimg{figs/gol_blinker1.png}{scale=0.40}{First state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
A basic example is a blinker:
* the cell (one, one) and (one, three) die because they have seven dead neighbors and one living neighbor (rule n°4),
* the cell (zero, two) and (two, two) are born because they have three living neighbors (rule n°3),
* the cell (one, two) stays alive because it has two living neighbors (rule n°4).
\cimg{figs/gol_blinker2.png}{scale=0.40}{Second state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL06. Re-created by Baptiste Coudray}
Thus, after the application of the rules, the horizontal line becomes a vertical line. Then, at the next iteration, the vertical line becomes a horizontal line again.
\pagebreak
## Parallelized version
Avec la librairie que nous avons créée, nous avons implémenté le jeu de la vie. Pour ce faire, nous créons un fichier Futhark `gol.fut` qui sert à calculer le prochain état d'une partie de l'automate cellulaire.
```
entry next_chunk_board [n][m] (chunk_board :[n][m]i8) (envelope: envelope_2d_i8) :[n][m]i8 =
let augmented_board = augment_board chunk_board envelope
let next_chunk_board = compute_next_chunk_board augmented_board
in next_chunk_board[1:n+1, 1:m+1] :> [n][m]i8
```
Dans ce fichier, la fonction `next_chunk_board` prends en paramètres la portion de l'automate cellulaire qui nous est attribuée ainsi que l'enveloppe.
Ensuite, cette fonction reconstruit l'automate cellulaire en ajoutant les voisins des cellules situés aux extrémités puis une autre fonction calcule le prochain état de chaque cellule. Finalement, le résultat de la fonction est retournée sans l'enveloppe car elle n'est plus nécessaire.
```c
void compute_next_chunk_board(struct dispatch_context *dc, struct futhark_context *fc, chunk_info_t *ci) {
envelope_t *outer_envelope = get_outer_envelope(dc, fc, 1);
struct futhark_opaque_envelope_2d_i8 *fut_outer_envelope = futhark_outer_envelope_new(dc, fc, outer_envelope,
futhark_restore_opaque_envelope_2d_i8, FUTHARK_I8);
struct futhark_i8_2d *fut_chunk_board = futhark_new_i8_2d(fc, ci->data, ci->dimensions[0], ci->dimensions[1]);
struct futhark_i8_2d *fut_next_chunk_board;
futhark_context_sync(fc);
futhark_entry_next_chunk_board(fc, &fut_next_chunk_board, fut_chunk_board, fut_outer_envelope);
futhark_context_sync(fc);
futhark_values_i8_2d(fc, fut_next_chunk_board, ci->data);
futhark_context_sync(fc);
/* ... Free resources ... */
}
int main(int argc, char *argv[]) {
/* ... MPI & Futhark Init ... */
const int N_ITERATIONS = 100;
int board_dimensions[2] = {800, 600};
struct dispatch_context *dc = dispatch_context_new(board_dimensions, MPI_INT8_T, 2);
chunk_info_t ci = get_chunk_info(disp_context);
init_chunk_board(&ci);
for (int i = 0; i < N_ITERATIONS; ++i) {
compute_next_chunk_board(dc, fc, &ci);
}
/* ... Free resources ... */
}
```
En plus d'un fichier Futhark, un fichier C `main.c` est nécessaire pour utiliser notre librairie. Ce fichier est le point d'entrée du programme, il s'occupe d'initialiser l'environnement MPI, Futhark et de notre librairie.
L'initialisation de notre librairie via la fonction `dispatch_context_new` nécessite de spécifier les dimensions de l'automate cellulaire, le type de données contenu et le nombre de dimensions. Ensuite, on récupère le morceau de l'automate cellulaire qui nous est attribué, d'initialiser les valeurs et de calculer le prochain état de l'automate cellulaire via la fonction `compute_next_chunk_board`. Dans cette fonction, il faut récupérer l'enveloppe, convertir en type Futhark et appeler notre fonction Futhark qui met à jour l'état des cellules (`futhark_entry_next_chunk_board`). Finalement, avec la fonction `futhark_values_i8_2d`, on récupère la nouvelle valeur des cellules.
## CPU Benchmarks
Nous effectuons un benchmark pour valider la scalabilité de notre parallélisation en deux dimensions quand on compile en mode séquentiel, multicoeurs, OpenCL ou CUDA. Les benchmarks sont effectués sur le cluster HES-GE (Baobab2).
Le benchmark séquentiel et multicœurs sont effectués comme suit :
* l'automate cellulaire est de taille $30000^2 = 900 000 000$ cellules,
* le nombre de tâches varie entre $2^0$ et $2^7$
* 15 mesures sont effectuées
* une mesure correspond à 100 générations,
| Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 4207.341 [s] | ± 85.994 [s] | x1.0 | 15 |
| 2 | 1834.217 [s] | ± 53.498 [s] | x2.3 | 15 |
| 4 | 1015.535 [s] | ± 17.84 [s] | x4.1 | 15 |
| 8 | 517.61 [s] | ± 7.355 [s] | x8.1 | 15 |
| 16 | 261.422 [s] | ± 9.078 [s] | x16.1 | 15 |
| 32 | 116.66 [s] | ± 0.263 [s] | x36.1 | 15 |
| 64 | 65.678 [s] | ± 0.31 [s] | x64.1 | 15 |
| 128 | 33.348 [s] | ± 0.208 [s] | x126.2 | 15 |
*Array of results for the parallelized-sequential version*
| Number of tasks | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 3207.645 [s] | ± 55.361 [s] | x1.0 | 15 |
| 2 | 1354.403 [s] | ± 98.014 [s] | x2.4 | 15 |
| 4 | 718.864 [s] | ± 23.786 [s] | x4.5 | 15 |
| 8 | 369.332 [s] | ± 0.098 [s] | x8.7 | 15 |
| 16 | 184.933 [s] | ± 0.041 [s] | x17.3 | 15 |
| 32 | 94.574 [s] | ± 0.122 [s] | x33.9 | 15 |
| 64 | 44.917 [s] | ± 1.667 [s] | x71.4 | 15 |
| 128 | 23.774 [s] | ± 0.038 [s] | x134.9 | 15 |
*Array of results for the parallelized-multicore version*
\cimg{figs/gol_result_and_speedup_cpu.png}{width=\linewidth}{First state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
A gauche, le graphique montre le temps de calculs de 100 générations du jeu de la vie pour un automate cellulaire de 30'000x30'000 cellules de la version sequentielle et multicoeurs. A droite, nous avons le speedup idéal ainsi que le speedup obtenu de la version sequentielle et multicoeurs.
Sur le graphique du temps d'exécution, on remarque que celui-ci diminue de l'ordre $\frac{1}{x}$ pour l'exécution séquentielle et multicore. A noté que la version multicore est plus rapide que la version séquentielle.
Sur le graphique des speedups, on remarque que la parallélisation de l'automate cellulaire est idéal même meilleur. Cette performance peut s'expliquer grâce à la présence des caches dans le CPU ce qui permet de récupérer les données plus rapidement comparé à la RAM.
\pagebreak
## GPU Benchmarks
Le benchmark OpenCL et CUDA sont effectués comme suit :
* l'automate cellulaire est de taille $60000^2 = 3 600 000 000$ cellules,
* le nombre de tâches varie entre $2^0$ et $2^7$
* 15 mesures sont effectuées
* une mesure correspond à 100 générations,
* de $2^0$ à $2^3$ tâches, une NVIDIA GeForce RTX 3090 est attribuée pour chaque tâche, au delà, les tâches se partagent
de manière équitable les cartes graphiques.
| Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 1 | 107.849 [s] | ± 0.213 [s] | x1.0 | 15 |
| 2 | 2 | 53.843 [s] | ± 0.085 [s] | x2.0 | 15 |
| 4 | 4 | 43.714 [s] | ± 0.024 [s] | x2.5 | 15 |
| 8 | 8 | 43.403 [s] | ± 0.038 [s] | x2.5 | 15 |
| 16 | 8 | 43.499 [s] | ± 0.257 [s] | x2.5 | 15 |
| 32 | 8 | 43.777 [s] | ± 0.281 [s] | x2.5 | 15 |
| 64 | 8 | 20.917 [s] | ± 0.183 [s] | x5.2 | 15 |
| 128 | 8 | 13.583 [s] | ± 0.112 [s] | x7.9 | 15 |
*Array of results for the parallelized-OpenCL version*
| Number of tasks | Number of GPUs | Average [s] | Standard Derivation [s] | Speedup | Number of measures |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 1 | 107.215 [s] | ± 0.193 [s] | x1.0 | 15 |
| 2 | 2 | 53.434 [s] | ± 0.041 [s] | x2.0 | 15 |
| 4 | 4 | 43.609 [s] | ± 0.065 [s] | x2.5 | 15 |
| 8 | 8 | 43.375 [s] | ± 0.041 [s] | x2.5 | 15 |
| 16 | 8 | 43.111 [s] | ± 0.088 [s] | x2.5 | 15 |
| 32 | 8 | 43.525 [s] | ± 0.415 [s] | x2.5 | 15 |
| 64 | 8 | 20.747 [s] | ± 0.113 [s] | x5.2 | 15 |
| 128 | 8 | 13.909 [s] | ± 0.118 [s] | x7.7 | 15 |
*Array of results for the parallelized-CUDA version*
\cimg{figs/gol_result_and_speedup_gpu.png}{width=\linewidth}{First state of blinker}{Source : Taken from
\url{https://commons.wikimedia.org/}, ref. URL05. Re-created by Baptiste Coudray}
\pagebreak
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment