diff --git a/presentations/pasc/pres.md b/presentations/pasc/pres.md index 064a7b1f0137ef09a5edde337590e8e66b334a03..f5a415d765ca1527c4aefd08d0d0eda2ecfb5d52 100644 --- a/presentations/pasc/pres.md +++ b/presentations/pasc/pres.md @@ -1,14 +1,23 @@ -% Ludwig Boltzmann and the Raiders of the Futhark +% High performance computing and the Raiders of the Futhark % O. Malaspinas -% 15 novembre 2019 +% July 7 2021 + +# How I experienced HPC code development + +* 2005: experimental sequential Fortran 90 fluid flow solver. +* 2006-2021: Palabos, a massively parallel C++ fluid flow solver. +* Circa 2015: Palabos perfomance compared to GPU codes becomes limited. +* Still Palabos features are far more advanced that those of any GPU code. +* Porting of certain features underway by J. Latt with modern C++: still needs + A LOT of rewrite. # What is Futhark - Statically typed, data-parallel, purely functional, array language ... - with limited functionalities (no I/O for example) ... -- that compiles to C, OpenCL, and Cuda backends... +- that compiles to C, multi-core, OpenCL, and Cuda backends... - very efficiently ... -- without the pain of actually writing GPU code. +- without the pain of actually writing sequential, multi-code, or GPU code. . . . @@ -16,10 +25,6 @@ - Very friendly and eager to help newcomers ... - Still a very experimental project (high truck factor). -. . . - -**Spoiler:** a toy d3q27 recursive regularized model does 1.5 GLUPS (single precision). - # Why use Futhark? \tiny @@ -63,14 +68,24 @@ int main() } ``` -# How to use Futhark +# Why really use Futhark? + +* In most codes: + * sequential, multi-core, and GPU codes need a completely **different** + memory layout. +* Multi-backend codes require complete rewrites: + * Difficult to maintain. + * Difficult to add features. + * Difficult to optimize. + +# How to use Futhark? - Not intended to replace existing generic-purpose languages. -- But aims at being easily integrated into non Futhark code: +- But aims at being easily integrated into non-Futhark code: - Used into python code: `PyOpenCL`, - Conventional `C` code, - - Several others (`C#`, `Haskell`, `F#`, ..., and soon `Rust`). + - Several others (`C#`, `Haskell`, `F#`, and `Rust` for example). - Futhark produces `C` code so it's accessible from any language through a FFI. @@ -301,7 +316,7 @@ tabulate_4d nx ny nz q (\x y z ipop -> # Summary -* D3Q27 fully periodic lattice Boltzmann library. +* A simple yet complete fluid flow simulator. * Lines of readable and easy to debug Futhark code: 110. @@ -315,8 +330,9 @@ tabulate_4d nx ny nz q (\x y z ipop -> * For small dimensions it is usually not worth. * Replace length 3, or length 27 arrays by tuples: better use of GPU architecture. * `[](a, b, c, ..) -> ([]a, []b, []c, ...)`{.ocaml} automatically by the compiler. -* Result: with a code of 150 lines, we go to 1.5 GLUPS. -* All results are the same with CUDA and OpenCL backends (room for improvement?). +* Result: with a code of 150 lines, we go to 1.5 GLUPS on GPU, 11 MLPUS on a + single core, 400 MLUPS on a multi-core machine. +* All results are the same with CUDA and OpenCL backends. # Conclusion @@ -333,14 +349,13 @@ tabulate_4d nx ny nz q (\x y z ipop -> ## From Troels Henriksen himself -* Incremental flattening (experimental). * Multi-GPU: only on a single motherboard (very experimental). ## What is missing? (IMHO) * Good compiler errors. * A way to profile the code. -* Distributed GPU (MPI). +* WIP: Distributed GPU (MPI) in the compiler or with a library. * Bonus: A cool rendering tool directly from the GPU. # Questions?