diff --git a/presentations/pasc/pres.md b/presentations/pasc/pres.md index 3ef9394aa5639b807b159b055935658cd59f0c6c..5ee6ecb7b55ae5c7aabbd1c3be7598fa31430f17 100644 --- a/presentations/pasc/pres.md +++ b/presentations/pasc/pres.md @@ -6,18 +6,22 @@ ## Palabos: a massively parallel high performance fluid flow solver +. . . + +## **Do not hesitate to interrupt me at any time** + # What is Futhark -- Statically typed, data-parallel, purely functional, array language ... -- with limited functionalities (no I/O for example) ... -- that compiles to sequencial, multi-core, OpenCL, and Cuda backends... -- very efficiently ... +- Statically typed, data-parallel, purely functional, array language, +- with limited functionalities (no I/O for example), +- that compiles to sequencial, multi-core, OpenCL, and Cuda backends, +- very efficiently, - without the pain of actually writing sequential, multi-code, or GPU code. . . . -- Developed in Copenhagen by Troels Henriksen ... -- Very friendly and eager to help newcomers ... +- Developed in Copenhagen by T. Henriksen, +- Very friendly and eager to help newcomers, - Still a very experimental project. # Why use Futhark? @@ -73,6 +77,10 @@ int main() * Difficult to add features. * Difficult to optimize. +. . . + +**Futhark handles that for you!** + # How to use Futhark? - Not intended to replace existing generic-purpose languages. @@ -82,19 +90,37 @@ int main() - Conventional `C` code, - Several others (`C#`, `Haskell`, `F#`, and `Rust` for example). -- Futhark produces `C` code so it's accessible from any language through a FFI. +- Futhark produces `C` code: FFI for most languages. -# An example: the dot product +# Basic syntax -::: dotprod +## Differences with classical HPC languages -## `dotprod.fut` +* Functional language $\Rightarrow$ functions always return a something (unlike + C/C++ for example). +* Function arguments cannot be modified in place. +## `let` ... `in` + +```ocaml +-- The addition of 3 doubles: +-- Signature f64 -> f64 -> f64 -> f64 + +let add (a: f64) (b: f64) (c: f64) : f64 = + let d = a + b + in c + d ``` + +This should be read: **let** `d` be equal to `a + b`, **in** `c + d`. + +# An example: the dot product + +## `dotprod.fut` + +```ocaml entry dotprod (xs: []i32) (ys: []i32): i32 = reduce (+) 0 (map (\(x, y) -> x * y) (zip xs ys)) ``` -::: Intrinsics: SOAC (Second order array combinators) @@ -149,14 +175,19 @@ int main() { # The lattice Boltzmann method -* Cellular automaton-like algorithm for fluid flow simulation. +## The algorithm -::: Simulation +* Cellular automaton-like algorithm for fluid flow simulation. +* Cartesian grid with $q$ variables per grid point, $f_i(\bm{x}, t)$. +* Each time step is made of: + 1. Collision (local operations only). + 2. Propagation (non-local operations). +* Notoriously straightforward to parallelize. ## Simulation 0. Initialization (no Futhark here). -1. Collision. +1. Collision (on every grid point at the same time). - Compute $\rho(f_i)$, - Compute $\bm{j}(f_i)$, - Compute $f_i^\mathrm{eq}(\rho,\bm{j})$, @@ -166,139 +197,131 @@ int main() { Repeat 1-2 a certain amount of times 3. Get the data and process it. -::: -# Computation of macroscopic moments (1/2) +# The LBM pseudo-code + +```ocaml +let time_step [nx][ny][nz][q] (f: [nx][ny][nz][q]f32) + -> [nx][ny][nz][q]f32 = + let rho = compute_rho f + let j = compute_j f + let feq = compute_feq rho u + let fout = collide f feq omega + in stream fout +``` + +# Data structures and intrinsics + +## Only arrays -::: Simulation +```ocaml +f: [nx][ny][nz][q]f32 -- 4d array +feq: [nx][ny][nz][q]f32 -- 4d array +rho: [nx][ny][nz]f32 -- 3d array +j: [nx][ny][nz][d]f32 -- 4d array +``` + +## Functions + +```ocaml +let map3d 'a [nx][ny][nz]'b + (foo: a -> b) (xs: [nx][ny][nz]a) -> [nx][ny][nz]b = + map (\xs1 -> + map (\xs2 -> + map (\xs3 -> foo xs3) xs2 + ) xs1 + ) xs +``` + +# Computation of macroscopic moments (1/2) ## LBM equations +On **each** grid point: + \begin{equation} \rho=\sum_{i=0}^{q-1}f_i, \forall\ \bm{x}. \end{equation} -::: - -::: Futhark ## Futhark code ```ocaml -map (\fx -> - map (\fxy -> - map (\fxyz -> - reduce (+) 0 fxyz - ) fxy - ) fx +let compute_rho [nx][ny][nz][q] + (f: [nx][ny][nz][q]f32) -> [nx][ny][nz]f32 = +map3d (\fxyz -> + reduce (+) 0 fxyz ) f ``` -::: # Computation of macroscopic moments (2/2) -::: Simulation - ## LBM equation \begin{equation} \rho\bm{u}=\sum_{i=0}^{q-1}f_i \bm{c}_i, \forall\ \bm{x}. \end{equation} -::: - -::: Futhark ## Futhark code ```ocaml -map (\fx -> - map (\fxy -> - map (\fxyz -> - map(\ci -> - dotprod ci fxyz - ) (transpose c) -- intrinsic - ) fxy - ) fx +let compute_j [nx][ny][nz][q] + (f: [nx][ny][nz][q]f32) -> [nx][ny][nz][3]f32 = +map3d (\fxyz -> + map(\ci -> + dotprod ci fxyz + ) (transpose c) -- intrinsic ) f ``` -::: # Computation of the equilibrium distribution -::: Simulation - ## LBM equation \begin{equation} f_i^\mathrm{eq}=w_i\rho\left(1+\frac{\bm{c}_i\cdot \bm{u}}{c_s^2}+\frac{1}{2c_s^4}(\bm{c}_i\cdot \bm{u})^2-\frac{1}{2c_s^2}\bm{u}^2\right),\ \forall \bm{x},i \end{equation} -::: - -::: Futhark ## Futhark code ```ocaml -map2(\rho_x j_x -> - map2(\rho_xy j_xy -> - map2(\rho_xyz j_xyz -> - let u = map(\j_xyzi -> j_xyzi / rho_xyz ) j_xyz - let u_sqr = dotprod u u +map_3d(\rho_xyz j_xyz + let u = map(\j_xyzi -> j_xyzi / rho_xyz ) j_xyz + let u_sqr = dotprod u u - in map2(\wi ci -> + in map2(\wi ci -> let c_u = dotprod ci u - in rho_xyz * wi * - (1 + 3 * c_u + 4.5 * c_u * c_u - 1.5 * u_sqr) - ) w c - ) rho_xy j_xy - ) rho_x j_x -) rho j + in rho_xyz * wi * + (1 + 3 * c_u + 4.5 * c_u * c_u - 1.5 * u_sqr) + ) w c +) (zip rho j) ``` -::: # Collision -::: Simulation - ## LBM equation \begin{equation} f^\mathrm{out}_i=f_i\left(1-\omega\right)+\omega f_i^\mathrm{eq}. \end{equation} -::: - -::: Futhark - ## Futhark code ```ocaml -map2(\f_x feq_x -> - map2(\f_xy feq_xy -> - map2(\f_xyz feq_xyz -> - map2(\f_i feq_i-> - f_i * (1.0 - omega) + feq_i * omega - ) f_xyz feq_xyz - ) f_xy feq_xy - ) f_x feq_x -) f feq +map2_3d(\f_xyz feq_xyz -> + map2(\f_i feq_i-> + f_i * (1.0 - omega) + feq_i * omega + ) f_xyz feq_xyz +) (zip f feq) ``` -::: - # Streaming -::: Simulation - ## LBM equation \begin{equation} f_i(\bm{x}+\bm{c}_i,t+1)=f^\mathrm{out}_i(\bm{x},t). \end{equation} -::: - -::: Futhark - ## Futhark code ```ocaml @@ -311,26 +334,25 @@ tabulate_4d nx ny nz q (\x y z ipop -> in f[next_x, next_y, next_z, ipop] ) ``` -::: # Summary * A simple yet complete fluid flow simulator. -* Lines of readable and easy to debug Futhark code: 110. -* Single precision, periodic, D3Q27 only arrays: 250 MLPUS. +* Lines of readable and "easy" to debug Futhark code: 110. +* Single precision, periodic, only arrays: 250 MLPUS. *Not bad: but we can do better.* # How can we go faster? * Arrays are aggressively parallelized: each dimension is flattened. -* For small dimensions it is usually not worth. +* For small dimensions it is usually not worth it. * Replace length 3, or length 27 arrays by tuples: better use of GPU architecture or use `INCREMENTAL_FLATTENING`. * `[](a, b, c, ..) -> ([]a, []b, []c, ...)`{.ocaml} automatically by the compiler. * Result: with a code of 150 lines, we go to 1.5 GLUPS on GPU, 11 MLPUS on a single core, 400 MLUPS on a multi-core machine. -* All results are the same with CUDA and OpenCL backends. +* Within 10-20\% of state of the art optimized GPU codes. # Conclusion @@ -345,7 +367,7 @@ tabulate_4d nx ny nz q (\x y z ipop -> # Current and Future Futhark planned developments -## From Troels Henriksen himself +## Currently worked on by the core team * Multi-GPU: only on a single motherboard (very experimental). @@ -358,9 +380,10 @@ tabulate_4d nx ny nz q (\x y z ipop -> * Distributed CPU/GPU (MPI) backend. * A cool rendering tool directly from the GPU (OpenGL). - # Acknowledgments +## By alphabetical order + * V. Berset, * B. Coudray. * M. El Kharroubi, @@ -368,5 +391,8 @@ tabulate_4d nx ny nz q (\x y z ipop -> # Questions? +## Futhark webpage: <https://futhark-lang.org/> + ## Thank you for your attention +