Available online at www.sciencedirect.com

ELSEVIER

ScienceDirect

Procedia Computer Science

Procedía Computer Science 1 (2010) 1083-1091

www.elsevier.com/locate/procedia

International Conference on Computational Science, ICCS 2010

Solving Boltzmann equation on GPU

Yu.Yu. Klossa'b, P.V. Shuvalova'b'*, F.G. Tcheremissinea-c

aMoscow Institute of Physics and Technology bRussian Research Center "Kurchatov Institute cDorodnicyn Computing Centre of RAS

Abstract

The paper describes the implementation of conservative projection method of solving the Boltzmann equation on GPU. The NVIDIA CUDA technology is used as the computational platform. Methods for optimal implementation of the solver for two-dimensional geometry and methods for setting boundary conditions, implementation of the geometry and storage of integrating grids were developed. The developed methodology was used to perform calculations for the problem of slow gas flow in a rectangular cavity and that of shock wave propagation in a narrow channel. The efficiency of using GPU for analysis of slow and high-speed flows of rarefied gas is demonstrated.

Keywords: Boltzmann equation, Rarefied gas, GPGPU, CUDA, Slow flows, Shock Waves © 2010 Published by Elsevier Ltd.

1. Introduction

Simulation of rarefied gas flows has many applications, including flows in micro- and nanodevices, and studies of shock wave structure and phenomena in the Knudsen boundary layer. Widely used simulation methods are those built around the Boltzmann equation relaxation model [1] and simulation methods of statistical modelling [2].

In the work presented herein the conservative projection method of solving the Boltzmann equation is used [3, 4, 5, 6]. The equation is solved for two molecular potential models the hard-sphere model and the Lennard-Jones potential model. Its solution is a distribution function defined in a six-dimensional phase space. The problem is complex in terms of computations and requires a highly efficient algorithm. Massively parallel computation technologies that have been appearing in recent years show promise for solution of such complex problems. Recent studies in different physical modelling areas, such as [7, 8, 9, 10], have demonstrated high efficiency of using GPU. This paper describes methods of solving two-dimensional problems in planar geometry; however, the developed methodology can be extended to deal with three-dimensional problems and problems with unstructured grids [11].

* Corresponding author

Email addresses: kloss@mnpt.kiae.ru (Yu.Yu. Kloss), shuvalov.pavel@gmail.com (P.V. Shuvalov), tcherem@ccas.ru (F.G. Tcheremissine)

1877-0509 © 2010 Published by Elsevier Ltd. doi:10.1016/j.procs.2010.04.120

2. Mathematical and theoretical framework

In the process of simulation, gas macroparameters, such as density n(x,t), temperature T(x,t), velocity U(x,t), and others are calculated by numerical integration on velocity £ of distribution function f (£,x,t), determined from solution of the Boltzmann equation:

df/dt + £ • df/dx = I(f,f) (1)

Equation (1) is solved by the finite-difference method on a fixed grid in velocity and coordinate space. The computational domain in the coordinate space is presented by a set of linked domains of rectangular grids. This allows calculating geometries that include channels and obstacles, while using a rectangular grid. A spherical domain Q on a uniform grid with a spacing hv = 2Vmax/No is chosen as the velocity space. The radius of the sphere bounding the velocity space is Vmax, and the sphere centre may be displaced relative to zero depending on the problem. The maximum velocity Vmax was taken equal to 4.8^/kT/m. In the most part of the calculations, the number of nodes No along one coordinate axis was equal to 20. For No = 20 the number of velocity space nodes is 4224.

Equation (1) splits into two parts - the transport equation representing the left side, and the relaxation problem. The splitting takes the form:

St = At/2Ct At/2 (2)

where ST - is the operator that transforms the solution at to to the solution at to + t; AT - is the operator of solving the transport equation df/dt + £ • df/dx = 0 from steps t; and CT - is the operator of solving the relaxation problem df/dt = I (f, f).

The left side of equation (1) is solved by an explicit first-order difference scheme:

f n + l _ f n f n _ f n f n _ f n

-T--+ --h--+ 1--h--(3)

T hx hy

For monatomic gas, the collision integral can be represented as follows:

CC 2n Om

I (f>f)=// / (f 'f '* - ff*)gbdbdipd£t

By introducing the following notation: ) = S(£ — ) + S(£t — ) — S(£' — ) — S(£' — ), where S denotes the Dirac delta function, we can present the value of the collision integral in point as follows:

CG CG 2n bm

I(£y) = 4 / / / / )(f'f'. — ff*)gbdbdipd£d£* (4)

— C —C 0 0

In equation (4), integration is performed numerically on the grid , ,bv,tp„ having Nv nodes in the 8-dimensional domain Q x Q x [0, 2n] x [0,bmax]. It should be noted that in the general case post-collision velocities v and do not get into the velocity grid nodes. For each velocity v there is determined a pair of nodes and that approximates the velocity on the velocity grid, and for velocity ^ there are chosen nodes +Sv and —Sv that are symmetric about the total velocity gv = + , after which the delta function in formula (4) is represented in the following form: — ) = (1 — rv)S(£\v — ) + rvS(£Mv — ),

where factor rv is found from the energy conservation law: (£'av )2+(£'3v )2 = (1 — rv)(£2v + ) + rv (£2v+Sv +

—Sv). The following power interpolation is used for approximation of the product of distribution functions in the nodes after the collision:

fav fj3v = (fAv fMv ) V + (f Av +Sv fMv —S v)rv (5)

This interpolation provides strict equality of the collision integral of the Maxwell distribution function to zero. This property is especially important for simulation of slow flows and flows containing slightly disturbed regions, where it essentially improves the computational accuracy.

Let the solution take the form f = fM+sf(1), where sf(1) - is the deviation from the Maxwell distribution, and s ^ 1. Substituting this solution in the collision integral and considering (5), we arrive at:

I(f, f) = I(Jm,fM) + 2sI(fM, f(1)) + s2/(f(1),f(1)) (6)

Since the principal part of integral I(Jm ,fM) is calculated exactly, with some natural assumptions about the integral evaluation error it follows from (6) that the collision integral calculation error reduces by (2s)-1 times.

The collision integral calculation can be broken down into a preparation step performed once before the calculation, and calculation of contribution of the integral to each node at each time step. The preparation step can be divided into a number of substeps:

i Generation of 8-dimensional integrating grid , ,bv ,tp„ on the basis of Korobov grids [12]. Korobov grids ensure higher accuracy of multidimensional integral evaluation compared to random node grids. At this step, several integrating grids are created, each having the same displacement of all nodes, and one of them is chosen in a random manner at each time step of the calculation. Points that had not initially got into domain Q are excluded from the integrating grid.

ii Post-collision velocities £'av and ^v are calculated for each point of the integrating grid ,bv,tp„. The fly-away velocities are calculated on the basis of the chosen potential as shown in [5].

iii Factors rv are calculated using the method described above for each velocity pair ^ and .

3. Implementation on GPU

The algorithm described was implemented in MPI-based parallel solver NSOLVER [13]. To implement the algorithm on modern GPU, the NVIDIA CUDA technology [14] was used, that provides a developer with C++ language extension set for running the device code on GPU and for GPU memory management. CUDA Toolkit version 2.3 was used and all computations were performed with single precision as it is quite sufficient for all problems.

All launched GPU threads are grouped in a thread block, and thread blocks in their turn are grouped in a block grid. At the moment the maximum number of threads that can be run in a block is 512. It is not necessary that all threads in a block are executed at the same time, but they can use shared memory where data is stored during the lifetime of the block. The built-in variable threadldx is used for addressing threads in a block. An index of a thread in a block may represent two- or three-dimensional vector comprising unsigned integers: threadldx.x, threadldx.y, threadldx.z. Such addressing is convenient for running the kernel on data representing a grid function in a physical space. For addressing a block in a block grid the built-in variable blockldx is used, that may have two-dimensional addressing using variables blockldx.x and blockldx.y. Each thread also has access to constants defining the size of the block and block grid -blockDim and gridDim.

To achieve a speedup memory access patterns must be optimized. Let us consider the mapping in a rectangular domain of size Sx xSy, such that Sx < Sy. To avoid collisions, each thread calculates a particular point in the coordinate space. For convenience of calculation of boundary conditions cells with indices —1 and Sx are added to the computational domain (in a similar way this is done for coordinate Y). Before launching the calculation, constant Xmax, that determines the number of threads in the block is defined. The computational kernel for the advection step calculation is launched in the following configuration: the block size is [Sx/Xmax] x 1 x 1, and the block grid size is (Sx mod Xmax)Sy x Sv, where Sv - is the size of the array of distribution function values in one space cell. The same configuration is used when running the computational kernel for the collision integral, except for the block size along the z axis - in this case it equals 1 since it is not required to go round all values in the velocity space. The most obvious order of addressing a value of the distribution function f (x,y,Vi) is coordinate addressing followed by velocity index addressing. However, with this approach neighbouring threads will refer to memory cells located at distance Sv which produce uncoalesced access to the GPU memory. Following conditions shall be met for optimizing the memory access:

i Thread number n refers to word number n in the memory.

ii The minimum address of a word in block shall be aligned to 16 times the size of word.

If values for neighbouring cells along the X axis having the same velocity are located in neighbouring cells, the first condition will be met. The minimum word address in a block will depend on the block sizes and in the general case may not be divisible by 16 (the size of the computational domain may be arbitrary, especially if dummy cells are considered). To eliminate this drawback, empty cells used only for alignment are introduced. The aforesaid is implemented in the code of function getIdx(x, y, v) that maps spatial coordinates and velocity index into a linear memory array as follows:

(x, y,v) ^ x + 16 + (y + 1)(Sx + 2 + A) + v(Sx + 2 + A)(Sy + 2) (7)

,where A is governed by condition (Sx + 2 + A) = 0 (mod 16).

No limitations on indices in the velocity space have been mentioned by now. The use of a cubical velocity space would allow the velocity value to be easily determined from known indices for each coordinate. This approach, however, cannot be applied because the method being described uses a spherical velocity space. It is required to determine velocity from the index in order to solve the transport equation and calculate macroparameters. To this end, auxiliary array speedMap[] is used, that stores values of velocity projection on each coordinate. This array is initialized before running the calculation and is stored in the GPU constant memory (having the size of 64KB) that suffices for storage of such data.

3.1. Transport equation

Solution of the transport equation is broken down into two steps calculation of boundary conditions, and calculation of values for a new layer using formula (3). Boundary conditions represent mirror or diffuse reflection from the computational domain walls, and are recorded in dummy cells. Thereafter the calculation is run for a new time step using the pseudocode from listing 1(global memory reads and writes are marked as and =^).

Algorithm 1 GPU pseudocode of the advection step Require: ft is a linear memory array with time step t values Require: ft+i is a linear memory array with time step t +1 values Require: x = threadldx.x + blockDim.x * (blockldx.x/Sy) Require: y = blockldx.x (mod Sy) Require: v = blockIdx.y 1: vx = getSpeed(speedMapX[v],Nv,Vmax) 2: vy = getSpeed(speedMapY[v],Nv,Vmax) 3: fo ft[getIdx(x,y, v)] 4: fx ft[getIdx(x + sgn(vx),y, v)] 5: fy ft[getIdx(x,y + sgn(vy),v)]

6: fo + T/h(|vx|(fx — fo) + |vy|(fy — fo)) ft+i[getIdx(x,y, v)]

3.2. Collision integral

The integrating grid was prepared on CPU using a code from the NSOLVER solver described in [13]. Pseudocode of the computational kernel for the collision integral is presented on listing 2. One random Korobov grid from the set prepared is selected on each time step. All threads operate with this integrating grid, it is accessed via the shared memory in order to speed up the calculation.

Algorithm 2 GPU pseudocode of the collision step Require: x = threadldx.x + blockDim.x * blockldx.x Require: y = blockldx.x

Require: K is an array of a current Korobov grid values Require: Nv is a number of points in the Korobov grid K

Require: g[Xmax] is a shared memory buffer for points from the Korobov grid K 1: for i = 0 to Nv/Xmax - 1 do

2: K[i * Xmax + threadldx.x] ^ g[threadIdx.x] {Filling shared memory} 3: for j = 0 to Xmax - 1 do

4: av,/3v,\v,nv,sv,gv,bv,rv ^ g[j] {Reading from shared memory} 5: fav ft[getIdx(x,y, av)], f^ ft[getIdx(x,y, )] 6: fxv ft[getIdx(x,y,\v)], fMv ft[getIdx(x,y,Hv)] 7: fxv+sv ft[getIdx(x,y,\v + Sv)], fMv-sv ft[getIdx(x,y,Hv - sv)]

8: A = Bgv bv (fav fl3v - (fAv fMv )1-rv + (fXv+Sv Uv-sv )rv )

9: fav - A ft[getIdx(x,y,av)], fi3v - A ft[getIdx(x,y,Pv)]

10: fxv + (1 - rv)A ft[getIdx(x,y,K)], f^ + (1 - rv)A ft[getIdx(x,y, ^)]

11: f\v+sv + rv A ft[getIdx(x,y,\v + Sv)]

12: f^v-sv + rv A ft[getIdx(x,y,y,v - Sv)]

13: end for 14: end for

Figure 2: Velocity profiles

4. Calculation examples 4-1. Driven cavity

The algorithm was tested on the driven cavity problem. The problem geometry is shown in Figure 1. Gas confined in a two-dimensional square domain is considered. Due to the uniform motion of the upper wall with velocity Vw, a flow is induced in the gas. The cavity walls have the same temperature To, that is equal to that of the gas at the initial instant. The gas is diffusely scattered on the walls;on the upper wall molecules acquire velocity Vw, and the reflected molecules distribution function takes the following form:

"t Nf . m x3/2 -m(j-VW)2 , s

f(^ = Nnorm • W)3/2 •e (8)

In formula (8) the value of Nf is equal to the flow to the wall J £yfdand the value of Nnorm is determined

from condition of equality of flows to the wall and flows from the wall. Calculation of this problem for a model equation at Vw = 0.01^/2kTo/m is presented in [10, 15, 16]. We solved the Boltzmann equation at Vw = 0.003 ^/2kTo/m on a two-dimensional spatial grid with 160 x 160 nodes and a three-dimensional velocity grid with 20 nodes along each direction. The cavity size is L = 10A, where A = (\f2nno<J2)-1 -is the free path length, which gives the Knudsen number of 0.1. At the collision step, the Korobov grid size was 50000 nodes(^ 11200 effective nodes lying inside the sphere bounding the velocity space), which ensured satisfactory accuracy. One time step was t = 0.0186to, where To = A/sj2kTo/m - is the mean free time. On GPU NVidia GTX285, the calculation of 4000 steps took 28 minutes, while a similar calculation on one core CPU Intel Core Quad 2.66GHz took about 2.5 days, so the total speedup is about 135 times.

Figure 1 shows steady-state streamlines and velocity module. Calculation results are also graphically presented on Figure 2 as profiles of velocity in vertical section passing through the centre of the cavity, and profiles of velocity in horizontal section passing through the vortex centre. Note that there is no random noise in the figure showing streamlines and in the plots.

4-2. Performance analysis

Calculation speedup was determined separately for the transport equation and for the collision integral. The value of advection time contain a summary of two AT/2 steps from one time step as shown in formula 2. Summary time stands for the time needed for one time step calculation without copying data from GPU to host memory and calculation macroparameters, i.e. ST calculation time. A higher speedup was achieved for the transport equation due to higher occupancy of the GPU, since independence of solutions for different velocities allowed running a separate thread for each velocity. Also the transport equation requires less

Grid 64 x 64 96 x 96 128 x 128 160 x 160

Advection speedup 210 347 395 422

Collision speedup 32.5 35.6 39.5 33

Summary speedup 105 139 150 135

Table 1: Comparison of speedup for different grid sizes on GPU Nvidia GTX 285

Device_CPU(2.66GHz) 9600GT GTS250 GTX285

Cores 1 64 128 240

Advection Time(m) 1469.14 10.436 6.21 3.81

Collision Time(m) 437.43 45.61 20.1 11.47

Summary time(m) 1906.57 56.05 26.32 15.28

Integral/Summary 23% 81% 76% 75%

Advection speedup 1 140.8 246.6 385

Collision speedup 1 9.8 21.8 38.1

Summary speedup 1 34 72.4 124.8

Table 2: Performance on CPU and different GPU for the driven cavity problem with a 128 X 128 grid

global memory operations then the collision step (4 operations per one iteration for the transport equation and 12 ones for the collision step and additional shared memory operations with Korobov grid).

NSOLVER [13], parallel solver previously used for calculation of a number of problems [17, 18], was used as the code for CPU. The CPU code was compiled with the gcc with -O3 optimization level and with single precision as well as GPU code. Time consuming part of the CPU implementation of the collision step was especially optimized with SSE2 instructions. NSOLVER was run as one process on one core of processor Intel Core2 Quad 2.66GHz. When using GPU, main calculations were performed on GPU NVidia GTX285. Values of speedup for different grid sizes are listed in Table 1. It is seen from the table that a growth of the grid increases the speed in case of the transport equation, but has little effect on the speedup for the collision integral. The data in the table shows that with the CPU it takes less time to calculate the integral than the transport equation which is accounted for by the algorithm efficiency, namely high accuracy at small deviation from the equilibrium solution (as shown in formula (6) above), and for advantage of SSE2 optimization. It should be noted that the CPU version makes no use of dummy cells for boundary conditions that produces additional code branching though it is not so critical as for GPU.

Performance on different GPU was also compared. The tests were performed for the problem described above with the coordinate grid of size 128 x 128 and No = 18. Table 2 presents summary data on the number of cores and memory size for the abovementioned GPU and on the speedup achieved on the latter. It is seen from the table that in case of GPU, performance improves with increasing number of processors, and this is true both for the integral and the transport equation.

Speedup results may be compared with ones presented in the paper [10]. In that paper GPU implementation of the BGKW model equation was introduced and authors obtain comparable time(7 minutes for 4100 time steps) of solving the driven cavity problem in case of Vw = 0.01x/2kTo/m and a two-dimensional spatial grid with 160 x 160 nodes.

4.3. Shock waves

The algorithm was also tested on a problem of shock wave (SW) propagation in a narrow channel that demonstrates applicability of the algorithm to nonsteady and supersonic flows. Figure 3(a) shows geometry of a shock tube consisting of a wide driving section (section A) and a narrow test section (section B). At the initial instant the pressure in the driving section is 10 times higher than in the test one, and temperatures are equal in the both sections. Decay of the initial pressure discontinuity produces a shock wave that moves inside the test section. The theoretical value of the SW Mach number calculated from Rankine-Hugoniot relations based on the one-dimensional theory of decay of an arbitrary discontinuity is M = 1.55. This

Figure 3: (a)SW problem geometry, (b) SW decelereation

-80 -40 0 40 80 120

1 1.6 2.2 2.8 3.4 4 4.6 5.2 5.8 6.4 7 7.6 8.2 8.8 9.4 10 -80 -40 0 40 80 120

0.7 0.77 0.84 0.91 0.98 1.05 1.12 1.19 1.26 1.33 1.4 1.47 1.5

Figure 4: Density and temperature fields at t =50

problem is studied in detail in [19], through simulation of forward and backward SW on the MIPT-60 cluster.

In the work presented herein the computational domain was divided into two domains containing sections A and B. Calculations for each domain were done separately, and at the boundary conditions preparation step data from adjacent cells of the neighbouring domain were copied into dummy cells. This approach allows scaling up solution of the problem to several GPU. In the calculation, the size of domain A was 160 x 80, the size of domain B was 320 x 40, coordinate increment h = 0.5A and the velocity space had the following parameters: No = 18, Vmax = 6.8^/kTo/m, the integrating grid consisted of 2 • 105 Korobov points, and the time step was t = 0.0111to. The simulation was performed using the Lennard-Jones interaction potential with parameters for argon at 300K. Figure 4 shows density and temperature fields at t = 50to, after calculations for 4500 time steps. It took 94 minutes to calculate 4500 steps on GPU, while a similar

calculation on the MIPT-60 cluster using 15 nodes (four 3.00GHz cores) took more than 7 hours. Thus the calculations speeded up by about 74 times. Figure 3(b) presents the curve of deceleration of the SW velocity (expressed as the Mach number) as the SW propagates along the channel.

5. Conclusion

The use of GPU allows achieving a significant speedup in solution of rarefied gas dynamics problems on the basis of the Boltzmann equation. In spite of obvious memory limitations, it is quite sufficient for modelling most of the two-dimensional problems both for slow flows and fast nonsteady ones. The method of dividing the computational domain into rectangular domains described in the paper allows scaling up the solver to a number of nodes containing GPU; in this case the ratio of the boundary region size to the domain size (the memory size order of magnitude) will be small, so the costs for transferring boundary conditions should be low. The significant speedup in solution of the transport equation compared to the collision integral allows the use of schemes with higher order of approximation without loss in performance.

References

[1] P. L. Bhatnagar, E. P. Gross, M. Krook, A model for collision processes in gases. I. Small amplitude processes in charged and neutral one-component systems, Phys. Rev. 94 (1954) 211-225.

[2] G. A. Bird, Molecular Gas Dynamics and the Direct Simulation of Gas Flows, Clarendon Press, Oxford University Press, Oxford, 1994.

[3] F. G. Tcheremissine, A conservative method for calculation of the Boltzmann collision integral, Doklady Physics 42 (11) (1997) 607-610.

[4] F. G. Tcheremissine, Direct numerical solution of the Boltzmann equation, in: 24-th Intern. Symp, Rarefied Gas Dynamics, AIP Conference Proceedings, 2005, pp. 667-685.

[5] F. G. Tcheremissine, Solution of the Boltzmann kinetic equation for high-speed flows, Comp.Math. and Math.Physics

46 (2) (2006) 315-329.

[6] F. G. Tcheremissine, Solution of the Boltzmann kinetic equation for low speed flows, Transport Theory and Statistical Physics 37 (5) (2008) 564-575.

[7] J. A. Anderson, C. D. Lorenz, A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units, J. Comput. Phys. 227 (10) (2008) 5342-5359.

[8] F. Molnar Jr., T. Szakaly, R. Meszaros, I. Lagzi, Air pollution modelling using a Graphics Processing Unit with CUDA, Computer Physics Communications 180 (12) (2010) 105-112.

[9] T. Brandvik, G. Pullan, Acceleration of a 3D Euler solver using commodity graphics hardware, in: 46th AIAA Aerospace Sciences Meeting and Exhibit, Nevada, 2008.

[10] A. Frezzotti, G. P. Ghiroldi, L. Gibelli, Solving kinetic equations on GPUs I: Model kinetic equations (2009). URL http://arxiv.org/abs/0903.4044v1

[11] A. Corrigan, F. Camelli, R. Lohner, J. Wallin, Running unstructured grid CFD solvers on modern graphics hardware, in: 19th AIAA Computational Fluid Dynamics Conference, no. AIAA 2009-4001, 2009.

[12] N. M. Korobov, Trigonometric Sums and Their Applications, Moscow. Nauka, 1989, in russian.

[13] Y. Y. Kloss, F. G. Tcheremissine, N. I. Khokhlov, B. A. Shurygin, Programming and modelling environment for studies of gas flows in micro- and nanostructures based on solving the Boltzmann equation, Atomic Energy 105 (4) (2008) 211-217.

[14] NVIDIA corporation, NVIDIA CUDA programming guide, version 2.3.1 (Aug. 2009). URL http://www.nvidia.com/object/cuda_develop.html

[15] D. Valougeorgis, S. Vauritis, F. Sharipov, Application of the integro-moment method to steady-state two-dimensional rerafeied gas flows subject to boundary induced discontinuities, J. Comput. Phys. 227 (2008) 6272-6287.

[16] S. Naris, D. Valougeorgis, The driven cavity flow over the whole range of the knudsen number, Phys. Fluids 17 (9) (2005) 097106.

[17] N. I. Khoklov, Y. Y. Kloss, B. A. Shurygin, F. G. Tcheremissine, Simulation of the temperature driven micro pump by solving the Boltzmann equation, in: 26-th International Symposium on Rarefied Gas Dynamics, Books of abstract, Kyoto University, 2008.

[18] N. I. Khokhlov, Y. Y. Kloss, B. A. Shurygin, F. G. Tcheremissine, Application of MPI technology for solving the Boltzmann equation, in: 20-th International Conference on Transport Theory, Book of Abstracts, Obninsk, 2007, p. 189.

[19] Y. Y. Kloss, F. G. Tcheremissine, P. V. Shuvalov, Solving Boltzmann equation for unsteady flows with shock waves in narrow chanels, Comp.Math.and Math.Physics.To be published.