Available online at www.sciencedirect.com

ScienceDirect

Procedía Engineering

ELSEVIER

Procedía Engineering 61 (2013) 76 - 80

www.elsevier.com/loeate/procedia

Parallel Computational Fluid Dynamics Conference (ParCFD2013)

OpenCL implementation of basic operations for a high-order finite-volume polynomial scheme on unstructured hybrid meshes

S.A. Soukova, A.V. Gorobets3*, P.B. Bogdanovb

aKeldysh institute of applied mathematics of RAS, 4A, Miusskaya Sq., Moscow,125047,Russia b Scientific Research Institute of System Development of RAS, 36-1, Nahimovskiy prosp., Moscow,117218,Russia

Abstract

A parallel finite-volume algorithm based on a cell-centered high-order polynomial scheme for unstructured hybrid meshes is under consideration. The work is focused on the adaptation and optimization of basic operations of the algorithm to different architectures of massively-parallel accelerators including GPU of AMD and NVIDIA. Such an algorithm is especially problematic for the GPU architectures since it has very low FLOP per byte ratio meaning that performance is dominated by the memory bandwidth but not the computing performance of a device. At the same time it has irregular memory access pattern since unstructured meshes are used. The calculation of polynomial coefficients and the calculation of convective fluxes through faces of cells are the most interesting and time consuming operations of the algorithm. Implementations of these operations for accelerators using OpenCL are considered here in detail. The ways to improve the computational efficiency are proposed, performance measurement results reaching up to 160 GFLOPS on a single GPU device are demonstrated.

© 2013 The Authors. Published by ElsevierLtd.

Selectionandpeer-reviewunderresponsibilityoftheHunan University andNationalSupercomputing Center inChangsha(NSCC)

Keywords: OpenCL, MPI, OpenMP, GPU, Parallel CFD, finite-volume, unstructured mesh

Nomenclature

p density

u velocity component in x direction

v velocity component in y direction

w velocity component in z direction

p pressure

E total energy

Q vector of conservative variables

$ vector of fluxes of conservative variables

Greek symbols Y adiabatic index ^ internal energy

* Corresponding author. E-mail address: andrey@cttc.upc.edu

1877-7058 © 2013 The Authors. Published by Elsevier Ltd.

Selection and peer-review under responsibility of the Hunan University and National Supercomputing Center in Changsha (NSCC) doi: 10. 1016/j .proeng.2013.07.096

1. Introduction

Hybrid supercomputers based on massively-parallel accelerators such as graphics processing units (GPU) of AMD and NVIDIA or Intel Many Integrated Core (MIC) Xeon Phi accelerators are becoming widely used in CFD for many different applications. An overview of GPU applications can be found for instance in [1]. Hybrid computing is widely used for example in numerical approaches based on the lattice Boltzmann method (LBM) that fits well the GPU architecture. Examples of LBM applications can be found for instance in [2,3]. Many GPU applications can be found as well for finite-difference and finite-volume approaches on structured meshes including in particular [4].

This work deals with a kind of algorithm that is perhaps one of the most difficult from the point of view of computations on massively-parallel accelerators. A finite-volume algorithm based on a cell-centered high-order polynomial scheme for unstructured hybrid meshes is under consideration. Its main complicating factors are, on the one hand, very low FLOP per byte ratio that makes performance dominated by the memory bandwidth of the device and, on the other hand, the irregular memory access pattern that comes from unstructured hybrid meshes.

The rest of the paper is organized as follows. The summary for the mathematical model and the spatial discretization is given in Section 2. Implementations details for the most interesting and time consuming operations of the algorithm are presented in Section 3. Finally, performance measurement results are summarized in Section 4.

2. Mathematical model and discretization

A compressible flow governed by the Euler equations is considered:

dQ + dF1(Q) + dF2(Q) + dFa(Q) = Q dt dx dy dz '

where Q = (p, pu, pv, pw, E)T is the vector of the five conservative variables and

F1 (Q) = (pu, pu2 + p, puv, puw, u(E + p))T, F2 (Q) = (pv, puv, pv2 + p, pvw, v(E + p))T, Fa (Q) =(pw, puw, pvw, pw2 + p, w(E + p))T,

are the convective fluxes. The total energy is defined as E = p(u2 + v2 + w2) /2 + pe. Equations (1) are closed with the ideal gas state equation p = pe(Y — 1).

The equations are discretized with a cell-centered scheme for unstructured hybrid meshes that consist of tetrahedrons, hexahedrons, quadrangular pyramids and triangular prisms. The mean integral values of conservative variables are defined in each cell: fID = (fvid f (x, y, z) dV^j /vid, where subindex ID denotes the identification number of the cell, VID is the volume of the cell and fID denotes one of the five conservative variables. Details on a similar discretization approach can be found in [5], for instance.

With the first order Runge-Kutta time integration scheme the values on the new temporal layer are Q/+1 — Qid + ^idt/vid, where t is the value of the time step. The convective flux, $ID, in the ID-th cell is computed as a sum of the fluxes through the faces of the cell:

NSiD-l $ID = ^ qsid/i^ ,

where NSID is the number of faces in the cell. The flux $SID/iid through the face with the identification number denoted as ID/I[d (that divides ID-th and I^-th cells) is computed using the Roe scheme [6]: $Sid/iid = Froe (qLid/iid , QRid/iid , —ID/ jiD^j. This flux depends on the normal vector, — ID/¡iD, and on the values of conservative variables from the "left" and "right" sides of the face at the position of its center of mass, QLID/iid and QRid/iid , respectively. To obtain this left and right values with the higher order of accuracy a polynomial

reconstruction is used. The variation of a variable fID inside the ID-th cell is defined by a quadratic polynomial of the form

PfD (x,y,z) = C°ID ((x - xcID)2 - g0ID) + C}ID ((y - ycID)2 - g\D) + (2)

C'2fID ((z - zcID)2 - gId) + CfID ((x - xcID ) (y - ycID) - glD + CfID ((x - xcid) (z - zcid) - gID + CfID ((y - ycID) (z - zcid) - gID + CfID (x - xcid) + C}ID (y - yciD) + C8fID (z - zcid) + flD,

where ghD, h £ [0, 5] - are six constant geometric coefficients that depend on the shape of the ID-th cell, xcID, ycID, zcID are coordinates of the center of mass of the cell.

3. Implementation of basic operations

The algorithm is originally parallelized using the combined two-level MPI+OpenMP approach. On the first level MPI is used within a classical domain decomposition approach. Further the workload is shared between CPUs and accelerators using second-level domain decomposition. OpenMP is used to parallelize across multiple CPU cores and OpenCL [7] is used to compute on accelerators. With this approach the algorithm of the time step is following:

1. compute the value of the time step according to the current velocity field;

2. update halos within the multi-level domain decomposition approach;

3. compute polynomial coefficients;

4. compute fluxes through faces of control volumes;

5. Runge-Kutta integration step to the new temporal layer.

Implementations of most interesting and time consuming operations, namely, the calculation of polynomial coefficients and the calculation of convective fluxes through faces of cells are described further in detail

3.1. Calculation of polynomial coefficients

At the geometry initialization stage a stencil of the spatial scheme is built for each mesh cell. For the ID-th cell the stencil consists of NGID neighbor elements with indices G^5, m £ [0, NGID — 1]. Each cell can have different number elements in the stencil depending on the geometry and position of neighboring elements. At the preprocessing stage matrices of geometric coefficients, AID, are formed for and the polynomial coefficients ClfiD can be computed by a matrix-vector product for each of the five conservative variables:

AID0,0 aid1>0 AID 2,0

AIDo,l AIDltl AID2,l

Aid0,NGID

Aid1 Aid

NGID-2,NGId-

IDs, 1

fGID - flD faID - fID fGID — f ID

'NGID-

The matrices AID stay constant during the whole simulation. Joining the five matrix-vector products for each conservative variable in one operation gives a product of matrices of sizes (9 x NGID) and (NGID x 5) resulting in a (9 x 5) matrix of polynomial coefficients for each cell: Pid = Aid x Bid.

The OpenCL implementation for a massively-parallel accelerator is following. Each local work-group processes 4 cells. Cells are grouped according to the sizes of stencils in the purpose of better load balancing. Each cell is processed by a warp of 32 threads (with 2 thread idle). The calculation of polynomial coefficients is a cycle process. On each cycle 6 columns of AID are multiplied by 6 rows of matrix BID. Corresponding coefficients of AID are read by the warp threads into the local memory. Then the threads of the first and second half of the warp compute simultaneously results for multiplication of 3 columns on 3 rows as it is shown in Fig. 1. Each thread is responsible for calculation of only 3 of 45 coefficients in total. Then the two intermediate parts of PCID are summated and the results are written to the global memory.

Fig. 1. Calculation of polynomial coefficients in a cell.

3.2. Calculation of fluxes

In order to improve performance and avoid intersections of parallel threads when summating fluxes from faces to centers of cells the calculation of fluxes is decomposed into three stages:

1. reconstruct values from "left" and "right" sides of cells;

2. compute fluxes through faces of cells using Roe scheme;

3. summate fluxes from faces to centers of cells.

This implementation requires additional arrays to store intermediate values on faces and to store the topology that provides for each cell the list of its faces. The first and the third stages are performed for the set of cells and the second stage is performed for the set of faces.

Let us consider in detail implementation parameters for the NVIDIA C2050 GPU device. The first stage is carried out by a set of local work-groups of 320 threads that process 24 cells each. Firstly, all the threads read the values of mesh functions, polynomial coefficients, etc., from the global to the local memory. Then only 128 threads compute the intermediate left and right flux values (5 threads per cell, 16-th and 32-nd threads in each warp are idle). This configuration is forced by very limited shared memory recourses of the device. On the second stage each thread of the global work-group processes one face, since Roe scheme cannot be decomposed into smaller operations. Then on the third stage local work-groups of 512 threads process 96 cells each (5 threads per cell, 16-th and 32-nd threads in each warp are idle).

4. Performance measurement results

The following hardware was used for tests with 8-byte double precision numerics: 6-core CPU Intel Xeon X5670, 2.93 GHz, 70 GFLOPS, 32 GB/s, 95 Wt; GPU NVIDIA Fermi C2050, 515 GFLOPS, 144 GB/s, 238 Wt; GPU AMD Radeon HD 7970, 947 GFLOPS, 264 GB/s, 250 Wt. The power consumption ratio CPU vs. GPU for these devises is around 2.5. This value has been chosen as a criteria of to evaluate resulting speedups. The OpenMP shared memory parallelization is used for computations on the multi-core CPU. Implementation for GPU uses OpenCL 1.1 [7]. A simulation of a flow around a sphere at Mach 0.1 on a mesh with 679339 cells is used as a model problem.

Table 1 reports performance results for calculation of polynomial coefficients. The achieved net performance and the corresponding percentage of the peak performance of the devices are indicated for a single CPU core, for the whole 6-core CPU, and for the GPUs of AMD and NVIDIA. The speedup is given in relation to the time on a single CPU core. However, the main limiting factor for such an operation is the memory bandwidth but not the peak performance. From this point of view taking peak memory bandwidth into account the efficiency of the usage of GPU resources is evaluated as at least 46% for NVIDIA C2050 and 79% for AMD 7970 compared to the maximal theoretically possible performance. Similarly, performance results for the calculation of convective fluxes are shown in Table 2.

Finally, speedup for the GPU vs. the 6-core CPU in total for the both operations is 3.3 and 10.4 for NVIDIA C2050 and AMD 7970, respectively. Therefore, the resulting speedups overcome the minimal level of 2.5 set by the power consumption ratio.

Device Time, sec. Performance, GFLOPS Percent of peak Speedup

1 CPU core 0.7078 3.1 26% 1

6-core CPU 0.1450 15.2 21% 4.88

NVIDIA C2050 0.045 50 9.7% 15.6

AMD 7970 0.014 160 17% 50.6

Table 1. Performance on computing polynomial reconstruction coefficients.

Device Time, sec. Performance, GFLOPS Percent of peak Speedup

1 CPU core 0.261 2.5 21% 1

6-core CPU 0.059 11 16% 4.42

NVIDIA C2050 0.0168 38 7.4% 15.5

AMD 7970 0.0056 115 12% 46.6

Table 2. Performance on computing convective fluxes.

Comparing performance on GPU AMD 7970 and NVIDIA 2050 we can conclude that the AMD accelerator outperforms the NVIDIA one around 3.1 times in average. However, the peak performance ratio differs only 1.8 times. This shows that the AMD accelerator is used more efficiently even despite the code has been particularly optimized for the NVIDIA GPU. It perhaps can be explained by some drawbacks of NVIDIA architecture in general and insufficiency of register memory in particular.

5. Conclusion

An OpenCL implementation of basic operations of the high-order parallel finite-volume CFD algorithm on unstructured meshes for modeling of compressible flows on hybrid supercomputers has been presented. The computations were ported and optimized for different architectures of massively-parallel accelerators including GPU of AMD and NVIDIA. The resulting performance appeared to be near theoretical limits for such memory-bounded operations with irregular memory access pattern. The resulting speedups achieved on the GPUs compared to the 6-core CPU notably exceed the ratio in the power consumption of the devices.

Acknowledgements

The K-100 hybrid supercomputer, Moscow, Russia, has been used for the tests. The work has been financially supported by RFBR, Russia, project No. 12-01-33022. The authors thankfully acknowledge these institutions.

References

[1] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences with cuda, Micro, IEEE 28 (4) (2008) 13 -27.

[2] Obrecht, C. and Kuznik, F. and Tourancheau, B. and Roux, J.-J., The TheLMA project: A thermal lattice Boltzmann solver for the GPU, Computers & Fluids, 54 (2012) 118-126.

[3] Rinaldi, P.R. and Dari, E.A. and Venere, M.J. and Clausse, A., A Lattice-Boltzmann solver for 3D fluid simulation on GPU, Simulation Modelling Practice and Theory, 25 (2012) 163-171.

[4] Peter Zaspel, Michael Griebel, Solving incompressible two-phase flows on multi-GPU clusters, Computers & Fluids (2012), In Press, Corrected Proof, Available online 31 January 2012, doi:10.1016/j.compfluid.2012.01.021

[5] Barth T. Numerical methods for conservation laws on structured and unstructured meshes, VKI for Fluid Dynamics, Lectures series, 2003-03.

[6] Roe P.L. Approximate Riemann Solvers, Parameter Vectors and Difference Schemes J. Comput. Phys. 1981. 43, N2. 357-372.

[7] Khronos OpenCL Working Group, The OpenCL Specification, Version: 1.1, 2010, http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf