Scholarly article on topic 'gpuSPHASE—A shared memory caching implementation for 2D SPH using CUDA'

gpuSPHASE—A shared memory caching implementation for 2D SPH using CUDA Academic research paper on "Electrical engineering, electronic engineering, information engineering"

CC BY
0
0
Share paper
Keywords
{"Computational fluid dynamics" / CUDA / GPGPU / GPU / "Smoothed particle hydrodynamics"}

Abstract of research paper on Electrical engineering, electronic engineering, information engineering, author of scientific article — Daniel Winkler, Michael Meister, Massoud Rezavand, Wolfgang Rauch

Abstract Smoothed particle hydrodynamics (SPH) is a meshless Lagrangian method that has been successfully applied to computational fluid dynamics (CFD), solid mechanics and many other multi-physics problems. Using the method to solve transport phenomena in process engineering requires the simulation of several days to weeks of physical time. Based on the high computational demand of CFD such simulations in 3D need a computation time of years so that a reduction to a 2D domain is inevitable. In this paper gpuSPHASE, a new open-source 2D SPH solver implementation for graphics devices, is developed. It is optimized for simulations that must be executed with thousands of frames per second to be computed in reasonable time. A novel caching algorithm for Compute Unified Device Architecture (CUDA) shared memory is proposed and implemented. The software is validated and the performance is evaluated for the well established dambreak test case. Program summary Program title: gpuSPHASE Catalogue identifier: AFBO_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AFBO_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 128288 No. of bytes in distributed program, including test data, etc.: 1350326 Distribution format: tar.gz Programming language: C++, CUDA. Computer: Nvidia CUDA capable devices. Operating system: Linux, Windows, Mac OS. Classification: 5, 12. External routines: Qt5 Core, HDF5, H5Part Nature of problem: Free surface fluid dynamics simulations of long running physical phenomena that must be calculated in the order of real-time. Solution method: gpuSPHASE is a 2D SPH solver for CUDA capable devices that is optimized for the computation of real-time simulations. Running time: Depending on the simulated problem the running time varies from seconds to weeks.

Academic research paper on topic "gpuSPHASE—A shared memory caching implementation for 2D SPH using CUDA"

Computer Physics Communications I (Uli) I

Contents lists available at ScienceDirect

Computer Physics Communications

journal homepage: www.elsevier.com/locate/cpc

gpuSPHASE—A shared memory caching implementation for 2D SPH using CUDA1"

Daniel Winkler *, Michael Meister, Massoud Rezavand, Wolfgang Rauch

Unit of Environmental Engineering, University of Innsbruck, Technikerstrasse 13, 6020, Innsbruck, Austria

article info

abstract

Article history: Received 12 November 2015 Received in revised form 8 November 2016 Accepted 29 November 2016 Available online xxxx

Keywords:

Computational fluid dynamics

Smoothed particle hydrodynamics

Smoothed particle hydrodynamics (SPH) is a meshless Lagrangian method that has been successfully applied to computational fluid dynamics (CFD), solid mechanics and many other multi-physics problems. Using the method to solve transport phenomena in process engineering requires the simulation of several days to weeks of physical time. Based on the high computational demand of CFD such simulations in 3D need a computation time of years so that a reduction to a 2D domain is inevitable. In this paper gpuSPHASE, a new open-source 2D SPH solver implementation for graphics devices, is developed. It is optimized for simulations that must be executed with thousands of frames per second to be computed in reasonable time. A novel caching algorithm for Compute Unified Device Architecture (CUDA) shared memory is proposed and implemented. The software is validated and the performance is evaluated for the well established dambreak test case.

Program summary

Program title: gpuSPHASE Catalogue identifier: AFBO_v1_0

Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AFBO_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3

No. of lines in distributed program, including test data, etc.: 128288

No. of bytes in distributed program, including test data, etc.: 1350326

Distribution format: tar.gz

Programming language: C++, CUDA.

Computer: Nvidia CUDA capable devices.

Operating system: Linux, Windows, Mac OS.

Classification: 5,12.

External routines: Qt5 Core, HDF5, H5Part Nature ofproblem:

Free surface fluid dynamics simulations of long running physical phenomena that must be calculated in the order of real-time.

^ This paper and its associated computer program are available via the Computer Physics Communication homepage on ScienceDirect (http://www.sciencedirect.com/ srience/joumal/00104655 ). * Corresponding author.

E-mail address: danielwinlder@uibk.ac.at (D. Winkler).

http://dx.doi.org/10.1016/jxpc.2016.11.011

0010-4655/© 2016 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

2 D. Winkler et al. / Computer Physics Communications I ^IIIJ lll-lll

Solution method:

gpuSPHASE is a 2D SPH solver for CUDA capable devices that is optimized for the computation of real-time simulations.

Running time: Depending on the simulated problem the running time varies from seconds to weeks.

© 2016 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Smoothed particle hydrodynamics (SPH) is a meshless particle based method that has initially been independently introduced by Lucy [1] and Gingold and Monaghan [2] for solving astrophys-ical problems. The general concept is the discretization of a continuous medium into interpolation points known as particles to transform partial differential equations into an integral form. Using integral interpolation theory the particles' properties are evolved in time based on the properties of neighboring particles [3]. Following this concept, the main advantages over mesh based approaches arise from the inherent Lagrangian nature, because the method follows the particles as they move through space and time. This allows to accurately simulate free surface evolution without topological restrictions, multi-phase interfaces in highly dynamic flows, fluid-structure interactions and multi-phase phenomena. Given the numerous advantages over grid based methods, SPH has been applied to a manifold of application domains, such as magnetohydrodynamics [4], solid mechanics [5-8], computational fluid dynamic [9,3] and many other multi-physics problems [10-12].

A well known property of simulating transient flows in engineering problems is the high computational demand that arises in solving the Navier Stokes equations, which applies to Eulerian as well as to Lagrangian methods. The computational effort can be reduced in SPH by limiting the amount of neighbors used for interpolation, but the cost still requires high performance computing (HPC) for practical applications. In recent years general purpose computation on graphics processing units (GPGPU) emerged as a technique to use the high floating point arithmetic performance of graphics devices for general purpose calculations. This paradigm creates the possibility to execute simulations on GPU computing workstations that required HPC clusters before. For this reason a lot of SPH solvers have been developed that harness the computational power of GPUs. Two established open-source GPU implementations are GPUSPH [13] and DualSPHysics [14]. Both implementations originated from the serial CPU based SPHysics code [15] that has initially been released in 2007. The main difference between the two implementations is that GPUSPH supports 3-D simulations only and uses a neighbor list for each particle for several iterations. It is shown in [16] that this approach results in a significantly lower number of maximum particles and causes no performance improvement over DualSPHysics for several test cases. Since both implementations require Nvidia GPUs to be executed AQUAgpusph [16] has been implemented using the open standard OpenCL so that accelerator cards from vendors such as Nvidia, AMD, Intel and IBM can be used. While the maximum performance is also worse than DualSPHysics, the versatility in hardware allows for the best performance to cost ratio. It has to be noted that DualSPHysics additionally provides an optimized CPU version that does not require an accelerator card and can thus be executed on any desktop or workstation computer. The implementation uses OpenMP to achieve a high performance for many core shared memory architectures.

A major challenge when simulating transport phenomena prevalent in biological and chemical engineering is the difference in the temporal scales. To generate meaningful results of the effects of hydraulics on environmental processes several weeks of physical time need to be simulated [17,18]. Contrary, simulating fluids with weakly compressible SPH with a coarse resolution requires time steps in the order of 10-4 s, such that 108-1010 iterations are needed. Every such iteration involves calculating the force exerted on each particle within the simulation domain. Performing these simulations in 3D requires decades of computation time on high end graphics devices [14]. Employing the multi-GPU capabilities of solvers like DualSPHysics [19] and GPUSPH [20] reduces the computational time, but the time necessary for synchronization of problems that require billions of iterations still prevents computation in feasible time. The compromise to cope with such simulations is to reduce the complexity of the simulation by constraining it to two dimensions and reduce the spatial resolution [17]. From this restriction arise possibilities for algorithmic optimizations in the implementation.

This paper describes gpuSPHASE, a GPU accelerated general purpose SPH solver created for solving two dimensional hydrody-namic problems. The implementation reduces the computational effort by implementing well established techniques like space filling curves and a neighborhood list. Generic methods to optimize the execution pipeline and data layout are described and evaluated to improve the performance of gpuSPHASE. A novel caching algorithm to accelerate the calculation of unstructured neighborhood problems using the CUDA shared memory cache is presented and analyzed. Validation of the implementation is performed and evaluated for different established test cases. The simulation performance for the well established dambreak test case is compared to the high performance SPH solver DualSPHysics. It is shown that the caching algorithm produces a very good speedup compared to a hardware caching approach and that the overall performance of the solver is very good.

2. Numerical method

SPH was introduced individually by Lucy [1] and Gingold and Monaghan [2] as a method for solving astrophysical problems. Since then the method has been applied to other fields like solid mechanics, magnetodynamics and fluid mechanics where reliable results were achieved. Key advantages of the method are the straight forward handling of multi-phase flows and the natural computation of the free surface [3].

2.1. Kernel function

As the name smoothed particle hydrodynamics indicates the method is based on particles. These particles are interpolation points that represent the continuum and as such move with the simulated fluid. The force calculation that imposes the advection of the particles is based on the properties of neighboring particles. By using a kernel function W the influence of the neighbor properties is weighted depending on the distance r. The kernel has to fulfill

D. Winkler et al. / Computer Physics Communications I fllllj I

the property

W (r, h) dr = 1

but due to computational properties it is preferable to add another constraint on the support of the kernel rc = kh so that

W(|r| > rc, h) = 0.

The combination of a suitable kernel function with a limited support ensures that the number of neighbors incorporated in the calculation is low enough to make correct but computationally feasible calculations [21].

2.2. Continuity equation

The first governing equation for fluid dynamics is the continuity equation, which relates the change of density to the advection of the continuum.

— = —p V • v. dt p

The variables t, p and v represent the time, density and velocity, respectively. When applied to SPH the equation can be discretized for a reference particle i using the kernel function W

dpi dt

Em,- „

Vi, • ViWij

where mj denotes the mass, vij = vi — Vj expresses the relative velocity and ViWij = ViW (ri — r, h) the gradient of the kernel weight function for the neighbor particles j [3].

2.3. Momentum equation

The second governing equation is the momentum equation

+ Pg (5)

dv (v)

P3T = —V P + F dt

with p denoting the pressure, F(v) the viscous and g the body force. gpuSPHASE uses the weakly compressible formulation of SPH (WCSPH) and we refer to [3] for further details on the equations. The momentum equation, including artificial viscosity and the acceleration caused by shear forces, is discretized to [22]

dvi ~dt

=—m ZV+V> >

— m,ah

PjPi + PiPj Pi + Pj Vj • rij

ViWij (momentum)

ij j I 12

Pij (|rj I + e hj

ViWij (viscosity)

+ _ V^ _JUL № + Vf) -(shear forces)

mi ni + n, r„ d r„

rij d rij

where riJ- = ri — r,-, rij = \rij\ and dW/drij = ViWij • eij. The coefficients h,, c, and p, are the smoothing length, speed of sound and density averaged between two particles. Variables Vi and n represent the volume and dynamic viscosity of particle i, the quantifier e = 0.01 is included to ensure a non-zero denominator.

2.4. Time stepping

The time integration uses the velocity-Verlet scheme [23]

! 2 V dt

n , At n+ f

rn + y vi

pn + At

„n+1

n+1 n+f vn+1 v 2

, At n+f + Av 2

At /dvj-

+ T [ dt"

(9) (10)

which is a second order accurate symplectic mid-step scheme. The continuity equation (9) and the force computation (11) are the computationally most expensive parts and need only be performed once per step. The time step is limited by several conditions, where the CFL condition [3] is usually the most restrictive

At <-■

4 cmax + |vmax|

Additionally the viscous condition

At <--

and the body force condition

At < - —

need to be fulfilled.

2.5. Solid boundary conditions

Solid boundary conditions based on boundary particles can handle arbitrary shapes and geometries in two and three dimensions. The particles are statically positioned and evolve the pressure based on the surrounding fluid particles. This pressure is included in the force computation of the fluid particles to enforce the impermeability condition of the wall. The discretized calculation of the pressure of a wall particle w is defined by the neighboring fluid particles f [22]

J2fPfWwj + (g — aw) • E/ pfrwjWwf ZfWwf

A computational advantage of this method is the straight-forward implementation of the method as it solely adds another calculation step following the particle principle. From the implementation point of view the existing data structures are reused and Eq. (15) is applied such that the parallel nature of the algorithm is retained.

(6) 3. Implementation

The main goal in the gpuSPHASE implementation is to have a fast and versatile two dimensional SPH solver that can be reused for different fields of application in engineering. The solver is designed as a standalone component that can be executed on any CUDA capable platform. It is optimized for long running simulations but can be applied to solve any problem with 2D WCSPH.

Although there are already a number of language bindings available for CUDA we use C++ as it is the most common one. C++ has been used in HPC for decades as it provides a good compromise between abstraction and flexibility. The open-source Qt5 framework facilitates platform independent abstractions for data input and output. The program uses a custom input format encoded in the JSON format, which is easy to read and edit in a

4 D. Winkler et al. / Computer Physics Communications I ^IIIJ lll-lll

standard text editor. To store particle configurations to disk the H5Part format and library are used. These interfaces allow users to easily configure the scene that has to be simulated and use the results in other software applications.

3.1. CUDA function execution pipeline

Since the implementation of this new solver aims at performance improvements this section elaborates on the implementation of the CUDA kernels. Due to the name conflict of the SPH kernel with CUDA kernel we will use the term CUDA function to refer to a CUDA kernel from here on.

Fig. 1 shows the dependency graph for the CUDA functions used in gpuSPHASE. It visualizes data nodes with ellipses, CUDA functions with rectangles and dependencies with arrows. The light green coloring depicts CUDA functions with less strict execution dependencies, which means that those functions may be executed in parallel to the main pipeline. Some data nodes are duplicated to avoid arrows cluttering the graph but equal names refer to the same memory locations. The suffixes _L and _R denote two different memory arrays for the same data. E.g. the CUDA function accelerate needs the velocity and force data v_L and f_L as input and writes the new velocity data to v_R as output.

Some of the functions strictly depend on the finished execution of the previous function, thus it is not possible to merge them into a single function. This restriction needs to be satisfied because of the data dependency to neighbor data, i.e. in order to calculate the continuity equation for a particle i, the position of all neighboring particles j have to be updated (cf. Section 2.4). That requirement can only be fulfilled by a global synchronization, which boils down to the separation of the computation into several CUDA functions.

A major drawback of this splitting is that all data that has already been transferred to the GPU is discarded and has to be transferred again for the next CUDA function using the same data. Caching could help in this case but since the amount of data necessary for most simulations is much bigger than the cache size it is unlikely that there is a significant influence. Discarding all data demands to calculate the particle distances and kernel weights multiple times per iteration, but due to memory and bandwidth restrictions it is favored over storing the data explicitly.

Because of the data dependency functions must store the new data values to a different location. gpuSPHASE solves this problem with double buffers that are swapped as soon as all new data is written. The use of two separate buffers decouples functions that are not data dependent such that they can be executed in

parallel. E.g. particle positions may already be advanced to rin+1 in

parallel to the continuity equation that uses ri 2. Exception to this restriction is the extrapolation function since it writes boundary particle data and reads fluid particle data. This is marked with the asterisks symbol (*) in the data nodes, which indicates that it is still the same array but some data has been updated. Reusing the same data location avoids bare copying of fluid particle data, which is beneficial for the performance of the function. Although some degree of execution parallelism is possible, performance is worse when leveraging it with the usage of CUDA streams for rather small simulations. This is caused by the synchronization overhead combined with the high number of function executions per second.

A promising optimization approach is the combination of several distinct data arrays into a single one [24]. CUDA provides vector types that combine up to four single precision floating point values into a single one. Since x and y positions are mainly used in combination, it is obvious to combine the two elements into a single value of type float2. This allows the CUDA run-time to load both values at once instead of retrieving it from two different

memory locations. Less obvious but similar is the combination of particle properties into a vector type as those properties depend on one another and often all of them are required as input and output to a CUDA function.

Fig. 2 shows an optimized workflow where data arrays have been vectorized such that less data nodes are necessary. Apart from the force f and the auxiliary c_max and v_max arrays, the data has been combined into float4 data arrays

• motion consisting of position r and velocity v

• traits consisting of pressure p, density rho, volume vol and

speed of sound c.

Analyzing the dependencies enforced by the time stepping scheme described in Section 2.4, steps (7), (8) and (10) are

n+1 n+1

combined into a single function. Using r 2 and vi 2 it is possible to calculate the continuity equation, the force calculation in step (11) requires the positions rn+1. Thus the input data array is used

to store rn+1 and vi 2, which is the data required for calculating the extrapolation and momentum functions. Furthermore the acceleration (11) is calculated in the momentum function, which removes another CUDA function call.

Apart from the reduced overhead of launching CUDA functions and removing global synchronization, increasing the tasks that are performed in every function reduces the amount of data that has to be transferred. While the momentum equation in the non-optimized workflow stores the calculated force to memory and the accelerate function has to load it, merging the two functions no longer requires the data to be loaded from global memory. The implementation of vector types makes this optimization even more important.

3.2. Shared memory caching algorithm

While modern GPUs provide peak computing performance of several trillion floating point operations per second (TFLOPS) the efficiency of real world applications is often much lower. For many applications the memory bandwidth and latency are a bottleneck, which can be improved with intelligent design and performance profiling. For details we refer to the CUDA C Programming Guide [24].

The main problem for memory bound applications is that the CUDA global memory is not fast enough to provide data to the compute units. CUDA tries to compensate this deficiency by executing several threads per processor in parallel. Based on hardware limitations and the complexity of the CUDA function the amount of threads per processor is limited, which is referred to as occupancy [24].

Another important aspect of memory is the order in which it is accessed. Ideally all data is loaded in a coalesced fashion, which means that every consecutive thread loads data from a consecutive memory address. While this seems easy for methods that access ordered data, there is no obvious way to do so in an unordered scheme like SPH. No efficient data structure exists to arrange the neighbor data in memory so that it can be read coalesced unless all neighbor data is redundantly stored for every particle.

gpuSPHASE handles this issue with a novel algorithm that uses a two step loading mechanism. In a first step the relevant particle data is loaded to a very fast programmable cache in a coalesced fashion and after that the data is accessed in an irregular way, which has a much lower implication on the performance. Due to the unstructured nature of the SPH method, it is not possible to efficiently load all necessary data to the programmable cache. The following section outlines an approach that combines several techniques to optimize the data access.

D. Winkler et al. / Computer Physics Communications I fllllj I

Fig. 1. CUDA function pipeline. Rectangles, circles and arrows denote CUDA functions, data arrays and dependencies, respectively. The bright green color indicates that the functions can be executed in parallel to the main pipeline. The asterisk marks data arrays that are read and written by a function. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3.2.1. Data layout using space filling curves

Most existing implementations mention neighbor search as a key component of the SPH method implementation [19,16]. Referring to those works the most efficient implementation for this problem is to partition the particles into a unigrid [25]. This means that the entire domain is split up into equally sized cells and the particles are classified by the unique cell identifier they geometrically belong to. By choosing the cell dimension equal to -or greater than - the kernel support rc the grid gains the property that all neighbors j of a particle i are contained in the adjacent cells. This property holds for simulations in any geometric dimension d, resulting in 3d cells to check for neighbor particles.

In order to find the particles for a given cell in memory, the particle data needs to be sorted according to the cell identifier. After sorting a cell linked list is generated, which points to the beginning and end of every cell's particles [25]. While the amount of particles to check is reduced drastically by this approach, those particles typically are spread across memory. This is an evident result as a higher dimensional problem is mapped to one dimensional memory. Space filling curves (SFC) are functions with the property to traverse higher dimensional space in a continuous fashion [26]. The Morton order [27] or Hilbert curve [28] are SFCs with good data locality characteristics, which means that for any two cells c1 and c2 that are geometrically close the function produces one dimensional indices H(xi, yi) = ni that are ordinally

D. Winkler et al. / Computer Physics Communications I IIIIIJ I

Fig. 2. Optimized CUDA function pipeline. Rectangles, circles and arrows denote CUDA functions, data arrays and dependencies, respectively. The bright green color indicates that the functions can be executed in parallel to the main pipeline. The asterisk marks data arrays that are read and written by a function. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

close. Ordering the particle data according to the mapped cell index compensates the computational overhead with better data locality. Data locality has positive effects on hardware caching, so ideally the overall performance of this approach is better than the straight forward mapping to memory.

Fig. 3 shows a Hilbert space filling curve where every line is a connection of two points that have to be interpreted as grid cells containing particle data. Two particles have been chosen to visualize the concept of data locality, the center points mark the particle abstraction into a grid cell, the surrounding circles visualize the relevant neighborhood. Neglecting the curve coloring and type, one can see that the two dimensional neighborhood is clustered into few coherent memory regions, which is beneficial for batch loading and caching performed by hardware caches.

3.2.2. CUDA shared memory

Interpreting the solid blue line in Fig. 3 as the amount of data that fits into software manageable cache visualizes the advantage of the approach. The light gray particle can be calculated with all necessary data already cached and as such completely neglecting memory latencies. While the black dashed circle does not exhibit this property the ratio of particles that can be calculated from cached data may still be high. This section describes the memory and caching implementation of gpuSPHASE that has the memory perks explained above.

Computer memory design is always a trade-off between size and speed. While hard disks provide vast amount of storage the speed is rather slow, caches and registers are several orders of magnitude faster but for economic reasons limited in size. For this

reason, modern computer hardware design relies on a hierarchy of different storage units to compensate for this problem. A typical computer consists of a hard disk, random access memory (RAM) and several levels of on-chip caches. This hierarchy is abstracted by the hardware and operating system so that a standard application does not need to take care where the data is stored.

HPC applications are typically designed and optimized to leverage the most performance of the described hierarchy. This means that the program needs to be profiled, a process where dedicated tools are used to evaluate how the data is cached by the hardware. Since the hardware caching cannot be controlled, the program must be reverse engineered to make the caching perform as expected. The CUDA execution model also employs this hierarchy on graphics devices but additionally provides the possibility to manually manage a very fast but small cache. By enabling developers to manage the data cache CUDA provides a mechanism to use domain knowledge for the optimization of the program execution. The downside is that programs that do not use shared memory lose a portion of fast hardware cache.

Particle computations mostly depend on data of neighboring particles, which means that an efficient caching is important for maximum compute performance. The unstructured nature of SPH has no obvious implementation of SM caching, even in combination with the memory layout optimization discussed in 3.2.1. Motivated by the property that SFCs allow for data locality we propose an algorithm that makes extensive use of the SM concept. The idea is to organize the data in global memory sorted by a SFC. Size limitations of the shared memory require to partition the data into smaller sets, given the properties of SFCs these chunks are selected to be as big as possible.

D. Winkler et al. / Computer Physics Communications I fllllj I

Fig. 3. Hilbert space filling curve. Every line is a connection of two points that have to be interpreted as grid cells, the solid blue line denotes the amount of data that fits into CUDA shared memory. Particles are visualized with bold points, the kernel support is indicated by the surrounding circle. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

The CUDA architecture abstracts the execution of millions of threads. These threads are grouped into blocks that are distributed to the Streaming Multiprocessors (SMP) where they are executed isolated from other blocks. Within one block all threads can be synchronized and access CUDA shared memory. The size of the blocks must be selected depending on hardware restrictions and performance characteristics, a CUDA grid is used to organize and index the individual blocks. At this abstraction level the block size for gpuSPHASE functions is maximized with the requirement that at least two blocks must fit on an SMP. For the most recent CUDA architectures Kepler and Maxwell the resulting block sizes for the momentum function are 768 and 1024 threads, respectively. The limiting factors are the size of shared memory and the maximum number concurrent threads per SMP.

Algorithm 1: Shared memory caching algorithm Input: Thread index Input: Particle data pointer Input: Neighbor list Output: Interaction value

1 Compute assigned particle index;

2 Copy particle data to shared memory;

3 Synchronize thread block;

4 while neighbors in list do

10 11 12

Choose neighbor from list; if neighbor in thread block then | Initialize neighbor pointer to shared memory; else

| Initialize neighbor pointer to global memory; end

Calculate distance; if distance <= cutoff then Calculate local interaction; Updated interaction value;

Remove neighbor from list;

17 end

Algorithm 1 describes a novel approach that fosters the SPH computation with the low memory latency of shared memory. The algorithm outlines pseudo code in a generic way so it can be used as skeleton for CUDA functions where every particle is assigned to a dedicated thread that computes its interaction value. Each

thread copies the particle data into the shared memory and waits for all other threads in the same block to do so. The code computing the interaction value is not different from a naive implementation that reads all data from global memory (lines 11-15). The only prerequisite for this property is the pointer arithmetic on lines 6-10, which allows each thread to use manually cached data if available. This optimization is possible starting with the Nvidia Fermi architecture, where uniform addressing of shared and global memory space enables to decide at run-time which memory type is requested. Based on this feature gpuSPHASE limits branch divergence to the computation of memory addresses, but executes the physical calculations and data retrieval from different memory spaces in parallel. The result is that branch divergence is reduced to a minimum and the hardware hides the latency from global memory with occupancy.

3.3. Neighbor list

To satisfy the requirement of maximum simulation speed for long-term engineering applications gpuSPHASE is designed with computational performance in mind. In cases where memory consumption has to be waged against computational efficiency we opt for higher memory usage, which results in a considerably lower amount particles that can be simulated [16]. This has no negative impacts on the simulations described in Section 1, since the number of particles that fit into memory is still much higher than the number that is computable in reasonable time.

gpuSPHASE implements the Verlet list (VL) concept [23], which is an a priori generated list of neighbors for each particle. Without this list a particle has to be tested against every particle in all adjacent cells, which can be implemented efficiently with a cell linked list (CLL) [25]. The benefit of the VL is that particles are tested for neighborhood only once and the resulting list can be reused for several iterations. Dominguez et al. [25] evaluate the performance of various algorithms for VL and CLL. They show that the implementation of a VL is in the best case minimally faster than the CLL, but requires much more memory. This evaluation has been performed on a CPU implementation and the authors conclude that new research and methods have to be developed for GPU implementations. Due to the linear iteration of neighbor lists, the implementation of gpuSPHASE follows the assumption that VL fits better to the architecture constraints of GPUs than the nested loops required in CLL implementations. However, as stated above

D. Winkler et al. / Computer Physics Communications IIIIIIJI

a comparison of VL and CLL for 2D and 3D SPH on GPUs needs to be performed to give a definite answer as in [25] for CPUs.

For the computation of particle interactions the updated distances must be computed in each iteration, thus the VL stores only the neighbor particle indices. Since particles move in every step the position updates over several iterations need to be accounted for when generating the list. The number of iterations that the list is reused is in direct relationship with the radius that determines neighboring particles that can possibly interact. Therefore the iterations cannot be chosen too large as the memory consumption would be too high and the particle list would contain too many false positives. For an a priori approximation of the neighborhood radius the following estimation can be used. Assuming a compressibility of 1% a ratio c0/|vmax | = 10needstobe maintained. Since c0 has to be provided as input to the simulation this relationship can be applied to the CFL condition (12). In the worst case two particles approach each other with |vmax | such that the cutoff radius rc must be increased by a factor z to be valid for Nit number of iterations.

zrc = rc + (2 Nit |Vmax| At)

z = 1 +

z = 1 +

2 N c0 1 h 2 Nlt 10 4 1.1 c0

Nit 22k

This estimation marks the upper boundary when c0 is chosen correctly. The 3D SPH solver GPUSPH also uses the neighbor list concept to speedup the computation [20]. The list is rebuilt every 10 iterations, which causes a negligible error but decreases the time required for neighborhood search significantly. While Eq. (16) allows to calculate a cell size that allows to correctly compute the particle interactions for a given number of iterations (e.g. z = 1.23 for 10 iterations with a kernel using k = 2), some drawbacks mentioned in [25] persist. Firstly, the method relies on the fact that c0 is chosen correctly and is set 10 times bigger than the maximum velocity of any particle throughout the simulation. Secondly, if the maximum velocity is only attained for a short part of the simulation the performance is significantly limited by the assumption of a constant maximum velocity.

For this reason gpuSPHASE implements the dynamic Verlet list (VLX) introduced in [25]. This allows to set the cell size according to Eq. (16) such that the number of iterations per generation of the Verlet list is dynamically determined. Small simulations with a high number of iterations per second will suffer from the communication overhead of device to host needed for bookkeeping, when compared to an approach that is executing several iterations in batch without communication to the host. However, this overhead is negligible for bigger simulations so that only the advantages of the dynamic iteration count persist.

3.4. Accuracy

Modern GPGPU programming allows to choose between single and double precision. While the advantage of double precision is evident the downside is a reduced performance. Choosing single precision allows the software to increase the maximum amount of particles, reduce the memory latency and profit from a higher computational peak performance. For the Nvidia Maxwell architecture the floating point arithmetic peak performance of double precision is 32 times lower than for single precision [24].

While the reduced accuracy of single precision does affect all properties of a simulation, for many simulations the truncation errors are negligible compared to the intrinsic numerical errors [29]. As a result, GPU accelerated SPH solvers are usually implemented

using single precision, although some implementations use double precision for selected properties like position [30]. This is especially important when simulating domains of large spatial extent compared to the particle sampling distance, as the resolution of representable floating point values decreases the higher the absolute value. Using double precision compensates this problem even for problems of sizes beyond the scope of SPH [30], but as [31] points out absolute positions must be replaced by relative cell positions to avoid the problem in general. Cell relative positioning has been implemented in gpuSPHASE and thus the maximum position and distance values are limited by the cell size, maintaining a sufficiently high accuracy.

Summation over a large number of values is furthermore problematic for limited precision floating point arithmetic. As summation of neighboring particles is at the core of the SPH method, [32] identify the Kahan summation algorithm as a countermeasure. This algorithm uses an additional correction value that keeps track of the error that occurs during the summation. The final accuracy can only be improved to a certain extent, limited by the condition number [31]. gpuSPHASE implements the algorithm but it needs to be enabled specifically at compile time as the higher accuracy is usually not needed.

4. Validation

The gpuSPHASE solver implements the numerical method described in [22], which has been validated for a variety of two dimensional flows, hydrostatic and dynamic examples. To validate the code implementation, we compare the simulation results of gpuSPHASE to a diverse set of representative validation test cases.

4.1. Hydrostatic tank

The generalized wall boundary conditions have been extensively validated in [22]. As static example we validate the test case used by [33] to compare different boundary conditions. The non-dimensionalized setup consists of a rectangular tank with the dimension Lx = 2 and Ly = 1 filled with water of depth D = 0.9. The particle discretization is performed with a sampling distance Ax = 0.02 resulting in 4500 fluid particles. The speed of sound is calculated as cs = 10^/gD with the gravitational acceleration g = 9.81 in negative y-direction. The artificial viscosity of a = 0.24 can be transformed into an equivalent kinematic viscosity [22], which results in a Reynolds number of Re = 100. A quintic spline kernel with k = 3 and a support of rc = 2 V2 Ax2 is used [21]. To avoid pressure oscillations the test is initialized with the damping technique proposed in [33].

Fig. 4 visualizes the non-dimensionalized results for the hydrostatic tank test case. According to the test case definition the pressure and time are defined as pref = pgD and tref = D/^/gD. Fig. 4(a) compares the analytical bottom pressure that results from the damped gravity initialization with the results from the simulation at the bottom center of the tank. The simulated values are in close agreement with the analytical value except for the regions where the slope of the damping function changes most. Fig. 4(b) shows that there is a smooth pressure gradient throughout the tank and that the particles maintain the initial positioning on a cartesian lattice. Due to the absence of any density correction in the simulation this structured particle positioning will not prevail for long term simulations at the free surface.

4.2. Lid-driven cavity flow without gravity

An established test case for the validation of viscous flows is the lid-driven cavity problem. The 2D problem consists of a

D. Winkler et al. / Computer Physics Communications I fllllj lll-lll 9

Fig. 4. Validation figures for a hydrostatic rectangular tank. The blue line in figure (a) represents the hydrostatic bottom pressure resulting from the damped gravity, the black dots indicate the simulated bottom pressure for time t* = 0 — 2. The color in figure (b) indicates the hydrostatic pressure ranging from 0 to 1. Pressure values and time are non-dimensionalized. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

100 90 80 70 60 50 40 30 20 10

(a) Streamline plot.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-0.4 -0.2 0 (b) u-velocity profile plot.

Fig. 5. Streamline (a) and u-velocity profile plot (b) for the lid-driven cavity test case with a Reynolds number of Re = 1000.

quadratic cavity filled with a fluid that is covered by a horizontally moving lid. Figs. 5(a) and 6(a) show the non-dimensional test case geometry of size L = 1 covered by a lid moving with velocity U = 1, that contains a fluid with p = 1. As specified in the SPHERIC validation test 3, for the validation of gpuSPHASE two different viscosities resulting in Reynolds numbers Re = UL/v of 1000 and 10,000 are investigated. Following the literature [34], the sound speed is increased by a factor of 10 compared to the standard approach of cs = 10 vmax to cs = 100 U to avoid the formation of a void in the center of the vortex. A quintic spline kernel with к = 3 and a support of rc = 3 Ax is used and density correction using the Shepard filter is applied every 50 iterations.

For the validation of the problem a particle sampling of Ax = 0.025 is chosen, which results in 400 x 400 fluid particles across the domain. After a simulation time of t = 80 streamline plots and the centerline u-velocity profiles are generated for both cases. The results are compared to the reference data from [35,36].

Fig. 5 shows the streamlines and u-velocity plots for the case Re = 1000. The comparison of the streamline plot Fig. 5(a) with the provided data from [36] shows a very similar flow profile, containing a central vortex and two smaller vortices in the lower corners. Fig. 6(b) furthermore allows to directly compare the velocities to the ones published in [35], which are in very good agreement.

The streamline and u-velocity plots in Fig. 6 summarize the results for the less viscous test case with a Reynolds number of Re = 10,000. The streamlines of the present SPH simulation in Fig. 6(a) show another vortex in the upper left corner, which is of identical shape as in the reference plots from [36]. Fig. 6(b) plots the vertical centerline u-velocity profile and shows that the calculated values are very close to the reference data points.

4.3. Sloshing wave impact problem

The sloshing wave impact problem is a SPHERIC test case that has been developed by the Model Basin Research Group (CEHINAV) [37,38]. In this experimental benchmark the impacts of fluids in a rectangular tank were recorded and are used for comparison to numerical simulations. The test case contains data for water and sunflower oil at 19 °C. The closed tank is filled with either liquid and air at atmospheric pressure at different levels.

For the validation of gpuSPHASE we follow the literature and choose to validate the most violent case, which is water at the low filling rate of 93 mm [38]. In this case we simulate turbulent flow (Re = 97,546) with overturning and breaking waves due to the sloshing phenomena. The sloshing period is obtained from the shallow water dispersion relation such that for the investigated

100 90 80 70 60 50 40 30 20

_ 20 (a) Streamline plot.

D. Winkler et al. / Computer Physics Communications I ^IIIJ 1

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

40 60 80 100 °3.5 0

(b) u-velocity profile plot.

Sensor 1

Fig. 6. Streamline (a) and u-velocity profile plot (b) for the lid-driven cavity test case with a Reynolds number of Re = 10,000.

6000 5000 4000 3000 2000 1000 0

450 mm

900 mm

Fig. 7. Tank geometry of the wave impact problem.

case T1 = 1.9191 s. The tank is rotated around the center of the tank's bottom line and the oscillation time has been defined as T/T1 = 0.85. Liquid properties for fresh water are provided as p = 998 kg/m3 and v = 8.96 x 10-7 m2/s.

Fig. 7 visualizes the geometry of the tank, which is a rectangular tank built of Plexiglas with a length of L = 900 mm and a height of H = 508 mm. For this validation we use the pressure data recorded by sensor 1, which is located at the left wall at height 93 mm, identical to the initial filling. The roll angle <p is provided together with the current velocity and acceleration in a time resolution of

5 x 10-5 s.

For the simulation the density and viscosity of the physical model are used in the simulation, no artificial viscosity term is added. A quintic spline kernel with k = 3 and a support of rc = 3 Ax is used and as in [38] the moving least squares density correction is applied every 25 iterations. The problem is modeled with a sampling distance Ax = 0.0015 resulting in 37,200 fluid particles. Based on the maximum velocities prevalent in the breaking waves the sound speed is set to cs = 35 m/s. The pressure is calculated based on the extrapolated pressure values of the boundary particles. For this we integrate over the sensor size of 24 x 10-6 m2 by averaging the pressure values of 7 interpolation points along the vertical of the sensor.

Fig. 8 shows the evolution of the pressure values of the physical experiment and the SPH simulation. The pressure of the first roll movement is very well predicted by the simulation in time and magnitude. Similar to the experiments in [38,39], the pressure values during the lateral impact are overestimated by the

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Fig. 8. Pressure evolution for the sloshing wave impact problem.

Fig. 9. Particle rendering of pressure values during impact.

simulation. During the peak of the impact the pressure oscillations are significant but reduce shortly thereafter to a minor oscillation around the experimental values. The overall shape of the pressure curve is well estimated by gpuSPHASE. This observation applies to both impacts, the peak pressures have been cut in the figure and are around 10,000 Pa and 9000 Pa for the first and second impact, respectively.

Fig. 9 shows the pressure values of the individual particles during the first impact. Comparing to the results of [36] that implement boundary integrals, it can be observed that in our results the

D. Winkler et al. / Computer Physics Communications I fllllj I

Fig. 10. Sketch of the dambreak geometry.

numerical noise is also restricted to the impact area and does not propagate to the whole domain.

5. Performance evaluation

For the performance evaluation of gpuSPHASE an established dambreak test case has been chosen with the specifications as provided in [10]. While the original simulation is performed as two-phase water air model, the simulation is repeated with the boundary conditions discussed in Section 2.5 for single phase [22]. The single-phase simulation is shown to be in good agreement with the two-phase simulation and will be used throughout this section for the evaluation of the compute performance of the gpuSPHASE solver. The scene is altered only in the SPH particle resolution. All simulations are performed with single precision as the peak performance on GPUs is much higher [24] and the memory restrictions would halve the maximum block size for double precision as described in Section 3.2. For comparison to other solvers, the simulation is performed with the quintic Wendland kernel [21] with k = 2 using a smoothing length h = V2Ax2.

Fig. 10 sketches the evaluated scene with the fluid dimensions of L = 2.0 and H = 1.0. The tank boundaries are defined as Ltank = 5.366 and Htank = 3.0, depending on the particle sampling distance the actual lengths are rounded. The snapshots visualize a particle sampling distance of Ax = 0.02 resulting in 5000 fluid particles.

All evaluations are performed on a workstation with 32 GB RAM main memory and two Intel(R) Xeon(R) CPU E5-2695 v2 CPUs with 12 cores each. Depending on the evaluation the system is equipped with an Nvidia GeForce GTX TITAN, Nvidia GeForce GTX TITAN X or Nvidia GeForce GTX 750, the CUDA devices are performing no other tasks during evaluation. The operating system is Xubuntu 14.04 (Trusty Tahr), the C++ compiler is gcc 4.8.4 and the CUDA compiler is nvcc v7.0.27. Executables are compiled explicitly for the highest CUDA architecture supported by the device with all optimizations enabled.

5.1. Pipeline optimization

The computationally most expensive functions are related to the calculation of the continuity and momentum equation. The CUDA best practices guide [24] advices to optimize the most expensive functions first, which is the approach applied to the gpuSPHASE implementation. This section will discuss the performance evaluation results of the CUDA function momentum, which calculates the force that is exerted on the particles.

As discussed in 3.2 the Nvidia Maxwell architecture increases the amount of shared memory for every SMP from 48 KB to 96 KB. This extension removes the limiting restriction on the block size due to limited memory availability. Orthogonal to the hardware changes we evaluate the performance gain achieved by the merging of distinct memory arrays into mini structs that combine data that is often used mutually.

For simulations with a low number of particles and several thousand frames per second the launching overhead of CUDA

functions and the time step calculation confine the maximum performance. To evaluate the performance of the function and the solver in the general case, we use a scene size where this overhead is negligible. Thus the sampling distance is fixed to Ax = 0.003 m (^5 x 105 particles) and no density correction is applied.

The performance of the algorithm benefits from the structured alignment of the initial sampling. To get an objective result the simulation is executed for 104 iterations before the profiling is performed. The nvprof tool - shipped with the CUDA toolkit - is used to measure the execution time of the CUDA functions. For the evaluation the average execution time of 10 succeeding iterations is used, although it has to be noted that the deviation from this value is very small (< 1%).

Fig. 11 visualizes the effect of the optimization to the memory layout and the CUDA function pipeline.

• No optimization refers to the execution time of an implementation corresponding to the data layout and workflow depicted in Fig. 1.

• Vectorized refers to the same function pipeline with data arrays merged to use CUDA vectors of type float4.

• Pipeline refers to the combination of vectorized data-types and the optimization to the CUDA function pipeline as shown in Fig. 2.

It is interesting to note that the upgrade from the Kepler based GTX Titan to the Maxwell based GTX Titan X increases the performance by a factor of 1.92 for the unoptimized and fully optimized versions. This value is noteworthy as the peak performance of the devices - measured in billion floating point operations per second (GFLOPS) - differ by a factor of6144/4494 = 1.37, which indicates that the implemented algorithm is well suited for architecture updates. For the momentum function the performance increase of the optimized version is even higher with 2.49, which is a remarkable value that supports the discussion in Section 3.2.

On the other hand the figure shows that the vectorization of the data arrays has a negative impact on the performance of the Titan X card while the performance of the Titan card increases. Analysis of the individual functions shows that the performance of the momentum function increases for both devices but the performance of the other functions decreases. The performance gain for the Titan X card is too low to achieve a positive overall result of the optimization. The reason for the worse performance of the other functions is the overhead that results from merging data into vector types if not all values of the vector are needed. E.g. because of the merge a CUDA function updating the velocity of particles (accelerate) involves also transferring the positions to the GPU. Due to the concept of double-buffering this data must also be written back to device memory (see Section 3.1).

This overhead vanishes with the combination of the vector-ization and the merging of CUDA functions, shown by the performance values of the pipeline optimized version (see Fig. 11). Referring to the example above the CUDA functions move and accelerate have been combined into a single function that performs both tasks at once. As a result the execution time of this function is equal to the sum of the non-optimized functions, additionally reducing the CUDA function launch overhead that becomes relevant for lower particle numbers. The overall performance resulting from the optimizations is 1.31 for both devices, although it is interesting to notice that the intermediate results differ significantly on the different architectures.

5.2. Shared memory caching algorithm

Section 3.2 proposes a novel method to leverage the software controllable shared memory cache to accelerate the SPH computa-

12 D. Winkler et al. / Computer Physics Communications I ^IIIJ lll-lll

Fig. 11. Optimization speedup chart. The rectangles show the execution time of an SPH iteration for different degrees of optimization on the Nvidia GTX Titan and Nvidia GTX Titan X cards.

tion on CUDA devices. The key idea of moving the data to a faster memory allows reducing the access times but also has drawbacks that may reduce the performance. Overhead is involved in copying data to shared memory and the execution logic is more complex than for a naïve implementation. For this reason we compare the performance of two momentum functions, one implementing the proposed algorithm and one implementing the straight-forward approach using global memory. Since the caching algorithm benefits from the structured alignment in the beginning of the simulation the performance difference is sampled over the entire simulation time.

The evaluation is based on the scene described in Section 5. While varying only the resolution the speedup has been sampled every fixed number of iterations for a simulation of 10 s physical time. Reducing the sampling distance directly influences the CFL condition and thus the number of iterations necessary for the simulation of a fixed physical time varies. For this reason it is not possible to sample the simulation at exactly the same points in time, but that does not affect the visualization of the overall trend during the simulation.

Fig. 12 shows that the performance of the caching algorithm depends on the number of particles involved in the simulation. The performance for a low number of particles (Ax = 0.02 m) is worse than the performance of the naïve implementation. The reason for this behavior is that the hardware managed cache is big enough to efficiently handle the low amount of particles. Increasing the particle number forces the cache to discard data more often, the overhead of copying particle data to shared memory starts to pay off. While accessing shared memory has a much higher bandwidth and much lower latency than local or global memory [24] it has the additional benefit of taking load off the hardware managed cache. The software controlled cache manages all data that is known to be used during the function execution while the heuristics controlled hardware cache needs only take care of data that is requested on demand.

Although the algorithm works well for unstructured data it is evident that the performance is best when the data is structured. This can be observed from the initial peak in the diagram, which shows the performance gain for the initially grid aligned scene. By design hardware caches store not only the requested data but cache lines that contain the request. When the particles are distributed on a grid, the probability that a cache line contains data that is needed for other computations in the same block is higher than for an SPH distributed scene.

5.3. Verlet list size

As it has been discussed in Section 3.3 the number of iterations that a Verlet list is valid depends on the maximum velocity in the simulation. Furthermore the factor z that increases the search radius of a particle beyond the kernel support rc to include possible future neighbors is important. To keep the Verlet list for 7 iterations, which has been identified as optimum for CPU based implementations [25], z = 1 + 7/44 = 1.16 when based on the overall maximum velocity (see (16)).

Table 1

Table showing overall frames per second for simulations with different resolutions while varying the factor z to increase the neighbor search radius.

0.02 0.01 0.005 0.003 0.002 0.001

1.01 1979 1815 893 376 197 55.3

1.05 1973 3060 1317 564 277 73.7

1.10 3696 3269 1328 580 289 70.8

1.15 3769 3234 1321 566 280 60.4

1.20 2543 3167 1285 548 270 70.0

Physical simulation time (s)

Fig. 12. Shared memory speedup chart. The chart visualizes the speedup of shared memory version over naïve implementation over a simulation duration of 10 s for different numbers of particles, executed on an Nvidia GTX Titan X.

Table 1 shows the average number of frames per second for dambreak simulations with different resolutions while varying the factor z to increase the neighbor search radius. From this data it is evident that a very small z = 1.01, which causes the list to be rebuilt every iteration, has a much worse performance than higher values. As previously discussed, when choosing a large z = 1.20, the number of false positives in the list gets too high and as such the overall performance drops. For cases z = 1.05,1.10 and 1.15 the performance is very similar, with slightly better performance for 1.10 in most cases.

gpuSPHASE implements the dynamic Verlet list (VLX) so that z can be fixed based on the discussed estimations, but the list is kept as long as possible. Consequently, the number of iterations that the list can be reused decreases when the dynamics of the simulation increase. Fig. 13(a) shows the number average number of iterations the list is reused throughout the simulation time. It is clearly visible that the number decreases during the start of the simulation when the fluid collapses and the maximum speed increases. When the water is building up on the downstream wall the maximum velocity decreases again, which leads to a higher number of iterations. As expected, the higher the value of z the more iterations the list can be kept. Fig. 13(b) shows for the exact same simulation the respective performance expressed as frames per second. Like observed in Table 1, z = 1.10 shows slightly better values than the other configurations, but a significant advantage can only be observed in the time interval t = 4-6.

5.4. Simulation performance

In this section we evaluate the simulation of the dambreak test case and compare gpuSPHASE against the existing SPH software DualSPHysics [14] on an Nvidia GTX Titan X and GTX 750. Both

D. Winkler et al. / Computer Physics Communications I fllllj I

Simulation time (-) (a) Number of iterations the Verlet list is reused

2 4 6 Simulation time (-) (b) Performance as frames per second.

Fig. 13. Effects of different increase factors z during a full simulation. Figure (a) represents the identical simulations as figure (b) so that the number of iterations a Verlet list is reused can be compared to the simulation performance.

cards are built using the Maxwell architecture but as of 2016 the GTX 750 is approximately 10 times cheaper than the GTX Titan X, resulting in less CUDA cores and lower memory amount and bandwidth. The latest available version of DualSPHysics (version 4.0) is compiled for the Maxwell architecture with all compiler optimization enabled. The elaboration uses the dambreak problem discussed in Section 5 starting with relatively little particles and continuously increasing the resolution. As it is evaluated in [16], DualSPHysics has the best performance of all three tested implementations on CUDA capable graphics devices. Like in the evaluation performed in [16] the quintic Wendland kernel [40] is used for the simulation. Contrary to definition of the smoothing length to h/Ax = 4 the definition h = \J2 Ax2 is used. While gpuSPHASE is capable of handling a large number of neighbors the performance of the algorithm outlined in 3.2 clearly suffers. This is evident as the shared memory covers only a small portion of the data that is actually needed, which implies that the manual caching results in computational and data overhead.

Fig. 14 compares the average execution time for a single iteration at different resolutions. DualSPHysics is executed with two different time stepping algorithm, a simple first order Verlet integration and a symplectic second order predictor corrector scheme. The symplectic algorithm is especially useful when simulating long term simulations [23]. In this comparison the values are no longer profiled for individual parts of the computation but calculated by dividing the overall computational time by the number of iterations. For low particle numbers gpuSPHASE is more than 2 times faster than DualSPHysics and 3 times faster than the DualSPHysics using the symplectic time integration. It is shown that the advantage decreases with higher particle numbers as the positive effect of the gpuSPHASE optimizations for low numbers of particles get less important. For particle numbers in the order of 104 gpuSPHASE is faster when being executed on the GTX 750 than DualSPHysics on a GTXTitanX.

It has to be noted that a key difference that is responsible for the much better performance is the usage of an Verlet list, which is reused over several iterations. As described in Section 3.3 reusing the list removes the burden of neighborhood search for every iteration. In the executed simulation gpuSPHASE was configured to create a list based on a increase factor z = 1.10, which allows to reuse the neighborhood for ^10 iterations depending on the present dynamics in the scene. While the memory requirements of this approach are a limiting factor for simulations with a huge

—-gpuSPHASE on GTX Titan X —-DualSPHysics on GTX Titan X

DualSPHysics Symplectic on GTX Titan X -^-gpuSPHASE on GTX 750

- - DualSPHysics on GTX 750

— DualSPHysics Symplectic on GTX 750

6.4 ■ 10-2 3.2 • 10"2

^ 1.6-10-2 I 8•10"3

| 4•10"3

a 2■10-3 | 1.10"» 5•10-4

2.5 ■ 10"4

Fig. 14. Simulation performance chart. Performance comparison of gpuSPHASE and DualSPHysics on an Nvidia GTX Titan X and GTX 750. The particle numbers result from the resolutions 0.02, 0.015, 0.01, 0.008, 0.006, 0.005, 0.003, 0.002 and 0.001.

number of particles it is of no practical relevance for comparatively small simulations.

The results in Fig. 14 cannot be used to interpret the efficiency of the software implementation. While both solvers use the concept of WCSPH, DualSPHysics is a very versatile solver and optimized for three dimensional SPH simulations, which means that 3D interactions have to be computed that are reduced to 2D forces. Furthermore, the implementation of a cell linked list in DualSPHysics has a much lower memory consumption so that simulations with billions of particles can be executed using multi-GPU environments [19]. It has been included in the performance benchmark to provide a reference to an established and highly optimized SPH software, but it has to be acknowledged that the aim and optimizations for DualSPHysics are different from gpuSPHASE.

Number of particles (-)

D. Winkler et al. / Computer Physics Communications I IIIIIJ I

/ ^ / / / ^ \

/A550 j

\/450 I

/ Î0.7 m

1.27 m

Fig. 15. Sketch of the AIZ egg-shaped digester.

6. Practical application

An example for the usage of the gpuSPHASE solver is the domain of anaerobic digestion (AD), which is a combination of hydraulic phenomena and biological processes that deals with the break down of organic waste. The goal of operating anaerobic digestion plants is the minimization of the costs to provide hydraulic conditions that facilitate the biokinetic processes. Those processes are described by the anaerobic digestion model no 1 (ADM1) [41] and require a sufficient mixture of the fluid, which is enforced by gas mixing, pumped recirculation and mechanical mixing. It has been shown that the latter one is the most efficient [42], which makes it interesting for further investigation regarding mixing efficiency and time.

For this purpose a real world AD reactor from the Wastewater Management Association Mid-Unterinntal-Zillertal (AIZ) has been modeled in a 2D SPH representation (see Fig. 15). The AIZ operates an egg-shaped digester (ESD), which is among the most efficient shapes regarding heat loss, maintenance and operation expenses [42]. The mixing is induced by a draft tube that is located vertically in the center of the digester, where an impeller is installed to drive the fluid up- or downwards. Considering the digester's axisymmetrical construction the flow is well suited for 2D numerical simulation.

Based on the exact treatment of advection in SPH, this approach has several benefits over existing CFD approaches to this problem. Biological compounds are advected with the flow and as such a spatial resolution of biological concentrations is an intrinsic property of the simulation. This allows to improve the computation of the biokinetic processes from a global model based on a fully mixed assumption to a local model that is sensitive to the spatially resolved concentrations [17].

Table 2

Table listing draft tube's mode of operation. The tube is repeatedly operating in the up- and downwards driving of the fluid.

Type Direction Time [min]

Increase Up 5

Steady Up 120

Decrease Up 1

Increase Down 5

Steady Down 20

Decrease Down 1

An interesting property of such simulations is the mixing time of a reactor, which measures the time required for added organic material to distribute across the digester. For this purpose we emulate a physical batch feeding process by defining a subset of particles as feeding matter. Those particles are tracked so that a distribution of the material over the reactor is visible. A significant advantage of this SPH simulations over the existing mesh based approaches is that the results are not based on steady state solutions, but instead have the dynamics of the draft tube modeled as a function of time. This is an important modeling aspect since in real world operation the draft direction of the tube is regularly changed for cleaning purposes. Changing the direction of the impeller is required to avoid a blocking due to the sludge that is driven through the tube. Table 2 shows the operational modes of the draft tube that are executed repeatedly during the simulation. For the change of flow direction we choose the damping function defined in [33] to model the transition. The velocity during the steady operation mode is set to 1.5 m/s, which is enforced by an external force that is applied to particles inside the draft tube area.

Fig. 16 visualizes the mixing effectiveness of the reactor that has been simulated for 6 h physical time. The simulation has been performed for a viscosity that is based on a typical total solid concentration (TS) of the sludge. According to [42] we used n = 0.007 [Pas] forTS = 2.5 [%].

As it has been described above, the analysis of the mixing effectiveness is calculated in post-processing solely based on the advection of the SPH particles. For this purpose a fraction of the 95,822 fluid particles is marked as feed according to the operational data of the real world plant. The total volume of the ESD is 2500 m3 with an average daily biomass feeding rate of 105 m3. As the simulation only lasts for a quarter of a day the feeding biomass is reduced accordingly. From this ratio result ^1000 feeding particles that are marked in the top right corner of the ESD at t = 0 to imitate a batch feeding. For better visualization of the mixing a circular area around each tracked particle is highlighted in dark blue (the radius ofthis area is chosen as 9 Ax so that in the perfectly mixed case the visualization is all dark blue).

Fig. 16 shows the mixing profiles for time instants t = 300, 900 and 3600 [s]. From the figures it is evident that after 5 min no mixing is prevalent. After 15 min it can be observed that the right side of the digester is almost dark blue except for a dead zone in the boundary area at the vertical center. The reactor mixing is not very good yet as the left side still is showing much light green area, indicating that too many tracked particles reside on the right side. After 1 h the highlighting is evenly distributed across both sides, with only the boundary area at the vertical center still light green. After 6 h of simulation the dark blue area is slightly better distributed than at t = 1 h but still showing the dead zones identified previously, showing a mostly identical picture. By adding diffusion to the simulation, it is possible to model the actual distribution of the individual biological compound's concentrations. Based on this data and the intrinsic support for particle tracking the spatially resolved calculation of biokinetic processes is viable [17].

Fig. 16. Mixing profile visualizations for time instants t = 300, 900 and 3600 [s].

The simulation of the ESD with a resolution of Ax = 0.05 and a speed of sound of c = 99 results in a dynamic timestep of At ^ 10-4. The simulation of 6 h thus results in approximately 2 x 107 time steps, which were calculated by gpuSPHASE in ^63 h on a GTX Titan X.

7. Conclusions

gpuSPHASE is a general purpose open-source 2D SPH solver that is optimized to simulate long running hydraulic phenomena. It uses standard input and output formats for compatibility with existing software. Based on the restriction to two dimensional scenes the solver is optimized for the memory hierarchy of the CUDA architecture. In gpuSPHASE a dynamic and configurable neighborhood list is implemented to reduce the effort for neighbor search. Since real-time simulations of transport processes using WCSPH require several thousand iterations per second this paper discusses the optimization of the execution pipeline. The combination of the discussed optimization methods results in very good performance when compared to existing state of the art SPH solvers for the discussed problem domain.

To reduce the performance penalties of CUDA global memory a generic caching algorithm for neighborhood data using shared memory is presented. Comparing the implementation of the algorithm for the momentum equation to a naive version shows that significant performance improvements can be achieved. The approach is not restricted to SPH but applicable to any unstructured problem that uses neighbor lists such as molecular dynamics or point cloud reconstruction.

Acknowledgments

This research is part of projects that are funded by the Austrian Science Fund (FWF): [P26768-N28] and Austrian Research Promotion Agency (FFG): [850738]. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GTX Titan X GPU used for this research. The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. We would like to thank Dr. Salvatore Marrone for providing reference data for the validation of gpuSPHASE.

References

[1] L.B. Lucy, A numerical approach to the testing of the Fission hypothesis, 1977, http://dx.doi.org/10.1086/112164.

[2] R.A. Gingold,J. Monaghan, Mon. Not. R.Astron. Soc. 181 (181) (1977)375-389. http://dx.doi.org/ 10.1093/mnras/181.3.375.

[3] J.J. Monaghan, Rep. Progr. Phys.68 (8) (2005) 1703-1759. http://dx.doi.org/10. 1088/0034-4885/68/8/R01, arXiv:0507472v1.

[4] D.J. Price, Computing 231 (3) (2010) 759-794. http://dx.doi.org/10.1016/jjcp. 2010.12.011.

[5] L.D. Libersky, A.G. Petschek, T.C. Carney, J.R. Hipp, F.A. Allahdadi, J. Comput. Phys. 109 (1) (1993) 67-75. http://dx.doi.org/10.1006/jcph.1993.1199, URL http://www.sciencedirect.com/science/article/pii/S002199918371199X.

[6] J. Brackbill, J. Eastwood, W. Benz, E. Asphaug, Comput. Phys. Comm. 87 (1) (1995) 253-265. http://dx.doi.org/10.1016/0010-4655(94)00176-3, URL http://www.sciencedirect.com/science/article/pii/0010465594001763.

[7] J. Bonet, S. Kulasegaram, Internat. J. Numer. Methods Engrg. 47 (6) (2000) 1189-1214.

http://dx.doi.org/10.1002/(SICI)1097-0207(20000228)47:6<1189::AID-NME830>3.0.C0;2-I.

[8] P.W. Randles, L.D. Libersky, Comput. Methods Appl. Mech. Engrg. 139 (1-4) (1996)375-408. http://dx.doi.org/10.1016/S0045-7825(96)01090-0.

[9] J. Monaghan, Simulating Free Surface Flows with SPH, 1994, http://dx.doi.org/ 10.1006/jcph. 1994.1034, URL http://www.sciencedirect.com/science/article/ pii/S0021999184710345.

[10] A. Colagrossi, M. Landrini,J. Comput. Phys. 191 (2) (2003)448-475. http://dx. doi.org/10.1016/S0021-9991(03)00324-3.

[11] N. Grenier, M. Antuono, A. Colagrossi, D. Le Touzé, B. Alessandrini, J. Comput. Phys. 228 (22) (2009) 8380-8393.

[12] J. Monaghan, Annu. Rev. Fluid Mech. 44 (2012) 323-346.

[13] A. Hérault,J. Hydraul. Res.48 (extra) (2009) 000. http://dx.doi.org/10.3826/jhr. 2010.0005.

[14] A. Crespo, J. Domínguez, B. Rogers, M. Gómez-Gesteira, S. Longshaw, R. Canelas, R. Vacondio, A. Barreiro, O. García-Feal, Comput. Phys. Comm. 187 (2015) 204-216. http://dx.doi.org/10.1016/jxpc.2014.10.004, URL http:// www.sciencedirect.com/science/article/pii/S0010465514003397.

[15] M. Gomez-Gesteira, B.D. Rogers, A.J. Crespo, R.A. Dalrymple, M. Narayanaswamy, J.M. Dominguez, Comput. Geosci. 48 (2012) 289-299.

[16] J. Cercos-Pita, Comput. Phys. Comm. 192 (2015) 295-312. http://dx.doi.org/10. 1016/j.cpc.2015.01.026, URL http://www.sciencedirect.com/science/article/ pii/S0010465515000909.

[17] M. Meister, W. Rauch, Environ. Model. Softw. 75 (2016) 206-211.

[18] M. Henze, Activated Sludge Models ASM1, ASM2, ASM2d and ASM3, Vol. 9, IWA publishing, 2000.

[19] J.M. Domínguez, A.J.C. Crespo, D. Valdez-Balderas, B.D. Rogers, M. Gómez-Gesteira, Comput. Phys. Comm. 184 (8) (2013) 1848-1860. http://dx.doi.org/ 10.1016/j.cpc.2013.03.008.

[20] E. Rustico, G. Bilotta, A. Herault, C. Del Negro, G. Gallo, IEEE Trans. Parallel Dis-trib. Syst. 25 (1) (2014)43-52. http://dx.doi.org/10.1109/TPDS.2012.340, URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6392827.

[21] W. Dehnen, H. Aly, Mon. Not. R. Astron. Soc. 425 (2) (2012) 1068-1082.

[22] S. Adami, X.Y. Hu, N.A. Adams, J. Comput. Phys. 231 (21) (2012) 7057-7075. http://dx.doi.org/10.1016/jjcp.2012.05.005.

[23] L. Verlet, Phys. Rev. 159 (1) (1967) 98-103. http://dx.doi.org/ 10.1103/PhysRev. 159.98.

[24] NVIDIA Corporation, CUDA C Programming Guide, 2016.

[25] J.M. Domínguez, A.J.C. Crespo, M. Gómez-Gesteira, J.C. Marongiu, Internat. J. Numer. Methods Fluids 67 (12) (2011) 2026-2042. http://dx.doi.org/10.1002/ fld.2481.

[26] B. Moon, H.V.Jagadish, C. Faloutsos,J.H. Saltz, IEEE Trans. Knowl. Data Eng. 13 (1) (2001) 124-141. http://dx.doi.org/10.1109/69.908985.

[27] G.M. Morton, A Computer Oriented Geodetic Data Base and A New Technique in File Sequencing, International Business Machines Company, 1966.

16 D. Winkler et al. / Computer Physics Communications i ^IIIJ Ill-Ill

[28] D. Hilbert, Math. Ann. 38 (3) (1891) 459-460.

[29] K. Szewc, Smoothed particle hydrodynamics modeling of granular column collapse, arXiv preprint arXiv:1602.07881.

[30] J. Domínguez, A. Crespo, A. Barreiro, M. Gómez-Gesteira, B. Rogers, 9th International SPHERIC Workshop, Paris, France, 2014, pp. 140-145.

[31] A. Hérault, G. Bilotta, R. Dalrymple, Proceedings of the 9th International SPHERIC Workshop, Paris, France, 2014, pp. 134-139.

[32] V. Titarenko, B.D. Rogers, R. Alistair, Proceedings of the 8th International SPHERIC Workshop, Trondheim, Norway, 2013.

[33] J. Monaghan, J. Kajtar, Comput. Phys. Comm. 180 (10) (2009) 1811-1820. http://dx.doi.org/ 10.1016/j.cpc.2009.05.008, URL http://www.sciencedirect. com/science/article/pii/S0010465509001544.

[34] E.-S. Lee, C. Moulinec, R. Xu, D. Violeau, D. Laurence, P. Stansby,J. Comput. Phys. 227 (18)(2008)8417-8436.

[35] U. Ghia, K.N. Ghia, C. Shin, J. Comput. Phys. 48 (3) (1982) 387-411.

[36] M. Chern, A. Borthwick, R. Eatock Taylor, Internat. J. Numer. Methods Heat Fluid Flow 15 (6) (2005) 517-554.

[37] A. Souto-Iglesias, E. Botia-Vera, A. Martín, F. Pérez-Arribas, Ocean Eng. 38 (16) (2011)1823-1830.

[38] L. Delorme, A. Colagrossi, A. Souto-Iglesias, R. Zamora-Rodriguez, E. Botia-Vera, Ocean Eng. 36 (2) (2009) 168-178.

[39] F. Macia, L.M. González,J.L. Cercos-Pita, A. Souto-Iglesias, Progr. Theoret. Phys. 128(3)(2012)439-462.

[40] H. Wendland, Adv. Comput. Math. 4 (1) (1995) 389-396. http://dx.doi.org/10. 1007/BF02123482.

[41] D.J. Batstone, J. Keller, I. Angelidaki, S. Kalyuzhnyi, S. Pavlostathis, A. Rozzi, W. Sanders, H. Siegrist, V. Vavilin, Water Sci. Technol. 45 (10) (2002) 65-73.

[42] B.Wu, Water Res. 44 (5) (2010) 1507-1519.