Contents lists available at ScienceDirect

Computer Physics Communications

journal homepage: www.elsevier.com/locate/cpc

An adaptable parallel algorithm for the direct numerical simulation of incompressible turbulent flows using a Fourier spectral/hp element method and MPI virtual topologies

A. Bolis, C.D. Cantwell *, D. Moxey, D. Serson, S.J. Sherwin

Department of Aeronautics, Imperial College London, South Kensington Campus, London, UK

CrossMark

article info

Article history: Received 29 July 2015 Received in revised form 7 January 2016 Accepted 19 April 2016 Available online 28 April 2016

Keywords:

Spectral/hp element method High-order methods Incompressible flows MPI parallelisation Virtual topologies

abstract

A hybrid parallelisation technique for distributed memory systems is investigated for a coupled Fourier-spectral/hp element discretisation of domains characterised by geometric homogeneity in one or more directions. The performance of the approach is mathematically modelled in terms of operation count and communication costs for identifying the most efficient parameter choices. The model is calibrated to target a specific hardware platform after which it is shown to accurately predict the performance in the hybrid regime. The method is applied to modelling turbulent flow using the incompressible Navier-Stokes equations in an axisymmetric pipe and square channel. The hybrid method extends the practical limitations of the discretisation, allowing greater parallelism and reduced wall times. Performance is shown to continue to scale when both parallelisation strategies are used.

© 2016 The Author(s). Published by Elsevier B.V.

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Direct Numerical Simulation (DNS) is used to simulate complex laminar and turbulent flow problems both to gain an understanding of the fundamental flow physics and for industrial applications [1]. Turbulent flows inherently require high spatial and temporal resolutions in order to resolve the spectrum of scales within the flow, which depends on the Reynolds number Re. The number of grid points needed to resolve a fully turbulent three-dimensional flow scales as Re9/4 [2], meaning that even at modest Re, the computational demands are significant. Since practical CFD applications involve Reynolds numbers on the order of 103-106 and higher, this inevitably makes serial computation impossible, necessitating the use of parallel clusters of computers.

The most prevalent form of high-performance computer systems are distributed-memory clusters consisting of an interconnected collection of processors, each with their own local memory hierarchies. Traditionally, the capacity of these systems has been broadly increased through faster processor clock speeds and improved lower-latency network interconnects. However, in recent years, HPC facilities have evolved around increasingly parallel systems as clock speeds have saturated and energy-usage concerns become a motivating factor [3]. Consequently, algorithms

* Corresponding author.

E-mail address: c.cantwell@imperial.ac.uk (C.D. Cantwell).

have been required to adapt to the changing hardware landscape in order to maintain efficiency.

The spectral/hp element method [4], whilst being used to simulate fluid flow for many years in an academic setting, is now emerging as an attractive alternative to many traditional numerical discretisations on modern HPC hardware. As opposed to the classical finite element method, spectral/hp elements use high-order polynomial expansions on each element. Numerically, this has the advantage of low dispersion and diffusion alongside exponential convergence in the polynomial order. Additionally however, discretised operators are dense and have a far richer structure compared to linear expansions, meaning that they can more effectively utilise caching on modern HPC hardware. The tensor product of one-dimensional basis functions on each element also admits a rich fabric of implementation strategies [5-7].

However, variations of this method mean that we can further improve computational performance whilst preserving the accuracy of the simulation. Many studies of fundamental flow physics are posed on domains which are characterised by geometric homogeneity in one or more coordinate directions [8-10]. Instead of dis-cretising the domain using a 3D spectral/hp element method, one can combine a 2D spectral/hp element discretisation with a pure spectral expansion to significantly reduce the computational cost of these problems. This approach is known as a Fourier-spectral/hp element method [11]. In this study we are specifically interested in the case where only one coordinate direction possesses geometric homogeneity. Therefore, the 3D domain is decomposed into a

http://dx.doi.org/10.1016/j.cpc.2016.04.011

0010-4655/© 2016 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

sequence of spectral/hp element planes, coupled using a Fourier expansion in the third coordinate directions.

The approach typically used when parallelising this type of discretisation is to either use mesh-decomposition in the spectral/hp element planes, or apply a modal decomposition in the Fourier direction. The latter takes advantage of the orthogonality property of the Fourier basis for linear operators. The optimal choice of parallelisation strategy typically depends on the size of the problem, the ratio of Fourier planes to spectral elements, alongside the hardware and interconnect of the parallel system. Moreover, the fast development of computer systems forces software designers to make a continuous effort to maintain algorithms to be able to exploit all the benefits exposed by the latest generation of hardware. There is therefore benefit to be gained from a code supporting both types of parallelism, but predicting the performance of these algorithms on a specific architecture is not trivial [12].

The performance of a parallel 3D incompressible Navier-Stokes solver using the Fourier-spectral/hp element method has been benchmarked previously [13,14]. Spectral modes were distributed across the processes, requiring the transposition of data using the MPI all-to-all technique to compute derivatives in that direction. Their performance model assumed a flat communication topology and the maximum number of processes was limited to the number of Fourier planes. Conversely, parallelisation of a spectral element discretisation has been explored [15-19], for which the upper limit on the number of processes is the number of mesh elements.

Performance of a mixed-parallelism case for 3D turbulence simulations has previously been investigated [20], specifically for a 1D spectral/hp element discretisation, coupled with a 2D spectral expansion. Parallel communication was implemented across processes as a Cartesian topology and a performance model was constructed which suggested improved strong scaling could be achieved on specific architectures. Solver performance depends on hardware characteristics such as memory bandwidth and processor cache size, but also on network capabilities in terms of latency, and bandwidth. Therefore prudent choice of parallelism strategy can enable improved overall performance by structuring the computation and communication pattern to better match the available hardware.

The present study is distinguished from this previous work by the choice of discretisation. We use a two-dimensional spectral/hp element method, coupled with a one-dimensional spectral expansion. This permits the investigation of flow problems on geometries of significantly greater complexity than earlier works. We first outline the discretisation and parallelisation strategies and quantify their comparative performance. For large runs with many processes the number of possible hybrid parallelism strategies may be significant. We construct a mathematical model to characterise the expected performance of any given single or hybrid paralleli-sation strategy which can be used to predict the optimal strategy for a given problem. We calibrate the model against the individual mesh-decomposition and Fourier parallelisation techniques and demonstrate its accuracy in predicting performance of the hybrid approach.

Fig. 1. Structure of a three-dimensional expansion using a Fourier spectral/hp element method.

where p is the kinematic pressure, v is the kinematic viscosity and u = [u, v, w]T is the velocity.

2.1. Spatial discretisation

The three-dimensional domain is decomposed into NZ two-dimensional spectral/hp element planes spanning the x and y coordinate directions, coupled with a Fourier expansion in the third homogeneous direction, as illustrated in Fig. 1. The spectral/hp element discretisation is described elsewhere [4] and only a brief summary is given here.

Each two-dimensional plane ak is partitioned into a set of Ne\ subdomains a£( such that,

ak = U aek

a£ n a£ = 0 Ve = f.

In this study the meshes consist of quadrilateral elements only, but the approach may be equally applied when using triangular elements. Numerical integration and differential operators are constructed on a standard reference element ast which is mapped to each a£ usingabijective map, xe ■ ast ^ ae, as x = /e(|).On each element, the solution u may be approximated as

us(x, y) =J2 &n(x, y)Un = $p(x)$q(y)uepq,

p=0 q=0

where îiepq are elemental coefficients. These correspond to the tensor-product of nodal expansion bases, <fip(x) and $q(y), of order P defined as Lagrange polynomials through Gauss-Lobatto-Legendre points and have the form

2. Methods

Three-dimensional incompressible, isothermal flow with constant density and viscosity is governed by the incompressible Navier-Stokes equations which, in terms of the primitive variables (u, p), are expressed as

--+ u • Vu = -Vp + vV2u,

V u = 0,

<Pm£) =

n të - ïi)

l=0,l=m

n (ïm - ïl)

l=0,l=m

This is synonymous with the original spectral element method, giving a total of (P + 1)2 degrees of freedom (DOF) per element and Nxy = Nel x (P + 1)2 local degrees of freedom per plane. Gaussian quadrature is used for numerical integration, for which the solution u is represented on the same set of P + 1 points fm.

The connectivity of elements in a plane is represented by an assembly mapping A which maps the concatenated vector of elemental degrees of freedom to their global counterparts and enforces a C0-continuity constraint. The global degrees of freedom are assembled using the relation ue = Aug, where A is the matrix equivalent of A. This matrix is in general highly sparse and so is in practice not constructed explicitly.

Operators in the spectral/hp element method are constructed elementally and applied using the sum-factorisation technique [21] as this has been demonstrated to be more efficient when operating on elements with higher-order bases [7,6,5]. The tensor-product nature of the elemental expansion bases allows matrix-vector operations to be decomposed into a sequence of smaller, more computationally efficient matrix-matrix operations, performed in each coordinate direction separately.

In the z-direction, the solution is expressed using a Fourier basis of Nz/2 complex modes, <fik(z) = elzk, to give an expansion of the three-dimensional solution on an element as

US (X, y, Z) = Y^ y, Z)Un = ^ $pq(x, y)$k(z)upqk.

The total number of degrees of freedom is therefore Ntot = NXYNZ.

2.2. Temporal discretisation

A stiffly stable splitting scheme [22] is adopted which decouples the velocity and pressure fields, leading to an explicit treatment of the advection term and an implicit treatment of the pressure and the diffusion terms. The key steps are

Ü-J2 ŒqUn

= -£ Pq[(U ■V )U]n-q,

V 2pn+1=V{ A),

= -V p

y0un+1 - U

= V V 2U

To maintain the order of the scheme, a modified Neumann pressure boundary condition is used,

d pn+1 dn

dU n+1 st^ n

-¡t + v£>q(Vx «r

+ J2 Pq[(U ■ V)U]n-

■ n.

The coefficients aq, and y0 can be found in [22] for first-, second- and third-order implicit-explicit (IMEX) time-integration schemes. Fig. 2 illustrates the implementation of the timeintegration section ofthe algorithm, where we ignore input/output and set-up costs. For short time-integration, these may be significant.

2.3. Parallelisation

In this section we describe the parallelisation approaches used in this study. We provide an overview of the two orthogonal approaches: parallel decomposition of the Fourier modes (modal parallelisation) and parallel decomposition of the spectral/hp element planes (elemental parallelisation). We then outline the hybrid approach which combines both techniques, and describe its

Fig. 2. Incompressible Navier-Stokes solution algorithm. Details of the building blocks ofthe time-integration process. The most expensive routines are highlighted, i.e. the advection term calculation and the elliptic solvers for pressure and velocity (Poisson and Helmholtz).

implementation. In each case we partition the simulation across a total of R processes. The Message Passing Interface (MPI) library is used for communication in all three methods.

In modal parallelisation the NZ planes, corresponding to NZ/2 complex Fourier modes, are distributed equally among the processes. Elliptic solves are decoupled in the Fourier-transformed space and can be performed independently on each plane using either a direct Cholesky factorisation with reverse Cuthill-McKee algorithm (LAPACK) [23], or through an iterative conjugate gradient algorithm. The non-linear advection term is more efficiently computed in non-modal space. To perform the inverse and forward Fourier transforms, used before and after the advection calculation respectively, the data to be transformed must reside on the same process. In practice, this requires a transposition of the data using an MPI all-to-all operation. To support efficient differentiation in the z-coordinate direction, we additionally impose the constraint that both the real and imaginary components of each complex Fourier mode reside on the same process, since, in the Fourier space, derivatives are calculated through the multiplication ük ^ -ikuk. This restricts the maximum number of useable processes to NZ/2.

In contrast, elemental parallelism distributes the Nel elements of each plane among the processes. The partitioning of the 2D plane is implemented using the METIS graph partitioning library [24] and an identical partitioning and distribution amongst processes is used for each plane in the domain. The natural limit on the number of useable processes is therefore Ne¡. The dual-graph of the mesh is partitioned among the R processes to equally distribute the number of degrees of freedom, whilst minimising the edge-cut, and therefore the inter-process communication. Elliptic solves are performed iteratively, with communication being required to exchange boundary information between adjacent elements residing on different processes at each iteration. This data exchange is implemented using the gather-scatter algorithm from Nek5000 [25] which uses a global numbering of the DOFs in the plane to efficiently summate process-local contributions and distribute the result back to the participating processes.

Hybrid parallelisation combines both modal and elemental approaches by organising the available processes in a Cartesian

grid [20], as illustrated in Fig. 3. In this arrangement, the world communicator is split into a series of row communicators which support elemental parallelisation, while column communicators enable modal parallelisation. Each process belongs to precisely one row communicator and one column communicator and nominally operates on a fixed subset of elements in a fixed subset of planes. As in modal parallelism, elliptic solves are performed in Fourier-transformed space, but due to the elemental parallelism the iterative conjugate gradient solver must be used. The limit on the number of viable processes is now increased substantially to Nel X Nz/2.

2.4. Test environment

All simulations are performed on an SGI Altix ICE 8200 EX system with up to 512 cores (64 eight-core nodes). Each node contains Nehalem CPUs running at 2.93 GHz and 24 GB RAM. Communication is through a dual-rail Infiniband interconnect. The system runs Redhat Enterprise Linux with kernel version 3.0.580.6.6. Intel MPI was used for parallel message exchange and FFTW 3.2.2 for performing fast Fourier transforms.

The software used for the spectral/hp element discretisation in this study was Nektar++ v3.3.0 [26,27]. In summary, the framework provides scope for constructing high-order polynomial expansions on both fully two-dimensional and three-dimensional domains. It also supports a coupled spectral/hp element—Fourier approach for domains with geometric homogeneity. The specific operators and time integration necessary for solving the incompressible Navier-Stokes equations are built upon this framework. As with any numerical timing study, the presented results are specific to the implementation used, although should provide useful generic guidance.

3. Performance model

Identifying the optimal strategy and the distribution of processes between elemental and modal parallelism is non-trivial, since algorithmic complexity and specific system architecture affects performance. We therefore design a performance model, calibrated through the use of the two parallelisation strategies independently, to help select the best approach before the start of a given simulation.

3.1. General model assumptions

To ensure the model remains simple enough for predictive use, yet sufficiently complex to provide reasonable accuracy, we make a number of assumptions regarding the nature of the computational problem and hardware when evaluating the computational and communication costs.

The computational cost of an algorithmic unit is evaluated using the floating-point operation count of basic routines such as matrix-vector multiplications, inner products and vector-vector summations. This implicitly disregards hardware characteristics, such as caching and memory throughput limits, although our testing has shown that these aspects can be reasonably captured using scalar constants, determined during the calibration process for a specific platform. Operations are evaluated at the element level, and their total computational cost across the domain is therefore assumed to be predominantly independent of the parallelisation strategy.

Communication costs are generally more complex to model and strongly depend on the hardware configuration. Different cluster configurations, such as mesh, hypercube or ring interconnect topologies, have a significant effect on the measured

Fig. 3. illustrative MPI Cartesian communicator for a hybrid parallelisation of a Fourier spectral/hp element discretisation using 4 elements per plane and 4 planes, on 16 MPi processes. Row communicators handle the communication between mesh partitions for elemental parallelisation while column communicators handle communication between planes for modal parallelisation.

communication time. In this study we follow the most common approach when estimating communication costs [20,14], which is to assume a "flat" topology supporting direct communication between nodes and no interconnect contention. Operations are assumed to be performed using double-precision floating point numbers, occupying eight bytes on the test system described above.

3.2. Model construction

The general structure of our performance model is as follows. Let Oi be the operation count of the ith operation in the algorithm. Let Cj be the time required for the jth communication, then we can define the total parallel execution time, T, as

E Oi + E с

where RXY and RZ are the number of processes used for elemental and Fourier parallelism, respectively, and R = RXYRZ. The size of a computational problem is generally measured by the number of degrees of freedom. In the spectral/hp discretisation, this corresponds to the number of elemental modes which, for two-dimensional quadrilateral elements, is (P + 1)2. For some matrix-vector operations the sum-factorisation technique, which exploits the tensorial nature of the expansion, can be used that requires (4P3 + 18P2 + 26P + 12) operations per element [7].

The communication times TCj can be further modelled as

Cj = Nmsgs X [tL + Nop X Tb]

where Nmsgs is the number of messages transmitted during an operation, Nop is the number of floating-point values per message, tl is the latency and tb the inverse of the bandwidth. Note that tb is quantified using s/DOF, rather than the conventional s/MB, to facilitate the modelling. Latency and bandwidth are sampled for the test system using the MPI benchmarking application IBM-MP1. Bandwidth is measured for a number of MPI routines and for messages of size 8 bytes up to 4 MB and averaged. For the test system considered, the average bandwidth measured was 1.64 • 103 MB/s. Bandwidth and latency for the test system was determined to be

tb = 4.87 • 10-9 s/DOF, tl = 2.09 • 10-6 s.

input : u0, ui, u2 output: N(u0), N(u1), N(u2)

// Transformation back to physical space for i = 0 to 2 do

1 | ui = IDFT (ui); end

for i = 0 to 2 do

// Derivatives in the 2D spectral/hp

element plane d ui/dx = Dxui d ui/dy = Dy ui;

// Derivatives in the spectral direction

d u i/dz = Dz u i;

// Transformation back to physical space d ui/d z = IDFT (d u i/d z);

// Construction of the i — th advection component

N (ui) = u0d ui/d X + u1d ui/dy + u2d ui/dz;

// Transformation to Fourier space

N (ui) = DFT (N (ui)); end

Algorithm 1: Non-linear advection term procedure. Terms with a tilde are forward Fourier-transformed.

3.3. Advection term

We first model the advection term u • Vu, which is computed in physical space and can be expanded as

N (u) = ud u/dx + vd u/dy + wd u/dz, N (v) = udv/dx + vdv/dy + wdv/dz, N (w) = udw/dx + vdw/dy + wdw/dz,

where u, v and w denote the three components of the velocity u. The numerical implementation of this is shown in Algorithm 1. Computational costs arise from FFTs (lines 1,4 and 6), derivatives (lines 2 and 3) and vector-vector operations (line 5), while communication is only required to compute the FFTs.

Inverse FFTs are required for each of the velocity components and the z-derivatives of each of the velocity components. A forward FFT is used to transform the result of the advection calculation. This gives a total of nine FFTs, each consisting of a set of independent 1D FFTs. The number of 1D FFTs is given by the number of quadrature points associated with the local spectral/hp element mesh partition. Assuming the mesh is evenly partitioned, we can quantify the total number of 1D FFTs as Ne¡(P + 1)2, each costing O(NZ log2(NZ)), giving a cumulative cost of

Of = 9Nel(P + 1)2 • CfftNz log2 Nz,

where CFFT is a const. Derivatives in the z-direction are a BLAS level

1 operation with total cost

02 = NelNz (P + 1)2.

In-plane physical derivatives will be proportional to the cost of executing a matrix-vector multiplication using the general derivative matrix. A total of six in-plane derivatives are computed. The total cost of derivative operations is therefore,

03 = 6Nel(P + 1)4.

Finally, there are five level 2 BLAS operations in calculating each component of N(ui). Vectors are of size Ne¡(P + 1)2, resulting in a total cost of

04 = 15Nel(P + 1)2.

input : initial guess x0

output: final solution x

// calculate initial residual r0 ro = b — Axo;

// solve for w0 where K is the preconditioner

Kwo = ro;

// set parameters

q—i = P— i = o P— i = 0 So = Aw o;

Po = (ro, wo) Mo = (So, wo) ao = Po/^o; for i = o to NMf do

1 Pi = Wi + Pi—iPi—i;

2 qi = Si + Pi—iqi—i;

3 Xi+i = Xi + aiPi;

4 ri+i = ri — aiqi; if (ri+i, ri+i) < tolerance then

break;

5 Solve Kwi+i = ri+i;

6 Si+i = AWi+i;

7 Pi+i = (ri+i, Wi+i);

8 Mi+i = (Si+i, Wi+i);

pi = Pi+iM;

ai+i = Pi+i/(Mi+i — Pi+i Pi/a);

Algorithm 2: Preconditioned Conjugate Gradient Method.

For the advection term, communication is required during the 9 FFTs to shuffle data between processes so that the data for each iD FFT, previously spanning RZ processes, is colocated on the same process. We apply the communication model described in (i). For each of the 9 FFTs two MPI All-to-all calls are required (shuffling and unshuffling), each of which formally requires Nmsgs = (RZ — i) messages [2o,i4]. Message size is based on the assumption that the iD FFTs will be evenly distributed across the participating processes. This gives a communication cost of

C'A = 18(Rz - 1)1 tl +

Nel RzRxy

(P + 1)2Tb

Combining the above contributions and distributing the computational cost amongst the processes gives a parallel execution time of

E °A + E CA.

3.4. Elliptic solver

Algorithm 2 shows the basic steps to solve the linear systems using a preconditioned conjugate gradient method [28]. To simplify the analysis, we do not perform static condensation of the elliptic systems, evaluating them using a block-diagonal matrix system, where each block contains a full elemental matrix.

The daxpy operations on lines i-4, each comprising one scalar-vector multiplication and one vector-vector summation giving a total cost of

Oi = 8Nel(P + i)2.

Fig. 4. Overview of how a partition containing NO can be cast. The different groupings suggest that the maximum number of edges which may require communication is a 2(NO + 1).

The application of the diagonal preconditioner in step 5 can be considered a vector-vector multiplication and has cost

02 = Nel(P + 1)2.

The most computationally expensive step is the evaluation of the matrix system in step 6. Applying the sum-factorisation operation count defined earlier in this section we quantify the number of operations as

Of = Nei(4P3 + 18P2 + 26P + 12).

Finally, the two inner products in steps 7 and 8 evaluate the stopping criteria of the iterative algorithm. Each consist of a vector-vector multiplication and a sum reduction. The vector-vector multiplication requires Ne/(P + 1)2 operations per plane while the sum reduction Ne/(P +1)2 — 1 operations per plane. In order to maintain simplicity in the model we approximate the sum reduction to Nei(P + 1)2 operations, leading to

04 = 08 = 6Ne,(P + 1)2.

Communication appears during the inner product reductions and during the matrix-vector multiplication. The inner product reduction can be modelled using the All-gather model [20]. The number of messages is (RXY — 1) for each inner product, since the local reductions need to be composed into a global reduction, which happens on one processor. Since the reduction operations for each iteration can be collated into a single message, the size is three floating-point values. This results in the communication time

ce = (Rxy - 1)(tL + 3rB).

To estimate communication during matrix-vector multiplication, we assume the structure of the mesh decomposition leads to a tree-like graph of communication, arising from recursive bisection. The number of communications will therefore be proportional to log2(Rxy). Furthermore, we can assume data needs to be exchanged in both directions for each edge of the tree. Message size is far more challenging to estimate, since this is dependent on the size of the boundary between any two partitions. We therefore choose the maximum message size, which can be estimated at 2 (Neoc+1 ), as illustrated in Fig. 4. Here we are also assuming that all partitions are interior to the domain and therefore all boundaries participate in communication. These estimates lead to a prediction for the matrix-vector multiplication communication costs as

CE = 2Cgs log2 (Rxy)

Tl + 2

Nel rzrxy

+ 1 (P + 1)TB

where Ces is a constant relating to the implementation of the gather-scatter algorithm.

Combining these contributions gives the cumulative cost of a single iteration of the elliptic solver as

E OE + E j.

Table 1

Turbulent test cases discretisation features.

Test case P NPlane el Nz Nel Nxy Ntot

Pipe Channel 7 6 64 450 128 64 8192 28800 5184 28800 663552 1843200

Fig. 5. Diagram illustrating the pipe geometry and its discretisation. Spectral/hp elements are used in the cross-plane with a Fourier expansion used in the streamwise direction.

3.5. Incompressible Navier-Stokes model

We now combine the above components of the model to elicit a full model for the incompressible Navier-Stokes algorithm described in Fig. 2. The total parallel execution time for one timestep can be expressed as

TNS = a- T

+ b ■ N

iter + 3Niter

which captures the costs associated with the advection term, Poisson solve for the pressure and the three Helmholtz solves for the velocity components. Ne and NiHer are the number of iterations of the Poisson and Helmholtz solves, respectively. These will vary depending on the nature of the problem and the choice of precon-ditioner plays an important role in the efficiency of the iterative solver. The diagonal preconditioner was chosen for modelling simplicity and is not necessarily the most efficient choice. Typical values are NPer ~ 80 and NiHer ~ 10 for the problems considered in this study. However, these are problem-specific and are largely independent of the parallelisation strategy. The coefficients a and b capture the characteristics of specific hardware and are determined during the calibration process discussed next.

4. Results

We consider two prototype turbulent flow problems to quantify the performance of the different parallelisation regimes. These examples highlight the benefits of the hybrid parallelisation approach for increasing parallelism in a scalable way and therefore reducing parallel execution time. Table 1 lists the performance model properties of the two domains considered. Since this study concerns only the parallelisation aspect of these simulations, we consider a fixed discretisation in each case. The discretisation is chosen to be numerically converged for capturing turbulent flow in the given geometry and at the prescribed Reynolds number, based on previously published studies [10,29,30].

4.1. Test problems

The pipe geometry is illustrated in Fig. 5, where the streamwise direction is geometrically homogeneous. Lengths and velocities are non-dimensionalised by the diameter D and the bulk velocity ubuik, respectively. The length of the pipe is 5D. The flow is driven by a constant body-force of

fz = 0.5 * 0.3164/Re1

Fig. 6. Diagram illustrating the channel geometry and its discretisation. A Fourier expansion is used in the spanwise direction.

Modal Bottleneck

64 128

Processors

Modal 0terative) —v-Modal (Direct) —B-Elemental (Iterative) O Linear Hybrid A Model Calibration Model

Fig. 7. Parallel efficiency of the four parallel approaches for the pipe flow problem on the system detailed in Section 2.4. The vertical dotted black lines indicate the practical limit imposed by the modal and elemental discretisations in isolation. The solid purple line shows ideal efficiency, relative to the 16-core case using modal parallelism with iterative elliptic solver. The dotted orange lines show the model prediction forthe cases used for calibration. The solid orange lines show predictions for the hybrid regime. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

for Re = 3000 to instil a turbulent flow regime. The pipe is discretised using spectral elements in the cross-section of the pipe and a Fourier expansion in the streamwise direction. A total of 64 spectral elements at polynomial order 7 are used in the x-y plane, while 128 modes are used in the Fourier expansion. No-slip boundary conditions are imposed on the wall of the pipe. The timestep used for simulations is At = 0.002 non-dimensional time units with a second-order IMEX scheme.

For the channel, shown in Fig. 6, lengths are non-dimen-sionalised by the channel half-height and velocities by ubuik. The length of the channel is 4n, and the spanwise dimension is 4n/3. The flow is driven by a body force of/x = 0.0036 for Re = 3000 and no-slip boundary conditions are imposed on the top and bottom of the channel. The channel was discretised using 64 Fourier modes and 450 spectral elements with a polynomial order of 6. The timestep used for channel flow simulations was 0.0001 with a second-order IMEX scheme.

4.2. Hybrid parallelism performance

The efficiency of the various parallel strategies is assessed through the strong scaling tests for both problems. The results for the pipe are shown in Fig. 7. Modal parallelism (triangle and square symbols) scales well using either direct or iterative elliptic solvers. Elemental parallelism (circle symbols) scales poorly due to the large ratio of communication to computation, since even for only 32 processes, there are only 2 elements per process. The dotted lines indicate theoretical bottlenecks on the number of useable processes due to there being an insufficient number of elements or Fourier modes. In both cases, this limit is 64 processes. However, in

Modal (Iterative) V Modal (Direct) -H-Elemental (Iterative) O Linear Hybrid A ▲ A

Mo Bottl dal 'neck Elctn Bottl - II

64 128

Processors

Fig. 8. Parallel efficiency of the four parallel approaches for the channel flow problem on the system detailed in Section 2.4. The vertical dotted black lines indicate the practical limit imposed by the modal and elemental discretisations in isolation. The solid purple line shows ideal efficiency, relative to the 16-core case using modal parallelism with iterative elliptic solver.

the case of a Fourier-dominated discretisation, modal parallelism is clearly preferably over elemental parallelism.

Fig. 8 shows a comparison of efficiency for the different parallelism strategies in the channel problem. Here the modal bottleneck is reached at 32 cores and the modal approach with direct solver has the greatest performance in this regime. Elemental parallelism is possible up to 450 cores but, above 128 cores, the ratio of computation to communication is low and at 256 cores, the distribution of computation becomes significantly unequal, resulting in poor parallel performance. However, elemental parallelism outperforms modal parallelism when using the iterative solver. This observation is intuitive since only four nodes are used and most of the communication between partitions will be intra-node. Recent versions of the OpenMPI libraries allow processes on the same node to use shared memory, rather than using the network interface to send messages. Within a node latency between processes is therefore low and sending a large number of small messages becomes the most effective method.

Hybrid parallelism extends these limits substantially, enabling simulations up to and beyond 512 processes. The distribution of modal and elemental parallelism will lead to different performance. In Figs. 7 and 8 the solid triangles indicate the minimum execution time achievable using hybrid parallelism for a prescribed total number of cores. Efficiency is less than the ideal case in general, but still reduces runtime significantly as the number of processes increase. In particular, for 512 processes, the performance approaches the ideal case.

Figs. 9 and 10 show the parallel efficiency of the various parallelisations of the pipe and channel problem, respectively, normalised against the 16-core modal case with iterative solver. Efficiency of both modal and elemental parallelism reduces with increasing core counts, however, the use of the hybrid approach recovers a significant portion of the lost efficiency. For the pipe the combined approach enables 80% of ideal parallel efficiency to be attained on 512 cores, while for the channel there is an order of magnitude increase in efficiency at 256 cores, compared to the elemental approach.

4.3. Model calibration

Calibration is the process whereby we identify values for the machine-specific constants a and b in Eq. (2). To simplify the calibration process, we combine the costs of the elliptic solves

Modal (Iterative) Modal (Direct)

|_| Elemental (Iterative)

| Hybrid

<PZ.P*Y>

(32,4) (16,16)

(32,16)

64 128

Processors

Fig. 9. Turbulent pipe flow parallel simulation—efficiency of parallelisation approaches on a cluster of 8-core nodes. The histograms show the efficiency E of different parallel simulations defined as E = S/P where S is the speed-up and P is the total number of processors used for the simulation. The speed-up is based on the 16-core (2 nodes) run using the Modal (iterative) approach.

Modal (Iterative) Modal (Direct)

|_| Elemental (Iterative)

M Hybrid

(16,16)

(32,16)

16 32 64 128 256 512

Processors

Fig. 10. Turbulent channel flow parallel simulation—efficiency of parallelisation approaches on a cluster of 8-core nodes. The histograms show the efficiency E of different parallel simulations defined as E = S/P where S is the speed-up and P is the total number of processors used for the simulation. The speed-up is based on the 16-core (2 nodes) run using the Modal (iterative) approach.

and split the timings into those for computation and those for communication as

TNS = a1 ■ TA + a2 ■ TA + b1 ■ TO + b2 ■ T^.

To illustrate the use of the model, six measurements were taken of the time taken to solve the pipe problem using the elemental and Fourier parallel decomposition. The model TNS was implemented in MATLAB and the required coefficients were calculated using a least-squares algorithm as

a1 = 0.45 • 10-

a2 = 0.2 b1 = 3.15 • 10-

1400, if NpJme/PxY < 4,

110, otherwise.

The coefficient b2 is multi-valued since performance of the elemental parallel decomposition typically degrades sharply when there are fewer than four elements per process, often due to imbalance in the mesh distribution amongst the processes.

4.4. Model validation

To quantify the accuracy of predictions using the calibrated performance model in the hybrid regime, we apply the model to the turbulent pipe flow example. Results presented are obtained by averaging 1000 measurements of the timings for each of the components of the time-stepping algorithm. The choice of elliptic solver has a significant impact on performance, and as in the modal parallelism case, the timings for the elliptic solves are measured using both a direct method using LAPACK and an iterative conjugate gradient approach. The four parallelisation types used throughout the remainder of this section are therefore modal (iterative and direct), elemental (iterative) and hybrid (iterative). Strong scaling is performed for the different parallelism strategies and results are normalised by the timings for the 16-core modal approach using the iterative solver.

The model, outlined in Section 3, is calibrated for the turbulent pipe flow example using simulations executed using both modal parallelism and elemental parallelism in isolation. These timings are consequently accurately reproduced by the model, as shown by the dashed orange lines in Fig. 7. The solid orange line in the same figure shows predicted runtimes using the performance model in the hybrid regime. Good agreement is observed and the model correctly identifies the best performing hybrid case, timings of which are shown by the blue triangles.

5. Discussion

In this paper we presented a technique to parallelise a 3D incompressible Navier-Stokes algorithm discretised using a 2D spectral/hp element mesh coupled with a Fourier expansion in a third geometrically homogeneous direction. The implementation enables a flexible mixture of both elemental parallelism and modal parallelism. We have illustrated the hybrid parallelism technique on two prototype problems: turbulent flow in an axisymmetric pipe and turbulent flow in a channel. Both problems enjoy increased parallelism through the approach and consequently improved runtimes and greater energy efficiency. The optimal weighting of these strategies can be systematically chosen through the construction of a performance model, calibrated to a specific system through measurements of the two parallelism approaches independently. This enables rapid selection of the highest-performing combination of the strategies without costly trial-and-error testing.

In the modern HPC environment, where energy is of increasing concern, selecting the optimal implementation to maximise performance is becoming increasingly important. Although experience and intuition can generally suggest the most suitable paral-lelisation approach for a specific problem, the decision is in general highly challenging, particularly when moving between HPC systems or when tackling problems on a range of different domains with the same algorithm. Even from a purely theoretical perspective, it can be appreciated that a single parallel approach cannot be optimal in all situations and this has been confirmed through numerical experimentation.

The number of degrees of freedom in the xy-plane and the number of Fourier modes are the first indicators of which technique is more appropriate. We recognise that problems with a high number of modes compared to the number of elements per plane appear to benefit from the modal parallelisation approach. On the other hand, a domain discretisation with a larger number of elements than Fourier modes generally benefits from an elemental decomposition technique. However, the relative performance of different approaches cannot be determined entirely through operation counts. Accounting for specific hardware characteristics and the latency and bandwidth available on the communication

pattern is essential to accurately predict the optimal strategy. Parallelisation approaches requiring a large number of messages, such as the mesh-decomposition parallelisation, can suffer from poor performance if the interconnect latency is high. These types of parallel techniques are therefore efficient on shared memory machines or low-latency interconnects.

In general we wish to be able to tackle a range of problems where those quantities can vary, potentially reaching extreme values. It is therefore clearly beneficial to have both parallelisation strategies available within a single codebase and this study illustrates the advantages of being able to combine them in a flexible manner to achieve lower runtimes on a fixed number of processors. A further benefit is the extension of strong scalability possible through the use of the hybrid parallel implementation. The potential of new machines are often exploited by investigating even larger problems than previously explored and in finer detail, or using larger Reynolds numbers, and to capitalise on weak scalability. Good weak scalability generally follows from good strong scalability and by increasing an algorithms strong scalability, simulations can be run faster and on larger machines.

It should be noted that not all the hybrid parallel approaches we tested provided good performance. Depending on the mesh topology, partitioning and the number of Fourier modes on each processor, some strategies may not perform efficiently. We note, for example, that the optimal 128-core hybrid parallel case for the turbulent pipe in Fig. 7 has a similar runtime to the modal parallelism using the direct solver with only 64 cores. Conversely, we showed that certain choices may result in reduced computational time without increasing the number of processors. This is the case of the modal approach when using a direct solver, which generally performs as well as an elemental parallelisation method with twice the number of CPUs. Minimising the energy consumption when running a simulation is a point of interest in current high performance computing research. Implementation flexibility plays an important role in addressing these goals.

In terms of limitations, we have disregarded some pieces of the algorithm in order to focus on the two main routines, namely the advection and the elliptic operators. This is a typical approach when creating a scalability model [20], although it may introduce some errors. The calibration has been carried out by monitoring the solution time on the SGI Altix ICE 8200 EX system, therefore the coefficients presented here must be considered specific for that machine. Finally, the model, and therefore the results presented in this paper, are specific to Nektar++. However, it still provides valuable insight and can suggest overall guidelines on typical choices of parallelisation strategy on large HPC resources.

Acknowledgements

AB and SJS acknowledge support from EPSRC under grant EP/H000208/1. CDC acknowledges support of the British Heart Foundation under grant FS/11/22/28745. DS acknowledges support from CNPq under grant 231787/2013-8 and FAPESP under grant 2012/23493-0. DM acknowledges support under the Laminar Flow Control Centre funded by Airbus/EADS and EPSRC under grant EP/I037946.

References

[1] N. Kroll, C. Hirsch, F. Bassi, C.Johnston, K. Hillewaert, IDIHOM: industrialization of High-Order Methods—A Top-Down Approach: Results of a Collaborative Research Project Funded by the European Union, 2010-2014, Vol. 128, Springer, 2015.

[2] S. Pope, Turbulent Flows, Cambridge University Press, 2000.

[3] H. Meuer, E. Strohmaier, J. Dongarra, H. Simon, Top500 supercomputer sites, 2013, http://www.top500.org.

[4] G. Karniadakis, S. Sherwin, Spectral/hp Element Methods for Computational Fluid Dynamics, second ed., Numer. Math. Sci. Comp., Oxford University Press, Oxford, 2005.

[5] C. Cantwell, S. Sherwin, R. Kirby, P. Kelly, From h to p efficiently: selecting the optimal spectral/hp discretisation in three dimensions, Math. Model. Nat. Phenom. 6(3) (2011) 84-96.

[6] C. Cantwell, S. Sherwin, R. Kirby, P. Kelly, From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements, Comput. & Fluids 43 (1) (2011) 23-28.

[7] P. Vos, S. Sherwin, R. Kirby, From h to p efficiently: implementing finite and spectral/hp element methods to achieve optimal performance for low-and high-order discretisations, J. Comput. Phys. 229 (13) (2010) 5161-5181.

[8] H. Blackburn, D. Barkley, S.J. Sherwin, Convective instability and transient growth in flow over a backward-facing step, J. Fluid Mech. 603 (2008) 271-304.

[9] C. Cantwell, D. Barkley, H. Blackburn, Transient growth analysis of flow through a sudden expansion in a circular pipe, Phys. Fluids (1994-present) 22 (3) (2010) 034101.

[10] K. Avila, D. Moxey, A. de Lozar, M. Avila, D. Barkley, B. Hof, The onset of turbulence in pipe flow, Science 333 (6039) (2011) 192-196.

[11] G. Karniadakis, Spectral element-fourier methods for incompressible turbulent flows, Comput. Methods Appl. Mech. Engrg. 80 (1990) 367-380.

[12] A. Grama, A. Gupta, V. Kumar, isoefficiency: measuring the scalability of parallel algorithms and architectures, iEEE Parallel Distrib. Technol., Syst. Appl. 1 (3) (1993) 12-21.

[13] C. Crawford, C. Evangelinos, D. Newman, G. Karniadakis, Parallel benchmarks of turbulence in complex geometries, Comput. & Fluids 25 (1996) 677.

[14] C. Evangelinos, G. Karniadakis, Communication performance models in prism: A spectral element-fourier parallel Navier-Stokes solver, in: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, SC'96, 1996.

[15] P. Fischer, Analysis and application of a parallel spectral element method for the solution of the Navier-Stokes equations, Comput. Methods Appl. Mech. Engrg. 80 (1990) 483-491.

[16] P. Fischer, E. Ronquist, Spectral element methods for large scale parallel Navier-Stokes calculations, Comput. Methods Appl. Mech. Engrg. 116 (1-4) (1994) 69-76.

[17] P. Fischer, Parallel domain decomposition for incompressible fluid dynamics, Contemp. Math. 157 AMS (1994) 313-322.

[18] P. Fischer, A. Patera, Parallel simulation of viscous incompressible flows, Ann. Rev. Fluid Mech. 26 (1994) 483-528.

[19] P. Fischer, An overlappingSchwarz method for spectral element solution of the incompressible Navier-Stokes equations, J. Comput. Phys. 133 (1997) 84-101.

[20] C. Hamman, R. Kirby, M. Berzin, Parallelization and scalability of a spectral element channel flow solver for incompressible Navier-Stokes equations, Concurrency, Pract. Exp. (2007) 1-7.

[21] S. Orszag, Spectral methods for problems in complex geometries, J. Comput. Phys. 37(1) (1980) 70-92.

[22] G. Karniadakis, M. israeli, S. Orszag, High-order splitting methods for the incompressible Navier-Stokes equations, J. Comput. Phys. 97 (2) (1991) 414-443.

[23] E.Anderson, Z. Bai, C. Bischof, S. Blackford,J. Demmel,J. Dongarra,J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK Users' Guide, third ed., Society for industrial and Applied Mathematics, Philadelphia, PA, 1999.

[24] G. Karypis, METlS's Manual, Department of Computer Science and Engineering, fifith ed., University of Minnesota, Minneapolis, MN 55455, March, 2013.

[25] P. Fischer, J. Lottes, D. Pointer, A. Siegel, Petascale algorithms for reactor hydrodynamics, J. Phys.: Conf. Ser. 125 (1) (2008).

[26] R. Kirby, S. Sherwin, The Nektar++ project, 2006, http://www.nektar.info.

[27] C.D. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo, D. de Grazia, S. Yakovlev,J.-E. Lombard, D. Ekelschot, B.Jordi, H.Xu, Y. Mohamied, C. Eskilsson, B. Nelson, P. Vos, C. Biotto, R.M. Kirby, S.J. Sherwin, Nektar++: An open-source spectral/hp element framework, Comput. Phys. Comm. 192 (2015) 205-219. http://dx.doi.org/10.1016/jxpc.2015.02.008.

[28] J. Demmel, M. Heat, H. van der Vorst, Parallel numerical linear algebra, Acta Numer. (1993) 111-197.

[29] H. Koberg, Turbulence control for drag reduction with active deformation (Ph.D. thesis), imperial College London, University of London, 2007.

[30] J. Kim, P. Moin, R. Moser, Turbulence statistics in fully developed channel flow at low reynolds number, J. Fluid Mech. 177 (1987) 133-166.