Accepted Manuscript

Parallel iterative solution of the incompressible Navier-Stokes equations with application to rotating wings

Jakub Sistek, Fehmi Cirak

PII: DOI:

Reference:

S0045-7930(15)00302-3 10.1016/j.compfluid.2015.08.026 CAF 2992

To appear in:

Computers and Fluids

Received date: Revised date: Accepted date:

19 May 2015 23 August 2015 27 August 2015

Please cite this article as: Jakub Sistek, Fehmi Cirak, Parallel iterative solution of the incompressible Navier-Stokes equations with application to rotating wings, Computers and Fluids (2015), doi: 1Q.1Q16/j.compfluid.2Q15.08.026

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights

Parallel performance of iterative solution techniques applied to three-dimensional low Reynolds number flows is studied.

Almost perfect scaling is achieved for Krylov methods wi Jacobi preconditioner and incomplete LU factorisation.

Largest problem with 3.3 billion unknowns solved on up 55.5 thousand processors in around thirty seconds.

New auxiliary algorithms for parallel mesh g and system matrix

assembly are introduced.

New highly resolved computations : flow around a rotating insect wing, its vortex structure and aerodynaic loading are provided.

Parallel iterative solution of the incompressible Navier-Stokes equations with application to rotating

Jakub Sisteka, Fehmi Cirakb

aInstitute of Mathematics, Academy of Sciences of the Czech Rep Zitna 25, 115 67 Prague 1, Czech Republic b Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, U.l

Abstract

We discuss aspects of implementation and performance of parallel iterative solution techniques applied to low Reynolds number flows around fixed and moving rigid bodies. The incompressible Navier-Stokes equations are discre-tised with Taylor-Hood finite elements in combination with a semi-implicit pressure-correction method. The resulting sequence of convection-diffusion and Poisson equations are solved with preconditioned Krylov subspace methods. To achieve overall scalability we consider new auxiliary algorithms for mesh handling and assembly of the system matrices. We compute the flow around a translating plate and a rotating insect wing to establish the scaling properties of the developed solver. The largest meshes have up to 132 x 106 hexahedral finite elements leading to around 3.3 x 109 unknowns. For the scalability runs the maximum core count is around 65.5 x 103. We find that ost perfect scaling can be achieved with a suitable Krylov subspace itera-

Email addresses: sistek@math.cas.cz (Jakub Sistek), f.cirak@eng.cam.ac.uk (Fehmi Cirak)

Preprint submitted to Elsevier

September 2, 2015

tive method, like conjugate gradients or GMRES, and a block Jacobi precon-ditioner with incomplete LU factorisation as a subdomain solver. In addition to parallel performance data, we provide new highly-resolved computations of flow around a rotating insect wing and examine its vortex structure and aerodynamic loading.

1. Introduction

The implicit computation of three-dimensional flow problems often requires parallel computing. In presence of highly resolved no-slip boundaries

the discretisation of the incompressible Navier-Stokes equations can lead to linear systems of equations with several hundred millions to a few billion

unknowns. In the course of a transient simulation this systems of equations have to be solved several thousand times. Hence, in order to achieve reasonable simulation turnaround times each system has to be solved within a few minutes. In combination with this computing time requirement, the large memory needs make it essential to use domain decomposition techniques and distributed-memory parallel computing platforms. As known, Krylov subspace iteration methods with efficient preconditioners are the only viable solvers on parallel computers with large processor counts [21, 48, 56]. In practice, efficient parallel algorithms for mesh handling and system matrix assembly are also relevant. The most efficient combination of iterative and preconditioning techniques usually depends on the specific application at hand. Finding a suitable combination can be greatly facilitated through the use of parallel linear algebraic solver libraries such as PETSc [4] or Trilinos [30]. In this work we make use of the PETSc library and compare techniques for the scalable solution of large-scale low Reynolds number flow problems with up veral billion unknowns.

to sevei

For computing the flow around a moving rigid body, such as a rotating insect wing, the Navier-Stokes equations can be expressed either in a non-inertial body-fixed frame or in an inertial frame using the arbitrary

Lagrangian-Eulerian (ALE) formulation [5, 34]. In both approaches a fixed body-fitted finite element mesh is used and there is no need to update the mesh. In our computations we use the ALE formulation and relate the prescribed wing velocity to the ALE mesh velocity. For the considered range of problems the solution of the Navier-Stokes equations with pressure-correction methods can be very efficient. Such methods reduce the solution of the original discretised Navier-Stokes equations to the solution of several smaller subproblems that are solved instead of the original equations [9, 28, 36, 45]. For instance, in the case of Taylor-Hood Q2-Q1 elements used in this work a mesh with ne elements leads to a system size of approximately (25ne x 25ne). With the pressure-correction method, three systems of convection-diffusion type of size (8ne x 8ne), one system of Poisson type of size (ne x ne) and one L2-projection of size (ne x ne) are solved. Moreover, the preconditioning of this smaller system matrices is more straightforward and easier to implement than the preconditioning of the original indefinite system matrix. In this work, we solve each of the five systems with a suitable Krylov subspace method and investigate the performance of additive Schwarz and block Ja-cobi preconditioners with complete and incomplete LU factorisations as local

solvers.

As m of inse per

itioners

mentioned, the driving application for the present work is the study

ct flight aerodynamics. See the textbooks [14, 50] and review pas [15, 16, 17, 18, 19] for an introduction to flapping flight. The relevant Reynolds numbers range from about 100 for a fruit fly to about 105 for large insects, such as dragon flies. In order to create sufficient lift insects crucially

rely on wings which flap with very high angles of attack (around 35°). This leads to separated flows with periodic vortex generation and shedding, which are exploited by insects to increase lift. The study of translating and rotating wings serves as a stepping-stone towards the understanding the more complex three-dimensional flapping flight. Both types of wing motions lead to the formation of a leading-edge, a trailing-edge and two tip vortices. However, this vortex structure is not stable for a translating wing, and it is periodically formed and shed, see [54] and references therein. Consequently, there are large fluctuations in the lift and drag coefficients of the wing. As first corroborated by the experiments of Ellington et al. [20] the leading-edge vortex for a rotating wing is stable and remains attached to the wing throughout the rotation. The low pressure zone at the vortex core immediately above the leading edge leads to a sustained large lift force. It is believed that the leading-edge vortex is stabilised by centrifugal and Coriolis accelerations, which create spanwise flow advecting vorticity from the leading-edge vortex. The exact mechanisms are however not yet well understood and there is an extensive amount of experimental research [2, 7, 35, 44] and some recent computational studies on the topic [24, 25, 29]. In this paper we present several new highly-resolved computations corroborating previous experimental and numerical findings.

The outline of this paper is as follows. Section 2 begins reviewing the impressible Navier-Stokes equations in ALE form for computing the flow around moving rigid bodies. Subsequently, their solution with the incremental pressure-correction method and their finite element discretisation are

introduced. Specifically, in Section 2.3 all the discretised subproblem sizes and types are given. In Section 3 the solution of the obtained discrete problems with parallel preconditioned iterative solvers is discussed. Efficient and scalable algorithms for partitioning of large meshes and assembly of large matrices are given. Section 4 is dedicated to numerical performance studies and presents several new highly resolved computations. First, in Section 4.1 the developed computational approach is validated and verified with the widely studied flow around an inclined flat plate. Subsequently, in Section 4.2 the flow around a rotating insect wing is used to investigate the mathematical and parallel scalability of various preconditioned iterative methods. Finally, the identified most efficient iterative methods are used to study the Reynolds

number dependence of the vortex struc

und a rotating wing.

Ts,u = w

Too, u = 0

Figure 1: Rotating insect wing (fig) with a body fitted fluid mesh. In the used ALE formulation the mesh and the wing rotate with the prescribed angular velocity w.

2. Pressure-correction method for Navier-Stokes equations in ALE

In this section we briefly review the arbitrary Lagrangian-Eulerian (ALE) formulation of the incompressible Navier-Stokes equations and their finite element discretisation. The discussion is specialised to the simulation of flows around rotating rigid bodies, see Figure 1. For time discretisation we use the implicit Euler scheme in combination with the semi-implicit pressure-correction technique. At each time step the solution of the Navier-Stokes

uations is reduced to the solution of a sequence of convection-diffusion and

equatio

sson problems.

2.1. Governing equations

We consider the rotation of a rigid body with domain and boundary rS embedded in a fluid domain Q. The boundary of the fluid domain r is comprised of two disjoint parts rS and rTO, r = rS U rTO. The boundary rS is the common interface between fluid and rigid body and is the free-stream boundary. The rotation of the rigid body is prescribed with the angular velocity vector w, the centre of rotation xO and the corresponding velocity

w = w X (x - Xo). (1)

A computationally efficient approach for simulating the flow field generated by the rigid body is to consider the Navier-Stokes equations in a domain moving with velocity w, i.e., Dau

+ ((u - w) ■ V)u - vAu + Vp = 0, (2a)

V ■ u = 0, (2b)

where u is the fluid>velocity v is the kinematic viscosity and p is the normalised pressure. The time derivative in (2) is the ALE derivative

DAu du „

sr = St +w ^ (3)

For further details on the ALE formulation of Navier-Stokes equations see, e.g., [8, 13]. The Navier-Stokes equations are complemented by the following boundary conditions and the initial condition

u(t, x) = 0 on rœ,

u(t, x) = w on rS, (4)

u(t = 0, x) = 0 in^.

2.2. Incremental pressure-correction method

For discretising the Navier-Stokes equations (2) in time, we use the backward Euler method with constant interval length At. In addition, we linearise the nonlinear convective term (2a) with a semi-implicit approach leading to the discretised equations

where the index n indicates the variables associated with the time step tn.

With a view to parallelisation, the time-discretised semi-implicit Navier-Stokes system (5) can be efficiently solved with a pressure-correction, or a fractional-step, method [9, 36]. A review and mathematical analysis of some of the prevalent pressure-correction approaches can be found, e.g., in [28, 45]. The specific method we use is the incremental pressure-correction method in rotational form as discussed in [47] and summarised below. To begin with, we define the pressure increment ^n+1 in order to keep the derivations compact

un+1 + ((un - w) ■ V)un+1 - vAun+1 + Vp

V ■ u

To compute the __ and pressure fields (un+1,pn+1) at time tn+1 three

successive subproblems are considered.

1 = pn+1 - pn + v V ■ un+1.

2. Next, the pressure increment is obtained by solving the Poisson problem

-A^n+1 = --1V ■ un+1 for x e Q, A-

At (8)

d%LU+1 (8)

= 0 for x e u rs.

3. Finally, the pressure field pn+1 is updated with

pn+1 = + - vV ■ un+1. (9)

Note that there is only the velocity field u at times tn+ and tn in these three equations. The intermediate and the end-of-step velocities familiar from conventional pressure-correction met hods have been consolidated to one velocity field, see [28] for details. Furthermore, we do not apply any subiterations within each time step. As discussed in [47] it is possible to employ subiterations in order to improve accuracy of the projection scheme.

The weak forms of the three subproblems (7), (8) and (9) are needed for their finite element discretisation, see e.g. [21, 27]. To this end, we introduce the function spaces

V := {v e [H 1(Q)]3, v = 0 on v = w on rs} ,

CV := {v e [H 1(Q)]3, v = 0 on r} ,

where H 1(Q)

is the usual Sobolev space.

The weak form of the convection-diffusion equation (7) reads: find un+1 e V such that

(un+1, v) + c(un, un+1, w, v)+a(un+1, v) =

l-(un, v) - (V(pn + v) Vv E

(u, v) = / u • v dx,

a(un+1, v) = v / Vun~ J n

c(un, un+1, w, v) = J ((un - w) ■ V)un+1 ■ v dx.

Notice that the Cartesian components of the momentum equation (10) are decoupled, and (10) reduces to three independent scalar convection-diffusion equations. The weak form of the Poisson equation (8) for the pressure increment reads: find ^n+1 G H 1(H) such that

(V^n+1, Vq) = - A (V- un+1,q) for all q G H 1(Q). (11)

This is a pure Neumann problem and has a one-dimensional nullspace consisting of constant functions, which has implications for its numerical solution, see Section 3.1. Finally, for updating the new pressure field with (9) we use the L2-projection: find pn+1 G H1 (Q) such that

(pn+1,q) = (pn + - vV • un+1,q) for all q G H^fi). (12)

This projection is only relevant in the finite element context because the divergence of the discrete velocity field V ■ u is in general discontinuous and the discrete pressure p and pressure increment fields ^ are continuous.

2.3. Finite element discretisation

The time-discretised Navier-Stokes equations in the incremental pressure-correction form are discretised in space with hexahedral finite elements. Although we use the ALE description of the Navier-Stokes equations, there is no need to solve for mesh position and velocity since both are prescribed. As known, the basis functions for discretising the velocity and pressure fields have to be carefully chosen so that they satisfy the inf-sup, or Babuska-Brezzi stability, condition, see [21, 27]. We use the Taylor-Hood Q2-Q1 elements discretising the velocity and pressure fields with tri-quadratic and tri-linear basis functions, respectively. In the resulting finite element mesh Th, there are nu velocity nodes and np pressure nodes with their ratio being nu/np ~ 8. Notably, in our computations we do not employ any convection stabilisation so that, in effect, performing a direct numerical simulation. The element sizes are chosen such that the boundary layers appearing in the flow are resolved.

Let us now investigate the systems of linear equations resulting from the discretisation of the weak forms (10), (11) and (12) closer. The approximation of the velocity and pressure fields with the Taylor-Hood elements reads

Her Her

U2,i \U3,i J

^ = E ph = E (13)

Here and ^ are the tri-quadratic and tri-linear basis functions, respectively, associated to the finite element node with index i. Moreover, the nodal

unknowns are assembled into the global arrays

ud = (Ud,1, . . .,Md,n„)T G

^ = (^1 )T G Rnp,

P = (P1,...,Pnp)T G

With these definitions at hand the weak forms (10), (11), spond, respectively, to the linear equation systems

AMu + N + vAu) un+1 = AMuud - Pd (p

Ap<0n+1 = - —

with d G {1, 2, 3}, (15)

,pn+1 = Mp(pn + ^n+1) - v ^ Bdun+1,

with the matrices

M Mp N A

.....5.

dx with i, j = 1,..., n.

p — / dx with i, j = 1,..., np, Jn

/ ((un - w) ■ V^j) ^ dx with i, j = 1,

V^i ■ V^j dx with i, j = 1,..., nu

Ap = / V& -V£j dx with i,j = 1,

,n„,

Pd = ^tt^ dx with i = 1,..., nu, j = 1,..., np, d =1, 2, 3, Jn dxd

Bd = f ^dx with i = 1,..., np, j = 1,..., nu, d =1, 2, 3. Jn dxd

(16) (17)

(18a) 18b) (18c) 18d) 18e) 18f) (18g)

velocities pressure increment pressure

a o number (10), (15) (11), (16) (12), (17)

& type convection-diffusion Poisson L2-projection

CD unknown(s) Ml, u2, u3 .o P y

matrix size properties nu x nu nonsymmetric np x np sym. pos. semidefinite ^O1 V np x np sym. pos. definite

Table 1: Summary of the properties of the five linear systems of equations solved in each time step.

Note that for each velocity component one independent equation (15) is solved. Some properties of the linear equation systems (15), (16) and (17) relevant to the selection of suitable iterative solution methods are summarised in Table 1.

n of su

3. Parallel iterative solvers and implementation

Next we introduce the solution of the linear systems of equations resulting from the finite element discretisation of the incremental pressure-correction method. The considered class of fluid problems have up to several billions unknowns and the target parallel architectures have up to hundred thousand processors. For such problems Krylov subspace methods with efficient preconditioners are the only suitable solution technique. In practice, the scalability of the overall finite element technique also depends on the efficiency of the data structures and algorithms for mesh decomposition and handling, and assembly of the systems matrices. In our finite element software open-FTL we make extensive use of the C++ STL [32, 53, 10], METIS [33] and PETSc [4] libraries in order to achieve efficiency and scalability. Specifically, the use of PETSc enables us to perform a number of numerical experiments to identify the most suitable combinations of Krylov subspace methods and preconditioners.

3.1. Parallel preconditioned iterative solvers

We first provide a brief review of the parallel preconditioned Krylov subspace methods in order to fix terminology and notations. For details we refer to standard textbooks, e.g., [21, 40, 48]. Our discussion is restricted to iterative solvers and preconditioning techniques that are available in PETSc and that we use in our numerical computations.

The linear systems of equations introduced in Section 2.3 are of the generic form

Au = f. (19)

for form

The symmetry and the specific entries of the system matrix A and the right-hand side vector f depend on the considered problem. We use GMRES [49] or BiCGstab [58] for systems with a nonsymmetric matrix A and the conjugate gradient method [11] for systems with a symmetric matrix A. Moreover, a preconditioning technique is necessary in order to improve the performance of the iterative solvers. To this end, we consider the (left-)preconditioned equation system

PAu = Pf, (2°)

where P is a suitable preconditioning matrix that approximates A-1 (in some sense). The specific choices of preconditioners will be discussed in the following. For the subsequent discussions, it is also relevant to recall that for implementing preconditioned Krylov subspace methods only matrix-vector products with the system matrix A and the preconditioning matrix P are needed.

On a (distributed-memory) parallel computer the equation systems (19) and (2°) are only available in a distributed format. The partitioning of both equation systems results from the partitioning of the computational domain Q (and the corresponding triangulation) into nd possibly overlapping subdomains Qi, with i = 1,..., nd. The overlap is a prescribed layer of elements between the subdomains. In our computations the number of subdomains nd is equal to the number of available processors. The matrix-vector product with the distributed system matrix A is straightforward and can be assembled from subdomain contributions. The matrix-vector product with P depends on the specific form of the preconditioner.

In this work, we consider as parallel preconditioners the block Jacobi

and the overlapping additive Schwarz methods available in PETSc, see, e.g., [46, 48, 51, 57] for details. These one-level methods are not mathematically scalable for elliptic problems, such as the Poisson problem for the pressure increment [57]. It is necessary to use a two-level method in order to achieve mathematical scalability, i.e. convergence independent of the number of subdomains in a weak scaling test. Nevertheless, in our experience, the considered one-level methods perform reasonably well for the linear systems introduced in Section 2.3, with the most critical being the Poisson problem for the pressure increment. The state-of-the-art two-level methods include BDDC and FETI [12, 22, 23]. In these methods the challenge is the scalable solution of the coarse problem, which is an active area of research, see e.g. [3]. A possible solution is offered by the multi-level extension of BDDC [41, 52]. Nevertheless, the multi-level methods should be avoided as long as a one-level method performs well, especially in a massively parallel environment.

In both the block Jacobi and the overlapping additive Schwarz methods, the preconditioner P^is defined as the sum of local subdomain matrices p, i.e.,

ERJPiR- (21)

where Rj is a 1 — 0 matrix which maps the local degrees of freedom in the interior of the subdomain Q to the global degrees of freedom. The subdomain

matrix Pi is an approximation to the inverse of the local system matrix Ai and is defined as

Pi « A-1 = (RiART) . (22)

Applying the inverse of the local system matrix A-1 represents solving a local Dirichlet problem because the matrix Ri does not include the degrees

of freedom at domain boundaries. The multiplication of a vector r with the preconditioner P can now be established using

«d nd

Pr = ^ RjPiRr = J2 RTP^ Y, RTwi,

i=1 i=l i=l

with the local subdomain specific vectors ri = R^r and wi = Piri in mind Pi ~ A-1, the local vector wi is the (possibly approxima of the local problem

AiWi = ri.

This problem is solved independently on each subdomain. The local solves are performed with a direct method, so t hat an LU or LLT factorisation is performed during the set-up of the preconditioner, and only backsubstitu-tions are performed during the subsequent Krylov iterations. Performing a complete LU (or LLT) factorisation is called exact factorisation. Often one can save both time and memory by using an incomplete LU factorisation with prescribed allowed fill-in, see, e.g., [48]. In this regard, ILU(0) is the basic approach which discards all entries of the factors which fall outside the sparsity pattern of the original matrix. While ILU(1) and ILU(2) improve the approximation of the inexact factorisation, they require new analysis of the sparsity structure of the factors and lead to longer times for both factorisation and back substitution.

A final remark concerns the solution of the pressure-correction problem (16) which is a pure Neumann problem for the considered fluid flow problems with only Dirichlet boundary conditions. The corresponding symmetric matrix is singular with the nullspace spanned by constant vectors. In this case, the problem is solved only in the orthogonal complement of the

lem pro

nullspace. Namely, if we denote with z = (1,1,..., 1)T the basis vector of null (A), we can construct the orthogonal projection on its complement as

Q = I--T~zzT. If this matrix is applied after every multiplication with

A and P, the iterations are confined to the subspace orthogonal to and the following modified system is solved

tion with ) null(A),

QPQAu = QPQf. (25)

The preconditioned conjugate gradient method in this subspace is referred to as deflated PCG.

3.2. Implementation details 3.2.1. Domain partitioning

As elucidated in preceding Section 3.1, the parallel solution of the discre-tised finite element equations relies on the partitioning of the domain into subdomains and assignin^the domains to different processors. In general, the number of subdomains is chosen equal to the number of available processors. In the computations presented in this paper the discretised domain is a block-structured hexahedral mesh and is generated with the GMSH mesh generator [26]. The subdomains are obtained by partitioning the mesh with METIS. The size and shape of each subdomain is chosen such that interpro-cessor communication is minimised and each processor is equally utilised.

Our METIS -based mesh partitioning algorithm is shown in Algorithm 1, ee also Figure 2. As the first step we construct the dual graph of the finite element mesh. In the dual graph each finite element is a vertex and the edges of the graph represent two adjacent finite elements. Subsequently, we partition the dual graph with METIS. The partitioned graph gives a partitioning

Figure 2: Partitioning of a mesh into two nonoverlapping subdomain meshes. The nodes on the subdomain boundaries are uniquely assigned to one of the subdomains.

of the finite element mesh into nonoverlapping subdomains so that each finite element is uniquely assigned to a particular subdomain. Next the finite element nodes are assigned to subdomains. First, the nodes inside a subdomain are assigned to the respective subdomain. Subsequently, the nodes at subdomain boundaries are randomly assigned to the attached subdomains so that each has a similar number of nodes. In the last step we assign to each node a unique (global) ID by sequentially looping over the subdomains and consecutively numbering the nodes. Finally, for performance considerations during assembly it is necessary to form overlapping partitions so that the system matrices can be assembled without interprocessor communication.

Algorithm 1 Partitioning of the mesh into subdomains

1. Create the dual graph of the computational mesh.

2. Create a non-overlapping partitioning of elements into subdomains by partitioning the dual graph (using METIS).

3. Derive a partitioning of nodes such that all nodes present in a single subdomain are local to the processor and distribute shared nodes to neighbouring subdomains based on balancing their size.

4. Assign each node a unique global ID by looping over all subdomains and all nodes in each subdomain.

5. Build overlapping clusters of elements as a union of all elements contributing to local nodes.

3.2.2. Overlapping partitions for fast assembly

The partitioning of the finite elements and nodes into subdomains implies a partitioning of the system matrices and vectors into processors. Recall that each row of the global system matrix represents a node in the mesh, precisely one of its degrees of freedom. Moreover, the domain introduced in Section 3.2.1 ensures that the degrees of freedom associated to a domain lie all within a certain range. Hence, consecutive blocks of rows of the system matrix can be assigned to processors. In PETSc this is achieved with the MPIAIJ matrix format.

The rows of the global system matrix corresponding to finite element nodes at the subdomain boundaries receive contributions from several subdomains. During the assembly this requires frequent interprocessor communication. In practice, the associated time overhead for assembly turns out to be excessively time consuming and presents a major performance bottleneck for large problems. In order to resolve this issue it is possible to eliminate any interprocessor communication during assembly. This can be achieved by providing each processor all the elements and nodes necessary for independently assembling its rows of the global system matrix. Therefore, we modify the partitioning introduced in Section 3.2.1 so that each subdomain stores in addition to its elements also elements of the neighbouring subdomains that contribute to local matrix rows, see Figure 3. Evidently this leads to the tion of overlapping partitions. This can be accomplished using the parti-ed dual graph provided by METIS and the global transposed matrix of element connectivity, i.e. for each node, the list of elements surrounding it.

notion tioned elemen

Part 1 Part 2

PL, (M

£0 Ph

6x6 6x8

8x6 8x8

Partition 1

Partition 2

Figure 3: Partitioning of a mesh with 14 nodes into two overlapping subdomain meshes and a sketch of the corresponding matrix. The first six rows of the system matrix are assigned to part one and the remaining eight to part two. The overlapping layer of elements ensures that the system matrix corresponding to each part can be assembled independently.

coli std

3.2.3. On-the-fly assembly and intermediary data container

In our finite element software openFTL, sparse matrices are directly assembled to an intermediary data container without determining the sparsity pattern of the matrix beforehand. The matrix is assembled on-the-fly while looping over the elements in the mesh and copying their contributions to the intermediary data container.

In the implementation of the intermediary data container, we make extensive use of the C++ STL library, specifically the std::pair object, std::vector container and std::sort algorithm. Similar to the coordinate sparse-matrix storage format we represent the sparse matrix as a vector of triplets (i, j, Aj), where Aj is an entry with the row index i and column index j. In C++ STL the type of each entry is chosen to be std::pair< std::pair<int, int>, double>. The key idea of the on-the-fly assembly is that matrix entries are first one after the other appended to the end of the vector. The vector is subsequently sorted and triplets with the

sorted entries

allowed overhead

new entry (h3>Aij)

unsorted

Figure 4: Schematic illustration of an insertion into the sparse matrix forma entry is appended at the end of the vector until the limit of allowed and a re-sorting is performed.

same row and column index (i and j) are combined to one triplet by summing the values A, of the matrix entries. Note that we could use instead of the std::vector container the sorted std::multi_map container. Although this would eliminate the sorting step, the insertion into an std::multi_map is substantially slower than into an std::vector. See also [42] for a discussion on the use of std::vector versus std::multi_map .

A step-by-step description of the on-the-fly assembly algorithm is given in Algorithm 2. This algorithm is to be read in conjunction with Figure 4. Step 1 of the algorithm is straightforward in the sense that the matrix entries are appended to the vector with a simple push_back operation. In step 2 we sort the triplets with the std::sort which must have O(n log n) complexity in C++11, with n being the length of the vector [32, 53]. Subsequently, in step 3 the entries are combined, i.e. assembled, in linear time, and they are stored

fsnd of the assembled part of the same vector, hence the assembly is ormed in-place.

Due to memory restrictions, it is usually not possible to process all matrix contributions in one go. The vector is sorted and assembled in fixed

Algorithm 2 The in-place assembly of the matrix in the coordinate format

1. Loop over elements while appending each contribution to the global matrix as a new triplet (i,j,Aj) at the end of the vector.

2. Sort the vector with the std::sort algorithm primarily accor the row index i and seeondari, aeeord.ng to the eo.umn index j (with

dex j (wit

the standard std::pair comparison functor).

3. Loop over the vector, sum all entries with the same index pair (i,j) and store them at the end of the already processed part of the vector.

4. Truncate the allocated memory to the actualSsize of the assembled vector.

prescribed intervals. In this way, we control both the memory overhead and number of sortings. The frequency of the intervals depends on a user prescribed allowed memory overhead. In Figure 5 a numerical timing study for the assembly of the system matrix of a 2D and 3D elasticity problem are reported. The study was performed on a single core of the Intel Core i7 CPU with frequency 2.7 GHz. We can observe that the time spent by sorting grows very slowly with increasing allowed memory overhead. In contrast, the number of sortings decreases linearly with increasing allowed memory overhead. Hence, the total time is clearly dictated by the number of sortings used during the assembly. Therefore, for achieving good performance the allowed memory overhead should be chosen as large as possible. Furthermore, it can be seen in Figure 5 that the time for insertion of entries is mostly lower than the total sorting times and it is independent of the allowed overhead.

After all the finite element contributions are processed with Algorithm 2,

time for all insertions time for all sortings time per sorting -* total time

allowed memory overhead [MB]

(a) 2D case

Figure 5: The runtime of the on-the-fly assembly in dependence of the prescribed allowed memory overhead. (a) Bi-quadratic quadrilateral element system matrix of size 722 x 103 with 23.1 x 106 nonzeros (^445 MB) and 29.2 x 106 insertions. (b) Tri-quadratic hexahedral element system matrix of size 207 x 103 with 37.5 x 106 nonzeros (~801 MB) and 52.5 x 106 insertions.

we obtain a vector of the assembled matrix in the coordinate format, sorted primarily by rows and secondarily by columns. This allows us to quickly determine the structure of the PETSc matrix on each processor and to perform an exact pre-allocation of memory. Subsequently, all the vector entries are copied into the PETSc MPIAIJ matrix (using MatSetValue function with the INSERT_VALUES mode). Moreover, due to the overlapping partitions discussed in Section 3.2.2, there is no need to transfer stiffness matrix data between the processors. Therefore, in our computations the assembly of the PETSc matrix (by MatAssemblyBegin and MatAssemblyEnd functions) takes

ligible time.

4. Numerical performance studies and results

In this section we first validate and verify our computational framework by analysing the flow around a low-aspect-ratio inclined flat plate. Subsequently, we consider the flow around a rotating insect wing to compare the performance of various Krylov solvers and their parallel scalability in combination with block Jacobi and additive Schwarz preconditioners. At the same time, we elucidate and compare the flow structures and aerodynamic forces for translating and rotating wings, especially the formation and persistence of leading-edge vortices. We generate all our finite element meshes using the GMSH [26] as block-structured boundary-conforming meshes. However, during the solution process the meshes are considered as unstructured. As finite elements we use the Taylor-Hood Q2-Q1 elements. In order to resolve the flow field without the need for convection stabilisation sufficiently fine grids are used.

The considered flows eir numerical solution is strongly dependent

on the Reynolds nur

Re = UvL, (26)

where uv is the characteristic fluid speed (e.g., free-stream velocity), L is the characteristic length of the wing and v is the kinematic viscosity. In our computations we always choose uv = 1 and the Reynolds number is altered by modifying the kinematic viscosity.

for for

by m.....

the result of our computations we provide plots of the flow fields in the form of isosurfaces of the Q-value. The Q-value is the second invariant of the velocity gradient tensor Vu, and it is widely used to visualise vortices [31].

In incompressible flows the second invariant can also be expressed as

Q = 2 (||n||2 -||S||2) , (27)

with the antisymmetric vorticity tensor Q = 1 (Vu — (Vu)T), the symmetric strain-rate tensor S = 2 (Vu + (Vu)^ and || ■ || denoting the Frobenius norm. Informally, in flow regions with Q > 0 the vorticity is larger than the strain rate, which indicates the presence of vortices (i.e., regions with swirling-type motion). More in-depth discussion of applicability of the Q-criterion can be found in, e.g., [37, 38, 39].

We also report the aerodynamic forces acting on the wing in the form of non-dimensionalised force coefficients

where F is a component of the force resultant vector F, p is the density of the flow and S is the planform of the wing. In all our computations the flow density is p = 1. The coefficient C represents usually the drag CD or lift CL depending on the component of the considered force vector F. The force resultant F is the integral of the boundary tractions, i.e.,

CF = a(u,p) ■ ndr, (29)

h is equal to the sum of the reaction forces of all the finite element nodes ed on the wing.

All performance and scalability studies are performed on the Cray XE6 supercomputer HECToR1 (Phase 3). This computer is composed of 2816

2 o> (28)

1http://www.hector.ac.uk

XE6 compute nodes, each of which has 32 GB memory. A compute node contains two AMD 2.3 GHz 16-core processors giving a total of 90 112 cores, with 65 536 cores being the maximum handled by the job scheduler. Cray Gemini chips are used for communication through a high-bandwidth : network.

4-1. Flow around an inclined flat plate 4-1.1. Problem definition and discretisation

The inclined flat plate represents some of the flow features typical for animal locomotion in air and water and has been experimentally and numerically studied by a number of authors, including Taira and Colonius [54] and Wang and Zhang [59]. As in these two references we consider a rectangular thin plate with the chord-length c =1 and the span 2c resulting in the aspect ratio A =2, see Figure 6. The thickness of the plate is 0.01875c. The bounding box of the computational flow domain is a rectangular box with dimensions [-10, 21] x [-10,10.01875] x [-5, 7]. From this rectangular box, a smaller axis-aligned cuboid representing the thin plate is subtracted in order to obtain the computational fluid domain. The position of the flat plate in the fluid domain is [0,1] x [0, 0.01875] x [0, 2].

The outer rectangular box is discretised by 210 x 110 x 120 elements along its length, height and width, respectively. Each of the cuboidal finite elements in the domain are axis-aligned with the domain boundaries and become progressively smaller close to the flat plate. The inner fluid boundary representing the wing is discretised by 100 x 10 x 80 elements along its chord, thickness and span directions, respectively. This discretisation leads to approximately 2.5 million elements and 20.6 million nodes.

The boundary condition at the plate surface is set to no-slip, u(t, x) =

Figure 6: Flow around an inclined plate. Computational mesh with the wing and its wake at Re = 300 and an angle of attack of 30°. All the mesh lines and the flat plate aligned with the coordinate system.

f the box

0, and at the outer surface of the box to free-stream velocity u(t, x) = cos a, usin a, 0)T. The free-stream velocity magnitude is = 1 and the angle of attack is a. Starting with an impulsive at t = 0, the time dependent problem is solved for t G [0,18]. The time-step size is chosen with At = 0.02 resulting in 901 time steps in a typical run. The Courant number based on the free stream velocity is At/hx ~ 1.64, where is the smallest g the chord. Moreover, the change of the angle of attack is achieved through changing the direction of the flow rather than changing the mesh in the computations.

in t trai

4-1-2. Flow characteristics and forces

Before proceeding to validation and verification, we present the results of our computations for (chord-length based) Re = 100, 300 and 1200 and the angle of attack of 30°. Our aim is to illustrate the Reynolds number dependence of the flow characteristics and forces, which in turn determine the spatial and temporal resolution needs of the discretisation. In Figure 7 the isosurfaces of the Q-value are shown. At time t = 1.0, in all plots a leading-edge and two tip vortices can be identified. Furthermore, for Re = 300 and 1200 also a convected trailing-edge vortex (starting vortex) is visible, which is for Re = 100 not strong enough to be shown by the Q = 2 isosurface. For later times, as a general trend the complexity of the observed vortex structures becomes more pronounced with increasing Reynolds number due to the decrease in the diffusivity of the flow. At time t = 3.0 for Re = 300 and 1200, there are two columnar tip vortices and an already pinching off leading-edge vortex is visible. This process continues with consecutive formation and shedding of leading-edge vortices as visible for time t = 5.0 and t = 8.0 for Re = 300. In order to be conclusive about the vortex structures observed for Re = 1200 at t = 5.0 and t = 8.0 computations with finer meshes are needed.

It is instructive to consider the Q-value plots in Figure 7 in conjunction with the history of drag and lift coefficients in Figure 8. As an artefact of the impulsive start, the coefficients have a large peak in the immediate vicinity of t = 0 which is not physically relevant. The subsequent sustained increase he lift coefficient occurs while the leading-edge vortex is formed and the trailing-edge vortex is advected. This is due to the low pressure zone created by the leading-edge vortex above the wing. The obtained maximum lift co-

efficient becomes larger with increasing Reynolds number. The difference in the drag coefficients corresponding to the maximum lift coefficients is far less pronounced. As a result the aerodynamic efficiency (lift coefficient divided by the drag coefficient) is proportional to the Reynolds number. After the leading-edge vortex detaches for Re = 100 the lift and drag coefficients reach a steady state. In contrast, for Re = 300 and Re = 1200 both coefficients continue oscillating in line with shedding of vortices. For a more in-depth discussion of the relevant flow characteristics we refer to [54].

4-1.3. Validation and verification

We compare our results with the experimental and computational results of Taira and Colonius [54]. They report for Re = 300 only computational and

for Re = 100 computational and experimental drag and lift coefficients for

rimental

various angles of attack. The same set-up is also computationally investigated by Wang and Zhang [59] for Re = 100. In [54] and [59] the discretisation is based on the immersed boundary method and the plate is assumed to have zero thickness. Since for Re = 100 the flow quickly reaches a steady state for all considered angles of attack, it is meaningful to compare the steady state coefficients. In Figure 9 our steady state drag and lift coefficients at time t =13 and angles of attack a = 10°, 30°, 50°, and 70° are compared with the ones presented in [54] and [59].

One can observe that for a = 10° and 30° our values are in excellent agreement with the experimental data and other computational results. The difference in CD can be attributed to the thickness of the plate, which is about a half of that used in the experiment, while it is ignored in other computations. The agreement is slightly worse for a = 50 and 70, where

Re = 100 Re = 300

Re = 1200

Figure 7: Flow around an inclined plate. Isosurfaces of Q = 2 at different time instants, = 30°.

1.4 1.2 1 0.8

0.4 0.2 0

0 2 4 6 8 10 12 14 16 18 t

Figure 8: Flow around an inclined flat plate. Comparison of drag (left) and lift (right) coefficients for an angle of attack of 30°, Re = 100, 300, and 1200.

Taira and Colonius, experiment Taira and Colonius, comp. Wang and Zhang, comp.

present comp.

0 10 20 30 40 50 60 70 80 90 a [deg]

igure 9: Comparison of steady drag CD (left) and lift CL (right) coefficients at Re = 100 for different angles of attack a with experimental and computational results in Taira and Colonius [54], and computational results by Wang and Zhang [59].

a (deg) reference Re = 300 max. CL t Re = 1200 max. CL t

10 [54] our result 0.46 1.63 0.43 1.4 0.48 2.68

30 [54] our result 1.29 1.68 1.25 1.66 — — 1.46 2.88

Table 2: Flow around an inclined flat plate. Maximum lift coefficients and corresponding times for two different Reynolds numbers and angles of attack. For Re = 300 results are compared to [54].

the computation seems to be affected by the interaction of the wake with coarser mesh above the plate. As mentioned, the change of angle of attack is achieved through changing the direction of the flow. At last, in Table 2 the maximal lift coefficients and the times at which they are attained are given for Re = 300 and Re = 1200. Our results are in good agreement with the computational results in [54] for Re = 300.

4.2. Flow around a rotating insect wing 4.2.1. Problem definition and discretisation

It is well-known that a rotating wing generates a leading-edge vortex ich remains attached to the wing. The attendant sustained lift forces ieved to be crucial for the success and efficiency of flapping flight in nature. In order to study the vortex formation and forces generated by rotating wings, we consider a fruit fly (Drosophila melanogaster) wing, as experimentally studied in [43], at an angle of attack of 40° rotating around

a vertical axis near its root, see Figure 10. The length of the wing is R; its aspect ratio is A = R2/S = 3.1, where S is the planform area; and its thickness is 0.01 R.

As shown in Figure 11, the computational fluid domain consis cylinder with radius 2.9 R and an axial hole with radius 0.016 R. The height of the cylinder is 4.7 R. The block structured finite element mesh is generated with GMSH [26] with the block topology depicted in Figure 12, although it is handled as unstructured in the solver. The mesh is refined towards the wing, resulting in 30 x 5 x 30 elements along the chord, thickness, and span of the wing, respectively, see Figure 11. The whole fluid domain contains 266 x 102 x 77 Taylor-Hood elements along circumference, height, and radius of the cylinder, resulting in approximately 2.1 x 106 elements and 16.8 x 106 nodes.

----------------

t G [0,1], the wing rotates with a constant angular velocity until t = 7.7. At t = 7.7 one full revolution is completed. With the uniform time-step size At = 0.002 the whole computation requires 3850 time steps. The velocity of the tip of the wing is prescribed with |utip| = 1, resulting in |uiip|At/hiip ~ 0.1, with denoting the smallest element size near the tip of the wing. The angular velocity is u = |utip|/R. An important length is the radius of

gyration Rg = a/S2/S, where S2 is the second moment of the wing with spect to the axis of rotation |16'291'For the considered wing geometry the

ius of gyration is Rg = 0.5255 R.

The reported Reynolds numbers are based on the velocity at the radius

respe radiu T

(b) Isometric view

Figure 10: Flow around a rotating insect wing. Geometry and prescribed motion of the )rosophila wing (with r = 0.0625 R and a = 40°).

Figure 11: Flow around a rotating insect wing. Computational mesh with wing and its wake at Re = 526 for angle of attack of 40° with approximately 2.1 x 106 finite elements

and 16.8

16.8 106 nodes.

Figure 12: Flow around a rotating insect wing. Topology of the 27 bloc uctured

mesh used for meshing the computational domain. The Drosoph g geometry is

mapped to the inner block (at the intersection of the dark The mesh is connected into a ring at the lightly shaded faces.

of gyration urg, i.e.,

which is altered by choosing a suitable kinematic viscosity v. In order to study the flow in laminar and transient regimes we consider four different Reynolds numbers Re G {105, 263, 526,1051}2.

4-2.2. Performance comparison of the iterative solvers

To begin with, we aim to identify the most efficient combination of Krylov subspace methods and preconditioners suitable for solving the systems of equations resulting from the discretisation of the Navier-Stokes equations. As summarised in Table 1, in the incremental pressure-correction approach five linear systems of equations are solved at every time step. Namely three equations for the update of the velocity components, one equation for the

2 The selected values of Reynolds numbers correspond to the tip velocity based Reynolds numbers Retip = R € {200, 500,1000, 2000}.

the the

update of the pressure increment and one equation for the pressure update. We only consider iterative solvers and preconditioners that are available in PETSc (version 3.2).

For all numerical studies in this section we use the block-structured mesh described in the foregoing section and shown in Figure 11 (with 2.1 x 106 elements and 16.8 x 106 nodes), unless stated otherwise. The number of subdomains and utilised processors is chosen to be 2048 and the Reynolds number for the flow is Re = 1051. The reported iteration counts and times are averaged over the initial 400 time steps, 200 of which are in the acceleration stage.

First, we consider the velocity update which involves three independent discrete convection-diffusion type equations (15). Each of the three equations has the same system matrix. Since these are nonsymmetric we use the GMRES [49] and BiCGstab [58] methods. In Table 3 the performance of both methods with no preconditioner and with block Jacobi precondi-tioner are compared. For solving the subproblems of the diagonal blocks in the block Jacobi preconditioner (see Section 3.1), we use ILU preconditioner with no fill-in ILU(0), with prescribed additional fill-in ILU(1) and ILU(2), and a complete sparse LU factorisation. In the last case, the MUMPS direct solver [1] is used. In all computations, the preconditioner is set up in every time step once and used for all three velocity components. The time for the set-up is included in the computational time for the velocity component u1.

In Table 3 we can see that GMRES and BiCGstab perform similarly well, the former being marginally faster. In terms of number of iterations, we recall that two actions of the system matrix as well as of the preconditioner are per-

formed within each iteration of BiCGstab. Hence, the number of iterations for BiCGstab should be about one half of those by GMRES for comparable accuracy. Moreover, it is evident from Table 3 that a preconditioner is crucial. It is interesting that the convergence of the preconditioned methods does not significantly improve with better approximation of the incomplete factors, moving from ILU(0) to ILU(2), while it increases the cost of the solve for the first velocity component drastically. Surprisingly the full LU factorisation of the diagonal blocks with MUMPS is faster than ILU(1) and ILU(2), but still more expensive than ILU(0). As a conclusion, the ILU(0) appears to be the best local solver in combination with the block Jacobi preconditioner for the velocity components. However, our most recent studies indicate that even a diagonal Jacobi preconditioner appears to be competitive in terms of computing time.

Continuing with the velocity update, we also investigate the algebraic versions of the additive Schwarz method (ASM) with one or two elements of overlap. In the PCASM preconditioner of PETSc the clusters of overlapping elements are reconstructed from the graph of the local matrices without overlaps. The ASM simplifies to the block Jacobi preconditioner when no overlap is used. The corresponding results are presented in Table 4. It can be observed that an overlap is capable of improving the preconditioner (cf. results for block Jacobi preconditioner in Table 4) by significantly reducing ber of iterations. However, in terms of computational time, the time nt on the set-up of the preconditioner is too high to be amortised by the slightly faster solution times of the ASM method with one element overlap. Next, we consider the update of the incremental pressure field ^ by solv-

results f the num spe slig

method prec. local sol. number of iter. min-max(avg.) u1 u2 u3 avg. sol. time (s) u1 «2 u3

GMRES no o o a ILU(0) ILU(1) ILU(2) LU 1-313(259.0) 0-10k(9.8k) 2-1.1k(0.9k) 7-14(12.6) 0-13(11.9) 7-15(13.7) 7-14(12.6) 0-13(11.7) 7-14(13.4) 7-14(12.0) 0-13(11.0) 7-14(12.7) 7-14(12.6) 0-13(11.7) 7-14(13.4) 1.49 55.67 5.26 0.20 0.11 0.12 3.10 0.32 0.36 29.67 0.72 0.82 1.01 0.27 0.30

BiCGstab no bi o o a 3 ILU(0) ILU(1) ILU(2) LU 1-183(151.4) 0-3.5k(2.5k) 1-440(342.5) 4-9(7.5) 0-7(6.9) 4-9(8.0) 4-8(7.3) 0-7(6.9) 4-9(8.6) 4-8(7.2) 0-7(6.8) 4-9(8.1) 4-8(7.6) 0-7(6.9) 4-9(8.6) 1.34 21.49 2.99 0.22 0.12 0.14 3.12 0.37 0.45 29.86 0.87 1.03 1.06 0.31 0.38

Table 3: Flow around a rotating insect wing. Iteration counts and timings for velocity update using two different Krylov solvers and a block Jacobi preconditioner with different subdomain solvers. All computations are using the mesh shown in Figure 11 and 2048 subdomains. For each case the reported numbers are average values over the initial 400 time steps (of which 200 are in acceleration phase). The minimal time is emphasised in boldface.

method prec. local sol. num. iter. min-max(avg.) Ui U2 u3 avg. sol. time (s) u1 u2 u3

GMRES bl. Jacobi ILU(0) 7-14(12.6) 0-13(11.9) 7-15(13.7) 0.20 0.11 0.12

GMRES i—i S < ILU(0) ILU(1) ILU(2) LU 5-8(7.7) 0-8(7.4) 5-9(8.6) 3-4(3.7) 0-4(3.7) 3-4(3.9) 3-3(3.0) 0-3(2.9) 3-3(3.0) 3-3(3.0) 0-4(3.1) 3-4(3.5) 1.80 0.10 0.11 5.54 0.19 0.20 57.01 0.40 0.41 13.05 0.16 0.18

GMRES CN S < ILU(0) ILU(1) ILU(2) LU 5-8(7.7) 0-8(7.4) 5-9(8.6) 3-4(3.6) 0-4(3.6) 3-4(3.6) 2-2(2.0) 0-2(2.0) 2-2(2.0) 2-3(2.5) 0-3(2.1) 2-3(2.0) 1.15 0.15 0.17 8.75 0.31 0.31 74.76 0.48 0.49 5.08 0.25 0.25

Table 4: Flow around a rotating insect wing. Iteration counts and timings for velocity update using GMRES, different preconditioners and subdomain solvers. The considered pre-conditioners and their abbreviations are: block Jacobi (bl. Jacobi) and additive Schwarz method with algebraic overlap 1 (ASM-1) and 2 (ASM-2). All computations are using the mesh shown in Figure 11 and 2048 subdomains. For each case the reported numbers are average values over the initial 400 time steps (of which 200 are in acceleration phase). The minimal time is emphasised in boldface.

method prec. local sol. num. iter. min-max(avg.)

O bi ILU(0) 144-249(195.9)

ü CL, o o a J ILU(1) 127-216(172.9)

d 3 ILU(2) 151-207(188.6)

LU 122-194(163.4)

avg. sol. time (s)

Table 5: Flow around a rotating insect wing. Iteration counts and timings for pressure increment ^ update using deflated PCG with block Jacobi preconditioner and different subdomain solvers. All computations are using the mesh'shown in Figure 11 and 2048 subdomains. For each case the reported numbers are average values over the initial 400 time steps (of which 200 are in acceleration phase). The minimal time is emphasised in boldface.

ing the Poisson problem (16) with pure Neumann boundary conditions, see Table 5 for results. Our study is restricted to the deflated preconditioned conjugate gradient (PCG) method using the block Jacobi preconditioner and different local solvers. As Table 5 suggests, the solution times are comparable to those necessary for solving for one component of velocity. However, the equation system for velocity component is around eight times larger than the one for the incremental pressure update, cf. Section 2.3. Overall, similar to velocity problems, the lowered number of iterations by better (or even exact) LU factorisation is, for this case, not sufficient to save computational time. The simplest ILU(0) factorisation of the local problems remains the most ient method.

For the sake of completeness, we also perform a similar study for the L2-projection of pressure p, cf. (17), see Table 6 for results. It should be stressed

method prec. local sol. num. iter. min-max(avg.) avg. sol. time (s)

PCG bl. Jacobi ILU(0) ILU(1) ILU(2) LU 11-12(11.5) 11-12(11.0) 11-12(11.1) 11-12(11.1) 0.009 0.010 0.014

Table 6: Flow around a rotating insect wing. Iteration counts and timings for the L2-projection of the pressure p using PCG with block Jacobi preconditioner and different subdomain solvers. All computations are using the mesh'shown in Figure 11 and 2048 subdomains. For each case the reported numbers are average values over the initial 400 time steps (of which 200 are in acceleration phase). The minimal time is emphasised in boldface.

that this problem is by an order of magnitude faster to solve than the velocity update (15) and the auxiliary pressure update (16). Consequently, savings for this problem do not lead to any significant gain in the overall algorithm. The fast convergence of all considered methods is reported in Table 6.

4.2.3. Parallel scalability of the iterative solvers

We now investigate the parallel scalability of the preconditioned iterative / v ^

solvers identified as most efficient in the foregoing section, see Table 7. The velocity components ui, u2 and u3 are updated with GMRES, the pressure in-rement ^ is updated with deflated PCG, and PCG is used for L2-projection pressure p. In each case the block Jacobi preconditioner with ILU(0) local solver is used. The Reynolds number of the flow is Re = 1051 for all computations. We study both the weak and strong scalability of the iterative solvers. During the weak scaling runs the problem size grows with

cremen of the

problem velocity components pressure increment pressure u1,u2,u3 ^ p

Krylov method preconditioner subdomain solver GMRES deflated PCG PCG block Jacobi block Jacobi block Jacobi ILU(0) ILU(0) ILU(0)

îcient com

t efficieni

ombination

Table 7: Flow around a rotating insect wing. The identified mos of Krylov solver, preconditioner and subdomain solver.

the number of processors while keeping the load on each processor approximately constant. In contrast, during the strong scaling runs the problem size is fixed and only the number of processors is increased. An algorithm is optimally scalable when the solution time is constant during a weak scaling test and when the solution time is halved each time the number of processors is doubled during a strong scaling test.

For the weak scaling runs, we use two additional meshes generated by octasecting the hexahedral elements of the computational mesh described in Section 4.2.2, see also Figure 11. During each refinement the problem size increases approximately by factor eight. The sizes of the considered three meshes are given in Table 8.

Figures 13 and 14 show the average number of iterations and average solution times per time step in dependence on number of utilised processors, respectively. The reported numbers are averaged over 400 time steps, 200 of which are in the initial acceleration phase. In addition, in Figure 14 we also

reference description # elements # velocity nodes # pressure nodes

Mesh A Mesh B Mesh C mesh described in Sec. 4.2.2 uniform refinement of Mesh A uniform refinement of Mesh B 2.1x106 16.8x10e 2.1x106 16.6x10e 134x106 16.8x 106 133 x 106 1.07x 109 134 x 106

Table 8: Flow around a rotating insect wing. Description of the for the scaling runs.

report the parallel speed-up

computa

tational meshes

where nPref is the lowest number of

. utilised processors, tref is the corresponding

time, and tnp is the time on np processors.

In Figure 13 one can see that for each given mesh the number of iterations is almost independent of the number of processors, only for the pressure increment ^ the number of iterations increases slightly with increasing processor numbers. Moreover, the number of iterations for the pressure increment are significantly higher than the ones for the other problems. Therefore, in Figure 14 the pressure increment update requires on average a comparable time like the update of the velocity components, despite having considerably less degrees of freedom. The worsening strong scalability of the pressure increment update suggests that the local work on subdomains is too small to balance the cost of communication. The scalability begins to deviate slightly from optimal going from 2000 elements to 1000 elements per core and is lost

when computing with 500 elements per core. A similar effect is seen for the L2-projection of the pressure p, although we again emphasise, that this problem is by an order of magnitude quicker to solve than the others. In Figure 14, we also present the average total time for one time step, including the time for computing and assembling the matrices, computing aerodynamic forces and output of results. Although these operations are embarrassingly parallel, as the plots indicate they can take significant amount of time. Hence, optimisation of these parts of the computation should be performed next.

Finally, in Figure 15 the number of iterations and computational times for Meshes A, B and C are combined in order to illustrate weak scalability. It can be inferred from Figure 15 that with increasing mesh size the number of iterations approximately doubles for the velocity problem. The increase is even higher for the pressure increment. This behaviour is common to all one-level domain decomposition techniques including the block Jacobi method. Therefore, weak scalability cannot be expected, although the solution times remain acceptable. Looking at the computational times in Figure 15 we can see that the time required for the update of the pressure increment ^ is now comparable to that of the update of the first velocity component u\. For 32 768 and 65 536 subdomains, the pressure corrector problem already dominates the solution process, and a better two-level preconditioner may become beneficial.

4-2.4- Reynolds number dependence of the iterative solvers

We compute one entire revolution of the Drosophila wing for four different Reynolds numbers (based on the velocity at the radius of gyration) Re E {105, 263, 526,1051}. The time step increment is chosen constant such

Figure 13: Flow around a rotating insect wing. Iteration counts for meshes A, B and C (see Table 8) and different number of subdomains. Problems for velocity components with GMRES, problems for pressure increment and for pressure with CG, all with cobi preconditioner and ILU(0) local solver. See Table A.10 in the Appendix for erical values.

10° 102 101 100

— -- optimal

x.^ _ u1

-- step

- ........'--«4. .,..........

5 _ optimal

10 10 number of processors

10 10 number of processors

optimal

—^ u3

--■ p

103 10 number of processors

103 10" number of processors

(c) Mesh C

Figure 14: Flow around a rotating insect wing. Timings (left) and speed-ups (right) for meshes A, B and C (see Table 8) and different number of subdomains. Problems for velocity components solved with GMRES, problems for pressure increment and for

pressure with CG, all with block Jacobi preconditioner and ILU(0) local solver. See Table A.11 in the Appendix for numerical values. The line 'step' presents the average total time for a time step.

o 10°

Figure 15: Flow around a rotating insect wing. Weak scaling plots. Iteration counts (left) and timings (right) for^meshes A, B and C (see Table 8) and different numbers of subdomains. Lines from the lower right to the upper left corner correspond to approximately 1000, 2000, 4000, 8000 and 16000 finite elements per subdomain. Problems for velocity component ui solved with GMRES, problems for pressure increment with CG, all with block Jacobi preconditioner and ILU(0) local solver. See Table A.10 in the Appendix for numeric

that one entire revolution requires 3000 time steps. In accordance with Table 7, the velocity components ui,u2 and u3 are updated with GMRES, the pressure increment ^ is updated with deflated PCG, and PCG is used for L2-projection of the pressure p. In each case the block Jacobi preconditioner with ILU(0) local solver is used. For all computations we use Mesh A, see Table 8, and the number of utilised processors is 2048.

In Table 9 the nun™, maximum and average ^rnta, of toati,™ over an entire revolution are given. The prescribed tolerance of the relative residual is 10-6. We can see that the number of iterations for the update of velocity components decreases significantly with increasing Reynolds number. We recall here that the velocities are updated by solving a convection-diffusion equation and it is known that the ILU preconditioner performs best for convection-dominated problems, see [21, 51]. Interestingly, the number of iterations for obtaining the pressure increment ^ also becomes smaller with higher Reynolds number. The number of iterations for the L2-projection of pressure is very low and independent of the Reynolds number.

4.2.5. Flow characteristics and forces

The vortex structures exhibited by a rotating wing and the corresponding aerodynamic forces are very different from the ones for a translating wing. At high angles of attack both trajectories generate a large leading-edge vortex. As it is known, whereas the leading-edge vortex of a translating wing is periodically shed (cf. Section 4.1.2), it remains attached for a rotating wing (under insect-like flight conditions). However, for higher aspect ratios and larger Reynolds numbers the leading-edge vortex may not be so well defined and breaks down near the tip [24, 29]. The presence of the attached

105 263 526 1051

23-29(26.6) 18-22(19.9) 14-17(16.0) 11-15(13.9)

21-27(25.8) 15-21(19.8) 12-17(16.6) 10-15(14.8)

24-29(27.2) 18-22(20.3) 14-18(16.3) 13-17(14.3)

100-242(168.6) 7-7(7.0)

93-229(150.4) 7-7(7.0) 61-181(106.5) 33-178(73.9)

Table 9: Flow around a rotating insect wing. Number of iterations in dependence on the Reynolds number for solving for velocity components (u1,u2,u3), pressure increment ^ and pressure p. For each Reynolds number the number of iterations over an entire revolution of the plate (3000 time steps) is reported. The format of the entries is 'minimum-maximum (average)' number of iterations. All computations are using the mesh shown in Figure 11 and 2048 subdomains.

leading-edge vortex and the associated low lift force. Although this has been extensively discussed in the animal flight

ressure zone leads to a sustained

and engineering literature, its computational study became feasible only recently. Due to the truly three-dimensional nature of the flow, the size of the discretised problem becomes very large, especially when a direct numerical simulation is performed.

In Figures 16 and 17 the Q-criterion isosurfaces with Q = 3 are used to visualise the flow. All snapshots show a vortex loop consisting of a leading-edge, trailing-edge and two tip vortices, see also the experimental results in [35, 44]. The segments of the loop not connected to the wing have a down-d velocity. Importantly, throughout the rotation the leading-edge vortex remains attached to the wing. For all the considered Reynolds numbers, the leading-edge vortex is compact close to the wing root and becomes larger

in [35, ward v remain

sma prev

and then lifts up nearer to the wing tip. Moreover, with increasing Reynolds number it is slightly tighter in the vicinity of the wing root. As to be expected the complexity of the observed vortex structures becomes more pronounced with increasing Reynolds number due to the decrease in the diffusivity of the flow. In particular, for Re = 526 and Re = 1051 we can see finger-like subvortices emanating from the wing tip and spiralling around the leading-edge vortex. As also observed in [25] the orientational sense of this spiralling appears to be opposite to the rotation of the vortex. Although not shown in Figures 16 and 17 there is a significant spanwise flow with its maximum comparable to the radial velocity from the root to the tip of the wing. As initially postulated in [20] the spanwise flow and the corresponding vorticity flux limits an unbounded growth of the leading-edge vortex and prevents its shedding.

Finally, we present the history of aerodynamic coefficients in Figure 18. The lift, drag, and spanwise force coefficients are computed according to (28) with the planform S = R2/A and = Rgu, the final velocity at the radius of gyration. It is helpful to recall that the wing is initially accelerated until t =1 and then rotates with constant angular velocity u until one full revolution is completed at t = 7.7. After a slight dip immediately after the acceleration phase, the coefficients are steady throughout the revolution even for the highest Reynolds number. Moreover, towards the end of the simulation when a full rotation is completed, the coefficients become slightly Her. This is most likely due to the interaction of the tip vortices with the previously generated vortices. As can be seen in Figure 18a and 18b both drag and lift coefficients become larger with increasing Reynolds number.

However, the lift coefficient is much more sensitive to the Reynolds number than the drag coefficient. Especially, for an increase from Re = 105 to Re = 263 there is a sudden jump in the lift forces. The obtained drag and lift coefficients show similar trends as recently reported by Harbig et al. [29]. The lift-to-drag ratio (or, aerodynamic efficiency) in Figure 18d indicates also a sudden increase going from Re = 105 to Re = 263. The force in the spanwise (root-to-tip) direction plotted in Figure 18c is by an order of magnitude lower than drag and lift, and it becomes lower with growing Reynolds numbers.

ng Reyn

Re = 105

Re = 263

t>-i>-

Figure 16: Flow around a rotating insect wing. Isosurfaces of Q = 3 at different time instants, a = 40°, Re = 105 (left) and Re = 263 (right).

Re = 526 Re = 1051

Figure 17: Flow around a rotating insect wing. Isosurfaces of Q = 3 at different time instants, a = 40°, Re = 526 (left) and Re = 1051 (right).

Re = 105 Re = 263 — Re = 526 -Re = 1051

2 3 4 5 6 7 t

(a) Drag coefficient

Re = 10. Re = 21 Re = 526 -Re

^ 0.15 -

ise force coefficient

Re = 105 Re = 263 — Re = 526 -Re = 1051

012345678 t

(d) Lift-to-drag ratio

Figure 18: Flow around a rotating insect wing. Aerodynamic coefficients during one evolution for different Reynolds numbers.

5. Conclusions

We considered the implicit parallel computation of three-dimensional low Reynolds number flows on stationary and rotating domains. The scalability of the overall finite element technique depends on the suitable formulation of the physical problem, the data structures and algorithms for data handling and the preconditioned iterative solver. To this end, we introduced efficient and scalable techniques for domain partitioning, system matrix assembly and preconditioned iterative solution. In our implementation we make extensive use of open source libraries METIS, C++ STL and PETSc. The resulting software is very efficient and it is able to solve systems with « 3.34 x 109 unknowns (of the corresponding coupled Navier-Stokes system) in less than a few seconds on 65 536 processors.

The discretisation of the Navier-Stokes equations with the pressure-correction method leads to one convection-diffusion problem with three different right-hand sides, one Poisson problem and one L2-projection. We find that, in the given set-up, it is most efficient to precondition each subproblem with the block Jacobi preconditioner using ILU(0) on subdomains and to solve it with a suitable Krylov subspace iterative method, namely GMRES for nonsymmetric and PCG for symmetric problems. This gives an overall solution technique with almost perfect scaling even for very large problems and processor counts. The block Jacobi preconditioner with ILU(0) exhibits perfect scalability in case of the convection-diffusion problems for the considered discretisations with up to 1.07 x 109 velocity nodes and 65 536 processors. Moreover, it was observed that the number of iterations is inversely proportional to the Reynolds number of the flow. Although for larger processor

fect disc

counts the solution of the Poisson problem for the pressure increment becomes a bottleneck, it has only limited influence on overall solution time. The size of the discretised Poisson problem is only about 1/8 of the size o

the discretised convection-diffusion problem. The discretised L2-projectio: problem is of the same size as the discretised Poisson problem. However, it can be solved with much fewer iterations due to its better conditioning. In order to compute on larger processor counts, especially with a view to peta-scale computing platforms, it appears to be crucial to improve the scalability of the solution of the Poisson problem. To this end, the most promising techniques appear to be two-level or multilevel BDDC and FETI domain decomposition methods, see e.g. [12, 22, 23, 52].

In order to make the developed approach truly useful for the study of flapping-flight aerodynamics, it is necessary to consider flexible wings. The used ALE approach with rigidly rotating meshes is however only suitable for rigid bodies and not applicable to flexible wings. In case of flexible wings it is in principle possible to use either non-boundary-fitting immersed grids [47], boundary-fitted meshes [55] or a combination of both [6]. When a partitioned (or, block Gauss-Seidel) approach is used for solving the resulting fluid-structure interaction problem, the approach developed here can be used for solving the fluid problem.

wledgement

We are grateful to Charles P. Ellington and Holger Babinsky for valuable discussions on aerodynamics of insect flight. This research was supported by the Engineering and Physical Sciences Research Council (EPSRC) through

grant # EP/G008531/1. Additional support was provided by the Czech Science Foundation through grant 14-02067S, and by the Czech Academy of Sciences through RV0:67985840. The presented computations were formed on HECToR at the Edinburgh Parallel Computing Centre PRACE-2IP (FP7 RI-283493).

per-tre through

Appendix A. Appendix

The following two tables give the data for the Figures 13 and 14 introduced in Section 4.2.3.

num. processors

number of iterations

7-12(11.0) 7-13(12.1) 7-13(11.6) 7-13(12.4)

7-14(12.6)

8-14(13.4)

0-11(10.4

0-13(11.3 0-13(11.0 0-13(12.2 0-13(11.9 0-13(12.2

119-198(157.2) 125-205(164.6) 30-217(175.4) 138-234(186.5) 144-249(195.9) 157-269(213.5)

11-11(11.0) 11-11(11.0) 11-11(11.0) 11-12(11.7)

11-12(11.5)

12-12(12.0)

1024 2048 4096 8192 16 384

11-20(18.4) 10-22(19.6)

10-22(19.

11-27(23. 10-24(21.7)

0-18(16.4 0-19(17.5 20(17.6

24(20.9 0-22(18.9

11-24(21.9) 11-25(23.0)

10-25(22.9)

11-30(26.5) 11-28(24.7)

369-680(529.7) 373-688(534.4) 382-696(543.9) 385-704(550.5) 394-712(558.4)

12-12(12.0) 12-12(12.0) 12-12(12.0) 12-12(12.0) 12-12(12.0)

8192 16 384 32 768 65 536

18-43(36.4)

19-42(37.0) 19-44(38.1) 19-44(36.9)

0-38(31.1 0-38(32.4 0-39(32.7) 0-38(30.9)

19-48(40.9)

20-47(41.6)

21-49(42.8) 20-49(42.1)

820-1425(1175.9) 832-1437(1150.6) 836-1452(1163.2) 855-1472(1254.6)

12-12(12.0) 12-12(12.0) 12-12(12.0) 12-12(12.0)

0: Flow around a rotating insect wing. Iteration counts for meshes A, B and see Table 8) and different number of subdomains. Problems for velocity components u2 and u3 solved with GMRES, problems for pressure increment ^ and for pressure p with CG, all with block Jacobi preconditioner and ILU(0) local solver.

num. proc.

16 384

loc. size (x103)

16 8 4 2 1

0.5 16

8 4 2 1

solution time (s)

U2 «3 ^

5.198 1.552 1.952 1.238 0.092

2.703 0.832 0.971 0.621 0.045

0.789 0.413 0.479 0.304 0.023

90 0.014

29 0.009

0.101 0.055 0.064) 0.134 0.009 5.139 2.577 3.538 4.286 0.105 2.585 1.337 1.810 2.087 0.051 1.152 0.656 0.867 0.907 0.026

.630 0.377 0.487 0.518 0.014

89 0.169 0.222 0.368 0.010

time per step (s)

220.640 86.915 39.801 19.127 9.876 6.286 231.137 91.739 42.604 21.857 12.439

8192 16 384 32 768 65 536

5.188 6.972 10.085 0.113 251.728

2.614 3.423 5.013 0.066 100.716

1.268 1.688 2.625 0.039 50.968

0.587 0.820 1.721 0.029 33.112

Jacc is d

Table A.11: Flow around a rotating insect wing. Average times for solving the linear systems and the total time per step for meshes A, B and C (see Table 8) with different number of subdomains. Problems for velocity components ui, u2 and u3 solved with GMRES, problems for pressure increment ^ and for pressure p with CG, all with block obi preconditioner and ILU(0) local solver. The number of processes (the same as cores) is denoted with 'num. proc.' and the average number of finite elements per subdomain with 'loc. size' .

References

[1] Amestoy, P. R., Duff, I. S., L'Excellent, J.-Y., 2000. Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Engrg. 184, 501-520.

[2] Ansari, S. A., Phillips, N., Stabler, G., Wilkins, P. C., Zbi , R., Knowles, K., 2009. Experimental investigation of some aspects of insect-

implementation of Balancing Domain Decomposition by Constraints. SIAM J. Sci. Comput. 36 (2), C190-C218.

[4] Balay, S., Abhyankar, S., Adams, M. F., Brown, J., Brune, P., Buschel-man, K., Eijkhout, V., Gropp, W. D., Kaushik, D., Knepley, M. G., McInnes, L. C., Rupp, K., Smith, B. F., Zhang, H., 2014. PETSc Web page. http://www.mcs.anl.gov/petsc.

[5] Bazilevs, Y., Hughes, T. J. R., 2008. NURBS-based isogeometric analysis for the computation of flows about rotating components. Computa-

Lg, X., Yan, J., 2015. Novel structural

modeling and mesh moving techniques for advanced fluid-structure interaction simulation of wind turbines. International Journal for Numerical Methods in Engineering 102 (3-4), 766-783.

2014. A highly Decomposition

[7] Birch, J., Dickinson, M., 2001. Spanwise flow and the attachment of the leading-edge vortex on insect wings. Nature 412, 729-733.

[8] Cesenek, J., Feistauer, M., Horacek, J., Kucera, V., Prokopova, J., 2013. Simulation of compressible viscous flow in time-dependent domains. Applied Mathematics and Computation 219 (13), 7139-7150.

[9] Chorin, A. J., 1968. Numerical solution of the Navier-Stokes equations.

:r-Stokes equa

Mathematics of Computation 22, 745-762.

[10] Cirak, F., Cummings, J., 2008. Generic programming techniques for parallelizing and extending procedural finite element programs. Engineering with Computers 24, 1-16.

[11] Concus, P., Golub, G. H., O'Leary, D. P., 1976. A generalized conjugate gradient method for the numerical solution of elliptic PDE. In: Bunch, J. R., Rose, D. J. (Eds.), Sparse Matrix Computations. Academic Press, New York, pp. 309-33

[12] Dohrmann, C. R., 2003. A preconditioner for substructuring based on constrained energy minimization. SIAM J. Sci. Comput. 25 (1), 246-258.

[13] Donea, J., Huerta, A., 2003. Finite element methods for flow problems. John Wiley & Sons.

[14] Dudley, R., 2000. The biomechanics of insect flight: form, function, evolution. Princeton University Press, Princeton, New Jersey.

[15] Ellington, C., 1984. The aerodynamics of hovering insect flight. I. The

quasi-steady analysis. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 305, 1-15.

[16] Ellington, C., 1984. The aerodynamics of hovering insect flight. II. Morphological parameters. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 305, 17-40.

[17] Ellington, C. P., 1984. The aerodynamics of hovering insect flight. III. Kinematics. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 305, 41-78.

[18] Ellington, C. P., 1984. The aerodynamics of hovering insect flight. IV. Aerodynamic mechanisms. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 305, 79-113.

[19] Ellington, C. P., 1984. The aerodynamics of hovering insect flight. V. A vortex theory. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 305, 115-144.

[20] Ellington, C. P., van den Berg, C., Willmott, A. P., Thomas, A. L. R., 1996. Leading-edge vortices in insect flight. Nature 384, 626-630.

[21] Elman, H. C., Silvester, D. J., Wathen, A. J., 2005. Finite elements and fast iterative solvers: with applications in incompressible fluid dynamics. Numerical Mathematics and Scientific Computation. Oxford University Press, New York.

[22] Farhat, C., Lesoinne, M., Le Tallec, P., Pierson, K., Rixen, D., 2001. FETI-DP: a dual-primal unified FETI method. I. A faster alternative to

the two-level FETI method. Internat. J. Numer. Methods Engrg. 50 (7), 1523-1544.

[23] Farhat, C., Roux, F.-X., 1991. A method of finite element tearing and interconnecting and its parallel solution algorithm. Internat. J. Numer. Methods Engrg. 32 (6), 1205-1227.

wings for v

[24] Garmann, D., Visbal, M., 2014. Dynamics of revolving wings for various aspect ratios. Journal of Fluid Mechanics 748, 932-956.

[25] Garmann, D. J., Visbal, M. R., Orkwis, P. D., 2013. Three-dimensional flow structure and aerodynamic loading on a revolving wing. Physics of Fluids (1994-present) 25 (3), 034101.

[26] Geuzaine, C., Remacle, J.-F., 2009. Gmsh: A 3-D finite element mesh generator with built-in pre- and post-processing facilities. International Journal for Numerical Methods in Engineering 79 (11), 1309-1331.

[27] Girault, V., Raviart, P.-A., 1986. Finite element methods for Navier-Stokes equations. Springer-Verlag, Berlin.

Guermond, J. L., Minev, P., Shen, J., 2006. An overview of projection methods for incompressible flow. Computer Methods in Applied Mechanics and Engineering 195, 6011-6045.

[29] Harbig, R., Sheridan, J., Thompson, M., 2013. Reynolds number and aspect ratio effects on the leading-edge vortex for rotating insect wing planforms. Journal of Fluid Mechanics 717, 166-192.

[30] Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., Lehoucq, R. B., Long, K. R., Pawlowski, R. P., Phipps, E. T., et al., 2005. An overview of the Trilinos project. Transactions on Mathematical Software (TOMS) 31 (3), 397-42

[31] Hunt, J. C. R., Wray, A. A., Moin, P., 1988. Eddies, stream, and convergence zones in turbulent flows. Tech. rep., Center for Turbulence Research.

m, and iter for Turbu

[32] Josuttis, N. M., 2012. The C++ Standard Library: A Tutorial and Reference, 2nd Edition. Addison-Wesley.

[33] Karypis, G., Kumar, V., 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20 (1), 359-392.

[34] Kim, D., Choi, H., 2006. Immersed boundary method for flow around an arbitrarily moving body. Journal of Computational Physics 212 (2), 662-680.

[35] Kim, D., Gharib, M., 2010. Experimental study of three-dimensional vortex structures in translating and rotating plates. Experiments in Fluids 49 (1), 329-339.

[36] Kim, J., Moin, P., 1985. Application of a fractional-step method to incompressible Navier-Stokes equations. Journal of Computational Physics 59 (2), 308-323.

[37] Kolâr, V., 2009. Compressibility effect in vortex identification. AIAA Journal 47 (2), 473-475.

[38] Kolar, V., Sistek, J., 2015. Corotational and compressibility aspects leading to a modification of the vortex-identification Q-criterion. AIAA Journal 53 (8), 2406-2410.

[39] Kolar, V., Sistek, J., Cirak, F., Moses, P., 2013. Average m of line segments near a point and vortex identification. AI al 51 (11), 2678-2694.

[40] Liesen, J., Strakos, Z., 2013. Krylov Subspace Metho s. Oxford University Press.

[41] Mandel, J., Sousedik, B., Dohrmann, C. ., 2008. Multispace and multilevel BDDC. Computing 83 (2-3), 55-

[42] Meyers, S., 2001. Effective STL: 50 specific ways to improve your use of

[43] Nolan, G. R., 2004. Aerodynamics of vortex lift in insect flight. Ph.D. thesis, University of Cambridge, Department of Zoology.

[44] Ozen, C. A., Rockwell, D., 2012. Three-dimensional vortex structure on a rotating wing. Journal of Fluid Mechanics 707, 541-550.

Quarte for the

[45] Quarteroni, A., Saleri, F., Veneziani, A., 2000. Factorization methods numerical appproximation of Navier-Stokes equations. Computer ethods in Applied Mechanics and Engineering 188, 505-526.

Quarteroni, A., Valli, A., 1999. Domain Decomposition Methods for Partial Differential Equations. Oxford Science Publications.

[47] Ruberg, T., Cirak, F., 2014. A fixed-grid b-spline finite element technique for fluid-structure interaction. International Journal for Numerical Methods in Fluids 74 (9), 623-660.

[48] Saad, Y., 2003. Iterative Methods for Sparse Linear Systems,>2nd edition. SIAM, Philadelpha, PA.

[49] Saad, Y., Schultz, M. H., 1986. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7 (3), 856-869.

[50] Shyy, W., Lian, Y., Tang, J., Viieru, D., Liu, H., 2008. Aerodynamics of low Reynolds number flyers. Cambridge University Press.

[51] Smith, B. F., Bj0rstad, P. E., Gropp, W. D., 1996. Domain decomposition: parallel multilevel methods for elliptic partial differential equations. Cambridge University Press, Cambridge.

niversi \

[52] Sousedik, B., Sistek, J., Mandel, J., 2013. Adaptive-Multilevel BDDC and its parallel implementation. Computing 95 (12), 1087-1119.

[53] Stroustrup, B., 2013. The C++ Programming Language, 4th Edition. Addison-Wesley.

[54] Taira, K., Colonius, T., 2009. Three-dimensional flows around low-aspect-ratio flat-plate wings at low Reynolds numbers. J. Fluid Mech. 623, 187-207.

[55] Takizawa, K., Henicke, B., Puntel, A., Kostov, N., Tezduyar, T. E., 2012. Space-time techniques for computational aerodynamics modeling

of flapping wings of an actual locust. Computational Mechanics 50 (6), 743-760.

[56] Tezduyar, T. E., Sameh, A., 2006. Parallel finite element computations in fluid mechanics. Computer methods in applied mechanics and engineering 195 (13), 1872-1884.

ition Meth

[57] Toselli, A., Widlund, O. B., 2005. Domain Decomposition Methods-Algorithms and Theory. Vol. 34 of Springer Series in Computational Mathematics. Springer-Verlag, Berlin.

[58] van der Vorst, H. A., 1992. Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 13 (2), 631-

[59] Wang, S., Zhang, X., 2011. An immersed boundary method based on discrete stream function formulation for two- and three-dimensional incompressible flows. J. Comput. Phys. 230, 3479-3499.