Scholarly article on topic 'Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code'

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code Academic research paper on "Computer and information sciences"

Share paper
Academic journal
Computer Physics Communications
OECD Field of science
{"Computational fluid dynamics" / "Code optimisation" / SIMD / SandyBridge / Haswell / "Xeon Phi" / "Parallel performance"}

Abstract of research paper on Computer and information sciences, author of scientific article — Ioan Hadade, Luca di Mare

Abstract Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

Academic research paper on topic "Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code"

Accepted Manuscript

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Ioan Hadade, Luca di Mare

PII: S0010-4655(16)30095-9


Reference: COMPHY 5918

To appear in: Computer Physics Communications

Received date: 3 October 2015 Revised date: 4 April 2016 Accepted date: 13 April 2016

Please cite this article as: I. Hadade, L. di Mare, Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code, Computer Physics Communications (2016),

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


South Kensington, SW7 2AZ, London United Kingdom,

9 Modern Multicore and Manycore Architectures:

10 modelling, optimisation and benchmarking a multiblock

12 CFD code

15 Ioan Hadade*, Luca di Mare

16 Whole Engine Modelling Group

17 Rolls-Royce Vibration UTC

18 Mechanical Engineering Department

19 Imperial College London

20 21 22

26 Abstract

2 8 Modern multicore and manycore processors exhibit multiple levels of parallelism

2 9 through a wide range of architectural features such as SIMD for data parallel

3 0 execution or threads for core parallelism. The exploitation of multi-level par-

31 allelism is therefore crucial for achieving superior performance on current and

32 future processors. This paper presents the performance tuning of a multiblock

33 CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel

34 Xeon Phi Knights Corner coprocessor. Code optimisations have been applied

35 on two computational kernels exhibiting different computational patterns: the

36 update of flow variables and the evaluation of the Roe numerical fluxes. We

37 discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes.

44 Results are correlated with the Roofline performance model in order to assert

45 their efficiency for each distinct architecture. We report significant speedups

4 6 for single thread execution across both kernels: 2-5X on the multicore CPUs

47 and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip

48 concurrency deliver a factor of three speedup on the multicore processors and 4 9 up to 24X on the Xeon Phi manycore coprocessor.

Keywords: computational fluid dynamics, code optimisation, SIMD, SandyBridge, Haswell, Xeon Phi, parallel performance

* Corresponding author

56 Email address: (loan Hadade)

59 Preprint submitted to Computer Physics Communications April 4, 2016

60 61 62

9 1. Introduction

11 Modern research and engineering rely heavily on numerical simulations. In

12 research, improvements in the speed and accuracy of scientific computations can

13 lead to new discoveries or facilitate the exploitation of recent breakthroughs [1].

14 5 In engineering, improvements in solution time can lead to better designs and to

15 a reduction of the development phase - and costs - of new products. There is a

16 constant need, therefore, for sustained improvements in the speed of execution

17 of scientific and engineering codes.

Historically, performance gains in scientific and engineering applications have

19 10 been obtained through advances in hardware engineering which required little

20 or no change to the programming paradigms. Examples of such innovations

21 were out-of-order execution, branch prediction, instruction pipelining, deeper

memory hierarchies and, of course, the increase in clock frequency [2] all of which guaranteed improvements in serial performance with every new CPU generation 15 and limited code intervention. Those days are now gone partly due to the

26 recognition that clock frequency cannot be scaled indefinitely because of power

27 consumption, and partly because circuitry density on the chip is approaching the 2g limit of existing technologies which is problematic as innovations in sequential 29 execution require a high fraction of die real estate [3].

3 0 20 The current trend in CPU design is parallelism [4] and has led to the rise of

31 multicore and manycore processors whilst effectively ending the so called " free

32 lunch" era [5, 6] in performance scaling. Modern multicore and manycore pro-

33 cessors now consist of double digit core numbers integrated on the same die,

34 vector units with associated instruction set extensions, multiple backend execu-

35 25 tion ports and deeper and more complex memory hierarchies with features such

36 Uniform-Memory-Access (UMA) and Non-Uniform-Memory-Access (NUMA),

37 to name a few. Consequently, achieving any reasonable performance on these

38 architectures mandates the exploitation of all architectural features and their

39 intrinsic parallelism across all granularities (core,data and instruction)[6]. 30 Core and data parallelism are exposed in modern applications through two

principal mechanisms: threads and Single Instruction Multiple Data (SIMD) constructs. In order for core parallelism to be exploitable, the computations need to be arranged in such a way as not to contain dependencies among data

45 being processed concurrently. Additionally, good data locality and temporality

4 6 35 is needed to achieve efficient execution on a thread's local working set. The same 47 considerations apply for efficient data parallelism where SIMD computations

4 8 require data to be packed in sets of size matching the width of the underlying 49 vector registers whilst being aligned at correct memory address boundaries.

5 0 Optimising existing applications to make full use of all available levels of par-

51 40 allelism can result in considerable speedups - up to one order of magnitude - in

52 many fields of scientific computing [7, 8]. However, extracting performance from

53 SIMD and thread-level parallelism is not trivial and a careless implementation

54 can easily obliterate the advantages of modern processors [9, 10, 6].

55 Firstly, the compiler may detect data dependencies, even when none really

56 45 exist, unless the source code is made un-ambiguous through directives. In some

60 61 62

9 cases, the data layout of the original application is not suitable for efficient

10 execution of vector load and store operations. This happens if the data items

11 are not aligned within the boundaries of vector registers. An example of this

12 situation occurs in stencil computations on discretized PDEs: values at several

13 50 neighbouring points on a lattice are used simultaneously, but values that would

14 naturally be loaded in the same vector register appear in different operands.

15 Algorithms which access simultaneously data stored at addresses far from each

16 other, or access the same data repeatedly with separate load or store operations

17 also do not perform well on modern multicore and manycore architectures. This

18 55 is due to memory bandwidth limitations and Transfer-Lookaside-Buffer (TLB)

20 misses. Furthermore, on the majority of these platforms, not all addresses in

21 memory can be accessed by all cores with the same latency. This effect, known

22 as Non-Uniform Multiple Access (NUMA) can limit scalability on multicore

23 platforms. Finally, scalability deteriorates as single-core execution is optimised,

24 60 because the latency of synchronization and communication operations becomes

25 more and more difficult to hide and because resources such as memory band-

26 width become increasingly saturated.

27 A number of techniques have been used in literature to alleviate these issues.

28 These techniques aim at hiding the latency of memory transfers, or at arranging

29 65 the data in a way that is better suited for vector load/store operations. Henretty

30 [11] applied data transposition for SIMD optimisations of stencil operations

31 in PDE-based applications. Rostrup [12] applied similar techniques with the

32 addition of SIMD shuffles for hyperbolic PDE solutions on Cell processors and

33 NVIDIA GPUs. Rosales et al. [13] studied the effect of data transformations

34 70 for SIMD execution and thread-scalability in a Lattice Boltzmann code for the

35 manycore Intel Xeon Phi (Knights Corner) processor. Datta [14], Jaeger [15]

36 and Schafer [16] studied the problem of determining and applying automatically

37 the necessary architectural optimisations for stencil computations on modern

3 9 multicore and manycore processors.

40 75 Incorporating these techniques in existing codes is no small task: accessing

41 some of the features of the hardware may require very specialised, non portable

42 code. On occasions, sections of code may need to invoke platform-specific in-

43 trinsic functions or to be written in assembler. For certain application fields,

44 active research is under way towards automatic tuning [17].

45 so For the general practitioner, however, the issue remains of finding a trade-off

46 between programming effort and gains in performance, both in the short and

4 7 in the long term [1]. For this purpose it is essential to expose domain scientists

48 to the whole range of techniques available, and to their impact on realistic

49 applications and up-to-date hardware.

50 85 This paper addresses a number of optimisation techniques and compares

51 their effect on three different leading platforms. The practicalities of SIMD/thread-

52 ing optimisation in a solver of complexity and structure representative of indus-

53 trial CFD applications are discussed in detail. A fair and meaningful assessment

54 of the gains afforded by each technique and on each platform is ensured by the

56 90 use of a performance model.

60 61 62

9 10 11 12

20 21 22

60 61 62

2. Numerical algorithm

The test vehicle for this study is a fast Euler solver designed for quick calculations of transonic turbo-machinery flows. The solver computes inviscid flow solutions in the m' — 0 coordinate system [18]1.

A typical computational domain is shown in Figure 1 and consists of a number of blocks. The blocks are connected by stitch lines.

(a) Computational grid with block boundaries high- (b) Block and stitch line coordinate lighted systems

Figure 1: Computational domain and topology.

The Euler equations are solved in semi-discrete form

— Wi ,U.; , dt ,j ,j

Fi- 1/2,j — Fi+ 1/2,j + Gi,j

¿,¿-1/2 — Gi,j+1/2 + Si,j — RHSi,j (1)

In equation (1), Wij is the volume of cell i, j, Uj j is the vector of conserved variables and the vectors Fi_1/2jj, Gij_1/2 and Si j denote fluxes through i-

10 is the angular position around the annulus. m is the arclengt.h evaluated along a stream surface dm = y/d x2 + dr2 and m' is a normalized (dimensionless) curvilinear coordinate defined from the differential relation dm' = ^y-. The m! — 0 system is used in t.urbomachinery computations because it preserves aerofoil shapes and flow angles.

9 10 11 12

20 21 22

60 61 62

faces, j-faces and source terms, respectively and are evaluated as follows:

-1/2 ,3

1/2 ,3


Wi,j Pi

PwZ pw( um + p«4 pwz ug + png pwç h — pwz

pwv pwv um + pnm pwn ug + pnn pwn h — pwn

Ug Sin 0

UmUg cos 0



100 For the purpose of flux and source term evaluation, the contravariant velocities , the normals n^ g and n^ g and the radial flow angle 0 are also needed. Further details are reported in Appendix A.

Convergence to a steady state is achieved by a matrix free, implicit algorithm. At each iteration, a correction to the primitive variable vector V is determined as solution to the linear problem [19]

HWjjUjj) St

+Jhj ™h,k

or, equivalently

(WijKi j — Jhj)SVi,j = — (SWj) IUi,j + RHSij

where Vjj is the vector of primitive variables at the cell i,j and Kjj is the transformation jacobian from primitive variables to conserved variables. The linear problem in equation (6) can be approximated by a diagonal problem if the assembled flux Jacobian J j is replaced by a matrix bearing on the main diagonal the sum of the spectral radii |A| of the flux Jacobian contributions for each cell


At a fixed Courant number a following update

this approximation yields the

Wj (1 + a)

K-1 (—Ui,j SWi,j + RHSi,j )

The solver has been validated against MISES [20] for turbine testcases. For the purpose of implementation, the baseline solver stores the primitive 105 variables Vj j at each cell. Auxiliary quantities such as speed of sound and total enthalpy at each cell are also stored.

10 11 12

20 21 22

60 61 62

Inspection of equations (1)-(8) reveals the existence of two types of operations. The first type evaluates cell attributes and is performed by looping over the cells. The evaluation of Si,j in equation 1 or the block diagonal inversion in equation 8 fall within this category. The second type evaluates stencil operators and needs to be performed by looping over the neighbours of each cell or by looping over the cell boundaries. The evaluation of the numerical fluxes Fj±1/2>j and Gi)j±1/2 in equation 1 is an example of such stencil kernels.

Figure 2: Isentropic Mach number solutions for first stage stator (a) and first stage rotor (b).

An example of the solver's solution output in terms of computed isentropic Mach numbers of the flow relations in first stage turbine blade configurations can be seen in Figure 2. The application is implemented as a set of C++ classes. Equations (1)-(8) are grouped in the methods of a class embodying an abstraction of a gas model. The methods of this class are called at appropriate points in the execution of the program to compute fluxes, updates etc. All the remaining elements of the discretization, e.g. block geometry and layout, are handled by a different set of classes.

All floating point operations are carried out in double precision. The solver spends 75% of time in computing the Roe fluxes and approximately 15% on updating the primary flow variables after each main iteration. The aim of this work is to evaluate and compare the optimisation process of both the quasi-stencil-based Roe flux computations and the cell-wise primary flow variable update on the three different hardware platforms.

3. Hardware and Performance Model

The three architectures chosen for this study are representative of modern multicore and manycore processors in terms of SIMD features and core count, namely the Intel1®Xeon® Sandy Bridge, Intel1®Xeon® Haswell and Intel1®Xeon PhiTMKnights Corner. The following sections will delve deeper into the architectural characteristics of the selected processor candidates including details of the system configuration and applied performance model.

3.1. Intel®Xeon® Sandy Bridge

The SandyBridge micro architecture introduces 256bit vector registers and an extension to the instruction set known as Advanced Vector extensions (AVX)

9 [21]. The AVX extensions are backward compatible with SSE and, to this

10 effect, the AVX registers on SandyBridge are implemented as two fused 128bit

11 140 SSE lanes. This complicates somewhat cross-lane elemental operations such as

12 shuffle and permute. AVX supports three non-destructive operands for vector

13 operations, e.g. x = y+z, as well as functions for merging and masking registers.

14 On SandyBridge vector loads are performed using two ports simultaneously.

This prevents the overlap of store and load operations when operating on whole

145 256bit registers and has an impact on instruction level parallelism [22]. For arithmetic purposes, SandyBridge can execute one vector multiplication and one addition in the same clock cycle. The memory subsystem comes with an improved 32KB L1D cache compared to Nehalem which is required by AVX

21 and can sustain two 128bit loads and a 128bit store every cycle. The 256KB

22 150 L2 cache is similar to the previous generation implementation providing 8-way

23 associativity and a 12 cycle load-to-use latency with a write-back design. The

24 L3 cache is shared by all cores within a ring interconnect and is sliced and 2 5 distributed across the ring allowing for better proximity to cores.

26 The SandyBridge-EP node evaluated in this paper holds a two-socket Xeon

27 155 E5-2650 configuration with 16 physical cores and 32 GB of DDR3 main memory 27 (ECC on). The CPUs are clocked at 2.0 GHz with HyperThreading and Tur-

29 boBoost disabled. The node runs Scientific Linux 6.7 (Carbon) kernel version

30 2.6.32-504. Code compilation has been performed with Intel icpc 15.0 with the

31 following flags: -O3 -xAVX -fargument-noalias -restrict -openmp.

33 160 3.2. Intel® Xeon® Haswell

34 The Haswell microarchitecture is based on a 22nm design and is a successor to

35 IvyBridge. Whereas IvyBridge did not contain any major architectural changes com3 6 pared to its Sandy Bridge predecessor, Haswell provides a number of modifications

37 through a new core design, improved execution unit based on the AVX2 extensions

38 165 and a revamped memory subsystem [23]. The major improvements to the execution

39 unit in the Haswell microarchitecture regard the addition of two ports, one for memory

40 and one for integer operations which aids instruction level parallelism as the remain-

41 ing five can target floating point arithmetic. To that end, there are two execution

42 ports for performing FMA operations as part of AVX2 with peak performance of 16

43 170 double precision FP operations per cycle, double that of SandyBridge and IvyBridge.

44 Further improvements provided by AVX2 are gather operations which are particularly

45 useful for packing non-contiguous elements within a vector register and whole register 4 6 permutations and shuffles with the implementation of the vpermd instruction. From

a memory standpoint, the Haswell fixes the backend port width issue by providing 64 175 byte load and 32 byte store per cycle functionality therefore alleviating the stalls in memory bound codes which occurred in SandyBridge and IvyBridge.

The Haswell-EP node consists of 2x 10-core Xeon E5-2650 CPUs (20 physical cores in total) clocked at 2.3 GHz and 64 GB of DDR4 main memory (ECC on). The node runs Oracle Linux Server 6.6 kernel 2.6.32-504. Code compilation has been done via 180 the Intel icpc 15.0 compiler with the following flags: -O3 -xCORE-AVX2 -fargument-noalias -restrict -openmp

60 61 62

9 3.3. Intel®Xeon®Xeon Phi™Knights Corner

10 Knights Corner is the first commercial iteration of the Intel Xeon Phi manycore

11 processor. The device can be described as an x86-based Shared-Memory-Multiprocessor-

12 185 on-a-chip[24] with over 50 cores on the die and up to four hardware threads per core.

13 The main computational device in each core is a Vector Processing Unit (VPU). The

14 VPU operates on 32 512bit wide registers and 8 mask registers. The VPU offers func-

15 tionalities geared towards scientific computations such as FMA, gather and scatter,

16 horizontal reductions and support for cross-lane permutations similar to AVX2 [25].

17 190 In theory, the VPU can execute one FMA (512bit) operation every cycle, giving it

18 a 16 DP FLOP per cycle ratio similar to the Haswell microarchitecture. However,

19 due to the in-order execution nature of the core design, one must run on at least two 2 0 hardware threads in order to hide memory latency and achieve peak FLOPS [24]. The

21 cores are connected to a high-speed bidirectional ring interconnect providing full cache

22 195 coherence. Communications with the host CPU is done via the PCI-Express bus in 2 3 a similar fashion to Graphics Processing Units (GPUs). For this study the Xeon Phi

24 card was used in native mode, i.e. the algorithm was executed by a process started by

25 the card's own operating system. The card can be used alternatively in offload mode

26 or in symmetric mode.

27 200 The Intel Xeon Phi Knights Corner coprocessor evaluated in this study is based 2 8 on the 5110P product series and contains 60 physical cores (4 hyperthreads each)

29 clocked at 1.053 Ghz and 8GB of GDDR5 memory. The coprocessor runs a bespoke

30 version of Linux kernel 2.6.32-504.8.1 and Intel Manycore Platform Stack (MPSS)

31 version 3.4.3 used for host and coprocessor communication. Compilation and runtime

32 205 environments were performed via Intel's icpc 2015 compiler with the following flags:

33 -O3 -mmic -fargument-noalias -restrict -openmp

35 3.4. The Roofline Model

36 In order to determine the effect of interventions on the code rigorously, performance

37 models are built and correlated with each platform. A good performance model can

38 highlight how well the implementation of an algorithm uses the available hardware both

39 in terms of in-core and intra-core performance. The Roofline [26, 17, 27] performance

40 model is used in this work. The model is based on the assumption that the principal

41 vectors of performance in numerical algorithms are computation, communication and

42 locality [17]. The Roofline model defines computation as floating point operations,

43 communication as unit of data from memory required by the computations and locality

as the distance in memory from where the data is retrieved i.e. (cache, DRAM etc). However, due to the fact that modelling the performance of the entire cache system is a complex task for one architecture, let alone three, the locality component for this work will be set to DRAM main memory, similar to previous work performed in literature [17]. The correlation between computation and communication is defined by the model as the arithmetic intensity of a given kernel which can be expressed as the ratio of useful FLOPS to the corresponding number of bytes requested from the data source, in our case, main memory. As the units of measurement for both flops

51 and byte transfers in current CPU architectures are given as GigaFLOPS/sec and

52 GigaBYTES/sec, the maximum attainable performance of a computational kernel,

53 measured in GigaFLOPS/sec, can be calculated as:

55 Max. GFlops/sec = mmj Max. Memory BandwidthxKernel Flops/Bytes (9)

60 61 62

9 Peak floating point is obtained from the processor's documentation manual whilst

10 maximum DRAM memory bandwidth from the STREAM[28, 29] benchmark. The

11 210 model visualises the metrics in a diagram [26] where the attainable performance of a

12 kernel is plotted as a function of its arithmetic intensity. This acts as the main roofline

13 of the model and exhibits a slope followed by a plateau when peak FLOPS is reached.

14 The position at which the arithmetic intensity coupled with the available memory

15 bandwidth equals peak FLOPS is called the "ridge" and signifies the transition from

16 215 memory to compute bound. However, achieving either maximum memory bandwidth

17 or peak FLOPS on modern multicore and manycore processors requires the exploita-

18 tion of a plethora of architectural features even though the algorithm might exhibit a

19 favourable arithmetic intensity. For example, achieving peak FLOPS requires a bal-

20 ance of additions and multiplications for fully populating all functional units, data

21 220 level parallelism through SIMD and loop unrolling for instruction level parallelism.

22 Reaching the memory roofline requires that memory accesses are performed at unit

23 strides, some form of prefetching and that NUMA effects are catered for when run-

24 ning at full chip concurrency [17],[30]. To this extent, subsequent rooflines can be

25 added which need to be "pierced" through by specific optimisations in order to reach

25 225 maximum attainable performance.

26 Roofline diagrams for all three architectures can be seen in Figure 3. Memory

27 bandwidth saturation effects [31] i.e. the inability of a single core to utilise all avail-

28 able controllers, mandates that separate diagrams are built for single-core and full

29 node configurations. The primary flow variable update kernel (cell) and flux computa-

30 230 tions (stencil) are plotted at their respective arithmetic intensity in order to visualise

31 their optimisation space and maximum attainable performance across all platforms.

32 Rooflines for in-core performance optimisations are added with respect to the partic-

33 ularities of each processor for both single-core and full node configurations. Rooflines

34 for memory optimisations horizontal to peak memory bandwidth are added only in

35 235 the full node diagrams as most of them such as NUMA apply only at full-chip and

36 full-node concurrency. The order of the rooflines is set in regards to how likely it is for

37 the compiler to automatically apply such optimisations without external interventions 3 8 [17].

39 For the SandyBridge Xeon E5-2650 core, a balance between additions and multi-

40 240 plications is required due to the design of the execution ports where one multiplication

41 and one addition can be performed every clock cycle. An imbalanced code will fully

42 utilise the addition or multiplication execution unit whilst the other sits idle therefore

43 reducing throughput by a factor of two. Furthermore, lack of SIMD computations

44 via AVX (256bit) further decreases performance by a factor of four from the previous

45 245 roofline. Poor use of instruction level parallelism brings forth another factor of four in

46 performance due to the reduced number of in-flight instructions from four to one in

47 the worst case scenario.

48 On the Haswell Xeon E5-2650 core, lack of FMA operations produces a factor of

49 four performance drop. This is the case for kernels that contain a large number of ad-

50 250 ditions and a minimal number of multiplications such as PDE-based stencil operations

51 and as such, similar to the application presented in this work. The reason for such a

52 drastic performance drop is due to the fact that the two FMA units sit idle whilst the

53 separate single ADD unit gets overloaded. A possible solution to the above would be

54 the manual insertion of a 1.0 multiplier, however, such optimisation would suffer from

55 255 portability issues on other architectures that do not implement FMA i.e. SandyBridge.

Furthermore, absence of AVX/AVX2 execution on Haswell leads to another factor of

57 four decrease in performance, similar to SandyBridge.

60 61 62

9 On the Knights Corner 5110P core, the out of order execution mandates the use

10 of multiple threads to hide memory latency and fully occupy the VPU unit. In theory,

11 260 the VPU unit can be filled by running between two or three threads on the core.

12 Running with only one thread in-flight reduces the attainable peak performance by a

13 factor of two as the VPU would be capable of performing an FMA operation every

14 other cycle. Furthermore, similarly to Haswell, an imbalanced kernel that cannot

15 fully exploit FMA operations on Knights Corner will suffer a factor of four drop in

16 265 performance when also running in single thread mode. Lack of SIMD brings forth an

17 eight fold performance decrease due to the wide SIMD nature of the VPU (512bit).

18 The node Roofline diagrams extend on their core counterparts by presenting mem-

19 ory optimisations such as NUMA on SandyBridge and Haswell and pre-fetching for

20 Knights Corner. The optimisations applied to both the flow variable update and Roe

21 270 numerical fluxes kernels and presented in the following section aim at piercing through

the in-core and intra-core rooflines and achieve a high percentage of the predicted maximum attainable performance for serial and full concurrency execution. The aim is also to highlight the fact that certain optimisations, such as efficient SIMD execution, have a high impact on performance across all architectures for kernels with moderate

275 to high arithmetic intensity, as is the case for the stencil-based flux computations. For kernels which exhibit a low arithmetic intensity such as the cell-wise flow variable update, the model predicts that SIMD would only affect runs on the Knights Corner coprocessor. However, the nature of the optimisations required for vector computa-

29 tions usually leads to a higher degree of instruction parallelism due to loop unrolling

30 280 and instruction dispatch reduction. Consequently, these might also provide speedups

31 on SandyBridge and Haswell.

60 61 62

1 Core SNB E5-2650

1 Core Haswell E5-2650

1 Core KNC5110P

1/16 1/8 1/4 1/2 1 2 4 Arithmetic Intensity

2x SNB E5-2650 (16 cores)

1/16 1/8 1/4 1/2 1 2 4

2x Haswell EP E5-2650 (20 cores)

1024 -

1/16 1/8 1/4 1/2 1 2 4 Arithmetic Intensity

1/16 1/8 1/4 1/2 1 2 4

Knights Corner 5110P (60 cores)

1024 -

1/16 1/8 1/4 1/2 1 2 4

1/16 1/8 1/4 1/2 1 2 4


9 4. Results and Discussions

11 This section presents in detail a number of optimisation techniques that have been

12 applied to the selected computational kernels introduced in Section 2. Results are

13 285 reported as GFLOPS/sec so as to validate them against the performance model and

14 have been obtained as average wall clock times spent in the kernels over 102 main

15 iterations, in order to mimic the natural utilisation of the application.

17 4-1- Flow Variable Update

18 The cell-wise kernel computes the primitive variables updates, based on the resid-

19 290 uals of the discretized Euler equations, as defined in equation (8). The kernel accesses

20 the arrays storing the flow variables, the residuals, an array storing auxiliary variables

21 and an array storing the spectral radii of the flux jacobians. The arrays are passed

22 to the function through their base pointers. The flow and auxiliary variables are used 2 3 to compute the entries of the transformation jacobian K-j between conserved vari-

24 295 ables and primitive variables. The kernel performs 40 FLOPS per cell and accesses

25 35 double precision values giving it a 0.14 flops/byte ratio. Figure 4 presents the re-2 6 sults derived from applying SIMD optimisations. Figure 6 provides the results from

27 applying SIMD and memory optimisations whilst results for full node concurrency

28 (thread-level parallelism) performance and speedup can be examined in Figure 7 and

29 300 Figure 8 respectively.

31 4.1.1. Data-level Parallelism

32 Autovectorization (autovect+unaligned - Figure 4) has been achieved through

33 the use of #pragma omp simd directive (OpenMP 4.0). The use of OpenMP 4.0

34 directives is found preferable to compiler-specific directives because it is more portable.

35 305 In order to achieve efficient vectorization, the qualifier restrict needs to be added to

36 the function arguments. This guarantees to the compiler that arrays represented by the

37 arguments do not overlap. In absence of further provisions, the compiler generates code

38 for unaligned loads and stores. This is a safety precaution, as aligned SIMD load/store

instructions on unaligned addresses lead to bus errors on the Xeon Phi Coprocessor.

310 SandyBridge-EP and Haswell-EP processors can deal with aligned access instructions on unaligned addresses, albeit with some performance penalty due to inter-register data movements that are required. On SandyBridge-EP and Haswell-EP, even the auto-vectorized kernel performs twice and three times faster than the baseline kernel. On Knights Corner, the improvement is almost one order of magnitude as the VPU 315 and therefore wide SIMD lanes were not previously exploited.

Aligned vector accesses (autovect+aligned - Figure 4) can be achieved by is-

46 suing the aligned qualifier to the original simd directive and by allocating all the

47 relevant arrays using the _mm_malloc function. _mm_malloc takes an extra argument

48 representing the required alignment (32 bytes for AVX/AVX2 and 64 bytes for KNC

49 320 VPU). A number of additional directives may be needed to persuade the compiler that

50 aligned loads can be issued safely, as shown in the snippet below for a pointer storing

51 linearly four variables in Structures of Arrays (SoA) arrangement, at offsets id0, idl:

60 61 62

9 10 11 12

20 21 22

60 61 62

__assume_aligned(rhs, 64); __assume ( id0 7,8==0) ; „assume ( idl °/,8==0) ;

__assume (( idO* size of (rhs [0] ) ) °/,64==0)

__assume (( idl* size of (rhs [0] ) ) °/,64==0)

Listing 1: Example of extra compiler hints needed for generating SIMD aligned/load stores on KNC.

The assume_aligned construct indicates that the base pointer is aligned with the SIMD register boundary. The following „assume statements indicate that subscripts and indices accessing the four sections of the array are also aligned on the SIMD register boundary. Stack-resident variables have to be aligned on the vector register boundary through the aligned attribute:

330 1 double du , dv , dp , dt ; __attr ibut e__ (( aligned (64) )) ;

Listing 2: Declaring stack variables on aligned boundaries on KNC

Aligned vs. unaligned accesses sees no benefit on SandyBridge-EP and for smaller problem sizes a marginal improvement on Haswell-EP, due to better L1 cache utilisation. The aligned access version outperforms its unaligned counterpart on Knights Corner as a 512bit vector register maps across an entire cache line therefore allowing for efficient load and store operations to/from the L1 cache.

SIMD execution with aligned load and stores can also be achieved invoking compiler intrinsics (intrinsics aligned - Figure 4) or by using a vector class (vector-class aligned - Figure 4). For this work, we have used Agner Fog's Vector Class Library (VCL) [32]. Its main advantage is that ugly compiler intrinsics are encapsulated away allowing for a more readable code. On Knights Corner, an extension of the VCL library was used which was developed by Przemysaw Karpiski at CERN [33]. The utilisation of intrinsics and a vector class did not bring forth any speedups to the directive based vectorization for this particular kernel. This would indicate the fact that the compiler was able to vectorize the code efficiently.

4.1.2. Memory Optimisations

A simple way to improve the performance of a memory bound kernel is to replace stored values with re-computed ones. This technique is applied to the data set memory-reduction in Figure 6. The technique is beneficial on Haswell-EP, but holds no palpable improvements SandyBridge-EP or Knights Corner. A reason for this being that re-computing the conservative variables requires a number of divisions and square root operations with a large latency penalty. Although the overall flop/byte ratio is improved, there is no palpable performance gain.

4.1.3. Software Prefetching

Software prefetching (intrinsics-aligned-prefetch - Figure 6) can also be used to improve the performance of memory-bound codes. A prefetching kernel was built upon the previous intrinsics-aligned implementation, by using the compiler directive #pragma prefetch on SandyBridge-EP and Haswell-EP. On the Knights Corner coprocessor, the available _mm_prefetch intrinsics was evaluated. This instruction can take as arguments the level of cache the prefetch is to appear in. When using _mm_prefetch it is advisable to disable the compiler prefetcher using the -no-opt-prefetch compilation flag (Intel compilers). Prefetching is marginally beneficial on SandyBridge-EP

9 10 11 12

20 21 22

60 61 62


104 105 Number of grid cells

Knights Corner

104 105 Number of grid cells

baseline [ autovect+unaligned [ autovect+aligned intrinsics+aligned vecclass+aligned

Number of grid cells

Figure 4: Primary Variable Update SIMD optimisations with Rooflines.

and positively beneficial, for smaller problem sizes, on Knights Corner. This is due to the fact that the compiler might not be able to estimate the overall loop count i.e. input based on geometry file which leads to a less aggressive prefetching regime compared to manual intervention.

4-1-4- Data Layout Transformations

A version using the Array of Structures Structures of Arrays (AoSSoA) data layout (intrinsics-aligned-aossoa - Figure 6) has also been studied. Four sub vector lengths were tested: 4,8,16,32 double precision values on the multicore processors and 8,16,32,64 on the coprocessor. The best performing sub vector length of AoSSoA proved to be the one equal to the SIMD vector width of the processor (i.e. 4 and 8 respectively). The difference between the AoS, SoA and hybrid AoSSoA format can be seen in Figure 5. Although AoS exhibits very good intra and inter structure locality by grouping all elements together, it is inefficient from a SIMD perspective as same type operands would have to be manually packed within a single vector register. The generic data layout advisable for SIMD is SoA where large contiguous arrays are fused together allowing for contiguous SIMD load and stores. This was also the initial format of our test vehicle application. However, the latter can lead to performance penalties such as TLB misses for very large arrays. A hybrid approach such as the AoSSoA can alleviate them by offering the recommended SIMD grouping in sub vectors coupled with improved intra and inter structure locality, similar to the AoS layout.

9 10 11 12

20 21 22

60 61 62

Array of Structures (AoS)

« • p • • p • • p • • p • • p • • p

Structures of Arrays (SoA)

• • • • • • • • • • • • p p p p p p

Array of Structures Structures of Arrays (AoSSoA)


• • • • • • • • p p p p

Figure 5: Array of Structures, Structures of Arrays and hybrid Array of Structures Structures of Arrays format.


104 105 Number of grid cells

Knights Corner

Number of grid cells

baseline [ mem-reduction | intrinsics+aligned+prefetch intrinsics+aligned+aossoa

104 105 Number of grid cells

Figure 6: Primary Variable Update SIMD+Memory Optimisations with Rooflines.

9 10 11 12

20 21 22

60 61 62


6 8 10 Number of cores

Knights Corner

6 8 10 12 14 16 18 20 Number of cores

baseline —ф— baseline+numa —■— autovect —A— autovect+numa ♦ vecclass/intrinsics+aligned —г— vecclass/intrinsics+aligned+numa —•— vecclass/intrinsics+aossoa —■— vecclass/intrinsics+aossoa+numa —A— vecclass/intrinsics+aligned+prefetch ♦ vecclass/intrinsics+aligned+prefetch+numa —▼— vecclass/intrinsics+unaligned

50 75 100 125 150 175 200 225 250 Number of cores

Figure 7: Primary Variable Update Kernel Scalability Performance on 106 grid cells.

4.1.5. Thread Parallelism

Thread parallelism of the primary variable update kernel was performed using the #pragma omp parallel for simd OpenMP 4.0 construct for the auto vectorized version which complements the initial #pragma omp simd statement and further decomposes the loop iterations across all of the available threads in the environment. The intrinsics-based versions used the generic pragma omp parallel for construct. The work decomposition has been carried out via a static scheduling clause, whereby each thread receives a fixed size chunk of the entire domain. Additionally, the first touch policy has been applied for all data structures utilised in this method in order to alleviate NUMA effects when crossing over socket boundaries. The first touch is a basic technique which consists in performing trivial operations - e.g. initialization to zero - on data involved in parallel loops before the actual computations. This forces the relevant sections of the arrays to be loaded on the virtual pages of the core and socket where the thread resides. The first touch needs to be performed with same scheduling and thread pinning as the subsequent computations. In terms of thread pinning, the compact process affinity was used for SandyBridge-EP and Haswell-EP running only with one thread per physical core. On Knights Corner, the scatter affinity brought forth better results compared to compact. The kernel under consideration is memory-bound on all three architectures and is therefore very sensitive to NUMA effects. For Haswell-EP, not using the first touch technique can lead to a large performance degradation as soon as the active threads spill outside a single socket. This can be seen by the behaviour of the data sets autovect and autovect+numa in Figure 7.

9 10 11 12

20 21 22

60 61 62


240 220 200 180 160 140 120 100 80 60 40 20

6 8 10 Number of cores

Knights Corner

8 10 12 Number of cores

baseline baseline+numa autovect autovect+numa vecclass/intrinsics+aligned vecclass/intrinsics+aligned+numa vecclass/intrinsics+aossoa vecclass/intrinsics+aossoa+numa vecclass/intrinsics+aligned+prefetch vecclass/intrinsics+aligned+prefetch+numa vecclass/intrinsics+unaligned

75 100 125 150 175 200 225 Number of cores

Figure 8: Primary Variable Update Kernel Scalability Speedup on 106 grid cells.

On the SandyBridge-EP node, all single core optimisations coupled with firsttouch (Fig. 7) managed to pierce through the predicted memory roofline. The same applies to Haswell-EP runs where all versions utilising NUMA optimisations have exceeded the model's performance prediction achieving good scaling. A reason as to why discrepancies in NUMA and non-NUMA runs are larger on Haswell-EP when compared to SandyBridge-EP is due to the larger core count on the former which means that the QPI interconnect 'linking the two sockets gets saturated more quickly due to the increase in cross-socket transfer requests. On Knights Corner there are no NUMA effects due to the ring-based interconnect and the interleaved memory controller placement. However, aligned loads play an important role on this architecture: the SIMD length is 64 bytes (8 doubles) which maps to the size of an L1 cache line. If unaligned loads are issued, the thread is required to go across cache boundaries, loading two or more cache lines and performing inter register movements for packing up the necessary data. This wastes memory bandwidth and forms a very damaging bottleneck. This can be seen from the drop in performance above 60 cores in the intrinsics+unaligned and autovect data set in Figure 7. The best performing version was based on the hybrid AoSSoA format and obtained the predicted Roofline performance when run on 96, 128 and 240 threads. Aligned intrinsics combined with manual prefetching performed second best followed by aligned intrinsics with compiler issued prefetches.

Finally, Figure 8 presents scalability results plotted as parallel speedup (performance on p processors divided by its corresponding p=1 run).

9 4.2. Roe Numerical Fluxes

10 The flux computations kernel loops over a set of cell interfaces and computes

11 numerical fluxes using Roe's approximate Riemann solver. The kernel accepts pointers

12 430 to the left and right states, to the corresponding residuals, and to the normals of the

13 interfaces. The computations performed at each interface are: the evaluation of Euler

14 fluxes for the left and right state - equations (2,3), evaluation of Roe averages and

15 corresponding eigen-vectors and eigen-values - equations (A.2)-(A.10), assembly of the

16 numerical fluxes, and accumulation of residuals. The kernel computes two external

17 435 products (left and right Euler flux), and one 4x4 GAXPY. A judicious implementation

18 requires 191 FLOPS and accesses 38 double precision values per direction sweep. The

19 pointers to the left and right states, and corresponding residuals are accessed through 2 0 indirect references in the baseline implementation. This makes the kernel directly

21 reusable in an unstructured solver, but at the same time complicates the generation

22 440 of auto-vectorized code. In fact, it is not possible to guarantee, at compile time,

23 the absence of race conditions unless special arrangements are made with regard to

24 the order by which the faces are visited. Furthermore, the arguments passed as left

25 and right state in most situations are alias for the same pointer. This is a source of

26 ambiguity and it too prevents auto-vectorization.

4.2.1. Data-level Parallelism

Code vectorization has been achieved by evaluating a number of avenues such as autovectorization via compiler directives, intrinsics and the VCL vector class, similar to the cell-wise flow variable update kernel.

In order for the code to be vectorized by the compiler, a number of modifications had to be performed. First of all, the indirect references for obtaining the neighbours indices were removed and their position explicitly computed for each iteration. As these increase in a linear fashion by a stride of 1 once their offset is computed, the compiler was then able to replace gather and scatter instructions with unaligned vector loads and stores, albeit with the help of the OpenMP4.0 linear clause. This had a particularly large effect on Knights Corner were performance of unaligned load and stores was a factor of two higher compared to gather and scatter operations. Furthermore, the restrict qualifier was also necessary for any degree of vectorization to be attempted by the compiler. The autovectorized version (autovect - Figure 9) brought forth a 2.5X improvement on SandyBridge-EP, 2.2X on Haswell-EP and 13X on Knights Corner when compared to their respective baseline.

In multiblock codes, it is natural to group the faces according to the transformed coordinate that stays constant on their surface, e.g. i-faces and j-faces. Thus fluxes can be evaluated in two separate sweeps visiting i-faces or the j-faces, respectively. Each operation consists of nested i- and j-loops. When visiting i-faces, if the left and right states are stored with unit stride, they cannot be simultaneously aligned on vector register boundary and prevent efficient SIMD execution. When visiting j-faces this problem does not appear, as padding is sufficient to guarantee that left and right states of all j-faces can be simultaneously aligned on vector register boundary. The intrinsics-unaligned and vector-unaligned versions use unaligned loads and stores when visiting i-faces, but aligned load/store instructions when visitng j-faces, for the reasons discussed above. The penalty for unaligned access on the i-faces is better tolerated on SandyBridge-EP and Haswell-EP, where inter-register movements can be performed with small latency, but is very damaging on Knigths Corner, as already seen. Furthermore, the VPU ISA does not contain native instructions for

60 61 62

9 10 11 12

20 21 22

60 61 62

unaligned vector loads and stores that are found in AVX/AVX2. To this extent, the programmer has to rely upon the unpacklo,unpackhi,packstorelo and packstorehi instructions and their corresponding intrinsics. Examples of intrinsic unaligned load and store functions used in this work for Knights Corner can be seen in code snippet 4.2.1 and 4.2.1 which requires two aligned load/store and register manipulation hence the penalty in performance compared to their aligned counterparts.

i nl i ne _ _ m5 1 2 d loadu(double * sr c)

__m512d ret;

ret = _mm512_ l oadunpac kl o _pd (ret , sr c ) ;

ret = _mm512_ _loadunpackhi _pd (ret , src+8) ;

return ret;

Listing 3: Unaligned SIMD loads on KNC

inline void storeu(double *dest, __m512d val) {

_mm512_packstorelo_pd(dest, val); _mm512_packstorehi_pd(dest+8,val);

Listing 4: Unaligned SIMD stores on KNC

The lack of alignment for the arguments of the i-faces loop can be resolved by two techniques. Shuffling or dimensionally lifted transposition. Shuffling consists in performing the loads using register-aligned addresses, and then modifying the vector registers to position correctly all the operands. On AVX, shuffling requires three instructions as register manipulations can only be performed on 128 bit lanes. The kernel using intrisincs and register shuffle (intrinsics+shuffle -Figure 9) issues aligned load and store operations combined with register vector shuffling operations when handling i-faces. The use of register shuffling also saves a load at each iteration, as the current right state can be re-used as left state in the next iteration therefore allowing for register blocking.

The optimisation of the vector register rotations and shifts, however, is critical. On AVX, register rotation was achieved by using permute2f128 followed by a shuffle. AVX2 supports the vpermd instruction which can perform cross-lane permutations, therefore a more efficient approach can be used. The Intel Xeon Phi also supports vpermd although the values have to be cast from epi32 (integers) to the required format (double in the present case). The choice of the intrinsics for each operation is based on its latency and throughput: functions with throughput smaller than one can be performed by more than one execution port, if these are available, leading to improved instruction level parallelism. As an example, the AVX/AVX2 blend intrinsic has a latency of one cycle and a throughput of 0.3 on Haswell-EP, and one cycle latency and throughput 0.5 on SandyBridge-EP, therefore it is preferrable to the AVX shuffle instruction, which has one cycle latency, but unit throughput on both chips. The code snippets below show how shuffling can be achieved on the three architectures.

52 520 1 // vl = 3 2 1 0

53 2 // vr = 7 6 5 4

54 3 __m256d tl = _mm256 .permute2f128_pd(vl,vr , 33) ; // tl = 5 4 3 2

55 525 4 __m256d t2 = _mm256 _shuffle_pd(vl,tl ,5) ; // t2 = 4 3 2 1

Listing 5: Register shuffle for aligned accesses on the i-face sweep with AVX.

9 10 11 12

20 21 22

60 61 62

535 1 2

540 6 7

// vl = 4,3,2,1 // vr = 8,7,6,5 // blend = 4,3,2,5

__m256d blend = _mm256_blend.pd(vl,vr,Oxl); // res = 5,4,3,2

__m256d res = _mm256_permute4x64_pd(blend,_MM_SHUFFLE(0,3,2,1))

Listing 6: Register shuffle for aligned accesses on the i-face sweep with AVX2.

// vl = 8,7,6,5,4,3,2,1

// vr = 16,15,14,13,12,11,10,9

__m512i idx = {2,3,4,5,6,7,8,9,10,11,12,13,14,15,0,1}; // blend = 8,7,6,5,4,3,2,9

__m512d blend = _mm512_mask_blend_pd(0x1,vl,vr ); // res = 9,8,7,6,5,4,3,2

__m512d res = _mm512_castsi512_pd(_mm512_permutevar_epi32(idx, _mm512_castpd_si512(blend)));

Listing 7: Register shuffle for aligned accesses on the i-face sweep with MIC VPU.

Generating aligned load/stores in the VCL library (vecclass-aligned - Figure 9) can be performed by using the blend4d method which requires as argument an index shuffle map. However, this was found to be less efficient than the authors' intrinsics functions since an optimised implementation targeting each architecture was not available.

The unaligned intrinsics version delivers a 4X speedup over the baseline implementation on SandyBridge-EP and Haswell-EP and 13.5X on Knights Corner. Comparisons with the autovectorized version sees a 60% speedup on SandyBridge-EP and 50% on Haswell-EP although no noticeable improvements on the coprocessor. The increase in performance for the intrinsics version on the multicore processors is due to the manual inner loop unrolling when assembling fluxes which allows for more efficient instruction level parallelism. The intrinsics and shuffle kernel performs similarly on SandyBridge-EP compared to the unaligned version due to AVX cross-lane shuffle limitations whilst performing 30% and 10% faster on Haswell-EP and Knights Corner. As earlier mentioned, the vector class versions performs worse on both multicore CPUs due to the inefficient implementation of shuffling in the blend4d routine however on Knights Corner, these were replaced with the intrinsics-based kernels listed in 4.2.1 as they were not implemented in the VCLKNC library. For this reason, the aligned vector class version on the manycore processor delivers very similar performance compared to the hand-tuned intrinsics+shuffle.

Another technique for addressing the alignment stream conflict issue is via Hen-retty's dimension-lifted transposition [11] (intrinsics-transp - Figure 10). Transposition solves the alignment conflict for i-faces by changing the stride of the left and right states. There is however a penalty in that the data has to transposed before the i-face loop and re-transposed prior to the j-face sweep. Furthermore, since the result of the transposition is an array with the number of columns equal to that of the SIMD length (i.e. rectangular matrix transposition), the transposition kernel can have a degrading effect on performance as the working set is increased and cannot be held in cache. This can be seen across all architectures, especially SandyBridge-EP where the drop in performance is significant. On problems of small size transposition is more efficient than shuffling (SandyBridge-EP), because of the high cost of shuffling. For larger problems, however, the shuffling version was preferable.

9 4.2.2. Cache Blocking

10 In the kernels tested up to this point, the loops visiting i-faces and j-faces have been

11 kept separate. Data locality and temporality can be improved by fusing (blocking)

12 the evaluation of fluxes across the j-face and the i-face of each cell. This technique

13 580 is known as cache blocking and the corresponding results are shown as the data set

14 intrinsics-shuffle-cacheblocked in Figure 10. The cache blocking kernel fuses both

15 i- and j-passes and it is based on the shuffling kernels described above for allowing

16 aligned loads and stores on the i-faces. The cache blocking kernel further benefits from

17 the fact that it can save an extra load/store operation when writing back results for

18 585 each cell. It should be noted that in a multiblock code, a first level of cache blocking

19 can be achieved by visiting the blocks in succession, rather than sweeping through the

20 whole domain. The second level is obtained by nesting loops visiting i- and j-faces,

21 but without re-arranging the data. The nesting of i- and j-face loops allows data

22 to be reused before being evicted from the cache. The blocking factor should be a

23 590 multiple of the SIMD vector length as to allow for efficient vectorization. Across all

24 three platforms, cache blocking did not bring forth any palpable benefits for single

25 core runs. The reason for this is that although data cache reuse is increased, fusing

both loops brings forth an increase in register pressure and degrades performance as

27 the fused kernel performs 382 FLOPS for every iteration.

29 595 4.2.3. Data Layout Transformations

3 0 Discussion so far for the flux computations has assumed a SoA data layout format.

31 A kernel using the AoSSoA data layout (intrinsics-shuffle-aossoa - Figure 10) and

32 based on the shuffling kernel was tested. The best performing sub vector size was the

33 size of the SIMD register, just like for the primitive variables update kernel.

34 600 Similarly, the intrinsics-shuffle-aossoa-cacheblocked (Figure 10) kernel uses

35 the new hybrid data layout on top of the fused kernel.

37 4.2.4. Thread Parallelism

38 For the purpose of studying performance at full chip and node concurrency, domain

39 decomposition is performed within every block in order to avoid load-imbalance due

40 605 to discrepancies in block size (see Figure 11). Decomposition is performed in two

ways. One option is to split each block into slabs along the i- and j-planes depending on sweep, the number of slabs being equal to the number of available threads. The second option is to split each block into square tiles. This method increases locality and complements the cache-blocking processing of the flux reconstruction scheme discussed

610 above. For the non-fused kernels where separate plane sweeps are performed, the first decomposition method was chosen due to the elimination of possible race conditions. For the fused kernels, the latter decomposition was required together with handling race conditions at the tile boundary.

48 Thread and process affinity has been applied similar to the flow variable update

49 615 kernel i.e. compact for the multicore CPUs and scatter for the coprocessor. Task and

50 chunk allocation have been performed manually as required by the custom domain

51 decomposition within the OpenMP parallel region. Another reason for doing the

52 above and not relying on the OpenMP runtime is due to the observed high overhead

53 that this caused on the Knights Corner processor when running on more than 100

54 620 threads. Furthermore, all runs imply NUMA-aware placement via applying the first

55 touch technique.

60 61 62

9 10 11 12

20 21 22

60 61 62


104 105 Number of grid cells

Knights Corner

104 105 Number of grid cells

baseline [ autovect [ intrinsics+unaligned vecclass+unaligned intrinsics+shuffle [ vecclass+aligned

Number of grid cells

Figure 9: Flux Computations SIMD optimisations with Rooflines.

The performance on both SandyBridge-EP and Haswell-EP at full concurrency beats the roofline prediction attaining 53 GFLOPS and 101 GFLOPS respectively. On Knights Corner, the best performing version achieves 95% (94 GFLOPS) performance efficiency compared to its corresponding roofline when running on 180 threads. While on the multicore CPUs, the best performing kernels were based on the intrinsics-based cacheblocked and hybrid data layout optimisations, the scaling on the coprocessor of this particular kernel was second best. This can be attributed to register pressure as more threads get scheduled on the same core which compete for available resources. Our assumption can be validated by examining the scaling behaviour of the cacheblocked kernel without the hybrid AoSSoA data layout which suffers from poor scalability as more than one thread gets allocated per core.

A key consideration to take into account is the fact that piercing through or achieving near parity to the performance roof on all architectures required both the advanced SIMD optimisations for efficient load store operations and memory optimisations such as cacheblocking and hybrid data layout transformations. The only exception is on the Haswell-EP where kernels implementing aligned SIMD with the SoA arrangement attained a speedup similar to the prediction model although not piercing through it. This can be attributed to the new core design that was focused on high SIMD throughput via the increase in back-port bandwidth and number of functional units.

Comparing best performing results to their respective baselines reveals a 3.1X speedup on SandyBridge-EP and Haswell-EP and 24X on the Knights Corner coprocessor at full concurrency.

If we were to analyse speedup metrics (Figure 12), a curios observation could be

9 10 11 12

20 21 22

60 61 62


104 105 Number of grid cells

Knights Corner

104 105 10° Number of grid cells

baseline intrinsics+transp [ intrinsics+shuffle+cacheblocked [ intrinsics+shuffle+aossoa intrinsics+shuffle+cacheblocked+aossoa

Number of grid cells

Figure 10: Flux Computations SIMD+Memory Optimisations with Rooflines.

made. Perhaps surprisingly, the baseline kernel scales best although it is performing significantly worse compared to all others on the merits of actual attained compute performance. This further highlights the importance of choosing the correct metrics to judge computational performance on modern architectures with the aid of performance models and not solely on the basis of scalability. The slowest kernels will undoubtedly achieve superior speedup if efficient domain decomposition and load balancing is performed, as it was in our case, since there is ample room in available resources such as memory bandwidth and FLOPS that can be utilised. In reality however, the baseline kernel utilises approximately 30-35% of the attainable performance on the multicore CPUs and only 5% on the coprocessor while the best optimised versions match or exceed it on all architectures.

4.3. Roofline Visualisation

A validation of the Roofline model based on the results of the applied optimisations can be seen in Figure 13. For brevity, the main classes of optimisations were grouped as baseline, autovect for autovectorized versions, intrinsics+aligned assuming best performing intrinsics version on that particular kernel and platform (i.e. shuffle, transposition etc) and intrin.+aligned that encapsulates best performing aligned intrinsics and specific memory optimisation (cacheblocked, AoSSoA etc).

Observations on the single core diagrams highlight interesting aspects on the two computational kernels and their evolution in the optimisation space. For the cell-wise flow variable update, applying autovectorization managed to pierce through the ILP wall on the multicore CPUs and the SIMD roofline on the coprocessor. This is due to

9 10 11 12

20 21 22

60 61 62


6 8 10 Number of cores

Knights Corner

6 8 10 12 14 16 18 Number of cores

baseline autovect intrinsics+unaligned vecclass+unaligned vecclass+aligned intrinsics+transp intrinsics+shuffle intrinsics+shuffle+aossoa intrinsics+shuffle+cacheblocked intrinsics+shuffle+cacheblocked+aossoa

100 150

Number of cores

Figure 11: Flux Computations Scalability Performance on 106 grid cells.

the fact that SIMD execution automatically brings forth improvements in instruction level parallelism on the SandyBridge and Haswell CPUs due to loop unrolling as it was earlier mentioned. On Knights Corner, autovectorization allowed for instructions to be routed towards the more efficient VPU unit. The reason as to why autovect performed as well as the hand-tuned kernels on all three platforms can also be evidenced by the fact that there is little to no optimisation space remaining as the kernel obtains 80-90% efficiency and is limited by the reduced available memory bandwidth since only one memory channel is utilised by the core. For flux computations, the baseline sits above the ILP roof on the multicore CPUs due to the considerable higher FLOP count of the kernel and better arithmetic intensity. Autovectorization pierces through their respective SIMD wall whilst hand tuned instrinsics and subsequent memory optimisations deliver close to the attainable performance prediction. On Haswell-EP and Knights Corner however, speedup is limited by the fact that the FMA units are not fully utilised due to the algorithmic nature of PDE-based stencils and their inherent lack of fused mul/add operations. Even so, for single core runs, our best optimised version in the intrin.+memopt class achieves between 80-90% of the available performance compared to the baseline which delivers 20% on the multicore CPUs and 1.67% on the coprocessor.

At full node and chip concurrency, the sizeable increase in memory bandwidth permits the baseline implementation to obtain a larger degree of speedup across all architectures. For the cell-wise flow variable update kernel, NUMA first touch optimisations applied to all versions allows them to bypass the NUMA optimisation wall. On the coprocessor, prefetching only works when coupled with SIMD due to the

9 10 11 12

20 21 22

60 61 62


16 14 12 10

6 4 2 0

6 8 10 Number of cores

Knights Corner

6 8 10 12 14 16 18 20 Number of cores

baseline —•—

autovect —■—

intrinsics+unaligned —A—

vecclass+unaligned ♦

vecclass+aligned —г—

intrinsics+transp —•— intrinsics+shuffle —■— intrinsics+shuffle+aossoa —^—

intrinsics+shuffle+cacheblocked ♦ intrinsics+shuffle+cacheblocked+aossoa —▼—

100 150

Number of cores

Figure 12: Flux Computations Scalability Speedup on 106 grid cells.

poor performance of the scalar processing unit compared to the VPU. As more memory bandwidth becomes available, subsequent optimisations such as the hybrid data layout transformation that prevents TLB misses and provides better data locality performs better than the autovectorized and SoA-based hand tuned intrinsics version. For flux computations, vectorization through compiler directives and intrinsics delivers close to peak roof performance whilst subsequent memory optimisations bypass the model's prediction as earlier stated. On Haswell-EP, the baseline model is placed above the SIMD roof due to the significantly higher in-core bandwidth and the fact that the compiler is still able to vectorize some inner loops of the kernel i.e. flux assembly. The highest speedup can be seen on Knights Corner, bearing in mind the log scale nature of the axes where the provision of further memory optimisations such as data layout transformations to existing efficient SIMD kernels (i.e. via shuffling) delivers close to peak attainable performance.

1 Core SNB E5-2650

1 Core Haswell E5-2650

1 Core KNC5110P

1/16 1/8 1/4 1/2 1 2 4 Arithmetic Intensity

2x SNB E5-2650 (16 cores)

• baseline

• autovect

• intrinsics+aligned Peak FP intrin.+memopt

1/16 1/8 1/4 1/2 1 2 4 Arithmetic Intensity

64 32 16 8 4 2

) baseline ) autovect intrinsics+alig intrin.

Peak FP




1/16 1/8 1/4 1/2 1 2 4

2x Haswell EP E5-2650 (20 cores)

1024 256 • baseline D . __ "•autovect PeakFP- , • intrinsics+aligned intrin.+memopt S

#/ ■

64 for > X s XX

16 A Y

4 / -

1 i ■ . ■ . .

1/16 1/8 1/4 1/2 1 2 4

«O Ol

o ■8

o £ TJ


) baseline ) autovect intrinsics+aligned intrin.

1/16 1/8 1/4 1/2 1 2 4

Knights Corner 5110P (60 cores)

1024 -

> baseline

) autovect Peak

intrinsics+aligned intrin.+memopt

1/16 1/8 1/4 1/2 1 2 4


20 21 22 23

9 5. Conclusions

11 We have applied a wide variety of optimisation techniques on two distinct compu-

12 705 tational kernels in a multiblock CFD code and evaluated their performance on three

13 modern architectures. A detailed description was given on the exploitation of all

14 levels of parallelism available in modern multicore and manycore processors through

15 efficient code SIMDization and thread parallelism. Memory optimisations described in

16 this work included software prefetching, data layout transformations through hybrid 710 data structures such as Array of Structures Structures of Arrays and multi-level cache

blocking for the numerical fluxes.

The practicalities of enabling efficient vectorization was discussed at great length. We have established that for relatively simple kernels such as the primary flow variable update, the compiler can generate efficient SIMD code with the aid of portable 715 OpenMP 4.0 directives. This approach however does not fully extend to more complex kernels such as flux computations involving a stencil-based access pattern where best SIMD performance is mandated through the use of aligned load and store operations

24 made possible via inter-register shuffles and permutations. Implementations of such

25 operations were performed using compiler intrinsics and the VCL vector library and

26 720 included bespoke optimisations for each architecture. Vectorized and non-vectorized

27 computations exhibit a 2X performance gap for the flow variable kernel and up to 5X

28 in the flux computations on the SandyBridge-EP and Haswell-EP multicore CPUs.

2 9 The difference in performance is significantly higher on the manycore Knights Corner

3 0 coprocessor where vectorized code outperforms the non-vectorized baseline by 13X in

31 725 updating the flow variables and 23X for computing the numerical fluxes. These fig-

32 ures correlated with projections that future multicore and manycore architectures will

33 bring forth further improvements to the width and latency of SIMD units mandates

34 efficient code SIMDization as a crucial avenue for attaining performance in numerical

35 computations on current and future architectures.

3 6 730 Modifying the data layout from Structures of Arrays to a hybrid Array of Struc-

37 tures Structures of Arrays brought forth improvements for the vectorized kernels by

38 minimizing TLB misses when running on large grids and increasing locality. Per-

39 formance gains were particularly noticeable when running at full concurrency and

40 performed best on the Knights Corner coprocessor. Cache blocking on the flux com-

41 735 putations coupled with the hybrid data layout delivered best results on the multicore

42 CPUs when running across all available cores however performed second best on the

43 coprocessor due to increased register pressure.

44 Core parallelism has been achieved through the use ofOpenMP4.0 directives for the

45 flow variable update kernel which offers mixed SIMD and thread granularity through

4 6 740 common constructs. For the numerical fluxes, the domain was decomposed into slabs

on the i and j faces for implementations that performed separate plane sweeps and into tiles for the fused sweep implementation with cache blocking. For the primary flow variable update we achieved a 2X speedup on SandyBridge-EP and Haswell-EP and 20X on the Knights Corner coprocessor when compared to their respective baselines.

745 Computations of the numerical fluxes at full concurrency also obtained a 3X speedup on the multicore CPUs and 24X on the coprocessor over the baseline implementation.

The Roofline performance model has been used to appraise and guide the optimisation process with respect to the algorithmic nature of the two computational kernels

54 and the three compute architectures. For single core execution, the optimised flow vari-

55 750 ables update kernel achieved approximately 80% efficiency on the multicore CPUs and

56 90% on the coprocessor. Flux computations obtained 90% efficiency on SandyBridge-

60 61 62

9 EP, 80% on Haswell-EP and approximately 60% on Knights Corner. The reason for

10 the relatively low efficiency on the coprocessor is due to the out-of-order core design

11 which requires at least 2 threads for fully populating the coprocessor VPU. The com-

12 755 putational efficiency at full concurrency outperforms the model's predictions for both

13 kernels across all three platforms the only exception being Knights Corner where flux

14 computations deliver a maximum of 94% efficiency when running on 180 threads.

15 On a core-to-core comparison among the three processors, the Haswell-based Xeon

16 E5-2650 core performs on average 2X and 4-5X faster compared to a single Sandy-

17 760 Bridge Xeon E5-2650 and Xeon Phi Knights Corner 5110P core across both kernels.

18 However, at full node and chip concurrency, the two socket Haswell Xeon E5-2650 node

19 is approximately on par with the Knights Corner coprocessor for flux computations

20 and 25% faster for updating the primary flow variables. The Xeon Phi Knights Corner

21 coprocessor and Haswell node outperform the two socket SandyBridge Xeon E5-2650 765 node by approximately a factor of two on flux computations and 50% to 2X on the flow

variable update. However, on a flop per watt and flop per dollar metric, the Knights Corner 5110P coprocessor delivers superior performance compared to both multicore CPUs at the cost of higher development and optimisation time needed for exploiting its underlying features in numerical simulations. The increase in time spent on fine 770 tuning the code on the coprocessor is attenuated by the fact that the majority of optimisations targeting fine and coarse grained levels of parallelism such as SIMD and threads are transferable when porting from multicore to manycore including GPGPUs.

31 Acknowledgements

33 This work has been supported by EPSRC and Rolls-Royce plc through the Indus-

34 775 trial CASE Award 13220161. We are also very much indebted to Christopher Dahnken

35 of Intel Corporation for his invaluable help and guidance on intrinsics optimisations

36 and Agner Fog for the support provided in using the VCL library. Furthermore, we

37 wish to thank Sophoclis Patsias and Paolo Adami of Rolls-Royce plc for approving

38 publication clearance and Simon Burbidge of Imperial College London for providing 3 9 780 access to Haswell-EP nodes.

41 References

43 References

45 [1] P. Prabhu, T. B. Jablin, A. Raman, Y. Zhang, J. Huang, H. Kim, N. P. Johnson,

45 F. Liu, S. Ghosh, S. Beard, T. Oh, M. Zoufaly, D. Walker, D. I. August, in: State

46 785 of the Practice Reports, SC '11, ACM, New York, NY, USA, 2011, pp. 19:1-19:12.

48 [2] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,

49 N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, K. Yelick, Commun.

50 ACM 52 (2009) 56-67.

52 [3] D. A. Patterson, J. L. Hennessy, Computer organization and design: the hard-

53 790 ware/software interface, Newnes, 2013.

60 61 62

[4] D. S. McFarlin, V. Arbatov, F. Franchetti, M. Piischel, in: Proceedings of the International Conference on Supercomputing, ICS '11, ACM, New York, NY, USA, 2011, pp. 265-274.

9 10 11 12

20 21 22

60 61 62

[5] H. Sutter, J. Larus, Queue 3 (2005) 54-62.

[6] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, P. Dubey, in: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 440-451.

[7] M. Bader, A. Breuer, W. Holzl, S. Rettenberger, in: High Performance Computing Simulation (HPCS), 2014 International Conference on, pp. 193-201.

[8] S. J. Pennycook, C. J. Hughes, M. Smelyanskiy, S. A. Jarvis, in: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS '13, IEEE Computer Society, Washington, DC, USA, 2013, pp. 1085-1097.

[9] X. Tian, H. Saito, S. V. Preis, E. N. Garcia, S. S. Kozhukhov, M. Masten, A. G. Cherkasov, N. Panchenko, in: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW '13, IEEE Computer Society, Washington, DC, USA, 2013, pp. 11491158.

10] N. G. Dickson, K. Karimi, F. Hamze, J. Comput. Phys. 230 (2011) 5383-5398.

11] T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, P. Sadayap-pan, in: Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software, CC'11/ETAPS'11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 225-245.

12] S. Rostrup, H. D. Sterck, Computer Physics Communications 181 (2010) 2164 -2179.

13] C. Rosales, in: Extreme Scaling Workshop (XSW), 2013, pp. 1-7.

14] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick, in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, IEEE Press, Piscataway, NJ, USA, 2008, pp. 4:1-4:12.

15] J. Jaeger, D. Barthou, in: High Performance Computing (HiPC), 2012 19th International Conference on, pp. 1-10.

16] A. Schfer, D. Fey, in: High Performance Computing for Computational Science - VECPAR 2012, volume 7851 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 451-466.

17] S. Williams, L. Oliker, J. Carter, J. Shalf, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, ACM, New York, NY, USA, 2011, pp. 55:1-55:12.

18] M. Vavra, Aero-Thermodynamics and Flow in Turbomachines, John Wiley, Los Alamitos, CA, USA, 1960.

19] F. Grasso, C. Meola, Handbook of Computational Fluid Mechanics, Academic Press, London, 1996.

[24] J. Jeffers, J. Reinders, Intel Xeon Phi Coprocessor High Performance Programming, Morgan Kaufmann, Boston, United States, 2013.

9 [20] MIT, Mises,, 2009. Accessed:

10 13-05-2015.

12 835 [21] P. Gepner, V. Gamayunov, D. L. Fraser, Procedia Computer Science 4 (2011) 452

13 - 460. Proceedings of the International Conference on Computational Science,

14 ICCS 2011.

15 [22] W. Eckhardt, A. Heinecke, R. Bader, M. Brehm, N. Hammer, H. Huber, H.-G.

16 Kleinhenz, J. Vrabec, H. Hasse, M. Horsch, M. Bernreuther, C. Glass, C. Ni-

17 840 ethammer, A. Bode, H.-J. Bungartz, in: J. Kunkel, T. Ludwig, H. Meuer (Eds.),

18 Supercomputing, volume 7905 of Lecture Notes in Computer Science, Springer

19 Berlin Heidelberg, 2013, pp. 1-12.

21 [23] T. Piazza, H. Jiang, P. Hammarlund, R. Singhal, Technology Insight: Intel(R)

22 Next Generation Microarchitecture Code Name Haswell, Technical Report, Intel

23 845 Corporation, 2012.

2 7 [25] I. D. Zone, Avx-512 instructions, http://software . intel. com/en-us/blogs/2013/avx-512-instructions,

28 2013. Accessed: 04-03-2014.

30 850 [26] S. Williams, A. Waterman, D. Patterson, Commun. ACM 52 (2009) 65-76.

31 [27] G. Ofenbeck, R. Steinmann, V. Caparros, D. Spampinato, M. Puschel, in: Per-

32 formance Analysis of Systems and Software (ISPASS), 2014 IEEE International

33 Symposium on, pp. 76-85.

35 [28] J. D. McCalpin, IEEE Computer Society Technical Committee on Computer

36 855 Architecture (TCCA) Newsletter (1995) 19-25.

38 [29] J. D. McCalpin, STREAM: Sustainable Memory Bandwidth in High Per-

3 9 formance Computers, Technical Report, University of Virginia, Char-

40 lottesville, Virginia, 1991-2007. A continually updated technical report.


42 860 [30] H. D. Bailey, F. R. Lucas, W. S. Williams, Performance Tuning of Scientific

4 3 Applications, CRC Press, New York, United States, 2011.

45 [31] J. Treibig, G. Hager, in: Proceedings of the 8th International Conference on 4 6 Parallel Processing and Applied Mathematics: Part I, PPAM'09, Springer-Verlag,

46 Berlin, Heidelberg, 2010, pp. 615-624.

51 [33] VCLKNC,, 2015. Accessed: 10-0852 2015.

54 [34] G. Albada, B. Leer, J. Roberts, W.W., in: M. Hussaini, B. Leer, J. Rosendale

55 870 (Eds.), Upwind and High-Resolution Schemes, Springer Berlin Heidelberg, 1997,

56 pp. 95-103.

60 61 62

[32] A. Fog, C++ vector library class,, 2015. Accessed: 10-05-2015.

10 11 12

20 21 22

60 61 62

[35] C. Hirsch, Numerical Computation of Internal and External Flows., John Wiley and Sons, Chichester, West Sussex, UK, 1990.

[36] P. Roe, Journal of Computational Physics 43 (1981) 357 - 372.

875 [37] A. Harten, P. D. Lax, B. van Leer, SIAM Review 25 (1983) 35-61.

Appendix A. Further details

The physical fluxes are approximated with second order TVD-MUSCL [34] [35] [36] numerical fluxes

F--1/2,, = - (Fi-ij + Fij) -±R|A|L (U^ - Ui-1,,0


The term R|A|^LAVi_1/2>j represents the second order contribution to the numerical 880 fluxes. ^ is the limiter and the flux eigenvectors and eigenvalues R,A, L are evaluated at the Roe-average [36] state.

L is the matrix of the left eigenvectors of the Roe-averaged flux jacobian.

0 im V"

0 nm ng

0 nm ng

The evaluation of the numerical fluxes also requires the Roe-averaged eigen-values and right-eigenvectors of the flux jacobian, which are evaluated as follows

A = diag (Un,Un,Un + a, U — a) + e

1 0 1 1

0 tm (Ûn + k)Um (Un — â)nm

0 t" (Un + a)ng (Un — a)ng

k Ut h + Una h — una

where e is Harten's entropy correction [37]. The Roe-averaged state is defined by

Um = ^Uzt+1 ' j + (i — u)Um (A.5)

U" = wu"+1 ' j + (1 — W)U"'j (A.6)

ha = whi+1, j + (1 — w)hi j (A.7)

2 aa = (7-1 )(/7-fe) (A.8)

P (A.9)

s/pi+I,; (A.10)

^/pi+IJ + ^/pij

Similar definition are applied for the G| 1 numerical flux vector.