Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 4 (2011) 372-381

International Conference on Computational Science, ICCS 2011

Parallel application benchmarks and performance evaluation of the

Intel Xeon 7500 family processors

Piotr Koptaa, Michal Kulczewskia, Krzysztof Kurowskia, Tomasz Pionteka, Pawel Gepnerb, Mariusz Puchalskic,

Jacek Komasac

aPoznan Supercomputing and Networking Center, Noskowskiego 10 Street, 61-704 Poznan, Poland

bIntel Corporation, Pipers Way, Swindon Wiltshire SN3 1RJ, United Kingdom cFaculty of Chemistry, Adam Mickiewicz University, Grunwaldzka 6 Street, 60-780 Poznan, Poland

Abstract

With the recent advent of novel multi- and many-core hardware architectures, application programmers have to deal with many hardware-specific implementation details and have to be familiar with software optimization techniques to benefit from new high-performance computing machines. Highly efficient parallel application design is in fact an interdisciplinary process involving domain specific and IT experts. Therefore, this paper aims to present early experiences with computationally demanding applications, development efforts and evaluation of their performance on the new family of Intel Xeon 7500 processors. We selected two application benchmarks applicable to real quantum chemistry and Computational Fluid Dynamics (CFD) problems as they can potentially take advantage of parallel processing on novel hardware architectures and built-in new features. Additionally, we discuss various parallel software improvements to mentioned applications, including appropriate changes to data structures as well as to communication and synchronization routines to deal with multi-level parallelism and hybrid hardware architectures. The obtained results confirmed that new hardware solutions can improve the overall application performance. However, in order to obtain a high level of parallel scalability various application modifications and tuning procedures are required as hardware configurations, including processors characteristics, interconnects and topologies, and they have a great influence on large-scale simulations.

Keywords: Intel Xeon 7500, parallel applications, high performance computing, quantum chemistry, CFD, petascale computing

1. Introduction

With the recent advent of many-core processors we observed a significant shift in computational science as well as in many classical commercial software packages. Today, to run the code efficiently on a certain massively parallel hardware architecture a scientist has to be familiar not only with domain-specific algorithms and parallel data structures [15, 14]. In this paper we present two complementary benchmarking experiments on scientific applications representing numerical algorithms related to large scale quantum-chemical and computational fluid dynamics (CFD) calculations. The first legacy application was redesigned and implemented from scratch to take advantage of new high performance features built-in in new Intel processors. In contrast, the second application was not optimized and we used it mostly to benchmark the legacy scientific code. In comparison to its predecessors, a new Intel Xeon 7500 family proved to be successful during benchmarks presented in this paper. With the use of a set of well-defined testing

1877-0509 © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Prof. Mitsuhisa Sato and Prof. Satoshi Matsuoka doi:10.1016/j.procs.2011.04.039

Table 1: A short summary of key features provided by Intel Xeon processors

Intel Xeon processor type L7555 L5640 L5518 E5345

Technology (nm) 45 32 45 65

Cores per socket 8 6 4 4

Intel Turbo Mode Yes No

Intel HT Yes No

Core frequency (MHz) 1866 2267 2133 2333

L1 cache size 32 KB Inst / 32 KB Data per core

L2 cache size 256 KB per core 4096 KB per core

L3 cache size 24 MB 12 MB 8MB No

Integrated memory controller Yes 2xSMI Yes No

Front side bus FSB No 1333 MHz

Memory transfer rate/socket 34 (GB/s) 32 (GB/s) 25.6 (GB/s) 21 (GB/s)

Intel Quick Path Interconnect 4 Links(5.86 or 4.80 GT/s) 2 Links(5.86 GT/s) 2 Links(5.86 GT/s) No

Number of threads/core 2 2 2 1

TLB Page Size 4KB, 2MB, 1GB 4KB, 2MB, 1GB 4KB, 2MB 4KB, 2MB

TDP(W) 95 60 60 80

benchmarks we successfully examined a new Intel Xeon 7500 family member in contrast to its predecessors. The obtained performance results are discussed in details for two applications representing in fact two different computing intensive benchmarking procedures. The rest of this paper is organized as follows. In Section 2 we briefly describe main hardware characteristics of different Intel Xeon processors widely adopted in the high performance computing area. Section 3 presents main assumptions and challenges for computationally demanding simulations for quantum chemistry. Additionally, in this section we discuss various parallel algorithm design issues and improvements that we implemented to take advantage of novel hardware architectures for advanced simulations in quantum chemistry. In order to verify the efficiency of new Intel processors, the impact in particular of memory size, bandwidth and latency on the overall application performance, we decided to experiment with Lattice Boltzmann methods (LBM) simulations that are discussed in Section 4. Section 3 and Section 4 also include detailed descriptions of application benchmarks and the obtained experimental results on different Intel Xeon processors and their large-scale configurations. Section 5 concludes our paper and proposes future directions for application experiments as well as new challenges we envision for peta- or even exascale computing in the near future, from both software development and hardware perspectives.

2. Hardware architecture

The Intel Xeon 7500 family is the first generation of eight core Intel CPUs dedicated to more than dual socket servers, it is also the first eight core processor with integrated memory controller and QPI interface. Intel Xeon 7500 family processors have been designed with all the advantages of 45nm Hi-k metal gate silicon technology. This process technology uses a combination of Hi-k gate dielectrics, conductive materials, and a 45 nm lithography process to improve transistor size and properties such as reduce electrical leakage, chip size, power consumption, and manufacturing costs [7]. Intel Xeon L7555 is a 45 nm eight core monolithic die with 24MB of L3 cache, 4 channel integrated memory controller and integrated 4 ports of Quick Path Interconnect interface build out of 2.3 billion transistors. The Intel Xeon L7555 cache hierarchy has three levels, L1 and L2 staying fairly small and private to each core, while the L3 cache is much larger. Each core in Nehalem-EX has a private 32KB first level L1 Instruction and 32kB Data Cache. In addition the unified 256KB L2 cache is 8 way associative and provides extremely fast access to data and instructions; the latency is typically smaller than 11 clock cycles. The Intel Xeon 7500 family has 24MB L3 cache, which gives 3 MB per core. This ratio is better than Intel Xeon 5600 family offers. Intel Xeon L7555 has many RAS features and technology dedicated for Mission Critical computing. Some of these features might also be useful in HPC scenarios especially to enhance the reliability of the system and implement more sophisticated checkpoint mechanisms. All these enhancements may have different performance implications depending on the fields and areas of application. For Intel Xeon L7555 theoretical performance we have = 1.86 GHz x 4 operations per clock x 8 cores = 59.52 GFLOPS. This is theoretical performance only, and does not fully reflect the real life scenario. The main processors characteristics are surmised in Table 1.

3. Quantum chemistry application

3.1. Motivations

The Schrodinger equation provides the foundation for the calculation of the electronic structure of atoms and molecules. Unfortunately, it is solvable exactly only for the simplest one-electron systems. For many-electron systems approximation schemes are mandatory and significant computational effort is usually involved. Despite the general success of the commonly used orbital methods, in which each electron is assumed to move in the average field of the other electrons, the number of quantum-chemical problems that may be solved to a high accuracy by performing such calculations is rather restricted. This is particularly observable for light atoms and molecules, where the orbital methods, such as the Hartree-Fock or configuration interaction, are not capable of yielding results of spectroscopic accuracy. In such a case, specialized techniques with qualitatively correct description of electron correlation are indispensable.

Over the years, a number of variational methods of solving the Schrodinger equation have been proposed and developed, taking into account the electron correlation explicitly in the construction of trial wave function. It was first demonstrated many years ago by Hylleraas for atoms [8] and by James and Coolidge for molecules [9]. Their wave function represented in the basis set of exponential functions is known to describe asymptotics of the exact wave function properly. However, the area of the application of the Hylleraas and the James-Coolidge wave functions is strictly limited to the low-lying states up to three-electron atoms and diatomic two-electron molecules respectively, due to inherent difficulties in the accurate and efficient calculations of many-electron and multicenter integrals.

In this paragraph we introduce a minimum physical underpinnings needed to understand the origins of our algorithm. We shall refer to the formulas given here while describing particular elements of the algorithm. We attempt to solve variationally the Schrodinger equation

H Y = E Y, (1)

where Y is the wave function corresponding to a particular state of an atom or a molecule, H is the clamped nuclei Hamiltonian describing all the Coulomb interactions between electrons and nuclei as well as the kinetic energy of the electrons, and E is the electronic energy of the system. In solving this equation for a given Hamiltonian we search for both the energy and wave function. The trial wave function will be represented as a ^-term linear combination of N-electron ECG basis functions <

T = £ c ft, (2)

4>i = AP

nN=igik(zk)exp £ J] #

^ p=1 q=p+1

0 . (3)

In the above definition, rk are electron spatial coordinates, rpq are interelectronic distances, A is the antisymmetrizer working on space and spin coordinates, P is the spatial symmetry projector, 0 is the N-electron spin function and the Cartesian Gaussian functions gik are given by

gk(rk) = X - Akx)1 (yk - Aky)m (zk - Akz)n exp(-a ft - Ak|2). (4)

The antisymmetry projector A ensures that the wave function takes properly into account the indistinguishability of the particles - one of the most fundamental physical property of systems with identical particles. When acting on the electron coordinates it generates a sum N! terms differing by electron coordinate permutations. The N! explosion affects all the matrix elements and, in highly accurate energy calculations, becomes the main factor limiting the size of the studied systems to a few electrons. The non-linear variational parameters a,¡3 and A are variables of optimization process. The total number of the nonlinear parameters depends on the size of the atom or molecule (expressed by the number of electrons N and nuclei) and on the size of the expansion (2). In the most demanding cases the wave function (and so the energy) depends on over 100 000 non-linear parameters. Finding an energy minimum in a space with so many dimensions is a highly nontrivial and computationally very demanding task. For a given set of the nonlinear parameters, the optimal vector of the linear parameters ci and the corresponding approximation e to the exact

electronic energy can be found by solving the Schrodinger equation represented in the matrix form of the general symmetric eigenvalue problem (GSEP)

HC = e Sc, (5)

where the matrix of the Hamiltonian H and the overlap matrix S are built of elements expressed by 4N-dimensional integrals over all coordinates dVi... dVN of the electrons [2, 24, 3]

Hij = J (piH(p j dVi ...dVN, (6)

Sij = J(itjdVi ...dVN . (7)

The above method is, in principle, general and can be applied to any N-electron atomic or molecular system. Routinely, such calculations are performed for the systems with up to four electrons yielding state-of-the-art accuracy for the energy and other properties [3, 5, 21]. To move further on to larger systems, while keeping the accuracy at analogous level, is our present day challenge. Fortunately, currently accessible computational resources give chance for a progress in calculations towards the systems with five, six and even seven electrons. The most natural way it can be achieved is a transformation of the existing sequential algorithms to parallel versions [4, 11]. In the next step, we need to perform large scale tests to compare different strategies for optimization and to select critical sections of the implemented programs, which depend on high-performance computing architecture employed. Part of our experience in this field is discussed in the next sections.

3.2. New algorithm

The general calculation schemes based on ECG functions seems to be well established, but the application to a system with each additional electron involves reanalysis of the algorithms. In order to achieve the spectroscopic accuracy for a few-electron system, we need to use the basis set built of ~ 103 - 104 functions (i. Next, we have to perform large scale energy minimization in a multidimensional space spanned by the variational parameters. During the optimization the energy e has to be evaluated 106 - 107 times in order to get closer and closer to the exact energy E. The main problem of this large scale optimization is the omnipresence of local minima which potentially prevents from getting close to the global minimum and, what is more, the number of such minima tends to increase rapidly with the size of the problem. There are plenty of different methods for the optimization, but the deterministic optimization with the moves downwards on the slope of the function according to a certain well-defined strategy (i.e. methods with or without energy gradient) seems to be the best choice for a given number of basis functions, K.

Based on the theory presented above, we implemented computational procedures in a hybrid parallel computing model combining the Message Passing Interface (MPI), to allow processes to communicate over the network, together with the shared memory model (multithreading) on local nodes. In order to present main functional blocks, a simplified version of the core algorithm is presented in Fig. 1.

The most computationally demanding sections of the program are named: scalar and vector parts. The scalar part computes the matrix elements of the Hamiltonian H, Eq. (6), and the overlap matrix S, Eq. (7). As mentioned above, the complexity of the algorithm depends factorially on the number of electrons of the quantum system due to permutation of the electrons (generated by the antisymmetrizer A of Eq. (3)) in the basis function (i. The vector part consists of the solution of GSEP, Eq. (5). This time-consuming component of the algorithm is based on a Cholesky decomposition of the matrix H - s, S with a trial value s, assumed close to the desired energy e. Next, we use the inverse iteration procedure [20] to improve the energy s, by solving in each iteration a triangular system of equations - the energy converges to e in a few iterations. In total, the vector part can be evaluated in O(K3) operations. Below we describe also an updating procedure which enables diminishing the execution time to O(K2). The next critical element of the algorithm is the optimization of the very large number of non-linear variational parameters a,¡3 and A. Our experience from sequential version of the algorithm suggests to use iterative optimization procedure from i = 1 to i = K tuning in i-th step the parameters of the basis function (i. Thus, it implies changes of variational parameters defining objective energy function during the optimization process being controlled by the Powell's conjugate directions method [19]. If new parameters do not cause further energy improvements, the algorithm is stopped.

Figure 1: Generic steps representing main quantum-chemical computational procedures

Figure 2: Functional and parallel decompositions proposed for the leg quantum chemistry application

A single execution of scalar and vector parts for given parameters set constitutes what we call the optimization shot and it gives a single value of the energy e. In practice, a number of required shots to reach the convergence turned out to be of the order of 100 times the number of the nonlinear parameters. For example, in a real case (LiH molecule) of a 4-electron wave function with 2400 terms and 14 nonlinear parameters per term, the following task has to be completed more than 106 times:

• scalar part - evaluation of the matrix elements Hij , Sij, with thousands of floating-point operations per each element,

• vector part - solution of the GSEP for matrices with K = 2400 rows and columns.

3.3. Application improvements

In this section we briefly discuss all the parallelized parts of the algorithm and improvements we introduced to the legacy code. Please note that the scalar part is built on a distributed shared memory model. All elements of the matrices H and S can be calculated independently, so there are no synchronization points in the execution and the scalability of our algorithm on many cores we took for granted. Potentially, it would be possible to split the single matrix element evaluation on N! independent tasks related to the permutation of electrons. We are going to utilize this feature in the future versions. It will be particularly valuable for the systems with 6, 7 electrons, where there are hundreds of permutations. With the 4 electron system (N = 4), the scalability barrier can be achieved for both the 64 Intel Xeon 5345 and Xeon 7500 processors.

The second part implemented in parallel is the vector part based on the shared memory model. The GSEP algorithm contains the following dominant operations: the Cholesky decomposition, a solution of the triangular system of equations, and matrix-vector multiplication. We decided to use the highly efficient BLAS/LaPACK implementation called GotoBLAS2 to accelerate these operations [28]. In a nutshell, GotoBLAS2 uses new algorithms and memory

techniques for optimal performance of the BLAS routines. The GotoBLAS2 library takes advantages of new architecture features in processors and interprocessor communication techniques and enhance multi-threaded execution of BLAS routines on node.

The most complex function is the Cholesky decomposition (O(K3)). The full Cholesky decomposition is calculated once per (average) tens iterations of Powell's optimization of nonlinear parameters (see Fig. 2). In the remaining inter steps the updating of the Cholesky decomposition is invoked related to single row/column change O(K2). So, the most time-consuming operation in the GSEP algorithm is a solution of the triangular system of equations with biggest constant at O(K2). Unfortunately, it's nature does not warrant linear scalability with the increase in the number of processors. The performance tests we conducted shown clearly, that with the matrices of the size K > 4000 increasing a number of cores to more than 8 brings no performance gain in the execution of the GSEP algorithm. The overall architecture of the GSEP algorithm is as follows. At the beginning, the master MPI process reads all input parameters, and subsequently it broadcasts tasks to the other MPI processes. If a single MPI process have received the data, it starts calculation of the elements in H and S. Each process calculates a different part of the matrices, and returns results to the master process. Then, if all requests are completed and available to the master process, it calculates the initial energy by the GSEP algorithm. At this stage, a new multi-threaded GotoBLAS2 library is used. Once the initial energy is calculated, the nonlinear optimization process for subsequent rows of the H and S matrices starts. Optimization parameters for the single row consist of (on average) tens of optimization shots - calculation of the elements of single row and then GSEP algorithm call. During the optimization shot, all worker processes calculate row elements, while the master process is using parallel threads performs the GSEP algorithm.

3.4. Performance tests

Our performance and scalability tests were performed for a 6-electron molecule (CH+, metylidenium cation) with the wave function expanded in 1024 basis functions (K = 1024). In all performed experiments we have measured the execution time to optimize the first 10 rows of H and S matrices. Tests were successfully performed on the following Intel Xeon processors available in three different computing clusters (see Table 1 for more detailed processors characteristics):

• stella.hpclab.net - 16 nodes, each node consists of 4 x Intel Xeon L7555

• rumex2.man.poznan.pl - 1 node, 2 x Intel Xeon L5518

• reef.man.poznan.pl - 8 nodes, each node consists of 2 x Intel Xeon E5345

Fig. 3 contains the execution time of a new version of the parallel version of quantum chemistry application tested on 1, 8,16, 32,64, 96, 128,256 and 512 cores respectively on the stella.hpclab.net cluster. Additionally, we measured an average execution time of both the scalar and vector parts in single optimization shots. Once can easily note that for the considered problem in quantum chemistry, 6-electron molecule, after the performance tuning procedures we finally achieved a very good speedup ratio. Adding more Intel Xeon 7555 cores to our benchmarking experiments did not affect the overall speed up as it is presented in Fig. 3. However, we should take into account that the speed up ratio of both scalar and vector parts differs significantly. As it was discussed in previous paragraphs, our benchmarks confirmed that the scalar part scales well, whereas the vector part does not as its execution time is the same no matter how many cores we use. Therefore, it is important to adjust steering parameters of the considered parallel application to keep a good balance between the execution of scalar and vector procedures. For the tested problem the ratio between scalar and vector parts for 8 cores was about 275, whereas for 512 cores was only 4. Therefore, performance improvements introduced to the application allowed us to run this application efficiently, especially for a large number of cores what was one of the key motivations for our research.

Additionally, we wanted to compare the efficiency of different Intel Xeon processors using the considered application. The new Intel Xeon L7555 processor is about 20% faster than E5345, and about 8% faster than L5518 for the 8 core setup as it is presented in Fig. 4. For a larger number of cores we observed even a bigger impact of new built-in hardware features on the overall application efficiency, in particular better memory transfer rate and a bigger L3 cache size (see Table 1). We executed the application on 64 cores processors and it turned out that despite the fact that the new Intel Xeon L7555 processor uses lower core voltage and runs at much lower frequency than E5345 the execution time of the application was reduced by 60%. We also wanted to compare the efficiency of the latest

Figure 3: The overall speedup achieved for the parallel version of the quantum chemistry application

Figure 4: Performance results obtained for the quantum chemistry ap cation on different Intel Xeon processors

Intel Xeon processor to its predecessors for other data or memory intensive applications. As the application is not a good example for such experiments we decided to use other application benchmarks which are described in the next section.

4. Computational fluid dynamics application

4.1. Motivations

Computational Fluid Dynamics (CFD), a part of fluid mechanics class, uses various numerical methods to analyze and solve fluid flows. In this paper we focused on one of such alternatives - the Lattice Boltzmann Method (LBM) [26]. LBM is a numerical technique for the simulation of fluid flows. However, it could be treated as an alternative to classical solvers of partial differential equations, as it can solve the incompressible, time dependent Navier-Stokes equations. LBM is based on the microscopic models and mesoscopic kinetic equations. The macroscopic quantities, such as velocity and density, are obtained through moment integration of the distribution function. LBM allows to model the fluid flow consisting of particles, which perform propagation and collision processes over a discrete lattice mesh. The collision step occurs locally, while a propagation process is responsible for moving updated distribution functions to neighboring nodes.

4.2. Architecture

The lid-driven cavity problem has been used as a benchmark and validation case for a long time [26], [16]. It is easy to implement, it can be solved with different numerical approaches, and the boundary conditions of the simulation in many cases are quite easy to be represented. In our case, where LBM is implied, a considered fluid is contained in a cubic (D3Q19) with Dirichlet velocity boundary conditions on all walls and with the Bhatnagar-Gross-Krook (BGK) collision operator [22]. The top lid is driven with constant velocity in a direction parallel to one of the two diagonals. The LBM equation with the BKG collision operator is given by

fi(-x + est,t+st) = fi&) - - f cxo - .№0] (8)

where (eiSt,t + St) and St are space and time increments respectively. f is the single particle velocity distribution along the i direction, while t is the single relaxation time, and ffq is the equilibrium distribution function. The left side of equation describes the propagation process, while the right side describes the collision step. The discrete velocities — of a 3-dimensional 19-speed lattice model D3Q19 are expressed as

( (0,0,0)c i=0

ei = \ (±1,0,0)c, (0, ±1,0)c, (0,0, ±1)c i=1.....6 (9)

( (±1, ±1,0)c, (±1,0, ±1)c, (0, ±1, ±1)c i=7.....18

where c = , St is the time step and Sx is the adjustable lattice size.

We consider the 3-dimensional 19-speed lattice model. Each node has a crystal shape and can deliver particles to each of the six neighboring nodes which share surface, the eight neighboring nodes sharing an edge, the four neighboring nodes sharing corners, and itself. The lid-driven cavity problem was prepared with the help of Palabos - a general purpose open software written in C++ for computational fluid dynamics simulations [29]. Palabos offers many physics models, including incompressible Navier-Stokes equations, flows with body-force term or static Smagorinsky model for fluid turbulence. It covers various grid models (including D2Q9, D3Q19), boundary conditions (e.g. bounce back, periodic, simple equilibrium, Dirichlet condition) and some basic fluid models (e.g. BGK: Bhatnagar-Gross-Krook, MRT: Multiple-Relaxation-Time). The software is seen as a robust and reliable one, which scales up well up to thousands of processors. A typical approach to parallelize LBM is a domain decomposition. In this approach domain is divided fairly between the cores into blocks, so that each block could be processed by different core. For example, a LBM problem of domain size 128 by 128 by 128 run on 8-cores machine could be divided into 8 block of 16 by 16 by 16 points each.

As far as data storage is concerned, Palabos handles two classes of data. Data such as block-lattices, scalar and tensor fields are spatially distributed - the data is partitioned into smaller blocks that are distributed between the cores (MPI paradigm). Data that require small amount of memory are duplicated on every core. Standard C++ and STL types are duplicated using the Single Program Multiple Data approach.

4.3. Performance tests

The LBM lid-driven cavity 3D problem was prepared in Palabos as a benchmark suite, measuring number of lattice sites updates per second. We run the test suitcase using the BGK collision operator and D3Q19 grid. The top lid is driven with constant velocity of 1e-2 in a direction parallel to one of the two diagonals. The benchmark is challenging because of the velocity discontinuities on corner nodes. The domain size was attuned to meet the memory requirements of Palabos' LBM implementation, hence we restricted the mesh size to contain from 200 to 1,000 points in each dimension (with no grid refinement). The Reynolds number has been set to 1. The test has been performed on two different clusters, each equipped with different processor type:

• stella.hpclab.net (stella) - 3 nodes used, each node consists of 4 x Intel Xeon L7555 1.87 GHz

• ui.cyf-kr.edu.pl (zeus) - 8-16 nodes used, each node consists of Intel Xeon L5640 2.27 GHz

Figure 5: Palabos lid-driven cavity 3D performance comparison for different domain size

Figure 6: Palabos lid-driven cavity 3D performance comparison for main size 400 over a different number of cores

The obtained performance results are presented in Fig. 5. As we can see, the new Intel Xeon L7555 processors gave the best performance. It is worth to note that increasing number of cores does not always result in better performance. This is because the Palabos tries to divide the grid fairly between the available cores. When the ratio grid size to processors lowers greatly, the performance start decreasing. That was for us the main motivation to run additional benchmarks to check if and to what level we can speedup the the test suitcase for the constant problem size when adding more computation resources. The domain size was set to 400 points in each dimension; other parameters remained the same. Our benchmarking tests were performed on different setups from 8 to 96 cores (with step of 8) on the following three different clusters:

• stella.hpclab.net (stella) - 3 nodes used, each node consists of 4 x Intel Xeon L7555 1.87 GHz

• ui.cyf-kr.edu.pl (zeus) - 8 nodes used, each node consists of Intel Xeon L5640 2.27 GHz

• reef.man.poznan.pl (reef) - 8 nodes used, each node consists of 2 x Intel Xeon E5345 2.33 GHz

It is interesting that both clusters, reef and zeus, reach optimum performance at 32 cores (see Fig. 6). Starting with 40 cores the performance decreases. This is however not the case for stella cluster, which increases application performance constantly up to 96 cores being tested, although the clock frequency of stella processors are worse than those reef and zeus are equipped with. One of the main difference between stella processors and the other ones is the L3 cache size. The reef and zeus processors are equipped with 8MB and 12MB of L3 cache respectively, while stella processors have 24MB of L3 cache built in. In the LBM, the array to store distribution functions, that need to be sent in propagation step as described in previous sections, consumes most of the memory, especially for 3D problems. Passing a large array can dramatically slow calculations because cache misses and page faults may appear. This problem appears to be valid for the Palabos version of lid-driven cavity problem, as it was proofed by some additional tests we performed. It appeared that a lot of L3 cache misses take place, as well as iTLB (instruction Translation Look-aside Buffer) load misses, probably because of the Palabos application code that does not fit to the iTLB size and of many far-instruction jumps that take place. The new and forthcoming Intel processors could dramatically increase performance of many applications which requires a lot of memory or are struggling with many misses in cache. This is because there is a significant increase in size of L3 cache in the new Intel processors. When the size of cache is doubled, adding more cores results in impressive increase in the performance.

5. Conclusions and future work

The performance tests in our study inspected the Intel Xeon L7555 series from the parallel application development perspective. However, if we target future exa-scale class of the system then we will have to be ready for new challenges in software design. The exa-scale class system will require billions of threads and it will consists of an enormous number of small scalar-core with massive multi-threading or few less for bigger vector cores. In principle, high performance computing hardware architectures will involve 25 million large vector cores or order of 100- 200 million small scalar cores. Today, many scientists are looking for new numerical algorithms and applications to be able to use in parallel thousands or million cores efficiently. As we demonstrated is this paper to program efficiently even a few hundreds of cores it is not a trivial task and it does require multidisciplinary skills. However, in our opinion all efforts we performed to modify, redesign and implement a new highly parallel software will bring many added values in the future as the roadmap for many-core hardware, in particular Intel processors, has been already defined for the next few years.

In fact, the excellent scaling of the scalar part of our quantum chemical algorithm is very promising and will be utilized extensively in production calculations. Our nearest target are the boron atom (5 electrons) and CH+ cation (6 electrons), for which benchmark energies are expected to be obtained. In the future work, we are aiming a further tuning of the code to the Intel architecture as a basis of calculations. Additional work on the rearrangement of the granularity of the parallelization within the N! loop to adapt the code to the growing size of the molecules is expected to improve efficiency of multi-core architecture usage. Considering implemented algorithm, a further analysis of code parallelization of the vector part of the code should be completed. Particularly, we hope to improve the speedup of GSEP, Eq. (5), by adjusting the existing parallel algorithms to the new architectures and available network layout. New optimization methods, particularly with the gradient calculation, are interesting alternative to the one currently in use.

The presented CFD-based benchmarks encouraged us to future work with other parallel CFD-based applications in higher resolution. Taking Numerical Weather Prediction (NWP) as an example, the problem can be considered at a global level (for whole planetary atmosphere) or at mesoscale level as well, where higher resolution and complex boundary conditions are imposed. This requires higher accuracy fluid solvers using efficient CFD numerical procedures. Today, the most powerful supercomputers enable scientists to run the NWP models at about 50km spatial resolution for global models, and 5-2km at mesoscale level, often reaching 1km resolution. Such a great spatial resolution not only needs a lot of computational power, but also an appropriate refinement of models being used, due to numerical stability demand. A related work has been performed with the EULAG - a robust and highly parallel

model for fluid flows across wide range of scales. Authors presented preliminary results of simulations of the flow over realistic topography of the Alps, demonstrating performance of the code on the IBM BlueGene/L machine in [17]. The tested CFD-based code scaled well up to the 1000 of cores, achieving almost a linear speedup with hopes for running simulations at even greater resolution in a reasonable amount of time.

6. Acknowledgments

We gratefully acknowledge the help and support provided by Jamie Wilcox from Intel EMEA Technical Marketing HPC Lab. This research was partially supported by the POIG PL-Grid and POWIEW projects.

References

[1] Bachorz, R., Komasa, J. "Variational calculations on using exponentially correlated Gaussian wave functions.", Computational Methods in Science and Technology, Vol. 11(1), Scientific Publishers OWN, Poznan 2005, p. 5-9.

[2] Boys, S.F. "The Integral Formulae for the Variational Solution of the Molecular Many-Electron Wave Equations in Terms of Gaussian Functions with Direct Electronic Correlation". Proc. R. Soc. A 258, 402 (1960).

[3] Cencek, W. and Rychlewski, J. 'Many-electron explicitly correlated Gaussian functions. I. General theory and test results". J. Chem. Phys. 98, 1252. (1993).

[4] Cencek, W., Komasa, J., Rychlewski, J. "High-performance computing in molecular sciences.", Handbook on Parallel and Distributed Processing (edited by J.Blazewicz, K.Ecker, B.Plateau, D.Trystram) Springer-Verlag (2000), p. 505-551.

[5] Cencek, W., Szalewicz, K. Ultra-high accuracy calculations for hydrogen molecule and helium dimer", International Journal of Quantum Chemistry 108, 2191-2198 (2008).

[6] Gepner, P., Kowalik, M.F.: Multi-Core Processors: New Way to Achieve High System Performance. PARELEC 2006, pp. 9-13 (2006)

[7] Gepner, P., Fraser, D.L., Kowalik, M.F.: Multi-Core Processors: Second generation Quad-Core Intel Xeon processors bring 45nm technology and a new level of performance to HPC applications. ICCS 2008, pp. 417-426 (2008).

[8] Hylleraas, E.A. "Neue Berechnung der Energie des Heliums im Grundzustande, sowie des tiefsten Terms von Ortho-Helium". Z. Phys. 54, 347 (1929).

[9] James, H.M., Coolidge, A.S. "Wave Functions and Potential Curves for Excited H2". Journal of Chemical Physics 1, 825 (1933)

[10] Komasa, J., Cencek, W., Rychlewski, J. "Application of Explicitly Correlated Gaussian Functions to Large Scale Calculations on Small Atoms and Molecules", Computational Methods in Science and Technology, Vol. 2, Scientific Publishers OWN, Poznan 1996, p. 87-100.

[11] Komasa, J., Rychlewski, J. "Solving quantum-mechanical problems on parallel Systems.", Parallel Computing, 26, 999-1009 (2000).

[12] Komasa, J., Cencek, W. "Exponentially Correlated Gaussian Functions in Variational Calculations. The EF 1Z+ State of Hydrogen Molecule.", Computational Methods in Science and Technology, Vol. 9(1-2), Scientific Publishers OWN, Poznan 2003, p. 79-92.

[13] Komasa, J. "The Z states of the molecular hydrogen", Physical Chemistry Chemical Physics, 10, 3383-3389 (2008).

[14] Kurowski, K., Back, W., Dubitzky, W., Gulyas, L, Kampis, G., Mamonski, M., Szemes, G., Swain, M. Complex System Simulations with QosCosGrid, In Proc of 9th International Conference on Computational Science (ICCS 2009), pp. 387-396, Baton Rouge, USA, (2009).

[15] Kurowski, K., Piontek, T., Kopta, P., Mamonski, M., Bosak, B. Parallel large scale simulations in the PL-Grid environment. Computational Methods in Science and Technology, Volume special issue 2010, pp. 47-56, (2010).

[16] Ni, J., Lin, C-L, Zhang, Y., He, T., Wang, S., Knosp, B. M. 2004. Parallelism of Lattice Boltzmann Method (LBM) for Lid-driven Cavity Flows. In: Proceedings of High Performance Computing and Applications (HPCA2004), August 8-10, 2004, Shanghai, P. R. China

[17] Piotrowski, Z.P., Kurowski, M.J., Rosa, B., Ziemianski M. EULAG Model for Multiscale Flows - Towards the Petascale Generation of Mesoscale Numerical Weather Prediction, PPAM 2009, Part II, LNCS 6068, pp. 380-387, 2010

[18] Piszczatowski, K., Lach, G., Przybytek, M., Komasa, J., Pachucki, K., Jeziorski, B. "Theoretical Determination of the Dissociation Energy of Molecular Hydrogen". Journal of Chemical Theory and Computation 5, 3039 (2009),

[19] Powell, M.J.D. "An eeficient method for finding the minimum of a function of several variables without calculating derivatives". Computer Journal, Vol. 7, 2 (1964).

[20] Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.F. Numerical Recipes in FORTRAN 77, 2nd ed. (Cambridge University Press, Cambridge, 1992).

[21] Przybytek, M., Cencek, W., Komasa, J., Lach, G., Jeziorski, B., Szalewicz, K. "Relativistic and Quantum Electrodynamics Effects in the Helium Pair Potential". Phys. Rev. Lett. 104, 18303 (2010).

[22] Qian, Y,H., DHumieres, D., Lallemand, P. Lattice BGK models for Navier-Stokes equations, Europhysics Letter, 17, 479-484, (1992)

[23] Rychlewski, J., Komasa, J. "Explicitly correlated functions in variational calculations.", Explicitly Correlated Wave Functions in Chemistry and Physics (edited by J.Rychlewski) Kluwer Academic Publishers, Dordrecht (2003), p. 91-147.

[24] Singer, K. "The Use of Gaussian (Exponential Quadratic) Wave Functions in Molecular Problems. I. General Formulae for the Evaluation of Integrals". Proc. R. Soc. A 258, 412 (1960).

[25] Stanke, M., Komasa, J., Bubin, S., Adamowicz, L. "Five lowest 1S states of the Be atom calculated with a finite-nuclear-mass approach and with relativistic and QED corrections". Phys. Rev. A 80, 022514 (2009).

[26] Succi, S. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond. Oxford University Press, USA (2001)

[27] Torchala, M., Komasa, J. "Efficiency of matrix elements computations on parallel systems.", Computational Methods in Science and Technology, Vol. 9(1-2), Scientific Publishers OWN, Poznan 2003, p. 137-145.

[28] GotoBLAS2 library - http://www.tacc.utexas.edu/tacc-projects/gotoblas2/

[29] http://www.lbmethod.org/palabos