CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2013; 25:1170-1182

Published online 10 October 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.2933

SPECIAL ISSUE PAPER

Matrix inversion on CPU-GPU platforms with applications

in control theory

Peter Benner1, Pablo Ezzatti2, Enrique S. Quintana-Ortí3 and Alfredo Remón3'*'^

1Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany 2 Centro de Cálculo-Instituto de Computación, Univ. de la República, Montevideo, Uruguay 3Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), 12.071-Castellón, Spain

SUMMARY

In this paper, we tackle the inversion of large-scale dense matrices via conventional matrix factorizations (LU, Cholesky, and ) and the Gauss-Jordan method on hybrid platforms consisting of a mul-

ticore CPU and a many-core graphics processor (GPU). Specifically, we introduce the different matrix inversion algorithms by using a unified framework based on the notation from the FLAME project; we develop hybrid implementations for those matrix operations underlying the algorithms, alternative to those in existing libraries for single GPU systems; and we perform an extensive experimental study on a platform equipped with state-of-the-art general-purpose architectures from Intel (Santa Clara, CA, USA) and a 'Fermi' GPU from NVIDIA (Santa Clara, CA, USA) that exposes the efficiency of the different inversion approaches. Our study and experimental results show the simplicity and performance advantage of the Gauss-Jordan elimination-based inversion methods and the difficulties associated with the symmetric indefinite case. Copyright © 2012 John Wiley & Sons, Ltd.

Received 31 January 2012; Revised 9 July 2012; Accepted 30 July 2012

KEY WORDS: matrix inversion; matrix equations; dense linear algebra; control theory; high performance computing; graphics processors

1. INTRODUCTION

Explicit matrix inversion is required in a few selected applications. One important case is the computation of the 'sign function' of a matrix A 2 Rsxs via the Newton iteration [1]:

Ao ^ A, Aj+i ^ 1 (A; + A"1), j = 0,1,.... (1)

Provided A has no eigenvalues on the imaginary axis, the sequence fA; }1L0 converges to sign (A), at an asymptotically quadratic rate. Diverse variants of (1), all requiring the explicit inversion of a matrix at each iteration, can be used to efficiently solve several types of matrix equations arising in numerically robust methods for model reduction and linear-quadratic optimal control (LQOC) [2,3]. Large-scale instances of these problems involving dense matrices arise, for example, in VLSI circuit simulation [4] and, when tackled via the matrix sign function, require the inversion of a dense matrix of dimension s ! 0(10,000 - 100,000).

Efficient and numerically stable methods for the inversion of a dense matrix A first compute some type of factorization of the matrix to then obtain the inverse from the resulting factors. Depending on the properties of A, the initial stage is usually performed via the LU factorization with partial

Correspondence to: Alfredo Remón, Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071-Castellón, Spain.

^E-mail: remon@icc.uji.es

pivoting (if A has no special properties), the Cholesky decomposition (when A is symmetric positive definite, s.p.d.), or the LDLT factorization (when A is symmetric indefinite). If A is of size s, these three types of decompositions as well as the computation of the inverse from the triangular factors require O(s3) flops (floating-point arithmetic operations). Alternatively, in the former two cases, one can use Gauss-Jordan elimination (GJE), also at a cost of O(s3) flops. Thus, these methods clearly ask for high-performance computing techniques when s is large.

Research in the framework of the MAGMA and FLAME projects has demonstrated the potential of graphics processors (GPUs) to accelerate the computation of dense linear algebra operations [5,6]. These two projects, together with the commercial development CULA [7], provide hybrid CPU-GPU kernels for the LU factorization and the Cholesky decomposition, among others. In [8-11], we have reported the results from our own effort towards the computation of dense linear algebra operations, with special focus on matrix inversion and its application to the solution of Lyapunov matrix equations and algebraic Riccati equations (AREs) arising in control theory. In this paper, we extend those results as follows:

• We provide a unified framework, based on the FLAME notation [12,13], for the presentation of conventional matrix inversion algorithms via the LU and Cholesky factorizations (for general and s.p.d. matrices, respectively), and alternative approaches based on the GJE (for both cases).

• We include new multicore and hybrid algorithms for the inversion of symmetric indefinite matrices via the LDLT factorization, a case not tackled in previous work, accommodating them into the same framework/notation.

• We perform an experimental study of the performance and overheads of our implementations of the inversion algorithms, comparing them with those available in the MAGMA and CULA libraries, using a state-of-the-art platform equipped with Intel Xeon-based multicore processors and a NVIDIA 'Fermi' GPU.

The rest of the paper is structured as follows. In the next subsection, we introduce an algorithmic skeleton that will allow a unified presentation of the matrix inversion algorithms. In Section 2, we briefly review how to use the matrix sign function to solve the Lyapunov equation and the ARE, and the application of these matrix equations in model reduction and LQOC. The three algorithmic approaches and the hybrid CPU-GPU implementations for the inversion of general, s.p.d., and symmetric indefinite matrices are presented, respectively, in Sections 3, 4, and 5. In Section 6, we describe several optimization techniques, applicable to most hybrid algorithms. Experimental results on a hybrid CPU-GPU platform are reported in Section 7, and a few remarks close the paper in Section 8.

1.1. Dense linear algebra algorithms

High-performance algorithms for dense linear algebra operations organize the computations by blocks, to hide the memory latency by amortizing the cost of data transfers between the main memory and the floating-point units with a large number of flops. Figure 1 shows two algorithmic frameworks (skeletons), using the FLAME notation, that illustrate the usual organization of blocked dense linear algebra algorithms. In the figure, m(-) stands for the number of rows of a matrix. The partitioning ('Partition') before the 'while' loop sets ABR : = A (left) or ATL := A (right). The repartitionings inside the loop body ('Repartition' and 'Continue with') and the thick lines capture how the algorithms progress through the operands. Thus, the algorithm on the left processes the matrix from the top-left corner to the bottom-right one (TL!BR), whereas the algorithm on the right proceeds in the opposite direction. At each iteration, both algorithms identify a b x b diagonal block An, with b usually known as the (algorithmic) block size. Examples of the first type of algorithms (TL!BR2 x 2) include the three key factorizations for the solution of dense linear systems: LU, Cholesky, and LDLT. Examples of the second (BL!TL2 x 2) include the solution of an upper triangular linear system [14].

In the following sections, when presenting an algorithm for a dense linear algebra operation, we will indicate whether it proceeds TL!BR2x2 or BL!TL2x2, and specify the computations that are performed during the 'current' iteration (to be included in the box marked as 'Update' in one of the

( ATL I ATR \ V Abl I ABR )

Algorithm: A := tl—>br2 x 2(A) Partition A -» ^ L

where Atl is 0 x 0 while tti^Atl) < m(A) do Determine block size b Repartition

/ ATL I ATR \ V Abl I Abr j where An is 6 x 6

Update

endwhile

Continue with

( Atl I ATr \ V Abl I Abr J

Aoo A0i A02 \

Aio An Al2

A2o A2i A 22

Algorithm: A := br—>tl2 x 2(A)

/ Atl I ath \ v ^bi i -4-h« /

Partition A - . , —T-

where Abr is 0 x 0 while m(ABR') < 771(A) do Determine block size b Repartition

Atl I ATr

Abl I Abr where A11 is 6 x 6

Aoo a01 a02

aio a11 a12

a20 a21 a22

Update

Continue with

/ Atl I ATr \ V abl i J

endwhile

Figure 1. Blocked algorithmic skeletons for dense linear algebra operations.

algorithms in Figure 1). In our notation, triu(-)/tril(-) stands for the upper/lower triangular part of its argument and trils(-) for its strictly lower triangular part; trilu(-) is analogous to trils(-) with the diagonal entries of the matrix replaced by ones.

For simplicity, we will not include pivoting in the algorithmic descriptions, although this is crucial for the numerical stability of the inversion of general as well as symmetric indefinite matrices. Our hybrid implementations for the inversion of general matrices use partial pivoting. Our inversion algorithms for symmetric indefinite matrices incorporate pivoting akin that of the LDLT factorization.

In the hybrid algorithms, we assume that the matrix to be inverted is in the GPU memory. This implies that there exists a preliminary transfer of the data from the (host) main memory to the device (GPU) memory. When presenting the update performed during the iteration, we will specify in which architecture is carried out each computation. Therefore, depending on where initially each operand is located, the exact data transfers between the host and device memory spaces can be easily derived.

2. MATRIX SIGN FUNCTION SOLVERS WITH APPLICATION IN CONTROL

In state space, a dynamic linear time-invariant (LTI) system is represented as

x(t) = Fx(t)C Bu(t), t> 0, x(0) = x0,

y(t) = Cx(t) C Du(t), t > 0, (2)

where x0 2 stands for the initial state of the system; and x(t) 2 Rn, u(t) 2 Rm, and y(t) 2 Rm contain, respectively, the states, inputs, and outputs of the system. Here, F 2 Mnxn, B 2 Rnxm, C 2 Rpxn, D 2 Rpxm, and, usually, m, p « n.

Given a system of order n as in (2), 'model order reduction' aims at finding an alternative LTI system of order r n that can accurately replace the original system in subsequent operations [4]. Balanced truncation is an efficient numerical method for model reduction of large-scale LTI systems (see, e.g., [2]). The major computational step in balanced truncation requires the solutions Y, Z 2 Mnxn of the coupled Lyapunov equations FY C YFT = BBT, FTZ C ZF = C TC, which can be obtained from sign (F) and a few minor other computations [1].

Given a pair of weight matrices R 2 Rmxm and Q 2 Rpxp, with R s.p.d. and Q symmetric positive semidefinite, the solution of the 'LQOC problem' associated with (2) is given by the stabilizing feedback control law u(t) = -R~rBTXx(t) = -Kx(t), t > 0, with X 2 Mnxn positive semidefinite. Under certain conditions [15], this feedback is given by the solution of the ARE

FTX + XF - XBR~1BTX + CTQC = 0. In [1], it is shown that the solution X can be easily obtained from sign (H), with

( F BR_1BT !

H = T T . (3)

CTQC -FT

The two problems defined earlier illustrate the need to compute the sign function of a matrix (F or H). When the number of states of the LTI system (2) is large, and the state matrix F is dense, it is therefore necessary to compute the inverse of a sequence of large-scale dense matrices. Note also that, when F is symmetric negative definite, all the matrices inverted during the computation of sign (F) inherit this property. On the other hand, a simple reorganization of the blocks in (3) yields a symmetric indefinite matrix, which can be leveraged to invert only symmetric indefinite matrices during the Newton iteration for the matrix sign function [16].

3. INVERSION OF GENERAL MATRICES

In this section, we review the inversion of general matrices via the LU factorization and GJE [14]. In both cases, the contents of A are overwritten with its inverse (i.e., they are in-place procedures). Besides, the two approaches require 2s3 flops.

3.1. LU factorization

Matrix inversion of a general matrix A e Rsxs via the LU factorization can be accomplished in three steps. First, the matrix is decomposed as A = LU, where L e Rsxs and U e Rsxs are lower unit and upper triangular factors, respectively; the upper triangular factor is then explicitly inverted: U ! U_1 = U, and finally, the (lower unit) triangular system A_1L = U is solved for the sought-after inverse A"1.

Figure 2 illustrates the updates performed at each one of the three steps for the inversion via the LU factorization. The complete algorithms are obtained by including these operations into the update box of the appropriate algorithmic skeleton in Figure 1. For each operation, we indicate in which architecture it is performed (host/CPU or device/GPU) and the type of numerical kernels involved (using the LAPACK/BLAS interface). Consider the LU factorization first (left). This algorithm proceeds TL!BR, and the resulting factors L and U overwrite, respectively, the strictly lower and the upper triangular parts of A. The invocation to the kernel LU, in the first operation of the update, obtains the factorization of the 'panel' (A^, A^) , of (column) width b. In our hybrid algorithm, this operation is performed in the CPU (via kernel GETRF). Because we assume that the matrix is initially stored in the GPU, this implies that the panel has to be transferred to the host memory prior to this computation. The other two operations involved in the factorization, a triangular system solve (trsm) and a matrix-matrix product (gemm), are performed in the GPU. Thus,

TL—^br2 x 2. LU factorization tl-»br2 x 2. U U'1 = U br-»tl2 x 2. Solve A_1L = U

Ai2 := trilu(ah)-1ai2 a22 := a22 - a21a12 CPU/ getrf GPU/ trsm GPU/ gemm a01 := triu(aoo)aoi A01 := —A01 triu(aii)-1 triu(ah) := triu(ah)-1 GPU/ trmm GPU/ trsm CPU/ trtri /A0i\ / A02\ an - := a12 W2 \ a21 1 \a22 / /aoa / Aoi\ Au := An Wf1 \A2i / \A2i / GPU/ lacpy GPU/ laset GPU/ gemm GPU/ trsm

Figure 2. Operations (updates) to invert a general matrix A via the LU factorization. Left: LU factorization. Center: Inversion of the triangular factor U. Right: Solution of lower triangular system.

the triangular matrices resulting from the panel factorization have to be transferred back to the GPU before these two operations commence.

Upon termination of the algorithm for the inversion of the triangular matrix U, in the center of the figure, the upper triangular part of A (which initially contained U) is overwritten with U = U_1. At each iteration, the algorithm computes the product of a triangular matrix times a general one (trmm), the solution of an upper triangular linear system, and the inversion of the triangular b x b block A11 (trtri). The former two operations are performed in the GPU, whereas the last one is carried out by the CPU. Thus, A11 is transferred from the device to the host before its inversion, and the result is moved back to the GPU. Finally, the solution of the triangular system A_1L = U for A"1 (right) performs a matrix-matrix product and a triangular system solve per iteration, both computed in the GPU. Thus, the algorithm requires no data transfers. For high performance, though, this operation uses a small workspace, of size s x b, in general with b > 128.

The previous algorithms illustrate a pattern common to most blocked procedures for dense linear algebra operations. The matrix is processed by blocks of a certain dimension (block size) and, at each iteration, only a part of the matrix is updated. In the hybrid CPU-GPU algorithms, we use the GPU for the computationally intensive, data-parallel operations, leaving the CPU for the kernels with strong control dependencies. This in turn dictates the data movements during the algorithm's iterations. For high performance, our hybrid algorithms will pursue two goals: (i) amortize the amount of data that are transferred between CPU and GPU with a large number of flops (per iteration) and (ii) cast most of the flops in terms of a reduced number of high-performance GPU kernels operating on large pieces of data. Consider, for example, the algorithm for the LU factorization. At each iteration, a panel of k x b elements is transferred between CPU and GPU (k initially equals s and decreases with each iteration by a factor of b), whereas this overhead has to be amortized roughly by the 2(k — b)2b flops corresponding to the update of A22. Also, the major part of the computation is cast in terms of the matrix-matrix product for the update of A22, an operation that is well known to attain high performance on most current architectures and, in particular, on GPUs.

3.2. Gauss-Jordan elimination

An alternative blocked procedure to invert a matrix via GJE is shown in Figure 3 (left) [17]. Compared with the three sweeps of the inversion approach based on the LU factorization (one per step), the algorithm computes the inverse in a single sweep of the matrix and is richer in large matrix-matrix products. All operations are performed on the GPU, except for the factorization of the current panel (A^, A^, A^,) (kernel GJE), which is performed in the CPU. Thus, at the beginning of each iteration, this panel of dimension s x b is transferred to the CPU, processed there, and the results are sent back to the GPU. Note the regularity of the algorithm: All iterations perform the same number of flops (approximately, 2sb(s — b) flops) and the same amount of data is transferred between host and device (except for the last iteration if the problem size is not an exact multiple of b).

tl—»br2 X 2. Invert general A via GJE tl—»br2 x 2. Invert s.p.d. A via GJE

/ Aoi /Aoi\

An := GJE An cpu/- triu(ah) := chol(ah) cpu/potrf

Aoo ■■= ¿oo + AoiAio gpu/gemm triu(ah) := triu(aii)-1 cpu/trtri

A2o := A20 + a21a10 gpu/gemm A0i := A0itriu(An)T gpu/trmm

Aio := A11A10 gpu/gemm triu(aoo) := triu(a0o + a0iaji) gpu/syrk

aO2 := A02 + a01a12 gpu/gemm Aoi := AoiTRiu(Aii) gpu/trmm

A 22 := A22 + a21a12 gpu/gemm A12 := triu(Au)-tAi2 gpu/trsm

A12 := A11A12 gpu/gemm triu(A22) := triu(A22 - af2Al2) gpu/syrk

A02 := A02 — A01A12 gpu/gemm

A12 := -triu(All)Al2 gpu/trmm

triu(ah) := triu(ah)thiu(aii)t cpu/lauum

Figure 3. Operations (updates) to invert a general or an s.p.d. matrix A via GJE (left or right, respectively).

4. INVERSION OF SYMMETRIC POSITIVE DEFINITE MATRICES

We next describe the specialized counterparts of the LUpp-based and GJE-based methods for the inversion of s.p.d. matrices. These alternatives exploit the symmetric structure of the matrix A 2 Rsxs to halve its computational cost: from the 2s3 flops of the general case to roughly s3 flops in this case. The variants presented next compute the inverse in-place, requiring no additional workspace. In fact, in the algorithms, the upper triangular part of the matrix is replaced by the upper triangle of its inverse, whereas the strictly lower part is not referenced/modified. Also, the symmetry combined with the positive definiteness render unnecessary the application of pivoting for numerical stability in the algorithms.

4.1. Cholesky factorization

This algorithm performs the inversion in three steps (sweeps). First, the s.p.d. matrix is decomposed into the product A = UT U, where the upper triangular matrix U 2 Rsxs is the Cholesky factor. Next, the Cholesky factor is explicitly inverted U ! U'1 = U, and, finally, the inverse is obtained as A'1 : = UUT = U~1U~T.

Figure 4 reports the updates performed in the first and third steps (Cholesky factorization and matrix product A'1 : = U'1 U'T, respectively) of matrix inversion via the Cholesky factorization. The inversion of the triangular factor U (second step) can be obtained using the same algorithm described in the previous section. At each iteration of the Cholesky factorization (left column of the figure), the b x b block A11 is first transferred to the CPU, factorized there (via a call to kernel potrf), and the triangular factor is moved back to the GPU. This factor is then used to solve a triangular system and perform a symmetric rank-b update (syrk) of the trailing submatrix A22 in the GPU. It is important to realize that only the upper triangular part of A22 is updated, which in practice halves the cost of this factorization with respect to that of the general case. This is attained by using the appropriate kernel for the update of A22, which requires only (k — b)2b flops (with k := s initially, and decreasing by a factor of b per iteration).

At the beginning of the third step (Figure 4 (right)), U = U'1 overwrites the upper triangular part of A, and, therefore, the matrix product that is computed corresponds to A'1 := U UT = U'1 U'T. The algorithm consists of four operations: the product of a general matrix times a triangular one, a call to compute the (in-place) inverse of the b x b block A11 (lauum), a general matrix-matrix product, and a symmetric rank-b update. All these operations, except the second one, are performed in the GPU. This dictates the need to transfer the corresponding b x b block between the GPU and the CPU before the second operation, and in the opposite direction immediately after it.

4.2. Gauss-Jordan elimination

The procedure for matrix inversion of s.p.d. matrices based on GJE is illustrated in Figure 3 (right) [18]. The algorithm consists of a sequence of 10 operations of different types: Cholesky factorization, triangular matrix inversion, matrix multiplications with operands of diverse structures, triangular system solve, and symmetric rank-b update. All operations except the first two and the last one are performed in the GPU. Therefore, the algorithm requires a couple of transfers of the b x b block A11 between GPU and CPU and back per iteration.

tl—>br2 x 2. Cholesky factorization tl->br2 x 2. A-1 := UUT

triu(-ah) := chol(ah) Al2 ^triucaii)"1^ triu(a22) := triu(a22 - Af2Ai2) CPU/potrf gpu/trsm gpu/syrk Aoi ■■= aoitriu(ah)t triu(ah) := triu(aii)triu(ah)t Aoi ■■= + A02Af2 triu(aii) :=triu(j4ii + Ai2Af2) gpu/trmm cpu/lauum gpu/gemm gpu/syrk

Figure 4. Operations (updates) to invert a s.p.d. matrix A via the Cholesky factorization. Left: Cholesky factorization. Right: Triangular matrix-matrix multiplication A~1 : = U U T.

TL—>br2 x 2. LDLT factorization br—>tl2 x 2. Invert symmetric indefinite A

tril(a22) := tril(a22 - A21W) cpu/- cpu/gemv,gemm W2 := D2A21 tril(wi) := tril(ôiah) Wi := tril(au)tv7i tril(ah) := tril(wi) w-i := A^W2 tril(aii) := TRIL(aii + Wl) A21 •■= TRIL(i422)rW2 cpu/cpu/-gpu/trmm gpu/laopy gpu/gemm gpu/- gpu/trmm

Figure 5. Operations (updates) to invert a symmetric indefinite matrix A via the LDLT factorization. Left: LDLT factorization. Right: Inversion A_1 := L~TD~1 L~1.

5. INVERSION OF SYMMETRIC INDEFINITE MATRICES

Symmetric indefinite matrices can be inverted in-place, at a cost of s3 flops, following a three-step (-sweep) procedure (although, for high performance, an additional s x b array is required.) First, the matrix is decomposed as the product A = LDLT, where L e Rsxs is lower triangular and D e Rsxs is diagonal with 1 x 1 or 2 x 2 diagonal blocks [14]; then, the lower triangular and diagonal factors are explicitly inverted: L ! L_1 = L and D ! D_1 = D; and finally, the inverse is obtained from A_1 := LTDL = L_TD_1L_1. In practice, the LDLT factorization includes pivoting to attain numerical stability close to that of partial pivoting (not included in the following presentation, for simplicity). Also, only the contents of the lower triangular part of A are overwritten with the lower triangle of the inverse.

Figure 5 introduces a simplified version of the updates performed in the algorithms for the first and third steps. The second step is performed following a transposed variant of the algorithm to invert an upper triangular matrix (see Section 3). The procedure in the left of the figure computes the factorization. To do this, at each iteration, it obtains the entries of the lower triangular factor L corresponding to the panel (A^, A^) while, simultaneously, building the product of this factor times the corresponding diagonal blocks into W. Because of the symmetric structure of the matrix, and the need to reference only its lower triangle, the computation of this panel factorization is quite elaborate. The update of the trailing submatrix A22 also needs to be carefully performed. In particular, to reduce the arithmetic cost, to avoid modifying the contents in the strictly upper triangular part of the array A22, and to still deliver fair performance, the update proceeds by panels of b columns. At each iteration, the upper b x b block of this panel is updated column-wise, using (repeated invocations to) kernel gemv (one per column), whereas the subdiagonal block of the panel is performed with a single invocation to kernel gemm and comprises the major part of the computational cost of the update. The algorithm in the right of the figure computes the inverse of A from its LDLT factorization. Assume that the lower triangular part of A already contains L = L_1 and consider a partitioning of D = D_1 = diag.Do, D1, D2) conformal with that of A. Then, at each iteration, the algorithm performs two triangular matrix-matrix multiplications (to update the workspace W1 and A21) and a matrix-matrix product (to update A11) on the GPU plus a few additional minor operations.

6. OPTIMIZING THE HYBRID EXECUTION OF MATRIX INVERSION KERNELS

In this section, we describe several techniques which, in general, render higher performance if incorporated into the basic hybrid CPU-GPU algorithms for matrix inversion presented in the previous three sections. For clarity, when necessary we will illustrate the corresponding technique using the GJE approach for matrix inversion of general matrices in Figure 3 (left).

6.1. Padding

Previous studies [19] have demonstrated the performance benefits of utilizing GPU kernels that operate with arrays of dimension(s) that are integer multiples of 32. This is due to the number of

arithmetic units in current GPUs and, when using dense linear algebra libraries of GPU kernels, to the routines in these libraries being specially optimized for such problem sizes. For other problem dimensions, one simple solution is to 'pad' the arrays to the next integer multiple of 32. For the inversion of A 2 Rsxs, this can be easily performed by embedding the data of the matrix into an array of dimension v x v = (s C r) x (s C r), with 0 6 r < 32 and v an exact multiple of 32, initialized to the identity matrix. When s is moderately large, the overhead incurred by padding is negligible while the benefits clearly pay off.

6.2. Blocked kernels for the CPU

Most of the algorithms presented earlier involved a sort of 'recursive' call to process a block or a panel of the matrix in the CPU. In the case of the GJE inversion algorithm, for example, this is the call to GJE to factorize the current s x b panel. To enhance performance of GPU kernels, in practice, b is often chosen to be large, around 5\2 or even larger. Therefore, it is usually recommended to perform the execution of the CPU operation using a blocked algorithm, with a smaller block size, let us say b & 128 < b, and a highly tuned, possibly multithreaded CPU implementation for the operation, for example, from LAPACK or BLAS.

6.3. Overlapping communication and computation (double-buffering)

In the presentation of the inversion algorithms, we assumed that the matrix was already stored on the GPU, which required an initial transfer of the matrix contents from the main memory to the device. This overhead can certainly be hidden by using 'double-buffering' so that the transfer of the data proceeds concurrently with the arithmetic operations. However, given that our matrix inversion algorithms are repeatedly invoked from the Newton iteration (1), we can neglect the costs of the initial transfer of the data between the main memory (in the host) and the GPU memory and the final retrieval of the result from device to the host. Although double-buffering can also be used to hide the latency of the data transfers that occur inside the iteration loop, in most cases, the cost of this transfer is negligible compared with the actual number of operations performed during the iteration.

6.4. Look-ahead

The algorithms for the LU factorization, the GJE for general matrices, and the LDLT factorization operate on a large-and-narrow panel, which stands in the critical path of the computation and limits the concurrency of the algorithm. For example, in the GJE, the current panel has to be factorized before any of the matrix products can proceed. The next panel then presents a critical point, which marks a new synchronization point. As the number of computational resources grows, the factorization (which is basically sequential or, at least, offers a much more limited degree of parallelism than the rest of the operations) becomes a bottleneck. One partial solution to this is to overlap the factorization of the k C 1 -th panel with the updates associated with the factorization of the current one. Specifically, consider the factorization of the k-th panel is done resulting in b 'Gauss-Jordan transforms'. We can then update only the columns in the k C 1-st panel and perform the factorization of this panel concurrently with the update of the remaining part of the matrix with the previous Gauss-Jordan transforms. This technique is known as 'look-ahead' and can be applied with different degrees of depth [20].

6.5. Concurrent CPU-GPU execution

Most hybrid platforms are equipped not only with a powerful GPU but also an appreciable computational power associated with one or more general-purpose multicore processors. Clearly, more performance can be attained if both resources are used for the computation of the algorithms involved in matrix inversion, and specially, if the two resources can operate in parallel. Given that GPU kernels are asynchronous, this can be, for example, combined with the look-ahead technique just described.

7. EXPERIMENTAL RESULTS

In the following, we perform an experimental evaluation of several 'ad-hoc' routines for matrix inversion based on the hybrid CPU-GPU algorithms introduced in Sections 3-5, optimized with the techniques described in Section 6. For comparison, our study also includes numerical kernels from diverse dense linear algebra libraries for multicore processors, many-core GPUs, and hybrid architectures:

• Intel's MKL (v11.1) implements LAPACK and BLAS for multicore architectures (results were obtained using eight threads). For those LAPACK-level routines that are not in MKL, we used the legacy codes available at www.netlib.org/lapack (v3.4).

• NVIDIA's CUBLAS (v4.0) provides BLAS for single GPU systems.

• EM Photonics' CULA (R13a) is a commercial product that partially implements LAPACK and BLAS for hybrid platforms equipped with a single GPU.

• (University of Tennessee, Knoxville) UTK's MAGMA (v1.1.0), like the previous case, implements LAPACK and BLAS partially, targeting the same class of hybrid platforms.

Hereafter, we will refer to the ad-hoc optimized implementations as the basic variants. Routines from CUBLAS, CULA, or MAGMA can be expected to include advanced optimization techniques to deliver similar or higher performance than the basic ones. The experiments do not evaluate UT's FLAME, because the runtime in the libflame library is tailored for multi-GPU platforms.

All experiments were performed on a platform consisting of two Intel Xeon E5520 quad-core processors (eight cores) at 2.27 GHz, connected to an NVIDIA Tesla C2050, using ieee single-precision and double-precision arithmetic. Because of space limitations, only results for double-precision arithmetic are showed. The experiments demonstrated that in practice, in the target platform, the performance obtained using single-precision doubles that of double-precision arithmetic. All operands were always square, with their dimensions being integer multiples of 32 as padding ensures similar efficiency for other problem sizes. Performance is compared in terms of GFLOPS (109 flops/secs.), using the standard flops count for each operation. The cost of data communication between the CPU and the GPU memories is always counted, except for the initial transfer of the original matrix to the GPU and the final retrieval of the result (see Section 6). Although for smaller matrices this should change, for the matrices evaluated this time is negligible.

Let us start with the inversion of general matrices (see Section 3). Our first experiment, with results in Figure 6, evaluates the performance attained by the operations corresponding to the three steps for the inversion of general matrices based on the LU factorization with partial pivoting (LUpp) and the complete matrix inversion procedures via LUpp and GJE. Specifically, the plot in the top-left corner compares the performance attained by the implementations for LUpp in MKL, MAGMA, and CULA; and a basic variant that uses kernels from CUBLAS and MKL. The routine from MAGMA is the fastest one for larger matrices, whereas the CULA variant attains higher performance for matrices of dimension up to 2000. Similarly, the plot in the top-right corner displays the results obtained by four implementations for the triangular matrix inversion, with the basic variant clearly outperforming the routines in MKL, MAGMA, and CULA and reporting acceleration factors over 2x. The plot in the bottom-left corner illustrates the performance of the kernels for the solution of lower triangular systems provided by CUBLAS, CULA, MAGMA, and MKL, reporting MAGMA as the best choice, although for large matrices, the three GPU routines offer similar results. Finally, the plot in the bottom-right corner compares the efficiency obtained by the fastest implementations of the complete LUpp-based and GJE-based matrix inversion algorithms including, for reference, MKL (routines dgetrf+dgetri). In this case, the GJE-based implementation is a hybrid variant that uses the GEMM kernel from MAGMA for the matrix-matrix products on the GPU, whereas all other computations are performed on the CPU using kernels from MKL. The LUpp-based routine is also an ad-hoc implementation that uses MAGMA for the LUpp and the solution of the lower triangular system, and the basic variant for the inversion of the triangular factor U. The last plot demonstrates the superior performance of the GJE-based inversion algorithm on the hybrid CPU-GPU architecture reducing the inversion time approximately a 10% for the larger matrices.

LUpp factorization

U->U = U

1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension

1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension

Solve AL = U

Inversion of general matrices

о 150

1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension

1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension

Figure 6. Inversion of general matrices. Top-left: LUpp; top-right: triangular matrix inversion; bottom-left: solution of lower triangular systems; bottom-right: complete algorithm.

Consider next the inversion of s.p.d. matrices (see Section 4). Figure 7 summarizes the results obtained for the two inversion algorithms of this class of matrices and the operations involved in the procedure based on the Cholesky factorization. The results in the top-left plot correspond to the computation of the Cholesky factorization, using routines from CULA, MAGMA, and MKL, as well as an ad-hoc basic variant that uses kernels from CUBLAS. The best results are obtained by the MAGMA routine for matrices larger than 2000, whereas for smaller problems, the CULA and MKL implementations are faster. The second step of the algorithm, as in the general matrix case, requires the inversion of an upper triangular matrix. The results in the top-right plot thus replicate those in the figure for the general case. The plot in the bottom-left corner rehearses the performance of the implementation in MAGMA for the computation A'1 : = U UT, including MKL for reference. (No other implementation is available, at the moment, for this particular operation.) MAGMA is considerably faster than MKL and reports acceleration factors between 2 and 5x. The plot in the bottom-right corner shows the GFLOPS obtained with the Cholesky-based and the GJE-based approaches. In particular, the first variant utilizes the MAGMA kernels for the computation of the Cholesky factorization and the product U UT, and the basic routine for the inversion of the triangular factor. The alternative procedure, based on the GJE method, uses kernels from CUBLAS for the major GPU operations, and MKL for other minor operations, such as the factorization or the inversion of small blocks. As in the general case, the best performance is obtained with the GJE algorithm. It reports execution times below 25% than those obtained with the best Cholesky-based implementation.

Let us finally move to the third case, the inversion of symmetric indefinite matrices (see Section 5). Among the libraries considered, only MKL implements the full procedure for the

Cholesky factorization

Matrix dimension

A"1 := Ü ÜT

Matrix dimension

Figure 7. Inversion of s.p.d. matrices. Top-left: inversion; bottom-left: multiply A_ 1 : =

Ü Ü 1 = Ü

Matrix dimension

Inversion of s.p.d. matrices

Matrix dimension

Cholesky factorization; top-right: triangular matrix UU UU T; bottom-right: complete algorithm.

inversion of this class of matrices. The algorithm for the first stage (computation of the factorization A = LTDL) is hardly suitable for many-core architectures, because of the intricacies of the application of pivoting in the symmetric case. This also explains the low performance obtained by MKL in the multicore architecture: less than 30 GFLOPS for the larger matrices (see left plot in Figure 8). For this particular operation, we have developed an optimized version for multithreaded CPUs, which merges the updates performed to each panel of A22 into a single call to gemm, and extracts parallelism by distributing the iteration space among the cores using a simple OpenMP pragma omp parallel for, yielding a higher GFLOPS rate. The second stage is primarily dedicated to the inversion of the (lower) triangular factor and can be accomplished using a transposed variant of our basic routine for the inversion of an upper triangular factor. (The inversion of D, also carried out in this stage, is straightforward.) The third stage (A_1 := LTDL = L_TD_1L_1) exhibits higher concurrency and can be executed in a GPU with moderate efficiency. The left-hand side plot in the figure also collects the performance obtained for the second+third stages (lines labeled as A_1 := L_TD_1L_1) by the routine from MKL and a basic implementation that uses kernels from MKL and CUBLAS. The hybrid variant clearly outperforms MKL for larger matrices, but it hardly attains 70 GFLOPS. The right-hand side plot shows the results for the inversion of symmetric indefinite matrices. There, the BASIC variant uses our tuned CPU factorization (DSYTRF) and the ad-hoc hybrid implementation to obtain the inverse from the factors. This optimized implementation outperforms the MKL+LAPACK code (DSYTRF from MKL and DSYTRI2X from LAPACK) for matrices of dimension larger than 2000, resulting in acceleration factors near 2x for the larger matrices. However, its performance is poor if we compare it, for example, with that observed for the

t -1 -t -1 -1

LDL fact./A := L D L Inversion of symmetric indefinite matrices

Matrix dimension Matrix dimension

Figure 8. Inversion of symmetric indefinite matrices (right) and the computation of the inverse of a matrix

from its LDLT factorization (left).

inversion of general matrices (Figure 6). Consequently, with the implementations available at the moment, it is faster to calculate the inverse of a symmetric indefinite matrix by using the general matrix inversion algorithm, even if that implies doubling the number of flops.

8. CONCLUDING REMARKS

In this paper, we have studied the inversion of large-scale, dense matrices, on hybrid CPU-GPU platforms, with application to the solution of matrix equations arising in control theory. Borrowing the notation of the FLAME project, we have presented several matrix inversion algorithms, based on common matrix factorizations and the GJE method, for general, s.p.d., and symmetric indefinite problems. The performance of our practical implementations of these algorithms has been compared with analogous kernels from CPU and hybrid CPU-GPU dense linear algebra libraries, on a representative high-performance platform.

The results for the general and s.p.d. cases on this architecture expose the performance advantage of the GJE-based matrix inversion methods over their LU-based and Cholesky-based counterparts. Combined with their extreme simplicity, the result is a wide-appeal GJE-based approach to leverage the hardware concurrency of current hybrid platforms. As for the symmetric indefinite case, the need to preserve the structure of the problem during the application of pivoting stands in the way of an efficient concurrent implementation of the method so that further work remains to be done to turn this into a competitive method for tackling this class of problems.

ACKNOWLEDGEMENTS

The researchers from the UJI were supported by project TIN2011-23283 and FEDER, and project P1B-2011-18 of the 'Fundación Caixa-Castellón/Bancaixa' and UJI. This study was supported by Publishing Arts Research Council (grant number 98-1846389).

REFERENCES

1. Roberts JD. Linear model reduction and solution of the algebraic Riccati equation by use of the sign function. International Jorunal of Control 1980; 32:677-687. (Reprint of Technical Report No. TR-13, CUED/B-Control, Cambridge University, Engineering Department, 1971).

2. Benner P, Quintana-Orti ES, Quintana-Orti G. State-space truncation methods for parallel model reduction of large-scale systems. Parallel Computing 2003; 29:1701-1722.

3. Benner P, Quintana-Orti ES, Quintana-Orti G. Solving linear-quadratic optimal control problems on parallel computers. Optimization Methods Software 2008; 23(6):879-909.

4. Antoulas A. Approximation of Large-Scale Dynamical Systems. SIAM Pub.: Philadelphia, PA, 2005.

5. MAGMA project home page. (Available from: http://icl.cs.utk.edu/magma/ [Accessed on 21 September 2012]).

6. University of Texas. (Available from: http://www.cs.utexas.edu/~flame/ [Accessed on 21 September 2012]).

7. CULA project home page. (Available from: http://www.culatools.com/ [Accessed on 21 September 2012]).

8. Benner P, Ezzatti P, Kressner D, Quintana-Orti E, Remon A. A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms. Parallel Computing 2011; 37(8):439-450.

9. Ezzatti P, Quintana-Orti E, Remon A. Using graphics processors to accelerate the computation of the matrix inverse. The Journal of Supercomputing 2011; 58(3):1-9. DOI: 10.1007/s11227-011-0606-4.

10. Benner P, Ezzatti P, Kressner D, Quintana-Orti ES, Remon A. Accelerating model reduction of large linear systems with graphics processors. In Lecture Notes in Computer Science, State of the Art in Scientific and Parallel Computing, Vol. 7134. Springer: Reykjavik, 2012; 88-97. (on-line version available).

11. Benner P, Ezzatti P, Quintana-Orti ES, Remon A. High performance matrix inversion of SPD matrices on graphics processors. In Workshop on Exploitation of Hardware Accelerators-WEHA 2011. IEEE: Istanbul, 2011; 640-646.

12. Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA. FLAME: formal linear algebra methods environment. ACM Transactions on Mathematical Software December 2001; 27(4):422-455.

13. Bientinesi P, Gunnels JA, Myers ME, Quintana-Orti ES, van de Geijn RA. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software March 2005; 31(1):1-26.

14. Golub GH, Van Loan CF. Matrix Computations, 3rd edition. The Johns Hopkins University Press: Baltimore, 1996.

15. Lancaster P, Rodman L. The Algebraic Riccati Equation. Oxford University Press: Oxford, 1995.

16. Gardiner JD, Laub AJ. A generalization of the matrix-sign-function solution for algebraic Riccati equations. International Jorunal of Control 1986; 44:823-832.

17. Quintana-Orti ES, Quintana-Orti G, Sun X, van de Geijn RA. A note on parallel matrix inversion. SIAM Journal on Scientific Computing 2001; 22:1762-1771.

18. Bientinesi P, Gunter B, van de Geijn RA. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software July 2008; 35:3:1-3:22.

19. Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES, Quintana-Orti G. Exploiting the capabilities of modern GPUs for dense matrix computations. Concurrency and Computation: Practice and Experience 2009; 21:2457-2477.

20. Strazdins P. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998.