A High Performance QDWH-SVD Solver Using Hardware Accelerators

DALAL SUKKARI, HATEM LTAIEF, and DAVID KEYES,

Extreme Computing Research Center, KAUST

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWH-SVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 4x improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries. However, when only singular values are needed, QDWH-SVD is penalized by performing more flops by an order of magnitude. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines.

Categories and Subject Descriptors: G.4 [Mathematical Software]: Performance

General Terms: Design, Algorithms

Additional Key Words and Phrases: Singular value decomposition, polar decomposition, symmetric eigen-solver, mixed precision algorithms, GPU-based scientific computing

ACM Reference Format:

Dalal Sukkari, HatemLtaief, and David Keyes. 2016. A high performance QDWH-SVD solver using hardware accelerators. ACM Trans. Math. Softw. 43, 1, Article 6 (August 2016), 25 pages. DOI: http://dx.doi.org/10.1145/2894747

1. INTRODUCTION

Computing the Singular Value Decomposition (SVD) is a critical operation for solving least-square problems, determining the pseudoinverse of a matrix, or calculating low-rank matrix approximations, with direct application to signal processing, pattern recognition, and statistics [Golub and Reinsch 1970; Golub and Van Loan 1996; Trefethen and Bau 1997]. This article proposes a new high performance implementation SVD solver featuring the QR-based Dynamically Weighted Halley (QDWH) on mul-ticore architecture equipped with GPU accelerators. First introduced by Nakatsukasa and Higham [Nakatsukasa and Higham 2013], this new spectral divide and conquer algorithm for SVD is composed of three successive computational stages: (1) computing the polar decomposition of the original matrix using the QDWH algorithm, (2) calculating the symmetric eigendecomposition of the resulting polar factor to obtain the

Authors' addresses: D. Sukkari, H. Ltaief, andD. Keyes, Extreme Computing Research Center, King Abdullah University of Science and Technology, 4700 King Abdullah Blvd, 23955 Thuwal Jeddah, Saudi Arabia; emails: {Dalal.Sukkari, Hatem.Ltaief, David.Keyes}@kaust.edu.sa.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. © 2016 ACM 0098-3500/2016/08-ART6 $15.00 DOI: http://dx.doi.org/10.1145/2894747

eigenvalues (which correspond to the singular values) and their associated eigenvectors (which correspond to the right singular vectors), and (3) applying the matrix-matrix multiplication to get the remaining left singular vectors. The QDWH-SVD framework presents interesting concepts compared to previous approaches [Nakatsukasa and Higham 2013]. It achieves backward stability thanks to the polar decomposition, reveals communication-reducing properties, and is implemented with conventional linear algebra building blocks for which quality vendor tunings are often available. However, QDWH-SVD exhibits overhead factors up to twofold in terms of number of extra floating-point operations (flops) compared to the standard SVD algorithm, as implemented in LAPACK [Anderson et al. 1999]. Although the total number of flops is very sensitive to the matrix conditioning and the convergence rate can greatly improve when the matrix is well conditioned, the overhead due to the extra flops can still make the algorithm unattractive, especially if only singular values are to be calculated.

On the other hand, when it comes to high performance parallel implementations of numerical algorithms, flops are often no longer a sound proxy for execution time or energy expenditure. One must examine what are the limiting factors with respect to the hardware for a given execution, that is, cache memory sizes, bus bandwidth, available cores (which may otherwise be idle), number of floating-point units per core, etc. This step is paramount because it can help to assess the best implementation on specific hardware for given objectives, for example, time to solution, energy to solution, data distribution for the next step of an overall procedure, etc. Most of the linear algebra operations of QDWH-SVD are expressed through Level 3 BLAS [BLAS 2013]. This level of BLAS deals with highly compute-intensive and parallel operations (mostly based on matrix-matrix operations), which stress the floating-point units of the underlying hardware and presents a high data reuse rate for the processor caches. As highlighted in the International Exascale Software Project [Dongarra et al. 2011], to exploit opportunities for energy-efficient performance, algorithms must be designed to boost concurrency and reduce data motion. Over the past decade, accelerators (e.g., GPUs) epitomize such hardware evolution. Indeed, today, because of the PCIe bottleneck, GPUs can process data beyond an order of magnitude faster than the speed with which they are fed. In other words, numerical algorithms performing extra flops should not usually be looked at as "show-stoppers" as long as data motion is confined.

This article proposes a new, efficient implementation of the QDWH-SVD algorithm on multicore and GPUs hardware architecture operating at high flop/s rates. Using the high performance Matrix Algebra on GPU and Multicore Architectures (MAGMA) library [Agullo et al. 2009; MAGMA 2009], all three computational stages of QDWH-SVD are accelerated using GPUs, while limiting and hiding data transfers between the host and the device and therefore, favoring in situ GPU processing. The resulting high performance GPU-based QDWH-SVD implementation outperforms by up to fourfold, and up to threefold for asymptotic random matrix sizes, the equivalent routines (DGESVD) from existing state-of-the-art commercial (Intel MKL [Intel 2015]) and open-source (MAGMA) libraries, respectively, when computing all singular values and vectors. If only singular values are required, the algorithm we implement is excessively penalized by performing up to 14 times more flops, depending on the matrix conditioning, compared to the standard SVD solver. The singular value-only QDWH-SVD implementation is still able to run up to 18% faster than the best existing equivalent routine. Moreover, if the end user's application is tolerant to loss of few digits, the singular value only implementation using mixed precision techniques can execute up to 50% faster than the standard full double precision floating point arithmetic SVD solver. A comprehensive testing framework checks on the numerical accuracy of singular values, the orthogonality of singular left/right vectors, and the accuracy of the computed singular values and the associated left/right vectors on various matrix conditioning types, which

highlights the robustness of the overall method. Finally, a multi-GPU implementation is introduced as a first attempt to study the scalability of the QDWH-SVD framework.

The rest of the article is organized as follows. Section 2 highlights our contributions. Section 3 presents related work. Section 4 briefly recalls the standard SVD algorithm. Section 5 describes the QDWH-SVD framework and presents its different computational stages. Section 6 introduces the algorithmic complexity. Section 7 illustrates the QDWH-SVD numerical accuracy with various matrix conditioning. Section 8 shows the QDWH-SVD performance results and compares against the state-of-the-art commercial and open-source high performance SVD solver implementations, and we conclude in Section 9.

2. CONTRIBUTIONS

We here delineate the research contributions of this article for developing a new high performance SVD solver on x86 multicore and GPUs hardware architecture:

—We present a high performance implementation of the polar decomposition (QDWH), which is one of the main components for the symmetric eigensolver and the SVD solver.

—A high performance implementation of the symmetric eigensolver based on the spectral divide and conquer approach (QDWH-EIG) is proposed and performance compared against the two-stage symmetric eigensolver [Haidar et al. 2011, 2014; Luszczek et al. 2011], previously implemented by some of the authors of this article. —A high performance implementation of the overall QDWH-SVD solver framework is described, which integrates the three computational stages: polar decomposition, symmetric eigensolver, and matrix-matrix multiplication. —A multiple GPU implementation of the resulting QDWH-SVD solver has been developed to study its scalability, which corresponds to the first SVD solver on multiple hardware accelerators (to our knowledge). —A mixed precision version of the algorithm is developed for the case where only approximate singular values are required, enabling a trade-off between accuracy and time complexity.

—To show the numerical robustness of the newly implemented QDWH-SVD solver, the article considers a comprehensive test suite for testing the numerical accuracy of the singular values, the orthogonality of the left/right singular vectors, and the accuracy of the computed singular values and the associated left/right vectors.

The combination of highly optimized dense linear algebra operations from MAGMA [Agullo et al. 2009; MAGMA 2009] to form the new high performance QDWH-SVD solver represents the crux of the research work presented in this article.

3. RELATED WORK

The standard SVD algorithm first reduces a general dense matrix to bidiagonal form, then extracts the singular values from the condensed form using the QR [Demmel and Kahan 1990; Fernando and Parlett 1994] or divide-and-conquer [Gu and Eisenstat 1995] algorithms and, finally, accumulates the subsequent orthogonal transformations if singular vectors are also required. While the back transformation phase to calculate the corresponding singular vectors is rich in Level 3 BLAS operations, the reduction to bidiagonal form is often seen as a major bottleneck for parallel performance due to the inefficient Level 2 BLAS kernels predominance.

The authors of the Successive Band Reductions (SBR) software package [Bischof et al. 2000; Lang 1999] introduced an intermediary computational stage before getting to the final bidiagonal form. Originally applied to the symmetric eigensolver [Haidar et al. 2011; Luszczek et al. 2011], the main ideas have been extended to SVD solver.

The matrix is first reduced to band bidiagonal form, where most of the original Level 2 BLAS operations are now cast into Level 3 BLAS, thus increasing the level of concurrency, as highlighted in Ltaief et al. [2010]. The second stage annihilates the extra offdiagonal elements using an efficient bulge chasing technique, until the final condensed structure is obtained. Although operations in the latter stage are memory-bound, data reuse at the high level of caches is possible, thanks to some locality scheduling heuristics [Haidar et al. 2013]. The standard column-major data format of the matrix has to be translated into a tile data layout. Indeed, the matrix is split into tiles, where elements of a given tile are now contiguous in memory. The algorithm of the SVD solver operates now on tiles and exposes fine-grained tasks to be scheduled by a dynamic runtime system [Agullo et al. 2009]. The overall tile SVD solver also generates substantial extra flops [Ballard et al. 2010], especially during the back transformation, in case singular vectors are desired. All the aforementioned high performance SVD solver implementations based on a two-stage matrix reduction run on x86 architecture only (accelerators are currently not supported). There are mainly two reasons for this. Similarly to the two-stage eigensolver, porting the two-stage SVD solver on such complex platforms requires new nonconventional computational kernels and increases software maintenance, while jeopardizing code sustainability across hardware generations. For solving large SVD or even eigenproblems on distributed-memory systems, it is not clear at this moment whether a two-stage reduction is the most efficient, due to the excessive number of communications.

Last but not least, a previous framework for computing the SVD via the polar decomposition and the eigendecomposition has already been presented in Higham and Papadimitriou [1993]. This algorithm requires three building blocks: matrix multiplication, matrix inversion, and solution of the Hermitian eigenproblem. The QDWH-SVD algorithm follows the core idea of this framework by implementing a new SVD solver replacing the expensive matrix inversion with a QDWH-based algorithm [Nakatsukasa and Higham 2013], which increases parallel concurrency and is rich in Level 3 BLAS.

The high performance QDWH-SVD implementation presented in this article is inverse free and communication friendly, and therefore suitable for multicore and hybrid high performance computing systems. It operates on column-major format and uses building blocks (QR/Cholesky factorizations and matrix-matrix multiplication) well known by the scientific community and often well optimized by the vendor community on shared-memory as well as distributed-memory systems, which makes the portability of such codes possible across various architectures.

4. THE STANDARD SVD ALGORITHM

This section recalls the standard algorithm of the SVD solver for dense matrices, as implemented in LAPACK [Anderson et al. 1999] and MAGMA [MAGMA 2009] through the API function DGESVD.

4.1. Block Algorithms

Introduced in the 1990's [Anderson et al. 1999] with the emergence of hierarchical cache-based shared-memory systems, LAPACK block algorithms allow one to recast vector operations to matrix operations so that efficient data reuse can be achieved on the memory subsystem. Matrix computations are split into two successive phases: (1) panel factorization, in which Level 2 BLAS operations are accumulated within a block of columns (called a panel) for later usage and (2) update of the trailing submatrix, in which the accumulated transformations from the panel are applied by means of Level 3 BLAS operations on the unreduced part of the matrix. These two phases represent the core algorithmic methodology to solve linear systems ofequations as well as eigenvalue problems and SVDs.

Fig. 1. Computational stages of the standard SVD algorithm: (a) bidiagonal reduction, (b) bidiagonal SVD solver, and (c) back transformation.

4.2. Three Computational Stages

The SVD of A e Rmxre is A = U £ VT, where £ is the matrix containing all the singular values, and U and V are the orthogonal matrices containing the left and right singular vectors, respectively.

The standard SVD algorithm (a) reduces the dense matrix to condensed bidiagonal form through a series of orthogonal transformations, (b) applies an iterative solver (QR or divide-and-conquer algorithms) to calculate the singular values and its associated left/right singular vectors of the bidiagonal matrix, and (c) multiplies the freshly computed singular vectors with the accumulated subsequent orthogonal transformations during the reduction phase (also called back transformation), as shown in Figure 1.

4.3. Algorithmic Drawbacks

To understand the algorithmic drawbacks of the standard SVD solver, it is important to distinguish two classes of transformations. One-sided transformations are typically needed for solving linear systems of equations and least-square problems: the matrix is reduced and factorized during the panel factorization phase and the latter does not require accessing matrix data outside of the current panel. Resulting transformations are then applied to the trailing submatrix on the left side. On the other hand, two-sided transformations are necessary for eigenvalue and SVD solvers. Their panel factorization is much more expensive than the one-sided transformations because it requires reading the entire trailing submatrix at each column update of the current panel. Once the panel is reduced, the accumulated transformations are applied from both sides, left and right, of the unreduced submatrix, which make, for instance, look-ahead techniques for increasing concurrency difficult to apply, as opposed to one-sided transformations.

To illustrate the bottleneck of the panel factorization, Figure 2 describes the profiling of the SVD solver DGESVD, using MAGMA implementation (single GPU only). Although only 50% of the total number of operations of the bidiagonal reduction are expressed through Level 2 BLAS operations, they still count toward 90% of the elapsed time, as shown in Figure 2(a). Furthermore, when the overall SVD solver is executed, the trend represented in Figure 2(b) highlights that even though Level 2 BLAS operations count only for 6% of the total number of flops, these still translate into 30% of the entire execution time.

BRD: Level 2 BLAS BRD: Level 3 BLAS BSV

(a) Computing the singular values.

Matrix size

(b) Computing additionally the singular vectors.

Fig. 2. Profiling of MAGMA-DGESVD: Bidiagonal Reduction (BRD), Bidiagonal SVD Solver (BSV), and Back Transformation (BT).

5. THE QDWH-SVD FRAMEWORK

This section recalls the general QDWH-SVD algorithm and describes the three computational stages: the polar decomposition, the symmetric eigensolver, and the matrixmatrix multiplication.

5.1. The Polar Decomposition

The polar decomposition is an important numerical algorithm for various applications, including aerospace computations [Bar-Itzhack 1975], chemistry [Goldstein and Levy 1991], computation of block reflectors in numerical linear algebra [Schreiber and Parlett 1988], factor analysis [Schnemann 1966], and signal processing [Arun 1992]. In this article, the polar decomposition is used as a first computational phase toward computing the SVD of a general dense matrix.

5.1.1. Background. The polar decomposition of the matrix A e Rmxn (m > n) is written A = UpH, where Up is an orthogonal matrix and H = VAT A is a symmetric positive semidefinite matrix. To find the polar decomposition, the original DWH iteration can be derived as follows:

Xc = A/a, Xk+1 = Xk (akI + bkXTXk)(I + cXXk)-1, (1)

where a = || A||2. The scalars (ak, bk, ck) are critical parameters since they drive the convergence speed at each iterate.

5.1.2. Deriving the Formulas for(ak, bk, ck). Assuming square matrices, let the SVD of Xk = Uk^kVk , such that [amin(Xk), oWx(Xk)] c [4, 1] c (0, 1]. amin(Xc) = P/a = lc and p =

1/11 A-

Hence,

Xk+i = UkZkV^ (akI + bkVkV2 Vi) (I + OkVk^l V/

= UkZk(akI + bkZ2)( I + ck^D 1Vk

= Uk^k+i vk.

Therefore, the singular values at (Xk+1) are given by

at ( Xk+i) = gk(at (Xk)),

ak + bkx2

gk(x) = x-

1 + CkX2

Therefore, [amin^+i), ffmax№+i)] ^ [min4<x<i(gA(:r)),

of Xk+1 to the polar factor can be measured by the maximum distance of its singular values and 1. A suboptimal choice of the parameters (ak, bk, Ck) should make gk bounded and satisfy the max-min maxak,bk Ck [minik<x<1gk(x)}. The solution of this max-min problem can be found in Nakatsukasa et al. [2010], which gives the following formulas to calculate (ak, bk, ck):

ak = h(lk), bk = (ak - 1)2/4, Ck = ak + bk - 1,

P lk-1(ak-1 + bk-1ll-1)

l0 = —, lk = -:-^-, k = 1, 2,...,

a 1 + Ck-1l2-1 (3)

h(l) = y/1+d +

8 - 4d + . d = ^ - l2)

l24\+d v l4 '

and a = ||A||2, P = 1/11 A-1 b.

5.1.3. Convergence within Six Iterations. The maximum number of DWH iterations for convergence can be calculated by determining the first k such that 11 - lk | < e (machine epsilon). Assuming double precision arithmetic with e = 10-16, unrolling Equations (3) with l0 = 1/k2(Xo) and a fairly large condition number k2(X0) = 1016, the number of DWH iterations is six. Therefore, DWH converges within at most six iterations for any matrix with condition number k2 < 1016.

5.1.4. Matrix Inversion Free QDWH. Equation (1) can be reformulated using the mathematically equivalent Q^-based implementation [Nakatsukasa and Higham 2013]:

Xo = A/a,

R, Xk+1 = bbkXk + -L (ak - ^ Q1QT, k > 0. (4)

Ck JCkK Ck)

If the condition number of the matrix Xk improves during execution and Xk becomes well conditioned, it is possible to replace Equation (4) using a Cholesky-based implementation as follows:

VCkXk 'Qi'

Xk+i = -Xk + (au - -) (XkW-1)Wk

ck V ck /

Wk = chol(Zk), Zk = I + CkXTXk.

This algorithmic switch at runtime allows one to further speed up the overall computation, thanks to a lower algorithmic complexity, while still maintaining numerical stability. In particular, in the subsequent experiments, our implementations switch from Equation (4) to Equation (5) if Ck is smaller than 100, as suggested in Section 5.6 of Nakatsukasa and Higham [2013].

5.2. Computing the Symmetric Eigendecomposition

The second computational stage of QDWH-SVD consists in computing the symmetric eigendecomposition of the matrix H = V £ VT, where V corresponds to the eigenvectors of H as well as to the right singular vectors of the original dense matrix A. This section describes two different eigensolver algorithms: a symmetric eigensolver based (again) on the QDWH procedure and a two-stage symmetric eigensolver.

5.2.1. The QDWH-Based Symmetric Eigensolver. The QDWH-based symmetric eigen-solver (QDWH-EIG) [Nakatsukasa and Higham 2013] is a spectral divide-and-conquer

algorithm to solve the symmetric eigenvalue problem. It is built on the polar decomposition (see previous Section 5.1) and operates by recursively decoupling the problem into independent subproblems through finding invariant subspaces. It is noteworthy to mention that this QDWH series of iterations operates in the inner loop of the QDWH-SVD algorithm.

Algorithm 1 outlines the overall QDWH-EIG algorithm. In order to provide a balanced division during the recursions, QDWH-EIG shifts the diagonal of the matrix by an estimate of the median of the eigenvalues. After computing the polar decomposition, the polar factor Up permits one to eventually generate two orthogonal matrices V1 and V2 through a subspace iteration procedure. Applying these orthogonal matrices to the original matrix permits one to discriminate among eigenvalues with respect to the shift a and consequently, to split the eigenvalue spectrum across subsequent subproblems. The algorithm proceeds then recursively on each new subproblem until the submatrices become diagonalized. Figure 3 shows the recursive QDWH-EIG algorithm unrolled with four levels of recursion. The eigenvalues of the eight subproblems (from A4,1 to A^8) correspond then to the eigenvalues of the original matrix A11. Finally, all subsequent orthogonal transformations need to be accumulated in order to generate the eigenvectors corresponding to the eigenvalues.

ALGORITHM 1: The QDWH-Based Symmetric Eigensolver 1: Choose a, an estimate of the median of eig(A).

2: Compute the orthogonal polar factor Up of A — a I by the QDWH algorithm. 3: Use subspace iteration to compute an orthogonal V = [V1 V2](V1 e Rnxk) such that

1(Up + I) = Vi VT. 4: Compute A1 = V1TAV1 e RkAk and A2 = V2TAV2 e Rn—kkMn—'k). 5: Repeat steps 1 ^ 4 with A A1 and A A2 until A is diagonalized.

5.2.2. Two-Stage Symmetric Eigensolver. Previous works from some of the authors have shown a two-stage tridiagonal reduction phase necessary for implementing an efficient symmetric eigensolver on standard x86 architecture [Haidar et al. 2011; Ltaief et al. 2011; Luszczek et al. 2011] and later on hardware accelerators [Haidar et al. 2014], in the context of the PLASMA and MAGMA [Agullo et al. 2009] numerical libraries, respectively. Similarly to the bidiagonal reduction for the SVD solver, the tridiagonal reduction phase can take up to 90% of the overall elapsed time of the symmetric eigensolver when only eigenvalues are needed and up to 50% when additional eigenvectors are required. As sketched in Figure 4, since the matrix is symmetric, the upper (or lower) part is not referenced (colored in gray). The first stage reduces the original symmetric dense matrix to a symmetric band form. The second stage applies the bulge chasing procedure, which annihilates all the extra off-diagonal entries. The two-stage approach casts expensive memory operations occurring during the panel factorization into faster compute-intensive ones. Once the tridiagonal reduction is achieved, a divide-and-conquer eigensolver calculates the eigenvalues and its associated eigenvectors of the condensed matrix structure. These eigenvectors need to be further aggregated against all subsequent orthogonal transformations, which occurred during the two-stage tridiagonal reduction phase. This back transformation procedure (similar to the SVD solver; see previous Section 4.2) produces the final eigenvectors of the original symmetric dense matrix.

In fact, the overall two-stage symmetric eigensolver algorithm can be expressed into fine-granularity tasks to further expose parallelism, which eventually generates a Directed Acyclic Graph (DAG), where nodes represent tasks and edges describe the data dependencies between them. An efficient dynamic runtime system named QUARK

Fig. 3. The recursive QDWH-EIG algorithm: the matrix Ai, j corresponds to the submatrix indexed j at the ith level of recursion.

Fig. 4. Reduction of a symmetric dense matrix to tridi-agonal form using a two-stage approach.

[YarKhan et al. 2011] is then employed to schedule the different tasks from both stages in an out-of-order fashion, as long as data dependencies are not violated to ensure numerical correctness.

5.3. Matrix-Matrix Multiplication

At this stage, the polar factor Up and the right singular vectors VT have already been calculated from the polar decomposition in Section 5.1 and from the symmetric eigensolver in Section 5.2, respectively. A = UpH (with H = V £VT) can then be rewritten as A = Up(V£VT) = (UpV)£VT. The last stage simply invokes a matrixmatrix multiplication kernel to calculate the remaining left singular vectors U = UpV.

5.4. Pseudocode for QDWH-SVD

As discussed previously, the basic blocks of QDWH-SVD correspond to high-level dense linear algebra operations (QR/Cholesky factorization, symmetric eigensolver, matrix multiplication, etc.). Our QDWH-SVD implementation relies on high performance numerical libraries supporting these operations, such as Intel MKL or MAGMA for targeting CPU and GPU, respectively, and makes successive calls to the corresponding APIs to perform the required operations. Our QDWH-SVD code can also rely on other libraries such as cuSOLVER and any vendor optimized BLAS/LAPACK libraries (e.g., AMD ACML, IBM ESSL, etc.). However, only MAGMA is an open-source code. MAGMA has been initially developed using LAPACK as a reference code. Basically, its main concepts consist in offloading to the GPU the compute-intensive kernels (Level 3 BLAS) to achieve high performance. Algorithm 2 gives the pseudocode for the single/multiple GPU implementations of QDWH-SVD. hA designates a pointer to a data structure located on the host (i.e., CPU) memory and dA targets a data structure located on the device (i.e., GPU) memory.

All in all, the iterative QDWH-SVD procedure relies primarily upon communication friendly and compute-intensive matrix operations, such as the QR/Cholesky factorization and the matrix-matrix multiplication, which are very often further optimized by the vendors (Intel MKL, AMD ACML, IBM ESSL, etc.). These types of operations also

{Computing the polar decomposition A = UpH}

ALGORITHM 2: Pseudo-Code for MAGMA-QDWH-SVD on [Single || Multiple] GPU(s).

dX ■ dscal(dA, a), a ^ ||A||2 k = 1, Li = в/a, в ^ 1/NA-1|2, conv = 100 while (conv > ^5eps || |Li — 1|> 5eps) do

L2 = Li2, dd = ^(4(1 — L2)/L22)_

sqd = V1 + dd, a1 = sqd + ^8 — 4 x dd + 8(2 — L2)/(L2 x sqd))/2 a = real(a1), b = (a — 1)2/4, c = a + b — 1 Li = Li (a + b x L2)/(1 + cL2) dB ■ magma dlacpy(dXk—1, dB) 10: if c > 100 then л/е dB' I

12: dC ■ [magma.dgeqrf.gpu || magmadgeqrf mgpu](dC, tau, dV) 13: dC ■ [magma dorgqr -gpu || magma dorgqr „mgpu](dC, tau, dV ) 14: dXk ■ [cublasDgemm(dC (1 : m, :), dC (m : m + n, :), dXk—1) ||

11: dC

20 21 22

kblas dgemm_mgpu(hC(1 : m, :), hC(m : m + n, :), dXk—1)]

dC — [magmablasdlasetddentity(dC) || lapackf77dlaset(hC)] [dC — cublasDgemmidB, dB, dC) || hC — kblasdgemmjmgpu(dB, dB, hC)] [dC — magmadpotrf\gpu(dC) || hC — magmadpotrfmgpu(hC)] dB — [cublasDtrsm(dC, dBT) || magmadtrsmmgpu(hC, dBT)] dB — [cublasDtrsm(dC, dB) || magmadtrsmmgpu(hC, dB)] dXk — cublasDgeam(dB, dXk—i) end if

conv — ||dXk — dXk-iHF k = k + 1 end while

dH — [cublasDgemm(dXk, dA) || kblasdgemmjmgpu(hXk, hA)]

{Computing the singular values and the right singular vectors H = V £ VT}

if Standard QDWH-SVD then

(£, dV) — magma — QDWHeig(dH) else

(£, hV) — [magma J2stage JDSY EVD(hH) || magma.2stage.DSYEVDmgpu(hH)] end if

{Computing the left singular vectors}

dU — cublasDgemm(dXk, dV), hU — kblasdgemm„mgpu(Xk, hA)

exhibit a high degree ofparallelism. However, although the algorithm performs at most six iterations from the outer loop of the QDWH-SVD framework, it is still important to assess the algorithmic complexity of the overall QDWH-SVD and to compare it against the standard SVD approach (Section 4).

6. ALGORITHMIC COMPLEXITY

In this section, we present the algorithmic complexity of the standard SVD algorithm (DGESVD/DGESDD) and the two variants of the QDWH-SVD procedure, that is, using the QDWH-EIG framework or the two-stage approach as the symmetric eigensolver. We investigate the complexity of all aforementioned SVD algorithms in terms of number of floating-point operations (flops). We assume that the matrix A is square of size n.

6.1. The Standard SVD Solver

The standard approach to compute the SVD of a dense matrix is to first reduce it to bidiagonal form A = U1BV1T. This reduction can be done using Householder reflectors for a cost of 3n3 and is a common step for both standard DGESVD and DGESDD functions, as implemented in LAPACK. If only singular values are needed, computing them from the bidiagonal form using the QR (DGESVD) or the divide-and-conquer (DGESDD) algorithms requires O(n2) flops. If singular vectors are additionally required from B = U2£V2, the LAPACK subroutines DBDSQR and DBDSDC implement an iterative method and a recursive approach based on the QR algorithm and divide-and-conquer, respectively. The subsequent left and right singular vectors are then accumulated during the back transformation phase, that is, U = U1U2 and V = V2 V1, to calculate the singular vectors of the original matrix A. The final estimated flop count to calculate the SVD is 22n3 for either DGESVD or DGESDD [Hansen 1998].

Even though both standard SVD implementations operate at the same flop counts, DGESDD is usually faster than DGESVD thanks to the recursion formulation. However, it necessitates larger workspaces to cast most of operations in terms of Level 3 BLAS, which permit one to achieve further performance optimizations thanks to a higher rate of data reuse.

6.2. The QDWH-SVD Solver

The QDWH-SVD flop count includes the cost of the polar decomposition, the solution of the symmetric eigenvalue problem (which produces the singular values and the right singular vectors), and the matrix-matrix multiplication kernel (which computes the left singular vectors).

As shown in Equation (4), the QDWH flops using QR-based iteration includes the QR decomposition of 2n x n matrix for a cost of (3 + 1)n3 flops. Then, forming [ Q ] explicitly, needs (3 + 1)n3 flops. The product Q1QT needs 2n3 flops. Therefore, the arithmetic cost of each QR-based iteration is (8 + |)n3 flops. For the Cholesky-based iteration in Equation (5), matrix-matrix multiplication involves 2n3, the Cholesky factorization needs 1 n3, and solving two linear systems requires 2n3. Therefore, the arithmetic cost of Cholesky-based iteration is (4 + 1)n3. Computing the positive semidefinite matrix H = UjA requires 2n3. Hence, the overall cost of QDWH is (8 + 3)n3 x #itQR + (4 +

1)n3 x #itChol + 2n3, where #itQR and #itChol correspond to the number of QR-based and Cholesky-based iterations, respectively. As discussed in Nakatsukasa and Higham [2013], the flop count of QDWH depends on the matrix condition number k2, which involves the QDWH iteration. The total flop count for QDWH ranges then from (10 + 3)n3 (for k2(A) « 1 with #itchoi = 2) to (36 + 3)n3 (for k2(A) » 1 with typically #itQR = 2 and #itchoi = 4).

This is followed by finding the eigenvalue decomposition of the resulting matrix H. In the original QDWH-SVD, the default variant of the eigensolver is QDWH-EIG (see Section 5.2.1). The flop count of a single execution of QDWH-EIG includes the cost of the polar decomposition, the subspace iteration, and forming the submatrices A1; A2. It is assumed, following Nakatsukasa and Higham [2013], that during the QDWH-EIG execution, the shift is always taken so that A1 and A2 are both approximately of dimension n. Since one spectral division results in two submatrices of size «n/2 and the arithmetic cost scales cubically with the matrix size, the overall arithmetic cost is approximately Y,¿=0(2 ■ 2-3)p = 4p, where p is the number of flops needed for one run

Table I. Algorithmic Complexity Ranges for Various SVD Solvers

£ U £ V T

DGES{VD,DD} (2 + §)n3 22n3

w/QDWH-EIG w/ two-stage-EIG (26 + 7)n3 <• • • <(87 + 9)n3 12n3 < • • • < 38n3 (30 + 1)n3 <• • • <(90 + 9)n3 (20 + |)n3 < • • • < (46 + |)n3

of QDWH-EIG for n x n matrix. The cost of the invariant subspace is 7n3 to obtain | Householder reflectors. Applying them to A needs (| + |)n3 = |n3. The final step of one recursion of QDWH-EIG is to update the eigenvectors V1 = V1 V21 and V2 = V2 V22, each requiring 1 n3 flops. The flop count of QDWH-EIG depends also on the matrix condition number k2 and ranges from (16 + 9)n3 (for k2(A) « 1) to (50 + |)n3 (for k2(A) > 1) if only the eigenvalues are needed and (17 + |)n3 (for k2(A) « 1) to (52 + 9)n3 (for k2(A) > 1) if the eigenvectors are additionally needed.

The other variant of the symmetric eigensolver is based on a two-stage procedure (two-stage-EIG), as previously explained in Section 5.2.2. For finding the eigenvalues only, it needs | n3 flops to reduce the symmetric matrix into a band matrix (with the matrix bandwidth nb ^ n) and a lower-order amount for the subsequent bulge-chasing step (bandwidth reduction) and for solving the resulting tridiagonal eigenvalue problem. If the eigenvectors are additionally needed, the reduction phase costs 3 n3 due to the accumulation overhead of the intermediary orthogonal transformations, and solving the tridiagonal eigenvalue problem using divide and conquer algorithm D&C needs 3 n3. Due to the two-stage tridiagonal reduction, recovering the original eigenvectors needs 4n3. This complexity very often overestimates the computational costs of the divide-and-conquer due to significant deflation that is often observed. Finally, forming the left singular vectors U = UpV requires an additional 2n3 flops. Thus, the total arithmetic cost of QDWH-SVD with two-stage-EIG is up to 38n3.

6.3. Flops Comparison

Table I summarizes the flop counts of the standard DGESVD and DGESDD LAPACK functions and QDWH-SVD with its two variants of symmetric eigensolvers (QDWH-EIG and two-stage-EIG) for finding the singular values only and all singular vectors, additionally. QDWH-SVD with QDWH-EIG seems to be prohibitive due to the extra flops (33x and 4x more flops for singular values and additionally singular vectors, respectively), compared to the standard DGES{VD,DD} SVD algorithms. Although still expensive, QDWH-SVD with the two-stage-EIG seems to be more reachable with factors up to 14x and2x in terms of the number of extra flops, compared to DGES{VD,DD}, when singular values and additionally singular vectors are needed, respectively.

Square matrices are assumed throughout the article, however, QDWH-SVD is very general and applicable to rectangular matrices. Reducing the flop count for the rectangular matrices can be done by calculating an initial QR factorization whenever m > 1.15n [Nakatsukasa and Higham 2013], whereas the standard SVD methods of bidiagonalization minimize the flop count by performing an initial QR factorization when m > 2n.

Before looking into performance results and how the GPU architecture allows QDWH-SVD to respond from the substantial extra flops, the next section evaluates its numerical robustness against several synthetic matrices with various condition numbers.

Table II. List of Ill-conditioned Synthetic Matrices with Different Singular Value Distributions

Type Description

01 = 1, at = 1/cond, i = 2, ..., n at = 1, i = 1,..., n — 1,an = 1/cond 1-i .

ai = cornin—1, i = 1,...,n ai = 1 - (n—1)(1 - 1/cond), i = 1,...,n n random numbers s [1/cond, 1] whose logarithms are uniformly distributed n random numbers following a uniform distribution

Type 1 Type 2 Type 3 Type 4

Type 5

Type 6

7. NUMERICAL ACCURACY

In this section, the orthogonality of the corresponding right and left singular vectors, the accuracy of the singular values, and the backward error of the computed overall SVD for different matrix types are analyzed for the SVD algorithms (DGESVD, DGESDD, and QDWH-SVD).

7.1. Synthetic Matrix Types

The numerical robustness of the SVD algorithms is assessed against a series of matrix types: (1) random matrices generated from DLARNV function in LAPACK, (2) well-conditioned synthetic matrices (cond equal to 1), and (3) ill-conditioned synthetic matrices. Each dense synthetic matrix A = QDQT of the latter matrix type is generated using the LAPACKroutine DLATMS, by initially setting a diagonal matrix D = diag(£) containing the singular values, which follows a certain distribution and from an orthogonal matrix Q generated from random entries. Table II enumerates the matrix types based on six different singular value distributions, which are all ill-conditioned with cond equal to the 1/(ulp), where ulp is the machine double precision. LAPACK uses these synthetic matrices to test the whole library during the nightly builds, especially the eigensolver and the SVD algorithms.

7.2. Environment Settings

The numerical accuracy assessments as well as the performance experiments have been run on a shared-memory multicore architecture composed of a dual-socket 10-core Ivy Bridge Intel(R) Xeon(R) CPU E5-2680 v2 (20 cores total), operating at 2.8GHz. The system has 2.5MB of L2 cache, MB of L3 cache, and 256GB of DDR3 main memory. There are three NVIDIA Tesla K40 GPUs (ECC off) with 12GB of main memory, each connected to the CPU through a PCIe bus 16x. The different implementations (all written in C) have been compiled using the Intel Compiler Suite v13.0.1 and NVIDIA CUDA compilation tools v6.0. The codes have been linked against the following numerical libraries: NVIDIA CUBLAS 6.0, Intel BLAS/LAPACK MKL, and MAGMA v1.4.1 for CPU only and GPU only implementations, respectively.

7.3. Definition of Accuracy Metrics

For a given general matrix A s Rnxn, the computed singular values are £ = diag(a1, a2,..., an). The corresponding left and right singular vectors are given by U and V. The norm || . ||F denotes the Frobenius norm. The various accuracy tests are then expressed as follows:

||I - UUTUF ||I - VVTHF

-and-, (6)

MKL-DGESDD ■ MAGMA-DGESDD

MKL-DGESDD MAGMA-DGESDD

Matrix size

Well-conditioned matrix.

Matrix size

Ill-conditioned matrix (type 2).

MKL-DGESDD -fl

...... 1

-•?> % % % %% ■ Matrix size

Well-conditioned matrix.

Matrix size

Ill-conditioned matrix (type 2).

Fig. 5. Orthogonality of the left (a-c) and right (b-d) singular vectors.

for the orthogonality of the left and right singular vectors (U and V),

|ai - Si |

for the accuracy of the ith computed singular value ai, where Si is the ith exact singular value (analytically known or computed using LAPACK-DGESVD), and

|| A - U£Vt||f (8)

n|| A||f ,

for the accuracy of the overall computed SVD. These standard accuracy metrics permit one to numerically evaluate the stability of these new SVD algorithms compared to the existing state-of-the-art SVD implementations.

7.4. Accuracy Issues of DGESDD

This section highlights some of the issues encountered specifically when calling the SVD solver DGESDD. Figure 5 shows how the orthogonality of the left and right singular vectors (Equation (6)) when using MKL-DGESDD and MAGMA-DGESDD fails to meet the double precision floating-point arithmetic for well- and ill-conditioned matrices (type 2) for certain sizes. This precision loss obviously propagates to the accuracy of the singular values (Equation (7)) and the overall computed SVD (Equation (8)), as shown in Figure 6, due to convergence failure during the divide-and-conquer SVD solver. In particular, DGESDD may not converge and shows some inaccuracy when dealing with

CO Oi 3

I 1e-06 rt

I 1e-09

S 1e-12

MKL-DGESDD MAGMA-DGESDD

'n. /j. >A

0.1 0.0001 P 1e-07

& 1e-10

CO □

«I 18-13 1e-16

Matrix size

Well-conditioned matrix.

MKL-DGESDD -MAGMA-DGESDD

Matrix size

Well-conditioned matrix.

0.0001

1 1e-07

S 1e-13

Matrix size

Ill-conditioned matrix (type 2).

MKL-DGESDD -MAGMA-DGESDD

Matrix size

Ill-conditioned matrix (type 2). Fig. 6. Accuracy of the singular values (a-c) and the overall SVD (b-d).

small singular values and/or multiplicity, as is the case for the two matrix types studied herein. The accuracy issue seems also to depend on the software implementation. All in all, the main cause of the accuracy issue (numerical sensitivity and/or software bug) is not clear and remains unidentified.

7.5. Accuracy Assessments of SVD Solvers

This section compares the numerical accuracy of DGESVD and DGESDD, as implemented in MKL (CPU) and MAGMA (GPU) libraries, against MAGMA-QDWH-SVD (GPU) on different matrix types. Since from an accuracy point of view MAGMA-QDWH-SVD with QDWH-EIG or two-stage-EIG gives a similar order of accuracy, only a single curve (MAGMA-QDWH-SVD with two-stage-EIG) is shown on the various graphs, for clarity purposes. By the same token, except for DGESDD and matrix type 2 (see previous Section 7.4), type 4 has been selected as a representative matrix type from Table II since the trend of the accuracy curves for all SVD solvers turns out to be quasi identical.

7.5.1. Orthogonality ofthe Singular Vectors. Figure 7 shows the orthogonality of all singular vectors. In particular, Figures 7(a), 7(c), 7(e) and Figures 7(b), 7(d), 7(f) present the orthogonality of the left and right singular vectors, respectively, for an ill-conditioned matrix of type 4, a well-conditioned matrix, and a random matrix. All SVD solvers pass the orthogonality test (Equation (6)) and present comparable orders of accuracy. Figures 7(c) and 7(d) do not show the orthogonality of the left and right singular vectors obtained from DGESDD, due to its accuracy issues revealed in Section 7.4.

MKL-DGESVD Q

MAGMA-DGESVD —9—

" MAGMA-QDWH-SVD —m— '

MAGMA-DGESDD —4,—

MKL-DGESDD

Matrix size

(a) Ill-conditioned matrix (type 4).

MKL-DGESVD -MAGMA-DGESVD MAGMA-QDWH-SVD MAGMA-DGESDD -MKL-DGESDD -

Matrix size

(b) Ill-conditioned matrix (type 4).

MKL-DGESVD MAGMA-DGESVD MAGMA-QDWH-SVD

Matrix size

(c) Well-conditioned matrix.

Z 1e-14 o

? 1e-15

o 1s"16

1e-17 1e-18

MKL-DGESVD MAGMA-DGESVD MAGMA-QDWH-SVD

Matrix size

(d) Well-conditioned matrix.

MAGMA-DGESVD „

MKL-DGESVD —e—

" MAGMA-QDWH-SVD ■

MAGMA-DGESDD —a,—

MKL-DGESDD

Matrix size

(e) Random matrix.

MAGMA-DGESVD 0

MKL-DGESVD -6-

MAGMA-QDWH-SVD

MAGMA-DGESDD -4.-

MKL-DGESDD

Matrix size

(f) Random matrix.

Fig. 7. Orthogonality of the left (a-c-e) and right (b-d-f) singular vectors.

7.5.2. Accuracy of the Singular Values and SVD Backward Error. Figure 8 presents the accuracy of the singular values (Equation (7)) and the backward error of the overall SVD (Equation (8)). In particular, Figures 8(a), 8(c), 8(e) and Figures 8(b), 8(c), 8(f) present the accuracy of the singular values and the backward error of the overall SVD solver, respectively, for an ill-conditioned matrix of type 4, a well-conditioned matrix, and a random matrix. The accuracy curves for all SVD solvers show that all numerical accuracy tests pass, thanks to an order of accuracy close to the double precision

1e-12 -1—i—i—i—i-1— 1e-14

MKL-DGESVD -Ô-

1e-13 MAGMA-DGESVD 0 18-15

MKL-DGESDD —v— "

® MAGMA-DGESDD -A.—

MAGMA-QDWH-SVD -■-

£ CO Z3 1e-14 « 18-16

c 1e-15 *—* ' ' 1 ' & 1e-17

s 1e-16 ■S 1e-18

1e-17 1e-19

1e-18 ...... 18-20

Matrix size

(a) Ill-conditioned matrix (type 4).

MKL-DGESVD „

MAGMA-DGESVD —$—

MAGMA-QDWH-SVD ■

MAGMA-DGESDD —a—

MKL-DGESDD

. ; ; ; ; ;.

Matrix size

(b) Ill-conditioned matrix (type 4).

1e-12 1e-13

I 1e-14 ®

w 1e-15

1e-16 1e-17 1e-18

MKL-DGESVD ■ MAGMA-DGESVD MAGMA-QDWH-SVD

Matrix size

(c) Well-conditioned matrix.

1e-14 1e-15

1 19"16

1 1B"17 S 1e-18

1e-19 1e-20

MKL-DGESVD -MAGMA-DGESVD MAGMA-QDWH-SVD

% % % ■%> % Matrix size

(d) Well-conditioned matrix.

MAGMA-DGESVD MAGMA-QDWH-SVD MKL-DGESVD ■ MKL-DGESDD -MAGMA-DGESDD -

16-14 1e-15 ■g 1e-16

& 1e-17 «

MAGMA-DGESVD 0

MKL-DGESVD —6—

MAGMA-QDWH-SVD —X—

MAGMA-DGESDD —A—

MKL-DGESDD

Matrix size

(e) Random matrix.

H?'%W>'- «

Matrix size

(f) Random matrix.

Fig. 8. Accuracy of the singular values (a-c-e) and backward error of the overall SVD (b-d-f).

floating-point arithmetic. Similarly, Figures 8(c) and 8(c) do not show the accuracy of the singular values and the backward error of the overall SVD solver obtained from DGESDD, due to its accuracy issues described in Section 7.4.

7.6. Accuracy Results Using Mixed Precision

Since the convergence speed of the QDWH-SVD framework is remarkable, a mixed precision technique has been integrated into the original QDWH-SVD algorithm so

1e-11 1e-12

■I 1e-13 £ iS

ä 1е-14

° 1e-15 Д 1e-16

1e-17 1e-18

MAGMA-QDWH-SVD-S5D1 ■ MAGMA-QDWH-SVD-S3D3 ■ MAGMA-QDWH-SVD-S1D5 MAGMA-DGESVD MKL-DGESVD ■ MAGMA-QDWH-SVD-S0D6 MKL-DGESDD -MAGMA-DGESDD -

Matrix size

(a) Ill-conditioned matrix (type 4).

ï 1e-12

g 1e-14

MAGMA-QDWH-SVD-S4D1 —*— MAGMA-QDWH-SVD-S3D2 —к— MAGMA-QDWH-SVD-S2D3 MAGMA-QDWH-SVD-S0D5

MAGMA-DGESVD -♦— MAGMA-DGESDD —»— MKL-DGESDD — MKL-DGESVD —e—

(b) Random matrix.

Fig. 9. Accuracy of the mixed precision MAGMA-QDWH-SVD solver. SiDj means i iterations are performed in single precision and the remaining j iterations in double precision.

that the first iterations are done in single precision and the subsequent ones in double precision in order to be able to recover some of the lost digits. This type of precision play, to maintain most of the value of a strictly high-precision implementation while doing a significant fraction of the work in faster and cheaper arithmetic, is typical of algorithmic adaptations to extreme architectures to come. The authors provide empirical evidence that mixed precision QDWH-SVD implementation can still recover some of the lost digits. Unfortunately, this novel contribution does not work when singular vectors are additionally needed, even though their orthogonality is preserved. The loss of digits in the singular values and their corresponding singular vectors directly affects the backward error of the overall SVD solver (Equation (8)), which makes the mixed precision technique suitable only for calculating singular values. Figure 9 shows the impact of the mixed precision technique on the numerical accuracy of the MAGMA-QDWH-SVD singular values on an ill-conditioned matrix (type 4) and a random matrix only, since for a well-conditioned matrix, the QDWH-SVD framework rapidly converges after two iterations. The condition number for the random matrix is less than the ill-conditioned matrix and, therefore, the polar decomposition needs fewer iterations to converge, as shown in Figure 9(b). By increasing the number of iterations in single precision over double precision, the accuracy of the singular values of MAGMA-QDWH-SVD degrades and it loses two to three digits at most, compared to the other SVD solvers, including the full double precision MAGMA-QDWH-SVD-S0D6. This can still be of interest for applications that show tolerance to a loss of a few digits. Although we empirically observed this numerical behavior, formulating it mathematically remains an open problem.

8. PERFORMANCE RESULTS

The performance results have been obtained on the system described in Section 7.2. All subsequent performance graphs show the total elapsed time (seconds) in logarithmic scale against the matrix size, unless mentioned otherwise. Similar to the accuracy study in Section 7.5, type 4 has been selected as a representative matrix type from Table II since the trend of the performance curves for all SVD solvers turns out to be quasi identical. Although the range of all experiments are of power of 2, the QDWH-SVD code still supports odd matrix sizes through the MAGMA implementation. Because MAGMA operates on vertical panels of width of power of 2, odd matrix sizes will generate clean up section, which are isolated from the main computation since only the

Fig. 10. Performance of both symmetric eigen- Fig. 11. Performance of various QDWH-SVD imple-solvers. mentations.

last columns/rows of the matrix are concerned. Therefore, the poor performance of the GPU kernels in the clean up section does not interfere with the overall performance.

8.1. Performance Comparisons of QDHW-SVD Variants on Single GPU

Figure 10 shows the performance of both symmetric eigensolvers (MAGMA-QDWH-EIG and MAGMA-two-stage-EIG) on random matrices, when calculating all singular vectors. Because MAGMA-QDWH-EIG performs around 6.5x more flops than MAGMA-two-stage-EIG, there is almost an order of magnitude difference in terms of elapsed time between them, when looking at asymptotic matrix sizes. Figure 11 highlights the overall performance of the two QDWH-SVD variants, after plugging in QDWH-EIG or two-stage-EIG, and shows an overall twofold speedup when selecting two-stage-EIG over QDWH-EIG as the symmetric eigensolver. For the subsequent performance graphs, MAGMA-QDWH-SVD will always refer to QDWH-SVD with the symmetric eigensolver two-stage-EIG. The authors did not include the performance graph of QDWH-SVD solely based on Intel MKL library [Intel 2015]. Although this variant stands as a natural reference implementation, the extra flops required by QDWH-SVD prohibit getting a competitive CPU implementation, even against other standard SVD solvers on CPUs. This reinforces the need for GPU high computational power in order to overcome these extra flops.

8.2. Performance Comparisons of All SVD Solvers on Single GPU

Figures 12 and 13 present the performance comparisons of all SVD solvers (i.e., DGESVD, DGESDD, and QDWH-SVD) for an ill-conditioned matrix of type 4, a well-conditioned matrix, and a random matrix on x86 (using Intel MKL) and GPU, respectively. In particular, Figures 12(a), (c), (e), 13(a), (c), (e) and Figures 12(b), (d), (f), 13(b), (d), (f) depict the performance comparisons when computing the singular values and additionally all singular vectors, respectively. Figures 12(c), (d), 13(c), (d) do not draw the performance curves of MKL/MAGMA-DGESDD, again because of the accuracy issued mentioned in Section 7.4. When calculating only the singular values, the MKL variants (CPU) of DGESVD and DGESDD outperform all MAGMA-DGESVD/DGESDD/QDWH-SVD versions on small matrix sizes, due to the low arithmetic complexity that unveils the overhead of moving data to the device. The same analysis also applies when singular vectors are required additionally for MKL-DGESVD (CPU), only for small well-conditioned matrix sizes as in Figure 12(d), thanks to a faster convergence of the SVD solvers. For large matrix sizes, when only singular values are calculated, MAGMA-QDWH-SVD shows competitive performance compared

(a) Ill-conditioned matrix (type 4).

(b) Ill-conditioned matrix (type 4).

Matrix size

(с) Well-conditioned matrix.

к <?>•_ 4). 'n„

Matrix size

(d) Well-conditioned matrix.

MKL-DGESVD MKL-DGESDD MAGMA-QDWH-SVD

(e) Random matrix.

(f) Random matrix.

Fig. 12. Performance comparisons of MAGMA-QDWH-SVD (GPU) against Intel MKL (CPU) for £ (a-c-e) and additionally U£VT (b-d-f).

to other SVD solvers on single GPU for ill-conditioned and random matrices, although MAGMA-QDWH-SVD performs more flops than other SVD solvers. It even achieves a twofold and 3.5x speedups against MAGMA-DGESVD (GPU) and MKL-DGESVD (CPU) for well-conditioned matrices, thanks to a reduced number of flops. When singular vectors are required additionally, MAGMA-QDWH-SVD outperforms the SVD

MAGMA-DGESVD MAGMA-DGESDD -MAGMA-QDWH-SVD

Matrix size

(a) Ill-conditioned matrix (type 4).

(b) Ill-conditioned matrix (type 4).

MAGMA-DGESVD MAGMA-QDWH-SVD

Matrix size

(c) Well-conditioned matrix.

MAGMA-DGESVD MAGMA-QDWH-SVD

Matrix size

(d) Well-conditioned matrix.

MAGMA-DGESVD MAGMA-DGESDD -MAGMA-QDWH-SVD

'n. fy.

Matrix size

(e) Random matrix.

MAGMA-DGESVD MAGMA-DGESDD MAGMA-QDWH-SVD

Matrix size

(f) Random matrix.

Fig. 13. Performance comparisons of MAGMA SVD solvers (GPU) for £ (a-c-e) and additionally U£VT (b-d-f).

solvers on all matrix types, with up to 18% and 30% compared to MAGMA-DGESDD, for ill-conditioned and random matrices, respectively. In fact, the maximum speedup for MAGMA-QDWH-SVD has been achieved against the matrix of type 6 II, which has similar condition number to the random matrix: it outperforms by up to fourfold, and up to threefold for asymptotic random matrix sizes, the equivalent MKL-DGESVD

(a) Ill-conditioned matrix (type 4). (b) Random matrix.

Fig. 14. Performance comparisons (E only) using mixed precision techniques. SiDj means i iterations are performed in single precision and the remaining j iterations in double precision.

and MAGMA-DGESVD routines, respectively, when computing all singular values and vectors.

Figure 14 assesses the performance for computing singular values using the mixed-precision technique. The condition number for the random matrix is less than the ill-conditioned matrix (type 4) and, therefore, the polar decomposition needs less iteration to converge, as shown in Figure 14(b). As expected, time to solution of MAGMA-QDWH-SVD decreases when executing more QDWH iterations in single precision floating-point arithmetic. The mixed precision technique allows one to further reduce time to solution and brings an additional 40% performance improvement compared to the full double precision floating-point arithmetic, at the expense of losing two to three digits of accuracy at most.

Finally, Figure 15 shows the time breakdown between the main phases of MAGMA-QDWH-SVD for random matrices. It is clear from this profiling figure that the polar decomposition (QDWH) is the most time-consuming stage and its elapsed time increases substantially as the matrix size gets larger.

Another interesting performance assessment, which goes beyond the scope of the article, will be to compare MAGMA-QDWH-SVD against the future MAGMA two-stage SVD solvers. Indeed, although the latter has not been implemented yet due to its coding complexity with new computational kernel implementations, it should perform less flops than the former and therefore, it should be a competitive SVD solver. However, porting the future MAGMA two-stage SVD code to other architectures will not be a trivial task and may increase the software complexity, hence this initial study conducted herein. It is also not clear at this stage which SVD implementations will win on distributed-memory systems, although the recursive formulation of QDWH-SVD would engender only local communications within the network subsystem, as opposed to the two-stage SVD. A distributed-memory QDWH-SVD implementation can always fall back to a two-stage approach once the subproblems would fit on a single shared-memory node.

8.3. Scalability of MAGMA-QDWH-SVD on Multiple GPUs

Figure 16 shows the performance scalability of the multiple GPU implementation of MAGMA-QDWH-SVD. In fact, this corresponds to the first multiple GPU implementation of an SVD solver (to our knowledge). The curves show the speedup of MAGMA-QDWH-SVD on up to three K40 GPUs. This first attempt presents a speedup of roughly 1.7 and 2.1 on two and three GPUs, respectively. The scalability is somewhat limited

Matrix size

Fig. 15. Profiling the computational stages of Fig. 16. Performance scalability of MAGMA-MAGMA-QDWH-SVD. QDWH-SVD on multiple GPUs.

mainly because of the overhead of moving data between the host and the three GPUs. There is obviously still room for improvement by doing more GPU computations and further hiding the communication overhead by computations, which are doable thanks to the nature of operations (compute-bound) occurring in the polar decomposition step.

9. CONCLUSIONS AND FUTURE WORK

A new high performance implementation of theQDWH-SVD solver has been presented on multicore architecture enhanced with GPUs. Based on three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors, the GPU accelerated QDWH-SVD framework outperforms the equivalent routines from existing state-of-the-art commercial (Intel MKL) and open-source GPU libraries (MAGMA) by up to 3.5 x and twofold speedups for asymptotic matrix sizes, respectively, although it performs up to two times more flops when computing all singular vectors. When only singular values are needed, the number of extra flops is further exacerbated (up to 14x), depending on the matrix conditioning, compared to the standard methods. The singular value only implementation of QDWH-SVD can still recover and run up to 18% faster than the best existing equivalent routines. Integrating mixed precision techniques in the solver can further provide up to 40% improvement at the price of losing two to three digits of accuracy, compared to the full double precision floating-point arithmetic version. These numbers reported herein show an unprecedented gain for solving SVD problems using QDWH and may leverage applications' performance for which the SVD solver plays an important role. The comprehensive numerical testing confirms the robustness of the QDWH-SVD solver. The provided multi-GPU implementation shows a decent scalability of the overall framework and constitutes the first multi-GPU SVD solver.

The gains in execution time of QDWH-SVD over the current state of the art are poised to become more significant in computational environments that are yet more austere in memory access. Thanks to the conventional kernels (QR/Cholesky factorizations, GEMM, etc.) upon which QDWH-SVD relies, porting the framework to other architectures such as Intel Xeon Phi, AMD APU, or even ARM processors seems to be a realistic challenge and should not require a significant effort in terms of software maintenance, as long as an optimized vendor BLAS/LAPACK library is available on the underlying system. Another research direction will be to replace current block

algorithms (coarse-grained) used in the QDWH-SVD framework by tile DAG-scheduled algorithms (fine-grained), which will permit one to take into account the matrix structure of the identity during QDWH iterations, hence, further reducing the overall number of flops. This will allow one to pipeline the various computational stages and to maintain useful computing in all processing cores. Finally, the authors are currently designing an implementation of QDWH-SVD for distributed memory systems.

ACKNOWLEDGMENTS

This work was supported by the Extreme Computing Research Center at KAUST. The authors would like to thank Ahmad Abdelfattah for his help to integrate KBLAS into QDWH-SVD, NVIDIA for the hardware donations, and KAUST IT for the hardware support.

REFERENCES

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In J. Phys.: Conf. Sen, 180 (2009). Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J. Dongarra, Jeremy J. Du Croz, Anne Greenbaum, Sven Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User's Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia.

K. S. Arun. 1992. A unitarily constrained total least squares problem in signal processing. SIAM J. Matrix

Anal.Appl. 13, 3 (1992), 729-745. D0I:http://dx.doi.org/10.1137/0613046 Grey Ballard, James Demmel, and Ioana Dumitriu. 2010. Minimizing communication for eigenproblems and

the singular value decomposition. CoRR abs/1011.3077 (2010). http://arxiv.org/abs/1011.3077. I. Y. Bar-Itzhack. 1975. Iterative optimal orthogonalization of the strapdown matrix. IEEE Trans. Aerospace

Electron. Syst. AES-11, 1 (Jan. 1975), 30-37. D0I:http://dx.doi.org/10.1109/TAES.1975.308025 Christian H. Bischof, Bruno Lang, and Xiaobai Sun. 2000. Algorithm 807: The SBR toolbox—Software for successive band reduction. ACM Trans. Math. Software 26, 4 (2000), 602-616. D0I:http://dx.doi.org/ 10.1145/365723.365736

BLAS. 2013. Basic Linear Algebra Subprograms v3.5. (Nov 2013). Available at http://www.netlib.org/blas/. James Demmel and W. Kahan. 1990. Computing small singular values of bidiagonal matrices with guaranteed high relative accuracy. SIAM J. Sci. Statist. Comput. 5 (1990), 873-912. Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E. Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3-60. D0I:http://dx.doi.org/10.1177/1094342010391989 K. Vince Fernando and Beresford N. Parlett. 1994. Accurate singular values and differential QD algorithms.

Num. Math. 67 (1994), 191-229. Jerome A. Goldstein and Mel Levy. 1991. Linear algebra and quantum chemistry. Am. Math. Monthly 98, 10

(Oct. 1991), 710-718. D0I:http://dx.doi.org/10.2307/2324422 Gene H. Golub and C. Reinsch. 1970. Singular value decomposition and least squares solutions. Num. Math. 14(1970), 403-420.

Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd ed.). Johns Hopkins University

Press, Baltimore, Maryland. Ming Gu and Stanley C. Eisenstat. 1995. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J.

Matrix Anal. Appl. 16, 1(1995), 79-92. Azzam Haidar, Jakub Kurzak, and Piotr Luszczek. 2013. An improved parallel singular value algorithm and its implementation for multicore hardware. Proceedings of the International Conference

on High Performance Computing, Networking, Storage and Analysis, Article 90 (2013), 12 pages. DOI: http://dx.doi.org/10.1145/2503210.2503292 Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings ofSC'11 Conference on High Performance Computing Networking, Storage and Analysis. ACM, 8. Azzam Haidar, Stanimire Tomov, Jack Dongarra, Raffaele Solc, and Thomas C. Schulthess. 2014. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks. IJHPCA (2014), 196-209. Per Christian Hansen. 1998. Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. Society for Industrial and Applied Mathematics, Philadelphia. http://books.google.com. sa/books?id=A5XWG\_PFFdcC. Nicholas J. Higham and Pythagoras Papadimitriou. 1993. Parallel Singular Value Decomposition via the Polar Decomposition. Numerical Analysis Report No. 239. University of Manchester, England. ftp://vtx.ma.man.ac.uk/pub/narep/narep239.dvi.Z. Intel. 2015. Math Kernel Library. (2015). Available at http://software.intel.com/en-us/articles/intel-mkl/. Bruno Lang. 1999. Efficient eigenvalue and singular value computations on shared memory machines.

Parallel Comput. 25, 7 (1999), 845-860. Hatem Ltaief, Jakub Kurzak, and Jack Dongarra. 2010. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems 21, 4 (April 2010). Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2011. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures. In PARCO (Advances in Parallel Computing), Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Jou-bert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.), Vol. 22. IOS Press, 397-404. http://dx.doi.org/10.3233/978-1-61499-041-3-397 Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-stage tridiagonal reduction for dense symmetric

matrices using tile algorithms on multicore architectures. In Proceedings of IPDPS 2011. ACM. MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory,

University of Tennessee. (2009). Available at http://icl.cs.utk.edu/magma/. Yuji Nakatsukasa, Zhaojun Bai, and Franois Gygi. 2010. Optimizing Halley's iteration for computing the

matrix polar decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700-2720. Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325-A1349. DOI:http://dx.doi.org/10.1137/120876605 Robert Schreiber and Beresford Parlett. 1988. Block reflectors: Theory and computation. SIAM J. Numer.

Anal. 25, 1 (1988), 189-205. D0I:http://dx.doi.org/10.1137/0725014 Peter H. Schnemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31,

1 (1966), 1-10. DOI:http://dx.doi.org/10.1007/BF02289451 Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://

www.siam.org/books/0T50/Index.htm. Asim YarKhan, Jakub Kurzak, and Jack Dongarra. 2011. QUARK Users' Guide: QUeueing and Runtime for Kernels. University of Tennessee Innovative Computing Laboratory Technical Report ICL-UT-11-02.

Received June 2015; revised February 2016; accepted February 2016