Available online at www.sciencedirect.com

ScienceDirect Procedia

Computer Science

Procedia Computer Science 4 (2011) 126-135

Efficient Solution of Evolution Models for Virus Populations

Gerhard Niederbruckera, Wilfried N. Gansterera'*

a University of Vienna, Research Lab Computational Technologies and Applications (Austria)

ELSEVIER

Abstract

The computation of the quasispecies in Eigen's quasispecies model requires the solution of a very large scale eigenvalue problem. Since the problem dimension is of an exponential growing nature the well known methods for dealing with such a problem run out of resources already far away from practically relevant cases. We propose the use of an implicit matrix vector product using the special problem structure as building block for eigenvalue solvers to partially overcome the exponential growth, which let us reach unexpected large problem sizes. As we will show our implicit matrix vector product is a prime example for an algorithm perfectly matching the requirements of GPU computing since it has low space and high parallel computation requirements. Therefore we will also present an GPU implementation delivering a speedup factor of about 100 compared to a standard implementation.

Keywords: evolution models for virus populations, quasispecies, eigenvector computation, structured eigenvalue problem

1. Introduction

We present a GPU-based approach for numerically computing the largest eigenvector of a large specially structured eigenproblem arising in modelling the evolution of virus populations. By exploiting the specific problem properties, our new approach is much more efficient than standard methods and is an important step towards handling the exponential growth of the underlying eigenproblem.

Under some assumptions on the environment the evolution of RNA molecules can be modeled by a system of ODEs [1]. In this model, RNA molecules are represented as strings over the binary alphabet with a fixed length v which we call the chain length. The other important model parameter is the error rate p which represents the possibility of single point mutations in an RNA sequence. Given a fixed v we will always have to consider all 2v possible sequences since any RNA molecule can mutate into any other even if the possibility might be very low. Our overall goal is to numerically compute the so called quasispecies [2] which describes an ordered stationary distribution in the model. In practice, cases of interest start with chain lengths of 40, and for realistic viruses, chain lengths are 100 and above. Thus, an important objective of our work is to raise the bar of problem instances which can be solved in reasonable time as high as possible. For some kinds of problems with special structure efficient methods are known which reduce

* Corresponding author

Email addresses: gerhard.niederbrucker@univie.ac.at (Gerhard Niederbrucker), wilfried.gansterer@univie.ac.at (Wilfried N. Gansterer)

1877-0509 © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Prof. Mitsuhisa Sato and Prof. Satoshi Matsuoka doi:10.1016/j.procs.2011.04.014

the computational complexity by an order of magnitude [3, 4]. Our methods are designed to deal with a more general and therefore more realistic setting were no efficient computations were performed yet [5].

The ODE system leads to a non-negative matrix W e R2"x2 which basically describes the constitution and the possibilities for mutations of the 2V RNA molecules with chain length v. Our problem is to compute the eigenvector corresponding to the largest eigenvalue of W up to a defined degree of precision. This vector contains the information about the concentration of the quasispecies. The hardness of this problem comes clearly from the exponential growth of the dimension of W with the chain length v. So we need significant improvements in our methods to see an improvement in the maximal tractable chain length v. In this paper, we consider applying the power iteration as a numerical method for computing the eigenvector we are looking for.

Given this eigenvalue problem and the knowledge that the matrix dimension grows exponentially with the problem size we develop algorithms to partially tackle this exponential growth. The first idea how to deal with such huge matrices is to use thresholding together with sparse computations. From sparse algorithms we expect that their complexity depends only on the number of non-zero matrix elements and not on the dimension of the matrix. Together with a low percentage of non-zero elements this strategy leads to a first reduction of time and space consumption. As an improvement to this straightforward approach we will show a method for creating the sparse matrix implicitly without touching the whole matrix which will give us already a significant improvement. Due to the exponential growth of the dimension even the sparse computations lead to serious problems in the space consumption. To solve this issue we will present an implicit matrix vector product for our matrix W where we do not need to store the elements of W in the memory. Since we do not lose the basic shape of the usual matrix vector product we are able to massively parallelize our computations. Together with the low memory consumption this will lead us to OpenCL and especially GPU computing where programs should fulfill exactly those requirements to get a great benefit from the usage of GPU devices.

The rest of the paper is structured as follows: In Section 2 we precisely define the eigenvalue problem we are interested in and show its properties and applications. In Section 3 we will show basic strategies to tackle the problem and afterwards we present our special purpose implicit matrix vector product for the power iteration. Section 4 is devoted to concrete implementations and experimental results showing how our matrix vector product benefits from the usage of GPUs as computing devices. Throughout this paper we will use zero-based indexing when dealing with vectors and matrices to get a consistent and more readable notation.

2. Problem Statement

2.1. Biochemical Background

We briefly introduce the model which leads to the eigenvalue problem we are dealing with. For a more detailed introduction and further references see [5,6]. As already mentioned, we use a binary alphabet for our RNA molecules. When talking about a molecule Xi with 0 < i < 2v we mean the sequence Xi := (bo,..., bv-1) where the vector (b0,..., bv-1) is the binary encoding of i with length v. As a measure for the difference between a species Xi and a species Xj we will use the well known Hamming distance dH(Xi, Xj) which delivers the minimal number of changes (mutations) one has to apply to get Xj out of Xi.

Due to Eigen [1] the following system of ODEs models the evolution of RNA molecules:

dx, dt

J]fj • Qij • Xj(t) - Xi(t) • ®(t), i = 0,...,N = 2v - 1

®(0 = 2 fj • Xj(t), X Xj(t) = 1

j=0 j=0

where xi is the relative concentration of the molecular species Xi. The fitness values fj describe the constitution of the molecular species Xj in terms of a single number and the (i, j) entry of the mutation matrix Q describes the possibility for a mutation from sequence Xi to sequence Xj and is defined as

Quj = pdH(X'Xj) • (1 - p)v-iH(X'Xj) (1)

By definition the values Q;, j depend only on dH(X;, Xj) and therefore the whole matrix Q consists of only v +1 different values. This also illustrates that p has to be understood as an average error rate over all possible mutations since there is no dependence on the positions and the overall number of the mutations. Equation 1 also shows that an index i (row or column) corresponds to the sequence X,. It would also be possible to apply some permutation n such that an index i corresponds to sequence Xn(i)1. When using a permutation one has to apply the permutation to Q as well as to F to get the same results.

Since the x; represent a relative concentration we are only interested in solutions where all entries of the computed eigenvector are non-negative. Other solutions where one or more entries are negative are not interesting for us since there is no physically meaningful interpretation of a negative concentration.

Since our ODE system is a Bernoulli system we get by the proper change of variables a linear system with constant coefficients Z = W • z with W = Q • F where Q is defined by Equation (1) and F is the diagonal matrix with the fitness values fj on the diagonal. The search of the quasispecies now reduces to the computation of the eigenvector corresponding to the dominating eigenvalue of W [7]. Moreover, the Perron-Frobenius theorem [8] guarantees that the eigenvector we are searching for only has non-negative entries since W clearly satisfies the conditions for applying the theorem.

As mentioned before, the entries in xi represent the relative concentration of the species Xi in the stationary distribution. Out of the computed eigenvector containing the relative concentrations one can compute the cummulative concentrations of so called error classes. An error class rk,; is defined via some fixed sequence i and a fixed Hamming distance k

rk,i := {j I 0 < j < 2v A dH(Xi, Xj) = k} (2)

In our model we call X0 the master sequence because at time t = 0 this is the only species which exists. Since we are especially interested in the error classes with respect to the master sequence we call them rk := rk,0. By our binary interpretation we can easily determine that rk consists of k elements. A change of variables from treating single molecules to variables for the classes rk is also the idea behind the computations for large v in [3, 4] since this transforms the 2v x 2v problem to a v x v problem. But as already mentioned this approach is only valid for very special structured fitness landscapes.

By computing the cumulative concentrations |rk| = £ jeTk Xj of the error classes rk in the stationary distribution for different p one can create graphs as the ones shown in Figure 1 where we vary values of p on the x-axis and show the relative concentrations |rk| on the y-axis. This application of our computations is used to get a better understanding on how the chosen fitness landscape influences the evolution. Such computations are even more interesting on the granularity of single molecules but up to now they are very rare in the literature due to the limited chain lengths people are able to deal with. Depending on the concrete fitness values the error threshold phenomenon may occur or not. When this phenomenon occurs there is an ordered stationary distribution up to a critical value pmax for p. For error rates above pmax the structure of the population changes immediately into the uniform distribution which represents an entire random replication. Typical values for pmax on certain fitness landscapes are around 0.01 - 0.1 [5, 7] depending on the concrete fitness values and the chain length. Those small values for pmax are quite surprising since we get the random replication as exact solution of the ODE system only for p = 0.5 [7].

This sudden change from an ordered distribution to random replication may also serve as building block for new antiviral strategies [9] because the error rates of RNA viruses are usually close to this critical value [10] and an increase of the error rates is possible by the use of pharmaceutical drugs.

In the style of the definition of the error classes we will us Qrk for the v + 1 different values of Q, which means that Qrk = pk • (1 - p)v-k for 0 < k < v. In this paper (in contrast to [3,4]) the only assumption on F is that it is a diagonal matrix. Later on we will use randomly generated landscapes when presenting our results to highlight this assertion. We should also point out the case where all values in F are equal. In this case the problem reduces to the computation of the dominating eigenvector of a bistochastic matrix2 which is trivial and leads to an eigenvector where all entries are equal. From the practical point of view this is not at all surprising since for equally fit sequences we clearly expect the uniform distribution as result.

1For example using the Gray code as permutation would deliver a matrix Q where the first diagonal above and below the main diagonal are constant. This comes from the basic definition of the Gray code which states that dH(X;, Xi+1) = 1 for all i.

2 Q is symmetric and we have by the binomial theorem for all 0 < i < N that 2N=0 Qi,j = 2k=0 (k)' p • (1 - p)v-k = (p + 1 - p) = 1

17 9 11

' r 16 10

0.04 Error rate p

0.04 Error rate p

Figure 1: On the left we see an example for the visualization of the error threshold phenomenon. We see an ordered stationary distribution up to Pmax ~ 0.035 and the sudden change to the uniform distribution for p > pmax. Error classes with the same number of elements (Tk, rv-k) have the same color and therefore there curves meet when the uniform distribution is reached above the threshold. We used v = 20, F0,0 = 2 and F= 1 for all 1 < i < N. This simple fitness landscape is known as the single peak fitness landscape in the literature. On the right we used again v = 20 and the so called linear landscape defined as F= f0 - (f0 - fv) ■ dH(i, 0)/v for all 0 < i < N with f0 = 2 and fv = 1. For this landscape we observe a smooth transition into the uniform distribution and no error threshold.

r r 0 20

2.2. Solution Strategies

Before we consider concrete methods for finding the desired eigenvector we study the structure of the matrices occurring in the model. Remember that Q, F and W have dimension 2v x 2v since this is the number of different RNA molecules with length v over a binary alphabet. The exponential growth of the problem size N with the chain length v is the central challenge in this context. Therefore we are looking for approaches which deliver speedups by orders of magnitude because, as already stated, our overall goal is to raise the number of tractable chain lengths. The central idea is to exploit the special structure of the matrices Q and F.

As already mentioned the whole matrix Q consists of just v + 1 different values which is a first hint that we may find an improved matrix product which explicitly uses this property resulting in a drastic reduction of the space consumption. Based on this observation, we will consider the well known power iteration for the task of finding the eigenvector corresponding to the dominating eigenvalue of W, because we can use our understanding of the problem structure best by using the power iteration. Moreover we expect a more accurate result compared to other existing methods for approximating dominating eigenvectors [11, 12]. We can also expect by the special structure of Q and F that we will be able to find highly efficient algorithms for realizing exactly the required matrix vector product. As a measure for the accuracy of the approximated dominating eigenvector X and as termination condition for the power iteration we will use the residual R(1, X) = \\WX - 1x||2 where 1 is the approximation of the corresponding eigenvalue calculated via the Rayleigh quotient.

In the following, we first study the behavior of some basic strategies like sparsification based on thresholding and afterwards we present an efficient method for implicitly computing matrix vector products with the matrix W.

3. Algorithmic Strategies

With increasing chain length v we quickly run out of memory due to the exponentially growing dimension of W. In the following we develop several strategies for coping with this problem.

3.1. Sparsification

A first approach is based on using thresholding which sparsifies the problem. Since Q is defined via powers of p and the interesting values of p are usually not significantly larger than 10-2 [3] we immediately see that there are many elements below machine precision which we can eliminate already for small v.

As already mentioned, we may even eliminate further elements, since less than full precision in the desired eigenvector is usually sufficient. When thinking of a bound for the elements we want to consider we should have in mind the uniform distribution as possible solution where we have xi = 2-v for all i. Therefore we should not take a limit above this value and try to take a significant smaller value as long as we do not run out of time. Actually, we do not

Figure 2: On the left, we see a graphical representation of the matrix Q for v = 7 and p = 0.001 where the elements are colored according to their order of magnitude. The color scheme ranges from 100 to 10-16. On the right we show the percentage of non-zero elements (x-axis) we have to consider for different chain lengths (y-axis) and Hamming distance limits. This graph shows that a decrease of d^31 has a stronger influence on the number of non-zero elements as an increase of the chain length. This is an important property for us which will help to treat large chain lengths. The experimental results in Table 1 also show this property in terms of concrete run times.

use a concrete numerical bound to decide which elements we use and which we remove from W. Instead, we use a bound dm* for the maximal Hamming distance between elements to be considered. This delivers also a natural stopping criterion for the power iteration, namely when we reach R(l, X) ~ Qrdmax+1. Theory tells us just the accuracy

of the approximated eigenvalue which is related to R(1, X). How the numerical error translates to the eigenvector is discussed in Section 3.4.

This restriction can be easily applied to the matrix Q since the values of Q are defined via the Hamming distance between the sequences corresponding to the row and column index. By multiplying this sparsified version of Q with F we get again a sparse version of W because F is diagonal.

In the model, this approach corresponds to restricting the possible mutants of an RNA molecule to near relatives. A positive side effect of this approach is that the Perron-Frobenius theorem stays valid since it holds for primitive matrices [8]. The sparsified W is indeed primitive even for dmax = 1 because in this minimal case the non-zero values of Q correspond to the non-zero values of the adjacency matrix of the v-dimensional hypercube. Therefore, there exists a mutation path between any two sequences Xi and Xj which guarantees the condition for primitivity, namely that there exists a value of m such that Wm contains only positive elements. The border case d^ = 0 is useless since in this case no mutations are possible and therefore we have to compute the dominating eigenvector of a diagonal matrix which is trivial and gives us the expected result that only the master sequence (more precisely, the sequence with the dominating fitness value) appears in the stationary distribution.

The fact that the entries of F are not considered in the thresholding is not a problem since they usually do not vary by the same magnitude as it is the case for the entries of Q. Thus, our simple strategy of restricting the maximal Hamming distance considered will deliver almost the same results as specifying some fixed numerical lower bound on the entries of W.

Obviously, the thresholding technique implies a certain loss of accuracy in the eigenvector computed. We know that the we can expect an accuracy of the order of the largest deleted entry of W [13]. This exactly explains why we choose R(l, X) ~ Qrdmax+1 as stopping criterion. As already told this delivers just the accuracy of the eigenvalue but as we discuss in Section 3.4 we also expect a similar behavior for the entries in the eigenvector.

3.2. A Memory-Efficient Hierarchical Algorithm for Generating W

In the sparsified version of W the percentage of non-zero values after restricting the maximal Hamming distance easily goes down to 10-2 and below (especially for small p) as illustrated in Figure 2. Thus, in order to achieve reductions in execution time it is extremely important not to explicitly compute every single entry of W before deciding whether it can be eliminated or not.

We will split up generating W in two steps. First, we use a smart method for generating Q and afterwards we perform the multiplication with the values of F as needed (since F is diagonal, this can be done separately for each element). For generating Q, we use an algorithm which visits the elements of Q according to their Hamming distance.

Note that for 0 < p < 0.5 Equation (1) shows that the elements of Q decrease with increasing Hamming distance. Consequently, the algorithm only computes the elements which are needed.

First remember that the dimension of the matrix Q is a power of two which implies that we can evenly divide Q into four blocks

Q J Q00 Q01 ), Q € R2"x2\ Qoo, Q01, Q10, Q11 € R2^-1. \ Q10 Q11 /

The dimension of each of these blocks is again a power of two and therefore we can split them up again and again until we reach 1 x 1 blocks. The second important observation for our efficient algorithm is that the matrix Q has a blockwise symmetry for each power of two as block size. This is an immediate consequence of our ordering of the columns and of the interpretation of indices as binary sequences. Due to this blockwise symmetry we have Q11 = Q00 and Q10 = Q01. Putting together these observations leads to the recursive Algorithm 1 which visits the elements of Q during the generation of Q ordered by their size and therefore only touches the elements we really want to consider, thus leading to a very efficient method for generating the matrices Q and W, respectively. By using Algorithm 1 we

reduced the runtime complexity for setting up W from O(22v) to O^T ■ X^o (k)). The right plot in Figure 2 exactly

shows (the inverse of) the speedup using Algorithm 1 compared to the trivial approach. This reduction also implies that now the creation and the the matrix vector product have the same runtime complexity.

Algorithm 1 Sparse creation of Q

1: function CreateQ(Q, du)

2: if Q € R1x1 then

3: Q = pdH ■ (1 - p)v-du

4: else

5: CreateQ(Qoo, du)

6: if du < dmax then

7: CreateQ(Q10, du + 1)

8: end if

9: Q11 ^ Q00

10: Q01 ^ Q10

11: end if

12: end function

3.3. An Efficient Implicit Matrix Vector Product

The principal idea to gain further improvements is to give up saving the matrix W explicitly in memory. Instead, we design an implicit matrix vector product where the elements of W are computed while building the product. We will then discuss the successful implementation of our algorithm with OpenCL on GPUs.

Remember that the size of the elements in Q decreases with the Hamming distance corresponding to the sequence for the row and the column, respectively. Also remember that in Section 3.1 we did not specify a specific numerical bound for the elements but instead we used a bound dumax for the maximal Hamming distance between elements we want to take into account. The only adaption we have to make in the standard matrix vector product is the order how the elements in a row are traversed during the multiplication. Usually we run through a row from left to right. Instead, we will run through the elements sorted by their Hamming distance with respect to the row index. In the notation of Section 2 this means that we run through the different error classes (see Equation (2)) of the row index. More specifically, for computing y = W ■ x we get

yi = 2j Qi,j ■ Fj,j ■ xj = Tj Tj Qi'j ■

2>k ■ X fj,J

\JtTkj

j=0 k=0 j€rw

We get now again a "sparse" matrix vector product by replacing v as upper bound for the sum by some limit dmax for the Hamming distance we want to consider. For sure the sorting according du has to happen implicitly during the multiplication since otherwise we would get a significant increase of the time complexity.

So the first problem which arises is to compute the sequence of indices sorted according dH to a fixed element Xi. For simplicity we will first use the element X0 = (0,..., 0). Based on this element computing the Hamming distance just means counting bits. We also know in advance the number of elements with a certain Hamming distance so we can use a bucket sort like procedure to create the sorted sequence we are interested in in linear time without any additional memory needed.

Up to now we just have a properly sorted sequence for one base element (0,..., 0) but we need such a sequence for every possible base element. Computing these sequences separately is not possible when we want to get an efficient method. Instead of that we will use the single sequence with base (0,..., 0) and transform this sequence into the one we actually need during the multiplication. For this purpose let us remember the truth table of the binary XOR ©:

© 0 1

An informal interpretation of A © B would be that the bit A gets flipped if and only if the bit B is true. As we know the extension of © : {0,1} i—> {0,1} for bit vectors to a bitwise XOR © : {0,1}v — {0,1}v is one of the commonly available operations for basic data types like integers on todays digital computers. By our informal interpretation of © we can conclude that for two binary sequences Xi and Xj it holds that

dH (Xi, Xi © Xj) = dH (Xj, Xq)

And now we have all what we need for an efficient (implicit) matrix vector product for W. We can compute the sorted sequence with base element (0,..., 0) efficiently once as a preprocessing step and the © operator lets us transform this sequence to the sorted sequence for an arbitrary base element during the multiplication. Using Equation (3) we get

yi = Qrk • Yj Fj,j ' Xj =X Qrk ' 2 Fi©j

JzTkj k=0 vjerk

i©j • xi©j

as new formula for computing the elements of the result vector y. Putting everything together leads to Algorithm 2 as an efficient procedure for multiplying any vector with W using a specified accuracy dHmax. Obviously Algorithm 2 is

Algorithm 2 Sparse matrix vector multiplication for W • v = Q • F • v

1: for all i €{0,...,N} do

2: vout [i] ^ 0 > Vin is the input vector, vout the output vector

3: j ^ 0

4: for k ^ 0 to d^ do

5: sum ^ 0

6: for l ^ 0 to (k) - 1 do

7: sum ^ sum + F[i © S [j]] • vin[i © S [j]] > S contains sorted sequences according dH to (0,..., 0)

8: j ^ j + 1

9: end for

10: voUt [i] ^ voUt [i] + Qr[k] • sum

11: end for

12: end for

just an algorithmic encoding of Equation (4). The occurring fixed values S [k], (k) and Qr[k] should be precomputed once before starting the power iteration.

3.4. Relative Error in the Computed Concentrations

In this section we want to briefly address the question how the thresholding translates into an error in the computed eigenvector. By (i, x) and (A, X) we mean the exact and approximated dominating eigenpair of W. First one can easily

verify that the largest element we delete from W when using the threshold dm3* is bounded by e := pd!?ax+1 ■ Fmax where Fmax = max; Fi i since the largest deleted element is Qr™ , ■ Fmax. By estimating the error in our single concentrations

we run into the common problem that there are quite a lot of bounds for eigenvalue approximations but there are only a few results on the estimation of errors in eigenvectors. What we observed in practice for small examples was that ||x - X||2 ~ R(1, X) which is a very tight and good bound which we can not yet confirm by some formal arguments. Instead of that we want to present some formal bounds for the eigenvalue and the condition number which are not at all tight but they show how the parameter v,dmax, p and the landscape F influence the error. And this precise formal connections correspond to observations we made for small examples.

Following [13] we get as upper bound for the eigenvalue \1 -11 < t = 2v ■ e. To see more clearly how the parameter effect the error we may write t = 2v+lo®2(p) (di?ai+1) ■ Fmax. Now we see that the influence of p and dmax is much bigger than the influence of v and F since a change of p is multiplied by dmax + 1 and vice versa.

To get a feeling for the concrete value of t we use the example v = 20, p = 0.01, Fmax = 10, d^ = 5 from Section 4 and observe a value of t ~ 10-5 which should be a satisfying bound. Since one can easily show that the row and column sums of Q-1 are also one we can estimate the condition by cond1 (W) = ||W||1 ■ ||W-1||1 < Fmax ■ F-1n. By looking at the common landscapes in the literature we can expect that neither Fmax nor F-^ will be big. So in principal we can expect that our problem is not at all ill conditioned.

4. Efficient GPU Implementation

4.1. An OpenCL Implementation

Up to now using Algorithm 2 provokes a great reduction of the space complexity, but also a slight increase of the time complexity due to the transformations we apply during the multiplication. Observe that the iterations of the outer i-loop are entirely independent, like in the case of the standard matrix vector product. Thus, the only natural limit of the number of concurrent processes is the matrix dimension. Algorithm 2 perfectly fits to the GPU philosophy having low space and high (parallel) computation requirements since we just need to store the four vectors.

The pseudo code shown in Algorithm 2 just shows a single threaded implementation of our efficient matrix vector product. Nevertheless, the algorithm can be adapted straightforwardly to an OpenCL kernel function. This is done by removing the outer loop and instead of that the current value of i is computed out of the id3 of the calling thread. Since the computations of the single rows are independent we can invoke our kernel with one thread for each row. For the number of local threads we also use the maximum number supported by the hardware the kernel is running on.

Like the algorithm itself also this implementation is just a basic one which should show the principal properties of the problem and our solutions. Several other refinements can be applied to gain further improvements. Especially the computations required between two successive iterations are in our implementation only parallelized in a rather simple way. But our implementation still reaches a level of efficiency such that the matrix vector product is the dominating part of the overall runtime, which suffices for the moment to illustrate the benefits of our approach.

4.2. Experimental Results

As mentioned earlier, we do not want to make assumptions on the concrete values of F, and therefore we use the following definitions to create landscapes:

F0,0 = c, F;,; = a ■ (nrnd(i) + 0.5) with c > 0, 0 < a < c/2 (5)

where nmd(i) is as in [5] a call to some uniform random number generator on the interval [0,1].

The reason for this definition is that we want to ensure that the fitness of the master sequence is somewhat separated from the second largest fitness value. This is due to the fact that the separation of the dominating fitness values correspond to the separation of the dominating eigenvalues of W which is the important factor for the convergence speed of the power iteration. The reason for this is that F scales the well understandable spectrum of Q by the fitness values. That the master sequence has a somewhat dominating fitness value is a common feature of the landscapes discussed in the literature. Therefore, our restriction is not at all an artificial condition. For the power iteration we always use (1,...,1)T (the dominating eigenvector of Q) as start vector. In our experiments we compare three methods for numerically computing the dominating eigenvector of W:

3In OpenCL threads are numbered from 0,1,...,T - 1 when T is the number of threads which should be started.

1. An OpenCL implementation of Algorithm 2 running on a Nvidia Tesla C2050 using Nvidia Cuda

2. An OpenCL implementation of Algorithm 2 running on a Intel I5 750 @ 2.67Ghz core using Ati Stream

3. A standard single threaded CPU implementation of Algorithm 2 running on a Intel I5 750 @ 2.67Ghz.

We miss here a comparison with well known eigenvalue solvers for the simple reason that the matrices are to huge for the reported chain lengths to put them into those methods. In Figure 3 we present the running times4 of the different

Figure 3: The execution times for finding the dominating eigenvector up to the expected accuracy (» ftr™* 1)) are shown. On the x-axis we

increase the input chain length v from v = 10 to v = 24 and on the (logarithmic) y-axis we show the runtime needed in seconds for the different implementations. In the upper row we used c = 10, <x = 1 and d^31 = 3 (left), = 5 (right). In the lower row we used c = 3, <x = 1 and again d™ = 3 (left), d^31 = 5 (right). We can observe that the decrease in the domination of the fitness of the master sequence from c = 10 to c = 3 results, as expected, in an increase of the runtime due to the need of more iterations. Obviously the increase of d^ also causes an increase of the runtime. The change of the used implementation does of course not change the principal problem complexity but nevertheless one should notice the speedup factor of more than 100 by using the GPU instead of the standard implementation. So the important observation should be that our algorithm perfectly scales with the underlying hardware.

implementations for two example landscapes and increasing chain lengths. Additionally we show in Table 1 some example test cases together with their resulting space complexity (assuming that all floating point vectors and matrices are of type double precision requiring 8 bytes per element) and time complexity. The space and time requirements in Table 1 clearly show the importance and the benefit of our implicit matrix vector product which gives us the ability to deal with unexpected large (compared with the expectations given in [5]) chain lengths up to5 v « 25, depending on the values of p and d^3*. Obviously our algorithm also offers the ability of a distributed implementation which would deliver a further significant raise of the chain length on large scale hardware.

In Figure 3 we see that our algorithm also already delivers good results from the point of runtime and we see a great benefit compared to standard methods.

5. Conclusion

In this paper we studied a very large scale eigenvalue problem occurring in a model for the evolution of RNA sequences. We got a non-negative matrix W with dimension 2v x 2v as an input and our aim was to compute the

4All times have to be understood as overall execution times, so the GPU times also include the data transfer to and from the device.

5On a modern GPU device like the Nvidia Tesla C2050 we used here.

v p jmax UH Full W Sparse W Algorithm 2 Time GPU [s] Time CPU [s]

19 0.01 5 2Tb 65.1 Gb 14 Mb 6.4 260

20 0.01 5 8Tb 169.5 Gb 28 Mb 17.5 746

21 0.01 4 32 Tb 117.9 Gb 56 Mb 13.7 506

22 0.01 4 128 Tb 284.7 Gb 112 Mb 35.9 1438

23 0.01 3 512 Tb 128.0 Gb 224 Mb 24.2 726

24 0.01 3 2048 Tb 290.6 Gb 448 Mb 71.1 1844

Table 1: Model parameters and resulting space and time requirements for some concrete test cases. We have to specify the chain length v, the error rate p and the maximal Hamming distance between to elements d^* to fully define a test case. Additionally the space requirements for handling the full matrix W, for handling the sparse matrix W (as in Section 3) and for using Algorithm 2 are reported. For the sparse matrix we only report the space required by the non-zero elements without the overhead of a specific sparse storage format. In the last two columns we show the runtimes for computing the dominating eigenvector with our OpenCL implementations on a Nvidia Tesla C2050 and on a Intel I5 750 @ 2.67Ghz for the landscape defined by c = 10, a = 1. It is clearly visible that Algorithm 2 has the typical properties of a sparse matrix vector product since the execution times are proportional to the size (number of non-zero elements) of the sparsified W. The stopping criterion in all reported cases was that R(1, x) reaches Qdmx+1.

dominating eigenvector of W. Since the dimension of W grows exponentially with the chain length v we had to find methods which perform by magnitudes better than the standard methods such that we are able see improvements also in terms of (significantly) larger tractable chain lengths. Using the fact that often the results are not needed in full precision, we started with some basic thresholding and sparsification techniques together with the power iteration as numerical method. Our main result was the development of an implicit matrix vector product which exploits the special problem structure and thus reduced the memory requirements drastically. Nevertheless, we did not lose the parallelization potential as it is available in the standard matrix vector product. We also showed the great benefit one receives when mapping our special purpose matrix vector product onto OpenCL and GPUs as computing devices.

As already stated, the presented methods have to be understood as a first approach to solving the presented problem efficiently. There is a wide range of possible future improvements. We will consider ideas like an adaptive power iteration, smart choices for the start vector depending on the input parameters, computing the right out of the (somewhat easier computable) left eigenvector, methods for improving the convergence speed, etc.

Acknowledgments. This work was partly supported by the Austrian Science Fund (FWF) under contract S10608 (NFN SISE). We would like to thank Peter Schuster for many fruitful discussions.

References

[1] M. Eigen, Selforganization of matter and the evolution of biological macromolecules, Naturwissenschaften 58 (1971) 465-523.

[2] M. Eigen, P. Schuster, A principle of natural self-organization, Naturwissenschaften 64 (1977) 541-565.

[3] J. Swetina, P. Schuster, Self-replication with errors : A model for polvnucleotide replication, Biophysical Chemistry 16 (4) (1982) 329 - 345.

[4] M. Nowak, P. Schuster, Error thresholds of replication in finite populations mutation frequencies and the onset of muller's ratchet, Journal of Theoretical Biology 137 (4) (1989) 375 - 395.

[5] P. Schuster, Mathematical modeling of evolution. solved and open problems, Theory in Biosciences 130 (2011) 71-89.

[6] P. Schuster, Prediction of RNA secondary structures: from theory to models and real molecules, Reports on Progress in Physics 69 (2006) 1419-1477.

[7] P. Schuster, The mathematics of Darwinian systems, 2008, appendix for the book: Manfred Eigen, 'From Strange Simplicity to Complex Familiarity, Vol.I'.

[8] E. Seneta, Non-negative Matrices and Markov Chains (Springer Series in Statistics), Springer, 2006.

[9] M. Eigen, Error catastrophe and antiviral strategy, Proceedings of the National Academy of Sciences of the United States of America 99 (21) (2002) 13374-13376.

[10] J. Drake, Rates of spontaneous mutation among rna viruses., Proc Natl Acad Sci U S A 90 (9) (1993) 4171-5.

[11] E. Drinea, P. Drineas, P. Huggins, A randomized singular value decomposition algorithm for image processing applications, in: Image Processing, Panhellenic Conference on Informatics, PCI 2001, 2001, p. 279.

[12] M. Mascagni, A. Karaivanova, A parallel quasi-monte carlo method for computing extremal eigenvalues, in: Lecture Notes in Statistics, Springer, 2000.

[13] Y. Bai, W. N. Gansterer, R. C. Ward, Block tridiagonalization of "effectively" sparse symmetric matrices, ACM Trans. Math. Softw. 30 (2004) 326-352.