Scholarly article on topic 'Accelerating BWA Aligner Using Multistage Data Parallelization on Multicore and Manycore Architectures'

Accelerating BWA Aligner Using Multistage Data Parallelization on Multicore and Manycore Architectures Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Computer Science
OECD Field of science
Keywords
{"Xeon Phi" / "Sequence alignment" / "Data parallelization" / "Multicore processors" / "Manycore processors"}

Abstract of research paper on Computer and information sciences, author of scientific article — Shaolong Chen, Miquel A. Senar

Abstract Nowadays, rapid progress in next generation sequencing (NGS) technologies has drastically decreased the cost and time required to obtain genome sequences. A series of powerful computing accelerators, such as GPUs and Xeon Phi MIC, are becoming a common platform to reduce the computational cost of the most demanding processes when genomic data is analyzed. GPU has received more attention at literature so far. However, Xeon Phi constitutes a very attractive approach to improve performance because applications don’t need to be rewritten in a different programming language specifically oriented to it. Sequence alignment is a fundamental step in any variant analysis study and there are many tools that cope with this problem. We have selected BWA, one of the most popular sequence aligner, and studied different data management strategies to improve its execution time on hybrid systems made of multicore CPUs and Xeon Phi accelerators. Our main contributions are focused on designing new strategies that combine data splitting and index replication in order to achieve a better balance in the use of system memory and reduce latency penalties. Our experimental results show significant speed-up improvements when such strategies are executed in our hybrid platform, taking advantage of the combined computing power of a standard multicore CPU and a Xeon Phi accelerator.

Academic research paper on topic "Accelerating BWA Aligner Using Multistage Data Parallelization on Multicore and Manycore Architectures"

Procedía Computer Science

CrossMark Volume 80, 2016, Pages 2438-2442

ICCS 2016. The International Conference on Computational

Science

Accelerating BWA Aligner Using Multistage Data Parallelization on Multicore and Manycore

Architectures

Shaolong Chen and Miquel A. Senar

Department of Computer Architecture & Operating System, Universität Autdnoma de Barcelona,

Spain.

shaolong.chen@caos.uab.es, miquelangel.senar@uab.es

Abstract

Nowadays, rapid progress in next generation sequencing (NGS) technologies has drastically decreased the cost and time required to obtain genome sequences. A series of powerful computing accelerators, such as GPUs and Xeon Phi MIC, are becoming a common platform to reduce the computational cost of the most demanding processes when genomic data is analyzed. GPU has received more attention at literature so far. However, Xeon Phi constitutes a very attractive approach to improve performance because applications don't need to be rewritten in a different programming language specifically oriented to it. Sequence alignment is a fundamental step in any variant analysis study and there are many tools that cope with this problem. We have selected BWA, one of the most popular sequence aligner, and studied different data management strategies to improve its execution time on hybrid systems made of multicore CPUs and Xeon Phi accelerators. Our main contributions are focused on designing new strategies that combine data splitting and index replication in order to achieve a better balance in the use of system memory and reduce latency penalties. Our experimental results show significant speed-up improvements when such strategies are executed in our hybrid platform, taking advantage of the combined computing power of a standard multicore CPU and a Xeon Phi accelerator.

Keywords: Xeon Phi, Sequence alignment, Data parallelization, Multicore processors, Manycore processors

1. Introduction

With the NGS technology burgeoning, sequencing a human genome has decreased in cost from $ 1 million in 2007 to $1000 in 2012 [1], which leads to an explosive increasing of sequence data from GB level to TB level. Therefore, new advances in software tools are needed to handle this challenge by efficiently exploiting technological features presented in modern parallel architectures. This paper focuses on sequence alignment, which involves the accurate positioning of genome reads into a reference genome sequence [2]. BWA is one of the most prevalent and widespread sequence alignment tools and, in particular, this paper focuses on the algorithm BWA-backtrack, which is designed for Illumina sequence reads up to lOObp. We have used and modified a basic implementation

2438 Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2016

© The Authors. Published by Elsevier B.V.

doi:10.1016/j.procs.2016.05.544

of BWA, named bwa-aln-xeon-phi-0.5.10 [3], which optimizes the performance of BWA 0.5.10 ALN module, and supports execution on symmetric mode on Xeon-Phi based systems.

The high performance community has been showing a growing attention to hybrid systems that include manycore accelerators. Intel's Xeon Phi is one of these accelerator systems, it is based on x86 architecture and constitutes a convenient platform to use with standard libraries, such as MPI, OpenMP or POSIX threads technologies. Xeon Phi has many cores, varying from 57 to 61, and 8 GB or 16GB of memory in one board, providing up to 1 TFlops in double precision or 2 TFlops in single precision. From a programmer's perspective, Xeon Phi constitutes an attractive choice because a parallel application that runs on a standard multicore CPU can be ported to a Xeon Phi system with minimum reprogramming effort [4].

Our contributions in this paper focus on the design and evaluation of data management strategies that significantly improve the speedup of BWA when running on hybrid systems, made of a multicore Xeon equipped with one Xeon Phi coprocessor. We have evaluated data splitting and index replication mechanisms, both independently and combined, and our experimental results show that the combined strategy is the best choice, achieving a speed-up of 4-fold in contrast with the execution times of the parallel version of BWA aligner running on the Intel Xeon E5 processor with 24 threads.

2. Data Management Strategies and Execution Methodology

In order to mitigate memory locality and congestion problems exhibited by BWA [5], we designed a multistage data parallelization strategy that can be divided to two main mechanisms, data splitting and index replication, and which can be implemented independently or combined. Additionally, the application is logically organized as a collection of different groups of threads that will work on independent sets of data and will share the reference index. For a total number of potential threads Nt and n groups, each group consists of Nt/n threads. In our experiments we have used all hardware cores and report results according to the number of thread groups that were used. For a given value of n, groups have a different size depending on computing resource. Nt is 24 in the Xeon case (obtained with 12 cores and 2 threads per core) and 240 in the Xeon Phi case (obtained with 60 cores and 4 threads per core). According to this, when two groups are used, 12 threads are included in each group at the Xeon processor (24 divided by 2) and 120 threads are included in each group at the Xeon Phi. Our data management strategies, which have been implemented by using a combination of MPI (at the higher level of program decomposition) and OpenMP (at the level of threads that share a common set of data structures), are the following:

- Data Splitting: short-read data is split into subsets based on the number of thread groups, such as subset 0, subset 1, etc. Only thread group 0 loads the reference index in memory, which is shared with other groups (when Xeon and Xeon Phi are used together, one copy of the index is present in each device). The alignment process runs in every thread group with a shared reference index and its relative subset of short-reads. Subset results (from groups 0, 1, ...) are finally gathered and merged from all thread groups.

- Index Replication: the alignment process in each thread group is carried out with shared short-read data and a private copy of reference index in memory. Short-read data is shared by all thread groups and processed in chunks of a fixed size using a round-robin pattern. The index of reference is replicated privately for every thread group in memory. When Xeon and Xeon Phi are used together, the same number of index replicas is used in each device. Each thread group then executes an independent alignment process with the shared short-read and its relative copy of reference index. Finally, results are merged like in the previous case.

- Data+Index Parallelization: this strategy combines the previous two. Short-read data is divided into small subsets, which depends on the number of thread groups, and the reference index is

replicated for all thread groups in memory. Alignment process is executed on every thread group with its short-read subset and reference index.

In our experiments, we have evaluated two execution modes: native and symmetric. When running on symmetric mode, getting a balanced distribution of work between the Xeon processor and the Xeon Phi accelerator is a very important factor. We did a distribution of the number of data reads so that running time in Xeon and Xeon Phi were very similar. Such workload optimization was based on the running time of their native mode. For instance, if the Intel Xeon Phi coprocessor's performance on native mode is 0.7-fold that of the same problem running on the Intel Xeon processor on native mode, then the ratio of workload on symmetric mode is 10 for the Intel Xeon processor, and 7 for the Intel Xeon Phi coprocessor.

3. Experiments and Performance Evaluation

3.1. Experiments Environment

We have used a heterogeneous system equipped with one Intel Xeon Phi 7120 coprocessor and one main Intel Xeon E5-2620 V2@2.1GHz CPU, with 64 GB of memory and hyper-threading enabled. Xeon Phi 7120 consists of 61 cores and 16 GB of memory on card. Our data management strategies were implemented on the existing version of BWA (bwa-aln-xeon-phi-0.5.10), which supports native mode and symmetric mode execution. We used sample data from the 1000 Genomes Project. The reference genome is hs37d5, about 3GB of size, and our set of reads data is SRR766060, about 17.2GB of size.

3.2. Performance Evaluation

Performance results of our three strategies are summarized in figures 1, 2 and 3, which show speed-up achieved in each case when compared to the execution of original BWA running on the Xeon platform with 1 thread. Each strategy was evaluated running on native mode on both the Xeon and Xeon Phi platform, and running on symmetric mode on both platforms simultaneously. Number of group threads ranged, in general, from 1 to 12 in the case of Xeon. Memory limitations in the Xeon Phi, restricted the maximum number of groups that could be used in all strategies that used index replication (three was the maximum).

Data splitting (Data-s), see figure 1, achieves the best performance when two and three groups are used in native mode on the Xeon and Xeon Phi, respectively (17.5-fold and 15-fold speedup improvement compared to the baseline). In symmetric mode, the best results are obtained with three groups (31.5-fold speedup). In general, data splitting performs well when a limited number of groups is used. When more groups are added, application performance starts to degrade. This behavior can be explained because when data is divided in independent groups, completion time of each group strongly depends on the particular reads assigned to each group. Different groups show different completion time and load imbalance appears due to different composition of read groups. When a small number of groups is used, load is kept fairly balanced. The more groups are used, the more imbalanced is produced, which is coherent with the observations reported in [6].

Index replication strategy (Index-r), see figure 2, achieves its best results when two and two groups were used in native mode of Xeon and Xeon Phi, respectively. In symmetric mode, best results were obtained with three groups. Speed-up were of 15-fold and 12.5-fold and 31.8-fold, respectively. These results confirm that using multiple copies of the reference index attenuated memory problems exhibited by the original BWA application. In particular, the effects of memory latency and congestion were reduced because accesses to the reference index were distributed between each instance. Number of maximum number of instances was limited to three at the Xeon Phi due to its 16 GB of memory

size on card. Several copies of the reference index are beneficial to the application but results on the Xeon platform also show that the ideal number of copies is closely related to the number of memory banks of the NUMA architecture (2 in the case of Xeon). Maximum benefits of index replication are achieved when memory accesses can be equally distributed between all NUMA banks. If more instances are created, they are allocated in the same NUMA banks and memory contention is produced by different groups of threads accessing the same memory bank.

35 30 25 ' 30 . 15 10

I Symmetric I Xeon Xciin Phi

35 30 25 20

I Symmetric I Xeon Xeon Phi

1 2 3 4 6 1

Number of Thread Group

Figure 1: Implement data splitting strategy

2 3 4 6

Number of Thread Group

Figure 2: Implement index replication strategy

Figure 3: Implement data+index parallelization strategy Figure 4: Acceleration in contrast with Xeon running

Best results were obtained, in general, when both mechanisms were combined in the data+index parallelization strategy (Data-Index-p) as shown in figure 3. A speed-up of 44-fold is achieved in symmetric mode (with three groups) and speed-ups of 12-fold and 14-fold is achieved on native Xeon mode and native Xeon Phi mode (with three and two thread groups, respectively). This strategy allowed an improvement of performance on the Xeon Phi, both in symmetric and in native mode. Because memory congestion was reduced by index replication and total load was kept balanced. Cause the number of groups was small (as observed in the data splitting case), same benefits in memory usage were not obtained on the Xeon platform, though, which will need some further analysis to understand the reasons.

Finally, figure 4 summarizes best performance achieved by all strategies proposed in this work compared to the performance of the original version of BWA, which was also executed on native and symmetric mode for the sake of completeness. For each strategy, figure 4 contains the best case that was observed in all the experiments carried out with each particular strategy. Execution of original BWA on Xeon platform with 24 threads is shown as bwa-aln in figure 4 with speed-up of 1. We can see that a slight loss of performance was observed for the original application when it was executed on the Xeon Phi on native mode (13% compared to standard Xeon). When the original application was executed on symmetric mode a speed-up of 2.7 was achieved. For the rest of cases, performance exhibited different rates of improvement. We summarize these results taking into account the

execution mode. On the one hand, data splitting is the best choice when systems are used in native mode, a speed-up of 1.6-fold and 1.4-fold was obtained, respectively, on Xeon and Xeon Phi compared to the original application running on the Xeon processor. On the other hand, data+index parallelization strategy is the best choice when symmetric mode is used. We obtained the largest speed-up (4-fold) compared to the base application running with 24 threads on our Xeon system. This strategy shows interesting characteristics that could be used to design parallel applications that enable the maximal usage of all available hardware resources on hybrid systems.

4. Conclusion

In this paper, different data strategies have been designed and analyzed to improve the performance of BWA on hybrid systems which consist of CPUs and accelerators. We have proposed some simple strategies that try to mitigate memory congestion problems exhibited by BWA. As shown in our study, achieving good performance from a Xeon Phi is a challenging task. When used in native mode, performance achieved on a Xeon Phi was slower than on a standard Xeon. However, this specialized hardware achieved reasonable results when used in symmetric mode. In particular, a strategy that replicates the reference index among independent groups of threads and also splits read data between all groups proved to be the best one when Xeon and Xeon Phi both collaborate in the execution of the application. The strategy requires some extra memory consumption to allocate all data structures, but the benefits in performance justify this increasing in memory footprint. The application run 4 times faster on the hybrid system when compared to the Xeon alone. Although our study has focused on BWA, as a representative example of BWT-based methods, our results could be easily extrapolated to other programs based on BWT. In general, this strategy might be a valuable approach to reduce the penalty of memory bounded applications that incur in high penalties due to the repeated access to data structures which are shared by all the threads of a parallel application.

5. Acknowledgment

This research has been supported by Project TIN2014-53234-C2-1-R and partially supported by China Scholarship Council (CSC) under reference number: 201406890007.

6. References

[1] O'Driscoll, A., et al.. 'Big data', Hadoop and cloud computing in genomics. Journal of biomedical informatics , 2013, 46(5): 774-781.

[2] Cui Y, Liao X, Zhu X, et al. mBWA: A massively parallel sequence reads aligner. The 8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014). Springer International Publishing, 2014: 113-120.

[3] bwa-aln-xeon-phi-0.5.10. https://github.eom/intel-mic/bwa-aln-xeon-phi-0.5.10.

[4] Stein L., Genome annotation: from sequence to biology. Nature reviews genetics, 2001, 2(7): 493-503.

[5] Lenis J. and Senar M. A., On the Performance of BWA on NUMA Architectures, Proc. of IEEE Int. Conf. on Trustcom/BigDataSE/ISPA, 2015, Vol 3. pp 236-241.

[6] Herzeel C, Ashby T J, Costanza P, et al. Resolving Load Balancing Issues in BWA on NUMA multicore architectures. Parallel Processing and Applied Mathematics. Springer Berlin Heidelberg, 2014: 227-236.