Available online at www.sciencedirect.com

SciVerse ScienceDirect PrOCSd ¡0

Computer Science

Procedia Computer Science 18 (2013) 319 - 328

International Conference on Computational Science, ICCS 2013

Optimization techniques for 3D-FWT on systems with manycore

GPUs and multicore CPUs

Gregorio Bernabea *, Javier Cuencaa, Domingo Gimenezb

a Computer Engineering Department, University ofMurcia (Spain) b Computer Science and Systems Department, University ofMurcia (Spain)

Abstract

Programming manycore GPUs or multicore CPUs for high performance requires a careful balance of several hardware specific related factors, which is typically achieved by expert users through trial and error. To reduce the amount of hand-made optimization time required to achieve optimal performance, general guidelines can be followed or different metrics can be considered to predict performance, but ultimately a trial and error process is still prevalent. In this paper, we present an optimization method to run the 3D-Fast Wavelet Transform (3D-FWT) on hybrid systems. The optimization engine detects the different platforms found on a system, executing the appropriate kernel, implemented in both CUDA or OpenCL for GPUs, and programmed with pthreads for a CPU. Moreover, the proposed method selects automatically parameters such as the block size, the work-group size or the number of threads for reducing the execution time, obtaining the optimal performance in many cases. Finally, the optimization engine sends proportionally different parts of a video sequence to run concurrently in all platforms of the system. Speedups with respect to a normal user, who sends all frames to a GPU with a version of the 3D-FWT implemented in CUDA or OpenCL, presents an averaged gains of up to 7.93.

Keywords: Optimization techniques, 3D-FWT, manycore GPUs, multicore CPUs, CUDA, OpenCL, pthreads.

1. Introduction

Nowadays multicore architectures are omnipresent and can be found in all market segments. In particular, they constitute the CPU of many embedded systems (for example, video game consoles, network processors or GPUs), personal computers (for example, the latest developments from Intel and AMD), servers (the IBM Power6 or Sun UltraSPARC T2 among others) and even supercomputers (for example, the CPU chips used as building blocks in the IBM Blue-Gene/L and Blue-Gene/P systems). This market trend towards CMP (or chip-multiprocessor) architectures has given rise to platforms with a great potential for scientific computing such as the GPGPUs [1][2].

GPUs are massively parallel devices with parallel compute capacity exceeding other single chip devices and are still the best device for high performance graphics [3]. There are currently many APIs for programming GPUs all with their respective advantages and disadvantages, but getting optimal performance from the GPU is still a challenging task that requires repetitive manual tuning.

* Corresponding author. Tel.: +34-868-887-570 ; fax: +34-868-884-151 .

Email addresses: gbernabe@ditec.um.es (Gregorio Bernabe), javiercm@ditec.um.es (Javier Cuenca), domingo@um.es (Domingo Gimenez)

1877-0509 © 2013 The Authors. Published by Elsevier B.V.

Selection and peer review under responsibility of the organizers of the 2013 International Conference on Computational Science doi:10.1016/j.procs.2013.05.195

Nvidia developed CUDA [4] and ATI launched Stream Computing [5] in order to exploit the GPU computational power beyond a traditional graphic processor and simplify the programming through a simple block-based API. Therefore, CUDA and ATI Stream programs can run only on Nvidia's GPUs and AMD's GPUs, respectively, and they are not compatible.

More recently, Open Computing Language (OpenCL) is a framework [6] that emerges and attempts to unify those two models. It provides parallel computing using task-based and data-based parallelism. It is an open standard. Up to now, it has been adopted by Intel, AMD, Nvidia and ARM among others. It allows to program several architectures dependent upon each of the previous manufacturers, and hence it is not specialized for any particular compute device.

In previous works [7][8], we contributed with a CUDA implementation for the 2D-FWT running more than 20 times faster than a sequential C version on a CPU, and more than twice faster than optimized OpenMP and pthreads versions implemented on a quad-core CPU. We extended the analysis to the 3D-FWT scenario [9][10], where speed-up factors have been improved using a new set of optimization techniques. We presented different alternatives and programming techniques for an efficient parallelization of the 3D Fast Wavelet Transform on mul-ticore CPUs and manycore GPUs. OpenMP and pthreads were used on the CPU to build different implementations in order to maximize parallelism, whereas CUDA and OpenCL were selected for data parallelism exploitation on the GPU with an explicit memory handling. Speedups of the CUDA version on Fermi architecture were the highest obtained, improving the execution times on CPU on a range from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains in the range from 2x for small frame sizes to 3x for larger ones.

The work necessary for optimizing a real code could take several weeks or months, and it requires deep knowledge in several disciplines, like computer architecture, programming and debugging tools, numerical analysis and mathematical software. Furthermore, the optimization task performed for a specific platform is not suitable, in principle, for other platforms. Such diverse working environment has promoted important changes in the traditional way of optimizing the software for scientific calculations, with the goal of following the pace of both the user necessities and the new hardware developments. In this context, different automatic optimization techniques have emerged as valuable tools that provide scientific software with environment adaptation capacity. In this way, optimized code for each specific platform can be generated automatically by means of software models and/or extensive experimentation. Some projects where these techniques have been applied are: ATLAS [11], with optimized computational kernels for dense linear algebra; SPARSITY [12], with kernels for sparse linear algebra; FFTW [13], with an optimized implementation of the Fourier discrete transform; ABCLib-DRSSED [14], with routines for obtaining eigenvalues. All these works are oriented for standard processors, without graphical accelerators. However, recent activities of chip manufacturers, make it more evident than the future designs of microprocessors and large HPC systems will be hybrid/heterogeneous in nature, relying on the integration (in varying proportions) of multicore CPUs and special purpose hardware and accelerators, especially GPUs. The relative balance between these components in future designs is not clear, and will likely vary over time. In this way, an early project on tuning linear algebra routines for Nvidia GPUs was done by Volkov and Demmel [15] and similar efforts were followed by Kurzak, Tomov and Dongarra [16] and in MAGMA [17]. In [18], a variety of techniques toward autotuning algorithms on the GPU, independent of the hardware architecture, are introduced by Davidson and Owens.

In this paper, we present a method to compute automatically the parameters of the 3D-FWT running on systems with multicore CPUs and manycore GPUs. The best implementation in each platform of the system is selected in the following way: the implementation in CUDA for a GPU based on Nvidia, the kernel programmed in OpenCL for an ATI GPU and an implementation with pthreads on a multicore CPU. Once selected the suitable kernel in each subsystem, the method selects automatically the parameters for running each kernel. Finally, the optimization engine sends in a proportional way differents parts of a video sequence to process concurrently. The proportions among the different platforms depend on the 3D-FWT kernel computer performance.

The rest of this paper is organized as follows. Section 2 summarizes the background to wavelets. Section 3 outlines the implementation of the 3D-FWT in CUDA and OpenCL. In section 4, we propose an optimization techniques for systems with an Nvidia or an ATI GPU. Experiments to select the suitable parameters for the 3D-FWT kernel for these GPUs are presented. The proposed method is extended for multi-GPU systems and the optimization engine is completed for hybrid systems with manycore GPUs and multicore CPUs in section 5.

Finally, section 6 summarizes the work, concludes the paper and introduces the future work.

2. Wavelet Transform Foundations

The basic idea of the wavelet transform is to represent any arbitrary function f as a weighted sum of functions, referred to as wavelets. Each wavelet is obtained from a mother wavelet function by conveniently scaling and translating it. The result is equivalent to decomposing f into different scale levels (or layers), where each level is then further decomposed with a resolution adapted to that level.

In a multiresolution analysis, there are two functions: the mother wavelet and its associated scaling function. Therefore, the wavelet transform can be implemented by quadrature mirror filters (QMF), G = g(n) and H = h(n), neZ. H corresponds to a low-pass filter, and G is a high-pass filter. For a more detailed analysis of the relationship between wavelets and QMF see [19].

The filters H and G correspond to one step in the wavelet decomposition. Given a discrete signal, s, with a length of 2n, at each stage of the wavelet transformation the G and H filters are applied to the signal, and the filter output downsampled by two, thus generating two bands, G and H. The process is then repeated on the H band to generate the next level of decomposition, and so on. This procedure is referred to as the 1D Fast Wavelet Transform (1D-FWT).

It is not difficult to generalize the one-dimensional wavelet transform to the multi-dimensional case [19]. The wavelet representation of an image, f (x, y), can be obtained with a pyramid algorithm. It can be achieved by first applying the 1D-FWT to each row of the image and then to each column, that is, the G and H filters are applied to the image in both the horizontal and vertical directions. The process is repeated several times, as in the one-dimensional case. This procedure is referred to as the 2D Fast Wavelet Transform (2D-FWT).

As in 2D, we can generalize the one-dimensional wavelet transform for the three-dimensional case. Instead of one image, there is now a sequence of images. Thus, a new dimension has emerged, the time (t). The 3D-FWT can be computed by successively applying the 1D wavelet transform to the value of the pixels in each dimension.

Based on previous work [20], we consider Daubechie's W4 mother wavelet [21] as an appropriate baseline function. This selection determines the access pattern to memory for the entire 3D-FWT process. Let us assume an input video sequence consisting of a number of frames (3rd dimension), each composed of a certain number of rows and columns (1 st and 2nd dimension). The 1D-FWT is performed across all frames for each row and column, that is, we apply the 1D-FWT rows x cols times in the third dimension. The first 1D-FWT instance requires four elements to calculate the first output element for the reference video and the detailed video, with these elements being the first pixel belonging to the first four frames. The second output element for the reference and detailed video are calculated using the first pixel of the third, fourth, fifth and sixth video frames. We continue this way until the entire reference and detailed video are calculated, and these data are the input used for the next stage.

The 2D-FWT is performed frames times, i.e., once per frame. This transform is performed by first applying the 1D-FWT on each row (horizontal filtering) of the image, followed by the 1D-FWT on each column (vertical filtering). The fact that vertical filtering computes each column entirely before advancing to the next column, forces the cache lines belonging to the first rows to be replaced before the algorithm moves on to the next column. Meerwald et al. [22] propose two techniques to overcome this problem: row extension and aggregation or tiling.

Other studies [23][24] also report remarkable improvements when applying the tiling technique over the 2D-FWT algorithm. Our experience implementing on a CPU the sequential 2D-FWT algorithm revealed a reduction of almost an order of magnitude in the overall execution time with respect to a baseline version. This process can straightforwardly be applied to the 3D case, where reports solid gains on execution times as well, which ranges from 3x to 7x factors. From now on, only the tiled 3D-FWT version is taken for parallelization purposes, either on CPU or GPU.

3. Parallelization on a manycore GPU in CUDA and OpenCL

Our 3D-FWT implementations in OpenCL and CUDA are based on the same CUDA algorithm described in [7][8][9]. We use simple source to source translation to convert the kernels of the implementation of 3D-FWT on CUDA to OpenCL. Our effort have been lesser to that outlined for the CUDA version, although there are some

Work-group (0,0)

(2 more frames)

Fig. 1. The OpenCL program for computing the FWT on each video frame when a work-group of 128 work items is used. The diagram is tailored to a frame size of 128 X 128 pixels, which forces a topology of 128 X 1 work-groups for the NDRange. The i-th block loads 4 X 128 pixels into local memory and computes 128 pixels for H and G, storing them in its i-th row.

differences between CUDA and OpenCL in terminology, the model is similar and it is was easy to transform the kernels. Our 3D-FWT implementation in OpenCL and CUDA consists of the following three major steps:

1. The host (CPU) allocates in memory the first four video frames coming from a .pgm file.

2. The first four images are transferred from main memory into video memory. The 1D-FWT is then applied to the first four frames over the third dimension to obtain a couple of frames for the detailed and reference videos. Figure 1 illustrates this procedure for a block size or a work-group of 128 threads or work items, respectively. In CUDA and OpenCL, each thread or work item loads four pixels into local memory and computes an output H and G pair. The grid or the NDRange is composed of rows X cols/128 blocks or work-groups.

3. The 2D-FWT is applied to the frame belonging to the detailed video, and, subsequently, to the reference video. Results are then transferred back to main memory.

The whole procedure is repeated for all remaining input frames, taking two additional frames on each new iteration. Figure 2 summarizes the way the entire process is implemented in OpenCL and CUDA. On each new iteration, two frames are copied, either at the beginning or at the second half depending on the iteration number. In particular, the first iteration copies frames number 0, 1, 2 and 3 to obtain the first detailed and reference video frames, the second iteration involves frames 2, 3,4 and 5 to obtain the second detailed and reference video frames, and so on. Note that frames 4 and 5 occupy the memory formerly assigned to frames 0 and 1, which requires an interleaved access to frames in the second iteration.

Memory objects has been allocated in page-locked (also known as pinned) host memory (as opposed to regular pageable host memory allocate by malloc()), in order to achieve higher performance for data transfers between host and device.

Host-Device copies

(4 frames are copied in global memory)

1-st step

FRAME 0 FRAME 1 FRAME 2 FRAME 3

2-nd step

Host-Device copies

(2 frames are copied In global memory)

I 1D-FWT program In time dimension |

FRAME 4 FRAME 5 FRAME 2 FRAME 3

|lD-FWT" kernel in program dimension (interleaving)!

Device-Host copies

(2 frames are copied In a buffer In host r

Device-Host copies

(2 frames are copied In a buffer In host memory)

Fig. 2. The way 3D-FWT is implemented in OpenCL and CUDA using interleaved accesses to video frames.

4. Optimization techniques for 3D-FWT on a single GPU system

We present an optimization engine for obtaining automatically the parameters of the 3D-FWT running on Nvidia and ATI GPUs. Our method consists mainly on three stages:

1. Detect automatically the available GPU in the system.

2. If GPU is based on Nvidia or ATI architecture, the method uses the implementation of the 3D-FWT in CUDA or OpenCL, respectively.

3. The routine selects automatically the key parameter value of block or work-group size needed to execute the 3D-FWT kernel in CUDA or OpenCL. From this value, the remaining parameters are also calculated automatically.

In the first step, the method detects automatically the available GPUs in the system with the OpenCL function clGetDevicelnfo [6]. This function returns the main features of an OpenCL device such as the number of parallel compute cores, the device type, the vendor name, the maximum number of work-items that can be specified in each dimension of the work-group, the maximum number of work-items in a work-group executing a kernel using the data parallel execution model, etc. All this information is used in the following steps.

The second step decides between the available kernels of the 3D-FWT implemented in CUDA and OpenCL. If the GPU vendor is Nvidia, the method uses an implementation in CUDA because this implementation of the 3D-FWT achieves better speedups than the kernel programmed in OpenCL [10]. If the GPU is an ATI model, the routine uses the OpenCL implementation.

In the third stage, the method selects the values for the different parameters to run the 3D-FWT kernel implemented using CUDA or OpenCL. In both implementations, the block size or the work-group size, respectively, is the key parameter, and so the selection of this value determines automatically the other parameters such as the grid size or the occupation of the shared memory. In this way, we define one-dimensional thread blocks or work-groups of the selected size and two-dimensional grids of (m rows x n/(block size * 2) columns) thread blocks for an image size of m rows by n columns. In [8], we empirically determined the optimal configuration using the CUDA occupancy calculator and following a simple set of heuristics [25]. Such a configuration consists of a different thread to compute every pair of G and H values in the 1D-FWT. Each thread requires 13 registers and a block size of n needs 8n + 8 bytes of shared memory. Thus, the number of active thread blocks per multiprocessor requires a number of registers and an amount of shared memory, which must not exceed the maximum allowed values for

Table 1. Summary of execution times (msecs.) for the 3D-FWT

Frame size

512 x 512 1K x 1K 2K x 2K

Block size / GPU - Code Version 64 128 192 256 64 128 192 256 64 128 192 256

Tesla C870 - Cuda Tesla C2050 - Cuda FirePro V5800 - OpenCL 58.68 35.33 130.06 56.28 53.17 135.87 53.51 32.13 131.29 58.68 33.59 114.87 225.74 122.12 452.95 214.36 115.02 346.29 209.01 110.88 313.35 217.21 113.32 307.54 889.83 467.50 2123.60 841.47 438.46 1496.27 840.14 427.69 1284.56 850.26 433.84 1217.59

the Nvidia GPUs. Our optimization engine is based on the CUDA occupancy calculator and the routine computes for different block sizes, the active thread blocks per multiprocessor and the occupancy of each multiprocessor, in order to select the block size that maximizes the occupancy of each multiprocessor. If two or more values of block size obtain the same occupancy of each multiprocessor, the method chooses the configuration that obtains the maximum value of the number of active thread blocks per multiprocessor. Obviously, the routine contains a table with the physical limits for the GPUs, and the limit of the active thread blocks per multiprocessor is the minimum of the maximum warps, registers and shared memory per multiprocessor. From now on, the function to select the block size for the Nvidia GPUs is called the f _block. On the other hand, for the OpenCL codes on ATI GPUs, the proposed automatical method is based on the CL_DEVICE_MAX_WORK_GROUP_SIZE value that can be obtained through the function clGetDeviceInfo. This value sets the maximum number of work-items in a work-group executing a program using the data parallel execution model. Therefore, our method uses this value for the work-group size.

Experiments with 3D-FWT parameters for three different GPUs, a Nvidia Tesla C870, a Nvidia Fermi Tesla C2050 and an ATI FirePro V5800 DVI have been performed. The 3D-FWT kernels in CUDA and OpenCL are applied for a video of 64 frames of different sizes (512 x 512, 1024 x 1024 and 2048 x 2048 pixels). In the table 1, we can observe the summary of execution times for different block sizes, including computational and communication times. The bold format denotes the optimal configuration for each GPU, code version and frame size.

For the Tesla C870, the optimization engine studies the problem for the block sizes 64,128, 192 and 256. The block size of 192 generates a 75% of occupancy of each multiprocessor. This occupation is higher than a 67% obtained by 64, 128 and 256 block sizes. Therefore, the value 192 is the size selected. In the Fermi GPU, the optimization engine performs its study for same block sizes. The sizes 192 and 256 obtain a 100% occupancy of each multiprocessor with 8 and 6 active thread blocks, respectively. Now, the block size of 192 is the value selected again. For the ATI GPU, the automatic routine obtains 256 for the value of CL_DEVICE_MAX_WORK_GROUP_SIZE through the function clGetDeviceInfo.

In a few words, the optimization engine selects the value 192 for the block size in the Tesla C870 and the Fermi C2050 and 256 for the ATI FirePro. Therefore, our automatical method gets the optimum block size for the three GPUs, allowing non expert users to obtain the highest performances for the different execution environments.

5. Optimization techniques for 3D-FWT on hybrid systems

Firstly, we propose how to extend our methodology to systems with more than one GPU. Now, the optimization engine detects automatically the number and the type of GPUs in a system with the functions clGetPlatformIDs and clGetDeviceIDs. For each GPU in the system, the method described in the section 4 is applied. Now, for each GPU we can execute the 3D-FWT for a sequence of video independently, overlapping the calculations. A combined effort is also feasible on applications performing 3D-FWTs over a list of queued videos on a batch processing basis: videos are mapped to the GPUs for an even workload balance on each platform to maximize task parallelism.

Our optimization engine computes automatically 3D-FWT parameters for a multi-GPU system with a Nvidia Fermi Tesla C2050 and an ATI FirePro V5800 DVI. We have a list of video sequences of different sizes. The routine detects automatically both GPUs with the function clGetDeviceInfo. A kernel programmed in CUDA is selected for the Fermi Tesla whereas a program written in OpenCL is chosen for the ATI GPU. Block and work-group size are obtained for each platform. Finally, the execution of the two 3D-FWT sequences are carried out simultaneously.

Algorithm 1 Automatic algorithm for manycore GPUs and multicore CPUs systems 1: Detect automatically the available GPUs and CPUs in the system. 2: for each platform (GPU or CPU) do 3: if GPU is based on Nvidia then 4: Select the CUDA implementation of 3D-FWT. 5: f _block calculates automatically the block size. 6: end if

7: if GPU is based on ATI then 8: Select the OpenCL version of 3D-FWT.

9: The work-group size is equal to CL_DEVICE_MAX_WORK_GROUP_SIZE. 10: end if 11: if CPU then

12: Select the implementation with tiling and pthreads.

13: The number of threads is equal to the number of compute units in the CPU.

14: end if

15: Send one sequence to this platform to obtain the computer performance of the 3D-FWT kernel. 16: end for

17: Send sequences in a proportion equal to the 3D-FWT kernel computer performance in each GPU and CPU.

Table 2. Summary of computer performance (Mflops) for the 3D-FWT kernel and speedups between platforms

Frame size

Platform 512 x 512 1K X 1K 2K x 2K

Nvidia C2050 GPU 522.17 605.24 627.64

ATI FirePro V5800 146.05 218.21 220.46

Intel Xeon E5620 CPU 410.80 438.73 445.54

Intel Core 2 Quad Q6700 CPU 92.85 87.63 88.31

Nvidia C870 GPU 313.53 321.08 319.51

Speedups platforms

Nvidia C2050 GPU - ATI FirePro V5800 3.58 2.77 2.85

Intel Xeon E5620 CPU - ATI FirePro V5800 2.81 2.01 2.02

Nvidia C870 GPU - Intel Core 2 Quad Q6700 CPU 3.38 3.66 3.62

Our aim is to obtain the maximum performance in a hybrid system with several manycore GPUs and multicore CPUs components. An extension of our methodology is necessary. Now, the optimization engine detects the number and the type of GPUs, and the number of computer units in the CPU. Moreover, a new step is added to the algorithm described in section 4, where a proportional shipment of sequences for each device in the system is done to process the 3D-FWT concurrently. In a first attempt, the proportion is made using the computer performance for the matrix multiplication kernel. However, this kernel is not very suitable because it does not represent very well the operations that are used on the wavelet transform and it was discarded. Therefore, in a second attempt, our optimization engine computes the proportion depending on the computer performance of the 3D-FWT kernel. In the top of the table 2, we can observe the computer performance for the 3D-FWT in several platforms. The complete and final optimization algorithm can be observed in algorithm 1. In this way, the method can decide fastly the proportions among the diffent platforms.

In a CPU, the automatic routine uses the best implementation of the 3D-FWT obtained previously [9] with tiling and pthreads. When a CPU is selected to execute the 3D-FWT program, the optimization engine selects a number of threads that is equal to the number of available processors or compute units in the CPU. In the table 3, we can observe the execution times for the 3D-FWT on Intel Xeon E5620 CPU. The bold format denotes the number of threads that obtains the optimal configuration implemented with pthreads for this CPU. The Intel Xeon E5620 has four processor with hyperthreading, detecting eight compute units and executing the 3D-FWT program with eight threads. Therefore, the selection of the number of threads of the optimization method matches with the optimal number of threads.

The optimization engine is executed for a system with an Intel Xeon E5620 CPU, a Nvidia Fermi Tesla C2050 and an ATI FirePro V5800 DVI. From now on, this system is called system 1. The engine detects automatically the Intel Xeon CPU and both GPUs. All the information is collected with the function clGetDeviceInfo. The CUDA implementation for the Nvidia GPU, the OpenCL program for the ATI GPU, and the 3D-FWT tiling

Table 3. Summary of execution times (msecs.) for the 3D-FWT on Intel Xeon E5620 CPU

Numbers of threads Frame size

512 X 512 1K x 1K 2K X 2K

1 134.15 518.28 2032.57

2 78.68 278.83 1082.31

4 47.66 169.08 656.95

8 40.84 152.96 602.50

16 41.16 154.36 603.03

32 42.77 154.89 604.27

Table 4. Execution times (secs.) for a sequence of video of two hours with 25 frames per second of different resolutions in the system1. Speedups among optimization engine for hybrid systems, optimization techniques for a single GPU system and results obtained by a normal user _

Frame size

3D-FWT on a single GPU system 512 X 512 1K X 1K 2K X 2K

Nvidia C2050 GPU (normal user) 134.21 324.27 1242.33

Nvidia C2050 GPU (optimization engine) 90.33 311.74 1202.45

ATI Firepro V5800 (normal user) 359.93 998.17 4303.02

ATI FirePro V5800 (optimization engine) 322.96 864.65 3423.25

3D-FWT on hybrid systems

Nvidia C2050 GPU 45.17 133.60 515.34

Intel Xeon E5620 CPU 43.06 122.87 483.98

ATI FirePro V5800 40.37 123.52 489.04

Speedups

Hybrid systems - Nvidia C2050 GPU (normal user) 2.97 2.43 2.41

Hybrid systems - Nvidia C2050 GPU (optimization engine) 2.00 2.33 2.33

Hybrid systems - ATI FirePro V5800 (normal user) 7.97 7.47 8.35

Hybrid systems - ATI FirePro V5800 (optimization engine) 7.15 6.47 6.64

version with pthreads for the quad-core with hyperthreading are selected, respectively. Next, the f _block calculates automatically the block size, 192, in the CUDA implementation, the work-group size is equal to 256 in the OpenCL program, and the number of threads is equal to 8 in the CPU program. A sequence of 64 frames is sent to each platform to calculate the 3D-FWT kernel computer performance. In table 2, these results for the three platforms can be observed (see first, second and third row of the table). The proportions for sending sequences are the speedups between the different platforms. In the low part of the table 2 (see sixth and seventh rows), a summary of speedups between the different platforms for the 3D-FWT kernel in the system 1 are presented. Therefore, the optimization engine sends to process 3, 2 and 1 sequences of 64 frames of 1024 x 1024 and 2048 x 2048 pixels to the Nvidia C2050 GPU, the Intel Xeon CPU and the ATI FirePro, respectively. On the same way, for a resolution of 512 x 512 pixels, 4, 3 and 1 sequences are sent to the Nvidia GPU, Intel Xeon and ATI GPU, respectively. Therefore, we have a tantalizing scenario with two GPUs, one Nvidia and one ATI, and one Intel CPU executing different implementations of the 3D-FWT in CUDA, OpenCL and pthreads, respectively. A tri-processor platform, where each hardware contributes to speed-up the application, either enabling few CPU cores or an army of streaming processors. Overall, the optimization engine efforts on multicore CPUs and manycore GPUs provide multiple chances for a compatible scenario where they may cooperate for an additional performance boost. For example, for a sequence of video of two hours with 25 frames per second of different resolutions, split in group of 64 frames, the optimization engine divides the sequences in a proportion equal to the speedup of the 3D-FWT kernel computer performance in each platform on a system. In the top of the table 4, a summary of the 3D-FWT execution times for different platforms in the system 1 is presented. In the first and the third row, an averaged of the execution times obtained by a normal user, who sends all sequences to one of the GPUs of the system, are presented. We consider this normal user has not knowledge to obtain the block or work-group size for the Nvidia or ATI GPU, respectively. Therefore, an averaged execution times among the different results presented in table 1, obtained by this user, can be observed. In the second and fourth row, we can observe the execution times of the method proposed in section 4, which processes all sequences of 64 frames in the Nvidia C2050 or in the ATI GPU. In the next rows, the different 3D-FWT execution times for the optimization engine on hybrid systems in each platform of the system 1 are presented. The proportions, computed automatically in the algorithm 1, to process sequences in each platform produce similar execution times in each platform. In the low part of the table 4, we can observe speedups among the optimization engine for hybrid systems, the method proposed in section 4 and results obtained by a normal user. In summary, 180.000 frames are processed simultaneously in the three platforms obtaining an averaged speedups of 2.60 and 7.93, versus a nomal user, who sends all sequences to the

Table 5. Execution times (sees.) for a sequence of video of two hours with 25 frames per second of different resolutions in the system2. Speedups among optimization engine for hybrid systems, optimization techniques for a single GPU system and results obtained by a normal user _

Frame size

3D-FWT on a single GPU system 512x512 1K x 1K 2K x 2K

Nvidia C870 GPU (normal user) Nvidia C870 GPU (optimization engine) 159.66 150.44 608.92 587.63 2405.02 2362.05

3D-FWT on hybrid systems

Nvidia C870 GPU Intel Core 2 Quad Q6700 CPU 112.83 127.00 470.11 430.64 1889.64 1709.20

Speedups

Hybrid systems - Nvidia C870 GPU (normal user) Hybrid systems - Nvidia C870 GPU (optimization engine) 1.26 1.18 1.30 1.25 1.27 1.25

Nvidia or ATI GPU, respectively. In the same way, our optimization engine for hybrid systems gets an averaged speedups of 2.22 and 6.75, regarding a system with a single GPU, Nvidia or ATI, with the method presented in section 4.

For another system with an Intel Core 2 Quad Q6700 CPU and a Nvidia Tesla C870, the optimization engine detects automatically the Nvidia GPU and the quad-core CPU. From now on, this system is called system2. The block size of 192 is computed with the function f _block and the number of threads is equal to 4 for the CPU program. In this system, the optimization engine sends to process 3 sequences of 512 x 512 or 4 sequences of 1K x 1K and 2K x 2K to the Nvidia C870 GPU and 1 sequence to the Intel Core 2 Quad. These proportions depend on the computer performance presented in table 2 (see fourth and fiveth row), or the speedups presented in the last row of this table. For a sequence of video of two hours with 25 frames per second of different resolutions, split in group of 64 frames, the table 5 presents a summary of the 3D-FWT execution times for different platforms in the system2, in the top of the table, and speedups among the optimization engine for hybrid systems, the optimization techniques proposed in section 4 and results achieved by a normal user, in the low of the table. In summary, two platforms process simultaneously all frames with an averaged speedup of 1.23 for the different resolution of images, versus a system with one GPU with the optimization techniques presented in section 4, which dispatches all sequences to the Nvidia C870 GPU. In the same way, the optimization engine for hybrid systems presents an averaged speedup of 1.28 with respect a normal user.

6. Conclusions and future work

We have proposed an optimization engine to run automatically the 3D-FWT kernel on integrated systems with different platforms such as multicore CPU and manycore GPUs. The optimization techniques detect different platforms in the system and execute the best implementation for each one in the following way: for a GPU based on Nvidia architecture the 3D-FWT kernel written in CUDA is choosen, for a GPU based on ATI parallel compute mode the kernel programmed in OpenCL is selected and for a CPU an implementation with phtreads is considered. The proposed method computes automatically the number of threads in a multicore CPU, the block or the work-group size in a Nvidia or ATI GPU, respectively, and the rest of the parameters needed to execute the kernel. Moreover, the proposed method sends proportionally different sequences of video, depending on the computer performance of the 3D-FWT kernel of each platform in the system in order to process concurrently the video sequences.

We have tested our optimization engine in systems with several and different hybrid systems. In a system with three platfoms: an Nvidia Fermi Tesla C2050, an Intel Xeon E5620 CPU and an ATI FirePro V5800 DVI, our proposed method selects the CUDA implementation with the optimal block size, 192 for the Nvidia GPU, the version with an optimal number of eight threads for the CPU, and the OpenCL 3D-FWT program for the ATI GPU with the optimal work-group size of 256. The automatic computation of the block size on Nvidia architectures is based on the CUDA occupancy calculator, where the block size that maximizes the occupancy of each multiprocessor and obtains the maximum number of active thread blocks per multiprocessor is selected. In ATI GPU architectures, the work-group size is obtained with the value of the CL_DEVICE_MAX_WORK_GROUP_SIZE through the function clGetDeviceInfo. In a CPU the number of threads is equal to the number of available compute units for the CPU. The optimization engine presents averaged gains up to 7.93x with respect a normal user, who sends all group of frames to the platform with the 3D-FWT implemented in OpenCL on the ATI GPU.

In another system with an Nvidia Tesla C870 GPU and a Intel Core 2 Quad Q6700 CPU, our optimization engine detects the GPU and the CPU properly. The routines implemented in CUDA and with pthreads are selected, respectively. The optimal block size, 192, and the optimal number of threads, 4, are selected automatically. Moreover, the automatical method proposed divides the different group of frames of a sequence of video in a proportion equal to the speedup of the computer performance of the 3D-FWT kernel in each platform on a system, obtaining averaged speedups up to 1.28x with respect a normal user.

The methodology applied to propose the optimization engine described in this paper should be applicable to other complex compute applications. Following this, non expert and normal users can obtain very good performances in other applications. Our work is part of the development of an image processing library oriented to biomedical applications, allowing users the efficient execution of different routines automatically.

7. Acknowledgements

This work was supported by the Spanish MINECO under grant TIN2012-38341-C04-03. We are grateful to the reviewers for their valuable comments.

References

[1] D. Manocha, General-Purpose Computation Using Graphic Processors, IEEE Computer 38 (8) (2005) 85-88.

[2] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, T. J. Purcell, A Survey of General-Purpose Computation on Graphics Hardware, Computer Graphics Forum 26 (1) (2007) 80-113.

[3] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips, GPU computing, Proceedings of the IEEE 96 (5) (2008) 879-889.

[4] Nvidia, CUDA Zone maintained by Nvidia, http://www.nvidia.com/object/cuda.html (2009).

[5] AMD, AMD Stream Computing, http://ati.amd.com/technology/streamcomputing/index.html (2009).

[6] The Khronos Group, The Opencl core API specification, http://www.khronos.org/registry/cl (2011).

[7] J. Franco, G. Bernabe, J. Fernandez, M. Ujaldon, The 2D Wavelet Transform on emerging architectures: GPUs and multicores, Journal of Real-Time Image Processing. http://dx.doi.org/10.1007/s11554-011-0224-7 (3) (2012) 145-152.

[8] J. Franco, G. Bernabe, J. Fernandez, M. E. Acacio, A Parallel Implementation of the 2D Wavelet Transform Using CUDA, in: 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Weimar, Germany, 2009.

[9] J. Franco, G. Bernabe, J. Fernandez, M. Ujaldn, Parallel 3D fast Wavelet Transform on manycore GPUs and multicore CPUs, in: 10th International Conference on Computational Science, Amsterdam, Netherlands, 2010.

[10] G. Bernabe, G. D. Guerrero, J. Fernandez, CUDA and OpenCL Implementations of 3D Fast Wavelet Transform, in: 3rd IEEE Latin American Symposium on Circuits and Systems, Playa del Carmen, Mexico, 2012.

[11] R. C. Whaley, A. Petitet, J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Computing 27 (1-2) (2001) 3-35.

[12] E. J. Im, K. Yelick, R. Vuduc, Optimization framework for sparse matrix kernels, International Journal of High Performance Computing and Applications 18 (1) (2004) 135-158.

[13] M. Frigo, S. G. Johnson, The design and implementation of FFTW3, Proceedings of the IEEE, Special issue on Program Generation, Optimization, and Platform Adaptation 93 (2) (2005) 216-231.

[14] T. Katagiri, K. Kise, H. Honda, T. Yuba, ABCLib DRSSED: a parallel eigensolver with an auto-tuning facility, Parallel Computing 32 (3) (2006) 231-250.

[15] V. Volkov, J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing SC'08, Austin, TX, U.S.A., 2008.

[16] J. Kurzak, S. Tomov, J. Dongarra, Autotuning gemms for Fermi, in: Proceedings of the 2011 ACM/IEEE conference on Supercomputing SC'11, Seattle, Washington, U.S.A., 2011.

[17] L. Yinan, J. Dongarra, S. Tomov, A note on auto-tuning gemm for GPUs, in: Proceedings of the 9th International Conference on Computational Science: Part I, Baton Rouge, Louisiana, U.S.A., 2009, pp. 884-892.

[18] A. Davidson, J. Owens, Toward techniques for auto-tuning GPU algorithms, Applied Parallel and Scientific Computing, Lecture Notes in Computer Science 7134 (2012) 110-119.

[19] S. Mallat, A Theory for Multiresolution Signal Descomposition: The Wavelet Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (7) (1989) 674-693.

[20] G. Bernabe, J. Gonzalez, J. M. Garcia, J. Duato, A New Lossy 3-D Wavelet Transform for High-Quality Compression of Medical Video, in: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, 2000, pp. 226-231.

[21] I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992.

[22] P. Meerwald, R. Norcen, A. Uhl, Cache Issues with JPEG2000 Wavelet Lifting, in: Proceedings of Visual Communications and Image Processing Conference, 2002, pp. 626-634.

[23] A. Shahbahrami, B. Juurlink, S. Vassiliadis, Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet Transform, in: Proceedings of ACM Conference in Computing Frontiers, 2006, pp. 253-260.

[24] J. Tao, A. Shahbahrami, B. Juurlink, R. Buchty, W. Karl, S. Vassiliadis, Optimizing Cache Performance of the Discrete Wavelet Transform Using a Visualization Tool, Procs. of IEEE Intl. Symposium on Multimedia (2007) 153-160.

[25] Nvidia Tutorial at PDP'08, CUDA: A New Architecture for Computing on the GPU (February 2008).