Scholarly article on topic 'A uniform approach for programming distributed heterogeneous computing systems'

A uniform approach for programming distributed heterogeneous computing systems Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Keywords
{OpenCL / MPI / "Distributed computing" / "Heterogeneous computing" / "Programming model" / "Runtime system"}

Abstract of research paper on Computer and information sciences, author of scientific article — Ivan Grasso, Simone Pellegrini, Biagio Cosenza, Thomas Fahringer

Abstract Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging. In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal. We assess libWater’s performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations.

Academic research paper on topic "A uniform approach for programming distributed heterogeneous computing systems"

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

A uniform approach for programming distributed heterogeneous computing systems

Ivan Grassoa,b'*, Simone Pellegrinia, Biagio Cosenzaa, Thomas Fahringera

CrossMaik

a Institute of Computer Science, University of Innsbruck, Austria b Barcelona Supercomputing Center, Barcelona, Spain

highlights

libWater programming model, which extends OpenCL with a simplified interface. A lightweight distributed runtime system based on asynchronous command execution. A powerful representation that collects and arranges dependencies between commands. Dynamic Collective Replacement and Device-Host-Device Copy Removal optimizations. A study of the performance of the library on three compute clusters.

article info

Article history:

Received 15 July 2013

Received in revised form

15 April 2014

Accepted 14 August 2014

Available online 26 August 2014

Keywords:

OpenCL

Distributed computing Heterogeneous computing Programming model Runtime system

abstract

Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging.

In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal.

We assess libWater's performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations.

© 2014 The Authors. Published by Elsevier Inc.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction

Ease of programming and best performance exploitation are two conflicting goals while designing programming models and abstractions for high performance computing (HPC). For instance, when programming a compute cluster, better performance can be obtained directly using low level and error prone communication layers like MPI [27]. Alternatively, high level models like domain specific languages and frameworks can be employed to simplify

* Corresponding author at: Institute of Computer Science, University of Innsbruck, Austria.

E-mail address: grasso@dps.uibk.ac.at (I. Grasso).

the programmability and portability of the code. This simplification however brings also a loss of performance due to the level of abstraction that is too far away from the underlying hardware.

The recent arise of multi- and many-core CPUs, next to special purpose hardware and accelerators such as GPUs, made this tradeoff even more challenging. In fact, heterogeneous architectures require an intricate and complex mix of programming models such as CUDA, OpenMP and pthreads, in order to handle the diversity of execution environments and programming models.

The Open Computing Language (OpenCL—[21]) is a partial solution to the problem. It introduces an open standard for generalpurpose parallel programming of heterogeneous systems, which has been implemented by many vendors such as Adapteva, Altera, AMD, ARM, Intel, Imagination Technologies, NVIDIA, Qualcomm, Vivante and Xilinx. An OpenCL program comprises a host program

http://dx.doi.org/10.1016/jjpdc.2014.08.002

0743-7315/© 2014 The Authors. Published by Elsevier Inc. This is an open access article underthe CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3. 0/).

and a set of kernels intended to run on a compute device. It also includes a language for kernel programming, and an API for transferring data between host and device memory and for executing kernels. Therefore, OpenCL is a big leap forward in order to assure portability between different hardware, potentially replacing standards like OpenMP and CUDA, but it also presents some limitations. A first problem is that it does not allow interactions between different platforms; for example, it is not possible to use event synchronization between devices from different vendors. Secondly, the semantics of OpenCL host applications is somewhat too verbose, as it includes different levels of abstraction (platform, device and context). Moreover, while writing an application targeting e.g. a cluster of heterogeneous nodes, we still require an intricate mix of OpenCL with a communication layer like MPI. Despite OpenCL can be easily extended in order to support remote, distributed devices (attempts in this direction are [22,1,20,11]), the host-device paradigm forces the use of a centralized communication pattern, which is a strong limitation for scaling on large-scale compute clusters. In this article, we introduce libWater, a library-based extension of the OpenCL programming paradigm that simplifies the development of applications for distributed heterogeneous architectures. libWater aims to improve both productivity and implementation efficiency addressing all the problems listed above. libWater does not alter the kernel logic of OpenCL kernels, but replaces the host-side API with a new, simpler and transparent interface which abstracts the underlying distributed architecture.

The main contributions of this article are:

• The libWater programming model, which extends the OpenCL standard by replacing the host code with a simplified and concise interface. It defines a novel device query language (DQL) for OpenCL device management and discovery, and introduces new features such as inter- and intra-context synchronization.

• A lightweight distributed runtime environment, which dispatches the work between remote devices, based on asynchronous execution of both communications and OpenCL commands. libWater runtime also collects and arranges dependencies between commands in the form of a powerful representation called command DAG.

• Two effective uses of the command DAG in order to improve scalability: (a) a Dynamic Collective Replacement (DCR) optimization, which identify collective communication patterns and replaces them with MPI collective operations; (b) a Device-Host-Device Copy Removal (DHDCR), where devicedevice communications supersedes device-host-device ones. Both optimizations overcome the limitation of the OpenCL hostdevice semantic, improving scalability on large-scale compute clusters.

• A study of the scalability of libWater on two real production clusters using up to 64 devices. Results show high efficiency and demonstrate the suitability of the presented command DAG optimizations for seven computational application codes. Finally we demonstrate the suitability of libWater for a heterogeneous cluster for two codes.

Our approach expands on previous work [13] by adding a new optimization (the DHDCR, in Section 6), new test cases (Section 6), new scalability studies on an additional target architecture, the MinoTauro GPU cluster (Section 7.2) of the Barcelona Supercomputing Center and new studies to test the suitability of libWater to exploit the computational capabilities of a heterogeneous cluster configuration (Section 7.3). With a wider range of applications, test platforms and optimizations, we show how libWater effectively improves the overall performance and scalability on large-scale compute clusters while easing the programmability.

The rest of the article is organized as follows. Sections 2 and 3 provide an introduction to OpenCL and libWater programming

model. Section 4 describes the distributed runtime system and the underlying command DAG representation. The runtime optimizations are treated in Sections 5 and 6. The experimental evaluation is presented in Section 7. Sections 8 and 9 discuss related work and conclusions.

2. The OpenCL programming model

OpenCL is an open industry standard for programming heterogeneous systems. The language is designed to support devices with different capabilities such as CPUs, GPUs and accelerators. The platform model comprises a host connected to one or more compute devices. Each device logically consists of one or more compute units (CUs) which are further divided into processing elements (PEs). Within a program, the computation is expressed through the use of special functions called kernels that are, for portability reason, compiled at runtime by an OpenCL driver. Interaction with the devices is possible by means of command-queues which are defined within a particular OpenCL context. Once enqueued, commands - such as the execution of a kernel or the movement of data between host and device memory - are managed by the OpenCL driver which schedules them on the actual physical device.

Commands can be enqueued in a blocking or non-blocking way. A non-blocking call places a command on a command-queue and returns immediately to the host, while a blocking-mode call does not return to the host until the command has been executed on the device. For synchronization purpose, within a context, event objects are generated when kernel and memory commands are submitted to a queue. These objects are used to coordinate execution between commands and enable decoupling between host and devices control flows.

Despite being a well designed language that allows the access to the compute power of heterogeneous devices from a single, multi-platform source code base, OpenCL has some drawbacks and limitations. One of the major drawbacks is that, because being created as a low-level API, a significant amount of boilerplate code is required even for the execution of simple programs. Developers have to be familiar with numerous concepts (i.e. platform, device, context, queue, buffer and kernel) which make the language less attractive to novice programmers. Another important limitation is that, although it was designed to address heterogeneous systems, in case of devices from different vendors, objects belonging to the context of one vendor are not valid for other vendors. This limitation clearly becomes a problem when synchronization of command queues across different contexts is needed.

3. The libWater programming interface

libWater is a C/C++ library-based extension of the OpenCL programming paradigm that simplifies the development of distributed heterogeneous applications. It inherits the main principles from the OpenCL programming model trying to overcome its limitations. While maintaining the notion of host and device code, libWater exposes a very simple programming interface based on four key concepts: device, buffer, kernel and event. A device represents a compute device, but differently from the original paradigm this single object is an abstraction of the OpenCL platform, device, queue and context concepts. Such simplification reduces the number of source code lines necessary for the initialization of the devices, and thus avoids the boilerplate configuration code that is usually present in every OpenCL program. Furthermore, the library is not restricted to a single node but, taking internally advantage of the message passing model, it provides access to devices on remote nodes as if they were locally available.

Since libWater can grant access to a large number of distinct devices, the selection of a particular one can be cumbersome.

Host Thread

_create_buffer(...); _run_kernel(...);

wtr_wait_for_event(...);

Fig. 1. libWater's distributed runtime system architecture.

In order to simplify this important aspect, libWater introduces a novel domain specific language for querying devices. A device query language (DQL) query statement follows an SQL-like structure, that is composed of 4 basic clauses with the following syntax:

SELECT [ALL | TOP k | POS i]

FROMNODE [n [, . . .]]

WHERE [restrictions attribute values]

ORDERBY [attribute [, ...]]

The SELECT clause (the only one which is mandatory) respectively allows the selection ofall the devices, the first top k, or a particular device from the device list generated under the restrictions on the following clauses. With FROM NODE a single node or a list of nodes can be specified narrowing the range ofselectable devices to those particular nodes. The clauses WHERE and ORDER BY allow the control of the device restrictions on attribute values and the order in which the devices will be returned. The possible attribute values are currently those exposed by the OpenCL clGetDeviceInfo function. A DQL use case is shown and discussed in Section 4.2. DQL queries can be used for both device initialization and device selection. The latter must be a subset of the former and since lib-Water's device concept represents a single device only, the function wtr_get_device only accepts queries that make use of the POS clause.

Once a device is created, it is possible to allocate data and execute computation on it. In libWater, this is done through the use of the buffer and the kernel concepts. These two objects are similar to their respective OpenCL versions, with the main difference that, during their creation, they are bound to a specific device. For this reason no device must be specified for buffer and kernel related functions. The fourth concept in libWater is the event object. Most of kernel and buffer functions have one or two parameters called wait_evt and evt. The latter is an output argument which is used by the invoked command to generate an event object. If not specified, libWater assumes blocking semantics for the routine. The former specifies the event object on which the execution of the command depends. If not present, the command has no dependencies and thus it can be immediately executed.

The last major difference between libWater and the OpenCL model is the fact that initialization and release of buffers and kernels can be invoked using a non-blocking semantics. The main reason for this is to increase the amount of operations that the runtime system can overlap. Due to space constraint, we omit the complete libWater API, which can be found in [13]. In the next section we explain how dependency information enforced by events are then exploited by libWater's runtime system.

4. The libWater distributed runtime system

While the main focus of the programming interface of libWa-ter is on simplicity and productivity, the underlying runtime system aims at low resource utilization and high scalability. Calls to libWater routines are forwarded to a distributed runtime system which is responsible for dispatching the OpenCL commands to the addressed devices and for transparently and efficiently moving data across the cluster nodes. The libWater distributed runtime is written in C++ and internally uses several paradigms, such as pthreads, OpenMP and MPI for parallelization.

4.1. Runtime system architecture

Fig. 1 shows the organization of the libWater distributed runtime system. The host code, which directly interacts with libWater's routines, runs on the so-called root node, which by default is the cluster node with rank 0. This thread will be referred to as the host thread. In the background, a second thread, i.e. the scheduler thread, is allocated to execute an instance of the WTRScheduler. On the remaining cluster nodes, a single scheduler thread is spawned independently of the number of available devices (only one MPI process is allocated per node). This thread executes an instance of the WTRScheduler which represents the backbone of libWater's distributed runtime system.

Each WTRScheduler continuously dequeues wtr_commands from the local command queue. wtr_commands in the system are generated in two ways, either by (i) libWater's routines (step 1), or (ii) by delegation from the root scheduler (step 3). Calls to the libWater s interface are converted into command descriptors (i.e. command design pattern) and immediately enqueued into the root node local command queue (step 1) of Fig. 1. Since all wtr_commands are generated by the root node itself, we refer to its queue as the runtime global command queue.

wtr_commands are either wrappers for OpenCL commands or data transfer jobs (i.e. send_job or recv_job) which are generated by the library routines whenever the device addressed by a read or write buffer operation is located in a remote (i.e. rank = 0) compute node. The descriptor of a wtr_command is self-contained since it carries all the information necessary for its execution. To be portable across cluster nodes, OpenCL objects such as kernels, buffers and events are identified, within the wtr_command object, by a unique ID. The root scheduler continuously fetches the wtr_commands from the global command queue, decodes its content and - depending on the targeted device - dispatches the command to the correct node. When the wtr_command addresses one of the local OpenCL devices, the corresponding OpenCL command

10 11 12

20 21 22

wtr_init_devices (

"SELECT ALL WHERE (type = gpu AND vendor = nvidia)"); wtr_event* evts [2]; for (int i=0; i<2; ++i) { size_t offset=size/2*i;

wtr_device * dev = wtr_get_device("SELECT POS 1 FROM NODE %d WHERE global_memory

> 1024MB",i); assert(dev != NULL && "Device does not exist!"); wtr_event* e [8]; wtr_init_event_array(7,e);

wtr_kernel* kern = wtr_create_kernel(dev ," kernel . cl "," fun " , "", WTR_SOURCE, e+0) ; wtr_buf f er * buff = wtr_create_buffer(dev, WTR_MEM_READ_WRITE , size/2, e + 1); wtr_write_buffer(buff , size/2, ptr + offset, e + 1, e+2) ; e [7] = wtr_merge_events(2, e+0, e+2);

wtr_run_kernel (kern ,1,(size_t[1]){size/2}, NULL , e+7 , e+3 , 2 , 0, buff,

sizeof( size_t ) , &offset); wtr_read_buffer(buff , size/2, ptr + offset, e+3, e+4) ; wtr_release_buffer(buff , e+4, e+5) ; wtr_release_kernel (kern, e+3, e+6); evts [i] = wtr_merge_events(2, e+5, e+6); wtr_release_event_array(8, e);

/* Blocks until buffers and kernels are released */ wtr_wait_for_events(2 , evts+0, evts + 1) ; wtr_release_event_array(2, evts);

Listing 1: A complete multi-device program example using libWater's routines.

is created and enqueued into the device command queue (step 2). When a remote OpenCL device is addressed, an MPI message is generated - serializing the content of the wtr_command descriptor - and dispatched to the cluster node hosting the requested device. The WTRScheduler of the target node then de-serializes the wtr_command and, instead of immediately executing it, enqueues the wtr_command instance into the local command queue (step 3). The same WTRScheduler is then responsible to dispatch the corresponding OpenCL command into one of its local device queues (step 2).

The heartbeat of the WTRScheduler is an advanced event system which allows the management of an entire compute node - hosting multiple OpenCL devices - using only a single application thread. Indeed, because one instance of the WTRScheduler runs on every cluster node, trying to keep the resource usage as low as possible is of paramount importance in order to avoid wasting CPU cycles which can be used to run an OpenCL kernel. Different from related work, e.g. the SnuCL runtime system [22], which exclusively reserves an entire cluster node and a physical CPU core in each compute node only for scheduling purposes, our system does not exclusively reserve any user resources for scheduling. Furthermore, using a single thread, for both executing local wtr_commands and for performing scheduling decisions, reduces the amount of synchronization since accesses to event and the command queues do not need to be synchronized.

Relying on a single thread can however easily become a performance bottleneck. An interesting example is the interaction with MPI routines. By default many MPI implementations implement blocking behavior with a spin-lock mechanism in order to minimize latency. This means for example that a blocking receive, waiting for a message from the communication channel, continuously checks for incoming data usually saturating the cycles of a CPU core. In an environment like ours, where CPU cores may be used

to run OpenCL kernels, this behavior must be avoided. Our solution is to avoid in every event handler routine any call to blocking MPI or OpenCL routines and always use the non-blocking semantics. The main idea is the creation of periodic events, handled by the event system using a priority queue based on timestamps, to check for the completion of pending operations. For OpenCL routines, we exploit the OpenCL event system and the associated callback mechanism. In this way, the WTRScheduler is able to dispatch several commands on the OpenCL devices, or MPI data transfers, which although being issued sequentially (by the single flow of the execution) are concurrently executed by the available resources (i.e. OpenCL devices and the network controller). The same event-based technique utilized to manage multiple OpenCL devices in a single node is also exploited on the large scale across cluster nodes.

4.2. Event-based command scheduling

As already explained in the previous section, libWater puts a strong emphasis on events. Following the semantics of OpenCL, dependency information enforced by programmers are used to select wtr_commands, which can be safely enqueued into one of the cluster nodes. libWater provides an event object, i.e. wtr_event. Internally, wtr_events are mapped either to an OpenCL cl_event object, or to a wtr_command identifier which is automatically generated for each wtr_command enqueued into the system. These dependencies allow the runtime system to organize enqueued wtr_commands into a DAG.

A complete multi-device libWater-based host program is shown in Listing 1. This code initializes all the available NVidia GPU devices. It then selects two devices belonging respectively to node rank 0 and 1, with a global memory larger than 1024 MB. For each device the code in Listing 1 does the following: create a

0|1 wtr_create_kernel

1|8 wtr_create_kernel

0|2 wtr_create_buffer

0|3 wtr_write_buffer

1|10 sendjob

wtr create buffer

T:1l10

wtr_write_buffer

0|4 wtr_run_kemel

0|5 wtr_read_buffer

1|12 wtr_run_kernel

1|13 wtr_read_buffer

Fig. 2. DAG of wtr_coimnands generated during the execution of the code snippet in Listing 1.

kernel (i.e. kern, in line 10) and a read/write buffer (i.e. buff, line 11). Then the contents from the host memory is written into the device buffer by the wtr_write_buffer command (line 12) and the wtr_run_kernel command is issued providing buff as an input argument (lines 14-16). The computed result is then retrieved by the wtr_read_buffer command (line 17) which moves data from the device memory back to the host memory. From the runtime system point of view, the execution of the previous code generates a set of dependent commands structured as the DAG depicted in Fig. 2. The DAG G(V, E) is composed of vertices, i.e. wtr_commands e V, interconnected through directed edges (a, b) e E | a, b e V, or events, which guarantee that the correct order of execution, and therefore the semantics of the input program, is maintained. The set of dependencies associated with a command c e V is defined as c.deps = {v e V | (v, c) e E}. It is worth mentioning that not all libWater library routines generate a corresponding wtr_command. For example, creation, merging and release of events are only meaningful in the root node, therefore there is no need for serializing them. In Fig. 2, each wtr_command carries a descriptor in the form x|y where x represents the node rank, c.node_id, on which the targeted device, c.dev_id, is hosted and y is the unique command identifier assigned by the runtime system. As already mentioned, for buffer operations on remote devices (i.e. device on node 1) explicit data transfers are automatically inserted by the libWater library (e.g. wtr_commands 10 and 14).

Events determine when a wtr_command can be scheduled for execution. The scheduler uses a just-in-time strategy to select the next wtr_command from the local command queue. The logic works as follows: enqueued wtr_commands are analyzed in a FIFO fashion and, for each ready command, the scheduler checks whether dependencies - explicitly specified by event objects - are satisfied. If a command has no dependencies, it can be executed. Since the host program generates all the commands solely on the root node, scheduling is done at this node. However, a centralized scheduler on a single node is not an effective strategy since it limits command throughput and thus the overall scalability of the system.

In order to solve this problem, we rely on the fact that the OpenCL runtime system already has the capability of scheduling commands and handling dependencies by using events. It is worth

> Local FIFO wtr_command queue > MPI process rank

cmd_queue my_rank while true do cmd • cmd_queue.pop(); if cmd.node_id = my_rank then if V d e cmd.deps | d.node_id = cmd.node_id then send(cmd, cmd.node_id, SCHED) > Delegates cmd to node

continue end if else

if V d e cmd.deps | d.dev_id = cmd.dev_id then issue(cmd.cl_cmd, cmd.deps) > Delegates to corresp. dev.

continue end if end if

cmd_queue.push(cmd) > Failed to schedule event due to deps. end while

noting that in OpenCL this mechanism is limited since events cannot be used to perform command synchronization across different contexts. libWater unifies event handling through WTRScheduler instances which manage inter-context synchronization and offload intra-context synchronization to the OpenCL driver.

We implemented a three-level hierarchical scheduling approach as described in Algorithm 1. At the top level, the root node of the libWater runtime system pro-actively schedules wtr_commands from the global queue to the targeted cluster nodes. cmd, fetched from the command queue, is sent to the target node (i.e. cmd.node_id) only if each of its dependent commands (i.e. the set cmd.deps) is to be executed on the same remote node (lines 6-9). The second level scheduling is local to each node (lines 11-14). The scheduler checks whether cmd only depends on wtr_commands addressing the same OpenCL device. In such case, the command is enqueued into the corresponding device queue (i.e. dev.dev_id) and dependencies are mapped to local OpenCL events. Alternatively, if a wtr_command C1 depends on a second wtr_command C2, scheduled in another context (of the same node), the local WTRScheduler ensures that C1 is not enqueued into the OpenCL device queue before C2 is completed. The third-level scheduling is implemented by the OpenCL runtime system itself which is responsible of managing single device queues. If cmd cannot be scheduled, due to unsatisfied dependencies, then it is pushed back in the command queue.

Command dependencies are automatically updated when a wtr_command c completes. Locally, a command completion event is generated. The associated callback function removes, for every command in the local queue, any dependence on c. Additionally, nodes notify the root scheduler with a message triggering a similar completion event internally at node 0. In such a way, commands in the global queue waiting for the completion of c can be scheduled - depending on the targeted device - either to a local device or to a remote node. The detailed algorithm can be found in [13].

This multi-level scheduling allows the runtime system to hide the costs of the scheduling, as well as data transfers, with the actual work being done by the devices in the background. The main idea is to use non-blocking semantics when OpenCL commands are scheduled in the corresponding devices. In this way, the WTRScheduler can continuously dispatch commands to other devices or move data from and to the root node. In the example in Fig. 2, commands 0 | 1 and 0 | 2 can be executed in parallel. Events at addresses e+0 and e+1 are handled by the root WTRScheduler since the OpenCL standard does not allow non-blocking semantics for these operations. The remaining commands (i.e. 0 | 3,0 | 4 and 0 | 5) are inserted asynchronously into the OpenCL device queue of node 0, upon completion of commands 0 | 1 and 0 | 2. Events e+2 and e+3 are therefore handled directly by the OpenCL runtime

system. Following the same logic, wtr_commands addressing the second OpenCL device (i.e. 1|*) are sent to the node with rank 1. The blocking function wtr_wait_for_events stops the execution of the host until the release operations on both nodes have completed.

5. The Dynamic Collective Replacement (DCR) optimization

The underlying architecture of the libWater runtime system and the emphasis on events, promoted by its interface, enables several runtime optimizations which are transparent to the user. This capability is a direct consequence of adhering to the OpenCL queuing semantics. Indeed, while commands are being enqueued into the system, a command DAG (as shown in Fig. 2) is internally created. Since OpenCL issues commands to the appropriate device only when an explicit flush is invoked by the programmer, the runtime system can analyze large portions of the application DAG and optimize it for improving scalability.

An optimization which has been implemented in the libWa-ter runtime system is the dynamic detection and replacement of collective communication patterns (DCR). Whenever the addressed device is not hosted in the root node, a call to wtr_ write_buffer and wtr_read_buffer respectively generates an MPI send and receive operation. When an OpenCL application is distributed among all available devices, input buffers are usually either split or replicated between compute nodes. This paralleliza-tion strategy is common and it results in a DAG containing several send/receive transfer operations for every device of the cluster. An example is depicted in Fig. 3(a) which represents a realistic DAG resulting from the splitting of an input and output buffer among a set of N OpenCL devices.

Point-to-point data transfers performed by the libWater runtime system imply an increased latency when compared with the native MPI send or receive routines. The reason for that is the polling mechanism implemented by the libWater runtime system - mainly employed to save node resources - which replaces the spin-lock mechanism commonly used by MPI libraries. Additionally, the number of required data transfers is directly proportional to the cluster nodes (and thus devices). This results in a large number of commands being dispatched by the runtime system and consecutively negatively impacts the overall scalability. MPI offers a large set of communication patterns called collective operations [27]. These routines are highly efficient since nearly all modern supercomputers and high-performance networks provide specialized hardware support for collective operations [25]. Additionally, the implementation of such collective operations employs dynamic runtime tuning techniques which choose, among a set of semantically equivalent algorithms, which best fit the underlying network topology and architecture [7,32,33].

Related work analyzed the problem of automatic detection of collective patterns from a set of point-to-point communications. This technique is common in MPI performance tools which are capable of detecting such patterns via post-mortem analysis of program traces [23]. The general problem of collective communication pattern detection is NP-hard, however, under particular restrictions the problem can be solved in polynomial time. A more recent work [16] proposed a fast solution, with a complexity of O(n log n), which makes the approach more suitable for runtime systems.

The goal of our DCR optimization algorithm is to analyze the command DAG isolating point-to-point data transfers and detect whether a subset of those resembles one of the collective patterns supported by MPI. This is possible since - if the application is carefully written using events for command synchronization - the command DAG will be available to the runtime system scheduler before the first blocking command is invoked

(e.g. wtr_wait_for_event(s)). Since data transfers in our environment have all the same root (the node 0), the analysis for patterns is simplified.

The optimization algorithm is composed by two phases. First, the command DAG is traversed and all the transfer commands are collected into N separate lists, one per device. Second, on the extracted N lists, pattern analysis is performed. The collective pattern check is done by considering elements having the same position within the transfer job lists. Furthermore, the check is simplified by the fact that every send and receive wtr_command carries information of the buffer location (buf) and the amount of bytes being transferred (size). The pattern analysis starts by taking the first transfer wtr_command from the N lists and by checking against a supported pattern, i.e. broadcast, scatter or gather. For instance, in a broadcast N send operations are expected where V i | 0 < i < N — 1, bufi = bufi+1 v sizei = sizei+1. If the check fails, the transfer jobs are tested against a scatter or gather pattern V i | 0 < i < N — 1, bufi + sizei = bufi+1.

Once a pattern is recognized, single point-to-point transfers are removed from the command DAG and replaced by the corresponding collective communication operation. A visual example of this optimization is depicted in Fig. 3(a), where multiple send operations are collapsed into a single scatter operation and correspondingly, receives are rewritten as a gather operation. By doing so, dependencies between successive commands are updated in order to keep the semantics of the input program unchanged.

Since collective operations must involve all the processes in a communicator, the current implementation of the DCR optimization works when all the initialized devices participate in the computation. Therefore, the analysis is limited to regular applications which must involve all OpenCL devices in data transfers. This is important to keep the pattern recognition algorithm simple and fast, since this optimization is applied during runtime.

6. The Device-Host-Device Copy Removal (DHDCR) optimization

Another optimization which has been implemented as part of the libWater runtime systems is the detection and optimization of device-host-device copy patterns. As the libWater API closely matches the OpenCL host-device model, it does not include any device-device communication. This limitation is based on the OpenCL platform model which does not include functions operating on contexts belonging to different platforms. However, on distributed computing environments, this limitation imposes the use of centralized host-device instead of more efficient devicedevice distributed communication.

An example of this problem arises when a buffer which has been distributed over N devices to be used as the output in a first kernel, is later used as input of one or multiple devices of a second kernel. For instance, let us consider the matrix chain multiplication ABCD. As matrix multiplication is associative, we can compute first AB, then CD, and finally the product (AB)(CD). While the first two multiplications work normally, the latter requires device-hostdevice communications that drastically affects scalability.

To address this issue, we implemented a new optimization which attempts to replace similar device-host-device communications with direct device-device data transfers. This optimization, called device-host-device copy removal (DHDCR) is implemented as follows. Whenever an application contains call to wtr_ write_buffer and wtr_read_buffer involving devices not belonging to the root node, libWater generates MPI send and receive operations. If a sequence of write, read, write occurs on the same buffer (or on part of the same buffer) then this sequence is a candidate for optimization. Once the pattern is recognized, the two consecutive device-to-host and host-to-device transfers are

1 SendJob 2 SendJob

1 WriteBufferJob 2 WriteBufferJob

1 ReadBufferJob 2 ReadBufferJob

1 RecvJob 2 RecvJob

1 WriteBufferJob 2 WriteBufferJob

1 ReadBufferJob 2 ReadBufferJob

* Gather Job

N SendJob

N WriteBufferJob

N ReadBufferJob

N RecvJob

N WriteBufferJob

N ReadBufferJob

1 SendJob [0->1]

1 WriteBufferJob

1 ReadBufferJob

1 RecvJob [1->0]

1 SendJob [0->1]

1 WriteBufferJob

1 ReadBufferJob

1 RecvJob [1->0]

2 SendJob 10->2]

2 WriteBufferJob

2 ReadBufferJob

2 RecvJob [2->0]

1,2 TransJob [1->2]

2 WriteBufferJob

2 ReadBufferJob

2 RecvJob [2->0]

(a) DCR optimization.

(b) DHDCR optimization.

Fig. 3. libWater runtime DAG optimizations. On the left N single point-to-point transfers are replaced by one corresponding collective communication operation (scatter or gather) while on the right two consecutive device-to-host and host-to-device transfers are replaced by a device-to-host and device-to-device transfers that can be completely overlapped.

Table 1

Application codes used for libWater evaluation.

Application

OpenCL LOC

libWater LOC

Input size

Input/Output buffers (splittable)

Short description

PerlinNoise

MatrixMul

LinReg

412 450 234 222 219 298

301 324 101 113 104 149

20K x 20K 0(0)/1(1)

600K bodies 2(0)/2(2)

ref :8M, query: 80K 2(1)/2(2)

Vertices 8K, Adjacency matrix 64K 1(0)/1(1)

7K x 7K(A = B = C) 2(^/1(1)

1000K 4(2)/1(1)

Gradient noise generator N-body simulation k-nearest neighbor Floyd-Warshall Matrix multiplication Linear regression

removed from the command DAG and replaced by a single device-to-device transfer. A visual example of this optimization is depicted in Fig. 3(b). The TransJob, generated by the DHDCR optimization, is a wtr_command which the root scheduler dispatches on both nodes involved in the data transfer (nodes 1 and 2 in the example), the other nodes are not involved. However, in order to maintain the host semantics of the program unchanged, the updated value of the buffer (generated by node 1) must also be copied back on the host node. Therefore a RecvJob command is generated to collect the buffer. The main difference with the original code is that this operation can be completely overlapped with the execution of the second kernel on the node rank 2.

Note that simple applications such as the ones listed in Table 1, only show a simple pattern (write, run kernel, read) and do not show any possibility to apply DHDCR. However, more complex applications are usually consisting of several kernels, with non trivial inter-node data transfers, and are more suitable for this optimization (e.g. matrix chain multiplication).

7. Experimental evaluation

We used libWater to encode 6 computational kernels, some of them taken from various OpenCL benchmarking suites (i.e. AMD and IBM), and studied their scalability. In four of them, the kernels were optimized for local memory, i.e. PerlinNoise (from IBM), Nbody (from AMD), Floyd and kNN manually written by us. For the remaining two codes, MatrixMul and LinReg we used a naive implementation unoptimized for what concern local memory. Table 1 shows, for each kernel, the number of input and output buffers used by the kernel. We define a buffer as splittable when its content can be distributed among the devices. The nature

of a buffer is strictly related to the algorithm being implemented within the OpenCL kernel, and thus the application. Non splittable buffers are always replicated on every device. All six applications utilized for our study do not contain unsplittable output buffers. In the presence of such buffers, the merge of the result coming from different devices would generate memory consistency issues that libWater is currently not able to handle. Table 1 also shows the reduction, in terms of lines of code, achieved when the application is written using our library. It is worth mentioning that while the original OpenCL applications were single device codes, the libWater based implementation is instead multi-device code. On average, we were able to reduce the lines of the host code by approximately a factor of 2 due to the higher level abstractions provided by libWater.

For the scalability analysis we used two large-scale production clusters, the Vienna Scientific Cluster VSC2 [38] and theMinoTauro Barcelona Supercomputing Center GPU Cluster [37]. A second study was conducted to test the suitability of libWater to exploit the computational capabilities of a heterogeneous cluster configuration. For this purpose we used the Ortler Cluster at the University of Innsbruck, composed of three heterogeneous compute nodes (i.e. mc1, mc2 and mc3). The hardware details of the clusters are depicted in Table 2.

7.1. VSC2 CPU cluster

The applications shown in Table 1 were executed on the VSC2 CPU cluster. We were able to access up to 64 nodes with a total of 1024 CPU cores. Since the 2 AMD CPUs which are hosted per node are considered by the OpenCL driver as a single device, the speedup was computed based on the number of compute nodes (and

Table 2

The experimental target architectures.

Site Vienna Scientific Cluster BSC

Cluster Max # of nodes Processors Cores per node Clock frequency Memory per Node GPUs Interconnection Open MPI version OpenCL version VSC2 1314 2 x AMD Opteron 6132 HE 2 x 8 2.2 GHz 32 GB DDR3 Infiniband 4x QDR 1.6.1 AMD APP 2.6 MinoTauro GPU Cluster 128 2 x Intel Xeon E5649 2x 6 2.5 GHz 24 GB DDR3 2 x NvidiaM2090 Infiniband 4x QDR 1.6.1 CUDA4.1

Site University of Innsbruck

Cluster Nodes Processors Cores per node Clock frequency Memory per Node GPUs or accelerators Interconnection Open MPI version OpenCL version Ortler mc1 2 x Intel Xeon E5-2690v2 2x 10 3.0 GHz 128 GB DDR3 2 x AMD FirePro S9000 Infiniband 4x QDR 1.6.5 AMD APP 2.9 mc2 2 x Intel Xeon E5-2690v2 2x 10 3.0 GHz 128 GB DDR3 2 x NVIDIA TeslaK20m Infiniband 4x QDR 1.6.5 CUDA 5.5 mc3 2 x Intel Xeon E5-2690v2 2x 10 3.0 GHz 128 GB DDR3 2 x Intel Xeon Phi7120P Infiniband 4x QDR 1.6.5 XE 2013 R3

thus OpenCL devices) instead of single CPU cores. The workload partitioning is implemented, for each test case, by assigning to each OpenCL device an equal amount of work.

The scalability tests were performed in the following way: the original OpenCL version of the applications were executed in a single node and their execution times used as a reference measurement. libWater was then used for node numbers ranging from 2 to 64. The main differences between the original version of the application codes and the one written using libWater are mainly in the host code. The kernel code was slightly modified only to forward the offset value used by the workload partitioning (as shown in Listing 1). We computed the ideal scaling for each application using the reference execution time and dividing it by the number of nodes. We conducted experiments with libWater by using two different settings: the first, named baseline, uses the runtime system without dynamic optimizations enabled; the second, DCR, uses the collective pattern replacement mechanism as described in Section 5. The results of our experiments are depicted in Fig. 4.

For each of the six applications, we show the execution time (in seconds) for up to 64 nodes and the corresponding speedup with respect to a single node. Overall, we observe that our approach scales almost linearly, especially for those codes using few input/output buffers. PerlinNoise, Fig. 4(a), is an example of those, since it has no dependencies on input buffers and the data produced by the kernel is distributed between the devices. For such code, the baseline configuration of our runtime system achieves a speedup of 53 for 64 nodes, and thus an efficiency of 83%. When the number and size of the input/output buffers increases, the efficiency of our system decreases. The worst case is represented by the LinReg application, Fig. 4(f), which stops scaling after 32 nodes. This kernel has 4 input buffers, 2 of them are not splittable (because of dependencies within the kernel code) and therefore must be replicated on every node. The remaining 2 input and output buffers are instead splittable. For such code we have an immediate decrease (75% on two nodes) of the efficiency. This is because the kernel execution is delayed due to the fact that several wtr_commands are executed (and transferred to the target nodes) to create and initialize the input/output buffers. However this delay is a constant and system efficiency remains almost unvaried up to 16 nodes. On 32 and 64 nodes the efficiency of the baseline runtime system starts decreasing significantly.

This problem is largely addressed by the dynamic collective pattern replacement, i.e. DCR, optimization which was introduced

in Section 5. This optimization reduces the load on the scheduler since it replaces several single transfer jobs with one collective operation. In LinReg this optimization improves the scalability of the system by a factor of 2 achieving an efficiency of 55%. A small effect of this optimization can be observed for smaller node configurations because collective operations are optimized for a large number of nodes. An interesting result is the effect of the DCR optimization on the PerlinNoise test case. In such a case, the DCR optimization fails to improve performance over the baseline. The reason is that collective operations are blocking while point-to-point communications in the runtime system are non-blocking thereby allowing overlapping of multiple transfers. The synchronization costs introduced by the gather operation are therefore not properly compensated by the amount of exchanged data. We believe that this problem can be eliminated by using non-blocking collective routines which have been introduced in the latest MPI standard [27] and will soon be available in mainstream MPI libraries. Additionally, since this optimization is done dynamically, and therefore the amount of data being transferred is known by the scheduler, heuristics can be integrated to decide when such optimization should be applied.

On average, libWater achieves an efficiency of 80% on 32 nodes and 64% when 64 nodes are used. Without the DCR optimization the system has an efficiency of 47% on 64 nodes. This means that the DCR optimization improves the system efficiency by 17% on 64 nodes and we expect this value to increase proportionally with the number of nodes.

To show the effectiveness of the device-host-device copy removal optimization (DHDCR) we conducted another experiment on the VSC2 Cluster. Using libWater library, we manually coded a multi-device version of the matrix chain multiplication ABCD. We run the experiment using two different settings: the first (baseline), uses the runtime system without the optimization while the second (DHDCROpt), uses the device-host-device copy removal mechanism as described in Section 6. Notice that in both cases the DCR optimization is also performed. When both runtime optimizations are enabled, the optimizer first tries to rewrite indirect data transfers to direct ones (using the TransJob command). Then, in a second pass DCR is applied. In order to optimize the execution even further, the DCR analysis has been updated to also take into account TransJob commands during the collective pattern analysis phase.

The results of our experiments are depicted in Fig. 5. For this application, we show the execution time (in seconds) for up to

Execution time (in sees.)

4 8 16 32 Number of nodes

(a) PerlinNoise.

52.7 -i 48.9-

7.8 -4 -

Speedup

□ basslins !

□ DCROpt

.....p [......P

4 8 16 32 Number of nodes

408.4 -

207.5 -10452.727.2

14.910.1 -

Execution time (in sees.)

(b) Nbody.

4 8 16 32 Number of nodes

40.5 32.1 -

7.83.9-

Speedup

□ baseline 0 DCROpt

4 8 16 32 Number of nodes

Execution time (in sees.)

(c) Floyd.

4 8 16 32 Number of nodes

31.228.825.3-

7.7 -3.9 -

Speedup

□ baseline 0 DCROpt

Execution time (in sees.)

.....r i.....fa

4 8 16 32 Number of nodes

276.4139.471.4 35.918.19.67-

& \ ----- Ideal -b- DCROpt

(d) kNN.

4 8 16 32 Number of nodes

28.927.1 -24.8-

7.73.9-

Speedup

□ baseline 0 DCROpt

4 8 16 32 Number of nodes

Execution time (in sees.)

377.8202.6104.9573319.312.5-

X ----- Ideal —e- baseline

ч -A- DCROpt

.......

(e) MatrixMul.

4 8 16 32 Number of nodes

3.61.9

Speedup

□ baseline 0 DCROpt

4 8 16 32 Number of nodes

423.5279139.1 -71.1 -35.918.712-

Execution time (in sees.)

(f) LinReg.

4 8 16 32 Number of nodes

16.9 -11.8-

Speedup

□ baseline H DCROpt

......p I fy

4 8 16 32 Number of nodes

Fig. 4. Strong scaling of libWater on the VSC2.

Execution time (in sees.)

Speedup

165.892.5-

47.233.2

2 4 8 16 Number of nodes

2 4 8 16

Number of nodes

Fig. 5. Strong scaling of matrix chain multiplication on the VSC2 Cluster.

16 nodes and the corresponding speedup with respect to a single node. The baseline approach scales almost linearly up to 8 nodes with an efficiency of 87%. For 16 nodes the runtime system efficiency decreases significantly reaching 48%. The main reason is the high communication overhead caused by the unnecessary copies of intermediate buffers to the root node. Before proceeding with the (AB)(CD) operation, the results of AB and CD have to be gathered by the root scheduler and then distributed again on the remaining nodes. While the buffer containing AB can be directly reused, the result of CD can be copied to remaining nodes using a more efficient collective pattern, the MPI_Allgather. In this paper, only the former redundant copy is automatically detected and removed,

the latter is replaced by an MPI_Gater and MPI_Bcast by the DCR optimization.

The benefits of this optimization starts to show with a large number of nodes because of the increased pressure on the root scheduler. For smaller node counts, the data movement of AB is completely overlapped with computation, so that by the time AB is distributed to the nodes also CD is available and the final computation can start without any delay. For larger nodes, the execution of the last kernel is delayed since there is not enough computation (kernel execution becomes shorter since more devices are used) to overlap the communication overhead. This causes a sensible decrease in the efficiency. By avoiding this communication, the DHDCR optimization improves the speedup from 7.6 to 10 achieving an efficiency improvement of 15%.

7.2. MinoTauro GPU cluster

Another scalability study was conducted executing the N-body simulation described in Table 1, line 2, in a GPU cluster. We were able to access up to 32 nodes of the MinoTauro cluster with a total of 64 GPU devices. In all the experiments, the workload was equally partitioned between the available devices. The optimization of the N-body simulation on the GPU processor is an active research problem [39,6,15,18]. The problem is well known to be suitable for the GPU architecture and in case of a high number of particles for cluster of GPUs.

343 -171.585.844.1 23.514.9-

Execution time (in sees.)

(a) NBody 2400K.

4 S 16 32 64 Number of devices

Speedup

2 4 8 16 32 64 Number of devices

1365.8683.9 -342.4171 87.6-

ExBcution time (in sees.)

33.728.9-

Speedup

4 8 16 32 64 Number of devices

(b) Nbody 4800K.

I 8 16 32 64 Number of devices

Execution time (in sees.)

Speedup

5425.8 -2721.91367.8685.4 -344.9177.3110.1 -

7.9 -4 -2 J

1 2 4 8 16 32 64 Number of devices

2 4 8 16 32 64 Number of devices

(c) NBody 9600K.

Fig. 6. Strong scaling of NBody on the MinoTauro BSC GPU Cluster.

Table 3

Performance of Nbody and LinReg on the heterogeneous cluster for different combination of GPUs and Accelerators.

Device

Workload Partition Configurations

C1 C2 C3 C4 C5 C6 C7 C8

mcl- -GPU1 100% 50% - - - - 23% 22.5%

Nbody mcl- -GPU2 - 50% - - - - 23% 22.5%

mc2- -GPU1 - - 100% 50% - - 27% 26.5%

mc2- -GPU2 - - - 50% - - 27% 26.5%

mc3- -ACL1 - - - - 100% 50% - 1%

mc3- -ACL2 - - - - - 50% - 1%

Ex. time (s) 42.2 21.2 35.9 18.2 659.6 335.8 9.9 9.7

mcl GPUl 100% 50% - - - - 15% 11%

mcl GPU2 - 50% - - - - 15% 11%

LinReg mc2 -GPU1 - - 100% 50% - - - 14%

mc2 GPU2 - - - 50% - - - 14%

mc3 ACL1 - - - - 100% 50% 35% 25%

mc3 ACL2 - - - - - 50% 35% 25%

Ex. time (s) 15.5 7.8 11.8 6.0 6.9 3.9 3.2 2.8

We ran the NBody test case using 3 different input sizes that show the benefit of using a high number of GPUs in case of large number of bodies. The results of our experiments are depicted in Fig. 6. The 3 tests were conducted respectively with an input size of 2 (Fig. 6(a)), 5 (Fig. 6(b)), 10 (Fig. 6(c)) Million bodies. With the smallest input size the application scales almost linearly up to 16 GPUs and stops scaling after 32 GPUs. Increasing the input size by a factor of 2 increases the execution time by a factor of 4, due to the quadratic complexity of the implemented algorithm. With an input size of 5 and 10 million bodies the application becomes more suitable for a GPU cluster and with the biggest tested input size achieves a speedup of around 49 on 64 GPUs with an efficiency of 77%. It is worth mentioning that in such environment it is important from a user prospective to find a trade-off between the number devices and the desired efficiency.

7.3. Ortler heterogeneous cluster

Since OpenCL allows access to heterogeneous devices we conducted a second experiment which demonstrates libWater on

a heterogeneous cluster as described in Table 2. In order to run applications on such environment, the input code was rewritten so that the workload distribution was controllable via command line arguments. It is worth mentioning that workload partitioning for heterogeneous architectures is still an active research problem [ 12,24,19,14]. However, this aspect is completely orthogonal to our library and for the sake of this experiment, we derive workload partitionings in an empirical way. We ran the NBody and the LinReg test cases using different combinations of devices. For each device configuration, several different workload splittings were tested and the fastest one was chosen. The partitionings and their corresponding execution times, are shown in Table 3. For example, in NBody, configuration C1 assigns all the workload to the first GPU of node mcl. The execution time for this configuration is 42.2 s. By equally splitting the workload between the two GPUs on the same node, i.e. C2, we double the performance. Between the devices, the NVIDIA Tesla k20m is the fastest device requiring 35.9 s to complete the work. However libWater can be used to improve the execution time even further. The overall execution time can be reduced by 70% by using the workload partition as described by configuration C8 which assigns 22.5% to each GPU in mcl, 26.5% to each GPU in mc2 and the remaining 1% to

each accelerator in mc3. For LinReg results are different, since the execution times for the different devices are more balanced. The best performance can be achieved in this case splitting the workload between the nodes by assigning 11% to each GPU in mcl, 14% to each GPU in mc2 and 25% to each accelerator in mc3.

7.4. Results summary

In this section, we analyzed the performance of libWater in three different compute clusters. On the VSC2 CPU cluster, the library achieves on average an efficiency of 80% on 32 nodes and 64% on 64 nodes. These results include the DCR optimization that in case of 64 nodes is capable of improving the system efficiency by 17%. In the same cluster, we also tested the DHDCR optimization showing an efficiency improvement of 15% over the baseline matrix chain multiplication implementation. On the MinoTauro GPU cluster we executed the N-Body application with different number of bodies achieving a speedup of 49 on 64 GPUs with an efficiency of 77% for the biggest tested input size. This result shows that the hierarchical scheduling approach described in Algorithm 1 is able to handle multiple devices per node without compromising the overall scalability of the system. Finally, we executed the NBody and the LinReg applications using different combinations of devices on the Ortler Heterogeneous cluster. The results of the experiment demonstrate, despite higher latencies caused by additional data transfers between host and device memory, non-blocking communication yields good scalability behavior even for heterogeneous architectures.

8. Related work

In recent years, heterogeneous systems have received a great amount of attention from the research community. Although several projects have been recently proposed to facilitate the programming of clusters with heterogeneous nodes [22,5,9,3,1,20, 11,29,40,30], none of them combines support for high performance inter-node data transfer, support for a wide number of different devices and a simplified programming model. Our work takes into account all these aspects through the development of the libWater library.

Kim et al. [22] proposed the SnuCL framework that extends the original OpenCL semantics to heterogeneous cluster environments. Their work is closely related to ours. SnuCL relies on the OpenCL language with few extensions to directly support collective patterns of MPI. Indeed, in SnuCL, it is the programmer responsibility to take care of the efficient data transfers between nodes. In that sense, end users of the SnuCL platform need to have an understanding of MPI collective calls semantics in order to be able to write scalable programs. This deeply differs from our system where such optimizations are transparently applied by the libWater runtime system.

Also other works have investigated the problem of extending the OpenCL semantics to access a cluster of nodes. The Many GPUs Package (MGP) [5] is a library and runtime system that using the MOSIX VCL layer enables unmodified OpenCL applications to be executed on clusters. Hybrid OpenCL [3] is based on the FOXC OpenCL runtime and extends it with a network layer that allows the access to devices in a distributed system. The clOpenCL [1] platform comprises a wrapper library and a set of user-level daemons. Every call to an OpenCL primitive is intercepted by the wrapper which redirects its execution to a specific daemon at a cluster node or to the local runtime. dOpenCL [20] extends the OpenCL standard, such that arbitrary compute devices installed on any node of a distributed system can be used together within a single application. Distributed OpenCL [11] is a framework that allows the distribution of computing processes to many resources

connected via network using JSON RPC as communication layer. OpenCL Remote [29] is a framework which extends both OpenCL's platform model and memory model with a network client-server paradigm. Virtual OpenCL [40], based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution.

While the objectives of these approaches are similar to ours, none of them provides an abstraction layer to reduce the complexity associated with the OpenCL development and, furthermore, they show a very limited scalability in clusters of 4-8 compute nodes. In particular, none of them employs dynamic communication optimizations as we do.

Besides OpenCL-based approaches, also CUDA solutions have been proposed to simplify distributed systems programming. CUDASA [35] is an extension of the CUDA programming language which extends parallelism to multi-GPU systems and GPU-cluster environments. rCUDA [10] is a distributed implementation of the CUDA API that enables shared remote GPGPU in HPC clusters. cudaMPI [26] is a message passing library for distributed-memory GPU clusters that extends the MPI interface to work with data stored on the GPU using the CUDA programming interface. All of these approaches are limited to devices that support CUDA, i.e. NVIDIA GPU accelerators, and therefore they cannot be used to address heterogeneous systems which combines CPUs and accelerators from different vendors.

Other projects have investigated how to simplify the OpenCL programming interface. Sun et al. [36], proposed a task queuing extension for OpenCL that provides a high-level API based on the concepts of work pools and work units. Intel CLU [17], OCL-MLA [28] and SimpleOpencl [34] are lightweight API designed to help programmers to rapidly prototype heterogeneous programs. DIANA [31] provides a common interface to hide the complexity of managing different application programming interfaces APIs and libraries for different many-core devices. OmpSs [8] relies on user directives to avoid the boilerplate OpenCL host code configuration and generate a DAG for task scheduling purpose. FastFlow [2] is a structured parallel programming framework targeting clusters of multi-core workstations. StarPU [4] provides a runtime and a programming language extensions to support task-based programming model in a cluster. Besides the simplified interface, libWater differently from other approaches provides fine-grained control over device selection (i.e. DQL) and an improved device synchronization based on events.

9. Conclusions

In this paper, we introduced libWater, a library for simplifying the programming of heterogeneous distributed systems.

The proposed interface demonstrates that raising the abstraction level of the OpenCL programming model is possible without losing control over performance. We showed, with an example, how a multi-device distributed host program can be written using approximately 25 lines of code. By defining a simple, but powerful, device query language (DQL), libWater simplifies the management and discovery of a large number of OpenCL devices. The simple API makes the library a perfect target for automatic code generation tools, thus it can be easily integrated in compilers.

libWater's interface is tightly bound to a lightweight distributed runtime system which is designed from scratch for high scalability and low resource usage. Because of the non-blocking semantics promoted by the library interface, commands can be organized by the runtime system into a DAG to be used for dynamic analysis and optimizations.

We studied the performance of the library on three compute clusters, demonstrating the high efficiency that the system can achieve.

libWater will be released as an open-source project with the goal of becoming a research platform to investigate performance aspects of heterogeneous and distributed HPC architectures.

Acknowledgments

This project was funded by the FWF Austrian Science Fund as part of project TRP 220-N23 ''Automatic Portable Performance for Heterogeneous Multi-cores'' and by the FWF Doctoral School CIM Computational Interdisciplinary Modelling under contract W01227. The computational results presented have been achieved in part using the Vienna Scientific Cluster 2 (VSC2). We would also like to thank the Barcelona Supercomputing Center for the availability of the MinoTauro GPU cluster.

References

[1] A. Albano, R. Jose, P. Antonio, S.L. Paulo, clOpenCL—Supporting Distributed Heterogeneous Computing in HPC Clusters, in: 10th International Workshop HeteroPar, 2012.

[2] M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, M. Torquati, Targeting distributed systems in fastflow, in: Euro-Par Workshops, 2012, pp. 47-56.

[3] R. Aoki, S. Oikawa, T. Nakamura, S. Miki, Hybrid OpenCL: Enhancing OpenCL for distributed processing, in: ISPA, 2011, pp. 149-154.

[4] C. Augonnet, O. Aumage, N. Furmento, R. Namyst, S. Thibault, Starpu-mpi: Task programming over clusters of machines enhanced with accelerators, in: EuroMPI, 2012, pp. 298-299.

[5] A. Barak, T. Ben-Nun, E. Levy, A. Shiloh, A package for OpenCL based heterogeneous computing on clusters with many GPU devices, in: Workshop PPAC, 2010, pp. 224-231.

[6] J. Bédorf, E. Gaburov, S.P. Zwart, A sparse octree gravitational n-body code that runs entirely on the gpu processor, J. Comput. Phys. 231 (7) (2012) 2825-2839.

[7] J. Bruck, C.-T. Ho, S. Kipnis, D. Weathersby, Efficient algorithms for all-to-all communications in multi-port message-passing systems, in: SPAA, 1994, pp. 298-309.

[8] J. Bueno, J. Planas, A. Duran, R.M. Badia, X. Martorell, E. Ayguade, J. Labarta, Productive programming of GPU clusters with OmpSs, in: IPDPS, 2012, pp. 557-568.

[9] T. Diopy, S. Gurfinkely, J. Anderson, N.E. Jerger, DistCL: A Framework for the Distributed Execution of OpenCL Kernels, in: MASCOTS, 2013.

10] J. Duato, A.J. Peña, F. Silla, R. Mayo, E.S. Quintana-Ortí, rCUDA: Reducing the number of GPU-based accelerators in high performance clusters, in: HPCS, 2010, pp. 224-231.

11] B. Eskikaya, D.T. Altilar, Distributed OpenCL Distributing OpenCL Platform on Network Scale, in: IJCA, Vol. ACCTHPCA 2, 2012, pp. 26-30.

12] I. Grasso, K. Kofler, B. Cosenza, T. Fahringer, Automatic problem size sensitive task partitioning on heterogeneous parallel systems, in: PPoPP, 2013.

13] I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer, libwater: Heterogeneous distributed computing made easy, in: ICS, 2013, pp. 161-172.

14] D. Grewe, M.F. O'Boyle, A static task partitioning approach for heterogeneous systems using opencl, in: CC, 2011.

15] T. Hamada, K. Nitadori, 190 tflops astrophysical n-body simulation on a cluster ofgpus, in: SC, 2010.

16] T. Hoefler, T. Schneider, Runtime detection and optimization of collective communication patterns, in: PACT, 2012, pp. 263-272.

17] Intel Corporation, Computing Language Utility, http://software.intel.com/.

18] P. Jetley, L. Wesolowski, F. Gioachin, L.V. Kalé, T.R. Quinn, Scaling hierarchical n-body simulations on gpu clusters, in: SC, 2010.

19] M. Kai, L. Xue, C. Wei, Z. Chi, W. Xiaorui, Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures, in: ICPP, 2012.

20] P. Kegel, M. Steuwer, S. Gorlatch, dOpenCL: Towards a Uniform Programming Approach for Distributed Heterogeneous Multi-/Many-Core Systems, in: IPDPS Workshops, 2012, pp. 174-186.

21] Khronos OpenCL Working Group, The OpenCL 1.2 specification, http://www.khronos.org/opencl, 2013.

22] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, J. Lee, SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters, in: ICS, 2012, pp. 341-352.

23] A. Knüpfer, D. Kranzlmüller, W.E. Nagel, Detection of collective MPI operation patterns, in: PVM/MPI, 2004, pp. 259-267.

24] K. Kofler, I. Grasso, B. Cosenza, T. Fahringer, An automatic input-sensitive approach for heterogeneous task partitioning, in: ICS, 2013.

25] S. Kumar, G. Dozsa, G. Almasi, P. Heidelberger, D. Chen, M.E. Giampapa,

M. Blocksome, A. Faraj, J. Parker, J. Ratterman, B. Smith, C.J. Archer, The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer, in: ICS, 2008, pp. 94-103.

[26] O.S. Lawlor, Message passing for GPGPU clusters: CudaMPI, in: CLUSTER, 2009, pp. 1-8.

[27] MPI Forum, MPI: A Message-Passing Interface Standard. Version 3, http://www.mpi-forum.org, 2012.

[28] OCL-MLA, http://tuxfan.github.com/ocl-mla/.

[29] R. Özaydin, D.T. Altilar, OpenCL remote: Extending OpenCL platform model to network scale, in: HPCC-ICESS, 2012, pp. 830-835.

[30] A. Panagiotidis, D. Kauker, S. Frey, T. Ertl, DIANA: a device abstraction framework for parallel computations, in: PARENG, 2011.

[31] A. Panagiotidis, D. Kauker, F. Sadlo, T. Ertl, Distributed computation and large-scale visualization in heterogeneous compute environments, in: ISPDC, 2012, pp. 87-94.

[32] J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G.E. Fagg, E. Gabriel, J. Dongarra, Performance analysis of MPI collective operations, Clust. Comput. 10(2) (2007) 127-143.

[33] P. Sanders, J. Speck, J.L. Träff, Two-tree algorithms for full bandwidth broadcast, reduction and scan, Parallel Comput. 35 (12) (2009) 581-594.

[34] Simple-opencl, http://code.google.com/p/simple-opencl/.

[35] M. Strengert, C. Müller, C. Dachsbacher, T. Ertl, CUDASA: Compute unified device and systems architecture, in: EGPGV, 2008, pp. 49-56.

[36] E. Sun, D. Schaa, R. Bagley, N. Rubin, D. Kaeli, Enabling task-level scheduling on heterogeneous platforms, in: Workshop GPGPU, 2012, pp. 84-93.

[37] The MinoTauro GPU Cluster, http://www.bsc.es/marenostrum-support-services/other-hpc-facilities/nvidia-gpu-cluster, 2013.

[38] The Vienna Scientific Cluster 2, http://www.vsc.ac.at/, 2013.

[39] W. Wang, H. Wang, D. Guo, H. Wei, G. Zeng, Parallel time-space processing model based fast n-body simulation on gpus, in: PMAM, 2013.

[40] S.Xiao, W. chun Feng, Generalizing the utility of GPUs in large-scale heterogeneous computing systems, in: IPDPS Workshops, 2012, pp. 2554-2557.

Ivan Grasso is a Ph.D. student at the University of Innsbruck, Austria. He received an M.Sc./diploma degree in Computer Science from University of Pisa, Italy, in 2010. His research includes high performance heterogeneous distributed computing, multi/many core architectures and compilers. The main aspects of his research are performance, portability and productivity.

Simone Pellegrini is a Ph.D. student at the University of Innsbruck, Austria. He received an M.Sc./diploma degree in Software Engineering from the University of Bologna, Italy in 2006. In 2007 and 2008 he was a researcher at the National Institute of Nuclear Physics (INFN), in Bologna, where he was involved in the development of a workflow management system for the EGEE/gLite GRID middle-ware. During the Ph.D. his research focus was on message passing optimizations for HPC. His contributions include compiler static analysis and transformations of MPI programs and an automated tool for the tuning of Open MPI's runtime parameters.

Biagio Cosenza is a Post-Doctoral Researcher at the University of Innsbruck, Austria. He received an M.Sc./diploma degree in Computer Science from University of Salerno, Italy, in 2007 and a Ph.D. in Computer Science in 2011. He has been the recipient of two HPC-Europa grants in 2008 and 2009, a DAAD Scholarship in 2010, and a Cineca IS-CRA in 2010. His research includes Compilers, High Performance Computing, Parallel Computing and applications to Computer Graphics and Visualization.

Thomas Fahringer is a professor of computer science in the Institute of Computer Science at UIBK. His research interests are programming languages, performance tools, debuggers, compiler analysis and program optimization based on advanced technologies such as genetic algorithms and machine learning. He has a long research history on parallel and distributed high performance systems and their applications starting in 1988. He has published over 160 papers in the area of performance-oriented parallel and distributed high-performance computing including 30 journal articles, and 5 books.