Contents lists available at ScienceDirect
Advances in Engineering Software
journal homepage: www.elsevier.com/locate/advengsoft
Research paper
Parallelisation of an interactive lattice-Boltzmann method on an Android-powered mobile device
Adrian R.G. Harwooda'*, Alistair J. Revell
CrossMark
a Research Associate, School of Mechanical, Aerospace and Civil Engineering, The University of Manchester, Sackville Street, M1 3BB, United Kingdom b Senior Lecturer, School of Mechanical, Aerospace and Civil Engineering, The University of Manchester, Sackville Street, M1 3BB, United Kingdom
A R T I C L E I N F 0
A B S T R A C T
Article history: Received 16 July 2016 Revised 18 November 2016 Accepted 22 November 2016 Available online 6 December 2016
Keywords: Android
Mobile computing Interactive simulation Lattice Boltzmann method Java concurrency
Engineering simulation is essential to modern engineering design, although it is often a computationally demanding activity which can require powerful computer systems to conduct a study. Traditionally the remit of large desktop workstations or off-site computational facilities, potential is now emerging for mobile computation , whereby the unique characteristics of portable devices are harnessed to provide a novel means of engineering simulation. Possible use cases include emergency service assistance, teaching environments, augmented reality or indeed any such case where large computational resources are unavailable and a system prediction is needed. This is particularly relevant if the required accuracy of a calculation is relatively low, such as cases where only an intuitive result is required. In such cases the computational resources offered by modern mobile devices may already be adequate. This paper proceeds to discuss further the possibilities that modern mobile devices might offer to engineering simulation and describes some initial developments in this direction. We focus on the development of an interactive fluid flow solver employing the lattice Boltzmann method, and investigate both task-based and thread-based parallel implementations. The latter is more traditional for high performance computing across many cores while the former, native to Android, is more simple to implement and returns a slightly higher performance. The performance of both saturates when the number of threads/tasks equal three on a quad-core device. Execution time is improved by a further 20% by implementing the kernel in C++ and cross-compiling using the Android NDK.
© 2016 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/40/).
1. Introduction
Modelling and simulation is an integral part of modern engineering as it allows the user to improve their understanding of physical scenarios and complex systems. Depending on the context, this knowledge may be used in a variety of ways; e.g., to inform a design decision, to aid in education of key concepts, or to identify level of risk for a given scenario. Due, in part, to both improved understanding and perpetually increasing computational power, we have become accustomed to a regular increase in the accuracy of these simulations. The calculations themselves are generally conducted on high-end computer facilities either housed locally or via a high-bandwidth interconnect to a high performance computing (HPC) facility. Due to the nature of the software and the skills required to manage and process the data, there are well defined processes in place to assure the quality of the simulation re-
* Corresponding author. E-mail address: adrian.harwood@manchester.ac.uk (A.R.G. Harwood).
sults. For these reasons, and others, the running of computer simulations tends to fall under the remit of an experienced engineer and are typically orchestrated from a desk-based computer.
However, in the era of big data and pervasive computing, it is no longer impractical to envisage the coordination and indeed the running of simulations via or on-board a mobile device. There is no question that mobile devices, be that tablet computers or mobile phones, are lighter, more portable and often cheaper than laptops, desktops and servers currently being used for engineering simulations. Having simulation results presented directly to a individual using this platform can allow qualitative analysis to be performed in situations where such information has previously been unavailable. For example, in emergency scenario analysis, the mobile device may be used to capture surroundings using the built-in camera and contaminant sources using the touch screen. A local simulation is then used to provide the user with an immediate safe route of navigation. Alternatively, an interactive wind tunnel can be effectively given to a class of students to enhance education and learning. Mobile devices may also be given to physicians and used
http://dx.doi.org/10.1016/j.advengsoft.2016.11.005
0965-9978/© 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
in combination with patient derived imagery to provide improved diagnostic information at the point of care [1].
Over recent years, the prevalence of desktop computing has reduced, and the use of laptop and tablet devices has grown to fill this gap. Leveraging the many core graphics processing units (GPUs) typically available on mobile devices can deliver a significant boost in processing power. However, it is unlikely that a single device will reach the power available in current HPC clusters in the near future. Instead of matching the accuracy of "conventional" modelling and simulation methods, which is likely to always require significant computing power, there is arguably a role for a faster simulation tool, that trades accuracy for speed, in order to attain a level of human-interactivity. Particularly where used to complement and enhance human decision making, or even to provide fast approximations to inform an automated decision making system as one of many environmental data available to the system.
In order to assess the suitability of mobile platforms for performing local, interactive engineering simulation, this article reports the development of two different parallel design patterns for performing interactive, grid-based fluid dynamics simulations on an Android-powered mobile device. The Android operating system is selected as the development platform, due to the availability and affordability of suitable hardware and the fact that it currently has the largest market share for mobile devices [2].
Although mobile devices often feature both a GPU and a CPU, the present study explores only use of the CPU. The development of custom software for mobile GPU is not yet widely supported and is left for a future publication. Our simulations use the lattice-Boltzmann method (LBM) to simulate flow physics [3] and is introduced in the next section.
The primary aim of the present study is to propose different approaches to interactive flow simulation using LBM on an Android device. This includes the implementation and cross-comparison of several candidate frameworks in order to assess the potential for mobile devices, either used alone or for multi-device parallelism. The completion of these aims provides baseline data and a design template on which other types of interactive engineering simulation on a range of devices may be built.
2. Use of mobile devices for engineering simulation
A survey of device ownership in the US in 2015 [2] revealed
that 68% of the US population owned a smartphone and 45% owned a tablet. Globally, smartphone ownership hit 10 0 0 million in 2012 and is set to exceed 2500 million by 2020. Ownership is spread across all continents with South Korea leading the way,
where 88% of the population own a smartphone. Ownership in so-called advanced economies, including the US and much of Europe, is approximately 68% on average [4].
Modern mobile devices are designed to include a multi-core CPU and a GPU, providing similar versatility to a desktop com-
puter. The computational power of these chips has increased ap-
proximately ten-fold since 2009 [5] and available RAM has also increased with more than 3 GB typical of current high-end Samsung Android devices (Fig. 1). Current engineering workstations may have 16 cores and 64 GB RAM with which to perform a local simulation - a 4x increase in CPU cores and a 16x increase in memory. Furthermore, due to active cooling mechanisms on desktop computers, clock speeds are often much higher increasing computing power further.
Halpern et al. [5] show that power consumption for mobile chips has increased although further increases in power consumption are yielding less of a gain in performance causing a power saturation at about 1.5W. Instead, the most recent smartphone design has increased the number of cores rather than increasing the
power per core. One may conclude that mobile hardware development is hence limited by its power-conserving motivation.
However, in defence of mobile devices, the wide-spread ownership of mobile devices, combined with the clear, albeit restricted, increases in power and capability, makes the platform a potential candidate for running smaller scale engineering simulations on site without reliance on external resources or connectivity. Although an individual device may not be able to offer the power of an HPC facility, simulations could be performed on a network of devices connected by a local network implemented via Bluetooth or Wi-Fi Direct.
High-end HPC facilities, at present, typically offer 0(105 ) cores and 0( 103 )GB of RAM according to the T0P500 list. However, in practice, these resources are shared amongst many users with individual jobs using much smaller allocations. Access to such facilities is also generally restricted. Theoretically, if all 2 x 109 smartphones globally are assumed to be quad-core with 2GB RAM (c.2013), a global P2P smartphone computer could offer 8 x 109 cores with 4 x 109 GB of memory. This is purely hypothetical but illustrates the compute potential for even the mobile devices in a single office block or city.
In light of hardware limitations for individual mobile devices, it is expected that in order to run a simulation locally, there will be a trade-off between the level of model complexity (and hence simulation accuracy) and the speed with which a result can be obtained. However, mobile platforms have the potential to provide sufficient computing power for rapid simulation to an acceptable and situation-appropriate degree of accuracy. This can only be realised with the development of a suitable framework for engineering simulation in this context.
2.1. Integration with existing infrastructure
In our increasingly connected world, mobile devices also have the option to off-load tasks with a high resource demand to more suitable systems [6]. In the case of engineering simulation, mobile devices may provide input data such as local wind speed and direction, measured structural loads or geometry and materials all recorded locally on the device, to a remote HPC facility which performs a potentially demanding calculation using these system data. A data-reduced result may then be returned for the user to inspect. It may also be possible to perform some part of the simulation locally as coarse approximation to the problem physics while simultaneously performing a more detailed analysis remotely which may be viewed or incorporated into the local platform at a later time Fig. 2.
However, presently HPC facilities are expensive to build and maintain and access is typically restricted. Furthermore, network connectivity is not available in every location and may also suffer from reduced bandwidth or unreliability. Common interconnect between HPC facilities and external terminals is of the order of 1Gb/s which gives a maximum theoretical throughput of 125MB/s. The fastest mobile data connections in the UK at present use the LTE-A (4G+) standard and will theoretically support such a transfer rate [7]. However, this service is, at present, only available in select areas and at a premium subscription cost to a user. Typically, connection speeds may be as low as 12 MB/s depending on the infrastructure available. A sensible alternative may therefore be to develop approaches to performing the calculation locally on one or more available devices [8].
2012 Year
-•-Cores -П RAM
Fig. 1. Illustration of the increase in CPU cores and memory on smartphone devices since 2009 [5].
Fig. 2. Illustration of the concept of interactive engineering simulation and the possible roles of mobile devices. Simulation may be performed locally on a single device, on a network of devices or off-loaded to High Performance Computers (HPC).
3. The lattice Boltzmann method
Tt + c-v )f = n
Unlike conventional CFD techniques, which aim to solve the Navier-Stokes equations, the lattice-Boltzmann method solves the Boltzmann equation Eq. (1) in order to obtain the statistical behaviour of the fluid. Physical space is modelled as a series of discrete nodes linked by a finite number of lattice links. At each lattice node, fluid is neither represented as finite volumes nor microscopic particles but as groups (or distributions) of particles f. The lattice links represent a discrete velocity vector c along which par-
ticles within a given group are permitted to move. As the simulation evolves, microscopic particle collisions are modelled by the application of a collision operator A commonly-used collision operator, known as the BGK approximation [9], relaxes the momenta of the particles at a particular lattice site towards a local equilibrium as
fnew — f old - 1 ( fold — Г4 )
where t is the rate of relaxation. fq is a function of the local macroscopic quantities. The redistributed momenta are then con-vected to neighbouring grid sites along the lattice links. At the end of each time step, the macroscopic quantities are updated by computing the statistical moments of the distribution functions.
\ / Collide
A \ \ Stream
t i / , \
Update Macroscopic & Equilibrium
Fig. 3. Graphical illustration of a time step using LBM.
The main steps in the LBM are illustrated graphically in Fig. 3. No-slip boundaries are implemented using a bounce-back technique which simply reflects distributions with a component oriented normal to the wall back along suitable lattice links. The lattice-Boltzmann equation is known to recover the Navier-Stokes equations with second-order accuracy in both space and time [10].
LBM is an increasingly widely-used method for accelerated fluid mechanics simulation due to its suitability for parallelisation, particularly on GPUs [11]. Massively parallel execution is possible due to typical collision operators operating on site-local data. Although the convection process requires propagation of information to neighbouring lattice sites it is possible to write the order of the operations such that data read-write is atomic [12]. A single instruction may therefore be carried out by many threads on a GPU in parallel. The memory requirements for an LBM application will inevitably increase with resolution as more lattice sites will need allocated storage. The link-wise artificial compressibility method (LW-ACM) [13] is one potential solution to the increasing memory requirements by reducing the data stored per lattice site. Alternatively, multiple devices may be used in parallel as discussed later.
ally significantly high enough to prevent interactive update rates for visualisation and user interaction.
If a result is required within a shorter amount of time, the size of the simulation may be reduced. If only a qualitative level of accuracy is sought, simulation size may be reduced and algorithms designed such that iterations become very rapid (<1s for example). Useful results may be obtained in a matter of minutes. Such speed of simulation lends itself to user-in-the-loop interaction in terms of input parameter adjustments and results visualisation. Results may be retrieved and visualised as the simulation evolves providing an insight into the simulation as it progresses. The user may adjust input parameters to explore a parameter space, or a machine may use the results to perform decision-making. Such use cases are simply not possible using a traditional approach due to lack of integration of the pre-processing, solution and post-processing activities. Conversely, the close integration of these same activities in the interactive simulation workflow is an enabler for new and exciting applications of engineering simulation. However, careful design is required to maintain use case flexibility in these cases as there is an inevitable close coupling of simulation and data.
4. Introducing an interactive workflow
In conventional engineering simulation, the workflow is typically split into three distinct activities: (1) pre-processing - to prepare the problem geometry and physical parameters; (2) simulation - to solve the governing equations and obtain a result; (3) post-processing - to visualise or interpret the results. Conversely, in interactive engineering simulation, all these activities must take place within the same loop with only a single (or a small number) of iterations of the solver being performed in each loop, i.e., the geometry must be created (or modified), a numerical iteration performed and the results (and possibly visualisation) be updated immediately after one another. A general workflow is presented in Fig. 4.
A data transfer step is usually required for traditional simulation where input data is read into the solver through input files moved to the computational facility. This could be a local machine but is often a remote high performance cluster. The simulation is left to run with data being written out for later inspection at user-specified intervals. When computation is off-loaded in this manner, the simulation may run for hours, days or even weeks depending on the scale of the problem. Once the computation is complete, the results are then transferred once more back to a local machine for inspection. The amount of data to be transmitted can be many gigabytes for large problem sizes and data-transfer times are usu-
5. Available infrastructure
The Android operating system is built on a modified Linux kernel and affords a high degree of control over the underlying hardware. A high level Application Programming Interface (API) is available in the Java programming language which allows the programmer to access operating system components to create applications for the platform. Software is compiled just-in-time for a virtual machine which is part of the Android Run-time (ART) environment. The use of a virtual machine allows Android to run on different physical architectures while the software remains platform-independent.
Engineering software, especially simulation software, is usually written in either Fortran or C/C++ and based on legacy libraries. These libraries are mature and widely available and the languages themselves allow a more flexible, fine-grain control of memory which helps optimisation. Fortunately, Android offers the Native Development Kit (NDK) which allows portions of an application to be written in C/C++ and compiled with the rest of the application. This facility means that existing libraries written in C/C++ may be reused. Furthermore, accelerator libraries may be used to write software for devices with a dedicated GPU, which can potentially provide a boost to performance for algorithms which can be massively parallelised. A schematic of the high-level architecture is shown in Fig. 5.
A.R.G. Harwood, A.J. Revell/ Advances in Engineering Software INTERACTIVE SIMULATION
TRADITIONAL SIMULATION
DESIGN IN CAD & GENERATE MESH
CAPTURE / EDIT
GEOMETRY INTERACTIVELY
GEOMETRY CAPTURE & BOUNDARY CONDITIONS
ITERATE SOLUTION
FLUID/STRUCTURAL
SOLVER
VISUALISE DATA IN REALTIME
DATA VISUALISATION
USER MODIFICATION
POST-PROCESS at VISUALISE
Fig. 4. A comparison between the generalised workflows of traditional simulation and interactive simulation in engineering.
Application (Java) JNI Application (C/C++)
Application Framework
Libraries Android Runtime
Linux Kernel
Fig. 5. The Android software infrastructure including the Java Native Interface (JNI) supplied with the NDK.
5.1. Concurrency support in Android
Work on the CPU is a set of instructions to be executed. These are typically organised as a thread, which is executed by one or more CPU cores. Work is said to execute in a multi-threaded sense if that work is divided into multiple threads of work. If those threads are executed at the same time then they are executing in parallel. If those threads are progressing at the same time but do not necessarily execute at the same time (i.e., if work is performed on one thread, then the other, then more work is performed on the first thread again) then the threads are working concurrently [14]. Android provides API elements to aid the deployment of work concurrently with different mechanisms affording different levels of control over how the work is distributed.
The availability of the human-device interface for a mobile device is crucial for the user experience. As such, an application written for the platform must maintain resources to handle user input (UI) events such as touching the screen to select an option or open a menu. Therefore, every application in Android must be designed with a basic level of concurrency such that demanding operations are executed on a separate background thread. According the Android Developer best practice [15], the operating system
(OS) will mark the application as non-responsive if an input event is not handled within 5s. The UI thread, responsible for drawing the view and handling UI events, is held at the highest priority within the OS and all other threads used for background work are set to a lower priority. The implication is that if a simulation is performed on an Android device, it will need to be performed on a background thread. As such, it will always be granted fewer resources than may be available to ensure a reserve is held for the main thread should it require them. It is possible to deploy the application as a background process without a user interface which would not be subject to the user response checking although in the context of interactive simulation, presence of an interface is desirable.
API components for designing parallel software are provided in Android in the form of extensions of the Java concurrency package. The most important of these are the Thread and the AsyncTask classes which can be used to package work in a Thread-based and Task-based sense, respectively. These helper classes are used in the patterns presented in this work.
Despite the availability of multi-core devices and associated API elements, multi-threading within Android applications is often under-utilised. Gao et al. [16,17] conclude that, on average, only 1.4 cores of quad-core devices are used by current applications. This suggests one of two things: either the desired performance is achieved without the use of a high degree of concurrency; or the applications are under-performing due to poor use of available resources. The former implies that vendors are over-provisioning in terms of hardware whereas the latter implies that programmers are not designing applications to best match the available hardware. Designing concurrent software is understood to be more difficult than serial software and hence requires more programmer effort. If an application does not require a high level of concurrency to function as desired, the return on the invested effort may not be worthwhile. However, the need for concurrency in a given application is likely to depend on the purpose of the software.
6. Implementation
For engineering simulation, many-core parallelisation is crucial for timely simulation and HPC software often makes use of Mes-
Thread Pool
RAM ................................► ................................►
................................►
Create Grid
Post Tasks
Manager Thread
(a) Task-based design concept
Create Grids
Thread Control
Manager Thread
(b) Thread-based design concept
Fig. 6. Schematics of Task-based and Thread-based designs for implementation in Java for an Android device.
sage Passing Interface (MPI) [18] to allow portions of a numerical grid which reside in distributed memory to communicate with one another. In addition, they may also include OpenMP [19] directives for multi-threaded execution within a distributed memory space. Similar behaviour can be emulated by implementation of the Thread and the AsyncTask classes by combining them with a distributed and shared memory arrangement. This hybrid parallelism of engineering software provides a template for similar design on alternative infrastructure such as Android. The mobile analogue is therefore to explore the two forms of on-device parallelism as presented in this work as well as multi-device parallelism as discussed in Section 10.1.
Two CPU implementations have been developed in Java. Both solve the LBM on a uniform Cartesian mesh and use the concurrency support of Android according to Android developer best practice guidelines. The LBM kernel consists of a local collision operator followed by a non-local streaming operation. This latter operation is non-local as quantities must be read from one grid site and written to a neighbour site in each direction allowed by the grid structure.
61. Memory access
It should be noted that two main design decisions have been made for each solution:
1. Whether the Thread or AsyncTask infrastructure is used to
package and execute grid operations.
2. Whether a single, shared grid is created or the grid is partitioned into sub-grids for each task/thread.
The task-based approach divides the work up into tasks. These tasks are executed asynchronously, operating on different parts of the grid simultaneously. Since the tasks themselves have a finite lifetime, are asynchronously executed and their resources are released once the task is completed, a shared memory approach is preferred. Thread-safety is easily enforced by chunking the grid operations into thread safe tasks. This precludes the needs for explicit synchronisation instructions.
The thread-based design affords more control over thread creation and destruction allowing the programmer to submit work to threads when necessary without first having to package the work into tasks. Threads are created at the creation of the application and joined to the main thread on application destruction. A partitioned memory model was chosen for this pattern which splits the grid into smaller stand-alone grids managed by each thread. This makes enforcing thread-safety a much more explicit intention of the programmer and also emulates an MPI architecture which uses distributed memory in a similar way. This design pattern may also
be easily extended to a multi-device approach in the same way typical engineering applications use MPI across multiple nodes.
6.2. Task-based, shared memory design
The Android development kit provides the AsyncTask class which is abstract and hence must be extended. The work to be performed by each instance of subclass must then be implemented. These task instances are constructed as required then posted to a ThreadPoolExecutor which executes the task on an available thread within a ThreadPool owned by the OS. The thread scheduling is automatically handled by the run-time with tasks queued and executed when threads are available. This concept is illustrated in Fig. 6(a). Given that the Android run-time is responsible for scheduling this work across a number of threads, this implementation represents a concurrent model rather than a parallel model.
Each task is designed to execute a portion of the LBM work pertaining to a certain block of the grid which itself is stored in shared memory. Execution is completely asynchronous and hence nonlocal operations on the grid quantities may take place at different times and on adjacent grid sites. The macroscopic flow quantities near the block boundaries may be computed from incomplete information if the tasks are not synchronised. In order to force synchronisation between tasks, the thread manager is first set to monitor a generic Java object as a lock. An AtomicInteger is incremented by each task once it completes its non-local operations on its portion of the grid. The final task to complete its work will increment the integer such that its value will equal to total number of tasks created. This is the trigger to notify the thread manager waiting on the lock. The thread manager will then proceed to use the grid object again now that the concurrent work is completed and all sites are up-to-date. The thread manager manages the sequential drawing of the bitmap to the view as well as the capture of touch events using the same lock-based synchronisation as the task synchronisation; each successful completion of an activity notifies the thread manager thread which continues to the next part of the workflow (Fig. 4). A pseudo-Unifed Modelling Language (UML) sequence diagram is shown in Fig. 7.
6.3. Thread-based, distributed memory design
In this design, threads can be created and managed by extending the Thread class. This allows the programmer a greater degree of control of thread behaviour and threads may be easily synchronised by using a reusable CyclicBarrier object. A manager thread creates the worker threads and decomposes the problem into separate grid blocks with overlapping halo regions. The
Table 1
Specification of the Google Project Tango Development Kit device used in this work.
Aspect Details
Screen 7 in. 1900 x 1200 (323 ppi)
Cameras 4 Mpx RGB-IR pixel sensor (rear)
1 Mpx RGB (front)
OS Android 4.4.2
RAM 4 GB
CPU + GPU nVidia Tegra K1 (quad-core CPU + 192 core GPU)
Battery 4960 mAH
UI Thread Manager Background
Thread Thread Pool
startManagerThreadQ^ !
createLbmGridO
decomposeGriiiO
; launchAsyncTaskQ ^ wait()
mcrementjAtomicQ
readAjomicO
invalidateViewQ
notifyQ
redrawViewO
wait()
processTouchEventQ ^ i notifyQ_
doLbm()
Fig. 7. Sequence diagram illustrating a Task-based, shared memory design pattern for an interactive flow simulation application.
worker threads are then started by the manager and operate on their own blocks. The concept is illustrated in Fig. 6(b). Contrary to the previous design, the programmer may distribute work to these threads at the same time which execute in parallel.
In order to implement message passing between concurrent threads the Android Handler framework is used. Each thread has a message queue and a Looper, managed by the OS, which implements message retrieval from the queue in the order they arrive. After barrier synchronisation, each thread constructs and posts a message to its neighbours in the decomposition topology. Messages are picked up by the looper and then processed by a message handler attached to each thread queue in the same way halo exchange is performed using MPI in many other scientific software. The thread manager is again notified by one of the background threads once the last message has been handled by using an AtomicInteger counter to log the processed message count. The behaviour is summarised in Fig. 8.
6.4. Interactive elements
A mobile device is unique in that it provides both a visualisation and interactive interface as well as raw processing power. In the fields of education and design, the ability to stop, resume and manipulate a simulation during execution provides a significant advantage [20,21]. Therefore, any framework for engineering simulation ought to include both a visualisation and an interactive element and incorporate those components of the mobile platform which enable this capability [22].
In both solutions, the 2D velocity field computed by the LBM is mapped to a rainbow colour range and interpolated onto a RGB bitmap texture. This texture is then passed to the Android view manager where it is scaled and drawn to the screen. Although texture-based visualisation is a simple means of discerning useful
information from the simulation, many more techniques for data visualisation and rendering are available with Android supporting OpenGL ES.
The touch-screen may also be used to add solid boundaries anywhere within the flow domain as drawn by the user allowing the user to study a range of external flows. Single-point gestures are recognised and event methods implemented to impose solid boundary conditions at touched grid sites during the calculation. This allows the user to dynamically add solid boundaries to the fluid flow.
The software used for testing is configured at compile time to model a channel flow at a fixed Reynolds number. However, a suitable user interface could be provided with the application framework to allow different base flows to be initialised and parameters changed at run-time in a similar way to existing interactive LBM software [23,24]. Furthermore, the application may also read input from photographs taken using the device camera which undergo a pre-processing step to recognise relevant contours within the image. Details of this module are outside the scope of this article and will be published separately.
7. Performance test case
Project Tango is an Android-powered mobile device with built-in depth-sensing technology for capturing external geometry [25]. This device has all the hardware necessary to execute each stage of the workflow (Fig. 4) and was used as the hardware platform for this work. Its specifications are given in Table 1.
The test simulations used a 192 x 93 grid with this aspect ratio dictated by the view size. The Reynolds number is set to 100. This combination of resolution and Reynolds number was chosen in order to keep the simulation stable while allowing fast enough solution such that the device to update the view at a rate of > 20 frames per second. Evolution of the simulation thus appears smooth with the user unable to discern the drawing of individual frames at each time-step. As this study focusses on software design strategies, validation of the LBM is not performed. However, the accuracy of even coarsely resolved LBM for low Reynolds number, laminar flows is remarkably good [26] and the resolution used here is more than sufficient for the Reynolds number being examined.
The development of the boundary layer from solid walls applied at the top and bottom of the domain was visible with the flow driven by a uniform inlet boundary condition at the left hand edge of the domain. A screenshot of the flow as evolved on the device is shown in Fig. 9.
In reality, Reynolds numbers of interest will be much higher than simulated here. In order to maintain both stability and accuracy, it is expected that the resolution of the lattice will need to be increased. Additional modelling may also be required to capture turbulence or to increase the stability of the simulation. There will inevitably be a trade-off between accuracy and performance. However, the motivation for interactive CFD in the short-term is not to target the level of accuracy already offered by conventional CFD,
UI Thread Manager LBM
Thread Thread 0
startManagerThreadQ^
invalidate Vit wQ
rcdrawV
notifyQ
triggerTouchEvei itQ
startWorkerThreadQ
createGrid()
notifyQ
processTouchßventQ
LBM Thread N
doLbm()
barrierO
sendMessage()
handleMessageO
barrierO
doLbmO
barrierO
handleMessageO
barrierO
Fig. 8. Sequence diagram illustrating a Thread-based, distributed memory design pattern for an interactive flow simulation application.
Fig. 9. Screenshot of converged 2D channel flow simulated using the lattice-Boltzmann method on an Android tablet.
but simply to target a sufficient level of accuracy for the application.
Both designs were implemented as separate applications and included timers to time the kernel function responsible for completing an LBM time step. In addition, an extra timer was added to the thread-based design to measure the performance of the pro-
cess responsible for the preparation, passing and dissemination of messages. As illustrated in Figs. 7 and8, the main differences in the designs are found in the execution of the LBM kernel rather than gesture processing or view drawing. Hence, although interaction is possible in both implementations, tests were conducted without touching the display to ensure only the kernel performance is measured.
Timing data was recorded each iteration of the LBM kernel and an average value is dynamically updated. After 100 0 time steps, the simulation was stopped and the data recorded. This amount of time steps was sufficient to ensure that updates to the average time were less than 1 ms per 100 time steps, i.e., typically less than 1% change. The problem size is kept fixed with a grid size of 192 x 93 used in each case, and the number of threads/tasks increased from 1 to 6.
Battery usage for any mobile application is an important consideration. Although it is likely that the device may be used in practice to run many short simulations rather than a single long calculation as only intuition and qualitative results are required, we detail the power consumption of the application in any case. It should be noted that the details of power consumption will vary from device to device as hardware such as CPUs, memory and screens will all have different power requirements across devices. Nevertheless, the battery usage was noted from the internal Android battery monitor application which was reset when the LBM application was loaded. The simulation is run for 60 min without
Fig. 10. Timing data for Thread-based (Multi-thread) and Task-based (AsyncTask) implementations. Ideal strong scaling normalised to the serial data is indicated by the square and circle markers.
interaction and the data from the battery monitor recorded every 10 min.
8. Results
Fig. 10 shows the timing results obtained during the test. The horizontal axis indicates the number of tasks/threads used to parallelise an LBM time step. The vertical axis shows the measured time in ms. The yellow bars indicate the time taken for the thread-based, distributed memory implementation and include a green region, which indicates the time taken to complete message-passing. The blue bars illustrate the time taken for the task-based, shared memory approach. The ideal strong scaling (Amdahl's Law) is represented by the red markers in the figure. This scaling is normalised to the serial case in Fig. 10. Considering that the problem size does not change, the idealised scaling is computed by simply dividing the serial time by the number of threads/tasks used. The two implementations used are capable of performing 0.33 / 0.40 million lattice updates per second (MLUPS) in serial with this increasing to a maximum of 0.62 / 0.69 MLUPS for the thread-based, distributed memory solution and the task-based, shared memory solution, respectively. These translate to performance increases of factors of 1.88 and 1.73.
There is a difference in serial execution time. This is expected as the kernel classes themselves are slightly different to facilitate the different memory structures of each design. The additional managerial software, including barrier synchronisation and manual control of the worker threads, in the thread-based design adds additional load to the LBM kernel which is reflected in an increase in execution time. However, the effect of this is less pronounced during parallel execution with the LBM execution time similar in both implementations. In line with expectations, the message passing cost associated with the thread-based design increases with the number of threads used. This is due to an increased CPU load associated with message routing and the use of a finite set of resources for executing the handlers.
Both implementations exhibit a clear performance saturation. This point occurs when the number of requested threads/tasks is
Table 2
Peak memory usage for each of the cases. Bottom row indicates the percentage increase in memory usage for the Thread-based design.
Tasks/Threads 1 2 3 4 5 6
Task-based (MB) 5 . 85 5 . 87 5. 87 5. 88 5. 89 5. 89
Thread-based (MB) 5.90 5. 96 6. 01 6. 09 6. 25 6. 41
Thread-based (% increase) 0.0 1. 1 1. 9 3. 2 5. 9 8. 6
equal to three. Considering that the device contains a quad-core CPU then this point represents when the application has spawned the same number of threads as there are additional cores beyond one. The scheduling of threads across available cores is handled by the OS. Each thread is given a priority. Background threads are always set to a lower priority than user interface threads to preserve responsiveness as discussed in Section 5. They are therefore allocated fewer resources including CPU time. Android also takes into consideration how many background threads exist with work to run and assigns them a thread group. The resource requirements of each thread group are also controlled by the thread scheduler to ensure that each thread group makes progress i.e., runs concurrently, with other thread groups. Given that each thread in our implementations will have a similar workload, the thread scheduler is expected to run all threads concurrently using a similar set of limited resources across available CPUs. As background threads increase beyond three, the workload per thread falls with the local grid size on each thread and the scheduler will simply run these threads less often in a limited resource pool. The execution time, therefore, stays roughly the same.
8.1. Memory usage
The memory usage of each case is recorded in Table 2. First, the allocation of memory for the application is relatively low given that the device has 4096 MB of memory in total with perhaps 75% of that usable memory. Therefore, given the execution times given in Fig. 10 the current implementations are limited by the speed with which an iteration of LBM can be computed. However,
IScreen O LBM ■Android OS A Total
0 10 20 30 40 50 60 Simulation Wall Clock (Minutes)
Fig. 11. Line (right-hand axis) indicates the battery percentage for the first 60 min of running the simulation. Bars (left-hand axis) indicate how the power consumption is apportioned to individual applications as recorded by the Android battery monitor.
if in future the GPU is used, the problem may become limited by either memory capacity or potentially memory bandwidth, given that mobile GPUs share memory with the CPU and would require to read write large amounts of data in parallel.
As expected, the shared memory model ensures an almost constant amount of memory allocated in each case with a small increase observed due to overhead associated with the creation of multiple tasks. Given that the LBM grid data is by far the largest allocation of memory, this observation is consistent.
For the thread-based approach, there is a larger memory overhead associated with the creation of more threads. This is due to the addition of halo regions required by the distributed memory approach, where a thread contains the cost of two halo regions. The domain is 192 x 93 = 17 , 856 grid sites big, and is divided in vertical strips for each thread. Each halo is thus 93 sites in total. So the addition of one more thread costs 186 sites ^ 1.1% of the total grid size. The percentage increase in memory usage for each of the thread-based cases are also given in Table 2. The increase in memory is initially lower than this 1.1%. This is because the grid, although a large proportion of the allocated memory, is not the only component of the allocated memory; there will be an application overhead. Hence a 1.1% increase in the grid memory is seen as a smaller increase overall. As the number of threads increases further, the actual increase begins to exceed the 1.1% as thread management memory and grid memory begin to represent a larger and larger proportion of the total memory allocated.
8.2. Battery usage
During the first 10 min, the battery level reduced from 100 to 95%, a total of 5%. For the remainder of the 60 min test, the battery drain over each 10 min period recorded never exceeded this value with drain approximately linear. As might be expected, the majority of the energy used during the test is spent keeping the screen illuminated for the duration of the simulation (cf. Fig. 11). The proportions of power consumed over each 10 min interval remain approximately constant with, on average, the simulation consuming 38% of the battery; the screen consuming roughly 1.5 times the amount of power. Projecting the available data linearly, it is expected that a full battery would therefore be exhausted after running the simulation for approximately 3.4 h. If a long-time-averaged result is required rather than a dynamic window into the flow behaviour, the screen could be turned off for the duration of
Fig. 12. Pseudo-UML sequence diagram to illustrate practical two-task coupling of the non-local and LBM local operations when using a shared memory model and asynchronous tasks.
the simulation. In these circumstances, battery life would be extended to 8.4 h.
8.3. Complexity
The previous discussion elicits the differences in execution time and memory usage of the two designs. There are, however, additional, practical considerations with regard to the complexity of the implementation. The Handler framework is efficient at passing messages between the objects in distributed memory. However, the programmer is required to design and instantiate a suitable container for the grid data, use countdown triggers (i.e., an AtomicInteger ) to track the completion of background tasks and to notify the thread manager to continue at appropriate times in the algorithm. These elements are necessary for a distributed memory, thread-safe update of regions of the LBM grid common to more than one thread, although at the cost of added complexity. These parts of the framework amount to approximately 20% of the application software which requires significant additional effort to implement compared with a sequential version. However, the thread-based system is easy to synchronise using CyclicBarrier , as this higher-level construct hides its underlying complexity.
In contrast, the shared memory model used by the task-based presents a different challenge when it comes to synchronisation and thread-safety. The task-based approach actually required the kernel to be broken into two steps. The local collision and nonlocal stream operations are performed by one set of tasks. Once complete, the thread manager then concatenates the non-local results by copying the post-stream grid quantities onto the pre-stream grid. Once the grid is up-to-date, another task is launched to complete the LBM kernel by updating the macroscopic quantities concurrently (cf. Fig. 12). There is added complexity due to this implementation but less so than managing halo data when using distributed memory. There is also an increase in task instantiation due to the creation of tasks twice per execution of the LBM kernel rather than once. AsyncTask is naturally asynchronous and does not provide barrier-type synchronisation hence synchronisation must be enforced explicitly using other constructs. The use of
Fig. 13. Schematic illustrating the hybrid Java-C++ implementation where the kernel and the methods called by the kernel are all implemented natively in C++ and precompiled.
atomic flags to allow each concurrent task to indicate its completion to the thread manager is a simple and effective thread-safe solution.
In summary, the task-based design performs better and is easier to implement than the thread-based design although the latter offers a familiarity to scientific programmers by sharing concepts with OpenMP and MPI.
9. Native implementation using the Android NDK
Having investigated the effects of two different strategies of parallel design in Java, the Android native development kit (NDK) was used to improve the performance of the LBM kernel by implementing it as a pre-compiled C++ library. The channel flow simulation was performed using a serial execution of the task-based Java design and compared to a hybrid Java-C++ design. The task-based design was chosen for modification due to its simplicity although similar modifications could easily be performed to the thread-based design.
The LBM kernel module of the application was previously written as a Java class which was instantiated with all the data corresponding to the lattice and its properties. The LBM kernel is a method of the class which performs a complete LBM time step. This kernel is repeatedly called in the previous two designs either by a looping runnable object (thread-based design) or by continuous posting of asynchronous tasks to the background threads. In this modified case, the tasks are posted to a single background thread within a loop when the simulation is started as per Fig. 7. The hybrid application uses this same design but a portion of the LBM class is implemented as a C++ library.
How much of the application to implement in C++ and how much to implement in Java is a design decision but these proportions are generally chosen such that highly reused or computationally demanding portions of the application are implemented
natively to improve performance. A key component of the NDK is the Java Native Interface (JNI) which is an environment in which software can be developed to provide a bridge between the Java side of the application and the C++ side of the application. The NDK provides C++ support for Android capabilities and it is possible to write an entire application in native software with no Java implementation at all. However, as Java already offers a clear, well-supported framework for thread management which is used extensively in the designs presented above, the simplicity of the implementation, was preserved by reusing this part of the application. The LBM object is instantiated from the Java class definition but the methods (including the LBM kernel) are implemented in C++ with the native kernel being launched from a single AsyncTask. At run time, the JNI allows the C++ kernel implementation to obtain handles to the data arrays created in the Java run-time environment and to release these arrays after kernel execution finishes. This arrangement is depicted schematically in Fig. 13.
In order for native methods to access the Java arrays, the interface software must search for the Java class and its fields. These references are then used to obtain pointers to the fields themselves. This process is achieved using JNI API calls. Repeated calls to the JNI API can be very slow [27] if calls are inside loops. To increase performance, it is advisable to perform as many required calls as possible in a native initialisation method when the C++ library is loaded. Fig. 13 illustrates this initialisation method which performs searches and caches field references in variables, publicly visible on the C++ side, for quick retrieval by native methods.
9.1. Test results
The same channel simulation is again run for 100 0 time steps. The resulting speed-up in run-time performance, by loading the LBM kernel from a pre-compiled C++ implementation, is approximately 20% with respect to the original Java implementation of the
Device 0 Device 1
UI Thread Manager LBM
Thread Thread N
startManagerThreadQ^
p2pManagerQ
invalidate Vit wQ
redrawV
notifyQ
triggerTouchEvei itQ
startWorkerThreadQ
createGridQ
notifyQ
LBM Manager UI Thread
Thread M Thread
doLbmQ
doLbmQ
shareGridDataO
processStreamQ
processTouchEJventQ
processStreamQ
createGridQ
startWorkeiThreadQ
notifyQ
^ sharelnputAndResultsQ ^
proce^sTouchEventQ
^tartManagerThreadQ
p2pManagerQ
inv, ilidateViewQ
wViewQ
notifyQ
tri] sgerTouchEventQ
Fig. 14. Illustration of the behaviour of a multi-device implementation of an interactive LBM simulation.
Table 3
Loop time and speed-up associated with the hybrid design versus a full Java implementation.
Implementation Loop time (ms) MLUPS Speed-up (%)
Java 37 0.48 0.0
Java <-> JNI <-> C++ 29 0.62 21.6
LBM kernel (cf. Table 3). If this performance were to scale by the same factor as the full Java implementation then the theoretical performance of the pre-compiled C++ version would be 1.1 MLUPS, a vast improvement on the 0.4 MLUPS of a serial implementation.
This performance increase may be viewed as being specific to a given implementation, or specifically, a given choice of JNI placement. However, the benefits of using pre-compiled native implementations as a replacement for potentially slower, just-in-time-compiled Java implementations are really only significant when computationally intensive software is optimised in this way. There is little benefit from re-implementing native Android features that already require little CPU effort to execute. In Fig. 13, the choice of boundary is such that the part of the application which represents >90% of the CPU effort (as determined using trace profiling) is allocated to the C++ side of the application. Adjustment of this boundary further toward the Java side would therefore yield little additional gain.
10. Future work
The Project Tango device features an nVidia Tegra GPU. Theoretically, the use of the NDK should allow an application to be written
in CUDA C/C++ and managed through the JNI. This would require possible cross-compilation of a mixture of Java, standard C/C++ and CUDA C/C++ API calls. This is achieved on other platforms using a specialist compiler supplied by nVidia. Although nVidia provide some development support [28], documentation is limited as to how to deploy software written using the CUDA API for Android platforms. Examples at present are limited to the deployment of native Android applications that only link to a pre-compiled CUDA library. The necessary cross-compilation appears to be difficult to achieve within the CodeWorks environment without the need for custom build profiles. Nevertheless, it is expected that documentation will be written in due coarse to simplify the process and enable programmers to leverage the power of mobile GPUs directly.
10.1. Multi-device framework
Parallel design patterns, such as those described in this article, for engineering simulations on mobile devices will only guarantee performance gains up to a point. Beyond this, the limited memory capacity and the number and clock speed of CPU cores will impose restrictions on problem size and computational throughput. One way to circumvent the limits of memory and number of computing cores is to parallelise the work across more than one device. This is akin to a multi-node configuration in conventional HPC. Coupling grids distributed across multiple devices is precisely the purpose of the distributed memory model used by the thread-based design in this article. Halo data on device edges is this time passed to a unique device in a group of devices contributing to the same calculation. Results may be then shared throughout the many-device collection or simply collected into a master device for
output to the screen. Interactive elements too may be added to this model by either sharing input across the collection or just accepting input from a master device.
Java offers the tools for implementation of this approach in the Android API. The Socket class allows the construction of buffered, serialised data streams from one machine to another. Mobile devices invariably have a Wi-Fi adapter for connectivity and peer-to-peer connections using Wi-Fi Direct is supported by Android 4+ (API level 14). Wi-Fi has greater range, bandwidth and speed than Bluetooth and is therefore the best available choice for direct wireless communication between Android mobile devices. The existing parallel frameworks may be modified to incorporate a second P2P communication step between a group of devices. One device in each pair will construct a ServerSocket and the other a ClientSocket over which two-way communication can take place. This behaviour is illustrated in Fig. 14. Such socket protocols make blocking method calls such that execution on one device will not continue until the communicating device has acknowledged the connection request. This provides an implicit means for synchronisation across the collection of devices.
Performance increases offered by a multi-device arrangement will be offset by the additional overhead required to communicate between the distributed grids with each device required to pass halo data to a fixed number of adjacent devices. The design of this message passing protocol will be crucial in maintaining scalability of this approach as more and more devices are linked and concurrent message passing will be essential. This implementation is not tested here but left for a future publication.
11. Conclusions
In this work, two parallel design patterns for performing grid-based flow simulation on an Android-powered mobile device have been presented. Implementations of the two designs in Java have been compared in terms of the update performance of a lattice-Boltzmann grid for a varying number of threads/tasks. The task-based design is simpler to implement and to synchronise using atomic data types. Furthermore, this design performs better than the thread-based implementation due to its shared memory configuration where message packing, passing and unpacking is not required. It has also been demonstrated that the performance of the LBM kernel can be improved through use of the Android NDK; when implementing the LBM kernel in C++, application performance improved by approximately 20% when compared with the initial Java implementations. Although tests were limited to a specific Android device, general trends and conclusions are expected to hold considering the similar requirements of all mobile devices and operating systems.
Acknowledgements
This work was supported by Engineering and Physical Science Research Council Impact Accelerator Account (grant number: EP/K503782/1).
References
[1] Ventola CL. Mobile devices and apps for health care professionals: uses and benefits. Pharm Ther 2014;39(5):356-64.
[2] Anderson M. Technology Device Ownership: 2015. Technical report. Pew Research Centre; 2015.
[3] Succi S. The lattice Boltzmann equation: for fluid dynamics and beyond. New York: Oxford University Press; 2001.
[4] Poushter J. Smartphone Ownership and Internet Usage Continues to Climb in Emerging Economies. Technical report. Pew Research Centre; 2015.
[5] Halpern M, Zhu Y. Reddi VJ. Mobile CPU's rise to power: quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. In: Proceedings of the 2016 IEEE international symposium on high performance computer architecture (HPCA); 2016. p. 64-76.
[6] Iida Y. Hirabayashi M. Azumi T. Nishio N. Kato S. Connected smartphones and high-performance servers for remote object detection. In: Proceedings of the 2014 IEEE international conference on cyber-physical systems, networks, and applications (CPSNA); 2014. p. 71-6.
[7] Wang CX. Haider F. Gao X. You XH. Yang Y. Yuan D. et al. Cellular architecture and key technologies for 5g wireless communication networks. IEEE Commun Mag 2014;52(2):122-30.
[8] Patera AT, Urban K. High performance computing on smartphones. Snapshots Mod Math (MFO) 2016(6). doi:10.14760/SNAP- 2016- 006- EN.
[9] Bhatnagar PL. Gross EP, Krook M. A model for collision processes in gases. i. small amplitude processes in charged and neutral one-component systems. Phys Rev 1954;94:511-25.
[10] Chen S, Doolen GD. Lattice Boltzmann method for fluid flows. Annu Rev Fluid Mech 1998;30(1):329-64.
[11] Schönherr M. Kucher K. Geier M , Stiebler M. Freudiger S. Krafczyk M. Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs. Comput Math Appl 2011;61(12):3730-43.
[12] Mawson MJ. Revell AJ. Memory transfer optimization for a lattice Boltz-mann solver on Kepler architecture nVidia GPUs. Comput Phys Commun 2014;185(10):2566-74.
[13] Asinari P. Ohwada T. Chiavazzo E. Rienzo AFD. Link-wise artificial compressibility method. J Comput Phys 2012;231(15):5109-43 .
[14] Oracle. Multithreaded Programming Guide. http://docs.oracle.com/cd/ E19455-01/806- 5257/index.html; 2016 [accessed: 08.09.16].
[15] Google. Best Practices for Performance. https://developer.android.com/training/ best-performance.html; 2016 [accessed: 01.07.16], Android Developers.
[16] Gao C. Gutierrez A. Dreslinski RG. Mudge T. Flautner K. Blake G. A study of thread level parallelism on mobile devices. In: Proceedings of the 2014 IEEE international symposium on performance analysis of systems and software (IS-PASS); 2014. p. 126-7.
[17] Gao C. Gutierrez A, Rajan M. Dreslinski RG. Mudge T, Wu CJ. A study of mobile device utilization. In: Proceedings of the 2015 IEEE international symposium on performance analysis of systems and software (ISPASS); 2015. p. 225-34.
[18] Gropp W. Lusk E. Doss N. Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 1996;22(6):789-828 .
[19] Dagum L. Menon R. OpenMP: an industry-standard API for shared-memory programming. IEEE Comput Sci Eng 1998;5(1):46-55.
[20] Wenisch P, van Treeck C, Borrmann A, Rank E, Wenisch O. Computational steering on distributed systems: indoor comfort simulations as a case study of interactive CFD on supercomputers. Int J Parallel Emergent Distrib Syst 2007;22(4):275-91 .
[21] Linxweiler J. Krafczyk M , Tölke J. Highly interactive computational steering for coupled 3d flow problems utilizing multiple GPUs. Comput Vis Sci 2010;13(7):299-314.
[22] Hassan H. An interactive fluid dynamics game on the iPhone. [Master's thesis], Technische Universität München; 2009.
[23] Mawson M. Interactive fluid-structure interaction with many-core accelerators, [Ph.D. thesis]. School of Mechanical, Aerospace & Civil Engineering, The University of Manchester; 2013 .
[24] Koliha N. Janßen CF. Rung T. Towards online visualization and interactive monitoring of real-time CFD simulations on commodity hardware. Computation 2015;3(3):4 4 4.
[25] Google. Project Tango. https://developers.google.com/tango/; 2016 [accessed: 01.07.16], Google Developers.
[26] Rohde M, Kandhai D, Derksen JJ, van den Akker HEA. A generic, mass conservative local grid refinement technique for lattice-Boltzmann schemes. Int J Numer Methods Fluids 2006;51:439-68. doi:10.1002/d.1140.
[27] Dawson M, Johnson G, Low A. Best practices for using the Java Native Interface. Technical report. IBM developerWorks; 2009. https://www.ibm.com/ developerworks/library/j-jni/, [accessed: 08.09.16].
[28] NVIDIA. NVIDA CodeWorks for Android. https://developer.nvidia.com/ codeworks-android, [accessed: 08.09.16].