Scholarly article on topic 'Interactive flow simulation using Tegra-powered mobile devices'

Interactive flow simulation using Tegra-powered mobile devices Academic research paper on "Earth and related environmental sciences"

CC BY
0
0
Share paper
Keywords
{Android / "Mobile computing" / "Interactive simulation" / "Lattice Boltzmann Method" / CUDA / "Embedded computing"}

Abstract of research paper on Earth and related environmental sciences, author of scientific article — Adrian R.G. Harwood, Alistair J. Revell

Abstract The ability to perform interactive CFD simulations on mobile devices allows the development of portable, affordable simulation tools that can have a significant impact in engineering design as well as teaching and learning. This work extends existing work in the area by developing and implementing a GPU-accelerated, interactive simulation framework suitable for mobile devices. The accuracy, throughput, memory usage and battery consumption of the application is established for a range of problem sizes. The current GPU implementation is found to be over 300 ×  more efficient in terms of combined throughput and power consumption than a comparable CPU implementation. The usability of the simulation is examined through a new ‘interactivity’ metric which identifies the ratio of simulated convection to real world convection of the same problem. This real-time ratio illustrates that large resolutions may increase throughput through parallelisation on the GPU but this only partially offsets the decrease in simulated flow rate due to the necessary shrinking of the time step in the solver with increasing resolution. Therefore, targeting higher throughput configurations of GPU-solvers offer little additional benefit for interactive applications due to ultimately simulations evolving at a too slow a rate to facilitate interaction. The trade-off between accuracy, speed and power consumption are explored with the choice of problem resolution ultimately being characterised by a desired accuracy, flow speed and endurance of a given simulation. With current rates of growth in mobile compute power expected to continue, real-time simulation is expected to be possible at higher resolutions with a reduced energy footprint in the near future.

Academic research paper on topic "Interactive flow simulation using Tegra-powered mobile devices"

Advances in Engineering Software xxx (xxxx) xxx-xxx

Contents lists available at ScienceDirect

Advances in Engineering Software

ELSEVIER journal homepage: www.elsevier.com/locate/advengsoft

Research paper

Interactive flow simulation using Tegra-powered mobile devices

Adrian R.G. Harwood*, Alistair J. Revell

School of Mechanical, Aerospace and Civil Engineering, The University of Manchester, Sackville Street, M1 3BB, United Kingdom

ABSTRACT

The ability to perform interactive CFD simulations on mobile devices allows the development of portable, affordable simulation tools that can have a significant impact in engineering design as well as teaching and learning. This work extends existing work in the area by developing and implementing a GPU-accelerated, interactive simulation framework suitable for mobile devices. The accuracy, throughput, memory usage and battery consumption of the application is established for a range of problem sizes. The current GPU implementation is found to be over 300 x more efficient in terms of combined throughput and power consumption than a comparable CPU implementation. The usability of the simulation is examined through a new 'interactivity' metric which identifies the ratio of simulated convection to real world convection of the same problem. This real-time ratio illustrates that large resolutions may increase throughput through parallelisation on the GPU but this only partially offsets the decrease in simulated flow rate due to the necessary shrinking of the time step in the solver with increasing resolution. Therefore, targeting higher throughput configurations of GPU-solvers offer little additional benefit for interactive applications due to ultimately simulations evolving at a too slow a rate to facilitate interaction. The trade-off between accuracy, speed and power consumption are explored with the choice of problem resolution ultimately being characterised by a desired accuracy, flow speed and endurance of a given simulation. With current rates of growth in mobile compute power expected to continue, real-time simulation is expected to be possible at higher resolutions with a reduced energy footprint in the near future.

ENGINEERING SOFTWARE

ARTICLE INFO

Keywords: Android

Mobile computing Interactive simulation Lattice Boltzmann Method CUDA

Embedded computing

1. Introduction

The modelling of fluid flow is an essential part of engineering design and analysis. As a complementary aid to experimental testing, computer modelling offers flexible and efficient tools that can significantly accelerate the exploration and development of a design. In order to achieve a higher level of accuracy, increased degrees of discretisation are required which increases the demand on computational resources. To date, the literature for use of GPUs in flow simulation has tended to focus on demonstrating increased performance by using higher resolution. However the analysis presented here demonstrates that this is accompanied by a reduction in simulated flow rate with respect to the real-time flow rate, due to the necessary shrinking of the time step in the solver with increasing resolution. Therefore, targeting higher throughput offers GPU-solvers little additional benefit for interactive applications due to the reduction in the rate of interactivity.

However, with ever increasing computing power, steady improvements in accuracy for a given set of computing resources have been realised in the past 2-3 decades. Reliable and informative simulation results are attainable in a matter of hours or days depending on the specific trade-off between resources, accuracy and complexity dictated

by the problem in question.

The hardware infrastructure required to run these computational fluid dynamics (CFD) studies usually includes a combination of local and remote devices. Local workstations are used to construct the numerical model and to analyse results while remote hardware, optimised for long-duration computation - known as high performance computing (HPC), is used to compute the solution. The amount of computing resource required scales with the accuracy and complexity of the simulation, and compromises are always made by the programmer and the user in this regard to manage the time and effort required to obtain a suitable solution.

This paper builds on previous work by the authors which proposes a conceptual framework for integrating mobile devices into existing CFD infrastructure [1]. The definition of a mobile device used in this work are those devices that are not merely transportable but can be used while moving about. A laptop, for example, would not fall under the category of a mobile device. The motivation for integrating battery-powered devices like phones and tablets is to investigate means of harnessing the growing potential offered by mobile and mobile-connected cloud compute resources, in order to develop a different mode of usage for the computational modelling of how physical systems evolve

* Corresponding author. E-mail address: adrian.harwood@manchester.ac.uk (A.R.G. Harwood).

http://dx.doi.org/10.10167j.advengsoft.2017.10.005

Received 10 July 2017; Received in revised form 17 August 2017; Accepted 18 October 2017

0965-9978/ © 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/BY/4.0/).

Please cite this article as: Harwood, A., Advances in Engineering Software (2017), http://dx.doi.Org/10.1016/j.advengsoft.2017.10.005

in time. Specifically, targeting the potential to interact with a simulation offered by the integrated hardware of mobile devices. The scope of the current investigation is to explore the potential for local compute.

Although mobile devices offer significant potential for humancomputer interaction through their integrated hardware, the raw compute power needed to run accurate engineering simulation locally is limited. In order for a simulation to be interactive, the solver must advance at a rate capable of presenting a perceived physical evolution of information to the user. Hence, the throughput of the device is an important capability as well as device endurance while running on a battery. Meeting the goals of both accuracy and speed is a significant challenge within the constraints of mobile device resources and a balance must be struck between speed and accuracy.

Graphics Processing Units (GPUs) offer massive parallelism of compute instructions and are already leveraged to accelerate simulations to real-time speeds in video games [2] and increasingly in engineering [3,4]. The ability to execute instructions with such a high degree of parallelism results in GPUs having a large amount of raw computing power, much more than a CPU [5,6]. However, the dramatic increase of processing cores on a given die means that GPU architectures are forced to sacrifice on-chip memory, control units and core complexity [3]. Hence, GPUs are mostly suited to performing operations where limited control is required. Especially suitable are operations which perform the same set of instructions on a small amount of highly local data. Thus, not all numerical simulation methods are capable of achieving efficient GPU-acceleration. A comparison between a typical CPU and CPU architecture is shown in Fig. 1.

The desire to increase the size and pixel density of mobile device screens has fuelled the development of dedicated mobile GPUs which are, more recently, also being leveraged for the rapidly-growing industry of mobile games. Their capability compared with their desktop counterparts is generally limited by physical size and passive cooling mechanisms, although the number of pixels to be updated are significantly lower and so the chip is sized accordingly. The implication of this for running GPU-accelerated simulation is an imposed ceiling on throughput and memory capacity.

The NVIDIA Tegra integrated mobile GPU and CPU is based on the same Kepler architecture of high-end desktop cards, although scaled down to fit into a mobile device. A number of power-saving optimisations have been implemented such that peak power consumption is typically less than 2W [8] (two orders of magnitude lower than desktop cards). However, this device in particular powers mobile devices designed for gaming and may be considered to be one of the most powerful mobile GPUs presently available.

A comparison of technical data between the Tegra K1 and the highend Kepler-based Tesla K40 desktop card used in typical HPC configurations is shown in Table 1. This comparison reveals an unsurprising difference in capacity which serves to manage the expectations of any GPU-accelerated mobile simulation. Regardless, the Tegra K1 GPU has significant compute potential when compared with a mobile CPU.

This work improves upon the original work of Harwood and Revell

Table 1

Specifications of the NVIDIA Tegra K1 and Tesla K40 GPUs.

Aspect Tegra K1 Tesla K40

Chip GK20A GK110B

Compute capability 3.2 3.5

SMs 1 15

CUDA cores 192 2,880

Clock (GHz) 0.96 0.875

Memory bus width 64-bit 384-bit

L2 Cache (kB) 128 1,536

Memory type DDR3L GDDR5

Memory bandwidth (GB/sec) 17 288

Memory capacity (GB) 4 12

Max power consumption (W) 2 235

[1] in a number of ways. First, the requirements for an interactive simulation tool suitable for a mobile device are formalised and a generalised framework designed. Second, a robust implementation of this framework is implemented. Finally, solver performance is improved by designing and implementing a solver module to run on a Tegra GPU incorporating best practice. The result is a proof of concept which is then examined for suitability. Section 2 introduces the framework, incorporating all the elements required for implementing local interactive simulation on a mobile device. This framework is then implemented for an Android tablet with a mobile GPU to form a benchmark implementation. Section 3 presents details of how a GPU-accelerated solver may be integrated into the framework as part of the solver component. The availability of a suitable toolchain for this process is a key enabler for implementing interactive solvers for mobile operating systems and architectures in general and is discussed in detail. Section 4 presents the results of a series of performance tests of the implemented framework exploring the key drivers of accuracy, speed and power consumption. Finally, Section 5 reinterprets these results in the form of a new metric which provides a much clearer indication of suitability of the application for interactive simulation.

2. An interactive simulation framework for mobile devices

Mobile devices such as smartphones and tablets are designed with interactivity in mind, and incorporate elements of visualisation and human device interaction into a portable hardware. The present work demonstrates a software framework for developing interactive simulation based on the combination of three core components: an interaction interface, a numerical solver and a visualisation environment. Thus, available mobile APIs offer all the tools required to leverage these capabilities.

Mobile operating systems (OS) offer functionality through apps and services. Since user interaction is an essential component, the framework is best designed around an app. However, the design must find a flexible means of integrating all three components to establish an extensible framework for a wide range of use cases. As mobile technology

Control

Processing Core Processing Core

Processing Core Processing Core

Control

Processing Core

(a) Typical CPU architecture

(b) Typical GPU architecture

Cache CPU Core CPU Core CPU Core CPU Core

Control

(c) Typical mobile CPU+GPU architecture

Fig. 1. Comparison between general CPU architecture (left) and GPU architecture (centre) (adapted from NVIDIA [7]) with a typical mobile architecture incorporating both with a shared memory component (right).

Fig. 2. An interactive simulation framework for mobile devices. The 3 key components are integrated through the implementation of prescribed interfaces.

Application Layer

RequestResetO RequestGeomModO RequestVisOptionO RequestNewSimQ RequestFlowModQ OnTouchGestureO

Interaction

CaptureGeometryO ModifyGeometryO ModifySimParamO

CommunicateChangesO ReadSensorsQ

Network Adapter

Other Sensors

Solver

ResetSimO

CreateSimO

ProcessSLmChangesO

CommunicateDataO

IterateO

Camera

Processors

Visualisation

MapDataToVisO GencrateDataOvcrlayO CreateVisElementO ChangeOrientationO

Screen

is changing rapidly, design for modularity is essential; different implementations of each component should be admissable without the need for significant changes to the app structure. To this end, we propose a framework as depicted in Fig. 2 which features a fixed app frontend and defines interfaces for each component to define their basic behaviour. Interface definition will vary as new requirements for interactive simulation apps emerge. The framework of Fig. 2 indicates some more obvious interfaces between components but should not be taken as a definitive list. The implementation of each component is expected to be operating system specific. Furthermore, the simulation component should be geared towards maximum performance given that fast engineering simulation will require significant resources.

The general framework in Fig. 2 is implemented in the rest of this paper and the resulting application tested to obtain the data needed to characterise a speed-accuracy-power relationship. The interaction and weighting between these requirements is a key metric for research into the potential for mobile devices to provide a fast and accurate simulation tool. Note that that any data obtained will be influenced by hardware and the software implementation but given the lack of existing data in the area, this work will be able to serve as a benchmark for future studies.

2.1. Framework implementation for android devices

As the most widely-used mobile operating system [9], Android is selected as the development target. The Android software development kit (SDK) [10] allows developers to access hardware and OS components through a high-level API in written in Java. Although Android runs on a wide range of devices, the Google Project Tango Development Kit is selected as the development device in this work. This device is a 7-inch Android tablet powered by the NVIDIA Tegra K1. The device is based on a ARMv7 architecture and uses a 4 + 1 Cortex-A15 R3 32-bit CPU arrangement. The 4 GB of memory on the device is shared between the CPU and GPU with up to 2 GB available for the GPU. The principle advantage of using a mobile device powered by NVIDIA is the ease of programming offered by a CUDA-capable GPU. However, the general framework in Fig. 2 supports a numerical solver implemented in a

hardware-independent manner and does not rely on a CUDA-capable device.

The device runs Android 4.4.2 and the Tegra GPU supports CUDA 6.0. The Android SDK can be used to cross compile for deployment. A portion of an application may be developed in C/C++ with an interface to the Java part of the application written in Java Native Interface (JNI) [11]. The Android Native Development Kit (NDK) [12], in combination with the Android SDK, can then be used to compile the application.

The layout of the application layer consists of a central View widget which on creation, instantiates an associated visualisation class. This occupies majority of the screen and is designed to be the central mechanism for user interaction. Additional capabilities are offered through a series of icons as well as a more detailed menu. A text feed overlay is also offered and can be used by the components to display relevant information to the user. Fig. 3 illustrates the arrangement of these components on screen.

In order to be able to respond to user events, an app must have a main thread which remains idle during periods of user inactivity. Therefore the application layer is instantiated on the main thread but manages a background thread directly through which the numerical solver is accessed and iterated continuously. In Android, the main thread is also responsible for updating visualisation elements and hence the visualisation component must be instantiated on this thread as well. However, methods must run to completion in a short amount of time to allow the view hierarchy to be updated by the thread every 16 ms [10]. While the simulation is running, the solver thread calls the iterate() method of the solver repeatedly until the user requires the simulation to be paused. Before each iteration is performed, the solver thread also processes any requests which require solver data to be adjusted. A description of the implemented application layer behaviour is shown in Fig. 4.

2.1.1. Flow solver component

The lattice-Boltzmann method (LBM) is used to simulate the fluid flow in terms of the discrete Boltzmann transport equation. The high spatial locality of the LBM makes it especially suitable for mass-

Text Feed

Interaction Menu

Boltzmann equation is summarised as Eq. (1).

Visualisation and Interaction Interface

Fig. 3. An visualisation of the Android implementation of the application layer components of the framework.

parallelisation and execution on devices such as GPUs [3,13,14]. Rather than the transport quantity being a macroscopic variable such as density, or a microscopic property such as particle velocity, this equation models the transport of a set of mesoscopic particle distribution functions f. Nodal locations are represented as sites on an inter-connected lattice. The number of lattice links i dictates admissible directions along which individual fi values may be convected. A set of velocity vectors {ci} are associated with the set of lattice links. The distribution function to be convected is subject to a particle collision operator Q. The

(A + ^ .vf = 0

Following spatial and temporal discretisation, the Boltzmann equation may be evaluated in two steps. The left-hand side of Eq. (1) represents the convection of f along lattice links to adjacent lattice sites - termed the streaming step. The values to be convected are computed by performing a collision operation on the current set of f values at a given site. The collision operation is carried out by relaxing the current distribution functions towards a local equilibrium function based on the Maxwell-Boltzmann distribution [15] - termed the collision step. Solid wall boundaries are implemented as simple bounce-back boundary conditions [16] for efficiency. For more details on the lattice-Boltzmann method see [4] and references therein.

The present implementation uses a simple BGK collision operator [17] and a uniform, single-grid arrangement. A Smagorinsky turbulence model [18] is available in the implementation but not enabled in this work to improve performance. Flow through the domain may be introduced using either the forcing scheme of Guo et al. [19] or simply a forced-equilibrium inlet/outlet boundary. A high-level description of the solver module is shown in Fig. 5 with details provided in the next section.

2.1.2. Interaction component

The interaction component implements methods to process the data collected by the application layer during a touch event or menu item selection. The Android API manages touch events through event methods that are triggered when the user touches the screen. In the present implementation, a single-point touch event handler is implemented in the application layer which passes the location of the touch event to the interaction component. The location data is converted to a discrete index and the region of the numerical domain marked as being a solid wall instead of fluid. When the application layer calls for the next iteration of the numerical solver, the effect of the boundary is processed by the solver thread before iterating proceeds as normal.

Fig. 4. A description of the implementation of the application layer.

Sim Interface

Lib Interface

GPU Grid

Fig. 5. A description of the implementation of the solver module and its interaction with the other components of the framework.

To/From Application Layer

Iterate

To/From Vis. Module

To/From Vis. Module

Lock Grid

Return Handle

To/From Request Set BCs Vis. Module

Return Control

Pass Request

Pass Count

Update Count

Pass Request

Pass Grid Handle

Wait for Unlock Trigged

Pass Request

Return Control

Launch Kernel

Increment Count

Request Sync

Copy Grid to Hos

Update Grid Push to GPU

Complete LBM

Sync Device

Fig. 6. A description of the implementation of the interaction module and its inclusion of the OpenCV library and the Android camera sub-system.

It is also possible in our implementation to capture geometry from the device's camera. The application layer handles the user request to use the camera and the Android API is invoked to switch to the OS camera extension. On returning from the camera application, a captured bitmap is passed to the interaction component to process. The OpenCV [20] libraries are linked to the application and are used to perform edge detection on the image. This is returned to the interaction component as an array of indices corresponding to areas of the simulation domain which should be marked as solid before the next iteration begins. In this way, the interaction component may be used to capture 2D geometry from the surroundings of the device. As with the other components, a sequence diagram of the interaction module behaviour is shown in Fig. 6.

2.1.3. Visualisation component

Data may be presented on mobile devices using a wide variety of visual elements from simple flat bitmaps to full 3D navigable game-engine-driven worlds with fully lit components. This is possible through Android supporting solutions implemented using mid-level graphics

APIs such as OpenGL ES and Vulkan as well as 3D environments built in Unity and Unreal Engine. Many recent mobile devices support virtual and augmented reality via peripheral headsets to afford the user an immersive experience if required.

In the general framework, the application layer should allow for variety by simply managing data requests from the visualisation component, however that component may be implemented. In our implementation, a simple flat bitmap is used to take 2D velocity information and represent magnitude as a range-adjusted colour map. Each magnitude, normalised with respect to the instantaneous maximum and minimum in the domain, is converted to an 8-bit number in the range 0-255. The Android API provides BitmapFactory classes to efficiently map these data to a 3-channel bitmap which is then rendered to the screen. A description of the visualisation module is shown in Fig. 7.

3. Solver implementation and compilation

Early implementations of LBM on GPUs had to be written in shader

Fig. 7. A description of the implementation of the visualisation module.

language [21], a particularly tedious process. Fortunately, in recent years, a number of suitable APIs for programming GPUs have been developed. Other than graphics-oriented APIs such as OpenGL [22] or Microsoft DirectX, there are also more general-purpose APIs such as OpenCL [23] and NVIDIA CUDA [7]. The latter is a high-level API in the C language suitable for use with CUDA-capable GPUs. Although it offers ease of programming, it relies on a proprietary compiler wrapper to facilitate the build process and the resulting machine code may only be executed on a NVIDIA GPUs at present. Since we are using a device with a Tegra GPU, we choose to implement the LBM using CUDA. The solver component of the framework then takes requests from the application layer and calls the appropriate CUDA run-time methods to execute time stepping on the GPU.

Various implementations of LBM on a GPU using CUDA are in use [6,14,24-26]. Ours is adapted from an in-house solver originally developed for desktop GPUs. This adapted version wraps the CUDA components in another class which implements the required methods for integration into the framework. In order to fully leverage the capability of the GPU for parallel computation, the solver kernel is written with knowledge of the hardware limitations: best practices are used to mitigate such limitations and maximise performance of the solver. These include: limiting host-device communication; storing frequently-used data in cache memory rather than DRAM; using constant memory for run-time constants; organising data for coalesced memory access; minimising branch divergence to maximise instruction-level parallelism; and using pitched memory for aligned memory read/write. A thorough explanation of these practices can be found in [4,24,25].

As discussed in Section 2.1 as well as [1], cross-compilation of native C/C+ + code is straightforward using the Android development tools. However, if part of the native code is also written in CUDA C, this standard toolchain configuration is not sufficient as the compilation of

CUDA code requires the NVIDIA wrapper as well as the CUDA run-time headers for compilation. In addition, at run-time, the application must dynamically link to the run-time libraries which must be pre-compiled for the target architecture and available on the device.

As the use of CUDA is a programming decision, the implementation of the general framework at the level of abstraction described by Fig. 2 should be agnostic to the language or API selected. In order to separate the CUDA implementation from the solver component our solution groups all CUDA API calls into a sub-set of classes which may be precompiled as a static library at the command line using the wrapper. This library is then linked to the rest of the Android application through the native portion of the solver component. In order to call the methods within the library, the native part of the solver component merely includes a standard C++ header that defines an interface class of outward-facing methods with a CUDA-enabled inward-facing implementation. This design is illustrated in Fig. 8.

Within the library, a simulation is initialised on instantiation of the library interface. This populates host and device memory with simulation grid data. A call to perform an iteration results in the launching of a CUDA kernel by the host (CPU) which spawns the specified number of worker threads. These threads complete a stream-collide operation on the GPU by manipulating DRAM arrays. Data may then be copied back to the host at defined intervals as required which in turn is accessible through the library interface.

CUDA C cross-compilation is only possible where NVIDIA compilation tools are available for the combination of target and development platform. Cross-compilation of CUDA for ARMv7 is, at present, only possible within an Ubuntu Linux x64 v14.04 development environment [7]. The Tegra Android Development Pack (TADP) v3.0R4 includes the CUDA Development Kit, associated compilers, the Android SDK and the Android NDK which enables all compilation necessary for this work.

Solver

Implements methods required by the framework

Native back-end implementation of Java-side methods

Library Interface

C/C++ (Native)

Library interface wrapper

Device code implementation forming the CUDA back-end for C++ classes.

Android SDK

Android NDK

Fig. 8. A breakdown of the solver component which implements the flow solver as a custom CUDA C/C+ + library component.

CUDA Dev Kit

Compilation is a two-step process: first the CUDA toolkit is used to compile the library; then the Android NDK + SDK toolchain is invoked to build the app and link to the library. Run-time dependencies are packaged along with app and deployed to the device.

4. Performance: accuracy, speed and power consumption

The implementation of the complete framework described above yields an interactive, GPU-accelerated flow simulation around user-defined geometry. The implementation is tested to ascertain performance and suitability and the trade-off between speed and accuracy explored. The performance of the present solver is also compared to the performance of previous work by Harwood and Revell [1] who implemented a similar LBM solver in Android to run on the CPU.

The usual performance measure of an LBM calculation is given in terms of the number of lattice sites that can be updated per second. This also indicates the throughput of the processor(s). The measure used in the literature is that of million lattice updates per second (MLUPS). The results of Harwood and Revell find a serial CPU performance of 0.4 MLUPS for a Java-based solution of a problem of 192 x 93 = 17, 856 lattice sites. However, they indicate the potential for up to 1.1 MLUPS using both multi-threading and a native library implementation on the same problem size. The mass parallelism of the GPU will raise the performance significantly beyond this.

In this study, the same metric is used to assess performance characteristics of a GPU implementation of the LBM on the same Android mobile device as used previously. In addition, memory and battery consumption are measured and a new metric is introduced to quantify the utility of a mobile CFD application.

4.1. Accuracy test

The accuracy of the LBM solver is established for a Poiseuille flow (a 2D flow between parallel plates). It is an incompressible, laminar flow with a steady solution and a typical benchmark case for assessing the accuracy of low Reynolds number LBM simulations [27]. This case also represents the flow on which the app is based.

The application is configured to simulate this flow at a Reynolds number Re = 50 by applying a periodic boundary condition on the left and right boundaries and a no-slip boundary condition on the top and bottom walls. A body force is used to drive the flow from a an initially stationary condition. This body force Eq. (2a) [4] is selected based on the desired centreline velocity in the channel umax = 2uref. The converged, span-wise velocity profile Eq. (2b) [27] is used to assess accuracy.

„ 12v

F = H u"f

U (y) = U^y (y - H)

The case is simulated using a range of increasing resolutions to demonstrate the capability of the current LBM implementation to accurately predict the background flow. Parameters are initialised as listed in Table 2 and resolution increased by a factor of two. Time step sizes are refined in the same manner. The base time step is chosen as the largest possible value while limiting the lattice Mach number to 17% of the lattice sound speed.

The solution is deemed converged when the span-wise velocity profile measured using 5 probes evenly space across the channel width changes by less than 0.1% between time steps. The convergence of the centreline velocity relative error is shown in Fig. 9. The results clearly illustrate that the LBM implementation demonstrates second-order convergence and is capable of simulating the Poiseuille flow with centreline errors of less than 0.5% even at the coarsest resolution used.

4.2. Speed test

In order to assess the performance of the LBM library a set of tests are designed which vary kernel launch parameters as well as grid size. For a given problem resolution, a one-to-one match between threads and lattice sites are assumed, herein referred to as 100% thread occupancy. Kernel launch parameters are set by the programmer and specify the number and grouping of worker threads to be spawned on the GPU when a kernel is launched. Specifically, the number of threads in a block for a fixed block aspect ratio is the free parameter chosen here. The kernel, excluding any host-device memory transfer operations, is timed and the throughput for each configuration quantified in terms of MLUPS.

The performance of the kernel is typically sensitive to these settings from the perspective of warp scheduling and memory coalescence [3].

Table 2

Base LBM parameters for Poiseuille flow benchmark.

Parameter Initial Value

Resolution lu/L 24

Domain size (W x H) 2L x L

Time step dt 0.00409

Reynolds number Re 50

Ref. velocity Uref 0.0981

Mach number Ma 0.17

Resolution (L)

Fig. 9. Illustration of the second-order convergence of the LBM for the background channel flow. The highest resolution cases have forcing terms that are approaching the floating point limit which degrades accuracy sightly.

If too few warps are assigned to a given block, processing core utilisation can fall below acceptable limits, limiting throughput. The specification of multiple warps within a given block allows the time taken to fetch data from memory to be at least partially disguised. This is referred to as the hiding of memory latency and can be achieved if the SM executes a second warp while the memory transaction of an earlier warp is still pending.

The grid sizes and flow parameters are varied in exactly the same way as for the accuracy tests. Block sizes are computed such that, where possible, each block always contains an integer multiple of warps from 1 to 6. For the case of 5 warps per block, the thread occupancy was less than 100%. These tests are omitted for consistency. Times are recorded using the system clock each time step and averaged over 20,000 time steps. Since GPU execution is asynchronous, the clock is only stopped once the call to cudaDeviceSynchronize() returns which indicates that kernel execution has finished.

Memory usage is recorded by using the cudaMemGetInfo() method in the CUDA API to state GPU memory usage before and after simulation initialisation. The actual memory usage is then estimated as twice this value to account for the host copy of the grid information.

Results of the performance tests are quantified in terms of throughput in Fig. 10. A peak performance of 76.68 MLUPS is obtained for the largest grid tested. This is not quite the peak throughput of the

50 40 30 20 10 0

□ 4 6

100 i §

Resolution (L)

Fig. 10. Performance in MLUPS (left-hand scale) for different number of warps per block for each problem size. Memory usage is represented by the solid line (right-hand scale).

GPU itself. As GPU implementations of LBM are limited by memory bandwidth [3,4], the MLUPS is expected to saturate once this hard limit imposes itself. This saturation is not yet visible for the largest grid tested. However, further testing of increments in L of 16 up to a grid size of 2048 x 1024 indicates that the peak throughput is ~ 80 MLUPS.

The throughput of the kernel is relatively insensitive to the number of warps used. The exception being the case where only a single warp is used per block. The reason for this is due to the inability of warp scheduling to hide the memory access latency; if there are no other warps which can be executed while the first warp is waiting for memory access then memory latency is made apparent.

Memory usage is also indicated in Fig. 10. The Tegra GPU has 2 GB of video memory which is a subset of memory of the overall device memory. The largest case tested in this work uses only 207 MB of memory ( ~ 10% of the GPU memory). The memory allocation per lattice site works out at 184 bytes including both the host and device copy of the information. This footprint could be reduced if the application makes use of the zero-copy API call cudaHostGetDevice-Pointer() which negates the need for redundant host-device copies where memory is shared between CPU and GPU. For maximum re-usability of the LBM library, the no-copy pattern was not used.

4.3. Power consumption test

Power consumption is tested by recording battery level from full every 10 minutes for an hour of one simulation at a each grid resolution used for the accuracy and speed tests. All radio and Wi-Fi systems are disabled and the screen brightness set to maximum to ensure a consistent background level of power consumption. Two identical devices are used to run all the tests in case of any device-specific variation and the results are averaged.

The amount of energy used in one hour for each case is shown in Fig. 11 as well as the energy required to simulate a physical second. The first metric indicates how long the device will last before draining the battery whereas the latter metric indicates how much energy is required to simulate one physical second of the real problem. It is a more practical interpretation of the data which takes into account the time step used for a given problem configuration.

There is a clear trend that indicates that the larger the problem size the more battery is used. This is expected as more threads are executing concurrently as resolution increases and hence GPU occupancy and memory usage increases accordingly. Assuming a constant rate of energy usage beyond the one hour measurement period, the associated endurance of the device can be calculated. Thus, the smallest simulation is projected to run for 7.69 h and the largest simulation 2.85 h before exhausting the battery.

The amount of energy used to simulate a physical second increases with variable rate through the resolution range. This increase is driven by a smaller and smaller time step being used for the simulation as the resolution increases. The increased throughput of the device, as resolution increases, limits the rate at which the energy consumption per physical second increases. As the resolution increases beyond L = 96 the number of iterations completed in an hour begins to fall more rapidly as throughput gains are reducing. Ultimately as the energy required to complete one iteration of the simulation hits the peak power consumption of the GPU and further increases in resolution shrink the time step further, the energy cost to conduct simulation of a physical second is expected to scale in line with time step and resolution scaling.

5. New performance metrics for interactive CFD

Although Fig. 10 provides an indication of lattice site throughput, and hence the raw computing power of the GPU, for an interactive application there are more appropriate metrics of performance. The user of an interactive simulation perceives performance as the rate a

Order 1

Resolution (L)

1 ,-3 M

Fig. 11. Variation in energy usage in kWh with resolution illustrated by the blue line (left-hand scale). The amount of energy used to simulate a physical second is illustrated by the red line (right-hand scale). (For interpretation of the references to colour in this figure caption, the reader is referred to the web version of this article.)

which flow convects through the virtual environment. If the flow moves too slowly, the user must wait some time for the flow to respond to interaction by an appreciable amount, reducing its usefulness. Conversely, if flow moves too quickly, it is harder for the user to see behaviour, which has a similar effect on its usefulness.

An important metric is the ratio of wall clock time to simulated time Tr(L, t). This real-time ratio indicates how fast the simulation evolves with respect to reality. It takes into account the combination of spatial resolution and time step size t. A ratio equal to 1 means that the simulation is evolving in real-time i.e. 1 physical second takes 1 second of processor wall clock time to simulate.

The real-time ratio, energy to simulate a physical second and accuracy are plotted against resolution in Fig. 12a. The accuracy metric is computed as the reciprocal of centre line error of Fig. 9 normalised with respect to the minimum error achieved during the test. It is important to note that Tr decreases with resolution despite the throughput (MLUPS) of the GPU increasing. Further increases in grid resolution, even up to the bandwidth-limited maximum throughput, offer little practical gain for interactive applications. As the GPU throughput begins to saturate, the decrease in real time ratio accelerates. For the problem studied here, real-time simulation is achieved at a resolution of approximately L = 102. The energy used to simulate one second of the simulation is slightly over 104 kWh and the centreline error of the Poiseuille flow at this resolution was of the order 10-4%. Further increases in resolution improve the accuracy but uses more energy and slows down the rate of time evolution such that the simulation runs slower than real-time.

In Fig. 12b, the normalised values of Tr and energy consumption are plotted on a log-log scale. Clearly identifiable is a sharp decrease in efficiency at a normalised accuracy equal to 0.9. This signifies that further gains in accuracy through resolution increases cost a significant extra amount in energy and reduce the real-time ratio dramatically. In practice, this level of accuracy would not be achievable or practical on the current generation of mobile GPU hardware.

5.1. Comparison to CPU performance

It is useful to visualise these new results alongside those published in [1] for the same application with the kernel running on the CPU. Harwood & Revell ran a channel flow with resolution 192 x 93. The nearest resolution in this work is 192 x 96 which is sufficiently close that comparisons are still relevant. The comparison in Table 3 uses the peak theoretical performance of the 6-task, native CPU implementation as the reference case and the best-performing warps per block simulation in the current GPU implementation.

The comparison illustrates a speed-up in terms of throughput for the GPU implementation by a factor of 13.4 with an improved power consumption per physical second of a factor of 22.5. This essentially equates to an improvement in computational efficiency by a factor of 312 x . Memory consumption is slightly higher for the GPU implementation. Crucially, the real-time ratio is 21 x higher than the CPU implementation with a the GPU simulation approaching the realtime threshold at this combination of resolution and time-step. The CPU only evolved at 4% of real-time.

6. Future work

Much of the focus of this work has been on developing an efficient implementation of the general interactive simulation framework suitable for GPU-enabled Android devices. Particular attention has been paid to the performance of the LBM implementation in order to give credence to the proof of concept presented here. There are elements of the interactive framework which may be extended and improved to enhance capability.

Implementation may exploit emerging work from other fields such as computer science and computer graphics to improve upon this concept demonstrator. For instance, the visualisation of the flow field is performed using a bitmap, populated by a CPU-based loop. Since Android devices support OpenGL ES, such elements of the application may also leverage the GPU to generate more impressive-looking and efficient visualisation techniques. Other commonly-used techniques for visualising fluid flow such as stream- and streak-line visualisation [28] or surface texture techniques require more complicated post-processing and are not usually attempted on-the-fly [29] but simplified versions could be implemented in 2D if desired.

The interactive components too are limited simply to touch response to add solid boundaries. More time invested in implementing gesture handlers could allow dynamic adjustment of flow configuration such as velocity. Finally, although at present only a single 2D boundary can be extracted from a single image, a sequence of photos may be used in future extensions to infer 3D geometry using photogrammetric techniques [30].

The concept of linking multiple devices together in a local network, as discussed in [1], may be developed for this GPU-oriented design. Ignoring the cost of data communication, a multiplication in the number of GPUs used for the LBM simulation kernel described in this work will unlock a significant amount of throughput compute. For a given problem this will allow the resolution to be increased while maintaining the real-time ratio. In order to keep the real-time ratio at

(a) Variation in real-time ratio, accuracy and energy con- (b) Variation in real-time ratio and energy consumption sumption with resolution. with accuracy.

Fig. 12. Each quantity has been normalised with respect to its maximum to facilitate comparison. In the case of the accuracy data, the relative error has been normalised with respect to the minimum and inverted to ensure a lower error is interpreted as a higher accuracy.

Table 3

Comparison between CPU implementation of [1] and the present GPU implementation for a similar sized grid.

Measure CPU (6-tasks) GPU (3 warps / block)

MLUPS 1.1 14.83

Memory usage (MB) 5.9 6.0

Energy/physical second (kWh x 10-3) 5.23 0.232

Real-time ratio (Tr) 0.0395 0.832

an acceptable level, a higher number of GPUs running at a reduced capacity will be required to scale the solution to higher resolutions. Such a configuration will not be limited by memory, as is typical of GPU-accelerated simulation, rather by the number of devices in the network and network connection speed. It may also be useful to investigate the combination of CPU and GPU implementations along similar lines to Valero-Lara and Jansson [31].

Finally, the utility of the application may be extend by including structural modelling. This could be coupled to the LBM fluid solver to allow simulation of fluid-structure interaction problems. Care will need to be taken into how work load is managed between the two solvers [32].

7. Conclusions

In this work a general framework for implementing an interactive simulation on the mobile device has been proposed. A 2D, low Reynolds number interactive channel flow simulation has been implemented within the framework protocols with a back-end which targets a mobile NVIDA processor. A vertical integration of CUDA, C++, JNI and Java has been used to incorporate the accelerated simulation kernel into a typical Android app. The LBM-based simulation kernel has been built as a library and linked to an Android application which supports interaction and visualisation. The performance of the kernel has been measured for different grid sizes and kernel launch configurations. Data has also been compared to a similar CPU implementation where data is available.

The accuracy of the simulation is established for the background Poiseuille flow on which the interactive environment is based. A modest resolution of 192 x 96 achieves a maximum centreline error of < 0.1% and hence it can be concluded that this resolution offers a good balance between accuracy and performance. The second-order convergence of the LBM is observed.

The performance of the GPU kernel demonstrates a factor 13

improvement in performance in terms of lattice site throughput compared with the best multi-threaded CPU implementation of Harwood and Revell [1]. At the same time, the power used to simulation a physical second is a factor 22 lower. Memory usage for the GPU implementation is higher due to the current design requiring both host and device copies of data for easy transfer. However, the memory usage even for the largest case tested is insignificant compared with the total memory available.

The practical limitation of the application is not the hard limit of memory bandwidth responsible for limiting the maximum lattice site throughput. Instead, the need to minimise the compressibility error of the LBM solver forces the time step to shrink at least at the same rate as the grid resolution for a given problem. Consequently, the ratio of wall clock time to simulated time is computed as a more suitable measure of performance and utility of a GPU-LBM simulation in the current context. The latter depends on both the lattice throughput, as well as the spatial and temporal discretisation of the simulation.

Results show that the gains in lattice throughput due to increased problem size do not offset the slow-down due to the decreasing time step with simulated flow rates with respect to real-time decaying with approximately order 2 up to the memory bandwidth saturation limit. Once this saturation point is reached, the simulated flow rate perceived by the user decays with the expected order 4. This imposes a significant restriction on the way in which LBM can be used to achieve real-time or interactive simulations. Other methods may not face the same issue and should be explored for their potential.

Overall, the framework proposed offers a great deal of flexibility for future work. The specific implementation on an Android device is accurate at the low Reynolds numbers tested and performance is acceptable at mid to lower resolutions for interactive applications. The GPU implementation offers significant benefits over the CPU implementation with minimal additional complexity. Although the current implementation uses NVIDIA CUDA, and is thus limited to NVIDIA GPUs, the results of this study demonstrate the importance of developing the tools required to implement a hardware-independent solution in the future.

Acknowledgement

This work was supported by Engineering and Physical Science Research Council Impact Accelerator Account (grant number: EP/ K503782/1).

References

[1] Harwood ARG, Revell AJ. Parallelisation of an interactive lattice-Boltzmann method on an Android-powered mobile device. Adv Eng Softw 2017;104(C):38-50. http:// dx.doi.org/10.1016/j.advengsoft.2016.11.005.

[2] Macklin M, Müller M, Chentanez N, Kim T-Y. Unified particle physics for real-time applications. ACM Trans Graph 2014;33(4). 153:1-153:12.

[3] Mawson M. Interactive fluid-structure interaction with many-core accelerators. School of Mechanical, Aerospace & Civil Engineering, The University of Manchester; 2013. Ph.D. thesis.

[4] Delbosc N. Real-time simulation of indoor air flow using the lattice Boltzmann method on Graphiscs Processing Units. School of Mechanical Engineering, The University of Leeds; 2015. Ph.D. thesis.

[5] Lee VW, Kim C, Chhugani J, Deisher M, Kim D, Nguyen AD, et al. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput Archit News 2010;38(3):451-60.

[6] Kuznik F, Obrecht C, Rusaouen G, Roux J-J. LBM based flow simulation using GPU computing processor. Comput Math Appl 2010;59(7):2380-92.

[7] NVIDIA. CUDA C programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide, Accessed: 12th May 2017.

[8] Corporation N. Whitepaper: Nvidia Tegra k1. Tech. Rep.. 2013.

[9] Anderson M. Technology device ownership: 2015. Tech. Rep.. Pew Research Centre; 2015.

[10] Google. Android SDK API guide. https://developer.android.com/guide/index.html, Android Developers, Accessed: 12th May 2017.

[11] Dawson M, Johnson G, Low A. Best practices for using the Java Native Interface Tech. Rep. IBM developerWorks; 2009. URL https://www.ibm.com/ developerworks/library/j-jni/.

[12] Google. Android NDK API guide. https://developer.android.com/ndk/index.html, Android Developers, Accessed: 12th May 2017.

[13] Linxweiler J, Krafczyk M, Tölke J. Highly interactive computational steering for coupled 3D flow problems utilizing multiple GPUs. Comput Vis Sci 2010;13(7):299-314.

[14] Koliha N, Janßen CF, Rung T. Towards online visualization and interactive monitoring of real-time CFD simulations on commodity hardware. Computation 2015;3(3):444.

[15] Sharp K, Matschinsky F. Translation of Ludwig Boltzmann's paper "On the relationship between the second fundamental theorem of the mechanical theory of heat and probability calculations regarding the conditions for thermal equilibrium" sitzungberichte der kaiserlichen akademie der wissenschaften. mathematisch-nat-urwissen classe. abt. ii, lxxvi 1877, pp 373-435 (wien. ber. 1877, 76:373-435). reprinted in wiss. abhandlungen, vol. ii, reprint 42, p. 164-223, Barth, Leipzig, 1909. Entropy 2015;17(4):1971-2009.

[16] Ziegler DP. Boundary conditions for lattice Boltzmann simulations. J Stat Phys 1993;71(5):1171-7.

[17] Bhatnagar PL, Gross EP, Krook M. A model for collision processes in gases. I. Small amplitude processes in charged and neutral one-component systems. Phys Rev 1954;94:511-25.

[18] Yu H, Girimaji SS, Luo L-S. DNS and LES of decaying isotropic turbulence with and without frame rotation using lattice Boltzmann method. J Comput Phys 2005;209(2):599-616.

[19] Guo Z, Zheng C, Shi B. Discrete lattice effects on the forcing term in the lattice Boltzmann method. Phys Rev E 2002;65:046308.

[20] Team O.D. Opencv 2.4.13.2 documentation - feature detection. http://docs.opencv. org/2.4/modules/imgproc/doc/feature_detection.html, Accessed: 10th June 2017.

[21] Li W, Wei X, Kaufman A. Implementing lattice Boltzmann computation on graphics hardware. Vis Comput 2003;19(7):444-56.

[22] Segal M, Akeley K. The design of the OpenGL graphics interface. Tech. Rep. Silicon Graphics Computer Systems; 1994.

[23] Stone JE, Gohara D, Shi G. OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des Test 2010;12(3):66-73.

[24] Mawson MJ, Revell AJ. Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs. Comput Phys Commun 2014;185(10):2566-74.

[25] Delbosc N, Summers J, Khan A, Kapur N, Noakes C. Optimized implementation of the Lattice Boltzmann Method on a graphics processing unit towards real-time fluid simulation. Comput Math Appl 2014;67(2):462-75. Mesoscopic Methods for Engineering and Science (Proceedings of ICMMES-2012, Taipei, Taiwan, 23-27 July 2012).

[26] Khan MAI, Delbosc N, Noakes CJ, Summers J. Real-time flow simulation of indoor environments using lattice Boltzmann method. Build Simul 2015;8(4):405-14. http://dx.doi.org/10.1007/s12273-015-0232-9.

[27] He X, Zou Q, Luo L-S, Dembo M. Analytic solutions of simple flows and analysis of nonslip boundary conditions for the lattice Boltzmann BGK model. J Stat Phys 1997;87(1-2):115-36.

[28] van Wijk JJ. Image based flow visualization for curved surfaces. IEEE Visualization, 2003. 2003. p. 123-30.

[29] Laramee RS, Hauser H, Doleisch H, Vrolijk B, Post FH, Weiskopf D. The state of the art in flow visualization: dense and texture-based techniques. Comput Graphics Forum 2004;23(2):203-21.

[30] Sellers WI, Hirasaki E. Markerless 3d motion capture for animal locomotion studies. Biol Open 2014.

[31] Valero-Lara P, Jansson J. Heterogeneous CPU + GPU approaches for mesh refinement over lattice-Boltzmann simulations. Concurr Comput 2017;29(7).

[32] Valero-Lara P. Accelerating solid-fluid interaction based on the immersed boundary method on multicore and GPU architectures. J Supercomput 2014;70(2):799-815.