Scholarly article on topic 'A Survey and Measurement Study of GPU DVFS on Energy Conservation'

A Survey and Measurement Study of GPU DVFS on Energy Conservation Academic research paper on "Electrical engineering, electronic engineering, information engineering"

CC BY-NC-ND
0
0
Share paper
Keywords
{"Graphics processing unit" / "Dynamic voltage and frequency scaling" / "Energy efficiency"}

Abstract of research paper on Electrical engineering, electronic engineering, information engineering, author of scientific article — Xinxin Mei, Qiang Wang, Xiaowen Chu

Abstract Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS) has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications.

Academic research paper on topic "A Survey and Measurement Study of GPU DVFS on Energy Conservation"

I Digital ...11«. fcl Communications mJ V^? | ^ and Networks

Author's Accepted Manuscript

A Survey and Measurement Study of GPU DVFS on Energy Conservation

Xinxin Mei, Qiang Wang, Xiaowen Chu

www.elsevier.comlocate/dcan

PII: S2352-8648(16)30073-6

DOI: http://dx.doi.org/10.1016/j.dcan.2016.10.001

Reference: DCAN55

To appear in: Digital Communications and Networks

Cite this article as: Xinxin Mei, Qiang Wang and Xiaowen Chu, A Survey anc Measurement Study of GPU DVFS on Energy Conservation, Digita Communications and Networks, http://dx.doi.org/10.1016/j.dcan.2016.10.001

This is a PDF file of an unedited manuscript that has been accepted fo publication. As a service to our customers we are providing this early version o the manuscript. The manuscript will undergo copyediting, typesetting, an< review of the resulting galley proof before it is published in its final citable form Please note that during the production process errors may be discovered whic could affect the content, and all legal disclaimers that apply to the journal pertain

Digital

Communications and Networks(DCN)

A rrrriTCn Ail A Kl

HOSTED BY

ELSEVIER

\vailable online at www.sciencedirect.co

ScienceDirect

journal homepage: www.elsevier.com/locate/dcan

A Survey and Measurement Study of GPU DVFS on Energy Conservation

Xinxin Meia, Qiang Wanga, Xiaowen Chu*ab

aDepartment of Computer Science, Hong Kong Baptist University, Hong Kong S.A.R., China bInstitute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, 518057, China

Abstract

Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS) has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications.

© 2016 Published by Elsevier Ltd. KEYWORDS:

Graphics processing unit, dynamic voltage and frequency scaling, energy efficiency

1. Introduction

The graphics processing units (GPUs) have become prevalent accelerators in current high performance clusters. They substantially boost the performance of a great number of applications in many commercial and scientific fields, such as bioinformatics [1,2], computer communications [3, 4, 5], machine learning [6, 7, 8], especially the emerging deep learning [9, 10, 11]. In the TOP500 supercomputer list [12] as of June. 2016, 94 systems are equipped with accelerators and 69 out of them are equipped with GPUs [13]. The CPU-GPU hybrid computing is more energy efficient than traditional many-core parallel computing [13,14]. However, this kind of high performance clusters still consume a lot of energy. To power the clusters remains a great expense. For example, the Titan supercomputer, 3rd in the TOP500 list as of this writing, is accelerated by 18,688 NVIDIA Tesla K20X with a power supply of 8.21 million Watts, which cost about

* Corresponding author. Email: chxw@comp.hkbu.edu.hk.

23 million dollars per year [15]. Given the fact that saving even a few percent of energy can reduce a large amount of electricity cost, efficient GPU power management becomes indispensable for GPU-accelerated data centers and supercomputers.

One of the promising power management strategies is the dynamic voltage and frequency scaling (DVFS) [16, 17], which refers to changing the processor voltage/frequency during task processing. It is effective in either saving energy or improving performance. The CPU DVFS technology is well developed and has been adopted in both personal computing devices and large scale clusters [18]. Despite the maturity of CPU DVFS, the GPU DVFS study started only a few years ago. According to existing studies, simply transplanting the CPU DVFS strategy to GPU platforms could be ineffective [19, 20]. For example, scaling up the processor frequency (described as "racing" in [21]) is proved to be energy efficient for the CPUs but not always for the GPUs [21, 22]. We summarize some challenges of the GPU DVFS study as below. First, the GPU hardware and power management information is

very limited. Second, there lacks accurate quantitative GPU DVFS performance/power estimation tools. Lastly, the GPU architecture design is being advanced very fast, that performing the same DVFS strategy may have different outcomes on different generations of GPUs.

In [23], Mittal et al. surveyed the research work on analyzing and improving energy efficiency of GPUs, including the GPU DVFS. Different from their broad scope, in this paper we are more focused to investigate the current status of GPU DVFS study. We aim at understanding the impact of GPU DVFS on the performance or power consumption, especially for recent NVIDIA GPU products. We consider our contributions of two aspects. First, we summarize the most up-to-date GPU DVFS studies and the GPU performance and power modeling techniques. Our work provides state-of-the-art investigation and observation on GPU DVFS. Second, we conduct DVFS measurement experiments on recent NVIDIA Fermi and Maxwell platforms. Our experimental results can serve as GPU DVFS benchmarks and our experimental findings reveal the similarities and the differences of GPU DVFS effects on two generations of GPU platforms.

The rest of this paper is organized as follows. Section 2 presents the GPU architecture and voltage/frequency scaling interface across five generations of NVIDIA GPUs. Section 3 characterizes the impact of GPU voltage and frequency scaling. Section 4 demonstrates the latest GPU DVFS power modeling, including both empirical and statistical ones. The GPU DVFS performance modeling is discussed in Section 5. In Section 6, we conduct real frequency scaling on the Maxwell GPU and voltage/frequency scaling on the Fermi GPU. We analyze the scaling effects and summarize the findings. We conclude our work in the last section.

2. Background

In this section, we introduce some fundamental knowledge on the GPU architecture and the GPU voltage/frequency scaling interface.

2.1. GPU Architecture

Fig. 1 shows a brief block diagram of the NVIDIA Maxwell GTX980 GPU. A GPU board contains the GPU chipset and GPU memory. The GPU consists of the L2 cache and multiple stream multiprocessors (SMs). Fig. 2 illustrates the block diagram of the SM of a Maxwell GPU. In the literature, a floating point (FP) processing unit is usually referred to as a GPU core. The number of FPs and other micro-units on an SM varies, depending on the GPU products. The SMs and the L2 cache are connected to the GPU memory module, which includes multiple GDDR RAM or the recent HBM stacked RAMs, via memory controllers.

Stream multiprocessor *8

Texture/Ll cache

m Shared memory Register

Instruction cache

GPU core

Memory controller

GPU memory

Fig. 1: The block diagram of NVIDIA GTX980 GPU board.

Stream multiprocessor

Instruction cache

Texture/L1 cache

Texture/L1 cache

Shared memory

Instruction buffer

Warp scheduler

Dispatch unit Dispatch unit

FP (Core)

Fig. 2: The stream multiprocessor of NVIDIA GTX980 GPU.

Till 2016, NVIDIA has launched five generations of GPUs, as listed in Table 1. The compute capability (CA) is used by NVIDIA to distinguish different architectures of the GPU products. While the microarchitectures of Fermi to Pascal GPUs are based on similar designs, there is a big difference between the Tesla GPUs and later GPUs: the cache system. For the 1st generation Tesla GPUs, normal data access is not cached [24, 25]. By introducing cache system, a great number of applications have received further boosted performance. It is vital to consider the influence of the GPU caches on the application performance as well as energy consumption [26, 27].

There are some popular benchmarks to appraise the performance of GPUs: CUDA SDK [28], Rodinia [29], Parboil [30], etc. Developers also use some simulators for thorough GPU-based study. Some of the prevalent simulators include: gem5 [31] on CPU-GPU hybrid systems, GPGPUSim [32] on the compiling and execution of GPU codes, and the subsequent GPUWattch [33] on the GPU energy footprint. These software tools can help understand the behavior of different applications in terms of performance or power consumption.

Table 1: Generations of NVIDIA GPUs

Micro-architecture Year CA Feature size

Tesla 2006 1.x >55 nm

Fermi 2009 2.x 45 nm

Kepler 2012 3.x 28 nm

Maxwell 2014 5.x 28 nm

Pascal 2016 6.x 16 nm

Table 2: The P-states of NVIDIA GTX980 GPU

State P0 P2 P5 P8

Default/Range of 0.987 0.987 0.85 0.85

Core voltage (V) [0.987, [0.987, [0.85,

1.187] 1.187] 1.187]

Default/Range of 540 540 405 135

Core freq. (MHz) [380, [380, [380,

1310] 1310] 1310]

Default/Range of 3500 3000 810 324

Mem. freq. (MHz) [2100, [2100, [380,

3600] 3600] 1080]

2.2. GPU DVFS

For the CMOS circuits, the total power consumption, denoted by Ptotai, is decomposed into dynamic and static parts. We list the power partition in Equation (1), where Pdynamic stands for the dynamic power, which is generated when transistors switch their states; Pleakeage, Pshort-circuit and Pdc together stand for the static power. [34] provides the elaboration of the power decomposition.

Ptotal = Pdynamic + Pleakeage + Pshort-circuit + PDC (1)

Equation (2) gives the general form of the dynamic power, where a denotes the average utilization ratio, C denotes the total capacitance, V denotes the chip supply voltage and f denotes the operating frequency [16].

Pdynamic = aCV2 f (2)

DVFS changes the runtime supply voltage and the frequency, and it mainly affects the dynamic power. For the early processing units, the dynamic power accounts for the majority of power consumption, but nowadays the static power is also contributing considerably [35, 36].

In general, the GPU boards have two sets of adjustable voltage/frequency: the core voltage/frequency, and the memory voltage/frequency. The core and memory voltage refer to the supply voltage of the GPU SMs and the DRAM. The core frequency affects the SM execution speed, while the memory frequency actually affects the DRAM I/O throughput.

On some NVIDIA products, the GPU core/memory voltage and frequency are almost continuously scalable within a wide range, with the help of proper over-clocking tools [37, 38]. Recently, NVIDIA has introduced the concept of P-states. A P-state defines a combination of GPU voltage and frequency settings. For example, on our ASUS Strix GeForce GTX 980 (OC edition), there are at least 4 P-states: P0, P2, P5 and P8, whose default voltage/frequency settings as well as the allowed scaling ranges are listed in Table 2. P8 is the idle state which consumes little energy but cannot run any tasks. P5 offers a wide scaling range for core voltage and frequency; but on the other hand it can only support very low memory frequency. P2 is a powerful state which can provide highest voltage and

frequency. P0 has the same scaling ability with P2 except its higher default memory frequency. Notice that the scaling range can be vendor-dependent, and the GPU may not work reliably if the voltage is set too high.

NVIDIA offers the NVIDIA Management Library (NVML) [39] and the NVIDIA System Management Interface (nvidia-smi) [40] to monitor its GPU P-states. Some third-party softwares, like the NVIDIA Inspector [41] and the Afterburner [42], can manually adjust the voltage/frequency with a certain level of flexibility. NVIDIA has also launched GPU Boost [43], an embedded thermal constrained DVFS system, which makes the manual voltage scaling rather tough. In contrast, AMD and some SoC platforms provide more user-friendly GPU voltage/frequency scaling interfaces.

3. GPU DVFS Characterization

There have been a number of recent studies on GPU DVFS, which are conducted through either real experiments or computer simulations. The experimental studies refer to those scaling the voltage or frequency of GPUs in reality, and in this paper, we mainly focus on the NVIDIA GPUs. The simulation studies refer to those scaling on simulators, like GPUWattch, or those lacking practical experimental results. Most of the experimental studies applied the GPU frequency scaling only, due to the limited support of GPU voltage scaling tools. The simulation studies discuss various scaling approaches, including the GPU core number scaling, per-core DVFS, etc., benefitting from the more flexible scaling interfaces. Both the experimental and simulation results suggest that the GPU DVFS is effective in conserving energy.

3.1. Experimental Studies

Jiao et al. scaled the core frequency and the memory frequency of a NVIDIA Tesla GTX280 GPU with three typical applications: the compute-intensive dense matrix multiply, the memory-intensive dense matrix transpose, and the hybrid fast Fourier transform (FFT) [44]. The three applications showed different performance and energy efficiency curves with the same core-memory frequency settings: the dense matrix multiply was insensitive to memory frequency scaling, FFT benefited from low core frequency and high memory frequency, while dense matrix transpose needed both high core and memory frequency. They also found that the energy efficiency was largely determined by the instructions per cycle (IPC) and the ratio of the amount of global memory transactions over the amount of computation transactions.

Ma et al. designed an online management system that integrated the GPU dynamic core and memory frequency scaling and the CPU-GPU workload division [45]. On their testbed, NVIDIA GeForce8800,

the GPU dynamic frequency scaling alone saved about 6% of system energy and 14.5% of GPU energy.

Ge et al. applied fine-grained GPU core frequency and coarse-grained GPU memory frequency on a Kepler K20c GPU, and compared its effect to the CPU frequency scaling [19]. They found that for dense matrix multiply, both the GPU power and the GPU performance were linear to the GPU core frequency, and the GPU energy consumption was insensitive to frequency scaling. For their three tested applications, the highest GPU frequency always resulted in best energy efficiency, differing from the CPU DVFS.

In our previous work, we scaled the core voltage, the core frequency and the memory frequency of the Fermi GTX560Ti GPU, with a set of 37 GPU applications [38]. We found that the effect of GPU DVFS depends on the application characteristics. The optimal setting to consume the least energy was a combination of appropriate GPU memory frequency and core voltage/frequency. We observed an average of 20% reduction of energy consumption with only 4% of performance loss.

Abe et al. combined the GPU core frequency, the GPU memory frequency and the CPU core frequency scaling together, on the NVIDIA Fermi GTX480 GPU [20]. They performed the frequency scaling with dense matrix multiply of various matrix sizes. They could save as much as 28% of the system energy with a small matrix size, low GPU memory frequency and high GPU core frequency. They then extensively scaled the GPU core and memory frequency of 4 GPU products, including the Tesla GTX285, Fermi GTX460/GTX480, and Kepler GTX680, with 33 popular applications [22]. They set both of the core and memory frequency to low, medium and high values, and searched for the optimal core-memory frequency combination that offered the best power efficiency. Surprisingly, they found that, for the Kepler GTX680, the default frequency configuration was never optimal, while the opposite for the Tesla GTX285. They could reduce as much as 75% of system energy within 30% of performance loss, for a compute-intensive workload on the Kepler GPU. Their results suggested that DVFS was even more appealing for recent GPUs.

You and Chung designed a performance-guaranteed DVFS algorithm for the Mali-400MP GPU on a SoC platform [46]. They found that the GPU utilization ratio was not tightly correlated to the GPU performance, and the on-demand DVFS provided by the SoC system was inadequate by wasting a certain amount of power.

Jiao et al. studied the GPU core and memory frequency scaling for two concurrent kernels on the Kepler GT640 GPU [47]. They took a set of kernels from the CUDA SDK and Rodinia benchmark and measured their energy efficiency (GFlops/Watt) with different core-memory frequency settings. They demonstrated that the concurrent kernel execution in combination with GPU DVFS can improve energy-efficiency

by up to 34.5%.

The above measurement studies offer the ground truth that the GPU DVFS is effective in saving energy, and meanwhile does not sacrifice much performance. For recent Kepler GPUs, DVFS is even more promising in energy-efficient computing.

3.2. Simulation Studies

In [48], Lee et al. simulated the GPU DVFS as well as the core number scaling in GPGPUSim, based on the 32nm prediction technology model [49], with the objective to improve the throughput. Their scaling scheme can provide an average of 20% higher throughput than the baseline GPU.

Leng et al. developed GPUWattch, which could simulate the cycle-level GPU core voltage/frequency scaling, based on the Fermi GTX480 GPU [33]. They configured the various GPU voltage/frequency settings according to the 45nm prediction technology model [49, 50], and simulated both slow off-chip and prompt on-chip DVFS. They gained an average of 13.2% energy saving with off-chip DVFS and 14.4% energy saving with on-chip DVFS, both within 3% performance loss. For either scaling scheme, they found that the memory-bounded kernels benefited a lot but the purely compute-bounded kernels did not take much advantage of the DVFS.

Sethia et al. designed a dynamic runtime GPU core number, core and memory frequency scaling system, to either conserve the energy or improve the performance [51]. They categorized the GPU applications into 3 types: compute-intensive, memory-intensive, and cache sensitive, according to GPUWattch characterizations. For each application category and scaling objective, they designed different scaling strategies. Their system reduced 15% energy in the energy-saving mode.

Sen et al. applied the fine-grained per-core DVFS in GPUWattch, in view of the diverse execution time and workload of different GPU cores [52]. They found the per-core DVFS had good potential to save more power than the overall DVFS.

Motivated by the fact that scaling down the core voltage was vital to conserve energy, Gopireddy et al. designed a new architecture that enabled a lower operating voltage in the energy-efficiency mode other than the normal voltage in the high-performance mode [53]. Their simulation results showed that the new hardware could reduce as much as 48% of energy consumption, compared to the conventional hardware with normal DVFS.

In summary, GPU DVFS is proved to be effective in energy conservation for a variety of applications, but the impact on different applications are very diverse. Researchers need to design specific scaling algorithm based on the application characteristics.

4. GPU DVFS Runtime Power Modeling

In this section, we survey the runtime GPU power modeling work. We classify the studies into either empirical or statistical, where the former one relies on the binary code analysis and the latter one relies on the program performance counters. The empirical method is a bottom-up approach and requires break-up of GPU micro-architectures, while the statistical method treats the GPU hardware as a black box and seeks statistical relationships between the GPU power and the runtime performance counters.

4.1. Empirical Methods

The empirical power modeling method was first presented by Isci and Margaret to measure Pentium4 power consumptions [54]. It manually decomposed a whole board into separate hardware components. For each component, they estimated the maximum power consumption and computed the access rate. The total power consumption was the summation of these components. Equation (3) shows the mathematical form of the empirical power model, where P1, P2,..., Pn are the maximum power consumption of the n subcomponents, r1, r2,..., rn are the access rates of the sub-components, and P0 is a constant parameter.

P = P0 + P1 * r1 + P2 * r2 + ... + Pn * rn (3)

Hong and Kim utilized this method for a GTX280 GPU [35]. They estimated the access rates based on the dynamic number of instructions and the execution cycles of separate GPU units, where the number of instructions were based on the binary PTX code analysis, and the execution cycles were based on the pipeline analysis. They then designed a suite of micro-benchmarks to search for P1,...,Pn, that gave the minimum error between the measured power and the computed power. With the above two approaches, they got the baseline runtime GPU power consumption. They also built a power/temperature increase model to account for the fact that the GPU runtime power increases as the chip temperature rises. The final GPU power consumption was the sum of the baseline power consumption and the increment. They achieved 2.5% of prediction error when evaluating the micro-benchmarks, and 9.2% of error when evaluating the integrated GPGPU kernels. Besides, the model also considered the influence of the active SM numbers, and the authors used the model to study the GPU energy conservation with fewer SMs. The authors also extended the study to the Fermi GPU, by involving the cache-stressed micro-benchmarks and adjusting the model parameters [36].

Leng et al. packed Hong and Kim's power modeling with GPGPUSim to form GPUWattch, which could estimate the runtime GPU power with different voltage/frequemcy settings at cycle-level [33]. The authors refined Hong and Kim's model with a large

amount of micro-benchmarks, to overcome the power uncertainties brought by the new Fermi hardware. On the Fermi GTX480 GPU, the prediction error was 15% for the micro-benchmarks, and 9.9% for the general GPU kernels. Besides, the model could capture the runtime power phase behaviours. They validated the model at a rate of 2 mega-samples per second. GPUWattch provided a convenient online scalable simulation platform, and was widely used in recent GPU DVFS studies.

Both Hong's and Leng's models had outstanding performance and were widely adopted by the following researchers [51, 52, 55, 56]. On the other hand, some researchers also pointed out that the models were product-specific and it was difficult to tune the parameters when applying them on other GPUs [22]. Sen and Wood derived a simple power model that mainly relied on the processing time of each core [52]. Their model achieved high qualitative similarity with GPUWattch. In [27], Rhoo et al. also stated that the power estimation by the simple IPC-based model [57] had more than 90% agreement with that by GPUWattch.

4.2. Statistical Methods

Some researchers built statistical models for the GPU runtime power consumption. They used software to monitor the runtime signals of the GPU-accelerated applications, and fitted or trained the power model based on the observed signals. This approach treats the GPU micro-architecture as a black box, and seeks for relationships between the GPU runtime power consumption and micro-architecture events. We summarize the related studies in Table 3, including their target devices, statistical methods, studied benchmarks, etc.

In Table 3, SVR, SLR, RF and GAM stand for support vector regression, square linear regression, random forest and generalized addictive models, respectively. These traditional statistical models fit the linear relationship tightly. They give the contribution of each input variable directly. Equation (4) gives the general form of the traditional regression model, where x1, x2,..., xn are the n input variables, P is the power consumption, and a0, a1, ...an are the output contributions. The mathematical representation is similar with that of the empirical methods.

P = a0 + a1 * x1 + a2 * x2 + ... + an * xn (4)

Some modern techniques such as artificial neural networks (ANN) and K-means clustering are also applied in the literature. Fig. 3 demonstrates an ANN, where x1, x2,..., xn denote the n input variables and P denotes the power consumption. Every arrow in the figure represents a model parameter. The researchers configure the ANN structure and train it with a set of training data, and after the training the system will obtain all model parameters that can achieve a certain

Table 3: Summary of statistical GPU power modeling studies

Study Year Device Method Input variables Benchmarks Software

[58] 2009 NVIDIA 8800GT SVR 5 busy signals 10 OpenGL benchmarks NVIDIA PerfKit [59]

[37] 2010 NVIDIA Tesla GTX285 SLR 13 CUDA performance counters 41 kernels in CUDA SDK and Rodinia NVIDIA CUDA Profiler [60]

[61] 2011 NVIDIA Tesla GTX280 RF 22 GPGPUSim characteristics 52 kernels in CUDA SDK, Rodinia and Parboil GPGPUSim

[62] 2011 AMD Radeon™ HD5870 RF 23 Steam Profiler counters 78 OpenCL kernels in ATI Stream SDK [63] ATI Stream Profiler [64]

[65] 2013 NVIDIA Fermi GTX480 SLR 12 CUDA performance counters 20 OpenCL applications in CUDA SDK and Ro-dinia NVIDIA CUDA Profiler

[66] 2013 A cluster with 4 NVIDIA Tesla M2050 cards Transformed SLR, GAM 8 CUDA performance counters 4 scientific CUDA applications NVIDIA CUDA Profiler

[22] 2014 NVIDIA Tesla GTX285; Fermi GTX460, GTX480; Kepler GTX680 SLR 10 performance counters, 3 core frequencies and 3 memory frequencies 33 kernels in CUDA SDK, Rodinia and Parboil NVIDIA CUDA Profiler

[67] 2013 NVIDIA Fermi C2075 BP-ANN 10 CUDA performance events 20 kernels in CUDA SDK NVIDIA CUDA Profiler, NVML

[68] 2015 AMD Radeon™ HD 7970 K-means, ANN 22 CodeXL performance counters 108 OpenCL kernels in Rodinia, Parboil, etc AMD CodeXL [69]

niques to an AMD GPU [62]. Karami et al. measured the power consumption of a Fermi GPU with OpenCL applications. They used the principle component analysis to pick out only a part of the performance counters as the input [65]. Ghosh et al. extended the study to multi-GPU system [66]. They applied some nonlinear transformations on the collected instruction counts, such as logarithm, division, etc, and found that the transformed SLR worked better than the traditional SLR, which might suggest some nonlinear relationships between the power and the input variables. The above regression models all highlight the contribution of the computation instruction counts and the memory (especially register and global memory) instructions.

Abe et al. built DVFS regression models for the NVIDIA Tesla, Fermi and Kepler GPUs [22]. Particularly, they regarded the 3 different core/memory frequency settings as the model inputs. They also chose 10 most relevant performance counters who gave the best fitting results. The prediction error varied from 15% to 23.5%, depending on the generations of GPU, and the newer hardware had larger prediction error.

Song et al. trained the GPU runtime power with an ANN of two hidden layers [67]. Their model achieved better prediction accuracy than that of SLR in [37]. Wu et al. extensively studied the GPU power and performance with different settings of GPU frequency, memory bandwidth and core number [68]. They applied K-means clustering and ANN. In the ANN modeling process, they first used K-means to cluster the set of kernels into groups with similar scaling behaviours. Then for each group, they trained an ANN with two hidden layers. The average power prediction error over all frequency/unit configurations was 10%.

In general, the traditional regression based meth-

Input Output ANN model:

wrnt^s: A kyer |

Fig. 3: An example of ANN. Each circle represents a node. A number of vertically aligned nodes form a layer. If an ANN has many nodes or layers, the topology would be very complicated and the computation/traing process would consume much time.

level of prediction accuracy. Compared to the traditional models, the neural network approach addresses the nonlinear dependency of the input variables.

Ma et al. applied the supported vector regression to build GPU power model based on five signals [58]. Notice that the software, variables and benchmarks in [58] were based on graphics applications, while others in Table 3 were on general-purpose GPU applications. In [37], Nagasaka et al. found that except the constant part (70% of contribution), the instruction count and the global memory accesses contribute to the GPU runtime power the most. In [61], Chen et al. simulated the runtime GPU characteristics in the cycle-level GPU simulator, GPGPUSim, which could decode the kernels to separate hardware instructions. Their random forest model suggested that the registers, single-precision floating-point, global memory, integer and arithmetic logic instructions were the most influential variables. Zhang et al. applied similar tech-

ods are easier to implement, but they fail to capture the nonlinearity. For the recent generations of GPUs, the prediction errors of the regression models tend to be large. On the contrary, the advanced neural network approaches suit the complicated data dependencies better, but require a great larger amount of training data, and the output models are of high complexity. For the power modeling work with frequency scaling, the prediction accuracy is relatively low, which might call for more effective modeling methods.

5. GPU DVFS Performance Modeling

In this section, we introduce the GPU performance modeling studies, where a number of them consider the GPU frequency scaling. We classify them into two categories: pipeline analysis and statistical methods. The pipeline analysis is a bottom-up empirical method which requires the knowledge of GPU execution principles, while the statistical methods purely rely on the GPU performance counters.

5.1. Pipeline Analysis

Many GPU performance modeling studies were based on the GPU pipeline analysis [35, 36, 55, 56, 67, 70]. They assembled the GPU execution pipeline and analyzed the memory/computation parallelism. We list some popular metrics used to evaluate the pipeline parallelism as below:

MWP (memory warp parallelism [55, 70]): the maximum number of warps that can access the memory simultaneously on one SM during the memory waiting period, i.e., the period between the launching and returning of a memory request by a warp.

CWP (computation warp parallelism [70]): the number of warps one SM can execute during the memory waiting period plus one.

LCP (load critical path [56]): the longest sequence of dependent memory loads possibly overlapped with computations from parallel warps.

We also give an example of the GPU pipeline in Fig. 4. The demonstrated pipeline is stalled due to the limited MWP. In real applications, the pipeline involves more types of instructions and various pipeline stalls.

Hong et al. used MWP and CWP to approximate the GPU execution pipeline in [70]. They computed MWP and CWP according to the global memory latency, memory bandwidth, the warp numbers, etc. They then divided the pipeline status into three categories: CWP>MWP, MWP>CWP and CWP=MWP (caused by insufficiency of concurrent warps). For each category, they derived the rough total execution

cycles. In [36], the authors refined the model by considering the cache access, shared memory bank conflict and other related issues. Their model was widely adopted and extended by the literature.

Chen et al. presented a much simpler MWP computation method in [55], based on the average memory access latency, which considered both cache hit and cache miss cases. The parameters of their model were obtained by the PTX code analysis in GPGPUSim.

Song et al. proposed a comprehensive pipeline analysis by assembling the full execution process in [67]. They drew the complete execution pipeline for their 12 tested GPU kernels. Their average prediction error rate was as low as 6.7%. However, for this method, the low prediction error was at the cost of being very application-specific and hardware-specific.

Baghsorkhi et al. built a performance model based on the GPU work flow graph, which was a graphic abstraction of the GPU execution pipeline [71]. They estimated the GPU execution time by calculating the total weight of the work flow graph. In their model, the memory latency was alterable, according to different warp executing patterns. The advantage of this model is that it could predict the execution time of diverse warp scheduling patterns in one run.

Nath et al. built a GPU performance model considering the core frequency scaling [56]. They divided the whole GPU executing pipeline into portions either sensitive or insensitive to GPU core frequency scaling, and studied how the sensitive portion changed to frequency. This model achieved impressive high accuracy for all the frequency settings. In addition, it unambiguously highlighted the nonlinear effect of GPU frequency on performance. They also proposed a simplified model by approximating LCP length with memory load stall cycles, and the simplified model showed competitive prediction accuracy.

5.2. Statistical Methods

Abe et al. built statistical linear regression performance models with respect to the core and memory frequency scaling, on four NVIDIA GPUs across the Tesla, Fermi and Kepler platforms [22]. They chose variables from the CUDA performance counters, just as they did for the power modeling. However, their average performance prediction errors were large, varying from 33.5% to 67.9% on different generations of GPUs. This may be due to a lack of data sampling, that they only performed the experiments with 3 different core/memory frequencies.

Wu et al. trained a performance model for an AMD GPU, with respect to varying both the core and memory frequency [68]. They used K-means clustering and the ANN modeling and received an average of 15% of performance prediction error across the frequency ranges. So far as we know, this is the only statistical GPU performance modeling involving advanced ANN techniques.

Different warps

Memory load stall

Memory waiting period M1

n n n n n n n

Memory waiting period

nnnnnnnnnnnnnnnnnn

Load critical path

Fig. 4: An example of GPU execution pipeline, where we define MWP=2. C1 and C2 denote two different compute instructions, and Ml denotes a memory instruction. The instructions are launched at every clock cycle. Assume this pipeline is taken from the 'reduction' kernel, thus the compute instruction C2 depends on the data returned by not only the current warp but also its subsequent warp. We assume a computation instruction takes 2 cycles, and the memory waiting period is 6 cycles. CWP equals the memory waiting period over the length of a computation instruction and then plus 1 [35, 70], i.e., CWP=6/2+1=4. The demonstrated pipeline is stalled mainly by memory loads. The latency of all memory instructions and the overlapped computation instruction forms the load critical path [56].

Table 4: Performance modeling considering frequency scaling

Study Formula

[22] t = ai + a2 / fGc + a3 / fGm

[56] t = max(ai, a2 / fGc ) + max(a3, a4 / fGc )

[68] ANN model

Ardalani et al. also used machine learning to train GPU performance models [72]. Their modeling included two techniques: the forward feature selection stepwise regression and the bootstrap aggregating. Different from the linear regression, their regression automatically applied certain transformations on the input variables, so that the output model could capture some nonlinearity. The authors trained the model with a Maxwell GPU, receiving 27% of prediction error. They then validated the model with a Kepler GPU, and the prediction error only increased a bit, to 36%. This up-to-date model showed some robustness across different generations of GPU platforms.

In Table 4, we summarize the formulas to describe the impact of frequency scaling on the GPU execution time in the literature. fGc and fGm denote the GPU core frequency and memory frequency, respectively. t denotes the execution time, and ai,...,a4 denote the coefficients defined by both the hardware and the application characteristics. In [22], the authors modeled the execution time as the summation of three parts: a static part (a1) which is insensitive to frequency scaling, a dynamic part (a2 / fGc) that is sensitive to GPU core frequency only, and another dynamic part (a3 /fGm) that is sensitive to GPU memory frequency only. Nath et al. considered the GPU core frequency scaling only. They divided the execution time into two segments: the LCP and the compute/store path (CSP) [56]. For each segment, there are a static part (a1, a3) and a dynamic part (a2 / fGc, a4 / fGc), where the two parts may be overlapped. When fGc varies, the length of each segment equals the larger one. This model stresses the

nonlinear relationship between t and 1/fGc. In [68], the authors modeled the DVFS performance according to ANN, in which fGc and fGm are regarded as ANN inputs.

The listed mathematical forms in turn support the diverse DVFS effects for different GPU applications. Among the models, [56] depends on the pipeline analysis and the other two depend on statistical methods. [56] shows the best accuracy; yet it considers the core frequency scaling only. The other two models consider both the core and memory frequency scaling, but the overall prediction accuracy is still low. Even the advanced ANN technique does not improve much accuracy.

6. GPU Voltage and Frequency Scaling Effects

We present our measurement DVFS study in this section. We scale the GPU core voltage/frequency and memory frequency of the Maxwell GTX980. We also scale the core voltage, core frequency and the memory frequency of the Fermi GTX560Ti. For simplicity, we use their codenames to refer to these two GPUs.

6.1. Experimental Methodology

In our previous work [38], we introduce the methodology to scale the core and memory voltage/frequency of the Fermi GPU, with the help of a series of third-party software tools. For the Maxwell platform, we use the NVIDIA Inspector [41] to scale the core/memory frequency. In addition, we disable GPU Boost to fix the GPU core/memory frequency at the selected level. The NVIDIA Inspector reports the power data every second. Since a GPU kernel may take less than one second to finish, we run sufficient rounds of the same kernel to guarantee a total running time of at least 20 minutes for each kernel which results in more than 1200 power samples. To verify the

Table 5: Target GPU voltage/frequency configurations

Platform Maxwell DFS Fermi DVFS

Core scaling [VGc, fGc] [0.987V, 480MHz], [0.987V, 580MHz], [0.987V, 680MHz], [0.987V, 780MHz], [0.987V, 880MHz], [0.987V, 980MHz], [0.987V, 1080MHz]; fGm=3000MHz [VGc, fGc ] [0.849V, 880MHz], [0.857V, 896MHz], [0.881V, 932MHz], [0.912V, 952MHz], [0.949V, 972MHz], [0.981V, 982MHz], [1.012V, 990MHz], [1.049V, 995MHz], [1.099V, 1000MHz]; fGm=2100MHz

Default setting [VGc, fGc]=[0.987V, 950MHz], fGm=3000MHz [VGc, fGc ]=[1.049V, 950MHz], fGm=2100MHz

Memory scaling fGm [2100, 2400, 2700, 3000, 3300, 3600] MHz; [VGc, fGc]=[0.987V, 980MHz] fGm [1500, 1700, 1900, 2100, 2300] MHz; [VGc, fGc ]=[1.049V, 990MHz]

Default setting fGm=3000MHz, [VGc, fGc]=[0.987V, 950MHz] fGm=2100MHz, [VGc, fGc ]=[1.049V, 990MHz]

1380 1280

repeatability of our experiments, we also conduct significance test with ¿-distribution on the power samples of each kernel and achieve 95% confidence interval. The energy consumption is then calculated as the average power multiplied by the total running time.

On the Maxwell platform, we choose P2 state which allows us to reliably scale the core voltage in range [0.987V, 1.087V]. According to Equation (2), for an application with different voltage settings, higher voltage usually results in more energy consumption. We verify this in Fig. 5. For the six kernels taken from Rodinia benchmark suite, energy consumptions of the higher core voltage are always more. In the following, we fix the core voltage at the lower bound 0.987V and scale the core frequency only. We use the Maxwell platform as an example to investigate the impact of dynamic frequency scaling (DFS) on energy consumption.

We denote the GPU core voltage, core frequency and memory frequency as VGc, fGc and fGm, respectively. Table 5 lists the target DFS/DVFS settings of our two GPU platforms. Similar with that in [38], we investigate the core scaling effect and memory scaling effect separately. Namely, when we change the GPU core settings, we fix the memory settings, and vice versa. In total, we study 7 core and 6 memory settings for the Maxwell DFS, and 9 core and 5 memory settings for the Fermi DVFS. We also list the default core and memory settings in Table 5. For the Maxwell platform, we use the same default setting for the core and memory scaling; while for Fermi, the default settings of the core and memory scaling are different.

We denote the default energy consumption as E, and the minimum and maximum energy consumption under different voltage/frequency settings as Emin and Emax, respectively. We use two metrics to evaluate the DVFS effect: R and Rmax, where R quantizes how much energy could be saved compared to default con-

Core Voltage (V)

Fig. 6: The relationship between the maximum allowed core frequency and the core voltage on the Maxwell platform.

figuration, and Rmax indicates the maximum energy saving capability. We give their definitions in Equations (5)-(6).

R = 1 - Emin/E (5)

Rmax = 1 Emin/ Em

It is noticeable that for the Fermi platform, we measure the power consumption of the whole desktop using an off-board power meter, so that R and Rmax for the Fermi refer to the system level energy saving, including the energy savings of the CPU, the interconnect, the motherboard, etc. That is because for the early GPUs, there are no power manage interfaces and it is difficult to get the GPU power consumption directly. On the other hand, the R and Rmax for the Maxwell refer to the GPU energy savings only, bene-fitting from the on-chip power sensors on the Maxwell GPU product. It is possible to know the GPU runtime power without using a meter.

We study the voltage/frequency scaling effect on two platforms with the same suite of 37 applications, taken from Rodinia and CUDA SDK. Note that there is also a little difference in the two sets of experiments. For the Fermi experiments, the studied applications are based on CUDA SDK 4.1; but for the Maxwell, the applications are based on CUDA SDK 6.5.

6.2. Experimental Results

6.2.1. Core voltage-frequency relationship

Fig. 6 shows the relationship between the maximum allowed core frequency and the core voltage on the Maxwell platform. It is widely believed that the maximum core frequency increases linearly to the core voltage. In [38], we find that the relationship between the maximum allowed core frequency and the core voltage is sublinear on the Fermi platform. As shown in Fig. 6, we observe similar sublinear relationship on the Maxwell platform. The conservative default setting helps to protect the GPU board and also leaves some room for over-clocking.

6.2.2. Maxwell DFS

We first present the experimental results of Maxwell core frequency scaling. Fig. 7 summarizes R and Rmax

0.09 0.085 0.08

2.9 2.8 2.7

Core frequency (MHz)

Core frequency (MHz)

Fig. 5: The effect of scaling the core voltage alone on the Maxwell platform.

of all benchmark applications on the Maxwell platform. Note that for the applications with multiple kernels, we compute R and Rmax for each kernel. The average RR is 5.24%, and the average Rmax is 10.87%, at single GPU level. Applications saving the most energy include eigenvalues, gaussian, hotspot, nn, etc. In the best case (nn), up to 34% GPU energy can be saved.

We also show the best core frequency, i.e., the frequency that leads to the minimum energy consumption of all the tested samples, for each kernel in Fig. 7. Among all the kernels, 12 benefit from scaling up the core frequency (fGc > 950MHz) while the other 30 benefit from scaling down the core frequency (fGc < 950 MHz). In particular, half of the kernels achieve their minimum energy at core frequencies between 680 MHz and 880 MHz, where 11 of them at 780 MHz. In [22], the authors state that low or medium core frequency setting improves the energy efficiency obviously on the Kepler GPU platform, where their medium setting corresponds to 680-880 MHz on our Maxwell platform. Therefore for modern GPUs, scaling down the core frequency to some extent is an effective approach to conserving energy, even if it is difficult to scale down the core voltage. Besides, it also reveals that for many applications, the performance is nonlinear to the GPU core frequency.

We plot the normalized energy under different core frequency settings of some kernels in Fig. 8. The kernels exhibit diverse energy consumption behaviours. The energy consumption of vectorAddDrv increases linearly to the core frequency. MC:EstimatePiInlineP and binomialOptions are insensitive to core frequency scaling, while eigenvalues and nn have optimal fre-

quency setting in a very narrow range. The diverse behaviours coincide with those described in [38, 20, 51]; and it confirms the complexity of the relationship between the runtime energy and the processor frequency, which calls for a more in-depth investigation.

We demonstrate the memory frequency scaling effect in Fig. 9. The average Rmax is as high as 35.87% whereas the average RR is 0.72%. In this case, Rmax denotes the energy increase caused by scaling down the memory frequency. 24 kernels suffer from more than 30% of energy increase by simply scaling down 30% of memory frequency. In the worst case (bfs), up to 53.9% energy can be wasted. The major reason is that the execution time of these kernels will increase significantly when the memory frequency decreases.

In Fig. 9, 34 kernels observe their minimum energy consumption at the default setting by the vendor, which suggests that the default setting is optimal for most cases. Note that even for the special applications which do not benefit from the default memory frequency, such as matrixMulDrv and vectorAddDrv, the difference is no more than 6%. A few kernels benefit from a little higher frequency of 3300 MHz, such as scalarProb and sortingNetworks. It is because the energy saving brought by the reduced running time is more than the extra energy by the higher memory frequency.

6.2.3. Fermi DVFS

We demonstrate the Fermi core voltage and frequency scaling effect in Fig. 10. Note that each time we change the voltage and frequency together. By scaling down both the core voltage and core frequency, DVFS can reduce a considerable percent of system energy. The energy savings on our Fermi platform at

36 32 ^ 28 V 24 | 20

(3 , ~ ö 12 W

1—I—I—I—I—I—I—I—I—I—I—I—I—I—I—I—I—I—T

1280 1180 1080 980 ! 880 780 680 580 1 480 380

Fig. 7: Core frequency scaling effects on the Maxwell platform.

13 0.8

£ 0.7

580 MHz I I 680 MHz

H 880 MHz

I I 1080 MHz |

MC:EstimatePiInlineP binomialOptions

eigenvalues

vectorAddDrv

Fig. 8: Kernels benefit from low core frequency on the Maxwell platform.

i—i—i—i—i—i—i—r

...................

M M M ♦ /\ MMMMMM» M ♦»»10 \/V+«/

Best frequency

1—i—i—i—i—i—i—r

o £ 1312

£ o ö m

2 2 a a

£ £ -Q-Q

c^PH CÖ <d ^ PH an Oö cdO

rrt ^ —i o

g cd<d a

> öö j«

<d<d o

Ö o o > o Ö o o

CÖ CÖ

-ö <d <D £££

^ CD CD

CÖ ^ s^'

O, CD CD c •

CD CD ' Ö Ö <D <D

OO £ £

o o §§

oq cr1

c ¡3 31= la cj rt 5 O C OO ^ > ^

£ & cö O

° Q P3

480 MHz

780 MHz

980 MHz

2880 £

Fig. 9: Memory frequency scaling effects on the Maxwell platform.

35 30 g 25

.!= 20

Best voltage

0.85 >

0.82 CT

0.79 >

0.76 Ü

ΠCO W W Q

ra & cp I—

^ ^ E -

1 ! S I

"S ÇL

E E w w

Fig. 10: Core voltage and frequency scaling effects on the Fermi platform.

Fig. 11: Memory frequency scaling effects on the Fermi platform.

whole system level are 18.91% on average and 24.4% at maximum. In general, the energy savings are larger than those of Maxwell platform, which stresses the importance of marginally scaling down the core voltage for energy conservation. Almost all the application get the minimum energy consumption at the lowest voltage/frequency.

Fig. 11 shows the memory frequency scaling effect on the Fermi platform. The average Rmax is 10.2% and average R is 3.5%. The energy saving is low, and the default memory frequency works well for many applications. The best memory frequencies for different applications are diverse, depending on the application characteristics, which strongly differs that of the Maxwell platform. There is no linear relationship between the energy consumption/ performance and the memory frequency for the Fermi platform.

To summarize, our experimental results on both platforms suggest following interesting findings:

1. Appropriate core frequency setting is effective for energy saving. Both platforms expose the

"pacing [21]" feature. The relationship between the performance and the GPU core frequency is very complex and a simple linear model is inadequate;

2. In terms of memory frequency scaling, the early platform exposes the "pacing" feature, while the modern platform exposes the "racing [21]" feature. The performance is highly linear to the GPU memory frequency on our Maxwell platform.

7. Conclusions and Future Work

In this paper, we survey the GPU DVFS for energy conservation. We focus on the most up-to-date GPU DVFS technologies and their influence on the performance and power consumption. We summarize the methodology and the performance of existing GPU DVFS models. Generally speaking, the nonlinear modeling technique, such as the ANN and the transformed SLR, has better estimation accuracy.

In addition, we conduct real-world DFS/DVFS measurement experiments on the NVIDIA Fermi and

Maxwell GPUs. The experimental result suggest that both the core and memory frequency influence the energy consumption significantly. Using the highest memory frequency would always conserve energy for the Maxwell GPU, which is not the case on the Fermi platform. According to the Fermi DVFS experiments, scaling down the core voltage is vital to conserve energy.

Both the survey and the measurements spotlight the challenge of building an accurate DVFS performance model, and furthermore, applying appropriate volt-age/frequecy settings to conserve energy. We leave these for our future study. Besides, it is another important direction to integrate the GPU DVFS into the large-scale cluster-level power management in the future. It will be interesting to explore how to effectively combine GPU DVFS with other energy conservation techniques such as task scheduling [73], VM consolidation [74], power-performance arbitrating [75], and runtime power monitoring [76, 77, 78].

Acknowledgements

This work is partially supported by HKBU FRG2/14-15/059 and Shenzhen Basic Research Grant SCI-2015-SZTIC-002.

References

[1] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu, X. Chu, K. Zhao, et al., SOAP3: ultra-fast GPU-based parallel alignment tool for short reads, Bioinfor-matics 28 (6) (2012) 878-879.

[2] K. Zhao, X. Chu, G-BLASTN: accelerating nucleotide alignment by graphics processors, Bioinformatics 30 (10) (2014) 1384-1391.

[3] X. Chu, C. Liu, K. Ouyang, L. S. Yung, H. Liu, Y.-W. Leung, Perasure: a parallel cauchy reed-solomon coding library for GPUs, in: Proceedings of the IEEE International Conference on Communications (ICC), IEEE, London, UK, 2015, pp. 436-441.

[4] Q. Li, C. Zhong, K. Zhao, X. Mei, X. Chu, Implementation and analysis of AES encryption on GPU, in: Proceedings of the third International Workshop on Frontier of GPU Computing, IEEE, Liverpool, UK, 2012.

[5] X. Chu, K. Zhao, Practical random linear network coding on GPUs, in: Proceedings of the 8th International Conferences on Networking, IFIP, Archen, Germany, 2009.

[6] Y. Li, K. Zhao, X. Chu, J. Liu, Speeding up K-means algorithm by GPUs, in: Proceedings of the 10th International Conference on Computer and Information Technology (CIT), IEEE, Bradford, UK, 2010, pp. 115-122.

[7] R. Raina, A. Madhavan, A. Y. Ng, Large-scale deep unsuper-vised learning using graphics processors, in: Proceedings of the 26th annual international conference on machine learning, ICML '09, ACM, 2009, pp. 873-880.

[8] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, N. Andrew, Deep learning with COTS HPC systems, in: Proceedings of the 30th international conference on machine learning, ICML '13, 2013, pp. 1337-1345.

[9] Q. V. Le, Building high-level features using large scale un-supervised learning, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2013, pp. 8595-8598.

[10] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,

G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Pan-neershelvam, M. Lanctot, et al., Mastering the game of Go with deep neural networks and tree search, Nature 529 (7587) (2016) 484-489.

[11] S. Shi, Q. Wang, P. Xu, X. Chu, Benchmarking state-of-the-art deep learning software tools, in: Proceedings of the 7th International Conference on Cloud Computing and Big Data, IEEE, Macau, China, 2016.

[12] E. Strohmaier, J. Dongarra, H. Simon, M. Meuer, H. Meuer, T0P500, [Online] http://www.top500.org.

[13] E. Strohmaier, J. Dongarra, H. Simon, M. Meuer,

H. Meuer, T0P500 highlights, November, 2015, [Online] http://www.top500.org/lists/2015/11/highlights/.

[14] A. Gharaibeh, E. Santos-Neto, L. B. Costa, M. Ripeanu, The energy case for graph processing on hybrid CPU and GPU systems, in: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms, ACM, 2013, pp. 2:1-2:8.

[15] Introducing Titan: advancing the era of accelerating computing, [Online] https://www.olcf.ornl.gov/titan/.

[16] R. Gonzalez, B. M. Gordon, M. A. Horowitz, Supply and threshold voltage scaling for low power CMOS, IEEE Journal of Solid-State Circuits 32 (8) (1997) 1210-1216. doi: 10.1109/4.604077.

[17] G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Al-bonesi, S. Dwarkadas, M. L. Scott, Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling, in: Proceedings of IEEE Eighth International Symposium on High Performance Computer Architecture, 2002, HPCA'02, IEEE, 2002, pp. 29-40.

[18] Intel, Intel Turbo Boost Technology 2.0, [Online] http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html.

[19] R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher, Z. Zong, Effects of dynamic voltage and frequency scaling on a K20 GPU, in: IEEE 42nd International Conference on Parallel Processing (ICPP), IEEE, 2013, pp. 826-833.

[20] Y. Abe, H. Sasaki, M. Peres, K. Inoue, K. Murakami, S. Kato, Power and performance analysis of GPU-accelerated systems, in: Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, HotPower'12, USENIX Association, Berkeley, CA, USA, 2012, pp. 10-14.

[21] D. H. Kim, C. Imes, H. Hoffmann, Racing and pacing to idle: Theoretical and empirical analysis of energy optimization heuristics, in: IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications (CP-SNA), IEEE, 2015, pp. 78-85.

[22] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, M. Peres, Power and performance characterization and modeling of GPU-accelerated systems, in: IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS), 2014, pp. 113-122. doi:10.1109/IPDPS.2014.23.

[23] S. Mittal, J. S. Vetter, A survey of methods for analyzing and improving GPU energy efficiency, ACM Computing Survey 47 (2) (2014) 19:1-19:23. doi:10.1145/2636342.

[24] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, A. Moshovos, Demystifying GPU microarchitecture through microbenchmarking, in: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2010, IEEE, 2010, pp. 235-246.

[25] X. Mei, X. Chu, Dissecting GPU memory hierarchy through microbenchmarking, preprint, IEEE Transactions on Parallel and Distributed Systems (2016). doi:10.1109/TPDS.2016. 2549523.

[26] W. Jia, K. A. Shaw, M. Martonosi, Characterizing and improving the use of demand-fetched caches in GPUs, in: Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, ACM, New York, NY, USA, 2012, pp. 15-24. doi:10.1145/2304576.2304582.

[27] M. Rhu, M. Sullivan, J. Leng, M. Erez, A locality-aware memory hierarchy for energy-efficient GPU architectures, in: Proceedings of the 46th Annual IEEE/ACM International Sym-

posium on Microarchitecture, MICRO-46, ACM, 2013, pp. 86-98.

[28] NVIDIA, GPU computing SDK, [Online] https://developer.nvidia.com/gpu-computing-sdk.

[29] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, in: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009., IEEE, 2009, pp. 44-54.

[30] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, W.-m. W. Hwu, Parboil: A revised benchmark suite for scientific and commercial throughput computing, Center for Reliable and High-Performance Computing 127.

[31] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, D. A. Wood, The gem5 simulator, SIGARCH Computer Architecture News 39 (2) (2011) 1-7. doi:10.1145/2024716. 2024718.

[32] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, T. M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, 2009, pp. 163-174.

[33] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, V. J. Reddi, GPUWattch: Enabling energy optimizations in GPGPUs, in: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA'13, ACM, New York, NY, USA, 2013, pp. 487-498. doi:10.1145/2485922.2485964.

URL http://www.gpgpu-sim.org/gpuwattch/

[34] V. Kersun, Supply and threshold voltage scaling techniques in CMOS circuits, Ph.D. thesis, University of Rochestor (2004).

[35] S. Hong, H. Kim, An integrated GPU power and performance model, in: ACM SIGARCH Computer Architecture News, Vol. 38, ACM, 2010, pp. 280-289.

[36] S. Hong, Modeling performance and power for energy-efficient GPGPU computing, Ph.D. thesis, Georgia Institute of Technology (2012).

[37] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, S. Mat-suoka, Statistical power modeling of GPU kernels using performance counters, in: International Green Computing Conference (IGCC), IEEE, 2010, pp. 115-122.

[38] X. Mei, L. S. Yung, K. Zhao, X. Chu, A measurement study of GPU DVFS on energy conservation, in: Proceedings of the Workshop on Power-Aware Computing and Systems, HotPower '13, ACM, New York, NY, USA, 2013, pp. 1-5. doi: 10.1145/2525526.2525852.

[39] NVIDIA, NVIDIA Management Library , [Online] https://developer.nvidia.com/nvidia-management-library-nvml.

[40] NVIDIA, NVIDIA System Management Interface (nvidia-smi), [Online] https://developer.nvidia.com/nvidia-system-management-interface.

[41] Orbmu2k, NVIDIA Inspector, [Online] http://blog.orbmu2k.de/tools/nvidia-inspector-tool.

[42] MSI, Afterburner, graphics card performance booster, [Online] http://event.msi.com/vga/afterburner/download.htm.

[43] NVIDIA, GPU Boost 2.0, [Online] http://www.geforce.com/hardware/technology/gpu-boost-2/technology.

[44] Y. Jiao, H. Lin, P. Balaji, W. Feng, Power and performance characterization of computational kernels on the GPU, in: IEEE/ACM International Conference on Green Computing and Communications (GreenCom) and Int'l Conference on Cyber, Physical and Social Computing (CPSCom), IEEE, 2010,pp. 221-228.

[45] K. Ma, X. Li, W. Chen, C. Zhang, X. Wang, GreenGPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures, in: IEEE 41st International Conference on Parallel Processing (ICPP), 2012, pp. 48-57. doi: 10.1109/ICPP.2012.31.

[46] D. You, K. S. Chung, Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs, IEEE Computer Architecture Letters 14 (1) (2015) 66-69. doi:10. 1109/LCA.2014.2319079.

[47] Q. Jiao, M. Lu, H. P. Huynh, T. Mitra, Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS, in: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '15, IEEE Computer Society, Washington, DC, USA, 2015, pp. 1-11.

[48] J. Lee, V. Sathisha, M. Schulte, K. Compton, N. S. Kim, Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling, in: International Conference on Parallel Architectures and Compilation Techniques (PACT), IEEE, 2011, pp. 111-120.

[49] W. Zhao, Y. Cao, New generation of predictive technology model for sub-45 nm early design exploration, IEEE Transactions on Electron Devices 53 (11) (2006) 2816-2823.

[50] A. Balijepalli, S. Sinha, Y. Cao, 45nm predittive technology model for metal gate/high-k CMOS, [Online] http://ptm.asu.edu/.

[51] A. Sethia, S. Mahlke, Equalizer: Dynamic tuning of GPU resources for efficient execution, in: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, IEEE Computer Society, 2014, pp. 647658.

[52] R. Sen, D. Wood, GPGPU footprint models to estimate per-core power, preprint, IEEE Computer Architecture Letters (2015). doi:10.1109/LCA.2015.2456909.

[53] B. Gopireddy, C. Song, J. Torrellas, N. S. Kim, A. Agrawal, A. Mishra, ScalCore: Designing a core for voltage scalability, in: IEEE 22nd International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 1-13.

[54] C. Isci, M. Martonosi, Runtime power monitoring in high-end processors: Methodology and empirical data, in: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO-36, IEEE Computer Society, 2003, pp. 93-104.

[55] X. Chen, Y. Wang, Y. Liang, Y. Xie, H. Yang, Run-time technique for simultaneous aging and power optimization in GPGPUs, in: 51st ACM/EDAC/IEEE Design Automation Conference (DAC), IEEE, 2014, pp. 1-6.

[56] R. Nath, D. Tullsen, The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU, in: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, ACM, 2015, pp. 281-293.

[57] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, R. S. Schreiber, Future scaling of processor-memory interfaces, in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 1-12. doi:10.1145/1654059.1654102.

[58] X. Ma, M. Dong, L. Zhong, Z. Deng, Statistical power consumption analysis and modeling for GPU-based computing, in: Proceedings of the Workshop on Power-Aware Computing and Systems, HotPower '09, ACM, 2009, pp. 1-5.

[59] NVIDIA, PerfKit, [Online] https://developer.nvidia.com/nvidia-perfkit.

[60] NVIDIA, CUDA Visual Profiler, [Online] https://developer.nvidia.com/nvidia-visual-profiler.

[61] J. Chen, B. Li, Y. Zhang, L. Peng, J.-k. Peir, Statistical GPU power analysis using tree-based methods, in: International Green Computing Conference and Workshops (IGCC), IEEE, 2011, pp. 1-6.

[62] Y. Zhang, Y. Hu, B. Li, L. Peng, Performance and power analysis of ATI GPU: A statistical approach, in: IEEE 6th International Conference on Networking, Architecture and Storage (NAS), 2011, pp. 149-158. doi:10.1109/NAS.2011.51.

[63] AMD, AMD Stream SDK, [Online] http://developer.amd.com/gpu/amdappsdk/pages/default.aspx.

[64] B. Purnomo, N. Rubin, M. Houston, ATI Stream Profiler: A tool to optimize an OpenCL kernel on ATI Radeon GPUs, in: ACM SIGGRAPH 2010 Posters, SIGGRAPH '10, ACM, New York, NY, USA, 2010, pp. 54:1-54:1. doi:10.1145/

1836845.1836904.

[65] A. Karami, S. A. Mirsoleimani, F. Khunjush, A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs, in: CSI 17th International Symposium on Computer Architecture and Digital Systems (CADS), 2013, pp. 15-22. doi:10.1109/CADS.2013.6714232.

[66] S. Ghosh, S. Chandrasekaran, B. Chapman, Statistical modeling of power/energy of scientific kernels on a multi-GPU system, in: International Green Computing Conference (IGCC), IEEE, 2013, pp. 1-6.

[67] S. Song, C. Su, B. Rountree, K. W. Cameron, A simplified and accurate model of power-performance efficiency on emergent GPU architectures, in: IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS), IEEE, 2013, pp. 673-686.

[68] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, D. Chiou, GPGPU performance and power estimation using machine learning, in: IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), IEEE,

2015,pp. 564-576.

[69] AMD, CodeXL: Powerful debugging, profiling and analysis, [Online] http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/.

[70] S. Hong, H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, in: ACM SIGARCH Computer Architecture News, Vol. 37, ACM, 2009, pp. 152-163.

[71] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, W.-m. W. Hwu, An adaptive performance modeling tool for GPU architectures, in: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, ACM, New York, NY, USA, 2010, pp. 105114. doi:10.1145/1693453.1693470.

[72] N. Ardalani, C. Lestourgeon, K. Sankaralingam, X. Zhu, Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance, in: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, ACM, New York, NY, USA, 2015, pp. 725-737. doi: 10.1145/2830772.2830780.

[73] X. Kong, C. Lin, Y. Jiang, W. Yan, X.-W. Chu, Efficient dynamic task scheduling in virtualized data centers with fuzzy prediction, Journal of Network and Computer Applications 34 (4).

[74] L. Ma, H. Liu, Y.-W. Leung, X.-W. Chu, Joint vm-switch consolidation for energy efficiency in data centers, in: Proceedings of the IEEE Globecom 2016, IEEE, Washington, USA,

[75] F. Liu, Z. Zhou, H. Jin, B. Li, B. Li, H. Jiang, On arbitrating the power-performance tradeoff in saas clouds, IEEE Transactions on Parallel and Distributed Systems 25 (10) (2014) 2648-2658.

[76] G. Tang, W. Jiang, Z. Xu, F. Liu, K. Wu, Zero-cost, finegrained power monitoring of datacenters using non-intrusive power disaggregation, in: Proceedings of the 16th Annual Middleware Conference, ACM, 2015, pp. 271-282.

[77] G. Tang, W. Jiang, Z. Xu, F. Liu, K. Wu, NIPD: Non-intrusive power disaggregation in legacy datacenters, IEEE Transactions on Computers (preprint).

[78] J. Gao, Machine learning applications for data center optimization (2014).