Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

PARALLELAND DISTRIBUTED COMPUTING

Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments

Xiao Linga, Yi Yuanb, Dan Wangb, Jiangchuan Liuc, Jiahai Yang

a Tsinghua National Laboratory for Information Science and Technology, Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing, China b Department of Computing Science, The Hong Kong Polytechnic University, Hong Kong

c Department of Computing Science, Simon Fraser University, British Columbia, Canada

CrossMaik

highlights

• We investigate a schedule problem for MapReduce-like frameworks by taking server assignment into consideration.

• We formulate the MapReduce server-job organizer problem (MSJO) and show that it is NP-complete.

• We propose a 3-approximation algorithm and a fast heuristic design to address the MSJO problem.

• We implement our algorithms and some state-of-the-art algorithms on Amazon EC2 with deploying schedulers in Hadoop.

• By comprehensive simulations and experiments, the results show that our algorithm outperforms other classical strategies.

article info abstract

MapReduce-like frameworks have achieved tremendous success for large-scale data processing in data centers. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. Past schemes, and especially their theoretical bounds, on general parallel models are therefore, unlikely to be applied to MapReduce directly. There are many recent studies on MapReduce job and task scheduling. These studies assume that the servers are assigned in advance. In current data centers, multiple MapReduce jobs of different importance levels run together. In this paper, we investigate a schedule problem for MapReduce taking server assignment into consideration as well. We formulate a MapReduce server-job organizer problem (MSJO) and show that it is NP-complete. We develop a 3-approximation algorithm and a fast heuristic design. Moreover, we further propose a novel fine-grained practical algorithm for general MapReduce-like task scheduling problem. Finally, we evaluate our algorithms through both simulations and experiments on Amazon EC2 with an implementation with Hadoop. The results confirm the superiority of our algorithms.

© 2016 The Authors. Published by Elsevier Inc.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Article history: Received 29 August 2015 Received in revised form 31 December 2015 Accepted 21 February 2016 Available online 2 March 2016

Keywords: MapReduce Scheduling Server assignment NP-complete Fast heuristic

1. Introduction

Recently the amount of data of various applications has increased beyond the processing capability of single machines. To cope with such data, scale out parallel processing is widely accepted. MapReduce [11], the de facto standard framework in parallel processing for big data applications, has become widely adopted. Nevertheless, MapReduce framework is also criticized

* Corresponding author.

E-mail addresses: lxcernet@gmail.com (X. Ling), robertyi@163.com (Y. Yuan), csdwang@comp.polyu.edu.hk (D. Wang), jcliu@cs.sfu.ca (J. Liu), yang@cernet.edu.cn (J. Yang).

for its inefficiency in performance and as ''a major step backward" [14]. This is partially because that, performance-wise, the MapReduce framework has not been deeply studied enough as compared to decades of study and fine-tune of other conventional systems. As a consequence, there are many recent studies in improving MapReduce performance.

MapReduce breaks down a job into map tasks and reduce tasks. These tasks are parallelized across server clusters.1 Although map tasks and reduce tasks overlap partly in the real Hadoop scheduling mechanism, researchers [18,19] generally assume that reduce

1 The server clusters here are meant to be general; it can either be data center servers or cloud virtual machines.

http://dx.doi.org/10.1016/jjpdc.2016.02.002

0743-7315/© 2016 The Authors. Published by Elsevier Inc. This is an open access article underthe CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4. 0/).

X. Ling et al. /J. Parallel Distrib. Comput. 90-91 (2016) 52-66 b

Machine 2 Machine 3

M1 M4 R1

M2 R2

M3 M1 R1

0 100 200 (s) 100 200 | | Task of job 1 Q Task of job 2

Fig. 1. Impact of server assignment. (a) Without server assignment, Hadoop default strategy. (b) Joint considering of server assignment.

MR Performance

System

Framework Resource Heterogeneity Fast Completion Fairness Data Locality

(e.g. YARN, Management (e.g. LATE, Time (e.g.Quincy, (e.g. ActCap,

SUDO, Camdoop) g pROTEUS Paragon)

i——:—i

Delay Scheduling) JSQ,Purlieus)

Elastisizer)

Without Server

Assignment

(e.g. OFFA, MARES)

Server Assignment

Fig. 2. Design space of MarS.

tasks will not start until the accomplishment of map tasks because reduce tasks rely on intermediate data produced by the map tasks. This is a parallel-sequential structure. In current practice, multiple MapReducejobs are scheduled simultaneously to efficiently utilize the computation resources in the data center servers. It is a nontrivial task to find a good schedule for multiple MapReduce jobs and tasks running on different servers. There are a great number of studies on general parallel processing scheduling in the past decades. Nevertheless, whether these techniques can be applied directly in the MapReduce framework is not clear; and especially, their results on theoretical bounds are unlikely to be translated.

In this paper, we conduct research in this direction. There are recent studies on MapReduce scheduling [7,8]. As an example, an algorithm is developed in [8] for joint scheduling of processing and shuffle phases and it achieves an 8-approximation. To the best of our knowledge, previous studies commonly assume that the servers are assigned. That is, they assume that tasks in MapReduce jobs are first assigned to the servers, and their scheduling is conducted to manage the sequences ofthe map and reduce tasks in each job. It is not clear whether the scheduling on map and reduce tasks will be affected in a situation where the server assignment is ''less good''. We illustrate this impact by a toy example in Fig. 1. There are three machines and two jobs. Job 1 has 4 map tasks and 2 reduce tasks. Job 2 has 1 map task and 1 reduce task. Assume the processing time to be 75 s for every single map task and 100 s for every single reduce task. If server assignment is not considered, it will result in Fig. 1(a), which follows the default FIFO strategy of Hadoop [3]. However, if we jointly consider server assignment, we can achieve a schedule shown in Fig. 1(b). It is easy to see that the completion time ofjob 2 in Fig. 1(a) is 250 s and in Fig. 1(b) is 175 s, a 30% improvement.

In this paper, we fill in this blank by jointly consider server assignments and MapReduce jobs (and the associated tasks). To systematically study this problem, we formulate a unique MapRe-duce server-job organizer problem (MSJO). Note that the MSJO

we discuss is the general case where the jobs can have different weights. We show that MSJO is NP-complete and we develop a 3-approximation algorithm. This approximation algorithm, though polynomial, has certain complexity in solving an LP-subroutine. Therefore, we further develop a fast heuristic. We evaluate our algorithm through extensive simulations. Our algorithm can outperforms the state-of-the-art algorithms by 40% in terms of total weighted job completion time. We further implement our algorithm in Hadoop and evaluate our algorithm using experiments in Amazon EC2 [1]. The experiment results confirm the advantage of our algorithm.

The rest of the paper is organized as follows. We discuss background in Section 2. We formulate the MSJO problem and analyze its complexity in Section 3. In Section 4, we present several algorithms. We evaluate our algorithms in Section 5. In Section 6, we show an implementation of our scheme in Hadoop and our evaluation in Amazon EC2. Finally, Section 7 summarizes the paper.

2. Related work

Due to the wide usage of MapReduce systems, there is a flourish of studies on understanding MapReduce performance and many developed various improvement schemes. One classification divides the view point by system and algorithm. Our work belongs to algorithm research and we categorize this in Fig. 2.

From system point of view: (1) Framework. There are many valuable advances on improving MapReduce framework. For example, Apache YARN [5] is a new kind of Hadoop resource manager, which can provide unified resource management and scheduling for the above big data applications. Mesos [4] is built using the same principles as the Linux kernel and provides applications (e.g., Hadoop, Spark, Kafka) with APIs for resource management and scheduling across entire datacenter and cloud environments. However, these open source programs focus on

the cloud resource management and system implementation, but not the optimal job/task scheduling issues. Zhang et al. [40] identified useful functional properties for user-defined functions and proposed an optimization framework SUDO that reasons about data-partition partition properties, functional properties, and data shuffling. Costa et al. [9] built a MapReduce-like system Camdoop to decrease the traffic by using a direct-connect network topology with servers. Zhang et al. [39] designed MIMP, a minimal interference, maximal progress scheduling system which manages both VM CPU scheduling and Hadoop job scheduling to reduce interference and increase overall efficiency. Li et al. [24] presented an efficient scheduling framework WOHA for deadline-aware MapReduce workflows. Literature [23] introduces the technique of packing server to convert independent task set schedulability bounds to MapReduce workflows schedulability bounds for real-time analytic applications. Our attention is focused on another aspect, which means we try to optimize the MapReduce performance bound in terms of the total weighted job completion time based on LP relaxation. (2) Resource Management. For example, Xie et al. [35] observed that the MapReduce applications have different network bandwidth requirements at different stages of the job execution, then proposed a model network bandwidth requirements of MapReduce jobs and a system PROTEUS to maximize number of accommodated jobs. Herodotou et al. [20] developed Elastisizer to which users can express their cluster sizing problems as queries in a declarative fashion, and can provide reliable answers to these queries using an automated technique and provide nonexpert users with a good combination of cluster resource and job configuration settings to meet their needs. In [13], Quasar, a cluster management system, is proposed to increase resource utilization while providing consistently high application performance. It uses fast classification techniques to determine the impact of different resource allocations and assignments on workload performance. (3) Heterogeneity. Zaharia et al. [38] designed a robust scheduling algorithm LATE to address the performance issues incurred by speculative execution in heterogeneous environment. In [12], Paragon, an online and QoS-aware scheduler proposed for heterogeneous clusters, includes a greedy server scheduler which minimizes interference and increases server utilization. Besides, outliers are considered in [2], which proposed Mantri to monitor tasks and cull outliers using cause- and resource-aware techniques. However, none of the above mentioned studies consider MapReduce jobs/tasks scheduling problem.

From algorithmic point of view: people are looking into MapReduce jobs/tasks scheduling with various considerations in different scenarios. (1) Fairness. Isard et al. [21] introduced Quincy for scheduling concurrent distributed jobs with fine-grain resource sharing and Quincy can get better fairness. Delay scheduling [37] is proposed to divide resources using max-min fair sharing to achieve statistical multiplexing. (2) Data locality. Wang et al. [33] proposed ActCap which uses a method based on Markov chain to do node-capability-aware data placement for the continuous incoming data in ever-growing heterogeneous clusters. Wang et al. [34] presented a new queueing architecture and proposed a map task scheduling algorithm constituted by theJSQpol-icy together with the MaxWeight policy under heavy traffic. Tan et al. [31] formulated a stochastic optimization framework to improve the data locality for reduce tasks with the optimal placement policy exhibiting a threshold-based structure. Purlieus [25] allocated VMs for MapReduce cluster in a data-locality manner to optimize performance of data access in MapReduce system. Besides, Omega [30] is proposed to support cooperation of multiple schedulers in large computer clusters. Zheng et al. [41] propose a MapReduce scheduler with provable efficiency on total flow time. Sandholm et al. [28] provided automatic application-independent

optimization strategies by prioritizing users, stage in a job, and bottleneck components within a stage. However, the goals of these studies are not so much finding an optimal solution, or even understanding how close to the optimal their schemes are.

For the objective of fast completion time, two most closely related works of our work are [7,8]. In [7], Chang et al. focused on a theoretical model for determining which MapReduce jobs to schedule at what times, and formulated a linear program and several approximation algorithms like OFFA for minimizing the total job completion times. In [8], Chen et al. investigated precedence constraints between map tasks and reduce tasks in MapReduce jobs, and proposed an 8-approximation algorithm MARES, which is an advanced work and has currently the best theoretical performance upper bound, to solve the joint scheduling problem. However, they assume that tasks are assigned to processors/servers in advance. As shown in Fig. 1, scheduling of jobs without considering server assignment may result in less optimal solutions. We fill in this gap in this paper.

General scheduling of parallel machines has decades of studies. There are many works with inspiring ideas and analytical techniques [22,29,17,10]. In particular, the polyhedron of necessary conditions for the single machine problem was derived in [26]. And [29] used this result to derive approximation algorithms for single machine weighted completion time with additional side constraints. For minimizing total weighted job completion time with precedence constraints on multi-machines, the best known work is a 4-approximation algorithm in [27]. However, these works focus on the general case.

3. MapReduce server-job organizer: the problem and complexity analysis

3.1. Problem formulation

Let J be a set of MapReduce jobs. Let M be a set of identical machines. Let the release time of job j be r,. This release time is the time a job is entering the system; note that it differs from the job start time where the job scheduler can schedule a job to be started later than this release time. Let Tj(M) and T® be the set of map tasks and reduce tasks for each job j. Let T be the set of all tasks of J. For each task u e T, let pu be its processing time. We assume that a task cannot be preempted. In our assumption, MapReduce jobs can be preempted; only tasks cannot. In the current operation of MapReduce job scheduling, a job with high priority does not preempt running tasks. It waits until the processors are released. This is because that the number of processors is large and the tasks are relatively small. Thus, the high priority job can be put into execution quickly. We also assume that for any job j, processing times of its map tasks are smaller than that of its reduce tasks. We admit that this is a key assumption for our bounding development. Yet this is true in current situation. Every map task simply scatters a chuck of data while the reduce tasks need to gather, reorganize and process data produced by map tasks. To make the situation worse, the number of reduce task is always configured to be much less than the number of map tasks. As a result, processing times of reduce tasks are much longer than that of map tasks. We also validate this assumption in our experiment.

Let duv be the delay between a map task u e t(m) and a reduce

task v e T® (e.g., introduced by shuffle phase). Let Su be the start time of task u. Let S be set of Su, Vu e T. Let Cj be completion time of job j, which is the time when all reduce tasks v e T® finish. Let C be set of Cj, Vj e J. There is a weight w, associated with job j and our objective is to find a feasible schedule to minimize total weighted job completion time X^jej WjCj subject to following

Table 1

Summary of key notations.

Notation Definition Notation Definition

J Set of all jobs Su Start time of task u

T Set of all tasks in J pu Processing time of task u

jTj Task number in T duv Delay between task u and

task v

M Set of machines S Set of start times of tasks

j Mj Machine number in M Mu Middle finish time of task u

Cj Completion time of G Precedence graph of all

job j tasks

C Set of job completion B Any subset of all tasks set T

Wj Weight of job j Cfu Completion time of task u

rj Release time of job j Cf Set of task completion time

t(M) Map task set of job j W u Weight of task u

Tj(R) Reduce task set of job j rfu Release time of task u

constraints for every job j. We can write the linear program model as follows (see Table 1):

Vu e t(m), v e Tj

(1) (2) (3)

Su > rj Vu G T.

Sv > Su + duv + Pu

Cj > Sv + Pv Vv G T

3.2. Problem complexity Theorem 1. MSJO is NP-complete.

Proof. It is easy to verify that calculating total weighted job completion time of a schedule result is NP. Therefore, MSJO is in NP class. To show MSJO is NP-complete, we reduce a job schedule problem (problem SS13 in [15]) to it. Problem SS13 is proven NP-complete to determine a feasible schedule for minimizing the total weighted job completion time. Given every instance (J, M) of problem SS13, where J is a set of jobs and M is a set of identical machines. We can construct an instance (JM, MM) of MSJO. M and

are same. For every job j e J, there is a job jM

'. j and

have same job weight. jM has one map task with processing time 0 and one reduce task with processing time of job j. Release time of jM is 0. Thus, if MSJO can be solved optimally with a polynomial algorithm, problem SS13 can be solved by this algorithm. Because problem SS13 is NP-complete, MSJO is NP-complete. □

4. Algorithm development and theoretical analysis

We outline our approach described in next three subsections: (1) We introduce a linear programming relaxation to give a lower bound of the optimal solution for MSJO. This LP-based solution may not be a feasible solution. (2) Although there is a polynomial time algorithm for solving this LP relaxed problem in theory, the high complexity associated makes it impractical to solve the LP-relaxed problem when problem size is large. Therefore, inspired by this classic linear programming relaxation, we develop a novel constraint generation algorithm to produce another relaxed solution which provides lower bound to MSJO. (3) We develop algorithm MarS to generate a feasible solution from this relaxed solution. We prove that this solution is within 3 factors of the optimal solution for MSJO.

4.1. Classical linear programming relaxation

Since MSJO is NP-complete, we adopt a linear programming relaxation of the problem to give a lower bound on the optimal solution value. Constraints of this LP relaxation are necessary conditions that task start times in a feasible schedule result have to satisfy. The relaxation constraints are shown as follows:

E puSu ^ 2Ш (E pu) - 2 E p jj

where |M| is number of machines in M, B is any subset of T.

Then our linear programming relaxation problem is minimizing Ejej wjCj subjected to constraints in Eqs. (1)-(4). We call this problem Classical LP Relaxation Problem (CLS-LPP). Note that the decision variables in this CLS-LPP are Su and Cj; so a solution can be presented as (S, C).

Constraints in Eq. (4) describe a polyhedron where task start times of a feasible schedule lie in. We give a simple example to explain the intuition. Consider 3 machines with 6 tasks t1, t2,..., t6 whose processing times are p1, p2,..., p6 respectively. Consider an assignment result where t1 and t2 are assigned to machine 1, t3 and t4 to machine 2, t5 and t6 to machine 3. Start times of t1 and t2 can be S1 = 0 and S2 = p1. Or S1 = p2 and S2 = 0 if t2 is scheduled first. Then, we have p1S1 + p2S2 = p1p2 = 1 ((p1 + p2)2 — (p2 + p2)). Tasks on other machines have similar equations. Adding these equations together, we have e6=1 piSi = 1 (p1 + p2)2 + 1 (p3 + p4)2 + 2 (p5 + p6)2 — 2 e6=1 p? > 1 x 3 (ei=1 pi)2 — 2 ei=1 p2 where equality holds when p1 + p2 = p3 + p4 = p5 + p6 = I 6=1 pi. Note that this argument can be extended to any feasible task schedule results. When additional constraints are added, the sum of task start times will increase. As a result, the left part of Eq. (4) increases and the relation still holds.

4.2. Conditional LP relaxation and constraint generation algorithm

Note that there are an exponential number of constraints in Eq. (4) due to the exponential number of B. To the best of our knowledge, algorithms for solving LP problems need to handle all constraints. Their computing complexities are at least O(n) where n is the number of constraints in the problem because algorithms need to check whether all constraints are satisfied. In our LP-relaxed problem, the exponential number of constraints makes the computing complexity unacceptable. We derive a new LP-relaxation problem which has a small subset of constraints in Eq. (4). The optimal result of this new LP-relaxation problem also leads to the 3-approximation algorithm to be developed in Section 4.3. Because this new problem is built by iteratively adding constraints based on checking certain property of its solution, we call it Conditional LP-relaxation problem (CND-LPP).

Before developing our algorithm for building CND-LPP, we introduce a property of the solutions that satisfy Eq. (4). Given start time Su of task u, let Mu = Su + 2 pu be middle finish time of task u. The property is described as follows:

Property 1. Given S satisfying Eq. (4), we sort tasks in non-descending order of their middle finish times. We use a permutation n to represent the sorting result, where Mn(r) < Mn(2) < We have following inequation for all i e [2 ... |T|]/

< Mn(jTj).

-y^pn(k) < 2Mn(i).

jMj k=i <

Definition. Performance Guarantee Constraint for given n and index i, denoted as PGC(n, i), is defined as follows:

^Pn(k)S„(k) > R(n, i)

where R(п, i) = 2M1 (eL1! Pn(k)) — 2 £L\ pirn.

We call them performance guarantee constraints because if an optimal LP satisfies PGC(n, i) for Vi e [2... |T|], MarS can produce a feasible solution with guaranteed performance.

We first describe the main process of our constraint generation algorithm (COGE) (see Algorithm 1). First we build an initial CND-LPP. In this initial CND-LPP, all precedence constraints in Eqs. (1)-(3) are included. Its objective function is same as MSJO.Then, there are 3 main steps. In step 1, we solve CND-LPP and get an optimal solution (SLP, CLP) for current CND-LPP. In step 2, we use SLP to Proof. For a given permutation n and task n (i), we create a task produce n and build PGCs based on n. In step 3, we check whether

Hyperplane defined by objective function

Fig. 3. illustration of optimal solutions of CLS-LPP and CND-LPP. Fine black lines represent constraints in Eq. (4). Thick black lines represent constraints in Eqs. (1)-(3). Blue lines represent boundary of PL. Red lines represent boundary of PPC. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

set B = {n(1),n(2), ...,n(i - 1)}. We rewrite Eq. (4) as:

Mn(k) -

pn(k)\

> 2M (ë- ^pl(k).

Then we have

i-1 1 (i-1 Y

g PnkMnk > 2|M| .

Because Mnm < M„2 <■■■ < M„m, we have:

i- 1 i- 1 i- 1

Mn(i)X, Pn (k) >Y. Pn(k)Mn(k) > 2|M| i XJ Pn(k)

Finally, we have Eq. (5) by eliminating £k=\ Pn(k).

Given a solution of CLS-LPP, if we schedule tasks in non-descending order of their middle finish time, Property 1 gives us a basic relation between the total processing time of previous scheduled tasks and the middle finish time of the unscheduled tasks. We will use this property to prove the theoretical bound of our algorithm MarS. Note that this property holds for any solution satisfying constraints in Eq. (4). If we can build a CND-LPP whose optimal solution also has this property, MarS has the same theoretical bound based on this CND-LPP.

Recall that the intuition behind Eq. (4) is to describe a polyhedron where the task start times of a feasible schedule lie in. We denote this polyhedron as PL. Given a specific CLS-LPP, its constraints define a polyhedron, denoted as PCLS. All precedence constraints in Eqs. (1)-(3) define another polyhedron, denoted as PPC. We know that PCLS = PL n PPC (see Fig. 3). Objective function of the problem defines a hyperplane. Solving CLS-LPP is searching for a point in PCLS which has the smallest distance to this hyperplane. Instead of finding the optimal point in PCLS (point A in Fig. 3), we search a solution in PPC (point B in Fig. 3) which has Property 1. We start with an initial CND-LPP which only contains all precedence constraints in Eqs. (1)-(3). By iteratively adding fine chosen constraints in Eq. (4), we approach the desirable solution.

To check whether an optimal solution satisfies Property 1, we can check whether constraints in Eq. (6) are satisfied by this optimal solution for every task n(i). We formally define these constraints as follows:

SLP satisfies PGC(n, i) for Vi e [1... |T|]. If SLP satisfies all PGCs, we are done. Otherwise, we add the violated PGCs to CND-LPP and repeat steps 1-3 until we produce an optimal solution that satisfies all PGCs.

Because finding an optimal solution satisfying all PGCs may still involve large computation complexity in large problems, given a threshold e we can terminate computation if current solution (SLP, CLP) is within (1 - e) of optimal solution satisfying all PGCs. Unfortunately, optimal solution is hard to compute. Instead, we construct a feasible solution (S^, CNV) which satisfies all PGCs (NV means no-violation). When PGC (n, i) is not satisfied, we calculate an offset which indicates how much Sn(j) should increase to satisfy PGC(n, i). Function ViolationOffset(PGC(n, i), SLP) calculates this offset as follows:

offset

pn(i-1)

R(n, i) -J2 Pn(k)S-

Thus, we can build a feasible solution by adding this offset to SNVk) for k e [i — 1... |T|]. Finally, after all PGCs are checked, we check whether (SLP, CLP) reaches the stop threshold. COGE is a heuristic-based solution for the LP-relaxed problem. In practice, COGE can produce satisfactory result in less than 10 iterations.

Algorithm 1 COGE()

Input: 1) Job set J; 2) Machine set M; 3) Stop threshold e Output: A solution (SLP, CLP). 1: Build initial CND-LPP with Eq. (1)-(3); 2: repeat

3: Solve CND-LPP with a linear programming solver;

4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

16: 17: 18:

Let (SLI, CLI ) be optimal solution of current CND-LPP; Let violated PGC number vn be 0;

(SNV, CNV ) ^ (SLP, CLP );

n ^ Sort tasks by middle finish time according to SLP ; for all i s [1... |T|] do

if PGC (n, i) is satisfied by SLP then

continue; Add PGC(n, i) to CND-LPP; vn = vn + 1;

offset = ViolationOffset (PGC (n, i), SLP ) for all k s [i - 1... T] do SNV, = + offset;

Update CNV according to SNV ;

if TjeJ w£? > (1 - e) Ej

return (SLP, CLP ); 19: until vn is 0 20: return (SLP, CLP );

jeJ WjC?V then

4.3. MarS algorithm

In this section, we describe MarS. MarS is a heuristic algorithm which derives feasible schedule result from the optimal solution of the linear relaxation problem. Let (SLP, CLP) denote the optimal result of our LP relaxation problem. Let MuLP denote the middle finish time of task u in the LP optimal result. We have MLP = SLP + pu/2, Vu e T. Let SH be set of task start times in final schedule result.

Our algorithm MarS is shown in Algorithm 2. We first produce n based on SLP.Then we schedule tasks from n (1) to n (|T|), meaning we schedule tasks in non-descending order of their middle finish time. In each iteration i, we first check the earliest possible start time Searliest of task n (i) to make sure that precedence constraints are satisfied. Then we choose a machine m* which has the earliest idle time among all machines. We schedule n (i) to machine m* with start time max{Searliest, em*} and update em* as finish time of n(i).

Because MarS schedules tasks in non-descending order of their middle finish times, Property 1 gives us a basic relation between the total processing time of previous scheduled tasks and middle finish time of unscheduled tasks. When release time constraints and precedence constraints exist, there may be idle time intervals between tasks. Thus, we introduce Lemma 2 to give an upper bound to the total length of these intervals.

Lemma 2. Given schedule result after tasks n (k), k e [1... (i — 1)], are scheduled by MarS, if we choose any machine m and start n(i) as soon as possible after machine m is idle, it results in a start time SH(i) for n(i). Let g(m, n(i)) be total length of idle time interval on machine m before SH^. We have:

g(m,n(i)) < m;

'n(i).

Proof. We outline our idea first. In our schedule result, there is an idle time interval before a task because this task cannot start earlier due to certain precedence constraints. These constraints are tight in our schedule result but may not be tight in LP optimal result. For example, there is an idle time interval between tasks u2 and v3 in Fig. 4 because = SH + du1u2 + pu1. Otherwise, task u2 can start earlier. However, we only know S^ > SLP + du1u2 + pu1. Our idea is to prove SH2 — (SH3 + pv3) < M^ — MLP by analyzing these tight precedence constraints in our schedule result. The left part of this inequation is maximum length of idle time interval between v3 and u2. We iteratively develop similar inequations for idle time intervals before task v3. Sum up both sides of these inequations, we have Eq. (7).

Next, we start our formal proof. For a scheduling problem, we can build a precedence graph G = (V, L) to describe all precedence constraints with delay. V is the set of tasks. For two tasks u and v, directed link (u, v) e L with length duv indicates that execution of v at least waits for a time interval duv after u finishes. For our problem, we introduce a dummy initial task t1 to represent the start point of the schedule. Then, constraints Su > r,, Vu e T(M)

in Eq. (1) can be expressed in the form of precedence constraint with delay: Su > St, + dt,u + pt,, Vu e Tj(M) where St, = 0, dtI u = rj, ptI = 0. In remaining part of this proof, we only mention precedence constraint with delay.

Given a schedule result, delay between u and v may be longer than duv. We say a link (u, v) e L is tight in this schedule result if SH = SH + duv + pu. In a schedule result, there may be a tight-link path {u1 — u2 — • • • — us} where link (uk, uk+1) e L is tight for all k e [1... s — 1], s > 2. Fig. 4 demonstrates a three-node tight-link path in a schedule result. If us = u, we call {u1 — u2 — • • • — us} a tight-link path of task u. Among all tight-link paths of

Machine 1 Machine 2

Fig. 4. Demo of tight-link path v1 —■ v2 — v3 and u1 —■ u2 in a schedule result for proof of Lemma 2.

task u, there is a tight-link path which has the maximal number of nodes. We call it the longest tight-link path of task u, denoted as LTLP(u). In our problem, for a map task u e T,(M), LTLP(u) can be

> u}. For a reduce task v e TfK), LTLP (v) can v} or LTLP3 = {u ^ v} where u e T,

empty or LTLP 1 = {tl be LTLP2 = {t' ^ u -

For a tight link (u, v), the following inequation holds for the optimal result of LP relaxation problem because precedence constraints are satisfied:

SLP > SLP + duv + pu. Then we have:

MvP > MLuP + duv + pu + 2 (pv — pu) .

For a tight-link path {u1 - u2 - • • • - us}, we have:

s—1 1

MP > K + £ (dukuk+1 + puk) + 2 (pus — pu0 .

For LTLP1, LTLP2 and LTLP3, we always have pus > pu1 because in a specific job, processing times ofits reduce tasks are longer than that of its map tasks. Then we have:

s— 1

MP > MP + £(dukuk+l + puk).

Because all links in a tight link path are tight, we also have:

SUs SU] = ^^ (dUkUk+l + pUk).

Based on Eqs. (8) and (9), we analyze lengths of idle time intervals on machine m. After scheduling n (i) to machine h, there are idle time intervals before SH^. Considering a task u right after a idle time interval, we find LTLP (u) = {u1 — u2 — ••• — us}. Because u1 is the first task in LTLP(us), there must be a task v scheduled on machine m where So < Su1 < (Sg + p„) (see task v3 in Fig. 4) and MLP < M^. Otherwise, u1 can start earlier. With Eqs. (8) and (9), we have:

Sus — (Sv + po) < MLP — Mt

The left side of this inequation is the maximum length of idle time interval between u and v. We repeat developing this inequation for idle time interval before v. Finally, we end up with t1 whose middle finish time is 0. By adding all these inequations together, we have Eq. (7). □

Lemma 2 gives an upper bound for total idle time intervals in the scheduling of task n (i). Note that, this result only holds when tasks from n(1) to n(i — 1) are scheduled according to MarS. Next, we introduce Theorem 3.

Algorithm 2 MarS()

Input: 1) Job set J; 2) Machine set M; 3) LP optimal result SLP Output: Scheduling result SH.

1: SH ^ 0;

2: n ^ Sort tasks by middle finish time according to SLP; 3: Let em be earliest idle time of machines m; 4: em ^ 0, Vm e M; 5: for all i e [1... |T|] do

6: 7: 8: 9: 10:

11: 12: 13: 14:

Find job j where n (i) e T<M) or n(i) e

if n(i) e T<M) then

SearliSt = j; if n(i) e TR then

Searlist = max

'ueT(M),SHeSH {Su + p + dun(i)};

Find m* where eM> = minmeM{em};

■n(!) = max{S'

em* = S: SH ^ SH

earlist em*};

'n(i) + pn(i);

H u SH(i);

15: return SH;

Theorem 3. MarS is a 3-approximation algorithm for MSJO.

Proof. Consider n (i) is scheduled to machine m. Let U(m,n(i)) be set of tasks scheduled to machine m before n (i). Combine Eqs. (5) and (7), we have:

^E pn(k) + g (m,n(i)) < 3MnPa).

In our algorithm, we choose machine to start task n (i) as early as possible, so we have:

SH(i) < E Pu + g(m,n(i)), Vm e M.

ueU(m,n(i))

There must exist a machine in where £ Li—1 Pn(k). Then, SH(i) < 3Mjf(i) and we have:

ueU(m ,n(i))

pu < IMI

Apparently, according to the previous proof in Section 3.2, we can use the similar method to prove that it is also a NP-complete problem. For minimizing total weighted job completion time with precedence constraints, the best known work is an 8-approximation algorithm in [8] without considering the impact of processors/machines assignment. Here, we develop a fast 7-approximation heuristic algorithm T-MarS.

Before developing our heuristic algorithm, we first adopt a linear programming relaxation of the NP-hard problem to give a lower bound on the optimal solution value, which are necessary conditions that task completion times in a feasible schedule result have to satisfy. The relaxation constraints are shown as follows:

J^PuCu >

EP2u +

I ueB \ueB

where |M| is number of machines in M, B is any subset of T.

Proof. To prove Eq. (13) we give a brief explanation. Consider a subset I C T, we sort tasks in non-descending order of their completion times. Let I = {1, 2,..., u}, then task u is the last task to be finished. Assume the performances of all the processors/machines are identical, if task u is scheduled on machine m* (m* e M), then m* is the most heavily loaded machine. So the load on machine m* is at least (£^ Pi)/|M|,and Cu > (£^Pi)/|M|,we have:

e puCu > ¿j e (pu E p0 .

Because |jM|£ueT(pu£ieI pi) = |M|£ueT £iel pipu = 2|M| (£ueT pu + (£ ueT Pu)2),thenwe have:

e Pu^ > 1e P2" + ^ Pu) ) .

u T | M| u T u T

SH(i) + Pn(i) < 3MnP(i) + 2Pn(i) = 3(SnP(i) + Pn(i)).

Eq. (10) holds for all task n(i), i e [1... |T|],then CH < 3C, , Vj e J. Finally, we have:

e WjCH < 3 e WjCLP.

jeJ jeJ

Because £jeJ Wjj is a lower bound of the optimal, MarS is a 3-approximation algorithm for MSJO. □

4.4. Extension: a fine-grained MapReduce task scheduling algorithm(T-MarS)

In this section, we propose a fine-grained task-level MapReduce scheduler, that means our optimization objective is to minimize £ueT WuCu (Wu denotes the weight of task u and Cu denotes the completion time of task u, note that Cu = Su + pu). Generally, instead of considering the coarse-grained submission time of a job, we assume every task has a release time and let rnu denote the release time of task u, which indicates the earliest time that the task can be scheduled. We can rewrite the LP model as follows:

min > WuCu

Cu > ru + pu Cv > Cu + pu

Vu e T V(u, v) e L.

(11) (12)

Therefore, when B = T Eq. (13) has been proved correct. Note that this argument can be extended to the general case, so when B is any subset of T, the formula is always workable. □

Next, we introduce a property of the solutions that satisfy Eq. (13). The property is described as follows:

Property 2. Given C1, C2,..., C|T| satisfying Eq. (13), we sort tasks in non-descending order of their completion times, without loss of generality, assume C1 < C2 < ■■■ < C^. Let B = {1, 2,..., i} and j e B, then we have:

Proof. Because C1, C2,..., C|T| satisfy Eq. (13), we have:

epjCj > ¿i (ep + (e j j B | M| j B j B

According to the hypothesis, task j is finished before task i, obviously, we have CCj < CCi, then

CiE pj > E pjCj > (E j . jeB jeB 2|M| jeB

ConsequentlУ, we have Ci > £ieB pj. □

Let (SLP, CLP) denote the optimal result of our LP relaxation problem. Let SH be set of task start times in final schedule result. Our algorithm T-MarS is shown in Algorithm 3. First, we produce n based on CLP. Second, note that the upper bound on the length of any feasible scheduling without unforced idle time is maxueT{ru} + uet Pu, we divide the time line into intervals: [1, 1], (1, 2], (2, 4],..., (2n-2, 2N-1], where N is the smallest integer such that 2N-1 is at least maxueT{ru} + ^ueT pu. Let t0 = 1 and tn = 2n-1, n s [1, N], we use B„ to denote all the tasks which CiPj) lie in interval (tn-1, tn], that means tn-1 < CiP(i) < tn. Third,

we define an as the average load on a machine for all the tasks in Bn, we have an = (eueB„ Pu)/|M|. Let tn = 1 + £„=1^ + a,<), for any Bn, n e [1, N],weschedule the tasks which belong to Bn using Graham list-scheduling algorithm [16] in the interval (t„_ 1, tn].The main idea of Graham algorithm can be stated as follows: All the tasks (without release times or release time are zero) are ordered in some list, whenever one machine becomes idle, the next available task (where a task is available if all its predecessors have been finished) on the list is allocated to start on the machine. Let Savailable denote the start time of next available task j and pre(j) be its precedence parent node.

Algorithm 3 T-MarS()

Input: 1) Task set T; 2) Machine set M; 3) LP optimal result CLP Output: Scheduling result SH'.

1: SH' ^ 0;

2: n ^ Sort tasks by middle finish time according to CLP;

3: Let em be earliest idle time of machines m;

4: em ^ 0, Vm s M;

5: N ^ [maxueT{ru} + EusT Pul;

6: B„ ^ 0, t„ ^ 2„-1, Vn s [1,..., N];

7: for all i s [1... |T|] do

n = r'o&C^l + 1; B„ = B„ U n(i); for all n s [1... N] do if Bn = 0 then

an = (EueB„ Pu)/|M|; t„ = 1 + £„=1 (tk + ak); Graham(B„, t„-1, t„, M, SH'); return SH';

function Graham(B„, t„-1, t„, M, SH') for all j s [1... |B„|] do em« = minmEM{em};

SH' {Spre(j) + ppre(j)};

cavailable Sj

S"' = maxK

= maxS

pre(j') available

, em* , C,};

-§"' U s" ;

Lemma 4. Let B be the tasks subset of T. We use Graham algorithm to schedule the tasks in B. Let & denote the set of tasks that form the longest chain of precedence-constrained tasks ending with the task that completes last in the schedule. Let Cmax denote the maximum time length of the resulting scheduling, we have:

Cmax < M e Pu + e Pu.

|M M ue(B-®) ue®

Obviously, we can use the pigeonhole principle to prove Lemma 4. Through the above description and analysis on Algorithm 3, we note that: (1) regularly dividing the time line different intervals can guarantee tasks scheduling relative independence; (2) in each interval scheduling the tasks using Graham algorithm can guarantee performance (a theoretical tight upper bound). Next, by applying both Property 2 and Lemma 4 we derive Theorem 5.

Theorem 5. T-MarS is a 7-approximation algorithm for minimizing total weighted task completion times.

Proof. We first show that Algorithm 3 can give a feasible scheduling result. Eq. (12) ensures that the precedence constraints are enforced, since for each task i s B„, each of its predecessors is assigned to Bk for some k s {1, 2,..., n}. We also need to show that the schedule respects the release time constraints. For any i s B„ and n s {1, 2,..., N}, we have:

n < CLP < t„.

According to the definition of t„ above, we can write

tn-1 = 1 + E(tk + a) - 1 + e tk = tn.

Hence ri < t„_ 1, it means that the scheduling rule in each interval (t„_ 1, t„] reduces to the case without release times. According to Lemma 4, it implies that the length of the schedule constructed for Bn can be bounded by the average load on each machine, plus the maximum length of any precedence chain. The average load in any interval is exactly a„. Let pi denote the maximum length of a chain that ends with task i. Obviously, pi is at most CtLP, that means Pi < CLP. Therefore,

CH < t„-1 + a„ + Pi = 1 + tk + ak) + a„ + Pi.

Because 1 + + ak) + an + Pi = 1 + £n=í tk + ak + Pi

and 1 + £n-i tk = tn, then we have

C" < tn + Y, ak + Pi < tn + ^ ak + CLP.

According to the definition of an, we can rewrite £nk= 1 ak as follows:

eak=e (¿i e a = ¿ e

k=1 k=1 \|M| ueBk j |M| ue(B1UB2U---UB„)

Let CLpn) denote the largest value whose i(n) e B1 U B2 U • • • U Bn.

Applying Property 2, we have 1

E Pu < 2Cm < 2tn.

ue(B1UB2U^^^UB„)

Since tn-1 < CLP and 2tn-1 = t„, then t„ < 2C\P. Thus, Cf < t„ + 2t„ + C,LP < 2C,lp + 4CLp + C,LP = 7C,lp. This result holds for all task i s T, finally, we have

"H' < 7 ^LP

Because ^isT wiCLP is a lower bound of the optimal, T-MarS is a 7-approximation algorithm for minimizing total weighted task completion times. □

5. Simulation

5.1. Simulation setup

5.1.1. Background

We use synthetic workloads to study the performance of our algorithm, following similar simulation setup in [7,8]. We generate

jobs as follows: (1) Job release times are randomly generated following Bernoulli with probability 2. (2) Number of tasks in a job are generated (a) uniformly or (b) randomly. For a job with uniformly generated tasks, the number of map tasks is set to 30 and the number of reduce tasks is set to 10. For a job with randomly generated tasks, the number of map tasks follows Poisson distribution with a mean of 30 and the number of reduce tasks is uniformly chosen between 1 and the number of map tasks. (3) Task processing times are generated (a) uniformly or (b) randomly. For the tasks with uniform processing time, processing times of map tasks are 10 and processing times of reduce tasks are 15. For the tasks with random processing time, processing times of map tasks are normally distributed with a mean of 10 and a standard deviation of 5. Processing times of reduce tasks are normally distributed with a mean 15 and a standard deviation 5. (4) Weights of jobs/tasks are generated randomly in normal distribution with a mean 30 and standard deviation 10. (5) Delays between map tasks and reduce tasks are proportional to the processing time of map tasks. This indicates that a long map task will generate more data and these data will need longer time to be transmitted to the reduce tasks. The default number of machines is set to 50.

5.1.2. Evaluation metric

The key to compare different algorithms is total weighted job completion time (TWJCT) of the result of the algorithms. To make comparison under different configurations more illustrative, we would like to compare TWJCT of different algorithms with the optimal solution. However, computing the optimal solutions requires exponential time. Therefore, we use the LP lower bound as a substitute. We define TWJCT ratio as our evaluation metrics:

TWJCT of Algorithm X result TWJCT ratio =-.

LP lower bound

TWJCT ratio indicates how close the schedule result is to the theoretical lower bound. The smaller the TWJCT ratio is, the better the schedule result is. All results in our simulation are measured by TWJCT ratio.

Similarly, to compare performance of different algorithms designed for minimizing total weighted task completion times (TWTCT), we use TWTCT ratio as our evaluation metric:

TWTCT of Algorithm X result

TWTCT ratio =

LP lower bound

5.1.3. Comparisons strategies

We compare performance of our algorithms with the following scheduling strategies:

MARES: MARES [8] is a LP-based algorithm considering precedence constraints in MapReduce jobs. In evaluation of [8], MARES outperforms other algorithms with a factor of 1.5 to the lower bound. To offer a fair comparison, we adopt a workload-based assignment strategy where tasks are evenly allocated in order to balance total processing time of tasks on every processor. According to [29], when there is not release time and precedence constraints, LP relaxation constraints in Eq. (4) and constraints in [8] have same lower bound if workloads on every processor are same. Moreover, workload-based assignment strategy is also widely adopted in practice.

High unit weight first (HUWF): Unit weight (UW) of a job is calculated by dividing weight of the job by total processing time of tasks in the job. All tasks are sorted in descending order of unit weights of the jobs they belong to. We also maintain an available task list where all tasks in the list do not have any unscheduled precedent task. In each iteration, we choose the task with highest unit weight and assign it to a machine where it can start as early

H- MarS

0.1 0.2 0.3 0.4

Stop Threshold

Fig. 5. Stop threshold vs. TWJCT ratio.

0.1 0.2 0.3 0.4 0.5

Stop Threshold

Fig. 6. Stop threshold vs. iteration number.

as possible. Then we check whether there is any unscheduled task whose precedent tasks are all scheduled and put these tasks into available task list. We iterate until all tasks are scheduled.

High job weight first (HJWF): This algorithm works similar to HUWF. The difference is that tasks are sorted according to weight of the jobs they belong to.

In the fine-grained task-level scenario, we add the following three scheduling strategies to compare their performance with T-MarS:

H-MARES: A simple heuristic implementation of MARES which we call H-MARES schedules tasks in the order of LP completion time without waiting for it to become available. The only reason for a task to wait is if some of its predecessors have not been completed. In H-MARES, it is assumed that tasks are assigned to machines in advance. We use H-MARES to evaluate effect of release time and precedence constraints.

Shortest task first (STF): When a machine is idle, we consider the available tasks set where all tasks have arrived and their predecessors have been finished, then we assign the task whose processing time is the shortest to the idle machine first. This strategy appears to be a reasonable greedy algorithm minimizing the completion time, so we take it into account.

High task weight first (HTWF): Similar to HJWF algorithm, the difference is that tasks are sorted according to their own weights.

In MarS, we need to choose a stop threshold e for COGE. We run MarS with 100 jobs and change value of stop thresholds (see Figs. 5 and 6). We see that when e changes from 0.1 to 0.5, there is a small performance degradation for MarS while iteration number decreases rapidly. Thus, we choose e = 0.3. MARES also needs a stop threshold in solving its LP-relaxation problem. To offer a fair comparison, we choose e = 0.3 for MARES instead of 0.5 in the original paper [8].

In our simulation, we assume that the task processing time and delays between a map task and a reduce task are known to the

X---X---* ■ - - X- - - *---X---X- - - -

U/ * :

-HMarS -0-HUWF -X-MARES -0-HJWF

10 20 30 40 50 60 70 Job Number

Fig. 7. Randomized scenario.

80 90 100

——x *

-HMarS -0-HUWF -X-MARES -0-HJWF

10 20 30 40 50 60 70 80 90 100 Job Number

Fig. 8. Uniform task number.

scheduler. There are many studies on accurate processing time prediction [32,36] and trace study [6] shows that the majority of map tasks and reduce tasks are highly recurrent making prediction feasible. We plan a future work here.

5.2. Simulation result

We first discuss the most randomized scenario where all parameters are generated in randomly. Based on this scenario, we evaluate impacts of different job parameters when our optimization objective is to minimize total weighted job completion times.

Performances in the most randomized scenario. Fig. 7 shows results where all job parameters are in random category. We see MarS constantly offers efficient solutions when job number changes. In theory, we proved that MarS represents a 3-approximation of the theoretical lower bound, meaning that TWJCT ratios of MarS are at most 3. In practice, MarS represents an increase less than 0.4 in terms of TWJCT ratio, compared with theoretical lower bound. More specifically, starting at 1.32, MarS increases to 1.39 when job number is 20 and stays at about 1.38 with a variance less than 0.02 when job number further grows. This is because LP relaxation always produces an optimized task schedule order. MARES shows stable performance after job number reaches 20. However, we see a constant performance difference between MARES and MarS. We consider it as improvement byjoint scheduling.

We see that MarS outperforms other algorithms by over 40% in most cases. The only exception happens when job number is 10 where MarS is 1.32 while MARES is 1.58, HUWF is 1.77 and HJWF is 1.86. After job number rises to 20, MARES increases to 1.96, HUWF and HJWF jump to 2.17 and 2.25 respectively. This is because when there are less jobs, machines are not extensive loaded. Thus, different schedule algorithms can gain close performances. However, when job number increases and machines are fully utilized, algorithm results differ from each other. We also notice that MARES outperforms HUWF and HJWF in all cases. This is because workload-based allocation performs well and MARES benefits from optimized task order generated by LP relaxation conditions.

Impact of task number in job. Next, we examine the effect of task number in a job. We generate jobs with uniform task number category while other job parameters are in random category. The results are shown in Fig. 8. Compared with results in Fig. 7, we see that all algorithms gain better performance. MarS stays at about 1.15 which is 0.17 lower in terms of TWJCT ratio. MARES gradually increases from 1.48 to 1.88. HUWF and HJWF still have larger performance variance but TWJCT ratio of both algorithms decreases by 0.2 on average. It is also shown that there is only a tiny performance difference between HUWF and HJWF.

co 2 DC

A A ft--J*-&

W ..X---X----- * * —x—x *-"*" iiiiiii

-HMarS -0-HUWF -x-MARES -0-HJWF

10 20 30 40 50 60 70 80 90 100 Job Number

Fig. 9. Uniform task processing time.

; i -HMarS -0-HUWF

i : -X-MARES -O-HJWF

10 20 30 40 50 60 70 80 90 100 Job Number

Fig. 10. Impact of machine number.

Impact of task processing time. We generate jobs with uniform task processing time category while other job parameters are in random category. MarS is very close to the theoretical lower bound. Its maximum distance to the lower bound is 0.08 when job number is 10. When job number rises, this distance decreases to less than 0.02. The gap between MarS and MARES is reduced to 0.35 on average.

It is worth noticing that comparing results in Figs. 8 and 9, all algorithms gain better performance in uniform task processing time category. This result indicates that it is more effective to have uniform task processing time than to have same task number in all jobs. It also shows the importance of solving skewed task processing time problem in a parallel computing framework.

Impact of machine number. We change machine number to 100 and examine all algorithms in most randomized scenario (see Fig. 10). We see that MarS performs extremely well. Starting at 1.09 when job number is 10, its TWJCT ratio gradually declines

I 1-5 1

°'510 20 30 40 50 60 70 80 90 100 Job Number

Fig. 11. Increasing UW category.

°'510 20 30 40 50 60 70 80 90 100 Job Number

Fig. 12. Decreasing UW category.

to 1.01 which can be considered as optimal. Other algorithms also gain better performance than same scenario when machine number is 50 (see Fig. 7). Different from MarS, TWJCT ratios of other algorithms increase with job number. When job number reaches 100, MarS outperform MARES, HUWF and HJWF by 0.42, 0.75 and 0.79 respectively.

Impact of job weight. We investigate impact of job weight by generating jobs with different unit weight distribution. We first generate jobs with increasing UW category (see Fig. 11). We see that MarS and MARES do not have obvious change while both HUWF and HJWF show a 10% performance degradation. This is because in increasing UW category, later-released jobs have higher unit weight than jobs released earlier. Intuitively, tasks in later jobs should preempt the processing of tasks in earlier jobs. In order to gain better TWJCT ratio, sequence of tasks must be planned more carefully. Otherwise, a performance degradation is introduced.

Next, we generate jobs with decreasing UW category (see Fig. 12). Performances of MarS and MARES are similar in Figs. 7 and 12. However, HUWF and HJWF show opposite trends. HUWF outperforms MARES while HJWF suffers a performance degradation. Because schedule order of tasks in HUWF is same to release time order of their jobs, HUWF does not need to schedule tasks in later-released jobs to preempt tasks in earlier-released jobs while HJWF needs to take care of this because later-released job may have higher weight due to total processing time of its tasks.

Impact of prediction error. In order to investigate impact of prediction error, we inject errors to precessing times of tasks. All algorithms schedule jobs with error-injected information while we calculate a LP-lower bound based on no-error information. The result is shown in Fig. 14. x-axis is maximum prediction error to no-error precessing time. We see that all algorithms do not show grade performance degradation. TWJCT ratios of four algorithms increase slightly. However the maximum performance degradation is a 0.09

performance degradation in terms of TWJCT ratio when maximum prediction error is 100%.

Next, in the fine-grained task-level MapReduce scheduling scenario, which means the optimization objective is to minimize total weighted task completion times, we further compare our T-MarS scheduler with other algorithms.

Varying total task number. To evaluate the impact of total task number on performance change under different algorithms, we vary the total task number from 100 to 1000 and we generate tasks randomly each time. The results are shown in Fig. 15. We note that when total task number is 100, TWTCT ratios of T-MarS, H-MARES, STF and HTWF are 1.03,1.19,1.31 and 1.42 respectively, T-MarS is 0.16 less than H-MARES in terms of TWTCT ratio. However, with the growth of total task number, T-MarS is at most 1.21, while H-MARES increases to 1.6 when task number is 1000, STF and HTWF are both more than 2.2. We observe that T-MarS is closer to the theoretical lower bound than other algorithms practically, because it has strict performance guarantee constraints.

Relative waiting time. We define the waiting time of a task to be the time interval between the submission and completion time. And the relative waiting time is the ratio of the task waiting time to the longest waiting time. As shown in Fig. 16, particularly, we note that T-MarS finishes 71% of all tasks at median relative waiting time, whereas H-MARES, STF and HTWF only finish 62%, 58% and 51% respectively. The results also explain the reason that T-MarS outperforms other classical algorithms from a different perspective.

5.3. Discussion: semi-online MarS algorithm

In our above assumptions, the release time (arrival time) of all the jobs is known in advance. This is clearly not true in practice where the jobs arrive one at a time, and the arrival times cannot be known in advance in the running MapReduce platforms. In the single processor case there are some competitive algorithms for scheduling jobs online. These algorithms use the fact that scheduling the job with the highest proportion of weight to processing time can minimize the weighted completion time when all jobs are available at time zero. This is not true for our problem. However, to make our algorithm into an online version, we can use the approach as following: During a batch interval time, when a new job arrives into the system, we take all the jobs currently in the system along with the new job, when the batch interval is finished, we get all the arrival jobs' information and run MarS scheduler to give a solution. In the next batch interval, the schedule can be implemented in the same manner. We refer to this approach as Semi-Online MarS Algorithm (SO-MarS). Since we have designed a fast heuristic algorithm and run the LP model periodically, as long as we set an appropriate batch interval, the overhead is not big. To demonstrate this approach works extremely well and is very robust to varying job characteristics, we compare our algorithm with other strategies using the same simulation setup in Section 5.1. Besides, the batch interval time is set from 10 to 100 s. As shown in Fig. 17, we can see that when the batch interval is 50 s, the average TWJCT ratios of SO-MarS, MARES, HUWF and HJWF are 1.09,1.35,1.63,1.75 respectively. But when the batch interval increases to 100 s, the average TWJCT ratio of SO-MarS is still less than 1.19, while that of other strategies is over 1.53. According to the results, we conclude that our Semi-Online MarS algorithm still outperforms other classical algorithms.

6. Implementation and results of experiment

6.1. Implementation

We implement a MSJO framework in Hadoop-1.2.0 and run the implementation on Amazon EC2. The implementation framework

• O O G O '' O

/ *-- -X-- A - —A—0— -X---*-- -A—0— -X -4>

-HMarS -x-MARES -0-HUWF -©-HJWF

Processing Time Predictor MarS MARES HJWF HUWF

T-MarS H-MARES HTWF STF

MSJO Scheduler

(l)Submit job (3)Event: Task is done (2)Assign task

Job Set Generator obTraci

<- (4)Event: Job is done y J V .cr [fl Task TaskTracker

Hadoop

Fig. 13. Processing of a job in MarS implement with Hadoop.

is described in Fig. 13. To run in Hadoop, MSJO needs to cooperate with two components of Hadoop: JobTracker and TaskTracker. JobTracker manages all jobs in a Hadoop cluster and, as jobs are split into tasks, TaskTracker is used to manage tasks on every machine.

We register MSJO to JobTracker so that JobTracker can call MSJO to make schedule decisions. MSJO makes schedule decisions according to different algorithm modules. Currently, we implemented four job-level algorithm modules including MarS, MARES, HUWF, HJWF, and four task-level algorithm modules including T-MarS, H-MARES, STF, HTWF in our experiments. Of course, other algorithm modules can be added in our implementation. When a job is submitted to Hadoop, JobTracker notifies MSJO that a job is added. MSJO puts the job into a queue. MSJO scheduler is event driven fromJobTracker. When Hadoop is running, JobTracker keeps notifying MSJO on TaskTracker status. If a machine is idle, MSJO assigns a task to the TaskTracker of this machine. After a task is finished, the TaskTracker will tell JobTracker, which will further notify MSJO and MSJO updates job information. Accordingly, if all tasks in a job finish, MSJO removes the job and JobTracker sends a job-completion event to user application.

Here we also have a Processing Time Predictor module. This module can be based on a prediction algorithm or history recording of the completion time of past jobs. We leave such a prediction module to our future work. In this experiment, we run jobs one by one with default scheduler of Hadoop and collect data to train our predictor.

In current implementation, all algorithms are offline algorithms. They need full information about the job set before scheduling. To fulfill this requirement, we develop a job set generator. In each experiment, job set generator submits a set of jobs to Hadoop at the beginning of the experiment. These jobs carry release time and weight information with them. After MSJO collects all information for the job set, MSJO calls an algorithm module to make a schedule. After that, MSJO schedules jobs accordingly.

6.2. Experiment setup

We evaluate the algorithms with experiments on a 16-node cluster. This cluster is built on Amazon EC2. We choose virtual machines of type m1.small which have a 1-ECU cpu (1 ECU roughly equals to 1.0 GHz), 1.7 GB memory and 160 GB disk. According to our measurement, the inter-node network bandwidth is 400 Mbps.

We employ Wordcount as the MapReduce program in our experiments. Wordcount aims to count the frequency of words appearing in a data set. It is a benchmark MapReduce job and serves as a basic component of many Internet applications (e.g. document clustering, searching, etc.). In addition, many MapReduce jobs have aggregation statistics closer to Wordcount [9]. We use a document package from Wikipedia as input data of jobs. This package contains all English documents in Wikipedia since 30th

Fig. 14. Impact of prediction error.

Fig. 15. Impact of total task number.

January 2010 with uncompressed size of 43.7 GB. In this package, there are 27 individual files, of which the sizes range from 149 MB to 9.98 GB. For every file, we create a MapReduce job to process it. The number of map tasks is determined by input data size. One map task is created for 64 MB input data. We set the number of reduce tasks to half of the number of map tasks. The release time and job weights are generated in the same way as in the simulation.

We build two job sets: (1) Job set 1. It contains 10 jobs where input data size of every job is less than 1 GB. We use this job set to evaluate the performance of our algorithms when jobs are small. (2)Job set 2. This job set contains all 27 jobs.

6.3. Experiment result

Performance of different algorithms. For the job-level scenario, the result is shown in Fig. 18. In job set 1, we see that MarS outperforms the other algorithms. MarS increases 0.416 to the lower bound while MARES, HUWF and HJWF increase 0.539, 0.81 and 0.672 respectively. In job set 2, we see that MarS still outperforms rest of the algorithms. Compared with results in job set 1, we see TWJCT ratios of all algorithms increase. The trend is also reflected in simulation results. We notice that HJWF suffers more performance degradation than other algorithms. The reason may be that in job set 2, these jobs process data as large as 9.98 GB. Most of them are bigger than the jobs in job set 1. HJWF scheduled big jobs first because their weights are big. However, these jobs have big weights but their unit weights are small because they take a long time to process these data. As a result, small jobs with big unit weights are delayed. By considering the relation between weight and processing time, MarS, MARES and HUWF do not suffer from this mixture of different size jobs.

Furthermore, for the task-level scenario, as shown in Fig. 19, in both job sets 1 and 2, we observe that T-MarS outperforms the

■T-MarS -0-STF -X-H-MARES -O-HTWF 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Relative wait time

Fig. 16. Relative waiting time.

—Map tasks In training data —Reduce tasks in training data —Map tasks in MarS result —Reduce tasks in MarS result

200 400 600 800 1000 1200 1400

Task Processing Time (s)

Fig. 20. Task processing time in experiment.

SO-MarS -0-HUWF X-MARES -0-HJWF '10 20 30 40 50 60 70 80 90 100 Batch Interval (s)

о 1.8 ■я

I- 1.6 о ID

Ё 1.4 1.2 1

о 1.8

1-O 1.6

1- 1.4

Fig. 17. Semi-Online MarS.

HMarS I MARES I HUWF H HJWF

Job Set 1

Job Set 2

Fig. 18. TWJCT ratio injob level.

Job Set 1

Job Set 2

Fig. 19. TWTCT ratio in task level.

other algorithms. In particular, in job set 1, the difference of TWTCT ratio between T-MarS and H-MARES is small, and the performance

о 1.8

I- 1.6

I- 1.4

HMarS I MARES I HUWF H HJWF

Job Set 1

Job Set 2

Fig. 21. Local server cluster experiment.

of STF is close to that of HTWF. In job set 2, we see that T-MarS still outperforms rest of the algorithms. Compared with results in job set 1, we see that all algorithms suffer performance degradation. However, the percentage of increasing TWTCT ratio of T-MarS is 3.9%, which is smaller than that of other algorithms. Moreover, compared to T-MarS, the difference of performance under job set 2 is larger than that under job set 1 for H-MARES, STF and HTWF.

Processing times of map tasks and reduce tasks. We show precessing times of map tasks and reduce tasks. Training data and MarS results are shown in Fig. 20. Both of them have 950 tasks. In training data, there is a clear processing time difference between reduce tasks and map tasks. Processing times of map tasks stay around 80 s while most of reduce task are over 150 s. We also see that there are gaps between training data and MarS result. The main reason is that training data is produced by running jobs in turn and the cluster is not fully utilized. Meanwhile, MarS schedules multiple jobs simultaneously to fully utilize the cluster. Intensive utilization of the cluster introduce cost from competitions on resources such as disk I/O, network bandwidth, etc.

6.4. Real server cluster experiment

To avoid the probable impact of virtual environment on the results, except for the above experiments on Amazon EC2, in this section we try to redo the experiment in a local real server cluster and perform comprehensive evaluations to demonstrate the superiority and reliability of our algorithm. The implementation is the same as Section 6.1, we evaluate the algorithms with experiments on a 12-server cluster. All the servers are connected with a 1 Gbps switch. Every server has 2 CPU cores, 80 GB RAM and 1 TB disk. We use the same workload and job sets as Section 6.2 in our experiments. As shown in Fig. 21, we can see that injob set 1, the TWJCT ratios of MarS, MARES, HUWF and HJWF are 1.36,1.45, 1.68,1.59 respectively; Injob set 2, we see that MarS still outperforms rest of

the algorithms. Particularly, compared to MARES, the gain percentage of TWJCT ratio of MarS is 6.7%. This experiment indicates that our MarS scheduler still works extremely well when MapReduce framework runs on real server cluster.

7. Conclusion

In this paper, we studied MapReduce job scheduling with consideration of server assignment. We showed that without such joint consideration, there can be great performance loss. We formulated a MapReduce server-job organizer problem. This problem is NP-complete and we developed a 3-approximation algorithm MarS. Moreover, we further propose a novel finegrained practical algorithm for general MapReduce task scheduling problem. Finally, we evaluated our algorithm through extensive simulation. The results show that MarS can outperform state-of-the-art strategies by as much as 40% in terms of total weighted job completion time. We also implement a prototype of MarS in Hadoop and test it with experiment on Amazon EC2. The experiment results confirm the advantage of our algorithm.

Acknowledgments

This work is supported by the National Basic Research Program of China under Grant No. 2012CB315806, the National Natural Science Foundation of China under Grant Nos. 61432009, 61170211, 61161140454, Specialized Research Fund for the Doctoral Program of Higher Education under Grant No. 20130002110058, and Joint Research Fund of MOE-China Mobile under Grant No. MCM20123041.

References

Amazon EC2, 2012. http://aws.amazon.com/cn/ec2.

G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, E. Harris, Reining in the outliers in map-reduce clusters using Mantri, in: Proc. of USENIX OSDI, 2010.

Apache Hadoop, 2012. http://hadoop.apache.org. Apache Mesos, 2015. http://mesos.apache.org. Apache YARN, 2015.

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

E. Bortnikov, A. Frank, E. Hillel, S. Rao, Predicting execution bottlenecks in map-reduce clusters, in: Proc. of USENIX HotCloud, 2012.

H. Chang, M. Kodialam, R.R. Kompella, T.V. Lakshman, M. Lee, S. Mukherjee, Scheduling in MapReduce-like systems for fast completion time, in: Proc. of IEEE INFOCOM, 2011.

F. Chen, M. Kodialam, T.V. Lakshman, Joint scheduling of processingand shuffle phases in MapReduce systems, in: Proc. of IEEE INFOCOM, 2012.

P. Costa, A. Donnelly, A. Rowstron, G. O'Shea, Camdoop: Exploiting in-network aggregation for big data applications, in: Proc. of USENIX NSDI, 2012. M.E. Crovella, M. Harchol-Balter, C.D. Murta, Task assignment in a distributed system: Improving performance by unbalancing load, in: Proc. of ACM SIGMETRICS, 1998.

J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, in: Proc. of USENIX OSDI, 2004.

C. Delimitrou, C. Kozyrakis, Paragon: QoS-aware scheduling for heterogeneous datacenters, in: Proc. of ACM ASPLOS, 2013.

C. Delimitrou, C. Kozyrakis, Quasar: Resource-efficient and QoS-aware cluster management, in: Proc. of ACM ASPLOS, 2014.

D. DeWitt, M. Stonebraker, MapReduce: A majorstep backwards, 2008. http:// homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards. html.

M.R. Garey, D.S.Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness, W. H. Freeman & Co., 1990.

R.L. Graham, Bounds on multiprocessing timing anomalies, SIAMJ. Appl. Math. 17 (2) (1969) 416-429.

M. Harchol-Balter, Task assignment with unknown duration, J. ACM 49 (2) (2002) 260-288.

C. He, Y. Lu, D. Swanson, Matchmaking: A new MapReduce scheduling technique, in: Proc. of IEEE CloudCom, 2011.

C. He, Y. Lu, D. Swanson, Real-time scheduling in MapReduce clusters, in: Proc. of IEEE HPCC and EUC, 2013.

H. Herodotou, F. Dong, S. Babu, No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics, in: Proc. of ACM SoCC, 2011.

[22 [23 [24

[25 [26 [27

[28 [29 [30 [31

[32 [33

[36 [37

[38 [39

M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, A. Goldberg, Quincy: Fair scheduling for distributed computing clusters, in: Proc. of ACM SOSP, 2009.

B.W. Lampson, A scheduling philosophy for multiprocessing systems, Commun. ACM 11 (5) (1968) 347-360.

S. Li, S. Hu, T. Abdelzaher, The packing server for real-time scheduling of MapReduce workflows, in: Proc. of IEEE RTAS, 2015.

S. Li, S. Hu, S. Wang, L. Su, T. Abdelzaher, I. Gupta, R. Pace, WOHA: Deadline-aware Map-Reduce workflow scheduling framework over hadoop clusters, in: Proc. of IEEE ICDCS, 2014.

B. Palanisamy, A. Singh, L. Liu, B. Jain, Purlieus: Locality-aware resource allocation for mapreduce in a cloud, in: Proc. of ACM SC, 2011. M. Queyranne, Structure of a simple scheduling polyhedron, Math. Program. 58 (2) (1993) 263-285.

M. Queyranne, A. Schulz, Approximation bounds for a general class of precedence constrained parallel machine scheduling problems, SIAM J. Comput. 35 (5) (2006) 1241-1253.

T. Sandholm, K. Lai, Mapreduce optimization using regulated dynamic prioritization, in: Proc. of ACM SIGMETRICS, 2009.

A.S. Schulz, et al. Polytopes and scheduling, Technical University of Berlin, 1996.

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, J. Wilkes, Omega: Flexible, scalable schedulers for large compute clusters, in: Proc. of ACM EuroSys, 2013. J. Tan, S. Meng, X. Meng, L. Zhang, Improving reducetask data locality for sequential mapreduce jobs, in: Proc. of IEEE INFOCOM, 2013.

A. Verma, L. Cherkasova, R.H. Campbell, ARIA: Automatic resource inference and allocation for mapreduce environments, in: Proc. of ACM ICAC, 2011.

B. Wang, J. Jiang, G. Yang, ActCap: Accelerating mapreduce on heterogeneous clusters with capability-aware data placement, in: Proc. of IEEE INFOCOM, 2015.

W. Wang, K. Zhu, L. Ying, J. Tan, L. Zhang, Map task scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality, in: Proc. of IEEE INFOCOM, 2013.

D.Xie, N. Ding, Y.C. Hu, R. Kompella, The only constant is change: Incorporating time-varying network reservations in data centers, in: Proc. of ACM SlGCOMM, 2012.

Y. Yuan, H. Wang, D. Wang, J. Liu, On interference-aware provisioning for cloud-based big data processing, in: Proc. of IEEE/ACM IWQoS, 2013. M. Zaharia, D. Borthakur,J. SenSarma, K. Elmeleegy, S. Shenker, I. Stoica, Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling, in: Proc. of ACM EuroSys, 2010.

M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, l. Stoica, lmproving MapReduce performance in heterogeneous environments, in: Proc. of USENIX OSDl, 2008. W. Zhang, S. Rajasekaran, T. Wood, M. Zhu, MlMP: Deadline and interference aware scheduling of hadoop virtual machines, in: Proc. of lEEE/ACM CCGrid, 2014.

J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J.Y. Li, W. Lin, J. Zhou, L. Zhou, Optimizing data shuffling in data-parallel computation by understanding user-defined functions, in: Proc. of USENIX NSDI, 2012. Y. Zheng, N. Shroff, P. Sinha, A new analytical technique for designing provably efficient mapreduce schedulers, in: Proc. of IEEE INFOCOM, 2013.

Xiao Ling received the B.Sc. degree from Beijing University of Posts and Telecommunications. He is now a Ph.D. candidate at the Department of Computer Science and Technology in Tsinghua University. He was a visiting Ph.D. student at the Hong Kong Polytechnic University between 2014 and 2015. His major research interests include cloud computing, distributed system and big data processing applications. He is a student member of IEEE.

Yi Yuan received the B.Sc. degree and the M.Sc. degree from University of Electronic Science and Technology of China, Chengdu, China. He received the Ph.D. degree from The Hong Kong Polytechnic University, Hong Kong. He is currently a technical engineer in Cloud Computing Department, Tencent Company. His major research interests include cloud computing, distributed system and green building. He is a student member of IEEE.

Dan Wang received the B.Sc. degree from Peking University, Beijing, China, the M.Sc. degree from Case Western Reserve University, Cleveland, Ohio, USA, and the Ph.D. degree from Simon Fraser University, Burnaby, B.C., Canada; all in computer science. He is an Assistant Professor of Department of Computing, The Hong Kong Polytechnic University, Hong Kong. His research interests include wireless sensor networks, Internet routing and cloud computing. He is a senior member of IEEE.

Jiangchuan Liu received the B.Sc. degree from Tsinghua University, Beijing, China, and the Ph.D. degree from The Hong Kong University of Science and Technology, Hong Kong. From 2003 to 2004, he was an Assistant Professor in the Department of Computer Science and Engineering at The Chinese University of Hong Kong. He was a Microsoft Research Fellow, and worked at Microsoft Research Asia (MSRA) in the summers of 2000, 2001, 2002, 2007, and 2011. He is a Full Professor in the School of Computing Science at Simon Fraser University, Canada, and an EMC-Endowed Visiting Chair Professor of Tsinghua University, Beijing, China. His research interests include multimedia communications, peer-to-peer networking, cloud computing, online gaming, social networking, big data networking, and wireless sensor/mesh networking. He is a senior member of IEEE.

Jiahai Yang received the B.Sc. degree from Beijing Technology and Business University, the M.Sc. degree and Ph.D. degree from Tsinghua University, Beijing, China; all in Computer Science. He is a Professor of the Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing, China. His research interests include network management, network measurement, Internet routingand applications, cloud computing and big data applications. He is a member of IEEE & ACM.