• Aucun résultat trouvé

PDF

N/A
N/A
Protected

Academic year: 2022

Partager "PDF"

Copied!
425
0
0

Texte intégral

(1)

Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference

K. Sel¸cuk Candan Arizona State University, USA

Sihem Amer-Yahia CNRS – LIG, France Nicole Schweikardt

University of Frankfurt, Germany

Vassilis Christophides

University of Crete & FORTH-ICS, Greece Vincent Leroy

University of Grenoble – CNRS, France

March 28th, 2014

(2)

Contents

Message from the Chairs . . . iii

Algorithms for MapReduce and Beyond (BeyondMR) . . . 1

Scheduling MapReduce Jobs on Unrelated Processors . . . 2

Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements . . . 6

On the design space of MapReduce ROLLUP aggregates . . . 10

Determining the k in k-means with MapReduce . . . 19

Tagged Dataflow: a Formal Model for Iterative Map-Reduce . . . 29

Processing Regular Path Queries on Giraph . . . 37

Graph-Parallel Entity Resolution using LSH & IMM . . . 41

Modular Data Clustering - Algorithm Design beyond MapReduce . . . 50

Bidirectional Transformations (BX) . . . 60

Preface to the Third International Workshop on Bidirectional Transformations . . . 61

Implementing a Bidirectional Model Transformation Language as an Internal DSL in Scala . 63 Towards a Framework for Multidirectional Model Transformations . . . 71

Formalizing Semantic Bidirectionalization with Dependent Types . . . 75

BenchmarX . . . 82

Towards a Repository of Bx Examples . . . 87

Intersection Schemas as a Dataspace Integration Technique . . . 92

Bidirectional Transformations in Database Evolution: A Case Study “At Scale” . . . 100

Entangled State Monads . . . 108

Spans of lenses . . . 112

Energy Data Management (EnDM) . . . 119

Pipeline Production Data Model . . . 120

Renewable Energy Data Sources in the Semantic Web with OpenWatt . . . 128

A Generic Ontology for Prosumer-Oriented Smart Grid . . . 134

Computing Electricity Consumption Profiles from Household Smart Meter Data . . . 140

ECAST: A Benchmark Framework for Renewable Energy Forecasting Systems . . . 148

Energy Data Management: Where Are We Headed? (panel) . . . 156

Exploratory Search in Databases and the Web (ExploreDB) . . . 157

Exploratory Search in Databases and the Web . . . 158

Exploring Big Data using Visual Analytics . . . 160

On the Suitability of Skyline Queries for Data Exploration . . . 161

Hippalus: Preference-enriched Faceted Exploration . . . 167

The DisC Diversity Model . . . 173

Exploring RDF/S Evolution using Provenance Queries . . . 176

Skyline Ranking `a la IR . . . 182

Multi-Engine Search and Language Translation . . . 188

Querying Graph Structured Data (GraphQ) . . . 191

An Event-Driven Approach for Querying Graph-Structured Data Using Natural Language . . 192

GraphMCS: Discover the Unknown in Large Data Graphs . . . 200

(3)

Graph-driven Exploration of Relational Databases for Efficient Keyword Search . . . 208

Implementing Iterative Algorithms with SPARQL . . . 216

A Map-Reduce algorithm for querying linked data based on query decomposition into stars . 224 Performance optimization for querying social network data . . . 232

Frequent Pattern Mining from Dense Graph Streams . . . 240

Linked Web Data Management (LWDM) . . . 248

Quantifying the Connectivity of a Semantic Warehouse . . . 249

Scalable Numerical SPARQL Queries over Relational Databases . . . 257

Similarity Recognition in the Web of Data . . . 263

Mining of Diverse Social Entities from Linked Data . . . 269

TripleGeo: an ETL Tool for Transforming Geospatial Data into RDF Triples . . . 275

Multimodal Social Data Management (MSDM) . . . 279

Social Data and Multimedia Analytics for News and Events Applications . . . 280

Event Identification and Tracking in Social Media Streaming Data . . . 282

Recommendation of Multimedia Objects for Social Network Applications . . . 288

Estimating Completeness in Streaming Graphs . . . 294

Mining Urban Data (MUD) . . . 300

Mining Trajectory Data for Discovering Communities of Moving Objects . . . 301

Mobile Sensing Data for Urban Mobility Analysis: A Case Study in Preprocessing . . . 309

Crowd Density Estimation for Public Transport Vehicles . . . 315

Traffic Incident Detection Using Probabilistic Topic Model . . . 323

Predictive Trip Planning – Smart Routing in Smart Cities . . . 331

Addressing the Sparsity of Location Information on Twitter . . . 339

Efficient Dissemination of Emergency Information using a Social Network . . . 347

Crowdsourcing turning restrictions for OpenStreetMap . . . 355

Big data analytics for smart mobility: a case study . . . 363

Smart Applications for Smart City: a Contribution to Innovation . . . 365

Analysis of Relationships Between Road Traffic Volumes and Weather: Exploring Spatial Variation . . . 367

SiCi Explorer: Situation Monitoring of Cities in Social Media Streaming Data . . . 369

A Cascading Wavelet-Feed Forward Neural Network Approach for Forecasting Traffic Flow . 371 Combining a Gauss-Markov model and Gaussian process for traffic prediction in Dublin city center . . . 373

Sensing Urban Soundscapes . . . 375

Privacy and Anonymity in the Information Society (PAIS) . . . 383

A Hybrid Approach for Privacy-preserving Record Linkage . . . 384

Clustering-based Multidimensional Sequence Data Anonymization . . . 385

Efficient Multi-User Indexing for Secure Keyword Search . . . 390

Community Detection in Anonymized Social Networks . . . 396

Secure Multi-Party linear Regression . . . 406

Data Anonymization: The Challenge from Theory to Practice . . . 415

A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems . . . 416

(4)

Message from the Chairs

We are delighted to present to you, on behalf of the entire conference organizing committee and the workshop organizers, the proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference, held on March 28, 2014, in Athens, Greece.

International Conference on Extending Database Technology (EDBT) and International Conference on Database Theory (ICDT) are two prestigious forums for the exchange of the latest research results in data management and the theoretical foundations of database systems. While having the same overarching goal of presenting cutting-edge results, ideas, techniques, and theoretical advances in databases, the workshops of the EDBT/ICDT joint conference are separately tasked by focusing on emerging topics that complement the areas covered by the main technical program.

This year, our program includes workshops focusing on eight exciting topics:

Algorithms for MapReduce and Beyond (BeyondMR) workshop, aiming to explore algorithms and computational models for systems that need large scale parallelization and systems designed to support efficient parallelization and fault tolerance,

• Bidirectional Transformations (BX) workshop, bringing together researchers and practitioners, established and new, interested in bidirectional transformations from different perspectives,

Energy Data Management (EnDM) workshop, focusing on conceptual and system architecture issues related to the management of very large-scale data sets specifically in the context of the energy domain,

Exploratory Search in Databases and the Web (ExploreDB) workshop, aiming to promote novel discovery methods that provide highly expressive discovery capabilities over large amounts of entity-relationship data, which are yet intuitive for end-users,

Linked Web Data Management (LWDM) workshop, aiming at stimulating participants to discuss about data management issues related to the Linked Data and the relationships with other Semantic Web technologies, and at the same time proposes a glance at new issues,

Multimodal Social Data Management (MSDM) workshop, bringing together experts in social network analysis, natural language processing, multimodal data management and integration, scalable data analysis, machine learning, to discuss how research contributions in different computer science areas can help better explain social data and build new applications,

Privacy and Anonymity in the Information Society (PAIS) workshop, which provides a platform for researchers and practitioners from computer science and other fields that are interacting with computer science in the privacy area, such as statistics, healthcare informatics, and law, to discuss and present current research challenges and advances in data privacy and anonymity research, and

Querying Graph Structured Data (GraphQ) workshop, which aims to encourage discussions about how to efficiently and effectively support graph queries in different application domains and seeks to provide the opportunity for cross-fertilization amoing teams working on graph-structured data, with a particular focus on the querying issues.

(5)

Before concluding, we would like to acknowledge those who have contributed to the success of the workshops program. First of all, we would like to thank all workshop organizers who have put together an exciting program as well as to all authors who submitted their works to the workshops.

We specially thank the authors of the accepted papers and the invited speakers who presented their works in the workshops program. Needless to say, we are grateful to the members of the workshop program committees and external reviewers who have helped put together a high-quality workshops program and we would like to acknowledge the conference organizers and many student volunteers for their invaluable help at various stages of the process. We would also like to give our thanks to the sponsors who have financially supported the workshops and the editors of the CEUR Workshop Proceedings (CEUR-WS.org) who have agreed to host these proceedings.

Sincerely,

K. Selçuk Candan, Workshops Chair

Sihem Amer-Yahia and Nicole Schweikardt, EDBT and ICDT Program Chairs Vassilis Christophides, General Chair

(6)

Algorithms for MapReduce and Beyond (BeyondMR)

Foto N. Afrati (National Technical University of Athens, Greece) Phokion G. Kolaitis (UC Santa Cruz & IBM Research, USA)

Jeffrey D. Ullman (Stanford University, USA)

(7)

Scheduling MapReduce Jobs on Unrelated Processors

D. Fotakis

National Technical University of Athens

[email protected]

I. Milis

Athens University of Economics and Business

[email protected]

E. Zampetakis

National Technical University

of Athens

[email protected] G. Zois

Université Pierre et Marie Curie and Athens University of

Economics and Business

[email protected] ABSTRACT

MapReduce framework is established as the standard ap- proach for parallel processing of massive amounts of data. In this work, we extend the model of MapReduce scheduling on unrelated processors (Moseley et al., SPAA 2011) and deal with the practically important case of jobs with any number of Map and Reduce tasks. We present a polynomial-time (32 +)-approximation algorithm for minimizing the total weighted completion time in this setting. To the best of our knowledge, this is the most general setting of MapReduce scheduling for which an approximation guarantee is known.

Moreover, this is the first time that a constant approxima- tion ratio is obtained for minimizing the total weighted com- pletion time on unrelated processors under a nontrivial class of precedence constraints.

Keywords

MapReduce, Scheduling, Unrelated Processors

1. INTRODUCTION

Scheduling in MapReduce environments has become in- creasingly important during the last years, as MapReduce has been established as the standard programming model to implement massive parallelism in large data centers [5].

Applications of MapReduce such as search indexing, web analytics and data mining, involve the concurrent execu- tion of several MapReduce jobs on a system like Google’s MapReduce or Apache Hadoop. When a MapReduce job is executed, a number of Map and Reduce tasks are created.

∗This work was supported by the project Handling Uncer- tainty in Data Intensive Applications, co-financed by the European Union (European Social Fund - ESF) and Greek national funds, through the Operational Program ”Educa- tion and Lifelong Learning”, under the program THALES, and by the project Heracleitus II.

(c) 2014, Copyright is with the authors. Published in the Workshop Pro- ceedings of the EDBT/ICDT 2014 Joint Conference (March 28, 2014, Athens, Greece) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC- by-nc-nd 4.0.

Each Map task operates on a portion of the input elements, translating them into a number of key-value pairs. Next, all key-value pairs are transmitted to the Reduce tasks, so that all pairs with the same key are available together at the same task. The Reduce tasks operate on the key-value pairs, combine the values associated with a key, and generate the final result. In addition to the many practical applica- tions of MapReduce, there has been a significant interest in developing appropriate cost models and a computational complexity theory for MapReduce computation (see e.g., [3, 6]), in understanding the basic principles underlying the de- sign of efficient MapReduce algorithms (see e.g., [1, 7]), and in obtaining upper and lower bounds on the performance of MapReduce algorithms for some fundamental computa- tional problems (see e.g. [2] and the references therein).

Motivation and Previous Work. Many important ad- vantages of MapReduce are due to the fact that the Map tasks or the Reduce tasks can be executed in parallel and essentially independent from each other. However, to best exploit massive parallelism available in typical MapReduce systems, one has to carefully allocate and schedule Map and Reduce tasks to actual processors (or computational resources, in general). This important and delicate task is performed in a centralized manner, by a process running in the master node. A major concern of the scheduler, among others, is to satisfy task dependencies within the tasks of the same MapReduce job;all the Map tasks must finish before the execution of any Reduce task of the same job. During the assignment and scheduling process, a number of differ- ent needs must be taken into account, e.g., transferring of the intermediate data (shuffle), data locality, and data skew, which give rise to the study of new scheduling problems.

Despite the importance and the challenging nature of sched- uling in MapReduce environments, and despite the extensive investigation of a large variety of scheduling problems in par- allel computing systems (see e.g., [13]), less attention has been paid to MapReduce scheduling problems. In fact, most of the previous work on scheduling in MapReduce systems concerns the experimental evaluation of scheduling heuris- tics, mostly from the viewpoint of finding good trade-offs between different objectives (see e.g., [14]). From a theoret- ical viewpoint, only few results on MapReduce scheduling have appeared so far [11, 4].These are based on simplified ab- stractions of MapReduce scheduling, closely-related to some variants of the classical Open Shop and Flow Shop schedul- ing models, that capture issues such as task dependencies,

(8)

data locality, shuffle, and task assignment, under the key objective of minimizing the total weighted completion time of a set of MapReduce jobs.

In this direction, the theoretical model of Moseley et al. [11]

generalizes a variant of the Flow Shop scheduling model, referred to as 2-stage Flexible Flow Shop (FFS), which is known to be strongly N P-hard, even for jobs of a single Map and Reduce task and a single map and reduce proces- sor (see in [11]). They consider the cases of both identical and unrelated processors and the goal is to minimize the total completion time of the jobs. For identical processors, they present a 12-approximation algorithm, and aO(1/2)- competitive online algorithm, for any∈(0,1), under the assumption that the processors used by the online algorithm are 1 +times faster than the processors used by the opti- mal schedule. Since the identical processors setting fails to capture issues as data locality and to model communication costs between the Map and the Reduce tasks, Moseley et al.

also consider the case of unrelated processors, which pro- vides a more expressive theoretical model of scheduling in MapReduce environments. Nevertheless, they only consider the very restricted (and practically not so interesting) case where each job has a single Map and a single Reduce task, and present a 6-approximation algorithm and a O(1/5)- competitive online algorithm, for any∈(0,1), under the assumption that the processors of the online algorithm are 1 +times faster.

A similar model of MapReduce scheduling so as to min- imize the total completion time was proposed by Chen et al. [4]. In contrast with the model of [11], they assume that tasks are preassigned to processors and, in this restricted set- ting, they present an LP-based 8-approximation algorithm.

Moreover, they deal with the shuffle phase in MapReduce systems and present a 58-approximation algorithm.

Contribution and Results. We adopt the theoretical model of [11] and consider MapReduce scheduling on unre- lated processors. However, departing from [11], we deal with the general (and practically interesting) case where each job has any number of Map and Reduce tasks and we succeed in obtaining a polynomial-time constant approximation al- gorithm for minimizing the total weighted completion time.

More specifically, we consider a set of MapReduce jobs to be executed on a set of unrelated processors. Each job consists of a set of Map tasks, that can be executed only on map pro- cessors, and a set of Reduce tasks, that can be executed only on Reduce processors. Each task has a different processing time for each processor and is associated with a positive weight, representing its importance. All jobs are available at time zero. Map or Reduce tasks can run simultaneously on different processors and, for each job, every Reduce task can start its execution after the completion of all the job’s Map tasks. The goal is to find an assignment of the tasks to processors and schedule themnon-preemptively so as to minimize their total weighted completion time.

In terms of classical scheduling, the model we consider in this work is a special case of total weighted completion time minimization on unrelated processors under precedence constraints. Despite its importance and generality, only few results are known for this problem. These results concern only the case of treelike precedence constraints [8]. More specifically, in [8], Kumar et al. propose a polylogarith- mic approximation algorithm for the case where the undi- rected graph underlying the precedence constraints is a for-

est (a.k.a. treelike precedences). Their algorithm is based on a reduction from total weighted completion time min- imization to an appropriate collection of makespan mini- mization problems. Based on ideas of [8], we present a (32+)-approximation algorithm for this problem that oper- ates in two steps. In the first step, our algorithm computes a (8 +)-approximation schedule for the Map tasks (resp. Re- duce tasks) by combining a time indexed LP-relaxation of the problem with a well-known approximation algorithm for the makespan minimization problem on unrelated proces- sors [9]. In fact, the makespan minimization algorithm runs on each time interval of the LP solution and computes an assignment of the Map (resp. Reduce) tasks to processors.

In the second step, based on an idea from [11], we merge the two schedules, produced for the Map tasks and the Reduce tasks, into a single schedule that respects the precedence constraints. Using techniques from [11], we show that the merging step increases the approximation ratio by a factor of at most 4.

On the practical side, the theoretical model of [11] for MapReduce scheduling on unrelated processors deals with the most of the important aspects of the problem. So, con- sidering jobs with any number of Map and Reduce tasks in this model is particularly important for practical applica- tions, since the basic idea behind MapReduce computation is that each job is split into a large number of Map and Re- duce tasks that can be executed in parallel (see e.g., [3, 6, 1, 2]). On the theoretical side, to the best of our knowledge, this is the first time that a constant approximation ratio is obtained for the problem of minimizing the total weighted completion time on unrelated processors under a nontrivial class of precedence constraints.

Notation. We consider a setJ ={1,2, . . . , n}ofnMapRe- duce jobs to be executed on a setP = {1,2, . . . , m}of m unrelated processors. Each job is available at time zero, is associated with a positive weightwjand consists of a setM of Map tasks and a setRof Reduce tasks. Each task is de- noted byTk,j∈ M∪R, wherek∈N is the task index of job j ∈ J and is associated with a vector of non-negative pro- cessing times{pi,k,j}, one for each processori∈ Pb, where b∈ {M,R}. LetPM andPR be the sets of map and re- duce processors respectively. Each job has at least one Map and one Reduce task that can run simultaneously on differ- ent processors and every Reduce task can start its execution after the completion of all Map tasks of the same job.

For a given schedule we denote byCj andCk,j the com- pletion times of each jobj∈ J and each taskTk,j∈ M ∪ R respectively. Note that, due to the precedence constraints between Map and Reduce tasks,Cj= maxTk,j∈R{Ck,j}. By Cmax= maxj∈J{Cj}we denote the makespan of the sched- ule, i.e., the completion time of the job which finishes last.

Our goal is to schedule non-preemptively all Map tasks on processors ofPMand all Reduce tasks on processors ofPR, with respect to their precedence constraints, so as to min- imize the total weighted completion time of the schedule, i.e.,P

j∈JwjCj. We refer to this problem asMapReduce schedulingproblem.

2. A CONSTANT APPROXIMATION ALGO- RITHM

In this section, we present a (32 +)-approximation al- gorithm, for∈(0,1), executed in the following two steps:

(9)

(i) it computes a (8 +)-approximate schedule for assigning and scheduling all Map tasks (resp. Reduce tasks) on pro- cessors of the setPM(resp. PR) and (ii) it merges the two schedules in one, with respect to the precedence constraints between Map and Reduce tasks of each job, increasing the approximation ratio by a factor of 4.

2.1 Scheduling Map and Reduce Tasks

Next, we propose an algorithm for the problem of mini- mizing the total weighted completion time of all Map (resp.

Reduce) tasks on processors of the setPM(resp. PR). For notational convenience, we use a dual variableb∈ {M,R}

to refer on either Map or Reduce sets of tasks.

We define (0, tmax = P

Tk,jbmaxi∈Pbpi,k,j] to be the time horizon of potential completion times, wheretmaxis an upper bound on the makespan of a feasible schedule. We dis- cretize the time horizon into intervals (1,1],(1,(1 +δ)],((1 + ),(1 +δ)2], . . . ,((1 +δ)L−1,(1 +δ)L], where δ ∈(0,1) is a small constant, and L is the smallest integer such that (1 +δ)L1 ≥ tmax. Let I` = ((1 +δ)`1,(1 +δ)`], for 0≤`≤L, andL={0,1,2, . . . , L}. Note that, the number of intervals is polynomial in the size of the instance and to 1/δ. For each processori ∈ Pb, taskTk,j ∈ band` ∈ L, we introduce a variableyi,k,j,` that denotes the fraction of taskTk,j assigned to processori in time interval I`. Fur- thermore, for each task Tk,j ∈ T, we introduce a variable Ck,j corresponding to its completion time, and a variable zk,j corresponding to its fractional processing time. For ev- ery jobj ∈ J, we also introduce a dummy taskDj, with zero processing time on every processor, which has to be processed after the completion of every other taskTk,j ∈b.

LP(b) is an interval-indexed linear programming relaxation of our problem.

LP(b) : minimizeX

j∈J

wjDj

subject to : X

i∈Pb,`∈L

yi,k,j,`= 1, ∀Tk,j∈b (1)

zk,j= X

i∈Pb

pi,k,j

X

`∈L

yi,k,j,`, ∀Tk,j∈b (2) CDj≥Ck,j+zk,j, ∀j∈ J,Tk,j∈b (3)

X

i∈Pb

X

`∈L

(1 +δ)`1yi,k,j,`≤Ck,j≤ X

i∈Pb

X

`∈L

(1 +δ)`yi,k,j,`,

∀Tk,j∈b (4) X

Tk,j∈b

pi,k,j

X

t≤`

yi,k,j,t≤(1 +δ)`, ∀i∈ Pb, `∈ L (5) pi,k,j>(1 +δ)`⇒yi,k,j,`= 0, ∀i∈ Pb,Tk,j∈b, `∈ L (6) yi,k,j,`≥0, ∀i∈ Pb,Tk,j∈b, `∈ L (7) Our objective is to minimize the sum of weighted com- pletion times of all jobs. Constraint (1) ensures that each task is entirely assigned to processors of the set Pb and constraint (2) defines its fractional processing time. Con- straint (3) ensures that, for each job j ∈ J, the comple- tion of each taskTk,j precedes the completion of taskDj. Constraint (4) adapts a lower and an upper bound on the completion time of each task. For each` ∈ L, constraints (5) and (6) are validity constraints which state that the to- tal fractional processing time on each processor is at most

(1 +δ)`, and that if it takes time more than (1 +δ)`to pro- cess a taskTj,kon a processori∈ Pb, thenTk,j should not be scheduled oni, respectively.

Assignment and Scheduling. Let (¯yi,k,j,l,z¯k,j,C¯k,j) be an optimal (fractional) solution toLP(b). For each 2≤`≤L, we define the set of tasksS(`) ={Tk,j∈b|(1 +δ)`2/2≤ C¯k,j ≤(1 +δ)`1/2}, that complete their execution within the intervalI`. By definition, for each taskTk,j ∈S(`), it must hold that 2(1 +δ) ¯Ck,j≤(1 +δ)`.

We will assign all jobs of each setS(`) to processors inPb according to the following algorithm.

AlgorithmMakespan

1: Compute a basic feasible solution (¯xi,k,j) toLP(T?, b).

2: Assign all tasks having integral values to processors of Pbas in (¯xi,k,j).

3: Let a graphG = (A∪ Pb, E), whereA= {Tk,j |0<

xi,j,k<1}andE={{Tk,j, i} | Tk,j∈A, i∈ Pband 0<

xi,k,j<1}. Compute a perfect matchingMonG.

4: Assign eachTk,j∈Atoi∈ Pb, as indicated byM.

5: foreach assigned taskTk,j do

6: ScheduleTk,j as early as possible, non-preemptively, with processing time pi,k,j on processor i∈ Pb that is assigned to. LetCk,j be the completion time ofTk,j. AlgorithmMakespanhas been proposed in a seminal pa- per by Lenstra et al. [9] and it is based on the so-called parametric pruning technique in an LP setting. More specif- ically, ifT is an estimation on the optimal makespan of a schedule of the jobs inS(`), then by pruning away all task- processor pairs for which pi,k,j > T, we are able to define a set of variables corresponding only to triples of the set QT ={(i, k, j)|pi,k,j ≤T}; note that this pruning process has been already taken under consideration by constraints (6) ofLP(b). Since T ∈ ∪`0≤`I`0, using binary search on

`0`I`0 withT as the search variable, we can find the min- imum value of T such that the following system of linear constraints is feasible.

LP(b, T) : X

i:(i,k,j)∈QT

xi,k,j= 1 ∀Tk,j∈b (8)

X

Tk,j:(i,k,j)∈QT

xi,k,jpi,k,j≤T ∀i∈ Pb (9)

xi,k,j≥0 ∀(i, k, j)∈ QT

Each variable xi,k,j denotes the fractional processor as- signment of each taskTk,j ∈S(`). Now, ifT? is the mini- mum value for whichLP(b, T) is feasible, thenT?is a lower bound on the optimal integral makespan.

Similarly as in [9], it can be proved that a basic feasible solution to LP(b, T) has at most|b|+|Pb| non-zero vari- ables, from which at least|b| − |Pb|, must be set integrally.

Then, the number of fractionalxi,k,jvalues must be at most 2|Pb|. If we formulate a bipartite graphG = (A∪ Pb, E), whereAis the set of tasks having fractionalxi,k,jvalues and E={{Tk,j, i} | Tk,j ∈A, i∈ Pband 0< xi,k,j <1}, then, according to the latter property, we deduce thatGis a con- nected graph with at most 2|Pb|vertices and at most 2|Pb| edges. However, this means thatG has the special topol- ogy of a pseudo-forest (a collection of trees with one possi-

(10)

ble extra edge) which enables the computation of a perfect matching on it. Hence, by executing steps 2-6 of Algorithm Makespan, a non-preemptive schedule of tasks inS(`) can be found.

The following lemma provides a tight upper bound on the makespan of the schedule computed by AlgorithmMakespan.

Lemma 1. AlgorithmMakespanis a 2-approximation al- gorithm for scheduling the tasks of the setS(`)so as to min- imize their makespan.

In the next lemma, using filtering [10] we modify theyi,k,j,`

values of the solution toLP(b) to find an upper bound on the value ofT.

Lemma 2. Consider a feasible solution toLP(b, T). For each set of jobsS(`)that complete their execution within the intervalI`, it holds thatT?≤2(1 +δ)`, forδ∈(0,1).

As consequence of filtering in Lemma 2 the completion time of each task inS(`) is increased by a factor of 4; this result has already proven to be tight (see Section 2 in [12]).

AlgorithmTaskScheduling(b)

1: Compute an optimal solution (¯yi,k,j,l,¯zk,j,C¯k,j) to LP(b).

2: foreach`∈ Ldo

3: compute S(`) = {Tk,j ∈b|(1 +δ)`2/2≤C¯k,j ≤ (1 +δ)`1/2}

4: foreach`such thatS(`)6=∅do

5: Schedule all tasks in S(`) by running Algorithm Makespan.

Running AlgorithmTaskScheduling(b), we compute a schedule for all Map (resp. Reduce) tasks such that:

Theorem 1. TaskScheduling(b) is a(8+ε)-approximation algorithm, for scheduling a set of Map (Reduce) tasks on a set of unrelated processorsPM (PR), in order to minimize their total weighted completion time, forε∈(0,1).

Proof Sketch. Let Ck,j be the completion time of a taskTk,j∈S(`), in the schedule of AlgorithmTaskSchedul- ing(b) and letCmax(`) be the makespan of the schedule of Algorithm Makespanon the jobs in S(`). Since, Ck,j ≤ Cmax(`), for all Tk,j ∈ b, it suffices to prove that Ck,j ≤ 8(1 +δ)2k,j: we combine Lemma 1 and Lemma 2 with the definition of the setS(`). Then, as we can select anεsuch that (1 +δ)2≤(1 +ε), the theorem follows. Note that this ratio is tight.

2.2 Merging Task Schedules

LetσM, σR be two schedules computed by two runs of Algorithm TaskScheduling(b), for b = M and b = R, respectively. Let also CjσM = maxTj,k∈M{Ck,j}, CjσR = maxTj,k∈R{Ck,j} be the completion times of the all Map and all Reduce tasks of a job j ∈ J within these sched- ules, respectively. Depending on these completion time val- ues, we assign each job j ∈ J a width equal to ωj = max{CjσM, CjσR}. The following algorithm computes a fea- sible schedule.

Algorithm MRS. In each time instant where a processor i∈ Pbbecomes available, either it processes the Map task, assigned toi∈ PMinσM, with the minimum width, or the

available (w.r.t. its precedence constraints) Reduce task, assigned toi∈ PRinσR, with the minimum width.

By an analysis similar to that in [11], we can prove that:

Theorem 2. AlgorithmMRSis a(32+)-approximation for theMapReduce schedulingproblem, for∈(0,1).

Proof Sketch. By execution of Algorithm MRS, the feasibility of the resulted schedule can be easily verified.

To prove the theorem, it suffices to prove that in such a schedule,σ, all tasks of a jobj∈ J are completed by time 2 max{CjσM, CjσR}. LetCjσ, be the completion time of a job j∈ J inσ. Note that, for each of the Map tasks ofj, their completion time is upper bounded byωj. On the other hand, the completion time of each Reduce task is upper bounded by a quantity equal tor+ωj, wherer is the earliest time when the task is available to be scheduled inσ. However, r= CjσM ≤ωj and thus Cjσ ≤2ωj = 2 max{CjσM, CjσR}. By applying Theorem 1 and as we can select ansuch that ≤4ε, the theorem follows.

3. REFERENCES

[1] F. Afrati, D. Fotakis, and J. Ullman. Enumerating subgraph instances using MapReduce.IEEE-ICDE:

62-73, 2013.

[2] F. Afrati, A. D. Sarma, S. Salihoglu, and J. Ullman.

Upper and Lower Bounds on the Cost of a

MapReduce Computation.VLDB: 6(4):277-288, 2013.

[3] F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment.IEEE-TKDE:

23(9):1282-1298, 2011.

[4] F. Chen, M. S. Kodialam, and T. V. Lakshman. Joint scheduling of processing and shuffle phases in mapreduce systems.INFOCOM: 1143-1151, 2012.

[5] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.OSDI: 137-150, 2004.

[6] H. Karloff, S. Suri, and S. Vassilvitskii. A Model of Computation for MapReduce.SODA: 938-948, 2010.

[7] R. Kumar, B. Moseley, S. Vassilvitskii, and

A. Vattani. Fast greedy algorithms in MapReduce and streaming.ACM-SPAA: 1-10, 2013.

[8] V. S. A. Kumar, M. V. Marathe, S. Parthasarathy, and A. Srinivasan. Scheduling on unrelated machines under tree-like precedence constraints.

Algorithmica: 55(1):205-226, 2009.

[9] J. K. Lenstra, D. B. Shmoys, and ´E. Tardos.

Approximation algorithms for scheduling unrelated parallel machines.Mathematical Programming:

46:259-271, 1990.

[10] J. Lin and J. S. Vitter. epsilon-approximations with minimum packing constraint violation.SODA: pages 771–782, 1992.

[11] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarl´os.

On scheduling in map-reduce and flow-shops.

ACM-SPAA: 289-298, 2011.

[12] J. R. Correa and M. Skutella and J. Verschae. The Power of Preemption on Unrelated Machines and Applications to Scheduling Orders.Math. Oper. Res.:

379-398, 2012.

[13] M. Pinedo.Scheduling: theory, algorithms, and systems.Springer, 2012.

[14] D.-J. Yoo and K. M. Sim. A comparative review of job scheduling for mapreduce.IEEE-ICCIS: 353-358, 2011.

(11)

Binary Theta-Joins using MapReduce:

Efficiency Analysis and Improvements

Ioannis K. Koumarelas

Dept. of Informatics Aristotle University Thessaloniki, Greece

[email protected]

Athanasios Naskos

Dept. of Informatics Aristotle University Thessaloniki, Greece

[email protected]

Anastasios Gounaris

Dept. of Informatics Aristotle University Thessaloniki, Greece

[email protected] ABSTRACT

We deal with binary theta-joins in a MapReduce environ- ment, and we make two contributions. First, we show that the best known algorithm to date for this problem can reach the optimal trade-off between the size of the input a reducer can receive and the incurred communication cost when the join selectivity is high. Second, when the join selectivity is low, we present improvements upon the state-of-the-art with a view to decreasing the communication cost and the max- imum load a reducer can receive, taking also into account the load imbalance across the reducers.

1. INTRODUCTION

Data analysis on voluminous data, such as clickstream data or data derived from scientific experiments and simula- tions, has given rise to the establishment of MapReduce as the most popular framework for large-scale processing. An- alytical database queries remain a useful tool for big data analyses; however, such queries are being investigated in the MapReduce context rather than within a traditional DBMS environment. Analytical query processing in MapReduce has attracted a lot of interest, and the relevant work has in- vestigated several issues, including indexing, data placement and layouts, optimizations, iterative processing, fair load al- location and interactive processing to name some of them [5]. In this work, we focus on improving the efficiency of join queries executed in MapReduce, for which several proposals already exist [7, 2, 9]. More specifically, we target binary theta-joins, where the join condition between two datasets is arbitrarily complex rather than a simple equation.

Nevertheless, most of the proposals to date tend to be developed on a best-effort basis, without systematically an- alyzing the inherent trade-offs. Two recent remedies to that have been proposed in [1, 8]. [8] introduces the notion of minimal MapReduce algorithms, which are algorithms ac- companied by guarantees (up to a small constant) regarding several aspects, such as memory consumption and commu- nication cost. The MapReduce rounds may be bounded but

(c) 2014, Copyright is with the authors. Published in the Workshop Pro- ceedings of the EDBT/ICDT 2014 Joint Conference (March 28, 2014, Athens, Greece) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC- by-nc-nd 4.0.

they can be more than one. The work in [1] is complemen- tary and presents a way to compute the lower bounds on communication cost as a function of the maximum input a reducer is allowed to receive for specific problems. This al- lows to define the trade-off between the load on the reducer side and thereplication rate. The replication rate is defined as the average ratio of output to input key-value pairs on the map side, and is used as a metric of the communication cost. Further, the work in [1] examines whether known al- gorithms for those problems can match the lower bounds, provided that they consist of a single MapReduce round.

The algorithms 1-Bucket-Theta and M-Bucket in [7] form the basis of our work. Our first contribution is that we an- alyze the lower bounds for the binary theta-join problem and we show that the worst-case behaviour of 1-Bucket- Theta matches those bounds. However, such behaviour is expected only when the join selectivity is high. For low selectivities, and with the help of histograms, the more effi- cient M-Bucket-I and M-Bucket-O algorithms are presented in [7], which aim at minimizing the maximum reducer input and output, respectively. Our second contribution is that we enhance those algorithms through the clustering of his- togram buckets. In that way, we can achieve more efficient partitioning of histogram buckets to reducers. The efficiency is measured in terms of the replication rate, the maximum reducer input, and the imbalance across reducers. We show that we can improve the replication rate (i.e., reduce the communication cost) and the maximum reducer input (i.e., reduce the longest running time and the space requirements of reducers) with insignificant impact on load imbalance.

The remainder of this extended abstract is structured as follows. In Sec. 2 we briefly present the 1-Bucket-Theta and M-Bucket algorithms, which we analyze in Sec. 3 and enhance in Sec. 4, respectively. In Sec. 5, we conclude and describe next steps.

2. BACKGROUND

In [7] the problem of performing binary theta joinsS ◃▹θT on MapReduce is studied. The core of the approach lies in how the workload is partitioned across reducers. To rep- resent the workload, a join matrix (JM) is used. In JMs, each cell corresponds to a pair of tuples, one from each in- put dataset, to be processed. The JM is split into several regions, where each region is mapped to a reducer. For each region, we can compute the amount of tuples that belong to it, which is theinput cost of that region and is directly re- lated to the computation and memory load of the associated reducer. For perfect load balancing, we want these regions to

(12)

Figure 1: Partitioning the JM in 1-Bucket-Theta (left) and M-Bucket (right).

have equal input cost. In order to accomplish the latter ob- jective, two main algorithms are presented: 1-Bucket-Theta and M-Bucket-I (and its variation M-Bucket-O).

2.1 1-Bucket-Theta

1-Bucket-Theta is the most generic algorithm, since it ex- amines all tuple pairs (as in the Cartesian product), and re- quires minimal statistical information, namely just the car- dinalities of the input. The strong point of the algorithm is the principled way that it partitions the JM, in a way that all JM cells are covered and, at the same time, the maxi- mum reducer input is minimized. The algorithm is shown to be more suitable for high join selectivities (e.g., above 50%). Fig. 1(left) shows an example partitioning across 3 reducers, where there are 6 tuples fromS and T, and the input cost of each reducer is 7 (4 tuples fromS and 3 from T), 7 (4 fromS and 3 fromT) and 8 (2 fromS and 6 from T), respectively.

2.2 M-Bucket-I

In cases where there are histograms, so that we can safely reason as to whether a specific combination of tuples can satisfy the join condition, and the join selectivity is small, M-Bucket-I outperforms 1-Bucket-Theta. The histograms are equi-depth ones and are produced in a separate MapRe- duce phase, as explained in [7]. Then, the JM is constructed, where each cell corresponds to a pair of histogram buckets rather than a pair of tuples. As such, the size of a JM need not grow as the size of the input data increases at the ex- pense of histogram buckets of higher depth. From the JM and the join condition, it is straightforward to identify pairs that do not contribute to the result (depicted as white cells in Fig. 1(right). During the partitioning step, a heuristic method is followed, which is not accompanied by guarantees as in 1-Bucket-Theta but yields better results, since it ben- efits from the fact that most of the JM cells are not valid candidate pairs.

The difference between M-Bucket-I and M-Bucket-O is that the former targets the minimization of the maximum reducer input, whereas the latter targets the minimization of the maximum reducer output. Note that estimating the reducer output based on histograms is prone to significant errors, even when the histograms are accurate.

3. ON THE OPTIMALITY OF 1-BUCKET- THETA

First we define the lower bound on the communication cost of any 1-round MapReduce algorithm for binary theta-

joins. As already mentioned, the communication cost is measured using the replication rate metric. Let us exam- ine the steps of the short version of the generic recipe for deriving such bounds from [1]. Given two relations S and T, with sizes|S|and|T|, respectively, we have:

• Size (Number)of Inputs and Outputs:

– Inputs: |S|+|T|

– Outputs: |S||T| (accounting for the worst case, which is the cartesian product)

• Derivingg(q): The upper bound of outputs a reducer can produce given q inputs, denoted as g(q), occurs whenqis equally divided into input from S and T, i.e.,

q

2 tuples fromS and q2 tuples fromT. The maximum result of applying the theta join on these two quantities is when we have a cartesian product, thusg(q) = q42.

• Replication Rate r(q): The quantity g(q)q equals

q2 4

q = q4, which is monotically increasing inq. There- fore, the replication rate can be computed using the formula:r(q)≥ g(q)Iq|O| =q(|S|+|T|)4(|S||T|) .So, the lower bound onr,rlb, is q(4(|S|S|||+T|T|)|).

The above formula illustrates the exact trade-off between parallelism and communication cost in binary theta-joins.

By increasing the degree of parallelism in order to decrease the input q each reducer receives, the communication cost increases, since, for the lower bound, q andr are inversely proportional to each other.

The next step is to find the upper bound on replication rate of 1-Bucket-Theta. In [7], three partitioning cases are presented, based on the sizes|S|and|T|and the number of available reducer processorsp. Due to the limited space, we will examine only the first case in detail.

The first case corresponds to the scenario, where the JM can be exactly covered by cS ×cT squares of side-length

√|S||T|/p. This means that the following conditions hold:

|S|=cS

√|S||T|/pand|T|=cT

√|S||T|/p, wherecS, cT are positive integers. For example, ifp= 4, then the JM in Fig.

1(left) can be exactly covered by 4 squares of side-length 3.

Then we have:

• Replication rate of 1-Bucket-Theta (r1BT):

r1BT|S||cST|++||TT||cS =

|S||T|

|S||T|

p

+|T||S|

|S||T| p

|S|+|T|

= 2|S||T|

(

|S||T|

p )(|S|+|T|)

• Reducer input: q1BT = 2√

|S||T|

p

• Combiningr1BT andq1BT: r1BTq1BT ≤( 2|S||T|

(

|S||T|

p )(|S|+|T|)

)(2√

|S||T|

p ) = 4|S|S|+||T|T|| which implies thatr1BT(q1BT)≤rlb

So, the upper bound of the first case of 1-Bucket-Theta is at most as high as the lower bound of the problem, which means that, for that case, the algorithm is optimal.

(13)

Following the same reasoning, the other two cases (Theo- rems 2 and 3 in [7], respectively), which correspond to dif- ferent formulas forcS andcT, can be examined, for which we have:

• Case 2: r1BT4q |T||S|

|S|+|T| =rlb

• Case 3: r1BT8q |S||T|

|S|+|T| = 2rlb

Overall, the upper bound of the replication rate is at most two times the lower bound, and as such is optimal up to a constant factor. In [3], it is shown that the lower bound can be met for self-joins, which is special case of binary joins.

4. REDUCING THE REPLICATION RATE IN M-BUCKET

The partitioner of M-Bucket-I algorithm operates on a join matrix (JM), where each cell corresponds to a pair of his- togram buckets. It tries to fit the cells in rectangular regions;

each region is associated with a single reducer. The ratio- nale of our approach is to permute JM’s rows and columns, in order to improve the quality of the partitioning phase.

The problem of cell rearrangement can be addressed with several algorithm families, such asclustering (e.g., hierar- chical, array-based, and so on),combinatorial optimization (e.g., bin packing, knapsack) andbandwidth reduction. Here, we examine the impact of array-based clustering algorithms and more specifically, we employ the Bond Energy cluster- ing algorithm (BEA) [6], due to its efficiency [4]. The pur- pose of BEA is to identify natural clusters that occur in complex data arrays, such as JMs. This task is accom- plished by permuting the rows and columns of the JM in a way that the numerically larger array elements are clus- tered together. As the JM of our interest comprises a two- dimensional bitmap array, i.e. the cell values are either 0 or 1 to indicate whether the processing of the corresponding pairs is meaningful or not, we expect all the non-zero values to be grouped as close as possible. The intuition is that, if the JM contains more empty sub-matrices, the mapping of the remainder sub-matrices to reducers will improve.

Our work adds a step of beforehand analysis to the M- Bucket-I/O algorithm, just after the histograms are built and the initial JM is produced. It thus takes place before the actual execution on a MapReduce platform. The quality of a JM is assessed with the help of the following three metrics:

1. replication rate (rep), defined as in the Introduction and [1];

2. maximum reducer input (mri); and

3. input imbalance (imb), defined as the ratio ofmri to the average reducer input, considering only the non- idle reducers.

Note that the metrics above can be accurately computed from the JM, without requiring the real execution to be com- pleted. Thus, if the JM rearrangement is considered as not beneficial, the execution can switch back to the original JM.

That is, it is straightforward to add a post-processing phase, in order to guarantee that we choose the best partitioning between the one based on the original and the one based on the re-arranged JM. Consequently, our proposal does never lead to performance degradation; actually it can lead to sig- nificant improvements according to our experiments.

0 20 40 T 60 80 100

0

20

40

60

80

100 S

0 20 40 T 60 80 100

0

20

40

60

80

100 S

0 20 40 T 60 80 100

0

20

40

60

80

100 S

0 20 40 T 60 80 100

0

20

40

60

80

100 S

0 20 40 T 60 80 100

0

20

40

60

80

100 S

0 20 40 T 60 80 100

0

20

40

60

80

100 S

Figure 2: Example JMs before (left) and after (right) apply- ing BEA.

As an example, we extracted a sample of 64M tuples from the Cloud dataset in http://cdiac.ornl.gov/ftp/

ndp026c/ndp026c.pdf. Fig. 2(top) shows the initial and re- arranged JM for a self-join query that retrieves record pairs, for which the absolute difference of the sea level is between 0 and 2, or between 22 and 24, or between 50 and 52, or between 80 and 82 to give an example of a complex range query. The rearranged JM yields 21% lower rep and 19%

lowermbiat the expense of 4% higherimb. Next, we pro- ceed to more systematic experiments on synthetic data.

4.1 Experimental Evaluation

We focus on band joins, which is a type of theta-joins that can significantly benefit from M-Bucket. In band joins, the condition is in the form ofR.A−ε≤S.A≤R.A+ε.

The experimental setup is as follows. We randomly generate synthetic JMs so that the produced JMs vary in the following aspects: join selectivity, number of band conditions, and size of JMs. Then, we compute the statistics of the resulting partitioning to reducers both when we cluster the JM and when we do not. In the first experiment, we assume that the dimensions of the JM are 100×100. We vary the number of available reducers from 10 to 40. Also, the numbers of band conditions examined are 1, 3 and 5. For each band condition, we examined selectivity values of 1%, 5% and 10%.

Fig. 2 shows two more examples of JM rearrangement.

From the left column of the middle and bottom row, we can see the typical form of the original synthetic JMs. For each band condition, there is a diagonal stripe of cells, for which the join condition holds. The gaps between such stripes are randomly shifted, so that the JMs are not symmetric; for each condition the selectivity is set to 1%. As we can ob- serve, the effect of the BEA algorithm is optically widely different, but in both cases, there were significant improve- ments, which we discuss below.

The average impact of BEA on the metrics examined are

(14)

rep mri imb coverage Overall 0.846 0.880 1.029 59.26%

Band Selectivity

1% 0.717 0.735 1.028 66.67%

5% 0.920 0.949 1.014 66.67%

10% 0.928 0.996 1.056 44.45%

Number of Band Conditions 1 0.987 0.967 0.964 33.34%

3 0.821 0.835 1.010 44.45%

5 0.810 0.873 1.058 100%

Table 1: Average ratio of the BEA-produced JM metrics to the original JM metrics.

rep mri imb

Overall 0.634 0.649 1.023 Band Selectivity 1% 0.634 0.649 1.023 5% 0.833 0.875 1.050 10% 0.848 0.900 1.050 Number of Band Conditions

1 0.979 1 0.988

3 0.737 0.733 0.995 5 0.634 0.649 1.023

Table 2: Ratio of the BEA-produced JM metrics to the orig- inal JM metrics for the maximumrepdrop observed.

summarized in Table 1. The rightmost column of the table shows the percentage of the times that the rearranged JM has led to improvements in the replication rate. Table 2 refers to the maximum improvements regarding replication observed. From these two tables, we can draw the following conclusions. On average, our proposal improves the parti- tioning in approximately 59% of the times. In those cases, the average decrease in the replication rate is 15%, but it can reach 37%. The improvements become more significant as the number of the band conditions increase and the selectiv- ity becomes lower. On average, when the band selectivity is 1%, the replication rate drops by 28%, while the maximum reducer input decreases by 26%. There is a slight increase in the relative imbalance though. Similarly, we can observe, that, when the number of band conditions is 5, there are improvements in all the cases examined.

We also investigated the impact of the number of reducers, but this was not found to be significant. Finally, note that we considered only the cases where the replication rate is strictly less than that with the original JMs in order to com- putemriandimb. The average values of these two metrics are slightly different if all the measurements are considered.

We conducted an additional experiment, where we in- creased the dimensions of the JM to 1000×1000 and we further decreased the minimum selectivity of each band con- dition to 0.1%. The main purpose was to verify our hypoth- esis that our proposal is more suitable for band joins with multiple conditions, each having a low selectivity. Indeed, in 100% of the cases examined when the selectivity was 0.1%

and the number of band conditions was 3 and 5, there was a significant decrease in the replication rate (28.1% on av- erage). The maximum reducer input was also decreased by the same amount, whereas the imbalance remained similar.

Overall, when the selectivity is low, there is more space for BEA to yield empty sub-matrices; whereas, when there are fewer band conditions, the differences from the original JMs are less significant.

5. CONCLUSIONS AND FURTHER WORK

We investigate the execution of binary theta-joins using MapReduce. First we analyze the efficiency of the state-of- the-art and second, we propose the usage of a pre-processing clustering algorithm in order to help the partitioning of the map output to reducers. Our proposal was shown to incur significant reductions in the communication cost and the maximum input received by each reducer when the theta clause comprises several conditions, each of low selectivity.

A strong point of our approach is that it is not intrusive, in the sense that it can be easily incorporated into the current state-of-the-art proposal in [7], as a pre-processing phase before the actual execution on a MapReduce platform be- gins. In addition, it is straightforward to assess whether our approach is beneficial for a specific setting, and thus our proposal does not lead to overall performance degradation.

In the future, we plan to focus on more elaborate types of array rearrangement algorithms. Scalability is also an issue, since algorithms such as BEA do not scale to matrices with very large dimensions. Another avenue for further work is to investigate more sophisticated partitioning algorithms to be coupled with JM rearrangement. Harder problems include the investigation of provably optimal techniques for multi- way theta-joins and efficient histogram construction when there are multiple attributes participating in the theta-join condition.

Acknowledgments This research has been co-financed by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program “Ed- ucation and Lifelong Learning” of the National Strategic Reference Framework (NSRF) - Research Funding Program:

Thales. Investing in knowledge society through the Euro- pean Social Fund.

6. REFERENCES

[1] F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D.

Ullman. Upper and lower bounds on the cost of a map-reduce computation.PVLDB, 6(4):277–288, 2013.

[2] F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment.IEEE Trans.

Knowl. Data Eng., 23(9):1282–1298, 2011.

[3] F. N. Afrati and J. D. Ullman. Matching bounds for the all-pairs mapreduce problem. InIDEAS, pages 3–4, 2013.

[4] S. Climer and W. Zhang. Rearrangement clustering:

Pitfalls, remedies, and applications.Journal of Machine Learning Research, 7:919–943, 2006.

[5] C. Doulkeridis and K. Nørv˚ag. A survey of large-scale analytical query processing in mapreduce.The VLDB Journal, pages 1–26, 2013.

[6] W. T. McCormick, P. J. Schweitzer, and T. W. White.

Problem decomposition and data reorganization by a clustering technique.Operations Research,

20(5):993–1009, 1972.

[7] A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. InSIGMOD, pages 949–960, 2011.

[8] Y. Tao, W. Lin, and X. Xiao. Minimal mapreduce algorithms. InSIGMOD, pages 529–540, 2013.

[9] X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using mapreduce.PVLDB, 5(11):1184–1195, 2012.

(15)

On the design space of MapReduce ROLLUP aggregates

Duy-Hung Phan

EURECOM

[email protected]

Matteo Dell’Amico

EURECOM

[email protected]

Pietro Michiardi

EURECOM

[email protected]

ABSTRACT

We define and explore the design space of efficient algorithms to compute ROLLUP aggregates, using the MapReduce pro- gramming paradigm. Using a modeling approach, we ex- plain the non-trivial trade-off that exists between parallelism and communication costs that is inherent to a MapReduce implementation of ROLLUP. Furthermore, we design a new family of algorithms that, through a single parameter, allow to find a “sweet spot” in the parallelism vs. communication cost trade-off. We complement our work with an experimen- tal approach, wherein we overcome some limitations of the model we use. Our results indicate that efficient ROLLUP aggregates require striking the good balance between paral- lelism and communication for both one-round and chained algorithms.

1. INTRODUCTION

Online analytical processing (OLAP) is a fundamental ap- proach to study multi-dimensional data involving the com- putation of, for example, aggregates on data that are ac- cumulated in traditional data warehouses. When operating on massive amounts of data, it is typical for business in- telligence and reporting applications, to require data sum- marization, which is achieved using standard SQL operators such as GROUP BY, ROLLUP, CUBE, and GROUPING SETS.

Despite the tremendous amount of work carried out in the database community to come up with efficient ways of computing data aggregates, little work has been done to extend these lines of work to cope with massive scale. In- deed, the main focus of prior works in this domain has been on single server systems or small clusters executing a dis- tributed database, implementing efficient implementations of CUBE and ROLLUP operators, in line with the expecta- tions of low-latency access to data summaries [6, 8, 11, 13, 14, 19]. Only recently, the community devoted attention to solve the problem of computing data aggregates at massive scales using data intensive, scalable computing engines such

(c) 2014, Copyright is with the authors. Published in the Workshop Pro- ceedings of the EDBT/ICDT 2014 Joint Conference (March 28, 2014, Athens, Greece) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC- by-nc-nd 4.0.

as MapReduce [10]. In support of the growing interest in computing data aggregates on batch-oriented systems, sev- eral high-level languages built on top of MapReduce, such as PIG [3] and HIVE [2], support simple implementations of, for example, the ROLLUP operator.

The endeavor of this work is to take a systematic approach to study the design space of the ROLLUP operator: besides being widely used on its own, ROLLUP is also a fundamen- tal building block used to compute CUBE and GROUPING SETS [7]. We study the problem of defining the design space of algorithms to implement ROLLUP through the lenses of a recent model of MapReduce-like systems [4]. The model explains the trade-offs that exist between the degree of par- allelism that is possible to achieve and the communication costs that are inherently present when using the MapReduce programming model. In addition, we overcome current lim- itations of the model we use (which glosses over important aspects of MapReduce computations) by extending our anal- ysis with an experimental approach. We present instances of algorithmic variants of the ROLLUP operator that cover several points in the design space, implement and evaluate them using an Hadoop cluster.

In summary, our contributions are the following:

• We study the design space that exists to implement ROLLUP and show that, while it may appear deceiv- ingly simple, it is not a straightforward embarrassing parallel problem. We use modeling to obtain bounds on parallelism and communication costs.

• We design and implement new ROLLUP algorithms that can match the bounds we derived, and that swipe the design space we were able to define.

• We pinpoint the essential role of combiners (an op- timization allowing pre-aggregation of data, which is available in real instances of the MapReduce paradigm, such as Hadoop [1]) for the practical relevance of some algorithm instances, and proceed with an experimen- tal evaluation of several variants of ROLLUP imple- mentations, both in terms of their performance (run- time) and their efficient use of cluster resources (total amount of work).

• Finally, our ROLLUP implementations exist in Java MapReduce and have been integrated in our experi- mental branch of PIG, which are available in a public repository.1

1https://bitbucket.org/bigfootproject/rollupmr

Références

Documents relatifs

“This could merit an investigation.” Two weeks later you receive follow-up: Public health conducted interviews and laboratory testing of more than 2 dozen people and identifed

C’est le seul bémol sérieux de la leçon : comme ce modèle ne met pas en avant les caracté- ristiques de la propagation guidée (confinement, modes, dispersion), il n’est

28 En aquest sentit, quant a Eduard Vallory, a través de l’anàlisi del cas, es pot considerar que ha dut a terme les funcions de boundary spanner, que segons Ball (2016:7) són:

Mucho más tarde, la gente de la ciudad descubrió que en el extranjero ya existía una palabra para designar aquel trabajo mucho antes de que le dieran un nombre, pero

La transición a la democracia que se intentó llevar en Egipto quedó frustrada en el momento que el brazo militar no quiso despojarse de sus privilegios para entregarlos al

L’objectiu principal és introduir l’art i el procés de creació artística de forma permanent dins l’escola, mitjançant la construcció d’un estudi artístic

también disfruto cuando una parte de la clase opina algo mientras la otra opina lo contrario y debatimos mientras llegamos a una conclusión acerca del poema.”; “he

Zelda se dedicó realmente a realizar todas estas actividades, por tanto, mientras que en la novela que ella misma escribió vemos el deseo de Alabama de dedicarse y triunfar, por