U N I V E R S I T ´ E L I B R E D E B R U X E L L E S , U N I V E R S I T ´ E D’ E U R O P E
FACULT´ E DES SCIENCES
Stochastic Approach to Brokering Heuristics for Computational Grids
Th` ese pr´ esent´ ee en vue de l’obtention du grade de Docteur en Sciences
Directeur de th` ese : Vandy B ERTEN
Jo¨ el G OOSSENS Ann´ ee acad´ emique 2006–2007
Th`ese soutenue publiquement
`a Bruxelles, le 8 juin 2007.
3
Remerciements
Un travail d’une telle ampleur n’est jamais une r´ealisation personnelle. Mˆeme si une seule personne ´ecrit son nom sur la couverture, il est tr`es loin d’en ˆetre le seul auteur. Je ne pouvais donc pas commencer cet ouvrage sans remercier ceux qui, autant que moi- mˆeme, en sont `a la base.
Parmi tous les participants `a ces quatre ann´ees de labeur, il en est certainement un qui y a contribu´e plus que tout autre ; il s’agit bien s ˆur de Jo¨el G OOSSENS , mon directeur de th`ese. Il a choisi de me faire confiance il y a pr`es de 5 ans, en acceptant tout d’abord de diriger mon m´emoire de maˆıtrise, et ensuite ma th`ese de doctorat. Nos rencontres hebdomadaires, ses lectures r´eguli`eres de mes ´ecrits mˆeme lorsque que je m’´eloignais de ses principaux sujets de recherche, son insistance `a me faire rencontrer des chercheurs
´etrangers, sa rigueur et son exp´erience ont dirig´e mes travaux dans une direction qui, je l’esp`ere, s’est montr´ee digne de la confiance qu’il a mise dans son premier “th´esard” que j’ai eu le privil`ege d’incarner.
Quelques mois apr`es le d´ebut de ma th`ese, j’ai eu la chance de rencontrer l’´equipe d’AlGorille, `a Nancy, et plus particuli`erement Emmanuel J EANNOT , qui m’a accueilli `a plusieurs reprises au cours de ces derni`eres ann´ees. Nos discussions diverses et vari´ees m’ont permis d’´etendre mes centres d’int´erˆets, et je ne peux que lui en ˆetre reconnaissant.
Par la suite, j’ai eu le plaisir de collaborer avec quelques chercheurs de l’IMAG de Grenoble, et plus sp´ecialement avec Bruno G AUJAL , `a qui je dois les id´ees `a la base de la deuxi`eme partie de ce travail. Il m’a fait l’honneur de m’accueillir `a plusieurs reprises, pour un total de pr`es de quatre mois. Outre le d´elice de la vie au pied des montagnes, ces s´ejours dans son ´equipe m’ont ´enorm´ement appris sur le monde de la recherche et de la collaboration. Je voudrais ´egalement remercier Jean-Marc V INCENT et J´er ˆome V IENNE , pour leur outil de simulation parfaite, qu’ils ont `a de nombreuses reprises adapt´e pour les besoins de nos exp´eriences. C’est par ailleurs grˆace `a Bruno que j’ai pu acc´eder `a Grid’5000, et y consommer pr`es de 40 000 heures de processeur.
Toute ma gratitude va ´egalement `a Raymond D EVILLERS , pour ses nombreuses relec- tures m´eticuleuses de mes ´ecrits, et toutes les discussions passionnantes qui ont suivi.
Comme peuvent en t´emoigner de nombreux membres du d´epartement, il a de toute
´evidence fortement contribu´e `a la rigueur scientifique, `a la pr´ecision, `a la prise en compte des cas limites. Il y a la th`ese avant D EVILLERS , et apr`es D EVILLERS . . .
Je me dois certainement de remercier tous ceux qui, outre MM. G OOSSENS , J EAN -
NOT , G AUJAL et D EVILLERS , ont accept´e de faire partie du jury : Olivier M ARKOWITCH
et Guy L OUCHARD de l’ULB, et Pierre M ANNEBACK des FPMs. Je remercie plus partic- uli`erement M. L OUCHARD , `a qui je dois en grande partie la d´emonstration de l’annexe B de ce travail.
En dehors du milieu acad´emique, il va sans dire que je dois beaucoup — pour ne pas dire tout — `a mes parents, mes deux fr`eres et ma soeur ; la curiosit´e, le go ˆut de la d´ecouverte, l’int´erˆet pour les sciences, la recherche de la pr´ecision, le plaisir du savoir partag´e, sont autant de valeurs que je leur dois, et sans lesquelles ce travail n’aurait pas pu aboutir.
Je n’oublie bien s ˆur par non plus ceux qui m’ont entour´e ces derni`eres ann´ees ;
b´emelois, colonnards, cousins, taiz´eens, membres du DI, et bien d’autres encore !
CONTENTS 5
Contents
1 Introduction to Grid Brokering 13
1.1 Motivations and Context . . . . 14
1.1.1 Outline . . . . 16
1.2 Definitions and Grid Modeling . . . . 17
1.2.1 Queuing Model . . . . 19
1.2.2 Scheduling Model . . . . 25
1.2.3 System Load . . . . 28
1.2.4 System State . . . . 30
1.2.5 Underlying Markov Chain . . . . 31
1.3 Brokering . . . . 33
1.3.1 Open-loop and Closed-loop . . . . 33
1.3.2 Memoryless and Historical Information . . . . 34
1.3.3 Deterministic and Probabilistic . . . . 34
1.3.4 Mathematical Model . . . . 35
Probabilistic Memoryless Open-loop Brokering . . . . 35
Deterministic Memoryless Closed-loop Brokering . . . . 36
1.3.5 Bernoulli Brokerings . . . . 36
1.3.6 LCB Policies . . . . 38
1.4 Examples . . . . 41
1.4.1 EGEE . . . . 41
1.4.2 NorduGrid . . . . 43
1.4.3 Grid’5000 . . . . 43
1.4.4 GridBus . . . . 44
1.5 Cost Function . . . . 45
1.6 Concepts . . . . 48
1.6.1 Resolving Markov Decision Processes Using Dynamic Programming . . 48
Minimizing the Policy . . . . 50
Value Iteration . . . . 50
Policy Iteration . . . . 51
1.6.2 Perfect Simulation . . . . 51
2 Random Brokering 53 2.1 Introduction and Model . . . . 54
2.1.1 Dispatching the Jobs . . . . 55
2.1.2 Numerical Simulations . . . . 56
2.2 Sequential Systems . . . . 58
2.2.1 System Load . . . . 58
2.2.2 Queue Size . . . . 59
Case ν = 1 . . . . 59
Case ν < 1 . . . . 60
Case ν > 1 . . . . 64
Arrivals. . . . 64
Departures. . . . . 65
Number of Jobs in the System. . . . . 67
Experimental Results . . . . 69
2.2.3 Used CPUs . . . . 74
2.2.4 Resorption Time . . . . 75
2.2.5 Slowdown . . . . 76
Job Length Distributions . . . . 77
Case ν < 1 . . . . 79
Case ν > 1 . . . . 81
Slowdown for a Job Submitted at Time θ. . . . . 82
Average Slowdown Until the Last Finished Job With ν > 1. . . 85
Experimental Results . . . . 89
2.3 Fully-synchronous Parallel Systems . . . . 93
2.3.1 Queue Size . . . . 95
Case ν < ν ˜ i . . . . 95
Case ν > ν ˜ i . . . . 95
Experimental Results . . . . 97
2.3.2 Used CPUs . . . . 99
Case ν < ν ˜ i . . . . 99
Case ν > ν ˜ i . . . . 99
2.3.3 Resorption Time . . . 102
2.3.4 Slowdown . . . 103
Experimental Results . . . 106
2.4 Conclusion . . . 108
CONTENTS 7
2.4.1 Summary of Contribution . . . 108
Queue Size . . . 108
Average Number of Used CPUs . . . 109
Resorption Time . . . 109
Measured Slowdown . . . 109
3 Index Based Brokering 111 3.1 Introduction . . . 112
3.1.1 Mathematical model . . . 114
3.1.2 Optimal Brokering . . . 115
3.1.3 Intuitive Justification of the Index Strategy . . . 117
3.1.4 Cost Optimization Problem . . . 122
3.1.5 Mathematical Justification of the Whittle-Gittins Index . . . 123
3.2 Threshold Policy on a Single Queue with Rejection . . . 127
3.2.1 Mathematical Formulation . . . 131
3.2.2 Properties of Optimal Policy . . . 133
3.2.3 Computing the Optimal Threshold . . . 141
3.3 Algorithmic Improvements . . . 144
3.3.1 Algebraic Simplifications . . . 144
Local Gain . . . 144
Admissibility Condition . . . 144
Total Gain . . . 145
3.3.2 Improvement of the Admissibility Check . . . 146
3.3.3 Reducing the Problem Size . . . 147
3.3.4 Computing the Index Function . . . 149
3.3.5 Parameters Dependence and Numerical Issues . . . 150
Discount Cost (α) . . . 151
Arrival Rate (λ) . . . 152
Number of Servers (s) . . . 152
Θ Precision () . . . 153
3.4 Complexity and Benchmarks . . . 155
3.4.1 Value-Determination Operation (Solving J θ ) . . . 155
3.4.2 Policy-Improvement Routine . . . 155
Maximal Complexity . . . 155
Average Complexity . . . 155
3.4.3 Finding Θ(R) . . . 156
Maximal Complexity . . . 156
Average Complexity . . . 156
Improving the Maximal Complexity . . . 156
3.4.4 Dichotomy . . . 157
First Phase . . . 158
Second Phase . . . 158
Improved Maximal Complexity . . . 158
3.4.5 Space Complexity . . . 159
3.4.6 Benchmarks . . . 159
3.5 Numerical Experiments . . . 161
3.5.1 Strategies . . . 161
3.5.2 Uniprocessor Systems . . . 162
3.5.3 Multiprocessors Systems . . . 164
3.5.4 Robustness . . . 164
3.5.5 Sojourn Time Distribution . . . 169
3.6 Realistic Experiments . . . 172
3.6.1 SimGrid Software . . . 172
3.6.2 A Grid Model Using SimGrid . . . 172
3.6.3 Traces . . . 174
3.6.4 Experimental Scenarios . . . 175
Several Inputs . . . 176
The Effect of Heterogeneity . . . 177
Information Delays . . . 178
3.6.5 Sojourn Time Distribution . . . 180
3.7 Conclusions . . . 181
4 Index Brokering of Batch Jobs 183 4.1 Introduction . . . 184
4.2 Batch Arrivals with Known Distribution . . . 185
4.2.1 Mathematical Model . . . 185
4.2.2 Threshold Policy: Differences with Sequential Jobs . . . 186
Properties of Optimal Threshold Policy . . . 188
4.2.3 Computing the Optimal Threshold: Algorithm and Optimizations . . . 190
Algebraic Simplifications . . . 191
Local Gain. . . 191
Admissibility Condition. . . 191
Global Gain. . . 191
Improvement of the Admissibility Check . . . 192
Reducing the Problem Size . . . 193
4.2.4 Computing the Index Function . . . 194
CONTENTS 9
4.2.5 Complexity . . . 194
Value-Determination Operation (Solving J θ ) . . . 194
Policy-Improvement Routine . . . 194
Finding Θ(R) . . . 194
Dichotomy . . . 195
Benchmarks . . . 196
4.2.6 Numerical Experiments . . . 197
Impact of the Architecture . . . 197
Impact of the Job Width Distribution . . . 198
Robustness on Load Variations . . . 199
Robustness on Job Width Distribution . . . 200
4.3 Batch Arrivals with Known Sizes . . . 203
4.3.1 Mathematical Model . . . 203
4.3.2 Bellman’s Equation . . . 203
4.3.3 Algorithms . . . 206
Computing the Best Thresholds . . . 206
Computing the Index . . . 207
4.3.4 Simulations . . . 208
4.4 Batch Arrivals with Synchronous Departures . . . 214
4.4.1 State Transitions . . . 215
Job Arrival . . . 215
Process Departure . . . 215
4.4.2 Alternative Formulation . . . 216
Case X > 0 . . . 216
Case X = 0 . . . 216
4.4.3 Bellman’s Equation . . . 216
4.4.4 Algorithm . . . 217
4.5 Parallelization . . . 219
4.5.1 Interval Division . . . 220
Avoiding Duplication . . . 222
Improving the Interval Division . . . 222
4.5.2 Pool of Rejection Cost Values . . . 223
Computing Processes . . . 223
Coordinator Process . . . 224
Pool of Values . . . 224
Distributed Dichotomy . . . 225
Cleaning the Pool . . . 225
Filling the Pool . . . 226
4.6 Conclusions . . . 228
5 Contribution, Future Works and Conclusion 229 5.1 Summary of Contribution . . . 229
5.1.1 Random Brokering . . . 229
5.1.2 Index Brokering . . . 230
5.2 Open Questions and Future Work . . . 232
5.2.1 Random Brokering (Chapter 2) . . . 232
Standard Deviation and Error . . . 232
Missing Experimentations . . . 232
Other Distributions . . . 233
Non-Saturated Parallel Systems . . . 233
Asynchronous and Semi-Synchronous Systems . . . 233
Slowdown on Submitted Jobs . . . 233
Missing Formal Proofs . . . 233
5.2.2 Index Brokering (Chapter 3) . . . 234
Alternative Cost Strategy . . . 234
Advanced Realistic Simulations . . . 235
Implementation in Real/Production Environment . . . 235
Proof of Conjecture 3.12 . . . 235
5.2.3 Batch Jobs (Chapter 4) . . . 235
Improvement of BatchK Algorithm . . . 235
Optimal Strategy . . . 236
Semi-Synchronous Systems (BatchS) . . . 236
Parallelization Implementation . . . 236
Full Proof of Convexity . . . 236
5.3 Final Conclusion . . . 237
A Splitting of Stochastic Process 241 B Used CPUs: Parallel Case 243 B.1 Case s = 2 . . . 243
B.2 Case s = 3 . . . 245
B.3 General Case . . . 246
B.4 Worst Case . . . 249
B.5 Equidistributed Case . . . 251
CONTENTS 11
C Convexity 259
C.1 Sequential Case . . . 262
C.1.1 Case x < B − 2 . . . 263
C.1.2 Case x = B − 2 . . . 265
C.2 Batch Case . . . 266
C.2.1 Case x < B − 2 . . . 266
C.2.2 Case x = B − 2 . . . 268
C.2.3 Case x = B − 1 . . . 269
C.2.4 Case B ≤ x ≤ B + K − 3 . . . 270
D Solving a K + 2-diagonal System 271 References 273 Webography . . . 273
Personnal Bibliography . . . 273
General Bibliography . . . 274
Symbols 281
Index 284
13
Chapter 1
Introduction to Grid Brokering
Chapter Abstract. This chapter introduces the framework in which our work has been undertaken. We present general definitions about grid computing, and more specifically about grid brokering. We establish our mathematical model of computational grid and give the definitions needed for the following of this document.
Chapter Contents
1.1 Motivations and Context 1.2 Definitions and Grid Modeling 1.3 Brokering
1.4 Examples
1.5 Cost Function
1.6 Concepts
1.1 Motivations and Context
When a scientist needs to perform some small computations, he typically launches a ded- icated application on a personal computer, which gives the required results. We might say that a user sends a job to a server, and gets back the results of its computation. This model applies when the need of computational power is not too high. Once the ma- chine usage reaches the maximal power of the used computer, we may consider two ways of tackling the increasing need of computational power. First, buy a new machine, with better performances, more powerful processors, or more processors (e.g. Massively Parallel Processors). Of course, this is limitted by the speed engineers improve CPU per- formances or memory accesses. The second way consists in gathering several machines, making them working together.
Infrastructures regrouping several (often homogeneous) machines such as simple per- sonal computers, with some coordination mechanisms, are generally named clusters. Ac- cording to Buyya [30]:
a cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers working together as a single, integrated computing resource.
Clusters are generally owned by organisations such as labs, universities, or industries, and are shared by several users. They contain between a few and several thousands of machines. The network interconnecting cluster machines is usually a high performance LAN (Local Area Network), which allows fast communications and data transfers.
Even if there are no theoretical limitations in terms of the number of machines, clus- ters have mainly two disadvantages: first, large clusters are very expensive, and very difficult to manage. Secondly, following the evolution of the experiments for which the cluster has been set up, such a parallel system is often underused at some periods, and saturated at other ones. Those reasons, as well as many others, lead scientists to the idea of Grids: if computers can be put together in order to form a cluster, clusters could be put together in order to form some kind of cluster of clusters, or meta-cluster, or Grid.
A Grid is typically an infrastructure coordinating several clusters spread around a country, or around the world, hosted by partners of a common project, and communi- cating through a WAN (Wide Area Network), such as the Internet, or some rented links.
As grids are often scientific applications oriented, they generally do not only contain
clusters, but also mass storage devices allowing to store experimental results, or various
kinds of data. Storing huge amount of data in grid systems requires to solve transfer,
duplication, security, scheduling or other management issues. Due to the slowness and
1.1 Motivations and Context 15
the unreliable nature of WAN communication channels, those fields are often far more difficult that what they were in clusters, with fast and controlable networks.
Grids have become prevalent infrastructures for intensive computational tasks. The word “Grid” is probably now one of the top five terms in computer science, and the whole scientific community agrees that physicists, biologists, or even mathematicians, will not be able to tackle tomorrow’s problems without such computational systems.
But what is exactly a Grid? Ian Foster, one of the so called “fathers of the grid” [39], describes on its personal webpage a Grid as:
a system that coordinates resources that are not subject to centralized con- trol, using standard, open, general-purpose protocols and interfaces, to de- liver nontrivial qualities of service.
With that definition, a lot of distributed systems correspond to grids. In this work, we are mainly going to focus on Grids as being a set of resources, such as clusters, mas- sively parallel processors machines, or mass storage devices, linked together through some wide area network, and managed at a high level by some middleware.
There are mainly two kinds of grids. The first one concerns grids in which the large majority of the work is number cruncher (or CPU bound), and does not require neither large amount of data, nor heavy network communication. This kind of grids is usually called Computational Grids. The second class regroups systems where data is the center of everything; jobs are data intensive (or IO-bound). They require to store huge amount of data (often Petabytes), which, of course, need regularly to be transferred between com- putational or storage resources. These grids are named Data Grids. This work mainly focuses on the first family, Computational Grids.
Idealistically, Grids should be as easy to use as a simple computer. In the same way than Internet is as easy to use as if the data were locally available, or the electric power grid — in which Grids name has its origin — is as easy to access as if the electrical central was just behind the wall.
Nevertheless, if Grids are on the way of being efficient and user-friendly systems, computer scientists and engineers still have a huge amount of work to do in order to improve their efficiency. Amongst a large number of problems to solve or to improve upon, the problem of scheduling the work and balancing the load is of first importance.
The main subject of this thesis will be this last problem: how to efficiently or fairly
distribute the work across a Computational Grid? This task is usually called the Meta-
Scheduling — in opposition to Scheduling, which treats of the same subject at a local
level — or Brokering. Even if some authors consider the brokering task as not being
primordial, we will show that the brokering policy can drastically change the system
performances, and not necessarily in an expected way.
1.1.1 Outline
This work will be mainly split in two parts. After introducing the mathematical frame- work on which the following of the manuscript is based, we will in Chapter 2 study sys- tems where the grid brokering is done without any feed-back information, i.e. without knowing the current state of the clusters when the resource broker — the grid component performing the brokering — makes its decision. We show here how a computational grid behaves if the brokering is done in such a way that each cluster receives a quantity of work proportional to its computational capacity.
This part is based on several publications, in collaboration with Jo¨el G OOSSENS (ULB) and Emmanuel J EANNOT (Loria, Nancy): my DEA thesis [15], a conference paper resum- ing my DEA thesis (ISPA04, Hong-Kong) [19], a journal paper extending our model to heterogeneous systems (IEEE Transactions on Parallel and Distributed Systems, 2006) [22], as well as some technical reports (INRIA and ULB) [21, 20]. Notice that, with the same authors, we have also a contribution in fault tolerant Real-Time systems [23], but as this research is out of the scope of the present work, we have not included it here.
The second part of this work (Chapters 3 and 4) is rather independent from the first one, and consists in the presentation of a brokering strategy, based on Whittle’s indices, trying to minimize as much as possible the average sojourn time. We show how efficient the proposed strategy is for computational grids, compared to the ones popularly used in production systems. We also show its robustness to several parameter changes, and provide several very efficient algorithms allowing to make the required computations for this index policy.
This second part is the fruit of a collaboration with Bruno G AUJAL (IMAG, Grenoble).
On this subject, started by my 3 months sojourn at IMAG in April 2005, we have one
journal paper giving performances of index brokering on realistic workloads (Parallel
Computing, Elsevier) [17], one conference paper extending our first model to batch jobs
(NetCoop07, Avignon) [18], as well as an INRIA research report [16].
1.2 Definitions and Grid Modeling 17
1.2 Definitions and Grid Modeling
We will now formalize several concepts and give definitions about Grids.
In this work, we will consider a quite simple but pretty much realistic model of com- putational grid. Our analysis is mainly focused on computational grids, as we assume that network latencies and transfer times are negligible compared to computation times, and we thus do not take into account the data localization.
The model we now present could seem in some way quite far from reality. However, we believe that models being an exact representation of the reality are, at least in the case of Grids, mathematically untractable. Analysis, predictions, or strategies cannot be provided in such models due to their complexity. This is why we prefer to consider a simpler model, even if we could loose some realism, but on which some theoretical analysis is possible. Nevertheless, there are plenty of examples in the literature where theoretical models allow to improve the performances of real systems, or to predict their behavior, even when the model simplifies the real environment. Indeed, in this work we will give such an example, where a simple model allows to find a very efficient strategy for a realistic model.
The aim of most computational systems is to run jobs. It is true for Grids as well. As the literature gives a lot of definitions around this term (job, process, task, subprocess, program, threads, applications, operation. . . ), we first need to clarify the concepts which will be the center of our concern in this work.
Definition 1.1 (Process)
A Process is a sequence of computing operations running on a single processor.
We do not consider in this work the subdivision of a process into several threads, as it is sometimes done.
Definition 1.2 (Job)
A Job is a set of one or several process(es).
A job is said to be uniprocess (or sequential) when it is composed of one process, and multiprocess (or parallel) when it is composed of several (including one) processes.
As a computational grid is generally described as a set of Computing Elements, we need to define such a component.
Definition 1.3 (Computing Element, Cluster)
A Computing Element (also called a Cluster) is a set of CPUs or servers, and a single queue, both managed by a scheduler using a specific scheduling policy.
A Computing Element gets jobs, and run processes on its CPUs.
In this work, we will only consider homogeneous Computing Elements, composed of identical CPUs.
Definition 1.4 (Client)
A Client is a grid user who sends jobs to a grid, in order to run them on some CPU, and to get back results.
One may consider that a grid has only one client, as a client can emulate the work of several clients.
While CPUs need to be managed in a Computing Element (CE), CEs need to be man- aged in a grid. This is done by a kind of orchestra conductor, usually named the Resource Broker.
Definition 1.5 (Resource Broker, Router, Meta-scheduler)
A Resource Broker (also called a Router or a Meta-scheduler) is a grid component receiving jobs from Clients and sending each of them to a chosen Computing Ele- ment.
Definition 1.6 (Brokering, Routing, Meta-scheduling)
The Brokering is, in a computational grid, the action of choosing the Computing El- ement to which an incoming job is sent. The Brokering policy is the way of choosing such an action.
We have now every element we need for providing a first definition of a Grid.
Definition 1.7 (Grid)
A Grid is a set of Computing Elements, linked together through a Resource Broker, re- ceiving jobs from one or several clients.
Definition 1.8 (Computational Grid)
A Computational Grid, in contrast to a Data Grid, is a Grid for which the usage is computation oriented. Data transfers, data duplication, cache coherency manage- ment, large databases access, etc. are considered to be negligible with respect to the time spent in actual computations.
In the following of this work, we will only focus on Computational Grids.
The next two sections will consist in formal definitions about concepts defined more
verbosely here above. In the next section (1.2.1), we will present a model inspired from
the queuing theory community [51, 38, 62]. Then, we will adapt this model in Sec-
tion 1.2.2, following a point of view coming from the scheduling community [56, 30].
1.2 Definitions and Grid Modeling 19
We will finish by comparing those two models, showing that from some aspects, they are not that different.
1.2.1 Queuing Model
We can now give more formal definitions about several concepts evoked in the previous section. Figure 1.1 shows a general overview of the model of grid structure we consider here. The left figure gives a rather high level point of view, while the right one goes deeper in to the queuing model.
B N µ N
µ N s N
µ N
B 1
s 1 µ 1
µ 1
µ 1 λ
C 1
C N
Client RB
Figure 1.1: Two models of a Computational Grid with Resource Broker. The Grid is composed of N Computing Elements (or queue), the i th queue being composed of s i CPUs (or servers) of speed (or rate) µ i . The system input, with rate λ, is routed amongst the N queues.
Definition 1.9 (CPU, Server)
A CPU or a server is a component able to perform computing operations, at a speed given by its service rate µ, with 0 < µ < ∞.
Definition 1.10 (Process)
A Process p is a set of computing operations which, when performed by a CPU of rate µ, will use it continuously during, in average, 1
µ units of time.
As it is classically done in queuing theory, processes do not really have a pre-defined
execution time; the execution time is fully determined by the server on which the process
runs. For instance, if a server provides a service time following a random variable ex-
ponentially distributed, one may consider that at each infinitesimal period of time, the
server decides if the process continues or stops its execution, following an exponential
distribution.
From now on, servers are assumed to run processes without preemption (a process can- not be interrupted in order to run another process on the same server), and without mi- gration (once a process has started on a server, it stays on this server till the end of its execution).
As we shall study stochastic workloads, the execution time is obtained from a random variable, by rolling a (continuous) die. As soon as we use the same random variable, and if the process lengths are not taken into account for any scheduling decision, it is equivalent to assume the dice to be rolled by the server when the job starts, or by the client at the submission time. In the first model, inspired from the queuing theory community, we assume the execution time is chosen by the server starting the process, or even during its execution. In the second model, execution time is chosen by the client submitting the job.
From the last definition, a process does then not have a true (absolute) execution time.
Its effective execution time is determined by the server (or CPU) on which this process runs. If a server has a rate of µ, this means that the average (effective) execution time of processes running on this server is µ −1 . For instance, in the case of exponentially distributed execution time, the rate of the distribution will be µ.
Definition 1.11 (Job)
A Job is a set of one or several processes, arriving together at a given submission time, and having to be sent to the same CE. The number of processes composing a job is named its width.
Definition 1.12 (Sequential Job, Uniprocess Job)
A Sequential Job (or Uniprocess Job) is a job composed of only one single process (job width=1).
In the case of sequential jobs, we will not differentiate jobs and processes. We will consider jobs as having their own execution time, because a job will have the execution time of its unique process.
In the case of parallel jobs, the execution time of a job can be also defined, but needs
some convention. It could be for instance the sum or the maximum of its processes execu-
tion time, or the total time during which at least one of its processes was running. Notice
that in this last case, the job execution time depends upon the scheduling, which is not
always convenient. In this work, we do not consider execution time of jobs.
1.2 Definitions and Grid Modeling 21
Figure 1.2: Execution time of parallel jobs. Left: Asynchronous, Center: Semi-Synchronous, Right:
Fully-Synchronous.
Definition 1.13 (Parallel Job, Multiprocess Job)
A Parallel Job (or Multiprocess Job) is a job with one or several independent pro- cesses.
By independent processes, we mean that there is neither precedence dependencies nor common resources (and then no communication) between processes.
We do not enforce here parallel jobs to have more than one process; sequential job is then a particular case of parallel job.
Definition 1.14 (Asynchronous Parallel Job)
An Asynchronous Parallel Job is a parallel job for which, once in a cluster queue, every process is independent from other processes.
Definition 1.15 (Semi-Synchronous Parallel Job)
A Semi-Synchronous Parallel Job is a parallel job with the constraint that its pro- cesses have to start their execution simultaneously, but are independent afterwards.
Definition 1.16 (Fully-Synchronous Parallel Job)
A Fully-Synchronous Parallel Job is a parallel job with the constraint that its pro- cesses have to start their execution simultaneously, and releasing the CPUs only when all its processes are completed.
Definition 1.17 (Synchronous Parallel Job)
A Synchronous Parallel Job is parallel job which is either semi-synchronous or fully- synchronous.
In this work, we assume that jobs are independent, i.e., they do not share common
resources except CPUs, and there are no precedence constraints between jobs, as we as-
sumed for processes.
Definition 1.18 (Job Flow)
A Job Flow F A
λ,W
Kis an infinite set of jobs, for which:
• The inter-arrival delay follows the probability density function 1 A λ , with 0 <
λ < ∞ and E [A λ ] = λ −1 ;
• The job width is an integer between 1 and K distributed according to the law W K .
Notice that another very similar model could have been considered: we can have K arrival processes, the process k following a law A λ
k, and submitting only jobs of width k. Those K arrival processes are merged before entering the system. In order to have our both models comparable, λ k need to be chosen consequently: if w k = P [W K = k], we should have λ k = λw k . In the case of Poissonian arrivals, where merging K processes of arrival rate λ k is equivalent to a Poissonian process of arrival rate P
k
λ k , and as a given arrival in the second model has a probability w k to correspond to a job of width k, both models are equivalent.
Definition 1.19 (Computing Element) A Computing Element C i is
• A set of s i homogeneous servers (or CPUs) having a service rate µ i ;
• With a queue (or buffer) having a capacity B i − s i , where jobs are stored while there are not enough free servers to run them. The capacity of C i will be B i ;
• Running jobs following a scheduling policy S i .
A Computing Element C i is then identified by the tuple {s i , µ i , B i , S i } .
The capacity of a computing element (CE) means that if this CE contains already as many jobs as its capacity, any incoming job sent to this CE will be rejected. Of course, as much as possible, a resource broker (RB) should avoid to send a job to a CE having already reached its capacity, but, as we will see later, a RB could be not informed of such situation.
In the following, the local scheduling policy S will be by default FCFS (First Come First Served, see [49] for instance), also known as FIFO (First In First Out), and will be the same for every Computing Element. The principle of FCFS scheduling policy is to only
1
The probability density function f : R → R
+of a random variable expresses its density of probability; the
area under the curve f between two abscissas a and b is the probability that a drawing of the random variable
will be between those two numbers.
1.2 Definitions and Grid Modeling 23
consider the job at the front of the queue. If there are enough available servers for this job to be executed, the job starts. If not, the CE waits until enough servers are freed (other jobs are not considered meanwhile).
t 3
1 6 7
CPU 4 CPU 3 CPU 2 CPU 1
t 1
2
1 3 6 7
CPU 4 CPU 3 CPU 2 CPU 1
6 5
7 4
5 7
6 2 4
3
2 45
5 4 2
1
3
FF F CFS
arrival times:
arrival times:
Figure 1.3: Example of scheduling, with seven jobs arriving successively, having 3, 2, 2, 2, 4, 2 and 2 processes. Top: FCFS scheduling, Bottom: FF scheduling.
Here are some definitions allowing to characterize a scheduling policy.
Definition 1.20 (Eligible)
A synchronous parallel job of width n is said to be Eligible if at least n CPUs are free in the same CE.
A process belonging to a sequential or an asynchronous parallel job is said to be Eligible if at least one CPU is free.
Definition 1.21 (Scheduling Policy)
A Scheduling Policy (or a Scheduling Strategy) is a function which gives, for any cluster configuration and if the queue is not empty, the next job (in the case of parallel synchronous jobs) or process (in the case of sequential or parallel asynchronous jobs) to start, eligible or not. The chosen job/process is named the highest priority job or the highest priority process.
Here, the cluster configuration contains any information about the cluster available
to the scheduler . This can be simply the number of running and waiting jobs, or, in more
complex systems, the arrival, the start and/or the (expected) end time of any job.
Notice that this definition requires the scheduler to be non ambiguous: for any cluster configuration, there is exactly one highest priority job/process (except if the queue is empty). But this does not prevent the scheduler to start several jobs/processes at the same time, because once a job is launched, the cluster configuration changes, and another job can get the highest priority at the same time.
Moreover, the job/process returned by the scheduling policy is not necessarily the next one to effectively start: it could happen that, at some time, the highest priority job/process is not eligible, and that the configuration changes in such a way that the pol- icy gives the highest priority to another job/process, before having started the previous highest priority job/process.
Definition 1.22 (Greedy)
A scheduling policy is said to be Greedy (also called expedient) if it never leaves any resource idle intentionally. If a system runs a greedy policy, a resource is idle only if there is no eligible job waiting for that resource.
Notice that, in the parallel case, FCFS or Backfilling [61] are not greedy, while FF (First Fit, see below) is. In the sequential case, FCFS, which corresponds to FF, is greedy.
Definition 1.23 (ASAP)
An ASAP scheduler, standing for as soon as possible, is a scheduler which starts any job/process at the first time this job/process becomes the highest priority job/process and is eligible.
Remark that, in the case of non preemptive systems, non ASAP schedulers could be more efficient than ASAP ones. For instance, a strategy could choose to delay the schedule of a job, hoping the near arrival of a “best” job. However, in the following of this work, we only consider ASAP scheduler.
In the case of synchronous parallel jobs, another scheduling policy will be considered in this document: FF, standing for First Fit [40]. With this policy, the first eligible job starts when possible. FF and FCFS are of course identical in the case of sequential jobs or asynchronous parallel jobs. A strategy based on FF and using advance reservation is used in OAR, the batch scheduler of Grid’5000 (see [6] and Section 1.4.3). An example of such scheduling policies is given in Figure 1.3 for fully synchronous parallel jobs.
It is because we look at ASAP systems (see Definition 1.23) that a synchronous system
is not a particular case of asynchronous system. Indeed, if we do not impose the system
to be ASAP, an asynchronous policy can choose to not start processes of a job as soon
as possible. In particular, the policy could choose to start every process of the same job
simultaneously.
1.2 Definitions and Grid Modeling 25
Very often, today cluster schedulers do not use FF or FCFS, but more sophisticate techniques, using preemption (a process can be interrupted and resumed later), migration (a process can be interrupted on a server, and resumed on another server, either on the same cluster — intra-cluster migration — or on another cluster — inter-cluster migration), dynamic priority systems, etc. See for instance [49] for more details. In this thesis, we do not focus on cluster performances, since our aim is to compare various brokering strategies. It is reasonable to assume that the comparison between strategies will not be too much impacted by a better low level scheduling strategy, because every cluster would improve its performance in a similar ratio. This assumption can of course only be done if strategies are not drastically different from FCFS, and show only marginal divergences.
For instance, comparing brokering strategies if the local scheduling policy is Last Come First Served, will probably not give the same conclusion as if the local scheduling policy is FCFS. This is why we only consider simple scheduling strategies, and non-preemptive systems.
Based on the last definition, we can now give a formal definition about the mathemat- ical object we name a Grid System, formalizing the definition of a Computational Grid.
Definition 1.24 (Grid System)
A Grid System G is a tuple {{ C i } i∈[1,...,N ] , F A
λ,W
K, β} , where
• { C i } i∈[1,...,N ] is a set of N Computing Elements C i = {s i , µ i , B i , S i } ;
• F A
λ,W
Kis the arrival job flow;
• β is the Brokering policy.
A formal definition of brokering will be given further in this document.
Figure 1.1 (page 19) gives two visions of a computational grid. On the left hand side, the Resource Broker (RB) and the Computing Elements (C i ) are seen as black boxes.
Clients send jobs to RB, which dispatches them to some C i according to its routing policy.
The right figure gives a more “queuing theory” oriented approach. A stream of jobs hav- ing a rate λ comes into the system. This stream is split in such a way that each job is sent to one Computing Element, which can be seen as a buffer (waiting queue) associated to some servers (CPUs). The s i servers of the queue C i have a service rate of µ i .
1.2.2 Scheduling Model
From a scheduling point of view, each incoming job j is composed of one or several
processes p each having a virtual execution time ` p (or virtual execution length), chosen by
the client before submitting the job, possibly from a probability distribution. This virtual execution time means that, on a processor of relative speed µ = 1, the process p would take ` p units of time. Then, on a CPU with any relative speed µ, a process p will have an effective execution time of ` p /µ. Or, if a process runs during ` units of time on a µ 1 CPU, it will run ` · µ µ
12
on a µ 2 CPU.
If the broker choices are made without taking into account job and/or process lengths, choosing the process length at submission time (scheduling model) or when the job starts (queuing model) is equivalent. And if the local scheduling decision (choosing the CPU) is also made without information about lengths, the process length can even be chosen when it starts.
These points of view are then not incompatible; from a “macroscopic” vision, the scheduling model with an average virtual execution time of 1 with, let say, a distribution D with E [D] = 1, will behave the same way as the queuing model with a rate µ, if the execution distribution is D 0 such as f D (m) = f D
0( m µ ) ∀m ∈ R + , where f D and f D
0are the probability density functions of D and D 0 . Notice that compelling the average virtual execution time to be equal to 1 is not restrictive: this only constrains to choose the time unit in order to reach a unitary average. If the virtual execution times are scaled, relative speeds are scaled accordingly.
We assume in our model that the execution time (or its average) does not depend of environmental factors, such as data transfer or communications cost, nor local factors such as boot time, migrations, preemption or other scheduling costs.
As for the service rate µ, one can give two interpretations of the arrival rate λ.
From a queuing theory point of view, we have the concept of job stream, and λ can be seen as the inverse of the inter-arrival average delay. For instance, if arrivals follow a Poisson process, λ is the rate of this process.
From a scheduling point of view, we have an infinite set of job ids J (J , N ∗ = {1, 2, . . . } ) 2 , and for each j ∈ J , a j is the submission (or arrival) time of the job j. The arrival rate λ is considered as the inverse of the average inter-arrival delay. In other words, {a j |j ∈ J } is such that
t→∞ lim P
j∈J |a
j<t
(a j − a j−1 ) k j ∈ J | a j < t k = λ −1 with a 0 , 0.
As λ < ∞ (and then λ −1 > 0), we can without loss of generality assume 3 that J is sorted by the submission time, meaning that ∀i < j, a i ≤ a j . This schema could help to
2
The symbol , means “is by definition” or “is defined as”.
3