- - -
- - -
Dépôt Institutionnel de l’Université libre de Bruxelles / Université libre de Bruxelles Institutional Repository
Thèse de doctorat/ PhD Thesis Citation APA:
Berten, V. (2007). Stochastic approach to Brokering heuristics for computational grids (Unpublished doctoral dissertation). Université libre de Bruxelles, Faculté des Sciences – Informatique, Bruxelles.
Disponible à / Available at permalink : https://dipot.ulb.ac.be/dspace/bitstream/2013/210707/4/31518d53-cb07-4a2a-82e8-2e89b820024c.txt
(English version below)
Cette thèse de doctorat a été numérisée par l’Université libre de Bruxelles. L’auteur qui s’opposerait à sa mise en ligne dans DI-fusion est invité à prendre contact avec l’Université (di-fusion@ulb.be).
Dans le cas où une version électronique native de la thèse existe, l’Université ne peut garantir que la présente version numérisée soit identique à la version électronique native, ni qu’elle soit la version officielle définitive de la thèse.
DI-fusion, le Dépôt Institutionnel de l’Université libre de Bruxelles, recueille la production scientifique de l’Université, mise à disposition en libre accès autant que possible. Les œuvres accessibles dans DI-fusion sont protégées par la législation belge relative aux droits d'auteur et aux droits voisins. Toute personne peut, sans avoir à demander l’autorisation de l’auteur ou de l’ayant-droit, à des fins d’usage privé ou à des fins d’illustration de l’enseignement ou de recherche scientifique, dans la mesure justifiée par le but non lucratif poursuivi, lire, télécharger ou reproduire sur papier ou sur tout autre support, les articles ou des fragments d’autres œuvres, disponibles dans DI-fusion, pour autant que :
Le nom des auteurs, le titre et la référence bibliographique complète soient cités;
L’identifiant unique attribué aux métadonnées dans DI-fusion (permalink) soit indiqué;
Le contenu ne soit pas modifié.
L’œuvre ne peut être stockée dans une autre base de données dans le but d’y donner accès ; l’identifiant unique (permalink) indiqué ci-dessus doit toujours être utilisé pour donner accès à l’œuvre. Toute autre utilisation non mentionnée ci-dessus nécessite l’autorisation de l’auteur de l’œuvre ou de l’ayant droit.
--- English Version ---
This Ph.D. thesis has been digitized by Université libre de Bruxelles. The author who would disagree on its online availability in DI-fusion is invited to contact the University (di-fusion@ulb.be).
If a native electronic version of the thesis exists, the University can guarantee neither that the present digitized version is identical to the native electronic version, nor that it is the definitive official version of the thesis.
DI-fusion is the Institutional Repository of Université libre de Bruxelles; it collects the research output of the University, available on open access as much as possible. The works included in DI-fusion are protected by the Belgian legislation relating to authors’ rights and neighbouring rights.
Any user may, without prior permission from the authors or copyright owners, for private usage or for educational or scientific research purposes, to the extent justified by the non-profit activity, read, download or reproduce on paper or on any other media, the articles or fragments of other works, available in DI-fusion, provided:
The authors, title and full bibliographic details are credited in any copy;
The unique identifier (permalink) for the original metadata page in DI-fusion is indicated;
The content is not changed in any way.
It is not permitted to store the work in another database in order to provide access to it; the unique identifier (permalink) indicated above must always be used to provide access to the work. Any other use not mentioned above requires the authors’ or copyright owners’ permission.
ULB
FACULTÉ DES SCIENCES
Stochastic Approach to Brokering Heuristics for Computational Grids
Thèse présentée en vue de l'obtention du grade de Docteur en Sciences
Directeur de thèse ; Joël G
oossensUniversité Libre de Bruxelles
r
Vandy BERTEN
Année académique 2006-2007
Thèse soutenue publiquement
à Bruxelles, le 8 juin 2007.
3
Remerciements
Un travail d'une telle ampleur n'est jamais une réalisation personnelle. Même si une seule personne écrit son nom sur la couverture, il est très loin d'en être le seul auteur. Je ne pouvais donc pas commencer cet ouvrage sans remercier ceux qui, autant que moi- même, en sont à la base.
Parmi tous les participants à ces quatre années de labeur, il en est certainement un qui y a contribué plus que tout autre ; il s'agit bien sûr de Joël
GOOSSENS,mon directeur de thèse. Il a choisi de me faire confiance il y a près de 5 ans, en acceptant tout d'abord de diriger mon mémoire de maîtrise, et ensuite ma thèse de doctorat. Nos rencontres hebdomadaires, ses lectures régulières de mes écrits même lorsque que je m'éloignais de ses principaux sujets de recherche, son insistance à me faire rencontrer des chercheurs étrangers, sa rigueur et son expérience ont dirigé mes travaux dans une direction qui, je l'espère, s'est montrée digne de la confiance qu'il a mise dans son premier "thésard" que j'ai eu le privilège d'incarner.
Quelques mois après le début de ma thèse, j'ai eu la chance de rencontrer l'équipe d'AlGorille,
àNancy, et plus particulièrement Emmanuel
JEANNOT,qui m'a accueilli
àplusieurs reprises au cours de ces dernières années. Nos discussions diverses et variées m'ont permis d'étendre mes centres d'intérêts, et je ne peux que lui en être reconnaissant.
Par la suite, j'ai eu le plaisir de collaborer avec quelques chercheurs de l'IMAG de Grenoble, et plus spécialement avec Bruno
Gaujal, àqui je dois les idées
àla base de la deuxième partie de ce travail. Il m'a fait l'honneur de m'accueillir à plusieurs reprises, pour im total de près de quatre mois. Outre le délice de la vie au pied des montagnes, ces séjours dans son équipe m'ont énormément appris sur le monde de la recherche et de la collaboration. Je voudrais également remercier Jean-Marc
VINCENTet Jérôme
VIENNE,pour leur outil de simulation parfaite, qu'ils ont
àde nombreuses reprises adapté pour les besoins de nos expériences. C'est par aiUeurs grâce
àBruno que j'ai pu accéder
àGrid'5000, et y consommer près de 40000 heures de processeur.
Toute ma gratitude va également à Raymond D
evillers, pour ses nombreuses relec
tures méticuleuses de mes écrits, et toutes les discussions passionnantes qui ont suivi.
Comme peuvent en témoigner de nombreux membres du département, il a de toute
évidence fortement contribué à la rigueur scientifique, à la précision, à la prise en compte des cas limites. Il y a la thèse avant DEVILLERS, et après D
evillers...
Je me dois certainement de remercier tous ceux qui, outre MM. GOOSSENS, JEAN- NOT, G
aujalet D
evillers, ont accepté de faire partie du jury: Olivier M
arkowitchet Guy L
ouchardde l'ULB, et Pierre M
annebackdes FPMs. Je remercie plus partic
ulièrement M. L
ouchard, à qui je dois en grande partie la démonstration de l'annexe B de ce travail.
En dehors du milieu académique, il va sans dire que je dois beaucoup — pour ne pas dire tout — à mes parents, mes deux frères et ma soeur ; la curiosité, le goût de la découverte, l'intérêt pour les sciences, la recherche de la précision, le plaisir du savoir partagé, sont autant de valeurs que je leur dois, et sans lesquelles ce travail n'aurait pas pu aboutir.
Je n'oublie bien sûr par non plus ceux qui m'ont entouré ces dernières années;
bémelois, colonnards, cousins, taizéens, membres du DI, et bien d'autres encore !
CONTENTS 5
Contents
1 Introduction to Grid Brokering 13
1.1 Motivations and Context... 14
1.1.1 Outline... 16
1.2 Définitions and Grid Modeling ... 17
1.2.1 Queuing Model... 19
1.2.2 Scheduling Model... 25
1.2.3 System Load... 28
1.2.4 System State ... 30
1.2.5 Underlying Markov Chain... 31
1.3 Brokering ... 33
1.3.1 Open-loop and Closed-loop... 33
1.3.2 Memoryless and Historical Information... 34
1.3.3 Deterministic and Probabilistic... 34
1.3.4 Mathematical Model ... 35
Probabilistic Memoryless Open-loop Brokering... 35
Deterministic Memoryless Closed-loop Brokering... 36
1.3.5 Bernoulli Brokerings... 36
1.3.6 LCB Policies... 38
1.4 Examples... 41
1.4.1 EGEE... 41
1.4.2 NorduGrid... 43
14.3 Grid'5000 ... 43
1.4.4 GridBus... 44
1.5 Cost Eunction... 45
1.6 Concepts... 48
1.6.1 Resolving Markov Decision Processes Using Dynamic Programming . . 48
Minimizing the Policy... 50
Value Itération... 50
Policy Itération ... 51
1.6.2 Perfect Simulation... 51
2 Random Brokering 53 2.1 Introduction and Model... 54
2.1.1 Dispatching the Jobs... 55
2.1.2 Numerical Simulations ... 56
2.2 Sequential Systems... 58
2.2.1 System Load... 58
2.2.2 Queue Size... 59
Case !/ = 1 59
Case U <\ 60
Case V > 1 64
Arrivais... 64
Departures... 65
Number of Jobs in the System...67
Experimental Results... 69
2.2.3 Used CPUs ... 74
2.2.4 Résorption Time... 75
2.2.5 Slowdown... 76
Job Length Distributions... 77
Case f < 1 79
Case > 1 81
Slov\/down for a Job Submitted at Time 9...82
Average Slowdown Until the Last Finished Job With u > 1. 85 Experimental Results... 89
2.3 Fully-synchronous Parallel Systems... 93
2.3.1 Queue Size... 95
Case V < ûi... 95
Case V > ùi... 95
Experimental Results... 97
2.3.2 Used CPUs ... 99
Case v < ûi... 99
Case V > ï>i... 99
2.3.3 Résorption Time... 102
2.3.4 Slowdown... 103
Experimental Results...106
2.4 Conclusion... 108
CONTENTS 7
2.4.1 Summary of Contribution...108
Queue Size... 108
Average Number of Used CPUs... 109
Résorption Time... 109
Measured Slowdown... 109
3 Index Based Brokering 111 3.1 Introduction...112
3.1.1 Mathematical model ... 114
3.1.2 Optimal Brokering... 115
3.1.3 Intuitive Justification of the Index Strategy ... 117
3.1.4 Cost Optimization Problem...122
3.1.5 Mathematical Justification of the Whittle-Gittins Index...123
3.2 Threshold Policy on a Single Queue with Rejection ... 127
3.2.1 Mathematical Formulation ...131
3.2.2 Properties of Optimal Policy...133
3.2.3 Computing the Optimal Threshold... 141
3.3 Algorithmic Improvements... 144
3.3.1 Algebraic Simplifications ...144
Local Gain...144
Admissibility Condition... 144
Total Gain...145
3.3.2 Improvement of the Admissibility Check...146
3.3.3 Reducing the Problem Size...147
3.3.4 Computing the Index Function...149
3.3.5 Parameters Dependence and Numerical Issues... 150
Discount Cost (a)... 151
Arrivai Rate (A)... 152
Number of Servers (s) ...152
© Précision (e) ... 153
3.4 Complexity and Benchmarks ... 155
3.4.1 Value-Determination Operation (Solving Jg)... 155
3.4.2 Policy-Improvement Routine...155
Maximal Complexity... 155
Average Complexity... 155
3.4.3 Finding 0{R)...156
Maximal Complexity... 156
Average Complexity... 156
Improving the Maximal Complexity... 156
3.4.4 Dichotomy... 157
First Phase ... 158
Second Phase... 158
Improved Maximal Complexity... 158
3.4.5 Space Complexity... 159
3.4.6 Benchmarks... 159
3.5 Numerical Experiments... 161
3.5.1 Strategies... 161
3.5.2 Uniprocessor Systems...162
3.5.3 Multiprocessors Systems ...164
3.5.4 Robustness... 164
3.5.5 Sojourn Time Distribution ...169
3.6 Realistic Experiments... 172
3.6.1 SimGrid Software... 172
3.6.2 A Grid Mode! Using SimGrid...172
3.6.3 Traces... 174
3.6.4 Experimental Scénarios... 175
Several Inputs... 176
The Effect of Fleterogeneity...177
Information Delays ... 178
3.6.5 Sojourn Time Distribution ...180
3.7 Conclusions ... 181
4 Index Brokering of Batch Jobs 183 4.1 Introduction... 184
4.2 Batch Arrivais with Known Distribution... 185
4.2.1 Mathematical Model ... 185
4.2.2 Threshold Policy: Différences with Sequential Jobs...186
Properties of Optimal Threshold Policy ... 188
4.2.3 Computing the Optimal Threshold: Algorithm and Optimizations . . . 190
Algebraic Simplifications ... 191
Local Gain... 191
Admissibility Condition... 191
Global Gain...191
Improvement of the Admissibility Check... 192
Reducing the Problem Size... 193
4.2.4 Computing the Index Function...194
CONTENTS 9
4.2.5 Complexity... 194
Value-Determination Operation (Solving Jg)... 194
Policy-lmprovement Routine ...194
Finding 6(iî)...194
Dichotomy...195
Benchmarks...196
4.2.6 Numerical Experiments...197
Impact of the Architecture...197
Impact of the Job Width Distribution ... 198
Robustness on Load Variations...199
Robustness on Job Width Distribution...200
4.3 Batch Arrivais with Known Sizes... 203
4.3.1 Mathematical Model ... 203
4.3.2 Bellman’s Equation... 203
4.3.3 Algorithms...206
Computing the Best Thresholds ...206
Computing the Index... 207
4.3.4 Simulations...208
4.4 Batch Arrivais with Synchronous Departures...214
4.4.1 State Transitions... 215
Job Arrivai...215
Process Departure... 215
4.4.2 Alternative Formulation... 216
Case X > 0...216
Case X = 0...216
4.4.3 Bellman's Equation... 216
4.4.4 Algorithm...217
4.5 Parallelization... 219
4.5.1 Interval Division...220
Avoiding Duplication... 222
Improving the Interval Division...222
4.5.2 Pool of Rejection Cost Values ...223
Computing Processes... 223
Coordinator Process... 224
Pool of Values...224
Distributed Dichotomy... 225
Cleaning the Pool... 225
Filling the Pool... 226
4.6 Conclusions ...228
5 Contribution, Future Works and Conclusion 229 5.1 Summary of Contribution... 229
5.1.1 Random Brokering ... 229
5.1.2 Index Brokering... 230
5.2 Open Questions and Future Work ... 232
5.2.1 Random Brokering (Chapter 2)... 232
Standard Déviation and Error... 232
Missing Expérimentations... 232
Other Distributions... 233
Non-Saturated Parallel Systems... 233
Asynchronous and Semi-Synchronous Systems...233
Slowdown on Submitted Jobs... 233
Missing Formai Proofs ... 233
5.2.2 Index Brokering (Chapter 3) ... 234
Alternative Cost Strategy... 234
Advanced Realistic Simulations... 235
Implémentation in Real/Production Environment...235
Proof of Conjecture 3.12... 235
5.2.3 Batch Jobs (Chapter 4)... 235
Improvement of BatchK Algorithm... 235
Optimal Strategy...236
Semi-Synchronous Systems (BatchS)... 236
Parallelization Implémentation... 236
Full Proof of Convexity... 236
5.3 Final Conclusion... 237
A Splitting of Stochastic Process 241 B Used CPUs: Parallel Case 243 B.l Case s = 2... 243
B.2 Case s = 3... 245
B.3 General Case... 246
B.4 Worst Case ...249
B.5 Equidistributed Case 251
CONTENTS 11
C Convexity 259
C.l Sequential Case... 262
C.1.1 Case x< B-2... 263
C.1.2 Case x= B-2... 265
C.2 Batch Case... 266
C.2.1 Case x < B — 2... 266
C.2.2 Case x= B-2... 268
C.2.3 Case x = B-l... 269
C.2.4 Case B<x<B + K-3 ... 270
D Solving a K
+2-diagonal System 271 References 273 Webography... 273
Personnal Bibliography... 273
General Bibliography... 274
Symbols 281
Index 284
13
Chapter 1
Introduction to Grid Brokering
Chapter Abstract. This chapter introduces the framework in which our work has been nndertaken. We présent general définitions about grid computing, and more specifically about grid brokering. We establish our mathematical model of computational grid and give the definihons needed for the following of this document.
Chapter Contents
1.1 Motivations and Context 1.2 Définitions and Grid Modeling 1.3 Brokering
1.4 Examples
1.5 Cost Function
1.6 Concepts
1.1 Motivations and Context
When a scientist needs to perform some small computations, he typically launches a ded- icated application on a personal computer, which gives the required results. We might say that a user sends a job to a server, and gets back the results of its computation. This model applies when the need of computational power is not too high. Once the ma
chine usage reaches the maximal power of the used computer, we may consider two ways of tackling the increasing need of computational power. First, buy a new machine, with better performances, more powerful processors, or more processors (e.g. Massively Parallel Processors). Of course, this is limitted by the speed engineers improve CPU per
formances or memory accesses. The second way consists in gathering several machines, making them working together.
Infrastructures regrouping several (often homogeneous) machines such as simple per
sonal computers, with some coordination mechanisms, are generally named clusters. Ac- cording to Buyya [30]:
a cluster is a type of parallel or distributed processing System, which consists of a collection of interconnected stand-alone computers working together as a single, integrated computing resource.
Clusters are generally owned by organisations such as labs, universities, or industries, and are shared by several users. They contain between a few and several thousands of machines. The network interconnecting cluster machines is usually a high performance LAN (Local Area Network), which allows fast communications and data transfers.
Even if there are no theoretical limitations in terms of the number of machines, clus
ters hâve mainly two disadvantages: first, large clusters are very expensive, and very difficult to manage. Secondly, following the évolution of the experiments for which the cluster has been set up, such a parallel System is often imderused at some periods, and saturated at other ones. Those reasons, as well as many others, lead scientists to the idea of Grids: if computers can be put together in order to form a cluster, clusters could be put together in order to form some kind of cluster of clusters, or meta-cluster, or Grid.
A Grid is typically an infrastructure coordinating several clusters spread around a country, or around the world, hosted by partners of a common project, and communi- cating through a WAN (Wide Area Network), such as the Internet, or some rented links.
As grids are often scientific applications oriented, they generally do not only contain
clusters, but also mass storage devices allowing to store experimental results, or various
kinds of data. Storing huge amount of data in grid Systems requires to solve transfer,
duplicaHon, security, scheduling or other management issues. Due to the slowness and
1.1 Motivations and Context 15
the unreliable nature of WAN communication channels, those fields are often far more difficult that what they were in clusters, with fast and contrôlable networks.
Grids hâve become prévalent infrastructures for intensive computational tasks. The Word "Grid" is probably now one of the top five terms in computer science, and the whole scientific community agréés that physicists, biologists, or even mathematicians, will not be able to tackle tomorrow's problems without such computational Systems.
But what is exactly a Grid? lan Foster, one of the so called "fathers of the grid" [39], describes on its personal webpage a Grid as:
a System that coordinates resources that are not subject to centralized con- trol, using standard, open, general-purpose protocols and interfaces, to de- liver nontrivial qualities of service.
With that définition, a lot of distributed Systems correspond to grids. In this work, we are mainly going to focus on Grids as being a set of resources, such as clusters, mas- sively parallel processors machines, or mass storage devices, linked together through some wide area network, and managed at a high level by some middleware.
There are mainly two kinds of grids. The first one concems grids in which the large majority of the work is number cruncher (or CPU bound), and does not require neither large amount of data, nor heavy network commimication. This kind of grids is usually called Computational Grids. The second class regroups Systems where data is the center of everything; jobs are data intensive (or lO-bound). They require to store huge amount of data (often Petabytes), which, of course, need regularly to be transferred between com
putational or storage resources. These grids are named Data Grids. This work mainly focuses on the first family, Computational Grids.
Idealistically, Grids should be as easy to use as a simple computer. In the same way than Internet is as easy to use as if the data were locally available, or the electric power grid — in which Grids name has its origin — is as easy to access as if the electrical central was just behind the wall.
Nevertheless, if Grids are on the way of being efficient and user-friendly Systems, computer scientists and engineers still hâve a huge amount of work to do in order to improve their efficiency. Amongst a large number of problems to solve or to improve upon, the problem of scheduling the work and balancing the load is of first importance.
The main subject of this thesis will be this last problem: how to efficiently or fairly
distribute the work across a Computational Grid? This task is usually called the Meta-
Scheduling — in opposition to Scheduling, which treats of the same subject at a local
level — or Brokering. Even if some authors consider the brokering task as not being
primordial, we will show that the brokering policy can drastically change the System
performances, and not necessarily in an expected way.
1.1.1 Outline
This work will be mainly split in two parts. After introducing the mathematical frame- work on which the following of the manuscript is based, we will in Chapter 2 study Sys
tems where the grid brokering is done without any feed-back information, i.e. without knowing the current State of the clusters when the resource broker — the grid component performing the brokering — makes its decision. We show here how a computational grid behaves if the brokering is done in such a way that each cluster receives a quantity of work proportional to its computational capacity.
This part is based on several publications, in collaboration with Joël GOOSSENS (ULB) and Emmanuel JEANNOT (Loria, Nancy): my DEA thesis [15], a conférence paper resum- ing my DEA thesis (1SPA04, Hong-Kong) [19], a journal paper extending our model to heterogeneous Systems (IEEE Transactions on Parallel and Distributed Systems, 2006) [22], as well as some technical reports (INRIA and ULB) [21, 20]. Notice that, with the same authors, we hâve also a contribution in fault tolérant Real-Time Systems [23], but as this research is out of the scope of the présent work, we hâve not included it here.
The second part of this work (Chapters 3 and 4) is rather independent from the first one, and consists in the présentation of a brokering strategy, based on Whittle's indices, trying to minimize as much as possible the average sojoum time. We show how efficient the proposed strategy is for computational grids, compared to the ones popularly used in production Systems. We also show its robustness to several parameter changes, and provide several very efficient algorithms allowing to make the required computations for this index policy.
This second part is the fruit of a collaboration with Bruno G
auJAL (IMAG, Grenoble).
On this subject, started by my 3 months sojoum at IMAG in April 2005, we hâve one
journal paper giving performances of index brokering on realistic workloads (Parallel
Computing, Elsevier) [17], one conférence paper extending our first model to batch jobs
(NetCoop07, Avignon) [18], as well as an INRIA research report [16].
1.2 Définitions and Grid Modeling 17
1.2 Définitions and Grid Modeling
We will now formalize several concepts and give définitions about Grids.
In this work, we will consider a quite simple but pretty much realistic model of com- putational grid. Our analysis is mainly focused on computational grids, as we assume that network latencies and transfer times are negligible compared to computation Hmes, and we thus do not take into accoimt the data localization.
The model we now présent could seem in some way quite far from reality. However, we believe that models being an exact représentation of the reality are, at least in the case of Grids, mathematically untractable. Analysis, prédictions, or strategies cannot be provided in such models due to their complexity. This is why we prefer to consider a simpler model, even if we could loose some realism, but on which some theoretical analysis is possible. Nevertheless, there are plenty of examples in the literature where theoretical models allow to improve the performances of real Systems, or to predict their behavior, even when the model simplifies the real environment. Indeed, in this work we will give such an example, where a simple model allows to find a very efficient strategy for a realistic model.
The aim of most computational Systems is to run jobs. It is true for Grids as well. As the literature gives a lot of définitions around this term (job, process, task, subprocess, program, threads, applications, operation... ), we first need to clarify the concepts which will be the center of our concem in this work.
Définition 1.1 (Process)
A Process is a sequence of computing operations running on a single processor.
We do not consider in this work the subdivision of a process into several threads, as it is sometimes done.
Définition 1.2 (Job)
A Job is a set of one or several process(es).
A job is said to be uniprocess (or sequential) when it is composed of one process, and multiprocess (or parallel) when it is composed of several (including one) prcKesses.
As a computational grid is generally described as a set of Computing Eléments, we need to define such a component.
Définition 1.3 (Computing Elément, Cluster)
A Computing Elément (also called a Cluster) is a set of CPUs or servers, and a single queue, both managed by a scheduler using a spécifie scheduling policy.
A Computing Elément gets jobs, and run processes on its CPUs.
In this work, we will only consider homogeneons Computing Eléments, composed of identical CPUs.
Définition 1.4 (Client)
A Client is a grid user who sends jobs to a grid, in order to run them on some CPU, and to get back results.
One may consider that a grid bas only one client, as a client can emulate the work of several clients.
While CPUs need to be managed in a Computing Elément (CE), CEs need to be man- aged in a grid. This is done by a kind of orchestra conductor, usually named the Resource Broker.
Définition 1.5 (Resource Broker, Router, Meta-scheduler)
A Resource Broker (also called a Router or a Meta-scheduler) is a grid component receiving jobs from Clients and sending each of them to a chosen Computing Elé
ment.
Définition 1.6 (Brokering, Routing, Meta-scheduling)
The Brokering is, in a computational grid, the action of choosing the Computing El
ément to which an incoming job is sent. The Brokering policy is the way of choosing such an action.
We hâve now every element we need for providing a first définition of a Grid.
Définition 1.7 (Grid)
A Grid is a set of Computing Eléments, hnked together through a Resource Broker, re
ceiving jobs from one or several clients.
Définition 1.8 (Computational Grid)
A Computational Grid, in contrast to a Data Grid, is a Grid for which the usage is computation oriented. Data transfers, data duplication, cache coherency manage
ment, large databases access, etc. are considered to be negligible with respect to the time spent in actual computations.
In the following of this work, we will only focus on Computational Grids.
The next two sections will consist in formai définitions about concepts defined more verbosely here above. In the next section (1.2.1), we will présent a model inspired from the queuing theory commimity [51, 38, 62). Then, we will adapt this model in Sec
tion 1.2.2, following a point of view coming from the scheduling community [56, 30).
1.2 Définitions and Grid Modeling 19
We will finish by comparing those two models, showing that from some aspects, they are not that different.
1.2.1 Queuing Model
We can now give more formai définitions about several concepts evoked in the previous section. Figure 1.1 shows a general OverView of the model of grid structure we consider here. The left figure gives a rather high level point of view, whUe the right one goes deeper in to the queuing model.
Figure 1.1: Two models of a Computational Grid with Resource Broker. The Grid is composed of M Computing Eléments (or queue), the queue being composed of Sj CPUs (or servers) of speed (or rate) pi. The System input, with rate A, is routed amongst the Af queues.
Définition 1.9 (CPU, Server)
A CPU or a server is a component able to perform computing operations, at a speed given by its service rate p, with 0 < p < oo.
Définition 1.10 (Process)
A Process p is a set of computing operations which, when performed by a CPU of rate p, will use it continuously during, in average, — imits of time.
P
As it is classically done in queuing theory, processes do not really hâve a pre-defined
execution time-, the execution Hme is fully determined by the server on which the process
runs. For instance, if a server provides a service time following a random variable ex-
ponentially distributed, one may consider that at each infinitésimal period of time, the
server décidés if the process continues or stops its execution, following an exponential
distribuHon.
From now on, servers are assnmed to run processes without preanption (a process can- not be interrupted in order to run another process on the same server), and without mi
gration (once a process bas started on a server, it stays on this server till the end of its execution).
As we shall study stochastic workloads, the execution time is obtained from a random variable, by rolling a (continuons) die. As soon as we use the same random variable, and if the prcKess lengths are not taken into account for any scheduling decision, it is équivalent to assume the dice to be rolled by the server when the job starts, or by the client at the submission time. In the first model, inspired from the queuing theory community, we assume the execution time is chosen by the server starting the process, or even during its execution. In the second model, execution time is chosen by the client submitting the job.
From the last définition, a process does then not hâve a true (absolute) execution time.
Its effective execution time is determined by the server (or CPU) on which this process runs. If a server has a rate of p, this means that the average (effective) execution time of processes running on this server is p~^. For instance, in the case of exponentially distributed execution time, the rate of the distribution will be p.
Définition 1.11 (Job)
A Job is a set of one or several processes, arriving together at a given submission time, and having to be sent to the same CE. The number of processes composing a job is named its width.
Définition 1.12 (Sequential Job, Uniprocess Job)
A Sequential Job (or Uniprocess Job) is a job composed of only one single process (job width=l).
In the case of sequenhal jobs, we will not differentiate jobs and processes. We wUl consider jobs as having their own execution time, because a job will hâve the execution time of its unique process.
In the case of parallel jobs, the execution time of a job can be also defined, but needs some convention. It could be for instance the sum or the maximum of its processes execu
tion time, or the total time during which at least one of its processes was running. Notice
that in this last case, the job execution time dépends upon the scheduling, which is not
always convenient. In this work, we do not consider execution time of jobs.
1.2 Définitions and Grid Modeling 21
I---1 I---1 I--- 1 I--- 1
Figure 1.2: Execution time of parallel jobs. Left: Asynchronous, Center: Semi-Synchronous, Right:
Fully-Synchronous.
Définition 1.13 (Parallel Job, Multiprocess Job)
A Parallel Job (or Multiprocess Job) is a job with one or several independent pro
cesses.
By independent processes, we mean that there is neither precedence dependencies nor common resources (and then no communication) between processes.
We do not enforce here parallel jobs to bave more than one process; sequential job is then a particular case of parallel job.
Définition 1.14 (Asynchronous Parallel Job)
An Asynchronous Parallel Job is a parallel job for which, once in a cluster queue, every process is independent from other processes.
Définition 1.15 (Semi-Synchronous Parallel Job)
A Semi-Synchronous Parallel Job is a parallel job with the constraint that its pro
cesses hâve to start their execution simultaneously, but are independent afterwards.
Définition 1.16 (Fully-Synchronous Parallel Job)
A Fully-Synchronous Parallel Job is a parallel job with the constraint that its pro
cesses hâve to start their execution simultaneously, and releasing the CPUs only when ail its processes are completed.
Définition 1.17 (Synchronous Parallel Job)
A Synchronous Parallel Job is parallel job which is either semi-synchronous or fully- synchronous.
In this work, we assume that jobs are independent, i.e., they do not share common
resources except CPUs, and there are no precedence constraints between jobs, as we as-
sumed for processes.
Définition 1.18 (Job Flow)
A Job Flow T^xyVK i® infinité set of jobs, for which:
• The inter-arrival delay follows the probability density function^A\, with 0 <
A < oo and E[-A
a] = A“^;
• The job width is an integer between 1 and K distributed according to the law W
k.
Notice that another very similar model could hâve been considered: we can hâve K arrivai processes, the process k following a law Ax,,, and submitting only jobs of width k. Those K arrivai processes are merged before entering the System. In order to hâve our both models comparable, Xk need to be chosen consequently: if Wk = P[W
a: = k\, we should hâve A*; = Xwk- In the case of Poissonian arrivais, where merging K processes of arrivai rate A/t is équivalent to a Poissonian process of arrivai rate ^ A/t, and as a given
k
arrivai in the second model has a probability Wk to correspond to a job of width k, both models are équivalent.
Définition 1.19 (Computing Elément) A Computing Elément Cj is
• A set of Si homogeneous servers (or CPUs) having a service rate ni;
• With a queue (or buffer) having a capacity Bi — Si, where jobs are stored while there are not enough free servers to run them. The capacity of Ci will be Bi;
• Running jobs following a scheduling policy Sj.
A Computing Elément Cj is then identified by the tuple {sj, pi, fî;, S;}.
The capacity of a computing element (CE) means that if this CE contains already as many jobs as its capacity, any incoming job sent to this CE will be rejected. Of course, as much as possible, a resource broker (RB) should avoid to send a job to a CE having already reached its capacity, but, as we will see later, a RB could be not informed of such situation.
In the following, the local scheduling policy S will be by default FCFS (First Corne First Served, see [49] for instance), also known as FIFO (First In First Out), and will be the same for every Computing Element. The principle of FCFS scheduling policy is to only
'The probability density fonction / : R —* R'*' of a random variable expresses its density of probability; the area under the curve / between two abscissas a and b is the probability that a drawing of the random variable will be between those two numbers.
1.2 Définitions and Grid Modeling 23
consider the job at the front of the queue. If there are enough available servers for this job to be executed, the job starts. If not, the CE waits until enough servers are freed (other jobs are not considered meanwhile).
ji CPU 4
\2, CPU 3
UL
CPU 2 CPU 1
■... 4...
G
... 0...
1 -T-
î îî t
arrivai times: 1 2 3 45 6 7 CPU 4
U.
CPU 3 CPU 2 CPU 1
___.O. . .
... 4... 7 1 — 3 ■
...6
...î nin- ‘ t
arrivai tinnes: 1 2 3 45 67
Figure 1.3: Example of scheduling, with seven jobs arriving successively, having 3, 2, 2, 2, 4, 2 and 2 processes. Top: FCFS scheduling, Bottom: FF scheduling.
Here are some définitions allowing to characterize a scheduling policy.
Définition 1.20 (Eligible)
A synchronous parallel job of width n is said to be Eligible if at least n CPUs are free in the same CE.
A process belonging to a sequenhal or an asynchronous parallel job is said to be Eligible if at least one CPU is free.
Définition 1.21 (Scheduling Policy)
A Scheduling Policy (or a Scheduling Strategy) is a function which gives, for any cluster configuration and if the queue is not empty, the next job (in the case of parallel synchronous jobs) or process (in the case of sequential or parallel asynchronous jobs) to start, eligible or not. The chosen job/process is named the highest priority job or the highest priority process.
Here, the cluster configuration contains any information about the cluster available
to the scheduler . This can be simply the number of tunning and waiting jobs, or, in more
complex Systems, the arrivai, the start and/or the (expected) end time of any job.
Notice that this définition requires the scheduler to be non ambiguous: for any cluster configuration, there is exactly one highest priority job/process (except if the queue is empty). But this does not prevent the scheduler to start several jobs/processes at the same time, because once a job is launched, the cluster configuration changes, and another job can get the highest priority at the same time.
Moreover, the job/process retumed by the scheduling policy is not necessarily the next one to effectively start: it could happen that, at some time, the highest priority job/process is not eligible, and that the configuration changes in such a way that the pol
icy gives the highest priority to another job/process, before having started the previous highest priority job/process.
Définition 1.22 (Greedy)
A scheduling policy is said to be Greedy (also called expédient) if if never leaves any resource idle intentionally. If a System runs a greedy policy, a resource is idle only if there is no eligible job waiting for that resource.
Notice that, in the parallel case, FCFS or BackfiUing [61] are not greedy, while FF (First Fit, see below) is. In the sequential case, FCFS, which corresponds to FF, is greedy.
Définition 1.23 (ASAP)
An ASAP scheduler, standing for as soon as possible, is a scheduler which starts any job/process at the first time this job/process becomes the highest priority job/process and is eligible.
Remark that, in the case of non preemptive Systems, non ASAP schedulers could be more efficient than ASAP ones. For instance, a strategy could choose to delay the schedule of a job, hoping the near arrivai of a "best" job. However, in the foUowing of this work, we only consider ASAP scheduler.
In the case of synchronous parallel jobs, another scheduling policy will be considered in this document: FF, standing for First Fit [40]. With this policy, the first eligible job starts when possible. FF and FCFS are of course idenhcal in the case of sequential jobs or asynchronous parallel jobs. A strategy based on FF and using advance réservation is used in OAR, the batch scheduler of Grid'5000 (see [6] and Section 1.4.3). An example of such scheduling policies is given in Figure 1.3 for fuUy S)mchronous parallel jobs.
It is because we look at ASAP Systems (see Définition 1.23) that a s)mchronous System
is not a particular case of asynchronous System. Indeed, if we do not impose the System
to be ASAP, an asynchronous policy can choose to not start processes of a job as soon
as possible. In particular, the policy could choose to start every process of the same job
simultaneously.
1.2 Définitions and Grid Modeling 25
Very often, today cluster schedulers do not use FF or FCFS, but more sophisticate techniques, using préemption (a process can be interrupted and resumed later), migration (a process can be interrupted on a server, and resumed on another server, either on the same cluster — intra-cluster migration — or on another cluster—inter-cluster migration), dynamic priority Systems, etc. See for instance [49] for more details. In this thesis, we do not focus on cluster performances, since our aim is to compare varions brokering strategies. It is reasonable to assume that the comparison between strategies will not be too much impacted by a better low level scheduling strategy, because every cluster would improve its performance in a similar ratio. This assumption can of course only be done if strategies are not drastically different from FCFS, and show only marginal divergences.
For instance, comparing brokering strategies if the local scheduling policy is Last Corne First Served, wiU probably not give the same conclusion as if the local scheduling policy is FCFS. This is why we only consider simple scheduling strategies, and non-preemptive Systems.
Based on the last définition, we can now give a formai définition about the mathemat- ical object we name a Grid System, formalizing the définition of a Computational Grid.
Définition 1.24 (Grid System)
A Grid System Ç isatuple where
• {Ct}i6[i,...jvl is a set ot Af Computing Eléments Ci = {«,, Bi, Sj};
• -^.4
x,
w'
kis the arrivai job flow;
• /? is the Brokering policy.
A formai définition of brokering will be given further in this document.
Figure 1.1 (page 19) gives tivo visions of a computational grid. On the left hand side, the Resource Broker (RB) and the Computing Eléments (Ci) are seen as black boxes.
Clients send jobs to RB, which dispatches them to some Ci according to its routing policy.
The right figure gives a more "queuing theory" oriented approach. A stream of jobs hav- ing a rate A cornes into the System. This stream is split in such a way that each job is sent to one Computing Elément, which can be seen as a buffer (waiting queue) associated to some servers (CPUs). The s, servers of the queue Ci hâve a service rate of pu.
1.2.2 Scheduling Model
Erom a scheduling point of view, each incoming job j is composed of one or several
processes p each having a Virtual execution time ip (or Virtual execution length), chosen by
the client before submitting the job, possibly from a probability distribution. This Virtual execution time means that, on a processor of relative speed p = 1, the process p would take ip units of time. Then, on a CPU with any relative speed p, a process p will hâve an effective execution time of ip/p. Or, if a process runs during £ units of time on a pi CPU, it will run £■ or\ a p
2CPU.
If the broker choices are made without taking into account job and/or process lengths, choosing the process length at submission time (scheduling model) or when the job starts (queuing model) is équivalent. And if the local scheduling decision (choosing the CPU) is also made without information about lengths, the process length can even be chosen when it starts.
These points of view are then not incompatible; from a "macroscopie" vision, the scheduling model with an average Virtual execution time of 1 with, let say, a distribution D with E[0] = 1, wiU behave the same way as the queuing model with a rate p, if the execution distribution is D' such as foim) = e R+, where fo and fo' are the probability density functions of D and D'. Notice that compelling the average Virtual execution time to be equal to 1 is not restrictive: this only constrains to choose the time unit in order to reach a unitary average. If the Virtual execution times are scaled, relative speeds are scaled accordingly.
We assume in our model that the execution time (or its average) does not dépend of environmental factors, such as data transfer or communicaHons cost, nor local factors such as boot time, migrations, préemption or other scheduling costs.
As for the service rate p, one can give two interprétations of the arrivai rate A.
From a queuing theory point of view, we hâve the concept of job stream, and A can be seen as the inverse of the inter-arrival average delay. For instance, if arrivais follow a Poisson process, A is the rate of this process.
From a scheduling point of view, we hâve an infinité set of job ids J {J = N* = {1,2,... }Ÿ, and for each j € J, aj is the submission (or arrivai) time of the job j. The arrivai rate A is considered as the inverse of the average inter-arrival delay. In other words, {aj\j G J} is such that
®i-i) hm -7T^---- -Tl--- r-
\\3^J\aj<t\\ = A-i with ao = 0.
As A < oo (and then A“^ > 0), we can without loss of generality assume^ that J is sorted by the submission time, meaning that Vf < j, a; < aj. This schéma could help to
^The Symbol = means “is by définition" or "is defined as".
^If
A
can be null, this is not aiways possible to re-order jobs. For instance, we could hâve infinitely many Jobs arriving at some timet,
and one job arriving aftert.
There is no possible numbering allowing to sort1.2 Définitions and Grid Modeling 27
visualize our System:
0 Ûi (22 ••• Oj — i dj ... ^
1 I I_______________ i_____ Il___________________________ I ’ I
We then sum up every interval between 0 and (the last arrivai before t), and divide this value by the number of intervals (fc(f)). Of course, the sum equals
This équation can be simplified, because every element but the first and the last one are simplified. Then,
lim ^ = A-
t-*oo
k(t) (1.1)
where k{t) = max{j € J \ aj < t}. We need that the set {j e J \ aj < t} is finite, or at least that its maximum exists for any t. But we hâve A < oo (or A“^ > 0), which is a sufficient condition for that.
We introduce here a new notation.
Définition 1.25 (Asymptotic behavior)
A function f\{t) behaves asymptotically like f
2{t) (denoted fi{t) ~ f
2{t)) iff
Equivalently, we hâve
t-c» f2{t) lim^
=1 .
h{t) = flit) + £(t),
with lim = 0.i—oo f2(t)
According to this définition, we can hâve some results about the asymptotic behavior of the inter-arrival delay.
Lemma 1.1
^max{j€^jaj<t} ^k(t) P roof
We hâve to show that au(,\ ~ t, or that lim —^ = 1. By définition of k(t), we hâve
t (-.oo aj.(t)
dk(t) <t< <ifc(t)+i.
this scénario by submission time. Or we coul(i hâve ai = 2, and a, = 1 — jVz > 2, which does not allow a re-ordering either.
Then,
t ^ Qfc(t)+i _ Q/t(t)+i fc(t) + 1
^k(t) ~ ^k(t) ^'(*) + 1 “i(t)
Taking the limit when t —> oo, we hâve (with Equation (1.1)) 1 < lim —-— < • ( A + lim ---| = 1.
t^oo afc(() V «fc(t) /
We hâve then lim ---= 1.
<^oo a/t(() □
Now we hâve:
Theorem 1.2
max{j e J \ üj < t} t
P roof
We hâve lim
<—oo
k(t) Then,
= A ^ (from Equation (1 1)), and hm = 1 (from Lemma 1.1)
t—*OC lim and
lim t k{t) = A
-1□
This theorem will be useful later.
1.2.3 System Load
When studying computational Systems, it is generally required to be able to measure the load of the System, giving an évaluation about the amoimt of work the System is Processing. Again, we will give two approaches of this measurement, the first one from the scheduling world, the second one from the queuing theory one.
First, we need to define the concept of computational capacity:
1.2 Définitions and Grid Modeling 29
Définition 1.26 (Computational capacity)
The Computational capacity of Cj is defined as The Computational capacity of A'
a grid Q is defined as fMSi, and is denoted M.
_ i=l
The computational capacity can be seen as the number of Virtual CPUs, or the number of homogeneous CPUs of rate /r = 1 équivalent to the original System, in a perfect world where one can perfectly take advantage of a larger number of CPUs.
From the queuing theory point of view, it can also be seen as the total rate, or the rate that a unique server would need to reach to be virtually équivalent to the whole System.
Indeed, the total service rate of a set of server is the sum of the service rate of each server, which is denoted by M. This number M is of course not necessarily a natural number.
From the scheduling point of view, we define now the System load !/(f i, t
2) as the total amount of Virtual work received in [U, f
2[, divided by the product of the total number of Virtual CPUs (M) and the total duration (t
2— t{). Or,
h) —
Z]
Obviously, we hâve that if u{ti,t
2) > 1, some jobs received in carmot be com- pleted. The System is then not schedulable on [t\,t
2[. Here, not schedulable on [ti,t2[
means that there is no brokering and/or scheduling decision allowing to finish before t
2ail jobs received in [ti, f2[. In other words, if the arrivai pattern on [fi, f2[ is indefinitely repeated with a period t
2— h, at least one queue will increase indefinitely, or, if queue sizes are boimded, an unboimded number of jobs will be rejected.
Of course, being not schedulable on [ti,t2[ does not mean that jobs still waiting or running at time <2 will not be completed after t
2- The System could for instance become schedulable if we extend the range.
The condition v{t\,t
2) < 1 is a necessary condition of schedulability, but usually not sufficient. For instance, if a job arrives at time t' < t
2, but with a running time on the faster CE greater than t
2— t', it cannot complété before t
2- By définition, if at least one job arrives in [0, oo[, it is always possible to find a non schedulable interval: ti just before a job arrivai, and (2 just after this arrivai.
However, in general, we are interested in long intervals, such as lim [ti, t
2[.
ti~*0 t2—*00
Now, from the queuing point of view, we would like to hâve a similar définition. The
load is then classically defined as the arrivai rate divided by the total service rate, or:
It is easy to see that those two définitions are asymptotically équivalent, or that i^{i) ~ V, where u(t) stands for v{Q, t). Indeed, as mentioned above, to go from the first
* 1 "
model to the second one, we fix f j = 1 in average (or lim - J' ij = 1), and keep the
■' ° n^oo n jTi
same fj., and A. We then hâve
u(t)
E {j€j\aj<t}
Il Q- M-t € J\aj < t} Il
M-t
max{j Ç. J \ üj < i) 1 M t
t M
^ V.
(seeTheorem. (1.2))
(
1
.2
)The necessary condition of schedulability becomes now < 1.
1.2.4 System State
In order to broker jobs, it is often required to hâve a knowledge about the current State of the System. A first System State model — applicable only for sequential or parallel asynchronous jobs — would consist to know the number of processes being in each CE, running or waiting. Each CE can then be characterized by an integer. We will use in this case the followmg notations: •
• Xi is the Ci State, or the number of processes currently présent in the queue (waiting and running). We hâve Xi G {0,..., jBi}.
• a; = (xi,..., a;;v) is the System State.
In some case, it will be useful to enrich CompuHng Eléments State. We will corne back on that point later on (Cf. Chapter 4).
Generally, this information does not fully characterize the State of the System, because knowing the System State does not allow to predict the future. The System State could for instance include the elapsed time from the start times of jobs (or processes) currently run
ning, or even their end time. However, it is often not realistic to hâve such information:
knowing the end time of jobs is generally very difficult, or impossible. And in most case, any information about start times will not give more information about the end time.
In the case of parallel jobs, it could be required to know the width of each job waiting
in the queue.
1.2 Définitions and Grid Modeling 31
We dénoté by <S the State space of the System. In the first simple model we presented here above, we bave S = {0,..., Bi} x ■ ■ ■ x {0,...,
Notice that if, as it will often be the case in this work, we focus on Systems having the memorylessness property, such as Poissonian arrivais with exponential services, the System State we define here fully characterizes the System; knowing the time processes hâve already spent in the System does not give more information.
As we assume our System to be AS AP (see Définition 1.23), and as we only consider independent processes by now, we know that if x, < s,, there are no jobs pending in the queue. If Xj > s;, then s; processes are nmning, and Xi — s, processes are waiting.
In Chapter 4, this model will be refined, for taking parallel jobs into account. In this case, there exist situations where there are idle CPUs, and some jobs are waihng in the queue.
1.2.5 Underlying Markov Chain
In several cases we are going to analyze in this document, we will focus on State transi
tions. If the brokering is only state-dependent, we can consider the underlying Markov Chain.
This transition Markov chain gives the probability to go from a State to another one, knowing that a transition has occurred. The probability that a transition is going from State X = (xi,..., x^f) toy = (yi,..., yjij-) is denoted
T{{xi,... ,xjy),{yi,... ,yj<j-)).
For instance, in the case of sequential jobs, with no simultaneous departures or ar
rivais, T will be defined as follows. Let T>i(x) be the probability that an event (or transi
tion) is a departure from C; when the State is x, A(x) is the probability that an event is an arrivai when the State is x, and Bj(x) the probability that an incoming job is send C, by the broker when the State is x. T is then defined as:
T(x,x-lj) =2?i(x) Vf e {1,...,A'}
r(x,x-Hj) =A{x)-Bi{x) Vf e {1,...,A'}
T{x, y) =0 Otherwise.
where 1; is the vector composed only of "O's", except a "1" at position f. The sum Bi{x)
i