• Aucun résultat trouvé

Jo¨elG Ann´eeacad´emique2006–2007 Th`esepr´esent´eeenvuedel’obtentiondugradedeDocteurenSciencesDirecteurdeth`ese:VandyB StochasticApproachtoBrokeringHeuristicsforComputationalGrids UNIVERSIT´ELIBREDEBRUXELLES,UNIVERSIT´ED’EUROPE

N/A
N/A
Protected

Academic year: 2021

Partager "Jo¨elG Ann´eeacad´emique2006–2007 Th`esepr´esent´eeenvuedel’obtentiondugradedeDocteurenSciencesDirecteurdeth`ese:VandyB StochasticApproachtoBrokeringHeuristicsforComputationalGrids UNIVERSIT´ELIBREDEBRUXELLES,UNIVERSIT´ED’EUROPE"

Copied!
292
0
0

Texte intégral

(1)

U N I V E R S I T ´ E L I B R E D E B R U X E L L E S , U N I V E R S I T ´ E D’ E U R O P E

FACULT´ E DES SCIENCES

Stochastic Approach to Brokering Heuristics for Computational Grids

Th` ese pr´ esent´ ee en vue de l’obtention du grade de Docteur en Sciences

Directeur de th` ese : Vandy B ERTEN

Jo¨ el G OOSSENS Ann´ ee acad´ emique 2006–2007

(2)
(3)

Th`ese soutenue publiquement

`a Bruxelles, le 8 juin 2007.

(4)
(5)

3

Remerciements

Un travail d’une telle ampleur n’est jamais une r´ealisation personnelle. Mˆeme si une seule personne ´ecrit son nom sur la couverture, il est tr`es loin d’en ˆetre le seul auteur. Je ne pouvais donc pas commencer cet ouvrage sans remercier ceux qui, autant que moi- mˆeme, en sont `a la base.

Parmi tous les participants `a ces quatre ann´ees de labeur, il en est certainement un qui y a contribu´e plus que tout autre ; il s’agit bien s ˆur de Jo¨el G OOSSENS , mon directeur de th`ese. Il a choisi de me faire confiance il y a pr`es de 5 ans, en acceptant tout d’abord de diriger mon m´emoire de maˆıtrise, et ensuite ma th`ese de doctorat. Nos rencontres hebdomadaires, ses lectures r´eguli`eres de mes ´ecrits mˆeme lorsque que je m’´eloignais de ses principaux sujets de recherche, son insistance `a me faire rencontrer des chercheurs

´etrangers, sa rigueur et son exp´erience ont dirig´e mes travaux dans une direction qui, je l’esp`ere, s’est montr´ee digne de la confiance qu’il a mise dans son premier “th´esard” que j’ai eu le privil`ege d’incarner.

Quelques mois apr`es le d´ebut de ma th`ese, j’ai eu la chance de rencontrer l’´equipe d’AlGorille, `a Nancy, et plus particuli`erement Emmanuel J EANNOT , qui m’a accueilli `a plusieurs reprises au cours de ces derni`eres ann´ees. Nos discussions diverses et vari´ees m’ont permis d’´etendre mes centres d’int´erˆets, et je ne peux que lui en ˆetre reconnaissant.

Par la suite, j’ai eu le plaisir de collaborer avec quelques chercheurs de l’IMAG de Grenoble, et plus sp´ecialement avec Bruno G AUJAL , `a qui je dois les id´ees `a la base de la deuxi`eme partie de ce travail. Il m’a fait l’honneur de m’accueillir `a plusieurs reprises, pour un total de pr`es de quatre mois. Outre le d´elice de la vie au pied des montagnes, ces s´ejours dans son ´equipe m’ont ´enorm´ement appris sur le monde de la recherche et de la collaboration. Je voudrais ´egalement remercier Jean-Marc V INCENT et J´er ˆome V IENNE , pour leur outil de simulation parfaite, qu’ils ont `a de nombreuses reprises adapt´e pour les besoins de nos exp´eriences. C’est par ailleurs grˆace `a Bruno que j’ai pu acc´eder `a Grid’5000, et y consommer pr`es de 40 000 heures de processeur.

Toute ma gratitude va ´egalement `a Raymond D EVILLERS , pour ses nombreuses relec- tures m´eticuleuses de mes ´ecrits, et toutes les discussions passionnantes qui ont suivi.

Comme peuvent en t´emoigner de nombreux membres du d´epartement, il a de toute

(6)

´evidence fortement contribu´e `a la rigueur scientifique, `a la pr´ecision, `a la prise en compte des cas limites. Il y a la th`ese avant D EVILLERS , et apr`es D EVILLERS . . .

Je me dois certainement de remercier tous ceux qui, outre MM. G OOSSENS , J EAN -

NOT , G AUJAL et D EVILLERS , ont accept´e de faire partie du jury : Olivier M ARKOWITCH

et Guy L OUCHARD de l’ULB, et Pierre M ANNEBACK des FPMs. Je remercie plus partic- uli`erement M. L OUCHARD , `a qui je dois en grande partie la d´emonstration de l’annexe B de ce travail.

En dehors du milieu acad´emique, il va sans dire que je dois beaucoup — pour ne pas dire tout — `a mes parents, mes deux fr`eres et ma soeur ; la curiosit´e, le go ˆut de la d´ecouverte, l’int´erˆet pour les sciences, la recherche de la pr´ecision, le plaisir du savoir partag´e, sont autant de valeurs que je leur dois, et sans lesquelles ce travail n’aurait pas pu aboutir.

Je n’oublie bien s ˆur par non plus ceux qui m’ont entour´e ces derni`eres ann´ees ;

b´emelois, colonnards, cousins, taiz´eens, membres du DI, et bien d’autres encore !

(7)

CONTENTS 5

Contents

1 Introduction to Grid Brokering 13

1.1 Motivations and Context . . . . 14

1.1.1 Outline . . . . 16

1.2 Definitions and Grid Modeling . . . . 17

1.2.1 Queuing Model . . . . 19

1.2.2 Scheduling Model . . . . 25

1.2.3 System Load . . . . 28

1.2.4 System State . . . . 30

1.2.5 Underlying Markov Chain . . . . 31

1.3 Brokering . . . . 33

1.3.1 Open-loop and Closed-loop . . . . 33

1.3.2 Memoryless and Historical Information . . . . 34

1.3.3 Deterministic and Probabilistic . . . . 34

1.3.4 Mathematical Model . . . . 35

Probabilistic Memoryless Open-loop Brokering . . . . 35

Deterministic Memoryless Closed-loop Brokering . . . . 36

1.3.5 Bernoulli Brokerings . . . . 36

1.3.6 LCB Policies . . . . 38

1.4 Examples . . . . 41

1.4.1 EGEE . . . . 41

1.4.2 NorduGrid . . . . 43

1.4.3 Grid’5000 . . . . 43

1.4.4 GridBus . . . . 44

1.5 Cost Function . . . . 45

1.6 Concepts . . . . 48

1.6.1 Resolving Markov Decision Processes Using Dynamic Programming . . 48

Minimizing the Policy . . . . 50

Value Iteration . . . . 50

(8)

Policy Iteration . . . . 51

1.6.2 Perfect Simulation . . . . 51

2 Random Brokering 53 2.1 Introduction and Model . . . . 54

2.1.1 Dispatching the Jobs . . . . 55

2.1.2 Numerical Simulations . . . . 56

2.2 Sequential Systems . . . . 58

2.2.1 System Load . . . . 58

2.2.2 Queue Size . . . . 59

Case ν = 1 . . . . 59

Case ν < 1 . . . . 60

Case ν > 1 . . . . 64

Arrivals. . . . 64

Departures. . . . . 65

Number of Jobs in the System. . . . . 67

Experimental Results . . . . 69

2.2.3 Used CPUs . . . . 74

2.2.4 Resorption Time . . . . 75

2.2.5 Slowdown . . . . 76

Job Length Distributions . . . . 77

Case ν < 1 . . . . 79

Case ν > 1 . . . . 81

Slowdown for a Job Submitted at Time θ. . . . . 82

Average Slowdown Until the Last Finished Job With ν > 1. . . 85

Experimental Results . . . . 89

2.3 Fully-synchronous Parallel Systems . . . . 93

2.3.1 Queue Size . . . . 95

Case ν < ν ˜ i . . . . 95

Case ν > ν ˜ i . . . . 95

Experimental Results . . . . 97

2.3.2 Used CPUs . . . . 99

Case ν < ν ˜ i . . . . 99

Case ν > ν ˜ i . . . . 99

2.3.3 Resorption Time . . . 102

2.3.4 Slowdown . . . 103

Experimental Results . . . 106

2.4 Conclusion . . . 108

(9)

CONTENTS 7

2.4.1 Summary of Contribution . . . 108

Queue Size . . . 108

Average Number of Used CPUs . . . 109

Resorption Time . . . 109

Measured Slowdown . . . 109

3 Index Based Brokering 111 3.1 Introduction . . . 112

3.1.1 Mathematical model . . . 114

3.1.2 Optimal Brokering . . . 115

3.1.3 Intuitive Justification of the Index Strategy . . . 117

3.1.4 Cost Optimization Problem . . . 122

3.1.5 Mathematical Justification of the Whittle-Gittins Index . . . 123

3.2 Threshold Policy on a Single Queue with Rejection . . . 127

3.2.1 Mathematical Formulation . . . 131

3.2.2 Properties of Optimal Policy . . . 133

3.2.3 Computing the Optimal Threshold . . . 141

3.3 Algorithmic Improvements . . . 144

3.3.1 Algebraic Simplifications . . . 144

Local Gain . . . 144

Admissibility Condition . . . 144

Total Gain . . . 145

3.3.2 Improvement of the Admissibility Check . . . 146

3.3.3 Reducing the Problem Size . . . 147

3.3.4 Computing the Index Function . . . 149

3.3.5 Parameters Dependence and Numerical Issues . . . 150

Discount Cost (α) . . . 151

Arrival Rate (λ) . . . 152

Number of Servers (s) . . . 152

Θ Precision () . . . 153

3.4 Complexity and Benchmarks . . . 155

3.4.1 Value-Determination Operation (Solving J θ ) . . . 155

3.4.2 Policy-Improvement Routine . . . 155

Maximal Complexity . . . 155

Average Complexity . . . 155

3.4.3 Finding Θ(R) . . . 156

Maximal Complexity . . . 156

Average Complexity . . . 156

(10)

Improving the Maximal Complexity . . . 156

3.4.4 Dichotomy . . . 157

First Phase . . . 158

Second Phase . . . 158

Improved Maximal Complexity . . . 158

3.4.5 Space Complexity . . . 159

3.4.6 Benchmarks . . . 159

3.5 Numerical Experiments . . . 161

3.5.1 Strategies . . . 161

3.5.2 Uniprocessor Systems . . . 162

3.5.3 Multiprocessors Systems . . . 164

3.5.4 Robustness . . . 164

3.5.5 Sojourn Time Distribution . . . 169

3.6 Realistic Experiments . . . 172

3.6.1 SimGrid Software . . . 172

3.6.2 A Grid Model Using SimGrid . . . 172

3.6.3 Traces . . . 174

3.6.4 Experimental Scenarios . . . 175

Several Inputs . . . 176

The Effect of Heterogeneity . . . 177

Information Delays . . . 178

3.6.5 Sojourn Time Distribution . . . 180

3.7 Conclusions . . . 181

4 Index Brokering of Batch Jobs 183 4.1 Introduction . . . 184

4.2 Batch Arrivals with Known Distribution . . . 185

4.2.1 Mathematical Model . . . 185

4.2.2 Threshold Policy: Differences with Sequential Jobs . . . 186

Properties of Optimal Threshold Policy . . . 188

4.2.3 Computing the Optimal Threshold: Algorithm and Optimizations . . . 190

Algebraic Simplifications . . . 191

Local Gain. . . 191

Admissibility Condition. . . 191

Global Gain. . . 191

Improvement of the Admissibility Check . . . 192

Reducing the Problem Size . . . 193

4.2.4 Computing the Index Function . . . 194

(11)

CONTENTS 9

4.2.5 Complexity . . . 194

Value-Determination Operation (Solving J θ ) . . . 194

Policy-Improvement Routine . . . 194

Finding Θ(R) . . . 194

Dichotomy . . . 195

Benchmarks . . . 196

4.2.6 Numerical Experiments . . . 197

Impact of the Architecture . . . 197

Impact of the Job Width Distribution . . . 198

Robustness on Load Variations . . . 199

Robustness on Job Width Distribution . . . 200

4.3 Batch Arrivals with Known Sizes . . . 203

4.3.1 Mathematical Model . . . 203

4.3.2 Bellman’s Equation . . . 203

4.3.3 Algorithms . . . 206

Computing the Best Thresholds . . . 206

Computing the Index . . . 207

4.3.4 Simulations . . . 208

4.4 Batch Arrivals with Synchronous Departures . . . 214

4.4.1 State Transitions . . . 215

Job Arrival . . . 215

Process Departure . . . 215

4.4.2 Alternative Formulation . . . 216

Case X > 0 . . . 216

Case X = 0 . . . 216

4.4.3 Bellman’s Equation . . . 216

4.4.4 Algorithm . . . 217

4.5 Parallelization . . . 219

4.5.1 Interval Division . . . 220

Avoiding Duplication . . . 222

Improving the Interval Division . . . 222

4.5.2 Pool of Rejection Cost Values . . . 223

Computing Processes . . . 223

Coordinator Process . . . 224

Pool of Values . . . 224

Distributed Dichotomy . . . 225

Cleaning the Pool . . . 225

(12)

Filling the Pool . . . 226

4.6 Conclusions . . . 228

5 Contribution, Future Works and Conclusion 229 5.1 Summary of Contribution . . . 229

5.1.1 Random Brokering . . . 229

5.1.2 Index Brokering . . . 230

5.2 Open Questions and Future Work . . . 232

5.2.1 Random Brokering (Chapter 2) . . . 232

Standard Deviation and Error . . . 232

Missing Experimentations . . . 232

Other Distributions . . . 233

Non-Saturated Parallel Systems . . . 233

Asynchronous and Semi-Synchronous Systems . . . 233

Slowdown on Submitted Jobs . . . 233

Missing Formal Proofs . . . 233

5.2.2 Index Brokering (Chapter 3) . . . 234

Alternative Cost Strategy . . . 234

Advanced Realistic Simulations . . . 235

Implementation in Real/Production Environment . . . 235

Proof of Conjecture 3.12 . . . 235

5.2.3 Batch Jobs (Chapter 4) . . . 235

Improvement of BatchK Algorithm . . . 235

Optimal Strategy . . . 236

Semi-Synchronous Systems (BatchS) . . . 236

Parallelization Implementation . . . 236

Full Proof of Convexity . . . 236

5.3 Final Conclusion . . . 237

A Splitting of Stochastic Process 241 B Used CPUs: Parallel Case 243 B.1 Case s = 2 . . . 243

B.2 Case s = 3 . . . 245

B.3 General Case . . . 246

B.4 Worst Case . . . 249

B.5 Equidistributed Case . . . 251

(13)

CONTENTS 11

C Convexity 259

C.1 Sequential Case . . . 262

C.1.1 Case x < B − 2 . . . 263

C.1.2 Case x = B − 2 . . . 265

C.2 Batch Case . . . 266

C.2.1 Case x < B − 2 . . . 266

C.2.2 Case x = B − 2 . . . 268

C.2.3 Case x = B − 1 . . . 269

C.2.4 Case B ≤ x ≤ B + K − 3 . . . 270

D Solving a K + 2-diagonal System 271 References 273 Webography . . . 273

Personnal Bibliography . . . 273

General Bibliography . . . 274

Symbols 281

Index 284

(14)
(15)

13

Chapter 1

Introduction to Grid Brokering

Chapter Abstract. This chapter introduces the framework in which our work has been undertaken. We present general definitions about grid computing, and more specifically about grid brokering. We establish our mathematical model of computational grid and give the definitions needed for the following of this document.

Chapter Contents

1.1 Motivations and Context 1.2 Definitions and Grid Modeling 1.3 Brokering

1.4 Examples

1.5 Cost Function

1.6 Concepts

(16)

1.1 Motivations and Context

When a scientist needs to perform some small computations, he typically launches a ded- icated application on a personal computer, which gives the required results. We might say that a user sends a job to a server, and gets back the results of its computation. This model applies when the need of computational power is not too high. Once the ma- chine usage reaches the maximal power of the used computer, we may consider two ways of tackling the increasing need of computational power. First, buy a new machine, with better performances, more powerful processors, or more processors (e.g. Massively Parallel Processors). Of course, this is limitted by the speed engineers improve CPU per- formances or memory accesses. The second way consists in gathering several machines, making them working together.

Infrastructures regrouping several (often homogeneous) machines such as simple per- sonal computers, with some coordination mechanisms, are generally named clusters. Ac- cording to Buyya [30]:

a cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers working together as a single, integrated computing resource.

Clusters are generally owned by organisations such as labs, universities, or industries, and are shared by several users. They contain between a few and several thousands of machines. The network interconnecting cluster machines is usually a high performance LAN (Local Area Network), which allows fast communications and data transfers.

Even if there are no theoretical limitations in terms of the number of machines, clus- ters have mainly two disadvantages: first, large clusters are very expensive, and very difficult to manage. Secondly, following the evolution of the experiments for which the cluster has been set up, such a parallel system is often underused at some periods, and saturated at other ones. Those reasons, as well as many others, lead scientists to the idea of Grids: if computers can be put together in order to form a cluster, clusters could be put together in order to form some kind of cluster of clusters, or meta-cluster, or Grid.

A Grid is typically an infrastructure coordinating several clusters spread around a country, or around the world, hosted by partners of a common project, and communi- cating through a WAN (Wide Area Network), such as the Internet, or some rented links.

As grids are often scientific applications oriented, they generally do not only contain

clusters, but also mass storage devices allowing to store experimental results, or various

kinds of data. Storing huge amount of data in grid systems requires to solve transfer,

duplication, security, scheduling or other management issues. Due to the slowness and

(17)

1.1 Motivations and Context 15

the unreliable nature of WAN communication channels, those fields are often far more difficult that what they were in clusters, with fast and controlable networks.

Grids have become prevalent infrastructures for intensive computational tasks. The word “Grid” is probably now one of the top five terms in computer science, and the whole scientific community agrees that physicists, biologists, or even mathematicians, will not be able to tackle tomorrow’s problems without such computational systems.

But what is exactly a Grid? Ian Foster, one of the so called “fathers of the grid” [39], describes on its personal webpage a Grid as:

a system that coordinates resources that are not subject to centralized con- trol, using standard, open, general-purpose protocols and interfaces, to de- liver nontrivial qualities of service.

With that definition, a lot of distributed systems correspond to grids. In this work, we are mainly going to focus on Grids as being a set of resources, such as clusters, mas- sively parallel processors machines, or mass storage devices, linked together through some wide area network, and managed at a high level by some middleware.

There are mainly two kinds of grids. The first one concerns grids in which the large majority of the work is number cruncher (or CPU bound), and does not require neither large amount of data, nor heavy network communication. This kind of grids is usually called Computational Grids. The second class regroups systems where data is the center of everything; jobs are data intensive (or IO-bound). They require to store huge amount of data (often Petabytes), which, of course, need regularly to be transferred between com- putational or storage resources. These grids are named Data Grids. This work mainly focuses on the first family, Computational Grids.

Idealistically, Grids should be as easy to use as a simple computer. In the same way than Internet is as easy to use as if the data were locally available, or the electric power grid — in which Grids name has its origin — is as easy to access as if the electrical central was just behind the wall.

Nevertheless, if Grids are on the way of being efficient and user-friendly systems, computer scientists and engineers still have a huge amount of work to do in order to improve their efficiency. Amongst a large number of problems to solve or to improve upon, the problem of scheduling the work and balancing the load is of first importance.

The main subject of this thesis will be this last problem: how to efficiently or fairly

distribute the work across a Computational Grid? This task is usually called the Meta-

Scheduling — in opposition to Scheduling, which treats of the same subject at a local

level — or Brokering. Even if some authors consider the brokering task as not being

primordial, we will show that the brokering policy can drastically change the system

performances, and not necessarily in an expected way.

(18)

1.1.1 Outline

This work will be mainly split in two parts. After introducing the mathematical frame- work on which the following of the manuscript is based, we will in Chapter 2 study sys- tems where the grid brokering is done without any feed-back information, i.e. without knowing the current state of the clusters when the resource broker — the grid component performing the brokering — makes its decision. We show here how a computational grid behaves if the brokering is done in such a way that each cluster receives a quantity of work proportional to its computational capacity.

This part is based on several publications, in collaboration with Jo¨el G OOSSENS (ULB) and Emmanuel J EANNOT (Loria, Nancy): my DEA thesis [15], a conference paper resum- ing my DEA thesis (ISPA04, Hong-Kong) [19], a journal paper extending our model to heterogeneous systems (IEEE Transactions on Parallel and Distributed Systems, 2006) [22], as well as some technical reports (INRIA and ULB) [21, 20]. Notice that, with the same authors, we have also a contribution in fault tolerant Real-Time systems [23], but as this research is out of the scope of the present work, we have not included it here.

The second part of this work (Chapters 3 and 4) is rather independent from the first one, and consists in the presentation of a brokering strategy, based on Whittle’s indices, trying to minimize as much as possible the average sojourn time. We show how efficient the proposed strategy is for computational grids, compared to the ones popularly used in production systems. We also show its robustness to several parameter changes, and provide several very efficient algorithms allowing to make the required computations for this index policy.

This second part is the fruit of a collaboration with Bruno G AUJAL (IMAG, Grenoble).

On this subject, started by my 3 months sojourn at IMAG in April 2005, we have one

journal paper giving performances of index brokering on realistic workloads (Parallel

Computing, Elsevier) [17], one conference paper extending our first model to batch jobs

(NetCoop07, Avignon) [18], as well as an INRIA research report [16].

(19)

1.2 Definitions and Grid Modeling 17

1.2 Definitions and Grid Modeling

We will now formalize several concepts and give definitions about Grids.

In this work, we will consider a quite simple but pretty much realistic model of com- putational grid. Our analysis is mainly focused on computational grids, as we assume that network latencies and transfer times are negligible compared to computation times, and we thus do not take into account the data localization.

The model we now present could seem in some way quite far from reality. However, we believe that models being an exact representation of the reality are, at least in the case of Grids, mathematically untractable. Analysis, predictions, or strategies cannot be provided in such models due to their complexity. This is why we prefer to consider a simpler model, even if we could loose some realism, but on which some theoretical analysis is possible. Nevertheless, there are plenty of examples in the literature where theoretical models allow to improve the performances of real systems, or to predict their behavior, even when the model simplifies the real environment. Indeed, in this work we will give such an example, where a simple model allows to find a very efficient strategy for a realistic model.

The aim of most computational systems is to run jobs. It is true for Grids as well. As the literature gives a lot of definitions around this term (job, process, task, subprocess, program, threads, applications, operation. . . ), we first need to clarify the concepts which will be the center of our concern in this work.

Definition 1.1 (Process)

A Process is a sequence of computing operations running on a single processor.

We do not consider in this work the subdivision of a process into several threads, as it is sometimes done.

Definition 1.2 (Job)

A Job is a set of one or several process(es).

A job is said to be uniprocess (or sequential) when it is composed of one process, and multiprocess (or parallel) when it is composed of several (including one) processes.

As a computational grid is generally described as a set of Computing Elements, we need to define such a component.

Definition 1.3 (Computing Element, Cluster)

A Computing Element (also called a Cluster) is a set of CPUs or servers, and a single queue, both managed by a scheduler using a specific scheduling policy.

A Computing Element gets jobs, and run processes on its CPUs.

(20)

In this work, we will only consider homogeneous Computing Elements, composed of identical CPUs.

Definition 1.4 (Client)

A Client is a grid user who sends jobs to a grid, in order to run them on some CPU, and to get back results.

One may consider that a grid has only one client, as a client can emulate the work of several clients.

While CPUs need to be managed in a Computing Element (CE), CEs need to be man- aged in a grid. This is done by a kind of orchestra conductor, usually named the Resource Broker.

Definition 1.5 (Resource Broker, Router, Meta-scheduler)

A Resource Broker (also called a Router or a Meta-scheduler) is a grid component receiving jobs from Clients and sending each of them to a chosen Computing Ele- ment.

Definition 1.6 (Brokering, Routing, Meta-scheduling)

The Brokering is, in a computational grid, the action of choosing the Computing El- ement to which an incoming job is sent. The Brokering policy is the way of choosing such an action.

We have now every element we need for providing a first definition of a Grid.

Definition 1.7 (Grid)

A Grid is a set of Computing Elements, linked together through a Resource Broker, re- ceiving jobs from one or several clients.

Definition 1.8 (Computational Grid)

A Computational Grid, in contrast to a Data Grid, is a Grid for which the usage is computation oriented. Data transfers, data duplication, cache coherency manage- ment, large databases access, etc. are considered to be negligible with respect to the time spent in actual computations.

In the following of this work, we will only focus on Computational Grids.

The next two sections will consist in formal definitions about concepts defined more

verbosely here above. In the next section (1.2.1), we will present a model inspired from

the queuing theory community [51, 38, 62]. Then, we will adapt this model in Sec-

tion 1.2.2, following a point of view coming from the scheduling community [56, 30].

(21)

1.2 Definitions and Grid Modeling 19

We will finish by comparing those two models, showing that from some aspects, they are not that different.

1.2.1 Queuing Model

We can now give more formal definitions about several concepts evoked in the previous section. Figure 1.1 shows a general overview of the model of grid structure we consider here. The left figure gives a rather high level point of view, while the right one goes deeper in to the queuing model.

B N µ N

µ N s N

µ N

B 1

s 1 µ 1

µ 1

µ 1 λ

C 1

C N

Client RB

Figure 1.1: Two models of a Computational Grid with Resource Broker. The Grid is composed of N Computing Elements (or queue), the i th queue being composed of s i CPUs (or servers) of speed (or rate) µ i . The system input, with rate λ, is routed amongst the N queues.

Definition 1.9 (CPU, Server)

A CPU or a server is a component able to perform computing operations, at a speed given by its service rate µ, with 0 < µ < ∞.

Definition 1.10 (Process)

A Process p is a set of computing operations which, when performed by a CPU of rate µ, will use it continuously during, in average, 1

µ units of time.

As it is classically done in queuing theory, processes do not really have a pre-defined

execution time; the execution time is fully determined by the server on which the process

runs. For instance, if a server provides a service time following a random variable ex-

ponentially distributed, one may consider that at each infinitesimal period of time, the

server decides if the process continues or stops its execution, following an exponential

distribution.

(22)

From now on, servers are assumed to run processes without preemption (a process can- not be interrupted in order to run another process on the same server), and without mi- gration (once a process has started on a server, it stays on this server till the end of its execution).

As we shall study stochastic workloads, the execution time is obtained from a random variable, by rolling a (continuous) die. As soon as we use the same random variable, and if the process lengths are not taken into account for any scheduling decision, it is equivalent to assume the dice to be rolled by the server when the job starts, or by the client at the submission time. In the first model, inspired from the queuing theory community, we assume the execution time is chosen by the server starting the process, or even during its execution. In the second model, execution time is chosen by the client submitting the job.

From the last definition, a process does then not have a true (absolute) execution time.

Its effective execution time is determined by the server (or CPU) on which this process runs. If a server has a rate of µ, this means that the average (effective) execution time of processes running on this server is µ −1 . For instance, in the case of exponentially distributed execution time, the rate of the distribution will be µ.

Definition 1.11 (Job)

A Job is a set of one or several processes, arriving together at a given submission time, and having to be sent to the same CE. The number of processes composing a job is named its width.

Definition 1.12 (Sequential Job, Uniprocess Job)

A Sequential Job (or Uniprocess Job) is a job composed of only one single process (job width=1).

In the case of sequential jobs, we will not differentiate jobs and processes. We will consider jobs as having their own execution time, because a job will have the execution time of its unique process.

In the case of parallel jobs, the execution time of a job can be also defined, but needs

some convention. It could be for instance the sum or the maximum of its processes execu-

tion time, or the total time during which at least one of its processes was running. Notice

that in this last case, the job execution time depends upon the scheduling, which is not

always convenient. In this work, we do not consider execution time of jobs.

(23)

1.2 Definitions and Grid Modeling 21

Figure 1.2: Execution time of parallel jobs. Left: Asynchronous, Center: Semi-Synchronous, Right:

Fully-Synchronous.

Definition 1.13 (Parallel Job, Multiprocess Job)

A Parallel Job (or Multiprocess Job) is a job with one or several independent pro- cesses.

By independent processes, we mean that there is neither precedence dependencies nor common resources (and then no communication) between processes.

We do not enforce here parallel jobs to have more than one process; sequential job is then a particular case of parallel job.

Definition 1.14 (Asynchronous Parallel Job)

An Asynchronous Parallel Job is a parallel job for which, once in a cluster queue, every process is independent from other processes.

Definition 1.15 (Semi-Synchronous Parallel Job)

A Semi-Synchronous Parallel Job is a parallel job with the constraint that its pro- cesses have to start their execution simultaneously, but are independent afterwards.

Definition 1.16 (Fully-Synchronous Parallel Job)

A Fully-Synchronous Parallel Job is a parallel job with the constraint that its pro- cesses have to start their execution simultaneously, and releasing the CPUs only when all its processes are completed.

Definition 1.17 (Synchronous Parallel Job)

A Synchronous Parallel Job is parallel job which is either semi-synchronous or fully- synchronous.

In this work, we assume that jobs are independent, i.e., they do not share common

resources except CPUs, and there are no precedence constraints between jobs, as we as-

sumed for processes.

(24)

Definition 1.18 (Job Flow)

A Job Flow F A

λ

,W

K

is an infinite set of jobs, for which:

• The inter-arrival delay follows the probability density function 1 A λ , with 0 <

λ < ∞ and E [A λ ] = λ −1 ;

• The job width is an integer between 1 and K distributed according to the law W K .

Notice that another very similar model could have been considered: we can have K arrival processes, the process k following a law A λ

k

, and submitting only jobs of width k. Those K arrival processes are merged before entering the system. In order to have our both models comparable, λ k need to be chosen consequently: if w k = P [W K = k], we should have λ k = λw k . In the case of Poissonian arrivals, where merging K processes of arrival rate λ k is equivalent to a Poissonian process of arrival rate P

k

λ k , and as a given arrival in the second model has a probability w k to correspond to a job of width k, both models are equivalent.

Definition 1.19 (Computing Element) A Computing Element C i is

• A set of s i homogeneous servers (or CPUs) having a service rate µ i ;

• With a queue (or buffer) having a capacity B i − s i , where jobs are stored while there are not enough free servers to run them. The capacity of C i will be B i ;

• Running jobs following a scheduling policy S i .

A Computing Element C i is then identified by the tuple {s i , µ i , B i , S i } .

The capacity of a computing element (CE) means that if this CE contains already as many jobs as its capacity, any incoming job sent to this CE will be rejected. Of course, as much as possible, a resource broker (RB) should avoid to send a job to a CE having already reached its capacity, but, as we will see later, a RB could be not informed of such situation.

In the following, the local scheduling policy S will be by default FCFS (First Come First Served, see [49] for instance), also known as FIFO (First In First Out), and will be the same for every Computing Element. The principle of FCFS scheduling policy is to only

1

The probability density function f : R → R

+

of a random variable expresses its density of probability; the

area under the curve f between two abscissas a and b is the probability that a drawing of the random variable

will be between those two numbers.

(25)

1.2 Definitions and Grid Modeling 23

consider the job at the front of the queue. If there are enough available servers for this job to be executed, the job starts. If not, the CE waits until enough servers are freed (other jobs are not considered meanwhile).

t 3

1 6 7

CPU 4 CPU 3 CPU 2 CPU 1

t 1

2

1 3 6 7

CPU 4 CPU 3 CPU 2 CPU 1

6 5

7 4

5 7

6 2 4

3

2 45

5 4 2

1

3

FF F CFS

arrival times:

arrival times:

Figure 1.3: Example of scheduling, with seven jobs arriving successively, having 3, 2, 2, 2, 4, 2 and 2 processes. Top: FCFS scheduling, Bottom: FF scheduling.

Here are some definitions allowing to characterize a scheduling policy.

Definition 1.20 (Eligible)

A synchronous parallel job of width n is said to be Eligible if at least n CPUs are free in the same CE.

A process belonging to a sequential or an asynchronous parallel job is said to be Eligible if at least one CPU is free.

Definition 1.21 (Scheduling Policy)

A Scheduling Policy (or a Scheduling Strategy) is a function which gives, for any cluster configuration and if the queue is not empty, the next job (in the case of parallel synchronous jobs) or process (in the case of sequential or parallel asynchronous jobs) to start, eligible or not. The chosen job/process is named the highest priority job or the highest priority process.

Here, the cluster configuration contains any information about the cluster available

to the scheduler . This can be simply the number of running and waiting jobs, or, in more

complex systems, the arrival, the start and/or the (expected) end time of any job.

(26)

Notice that this definition requires the scheduler to be non ambiguous: for any cluster configuration, there is exactly one highest priority job/process (except if the queue is empty). But this does not prevent the scheduler to start several jobs/processes at the same time, because once a job is launched, the cluster configuration changes, and another job can get the highest priority at the same time.

Moreover, the job/process returned by the scheduling policy is not necessarily the next one to effectively start: it could happen that, at some time, the highest priority job/process is not eligible, and that the configuration changes in such a way that the pol- icy gives the highest priority to another job/process, before having started the previous highest priority job/process.

Definition 1.22 (Greedy)

A scheduling policy is said to be Greedy (also called expedient) if it never leaves any resource idle intentionally. If a system runs a greedy policy, a resource is idle only if there is no eligible job waiting for that resource.

Notice that, in the parallel case, FCFS or Backfilling [61] are not greedy, while FF (First Fit, see below) is. In the sequential case, FCFS, which corresponds to FF, is greedy.

Definition 1.23 (ASAP)

An ASAP scheduler, standing for as soon as possible, is a scheduler which starts any job/process at the first time this job/process becomes the highest priority job/process and is eligible.

Remark that, in the case of non preemptive systems, non ASAP schedulers could be more efficient than ASAP ones. For instance, a strategy could choose to delay the schedule of a job, hoping the near arrival of a “best” job. However, in the following of this work, we only consider ASAP scheduler.

In the case of synchronous parallel jobs, another scheduling policy will be considered in this document: FF, standing for First Fit [40]. With this policy, the first eligible job starts when possible. FF and FCFS are of course identical in the case of sequential jobs or asynchronous parallel jobs. A strategy based on FF and using advance reservation is used in OAR, the batch scheduler of Grid’5000 (see [6] and Section 1.4.3). An example of such scheduling policies is given in Figure 1.3 for fully synchronous parallel jobs.

It is because we look at ASAP systems (see Definition 1.23) that a synchronous system

is not a particular case of asynchronous system. Indeed, if we do not impose the system

to be ASAP, an asynchronous policy can choose to not start processes of a job as soon

as possible. In particular, the policy could choose to start every process of the same job

simultaneously.

(27)

1.2 Definitions and Grid Modeling 25

Very often, today cluster schedulers do not use FF or FCFS, but more sophisticate techniques, using preemption (a process can be interrupted and resumed later), migration (a process can be interrupted on a server, and resumed on another server, either on the same cluster — intra-cluster migration — or on another cluster — inter-cluster migration), dynamic priority systems, etc. See for instance [49] for more details. In this thesis, we do not focus on cluster performances, since our aim is to compare various brokering strategies. It is reasonable to assume that the comparison between strategies will not be too much impacted by a better low level scheduling strategy, because every cluster would improve its performance in a similar ratio. This assumption can of course only be done if strategies are not drastically different from FCFS, and show only marginal divergences.

For instance, comparing brokering strategies if the local scheduling policy is Last Come First Served, will probably not give the same conclusion as if the local scheduling policy is FCFS. This is why we only consider simple scheduling strategies, and non-preemptive systems.

Based on the last definition, we can now give a formal definition about the mathemat- ical object we name a Grid System, formalizing the definition of a Computational Grid.

Definition 1.24 (Grid System)

A Grid System G is a tuple {{ C i } i∈[1,...,N ] , F A

λ

,W

K

, β} , where

• { C i } i∈[1,...,N ] is a set of N Computing Elements C i = {s i , µ i , B i , S i } ;

• F A

λ

,W

K

is the arrival job flow;

• β is the Brokering policy.

A formal definition of brokering will be given further in this document.

Figure 1.1 (page 19) gives two visions of a computational grid. On the left hand side, the Resource Broker (RB) and the Computing Elements (C i ) are seen as black boxes.

Clients send jobs to RB, which dispatches them to some C i according to its routing policy.

The right figure gives a more “queuing theory” oriented approach. A stream of jobs hav- ing a rate λ comes into the system. This stream is split in such a way that each job is sent to one Computing Element, which can be seen as a buffer (waiting queue) associated to some servers (CPUs). The s i servers of the queue C i have a service rate of µ i .

1.2.2 Scheduling Model

From a scheduling point of view, each incoming job j is composed of one or several

processes p each having a virtual execution time ` p (or virtual execution length), chosen by

(28)

the client before submitting the job, possibly from a probability distribution. This virtual execution time means that, on a processor of relative speed µ = 1, the process p would take ` p units of time. Then, on a CPU with any relative speed µ, a process p will have an effective execution time of ` p /µ. Or, if a process runs during ` units of time on a µ 1 CPU, it will run ` · µ µ

1

2

on a µ 2 CPU.

If the broker choices are made without taking into account job and/or process lengths, choosing the process length at submission time (scheduling model) or when the job starts (queuing model) is equivalent. And if the local scheduling decision (choosing the CPU) is also made without information about lengths, the process length can even be chosen when it starts.

These points of view are then not incompatible; from a “macroscopic” vision, the scheduling model with an average virtual execution time of 1 with, let say, a distribution D with E [D] = 1, will behave the same way as the queuing model with a rate µ, if the execution distribution is D 0 such as f D (m) = f D

0

( m µ ) ∀m ∈ R + , where f D and f D

0

are the probability density functions of D and D 0 . Notice that compelling the average virtual execution time to be equal to 1 is not restrictive: this only constrains to choose the time unit in order to reach a unitary average. If the virtual execution times are scaled, relative speeds are scaled accordingly.

We assume in our model that the execution time (or its average) does not depend of environmental factors, such as data transfer or communications cost, nor local factors such as boot time, migrations, preemption or other scheduling costs.

As for the service rate µ, one can give two interpretations of the arrival rate λ.

From a queuing theory point of view, we have the concept of job stream, and λ can be seen as the inverse of the inter-arrival average delay. For instance, if arrivals follow a Poisson process, λ is the rate of this process.

From a scheduling point of view, we have an infinite set of job ids J (J , N = {1, 2, . . . } ) 2 , and for each j ∈ J , a j is the submission (or arrival) time of the job j. The arrival rate λ is considered as the inverse of the average inter-arrival delay. In other words, {a j |j ∈ J } is such that

t→∞ lim P

j∈J |a

j

<t

(a j − a j−1 ) k j ∈ J | a j < t k = λ −1 with a 0 , 0.

As λ < ∞ (and then λ −1 > 0), we can without loss of generality assume 3 that J is sorted by the submission time, meaning that ∀i < j, a i ≤ a j . This schema could help to

2

The symbol , means “is by definition” or “is defined as”.

3

If λ can be null, this is not always possible to re-order jobs. For instance, we could have infinitely many

jobs arriving at some time t, and one job arriving after t. There is no possible numbering allowing to sort

(29)

1.2 Definitions and Grid Modeling 27

visualize our system:

- 0 a 1 a 2 . . . a j−1 a j . . . a k(t) t

We then sum up every interval between 0 and a k(t) (the last arrival before t), and divide this value by the number of intervals (k(t)). Of course, the sum equals a k(t) .

This equation can be simplified, because every element but the first and the last one are simplified. Then,

t→∞ lim a k(t)

k(t) = λ −1 (1.1)

where k(t) = max{j ∈ J | a j < t}. We need that the set {j ∈ J | a j < t} is finite, or at least that its maximum exists for any t. But we have λ < ∞ (or λ −1 > 0), which is a sufficient condition for that.

We introduce here a new notation.

Definition 1.25 (Asymptotic behavior)

A function f 1 (t) behaves asymptotically like f 2 (t) (denoted f 1 (t)

t f 2 (t)) iff

t→∞ lim f 1 (t) f 2 (t) = 1.

Equivalently, we have f 1 (t)

t f 2 (t) ⇔ f 1 (t) = f 2 (t) + ε(t), with lim

t→∞

ε(t) f 2 (t) = 0.

According to this definition, we can have some results about the asymptotic behavior of the inter-arrival delay.

Lemma 1.1

a max{j∈J |a

j

<t} = a k(t)

t t.

Proof

We have to show that a k(t)

t t, or that lim

t→∞

t

a k(t) = 1. By definition of k(t), we have a k(t) ≤ t ≤ a k(t)+1 .

this scenario by submission time. Or we could have a

1

= 2, and a

i

= 1 −

1i

∀i > 2, which does not allow a

re-ordering either.

(30)

Then,

1 ≤ t

a k(t) ≤ a k(t)+1

a k(t) = a k(t)+1

k(t) + 1 · k(t) + 1 a k(t) .

Taking the limit when t → ∞ , we have (with Equation (1.1)) 1 ≤ lim

t→∞

t

a k(t) ≤ λ −1 ·

λ + lim

t→∞

1 a k(t)

= 1.

We have then lim

t→∞

t

a k(t) = 1.

Now we have:

Theorem 1.2

t

max{j ∈ J | a j < t} t λ −1 .

Proof

We have lim

t→∞

a k(t)

k(t) = λ −1 (from Equation (1.1)), and lim

t→∞

a k(t)

t = 1 (from Lemma 1.1) Then,

t→∞ lim

a

k(t)

k(t) a

k(t)

t

= λ −1 1 and

t→∞ lim t

k(t) = λ −1 .

This theorem will be useful later.

1.2.3 System Load

When studying computational systems, it is generally required to be able to measure the load of the system, giving an evaluation about the amount of work the system is processing. Again, we will give two approaches of this measurement, the first one from the scheduling world, the second one from the queuing theory one.

First, we need to define the concept of computational capacity:

(31)

1.2 Definitions and Grid Modeling 29

Definition 1.26 (Computational capacity)

The Computational capacity of C i is defined as µ i s i ; The Computational capacity of a grid G is defined as

N

P

i=1

µ i s i , and is denoted M .

The computational capacity can be seen as the number of virtual CPUs, or the number of homogeneous CPUs of rate µ = 1 equivalent to the original system, in a perfect world where one can perfectly take advantage of a larger number of CPUs.

From the queuing theory point of view, it can also be seen as the total rate, or the rate that a unique server would need to reach to be virtually equivalent to the whole system.

Indeed, the total service rate of a set of server is the sum of the service rate of each server, which is denoted by M . This number M is of course not necessarily a natural number.

From the scheduling point of view, we define now the system load ν (t 1 , t 2 ) as the total amount of virtual work received in [t 1 , t 2 [, divided by the product of the total number of virtual CPUs (M ) and the total duration (t 2 − t 1 ). Or,

ν(t 1 , t 2 ) ,

P

{j∈J |t

1

≤a

j

<t

2

}

` j M · (t 2 − t 1 ) .

Obviously, we have that if ν(t 1 , t 2 ) > 1, some jobs received in [t 1 , t 2 [ cannot be com- pleted. The system is then not schedulable on [t 1 , t 2 [. Here, not schedulable on [t 1 , t 2 [ means that there is no brokering and/or scheduling decision allowing to finish before t 2 all jobs received in [t 1 , t 2 [. In other words, if the arrival pattern on [t 1 , t 2 [ is indefinitely repeated with a period t 2 − t 1 , at least one queue will increase indefinitely, or, if queue sizes are bounded, an unbounded number of jobs will be rejected.

Of course, being not schedulable on [t 1 , t 2 [ does not mean that jobs still waiting or running at time t 2 will not be completed after t 2 . The system could for instance become schedulable if we extend the range.

The condition ν(t 1 , t 2 ) ≤ 1 is a necessary condition of schedulability, but usually not sufficient. For instance, if a job arrives at time t 0 < t 2 , but with a running time on the faster CE greater than t 2 − t 0 , it cannot complete before t 2 . By definition, if at least one job arrives in [0, ∞[, it is always possible to find a non schedulable interval: t 1 just before a job arrival, and t 2 just after this arrival.

However, in general, we are interested in long intervals, such as lim

t

1

→0 t

2

→∞

[t 1 , t 2 [.

Now, from the queuing point of view, we would like to have a similar definition. The load is then classically defined as the arrival rate divided by the total service rate, or:

ν , λ

M .

(32)

It is easy to see that those two definitions are asymptotically equivalent, or that ν(t)

t ν, where ν(t) stands for ν(0, t). Indeed, as mentioned above, to go from the first model to the second one, we fix ` j = 1 in average (or lim

n→∞

1 n

n

P

j=1

` j = 1), and keep the same µ i and λ. We then have

ν(t) ,

P

{j∈J |a

j

<t}

` j

M · t (1.2)

∼ t

k {j ∈ J |a j < t} k M · t

= max{j ∈ J | a j < t}

t · 1

M

∼ t

λ

M (see Theorem. (1.2)) , ν.

The necessary condition of schedulability becomes now ν < 1.

1.2.4 System State

In order to broker jobs, it is often required to have a knowledge about the current state of the system. A first system state model — applicable only for sequential or parallel asynchronous jobs — would consist to know the number of processes being in each CE, running or waiting. Each CE can then be characterized by an integer. We will use in this case the following notations:

• x i is the C i state, or the number of processes currently present in the queue (waiting and running). We have x i ∈ {0, . . . , B i } .

• x = (x 1 , . . . , x N ) is the system state.

In some case, it will be useful to enrich Computing Elements state. We will come back on that point later on (Cf. Chapter 4).

Generally, this information does not fully characterize the state of the system, because knowing the system state does not allow to predict the future. The system state could for instance include the elapsed time from the start times of jobs (or processes) currently run- ning, or even their end time. However, it is often not realistic to have such information:

knowing the end time of jobs is generally very difficult, or impossible. And in most case, any information about start times will not give more information about the end time.

In the case of parallel jobs, it could be required to know the width of each job waiting

in the queue.

(33)

1.2 Definitions and Grid Modeling 31

We denote by S the state space of the system. In the first simple model we presented here above, we have S = {0, . . . , B 1 } × · · · × {0, . . . , B N }.

Notice that if, as it will often be the case in this work, we focus on systems having the memorylessness property, such as Poissonian arrivals with exponential services, the system state we define here fully characterizes the system; knowing the time processes have already spent in the system does not give more information.

As we assume our system to be ASAP (see Definition 1.23), and as we only consider independent processes by now, we know that if x i ≤ s i , there are no jobs pending in the queue. If x i > s i , then s i processes are running, and x i − s i processes are waiting.

In Chapter 4, this model will be refined, for taking parallel jobs into account. In this case, there exist situations where there are idle CPUs, and some jobs are waiting in the queue.

1.2.5 Underlying Markov Chain

In several cases we are going to analyze in this document, we will focus on state transi- tions. If the brokering is only state-dependent, we can consider the underlying Markov chain.

This transition Markov chain gives the probability to go from a state to another one, knowing that a transition has occurred. The probability that a transition is going from state x = (x 1 , . . . , x N ) to y = (y 1 , . . . , y N ) is denoted

T ((x 1 , . . . , x N ), (y 1 , . . . , y N )).

For instance, in the case of sequential jobs, with no simultaneous departures or ar- rivals, T will be defined as follows. Let D i (x) be the probability that an event (or transi- tion) is a departure from C i when the state is x, A(x) is the probability that an event is an arrival when the state is x, and B i (x) the probability that an incoming job is send C i by the broker when the state is x. T is then defined as:

T (x, x − 1 i ) = D i (x) ∀i ∈ {1, . . . , N } T (x, x + 1 i ) = A(x) · B i (x) ∀i ∈ {1, . . . , N }

T (x, y) = 0 Otherwise.

where 1 i is the vector composed only of “0’s”, except a “1” at position i. The sum P

i

D i (x)

represent the probability to have a departure in the system when the state is x, or more

exactly the probability that the next event is a departure. As A(x) is the probability that

the next event is an arrival, and as we assume now that there is no other events than

(34)

arrivals and departures, we have A(x) + P

i

D i (x) = 1 ∀x ∈ S . Similarly, we have that P

i

B i (x) = 1, if we assume that there is no possibility to reject a job.

(35)

1.3 Brokering 33

1.3 Brokering

As mentioned above, the brokering consists in choosing a CE to which to send a job coming to the RB. This is then simply a selection problem: the RB has to choose one CE amongst N , and forward the job toward the chosen CE. A job then never waits into the RB, except the time required for making up its decision — assumed to be negligible in our model —, but will eventually be queued in the elected CE.

A job will transit between several states. In our model, some of them are assumed to be instantaneous. First, once the client has created a job, the job is said submitted. Then the RB gets the job, and chooses a CE for sending it. The job then becomes ready, and turns into queued once actually sent to the CE. The job now waits for one (or several) server(s) to be free, and is said running as soon as it starts its execution. When the job has completed, it becomes done.

In our model, we assume that the states queued and running are the only one to have a (potentially) non null duration. Other state durations are considered as negligible in comparison to those ones. Notice that, if there are enough free servers when the job is queued, it starts directly, and the state queued has then a null duration.

Notice that, as we assume our system to be non preemptive, a job remains running until it finishes, and does not go back to queued state, or any other waiting state.

Brokering policies can be categorized into several families. Mainly, a brokering policy can be:

• Open-loop or closed-loop;

• Memoryless or using historical information;

• Deterministic or probabilistic.

All of these characterizations pairs are exclusive: a brokering policy is either open- loop or closed-loop, but never both, or either deterministic or probabilistic, but not both.

1.3.1 Open-loop and Closed-loop

In the first family (open-loop, also known as static), the brokering is done without taking

into account dynamic information, such as the current state, or information about run-

ning jobs. This kind of policy is only based on static information, such as the number of

CEs (N ), the number of CPUs (s i ), and/or the speed (µ i ). This kind of model does not

apply to most today grids, because modern RBs have access to feedback from CEs. This

kind of strategies can however be used as a comparison point in the aim of showing the

advantage of information feedback. Chapter 2 will be devoted to that kind of systems.

(36)

The second family (closed-loop, also known as dynamic) allows to use static and dy- namic information about the system. Typically, a closed-loop broker will use information such as the current queue size, or the number of busy CPUs.

Closed-loop strategies can be sub-categorized, according to which information is available:

• The current state of each CE (x i );

• The width of the job entering the system;

• The number of free CPUs (in case of synchronous jobs);

• The arrival time of jobs;

• The (estimated) end time of running processes;

• Some historical information about any of these values;

• Etc.

In this work, we will only consider strategies using the first three information: the current state, the width of the entering job (in the case of parallel jobs), and the number of free CPUs (in the case of synchronous parallel jobs).

1.3.2 Memoryless and Historical Information

In many cases, the broker does only have information about the current state of the grid, and about static information. The brokering does not depend on information about the past of the system, such as the arrival time of running (or finished) jobs, the state of the system at some time in the past, or previous broker decisions.

A variation on the Round-Robin strategy as used in [41] could for instance be an open-loop strategy (the queue state is not considered) using historical information (past decision from the resource broker).

1.3.3 Deterministic and Probabilistic

Starting from a given state s, a job coming in a deterministic broker would have only one possible destination d. There is then a univoque mapping between the state and the destination. Here, the definition of state would depend on the case: open- or closed-loop, memoryless or not.

With a probabilistic broker, several destinations could be possible, each of them asso-

ciated with a probability.

(37)

1.3 Brokering 35

Possibly, a broker could be deterministic in some states, and probabilistic in others.

This is of course a particular case of a probabilistic broker.

Several kinds of broker can be combined, but one of the 8 combinations does not make any sense: a broker cannot be open-loop, memoryless and deterministic. Otherwise, any job would be sent to the same CE.

Let us remark that we may consider deterministic strategies as a generalization of (pseudo-)probabilistic ones, or at least that they can emulate them: we need in such a case to add in the state x the state of a pseudo-random generator.

1.3.4 Mathematical Model

This work will only consider memoryless brokers. Chapter 2 will be devoted to prob- abilistic open-loop brokers. Chapter 3 will address deterministic closed-loop systems using only the current state as feedback information. In Chapter 4 we will extend this model and use the job width and/or the number of free CPUs.

Formally, we can then consider a brokering policy as a function giving an integer between 1 and N as a result. As it could happen the whole system to be full, we need the RB to be able to reject a job; we associate this reject to the value 0 for this function.

In the case of a closed-loop broker, this function has the state and/or some other in- formation as parameter. A brokering policy function for (memoryless) open-loop brokers does not take any parameter.

Deterministic brokers correspond to usual functions (giving one result for a given parameter). Probabilistic brokers depend on random variable and can, for the same state, give different outputs.

Probabilistic Memoryless Open-loop Brokering

We can now give more formal definitions about several kinds of brokering. The function B i (x) is defined in Section 1.2.5.

Definition 1.27 (Probabilistic Memoryless Open-loop brokering)

A Probabilistic Memoryless Open-loop Brokering Policy β is a set of N values β i

such that 0 ≤ β i ≤ 1 and

N

P

i=1

β i = 1, and where transitions in the underlying Markov chain are defined with

B i (x) = β i .

Notice that, as there is no feedback information, the brokering function does not know

if a queue is full or not, and is then unable to reject jobs if the elected CE is full, hence

(38)

there is no β 0 . This assumes then that a CE is able to reject an incoming job if its queue is full.

Deterministic Memoryless Closed-loop Brokering

If we consider the case where the RB only uses x as input parameter (other cases can easily be derived from this case), the brokering function β takes its input from the state space {0, . . . , B 1 } × · · · × {0, . . . , B N } , and gives results in {0, . . . , N } .

Definition 1.28 (State based Brokering Policy)

A State based Brokering Policy β is, in the case of sequential jobs, a function

β : S −→ {0, . . . , N } where

β(x) 6= i if x i = B i , β(x) = 0 iff x i = B i ∀i and,

B i (x) =

1 if β (x) = i 0 otherwise.

The first condition indicates that a job cannot be sent to a full queue; the second one imposes to only reject a job if every queue is full; the last one sais that the brokering is deterministic.

Here, the RB is able to know if the system is full, and must then reject any incoming job while no process (or not enough processes in case of parallel jobs) has left a CE.

Notice that if B i = B ∀i, the state space size k S k is (B + 1) N . As a brokering function can be seen as a word of k S k letters, each letter being a number between 1 and N , the number of possible brokering function is N ( (B+1)

N

) . This is of course only valid if we neglect that some brokering strategy are not acceptable: a brokering strategy cannot choose to send a job to a full queue.

1.3.5 Bernoulli Brokerings

As already mentioned, the first part of this work (Chapter 2) will be devoted to open-

loop systems, and more precisely to systems where the brokering is done in such a way

that every CE has a probability to be chosen proportional to its computational capacity.

(39)

1.3 Brokering 37

Please notice that we do not attempt to analyze an efficient open-loop strategy; it has been shown in [41] that there exist better strategies, at least for simple cases.

Our aim is here to give a first approach of the kind of system we are looking at, and to experiment our model.

Definition 1.29 (Equidistributed Bernoulli Brokering)

An Equidistributed Bernoulli Brokering is a probabilistic memoryless open-loop broker- ing strategy (see Definition 1.27) in which every CE has the same probability to be chosen. Or,

β i = 1 N ∀i.

Definition 1.30 (Weighted Bernoulli Brokering)

A Weighted Bernoulli Brokering is a probabilistic memoryless open-loop brokering strat- egy (see Definition 1.27) in which every CE as a probability to be chosen proportional to is computational capacity. Or,

β i = µ i · s i

M ∀i.

Since M ,

N

P

k=1

µ k · s k , this indeed yields a probability distribution, as requested by Definition 1.27. Based on these definitions, we can show the following lemma:

Lemma 1.3

Let G be a grid with a set of Computing Elements (CE) C and one broker using Equidistributed or Weighted Bernoulli Brokering, and G 0 be a grid with the same set of CEs, but with r brokers sending jobs to any CE in C , and using the same policy as the broker of G.

G and G 0 are equivalent, meaning that the probability for a job j to be sent to a CE C i

is the same in G and G 0 , whatever the way clients choose the broker to which it sends a job.

Proof

The proof is direct: the probability for a job to be sent to C i is N 1 (Equidistributed) or

µ

i

·s

i

M (Weighted) in both systems.

This lemma highlights that even if for practical reasons (robustness, access rights,

networks delays. . . ) it could be interesting to use several brokers, it is equivalent in our

model to consider one single broker, as soon as we assume it to be reliable.

Références

Documents relatifs

While this class is not closed under composition and its type checking problem against vis- ibly pushdown automata is undecidable, we identify a subclass, the well-nested VPTs,

In this work, we have limited the design space to the exploration of the floorplanning and layer distribution of a predefined MPSoC structure, based on the estimation of the

However, the effects of atmospheric turbulence in the future Extremely Large Telescopes (ELTs), to be built during the next decade, will be much more severe within their apertures

The average absolute weight of the optimal portfolio is four times as the market therefore, the covariates in the portfolio policy function encourage active portfolio

A Scheduling Policy (or a Scheduling Strategy) is a function which gives, for any cluster configuration and if the queue is not empty, the next job (in the case of parallel

We study the geometry of minimax regret version of linear programming problems under interval uncertainty in the objective function coefficients, we test our algorithm in this case

• The theoretical and experimental comparison between piezoelectric shunt damping (active and passive) and classical active control with Integral Force Feedback on a single

Observe that on the contrary of games of perfect information, finding a solution to the reachability question for a game of imperfect information gives no information about the