Work Stealing Technique and Scheduling on the Critical Path

(1)

HAL Id: hal-01197544

https://hal.archives-ouvertes.fr/hal-01197544

Submitted on 3 Jun 2020

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Work Stealing Technique and Scheduling on the Critical Path

Christophe Cerin, Michel Koskas

To cite this version:

Christophe Cerin, Michel Koskas. Work Stealing Technique and Scheduling on the Critical Path. GPC 2008 : 3. International Conference on Grid and Pervasive Computing, May 2008, Kunming, France.

�10.1109/GPC.WORKSHOPS.2008.62�. �hal-01197544�

(2)

Work Stealing Technique and Scheduling on the Critical Path

Christophe C´erin

⁺

Mich`el Koskas

⁺⁺

+

Laboratoire d’Informatique de Paris-Nord (LIPN) UMR CNRS 7030,99, avenue Jean-Baptiste Cl´ement

93430 Villetaneuse - France

++

INRA – Dpartement de Mathmatiques 16, rue Claude Bernard – 75005 Paris

christophe.cerin@lipn.univ-paris13.fr michel.koskas@agroparistech.fr

Abstract

In this paper we propose new insights into the problem of concurrently scheduling threads through mainly the Work Stealing (WS) technique which is one of the technique, with Parallel Depth First (PDF) technique able to efficiently manage fine-grained mutithreaded programs. We investigate a nice result of Bender and Rabin about the scheduling of threads in an heterogeneous context, show the limits in terms of (theoretical) performance and we propose to enhance the efficiency of the processor use by keeping the fastest processor busy and working on the critical path of the application. Then we provide a bound on the completion time. We also investigate the problem of execution time penalies when scheduling according to WS faced to PDF and we show how to adapt our initial algorithm to behave ’like’ a PDF schedule. We hope that such framework could help multi-threaded or multi-core architect to rely on well founded mathematical results to build appropriate hardware controling the scheduling of threads or for implementing software schedulers.

Keywords: Chip multiprocessors, Heterogeneity of chips, Scheduling algorithms, Work stealing principle.

1 Introduction and related work

Chip multiprocessors are now available from retailers for moderate parallelism through the use of about 128 processing units as for instance with the 88xx Nvidia chips series or with the Cell processor which is part of the PS3. All the major chip mufacturers have made the paradigm shift to produce chips with multi cores that are not necessary homogeneous (in the sense that all the cores are identical) as it is the case with the two previous examples.

(3)

To effectively exploit available parallelism, such hardware must address con- tention for shared resources [?,?,?] and also should take care about the scheduling of threads. We do not investigate the hardware support for that, but we put our attention on another layer which is more close to the software layer, for instance the scheduler layer of any Linux system. Our aim is to modelize the scheduling of processes in order to isolate the main factors that impact the performance of the chips.

Many parallel programming language and run-time systems use greedy¹ thread scheduling principles to maximize processor utilization. For instance, the Parallel Depth First (PDF) approaches [?,?] have been proposed for cache sharing as well as for Work Stealing (WS), a popular scheduler technique that takes a more traditional approach. The WS’s policy maintains a work queue for each processor and when forking a new thread, this new thread is put on the top of the local thread. When a thread completes, the processor looks for its local queue to execute the next job, and when the queue is empty, the processor tries to steal work from the queue of another processor.

If we deal with independant tasks (for instance we have to paint an image in black), the scheduler can steal any task without increasing the completion time, otherwise, when there exists dependencies between tasks it is more difficult to

”choose”. Do we have to select the task from the top or from the bottom of the queue (the inserting order in the queue is important since it gives the dependencies between tasks)? According to the WS technique we may also observe that processors tend to have disjoint working sets that enforce the parallelism of the application but may induce the movement of large data set if we steal too frequently.

Parallel Depth First (PDF) [?] is another greedy scheduling strategy based on the fact that important sequential programs have already been hightly optimized on a single core architecture by maintening small working sets, by capital- izing on good spacial and temporal reuse. In PDF, when a processor completes a task, it is assigned the ready task that the sequential program would have executed the earliest. Note also that [?, ?] indicate how to do this work online without executing the sequential program, we mean online. In a sense, PDF tends to co-schedule work in a way simular to the sequential execution that is to say more favorable for optimization (cache reuse. . . ). Intuitively, the benefit of using the PDF technique is that the spacial data locallity is preserved while WS tends to make ’anarchic’ memory accesses by making all the threads working concurrently to acess the memory without coordination, hence problem with numerus cache misses.

All of the strategies (WS or PDF) rely on a model of parallel computation. A program in a model is represented as a directed acyclic graph and the representation is used in the proofs of the bounds of the computational time. Section 2 returns to models of parallelism and the representation of computation. Section 3 reminds the results of Rabin and Bender [?,?] concerning the WS technique

1In a greedy schedule, a ready job remains unscheduled only if all processors are already busy

(4)

and we introduce in section 4 new insights on the way to improve the bounds.

We consider first an offline algorithm to schedule threads. Then, in section 5, we revisit our algorithm to discuss about cache effects. In section 6 we consider and On-line strategy. Section 7 concludes the paper.

2 Representing a computation and models of parallelism

Our scheduling algorithm is applicable to programming language that supports fork-join constructs. The algorithm assumes a shared memory programming model in which parallel programs can be described in terms of threads. A threadis a sequence of actions that are executed serially and taken exactly one timestep (or one clock cycle). A thread may fork any number of child threads and this number need not be be known at compile time.

Theworkof a parallel computation is the total number of actions executed in it, that is, the number of timesteps required to execute it serially and ignoring the overheads of switching between contexts for instance. Thedepthof a parallel computation is the length of the critical path, or the time required to execute the computation on an infinite number of processors.

The remaining definitions are common in the litterature. We will adapt them later in the paper in considering weight on the edges of graphs.

k g d c

f e b

j i h

a a

b

c d

e

g f

Figure 1: Left side: valid graph – Right side: not a valid graph

The computation is viewn as adirected acyclic graphordagas follows. Each node in the dag corresponds to an action. The edges of the graph express the dependencies between the actions. To capture the fok-join paradigm in terms of graph, we use the so calledit series-parallel graph. A series-parallel graph G= (V, E) is a directed acyclic graph with two distinguished vertices, asource s and a sink t. G is one of the following: (1) A single edge extending from s to t, that is, V = s, t and E = (s, t); (2) two series parallel graphs G¹ and G² composed in parallel. The sourcess¹ ands² ofG¹and G² respectivelly are merged into a single sourcesand the sinks are also merged into a single sinkt;

(5)

(3) two series parallel graphsG¹ andG² composed in series. The sinkt¹ ofG¹ and the sources² are merged into a single node.

Note that according to the definition, not all graphs are accepted as valid programs for our study. For instance, the graph on the left side of Figure 2 is valid but not the one on the right side because it cannot be build according to the previous 3 rules.

We usefully strict dagsthat is a series parallel dag such that all of the nodes in the dag have an outdegreee at most 2, and there is one node with indegree 0 and one node with outdegree 0.

Let W1 represent the total work that is the total number of nodes in the dag G. The total work for the graph of Figure ?? is 11. Let W∞ represent thecritical path lengthof the graph, that is, the number of nodes in the longest chain inG. The critical path length for the graph of Figure?? is 5.

The above definitions are of common use in the litterature. We will adapt them later in the paper in considering weights between vertices of graphs.

Weight represent, as usually, the time cost of executing actions between two program states.

3 WS technique and the algorithm of Rabin and Bender

In [?], bender and Rabin consider how to execute parallel programs on a col- lection of heterogeneous processors. The considerpprocessors labeled 1,· · ·, p where processori has speedπi steps/time. They assume for the sake of conve- nience thatπ¹ ≥π² ≥ · · · ≥ πp and that the processor speeds do not change.

Letπave= ^P

n i=1πi

p .

Bender and Rabin starts from a result of Graham[?] that proves that a list schedule is a (2−1/p)-approximation to the optimal completion time, and this result holds for any greedy schedule. This result derives from the following theorem[?] using the notions defined above:

Theorem: A greedy schedule (or list schedule) has a completion time Tp≤ W¹

p +p−1 p W∞

In their work, Bender and Rabin consider themaximum utilization scheduling policy. This notion maintains the following invariant. During each time interval in which there are exactlyi ready threads, the fastest iprocessors execute these tasks. If there arei ≥ pready threads, then all processors work.

Note that in order to maintain this invariant, the scheduling policy must allow preemptions.

The maximun utilization schedule has a straightforward generalization which is calledhigh utilization scheduleby Bender and Rabin. In this case, the scheduler relax the invariant so that at all time we have: if there arei < p ready

(6)

threads, the fastest idle processor is at mostβtimes faster than the slowest busy processor. Thus, whenβ = 1, we obtain a maximum utilization schedule. The completion time of a high utilization schedule may be inferior to the completion time of a maximum utilization schedule, but may have the advantage of fewer preemptions.

The central theorem in [?] is then the following:

Theorem:[?] Any maximum utilization schedule has a completion time

Tp ≤ W¹ pπave

+ π²

π1

+π³ π2

+· · ·+ πp

πp−1

W∞

pπave

≤ W¹

pπave

+p−1 p

W∞

πave

To explain how we go from the first line to the second line of the theorem, we shall observe that each ratio among speeds are lower than 1 aand since the speeds are ordered, the sum of (p−1) sums lower than 1 are no more than (p−1). For other explanation about the proof, please refer to [?]. Note that authors use a particular accounting trick related to shadow threads. They say that ”when a processori is unable to do any work on an actual thread, the processor begins working on its shadow thread”. The sum ofπi/πi−1 terms in the theorem correspond to the work done on shadow threads and on the critical path.

Bender and Rabin have also considered a variant of a scheduling strategy that makes thei-th fastest processor working on thei-th longest critical path. In this case Bender and Rabin showed that the policy guarantees that the critical path progresses at least at the average speed of the working processors, that is:

Theorem:[?] Suppose that the maximum utilization strategy additionally maintains the invariant that the i-th fastest processor executes the thread that is i-th farthest from the end of the dag. This amounts to putting the fastest processor on the critical path. Then the computation has completion time:

Tp ≤ W¹ pπave

+ π²

π¹+ 2π³

π¹+π² + 3π⁴

π¹+π²+π³ +· · ·+ (p−1)πp

π¹+· · ·+πp−1

W∞

pπave

Again, in the proof, Bender and Rabin use an accounting trick about shadow threads (when a processor is unable to do any work on an actual thread, the processor works on its shadow thread STi). Since only processors 1· · ·(i−1) may work on the computation, the critical path advances at a rate of at least P = π1+π2+· · ·+πi−1

i−1 steps per time. Thus, since the critical path has lengthW∞, processor i can work onSTi for πi

PW∞ time units. Then, Bender and Rabin conlude that the total amount of work the processors dedicate to actual and shadow threads is at most

W¹+ π²

π¹ + 2π³

π¹+π² + 3π⁴

π¹+π²+π³ +· · ·+ (p−1)πp

π¹+· · ·+πp−1

W∞

(7)

Bender and Rabin also point out that the gain seems small in comparison to the potentially infinite number of additional preemptions that this policy entails (A EXPLIQUER). Moreover, they do not show how to keep, in practice,i-th fastest processor busy on thei-th critical path. The result is not a constructive one.

4 How to consider the critical path?

4.1 Offline algorithm

In this section we consider that the computation graph is known in advance.

So, we introduce an offline scheduling algorithm which is based on the WS principle. Our aim is to replace theπave terms in the Bender and Rabin result by a more favorable one, especially one which is not related to the mean of speeds. Our main idea is to keep the fastest processor busy on the critical path.

For programs with independant tasks,W∞ is small and the p−1 p

W^∞ πave

do not influence the completion time but it still remains aπaveterm in the equation of the completion time. Otherwise, if the task graph is like a hackle, the critical path length may become an influencing factor for the completion time. Our approach is twofold: keeping the benefit of Bender and Rabin result when the critical path lenght is ’small’ and keeping busy the most powerfull processors with the most heavy tasks as much as possible.

We remains under the heterogeneous case as defined by Bender and Rabin concerning the speed of processors for instance but we need some complemen- tary definitions. We also work with the series-parallel graph model but we would like to mention that our scheduling algorithm is also valid if we consider ’gen- eral’ DAGs. However, we will see later on that to tackle memory management problems, series-parallel graphs are a prequisite.

Nodes are now represented by a tuple (name, weight, st, et) wherenameis the label of a task, weight is the time duration of the task, st is the time at which the task starts andetthe time at which the task finishes. The Figure 4.1 introduces an example. Note that if we only know the weight of the task at the begining of the algorithm and assuming that time starts at 0, a DFS (Depth First Search) like algorithms allows us to label the nodes for tuple vlaues in a straithforward way.

4.2 Computation of paths of minimal lenght

After labeling the nodes, the second step of the algorithm is to compute ’all the critical paths’ of the graph. Each (critical) path of maximal weight has a minimal time for starting and a maximal time for finishing. The weight of a path is the sum of weights for tasks that constitute the path. We compute the connected path of maximal weight in the graph in the range [n= 0, M =∞].

This path has an interval of execution of [m1, M1]. We continue the process in order to compute the paths of maximal weight in [m, m1] and in [M1, M],

(8)

k,3,15,18 g,5,10,15 d,5,5,10

c,2,3,5

f,2,8,10 e,5,3,8 b,1,2,3

j,5,10,15 i,4,6,10 h,4,2,6 a,2,0,2

Figure 2: The task graph augmented with information about task durations

recursivelly. We obtain a set of pathsC¹, C²,· · · and we store them in an array that give also the processor that should run the tasks on the paths (remind that π¹ ≤ π² ≤ · · · ≤ πp). The array also contains a flag to indicate if a path is completed or not. For the example of Figure 4.1, we get:

Path Finish? Proc

C¹ false 1

C² false 2

C³ false 3

withC¹=abef gk,C²=hij,C³=cd. Note that the time duration of tasks on theC² path is 13; the time duration of tasks on the C³ path is 7; and the time duration of tasks on theC¹ path is 18;

4.3 Preemptive algorithm

Assuming that the previous array in shared among all the threads, the scheduling algorithms starts by executing the first free path on the fastest processor, the second free path is executed on the ’second’ fastest processor and so on.

After this initialization step, the scheduler enter to a permanent regime and a processor:

α: becomes idle because it has finished its ’job’ on a path;

β: must interrupt its work because it needs to wait for a task to complete.

The protocol is then as follows for theα, β cases:

α: the processor looks at the array, marks the ’finish?’ field with true and considers the first path which is either not yet started or, if it has been

(9)

started by a slower processor, it steals the work. A processor which has been stolen by another one has the behavior of a processor that is idle (its job is just terminated);

β: the processor interrupts its work on its path. Let N be the node requir- ing such synchronization point. We consider the paths belonging to the parents ofN and not yet terminated.

If the paths have not yet been started, we start one of them on the inter- rupted processor (COMMENT ON CHOISIT UN CHEMIN PARMI PLUSIEURS ? DANS LE CAS D’UN GRAPHE SERIES-PARALLEL, il y a au plus un seul parent possible)

If these paths are currently executed on slowest processors, we steal the work of the slowest until we reach the nodeN, we stop the computation on the stolen path by marking it as executed by an infinitly slow processor and we restart the computation after nodeN

If these paths are executed on fastest processors and that there is no more slowest processors, we ask to be notified when node N completes.

During the waiting step, the behavior of the processor is as it has finished its work. When the processor receives the signal, it restarts its initial work and freezes its current work by marking it as executed by an infinitly slow processor.

Theorem: The above algorithm has a completion time : Tp ≤

p

X

i=1

(Wi

pπi

) + W¹ pπp

Proof. By construction, the processor working atπi steps is working on the ith critical path as defined above and when processorimust wait for one parent it makes a progress in the computation and it contributes for a fraction ofW¹ at a speed at least equal toπp.

5 Offline Algorithm and cache effects

In this section we investigate the problems of scheduling threads efficiently whereas controling the cache misses. Two orthogonal facts caracterize the WS principle: theoretical bounds on the completion time can be derived and these bounds show all the potential of the method including for practical cases (fork- join programming) but since the method tends to activate concurrently all the threads with no control of what they are doing concerning the memory, there is no chance for regular access memory patterns (we mean that memory access is accomplished in a random fashion provided by the memory controller) and thus we tend to multiply the cache misses.

For that reason, the PDF technique has been studied in the past in order to have concurrently executing threads share a largely overlapping memory por- tion. The aim of the PDF technique is to get good spatial and temporal reuse as

(10)

follows: when a core completes a task, it is assigned the ready-to-execute task that the equivalent sequential program would have executed the earliest. The underlying idea is that we can count on sequential codes because they are now optimized by compilers: code reordering (unfolding loop), memory prefetching inside a loop, 128 bits register use and many other techniques are automatiquelly done and increase the performance.

Papers [] from Blelloch and Gibbons shows important properties of PDF schedules regarding cache misses. In this paper, authors compare the number of cche missesM1 for running a computation on a single processor equiped of a cache of sizeC1 to the total number of misses Mp for the same computation when using p processors and a shared cache of size Cp. They demonstrate that for any computation, and with an approppriate greedy parallel schedule, if Cp≤C1+p.W∞thenMp≤M1.

As quoted in the paper, ”this gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single processor cache” in order to guarantee no additional misses for any computation. The key point to explain the result is the choice of the schedule and we could add that it determines the result. Note that the p.W^∞ product is samm in practice because p is small and W^∞ is not of the same magnitude of the input size but much smaller. note also that the result consider an ideal cache model which is a fully associative cache and it uses an ideal replacement policy meaning that the evicted momory bloc is the candidate memory block whose next access is the infrequent in the future which is optimal².

In [?] Acar, Blelloch and Blumofe show that the bound compares favorably to WS where cache size must be at leastC1.p.W∞ to guaranteeM1 misses.

A standard sequential execution corresponds to a particular depth-first schedule of the dependance graph; Blelloch and Gibbons call it a 1DF-schedule. Then they consider parallel schedules that are prioritized based on a given sequan- tial schedule i.e. if there are multiple tasks ready to execute at a given time step, the ”schedule will preferentially pick the tasks that are earliest in the sequential scheduke”. A parallel schedule based on a 1DF-schedule is called a PDF-schedule [] (provably efficie...).

In [?,?] (provably...) authors show how to maintain a PDF schedule online for various types of computations, in particular for computations with parallel loops or fork-join construct. The main fact to succeed is based on properties of DAG that we use that must be planar graphs. Series-parallel graphs are planar graphs and this fact may guaranty to get theoretical results. this explains why we concentrate our attention on series-parallel graphs.

An intuitive idiea for achieving good data locality with WS is to execute on the same processors nodes that are close in the computation graph. Following this idea, ech thread can be given an affinity for a process and when a process obtains work, it gives priority to threads with affinity for it. to enable this, in addition to the (Path, Finish, Proc) information of our scheduler, we add

2Refer to http://www.research.ibm.com/journal/sj/052/belady.pdf

(11)

another array containing for each task a first-in-first-out (FIFO) queue of tasks that have affinity for. We also add a field, protected by a mutex, saying if the task is completed or not. Let us call this data structure amailbox. We are now faced to the following problems:

How to choose the affinity threads of a given thread?

How to schedule threads equiped with this new information?

What about the completion time?

What about the cache misses?

5.1 How to choose affinity threads

The computation of the affinity list of a task is simply the ancestors of the task in the task graph until the root. We assume that we deal with series-parallel graphs such that we have only a unique common ancestor for each task which is the root of the graph.

11 9 6

4

8 5 2

10 7 3 1

Figure 3: The task graph where nodes are numbered in order of a 1DF-schedule

5.2 How to schedule threads

According to the previous subsection, a task put in its affinity FIFO queue the task given by a linearization of a 1DF-schedule when we look at the graph from bottom to top with priority given to task on the lower critical path in case of a tie.

In our scheduling algorithm taking into account the critical paths, we have distinguished two cases when a processor interruptss its work because it need to wait for a task to complete. Only theαcase is under concern here because for the β case we have already a criteria that requires to examine a parent of

(12)

8 6 4

3

4 3 2

7 5 2 1

Figure 4: The task graph labeled as a PDF-schedule

the ’faulty’ node. A parent is necessary in the affinity list, so we have nothing to do.

In theαcase, we modify the protocol as follows. The idle processor looks at the main array, marks the ’finish?’ field with true and considers first the first path which has not yet started and second, if there is no free path, it considers a slower processor:it steals the work of the processor with more tasks that remains to accomplish in the affinity list of all tasks on active paths. In case of tie, it behaves like the initial algorithm: it steals on the processor the speed that is immediatly inferior to itself. A processor which has been stolen by another one has the behavior of a processor that is idle (its job is just terminated);

5.3 Completion time of the new framework

5.4 Number of cache misses for the new framework

Let us remind some key points about schedules according to Blelloch [] (effectively sharing...).

In abreadth-firstorlevel-order1-schedule, a node is scheduled only after all nodes at lower levels are scheduled. Adepth-first1-schedule (1DF-schedule) is defined as follows: at each step, if thre are no scheduled nodes with a ready child, schedule a root node; otherwise, schedule a ready child of the most recently scheduled node wth a ready child. The node scheduled at step i in a 1DF- schedule is said to have 1DF-numberi.

Blelloch also defines a p-schedule, S based on a 1-schedule S1) if at each step iof S, thekiearliest nodes in S¹that are ready at stepiare scheduled, for some ki ≤p.

As pointed out by Blelloc in [] (Provably), ”the motication for considering a p-schedule based on a 1-schedule is that it attemps to deviate the least from the 1-schedule, by seeking to make progess on the 1-schedule whenever possible.

(13)

Intuitively, the closer the p-schedule is to the 1-schedule, the closer its resource bounds may be closed to the 1-schedule”.

The p-schedule that we consider is a p-schedule based on a depth-first 1- schedule. If we consider again Figure 4.1, we introduce on Figure 5.1 the same graph but the nodes are numbered in order of a 1DF-schedule (and p = 3). On Figure 5.1, the graph of Figure 5.1 is labeled according to the PDF-schedule based on the 1-schedule. Let S be this PDF-schedule, we have S = V1, V2, V3, V4, V5, V6, V7, V8 where for i = 1 to 6, Vi is the set of nodes scheduled in stepi, that is to say, the set of nodes labelediin the Figure 5.1.

Note that on these Figures, tasks are not unitary task so that we have to take into account their time duration for the 1-schedule and also for the PDF- schedule.

Note : pour l’instant, les PDF-schedule ne servent pas a grand chose, mais je pense les reutiliser... partir de maintenant. !

Work Stealing Technique and Scheduling on the Critical Path

HAL Id: hal-01197544

https://hal.archives-ouvertes.fr/hal-01197544

Submitted on 3 Jun 2020

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Work Stealing Technique and Scheduling on the Critical Path

Christophe Cerin, Michel Koskas

To cite this version:

Christophe Cerin, Michel Koskas. Work Stealing Technique and Scheduling on the Critical Path. GPC 2008 : 3. International Conference on Grid and Pervasive Computing, May 2008, Kunming, France.

�10.1109/GPC.WORKSHOPS.2008.62�. �hal-01197544�

Work Stealing Technique and Scheduling on the Critical Path

Christophe C´erin

Mich`el Koskas

Laboratoire d’Informatique de Paris-Nord (LIPN) UMR CNRS 7030,99, avenue Jean-Baptiste Cl´ement

93430 Villetaneuse - France

INRA – Dpartement de Mathmatiques 16, rue Claude Bernard – 75005 Paris

christophe.cerin@lipn.univ-paris13.fr michel.koskas@agroparistech.fr

1 Introduction and related work

2 Representing a computation and models of parallelism

3 WS technique and the algorithm of Rabin and Bender

4 How to consider the critical path?

4.1 Offline algorithm

4.2 Computation of paths of minimal lenght

4.3 Preemptive algorithm

5 Offline Algorithm and cache effects

5.1 How to choose affinity threads

5.2 How to schedule threads

5.3 Completion time of the new framework

5.4 Number of cache misses for the new framework

6 Online Algorithm

7 Conclusion