Cilk - Task-Based Parallelism - Practical Many-Core Programming

9 Practical Many-Core Programming

9.2 Task-Based Parallelism

9.2.1 Cilk

Cilk [1] emerged as a spin-off from research done at MIT, which led to the design of the work stealing task scheduler. It was recently acquired by Intel and included into Intel’s parallel programming offering, alongside the more established Thread Building Block library (that we will cover in the next sub-chapter).

Cilk is a multi-core programming model based on a simple extension to the C (and more recently, C++) programming language (through the introduction of a three new keywords), together with a sophisticated work-stealing scheduler. By de-sign all Cilk programs preserve the semantics of their C equivalent, in the sense that when the specific Cilk keywords are removed from the program, it should be an exactly equivalent C program, minus the multi-task capabilities.

Let’s look again at the pedagogical implementation of the calculation of the nth Fibonacci number (first introduced in Chap. 8, shown here in Fig. 9.1). It is the most commonly cited example in the Cilk research literature and we follow that tradition here, as it exposes the most important keyword additions, namely spawn and sync.

The spawn keyword tells the Cilk run-time that the function may, but doesn’t have to, run in parallel with its parent caller. The sync keyword is a barrier that indicates that control cannot pass this point until all spawned children have returned. This is about the extent of language features one needs to know in order to write useful Cilk programs. One of the characteristics of Cilk is that both the compiler and run-time, with supporting scheduler, bear the responsibility for efficient execution of Cilk programs. The Cilk scheduler’s algorithmic design makes certain guarantees, which can be proven about the achievable efficiency of a program run on it.

Figure 9.1 also illustrates the concept of a Cilk thread. In effect a Cilk thread is the maximal sequence of statements that execute until the next spawn or sync

state-177

ment. The T1 strand runs in some initial worker thread. On reaching the first spawn statement, a new possible parallel thread of execution is created. The Cilk model of execution is such that the spawned thread continues to execute using its parent worker thread context. The parent or spawning thread is then placed on a queue of possible thread work items. If more then one processor is available and this proces-sor has nothing to do then it will steal this work item; thus the parent has a continu-ation context in a new thread. The same happens when the parent context reaches the second spawn statement: the spawned thread continues using the parent thread context, then the parent thread is placed on the work list queue and if there is an idle processor available it will again steal from this work list.

It is interesting to note that the concept of a parent continuation being popped onto a queue of available work items is rather analogous to the semantics of the C equivalent program where the parent frame or context would be popped onto the stack while execution continues in a branched child function. The sync statements function as a barrier and guarantee that all spawned children have returned after the statement. This is important as the values of partial Fibonacci sums f1 and f2 in the example cannot be relied upon until after the sync statement.

Figure 9.2 shows yet another view of what happens at execution time. The figure captures what happens on execution of fibonacci(4). In effect a DAG (directed acy-clic graph) of the program execution unfolds as the program executes. This DAG represents a graph of control and data flow and can be defined formally as G = (V,E) where each vertex v is an element of the set of vertices V and each edge e is an ele-ment of the set of edges E. A vertex then represents a sequence of instructions that

Fig. 9.1 Fibonacci numbers using Cilk

9.2 Task-Based Parallelism

does not contain a Cilk language statement and an edge represents a Cilk spawn, a function return or a continuation.

In Fig. 9.2 the edge e_spawnrepresents a branch in the execution graph through the creation of a new Cilk thread. After the e_spawnbranch the parent has a continuation edge e_contthat may result in continued parallel execution, through the process of work-stealing, in a free processor. The upward edges, shown by e_return, represent a flow of data back to the creating context. Finally execution completes in a final thread.

9.2.1.1 The Cilk Scheduler

The Cilk scheduler uses the concept of work-stealing in which a processor, often called a thief, who runs out of work steals work from the task queue of some victim processor, who has more tasks waiting to execute that it can currently service. The strategy in Cilk is to select the victim processor at random.

The theory behind the Cilk scheduler states that the performance of a Cilk putation is related to two quantities. The first is work, which is the cost of the com-putation when run serially on one processor. The second is termed critical-path length, which is the cost of the computation when run on a theoretically infinite number of processor. Figures 9.3 and 9.4 illustrate these concepts on the Fibonacci example using a value of 1 per task computation cost.

Fig. 9.2 The execution of fibonacci(4)

179

The term T₁is the same as the work, the time to execute on one processor. T_p is the time to execute on P processors. Then T_p ≥ T₁/P since P processors can do at most P work in one step.

Another important principle of the Cilk scheduler is the work-first principle.

This principle states that the scheduling overhead should not be borne by the work of a computation but, rather, moved onto the critical path to the point where stealing is necessary. The Cilk scheduler is termed a “greedy scheduler” as it attempts to do as much work as possible at each step. There are basically two types of schedule steps. The first is called complete: in this step there are at least P tasks ready and P processing resources available. In this case the scheduler selects any P and runs them. The second is called incomplete: in this step there are less than P tasks avail-able to run; the scheduler will then select all for execution on subset of processing resources. Stealing—and consequently, scheduling—will only occur when there is at least one processor out of work.

Fig. 9.3 Total work in case

of fibonacci(4) 1 1 1

1 1 1 1 1 1

1 1 1 1

1 1

Total Work = 17

Fig. 9.4 Critical path for

fibonacci(4) 1 1

1 1

1 1 1

1 Critical Path = 8 9.2 Task-Based Parallelism

In their paper [1], the original designers of Cilk and its scheduling algorithm have shown that its performance is within a constant multiplier of optimal execution time, for execution time, required memory and communication. This theoretical underpinning—verified in practice as well—is the main reason for the enduring success of the work stealing scheduling method.

Dans le document Programming Many-Core Chips (Page 189-193)