• Aucun résultat trouvé

Correct and Efficient Work-Stealing for Weak Memory Models

N/A
N/A
Protected

Academic year: 2021

Partager "Correct and Efficient Work-Stealing for Weak Memory Models"

Copied!
13
0
0

Texte intégral

(1)

HAL Id: hal-00802885

https://hal.inria.fr/hal-00802885

Submitted on 20 Mar 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Correct and Efficient Work-Stealing for Weak Memory Models

Nhat Minh Lê, Antoniu Pop, Albert Cohen, Francesco Zappa Nardelli

To cite this version:

Nhat Minh Lê, Antoniu Pop, Albert Cohen, Francesco Zappa Nardelli. Correct and Efficient Work- Stealing for Weak Memory Models. PPoPP ’13 - Proceedings of the 18th ACM SIGPLAN sym- posium on Principles and practice of parallel programming, Feb 2013, Shenzhen, China. pp.69-80,

�10.1145/2442516.2442524�. �hal-00802885�

(2)

Correct and Efficient Work-Stealing for Weak Memory Models

Nhat Minh Lê Antoniu Pop Albert Cohen Francesco Zappa Nardelli

INRIA and ENS Paris

Abstract

Chase and Lev’s concurrent deque is a key data structure in shared- memory parallel programming and plays an essential role in work- stealing schedulers. We provide the first correctness proof of an optimized implementation of Chase and Lev’s deque on top of the POWER and ARM architectures: these provide very relaxed mem- ory models, which we exploit to improve performance but consider- ably complicate the reasoning. We also study an optimized x86 and a portable C11 implementation, conducting systematic experiments to evaluate the impact of memory barrier optimizations. Our results demonstrate the benefits of hand tuning the deque code when run- ning on top of relaxed memory models.

Categories and Subject Descriptors D.1.3 [Programming Tech- niques]: Concurrent Programming; E.1 [Data Structures]: Lists, stacks, and queues

Keywords lock-free algorithm, work-stealing, relaxed memory model, proof

1. Introduction

Multicore POWER and ARM architectures are standard targets for server, consumer electronics, and embedded control applications.

The difficulties of parallel programming are exacerbated by the re- laxed memory model implemented by these architectures, which allow the processors to perform a wide range of optimizations, in- cluding thread-local reordering and non-atomic store propagation.

The safety-critical nature of many embedded applications call for solid foundations for parallel programming. This paper shows that a high degree of confidence can be achieved for highly opti- mized, real-world, concurrent algorithms, running on top of weak memory models. A good test-case is provided by the runtime scheduler of a task library. We thus focus on the Chase and Lev’s concurrent doubly-ended queue (deque) [3], the cornerstone of most work-stealing schedulers. Until now, no rigorous correctness proof has been been provided for implementations of this algorithm running on top of a relaxed memory model. Furthermore, while work-stealing is widely used on the x86 architecture (an evaluation under a restrictive hypothesis of idempotence of the workload can be found in [10]), few experiments target weaker memory models.

Our first contribution is a correctness proof of this fundamen- tal concurrent data structure running on top of a relaxed memory model. We provide a hand-tuned implementation of the Chase and

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

PPoPP’13, February 23–27, 2013, Shenzhen, China.

Copyright c2013 ACM 978-1-4503-1922/13/02. . . $10.00

Lev’s deque for the ARM architectures, and prove its correctness against the memory semantics defined in [12] and [7]. Our second contribution is a systematic study of the performance of several implementations of Chase–Lev on relaxed hardware. In detail, we compare our optimized ARM implementation against a standard implementation for the x86 architecture and two portable variants expressed in C11: a reference sequentially consistent translation of the algorithm, and an aggressively optimized version making full use of the release–acquire and relaxed semantics offered by C11 low-level atomics. These implementations of the Chase–Lev deque are evaluated in the context of a work-stealing scheduler. We consider diverse worker/thief configurations, including a synthetic benchmark with two different workloads and standard task-parallel kernels. Our experiments demonstrate the impact of the memory barrier optimization on the throughput of our work-stealing run- time. We also comment on how the ARM correctness proof can be tailored to these alternative implementations. As a side effect, we highlight that our optimized ARM implementation cannot be expressed using C11 low-level atomics, which invariably end up inserting one redundant synchronization instruction.

2. Chase–Lev deque

User-space runtime schedulers offer an excellent playground for studying low-level high-performance code. We focus on random- ized work-stealing: it was originally designed as the scheduler of the Cilk language for shared-memory multiprocessors [4], but thanks to its merits [2] it has been adopted in a number of par- allel libraries and parallel programming environments, including the Intel TBB and compiler suite. Work-stealing variants have also been proposed for distributed clusters [5] and heterogeneous plat- forms [1]. The scheduling strategy is intuitive:

Each processor uses a dynamic array as a deque holding tasks ready to be scheduled.

Each processor manages its own deque as a stack. It may only push and pop tasks from the

bottom

of its own deque.

Other processors may not push or pop from that deque; instead, they steal tasks from the

top

when their own deque is empty. In most implementations, the stolen deque is selected at random.

Initially, one processor starts with the “root” task of the parallel program in its deque, and all other deques are empty.

The state-of-the-art algorithm for the work-stealing deque is Chase and Lev’s lock-free deque [3]. It uses an array with automatic, asynchronous growth. Assuming sequentially consistent memory, it involves only one atomic compare-and-swap (CAS) per steal, no CAS on push, and no CAS on take except when the deque has exactly only one element left.

We implemented and tested four versions of the concurrent

deque algorithm, with different barrier configurations: (1) a sequen-

tially consistent version, written with C11

seq_cst

atomics, follow-

ing the original description in [3]; (2) an optimized version, which

takes full advantage of the C11 relaxed memory model, reported

in Figure 1; (3) a native version for ARMv7, reported in Figure 2,

(3)

and (4) a native version for x86. These native versions rely on com- piler intrinsics and inline assembly to leverage architecture-specific assumptions and thus reduce the number of barriers required.

In our implementations of Figure 1 and Figure 2, we assume that the Deque type is declared as:

typedef struct{ atomic_size_t size;

atomic_int buffer[];

} Array;

typedef struct{

atomic_size_t top, bottom;

Atomic(Array *) array;

} Deque;

In the code of Figure 1 the atomic_ and memory_order_ prefixes have been elided for clarity. The ARMv7 pseudo-code of Figure 2 uses the keywords

R

and

W

to denote reads and writes to shared vari- ables, and

atomic

indicates a block that will be executed atomically, implemented via LL/SC instructions. The x86 version is based on prior work [10] and only requires a single mfence memory barrier in take, in place of the call to thread_fence in the C11 code.

2.1 Notions of correctness

The expected behavior of the work-stealing deque is intuitive: tasks pushed into the deque are then either taken in reverse order by the same thread, or stolen by another thread. We say that an implemen- tation is correct if it satisfies four criteria, formalized and proven correct for our ARMv7 optimized code in Section 4:

1. tasks are taken in reverse order;

2. only tasks pushed are taken or stolen (well-defined reads);

3. a task pushed into a deque cannot be taken or stolen more than once (uniqueness);

4. given a finite number of push operations, all pushed values will eventually be either taken or stolen exactly once, if enough take and steal operations are attempted (existence).

These criteria hold because of the following assumptions and prop- erties of the Chase–Lev algorithm:

For any given deque, push and pop operations execute on a sin- gle thread. Concurrency can only occur between one execution of push or take in the owner thread, and one or more executions of steal in different threads.

Newly pushed tasks are made visible to take and steal by the increment to

bottom

in push. As we shall see in Section 4, our ARMv7 implementation enforces this by placing a

sync

bar- rier before the update of

bottom

, guaranteeing that the pushed element can not be stolen before

bottom

is updated.

Taken tasks are reserved first by updating

bottom

; again, in our ARMv7 code, the

sync

barrier placed after the update to

bottom

will ensure that it will not be concurrently stolen.

Stolen tasks are reserved by updating

top

. The only situation where steal and take contend for the same task is when the deque has a single element left; this particular conflict is re- solved through the CAS instructions in both take and steal. This scenario allowed Chase and Lev to make the CAS in take con- ditional upon the size of the deque being 1. The correctness of this optimization on a relaxed memory model depends on the presence of the two full barriers in take and steal, to ensure that at least one of the participants will have a consistent view of the size of the deque. Having just one take or steal seeing a consis- tent view of the size of the deque is enough: if it is take, that will force a CAS to be performed; if it is steal, the index reservation will ensure an empty return value.

Finally, stolen tasks are protected from being concurrently stolen multiple times by the monotonic CAS update to

top

in steal. This CAS orders steal operations and makes them mu- tually exclusive. At the same time, steal operations that abort due to a failed CAS do not change the state of the deque.

2.2 Comparison of the C11 and ARM implementations Our C11 implementation in Figure 1 is optimal in the sense that no C11 synchronization can be removed without breaking the algo-

rithm. However, if low-level atomics are compiled using the map- ping of McKenney and Silvera [9] on ARMv7/POWER or the map- ping of Tehrekov [14] on x86, the generated code contains more barriers than the hand-optimized native versions on both x86 and ARMv7. We show in Section 5 that this happens because of the need for seq_cst atomics to simulate ARMv7/POWER cumula- tive semantics. Concretely, on ARMv7, an extra dmb instruction is inserted before each CAS operation [11], compared to the native version where a relaxed CAS—coherent and atomic only—is suf- ficient. On x86, an mfence instruction is added between the two reads in steal. The fully sequentially consistent C11 implementa- tion inserts many more redundant barriers [11].

3. The memory model of ARMv7

The memory model of the ARMv7 architecture follows closely that of the POWER architecture, allowing a wide range of relaxed behaviors to be observable to the programmer:

1. The hardware threads can each perform reads and writes out- of-order, or even speculatively. Basically any local reordering is allowed unless there is a data/control dependence or synchro- nization instruction preventing it.

2. The memory system does not guarantee that a write becomes visible to all other hardware threads at the same time point.

Writes performed by one thread are propagated to (and become visible from) any other thread in an arbitrary order, unless synchronization instructions are used.

3. A

dmb

barrier instruction guarantees that all the writes which have been observed by the thread issuing the barrier instruction are propagated to all the other threads before the thread can continue. Observed writes include all writes previously issued by the thread itself, as well as any write propagated to it from another thread prior to the barrier. This semantics of barrier instructions is referred to as cumulative.

We build on the axiomatic formalization of POWER and ARMv7 memory model by Mador-Haim et al. [7], which has been proved equivalent to the operational semantics of Sarkar et al. [12]. A gentle introduction can be found in [8].

Axiomatic execution witnesses capture abstract memory events associated with memory-related instructions and internal transi- tions of the model. Unlike in stronger models such as x86, each memory access is represented at run-time by two distinct events: an issuing event—called sat for reads and ini for writes—eventually followed by a commit event when the speculative state of the in- struction is resolved. Once a write instruction is committed, events that propagate it to other threads can be observed—propagation to thread A is denoted pp

A

. All the relations part of an execution wit- ness are listed in Table 1.

The core of the axiomatic model builds on the evord relation, modeling the happens-before order between events. This satisfies the fundamental property:

evord

−−→ ⊃ −−→ ∪

after

before

−− → ∪ −

comm

−− → ∪ −→ ∪

insn

−−→

local

and must be acyclic for an execution to be consistent.

We assume that the

atomic

sections, used to represent CAS- like behaviors, are executed atomically and obey a total order.

We model them either as a single instance of a read instruction (failed CAS) or an atomic read–write pair of instruction instances (successful CAS). The atomicity of these accesses is captured by the −−−−→

po-atom

relation. We do not assume any other property on these atomic sections (e.g., cumulativity). In practice, atomic sections can be implemented with LL/SC instructions.

We use several notation shortcuts. We refer to the deque global

variables

top

,

bottom

, and

array

as t, b, and a. Elements of the

buffer are written x

i

, where i is the virtual index in natural numbers

(4)

inttake(Deque *q) {

size_t b =load_explicit(&q->bottom,relaxed) - 1;

Array *a =load_explicit(&q->array,relaxed);

store_explicit(&q->bottom, b,relaxed);

thread_fence(seq_cst);

size_t t =load_explicit(&q->top,relaxed);

int x;

if(t <= b) {

/* Non-empty queue. */

x =load_explicit(&a->buffer[b % a->size],relaxed);

if(t == b) {

/* Single last element in queue. */

if(!compare_exchange_strong_explicit(&q->top, &t, t + 1,seq_cst,relaxed)) /* Failed race. */

x = EMPTY;

store_explicit(&q->bottom, b + 1,relaxed);

}

}else{/* Empty queue. */

x = EMPTY;

store_explicit(&q->bottom, b + 1, relaxed);

} return x;

}

voidpush(Deque *q, int x) {

size_t b =load_explicit(&q->bottom,relaxed);

size_t t =load_explicit(&q->top,acquire);

Array *a =load_explicit(&q->array,relaxed);

if(b - t > a->size - 1) {/* Full queue. */

resize(q);

a =load_explicit(&q->array,relaxed);

}

store_explicit(&a->buffer[b % a->size], x,relaxed);

thread_fence(release);

store_explicit(&q->bottom, b + 1,relaxed);

}

intsteal(Deque *q) {

size_t t =load_explicit(&q->top,acquire);

thread_fence(seq_cst);

size_t b =load_explicit(&q->bottom,acquire);

int x = EMPTY;

if(t < b) {

/* Non-empty queue. */

Array *a =load_explicit(&q->array, consume);

x =load_explicit(&a->buffer[t % a->size],relaxed);

if(!compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst,relaxed)) /* Failed race. */

returnABORT;

} return x;

}

Figure 1. C11 code of Chase–Lev deque, with low-level atomics

inttake(Deque *q) {

size_t b =R(q->bottom) - 1; (a)

Array *a =R(q->array); (b)

W(q->bottom, b); (c)

sync;

size_t t =R(q->top); (d)

int x;

if(t <= b) {

x =R(a->buffer[b % a->size]); (e) if(t == b) {

bool success = false;

atomic/* Implemented with LL/SC. */

if(success = (R(q->top) == t)) (f)

W(q->top, t + 1); (g)

if(!success) x = EMPTY;

W(q->bottom, b + 1); (h)

} }else{

x = EMPTY;

W(q->bottom, b + 1); (i)

} return x;

}

voidpush(Deque *q, int x) {

size_t b =R(q->bottom); (a)

size_t t =R(q->top); (b)

Array *a =R(q->array); (c)

if(b - t > a->size - 1) {/* Full queue. */

resize(q);

a =R(q->array); (d)

}

W(a->buffer[b % a->size], x); (e) sync;

W(q->bottom, b + 1); (f)

}

intsteal(Deque *q) {

size_t t =R(q->top); (a)

sync;

size_t b =R(q->bottom); (b)

int x = EMPTY;

if(t < b) {

Array *a =R(q->array); (c)

x =R(a->buffer[t % a->size]); (d) ctrl_isync;

bool success = false;

atomic/* Implemented with LL/SC. */

if(success = (R(q->top) == t)) (e)

W(q->top, t + 1); (f)

if(!success)return ABORT;

} return x;

}

Figure 2. ARMv7 pseudo-code of Chase–Lev deque

before any wrap-around is applied. Barrier instructions are omitted for brevity when implied by the presence of a −

sync

− → or −−−−→

ctrl-isync

relation. Irrelevant values in reads and writes are replaced with the placeholder “

_

” (e.g., Rx,

_

). We do not label instruction instances individually, but decorate them with a disambiguating execution prefix, identified by a dot. These prefixes do not only distinguish between instruction instances, but also group related instruction instances within a same execution unit (usually an invocation of one of push, take or steal). For this, when no prefix is specified, the last prefix in left-to-right order is assumed.

4. Proof of correctness of the ARMv7 code

The proof is divided into five parts; it validates the criteria 2 to 4 enumerated in Section 2.1. Since push and take never execute con- currently and b is only ever modified in one of these functions, the proof of Criterion 1 does not involve reasoning about concurrency and we omit it here.

The proof builds on a precise analysis of all the possible exe- cution witnesses of arbitrary invocations of the algorithm. We re- call that an execution witness, as defined by the ARMv7 axiomatic model, is a graph capturing all memory events occuring during an execution (vertices), as well as the relations that link them (edges).

Individual lemmas strive to narrow down the set of possible execu- tion witnesses, based on properties of the algorithm and the archi-

tecture. To that end, we pinpoint specific subgraphs of an execution witness (hereafter, execution graphs) that cannot occur together in the same consistent execution witness. We then show that all in- correct executions, such as those containing two instances of steal reading the same value added by a single instance of push, cannot have consistent execution witnesses and, as such, cannot happen.

The proof is structured as follows. In 4.1 we provide basic tech- nical definitions and properties of the memory model, which are used throughout the proof. In 4.2 we describe all the possible exe- cution graphs for each of the three operations (push, take and steal), following the control flow of the ARMv7 code in Figure 2. In 4.3 we show how the succession of dynamic arrays built by resizing can be abstracted as a single sequence of unique abstract values in- dependent of resize operations, with strong coherence and consis- tency properties. Corollary 2 establishes Criterion 2 (well-defined reads). In 4.4 we build on the previous abstraction to prove Theo- rem 1, pertaining to the uniqueness of elements taken and stolen, which corresponds to Criterion 3 (uniqueness). Finally, in 4.5, we rely on all previous results to prove Theorem 2 establishing Crite- rion 4 (existence): the existence of matching take or steal opera- tions for every pushed element, under the appropriate hypotheses.

4.1 Preliminary properties

Before delving into the details of the proof itself, we introduce

some support definitions and related properties.

(5)

Rl, α read of valueαfrom locationl(_stands foranyvalue) Wl, α write of valueαto locationl(_stands foranyvalue)

sync memory barrier (usually implied by−−→sync ) isync instruction barrier (usually implied by−ctrl-isync−−−−→) sat(X) satisfy(a.k.a. complete) event of a read instruction ini(X) initializeevent of a write instruction

com(X) commitevent of an in-flight or speculative instruction ppA(X) propagate to thread ofAevent

po→ program order

po-atom

−−−−→ atomic operation in program order(for CAS; see below)

po-loc

−−−→ same-location access in program order(defined in 4.1)

co→ write coherence

rf

→ read from

r→ read from far(defined in 4.3)

fr

→ from read

−−→addr address dependence(usually implicit)

−→ctrl control dependence(usually implicit)

−−→data data dependence(usually implicit)

dp→ observable dependence(defined in 4.1)

ctrl-isync

−−−−−→ non-cumulative local ordering barrier(see below)

−−→sync cumulative full barrier(see below)

pp-sat

−−−→ write-to-read propagation(defined in 4.1)

after

−−→ after barrier edge

before

−−−→ before barrier edge

−−−→comm communication edge

insn−→ intra-instruction order edge

local

−−→ local order edge

evord

−−→ event happens-before order(usually typeset as→) On ARMv7,−−→sync corresponds to admbinstruction while−ctrl-isync−−−−→corre- sponds to a dependent conditional branch followed by anisbinstruction.

Table 1. Summary of relations used in the ARMv7 axiomatic model

For convenience, we define the −

po-loc

−− → relation, which relates local (same-thread) accesses to the same memory location; −

po-loc

−− → implies an instruction-level communication edge − →

co

, − →

rf

or − →

fr

. In particular, −

po-loc

−− → implies − →

co

between two writes.

We define the dependence relation −

dp

→ as follows:

R x,

_

dp

→ R y,

_

⇐⇒

def

R x,

_

( −−→ ∪

addr

−−−−→

ctrl-isync

) R y,

_

R x,

_

− →

dp

W y,

_

⇐⇒

def

R x,

_

( −−→ ∪

addr

−→ ∪

ctrl

−−→

data

) W y,

_

Lemma 1. The following properties involving − →

dp

apply:

R x,

_

dp

→ R y,

_

= ⇒ sat ( R x,

_

) → sat ( R y,

_

) Rx,

_

dp

→ Wy,

_

= ⇒ sat(Rx,

_

) → com(Wy,

_

)

Proof.In the case the of an address or control dependence, the result is an immediate consequence of the definition ofintra-instructionandlocal orders. It remains to be shown that the result holds for−ctrl-isync−−−−→: a depen- dent conditional branch instruction,ctrl, followed by anisyncbarrier. Sup- poseRx,_ctrl-isync−−−−→Ry,_. Then we have:sat(Rx,_)−insn−→com(Rx,_)−−→local com(ctrl)−−→local com(isync)−−→local sat(Ry,_).

We define the relation −

pp-sat

−− → between instruction instances, A. W x,

_

pp-sat

−− → B. R y,

_

, as follows:

1

( Wx,

_

po

→ Ry,

_

if A ∼ B pp

B

(Wx,

_

) → sat(Ry,

_

) if A 6∼ B

where A ∼ B means that instruction instances grouped under prefixes A and B belong to the same thread.

Intuitively, −

pp-sat

−− → represents a “known-to” relation in the fol- lowing sense: A. W x,

_

pp-sat

−− → B. R y,

_

means that, at the time of read- ing y, that specific write to x (as well as any write that is coherence- before it) is known to the thread executing B. It is clear that − →

rf

1Note that −−−→pp-sat does not imply anevent happens-before orderon the events making up the related instruction instances.

implies −

pp-sat

−− → , by definition of communication edges (if threads are different) or uniprocessor constraints (if same thread).

Lemma 2. The following properties involve −

pp-sat

−− → and −

po-loc

−− → : (i) A. W x,

_

− →

rf

B. R x,

_

po-loc

−− → B

0

. R x,

_

= ⇒ A. W x,

_

pp-sat

−− → B

0

. R x,

_

(ii) A. W x,

_

− →

co

B. W x,

_

pp-sat

−− → C. R x,

_

= ⇒ A. W x,

_

6 − →

rf

C. R x,

_

(iii) W x,

_

pp-sat

−− → R y,

_

dp

→ R z,

_

= ⇒ W x,

_

pp-sat

−− → R z,

_

(iv) ¬ A.Wx,

_

pp-sat

−− → B.Ry

0

,

_

dp

→ B.Wx

0

,

_

pp-sat

−− → A.Ry,

_

dp

→ A.Wx,

_

Proof.We prove each point separately:

(i) If the write and the reads happen in the same thread, then all in- struction instances belong to that thread, andprogram order prevails.

Otherwise, either A.Wx,_−→rf B0.Rx,_ and the result is immediate, or A.Wx,_6−→rf B0.Rx,_ and B.Rx,_−−−→po-locB0.Rx,_ implies the following:

com(B.Ry,_)−−→local sat(B0.Ry,_), by definition of−−→local . Hence:

ppB(A.Wx,_)→sat(B.Ry,_)−insn−→com(Ry,_)−−→local sat(B0.Ry,_) (ii) Suppose A.Wx,_6−→C.rf Rx,_. ThenC.Rx,_−→fr B.Wx,_, and we have the following cycle in theevent happens-before order:

sat(C.Rx,_)−−−→comm ppZ(B.Wx,_)→sat(C.Rx,_) (iii) Follows from Lemma 1.

(iv) Assume that:

A.Wx,_−−−→pp-satB.Ry0,_dp→B.Wx0,_−−−→pp-satA.Ry,_dp→A.Wx,_

IfA∼Bthen there is a cycle in −po→. Otherwise, by Lemma 1, we have a cycle in theevent happens-before order:

ppB(Wx,_)→sat(Ry0,_)→com(Wx0,_)−insn−→ppA(Wx0,_)

→sat(Ry,_)→com(Wx,_)−insn−→ppB(Wx,_)

Lemma 3. The following properties involving barriers apply:

(i) (Wx,

_

sync

− → Wy,

_

pp-sat

−− → Rz,

_

∨ Wx,

_

pp-sat

−− → Ry,

_

sync

− → Rz,

_

)

= ⇒ Wx,

_

pp-sat

−− → Rz,

_

(ii) A.Wx,

_

− →

rf

B.Rx,

_

sync

− → B.Wy,

_

pp-sat

−− → C.Rx,

_

= ⇒ A.Wx,

_

pp-sat

−− → C.Rx,

_

(iii) Let X stand for A.Wx,

_

− →

rf

B.Rx,

_

or (A ∼ B).Wx,

_

and Y stand for C. W y,

_

− →

rf

D. R y,

_

or (C ∼ D). W y,

_

then the following holds:

¬(X −

sync

− → B. R y,

_

− →

fr

C. W y,

_

∧ Y −

sync

− → D. R x,

_

− →

fr

A. W x,

_

)

Proof.We prove each point separately:

(i) If Wx,_ andRz,_occur in the same thread, then all instruction instances belong to that thread andprogram order prevails. Otherwise, supposeRz,_executes inA; we have two cases:

ppA(Wx,_)−−−→before ppA(sync)−−−→before ppA(Wy,_)→sat(Rz,_) Or the other way around:

ppA(Wx,_)→sat(Ry,_)−insn−→com(Ry,_)−−→local com(sync)−−→local sat(Rz,_) In both cases,ppA(Wx,_)→sat(Rz,_).

(ii) Suppose A ∼ C. If A ∼ B, thenprogram order prevails:

all the instruction instances belong to the same thread. If not, suppose C.Rx,_po→A.Wx,_; then theevent happens-before ordercontains the fol- lowing cycle:

ppB(A.Wx,_)−−−→comm sat(B.Rx,_)−insn−→com(Rx,_)−−→local com(sync)

local

−−→com(B.Wy,_)−insn−→ppC(Wy,_)→sat(C.Rx,_)−insn−→com(Rx,_)

local

−−→com(A.Wx,_)−insn−→ppB(Wx,_)

Otherwise, supposeA6∼C. IfA∼B, thenA.Wx,_−−→sync B.Wy,_and we have the result from (i). If not, we have:

ppB(A.Wx,_)−−−→comm sat(B.Rx,_)−insn−→com(Rx,_)−−→local com(sync) Thus, we haveppC(A.Wx,_)−−−→beforeppC(sync)−−−→beforeppC(B.Wy,_)→ sat(C.Rx,_).

(iii) Suppose the contrary. IfB∼D, then−→rf and−→fr form a path that goes against−po→: the graph is invalid according to uniprocessor constraints.

Otherwise,B6∼Dand the following holds (omitting intermediate steps in elaborating−−−→before for conciseness):

• com(B.sync)−−→local com(C.Wy,_)−−→local com(D.sync)−insn−→ppB(sync) ifB∼C.

(6)

•com(B.sync)−−→local sat(Ry,_)−−−→comm ppB(C.Wy,_)−−−→beforeppB(D.sync) otherwise.

Either way,com(B.sync) → ppB(D.sync). By definition, we have an after edge between the two barriers:ppD(B.sync)−−→aftercom(D.sync).

Moreover, eitherA∼DorA6∼D:

•ppD(B.sync)−−→aftercom(D.sync)−−→local com(A.Wx,_)−insn−→ppB(Wx,_) ifA∼D.

•ppD(B.sync)−−→aftercom(D.sync)−−→local sat(Rx,_)−−−→comm ppD(A.Wx,_) otherwise.

Thus, in all cases, we have a cycle:

com(B.sync)−−−→beforeppB(A.Wx,_)

−−−→comm sat(B.Rx,_)−insn−→com(Rx,_)−−→com(B.sync)local

4.2 Execution paths

We consider the three operations of the work-stealing algorithm:

take, push and steal. Each of them exhibits different execution paths depending on control flow. Data and address dependences are implicit in the notations and are omitted for brevity. Control dependences are implied by the guard conditions in each case and are also omitted, but we explicit the constraints on the b and t variables carrying the control dependence. Greek letters β, τ, ξ denote the memory values of b, t, and some x

i

, respectively. Reads and writes are annotated with the corresponding line from Figure 2.

For take and steal, we say that an instance of the operation is successful if it returns one element; otherwise (including if it returns empty) it is considered failed.

4.2.1 Take

Two failure cases return no element (empty), and two success cases return one element from the deque. All four paths start with:

(a) R b, β −

po

→ (b) R a, &x −

po

→ (c) W b, β − 1 −

sync

− → (d) R t, τ Specific continuations for each path are listed below.

Return empty without CAS, β − τ ≤ 0: · · · − →

po

(i) W b, β Return empty with (failed) CAS, β − τ = 1, τ 6= τ

0

:

· · · −

po

→ (e) R x

β−1

, ξ −

po

→ (f) R t, τ

0

po

→ (h) W b, τ + 1 Return one without CAS, β − τ > 1: · · · −

po

→ (e) R x

β−1

, ξ Return one with (successful) CAS, β − τ = 1:

· · · −

po

→ (e)Rx

β−1

, ξ −

po

→ (f)Rt, τ −−−−→

po-atom

(g)Wt, τ +1 −

po

→ (h)Wb, β

4.2.2 Push

There are two paths: a straight case, and a resizing case which grows the underlying circular buffer.

Straight, β − τ < size(x) − 1:

(a) R b, β −

po

→ (b) R t, τ − →

po

(c) R a, &x −

po

→ (e) W x

β

, ξ −

sync

− → (f) W b, β + 1 Resizing, β − τ ≥ size(x) − 1: where x

0

refers to the new array

(a) R b, β −

po

→ (b) R t, τ −

po

→ (c) R a, &x − →

po

resize

sync

− → (d) R a, &x

0

− →

po

(e) W x

0β

, ξ −

sync

− → (f ) W b, β + 1 where resize = R x

τ

, ξ

τ

− →

po

W x

0τ

, ξ

τ

po

→ · · ·

po

→ R x

β−1

, ξ

β−1

po

→ W x

0β−1

, ξ

β−1

sync

− → W a, &x

0

4.2.3 Steal

There are three paths: two failure cases and one success case.

Failure returns no element and success returns a stolen element.

Return empty without CAS, β−τ ≤ 0: (a)Rt, τ −

sync

− → (b)Rb, β Return empty with (failed) CAS, β − τ > 0 ∧ τ 6= τ

0

:

(a) R t, τ −

sync

− → (b) R b, β −−−−→

ctrl-isync

(c) R a, &x −

po

→ (d) R x

τ

, ξ −−−−→

ctrl-isync

(e) R t, τ

0

Return one with (successful) CAS, β − τ > 0:

(a) R t, τ −

sync

− → (b) R b, β −−−−→

ctrl-isync

(c) R a, &x − →

po

(d) R x

τ

, ξ

ctrl-isync

−−−−→ (e) R t, τ −−−−→

po-atom

(f) W t, τ + 1

4.3 Significant reads and writes

We define the sequence (β

n

) of values taken by the variable b over the course of the program, according to the write coherence

relation. Initially β

0

= 0. Since all push and take operations occur in a single thread, and steal operations never alter the value of b, the elements of (β

n

) correspond to writes to b in program order within the push and take operations. Similarly, we define the sequence (τ

m

) of values taken by the variable t. We assume τ

0

= 0.

Furthermore, since all writes to t are from CAS instructions, which are sequentially ordered, and all such CAS instructions increment t by one, (τ

m

) is monotonically increasing, and s.t. τ

m

= m.

For each index i, we define the sequence (ξ

vi

)

v∈N

of successive values given to the element at index i in the deque by the last write W x

i

,

_

of a push operation, regardless of the address &x of the underlying array. Only the last such write is called significant as it induces a new value in an (ξ

iv

) sequence, while writes due to resizing do not. For all i, ξ

0i

, the value before the first significant write to x

i

location, is undefined: ξ

i0

= ⊥. Similarly, a read is significant if it occurs in a successful instance of take or steal.

Lemma 4. For all i, (ξ

iv

) is globally coherent.

Proof.Given two significant writesWxi,_andWx0i,_at indexi(regardless of the address of the underlying array). IfWxi,_andWx0i,_both write to the same memory location, then they are ordered by write coherence. If they do not, then there must be a resize operation after the first write and before the second (all writes happen in the same thread). Because of the cumulative barrier after a resize operation, threads that see the second value must have seen the first beforehand. Hence, there is a global coherence order on the writes, which corresponds to the order ofpushoperations.

We define the relation read from far as follows: for some mem- ory locations m

0

, . . . , m

n

and some value v, W m

0

, v − →

r

R m

n

, v if W m

0

, v − →

rf

R m

n

, v or there exists a sequence of copies carrying the value of the write to the read:

W m

0

, v − →

rf

R m

0

, v −−→

data

W m

1

, v − → · · ·

rf

−−→

data

W m

n

, v − →

rf

R m

n

, v.

For conciseness, we hereafter omit the variable name from reads and writes whenever the variable can be inferred from the value:

e.g., Wβ

n

stands for Wb, β

n

. Let Wξ

iv

denote the v

th

significant write at index i, and Rξ

vi

a significant read s.t. Wξ

iv

r

→ Rξ

vi

. Lemma 5. Given a write Wx

i

,

_

and a read Rx

0j

,

_

,

i 6= j = ⇒ W x

i

,

_

6 − →

rf

R x

0j

,

_

Proof.If the addresses of the underlying arrays differ, then the memory locations read and written are distinct and there can be noread fromrelation.

Otherwise, since old arrays are never reused, the addresses are the same andi ≡j mod size(x)Rx0j,_belongs to a successful instance oftake, push(with resizing), orsteal. LetXbe that instance.

LetPbe the instance ofpushto whichWxi,_belongs. InP, we have the following execution graph:

P.Rt, τP−→ctrl Wxi,_−−→sync Wb, βP+ 1 where τP≤i≤βP and βP−τP <size(x)−1 Let us assumei6=j∧Wxi,_−→rf Rx0j,_and show it is indeed impossible.

Assume X is a successful instance of take orpush. Since X and P belong to the same thread,P must occur beforeX in program or- der (the order of loads and stores to the same location is preserved:

P.Wxi,_−−−→po-locX.Rx0j,_).

Ifj < i, thenj≤i−size(x). However, the following must hold inP: τP≤i≤βP∧βP−τP <size(x)−1

hence j < i−size(x) + 1≤βP−size(x) + 1< τP

Furthermore, ifXis atakeoperation,Rx0j,_reads the last element of the deque, andj = βX −1 ≥ τX; ifX is apushoperation,Rx0j,_

results from a copy operation of the resizing code, hencej ≥ τX. Since X occurs after P in program order and tis monotonically increasing, P.Rt, τP−−−→po-locX.Rt, τXandj < τP ≤τX≤j. Impossible.

Ifi < j, then, sincej≥βX,bmust increase fromβP + 1toj+ 1 between the write inPand the read inX. Hence, there must be an instance P0ofpushbetweenPandX(in program order) that incrementsbtoj+ 1.

Indeed, the only writes that increase the value ofboccur inpushandtake;

and the effect oftakeas a whole never increases the value ofbsince it first

(7)

decrements the variable. We have:

P.Wxi,_−−−→po-locP0.Wxj,_−−−→po-locX.Rx0j,_ hence P.Wxi,_−→coP0.Wxj,_−−−→pp-satX.Rx0j,_ Thus, from Lemma 2 (ii),P.Wxi,_6−→X.Rxrf 0j,_.

Now, assumeXis a successful instance ofsteal. We have the following execution graph forX:

X.Rt, τX=j−−→sync Rb, βXctrl-isync−−−−→Ra,&x0po→Rx0j,_

ctrl-isync

−−−−−→Rt, τX−−−−→po-atom Wt, τX+ 1

Ifj < i, thenj≤i−size(x). However, the following must hold inP:

j < i−size(x) + 1≤βP−size(x) + 1< τP

HenceτX=j < τP. Sincetincreases monotonically, it must be that:

X.Rx0j,_ctrl-isync−−−−→Rt, τX−−−−→po-atomWt, τX+ 1

rf

→Rt,_−−→sync Wt,_−→ · · ·rf −−→sync Wt, τP−→P.rf Rt, τP−→ctrl Wxi,_ HenceX.Rx0j,_must be committed beforeWt, τX+ 1. SinceWt, τX+ 1 is (cumulatively) propagated toWxi,_, X.Rx0j,_must be committed beforeWxi,_. Formally: it follows from Lemma 3 (ii) that Wt, τX + 1−−−→pp-sat P.Rt, τP. IfWxi,_−→rf Rx0j,_thenWxi,_−−−→pp-satRx0j,_. We get:

X.Wt, τX+ 1−−−→pp-sat P.Rt, τP−→ctrl Wxi,_

∧P.Wxi,_−−−→pp-sat X.Rx0j,_ctrl-isync−−−−→Wt, τX+ 1 Lemma 2 (iv) tells that it is impossible. ThusP.Wxi,_6−→rf X.Rx0j,_.

Ifi < j, theni ≤ j−size(x), and there must be an instanceP0 ofpushs.t.P0.Wb, j+ 1−−−→po-locWb, βX−→rf X.Rb, βX(so that indexjbe accessible inX).P0cannot occur beforePin program order because, as above, we would haveτP0 ≤ τP ≤ ion the one hand, andi ≤ j− size(x)< τP0on the other hand. The underlying array also monotonically increases in size, so the inequality still holds if the sizes ofPandP0differ.

HenceP0occurs afterP. FurthermoreWx00j,_∈P0. IfxinPandx00in P0refer to different arrays, then a resize operationRmust precedeP0, s.t.

Wa,&x−−−→po-locP.Ra,&x−−−→po-locR.Wa,&x00

−−→sync P0.Wx00j,_−−→sync Wb, j+ 1

po-loc

−−−→Wb, βX−→rf X.Rb, βXctrl-isync−−−−→Ra,&x0−−→addrRx0j,_ hence Wa,&x−co→R.Wa,&x00−−→sync Wb, βX−−−→pp-sat X.Rb, βX

From Lemma 2 (iii),Wb, βX−−−→pp-satX.Ra,&x0; Lemma 2 (ii) concludes thatWa,&x6−→rf X.Ra,&x0. Since all resize operations allocate new ar- rays,&x0 6= &x, which contradicts our premises. Otherwise,xandx00 refer to the same array, henceWxi,_−−−→po-locWx00j,_, and we get:

P.Wxi,_−−−→po-locP0.Wx00j,_−−→sync Wb, j+ 1−−−→po-locWb, βX

rf

→X.Rb, βXctrl-isync−−−−→Rx0j,_

It follows from Lemmas 3 (i) and 2 (iii) that:

P.Wxi,_co→Wx00j,_−−−→pp-satRx0j,_ Hence, from Lemma 2 (ii),Wxi,_6−→rf Rx0j,_.

Corollary 1. Given a significant write Wξ

iv

and a significant read Rx

0j

,

_

: i 6= j = ⇒ Wξ

iv

6 −

r

→ Rx

0j

,

_

.

Proof.Ifi 6= j, we know thatWξvi6−→rf Rx0j,_. Furthermore, all copies, which happen during a resize operation, copy from and to the same index.

Since there are less copies than the size of the expanded array, there can be no two copies writing to the same memory location in the new array. Hence, there can be no sequence of copies fromWξvi toRx0j,_.

Lemma 6. Given a significant write Wξ

iu

and a significant read Rξ

iv

:

(i) Wξ

ui

pp-sat

−− → Ra, &x −−→

addr

Rx

i

, ξ

vi

= ⇒ u ≤ v (ii) 0 < u ≤ v = ⇒ Wξ

iu

pp-sat

−− → Rx

i

, ξ

vi

Proof.We prove each point separately:

(i) Supposev < u. We defineW0.Wxi, ξivas follows.

Ifv= 0,ξvi is an undefined value; letW0.Wxi, ξ0i−→rf Rxi, ξvi be the initialization ofxi.W0.Wxi, ξ0i comes beforeWξiuin program order.

Otherwise, 0 < v < u. LetW.Wξiv be the significant write s.t.

W.Wξivr→Rxi, ξvi. In other words, there exists a sequence of copies carrying the value ofξiv to Rxi, ξiv. That sequence ends with a write W0.Wxi, ξiv−→rf Rxi, ξvi. Moreover, according to the definition of(ξiv)and the semantics of resizing,W.Wξivand W0.Wxi, ξivmust come before Wξiuin program order.

We have two cases: eitherWξiuandRxi, ξvi refer to the same memory location or they do not.

Assume that they refer to the same memory locationxi. Then it must be thatW0.Wxi, ξiv−−−→po-loc Wxi, ξui, and we have:

W0.Wxi, ξvico→Wξiu−−−→pp-satRa,&x−−→addrRxi, ξiv Hence, from Lemma 2 (ii),W0.Wxi, ξvi6−→rf Rxi, ξiv. Impossible.

Conversely, assume that they do not refer to the same memory location.

Then there must be a resize operation betweenW0.Wxi, ξivandWξiu: Wa,&x−−→sync W0.Wxi, ξvi−−→syncWa,&x0−−→sync Wx0i, ξui

pp-sat

−−−→Ra,&x−−→addrRxi, ξiv

Hence, from Lemma 3 (i),Wa,&x−co→Wa,&x0−−−→pp-sat Ra,&x. And from Lemma 2 (ii),Wa,&x6−→rf Ra,&x. Since there is only one writeWa,&x that gives the value&xtoa, we have a contradiction.

(ii) There exists a writeW.Wξvi s.t.W.Wξvir→Rξvi, and a sequence of copies carrying the value ofξvi to Rξvi. That sequence ends with a writeW0.Wξiv−→rfiv. Sinceu ≤ v,Wξuipo→W.Wξvi by definition of (ξiv). Thanks to the barrier afterWξiuinpush,Wξui −−→sync W0.Wξiv−→rfiv. From Lemma 3 (i), we getWξiu−−−→pp-sativ.

Corollary 2 (Well-defined significant reads). Given a significant read R x

i

, ξ, ξ = ξ

iv

for some v > 0.

Proof.LetXbe the successful instance oftakeorsteals.t.Rxi, ξ∈X. Supposeξ6=ξvi, thenξ=⊥can only be an undefined value from the uninitialized array, prior to copying. Indeed, ifxiis not affected by copying, then it must be one of the new slots allocated by the resizing, hence its initial value isξi0. LetRbe thepushoperation that allocates the arrayx. There exists aξiusuch that:

Wxi,⊥−co→R.Wxi, ξiu−−→sync Wa,&x−→rf X.Ra,&x−−→addrRxi, ξ It follows from Lemmas 2 (iii), 3 (i) and 2 (ii) thatWxi,⊥ 6−→rf Rxi, ξ.

Impossible.

Hence,ξ=ξvi. We haveRb, β ∈ Xandβ ≥i+ 1>0, forXis successful. Hence, there is an instance ofpushPs.t.P.Wb, β−→rf X.Rb, β.

Sinceβ ≥ i+ 1, eitherβ = i+ 1andWξui ∈ P, or there must be an instance ofpushthat contains a significant writeWξui and comes be- forePin program order. In both cases,Wξui belongs to apushoperation, henceu >0. Moreover, thanks to the barrier after a significant write in push,Wξiu−−→sync P.Wb, β. IfXis an instance oftake,P.Wb, β−po→X.Rξvi; otherwise, P.Wb, β−→rf X.Rb, β−ctrl-isync−−−−→Rξiv and Lemma 3 (ii) gives P.Wb, β−−−→pp-sat X.Rξvi. In both cases,Wξiu−−→sync P.Wb, β−−−→pp-satX.Rξiv, hence, by Lemmas 3 (i) and 6,0< u≤v.

4.4 Uniqueness of significant reads

The results from the previous section establish that two significant reads at different indexes cannot retrieve the same element ξ

iv

. The only possible cause of duplicate significant reads is thus reduced to the case where the reads access the same index i.

Theorem 1 (Work-stealing: uniqueness of significant reads). Given a worker thread executing a sequence of push and take operations, and finite number number of thief threads each executing steal op- erations, all against a same deque. If X and Y are two distinct successful instances of steal or take,

∀ R ξ

iv

∈ X, ∀ R ξ

vi00

∈ Y, i 6= i

0

∨ v 6= v

0

Lemma 7. Given S

1

and S

2

distinct successful instances of steal,

∀R ξ

vi

∈ S

1

, ∀R ξ

vi00

∈ S

2

, i 6= i

0

Proof.All writes totatomically increment it (by atomicity of CAS). Hence two successfulstealoperations cannot write (thus read) the same value of t. Reads fromxin astealoperation access the index given by the value of thetvariable. HenceRt, i∈S1andRt, i0∈S2implyi6=i0.

Références

Documents relatifs

The initial skew between the global and pseudo-optical clocks (the electrical clock supplied to the current pulse circuitry) was equal to the intrinsic delay of the

In this communication, we circumvent these problems and report a new addressable hybrid material composed of single- walled carbon nanotubes terminally linked by oligonucleotides into

Moreover, using functional analysis of misregulated genes we predict a role for RDE-4 and ZFP-1 in modulating insulin signaling and further demonstrate that regulation of

Nevertheless, the fact that such recruitment is observed for DNA-binding but not for histone-binding domains and that macromolecular crowding conditions remain mainly unchanged at

In terms of cognitive trust, in the same way as the affective dimension, institutional proximity is translated more by a distance between personnel, particularly concerning

Il est apparu que, comme fait institutionnel, la proximité se distingue de la faible distance dans le sens où les acteurs assignent collectivement à cette

Parmi les limites, le manque de cohésion au sein du même niveau et d’un niveau à l’autre, le manque d’ouverture de certains collègues et l’importante mobilisation dont ils

Memory Test; apoE, apolipoprotein E; apoE2, E2; apoE3, E3; apoE4, E4; BDNF, brain-derived neurotrophic factor; CAPS, Clinician Administered PTSD Scale; CAVE, computer automatic