Correct and Efficient Work-Stealing for Weak Memory Models

(1)

HAL Id: hal-00802885

https://hal.inria.fr/hal-00802885

Submitted on 20 Mar 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Correct and Eﬀicient Work-Stealing for Weak Memory Models

Nhat Minh Lê, Antoniu Pop, Albert Cohen, Francesco Zappa Nardelli

To cite this version:

Nhat Minh Lê, Antoniu Pop, Albert Cohen, Francesco Zappa Nardelli. Correct and Eﬀicient Work- Stealing for Weak Memory Models. PPoPP ’13 - Proceedings of the 18th ACM SIGPLAN sym- posium on Principles and practice of parallel programming, Feb 2013, Shenzhen, China. pp.69-80,

�10.1145/2442516.2442524�. �hal-00802885�

(2)

Correct and Efficient Work-Stealing for Weak Memory Models

Nhat Minh Lê Antoniu Pop Albert Cohen Francesco Zappa Nardelli

INRIA and ENS Paris

Abstract

Chase and Lev’s concurrent deque is a key data structure in shared- memory parallel programming and plays an essential role in work- stealing schedulers. We provide the first correctness proof of an optimized implementation of Chase and Lev’s deque on top of the POWER and ARM architectures: these provide very relaxed memory models, which we exploit to improve performance but consider- ably complicate the reasoning. We also study an optimized x86 and a portable C11 implementation, conducting systematic experiments to evaluate the impact of memory barrier optimizations. Our results demonstrate the benefits of hand tuning the deque code when running on top of relaxed memory models.

Categories and Subject Descriptors D.1.3 [Programming Tech- niques]: Concurrent Programming; E.1 [Data Structures]: Lists, stacks, and queues

Keywords lock-free algorithm, work-stealing, relaxed memory model, proof

1. Introduction

Multicore POWER and ARM architectures are standard targets for server, consumer electronics, and embedded control applications.

The difficulties of parallel programming are exacerbated by the relaxed memory model implemented by these architectures, which allow the processors to perform a wide range of optimizations, including thread-local reordering and non-atomic store propagation.

The safety-critical nature of many embedded applications call for solid foundations for parallel programming. This paper shows that a high degree of confidence can be achieved for highly optimized, real-world, concurrent algorithms, running on top of weak memory models. A good test-case is provided by the runtime scheduler of a task library. We thus focus on the Chase and Lev’s concurrent doubly-ended queue (deque) [3], the cornerstone of most work-stealing schedulers. Until now, no rigorous correctness proof has been been provided for implementations of this algorithm running on top of a relaxed memory model. Furthermore, while work-stealing is widely used on the x86 architecture (an evaluation under a restrictive hypothesis of idempotence of the workload can be found in [10]), few experiments target weaker memory models.

Our first contribution is a correctness proof of this fundamental concurrent data structure running on top of a relaxed memory model. We provide a hand-tuned implementation of the Chase and

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

PPoPP’13, February 23–27, 2013, Shenzhen, China.

Copyright c2013 ACM 978-1-4503-1922/13/02. . . $10.00

Lev’s deque for the ARM architectures, and prove its correctness against the memory semantics defined in [12] and [7]. Our second contribution is a systematic study of the performance of several implementations of Chase–Lev on relaxed hardware. In detail, we compare our optimized ARM implementation against a standard implementation for the x86 architecture and two portable variants expressed in C11: a reference sequentially consistent translation of the algorithm, and an aggressively optimized version making full use of the release–acquire and relaxed semantics offered by C11 low-level atomics. These implementations of the Chase–Lev deque are evaluated in the context of a work-stealing scheduler. We consider diverse worker/thief configurations, including a synthetic benchmark with two different workloads and standard task-parallel kernels. Our experiments demonstrate the impact of the memory barrier optimization on the throughput of our work-stealing runtime. We also comment on how the ARM correctness proof can be tailored to these alternative implementations. As a side effect, we highlight that our optimized ARM implementation cannot be expressed using C11 low-level atomics, which invariably end up inserting one redundant synchronization instruction.

2. Chase–Lev deque

User-space runtime schedulers offer an excellent playground for studying low-level high-performance code. We focus on random- ized work-stealing: it was originally designed as the scheduler of the Cilk language for shared-memory multiprocessors [4], but thanks to its merits [2] it has been adopted in a number of parallel libraries and parallel programming environments, including the Intel TBB and compiler suite. Work-stealing variants have also been proposed for distributed clusters [5] and heterogeneous plat- forms [1]. The scheduling strategy is intuitive:

•

Each processor uses a dynamic array as a deque holding tasks ready to be scheduled.

•

Each processor manages its own deque as a stack. It may only push and pop tasks from the

bottom

of its own deque.

•

Other processors may not push or pop from that deque; instead, they steal tasks from the

top

when their own deque is empty. In most implementations, the stolen deque is selected at random.

•

Initially, one processor starts with the “root” task of the parallel program in its deque, and all other deques are empty.

The state-of-the-art algorithm for the work-stealing deque is Chase and Lev’s lock-free deque [3]. It uses an array with automatic, asynchronous growth. Assuming sequentially consistent memory, it involves only one atomic compare-and-swap (CAS) per steal, no CAS on push, and no CAS on take except when the deque has exactly only one element left.

We implemented and tested four versions of the concurrent

deque algorithm, with different barrier configurations: (1) a sequen-

tially consistent version, written with C11

seq_cst

atomics, follow-

ing the original description in [3]; (2) an optimized version, which

takes full advantage of the C11 relaxed memory model, reported

in Figure 1; (3) a native version for ARMv7, reported in Figure 2,

(3)

and (4) a native version for x86. These native versions rely on compiler intrinsics and inline assembly to leverage architecture-specific assumptions and thus reduce the number of barriers required.

In our implementations of Figure 1 and Figure 2, we assume that the Deque type is declared as:

typedef struct{ atomic_size_t size;

atomic_int buffer[];

} Array;

typedef struct{

atomic_size_t top, bottom;

Atomic(Array *) array;

} Deque;

In the code of Figure 1 the atomic_ and memory_order_ prefixes have been elided for clarity. The ARMv7 pseudo-code of Figure 2 uses the keywords

R

and

W

to denote reads and writes to shared variables, and

atomic

indicates a block that will be executed atomically, implemented via LL/SC instructions. The x86 version is based on prior work [10] and only requires a single mfence memory barrier in take, in place of the call to thread_fence in the C11 code.

2.1 Notions of correctness

The expected behavior of the work-stealing deque is intuitive: tasks pushed into the deque are then either taken in reverse order by the same thread, or stolen by another thread. We say that an implementation is correct if it satisfies four criteria, formalized and proven correct for our ARMv7 optimized code in Section 4:

1. tasks are taken in reverse order;

2. only tasks pushed are taken or stolen (well-defined reads);

3. a task pushed into a deque cannot be taken or stolen more than once (uniqueness);

4. given a finite number of push operations, all pushed values will eventually be either taken or stolen exactly once, if enough take and steal operations are attempted (existence).

These criteria hold because of the following assumptions and properties of the Chase–Lev algorithm:

•

For any given deque, push and pop operations execute on a single thread. Concurrency can only occur between one execution of push or take in the owner thread, and one or more executions of steal in different threads.

•

Newly pushed tasks are made visible to take and steal by the increment to

bottom

in push. As we shall see in Section 4, our ARMv7 implementation enforces this by placing a

sync

barrier before the update of

bottom

, guaranteeing that the pushed element can not be stolen before

bottom

is updated.

•

Taken tasks are reserved first by updating

bottom

; again, in our ARMv7 code, the

sync

barrier placed after the update to

bottom

will ensure that it will not be concurrently stolen.

•

Stolen tasks are reserved by updating

top

. The only situation where steal and take contend for the same task is when the deque has a single element left; this particular conflict is resolved through the CAS instructions in both take and steal. This scenario allowed Chase and Lev to make the CAS in take conditional upon the size of the deque being 1. The correctness of this optimization on a relaxed memory model depends on the presence of the two full barriers in take and steal, to ensure that at least one of the participants will have a consistent view of the size of the deque. Having just one take or steal seeing a consistent view of the size of the deque is enough: if it is take, that will force a CAS to be performed; if it is steal, the index reservation will ensure an empty return value.

•

Finally, stolen tasks are protected from being concurrently stolen multiple times by the monotonic CAS update to

top

in steal. This CAS orders steal operations and makes them mu- tually exclusive. At the same time, steal operations that abort due to a failed CAS do not change the state of the deque.

2.2 Comparison of the C11 and ARM implementations Our C11 implementation in Figure 1 is optimal in the sense that no C11 synchronization can be removed without breaking the algo-

rithm. However, if low-level atomics are compiled using the map- ping of McKenney and Silvera [9] on ARMv7/POWER or the map- ping of Tehrekov [14] on x86, the generated code contains more barriers than the hand-optimized native versions on both x86 and ARMv7. We show in Section 5 that this happens because of the need for seq_cst atomics to simulate ARMv7/POWER cumulative semantics. Concretely, on ARMv7, an extra dmb instruction is inserted before each CAS operation [11], compared to the native version where a relaxed CAS—coherent and atomic only—is suf- ficient. On x86, an mfence instruction is added between the two reads in steal. The fully sequentially consistent C11 implementation inserts many more redundant barriers [11].

3. The memory model of ARMv7

The memory model of the ARMv7 architecture follows closely that of the POWER architecture, allowing a wide range of relaxed behaviors to be observable to the programmer:

1. The hardware threads can each perform reads and writes out- of-order, or even speculatively. Basically any local reordering is allowed unless there is a data/control dependence or synchronization instruction preventing it.

2. The memory system does not guarantee that a write becomes visible to all other hardware threads at the same time point.

Writes performed by one thread are propagated to (and become visible from) any other thread in an arbitrary order, unless synchronization instructions are used.

3. A

dmb

barrier instruction guarantees that all the writes which have been observed by the thread issuing the barrier instruction are propagated to all the other threads before the thread can continue. Observed writes include all writes previously issued by the thread itself, as well as any write propagated to it from another thread prior to the barrier. This semantics of barrier instructions is referred to as cumulative.

We build on the axiomatic formalization of POWER and ARMv7 memory model by Mador-Haim et al. [7], which has been proved equivalent to the operational semantics of Sarkar et al. [12]. A gentle introduction can be found in [8].

Axiomatic execution witnesses capture abstract memory events associated with memory-related instructions and internal transi- tions of the model. Unlike in stronger models such as x86, each memory access is represented at run-time by two distinct events: an issuing event—called sat for reads and ini for writes—eventually followed by a commit event when the speculative state of the instruction is resolved. Once a write instruction is committed, events that propagate it to other threads can be observed—propagation to thread A is denoted pp

A

. All the relations part of an execution witness are listed in Table 1.

The core of the axiomatic model builds on the evord relation, modeling the happens-before order between events. This satisfies the fundamental property:

evord

−−→ ⊃ −−→ ∪

^after

−

^before

−− → ∪ −

^comm

−− → ∪ −→ ∪

^insn

−−→

^local

and must be acyclic for an execution to be consistent.

We assume that the

atomic

sections, used to represent CAS- like behaviors, are executed atomically and obey a total order.

We model them either as a single instance of a read instruction (failed CAS) or an atomic read–write pair of instruction instances (successful CAS). The atomicity of these accesses is captured by the −−−−→

^po-atom

relation. We do not assume any other property on these atomic sections (e.g., cumulativity). In practice, atomic sections can be implemented with LL/SC instructions.

We use several notation shortcuts. We refer to the deque global

variables

top

,

bottom

, and

array

as t, b, and a. Elements of the

buffer are written x

i

, where i is the virtual index in natural numbers

(4)

inttake(Deque *q) {

size_t b =load_explicit(&q->bottom,relaxed) - 1;

Array *a =load_explicit(&q->array,relaxed);

store_explicit(&q->bottom, b,relaxed);

thread_fence(seq_cst);

size_t t =load_explicit(&q->top,relaxed);

int x;

if(t <= b) {

/* Non-empty queue. */

x =load_explicit(&a->buffer[b % a->size],relaxed);

if(t == b) {

/* Single last element in queue. */

if(!compare_exchange_strong_explicit(&q->top, &t, t + 1,seq_cst,relaxed)) /* Failed race. */

x = EMPTY;

store_explicit(&q->bottom, b + 1,relaxed);

}

}else{/* Empty queue. */

x = EMPTY;

store_explicit(&q->bottom, b + 1, relaxed);

} return x;

}

voidpush(Deque *q, int x) {

size_t b =load_explicit(&q->bottom,relaxed);

size_t t =load_explicit(&q->top,acquire);

Array *a =load_explicit(&q->array,relaxed);

if(b - t > a->size - 1) {/* Full queue. */

resize(q);

a =load_explicit(&q->array,relaxed);

}

store_explicit(&a->buffer[b % a->size], x,relaxed);

thread_fence(release);

store_explicit(&q->bottom, b + 1,relaxed);

}

intsteal(Deque *q) {

size_t t =load_explicit(&q->top,acquire);

thread_fence(seq_cst);

size_t b =load_explicit(&q->bottom,acquire);

int x = EMPTY;

if(t < b) {

/* Non-empty queue. */

Array *a =load_explicit(&q->array, consume);

x =load_explicit(&a->buffer[t % a->size],relaxed);

if(!compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst,relaxed)) /* Failed race. */

returnABORT;

} return x;

}

Figure 1. C11 code of Chase–Lev deque, with low-level atomics

inttake(Deque *q) {

size_t b =R(q->bottom) - 1; (a)

Array *a =R(q->array); (b)

W(q->bottom, b); (c)

sync;

size_t t =R(q->top); (d)

int x;

if(t <= b) {

x =R(a->buffer[b % a->size]); (e) if(t == b) {

bool success = false;

atomic/* Implemented with LL/SC. */

if(success = (R(q->top) == t)) (f)

W(q->top, t + 1); (g)

if(!success) x = EMPTY;

W(q->bottom, b + 1); (h)

} }else{

x = EMPTY;

W(q->bottom, b + 1); (i)

} return x;

}

voidpush(Deque *q, int x) {

size_t b =R(q->bottom); (a)

size_t t =R(q->top); (b)

Array *a =R(q->array); (c)

if(b - t > a->size - 1) {/* Full queue. */

resize(q);

a =R(q->array); (d)

}

W(a->buffer[b % a->size], x); (e) sync;

W(q->bottom, b + 1); (f)

}

intsteal(Deque *q) {

size_t t =R(q->top); (a)

sync;

size_t b =R(q->bottom); (b)

int x = EMPTY;

if(t < b) {

Array *a =R(q->array); (c)

x =R(a->buffer[t % a->size]); (d) ctrl_isync;

bool success = false;

atomic/* Implemented with LL/SC. */

if(success = (R(q->top) == t)) (e)

W(q->top, t + 1); (f)

if(!success)return ABORT;

} return x;

}

Figure 2. ARMv7 pseudo-code of Chase–Lev deque

before any wrap-around is applied. Barrier instructions are omitted for brevity when implied by the presence of a −

^sync

− → or −−−−→

^ctrl-isync

relation. Irrelevant values in reads and writes are replaced with the placeholder “

^_

” (e.g., Rx,

^_

). We do not label instruction instances individually, but decorate them with a disambiguating execution prefix, identified by a dot. These prefixes do not only distinguish between instruction instances, but also group related instruction instances within a same execution unit (usually an invocation of one of push, take or steal). For this, when no prefix is specified, the last prefix in left-to-right order is assumed.

4. Proof of correctness of the ARMv7 code

The proof is divided into five parts; it validates the criteria 2 to 4 enumerated in Section 2.1. Since push and take never execute concurrently and b is only ever modified in one of these functions, the proof of Criterion 1 does not involve reasoning about concurrency and we omit it here.

The proof builds on a precise analysis of all the possible execution witnesses of arbitrary invocations of the algorithm. We re- call that an execution witness, as defined by the ARMv7 axiomatic model, is a graph capturing all memory events occuring during an execution (vertices), as well as the relations that link them (edges).

Individual lemmas strive to narrow down the set of possible execution witnesses, based on properties of the algorithm and the archi-

tecture. To that end, we pinpoint specific subgraphs of an execution witness (hereafter, execution graphs) that cannot occur together in the same consistent execution witness. We then show that all in- correct executions, such as those containing two instances of steal reading the same value added by a single instance of push, cannot have consistent execution witnesses and, as such, cannot happen.

The proof is structured as follows. In 4.1 we provide basic tech- nical definitions and properties of the memory model, which are used throughout the proof. In 4.2 we describe all the possible execution graphs for each of the three operations (push, take and steal), following the control flow of the ARMv7 code in Figure 2. In 4.3 we show how the succession of dynamic arrays built by resizing can be abstracted as a single sequence of unique abstract values in- dependent of resize operations, with strong coherence and consis- tency properties. Corollary 2 establishes Criterion 2 (well-defined reads). In 4.4 we build on the previous abstraction to prove Theo- rem 1, pertaining to the uniqueness of elements taken and stolen, which corresponds to Criterion 3 (uniqueness). Finally, in 4.5, we rely on all previous results to prove Theorem 2 establishing Crite- rion 4 (existence): the existence of matching take or steal operations for every pushed element, under the appropriate hypotheses.

4.1 Preliminary properties

Before delving into the details of the proof itself, we introduce

some support definitions and related properties.

(5)

Rl, α read of valueαfrom locationl(^_stands foranyvalue) Wl, α write of valueαto locationl(^_stands foranyvalue)

sync memory barrier (usually implied by−−→^sync ) isync instruction barrier (usually implied by−^ctrl-isync−−−−→) sat(X) satisfy(a.k.a. complete) event of a read instruction ini(X) initializeevent of a write instruction

com(X) commitevent of an in-flight or speculative instruction ppA(X) propagate to thread ofAevent

−po→ program order

po-atom

−−−−→ atomic operation in program order(for CAS; see below)

po-loc

−−−→ same-location access in program order(defined in 4.1)

−co→ write coherence

−rf

→ read from

−r→ read from far(defined in 4.3)

−fr

→ from read

−−→addr address dependence(usually implicit)

−→ctrl control dependence(usually implicit)

−−→data data dependence(usually implicit)

−dp→ observable dependence(defined in 4.1)

ctrl-isync

−−−−−→ non-cumulative local ordering barrier(see below)

−−→sync cumulative full barrier(see below)

pp-sat

−−−→ write-to-read propagation(defined in 4.1)

after

−−→ after barrier edge

before

−−−→ before barrier edge

−−−→comm communication edge

−insn−→ intra-instruction order edge

local

−−→ local order edge

evord

−−→ event happens-before order(usually typeset as→) On ARMv7,−−→^sync corresponds to admbinstruction while−^ctrl-isync−−−−→corresponds to a dependent conditional branch followed by anisbinstruction.

Table 1. Summary of relations used in the ARMv7 axiomatic model

For convenience, we define the −

^po-loc

−− → relation, which relates local (same-thread) accesses to the same memory location; −

^po-loc

−− → implies an instruction-level communication edge − →

^co

, − →

^rf

or − →

^fr

. In particular, −

^po-loc

−− → implies − →

^co

between two writes.

We define the dependence relation −

^dp

→ as follows:

R x,

_

−

^dp

→ R y,

_

⇐⇒

^def

R x,

_

( −−→ ∪

^addr

−−−−→

^ctrl-isync

) R y,

_

R x,

^_

− →

^dp

W y,

^_

⇐⇒

^def

R x,

^_

( −−→ ∪

^addr

−→ ∪

^ctrl

−−→

^data

) W y,

^_

Lemma 1. The following properties involving − →

^dp

apply:

R x,

^_

−

^dp

→ R y,

^_

= ⇒ sat ( R x,

^_

) → sat ( R y,

^_

) Rx,

^_

−

^dp

→ Wy,

^_

= ⇒ sat(Rx,

^_

) → com(Wy,

^_

)

Proof.In the case the of an address or control dependence, the result is an immediate consequence of the definition ofintra-instructionandlocal orders. It remains to be shown that the result holds for−^ctrl-isync−−−−→: a dependent conditional branch instruction,ctrl, followed by anisyncbarrier. Sup- poseRx,^_−^ctrl-isync−−−−→Ry,^_. Then we have:sat(Rx,^_)−^insn−→com(Rx,^_)−−→^local com(ctrl)−−→^local com(isync)−−→^local sat(Ry,^_).

We define the relation −

^pp-sat

−− → between instruction instances, A. W x,

^_

−

^pp-sat

−− → B. R y,

^_

, as follows:

¹

( Wx,

^_

−

^po

→ Ry,

^_

if A ∼ B pp

B

(Wx,

^_

) → sat(Ry,

^_

) if A 6∼ B

where A ∼ B means that instruction instances grouped under prefixes A and B belong to the same thread.

Intuitively, −

^pp-sat

−− → represents a “known-to” relation in the following sense: A. W x,

^_

−

^pp-sat

−− → B. R y,

^_

means that, at the time of reading y, that specific write to x (as well as any write that is coherence- before it) is known to the thread executing B. It is clear that − →

^rf

1Note that −−−→^pp-sat does not imply anevent happens-before orderon the events making up the related instruction instances.

implies −

^pp-sat

−− → , by definition of communication edges (if threads are different) or uniprocessor constraints (if same thread).

Lemma 2. The following properties involve −

^pp-sat

−− → and −

^po-loc

−− → : (i) A. W x,

^_

− →

^rf

B. R x,

^_

−

^po-loc

−− → B

⁰

. R x,

^_

= ⇒ A. W x,

^_

−

^pp-sat

−− → B

⁰

. R x,

^_

(ii) A. W x,

^_

− →

^co

B. W x,

^_

−

^pp-sat

−− → C. R x,

^_

= ⇒ A. W x,

^_

6 − →

^rf

C. R x,

^_

(iii) W x,

^_

−

^pp-sat

−− → R y,

^_

−

^dp

→ R z,

^_

= ⇒ W x,

^_

−

^pp-sat

−− → R z,

^_

(iv) ¬ A.Wx,

^_

−

^pp-sat

−− → B.Ry

⁰

,

^_

−

^dp

→ B.Wx

⁰

,

^_

−

^pp-sat

−− → A.Ry,

^_

−

^dp

→ A.Wx,

^_

Proof.We prove each point separately:

(i) If the write and the reads happen in the same thread, then all instruction instances belong to that thread, andprogram order prevails.

Otherwise, either A.Wx,^_−→^rf B⁰.Rx,^_ and the result is immediate, or A.Wx,^_6−→^rf B⁰.Rx,^_ and B.Rx,^_−−−→^po-locB⁰.Rx,^_ implies the following:

com(B.Ry,^_)−−→^local sat(B⁰.Ry,^_), by definition of−−→^local . Hence:

ppB(A.Wx,^_)→sat(B.Ry,^_)−^insn−→com(Ry,^_)−−→^local sat(B⁰.Ry,^_) (ii) Suppose A.Wx,^_6−→C.^rf Rx,^_. ThenC.Rx,^_−→^fr B.Wx,^_, and we have the following cycle in theevent happens-before order:

sat(C.Rx,^_)−−−→^comm ppZ(B.Wx,^_)→sat(C.Rx,^_) (iii) Follows from Lemma 1.

(iv) Assume that:

A.Wx,_−−−→^pp-satB.Ry⁰,_−^dp→B.Wx⁰,_−−−→^pp-satA.Ry,_−^dp→A.Wx,_

IfA∼Bthen there is a cycle in −^po→. Otherwise, by Lemma 1, we have a cycle in theevent happens-before order:

ppB(Wx,^_)→sat(Ry⁰,^_)→com(Wx⁰,^_)−^insn−→ppA(Wx⁰,^_)

→sat(Ry,^_)→com(Wx,^_)−^insn−→ppB(Wx,^_)

Lemma 3. The following properties involving barriers apply:

(i) (Wx,

^_

−

^sync

− → Wy,

^_

−

^pp-sat

−− → Rz,

^_

∨ Wx,

^_

−

^pp-sat

−− → Ry,

^_

−

^sync

− → Rz,

^_

)

= ⇒ Wx,

^_

−

^pp-sat

−− → Rz,

^_

(ii) A.Wx,

^_

− →

^rf

B.Rx,

^_

−

^sync

− → B.Wy,

^_

−

^pp-sat

−− → C.Rx,

^_

= ⇒ A.Wx,

^_

−

^pp-sat

−− → C.Rx,

^_

(iii) Let X stand for A.Wx,

^_

− →

^rf

B.Rx,

^_

or (A ∼ B).Wx,

^_

and Y stand for C. W y,

_

− →

^rf

D. R y,

_

or (C ∼ D). W y,

_

then the following holds:

¬(X −

^sync

− → B. R y,

_

− →

^fr

C. W y,

_

∧ Y −

^sync

− → D. R x,

_

− →

^fr

A. W x,

_

)

(i) If Wx,^_ andRz,^_occur in the same thread, then all instruction instances belong to that thread andprogram order prevails. Otherwise, supposeRz,_executes inA; we have two cases:

ppA(Wx,^_)−−−→^before ppA(sync)−−−→^before ppA(Wy,^_)→sat(Rz,^_) Or the other way around:

ppA(Wx,^_)→sat(Ry,^_)−^insn−→com(Ry,^_)−−→^local com(sync)−−→^local sat(Rz,^_) In both cases,ppA(Wx,_)→sat(Rz,_).

(ii) Suppose A ∼ C. If A ∼ B, thenprogram order prevails:

all the instruction instances belong to the same thread. If not, suppose C.Rx,^_−^po→A.Wx,^_; then theevent happens-before ordercontains the following cycle:

ppB(A.Wx,^_)−−−→^comm sat(B.Rx,^_)−^insn−→com(Rx,^_)−−→^local com(sync)

local

−−→com(B.Wy,_)−^insn−→ppC(Wy,_)→sat(C.Rx,_)−^insn−→com(Rx,_)

local

−−→com(A.Wx,^_)−^insn−→ppB(Wx,^_)

Otherwise, supposeA6∼C. IfA∼B, thenA.Wx,_−−→^sync B.Wy,_and we have the result from (i). If not, we have:

ppB(A.Wx,^_)−−−→^comm sat(B.Rx,^_)−^insn−→com(Rx,^_)−−→^local com(sync) Thus, we haveppC(A.Wx,^_)−−−→^beforeppC(sync)−−−→^beforeppC(B.Wy,^_)→ sat(C.Rx,_).

(iii) Suppose the contrary. IfB∼D, then−→^rf and−→^fr form a path that goes against−^po→: the graph is invalid according to uniprocessor constraints.

Otherwise,B6∼Dand the following holds (omitting intermediate steps in elaborating−−−→^before for conciseness):

• com(B.sync)−−→^local com(C.Wy,^_)−−→^local com(D.sync)−^insn−→ppB(sync) ifB∼C.

(6)

•com(B.sync)−−→^local sat(Ry,^_)−−−→^comm ppB(C.Wy,^_)−−−→^beforeppB(D.sync) otherwise.

Either way,com(B.sync) → ppB(D.sync). By definition, we have an after edge between the two barriers:ppD(B.sync)−−→^aftercom(D.sync).

Moreover, eitherA∼DorA6∼D:

•ppD(B.sync)−−→^aftercom(D.sync)−−→^local com(A.Wx,^_)−^insn−→ppB(Wx,^_) ifA∼D.

•ppD(B.sync)−−→^aftercom(D.sync)−−→^local sat(Rx,^_)−−−→^comm ppD(A.Wx,^_) otherwise.

Thus, in all cases, we have a cycle:

com(B.sync)−−−→^beforeppB(A.Wx,^_)

−−−→comm sat(B.Rx,^_)−^insn−→com(Rx,^_)−−→com(B.sync)^local

4.2 Execution paths

We consider the three operations of the work-stealing algorithm:

take, push and steal. Each of them exhibits different execution paths depending on control flow. Data and address dependences are implicit in the notations and are omitted for brevity. Control dependences are implied by the guard conditions in each case and are also omitted, but we explicit the constraints on the b and t variables carrying the control dependence. Greek letters β, τ, ξ denote the memory values of b, t, and some x

i

, respectively. Reads and writes are annotated with the corresponding line from Figure 2.

For take and steal, we say that an instance of the operation is successful if it returns one element; otherwise (including if it returns empty) it is considered failed.

4.2.1 Take

Two failure cases return no element (empty), and two success cases return one element from the deque. All four paths start with:

(a) R b, β −

^po

→ (b) R a, &x −

^po

→ (c) W b, β − 1 −

^sync

− → (d) R t, τ Specific continuations for each path are listed below.

Return empty without CAS, β − τ ≤ 0: · · · − →

^po

(i) W b, β Return empty with (failed) CAS, β − τ = 1, τ 6= τ

⁰

:

· · · −

^po

→ (e) R x

β−1

, ξ −

^po

→ (f) R t, τ

⁰

−

^po

→ (h) W b, τ + 1 Return one without CAS, β − τ > 1: · · · −

^po

→ (e) R x

β−1

, ξ Return one with (successful) CAS, β − τ = 1:

· · · −

^po

→ (e)Rx

β−1

, ξ −

^po

→ (f)Rt, τ −−−−→

^po-atom

(g)Wt, τ +1 −

^po

→ (h)Wb, β

4.2.2 Push

There are two paths: a straight case, and a resizing case which grows the underlying circular buffer.

Straight, β − τ < size(x) − 1:

(a) R b, β −

^po

→ (b) R t, τ − →

^po

(c) R a, &x −

^po

→ (e) W x

β

, ξ −

^sync

− → (f) W b, β + 1 Resizing, β − τ ≥ size(x) − 1: where x

⁰

refers to the new array

(a) R b, β −

^po

→ (b) R t, τ −

^po

→ (c) R a, &x − →

^po

resize

−

sync

− → (d) R a, &x

⁰

− →

^po

(e) W x

⁰_β

, ξ −

^sync

− → (f ) W b, β + 1 where resize = R x

τ

, ξ

τ

− →

^po

W x

⁰_τ

, ξ

τ

−

^po

→ · · ·

−

po

→ R x

β−1

, ξ

β−1

−

^po

→ W x

⁰β−1

, ξ

β−1

−

^sync

− → W a, &x

⁰

4.2.3 Steal

There are three paths: two failure cases and one success case.

Failure returns no element and success returns a stolen element.

Return empty without CAS, β−τ ≤ 0: (a)Rt, τ −

^sync

− → (b)Rb, β Return empty with (failed) CAS, β − τ > 0 ∧ τ 6= τ

⁰

:

(a) R t, τ −

^sync

− → (b) R b, β −−−−→

^ctrl-isync

(c) R a, &x −

^po

→ (d) R x

τ

, ξ −−−−→

^ctrl-isync

(e) R t, τ

⁰

Return one with (successful) CAS, β − τ > 0:

(a) R t, τ −

^sync

− → (b) R b, β −−−−→

^ctrl-isync

(c) R a, &x − →

^po

(d) R x

τ

, ξ

ctrl-isync

−−−−→ (e) R t, τ −−−−→

^po-atom

(f) W t, τ + 1

4.3 Significant reads and writes

We define the sequence (β

n

) of values taken by the variable b over the course of the program, according to the write coherence

relation. Initially β

0

= 0. Since all push and take operations occur in a single thread, and steal operations never alter the value of b, the elements of (β

n

) correspond to writes to b in program order within the push and take operations. Similarly, we define the sequence (τ

m

) of values taken by the variable t. We assume τ

0

= 0.

Furthermore, since all writes to t are from CAS instructions, which are sequentially ordered, and all such CAS instructions increment t by one, (τ

m

) is monotonically increasing, and s.t. τ

m

= m.

For each index i, we define the sequence (ξ

^v_i

)

v∈N

of successive values given to the element at index i in the deque by the last write W x

i

,

^_

of a push operation, regardless of the address &x of the underlying array. Only the last such write is called significant as it induces a new value in an (ξ

_i^v

) sequence, while writes due to resizing do not. For all i, ξ

⁰i

, the value before the first significant write to x

i

location, is undefined: ξ

i⁰

= ⊥. Similarly, a read is significant if it occurs in a successful instance of take or steal.

Lemma 4. For all i, (ξ

i^v

) is globally coherent.

Proof.Given two significant writesWxi,^_andWx⁰_i,^_at indexi(regardless of the address of the underlying array). IfWxi,^_andWx⁰_i,^_both write to the same memory location, then they are ordered by write coherence. If they do not, then there must be a resize operation after the first write and before the second (all writes happen in the same thread). Because of the cumulative barrier after a resize operation, threads that see the second value must have seen the first beforehand. Hence, there is a global coherence order on the writes, which corresponds to the order ofpushoperations.

We define the relation read from far as follows: for some memory locations m

0

, . . . , m

n

and some value v, W m

0

, v − →

^r

R m

n

, v if W m

0

, v − →

^rf

R m

n

, v or there exists a sequence of copies carrying the value of the write to the read:

W m

0

, v − →

^rf

R m

0

, v −−→

^data

W m

1

, v − → · · ·

^rf

−−→

^data

W m

n

, v − →

^rf

R m

n

, v.

For conciseness, we hereafter omit the variable name from reads and writes whenever the variable can be inferred from the value:

e.g., Wβ

n

stands for Wb, β

n

. Let Wξ

i^v

denote the v

^th

significant write at index i, and Rξ

^vi

a significant read s.t. Wξ

i^v

−

^r

→ Rξ

^vi

. Lemma 5. Given a write Wx

i

,

^_

and a read Rx

⁰j

,

^_

,

i 6= j = ⇒ W x

i

,

_

6 − →

^rf

R x

⁰_j

,

_

Proof.If the addresses of the underlying arrays differ, then the memory locations read and written are distinct and there can be noread fromrelation.

Otherwise, since old arrays are never reused, the addresses are the same andi ≡j mod size(x)Rx⁰_j,^_belongs to a successful instance oftake, push(with resizing), orsteal. LetXbe that instance.

LetPbe the instance ofpushto whichWxi,^_belongs. InP, we have the following execution graph:

P.Rt, τP−→^ctrl Wxi,^_−−→^sync Wb, βP+ 1 where τP≤i≤βP and βP−τP <size(x)−1 Let us assumei6=j∧Wxi,^_−→^rf Rx⁰_j,^_and show it is indeed impossible.

Assume X is a successful instance of take orpush. Since X and P belong to the same thread,P must occur beforeX in program order (the order of loads and stores to the same location is preserved:

P.Wxi,^_−−−→^po-locX.Rx⁰_j,^_).

Ifj < i, thenj≤i−size(x). However, the following must hold inP: τP≤i≤βP∧βP−τP <size(x)−1

hence j < i−size(x) + 1≤βP−size(x) + 1< τP

Furthermore, ifXis atakeoperation,Rx⁰_j,^_reads the last element of the deque, andj = β_X −1 ≥ τ_X; ifX is apushoperation,Rx⁰_j,_

results from a copy operation of the resizing code, hencej ≥ τX. Since X occurs after P in program order and tis monotonically increasing, P.Rt, τP−−−→^po-locX.Rt, τXandj < τP ≤τX≤j. Impossible.

Ifi < j, then, sincej≥βX,bmust increase fromβP + 1toj+ 1 between the write inPand the read inX. Hence, there must be an instance P⁰ofpushbetweenPandX(in program order) that incrementsbtoj+ 1.

Indeed, the only writes that increase the value ofboccur inpushandtake;

and the effect oftakeas a whole never increases the value ofbsince it first

(7)

decrements the variable. We have:

P.Wxi,^_−−−→^po-locP⁰.Wxj,^_−−−→^po-locX.Rx⁰_j,^_ hence P.Wxi,^_−→^coP⁰.Wxj,^_−−−→^pp-satX.Rx⁰_j,^_ Thus, from Lemma 2 (ii),P.Wxi,^_6−→X.Rx^rf ⁰j,^_.

Now, assumeXis a successful instance ofsteal. We have the following execution graph forX:

X.Rt, τX=j−−→^sync Rb, βX−^ctrl-isync−−−−→Ra,&x⁰−^po→Rx⁰_j,^_

ctrl-isync

−−−−−→Rt, τX−−−−→^po-atom Wt, τX+ 1

Ifj < i, thenj≤i−size(x). However, the following must hold inP:

j < i−size(x) + 1≤βP−size(x) + 1< τP

HenceτX=j < τP. Sincetincreases monotonically, it must be that:

X.Rx⁰j,^_−^ctrl-isync−−−−→Rt, τX−−−−→^po-atomWt, τX+ 1

−rf

→Rt,^_−−→^sync Wt,^_−→ · · ·^rf −−→^sync Wt, τP−→P.^rf Rt, τP−→^ctrl Wxi,^_ HenceX.Rx⁰_j,^_must be committed beforeWt, τX+ 1. SinceWt, τX+ 1 is (cumulatively) propagated toWxi,^_, X.Rx⁰_j,^_must be committed beforeWxi,^_. Formally: it follows from Lemma 3 (ii) that Wt, τX + 1−−−→^pp-sat P.Rt, τP. IfWxi,_−→^rf Rx⁰_j,_thenWxi,_−−−→^pp-satRx⁰_j,_. We get:

X.Wt, τX+ 1−−−→^pp-sat P.Rt, τP−→^ctrl Wxi,^_

∧P.Wxi,^_−−−→^pp-sat X.Rx⁰_j,^_−^ctrl-isync−−−−→Wt, τX+ 1 Lemma 2 (iv) tells that it is impossible. ThusP.Wxi,^_6−→^rf X.Rx⁰_j,^_.

Ifi < j, theni ≤ j−size(x), and there must be an instanceP⁰ ofpushs.t.P⁰.Wb, j+ 1−−−→^po-locWb, βX−→^rf X.Rb, βX(so that indexjbe accessible inX).P⁰cannot occur beforePin program order because, as above, we would haveτP⁰ ≤ τP ≤ ion the one hand, andi ≤ j− size(x)< τ_P0on the other hand. The underlying array also monotonically increases in size, so the inequality still holds if the sizes ofPandP⁰differ.

HenceP⁰occurs afterP. FurthermoreWx⁰⁰_j,^_∈P⁰. IfxinPandx⁰⁰in P⁰refer to different arrays, then a resize operationRmust precedeP⁰, s.t.

Wa,&x−−−→^po-locP.Ra,&x−−−→^po-locR.Wa,&x⁰⁰

−−→sync P⁰.Wx⁰⁰_j,^_−−→^sync Wb, j+ 1

po-loc

−−−→Wb, βX−→^rf X.Rb, βX−^ctrl-isync−−−−→Ra,&x⁰−−→^addrRx⁰_j,^_ hence Wa,&x−^co→R.Wa,&x⁰⁰−−→^sync Wb, βX−−−→^pp-sat X.Rb, βX

From Lemma 2 (iii),Wb, βX−−−→^pp-satX.Ra,&x⁰; Lemma 2 (ii) concludes thatWa,&x6−→^rf X.Ra,&x⁰. Since all resize operations allocate new arrays,&x⁰ 6= &x, which contradicts our premises. Otherwise,xandx⁰⁰ refer to the same array, henceWxi,^_−−−→^po-locWx⁰⁰_j,^_, and we get:

P.Wxi,_−−−→^po-locP⁰.Wx⁰⁰_j,_−−→^sync Wb, j+ 1−−−→^po-locWb, βX

−rf

→X.Rb, βX−^ctrl-isync−−−−→Rx⁰_j,_

It follows from Lemmas 3 (i) and 2 (iii) that:

P.Wxi,^_−^co→Wx⁰⁰_j,^_−−−→^pp-satRx⁰_j,^_ Hence, from Lemma 2 (ii),Wxi,^_6−→^rf Rx⁰_j,^_.

Corollary 1. Given a significant write Wξ

i^v

and a significant read Rx

⁰j

,

^_

: i 6= j = ⇒ Wξ

i^v

6 −

^r

→ Rx

⁰j

,

^_

.

Proof.Ifi 6= j, we know thatWξ^v_i6−→^rf Rx⁰_j,^_. Furthermore, all copies, which happen during a resize operation, copy from and to the same index.

Since there are less copies than the size of the expanded array, there can be no two copies writing to the same memory location in the new array. Hence, there can be no sequence of copies fromWξ^v_i toRx⁰_j,^_.

Lemma 6. Given a significant write Wξ

i^u

and a significant read Rξ

i^v

:

(i) Wξ

^ui

−

^pp-sat

−− → Ra, &x −−→

^addr

Rx

i

, ξ

^vi

= ⇒ u ≤ v (ii) 0 < u ≤ v = ⇒ Wξ

i^u

−

^pp-sat

−− → Rx

i

, ξ

^vi

(i) Supposev < u. We defineW⁰.Wxi, ξ_i^vas follows.

Ifv= 0,ξ^v_i is an undefined value; letW⁰.Wxi, ξ⁰_i−→^rf Rxi, ξ^v_i be the initialization ofxi.W⁰.Wxi, ξ⁰_i comes beforeWξ_i^uin program order.

Otherwise, 0 < v < u. LetW.Wξ_i^v be the significant write s.t.

W.Wξ_i^v−^r→Rxi, ξ^v_i. In other words, there exists a sequence of copies carrying the value ofξ_i^v to Rxi, ξ_i^v. That sequence ends with a write W⁰.Wxi, ξ_i^v−→^rf Rxi, ξ^v_i. Moreover, according to the definition of(ξ_i^v)and the semantics of resizing,W.Wξ_i^vand W⁰.Wxi, ξ_i^vmust come before Wξ_i^uin program order.

We have two cases: eitherWξ_i^uandRxi, ξ^v_i refer to the same memory location or they do not.

Assume that they refer to the same memory locationxi. Then it must be thatW⁰.Wxi, ξ_i^v−−−→^po-loc Wxi, ξ^u_i, and we have:

W⁰.Wxi, ξ^v_i−^co→Wξ_i^u−−−→^pp-satRa,&x−−→^addrRxi, ξ_i^v Hence, from Lemma 2 (ii),W⁰.Wxi, ξ^v_i6−→^rf Rxi, ξ_i^v. Impossible.

Conversely, assume that they do not refer to the same memory location.

Then there must be a resize operation betweenW⁰.Wxi, ξ_i^vandWξ_i^u: Wa,&x−−→^sync W⁰.Wxi, ξ^v_i−−→^syncWa,&x⁰−−→^sync Wx⁰_i, ξ^u_i

pp-sat

−−−→Ra,&x−−→^addrRxi, ξ_i^v

Hence, from Lemma 3 (i),Wa,&x−^co→Wa,&x⁰−−−→^pp-sat Ra,&x. And from Lemma 2 (ii),Wa,&x6−→^rf Ra,&x. Since there is only one writeWa,&x that gives the value&xtoa, we have a contradiction.

(ii) There exists a writeW.Wξ^v_i s.t.W.Wξ^v_i−^r→Rξ^v_i, and a sequence of copies carrying the value ofξ^v_i to Rξ^v_i. That sequence ends with a writeW⁰.Wξ_i^v−→^rf Rξ_i^v. Sinceu ≤ v,Wξû_i −^po→W.Wξ^v_i by definition of (ξ_i^v). Thanks to the barrier afterWξ_iûinpush,Wξû_i −−→^sync W⁰.Wξ_i^v−→^rf Rξ_i^v. From Lemma 3 (i), we getWξ_iû−−−→^pp-satRξ_i^v.

Corollary 2 (Well-defined significant reads). Given a significant read R x

i

, ξ, ξ = ξ

_i^v

for some v > 0.

Proof.LetXbe the successful instance oftakeorsteals.t.Rxi, ξ∈X. Supposeξ6=ξ^v_i, thenξ=⊥can only be an undefined value from the uninitialized array, prior to copying. Indeed, ifxiis not affected by copying, then it must be one of the new slots allocated by the resizing, hence its initial value isξ_i⁰. LetRbe thepushoperation that allocates the arrayx. There exists aξ_i^usuch that:

Wxi,⊥−^co→R.Wxi, ξ_i^u−−→^sync Wa,&x−→^rf X.Ra,&x−−→^addrRxi, ξ It follows from Lemmas 2 (iii), 3 (i) and 2 (ii) thatWxi,⊥ 6−→^rf Rxi, ξ.

Impossible.

Hence,ξ=ξ^v_i. We haveRb, β ∈ Xandβ ≥i+ 1>0, forXis successful. Hence, there is an instance ofpushPs.t.P.Wb, β−→^rf X.Rb, β.

Sinceβ ≥ i+ 1, eitherβ = i+ 1andWξû_i ∈ P, or there must be an instance ofpushthat contains a significant writeWξû_i and comes be- forePin program order. In both cases,Wξû_i belongs to apushoperation, henceu >0. Moreover, thanks to the barrier after a significant write in push,Wξ_iû−−→^sync P.Wb, β. IfXis an instance oftake,P.Wb, β−^po→X.Rξ^v_i; otherwise, P.Wb, β−→^rf X.Rb, β−^ctrl-isync−−−−→Rξ_i^v and Lemma 3 (ii) gives P.Wb, β−−−→^pp-sat X.Rξ^v_i. In both cases,Wξ_iû−−→^sync P.Wb, β−−−→^pp-satX.Rξ_i^v, hence, by Lemmas 3 (i) and 6,0< u≤v.

4.4 Uniqueness of significant reads

The results from the previous section establish that two significant reads at different indexes cannot retrieve the same element ξ

i^v

. The only possible cause of duplicate significant reads is thus reduced to the case where the reads access the same index i.

Theorem 1 (Work-stealing: uniqueness of significant reads). Given a worker thread executing a sequence of push and take operations, and finite number number of thief threads each executing steal operations, all against a same deque. If X and Y are two distinct successful instances of steal or take,

∀ R ξ

i^v

∈ X, ∀ R ξ

^v_i0⁰

∈ Y, i 6= i

⁰

∨ v 6= v

⁰

Lemma 7. Given S

1

and S

2

distinct successful instances of steal,

∀R ξ

^v_i

∈ S

1

, ∀R ξ

^v_i0⁰

∈ S

2

, i 6= i

⁰

Proof.All writes totatomically increment it (by atomicity of CAS). Hence two successfulstealoperations cannot write (thus read) the same value of t. Reads fromxin astealoperation access the index given by the value of thetvariable. HenceRt, i∈S1andRt, i⁰∈S2implyi6=i⁰.