HAL Id: hal-00802885
https://hal.inria.fr/hal-00802885
Submitted on 20 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Correct and Efficient Work-Stealing for Weak Memory Models
Nhat Minh Lê, Antoniu Pop, Albert Cohen, Francesco Zappa Nardelli
To cite this version:
Nhat Minh Lê, Antoniu Pop, Albert Cohen, Francesco Zappa Nardelli. Correct and Efficient Work- Stealing for Weak Memory Models. PPoPP ’13 - Proceedings of the 18th ACM SIGPLAN sym- posium on Principles and practice of parallel programming, Feb 2013, Shenzhen, China. pp.69-80,
�10.1145/2442516.2442524�. �hal-00802885�
Correct and Efficient Work-Stealing for Weak Memory Models
Nhat Minh Lê Antoniu Pop Albert Cohen Francesco Zappa Nardelli
INRIA and ENS Paris
Abstract
Chase and Lev’s concurrent deque is a key data structure in shared- memory parallel programming and plays an essential role in work- stealing schedulers. We provide the first correctness proof of an optimized implementation of Chase and Lev’s deque on top of the POWER and ARM architectures: these provide very relaxed mem- ory models, which we exploit to improve performance but consider- ably complicate the reasoning. We also study an optimized x86 and a portable C11 implementation, conducting systematic experiments to evaluate the impact of memory barrier optimizations. Our results demonstrate the benefits of hand tuning the deque code when run- ning on top of relaxed memory models.
Categories and Subject Descriptors D.1.3 [Programming Tech- niques]: Concurrent Programming; E.1 [Data Structures]: Lists, stacks, and queues
Keywords lock-free algorithm, work-stealing, relaxed memory model, proof
1. Introduction
Multicore POWER and ARM architectures are standard targets for server, consumer electronics, and embedded control applications.
The difficulties of parallel programming are exacerbated by the re- laxed memory model implemented by these architectures, which allow the processors to perform a wide range of optimizations, in- cluding thread-local reordering and non-atomic store propagation.
The safety-critical nature of many embedded applications call for solid foundations for parallel programming. This paper shows that a high degree of confidence can be achieved for highly opti- mized, real-world, concurrent algorithms, running on top of weak memory models. A good test-case is provided by the runtime scheduler of a task library. We thus focus on the Chase and Lev’s concurrent doubly-ended queue (deque) [3], the cornerstone of most work-stealing schedulers. Until now, no rigorous correctness proof has been been provided for implementations of this algorithm running on top of a relaxed memory model. Furthermore, while work-stealing is widely used on the x86 architecture (an evaluation under a restrictive hypothesis of idempotence of the workload can be found in [10]), few experiments target weaker memory models.
Our first contribution is a correctness proof of this fundamen- tal concurrent data structure running on top of a relaxed memory model. We provide a hand-tuned implementation of the Chase and
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
PPoPP’13, February 23–27, 2013, Shenzhen, China.
Copyright c2013 ACM 978-1-4503-1922/13/02. . . $10.00
Lev’s deque for the ARM architectures, and prove its correctness against the memory semantics defined in [12] and [7]. Our second contribution is a systematic study of the performance of several implementations of Chase–Lev on relaxed hardware. In detail, we compare our optimized ARM implementation against a standard implementation for the x86 architecture and two portable variants expressed in C11: a reference sequentially consistent translation of the algorithm, and an aggressively optimized version making full use of the release–acquire and relaxed semantics offered by C11 low-level atomics. These implementations of the Chase–Lev deque are evaluated in the context of a work-stealing scheduler. We consider diverse worker/thief configurations, including a synthetic benchmark with two different workloads and standard task-parallel kernels. Our experiments demonstrate the impact of the memory barrier optimization on the throughput of our work-stealing run- time. We also comment on how the ARM correctness proof can be tailored to these alternative implementations. As a side effect, we highlight that our optimized ARM implementation cannot be expressed using C11 low-level atomics, which invariably end up inserting one redundant synchronization instruction.
2. Chase–Lev deque
User-space runtime schedulers offer an excellent playground for studying low-level high-performance code. We focus on random- ized work-stealing: it was originally designed as the scheduler of the Cilk language for shared-memory multiprocessors [4], but thanks to its merits [2] it has been adopted in a number of par- allel libraries and parallel programming environments, including the Intel TBB and compiler suite. Work-stealing variants have also been proposed for distributed clusters [5] and heterogeneous plat- forms [1]. The scheduling strategy is intuitive:
•
Each processor uses a dynamic array as a deque holding tasks ready to be scheduled.
•
Each processor manages its own deque as a stack. It may only push and pop tasks from the
bottomof its own deque.
•
Other processors may not push or pop from that deque; instead, they steal tasks from the
topwhen their own deque is empty. In most implementations, the stolen deque is selected at random.
•
Initially, one processor starts with the “root” task of the parallel program in its deque, and all other deques are empty.
The state-of-the-art algorithm for the work-stealing deque is Chase and Lev’s lock-free deque [3]. It uses an array with automatic, asynchronous growth. Assuming sequentially consistent memory, it involves only one atomic compare-and-swap (CAS) per steal, no CAS on push, and no CAS on take except when the deque has exactly only one element left.
We implemented and tested four versions of the concurrent
deque algorithm, with different barrier configurations: (1) a sequen-
tially consistent version, written with C11
seq_cstatomics, follow-
ing the original description in [3]; (2) an optimized version, which
takes full advantage of the C11 relaxed memory model, reported
in Figure 1; (3) a native version for ARMv7, reported in Figure 2,
and (4) a native version for x86. These native versions rely on com- piler intrinsics and inline assembly to leverage architecture-specific assumptions and thus reduce the number of barriers required.
In our implementations of Figure 1 and Figure 2, we assume that the Deque type is declared as:
typedef struct{ atomic_size_t size;
atomic_int buffer[];
} Array;
typedef struct{
atomic_size_t top, bottom;
Atomic(Array *) array;
} Deque;
In the code of Figure 1 the atomic_ and memory_order_ prefixes have been elided for clarity. The ARMv7 pseudo-code of Figure 2 uses the keywords
Rand
Wto denote reads and writes to shared vari- ables, and
atomicindicates a block that will be executed atomically, implemented via LL/SC instructions. The x86 version is based on prior work [10] and only requires a single mfence memory barrier in take, in place of the call to thread_fence in the C11 code.
2.1 Notions of correctness
The expected behavior of the work-stealing deque is intuitive: tasks pushed into the deque are then either taken in reverse order by the same thread, or stolen by another thread. We say that an implemen- tation is correct if it satisfies four criteria, formalized and proven correct for our ARMv7 optimized code in Section 4:
1. tasks are taken in reverse order;
2. only tasks pushed are taken or stolen (well-defined reads);
3. a task pushed into a deque cannot be taken or stolen more than once (uniqueness);
4. given a finite number of push operations, all pushed values will eventually be either taken or stolen exactly once, if enough take and steal operations are attempted (existence).
These criteria hold because of the following assumptions and prop- erties of the Chase–Lev algorithm:
•
For any given deque, push and pop operations execute on a sin- gle thread. Concurrency can only occur between one execution of push or take in the owner thread, and one or more executions of steal in different threads.
•
Newly pushed tasks are made visible to take and steal by the increment to
bottomin push. As we shall see in Section 4, our ARMv7 implementation enforces this by placing a
syncbar- rier before the update of
bottom, guaranteeing that the pushed element can not be stolen before
bottomis updated.
•
Taken tasks are reserved first by updating
bottom; again, in our ARMv7 code, the
syncbarrier placed after the update to
bottomwill ensure that it will not be concurrently stolen.
•
Stolen tasks are reserved by updating
top. The only situation where steal and take contend for the same task is when the deque has a single element left; this particular conflict is re- solved through the CAS instructions in both take and steal. This scenario allowed Chase and Lev to make the CAS in take con- ditional upon the size of the deque being 1. The correctness of this optimization on a relaxed memory model depends on the presence of the two full barriers in take and steal, to ensure that at least one of the participants will have a consistent view of the size of the deque. Having just one take or steal seeing a consis- tent view of the size of the deque is enough: if it is take, that will force a CAS to be performed; if it is steal, the index reservation will ensure an empty return value.
•
Finally, stolen tasks are protected from being concurrently stolen multiple times by the monotonic CAS update to
topin steal. This CAS orders steal operations and makes them mu- tually exclusive. At the same time, steal operations that abort due to a failed CAS do not change the state of the deque.
2.2 Comparison of the C11 and ARM implementations Our C11 implementation in Figure 1 is optimal in the sense that no C11 synchronization can be removed without breaking the algo-
rithm. However, if low-level atomics are compiled using the map- ping of McKenney and Silvera [9] on ARMv7/POWER or the map- ping of Tehrekov [14] on x86, the generated code contains more barriers than the hand-optimized native versions on both x86 and ARMv7. We show in Section 5 that this happens because of the need for seq_cst atomics to simulate ARMv7/POWER cumula- tive semantics. Concretely, on ARMv7, an extra dmb instruction is inserted before each CAS operation [11], compared to the native version where a relaxed CAS—coherent and atomic only—is suf- ficient. On x86, an mfence instruction is added between the two reads in steal. The fully sequentially consistent C11 implementa- tion inserts many more redundant barriers [11].
3. The memory model of ARMv7
The memory model of the ARMv7 architecture follows closely that of the POWER architecture, allowing a wide range of relaxed behaviors to be observable to the programmer:
1. The hardware threads can each perform reads and writes out- of-order, or even speculatively. Basically any local reordering is allowed unless there is a data/control dependence or synchro- nization instruction preventing it.
2. The memory system does not guarantee that a write becomes visible to all other hardware threads at the same time point.
Writes performed by one thread are propagated to (and become visible from) any other thread in an arbitrary order, unless synchronization instructions are used.
3. A
dmbbarrier instruction guarantees that all the writes which have been observed by the thread issuing the barrier instruction are propagated to all the other threads before the thread can continue. Observed writes include all writes previously issued by the thread itself, as well as any write propagated to it from another thread prior to the barrier. This semantics of barrier instructions is referred to as cumulative.
We build on the axiomatic formalization of POWER and ARMv7 memory model by Mador-Haim et al. [7], which has been proved equivalent to the operational semantics of Sarkar et al. [12]. A gentle introduction can be found in [8].
Axiomatic execution witnesses capture abstract memory events associated with memory-related instructions and internal transi- tions of the model. Unlike in stronger models such as x86, each memory access is represented at run-time by two distinct events: an issuing event—called sat for reads and ini for writes—eventually followed by a commit event when the speculative state of the in- struction is resolved. Once a write instruction is committed, events that propagate it to other threads can be observed—propagation to thread A is denoted pp
A. All the relations part of an execution wit- ness are listed in Table 1.
The core of the axiomatic model builds on the evord relation, modeling the happens-before order between events. This satisfies the fundamental property:
evord
−−→ ⊃ −−→ ∪
after−
before−− → ∪ −
comm−− → ∪ −→ ∪
insn−−→
localand must be acyclic for an execution to be consistent.
We assume that the
atomicsections, used to represent CAS- like behaviors, are executed atomically and obey a total order.
We model them either as a single instance of a read instruction (failed CAS) or an atomic read–write pair of instruction instances (successful CAS). The atomicity of these accesses is captured by the −−−−→
po-atomrelation. We do not assume any other property on these atomic sections (e.g., cumulativity). In practice, atomic sections can be implemented with LL/SC instructions.
We use several notation shortcuts. We refer to the deque global
variables
top,
bottom, and
arrayas t, b, and a. Elements of the
buffer are written x
i, where i is the virtual index in natural numbers
inttake(Deque *q) {
size_t b =load_explicit(&q->bottom,relaxed) - 1;
Array *a =load_explicit(&q->array,relaxed);
store_explicit(&q->bottom, b,relaxed);
thread_fence(seq_cst);
size_t t =load_explicit(&q->top,relaxed);
int x;
if(t <= b) {
/* Non-empty queue. */
x =load_explicit(&a->buffer[b % a->size],relaxed);
if(t == b) {
/* Single last element in queue. */
if(!compare_exchange_strong_explicit(&q->top, &t, t + 1,seq_cst,relaxed)) /* Failed race. */
x = EMPTY;
store_explicit(&q->bottom, b + 1,relaxed);
}
}else{/* Empty queue. */
x = EMPTY;
store_explicit(&q->bottom, b + 1, relaxed);
} return x;
}
voidpush(Deque *q, int x) {
size_t b =load_explicit(&q->bottom,relaxed);
size_t t =load_explicit(&q->top,acquire);
Array *a =load_explicit(&q->array,relaxed);
if(b - t > a->size - 1) {/* Full queue. */
resize(q);
a =load_explicit(&q->array,relaxed);
}
store_explicit(&a->buffer[b % a->size], x,relaxed);
thread_fence(release);
store_explicit(&q->bottom, b + 1,relaxed);
}
intsteal(Deque *q) {
size_t t =load_explicit(&q->top,acquire);
thread_fence(seq_cst);
size_t b =load_explicit(&q->bottom,acquire);
int x = EMPTY;
if(t < b) {
/* Non-empty queue. */
Array *a =load_explicit(&q->array, consume);
x =load_explicit(&a->buffer[t % a->size],relaxed);
if(!compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst,relaxed)) /* Failed race. */
returnABORT;
} return x;
}
Figure 1. C11 code of Chase–Lev deque, with low-level atomics
inttake(Deque *q) {
size_t b =R(q->bottom) - 1; (a)
Array *a =R(q->array); (b)
W(q->bottom, b); (c)
sync;
size_t t =R(q->top); (d)
int x;
if(t <= b) {
x =R(a->buffer[b % a->size]); (e) if(t == b) {
bool success = false;
atomic/* Implemented with LL/SC. */
if(success = (R(q->top) == t)) (f)
W(q->top, t + 1); (g)
if(!success) x = EMPTY;
W(q->bottom, b + 1); (h)
} }else{
x = EMPTY;
W(q->bottom, b + 1); (i)
} return x;
}
voidpush(Deque *q, int x) {
size_t b =R(q->bottom); (a)
size_t t =R(q->top); (b)
Array *a =R(q->array); (c)
if(b - t > a->size - 1) {/* Full queue. */
resize(q);
a =R(q->array); (d)
}
W(a->buffer[b % a->size], x); (e) sync;
W(q->bottom, b + 1); (f)
}
intsteal(Deque *q) {
size_t t =R(q->top); (a)
sync;
size_t b =R(q->bottom); (b)
int x = EMPTY;
if(t < b) {
Array *a =R(q->array); (c)
x =R(a->buffer[t % a->size]); (d) ctrl_isync;
bool success = false;
atomic/* Implemented with LL/SC. */
if(success = (R(q->top) == t)) (e)
W(q->top, t + 1); (f)
if(!success)return ABORT;
} return x;
}
Figure 2. ARMv7 pseudo-code of Chase–Lev deque
before any wrap-around is applied. Barrier instructions are omitted for brevity when implied by the presence of a −
sync− → or −−−−→
ctrl-isyncrelation. Irrelevant values in reads and writes are replaced with the placeholder “
_” (e.g., Rx,
_). We do not label instruction instances individually, but decorate them with a disambiguating execution prefix, identified by a dot. These prefixes do not only distinguish between instruction instances, but also group related instruction instances within a same execution unit (usually an invocation of one of push, take or steal). For this, when no prefix is specified, the last prefix in left-to-right order is assumed.
4. Proof of correctness of the ARMv7 code
The proof is divided into five parts; it validates the criteria 2 to 4 enumerated in Section 2.1. Since push and take never execute con- currently and b is only ever modified in one of these functions, the proof of Criterion 1 does not involve reasoning about concurrency and we omit it here.
The proof builds on a precise analysis of all the possible exe- cution witnesses of arbitrary invocations of the algorithm. We re- call that an execution witness, as defined by the ARMv7 axiomatic model, is a graph capturing all memory events occuring during an execution (vertices), as well as the relations that link them (edges).
Individual lemmas strive to narrow down the set of possible execu- tion witnesses, based on properties of the algorithm and the archi-
tecture. To that end, we pinpoint specific subgraphs of an execution witness (hereafter, execution graphs) that cannot occur together in the same consistent execution witness. We then show that all in- correct executions, such as those containing two instances of steal reading the same value added by a single instance of push, cannot have consistent execution witnesses and, as such, cannot happen.
The proof is structured as follows. In 4.1 we provide basic tech- nical definitions and properties of the memory model, which are used throughout the proof. In 4.2 we describe all the possible exe- cution graphs for each of the three operations (push, take and steal), following the control flow of the ARMv7 code in Figure 2. In 4.3 we show how the succession of dynamic arrays built by resizing can be abstracted as a single sequence of unique abstract values in- dependent of resize operations, with strong coherence and consis- tency properties. Corollary 2 establishes Criterion 2 (well-defined reads). In 4.4 we build on the previous abstraction to prove Theo- rem 1, pertaining to the uniqueness of elements taken and stolen, which corresponds to Criterion 3 (uniqueness). Finally, in 4.5, we rely on all previous results to prove Theorem 2 establishing Crite- rion 4 (existence): the existence of matching take or steal opera- tions for every pushed element, under the appropriate hypotheses.
4.1 Preliminary properties
Before delving into the details of the proof itself, we introduce
some support definitions and related properties.
Rl, α read of valueαfrom locationl(_stands foranyvalue) Wl, α write of valueαto locationl(_stands foranyvalue)
sync memory barrier (usually implied by−−→sync ) isync instruction barrier (usually implied by−ctrl-isync−−−−→) sat(X) satisfy(a.k.a. complete) event of a read instruction ini(X) initializeevent of a write instruction
com(X) commitevent of an in-flight or speculative instruction ppA(X) propagate to thread ofAevent
−po→ program order
po-atom
−−−−→ atomic operation in program order(for CAS; see below)
po-loc
−−−→ same-location access in program order(defined in 4.1)
−co→ write coherence
−rf
→ read from
−r→ read from far(defined in 4.3)
−fr
→ from read
−−→addr address dependence(usually implicit)
−→ctrl control dependence(usually implicit)
−−→data data dependence(usually implicit)
−dp→ observable dependence(defined in 4.1)
ctrl-isync
−−−−−→ non-cumulative local ordering barrier(see below)
−−→sync cumulative full barrier(see below)
pp-sat
−−−→ write-to-read propagation(defined in 4.1)
after
−−→ after barrier edge
before
−−−→ before barrier edge
−−−→comm communication edge
−insn−→ intra-instruction order edge
local
−−→ local order edge
evord
−−→ event happens-before order(usually typeset as→) On ARMv7,−−→sync corresponds to admbinstruction while−ctrl-isync−−−−→corre- sponds to a dependent conditional branch followed by anisbinstruction.
Table 1. Summary of relations used in the ARMv7 axiomatic model
For convenience, we define the −
po-loc−− → relation, which relates local (same-thread) accesses to the same memory location; −
po-loc−− → implies an instruction-level communication edge − →
co, − →
rfor − →
fr. In particular, −
po-loc−− → implies − →
cobetween two writes.
We define the dependence relation −
dp→ as follows:
R x,
_−
dp→ R y,
_⇐⇒
defR x,
_( −−→ ∪
addr−−−−→
ctrl-isync) R y,
_R x,
_− →
dpW y,
_⇐⇒
defR x,
_( −−→ ∪
addr−→ ∪
ctrl−−→
data) W y,
_Lemma 1. The following properties involving − →
dpapply:
R x,
_−
dp→ R y,
_= ⇒ sat ( R x,
_) → sat ( R y,
_) Rx,
_−
dp→ Wy,
_= ⇒ sat(Rx,
_) → com(Wy,
_)
Proof.In the case the of an address or control dependence, the result is an immediate consequence of the definition ofintra-instructionandlocal orders. It remains to be shown that the result holds for−ctrl-isync−−−−→: a depen- dent conditional branch instruction,ctrl, followed by anisyncbarrier. Sup- poseRx,_−ctrl-isync−−−−→Ry,_. Then we have:sat(Rx,_)−insn−→com(Rx,_)−−→local com(ctrl)−−→local com(isync)−−→local sat(Ry,_).
We define the relation −
pp-sat−− → between instruction instances, A. W x,
_−
pp-sat−− → B. R y,
_, as follows:
1( Wx,
_−
po→ Ry,
_if A ∼ B pp
B(Wx,
_) → sat(Ry,
_) if A 6∼ B
where A ∼ B means that instruction instances grouped under prefixes A and B belong to the same thread.
Intuitively, −
pp-sat−− → represents a “known-to” relation in the fol- lowing sense: A. W x,
_−
pp-sat−− → B. R y,
_means that, at the time of read- ing y, that specific write to x (as well as any write that is coherence- before it) is known to the thread executing B. It is clear that − →
rf1Note that −−−→pp-sat does not imply anevent happens-before orderon the events making up the related instruction instances.
implies −
pp-sat−− → , by definition of communication edges (if threads are different) or uniprocessor constraints (if same thread).
Lemma 2. The following properties involve −
pp-sat−− → and −
po-loc−− → : (i) A. W x,
_− →
rfB. R x,
_−
po-loc−− → B
0. R x,
_= ⇒ A. W x,
_−
pp-sat−− → B
0. R x,
_(ii) A. W x,
_− →
coB. W x,
_−
pp-sat−− → C. R x,
_= ⇒ A. W x,
_6 − →
rfC. R x,
_(iii) W x,
_−
pp-sat−− → R y,
_−
dp→ R z,
_= ⇒ W x,
_−
pp-sat−− → R z,
_(iv) ¬ A.Wx,
_−
pp-sat−− → B.Ry
0,
_−
dp→ B.Wx
0,
_−
pp-sat−− → A.Ry,
_−
dp→ A.Wx,
_ Proof.We prove each point separately:(i) If the write and the reads happen in the same thread, then all in- struction instances belong to that thread, andprogram order prevails.
Otherwise, either A.Wx,_−→rf B0.Rx,_ and the result is immediate, or A.Wx,_6−→rf B0.Rx,_ and B.Rx,_−−−→po-locB0.Rx,_ implies the following:
com(B.Ry,_)−−→local sat(B0.Ry,_), by definition of−−→local . Hence:
ppB(A.Wx,_)→sat(B.Ry,_)−insn−→com(Ry,_)−−→local sat(B0.Ry,_) (ii) Suppose A.Wx,_6−→C.rf Rx,_. ThenC.Rx,_−→fr B.Wx,_, and we have the following cycle in theevent happens-before order:
sat(C.Rx,_)−−−→comm ppZ(B.Wx,_)→sat(C.Rx,_) (iii) Follows from Lemma 1.
(iv) Assume that:
A.Wx,_−−−→pp-satB.Ry0,_−dp→B.Wx0,_−−−→pp-satA.Ry,_−dp→A.Wx,_
IfA∼Bthen there is a cycle in −po→. Otherwise, by Lemma 1, we have a cycle in theevent happens-before order:
ppB(Wx,_)→sat(Ry0,_)→com(Wx0,_)−insn−→ppA(Wx0,_)
→sat(Ry,_)→com(Wx,_)−insn−→ppB(Wx,_)
Lemma 3. The following properties involving barriers apply:
(i) (Wx,
_−
sync− → Wy,
_−
pp-sat−− → Rz,
_∨ Wx,
_−
pp-sat−− → Ry,
_−
sync− → Rz,
_)
= ⇒ Wx,
_−
pp-sat−− → Rz,
_(ii) A.Wx,
_− →
rfB.Rx,
_−
sync− → B.Wy,
_−
pp-sat−− → C.Rx,
_= ⇒ A.Wx,
_−
pp-sat−− → C.Rx,
_(iii) Let X stand for A.Wx,
_− →
rfB.Rx,
_or (A ∼ B).Wx,
_and Y stand for C. W y,
_− →
rfD. R y,
_or (C ∼ D). W y,
_then the following holds:
¬(X −
sync− → B. R y,
_− →
frC. W y,
_∧ Y −
sync− → D. R x,
_− →
frA. W x,
_)
Proof.We prove each point separately:(i) If Wx,_ andRz,_occur in the same thread, then all instruction instances belong to that thread andprogram order prevails. Otherwise, supposeRz,_executes inA; we have two cases:
ppA(Wx,_)−−−→before ppA(sync)−−−→before ppA(Wy,_)→sat(Rz,_) Or the other way around:
ppA(Wx,_)→sat(Ry,_)−insn−→com(Ry,_)−−→local com(sync)−−→local sat(Rz,_) In both cases,ppA(Wx,_)→sat(Rz,_).
(ii) Suppose A ∼ C. If A ∼ B, thenprogram order prevails:
all the instruction instances belong to the same thread. If not, suppose C.Rx,_−po→A.Wx,_; then theevent happens-before ordercontains the fol- lowing cycle:
ppB(A.Wx,_)−−−→comm sat(B.Rx,_)−insn−→com(Rx,_)−−→local com(sync)
local
−−→com(B.Wy,_)−insn−→ppC(Wy,_)→sat(C.Rx,_)−insn−→com(Rx,_)
local
−−→com(A.Wx,_)−insn−→ppB(Wx,_)
Otherwise, supposeA6∼C. IfA∼B, thenA.Wx,_−−→sync B.Wy,_and we have the result from (i). If not, we have:
ppB(A.Wx,_)−−−→comm sat(B.Rx,_)−insn−→com(Rx,_)−−→local com(sync) Thus, we haveppC(A.Wx,_)−−−→beforeppC(sync)−−−→beforeppC(B.Wy,_)→ sat(C.Rx,_).
(iii) Suppose the contrary. IfB∼D, then−→rf and−→fr form a path that goes against−po→: the graph is invalid according to uniprocessor constraints.
Otherwise,B6∼Dand the following holds (omitting intermediate steps in elaborating−−−→before for conciseness):
• com(B.sync)−−→local com(C.Wy,_)−−→local com(D.sync)−insn−→ppB(sync) ifB∼C.
•com(B.sync)−−→local sat(Ry,_)−−−→comm ppB(C.Wy,_)−−−→beforeppB(D.sync) otherwise.
Either way,com(B.sync) → ppB(D.sync). By definition, we have an after edge between the two barriers:ppD(B.sync)−−→aftercom(D.sync).
Moreover, eitherA∼DorA6∼D:
•ppD(B.sync)−−→aftercom(D.sync)−−→local com(A.Wx,_)−insn−→ppB(Wx,_) ifA∼D.
•ppD(B.sync)−−→aftercom(D.sync)−−→local sat(Rx,_)−−−→comm ppD(A.Wx,_) otherwise.
Thus, in all cases, we have a cycle:
com(B.sync)−−−→beforeppB(A.Wx,_)
−−−→comm sat(B.Rx,_)−insn−→com(Rx,_)−−→com(B.sync)local
4.2 Execution paths
We consider the three operations of the work-stealing algorithm:
take, push and steal. Each of them exhibits different execution paths depending on control flow. Data and address dependences are implicit in the notations and are omitted for brevity. Control dependences are implied by the guard conditions in each case and are also omitted, but we explicit the constraints on the b and t variables carrying the control dependence. Greek letters β, τ, ξ denote the memory values of b, t, and some x
i, respectively. Reads and writes are annotated with the corresponding line from Figure 2.
For take and steal, we say that an instance of the operation is successful if it returns one element; otherwise (including if it returns empty) it is considered failed.
4.2.1 Take
Two failure cases return no element (empty), and two success cases return one element from the deque. All four paths start with:
(a) R b, β −
po→ (b) R a, &x −
po→ (c) W b, β − 1 −
sync− → (d) R t, τ Specific continuations for each path are listed below.
Return empty without CAS, β − τ ≤ 0: · · · − →
po(i) W b, β Return empty with (failed) CAS, β − τ = 1, τ 6= τ
0:
· · · −
po→ (e) R x
β−1, ξ −
po→ (f) R t, τ
0−
po→ (h) W b, τ + 1 Return one without CAS, β − τ > 1: · · · −
po→ (e) R x
β−1, ξ Return one with (successful) CAS, β − τ = 1:
· · · −
po→ (e)Rx
β−1, ξ −
po→ (f)Rt, τ −−−−→
po-atom(g)Wt, τ +1 −
po→ (h)Wb, β
4.2.2 Push
There are two paths: a straight case, and a resizing case which grows the underlying circular buffer.
Straight, β − τ < size(x) − 1:
(a) R b, β −
po→ (b) R t, τ − →
po(c) R a, &x −
po→ (e) W x
β, ξ −
sync− → (f) W b, β + 1 Resizing, β − τ ≥ size(x) − 1: where x
0refers to the new array
(a) R b, β −
po→ (b) R t, τ −
po→ (c) R a, &x − →
poresize
−
sync− → (d) R a, &x
0− →
po(e) W x
0β, ξ −
sync− → (f ) W b, β + 1 where resize = R x
τ, ξ
τ− →
poW x
0τ, ξ
τ−
po→ · · ·
−
po→ R x
β−1, ξ
β−1−
po→ W x
0β−1, ξ
β−1−
sync− → W a, &x
04.2.3 Steal
There are three paths: two failure cases and one success case.
Failure returns no element and success returns a stolen element.
Return empty without CAS, β−τ ≤ 0: (a)Rt, τ −
sync− → (b)Rb, β Return empty with (failed) CAS, β − τ > 0 ∧ τ 6= τ
0:
(a) R t, τ −
sync− → (b) R b, β −−−−→
ctrl-isync(c) R a, &x −
po→ (d) R x
τ, ξ −−−−→
ctrl-isync(e) R t, τ
0Return one with (successful) CAS, β − τ > 0:
(a) R t, τ −
sync− → (b) R b, β −−−−→
ctrl-isync(c) R a, &x − →
po(d) R x
τ, ξ
ctrl-isync
−−−−→ (e) R t, τ −−−−→
po-atom(f) W t, τ + 1
4.3 Significant reads and writes
We define the sequence (β
n) of values taken by the variable b over the course of the program, according to the write coherence
relation. Initially β
0= 0. Since all push and take operations occur in a single thread, and steal operations never alter the value of b, the elements of (β
n) correspond to writes to b in program order within the push and take operations. Similarly, we define the sequence (τ
m) of values taken by the variable t. We assume τ
0= 0.
Furthermore, since all writes to t are from CAS instructions, which are sequentially ordered, and all such CAS instructions increment t by one, (τ
m) is monotonically increasing, and s.t. τ
m= m.
For each index i, we define the sequence (ξ
vi)
v∈Nof successive values given to the element at index i in the deque by the last write W x
i,
_of a push operation, regardless of the address &x of the underlying array. Only the last such write is called significant as it induces a new value in an (ξ
iv) sequence, while writes due to resizing do not. For all i, ξ
0i, the value before the first significant write to x
ilocation, is undefined: ξ
i0= ⊥. Similarly, a read is significant if it occurs in a successful instance of take or steal.
Lemma 4. For all i, (ξ
iv) is globally coherent.
Proof.Given two significant writesWxi,_andWx0i,_at indexi(regardless of the address of the underlying array). IfWxi,_andWx0i,_both write to the same memory location, then they are ordered by write coherence. If they do not, then there must be a resize operation after the first write and before the second (all writes happen in the same thread). Because of the cumulative barrier after a resize operation, threads that see the second value must have seen the first beforehand. Hence, there is a global coherence order on the writes, which corresponds to the order ofpushoperations.
We define the relation read from far as follows: for some mem- ory locations m
0, . . . , m
nand some value v, W m
0, v − →
rR m
n, v if W m
0, v − →
rfR m
n, v or there exists a sequence of copies carrying the value of the write to the read:
W m
0, v − →
rfR m
0, v −−→
dataW m
1, v − → · · ·
rf−−→
dataW m
n, v − →
rfR m
n, v.
For conciseness, we hereafter omit the variable name from reads and writes whenever the variable can be inferred from the value:
e.g., Wβ
nstands for Wb, β
n. Let Wξ
ivdenote the v
thsignificant write at index i, and Rξ
via significant read s.t. Wξ
iv−
r→ Rξ
vi. Lemma 5. Given a write Wx
i,
_and a read Rx
0j,
_,
i 6= j = ⇒ W x
i,
_6 − →
rfR x
0j,
_Proof.If the addresses of the underlying arrays differ, then the memory locations read and written are distinct and there can be noread fromrelation.
Otherwise, since old arrays are never reused, the addresses are the same andi ≡j mod size(x)Rx0j,_belongs to a successful instance oftake, push(with resizing), orsteal. LetXbe that instance.
LetPbe the instance ofpushto whichWxi,_belongs. InP, we have the following execution graph:
P.Rt, τP−→ctrl Wxi,_−−→sync Wb, βP+ 1 where τP≤i≤βP and βP−τP <size(x)−1 Let us assumei6=j∧Wxi,_−→rf Rx0j,_and show it is indeed impossible.
Assume X is a successful instance of take orpush. Since X and P belong to the same thread,P must occur beforeX in program or- der (the order of loads and stores to the same location is preserved:
P.Wxi,_−−−→po-locX.Rx0j,_).
Ifj < i, thenj≤i−size(x). However, the following must hold inP: τP≤i≤βP∧βP−τP <size(x)−1
hence j < i−size(x) + 1≤βP−size(x) + 1< τP
Furthermore, ifXis atakeoperation,Rx0j,_reads the last element of the deque, andj = βX −1 ≥ τX; ifX is apushoperation,Rx0j,_
results from a copy operation of the resizing code, hencej ≥ τX. Since X occurs after P in program order and tis monotonically increasing, P.Rt, τP−−−→po-locX.Rt, τXandj < τP ≤τX≤j. Impossible.
Ifi < j, then, sincej≥βX,bmust increase fromβP + 1toj+ 1 between the write inPand the read inX. Hence, there must be an instance P0ofpushbetweenPandX(in program order) that incrementsbtoj+ 1.
Indeed, the only writes that increase the value ofboccur inpushandtake;
and the effect oftakeas a whole never increases the value ofbsince it first
decrements the variable. We have:
P.Wxi,_−−−→po-locP0.Wxj,_−−−→po-locX.Rx0j,_ hence P.Wxi,_−→coP0.Wxj,_−−−→pp-satX.Rx0j,_ Thus, from Lemma 2 (ii),P.Wxi,_6−→X.Rxrf 0j,_.
Now, assumeXis a successful instance ofsteal. We have the following execution graph forX:
X.Rt, τX=j−−→sync Rb, βX−ctrl-isync−−−−→Ra,&x0−po→Rx0j,_
ctrl-isync
−−−−−→Rt, τX−−−−→po-atom Wt, τX+ 1
Ifj < i, thenj≤i−size(x). However, the following must hold inP:
j < i−size(x) + 1≤βP−size(x) + 1< τP
HenceτX=j < τP. Sincetincreases monotonically, it must be that:
X.Rx0j,_−ctrl-isync−−−−→Rt, τX−−−−→po-atomWt, τX+ 1
−rf
→Rt,_−−→sync Wt,_−→ · · ·rf −−→sync Wt, τP−→P.rf Rt, τP−→ctrl Wxi,_ HenceX.Rx0j,_must be committed beforeWt, τX+ 1. SinceWt, τX+ 1 is (cumulatively) propagated toWxi,_, X.Rx0j,_must be committed beforeWxi,_. Formally: it follows from Lemma 3 (ii) that Wt, τX + 1−−−→pp-sat P.Rt, τP. IfWxi,_−→rf Rx0j,_thenWxi,_−−−→pp-satRx0j,_. We get:
X.Wt, τX+ 1−−−→pp-sat P.Rt, τP−→ctrl Wxi,_
∧P.Wxi,_−−−→pp-sat X.Rx0j,_−ctrl-isync−−−−→Wt, τX+ 1 Lemma 2 (iv) tells that it is impossible. ThusP.Wxi,_6−→rf X.Rx0j,_.
Ifi < j, theni ≤ j−size(x), and there must be an instanceP0 ofpushs.t.P0.Wb, j+ 1−−−→po-locWb, βX−→rf X.Rb, βX(so that indexjbe accessible inX).P0cannot occur beforePin program order because, as above, we would haveτP0 ≤ τP ≤ ion the one hand, andi ≤ j− size(x)< τP0on the other hand. The underlying array also monotonically increases in size, so the inequality still holds if the sizes ofPandP0differ.
HenceP0occurs afterP. FurthermoreWx00j,_∈P0. IfxinPandx00in P0refer to different arrays, then a resize operationRmust precedeP0, s.t.
Wa,&x−−−→po-locP.Ra,&x−−−→po-locR.Wa,&x00
−−→sync P0.Wx00j,_−−→sync Wb, j+ 1
po-loc
−−−→Wb, βX−→rf X.Rb, βX−ctrl-isync−−−−→Ra,&x0−−→addrRx0j,_ hence Wa,&x−co→R.Wa,&x00−−→sync Wb, βX−−−→pp-sat X.Rb, βX
From Lemma 2 (iii),Wb, βX−−−→pp-satX.Ra,&x0; Lemma 2 (ii) concludes thatWa,&x6−→rf X.Ra,&x0. Since all resize operations allocate new ar- rays,&x0 6= &x, which contradicts our premises. Otherwise,xandx00 refer to the same array, henceWxi,_−−−→po-locWx00j,_, and we get:
P.Wxi,_−−−→po-locP0.Wx00j,_−−→sync Wb, j+ 1−−−→po-locWb, βX
−rf
→X.Rb, βX−ctrl-isync−−−−→Rx0j,_
It follows from Lemmas 3 (i) and 2 (iii) that:
P.Wxi,_−co→Wx00j,_−−−→pp-satRx0j,_ Hence, from Lemma 2 (ii),Wxi,_6−→rf Rx0j,_.
Corollary 1. Given a significant write Wξ
ivand a significant read Rx
0j,
_: i 6= j = ⇒ Wξ
iv6 −
r→ Rx
0j,
_.
Proof.Ifi 6= j, we know thatWξvi6−→rf Rx0j,_. Furthermore, all copies, which happen during a resize operation, copy from and to the same index.
Since there are less copies than the size of the expanded array, there can be no two copies writing to the same memory location in the new array. Hence, there can be no sequence of copies fromWξvi toRx0j,_.
Lemma 6. Given a significant write Wξ
iuand a significant read Rξ
iv:
(i) Wξ
ui−
pp-sat−− → Ra, &x −−→
addrRx
i, ξ
vi= ⇒ u ≤ v (ii) 0 < u ≤ v = ⇒ Wξ
iu−
pp-sat−− → Rx
i, ξ
viProof.We prove each point separately:
(i) Supposev < u. We defineW0.Wxi, ξivas follows.
Ifv= 0,ξvi is an undefined value; letW0.Wxi, ξ0i−→rf Rxi, ξvi be the initialization ofxi.W0.Wxi, ξ0i comes beforeWξiuin program order.
Otherwise, 0 < v < u. LetW.Wξiv be the significant write s.t.
W.Wξiv−r→Rxi, ξvi. In other words, there exists a sequence of copies carrying the value ofξiv to Rxi, ξiv. That sequence ends with a write W0.Wxi, ξiv−→rf Rxi, ξvi. Moreover, according to the definition of(ξiv)and the semantics of resizing,W.Wξivand W0.Wxi, ξivmust come before Wξiuin program order.
We have two cases: eitherWξiuandRxi, ξvi refer to the same memory location or they do not.
Assume that they refer to the same memory locationxi. Then it must be thatW0.Wxi, ξiv−−−→po-loc Wxi, ξui, and we have:
W0.Wxi, ξvi−co→Wξiu−−−→pp-satRa,&x−−→addrRxi, ξiv Hence, from Lemma 2 (ii),W0.Wxi, ξvi6−→rf Rxi, ξiv. Impossible.
Conversely, assume that they do not refer to the same memory location.
Then there must be a resize operation betweenW0.Wxi, ξivandWξiu: Wa,&x−−→sync W0.Wxi, ξvi−−→syncWa,&x0−−→sync Wx0i, ξui
pp-sat
−−−→Ra,&x−−→addrRxi, ξiv
Hence, from Lemma 3 (i),Wa,&x−co→Wa,&x0−−−→pp-sat Ra,&x. And from Lemma 2 (ii),Wa,&x6−→rf Ra,&x. Since there is only one writeWa,&x that gives the value&xtoa, we have a contradiction.
(ii) There exists a writeW.Wξvi s.t.W.Wξvi−r→Rξvi, and a sequence of copies carrying the value ofξvi to Rξvi. That sequence ends with a writeW0.Wξiv−→rf Rξiv. Sinceu ≤ v,Wξui −po→W.Wξvi by definition of (ξiv). Thanks to the barrier afterWξiuinpush,Wξui −−→sync W0.Wξiv−→rf Rξiv. From Lemma 3 (i), we getWξiu−−−→pp-satRξiv.
Corollary 2 (Well-defined significant reads). Given a significant read R x
i, ξ, ξ = ξ
ivfor some v > 0.
Proof.LetXbe the successful instance oftakeorsteals.t.Rxi, ξ∈X. Supposeξ6=ξvi, thenξ=⊥can only be an undefined value from the uninitialized array, prior to copying. Indeed, ifxiis not affected by copying, then it must be one of the new slots allocated by the resizing, hence its initial value isξi0. LetRbe thepushoperation that allocates the arrayx. There exists aξiusuch that:
Wxi,⊥−co→R.Wxi, ξiu−−→sync Wa,&x−→rf X.Ra,&x−−→addrRxi, ξ It follows from Lemmas 2 (iii), 3 (i) and 2 (ii) thatWxi,⊥ 6−→rf Rxi, ξ.
Impossible.
Hence,ξ=ξvi. We haveRb, β ∈ Xandβ ≥i+ 1>0, forXis successful. Hence, there is an instance ofpushPs.t.P.Wb, β−→rf X.Rb, β.
Sinceβ ≥ i+ 1, eitherβ = i+ 1andWξui ∈ P, or there must be an instance ofpushthat contains a significant writeWξui and comes be- forePin program order. In both cases,Wξui belongs to apushoperation, henceu >0. Moreover, thanks to the barrier after a significant write in push,Wξiu−−→sync P.Wb, β. IfXis an instance oftake,P.Wb, β−po→X.Rξvi; otherwise, P.Wb, β−→rf X.Rb, β−ctrl-isync−−−−→Rξiv and Lemma 3 (ii) gives P.Wb, β−−−→pp-sat X.Rξvi. In both cases,Wξiu−−→sync P.Wb, β−−−→pp-satX.Rξiv, hence, by Lemmas 3 (i) and 6,0< u≤v.
4.4 Uniqueness of significant reads
The results from the previous section establish that two significant reads at different indexes cannot retrieve the same element ξ
iv. The only possible cause of duplicate significant reads is thus reduced to the case where the reads access the same index i.
Theorem 1 (Work-stealing: uniqueness of significant reads). Given a worker thread executing a sequence of push and take operations, and finite number number of thief threads each executing steal op- erations, all against a same deque. If X and Y are two distinct successful instances of steal or take,
∀ R ξ
iv∈ X, ∀ R ξ
vi00∈ Y, i 6= i
0∨ v 6= v
0Lemma 7. Given S
1and S
2distinct successful instances of steal,
∀R ξ
vi∈ S
1, ∀R ξ
vi00∈ S
2, i 6= i
0Proof.All writes totatomically increment it (by atomicity of CAS). Hence two successfulstealoperations cannot write (thus read) the same value of t. Reads fromxin astealoperation access the index given by the value of thetvariable. HenceRt, i∈S1andRt, i0∈S2implyi6=i0.