LOGICAL TIME
in DISTRIBUTED SYSTEMS
Michel RAYNAL raynal@irisa.fr
IRISA, Universit´ e de Rennes, France
Contents
• Scalar (linear) time
• Vector time
• Matrix time
• Using virtual time
Part I
SCALAR/LINEAR TIME
- Lamport, L., Time, Clocks and the Ordering of Events in a Distributed System.
Communications of the ACM, 21(7):558-565, 1978
Aim
• Build a logical time in order to be able to associate a consistent date with events, i.e.,
e −→ f ⇒ date(e) < date(f)
• Why logical time? Because there is no notion of phys- ical time in a pure asynchronous system (no bound on process speed, and message transfer delay)
Even if we had physical time, it would be more difficult
The fundamental constraint
Logical time has to increase along causal paths
• How logical time is used: when it produces a new event, a process associates the current clock value with that event
• Idea: consider the set of integers for the time domain
? Each process pi has a local clock hi
? From a local point of view:
hi has to measure the progress of pi
? From a global point of view:
hi has to measure the progress of the whole compu- tation
Lamport clocks (1978)
Local progress rule:
before producing an internal event::
hi ← hi + 1 % date of the internal event % Sending rule:
when sending a message m to pj:
hi ← hi + 1; % date of the send event % send (m, hi) to pj
Receiving Rule:
when receiving a message (m, h) from pj: hi ← max(hi, h);
hi ← hi + 1 % date of the receive event %
Illustration
p1
p2
p3
h1
h2
1
4
2 3 4 6
1 2
5
6
7 5
3
1 2
6 2
1 h3
3
5
4
Observation: (date(e) = x)
⇔
There are x events on the longest causal path ending at e
Build a total order on all the events
• Motivation: Resource allocation problems
• Observations
? (date(e) < date(f)) ⇒ ¬(f −→ e)
? (date(e) < date(f)) ∧ ¬(e −→ f) is possible
In that case, e and f are independent events, but cannot be concluded from the dates only
? (date(e) = date(f)) ⇒ e||f
• Associate a timestamp (h, i) with each event e where:
? h = local clock value when e is produced (date)
Total order definition
• Let e and f be events timestamped (h, i) and (k, j) (e −→T O f) def= (h < k) ∨ ((h = k) ∧ (i < j))
• This is the (well-known) lexicographical ordering
Illustration
Σinit = [0,0]
Σ = [0,1]
Σ = [1,1] Σ = [0,2]
Σ = [2,1] Σ = [1,2]
Σ = [2,2]
e12
e11 e22
e11 e22
e21
e22 e21
e32 p1
p2
(1,2)
e12 e22 e32 e11 e21
(3,1) (2,1)
(2,2) (4,2) H = 1
namely, e12, e11, e22, e21, e22
H = 4 H = 3
H = 2
Lamport’s timestamps capture an
observation (among all possible oberv.)
A theorem on the space of scalar clocks
• Let C be the set of all the scalar clock systems that are consistent (with respect to the causality relation)
• Let e and f be any two events of a distributed execution
• ∀C ∈ C: e −→ev f ⇒ dateC(e) < dateC(f) (Consistency)
• e||f ⇔ ∃C ∈ C : dateC(e) = dateC(f)
• Or equivalently
e||f ⇔ ∃C1, C2 ∈ C :
dateC1(e) ≤ dateC1(f) ∧ dateC2(e) ≥ dateC2(f)
Part II
SCALAR CLOCKS in ACTION
The Mutex problem
• Enrich the underlying system with new operations
• These operations define a service
• Here, two operations: acquire() and release()
• Process behavior: abstracted in a 3-state automaton statei ∈ {out, asking, in} (all other detail are irrelevant)
asking
in out
The Mutex problem: definition
• Definition
? Safety: no two processes are concurrently in the CS
? Liveness: any request is eventually granted
• Algorithms
? Permission-based (Individual vs arbiter permissions)
? Token-based
- Raynal M., Algorithms for mutual exclusion. The MIT Press, 1986
- Anderson J., Kim Y.-J. and Herman T., Shared-memory mutual exclusion: major research trends since 1986. Distributed Computing, 16(2-3): 75-110, 2003
Individual permissions: principles
• When,it wants to enter the CS, pi asks for permissions
• When,it has received all the permissions, pi enters
• Ri = the set of processes from which pi needs the per- mission to enter the SC
• Individual permission: Ri = {1, . . . , n} \ {i}
• When pi gives its permission to a process pk, the mean- ing of the permission is “As far as I am concerned, you can enter” (a permission is consequently “individual”)
• Core of the algorithm: the way permissions are granted
• The algorithm manages bilateral conflicts
Granting a permission
statei = asking pi
pj
pk
statej = out
perm statei = out
statek 6= out
• Solve the conflict between pi and pk
• A solution: timestamp the requests, and use the total order on timestamps to establish a system-wide consis- tent priority
? pk has not priority: send its permission to pi by return
From mechanisms to properties
• Safety: ∀i 6= j : j ∈ Ri ∧ i ∈ Rj
• Liveness: based on a timestamping mechanism
Ricart-Agrawala mutex algorithm: local variables
• statei ∈ {out, asking, in}, init out
• hi, last i integers, init 0
• prioi boolean
• waiting f romi, postponedi sets
Structure
perm(j) acquire() release()
local variables req(k, j)
Ricart-Agrawala mutex algorithm (1)
operation acquire() issued by pi
statei ← asking; postponedi ← ∅;
hi ← hi + 1; lasti ← hi; waiting f romi ← Ri;
for each j 6= i do send req(lasti, i) to pj end for;
wait (waiting f romi = ∅);
statei ← in
when perm(j) is received
waiting f romi ← waiting f romi \ {j}
Ricart-Agrawala mutex algorithm (2)
when req(k, j) is received hi ← max(hi, k) + 1;
prioi ← (statei 6= out) ∧ (lasti, i) < (k, j);
if prioi then postponedi ← postponedi ∪ {j} else send perm(i) to pj
end if
operation release()
for each j ∈ postponedi do
send perm(i) to pj end for;
statei ← out
Clock values
• hi can increase forever
• Aim: limit its increase
• As only requests have to be timestamped we can replace hi ← max(hi, k) + 1 with hi ← max(hi, k)
• As we are about to see in the proof, it is possible to further limit the increase in the acquire() operation:
The two statements [hi ← hi+1; lasti ← hi] are replaced by lasti ← hi + 1, which does not increase hi!
• These two modifications allows obtaining an algorithm in which all variables are bounded: clocks values can be implemented modulo 2n − 1
Ricart-Agrawala mutex algorithm
operation acquire() issued by pi
statei ← asking; postponedi ← ∅;
lasti ← hi + 1; % replaces hi ← hi + 1; lasti ← hi waiting f romi ← Ri;
for each j 6= i do send req(lasti, i) to pj end for;
wait (waiting f romi = ∅);
statei ← in
when req(k, j) is received
hi ← max(hi, k); % replaces hi ← max(hi, k) + 1 prioi ← (statei 6= out) ∧ (lasti, i) < (k, j);
if prioi then postponedi ← postponedi ∪ {j} else send perm(i) to pj
end if
Proof: on the blackboard!
• Safety: by contradiction
• Liveness: in two steps
? No deadlock: at least one process acquires the CS
? No starvation: eventually any requesting process will be granted the CS
Cost
• Message cost: 2(n − 1) messages per CS use
• Improvement:
? The algorithm can be improved in such a way that a CS use costs between 0 and 2(n − 1) messages
? Idea: every pair of processes manages a single per- mission (token) to solve their conflicts
• Time: consider each message takes one time unit
? Heavy load: one time unit
? Light load: two time unit
Variants
• Ring structure
? Forward a request = give its permission
? Cost: n messages
• Assumption ∆ on transfer delays
? Give its permission = not to answer
? Not to give its permission = send by return a negative ack, cancel it when exiting the CS
On mutual exclusion
• Permission-based
? Individual permission approach
? Arbiter permission approach
Quorums and three-way handshake algorithms
• Token-based
• A continuous view
- Raynal M., Algorithms for mutual exclusion, The MIT Press, 1986
- Anderson J., Kim Y.-J. andHerman T., Shared-memory mutual exclusion: major research trends since 1986, Distributed Computing 16(2-3): 75-110 2003
Part III
VECTOR TIME
- Fidge C., Timestamp in Message Passing Systems that Preserves Partial Ordering, Proc. 11th Australian Computing Conference, pp. 56-66, 1988
- Mattern F., Virtual time and global states of distributed systems. Proc. Int’l work- shop on Parallel and Distributed Systems, North-Holland, pp. 215-226, (Cosnard, Quinton, Raynal and Robert Eds), 1988
- Baldoni R. and Raynal M. Fundamentals of Distributed Computing: A Practical Tour of Vector-Clock Systems. IEEE Distributed Systems Online, 3(2):1-18, 2002
Aim: capture the causality relation
• Scalar (linear) clock system
? Respects causality
? But does not capture it
• Find a dating system that captures causality exactly (e −→ev f) ⇔ date(e) < date(f)
(e||f) ⇔ date(e) and date(f) cannot be compared
Vector clock: intuition
• Observation: a process pi can always measure its progress by counting the number of events it has produced since the beginning
This number can be seen as its logical local clock There is one such clock per process
• The time domain is consequently n-dimensional: there is one dimension associated with each process
• Hence the idea of vector clocks: each process pi man- ages a vector V Ci[1..n] that represents its view of the global time progress
V Ci is a digest of the current causal past of pi
Vector clock: definition
• V Ci[i] = nb of events issued by pi
• V Ci[j] = nb of events issued by pj, as known by pi Formally, let e be the current event produced by pi V Ci[j] = |{f | f −→ e ∧ f has been issued by pj}|
• Notation: component-wise maximum/minimum
max(V 1, V 2) = [max(V 1[1], V 2[1]), · · · , max(V 1[n], V 2[n])]
min(V 1, V 2) = [min(V 1[1], V 2[1]), · · · , min(V 1[n], V 2[n])]
Vector clock: algorithm
Local progress rule:
before producing an internal event:
V Ci[i] ← V Ci[i] + 1 Sending rule:
when sending a message m to pj: V Ci[i] ← V Ci[i] + 1;
send (m, V Ci) to pj Receiving Rule:
when receiving a message (m, V C) from pj: V Ci[i] ← V Ci[i] + 1;
V Ci ← max(V Ci, V C[)
Illustration
p3
p1
p2
p4
[1,2,0,0]
[0,3,0,0]
[0,3,0,2]
[0,3,0,1]
[0,0,0,0]
[0,0,0,0]
[0,0,0,0]
[0,1,0,0]
[0,2,0,0]
[0,0,1,0]
[0,3,2,2]
∀i, k: V Ci[k] is not decreasing, and V Ci[k] ≤ V Ck[k]
A few simple definitions
• V 1 ≤ V 2 def= ∀k : V 1[k] ≤ V 2[k]
• V 1 < V 2 def= (V 1 ≤ V 2) ∧ (V 1 6= V 2)
• V 1||V 2 def= ¬(V 1 ≤ V 2) ∧ ¬(V 2 ≤ V 1)
The vector clock properties
Let e with date(e) = Ve , and f with date(f) = Vf
(e −→ev f) ⇔ (Ve < Vf)
(e || f) ⇔ (Ve || Vf)
These are the fundamental properties provided by vector clocks
Proof (1)
• Theorem 1: Vector clocks increase along causal paths
• Theorem 2: (e −→ev f) ⇔ (Ve < Vf)
? (e −→ev f) ⇒ (Ve < Vf): follows from Theorem 1.
? (Ve < Vf) ⇒ (e −→ev f):
Let pi be the process that issued the event e. We have (Ve < Vf) ⇒ (Ve[i] ≤ Vf[i]). As only pi can entail an increase of V [i] (for any vector V ), it follows that there is a causal path from e to f.
Proof and cost
• Theorem 3: (e || f) ⇔ (Ve || Vf).
(e || f) def= ¬(e −→ev f) ∧ ¬(f −→ev e) (definition).
? ¬(e −→ev f) ⇒ ¬(Ve < Vf).
? ¬(f −→ev e) ⇒ ¬(Vf < Ve).
From which follows that Vf and Ve cannot be compared.
• Theorem 4: The previous (causality/independence) pred- icates require O(n) comparisons
Refining the causality test
• Let us associate a timestamp (V e, i) with each event e, where pi is the process that issued e
• Let e timestamped (Ve, i) and f timestamped (Vf, j)
• Refined causality test:
(e −→ev f) ⇔ (Ve[i] ≤ Vf[i])
• Refined independence test:
(e || f) ⇔ (Ve[i] > Vf[i]) ∧ (Vf[j] > Ve[j])
• Theorem 4: The previous (causality/independence) pred- icates require O(1) comparisons (Scalability of the test)
A process is a “local” observer
Σinit = [0,0]
Σ = [0,1]
Σ = [1,1] Σ = [0,2]
Σ = [2,1] Σ = [1,2]
Σ = [2,2]
e12
e11 e22
e11 e22
e21
e22 e21
e32 p1
p2
σ10 e11 σ11 e21 σ12
e22 e32 e12
σ20 σ12 σ22 σ23
[0,1] [0,2] [2,3]
[1,1] [2,1]
A process is an oberver of the computation
A vector clock denotes a global state
Σinit = [0,0]
Σ = [0,1]
Σ = [1,1] Σ = [0,2]
e12
e11 e22
e11 e22
e21
e22 e21
e3 p1
p2
σ10 e11 σ11 e21 σ21
e22 e32 e12
σ02 σ21 σ23
[0,1] [0,2] [2,3]
[1,1] [2,1]
σ22
Σa = [2,1]
Σb = [1,2]
Σc = [2,2]
Σa = [2,1] Σb = [1,2]
Σc = max(Σa,Σb)
The development of logical time (1)
pi
pj
pk
Vj[i] ≥ s Vj[j] = r Vi[i] = s
e
f causal path: e → f m
• m: sent by pi at Vi[i] = s, received by pj at Vj[j] = r
• “Knowing” the receipt of m ⇒ “knowing” its sending
• I.e., for any event x: (Vx[j] ≥ r) ⇒ (Vx[i] ≥ s)
• Due to m it is impossible to have (Vx[j] ≥ r)∧(Vx[i] < s)
The development of logical time (2)
m1 makes it impossible to have 1,0 2,0 3,0 4,2 5,2 6,5
0,1 0,2 0,3 3,4 3,5 3,6 m2 m1
m3 m4
p2
1 2 3 4 5
6
m1
m3
m2
m4
(V[1] < 2) ∧ (V[2] ≥ 6)
Part IV
VECTOR CLOCKS in ACTION (1)
CAUSAL ORDER ABSTRACTION
Causal order abstraction
- Birman, A. Schiper, and P. Stephenson. Lightweight causal and atomic group multicast. ACM Transactions on Computer Systems, 9(3):272-314, 1991
- Raynal M., Schiper A. and Toueg S., The causal ordering abstraction and a simple way to implement it. Information Processing Letters, 39:343-351, 1991
• co broadcast(m): allows to send a message m to all
• co deliver(): allows a process to deliver a message
Causal delivery: definition
• Termination: If a message is co broadcast, it is even- tually co delivered (No loss)
• Integrity: A process co delivers a message m at most once (No duplication)
• Validity: If a process co delivers a message m, then m has been co broadcast (No spurious message)
• Causal Order:
co broadcast(m1) → co broadcast(m2)
⇒ co del(m1) → co del(m2)
Causal delivery: Why it is useful
• Capture causality
• Cooperative work
• Stronger than fifo channels
• But weaker than atomic broadcast
Atomic broadcast = total order delivery
Causal order: Example 1
Causal order: Example 2
Causal broadcast
V Ci[j] = nb messages broadcast by Pj (to pi’s knowledge)
m1
m2
m3
m4
m5
• V Cm2 = [1,1, 0, 0] V Cm3 = [1, 1, 1, 0]
• V Cm4 = [1, 2,0,0] V Cm5 = [1, 2, 2, 0]
Illustration
p3
0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 1
0 1
0 1 1
0 0 1
0 0 1
0 1 1 p1
p2
0
RST algorithm
operation co broadcast(m)
for each j 6= i do send (m, V Ci) to pj end for;
V Ci[i] ← V Ci[i] + 1
when (m, m.V C) is received from pj:
wait until (∀k : V Ci[k] ≥ m.V C[k]);
co deliver m to the application;
V Ci[j] ← V Ci[j] + 1
Part V
VECTOR CLOCKS in ACTION (2)
PREDICATE DETECTION
Stable Local Predicate Detection (1)
• Local predicate LPi: on the local variables of a single process pi
• Stable predicate: once true, remains true
• A consistent global state Σ = (σ1, · · · , σn) satisfies the global predicate LP1 ∧ LP2 · · · ∧ LPn if ∀ i : (σi |= LPi)
Σ |= (^
i
LPi) ⇔ ^
i
(σi |= LPi)
Stable Local Predicate Detection (2)
• Problem: Design an algorithm that detects the first consistent global state that satisfies a conjunction of stable local predicates
• Constraints: Do not use additional control messages, Detection must be done on the fly
Stable Local Predicate Detection (3)
σ20
σ10 σ1x1
σ2x2
σ30 σ3x3
m1 m5
m3
m4
m2
P1
P2
P3
Σ σ3y3
σ1y1
σ2y2
Stable Local Predicate Detection (4)
Σ σ3y3
σ2y2
P1
P2
P3
m1 m3
m2
σ1y1
Detection algorithm: local context of pi
• V C
i[1..n]: local vector clock
• SAT
i: set of process identities such that
j ∈ SAT
i⇔ p
jentered a local state σ
jxfrom which LP
jis true
• F IRST
i: first global state (as known by p
i) in
which all the local predicates LP
jsuch that
j ∈ SAT
iare satisfied
Detection algorithm (1)
procedure detected? is
if SATi = {1, 2, . . . , n} then
F IRSTi defines the first consistent global state Σ that satisfies Vj LPj fi
procedure check LPi is
if (σix |= LPi) then SATi := SATi ∪ {i};
F IRSTi := V Ci; donei := true;
detected? fi
Detection algorithm (2)
(S1) when Pi produces an internal event (e) V Ci[i] := V Ci[i] + 1;
execute e and move to σ;
if ¬donei then check LPi fi
Detection algorithm (3)
(S2) when Pi produces a send event (e=send m to Pj) V Ci[i] := V Ci[i] + 1;
move to σ;
if ¬donei then check LPi fi;
m.V C := V Ci; m.SAT := SATi; m.F IRST := F IRSTi; send (m) to Pj
% m carries m.V C, m.SAT and m.F IRST %
Detection algorithm (4)
(S3) when Pi produces a receive event (e=receive (m)) V Ci[i] := V Ci[i] + 1; V Ci := max(V Ci, m.V C);
move to σ; % by delivering m to the process % if ¬donei then check LPi fi;
if ¬(m.SAT ⊆ SATi)then
SATi := SATi ∪ m.SAT;
F IRSTi := max(F IRSTi, m.F IRST);
detected?
fi
Part VI
LIMIT of VECTOR CLOCKS
DETECTION OF A SIMPLE EVENT PATTERN
-Raynal M., Illustrating the Use of Vector Clocks in Property Detection: an Example and a Counter-Example. Proc. 5th Int’l European Parallel Computing Conference (EUROPAR’99), Springer LNCS 1685, pp. 806-814, 1999
Pattern Recognition (1)
• Some internal events are tagged black, the others are tagged white
• All communication events are tagged white
• The problem: Given two black events s and t, does it exist a black event u such that s → u ∧ u → t
• Formally, P(s, t) is the conjunction of:
? (black(s) ∧ black(t))
? (∃u 6= s, t : (black(u) ∧ (s → u ∧ u → t)))
Pattern Recognition
t t
s s
u u
P(s, t) is false P(s, t) is true
White and black: s.V C = (0,0,2) and t.V C = (3,4,2) in both cases Only black: s.V C = (0,0,1) and t.V C = (2,1,1) in both cases
Non-Triviality of the Problem
a t1 t2
c
d s
P1
P2
P3
u b
P(s, t2) is true while P(s, t1) is not
Decomposing the Predicate
• P(s, t) ≡ (∃u : P1(s, u, t) ∧ P2(s, u, t))
? P1(s, u, t) ≡ (black(s) ∧ black(u) ∧ black(t))
? P2(s, u, t) ≡ (s → u ∧ u → t)
Using Vector of Vector Clocks
• Only black events are relevant: count only them
• Event e:
? e.V C: its vector timestamp (counting only black events)
? e.M C[1..n]: an array of vector timestamps
e.M C[j] contains the vector timestamp of the last black event of Pj that causally precedes e
e.M C[j] can be considered as a “pointer” from e to the last event that precedes it on Pj
Example (1)
a t1 t2
c
d s
P1
P2
P3
u b
t1.M C[1] = a.V C means that t1.M C[1] points to a t1.M C[2] = b.V C means that t1.M C[2] points to b t1.M C[3] = s.V C means that t1.M C[3] points to s
Example (2)
a t1 t2
c
d s
P1
P2
P3
u b
t2.M C[1] = t1.V C: means that t2.M C[1] points to t1 t2.M C[2] = u.V C: means that t2.M C[2] points to u t2.M C[3] = s.V C : means that t2.M C[3] points to s
Operational Predicate
• Event s: s.V C and s.M C Event t: t.V C and t.M C
• P1 is trivially satisfied by any triple of events
• (∃u : P2(s, u, t)) ≡ (∃u : s → u → t) can be restated as:
(∃u : s → u → t) ≡ (∃u : s.V C < u.V C < t.V C)
(∃u : s → u → t) ≡ (∃pk : s.V C < t.M C[k] < t.V C)
As ∀k : t.M C[k] < t.V C, we get the operational predi- cate:
P ( s, t ) ≡ ( ∃k : s.V C < t.M C [ k ])
The Protocol (1)
(S1) when Pi produces a black event (e)
V Ci[i] := V Ci[i] + 1; % one more black event on Pi % e.V C = V Ci; e.M C = M Ci;
M Ci[i] := V Ci
% vector timestamp of Pi’s last black event %
The Protocol (2)
(S2) when Pi executes a send event (e=send m to Pj) m.V C := V Ci; m.M C := M Ci;
send (m) to Pj % m carries m.V C and m.M C %
The Protocol (3)
(S3) when Pi executes a receive event (e=receive(m)) V Ci := max(V Ci, m.V C);
% update of the local vector clock %
∀ k : M Ci[k] := max(M Ci[k], m.M C[k])
% record vector timestamps of last black predecessors %
What has ben learnt
• Power of vector clocks: To track (counter-based) causal- ity: “First Order” predecessor tracking
• Limitation of vector clocks: To solve problems where causality can not be reduced to event counting: “Sec- ond Order” predecessor (or more) tracking
Part VII
VECTOR CLOCKS in ACTION (3)
DETERMINING IMMEDIATE PREDECESSORS
- Anceaume E. Helary J.-M. and Raynal M. A Note on the Determination of the Immediate Predecessors in a Distributed Computation. Int. Journal of Foundations of Computer Science (IJFCS), 13(6):865-972, 2002
- Helary J.-M., Raynal M., Melideo G., and Baldoni R., Efficient Causality-Tracking Timestamping. IEEE Transactions on Knowledge and Data Engineering, 15(5):1239- 1250, 2003
Relevant Events
• At some abstraction level only some events of a dis- tributed computation are relevant
• Let R ⊆ H be the set of relevant events
• Let → be the relation on R defined in the following way:
∀ (e, f) ∈ R × R : (e → f) ⇔ (e −→ev f).
• The poset (R, →) constitutes an abstraction of the dis- tributed computation
Without loss of generality we consider that the set of relevant events is a subset of the internal events (if a communication event has to be observed, a relevant internal event can be generated just after the corre-
A Distributed Computation
P1
P2
P3
Vector Clocks (2)
VC0 V Ci[1..n] initialized to [0, . . . , 0]
VC1 Each time pi produces a relevant event e:
? It increments its vector clock entry V Ci[i] to indicate its progress: V Ci[i] := V Ci[i] + 1
? It associates with e its timestamp e.V C = V Ci
VC2 When a process pi sends a message m, it attaches to it the current value of V Ci (Let m.V C denote this value) VC3 When pi receives a message m, it updates its vector
clock: ∀ x : V Ci[x] := max(V Ci[x], m.V C[x])
Vector Clocks (3)
• V C
i= current knowledge of p
ion the progress of each process P
k(measured by V C
i[k])
• More precisely:
V C
i[ k ]= # number of relevant events produced
by p
kand known by p
iVector Clocks: Example
P1
P2
P3
[1,1,2]
[1,0,0] [3,2,1]
[1,1,0]
(2,1)
[0,0,1]
(3,1)
[2,0,1]
(1,1) (1,2) (1,3)
(2,2) (2,3)
(3,2)
[2, 2,1][2,3,1]
Immediate Predecessor Tracking: the Problem
• Given two relevant events e and f, we say that e is an immediate predecessor of f if:
? e → f, and
? 6 ∃ relevant event g such that e → g → f
• The Immediate Predecessor Tracking (IPT) problem consists in associating with each relevant event e the set of relevant events that are its immediate predeces- sors
Moreover, this has to be done on the fly and without additional control message (i.e., without modifying the communication pattern of the computation)
Immediate Predecessor Tracking: Why?
• Capture the very structure of the causal past of each event
• Allow the analysis of distributed computations
(e.g., detection of global predicates, analysis
of control flows)
Distributed Computation and its Reduction
P1
P2
P3
[1,1,2]
[1,0,0] [3,2,1]
[1,1,0]
(2,1)
[0,0,1]
(3,1)
[2,0,1]
(1,1) (1,2) (1,3)
(2,2) (2,3)
(3,2) [2,2,1][2,3,1]
(1,1) (1,2) (1,3)
(2,1)
(2,2)
(2,3)
(3,1)
(3,2)
Transitive Reduction (Hasse Diagram)
(1, 1) (1, 2) (1, 3)
(2, 1)
(2, 2)
(2, 3)
(3, 1)
(3, 2)
Basic IPT Protocol (1)
Each p
imanages:
• A vector clock V C
i• A boolean array IP
iwhose meaning is:
(IP
i[j ] = 1) ⇔ The last relevant event pro-
duced by p
jand known by p
iis an immediate
predecessor of p
i’s current event
Basic IPT Protocol (2)
R0 Both V Ci[1..n] and IPi[1..n] are initialized to [0, . . . , 0]
R1 Each time pi produces a relevant event e:
? It increments its VC entry: V Ci[i] := V Ci[i] + 1
? It associates with e the timestamp
e.T S = {(k, V Ci[k]) | IPi[k] = 1}
? It resets IPi: ∀ ` 6= i : IPi[`] := 0; IPi[i] := 1
R2 When pi sends a message m to pj, it attaches to m the current values of V C (denoted m.V C) and the boolean
How to Manage the IPi Vectors? (1)
P
P
P 1
m
2
3
How to Manage the IPi Vectors? (2)
P
P
P 1
m
2
3
Basic IPT Protocol (3)
R3 When it receives a message m from pj, pi executes the following updates:
∀k : case
V Ci[k] < m.V C[k] then V Ci[k] := m.V C[k];
IPi[k] := m.IP[k]
V Ci[k] = m.V C[k] then IPi[k] := min(IPi[k], m.IP [k]) V Ci[k] > m.V C[k] then skip
end case
Efficient IPT? (1)
• Question : Is it possible to design an IPT pro- tocol that does not require each message m to carry a vector clock m.V C and a boolean vector m.IP whose size is always n?
• Answer : Yes! ... but How???
Efficient IPT (2): Towards a General Condition
Underlying intuition:
Pj
send(m)
Pi
V Ci[k] = x IPi[k] = 1
V Cj[k] ≥ x receive(m)
Efficient IPT (3): a General Condition
• Let e.Xi= value of the var Xi of pi when it produces e
• Let K(m, k) be the following predicate:
(1) (send(m).V Ci[k] = 0)
(2) ∨ (send(m).V Ci[k] < pred(receive(m)).V Cj[k]) (3) ∨ ((send(m).V Ci[k] = pred(receive(m)).V Cj[k])∧
(send(m).IPi[k] = 1))
Efficient IPT (3): a General Condition
• Theorem 1:
The condition K ( m, k ) is both nec-
essary and sufficient to omit the trans-
mission of V C i [ k ] and IP i [ k ] when m is
sent by p i to p j
Efficient IPT (4): Towards a Concrete Condition
• K(m, k) involves events on two processes (send(m) at pi and receive(m) at pj), and consequently cannot be atomically evaluated by a single process
• Replace it by a “concrete” condition C(m, k) that:
? Can be locally evaluated by a process just before it sends a message, and
? Is a correct approximation of K(m, k), i.e., C(m, k) has to be such that ∀m, k: C(m, k) ⇒ K(m, k)
Efficient IPT (5): Towards a Concrete Condition
• The “constant” condition ∀(m, k) : KC(m, k) = f alse works
It is actually the trivially correct approximation of K that corresponds to the basic IPT protocol d in which each message m carries a whole vector clock m.V C and a whole boolean vector m.IP
• Let us equip each process Pi with an additional matrix Mi of 0/1 values such that
(Mi[j, k] = 1) ⇔ (to Pi’s knowledge: V Cj[k] ≥ V Ci[k])
An Implementation of the Matrices Mi
M0 ∀ (j, k) : Mi[j, k] is initialized to 1
M1 Each time it produces a relevant event e: pi resets the ith column of its boolean matrix: ∀j 6= i : Mi[j, i] := 0 M2 When pi sends a message: no update of Mi occurs.
M3 When it receives a message m from pj, pi executes the following updates (m.V C is carried by m):
∀k: case V Ci[k] < m.V C[k] then ∀` 6= i, j, k : Mi[`, k] := 0;
Mi[j, k] := 1 V Ci[k] = m.V C[k] then Mi[j, k] := 1 V Ci[k] > m.V C[k] then skip
A Concrete Condition
• Let m be a message sent by pi to pj and C(m, k) = ((send(m).Mi[j, k] = 1) ∧ (send(m).IPi[k] = 1))
∨(send(m).V Ci[k] = 0)
• Theorem 2: ∀k : C(m, k) ⇒ K(m, k)
An Efficient IPT Protocol (1)
RM0 Both V Ci[1..n] and IPi[1..n] are set to [0, . . . , 0], and
∀ (j, k) : Mi[j, k] is set to 1
RM1 Each time pi produces a relevant event e:
? It increments its VC entry: V Ci[i] := V Ci[i] + 1,
? It associates with e the timestamp
e.T S = {(k, V Ci[k]) | IPi[k] = 1}
? It resets IPi: ∀ ` 6= i : IPi[`] := 0; IPi[i] := 1
An Efficient IPT Protocol (2)
RM2 When pi sends a message m to pj, it attaches to m the set of triples {(k, V Ci[k], IPi[k])} where k is such that (Mi[j, k] = 0 ∨ IPi[k] = 0) ∧ (V Ci[k] > 0)}
RM3 When pi receives a message m from pj, it executes:
∀(k,m.V C[k], m.IP[k]) carried by m:
case V Ci[k] < m.V C[k] then V Ci[k] := m.V C[k];
IPi[k] := m.IP [k];
∀` 6= i, j, k : Mi[`, k] := 0;
Mi[j, k] := 1
V Ci[k] = m.V C[k] then IPi[k] := min(IPi[k], m.IP [k]);
Mi[j, k] := 1 V Ci[k] > m.V C[k] then skip
endcase
Properties of the IPT Protocol
• Improvement: Transmit rows of Mi allows the processes to have more entries of their matrices equal to 1, hence transmit fewer triples
• If one is not interested in the IPT problem, s/he can suppress the IPi arrays. Then, we obtain an efficient implementation of vector clocks (that does not require fifo channels)
• A simulation study has shown the gains are substantial
Part VIII
MATRIX CLOCKS
-Wuu G.T. and Bernstein A.J., Efficient solutions to the replicated log and dic- tionnary problems. Proc. 3rd Int’l ACM Symposium on Principles of Distributed Computing (PODC’84), ACM Press, pp. 233-242, 1984
Matrix clock
• Martrix clocks capture a “second order” knowledge
• Each process manages a time matrix M Ci[1..n, 1..n]
• M Ci[i, i] = nb of event produced by pi
• M Ci[i, k] = nb of events produced by pk, to pi’s knowl- edge (this is nothing else than pi’s vector clock)
• M Ci[j, k] = pi’s knowledge of the nb of events produced by pk as known by pj
M Ci[j, k] = x means