LOGICAL TIME

(1)

LOGICAL TIME

in DISTRIBUTED SYSTEMS

Michel R^AYNAL raynal@irisa.fr

IRISA, Universit´ e de Rennes, France

(2)

Contents

• Scalar (linear) time

• Vector time

• Matrix time

• Using virtual time

(3)

Part I

SCALAR/LINEAR TIME

- Lamport, L., Time, Clocks and the Ordering of Events in a Distributed System.

Communications of the ACM, 21(7):558-565, 1978

(4)

Aim

• Build a logical time in order to be able to associate a consistent date with events, i.e.,

e −→ f ⇒ date(e) < date(f)

• Why logical time? Because there is no notion of physical time in a pure asynchronous system (no bound on process speed, and message transfer delay)

Even if we had physical time, it would be more difficult

(5)

The fundamental constraint

Logical time has to increase along causal paths

• How logical time is used: when it produces a new event, a process associates the current clock value with that event

• Idea: consider the set of integers for the time domain

? Each process p_i has a local clock h_i

? From a local point of view:

h_i has to measure the progress of p_i

? From a global point of view:

h_i has to measure the progress of the whole computation

(6)

Lamport clocks (1978)

Local progress rule:

before producing an internal event::

h_i ← h_i + 1 % date of the internal event % Sending rule:

when sending a message m to p_j:

h_i ← h_i + 1; % date of the send event % send (m, h_i) to p_j

Receiving Rule:

when receiving a message (m, h) from p_j: h_i ← max(h_i, h);

h_i ← h_i + 1 % date of the receive event %

(7)

Illustration

p1

p2

p3

h1

h2

1

4

2 3 4 6

1 2

5

6

7 5

3

1 2

6 2

1 h3

3

5

4

Observation: (date(e) = x)

⇔

There are x events on the longest causal path ending at e

(8)

Build a total order on all the events

• Motivation: Resource allocation problems

• Observations

? (date(e) < date(f)) ⇒ ¬(f −→ e)

? (date(e) < date(f)) ∧ ¬(e −→ f) is possible

In that case, e and f are independent events, but cannot be concluded from the dates only

? (date(e) = date(f)) ⇒ e||f

• Associate a timestamp (h, i) with each event e where:

? h = local clock value when e is produced (date)

(9)

Total order definition

• Let e and f be events timestamped (h, i) and (k, j) (e −→^{T O} f) ^def= (h < k) ∨ ((h = k) ∧ (i < j))

• This is the (well-known) lexicographical ordering

(10)

Illustration

Σ_init = [0,0]

Σ = [0,1]

Σ = [1,1] Σ = [0,2]

Σ = [2,1] Σ = [1,2]

Σ = [2,2]

e¹₂

e¹₁ e²₂

e²₁

e²₂ e²₁

e³₂ p1

p2

(1,2)

e¹₂ e²₂ e³₂ e¹₁ e²₁

(3,1) (2,1)

(2,2) (4,2) H = 1

namely, e¹₂, e¹₁, e²₂, e²₁, e²₂

H = 4 H = 3

H = 2

Lamport’s timestamps capture an

observation (among all possible oberv.)

(11)

A theorem on the space of scalar clocks

• Let C be the set of all the scalar clock systems that are consistent (with respect to the causality relation)

• Let e and f be any two events of a distributed execution

• ∀C ∈ C: e −→^ev f ⇒ date_C(e) < date_C(f) (Consistency)

• e||f ⇔ ∃C ∈ C : date_C(e) = date_C(f)

• Or equivalently

e||f ⇔ ∃C1, C2 ∈ C :

date_C₁(e) ≤ date_C₁(f) ∧ date_C₂(e) ≥ date_C₂(f)

(12)

Part II

SCALAR CLOCKS in ACTION

(13)

The Mutex problem

• Enrich the underlying system with new operations

• These operations define a service

• Here, two operations: acquire() and release()

• Process behavior: abstracted in a 3-state automaton state_i ∈ {out, asking, in} (all other detail are irrelevant)

asking

in out

(14)

The Mutex problem: definition

• Definition

? Safety: no two processes are concurrently in the CS

? Liveness: any request is eventually granted

• Algorithms

? Permission-based (Individual vs arbiter permissions)

? Token-based

- Raynal M., Algorithms for mutual exclusion. The MIT Press, 1986

- Anderson J., Kim Y.-J. and Herman T., Shared-memory mutual exclusion: major research trends since 1986. Distributed Computing, 16(2-3): 75-110, 2003

(15)

Individual permissions: principles

• When,it wants to enter the CS, p_i asks for permissions

• When,it has received all the permissions, p_i enters

• R_i = the set of processes from which p_i needs the permission to enter the SC

• Individual permission: R_i = {1, . . . , n} \ {i}

• When p_i gives its permission to a process p_k, the meaning of the permission is “As far as I am concerned, you can enter” (a permission is consequently “individual”)

• Core of the algorithm: the way permissions are granted

• The algorithm manages bilateral conflicts

(16)

Granting a permission

state_i = asking p_i

pj

pk

statej = out

perm state_i = out

state_k 6= out

• Solve the conflict between p_i and p_k

• A solution: timestamp the requests, and use the total order on timestamps to establish a system-wide consistent priority

? p_k has not priority: send its permission to p_i by return

(17)

From mechanisms to properties

• Safety: ∀i 6= j : j ∈ R_i ∧ i ∈ R_j

• Liveness: based on a timestamping mechanism

(18)

Ricart-Agrawala mutex algorithm: local variables

• state_i ∈ {out, asking, in}, init out

• h_i, last i integers, init 0

• prio_i boolean

• waiting f rom_i, postponed_i sets

(19)

Structure

perm(j) acquire() release()

local variables req(k, j)

(20)

Ricart-Agrawala mutex algorithm (1)

operation acquire() issued by p_i

state_i ← asking; postponed_i ← ∅;

h_i ← h_i + 1; last_i ← h_i; waiting f rom_i ← R_i;

for each j 6= i do send req(last_i, i) to p_j end for;

wait (waiting f rom_i = ∅);

state_i ← in

when perm(j) is received

waiting f rom_i ← waiting f rom_i \ {j}

(21)

Ricart-Agrawala mutex algorithm (2)

when req(k, j) is received h_i ← max(h_i, k) + 1;

prio_i ← (state_i 6= out) ∧ (last_i, i) < (k, j);

if prio_i then postponed_i ← postponed_i ∪ {j} else send perm(i) to p_j

end if

operation release()

for each j ∈ postponed_i do

send perm(i) to p_j end for;

state_i ← out

(22)

Clock values

• h_i can increase forever

• Aim: limit its increase

• As only requests have to be timestamped we can replace h_i ← max(h_i, k) + 1 with h_i ← max(h_i, k)

• As we are about to see in the proof, it is possible to further limit the increase in the acquire() operation:

The two statements [h_i ← h_i+1; last_i ← h_i] are replaced by last_i ← h_i + 1, which does not increase h_i!

• These two modifications allows obtaining an algorithm in which all variables are bounded: clocks values can be implemented modulo 2n − 1

(23)

Ricart-Agrawala mutex algorithm

operation acquire() issued by p_i

state_i ← asking; postponed_i ← ∅;

last_i ← h_i + 1; % replaces h_i ← h_i + 1; last_i ← h_i waiting f rom_i ← R_i;

for each j 6= i do send req(last_i, i) to p_j end for;

wait (waiting f rom_i = ∅);

state_i ← in

when req(k, j) is received

h_i ← max(h_i, k); % replaces h_i ← max(h_i, k) + 1 prio_i ← (state_i 6= out) ∧ (last_i, i) < (k, j);

if prio_i then postponed_i ← postponed_i ∪ {j} else send perm(i) to p_j

end if

(24)

Proof: on the blackboard!

• Safety: by contradiction

• Liveness: in two steps

? No deadlock: at least one process acquires the CS

? No starvation: eventually any requesting process will be granted the CS

(25)

Cost

• Message cost: 2(n − 1) messages per CS use

• Improvement:

? The algorithm can be improved in such a way that a CS use costs between 0 and 2(n − 1) messages

? Idea: every pair of processes manages a single permission (token) to solve their conflicts

• Time: consider each message takes one time unit

? Heavy load: one time unit

? Light load: two time unit

(26)

Variants

• Ring structure

? Forward a request = give its permission

? Cost: n messages

• Assumption ∆ on transfer delays

? Give its permission = not to answer

? Not to give its permission = send by return a negative ack, cancel it when exiting the CS

(27)

On mutual exclusion

• Permission-based

? Individual permission approach

? Arbiter permission approach

Quorums and three-way handshake algorithms

• Token-based

• A continuous view

- Raynal M., Algorithms for mutual exclusion, The MIT Press, 1986

- Anderson J., Kim Y.-J. andHerman T., Shared-memory mutual exclusion: major research trends since 1986, Distributed Computing 16(2-3): 75-110 2003

(28)

Part III

VECTOR TIME

- Fidge C., Timestamp in Message Passing Systems that Preserves Partial Ordering, Proc. 11th Australian Computing Conference, pp. 56-66, 1988

- Mattern F., Virtual time and global states of distributed systems. Proc. Int’l work- shop on Parallel and Distributed Systems, North-Holland, pp. 215-226, (Cosnard, Quinton, Raynal and Robert Eds), 1988

- Baldoni R. and Raynal M. Fundamentals of Distributed Computing: A Practical Tour of Vector-Clock Systems. IEEE Distributed Systems Online, 3(2):1-18, 2002

(29)

Aim: capture the causality relation

• Scalar (linear) clock system

? Respects causality

? But does not capture it

• Find a dating system that captures causality exactly (e −→^ev f) ⇔ date(e) < date(f)

(e||f) ⇔ date(e) and date(f) cannot be compared

(30)

Vector clock: intuition

• Observation: a process p_i can always measure its progress by counting the number of events it has produced since the beginning

This number can be seen as its logical local clock There is one such clock per process

• The time domain is consequently n-dimensional: there is one dimension associated with each process

• Hence the idea of vector clocks: each process p_i manages a vector V C_i[1..n] that represents its view of the global time progress

V C_i is a digest of the current causal past of p_i

(31)

Vector clock: definition

• V C_i[i] = nb of events issued by p_i

• V C_i[j] = nb of events issued by p_j, as known by p_i Formally, let e be the current event produced by p_i V C_i[j] = |{f | f −→ e ∧ f has been issued by p_j}|

• Notation: component-wise maximum/minimum

max(V 1, V 2) = [max(V 1[1], V 2[1]), · · · , max(V 1[n], V 2[n])]

min(V 1, V 2) = [min(V 1[1], V 2[1]), · · · , min(V 1[n], V 2[n])]

(32)

Vector clock: algorithm

Local progress rule:

before producing an internal event:

V C_i[i] ← V C_i[i] + 1 Sending rule:

when sending a message m to p_j: V C_i[i] ← V C_i[i] + 1;

send (m, V C_i) to p_j Receiving Rule:

when receiving a message (m, V C) from p_j: V C_i[i] ← V C_i[i] + 1;

V C_i ← max(V C_i, V C[)

(33)

Illustration

p3

p1

p2

p4

[1,2,0,0]

[0,3,0,0]

[0,3,0,2]

[0,3,0,1]

[0,0,0,0]

[0,1,0,0]

[0,2,0,0]

[0,0,1,0]

[0,3,2,2]

∀i, k: V C_i[k] is not decreasing, and V C_i[k] ≤ V C_k[k]

(34)

A few simple definitions

• V 1 ≤ V 2 ^def= ∀k : V 1[k] ≤ V 2[k]

• V 1 < V 2 ^def= (V 1 ≤ V 2) ∧ (V 1 6= V 2)

• V 1||V 2 ^def= ¬(V 1 ≤ V 2) ∧ ¬(V 2 ≤ V 1)

(35)

The vector clock properties

Let e with date(e) = V_e , and f with date(f) = V_f

(e −→^ev f) ⇔ (V_e < V_f)

(e || f) ⇔ (V_e || V_f)

These are the fundamental properties provided by vector clocks

(36)

Proof (1)

• Theorem 1: Vector clocks increase along causal paths

• Theorem 2: (e −→^ev f) ⇔ (V_e < V_f)

? (e −→^ev f) ⇒ (V_e < V_f): follows from Theorem 1.

? (V_e < V_f) ⇒ (e −→^ev f):

Let p_i be the process that issued the event e. We have (V_e < V_f) ⇒ (V_e[i] ≤ V_f[i]). As only p_i can entail an increase of V [i] (for any vector V ), it follows that there is a causal path from e to f.

(37)

Proof and cost

• Theorem 3: (e || f) ⇔ (V_e || V_f).

(e || f) ^def= ¬(e −→^ev f) ∧ ¬(f −→^ev e) (definition).

? ¬(e −→^ev f) ⇒ ¬(V_e < V_f).

? ¬(f −→^ev e) ⇒ ¬(V_f < V_e).

From which follows that V_f and V_e cannot be compared.

• Theorem 4: The previous (causality/independence) predicates require O(n) comparisons

(38)

Refining the causality test

• Let us associate a timestamp (V e, i) with each event e, where p_i is the process that issued e

• Let e timestamped (V_e, i) and f timestamped (V_f, j)

• Refined causality test:

(e −→^ev f) ⇔ (V_e[i] ≤ V_f[i])

• Refined independence test:

(e || f) ⇔ (V_e[i] > V_f[i]) ∧ (V_f[j] > V_e[j])

• Theorem 4: The previous (causality/independence) predicates require O(1) comparisons (Scalability of the test)

(39)

A process is a “local” observer

Σ_init = [0,0]

Σ = [0,1]

Σ = [1,1] Σ = [0,2]

Σ = [2,1] Σ = [1,2]

Σ = [2,2]

e¹₂

e¹₁ e²₂

e²₁

e²₂ e²₁

e³₂ p1

p2

σ₁⁰ e¹₁ σ₁¹ e²₁ σ₁²

e²₂ e³₂ e¹₂

σ₂⁰ σ¹₂ σ₂² σ₂³

[0,1] [0,2] [2,3]

[1,1] [2,1]

A process is an oberver of the computation

(40)

A vector clock denotes a global state

Σ_init = [0,0]

Σ = [0,1]

Σ = [1,1] Σ = [0,2]

e¹₂

e¹₁ e²₂

e²₁

e²₂ e²₁

e³ p1

p2

σ₁⁰ e¹₁ σ¹₁ e²₁ σ²₁

e²₂ e³₂ e¹₂

σ⁰₂ σ₂¹ σ₂³

[0,1] [0,2] [2,3]

[1,1] [2,1]

σ₂²

Σ_a = [2,1]

Σ_b = [1,2]

Σ_c = [2,2]

Σ_a = [2,1] Σ_b = [1,2]

Σ_c = max(Σ_a,Σ_b)

(41)

The development of logical time (1)

pi

pj

pk

Vj[i] ≥ s Vj[j] = r Vi[i] = s

e

f causal path: e → f m

• m: sent by p_i at V_i[i] = s, received by p_j at V_j[j] = r

• “Knowing” the receipt of m ⇒ “knowing” its sending

• I.e., for any event x: (V_x[j] ≥ r) ⇒ (V_x[i] ≥ s)

• Due to m it is impossible to have (V_x[j] ≥ r)∧(V_x[i] < s)

(42)

The development of logical time (2)

m1 makes it impossible to have 1,0 2,0 3,0 4,2 5,2 6,5

0,1 0,2 0,3 3,4 3,5 3,6 m2 m1

m3 m4

p2

1 2 3 4 5

6

m1

m3

m2

m4

(V[1] < 2) ∧ (V[2] ≥ 6)

(43)

Part IV

VECTOR CLOCKS in ACTION (1)

CAUSAL ORDER ABSTRACTION

(44)

Causal order abstraction

- Birman, A. Schiper, and P. Stephenson. Lightweight causal and atomic group multicast. ACM Transactions on Computer Systems, 9(3):272-314, 1991

- Raynal M., Schiper A. and Toueg S., The causal ordering abstraction and a simple way to implement it. Information Processing Letters, 39:343-351, 1991

• co broadcast(m): allows to send a message m to all

• co deliver(): allows a process to deliver a message

(45)

Causal delivery: definition

• Termination: If a message is co broadcast, it is eventually co delivered (No loss)

• Integrity: A process co delivers a message m at most once (No duplication)

• Validity: If a process co delivers a message m, then m has been co broadcast (No spurious message)

• Causal Order:

co broadcast(m₁) → co broadcast(m₂)

⇒ co del(m₁) → co del(m₂)

(46)

Causal delivery: Why it is useful

• Capture causality

• Cooperative work

• Stronger than fifo channels

• But weaker than atomic broadcast

Atomic broadcast = total order delivery

(47)

Causal order: Example 1

(48)

Causal order: Example 2

(49)

Causal broadcast

V C_i[j] = nb messages broadcast by P_j (to p_i’s knowledge)

m1

m2

m3

m4

m5

• V C_m₂ = [1,1, 0, 0] V C_m₃ = [1, 1, 1, 0]

• V C_m₄ = [1, 2,0,0] V C_m₅ = [1, 2, 2, 0]

(50)

Illustration

p3

0 0

0 0 0

0 1

0 1 1

0 0 1

0 1 1 p1

p2

0

(51)

RST algorithm

operation co broadcast(m)

for each j 6= i do send (m, V C_i) to p_j end for;

V C_i[i] ← V C_i[i] + 1

when (m, m.V C) is received from p_j:

wait until (∀k : V C_i[k] ≥ m.V C[k]);

co deliver m to the application;

V C_i[j] ← V C_i[j] + 1

(52)

Part V

VECTOR CLOCKS in ACTION (2)

PREDICATE DETECTION

(53)

Stable Local Predicate Detection (1)

• Local predicate LP_i: on the local variables of a single process p_i

• Stable predicate: once true, remains true

• A consistent global state Σ = (σ₁, · · · , σ_n) satisfies the global predicate LP₁ ∧ LP₂ · · · ∧ LP_n if ∀ i : (σ_i |= LP_i)

Σ |= (^{^}

i

LP_i) ⇔ ^{^}

i

(σ_i |= LP_i)

(54)

• Problem: Design an algorithm that detects the first consistent global state that satisfies a conjunction of stable local predicates

• Constraints: Do not use additional control messages, Detection must be done on the fly

(55)

σ₂⁰

σ₁⁰ σ₁^x¹

σ₂^x²

σ₃⁰ σ₃^x³

m1 m5

m3

m4

m2

P1

P2

P3

Σ σ₃^y³

σ₁^y¹

σ₂^y²

(56)

Σ σ₃^y³

σ₂^y²

P1

P2

P3

m1 m3

m2

σ₁^y¹

(57)

Detection algorithm: local context of p_i

• V C

_i

[1..n]: local vector clock

• SAT

_i

: set of process identities such that

j ∈ SAT

_i

⇔ p

_j

entered a local state σ

_j^x

from which LP

_j

is true

• F IRST

_i

: first global state (as known by p

_i

) in

which all the local predicates LP

_j

such that

j ∈ SAT

_i

are satisfied

(58)

Detection algorithm (1)

procedure detected? is

if SAT_i = {1, 2, . . . , n} then

F IRST_i defines the first consistent global state Σ that satisfies ^V_j LP_j fi

procedure check LP_i is

if (σ_i^x |= LP_i) then SAT_i := SAT_i ∪ {i};

F IRST_i := V C_i; done_i := true;

detected? fi

(59)

(S1) when P_i produces an internal event (e) V C_i[i] := V C_i[i] + 1;

execute e and move to σ;

if ¬done_i then check LP_i fi

(60)

(S2) when P_i produces a send event (e=send m to P_j) V C_i[i] := V C_i[i] + 1;

move to σ;

if ¬done_i then check LP_i fi;

m.V C := V C_i; m.SAT := SAT_i; m.F IRST := F IRST_i; send (m) to P_j

% m carries m.V C, m.SAT and m.F IRST %

(61)

(S3) when P_i produces a receive event (e=receive (m)) V C_i[i] := V C_i[i] + 1; V C_i := max(V C_i, m.V C);

move to σ; % by delivering m to the process % if ¬done_i then check LP_i fi;

if ¬(m.SAT ⊆ SAT_i)then

SAT_i := SAT_i ∪ m.SAT;

F IRST_i := max(F IRST_i, m.F IRST);

detected?

fi

(62)

Part VI

LIMIT of VECTOR CLOCKS

DETECTION OF A SIMPLE EVENT PATTERN

-Raynal M., Illustrating the Use of Vector Clocks in Property Detection: an Example and a Counter-Example. Proc. 5th Int’l European Parallel Computing Conference (EUROPAR’99), Springer LNCS 1685, pp. 806-814, 1999

(63)

Pattern Recognition (1)

• Some internal events are tagged black, the others are tagged white

• All communication events are tagged white

• The problem: Given two black events s and t, does it exist a black event u such that s → u ∧ u → t

• Formally, P(s, t) is the conjunction of:

? (black(s) ∧ black(t))

? (∃u 6= s, t : (black(u) ∧ (s → u ∧ u → t)))

(64)

Pattern Recognition

t t

s s

u u

P(s, t) is false P(s, t) is true

White and black: s.V C = (0,0,2) and t.V C = (3,4,2) in both cases Only black: s.V C = (0,0,1) and t.V C = (2,1,1) in both cases

(65)

Non-Triviality of the Problem

a t1 t2

c

d s

P1

P2

P3

u b

P(s, t₂) is true while P(s, t₁) is not

(66)

Decomposing the Predicate

• P(s, t) ≡ (∃u : P₁(s, u, t) ∧ P₂(s, u, t))

? P₁(s, u, t) ≡ (black(s) ∧ black(u) ∧ black(t))

? P₂(s, u, t) ≡ (s → u ∧ u → t)

(67)

Using Vector of Vector Clocks

• Only black events are relevant: count only them

• Event e:

? e.V C: its vector timestamp (counting only black events)

? e.M C[1..n]: an array of vector timestamps

e.M C[j] contains the vector timestamp of the last black event of P_j that causally precedes e

e.M C[j] can be considered as a “pointer” from e to the last event that precedes it on P_j

(68)

Example (1)

a t1 t2

c

d s

P1

P2

P3

u b

t₁.M C[1] = a.V C means that t₁.M C[1] points to a t₁.M C[2] = b.V C means that t₁.M C[2] points to b t₁.M C[3] = s.V C means that t₁.M C[3] points to s

(69)

Example (2)

a t1 t2

c

d s

P1

P2

P3

u b

t₂.M C[1] = t₁.V C: means that t₂.M C[1] points to t₁ t₂.M C[2] = u.V C: means that t₂.M C[2] points to u t₂.M C[3] = s.V C : means that t₂.M C[3] points to s

(70)

Operational Predicate

• Event s: s.V C and s.M C Event t: t.V C and t.M C

• P₁ is trivially satisfied by any triple of events

• (∃u : P₂(s, u, t)) ≡ (∃u : s → u → t) can be restated as:

(∃u : s → u → t) ≡ (∃u : s.V C < u.V C < t.V C)

(∃u : s → u → t) ≡ (∃p_k : s.V C < t.M C[k] < t.V C)

As ∀k : t.M C[k] < t.V C, we get the operational predicate:

P ( s, t ) ≡ ( ∃k : s.V C < t.M C [ k ])

(71)

The Protocol (1)

(S1) when P_i produces a black event (e)

V C_i[i] := V C_i[i] + 1; % one more black event on P_i % e.V C = V C_i; e.M C = M C_i;

M C_i[i] := V C_i

% vector timestamp of P_i’s last black event %

(72)

The Protocol (2)

(S2) when P_i executes a send event (e=send m to P_j) m.V C := V C_i; m.M C := M C_i;

send (m) to P_j % m carries m.V C and m.M C %

(73)

The Protocol (3)

(S3) when P_i executes a receive event (e=receive(m)) V C_i := max(V C_i, m.V C);

% update of the local vector clock %

∀ k : M C_i[k] := max(M C_i[k], m.M C[k])

% record vector timestamps of last black predecessors %

(74)

What has ben learnt

• Power of vector clocks: To track (counter-based) causality: “First Order” predecessor tracking

• Limitation of vector clocks: To solve problems where causality can not be reduced to event counting: “Sec- ond Order” predecessor (or more) tracking

(75)

Part VII

VECTOR CLOCKS in ACTION (3)

DETERMINING IMMEDIATE PREDECESSORS

- Anceaume E. Helary J.-M. and Raynal M. A Note on the Determination of the Immediate Predecessors in a Distributed Computation. Int. Journal of Foundations of Computer Science (IJFCS), 13(6):865-972, 2002

- Helary J.-M., Raynal M., Melideo G., and Baldoni R., Efficient Causality-Tracking Timestamping. IEEE Transactions on Knowledge and Data Engineering, 15(5):1239- 1250, 2003

(76)

Relevant Events

• At some abstraction level only some events of a distributed computation are relevant

• Let R ⊆ H be the set of relevant events

• Let → be the relation on R defined in the following way:

∀ (e, f) ∈ R × R : (e → f) ⇔ (e −→^ev f).

• The poset (R, →) constitutes an abstraction of the distributed computation

Without loss of generality we consider that the set of relevant events is a subset of the internal events (if a communication event has to be observed, a relevant internal event can be generated just after the corre-

(77)

A Distributed Computation

P1

P2

P3

(78)

Vector Clocks (2)

VC0 V C_i[1..n] initialized to [0, . . . , 0]

VC1 Each time p_i produces a relevant event e:

? It increments its vector clock entry V C_i[i] to indicate its progress: V C_i[i] := V C_i[i] + 1

? It associates with e its timestamp e.V C = V C_i

VC2 When a process p_i sends a message m, it attaches to it the current value of V C_i (Let m.V C denote this value) VC3 When p_i receives a message m, it updates its vector

clock: ∀ x : V C_i[x] := max(V C_i[x], m.V C[x])

(79)

Vector Clocks (3)

• V C

_i

= current knowledge of p

_i

on the progress of each process P

_k

(measured by V C

_i

[k])

• More precisely:

V C

_i

[ k ]= # number of relevant events produced

by p

_k

and known by p

_i

(80)

Vector Clocks: Example

P1

P2

P3

[1,1,2]

[1,0,0] [3,2,1]

[1,1,0]

(2,1)

[0,0,1]

(3,1)

[2,0,1]

(1,1) (1,2) (1,3)

(2,2) (2,3)

(3,2)

[2, 2,1][2,3,1]

(81)

Immediate Predecessor Tracking: the Problem

• Given two relevant events e and f, we say that e is an immediate predecessor of f if:

? e → f, and

? 6 ∃ relevant event g such that e → g → f

• The Immediate Predecessor Tracking (IPT) problem consists in associating with each relevant event e the set of relevant events that are its immediate predecessors

Moreover, this has to be done on the fly and without additional control message (i.e., without modifying the communication pattern of the computation)

(82)

Immediate Predecessor Tracking: Why?

• Capture the very structure of the causal past of each event

• Allow the analysis of distributed computations

(e.g., detection of global predicates, analysis

of control flows)

(83)

Distributed Computation and its Reduction

P1

P2

P3

[1,1,2]

[1,0,0] [3,2,1]

[1,1,0]

(2,1)

[0,0,1]

(3,1)

[2,0,1]

(1,1) (1,2) (1,3)

(2,2) (2,3)

(3,2) [2,2,1][2,3,1]

(1,1) (1,2) (1,3)

(2,1)

(2,2)

(2,3)

(3,1)

(3,2)

(84)

Transitive Reduction (Hasse Diagram)

(1, 1) (1, 2) (1, 3)

(2, 1)

(2, 2)

(2, 3)

(3, 1)

(3, 2)

(85)

Basic IPT Protocol (1)

Each p

_i

manages:

• A vector clock V C

_i

• A boolean array IP

_i

whose meaning is:

(IP

_i

[j ] = 1) ⇔ The last relevant event pro-

duced by p

_j

and known by p

_i

is an immediate

predecessor of p

_i

’s current event

(86)

R0 Both V C_i[1..n] and IP_i[1..n] are initialized to [0, . . . , 0]

R1 Each time p_i produces a relevant event e:

? It increments its VC entry: V C_i[i] := V C_i[i] + 1

? It associates with e the timestamp

e.T S = {(k, V C_i[k]) | IP_i[k] = 1}

? It resets IP_i: ∀ ` 6= i : IP_i[`] := 0; IP_i[i] := 1

R2 When p_i sends a message m to p_j, it attaches to m the current values of V C (denoted m.V C) and the boolean

(87)

How to Manage the IP_i Vectors? (1)

P

P 1

m

2

3

(88)

How to Manage the IP_i Vectors? (2)

P

P 1

m

2

3

(89)

R3 When it receives a message m from p_j, p_i executes the following updates:

∀k : case

V C_i[k] < m.V C[k] then V C_i[k] := m.V C[k];

IP_i[k] := m.IP[k]

V C_i[k] = m.V C[k] then IP_i[k] := min(IP_i[k], m.IP [k]) V C_i[k] > m.V C[k] then skip

end case

(90)

Efficient IPT? (1)

• Question : Is it possible to design an IPT pro- tocol that does not require each message m to carry a vector clock m.V C and a boolean vector m.IP whose size is always n?

• Answer : Yes! ... but How???

(91)

Efficient IPT (2): Towards a General Condition

Underlying intuition:

Pj

send(m)

Pi

V C_i[k] = x IP_i[k] = 1

V C_j[k] ≥ x receive(m)

(92)

Efficient IPT (3): a General Condition

• Let e.X_i= value of the var X_i of p_i when it produces e

• Let K(m, k) be the following predicate:

(1) (send(m).V C_i[k] = 0)

(2) ∨ (send(m).V C_i[k] < pred(receive(m)).V C_j[k]) (3) ∨ ((send(m).V C_i[k] = pred(receive(m)).V C_j[k])∧

(send(m).IP_i[k] = 1))

(93)

Efficient IPT (3): a General Condition

• Theorem 1:

The condition K ( m, k ) is both nec-

essary and sufficient to omit the trans-

mission of V C _i [ k ] and IP _i [ k ] when m is

sent by p _i to p _j

(94)

Efficient IPT (4): Towards a Concrete Condition

• K(m, k) involves events on two processes (send(m) at p_i and receive(m) at p_j), and consequently cannot be atomically evaluated by a single process

• Replace it by a “concrete” condition C(m, k) that:

? Can be locally evaluated by a process just before it sends a message, and

? Is a correct approximation of K(m, k), i.e., C(m, k) has to be such that ∀m, k: C(m, k) ⇒ K(m, k)

(95)

Efficient IPT (5): Towards a Concrete Condition

• The “constant” condition ∀(m, k) : KC(m, k) = f alse works

It is actually the trivially correct approximation of K that corresponds to the basic IPT protocol d in which each message m carries a whole vector clock m.V C and a whole boolean vector m.IP

• Let us equip each process P_i with an additional matrix M_i of 0/1 values such that

(M_i[j, k] = 1) ⇔ (to P_i’s knowledge: V C_j[k] ≥ V C_i[k])

(96)

An Implementation of the Matrices M_i

M0 ∀ (j, k) : M_i[j, k] is initialized to 1

M1 Each time it produces a relevant event e: p_i resets the ith column of its boolean matrix: ∀j 6= i : M_i[j, i] := 0 M2 When p_i sends a message: no update of M_i occurs.

M3 When it receives a message m from p_j, p_i executes the following updates (m.V C is carried by m):

∀k: case V C_i[k] < m.V C[k] then ∀` 6= i, j, k : M_i[`, k] := 0;

M_i[j, k] := 1 V C_i[k] = m.V C[k] then M_i[j, k] := 1 V C_i[k] > m.V C[k] then skip

(97)

A Concrete Condition

• Let m be a message sent by p_i to p_j and C(m, k) = ((send(m).M_i[j, k] = 1) ∧ (send(m).IP_i[k] = 1))

∨(send(m).V C_i[k] = 0)

• Theorem 2: ∀k : C(m, k) ⇒ K(m, k)

(98)

An Efficient IPT Protocol (1)

RM0 Both V C_i[1..n] and IP_i[1..n] are set to [0, . . . , 0], and

∀ (j, k) : M_i[j, k] is set to 1

RM1 Each time p_i produces a relevant event e:

? It increments its VC entry: V C_i[i] := V C_i[i] + 1,

? It associates with e the timestamp

e.T S = {(k, V C_i[k]) | IP_i[k] = 1}

? It resets IP_i: ∀ ` 6= i : IP_i[`] := 0; IP_i[i] := 1

(99)

An Efficient IPT Protocol (2)

RM2 When p_i sends a message m to p_j, it attaches to m the set of triples {(k, V C_i[k], IP_i[k])} where k is such that (M_i[j, k] = 0 ∨ IP_i[k] = 0) ∧ (V C_i[k] > 0)}

RM3 When p_i receives a message m from p_j, it executes:

∀(k,m.V C[k], m.IP[k]) carried by m:

case V C_i[k] < m.V C[k] then V C_i[k] := m.V C[k];

IP_i[k] := m.IP [k];

∀` 6= i, j, k : M_i[`, k] := 0;

M_i[j, k] := 1

V C_i[k] = m.V C[k] then IP_i[k] := min(IP_i[k], m.IP [k]);

M_i[j, k] := 1 V C_i[k] > m.V C[k] then skip

endcase

(100)

Properties of the IPT Protocol

• Improvement: Transmit rows of M_i allows the processes to have more entries of their matrices equal to 1, hence transmit fewer triples

• If one is not interested in the IPT problem, s/he can suppress the IP_i arrays. Then, we obtain an efficient implementation of vector clocks (that does not require fifo channels)

• A simulation study has shown the gains are substantial

(101)

Part VIII

MATRIX CLOCKS

-Wuu G.T. and Bernstein A.J., Efficient solutions to the replicated log and dic- tionnary problems. Proc. 3rd Int’l ACM Symposium on Principles of Distributed Computing (PODC’84), ACM Press, pp. 233-242, 1984

(102)

Matrix clock

• Martrix clocks capture a “second order” knowledge

• Each process manages a time matrix M C_i[1..n, 1..n]

• M C_i[i, i] = nb of event produced by p_i

• M C_i[i, k] = nb of events produced by p_k, to p_i’s knowledge (this is nothing else than p_i’s vector clock)

• M C_i[j, k] = p_i’s knowledge of the nb of events produced by p_k as known by p_j

M C_i[j, k] = x means

LOGICAL TIME

LOGICAL TIME

in DISTRIBUTED SYSTEMS

IRISA, Universit´ e de Rennes, France

• Scalar (linear) time

• Vector time

• Matrix time

• Using virtual time

SCALAR/LINEAR TIME

SCALAR CLOCKS in ACTION

VECTOR TIME

VECTOR CLOCKS in ACTION (1)

CAUSAL ORDER ABSTRACTION

VECTOR CLOCKS in ACTION (2)

PREDICATE DETECTION

• V C

[1..n]: local vector clock

• SAT

: set of process identities such that

j ∈ SAT

⇔ p

entered a local state σ

from which LP

is true

• F IRST

: first global state (as known by p

) in

which all the local predicates LP

such that

j ∈ SAT

are satisfied

LIMIT of VECTOR CLOCKS

DETECTION OF A SIMPLE EVENT PATTERN

P ( s, t ) ≡ ( ∃k : s.V C < t.M C [ k ])

VECTOR CLOCKS in ACTION (3)

DETERMINING IMMEDIATE PREDECESSORS

• V C

= current knowledge of p

on the progress of each process P

(measured by V C

[k])

• More precisely:

V C

[ k ]= # number of relevant events produced

by p

and known by p

• Capture the very structure of the causal past of each event

• Allow the analysis of distributed computations

(e.g., detection of global predicates, analysis

of control flows)

Each p

manages:

• A vector clock V C

• A boolean array IP

whose meaning is:

(IP

[j ] = 1) ⇔ The last relevant event pro-

duced by p

and known by p

is an immediate

predecessor of p

’s current event

• Question : Is it possible to design an IPT pro- tocol that does not require each message m to carry a vector clock m.V C and a boolean vector m.IP whose size is always n?

• Answer : Yes! ... but How???

• Theorem 1:

The condition K ( m, k ) is both nec-

essary and sufficient to omit the trans-

mission of V C i [ k ] and IP i [ k ] when m is

sent by p i to p j

MATRIX CLOCKS

mission of V C _i [ k ] and IP _i [ k ] when m is

sent by p _i to p _j