• Aucun résultat trouvé

LOGICAL TIME

N/A
N/A
Protected

Academic year: 2022

Partager "LOGICAL TIME"

Copied!
112
0
0

Texte intégral

(1)

LOGICAL TIME

in DISTRIBUTED SYSTEMS

Michel RAYNAL raynal@irisa.fr

IRISA, Universit´ e de Rennes, France

(2)

Contents

• Scalar (linear) time

• Vector time

• Matrix time

• Using virtual time

(3)

Part I

SCALAR/LINEAR TIME

- Lamport, L., Time, Clocks and the Ordering of Events in a Distributed System.

Communications of the ACM, 21(7):558-565, 1978

(4)

Aim

• Build a logical time in order to be able to associate a consistent date with events, i.e.,

e −→ f ⇒ date(e) < date(f)

• Why logical time? Because there is no notion of phys- ical time in a pure asynchronous system (no bound on process speed, and message transfer delay)

Even if we had physical time, it would be more difficult

(5)

The fundamental constraint

Logical time has to increase along causal paths

• How logical time is used: when it produces a new event, a process associates the current clock value with that event

• Idea: consider the set of integers for the time domain

? Each process pi has a local clock hi

? From a local point of view:

hi has to measure the progress of pi

? From a global point of view:

hi has to measure the progress of the whole compu- tation

(6)

Lamport clocks (1978)

Local progress rule:

before producing an internal event::

hi ← hi + 1 % date of the internal event % Sending rule:

when sending a message m to pj:

hi ← hi + 1; % date of the send event % send (m, hi) to pj

Receiving Rule:

when receiving a message (m, h) from pj: hi ← max(hi, h);

hi ← hi + 1 % date of the receive event %

(7)

Illustration

p1

p2

p3

h1

h2

1

4

2 3 4 6

1 2

5

6

7 5

3

1 2

6 2

1 h3

3

5

4

Observation: (date(e) = x)

There are x events on the longest causal path ending at e

(8)

Build a total order on all the events

• Motivation: Resource allocation problems

• Observations

? (date(e) < date(f)) ⇒ ¬(f −→ e)

? (date(e) < date(f)) ∧ ¬(e −→ f) is possible

In that case, e and f are independent events, but cannot be concluded from the dates only

? (date(e) = date(f)) ⇒ e||f

• Associate a timestamp (h, i) with each event e where:

? h = local clock value when e is produced (date)

(9)

Total order definition

• Let e and f be events timestamped (h, i) and (k, j) (e −→T O f) def= (h < k) ∨ ((h = k) ∧ (i < j))

• This is the (well-known) lexicographical ordering

(10)

Illustration

Σinit = [0,0]

Σ = [0,1]

Σ = [1,1] Σ = [0,2]

Σ = [2,1] Σ = [1,2]

Σ = [2,2]

e12

e11 e22

e11 e22

e21

e22 e21

e32 p1

p2

(1,2)

e12 e22 e32 e11 e21

(3,1) (2,1)

(2,2) (4,2) H = 1

namely, e12, e11, e22, e21, e22

H = 4 H = 3

H = 2

Lamport’s timestamps capture an

observation (among all possible oberv.)

(11)

A theorem on the space of scalar clocks

• Let C be the set of all the scalar clock systems that are consistent (with respect to the causality relation)

• Let e and f be any two events of a distributed execution

• ∀C ∈ C: e −→ev f ⇒ dateC(e) < dateC(f) (Consistency)

• e||f ⇔ ∃C ∈ C : dateC(e) = dateC(f)

• Or equivalently

e||f ⇔ ∃C1, C2 ∈ C :

dateC1(e) ≤ dateC1(f) ∧ dateC2(e) ≥ dateC2(f)

(12)

Part II

SCALAR CLOCKS in ACTION

(13)

The Mutex problem

• Enrich the underlying system with new operations

• These operations define a service

• Here, two operations: acquire() and release()

• Process behavior: abstracted in a 3-state automaton statei ∈ {out, asking, in} (all other detail are irrelevant)

asking

in out

(14)

The Mutex problem: definition

• Definition

? Safety: no two processes are concurrently in the CS

? Liveness: any request is eventually granted

• Algorithms

? Permission-based (Individual vs arbiter permissions)

? Token-based

- Raynal M., Algorithms for mutual exclusion. The MIT Press, 1986

- Anderson J., Kim Y.-J. and Herman T., Shared-memory mutual exclusion: major research trends since 1986. Distributed Computing, 16(2-3): 75-110, 2003

(15)

Individual permissions: principles

• When,it wants to enter the CS, pi asks for permissions

• When,it has received all the permissions, pi enters

• Ri = the set of processes from which pi needs the per- mission to enter the SC

• Individual permission: Ri = {1, . . . , n} \ {i}

• When pi gives its permission to a process pk, the mean- ing of the permission is “As far as I am concerned, you can enter” (a permission is consequently “individual”)

• Core of the algorithm: the way permissions are granted

• The algorithm manages bilateral conflicts

(16)

Granting a permission

statei = asking pi

pj

pk

statej = out

perm statei = out

statek 6= out

• Solve the conflict between pi and pk

• A solution: timestamp the requests, and use the total order on timestamps to establish a system-wide consis- tent priority

? pk has not priority: send its permission to pi by return

(17)

From mechanisms to properties

• Safety: ∀i 6= j : j ∈ Ri ∧ i ∈ Rj

• Liveness: based on a timestamping mechanism

(18)

Ricart-Agrawala mutex algorithm: local variables

• statei ∈ {out, asking, in}, init out

• hi, last i integers, init 0

• prioi boolean

• waiting f romi, postponedi sets

(19)

Structure

perm(j) acquire() release()

local variables req(k, j)

(20)

Ricart-Agrawala mutex algorithm (1)

operation acquire() issued by pi

statei ← asking; postponedi ← ∅;

hi ← hi + 1; lasti ← hi; waiting f romi ← Ri;

for each j 6= i do send req(lasti, i) to pj end for;

wait (waiting f romi = ∅);

statei ← in

when perm(j) is received

waiting f romi ← waiting f romi \ {j}

(21)

Ricart-Agrawala mutex algorithm (2)

when req(k, j) is received hi ← max(hi, k) + 1;

prioi ← (statei 6= out) ∧ (lasti, i) < (k, j);

if prioi then postponedi ← postponedi ∪ {j} else send perm(i) to pj

end if

operation release()

for each j ∈ postponedi do

send perm(i) to pj end for;

statei ← out

(22)

Clock values

• hi can increase forever

• Aim: limit its increase

• As only requests have to be timestamped we can replace hi ← max(hi, k) + 1 with hi ← max(hi, k)

• As we are about to see in the proof, it is possible to further limit the increase in the acquire() operation:

The two statements [hi ← hi+1; lasti ← hi] are replaced by lasti ← hi + 1, which does not increase hi!

• These two modifications allows obtaining an algorithm in which all variables are bounded: clocks values can be implemented modulo 2n − 1

(23)

Ricart-Agrawala mutex algorithm

operation acquire() issued by pi

statei ← asking; postponedi ← ∅;

lasti ← hi + 1; % replaces hi ← hi + 1; lasti ← hi waiting f romi ← Ri;

for each j 6= i do send req(lasti, i) to pj end for;

wait (waiting f romi = ∅);

statei ← in

when req(k, j) is received

hi ← max(hi, k); % replaces hi ← max(hi, k) + 1 prioi ← (statei 6= out) ∧ (lasti, i) < (k, j);

if prioi then postponedi ← postponedi ∪ {j} else send perm(i) to pj

end if

(24)

Proof: on the blackboard!

• Safety: by contradiction

• Liveness: in two steps

? No deadlock: at least one process acquires the CS

? No starvation: eventually any requesting process will be granted the CS

(25)

Cost

• Message cost: 2(n − 1) messages per CS use

• Improvement:

? The algorithm can be improved in such a way that a CS use costs between 0 and 2(n − 1) messages

? Idea: every pair of processes manages a single per- mission (token) to solve their conflicts

• Time: consider each message takes one time unit

? Heavy load: one time unit

? Light load: two time unit

(26)

Variants

• Ring structure

? Forward a request = give its permission

? Cost: n messages

• Assumption ∆ on transfer delays

? Give its permission = not to answer

? Not to give its permission = send by return a negative ack, cancel it when exiting the CS

(27)

On mutual exclusion

• Permission-based

? Individual permission approach

? Arbiter permission approach

Quorums and three-way handshake algorithms

• Token-based

• A continuous view

- Raynal M., Algorithms for mutual exclusion, The MIT Press, 1986

- Anderson J., Kim Y.-J. andHerman T., Shared-memory mutual exclusion: major research trends since 1986, Distributed Computing 16(2-3): 75-110 2003

(28)

Part III

VECTOR TIME

- Fidge C., Timestamp in Message Passing Systems that Preserves Partial Ordering, Proc. 11th Australian Computing Conference, pp. 56-66, 1988

- Mattern F., Virtual time and global states of distributed systems. Proc. Int’l work- shop on Parallel and Distributed Systems, North-Holland, pp. 215-226, (Cosnard, Quinton, Raynal and Robert Eds), 1988

- Baldoni R. and Raynal M. Fundamentals of Distributed Computing: A Practical Tour of Vector-Clock Systems. IEEE Distributed Systems Online, 3(2):1-18, 2002

(29)

Aim: capture the causality relation

• Scalar (linear) clock system

? Respects causality

? But does not capture it

• Find a dating system that captures causality exactly (e −→ev f) ⇔ date(e) < date(f)

(e||f) ⇔ date(e) and date(f) cannot be compared

(30)

Vector clock: intuition

• Observation: a process pi can always measure its progress by counting the number of events it has produced since the beginning

This number can be seen as its logical local clock There is one such clock per process

• The time domain is consequently n-dimensional: there is one dimension associated with each process

• Hence the idea of vector clocks: each process pi man- ages a vector V Ci[1..n] that represents its view of the global time progress

V Ci is a digest of the current causal past of pi

(31)

Vector clock: definition

• V Ci[i] = nb of events issued by pi

• V Ci[j] = nb of events issued by pj, as known by pi Formally, let e be the current event produced by pi V Ci[j] = |{f | f −→ e ∧ f has been issued by pj}|

• Notation: component-wise maximum/minimum

max(V 1, V 2) = [max(V 1[1], V 2[1]), · · · , max(V 1[n], V 2[n])]

min(V 1, V 2) = [min(V 1[1], V 2[1]), · · · , min(V 1[n], V 2[n])]

(32)

Vector clock: algorithm

Local progress rule:

before producing an internal event:

V Ci[i] ← V Ci[i] + 1 Sending rule:

when sending a message m to pj: V Ci[i] ← V Ci[i] + 1;

send (m, V Ci) to pj Receiving Rule:

when receiving a message (m, V C) from pj: V Ci[i] ← V Ci[i] + 1;

V Ci ← max(V Ci, V C[)

(33)

Illustration

p3

p1

p2

p4

[1,2,0,0]

[0,3,0,0]

[0,3,0,2]

[0,3,0,1]

[0,0,0,0]

[0,0,0,0]

[0,0,0,0]

[0,1,0,0]

[0,2,0,0]

[0,0,1,0]

[0,3,2,2]

∀i, k: V Ci[k] is not decreasing, and V Ci[k] ≤ V Ck[k]

(34)

A few simple definitions

• V 1 ≤ V 2 def= ∀k : V 1[k] ≤ V 2[k]

• V 1 < V 2 def= (V 1 ≤ V 2) ∧ (V 1 6= V 2)

• V 1||V 2 def= ¬(V 1 ≤ V 2) ∧ ¬(V 2 ≤ V 1)

(35)

The vector clock properties

Let e with date(e) = Ve , and f with date(f) = Vf

(e −→ev f) ⇔ (Ve < Vf)

(e || f) ⇔ (Ve || Vf)

These are the fundamental properties provided by vector clocks

(36)

Proof (1)

• Theorem 1: Vector clocks increase along causal paths

• Theorem 2: (e −→ev f) ⇔ (Ve < Vf)

? (e −→ev f) ⇒ (Ve < Vf): follows from Theorem 1.

? (Ve < Vf) ⇒ (e −→ev f):

Let pi be the process that issued the event e. We have (Ve < Vf) ⇒ (Ve[i] ≤ Vf[i]). As only pi can entail an increase of V [i] (for any vector V ), it follows that there is a causal path from e to f.

(37)

Proof and cost

• Theorem 3: (e || f) ⇔ (Ve || Vf).

(e || f) def= ¬(e −→ev f) ∧ ¬(f −→ev e) (definition).

? ¬(e −→ev f) ⇒ ¬(Ve < Vf).

? ¬(f −→ev e) ⇒ ¬(Vf < Ve).

From which follows that Vf and Ve cannot be compared.

• Theorem 4: The previous (causality/independence) pred- icates require O(n) comparisons

(38)

Refining the causality test

• Let us associate a timestamp (V e, i) with each event e, where pi is the process that issued e

• Let e timestamped (Ve, i) and f timestamped (Vf, j)

• Refined causality test:

(e −→ev f) ⇔ (Ve[i] ≤ Vf[i])

• Refined independence test:

(e || f) ⇔ (Ve[i] > Vf[i]) ∧ (Vf[j] > Ve[j])

• Theorem 4: The previous (causality/independence) pred- icates require O(1) comparisons (Scalability of the test)

(39)

A process is a “local” observer

Σinit = [0,0]

Σ = [0,1]

Σ = [1,1] Σ = [0,2]

Σ = [2,1] Σ = [1,2]

Σ = [2,2]

e12

e11 e22

e11 e22

e21

e22 e21

e32 p1

p2

σ10 e11 σ11 e21 σ12

e22 e32 e12

σ20 σ12 σ22 σ23

[0,1] [0,2] [2,3]

[1,1] [2,1]

A process is an oberver of the computation

(40)

A vector clock denotes a global state

Σinit = [0,0]

Σ = [0,1]

Σ = [1,1] Σ = [0,2]

e12

e11 e22

e11 e22

e21

e22 e21

e3 p1

p2

σ10 e11 σ11 e21 σ21

e22 e32 e12

σ02 σ21 σ23

[0,1] [0,2] [2,3]

[1,1] [2,1]

σ22

Σa = [2,1]

Σb = [1,2]

Σc = [2,2]

Σa = [2,1] Σb = [1,2]

Σc = max(Σa,Σb)

(41)

The development of logical time (1)

pi

pj

pk

Vj[i] s Vj[j] = r Vi[i] = s

e

f causal path: e f m

• m: sent by pi at Vi[i] = s, received by pj at Vj[j] = r

• “Knowing” the receipt of m ⇒ “knowing” its sending

• I.e., for any event x: (Vx[j] ≥ r) ⇒ (Vx[i] ≥ s)

• Due to m it is impossible to have (Vx[j] ≥ r)∧(Vx[i] < s)

(42)

The development of logical time (2)

m1 makes it impossible to have 1,0 2,0 3,0 4,2 5,2 6,5

0,1 0,2 0,3 3,4 3,5 3,6 m2 m1

m3 m4

p2

1 2 3 4 5

6

m1

m3

m2

m4

(V[1] < 2) (V[2] 6)

(43)

Part IV

VECTOR CLOCKS in ACTION (1)

CAUSAL ORDER ABSTRACTION

(44)

Causal order abstraction

- Birman, A. Schiper, and P. Stephenson. Lightweight causal and atomic group multicast. ACM Transactions on Computer Systems, 9(3):272-314, 1991

- Raynal M., Schiper A. and Toueg S., The causal ordering abstraction and a simple way to implement it. Information Processing Letters, 39:343-351, 1991

• co broadcast(m): allows to send a message m to all

• co deliver(): allows a process to deliver a message

(45)

Causal delivery: definition

• Termination: If a message is co broadcast, it is even- tually co delivered (No loss)

• Integrity: A process co delivers a message m at most once (No duplication)

• Validity: If a process co delivers a message m, then m has been co broadcast (No spurious message)

• Causal Order:

co broadcast(m1) → co broadcast(m2)

⇒ co del(m1) → co del(m2)

(46)

Causal delivery: Why it is useful

• Capture causality

• Cooperative work

• Stronger than fifo channels

• But weaker than atomic broadcast

Atomic broadcast = total order delivery

(47)

Causal order: Example 1

(48)

Causal order: Example 2

(49)

Causal broadcast

V Ci[j] = nb messages broadcast by Pj (to pi’s knowledge)

m1

m2

m3

m4

m5

• V Cm2 = [1,1, 0, 0] V Cm3 = [1, 1, 1, 0]

• V Cm4 = [1, 2,0,0] V Cm5 = [1, 2, 2, 0]

(50)

Illustration

p3

0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 1

0 1

0 1 1

0 0 1

0 0 1

0 1 1 p1

p2

0

(51)

RST algorithm

operation co broadcast(m)

for each j 6= i do send (m, V Ci) to pj end for;

V Ci[i] ← V Ci[i] + 1

when (m, m.V C) is received from pj:

wait until (∀k : V Ci[k] ≥ m.V C[k]);

co deliver m to the application;

V Ci[j] ← V Ci[j] + 1

(52)

Part V

VECTOR CLOCKS in ACTION (2)

PREDICATE DETECTION

(53)

Stable Local Predicate Detection (1)

• Local predicate LPi: on the local variables of a single process pi

• Stable predicate: once true, remains true

• A consistent global state Σ = (σ1, · · · , σn) satisfies the global predicate LP1 ∧ LP2 · · · ∧ LPn if ∀ i : (σi |= LPi)

Σ |= (^

i

LPi) ⇔ ^

i

i |= LPi)

(54)

Stable Local Predicate Detection (2)

• Problem: Design an algorithm that detects the first consistent global state that satisfies a conjunction of stable local predicates

• Constraints: Do not use additional control messages, Detection must be done on the fly

(55)

Stable Local Predicate Detection (3)

σ20

σ10 σ1x1

σ2x2

σ30 σ3x3

m1 m5

m3

m4

m2

P1

P2

P3

Σ σ3y3

σ1y1

σ2y2

(56)

Stable Local Predicate Detection (4)

Σ σ3y3

σ2y2

P1

P2

P3

m1 m3

m2

σ1y1

(57)

Detection algorithm: local context of pi

• V C

i

[1..n]: local vector clock

• SAT

i

: set of process identities such that

j ∈ SAT

i

⇔ p

j

entered a local state σ

jx

from which LP

j

is true

• F IRST

i

: first global state (as known by p

i

) in

which all the local predicates LP

j

such that

j ∈ SAT

i

are satisfied

(58)

Detection algorithm (1)

procedure detected? is

if SATi = {1, 2, . . . , n} then

F IRSTi defines the first consistent global state Σ that satisfies Vj LPj fi

procedure check LPi is

if (σix |= LPi) then SATi := SATi ∪ {i};

F IRSTi := V Ci; donei := true;

detected? fi

(59)

Detection algorithm (2)

(S1) when Pi produces an internal event (e) V Ci[i] := V Ci[i] + 1;

execute e and move to σ;

if ¬donei then check LPi fi

(60)

Detection algorithm (3)

(S2) when Pi produces a send event (e=send m to Pj) V Ci[i] := V Ci[i] + 1;

move to σ;

if ¬donei then check LPi fi;

m.V C := V Ci; m.SAT := SATi; m.F IRST := F IRSTi; send (m) to Pj

% m carries m.V C, m.SAT and m.F IRST %

(61)

Detection algorithm (4)

(S3) when Pi produces a receive event (e=receive (m)) V Ci[i] := V Ci[i] + 1; V Ci := max(V Ci, m.V C);

move to σ; % by delivering m to the process % if ¬donei then check LPi fi;

if ¬(m.SAT ⊆ SATi)then

SATi := SATi ∪ m.SAT;

F IRSTi := max(F IRSTi, m.F IRST);

detected?

fi

(62)

Part VI

LIMIT of VECTOR CLOCKS

DETECTION OF A SIMPLE EVENT PATTERN

-Raynal M., Illustrating the Use of Vector Clocks in Property Detection: an Example and a Counter-Example. Proc. 5th Int’l European Parallel Computing Conference (EUROPAR’99), Springer LNCS 1685, pp. 806-814, 1999

(63)

Pattern Recognition (1)

• Some internal events are tagged black, the others are tagged white

• All communication events are tagged white

• The problem: Given two black events s and t, does it exist a black event u such that s → u ∧ u → t

• Formally, P(s, t) is the conjunction of:

? (black(s) ∧ black(t))

? (∃u 6= s, t : (black(u) ∧ (s → u ∧ u → t)))

(64)

Pattern Recognition

t t

s s

u u

P(s, t) is false P(s, t) is true

White and black: s.V C = (0,0,2) and t.V C = (3,4,2) in both cases Only black: s.V C = (0,0,1) and t.V C = (2,1,1) in both cases

(65)

Non-Triviality of the Problem

a t1 t2

c

d s

P1

P2

P3

u b

P(s, t2) is true while P(s, t1) is not

(66)

Decomposing the Predicate

• P(s, t) ≡ (∃u : P1(s, u, t) ∧ P2(s, u, t))

? P1(s, u, t) ≡ (black(s) ∧ black(u) ∧ black(t))

? P2(s, u, t) ≡ (s → u ∧ u → t)

(67)

Using Vector of Vector Clocks

• Only black events are relevant: count only them

• Event e:

? e.V C: its vector timestamp (counting only black events)

? e.M C[1..n]: an array of vector timestamps

e.M C[j] contains the vector timestamp of the last black event of Pj that causally precedes e

e.M C[j] can be considered as a “pointer” from e to the last event that precedes it on Pj

(68)

Example (1)

a t1 t2

c

d s

P1

P2

P3

u b

t1.M C[1] = a.V C means that t1.M C[1] points to a t1.M C[2] = b.V C means that t1.M C[2] points to b t1.M C[3] = s.V C means that t1.M C[3] points to s

(69)

Example (2)

a t1 t2

c

d s

P1

P2

P3

u b

t2.M C[1] = t1.V C: means that t2.M C[1] points to t1 t2.M C[2] = u.V C: means that t2.M C[2] points to u t2.M C[3] = s.V C : means that t2.M C[3] points to s

(70)

Operational Predicate

• Event s: s.V C and s.M C Event t: t.V C and t.M C

• P1 is trivially satisfied by any triple of events

• (∃u : P2(s, u, t)) ≡ (∃u : s → u → t) can be restated as:

(∃u : s → u → t) ≡ (∃u : s.V C < u.V C < t.V C)

(∃u : s → u → t) ≡ (∃pk : s.V C < t.M C[k] < t.V C)

As ∀k : t.M C[k] < t.V C, we get the operational predi- cate:

P ( s, t ) ≡ ( ∃k : s.V C < t.M C [ k ])

(71)

The Protocol (1)

(S1) when Pi produces a black event (e)

V Ci[i] := V Ci[i] + 1; % one more black event on Pi % e.V C = V Ci; e.M C = M Ci;

M Ci[i] := V Ci

% vector timestamp of Pi’s last black event %

(72)

The Protocol (2)

(S2) when Pi executes a send event (e=send m to Pj) m.V C := V Ci; m.M C := M Ci;

send (m) to Pj % m carries m.V C and m.M C %

(73)

The Protocol (3)

(S3) when Pi executes a receive event (e=receive(m)) V Ci := max(V Ci, m.V C);

% update of the local vector clock %

∀ k : M Ci[k] := max(M Ci[k], m.M C[k])

% record vector timestamps of last black predecessors %

(74)

What has ben learnt

• Power of vector clocks: To track (counter-based) causal- ity: “First Order” predecessor tracking

• Limitation of vector clocks: To solve problems where causality can not be reduced to event counting: “Sec- ond Order” predecessor (or more) tracking

(75)

Part VII

VECTOR CLOCKS in ACTION (3)

DETERMINING IMMEDIATE PREDECESSORS

- Anceaume E. Helary J.-M. and Raynal M. A Note on the Determination of the Immediate Predecessors in a Distributed Computation. Int. Journal of Foundations of Computer Science (IJFCS), 13(6):865-972, 2002

- Helary J.-M., Raynal M., Melideo G., and Baldoni R., Efficient Causality-Tracking Timestamping. IEEE Transactions on Knowledge and Data Engineering, 15(5):1239- 1250, 2003

(76)

Relevant Events

• At some abstraction level only some events of a dis- tributed computation are relevant

• Let R ⊆ H be the set of relevant events

• Let → be the relation on R defined in the following way:

∀ (e, f) ∈ R × R : (e → f) ⇔ (e −→ev f).

• The poset (R, →) constitutes an abstraction of the dis- tributed computation

Without loss of generality we consider that the set of relevant events is a subset of the internal events (if a communication event has to be observed, a relevant internal event can be generated just after the corre-

(77)

A Distributed Computation

P1

P2

P3

(78)

Vector Clocks (2)

VC0 V Ci[1..n] initialized to [0, . . . , 0]

VC1 Each time pi produces a relevant event e:

? It increments its vector clock entry V Ci[i] to indicate its progress: V Ci[i] := V Ci[i] + 1

? It associates with e its timestamp e.V C = V Ci

VC2 When a process pi sends a message m, it attaches to it the current value of V Ci (Let m.V C denote this value) VC3 When pi receives a message m, it updates its vector

clock: ∀ x : V Ci[x] := max(V Ci[x], m.V C[x])

(79)

Vector Clocks (3)

• V C

i

= current knowledge of p

i

on the progress of each process P

k

(measured by V C

i

[k])

• More precisely:

V C

i

[ k ]= # number of relevant events produced

by p

k

and known by p

i

(80)

Vector Clocks: Example

P1

P2

P3

[1,1,2]

[1,0,0] [3,2,1]

[1,1,0]

(2,1)

[0,0,1]

(3,1)

[2,0,1]

(1,1) (1,2) (1,3)

(2,2) (2,3)

(3,2)

[2, 2,1][2,3,1]

(81)

Immediate Predecessor Tracking: the Problem

• Given two relevant events e and f, we say that e is an immediate predecessor of f if:

? e → f, and

? 6 ∃ relevant event g such that e → g → f

• The Immediate Predecessor Tracking (IPT) problem consists in associating with each relevant event e the set of relevant events that are its immediate predeces- sors

Moreover, this has to be done on the fly and without additional control message (i.e., without modifying the communication pattern of the computation)

(82)

Immediate Predecessor Tracking: Why?

• Capture the very structure of the causal past of each event

• Allow the analysis of distributed computations

(e.g., detection of global predicates, analysis

of control flows)

(83)

Distributed Computation and its Reduction

P1

P2

P3

[1,1,2]

[1,0,0] [3,2,1]

[1,1,0]

(2,1)

[0,0,1]

(3,1)

[2,0,1]

(1,1) (1,2) (1,3)

(2,2) (2,3)

(3,2) [2,2,1][2,3,1]

(1,1) (1,2) (1,3)

(2,1)

(2,2)

(2,3)

(3,1)

(3,2)

(84)

Transitive Reduction (Hasse Diagram)

(1, 1) (1, 2) (1, 3)

(2, 1)

(2, 2)

(2, 3)

(3, 1)

(3, 2)

(85)

Basic IPT Protocol (1)

Each p

i

manages:

• A vector clock V C

i

• A boolean array IP

i

whose meaning is:

(IP

i

[j ] = 1) ⇔ The last relevant event pro-

duced by p

j

and known by p

i

is an immediate

predecessor of p

i

’s current event

(86)

Basic IPT Protocol (2)

R0 Both V Ci[1..n] and IPi[1..n] are initialized to [0, . . . , 0]

R1 Each time pi produces a relevant event e:

? It increments its VC entry: V Ci[i] := V Ci[i] + 1

? It associates with e the timestamp

e.T S = {(k, V Ci[k]) | IPi[k] = 1}

? It resets IPi: ∀ ` 6= i : IPi[`] := 0; IPi[i] := 1

R2 When pi sends a message m to pj, it attaches to m the current values of V C (denoted m.V C) and the boolean

(87)

How to Manage the IPi Vectors? (1)

P

P

P 1

m

2

3

(88)

How to Manage the IPi Vectors? (2)

P

P

P 1

m

2

3

(89)

Basic IPT Protocol (3)

R3 When it receives a message m from pj, pi executes the following updates:

∀k : case

V Ci[k] < m.V C[k] then V Ci[k] := m.V C[k];

IPi[k] := m.IP[k]

V Ci[k] = m.V C[k] then IPi[k] := min(IPi[k], m.IP [k]) V Ci[k] > m.V C[k] then skip

end case

(90)

Efficient IPT? (1)

• Question : Is it possible to design an IPT pro- tocol that does not require each message m to carry a vector clock m.V C and a boolean vector m.IP whose size is always n?

• Answer : Yes! ... but How???

(91)

Efficient IPT (2): Towards a General Condition

Underlying intuition:

Pj

send(m)

Pi

V Ci[k] = x IPi[k] = 1

V Cj[k] x receive(m)

(92)

Efficient IPT (3): a General Condition

• Let e.Xi= value of the var Xi of pi when it produces e

• Let K(m, k) be the following predicate:

(1) (send(m).V Ci[k] = 0)

(2) ∨ (send(m).V Ci[k] < pred(receive(m)).V Cj[k]) (3) ∨ ((send(m).V Ci[k] = pred(receive(m)).V Cj[k])∧

(send(m).IPi[k] = 1))

(93)

Efficient IPT (3): a General Condition

• Theorem 1:

The condition K ( m, k ) is both nec-

essary and sufficient to omit the trans-

mission of V C i [ k ] and IP i [ k ] when m is

sent by p i to p j

(94)

Efficient IPT (4): Towards a Concrete Condition

• K(m, k) involves events on two processes (send(m) at pi and receive(m) at pj), and consequently cannot be atomically evaluated by a single process

• Replace it by a “concrete” condition C(m, k) that:

? Can be locally evaluated by a process just before it sends a message, and

? Is a correct approximation of K(m, k), i.e., C(m, k) has to be such that ∀m, k: C(m, k) ⇒ K(m, k)

(95)

Efficient IPT (5): Towards a Concrete Condition

• The “constant” condition ∀(m, k) : KC(m, k) = f alse works

It is actually the trivially correct approximation of K that corresponds to the basic IPT protocol d in which each message m carries a whole vector clock m.V C and a whole boolean vector m.IP

• Let us equip each process Pi with an additional matrix Mi of 0/1 values such that

(Mi[j, k] = 1) ⇔ (to Pi’s knowledge: V Cj[k] ≥ V Ci[k])

(96)

An Implementation of the Matrices Mi

M0 ∀ (j, k) : Mi[j, k] is initialized to 1

M1 Each time it produces a relevant event e: pi resets the ith column of its boolean matrix: ∀j 6= i : Mi[j, i] := 0 M2 When pi sends a message: no update of Mi occurs.

M3 When it receives a message m from pj, pi executes the following updates (m.V C is carried by m):

∀k: case V Ci[k] < m.V C[k] then ∀` 6= i, j, k : Mi[`, k] := 0;

Mi[j, k] := 1 V Ci[k] = m.V C[k] then Mi[j, k] := 1 V Ci[k] > m.V C[k] then skip

(97)

A Concrete Condition

• Let m be a message sent by pi to pj and C(m, k) = ((send(m).Mi[j, k] = 1) ∧ (send(m).IPi[k] = 1))

∨(send(m).V Ci[k] = 0)

• Theorem 2: ∀k : C(m, k) ⇒ K(m, k)

(98)

An Efficient IPT Protocol (1)

RM0 Both V Ci[1..n] and IPi[1..n] are set to [0, . . . , 0], and

∀ (j, k) : Mi[j, k] is set to 1

RM1 Each time pi produces a relevant event e:

? It increments its VC entry: V Ci[i] := V Ci[i] + 1,

? It associates with e the timestamp

e.T S = {(k, V Ci[k]) | IPi[k] = 1}

? It resets IPi: ∀ ` 6= i : IPi[`] := 0; IPi[i] := 1

(99)

An Efficient IPT Protocol (2)

RM2 When pi sends a message m to pj, it attaches to m the set of triples {(k, V Ci[k], IPi[k])} where k is such that (Mi[j, k] = 0 ∨ IPi[k] = 0) ∧ (V Ci[k] > 0)}

RM3 When pi receives a message m from pj, it executes:

∀(k,m.V C[k], m.IP[k]) carried by m:

case V Ci[k] < m.V C[k] then V Ci[k] := m.V C[k];

IPi[k] := m.IP [k];

∀` 6= i, j, k : Mi[`, k] := 0;

Mi[j, k] := 1

V Ci[k] = m.V C[k] then IPi[k] := min(IPi[k], m.IP [k]);

Mi[j, k] := 1 V Ci[k] > m.V C[k] then skip

endcase

(100)

Properties of the IPT Protocol

• Improvement: Transmit rows of Mi allows the processes to have more entries of their matrices equal to 1, hence transmit fewer triples

• If one is not interested in the IPT problem, s/he can suppress the IPi arrays. Then, we obtain an efficient implementation of vector clocks (that does not require fifo channels)

• A simulation study has shown the gains are substantial

(101)

Part VIII

MATRIX CLOCKS

-Wuu G.T. and Bernstein A.J., Efficient solutions to the replicated log and dic- tionnary problems. Proc. 3rd Int’l ACM Symposium on Principles of Distributed Computing (PODC’84), ACM Press, pp. 233-242, 1984

(102)

Matrix clock

• Martrix clocks capture a “second order” knowledge

• Each process manages a time matrix M Ci[1..n, 1..n]

• M Ci[i, i] = nb of event produced by pi

• M Ci[i, k] = nb of events produced by pk, to pi’s knowl- edge (this is nothing else than pi’s vector clock)

• M Ci[j, k] = pi’s knowledge of the nb of events produced by pk as known by pj

M Ci[j, k] = x means

Références

Documents relatifs

Très pratique, cette blouse médicale est dotée de 4 larges poches afin de transporter vos accessoires et/ou effets personnels au quotidien... Très pratique, elle est dotée de

LONDRES LUTON/Easyjet STRASBOURG/Air France RENNES/Volotea BRISTOL/Easyjet LONDRES GATWICK/Easyjet LUXEMBOURG/Luxir Dublin/Ryanair NICE/Air France GENEVE/Air France

*La durée des soins comprend le temps d’installation - During the treatments time, setting up is included.. GOMMAGE AU SUCRE DE BETTERAVES Scrub with

La Société s'interdit de placer ses avoirs nets en valeurs mobilières et instruments du marché monétaire d'un même émetteur dans une proportion qui excède les limites

Sauce tomate, mozzarella fior di latte, jambon cuit, champignons de Paris, aubergines, olives noires, huile d’olive extra vierge et basilic.

l'Asie et que, au temps de Noé et avant la cata- strophe diluvienne, il émergeait encore, en plus ou moins grande partie, au-dessus de l'Océan ; et ce serait cette catastrophe

Délégation d’entreprises à Djibouti et Addis-Abeba, conduite par Philippe LABONNE, président du conseil d’entreprises France – Afrique de l’Est, directeur général adjoint

interviews radiophoniques, prises de parole dans l’espace public, diffusion de plusieurs NEWSLETTERS nationales et régionales rédigées par des bénévoles et volontaires