On-the-Fly Determination of the Immediate Predecessors

The Notion of a Relevant Event At some abstraction level, only a subset of events are relevant. As an example, in some applications only the modification of some local variables, or the processing of specific messages, are relevant. Given a distributed computationH=(H,−→^ev ), letR⊂H be the subset of its events that are defined as relevant.

Let us accordingly define the causal precedence relation on these events, denoted

−→, as followsre

∀e₁, e₂∈R: (e₁−→^re e₂)⇔(e₂−→^ev e₁).

This relation is nothing else than the projection of−→^ev on the elements ofR. The pairR=(R,−→^re )constitutes an abstraction of the distributed computationH= (H,−→^ev ).

Without loss of generality, we consider that the set of relevant events consists only of internal events. Communication events are low-level events which, without being themselves relevant, participate in the establishment of causal precedence on relevant events. If a communication event has to be explicitly observed as relevant, an associated relevant internal event can be generated just before or after it.

An example of a distributed execution with both relevant events and non-relevant events is described in Fig.7.17. The relevant events are represented by black dots, while the non-relevant events are represented by white dots.

The management of vector clocks restricted to relevant events (Fig.7.18) is a simplified version of the one described in Fig.7.9. As we can see, despite the fact that they are not relevant, communication events participate in the tracking of causal precedence. Given a relevantetimestampede.vc[1..n], i, the integere.vc[j]is the number of relevant events produced bypj and known bypi.

The Notion of an Immediate Predecessor and the Immediate Predecessor Tracking Problem Given two eventse1, e2∈R, the relevant evente1is an im-mediate predecessor of the relevant evente₂if

(e₁−→^re e₂) ∧

e∈R: (e₁−→^re e) ∧ (e−→^re e₂) .

7.2 Vector Time 171

Fig. 7.17 Relevant events in a distributed computation

when producing a relevant internal eventedo (1) vc_i[i] ←vc_i[i] +1;

(2) Produce the relevant evente. % The date ofeis vci[1..n]. when sendingMSG(m)top_j do

(3) sendMSG(m,vc_i[1..n])top_j. whenMSG(m,vc)is received fromp_j do

(4) vc_i[1..n] ←max(vci[1..n],vc[1..n]).

Fig. 7.18 Vector clock system for relevant events (code for processp_i)

Fig. 7.19 From relevant events to Hasse diagram

The immediate predecessor tracking (IPT) problem consists in associating with each relevant event the set of relevant events that are its immediate predecessors.

Moreover, this has been done on the fly and without adding control messages. The determination of the immediate predecessors consists in computing the transitive reduction (or Hasse diagram) of the partial orderR=(R,−→^re ). This reduction cap-tures the essential causality ofR.

The left of Fig.7.19represents the distributed computation of Fig.7.17, in which only the relevant events are explicitly indicated, together with their vector dates and an identification pair (made up of a process identity plus a sequence number). The right of the figure shows the corresponding Hasse diagram.

An Algorithm Solving the IPT Problem: Local Variables The kth relevant event on processp_kis unambiguously identified by the pair(k,vck), where vck is the value of vck[k]whenp_khas produced this event. The aim of the algorithm is

conse-when producing a relevant internal eventedo (1) vc_i[i] ←vc_i[i] +1;

(2) Produce the relevant evente; lete.imp= {(k,vc_i[k])|imp_i[k] =1}; (3) imp_i[i] ←1;

(4) for eachj∈ {1, . . . , n} \ {i}do imp_i[j] ←0 end for.

when sendingMSG(m)top_jdo

(5) sendMSG(m,vc_i[1..n],imp_i[1..n])top_j. whenMSG(m,vc,imp)is received do

(6) for eachk∈ {1, . . . , n}do

(7) case vc_i[k]<vc[k]then vc_i[k] ←vc[k]; imp_i[k] ←imp[k] (8) vc_i[k] =vc[k]then imp_i[k] ←min(imp_i[k],imp[k]) (9) vc_i[k]>vc[k]then skip

(10) end case (11) end for.

Fig. 7.20 Determination of the immediate predecessors (code for processp_i)

quently to associate with each relevant eventea sete.imp such that(k,vck)∈e.imp if and only if the corresponding event is an immediate predecessor ofe.

To that end, each processpi manages the following local variables:

• vci[1..n]is the local vector clock.

• imp_i[1..n]is a vector initialized to[0, . . . ,0]. Each variable imp_i[j]contains 0 or 1. Its meaning is the following: imp_i[j] =1 means that the last relevant event produced byp_j, as known byp_i, is candidate to be an immediate predecessor of the next relevant event produced byp_i.

An Algorithm Solving the IPT Problem: Process Behavior The algorithm ex-ecuted by each processpi is a combined management of both the vectors vci and imp_i. It is described in Fig.7.20.

When a processp_i produces a relevant evente, it increases its own vector clock (line 1). Then, it considers imp_i[1..n] to compute the immediate predecessors of e(line2). According to the definition of each entry of imp_i, those are the events identified by the pairs(k,cvi[k])such that imp_i[k] =1 (which indicates that the last relevant event produced bypk and known bypi, is still a candidate to be an immediate predecessor ofe).

Then, as it produced a new relevant event e, p_i must reset its local array imp_i[1..n] (lines3–4). It resets (a) imp_i[i] to 1 (becausee is candidate to be an immediate predecessor of relevant events that will appear in its causal future) and (b) each imp_i[j](j =i) to 0 (because no event that will be produced in the future ofecan have them as immediate predecessors).

Whenpi sends a message, it attaches to this message all its local control in-formation, namely, cvi and imp_i (line5). When it receives a message,p_i updates it local vector clock as in the basic algorithm so that the new value of vci is the component-wise maximum of vc and the previous value of vci.

7.3 On the Size of Vector Clocks 173

Fig. 7.21 Four possible cases when updating imp_i[k], while vci[k] =vc[k]

The update of the array imp_i depends on the value of each entry of vci.

• If vci[k]<vc[k],pj (the sender of the message) has fresher information onpk

thanp_i. Consequently,p_iadopts what is known byp_j, and sets imp_i[k]to imp[k] (line7).

• If vci[k] =vc[k], the last relevant event produced byp_kand known byp_i is the same as the one known byp_j. If this event is still candidate to be an immedi-ate predecessor of the next event produced bypi from bothpi’s point of view ((imp_i[k]) andpj’s point of view ((imp[k]), then imp_i[k] remains equal to 1;

otherwise, imp_i[k] is set to 0 (line8). The four possible cases are depicted in Fig.7.21.

• If vci[k]>vc[k],p_i knows more onp_k thanp_j. Hence, it does not modify the value of imp_i[k](line9).

7.3 On the Size of Vector Clocks

This section first shows that vector time has an inherent price, namely, the size of vector clocks cannot be less than the number of processes. Then it introduces the notion of a relevant event and presents a general technique that allows us to reduce the number of vector entries that have to be transmitted in each message. Finally, it presents the notions of approximation of the causality relation and approximate vector clocks.

7.3.1 A Lower Bound on the Size of Vector Clocks

A vector clock has one entry per process, i.e.,nentries. It follows from the algorithm of Fig.7.9and Theorem5that vectors of sizenare sufficient to capture causality and independence among events. Hence the question: Are vectors of sizen neces-sary to capture causality and concurrency among the events produced byn asyn-chronous processes communicating by sending and receiving messages through a reliable asynchronous network?

This section shows that the answer to this question is “yes”. This means that there are distributed executions in which causality and concurrency cannot be captured by vector clocks of size smaller thann. The proof of this result, which is due to B.

Charron-Bost (1991), consists in building such a specific execution and showing a contradiction if vector clocks of size smaller thann are used to capture causality and independence of events.

The Basic Execution Let us consider an execution ofnprocesses in which each processp_i, 1≤i≤n, executes a sending phase followed by a reception phase.

There is a communication channel between any pair of distinct processes, and the communication pattern is based on a logical ring defined as follows on the process identities:i,i+1,i+2, . . . ,n, 1, . . . ,i−1, i. The notationsi+x andi−y are used to denote thexth successor and theyth predecessor of the identityi on the ring, respectively. More precisely, the behavior of each processp_i is as follows.

• A processp_i first sends, one after the other, a message to each process of the following “increasing” list of (n−2)processes: p_i₊₁, p_i₊₂, . . . , p_n,p₁, . . . , p_i₋2. The important points are the “increasing” order in the list of processes and the fact thatpi does not send a message topi−1.

• Then, pi receives, one after the other, the (n−2)messages sent to it, in the

“decreasing” order on the identity of their senders, namely,p_i receives first the message fromp_i₋₁, then the one fromp_i₋₂, . . . ,p₁,p_n, . . . ,p_i₊₂. As before, the important points are the “decreasing” order in the list of processes and the fact thatpi does not receive a message frompi+1.

This communication pattern is described in Fig.7.22. (A simpler figure with only three processes is described in Fig.7.23.)

Considering a processpi, let fsidenote its first send event (which is the sending of a message topi+1), and lridenote its last receive event (which is the reception of a message fromp_i₊₂).

Lemma 1 ∀i∈ {1, . . . , n} : lri||fs_i₊₁.

Proof As any process first sends messages before receiving any message, there is no causal chain involving more than one message. The lemma follows from this observation and the factpi+1does not send message topi. Lemma 2 ∀i, j: 1≤i=j≤n: fs_i₊₁−→^ev lrj.

Dans le document Distributed Algorithms for Message-Passing Systems (Page 196-200)