• Aucun résultat trouvé

Timestamps in Action: Total Order Broadcast

The Notion of a Timestamp

7.1.4 Timestamps in Action: Total Order Broadcast

Linear time and timestamps are particularly useful when one has to order operations or messages. The most typical case is the establishment of a total order on a set of requests, which have to be serviced one after the other. This use of timestamps will be addressed in Chap.10, which is devoted to permission-based mutual exclusion.

This section considers another illustration of timestamps, namely, it presents a timestamp-based implementation of a high-level communication abstraction, called total order broadcast.

The Total Order Broadcast Abstraction Total order broadcast is a communica-tion abstraccommunica-tion defined by two operacommunica-tions, denotedto_broadcast()andto_deliver().

Intuitively,to_broadcast()allows a processpi to send a messagemto all the pro-cesses (we then say “pi to_broadcastsm”), whileto_deliver()allows a process to receive such a message (we then say “pi to_deliversm”). Moreover, all the mes-sages that have been to_broadcast must be to_delivered in the same order at each

process, and this order has to respect the causal precedence order. To simplify the presentation, it is assumed that each message is unique (this can be easily realized by associating a pairsequence number, sender identitywith each message).

A Causality-Compliant Partial Order on Messages M being the set of mes-sages which are to_broadcast during an execution, letM=(M,M)be the rela-tion where→M is defined onMas follows. Givenm, mM,mMm(and we the say “mcausally precedesm”) if:

m and m have been to_broadcast by the same process, and m has been to_broadcast beforem, or

mhas been to_delivered by a processpi beforepi to_broadcastsm, or

• there is a messagemMsuch thatmMmandmMm.

Total Order Broadcast: Definition The total order broadcast abstraction is for-mally defined by the following properties. Said differently, this means that, to be correct, an implementation of the total order broadcast abstraction has to ensure that these properties are always satisfied.

• Validity. If a process to_delivers a message m, there is a process that has to_broadcastm.

• Integrity. No message is to_delivered twice.

• Total order. If a process to_deliversmbeforem, no process to_deliversm be-forem.

• Causal precedence order. IfmMm, no process to_deliversmbeforem.

• Termination. If a process to_broadcasts a messagem, any process to_deliversm.

The first four properties are safety properties. Validity relates the outputs to the inputs. It states that there is neither message corruption, nor message creation. In-tegrity states that there is no message duplication. Total order states that messages are to_delivered in the same order at every process, while causal precedence states that this total order respects the message causality relation→M. Finally, the termi-nation property is a liveness property stating that no message is lost.

Principle of the Implementation To simplify the description, we consider that the communication channels are FIFO. Moreover, each pair of processes is con-nected by a bidirectional channel.

The principle that underlies the implementation is very simple: it consists in as-sociating a timestamp with each message, and to_delivering the messages according to their timestamp order. As timestamps are totally ordered and this order respects causal precedence, we obtain both total order and causal precedence order proper-ties.

To illustrate the main issue posed by associating appropriate timestamps with messages, and define accordingly a correct message delivery rule, let us consider Fig.7.5. Independently one from the other,p1andp2to_broadcast the messagesm1

andm2, respectively. Neither ofp1andp2can immediately to_deliver its message, otherwise the total order delivery property would be violated. The processes have to

7.1 Linear Time 155

Fig. 7.5 Total order broadcast: the problem that has to be solved

Fig. 7.6 Structure of the total order broadcast implementation

cooperate so that they to_deliverm1andm2in the same order. This remains true if onlyp1to_broadcasts a message. This is because, when it issuesto_broadcast(m1), p1does not know whetherp2has independently issuedto_broadcast(m2)or not.

It follows that a to_broadcast message generates two distinct communication events at each process. The first one is associated with the reception of the message from the underlying communication network, while the second one is associated with its to_delivery.

Message Stability A means to implement the same to_delivery order at each pro-cess consists in providing each propro-cesspi with information on the clock values of the whole set of processes. This local information can then be used by each process pi to know which, among the to_broadcast messages it has received and not yet to_delivered, are stable, where message stability is defined as follows.

A message timestampedk, jreceived by a processpiis stable (at that process) ifpi knows that all the messages it will receive in the future will have a timestamp greater thank, j. The main job of a timestamp-based implementation consists in ensuring the message stability at each process.

Global Structure and Local Variables at a Processpi The structure of the im-plementation is described in Fig.7.6. Each processpi has a local module

imple-menting the operationsto_broadcast()andto_deliver(). Each local module manages the following local variables.

clocki[1..n] is an array of integers initialized to [0, . . . ,0]. The local variable clocki[i]is the local clock ofpi, which implements the global linear time. Dif-ferently, forj =i, clocki[j]is the best approximation of the value of the local clock ofpj, as known bypi. As the communication channels are FIFO, clocki[j] contains the last value of clockj[j]received bypi. Hence, in addition to the fact that the set of local clocks{clocki}1inimplement a global scalar clock, the lo-cal array clocki[1..n]of each processpi represents its current knowledge on the progress of the whole set of local logical clocks.

to_deliverablei is a sequence, initially empty (the empty sequence is denoted).

This sequence contains the list of messages that (a) have been received bypi, (b) have then been totally ordered, and (c) have not yet been to_delivered. Hence, to_deliverablei is the list of messages that can be to_delivered to the local upper layer application process.

pendingi is a set of pairsm,d, j, wheremis a message whose timestamp is d, j. Initially, pendingi= ∅.

Description of the Implementation The timestamp-based algorithm implement-ing the total order broadcast abstraction is described in Fig.7.7. It is assumed that pi does not interleave the execution of the statements at lines1–4, lines9–14, and lines18–22. Let us recall that there is a bidirectional FIFO point-to-point commu-nication channel connecting any pair of distinct processes. This algorithm works as described below.

When a process pi invokes to_broadcast(m), it first associates with ma new timestamp ts(m)= clocki[i], i (line 2). Then, it adds the pair m,ts(m) to its local set pendingi (line 3), and sends to all the other processes the mes-sageTOBC(m,ts(m)) to inform them that a new message has been to_broadcast (line 4). An invocation of to_deliver() returns the first message in the local list to_deliverablei (line5–8).

The behavior of a processpi when it receives a messageTOBC(m,sd_date, j) can be decomposed into two parts.

• The processpi first modifies its local context according to content of theTOBC() message it has just received. It updates clocki[j] (line 9), and adds the pair m,sd_date, jto its local set pendingi (line10).

Let us notice that, as (a) clockj[j] never decreases, (b) pj increased it be-fore sending the messageTOBC(m,), and (c) the channels are FIFO, it follows that we have clocki[j]<sd_date when the message TOBC(m,sd_date, j)is received. Hence the systematic update at line9.

• The second set of statements executed bypi (lines11–14) is related to the update of its local clock clocki[i], and the dissemination of its new value to the other processes. This is to ensure both that (a) each message that has been to_broadcast is eventually to_delivered, and (b) message to_deliveries satisfy total order.

If sd_dateclocki[i],pi resets its local clock (line12) to a value greater than sd_date (as in the basic scalar clock algorithm of Fig.7.1). Thenpisends to each

7.1 Linear Time 157 operationto_broadcast(m)is

(1) clocki[i] ←clocki[i] +1;

(2) let ts(m) = clocki[i], i;

(3) pendingipendingi∪ {m,ts(m)};

(4) for eachj∈ {1, . . . , n} \ {i}dosendTOBC(m,ts(m))topjend for.

operationto_deliver(m)is (5) wait(to_deliverablei=);

(6) letmbe the first message in the list to_deliverablei; (7) withdrawmfrom to_deliverablei;

(8) return(m).

whenTOBC(m,sd_date, j)is received do (9) clocki[j] ←sd_date;

(10) pendingipendingi∪ {m,sd_date, j};

(11) if(sd_dateclocki[i])then (12) clocki[i] ←sd_date+1;

(13) for eachk∈ {1, . . . , n} \ {i}dosendCATCH_UP(clocki[i], i)topjend for (14) end if.

whenCATCH_UP(last_date, j )is received do (15) clocki[j] ←last_date.

background taskTis (16) repeat forever (17) wait (pendingi= ∅);

(18) letm,d, kbe the pair in pendingiwith the smallest timestamp;

(19) if(j=k: d, k<clocki[j], j)then (20) addmat the tail of to_deliverablei; (21) pendingipendingi\ {m,d, k} (22) end if

(23) end repeat.

Fig. 7.7 Implementation of total order broadcast (code for processpi)

Fig. 7.8 To_delivery predicate of a message at processpi

process a control message denotedCATCH_UP(). This message carries the new clock value ofpi (line13). As the channels are FIFO, it follows that when this message is received bypk, 1≤k=in, this process has necessarily received all the messagesTOBC()sent bypi before thisCATCH_UP()message. This allows pkto know if, among the to_broadcast messages it has received, the one with the smallest timestamp is stable (Fig.7.8).

Finally, a background task checks if some of the to_broadcast messages which have been received bypi are stable, and can consequently be moved from the set

pendingi to the local sequence to_partitionablei. To that end, when pendingi= ∅ (line17), pi looks for the messagemwith the smallest timestamp (line18). Let ts(m)= d, k. If, for any j =k, ts(m) is smaller than clocki[j], j, it follows (from lines11–14executed by each process when it received the pairm,ts(m)) that any to_broadcast message not yet received bypi will have a timestamp greater than ts(m). Hence, if the previous predicate is true, the messagemis stable atpi

(Fig.7.8). Consequently,pi withdrawsm,ts(m) from the set pendingi (line20) and addsmat the tail of to_partitionablei (line21). As we will see in the proof, message stability at each process ensures that the to_broadcast messages are added in the same order to all the sequences to_partitionablex,≤xn.

Theorem 4 The algorithm described in Fig.7.7implements the total order broad-cast abstraction.

Proof The validity property (neither corruption, nor creation of messages) follows directly from the reliability of the channels. The integrity property (no duplica-tion) follows from the reliability of the underlying channels, the fact that no two to_broadcast messages have the same timestamp, and the fact that when a message is added to to_deliverablei, it is suppressed from pendingi.

The proof of the termination property is by contradiction. Assuming that to_broadcast messages are never to_delivered by a process pi (i.e., added to to_deliverablei), let mbe the one with the smallest timestamp, and let ts(m)= d, k.

Let us observe that, each time a processpj updates its local clock (to a greater value), it sends its new clock value to all processes. This occurs at lines1and4, or at lines12and13.

As each other processpj receives the messageTOBC(m,ts(m))sent bypk, its local clock becomes greater than d (if it was not before). It then follows from the previous observation that a value of clockj[j] greater thand becomes even-tually known by each process, and we eveneven-tually have clocki[j]> d. Hence, the to_delivery predicate formbecomes eventually satisfied atpi. The message mis consequently moved from pendingi to to_deliverablei, which contradicts the initial assumption and proves the termination property.

The proof of the total order property is also by contradiction. Letmx andmybe two messages timestamped ts(mx)= dx, xand ts(my)= dy, y, respectively. Let us assume that ts(mx) <ts(my), andmy is to_delivered by a processpi beforemx (i.e.,myis added to to_deliverablei beforemx).

Just beforemy is added to to_deliverablei, (my,ts(my)) is the pair with the smallest timestamp in pendingi, and∀j=y: dy, y<clocki[j], j(lines18–19).

It follows that we have then dx, x<dy, y<clocki[x], x. As (a) px sends only increasing values of its local clock (lines 1 and 4, and lines 12–13), (b) dx<clocki[x], and (c) the channels are FIFO, it follows that pi has received the message TOBC(mx,ts(mx)) before the message carrying the value of clockx[x] which entailed the update of clocki[x] making true the predicate dy, y <

Outline

Documents relatifs