Snapshot algorithms for non-FIFO channels

4 Global state and snapshot recording algorithms

4.5 Snapshot algorithms for non-FIFO channels

A FIFO system ensures that all messages sent after a marker on a chan-nel will be delivered after the marker. This ensures that condition C2 is satisfied in the recorded snapshot if LS_i, LS_j, and SC_ij are recorded as described in the Chandy–Lamport algorithm. In a non-FIFO system, the prob-lem of global snapshot recording is complicated because a marker cannot be used to delineate messages into those to be recorded in the global state from those not to be recorded in the global state. In such systems, different techniques have to be used to ensure that a recorded global state satisfies conditionC2.

In a non-FIFO system, either some degree of inhibition (i.e., temporarily delaying the execution of an application process or delaying the send of a computation message) or piggybacking of control information on computa-tion messages to capture out-of-sequence messages is necessary to record a consistent global snapshot [31]. The non-FIFO algorithm by Helary uses message inhibition [12]. The non-FIFO algorithms by Lai and Yang [18], Li et al. [20], and Mattern [23] use message piggybacking to distinguish computation messages sent after the marker from those sent before the marker.

The non-FIFO algorithm of Helary [12] uses message inhibition to avoid an inconsistency in a global snapshot in the following way: when a process

receives a marker, it immediately returns an acknowledgement. After a process p_ihas sent a marker on the outgoing channel to processp_j, it does not send any messages on this channel until it is sure that p_j has recorded its local state. Processp_i can conclude this if it has received an acknowledgement for the marker sent top_j, or it has received a marker for this snapshot fromp_j.

We next discuss snapshot recording algorithms for systems with non-FIFO channels that use piggybacking of computation messages.

4.5.1 Lai–Yang algorithm

Lai and Yang’s global snapshot algorithm for non-FIFO systems [18] is based on two observations on the role of a marker in a FIFO system. The first observation is that a marker ensures that conditionC2is satisfied forLS_i andLS_jwhen the snapshots are recorded at processesp_iandp_j, respectively.

The Lai–Yang algorithm fulfills this role of a marker in a non-FIFO system by using a coloring scheme on computation messages that works as follows:

1. Every process is initially white and turns red while taking a snapshot.

The equivalent of the “marker sending rule” is executed when a process turns red.

2. Every message sent by a white (red) process is colored white (red). Thus, a white (red) message is a message that was sent before (after) the sender of that message recorded its local snapshot.

3. Every white process takes its snapshot at its convenience, but no later than the instant it receives a red message.

Thus, when a white process receives a red message, it records its local snap-shot before processing the message. This ensures that no message sent by a process after recording its local snapshot is processed by the destination pro-cess before the destination records its local snapshot. Thus, an explicit marker message is not required in this algorithm and the “marker” is piggybacked on computation messages using a coloring scheme.

The second observation is that the marker informs processp_jof the value of sendm_ijsendm_ij∈LS_i so that the state of the channel C_ij can be computed as transitLS_i LS_j. The Lai–Yang algorithm fulfills this role of the marker in the following way:

4. Every white process records a history of all white messages sent or received by it along each channel.

5. When a process turns red, it sends these histories along with its snapshot to the initiator process that collects the global snapshot.

6. The initiator process evaluatestransitLS_i LS_jto compute the state of a channelC_ij as given below:

103 4.5 Snapshot algorithms for non-FIFO channels

SC_ij=white messages sent by p_i on C_ij − white messages received by p_j onC_ij

= m_ij sendm_ij∈LS_i − m_ij recm_ij∈LS_j

Condition C2 holds because a red message is not included in the snapshot of the recipient process and a channel state is the difference of two sets of white messages. ConditionC1holds because a white messagem_ijis included in the snapshot of process p_j if p_j receives m_ij before taking its snapshot.

Otherwise,m_ijis included in the state of channelC_ij.

Though marker messages are not required in the algorithm, each process has to record the entire message history on each channel as part of the local snapshot. Thus, the space requirements of the algorithm may be large.

However, in applications (such as termination detection) where the number of messages in transit in a channel is sufficient, message histories can be replaced by integer counters reducing the space requirement. Lai and Yang describe how the size of the local storage and snapshot recording can be reduced by storing only the messages sent and received since the previous snapshot recording, assuming that the previous snapshot is still available. This approach can be very useful in applications that require repeated snapshots of a distributed system.

4.5.2 Li et al.’s algorithm

Li et al.’s algorithm [20] for recording a global snapshot in a non-FIFO system is similar to the Lai–Yang algorithm. Markers are tagged so as to generalize the red/white colors of the Lai–Yang algorithm to accommodate repeated invocations of the algorithm and multiple initiators. In addition, the algorithm is not concerned with the contents of computation messages and the state of a channel is computed as the number of messages in transit in the channel. A process maintains two counters for each incident channel to record the number of messages sent and received on the channel and reports these counter values with its snapshot to the initiator. This simplification is combined with the incremental technique to compute channel states, which reduces the size of message histories to be stored and transmitted. The initiator computes the state ofC_ij as: (the number of messages inC_ijin the previous snapshot) +(the number of messages sent onC_ij since the last snapshot at processp_i)−(the number of messages received onC_ijsince the last snapshot at processp_j).

Snapshots initiated by an initiator are assigned a sequence number. All messages sent after a local snapshot recording are tagged by a tuple <

init_id MKNO >, where init_id is the initiator’s identifier and MKNO is the sequence number of the algorithm’s most recent invocation by initiator init_id; to insure liveness, markers with tags similar to the above tags are

explicitly sent only on all outgoing channels on which no messages might be sent. The tuple< init_id MKNO >is a generalization of the red/white colors used in Lai–Yang to accommodate repeated invocations of the algorithm and multiple initiators.

For simplicity, we explain this algorithm using the framework of the Lai–

Yang algorithm. The local state recording is done as described by rules 1–3 of the Lai–Yang algorithm.

A process maintains input/output counters for the number of messages sent and received on each incident channel after the last snapshot (by that initiator). The algorithm is not concerned with the contents of computation messages and so the computation of the state of a channel is simplified to computing the number of messages in transit in the channel. This simplifi-cation is combined with an incremental technique for computing in-transit messages, also suggested independently by Lai and Yang [18], for reducing the size of the entire message history to be locally stored and to be recorded in a local snapshot to compute channel states. The initiator of the algorithm maintains a variableTRANSIT_ij for the number of messages in transit in the channel from processp_ito processp_j, as recorded in the previous snapshot.

The channel states are recorded as described in rules 4–6 of the Lai–Yang algorithm:

4. Every white process records a history, as input and output counters, of all white messages sent or received by it along each channel after the previous snapshot (by the same initiator).

5. When a process turns red, it sends these histories (i.e., input and output counters) along with its snapshot to the initiator process that collects the global snapshot.

6. The initiator process computes the state of channelC_ij as follows:

SC_ij = transitLS_i LS_j=TRANSIT_ij

+#messages sent on that channel since the last snapshot

−#messages received on that channel since the last snapshot If the initiator initiates a snapshot before the completion of the previous snapshot, it is possible that some process may get a message with a lower sequence number after participating in a snapshot initiated later. In this case, the algorithm uses the snapshot with the higher sequence number to also create the snapshot for the lower sequence number.

The algorithm works for multiple initiators if separate input/output counters are associated with each initiator, and marker messages and the tag fields carry a vector of tuples, with one tuple for each initiator.

Though this algorithm does not require any additional message to record a global snapshot provided computation messages are eventually sent on each channel, the local storage and size of tags on computation messages are of

105 4.5 Snapshot algorithms for non-FIFO channels

size On, where n is the number of initiators. The Spezialetti and Kearns technique [29] of combining concurrently initiated snapshots can be used with this algorithm.

4.5.3 Mattern’s algorithm

Mattern’s algorithm [23] is based on vector clocks. Recall that, in vector clocks, the clock at a process in an integer vector of length n, with one component for each process.

Mattern’s algorithm assumes a single initiator process and works as follows:

1. The initiator “ticks” its local clock and selects a future vector time s at which it would like a global snapshot to be recorded. It then broadcasts this timesand freezes all activity until it receives all acknowledgements of the receipt of this broadcast.

2. When a process receives the broadcast, it remembers the value s and returns an acknowledgement to the initiator.

3. After having received an acknowledgement from every process, the ini-tiator increases its vector clock tosand broadcasts a dummy message to all processes. (Observe that before broadcasting this dummy message, the local clocks of other processes have a value≥s.)

4. The receipt of this dummy message forces each recipient to increase its clock to a value≥sif not already≥s.

5. Each process takes a local snapshot and sends it to the initiator when (just before) its clock increases from a value less thansto a value≥s. Observe that this may happen before the dummy message arrives at the process.

6. The state ofC_ijis all messages sent alongC_ij, whose timestamp is smaller thansand which are received byp_jafter recordingLS_j.

Processes record their local snapshot as per rule 5. Any messagem_ij sent by process p_i after it records its local snapshot LS_i has a timestamp > s.

Assume that this m_ij is received by process p_j before it recordsLS_j. After receiving thism_ij and beforep_j records LS_j,p_j’s local clock reads a value

> s, as per rules for updating vector clocks. This impliesp_jmust have already recordedLS_j as per rule 5, which contradicts the assumption. Therefore,m_ij cannot be received byp_jbefore it recordsLS_j. By rule 6,m_ijis not recorded inSC_ij and therefore, condition C2is satisfied. ConditionC1holds because each message m_ij with a timestamp less than s is included in the snapshot of processp_j ifp_j receivesm_ijbefore taking its snapshot. Otherwise,m_ijis included in the state of channelC_ij.

The following observations about the above algorithm lead to various optimizations: (i) The initiator can be made a “virtual” process–so no process has to freeze. (ii) As long as a new higher value of sis selected, the phase of broadcasting s and returning the acks can be eliminated. (iii) Only the initiator’s component ofs is used to determine when to record a snapshot.

Also, one needs to know only if the initiator’s component of the vector timestamp in a message has increased beyond the value of the corresponding component ins. Therefore, it suffices to have just two values ofs, say, white and red, which can be represented using one bit.

With these optimizations, the algorithm becomes similar to the Lai–Yang algorithm except for the manner in which transitLS_i LS_jis evaluated for channelC_ij. In Mattern’s algorithm, a process is not required to store message histories to evaluate the channel states. The state of any channel is the set of all the white messages that are received by a red process on which that channel is incident. A termination detection scheme for non-FIFO channels is required to detect that no white messages are in transit to ensure that the recording of all the channel states is complete. One of the following schemes can be used for termination detection:

1. Each processikeeps a countercntr_i that indicates the difference between the number of white messages it has sent and received before recording its snapshot. It reports this value to the initiator process along with its snapshot and forwards all white messages, it receives henceforth, to the initiator. Snapshot collection terminates when the initiator has received

icntr_inumber of forwarded white messages.

2. Each red message sent by a process carries a piggybacked value of the number of white messages sent on that channel before the local state recording. Each process keeps a counter for the number of white messages received on each channel. A process can detect termination of recording the states of incoming channels when it receives as many white messages on each channel as the value piggybacked on red messages received on that channel.

The savings of not storing and transmitting entire message histories, over the Lai–Yang algorithm, comes at the expense of delay in the termination of the snapshot recording algorithm and need for a termination detection scheme (e.g., a message counter per channel).

Dans le document This page intentionally left blank (Page 121-126)