Distributed snapshots - Message-Passing Systems

The second fundamental technique that we discuss in this chapter is a technique for recording global states during the execution of an asynchronous algorithm. While the

concept of a global state, as introduced in Section 3.1, is of fundamental importance by itself, the ability to record a global state over which some of the algorithm's global properties can be analyzed is no less attractive. In the context of this book, areas in which algorithms for global state recording are especially relevant include the treatment of stable properties (discussed in Chapter 6) and the handling of some issues related to timing during a distributed

simulation (our subject in Chapter 10).

The bulk of our discussion on global state recording is presented in Section 5.2.1, where we present a distributed algorithm to record a global state and leave the recorded information spread throughout the nodes of G. However, there are special cases, chiefly within the area of distributed simulation, for which the recording of global states with some specific

properties is desirable. In these cases, it seems that the use of a leader to perform the global state recording in a centralized fashion is considerably more efficient. We discuss such a centralized approach in Section 5.2.2, aiming at their use in Chapter 10.

Throughout all of Section 5.2, G is taken to be a directed graph, so that the states of edges can be referred to without explicit mention to a particular direction. The extension to the undirected case is immediate, as usual.

5.2.1 An algorithm

In the case of synchronous algorithms, the recording of a global state can be achieved rather simply. At each pulse s ≥ 0, the states of all nodes and the messages that were sent at pulse s −1 (if s > 0), which by assumption must already have arrived at their destinations, constitute a global state. Without further communication, such a global state can be stored in G

distributedly, so that a node stores its own state and the state of all edges on which it receives messages.

Clearly, though, nothing like this simple approach can be employed in the asynchronous case, owing to the total absence of global timing. However, with the aid of communication among the nodes in addition to that pertaining to the computation whose global state we wish to record, the task can also be performed for asynchronous algorithms. The algorithm that we discuss next is surprisingly simple given the apparent intricacy of the task, and yields a global state that can be found at the end of the algorithm stored in a distributed fashion throughout G, in much the same way as in the synchronous case we just discussed.

Before we introduce the algorithm for global state recording, it should be noted that, conceptually, we are dealing with two distributed computations. One of them, which we can refer to as the substrate, is the computation whose global properties one wishes to study, and then the global state one is seeking to record is a global state of the substrate. It is then to the substrate that the set of events Ξ introduced in Section 3.1 refers. The other

distributed computation is an execution of the algorithm for global state recording, which we henceforth call Algorithm A_Record_Global_State. Both computations run on G, so each node is responsible for executing its share of the substrate and of Algorithm

A_Record_Global_State. The two computations are, however, totally independent of each other as far as causality relationships are concerned. Our only assumption about their interaction is that Algorithm A_Record_Global_State is capable of "peeking" at the

substrate's variables and messages with the purpose of recording a global state. Note that a node participates in both computations in such a way that, when the substrate is being executed by a node, Algorithm A_Record_Global_State is suspended, and conversely. This is immaterial from the standpoint of either computation, though. Having been designed to operate in the asynchronous model, the suspension of one to execute the other only adds to the asynchronism already present. Recording a global state during an execution of the substrate is essentially a means of "freezing" that execution in a snapshot (thence this alternative denomination for a global state) to analyze the states of all nodes and edges without actually having to halt the substrate.

This view of the computation at a node as actually comprising the node's participation in two different distributed algorithms is the view that we adopt in this section. What this amounts to when specifying the actions of Algorithm A_Record_Global_State is that there has to exist an action to handle the receipt of messages of the substrate, although in none of the algorithm's actions does one such message get sent. Alternatively, we might have viewed both

computations as constituting the execution of a single algorithm, in which case the technique for recording global states would appear truly as a building block. When arguing formally about the recorded global state, however, we would have to be careful to discriminate events associated with the substrate from those associated with the additional communication employed by the recording algorithm, as it is to the former that the recorded global state relates.

The following is an outline of how Algorithm A_Record_Global_State functions. A node is responsible for recording the substrate's local state and the states of all edges directed toward itself. If all nodes carry their recording tasks to their ends, then the resulting overall recording is a system state, as introduced in Section 3.1, because a local state has been recorded for each node and a set of messages for each edge. The algorithm progresses through the exchange between neighbor nodes of a special message called marker. A node ni ∊ N0 initiates its participation in Algorithm A_Record_Global_State by recording the local state of the substrate,i in the terminology of Section 3.1, and then sending marker on all edges that are directed away from it, without however allowing the substrate to send any messages in the meantime (i.e., after the recording of the local state and before the sending of marker). In practice, this can be achieved by "disabling interrupts" so that the node will not switch to execute the other computation while this is undesired. All other nodes behave likewise upon receiving marker for the first time. Every message of the substrate received at ni from a neighbor nj after ni has received the first marker (and consequently recorded a local state) and before ni receives marker from nj is added to the set of messages representing Φji, which is the state of edge (nj → ni) (cf. Section 3.1 for the appropriate terminology). The state of the edge on which marker was first received is then recorded as the empty set, so the system state recorded by Algorithm A_Record_Global_State can be regarded as containing a forest of empty edges, each of whose trees spanning exactly one node in N0. The recording is completed at a node when marker has been received on all edges directed toward that node.

It is instructive at this point to notice the very close resemblance of the algorithm we just outlined to Algorithm A_PI, introduced in Section 4.1.1 for the propagation of information on G. While that algorithm was given for an undirected G, Algorithm A_Record_Global_State can be easily recognized as a variation of Algorithm PI to propagate marker messages by flooding when G is a directed graph. Of course, the question of whether every node in G does ever receive a copy of marker in the directed case arises, because there may exist nodes to which no directed path from a node in N0 exists. One situation in which this can be guaranteed is, for example, the case of a strongly connected G, in which a directed path exists from every node to every other node.

Even before describing and analyzing Algorithm A_Record_Global_State more thoroughly, we are then in position to assess its complexities. Because every edge carries at most one copy of marker, the algorithm's message complexity is clearly O(m). The algorithm's time complexity, on the other hand, depends only on how long it takes a marker to reach a node that is not in N0, and this is clearly O(n) time.

In the description of Algorithm A_Record_Global_State we give next, sub_msg is used to generically denote a message of the substrate. A node ni maintains a variable to store the substrate's local state at ni, and for each neighbor nj ∊ I_Neigi a variable to store the state of

edge (nj → ni). These variables are, respectively, node_statei and , initialized respectively to nil and . In addition, a variable recordedi (initially equal to false) indicates whether the substrate's local state has already been recorded, and a variable

for each nj ∊ I_Neigi (initialized to false as well) indicates whether marker

has been received from nj. Clearly, for ni ∉ N0, recordedi = true if and only if = true for some nj ∊ I_Neigi.

Algorithm A_Record_Global_State:

Variables::

node_statei = nil;

= for all nj ∊ I_Neigi; recordedi = false;

= false for all nj ∊ I_Neigi.

Listing 5.11 Input:

msgi = nil.

Action ifni ∊ N0: node_statei := σi; recordedi := true;

Send marker to all nj ∊ O_Neigi.

Listing 5.12 Input:

msgi = marker such that origini(msgi) = (nj → ni).

Action:

:= true;

if notrecordedi then begin

node_statei := σi; recordedi := true;

Send marker to all nk ∊ O_Neigi. end.

Listing 5.13 Input:

msgi = sub_msg such that origini(msgi) = (nj → ni).

Action:

if recordedi then

if not then

:= {∪ msgi}.

There are two important observations to be made concerning Algorithm

A_Record_Global_State. The first observation is that the assumed atomicity of actions in Algorithm A_Template (cf. Section 2.1) suffices to prevent a node from executing the substrate computation, possibly with the sending of messages, between the recording of the local state and the sending of marker's in (5.11) and (5.12). The second observation

concerns (5.13) and the sub_msg messages that trigger this action. Because such

messages are in fact messages of_ the substrate, the actions that they trigger do not really belong in a presentation of Algorithm A_Record_Global_State based on Algorithm

A_Template. In fact, we have made no provisions whatsoever to denote an algorithm's peeking at some other algorithm's messages, so that our notation in the description of Algorithm A_Record_Global_State is abusive. One of the problems caused by this abuse of notation is that Algorithm A_Record_Global_State does not seem to terminate as long as there are sub_msg's in transit on G's edges, while clearly the algorithm is to terminate as soon as every node ni has received as many marker's as there are edges in Ini (cf.Section 6.2). As long as these issues are clearly understood, however, our slightly licentious use of the notation should not be troublesome. We keep the improper notation for simplicity (here and in other occasions, as in Section 9.3.3), although it appears that adapting Algorithm A_Template to properly contemplate such an interaction between two distributed algorithms is a simple matter (cf. Exercise 5).

Theorem 5.5 states two sufficient conditions for the system state that Algorithm A_Record_Global_State records to be a global state.

Theorem 5.5.

If G is strongly connected and all of its edges are FIFO, then the system state that Algorithm A_Record_Global_State records is a global state.

Proof: The fact that G is strongly connected implies that every node ni receives marker exactly once on every edge in Ini, by (5.11) and (5.12).

Recalling that Ξ is the set of events related to the substrate only, let (Ξ1, Ξ2) be a partition of Ξ such that ξ ∊ Ξ1 if and only if ξ occurred before the local state of the node at which it occurred was recorded. In addition, referring back to the notation introduced in Section 3.1, let be any total order of the events in Ξ consistent with ≺ ⁺, and consider two

consecutive events ξ2 ∊ Ξ2 and ξ1 ∊ Ξ1 in .

By the definition of Ξ1 and of Ξ2, it is clear that ξ2 did not happen at the same node as ξ1 and before the occurrence of ξ1. Now consider a scenario in which a sequence of events follow ξ2

at the node at which it happened, and then a message is sent in connection with the last event in this sequence, which in turn eventually causes the sending of another message by its destination node, and then the eventual sending of another message by the destination node of this second message, and so on, and then the arrival of the last message causes a sequence of events to happen at its destination node culminating with the occurrence of ξ1. By (5.11) and (5.12), and by the definition of Ξ1 and Ξ2, the node at which ξ2 happened must have sent marker's before ξ2 happened. Likewise, the node at which ξ1 happened must not have received any marker before ξ1 happened. Clearly, these two requirements are inconsistent with the scenario we just described, as the edges are all FIFO, and the

sequence of messages alluded to in the description of the scenario would then have to have been overrun by a marker. In summary, (ξ2, ξ1) ∉≺⁺, so the total order can be altered by substituting (ξ1, ξ2) for (ξ2, ξ1) in it, and yet remain consistent with ≺⁺.

Clearly, it takes no more than |Ξ1∥Ξ2| such substitutions to obtain a total order in which at most one pair (ξ1,ξ2) of consecutive events exists such that ξ1 ∊ Ξ1 and ξ2 ∊ ξ2. The events in all other pairs of consecutive events are in this total order both in ξ1 or in ξ2. By (5.11) through (5.13), and by the definition of ξ1 and ξ2, this distinguished pair of consecutive events is such that system_state (ξ1, ξ2) is precisely the system state recorded by Algorithm A_Record-Global_State, which is then a global state, by our first definition of global states in Section 3.1.

Before we finalize this section, there are a couple of important observations to be made regarding Algorithm A_Record_Global_State. The first observation is that, as we mentioned previously, the global state that the algorithm records is stored in a distributed fashion among G's nodes. Often the recorded global state can be used without having to be concentrated on a single node for analysis (cf. Section 6.3.2 for an example), but equally as frequently it must first be concentrated on a leader, which then works on the global state in a centralized manner.

The second observation is that the global state that the algorithm records is in principle any global state, in the sense that no control is provided to make "choices" regarding desirable characteristics of the resulting global state. While this is fine for many applications (as for example the detection of the stable properties we treat in Chapter 6), for others it does matter which global state is used, and then a centralized approach may be advisable. We elaborate on this a little more in Section 5.2.2.

We end the discussion in this section by returning to the issue of knowledge in distributed computations, treated in Section 2.3, in order to illustrate one of the concepts introduced in that section. Specifically, let be any sentence related to a global state that has been recorded by Algorithm A_Record_Global_State. Because of the distributed fashion in which this global state is stored after it is recorded, is clearly implicit knowledge that the members of N have, that is, IN .

5.2.2 Some centralized alternatives

In this section, we briefly comment on two centralized alternatives to the recording of global states, drawing mainly on motivations to be found in Chapter 10. As we mentioned in the previous section, these centralized alternatives are a solution when the global state that one seeks to record cannot be just any global state, but instead must be a global state with certain specific characteristics.

The first case that we examine is that of a computation for which it is known that the system state in which every node ni is in the kth local state in the sequence σi (cf.Section 3.1) for some k ≥ 1 is a global state. This is the case, for example, of the algorithms we discuss in Section 10.2, where for k > 1 a function of the kth such global state has to be compared with the result of applying the same function to the k — 1st such global state, regardless of whatever messages may be in transit on the edges. It turns out that computing such a function of a global state is a trivial task if done by a single node, and because the edge states do not matter, the natural choice is for every node ni to report every new local state to a leader, which then performs the necessary function evaluation and global state comparison whenever a new global state is completed with information received from the other nodes.

The second case is motivated by the needs of Sections 10.3.2 and 10.6, where in the global states to be recorded every edge state must be the empty set. Again, a centralized approach is preferable because it renders the task of checking for empty edges quite simple. The approach is then the following. Whenever a node ni is in a local state with which it may participate in a global state of interest (or simply periodically), it sends the leader this local state, together with the numbers of messages it has so far received on each edge in Ini and sent on each edge in Outi. Whenever the information the leader receives from all nodes is such that the number of messages sent on each edge is equal to the number of messages received on that edge, the corresponding system state is surely a global state, because every system state in which all edges are empty is a global state (cf. Exercise 7).

Dans le document Message-Passing Systems (Page 88-93)