• Aucun résultat trouvé

Necessary and sufficient conditions for consistent global snapshots

Dans le document This page intentionally left blank (Page 130-134)

4 Global state and snapshot recording algorithms

4.8 Necessary and sufficient conditions for consistent global snapshots

Many applications (such as transparent failure recovery, distributed debug-ging, monitoring distributed events, setting distributed breakpoints, protocol specification and verification, etc.) require that local process states are peri-odically recorded and analyzed during execution or post martem. A saved intermediate state of a process during its execution is called a local check-point of the process. A global snapshot of a distributed system is a set of local checkpoints one from each process and it represents a snapshot of the distributed computation execution at some instant. A global snapshot is con-sistent if there is no causal path between any two distinct checkpoints in the global snapshot. Therefore, a consistent snapshot consists of a set of local states that occurred concurrently or had a potential to occur simultaneously.

This condition for the consistency of a global snapshot (that no causal path between any two checkpoints) is only the necessary condition but it is not the sufficient condition. In this section, we present the necessary and sufficient conditions under which a local checkpoint or a set of arbitrary collection of local checkpoints can be grouped with checkpoints at other processes to form a consistent global snapshot.

Processes take checkpoints asynchronously. Each checkpoint taken by a process is assigned a unique sequence number. The ith i≥0 checkpoint of process pp is assigned the sequence number i and is denoted by Cpi. We assume that each process takes an initial checkpoint before execution begins and takes avirtualcheckpoint after execution ends. Theithcheckpoint interval of process pp consists of all the computation performed between its i−1th and ith checkpoints (and includes thei−1th checkpoint but notith).

We first show with the help of an example that even if two local checkpoints do not have a causal path between them (i.e., neither happened before the other using Lamport’s happen before relation), they may not belong to the same consistent global snapshot. Consider the execution shown in Figure4.4.

Although neither of the checkpointsC11andC32happened before the other, they cannot be grouped together with a checkpoint on processp2 to form a

111 4.8 Necessary and sufficient conditions for consistent global snapshots

Figure 4.4 An illustration of zigzag paths.

consistent global snapshot. No checkpoint onp2 can be grouped with both C11andC32while maintaining the consistency. Because of messagem4,C32 cannot be consistent withC21or any earlier checkpoint inp2, and because of messagem3,C11cannot be consistent withC22or any later checkpoint inp2. Thus, no checkpoint onp2 is available to form a consistent global snapshot withC11andC32.

To describe the necessary and sufficient conditions for a consistent snap-shot, Netzer and Xu [25] defined a generalization of the Lamport’s happens before relation, called a zigzag path. A checkpointCijhappens before a check-point Cxy (or a causal path exists between two checkpoints) if a sequence of messages exists fromCij toCxy such that each message is sent after the previous one in the sequence is received. A zigzag path between two check-points is a causal path, however, and allows a message to be sent before the previous one in the path is received. For example, in Figure4.4, although a causal path does not exist from C11 toC32, a zigzag path does exist from C11toC32. This zigzag path is formed by messagesm3andm4. This zigzag path means that no consistent snapshot exists in this execution that contains bothC11andC32.

Several applications require saving or analyzing consistent snapshots and zigzag paths have implications on such applications. For example, the state from which a distributed computation must restart after a crash must be consistent. Consistency ensures that no process is restarted from a state that has recorded the receipt of a message (called an orphan message) that no other process claims to have sent in the rolled back state. Processes take local checkpoints independently and a consistent global snapshot/checkpoint is found from the local checkpoints for a crash recovery. Clearly, due to zigzag paths, not all checkpoints taken by the processes will belong to a consistent snapshot. By reducing the number of zigzag paths in the local checkpoints taken by processes, one can increase the number of local checkpoints that belong to a consistent snapshot, thus minimizing the roll back necessary to find a consistent snapshot.1 This can be achieved by tracking zigzag paths online and allowing each process to adaptively take checkpoints at certain

1 In the worst case, the system would have to restart its execution right from the beginning after repeated rollbacks.

points in the execution so that the number of checkpoints that cannot belong to a consistent snapshot is minimized.

4.8.1 Zigzag paths and consistent global snapshots

In this section, we provide a formal definition of zigzag paths and use zigzag paths to characterize condition under which a set of local checkpoints together can belong to the same consistent snapshot. We then present two special cases:

first, the conditions for an arbitrary checkpoint to be useful (i.e., a consistent snapshot exists that contains this checkpoint), and second, the conditions for two arbitrary checkpoints to belong to the same consistent snapshot.

A zigzag path

Recall that if a global snapshot is consistent, then none of its checkpoints happened before the other (i.e., there is no causal path between any two checkpoints in the snapshot). However, as explained earlier using Figure4.4, if we have two checkpoints such that none of them happened before the other, it is still not sufficient to ensure that they can belong together to the same consistent snapshot. This happens when a zigzag path exists between such checkpoints. A zigzag path is defined as a generalization of Lamport’s happens before relation.

definition 4.1 Azigzag pathexists from a checkpointCxi to a checkpoint Cyj iff there exists messagesm1,m2, mn (n≥1) such that

1. m1is sent by processpxafterCxi;

2. if mk (1kn) is received by process pz, then mk+1 is sent by pz in the same or a later checkpoint interval (althoughmk+1may be sent before or aftermkis received);

3. mnis received by processpybeforeCyj.

For example, in Figure 4.4, a zigzag path exists from C11 toC32 due to messagesm3andm4. Even though processp2 sendsm4before receivingm3, it does these in the same checkpoint interval. However, a zigzag path does not exist fromC12toC33 (due to messagesm5andm6) because processp2 sendsm6and receivesm5in different checkpoint intervals.

definition 4.2 A checkpoint C is involved in a zigzag cycleiff there is a zigzag path from C to itself.

For example, in Figure4.5,C21is on a zigzag cycle formed by messages m1andm2. Note that messagesm1andm2are respectively sent and received in the same checkpoint interval atp1.

Difference between a zigzag path and a causal path

It is important to understand the difference between a causal path and a zigzag path. A causal path exists from a checkpoint A to another checkpoint B iff

113 4.8 Necessary and sufficient conditions for consistent global snapshots

there is chain of messages starting after A and ending before B such that each message is sent after the previous one in the chain is received. A zigzag path consists of such a message chain, however, a message in the chain can be sent before the previous one in the chain is received, as long as the send and receive are in the same checkpoint interval. Thus a causal path is always a zigzag path, but a zigzag path need not be a causal path.

Figure 4.4 illustrates the difference between causal and zigzag paths. A causal path exists fromC10toC31formed by chain of messagesm1andm2; this causal path is also a zigzag path. Similarly, a zigzag path exists fromC11 toC32formed by the chain of messagesm3andm4. Since the receive ofm3 happened after the send ofm4, this zigzag path is not a causal path andC11 does not happen beforeC32.

Another difference between a zigzag path and a causal path is that a zigzag path can form a cycle but a causal path never forms a cycle. That is, it is possible for a zigzag path to exist from a checkpoint back to itself, called a zigzag cycle. In contrast, causal paths can never form cycles. A zigzag path may form a cycle because a zigzag path need not represent causality – in a zigzag path, we allow a message to be sent before the previous message in the path is received as long as the send and receive are in the same interval.

Figure 4.5 shows a zigzag cycle involving C21, formed by messages m1 andm2.

Consistent global snapshots

Netzer and Xu [25] proved that if no zigzag path (or cycle) exists between any two checkpoints from a setSof checkpoints, then a consistent snapshot can be formed that includes the setSof checkpoints, and vice versa.

For a formal proof, the readers should consult the original paper. Here we give an intuitive explanation. Intuitively, if a zigzag path exists between two checkpoints, and that zigzag path is also a causal path, then the checkpoints are ordered and hence cannot belong to the same consistent snapshot. If the zigzag path between two checkpoints is not a causal path, a consistent snapshot cannot be formed that contains both the checkpoints. The zigzag nature of the path causes any snapshot that includes the two checkpoints to be inconsistent. To visualize the effect of a zigzag path, consider a snapshot

Figure 4.5 A zigzag cycle, inconsistent snapshot, and consistent snapshot.

m2

m3

m4 p1

p2

m1

p3

C1,0 C1,1 C1,2

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2

line2through the two checkpoints. Because of the existance of a zigzag path between the two checkpoints, the snapshot line will always cross a message that causes one of the checkpoints to happen before the other, making the snapshot inconsistent. Figure4.5illustrates this. Two snapshot lines are drawn fromC11toC32. The zigzag path fromC11toC32renders both the snapshot lines to be inconsistent. This is because messages m3 and m4 cross either snapshot line in way that orders the two of its checkpoints.

Conversely, if no zigzag path exists between two checkpoints (including zigzag cycles), then it is always possible to construct a consistent snapshot that includes these two checkpoints. We can form a consistent snapshot by including the first checkpoint at every process that has no zigzag path to either checkpoint. Note that messages can cross a consistent snapshot line as long as they do not cause any of the line’s checkpoints to happen before each other. For example, in Figure 4.5, C12 and C23 can be grouped with C31 to form a consistent snapshot even though message m4 crosses the snapshot line.

To summarize:

• the absence of a causal path between checkpoints in a snapshot corresponds to the necessary condition for a consistent snapshot, and the absence of a zigzag path between checkpoints in a snapshot corresponds to the necessary and sufficient conditions for a consistent snapshot;

• a set of checkpointsScan be extended to a consistent snapshot if and only if no checkpoint in S has a zigzag path to any other checkpoint inS;

• a checkpoint can be a part of a consistent snapshot if and only if it is not invloved in a Z-cycle.

Dans le document This page intentionally left blank (Page 130-134)