Master Recherche en Informatique - Novembre 2010
Fault-Tolerance: Checkpointing in Distributed Asynchronous Systems
Achour Most´efaoui
Irisa/Ifsic, Universit´e de Rennes [email protected]
http://www.irisa.fr/asap/
Checkpointing Distributed Computations 1
Some Failure Types
• Software errors
• Process failures
⋆ Crash failure
⋆ Send/Receive omission
⋆ Arbitrary (Byzantine)
• Link failures
⋆ Omission/duplication failure
• Clock/Performance failures
Checkpointing Distributed Computations 2
How to Tolerate Failures?
• Debbuging/Validation.
• Duplication of processors and memories.
• Faul-tolerant software.
⋆ Data replication.
⋆ Fault-tolerant services (consensus, NBAC, etc.)
⋆ Checkpointing/Rollback-Recovery.
Computation Model
Asynchronous Distributed System
• Set of processes: P1, . . . , Pn.
• No shared memory.
• No global clock.
• Fail-stop processors (crash failures).
What Is Rollback?
P
σ4
P
σ4
P
e1 e2 e3
σ4
e5
e4
• Failure occurrence: restart at a safe state.
• Necessity to save safe states.
Checkpointing Distributed Computations 5
Local States and Local Checkpoints
• Local History of Pi: ei,1 ei,2 · · · ei,s · · ·
• Local State:
⋆ initial state of Pi: σi,0
⋆ σi,s is obtained by applying the event ei,s to the local state σi,s−1
• Local Checkpoint:
A local checkpoint C is a recorded state (snapshot) of a process.
A local state is not necessarily recorded as a local check- point, so the set of local checkpoints is only a subset of the set of local states.
Checkpointing Distributed Computations 6
Distributed Computation: Example
Pi
Pj
Pk
Ik,1 Ik,2 Ik,3
Ij,1
m1 m2
m3 m4
m5
m6 m7 Ci,3
Ci,2
Ci,1
Ci,0
Cj,3
Cj,2
Cj,1
Cj,0
Ck,3
Ck,2
Ck,1
Ck,0
A State of a Distributed Computation?
• Remind: There isno global memory andno global clock
• Impossibility to compute instantaneously a global snap- shot
• Illustration of Chandy-Lamport (1985)
• A global checkpoint is a set of local checkpoints, one from each process.
A pair of Mutually Consistent Checkpoints
Pj
Pj
Pj
Pi
Pi
Pi
Checkpointing Distributed Computations 9
A Missing Message
Pj
Pj
Pj
Pi
Pi
Pi
?
• The message is missing ⇒ recording.
Checkpointing Distributed Computations 10
An Orphan Message
Pj
Pj
Pj
Pi
Pi
Pi
?
• The two local checkpoints are definitelyinconsistent.
Consistency Rules of a Global Checkpoint (CL 1985)
Let C1,x1, C2,x2, . . . , Cn,xn be a global checkpoint. Each pair (Ci,xi, Cj,xj) of local checkpoints respects:
R1 ∀m, send(m) ∈Ci,xi ⇒delivery(m) ∈Cj,xj or M is recorded (no missing).
R2 ∀m, send(m) 6∈Ci,xi ⇒delivery(m) 6∈Cj,xj (no orphan).
Consistent Global Checkpoints
• Consistent Global Checkpoint
A global checkpoint is consistent if it has no message delivered and not sent (no orphan message).
• Message recording can easily overcome the problem of missing messages.
Checkpointing Distributed Computations 13
Meaning of a Global Checkpoint
Pi
Pj
Pk
Ik,1 Ik,2
Ci,0 Ci,1 Ci,2
Cj,0 Cj,1 Cj,2 Cj,3
Ci,3
Ck,3
Ck,1
Ck,0
m3 m4 m6
m2 m1
Ik,3
m5
Ck,2
m7 Ij,1
• Does a consistent global checkpoint represent a real state of the computation?
Checkpointing Distributed Computations 14
Why to Compute Global Checkpoints?
• Rollback recovery.
• Stable/Unstable properties detection
• Monitoring
• etc.
Caracteristics of a Global Checkpoint
• To be as close as possible to a real state (monitoring)
• To be the most recent possible (rollback)
• To have as few forced checkpoints as possible (proper- ties detection)
• etc.
Example: Rollback Recovery
Pi
Pj
Pk
Ik,1 Ik,2
Ci,0 Ci,1 Ci,2
Cj,0 Cj,1 Cj,2 Cj,3
Ci,3
Ck,1
Ck,0
m3 m4 m6
m2
m1
Ck,2
m7 Ij,1
Checkpointing Distributed Computations 17
Limits of Chandy-Lamport’s Consistency Rules
• Chandy-Lamport result applies to global checkpoints:
⋆ Considering a global checkpoint, one can say whether it is consistent or not.
• Question:
⋆ Considering a single local checkpoint taken by a pro- cess, How to know whether there exists a consistent global checkpoint to which it belongs?
Checkpointing Distributed Computations 18
Consistent Global Checkpoints
Checkpointing protocols must ensure that each local checkpoint belongs to at least one global checkpoint with:
1 No orphan messages.
2 No missing messages.
Limits of Chandy-Lamport’s Consistency Rules: Example
Pi
Pj
Pk
Ik,1 Ik,2
Ci,0 Ci,1 Ci,2
Cj,0 Cj,1 Cj,2 Cj,3
Ci,3
Ck,3
Ck,1
Ck,0
m3 m4 m6 m2
m1
Ik,3
m5
Ck,2
m7
Ij,1
• Hidden dependencies.
Theorem of Netzer and Xu (1995)
• Considering a subset of local checkpoints (possibly one local checkpoint), one can say whether this subset can be extended to form a consistent global checkpoint.
Checkpointing Distributed Computations 21
Z-Paths
A relation exists from local checkpoint A to local check- point B if there exist a Z-path from A to B.
B
A
Checkpointing Distributed Computations 22
Z-Paths
A Z-path exists from local checkpoint A to local check- point B if and only if:
• A precedes B within the same process, or
• a sequence of messages [m1, m2, . . . , mq] (q ≥ 1) exists such that:
1. A precedes send(m1) in the same process, and
2. for each mi, i < q, delivery(mi) is in the same or earlier interval as send(mi+1), and
3. delivery(mq) precedes B in the same process.
Causal Z-Paths, Z-Patterns and Z-Cycles
• A Z-Path is Causal iff for each mi, i < q, we have delivery(mi) →hb send(mi+1).
• a Z-Path has a Z-Pattern iff ∃i such that:
send(mi+1) →hb delivery(mi).
• a Z-Cycle is a Z-Path going from a local checkpoint C to the same local checkpoint C.
Causal Z-Path: Example
Pi
Pj
Pk
Ik,1 Ik,2
Ci,0 Ci,1 Ci,2
Cj,0 Cj,1 Cj,2 Cj,3
Ci,3
Ck,3
Ck,1
Ck,0
m3 m4 m6
m2
m1
Ik,3
m5
Ck,2
m7 Ij,1
Checkpointing Distributed Computations 25
Z-Cycle: Example
Pi
Pj
Pk
Ik,1 Ik,2
Ci,0 Ci,1 Ci,2
Cj,0 Cj,1 Cj,2 Cj,3
Ci,3
Ck,3
Ck,1
Ck,0
m3 m4 m6
m2
m1
Ik,3
m5
Ck,2
m7 Ij,1
Checkpointing Distributed Computations 26
Basic Theorem
• A local checkpoint C is Useless if it cannot belong to any consistent global checkpoint.
• Netzer-Xu Theorem (1995): A local checkpoint C is useless iff it is involved in a Z-cycle.
Basic Checkpoints
• Some local states of each process called local check- points are saved on stable storage.
⋆ periodically,
⋆ upon the reception a signal,
⋆ according to the value of a predicate
⋆ according to the OS convenience (light-load, etc.)
⋆ etc.
Uncoordinated Checkpointing: Example 1
Each process has its own checkpointing policy.
Pj
Pi
Risk : domino effect.
Checkpointing Distributed Computations 29
Forced Checkpoints
Pj
Pi
Pk
• In order each local checkpoint belongs to at least one consistent global checkpoint, some processes may have to take additional checkpoints (forced checkpoints).
Checkpointing Distributed Computations 30
Checkpointing Protocols
• Coordinated Checkpointing
⋆ only Forced checkpoints, no Domino effect.
• Uncoordinated Checkpointing
⋆ only Basic checkpoints, possibly: Domino effect.
• Communication Induced Checkpointing
⋆ Forced + Basic checkpoints, no Domino effect.
Chandy-Lamport’s Coordinated Protocol (1985)
This protocol is based on the use of control messages:
markers.
Pj
Pi
marker
• A marker is a message that can neither overtake nor be overtaken by any other message sent on the same unidirectionnal channel.
Communication-Induced Checkpointing
• No communication ⇒ no Z-cycles.
• Avoid Z-cycles formed by application messages.
⋆ Detect Z-cycles: control information carried by mes- sages
⋆ Break Z-cycles: forced checkpoints
Checkpointing Distributed Computations 33
Breaking a Z-Cycle
Pj
Pi
Pj
Checkpointing Distributed Computations 34
Main Idea of the Protocol
• Idea: Asssociate a Lamport Timestamp with each local checkpoint.
• Theorem: If for any pair of checkpoints Cj,y and Ck,z: Z-path from Cj,y to Ck,z ⇒ Cj,y.t < Ck,z.t,
then no checkpoint can be involved in a Z-cycle.
A Protocol: Second Step
Each message carries the value of its sender’s clock at sending time.
• Init: cli := 0
• Upon the definition of a local checkpoint cli :=cli+ 1
<Take a local checkpoint timestamped with the current value of cli >
• When Pi sends a message m:
m.cl :=cli; send(m, m.cl)
• Upon the reception of a message (m, m.cl) cli :=max(cli, m.cl)
A Protocol: First Step
Use a lamport clock to timestamp checkpoints.
Pi
Pj
Pk
m3 m4
m2
m1 m5
m7
3
4 4 3
2 1
1
2
1 2 3
2 4 5
• Does not take into account hidden dependencies (non- causal Z-paths).
Checkpointing Distributed Computations 37
Hidden Dependencies
• Timestamps of messages increase along causal Z-paths.
• Timestamps of messages should increase along all Z-paths.
Checkpointing Distributed Computations 38
To Checkpoint or Not to Checkpoint
Pk
Pi
Pj
m1
m2 Ck,z Cj,y
Pk
Pi
Pj Cj,y
m1 Ck,z m2
a. b.
Ci,x
m1.t ≤ m2.t m1.t > m2.t
General Structure of the Protocol
• Init:
cli := 0;. . .
• When Pi takes a local checkpoint
cli :=cli+ 1; < resetting of data structures >
< Take a local checkpoint timestamped with cli >
• when Pi sends a message m:
m.cl :=cli; send(m, m.cl, . . .)
• Upon the reception of a message (m, m.cl, . . .) if < condition > then < take a ckpt >; (*forced*)
A First Condition to Checkpoint
• sent toi[k] has the value true iff Pi has sent a message to Pk since its last checkpoint.
• min toi[k] keeps the timestamp of the first message Pi sent to Pk since Pi’s last checkpoint.
C ≡(∃k : sent toi[k]∧m1.t > min toi[k])
Checkpointing Distributed Computations 41
Refining the Condition (1)
Pi Cj,y
Ck,z
Pk
Pj
m2 µ2
m1
Pi Cj,y
Ck,z
Pk
Pj
m2
m1
µ1
a. b.
Checkpointing Distributed Computations 42
Refining the Condition (2)
• cli(k)= value of Pk’s local clock as perceived by Pi (Pi can obtain this knowledge with a classical piggybacking technique).
(m1.t≤m2.t)∨ P, where P ≡ (Ci,y.t≤m1.t≤cli(k)< Ck,z.t).
Does cli(k) Refers to a Correct Value?
a. b.
Pi Cj,y
Pk
Pj
m1
Pi Cj,y
Pk
Pj
m1
µ1 µ2
Ci,x Ci,x
m2 Ck,z m2 Ck,z
Pi
Pk
m2 µ
Ci,x
Ck,z
causal Z-cycle
Final Condition
C′ ≡
(∃k : sent toi[k]∧(m1.t > min toi[k])∧(m1.t > cli(k)∨ C1))
C1 being the condition that detects causal Z-cycles
Checkpointing Distributed Computations 45
Particular Cases
• C′′≡ ∃k : (sent toi[k]∧(m.lc > min toi[k]))
• C′′′ ≡(m.lc > mini)
Checkpointing Distributed Computations 46
Consistent Global Checkpoints
Checkpointing protocols must ensure that each local checkpoint belongs to at least one global checkpoint with:
1 No orphan messages.
2 No missing messages.
Question: How to ensure “no missing” in a not too much costly way?
Chandy-Lamport’s Coordinated Protocol: Example
Pj
Pi
σj
σi
σj
σi
m1 m2 m1 m2
• Which messages are in-transit (must be recorded)?
Chandy-Lamport’s Recording Rule
Upon the reception Pi of a marker sent by Pj
• If Pi has not yet taken a local checkpoint:
⋆ no message is in transit wrt this pair of local check- points
• If Pi has already taken a local checkpoint:
⋆ all the messages received after σi and before the re- ception of the marker.
Checkpointing Distributed Computations 49
Timestamps in a Checkpoint Interval (1)
Pk Pi
Pj
m1
m2 Ck,z Cj,y
Pk Pi
Pj Cj,y
m1 Ck,z
m2
a. b.
Ci,x
m1.t ≤ m2.t m1.t > m2.t
Checkpointing Distributed Computations 50
Timestamps in a Checkpoint Interval (2)
Pi
M ax Reci ≤cli ≤M in Senti
(initially, M ax Reci =−∞ and M in Senti= +∞)
In-Transit vs. Orphan (1)
Pi
Pj
in-transit orphan
Reversed computation Computation
Remark: In-Transit vs. Orphan (2)
• Let us consider the computation where all messages are reversed.
• Ensuring each local checkpoint belongs to a least one orphan-free global checkpoint of the reversed computa- tion is equivalent to ensure each local checkpoint be- longs to a least one missing-free global checkpoint of the original computation.
M ax Senti ≤cli≤ M in Reci
Checkpointing Distributed Computations 53
Recording Messages vs. Checkpointing
• Question: Can a recorded message be missing wrt a global checkpoint?
M ax Sent N Ri≤cli≤M in Rec N Ri
Checkpointing Distributed Computations 54
No Orphan and No Missing Messages (R1 and R2)
Within each interval the following invariant must be pre- served:
max(M ax Reci, M ax Sent N Ri)≤cli ≤ min(M in Senti, M in Rec N Ri)
Sketch of a Protocol (1)
M in Rec N Li
M in Senti
cli
M ax Reci
M ax Sent N Li
Sketch of a Protocol (2)
What to do (upon a send or a receive operation) in order to maintain the invariant?
• Increase the value of cli.
• Record some sent or received messages.
• Take a forced checkpoint.
Checkpointing Distributed Computations 57
Sketch of a Protocol: Example 1
m1
Pi
m2
8 5
2
Variable Before m1 Before m2 After m2
M ax Reci −∞ 5 8
M ax Sent N Ri −∞ −∞ −∞
cli 2 5 ?
M in Senti +∞ +∞ +∞
M in Rec N Ri +∞ 5 5
Checkpointing Distributed Computations 58
Sketch of a Protocol: Example 1
Variable Before m2 Record m1? Ckpt before m2?
M ax Reci 5 8 −∞
M ax Sent N Ri −∞ −∞ −∞
cli 5 8 >5
M in Senti +∞ +∞ +∞
M in Rec N Ri 5 8 +∞
There is a choice: checkpointing or recording m1
Sketch of a Protocol: Example 2
m1
Pi
m2
8 2
2
Variable Before m1 Before m2 After m2
M ax Reci −∞ −∞ 8
M ax Sent N Ri −∞ 2 2
cli 2 2 ?
M in Senti +∞ 2 2
M in Rec N Ri +∞ +∞ 8
Sketch of a Protocol: Example 2
Variable Before m2 Record m1? Ckpt before m2?
M ax Reci −∞ 8 −∞
M ax Sent N Ri 2 −∞ −∞
cli 2 ? >2
M in Senti 2 2 +∞
M in Rec N Ri +∞ 8 +∞
There is no choice: checkpointing
Checkpointing Distributed Computations 61