Impact of checkpointing on the completion time To deal with failures and to improve the system performance, many faultTo deal with failures and to improve the system performance, many fault

Fault-tolerance and availability awareness in computational grids

6.5 Stochastic checkpoint model analysis issues

6.5.2 Impact of checkpointing on the completion time To deal with failures and to improve the system performance, many faultTo deal with failures and to improve the system performance, many fault

gf_x(g)dg+F_x(T)μ_r] (6.4) Since the integral is always bounded by the ﬁrst moment of time between failuresX denoted byμ_xand the expected completion time must be at least T, we have:

T ≤E(V)≤T + (μ_r+μ_x)/(1−F_x(T)). (6.5) Equation (6.2) states that the completion time of the application increases exponentially whenωgrows. This conﬁrms that fault tolerance techniques are required to improve the system performances. More generally, we conclude from Expression (6.5) that if most of the mass offconcentrates in the interval [0, T] thenF_x(T)→1,_T

0 gf_x(g)dg→μ_xand the completion time tends to be much longer than the completion time with the failures-free assumption. On the other hand, if most of the mass off_xconcentrates in the interval [T ,+∞) thenF_x(T)→0,_T

0 xf_x(x)dx→0 and the completion time tends to be equal toT. Thus, fault tolerance techniques should be investigated in this context.

Let us now introduce how the checkpoint/restart mechanism can improve the system performance.

6.5.2 Impact of checkpointing on the completion time To deal with failures and to improve the system performance, many fault tolerance techniques have been proposed during the last decades. Based on simulation results, Elnozahy et al. [Elnozahy and Plank, 2004 ] showed that the coordinated checkpoint approach is the most eﬀective fault tolerance

160 Fundamentals of Grid Computing

mechanism in large parallel platforms. Let us recall brieﬂy that the aim of such models is to optimize the trade-oﬀ between the amount of work lost when a failure occurs and the performance lost due to the checkpoint overhead with respect to some given metric. The most studied metric in computing systems is the completion time of an application. The fault-tolerance techniques are considered as a defensive mechanism added to the application to reduce the failures consequences (the amount of lost work due to failures in the case of checkpointing mechanisms). Unfortunately, these defensive mechanisms in-troduce also overheads that decrease the performance of the whole system.

For instance, checkpointing on a BlueGene machine is reported to take an hour [Liang et al., 2006 ]. Hence, using these mechanisms without optimiza-tion techniques can decrease seriously the system performance. Since the 70s, several works model the application and checkpointing mechanisms and apply adapted analytical methods to optimize the system performance with respect to many criteria. All these models [Young, 1974 ], [Daly, 2006 ], [Ziv and Bruck, 1997 ], [Chandy and Ramamoorthy, 1972 ], [Toueg and Babaoglu, 1983 ] diﬀer in some critical assumptions of the computing system.

Failure work

Checkpoint

Repair ω

c¹

time X² R²

R¹ X¹

t¹

t²

FIGURE 6.5: General scheme of an execution under failures with checkpoint mechanism.

The two following sections present an analysis of the checkpointing mecha-nism under two simple but reasonable hypothesis. The previous assumptions are still valid. In addition, we assume that the initial amount of work ω considered as a big task is preemptive (this is a mandatory assumption to implement checkpointing). The execution of this task will be divided intok consecutive intervals of lengtht_j such that_k

j=1t_j =T. A checkpoint occurs

Fault-tolerance and availability awareness in computational grids 161 between each interval and cost c_j units of time as it is depicted in Figure 6.5. The diﬀerence between both analyses is related to the cost function of a checkpoint. In the ﬁrst case, it is supposed to be constant whereas it becomes variable in the second one.

6.5.2.1 Constant checkpoint cost

Young proposes in [Young, 1974 ] a checkpoint/restart model where fail-ures follow a probabilistic law, but the checkpoint cost and the restart cost are constant. It also assumes that checkpoints can be placed at any moment of the application. Moreover, checkpointing and restart phases are assumed to be fault-free. Under those assumptions, Young provides a ﬁrst order ap-proximation of the optimal time between two successive checkpoints using the following arguments. LetO be the overhead due to the work during failures and the checkpoint cost,τ be the period between checkpoints such asτ = ^ω_k and λ be the rate of the Poisson process which represents failures arrivals.

Thus,O is given by the following expression:

O= 1/λ+ c

1−e^λ⁽^c⁺^τ⁾ (6.6)

Therefore, the optimal period between checkpointsτ is the root of the deriva-tive function of Expression (6.6) which is equal to

2c/λ, considering a ﬁrst approximation for the exponential term and assuming that the checkpoint cost is much shorter than the failures ratec <<1/λ.

Daly extends in [Daly, 2006 ] the model introduced by Young by proposing a higher order solution for the optimal interval of time between checkpoints con-sidering that failures may happen during the checkpoint phase and the restart phase. In fact, he proposes another interval between checkpoints equivalent to Young’s optimal interval length if the checkpoint cost less than 2/λ; elsewhere the optimal interval length will be equal to the 1/λ. Therefore, Daly’s model is more precise than Young’s model when the checkpoint cost is close to the mean time between failure (1/λ).

τ = 2c

λ[1 + 1/3 2c

λ + 1/9(²_λ^c)]−c ifc≤2/λ

1/λ ifc >2/λ

(6.7)

Considering both models shows that the average completion time will grow linearly when the initial amount of work grows. In fact, it is clear that the expression of the overhead due to failures and checkpoint according to Young’s model [Young, 1974 ] does not depend on the initial amount of work which implies that the expected completion time will grow linearly. Also in [Daly, 2006 ] based on simulation the author reaches the same conclusion. Hence, checkpointing improves the system performance.

162 Fundamentals of Grid Computing 6.5.2.2 Variable checkpoint cost

The second variant of checkpoint/restart model are the models that con-sider a variable checkpoint cost. Several works claim that the checkpoint cost should not be considered as constant [Ziv and Bruck, 1997 ], [Chandy and Ramamoorthy, 1972 ], [Toueg and Babaoglu, 1983 ]. In fact, a popular tech-nique to reduce the amount of data to save in the stable storage is to use an incremental method which only saves the memory which changed from the previous checkpoint. Using such a technique, considering a checkpoint cost is no longer a reasonable assumption.

The ﬁrst work was proposed by Chandy et al. in 1972. Based on graph theory, it ﬁnds the optimal placement of the checkpoints [Chandy and Ra-mamoorthy, 1972 ]. The proposed technique relies on the existence of a prior-information about the checkpoint cost. Moreover, it also assumes that failures follow a Poisson’s process.

Toueg et al. tackle the same problem under the following assumptions:

the application can be only preempted at n speciﬁc times t_i for 1 < i < n and the failures follow a Poisson’s process [Toueg and Babaoglu, 1983 ]. The cost of a checkpoint is then c_i. Under this model, they propose an O(n²) algorithm based on a dynamic programming that leads to an optimal expected completion time. The algorithm assumes that there are onlyn ﬁnite places to schedule the checkpoint. The recurrence objective function is based on Equation (6.8) that gives the expected completion timeE(V) if a checkpoint is placed at the indexi.

E(V) = e^λ⁽^t¹⁺^t²⁺^···⁺^tⁱ⁾−1

λ +c_i+e^λ⁽^tⁱ⁺¹⁺^···⁺^tⁿ⁾−1

λ (6.8)

To ﬁnd the optimal placement of diﬀerent checkpoints, the algorithm iterates on the indexileading to the optimal sequence of checkpoints inn² iterations at most.

Another important contribution is proposed by Zvi et al. in [Ziv and Bruck, 1997 ] using the following assumptions: failures arrive following a Poisson’s process and the application can be preempted at any timetto take a check-point. They assume that the system is modeled by a Markov chain composed of two diﬀerent statess1ands2and of transition functionφ(that is to say,φ1

is the probability of going from states₁to states₂andφ₂is the probability of going from states₂to states₁). When the system is in states₁(resp. s₂), the checkpoint cost is c1 (resp. s2). They propose an algorithm to decide when a checkpoint should be taken. The algorithm has two parameterst1 and t2

such thatt1≤t2and is now stated:

Fault-tolerance and availability awareness in computational grids 163 repeat

Waitt1 units of time.

if the state iss1 then

Take a checkpoint. The overhead isc₁. else if the state iss₂ then

Wait up tot₂ for the system to change to states₁. if the system changes to state s₁ then

Take a checkpoint. The overhead isc₁. else

Take a checkpoint. The overhead isc2. end if

end if

until all work is done

The average overhead ratio of this algorithm denoted by O is given by the following expression:

O= e^λt¹+_λ−φ^λp²

2(e^λt²−e^λt¹⁺^φ²⁽^t²^−t¹⁾)−1 λ(t1+^p²^e^φ²⁽^t_φ²^−t¹⁾⁻¹

2 )

+ (1−p2)c1+p2c2

t1+^p²^e^φ²⁽^t_φ²^−t¹⁾⁻¹

−1 (6.9)

where p2 is the probability that the state at a checkpoint is s2. The ﬁnal step is to ﬁnd the couple (t1, t2) that minimizes the average overhead. The proposed solution is to use a numerical method.

Let us summarize the results of this section. Under a basic stochastic model, the completion time of an application grows exponentially if no fault-tolerance techniques are used. Moreover, using some stochastic checkpoint/restart mod-els, it is possible to improve the performance of orders of magnitude. However, these models are quite optimistic and many hypotheses do not hold in actual computing systems. In the next section we focus on the implementation issues of some fault tolerance mechanisms.

6.6 Implementations

A complete fault tolerant middleware is a complex system that consists in many interconnected components. This section presents ﬁrst some implemen-tations of single process checkpoints, and then an overview of implemented distributed fault tolerance protocols. Finally, we give a synthetic comparison of the main current implementations.

164 Fundamentals of Grid Computing 6.6.1 Single process snapshot

The problem of the single process checkpoint can be addressed following three approaches depending at which level this checkpoint is performed.

In the ﬁrst approach, a process is saved as a memory dump. It can be done at the kernel-level with Berkeley Lab’s Linux Checkpoint/Restart (BLCR) [Duell et al., 2002 ] or at the user-level with a library like Con-dor [Litzkow et al., 1997 ] or Libckpt [Plank et al., 1995 ]. Among these solutions, BLCR is the only one that supports multi-threaded applications.

This method is widely employed since it is transparent for the application developer. But many drawbacks aﬀect this kind of snapshot: it requires a ho-mogeneous resource (same operating system and same CPU) when restarting.

Furthermore, the process address space contains data that are not strictly required for restarting, so the checkpoint size is larger than necessary.

In order to abstract the process state, the second approach considers that it is the user responsibility to write functions for saving and restarting a pro-cess. This method is effective because the developer can choose exactly which data to checkpoint, but it requires a significant effort from the application developer.

The third approach proceeds at middleware level. It combines advantages from the two above approaches, but it requires the application to be written with a middleware that uses an abstract representation of the application, like objects (Charm++ [Zheng et al., 2004 ], [Chakravorty and Kal´e, 2004 ]), task lists (Satin [Wrzesinska et al., 2006 ]) or data ﬂow graphs (Kaapi [Besseron and Gautier, 2008 ], [Jafar et al., 2009 ]). Using the application abstract representation, the middleware can checkpoint on its own tasks and data used by the application. This approach is fully transparent for the application developer; the process can be restarted on a heterogeneous resource (since the abstract representation is architecture-independent) and the snapshot is smaller than in the memory space’s case.

Dans le document Grid Computing Theory, Algorithms and Technologies (Page 184-189)