• Aucun résultat trouvé

Impact of checkpointing on the completion time To deal with failures and to improve the system performance, many faultTo deal with failures and to improve the system performance, many fault

Fault-tolerance and availability awareness in computational grids

6.5 Stochastic checkpoint model analysis issues

6.5.2 Impact of checkpointing on the completion time To deal with failures and to improve the system performance, many faultTo deal with failures and to improve the system performance, many fault

0

gfx(g)dg+Fx(T)μr] (6.4) Since the integral is always bounded by the first moment of time between failuresX denoted byμxand the expected completion time must be at least T, we have:

T E(V)≤T + (μr+μx)/(1−Fx(T)). (6.5) Equation (6.2) states that the completion time of the application increases exponentially whenωgrows. This confirms that fault tolerance techniques are required to improve the system performances. More generally, we conclude from Expression (6.5) that if most of the mass offconcentrates in the interval [0, T] thenFx(T)1,T

0 gfx(g)dg→μxand the completion time tends to be much longer than the completion time with the failures-free assumption. On the other hand, if most of the mass offxconcentrates in the interval [T ,+) thenFx(T)0,T

0 xfx(x)dx0 and the completion time tends to be equal toT. Thus, fault tolerance techniques should be investigated in this context.

Let us now introduce how the checkpoint/restart mechanism can improve the system performance.

6.5.2 Impact of checkpointing on the completion time To deal with failures and to improve the system performance, many fault tolerance techniques have been proposed during the last decades. Based on simulation results, Elnozahy et al. [Elnozahy and Plank, 2004 ] showed that the coordinated checkpoint approach is the most effective fault tolerance

160 Fundamentals of Grid Computing

mechanism in large parallel platforms. Let us recall briefly that the aim of such models is to optimize the trade-off between the amount of work lost when a failure occurs and the performance lost due to the checkpoint overhead with respect to some given metric. The most studied metric in computing systems is the completion time of an application. The fault-tolerance techniques are considered as a defensive mechanism added to the application to reduce the failures consequences (the amount of lost work due to failures in the case of checkpointing mechanisms). Unfortunately, these defensive mechanisms in-troduce also overheads that decrease the performance of the whole system.

For instance, checkpointing on a BlueGene machine is reported to take an hour [Liang et al., 2006 ]. Hence, using these mechanisms without optimiza-tion techniques can decrease seriously the system performance. Since the 70s, several works model the application and checkpointing mechanisms and apply adapted analytical methods to optimize the system performance with respect to many criteria. All these models [Young, 1974 ], [Daly, 2006 ], [Ziv and Bruck, 1997 ], [Chandy and Ramamoorthy, 1972 ], [Toueg and Babaoglu, 1983 ] differ in some critical assumptions of the computing system.

Failure work

Checkpoint

V

Repair ω

c1

time X2 R2

R1 X1

t1

t2

FIGURE 6.5: General scheme of an execution under failures with checkpoint mechanism.

The two following sections present an analysis of the checkpointing mecha-nism under two simple but reasonable hypothesis. The previous assumptions are still valid. In addition, we assume that the initial amount of work ω considered as a big task is preemptive (this is a mandatory assumption to implement checkpointing). The execution of this task will be divided intok consecutive intervals of lengthtj such thatk

j=1tj =T. A checkpoint occurs

Fault-tolerance and availability awareness in computational grids 161 between each interval and cost cj units of time as it is depicted in Figure 6.5. The difference between both analyses is related to the cost function of a checkpoint. In the first case, it is supposed to be constant whereas it becomes variable in the second one.

6.5.2.1 Constant checkpoint cost

Young proposes in [Young, 1974 ] a checkpoint/restart model where fail-ures follow a probabilistic law, but the checkpoint cost and the restart cost are constant. It also assumes that checkpoints can be placed at any moment of the application. Moreover, checkpointing and restart phases are assumed to be fault-free. Under those assumptions, Young provides a first order ap-proximation of the optimal time between two successive checkpoints using the following arguments. LetO be the overhead due to the work during failures and the checkpoint cost,τ be the period between checkpoints such asτ = ωk and λ be the rate of the Poisson process which represents failures arrivals.

Thus,O is given by the following expression:

O= 1/λ+ c

1−eλ(c+τ) (6.6)

Therefore, the optimal period between checkpointsτ is the root of the deriva-tive function of Expression (6.6) which is equal to

2c/λ, considering a first approximation for the exponential term and assuming that the checkpoint cost is much shorter than the failures ratec <<1/λ.

Daly extends in [Daly, 2006 ] the model introduced by Young by proposing a higher order solution for the optimal interval of time between checkpoints con-sidering that failures may happen during the checkpoint phase and the restart phase. In fact, he proposes another interval between checkpoints equivalent to Young’s optimal interval length if the checkpoint cost less than 2/λ; elsewhere the optimal interval length will be equal to the 1/λ. Therefore, Daly’s model is more precise than Young’s model when the checkpoint cost is close to the mean time between failure (1/λ).

τ = 2c

λ[1 + 1/3 2c

λ + 1/9(2λc)]−c ifc≤2/λ

1/λ ifc >2/λ

(6.7)

Considering both models shows that the average completion time will grow linearly when the initial amount of work grows. In fact, it is clear that the expression of the overhead due to failures and checkpoint according to Young’s model [Young, 1974 ] does not depend on the initial amount of work which implies that the expected completion time will grow linearly. Also in [Daly, 2006 ] based on simulation the author reaches the same conclusion. Hence, checkpointing improves the system performance.

162 Fundamentals of Grid Computing 6.5.2.2 Variable checkpoint cost

The second variant of checkpoint/restart model are the models that con-sider a variable checkpoint cost. Several works claim that the checkpoint cost should not be considered as constant [Ziv and Bruck, 1997 ], [Chandy and Ramamoorthy, 1972 ], [Toueg and Babaoglu, 1983 ]. In fact, a popular tech-nique to reduce the amount of data to save in the stable storage is to use an incremental method which only saves the memory which changed from the previous checkpoint. Using such a technique, considering a checkpoint cost is no longer a reasonable assumption.

The first work was proposed by Chandy et al. in 1972. Based on graph theory, it finds the optimal placement of the checkpoints [Chandy and Ra-mamoorthy, 1972 ]. The proposed technique relies on the existence of a prior-information about the checkpoint cost. Moreover, it also assumes that failures follow a Poisson’s process.

Toueg et al. tackle the same problem under the following assumptions:

the application can be only preempted at n specific times ti for 1 < i < n and the failures follow a Poisson’s process [Toueg and Babaoglu, 1983 ]. The cost of a checkpoint is then ci. Under this model, they propose an O(n2) algorithm based on a dynamic programming that leads to an optimal expected completion time. The algorithm assumes that there are onlyn finite places to schedule the checkpoint. The recurrence objective function is based on Equation (6.8) that gives the expected completion timeE(V) if a checkpoint is placed at the indexi.

E(V) = eλ(t1+t2+···+ti)1

λ +ci+eλ(ti+1+···+tn)1

λ (6.8)

To find the optimal placement of different checkpoints, the algorithm iterates on the indexileading to the optimal sequence of checkpoints inn2 iterations at most.

Another important contribution is proposed by Zvi et al. in [Ziv and Bruck, 1997 ] using the following assumptions: failures arrive following a Poisson’s process and the application can be preempted at any timetto take a check-point. They assume that the system is modeled by a Markov chain composed of two different statess1ands2and of transition functionφ(that is to say,φ1

is the probability of going from states1to states2andφ2is the probability of going from states2to states1). When the system is in states1(resp. s2), the checkpoint cost is c1 (resp. s2). They propose an algorithm to decide when a checkpoint should be taken. The algorithm has two parameterst1 and t2

such thatt1≤t2and is now stated:

Fault-tolerance and availability awareness in computational grids 163 repeat

Waitt1 units of time.

if the state iss1 then

Take a checkpoint. The overhead isc1. else if the state iss2 then

Wait up tot2 for the system to change to states1. if the system changes to state s1 then

Take a checkpoint. The overhead isc1. else

Take a checkpoint. The overhead isc2. end if

end if

until all work is done

The average overhead ratio of this algorithm denoted by O is given by the following expression:

O= eλt1+λ−φλp2

2(eλt2−eλt1+φ2(t2−t1))1 λ(t1+p2eφ2(tφ2−t1)1

2 )

+ (1−p2)c1+p2c2

t1+p2eφ2(tφ2−t1)1

2

1 (6.9)

where p2 is the probability that the state at a checkpoint is s2. The final step is to find the couple (t1, t2) that minimizes the average overhead. The proposed solution is to use a numerical method.

Let us summarize the results of this section. Under a basic stochastic model, the completion time of an application grows exponentially if no fault-tolerance techniques are used. Moreover, using some stochastic checkpoint/restart mod-els, it is possible to improve the performance of orders of magnitude. However, these models are quite optimistic and many hypotheses do not hold in actual computing systems. In the next section we focus on the implementation issues of some fault tolerance mechanisms.

6.6 Implementations

A complete fault tolerant middleware is a complex system that consists in many interconnected components. This section presents first some implemen-tations of single process checkpoints, and then an overview of implemented distributed fault tolerance protocols. Finally, we give a synthetic comparison of the main current implementations.

164 Fundamentals of Grid Computing 6.6.1 Single process snapshot

The problem of the single process checkpoint can be addressed following three approaches depending at which level this checkpoint is performed.

In the first approach, a process is saved as a memory dump. It can be done at the kernel-level with Berkeley Lab’s Linux Checkpoint/Restart (BLCR) [Duell et al., 2002 ] or at the user-level with a library like Con-dor [Litzkow et al., 1997 ] or Libckpt [Plank et al., 1995 ]. Among these solutions, BLCR is the only one that supports multi-threaded applications.

This method is widely employed since it is transparent for the application developer. But many drawbacks affect this kind of snapshot: it requires a ho-mogeneous resource (same operating system and same CPU) when restarting.

Furthermore, the process address space contains data that are not strictly required for restarting, so the checkpoint size is larger than necessary.

In order to abstract the process state, the second approach considers that it is the user responsibility to write functions for saving and restarting a pro-cess. This method is effective because the developer can choose exactly which data to checkpoint, but it requires a significant effort from the application developer.

The third approach proceeds at middleware level. It combines advantages from the two above approaches, but it requires the application to be written with a middleware that uses an abstract representation of the application, like objects (Charm++ [Zheng et al., 2004 ], [Chakravorty and Kal´e, 2004 ]), task lists (Satin [Wrzesinska et al., 2006 ]) or data flow graphs (Kaapi [Besseron and Gautier, 2008 ], [Jafar et al., 2009 ]). Using the application abstract representation, the middleware can checkpoint on its own tasks and data used by the application. This approach is fully transparent for the application developer; the process can be restarted on a heterogeneous resource (since the abstract representation is architecture-independent) and the snapshot is smaller than in the memory space’s case.