• Aucun résultat trouvé

Background and definitions

Fault-tolerance and availability awareness in computational grids

6.2 Background and definitions

In this section, a basic set of definitions is presented that will be used throughout the entire chapter. These definitions [Avizienis et al., 2004 ] give a characterization of the various concepts that come into play when addressing the dependability of grid systems. The basic qualitative definition of depend-ability is: “The depend-ability to avoid service failures that are more frequent and more severe than it is acceptable to the users” [Avizienis et al., 2004 ]. More precisely, the dependability represents a set of attributes namely: Availability, Reliability, Robustness, Safety, Integrity and Maintainability. An exhaustive study of all dependability aspects of the system is therefore far beyond the scope of this chapter. This chapter focuses more on the availability which is the time proportion a system is in a functioning condition. More precisely, it is the probability that the system is in the correct state at a given time.

We also focus on the reliability which represents the probability of failure in a given interval of time. Finally, the robustness corresponds to the ability of the system to behave as expected in the presence of failures. Thus, we will

Fault-tolerance and availability awareness in computational grids 147 emphasize on the automatic or semi-automatic approaches to maximize these different attributes using fault tolerance techniques which are transparent to the application.

6.2.1 Grid architecture and execution model

The grid model abstracts the grid architecture, and then it allows to design and verify protocols. The grid system model (Figure 6.1) is a set of clusters which are interconnected through a wide area network (WAN). They are com-posed of individual computers interconnected together by a local area network (LAN). The clusters may have a network attached storage (NAS) connected to its LAN.

FIGURE 6.1: Grid system model: each individual node of a cluster is able to access to a network attached storage (NAS).

The distributed execution model consists in a set of processes that communi-cate only through messages (message passing model). The processes cooperate to solve a problem in a distributed fashion. They may interact with the outer world by sending or receiving messages. Some assumptions are required by the rollback-recovery protocols about the communication sub-system: most of the protocols assume that the delivery of messages is reliable and according to the FIFO order; some of them may accept message loss, duplication, or reorder.

148 Fundamentals of Grid Computing 6.2.2 Faults models

The failure is due to an error of the system which is a consequence of a fault. Different kinds of faults are usually distinguished in function of their origin and their temporal duration [Avizienis et al., 2004 ]. They could be intentional or not, software or hardware, modify the processing time of an operation, provide a wrong result or return no result at all. In this chapter, we are interested in accidental faults that do not modify the processing times of the computations and that provide no result in case of faults.

Faults can also be distinguished by the times during which they occur. A fault is said to be a permanent fault if the affected component will never behave correctly after the fault occurs or is said to be a transient fault if the fault is only active for a finite time interval. The length of the time interval of transient faults can be either deterministic or stochastic. It is frequent to consider transient faults of infinitely short durations.

Another important property of faults is the time when a given fault occurs.

Despite existing cases where the fault arrivals are deterministic, it is more common to consider stochastic fault arrivals.2 The concept of mean time between failures (often abbreviated MTBF) appears. When the faults are independent from each other or when the probability of failure is constant, it is usual to consider that the faults arrive according to a Poisson’s process.

In other cases, the more general Weibull law can be used. A description of several fault distribution models and a discussion on when to use them is the subject of a chapter in [Barlow and Proschan, 1996 ].

6.2.3 Consistent system states

Rollback-recovery protocols aim at restarting the execution after a failure from a global consistent state of the system.

The global state of a distributed application is composed of the states of all the individual processes and the states of the communication sub-system.

State of each process could be easily captured.3 However, the state of the communication subsystem is not accessible directly. It can be captured indi-rectly by flushing the communication channel or by logging the messages at emission or reception.

A consistent system state is a possible state of the system in a failure-free execution [Chandy and Lamport, 1985 ]. Applying the definition to the message passing model, a consistent system state means that “if a process state reflects a message receipt, then the state of the corresponding sender reflects sending that message” [Chandy and Lamport, 1985 ], [Elnozahy et al., 2002 ]. Let us consider for instance the global state C2 of Figure 6.2. The

2A way to deal with deterministic faults is to use the scheduling theory in a model with machine availability.

3Multi-threaded processes may require more work.

Fault-tolerance and availability awareness in computational grids 149 processP1sends a messagem3to processP2. The global stateC2is composed of the state of P1 before sending and the state of P2 after reception: it is inconsistent.

FIGURE 6.2: Three processes exchange messages. Two global states are considered: at the left the global stateC1is consistent; at the right the global state C2 is an inconsistent global state because message m3 is received on processP2 but not sent on processP1.

6.3 Multi-objective scheduling for safety