Multi-objective scheduling for safety .1 Generalities.1Generalities

Fault-tolerance and availability awareness in computational grids

6.3 Multi-objective scheduling for safety .1 Generalities.1Generalities

One common way to optimize the behavior of a parallel system is to optimize the resources by the use of the scheduling theory [Leung, 2004 ]. Basically, an application is represented as a set of tasks with constraints like precedences or communications. These tasks need to be distributed among a set of processors.

There exist a lot of models for addressing the problem variants like processors with diﬀerent speeds, speciﬁc routing topologies, etc. Some extra features can also be included into scheduling problems, like optimization of energy consumption or reliability.

Once the underlying model has been established, we can focus on the op-timization of some performance indices on a target application, such as the

150 Fundamentals of Grid Computing

makespan (defined as the maximum completion time of the application), or the flow time (the turnaround time in systems where data arrive infinitely).

Within a scheduling formalism, it is possible to address the problem of safety as another objective, for instance, maximizing the reliability of the whole sys-tem. Some other works address the problem of optimizing of the number of tolerated faults. This issue will not be covered in this chapter but such kind of work can be found in [Benoit et al., 2008 ].

Optimizing only the reliability of a system does not really make sense since it will cost some processing time that will reduce the system performance. Then, the interesting problem is to optimize both the performance and the reliability of the system. In multi-objective optimization, the problem is to achieve a good trade-oﬀ between both objectives. This trade-oﬀ problem is solved by a decision maker and not by an automatic system. However, computers can be used to obtain an interesting set of solutions among which the decision maker will be able to choose the best one depending on his-her use. Here, interesting solutions belong to the Pareto set.⁴ Details about multi-objective scheduling can be found in [Hoogeveen, 2004 ].

There are two common ways for determining the Pareto optimal solutions of a bi-objective optimization problem. The first one is to optimize a weighted sum of both objectives (or any other adequate combination). By changing the weight of both objectives, it should be possible to find the solutions that are on the convex hull of the Pareto set. The other way is to introduce a threshold on one objective that should not be exceeded, and then find the solution that is optimal for the second objective with standard single objective methods.

The ﬁrst method is generally easier from an algorithm perspective, but it will miss all the solutions that are not on the convex hull of the Pareto set. The second technique will provide all the Pareto optimal solutions but at the price of much more diﬃcult optimization problems.

There are more pertinent models that can be detailed in this chapter. In the following sections, we focus on two scheduling models for safety in compu-tational grids. Their applicability will be discussed. The problems they rise will be detailed as well as how to solve them. How to optimize safety as well as performance will be sketched. Since duplication leads to harder models, we will start by studying problems without duplication.

6.3.2 No duplication

We will present and discuss two problems depending on the type of faults.

The ﬁrst problem that we consider is to schedule an application with inde-pendent permanent faults distributed according to a Poisson’s process.

4The Pareto set is the set of Pareto optimal solutions. Informally, a solution is Pareto optimal if no other solution is better on both objectives simultaneously.

Fault-tolerance and availability awareness in computational grids 151 Permanent faults is a common fault model in grid computing. Indeed, when a machine crashes, a technician will have to repair it. This manual interven-tion is usually longer than the computainterven-tion itself. Faults distributed according to Poisson’s process is a common assumption. It is like considering that the failure rate is constant over time. This assumption is not always realistic since usually the machines have an important failure rate when they are started for the first time (due to unstable hardware or configuration issues). Then, the failure rate will increase with hardware wear. However, if the computation is short compared to the grid lifespan, we can assume that the variability of the failure rate can be neglected. The last assumption is that faults are inde-pendent. This assumption is not realistic since several causes of fault are due to the environment such as network failure, power outage, air conditioning dysfunction, etc. However, this assumption is critical for computing the relia-bility of a system. The only way to prevent the effects of global dysfunctions is to backup the computations (for instance, using checkpointing).

The success probability of the application (i.e., the reliability) is the prob-ability that all processors are still active when they complete their last task.

This is a direct consequence of the permanent fault model. The optimization of both makespan and reliability of a parallel application on an heterogeneous platform (heterogeneous in computing power as well as in failure rate) has been considered [Dongarra et al., 2007 ], [Jeannot et al., 2008 ]. To optimize it, the concept of failure rate per operation is introduced. Optimizing the reliability is done by scheduling tasks on the most reliable (per operation) processor whereas optimizing the makespan is achieved by using the most powerful machines. The problem of scheduling independent task has been solved in [Jeannot et al., 2008 ] using an approximation algorithm that sets a threshold on the makespan. The idea is to consider the processors ordered by decreasing reliability and ﬁlling them up to the threshold. For the prob-lem of scheduling an arbitrary application, [Dongarra et al., 2007 ] proposes a non-guaranteed heuristic using the same kind of techniques.

The second model that can be considered deals with independent transient faults distributed according to a Poisson’s process. The idea behind transient fault is that the machine recovers. This model can not be directly applied to grid computing. It models the situation where another machine can be used to replace the one that crashed. Such a situation can happen in grids that are under-loaded or in computing systems where an operator guarantees the availability of the machine. Computing a reliability of a system in this model is somehow easier than before. It corresponds basically to the probability that each task is executed correctly. However, the network introduces some diﬃculties while computing the reliability. The problem is usually handled using heuristics with no approximation guaranty (greedy algorithms) that optimize a linear combination of both objectives [Dogan and ¨Ozguner, 2002 ].

152 Fundamentals of Grid Computing 6.3.3 Using duplication

Duplication makes things much more diﬃcult to tackle but it also allows to improve drastically the reliability. The main existing model with duplication is based on an estimation of the reliability. Intuitively, without duplication, computing the reliability is done using very simple statistical events such as the processor is still active at the end of the schedule (for the permanent fault model), or during the lifespan of a task, the processor is active (for the transient fault model). With duplication, there is in general no simple events that allow to describe the success of an application execution. Like before, let us discuss the two main existing models.

Let us ﬁrst consider the model of transient faults. One way for computing the reliability would be to consider all the possible combinations of fault oc-currences and determine whether the schedule is still valid or not. This would consume a lot of computing power and, thus, it is not reasonable. A less time consuming way for computing the reliability is to consider that faults on tasks are statistically independent.⁵ This assumption deduces that when a fault occurs, the impacted processor should recover before the next task begins. Then, it is possible to construct a causality dependence graph of an application. Each node of the graph represents the execution of one task on one processor and two additional nodes that represent the beginning and the end of the application are included. Directed edges represent the causality dependencies. If there is a path of tasks that are executed without fault from the beginning to the end, then the application is failure free. Such a graph is called a Reliability Block Diagram (RBD) [Lloyd and Lipow, 1962 ], [Siewiorek and Swarz, 1998 ].

Computing the reliability of the application can be done by computing the reliability of the RBD. Unfortunately, estimating the reliability of a RBD is, in general, an NP-Complete problem. However, if the RBD has some speciﬁc given structure, then the reliability can be computed in polynomial time.

The main class of RBD with this property is the series-parallel graphs. The problem is then to construct a schedule so that the RBD is series-parallel. To reach this goal, it is necessary to add additional constraints to the scheduling problem.

A common property that meets this requirement is that when a task begins, all the copies of its predecessors have been taken into account. This breaks the hard part of combinatorics of the reliability estimation by making the RBD a series-parallel graph (in fact, a series of parallel macro blocks). This technique has been used twice depending on the communication pattern [Assayad et al., 2004 , Girault et al., 2009 ]. The structure of the reliability function usually makes approximation algorithm impossible to construct. However, eﬃcient approximation algorithms were derived for some speciﬁc cases of independent tasks or for a chain of tasks in [Saule and Trystram, 2009 ].

5This assumption is diﬀerent from the statistical independence of faults.

Fault-tolerance and availability awareness in computational grids 153 In a model with permanent faults, duplication is complex due to the strong statistical dependency of the faults on the tasks. It is likely that when a permanent fault appears, we would like to use a dynamic scheduling scheme that uses machines that are still functioning correctly. Obtaining strong re-sults from the reliability point of view would broaden our comprehension of the problem. Otherwise, it is likely that eﬃcient non guaranteed heuristics (within the theory of multi-objective scheduling) can be developed. Solutions based on work-stealing and checkpointing seem to be better suited to achieve eﬃcient parallel executions even when the system is subject to faults.

To summarize, multi-objective scheduling can be applied to safety in order to help a decision maker to solve the trade-off between efficiency and relia-bility. This approach is general and needs to be adapted to different models.

Unfortunately, not a single model is unanimously accepted since each com-puting system requires its own speciﬁc fault model. Each resulting problem is almost unique; tackling each of them would be a titanic work. Instead, theo-retical approaches focus on general but reasonable assumptions that highlight the core properties practical heuristics should focus on.

Dans le document Grid Computing Theory, Algorithms and Technologies (Page 174-178)