A Graph Transformation-based Approach for the Validation of Checkpointing Algorithms in Distributed Systems

(1)

O

pen

A

rchive

T

OULOUSE

A

rchive

O

uverte (

OATAO

)

OATAO is an open access repository that collects the work of Toulouse researchers and

makes it freely available over the web where possible.

This is an author-deposited version published in :

http://oatao.univ-toulouse.fr/

Eprints ID : 15189

The contribution was presented at WETICE 2014 :

http://cmt.dmi.unipr.it/wetice14/

To cite this version :

Khlif, Houda and Hadj Kacem, Hatem and Pomares

Hernandez, Saùl E. and Eichler, Cédric and Hadj Kacem, Ahmed and Calixto

Simon, Alberto A Graph Transformation-based Approach for the Validation

of Checkpointing Algorithms in Distributed Systems. (2014) In: 23rd IEEE

International Workshop on Enabling Technologies: Infrastructure for

Collaborative Enterprises (WETICE 2014), 23 June 2014 - 25 June 2014

(Parme, Italy).

Any correspondence concerning this service should be sent to the repository

administrator: [email protected]

(2)

A graph transformation-based approach for the

validation of checkpointing algorithms in distributed

systems

Houda Khlif, Hatem Hadj Kacem

ReDCAD Laboratory FSEGS, University of Sfax

Sfax, Tunisia [email protected] [email protected]

Sa´ul E. Pomares Hernandez1,2,3_{, C´edric Eichler}2,3

1_{Instituto Nacional de Astrof´ısica, ´}_{Optica y Electr´onica (INAOE)} Luis Enrique Erro 1, C.P. 72840, Tonantzintla, Puebla, Mexico 2

CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France 3

Univ de Toulouse, LAAS, F-31400 Toulouse, France [email protected]

[email protected]

Ahmed Hadj Kacem

ReDCAD Laboratory FSEGS, University of Sfax

Sfax, Tunisia [email protected]

Alberto Calixto Sim´on

Universidad del Papaloapan, UNPA

Av, Ferrocarril s/n, C.P. 68400, Loma Bonita, Oaxaca, Mexico [email protected]

Abstract—Autonomic Computing Systems are oriented to pre-vente the human intervention and to enable distributed systems to manage themselves. One of their challenges is the efficient monitoring at runtime oriented to collect information from which the system can automatically repair itself in case of failure.

Quasi-Synchronous Checkpointing is a well-known technique, which

allows processes to recover in spite of failures. Based on this technique, several checkpointing algorithms have been developed. According to the checkpoint properties detected and ensured, they are classified into: Strictly Z-Path Free (SZPF), Z-Path Free (ZPF) and Z-Cycle Free (ZCF). In the literature, the simulation has been the method adopted for the performance evaluation of checkpointing algorithms. However, few works have been designed to validate their correctness. In this paper, we propose a validation approach based on graph transformation oriented to automatically detect the previous mentioned checkpointing properties. To achieve this, we take the vector clocks resulting from the algorithm execution, and we model it into a causal graph. Then, we design and use transformation rules oriented to verify if in such a causal graph, the algorithm is exempt from non desirable patterns, such as Z-paths or Z-cycles, according to the case.

I. INTRODUCTION

As computing systems have reached a level of complexity, their management has become increasingly difficult. As a result, the initiative of autonomic computing has been introduced to prevent human intervention and enable the system to manage itself. An Autonomic Distributed System is considered as a set of geographically distributed autonomic components that communicate/collaborate among them. In general, the autonomic computing has four elements: self-configuration, self-healing, self-optimizing and self-protecting. One open challenge in self-healing is the efficient monitoring at runtime oriented towards the collection of information, from which the system itself will detect problems that result

from failures in software and/or hardware components, diagnose and initiate a corrective action without distributing system applications. Checkpointing is a well-known technique that allows processes to make progress in spite of failures [13]. Its basic idea is to start the system from a consistent state when failures occur. Indeed, when global states are periodically recorded during the system execution, they are called global checkpoints. Thus, a global checkpoint is a set of local checkpoints, where each one presents the recorded states of a process. A global checkpoint is consistent if no local checkpoint occurs before another, that is, there is no causal path from one checkpoint to another, in the sense that a message (or a sequence of messages) sent after one checkpoint is received before the other [7]. Checkpointing algorithms are organized into three classes: asynchronous, synchronous and quasi-synchronous [9]. In asynchronous checkpointing, also known as uncoordinated checkpointing, each process takes its checkpoints independently, which leads to the domino effect. The domino effect exists due to a dependency relation between checkpoints called zigzag path or Z-path [11]. In synchronous checkpointing, or coordinated checkpointing, inorder to avoid the domino effect, processes coordinate their checkpoints by the addition of control messages to form consistent states. In quasi-synchronous checkpointing, or Communication Induced Checkpointing (CIC), coordination is achieved by piggybacking control information on application messages and taking forced local checkpoints in case of dangerous patterns, such as a Z-path, in order to avoid the domino effect.

Several algorithms, which propose different methods to force checkpoints and produce checkpoint and communication patterns with different properties, have been developed. They are classified into: Strictly Z-Path Free (SZPF), Z-Path Free (ZPF) and Z-Cycle Free (ZCF). However, few works have

(3)

dealt with the correctness of such algorithms. In this paper, we propose a validation approach for checkpointing algorithms. In fact, to validate the correctness of a checkpointing algorithm, we propose modelling its execution by a graph. Then, we use graph transformation approaches to verify if in such a graph, the algorithm is exempt from dangerous patterns like Z-paths and Z-cycles.

This paper is structured as follows. Section 2 defines the system model and background, describes the classification of quasi-synchronous checkpointing, and defines the principle approaches of graph transformation. Section 3 shows some related works. In section 4, we describe the process of our approach, and we present a set of transformation rules that we have designed. Finally, we summarize our contributions and suggest new research.

II. PRELIMINARIES

A. System Model

Processes. The system under consideration is composed of a

set of processesP = {p1, p2, · · · , pn}. The processes present

an asynchronous execution and communicate only by message passing.

Messages. We consider a finite set of messages M , where

each message m ∈ M is sent considering an asynchronous

reliable network that is characterized by no transmission time boundaries, no order delivery, and no loss of messages. The

set of destinations of a messagem is identified by Dest(m).

Events. We consider two types of events: internal and external events. An internal event is a unique action that occurs at a

processp in a local manner and which changes only the local

process state. We denote the finite set of internal events asR.

The setR represents the set of relevant events. We consider

only the checkpoints as internal events. We denote byCx

i the

xth_{checkpoint of process}_p

i. The sequence of events occurring

atpi, betweenCix−1 and Cix (i > 0), is called a checkpoint

interval (denoted asIx

i). On the other hand, while an external

event is also a unique action that occurs at a process, it is seen by other processes, thus, affecting the global state of the system. The external events considered in this paper are the

send and delivery events. Let m be a message. We denote

by send(m) the emission event and by delivery(p, m) the

delivery event of m to participant p ∈ P . The set of events

associated toM is the set E(M ) =

{send(m) : m ∈ M } ∪ {delivery(p, m) : m ∈ M ∧ p ∈ P }.

The whole set of events in the system is the finite set E =

R ∪ E(M ). The distributed computation is modeled by the

partially ordered set ˆE = ( ˆE,→), where → denotes Lamports

Happened Before Relation [7].

B. Background and Definitions

Happened Before Relation (HBR). The definition of HBR is the following:

Definition 1. The Happened Before Relation (HBR) [7], “→”,

is the smallest relation on a set of events E satisfying the following conditions:

1) Ifa and b are events belonging to the same process,

anda was originated before b, then a → b.

2) Ifa is the sending of a message by one process, and b

is the receipt of the same message in another process,

thena → b.

3) Ifa → b and b → c, then a → c.

The Happened Before Relation among the set of relevant

eventsR ∈ E , is defined as follows:

• IfCx

i andC

y

j are checkpoint (relevant) events

belong-ing to the same process, andCx

i was originated before

Cjy, thenCix→ C

y

j.

• If Cx

i is the sending of a message by one process,

andCjyis the receipt of the same message in another

process, thenCx i → C y j. • IfCx i → C y j andC y j → Ckz, thenCix→ Ckz.

Z-path and Z-cycle. Netzer et al. [10] defined the notion of zigzag path (z-path) as a genaralization of HBR, as follows:

Definition 2. Z-path exists from Cip to another C

q

j iff there

are messagesm1, m2, ..., ml such that:

1) m1 is sent by processp after Cip,

2) if mk (1 ≤ k < l) is received by process r, then

mk+1 is sent by r in the same or at a later

check-point interval (althoughmk+1may be sent before or

aftermkis received), and

3) ml is received by processq before Cjq.

Helary et al defined the following in [4].

Definition 3. A Z-path[m1, ..., mq] is causal, iff for each pair

of consecutive messagesmα andmα+1:

delivery(mα)→ send(mα+1). Otherwise, it is a non causal

Z-path.

Definition 4. A local checkpoint Cjy Z-depends on a local

checkpointCx

i ,Cix

Z

−→ Cjy, if:

1) j = i and y > x, or

2) there is a Z-path fromCx

i toC

y

j.

Definition 5. A Z-cycle is a Z-dependency from a local

checkpointCx

i to itself:Cix

Z

−→ Cx

i.

Theorem 1. The length of a Z-cycle (or Z-path) is l if the

Z-cycle (or Z-path) is formed byl messages m1,m2, ...,ml.

Definition 6. A communication and checkpoint pattern (CCP)

is a pair ( ˆE,R_Eˆ) where ˆE is a partially ordered set modeling a

distributed computation, andR_Eˆ is a set of local checkpoints

defined on ˆE.

An example of Communication and checkpoint pattern is given

in Figure 1. In this example, the messages [m1,m2] form a

Z-cycle of length two involving C2

2, and [m5,m4,m3] form a

Z-cycle of length three involvingC3

3.

Theorem 2. The following properties of a communication and

checkpoint pattern ( ˆE,R) are equivalent:

1) ( ˆE,R) has no Z-cycle.

2) It is possible to timestamp its local checkpoints in

(4)

p1 p2 p3 C 1 1 C 2 1 C 3 1 C 1 2 C 2 2 C 3 2 C 4 2 C 1 3 m m C 2 3 C 3 3 C 4 3 m m 1 2 3 5 I 1 3 I 2 3 I 3 3 m 4

Fig. 1. A communication and checkpoint pattern

A−→ B ⇒ A.t < B.tZ

wheret is a logical clock as defined by Lamport [7].

C. Classification of quasi-synchronous checkpointing

A local checkpoint that cannot be part of a consistent global snapshot is said to be useless. A local checkpoint that can be part of a consistent global snapshot is called a useful checkpoint. The main advantage of a quasi-synchronous check-pointing algorithm is that it can reduce the number of useless checkpoints. Quasi-synchronous checkpointing algorithms are classified into three different classes, namely, Strictly Z-Path Free (SZPF), Z-Path Free (ZPF), and Z-Cycle Free (ZCF) [9]. Strictly Z-path free checkpointing eliminates all the noncausal Z-paths between checkpoints altogether.

Definition 7. A checkpointing pattern is said to be strictly

Z-path free (or SZPF) if there exists no noncausal Z-Z-path between any two (not necessarily distinct) checkpoints.

In a ZPF system, it is possible to prevent useless checkpoints by eliminating only those noncausal Z-paths in which there is no sibling causal path. The ZPF model is defined below: Definition 8. A checkpointing pattern is said to be Z-path free

(or ZPF) iff for any two checkpoints A and B, a Z-path exists from A to B iff there exists a causal path from A to B.

In a ZCF model, only Z-cycles are prevented. A ZCF model is defined as follows:

Definition 9. A checkpointing pattern is said to be Z-cycle

free (or ZCF) iff none of the checkpoints lie on a Z-cycle.

All checkpoints taken in SZPF, ZPF and ZCF systems are useful. The construction of consistent global checkpoints is more difficult in a ZCF than the other systems because of the existence of noncausal Z-paths. However, a ZCF system has the potential to have the lowest checkpointing overhead as it takes forced checkpoints only to prevent Z-cycles. Figure 2 summarizes the relationship between the various quasi syn-chronous checkpointing models. As shown in this Figure, a SZPF system is a ZPF system, but the converse is not true. Likewise, a ZPF system is a ZCF system, but the converse is not true.

D. Graph Transformation

Graph Transformation is a rule-based approach targeted to-wards the modifications on a graph [14], [2], [1]. A Graph

"#$% "$% "&% '()*+,*-./067.7(*

Fig. 2. Relationship between the various quasi synchronous checkpointing models

Transformation Rule is described by a pair of graphs r =

(L, R), where L is called the left-hand side graph and R is

called the right-hand side graph. Applying the ruler = (L, R)

means finding a match ofL in the source graph and replacing

L by R. The suppression of the occurrence of L in G may cause the appearance of edges without a starting node or a terminating node or both. Those edges are called “dangling edges”. Two principle approaches have dealt with this problem, which are the SPO approach and the DPO approach [1]. The Single PushOut Approach (SPO). A rule of type SPO

is a production of the form(L, R). Its application to a graph

G is related to the existence of an occurrence of L in G. The

application of the SPO rule to a graphG involves the removal

of the graph corresponding to Del = (L\(L ∩ R)) and the

addition of the graph corresponding toAdd = (R\(L ∩ R)).

The Double PushOut Approach (DPO). A rule of type DPO

is a production of the form (L, K, R), where K is used to

specify clearly the part to preserve after applying the rule.

If both conditions of existence of the occurrence of L and

absence of suspended edges are checked, the application of the rule involves the removal of the graph corresponding to

the occurrence of Del = (L\K) and the addition of a copy

of the graphAdd = (R\K).

Neighborhood Controlled Embedding Approach (NCE). The NCE mechanism [14] is based on the specification of the so-called connection instructions. These instructions are based on the labels of nodes to define new edges to be introduced. The connection instructions are described by a pair (n, δ). The execution of this instruction connection involves

the introduction of an edge between the added node n and

all the neighboring nodes of the removed node, whose label

is δ. dNCE (d for edge label), eNCE (e for edge label),

and edNCE are extensions of NCE approach. The dNCE connection instructions are described by a triplet (n, δ, d),

where d ∈ {in, out} is for controlling the direction of the

edges. The eNCE connection instructions are described by a triplet (n, p/q, δ), where p and q are edge labels. The edNCE is the combination of the eNCE and dNCE approaches. The

ed-NCE connection instructions are of the form(n; p/q; δ; d; d′_).

The execution of this instruction implies the introduction of

an edge in the direction indicated by d′ _{between the node}

n and all nodes n′ _{that are} _{p-neighbours and d-neighbours}

(in-neighbours if d=in and out-neighbours otherwise) of the

(5)

III. RELATEDWORK

Finding a method to construct consistent global checkpoints in a ZCF system has been an open problem. The impossibility of designing an optimal ZCF quasi-synchronous checkpoint-ing algorithm has been treated by Tsai et al [17]. In the last few years, some ZCF quasi-synchronous checkpointing algorithms have been developed. For example, the Fully formed (FI) algorithm of Helary et al [4], the Fully In-formed aNd Efficient (FINE) algorithm of Luo et al. [8], the Delayed Communication-Induced Checkpointing (DCFI) algorithm [15] and The Scalable Fully Informed (SF-I) al-gorithm [16] of Calixto et al. According to our knowledge, the present work introduces the first solution based on graph transformation for the validation of checkpointing properties that can be used with any CIC algorithm.

Wang [18], [19] has defined a graph called “Rollback depedency graph” to show Z-paths in a distributed com-putation. It is easy to detect Z-paths in such a graph, but detecting Z-cycles is a critical problem. Taesoon Park and Heon Y. Yeom [12] have proposed a scheme for detecting Z-cycles of length two. Such scheme takes forced checkpoints to break them. Then, Chin-Lin Kuo and Yuo-Ming Yeh [6] have proposed an in-line distributed algorithm to detect all Z-cycles of long length and their involved checkpoints. This algorithm conceptualizes an appropriate data structure to express Z-path and Z-cycle. It requires much piggybacked Z-path information, but it detects, for a distributed computation, all the existing Z-cycles and their involving checkpoints. In order to use this last solution for validation purposes, such algorithm must be executed at runtime in a simultaneous way along with the checkpointing algorithms, which implies an expensive and ad-ditional use of memory, processing and bandwidth resources.

IV. APPROACH

Figure 3 ullistrates the general process of our approach. We have choosen to use the Graph Matching and Transformation

Engine (GMTE)1 _{as it is able to search small and medium}

graph patterns in huge graphs in a short time. It receives as input graph descriptions and transformation rules written in XML [3]. In our approach, the GMTE receives a checkpoint graph that models the execution of a checkpointing algorithm, for which it applies a transformation rule presenting dangerous patterns to show if in such a graph the algorithm is exempt from these patterns. Our solution is based on two main approaches which are SPO approach and edNCE approach. The SPO approach is to deal with suspended egdes as it provides power to remove this kind of edges greater than the DPO approach. The edNCE approach is for the addition and the removal of nodes using the connection instructions.

89:; 86)<0 6(=> 86)<0 ?+=> @.A).B>C $)BB>6.* &0>/D<7+.B+.E )=E76+B0F G(B<(B 86)<0 H)=+C)B+7. H>6C+/B

Fig. 3. Validation process

1_{GMTE is available at http://homepages.laas.fr/khalil/GMTE/}

A graph transformation rule applied to an HBR graph. The HBR graph presents the causal relations of the system. The graph presented in Figure 4 is the HBR graph corrsponding to the scenario of Figure 1. As shown in this Figure, a causal graph contains three types of edges:

• A local relation that connects two successive

check-pointsCx

i andCix+1, belong to the same processpi,

is denoted by the label “c”.

• A direct relation between two checkpoints Cx

i of pi

andCjyof pj is denoted by the label “d”.

• A transitive relation is denoted by the label “t”.

/,=)I>=>C >CE> B,=)I>=>C >CE> C,=)I>=>C >CE> C 1 1 C 2 1 _C 3 1 C 1 2 C 2 2 C 3 2 C 4 2 C 1 3 C 2 3 C 3 3 C 4 3

Fig. 4. Example of an HBR graph

Figure 5 presents the general patterns of Z-paths and Z-cycles.

A Z-path, in an HBR graph, is in the form of (n1

d

−→ n3,n2

c

−→ n3, n3

d

−→ n4). It can be either of length two or more.

A Z-cycle, in an HBR graph, is in the form of (n1

d −→ n2, n2 c −→ n3,n3 d

−→ n1). It can also be of length two or more.

C / C C C C C C C / / / / /

JIK #,/-/=> 7? =>.EB0 BA7

JCK #,/-/=> 7? =>.EB0 !" C / C J)K #,<)B0 7? =>.EB0 BA7 #1 #3 #2 #4 #2 #3 #1 C C " " " " " " C / C C / C J/K #,<)B0 7? =>.EB0 !" / C " " "

(6)

We start by eliminating all t-labeled edges, as they can be involved neither in a Z-path nor in a Z-cycle. For this, we use

the ruler1. This rule is of type SPO. Its application implies the

removal of allt-labeled edges. An example of the application

of ruler1 to an HBR graph is given in Figure 6.

/,=)I>=>C >CE> B,=)I>=>C >CE>

86)<0 $1 L MNO 86)<0 C,=)I>=>C >CE> 86)<0 $2

C 1 1 C 3 1 C 2 2 C 3 3 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 2 C 1 1 C 2 1 _C 3 1 C 1 2 C 2 2 C 4 2 C 1 3 C 2 3 C 3 3 C 4 3 P<<=-+.E%1 B #1 #2 %1L &₁"L"""""""""""""""""""""""""""""""'"(1")"""""#3 #4 C 3 2

Fig. 6. The application of rule r1

In the first time, we verify if the output graph ofr1 is exempt

from Z-paths. For this, we implement the SPO ruler2that we

present in Figure 7. Its application implies the addition of a z-labeled edge between the two checkpoints (nodes) involved in a Z-path. An illustrative example is given in Figure 7. The

application of r2 to graph G2 shows the existence of two

Z-paths: the first is fromC2

1 toC33 (C21 d −→ C4 2,C23 c −→ C4 2, C3 2 d −→ C3

3) and the second is from C33 toC24 (C33

d −→ C3 1, C2 1 c −→ C3 1, C12 d −→ C4 2). 86)<0 $2 86)<0 $3 Q,=)I>=>C >CE> P<<=-+.E%2 /,=)I>=>C >CE> C,=)I>=>C >CE> C 1 1 _C 3 1 C 2 2 C 3 3 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 C 1 1 _C 3 1 C 2 2 C 3 3 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 %2L &2"L"""""""""""""""""""""""""""""""'"(/ 2")""""" #4 #1 C C / #R #S #T C C #5 #3 #2 Q

The occurence of the label “z” in the resulting graph implies that the system is not ZPF. So, it can also contain Z-cycles.

Then, we implement the rules r3, r4 and r5 to verify

such property. These rules are of type SPO, and they are sequentially executed.

Ruler3, presented in Figure 8, is to detect a long Z-path. It is

applied toG3, the output graph ofr1. Two successive Z-paths

form another Z-path (n1

z −→ n3, n2 d −→ n3, n3 z −→ n4).

We denote this model by adding a z-labeled edge between

n1 andn4. In the HBR graph that we have previously used,

there is no case of long Z-path. G4, the output graph of r3,

is equivalent to graphG3. For this, we illustrate the result of

the application of rule r3 in the general case (see Figure 8).

#,/-/=> 7? =>.EB0 !" " " " " " " #,<)B0 7? =>.EB0 !" " " " Q,=)I>=>C >CE> /,=)I>=>C >CE> C,=)I>=>C >CE> C #4 #1 Q Q C _# S #T Q Q #5 Q #3 #2 #R %3 L &3"L"""""""""""""""""""""""""""""""""'"(3")""""" Q,=)I>=>C >CE> J )?B>6 )<<=-+.E %3K

Rule r4 (presented in Figure 9) is used to detect Z-cycles of

length two. Its application implies replacing the node involved

in such a Z-cycle by a zc-labeled node. In this level, it is

necessary to use the connection instructions to link the added

node and all nodes that ared-neighbours (in-neighbours if d=in

and out-neighbours otherwise) of the removed node.This rule

is applied to graph G4. Figure 9 illustrates the application of

r4. In this example, only one Z-cycle of length two exists

involvingC2 2 (C31 d −→ C2 2, C31 c −→ C2 3,C32 d −→ C2 2). P<<=-+.E%4 C 1 1 _C 3 1 C 2 2 C 3 3 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 _Q/ C 1 1 _C 3 1 C 3 3 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 86)<0 $3 _{86)<0 $}₄ Q,=)I>=>C >CE> /,=)I>=>C >CE> C,=)I>=>C >CE> / #1 #2 #3 #4 #2 #3 C" C" / C C %4L &4"L"""""""""""""""""""""""""""""""'"(4")""""" Q/

(7)

Finally, we apply rule r5 toG5 to detect long Z-cycles. As a result, a node involved in a long Z-cycle is removed and

replaced by a new zc-labeled node. In this level, it is also

necessary to use the connection instruction because of the removal and the addition of nodes. Figure 10 presents rule

r5 and illustrates its application. In the given example, a

z-cycle exists involvingC3

3 (C12 z −→ C3 3, C12 d −→ C4 2,C33 z −→ C4 2). 86)<0 $5 86)<0 $T Q,=)I>=>C >CE> P<<=-+.E%5 /,=)I>=>C >CE> C,=)I>=>C >CE> Q/ C 1 1 _C 3 1 C 3 3 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 _Q/ C 1 1 _C 3 1 C 4 2 C 4 3 C 1 3 C 3 2 C 2 1 C 2 3 C 1 2 Q/ Q #1 #2 #3 Q/ #4 #2 #3 Q Q Q %5L &5"L"""""""""""""""""""""""""""""""'"(5")"""""

By applying the set of transformation rules previously pre-sented to a causal graph, we can detect all existing Z-paths and Z-cycles, whether they are of length two or more. Then, we can decide for any checkpointing algorithm if it is ZPF on the one hand, and if it is ZCF on the other hand. The use of graph transformation approaches ensures correct results without much need of time and space. As a result, we obtain an efficient validation approach, which verifies the correctness of checkpointing algorithms in a less costly way.

V. CONCLUSION AND FUTURE WORK

In this paper, we have proposed a validation approach for checkpointing algorithms in distributed systems. We have used graph transformation approaches to validate the correctness of such algorithms in a less costly way. Firstly, we have designed a set of transformation rules to verify in a causal graph, the existence of non desirable patterns like Z-paths and Z-cycles. The application of these rules shows all existing Z-paths and Z-cycles in such a graph regardless of its length. The cost reduction is explained by the fact that the designed rules are not executed at runtime. With the obtained results, we can validate if the algorithm is ZPF in the first time and if it is ZCF in the second.

As an on-going work, we aim to design transformation rules oriented to the Minimal Causal and Set Representation

(MCSR)2 _{tool that constructs, for a distributed computation,}

the minimal causal graph (IDR graph) at a single event level. For this, the MCSR uses the Immediate Dependency Relation (IDR) [5]. The IDR graph greatly reduces the state-space of a system. This is the reason why finding an efficient method

2_{MCSR is available at http://homepages.laas.fr/khalil/GMTE/}

to detect non desirable patterns in such graphs can be an important contribution in term of cost reduction.

REFERENCES

[1] H. Ehrig, H.J. Kreowski, and G. Rozenberg. Tutorial introduction to the algebraic approach of graph grammars based on double and single pushouts. In Graph-Grammars and Their Application to Computer

Science, volume 532 of LNCS. Springer, March 5-9 1990.

[2] J. Engelfriet and G. Rozenberg. Handbook of Graph Grammars and

Computing by Graph Transformation, chapter Node Replacement Graph

Grammars, pages 1–94. World Scientific Publishing, 1997.

[3] M. A. Hannachi, I. B. Rodriguez, K. Drira, and S. E. Pomares-Hernandez. GMTE: A tool for graph transformation and exact/inexact graph matching. In Graph-Based Representations in Pattern

Recogni-tion. 9th IAPR-TC-15 International Workshop, GbRPR 2013, Vienna,

Austria, volume 7877 of LNCS. Springer, 2013.

[4] J. M. Helary, A. Mostefaoui, R. H. B. Netzer, and M. Raynal. Communication-based prevention of useless checkpoints in distributed computations. Distributed Computing, 13:29–43, January 2000. [5] S.E. Pomares Hernandez, J. Fanchon, and K. Drira. The

Immedi-ate Dependency Relation: An Optimal Way to Ensure Causal Group

Communication. Singapore University Press and World Scientific

Publications, 2004.

[6] C. L. Kuo and Y. M. Yeh. An algorithm for detecting z-cycles in distributed computing system. In Int. Computer Symposium, pages 1124–1133. Int. Computer Symposium, December 2004.

[7] L. Lamport. Time, clocks and the ordering of events in a distributed system. ACM, 21:558–565, July 1978.

[8] Y. Luo and D. Manivannan. Fine: A fully informed and efficient communication-induced checkpointing protocol for distributed systems.

Journal of Parallel and Distributed Computing, 69:153–167, February

2009.

[9] D. Manivannan and M. Singhal. Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE Transactions on Parallel and Distributed Systems, 10(7):703–713, 1999.

[10] R. H. B. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE, 6:165–169, February 1995. [11] Robert H. B. Netzer and Jian Xu. Adaptive message logging for

incremental replay of message-passing programs. Technical report, Brown University, Providence, RI, USA, 1993.

[12] T. Park and H. Y. Yeom. Application controlled checkpointing co-ordination for fault-tolerant distributed computing systems. Parallel

Computing, 26:467–482, 2000.

[13] B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in computing system design. ACM, 10(2):123–166, June 1978. [14] G. Rozenberg. Handbook of Graph Grammars and Computing by

Graph Transformation, volume 3. World Scientific Publishing, 1997. [15] A. C. Simon, S. E. Pomares Hernandez, and J. R. Pilar Cruz. A

delayed checkpoint approach for communication-induced checkpointing in autonomic computing. In Workshops on Enabling Technologies:

Infrastructure for Collaborative Enterprises, June 2013.

[16] A. C. Simon, S. E. Pomares Hernandez, J. R. Pilar Cruz, P. Gomez-Gil, and K. Drira. A scalable communication-induced checkpointing algo-rithm for distributed systems. IEICE TRANSACTIONS on Information

and Systems, E96-D(4):886–896, April 2013.

[17] J. Tsai, Y. Wang, and S. Kuo. Evaluations of domino-free communication-induced checkpointing protocols. Information

Process-ing Letters, 69:1–69, 1998.

[18] Y. M. Wang. Maximum and minimum consistent global checkpoints and their applications. In Symposium on Reliable Distributed Systems, pages 86–95, 1995.

[19] Y. M. Wang. Consistent global checkpoints that contain a given set of local checkpoints. IEEE Transactions on Computers, 46(4):456–468, April 1997.