• Aucun résultat trouvé

A Repair Mechanism for Fault-Tolerance for Tree-Structured Peer-to-Peer Systems

N/A
N/A
Protected

Academic year: 2021

Partager "A Repair Mechanism for Fault-Tolerance for Tree-Structured Peer-to-Peer Systems"

Copied!
20
0
0

Texte intégral

(1)

HAL Id: inria-00115997

https://hal.inria.fr/inria-00115997v3

Submitted on 28 Nov 2006

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A Repair Mechanism for Fault-Tolerance for

Tree-Structured Peer-to-Peer Systems

Eddy Caron, Frédéric Desprez, Franck Petit, Cédric Tedeschi, Charles

Fourdrignier

To cite this version:

Eddy Caron, Frédéric Desprez, Franck Petit, Cédric Tedeschi, Charles Fourdrignier. A Repair

Mecha-nism for Fault-Tolerance for Tree-Structured Peer-to-Peer Systems. [Research Report] RR-6029, LIP

RR-2006-34, INRIA, LIP. 2006, pp.15. �inria-00115997v3�

(2)

a p p o r t

d e r e c h e r c h e

9

-6

3

9

9

IS

R

N

IN

R

IA

/R

R

--6

0

2

9

--F

R

+

E

N

G

Thème NUM

A Repair Mechanism for Fault-Tolerance for

Tree-Structured Peer-to-Peer Systems

Eddy Caron — Frédéric Desprez — Charles Fourdrignier — Franck Petit — Cédric Tedeschi

N° 6029

(3)
(4)

EddyCaron , Frédéri Desprez , Charles Fourdrignier, Fran k Petit, Cédri

Tedes hi

ThèmeNUM Systèmes numériques ProjetGRAAL

Rapportde re her he n°6029  O t 2006 16 pages

Abstra t:

Fa ing the limits of traditional tools of resour e management within

omputa-tional grids (related to s ale, dynami ity, et . of the platforms newly onsidered), newapproa hes, basedon peer-to-peerte hnologiesareemerging. The resour e dis- overyandinparti ular theservi edis overyis on ernedbythisevolution. Among

thesolutions, a promising one is theindexing of resour es usingtrie stru turesand more parti ularly prex trees. The major advantages of trie-stru tured approa hes isthe apabilitytosupportsear hqueriesonrangesofvalueswithalaten ygrowing

logarithmi ally in the number of nodes in the trie. Those te hniques are easy to extend to multi riteria sear hes. One drawba k of using tries is its inherent poor robustnessinadynami environment,where nodesjoinand leavethenetwork,

lead-ing to the split of the tree into a forest, whi h results inthe impossibility to route requests. Within most re ent approa hes, thefault-toleran e isa prevention me h-anism, often repli ation-based. The repli ation an be ostly in term of resour es

required. In this paper, we propose a fault-toleran e proto ol that re onne ts sub-trees a posteriori, after rashes, to have again a onne ted graph and then reorder thenodes torebuild a onsistent tree.

Key-words: Faulttoleran e, peer-to-peer, prex trees

Thistextisalsoavailableasaresear hreportoftheLaboratoiredel'Informatiquedu Paral-lélismehttp://www.ens-lyon.fr/LIP.

(5)

pair-à-pair de dé ouverte de servi es

Résumé :

Fa eauxlimitesdesoutilstraditionnelsdegestionderessour esdanslesgrillesde al ul(mauvais passageà l'é helle,nonprise en omptede ladynami itéduréseau, et .),desalternativesfondéessurleste hnologiespair-à-pairsontentraind'émerger.

La dé ouverte de ressour es eten parti ulier des servi es de al ulest tou hée par ette évolution. Parmi es solutions, il existe des appro hes prometteuses fondées sur des arbres lexi ographiques. L'intérêt de telles appro hes repose sur la

possi-bilité d'ee tuer des requêtes sur des intervalles de valeurs ainsi que la possibilité de réaliserde l'auto omplétion surles haînes de re her he en temps logarithmique en la taille de l'arbre. Ces te hniques s'étendent fa ilement à des re her hes

mul-ti ritères. Cependant la stru ture en arbre est fragile et peut é later en une forêt si l'un des n÷uds vient à quitter le réseau, rendant ainsi impossible le routage de ertaines requêtes et n'orant au lient qu'une vue partielle des servi es. Dans la

plupart de es appro hes, la toléran e aux pannes, indispensable dans les environ-nementsdynamiquesàlarge é helle, estpréventive(réaliséea priori)etsefondesur larépli ation,qui est oûteuseen termesderessour eset detemps. Dans epapier,

nous présentons un proto ole tolérant aux pannes de n÷uds, omplémentaire à la répli ation, dans les arbres lexi ographiques. Il se fonde sur la re onnexion et la réparation a posteriori d'arbresqui ont subi laperte d'unou plusieurs n÷uds.

(6)

1 Introdu tion

Theselast few years have seen thedevelopment of large s ale grids onne ting

dis-tributed resour es ( omputation resour es, storagefa ilities, omputation libraries, et .) in a seamless way. This is now an e ient alternative to super omputers to solve large problems su h ashigh energy physi s, simulation, bioinformati , et .

However, existing middlewares used ingrids require most of the time a stable and entralized infrastru ture. They usually loose their performan e on dynami and large s ale platforms without entralized management of resour es. To ope with

the hara teristi s ofthese emerging kindof platforms,ithasbeensuggested touse peer-to-peerte hnologies within omputationalgrids [8℄.

Peer-to-peer te hnologies oer algorithms allowing the sear h and retrieval of obje ts over the net (data items, les, servi es, et .). Among these te hnologies, DistributedHashTables(DHT)wereinitiallydesignedforverylarges aleplatforms,

for example to share les over the Internet. However, DHTs have several major drawba ks. Amongthem,theirdis overyme hanismusuallyworksonexa tsear hes of a given key. Some work has then been done to allow omplex requests to be submitted over DHTs or more generally in stru tured peer-to-peer systems, i.e.

systems based on request routing. Some of these works are based on tries (also alledprex trees). A trie stru ture supportsrange queries inalogarithmi time in thenumberof nodesof the trie.

Fault-toleran e isa mandatoryfeature forpeer-to-peersystems toavoidtheloss

ofdatastored onnodesandtoallowa orre trouting ofmessages. The rashofone orseveralnodesinatrie leadsto thelossofobje tsreferen es storedinthetrie and tothesplitofthetrieintoseveralsubtries,also alledaforest. Fault-toleran ewithin

stru tured peer-to-peer systems usually uses repli ation. Using su h an approa h, ea hnodeandea hlinkofthetriewouldhavetobedupli ated

k

times,

k

beingthe repli ationfa tor. Keepingsu h stru ture upis ostly, mainly intermsof resour es

used. Afterward, thepurposeistondfor thevalueof

k

theright trade-obetween the repli ation ost and the robustness of the system. In this paper, we study an alternativeto therepli ationapproa hbasedonthere onne tionofthesubtriesand thea posteriori reorderingofa onsistenttrie. Whenthetrieisdis onne ted,arst

solution onsistsinrebuilding atrie addingnodesofremaining subtries one byone. Thisnaivemethod anleadtoaprohibitive ostwhenthenumberofremainingnodes islarge(whi hisusually the aseinpeer-to-peersystems). Forexample,loosingone

node an leadto a omplete re onstru tion of thetrie. A se ond approa h onsists inre onne tingthesubtries togetthe originaltrie ba kat aminimum ost. Thisis

(7)

this kind of algorithm we des ribe inthis paper in a distributed and asyn hronous

environment. It an also be usedto omplete therepli ationpro ess.

A brief history of peer-to-peer te hnologies is provided in Se tion 2, followed by the formal des riptionof the parti ular trie stru ture we use (Se tion 3) and of the distributed system we pla e ourselves. We fo us our study on fault-toleran e

me anisms relatedto them. Then, inSe tion 4 we present the repair algorithm we designed and give its proofbeforea on lusionand futurework Se tion.

2 Related Work

With the spread of the peer-to-peer te hnologies going along with the le sharing overtheInternet,purelyde entralizedsear hsystemshaveemerged. Su htoolsrst tooktheshapeof unstru tured me hanisms,i.e.,basedon theoodingofsear h

re-quests[10,9℄. Theseme hanismsresultedinoverloadingthenetworkwhileproviding non-exhaustiveresponses. Addressingboththes alabilityandtheexhaustiveness is-sueswithin peer-to-peersystems,the distributedhashtables[13 , 14,18 ,20℄, a.k.a.,

thestru tured peer-to-peer group, arehighly s alable inthe sense that thenumber of logi al hops required to route and thelo al state grows logarithmi ally withthe number of nodes parti ipating in the system. Moreover, DHTs prevent from

loos-ing routing paths and obje ts' referen es by use of repli ation and periodi s ans. Unfortunately, DHTs present several major drawba ks (homogeneous apa ity as-sumptions, topology awareness, et .). Among them, the rigidity of the requesting

me hanism, i.e.,exa tmat honagivenkeyhindersitsuseoverrealsear hsystems.

A series of work gives the opportunity to allow exible meanings of retrieval overstru tured peer-to-peer networks. First a hievement in this way has been the

abilityto des riberesour es withsemi-stru tured language,su hXML,asdes ribed in[3℄.[19℄enhan esDHTswithtraditionaldatabaseoperations. Severalapproa hes, based on spa e lling urves, su h as Squid [15 ℄ or [17 ℄ support multi-dimensional

rangequeries.[1℄mapsone-dimensionaldataspa etod-dimensional Cartesianspa e byusing the inverse Hilbertmapping. Builton topof multiple DHTs,SWORD [11 ℄ is an information servi eaiming at dis overing omputing resour es on the grid by

answeringmulti-attribute rangequeries.

Wefo usinthisworkontrie-stru turedretrievalsolutions,alsosupportingrange queries but outperforming previous approa hes in the sense that logarithmi (or onstant ifwe assumean upperbound on thedepth of thetrie) laten yis a hieved

byparallelizingtheresolution ofthequeryintheseveralbran hesofthetrie. Prex Hash Tree (PHT) [12 ℄ builds a trie of the entire key-spa e on top of a DHT. The

(8)

purpose of this ar hite ture is to use the trie as a logi al layer allowing omplex

sear hes on top of any DHT-like network. The ar hite ture of PHT results in the multipli ationof the omplexities ofthetrie and oftheunderlying DHT.

The Skip Graphs stru ture proposed in [2 ℄ is similar to a trie but is built with theskiplistste hnology,allowingtheuseoftheir inherentfault-toleran e properties. But again, the omplexity of the number of messages generated to pro ess range

queries isin

O(m log(n))

,

m

being the numberof nodespertainedby therangeand

n

thetotal numberof nodesinthegraph.

Other approa hes propose to rely on a trie for ea h purpose, i.e., indexing the key-spa e, mapping the nodesof the trie on thenetwork, and routing therequests.

Among them, Nodewiz [4℄ assumes a set of stati reliable nodes to host the trie, whi h is unfortunately hard to ensure on peer-to-peer platforms. P-Grid [7℄ builds a trie on the whole key-spa e (i.e., the whole set of potential keys). Ea h leaf of

thistrie orrespondsto asubset ofthekey-spa e. Thefault-toleran e isa hieved by probabilisti repli ation.

As a more general onsideration, none of these approa hes address the topol-ogy/physi al lo alityawarenessissue,i.e.,noinformation about theunderlying

net-work is taken into a ount to build the logi al (overlay) network, what an raise a signi antperforman eproblem,physi allo alitybeingbrokenwhenthelogi al net-work isbuilt. Moreover, the several fault-toleran e solutions aremostly

repli ation-based, or DHT-based,also involvingheavy repli ationme hanisms.

Initiallydesignedforthepurposeofservi edis overyoverdynami omputational gridsandattemptingtosolvetheabovedrawba ksofexistingapproa hes,were ently developed a novel ar hite ture, based on a logi al Greatest Common Prex Tree

formally des ribed in Se tion 3, that is dynami ally built as obje ts (servi es, but extensibleto dataitems, les,et .) are de lared.

3 Preliminaries

Greatest Common Prex Tree. Let an ordered alphabet

A

be a nite set of letters. Denote

an orderon

A

. A non emptyword

w

over

A

is a nite sequen e ofletters

a

1

, . . . , a

i

, . . . , a

l

,

l > 0

. The on atenation of two words

u

and

v

,denoted

u ◦ v

or simply

uv

,isequal to the word

a

1

, . . . , a

i

, . . . , a

k

, b

1

, . . . , b

j

, . . . , b

l

su hthat

u = a

1

, . . . , a

i

, . . . , a

k

and

v = b

1

, . . . , b

j

, . . . , b

l

. Let

ǫ

be the empty word su h that foreveryword

w

,

wǫ = ǫw = w

. The length ofaword

w

,denotedby

|w|

,isequal to thenumberof lettersof

w



|ǫ| = 0

.

(9)

Aword

u

isaprex (respe tively,properprex)ofaword

v

ifthereexistsaword

w

su hthat

v = uw

(resp.,

v = uw

and

u 6= v

). TheGreatest CommonPrex (resp., Proper Greatest Common Prex) ofa olle tion ofwords

w

1

, w

2

, . . . , w

i

, . . .

(

i ≥ 2

), denoted

GCP (w

1

, w

2

, . . . , w

i

, . . .)

(resp.

P GCP (w

1

, w

2

, . . . , w

i

, . . .)

), is the longest prex

u

shared byall ofthem(resp., su h that

∀i ≥ 1, u 6= w

i

). A[Proper℄Greatest CommonPrexTree ([P℄GCPTree,alsoaparti ularkindoftrie)isalabeledrooted treesu hthatboth following properties aretruefor everynode ofthetree:

1. Thenodelabelisa properprexof any labelinits subtree;

2. ThenodelabelistheProperGreatestCommon Prex ofall itsson labels.

In the following we usethe word trieto designateour PGCP tree.

Distributed Lexi ographi Pla ement Table. Thedistributed system

onsid-ered in this paper onsists of a set of asyn hronous physi al nodes organized in a Distributed Hash Tables(DHT).Ea hphysi alnodemaintains oneormore nodesof thelogi alPGCP Tree. NotethataDHTisused,but it an berepla edbyany

sys-tem,distributedor not,allowing theretrievalofanynode fromanyothernode. We also onsiderthatthepotentialexistingfault-toleran eme hanismsprovided bythis layerarenotusedwithinourar hite ture. Weproposeinthispaperafault-toleran e

me hanism atthe PGCP Treelevel.

Whenonewantstoinsertanobje tlabeled

o

intothetrie,amessageisgenerated ontaining

o

,a ordingtowhi hthemessageisroutedwithinthetrieuntil rea hing thenodelabeled

v

su hthat

v

isthesmallestlabelinthetrie thatshareswith

o

the greatest ommon prex of anynode of the trie with

o

. More formally,if

L

denotes the whole set of label urrently in the trie, the set

U = {l ∈ L | GCP (l, o) = p}

where

p = max

|m|

{m = P GCP (l, o), l ∈ L)

. The label of the target node is

t =

min

|w|

{u ∈ U | u = pw}

. On e found, the target node performs the insertion. If

t 6= o

, node(s) are reated. If

o = tu

(

u 6= ǫ

), a new node labeled

o

is reated as a new son of the node labeled

t

. If

t = ou

(

u 6= ǫ

), a new node is reated as the father of the node labeled by

t

. Finally, if none of these onditions are satised, it means that

o

and

t

must be siblings but no node in the trie is labeled by their ommonprex. Thus twonodesare reated,anodelabeled

GCP (o, t)

,fatherofthe node labeledby

t

andalso fatheroftheothernewly reated node labeledby

o

. The distributed routing algorithm (that also performs the reation and the mapping of nodes) requiresa numberofhopsbounded bytwi ethedepth ofthetrie [5℄.

Physi al nodes ommuni ate bymessage passing. We assumetwo sending fun -tions. Theformer, simplyreferredto SEND,isusedbyanyphysi al nodeto senda

(10)

messagetoanothernodeasyn hronously,i.e.,withoutwaitinganya knowledgement.

Thelatter, alledSYNC-SEND,waitsforana knowledgementforea hmessagesent. We assume that ea h physi al node may rash. So, when a physi al node rashes, one or morelogi al nodes arelost.

4 Proto ol

Inthisse tion, we give adetailedexplanation ofhowtheproto olworks. Wedivide the algorithm ode in two parts. The former shows the rst phase developed with our te hnique during whi h a unique trie is re overed without onsidering any

lex-i ographi property. During these ond phase, the trie is reorganized to eventually forma distributedgreatest ommon prextree.

4.1 Trie Re overy

After a node

p

dete ts the lossof its father (

p.f ather

),it sear hes for a new father to link on. Making a traversal of the DHT, Node

p

olle ts in Variable

P N

all the addresses of ea h remaining physi al node. Colle ting the addresses in

P N

,

p

builds the set of logi al nodes stored by the physi al nodes in

P N

. Next, using a

P IF

(Propagation of Information with Feedba k) Proto ol [6 , 16 ℄,

p

omputes

T

, the set of logi al nodes in its subtrie, whi h is made of its real des endants and its temporary relinked des endants. This rst step of the re overy proto ol ends

when

p

hooses a temporary father (

p.tmpf ather

) in the subset

N \ T

. When, a node

q

is linked to a node

p

, then

p

onsiders

q

as a temporary sonstored in

p.tmpsons

. Note thatVariable

p.tmpsons

isrequired to ompute

T

using a PIF in thesubtrie of

p

. If

N \ T = ∅

(i.e.,there is nonode for whi h

p

maylink on),then

p

is onsideredasthe root of thetrie.

The above te hnique suers of a drawba k: Several nodes without father may make whi h ould be ome a bad hoi e. In parti ular, they an hoose asa tem-porary father a node belonging to the subtrie of another node being in the same

situation. By doing this in parallel, y les may appear. Our strategy is to dete t and tobreak a posteriori su h y les asfollows.

After the hoi e ofits temporary father

tf

,a node

p

sends a message HELLO withitsID(

p.id

)to

tf

. Inthe nextstep,

tf

transmitsthemessagetoitsownfather, and soon. Stepbystep, one ofthetwo following situationseventually arises:

(11)

1. Thereal rootofthetriere eivesthemessageHELLO.Inthat ase, theroot

noties

p

thatitisnot involved ina y le.

2. Themessage is re eived by afalse root,i.e., a node having alsolost its own father. the falseroot propagatesthemessage to its temporary father.

Notethat, inthe above latter ase, dueto asyn hrony of thenetwork,it ispossible

that the false root re eives the message HELLO sent by

p

before it exe uted its own re overy phase. Inthat ase, thefalse root isstill without a temporary father. ThemessageHELLO isthendelayeduntilthefalseroot hoosesitsowntemporary

father.

Therefore,themessageHELLO sentby

p

keeps ir ulatingamongitsan estors, arryingthelistoffalseroots'IDswhi hweremetduringitstraversal. Uponre eipt

of a message HELLO, if the rst item of the list arried by the message is equal to theID ofthe re eiver, thena y le isdete ted. In that ase, a leader ele tion is omputed among the IDs of theliste.g., by hoosing the smallest ID. The leader be omes the root of the subtrie, breaks its link whi h its father, and exe utes the

re overy phase again. (The other false roots involved in the y le remain on-ne tedtothe subtrierootedbytheleader.) Notethata y lemaybe reated again. However, inthe worst ase, at ea h relaun hing of there overy phase, at least one

subtrie be omes thesubtrie of one false root. Inother words, thenumber of y les is periodi ally divided byat least

2

. Therefore, the systemeventually ontains one (rooted) trieonly.

4.2 Trie Reorganization

The trie reorganization is initiated on e the trie re overy is done. Ea h node

p

having a temporary son

q

i.e.,

q

isa falseroot withits subtrieinitiatesa routing me hanism losed to the original key insertion [5℄. Let us onsider the following ases:

1. Thevalue

p.val

is a prex ofthe value of

q

Figure1, Case

(i)

. In that ase,

q

(and its subtrie) ispla ed inthe subtrieof

p

following one of thefour ases showninFigure1,Cases (

a

) to(

d

).

2. Thevalue

p.val

isnot a prexof the valueof

q

. Then,

p

moves

q

toits father whi hnowhastheresponsibilityto pla e

q

.

Note that new servi es may keep inserting during the trie re onstru tion. So,

a new subtrie may have been reated at the same pla e where the false root ini-tially was. Thus, our method requires to take in a ount that any false root being

(12)

p

q

s

s

s

1

i

k

(

i

)

p.val

= pref ix(q)

and

p.val

= P GCP (s

1

, . . . , s

k

)

.

p

q

s

s

s

1

i

k

(

a

)Thereexists

s

i

su hthat

s

i

.val

= pref ix(q.val)

.

p

q

s

s

s

1

i

k

(

b

)Thereexists

s

i

su hthat

q.val

= pref ix(s

i

.val)

.

p

q

newson

s

s

s

1

k

i

(

c

)Thereexists

s

i

su hthat

P GCP

(q.val, s

i

.val) > p.val

.

p

s

s

s

q=s

1

i

k

k+1

(

d

)

p.val

= pref ix(q.val)

.

(13)

pla ed in the trie an meet a node having the same value. In that ase, the two

tries must be merged. That is the aim of the merging proto ol, initiated by the sending of a message MERGE. Upon re eipt of this message, a node

p

exe utes Pro edure

Gluing(q)

, whi h moves the sons of

q

to

p

before withdrawing

q

from thetrie (in luding the sons of

q

's father). Then, if ne essary,

p

restarts re ursively mergingand pla ementsamong itssons,inorderto merge both subtries eventually.

4.3 Corre tness Proof

Inthissubse tion,wedis ussthe orre tnessofour proto ol. Inorderto dothis,we rst need to make the realisti assumption that under the onsidered ontext, the rashfrequen yislowenough tomakethetriefullybuiltsometime. (Intheopposite

way, the trie ould never be built and unusable most of the time. More generally it is impossible to say anything about termination otherwise.) In other words, we

fairly assume that no rash o urs after a rash until the trie is fullybuilt, i.e., no two onse utive rashes interfereea h other, atone given time.

Assuption 1 If a node rashes at time

t

,then for every

t

> t

, no rash o urs.

Lemma 2 UnderAssumption1,there overyproto ol(Algorithm1)terminates,and whenthis o urs, the system ontainsone trieonly.

Proof. The validation mainly onsists inshowing that the proto ol terminates andthat the reorganizationof thetrie iseventually initiated (by sendinga message

NOCYCLE).

Assume by ontradi tion that under Assumption 1, no node eventually sent a

messageNOCYCLE.So,neitherLine

1.35

norLine

1.37

inAlgorithm1isexe uted. Note that in the rst ase (Line

1.35

), the node be omes the real node after the rash ofits father. So,inboth ases,this means thatNOCYCLEneverrea hesthe

real root of the trie. The height of the trie being nite, this means that every Message HELLO traverses y les only. When a message HELLO is re eived by its initiator, the y leis broken by the node whi h is ele ted among thefalse roots parti ipatinginthe y leLines

1.16

to

1.21

. Therefore, y lesare reatedinnitely often. Let

C

be the number of reated y les. In theworst ase, a y le ismade of atleasttwonodes. So,

C

isinitiallyboundedby

F/2

,where

F

isthenumberoffalse root reated by the rash. When a y le is broken, at most one leader is ele ted.

So,at most

C/2

leadersare ableto linkanother node again. Inthenext phase,the number of y les is less than or equal to

C/2

. Sin e under Assumption 1, y les

(14)

Algorithm 1 Re overy Proto ol forea h node

p

1

.

01

uponre eiptof

<

Dis onne tedfrom Father

>

do

1

.

02

P N :=

Physi alNodeSetintheDHT ( olle tedbyaDHTtraversal);

1

.

03

N :=

Logi alNodeSetin

P N

( olle ted bypollingthenodesin

P N

);

1

.

04

T :=

Logi alNodeSetinmysubtrie( olle ted usingaPIFwave)

1

.

05

using

p.sons ∪ p.tmpsons

;

1

.

06

if

p.tmpf ather 6=⊥

thensend

<

DISCONNECT

>

to

p.tmpf ather

;

1

.

07

if

N \ T = ∅

1

.

08

then //Iamtheroot

1

.

09

p.f ather :=⊥

;

p.tmpf ather :=⊥

;

1

.

10

else

p.tmpf ather :=

random hoi eamong

N \ T

;

1

.

11

send-syn

<

LINK

>

to

p.tmpf ather

;

1

.

12

send

<

HELLO,

p.id>

to

p.tmpf ather

;

1

.

13

endif

1

.

14

uponre eiptof

<

HELLO,

list>

from

q

do

1

.

15

if

F irst(list) = p.id

1

.

16

then //A y leisdete ted

1

.

17

leader := LeaderElection(list)

;

1

.

18

if

p = leader

1

.

19

then Exe utesuponre eipt of

<

Dis onne tfromFather

>

do,

1

.

20

ex ept

P N

and

N

;

1

.

21

endif

1

.

22

elseif

p.F ather 6=⊥

1

.

23

then send

<

HELLO,

list>

to

p.f ather

;

1

.

24

elseif

p.tmpf ather 6=⊥

1

.

25

then

list := list + p.id

;

1

.

26

send

<

HELLO,

list>

to

p.tmpf ather

1

.

27

elseif

p.f ather =⊥

1

.

28

then //Both

f ather

and

tmpf ather

areunknown,i.e.,

1

.

29

I amafalserootwhi hisstill notlinked

1

.

30

Exe utesuponre eiptof

<

Dis onne tfrom Father

>

do

1

.

31

ifitisstillnotworking;

1

.

32

if

tmpf ather 6=⊥

1

.

33

then

list := list + p.id

;

1

.

34

send

<

HELLO,

list>

to

p.tmpf ather

;

1

.

35

else send

<

NOCYCLE

>

to

F irst(list)

;

1

.

36

else //I amtherealroot,sothereisno y le.

1

.

37

send

<

NOCYCLE

>

to

F irst(list)

;

1

.

38

endif

1

.

39

uponre eiptof

<

NOCYCLE

>

from

q

do

1

.

40

send

<

MOVE,

p>

to

p.tmpf ather

;

1

.

41

send-syn

<

UNLINK

>

to

p.tmpf ather

;

1

.

42

p.tmpf ather :=⊥

;

(15)

Algorithm 2 ReorganizationProto olfor ea hnode

p

1

.

01

uponre eiptof

<

MOVE,

f s>

from

q

do

1

.

02

if

f s.val = p.val

1

.

03

then //Isendtomyselfthatafusionisneeded.

1

.

04

send

<

MERGE,

f s>

to

p

1

.

05

elseif

p.val = pref ix(f s.val)

1

.

06

then if

∃s ∈ p.sons| s.val = pref ix(f s.val)

1

.

07

then //

f s

isin thesubtrieof

s

,Case(

a

)inFigure 1

1

.

08

send

<

MOVE,

f s>

to

s

;

1

.

09

elseif

∃s ∈ p.sons| f s.val = pref ix(s.val)

1

.

10

then //

s

isinthesubtrieof

f s

,Case(

b

)inFigure1

1

.

11

p.sons := p.sons ∪ {f s}

;

p.sons := p.sons \ {s}

;

1

.

12

send

<

MOVE,

s>

to

f s

;

1

.

13

elseif

∃s ∈ p.sons | p.val < P GCP (s.val, f s.val)

1

.

14

then //

f s

and

s

haveaPGCPwhi hisgreaterthan

p.val

1

.

15

//Case(

c

)inFigure1

1

.

16

N ewnode(P GCP (f s.val, s.val), s, f s)

;

p.sons := p.sons \ {s}

;

1

.

17

else //

f s

isoneofmysons,Case(

d

)inFigure1

1

.

18

p.sons := p.sons ∪ {f s}

;

1

.

19

endif

1

.

20

else if

p.f ather 6=⊥

1

.

21

then send

<

MOVE,

f s>

to

p.f ather

1

.

22

else if

f s.val = pref ix(p.val)

1

.

23

then //Iamin thesubtrieof

f s

1

.

24

send

<

MOVE,

p>

to

f s

;

1

.

25

else //

p

and

f s

arebrothers

1

.

26

p.sons := p.sons ∪ N ewnode(P GCP (f s.val, f.val), f s, p)

;

1

.

27

endif

1

.

28

endif

1

.

29

endif

2

.

01

uponre eiptof

<

MERGE,

f s>

from

q

do

2

.

02

Gluing(q)

;

2

.

03

Sortingof

p.sons

inthelexi ographi orderin Table

t

s

;

2

.

04

for

i = 0

to

t

s

.length()

do

2

.

05

if

t

s

[i].val = t

s

[i + 1].val

2

.

06

then send

<

MERGE,

t

s

[i + 1]>

to

t

s

[i]

;

2

.

07

i := i + 1

;

2

.

08

elseif

t

s

[i].val = pref ix(t

s

[i + 1].val)

2

.

09

then send

<

MOVE,

t

s

[i + 1]>

to

t

s

[i]

;

2

.

10

p.sons := p.sons \ {t

s

[i + 1]}

;

2

.

11

i := i + 1

2

.

12

elseif

p.val < P GCP (t

s

[i].val, t

s

[i + 1].val)

2

.

13

then

p.sons := p.sons ∪ N ewnode(P GCP (t

s

[i].val, t

s

[i + 1].val),

2

.

14

t

s

[i], t

s

[i + 1])

;

2

.

15

p.sons; = p.sons \ {t

s

[i], t

s

[i + 1]}

;

2

.

16

i := i + 1

;

2

.

17

endif

2

18

(16)

maybe reatedonlywhenfalserootsarelinkedto othernodes(exe uting Lines

1.10

and

1.11

),

C

never grows and iseventually equal to

0

. This ontradi ts that y les

are reated innitelyoften.

2

Wenow onsider the phaseoftrie reorganization showninAlgorithm2.

Lemma 3 Under Assumption 1 and assuming that the system ontains one trie only,the reorganizationproto ol (Algorithm2)terminates, andwhenthis o urs,the trieis a

P GCP

tree.

Proof. Clearly,ea h trie of theforest following the rash of a node is a

P GCP

tree. So,its remains to show that exe uting Algorithm2, thewhole trie eventually satisesthe onditionto bea

P GCP

tree.

From the algorithm, it iseasy to observe that, intheabsen e ofmerging, there areonlytwo ases to onsiderdependingon thevalueofNode

p

anditsfalse son

f s

:

1. Thevalueof

p

is aprex of

f s

's valueLine

1.05

. In that ase, following the four ases des ribed in Figure 1,

f s

is eventually pla ed at theright pla e in the subtrie of

p

refer to Lines

1.06

to

1.19

. The resulting trie is a

P GCP

tree.

2. Thevalueof

p

is not aprex of

f s

. Again, therearetwo ases to onsider: (a) Node

p

hasno father (

p.f ather =⊥

)Line

1.22

to

1.28

. In that ase, if

f s.val

is a prex of

p

, then

p

(and its subtrie) be omes the node to be pla ed in

f s

Line

1.24

. Otherwise,

p

and

f s

be ome thetwo sons of a newroot node

q

su h that

q.val = P GCP (p, f s)

Line

1.26

. The trie is then learly a

P GCP

tree.

(b) Node

p

hasafather. Then,

f s

ismovedto thefatherof

p

Line

1.21

. By indu tionoftheabove dis ussion,either

f s

eventuallymovesonanode

q

su hthat

q.val = pref ix(f s.val)

or

f s

eventuallyrea hestheroot ofthe trie. Theformer ase isequivalent to Case 1,thelatter to Case2a.

If

p

and

f s

merge,thentherearefour asesto onsiderafter

p

and

f s

gluedtogether into

p

:

1. Thereexistsapairofsons

s

i

,

s

j

of

p

su hthat

s

i

.val

isaprexof

s

j

.val

. Then,

s

j

is moved toward

s

i

Lines

2.08

to

2.11

. This ase is similar to the above Case 1(Cases (

a

) or (

b

) inFigure1).

(17)

2. There exists a pair of sons

s

i

,

s

j

of

p

su h that

P GCP (s

i

, s

j

) > p.val

. Then,

s

i

and

s

j

be ome the two sons of a new son

q

of

p

su h that

q.val =

P GCP (p, f s)

Lines

2.12

to

2.16

. This aseisalsosimilartotheaboveCase1 (Case(

c

) inFigure1).

3. There existsa pair of sons

s

i

,

s

j

of

p

su h that

s

i

.val = s

j

.val

. This ase is solvedbyinitiatingare ursivemergingbetween

s

i

and

s

j

Lines

2.05

to

2.07

. This aseis solved byindu tion on

s

i

and

s

j

.

4. Thereexistsnopairofsons

s

i

,

s

j

of

p

satisfyingeither Case1,2,or3. Inthat ase, the subtrieof

p

learly satisesthe properties ofa

P GCP

tree.

2

From Lemmas2 and 3follows:

Theorem 1 Under Assumption 1, Algorithm 1 and Algorithm 2 provide a

P GCP

tree re onstru tion after the rash of a physi al node.

5 Con lusion and Future Work

Inthis paper,we have presented afault-tolerant proto ol in aseof node rashes in aProperCommon GreatestPrex treesear h system. Thisproto ol anbe oupled

witharepli ationstrategytolowerthe ostsrelatedto highrepli ationfa tors. This proto olallowsthere onne tionandrepairofsubtriesafterthe rashofoneormore nodes. This algorithm guarantees to re over a onsistent PGCP tree after a nite

timeandthus to avoidpartially repli ation.

Our futurework will onsist in onne tingthe two me hanisms (repli ation and repair) in order to minimize the ost of fault-toleran e on dynami platforms. We willalso develop andvalidateexperimentally theme hanisms exposedinthis paper on the Grid'5000 platform of the fren h ministry of resear h. The aim of su h

experimentation willbetoseetheperforman eoftherepairalgorithm andtoseeits apa ityto answer lients' requests fa ing dierent levels of dynami ity. Moreover, we willbeable to seestartingfrom whi h levelof dynami itytherepair me hanism

isnomoree ient alone, andthen howwe anprogressivelyinje t somerepli ation asthedynami itylevelin reases.

(18)

Referen es

[1℄ A.AndrzejakandZ.Xu. S alable,E ientRangeQueriesforGridInformation Servi es. InPeer-to-Peer Computing,pages 3340,2002.

[2℄ J. Aspnesand G. Shah. SkipGraphs. InFourteenth Annual ACM-SIAM Sym-posiumon Dis rete Algorithms,pages 384393, January 2003.

[3℄ M. Balazinska,H. Balakrishnan,and D. Karger. INS/Twine: A S alable Peer-to-PeerAr hite ture forIntentional Resour eDis overy. InPro eedings of Per-vasive 2002, 2002.

[4℄ S.Basu, S.Banerjee,P.Sharma, andS.Lee. NodeWiz: Peer-to-Peer Resour e Dis overyforGrids. In5thInternationalWorkshop on Global andPeer-to-Peer Computing (GP2PC) in onjun tion withCCGrid, May2005,2005.

[5℄ E. Caron, F. Desprez, and C. Tedes hi. A dynami prex tree for the servi e dis overywithinlarges alegrids.InIEEE,editor,TheSixthIEEEInternational

Conferen e on Peer-to-Peer Computing, P2P2006,Cambridge,UK.,September 6-8 2006.

[6℄ E.J.H.Chang. E hoAlgorithms: DepthParallelOperationsonGeneralGraphs.

IEEE Trans. on Software Engineering, SE-8:391401, 1982.

[7℄ A. Datta, M. Hauswirth, R.John, R.S hmidt, and K. Aberer. Range Queries

in Trie-Stru tured Overlays. In The Fifth IEEE International Conferen e on Peer-to-Peer Computing, 2005.

[8℄ I. Foster and A.Iamnit hi. On Death,Taxes, and theConvergen e of

Peer-to-Peerand GridComputing. In IPTPS'03,pages118128, 2003.

[9℄ Gnutella. http://www.gnutella. om.

[10℄ KaZaA 2005.The KaZaAWeb Site. http://www.kazaa. om.

[11℄ D. Oppenheimer, J. Albre ht, D. Patterson, and A. Vahdat. Distributed Resour e Dis overy on PlanetLab with SWORD. In Pro eedings of the

ACM/USENIX Workshopon Real, LargeDistributed Systems(WORLDS), De- ember2004.

(19)

[12℄ S.Ramabhadran,S.Ratnasamy,J.M. Hellerstein,andS.Shenker. PrexHash

TreeAnindexing DataStru tureoverDistributedHash Tables. InPro eedings ofthe23rdACMSymposiumonPrin iplesofDistributedComputing,St.John's, Newfoundland, Canada, July 2004.

[13℄ S. Ratnasamy, P. Fran is, M. Handley, R. Karp, and S. Shenker. A S alable Content-Adressable Network. InACM SIGCOMM,2001.

[14℄ A. Rowstron and P. Drus hel. Pastry: S alable, Distributed Obje t Lo ation andRoutingforLarge-S alePeer-To-PeerSystems. InInternationalConferen e

on Distributed SystemsPlatforms (Middleware),November2001.

[15℄ C. S hmidt and M. Parashar. Enabling Flexible Queries with Guarantees in P2P Systems. IEEE Internet Computing,8(3):1926, 2004.

[16℄ A. Segall. Distributed Network Proto ols. IEEE Transa tions on Information Theory,IT-29:2335,1983.

[17℄ Y. Shu, B.-C. Ooi, K.-L. Tan, and A. Zhou. Supporting Multi-Dimensional RangeQueriesinPeer-to-PeerSystems. InPeer-to-Peer Computing,pages173 180, 2005.

[18℄ I. Stoi a, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan. Chord: A S alable Peer-to-Peer Lookup servi e for Internet Appli ations. In ACM

SIGCOMM,pages 149160, 2001.

[19℄ P. Triantallou and T. Pitoura. Towards a Unifying Framework for Complex Query Pro essing over Stru tured Peer-to-Peer Data Networks. In DBISP2P, 2003.

[20℄ B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Ku-biatowi z. Tapestry: A Resilient Global-s ale Overlayfor Servi e Deployment.

(20)

4, rue Jacques Monod - 91893 ORSAY Cedex (France)

Unité de recherche INRIA Lorraine : LORIA, Technopôle de Nancy-Brabois - Campus scientifique

615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex (France)

Unité de recherche INRIA Rennes : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France)

Unité de recherche INRIA Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France)

Unité de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France)

Figure

Figure 1: A false root q is linked to a node p suh that p.val = pref ix(q.val) .

Références

Documents relatifs

Each peer in the network is assigned a pair of VBI-Tree nodes: a routing node and a data node, in which the data node is the left adjacent node of the routing node (in the in-

In that context, two simple experiments were conducted in a real infrastructure using a parallel version of the Peer-to-Peer Evolutionary Algorithm. The first experiment shows that

This section details our approach of reactive peer-to-peer distributed [email protected]. It begins with an overview of our proposition and defines important terms used in the rest

This section analyzes three precedents of peer to peer: fragmented, distributed, or collective property among unidentified, evolving group of peers in legal history – the bundle

Our problematic is to make possible such a service discovery in platforms emerging today which are large (gathering a high number of geographically distributed nodes), heterogeneous

Definition 2 (PGCP Tree) A Proper Greatest Common Prefix Tree is a labeled rooted tree such that each node label is the Proper Greatest Common Prefix of every pair of its

We observed the performance of the storage load balancing method subject to three fac- tors : the storage utilization ratio, the peer dynamicity, and the storage load transfer

While the impact of volatility, the number of peers and the amount of input data on the code coverage are no- ticeable, the only variation of these parameters is not suf- ficient