HAL Id: inria-00115997
https://hal.inria.fr/inria-00115997v3
Submitted on 28 Nov 2006
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A Repair Mechanism for Fault-Tolerance for
Tree-Structured Peer-to-Peer Systems
Eddy Caron, Frédéric Desprez, Franck Petit, Cédric Tedeschi, Charles
Fourdrignier
To cite this version:
Eddy Caron, Frédéric Desprez, Franck Petit, Cédric Tedeschi, Charles Fourdrignier. A Repair
Mecha-nism for Fault-Tolerance for Tree-Structured Peer-to-Peer Systems. [Research Report] RR-6029, LIP
RR-2006-34, INRIA, LIP. 2006, pp.15. �inria-00115997v3�
a p p o r t
d e r e c h e r c h e
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--6
0
2
9
--F
R
+
E
N
G
Thème NUM
A Repair Mechanism for Fault-Tolerance for
Tree-Structured Peer-to-Peer Systems
Eddy Caron — Frédéric Desprez — Charles Fourdrignier — Franck Petit — Cédric Tedeschi
N° 6029
EddyCaron , Frédéri Desprez , Charles Fourdrignier, Fran k Petit, Cédri
Tedes hi
ThèmeNUM Systèmes numériques ProjetGRAAL
Rapportde re her he n°6029 O t 2006 16 pages
Abstra t:
Fa ing the limits of traditional tools of resour e management within
omputa-tional grids (related to s ale, dynami ity, et . of the platforms newly onsidered), newapproa hes, basedon peer-to-peerte hnologiesareemerging. The resour e dis- overyandinparti ular theservi edis overyis on ernedbythisevolution. Among
thesolutions, a promising one is theindexing of resour es usingtrie stru turesand more parti ularly prex trees. The major advantages of trie-stru tured approa hes isthe apabilitytosupportsear hqueriesonrangesofvalueswithalaten ygrowing
logarithmi ally in the number of nodes in the trie. Those te hniques are easy to extend to multi riteria sear hes. One drawba k of using tries is its inherent poor robustnessinadynami environment,where nodesjoinand leavethenetwork,
lead-ing to the split of the tree into a forest, whi h results inthe impossibility to route requests. Within most re ent approa hes, thefault-toleran e isa prevention me h-anism, often repli ation-based. The repli ation an be ostly in term of resour es
required. In this paper, we propose a fault-toleran e proto ol that re onne ts sub-trees a posteriori, after rashes, to have again a onne ted graph and then reorder thenodes torebuild a onsistent tree.
Key-words: Faulttoleran e, peer-to-peer, prex trees
Thistextisalsoavailableasaresear hreportoftheLaboratoiredel'Informatiquedu Paral-lélismehttp://www.ens-lyon.fr/LIP.
pair-à-pair de dé ouverte de servi es
Résumé :
Fa eauxlimitesdesoutilstraditionnelsdegestionderessour esdanslesgrillesde al ul(mauvais passageà l'é helle,nonprise en omptede ladynami itéduréseau, et .),desalternativesfondéessurleste hnologiespair-à-pairsontentraind'émerger.
La dé ouverte de ressour es eten parti ulier des servi es de al ulest tou hée par ette évolution. Parmi es solutions, il existe des appro hes prometteuses fondées sur des arbres lexi ographiques. L'intérêt de telles appro hes repose sur la
possi-bilité d'ee tuer des requêtes sur des intervalles de valeurs ainsi que la possibilité de réaliserde l'auto omplétion surles haînes de re her he en temps logarithmique en la taille de l'arbre. Ces te hniques s'étendent fa ilement à des re her hes
mul-ti ritères. Cependant la stru ture en arbre est fragile et peut é later en une forêt si l'un des n÷uds vient à quitter le réseau, rendant ainsi impossible le routage de ertaines requêtes et n'orant au lient qu'une vue partielle des servi es. Dans la
plupart de es appro hes, la toléran e aux pannes, indispensable dans les environ-nementsdynamiquesàlarge é helle, estpréventive(réaliséea priori)etsefondesur larépli ation,qui est oûteuseen termesderessour eset detemps. Dans epapier,
nous présentons un proto ole tolérant aux pannes de n÷uds, omplémentaire à la répli ation, dans les arbres lexi ographiques. Il se fonde sur la re onnexion et la réparation a posteriori d'arbresqui ont subi laperte d'unou plusieurs n÷uds.
1 Introdu tion
Theselast few years have seen thedevelopment of large s ale grids onne ting
dis-tributed resour es ( omputation resour es, storagefa ilities, omputation libraries, et .) in a seamless way. This is now an e ient alternative to super omputers to solve large problems su h ashigh energy physi s, simulation, bioinformati , et .
However, existing middlewares used ingrids require most of the time a stable and entralized infrastru ture. They usually loose their performan e on dynami and large s ale platforms without entralized management of resour es. To ope with
the hara teristi s ofthese emerging kindof platforms,ithasbeensuggested touse peer-to-peerte hnologies within omputationalgrids [8℄.
Peer-to-peer te hnologies oer algorithms allowing the sear h and retrieval of obje ts over the net (data items, les, servi es, et .). Among these te hnologies, DistributedHashTables(DHT)wereinitiallydesignedforverylarges aleplatforms,
for example to share les over the Internet. However, DHTs have several major drawba ks. Amongthem,theirdis overyme hanismusuallyworksonexa tsear hes of a given key. Some work has then been done to allow omplex requests to be submitted over DHTs or more generally in stru tured peer-to-peer systems, i.e.
systems based on request routing. Some of these works are based on tries (also alledprex trees). A trie stru ture supportsrange queries inalogarithmi time in thenumberof nodesof the trie.
Fault-toleran e isa mandatoryfeature forpeer-to-peersystems toavoidtheloss
ofdatastored onnodesandtoallowa orre trouting ofmessages. The rashofone orseveralnodesinatrie leadsto thelossofobje tsreferen es storedinthetrie and tothesplitofthetrieintoseveralsubtries,also alledaforest. Fault-toleran ewithin
stru tured peer-to-peer systems usually uses repli ation. Using su h an approa h, ea hnodeandea hlinkofthetriewouldhavetobedupli ated
k
times,k
beingthe repli ationfa tor. Keepingsu h stru ture upis ostly, mainly intermsof resour esused. Afterward, thepurposeistondfor thevalueof
k
theright trade-obetween the repli ation ost and the robustness of the system. In this paper, we study an alternativeto therepli ationapproa hbasedonthere onne tionofthesubtriesand thea posteriori reorderingofa onsistenttrie. Whenthetrieisdis onne ted,arstsolution onsistsinrebuilding atrie addingnodesofremaining subtries one byone. Thisnaivemethod anleadtoaprohibitive ostwhenthenumberofremainingnodes islarge(whi hisusually the aseinpeer-to-peersystems). Forexample,loosingone
node an leadto a omplete re onstru tion of thetrie. A se ond approa h onsists inre onne tingthesubtries togetthe originaltrie ba kat aminimum ost. Thisis
this kind of algorithm we des ribe inthis paper in a distributed and asyn hronous
environment. It an also be usedto omplete therepli ationpro ess.
A brief history of peer-to-peer te hnologies is provided in Se tion 2, followed by the formal des riptionof the parti ular trie stru ture we use (Se tion 3) and of the distributed system we pla e ourselves. We fo us our study on fault-toleran e
me anisms relatedto them. Then, inSe tion 4 we present the repair algorithm we designed and give its proofbeforea on lusionand futurework Se tion.
2 Related Work
With the spread of the peer-to-peer te hnologies going along with the le sharing overtheInternet,purelyde entralizedsear hsystemshaveemerged. Su htoolsrst tooktheshapeof unstru tured me hanisms,i.e.,basedon theoodingofsear h
re-quests[10,9℄. Theseme hanismsresultedinoverloadingthenetworkwhileproviding non-exhaustiveresponses. Addressingboththes alabilityandtheexhaustiveness is-sueswithin peer-to-peersystems,the distributedhashtables[13 , 14,18 ,20℄, a.k.a.,
thestru tured peer-to-peer group, arehighly s alable inthe sense that thenumber of logi al hops required to route and thelo al state grows logarithmi ally withthe number of nodes parti ipating in the system. Moreover, DHTs prevent from
loos-ing routing paths and obje ts' referen es by use of repli ation and periodi s ans. Unfortunately, DHTs present several major drawba ks (homogeneous apa ity as-sumptions, topology awareness, et .). Among them, the rigidity of the requesting
me hanism, i.e.,exa tmat honagivenkeyhindersitsuseoverrealsear hsystems.
A series of work gives the opportunity to allow exible meanings of retrieval overstru tured peer-to-peer networks. First a hievement in this way has been the
abilityto des riberesour es withsemi-stru tured language,su hXML,asdes ribed in[3℄.[19℄enhan esDHTswithtraditionaldatabaseoperations. Severalapproa hes, based on spa e lling urves, su h as Squid [15 ℄ or [17 ℄ support multi-dimensional
rangequeries.[1℄mapsone-dimensionaldataspa etod-dimensional Cartesianspa e byusing the inverse Hilbertmapping. Builton topof multiple DHTs,SWORD [11 ℄ is an information servi eaiming at dis overing omputing resour es on the grid by
answeringmulti-attribute rangequeries.
Wefo usinthisworkontrie-stru turedretrievalsolutions,alsosupportingrange queries but outperforming previous approa hes in the sense that logarithmi (or onstant ifwe assumean upperbound on thedepth of thetrie) laten yis a hieved
byparallelizingtheresolution ofthequeryintheseveralbran hesofthetrie. Prex Hash Tree (PHT) [12 ℄ builds a trie of the entire key-spa e on top of a DHT. The
purpose of this ar hite ture is to use the trie as a logi al layer allowing omplex
sear hes on top of any DHT-like network. The ar hite ture of PHT results in the multipli ationof the omplexities ofthetrie and oftheunderlying DHT.
The Skip Graphs stru ture proposed in [2 ℄ is similar to a trie but is built with theskiplistste hnology,allowingtheuseoftheir inherentfault-toleran e properties. But again, the omplexity of the number of messages generated to pro ess range
queries isin
O(m log(n))
,m
being the numberof nodespertainedby therangeandn
thetotal numberof nodesinthegraph.Other approa hes propose to rely on a trie for ea h purpose, i.e., indexing the key-spa e, mapping the nodesof the trie on thenetwork, and routing therequests.
Among them, Nodewiz [4℄ assumes a set of stati reliable nodes to host the trie, whi h is unfortunately hard to ensure on peer-to-peer platforms. P-Grid [7℄ builds a trie on the whole key-spa e (i.e., the whole set of potential keys). Ea h leaf of
thistrie orrespondsto asubset ofthekey-spa e. Thefault-toleran e isa hieved by probabilisti repli ation.
As a more general onsideration, none of these approa hes address the topol-ogy/physi al lo alityawarenessissue,i.e.,noinformation about theunderlying
net-work is taken into a ount to build the logi al (overlay) network, what an raise a signi antperforman eproblem,physi allo alitybeingbrokenwhenthelogi al net-work isbuilt. Moreover, the several fault-toleran e solutions aremostly
repli ation-based, or DHT-based,also involvingheavy repli ationme hanisms.
Initiallydesignedforthepurposeofservi edis overyoverdynami omputational gridsandattemptingtosolvetheabovedrawba ksofexistingapproa hes,were ently developed a novel ar hite ture, based on a logi al Greatest Common Prex Tree
formally des ribed in Se tion 3, that is dynami ally built as obje ts (servi es, but extensibleto dataitems, les,et .) are de lared.
3 Preliminaries
Greatest Common Prex Tree. Let an ordered alphabet
A
be a nite set of letters. Denote≺
an orderonA
. A non emptywordw
overA
is a nite sequen e oflettersa
1
, . . . , a
i
, . . . , a
l
,l > 0
. The on atenation of two wordsu
andv
,denotedu ◦ v
or simplyuv
,isequal to the worda
1
, . . . , a
i
, . . . , a
k
, b
1
, . . . , b
j
, . . . , b
l
su hthatu = a
1
, . . . , a
i
, . . . , a
k
andv = b
1
, . . . , b
j
, . . . , b
l
. Letǫ
be the empty word su h that foreverywordw
,wǫ = ǫw = w
. The length ofawordw
,denotedby|w|
,isequal to thenumberof lettersofw
|ǫ| = 0
.Aword
u
isaprex (respe tively,properprex)ofawordv
ifthereexistsawordw
su hthatv = uw
(resp.,v = uw
andu 6= v
). TheGreatest CommonPrex (resp., Proper Greatest Common Prex) ofa olle tion ofwordsw
1
, w
2
, . . . , w
i
, . . .
(i ≥ 2
), denotedGCP (w
1
, w
2
, . . . , w
i
, . . .)
(resp.P GCP (w
1
, w
2
, . . . , w
i
, . . .)
), is the longest prexu
shared byall ofthem(resp., su h that∀i ≥ 1, u 6= w
i
). A[Proper℄Greatest CommonPrexTree ([P℄GCPTree,alsoaparti ularkindoftrie)isalabeledrooted treesu hthatboth following properties aretruefor everynode ofthetree:1. Thenodelabelisa properprexof any labelinits subtree;
2. ThenodelabelistheProperGreatestCommon Prex ofall itsson labels.
In the following we usethe word trieto designateour PGCP tree.
Distributed Lexi ographi Pla ement Table. Thedistributed system
onsid-ered in this paper onsists of a set of asyn hronous physi al nodes organized in a Distributed Hash Tables(DHT).Ea hphysi alnodemaintains oneormore nodesof thelogi alPGCP Tree. NotethataDHTisused,but it an berepla edbyany
sys-tem,distributedor not,allowing theretrievalofanynode fromanyothernode. We also onsiderthatthepotentialexistingfault-toleran eme hanismsprovided bythis layerarenotusedwithinourar hite ture. Weproposeinthispaperafault-toleran e
me hanism atthe PGCP Treelevel.
Whenonewantstoinsertanobje tlabeled
o
intothetrie,amessageisgenerated ontainingo
,a ordingtowhi hthemessageisroutedwithinthetrieuntil rea hing thenodelabeledv
su hthatv
isthesmallestlabelinthetrie thatshareswitho
the greatest ommon prex of anynode of the trie witho
. More formally,ifL
denotes the whole set of label urrently in the trie, the setU = {l ∈ L | GCP (l, o) = p}
wherep = max
|m|
{m = P GCP (l, o), l ∈ L)
. The label of the target node ist =
min
|w|
{u ∈ U | u = pw}
. On e found, the target node performs the insertion. Ift 6= o
, node(s) are reated. Ifo = tu
(u 6= ǫ
), a new node labeledo
is reated as a new son of the node labeledt
. Ift = ou
(u 6= ǫ
), a new node is reated as the father of the node labeled byt
. Finally, if none of these onditions are satised, it means thato
andt
must be siblings but no node in the trie is labeled by their ommonprex. Thus twonodesare reated,anodelabeledGCP (o, t)
,fatherofthe node labeledbyt
andalso fatheroftheothernewly reated node labeledbyo
. The distributed routing algorithm (that also performs the reation and the mapping of nodes) requiresa numberofhopsbounded bytwi ethedepth ofthetrie [5℄.Physi al nodes ommuni ate bymessage passing. We assumetwo sending fun -tions. Theformer, simplyreferredto SEND,isusedbyanyphysi al nodeto senda
messagetoanothernodeasyn hronously,i.e.,withoutwaitinganya knowledgement.
Thelatter, alledSYNC-SEND,waitsforana knowledgementforea hmessagesent. We assume that ea h physi al node may rash. So, when a physi al node rashes, one or morelogi al nodes arelost.
4 Proto ol
Inthisse tion, we give adetailedexplanation ofhowtheproto olworks. Wedivide the algorithm ode in two parts. The former shows the rst phase developed with our te hnique during whi h a unique trie is re overed without onsidering any
lex-i ographi property. During these ond phase, the trie is reorganized to eventually forma distributedgreatest ommon prextree.
4.1 Trie Re overy
After a node
p
dete ts the lossof its father (p.f ather
),it sear hes for a new father to link on. Making a traversal of the DHT, Nodep
olle ts in VariableP N
all the addresses of ea h remaining physi al node. Colle ting the addresses inP N
,p
builds the set of logi al nodes stored by the physi al nodes inP N
. Next, using aP IF
(Propagation of Information with Feedba k) Proto ol [6 , 16 ℄,p
omputesT
, the set of logi al nodes in its subtrie, whi h is made of its real des endants and its temporary relinked des endants. This rst step of the re overy proto ol endswhen
p
hooses a temporary father (p.tmpf ather
) in the subsetN \ T
. When, a nodeq
is linked to a nodep
, thenp
onsidersq
as a temporary sonstored inp.tmpsons
. Note thatVariablep.tmpsons
isrequired to omputeT
using a PIF in thesubtrie ofp
. IfN \ T = ∅
(i.e.,there is nonode for whi hp
maylink on),thenp
is onsideredasthe root of thetrie.The above te hnique suers of a drawba k: Several nodes without father may make whi h ould be ome a bad hoi e. In parti ular, they an hoose asa tem-porary father a node belonging to the subtrie of another node being in the same
situation. By doing this in parallel, y les may appear. Our strategy is to dete t and tobreak a posteriori su h y les asfollows.
After the hoi e ofits temporary father
tf
,a nodep
sends a message HELLO withitsID(p.id
)totf
. Inthe nextstep,tf
transmitsthemessagetoitsownfather, and soon. Stepbystep, one ofthetwo following situationseventually arises:1. Thereal rootofthetriere eivesthemessageHELLO.Inthat ase, theroot
noties
p
thatitisnot involved ina y le.2. Themessage is re eived by afalse root,i.e., a node having alsolost its own father. the falseroot propagatesthemessage to its temporary father.
Notethat, inthe above latter ase, dueto asyn hrony of thenetwork,it ispossible
that the false root re eives the message HELLO sent by
p
before it exe uted its own re overy phase. Inthat ase, thefalse root isstill without a temporary father. ThemessageHELLO isthendelayeduntilthefalseroot hoosesitsowntemporaryfather.
Therefore,themessageHELLO sentby
p
keeps ir ulatingamongitsan estors, arryingthelistoffalseroots'IDswhi hweremetduringitstraversal. Uponre eiptof a message HELLO, if the rst item of the list arried by the message is equal to theID ofthe re eiver, thena y le isdete ted. In that ase, a leader ele tion is omputed among the IDs of theliste.g., by hoosing the smallest ID. The leader be omes the root of the subtrie, breaks its link whi h its father, and exe utes the
re overy phase again. (The other false roots involved in the y le remain on-ne tedtothe subtrierootedbytheleader.) Notethata y lemaybe reated again. However, inthe worst ase, at ea h relaun hing of there overy phase, at least one
subtrie be omes thesubtrie of one false root. Inother words, thenumber of y les is periodi ally divided byat least
2
. Therefore, the systemeventually ontains one (rooted) trieonly.4.2 Trie Reorganization
The trie reorganization is initiated on e the trie re overy is done. Ea h node
p
having a temporary sonq
i.e.,q
isa falseroot withits subtrieinitiatesa routing me hanism losed to the original key insertion [5℄. Let us onsider the following ases:1. Thevalue
p.val
is a prex ofthe value ofq
Figure1, Case(i)
. In that ase,q
(and its subtrie) ispla ed inthe subtrieofp
following one of thefour ases showninFigure1,Cases (a
) to(d
).2. Thevalue
p.val
isnot a prexof the valueofq
. Then,p
movesq
toits father whi hnowhastheresponsibilityto pla eq
.Note that new servi es may keep inserting during the trie re onstru tion. So,
a new subtrie may have been reated at the same pla e where the false root ini-tially was. Thus, our method requires to take in a ount that any false root being
p
q
s
s
s
1
i
k
(
i
)p.val
= pref ix(q)
andp.val
= P GCP (s
1
, . . . , s
k
)
.p
q
s
s
s
1
i
k
(
a
)Thereexistss
i
su hthats
i
.val
= pref ix(q.val)
.p
q
s
s
s
1
i
k
(
b
)Thereexistss
i
su hthatq.val
= pref ix(s
i
.val)
.p
q
newson
s
s
s
1
k
i
(
c
)Thereexistss
i
su hthatP GCP
(q.val, s
i
.val) > p.val
.p
s
s
s
q=s
1
i
k
k+1
(
d
)p.val
= pref ix(q.val)
.pla ed in the trie an meet a node having the same value. In that ase, the two
tries must be merged. That is the aim of the merging proto ol, initiated by the sending of a message MERGE. Upon re eipt of this message, a node
p
exe utes Pro edureGluing(q)
, whi h moves the sons ofq
top
before withdrawingq
from thetrie (in luding the sons ofq
's father). Then, if ne essary,p
restarts re ursively mergingand pla ementsamong itssons,inorderto merge both subtries eventually.4.3 Corre tness Proof
Inthissubse tion,wedis ussthe orre tnessofour proto ol. Inorderto dothis,we rst need to make the realisti assumption that under the onsidered ontext, the rashfrequen yislowenough tomakethetriefullybuiltsometime. (Intheopposite
way, the trie ould never be built and unusable most of the time. More generally it is impossible to say anything about termination otherwise.) In other words, we
fairly assume that no rash o urs after a rash until the trie is fullybuilt, i.e., no two onse utive rashes interfereea h other, atone given time.
Assuption 1 If a node rashes at time
t
,then for everyt
′
> t
, no rash o urs.
Lemma 2 UnderAssumption1,there overyproto ol(Algorithm1)terminates,and whenthis o urs, the system ontainsone trieonly.
Proof. The validation mainly onsists inshowing that the proto ol terminates andthat the reorganizationof thetrie iseventually initiated (by sendinga message
NOCYCLE).
Assume by ontradi tion that under Assumption 1, no node eventually sent a
messageNOCYCLE.So,neitherLine
1.35
norLine1.37
inAlgorithm1isexe uted. Note that in the rst ase (Line1.35
), the node be omes the real node after the rash ofits father. So,inboth ases,this means thatNOCYCLEneverrea hesthereal root of the trie. The height of the trie being nite, this means that every Message HELLO traverses y les only. When a message HELLO is re eived by its initiator, the y leis broken by the node whi h is ele ted among thefalse roots parti ipatinginthe y leLines
1.16
to1.21
. Therefore, y lesare reatedinnitely often. LetC
be the number of reated y les. In theworst ase, a y le ismade of atleasttwonodes. So,C
isinitiallyboundedbyF/2
,whereF
isthenumberoffalse root reated by the rash. When a y le is broken, at most one leader is ele ted.So,at most
C/2
leadersare ableto linkanother node again. Inthenext phase,the number of y les is less than or equal toC/2
. Sin e under Assumption 1, y lesAlgorithm 1 Re overy Proto ol forea h node
p
1
.01
uponre eiptof<
Dis onne tedfrom Father>
do1
.02
P N :=
Physi alNodeSetintheDHT ( olle tedbyaDHTtraversal);1
.03
N :=
Logi alNodeSetinP N
( olle ted bypollingthenodesinP N
);1
.04
T :=
Logi alNodeSetinmysubtrie( olle ted usingaPIFwave)1
.05
usingp.sons ∪ p.tmpsons
;1
.06
ifp.tmpf ather 6=⊥
thensend<
DISCONNECT>
top.tmpf ather
;1
.07
ifN \ T = ∅
1
.08
then //Iamtheroot1
.09
p.f ather :=⊥
;p.tmpf ather :=⊥
;1
.10
elsep.tmpf ather :=
random hoi eamongN \ T
;1
.11
send-syn<
LINK>
top.tmpf ather
;1
.12
send<
HELLO,p.id>
top.tmpf ather
;1
.13
endif1
.14
uponre eiptof<
HELLO,list>
fromq
do1
.15
ifF irst(list) = p.id
1
.16
then //A y leisdete ted1
.17
leader := LeaderElection(list)
;1
.18
ifp = leader
1
.19
then Exe utesuponre eipt of<
Dis onne tfromFather>
do,1
.20
ex eptP N
andN
;1
.21
endif1
.22
elseifp.F ather 6=⊥
1
.23
then send<
HELLO,list>
top.f ather
;1
.24
elseifp.tmpf ather 6=⊥
1
.25
thenlist := list + p.id
;1
.26
send<
HELLO,list>
top.tmpf ather
1
.27
elseifp.f ather =⊥
1
.28
then //Bothf ather
andtmpf ather
areunknown,i.e.,1
.29
I amafalserootwhi hisstill notlinked1
.30
Exe utesuponre eiptof<
Dis onne tfrom Father>
do1
.31
ifitisstillnotworking;1
.32
iftmpf ather 6=⊥
1
.33
thenlist := list + p.id
;1
.34
send<
HELLO,list>
top.tmpf ather
;1
.35
else send<
NOCYCLE>
toF irst(list)
;1
.36
else //I amtherealroot,sothereisno y le.1
.37
send<
NOCYCLE>
toF irst(list)
;1
.38
endif1
.39
uponre eiptof<
NOCYCLE>
fromq
do1
.40
send<
MOVE,p>
top.tmpf ather
;1
.41
send-syn<
UNLINK>
top.tmpf ather
;1
.42
p.tmpf ather :=⊥
;Algorithm 2 ReorganizationProto olfor ea hnode
p
1
.01
uponre eiptof<
MOVE,f s>
fromq
do1
.02
iff s.val = p.val
1
.03
then //Isendtomyselfthatafusionisneeded.1
.04
send<
MERGE,f s>
top
1
.05
elseifp.val = pref ix(f s.val)
1
.06
then if∃s ∈ p.sons| s.val = pref ix(f s.val)
1
.07
then //f s
isin thesubtrieofs
,Case(a
)inFigure 11
.08
send<
MOVE,f s>
tos
;1
.09
elseif∃s ∈ p.sons| f s.val = pref ix(s.val)
1
.10
then //s
isinthesubtrieoff s
,Case(b
)inFigure11
.11
p.sons := p.sons ∪ {f s}
;p.sons := p.sons \ {s}
;1
.12
send<
MOVE,s>
tof s
;1
.13
elseif∃s ∈ p.sons | p.val < P GCP (s.val, f s.val)
1
.14
then //f s
ands
haveaPGCPwhi hisgreaterthanp.val
1
.15
//Case(c
)inFigure11
.16
N ewnode(P GCP (f s.val, s.val), s, f s)
;p.sons := p.sons \ {s}
;1
.17
else //f s
isoneofmysons,Case(d
)inFigure11
.18
p.sons := p.sons ∪ {f s}
;1
.19
endif1
.20
else ifp.f ather 6=⊥
1
.21
then send<
MOVE,f s>
top.f ather
1
.22
else iff s.val = pref ix(p.val)
1
.23
then //Iamin thesubtrieoff s
1
.24
send<
MOVE,p>
tof s
;1
.25
else //p
andf s
arebrothers1
.26
p.sons := p.sons ∪ N ewnode(P GCP (f s.val, f.val), f s, p)
;1
.27
endif1
.28
endif1
.29
endif2
.01
uponre eiptof<
MERGE,f s>
fromq
do2
.02
Gluing(q)
;2
.03
Sortingofp.sons
inthelexi ographi orderin Tablet
s
;2
.04
fori = 0
tot
s
.length()
do2
.05
ift
s
[i].val = t
s
[i + 1].val
2
.06
then send<
MERGE,t
s
[i + 1]>
tot
s
[i]
;2
.07
i := i + 1
;2
.08
elseift
s
[i].val = pref ix(t
s
[i + 1].val)
2
.09
then send<
MOVE,t
s
[i + 1]>
tot
s
[i]
;2
.10
p.sons := p.sons \ {t
s
[i + 1]}
;2
.11
i := i + 1
2
.12
elseifp.val < P GCP (t
s
[i].val, t
s
[i + 1].val)
2
.13
thenp.sons := p.sons ∪ N ewnode(P GCP (t
s
[i].val, t
s
[i + 1].val),
2
.14
t
s
[i], t
s
[i + 1])
;2
.15
p.sons; = p.sons \ {t
s
[i], t
s
[i + 1]}
;2
.16
i := i + 1
;2
.17
endif2
18
maybe reatedonlywhenfalserootsarelinkedto othernodes(exe uting Lines
1.10
and1.11
),C
never grows and iseventually equal to0
. This ontradi ts that y lesare reated innitelyoften.
2
Wenow onsider the phaseoftrie reorganization showninAlgorithm2.
Lemma 3 Under Assumption 1 and assuming that the system ontains one trie only,the reorganizationproto ol (Algorithm2)terminates, andwhenthis o urs,the trieis a
P GCP
tree.Proof. Clearly,ea h trie of theforest following the rash of a node is a
P GCP
tree. So,its remains to show that exe uting Algorithm2, thewhole trie eventually satisesthe onditionto beaP GCP
tree.From the algorithm, it iseasy to observe that, intheabsen e ofmerging, there areonlytwo ases to onsiderdependingon thevalueofNode
p
anditsfalse sonf s
:1. Thevalueof
p
is aprex off s
's valueLine1.05
. In that ase, following the four ases des ribed in Figure 1,f s
is eventually pla ed at theright pla e in the subtrie ofp
refer to Lines1.06
to1.19
. The resulting trie is aP GCP
tree.2. Thevalueof
p
is not aprex off s
. Again, therearetwo ases to onsider: (a) Nodep
hasno father (p.f ather =⊥
)Line1.22
to1.28
. In that ase, iff s.val
is a prex ofp
, thenp
(and its subtrie) be omes the node to be pla ed inf s
Line1.24
. Otherwise,p
andf s
be ome thetwo sons of a newroot nodeq
su h thatq.val = P GCP (p, f s)
Line1.26
. The trie is then learly aP GCP
tree.(b) Node
p
hasafather. Then,f s
ismovedto thefatherofp
Line1.21
. By indu tionoftheabove dis ussion,eitherf s
eventuallymovesonanodeq
su hthatq.val = pref ix(f s.val)
orf s
eventuallyrea hestheroot ofthe trie. Theformer ase isequivalent to Case 1,thelatter to Case2a.If
p
andf s
merge,thentherearefour asesto onsiderafterp
andf s
gluedtogether intop
:1. Thereexistsapairofsons
s
i
,s
j
ofp
su hthats
i
.val
isaprexofs
j
.val
. Then,s
j
is moved towards
i
Lines2.08
to2.11
. This ase is similar to the above Case 1(Cases (a
) or (b
) inFigure1).2. There exists a pair of sons
s
i
,s
j
ofp
su h thatP GCP (s
i
, s
j
) > p.val
. Then,s
i
ands
j
be ome the two sons of a new sonq
ofp
su h thatq.val =
P GCP (p, f s)
Lines2.12
to2.16
. This aseisalsosimilartotheaboveCase1 (Case(c
) inFigure1).3. There existsa pair of sons
s
i
,s
j
ofp
su h thats
i
.val = s
j
.val
. This ase is solvedbyinitiatingare ursivemergingbetweens
i
ands
j
Lines2.05
to2.07
. This aseis solved byindu tion ons
i
ands
j
.4. Thereexistsnopairofsons
s
i
,s
j
ofp
satisfyingeither Case1,2,or3. Inthat ase, the subtrieofp
learly satisesthe properties ofaP GCP
tree.2
From Lemmas2 and 3follows:Theorem 1 Under Assumption 1, Algorithm 1 and Algorithm 2 provide a
P GCP
tree re onstru tion after the rash of a physi al node.5 Con lusion and Future Work
Inthis paper,we have presented afault-tolerant proto ol in aseof node rashes in aProperCommon GreatestPrex treesear h system. Thisproto ol anbe oupled
witharepli ationstrategytolowerthe ostsrelatedto highrepli ationfa tors. This proto olallowsthere onne tionandrepairofsubtriesafterthe rashofoneormore nodes. This algorithm guarantees to re over a onsistent PGCP tree after a nite
timeandthus to avoidpartially repli ation.
Our futurework will onsist in onne tingthe two me hanisms (repli ation and repair) in order to minimize the ost of fault-toleran e on dynami platforms. We willalso develop andvalidateexperimentally theme hanisms exposedinthis paper on the Grid'5000 platform of the fren h ministry of resear h. The aim of su h
experimentation willbetoseetheperforman eoftherepairalgorithm andtoseeits apa ityto answer lients' requests fa ing dierent levels of dynami ity. Moreover, we willbeable to seestartingfrom whi h levelof dynami itytherepair me hanism
isnomoree ient alone, andthen howwe anprogressivelyinje t somerepli ation asthedynami itylevelin reases.
Referen es
[1℄ A.AndrzejakandZ.Xu. S alable,E ientRangeQueriesforGridInformation Servi es. InPeer-to-Peer Computing,pages 3340,2002.
[2℄ J. Aspnesand G. Shah. SkipGraphs. InFourteenth Annual ACM-SIAM Sym-posiumon Dis rete Algorithms,pages 384393, January 2003.
[3℄ M. Balazinska,H. Balakrishnan,and D. Karger. INS/Twine: A S alable Peer-to-PeerAr hite ture forIntentional Resour eDis overy. InPro eedings of Per-vasive 2002, 2002.
[4℄ S.Basu, S.Banerjee,P.Sharma, andS.Lee. NodeWiz: Peer-to-Peer Resour e Dis overyforGrids. In5thInternationalWorkshop on Global andPeer-to-Peer Computing (GP2PC) in onjun tion withCCGrid, May2005,2005.
[5℄ E. Caron, F. Desprez, and C. Tedes hi. A dynami prex tree for the servi e dis overywithinlarges alegrids.InIEEE,editor,TheSixthIEEEInternational
Conferen e on Peer-to-Peer Computing, P2P2006,Cambridge,UK.,September 6-8 2006.
[6℄ E.J.H.Chang. E hoAlgorithms: DepthParallelOperationsonGeneralGraphs.
IEEE Trans. on Software Engineering, SE-8:391401, 1982.
[7℄ A. Datta, M. Hauswirth, R.John, R.S hmidt, and K. Aberer. Range Queries
in Trie-Stru tured Overlays. In The Fifth IEEE International Conferen e on Peer-to-Peer Computing, 2005.
[8℄ I. Foster and A.Iamnit hi. On Death,Taxes, and theConvergen e of
Peer-to-Peerand GridComputing. In IPTPS'03,pages118128, 2003.
[9℄ Gnutella. http://www.gnutella. om.
[10℄ KaZaA 2005.The KaZaAWeb Site. http://www.kazaa. om.
[11℄ D. Oppenheimer, J. Albre ht, D. Patterson, and A. Vahdat. Distributed Resour e Dis overy on PlanetLab with SWORD. In Pro eedings of the
ACM/USENIX Workshopon Real, LargeDistributed Systems(WORLDS), De- ember2004.
[12℄ S.Ramabhadran,S.Ratnasamy,J.M. Hellerstein,andS.Shenker. PrexHash
TreeAnindexing DataStru tureoverDistributedHash Tables. InPro eedings ofthe23rdACMSymposiumonPrin iplesofDistributedComputing,St.John's, Newfoundland, Canada, July 2004.
[13℄ S. Ratnasamy, P. Fran is, M. Handley, R. Karp, and S. Shenker. A S alable Content-Adressable Network. InACM SIGCOMM,2001.
[14℄ A. Rowstron and P. Drus hel. Pastry: S alable, Distributed Obje t Lo ation andRoutingforLarge-S alePeer-To-PeerSystems. InInternationalConferen e
on Distributed SystemsPlatforms (Middleware),November2001.
[15℄ C. S hmidt and M. Parashar. Enabling Flexible Queries with Guarantees in P2P Systems. IEEE Internet Computing,8(3):1926, 2004.
[16℄ A. Segall. Distributed Network Proto ols. IEEE Transa tions on Information Theory,IT-29:2335,1983.
[17℄ Y. Shu, B.-C. Ooi, K.-L. Tan, and A. Zhou. Supporting Multi-Dimensional RangeQueriesinPeer-to-PeerSystems. InPeer-to-Peer Computing,pages173 180, 2005.
[18℄ I. Stoi a, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan. Chord: A S alable Peer-to-Peer Lookup servi e for Internet Appli ations. In ACM
SIGCOMM,pages 149160, 2001.
[19℄ P. Triantallou and T. Pitoura. Towards a Unifying Framework for Complex Query Pro essing over Stru tured Peer-to-Peer Data Networks. In DBISP2P, 2003.
[20℄ B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Ku-biatowi z. Tapestry: A Resilient Global-s ale Overlayfor Servi e Deployment.