Laboratoire de l’Informatique du Parallélisme
École Normale Supérieure de Lyon
Unité Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL n o 5668
A Repair Mehanism for Fault-Tolerane for
Tree-Strutured Peer-to-Peer Systems
Eddy Caron ,
FrédériDesprez ,
CharlesFourdrignier ,
Frank Petit,
CédriTedeshi
Ot2006
Researh Report N o
2006-34
École Normale Supérieure de Lyon
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Téléphone : +33(0)4.72.72.80.37
Télécopieur : +33(0)4.72.72.80.80
Adresse électronique :
lipens-lyon.frPeer-to-Peer Systems
EddyCaron ,FrédériDesprez , CharlesFourdrignier , Frank Petit, Cédri Tedeshi
Ot 2006
Abstrat
Faing the limits of traditional tools of resoure management within
omputational grids (related to sale, dynamiity, et. ofthe platforms
newly onsidered), new approahes, based on peer-to-peer tehnologies
are emerging. The resoure disovery and in partiular theservie dis-
overyisonerned bythisevolution. Amongthesolutions, apromising
one isthe indexing of resoures using trie strutures and more partiu-
larly prex trees. The major advantages of trie-strutured approahes
isthe apabilityto supportsearh querieson rangesof values withala-
teny growinglogarithmially inthenumberof nodesinthetrie. Those
tehniques are easy to extend to multiriteria searhes. One drawbak
of usingtries isits inherent poor robustness ina dynami environment,
where nodes joinand leave the network, leading to thesplit of thetree
intoaforest,whihresultsintheimpossibilitytorouterequests. Within
most reent approahes, the fault-tolerane is a prevention mehanism,
oftenrepliation-based. Therepliationanbeostlyintermofresoures
required. In this paper, we propose a fault-tolerane protool that re-
onnets subtrees a posteriori, after rashes, to have again a onneted
graphand thenreorder thenodesto rebuildaonsistent tree.
Keywords: Faulttolerane, peer-to-peer,prex trees
les grillesdealul(mauvaispassage à l'éhelle,non prise enompte de
la dynamiité du réseau, et.), des alternatives fondées sur les tehno-
logies pair-à-pair sont en train d'émerger. La déouverte de ressoures
et en partiulier des servies de alul est touhée par ette évolution.
Parmiessolutions,ilexistedesapprohesprometteuses fondéessurdes
arbres lexiographiques. L'intérêt detellesapprohesreposesurlapossi-
bilité d'eetuer des requêtes surdes intervalles de valeurs ainsi que la
possibilitéderéaliserdel'autoomplétion surleshaînesdereherheen
temps logarithmiqueen latailledel'arbre. Cestehniquess'étendent fa-
ilement àdesreherhes multiritères. Cependantlastruture enarbre
est fragile et peut élater en une forêt si l'un des n÷uds vient à quit-
ter le réseau, rendant ainsi impossible le routage de ertaines requêtes
et n'orant au lient qu'une vue partielle des servies. Dans la plupart
de es approhes, la tolérane aux pannes, indispensable dans les envi-
ronnementsdynamiquesàlargeéhelle,estpréventive(réaliséeapriori)
et se fonde sur la répliation, qui est oûteuse en termes de ressoures
etde temps. Danse papier,nousprésentons unprotooletolérant aux
pannes de n÷uds,omplémentaire à larépliation, dans les arbres lexi-
ographiques. Ilsefonde surlareonnexionetlaréparationa posteriori
d'arbres quiont subilaperted'unou plusieurs n÷uds.
Mots-lés: Toléraneauxpannes, pair-à-pair, arbres de préxes
1 Introdution
These last few years have seen the development of large sale grids onneting distributed
resoures(omputation resoures,storagefailities,omputation libraries, et.) inaseamless
way. This is now an eient alternative to superomputers to solve large problems suh as
high energy physis, simulation, bioinformati, et. However, existing middlewares used in
grids require most of the time a stable and entralized infrastruture. They usually loose
their performane on dynami and large sale platforms without entralized management of
resoures. To ope withthe harateristis of these emerging kind of platforms, it has been
suggestedto usepeer-to-peertehnologies within omputationalgrids [8 ℄.
Peer-to-peertehnologiesoer algorithmsallowingthesearh andretrievalofobjets over
thenet(dataitems, les,servies,et.). Among thesetehnologies, Distributed HashTables
(DHT) were initially designed for very large sale platforms, for example to share les over
the Internet. However, DHTs have several major drawbaks. Among them, their disovery
mehanismusually worksonexat searhes ofagiven key. Somework hasthenbeen done to
allowomplex requeststobesubmitted overDHTs or moregenerallyinstruturedpeer-to-
peer systems,i.e. systemsbased on request routing. Some of these worksare based on tries
(also alledprex trees). Atrie struture supports rangequeries inalogarithmi timeinthe
number ofnodesof thetrie.
Fault-tolerane is a mandatory feature for peer-to-peer systems to avoid the lossof data
storedonnodesandto allowaorretrouting ofmessages. Therashof oneor several nodes
inatrieleadstothelossofobjetsreferenesstored inthetrieandtothesplitofthetrieinto
several subtries, also alled a forest. Fault-tolerane within strutured peer-to-peer systems
usually uses repliation. Using suh an approah, eah node and eah link of thetrie would
have to be dupliated
k
times,k
being the repliation fator. Keeping suh struture up isostly, mainly in terms of resoures used. Afterward, the purpose is to nd for the value of
k
the right trade-o between the repliation ost and the robustness of the system. In thispaper, we study an alternative to the repliation approah based on the reonnetion of the
subtries and the a posteriori reordering of a onsistent trie. When the trie is disonneted,
a rst solution onsists in rebuilding a trie adding nodes of remaining subtries one by one.
Thisnaivemethodanleadtoaprohibitive ostwhenthenumberofremainingnodesislarge
(whihisusuallytheaseinpeer-to-peersystems). Forexample,loosingonenodeanleadto
aompletereonstrution ofthetrie. Aseondapproahonsistsinreonnetingthesubtries
to gettheoriginal trie bakat aminimum ost. Thisisthis kindof algorithm wedesribein
this paper ina distributed and asynhronous environment. It an also be used to omplete
the repliationproess.
Abriefhistoryofpeer-to-peertehnologiesisprovidedinSetion2,followedbytheformal
desriptionofthepartiulartriestrutureweuse(Setion3)andofthedistributedsystemwe
plae ourselves. We fousour study on fault-tolerane meanisms related to them. Then, in
Setion 4 we present the repair algorithm we designed and give its proof before a onlusion
and futureworkSetion.
2 Related Work
With the spread of the peer-to-peer tehnologies going along with the le sharing over the
of unstrutured mehanisms, i.e., based on the ooding of searh requests [10 , 9℄. These
mehanisms resulted in overloading the network while providing non-exhaustive responses.
Addressingboth thesalabilityandtheexhaustivenessissueswithinpeer-to-peersystems,the
distributed hashtables [13 , 14, 18, 20 ℄, a.k.a., the strutured peer-to-peer group, are highly
salableinthesensethatthenumberoflogialhopsrequiredtorouteandtheloalstategrows
logarithmiallywiththenumberofnodespartiipatinginthesystem. Moreover,DHTsprevent
from loosing routing paths and objets' referenes by use of repliation and periodi sans.
Unfortunately, DHTs present several major drawbaks (homogeneous apaity assumptions,
topology awareness, et.). Amongthem,therigidityof therequestingmehanism, i.e., exat
math ona givenkeyhinders its useoverreal searhsystems.
A series of work givesthe opportunity to allow exible meanings of retrievalover stru-
tured peer-to-peer networks. First ahievement in this way has been the ability to desribe
resoures withsemi-strutured language, suh XML,as desribed in[3 ℄. [19℄enhanes DHTs
with traditional database operations. Several approahes, based on spaelling urves, suh
as Squid[15 ℄ or [17 ℄ support multi-dimensionalrange queries.[1℄ maps one-dimensionaldata
spae to d-dimensional Cartesian spae by using the inverse Hilbert mapping. Built on top
of multiple DHTs, SWORD [11 ℄ is an information servie aiming at disovering omputing
resoures on the gridbyansweringmulti-attribute range queries.
We fousinthisworkon trie-struturedretrievalsolutions, alsosupporting range queries
butoutperformingpreviousapproahesinthesensethatlogarithmi(oronstantifweassume
an upper bound on the depth of the trie) lateny is ahieved by parallelizing the resolution
of the queryinthe several branhes of thetrie. Prex HashTree(PHT) [12℄ builds a trie of
the entire key-spae on topof a DHT.Thepurposeof this arhiteture isto usethetrie asa
logial layerallowing omplex searheson top of anyDHT-like network. The arhiteture of
PHT results inthe multipliationof theomplexitiesof thetrie andof theunderlying DHT.
The Skip Graphs struture proposed in [2℄ is similar to a trie but is built with the skip
lists tehnology, allowing the use of their inherent fault-tolerane properties. But again, the
omplexity of the number of messages generated to proess range queries is in
O(m log(n))
,m
beingthenumberof nodespertainedbytherangeandn
thetotal numberof nodesinthegraph.
Other approahes proposeto relyon a triefor eah purpose, i.e.,indexing thekey-spae,
mapping the nodes of the trie on the network, and routing the requests. Among them,
Nodewiz [4℄ assumes a set of stati reliable nodes to host the trie, whih is unfortunately
hard toensureonpeer-to-peerplatforms. P-Grid[7℄buildsatrieonthewholekey-spae(i.e.,
thewholesetofpotentialkeys). Eahleafofthistrieorrespondstoasubsetofthekey-spae.
The fault-tolerane isahieved by probabilisti repliation.
Asa more general onsideration, none of these approahes addressthetopology/physial
loality awareness issue, i.e., no information about the underlying network is taken into a-
ount to build the logial (overlay) network, what an raise a signiant performane prob-
lem, physial loality being broken when the logial network is built. Moreover, the several
fault-tolerane solutions are mostly repliation-based, or DHT-based, also involving heavy
repliation mehanisms.
Initially designed for the purpose of serviedisovery over dynami omputational grids
and attempting to solve the above drawbaks of existing approahes, we reently developed
a novel arhiteture, based on a logial Greatest Common Prex Tree formally desribed in
Setion 3, that is dynamially built as objets (servies, but extensible to data items, les,
3 Preliminaries
Greatest Common Prex Tree. Let an ordered alphabet
A
be a nite set of letters.Denote
≺
an order onA
. A non empty wordw
overA
is a nite sequene of lettersa 1 , . . . , a i , . . . , a l, l > 0
. The onatenation of two words u
and v
, denoted u ◦ v
or sim-
ply
uv
,isequaltotheworda 1 , . . . , a i , . . . , a k , b 1 , . . . , b j , . . . , b l suhthatu = a 1 , . . . , a i , . . . , a k
and
v = b 1 , . . . , b j , . . . , b l. Letǫ
betheempty word suhthatfor everyword w
,wǫ = ǫw = w
.
The length ofa word
w
,denotedby|w|
,is equalto thenumber ofletters ofw
|ǫ| = 0
.A word
u
is a prex (respetively, proper prex) of a wordv
if there exists a wordw
suh thatv = uw
(resp.,v = uw
andu 6= v
). The Greatest Common Prex (resp.,Proper Greatest Common Prex) of a olletion ofwords
w 1 , w 2 , . . . , w i , . . .
(i ≥ 2
), denotedGCP (w 1 , w 2 , . . . , w i , . . .)
(resp.P GCP (w 1 , w 2 , . . . , w i , . . .)
), is the longest prexu
sharedby all of them (resp., suh that
∀i ≥ 1, u 6= w i). A [Proper℄ Greatest Common Prex Tree
([P℄GCPTree, alsoapartiular kindof trie)isa labeledrootedtreesuh thatboth following
properties aretruefor everynode ofthetree:
1. The node labelis aproperprex of anylabel initssubtree;
2. The node labelis theProperGreatestCommon Prexof all itssonlabels.
Inthefollowing we usetheword trieto designate our PGCP tree.
DistributedLexiographiPlaementTable. Thedistributed system onsideredinthis
paperonsistsofasetofasynhronous physial nodesorganizedina Distributed Hash Tables
(DHT).EahphysialnodemaintainsoneormorenodesofthelogialPGCPTree. Notethat
aDHTisused,butitan bereplaedbyanysystem,distributedor not,allowingtheretrieval
ofanynodefromanyothernode. We alsoonsiderthatthepotential existingfault-tolerane
mehanisms provided by this layer arenot usedwithin our arhiteture. We propose in this
papera fault-tolerane mehanismat thePGCP Treelevel.
Whenonewantstoinsertanobjetlabeled
o
intothetrie,amessageisgenerated ontain-ing
o
,aordingtowhihthemessageisroutedwithinthetrieuntilreahingthenodelabeledv
suh thatv
is the smallest label inthe trie that shareswitho
thegreatest ommon prexofanynode of thetrie with
o
. Moreformally,ifL
denotes thewholesetof labelurrently inthe trie, the set
U = {l ∈ L | GCP (l, o) = p}
wherep = max |m| {m = P GCP (l, o), l ∈ L)
.The label of the target node is
t = min |w| {u ∈ U | u = pw}
. One found, the target nodeperforms the insertion. If
t 6= o
,node(s) arereated. Ifo = tu
(u 6= ǫ
), anew node labeledo
is reated asa newson ofthe node labeled
t
. Ift = ou
(u 6= ǫ
),a newnode isreated asthefatherofthe nodelabeledby
t
. Finally,ifnoneoftheseonditions aresatised,itmeansthato
andt
mustbesiblingsbut nonode inthetrie is labeledbytheir ommon prex. Thus twonodes arereated, a node labeled
GCP (o, t)
, father of thenode labeled byt
and also fatherof the other newly reated node labeled by
o
. The distributed routing algorithm (that also performsthereationandthemappingofnodes)requiresanumberofhopsboundedbytwiethedepthofthe trie [5 ℄.
Physial nodesommuniate bymessage passing. Weassume two sendingfuntions. The
former,simplyreferred to SEND,is usedbyany physial node to senda message to another
node asynhronously, i.e., without waiting any aknowledgement. The latter, alled SYNC-
SEND, waits for an aknowledgement for eah message sent. We assume that eah physial
4 Protool
In this setion, we give a detailed explanation of how the protool works. We divide the
algorithm ode intwo parts. The former showsthe rstphase developed withour tehnique
duringwhihauniquetrieisreoveredwithoutonsideringanylexiographiproperty. During
the seond phase, the trie is reorganized to eventually form a distributed greatest ommon
prex tree.
4.1 Trie Reovery
After a node
p
detets the loss of its father (p.f ather
), it searhes for a new father to linkon. Making a traversalof the DHT,Node
p
ollets inVariableP N
all theaddressesof eahremaining physial node. Colleting the addresses in
P N
,p
builds the set of logial nodesstored by the physial nodes in
P N
. Next, using aP IF
(Propagation of Information with Feedbak) Protool [6, 16 ℄,p
omputesT
, the set of logial nodes in its subtrie, whih ismade of its real desendants and its temporary relinked desendants. This rst step of
the reovery protool ends when
p
hooses a temporary father (p.tmpf ather
) in the subsetN \ T
. When, a nodeq
is linked to a nodep
,thenp
onsidersq
asa temporary sonstoredin
p.tmpsons
. Note that Variablep.tmpsons
is required to omputeT
using a PIF in thesubtrieof
p
. IfN \ T = ∅
(i.e.,thereis nonode forwhihp
maylinkon),thenp
isonsideredasthe root of the trie.
Theabovetehniquesuersofadrawbak: Severalnodeswithoutfathermaymakewhih
ould beome a bad hoie. In partiular, they an hoose as a temporary father a node
belongingtothesubtrieofanothernodebeinginthesamesituation. Bydoingthisinparallel,
yles mayappear. Ourstrategy isto detet andto break aposteriori suhyles asfollows.
After the hoie of its temporary father
tf
,a nodep
sends a message HELLO with itsID (
p.id
) totf
. Inthenextstep,tf
transmitsthe message toits ownfather, and soon. Stepbystep, one ofthe two following situationseventually arises:
1. The real root ofthe trie reeivesthe messageHELLO. Inthatase, therootnoties
p
thatit isnot involved inayle.2. Themessageisreeivedbyafalse root,i.e.,anodehavingalso lostitsownfather. the
false root propagatesthemessage to itstemporary father.
Notethat, in the above latter ase, due to asynhrony of thenetwork, it ispossible thatthe
falseroot reeivesthe message HELLO sent by
p
beforeitexeutedits own reovery phase.Inthatase, the falserootisstill withoutatemporaryfather. ThemessageHELLO isthen
delayed until the false root hooses its owntemporary father.
Therefore,the messageHELLO sentby
p
keepsirulatingamongitsanestors,arryingthe list of false roots' IDs whih were met during its traversal. Upon reeipt of a message
HELLO, iftherst itemofthelistarriedbythe messageisequal totheIDof thereeiver,
then a yle is deteted. In that ase, a leader eletion is omputed among the IDs of the
liste.g.,byhoosingthesmallest ID.Theleader beomestherootof thesubtrie, breaksits
linkwhihitsfather, andexeutesthe reovery phaseagain. (Theotherfalse rootsinvolved
in the yle remain onneted to the subtrie rooted by the leader.) Note that a yle may
least one subtrie beomes thesubtrieof one false root. In other words, thenumber of yles
is periodially divided byat least
2
. Therefore, the system eventually ontains one (rooted)trie only.
4.2 Trie Reorganization
The trie reorganization is initiated one the trie reovery is done. Eah node
p
having atemporary son
q
i.e.,q
isa falseroot withits subtrieinitiatesa routing mehanismlosed to theoriginal key insertion [5℄. Letus onsiderthefollowing ases:1. The value
p.val
isaprex ofthevalueofq
Figure1,Case(i)
. Inthatase,q
(and itssubtrie) isplaed in thesubtrie of
p
following one of the four ases shown in Figure1,Cases (
a
) to(d
).2. The value
p.val
is not a prex of the value ofq
. Then,p
movesq
to its father whihnowhasthe responsibilityto plae
q
.p
q s s s
1 i k
(
i
)p.val = pref ix(q)
andp.val = P GCP (s 1 , . . . , s k )
.p
q s s s
1 i k
(
a
)Thereexistss i
suhthats i .val = pref ix(q.val)
.p
q s s s
1 i k
(
b
)Thereexistss i
suhthatq.val = pref ix(s i .val)
.p
q
newson
s s
s
1 k
i
(
c
)Thereexistss i
suhthatP GCP (q.val, s i .val) > p.val
.p
s s s
q=s k+1 1 i k
(
d
)p.val = pref ix(q.val)
.Figure 1: Afalse root
q
islinked to a nodep
suhthatp.val = pref ix(q.val)
.Notethatnewserviesmaykeepinsertingduringthetriereonstrution. So,anewsubtrie
mayhavebeen reatedatthe sameplaewherethefalserootinitiallywas. Thus,ourmethod
requirestotakeinaount thatanyfalserootbeingplaedinthetrieanmeet anodehaving
the same value. Inthat ase, the two tries mustbe merged. That isthe aim of themerging
node
p
exeutes ProedureGluing(q)
, whih moves the sons ofq
top
before withdrawingq
from the trie (inluding the sons ofq
's father). Then, if neessary,p
restarts reursivelymerging andplaementsamong itssons,inorder to merge both subtries eventually.
4.3 Corretness Proof
In this subsetion, we disuss the orretness of our protool. In order to do this, we rst
need to make the realistiassumption thatundertheonsidered ontext, therash frequeny
islowenoughto make thetriefullybuiltsometime. (Intheoppositeway,thetrieouldnever
bebuiltand unusablemost ofthetime. Moregenerallyitisimpossibletosayanythingabout
termination otherwise.) In other words, we fairly assume that no rash ours after a rash
until the trie is fullybuilt, i.e., no two onseutive rashes interfere eah other, at one given
time.
Assuption 1 If a noderashes at time
t
, then for everyt ′ > t
, no rash ours.Lemma 2 Under Assumption 1, the reovery protool (Algorithm 1) terminates, and when
this ours,the system ontains onetrie only.
Proof. The validation mainly onsists inshowing that the protool terminates and that
the reorganizationof thetrie iseventually initiated (bysending amessage NOCYCLE).
Assume by ontradition that under Assumption 1, no node eventually sent a message
NOCYCLE.So,neitherLine
1.35
nor Line1.37
inAlgorithm1 isexeuted. Notethatintherst ase (Line
1.35
), the node beomes the real node after the rash of its father. So, inboth ases, this means thatNOCYCLE never reahes thereal root of thetrie. The height
of thetrie being nite,thismeans thateveryMessage HELLO traversesyles only. When a
message HELLO is reeived by its initiator, the yleis broken bythe nodewhih is eleted
among the false roots partiipating in the yleLines
1.16
to1.21
. Therefore, yles arereated innitely often. Let
C
be the number of reated yles. In the worst ase, a yleismade of at leasttwo nodes. So,
C
is initiallybounded byF/2
,whereF
is thenumber offalse root reated by the rash. When a yle is broken, at most one leader is eleted. So,
at most
C/2
leaders are able to link another node again. In the next phase, the number ofyles is less than or equal to
C/2
. Sine under Assumption 1, yles may be reated onlywhenfalse rootsarelinkedtoothernodes(exeuting Lines
1.10
and1.11
),C
never grows andiseventually equalto
0
. This ontradits thatyles arereated innitely often.2
We nowonsiderthe phaseof trie reorganization showninAlgorithm2.
Lemma 3 Under Assumption 1 and assuming that the system ontains one trie only, the
reorganization protool (Algorithm 2) terminates, and whenthis ours, the trie is a
P GCP
tree.
Proof. Clearly,eahtrie oftheforestfollowingthe rashofanode isa
P GCP
tree. So,itsremains to showthat exeuting Algorithm2,thewholetrie eventually satisestheondition
to be a
P GCP
tree.From the algorithm, it is easy to observe that, inthe absene of merging, there are only
two ases to onsiderdepending onthevalueof Node
p
andits false sonf s
:Algorithm 1 ReoveryProtoolfor eah node
p
1
.01
uponreeiptof<
DisonnetedfromFather>
do1
.02 P N :=
PhysialNodeSetintheDHT(olletedbyaDHTtraversal);1
.03 N :=
LogialNodeSetinP N
(olletedbypollingthenodesinP N
);1
.04 T :=
LogialNodeSetinmysubtrie(olletedusingaPIFwave)1
.05
usingp.sons ∪ p.tmpsons
;1
.06
ifp.tmpf ather 6=⊥
thensend<
DISCONNECT>
top.tmpf ather
;1
.07
ifN \ T = ∅
1
.08
then //Iamtheroot1
.09 p.f ather :=⊥
;p.tmpf ather :=⊥
;1
.10
elsep.tmpf ather :=
randomhoieamongN \ T
;1
.11
send-syn<
LINK>
top.tmpf ather
;1
.12
send<
HELLO,p.id>
top.tmpf ather
;1
.13
endif1
.14
uponreeiptof<
HELLO,list>
fromq
do1
.15
ifF irst(list) = p.id
1
.16
then //Ayleisdeteted1
.17 leader := LeaderElection(list)
;1
.18
ifp = leader
1
.19
then Exeutesuponreeiptof<
Disonnet fromFather>
do,1
.20
exeptP N
andN
;1
.21
endif1
.22
elseifp.F ather 6=⊥
1
.23
then send<
HELLO,list>
top.f ather
;1
.24
elseifp.tmpf ather 6=⊥
1
.25
thenlist := list + p.id
;1
.26
send<
HELLO,list>
top.tmpf ather
1
.27
elseifp.f ather =⊥
1
.28
then //Bothf ather
andtmpf ather
areunknown,i.e.,1
.29
Iamafalserootwhihisstillnotlinked1
.30
Exeutesuponreeiptof<
DisonnetfromFather>
do1
.31
ifitisstillnotworking;1
.32
iftmpf ather 6=⊥
1
.33
thenlist := list + p.id
;1
.34
send<
HELLO,list>
top.tmpf ather
;1
.35
else send<
NOCYCLE>
toF irst(list)
;1
.36
else //Iamtherealroot,sothere isnoyle.1
.37
send<
NOCYCLE>
toF irst(list)
;1
.38
endif1
.39
uponreeiptof<
NOCYCLE>
fromq
do1
.40
send<
MOVE,p>
top.tmpf ather
;1
.41
send-syn<
UNLINK>
top.tmpf ather
;1
.42 p.tmpf ather :=⊥
;1
.43
uponreeiptof<
LINK>
fromq
do1
.44 tmpsons := tmpsons ∪ {q}
;1
.45
uponreeiptof<
UNLINK>
fromq
do1
.46 tmpsons := tmpsons \ {q}
;Algorithm 2 ReorganizationProtool foreah node
p
1
.01
uponreeiptof<
MOVE,f s>
fromq
do1
.02
iff s.val = p.val
1
.03
then //Isendto myselfthatafusionisneeded.1
.04
send<
MERGE,f s>
top
1
.05
elseifp.val = pref ix(f s.val)
1
.06
then if∃s ∈ p.sons| s.val = pref ix(f s.val)
1
.07
then //f s
isin thesubtrieofs
,Case(a
)inFigure11
.08
send<
MOVE,f s>
tos
;1
.09
elseif∃s ∈ p.sons| f s.val = pref ix(s.val)
1
.10
then //s
isin thesubtrieoff s
,Case(b
)in Figure11
.11 p.sons := p.sons ∪ {f s}
;p.sons := p.sons \ {s}
;1
.12
send<
MOVE,s>
tof s
;1
.13
elseif∃s ∈ p.sons | p.val < P GCP (s.val, f s.val)
1
.14
then //f s
ands
haveaPGCPwhih isgreaterthanp.val
1
.15
//Case(c
)in Figure11
.16 N ewnode(P GCP (f s.val, s.val), s, f s)
;p.sons := p.sons \ {s}
;1
.17
else //f s
isoneofmysons,Case(d
)in Figure11
.18 p.sons := p.sons ∪ {f s}
;1
.19
endif1
.20
else ifp.f ather 6=⊥
1
.21
then send<
MOVE,f s>
top.f ather
1
.22
else iff s.val = pref ix(p.val)
1
.23
then //Iaminthesubtrieoff s
1
.24
send<
MOVE,p>
tof s
;1
.25
else //p
andf s
arebrothers1
.26 p.sons := p.sons ∪ N ewnode(P GCP(f s.val, f.val), f s, p)
;1
.27
endif1
.28
endif1
.29
endif2
.01
uponreeiptof<
MERGE,f s>
fromq
do2
.02 Gluing(q)
;2
.03
Sortingofp.sons
inthelexiographiorderin Tablet s
;2
.04
fori = 0
tot s .length()
do2
.05
ift s [i].val = t s [i + 1].val
2
.06
then send<
MERGE,t s [i + 1]>
tot s [i]
;2
.07 i := i + 1
;2
.08
elseift s [i].val = pref ix(t s [i + 1].val)
2
.09
then send<
MOVE,t s [i + 1]>
tot s [i]
;2
.10 p.sons := p.sons \ {t s [i + 1]}
;2
.11 i := i + 1
2
.12
elseifp.val < P GCP (t s [i].val, t s [i + 1].val)
2
.13
thenp.sons := p.sons ∪ N ewnode(P GCP (t s [i].val, t s [i + 1].val),
2
.14 t s [i], t s [i + 1])
;2
.15 p.sons; = p.sons \ {t s [i], t s [i + 1]}
;2
.16 i := i + 1
;2
.17
endif2
.18
done1. Thevalueof
p
isaprexoff s
'svalueLine1.05
. Inthatase, followingthefour asesdesribedinFigure1,
f s
iseventuallyplaedattherightplaeinthesubtrieofp
referto Lines
1.06
to1.19
. The resulting trie isaP GCP
tree.2. The valueof
p
isnot a prex off s
. Again, there aretwo ases toonsider:(a) Node
p
has no father (p.f ather =⊥
)Line1.22
to1.28
. In that ase, iff s.val
is a prex of
p
, thenp
(and its subtrie) beomes the node to be plaed inf s
Line
1.24
. Otherwise,p
andf s
beome the two sons of a new root nodeq
suhthat
q.val = P GCP (p, f s)
Line1.26
. Thetrie isthenlearly aP GCP
tree.(b) Node
p
hasafather. Then,f s
ismovedtothefatherofp
Line1.21
. Byindutionof the above disussion, either
f s
eventually moves ona nodeq
suh thatq.val = pref ix(f s.val)
orf s
eventually reahes the root of the trie. The former ase isequivalent to Case 1,thelatterto Case 2a.
If
p
andf s
merge,thenthere arefour ases to onsiderafterp
andf s
gluedtogether intop
:1. There existsa pair of sons
s i, s j of p
suh that s i .val
is a prex of s j .val
. Then, s j is
p
suh thats i .val
is a prex ofs j .val
. Then,s j is
movedtoward
s iLines2.08
to2.11
. Thisaseissimilarto theaboveCase1(Cases(a
)
or (
b
)inFigure 1).2. There existsapair of sons
s i,s j of p
suh that P GCP (s i , s j ) > p.val
. Then, s i and s j
p
suh thatP GCP (s i , s j ) > p.val
. Then,s i and s j
beome the two sons of a new son
q
ofp
suh thatq.val = P GCP (p, f s)
Lines2.12
to
2.16
. Thisase isalso similarto theabove Case 1 (Case(c
) inFigure 1).3. There existsa pair of sons
s i,s j of p
suh that s i .val = s j .val
. Thisase is solved by
p
suh thats i .val = s j .val
. Thisase is solved byinitiatinga reursivemergingbetween
s i ands jLines 2.05
to2.07
. Thisaseissolved
2.05
to2.07
. Thisaseissolvedbyindutionon
s i and s j.
4. There existsnopairofsons
s i,s j ofp
satisfying eitherCase 1,2,or 3. Inthatase,the
p
satisfying eitherCase 1,2,or 3. Inthatase,thesubtrie of
p
learly satisesthepropertiesof aP GCP
tree.2
From Lemmas2 and3 follows:
Theorem 1 UnderAssumption1, Algorithm1 andAlgorithm2 provide a
P GCP
tree reon-strutionafter the rash of a physial node.
5 Conlusion and Future Work
Inthis paper, wehave presenteda fault-tolerant protool inase ofnode rashesinaProper
CommonGreatestPrextree searhsystem. Thisprotool anbeoupled witharepliation
strategyto lowertheostsrelatedto highrepliationfators. Thisprotoolallows thereon-
netionandrepairofsubtriesaftertherashofoneormorenodes. Thisalgorithmguarantees
toreovera onsistent PGCP tree aftera nitetimeand thus to avoid partiallyrepliation.
Ourfutureworkwill onsistinonnetingthetwo mehanisms (repliationandrepair)in
orderto minimizethe ostoffault-tolerane on dynamiplatforms. Wewill alsodevelopand
validateexperimentallythemehanismsexposedinthispaperontheGrid'5000platformofthe
oftherepair algorithmandtoseeitsapaitytoanswerlients'requestsfaingdierentlevels
of dynamiity. Moreover, we will be able to see starting from whih level of dynamiity the
repair mehanism is no more eient alone, and then how we an progressively injet some
repliation asthe dynamiitylevel inreases.
Referenes
[1℄ A.AndrzejakandZ.Xu. Salable,EientRangeQueriesforGridInformationServies.
InPeer-to-Peer Computing,pages 3340, 2002.
[2℄ J.Aspnes and G.Shah. Skip Graphs. In Fourteenth Annual ACM-SIAM Symposium on
Disrete Algorithms,pages 384393, January2003.
[3℄ M. Balazinska, H. Balakrishnan, and D. Karger. INS/Twine: A Salable Peer-to-Peer
Arhiteture forIntentionalResoureDisovery. In Proeedings of Pervasive 2002,2002.
[4℄ S.Basu,S.Banerjee,P.Sharma,andS.Lee. NodeWiz: Peer-to-PeerResoureDisovery
forGrids.In5thInternationalWorkshoponGlobalandPeer-to-PeerComputing(GP2PC)
in onjuntion withCCGrid, May2005, 2005.
[5℄ E. Caron, F. Desprez, and C. Tedeshi. A dynami prex tree for theservie disovery
within large sale grids. In IEEE, editor, The Sixth IEEE International Conferene on
Peer-to-Peer Computing, P2P2006,Cambridge, UK.,September 6-8 2006.
[6℄ E.J.H. Chang. Eho Algorithms: Depth Parallel Operations on General Graphs. IEEE
Trans. on Software Engineering, SE-8:391401, 1982.
[7℄ A. Datta, M. Hauswirth, R. John, R.Shmidt, and K. Aberer. Range Queries in Trie-
StruturedOverlays. InThe FifthIEEE InternationalConferene onPeer-to-Peer Com-
puting, 2005.
[8℄ I. Foster and A.Iamnithi. On Death,Taxes, and theConvergene of Peer-to-Peer and
GridComputing. In IPTPS'03,pages 118128,2003.
[9℄ Gnutella. http://www.gnutella.om.
[10℄ KaZaA2005. TheKaZaA Web Site. http://www.kazaa.om.
[11℄ D. Oppenheimer, J. Albreht, D. Patterson, and A. Vahdat. Distributed ResoureDis-
overy on PlanetLab with SWORD. In Proeedings of the ACM/USENIX Workshop on
Real, Large Distributed Systems(WORLDS), Deember 2004.
[12℄ S.Ramabhadran,S.Ratnasamy,J.M. Hellerstein,and S.Shenker. PrexHashTreeAn
indexing DataStruture over DistributedHash Tables. InProeedings of the 23rd ACM
Symposium on Priniples of Distributed Computing,St.John's, Newfoundland, Canada,
July 2004.
[13℄ S. Ratnasamy, P. Franis, M. Handley, R. Karp, and S. Shenker. A Salable Content-
[14℄ A.Rowstronand P.Drushel. Pastry: Salable, Distributed Objet Loation and Rout-
ing for Large-Sale Peer-To-Peer Systems. In International Conferene on Distributed
SystemsPlatforms (Middleware),November2001.
[15℄ C.ShmidtandM.Parashar.EnablingFlexibleQuerieswithGuaranteesinP2PSystems.
IEEE Internet Computing,8(3):1926, 2004.
[16℄ A. Segall. Distributed Network Protools. IEEE Transations on Information Theory,
IT-29:2335,1983.
[17℄ Y.Shu,B.-C.Ooi,K.-L.Tan,andA.Zhou.SupportingMulti-DimensionalRangeQueries
inPeer-to-PeerSystems. InPeer-to-Peer Computing,pages 173180, 2005.
[18℄ I. Stoia, R.Morris, D. Karger, M. Kaashoek, and H.Balakrishnan. Chord: A Salable
Peer-to-Peer Lookup serviefor Internet Appliations. In ACM SIGCOMM,pages149
160,2001.
[19℄ P. Triantallou and T. Pitoura. Towards a Unifying Framework for Complex Query
ProessingoverStrutured Peer-to-Peer Data Networks. In DBISP2P,2003.
[20℄ B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowiz.
Tapestry: A Resilient Global-sale Overlay for Servie Deployment. IEEE Journal on
Seleted Areas in Communiations, 22(1):4153, January2004.