HAL Id: lirmm-01617207
https://hal-lirmm.ccsd.cnrs.fr/lirmm-01617207
Submitted on 16 Oct 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Construction and update
Bastien Cazaux, Thierry Lecroq, Eric Rivals
To cite this version:
Bastien Cazaux, Thierry Lecroq, Eric Rivals. Linking indexing data structures to de Bruijn graphs:
Construction and update. Journal of Computer and System Sciences, Elsevier, 2019, 104, pp.165-183.
�10.1016/j.jcss.2016.06.008�. �lirmm-01617207�
Contents lists available atScienceDirect
Journal
of
Computer
and
System
Sciences
www.elsevier.com/locate/jcss
Linking
indexing
data
structures
to
de
Bruijn
graphs:
Construction
and
update
Bastien Cazaux
a,
b,
Thierry Lecroq
c,
Eric Rivals
a,
b,
∗
aLIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,FrancebInstitutBiologieComputationnelle,CNRSandUniversitédeMontpellier,860rueSaintPriest,34095MontpellierCedex5,France cNormandieUniv.&UNIROUEN,UNIHAVRE,INSARouen,LITIS,76000RouenFrance
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory:
Received25June2015
Receivedinrevisedform 26May2016 Accepted27June2016 Availableonlinexxxx Keywords: Index Datastructure Suffixtree Suffixarray Dynamicupdate Overlap
ContracteddeBruijngraph Assembly
Algorithms Bioinformatics
DNA sequencing technologieshavetremendously increased theirthroughput, and hence complicatedDNAassembly.NumerousassemblyprogramsusedeBruijngraphs(dBG)built fromshortreadstomergetheseintocontigs,whichrepresentputativeDNAsegments.In adBGoforderk,nodesaresubstringsoflengthk ofreads(ork-mers),whilearcsaretheir k+1-mers.Asanalysingreadsoftenrequiretoindexalltheirsubstrings,itisinteresting toexhibitalgorithms thatdirectlybuildadBG fromapre-existingindex,and especially acontracteddBG,wherenon-branching pathsarecondensedintosinglenodes.Here,we exhibitlineartimealgorithmsforconstructingthefullorcontracteddBGsfromsuffixtrees, suffixarrays,andtruncatedsuffixtrees.Withthelattertheconstructionusesaspacethat islinearinthesizeofthedBG.Finally,wealsoprovidealgorithmstodynamicallyupdate theorderofthegraphwithoutreconstructingit.
©2016TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
In life sciences, determining the sequence of bio-molecules is an essential step towards the understanding of their functions and interactions within an organism. Powerful sequencing technologies allow to get huge quantities of short sequencingreads thatneed tobe assembledto inferthecomplete targetsequence. Theseconstraintsfavourthe useofa versionofthedeBruijnGraph(dBG)dedicatedtogenomeassembly–aversionwhichdiffersfromthecombinatorial struc-tureinventedbyN.G.de Bruijn[1].Givenaset S
= {
s1,
. . . ,
sn}
ofn readsandanintegerk, anassemblydeBruijn Graph, orforshortsimplyde Bruijn Graph,storeseach k-mer(k-longsubstring)occurringinthereadsasnodesandhasan arc joiningtwok-mersiftheyappearassuccessive(andhenceoverlapping)k-mersinatleastoneread.The dBGis then traversed to extract long paths, which willform the contigs, i.e., the sequenceof sub-regions ofthe molecule.Innon-repetitive regions,thelayoutofthereadsdictatesasimplepathofk-merswithoutbifurcations.Anysimple path betweenan in-branchingnode andthe next out-branching node, can then be contractedinto a single arc without loosinganyinformationonthegraphstructure.Thesequencesofsuchsimplepathsarecalledunitigs (thecontractionfrom uniqueandcontigs).TheversionofthedBGwhereeachsuch“non-branching”pathiscondensedintoasinglearcistermed theContracteddBG(CdBG).
*
Correspondingauthorat:LIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France.E-mailaddresses:bastien.cazaux@lirmm.fr(B. Cazaux),thierry.lecroq@univ-rouen.fr(T. Lecroq),rivals@lirmm.fr(E. Rivals).
http://dx.doi.org/10.1016/j.jcss.2016.06.008
0022-0000/©2016TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
Sequencingtechnologiesofthesecondgenerationcanyieldhundredsofmillionsofreads.Comparedtotheoverlapgraph ortothestringgraph,whichwereusedwithprevioustechnologies,thedBGhasanumberofnodesthatisnotproportional to thenumberofreads:itdependsonausercontrolledparameterk,termed theorder of thedBG.Its memoryusage can befinetunedthroughthisparameter.
In bioinformatics,dBGs are heavily exploitedforgenomeassembly [2],butforother purposes aswell. Actually,some programs mine the dBG to seekgraph patterns representingmutations, large insertions/deletions, orchromosomal rear-rangements[3].Othersuseittocorrectsequencingerrorsinlongreads[4].
ThedeBruijnGraphisusuallybuiltdirectlyfromthesetofreads,whichistimeandspaceconsuming.Severalcompact datastructuresforstoring dBGshavebeendeveloped[5,6] includingprobabilistic ones[7].Theemphasisisplacedonthe practical spaceneededtostorethedBGs inmemory.Moreover,some recentassemblyalgorithms putforwardthe advan-tage ofusingforthesameinput,multipledBGswithincreasingorders[8],therebyemphasising theneedfordynamically updatingthedBG.Inallcases,theconstructionalgorithmsneedtoscanthroughthewholesetofreads.
Severalgenomeassemblyprograms usedhashtablesto storethek-merofthereadsandallownavigatingthrough the arcs ofthedBG,butthesesolutionssufferfromseverallimitationsregardinge.g. functionalities andflexibility.Withhash functions, itisoftennotpossibleto addextrainformationtothe nodes,likeforinstancethenumberoftimesak-meris observed inthereadset,whichis usedasa confidencemeasure.Hashtablesmakeitdifficulttocompute thecontracted dBGortochangethevalueofk.Themainadvantageofsophisticatedhashfunctionsis theirmemoryfootprint.Forinstance Minia [9] offers a very spaceefficient storage to handlethe dBG based oncascading Bloom filters, which are a type of hash functions. This hashtable based solution was used forlong read errorcorrection andalso proves efficient in that context[4].
Instudiesinvolvingtheanalysisofsequencingreads,distincttasksrequiretoindexeitherallsubstrings,orthek-mersof thereads.Forinstance,fastdBGassemblyprogramsfirstcountthek-mersbeforebuildingthedBGtoestimatethememory needed [7]. Anotherexample: some errorcorrection software build a suffix tree ofall short readsto correct them [10]. Hence, before the assembly starts, the read set has already beenscanned through andindexed. It can thus be efficient to enable theconstruction ofthedBGforthesubsequentassembly, directlyfromtheindexratherthan fromscratch.For these reasons, we set out to find algorithms that transformusual indexes intoa dBGor a contracted dBG.It is also of theoretical interesttobuildbridges betweenwell studiedindexesandthisgraphonwords.Despiterecentresults[11,12], formal methods forconstructing dBGs fromsuffix trees are an open question. In comparison, Simpson and Durbin have proposedanalgorithmtobuildtheStringGraphfromaFM-index[13].
Here, we present algorithms to build directly the CdBG from a Generalised Suffix Tree or froma Generalised Suffix Array ofthereads[14–17].Thesealgorithmstake spaceandtimethatarelinearintheinputsize.Thesewell-knowndata structuresindexallsubstringsofthereads,andnotonlytheirk-mers.Thisresultsinonedrawbackandinoneadvantage.
Thedrawbackistheirspaceoccupancy.Wewillthenconsideranindexingdatastructurethatreducesthesetofindexed substrings: thetruncatedsuffixtree[18,19]. Weintroducethe reducedtruncatedsuffix tree(TST) andthen showhow to constructwiththisindexboththedBGandCdBGintimeandspacethatarelinearinthesizeofthefinaldBG,ratherthan inthecumulatedlengthofthereads.BysizeofthedBGwemeanthesumofnumberofnodes,plusthenumberofarcs. Thisalgorithmachievesanoptimaltimeandspacecomplexity.
The advantage isthecounterpart: assubstringsofall lengthsareindexed, itallows toupdate theorder ofthegraph, that is to changedynamically the value ofk without reconstructing thedBG. Finally,we provide efficientalgorithms for increasing ordecreasingthevalueofk.Ofcourse,ifoneusesthetruncatedsuffixtreeinsteadofthefull suffixtree,only some updatesremainpossible.Ourresultsneverthelessremainapplicabletothetruncatedsuffixtree,wheretheordercan bedynamicallydecreased.
Thisarticleincludesresultsthatappearedin[20,21].
1.1. Indexingdatastructures
Suffix trees arewell-known indexingdatastructuresthat enable tostore andretrieveall thefactorsof agivenstring. The suffix tree of a string y of length s can be build in time and space in O
(
s)
on a constant size alphabet [14,22]. Then, itis possibletocheck ifapatternx of lengthm isa factorofastring y of S intime O(
m)
.Counting thenumber of occurrences of x in y can also be done intime O(
m)
while enumerating the positions where x occurs in y can be performedintime O(
m+
occ)
,whereocc denotesthe numberofoccurrencesof x in y.Suffix trees canbe adaptedto a finitesetofstringsandarethencalledGeneralisedSuffixTrees(GSTs).Thus,givenasetS ofn stringsoftotallengthSon aconstantsizealphabet,thegeneralisedsuffixtreefor S canbebuildintimeandspace O(
S)
.Foradetailedexposition ofpropertiesofsuffixtreeswereferthereaderto[17].Suffixtreeshavebeenwidelystudiedandusedinalargenumberof applications(see[15]and[17]).Inpractice,theyconsumetoomuchspaceandareoftenreplacedbythemoreeconomical suffixarrays[16],whichhavethesameproperties[23].When oneisonlyinterested infactorsofagivenlength,truncatedsuffix treesonlystorethefactors oflength uptoa givenconstantk of agivenstring.Theycanalsobebuild inlineartime andspace[18].Inpractice,truncatedsuffix trees savealotofnodescompared tosuffixtrees.
Fig. 1. S:= {bacbab,cbabcaa,bcaacb,cbaac,bbacbaa}isasetofwords.Therefore,wehaveSupport(ba)= {(1,1),(1,4),(2,2),(4,2),(5,2),(5,5)},RC(ba)= {ε,c,cb,cba,cbab,b,bc,bca,bcaa,a,ac,cbaa},LC(ba)= {ε,c,ac,bac,b,bbac}andd(ba)=0.OnehasRC(ba)∩ = {a,b,c}.Thus,thewordba isnotright extensibleinS (seeDefinition 2).
2. DefinitionsofdeBruijngraphs
2.1. Notationaboutstrings
Hereweintroduceanotationandbasicdefinitions.
An alphabet
is afinite setof letters.A finitesequence ofelements of
is calleda word or a string.The set ofall wordsover
isdenotedby
,and
ε
denotestheemptyword.Forawordx,|
x|
denotesthelength ofx.Giventwowordsx and y,wedenoteby x
·
y orsimplyxy theconcatenation ofx and y.Forevery1≤
i≤
j≤ |
x|
,x[
i]
denotesthei-thletter of x,and x[
i..
j]
denotes the substring or factor x[
i]
x[
i+
1]
. . .
x[
j]
.Letk be a positive integer. If|
x|
≥
k, f irstk(
x)
istheprefix oflength k ofx andlastk
(
x)
isthesuffix of lengthk of x.Then asubstring oflengthk of x is calledak-merofx. Fori suchthat 1≤
i≤ |
x|
−
k+
1,(
x)
k,i isthek-merof x startinginpositioni,i.e.,(
x)
k,i=
x[
i..
i+
k−
1]
.Thus we havef irstk
(
x)
= (
x)
k,1andlastk(
x)
= (
x)
k,|x|−k+1.Wedenoteby()
thecardinalityofanyfiniteset.
LetS
= {
s1,
. . . ,
sn}
beafinitesetofwords.Itisourrunninginstanceforallthefollowing.Letusdenotethesumofthe lengthsoftheinputstringsby S:=
si∈S
|
si|
Wedenoteby•
F(
S)
thesetoffactorsofwordsof S,i.e., F(
S)
= {
w∈
| ∃
u,
v∈
,
1≤
i≤
n,
si
=
u w v}
.•
Fk(
S)
thesetoffactorsoflengthk ofS wherek isapositiveinteger,i.e., Fk(
S)
=
F(
S)
∩
k.•
Suffk(
S)
isthesetofsuffixesoflengthk ofwordsofS.2.2. ClassicaldefinitionofdeBruijngraph
All definitions below refer to the set S; however, as S is clear from the context, we simply omit the “in S” in the notation.
Forawordw ofF
(
S)
,•
Support(
w)
isthesetofpairs(
i,
j)
,where w isthesubstring(
si)
|w|,j.Support(
w)
iscalledthesupportofw in S.•
RC(
w)
(resp.LC(
w)
)isthesetofrightcontext (resp.leftcontext)ofthewordw in S,i.e.,thesetofwords w suchthatw w
∈
F(
S)
(resp.w w∈
F(
S)
).•
wisthewordw w where w isthelongestwordof RC(
w)
suchthat Support(
w)
=
Support(
w w)
.Inother words, suchthat w andw w haveexactlythesamesupportinS.•
wisthewordw wherew isthelongestprefixof w suchthat Support
(
w)
=
Support(
w)
.•
d(
w)
:= |
w|
− |
w|
.Inotherwords,
wisthelongestextensionofw havingthesamesupportas w inS,whilewistheshortestreduction ofw withasupportdifferentfromthatofw inS.ThesedefinitionsareillustratedinarunningexamplepresentedinFig. 1. WegivethedefinitionofadeBruijn graphforassembly(dBGforshort),whichdiffersfromtheoriginaldefinitionofa completegraphoverallpossiblewordsoflengthk statedbydeBruijn[1].
Definition1. Let k be a positive integer. The deBruijngraph of order k for S, denoted by D B G+k, is a directed graph,
D B Gk+
:= (
Vk+,
Ek+)
,whoseverticesarethek-mersofwordsof S andwhereanarclinksu to v ifandonlyifu andv aretwosuccessivek-mersofawordofS,i.e.:
Vk+
:=
Fk(
S)
Fig. 2. Examples ofarcsfrom D BG+k.(a)showslettersintherightcontextofba,and(b)thesuccessorsofnodeba inD B G+2;oneforeachletterin
RC(w)∩ .(c)showslettersintheleftcontextofba,and(d)thepredecessorsofnodeba inD BG+2;oneforeachletterinLC(w)∩ .
Fig. 3. Withsolidarcsonly,thegraphscorrespondtoD BG+2 (a)and D BG+3 (b)forourrunningexample.Withbothsolidanddottedarcs,theyrepresent
D B G−2 (a)andD BG−3 (b).
AnequivalentdefinitionofE+k canbestatedusingtheleftinsteadofrightcontext:
E+k
:= {(
u,
v)
∈
Vk+2|
lastk−1(
u)
=
f irstk−1(
v)
and u[
1] ∈
LC(
v)
}.
ExamplesofarcsaredisplayedonFig. 2.ThesizeofD B Gk+isdenotedbyanddefinedassize
(
D B G+k)
:= (
Vk+)
+ (
E+k)
. Notethatanother,simplerdefinitionofthearcsinthedeBruijngraphcoexistswiththatofDefinition 1.There,anarclinksu tov ifandonlyifu overlaps v byk
−
1 symbols.Thisgraphisdenotedby D B G−k= (
Vk−,
E−k)
,where:Vk−
:=
Fk(
S)
E−k
:= {(
u,
v)
∈
Vk−2|
lastk−1(
u)
=
f irstk−1(
v)
}.
ThearcsofEk−satisfylessconstraintsthanthoseofE+k;hence,Ek+isasubsetof E−k.Bothdefinitionsareillustratedon Fig. 3.SomeassemblyprogramsuseD B Gk−[9].AllthealgorithmicresultsthatweobtainforD B Gk+remainvalidforD B G−k. Inthesequel,wefocusonlyonD B G+k.
LetusintroducenowthenotionsofextensibilityforasubstringofS andthatofaContracteddBG(CdBGforshort).
Definition2(Extensibility).Letw beawordofF
(
S)
.•
w isrightextensible inS ifandonlyif(
RC(
w)
∩ )
=
1.•
w isleftextensible inS ifandonlyif(
LC(
w)
∩ )
=
1.Let w beawordof
.Theword w issaidtobeaunique k -merof S ifandonlyifk
≥
k andforalli∈ [
1..
k−
k+
1]
,Fig. 4. The graphs correspond to CDBG+2 (a) and CDBG+3 (b) for our running example.
Definition3.AcontracteddeBruijngraph oforderk,denotedbyCDBGk+
= (
Vk+,c,
E+k,c)
,isadirectedgraphwhere:Vk+,c
= {
w∈
|
w is a k -mer unique maximal by substring and k≥
k}
E+k,c= {(
u,
v)
∈
Vk+,c2|
lastk−1(
u)
=
f irstk−1(
v)
and v[
k] ∈
RC(
lastk(
u))
}.
ExamplesofCDBG+k aredisplayed onFig. 4.Notethatinthepreviousdefinition,anelement w in Vk+,c doesnot neces-sarilybelong to F
(
S)
,since w mayonlyexistasthe substringoftheagglomerationoftwo wordsof S.Thus, let w beak -meruniquemaximalbysubstringwithk
≥
k:•
lastk(
w)
isnotrightextensibleorRC(
lastk(
w))
∩ = {
a}
andlastk−1(
w)
·
a isnotleftextensible,•
f irstk(
w)
isnotleftextensibleorLC(
f irstk(
w))
∩ = {
a}
anda·
f irstk−1(
w)
isnotrightextensible.Withthisargument,wehavebothfollowingpropositions.
Proposition1.Let
(
u,
v)
∈
E+k,c;(
lastk(
u),
f irstk(
v))
∈
Ek+andthereexists w∈
Vk+suchthat(
w,
f irstk(
v))
∈
Ek+\ {(
lastk(
u),
f irstk
(
v))
}
or(
lastk(
u),
w)
∈
Ek+\ {(
lastk(
u),
f irstk(
v))
}
.Proposition2.Let
(
u,
v)
∈
Ek+.Ifu isrightextensibleandv isleftextensible,thenthereexistsw∈
Vk+,csuchthatu·
v[
k]
isasubstring ofw.Otherwise,thereexists(
u,
v)
∈
E+k,csuchthatu=
lastk(
u)
andv=
f irstk(
v)
.Accordingto Propositions 1 and2,CDBG+k isthegraph D B G+k wherethearcs
(
u,
v)
are contractedifandonly ifu isrightextensibleandv isleftextensible.
2.3. Constructivecharacterisation ofthedeBruijngraph
Letk beapositiveinteger.WedefinethefollowingthreesubsetsofF
(
S)
.•
Init Exactk= {
w∈
F(
S)
| |
w|
=
k and d(
w)
=
0}
•
Initk= {
w∈
F(
S)
| |
w|
≥
k and d(
f irstk(
w))
= |
w|
−
k}
•
SubInitk=
Init Exactk−1AwordofInit Exactkiseitheronlythesuffixofsomesiorhasatleasttworightextensions,whilethefirstk-merofaword inInitk
\
Init Exactkhasonlyonerightextension.Proposition3.Init Exactk
=
Initk∩ {
w∈
F(
S)
| |
w|
=
k}
.Proof. Let w
∈
Init Exactk.Inthiscase,weget f irstk(
w)
=
w and|
w|
−
k=
0.Thismeansthatd(
f irstk(
w))
= |
w|
−
k and thereforew∈
Initk.2
Forw anelementofInitk, f irstk
(
w)
isak-merofS.Giventwowords w1 and w2 ofInitk, f irstk(
w1)
and f irstk(
w2)
aredistinctk-mersofS.Furthermoreforeachk-merw ofS,thereexistsawordw ofInitksuchthat f irstk
(
w)
=
w .From this,wegetthefollowingproposition.Proposition4.ThereexistsabijectionbetweenInitkandthesetofthek-mersofS.
AccordingtoDefinition 1andProposition 4,eachvertexofD B G+k canbeassimilatedtoauniqueelementofInitk.Asthe verticesofD B G−k areidenticaltothoseofD B Gk+,thereexistsalsoabijectionbetweenInitkandthesetofverticesofD B G−k.
TodefinethearcsbetweenthewordsofInitk,whichcorrespondtoarcsofD B Gk+,weneedthefollowingproposition,which statesthateachsingleletterthatisarightextensionof w givesrisetoasinglearc.
Proposition5.Forw
∈
Init Exactkanda∈ ∩
RC(
w)
,thereexistsauniquew∈
Initksuchthatlastk−1(
w)
a isaprefixofw . Proof. Let w be a word of Init Exactk and a a letter of RC(
w)
. By definition of right context, lastk−1(
w)
a∈
F(
S)
. As|
lastk−1(
w)
a|
=
k, there exists w such that lastk−1(
w)
a is a prefix of w and|
lastk−1(
w)
a|
+
d(
lastk−1(
w)
a)
= |
w|
. BydefinitionofInitk, w
∈
Initk.2
ThesetInitkrepresentsthenodesofD B G+k.LetusnowbuildthesetofarcsthatisisomorphictoEk+.Letw beaword of Initk andSucck
(
w)
denote theset ofsuccessors of f irstk(
w)
: Succk(
w)
:= {
x∈
Initk| (
f irstk(
w),
f irstk(
x))
∈
Ek+}
. We knowthat foreach lettera in RC(
w)
,there existsan arcfrom f irstk(
w)
to f irstk(
last|w|−1(
w)
a)
in D B Gk+.We consider twocasesdependingonthelengthofw:Case1.
|
w|
=
k.AccordingtoProposition 3, w
∈
Init Exactkandhencelastk−1(
w)
∈
SubInitk.Therefore,theoutgoingarcsofw inD B G+k arethearcsfromw to w satisfyingtheconditionofProposition 5.Then,Succk
(
w)
=
a∈∩RC(w) lastk−1(
w)
a.
Case2.|
w|
>
k.As w islongerthank,itcontainsthenextk-mer;hence f irstk
(
last|w|−1(
w)
a)
=
f irstk(
last|w|−1(
w))
,andthereexistsauniqueoutgoingarcofw:thatfromw to
w[
2..
k]
.Indeed,bydefinitionofInitk,w[
2..
k]
∈
Initk,andthus Succk(
w)
= {
w[
2..
k]}.
Now,wecanbuildintegrally D B Gk+ormoreexactlyanisomorphicgraphofD B G+k.
Theorem1.WiththesetsInitk,Init ExactkandSubInitk,wecanbuildanisomorphicgraphofD B G+k inlineartimeinthesizeof
thesesets.
Forsimplicity,fromnowon,weconfoundthegraphwebuildwithD B G+k.
2.4. Constructivecharacterisation ofthecontracted deBruijngraph
TodothesamewithCDBGk+,initiallywebeginbyexplainingthealgorithmthatweusetobuildthisgraphandinthe secondtimeweneedtocharacterisetheconceptsofrightandleftextensibilityintermsofwordproperties.
OuralgorithmtobuildCDBG+k. Wepresent a generic algorithm to build incrementally CDBG+k. It is explained interms of words,anddoesnotdependonanyindexingdatastructure.Infollowingsections,wewillusethisgenericalgorithmand explainhowitcanbeperformedefficientlyusingaspecifiedindexingstructure.
Themainalgorithm(Algorithm 2)exploresD B Gk+tofindthenodeskeptinCDBGk+andsetallsinglearcsthatrepresent wholenon-branching pathsofD B G+k thatareproperlycontracted.Thekeypointistofindallstartingnodesofsimplepaths andexplorethesepathsfromthem;theexplorationisdonebyAlgorithm 1.
Amoredetailedexplanation. First,note that to build D B G+k it suffices to know the set Succk
(.)
foreach node. The algo-rithm belowsimulatesa traversalof D B G+k without buildingit,andstoresonlyone nodeper unique maximalk -merofD B G+k.Forsuchak -mer, saym, wechoose torepresentitby thenode v suchthat f irstk
(
v)
isaprefixofm.In D B G+k,m is represented by a simple (i.e., non-branching) path and v is its first node. In the traversalalgorithm, for a current starting node vc in Initk,we traversethesimple pathuntilwe arriveata node u havingseveralsuccessorsorsuchthat its onlysuccessoris notleft extensible (i.e.,hasseveralpredecessors).Inother words,untilwe findu suchthat u is not right extensible ornext
(
u)
isnot left extensible. In D B Gk+,there exists a simplepath between vc andu, and thismust build a single node in CDBG+k.To contract thispath,we choose tokeep vc, andforanysuccessor w of u, we insertan arc betweenu and w, asthisarc cannot be contracted. Noting that w necessarily starts a chain(having atleast a sin-gle node), if w is not yetin CDBG+k, we launcha newpath exploration starting from w,one gets that f irstk(
w)
is the prefix of a node ofCDBGk+, andthus w canappropriately represent the path.Now,if w already belongsto CDBGk+, the case istrickier. If vf stores thefirst vc calledby theprocedure, it maynot be thestarting node ofa path,but be any-where inside a path. Two casesarise. If vf isconsidered during the while loop, then it is not at the start ofa simpleAlgorithm 1: BuildAuxCDBG
(
V,
E,
vf,
vc)
.Input : ThepartialcontractedgraphCDBG+k as(V,E),twonodesvf andvc.vf theinitialstartingnode,andvcthecurrentstartingnode.
Output: Theupdatedcontractedgraph(V ,E ),whichnowcontainsallpathsstartingfromvc.
1 begin
2 u:=vc;marku
3 //searchthenodeendingthechainthatgoesthroughvc
4 while u isrightextensibleandnext(u)isleftextensible do
5 if vf=next(u)then
6 update(vf,i)by(vc,i)forall(vf,i)∈E
7 return(V\ {vf},E)
8 u:=next(u);marku
9 //nowexplorethepathstartinginthesuccessorofu
10 for w∈Succk(u)do
11 if w∈V then
12 (V,E):= (V,E∪ {(vc,w)});
13 else
14 (V,E):=Build AuxC D B G(V∪ {w},E∪ {(vc,w)},vf,w);; // explore from node w
15 return(V,E)
Algorithm 2: BuildCDBG
(
S)
.Input : Asetofwords S.
Output: CDBG+k ofS.
1 begin
2 (V,E)= (∅,∅)
3 //searchforanynodev ofD BG+k withoutpredecessors
4 //andbuildCDBG+k fromv
5 for v∈Initkdo
6 if thereexistsnow suchthatv∈Succk(w)then
7 (V,E):= (V,E)Build AuxC D B G(V∪ {v},E,v,v) 8 //exploreD BG+k fromanynodenotyetvisited
9 for vcanunmarkednodeofInitkdo
10 (V,E):= (V,E)Build AuxC D B G(V∪ {vc},E,vc,vc)
11 return(V,E)
path:hencewe mustupdate V byexchanging vf withvc andterminatetheexploration.Otherwise, vf is traversed dur-ing the for loop (as the value of w), then it is a successor of u and the beginning of a simple path: we just add an arc linking vc to w and stop. Finally,if w already belongs to V but w
=
vf, we also add an arc linking vc to w and stop.TheprocessperformedbyAlgorithm 1augmentsthepartialgraphCDBG+k restrainedtothenodesvisitedwhenexploring the pathstarting from vc.It sufficesnow toensure that all arcsof D B G+k are examined,which Algorithm 2 does.More precisely, it starts by visiting the simple paths starting atnodes having no predecessors (otherwise these nodes would not be visited). Once this is done, one must explore all nodes not yet marked and continue until all nodes have been visited/marked.
Fromtheabovediscussion,weobtainthefollowingtheorem.
Theorem2.Assumeonecandetermineinconstanttimeforanarc
(
u,
v)
ofEk+,c,whetheru isrightextensibleandwhetherv isleft extensible.Then,withthesetsInitk,Init ExactkandSubInitk,Algorithm 2buildsagraphthatisisomorphictoCDBG+k inlineartimeinthesizeofthesesets.
Remark. Executing Algorithm 2 doesnot requireto build D B G+k, since the set ofsuccessors Succk
(
u)
of anynode u is computedinconstanttime.Characterisation oftheconceptsofrightandleftextensibility. By the constructionof D B Gk+,we get thefollowing properties, whichwillturnusefulfortheconstructionoftheCdBGfromspecificindexes(Section3and4).
Proposition 7. Let w be a word of Initk such that f irstk
(
w)
is right extensible. Let the letter a be theunique element ofRC
(
f irstk(
w))
∩
,thenlastk−1(
f irstk(
w))
a isleftextensibleifandonlyif(
Support(
f irstk(
w)))
= (
Support(
lastk−1(
f irstk(
w))
a)
\ {(
i,
1)
|
1≤
i≤
n}).
Proof. Let
(
i,
j)
beapairofSupport(
f irstk(
w))
.Wehave(
i,
j+
1)
∈
Support(
lastk−1(
f irstk(
w))).
As Support
(
lastk−1(
f irstk(
w)))
=
Support(
lastk−1(
f irstk(
w))
a)
,itfollowsthat(
i,
j+
1)
∈
Support(
lastk−1(
f irstk(
w))
a).
Ifthere exists
(
i,
j)
∈
Support(
lastk−1(
f irstk(
w)))
such that j>
0 and(
i,
j−
1)
∈
/
Support(
f irstk(
w))
,thereexists aletterb
=
w[
1]
suchthat(
i,
j−
1)
∈
Support(
b·
lastk−1(
f irstk(
w)))
.Hence
(
b·
lastk−1(
f irstk(
w)),
lastk−1(
f irstk(
w))
a)
alsobelongsto E+,andthuslastk−1(
f irstk(
w))
a is not left extensi-ble.2
Insummary,thissectiongivesaformulationofthedBGofS intermsofwords.Nowassumethatthesubstringsofthe words areindexed inadata structure,e.g. ageneralisedsuffix array.Howcan webuild thedBGor thecontractedgraph directlyfromthisstructure?Toachieve this,itsufficestocomputethethreesets Initk,Init Exactk, SubInitk,aswell asthe sets Support
(.)
andSucck(.)
forsome appropriatesubstrings. Inthefollowing sections,we exhibitalgorithms tocomputeD B G+k andCDBG+k fortwoimportantindexingstructuresandforahome-madetruncateddatastructure.
3. TransitionfromanindexingdatastructuretodeBruijngraphs
3.1. Fromageneralisedsuffixtree
Suffix Trees(ST)belongtothemoststudiedindexingdatastructures.Ageneralised STcanindexthesubstringsofaset of words.Generallyforthissake,all wordsare concatenatedandseparatedby aspecialsymbol not occurringelsewhere. However,thistrickisnotcompulsory,andanalternativeistokeeptheindicationofaterminatingnodewithineachnode.
3.1.1. Thesuffixtree anditsproperties
The GeneralisedSuffixTree ofasetofwords S isthe suffixtreeof S,whereeachwordof S doesnotnecessarilyfinish by aletterofuniqueoccurrence. Hence,foreachnode v oftheGeneralised SuffixTreeof S,wekeepinmemorytheset, denoted by Suff
(
v)
, ofpairs(
i,
j)
such that the word representedby v is the suffix of si starting at position j. Letus denotebyT thegeneralisedsuffixtreeofS (fromnowon,wesimplysaythetree)andby VT itssetofnodes.Forv∈
VT,Children
(
v)
denotesitssetofchildrenand f(
v)
itsparent.SeeFig. 5foranexampleofGST.Some nodesof T may havejustonechild.The size oftheunionof Suff
(
v)
forallnode v of T equals thenumberof leavesinthegeneralisedsuffix treewhen thewordsendwithaterminating symbol.Hence,thespacetostore T andthe sets Suff(.)
islinearinS.Bysimplicity,foranode v ofT ,thewordrepresentedbyv isconfusedwithv.Foreachnodev of T ,v
∈
F(
S)
.AsallelementsofF(
S)
arenotnecessarilyrepresentedbyanodeofT ,wegivethefollowingproposition.Proposition8.ThesetofnodesofT isexactlythesetofwordsw ofF
(
S)
suchthatd(
w)
=
0.Werecallthenotionofasuffixlink(SL)foranynode v ofT (leavesincluded).Letsl
(
v)
denotethenodetargetedbythe suffixlinkofv,i.e.,sl(
v)
=
v[
2..
|
v|]
.Bydefinitionofasuffixtree,forall w∈
F(
S)
,thereexistsanode v ofT suchthat wisaprefixof v.Let v thenodeofminimal lengthofT suchthat w isaprefixof v,then
|
v|
= |
w|
+
d(
w)
,andtherefore w=
v .Proposition9.Letw
∈
F(
S)
.Then|
w|
≥ |
w|
>
|
f(
w)|
,where f(
w)
isthe parentofwinT .Proof. As f
(
w)
=
w,theresultisobvious.
2
3.1.2. ConstructionofD B Gk+
Let
[
x1..
xm]
bethesetofk-mersofS.AccordingtothedefinitionofInitkandtoProposition 4,Initk= [
x1..
xm]
.Thus, by Proposition 9, Initk= {
v∈
VT| |
f(
v)
|
<
k and|
v|
≥
k}
.Similarly, Init Exactk= {
v∈
VT| |
v|
=
k}
.Now,itappearsclearly that Init ExactkisasubsetofInitk,sinceforallv∈
VT,|
f(
v)
|
<
|
v|
.Fig. 5. Thegeneralisedsuffixtree forourrunningexampleandtheconstructeddeBruijngraph fork:=2.Squarenodesrepresentwordsthatoccuras asuffixofsomesi,circlenodesaretheothernodesofT .NodesingreyarethoseusedtorepresentthenodesofthedBG.Eachsquarenodestoresits positionsofoccurrencesinS;forsimplicity,wedisplaythestartingpositionasanumberandthewordofS inwhichitoccursasitscolour,insteadof showingthelistofpairs(i,j).ThesolidcurvedarrowsaretheedgesofthedeBruijngraphfork:=2;thosecoloured inredcorrespondtoCase1and thoseinbluetoCase2.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
Fig. 6. Thefigures(a),(b)and(c)showCase1andCase2encounteredwhencomputingthearcsofD BG+k.Thegreennoderepresentsthenodev,and theoneinorangesl(v).Thedashedarcscorrespondtosuffixlinks.ArcsofD BG+k areinsolidlineandcolouredinredforCase1(a),orinblueforCase2 (b), (c).(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
Case1.
|
v|
=
k (Fig. 6a).As v
∈
Init Exactk,sl(
v)
∈
SubInitk.Therefore,each childu of sl(
v)
isan elementof Initk.Thus, theoutgoingarcsofv in D B Gk+are thearcsfrom v tothechildu of sl(
v)
wherethefirstletterofthelabelbetweensl(
v)
andu isanelement oftherightcontextof v.Asthesetofthefirstlettersofthelabelbetweenv andchildrenofv isexactly RC(
v)
∩
,the numberofoutgoingarcsof v inD B G+k isthenumberofchildrenof v.Tobuildtheoutgoingarcsof v inD B G+k,foreach childu ofv,weassociatev withthenodeofInitkbetweentherootandsl(
u)
,i.e.,f irstk(
sl(
u))
.Case2.
|
v|
>
k (Figs. 6band 6c).Wehavethatsl
(
v)
isanode ofVT.As|
v|
>
k,|
sl(
v)
|
≥
k.Thus,thereexistsan elementofInitk betweentherootandsl
(
v)
.Weassociatev withthisnode,i.e.f irstk(
sl(
v))
. WeillustratethesetwocasesinFig. 5:Case1.Casewherev is 6,6,sl
(
v)
is 7,7,theuniquechildu ofv is 3,andsl(
u)
is 4,whichisinInitk.Case2.Casewherev is 1,sl
(
v)
is 2,andf irstk(
sl(
v))
is .Inbothcases,buildingthearcsofE+requirestofollowtheSLofsomenode.Thenode,sayu, pointedatbyaSLmay notbeinitial.Hence,theinitialnoderepresentingtheassociatedfirstk-merofu istheonlyancestralinitialnodeofu.We
equipeachsuch nodeu withapointer p
(
u)
that pointstotheonlyinitialnode onitspathfromtheroot. Inother words, foranyu∈
/
Initksuchthat|
u|
>
k,onehasp(
u)
:=
f irstk(
u)
.The algorithmtobuildthe D B G+k isasfollows.Aninitial depthfirsttraversalofT allowstocollectthenodesofInitk andforeachsuchnodetosetthepointerp
(.)
ofallitsdescendantsinthetree.FinallytobuildE+,onescansthrough Initk andforeachnode v oneaddsSucck(
v)
to E+usingtheformulagivenabove.Altogetherthisalgorithmtakesatimelinear inthesizeofT .Moreover,thenumberofarcsinE+islinearinthetotalnumberofchildrenofinitialnodes.Thisgivesus thefollowingresult.Theorem3.ForasetofwordsS,buildingthedeBruijnGraphoforderk,D B Gk+takeslineartimeandspacein
|
T|
, i.e.,inS. 3.1.3. ConstructionofCDBG+kInSection2.3,wehaveseenanalgorithmthatallowstocomputedirectlyCDBG+k providedthatonecandetermineifa node v isrightextensibleandifnext
(
v)
isleftextensible,wherenext(
v)
denotestheonlysuccessorofv.Letusseehow tocomputetheextensibilityinthecaseofaSuffixTree.By applyingProposition 6 inthe caseof a tree,foran element v of Initk, f irstk
(
v)
is rightextensible ifandonly if|
v|
>
k or(
Children(
v))
=
1.Thuscheckingtherightextensibilityofanodetakesconstanttime.Fortheleft extensibilityofthesingle successorofa node,oneonlyneedsthesize ofsupportofsome nodes (Proposi-tion 7).Letusseefirsthowtocompute
(
Support(.))
onthetree,andthenhowtoapplyProposition 7.Proposition10.Letv beawordofF
(
S)
andVT(
v)
denotesthesetofnodesofthesubtreerootedinv.Support
(
v)
=
v ∈VT(v)
Suff
(
v).
Along atraversalofthetree,we cancompute andstore
(
Support(
v))
and(
Support(
v)
∩ {(
i,
1)
|
1≤
i≤
n})
foreach node v inlineartimein|
T|
.Letv beawordofInitk suchthat f irstk
(
v)
isrightextensible.Case1. If
|
v|
=
k, then f irstk(
v)
=
v and(
Children(
v))
=
1. Let u be the only child of v. Thus,|
u|
>
k, RC(
v)
∩ =
{
u[
k+
1]}
,andlastk−1(
v)
u[
k+
1]
=
f irstk(
sl(
u))
.Hence,(
Support(
v))
= (
Support(
f irstk(
sl(
u)))
\ {(
i,
1)
|
1≤
i≤
n})
andbyProposition 7, f irstk(
sl(
u))
isleftextensible.Case2.If
|
v|
>
k,thenRC(
f irstk(
v))
∩ = {
v[
k+
1]}
andlastk−1
(
f irstk(
v))
v[
k+
1] =
lastk(
f irstk+1(
v))
=
f irstk(
sl(
v)).
ByProposition 7, f irstk(
sl(
v))
isleftextensibleifandonlyif(
Support(
f irstk(
v)))
= (
Support(
f irstk(
sl(
v)))
\ {(
i,
1)
|
1≤
i≤
n})
As
(
Support(
f irstk(
v)))
= (
Support(
f irstk(
v)
))
and(
Support(
v)
\ {(
i,
1)
|
1≤
i≤
n})
= (
Support(
v))
−
(
Support(
v)
∩ {(
i,
1)
|
1≤
i≤
n})
, determining if next(
v)
is left extensible takes constant time. Toconclude, asfor any initial node v,we cancompute in O(
1)
timeitssetofsuccessorsSucck(
v)
,itsrightextensibility,andtheleft extensibility ofits singlesuccessor,we canreadilyapply Algorithm 2tobuiltCDBG+k andweobtaina complexitythat islinearinthe sizeofD B G+k,sinceeachsuccessorisaccessedonlyonce.ThisyieldsTheorem 4.Theorem4.ForasetofwordsS,buildingtheContracteddeBruijnGraphoforderk,CDBG+k takeslineartimeandspacein
|
T|
, i.e., inS.3.2. Fromageneralisedsuffixarray
IntheprevioussubsectionswehaveshownhowtobuilddeBruijngraphsfromsuffixtrees.Suffixtreesareveryelegant datastructuresbuttheyaretoospace-consuminginpractice.Inmanyapplicationstheyhavebeenreplacedbysuffixarrays thatareequivalent datastructuresandaremorespaceeconomical.WewillnowshowhowtobuilddeBruijngraphsfrom suffixarrays.
LetSA andLCP bethegeneralisedenhancedsuffixarrayofS:
• ∀
1≤
i<
S,SA[
i]
= (
g,
h)
,SA[
i+
1]
= (
g,
h)
thensg[
h. .
|
sg|]
<
sg[
h. .
|
sg|]
,• ∀
2≤
i≤
S,LCP[
i]
isthelengthofthelongestcommonprefix betweensuffixesstoredinSA[
i−
1]
andinSA[
i]
,andLetusrecallthedefinitionofanlcp-interval.
Definition4([23]).Aninterval
[
i,
j]
,1≤
i<
j≤
Siscalledalcp-intervalofvalue,alsodenotedby
-
[
i,
j]
,iff: 1. LCP[
i]
<
,2. LCP
[
g]
≥
fori<
g≤
j,3. LCP
[
g]
=
foratleastone g suchthati<
g≤
j,4. LCP
[
j+
1]
<
.Letusnowrecallthedefinitionsofthepreviousandnextsmallervalues(PSV andNSV)arrays.
Definition5([23]).For2
≤
i≤
S:•
PSV[
i]
=
max{
j|
1≤
j<
i and LCP[
j]
<
LCP[
i]}
,•
NSV[
i]
=
min{
j|
i<
j≤
S+
1 and LCP[
j]
<
LCP[
i]}
.Recallthat if 2
≤
i≤
S then[
PSV[
i],
NSV[
i]
−
1]
is an lcp-intervalof value LCP[
i]
. The direct inclusion among lcp-intervalsdefinesatreerelationshipcalledthelcp-intervaltree(see [23,Def.4.4.3,p.87]).Givenanlcp-interval-
[
i,
j]
,its parentlcp-interval-
[
i,
j]
canbeeasilycomputedinconstanttimeusingthearraysLCP,PSV andNSV.Then:•
Initk consistsof:– thelcp-intervals
-
[
i,
j]
suchthat≥
k andtheparentinterval-
[
i,
j]
of-
[
i,
j]
issuchthat<
k (theassociated stringissSA[i].g[
SA[
i].
h. .
SA[
i].
h+
−
1]
);– thepositionsSA
[
i]
= (
g,
h)
suchthati isnotcontainedinlcp-intervals-
[
i,
j]
with≥
k andh≤ |
sg|
−
k+
1 (the associatedstringissg[
h. .
|
sg]
);•
Init Exactkiscomposedofthelcp-intervalsk-[
i,
j]
;•
SubInitk=
Init Exactk−1.Actuallythelcp-intervaltreedoesnotneedtobe explicitlybuildandthesetscanbecomputedbyasinglescan ofthe
SA andLCP arrays.
Foranlcp-interval
-
[
i,
j]
∈
Initkwehave(
Support(
sSA[i].g[
SA[
i].
h. .
SA[
i].
h+
k−
1]))
=
j−
i+
1.Theorem5.ThedeBruijngraphoforderk,CDBGk+,forasetofwordsS canbebuiltinatimeandspacethatarelinearin
Susing thegeneralisedsuffixarrayofS.4. TransitionfromatruncatedstructuretodeBruijngraphs
Thissectionisorganisedasfollows.InSection4.1,wedefine asimpleconditionthat asetofinputstringsmustsatisfy toallowbuildingageneralisedindexandsketchamodificationofMcCreight’salgorithm[14] fordoing so.InSection4.2, weintroducethereducedtruncatedsuffixtreeandspecialisetheprevious algorithmforconstructingitefficiently.Finally, in Section 4.3we show how toconstruct both thede Bruijn Graph andits contractedversion in optimaltime fromthe reducedtruncatedsuffixtree.
4.1. Setofchainsofsuffix-dependantstringsandtree
Here,weintroducethenotionofsuffixdependence betweenstrings,andthenotionofchainofsuffix-dependantstrings in
ordertodefineaunifiedindexthatgeneralisesboththesuffixtree[14]andthetruncatedsuffixtree[18].First,letusdefine theconceptofsuffix-dependantstringsandofchainsofsuffix-dependantstrings.
Definition6.
1. Astringx issaidtobesuffix-dependant ofanotherstring y ifx
[
2..
|
x|]
isprefixof y.2. Let w be a stringand m be a positive integer smaller than
|
w|
−
1. A m-tuple ofm strings(
x1,
. . . ,
xm)
is a chainofsuffix-dependantstringsof w if x1 is a prefix of w and for each i
∈ [
2,
m]
, xi is a prefix of w[
i,
|
w|]
such that|
xi|
≥ |
xi−1|
−
1.Let
R
= {
C1,
. . . ,
Cn}
be a set of tuples such that foreach i∈ [
1,
n]
, Ci is a chain of suffix-dependant strings ofthe string si.Fori∈ [
1,
n]
and j∈ [
1,
|
Ci|]
,Ci[
j]
isthe jth stringofthetupleCi.LetR
= {
C1,
. . . ,
Cn}
bethesetoftuplessuch thatforeachi∈ [
1,
n]
and j∈ [
1,
|
Ci|]
,Ci[
j]
= |
Ci[
j]|
,i.e.R
containstuplesoflengths.With
R
and S, we can easily computeR
. In the sequel, we useR
to demonstrate our results, andR
to state the complexitiesofalgorithms.Indeed,inthecasewhereCiisthetupleofeachsuffixofsi,thesizeofCiislinearin|
si|
2 butCi islinearin
|
si|
.Let w beastring; w may occurindistinct tuplesof
R
.Thus, wedefine N(
w)
thesetof(
i,
j)
such that w=
Ci[
j]
.In otherwords,N(
w)
isthesetofcoordinatesoftheelementsofR
thatareequalto w.Wedefineacontractedversionofthewell-knownAho–Corasicktree[17].Infact,weapplynearlythesamecontraction process that turnsa trie ofawordinto itscompact SuffixTree [17].Consider theAho–Corasicktree of S,inwhicheach noderepresentsaprefixofwordsinS.Wecontractthenon-branchingpartsofthebranchesexceptthatwekeepallnodes representingawordthatbelongstoatuplein
R
.Fromnowon,letT(
R
)
denotethiscontractedversionoftheAho–Corasick treeofS.N
andL
denoterespectivelythesetofnodesandthesetofleavesofT(
R
)
.Furthermore,wedefineforeachnodev of T(
R
)
twoweights:•
s(
v)
isthenumberoftimesthatanelementofatupleofR
isequaltothewordrepresentedbyv (i.e.,s(
v)
:= |
N(
v)
|
).•
t(
v)
isthenumberoftimesthatthefirstelementofatupleofR
isequaltothewordrepresentedby v (i.e.,t(
v)
:=
|{(
i,
1)
∈
N(
v)
|
i∈ [
1,
n]}|
).Letw beastring,weputSucc
(
w)
= {(
i,
j)
| (
i,
j−
1)
∈
N(
w)
and j≤ |
Ci|}
.WedefineH
asthesubsetofL
suchthat:H
:= {
u∈
L
| ∃
C∈
R
and j<
|
C|
such that u=
C[
j]}
It isequivalenttosaythat
H
= {
u∈
L
|
Succ(
u)
is not empty}
.Amappingm fromH
toN
iscalledpossiblelink iffor eachnode v inH
,∃(
i,
j)
∈
Succ(
v)
suchthatm(
v)
=
Ci[
j]
.Belowwe presentanalgorithm thatconstructs T
(
R
)
,andcomputesforeach node v inN
,the weightss(
v)
andt(
v)
andapossiblelinkP0.
ConstructionofT
(
R
)
. Now,we giveanalgorithmtoconstructT(
R
)
.WeusetheversionofMcCreight’salgorithmgivenby Naetal.[18] onourinputandwebuildforeachleafv, s(
v)
,t(
v)
and P0(
v)
.ForbuildingT(
R
)
,westartwithatreethatcontains onlythe root.Then, foreach word w ineverychain C ,we createorupdate (ifitexists)thenode w asfollows. Assumethatwekeepinmemorythenode v thathasbeenprocessedjustbeforew.
If w is the first wordof C , we go down from the root by comparing w to the labels of the tree. If we create the node w,s
(
w)
andt(
w)
areinitialisedto1,andP0(
w)
tonil.If w alreadyexistsonthetree,weincrements(
w)
andt(
w)
by 1.
If w is not the first word of C , we start from v, andas in McCreight’s algorithm, we createor arrive on the node representing w.Ifweneedtocreatethisnode,s
(
w)
isinitialisedto1,t(
w)
to0,andP0(
w)
tonil.Otherwise,weadd1 tos
(
w)
.Weset P0(
v)
=
w.Theloopcontinueswiththenextworduntiltheend,andweobtain T
(
R
)
.Theorem6.Forasetofchainofsuffix-dependantstrings
R
,wecanconstructT(
R
)
inO(
S)
timeandspace.Proof. Tobeginwith,letustoprovethatT
(
R
)
isinO(
S)
space.ItsnumberofleavesequalsC∈R|
C|
.Hence,itsnumber ofnodesisatmost2C∈R|
C|
−
1≤
2S,anditsnumberofedgesisatmost2S.ThusthesizeofT(
R
)
isinO(
S)
.Clearly, theconstruction algorithmof T
(
R
)
computesbothweights s(.)
andt(.)
,andthepossible link P0(.)
correctly.Forthecomplexity,foreachchainofsuffix-dependant Ciof
R
,thelengthofthetraversepathonthetreeisequalto|
wi|
, thankstotheuseofthesuffixlinks.ThusasinMcCreight’salgorithm,thecomplexityisinO(
S)
.2
Now,weareequippedwithanalgorithmthatbuildsT
(
R
)
foranysetofchainsofsuffix-dependantstrings.Letusreview someinstancesofsets S,forwhich T(
R
)
isinfactawell-knowntree.•
IfC
:= ∪
w∈S{
tuple of suffixes of w}
, then T(
C
)
is the Generalised Suffix Tree of S (see Fig. 7a). We have that the restrainedmappingsl(.)
isanexampleofapossiblelink.•
If Bk:= ∪
w∈S{
tuple of k-mer of w and suffixes of length k<
k of w}
,thenT(
Bk)
isthegeneralisedk-truncatedsuffix treeofS,asdefinedin[19](whichgeneralisesthek-truncatedsuffixtreeofNaetal.[18]).•
If Ak:= ∪
w∈S{
tuple of k+
1-mer of w and suffixes of length k of w}
,then T(
Ak)
is thetruncatedsuffix tree thatwe definebelowinSection4.2(seeFig. 7b).4.2. Ourtruncatedsuffixtree
Fig. 7. (a)Thegeneralisedsuffixtreeforthesetofwords{bacbab,bbacbaa,bcaacb,cbaac,cbabcaa}.ThepartabovethegreenlinecorrespondstotheTST
T(A2),whichisshownin(b).(b)ThetruncatedsuffixtreeT(A2)forthesamesetofwords.(Forinterpretationofthereferencestocolourinthisfigure
legend,thereaderisreferredtothewebversionofthisarticle.)
Definition7.
1. Foralli
∈ [
1,
|
S|]
and j∈ [
1,
|
si|
−
k+
1]
, Ak,idenotesthetuplesuchthatits jth componentisdefinedbyAk,i
[
j] :=
wi
[
j,
j+
k]
if j≤ |
wi| −
kwi
[
j,
|
wi|]
otherwise 2. and Akisthesetofthesetuples: Ak:=
n i=1Ak,i.Proposition11.
1. Ak,iisachainofsuffix-dependantstringsofsi. 2. Moreover,
{
w∈
Ak,i|
Ak,i∈
Ak}
=
Fk+1(
S)
∪
Suffk(
S)
.Proof.
1. Forall j