• Aucun résultat trouvé

Linking indexing data structures to de Bruijn graphs: Construction and update

N/A
N/A
Protected

Academic year: 2021

Partager "Linking indexing data structures to de Bruijn graphs: Construction and update"

Copied!
20
0
0

Texte intégral

(1)

HAL Id: lirmm-01617207

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01617207

Submitted on 16 Oct 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Construction and update

Bastien Cazaux, Thierry Lecroq, Eric Rivals

To cite this version:

Bastien Cazaux, Thierry Lecroq, Eric Rivals. Linking indexing data structures to de Bruijn graphs:

Construction and update. Journal of Computer and System Sciences, Elsevier, 2019, 104, pp.165-183.

�10.1016/j.jcss.2016.06.008�. �lirmm-01617207�

(2)

Contents lists available atScienceDirect

Journal

of

Computer

and

System

Sciences

www.elsevier.com/locate/jcss

Linking

indexing

data

structures

to

de

Bruijn

graphs:

Construction

and

update

Bastien Cazaux

a

,

b

,

Thierry Lecroq

c

,

Eric Rivals

a

,

b

,

aLIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France

bInstitutBiologieComputationnelle,CNRSandUniversitédeMontpellier,860rueSaintPriest,34095MontpellierCedex5,France cNormandieUniv.&UNIROUEN,UNIHAVRE,INSARouen,LITIS,76000RouenFrance

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received25June2015

Receivedinrevisedform 26May2016 Accepted27June2016 Availableonlinexxxx Keywords: Index Datastructure Suffixtree Suffixarray Dynamicupdate Overlap

ContracteddeBruijngraph Assembly

Algorithms Bioinformatics

DNA sequencing technologieshavetremendously increased theirthroughput, and hence complicatedDNAassembly.NumerousassemblyprogramsusedeBruijngraphs(dBG)built fromshortreadstomergetheseintocontigs,whichrepresentputativeDNAsegments.In adBGoforderk,nodesaresubstringsoflengthk ofreads(ork-mers),whilearcsaretheir k+1-mers.Asanalysingreadsoftenrequiretoindexalltheirsubstrings,itisinteresting toexhibitalgorithms thatdirectlybuildadBG fromapre-existingindex,and especially acontracteddBG,wherenon-branching pathsarecondensedintosinglenodes.Here,we exhibitlineartimealgorithmsforconstructingthefullorcontracteddBGsfromsuffixtrees, suffixarrays,andtruncatedsuffixtrees.Withthelattertheconstructionusesaspacethat islinearinthesizeofthedBG.Finally,wealsoprovidealgorithmstodynamicallyupdate theorderofthegraphwithoutreconstructingit.

©2016TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

In life sciences, determining the sequence of bio-molecules is an essential step towards the understanding of their functions and interactions within an organism. Powerful sequencing technologies allow to get huge quantities of short sequencingreads thatneed tobe assembledto inferthecomplete targetsequence. Theseconstraintsfavourthe useofa versionofthedeBruijnGraph(dBG)dedicatedtogenomeassembly–aversionwhichdiffersfromthecombinatorial struc-tureinventedbyN.G.de Bruijn[1].Givenaset S

= {

s1

,

. . . ,

sn

}

ofn readsandanintegerk, anassemblydeBruijn Graph, orforshortsimplyde Bruijn Graph,storeseach k-mer(k-longsubstring)occurringinthereadsasnodesandhasan arc joiningtwok-mersiftheyappearassuccessive(andhenceoverlapping)k-mersinatleastoneread.

The dBGis then traversed to extract long paths, which willform the contigs, i.e., the sequenceof sub-regions ofthe molecule.Innon-repetitive regions,thelayoutofthereadsdictatesasimplepathofk-merswithoutbifurcations.Anysimple path betweenan in-branchingnode andthe next out-branching node, can then be contractedinto a single arc without loosinganyinformationonthegraphstructure.Thesequencesofsuchsimplepathsarecalledunitigs (thecontractionfrom uniqueandcontigs).TheversionofthedBGwhereeachsuch“non-branching”pathiscondensedintoasinglearcistermed theContracteddBG(CdBG).

*

Correspondingauthorat:LIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France.

E-mailaddresses:bastien.cazaux@lirmm.fr(B. Cazaux),thierry.lecroq@univ-rouen.fr(T. Lecroq),rivals@lirmm.fr(E. Rivals).

http://dx.doi.org/10.1016/j.jcss.2016.06.008

0022-0000/©2016TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

(3)

Sequencingtechnologiesofthesecondgenerationcanyieldhundredsofmillionsofreads.Comparedtotheoverlapgraph ortothestringgraph,whichwereusedwithprevioustechnologies,thedBGhasanumberofnodesthatisnotproportional to thenumberofreads:itdependsonausercontrolledparameterk,termed theorder of thedBG.Its memoryusage can befinetunedthroughthisparameter.

In bioinformatics,dBGs are heavily exploitedforgenomeassembly [2],butforother purposes aswell. Actually,some programs mine the dBG to seekgraph patterns representingmutations, large insertions/deletions, orchromosomal rear-rangements[3].Othersuseittocorrectsequencingerrorsinlongreads[4].

ThedeBruijnGraphisusuallybuiltdirectlyfromthesetofreads,whichistimeandspaceconsuming.Severalcompact datastructuresforstoring dBGshavebeendeveloped[5,6] includingprobabilistic ones[7].Theemphasisisplacedonthe practical spaceneededtostorethedBGs inmemory.Moreover,some recentassemblyalgorithms putforwardthe advan-tage ofusingforthesameinput,multipledBGswithincreasingorders[8],therebyemphasising theneedfordynamically updatingthedBG.Inallcases,theconstructionalgorithmsneedtoscanthroughthewholesetofreads.

Severalgenomeassemblyprograms usedhashtablesto storethek-merofthereadsandallownavigatingthrough the arcs ofthedBG,butthesesolutionssufferfromseverallimitationsregardinge.g. functionalities andflexibility.Withhash functions, itisoftennotpossibleto addextrainformationtothe nodes,likeforinstancethenumberoftimesak-meris observed inthereadset,whichis usedasa confidencemeasure.Hashtablesmakeitdifficulttocompute thecontracted dBGortochangethevalueofk.Themainadvantageofsophisticatedhashfunctionsis theirmemoryfootprint.Forinstance Minia [9] offers a very spaceefficient storage to handlethe dBG based oncascading Bloom filters, which are a type of hash functions. This hashtable based solution was used forlong read errorcorrection andalso proves efficient in that context[4].

Instudiesinvolvingtheanalysisofsequencingreads,distincttasksrequiretoindexeitherallsubstrings,orthek-mersof thereads.Forinstance,fastdBGassemblyprogramsfirstcountthek-mersbeforebuildingthedBGtoestimatethememory needed [7]. Anotherexample: some errorcorrection software build a suffix tree ofall short readsto correct them [10]. Hence, before the assembly starts, the read set has already beenscanned through andindexed. It can thus be efficient to enable theconstruction ofthedBGforthesubsequentassembly, directlyfromtheindexratherthan fromscratch.For these reasons, we set out to find algorithms that transformusual indexes intoa dBGor a contracted dBG.It is also of theoretical interesttobuildbridges betweenwell studiedindexesandthisgraphonwords.Despiterecentresults[11,12], formal methods forconstructing dBGs fromsuffix trees are an open question. In comparison, Simpson and Durbin have proposedanalgorithmtobuildtheStringGraphfromaFM-index[13].

Here, we present algorithms to build directly the CdBG from a Generalised Suffix Tree or froma Generalised Suffix Array ofthereads[14–17].Thesealgorithmstake spaceandtimethatarelinearintheinputsize.Thesewell-knowndata structuresindexallsubstringsofthereads,andnotonlytheirk-mers.Thisresultsinonedrawbackandinoneadvantage.

Thedrawbackistheirspaceoccupancy.Wewillthenconsideranindexingdatastructurethatreducesthesetofindexed substrings: thetruncatedsuffixtree[18,19]. Weintroducethe reducedtruncatedsuffix tree(TST) andthen showhow to constructwiththisindexboththedBGandCdBGintimeandspacethatarelinearinthesizeofthefinaldBG,ratherthan inthecumulatedlengthofthereads.BysizeofthedBGwemeanthesumofnumberofnodes,plusthenumberofarcs. Thisalgorithmachievesanoptimaltimeandspacecomplexity.

The advantage isthecounterpart: assubstringsofall lengthsareindexed, itallows toupdate theorder ofthegraph, that is to changedynamically the value ofk without reconstructing thedBG. Finally,we provide efficientalgorithms for increasing ordecreasingthevalueofk.Ofcourse,ifoneusesthetruncatedsuffixtreeinsteadofthefull suffixtree,only some updatesremainpossible.Ourresultsneverthelessremainapplicabletothetruncatedsuffixtree,wheretheordercan bedynamicallydecreased.

Thisarticleincludesresultsthatappearedin[20,21].

1.1. Indexingdatastructures

Suffix trees arewell-known indexingdatastructuresthat enable tostore andretrieveall thefactorsof agivenstring. The suffix tree of a string y of length s can be build in time and space in O

(

s

)

on a constant size alphabet [14,22]. Then, itis possibletocheck ifapatternx of lengthm isa factorofastring y of S intime O

(

m

)

.Counting thenumber of occurrences of x in y can also be done intime O

(

m

)

while enumerating the positions where x occurs in y can be performedintime O

(

m

+

occ

)

,whereocc denotesthe numberofoccurrencesof x in y.Suffix trees canbe adaptedto a finitesetofstringsandarethencalledGeneralisedSuffixTrees(GSTs).Thus,givenasetS ofn stringsoftotallength



S



on aconstantsizealphabet,thegeneralisedsuffixtreefor S canbebuildintimeandspace O

(



S

)

.Foradetailedexposition ofpropertiesofsuffixtreeswereferthereaderto[17].Suffixtreeshavebeenwidelystudiedandusedinalargenumberof applications(see[15]and[17]).Inpractice,theyconsumetoomuchspaceandareoftenreplacedbythemoreeconomical suffixarrays[16],whichhavethesameproperties[23].

When oneisonlyinterested infactorsofagivenlength,truncatedsuffix treesonlystorethefactors oflength uptoa givenconstantk of agivenstring.Theycanalsobebuild inlineartime andspace[18].Inpractice,truncatedsuffix trees savealotofnodescompared tosuffixtrees.

(4)

Fig. 1. S:= {bacbab,cbabcaa,bcaacb,cbaac,bbacbaa}isasetofwords.Therefore,wehaveSupport(ba)= {(1,1),(1,4),(2,2),(4,2),(5,2),(5,5)},RC(ba)= {ε,c,cb,cba,cbab,b,bc,bca,bcaa,a,ac,cbaa},LC(ba)= {ε,c,ac,bac,b,bbac}andd(ba)=0.OnehasRC(ba)∩  = {a,b,c}.Thus,thewordba isnotright extensibleinS (seeDefinition 2).

2. DefinitionsofdeBruijngraphs

2.1. Notationaboutstrings

Hereweintroduceanotationandbasicdefinitions.

An alphabet



is afinite setof letters.A finitesequence ofelements of



is calleda word or a string.The set ofall wordsover



isdenotedby



,and

ε

denotestheemptyword.Forawordx,

|

x

|

denotesthelength ofx.Giventwowords

x and y,wedenoteby x

·

y orsimplyxy theconcatenation ofx and y.Forevery1

i

j

≤ |

x

|

,x

[

i

]

denotesthei-thletter of x,and x

[

i

..

j

]

denotes the substring or factor x

[

i

]

x

[

i

+

1

]

. . .

x

[

j

]

.Letk be a positive integer. If

|

x

|

k, f irstk

(

x

)

isthe

prefix oflength k ofx andlastk

(

x

)

isthesuffix of lengthk of x.Then asubstring oflengthk of x is calledak-merofx. Fori suchthat 1

i

≤ |

x

|

k

+

1,

(

x

)

k,i isthek-merof x startinginpositioni,i.e.,

(

x

)

k,i

=

x

[

i

..

i

+

k

1

]

.Thus we have

f irstk

(

x

)

= (

x

)

k,1andlastk

(

x

)

= (

x

)

k,|x|−k+1.Wedenoteby

()

thecardinalityofanyfiniteset



.

LetS

= {

s1

,

. . . ,

sn

}

beafinitesetofwords.Itisourrunninginstanceforallthefollowing.Letusdenotethesumofthe lengthsoftheinputstringsby



S

 :=



siS

|

si

|

Wedenoteby

F

(

S

)

thesetoffactorsofwordsof S,i.e., F

(

S

)

= {

w

∈ 



| ∃

u

,

v

∈ 



,

1

i

n

,

s

i

=

u w v

}

.

Fk

(

S

)

thesetoffactorsoflengthk ofS wherek isapositiveinteger,i.e., Fk

(

S

)

=

F

(

S

)

∩ 

k.

Suffk

(

S

)

isthesetofsuffixesoflengthk ofwordsofS.

2.2. ClassicaldefinitionofdeBruijngraph

All definitions below refer to the set S; however, as S is clear from the context, we simply omit the “in S” in the notation.

Forawordw ofF

(

S

)

,

Support

(

w

)

isthesetofpairs

(

i

,

j

)

,where w isthesubstring

(

si

)

|w|,j.Support

(

w

)

iscalledthesupportofw in S.

RC

(

w

)

(resp.LC

(

w

)

)isthesetofrightcontext (resp.leftcontext)ofthewordw in S,i.e.,thesetofwords w suchthat

w w

F

(

S

)

(resp.w w

F

(

S

)

).

w

isthewordw w where w isthelongestwordof RC

(

w

)

suchthat Support

(

w

)

=

Support

(

w w

)

.Inother words, suchthat w andw w haveexactlythesamesupportinS.

w

isthewordw wherew isthelongestprefixof w suchthat Support

(

w

)

=

Support

(

w

)

.

d

(

w

)

:= |

w

|

− |

w

|

.

Inotherwords,

w

isthelongestextensionofw havingthesamesupportas w inS,while

w

istheshortestreduction ofw withasupportdifferentfromthatofw inS.ThesedefinitionsareillustratedinarunningexamplepresentedinFig. 1. WegivethedefinitionofadeBruijn graphforassembly(dBGforshort),whichdiffersfromtheoriginaldefinitionofa completegraphoverallpossiblewordsoflengthk statedbydeBruijn[1].

Definition1. Let k be a positive integer. The deBruijngraph of order k for S, denoted by D B G+k, is a directed graph,

D B Gk+

:= (

Vk+

,

Ek+

)

,whoseverticesarethek-mersofwordsof S andwhereanarclinksu to v ifandonlyifu andv are

twosuccessivek-mersofawordofS,i.e.:

Vk+

:=

Fk

(

S

)

(5)

Fig. 2. Examples ofarcsfrom D BG+k.(a)showslettersintherightcontextofba,and(b)thesuccessorsofnodeba inD B G+2;oneforeachletterin

RC(w)∩ .(c)showslettersintheleftcontextofba,and(d)thepredecessorsofnodeba inD BG+2;oneforeachletterinLC(w)∩ .

Fig. 3. Withsolidarcsonly,thegraphscorrespondtoD BG+2 (a)and D BG+3 (b)forourrunningexample.Withbothsolidanddottedarcs,theyrepresent

D B G−2 (a)andD BG−3 (b).

AnequivalentdefinitionofE+k canbestatedusingtheleftinsteadofrightcontext:

E+k

:= {(

u

,

v

)

Vk+2

|

lastk−1

(

u

)

=

f irstk−1

(

v

)

and u

[

1

] ∈

LC

(

v

)

}.

ExamplesofarcsaredisplayedonFig. 2.ThesizeofD B Gk+isdenotedbyanddefinedassize

(

D B G+k

)

:= (

Vk+

)

+ (

E+k

)

. Notethatanother,simplerdefinitionofthearcsinthedeBruijngraphcoexistswiththatofDefinition 1.There,anarclinks

u tov ifandonlyifu overlaps v byk

1 symbols.Thisgraphisdenotedby D B Gk

= (

Vk

,

Ek

)

,where:

Vk

:=

Fk

(

S

)

Ek

:= {(

u

,

v

)

Vk−2

|

lastk−1

(

u

)

=

f irstk−1

(

v

)

}.

ThearcsofEk−satisfylessconstraintsthanthoseofE+k;hence,Ek+isasubsetof Ek.Bothdefinitionsareillustratedon Fig. 3.SomeassemblyprogramsuseD B Gk−[9].AllthealgorithmicresultsthatweobtainforD B Gk+remainvalidforD B Gk. Inthesequel,wefocusonlyonD B G+k.

LetusintroducenowthenotionsofextensibilityforasubstringofS andthatofaContracteddBG(CdBGforshort).

Definition2(Extensibility).Letw beawordofF

(

S

)

.

w isrightextensible inS ifandonlyif

(

RC

(

w

)

∩ )

=

1.

w isleftextensible inS ifandonlyif

(

LC

(

w

)

∩ )

=

1.

Let w beawordof



.Theword w issaidtobeaunique k -merof S ifandonlyifk

k andforalli

∈ [

1

..

k

k

+

1

]

,

(6)

Fig. 4. The graphs correspond to CDBG+2 (a) and CDBG+3 (b) for our running example.

Definition3.AcontracteddeBruijngraph oforderk,denotedbyCDBGk+

= (

Vk+,c

,

E+k,c

)

,isadirectedgraphwhere:

Vk+,c

= {

w

∈ 



|

w is a k -mer unique maximal by substring and k

k

}

E+k,c

= {(

u

,

v

)

Vk+,c2

|

lastk−1

(

u

)

=

f irstk−1

(

v

)

and v

[

k

] ∈

RC

(

lastk

(

u

))

}.

ExamplesofCDBG+k aredisplayed onFig. 4.Notethatinthepreviousdefinition,anelement w in Vk+,c doesnot neces-sarilybelong to F

(

S

)

,since w mayonlyexistasthe substringoftheagglomerationoftwo wordsof S.Thus, let w bea

k -meruniquemaximalbysubstringwithk

k:

lastk

(

w

)

isnotrightextensibleorRC

(

lastk

(

w

))

∩  = {

a

}

andlastk−1

(

w

)

·

a isnotleftextensible,

f irstk

(

w

)

isnotleftextensibleorLC

(

f irstk

(

w

))

∩  = {

a

}

anda

·

f irstk−1

(

w

)

isnotrightextensible.

Withthisargument,wehavebothfollowingpropositions.

Proposition1.Let

(

u

,

v

)

E+k,c;

(

lastk

(

u

),

f irstk

(

v

))

Ek+andthereexists w

Vk+suchthat

(

w

,

f irstk

(

v

))

Ek+

\ {(

lastk

(

u

),

f irstk

(

v

))

}

or

(

lastk

(

u

),

w

)

Ek+

\ {(

lastk

(

u

),

f irstk

(

v

))

}

.

Proposition2.Let

(

u

,

v

)

Ek+.Ifu isrightextensibleandv isleftextensible,thenthereexistsw

Vk+,csuchthatu

·

v

[

k

]

isasubstring ofw.Otherwise,thereexists

(

u

,

v

)

E+k,csuchthatu

=

lastk

(

u

)

andv

=

f irstk

(

v

)

.

Accordingto Propositions 1 and2,CDBG+k isthegraph D B G+k wherethearcs

(

u

,

v

)

are contractedifandonly ifu is

rightextensibleandv isleftextensible.

2.3. Constructivecharacterisation ofthedeBruijngraph

Letk beapositiveinteger.WedefinethefollowingthreesubsetsofF

(

S

)

.

Init Exactk

= {

w

F

(

S

)

| |

w

|

=

k and d

(

w

)

=

0

}

Initk

= {

w

F

(

S

)

| |

w

|

k and d

(

f irstk

(

w

))

= |

w

|

k

}

SubInitk

=

Init Exactk−1

AwordofInit Exactkiseitheronlythesuffixofsomesiorhasatleasttworightextensions,whilethefirstk-merofaword inInitk

\

Init Exactkhasonlyonerightextension.

Proposition3.Init Exactk

=

Initk

∩ {

w

F

(

S

)

| |

w

|

=

k

}

.

Proof. Let w

Init Exactk.Inthiscase,weget f irstk

(

w

)

=

w and

|

w

|

k

=

0.Thismeansthatd

(

f irstk

(

w

))

= |

w

|

k and thereforew

Initk.

2

Forw anelementofInitk, f irstk

(

w

)

isak-merofS.Giventwowords w1 and w2 ofInitk, f irstk

(

w1

)

and f irstk

(

w2

)

aredistinctk-mersofS.Furthermoreforeachk-merw ofS,thereexistsawordw ofInitksuchthat f irstk

(

w

)

=

w .From this,wegetthefollowingproposition.

Proposition4.ThereexistsabijectionbetweenInitkandthesetofthek-mersofS.

AccordingtoDefinition 1andProposition 4,eachvertexofD B G+k canbeassimilatedtoauniqueelementofInitk.Asthe verticesofD B Gk areidenticaltothoseofD B Gk+,thereexistsalsoabijectionbetweenInitkandthesetofverticesofD B Gk.

(7)

TodefinethearcsbetweenthewordsofInitk,whichcorrespondtoarcsofD B Gk+,weneedthefollowingproposition,which statesthateachsingleletterthatisarightextensionof w givesrisetoasinglearc.

Proposition5.Forw

Init Exactkanda

∈  ∩

RC

(

w

)

,thereexistsauniquew

Initksuchthatlastk−1

(

w

)

a isaprefixofw . Proof. Let w be a word of Init Exactk and a a letter of RC

(

w

)

. By definition of right context, lastk−1

(

w

)

a

F

(

S

)

. As

|

lastk−1

(

w

)

a

|

=

k, there exists w such that lastk−1

(

w

)

a is a prefix of w and

|

lastk−1

(

w

)

a

|

+

d

(

lastk−1

(

w

)

a

)

= |

w

|

. By

definitionofInitk, w

Initk.

2

ThesetInitkrepresentsthenodesofD B G+k.LetusnowbuildthesetofarcsthatisisomorphictoEk+.Letw beaword of Initk andSucck

(

w

)

denote theset ofsuccessors of f irstk

(

w

)

: Succk

(

w

)

:= {

x

Initk

| (

f irstk

(

w

),

f irstk

(

x

))

Ek+

}

. We knowthat foreach lettera in RC

(

w

)

,there existsan arcfrom f irstk

(

w

)

to f irstk

(

last|w|−1

(

w

)

a

)

in D B Gk+.We consider twocasesdependingonthelengthofw:

Case1.

|

w

|

=

k.

AccordingtoProposition 3, w

Init Exactkandhencelastk−1

(

w

)

SubInitk.Therefore,theoutgoingarcsofw inD B G+k arethearcsfromw to w satisfyingtheconditionofProposition 5.Then,

Succk

(

w

)

=



a∈∩RC(w)

lastk−1

(

w

)

a

.

Case2.

|

w

|

>

k.

As w islongerthank,itcontainsthenextk-mer;hence f irstk

(

last|w|−1

(

w

)

a

)

=

f irstk

(

last|w|−1

(

w

))

,andthereexistsa

uniqueoutgoingarcofw:thatfromw to

w

[

2

..

k

]

.Indeed,bydefinitionofInitk,

w

[

2

..

k

]

Initk,andthus Succk

(

w

)

= {

w

[

2

..

k

] }.

Now,wecanbuildintegrally D B Gk+ormoreexactlyanisomorphicgraphofD B G+k.

Theorem1.WiththesetsInitk,Init ExactkandSubInitk,wecanbuildanisomorphicgraphofD B G+k inlineartimeinthesizeof

thesesets.

Forsimplicity,fromnowon,weconfoundthegraphwebuildwithD B G+k.

2.4. Constructivecharacterisation ofthecontracted deBruijngraph

TodothesamewithCDBGk+,initiallywebeginbyexplainingthealgorithmthatweusetobuildthisgraphandinthe secondtimeweneedtocharacterisetheconceptsofrightandleftextensibilityintermsofwordproperties.

OuralgorithmtobuildCDBG+k. Wepresent a generic algorithm to build incrementally CDBG+k. It is explained interms of words,anddoesnotdependonanyindexingdatastructure.Infollowingsections,wewillusethisgenericalgorithmand explainhowitcanbeperformedefficientlyusingaspecifiedindexingstructure.

Themainalgorithm(Algorithm 2)exploresD B Gk+tofindthenodeskeptinCDBGk+andsetallsinglearcsthatrepresent wholenon-branching pathsofD B G+k thatareproperlycontracted.Thekeypointistofindallstartingnodesofsimplepaths andexplorethesepathsfromthem;theexplorationisdonebyAlgorithm 1.

Amoredetailedexplanation. First,note that to build D B G+k it suffices to know the set Succk

(.)

foreach node. The algo-rithm belowsimulatesa traversalof D B G+k without buildingit,andstoresonlyone nodeper unique maximalk -merof

D B G+k.Forsuchak -mer, saym, wechoose torepresentitby thenode v suchthat f irstk

(

v

)

isaprefixofm.In D B G+k,

m is represented by a simple (i.e., non-branching) path and v is its first node. In the traversalalgorithm, for a current starting node vc in Initk,we traversethesimple pathuntilwe arriveata node u havingseveralsuccessorsorsuchthat its onlysuccessoris notleft extensible (i.e.,hasseveralpredecessors).Inother words,untilwe findu suchthat u is not right extensible ornext

(

u

)

isnot left extensible. In D B Gk+,there exists a simplepath between vc andu, and thismust build a single node in CDBG+k.To contract thispath,we choose tokeep vc, andforanysuccessor w of u, we insertan arc betweenu and w, asthisarc cannot be contracted. Noting that w necessarily starts a chain(having atleast a sin-gle node), if w is not yetin CDBG+k, we launcha newpath exploration starting from w,one gets that f irstk

(

w

)

is the prefix of a node ofCDBGk+, andthus w canappropriately represent the path.Now,if w already belongsto CDBGk+, the case istrickier. If vf stores thefirst vc calledby theprocedure, it maynot be thestarting node ofa path,but be any-where inside a path. Two casesarise. If vf isconsidered during the while loop, then it is not at the start ofa simple

(8)

Algorithm 1: BuildAuxCDBG

(

V

,

E

,

vf

,

vc

)

.

Input : ThepartialcontractedgraphCDBG+k as(V,E),twonodesvf andvc.vf theinitialstartingnode,andvcthecurrentstartingnode.

Output: Theupdatedcontractedgraph(V ,E ),whichnowcontainsallpathsstartingfromvc.

1 begin

2 u:=vc;marku

3 //searchthenodeendingthechainthatgoesthroughvc

4 while u isrightextensibleandnext(u)isleftextensible do

5 if vf=next(u)then

6 update(vf,i)by(vc,i)forall(vf,i)E

7 return(V\ {vf},E)

8 u:=next(u);marku

9 //nowexplorethepathstartinginthesuccessorofu

10 for wSucck(u)do

11 if wV then

12 (V,E):= (V,E∪ {(vc,w)});

13 else

14 (V,E):=Build AuxC D B G(V∪ {w},E∪ {(vc,w)},vf,w);; // explore from node w

15 return(V,E)

Algorithm 2: BuildCDBG

(

S

)

.

Input : Asetofwords S.

Output: CDBG+k ofS.

1 begin

2 (V,E)= (∅,∅)

3 //searchforanynodev ofD BG+k withoutpredecessors

4 //andbuildCDBG+k fromv

5 for vInitkdo

6 if thereexistsnow suchthatvSucck(w)then

7 (V,E):= (V,E)Build AuxC D B G(V∪ {v},E,v,v) 8 //exploreD BG+k fromanynodenotyetvisited

9 for vcanunmarkednodeofInitkdo

10 (V,E):= (V,E)Build AuxC D B G(V∪ {vc},E,vc,vc)

11 return(V,E)

path:hencewe mustupdate V byexchanging vf withvc andterminatetheexploration.Otherwise, vf is traversed dur-ing the for loop (as the value of w), then it is a successor of u and the beginning of a simple path: we just add an arc linking vc to w and stop. Finally,if w already belongs to V but w

=

vf, we also add an arc linking vc to w and stop.

TheprocessperformedbyAlgorithm 1augmentsthepartialgraphCDBG+k restrainedtothenodesvisitedwhenexploring the pathstarting from vc.It sufficesnow toensure that all arcsof D B G+k are examined,which Algorithm 2 does.More precisely, it starts by visiting the simple paths starting atnodes having no predecessors (otherwise these nodes would not be visited). Once this is done, one must explore all nodes not yet marked and continue until all nodes have been visited/marked.

Fromtheabovediscussion,weobtainthefollowingtheorem.

Theorem2.Assumeonecandetermineinconstanttimeforanarc

(

u

,

v

)

ofEk+,c,whetheru isrightextensibleandwhetherv isleft extensible.Then,withthesetsInitk,Init ExactkandSubInitk,Algorithm 2buildsagraphthatisisomorphictoCDBG+k inlineartime

inthesizeofthesesets.

Remark. Executing Algorithm 2 doesnot requireto build D B G+k, since the set ofsuccessors Succk

(

u

)

of anynode u is computedinconstanttime.

Characterisation oftheconceptsofrightandleftextensibility. By the constructionof D B Gk+,we get thefollowing properties, whichwillturnusefulfortheconstructionoftheCdBGfromspecificindexes(Section3and4).

(9)

Proposition 7. Let w be a word of Initk such that f irstk

(

w

)

is right extensible. Let the letter a be theunique element of

RC

(

f irstk

(

w

))

∩ 

,thenlastk−1

(

f irstk

(

w

))

a isleftextensibleifandonlyif

(

Support

(

f irstk

(

w

)))

= (

Support

(

lastk−1

(

f irstk

(

w

))

a

)

\ {(

i

,

1

)

|

1

i

n

}).

Proof. Let

(

i

,

j

)

beapairofSupport

(

f irstk

(

w

))

.Wehave

(

i

,

j

+

1

)

Support

(

lastk−1

(

f irstk

(

w

))).

As Support

(

lastk−1

(

f irstk

(

w

)))

=

Support

(

lastk−1

(

f irstk

(

w

))

a

)

,itfollowsthat

(

i

,

j

+

1

)

Support

(

lastk−1

(

f irstk

(

w

))

a

).

Ifthere exists

(

i

,

j

)

Support

(

lastk−1

(

f irstk

(

w

)))

such that j

>

0 and

(

i

,

j

1

)

/

Support

(

f irstk

(

w

))

,thereexists aletter

b

=

w

[

1

]

suchthat

(

i

,

j

1

)

Support

(

b

·

lastk−1

(

f irstk

(

w

)))

.

Hence

(

b

·

lastk−1

(

f irstk

(

w

)),

lastk−1

(

f irstk

(

w

))

a

)

alsobelongsto E+,andthuslastk−1

(

f irstk

(

w

))

a is not left extensi-ble.

2

Insummary,thissectiongivesaformulationofthedBGofS intermsofwords.Nowassumethatthesubstringsofthe words areindexed inadata structure,e.g. ageneralisedsuffix array.Howcan webuild thedBGor thecontractedgraph directlyfromthisstructure?Toachieve this,itsufficestocomputethethreesets Initk,Init Exactk, SubInitk,aswell asthe sets Support

(.)

andSucck

(.)

forsome appropriatesubstrings. Inthefollowing sections,we exhibitalgorithms tocompute

D B G+k andCDBG+k fortwoimportantindexingstructuresandforahome-madetruncateddatastructure.

3. TransitionfromanindexingdatastructuretodeBruijngraphs

3.1. Fromageneralisedsuffixtree

Suffix Trees(ST)belongtothemoststudiedindexingdatastructures.Ageneralised STcanindexthesubstringsofaset of words.Generallyforthissake,all wordsare concatenatedandseparatedby aspecialsymbol not occurringelsewhere. However,thistrickisnotcompulsory,andanalternativeistokeeptheindicationofaterminatingnodewithineachnode.

3.1.1. Thesuffixtree anditsproperties

The GeneralisedSuffixTree ofasetofwords S isthe suffixtreeof S,whereeachwordof S doesnotnecessarilyfinish by aletterofuniqueoccurrence. Hence,foreachnode v oftheGeneralised SuffixTreeof S,wekeepinmemorytheset, denoted by Suff

(

v

)

, ofpairs

(

i

,

j

)

such that the word representedby v is the suffix of si starting at position j. Letus denotebyT thegeneralisedsuffixtreeofS (fromnowon,wesimplysaythetree)andby VT itssetofnodes.Forv

VT,

Children

(

v

)

denotesitssetofchildrenand f

(

v

)

itsparent.SeeFig. 5foranexampleofGST.

Some nodesof T may havejustonechild.The size oftheunionof Suff

(

v

)

forallnode v of T equals thenumberof leavesinthegeneralisedsuffix treewhen thewordsendwithaterminating symbol.Hence,thespacetostore T andthe sets Suff

(.)

islinearin



S



.Bysimplicity,foranode v ofT ,thewordrepresentedbyv isconfusedwithv.Foreachnode

v of T ,v

F

(

S

)

.AsallelementsofF

(

S

)

arenotnecessarilyrepresentedbyanodeofT ,wegivethefollowingproposition.

Proposition8.ThesetofnodesofT isexactlythesetofwordsw ofF

(

S

)

suchthatd

(

w

)

=

0.

Werecallthenotionofasuffixlink(SL)foranynode v ofT (leavesincluded).Letsl

(

v

)

denotethenodetargetedbythe suffixlinkofv,i.e.,sl

(

v

)

=

v

[

2

..

|

v

|]

.Bydefinitionofasuffixtree,forall w

F

(

S

)

,thereexistsanode v ofT suchthat w

isaprefixof v.Let v thenodeofminimal lengthofT suchthat w isaprefixof v,then

|

v

|

= |

w

|

+

d

(

w

)

,andtherefore

w

=

v .

Proposition9.Letw

F

(

S

)

.Then

|

w

|

≥ |

w

|

>

|

f

(

w

)|

,where f

(

w

)

isthe parentof

w

inT .

Proof. As f

(

w

)

=

w

,theresultisobvious.

2

3.1.2. ConstructionofD B Gk+

Let

[

x1

..

xm

]

bethesetofk-mersofS.AccordingtothedefinitionofInitkandtoProposition 4,Initk

= [

x1

..

xm

]

.Thus, by Proposition 9, Initk

= {

v

VT

| |

f

(

v

)

|

<

k and

|

v

|

k

}

.Similarly, Init Exactk

= {

v

VT

| |

v

|

=

k

}

.Now,itappearsclearly that Init ExactkisasubsetofInitk,sinceforallv

VT,

|

f

(

v

)

|

<

|

v

|

.

(10)

Fig. 5. Thegeneralisedsuffixtree forourrunningexampleandtheconstructeddeBruijngraph fork:=2.Squarenodesrepresentwordsthatoccuras asuffixofsomesi,circlenodesaretheothernodesofT .NodesingreyarethoseusedtorepresentthenodesofthedBG.Eachsquarenodestoresits positionsofoccurrencesinS;forsimplicity,wedisplaythestartingpositionasanumberandthewordofS inwhichitoccursasitscolour,insteadof showingthelistofpairs(i,j).ThesolidcurvedarrowsaretheedgesofthedeBruijngraphfork:=2;thosecoloured inredcorrespondtoCase1and thoseinbluetoCase2.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

Fig. 6. Thefigures(a),(b)and(c)showCase1andCase2encounteredwhencomputingthearcsofD BG+k.Thegreennoderepresentsthenodev,and theoneinorangesl(v).Thedashedarcscorrespondtosuffixlinks.ArcsofD BG+k areinsolidlineandcolouredinredforCase1(a),orinblueforCase2 (b), (c).(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

Case1.

|

v

|

=

k (Fig. 6a).

As v

Init Exactk,sl

(

v

)

SubInitk.Therefore,each childu of sl

(

v

)

isan elementof Initk.Thus, theoutgoingarcsofv in D B Gk+are thearcsfrom v tothechildu of sl

(

v

)

wherethefirstletterofthelabelbetweensl

(

v

)

andu isanelement oftherightcontextof v.Asthesetofthefirstlettersofthelabelbetweenv andchildrenofv isexactly RC

(

v

)

∩ 

,the numberofoutgoingarcsof v inD B G+k isthenumberofchildrenof v.Tobuildtheoutgoingarcsof v inD B G+k,foreach childu ofv,weassociatev withthenodeofInitkbetweentherootandsl

(

u

)

,i.e.,

f irstk

(

sl

(

u

))

.

Case2.

|

v

|

>

k (Figs. 6band 6c).

Wehavethatsl

(

v

)

isanode ofVT.As

|

v

|

>

k,

|

sl

(

v

)

|

k.Thus,thereexistsan elementofInitk betweentherootand

sl

(

v

)

.Weassociatev withthisnode,i.e.

f irstk

(

sl

(

v

))

. WeillustratethesetwocasesinFig. 5:

Case1.Casewherev is 6,6,sl

(

v

)

is 7,7,theuniquechildu ofv is 3,andsl

(

u

)

is 4,whichisinInitk.

Case2.Casewherev is 1,sl

(

v

)

is 2,and

f irstk

(

sl

(

v

))

is .

Inbothcases,buildingthearcsofE+requirestofollowtheSLofsomenode.Thenode,sayu, pointedatbyaSLmay notbeinitial.Hence,theinitialnoderepresentingtheassociatedfirstk-merofu istheonlyancestralinitialnodeofu.We

(11)

equipeachsuch nodeu withapointer p

(

u

)

that pointstotheonlyinitialnode onitspathfromtheroot. Inother words, foranyu

/

Initksuchthat

|

u

|

>

k,onehasp

(

u

)

:=

f irstk

(

u

)

.

The algorithmtobuildthe D B G+k isasfollows.Aninitial depthfirsttraversalofT allowstocollectthenodesofInitk andforeachsuchnodetosetthepointerp

(.)

ofallitsdescendantsinthetree.FinallytobuildE+,onescansthrough Initk andforeachnode v oneaddsSucck

(

v

)

to E+usingtheformulagivenabove.Altogetherthisalgorithmtakesatimelinear inthesizeofT .Moreover,thenumberofarcsinE+islinearinthetotalnumberofchildrenofinitialnodes.Thisgivesus thefollowingresult.

Theorem3.ForasetofwordsS,buildingthedeBruijnGraphoforderk,D B Gk+takeslineartimeandspacein

|

T

|

, i.e.,in



S



. 3.1.3. ConstructionofCDBG+k

InSection2.3,wehaveseenanalgorithmthatallowstocomputedirectlyCDBG+k providedthatonecandetermineifa node v isrightextensibleandifnext

(

v

)

isleftextensible,wherenext

(

v

)

denotestheonlysuccessorofv.Letusseehow tocomputetheextensibilityinthecaseofaSuffixTree.

By applyingProposition 6 inthe caseof a tree,foran element v of Initk, f irstk

(

v

)

is rightextensible ifandonly if

|

v

|

>

k or

(

Children

(

v

))

=

1.Thuscheckingtherightextensibilityofanodetakesconstanttime.

Fortheleft extensibilityofthesingle successorofa node,oneonlyneedsthesize ofsupportofsome nodes (Proposi-tion 7).Letusseefirsthowtocompute

(

Support

(.))

onthetree,andthenhowtoapplyProposition 7.

Proposition10.Letv beawordofF

(

S

)

andVT

(

v

)

denotesthesetofnodesofthesubtreerootedin

v

.

Support

(

v

)

=



vVT(v )

Suff

(

v

).

Along atraversalofthetree,we cancompute andstore

(

Support

(

v

))

and

(

Support

(

v

)

∩ {(

i

,

1

)

|

1

i

n

})

foreach node v inlineartimein

|

T

|

.

Letv beawordofInitk suchthat f irstk

(

v

)

isrightextensible.

Case1. If

|

v

|

=

k, then f irstk

(

v

)

=

v and

(

Children

(

v

))

=

1. Let u be the only child of v. Thus,

|

u

|

>

k, RC

(

v

)

∩  =

{

u

[

k

+

1

]}

,andlastk−1

(

v

)

u

[

k

+

1

]

=

f irstk

(

sl

(

u

))

.Hence,

(

Support

(

v

))

= (

Support

(

f irstk

(

sl

(

u

)))

\ {(

i

,

1

)

|

1

i

n

})

andbyProposition 7, f irstk

(

sl

(

u

))

isleftextensible.

Case2.If

|

v

|

>

k,thenRC

(

f irstk

(

v

))

∩  = {

v

[

k

+

1

]}

and

lastk−1

(

f irstk

(

v

))

v

[

k

+

1

] =

lastk

(

f irstk+1

(

v

))

=

f irstk

(

sl

(

v

)).

ByProposition 7, f irstk

(

sl

(

v

))

isleftextensibleifandonlyif

(

Support

(

f irstk

(

v

)))

= (

Support

(

f irstk

(

sl

(

v

)))

\ {(

i

,

1

)

|

1

i

n

})

As

(

Support

(

f irstk

(

v

)))

= (

Support

(

f irstk

(

v

)

))

and

(

Support

(

v

)

\ {(

i

,

1

)

|

1

i

n

})

= (

Support

(

v

))

(

Support

(

v

)

∩ {(

i

,

1

)

|

1

i

n

})

, determining if next

(

v

)

is left extensible takes constant time. Toconclude, asfor any initial node v,we cancompute in O

(

1

)

timeitssetofsuccessorsSucck

(

v

)

,itsrightextensibility,andtheleft extensibility ofits singlesuccessor,we canreadilyapply Algorithm 2tobuiltCDBG+k andweobtaina complexitythat islinearinthe sizeofD B G+k,sinceeachsuccessorisaccessedonlyonce.ThisyieldsTheorem 4.

Theorem4.ForasetofwordsS,buildingtheContracteddeBruijnGraphoforderk,CDBG+k takeslineartimeandspacein

|

T

|

, i.e., in



S



.

3.2. Fromageneralisedsuffixarray

IntheprevioussubsectionswehaveshownhowtobuilddeBruijngraphsfromsuffixtrees.Suffixtreesareveryelegant datastructuresbuttheyaretoospace-consuminginpractice.Inmanyapplicationstheyhavebeenreplacedbysuffixarrays thatareequivalent datastructuresandaremorespaceeconomical.WewillnowshowhowtobuilddeBruijngraphsfrom suffixarrays.

LetSA andLCP bethegeneralisedenhancedsuffixarrayofS:

• ∀

1

i

<



S



,SA

[

i

]

= (

g

,

h

)

,SA

[

i

+

1

]

= (

g

,

h

)

thensg

[

h

. .

|

sg

|]

<

sg

[

h

. .

|

sg

|]

,

• ∀

2

i

≤ 

S



,LCP

[

i

]

isthelengthofthelongestcommonprefix betweensuffixesstoredinSA

[

i

1

]

andinSA

[

i

]

,and

(12)

Letusrecallthedefinitionofanlcp-interval.

Definition4([23]).Aninterval

[

i

,

j

]

,1

i

<

j

≤ 

S



iscalledalcp-intervalofvalue



,alsodenotedby



-

[

i

,

j

]

,iff: 1. LCP

[

i

]

< 

,

2. LCP

[

g

]

≥ 

fori

<

g

j,

3. LCP

[

g

]

= 

foratleastone g suchthati

<

g

j,

4. LCP

[

j

+

1

]

< 

.

Letusnowrecallthedefinitionsofthepreviousandnextsmallervalues(PSV andNSV)arrays.

Definition5([23]).For2

i

≤ 

S



:

PSV

[

i

]

=

max

{

j

|

1

j

<

i and LCP

[

j

]

<

LCP

[

i

]}

,

NSV

[

i

]

=

min

{

j

|

i

<

j

≤ 

S



+

1 and LCP

[

j

]

<

LCP

[

i

]}

.

Recallthat if 2

i

≤ 

S



then

[

PSV

[

i

],

NSV

[

i

]

1

]

is an lcp-intervalof value LCP

[

i

]

. The direct inclusion among lcp-intervalsdefinesatreerelationshipcalledthelcp-intervaltree(see [23,Def.4.4.3,p.87]).Givenanlcp-interval



-

[

i

,

j

]

,its parentlcp-interval



-

[

i

,

j

]

canbeeasilycomputedinconstanttimeusingthearraysLCP,PSV andNSV.Then:

Initk consistsof:

– thelcp-intervals



-

[

i

,

j

]

suchthat



k andtheparentinterval



-

[

i

,

j

]

of



-

[

i

,

j

]

issuchthat



<

k (theassociated stringissSA[i].g

[

SA

[

i

].

h

. .

SA

[

i

].

h

+ 

1

]

);

– thepositionsSA

[

i

]

= (

g

,

h

)

suchthati isnotcontainedinlcp-intervals



-

[

i

,

j

]

with



k andh

≤ |

sg

|

k

+

1 (the associatedstringissg

[

h

. .

|

sg

]

);

Init Exactkiscomposedofthelcp-intervalsk-

[

i

,

j

]

;

SubInitk

=

Init Exactk−1.

Actuallythelcp-intervaltreedoesnotneedtobe explicitlybuildandthesetscanbecomputedbyasinglescan ofthe

SA andLCP arrays.

Foranlcp-interval



-

[

i

,

j

]

Initkwehave

(

Support

(

sSA[i].g

[

SA

[

i

].

h

. .

SA

[

i

].

h

+

k

1

]))

=

j

i

+

1.

Theorem5.ThedeBruijngraphoforderk,CDBGk+,forasetofwordsS canbebuiltinatimeandspacethatarelinearin



S



using thegeneralisedsuffixarrayofS.

4. TransitionfromatruncatedstructuretodeBruijngraphs

Thissectionisorganisedasfollows.InSection4.1,wedefine asimpleconditionthat asetofinputstringsmustsatisfy toallowbuildingageneralisedindexandsketchamodificationofMcCreight’salgorithm[14] fordoing so.InSection4.2, weintroducethereducedtruncatedsuffixtreeandspecialisetheprevious algorithmforconstructingitefficiently.Finally, in Section 4.3we show how toconstruct both thede Bruijn Graph andits contractedversion in optimaltime fromthe reducedtruncatedsuffixtree.

4.1. Setofchainsofsuffix-dependantstringsandtree

Here,weintroducethenotionofsuffixdependence betweenstrings,andthenotionofchainofsuffix-dependantstrings in

ordertodefineaunifiedindexthatgeneralisesboththesuffixtree[14]andthetruncatedsuffixtree[18].First,letusdefine theconceptofsuffix-dependantstringsandofchainsofsuffix-dependantstrings.

Definition6.

1. Astringx issaidtobesuffix-dependant ofanotherstring y ifx

[

2

..

|

x

|]

isprefixof y.

2. Let w be a stringand m be a positive integer smaller than

|

w

|

1. A m-tuple ofm strings

(

x1

,

. . . ,

xm

)

is a chain

ofsuffix-dependantstringsof w if x1 is a prefix of w and for each i

∈ [

2

,

m

]

, xi is a prefix of w

[

i

,

|

w

|]

such that

|

xi

|

≥ |

xi−1

|

1.

Let

R

= {

C1

,

. . . ,

Cn

}

be a set of tuples such that foreach i

∈ [

1

,

n

]

, Ci is a chain of suffix-dependant strings ofthe string si.Fori

∈ [

1

,

n

]

and j

∈ [

1

,

|

Ci

|]

,Ci

[

j

]

isthe jth stringofthetupleCi.Let

R



= { 

C1

,

. . . ,

C



n

}

bethesetoftuplessuch thatforeachi

∈ [

1

,

n

]

and j

∈ [

1

,

|

Ci

|]

,



Ci

[

j

]

= |

Ci

[

j

]|

,i.e.

R



containstuplesoflengths.

(13)

With

R



and S, we can easily compute

R

. In the sequel, we use

R

to demonstrate our results, and

R



to state the complexitiesofalgorithms.Indeed,inthecasewhereCiisthetupleofeachsuffixofsi,thesizeofCiislinearin

|

si

|

2 but



Ci islinearin

|

si

|

.

Let w beastring; w may occurindistinct tuplesof

R

.Thus, wedefine N

(

w

)

thesetof

(

i

,

j

)

such that w

=

Ci

[

j

]

.In otherwords,N

(

w

)

isthesetofcoordinatesoftheelementsof

R

thatareequalto w.

Wedefineacontractedversionofthewell-knownAho–Corasicktree[17].Infact,weapplynearlythesamecontraction process that turnsa trie ofawordinto itscompact SuffixTree [17].Consider theAho–Corasicktree of S,inwhicheach noderepresentsaprefixofwordsinS.Wecontractthenon-branchingpartsofthebranchesexceptthatwekeepallnodes representingawordthatbelongstoatuplein

R

.Fromnowon,letT

(

R

)

denotethiscontractedversionoftheAho–Corasick treeofS.

N

and

L

denoterespectivelythesetofnodesandthesetofleavesofT

(

R

)

.Furthermore,wedefineforeachnodev of T

(

R

)

twoweights:

s

(

v

)

isthenumberoftimesthatanelementofatupleof

R

isequaltothewordrepresentedbyv (i.e.,s

(

v

)

:= |

N

(

v

)

|

).

t

(

v

)

isthenumberoftimesthatthefirstelementofatupleof

R

isequaltothewordrepresentedby v (i.e.,t

(

v

)

:=

|{(

i

,

1

)

N

(

v

)

|

i

∈ [

1

,

n

]}|

).

Letw beastring,weputSucc

(

w

)

= {(

i

,

j

)

| (

i

,

j

1

)

N

(

w

)

and j

≤ |

Ci

|}

.Wedefine

H

asthesubsetof

L

suchthat:

H

:= {

u

L

| ∃

C

R

and j

<

|

C

|

such that u

=

C

[

j

]}

It isequivalenttosaythat

H

= {

u

L

|

Succ

(

u

)

is not empty

}

.Amappingm from

H

to

N

iscalledpossiblelink iffor eachnode v in

H

,

∃(

i

,

j

)

Succ

(

v

)

suchthatm

(

v

)

=

Ci

[

j

]

.

Belowwe presentanalgorithm thatconstructs T

(

R

)

,andcomputesforeach node v in

N

,the weightss

(

v

)

andt

(

v

)

andapossiblelinkP0.

ConstructionofT

(

R

)

. Now,we giveanalgorithmtoconstructT

(

R

)

.WeusetheversionofMcCreight’salgorithmgivenby Naetal.[18] onourinputandwebuildforeachleafv, s

(

v

)

,t

(

v

)

and P0

(

v

)

.ForbuildingT

(

R

)

,westartwithatreethat

contains onlythe root.Then, foreach word w ineverychain C ,we createorupdate (ifitexists)thenode w asfollows. Assumethatwekeepinmemorythenode v thathasbeenprocessedjustbeforew.

If w is the first wordof C , we go down from the root by comparing w to the labels of the tree. If we create the node w,s

(

w

)

andt

(

w

)

areinitialisedto1,andP0

(

w

)

tonil.If w alreadyexistsonthetree,weincrements

(

w

)

andt

(

w

)

by 1.

If w is not the first word of C , we start from v, andas in McCreight’s algorithm, we createor arrive on the node representing w.Ifweneedtocreatethisnode,s

(

w

)

isinitialisedto1,t

(

w

)

to0,andP0

(

w

)

tonil.Otherwise,weadd1 to

s

(

w

)

.Weset P0

(

v

)

=

w.

Theloopcontinueswiththenextworduntiltheend,andweobtain T

(

R

)

.

Theorem6.Forasetofchainofsuffix-dependantstrings

R

,wecanconstructT

(

R

)

inO

(



S

)

timeandspace.

Proof. Tobeginwith,letustoprovethatT

(

R

)

isinO

(



S

)

space.Itsnumberofleavesequals



CR

|

C

|

.Hence,itsnumber ofnodesisatmost2



CR

|

C

|

1

2



S



,anditsnumberofedgesisatmost2



S



.ThusthesizeofT

(

R

)

isinO

(



S

)

.

Clearly, theconstruction algorithmof T

(

R

)

computesbothweights s

(.)

andt

(.)

,andthepossible link P0

(.)

correctly.

Forthecomplexity,foreachchainofsuffix-dependant Ciof

R

,thelengthofthetraversepathonthetreeisequalto

|

wi

|

, thankstotheuseofthesuffixlinks.ThusasinMcCreight’salgorithm,thecomplexityisinO

(



S

)

.

2

Now,weareequippedwithanalgorithmthatbuildsT

(

R

)

foranysetofchainsofsuffix-dependantstrings.Letusreview someinstancesofsets S,forwhich T

(

R

)

isinfactawell-knowntree.

If

C

:= ∪

wS

{

tuple of suffixes of w

}

, then T

(

C

)

is the Generalised Suffix Tree of S (see Fig. 7a). We have that the restrainedmappingsl

(.)

isanexampleofapossiblelink.

If Bk

:= ∪

wS

{

tuple of k-mer of w and suffixes of length k

<

k of w

}

,thenT

(

Bk

)

isthegeneralisedk-truncatedsuffix treeofS,asdefinedin[19](whichgeneralisesthek-truncatedsuffixtreeofNaetal.[18]).

If Ak

:= ∪

wS

{

tuple of k

+

1-mer of w and suffixes of length k of w

}

,then T

(

Ak

)

is thetruncatedsuffix tree thatwe definebelowinSection4.2(seeFig. 7b).

4.2. Ourtruncatedsuffixtree

(14)

Fig. 7. (a)Thegeneralisedsuffixtreeforthesetofwords{bacbab,bbacbaa,bcaacb,cbaac,cbabcaa}.ThepartabovethegreenlinecorrespondstotheTST

T(A2),whichisshownin(b).(b)ThetruncatedsuffixtreeT(A2)forthesamesetofwords.(Forinterpretationofthereferencestocolourinthisfigure

legend,thereaderisreferredtothewebversionofthisarticle.)

Definition7.

1. Foralli

∈ [

1

,

|

S

|]

and j

∈ [

1

,

|

si

|

k

+

1

]

, Ak,idenotesthetuplesuchthatits jth componentisdefinedby

Ak,i

[

j

] :=



wi

[

j

,

j

+

k

]

if j

≤ |

wi

| −

k

wi

[

j

,

|

wi

|]

otherwise 2. and Akisthesetofthesetuples: Ak

:=



n i=1Ak,i.

Proposition11.

1. Ak,iisachainofsuffix-dependantstringsofsi. 2. Moreover,

{

w

Ak,i

|

Ak,i

Ak

}

=

Fk+1

(

S

)

Suffk

(

S

)

.

Proof.

1. Forall j

∈ [

1

,

|

Ak,i

|

k

]

,itiseasytoseethat Ak,i

[

j

]

isasuffix-dependantstringof Ak,i

[

j

+

1

]

. 2. Forthesecondpoint

{

w

Ak,i

|

Ak,i

Ak

} =

n



i=1

(

|si|−



k+1 j=1

{

Ak,i

[

j

]})

=

n



i=1

(

|s



i|−k j=1

{

Ak,i

[

j

]}



{

Ak,i

[|

si

| −

k

+

1

]})

Figure

Fig. 1. S := { bacbab , cbabcaa , bcaacb , cbaac , bbacbaa } is a set of words. Therefore, we have Support ( ba ) = {( 1 , 1 ), ( 1 , 4 ), ( 2 , 2 ), ( 4 , 2 ), ( 5 , 2 ), ( 5 , 5 )} , RC ( ba ) = { ε , c , cb , cba , cbab , b , bc , bca , bcaa , a , ac ,
Fig. 2. Examples of arcs from D BG + k . (a) shows letters in the right context of ba, and (b) the successors of node ba in D B G + 2 ; one for each letter in RC ( w ) ∩
Fig. 4. The graphs correspond to CDBG + 2 (a) and CDBG + 3 (b) for our running example.
Fig. 5. The generalised suffix tree for our running example and the constructed de Bruijn graph for k := 2
+4

Références

Documents relatifs

In this article, we compared the performance of the B-tree with the R-tree, the Signature R-tree, and the Ordered R-tree for the triple table and point and range queries

In the scheme of the (unbalanced) priority search tree outlined above, each node p divides the plane into two parts along the line x = p.x. All nodes of the left subtree lie to

We report the amount of memory used by the positions divided by the number of k-mers, the time required to query the real dataset SRR5833294 with one thread, the time required to

Some recent analyses of random Schr¨odinger operators have involved three related concepts: the Wegner estimate for the finite-volume Hamiltonians, the spectral shift function

In the case the points of X lie on a manifold surface, and according to Boissonnat [56], the surface triangulation in the Delaunay triangulation sat- isfies the condition that it

In IR systems, used to retrieve potentially relevant documents from large collections, the thesaurus serves to coordinate the basic processes of indexing and document retrieval..

To analyze pseudo-programs consisting of programming language statements and calls to unimplemented procedures, such as operations on ADT's, we compute the running time as a

(It is also permissible to declare a class as abstract even if it does not contain any abstract methods.) As a result, Java will not allow any instances of an abstract class to