• Aucun résultat trouvé

A linear time algorithm for Shortest Cyclic Cover of Strings

N/A
N/A
Protected

Academic year: 2021

Partager "A linear time algorithm for Shortest Cyclic Cover of Strings"

Copied!
13
0
0

Texte intégral

(1)

HAL Id: lirmm-01525995

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01525995

Submitted on 22 May 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A linear time algorithm for Shortest Cyclic Cover of

Strings

Bastien Cazaux, Eric Rivals

To cite this version:

Bastien Cazaux, Eric Rivals. A linear time algorithm for Shortest Cyclic Cover of Strings. Journal of

Discrete Algorithms, Elsevier, 2016, 37, pp.56-67. �10.1016/j.jda.2016.05.001�. �lirmm-01525995�

(2)

Contents lists available atScienceDirect

Journal

of

Discrete

Algorithms

www.elsevier.com/locate/jda

A

linear

time

algorithm

for

Shortest

Cyclic

Cover

of

Strings

Bastien Cazaux

a

,

b

,

Eric Rivals

a

,

b

,

aLIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France

bInstitutdeBiologieComputationnelle,CNRSandUniversitédeMontpellier,860rueSaintPriest,34095MontpellierCedex5,France

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Availableonline19May2016

Keywords:

Greedy

Shortestcycliccoverofstrings Superstring

Overlap

Minimumassignment Graph

Mergingwordsaccordingtotheiroverlapyieldsasuperstring.Thisbasicoperationallows toinferlongstringsfromacollectionofshortpieces,asingenomeassembly.Tocapture amaximum of overlaps, the goalis to infer the shortest superstring ofa set ofinput words.TheShortestCyclicCoverofStrings(SCCS)problemasks,insteadofasinglelinear superstring,forasetofcyclicstringsthatcontainthewordsassubstringsandwhosesum oflengthsisminimal. SCCSis used asacrucial stepinpolynomial timeapproximation algorithms for the notably hardShortest Superstring problem, but it is solvedin cubic time.Thecyclicstringsarethencutandmergedtobuildalinearsuperstring.SCCS can alsobesolvedbyagreedyalgorithm.Here,weproposealineartimealgorithmforsolving SCCSbasedonaEuleriangraphthatcapturesallgreedysolutionsinlinearspace.Because thegraphisEulerian,thisalgorithmcanalsofindagreedysolutionofSCCSwiththeleast numberofcyclicstrings.ThishasimplicationsforsolvingcertaininstancesoftheShortest linearorcyclicSuperstringproblems.

©2016TheAuthor(s).PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

The possibility ofmergingtwo words intoa longerone accordingto their overlap– forinstancemerging abcde with cdeba into abcdeba – allows to infer the sequence of a target molecule from the short reads produced by sequencing machines. Incomputerscience, mergingresultsina superstring oftheinput words.Theactionofmergingisheavily used in any genome assembler (although not necessarily on the complete reads, sometimes only on substrings ofthe reads) andgiventheredundancy ofthesequencing,whichyields ahighdensityofreads, theobjectiveistocompute ashortest superstringofasetofinputwords[10].Inbioinformatics,thetaskofgenomeassembly isworsenedby theinaccuracyof the sequencingandby thestructure ofthe genomewhichcontains repeatedsequences [10].However, progresseson the modelquestion, calledShortest(linear)SuperstringProblem (SLiS)–oroftenShortestCommon Superstring– stillshedlight on thedifficultyofassembly.Moreover,thetarget moleculecanbe alinearora circularstructure.Itisalsointeresting to addressthequestionoftheShortestCyclicSuperstring (SCyS).Findingsuperstringsalsohasapplicationindatacompression. Indeed,asasuperstringrepresentsacondensedrepresentationoftheinitiallanguage,itcanserveasacompressedversion oftheinputset.

Let P

:= {

s1

,

. . . ,

s|P|

}

beasetofinputwordsanddenotethesumoftheirlengthsby



P



,i.e. thenormof P .Without loss ofgenerality, we always assume that P isfactor free,i.e. for anytwostrings of P ,noneis asubstring of theother. From thecomputationalviewpoint,finding ashortestsuperstringisknowntobehardtosolve(NP-hardevenonabinary

*

Correspondingauthorat:LIRMM,CNRSandUniversitédeMontpellier,CC477,161rueAda,34095MontpellierCedex5,France.Tel.:+33467418664.

E-mailaddresses:[email protected](B. Cazaux),[email protected](E. Rivals).

http://dx.doi.org/10.1016/j.jda.2016.05.001

1570-8667/©2016TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

(3)

alphabet) andto approximate(Max-SNP hard) [7,1].The research inthe last 25years hasmainlyfocused on polynomial timeapproximationalgorithmsforSLiS(see[8]forarecentlist).Despitetheseefforts,thealgorithm

greedy

(Algorithm 1), an algorithmthat iteratively updatestheinput setby mergingtwo maximally overlappingstrings untilone stringis left, wasconjectured toapproximatethelengthoftheshortestsuperstringwithratio2[17,1].Thisisbetterthanmostknown algorithms,andtheconjectureisstillvalid.Forlinearsuperstringsontheconstantsizealphabets,Ukkonenproposedalinear timeimplementationofthealgorithm

greedy

basedontheAho–Corasickautomaton[19];later, anotherimplementation usingthesuffixtreewasgivenin[12,13].

Thequestion ofapproximatingthe superstringcanbe reformulatedona completeweighteddigraph, theso-called dis-tancegraph:each input wordisavertexandan arcencodesthedirectional mergingofthetwo words.The weightofan arcfromu to v equalsthelengthoftheprefixofu thatisnot coveredbytheoverlapofv (e.g.,thecorrespondingprefix ofabcde to cdeba isab, andtheweightwouldbe 2). Thena shortestlinearsuperstringisassociatedwithaHamiltonian path on this graph,and a shortestcyclic superstring witha Hamiltonian cycle[1]. Yet those problems are also hard to approximateandmostapproximationalgorithmsforSLiSresorttoarelaxedquestion,namelyShortestCyclicCoverofStrings (SCCS).ThesealgorithmssolveSCCSbycomputingaminimumassignmentonthedistancegraphinO

(



P



+ |

P

|

3

)

[16],and

thentransformthecyclesintocyclicstrings.OurworkfocusesonSCCS.Weproposeanalgorithmthatusesonlylineartime in



P



,andforthissakeintroducethesuperstringgraph,whichcapturesallsolutionsof

greedy

forSCCS(seeSections4

and3). Thesegreedysolutions arecalledgreedysuperstrings. Exploringthe graphproperties,we show that itgives,fora subsetofinstancesoftheshortestlinearandcyclicsuperstringproblems,interestingapproximationsorevenexactsolutions (Section4).

1.1. Notationandproblemdefinition

ForanyrootedtreeT ,VT isthesetofnodesofT .Foranode v ofVT,ChildrenT

(

v

)

isthesetofchildrenofv inT .We denoteby

[

i

,

j

]

thesetofintegersbetweeni and j.

Weconsidertwokindsofstrings:linear andcyclicstrings (seeFigs. 1a and 1b).Forastrings,thelengthofs is

|

s

|

.Fora linearstrings andi

j in

[

1

,

|

s

|]

,s

[

i

,

j

]

isthelinearsubstringofs beginningatthepositioni andendingattheposition j, s

[

i

]

is thesubstring s

[

i

,

i

]

, s

[

1

,

j

]

isa prefix of s and s

[

i

,

|

s

|]

isa suffix of s. Aprefix (or suffix)s of s isproper ifs is differentofs. Foranotherlinearstringt,theoverlap froms tot,denotedby ov

(

s

,

t

)

, isthelongestsubstring whichisa propersuffixof s andaproper prefixoft.The prefixfroms tot,denotedby pr

(

s

,

t

)

,issuchthat s

=

pr

(

s

,

t

)

ov

(

s

,

t

)

.The merge ofs witht isthelinearwordpr

(

s

,

t

)

t ifs

=

t,andthecyclicwordpr

(

s

,

t

)

otherwise.Alinearsuperstring ofasetP of linearstringsisalinearstring w suchthateachstringofP isasubstringofw.Wesaythatalinearstringw isasubstring ofacyclicstringc ifthereexists wc alinearpermutation ofc such that w isa substringofwc (where wc

=

wcwc

. . .

) –seeFig. 1c.Acycliccover of P isasetC ofcyclicstringssuchthateachlinearstringofP isasubstringofacyclicstring of C .

Fig. 1. (c) Shows how the linear word (a) is included in the cyclic word (b). Wedefinethefollowingproblem:

Problem1(Shortestcycliccoverofstrings). Input: Asetoflinearstrings P .

Output: AcycliccoverC ofP thatminimises



C



.

Let

σ

be a permutation of

{

1

,

. . . ,

|

P

|}

, we can split up

σ

in cyclicpermutations:

σ

=

σ

1

◦ . . . ◦

σ

m. Foreach cyclic permutation

σ

ioflengthni,wedenotethecyclicstringby:

P

(

σ

i

)

:=

pr

(

sσi(1),sσi(2)) . . .pr

(

sσi(ni−1),sσi(ni))pr

(

sσi(ni),sσi(1)),

and

P

(

σ

)

:= {

P

(

σ

1

), . . . ,

P

(

σ

m

)}

(SeeFig. 2).

Itiseasytosee,thatforeachpermutation

σ

of

{

1

,

. . . ,

|

P

|}

,P

(

σ

)

isacycliccoverof P andforeachoptimalsolutionw ofProblem 1(SCCS),thereexists

σ

apermutationof

{

1

,

. . . ,

|

P

|}

suchthatw

=

P

(

σ

)

.

1.2. Relatedworks

(4)

Fig. 2. Running examplewiththeinputsetP:= {ababb, aab, abba, abaa}.InstanceofacycliccoverofP obtainedwithapermutationσ.Weobtainthe cycliccoverP(σ) = {ababba, aba}(ababba andaba arecyclicwords).

SinceSLiSdoesnotadmitaPTAS,numerousworksonSLiShavereportedpolynomialtimeapproximationalgorithmswith aconstantratiorangingfrom4 to

>

2 (see[8, Table 1]).AmajorityofthesemodelSLiSasaminimumweightHamiltonian path problemonthedistancegraph G of P (whereanarc

(

si

,

sj

)

isweighted by



pr

(

si

,

sj

)



foranyvalidi

,

j).Notethat ifinstead itis weightedby



ov

(

si

,

sj

)



, weobtain theOverlapGraph,andSLiS becomesequivalent to findinga maximum weight Hamiltonianpath[1].Clearly, theweight ofacycliccoverof G (i.e.,the sumoftheweight ofall cycles)islower thanthat ofaHamiltoniancycle,whichitselfgivesalowerboundoftheweightofaHamiltonianpathonG.Hence,most approximationalgorithmsforSLiSrelyonsolvingSCCS,whichispolynomial.Blumetal.[1]werethefirsttoexhibitsuchan algorithmandtoprovethatitachievesaconstantratioof3 forSLiS.ThisalgorithmfirstsolvesSCCSandinasecondstep combinesthecyclicstringsbyrunningagreedyalgorithmonthem.TengandYaoimprovedthisratiobyanalgorithmthat recursivelysolvesSCCS[18].Afterthat,manyauthorshaveinvestigatedthecombinatorialpropertiesofwordsandofsome specificcycles,whichenabledthemtofurtherimprovetheapproximation(e.g.[3,14,15]),butallneedtosolveSCCS.Todo so,they computea MinimumAssignment onthedistancegraph.Anassignment(or cycliccover)ofa digraph G isasetof cyclessuchthatevery vertexofG belongstoexactlyonecycle.TosolveSCCSonaninstance P ,approximationalgorithms for SLiS first1/ build the distancegraphof P , with

|

P

|

nodes and

|

P

|(|

P

| −

1

)

arcs,which requires O

(



P



+ |

P

|

2

)

time andspaceusingtheGeneralisedSuffixTree(GST)of P [11],then2/computeaminimumcycliccoverusingtheHungarian algorithmin O

(

|

P

|

3

)

time[16].Oncethecycliccoverisknown,thealgorithmin[18]alsorequirestocomputeaparticular assignment calledacanonicalassignment, whichtakesanadditional O

(



P

)

usingthesuffix treebasedalgorithmof[9]. However, thecomplexitybottleneckofalgorithmsusingacycliccoverremainsthe O

(



P



+ |

P

|

3

)

time tobuildthegraph andcomputethecover.Itwasalsoshownthat

greedy

(Algorithm 1)solvesSCCSexactly[4].Tothebestofourknowledge, thereiscurrentlynolineartimealgorithmforsolvingSCCS.

Algorithm 1: Thealgorithm

greedy

forShortestCyclicCoverofStrings.

1 Input: P asetoflinearwords;

2 Output: S asetofcyclicstringsthatcoverP ;

3 S:= ∅

4 while P is notempty do

5 u andv twoelementsofP havingthelongestoverlap(u canbeequaltov)

6 w isthemergeofu andv

7 P:=P\ {u, v}

8 if u =v (i.e.w isacyclicstring) then S:=S∪ {w}else P:=P∪ {w}

9 return S

Theorem1([6]).ForShortestCyclicCoverofStrings,Algorithm

greedy

yieldsanoptimalsolution. 1.3. Explanationsaboutalgorithm

greedy

An overlapissaidtobegreedy-feasible ifagreedysolutiontakesit.Letu1, u2, v1 and v2 befourstrings.Wesaythat

the overlapsov

(

u1

,

u2

)

andov

(

v1

,

v2

)

aredependent ifandonlyifu1

=

v1 oru2

=

v2.WhenAlgorithm

greedy

merges

u withv,itchoosestheoverlapbetweenu andv andforbidsanyoverlapbeginningbyu orendingbyv,i.e. itforbidsall overlaps that aredependent ofov

(

u

,

v

)

.Wesaythat a setofoverlapsis independent ifeach overlapisnot dependentof another.

(5)

Thus,Algorithm

greedy

makesaseriesofchoicestoselectasequenceofgreedy-feasibleandindependentoverlaps:

InthefirststepoftheAlgorithm

greedy

choosesanoverlapamongthelongestoverlaps.

Inthesecondstep,itchoosesanoverlapamongthelongestoverlapswhichareindependentofthefirstoverlap.

Ateachfurtherstep,ittakesamaximumoverlapthatisindependentofthepreviouslyselectedoverlaps.

Intheend,itoutputsasetofoverlapsthataremaximal(fortheinclusionasabaseinamatroid).Thissetofoverlaps isindependent.

Asthesetofoverlapsattheendof

greedy

ismaximal,thereexistsapermutation

σ

of

{

1

,

. . . ,

|

P

|}

,suchthatforalli in

{

1

,

. . . ,

|

P

|}

,ov

(

si

,

sσ(i)

)

isanoverlapofoursetofoverlaps.Thus,weobtainthefollowingpropertywhichissimilartoa well-knownpropertyonthesolutionofthegreedyalgorithmforSLiS[20, Def. of Induced

(

π

,

S

)

and Def. 2]:

Proposition2.Letw agreedysolutionofSCCS.Thereexistsapermutation

σ

of

{

1

,

. . . ,

|

P

|}

suchthatw

=

P

(

σ

)

. 2. GeneralisedSuffixTree,overlap,merge,andRed–Bluepaths

Thesuffixtreeofawordw isadatastructurethatindexesallsubstringsofw.Eacharcislabelledwithasubstringofw suchthattheconcatenationofthelabelsalongabranchfromtheroottoaleafspellsoutasuffixofw.Aconstraintrequires that thelabelsof arcsfromanode to itsoffspring start withdifferentsymbols.Thus, each noderepresents aprefix ofa suffix,i.e. asubstringof w.Infact,eachinternalnode isthelongestcommonprefixofallleavesinitssubtree.Traversing an arcextends therepresented stringon itsright. The worddepth of a node isthelength of thestringit represents.An exampleof(generalised)suffixtreeisshowninFig. 3:thesolesuffixtreestructureisdisplayedinFig. 3a.Forinstance,the node 3 representsthestringaa.Goingfromnode 3 tonode 1 extends thestringrepresentedbya b toyieldthestring aab.

STare equippedwithadditionalarcs calledSuffixLinks (SL):a SL linksanode representing au (witha aletter andu a string)to thenode representingu. LetuslabeltheSL by thelettera.Thenotion ofSLis usuallyrestrictedto internal nodes,butitextendsnaturallytoleaves.Weconsiderthateachnode hasitsSL.TraversingaSL removesthefirstletterof theword:itshortenstherepresentedwordatthefront.ToeachSL,weassociatealabel,whichistheletteritremoves.For instance,inFig. 3,traversingtheSLofnode 1 shortensthewordaab toab,whichisthestringrepresentedbynode 2.The labelofthatSLisa (labelsofSLarenotrepresentedinthisfigure).Itisknownthat,ifSLsarereversed(consideredinthe otherdirection),theyformatreerootedintherootoftheST,andallleavesareonthesamebranchofthistree[5].Note thatthechainofSLstartingfromanynodegoesuptotherootofT (Forsimplicity,assumetheroothasaSLtoitself).The treecomposedbythesuffixlinksisdisplayed onFig. 3b.Moreover,inthefigures,theSLaredrawnasanarcfromachild toitsparentinL (seeFig. 3).

Forsimplicity,wesaythata suffixtreearcisblue,whileSLarered.So,wehavethebluetree,denoted T ,andthered tree,denotedL,onthesamenodes.

The GST generalises thisprinciple toseveral words. From now on,the generalised suffix tree of P isthe pair

(

T

,

L

)

. AnextensiveintroductiontoGSTandtheirapplicationscanbefoundin[10].Thedistinctionbetweenexplicitandimplicit nodesisskippedhere,butexplainedin[10,2].

2.1. Overlaps

Basically,allsuffixesandallsubstringsofwordsof P arerepresentedintheGST. Ifawordu overlapsaword v,then a suffixof u equals toa prefixof v. Hence, oneexpectsthe overlapto berepresented somewhereinthe GST. Indeed,if aninternalnoderepresentingthesuffixofu appearsonthebranchspelling v,thenu overlaps v.Thisinternalnodeexists sincethissubstringoccursatleasttwice(inu andinv), andrepresentstheoverlap.Letuscallsuchnodean overlapnode fromu to v.Inourrepresentation,theinternalnodeislabelledwiththestarting positionofthatsuffixinu onthebranch leading totheleafof v.Forexample,inFig. 3 thenode 2 islocatedonthebranchof 1.Itisalsoonthebranchesof 1 and 1 .Itrepresentsthestringab.Soweknowthataaboverlaps

abba

,abaaandababb,withtheoverlapab.

Clearly,alloverlapnodesfromu tov areancestorsofanother.Moreover,theworddepthofan overlapnode givesthe lengthofthecorrespondingoverlap.Hence,themaximumoverlapbetweentwowordsisthedeepestoverlapnodeofthose words.Forinstance,both nodes 4 and 3 areancestorsof 1 meaningthatabaaoverlapsaab withtwopossibleoverlaps: a andaa,oflength1 and2 respectively.

Allthesenotionsareknown.Indeed,itisknownthatonecancomputeefficientlyalloverlapsforeachpairofwordsin P usingtheGSTof P .Infact,one usuallycomputespairwise themaximumoverlapsfor P [11],tobuild theprefixorthe overlapgraphofP (asmentionedinintroduction).

2.2. MergeandRed–Bluepaths

Abovetheoverlapnode was definedonlyusing T ;theSL(hence L)were notinvolved.Letusnowgive analternative definitionusingT and L.

(6)

Fig. 3. (c) TheGeneralisedSuffixTree(GST)ofP:= {ababb, aab, abba, abaa}.TheGSTofP isinfactcomposedoftwotrees:(a)thetreeT ,whichisthe basicstructureofthesuffixtree–inbluearcs,and(b)L,thetreemadebythesuffixlinksofT –inredarcs.Squarenodesrepresentwordsthatoccuras asuffixofsomeinputword,circlenodesaretheothernodesofT .EachsquarenodestoresitspositionsofoccurrencesinS;forsimplicity,wedisplaythe startingpositionasanumberandthewordofS inwhichitoccursasitscolour,insteadofshowingthelistofpairs

(

i, j).Thesquarenodelabelledbya red2,whichrepresentsthestringab,meansthatthisstringoccursasthesuffixstartingatposition2 intheredwordaab.A squarenodelabelledwitha 1 representsaninputword.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

Ifu overlapsv,thenasuffixofu equalsaprefixof v.Wehaveseenthattheoverlapsisanodeonthebranchspelling v. We canalsofindthisprecisesuffix ofu startingatthenode representingu byfollowingenough suffixlinks. Thismeans going upintheredtreefromthenodeu.Hence, oneseesthatanoverlapnodefromu to v isan ancestorofu in L (the red tree),andanancestorofv in T (thebluetree).Weobtainanalternativedefinition:anoverlapnodefromu to v isa red ancestorofu andablue ancestorofv.Forsimplicity,letussayitisared–blue ancestoroftheorderedpair

(

u

,

v

)

.

Now,weseethatthemaximumoverlapof

(

u

,

v

)

isthelowestred–blueancestorof

(

u

,

v

)

.Also, iftheredpathreaches theroot,thentheoverlapistheemptyword(andtheirmergeissimplytheirconcatenation).

Inotherwords,anyoverlapforapair

(

u

,

v

)

correspondstoaRed–Bluepath fromu tov in

(

T

,

L

)

.

The merge of u and v usingtheir maximumoverlap ov

(

u

,

v

)

is the word w

:=

pr

(

u

,

v

)

ov

(

u

,

v

)

su

(

u

,

v

)

, where u

:=

pr

(

u

,

v

)

ov

(

u

,

v

)

and v

:=

ov

(

u

,

v

)

su

(

u

,

v

)

.We state this definitionwith the maximum overlap,but it is validwith any overlap; thelargertheoverlap,theshorterthemerge.Wedenoteby RB-path

(

u

,

v

)

theRed–Bluepathwhichcorresponds tothemaximumoverlapbetweenu andv.

Assumewewanttospelloutthe(linear)mergeofu and v usingtheoverlapnode w in

(

T

,

L

)

.Itcanbedoneinthree steps:

1. traverseT fromtheroottou:thisspellsoutu withthelabelsofbluearcs;

2. goup L fromu to w followingtheredarcs:itispossiblesince w isaredancestorofu (here weskiponebyone all lettersofpr

(

u

,

v

)

);

3. godownT from w to v usingbluearcs,whichispossiblesincev isadescendantof w;thisspellsoutsu

(

u

,

v

)

. A mergeof

(

u

,

v

)

isa linearsuperstringof

{

u

,

v

}

.The process abovecan begeneralised tospellout anycyclic super-string of a cyclicpermutation of words

(

w1

,

w2

,

. . . ,

wn

)

using some givenpairwise overlaps: one starts from thenode representing w1,then performs steps2and3usingthe appropriateRed–Blue pathfrom w1 to w2 tomergethem, then

iteratesonlysteps2and3foreachfollowingworduntilonereaches w1 again.ThiscircuitontheGSTiscalledaRed–Blue

circuit.

Example1. For instancein Fig. 3,the Red–Bluecircuit corresponding to thecyclic mergeof thewords ababb,abba,and thenaabinthisorder,containsthreeRed–Bluepaths:forthepairs

(

ababb

,

abba

)

,

(

abba

,

aab

)

andthenfor

(

aab

,

ababb

)

.This

(7)

circuitstarts atnode 1 goesup withSLtonode 2 andthen 3 whichisared–blue ancestorof

(

ababb

,

abba

)

,then goes down withbluearcsto 1.Thisisthe Red–Bluepath for

(

ababb

,

abba

)

, whichspells outab with the SLlabels.Then the circuitgoesupwithSLto 2, 3,and4,whichisanancestorof 1.Thisspellsoutabb withSLlabels.Then,itgoesdownto 1.ThismakesuptheRed–Bluepathfor

(

abba

,

aab

)

.Then,thecircuitgoesupto 2,whichisanancestorof 1 ;thisspells outa withone redlabel.Then, itgoesdownandbackto 1 ,whichwas thestartingnode.Altogetherthesequenceofred labelsspellsoutababba,whilethesequenceofbluelabelsyields

aababb

:bothwritethesamecyclicword.

Now,letusstateamoresubtleproperty.ForaRed–Bluecircuit,writingallbluelabelsinorderorwritingallredlabels inorderspellout thesamesuperstring(aredlabelisalabelofaSL).Hence, thenumberofbluearcsissmallerthanthe numberofredarcs,whichisthesuperstringlength.Aproof ofaslightlydifferentversionofthispropertycanbefoundin

[5, Section 4.3 on p. 18]. 3. Thesuperstringgraph

Here, weare going to show that thechoicesmade byAlgorithm

greedy

forSCCS are not reallychoices. Infact, we willshowthateachgreedysolutioncanberepresentedbyauniquegraphembeddedintheGeneralisedSuffixTree.Asthis graphturnsouttobecomputableinlineartime,onecanderiveagreedysolutionforSCCSinlineartime.

3.1. Equivalenceofgreedysolutions

Letw beagreedysolutionofSCCS.ByProposition 2,thereexistsapermutation

σ

suchthat w

=

P

(

σ

)

.Foreachcyclic permutation

σ

iof

σ

,wedenotebyc

(

σ

i

)

theRed–Bluecircuitobtainedfrom

σ

i intheGST.

We denoteby GP

(

σ

)

= (

V

,

R

,

B

)

thedirected multigraphbuiltwith thearcsof theRed–Blue circuitsof allthe c

(

σ

i

)

, whereR denotesthesetofRed arcs,andB denotesthesetofBlue arcsofallthec

(

σ

i

)

.

AgraphissaidtobeEulerian,ifthereexistsaEulerianpathonthisgraph,i.e. apath thatvisitseacharcofthegraph exactlyonce.As GP

(

σ

)

isacollectionofRed–Bluecircuits,whicharemadeofRed–Bluepaths,eachconnectedcomponent ofGP

(

σ

)

isEulerian.(Witheachadditionalcircuit,thegraphremainsEulerian.)

Foreachpermutationof

{

1

,

. . . ,

|

P

|}

,we canbuildthecorresponding multigraphGP

()

.Wehavethefollowing proposi-tion.

Proposition3.Letw1andw2betwogreedysolutionsofSCCSandlet

σ

1and

σ

2betheirrespectivepermutationsof

{

1

,

. . . ,

|

P

|}

.

ThenGP

(

σ

1

)

=

GP

(

σ

2

)

.

Proposition 3statesakeypropertyofgreedysolutionsforSCCS.Below,wegivetwolinesofproof:thefirstoneexplains whythepropertyisvalid(andbetterconveystheintuitiontothereader),whilethesecondgivesamoreformal,butless explanatory,demonstration.

Fig. 4. Example withthreeoverlaps:twodependentones,betweenu1andv1andbetweenu1andv2,andanotheroverlapbetweenu2andv2:shownon

thestrings(a),intheoverlapgraph(b),andintheSuperstringGraph(c).

Proof. Let w

=

P

(

σ

)

beagreedysolutionofSCCS.Letusshow thatthechoicesmadebyAlgorithm

greedy

forSCCSdo notinfluenceGP

(

σ

)

.

greedy

facesachoicewhenseveraloverlapsofthesamelengtharedependent(Clearly,independent overlapscanbeswappedinapermutationandthusdonotinfluencethegraph).

Let ov

(

u1

,

v1

)

and ov

(

u1

,

v2

)

be two dependent andgreedy-feasible overlaps. Assume that ov

(

u1

,

v1

)

is the overlap

choosenby

greedy

toobtain w,andletu2 inP bethesuccessorofv2in

σ

.Itimpliesthatov

(

u2

,

v2

)

istheoverlapthat

mergesu2 andv2in w.Bythedefinitionof

greedy

,wehavethat

|

ov

(

u1

,

v1

)

| ≥ |

ov

(

u2

,

v2

)

|

.Letusshowthatthereexits

(8)

As

|

ov

(

u1

,

v1

)

| ≥ |

ov

(

u2

,

v2

)

|

,wehavethat ov

(

u2

,

v2

)

isaprefix ofov

(

u1

,

v2

)

,andthus aprefixofov

(

u1

,

v1

)

.Hence

intheGST,thenodeov

(

u2

,

v2

)

isanancestorofov

(

u1

,

v1

)

,andov

(

u1

,

v1

)

isanodeintheRed–Bluepathsfromu1 to v1

andfromu2 tov2.Thus,wehavebothRed–Bluepathsfromu1 tov2 andfromu2to v1areinGP

(

σ

)

(seeFig. 4).

2

Analternativeproof. Let

σ

1and

σ

2 betwopermutationsofP correspondingtotwodistinctgreedysolutionsforSCCS.We

setGp

(

σ

1

)

:= (

V1

,

R1

,

B1

)

andGp

(

σ

2

)

:= (

V2

,

R2

,

B2

)

.

WeproceedbycontrapositionandassumethatGp

(

σ

1

)

=

Gp

(

σ

2

)

;then

((

R1

B1

)

\(

R2

B2

))

∪((

R2

B2

)

\(

R1

B1

))

= ∅

.

As the role of bothpermutations issymmetrical, we assume that

(

R1

B1

)

\ (

R2

B2

)

= ∅

.Lete

:= (

u

,

v

)

be an arcof

(

R1

B1

)

\ (

R2

B2

)

suchthatmax

(

|

u

| ,

|

v

|)

ismaximalamongallpossiblearcsof

(

R1

B1

)

\ (

R2

B2

)

.Accordingtothe

definitionofGp

(

σ

1

)

,thereexisti and j suchthat

σ

1

(

i

)

=

j ande isanarcontheRed–Bluepathlinking si tosj (formally e

RB-path

(

si

,

sj

)

).Twocasesarise.

Case e

B1:

Let i2 denotethepredecessor of j in

σ

2, inotherwords, theelementof

σ

2 satisfying j

=

σ

2

(

i2

)

. But,one gets that

ov

(

si2

,

sj

)

isaprefixofov

(

si

,

sj

)

,sinceotherwisetherewouldexist j1

:=

σ

1

(

i2

)

,suchthat



ov

(

si

,

sj

)



<



ov

(

si2

,

sj1

)



,

whichwouldimplythate isnotmaximal.Itfollowsthate belongstoRB-path

(

si2

,

sj

)

,andthusto B2,which

contra-dictsourhypothesis. Case e

R1:

Let j2 denote the successorof i in

σ

2, that is j2

:=

σ

2

(

i

)

. Clearly, ov

(

si

,

sj2

)

must be a suffix of ov

(

si

,

sj

)

, since

otherwise therewouldexisti1 suchthat j2

=

σ

1

(

i1

)

and



ov

(

si

,

sj

)



<



ov

(

si1

,

sj2

)



,whichwouldmeanthate is not

maximal.Oneobtainsthate

RB-path

(

si

,

sj2

)

,andthuse belongsto R2,whichisacontradiction.

Thisendstheproof.

2

ByProposition 3,thegraphGP

()

representsallthesolutionsofthegreedyalgorithmforSCCS.Wecalledthisgraph,the SuperstringGraph of P ,anddenoteitbyGP.AnexampleofSuperstringGraphforP

:= {

abb

,

bbb

,

bbc

}

isshowninFig. 7.

Tosumup,weknowthat foreach greedysolution w,thereexistsapermutation of

{

1

,

. . . ,

|

P

|}

that canbetranslated intoasetofRed–Bluecircuits,c

(

w

)

= {

c1

(

w

),

. . . ,

cm

(

w

)

}

.RememberthateachRed–BluecircuitismadebychainingRed– Blue paths. Letc be amulti-cycle of G.We saythat c covers GP ifeach arcof GP is includedina single cycleof c.By

Proposition 3,foreachgreedysolution w wehavethatc

(

w

)

coversGP.

Bythefollowingproposition,wehavethateachmulti-cyclethatcoversG comesfromagreedysolution. Proposition4.

{

multi-cycle p

|

p covers GP

}

= {

c

(

w

)

|

w is a greedy solution

}

.

ToproveProposition 4,wefirstneedalemma.Acrossing inGP isanode thathasredin- andout-goingarcsandblue in- andout-goingarcs.

Lemma5(Absenceofcrossing).GPdoesnotcontainanycrossing.

Proofof

Lemma 5

. It is enough to see that the only nodes that can have a red out-goingarc and a blue in-going arc correspondtothewordsof P .Butifv

P , v isaleafinT andinL bydefinitionoftheGST,andthus v hasneitherablue arcgoingout,noraredarcgoingin.

2

Fig. 5. Illustration fortheproofofProposition 4.Intheproofof⊆,theargumentistheexistenceofacrossingnodeinanon-greedy solution.Thisfigure displaysthefournodes v1, . . . , v4 andthenodes(circles)representingtheiroverlaps.Then,thenoderepresentingov(v3, v4)(greynode)isacrossing, i.e. ithasblueandredin-goingandout-goingarcs.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtotheweb versionofthisarticle.)

Proofof

Proposition 4

. To understandthis equality,it issufficient tonote that there is arelation betweenchoosing the longestoverlapin

greedy

andtheabsenceofcrossinginGP.

Proofof

.

Letusshow

{

multi-cycle p

|

p covers GP

}

⊆ {

c

(

w

)

|

w is a greedy solution

}

by contraposition.Assumethatthere exists a multi-cycle p coveringGP andtheredoesnotexist agreedysolution w of P suchthat p

=

c

(

w

)

.Byhypothesis, there exist four nodes v1

,

v2

,

v3 and v4 in P that do notsatisfy the greedyproperty. In other words,although

|

ov

(

v1

,

v2

)

| <

(9)

|

ov

(

v3

,

v4

)

|

,both P ath

(

v1

,

v4

)

and P ath

(

v3

,

v2

)

arein p. Hence,the nodeov

(

v3

,

v4

)

isa crossingin GP (see Fig. 5).As theredoesnotexistanycrossinginGP (Lemma 5),weobtainacontradiction,andget

{

multi-cycle p

|

p covers GP

} ⊆ {

c

(

w

)

|

w is a greedy solution

}.

Proofof

.

Reciprocally,assumethatthereexistsagreedysolution w of P andtheredoesnotexistamulti-cycle p thatcoversGP such that p

=

c

(

w

)

.Thus,there existsan arcofc

(

w

)

whichisnot an arcofGP.Among thosearcs,let e

= (

u

,

v

)

bethe one withthedeepest node(i.e. longest word) v in T .Bythe definitionof GP, e belongs to R (i.e. is ared arc),because it is the deepest ofits kind. Let v1 and v2 be the nodesof P such that e belongs to P ath

(

v1

,

v2

)

. As e is not in GP, foreach u in P such that P ath

(

v1

,

u

)

is in GP, u

=

v2.Thus, there exists v4 in P such that P ath

(

v1

,

v4

)

is in GP and

|

ov

(

v1

,

v2

)

| < |

ov

(

v1

,

v4

)

|

.As w isagreedysolutionof P ,thereexists v3 in P andnotinGP suchthat P ath

(

v3

,

v4

)

isin

c

(

w

)

and

|

ov

(

v1

,

v4

)

| ≤ |

ov

(

v3

,

v4

)

|

.Thus,there existsan arcinc

(

w

)

that isnot inGP witha deeperextremity thane. Thiscontradictsourhypothesisandweget

{

c

(

w

)

|

w is a greedy solution

}

⊆ {

multi-cycle p

|

p covers GP

}

.

2

3.2. CharacterisationoftheSuperstringGraphandaconstructionalgorithm

Now,wearegoingtocharacterisetheSuperstringGraphandgivealineartimealgorithmtobuilditfromasetofwords. Forthisalgorithm,wedefineandcomputeweights(respectivelyn

(.)

andd

(.)

)fortheredandforthebluearcsoftheGST. Finally,theSGwillbecomposedofthosearcshavingastrictlypositiveweight(andofthenodesvisitedbythem).

As

greedy

selectsoverlapsinorderofdecreasinglength,onecanseeitasarecursivealgorithmthatprogressesalong thebranchesoftheGSTof P .OntheGST,

greedy

progresses fromtheleavestowardtheroot,i.e. inorderofdecreasing worddepthofthenodes.Letususethisrecursivesideof

greedy

toexplaintheSuperstringGraph.

Letv be anodeoftheGST;when

greedy

sees theoverlaps oflength

|

v

|

,ithasalreadyconsideredoverlaps oflarger lengths.Wedefinetwoweightsn

()

andd

()

suchthat:

n

(

v

)

isthenumberofstringsof P whichhavev assuffixandarenotmergedontheirrightwithanotherstringof P withalargeroverlap.

d

(

v

)

isthenumberofstringsof P whichhave v asprefix andare notmergedontheir leftwithanotherstringof P withalargeroverlap.

Inother words,n

(

v

)

isthenumberofleavesofthesubtreeofv inL thathavenotyetbeenmergedontheirright,and d

(

v

)

isthenumberofleavesofthesubtreeofv inT thathavenotyetbeenmergedontheirleft.

First,itcanbeseeneasilyseenthatanode v whichhasnootherleafbothinitssubtreesinL andinT ,isastringofP . Thus,forthesenodesweset,n

(

v

)

=

d

(

v

)

:=

1.

Fortheother nodes,weusetherecursionof

greedy

.Infact, wecan determinehowmanystringsof P havenotyet beenmergedwithanotherstringbylookingtheweightsn

(.)

andd

(.)

ofthechildrenofv inL andofitschildreninT (see

Fig. 6).

Leta

(

v

)

bethe difference betweenthesumofthen

(

u

)

ofthe children u of v inL and the sumofthed

(

u

)

ofthe childrenu ofv inT : a

(

v

)

:=



u∈ChildrenL(v) n

(

u

)



uChildrenT(v) d

(

u

).

Hence, theabsolutevalue ofa

(

v

)

is thelargest numberofstrings of P that

greedy

can mergewithother strings of P usingtheoverlap v.Witha

(

v

)

,wecancalculaten

(

v

)

andd

(

v

)

:

If a

(

v

)

0

,

we have



n

(

v

)

:=

a d

(

v

)

:=

0

,

if a

(

v

) <

0

,

we have



n

(

v

)

:=

0 d

(

v

)

:= −

a

.

Withthisrecursive definitionoftheweights n

()

andd

()

, wecan constructtheSuperstring GraphfromtheGST of P . WhenevertheweightofanarcoftheGSTisequalto0,thisarcdoesnotbelongtotheSuperstringGraph.Similarly,iffor anode,sayv,allofitsin-goingandout-goingarcshaveazeroweight, v isisolatedandisnotincludedintheSuperstring Graph–thiscorrespondstothesubsetU inDefinition 1(whichstatesanotherdefinitionoftheSuperstringGraph).

Definition1.TheSuperstringGraphGP

= (

V

,

R

,

B

)

where V

=

VT

\

U

(10)

Fig. 6. General schemeofthein- andout-goingarcsofanodev inGST.TheredarcsbelongtoR andthebluearcstoB.Ifaweightiszero,thearcis notincludedintheSuperstringGraph.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthis article.)

B

= {(

t

,

v

)

d(v)

|

v

VT

,

t the parent of v in T

}

andU

= {

v

VT

|

vis not an extremity of an arc of R

B

}

wheren

()

andd

()

aretheweightsonarcsasdefinedabove. Now thatwecanbuild theSuperstringGraphforaninput P by traversingthegeneralisedsuffixtreeof P ,weprovide analgorithm(Algorithm 2)thatusesthisgraphtosolveSCCSforP .

Theorem6.

Algorithm 2

solvesSCCSinlineartimeandspace.

Proof. Correctness.ByProposition 4,theset w ofcyclicwordscomputedbyAlgorithm 2isasolutionof

greedy

forSCCS. Hence,eachwordof P appearsasasubstringofacyclicwordinw,andbyTheorem 1,thenormof w isminimal.

Complexity.TheSuperstringGraphof P isembeddedinthegeneralisedsuffixtreeof P ,andthusbuiltinatime linear in



P



.

Firstthecyclicconcatenationofthewordsin P islongerthantheSCCSobtainedthroughanycyclicpermutationof P . ByProposition 2,thelengthofasolutionof

greedy

isboundedby



P



;thus,thelengthofamulti-cycleofGP isbounded by2

× 

P



.Consequently,onecanfindamulti-cyclealsoin O

(



P

)

timeandspace.Finally,traversingallitscyclestakesa timeproportionalinthesizeoftheSuperstringGraph.

2

NotethatwithAlgorithm 2,oneobtainsacycliccover.

Algorithm 2: TheSuperstringGraphalgorithmforSCCS.

1 Input: P asetoflinearwords. Output:w agreedysolutionforSCCS;

2 buildGPtheSuperstringGraphofP ;

3 computeacycliccoverc= (c1,. . . ,cn)ofGP;

4 for i ∈ [1, n]do

5 traverseci:listthewordsofP whosenodeareinciandcreatewibyconcatenatingcyclicallypr(sj, sk)wheneverskisthesuccessorofsj;

6 insertthecyclicstringwiinw;

7 return w

4. ImplicationsoftheSuperstringGraph

4.1. SCCSwithacardinalityconstraintongreedysolutions

We cansplitup GP intoconnectedcomponents: GP

= {

C1

,

. . . ,

Cg

}

.ByProposition 4,eachmulti-cycle c ofGP can be partitionedinasubsetofcyclesc

=

S1

∪ . . . ∪

Sg,whereeachsetofcyclesSi coversaconnectedcomponentCiofGP.

Algorithm 2, on its third line (in blue),1 computes a cyclic cover of G

P. This requires to perform a random choice when facing two alternative out-goingarcs.Clearly, the combinationof all possiblesuccessive choicesallows to produce all possiblecycliccovers ofGP.Certain choiceswill generateasolution containingseveralcycles tocoverone connected component(seeFig. 7b),althoughitisalwayspossibletocoveroneconnectedcomponentwithasinglecycle(seeFig. 7c). Thus, anytwocycliccoversmaydifferintheirnumberofcycles,i.e. intheircardinality.Thisgivesrisenaturallytoanew question,toanewvariantofSCCS,whichwedefinebelow.

Problem2(SCCSwithacardinalityconstrainton

greedy

solutions). Input: Asetoflinearstrings P .

Output: A setofcyclicstringsc,withtheleastnumberofelements,thatisasolutionofthegreedyalgorithmonShortest CyclicCoverofStrings.

(11)

Fig. 7. A SuperstringGraphandSCCSsolutions.(a)ThegeneralisedsuffixtreeandtheSuperstringGraph(SG)forP:= {abb, bbb, bbc}.SquarenodesofT

representstringsthataresuffixesofsomewordofP :theintegerinthesquarenodeisastartingpositionofitssuffix.AwordofP isidentifiedbyits colour.Figure (b)and(c)showthetwomulti-cyclesthatcovertheSGandtheirrespectivesetsofcyclicstrings.Thisexampleiswellknownforshowing thelowerboundoftheapproximationratioofthegreedyalgorithmforSLiS.TheSGforthisinstanceisconnectedandthusitalsogivesanoptimalcyclic superstring(asolutionofSCyS).AstheEuleriancircuitvisitstherootoftheGST,thisoptimalcyclicsuperstringcanbecutattheroottoobtainanoptimal linearsuperstring.

Algorithm 3: TheSuperstringGraphalgorithmforSCCSwithacardinalityconstraint.

1 Input: P asetoflinearwords. Output:w agreedysolutionforSCCS;

2 buildGP theSuperstringGraphofP ;

3 computeaEulerianmulti-cyclec= (c1,. . . ,cn)ofGP;

4 for i ∈ [1, n]do

5 traverseci:listthewordsofP whosenodeareinciandcreatewibyconcatenatingcyclicallypr(sj, sk)wheneverskisthesuccessorofsj;

6 insertthecyclicstringwiinw;

7 return w

Algorithm 3differsfromAlgorithm 2onlyon thethirdline:insteadofcomputinganycycliccoverof GP,it computes aEulerianmulti-cycleof GP.ThisispossiblesinceeachconnectedcomponentoftheSuperstringGraphisEulerianandit canbedoneinlineartimeinthenormof P .AEulerianmulti-cycleofGP coverseachconnectedcomponentwithasingle cycle, henceitminimisesthecardinalityofthe cycliccover, andasitisa greedysolutionitsolves Problem 2.Weobtain

Theorem 7.TheminimisationofthenumberofcyclicstringsbyAlgorithm 3isillustratedinFig. 7.

Theorem7.

Algorithm 3

solves

Problem 2

inlineartime.

Proof. AstheconnectedcomponentsoftheSuperstringGraphareEulerian,thisisdoneinlineartimeandspaceinthesize ofGP.ByProposition 4,alabelofaEulerianmulti-cycleofGP isanoptimalsolutionfortheProblem 2.

2

Remark.AlthoughforSCCS,allsolutionsof

greedy

areoptimal,someoptimalsolutionsarenotoutputby

greedy

.Thus, thequestionoffindingashortestcycliccoverwithaminimumnumberofelementsisnotsolvedbytheAlgorithm 3(seea counterexampleinFig. 8).

4.2. ConsequencesofthetopologyoftheSuperstringGraph

Thecharacterisation oftheSGcanalsobe appliedtosolveorapproximatethe“classical”shortestsuperstringproblem (SLiS–forShortestLinearSuperstring),inwhichthesolutionisalinearsuperstring,aswellasitsvariantwherethesolution isasinglecyclicword(SCyS-=forShortestCyclicSuperstring).LetwO P T(S Li S),respectively wO P T(SC y S),denoteanoptimal solutionforSLiS,resp.forSCyS.Let w agreedysolutionforSCCS,wehave:

|

w

| ≤



wO P T(SC y S)

 ≤ 

wO P T(S Li S)



.IftheSGis connected,then w is acyclicsuperstringof P ,i.e. asolution ofSCyS.Bytheabove relation,itis anoptimalsolution for SCyS.

(12)

Fig. 8. An instance with an optimal solution that is nota solution of greedy for SCCS (however, the solution ofthe SG is also optimal). Let

P= {abec, bed, c f abe, dgab}.(Top)thegeneralised suffixtree,(middle)theSG,and(bottom)anoptimalsolutionofSCCSthatconsistsinsinglecycle, whiletheSGhastwocomponents.

Now,fromthiscyclicsuperstring,wecaneasilyderivealinearsuperstringof P .Basically,wecutthecyclicstringatthe startofasuffixofawordofP andpre-catenatethecorrespondingprefix.Letv anodeoftheSGsuchthatn

(

v

)

=

d

(

v

)

=

0, thenv isaprefixofsomewordofP .Then v w makesupalinearsuperstringofP .Weobtain

|

w

| ≤



wO P T(S Li S)

 ≤|

v

|+|

w

|

. As

|

v

| ≤



wO P T(S Li S)



, we get

|

v

| + |

w

| ≤

2



wO P T(S Li S)



. In other words, forinstances P that have a connected SG,we obtain a 2-approximation for the shortest linear superstring. Furthermore, if the Eulerian cycle visits the root of the suffix tree, then v isthe empty word andthen

|

w

| ≤



wO P T(S Li S)

 ≤ |

w

|

, meaning that w is an optimal linear super-string.

Notethatthereexistnon-trivial instances,withanarbitrarylargenorm,whoseSGisconnected:anyinstanceinwhich allwordsarecyclicshiftsofanotherbelongtothisclass.AstheSGcanbebuiltandexploredinlineartimein



P



,onecan usethisprocedureasapredicatetocheck whetherSCySorSLiSaresolvableinlineartime.Ifnot,onecanlaunchamore complexapproximationalgorithm[15].

5. Conclusion

WeconsidertheproblemShortestCyclicCoverofStrings,whichissolvedbyAlgorithm

greedy

.Thesetofsolutionsof

greedy

canbeexponential.Nevertheless,withtheSG,weexhibitacharacterisationofthissetinagraphthatisEulerian and linear inthe input size. This enablesus to solve SCCSin linear time, butalso to find a

greedy

solutionwith the leastnumberofcyclicstrings.Moreover,thisapproachcansolveorapproximatesome non-trivial instancesoftheshortest superstringproblems.OurapproachcanbeadaptedtoavariantofSCCSwherethemultiplicityofeachinputwordisgiven asinput.

Currently,approximationalgorithmsforshortestsuperstringproblemsuseaprocedureforsolvingSCCSthattakescubic time (i.e. O

(



P



+ |

P

|

3

)

time) andquadratic spacein the numberof strings. Thisstep represents their time complexity bottlenecksinceotherstepscanbedonein O

(



P

)

time.Usingouralgorithmmakestheoveralltimecomplexitylinearin theinputsize.

(13)

Acknowledgements

This work is supported by ANR Colib’read (ANR-12-BS02-0008), by ANR Institut de Biologie Computationnelle (ANR-11-BINF-0002),andbyCNRS(DéfiMASTODONS).

References

[1]A.Blum,T.Jiang,M.Li,J.Tromp,M.Yannakakis,Linearapproximationofshortestsuperstrings,in:ACMSymposiumontheTheoryofComputing,1991, pp. 328–336.

[2]D.Breslauer,G.F.Italiano,Onsuffixextensionsinsuffixtrees,Theor.Comput.Sci.457(2012)27–34.

[3]D.Breslauer,T.Jiang,Z.Jiang,Rotationsofperiodicstringsandshortsuperstrings,J.Algorithms24 (2)(1997)340–353.

[4] B. Cazaux, E. Rivals, Approximation of greedy algorithms for Max-ATSP, Maximal Compression, Maximal Cycle Cover, and Shortest Cyclic Cover of Strings, in: Proc. of Prague Stringology Conference, PSC, Czech Technical Univ. Prague, 2014, pp. 148–161, http://www.stringology. org/event/2014/p14.html.

[5] B.Cazaux, E. Rivals, Reverseengineering of compactsuffix trees and links: a novel algorithm, J. Discret. Algorithms 28 (2014) 9–22, http:// dx.doi.org/10.1016/j.jda.2014.07.002.

[6] B.Cazaux,E.Rivals, The powerofgreedy algorithms for approximating Max-ATSP, CyclicCover,and superstrings,Discrete Appl.Math. (2015),

http://dx.doi.org/10.1016/j.dam.2015.06.003.

[7]J.Gallant,D.Maier,J.A.Storer,Onfindingminimallengthsuperstrings,J.Comput.Syst.Sci.20(1980)50–58.

[8]A.Golovnev,A.Kulikov,I.Mihajlin,ApproximatingshortestsuperstringproblemusingdeBruijnGraphs,in:CombinatorialPatternMatching,in:Lect. NotesComput.Sci.,vol. 7922,SpringerVerlag,2013,pp. 120–129.

[9]D.Gusfield,Fasterimplementationofashortestsuperstringapproximation,Inf.Process.Lett.51 (5)(1994)271–274.

[10]D.Gusfield,AlgorithmsonStrings,TreesandSequences,CambridgeUniv.Press,1997.

[11]D.Gusfield,G.M.Landau,B.Schieber,Anefficientalgorithmfortheallpairssuffix–prefixproblem,Inf.Process.Lett.41 (4)(1992)181–185.

[12]S.R.Kosaraju,A.L.Delcher,Large-scaleassemblyofDNA stringsandspace-efficientconstructionofsuffixtrees,in:ProceedingsoftheTwenty-Seventh AnnualACMSymposiumonTheoryofComputing,STOC ’95,1995,pp. 169–177.

[13]S.R.Kosaraju,A.L.Delcher,Large-scaleassemblyofDNA stringsandspace-efficientconstructionofsuffixtrees,in:ProceedingsoftheTwenty-Eighth AnnualACMSymposiumonTheoryofComputing,STOC ’96,ACM,NewYork,NY,USA,1996,p. 659.

[14]M.Mucha,Lyndonwordsandshortsuperstrings,in:ProceedingsoftheTwenty-FourthAnnualACM–SIAMSymposiumonDiscreteAlgorithms,SODA, 2013,pp. 958–972.

[15]K.E.Paluch,Betterapproximationalgorithmsformaximumasymmetrictravelingsalesmanandshortestsuperstring,CoRRarXiv:1401.3670[abs],2014.

[16]C.H.Papadimitriou,K.Steiglitz,CombinatorialOptimization:AlgorithmsandComplexity,2ndedition,DoverPublications,Inc.,1998,496pp.

[17]J.Tarhio,E.Ukkonen,Agreedyapproximationalgorithmforconstructingshortestcommonsuperstrings,Theor.Comp.Sci.57(1988)131–145.

[18]S.Teng,F.F.Yao,Approximatingshortestsuperstrings,in:34thAnnualSymposiumonFoundationsofComputerScience,1993,pp. 158–165.

[19]E.Ukkonen,Alinear-timealgorithmforfindingapproximateshortestcommonsuperstrings,Algorithmica5 (1–4)(June1990)313–323.

[20]V.Vassilevska,Explicitinapproximabilityboundsfortheshortestsuperstringproblem,in:MathematicalFoundationsofComputerScience2005,30th InternationalSymposium,Proceedings,MFCS2005,Gdansk,Poland,August29–September2,2005,in:Lect.NotesComput.Sci.,vol. 3618,Springer, 2005,pp. 793–800.

Figure

Fig. 2. Running example with the input set P := { ababb , aab , abba , abaa } . Instance of a cyclic cover of P obtained with a permutation σ
Fig. 4. Example with three overlaps: two dependent ones, between u 1 and v 1 and between u 1 and v 2 , and another overlap between u 2 and v 2 : shown on the strings (a), in the overlap graph (b), and in the Superstring Graph (c).
Fig. 7. A Superstring Graph and SCCS solutions. (a) The generalised suffix tree and the Superstring Graph (SG) for P := { abb , bbb , bbc }
Fig. 8. An instance with an optimal solution that is not a solution of greedy for SCCS (however, the solution of the SG is also optimal)

Références

Documents relatifs

Figure 4: Small extracts of the OCR results obtained on scanned document images: (a) input image (no restoration); (b) k-means segmentation + restoration; (c) single MRF

In order to allow constraint solvers to handle GIPs in a global way so that they can solve them eciently without loosing CP's exibility, we have in- troduced in [1] a

Because this problem is an NP- hard problem there is not a known deterministic efficient (polinomial) method. Non- deterministic methods are in general more efficient but an

A density functional theory (DFT)-based method for calculating the electron paramagnetic resonance hyperfine coupling tensors, using second-order perturbation theory and the

of single photon events extracted using the fits to the missing mass squared distribution of the mixed events to the number of single photon events that passed the analysis cuts for

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

Lettres adressées au Comte Eugène de Courten XIX e siècle» / Carton B 19 «Correspondance entre Eugène de Courten et son frère Pancrace, 1814-1830» (comprenant également des

Dynamic programming with blocs (DP Blocs) decrease the number of la- bels treated by 79% compared to the label correcting algorithm (DP LC), this is due to the reduction of the