HAL Id: lirmm-01525995
https://hal-lirmm.ccsd.cnrs.fr/lirmm-01525995
Submitted on 22 May 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A linear time algorithm for Shortest Cyclic Cover of
Strings
Bastien Cazaux, Eric Rivals
To cite this version:
Bastien Cazaux, Eric Rivals. A linear time algorithm for Shortest Cyclic Cover of Strings. Journal of
Discrete Algorithms, Elsevier, 2016, 37, pp.56-67. �10.1016/j.jda.2016.05.001�. �lirmm-01525995�
Contents lists available atScienceDirect
Journal
of
Discrete
Algorithms
www.elsevier.com/locate/jda
A
linear
time
algorithm
for
Shortest
Cyclic
Cover
of
Strings
Bastien Cazaux
a,
b,
Eric Rivals
a,
b,
∗
aLIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France
bInstitutdeBiologieComputationnelle,CNRSandUniversitédeMontpellier,860rueSaintPriest,34095MontpellierCedex5,France
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory:
Availableonline19May2016
Keywords:
Greedy
Shortestcycliccoverofstrings Superstring
Overlap
Minimumassignment Graph
Mergingwordsaccordingtotheiroverlapyieldsasuperstring.Thisbasicoperationallows toinferlongstringsfromacollectionofshortpieces,asingenomeassembly.Tocapture amaximum of overlaps, the goalis to infer the shortest superstring ofa set ofinput words.TheShortestCyclicCoverofStrings(SCCS)problemasks,insteadofasinglelinear superstring,forasetofcyclicstringsthatcontainthewordsassubstringsandwhosesum oflengthsisminimal. SCCSis used asacrucial stepinpolynomial timeapproximation algorithms for the notably hardShortest Superstring problem, but it is solvedin cubic time.Thecyclicstringsarethencutandmergedtobuildalinearsuperstring.SCCS can alsobesolvedbyagreedyalgorithm.Here,weproposealineartimealgorithmforsolving SCCSbasedonaEuleriangraphthatcapturesallgreedysolutionsinlinearspace.Because thegraphisEulerian,thisalgorithmcanalsofindagreedysolutionofSCCSwiththeleast numberofcyclicstrings.ThishasimplicationsforsolvingcertaininstancesoftheShortest linearorcyclicSuperstringproblems.
©2016TheAuthor(s).PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
The possibility ofmergingtwo words intoa longerone accordingto their overlap– forinstancemerging abcde with cdeba into abcdeba – allows to infer the sequence of a target molecule from the short reads produced by sequencing machines. Incomputerscience, mergingresultsina superstring oftheinput words.Theactionofmergingisheavily used in any genome assembler (although not necessarily on the complete reads, sometimes only on substrings ofthe reads) andgiventheredundancy ofthesequencing,whichyields ahighdensityofreads, theobjectiveistocompute ashortest superstringofasetofinputwords[10].Inbioinformatics,thetaskofgenomeassembly isworsenedby theinaccuracyof the sequencingandby thestructure ofthe genomewhichcontains repeatedsequences [10].However, progresseson the modelquestion, calledShortest(linear)SuperstringProblem (SLiS)–oroftenShortestCommon Superstring– stillshedlight on thedifficultyofassembly.Moreover,thetarget moleculecanbe alinearora circularstructure.Itisalsointeresting to addressthequestionoftheShortestCyclicSuperstring (SCyS).Findingsuperstringsalsohasapplicationindatacompression. Indeed,asasuperstringrepresentsacondensedrepresentationoftheinitiallanguage,itcanserveasacompressedversion oftheinputset.
Let P
:= {
s1,
. . . ,
s|P|}
beasetofinputwordsanddenotethesumoftheirlengthsbyP,i.e. thenormof P .Without loss ofgenerality, we always assume that P isfactor free,i.e. for anytwostrings of P ,noneis asubstring of theother. From thecomputationalviewpoint,finding ashortestsuperstringisknowntobehardtosolve(NP-hardevenonabinary*
Correspondingauthorat:LIRMM,CNRSandUniversitédeMontpellier,CC477,161rueAda,34095MontpellierCedex5,France.Tel.:+33467418664.E-mailaddresses:[email protected](B. Cazaux),[email protected](E. Rivals).
http://dx.doi.org/10.1016/j.jda.2016.05.001
1570-8667/©2016TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
alphabet) andto approximate(Max-SNP hard) [7,1].The research inthe last 25years hasmainlyfocused on polynomial timeapproximationalgorithmsforSLiS(see[8]forarecentlist).Despitetheseefforts,thealgorithm
greedy
(Algorithm 1), an algorithmthat iteratively updatestheinput setby mergingtwo maximally overlappingstrings untilone stringis left, wasconjectured toapproximatethelengthoftheshortestsuperstringwithratio2[17,1].Thisisbetterthanmostknown algorithms,andtheconjectureisstillvalid.Forlinearsuperstringsontheconstantsizealphabets,Ukkonenproposedalinear timeimplementationofthealgorithmgreedy
basedontheAho–Corasickautomaton[19];later, anotherimplementation usingthesuffixtreewasgivenin[12,13].Thequestion ofapproximatingthe superstringcanbe reformulatedona completeweighteddigraph, theso-called dis-tancegraph:each input wordisavertexandan arcencodesthedirectional mergingofthetwo words.The weightofan arcfromu to v equalsthelengthoftheprefixofu thatisnot coveredbytheoverlapofv (e.g.,thecorrespondingprefix ofabcde to cdeba isab, andtheweightwouldbe 2). Thena shortestlinearsuperstringisassociatedwithaHamiltonian path on this graph,and a shortestcyclic superstring witha Hamiltonian cycle[1]. Yet those problems are also hard to approximateandmostapproximationalgorithmsforSLiSresorttoarelaxedquestion,namelyShortestCyclicCoverofStrings (SCCS).ThesealgorithmssolveSCCSbycomputingaminimumassignmentonthedistancegraphinO
(
P+ |
P|
3)
[16],andthentransformthecyclesintocyclicstrings.OurworkfocusesonSCCS.Weproposeanalgorithmthatusesonlylineartime in
P,andforthissakeintroducethesuperstringgraph,whichcapturesallsolutionsofgreedy
forSCCS(seeSections4and3). Thesegreedysolutions arecalledgreedysuperstrings. Exploringthe graphproperties,we show that itgives,fora subsetofinstancesoftheshortestlinearandcyclicsuperstringproblems,interestingapproximationsorevenexactsolutions (Section4).
1.1. Notationandproblemdefinition
ForanyrootedtreeT ,VT isthesetofnodesofT .Foranode v ofVT,ChildrenT
(
v)
isthesetofchildrenofv inT .We denoteby[
i,
j]
thesetofintegersbetweeni and j.Weconsidertwokindsofstrings:linear andcyclicstrings (seeFigs. 1a and 1b).Forastrings,thelengthofs is
|
s|
.Fora linearstrings andi≤
j in[
1,
|
s|]
,s[
i,
j]
isthelinearsubstringofs beginningatthepositioni andendingattheposition j, s[
i]
is thesubstring s[
i,
i]
, s[
1,
j]
isa prefix of s and s[
i,
|
s|]
isa suffix of s. Aprefix (or suffix)s of s isproper ifs is differentofs. Foranotherlinearstringt,theoverlap froms tot,denotedby ov(
s,
t)
, isthelongestsubstring whichisa propersuffixof s andaproper prefixoft.The prefixfroms tot,denotedby pr(
s,
t)
,issuchthat s=
pr(
s,
t)
ov(
s,
t)
.The merge ofs witht isthelinearwordpr(
s,
t)
t ifs=
t,andthecyclicwordpr(
s,
t)
otherwise.Alinearsuperstring ofasetP of linearstringsisalinearstring w suchthateachstringofP isasubstringofw.Wesaythatalinearstringw isasubstring ofacyclicstringc ifthereexists wc alinearpermutation ofc such that w isa substringofw∞c (where w∞c=
wcwc. . .
) –seeFig. 1c.Acycliccover of P isasetC ofcyclicstringssuchthateachlinearstringofP isasubstringofacyclicstring of C .Fig. 1. (c) Shows how the linear word (a) is included in the cyclic word (b). Wedefinethefollowingproblem:
Problem1(Shortestcycliccoverofstrings). Input: Asetoflinearstrings P .
Output: AcycliccoverC ofP thatminimises
C.Let
σ
be a permutation of{
1,
. . . ,
|
P|}
, we can split upσ
in cyclicpermutations:σ
=
σ
1◦ . . . ◦
σ
m. Foreach cyclic permutationσ
ioflengthni,wedenotethecyclicstringby:P
(
σ
i)
:=
pr(
sσi(1),sσi(2)) . . .pr(
sσi(ni−1),sσi(ni))pr(
sσi(ni),sσi(1)),and
P
(
σ
)
:= {
P(
σ
1), . . . ,
P(
σ
m)}
(SeeFig. 2).Itiseasytosee,thatforeachpermutation
σ
of{
1,
. . . ,
|
P|}
,P(
σ
)
isacycliccoverof P andforeachoptimalsolutionw ofProblem 1(SCCS),thereexistsσ
apermutationof{
1,
. . . ,
|
P|}
suchthatw=
P(
σ
)
.1.2. Relatedworks
Fig. 2. Running examplewiththeinputsetP:= {ababb, aab, abba, abaa}.InstanceofacycliccoverofP obtainedwithapermutationσ.Weobtainthe cycliccoverP(σ) = {ababba, aba}(ababba andaba arecyclicwords).
SinceSLiSdoesnotadmitaPTAS,numerousworksonSLiShavereportedpolynomialtimeapproximationalgorithmswith aconstantratiorangingfrom4 to
>
2 (see[8, Table 1]).AmajorityofthesemodelSLiSasaminimumweightHamiltonian path problemonthedistancegraph G of P (whereanarc(
si,
sj)
isweighted by pr(
si,
sj)
foranyvalidi,
j).Notethat ifinstead itis weightedby ov(
si,
sj)
, weobtain theOverlapGraph,andSLiS becomesequivalent to findinga maximum weight Hamiltonianpath[1].Clearly, theweight ofacycliccoverof G (i.e.,the sumoftheweight ofall cycles)islower thanthat ofaHamiltoniancycle,whichitselfgivesalowerboundoftheweightofaHamiltonianpathonG.Hence,most approximationalgorithmsforSLiSrelyonsolvingSCCS,whichispolynomial.Blumetal.[1]werethefirsttoexhibitsuchan algorithmandtoprovethatitachievesaconstantratioof3 forSLiS.ThisalgorithmfirstsolvesSCCSandinasecondstep combinesthecyclicstringsbyrunningagreedyalgorithmonthem.TengandYaoimprovedthisratiobyanalgorithmthat recursivelysolvesSCCS[18].Afterthat,manyauthorshaveinvestigatedthecombinatorialpropertiesofwordsandofsome specificcycles,whichenabledthemtofurtherimprovetheapproximation(e.g.[3,14,15]),butallneedtosolveSCCS.Todo so,they computea MinimumAssignment onthedistancegraph.Anassignment(or cycliccover)ofa digraph G isasetof cyclessuchthatevery vertexofG belongstoexactlyonecycle.TosolveSCCSonaninstance P ,approximationalgorithms for SLiS first1/ build the distancegraphof P , with|
P|
nodes and|
P|(|
P| −
1)
arcs,which requires O(
P+ |
P|
2)
time andspaceusingtheGeneralisedSuffixTree(GST)of P [11],then2/computeaminimumcycliccoverusingtheHungarian algorithmin O(
|
P|
3)
time[16].Oncethecycliccoverisknown,thealgorithmin[18]alsorequirestocomputeaparticular assignment calledacanonicalassignment, whichtakesanadditional O(
P)
usingthesuffix treebasedalgorithmof[9]. However, thecomplexitybottleneckofalgorithmsusingacycliccoverremainsthe O(
P+ |
P|
3)
time tobuildthegraph andcomputethecover.Itwasalsoshownthatgreedy
(Algorithm 1)solvesSCCSexactly[4].Tothebestofourknowledge, thereiscurrentlynolineartimealgorithmforsolvingSCCS.Algorithm 1: Thealgorithm
greedy
forShortestCyclicCoverofStrings.1 Input: P asetoflinearwords;
2 Output: S asetofcyclicstringsthatcoverP ;
3 S:= ∅
4 while P is notempty do
5 u andv twoelementsofP havingthelongestoverlap(u canbeequaltov)
6 w isthemergeofu andv
7 P:=P\ {u, v}
8 if u =v (i.e.w isacyclicstring) then S:=S∪ {w}else P:=P∪ {w}
9 return S
Theorem1([6]).ForShortestCyclicCoverofStrings,Algorithm
greedy
yieldsanoptimalsolution. 1.3. Explanationsaboutalgorithmgreedy
An overlapissaidtobegreedy-feasible ifagreedysolutiontakesit.Letu1, u2, v1 and v2 befourstrings.Wesaythat
the overlapsov
(
u1,
u2)
andov(
v1,
v2)
aredependent ifandonlyifu1=
v1 oru2=
v2.WhenAlgorithmgreedy
mergesu withv,itchoosestheoverlapbetweenu andv andforbidsanyoverlapbeginningbyu orendingbyv,i.e. itforbidsall overlaps that aredependent ofov
(
u,
v)
.Wesaythat a setofoverlapsis independent ifeach overlapisnot dependentof another.Thus,Algorithm
greedy
makesaseriesofchoicestoselectasequenceofgreedy-feasibleandindependentoverlaps:•
InthefirststepoftheAlgorithmgreedy
choosesanoverlapamongthelongestoverlaps.•
Inthesecondstep,itchoosesanoverlapamongthelongestoverlapswhichareindependentofthefirstoverlap.•
Ateachfurtherstep,ittakesamaximumoverlapthatisindependentofthepreviouslyselectedoverlaps.•
Intheend,itoutputsasetofoverlapsthataremaximal(fortheinclusionasabaseinamatroid).Thissetofoverlaps isindependent.Asthesetofoverlapsattheendof
greedy
ismaximal,thereexistsapermutationσ
of{
1,
. . . ,
|
P|}
,suchthatforalli in{
1,
. . . ,
|
P|}
,ov(
si,
sσ(i))
isanoverlapofoursetofoverlaps.Thus,weobtainthefollowingpropertywhichissimilartoa well-knownpropertyonthesolutionofthegreedyalgorithmforSLiS[20, Def. of Induced(
π
,
S)
and Def. 2]:Proposition2.Letw agreedysolutionofSCCS.Thereexistsapermutation
σ
of{
1,
. . . ,
|
P|}
suchthatw=
P(
σ
)
. 2. GeneralisedSuffixTree,overlap,merge,andRed–BluepathsThesuffixtreeofawordw isadatastructurethatindexesallsubstringsofw.Eacharcislabelledwithasubstringofw suchthattheconcatenationofthelabelsalongabranchfromtheroottoaleafspellsoutasuffixofw.Aconstraintrequires that thelabelsof arcsfromanode to itsoffspring start withdifferentsymbols.Thus, each noderepresents aprefix ofa suffix,i.e. asubstringof w.Infact,eachinternalnode isthelongestcommonprefixofallleavesinitssubtree.Traversing an arcextends therepresented stringon itsright. The worddepth of a node isthelength of thestringit represents.An exampleof(generalised)suffixtreeisshowninFig. 3:thesolesuffixtreestructureisdisplayedinFig. 3a.Forinstance,the node 3 representsthestringaa.Goingfromnode 3 tonode 1 extends thestringrepresentedbya b toyieldthestring aab.
STare equippedwithadditionalarcs calledSuffixLinks (SL):a SL linksanode representing au (witha aletter andu a string)to thenode representingu. LetuslabeltheSL by thelettera.Thenotion ofSLis usuallyrestrictedto internal nodes,butitextendsnaturallytoleaves.Weconsiderthateachnode hasitsSL.TraversingaSL removesthefirstletterof theword:itshortenstherepresentedwordatthefront.ToeachSL,weassociatealabel,whichistheletteritremoves.For instance,inFig. 3,traversingtheSLofnode 1 shortensthewordaab toab,whichisthestringrepresentedbynode 2.The labelofthatSLisa (labelsofSLarenotrepresentedinthisfigure).Itisknownthat,ifSLsarereversed(consideredinthe otherdirection),theyformatreerootedintherootoftheST,andallleavesareonthesamebranchofthistree[5].Note thatthechainofSLstartingfromanynodegoesuptotherootofT (Forsimplicity,assumetheroothasaSLtoitself).The treecomposedbythesuffixlinksisdisplayed onFig. 3b.Moreover,inthefigures,theSLaredrawnasanarcfromachild toitsparentinL (seeFig. 3).
Forsimplicity,wesaythata suffixtreearcisblue,whileSLarered.So,wehavethebluetree,denoted T ,andthered tree,denotedL,onthesamenodes.
The GST generalises thisprinciple toseveral words. From now on,the generalised suffix tree of P isthe pair
(
T,
L)
. AnextensiveintroductiontoGSTandtheirapplicationscanbefoundin[10].Thedistinctionbetweenexplicitandimplicit nodesisskippedhere,butexplainedin[10,2].2.1. Overlaps
Basically,allsuffixesandallsubstringsofwordsof P arerepresentedintheGST. Ifawordu overlapsaword v,then a suffixof u equals toa prefixof v. Hence, oneexpectsthe overlapto berepresented somewhereinthe GST. Indeed,if aninternalnoderepresentingthesuffixofu appearsonthebranchspelling v,thenu overlaps v.Thisinternalnodeexists sincethissubstringoccursatleasttwice(inu andinv), andrepresentstheoverlap.Letuscallsuchnodean overlapnode fromu to v.Inourrepresentation,theinternalnodeislabelledwiththestarting positionofthatsuffixinu onthebranch leading totheleafof v.Forexample,inFig. 3 thenode 2 islocatedonthebranchof 1.Itisalsoonthebranchesof 1 and 1 .Itrepresentsthestringab.Soweknowthataaboverlaps
abba
,abaaandababb,withtheoverlapab.Clearly,alloverlapnodesfromu tov areancestorsofanother.Moreover,theworddepthofan overlapnode givesthe lengthofthecorrespondingoverlap.Hence,themaximumoverlapbetweentwowordsisthedeepestoverlapnodeofthose words.Forinstance,both nodes 4 and 3 areancestorsof 1 meaningthatabaaoverlapsaab withtwopossibleoverlaps: a andaa,oflength1 and2 respectively.
Allthesenotionsareknown.Indeed,itisknownthatonecancomputeefficientlyalloverlapsforeachpairofwordsin P usingtheGSTof P .Infact,one usuallycomputespairwise themaximumoverlapsfor P [11],tobuild theprefixorthe overlapgraphofP (asmentionedinintroduction).
2.2. MergeandRed–Bluepaths
Abovetheoverlapnode was definedonlyusing T ;theSL(hence L)were notinvolved.Letusnowgive analternative definitionusingT and L.
Fig. 3. (c) TheGeneralisedSuffixTree(GST)ofP:= {ababb, aab, abba, abaa}.TheGSTofP isinfactcomposedoftwotrees:(a)thetreeT ,whichisthe basicstructureofthesuffixtree–inbluearcs,and(b)L,thetreemadebythesuffixlinksofT –inredarcs.Squarenodesrepresentwordsthatoccuras asuffixofsomeinputword,circlenodesaretheothernodesofT .EachsquarenodestoresitspositionsofoccurrencesinS;forsimplicity,wedisplaythe startingpositionasanumberandthewordofS inwhichitoccursasitscolour,insteadofshowingthelistofpairs
(
i, j).Thesquarenodelabelledbya red2,whichrepresentsthestringab,meansthatthisstringoccursasthesuffixstartingatposition2 intheredwordaab.A squarenodelabelledwitha 1 representsaninputword.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)Ifu overlapsv,thenasuffixofu equalsaprefixof v.Wehaveseenthattheoverlapsisanodeonthebranchspelling v. We canalsofindthisprecisesuffix ofu startingatthenode representingu byfollowingenough suffixlinks. Thismeans going upintheredtreefromthenodeu.Hence, oneseesthatanoverlapnodefromu to v isan ancestorofu in L (the red tree),andanancestorofv in T (thebluetree).Weobtainanalternativedefinition:anoverlapnodefromu to v isa red ancestorofu andablue ancestorofv.Forsimplicity,letussayitisared–blue ancestoroftheorderedpair
(
u,
v)
.Now,weseethatthemaximumoverlapof
(
u,
v)
isthelowestred–blueancestorof(
u,
v)
.Also, iftheredpathreaches theroot,thentheoverlapistheemptyword(andtheirmergeissimplytheirconcatenation).Inotherwords,anyoverlapforapair
(
u,
v)
correspondstoaRed–Bluepath fromu tov in(
T,
L)
.The merge of u and v usingtheir maximumoverlap ov
(
u,
v)
is the word w:=
pr(
u,
v)
ov(
u,
v)
su(
u,
v)
, where u:=
pr(
u,
v)
ov(
u,
v)
and v:=
ov(
u,
v)
su(
u,
v)
.We state this definitionwith the maximum overlap,but it is validwith any overlap; thelargertheoverlap,theshorterthemerge.Wedenoteby RB-path(
u,
v)
theRed–Bluepathwhichcorresponds tothemaximumoverlapbetweenu andv.Assumewewanttospelloutthe(linear)mergeofu and v usingtheoverlapnode w in
(
T,
L)
.Itcanbedoneinthree steps:1. traverseT fromtheroottou:thisspellsoutu withthelabelsofbluearcs;
2. goup L fromu to w followingtheredarcs:itispossiblesince w isaredancestorofu (here weskiponebyone all lettersofpr
(
u,
v)
);3. godownT from w to v usingbluearcs,whichispossiblesincev isadescendantof w;thisspellsoutsu
(
u,
v)
. A mergeof(
u,
v)
isa linearsuperstringof{
u,
v}
.The process abovecan begeneralised tospellout anycyclic super-string of a cyclicpermutation of words(
w1,
w2,
. . . ,
wn)
using some givenpairwise overlaps: one starts from thenode representing w1,then performs steps2and3usingthe appropriateRed–Blue pathfrom w1 to w2 tomergethem, theniteratesonlysteps2and3foreachfollowingworduntilonereaches w1 again.ThiscircuitontheGSTiscalledaRed–Blue
circuit.
Example1. For instancein Fig. 3,the Red–Bluecircuit corresponding to thecyclic mergeof thewords ababb,abba,and thenaabinthisorder,containsthreeRed–Bluepaths:forthepairs
(
ababb,
abba)
,(
abba,
aab)
andthenfor(
aab,
ababb)
.Thiscircuitstarts atnode 1 goesup withSLtonode 2 andthen 3 whichisared–blue ancestorof
(
ababb,
abba)
,then goes down withbluearcsto 1.Thisisthe Red–Bluepath for(
ababb,
abba)
, whichspells outab with the SLlabels.Then the circuitgoesupwithSLto 2, 3,and4,whichisanancestorof 1.Thisspellsoutabb withSLlabels.Then,itgoesdownto 1.ThismakesuptheRed–Bluepathfor(
abba,
aab)
.Then,thecircuitgoesupto 2,whichisanancestorof 1 ;thisspells outa withone redlabel.Then, itgoesdownandbackto 1 ,whichwas thestartingnode.Altogetherthesequenceofred labelsspellsoutababba,whilethesequenceofbluelabelsyieldsaababb
:bothwritethesamecyclicword.Now,letusstateamoresubtleproperty.ForaRed–Bluecircuit,writingallbluelabelsinorderorwritingallredlabels inorderspellout thesamesuperstring(aredlabelisalabelofaSL).Hence, thenumberofbluearcsissmallerthanthe numberofredarcs,whichisthesuperstringlength.Aproof ofaslightlydifferentversionofthispropertycanbefoundin
[5, Section 4.3 on p. 18]. 3. Thesuperstringgraph
Here, weare going to show that thechoicesmade byAlgorithm
greedy
forSCCS are not reallychoices. Infact, we willshowthateachgreedysolutioncanberepresentedbyauniquegraphembeddedintheGeneralisedSuffixTree.Asthis graphturnsouttobecomputableinlineartime,onecanderiveagreedysolutionforSCCSinlineartime.3.1. Equivalenceofgreedysolutions
Letw beagreedysolutionofSCCS.ByProposition 2,thereexistsapermutation
σ
suchthat w=
P(
σ
)
.Foreachcyclic permutationσ
iofσ
,wedenotebyc(
σ
i)
theRed–Bluecircuitobtainedfromσ
i intheGST.We denoteby GP
(
σ
)
= (
V,
R,
B)
thedirected multigraphbuiltwith thearcsof theRed–Blue circuitsof allthe c(
σ
i)
, whereR denotesthesetofRed arcs,andB denotesthesetofBlue arcsofallthec(
σ
i)
.AgraphissaidtobeEulerian,ifthereexistsaEulerianpathonthisgraph,i.e. apath thatvisitseacharcofthegraph exactlyonce.As GP
(
σ
)
isacollectionofRed–Bluecircuits,whicharemadeofRed–Bluepaths,eachconnectedcomponent ofGP(
σ
)
isEulerian.(Witheachadditionalcircuit,thegraphremainsEulerian.)Foreachpermutationof
{
1,
. . . ,
|
P|}
,we canbuildthecorresponding multigraphGP()
.Wehavethefollowing proposi-tion.Proposition3.Letw1andw2betwogreedysolutionsofSCCSandlet
σ
1andσ
2betheirrespectivepermutationsof{
1,
. . . ,
|
P|}
.ThenGP
(
σ
1)
=
GP(
σ
2)
.Proposition 3statesakeypropertyofgreedysolutionsforSCCS.Below,wegivetwolinesofproof:thefirstoneexplains whythepropertyisvalid(andbetterconveystheintuitiontothereader),whilethesecondgivesamoreformal,butless explanatory,demonstration.
Fig. 4. Example withthreeoverlaps:twodependentones,betweenu1andv1andbetweenu1andv2,andanotheroverlapbetweenu2andv2:shownon
thestrings(a),intheoverlapgraph(b),andintheSuperstringGraph(c).
Proof. Let w
=
P(
σ
)
beagreedysolutionofSCCS.Letusshow thatthechoicesmadebyAlgorithmgreedy
forSCCSdo notinfluenceGP(
σ
)
.greedy
facesachoicewhenseveraloverlapsofthesamelengtharedependent(Clearly,independent overlapscanbeswappedinapermutationandthusdonotinfluencethegraph).Let ov
(
u1,
v1)
and ov(
u1,
v2)
be two dependent andgreedy-feasible overlaps. Assume that ov(
u1,
v1)
is the overlapchoosenby
greedy
toobtain w,andletu2 inP bethesuccessorofv2inσ
.Itimpliesthatov(
u2,
v2)
istheoverlapthatmergesu2 andv2in w.Bythedefinitionof
greedy
,wehavethat|
ov(
u1,
v1)
| ≥ |
ov(
u2,
v2)
|
.LetusshowthatthereexitsAs
|
ov(
u1,
v1)
| ≥ |
ov(
u2,
v2)
|
,wehavethat ov(
u2,
v2)
isaprefix ofov(
u1,
v2)
,andthus aprefixofov(
u1,
v1)
.HenceintheGST,thenodeov
(
u2,
v2)
isanancestorofov(
u1,
v1)
,andov(
u1,
v1)
isanodeintheRed–Bluepathsfromu1 to v1andfromu2 tov2.Thus,wehavebothRed–Bluepathsfromu1 tov2 andfromu2to v1areinGP
(
σ
)
(seeFig. 4).2
Analternativeproof. Let
σ
1andσ
2 betwopermutationsofP correspondingtotwodistinctgreedysolutionsforSCCS.WesetGp
(
σ
1)
:= (
V1,
R1,
B1)
andGp(
σ
2)
:= (
V2,
R2,
B2)
.WeproceedbycontrapositionandassumethatGp
(
σ
1)
=
Gp(
σ
2)
;then((
R1∪
B1)
\(
R2∪
B2))
∪((
R2∪
B2)
\(
R1∪
B1))
= ∅
.As the role of bothpermutations issymmetrical, we assume that
(
R1∪
B1)
\ (
R2∪
B2)
= ∅
.Lete:= (
u,
v)
be an arcof(
R1∪
B1)
\ (
R2∪
B2)
suchthatmax(
|
u| ,
|
v|)
ismaximalamongallpossiblearcsof(
R1∪
B1)
\ (
R2∪
B2)
.AccordingtothedefinitionofGp
(
σ
1)
,thereexisti and j suchthatσ
1(
i)
=
j ande isanarcontheRed–Bluepathlinking si tosj (formally e∈
RB-path(
si,
sj)
).Twocasesarise.Case e
∈
B1:Let i2 denotethepredecessor of j in
σ
2, inotherwords, theelementofσ
2 satisfying j=
σ
2(
i2)
. But,one gets thatov
(
si2,
sj)
isaprefixofov(
si,
sj)
,sinceotherwisetherewouldexist j1:=
σ
1(
i2)
,suchthatov(
si,
sj)
<
ov(
si2,
sj1)
,whichwouldimplythate isnotmaximal.Itfollowsthate belongstoRB-path
(
si2,
sj)
,andthusto B2,whichcontra-dictsourhypothesis. Case e
∈
R1:Let j2 denote the successorof i in
σ
2, that is j2:=
σ
2(
i)
. Clearly, ov(
si,
sj2)
must be a suffix of ov(
si,
sj)
, sinceotherwise therewouldexisti1 suchthat j2
=
σ
1(
i1)
andov(
si,
sj)
<
ov(
si1,
sj2)
,whichwouldmeanthate is notmaximal.Oneobtainsthate
∈
RB-path(
si,
sj2)
,andthuse belongsto R2,whichisacontradiction.Thisendstheproof.
2
ByProposition 3,thegraphGP
()
representsallthesolutionsofthegreedyalgorithmforSCCS.Wecalledthisgraph,the SuperstringGraph of P ,anddenoteitbyGP.AnexampleofSuperstringGraphforP:= {
abb,
bbb,
bbc}
isshowninFig. 7.Tosumup,weknowthat foreach greedysolution w,thereexistsapermutation of
{
1,
. . . ,
|
P|}
that canbetranslated intoasetofRed–Bluecircuits,c(
w)
= {
c1(
w),
. . . ,
cm(
w)
}
.RememberthateachRed–BluecircuitismadebychainingRed– Blue paths. Letc be amulti-cycle of G.We saythat c covers GP ifeach arcof GP is includedina single cycleof c.ByProposition 3,foreachgreedysolution w wehavethatc
(
w)
coversGP.Bythefollowingproposition,wehavethateachmulti-cyclethatcoversG comesfromagreedysolution. Proposition4.
{
multi-cycle p|
p covers GP}
= {
c(
w)
|
w is a greedy solution}
.ToproveProposition 4,wefirstneedalemma.Acrossing inGP isanode thathasredin- andout-goingarcsandblue in- andout-goingarcs.
Lemma5(Absenceofcrossing).GPdoesnotcontainanycrossing.
Proofof
Lemma 5
. It is enough to see that the only nodes that can have a red out-goingarc and a blue in-going arc correspondtothewordsof P .Butifv∈
P , v isaleafinT andinL bydefinitionoftheGST,andthus v hasneitherablue arcgoingout,noraredarcgoingin.2
Fig. 5. Illustration fortheproofofProposition 4.Intheproofof⊆,theargumentistheexistenceofacrossingnodeinanon-greedy solution.Thisfigure displaysthefournodes v1, . . . , v4 andthenodes(circles)representingtheiroverlaps.Then,thenoderepresentingov(v3, v4)(greynode)isacrossing, i.e. ithasblueandredin-goingandout-goingarcs.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtotheweb versionofthisarticle.)
Proofof
Proposition 4
. To understandthis equality,it issufficient tonote that there is arelation betweenchoosing the longestoverlapingreedy
andtheabsenceofcrossinginGP.Proofof
⊆
.Letusshow
{
multi-cycle p|
p covers GP}
⊆ {
c(
w)
|
w is a greedy solution}
by contraposition.Assumethatthere exists a multi-cycle p coveringGP andtheredoesnotexist agreedysolution w of P suchthat p=
c(
w)
.Byhypothesis, there exist four nodes v1,
v2,
v3 and v4 in P that do notsatisfy the greedyproperty. In other words,although|
ov(
v1,
v2)
| <
|
ov(
v3,
v4)
|
,both P ath(
v1,
v4)
and P ath(
v3,
v2)
arein p. Hence,the nodeov(
v3,
v4)
isa crossingin GP (see Fig. 5).As theredoesnotexistanycrossinginGP (Lemma 5),weobtainacontradiction,andget{
multi-cycle p|
p covers GP} ⊆ {
c(
w)
|
w is a greedy solution}.
Proofof⊇
.Reciprocally,assumethatthereexistsagreedysolution w of P andtheredoesnotexistamulti-cycle p thatcoversGP such that p
=
c(
w)
.Thus,there existsan arcofc(
w)
whichisnot an arcofGP.Among thosearcs,let e= (
u,
v)
bethe one withthedeepest node(i.e. longest word) v in T .Bythe definitionof GP, e belongs to R (i.e. is ared arc),because it is the deepest ofits kind. Let v1 and v2 be the nodesof P such that e belongs to P ath(
v1,
v2)
. As e is not in GP, foreach u in P such that P ath(
v1,
u)
is in GP, u=
v2.Thus, there exists v4 in P such that P ath(
v1,
v4)
is in GP and|
ov(
v1,
v2)
| < |
ov(
v1,
v4)
|
.As w isagreedysolutionof P ,thereexists v3 in P andnotinGP suchthat P ath(
v3,
v4)
isinc
(
w)
and|
ov(
v1,
v4)
| ≤ |
ov(
v3,
v4)
|
.Thus,there existsan arcinc(
w)
that isnot inGP witha deeperextremity thane. Thiscontradictsourhypothesisandweget{
c(
w)
|
w is a greedy solution}
⊆ {
multi-cycle p|
p covers GP}
.2
3.2. CharacterisationoftheSuperstringGraphandaconstructionalgorithm
Now,wearegoingtocharacterisetheSuperstringGraphandgivealineartimealgorithmtobuilditfromasetofwords. Forthisalgorithm,wedefineandcomputeweights(respectivelyn
(.)
andd(.)
)fortheredandforthebluearcsoftheGST. Finally,theSGwillbecomposedofthosearcshavingastrictlypositiveweight(andofthenodesvisitedbythem).As
greedy
selectsoverlapsinorderofdecreasinglength,onecanseeitasarecursivealgorithmthatprogressesalong thebranchesoftheGSTof P .OntheGST,greedy
progresses fromtheleavestowardtheroot,i.e. inorderofdecreasing worddepthofthenodes.Letususethisrecursivesideofgreedy
toexplaintheSuperstringGraph.Letv be anodeoftheGST;when
greedy
sees theoverlaps oflength|
v|
,ithasalreadyconsideredoverlaps oflarger lengths.Wedefinetwoweightsn()
andd()
suchthat:•
n(
v)
isthenumberofstringsof P whichhavev assuffixandarenotmergedontheirrightwithanotherstringof P withalargeroverlap.•
d(
v)
isthenumberofstringsof P whichhave v asprefix andare notmergedontheir leftwithanotherstringof P withalargeroverlap.Inother words,n
(
v)
isthenumberofleavesofthesubtreeofv inL thathavenotyetbeenmergedontheirright,and d(
v)
isthenumberofleavesofthesubtreeofv inT thathavenotyetbeenmergedontheirleft.First,itcanbeseeneasilyseenthatanode v whichhasnootherleafbothinitssubtreesinL andinT ,isastringofP . Thus,forthesenodesweset,n
(
v)
=
d(
v)
:=
1.Fortheother nodes,weusetherecursionof
greedy
.Infact, wecan determinehowmanystringsof P havenotyet beenmergedwithanotherstringbylookingtheweightsn(.)
andd(.)
ofthechildrenofv inL andofitschildreninT (seeFig. 6).
Leta
(
v)
bethe difference betweenthesumofthen(
u)
ofthe children u of v inL and the sumofthed(
u)
ofthe childrenu ofv inT : a(
v)
:=
u∈ChildrenL(v) n(
u)
−
u∈ChildrenT(v) d(
u).
Hence, theabsolutevalue ofa
(
v)
is thelargest numberofstrings of P thatgreedy
can mergewithother strings of P usingtheoverlap v.Witha(
v)
,wecancalculaten(
v)
andd(
v)
:If a
(
v)
≥
0,
we have n(
v)
:=
a d(
v)
:=
0,
if a(
v) <
0,
we have n(
v)
:=
0 d(
v)
:= −
a.
Withthisrecursive definitionoftheweights n
()
andd()
, wecan constructtheSuperstring GraphfromtheGST of P . WhenevertheweightofanarcoftheGSTisequalto0,thisarcdoesnotbelongtotheSuperstringGraph.Similarly,iffor anode,sayv,allofitsin-goingandout-goingarcshaveazeroweight, v isisolatedandisnotincludedintheSuperstring Graph–thiscorrespondstothesubsetU inDefinition 1(whichstatesanotherdefinitionoftheSuperstringGraph).Definition1.TheSuperstringGraphGP
= (
V,
R,
B)
where V=
VT\
UFig. 6. General schemeofthein- andout-goingarcsofanodev inGST.TheredarcsbelongtoR andthebluearcstoB.Ifaweightiszero,thearcis notincludedintheSuperstringGraph.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthis article.)
B
= {(
t,
v)
d(v)|
v∈
VT,
t the parent of v in T}
andU
= {
v∈
VT|
vis not an extremity of an arc of R∪
B}
wheren()
andd()
aretheweightsonarcsasdefinedabove. Now thatwecanbuild theSuperstringGraphforaninput P by traversingthegeneralisedsuffixtreeof P ,weprovide analgorithm(Algorithm 2)thatusesthisgraphtosolveSCCSforP .Theorem6.
Algorithm 2
solvesSCCSinlineartimeandspace.Proof. Correctness.ByProposition 4,theset w ofcyclicwordscomputedbyAlgorithm 2isasolutionof
greedy
forSCCS. Hence,eachwordof P appearsasasubstringofacyclicwordinw,andbyTheorem 1,thenormof w isminimal.Complexity.TheSuperstringGraphof P isembeddedinthegeneralisedsuffixtreeof P ,andthusbuiltinatime linear in
P.Firstthecyclicconcatenationofthewordsin P islongerthantheSCCSobtainedthroughanycyclicpermutationof P . ByProposition 2,thelengthofasolutionof
greedy
isboundedbyP;thus,thelengthofamulti-cycleofGP isbounded by2×
P.Consequently,onecanfindamulti-cyclealsoin O(
P)
timeandspace.Finally,traversingallitscyclestakesa timeproportionalinthesizeoftheSuperstringGraph.2
NotethatwithAlgorithm 2,oneobtainsacycliccover.
Algorithm 2: TheSuperstringGraphalgorithmforSCCS.
1 Input: P asetoflinearwords. Output:w agreedysolutionforSCCS;
2 buildGPtheSuperstringGraphofP ;
3 computeacycliccoverc= (c1,. . . ,cn)ofGP;
4 for i ∈ [1, n]do
5 traverseci:listthewordsofP whosenodeareinciandcreatewibyconcatenatingcyclicallypr(sj, sk)wheneverskisthesuccessorofsj;
6 insertthecyclicstringwiinw;
7 return w
4. ImplicationsoftheSuperstringGraph
4.1. SCCSwithacardinalityconstraintongreedysolutions
We cansplitup GP intoconnectedcomponents: GP
= {
C1,
. . . ,
Cg}
.ByProposition 4,eachmulti-cycle c ofGP can be partitionedinasubsetofcyclesc=
S1∪ . . . ∪
Sg,whereeachsetofcyclesSi coversaconnectedcomponentCiofGP.Algorithm 2, on its third line (in blue),1 computes a cyclic cover of G
P. This requires to perform a random choice when facing two alternative out-goingarcs.Clearly, the combinationof all possiblesuccessive choicesallows to produce all possiblecycliccovers ofGP.Certain choiceswill generateasolution containingseveralcycles tocoverone connected component(seeFig. 7b),althoughitisalwayspossibletocoveroneconnectedcomponentwithasinglecycle(seeFig. 7c). Thus, anytwocycliccoversmaydifferintheirnumberofcycles,i.e. intheircardinality.Thisgivesrisenaturallytoanew question,toanewvariantofSCCS,whichwedefinebelow.
Problem2(SCCSwithacardinalityconstrainton
greedy
solutions). Input: Asetoflinearstrings P .Output: A setofcyclicstringsc,withtheleastnumberofelements,thatisasolutionofthegreedyalgorithmonShortest CyclicCoverofStrings.
Fig. 7. A SuperstringGraphandSCCSsolutions.(a)ThegeneralisedsuffixtreeandtheSuperstringGraph(SG)forP:= {abb, bbb, bbc}.SquarenodesofT
representstringsthataresuffixesofsomewordofP :theintegerinthesquarenodeisastartingpositionofitssuffix.AwordofP isidentifiedbyits colour.Figure (b)and(c)showthetwomulti-cyclesthatcovertheSGandtheirrespectivesetsofcyclicstrings.Thisexampleiswellknownforshowing thelowerboundoftheapproximationratioofthegreedyalgorithmforSLiS.TheSGforthisinstanceisconnectedandthusitalsogivesanoptimalcyclic superstring(asolutionofSCyS).AstheEuleriancircuitvisitstherootoftheGST,thisoptimalcyclicsuperstringcanbecutattheroottoobtainanoptimal linearsuperstring.
Algorithm 3: TheSuperstringGraphalgorithmforSCCSwithacardinalityconstraint.
1 Input: P asetoflinearwords. Output:w agreedysolutionforSCCS;
2 buildGP theSuperstringGraphofP ;
3 computeaEulerianmulti-cyclec= (c1,. . . ,cn)ofGP;
4 for i ∈ [1, n]do
5 traverseci:listthewordsofP whosenodeareinciandcreatewibyconcatenatingcyclicallypr(sj, sk)wheneverskisthesuccessorofsj;
6 insertthecyclicstringwiinw;
7 return w
Algorithm 3differsfromAlgorithm 2onlyon thethirdline:insteadofcomputinganycycliccoverof GP,it computes aEulerianmulti-cycleof GP.ThisispossiblesinceeachconnectedcomponentoftheSuperstringGraphisEulerianandit canbedoneinlineartimeinthenormof P .AEulerianmulti-cycleofGP coverseachconnectedcomponentwithasingle cycle, henceitminimisesthecardinalityofthe cycliccover, andasitisa greedysolutionitsolves Problem 2.Weobtain
Theorem 7.TheminimisationofthenumberofcyclicstringsbyAlgorithm 3isillustratedinFig. 7.
Theorem7.
Algorithm 3
solvesProblem 2
inlineartime.Proof. AstheconnectedcomponentsoftheSuperstringGraphareEulerian,thisisdoneinlineartimeandspaceinthesize ofGP.ByProposition 4,alabelofaEulerianmulti-cycleofGP isanoptimalsolutionfortheProblem 2.
2
Remark.AlthoughforSCCS,allsolutionsof
greedy
areoptimal,someoptimalsolutionsarenotoutputbygreedy
.Thus, thequestionoffindingashortestcycliccoverwithaminimumnumberofelementsisnotsolvedbytheAlgorithm 3(seea counterexampleinFig. 8).4.2. ConsequencesofthetopologyoftheSuperstringGraph
Thecharacterisation oftheSGcanalsobe appliedtosolveorapproximatethe“classical”shortestsuperstringproblem (SLiS–forShortestLinearSuperstring),inwhichthesolutionisalinearsuperstring,aswellasitsvariantwherethesolution isasinglecyclicword(SCyS-=forShortestCyclicSuperstring).LetwO P T(S Li S),respectively wO P T(SC y S),denoteanoptimal solutionforSLiS,resp.forSCyS.Let w agreedysolutionforSCCS,wehave:
|
w| ≤
wO P T(SC y S)≤
wO P T(S Li S).IftheSGis connected,then w is acyclicsuperstringof P ,i.e. asolution ofSCyS.Bytheabove relation,itis anoptimalsolution for SCyS.Fig. 8. An instance with an optimal solution that is nota solution of greedy for SCCS (however, the solution ofthe SG is also optimal). Let
P= {abec, bed, c f abe, dgab}.(Top)thegeneralised suffixtree,(middle)theSG,and(bottom)anoptimalsolutionofSCCSthatconsistsinsinglecycle, whiletheSGhastwocomponents.
Now,fromthiscyclicsuperstring,wecaneasilyderivealinearsuperstringof P .Basically,wecutthecyclicstringatthe startofasuffixofawordofP andpre-catenatethecorrespondingprefix.Letv anodeoftheSGsuchthatn
(
v)
=
d(
v)
=
0, thenv isaprefixofsomewordofP .Then v w makesupalinearsuperstringofP .Weobtain|
w| ≤
wO P T(S Li S)≤|
v|+|
w|
. As|
v| ≤
wO P T(S Li S), we get|
v| + |
w| ≤
2wO P T(S Li S). In other words, forinstances P that have a connected SG,we obtain a 2-approximation for the shortest linear superstring. Furthermore, if the Eulerian cycle visits the root of the suffix tree, then v isthe empty word andthen|
w| ≤
wO P T(S Li S)≤ |
w|
, meaning that w is an optimal linear super-string.Notethatthereexistnon-trivial instances,withanarbitrarylargenorm,whoseSGisconnected:anyinstanceinwhich allwordsarecyclicshiftsofanotherbelongtothisclass.AstheSGcanbebuiltandexploredinlineartimein
P,onecan usethisprocedureasapredicatetocheck whetherSCySorSLiSaresolvableinlineartime.Ifnot,onecanlaunchamore complexapproximationalgorithm[15].5. Conclusion
WeconsidertheproblemShortestCyclicCoverofStrings,whichissolvedbyAlgorithm
greedy
.Thesetofsolutionsofgreedy
canbeexponential.Nevertheless,withtheSG,weexhibitacharacterisationofthissetinagraphthatisEulerian and linear inthe input size. This enablesus to solve SCCSin linear time, butalso to find agreedy
solutionwith the leastnumberofcyclicstrings.Moreover,thisapproachcansolveorapproximatesome non-trivial instancesoftheshortest superstringproblems.OurapproachcanbeadaptedtoavariantofSCCSwherethemultiplicityofeachinputwordisgiven asinput.Currently,approximationalgorithmsforshortestsuperstringproblemsuseaprocedureforsolvingSCCSthattakescubic time (i.e. O
(
P+ |
P|
3)
time) andquadratic spacein the numberof strings. Thisstep represents their time complexity bottlenecksinceotherstepscanbedonein O(
P)
time.Usingouralgorithmmakestheoveralltimecomplexitylinearin theinputsize.Acknowledgements
This work is supported by ANR Colib’read (ANR-12-BS02-0008), by ANR Institut de Biologie Computationnelle (ANR-11-BINF-0002),andbyCNRS(DéfiMASTODONS).
References
[1]A.Blum,T.Jiang,M.Li,J.Tromp,M.Yannakakis,Linearapproximationofshortestsuperstrings,in:ACMSymposiumontheTheoryofComputing,1991, pp. 328–336.
[2]D.Breslauer,G.F.Italiano,Onsuffixextensionsinsuffixtrees,Theor.Comput.Sci.457(2012)27–34.
[3]D.Breslauer,T.Jiang,Z.Jiang,Rotationsofperiodicstringsandshortsuperstrings,J.Algorithms24 (2)(1997)340–353.
[4] B. Cazaux, E. Rivals, Approximation of greedy algorithms for Max-ATSP, Maximal Compression, Maximal Cycle Cover, and Shortest Cyclic Cover of Strings, in: Proc. of Prague Stringology Conference, PSC, Czech Technical Univ. Prague, 2014, pp. 148–161, http://www.stringology. org/event/2014/p14.html.
[5] B.Cazaux, E. Rivals, Reverseengineering of compactsuffix trees and links: a novel algorithm, J. Discret. Algorithms 28 (2014) 9–22, http:// dx.doi.org/10.1016/j.jda.2014.07.002.
[6] B.Cazaux,E.Rivals, The powerofgreedy algorithms for approximating Max-ATSP, CyclicCover,and superstrings,Discrete Appl.Math. (2015),
http://dx.doi.org/10.1016/j.dam.2015.06.003.
[7]J.Gallant,D.Maier,J.A.Storer,Onfindingminimallengthsuperstrings,J.Comput.Syst.Sci.20(1980)50–58.
[8]A.Golovnev,A.Kulikov,I.Mihajlin,ApproximatingshortestsuperstringproblemusingdeBruijnGraphs,in:CombinatorialPatternMatching,in:Lect. NotesComput.Sci.,vol. 7922,SpringerVerlag,2013,pp. 120–129.
[9]D.Gusfield,Fasterimplementationofashortestsuperstringapproximation,Inf.Process.Lett.51 (5)(1994)271–274.
[10]D.Gusfield,AlgorithmsonStrings,TreesandSequences,CambridgeUniv.Press,1997.
[11]D.Gusfield,G.M.Landau,B.Schieber,Anefficientalgorithmfortheallpairssuffix–prefixproblem,Inf.Process.Lett.41 (4)(1992)181–185.
[12]S.R.Kosaraju,A.L.Delcher,Large-scaleassemblyofDNA stringsandspace-efficientconstructionofsuffixtrees,in:ProceedingsoftheTwenty-Seventh AnnualACMSymposiumonTheoryofComputing,STOC ’95,1995,pp. 169–177.
[13]S.R.Kosaraju,A.L.Delcher,Large-scaleassemblyofDNA stringsandspace-efficientconstructionofsuffixtrees,in:ProceedingsoftheTwenty-Eighth AnnualACMSymposiumonTheoryofComputing,STOC ’96,ACM,NewYork,NY,USA,1996,p. 659.
[14]M.Mucha,Lyndonwordsandshortsuperstrings,in:ProceedingsoftheTwenty-FourthAnnualACM–SIAMSymposiumonDiscreteAlgorithms,SODA, 2013,pp. 958–972.
[15]K.E.Paluch,Betterapproximationalgorithmsformaximumasymmetrictravelingsalesmanandshortestsuperstring,CoRRarXiv:1401.3670[abs],2014.
[16]C.H.Papadimitriou,K.Steiglitz,CombinatorialOptimization:AlgorithmsandComplexity,2ndedition,DoverPublications,Inc.,1998,496pp.
[17]J.Tarhio,E.Ukkonen,Agreedyapproximationalgorithmforconstructingshortestcommonsuperstrings,Theor.Comp.Sci.57(1988)131–145.
[18]S.Teng,F.F.Yao,Approximatingshortestsuperstrings,in:34thAnnualSymposiumonFoundationsofComputerScience,1993,pp. 158–165.
[19]E.Ukkonen,Alinear-timealgorithmforfindingapproximateshortestcommonsuperstrings,Algorithmica5 (1–4)(June1990)313–323.
[20]V.Vassilevska,Explicitinapproximabilityboundsfortheshortestsuperstringproblem,in:MathematicalFoundationsofComputerScience2005,30th InternationalSymposium,Proceedings,MFCS2005,Gdansk,Poland,August29–September2,2005,in:Lect.NotesComput.Sci.,vol. 3618,Springer, 2005,pp. 793–800.