HAL Id: hal-00619996
https://hal-upec-upem.archives-ouvertes.fr/hal-00619996
Submitted on 20 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson
To cite this version:
Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson. A Sub-quadratic Sequence Alignment Al-
gorithm for Unrestricted Cost Matrices. Proceedings of the Thirteen Annual ACM-SIAM Symposium
on Discrete Algorithms, 2002, United States. pp.679-688. �hal-00619996�
for Unrestrited Cost Matries
MaximeCrohemore
InstitutGaspard-Monge
UniversitedeMarne-la-Vallee
Gad M. Landau y
HaifaUniversity
and
PolytehniUniversity
MihalZiv-Ukelson z
HaifaUniversity
and
IBMT.J.WResearhCenter
Abstrat
The lassial algorithm for omputing the similarity between two
sequenes[36 ,39℄usesadynamiprogrammingmatrix,andomparestwo
stringsof size n inO(n 2
) time. Weaddressthe hallengeof omputing
thesimilarityoftwostringsinsub-quadratitime,formetriswhihuse
asoringmatrixofunrestrited weights. Our algorithmapplies to both
loalandglobalalignmentomputations.
Thespeed-upisahievedbydividingthedynamiprogrammingmatrix
into variable sized bloks, as indued by Lempel-Ziv parsing of both
strings,and utilizing the inherent periodi natureof both strings. This
leadstoanO(n 2
=logn)algorithmforaninputofonstantalphabetsize.
Formosttexts,thetimeomplexityisatuallyO(hn 2
=logn)whereh1
istheentropyofthetext.
Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Desartes, Champs-sur-
Marne,77454Marne-la-ValleeCedex2,Frane,email: mauniv-mlv.fr.
y
DepartmentofComputerSiene,HaifaUniversity,Haifa31905, Israel,phone: (972-4)
824-0103, FAX:(972-4) 824-9331;DepartmentofComputerandInformationSiene,Poly-
tehniUniversity,SixMetroTehCenter,Brooklyn,NY11201-3840;email:landaupoly.edu;
partially supported by NSF grant CCR-0104307, by NATO Siene Programme grant
PST.CLG.977017,bytheIsraelSieneFoundation(grants173/98and282/01),bytheFIRST
FoundationoftheIsraelAademyofSieneandHumanities,andbyIBMFaultyPartnership
Award.
z
Departmentof ComputerSiene,HaifaUniversity,Haifa31905, Israel; OnEduation
LeavefromtheIBMT.J.W.ResearhCenter;email:mihals.haifa.il;partiallysupported
bybytheIsraelSieneFoundation(grants173/98and282/01),andbytheFIRSTFoundation
The rapid progress in large-sale DNA sequening opens a new level of
omputational hallenges involved in storing, organizing and analyzing the
wealthofbiologialinformation. Oneofthemostinterestingneweldsthatthe
availabilityoftheompletegenomeshasreatedisthatof genomeomparison
(thegenomeisalloftheDNAsequenepassedfromonegenerationtothenext).
Comparing omplete genomes an give deep insights about the relationship
betweenorganisms,aswellassheddinglightonthefuntionofspeigenesin
eahsinglegenome. Thehallengeofomparingompletegenomesneessitates
thereationofadditional,moreeÆientomputationaltools.
One of the most ommon problems in biologial omparative analysis is that
of aligningtwolongbio-sequenes in orderto measure their similarity. Inthe
global alignmentproblem [19℄, [29℄, thesimilaritybetweentwostringsA and
B is measured. In the loal alignmentproblem [39℄, theobjetive is to nd
substringsofAwhiharesimilarto substringsofB. Both alignmentproblems
anbesolvedinO(n 2
)timebydynamiprogramming[19℄,[39℄.
In this paper data ompression tehniques are employed to speed up the
alignment of two strings. The ompression mehanism enables the algorithm
to adapt to the dataand to utilizeits repetitions. The periodinature ofthe
sequeneisquantiedviaitsentropy,denoted0h1. Entropyisameasure
of how "ompressible"asequene is [5℄,[7℄, and issmall whenthere is alot of
order(i.e,thesequeneisrepetitiveandthereforemoreompressible)andlarge
whenthere isalotof disorder(seesetion2.2).
We present an O(n 2
=logn) algorithm for omputing both global and loal
similaritybetweentwostringsoveraonstantalphabet. Thealgorithmiseven
fasterwhenthesequeneisompressible.Infat,formosttexts,theomplexity
ofouralgorithmisatuallyO(hn 2
=logn).
Note that the algorithm presented is the rst sub-quadrati loal alignment
algorithm.
Aftertheoptimalsoresareomputed,analignmenttraeorrespondingtothe
optimalsoreanbereoveredin timeomplexitywhihislinearwiththesize
ofthetrae,forboththeglobal alignmentandtheloalalignmentproblems.
Thealgorithmsdesribedinthispaperarethersttoapproahfullyompressed
(bothsoureandtargetstringsareompressed)stringalignment. Themethods
givenin this paperan also beused by appliations where both inputstrings
are storedor transmittedin theform of LZ78 orLZW ompressed sequene,
thus providing aneÆient solutiontothe problemof howto omparethe two
stringswithouthavingto deompressthemrst.
The only previously known sub-quadrati global alignment string omparison
algorithm,byMasekandPaterson[31℄,isbasedontheFourRussiansparadigm.
The "Four Russians" algorithm divides the dynami programming table into
O(n =logn) time omplexity, based on two assumptions. One is that the
sequeneelementsomefromaonstantalphabet. Theother,whihtheydenote
the"disreteness"ondition,isthattheweights(ofsubstitutionsandindels)are
allrationalnumbers.
Ouralgorithmspresentanewapproahandarebetterthantheabovealgorithm
in twoaspets.
Thealgorithmspresentedherearefasterforompressiblesequenes. For
suh sequenes,theomplexityofouralgorithmsis O(hn 2
=logn), where
h1istheentropyofthesequene.
Ouralgorithms aregeneralenoughto support soring shemeswith real
numberweights.
For many soring shemes, the rational number weights supported by
MasekandPaterson'salgorithm donotsuÆe. Forexample, theentries
of PAM similarity matries, as well as BLOSUM evolutionary distane
matries, are dened to be real numbers, omputed as log-oddsratios -
andthereforeouldbeirrational.
The paper by Masek and Paterson is onluded with the following
statement: "The most important problem remaining is nding abetter
algorithm for the nite (in our terms onstant) alphabet ase without
the disreteness ondition". Here, more than twenty years later, this
importantopenquestionwillnallybeanswered!
Theseadvantagesarebasedinthefollowingfats. First,ouralgorithmdoesnot
require any pre-omputation of lookup-tables, and therefore an aord more
exible weight values. Also, instead of dividing the dynami programming
matrix into uniform sized bloks as did Masek and Paterson, we employ a
variable sized blok partition, as induedby Lempel-Ziv fatorization of both
soureandtarget. Theommondenominatorbetweenbloks,maximizedbythe
ompression tehnique, is then re-yledand used for omputing therelevant
information foreah blok in time whih is linearwith thelength ofits sides.
In this sense,the approah desribed in this paperan beviewed asanother
exampleofspeedingupdynami programmingbykeepingandomputingonly
a relevant subset of important values, as demonstrated in [10℄, [11℄, [27℄ and
[37℄.
The remainder of this paper is organized as follows. Setion 2 inludes
preliminaries. Insetion3wedesribetheglobalalignmentsolutionusingfully
ompressed stringomparison. Insetion4weextendthesolutiontoompute
the highestsoringregions ofloal alignment. Setion 5 ontainsadisussion
of howto reduethespae omplexitywithoutimpairing thetimeomplexity,
whenomputingglobalalignmentover"disrete"soringmatries.
A desriptionof how toreoverapath alignment traein timelinearwith its
a c t
a a c g a c g a
0
1
1 2 3 4 5 6 7 8
2 3
4
a g a g
8 5 6
7
0 1 2 3 4 5 6 7 8 c
- a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1
g a
a c g
2 1 3
3
2 1
3 3 4
1
3
2
G
Figure 1: The alignment graph for omparing strings A = "tagaga" and B =
"aagaga". The soring shemematrix Æ is shown in the lower left orner of the
gure. Thehighest soring global alignment paths originate invertex (0,0), endin
vertex(8,8)andhaveatotalweightof3.Thehighestsoringloalalignmentpathhas
atotalweightof5andorrespondstothealignmentofsubstringsa="agaga" and
b =agaga". A sub-graph G orresponding tothe blok for omparing substrings
a="ag"andb="ag"isshowninthelower-rightornerofthegure. Alsospeied
arethevaluesI fortheentriesoftheinputborderforG(inwhite-shadedretangles),
andthevaluesOoftheoutputborderofG(ingrey-shadedretangles),assetduring
aloalalignmentomputation.
2 Preliminaries
2.1 Highest Soring Paths in the Alignment Graph. The dynami
programming solution to the string omparison omputation problem an be
representedin terms ofaweightedalignment graph[19℄(See Figure 1). The
weight of a given edge an be speied diretly in the grid graph, or as is
frequently the ase in biologial appliations, is given by a penalty matrix,
denoted Æ, whih speies the substitution ost for eah pair of haraters
and the deletion ost for eah harater from the alphabet. Typially, in the
biologialdomain,Æisnegativeforalloperationsexeptreplaementofsimilar
symbols,andtheobjetiveistomaximizethealignmentsore.
order,tothesorebetweenthersti haratersofA andtherstj haraters
ofB,using thefollowingreurrene. V(i;j)=max[V(i;j 1)+Æ(;B
j );
V(i 1;j)+Æ(A
i
;);
V(i 1;j 1)+Æ(A
i
;B
j )℄
SmithandWaterman[39℄showedthatessentiallythesameO(jAjjBj) dynami
programmingsolutionan beusedfor loalalignment,providedthat thesore
of the alignment of two empty strings is dened as 0, and only pairs whose
alignment soresare above0are of interest. The Smith-Waterman algorithm
for loal alignmentwill omputethe followingreurrene, whih inludes 0 as
anadditionaloption,andthusrestritsthesorestonon-negativevalues.
S(i;j)=max[0;S(i;j 1)+Æ(;B
j );
S(i 1;j)+Æ(A
i
;);
S(i 1;j 1)+Æ(A
i
;B
j )℄
Thesoreforthemostsimilarsubstringsisfoundin thehighestsoringnodes
in thealignmentgraph.
2.2 A Blok Partition of the Alignment Graph based on LZ78
Fatorization. The traditional aim of text ompression is eÆient use of
resouressuhasstorageandbandwidth. Here,wewillompressthesequenes
in order to speedupthe alignment proess. Note that thisapproah,denoted
"aeleration by text-ompression", has been reently applied to a related
problem-that ofexatstring mathing[22℄, [30℄,[38℄.
Itshould alsobementionedthatanotherrelatedproblem-thatofexatstring
mathing in ompressed text without deoding it, whih is often referred to
as "ompressedpattern mathing", has been studied extensively[3℄, [13℄ [34℄.
Along these lines, string searh in ompressed text was developed for the
ompression paradigm of LZ78 [45℄, and its subsequent variant LZW [43℄, as
desribedin[23℄,[35℄. Amorehallengingproblemisthatof"fullyompressed"
pattern mathingwhen boththepattern andtext stringsareompressed [16℄,
[17℄.
For the LZ78-LZW paradigm, ompressed mathing has been extended and
generalizedtothat ofapproximate patternmathing (ndingallourrenesof
ashortsequenewithinalongoneallowinguptokhanges)in [21℄,[33℄.
The LZ ompression methods are based on the idea of self referene: while
the text le is sanned, substrings or phrases are identied and stored in a
ditionary, and whenever, later in the proess, a phrase or onatenation of
phrases is enountered again, this is ompatly enoded by suitable pointers
[28℄, [44℄,[45℄.
fatorizationfrom previousLZompressionalgorithms isinthehoieofode
words. Instead of allowingpointers to refereneanystring that hasappeared
previously,thetextseensofarisparsedintophrases,whereeahphraseisthe
longest mathing phraseseenpreviouslyplusoneharater. Forexample,the
string"S =aagag"is dividedinto fours phrases: a,a, g, ag. Eahphrase
is enoded asanindex toits prex,plusthe extraharater. Thenewphrase
isthenaddedto thelistofphrasesthatmaybereferened.
a a c g a c g c
t a c
3/4
a c g a c
g
g Trie for A Trie for B
0
3 1
5
0 3 2
g a g a
a
5/4
2
t
4
g
5/2 3/2
left prefix (5/2)
diagonal prefix (3,2)
top prefix (3,4)
Graph G for Block (5,4) LZ78-Partitioned
Alignment Graph
g a
a c g a
g a c
g a
a c a
a c
1 2 3 4
0
1 2 3 4
5
1
4
Figure 2: The blok partitionof the alignment graph, and the tries orresponding
to LZ-78parsingof stringsA="tagaga"and B="aagaga". Notethatfor the
blok G inthis example, = "ag", = "ag", `r = 2, ` =3, i = 5 and j = 4.
(ThenewellofG,whihdoesnotappearinanyoftheprexbloks,istherightmost
ellatthebottomrowofG,andanbedistinguishedbyitswhiteolor.) Thisgure
ontinuesFigure1.
Sineeahphraseisdistint,thefollowingupperboundappliestothepossible
numberofphrasesobtainedbyLZ78fatorization.
Theorem 2.1. (Ziv andLempel 1976[28℄.) Given a sequene S of size n
over a onstant alphabet. The maximal number of distint phrases in S is
O(
n
).
onstantalphabet,ithasbeenshownthatin manyasesweandobetterthan
that.
Intuitively, the LZ78 algorithm ompresses thesequene beauseit is able to
disover some repeated patterns. Therefore, in order to ompute a tighter
upper bound on the number of phrases obtained by LZ78 fatorization for
"ompressible" sequenes, the repetitive nature of the sequene needs to be
quantied. One of the fundamental ideas in information theory is that of
entropy, denoted 0 h 1, whih is a measure of the amount of disorder
orrandomness,orinversely,theamountof orderorredundany inasequene.
Entropyissmallwhenthereisalotoforder(i.e,thesequeneisrepetitive)and
largewhenthere isalot ofdisorder. Theentropyofasequeneshould ideally
reettheratiobetweenthesizeof thesequeneafterithasbeenompressed,
andthelengthoftheunompressedsequene.
ThenumberofdistintphrasesobtainedbyLZ78fatorizationhasbeenshown
tobeO(hn=logn)formosttexts[5℄,[7℄,[28℄,[45℄. Notethatforanyothertext
overaonstantalphabet,theupperboundabovestillappliesbysettinghto1.
3 The GlobalAlignmentSolution
3.1 Denitions and Basi Observations. The alignment graph will be
partitionedasfollows. StringsAandBwillbeparsedusingLZ78fatorization.
This indues a partition ofthe alignmentgraphfor omparing A with B into
variablesizedbloks(seeFigure2). Eahblokwillorrespondtoaomparison
ofanLZphraseofAwithanLZphraseofB.
Let xa denote aphrasein A obtainedby extendinga previousphrasex of A
withharatera,andybdenoteaphraseinB,obtainedbyextendingaprevious
phraseofB withharaterb.
Fromnowonwewillfousontheomputationsneessaryforasingleblokof
thealignmentgraph.
Considerthe blok Gwhih orrespondsto the omparison ofxa and yb. We
deneinputborder I -astheleftandtopbordersofG,andoutputborderO-as
thebottomand rightbordersofG. (Thenode entries ontheinputborder are
numberedin alokwise diretion, andthe nodeentries onthe outputborder
arenumberedinaounter-lokwisediretion.)
RatherthanllinginthevaluesofeahvertexinG,asdoesthelassialdynami
programmingalgorithm-theonlyvaluesomputedforeahblokwillbethose
on its I=O borders (see Figures 1, 5A ).Intuitively, this is the reasonbehind
theeÆienygain.
Let `
r
-denotethenumberofrowsin G,`
r
=jxaj. Let`
-denotethenumber
ofolumns inG,`
=jybj. Lett=`
r +`
. Clearly,jIj=jOj=t.
phrasey ofB.
2.The diagonal prefixof G-denotes theblokomparingphrase x ofA and
phrasey ofB.
3. Thetop prefix of G -denotes thethe blokomparing phrasex of A and
phraseybofB.
Observation1WhentraversingthebloksofanLZ78parsedalignmentgraph
in aleft-to-right, top-to-bottom order. Thebloksforthe leftprex, diagonal
prexandtopprexofGareenounteredpriortoblokG.
Note that the graphfor theleft prex of Gis idential to thesubgraph of G
ontainingallolumnsbutthelastone. Morespeially,boththestrutureand
theweightsoftheedgesofthesetwographsareidential,buttheweightstobe
assigned tothe vertiesduring thesimilarityomputationmayvary aording
to the input border values. Similarly, for the top prex and diagonal prex
graphs. TheonlynewellinG,whihdoesnotappearinanyofitsprexblok
graphs,istheellforomparingaandb.
3.2 I=O Propagation Aross G. The work for eah blokwill onsist of
twostages(asimilarapproahisshownin[6,20,26,27℄).
1. enoding:StudythestrutureofGandrepresentitinaneÆientway.
2.propagation:GivenI andtheenodingofG,onstrutedinthepreviousstage,
omputeO forG.
The struture of G will be enoded by omputing weights of optimal paths
onnetingeahentryofitsinputborderwith eahentryofitsoutputborder.
ThefollowingDIST matrixwillbeused(seeFigure3).
Definition 3.1. DIST[i;j℄storesthe weight ofthe optimal path fromentry i
of the inputborderof Gtoentryj of itsoutputborder.
DIST matrieshavealsobeenusedin [4℄,[6℄, [20℄,[27℄and[37℄.
Given inputrowI and the DIST for G, the weightof output row vertexO
j
anbeomputedasfollows.
O
j
= j
max
r=0 fI
r
+DIST[r;j℄g
O
j
isthemaximumofolumnj ofthefollowingOUT matrix,whihmergesthe
informationfrom inputrowI andDIST. (See Figure3).
Definition 3.2. OUT[i;j℄=I
i
+DIST[i;j℄.
Aggarwal and Park [2℄ and Shmidt [37℄ observed that DIST matries are
I0=1 0 1 2 3 4 4
I1=2 1 1 2 1 3 4
I
2
=3 2 0 0 1 1 3
I
3
=2 4 2 2 0 2 2
I
4
=1 4 4 2 0 1 1
I
5
=3 4 4 4 2 1 0
OUT matrix
1 0 1 2 1 1
1 1 0 1 1 1
1 3 3 4 2 0
12 0 0 2 0 0
13 13 1 1 0 0
14 14 14 1 2 3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
olumnnumbers
0 1 2 3 4 5
Figure3: TheDIST matrixwhihorrespondstothesubsequenes"ag","ag",the
OUT matrix obtained by addingthe values of I to the rows of DIST, and the O
ontainingtherowmaximaofOUT. ThisgureontinuesFigures1and2.
Definition 3.3. AmatrixM[0:::m;0:::n℄isMongeifeitherondition1or
2below holdsfor alla;b=0:::m; ;d=0:::n:
1. onvex ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄for all a<b
and<d.
2. onave ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄ for all a<b
and<d.
Sine DIST is Monge-soisOUT,whihis aDIST with onstantsadded to
itsrows.
An importantpropertyofMongearraysisthatof beingtotallymonotone.
Definition 3.4. A matrix M[0:::m;0:::n℄ is totally monotone if either
ondition 1or2below holdsfor alla;b=0:::m; ;d=0:::n:
and<d.
2. onaveondition: M[a;℄M[b;℄=)M[a;d℄M[b;d℄foralla<b
and<d.
Note that the Mongepropertyimplies totalmonotoniity, but theonverse is
nottrue. Therefore,bothDIST andOUT aretotallymonotonebytheonave
ondition.
Aggarwal et al [1℄ gave a reursive algorithm, niknamed SMAWK in the
literature, whih an ompute in O(n) time all row and olumn maxima of
annntotallymonotonematrix,byqueryingonlyO(n)elementsofthearray.
Hene,oneoulduseSMAWKtoomputetheoutputrowObyqueryingonly
O(n)elementsof OUT. Clearly,ifboththefullDIST andallentriesof I are
available,thenomputinganelementofOUT isO(1)work.
Forvarious solutionsto related problems, whih also utilizeMonge andTotal
Monotoniityproperties,werefertheinterestedreaderto[8℄,[9℄,[14℄,[15℄,[24℄
and[27℄. InordertoeÆientlyutilizethesepropertieshere,weneedtoaddress
thefollowingtwoproblems.
1. HowtoeÆientlyomputeDISTandrepresentitinaformatwhihallows
diretaessto itsentries. Thiswillbedonein setion3.2.2.
2. SMAWK is intended for a full, retangular matrix. However, both
DIST anditsorrespondingOUT arenotretangular. Sinepathsinan
alignmentgraphanonlyassumealeft-to-right,top-to-bottomdiretion,
onnetionsbetweensomeinputbordervertiesandsomeoutputborder
vertiesareimpossible. Therefore,thematriesaremissing bothalower
lefttriangleandupperrighttriangle(see Figure3).
3.2.1 AddressingtheRetangleProblem. TheundenedentriesofOUT
anbeomplementedin onstanttimeeah,asfollows.
1.Themissingupperrighttriangleentriesanbeompletedbysettingthevalue
ofanyentryOUT[i;j℄in thistriangleto 1.
2. Letk denotemaximalabsolutevalueofasoreinÆ. Themissinglowerleft
triangle entries anbeompletedby settingthevalue ofanyOUT[i;j℄ in this
triangleto (n+i+1)k.
Lemma 3.1. Complementingthe undenedentriesasdesribedabove preserves
the onave total monotoniity ondition of OUT,and does not introdue new
row-maxima.
Proof. 1. Upper Right Triangle: All similarity sores in the alignment
the shape of the redened upper-right triangle, one a 1 value in row a is
enountered,all futurevaluesin rowa arealso 1. Thefuture valuesofrow
b ouldeitherbeniteor 1. Therefore,OUT[a;d℄OUT[b;d℄foralld>.
2. Lower Left Triangle: The worst soreappearing in thealignmentgraph
is lowerbounded by nk. Sinei is alwaysgreaterthan orequalto zero,the
omplementedvaluesinthelowerlefttriangleareupper-boundedby (n+1)k
andnonewolumn maximaareintrodued. Also,foranyomplementedentry
OUT[b;℄ in the lowerleft triangle, OUT[b;℄< OUT[a;℄ for alla < b, and
thereforetheonavetotalmonotoniityonditionholds.
3.2.2 Inremental Update of the new DIST Information for G. In
thissetion wewillshowhowto eÆientlyomputethenewDIST info forG,
usingtheDIST representationspreviouslyomputedforitsprexbloks,plus
theinformationofitsnewell.
Whenproessinganewblok G,wewill omputethesoresof t newoptimal
paths, leadingfrom the inputborder to the newvertex (`
r
;`
)in the lowest,
rightmost orner of G. These values orrespond to olumn `
of the DIST
matrixforG,andanbeomputedasfollows.
Entry[i℄in olumn `
of the DIST for G ontainsthe weightof the optimal
path from entryi in theinput borderof G to vertex (`
r
;`
). This path must
go throughoneof thethree verties(`
r 1;`
), (`
r 1;`
1) or(`
r
;`
1).
Therefore,theweightoftheoptimalpathfromentryiintheinputborderofG
to (`
r
;`
)isequalto themaximumamongthefollowingthreevalues:
1.Entry[i℄ofolumn`
1oftheDIST fortheleftprexofG,plustheweight
ofthehorizontaledgeleadinginto(`
r
;`
).
2. Entry[i℄ofolumn`
1oftheDIST forthediagonalprexofG,plusthe
weightofthediagonaledgeleadinginto(`
r
;`
).
3. Entry[i℄ofolumn `
oftheDIST forthetopprexofG, plusthe weight
ofthevertialedgeleadinginto(`
r
;`
).
3.2.3 Maintaining Diret Aess to DIST Columns. In order to
omputeanentryofOUT inonstanttimeduringtheexeutionofSMAWK,
diretaesstoDIST entriesisneessary. Thisisnotstraightforward,sineas
shownin theprevioussetion, foreahblokonlyonenew DIST olumn has
beenomputedand stored. All other olumnsbesidesolumn`
oftheDIST
forGneedtobeobtainedfromG'sprexanestorbloks.
Therefore, before theexeutionof SMAWK begins,avetorwith pointers to
allt+1olumns oftheDIST forGisonstruted(seeFigure 4). Thisvetor
isnolongerneededaftertheomputationsforGhavebeenompleted,andits
0 1 2 3 4
DIST(5,4) 0
1 2 3 4 5
-3 -1 1 0 0 -2
-3 -1 -2 -1 -1
-3 -2 -1 0 -2
-2 0 -2 -2 -1 -1 0 -2 0 -1 -2
a
g a c
g
g Trie for A Trie for B
0
3 1
0
1 3
2
4 2
t
4
g Block Table
5
c
Figure 4: Atableontaininganentryforeahblok ofthealignmentgraph. Entry
(i;j)ofthe tableorrespondstotheblokwhosesubstringsare representedbynode
iinthetriefor AandnodejinthetrieforB. Theentryforeahblokinthetable
pointstothestartofitsnewDIST olumn. Alsoshownisthevetorwhihontains
pointers to all olumns of the DIST for blok (5;4), as obtained from its anestor
prexbloks.ThisgureontinuesFigures1,2and3.
ThepointerstoallolumnsoftheDIST forGareassembledasfollows.Column
`
isset to thenewlyonstrutedvetorforG. All olumnsof indiessmaller
than`
areobtainedvia`
reursiveallstoleftprexbloksofG. Allolumns
ofindiesgreaterthan`
areobtainedvia`
r
reursiveallstotopprexbloks
ofG.
3.2.4 Querying a Prex Blok and Obtaining its DIST Column in
Constanttime. TheLZ78phrasesform atrie (seeFigure2), and thestring
tobeompressedisenodedasasequeneofnamesofprexesofthetrie. Eah
nodein the trieontains the serial numberof the phrase itrepresents. Sine
eahblokorrespondstoaomparisonofaphrasefromAwithaphrasefrom
B, eah blok will be identied by a pair of numbers, omposed of the serial
numbersforitsorrespondingphrasesinthetriesforAandB.
Another data struture to be onstruted is a Blok Table (see Figure 4),
ontaining an entry for eah partitioned blok of the alignment graph. The
entryfor eah blok in thetable pointsto the startof itsnewDIST olumn,
andanbediretlyaessedviatheblok'sphrasenumberindex pair.
The left prex of G an be identied in onstant time as a pair of phrase
numbers, the rst idential to the serial number of xa, and the seond
the pair of identiation numbers for ablok, a pointer to the orresponding
DIST olumnanthenbediretlyobtainedfromtheBlokTable.
3.3 Time and Spae Analysis
Assumingsequenesizenandsequeneentropyh1. TheLZ78fatorization
algorithm will parse the strings and onstrut the tries for A and B in O(n)
time. Theresulting numberof phrasesin bothA andB is O(hn=logn). The
number of resulting bloks in the alignment graphis equalto the number of
phrasesin Atimesnumberofphrasesin B,andis thereforeO(h 2
n 2
=(logn) 2
).
For eah blok G, the following information (1{3) is omputed, in time and
spaeomplexitylinearwiththesizeofitsI=O borders:
1.UpdatingtheEnodingStruture forG. TheprexbloksofGanbe
aessedin onstanttime. ThevetorsofDIST olumnpointers fortheprex
blokshavealreadybeenfreed. However,sineeahprexblokdiretlypoints
to itsnewlyomputed DIST olumn -allvaluesneeded fortheomputations
arestillavailable. SineeahentryofthenewDIST olumnforGissettothe
maximum amongupto three sumsofpairs, thenewDIST olumn forG an
beonstrutedin O(t) timeandspae.
2.MaintainingDiret Aessto DIST olumns. Sineprefixbloksand
theirDIST olumnsanbeaessedinonstanttime,thevetorwithpointers
to olumnsoftheDIST forGanbesetin O(t)time.
3.Propagation forG. UsingtheinformationomputedforG,andgiventhe
I for G obtained from the O vetorsfor the blokabove G and the blok to
itsleft,thevaluesofO forGareomputedviaSMAWK MatrixSearhingin
O(t)time.
Total Complexity. Sinethe workand spae for eah blok islinear with
the size of its I=O borders, the total time and spae omplexity is linear
with the total size of the borders of the bloks. The blok borders form
O(hn=logn) rowsof size jBj eah, and O(hn=logn)olumns of size jAj eah,
in the alignment graph (see Figure 2). Therefore, the total time and spae
omplexityisO(hn 2
=logn).
4 Extensions to Loal Alignment
Whenomputingthehighestloalalignmentsore,theaddedhallengeisthat
anoptimalpathouldbeginandendinsideanyblok. Therefore,wewillmodify
O toonsider theadditionalpathsoriginatinginside G.
Also,inadditiontotheDISTdesribedinsetion3,weomputeforeahblok
Gthefollowingdatastrutures(seeFigures5B,5C).
I
O I
S
C E I
O
i
DIST[i,j ] j
A
B C
Figure 5: A.The I=O pathweight vetors omputed for eah blok in the global
alignment solution. DIST[i;j℄ will be set to the highest soring path onneting
vertexiintheinputborderwithvertexjintheoutputborder. B,C.Thevetorsof
optimalpathweightsonsideredfortheloalalignmentomputation.
whihstartsinvertexioftheinputborderofGandendsinsideG.
2. S -is avetorofsize t. S[i℄ontainsthevalueofthehighest soringpath
whihstartsinsideGandendsin vertexioftheoutputborder ofG.
3. C - is the value of the highest soring path ontained in G, that is - the
highestsoringpathwhihoriginatesinside GandendsinsideG.
4. F - isthe weightof thehighestsoring pathending in G. This pathould
either beginandend inside G(a C-path)orstartoutsideGand endinside G
(anI-pathfollowedbyanE-path).
The overall highest loal alignment sore for omparing A and B an be
omputedasthemaximumamongtheF valuesofeahblok.
Thetwostagesdesribedinsetion3.2willbeextendedasfollows.
4.1 Enoding. DIST isomputedasdesribed insetion 3.2. Inaddition,
thevaluesofE,S andCareomputedasfollows.
1. Computing the values of E. E[i℄isomputedasthemaximumbetween
E[i℄fortheleftprexofG,E[i℄forthetopprexofG,andDIST[i;`
℄.
2. Computing the values of S. The only new value omputed for S
is the Smith-Waterman sore for the new vertex (` ;` ). Given the Smith-
r
diagonal prefix, (`
r
;`
1) obtained from the left prefix and (`
r 1;`
)
obtainedfromthetopprefixof G,and theweightsofthe3edgesleadinginto
vertex(`
r
;`
), the Smith-Waterman sorefor vertex (`
r
;`
)an be omputed
inO(1)timeomplexity,usingthereursiongiveninsetion2.1. Thevaluefor
entry`
of S is set to this newly omputedSmith-Waterman sore for vertex
(`
r
;`
).
Thevalues ofall otherentriesof S are then set asfollows. Therst`
values
ofS areopiedfromtherst`
valuesoftheS omputedfortheleft prexof
G. Thelast`
r
valuesareopiedfromthelast `
r
valuesoftheS vetorforthe
topprexofG.
3. Computing thevalueofC. Cisomputedasthemaximumbetweenthe
C value for the left prex of G, the C valuefor the top prex of G, and the
newlyomputedS[`
℄ asdesribedabove.
4.2 Propagation.
1. Computing the values of O. Ourobjetiveis to set O[i℄to the weight
of the highest soring path originating anywhere in the alignment graph and
ending in entryi of the outputborder. VetorO will rst be omputed from
the I and DIST for G as desribed in setion 3.2. At this point entry O[i℄
reetstheweightoftheoptimalpathstartinganywhereoutsideGandending
in entryioftheoutputborder. Itneedstobeupdatedwiththeweightsofthe
highestsoringpathsstartinginsideG. ThisisahievedbyresettingO[i℄tothe
maximumbetweenO[i℄andS[i℄.
2. Computing the values of F. F is omputed as max(Max t
i=0 fI[i℄+
E[i℄g; C)
4.3 Time and Spae Analysis
Enoding. Sine, as shown in setion 3.2.3, eah prex blok of G an be
aessed in onstant time, the values of the S and E vetors for G an be
omputed and stored in O(t) time and spae, and the C value for G an be
omputedin onstanttimeandspae.
Propagation. Giventhevetorsomputed intheenodingstage -thevalues
ofO andF anbeomputedinO(t)timeeahasdesribedabove.
The weight of the highest soring path in the alignment graph an then be
omputedinanadditionalO(h 2
n 2
=(logn) 2
)timeasthemaximumvalueamong
theF valuesomputedforeahblok.
Total Complexity Sine the work and spae for eah blok is linear with
the size of itsI=O borders, the total time and spae omplexity for the loal
alignmentsolutionisO(hn 2
=logn).
When omputing global alignment with soring matries whih follow the
"disreteness"ondition(seeSetion1),theeÆientalignmentstagealgorithm
desribedin[27℄anbeextendedtosupportfullpropagationfromtheleftmost
andupperboundariestothebottomandrightmostboundariesofG.
This extendedpropagation algorithm an thenbeused to omputethevalues
of the global alignment O for G, given the I for G and a minimal enoding
of theDIST forG. Theadvantageof thisminimal enoding ofDIST is that
ratherthansavinganO(t)sizedDIST olumnperblok,weonlyneedtosave
aonstantnumberofvaluesperblok. TheenodingforthenewDIST olumn
ofeahblokanbeomputedandstoredinonstanttimeandspaefromthe
information stored forthe left, diagonal and topprex bloks ofG, using the
tehniquedesribedinsetion6of [37℄.
This redues the spae omplexity to O(h 2
n 2
=(logn) 2
), while preserving the
O(hn 2
=logn)timeomplexity.
6 Conlusions
Theresultsdemonstratedin thispaperareasfollows.
The algorithm presented in this paper is the rst O(hn 2
=logn) string
omparisonalgorithm.
This is the rst sub-quadrati string omparison algorithm for general
soringtables whoseweightsarenotrestritedtorationalnumbers.
Weshowed how to extend this resultto aloal alignmentO(hn 2
=logn)
algorithm.
For global alignment over "disrete" soring matries, we explained
how the spae omplexity an be redued to O(h 2
n 2
=(logn) 2
), without
impairingtheO(hn 2
=logn)timeomplexity.
Inadditiontothesoresomputedbydynamiprogramming,itisoftendesired
to reover a meaningful trae of the optimal alignments. Optimal paths in
thealignmentgraph(pathswhosetotalweightismaximum) representoptimal
alignmentsofAandB.
Without any added omplexity, the urrentalgorithmi infrastruture an be
modiedto supportthereoveryofanoptimalglobalalignmentpathtrae,as
wellasanoptimalloalalignmenttraeasdenedbyEriksonandSellers[12℄,
in timeomplexitywhihis linearwiththesizeofthetrae.
Duetolakofspae,thedesriptionofhowtoreoverthepathalignmenttraes
isreservedforthejournalversionofthepaper.
algorithm.
Referenes
[1℄ Aggarwal, A., M. Klawe, S. Moran, P. Shor, and R. Wilber, Geometri
AppliationsofaMatrix-SearhingAlgorithm,Algorithmia,2,195-208(1987).
[2℄ Aggarawal, A.,and J.Park,NotesonSearhinginMultidimensionalMonotone
Arrays, Pro.29th IEEE Symp. onFoundations of Computer Siene, 497-512
(1988).
[3℄ Amir,A.,G.Benson, andM.Farah,Letsleepingleslie: Patternmathingin
Z-ompressedles.J.ofComp.andSys.Sienes,52(2),299{307(1996).
[4℄ Apostolio, A., M. Atallah, L. Larmore, and S. MFaddin, EÆient parallel
algorithmsforstringeditingproblems.SIAMJ.Comput.,19,968-998(1990).
[5℄ Bell,T.C.,J.C.Cleary,andI.H.Witten.TextCompression.PrentieHall,(1990).
[6℄ Benson, G., A spae eÆient algorithm for nding the best nonoverlapping
alignmentsore,TheoretialComputerSiene,145,357-369(1995).
[7℄ Crohemore, M., and W. Rytter, Text Algorithms, Oxford University Press,
(1994).
[8℄ Eppstein, D., Sequene Comparison with Mixed Convex and Conave Costs,
JournalofAlgorithms,11,85{101(1990).
[9℄ Eppstein,D., Z. Galil,and R.Gianarlo, Speeding UpDynami Programming,
Pro.29thIEEESymp.onFoundationsof ComputerSiene,488{296(1988).
[10℄ Eppstein, D., Z. Galil, R. Gianarlo, and G.F. Italiano, Sparse Dynami
ProgrammingI:LinearCostFuntions,JACM,39,546{567(1992).
[11℄ Eppstein, D., Z. Galil, R. Gianarlo, and G.F. Italiano, Sparse Dynami
Programming II: Convex and Conave Cost Funtions, JACM, 39, 568{599
(1992).
[12℄ Erikson, B.W., and P.H.Sellers, Reognition of patternsingeneti sequenes,
in Time Warps, String Edits, and Maromoleules: The Theory and Pratie
of Sequene Comparison, D. Sanko and J.B. Kruskal, eds., Addison-Wesley,
Reading,MA,55{91(1983).
[13℄ Farah,M.,andM.Thorup,StringmathinginLempel-Zivompressedstrings.
Algorithmia,20,388{404(1998).
[14℄ Galil,Z.,and R.Gianarlo,SpeedingUp DynamiProgramming withApplia-
tionstoMoleularBiology,TheoretialComputerSiene,64,107-118(1989).
[15℄ Galil Z., and K. Park, A linear-time algorithm for onave one-dimensional
dynamiprogramming,Info.Proessing Letters,33,309-311(1990).
[16℄ Gasienie, L., M. Karpinski, W. Plandowski, W. Rytter, Randomised eÆient
algorithmsfor ompressedstrings: the nger-printapproah, Pro. 7thAnnual
SymposiumOn CombinatorialPatternMathing,LNCS1075,39{49(1996).
[17℄ Gasienie, L., and W. Rytter, Almost optimal fully LZW ompressedpattern
mathing,DataCompressionConferene,J.Storer,ed,(1999).
[18℄ Gianarlo, R. , Dynami Programming: Speial Cases, Pattern Mathing
Algorithms,editedbyApostolio,A.andZ.Galil,OxfordUniversityPress,201-
232(1997).
[19℄ Guseld,D.,AlgorithmsonStrings,Trees,andSequenes.CambridgeUniversity
RegionsofMaximumAlignmentSore,SIAMJ.Comput.,25(3),648{662(1996).
[21℄ Karkkainen,J.,G.NavarroandE.Ukkonen,ApproximateStringMathingover
Ziv-LempelCompressedText,Pro.11thAnnual SymposiumOn Combinatorial
PatternMathing, LNCS1848,195{209(2000).
[22℄ Karkkainen, J., and E. Ukkonen, Lempel-Ziv parsing and sublinear-size index
struturesforstringmathing,Pro.ThirdSouthAmerianWorkshoponString
Proessing(WSP'96),141{155(1996).
[23℄ Kida, T., M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Shift-And
approah to pattern mathing in LZW ompressed text, Pro. 10th Annual
SymposiumOn CombinatorialPatternMathing,LNCS1645,1{13(1999).
[24℄ Klawe,M.,andD.Kleitman,AnAlmostLinearAlgorithmforGeneralizedMatrix
Searhing,SIAMJour.Desrete Math.,3,81-97(1990).
[25℄ Landau, G.M., E.W.Myersand J.P. Shmidt, InrementalString Comparison,
SIAMJ.Comput.,27(2),557{582(1998).
[26℄ Landau,G.M.andM.Ziv-Ukelson,OntheSharedSubstringAlignmentProblem,
Pro.SymposiumOnDisreteAlgorithms,804{814(2000).
[27℄ Landau, G.M., and M. Ziv-Ukelson, On the Common Substring Alignment
Problem,JournalofAlgorithms.
[28℄ Lempel,A.,andJ.Ziv,Ontheomplexityofnitesequenes,IEEETransations
onInformationTheory,22,75{81(1976).
[29℄ Levenshtein,V.I.,BinaryCodesCapableofCorreting,Deletions,Insertionsand
Reversals,SovietPhys.Dokl,10,707{710(1966).
[30℄ Manber, U., A text ompression sheme that allows fast searhing diretly in
the ompressed le, Pro. 5th Annual Symposium On Combinatorial Pattern
Mathing,LNCS807, 113{124(1994).
[31℄ Masek, W.J., and M.S.Paterson, A faster algorithm for omputingstring edit
distanes.J.Comput. Syst.Si.,20,18{31(1980).
[32℄ Monge, G., Deblai et Remblai,Memoires del l'Aademie des Sienes, Paris
(1781).
[33℄ Navarro G., T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Faster
ApproximateStringMathing OverCompressed Text,Pro.Data Compression
Conferene(DCC2001),IEEEComputerSoiety,459-468(2001).
[34℄ Navarro,G.,andM.RaÆnot,Ageneralpratialapproahtopatternmathing
overZiv-Lempelompressedtext,Pro. 10thAnnual SymposiumOn Combina-
torialPattern Mathing,LNCS1645,14{36(1999).
[35℄ Navarro, G., and M. RaÆnot. Boyer-Moore string mathing over Ziv-Lempel
ompressed text, Pro. 11th Annual Symposium On Combinatorial Pattern
Mathing,LNCS1848,166{180(2000).
[36℄ SankoD., and J.B. Kruskal(editors), Time Warps,String Edits, and Maro-
moleules: the Theory and Pratie of Sequene Comparison, Addison-Wesley,
Reading,MA,(1983).
[37℄ Shmidt, J.P., All HighestSoring Paths In Weighted Grid Graphs and Their
AppliationToFindingAllApproximateRepeatsInStrings,SIAMJ.Comput,
27(4),972{992(1998).
[38℄ ShabitaY.,T.Kida,S.Fukamahi, M.Takeda, A.Shinohara, T.Shinohara,S.
Arikawa,Speeding up patternmathingby textompression, CIAC2000,LNCS
1767,306{315(2000).
[39℄ Smith, T.F. and M. S. Waterman, Identiationof ommon moleular subse-
Parsing Sheme and Digital Searh Trees, Theoretial Computer Siene, 144,
161{197(1995).
[41℄ Takeda, M., Y.Shibata, T.Matsumoto, T.Kida, A.Shinohara, S.Fukamahi,
T. Shinohara, and S. Arikawa: Speeding up string pattern mathing by text
ompression: Thedawnofanewera,42(3),pp.370-384(2001).
[42℄ Waterman,M.S.,andM.Eggert,Anewalgorithmforbestsubsequenealignment
withappliationto tRNA-rRNAomparisons,J.MoleularBiol.,197,723{728
(1987).
[43℄ Welh,T.A.,ATehniqueforHighPerformaneDataCompression,IEEETrans.
onComputers,17(6),8{19(1984).
[44℄ Ziv,J.,andA.Lempel,AUniversalAlgorithmforSequentialDataCompression,
IEEETransationsonInformationTheory,IT-23(3),337{343(1977).
[45℄ Ziv, J., and A. Lempel, Compression of individual sequenes via variable rate
oding,IEEETrans.Inform.Th.,24,530-536(1978).