HAL Id: hal-00619573
https://hal-upec-upem.archives-ouvertes.fr/hal-00619573
Submitted on 20 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices
Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson
To cite this version:
Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson. A Subquadratic Sequence Alignment
Algorithm for Unrestricted Scoring Matrices. SIAM Journal on Computing, Society for Industrial and
Applied Mathematics, 2003, 32 (6), pp.1654-1673. �10.1137/S0097539702402007�. �hal-00619573�
Soring Matries
MaximeCrohemore
Institut Gaspard-Monge
UniversityofMarne-la-Vallee
Gad M. Landau y
HaifaUniversity
and
PolytehniUniversity
MihalZiv-Ukelson z
HaifaUniversity
and
IBMT.J.WResearhCenter
Abstrat
The lassialalgorithmfor omputingthesimilarity betweentwosequenes [45, 48℄uses a
dynami programmingmatrix, and omparestwostringsof size nin O(n 2
)time. We address
thehallengeofomputingthesimilarityoftwostringsinsub-quadratitime,formetriswhih
use a soring matrix of unrestrited weights. Our algorithm applies to both loal and global
similarityomputations.
The speed-up isahievedbydividingthe dynamiprogrammingmatrixinto variablesized
bloks, asindued by Lempel-Ziv parsing of both strings, and utilizing the inherent periodi
natureofbothstrings.ThisleadstoanO(n 2
=logn)algorithmforaninputofonstantalphabet
size. Formosttexts, thetimeomplexityis atuallyO(hn 2
=logn)where h1is theentropy
ofthetext.
Wealsopresentanalgorithmforomparingtworun-lengthenodedstringsoflengthmand
n,ompressedintom 0
andn 0
runsrespetively,inO(m 0
n+n 0
m)omplexity. Thisresultextends
toalldistaneorsimilaritysoringshemeswhih useanadditivegap penalty.
Keywords: alignment,dynami programming,textompression,runlength.
1 Introdution
The rapid progress inlarge-sale DNA sequening opens a new level of omputational hallenges
involved instoring,organizingand analyzingthewealth ofbiologialinformation. Oneofthemost
interesting new elds thatthe availabilityof the ompletegenomes has reated isthat of genome
omparison (the genome is all of the DNA sequene passed from one generation to the next).
Comparingompletegenomes an give deepinsightsabouttherelationshipbetweenorganisms,as
well as shedding light on the funtion of spei genes in eah single genome. The hallenge of
Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Desartes, Champs-sur-Marne, 77454 Marne-la-
ValleeCedex2,Frane,http://www-igm.univ-mlv.fr/ma /.
y
Department of ComputerSiene, Haifa University, Haifa 31905, Israel, phone: (972-4) 824-0103, FAX:(972-
4) 824-9331; Department of Computer and Information Siene, Polytehni University, Six MetroTeh Center,
Brooklyn, NY 11201-3840; email: landaupoly.edu; partially supported by NSF grant CCR-0104307, by NATO
Siene Programme grant PST.CLG.977017, by the Israel Siene Foundationgrants 173/98 and 282/01, by the
FIRSTFoundationoftheIsraelAademyofSieneandHumanities,andbyIBMFaultyPartnershipAward.
z
DepartmentofComputerSiene,HaifaUniversity,Haifa31905,Israel;OnEduationLeavefromtheIBMT.J.W.
ResearhCenter;email: mihals.haifa.il;partiallysupportedbybytheIsraelSieneFoundationgrants173/98
and282/01,andbytheFIRSTFoundationoftheIsraelAademyofSieneandHumanities.
tools.
One of themost ommon problemsin biologial omparative analysis is thatof aligningtwo long
bio-sequenesinorder to measuretheirsimilarity. Thealignmentislassiallybasedon thetrans-
formationof one sequene into the other via operationsof substitutions, insertions, and deletions
(indels). Theirostsare givenbya soringmatrix.
Denition 1 (Guseld [24 ℄) Global Alignment Problem. Givena pairwise soring matrix
Æ over the alphabet , the similarity of two strings A and B is dened as the value maxV of the
alignment of A and B that maximizes the total alignmentvalue.
The sore value maxV isalled theoptimal global alignment valueof A and B.
AdesriptionofamaxV-soringtransformationofAintoB isalledaglobal alignmenttrae.
In many appliations, two strings may not be highly similar in their entirety but may ontain
regions that are highly similar. The task is to nd and extrat a pair of regions, one from eah
of the two given strings, that exhibit high similarity. This is alled the loal alignment or loal
similarityand is denedformally below.
Denition 2 (Guseld [24 ℄) Loal alignment problem. Given two strings A and B, nd
substrings and of A and B, respetively, whose similarity (optimal global alignment value) is
maximum over all pairs of substrings from A and B.
The sore value maxL of the most similar pair of substrings and is alled the optimal
loal alignmentvalue.
The desriptionof amaxL-soring transformationof substring into substring isalled a
loal alignmenttrae.
Both global and loal similarity problems an be solved in O(n 2
) time by dynamiprogramming
[24 ℄,[35 ℄, [48 ℄. Aftertheoptimalsimilaritysoreshavebeenomputed,bothglobalalignment and
loal alignmenttraes an bereportedintime linearwiththeirsize[10,25 , 26 ℄.
1.1 Results
In thispaperdata ompressiontehniquesareemployed tospeedupthe alignmentof two strings.
Theompressionmehanismenablesthealgorithmtoadapttothedataandtoutilizeitsrepetitions.
The periodi nature of the sequene is quantied via its entropy, denoted by the real number h,
0 h 1. Entropy is a measure of how \ompressible" a sequene is (see [7℄,[12℄), and is small
when there is a lot of order (i.e, the sequene is repetitive and therefore more ompressible) and
largewhen there isa lotof disorder(see Setion2.2).
Our resultsinludethefollowingalgorithms.
We present an O(n 2
=logn) algorithm for omputing the optimal global alignment value of
two stringsover a onstant alphabet(see Setion 3). The algorithm is even fasterwhen the
sequene isompressible. In fat, formosttexts, theomplexityof ouralgorithm isatually
O(hn 2
=logn).
After the optimal sore is omputed,a single alignment trae orresponding to theoptimal
sorean bereoveredintimeomplexitythatislinearwiththesizeofthetrae(seeSetion
4).
For globalalignment over \disrete" soring matries, we explainhowthe spae omplexity
an be redued to O(h 2
n 2
=(logn) 2
), without impairing the O(hn 2
=logn) time omplexity
(see Setion5).
1.1.2 Loal Alignment
We desribe a sub-quadrati, O(hn 2
=logn) algorithm for the omputation of the optimal
loal alignment value oftwo stringsovera onstant alphabet(see Setion6.1).
Given an indexon A wheresubstring endsand an indexon B wheresubstring ends,an
optimal loal alignmenttraean bereportedintime linearwithits size(see Setion6.2).
1.1.3 Comparing Two Run-Length Enoded Strings
We give an algorithm for omparing two run-length enoded strings of length m and n,
ompressed to m 0
and n 0
runs respetively, using any distane or similarity soring sheme
withadditive gaps, inO(m 0
n+n 0
m)omplexity(see Setion7).
The algorithmsdesribedinthispaperare therst to approahfully LZ ompressed (bothsoure
and target stringsare ompressed) stringalignment. Themethodsgiven inthispaperan also be
usedbyappliationswherebothinputstringsarestoredortransmittedintheform ofanLZ78or
LZW ompressed sequene,thusproviding aneÆient solutionto theproblemofhowto ompare
two stringswithouthavingto deompress them rst.
1.2 Previous Results
Theonlypreviouslyknownsub-quadratiglobalalignment stringomparisonalgorithm,byMasek
andPaterson[39 ℄,isbasedontheFourRussiansparadigm. The\FourRussians"algorithmdivides
the dynamiprogramming table into uniform sized (lognby logn) bloks, and uses table lookup
toobtainan O(n 2
=logn)timeomplexitystringomparisonalgorithm,basedontwoassumptions.
Oneisthatthesequeneelementsomefromaonstantalphabet. Theother,whihtheydenotethe
\disreteness" ondition,isthatthe weights(of substitutionsand indels)are allrationalnumbers.
Our algorithms present a new approah and are better than the above algorithm in two aspets.
First,thealgorithmspresentedhere arefasterforompressiblesequenes. Forsuh sequenes,the
omplexityof ouralgorithmsisO(hn 2
=logn),whereh1is theentropy ofthesequene.
a c t
a a c g a c g a
0
1
1 2 3 4 5 6 7 8
2
3 4
a g a g
8 5 6
7
0 1 2 3 4 5 6 7 8 c
- a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1
g a
a c g
2 1 3
3
2 1
3 3 4
1
3
2
G
Figure 1: The alignment graphfor omparing stringsA =\tagaga"and B =\aagaga". Thesoring
shemematrixÆis shownin thelowerleftornerofthegure. Thehighestsoringglobalalignmentpaths
originateinvertex(0,0),endinvertex(8,8)andhaveatotalweightof3. Thehighestsoringloalalignment
pathhasatotalweightof5andorrespondstothealignmentofsubstringsa=\agaga"andb=\agaga".
A sub-graph Gorrespondingto theblok foromparing substrings a =\ag"and b =\ag"is shown in
thelower-rightornerofthegure. AlsospeiedarethevaluesI fortheentries oftheinputborderforG
(inwhite-shadedretangles),andthevaluesOoftheoutputborderofG(ingrey-shadedretangles),asset
duringaloalalignmentomputation.
Seond, ouralgorithms are general enough to supportsoring shemes with real number weights.
For manysoring shemes, the rationalnumberweights supportedbyMasek and Paterson'salgo-
rithm do not suÆe. For example, the entries of PAM similarity matries, as well as BLOSUM
evolutionary distane matries,are denedto be real numbers,omputedas log-oddsratios - and
therefore ouldbe irrational.
The paperbyMasek and Paterson onludes with thefollowing statement: \The mostimportant
problem remaining is nding a better algorithm for the nite (in our terms onstant) alphabet
ase withoutthedisreteness ondition". Here,more thantwentyyears later, thisimportant open
questionwillnally be answered!
These advantages arebased on the followingfats. First, ouralgorithm doesnotrequire any pre-
omputationoflookup-tables,andthereforeanaordmoreexibleweightvalues. Also,insteadof
dividing the dynamiprogramming matrix into uniform-sized bloks as did Masek and Paterson,
we employa variable-sizedblokpartition,asindued byLempel-Zivfatorization ofbothsoure
is then re-yled and used for omputing the relevant information for eah blok in time whih
is linear with the length of its sides. In this sense, the approah desribed in this paper an
be viewed as another example of speeding up dynami programming by keeping and omputing
only a relevant subset of important values, asdemonstrated in [16 ℄, [17℄, [33 ℄ and [46 ℄. A similar
unbalanedstrategyhasbeensuessfullyusedforsquare detetioninstrings[11 ℄ to speedupthe
originalalgorithm basedon a divide-and-onquerapproah [36 ℄.
2 Preliminaries
2.1 The Alignment Graph
The dynamiprogramming solutionto thestring omparison omputation problem an be repre-
sented intermsof aweighted alignment graph[24 ℄ (SeeFigure 1).
The weight ofa givenedge an be speieddiretlyon thegrid graph,orasisfrequentlythease
inbiologialappliations,is given byapenaltymatrix, denotedÆ,whihspeiesthesubstitution
ost foreah pairof haratersand thedeletionostforeah harater fromthe alphabet.
The two widely used lasses of soring shemes are distane soring, in whih the objetive is to
minimize the total alignment sore, and similarity soring, in whih the objetive is to maximize
the total alignment sore. Within these lasses, soring shemes are further haraterized by the
treatment of gap osts. A gap is the result of the deletionof one or more onseutive haraters
in one of the sequenes. Additive gap osts assign a onstant weight to eah of the onseutive
haraters. Forother gapfuntionswhihhavebeenfoundusefulforbiologialsequenes, see[24 ℄.
The solutionsinthispaperassumea soring sheme withadditive gap osts.
Global Alignment via Dynami Programming The lassial dynami programming algo-
rithmfortheglobalomparisonoftwostringswillsetthevalueateahvertex(i;j)ofthealignment
graph,row byrowina left to right order,to the sore betweentherst i haraters ofA and the
rst j haratersofB,usingthefollowingreurrene:
V(i;j)=max[V(i;j 1)+Æ(;B
j );
V(i 1;j)+Æ(A
i
;);
V(i 1;j 1)+Æ(A
i
;B
j )℄:
Computingandsettingthevaluesofallvertiesinthealignmentgraph,usingtheabovereurrene,
takes O(n 2
) time and spae. After the values at eah vertex of the alignment graph have been
omputedandset, theoptimalglobalalignmentvaluemaxV isfoundat vertex(n;n)ofthegraph.
Ifeahvertexinthealignmentgraphstorestheoperation(insertion,deletion,substitution)seleted
when its value was set, then a global alignment trae, orresponding to an optimal path in the
alignment graph, an be reovered in time linear with its size, starting from vertex (n;n) whih
ontainsthemaximalsore, and traingtheedges bakup to vertex(0;0) inthegraph.
Loal Alignment via Dynami Programming Smith and Waterman [48 ℄, [24 ℄ showed that
essentially the same O(jAjjBj) dynami programming solution an be used for omputing loal
similarity, providedthat thesore of thealignment of two empty stringsis denedas 0,and only
pairs whose alignment sores are above 0 are of interest. The Smith-Waterman algorithm for
option,and thusrestritsthesores to non-negative values:
L(i;j)=max[0;L(i;j 1)+Æ(;B
j );
L(i 1;j)+Æ(A
i
;);
L(i 1;j 1)+Æ(A
i
;B
j )℄:
ThemethodtoomputetheoptimalloalalignmentvaluemaxListoomputeallalignmentgraph
vertexvaluesL(i;j) inO(n 2
)timeand spae,and thenndthelargest valueat any vertex onthe
table, sayat vertex(i
end
;j
end ).
Giventhevertex(i
end
;j
end
)whihsore ismaxL,theorrespondingsubstringsand givingthe
optimalloalalignment ofAand B areobtainedintimelinearwiththeirsize,byusingthestored
operations(insertion, deletion, substitution) to trae bak the edges from vertex (i
end
;j
end ) until
a vertex (i
start
;j
start
) isreahedthat hasvaluezero. Then theoptimalloal alignment substrings
forvertex (i
end
;j
end
) are =A[i
start :::i
end
℄and =B[j
start :::j
end
℄[24 ℄.
2.2 A Blok Partitionof the Alignment Graph based on LZ78 Fatorization
The traditionalaim of text ompressionis theeÆient useof resouressuh asstorage and band-
width. Here,wewillompressthesequenesinorderto speedupthealignmentproess. Notethat
thisapproah, denoted \aelerationbytext-ompression", hasbeenreentlyapplied to a related
problem- thatof exat string mathing[29 ℄,[38 ℄, [47 ℄.
It shouldalso be mentioned thatanother related problem- thatof exat string mathing inom-
pressed text without deoding it, whih is often referred to as \ompressed pattern mathing",
hasbeenstudied extensively[4 ℄,[18 ℄ [43 ℄. Along these lines,string searhin ompressed text was
developed for the ompression paradigm of LZ78 [52 ℄, and its subsequent variant LZW [50 ℄, as
desribedin[30 ℄,[44 ℄. A morehallengingproblemisthatof \fullyompressed"patternmathing
when boththe patternand text stringsareompressed[21 ℄, [22 ℄.
For the LZ78-LZW paradigm, ompressed mathing has been extended and generalized to that
of approximate pattern mathing (nding all ourrenes of a short sequene within a long one
allowingup to k hanges)in[28 ℄, [42 ℄.
TheLZ ompressionmethodsarebasedon theideaofself referene: whilethetext leissanned,
substringsorphrases areidentiedand storedina ditionary,and whenever, laterin theproess,
a phrase or onatenation of phrases is enountered again, this is ompatly enoded bysuitable
pointers [34 ℄,[51 ℄, [52 ℄.
Oftheseveralexistingversionsofthemethod,wewillusetheoneswhiharedenotedLZ78 family
[50 ℄,[52 ℄. The mainfeaturewhihdistinguishesLZ78 fatorization frompreviousLZ ompression
algorithmsis inthe hoie of odewords. Instead of allowing pointers to referene anystring that
has appeared previously, the text seen so far is parsed into phrases, where eah phrase is the
longestmathingphraseseenpreviouslyplusoneharater. Forexample,thestring\S=aagag"
is dividedinto fours phrases: a, a, g, ag. Eah phraseis enoded asan indexto its prex, plus
theextra harater. The new phraseis thenadded to thelistof phrasesthatmaybereferened.
Sineeahphraseisdistintfromothers,thefollowingupperboundappliesto thepossiblenumber
of phrasesobtainedbyLZ78fatorization.
Theorem 2.2.1 (Lempel and Ziv 1976 [34℄) Givena sequeneS of size nover a onstant al-
a a c g a c g c
t a c
3/4
a c g a c
g
g Trie for A Trie for B
0
3 1
5
0 3 2
g a g a
a
5/4
2
t
4
g
5/2 3/2
left prefix (5/2)
diagonal prefix (3,2)
top prefix (3,4)
Graph G for Block (5,4) LZ78-Partitioned
Alignment Graph
g a
a c g a
g a c
g a
a c a
a c
1 2 3 4
0
1 2 3 4
5
1
4
Figure2: Theblokpartitionofthealignmentgraph,andthetriesorrespondingtoLZ78parsingofstrings
A =\tagaga" and B =\aagaga". Note that forthe blokG in this example, = \ag", =\ag",
`
r
=2,`
=3,i=5andj =4. (ThenewellofG,whihdoesnotappearinanyoftheprexbloks,isthe
rightmostellat thebottom rowofG, andanbedistinguishedby itswhiteolor.) Thisgureontinues
Figure1.
phabet. The maximal number of distintphrases in S isO(
n
logn ).
Even though theupperboundabove appliesto anypossiblesequene overa onstant alphabet,it
hasbeenshown thatinmany ases we an dobetter thanthat.
Intuitively,theLZ78algorithmompressesthesequenebeauseitisabletodisoversomerepeated
patterns. Therefore,inorder toomputeatighterupperboundonthenumberofphrasesobtained
byLZ78fatorization for\ompressible"sequenes,therepetitivenatureof thesequeneneedsto
bequantied. One of thefundamentalideas ininformationtheory isthat of entropy, denoted by
thereal numberh, 0h 1,whih measurestheamount ofdisorderorrandomness,orinversely,
the amount of order or redundany in a sequene. Entropy is small when there is a lot of order
(i.e,thesequeneisrepetitive)andlargewhenthereisalotofdisorder. The entropyof asequene
shouldideally reetthe ratiobetweenthe sizeof thesequene after it hasbeenompressed, and
thelengthof theunompressedsequene.
ThenumberofdistintphrasesobtainedbyLZ78 fatorizationhasbeenshownto be O(hn=logn)
formosttexts[7 ℄,[12℄,[34 ℄,[52 ℄. Notethatforanytextoveraonstant alphabet,theupperbound
above stillappliesby settingh to 1.
3 Computing the Optimal Global Similarity Value
3.1 Denitions and Basi Observations
The alignment graph willbe partitioned as follows. Strings A and B will be parsedusing LZ78
fatorization. ThisinduesapartitionofthealignmentgraphforomparingAwithB intovariable-
an LZ phraseofB.
Letxadenotea phraseinA obtainedbyextendinga previousphrasex ofAwithharatera, and
ybdenote aphrasein B,obtainedbyextendingaprevious phraseofB withharater b.
Fromnowonwewillfousontheomputationsneessaryforasingleblokofthealignmentgraph.
ConsidertheblokGwhihorrespondstotheomparisonofxaandyb. We deneinput border I
astheleft and topbordersof G, and output border O asthebottom and rightbordersof G. (The
node entries on the input border are numbered in a lokwise diretion,and the node entries on
theoutput borderarenumberedina ounter-lokwisediretion.)
Rather than llingin the valuesof eah vertex inG, as doesthe lassial dynamiprogramming
algorithm,theonlyvaluesomputedforeahblokwillbe thoseon itsI=Oborders(seeFigures1
and 5A). Intuitively,thisis thereasonbehindtheeÆieny gain.
Let `
r
denote the number of rows in G, `
r
= jxaj. Let `
denote the number of olumns in G,
`
=jybj. Let t=`
r +`
. Clearly,jIj=jOj=t.
We denethe followingthree prefix bloksof G.
1. The left prefixof Gdenotesthe blokomparingphrasexaof A andphrase y ofB.
2. The diagonal prefix ofG denotestheblokomparingphrasex of Aand phrasey of B.
3. The topprefixof Gdenotestheblokomparing phrasex of A and phraseybof B.
Observation 1 When traversing the bloks of an LZ78 parsed alignment graph in a left-to-
right, top-to-bottom order, the bloks for the left prex, diagonal prexand top prex of G are
enountered prior to blokG.
NotethatthegraphfortheleftprexofGisidentialtothesubgraphofGontainingallolumns
butthelastone. Morespeially,boththestrutureandtheweightsofedges ofthesetwo graphs
areidential,buttheweightstobeassignedtovertiesduringthesimilarityomputationmayvary
aordingto theinputborder values. Similarly,forthetopprexand diagonalprexgraphs. The
onlynewellinG,whihdoesnotappearinanyofitsprexblokgraphs,istheellforomparing
aand b.
3.2 I=O Propagation Aross G.
The work foreah blokonsistsof two stages(a similarapproahis shownin[8 , 27 ,32 , 33 ℄).
1. enoding: Study thestrutureof Gand represent itinan eÆientway.
2. propagation: Given I and the enoding of G, onstruted in the previousstage, ompute O
forG.
The strutureofG isenodedbyomputingweights ofoptimalpaths onneting eahentry of its
inputborder with eahentry of its output border. Thefollowing DISTmatrixis used(see Figure
3).
Denition 3 DIST[i;j℄ stores the weightof the optimal path from entry i of the input border of
G toentry j of its output border.
I0=1 0 1 2 3 4 4
I
1
=2 1 1 2 1 3 4
I2=3 2 0 0 1 1 3
I3=2 4 2 2 0 2 2
I
4
=1 4 4 2 0 1 1
I
5
=3 4 4 4 2 1 0
OUT matrix
1 0 1 2 1 1
1 1 0 1 1 1
1 3 3 4 2 0
12 0 0 2 0 0
13 13 1 1 0 0
14 14 14 1 2 3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
olumn numbers
0 1 2 3 4 5
Figure3: TheDISTmatrixwhihorrespondstothesubsequenes\ag",\ag",theOUTmatrixobtained
byaddingthevaluesofI totherowsofDIST, andtheO ontainingtherowmaximaof OUT. Thisgure
ontinuesFigures1and2.
DISTmatries havealso beenusedin[5℄,[8 ℄, [27 ℄,[33 ℄ and [46 ℄.
Given inputrow I and the DIST for G, the weight of output row vertex O
j
an be omputed as
the maximum among the sums I
r
+DIST[r;j℄, if there is indeed a path onneting inputborder
entry r withoutput borderentry j.
VertexO
j
isthemaximumofolumnjofthefollowingOUTmatrix,whihmergestheinformation
from inputrow I andDIST. (SeeFigure 3).
Denition 4 OUT[i;j℄=I
i
+DIST[i;j℄.
Aggarwal and Park [3℄ and Shmidt [46 ℄ observed thatDISTmatries areMonge arrays [41 ℄.
Denition 5 A matrix M[0:::m;0:::n℄ is Mongeif either ondition 1 or 2 below holdsfor all
a;b=0:::m;;d =0:::n:
1. onvex ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄ for all a<band <d.
2. onave ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄ for all a<b and<d.
An importantpropertyof Mongearrays is thatof being totallymonotone.
Denition 6 A matrix M[0:::m;0:::n℄ is totally monotone if either ondition 1 or 2 below
holds for all a;b=0:::m; ;d=0:::n:
1. onvex ondition: M[a;℄M[b;℄=)M[a;d℄M[b;d℄ for all a<b and <d.
2. onave ondition: M[a;℄M[b;℄=)M[a;d℄M[b;d℄ for all a<b and <d.
Notethat theMongepropertyimpliestotal monotoniity,buttheonverse isnottrue. Therefore,
bothDISTand OUTare totallymonotonebytheonave ondition.
Aggarwal et al [2 ℄ gave a reursive algorithm, niknamed SMAWK in the literature, whih an
ompute in O(n) time all row and olumn maxima of an n n totally monotone matrix, by
queryingonlyO(n) elementsofthearray. Hene,onean useSMAWKtoomputetheoutputrow
O by queryingonlyO(n) elements of OUT. Clearly,if boththe fullDISTand all entries of I are
available, thenomputingan element of OUTisO(1) work.
For various solutions to related problems, whih also utilizeMonge and Total Monotoniity prop-
erties, we refer the interested reader to [14 ℄, [15 ℄, [19 ℄, [20℄, [31 ℄ and [33 ℄. In order to eÆiently
utilizethese propertieshere,we needto address thefollowingtwo problems.
1. How to eÆiently ompute DISTand represent it in a format whih allows diret aess to
its entries. ThiswillbedoneinSetion3.4.
2. SMAWKisintendedforafull,retangularmatrix. However,bothDISTanditsorresponding
OUTarenotretangular. Sinepathsinan alignmentgraphan onlyassumea left-to-right,
top-to-bottom diretion, onnetions between some inputborder verties and some output
bordervertiesare impossible. Therefore, thematriesaremissing botha lowerlefttriangle
and upperright triangle(see Figure 3). The questionisaddressed inSetion3.3.
3.3 Addressing the Retangle Problem
The undenedentries ofOUTan beomplementedinonstant timeeah, asfollows.
1 The missing upper right triangle entries an be ompleted by setting the value of any entry
OUT[i;j℄in thistriangle to 1.
2 Letk denote themaximalabsolute valueof a sore inÆ. The missinglowerleft triangleentries
an beompletedbysetting thevalueof anyOUT[i;j℄inthistriangleto (n+i+1)k.
Lemma 3.3.1 Complementingthe undenedentries asdesribed abovepreservesthe onavetotal
monotoniity ondition of OUT, and does not introdue new row-maxima.
Proof:
1 Upper Right Triangle: Allsimilaritysoresin the alignment graphare nite. Therefore, no
new olumn maxima are introdued. Suppose OUT[a;℄ OUT[b;℄, a< b, and OUT[a;℄
hasbeensetto 1. Duetotheshapeoftheredenedupper-righttriangle,onea 1value
inrowa is enountered, all future valuesin rowa are also 1. The future values of row b
ould eitherbeniteor 1. Therefore, OUT[a;d℄OUT[b;d℄forall d>.
nk. Sine i is always greater thanor equal to zero, the omplemented valuesin the lower
lefttriangle areupper-boundedby (n+1)k and no newolumn maximaareintrodued.
Also,foranyomplementedentry OUT[b;℄inthe lowerleft triangle, OUT[b;℄<OUT[a;℄
forall a<b,and therefore theonave total monotoniityonditionholds.
3.4 Inremental Update of the new DIST Information for G
InthissetionweshowhowtoeÆientlyomputethenewDISTinformationforG,usingtheDIST
representationspreviously omputedforits prexbloks,plustheinformationofits new ell.
When proessing a new blok G, we ompute thesores of t new optimalpaths, leadingfrom the
inputbordertothenewvertex(`
r
;`
)inthelowest, rightmostornerofG. Thesevaluesorrespond
to olumn`
ofthe DISTmatrixforG, and an be omputedasfollows.
Entry[i℄inolumn`
oftheDISTforGontainstheweightoftheoptimalpathfromentryiinthe
inputborderofGtovertex(`
r
;`
). Thispathmustgothroughoneofthethreeverties(`
r 1;`
),
(`
r 1;`
1) or (`
r
;`
1). Therefore, the weight of theoptimalpath from entry iinthe input
borderof Gto (`
r
;`
)is equalto themaximumamongthe followingthreevalues:
1 Entry [i℄of olumn `
1of theDISTforthe leftprexof G, plustheweight ofthe horizontal
edgeleadinginto (`
r
;`
).
2 Entry[i℄ofolumn`
1oftheDISTforthediagonalprexofG,plustheweightofthediagonal
edgeleadinginto (`
r
;`
).
3 Entry [i℄of olumn`
of theDIST forthe topprex of G, plus theweight of thevertialedge
leadinginto (`
r
;`
).
3.4.1 Maintaining Diret Aess to DIST Columns
In order to ompute an entry of OUTin onstant time during the exeution of SMAWK, diret
aess to DIST entries is neessary. This is not straightforward, sine as shown in the previous
setion,foreahblokonlyonenewDISTolumnhasbeenomputedandstored. Allotherolumns
besidesolumn`
ofthe DISTforGneedto be obtainedfromG's prexanestorbloks.
Therefore,beforetheexeutionofSMAWKbegins,avetorwithpointerstoallt+1olumnsofthe
DISTfor Gis onstruted (see Figure 4). This vetor is no longerneeded after the omputations
forGhave beenompleted, andits spae an befreed.
The pointers to all olumnsof the DISTfor G are assembled asfollows. Column`
is set to the
newlyonstrutedvetorforG. Allolumnsofindiessmallerthan`
areobtainedvia`
reursive
allsto left prexbloks ofG. Allolumnsof indiesgreaterthan`
areobtainedvia `
r
reursive
alls to topprexbloksof G.
3.4.2 Querying a Prex Blok and Obtaining its DIST Column in Constant time
The LZ78 phrases form a trie (see Figure 2), and the string to be ompressed is enoded as a
sequene of names of prexes of thetrie. Eah node in the trieontains the serialnumberof the
phrase it represents. Sine eah blok orresponds to a omparison of a phrase from A with a
0 1 2 3 4
DIST(5,4) 0
1 2 3 4 5
-3 -1 1 0 0 -2
-3 -1 -2 -1 -1
-3 -2 -1 0 -2
-2 0 -2 -2 -1 -1 0 -2 0 -1 -2
a
g a c
g
g Trie for A Trie for B
0
3 1
0
1 3
2
4 2
t
4
g Block Table
5
c
Figure 4: A table ontaining an entry for eah blok of the alignment graph. Entry(i;j) of the table
representstheblokwhihorrespondstonodeiinthetrieforAandnodejinthetrieforB. Theentryfor
eahblokinthetablepointstothestartofitsnewDISTolumn. Alsoshownisthevetorwhihontains
pointersto allolumns oftheDIST forblok(5;4),asobtainedfromitsanestorprexbloks. Thisgure
ontinuesFigures1,2and3.
phrasefrom B,eahblokwillbe identiedbyapair ofnumbers,omposedofthe serialnumbers
forits orrespondingphrasesin thetriesforA and B.
Anotherdata struture to beonstruted is a Blok Table (see Figure 4), ontaining an entry for
eah partitioned blokofthe alignmentgraph. The entry foreah blokinthetable pointsto the
start of its new DISTolumn, and an be diretly aessed via the blok's phrase numberindex
pair.
TheleftprexofGanbeidentiedinonstanttimeasapairofphrasenumbers,therstidential
to the serialnumberof xa, and the seond orresponding to the serial number of y, whih is the
diret anestor of ybin thetrie forB. Similarly,the topprex of Gan be identiedin onstant
time. Given the pair of identiation numbers for a blok, a pointer to the orresponding DIST
olumnan then be diretlyobtained fromthe BlokTable.
Time and Spae Analysis Assumingsequene size nand sequene entropyh 1. The LZ78
fatorization algorithm parsesthe stringsandonstruts thetries forA and B inO(n)time. The
resulting numberof phrases inboth A and B is O(hn=logn). The number of resulting bloks in
thealignment graphis equalto thenumberof phrasesin A times number ofphrases inB,and is
thereforeO(h 2
n 2
=(logn) 2
). ForeahblokG,thefollowinginformation(1{3)isomputed,intime
and spae omplexitylinearwith thesizeofits I=Oborders:
1. UpdatingtheEnodingStrutureforG. TheprexbloksofGanbeaessedinonstant
time. ThevetorsofDISTolumnpointersfortheprexblokshavealreadybeenfreed. However,
sine eah prex blok diretly points to its newlyomputed DIST olumn, all valuesneeded for
theomputations arestillavailable. Sine eah entry of the new DISTolumnfor Gis set to the
O(t)timeand spae.
2. MaintainingDiretAess toDISTolumns. Sineprex bloksandtheirDISTolumns
an beaessedinonstanttime, thevetor withpointerstoolumnsoftheDISTforGanbeset
inO(t)time.
3. Propagating I=O values aross the blok. Using the information omputed for G, and
given theI forG obtainedfrom theO vetorsfor theblokabove Gand the blokto its left,the
valuesofO forGareomputed via SMAWKMatrix SearhinginO(t)time.
Total Complexity Sine the work and spae for eah blok is linear with the size of its I=O
borders, the total time and spae omplexity is linear with the total size of the borders of the
bloks. The blok borders form O(hn=logn) rows of size jBj eah, and O(hn=logn) olumns of
sizejAjeah,inthealignmentgraph(seeFigure2). Therefore,thetotaltimeandspaeomplexity
isO(hn 2
=logn).
4 Global Similarity Optimal Alignment Trae Reovery
The reovery of an optimal globalalignment trae betweenA and B starts at vertex (n;n). The
series of blok rossing paths is then traed bak until vertex (0;0) is reahed. For eah blok
rossed, the internal alignment trae is reported, starting from the output border sink, and bak
to the optimal origin soure vertex in the orresponding input border. In order to support the
reovery of blok-rossingpaths intimelinear withtheirsize, theomputation and storage ofthe
followingadditionalinformationfora givenblokGisrequired.
1. During the Propagation stage, for eah entry j in the output border of G, the index of the
inputborder entry i, whih isthe soureof the highest soringpath to output border entry
j, issaved.
2. During Enoding, an additional O(t) sized vetor of pointers, the anestors vetor, is om-
puted for G. For anyoutput border entry O[j =0:::t℄, anestors[j℄ pointsto the anestor
blokof G forwhih thisentry isits new vertex. (Thevalue of anestors[`
℄is setto G. All
olumnsofindiessmallerthan`
areobtainedvia`
reursiveallstoleftprexbloksofG.
Allolumnsofindiesgreater than`
areobtainedvia `
r
reursive alls to topprexbloks
of G.)
3. During Enoding, G's new vertex (`
r
;`
) is annotated with an additional O(t) sized vetor
of pointers, denoted diretion. These pointers are setduring theDISTolumn omputation
desribedinSetion3.4, asfollows. The value ofdiretion[i℄isset aordingto thediretion
of thelastedgeinthe optimalpathoriginatingat entry i ofG's inputborder and endingat
vertex (`
r
;`
).
GiventhattheoptimalpathentersthroughentryjoftheoutputborderofG,thetrae-bakofthe
partofthepathgoingthroughGproeedsintwostages. Therststageisadestinationandorigin
initialization stage. This stage inludes the fething of the input row soure entry i, whih was
storedastheoriginforthehighestsoring pathtoG's outputborderentry j (see1 above). Entry
of G, pointedto byanestors[j℄is fethed(see 2 above). Theedge reovery beginsinblokP.
Duringtheseond stage,theoriginanddestinationinformationomputedintherst stageisused
to trae bak the part of the path ontained in P, from entry j on P's output border (the new
vertex of P), to entry i on its input border. This is done by baktraking through a dynasty of
prexanestor bloks internalto P,usingthe diretion vetor omputed foreah of thetraversed
bloks (see 3 above). If diretion [i℄ of the traversed blok speies a horizontal edge, then the
trae-bakretreatstotheleftprexofP,andan\insertion"operationisreportedinthealignment
trae. Correspondingly,\substitution" and\deletion"arereportedwhen baktrakingto diagonal
and top prexbloks. The reovery ontinues througha series of prexbloks of P untilthe full
optimalalignment trae isreovered.
Time and Spae Analysis The two additionalvetors forG, diretion and anestors , and the
inputborder soure entry i,an be omputed and storedduringenoding and propagation stages
inO(t)timeand spae.
The work forthe rst stage inthe trae-bak an be done inonstant time. In the seond stage,
eahedgeinthereoveredalignmentpathresultsinatraversaltoasingleprexblok. Sineprex
bloksandtheirorrespondingdiretionvetorsanbeaessedinonstanttime, ahighestsoring
globalalignmentbetweenstringsA and B anbe reovered intimelinear inits size.
5 Reduing the Spae Complexity
When omputingthe optimalglobal alignment value withsoring matries whih follow the\dis-
reteness" ondition (see Setion 1), the eÆient alignment stage algorithm desribed in [33 ℄ an
be extended to support full propagation from the leftmost and upper boundaries to the bottom
and rightmostboundariesofG.
Thisextendedpropagationalgorithmanthenbeusedtoomputethevaluesoftheglobalalignment
OforG,giventheIforGandaminimalenodingoftheDISTforG. Theadvantageofthisminimal
enodingofDISTisthatratherthansavinganO(t)sizedDISTolumnperblok,we onlyneedto
save aonstant numberofvaluesperblok. Theenoding forthenewDISTolumnof eahblok
an be omputed and stored inonstant time and spae from the informationstored forthe left,
diagonal andtopprexbloks of G, usingthetehniquedesribedin Setion6of [46 ℄.
This redues the spae omplexity to O(h 2
n 2
=(logn) 2
), while preserving the O(hn 2
=logn) time
omplexity.
6 The Loal Alignment Algorithm
6.1 Computing the Optimal Loal Similarity Value
When omputing the optimal loal similarity value, an optimal path ould either be ontained
entirely in one blok(type C), or ould be a blok-rossing path (see gure 5). A blokrossing
pathonsistsofa(possiblyempty)S-path,followedbyanynumberofpathsleadingfromtheinput
borderofa blokto itsoutputborder,and endinginanE-path withahighestsoring lastvertex.
Sine an optimalpath ouldbegininside anyblok, vetor O needs to beupdatedto onsiderthe
I
O I
S
C E I
O
i
DIST[i,j ] j
A
B C
Figure 5: A. The I=O path weight vetors omputed for eah blok in the global alignment solution.
DIST[i;j℄willbesettothehighestsoringpathonnetingvertexiin theinputborderwithvertexjinthe
outputborder. B,C.Thevetorsofoptimalpathweightsonsideredfortheloalalignmentomputation.
additionalpathsoriginatinginsideG. Also,sineanoptimalpathouldendinsideanyblok,extra
bookkeepingis neededinorder to keeptrakof thehighestsoring paths endingineah blok.
Therefore,inadditiontotheDISTdesribedinSetion3,weomputeforeahblokGthefollowing
datastrutures (see Figures5Band 5C).
1. E is a vetor of size t. E[i℄ ontains the value of the highest soring path whih starts at
vertex i of the input border of G and ends inside G. E[i℄ is omputed as the maximum
betweenE[i℄fortheleft prexof G, E[i℄ forthetopprexof G, andDIST[i;`
℄.
2. S is a vetor of sizet. S[i℄ontainsthevalue of thehighestsoring pathwhihstarts inside
Gand endsat vertexiof theoutput border of G.
The only new value omputed for S is the loal alignment sores for the new vertex of G,
S[`
℄. GiventhesoresS[`
1℄obtainedfromthediagonal prex,S[`
1℄obtainedfromthe
left prex and S[`
℄obtainedfromthetop prex of G,and theweightsof the3edgesleading
into vertex(`
r
;`
),S[`
℄anbeomputedinO(1)timeomplexity,usingthereursiongiven
inSetion2.1.
Thevaluesofallother entriesofS arethen setasfollows. Therst`
valuesofS areopied
from the rst `
values of the S omputed for the left prex of G. The last `
r
values are
opiedfromthe last`
r
valuesoftheS vetor forthetopprexof G.
3. C is the value of thehighest soring path ontained in G, that is, the highest soring path
whihoriginates insideGand endsinside G. C isomputed asthemaximumbetweentheC
value for the left prex of G, the C value forthe top prex of G, and the newly omputed
S[`
℄ asdesribed above.
willbeused to omputetheweight of thehighestsoring pathendinginG.
Vetor O is rst omputed from theI and DIST forG as desribed inSetion 3.2. At this point
entryO[i℄reetstheweightoftheoptimalpathstartinganywhereoutsideGandendingatentryi
oftheoutputborder. It needstobeupdatedwiththeweightsof thehighestsoringpathsstarting
insideG. This isahieved by resettingO[i℄to themaximumbetweenO[i℄and S[i℄.
Theweight of thehighestsoring pathendinginG isomputedasmax(Max t
i=0
fI[i℄+E[i℄g; C).
After the omputations for eah blok have been ompleted, the overall highest loal alignment
sore foromparing A and B an be omputed as themaximumamong the valuesof the highest
soring pathendingineahblok.
Time and Spae Analysis Sine, as shown in Setion 3.4.1, eah prex blok of G an be
aessed inonstant time,thevaluesof theS and E vetors forGan beomputedand storedin
O(t)timeand spae,and theC value forGan be omputedinonstant timeand spae.
Given the S, E and C vetors for G, the values of O and the weight of the highest soring path
endinginGan beomputed inO(t)timeeah asdesribed above.
Theweightofthehighestsoringpathinthealignmentgraphanthenbeomputedinanadditional
O(h 2
n 2
=(logn) 2
) time asthemaximumvalueamong thebestvaluesomputed foreah blok.
Sine the work and spae for eah blok is linear with the size of its I=O borders, thetotal time
and spae omplexityofomputingtheoptimalloalalignment valueis O(hn 2
=logn).
6.2 Optimal Alignment Trae Reovery for the Loal Alignment Solution
Similarlyto the alignment trae dened inSetion 4, given a maxLvertex (i
end
;j
end
) whih was
obtainedintheprevioussetion,weshowhowtoreovertheoptimalpathendinginthisvertex. by
reporting a trae-bak of theedges from vertex (i
end
;j
end
) untila start-point vertex (i
start
;j
start )
isreahedthathas value zero.
A blok rossing optimal path onsists of a (possibly empty) S-path, followed by any number of
pathsleadingfromtheinputborderofabloktoitsoutputborder,andendinginanE-pathwhose
lastvertex is(i
end
;j
end ).
The reovery starts at vertex (i
end
;j
end
) and ontinues bak to the optimal path origin in three
stages.
1. ReoveringtheE-path part.
Duringenoding,whenever theE[i℄ value of a blokis updatedbyits new vertex, a pointer
to theupdatingblokissaved together withthe newE[i℄value.
During alignment reovery, given that vertex (i
end
;j
end
) ends an E[i℄ path in G, the orre-
spondingblokanbefethed,andthepathfromitsnewvertextoentryionitsinputborder
reovered, asdesribedinSetion 4.
2. Reoveringall paths leadingfrom theinputborder of ablokto its output border.
The part of thepath ontainedin eah one of these bloksan bereovered as desribedin
Setion4.
Duringenoding,when omputingtheS-soreof thenew vertexof eahblok,thediretion
of the edge optimizing the sore S[`
℄ of the new vertex of G, denoted s
diretion
, is saved
withthesore.
Duringtheterminationofthepropagationstage,whensettingthesorevaluesforeahentry
inO,aeldisset,indiatingwhetherthenewlysetsorevalueforthisentryorrespondstoa
pathoriginatinginsideG(anS-path),orapathrossingG. Insuhaase,thereoveryofthe
S-pathpart utilizesthetehniquedesribedinSetion4,with aslightmodiation. Instead
ofthediretion vetor,thes
diretion
eldisusedfortheedgetrae-bak. Thereoveryhalts
when ananestor blokis reahedwhose S[`
℄valueis zero.
Aspeialaseourswhenvertex(i
end
;j
end
)istheendpointofaC-path. AC-pathis,inessene,
a haltedS-path. Duringenoding, whenever theC valueof a blokis updatedbyits new vertex,
a pointer to the updating blok is saved together with the new C value. The reovery of the C
pathinGstartsat thenewvertexofitsorrespondingblokandontinuessimilarlyto theS path
reovery, asdesribedin3 above.
Time and Spae Analysis In addition to the values desribed in Setion 4, an additional
O(t) information(pointers to the E[i℄ updatingbloks) is omputed and stored forE-paths, and
an additional O(1) information per blok is omputed and stored for C and S paths. During
propagationtermination, anaddition O(t)informationisstoredwith theO vetor.
Duringreovery,eah edgeinthereovered alignment pathresultsinatraversalto a single prex
blok, foreah one of the three path parts. Both prex bloks and theirorresponding diretion
vetorsanbeaessedinonstanttime. Therefore,inadditiontothebasiO(hn 2
=logn)timeand
spaeneededforomputingtheoptimalloalalignmentsore maxL,analignmenttraeendingat
a givenmaxL-soring vertexan be reported intimelinear withthesizeof thetrae.
7 Appliations to the Problem of Comparing Two Run Length
Enoded strings
AstringS isrun-lengthenodedifitisdesribedasanorderedsequeneofpairs(;i),oftendenoted
\ i
,"eahonsistingofanalphabetsymbol,,andaninteger,i. Eahpairorrespondstoarunin
S,onsistingof ionseutiveourrenesof . Forexample,the stringaabbbbb an beenoded
asa 2
b 5
3
. Suh a run-lengthenodedstring an be signiantlyshorter thanthe expandedstring
representationafter eÆientlyenoding theintegers (see [13 ℄forexample).
Run-lengthenodingservesasapopularimageompressiontehnique,sinemanylassesofimages
(e.g., binary images in fasimiletransmissionor foruse inoptial harater reognition) typially
ontain largepathesof identially-valued pixels.
Let m and n be the lengths of two run-length enoded strings X and Y, of enoded lengths
m 0
and n 0
, respetively. Previous algorithms for the problem ompared two run-length enoded
strings using the Levenshtein Edit Distane [35 ℄ and the LCS similarity measure [25 ℄. For the
LCS metri, Bunke and Csirik [9℄ presented an O(mn 0
+nm 0
) time algorithm, while Apostolio,
Landau,and Skiena[6 ℄desribed an O(m 0
n 0
log (m 0
n 0
))time algorithm. Mithell[40 ℄ hasobtained
an O((d+m 0
+n 0
)log(d+m 0
+n 0
)) time algorithm for a more general string mathing problem
Arbelletal[1 ℄andMakinenetal[37 ℄independentlyobtainedan O(m 0
n+n 0
m)timealgorithmfor
omputingthe edit distane between two run-length enodedstrings forthe Levenshtein distane
metri.
Makinen et al. [37 ℄ posed as an open problem the hallenge of extending these results to more
general soring shemes, sine in those appliations whih are related to image ompression, the
hangefrom a pixel value to the next is smooth. Here, we willshowhow to extend theresultsto
applythem to any distaneor similaritysoring sheme withadditive gap sores.
In this solution, the alignment graph is also partitioned into bloks. But rather than using the
LZ78partition desribedinSetion 2,eah blokhere onsistsof two runs {one ofX and one of
Y. Thisresultsinthepartition ofthealignmentgraph into m 0
n 0
bloks. Thealgorithm suggested
alsopropagatesaumulatedsoresfromtheleftandupperboundariesofeahblok,toitsbottom
and rightboundaries.
Considertheblok Rforomparingtherun i
of X withtherun j
of Y. An edgeinR ouldbe
assigned one ofthree possibleweight values: D(diagonal),H(horizontal) and V(vertial).
Let
h
and
w
denote the dierene in row index values and olumn index values respetively,
betweenentry ion theinputborder of R ,and entry j on theoutputborder ofR .
We showhowto omputeDIST[i;j℄(whihistheost ofthebestsoringpathfrom entryiinthe
inputborderof theblok,to entryj intheoutputborderof theblok)inonstanttime, given
h
and
w
fortheinputand outputentries,and thevaluesD,H and V.
H+V D. Clearly,anoptimalpathfromitoj anuseall possiblediagonaledgesand only
thentheminimalnumberofremaining H and V edges neessary to reah j.
Therefore, DIST[i;j℄obtains one ofthree values:
1. If
w
=
h
,thenDIST[i;j℄=D
h .
2. If
w
>
h
,thenDIST[i;j℄=D
h
+H(
w
h ).
3. If
w
<
h
,thenDIST[i;j℄=D
w
+V (
h
w ).
H+V <D. In thisase, an optimalpathneveruses anydiagonal edge. The path inludes
onlytheminimalnumberofH edges,andtheminimalnumberofV edgesneessarytoreah
j from i. inthisase, DIST[i;j℄=H
w
+V
h .
Therefore,DIST[i;j℄anbeeasilyomputedinonstanttimewhenusingthegeneralsoringsheme
desribedinSetion 2.1.
Time and Spae Analysis The O vetor foreah blokis omputedusingSMAWK. Vetor I
for blokR an be easilyobtained from the O vetors forthe blokabove R and the blokto its
left,intimelinearwiththesidesofR . The \retangle"probleman besolved similarlytoSetion
3.2. Therefore,anyvalue OUT[i;j℄=I[i℄+DIST[i;j℄an be omputedinonstant time.
SinetheworkandspaeforeahblokislinearwiththesizeofitsI=Oborders,thetotaltimeand
spae omplexityislinear withthetotal sizeof thebordersof thebloks,whih isO(m 0
n+n 0
m).
The algorithmspresentedinthispaperareperhapslose to optimalin timeomplexity. However,
an important onernis thespae omplexity of thealgorithms. If onlythe similaritysore value
is required,thelassial, quadrati timesequene alignment algorithman easilybe implemented
to runin linear spae,by keepingonly two rows of the dynami programmingtable alive at eah
step. If thereovery of eitherglobal orloaloptimal alignment traes is required,quadrati-time
and linear-spae algorithms an be obtained by applying Hirshberg's renement to the lassial
sequene alignment algorithms [10 , 25 , 26 ℄. We post as an open problemthe hallenge of further
reduing thespae requirement of thealgorithms desribed in thispaper,withoutimpairingtheir
sub-quadratitime omplexity.
Aknowledgement
We aregrateful to DanGuseldfora helpfuldisussion.
Referenes
[1℄ O. Arbell, G. M. Landau, and J. Mithell, Edit distane of run-length enoded strings, aepted for
publiation inInformationProessingLetters.
[2℄ Aggarwal,A.,M.Klawe,S.Moran,P.Shor,andR.Wilber,GeometriAppliationsofaMatrix-Searhing
Algorithm,Algorithmia,2,195-208(1987).
[3℄ Aggarawal,A.,andJ.Park,NotesonSearhinginMultidimensionalMonotoneArrays,Pro.29thIEEE
Symp.on Foundations ofComputer Siene,497-512(1988).
[4℄ Amir,A., G.Benson,andM.Farah,Letsleepingleslie: Patternmathingin Z-ompressedles.J. of
Comp. andSys. Sienes,52(2), 299{307(1996).
[5℄ Apostolio,A.,M.Atallah,L.Larmore,andS.MFaddin,EÆientparallelalgorithmsforstringediting
problems.SIAMJ. Comput.,19,968-998(1990).
[6℄ Apostolio, A., G.M. Landau and S. Skiena, Mathing for Run Length Enoded Strings, Journal of
Complexity, 15,1,4{16(1999).
[7℄ Bell,T.C.,J.C.Cleary,andI.H.Witten.TextCompression.PrentieHall,(1990).
[8℄ Benson,G.,AspaeeÆientalgorithmforndingthebestnonoverlappingalignmentsore,Theoretial
Computer Siene,145,357{369(1995).
[9℄ Bunke,H., and J.Csirik. An improvedalgorithm for omputing theedit distane ofrun lengthoded
strings,Information ProessingLetters,54,93{96(1995).
[10℄ Chao, K.M., R.Hardison, andW. Miller,Reentdevelopmentsin linear-spaealignmentmethods: a
minisurvey.J. Comp. Biol.,1,271{291(1994).
[11℄ Crohemore,M.,TransduersandRepetitions.Theoret. Comput.Si.,45, 63{86(1986).
[12℄ Crohemore,M.,andW.Rytter,TextAlgorithms,OxfordUniversity Press, (1994).
[13℄ Elias, P., Universal Codeword Sets and Representation of Integers, I.E.E.E. Transf. Inform. Theory,
IT21,2,194{203(1975).
[14℄ Eppstein,D.,SequeneComparisonwithMixedConvexandConaveCosts,JournalofAlgorithms,11,
85{101(1990).
onFoundationsof Computer Siene,488{296(1988).
[16℄ Eppstein,D.,Z. Galil, R.Gianarlo,and G.F.Italiano,SparseDynami ProgrammingI:LinearCost
Funtions,JACM,39, 546{567(1992).
[17℄ Eppstein,D.,Z.Galil, R.Gianarlo,and G.F.Italiano,SparseDynamiProgrammingII:Convexand
ConaveCostFuntions,JACM,39,568{599(1992).
[18℄ Farah, M., and M. Thorup, String mathing in Lempel-Ziv ompressed strings. Algorithmia, 20,
388{404(1998).
[19℄ Galil,Z.,andR.Gianarlo,SpeedingUpDynamiProgrammingwithAppliationstoMoleularBiology,
Theoretial Computer Siene,64,107-118(1989).
[20℄ GalilZ.,andK.Park,Alinear-timealgorithmforonaveone-dimensionaldynamiprogramming,Info.
ProessingLetters, 33,309-311(1990).
[21℄ Gasienie,L.,M.Karpinski,W.Plandowski,W.Rytter,RandomisedeÆientalgorithmsforompressed
strings: the nger-print approah, Pro. 7th Annual Symposium On Combinatorial Pattern Mathing,
LNCS1075,39{49(1996).
[22℄ Gasienie,L.,andW.Rytter,AlmostoptimalfullyLZWompressedpatternmathing,DataCompres-
sionConferene,J.Storer,ed,(1999).
[23℄ Gianarlo,R. ,Dynami Programming: Speial Cases,Pattern Mathing Algorithms,edited byApos-
tolio,A.andZ.Galil,OxfordUniversityPress,201-232(1997).
[24℄ Guseld,D.,AlgorithmsonStrings,Trees,andSequenes.Cambridge University Press, (1997).
[25℄ Hirshberg,D.S.,Alinearspaealgorithmforomputingmaximalommonsubsequenes,Comm.Asso.
Comput.Mah.,18(6),341-343,(1975).
[26℄ Huang, X., andW. Miller,A time-eÆient,linearspae loal similarityalgorithm, Adv. Appl. Math.,
12,337{357(1991).
[27℄ Kannan, S.K., andE. W. Myers,An AlgorithmForLoatingNon-OverlappingRegions ofMaximum
AlignmentSore,SIAMJ. Comput.,25(3),648{662(1996).
[28℄ Karkkainen,J.,G.NavarroandE.Ukkonen,ApproximateStringMathingoverZiv-LempelCompressed
Text, Pro.11th Annual SymposiumOnCombinatorialPattern Mathing,LNCS1848,195{209(2000).
[29℄ Karkkainen, J., and E. Ukkonen, Lempel-Ziv parsing and sublinear-size index strutures for string
mathing,Pro. ThirdSouthAmerianWorkshop on StringProessing (WSP'96),141{155(1996).
[30℄ Kida, T., M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Shift-And approah to pattern
mathinginLZWompressedtext,Pro. 10thAnnualSymposiumOnCombinatorialPattern Mathing,
LNCS1645,1{13(1999).
[31℄ Klawe, M., and D. Kleitman, An Almost Linear Algorithm forGeneralized Matrix Searhing, SIAM
Jour. DesreteMath., 3,81-97(1990).
[32℄ Landau,G.M.andM.Ziv-Ukelson,OntheSharedSubstringAlignmentProblem,Pro.SymposiumOn
DisreteAlgorithms,804{814(2000).
[33℄ Landau,G.M., andM.Ziv-Ukelson,OntheCommon SubstringAlignmentProblem,Journal of Algo-
rithms.
[34℄ Lempel, A., and J. Ziv, On the omplexity of nite sequenes, IEEE Transations on Information
Theory, 22,75{81(1976).
[35℄ Levenshtein,V.I.,BinaryCodesCapableofCorreting,Deletions,InsertionsandReversals,SovietPhys.
Dokl,10,707{710(1966).
5,422{432(1984).
[37℄ Makinen,V., G.Navarro,andE.Ukkonen,ApproximateMathingofRun-LengthCompressedStrings,
Pro. 12thAnnual SymposiumOnCombinatorialPattern Mathing,LNCS1645,1{13(1999).
[38℄ Manber,U.,Atextompressionshemethatallowsfastsearhingdiretlyintheompressedle,Pro.
5th Annual SymposiumOnCombinatorialPattern Mathing,LNCS2089,31{49(2001).
[39℄ Masek,W.J., and M.S. Paterson,A faster algorithm for omputing stringedit distanes. J. Comput.
Syst. Si.,20, 18{31(1980).
[40℄ Mithell, J.,AGeometriShortest PathProblem,withAppliationtoComputing aLongestCommon
Subsequenein Run-Length EnodedStrings,TehnialReport, Dept. ofAppliedMathematis,SUNY
StonyBrook,1997.
[41℄ Monge,G.,Deblaiet Remblai, Memoiresde l'AademiedesSienes,Paris(1781).
[42℄ NavarroG.,T.Kida,M.Takeda,A.Shinohara,andS. Arikawa: FasterApproximate StringMathing
OverCompressedText,Pro. DataCompressionConferene(DCC2001), IEEEComputerSoiety,459-
468(2001).
[43℄ Navarro,G.,andM.RaÆnot,AgeneralpratialapproahtopatternmathingoverZiv-Lempelom-
pressed text, Pro. 10th Annual Symposium On Combinatorial Pattern Mathing, LNCS 1645, 14{36
(1999).
[44℄ Navarro, G., and M. RaÆnot.Boyer-Moore stringmathing over Ziv-Lempel ompressed text, Pro.
11thAnnual SymposiumOnCombinatorial PatternMathing, LNCS1848,166{180(2000).
[45℄ Sanko D., and J.B. Kruskal(editors), Time Warps, String Edits, and Maromoleules: the Theory
andPratieof SequeneComparison,Addison-Wesley,Reading,MA,(1983).
[46℄ Shmidt, J.P., All HighestSoring PathsInWeightedGridGraphsandTheirAppliationToFinding
AllApproximate RepeatsInStrings,SIAMJ. Comput, 27(4),972{992(1998).
[47℄ ShabitaY., T.Kida, S. Fukamahi,M.Takeda,A. Shinohara,T. Shinohara,S. Arikawa, Speeding up
patternmathing bytextompression,CIAC2000,LNCS1767,306{315(2000).
[48℄ Smith,T.F.andM.S.Waterman,Identiationofommonmoleularsubsequenes,J.MoleularBiol.,
147,195{197(1981).
[49℄ Szpankowski,W.,andP.Jaquet.AsymptotiBehavioroftheLempel-ZivParsingShemeandDigital
SearhTrees,Theoretial Computer Siene,144,161{197(1995).
[50℄ Welh,T.A.,ATehniqueforHighPerformaneDataCompression,IEEETrans.onComputers,17(6),
8{19(1984).
[51℄ Ziv,J., and A. Lempel, A UniversalAlgorithm for Sequential DataCompression,IEEE Transations
onInformation Theory, IT-23(3),337{343(1977).
[52℄ Ziv, J., and A. Lempel, Compression of individual sequenes via variable rate oding, IEEE Trans.
Inform. Th.,24,530-536(1978).