• Aucun résultat trouvé

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

N/A
N/A
Protected

Academic year: 2022

Partager "A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices"

Copied!
22
0
0

Texte intégral

(1)

HAL Id: hal-00619573

https://hal-upec-upem.archives-ouvertes.fr/hal-00619573

Submitted on 20 Mar 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson

To cite this version:

Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson. A Subquadratic Sequence Alignment

Algorithm for Unrestricted Scoring Matrices. SIAM Journal on Computing, Society for Industrial and

Applied Mathematics, 2003, 32 (6), pp.1654-1673. �10.1137/S0097539702402007�. �hal-00619573�

(2)

Soring Matries

MaximeCrohemore

Institut Gaspard-Monge

UniversityofMarne-la-Vallee

Gad M. Landau y

HaifaUniversity

and

PolytehniUniversity

MihalZiv-Ukelson z

HaifaUniversity

and

IBMT.J.WResearhCenter

Abstrat

The lassialalgorithmfor omputingthesimilarity betweentwosequenes [45, 48℄uses a

dynami programmingmatrix, and omparestwostringsof size nin O(n 2

)time. We address

thehallengeofomputingthesimilarityoftwostringsinsub-quadratitime,formetriswhih

use a soring matrix of unrestrited weights. Our algorithm applies to both loal and global

similarityomputations.

The speed-up isahievedbydividingthe dynamiprogrammingmatrixinto variablesized

bloks, asindued by Lempel-Ziv parsing of both strings, and utilizing the inherent periodi

natureofbothstrings.ThisleadstoanO(n 2

=logn)algorithmforaninputofonstantalphabet

size. Formosttexts, thetimeomplexityis atuallyO(hn 2

=logn)where h1is theentropy

ofthetext.

Wealsopresentanalgorithmforomparingtworun-lengthenodedstringsoflengthmand

n,ompressedintom 0

andn 0

runsrespetively,inO(m 0

n+n 0

m)omplexity. Thisresultextends

toalldistaneorsimilaritysoringshemeswhih useanadditivegap penalty.

Keywords: alignment,dynami programming,textompression,runlength.

1 Introdution

The rapid progress inlarge-sale DNA sequening opens a new level of omputational hallenges

involved instoring,organizingand analyzingthewealth ofbiologialinformation. Oneofthemost

interesting new elds thatthe availabilityof the ompletegenomes has reated isthat of genome

omparison (the genome is all of the DNA sequene passed from one generation to the next).

Comparingompletegenomes an give deepinsightsabouttherelationshipbetweenorganisms,as

well as shedding light on the funtion of spei genes in eah single genome. The hallenge of

Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Desartes, Champs-sur-Marne, 77454 Marne-la-

ValleeCedex2,Frane,http://www-igm.univ-mlv.fr/ma /.

y

Department of ComputerSiene, Haifa University, Haifa 31905, Israel, phone: (972-4) 824-0103, FAX:(972-

4) 824-9331; Department of Computer and Information Siene, Polytehni University, Six MetroTeh Center,

Brooklyn, NY 11201-3840; email: landaupoly.edu; partially supported by NSF grant CCR-0104307, by NATO

Siene Programme grant PST.CLG.977017, by the Israel Siene Foundationgrants 173/98 and 282/01, by the

FIRSTFoundationoftheIsraelAademyofSieneandHumanities,andbyIBMFaultyPartnershipAward.

z

DepartmentofComputerSiene,HaifaUniversity,Haifa31905,Israel;OnEduationLeavefromtheIBMT.J.W.

ResearhCenter;email: mihals.haifa.il;partiallysupportedbybytheIsraelSieneFoundationgrants173/98

and282/01,andbytheFIRSTFoundationoftheIsraelAademyofSieneandHumanities.

(3)

tools.

One of themost ommon problemsin biologial omparative analysis is thatof aligningtwo long

bio-sequenesinorder to measuretheirsimilarity. Thealignmentislassiallybasedon thetrans-

formationof one sequene into the other via operationsof substitutions, insertions, and deletions

(indels). Theirostsare givenbya soringmatrix.

Denition 1 (Guseld [24 ℄) Global Alignment Problem. Givena pairwise soring matrix

Æ over the alphabet , the similarity of two strings A and B is dened as the value maxV of the

alignment of A and B that maximizes the total alignmentvalue.

The sore value maxV isalled theoptimal global alignment valueof A and B.

AdesriptionofamaxV-soringtransformationofAintoB isalledaglobal alignmenttrae.

In many appliations, two strings may not be highly similar in their entirety but may ontain

regions that are highly similar. The task is to nd and extrat a pair of regions, one from eah

of the two given strings, that exhibit high similarity. This is alled the loal alignment or loal

similarityand is denedformally below.

Denition 2 (Guseld [24 ℄) Loal alignment problem. Given two strings A and B, nd

substrings and of A and B, respetively, whose similarity (optimal global alignment value) is

maximum over all pairs of substrings from A and B.

The sore value maxL of the most similar pair of substrings and is alled the optimal

loal alignmentvalue.

The desriptionof amaxL-soring transformationof substring into substring isalled a

loal alignmenttrae.

Both global and loal similarity problems an be solved in O(n 2

) time by dynamiprogramming

[24 ℄,[35 ℄, [48 ℄. Aftertheoptimalsimilaritysoreshavebeenomputed,bothglobalalignment and

loal alignmenttraes an bereportedintime linearwiththeirsize[10,25 , 26 ℄.

1.1 Results

In thispaperdata ompressiontehniquesareemployed tospeedupthe alignmentof two strings.

Theompressionmehanismenablesthealgorithmtoadapttothedataandtoutilizeitsrepetitions.

The periodi nature of the sequene is quantied via its entropy, denoted by the real number h,

0 h 1. Entropy is a measure of how \ompressible" a sequene is (see [7℄,[12℄), and is small

when there is a lot of order (i.e, the sequene is repetitive and therefore more ompressible) and

largewhen there isa lotof disorder(see Setion2.2).

Our resultsinludethefollowingalgorithms.

(4)

We present an O(n 2

=logn) algorithm for omputing the optimal global alignment value of

two stringsover a onstant alphabet(see Setion 3). The algorithm is even fasterwhen the

sequene isompressible. In fat, formosttexts, theomplexityof ouralgorithm isatually

O(hn 2

=logn).

After the optimal sore is omputed,a single alignment trae orresponding to theoptimal

sorean bereoveredintimeomplexitythatislinearwiththesizeofthetrae(seeSetion

4).

For globalalignment over \disrete" soring matries, we explainhowthe spae omplexity

an be redued to O(h 2

n 2

=(logn) 2

), without impairing the O(hn 2

=logn) time omplexity

(see Setion5).

1.1.2 Loal Alignment

We desribe a sub-quadrati, O(hn 2

=logn) algorithm for the omputation of the optimal

loal alignment value oftwo stringsovera onstant alphabet(see Setion6.1).

Given an indexon A wheresubstring endsand an indexon B wheresubstring ends,an

optimal loal alignmenttraean bereportedintime linearwithits size(see Setion6.2).

1.1.3 Comparing Two Run-Length Enoded Strings

We give an algorithm for omparing two run-length enoded strings of length m and n,

ompressed to m 0

and n 0

runs respetively, using any distane or similarity soring sheme

withadditive gaps, inO(m 0

n+n 0

m)omplexity(see Setion7).

The algorithmsdesribedinthispaperare therst to approahfully LZ ompressed (bothsoure

and target stringsare ompressed) stringalignment. Themethodsgiven inthispaperan also be

usedbyappliationswherebothinputstringsarestoredortransmittedintheform ofanLZ78or

LZW ompressed sequene,thusproviding aneÆient solutionto theproblemofhowto ompare

two stringswithouthavingto deompress them rst.

1.2 Previous Results

Theonlypreviouslyknownsub-quadratiglobalalignment stringomparisonalgorithm,byMasek

andPaterson[39 ℄,isbasedontheFourRussiansparadigm. The\FourRussians"algorithmdivides

the dynamiprogramming table into uniform sized (lognby logn) bloks, and uses table lookup

toobtainan O(n 2

=logn)timeomplexitystringomparisonalgorithm,basedontwoassumptions.

Oneisthatthesequeneelementsomefromaonstantalphabet. Theother,whihtheydenotethe

\disreteness" ondition,isthatthe weights(of substitutionsand indels)are allrationalnumbers.

Our algorithms present a new approah and are better than the above algorithm in two aspets.

First,thealgorithmspresentedhere arefasterforompressiblesequenes. Forsuh sequenes,the

omplexityof ouralgorithmsisO(hn 2

=logn),whereh1is theentropy ofthesequene.

(5)

a c t

a a c g a c g a

0

1

1 2 3 4 5 6 7 8

2

3 4

a g a g

8 5 6

7

0 1 2 3 4 5 6 7 8 c

- a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1

g a

a c g

2 1 3

3

2 1

3 3 4

1

3

2

G

Figure 1: The alignment graphfor omparing stringsA =\tagaga"and B =\aagaga". Thesoring

shemematrixÆis shownin thelowerleftornerofthegure. Thehighestsoringglobalalignmentpaths

originateinvertex(0,0),endinvertex(8,8)andhaveatotalweightof3. Thehighestsoringloalalignment

pathhasatotalweightof5andorrespondstothealignmentofsubstringsa=\agaga"andb=\agaga".

A sub-graph Gorrespondingto theblok foromparing substrings a =\ag"and b =\ag"is shown in

thelower-rightornerofthegure. AlsospeiedarethevaluesI fortheentries oftheinputborderforG

(inwhite-shadedretangles),andthevaluesOoftheoutputborderofG(ingrey-shadedretangles),asset

duringaloalalignmentomputation.

Seond, ouralgorithms are general enough to supportsoring shemes with real number weights.

For manysoring shemes, the rationalnumberweights supportedbyMasek and Paterson'salgo-

rithm do not suÆe. For example, the entries of PAM similarity matries, as well as BLOSUM

evolutionary distane matries,are denedto be real numbers,omputedas log-oddsratios - and

therefore ouldbe irrational.

The paperbyMasek and Paterson onludes with thefollowing statement: \The mostimportant

problem remaining is nding a better algorithm for the nite (in our terms onstant) alphabet

ase withoutthedisreteness ondition". Here,more thantwentyyears later, thisimportant open

questionwillnally be answered!

These advantages arebased on the followingfats. First, ouralgorithm doesnotrequire any pre-

omputationoflookup-tables,andthereforeanaordmoreexibleweightvalues. Also,insteadof

dividing the dynamiprogramming matrix into uniform-sized bloks as did Masek and Paterson,

we employa variable-sizedblokpartition,asindued byLempel-Zivfatorization ofbothsoure

(6)

is then re-yled and used for omputing the relevant information for eah blok in time whih

is linear with the length of its sides. In this sense, the approah desribed in this paper an

be viewed as another example of speeding up dynami programming by keeping and omputing

only a relevant subset of important values, asdemonstrated in [16 ℄, [17℄, [33 ℄ and [46 ℄. A similar

unbalanedstrategyhasbeensuessfullyusedforsquare detetioninstrings[11 ℄ to speedupthe

originalalgorithm basedon a divide-and-onquerapproah [36 ℄.

2 Preliminaries

2.1 The Alignment Graph

The dynamiprogramming solutionto thestring omparison omputation problem an be repre-

sented intermsof aweighted alignment graph[24 ℄ (SeeFigure 1).

The weight ofa givenedge an be speieddiretlyon thegrid graph,orasisfrequentlythease

inbiologialappliations,is given byapenaltymatrix, denotedÆ,whihspeiesthesubstitution

ost foreah pairof haratersand thedeletionostforeah harater fromthe alphabet.

The two widely used lasses of soring shemes are distane soring, in whih the objetive is to

minimize the total alignment sore, and similarity soring, in whih the objetive is to maximize

the total alignment sore. Within these lasses, soring shemes are further haraterized by the

treatment of gap osts. A gap is the result of the deletionof one or more onseutive haraters

in one of the sequenes. Additive gap osts assign a onstant weight to eah of the onseutive

haraters. Forother gapfuntionswhihhavebeenfoundusefulforbiologialsequenes, see[24 ℄.

The solutionsinthispaperassumea soring sheme withadditive gap osts.

Global Alignment via Dynami Programming The lassial dynami programming algo-

rithmfortheglobalomparisonoftwostringswillsetthevalueateahvertex(i;j)ofthealignment

graph,row byrowina left to right order,to the sore betweentherst i haraters ofA and the

rst j haratersofB,usingthefollowingreurrene:

V(i;j)=max[V(i;j 1)+Æ(;B

j );

V(i 1;j)+Æ(A

i

;);

V(i 1;j 1)+Æ(A

i

;B

j )℄:

Computingandsettingthevaluesofallvertiesinthealignmentgraph,usingtheabovereurrene,

takes O(n 2

) time and spae. After the values at eah vertex of the alignment graph have been

omputedandset, theoptimalglobalalignmentvaluemaxV isfoundat vertex(n;n)ofthegraph.

Ifeahvertexinthealignmentgraphstorestheoperation(insertion,deletion,substitution)seleted

when its value was set, then a global alignment trae, orresponding to an optimal path in the

alignment graph, an be reovered in time linear with its size, starting from vertex (n;n) whih

ontainsthemaximalsore, and traingtheedges bakup to vertex(0;0) inthegraph.

Loal Alignment via Dynami Programming Smith and Waterman [48 ℄, [24 ℄ showed that

essentially the same O(jAjjBj) dynami programming solution an be used for omputing loal

similarity, providedthat thesore of thealignment of two empty stringsis denedas 0,and only

pairs whose alignment sores are above 0 are of interest. The Smith-Waterman algorithm for

(7)

option,and thusrestritsthesores to non-negative values:

L(i;j)=max[0;L(i;j 1)+Æ(;B

j );

L(i 1;j)+Æ(A

i

;);

L(i 1;j 1)+Æ(A

i

;B

j )℄:

ThemethodtoomputetheoptimalloalalignmentvaluemaxListoomputeallalignmentgraph

vertexvaluesL(i;j) inO(n 2

)timeand spae,and thenndthelargest valueat any vertex onthe

table, sayat vertex(i

end

;j

end ).

Giventhevertex(i

end

;j

end

)whihsore ismaxL,theorrespondingsubstringsand givingthe

optimalloalalignment ofAand B areobtainedintimelinearwiththeirsize,byusingthestored

operations(insertion, deletion, substitution) to trae bak the edges from vertex (i

end

;j

end ) until

a vertex (i

start

;j

start

) isreahedthat hasvaluezero. Then theoptimalloal alignment substrings

forvertex (i

end

;j

end

) are =A[i

start :::i

end

℄and =B[j

start :::j

end

℄[24 ℄.

2.2 A Blok Partitionof the Alignment Graph based on LZ78 Fatorization

The traditionalaim of text ompressionis theeÆient useof resouressuh asstorage and band-

width. Here,wewillompressthesequenesinorderto speedupthealignmentproess. Notethat

thisapproah, denoted \aelerationbytext-ompression", hasbeenreentlyapplied to a related

problem- thatof exat string mathing[29 ℄,[38 ℄, [47 ℄.

It shouldalso be mentioned thatanother related problem- thatof exat string mathing inom-

pressed text without deoding it, whih is often referred to as \ompressed pattern mathing",

hasbeenstudied extensively[4 ℄,[18 ℄ [43 ℄. Along these lines,string searhin ompressed text was

developed for the ompression paradigm of LZ78 [52 ℄, and its subsequent variant LZW [50 ℄, as

desribedin[30 ℄,[44 ℄. A morehallengingproblemisthatof \fullyompressed"patternmathing

when boththe patternand text stringsareompressed[21 ℄, [22 ℄.

For the LZ78-LZW paradigm, ompressed mathing has been extended and generalized to that

of approximate pattern mathing (nding all ourrenes of a short sequene within a long one

allowingup to k hanges)in[28 ℄, [42 ℄.

TheLZ ompressionmethodsarebasedon theideaofself referene: whilethetext leissanned,

substringsorphrases areidentiedand storedina ditionary,and whenever, laterin theproess,

a phrase or onatenation of phrases is enountered again, this is ompatly enoded bysuitable

pointers [34 ℄,[51 ℄, [52 ℄.

Oftheseveralexistingversionsofthemethod,wewillusetheoneswhiharedenotedLZ78 family

[50 ℄,[52 ℄. The mainfeaturewhihdistinguishesLZ78 fatorization frompreviousLZ ompression

algorithmsis inthe hoie of odewords. Instead of allowing pointers to referene anystring that

has appeared previously, the text seen so far is parsed into phrases, where eah phrase is the

longestmathingphraseseenpreviouslyplusoneharater. Forexample,thestring\S=aagag"

is dividedinto fours phrases: a, a, g, ag. Eah phraseis enoded asan indexto its prex, plus

theextra harater. The new phraseis thenadded to thelistof phrasesthatmaybereferened.

Sineeahphraseisdistintfromothers,thefollowingupperboundappliesto thepossiblenumber

of phrasesobtainedbyLZ78fatorization.

Theorem 2.2.1 (Lempel and Ziv 1976 [34℄) Givena sequeneS of size nover a onstant al-

(8)

a a c g a c g c

t a c

3/4

a c g a c

g

g Trie for A Trie for B

0

3 1

5

0 3 2

g a g a

a

5/4

2

t

4

g

5/2 3/2

left prefix (5/2)

diagonal prefix (3,2)

top prefix (3,4)

Graph G for Block (5,4) LZ78-Partitioned

Alignment Graph

g a

a c g a

g a c

g a

a c a

a c

1 2 3 4

0

1 2 3 4

5

1

4

Figure2: Theblokpartitionofthealignmentgraph,andthetriesorrespondingtoLZ78parsingofstrings

A =\tagaga" and B =\aagaga". Note that forthe blokG in this example, = \ag", =\ag",

`

r

=2,`

=3,i=5andj =4. (ThenewellofG,whihdoesnotappearinanyoftheprexbloks,isthe

rightmostellat thebottom rowofG, andanbedistinguishedby itswhiteolor.) Thisgureontinues

Figure1.

phabet. The maximal number of distintphrases in S isO(

n

logn ).

Even though theupperboundabove appliesto anypossiblesequene overa onstant alphabet,it

hasbeenshown thatinmany ases we an dobetter thanthat.

Intuitively,theLZ78algorithmompressesthesequenebeauseitisabletodisoversomerepeated

patterns. Therefore,inorder toomputeatighterupperboundonthenumberofphrasesobtained

byLZ78fatorization for\ompressible"sequenes,therepetitivenatureof thesequeneneedsto

bequantied. One of thefundamentalideas ininformationtheory isthat of entropy, denoted by

thereal numberh, 0h 1,whih measurestheamount ofdisorderorrandomness,orinversely,

the amount of order or redundany in a sequene. Entropy is small when there is a lot of order

(i.e,thesequeneisrepetitive)andlargewhenthereisalotofdisorder. The entropyof asequene

shouldideally reetthe ratiobetweenthe sizeof thesequene after it hasbeenompressed, and

thelengthof theunompressedsequene.

ThenumberofdistintphrasesobtainedbyLZ78 fatorizationhasbeenshownto be O(hn=logn)

formosttexts[7 ℄,[12℄,[34 ℄,[52 ℄. Notethatforanytextoveraonstant alphabet,theupperbound

above stillappliesby settingh to 1.

3 Computing the Optimal Global Similarity Value

3.1 Denitions and Basi Observations

The alignment graph willbe partitioned as follows. Strings A and B will be parsedusing LZ78

fatorization. ThisinduesapartitionofthealignmentgraphforomparingAwithB intovariable-

(9)

an LZ phraseofB.

Letxadenotea phraseinA obtainedbyextendinga previousphrasex ofAwithharatera, and

ybdenote aphrasein B,obtainedbyextendingaprevious phraseofB withharater b.

Fromnowonwewillfousontheomputationsneessaryforasingleblokofthealignmentgraph.

ConsidertheblokGwhihorrespondstotheomparisonofxaandyb. We deneinput border I

astheleft and topbordersof G, and output border O asthebottom and rightbordersof G. (The

node entries on the input border are numbered in a lokwise diretion,and the node entries on

theoutput borderarenumberedina ounter-lokwisediretion.)

Rather than llingin the valuesof eah vertex inG, as doesthe lassial dynamiprogramming

algorithm,theonlyvaluesomputedforeahblokwillbe thoseon itsI=Oborders(seeFigures1

and 5A). Intuitively,thisis thereasonbehindtheeÆieny gain.

Let `

r

denote the number of rows in G, `

r

= jxaj. Let `

denote the number of olumns in G,

`

=jybj. Let t=`

r +`

. Clearly,jIj=jOj=t.

We denethe followingthree prefix bloksof G.

1. The left prefixof Gdenotesthe blokomparingphrasexaof A andphrase y ofB.

2. The diagonal prefix ofG denotestheblokomparingphrasex of Aand phrasey of B.

3. The topprefixof Gdenotestheblokomparing phrasex of A and phraseybof B.

Observation 1 When traversing the bloks of an LZ78 parsed alignment graph in a left-to-

right, top-to-bottom order, the bloks for the left prex, diagonal prexand top prex of G are

enountered prior to blokG.

NotethatthegraphfortheleftprexofGisidentialtothesubgraphofGontainingallolumns

butthelastone. Morespeially,boththestrutureandtheweightsofedges ofthesetwo graphs

areidential,buttheweightstobeassignedtovertiesduringthesimilarityomputationmayvary

aordingto theinputborder values. Similarly,forthetopprexand diagonalprexgraphs. The

onlynewellinG,whihdoesnotappearinanyofitsprexblokgraphs,istheellforomparing

aand b.

3.2 I=O Propagation Aross G.

The work foreah blokonsistsof two stages(a similarapproahis shownin[8 , 27 ,32 , 33 ℄).

1. enoding: Study thestrutureof Gand represent itinan eÆientway.

2. propagation: Given I and the enoding of G, onstruted in the previousstage, ompute O

forG.

The strutureofG isenodedbyomputingweights ofoptimalpaths onneting eahentry of its

inputborder with eahentry of its output border. Thefollowing DISTmatrixis used(see Figure

3).

Denition 3 DIST[i;j℄ stores the weightof the optimal path from entry i of the input border of

G toentry j of its output border.

(10)

I0=1 0 1 2 3 4 4

I

1

=2 1 1 2 1 3 4

I2=3 2 0 0 1 1 3

I3=2 4 2 2 0 2 2

I

4

=1 4 4 2 0 1 1

I

5

=3 4 4 4 2 1 0

OUT matrix

1 0 1 2 1 1

1 1 0 1 1 1

1 3 3 4 2 0

12 0 0 2 0 0

13 13 1 1 0 0

14 14 14 1 2 3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

olumn numbers

0 1 2 3 4 5

Figure3: TheDISTmatrixwhihorrespondstothesubsequenes\ag",\ag",theOUTmatrixobtained

byaddingthevaluesofI totherowsofDIST, andtheO ontainingtherowmaximaof OUT. Thisgure

ontinuesFigures1and2.

DISTmatries havealso beenusedin[5℄,[8 ℄, [27 ℄,[33 ℄ and [46 ℄.

Given inputrow I and the DIST for G, the weight of output row vertex O

j

an be omputed as

the maximum among the sums I

r

+DIST[r;j℄, if there is indeed a path onneting inputborder

entry r withoutput borderentry j.

VertexO

j

isthemaximumofolumnjofthefollowingOUTmatrix,whihmergestheinformation

from inputrow I andDIST. (SeeFigure 3).

Denition 4 OUT[i;j℄=I

i

+DIST[i;j℄.

Aggarwal and Park [3℄ and Shmidt [46 ℄ observed thatDISTmatries areMonge arrays [41 ℄.

Denition 5 A matrix M[0:::m;0:::n℄ is Mongeif either ondition 1 or 2 below holdsfor all

a;b=0:::m;;d =0:::n:

1. onvex ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄ for all a<band <d.

2. onave ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄ for all a<b and<d.

(11)

An importantpropertyof Mongearrays is thatof being totallymonotone.

Denition 6 A matrix M[0:::m;0:::n℄ is totally monotone if either ondition 1 or 2 below

holds for all a;b=0:::m; ;d=0:::n:

1. onvex ondition: M[a;℄M[b;℄=)M[a;d℄M[b;d℄ for all a<b and <d.

2. onave ondition: M[a;℄M[b;℄=)M[a;d℄M[b;d℄ for all a<b and <d.

Notethat theMongepropertyimpliestotal monotoniity,buttheonverse isnottrue. Therefore,

bothDISTand OUTare totallymonotonebytheonave ondition.

Aggarwal et al [2 ℄ gave a reursive algorithm, niknamed SMAWK in the literature, whih an

ompute in O(n) time all row and olumn maxima of an n n totally monotone matrix, by

queryingonlyO(n) elementsofthearray. Hene,onean useSMAWKtoomputetheoutputrow

O by queryingonlyO(n) elements of OUT. Clearly,if boththe fullDISTand all entries of I are

available, thenomputingan element of OUTisO(1) work.

For various solutions to related problems, whih also utilizeMonge and Total Monotoniity prop-

erties, we refer the interested reader to [14 ℄, [15 ℄, [19 ℄, [20℄, [31 ℄ and [33 ℄. In order to eÆiently

utilizethese propertieshere,we needto address thefollowingtwo problems.

1. How to eÆiently ompute DISTand represent it in a format whih allows diret aess to

its entries. ThiswillbedoneinSetion3.4.

2. SMAWKisintendedforafull,retangularmatrix. However,bothDISTanditsorresponding

OUTarenotretangular. Sinepathsinan alignmentgraphan onlyassumea left-to-right,

top-to-bottom diretion, onnetions between some inputborder verties and some output

bordervertiesare impossible. Therefore, thematriesaremissing botha lowerlefttriangle

and upperright triangle(see Figure 3). The questionisaddressed inSetion3.3.

3.3 Addressing the Retangle Problem

The undenedentries ofOUTan beomplementedinonstant timeeah, asfollows.

1 The missing upper right triangle entries an be ompleted by setting the value of any entry

OUT[i;j℄in thistriangle to 1.

2 Letk denote themaximalabsolute valueof a sore inÆ. The missinglowerleft triangleentries

an beompletedbysetting thevalueof anyOUT[i;j℄inthistriangleto (n+i+1)k.

Lemma 3.3.1 Complementingthe undenedentries asdesribed abovepreservesthe onavetotal

monotoniity ondition of OUT, and does not introdue new row-maxima.

Proof:

1 Upper Right Triangle: Allsimilaritysoresin the alignment graphare nite. Therefore, no

new olumn maxima are introdued. Suppose OUT[a;℄ OUT[b;℄, a< b, and OUT[a;℄

hasbeensetto 1. Duetotheshapeoftheredenedupper-righttriangle,onea 1value

inrowa is enountered, all future valuesin rowa are also 1. The future values of row b

ould eitherbeniteor 1. Therefore, OUT[a;d℄OUT[b;d℄forall d>.

(12)

nk. Sine i is always greater thanor equal to zero, the omplemented valuesin the lower

lefttriangle areupper-boundedby (n+1)k and no newolumn maximaareintrodued.

Also,foranyomplementedentry OUT[b;℄inthe lowerleft triangle, OUT[b;℄<OUT[a;℄

forall a<b,and therefore theonave total monotoniityonditionholds.

3.4 Inremental Update of the new DIST Information for G

InthissetionweshowhowtoeÆientlyomputethenewDISTinformationforG,usingtheDIST

representationspreviously omputedforits prexbloks,plustheinformationofits new ell.

When proessing a new blok G, we ompute thesores of t new optimalpaths, leadingfrom the

inputbordertothenewvertex(`

r

;`

)inthelowest, rightmostornerofG. Thesevaluesorrespond

to olumn`

ofthe DISTmatrixforG, and an be omputedasfollows.

Entry[i℄inolumn`

oftheDISTforGontainstheweightoftheoptimalpathfromentryiinthe

inputborderofGtovertex(`

r

;`

). Thispathmustgothroughoneofthethreeverties(`

r 1;`

),

(`

r 1;`

1) or (`

r

;`

1). Therefore, the weight of theoptimalpath from entry iinthe input

borderof Gto (`

r

;`

)is equalto themaximumamongthe followingthreevalues:

1 Entry [i℄of olumn `

1of theDISTforthe leftprexof G, plustheweight ofthe horizontal

edgeleadinginto (`

r

;`

).

2 Entry[i℄ofolumn`

1oftheDISTforthediagonalprexofG,plustheweightofthediagonal

edgeleadinginto (`

r

;`

).

3 Entry [i℄of olumn`

of theDIST forthe topprex of G, plus theweight of thevertialedge

leadinginto (`

r

;`

).

3.4.1 Maintaining Diret Aess to DIST Columns

In order to ompute an entry of OUTin onstant time during the exeution of SMAWK, diret

aess to DIST entries is neessary. This is not straightforward, sine as shown in the previous

setion,foreahblokonlyonenewDISTolumnhasbeenomputedandstored. Allotherolumns

besidesolumn`

ofthe DISTforGneedto be obtainedfromG's prexanestorbloks.

Therefore,beforetheexeutionofSMAWKbegins,avetorwithpointerstoallt+1olumnsofthe

DISTfor Gis onstruted (see Figure 4). This vetor is no longerneeded after the omputations

forGhave beenompleted, andits spae an befreed.

The pointers to all olumnsof the DISTfor G are assembled asfollows. Column`

is set to the

newlyonstrutedvetorforG. Allolumnsofindiessmallerthan`

areobtainedvia`

reursive

allsto left prexbloks ofG. Allolumnsof indiesgreaterthan`

areobtainedvia `

r

reursive

alls to topprexbloksof G.

3.4.2 Querying a Prex Blok and Obtaining its DIST Column in Constant time

The LZ78 phrases form a trie (see Figure 2), and the string to be ompressed is enoded as a

sequene of names of prexes of thetrie. Eah node in the trieontains the serialnumberof the

phrase it represents. Sine eah blok orresponds to a omparison of a phrase from A with a

(13)

0 1 2 3 4

DIST(5,4) 0

1 2 3 4 5

-3 -1 1 0 0 -2

-3 -1 -2 -1 -1

-3 -2 -1 0 -2

-2 0 -2 -2 -1 -1 0 -2 0 -1 -2

a

g a c

g

g Trie for A Trie for B

0

3 1

0

1 3

2

4 2

t

4

g Block Table

5

c

Figure 4: A table ontaining an entry for eah blok of the alignment graph. Entry(i;j) of the table

representstheblokwhihorrespondstonodeiinthetrieforAandnodejinthetrieforB. Theentryfor

eahblokinthetablepointstothestartofitsnewDISTolumn. Alsoshownisthevetorwhihontains

pointersto allolumns oftheDIST forblok(5;4),asobtainedfromitsanestorprexbloks. Thisgure

ontinuesFigures1,2and3.

phrasefrom B,eahblokwillbe identiedbyapair ofnumbers,omposedofthe serialnumbers

forits orrespondingphrasesin thetriesforA and B.

Anotherdata struture to beonstruted is a Blok Table (see Figure 4), ontaining an entry for

eah partitioned blokofthe alignmentgraph. The entry foreah blokinthetable pointsto the

start of its new DISTolumn, and an be diretly aessed via the blok's phrase numberindex

pair.

TheleftprexofGanbeidentiedinonstanttimeasapairofphrasenumbers,therstidential

to the serialnumberof xa, and the seond orresponding to the serial number of y, whih is the

diret anestor of ybin thetrie forB. Similarly,the topprex of Gan be identiedin onstant

time. Given the pair of identiation numbers for a blok, a pointer to the orresponding DIST

olumnan then be diretlyobtained fromthe BlokTable.

Time and Spae Analysis Assumingsequene size nand sequene entropyh 1. The LZ78

fatorization algorithm parsesthe stringsandonstruts thetries forA and B inO(n)time. The

resulting numberof phrases inboth A and B is O(hn=logn). The number of resulting bloks in

thealignment graphis equalto thenumberof phrasesin A times number ofphrases inB,and is

thereforeO(h 2

n 2

=(logn) 2

). ForeahblokG,thefollowinginformation(1{3)isomputed,intime

and spae omplexitylinearwith thesizeofits I=Oborders:

1. UpdatingtheEnodingStrutureforG. TheprexbloksofGanbeaessedinonstant

time. ThevetorsofDISTolumnpointersfortheprexblokshavealreadybeenfreed. However,

sine eah prex blok diretly points to its newlyomputed DIST olumn, all valuesneeded for

theomputations arestillavailable. Sine eah entry of the new DISTolumnfor Gis set to the

(14)

O(t)timeand spae.

2. MaintainingDiretAess toDISTolumns. Sineprex bloksandtheirDISTolumns

an beaessedinonstanttime, thevetor withpointerstoolumnsoftheDISTforGanbeset

inO(t)time.

3. Propagating I=O values aross the blok. Using the information omputed for G, and

given theI forG obtainedfrom theO vetorsfor theblokabove Gand the blokto its left,the

valuesofO forGareomputed via SMAWKMatrix SearhinginO(t)time.

Total Complexity Sine the work and spae for eah blok is linear with the size of its I=O

borders, the total time and spae omplexity is linear with the total size of the borders of the

bloks. The blok borders form O(hn=logn) rows of size jBj eah, and O(hn=logn) olumns of

sizejAjeah,inthealignmentgraph(seeFigure2). Therefore,thetotaltimeandspaeomplexity

isO(hn 2

=logn).

4 Global Similarity Optimal Alignment Trae Reovery

The reovery of an optimal globalalignment trae betweenA and B starts at vertex (n;n). The

series of blok rossing paths is then traed bak until vertex (0;0) is reahed. For eah blok

rossed, the internal alignment trae is reported, starting from the output border sink, and bak

to the optimal origin soure vertex in the orresponding input border. In order to support the

reovery of blok-rossingpaths intimelinear withtheirsize, theomputation and storage ofthe

followingadditionalinformationfora givenblokGisrequired.

1. During the Propagation stage, for eah entry j in the output border of G, the index of the

inputborder entry i, whih isthe soureof the highest soringpath to output border entry

j, issaved.

2. During Enoding, an additional O(t) sized vetor of pointers, the anestors vetor, is om-

puted for G. For anyoutput border entry O[j =0:::t℄, anestors[j℄ pointsto the anestor

blokof G forwhih thisentry isits new vertex. (Thevalue of anestors[`

℄is setto G. All

olumnsofindiessmallerthan`

areobtainedvia`

reursiveallstoleftprexbloksofG.

Allolumnsofindiesgreater than`

areobtainedvia `

r

reursive alls to topprexbloks

of G.)

3. During Enoding, G's new vertex (`

r

;`

) is annotated with an additional O(t) sized vetor

of pointers, denoted diretion. These pointers are setduring theDISTolumn omputation

desribedinSetion3.4, asfollows. The value ofdiretion[i℄isset aordingto thediretion

of thelastedgeinthe optimalpathoriginatingat entry i ofG's inputborder and endingat

vertex (`

r

;`

).

GiventhattheoptimalpathentersthroughentryjoftheoutputborderofG,thetrae-bakofthe

partofthepathgoingthroughGproeedsintwostages. Therststageisadestinationandorigin

initialization stage. This stage inludes the fething of the input row soure entry i, whih was

storedastheoriginforthehighestsoring pathtoG's outputborderentry j (see1 above). Entry

(15)

of G, pointedto byanestors[j℄is fethed(see 2 above). Theedge reovery beginsinblokP.

Duringtheseond stage,theoriginanddestinationinformationomputedintherst stageisused

to trae bak the part of the path ontained in P, from entry j on P's output border (the new

vertex of P), to entry i on its input border. This is done by baktraking through a dynasty of

prexanestor bloks internalto P,usingthe diretion vetor omputed foreah of thetraversed

bloks (see 3 above). If diretion [i℄ of the traversed blok speies a horizontal edge, then the

trae-bakretreatstotheleftprexofP,andan\insertion"operationisreportedinthealignment

trae. Correspondingly,\substitution" and\deletion"arereportedwhen baktrakingto diagonal

and top prexbloks. The reovery ontinues througha series of prexbloks of P untilthe full

optimalalignment trae isreovered.

Time and Spae Analysis The two additionalvetors forG, diretion and anestors , and the

inputborder soure entry i,an be omputed and storedduringenoding and propagation stages

inO(t)timeand spae.

The work forthe rst stage inthe trae-bak an be done inonstant time. In the seond stage,

eahedgeinthereoveredalignmentpathresultsinatraversaltoasingleprexblok. Sineprex

bloksandtheirorrespondingdiretionvetorsanbeaessedinonstanttime, ahighestsoring

globalalignmentbetweenstringsA and B anbe reovered intimelinear inits size.

5 Reduing the Spae Complexity

When omputingthe optimalglobal alignment value withsoring matries whih follow the\dis-

reteness" ondition (see Setion 1), the eÆient alignment stage algorithm desribed in [33 ℄ an

be extended to support full propagation from the leftmost and upper boundaries to the bottom

and rightmostboundariesofG.

Thisextendedpropagationalgorithmanthenbeusedtoomputethevaluesoftheglobalalignment

OforG,giventheIforGandaminimalenodingoftheDISTforG. Theadvantageofthisminimal

enodingofDISTisthatratherthansavinganO(t)sizedDISTolumnperblok,we onlyneedto

save aonstant numberofvaluesperblok. Theenoding forthenewDISTolumnof eahblok

an be omputed and stored inonstant time and spae from the informationstored forthe left,

diagonal andtopprexbloks of G, usingthetehniquedesribedin Setion6of [46 ℄.

This redues the spae omplexity to O(h 2

n 2

=(logn) 2

), while preserving the O(hn 2

=logn) time

omplexity.

6 The Loal Alignment Algorithm

6.1 Computing the Optimal Loal Similarity Value

When omputing the optimal loal similarity value, an optimal path ould either be ontained

entirely in one blok(type C), or ould be a blok-rossing path (see gure 5). A blokrossing

pathonsistsofa(possiblyempty)S-path,followedbyanynumberofpathsleadingfromtheinput

borderofa blokto itsoutputborder,and endinginanE-path withahighestsoring lastvertex.

Sine an optimalpath ouldbegininside anyblok, vetor O needs to beupdatedto onsiderthe

(16)

I

O I

S

C E I

O

i

DIST[i,j ] j

A

B C

Figure 5: A. The I=O path weight vetors omputed for eah blok in the global alignment solution.

DIST[i;j℄willbesettothehighestsoringpathonnetingvertexiin theinputborderwithvertexjinthe

outputborder. B,C.Thevetorsofoptimalpathweightsonsideredfortheloalalignmentomputation.

additionalpathsoriginatinginsideG. Also,sineanoptimalpathouldendinsideanyblok,extra

bookkeepingis neededinorder to keeptrakof thehighestsoring paths endingineah blok.

Therefore,inadditiontotheDISTdesribedinSetion3,weomputeforeahblokGthefollowing

datastrutures (see Figures5Band 5C).

1. E is a vetor of size t. E[i℄ ontains the value of the highest soring path whih starts at

vertex i of the input border of G and ends inside G. E[i℄ is omputed as the maximum

betweenE[i℄fortheleft prexof G, E[i℄ forthetopprexof G, andDIST[i;`

℄.

2. S is a vetor of sizet. S[i℄ontainsthevalue of thehighestsoring pathwhihstarts inside

Gand endsat vertexiof theoutput border of G.

The only new value omputed for S is the loal alignment sores for the new vertex of G,

S[`

℄. GiventhesoresS[`

1℄obtainedfromthediagonal prex,S[`

1℄obtainedfromthe

left prex and S[`

℄obtainedfromthetop prex of G,and theweightsof the3edgesleading

into vertex(`

r

;`

),S[`

℄anbeomputedinO(1)timeomplexity,usingthereursiongiven

inSetion2.1.

Thevaluesofallother entriesofS arethen setasfollows. Therst`

valuesofS areopied

from the rst `

values of the S omputed for the left prex of G. The last `

r

values are

opiedfromthe last`

r

valuesoftheS vetor forthetopprexof G.

3. C is the value of thehighest soring path ontained in G, that is, the highest soring path

whihoriginates insideGand endsinside G. C isomputed asthemaximumbetweentheC

value for the left prex of G, the C value forthe top prex of G, and the newly omputed

S[`

℄ asdesribed above.

(17)

willbeused to omputetheweight of thehighestsoring pathendinginG.

Vetor O is rst omputed from theI and DIST forG as desribed inSetion 3.2. At this point

entryO[i℄reetstheweightoftheoptimalpathstartinganywhereoutsideGandendingatentryi

oftheoutputborder. It needstobeupdatedwiththeweightsof thehighestsoringpathsstarting

insideG. This isahieved by resettingO[i℄to themaximumbetweenO[i℄and S[i℄.

Theweight of thehighestsoring pathendinginG isomputedasmax(Max t

i=0

fI[i℄+E[i℄g; C).

After the omputations for eah blok have been ompleted, the overall highest loal alignment

sore foromparing A and B an be omputed as themaximumamong the valuesof the highest

soring pathendingineahblok.

Time and Spae Analysis Sine, as shown in Setion 3.4.1, eah prex blok of G an be

aessed inonstant time,thevaluesof theS and E vetors forGan beomputedand storedin

O(t)timeand spae,and theC value forGan be omputedinonstant timeand spae.

Given the S, E and C vetors for G, the values of O and the weight of the highest soring path

endinginGan beomputed inO(t)timeeah asdesribed above.

Theweightofthehighestsoringpathinthealignmentgraphanthenbeomputedinanadditional

O(h 2

n 2

=(logn) 2

) time asthemaximumvalueamong thebestvaluesomputed foreah blok.

Sine the work and spae for eah blok is linear with the size of its I=O borders, thetotal time

and spae omplexityofomputingtheoptimalloalalignment valueis O(hn 2

=logn).

6.2 Optimal Alignment Trae Reovery for the Loal Alignment Solution

Similarlyto the alignment trae dened inSetion 4, given a maxLvertex (i

end

;j

end

) whih was

obtainedintheprevioussetion,weshowhowtoreovertheoptimalpathendinginthisvertex. by

reporting a trae-bak of theedges from vertex (i

end

;j

end

) untila start-point vertex (i

start

;j

start )

isreahedthathas value zero.

A blok rossing optimal path onsists of a (possibly empty) S-path, followed by any number of

pathsleadingfromtheinputborderofabloktoitsoutputborder,andendinginanE-pathwhose

lastvertex is(i

end

;j

end ).

The reovery starts at vertex (i

end

;j

end

) and ontinues bak to the optimal path origin in three

stages.

1. ReoveringtheE-path part.

Duringenoding,whenever theE[i℄ value of a blokis updatedbyits new vertex, a pointer

to theupdatingblokissaved together withthe newE[i℄value.

During alignment reovery, given that vertex (i

end

;j

end

) ends an E[i℄ path in G, the orre-

spondingblokanbefethed,andthepathfromitsnewvertextoentryionitsinputborder

reovered, asdesribedinSetion 4.

2. Reoveringall paths leadingfrom theinputborder of ablokto its output border.

The part of thepath ontainedin eah one of these bloksan bereovered as desribedin

Setion4.

(18)

Duringenoding,when omputingtheS-soreof thenew vertexof eahblok,thediretion

of the edge optimizing the sore S[`

℄ of the new vertex of G, denoted s

diretion

, is saved

withthesore.

Duringtheterminationofthepropagationstage,whensettingthesorevaluesforeahentry

inO,aeldisset,indiatingwhetherthenewlysetsorevalueforthisentryorrespondstoa

pathoriginatinginsideG(anS-path),orapathrossingG. Insuhaase,thereoveryofthe

S-pathpart utilizesthetehniquedesribedinSetion4,with aslightmodiation. Instead

ofthediretion vetor,thes

diretion

eldisusedfortheedgetrae-bak. Thereoveryhalts

when ananestor blokis reahedwhose S[`

℄valueis zero.

Aspeialaseourswhenvertex(i

end

;j

end

)istheendpointofaC-path. AC-pathis,inessene,

a haltedS-path. Duringenoding, whenever theC valueof a blokis updatedbyits new vertex,

a pointer to the updating blok is saved together with the new C value. The reovery of the C

pathinGstartsat thenewvertexofitsorrespondingblokandontinuessimilarlyto theS path

reovery, asdesribedin3 above.

Time and Spae Analysis In addition to the values desribed in Setion 4, an additional

O(t) information(pointers to the E[i℄ updatingbloks) is omputed and stored forE-paths, and

an additional O(1) information per blok is omputed and stored for C and S paths. During

propagationtermination, anaddition O(t)informationisstoredwith theO vetor.

Duringreovery,eah edgeinthereovered alignment pathresultsinatraversalto a single prex

blok, foreah one of the three path parts. Both prex bloks and theirorresponding diretion

vetorsanbeaessedinonstanttime. Therefore,inadditiontothebasiO(hn 2

=logn)timeand

spaeneededforomputingtheoptimalloalalignmentsore maxL,analignmenttraeendingat

a givenmaxL-soring vertexan be reported intimelinear withthesizeof thetrae.

7 Appliations to the Problem of Comparing Two Run Length

Enoded strings

AstringS isrun-lengthenodedifitisdesribedasanorderedsequeneofpairs(;i),oftendenoted

\ i

,"eahonsistingofanalphabetsymbol,,andaninteger,i. Eahpairorrespondstoarunin

S,onsistingof ionseutiveourrenesof . Forexample,the stringaabbbbb an beenoded

asa 2

b 5

3

. Suh a run-lengthenodedstring an be signiantlyshorter thanthe expandedstring

representationafter eÆientlyenoding theintegers (see [13 ℄forexample).

Run-lengthenodingservesasapopularimageompressiontehnique,sinemanylassesofimages

(e.g., binary images in fasimiletransmissionor foruse inoptial harater reognition) typially

ontain largepathesof identially-valued pixels.

Let m and n be the lengths of two run-length enoded strings X and Y, of enoded lengths

m 0

and n 0

, respetively. Previous algorithms for the problem ompared two run-length enoded

strings using the Levenshtein Edit Distane [35 ℄ and the LCS similarity measure [25 ℄. For the

LCS metri, Bunke and Csirik [9℄ presented an O(mn 0

+nm 0

) time algorithm, while Apostolio,

Landau,and Skiena[6 ℄desribed an O(m 0

n 0

log (m 0

n 0

))time algorithm. Mithell[40 ℄ hasobtained

an O((d+m 0

+n 0

)log(d+m 0

+n 0

)) time algorithm for a more general string mathing problem

(19)

Arbelletal[1 ℄andMakinenetal[37 ℄independentlyobtainedan O(m 0

n+n 0

m)timealgorithmfor

omputingthe edit distane between two run-length enodedstrings forthe Levenshtein distane

metri.

Makinen et al. [37 ℄ posed as an open problem the hallenge of extending these results to more

general soring shemes, sine in those appliations whih are related to image ompression, the

hangefrom a pixel value to the next is smooth. Here, we willshowhow to extend theresultsto

applythem to any distaneor similaritysoring sheme withadditive gap sores.

In this solution, the alignment graph is also partitioned into bloks. But rather than using the

LZ78partition desribedinSetion 2,eah blokhere onsistsof two runs {one ofX and one of

Y. Thisresultsinthepartition ofthealignmentgraph into m 0

n 0

bloks. Thealgorithm suggested

alsopropagatesaumulatedsoresfromtheleftandupperboundariesofeahblok,toitsbottom

and rightboundaries.

Considertheblok Rforomparingtherun i

of X withtherun j

of Y. An edgeinR ouldbe

assigned one ofthree possibleweight values: D(diagonal),H(horizontal) and V(vertial).

Let

h

and

w

denote the dierene in row index values and olumn index values respetively,

betweenentry ion theinputborder of R ,and entry j on theoutputborder ofR .

We showhowto omputeDIST[i;j℄(whihistheost ofthebestsoringpathfrom entryiinthe

inputborderof theblok,to entryj intheoutputborderof theblok)inonstanttime, given

h

and

w

fortheinputand outputentries,and thevaluesD,H and V.

H+V D. Clearly,anoptimalpathfromitoj anuseall possiblediagonaledgesand only

thentheminimalnumberofremaining H and V edges neessary to reah j.

Therefore, DIST[i;j℄obtains one ofthree values:

1. If

w

=

h

,thenDIST[i;j℄=D

h .

2. If

w

>

h

,thenDIST[i;j℄=D

h

+H(

w

h ).

3. If

w

<

h

,thenDIST[i;j℄=D

w

+V (

h

w ).

H+V <D. In thisase, an optimalpathneveruses anydiagonal edge. The path inludes

onlytheminimalnumberofH edges,andtheminimalnumberofV edgesneessarytoreah

j from i. inthisase, DIST[i;j℄=H

w

+V

h .

Therefore,DIST[i;j℄anbeeasilyomputedinonstanttimewhenusingthegeneralsoringsheme

desribedinSetion 2.1.

Time and Spae Analysis The O vetor foreah blokis omputedusingSMAWK. Vetor I

for blokR an be easilyobtained from the O vetors forthe blokabove R and the blokto its

left,intimelinearwiththesidesofR . The \retangle"probleman besolved similarlytoSetion

3.2. Therefore,anyvalue OUT[i;j℄=I[i℄+DIST[i;j℄an be omputedinonstant time.

SinetheworkandspaeforeahblokislinearwiththesizeofitsI=Oborders,thetotaltimeand

spae omplexityislinear withthetotal sizeof thebordersof thebloks,whih isO(m 0

n+n 0

m).

(20)

The algorithmspresentedinthispaperareperhapslose to optimalin timeomplexity. However,

an important onernis thespae omplexity of thealgorithms. If onlythe similaritysore value

is required,thelassial, quadrati timesequene alignment algorithman easilybe implemented

to runin linear spae,by keepingonly two rows of the dynami programmingtable alive at eah

step. If thereovery of eitherglobal orloaloptimal alignment traes is required,quadrati-time

and linear-spae algorithms an be obtained by applying Hirshberg's renement to the lassial

sequene alignment algorithms [10 , 25 , 26 ℄. We post as an open problemthe hallenge of further

reduing thespae requirement of thealgorithms desribed in thispaper,withoutimpairingtheir

sub-quadratitime omplexity.

Aknowledgement

We aregrateful to DanGuseldfora helpfuldisussion.

Referenes

[1℄ O. Arbell, G. M. Landau, and J. Mithell, Edit distane of run-length enoded strings, aepted for

publiation inInformationProessingLetters.

[2℄ Aggarwal,A.,M.Klawe,S.Moran,P.Shor,andR.Wilber,GeometriAppliationsofaMatrix-Searhing

Algorithm,Algorithmia,2,195-208(1987).

[3℄ Aggarawal,A.,andJ.Park,NotesonSearhinginMultidimensionalMonotoneArrays,Pro.29thIEEE

Symp.on Foundations ofComputer Siene,497-512(1988).

[4℄ Amir,A., G.Benson,andM.Farah,Letsleepingleslie: Patternmathingin Z-ompressedles.J. of

Comp. andSys. Sienes,52(2), 299{307(1996).

[5℄ Apostolio,A.,M.Atallah,L.Larmore,andS.MFaddin,EÆientparallelalgorithmsforstringediting

problems.SIAMJ. Comput.,19,968-998(1990).

[6℄ Apostolio, A., G.M. Landau and S. Skiena, Mathing for Run Length Enoded Strings, Journal of

Complexity, 15,1,4{16(1999).

[7℄ Bell,T.C.,J.C.Cleary,andI.H.Witten.TextCompression.PrentieHall,(1990).

[8℄ Benson,G.,AspaeeÆientalgorithmforndingthebestnonoverlappingalignmentsore,Theoretial

Computer Siene,145,357{369(1995).

[9℄ Bunke,H., and J.Csirik. An improvedalgorithm for omputing theedit distane ofrun lengthoded

strings,Information ProessingLetters,54,93{96(1995).

[10℄ Chao, K.M., R.Hardison, andW. Miller,Reentdevelopmentsin linear-spaealignmentmethods: a

minisurvey.J. Comp. Biol.,1,271{291(1994).

[11℄ Crohemore,M.,TransduersandRepetitions.Theoret. Comput.Si.,45, 63{86(1986).

[12℄ Crohemore,M.,andW.Rytter,TextAlgorithms,OxfordUniversity Press, (1994).

[13℄ Elias, P., Universal Codeword Sets and Representation of Integers, I.E.E.E. Transf. Inform. Theory,

IT21,2,194{203(1975).

[14℄ Eppstein,D.,SequeneComparisonwithMixedConvexandConaveCosts,JournalofAlgorithms,11,

85{101(1990).

(21)

onFoundationsof Computer Siene,488{296(1988).

[16℄ Eppstein,D.,Z. Galil, R.Gianarlo,and G.F.Italiano,SparseDynami ProgrammingI:LinearCost

Funtions,JACM,39, 546{567(1992).

[17℄ Eppstein,D.,Z.Galil, R.Gianarlo,and G.F.Italiano,SparseDynamiProgrammingII:Convexand

ConaveCostFuntions,JACM,39,568{599(1992).

[18℄ Farah, M., and M. Thorup, String mathing in Lempel-Ziv ompressed strings. Algorithmia, 20,

388{404(1998).

[19℄ Galil,Z.,andR.Gianarlo,SpeedingUpDynamiProgrammingwithAppliationstoMoleularBiology,

Theoretial Computer Siene,64,107-118(1989).

[20℄ GalilZ.,andK.Park,Alinear-timealgorithmforonaveone-dimensionaldynamiprogramming,Info.

ProessingLetters, 33,309-311(1990).

[21℄ Gasienie,L.,M.Karpinski,W.Plandowski,W.Rytter,RandomisedeÆientalgorithmsforompressed

strings: the nger-print approah, Pro. 7th Annual Symposium On Combinatorial Pattern Mathing,

LNCS1075,39{49(1996).

[22℄ Gasienie,L.,andW.Rytter,AlmostoptimalfullyLZWompressedpatternmathing,DataCompres-

sionConferene,J.Storer,ed,(1999).

[23℄ Gianarlo,R. ,Dynami Programming: Speial Cases,Pattern Mathing Algorithms,edited byApos-

tolio,A.andZ.Galil,OxfordUniversityPress,201-232(1997).

[24℄ Guseld,D.,AlgorithmsonStrings,Trees,andSequenes.Cambridge University Press, (1997).

[25℄ Hirshberg,D.S.,Alinearspaealgorithmforomputingmaximalommonsubsequenes,Comm.Asso.

Comput.Mah.,18(6),341-343,(1975).

[26℄ Huang, X., andW. Miller,A time-eÆient,linearspae loal similarityalgorithm, Adv. Appl. Math.,

12,337{357(1991).

[27℄ Kannan, S.K., andE. W. Myers,An AlgorithmForLoatingNon-OverlappingRegions ofMaximum

AlignmentSore,SIAMJ. Comput.,25(3),648{662(1996).

[28℄ Karkkainen,J.,G.NavarroandE.Ukkonen,ApproximateStringMathingoverZiv-LempelCompressed

Text, Pro.11th Annual SymposiumOnCombinatorialPattern Mathing,LNCS1848,195{209(2000).

[29℄ Karkkainen, J., and E. Ukkonen, Lempel-Ziv parsing and sublinear-size index strutures for string

mathing,Pro. ThirdSouthAmerianWorkshop on StringProessing (WSP'96),141{155(1996).

[30℄ Kida, T., M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Shift-And approah to pattern

mathinginLZWompressedtext,Pro. 10thAnnualSymposiumOnCombinatorialPattern Mathing,

LNCS1645,1{13(1999).

[31℄ Klawe, M., and D. Kleitman, An Almost Linear Algorithm forGeneralized Matrix Searhing, SIAM

Jour. DesreteMath., 3,81-97(1990).

[32℄ Landau,G.M.andM.Ziv-Ukelson,OntheSharedSubstringAlignmentProblem,Pro.SymposiumOn

DisreteAlgorithms,804{814(2000).

[33℄ Landau,G.M., andM.Ziv-Ukelson,OntheCommon SubstringAlignmentProblem,Journal of Algo-

rithms.

[34℄ Lempel, A., and J. Ziv, On the omplexity of nite sequenes, IEEE Transations on Information

Theory, 22,75{81(1976).

[35℄ Levenshtein,V.I.,BinaryCodesCapableofCorreting,Deletions,InsertionsandReversals,SovietPhys.

Dokl,10,707{710(1966).

(22)

5,422{432(1984).

[37℄ Makinen,V., G.Navarro,andE.Ukkonen,ApproximateMathingofRun-LengthCompressedStrings,

Pro. 12thAnnual SymposiumOnCombinatorialPattern Mathing,LNCS1645,1{13(1999).

[38℄ Manber,U.,Atextompressionshemethatallowsfastsearhingdiretlyintheompressedle,Pro.

5th Annual SymposiumOnCombinatorialPattern Mathing,LNCS2089,31{49(2001).

[39℄ Masek,W.J., and M.S. Paterson,A faster algorithm for omputing stringedit distanes. J. Comput.

Syst. Si.,20, 18{31(1980).

[40℄ Mithell, J.,AGeometriShortest PathProblem,withAppliationtoComputing aLongestCommon

Subsequenein Run-Length EnodedStrings,TehnialReport, Dept. ofAppliedMathematis,SUNY

StonyBrook,1997.

[41℄ Monge,G.,Deblaiet Remblai, Memoiresde l'AademiedesSienes,Paris(1781).

[42℄ NavarroG.,T.Kida,M.Takeda,A.Shinohara,andS. Arikawa: FasterApproximate StringMathing

OverCompressedText,Pro. DataCompressionConferene(DCC2001), IEEEComputerSoiety,459-

468(2001).

[43℄ Navarro,G.,andM.RaÆnot,AgeneralpratialapproahtopatternmathingoverZiv-Lempelom-

pressed text, Pro. 10th Annual Symposium On Combinatorial Pattern Mathing, LNCS 1645, 14{36

(1999).

[44℄ Navarro, G., and M. RaÆnot.Boyer-Moore stringmathing over Ziv-Lempel ompressed text, Pro.

11thAnnual SymposiumOnCombinatorial PatternMathing, LNCS1848,166{180(2000).

[45℄ Sanko D., and J.B. Kruskal(editors), Time Warps, String Edits, and Maromoleules: the Theory

andPratieof SequeneComparison,Addison-Wesley,Reading,MA,(1983).

[46℄ Shmidt, J.P., All HighestSoring PathsInWeightedGridGraphsandTheirAppliationToFinding

AllApproximate RepeatsInStrings,SIAMJ. Comput, 27(4),972{992(1998).

[47℄ ShabitaY., T.Kida, S. Fukamahi,M.Takeda,A. Shinohara,T. Shinohara,S. Arikawa, Speeding up

patternmathing bytextompression,CIAC2000,LNCS1767,306{315(2000).

[48℄ Smith,T.F.andM.S.Waterman,Identiationofommonmoleularsubsequenes,J.MoleularBiol.,

147,195{197(1981).

[49℄ Szpankowski,W.,andP.Jaquet.AsymptotiBehavioroftheLempel-ZivParsingShemeandDigital

SearhTrees,Theoretial Computer Siene,144,161{197(1995).

[50℄ Welh,T.A.,ATehniqueforHighPerformaneDataCompression,IEEETrans.onComputers,17(6),

8{19(1984).

[51℄ Ziv,J., and A. Lempel, A UniversalAlgorithm for Sequential DataCompression,IEEE Transations

onInformation Theory, IT-23(3),337{343(1977).

[52℄ Ziv, J., and A. Lempel, Compression of individual sequenes via variable rate oding, IEEE Trans.

Inform. Th.,24,530-536(1978).

Références

Documents relatifs

« Il n’y a plus qu’à » exploiter des matrices de flux technologiques avec cette technique, matrices qui posent cependant aujourd’hui encore quelques

We provide both analytical and quantitative results showing that the proposed approach outperforms the existing methods by effi- ciently and robustly estimating planar homography

[5] proposed a parallel solution for a variant of the problem under the CGM model using weighted graphs that requires log p rounds and runs in On2 log m/p time.. Recently, Kim

In this paper, we present the GeneTegra Alignment Tool (GT-Align), a practical implementation of the ASMOV ontology alignment algorithm within a Web-based

In the present paper, we prove that the average complexity of the tree align- ment algorithm of [10] is in O(nm), as well as the average complexity of the RNA structure

All-in-all, I suggest three bio-inspired mechanisms poten- tially interesting to develop this idea of Alignment for the construction of the Self, namely (1) the Hebbian

From left to right: (1) Best single feature, (2) Best learned combination of features using the symmetrized area loss ` S , (3) Best combination of MFCC using SAL and D T obtained

Besides, an edge (i, j) with some friendly relations will be assigned positive pheromones according to the f ants that cross edges in friendly relation with (i, j) during the