A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

(1)

HAL Id: hal-00619996

https://hal-upec-upem.archives-ouvertes.fr/hal-00619996

Submitted on 20 Mar 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson

To cite this version:

Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson. A Sub-quadratic Sequence Alignment Al-

gorithm for Unrestricted Cost Matrices. Proceedings of the Thirteen Annual ACM-SIAM Symposium

on Discrete Algorithms, 2002, United States. pp.679-688. �hal-00619996�

(2)

for Unrestrited Cost Matries

MaximeCrohemore

InstitutGaspard-Monge

UniversitedeMarne-la-Vallee

Gad M. Landau y

HaifaUniversity

and

PolytehniUniversity

MihalZiv-Ukelson z

HaifaUniversity

and

IBMT.J.WResearhCenter

Abstrat

The lassial algorithm for omputing the similarity between two

sequenes[36 ,39℄usesadynamiprogrammingmatrix,andomparestwo

stringsof size n inO(n 2

) time. Weaddressthe hallengeof omputing

thesimilarityoftwostringsinsub-quadratitime,formetriswhihuse

asoringmatrixofunrestrited weights. Our algorithmapplies to both

loalandglobalalignmentomputations.

Thespeed-upisahievedbydividingthedynamiprogrammingmatrix

into variable sized bloks, as indued by Lempel-Ziv parsing of both

strings,and utilizing the inherent periodi natureof both strings. This

leadstoanO(n 2

=logn)algorithmforaninputofonstantalphabetsize.

Formosttexts,thetimeomplexityisatuallyO(hn 2

=logn)whereh1

istheentropyofthetext.

Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Desartes, Champs-sur-

Marne,77454Marne-la-ValleeCedex2,Frane,email: mauniv-mlv.fr.

y

DepartmentofComputerSiene,HaifaUniversity,Haifa31905, Israel,phone: (972-4)

824-0103, FAX:(972-4) 824-9331;DepartmentofComputerandInformationSiene,Poly-

tehniUniversity,SixMetroTehCenter,Brooklyn,NY11201-3840;email:landaupoly.edu;

partially supported by NSF grant CCR-0104307, by NATO Siene Programme grant

PST.CLG.977017,bytheIsraelSieneFoundation(grants173/98and282/01),bytheFIRST

FoundationoftheIsraelAademyofSieneandHumanities,andbyIBMFaultyPartnership

Award.

z

Departmentof ComputerSiene,HaifaUniversity,Haifa31905, Israel; OnEduation

LeavefromtheIBMT.J.W.ResearhCenter;email:mihals.haifa.il;partiallysupported

bybytheIsraelSieneFoundation(grants173/98and282/01),andbytheFIRSTFoundation

(3)

The rapid progress in large-sale DNA sequening opens a new level of

omputational hallenges involved in storing, organizing and analyzing the

wealthofbiologialinformation. Oneofthemostinterestingneweldsthatthe

availabilityoftheompletegenomeshasreatedisthatof genomeomparison

(thegenomeisalloftheDNAsequenepassedfromonegenerationtothenext).

Comparing omplete genomes an give deep insights about the relationship

betweenorganisms,aswellassheddinglightonthefuntionofspeigenesin

eahsinglegenome. Thehallengeofomparingompletegenomesneessitates

thereationofadditional,moreeÆientomputationaltools.

One of the most ommon problems in biologial omparative analysis is that

of aligningtwolongbio-sequenes in orderto measure their similarity. Inthe

global alignmentproblem [19℄, [29℄, thesimilaritybetweentwostringsA and

B is measured. In the loal alignmentproblem [39℄, theobjetive is to nd

substringsofAwhiharesimilarto substringsofB. Both alignmentproblems

anbesolvedinO(n 2

)timebydynamiprogramming[19℄,[39℄.

In this paper data ompression tehniques are employed to speed up the

alignment of two strings. The ompression mehanism enables the algorithm

to adapt to the dataand to utilizeits repetitions. The periodinature ofthe

sequeneisquantiedviaitsentropy,denoted0h1. Entropyisameasure

of how "ompressible"asequene is [5℄,[7℄, and issmall whenthere is alot of

order(i.e,thesequeneisrepetitiveandthereforemoreompressible)andlarge

whenthere isalotof disorder(seesetion2.2).

We present an O(n 2

=logn) algorithm for omputing both global and loal

similaritybetweentwostringsoveraonstantalphabet. Thealgorithmiseven

fasterwhenthesequeneisompressible.Infat,formosttexts,theomplexity

ofouralgorithmisatuallyO(hn 2

=logn).

Note that the algorithm presented is the rst sub-quadrati loal alignment

algorithm.

Aftertheoptimalsoresareomputed,analignmenttraeorrespondingtothe

optimalsoreanbereoveredin timeomplexitywhihislinearwiththesize

ofthetrae,forboththeglobal alignmentandtheloalalignmentproblems.

Thealgorithmsdesribedinthispaperarethersttoapproahfullyompressed

(bothsoureandtargetstringsareompressed)stringalignment. Themethods

givenin this paperan also beused by appliations where both inputstrings

are storedor transmittedin theform of LZ78 orLZW ompressed sequene,

thus providing aneÆient solutiontothe problemof howto omparethe two

stringswithouthavingto deompressthemrst.

The only previously known sub-quadrati global alignment string omparison

algorithm,byMasekandPaterson[31℄,isbasedontheFourRussiansparadigm.

The "Four Russians" algorithm divides the dynami programming table into

(4)

O(n =logn) time omplexity, based on two assumptions. One is that the

sequeneelementsomefromaonstantalphabet. Theother,whihtheydenote

the"disreteness"ondition,isthattheweights(ofsubstitutionsandindels)are

allrationalnumbers.

Ouralgorithmspresentanewapproahandarebetterthantheabovealgorithm

in twoaspets.

Thealgorithmspresentedherearefasterforompressiblesequenes. For

suh sequenes,theomplexityofouralgorithmsis O(hn 2

=logn), where

h1istheentropyofthesequene.

Ouralgorithms aregeneralenoughto support soring shemeswith real

numberweights.

For many soring shemes, the rational number weights supported by

MasekandPaterson'salgorithm donotsuÆe. Forexample, theentries

of PAM similarity matries, as well as BLOSUM evolutionary distane

matries, are dened to be real numbers, omputed as log-oddsratios -

andthereforeouldbeirrational.

The paper by Masek and Paterson is onluded with the following

statement: "The most important problem remaining is nding abetter

algorithm for the nite (in our terms onstant) alphabet ase without

the disreteness ondition". Here, more than twenty years later, this

importantopenquestionwillnallybeanswered!

Theseadvantagesarebasedinthefollowingfats. First,ouralgorithmdoesnot

require any pre-omputation of lookup-tables, and therefore an aord more

exible weight values. Also, instead of dividing the dynami programming

matrix into uniform sized bloks as did Masek and Paterson, we employ a

variable sized blok partition, as induedby Lempel-Ziv fatorization of both

soureandtarget. Theommondenominatorbetweenbloks,maximizedbythe

ompression tehnique, is then re-yledand used for omputing therelevant

information foreah blok in time whih is linearwith thelength ofits sides.

In this sense,the approah desribed in this paperan beviewed asanother

exampleofspeedingupdynami programmingbykeepingandomputingonly

a relevant subset of important values, as demonstrated in [10℄, [11℄, [27℄ and

[37℄.

The remainder of this paper is organized as follows. Setion 2 inludes

preliminaries. Insetion3wedesribetheglobalalignmentsolutionusingfully

ompressed stringomparison. Insetion4weextendthesolutiontoompute

the highestsoringregions ofloal alignment. Setion 5 ontainsadisussion

of howto reduethespae omplexitywithoutimpairing thetimeomplexity,

whenomputingglobalalignmentover"disrete"soringmatries.

A desriptionof how toreoverapath alignment traein timelinearwith its

(5)

a c t

a a c g a c g a

0

1 1 2 3 4 5 6 7 8

2 3

4 a g a g

8 5 6

7 0 1 2 3 4 5 6 7 8 c

- a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1

g a

a c g

2 1 3

3 2 1

3 3 4

1

3

2 G

Figure 1: The alignment graph for omparing strings A = "tagaga" and B =

"aagaga". The soring shemematrix Æ is shown in the lower left orner of the

gure. Thehighest soring global alignment paths originate invertex (0,0), endin

vertex(8,8)andhaveatotalweightof3.Thehighestsoringloalalignmentpathhas

atotalweightof5andorrespondstothealignmentofsubstringsa="agaga" and

b =agaga". A sub-graph G orresponding tothe blok for omparing substrings

a="ag"andb="ag"isshowninthelower-rightornerofthegure. Alsospeied

arethevaluesI fortheentriesoftheinputborderforG(inwhite-shadedretangles),

andthevaluesOoftheoutputborderofG(ingrey-shadedretangles),assetduring

aloalalignmentomputation.

2 Preliminaries

2.1 Highest Soring Paths in the Alignment Graph. The dynami

programming solution to the string omparison omputation problem an be

representedin terms ofaweightedalignment graph[19℄(See Figure 1). The

weight of a given edge an be speied diretly in the grid graph, or as is

frequently the ase in biologial appliations, is given by a penalty matrix,

denoted Æ, whih speies the substitution ost for eah pair of haraters

and the deletion ost for eah harater from the alphabet. Typially, in the

biologialdomain,Æisnegativeforalloperationsexeptreplaementofsimilar

symbols,andtheobjetiveistomaximizethealignmentsore.

(6)

order,tothesorebetweenthersti haratersofA andtherstj haraters

ofB,using thefollowingreurrene. V(i;j)=max[V(i;j 1)+Æ(;B

j );

V(i 1;j)+Æ(A

i

;);

V(i 1;j 1)+Æ(A

i

;B

j )℄

SmithandWaterman[39℄showedthatessentiallythesameO(jAjjBj) dynami

programmingsolutionan beusedfor loalalignment,providedthat thesore

of the alignment of two empty strings is dened as 0, and only pairs whose

alignment soresare above0are of interest. The Smith-Waterman algorithm

for loal alignmentwill omputethe followingreurrene, whih inludes 0 as

anadditionaloption,andthusrestritsthesorestonon-negativevalues.

S(i;j)=max[0;S(i;j 1)+Æ(;B

j );

S(i 1;j)+Æ(A

i

;);

S(i 1;j 1)+Æ(A

i

;B

j )℄

Thesoreforthemostsimilarsubstringsisfoundin thehighestsoringnodes

in thealignmentgraph.

2.2 A Blok Partition of the Alignment Graph based on LZ78

Fatorization. The traditional aim of text ompression is eÆient use of

resouressuhasstorageandbandwidth. Here,wewillompressthesequenes

in order to speedupthe alignment proess. Note that thisapproah,denoted

"aeleration by text-ompression", has been reently applied to a related

problem-that ofexatstring mathing[22℄, [30℄,[38℄.

Itshould alsobementionedthatanotherrelatedproblem-thatofexatstring

mathing in ompressed text without deoding it, whih is often referred to

as "ompressedpattern mathing", has been studied extensively[3℄, [13℄ [34℄.

Along these lines, string searh in ompressed text was developed for the

ompression paradigm of LZ78 [45℄, and its subsequent variant LZW [43℄, as

desribedin[23℄,[35℄. Amorehallengingproblemisthatof"fullyompressed"

pattern mathingwhen boththepattern andtext stringsareompressed [16℄,

[17℄.

For the LZ78-LZW paradigm, ompressed mathing has been extended and

generalizedtothat ofapproximate patternmathing (ndingallourrenesof

ashortsequenewithinalongoneallowinguptokhanges)in [21℄,[33℄.

The LZ ompression methods are based on the idea of self referene: while

the text le is sanned, substrings or phrases are identied and stored in a

ditionary, and whenever, later in the proess, a phrase or onatenation of

phrases is enountered again, this is ompatly enoded by suitable pointers

[28℄, [44℄,[45℄.

(7)

fatorizationfrom previousLZompressionalgorithms isinthehoieofode

words. Instead of allowingpointers to refereneanystring that hasappeared

previously,thetextseensofarisparsedintophrases,whereeahphraseisthe

longest mathing phraseseenpreviouslyplusoneharater. Forexample,the

string"S =aagag"is dividedinto fours phrases: a,a, g, ag. Eahphrase

is enoded asanindex toits prex,plusthe extraharater. Thenewphrase

isthenaddedto thelistofphrasesthatmaybereferened.

a a c g a c g c

t a c

3/4

a c g a c

g

g Trie for A Trie for B

0 3 1

5 0 3 2

g a g a

a

5/4

2 t

4 g

5/2 3/2

left prefix (5/2)

diagonal prefix (3,2)

top prefix (3,4)

Graph G for Block (5,4) LZ78-Partitioned

Alignment Graph

g a

a c g a

g a c

g a

a c a

a c

1 2 3 4

0 1 2 3 4

5

1

4

Figure 2: The blok partitionof the alignment graph, and the tries orresponding

to LZ-78parsingof stringsA="tagaga"and B="aagaga". Notethatfor the

blok G inthis example, = "ag", = "ag", `r = 2, ` =3, i = 5 and j = 4.

(ThenewellofG,whihdoesnotappearinanyoftheprexbloks,istherightmost

ellatthebottomrowofG,andanbedistinguishedbyitswhiteolor.) Thisgure

ontinuesFigure1.

Sineeahphraseisdistint,thefollowingupperboundappliestothepossible

numberofphrasesobtainedbyLZ78fatorization.

Theorem 2.1. (Ziv andLempel 1976[28℄.) Given a sequene S of size n

over a onstant alphabet. The maximal number of distint phrases in S is

O(

n

).

(8)

onstantalphabet,ithasbeenshownthatin manyasesweandobetterthan

that.

Intuitively, the LZ78 algorithm ompresses thesequene beauseit is able to

disover some repeated patterns. Therefore, in order to ompute a tighter

upper bound on the number of phrases obtained by LZ78 fatorization for

"ompressible" sequenes, the repetitive nature of the sequene needs to be

quantied. One of the fundamental ideas in information theory is that of

entropy, denoted 0 h 1, whih is a measure of the amount of disorder

orrandomness,orinversely,theamountof orderorredundany inasequene.

Entropyissmallwhenthereisalotoforder(i.e,thesequeneisrepetitive)and

largewhenthere isalot ofdisorder. Theentropyofasequeneshould ideally

reettheratiobetweenthesizeof thesequeneafterithasbeenompressed,

andthelengthoftheunompressedsequene.

ThenumberofdistintphrasesobtainedbyLZ78fatorizationhasbeenshown

tobeO(hn=logn)formosttexts[5℄,[7℄,[28℄,[45℄. Notethatforanyothertext

overaonstantalphabet,theupperboundabovestillappliesbysettinghto1.

3 The GlobalAlignmentSolution

3.1 Denitions and Basi Observations. The alignment graph will be

partitionedasfollows. StringsAandBwillbeparsedusingLZ78fatorization.

This indues a partition ofthe alignmentgraphfor omparing A with B into

variablesizedbloks(seeFigure2). Eahblokwillorrespondtoaomparison

ofanLZphraseofAwithanLZphraseofB.

Let xa denote aphrasein A obtainedby extendinga previousphrasex of A

withharatera,andybdenoteaphraseinB,obtainedbyextendingaprevious

phraseofB withharaterb.

Fromnowonwewillfousontheomputationsneessaryforasingleblokof

thealignmentgraph.

Considerthe blok Gwhih orrespondsto the omparison ofxa and yb. We

deneinputborder I -astheleftandtopbordersofG,andoutputborderO-as

thebottomand rightbordersofG. (Thenode entries ontheinputborder are

numberedin alokwise diretion, andthe nodeentries onthe outputborder

arenumberedinaounter-lokwisediretion.)

RatherthanllinginthevaluesofeahvertexinG,asdoesthelassialdynami

programmingalgorithm-theonlyvaluesomputedforeahblokwillbethose

on its I=O borders (see Figures 1, 5A ).Intuitively, this is the reasonbehind

theeÆienygain.

Let `

r

-denotethenumberofrowsin G,`

r

=jxaj. Let`

-denotethenumber

ofolumns inG,`

=jybj. Lett=`

r +`

. Clearly,jIj=jOj=t.

(9)

phrasey ofB.

2.The diagonal prefixof G-denotes theblokomparingphrase x ofA and

phrasey ofB.

3. Thetop prefix of G -denotes thethe blokomparing phrasex of A and

phraseybofB.

Observation1WhentraversingthebloksofanLZ78parsedalignmentgraph

in aleft-to-right, top-to-bottom order. Thebloksforthe leftprex, diagonal

prexandtopprexofGareenounteredpriortoblokG.

Note that the graphfor theleft prex of Gis idential to thesubgraph of G

ontainingallolumnsbutthelastone. Morespeially,boththestrutureand

theweightsoftheedgesofthesetwographsareidential,buttheweightstobe

assigned tothe vertiesduring thesimilarityomputationmayvary aording

to the input border values. Similarly, for the top prex and diagonal prex

graphs. TheonlynewellinG,whihdoesnotappearinanyofitsprexblok

graphs,istheellforomparingaandb.

3.2 I=O Propagation Aross G. The work for eah blokwill onsist of

twostages(asimilarapproahisshownin[6,20,26,27℄).

1. enoding:StudythestrutureofGandrepresentitinaneÆientway.

2.propagation:GivenI andtheenodingofG,onstrutedinthepreviousstage,

omputeO forG.

The struture of G will be enoded by omputing weights of optimal paths

onnetingeahentryofitsinputborderwith eahentryofitsoutputborder.

ThefollowingDIST matrixwillbeused(seeFigure3).

Definition 3.1. DIST[i;j℄storesthe weight ofthe optimal path fromentry i

of the inputborderof Gtoentryj of itsoutputborder.

DIST matrieshavealsobeenusedin [4℄,[6℄, [20℄,[27℄and[37℄.

Given inputrowI and the DIST for G, the weightof output row vertexO

j

anbeomputedasfollows.

O

j

= j

max

r=0 fI

r

+DIST[r;j℄g

O

j

isthemaximumofolumnj ofthefollowingOUT matrix,whihmergesthe

informationfrom inputrowI andDIST. (See Figure3).

Definition 3.2. OUT[i;j℄=I

i

+DIST[i;j℄.

Aggarwal and Park [2℄ and Shmidt [37℄ observed that DIST matries are

(10)

I0=1 0 1 2 3 4 4

I1=2 1 1 2 1 3 4

I

2

=3 2 0 0 1 1 3

I

3

=2 4 2 2 0 2 2

I

4

=1 4 4 2 0 1 1

I

5

=3 4 4 4 2 1 0

OUT matrix

1 0 1 2 1 1

1 1 0 1 1 1

1 3 3 4 2 0

12 0 0 2 0 0

13 13 1 1 0 0

14 14 14 1 2 3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

olumnnumbers

0 1 2 3 4 5

Figure3: TheDIST matrixwhihorrespondstothesubsequenes"ag","ag",the

OUT matrix obtained by addingthe values of I to the rows of DIST, and the O

ontainingtherowmaximaofOUT. ThisgureontinuesFigures1and2.

Definition 3.3. AmatrixM[0:::m;0:::n℄isMongeifeitherondition1or

2below holdsfor alla;b=0:::m; ;d=0:::n:

1. onvex ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄for all a<b

and<d.

2. onave ondition: M[a;℄+M[b;d℄M[b;℄+M[a;d℄ for all a<b

and<d.

Sine DIST is Monge-soisOUT,whihis aDIST with onstantsadded to

itsrows.

An importantpropertyofMongearraysisthatof beingtotallymonotone.

Definition 3.4. A matrix M[0:::m;0:::n℄ is totally monotone if either

ondition 1or2below holdsfor alla;b=0:::m; ;d=0:::n:

(11)

and<d.

2. onaveondition: M[a;℄M[b;℄=)M[a;d℄M[b;d℄foralla<b

and<d.

Note that the Mongepropertyimplies totalmonotoniity, but theonverse is

nottrue. Therefore,bothDIST andOUT aretotallymonotonebytheonave

ondition.

Aggarwal et al [1℄ gave a reursive algorithm, niknamed SMAWK in the

literature, whih an ompute in O(n) time all row and olumn maxima of

annntotallymonotonematrix,byqueryingonlyO(n)elementsofthearray.

Hene,oneoulduseSMAWKtoomputetheoutputrowObyqueryingonly

O(n)elementsof OUT. Clearly,ifboththefullDIST andallentriesof I are

available,thenomputinganelementofOUT isO(1)work.

Forvarious solutionsto related problems, whih also utilizeMonge andTotal

Monotoniityproperties,werefertheinterestedreaderto[8℄,[9℄,[14℄,[15℄,[24℄

and[27℄. InordertoeÆientlyutilizethesepropertieshere,weneedtoaddress

thefollowingtwoproblems.

1. HowtoeÆientlyomputeDISTandrepresentitinaformatwhihallows

diretaessto itsentries. Thiswillbedonein setion3.2.2.

2. SMAWK is intended for a full, retangular matrix. However, both

DIST anditsorrespondingOUT arenotretangular. Sinepathsinan

alignmentgraphanonlyassumealeft-to-right,top-to-bottomdiretion,

onnetionsbetweensomeinputbordervertiesandsomeoutputborder

vertiesareimpossible. Therefore,thematriesaremissing bothalower

lefttriangleandupperrighttriangle(see Figure3).

3.2.1 AddressingtheRetangleProblem. TheundenedentriesofOUT

anbeomplementedin onstanttimeeah,asfollows.

1.Themissingupperrighttriangleentriesanbeompletedbysettingthevalue

ofanyentryOUT[i;j℄in thistriangleto 1.

2. Letk denotemaximalabsolutevalueofasoreinÆ. Themissinglowerleft

triangle entries anbeompletedby settingthevalue ofanyOUT[i;j℄ in this

triangleto (n+i+1)k.

Lemma 3.1. Complementingthe undenedentriesasdesribedabove preserves

the onave total monotoniity ondition of OUT,and does not introdue new

row-maxima.

Proof. 1. Upper Right Triangle: All similarity sores in the alignment

(12)

the shape of the redened upper-right triangle, one a 1 value in row a is

enountered,all futurevaluesin rowa arealso 1. Thefuture valuesofrow

b ouldeitherbeniteor 1. Therefore,OUT[a;d℄OUT[b;d℄foralld>.

2. Lower Left Triangle: The worst soreappearing in thealignmentgraph

is lowerbounded by nk. Sinei is alwaysgreaterthan orequalto zero,the

omplementedvaluesinthelowerlefttriangleareupper-boundedby (n+1)k

andnonewolumn maximaareintrodued. Also,foranyomplementedentry

OUT[b;℄ in the lowerleft triangle, OUT[b;℄< OUT[a;℄ for alla < b, and

thereforetheonavetotalmonotoniityonditionholds.

3.2.2 Inremental Update of the new DIST Information for G. In

thissetion wewillshowhowto eÆientlyomputethenewDIST info forG,

usingtheDIST representationspreviouslyomputedforitsprexbloks,plus

theinformationofitsnewell.

Whenproessinganewblok G,wewill omputethesoresof t newoptimal

paths, leadingfrom the inputborder to the newvertex (`

r

;`

)in the lowest,

rightmost orner of G. These values orrespond to olumn `

of the DIST

matrixforG,andanbeomputedasfollows.

Entry[i℄in olumn `

of the DIST for G ontainsthe weightof the optimal

path from entryi in theinput borderof G to vertex (`

r

;`

). This path must

go throughoneof thethree verties(`

r 1;`

), (`

r 1;`

1) or(`

r

;`

1).

Therefore,theweightoftheoptimalpathfromentryiintheinputborderofG

to (`

r

;`

)isequalto themaximumamongthefollowingthreevalues:

1.Entry[i℄ofolumn`

1oftheDIST fortheleftprexofG,plustheweight

ofthehorizontaledgeleadinginto(`

r

;`

).

2. Entry[i℄ofolumn`

1oftheDIST forthediagonalprexofG,plusthe

weightofthediagonaledgeleadinginto(`

r

;`

).

3. Entry[i℄ofolumn `

oftheDIST forthetopprexofG, plusthe weight

ofthevertialedgeleadinginto(`

r

;`

).

3.2.3 Maintaining Diret Aess to DIST Columns. In order to

omputeanentryofOUT inonstanttimeduringtheexeutionofSMAWK,

diretaesstoDIST entriesisneessary. Thisisnotstraightforward,sineas

shownin theprevioussetion, foreahblokonlyonenew DIST olumn has

beenomputedand stored. All other olumnsbesidesolumn`

oftheDIST

forGneedtobeobtainedfromG'sprexanestorbloks.

Therefore, before theexeutionof SMAWK begins,avetorwith pointers to

allt+1olumns oftheDIST forGisonstruted(seeFigure 4). Thisvetor

isnolongerneededaftertheomputationsforGhavebeenompleted,andits

(13)

0 1 2 3 4

DIST(5,4) 0

1 2 3 4 5

-3 -1 1 0 0 -2

-3 -1 -2 -1 -1

-3 -2 -1 0 -2

-2 0 -2 -2 -1 -1 0 -2 0 -1 -2

a

g a c

g

g Trie for A Trie for B

0 3 1

0 1 3

2 4 2

t

4 g Block Table

5 c

Figure 4: Atableontaininganentryforeahblok ofthealignmentgraph. Entry

(i;j)ofthe tableorrespondstotheblokwhosesubstringsare representedbynode

iinthetriefor AandnodejinthetrieforB. Theentryforeahblokinthetable

pointstothestartofitsnewDIST olumn. Alsoshownisthevetorwhihontains

pointers to all olumns of the DIST for blok (5;4), as obtained from its anestor

prexbloks.ThisgureontinuesFigures1,2and3.

ThepointerstoallolumnsoftheDIST forGareassembledasfollows.Column

`

isset to thenewlyonstrutedvetorforG. All olumnsof indiessmaller

than`

areobtainedvia`

reursiveallstoleftprexbloksofG. Allolumns

ofindiesgreaterthan`

areobtainedvia`

r

reursiveallstotopprexbloks

ofG.

3.2.4 Querying a Prex Blok and Obtaining its DIST Column in

Constanttime. TheLZ78phrasesform atrie (seeFigure2), and thestring

tobeompressedisenodedasasequeneofnamesofprexesofthetrie. Eah

nodein the trieontains the serial numberof the phrase itrepresents. Sine

eahblokorrespondstoaomparisonofaphrasefromAwithaphrasefrom

B, eah blok will be identied by a pair of numbers, omposed of the serial

numbersforitsorrespondingphrasesinthetriesforAandB.

Another data struture to be onstruted is a Blok Table (see Figure 4),

ontaining an entry for eah partitioned blok of the alignment graph. The

entryfor eah blok in thetable pointsto the startof itsnewDIST olumn,

andanbediretlyaessedviatheblok'sphrasenumberindex pair.

The left prex of G an be identied in onstant time as a pair of phrase

numbers, the rst idential to the serial number of xa, and the seond

(14)

the pair of identiation numbers for ablok, a pointer to the orresponding

DIST olumnanthenbediretlyobtainedfromtheBlokTable.

3.3 Time and Spae Analysis

Assumingsequenesizenandsequeneentropyh1. TheLZ78fatorization

algorithm will parse the strings and onstrut the tries for A and B in O(n)

time. Theresulting numberof phrasesin bothA andB is O(hn=logn). The

number of resulting bloks in the alignment graphis equalto the number of

phrasesin Atimesnumberofphrasesin B,andis thereforeO(h 2

n 2

=(logn) 2

).

For eah blok G, the following information (1{3) is omputed, in time and

spaeomplexitylinearwiththesizeofitsI=O borders:

1.UpdatingtheEnodingStruture forG. TheprexbloksofGanbe

aessedin onstanttime. ThevetorsofDIST olumnpointers fortheprex

blokshavealreadybeenfreed. However,sineeahprexblokdiretlypoints

to itsnewlyomputed DIST olumn -allvaluesneeded fortheomputations

arestillavailable. SineeahentryofthenewDIST olumnforGissettothe

maximum amongupto three sumsofpairs, thenewDIST olumn forG an

beonstrutedin O(t) timeandspae.

2.MaintainingDiret Aessto DIST olumns. Sineprefixbloksand

theirDIST olumnsanbeaessedinonstanttime,thevetorwithpointers

to olumnsoftheDIST forGanbesetin O(t)time.

3.Propagation forG. UsingtheinformationomputedforG,andgiventhe

I for G obtained from the O vetorsfor the blokabove G and the blok to

itsleft,thevaluesofO forGareomputedviaSMAWK MatrixSearhingin

O(t)time.

Total Complexity. Sinethe workand spae for eah blok islinear with

the size of its I=O borders, the total time and spae omplexity is linear

with the total size of the borders of the bloks. The blok borders form

O(hn=logn) rowsof size jBj eah, and O(hn=logn)olumns of size jAj eah,

in the alignment graph (see Figure 2). Therefore, the total time and spae

omplexityisO(hn 2

=logn).

4 Extensions to Loal Alignment

Whenomputingthehighestloalalignmentsore,theaddedhallengeisthat

anoptimalpathouldbeginandendinsideanyblok. Therefore,wewillmodify

O toonsider theadditionalpathsoriginatinginside G.

Also,inadditiontotheDISTdesribedinsetion3,weomputeforeahblok

Gthefollowingdatastrutures(seeFigures5B,5C).

(15)

I

O I

S

C E I

O

i

DIST[i,j ] j

A

B C

Figure 5: A.The I=O pathweight vetors omputed for eah blok in the global

alignment solution. DIST[i;j℄ will be set to the highest soring path onneting

vertexiintheinputborderwithvertexjintheoutputborder. B,C.Thevetorsof

optimalpathweightsonsideredfortheloalalignmentomputation.

whihstartsinvertexioftheinputborderofGandendsinsideG.

2. S -is avetorofsize t. S[i℄ontainsthevalueofthehighest soringpath

whihstartsinsideGandendsin vertexioftheoutputborder ofG.

3. C - is the value of the highest soring path ontained in G, that is - the

highestsoringpathwhihoriginatesinside GandendsinsideG.

4. F - isthe weightof thehighestsoring pathending in G. This pathould

either beginandend inside G(a C-path)orstartoutsideGand endinside G

(anI-pathfollowedbyanE-path).

The overall highest loal alignment sore for omparing A and B an be

omputedasthemaximumamongtheF valuesofeahblok.

Thetwostagesdesribedinsetion3.2willbeextendedasfollows.

4.1 Enoding. DIST isomputedasdesribed insetion 3.2. Inaddition,

thevaluesofE,S andCareomputedasfollows.

1. Computing the values of E. E[i℄isomputedasthemaximumbetween

E[i℄fortheleftprexofG,E[i℄forthetopprexofG,andDIST[i;`

℄.

2. Computing the values of S. The only new value omputed for S

is the Smith-Waterman sore for the new vertex (` ;` ). Given the Smith-

(16)

r

diagonal prefix, (`

r

;`

1) obtained from the left prefix and (`

r 1;`

)

obtainedfromthetopprefixof G,and theweightsofthe3edgesleadinginto

vertex(`

r

;`

), the Smith-Waterman sorefor vertex (`

r

;`

)an be omputed

inO(1)timeomplexity,usingthereursiongiveninsetion2.1. Thevaluefor

entry`

of S is set to this newly omputedSmith-Waterman sore for vertex

(`

r

;`

).

Thevalues ofall otherentriesof S are then set asfollows. Therst`

values

ofS areopiedfromtherst`

valuesoftheS omputedfortheleft prexof

G. Thelast`

r

valuesareopiedfromthelast `

r

valuesoftheS vetorforthe

topprexofG.

3. Computing thevalueofC. Cisomputedasthemaximumbetweenthe

C value for the left prex of G, the C valuefor the top prex of G, and the

newlyomputedS[`

℄ asdesribedabove.

4.2 Propagation.

1. Computing the values of O. Ourobjetiveis to set O[i℄to the weight

of the highest soring path originating anywhere in the alignment graph and

ending in entryi of the outputborder. VetorO will rst be omputed from

the I and DIST for G as desribed in setion 3.2. At this point entry O[i℄

reetstheweightoftheoptimalpathstartinganywhereoutsideGandending

in entryioftheoutputborder. Itneedstobeupdatedwiththeweightsofthe

highestsoringpathsstartinginsideG. ThisisahievedbyresettingO[i℄tothe

maximumbetweenO[i℄andS[i℄.

2. Computing the values of F. F is omputed as max(Max t

i=0 fI[i℄+

E[i℄g; C)

4.3 Time and Spae Analysis

Enoding. Sine, as shown in setion 3.2.3, eah prex blok of G an be

aessed in onstant time, the values of the S and E vetors for G an be

omputed and stored in O(t) time and spae, and the C value for G an be

omputedin onstanttimeandspae.

Propagation. Giventhevetorsomputed intheenodingstage -thevalues

ofO andF anbeomputedinO(t)timeeahasdesribedabove.

The weight of the highest soring path in the alignment graph an then be

omputedinanadditionalO(h 2

n 2

=(logn) 2

)timeasthemaximumvalueamong

theF valuesomputedforeahblok.

Total Complexity Sine the work and spae for eah blok is linear with

the size of itsI=O borders, the total time and spae omplexity for the loal

alignmentsolutionisO(hn 2

=logn).

(17)

When omputing global alignment with soring matries whih follow the

"disreteness"ondition(seeSetion1),theeÆientalignmentstagealgorithm

desribedin[27℄anbeextendedtosupportfullpropagationfromtheleftmost

andupperboundariestothebottomandrightmostboundariesofG.

This extendedpropagation algorithm an thenbeused to omputethevalues

of the global alignment O for G, given the I for G and a minimal enoding

of theDIST forG. Theadvantageof thisminimal enoding ofDIST is that

ratherthansavinganO(t)sizedDIST olumnperblok,weonlyneedtosave

aonstantnumberofvaluesperblok. TheenodingforthenewDIST olumn

ofeahblokanbeomputedandstoredinonstanttimeandspaefromthe

information stored forthe left, diagonal and topprex bloks ofG, using the

tehniquedesribedinsetion6of [37℄.

This redues the spae omplexity to O(h 2

n 2

=(logn) 2

), while preserving the

O(hn 2

=logn)timeomplexity.

6 Conlusions

Theresultsdemonstratedin thispaperareasfollows.

The algorithm presented in this paper is the rst O(hn 2

=logn) string

omparisonalgorithm.

This is the rst sub-quadrati string omparison algorithm for general

soringtables whoseweightsarenotrestritedtorationalnumbers.

Weshowed how to extend this resultto aloal alignmentO(hn 2

=logn)

algorithm.

For global alignment over "disrete" soring matries, we explained

how the spae omplexity an be redued to O(h 2

n 2

=(logn) 2

), without

impairingtheO(hn 2

=logn)timeomplexity.

Inadditiontothesoresomputedbydynamiprogramming,itisoftendesired

to reover a meaningful trae of the optimal alignments. Optimal paths in

thealignmentgraph(pathswhosetotalweightismaximum) representoptimal

alignmentsofAandB.

Without any added omplexity, the urrentalgorithmi infrastruture an be

modiedto supportthereoveryofanoptimalglobalalignmentpathtrae,as

wellasanoptimalloalalignmenttraeasdenedbyEriksonandSellers[12℄,

in timeomplexitywhihis linearwiththesizeofthetrae.

Duetolakofspae,thedesriptionofhowtoreoverthepathalignmenttraes

isreservedforthejournalversionofthepaper.

(18)

algorithm.

Referenes

[1℄ Aggarwal, A., M. Klawe, S. Moran, P. Shor, and R. Wilber, Geometri

AppliationsofaMatrix-SearhingAlgorithm,Algorithmia,2,195-208(1987).

[2℄ Aggarawal, A.,and J.Park,NotesonSearhinginMultidimensionalMonotone

Arrays, Pro.29th IEEE Symp. onFoundations of Computer Siene, 497-512

(1988).

[3℄ Amir,A.,G.Benson, andM.Farah,Letsleepingleslie: Patternmathingin

Z-ompressedles.J.ofComp.andSys.Sienes,52(2),299{307(1996).

[4℄ Apostolio, A., M. Atallah, L. Larmore, and S. MFaddin, EÆient parallel

algorithmsforstringeditingproblems.SIAMJ.Comput.,19,968-998(1990).

[5℄ Bell,T.C.,J.C.Cleary,andI.H.Witten.TextCompression.PrentieHall,(1990).

[6℄ Benson, G., A spae eÆient algorithm for nding the best nonoverlapping

alignmentsore,TheoretialComputerSiene,145,357-369(1995).

[7℄ Crohemore, M., and W. Rytter, Text Algorithms, Oxford University Press,

(1994).

[8℄ Eppstein, D., Sequene Comparison with Mixed Convex and Conave Costs,

JournalofAlgorithms,11,85{101(1990).

[9℄ Eppstein,D., Z. Galil,and R.Gianarlo, Speeding UpDynami Programming,

Pro.29thIEEESymp.onFoundationsof ComputerSiene,488{296(1988).

[10℄ Eppstein, D., Z. Galil, R. Gianarlo, and G.F. Italiano, Sparse Dynami

ProgrammingI:LinearCostFuntions,JACM,39,546{567(1992).

[11℄ Eppstein, D., Z. Galil, R. Gianarlo, and G.F. Italiano, Sparse Dynami

Programming II: Convex and Conave Cost Funtions, JACM, 39, 568{599

(1992).

[12℄ Erikson, B.W., and P.H.Sellers, Reognition of patternsingeneti sequenes,

in Time Warps, String Edits, and Maromoleules: The Theory and Pratie

of Sequene Comparison, D. Sanko and J.B. Kruskal, eds., Addison-Wesley,

Reading,MA,55{91(1983).

[13℄ Farah,M.,andM.Thorup,StringmathinginLempel-Zivompressedstrings.

Algorithmia,20,388{404(1998).

[14℄ Galil,Z.,and R.Gianarlo,SpeedingUp DynamiProgramming withApplia-

tionstoMoleularBiology,TheoretialComputerSiene,64,107-118(1989).

[15℄ Galil Z., and K. Park, A linear-time algorithm for onave one-dimensional

dynamiprogramming,Info.Proessing Letters,33,309-311(1990).

[16℄ Gasienie, L., M. Karpinski, W. Plandowski, W. Rytter, Randomised eÆient

algorithmsfor ompressedstrings: the nger-printapproah, Pro. 7thAnnual

SymposiumOn CombinatorialPatternMathing,LNCS1075,39{49(1996).

[17℄ Gasienie, L., and W. Rytter, Almost optimal fully LZW ompressedpattern

mathing,DataCompressionConferene,J.Storer,ed,(1999).

[18℄ Gianarlo, R. , Dynami Programming: Speial Cases, Pattern Mathing

Algorithms,editedbyApostolio,A.andZ.Galil,OxfordUniversityPress,201-

232(1997).

[19℄ Guseld,D.,AlgorithmsonStrings,Trees,andSequenes.CambridgeUniversity

(19)

RegionsofMaximumAlignmentSore,SIAMJ.Comput.,25(3),648{662(1996).

[21℄ Karkkainen,J.,G.NavarroandE.Ukkonen,ApproximateStringMathingover

Ziv-LempelCompressedText,Pro.11thAnnual SymposiumOn Combinatorial

PatternMathing, LNCS1848,195{209(2000).

[22℄ Karkkainen, J., and E. Ukkonen, Lempel-Ziv parsing and sublinear-size index

struturesforstringmathing,Pro.ThirdSouthAmerianWorkshoponString

Proessing(WSP'96),141{155(1996).

[23℄ Kida, T., M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Shift-And

approah to pattern mathing in LZW ompressed text, Pro. 10th Annual

SymposiumOn CombinatorialPatternMathing,LNCS1645,1{13(1999).

[24℄ Klawe,M.,andD.Kleitman,AnAlmostLinearAlgorithmforGeneralizedMatrix

Searhing,SIAMJour.Desrete Math.,3,81-97(1990).

[25℄ Landau, G.M., E.W.Myersand J.P. Shmidt, InrementalString Comparison,

SIAMJ.Comput.,27(2),557{582(1998).

[26℄ Landau,G.M.andM.Ziv-Ukelson,OntheSharedSubstringAlignmentProblem,

Pro.SymposiumOnDisreteAlgorithms,804{814(2000).

[27℄ Landau, G.M., and M. Ziv-Ukelson, On the Common Substring Alignment

Problem,JournalofAlgorithms.

[28℄ Lempel,A.,andJ.Ziv,Ontheomplexityofnitesequenes,IEEETransations

onInformationTheory,22,75{81(1976).

[29℄ Levenshtein,V.I.,BinaryCodesCapableofCorreting,Deletions,Insertionsand

Reversals,SovietPhys.Dokl,10,707{710(1966).

[30℄ Manber, U., A text ompression sheme that allows fast searhing diretly in

the ompressed le, Pro. 5th Annual Symposium On Combinatorial Pattern

Mathing,LNCS807, 113{124(1994).

[31℄ Masek, W.J., and M.S.Paterson, A faster algorithm for omputingstring edit

distanes.J.Comput. Syst.Si.,20,18{31(1980).

[32℄ Monge, G., Deblai et Remblai,Memoires del l'Aademie des Sienes, Paris

(1781).

[33℄ Navarro G., T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Faster

ApproximateStringMathing OverCompressed Text,Pro.Data Compression

Conferene(DCC2001),IEEEComputerSoiety,459-468(2001).

[34℄ Navarro,G.,andM.RaÆnot,Ageneralpratialapproahtopatternmathing

overZiv-Lempelompressedtext,Pro. 10thAnnual SymposiumOn Combina-

torialPattern Mathing,LNCS1645,14{36(1999).

[35℄ Navarro, G., and M. RaÆnot. Boyer-Moore string mathing over Ziv-Lempel

ompressed text, Pro. 11th Annual Symposium On Combinatorial Pattern

Mathing,LNCS1848,166{180(2000).

[36℄ SankoD., and J.B. Kruskal(editors), Time Warps,String Edits, and Maro-

moleules: the Theory and Pratie of Sequene Comparison, Addison-Wesley,

Reading,MA,(1983).

[37℄ Shmidt, J.P., All HighestSoring Paths In Weighted Grid Graphs and Their

AppliationToFindingAllApproximateRepeatsInStrings,SIAMJ.Comput,

27(4),972{992(1998).

[38℄ ShabitaY.,T.Kida,S.Fukamahi, M.Takeda, A.Shinohara, T.Shinohara,S.

Arikawa,Speeding up patternmathingby textompression, CIAC2000,LNCS

1767,306{315(2000).

[39℄ Smith, T.F. and M. S. Waterman, Identiationof ommon moleular subse-

(20)

Parsing Sheme and Digital Searh Trees, Theoretial Computer Siene, 144,

161{197(1995).

[41℄ Takeda, M., Y.Shibata, T.Matsumoto, T.Kida, A.Shinohara, S.Fukamahi,

T. Shinohara, and S. Arikawa: Speeding up string pattern mathing by text

ompression: Thedawnofanewera,42(3),pp.370-384(2001).

[42℄ Waterman,M.S.,andM.Eggert,Anewalgorithmforbestsubsequenealignment

withappliationto tRNA-rRNAomparisons,J.MoleularBiol.,197,723{728

(1987).

[43℄ Welh,T.A.,ATehniqueforHighPerformaneDataCompression,IEEETrans.

onComputers,17(6),8{19(1984).

[44℄ Ziv,J.,andA.Lempel,AUniversalAlgorithmforSequentialDataCompression,

IEEETransationsonInformationTheory,IT-23(3),337{343(1977).

[45℄ Ziv, J., and A. Lempel, Compression of individual sequenes via variable rate

oding,IEEETrans.Inform.Th.,24,530-536(1978).