• Aucun résultat trouvé

Computing the Burrows–Wheeler transform in place and in small space

N/A
N/A
Protected

Academic year: 2022

Partager "Computing the Burrows–Wheeler transform in place and in small space"

Copied!
10
0
0

Texte intégral

(1)

HAL Id: hal-01248855

https://hal.inria.fr/hal-01248855

Submitted on 29 Jun 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Computing the Burrows–Wheeler transform in place and in small space

Maxime Crochemore, Roberto Grossi, Juha Kärkkäinen, Gad M. Landau

To cite this version:

Maxime Crochemore, Roberto Grossi, Juha Kärkkäinen, Gad M. Landau. Computing the Burrows–

Wheeler transform in place and in small space. Journal of Discrete Algorithms, Elsevier, 2015, 32,

pp.44-52. �10.1016/j.jda.2015.01.004�. �hal-01248855�

(2)

Computing the Burrows–Wheeler transform in place and in small space

Maxime Crochemore

a

, Roberto Grossi

b

,∗ , Juha Kärkkäinen

c

, Gad M. Landau

d

,

e

aKing’sCollegeLondon,UK

bDipartimentodiInformatica,UniversitàdiPisa,Italy cDepartmentofComputerScience,UniversityofHelsinki,Finland dDepartmentofComputerScience,UniversityofHaifa,Israel

eDepartmentofComputerScienceandEngineering,NYU-Poly,BrooklynNY,USA

a b s t r a c t

Weintroducethe

problemofcomputingtheBurrows–WheelerTransform(BWT)usingsmalladditionalspace.Ourin-placealgorithmdoes not need theexplicit storage forthe suffixsort array and theoutput array ,astypically required inprevious work .Itrelies onthe combinatorialpropertiesoftheBWT,andrunsinO(n2)timeinthecomparisonmodel

usingO(1)extramemorycells,apartfromthearrayofncellsstoringthencharactersoftheinputtext.Wethendiscussthetime–space trade-offwhenO(k·

σ

k)extramemorycells

areallowedwith

σ

kdistinctcharacters,providinganO((n2/k+n)logk)-timealgorithmto

obtain(andinvert)theBWT.Forexampleinrealsystemswherethealphabetsizeisaconstant,foranyarbitrarilysmall

>0,theBWT of atextofnbytescanbecomputedinO(n

1logn)timeusingjust

nextrabytes.

1. Introduction

The Burrows–Wheeler Transform [4] (known as BWT) of a text string is at the heart of the bzip2 family of text compressors,andfindsalsoapplicationsintext indexingandsequenceprocessing. ConsideraninputtextstringT

T

[

0

. .

n

1

]

andthesetofitssuffixesTi

T

[

i

. .

n

1

]

(0

i

<

n)underthelexicographicorder,whereT

[

n

1

]

isanendmarker character $ smaller than any other character in T. The alphabet

from which the characters in T are drawn can be unbounded.

AclassicalwaytodefinetheBWT usesthencircularshiftsofthetextT

=

mississippi$ asshowninthefirstcolumn ofTable 1.Weperformalexicographic sortoftheseshifts,asshowninthesecond column:ifwemarkthelastcharacter fromeachofthecircularshiftsinthisorder,weobtainasequenceLofncharactersthatiscalledtheBWT ofT.Itsrelation

Apreliminaryversionoftheresultsinthispaperappearedin[6].TheworkofthesecondauthorhasbeensupportedinpartbytheItalianMinistryof Education,University,andResearch(MIUR)underPRIN2012C4E3KTproject.TheworkofthethirdauthorhasbeensupportedbytheAcademyofFinland Grant 118653(ALGODAN).TheworkofthefourthauthorhasbeenpartiallysupportedbytheNationalScienceFoundationAward0904246,IsraelScience FoundationGrants 347/09and571/14,Yahoo,GrantNo.2008217fromtheUnitedStates–IsraelBinationalScienceFoundation(BSF)andDFG.

*

Correspondingauthor.

E-mailaddresses:Maxime.Crochemore@kcl.ac.uk(M. Crochemore),grossi@di.unipi.it(R. Grossi),Juha.Karkkainen@cs.helsinki.fi(J. Kärkkäinen), landau@cs.haifa.ac.il(G.M. Landau).

(3)

Table 1

BWTLforthetextT=mississippi$ anditsrelationwithsuffixsort.

Cyclic shifts Sorted cyclic shifts Suffixes

L i Ti

mississippi$ $mississipp i 11 $

$mississippi i$mississip p 10 i$

i$mississipp ippi$missis s 7 ippi$

pi$mississip issippi$mis s 4 issippi$

ppi$mississi ississippi$ m 1 ississippi$

ippi$mississ mississippi $ 0 mississippi$

sippi$missis pi$mississi p 9 pi$

ssippi$missi ppi$mississ i 8 ppi$

issippi$miss sippi$missi s 6 sippi$

sissippi$mis sissippi$mi s 3 sissippi$

ssissippi$mi ssippi$miss i 5 ssippi$

ississippi$m ssissippi$m i 2 ssissippi$

to suffixsortiswellknown,asillustratedinthethirdcolumn:therthcharacterinL isT

[

j

1

]

ifandonlyifTjistherth suffixinthesort(excepttheborderlinecase j

=

0,forwhichwetake T

[

n

1

]

ascharacter).

As it can be seen inthe example of Table 1, the BWT produces a text of the same length as the input text T. The transformisreversible since itisa one-to-one function whentheinput text isterminated by an endmarker $.Thus, not onlywecan recoverT fromLalone,buttypically Lismorecompressiblethan T itselfusing0th-order compressors[21].

TherearenowefficientmethodsthatconvertT to Landviceversa,taking O

(

nlogn

)

timeforunboundedalphabetsinthe worstcase[1].1TheBWT isalsoakeyelementofsomecompressedtextindexingimplementationsduetothesmallamount ofspaceit requires:someexamplesare thesolutionbyFerraginaandManzini[8]orthat byGrossietal.[10],wherethe transformis associatedwiththe techniquesofwavelet treesandof succinctdata structuresusingrank-selectqueries on binarysequences[22].

OneoftheprominentapplicationsoftheBWT isforsoftwaredealingwithNextGenerationSequencing,wheremillions ofshortstrings,calledreads, aremapped ontoa referencegenome. Typicalandpopularsoftware ofthistype are Bowtie [19],BWA[18]andSOAP2[16].Hereitiscrucialthatthegenomeisindexedinacompactmannertogetreasonablerunning time.SpaceissuesforcomputingtheBWT arethusrelevant:frequentlytheinputdataissolargethattheinputtext T stays inmainmemorywhileanyadditionaldatastructureofsimilarsizecannotfitintherestofthemainmemory[14].

Allthe previouswork forcomputingtheBWT of T reliesonthe factthat

(

a

)

weneed firstto storethesuffix sorting ofT (alsoknownassuffixarray [20]),thus occupyingn memorycells forstoring integers,and

(

b

)

weneedtooutputthe BWT inanotherarraystoringncharacters.Motivatedbytheseobservations,wewanttostudythecaseinwhich

(

a

)

and

(

b

)

areavoided,thussavingonthespaceoccupiedbythem.

Inthispaper,ourgoalistoobtaintheBWT bydirectlypermuting T andusingjust O

(

1

)

memorycells, i.e.,weaimat anin-placealgorithmforcomputingtheBWT.Weconsiderthemodelinwhichthetext T isstoredasanarrayofnentries, whereeachentrystoresexactlyonecharacterofT.Notethatstoring anintegerusuallytakesmorespacethanacharacter, soweassumethatonlythecharactersofT canbekeptinthearrayT.Moreover,T isnotread-onlybutitcanbemodified atanytime,andjustO

(

1

)

additionalmemorycells(besidesT)canbekeptforstoringauxiliaryinformation.2

Note that our model represents some realistic situations in which one has to handle large text collections, or large genomicsequences,withoutrelying onextramemoryfor

(

a

)

and

(

b

)

.Henceitiscrucialtomaximize theamountofdata that can fit into main memory: not storing explicitly

(

a

)

and

(

b

)

permits to save space, which is typically regarded as takingmorethanhalfofthetotalspacerequired.Forinstance,DNAsequencesarestoredbyusing2bitspercharacterand machineintegerstake64bits.Herewejustneed2nbitstostorethe(genomic)textandsavethe64nbitsusedforstoring theintermediatesuffixsortingin

(

a

)

andthe2nbitsforstoringtheoutputofBWT inanotherarrayin

(

b

)

:thismeansthat duringtheBWT construction,wecanfitalmost33timesmoretextusingthesamemainmemorysize,thuseliminatingthe usageoftheslowerexternalmemoryforthistime-consumingtaskinthesecases.

From thecombinatorialpointofview,thein-placeBWT isan interestingquestion tosolveon strings.Thereare space saving approachesstoring the suffix sortingin compressed form[13,24,14,27] or only partially ata time [15],but none of them provides an in-place algorithm. In-place selection and sortingdoes not seem to help either [7,9,12,23,28].It is wellknownthat in-placesortingrequiresthesamecomparisoncostof

(

nlogn

)

asinstandardsorting.ButfortheBWT, we onlyknowits comparisoncost of

(

nlogn

)

forthe standardconstruction. Asfaraswe know,noresultisknown for the in-placeconstruction ofBWT: a naive solutionis not that simple,even ifit resultsin exponential time.Indeed, any

1 Asisstandardinmanystringalgorithms,weassumethatanytwocharactersincanonlybecomparedandthistakesO(1)time.Hence,comparing characterwiseanytwosuffixesmayrequireO(n)timeintheworstcase.

2 InC code,wewoulddeclareTasunsigned char T[n]andusethisstorageplusO(1)localvariablesofconstantsize.Amoreformalmodelwould saythateachmemorycellshostsacharacterfromandsoanintegeroflognbitsrequireslog||ncells.Weprefertokeepitsimplerandsaythatan extracellcancontainaninteger.

(4)

Fig. 1.An illustration of Steps 1–4 of the in-place construction of the BWT.

movement ofa character T

[

j

]

to anotherposition inside T at leastchanges thecontent of its suffixes Ti for 0

i

j, makingthealgorithmicflavorofthisproblemdifferentfromthatofin-placesortingnelements.

TheabovediscussionsuggeststhatacarefulorchestrationofthemovementofthecharactersinsideT isneededtoavoid losingthecontentofsomesuffixesbeforethey contributetotheBWT.Ourideaistodefineasequenceoftransformations B0

,

B1

, . . . ,

Bn,where B0 isthe inputtext T andBn isthe finalBWT of T.For1

n,we havethat B istheBWT of thelast

charactersinT andiscomputedfromB1 (re)usingjustO

(

1

)

extramemorycells.Wethinkthatthissequence oftransformationscouldbeofindependentinterestforthecommunityofstringalgorithms,andsomeofthecombinatorial propertiesthatweusecanbefoundin[3,14,17,29].

Inthispaperweproposean O

(

n2

)

-timeapproachthatbuildstheabovesequenceoftransformationsusingfourinteger variablesandonecharactervariable,takingO

(

n

)

timepertransformationintheworstcase.Theresultingin-placealgorithm issimpleandcanbeeasilyencodedinfewlinesofC codeorsimilarprogramminglanguages.Howeverwedonotclaimany practicalityofoursolutiondueto itsquadraticworst-casecost.Ourcontributionisthatitcould layoutthepathtowards fastermethodsforthespace-efficientcomputationoftheBWT:anymethodtocomputeBkfromBk1int

(

n

)

time(re)using s

(

n

)

space, wouldleadto aconstruction oftheBWT in O

(

n

·

t

(

n

))

timeusing O

(

1

+

s

(

n

))

space. Tothisend, itisworth notingthattheinputsforBWT aretypicallylargeandafastalgorithmthatisin-placeorusesverylowadditionalmemory, wouldberelevantinpractice.

Ourtheoreticalstudyhasalsoanimpactonthepracticalalgorithmdesign.Anaturalquestioniswhatwegetifweallow forsomeextraspace.WeprovethatusingO

(

k

· σ

k

)

additionalspaceforanygivenparameterk

n,where

σ

k

min

{||,

k

}

isthemaximumnumberofdistinctcharactersfoundamongkconsecutivepositionsin T,wecancompute theBWT (and its inverse)ofatextofncharactersin O

((

n2

/

k

+

n

)

logk

)

timeinthecomparisonmodel.Weobservethat

σ

k ispractically a constant inmany applications. The practical implications ofthis trade-off in space versus time can be appreciated by observingthat,foranyarbitrarilysmall

>

0,wecanobtaintheBWT ofatextofnbytesinO

(

n

1logn

)

timeusingjust

nextrabytesforaconstant-sizealphabet.Thisisusefulwhenthetextoccupiesagreatpartoftheavailablemainmemory, andonly n freecells areavailable.Thisavoids usingexternal-memoryalgorithms,whichareclearly slowerasI/Oaccess takesseveralordersofmagnitudesmoretimethanmainmemoryaccess.

The paperis organized asfollows.Wedescribe how toperform thein-placeBWT in Section 2.We then discusshow to inverttheBWT,soastoobtain theoriginaltext T,inSection3.Weillustratethetrade-off betweenspaceandtime in Section4.Finally,wedrawsomeconclusionsinSection5.

2. In-placeBWT

Giventhe inputtext T

=

T

[

0

. .

n

1

]

where T

[

n

1

] =

$,moving asingle character inside T canchangethe content ofmanysuffixes.Theideatocircumventthisdifficulty withoutusingstorageforthesuffix sortistoproceedbyinduction fromrighttoleftinT,whilemaintainingtheBWT ofthecurrentsuffix Ts,denotedbyBWT

(

Ts

)

.Weassume0

s

n

3, sincethelasttwosuffixesofT areequaltotheirrespectiveBWT.

To compute BWT

(

Ts

)

,suppose that BWT

(

Ts+1

)

hasbeen alreadycomputed andstoredin the last positions of T,i.e.

T

[

s

+

1

. .

n

1

]

.Considerthecurrentcharacterc

=

T

[

s

]

:ifwelookatthecontentofT

[

s

. .

n

1

]

,wenolongerfindTs,but thecharactercfollowedbythepermutationBWT

(

Ts+1

)

ofTs+1.Nevertheless,westillhaveenoughinformationaswewill show intheproofofTheorem 1thatthepositionof$ inside BWT

(

Ts+1

)

isrelatedtotherankof Ts+1 amongthesuffixes Ts+1

, . . . ,

Tn1 inlexicographicorder.Weexploitthisfactinthefollowingsteps(seeFig. 1).

1. Findtheposition pofthe$ inT

[

s

+

1

. .

n

1

]

:notethat p

sisthe(local)rankofthesuffix Ts+1 thatoriginallywas startingatpositions

+

1.

2. Findtherankrofthesuffix Ts (originallyinpositions).Usingcharacterc,scan T

[

s

+

1

. .

n

1

]

andcomputethesum oftwocounts:howmanycharactersarestrictlysmallerthanc,andhowmanyoccurrencesofcappearinT

[

s

+

1

. .

p

]

(andaddsasanoffsettoobtainr).

3. Storec intoT

[

p

]

(thusreplacingthe$).

4. Insertthecharacter$ in T

[

r

]

byshifting T

[

s

+

1

. .

r

]

byonepositiontotheleft,soastooccupypositions s

, . . . ,

r

1 ofT.

The C code reported in Fig. 2 implements Steps 1–4, where END_MARKER denotes $. For example, consider T

=

mississippi$ and s

=

4, where we use capital letters to denote the BWT partially built on the last positions of T. Suppose that we have alreadycomputed theBWT for the last 7 charactersin T, namely,we havemissiIPSPIS$. We thenhavep

=

11 and,sincethereisonecharacter($)smallerthanc

=

i,andtwocharactersthatareequaltocandoccur

(5)

void inplaceBWT( unsigned char T[ ], int n ){

int i, p, r, s;

unsigned char c;

for ( s = n-3; s >= 0; s-- ){

c = T[ s ];

/* steps 1 and 2 */

r = s;

for ( i = s+1; T[ i ] != END_MARKER; i++ ) if ( T[ i ] <= c ) r++;

p = i;

while ( i < n )

if ( T[ i++ ] < c ) r++;

/* step 3 */

T[ p ] = c;

/* step 4 */

for ( i = s; i < r; i++ ) T[ i ] = T[ i+1 ];

T[ r ] = END_MARKER;

} }

Fig. 2.In-place construction of BWT.

beforeposition p,wehaver

=

s

+

3

=

7.Thismeansthatwehavetoreplace$ byc andshiftIPSbyonepositionleftso astoinsert$ inpositionr.ThenextconfigurationismissIPS$PISI,whichendswiththeBWT ofTs.

Theorem1.GivenatextT ofn characters,wecancomputeitsBurrows–WheelerTransform(BWT)inO

(

n2

)

timeinthecomparison modelusingO

(

1

)

additionalmemorycells.

Proof. We provefirst the correctness.Let T be the inputtext andT be its modificationat a generic iteration s, where 0

s

n

3.NotethatT

[

0

. .

s

] =

T

[

0

. .

s

]

whileT

[

s

+

1

. .

n

1

] =

BWT

(

T

[

s

+

1

. .

n

1

] )

.Byinduction,thepositionpof $ inT

[

s

+

1

. .

n

1

]

indicatestherankp

sofTs+1 amongthesuffixesin

{

Ts+1

,

Ts+2

, . . . ,

Tn1

}

inlexicographicorder.The basecaseforTn2 andTn1 istriviallysatisfied.Hence,weshowhowtopreservethispropertyfor0

s

n

3.

Firstnotethatthecharacterc

=

T

[

s

]

goesinposition p,sinceitprecedes Ts+1 insideT.Next,wehavetofindthenew positionr forTs,so thatr

(

s

1

)

isits rankamongthesuffixesin S

= {

Ts

,

Ts+1

, . . . ,

Tn1

}

inlexicographicorder.First count howmany characterssmallerthan c occur in T

[

p

. .

n

1

]

: thereare asmanysuffixesin S that are smallerthan Ts since theirfirst characterissmallerthan c.Tothisquantity,add thenumberofoccurrencesofc in T

[

s

+

1

. .

p

1

]

: thesesuffixesare alsosmallersincetheystart withc buthaveranksmallerthan p,i.e.therankofTs+1.Inthisway,we discover how manysuffixes are smaller than Ts in S: inserting $ in the corresponding location r of T, by shifting the charactersinT

[

s

+

1

. .

r

]

totheleft,whichthusoccupypositionss

, . . . ,

r

1 (seeFig. 1),wemaintaintheinduction.Hence, T

[

0

. .

s

1

] =

T

[

0

. .

s

1

]

andT

[

s

. .

n

1

] =

BWT

(

T

[

s

. .

n

1

])

.Whens

=

0,weobtaintheBWT ofT.

Asforthecomplexity,note thateachofthen

2 iterationsrequires O

(

n

)

time, sinceitcanbe implementedby O

(

1

)

scansofT

[

s

. .

n

1

]

.ThisgivesatotalcostofO

(

n2

)

.Weusefourintegervariables(i,p,r,s)andonecharactervariable (c) intheC codeshowninFig. 2,andthusweneed O

(

1

)

memorycellsforthelocalvariables. 2

3. InvertingtheBWT

Reversingthe permutation performedby the in-placeBWT is called inverting the BWT.Initially we havethe BWT of theoriginalinput text T,denotedBWT

(

T

)

.We wanttoinvertthelatterbypermuting itscharacters.Thus wereversethe approachdescribedinSection2.WemaintaintheinvariantthatthereisapointerLtoacertainpositionintheinputbuffer storingBWT

(

T

)

sothat,atanytime,

(

a

)

theprefixofthebuffertotheleftofLstorestheprefixofT obtainedsofarbythe invertingprocessand

(

b

)

theremainingsuffixofthebuffer(pointedbyLtilltheendoftheinputbuffer)storestheportion oftheBWT still tobeinverted.Forthesakeofnotation, weidentify L withtheentiresuffix ofthe inputbufferthat still hastobeinverted.

Underthisinvariant,whichisinitiallytruebysettingLtothebeginningoftheinputbufferforBWT

(

T

)

,weproceedas follows.We findtheposition pof $ inL,andthenselectthe pthcharacterinthealphabetorderinthemultisetgivenby thecharactersofL.Stability isneeded,sinceequalcharactersshouldbeconsidered intheorderoftheir appearancein L, asdetailedbelow.

(6)

void inplaceIBWT( unsigned char L[ ], int n ){

int f, i, p, q, count;

unsigned char c;

/* step 1 */

p = 0;

while( L[ p ] != END_MARKER ) p++;

p++;

while ( n > 2 ){

/* step 2 */

c = select( L, p );

count = 0;

for ( i = 0; i < n; i++ ){

if (L[i] < c) count++;

}

/* step 3 */

f = p - count;

q = -1;

while ( f > 0 ){

q++;

if ( L[ q ] == c ) f--;

}

/* step 4 */

L[ q ] = END_MARKER;

for ( i = p-1; i > 0; i-- ){

L[i] = L[i-1];

}

/* step 5 */

L[0] = c;

L++; n--;

/* step 1 */

if (p-1 > q)

p = q+1; /* also the new END_MARKER has been shifted */

else p = q;

} }

Fig. 3.Reverting the permutation of the inverse BWT.

1. Findtheposition pofthe$ inL,andincrementp(sincearrayindexingstartsfrom0).

2. Let select be a selection algorithm that works on read-only input, i.e., it does not move elements around while findingthe pthsmallestelement.UsingselectonL,selectthe pthcharacterc inthemultisetofthecharactersofL or,equivalently,thepthcharacterinthesortedlistofcharacters ofL.

3. Letq denote thepositioninside L ofthe fthoccurrenceofc,which wehit ina stablefashionwhen findingc in L.

Here f isthedifferencebetweenpandthenumberofcharacterscofLsuchthatc

<

c.

4. Replacetheoccurrenceofc atpositionq by$,andremovetheoldoccurrenceof$ byshiftingtotherightthefirst p charactersofL.

5. Atthispoint,thefirstpositioninLisfree:storethecharacterc init,andshortenLbyonecharacteratthebeginning (i.e. advancethepointerLbyonepositiontowardstheendoftheinputbuffer).

The C codeinFig. 3implementsSteps 1–5above,whereEND_MARKERdenotes$. Notethat itisabitlongerthanthe codeforthein-placeBWT inFig. 2.Asitcanbeseenbelow,theoriginaltextisreconstructedfromlefttorightasaprefix ofincreasinglength(indicatedwithsmallletters).

IPSSM$PISSII

mIPSS$PISSII

miIPSSPISSI$

misIPSSPIS$I

missIPS$PISI

missiIPSPIS$

missisIPSPI$

mississIP$PI

mississiIPP$

mississipIP$

mississippI$

mississippi$

TheproofofcorrectnessproceedsalongthesamelinesasintheproofofTheorem 1,sincewearereversingtheprocedure describedthere.Asforthecomplexity,eachofthen

2 iterationsisdominatedbythecostofselect.

(7)

Letts

(

n

)

bethetimecomplexityinthecomparisonmodelandss

(

n

)

bethespacecomplexityrequiredbyselect.Using theresultin[23],wehavets

(

n

) =

O

(

n1+

)

intheworstcaseforanyfixedsmallconstant

>

0 withss

(

n

) =

O

(

1

)

,andwe havets

(

n

) =

O

(

n1+

) =

O

(

nlog logn

)

ontheaverage(whichmeetstherandomizedlowerboundin[5]),withss

(

n

) =

O

(

1

)

.

Wecanstatethecomplexityingeneralterms.

Theorem2.Letts

(

n

)

bethetimecomplexityinthecomparisonmodelandss

(

n

)

bethespacecomplexityrequiredbyselect.Given theBWTofatextT ofn characters,wecanrecoverT bypermutingtheBWT(alsoknownasinverseBWT)inO

(

n

·

ts

(

n

))

timeinthe comparisonmodelusingO

(

1

+

ss

(

n

))

additionalmemorycells.

WegivesomeexamplesoftheboundsthatcanbeattainedwithTheorem 2.

Corollary1.GiventheBWTofatextT ofn characters,wecanrecoverT byinvertingtheBWTinO

(

n2+

)

timeintheworstcase,or O

(

n2log logn

)

timeontheaverage,inthecomparisonmodelusingO

(

1

)

additionalmemorycells.

Usingslightlymoreadditionalspacethanaconstant—literallyspeaking,thealgorithmisnomorein-place—andtheresult in[28],wherets

(

n

) =

O

(

n

(

logn

)

2

)

andss

(

n

) =

O

(

logn

)

,wederivethefollowing.

Corollary2.GiventheBWTofatextT ofn characters,wecanrecoverT byinvertingtheBWTinO

((

nlogn

)

2

)

timeinthecomparison modelusingO

(

logn

)

additionalmemorycells.

Finally,forthespecialcaseinwhichthealphabetofthedistinctcharactersinT isofconstantsize(asinDNAandASCII texts),we obtain an improvedbound since selectcanbe immediatelyimplemented by asimple schemethat employs O

(||) =

O

(

1

)

counters.

Corollary3.GiventheBWTofatextT ofn charactersdrawnfromaconstant-sizealphabet,wecanrecoverT byinvertingtheBWT inO

(

n2

)

timeinthecomparisonmodelusingO

(

1

)

additionalmemorycells.

4. Practicaltrade-offbetweenspaceandtime

Theinplacealgorithmsdescribedsofarhavethedrawbackofrequiring

(

n2

)

time,whichmakethemunfeasibleforlong texts.Anaturalquestionishowmuchthelatterboundcanbeimprovedusingextraspace.Forexample,usingthedynamic wavelettreedatastructure[26]inadditionalO

(

n

+ | |

logn

)

bitsofspace,wecanmaintaintheBWT throughinsertionand deletionoperationsofindividualsymbols,supportingrankandselectoperations,withacostof O

(

logn

/

log logn

)

timeper operation.Using thelatterdatastructure,ouralgorithmsinSections2and3wouldgive aboundof O

(

nlogn

)

time with additional O

(

n

+ ||

logn

)

bitsofspacebesidesthatneededforstoringthencharactersoftheinputtext T.3 Howeverthe resultingsolutionisnot verypracticalasthedatastructurein[26] isquitesophisticated.Weshow nexthowtosmoothly adaptouralgorithmsinSections2and3toasituationwhereextramemoryisallowed,producingsometrade-offsolutions thatareamenableforimplementationwithaflexibleparameterkfortheadditionalspace.

Theorem3.GivenatextT ofn characters,wecancomputeitsBurrows–WheelerTransform(BWT)anditsinversein O

((

n2

/

k

+

n

)

logk

)

timeinthecomparisonmodelusingO

(

k

· σ

k

)

additionalspace,where

σ

k

min

{| | ,

k

}

isthemaximumnumberofdistinct charactersfoundamongk consecutivepositionsinT .

ToappreciatetheboundinTheorem 3fromapracticalpointofview,considerthesituationinwhichthetextT occupies a greatpartofthe availablememory, andtheremaining free cellsare a constantfractionofthe textsize. Ouralgorithm takes O

(

nlogn

)

time by fixing k to be a suitable fraction of n. This avoids to use external-memory algorithms, which are clearlyslowerasI/O accesstakesseveralorderofmagnitudeswithrespectto mainmemoryaccess. Ingeneral, ifthe availablememorysizeisM,weobtainthefollowingresultbysettingk

= (

M

n

)

.

Corollary4.LetM bethenumberofavailablecellsinmainmemory.GivenatextT ofn

<

M charactersoveraconstantalphabet,we cancomputeitsBurrows–WheelerTransform(BWT)anditsinverseinO

((

Mn2n

+

n

)

logn

)

timeinthecomparisonmodelusing

M totalmemorycellsincludingthosecontainingT .WhenM

(

1

+ )

n foraconstant

>

0,thisgivesO

(

nlogn

)

timeusingjust n additionalcells.

TheideatoproveTheorem 3istohaven

/

kbatches.Eachbatchsimulateskconsecutiveiterations intheexternal for looponsinFig. 2,takingO

((

n

+

k

)

logk

)

timeandusing O

(

k

· σ

k

)

spaceasfollows.

3 ThistheoreticalsolutionhasbeensuggestedbyRossanoVenturini(privatecommunication).

(8)

Basecase.Lets1 bethelargestmultipleofk thatissmallerthann.Wecancompute BWT

(

T

[

s1

. .

n

1

] )

andstoreitin T

[

s1

. .

n

1

]

in O

(

klogk

)

timeandO

(

k

)

additionalspacebyobservingthat

|

T

[

s1

. .

n

1

]| <

k.

Inductivecase.SupposethatBWT

(

T

[

s1

. .

n

1

] )

hasbeenalreadystoredin T

[

s1

. .

n

1

]

,wheres1

>

0 isnowageneric multipleofk.Lettings0

=

s1

k,wewanttoshowhowtostoreBWT

(

T

[

s0

. .

n

1

] )

inT

[

s0

. .

n

1

]

inO

((

n

+

k

)

logk

)

time using O

(

k

· σ

k

)

additionalspace.

Sincewehaveonebasecaseand

n

/

kinductivecases,thefinalcostwillbe O

(

klogk

+ (

n

/

k

) · (

n

+

k

)

logk

) =

O

((

n2

/

k

+

n

)

logk

)

timeusing O

(

k

· σ

k

)

additionalspace,asstatedinTheorem 3.

4.1. Inductivecase

We can abstracttheproblemfora string X (i.e. T

[

s0

. .

n

1

]

) oflength m, wherethecharactersin X

[

k

. .

m

1

]

are alreadypermuted accordingtotheir BWT,andthecharactersin X

[

0

. .

k

1

]

arestill intheir original order.We wantto compute BWT

(

X

)

in O

((

m

+

k

)

logk

)

timeusing O

(

k

· σ

k

)

additionalspace.

Let Z denote X

[

k

. .

m

1

]

wherethe $ character is virtually removedfrom its position,say j. Hence Z is of length m

k

1 andthepair

$

,

j isabreakpointforZ.Ingeneral,abreakpointisapair

c

,

j suchthatcisvirtuallyoccupying position j ofZ:iftwoormorebreakpointsclaimthesameposition j,thereshouldbearelativeorderamongthem.

Our goalisto compute thek

+

1 breakpoints for Z so that

(

a

)

their charactersare thosein X

[

0

. .

k

1

]

plus $,and

(

b

)

flattening Z andthesebreakpointscorrectlyproducesBWT

(

X

)

asfollows.Given Z andanordered listofk

+

1 break- points B

=

c0

,

j0

, . . . ,

ck

,

jk ,where0

j0

≤ · · · ≤

jk

≤ |

Z

|

,flattening Z and Bproducesastringwiththecharactersof Z suitably shiftedto theleft to makeroom for thecharactersinthe breakpoints of B asfollows.Wescan Z starting with j

=

0:if j

=

jr forthe breakpoint

cr

,

jr atthe beginningof B,we outputcr andremove

cr

,

jr from B; else (j

=

jr), we output Z

[

j

]

andincrease j.(Ifcr

=

$,wedonotreallyoutputit,butweretainitsposition jr forthenextbatch.)The computation endswhen Z hasbeencompletelyscannedandthelistB hasbeenemptied.Therequiredtimeis O

(

mlogk

)

andthecomputationcanbeperformedusing O

(

k

· σ

k

)

additionalspace.

Forthisweneedthefollowingauxiliarydatastructuresforstring Z,whichrequireO

(

k

· σ

k

)

additionalspace.(Notethat Z andX requirejust O

(

1

)

spaceastheyoriginatefromT.)

1. StaticarrayC of

σ

k

kentries,where

α

1

< · · · < α

σk arethedistinctcharactersinX

[

0

. .

k

1

]

:entryC

[

i

]

isthenumber ofpositions jin Z suchthat Z

[

j

] < α

i.

2. Static rank data structure R1 on Z supporting queries that, foran integer j and a character

α

i, report how many positions j satisfy Z

[

j

] = α

i and j

j.

3. DynamiclistB ofbreakpoints,initiallycontainingonlythepair

$

,

j .

4. Dynamicrankdatastructure R2 onBsupportingqueriesthat,foracharacter

α

i,reporthowmanybreakpoints

c

,

l to theleftof

$

,

j inB satisfyc

= α

i.

As itisclear, wewantto populatethelist B by simulatingthealgorithm describedinSection 2.Namely, we wantto findthebreakpointofcharacter X

[

s

]

fors

=

k

1

,

k

2

, . . . ,

0 in Z.

Consider thebreakpoint

$

,

j ,which existsin B by construction.Let

α

i

=

X

[

s

]

,andr

=

r0

+

r1

+

r2 be sumofthree quantities:thenumberr0 ofpositions jin Z suchthat Z

[

j

] < α

i,thenumberr1 ofpositions j in Z suchthat Z

[

j

] = α

i and j

j, andthenumber r2 ofbreakpoints

c

,

l that areto the left of

$

,

j in B andhave c

= α

i.Note that we can computer0 usingentryC

[

i

]

inpoint 1,r1usingthedatastructureR1 inpoint 2,andr2usingthedatastructuresBandR2 inpoints 3–4.

Fact1.ThevalueofristherankofX

[

s

. .

m

1

]

amongX

[

s

+

1

. .

m

1

]

,X

[

s

+

2

. .

m

1

]

, . . . ,X

[

m

1

. .

m

1

]

inlexicographic order.

Proof. Itfollowsfromthefactthat X

T

[

s0

. .

n

1

]

andX

[

s

+

d

. .

m

1

] ≡

Ts+d,wheres

=

s0

+

sandd

0. 2 Aftercomputingr, wereplace thebreakpoint

$

,

j by

c

,

j , andcreatea newbreakpoint

$

,

r to beinserted in B, updatingthedatastructuresinpoints 3–4accordingly.

Lemma1.ThestaticarrayC canbestoredinO

(

k

· σ

k

)

spaceandbuiltinO

(

mlogk

)

time.

Proof. We scan X

[

0

. .

k

1

]

and find the distinct characters

α

1

< · · · < α

σk. We then perform a scan of Z to store in C

[

1

]

the numberof positions j such that Z

[

j

] < α

1 and, for i

>

1,to store in C

[

i

]

the numberofpositions j such that

α

i1

Z

[

j

] < α

i.Wethenstorein C itsprefixsums,thusgivingthewanted C.Timeis O

(

mlog

σ

k

)

sinceduringthescan weperformabinarysearchamongthe

σ

kcharacters.Spaceis O

( σ

k

)

bydefinitionofC. 2

Lemma2.ThestaticdatastructureR1canbestoredinO

(

k

· σ

k

)

spaceandbuiltinO

(

mlog

σ

k

)

time,sothateachqueryrequires O

(

log

σ

k

+

m

/

k

)

time.

(9)

Proof. Westore R1 asacollectionof

σ

k arraysRi1,for1

i

σ

k.Entry Ri1

[

h

]

,for0

h

k,storesthenumberofoccur- rencesofcharacter

α

iinthehthsegmentofm

/

kconsecutivepositionsinZ:namely,Ri1

[

0

] =

0 and,forh

1,Ri1

[

h

]

stores thenumberofpositions j suchthat Z

[

j

] = α

iand

(

m

/

k

) · (

h

1

)

j

max

{

m

1

, (

m

/

k

)

h

1

}

.Afterinitializingtheentries inallthearraystozero,theircorrectvaluecanbesetwithasinglescanofZ inO

(

mlog

σ

k

)

time.4 Afterthat,weperforma post-processingandstoreineach Ri1itsprefixsums:inthisway,Ri1

[

h

]

storesthenumberofpositions jsuchthat Z

[

j

] = α

i and0

j

max

{

m

1

, (

m

/

k

)

h

1

}

.Foraquerywithacharactercandaninteger j,ittakes O

(

log

σ

k

)

time toestablish that c

= α

i fora certain i, and O

(

1

)

time tofind thelargest hsuch that

(

m

/

k

)

h

<

j. Thecorrectanswer forthe query isthengivenbythesumofthecontentof Ri1

[

h

]

andthenumberof

α

i’s foundin Z

[ (

m

/

k

)

h

. .

j

]

byitsdirectinspection.

Scanningthelattertakes O

(

m

/

k

)

timebydefinitionofRi1. 2

Lemma3.ThedynamiclistB andthedynamicdatastructureR2canbestoredinO

(

k

)

spaceandbuiltinO

(

klogk

)

time,sothateach queryandeachupdaterequiresO

(

logk

)

time.

Proof. Wehandlethe list B

=

c0

,

j0

, . . . ,

cr

,

jr ,where0

r

k,andthedatastructures R2 together.Inparticular, R2 isthe wavelettreeofheight O

(

logr

)

builton thesequencec0

· · ·

cr ofcharacterstakenfrom B.Thequery for

α

i canbe performedasacountqueryofcharacters

α

i inc0

· · ·

cd,whered

randcd

=

$ (e.g.[25]).Eachupdate(insertion,deletion, replacement)canbehandledinO

(

logr

+

logk

) =

O

(

logk

)

time[11,26]. 2

Wenowhaveall theingredientsto provethetimeandspaceboundsforcomputingtheBWT as statedinTheorem 3.

BytheinductiveschemeandFact 1,thecomputationiscorrectlyperformed.Asforthetimebounds,weuseLemmas 1–3.

NotethattoinverttheBWT,wecan nowimplementthealgorithmdescribedinSection 3inasimplerway,since R1 and R2supportalsotheselectionofthepthsymbolc

= α

i.Moreover,flattening Z andB removesthepositions j storedinthe pairsinBfrom Z,asthissimulatestheshiftofsomecharactersof Z totheright.Thetimeandspaceanalysisissimilarto thatofcomputingtheBWT.

5. Conclusions

We presented an in-place BWT construction taking O

(

n2

)

time in the comparison model.It wouldbe interesting to improvethisbound.Note thatthe whileloopin ourin-placeBWT can beavoided using O

(||)

space, where

isthe alphabetofcharactersoccurringinT.Timecanbefurtherreducedto O

(

n2

/

log||n

)

by packingcharactersbutthisisstill notusefulforlargetextcollections.

Wedo notknowwhetheralower bound betterthan

(

nlogn

)

holds fortheprobleminthecomparisonmodelsince thespaceisveryconstrained.Thisisaninterestingquestiontoinvestigate.

On the practical side, our in-placealgorithms can be adapted to provide a trade-off between spaceand time, when O

(

k

· σ

k

)

extramemorycells areallowed,providingan O

((

n2

/

k

+

n

)

logk

)

-timealgorithmto obtain(andinvert)theBWT.

Forexample inreal systemswhere thealphabet size isa constant, foranyarbitrarily small

>

0,the BWT of a text of nbytescanbecomputedinO

(

n

1logn

)

timeusingjust nextrabytes.

Acknowledgements

The second author is grateful to Gianni Franceschini for some preliminary discussions and to Venkatesh Raman for pointingout theresultsin[23,28]andRossanoVenturini forhiscommentsandindicatingtheresultin[3].Wealsothank theanonymousreviewersfortheircarefulreadingofourmanuscript.

References

[1]DonaldAdjeroh,TimothyBell,AmarMukherjee,TheBurrows–WheelerTransform:DataCompression,SuffixArrays,andPatternMatching,Springer, 2008.

[2]AlfredAho,JohnHopcroft,JeffD.Ullman,TheDesignandAnalysisofComputerAlgorithms,Addison-Wesley,Reading,MA,1974.

[3]DjamalBelazzougui,Lineartimeconstructionofcompressedtextindicesincompactspace,in:Proc.STOC2014,2014,pp. 148–193.

[4]MichaelBurrows,DavidJ.Wheeler,Ablock-sortinglosslessdatacompressionalgorithm,ResearchReport124,DigitalSRC,PaloAlto,CA,USA,May 1994.

[5]TimothyM.Chan,Comparison-basedtime–spacelowerboundsforselection,ACMTrans.Algorithms6 (2)(2010)1–16.

[6]MaximeCrochemore,RobertoGrossi,JuhaKärkkäinen,GadM.Landau,Aconstant-spacecomparison-basedalgorithmforcomputingtheBurrows–

Wheelertransform,in:CombinatorialPatternMatching,24thAnnualSymposium(CPM),2013,pp. 74–82.

[7]DavidJ.Dobkin,J.IanMunro,Optimaltimeminimalspaceselectionalgorithms,J.ACM28 (3)(July1981)454–461.

[8]PaoloFerragina,GiovanniManzini,Indexingcompressedtext,J.ACM52 (4)(2005)552–581.

[9]GianniFranceschini,S.Muthukrishnan,In-placesuffixsorting,automata,languagesandprogramming,in:34thInternationalColloquium(ICALP),2007, pp. 533–545.

[10]RobertoGrossi,AnkurGupta,JeffreyS.Vitter,High-orderentropy-compressedtextindexes,in:ACM–SIAMSODA,2003,pp. 841–850.

4 Ifk·σk=w(nlogσk),itisastandardtricktoallocateO(k·σk)memoryandinitializeonlywhatisneededinO(mlogσk)time[2].

Références

Documents relatifs

Thus the idea is to pair, for each suffix S i , its rank with the rank of the suffix S i+h (i.e., (ISA[i]), ISA[i + h])) and sort all these pairs in order to obtain the sorting by

To make a final decision on the state of the data transmission system from the aircraft and the presence or absence of intruder interventions into the data channel, a decision

Our proposal which is based on DL f- SHIN , consists of the RDF triples syn- tax accommodating the fuzzy element, the fuzzy reasoning engine FiRE and its integration with RDF

Abstract: Using a generalized translation operator, we obtain a generalization of Titchmarsh’s theorem of the Jacobi transform for functions satisfying the ψ- Jacobi

We build a coding of the trajectories of billiards in regular 2n-gons, similar but different from the one in [16], by applying the self-dual induction [9] to the

Proposez une méthode pour transmettre d’une personne à une autre le résultat d’un lancer de dé (lancer caché par la première personne, la deuxième doit pouvoir énoncer

Proposez une méthode pour transmettre d’une personne à une autre le résultat d’un lancer de dé (lancer caché par la première personne, la deuxième doit pouvoir énoncer

Bien que ces tailles de clés soient jugées accepta- bles à l’heure actuelle, l’analyse a permis d’arriver à la conclusion que 14 bits supplémentaires dans la longueur de