• Aucun résultat trouvé

Combinatorial Characterization of the Language Recognized by Factor and Suffix Oracles

N/A
N/A
Protected

Academic year: 2021

Partager "Combinatorial Characterization of the Language Recognized by Factor and Suffix Oracles"

Copied!
14
0
0

Texte intégral

(1)

HAL Id: hal-00487228

https://hal.archives-ouvertes.fr/hal-00487228

Submitted on 28 May 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

Combinatorial Characterization of the Language

Recognized by Factor and Suffix Oracles

Alban Mancheron, Christophe Moan

To cite this version:

Alban Mancheron, Christophe Moan. Combinatorial Characterization of the Language Recognized by

Factor and Suffix Oracles. International Journal of Foundations of Computer Science, World Scientific

Publishing, 2005, 16 (6), pp.1179-1191. �10.1142/S0129054105003741�. �hal-00487228�

(2)

© WorldS ienti PublishingCompany

COMBINATORIAL CHARACTERIZATION OFTHE LANGUAGE

RECOGNIZEDBY FACTOR AND SUFFIXORACLES

ALBANMANCHERON

and

CHRISTOPHEMOAN

Laboratoired'InformatiquedeNantes-Atlantique, UniversitédeNantes,

B.P.92208, 44322 NantesCedex3,Fran e

Re eived(re eiveddate) Revised(reviseddate) Communi atedbyEditor'sname

ABSTRACT

Sequen eAnalysisrequires toelaboratedata stru tures, whi hallowbothan e- ientstorageanduse. Anewonewasintrodu edin1999byCyrilAllauzen,Maxime Cro hemoreandMathieuRaffinot.Thisstru tureislinearonthesizeofthe repre-sentedwordbothintimeandspa e.Ithasthesmallestnumberofstatesandita epts atleast allsubstringsofthe represented word. Thisstru tureis alledFa torOra le. Authorsdevelopedanotherstru tureonthebasisofFa torOra le,whi hhasthesame propertiesex eptita eptsatleastallsuxesinsteadofallfa torsofthe represented word. Thisstru tureisthen alledSuxOra le. The hara terizationofthelanguage re ognized bythe Fa tor/SuxOra leof agivenwordisanopenproblem,forwhi h weprovideasolution. Usingthisresult,weshowthatthese stru turesmaya eptan exponentialnumberofwords,whi harenotfa tors/suxesofthegivenword.

Keywords:Fa torOra le,SuxOra le,automata,language, hara terization.

1. Introdu tion

Severalstru tureshavebeendevelopedintextindexation: we an iteTries[1℄, SuxAutomata[1,2℄,SuxTrees[1,3℄...Theirobje tiveistorepresentatextor aword

s

(i.e.asu essionofsymbolstakeninanarbitraryalphabetdenotedby

Σ

), inordertoqui kly determinewhetherthisword ontainssomespe i sub-word. Thissub-wordisthen alledafa torof

s

.

Allauzen & al. [4, 5, 6℄ des ribed a method allowing to build an a y li automaton, whi h a epts at least all fa tors of

s

, whi h have as few states as possible

(|s| + 1)

andwhi hislinearinthenumberoftransitions

(2 |s| − 1)

. When ea h stateis nal in this automaton, the stru ture is alled aFa tor Ora le. By keepingonlyparti ular statesas nal,thisautomatonbe omesaSuxOra le.

(3)

algorithm is easy to understand and implement; su h advantagedoesn't exist in the most e ient algorithm to build Sux Trees [3℄. Ora les are homogeneous automata,i.e.alltransitionsingoingtoasamestatearelabeledwithasamesymbol. Thus,itis notne essarytolabeledgesanymore. Thereforethis stru turerequires less memory than Sux Trees or Tries. Lefebvre & al. [7, 8, 9℄ used it for repeated motifs dis overyoverlarge genomi data and obtainedin a few se onds similarresultstotheonesobtainedbyusingthousandsofblastnrequests. Authors alsousedtheFa tor Ora lefortext ompression[10℄.

However,at leasttwoopen questions arelinked tothese Ora les: therstone isaboutthe hara terizationofthelanguagere ognizedbyOra les;andthese ond question on ernstheexisten eofalinearalgorithmintimeandspa etobuildan automaton,whi ha eptsallfa tors/suxesofaword

s

andwhi h isminimal in numberoftransitions.

Whenusing these Ora les, the main di ulty is to distinguish true and false positives. Therefore wewill provide in thenext se tionseveraldenitions related tothe onstru tionofOra les.Wewill hara terizethelanguagere ognizedbythis stru turein se tion3. Finally,someresultsusingOra leswillalsobedes ribed.

2. Denitions

Inthefollowingse tions, we all

F act(s)

(resp.

Suff (s)

and

P ref (s)

) the set offa tors(resp. suxesandprexes) of

s ∈ Σ

+

. We all

P ref

s

(i)

theprexof

s

, whi hhaslength

i ≥ 0

. Given

x ∈ F act(s)

,

N b

s

(x)

isthenumberofo urren esof

x

in

s

and

x

is repeatedifandonlyif

N b

s

(x) ≥ 2

. Denition1 Givenaword

s ∈ Σ

+

and

x

afa torof

s

,wedenethefun tion

P os

asthepositionofthersto urren eof

x

in

s = uxv (u, v ∈ Σ

)

:

P os

s

(x) = |u|+1

. Wealsodenethe fun tion

poccur

su hthat

poccur

s

(x) = |ux| = P os

s

(x) + |x| − 1

.

2.1. Ora les

The Ora le onstru tion is dened by the algorithm of Allauzen & al. [4℄ (seealgorithm1). Authorsgaveanotheralgorithmtobuildthesameautomatonin lineartimeonthesize of

s

. However,sin eweareonly interestedinpropertiesof theOra le,wedonotreportitinthispaper.

Denition2 [4 ℄ Given a word

s ∈ Σ

, we dene the Fa tor Ora le of

s

as the automaton obtainedby the algorithm 1, whereall states arenal. Itis denoted by

F O(s)

. We dene the Sux Ora le of

s

as the automaton obtained by the same algorithm, where astate

e

i

(0 ≤ i ≤ |s|)

is nal if and only if there existsa path labeledby asux of

s

fromtheinitial statetothestate

e

i

. Itisdenotedby

SO(s)

. Weuse thetermOra leto equallydesignatetheFa torortheSuxOra leof aword

s

andwedenoteitby

O(s)

. Wedenearelationoforderbetweenstatesin theseOra les. Indeed,ifwehavetwostates

e

i

and

e

j

su hthat

i ≤ j

,then

e

i

≤ e

j

.

a

(4)

Algorithm1: Constru tionoftheFa torOra leofaword



1 Input:

Σ

% Alphabet (supposed minimal) %

2

s

∈ Σ

% T h e word to pro ess % 3 Output:

Oracle

% Fa tor Ora le of

s

% 4

5 Begin

6 Create the initial state labeled by

e

0

7

8 F o r

i

from

1

to

|s|

Do

9 Create a state labeled by

e

i

10 Build a transition from the state

e

i−1

to the state

e

i

labeled by

s[i]

11 End F o r

12

13 F o r

i

from

0

to

|s| − 1

Do

14 Let

u

be a word of minimal length re ognized at state

e

i

15 F o r All

α

∈ Σ \ {s[i + 1]}

Do

16 If

∈ F act(s[i − |u| + 1..|s|])

Then 17

j

← poccur

s[i−|u|+1..|s|]

(uα) − |u|

18 Build a transition from the state

e

i

to

e

i+j

labeled by

α

19 End If 20 End F o r All 21 End F o r 22 End



Denition3 Given aword

s ∈ Σ

anda word

x

a epted atthe state

e

i

(0 ≤ i ≤

|s|)

in the Ora leof

s

,wedenethe fun tion

State

as

State(x) = e

i

. Lemma1 [4 ℄ Givenaword

s ∈ Σ

,auniquewordwithminimallengthisa epted atea hstate

e

i

(0 ≤ i ≤ |s|)

inthe Ora leof

s

. Itisdenotedby

min(e

i

)

.

Lemma2 [4 ℄ Given aword

s ∈ Σ

,itsOra leandaninteger

i (0 ≤ i ≤ |s|)

,then

min(e

i

) ∈ F act(s)

and

i = poccur

s

(min(e

i

))

. Notation1 Given a word

s ∈ Σ

, let

#

in

(e

i

)

and

#

out

(e

i

)

denote the number of ingoing/outgoingtransitions to/fromstate

e

i

(0 ≤ i ≤ |s|)

inthe Ora le of

s

.

2.2. Canoni al Fa tors &Contra tion Operation

In this se tion, wewill dene parti ular fa torsfrom a givenwordand an op-erationneededto hara terizethelanguage. Thenwewilldene thesetsofwords obtainedwiththis operation.

Denition4 Given a word

s ∈ Σ

and itsOra le,we dene the setof Canoni al Fa torsof

s

as

F

s

= {min(e

i

) | 1 ≤ i ≤ |s| ∧ (#

out

(e

i

) > 1 ∨ #

in

(e

i

) > 1)}

. Given asux

t

of

s

andaCanoni alFa tor

f

of

s

,

f

isa onservedCanoni al Fa torof

s

in

t

ifthe rsto urren eof

f

in

s

appearsin

t

. Thesetofthe onservedCanoni al Fa torsof

s

in

t

is denotedby

F

s,t

(thus

F

s,t

⊆ F

s

).

Denition5 Given a word

s ∈ Σ

and arepeatedCanoni al Fa tor

f

of

s

su h that:

s

=

uf v

(u, v ∈ Σ

)

f v

=

wf x

(w ∈ Σ

+

, x ∈ Σ

)

P os

s

(f ) =

|u| + 1

(5)

Notation2 Given aword

s ∈ Σ

andaCanoni al Fa tor

f ∈ F

s

,

C

f

s

isthe setof ontra tionsof

s

by

f

and

C

s

(≡

[

f∈F

s

C

f

s

) isthe setof all the ontra tions we an applyto

s

. Givenasux

t

of

s = t

t (t, t

∈ Σ

)

,then

C

s,t

isthe subsetof

C

s

su h that

C

s,t

= {(p, q) | (p, q) ∈ C

s

∧ p > |t

|}

.

In thefollowing, wewill usesets of ontra tionstoprodu enewwordsfroma givenone.Thenwewillusethesewordsto hara terizethelanguagere ognizedby Ora les.

Denition6 A set

C

of ontra tionsis oherentif andonly ifitdoes not ontain two ontra tions

(i

1

, j

1

)

,

(i

2

, j

2

)

su h that:

i

1

< i

2

< j

1

< j

2

. Furthermore,

C

is minimal ifandonlyifitdoes not ontaintwo ontra tions

(i

1

, j

1

)

and

(i

2

, j

2

)

su h that

i

1

≤ i

2

< j

2

≤ j

1

orsu hthat

i

1

< j

1

= i

2

< j

2

.

Denition7 Givenaword

s ∈ Σ

anda oherentandminimalsetof ontra tions

C = {(p

1

, q

1

), . . . , (p

k

, q

k

)}

(asso iatedto the setof anoni al fa tors

{f

1

, . . . , f

k

}

), thenwedene the fun tion

W ord

asfollowing:

W ord(s, C) = s[1..p

1

− 1] s[q

1

..p

2

− 1] . . . s[q

k−1

..p

k

− 1] s[q

k

..|s|].

s

f1

f2

...

fk

f1

f2

...

fk

f1 f2

...

fk

Fig.1. Wordsobtainedusingthe ontra tionoperation(seeDenition5).

We are only interested by words obtained by the ontra tion operation. So we will only onsider oherent and minimal sets of ontra tions without loss of generality. Notethatthewordremainsthesamewhatevertheorderof ontra tions (seegure1).

Denition8 Wedene

E(s) =

[

C⊆C

s

W ord(s, C)

,asthe losureof

s

.

Example: Considertheword

gaccattctc

(seegure2). ItssetofCanoni alFa tors is

F

gaccattctc

= {a, c, ca, t, tc, ct}

and then

C

gaccattctc

= {(2, 5), (3, 4), (3, 8), (3, 10),

(6, 7), (6, 9), (7, 9)}

. Now, onsidertheset

C = {(2, 5), (7, 9)} (C ⊆ C

gaccattctc

)

,then

W ord(gaccattctc, C) = gacc

///

attc

//

tc

= gattc

. The losureof

gaccattctc

is:

E(gaccattctc) =



gac, gacatc, gacatctc, gacattc, gacattctc, gaccatc, gaccatctc,

gaccattc, gaccattctc, gactc, gatc, gatctc, gattc, gattctc



.

e

0

g

e

1

e

2

e

3

e

4

e

5

e

6

e

7

e

8

e

9

e

10

a a t t t a t t a t

(6)

Givenaword

s ∈ Σ

,wesawhowtobuildthe orrespondingFa tor(resp.Sux) Ora le. ThisOra leallowsto re ognizethefa tors(resp.suxes)of

s

,but italso a epts additional words. For example the word

atc

is a epted by the Fa tor (resp.Sux)Ora leof

gaccattctc

(seegure 2), whereasitis neitherafa tornor asuxofthis sequen e. Wewillsee thattheSuxOra leof

s

exa tlyre ognizes allsuxesof wordsfrom

E(s)

andwewill usethisresultto provethattheFa tor Ora leof

s

exa tlyre ognizesallfa torsofwordsfrom

E(s)

.

Weuse followingLemmasto provetheresult on erningSuxOra les. These Lemmashavebeenprovedin [4℄.

Lemma3 [4 ℄Given aword

s ∈ Σ

andan integer

i (0 ≤ i ≤ |s|)

,then

min(e

i

)

is sux ofallwordsre ognizedatstate

e

i

inthe Ora leof

s

.

Lemma4 [4 ℄ Given a word

s ∈ Σ

and a fa tor

w

of

s

, then

w

is re ognizedat state

e

i

(1 ≤ i ≤ poccur

s

(w))

inthe Ora leof

s

.

Lemma5 [4 ℄Given aword

s ∈ Σ

andan integer

i (0 ≤ i ≤ |s|)

,thenevery path endingby

min(e

i

)

inthe Ora leof

s

leads toastate

e

j

su hthat

j ≥ i

.

Lemma6 [4 ℄ Given aword

s ∈ Σ

and

w ∈ Σ

aword a eptedat state

e

i

(0 ≤

i ≤ |s|)

inthe Ora leof

s

,thenevery suxof

w

isalso re ognizedby theOra leat state

e

j

su hthat

j ≤ i

.

TheproofofLemma6isonlygivenin[4℄forFa torOra les. Weneedthisresult tobetrueforSuxOra les.

Proof. TheoriginalLemmagivesusthatif

x

isasuxof

w

,then

State(x) ≤

State(w)

. Weneedtoprovethatif

State(w)

isnal,then

State(x)

isnal. There-fore, we have to onsider two ases. When

|x| ≥ |min(e

i

)|

, we have

min(e

i

) ∈

Suff (x)

and thus, a ording to Lemma 5,

State(x) ≥ State(min(e

i

))

. Sin e

State(min(e

i

)) = e

i

= State(w)

, we on lude that

State(x) = State(w)

. When we have

|x| < |min(e

i

)|

, sin e the state

e

i

is nal, then there exists a sux

t

of

s

su h that

State(t) = e

i

. A ording to Lemma 3, we on lude that

min(e

i

) ∈

Suff (t) ⊆ Suff (s)

. Sin e

x

and

min(e

i

)

aresuxes of

w

, then

|x| < |min(e

i

)| ⇒

x ∈ Suff (min(e

i

))

. So

x

isalsosuxof

s

anda ordingtoDenitionoftheSux

Ora le

State(x)

isnal.



Weusethesepreviousresultstoshowthat thetwofollowingLemmas. Lemma7 Given a word

s ∈ Σ

, a Canoni al Fa tor

f

of

s = uf v (u, v ∈ Σ

)

su h that

f

is not repeatedin

uf

and a set

C

of ontra tions

(C ∈ C

s

)

su h that

W ord(uf, C) = wf

,thenthe Ora leof

s

a epts

wf

and

f

atthe samestate. Proof. Wedenoteby

C

i

⊆ C

s

asetof ontra tions,whi h has ardinality

i

. In thesameway,wedenoteby

w

i

f

thewordobtainedwhenweapply ontra tions

C

i

to

uf

(warning:

w

i

f = W ord(uf, C

i

) ; w

i

= W ord(u, C

i

)

). By indu tion onthe sizeof

C

i

,weshowthat

State(W ord(uf, C

i

)) = State(f )

forall

C

i

∈ C

s

. Let

e

x

= State(f )

(

f = min(e

x

)

byDenitionof

f

)and

e

x

i

= State(W ord(uf, C

i

))

. If we onsider

C

0

, then

W ord(uf, C

0

) = uf

. A ording to Lemma 5,

x

0

≥ x

. Furthermore, a ording to Lemma 4 applied to

uf

, we have

x

(7)

HoweverbyDenition of

f

,

poccur

s

(f ) = |uf | = poccur

s

(uf )

. Thereforewehave

x

0

≤ x

and

x

0

= x

.

Ifthislemmaistrueforaset

C

i

⊂ C

s

of ontra tions,thenitistrueforaset

C

i+1

=

C

i

∪ {(p, q)}

. Weassumewithoutlossofgenerality(see gure1) that

(p, q)

isthe last ontra tionin

C

i+1

(byas endingorderoverpositions). Let

b

betheCanoni al Fa tor usedbythis ontra tion. We anwrite

uf = s[1..p − 1] s[p..q − 1] s[q..|uf |]

. Sin e

(p, q)

is hosenas thelast ontra tion, all ontra tionsin

C

i

are appli able to

s[1..p − 1]

. So we dene

a, c ∈ Σ

su h that

w

i

f = a s[p..|uf |] = abc

and

d ∈ Σ

su hthat

w

i+1

f = a s[q..|uf |] = abd

. Also

ab = W ord(s[1..p − 1] b, C

i

)

: the oppositewouldmeanthat ontra tion

(p, q)

an'tbeoperatedfrom

b

anda ording to the indu tion hypothesis,

State(ab) = State(b)

. From this, we on lude that

State(abc) = State(bc)

and

State(abd) = State(bd)

. Weknowthat

bd (= s[q..|uf |])

isasuxof

bc (= s[p..|uf |])

anda ordingto Lemma6:

State(bd)

≤ State(bc)

⇔ State(abd)

≤ State(abc)

⇔ State(w

i+1

f ) ≤ State(w

i

f )

⇔ State(w

i+1

f ) ≤ State(f )

However,a ordingto Lemma 5, wehave

State(w

i+1

f ) ≥ State(f )

andtherefore

State(w

i+1

f ) = State(f )

. Thislemmaisthentrueforall

C

i

⊆ C

s

.



u

e

0

f

v

f

= min(e

i

)

wf

= W ord(uf, C)

e

i

e

|s|

Fig.3. IllustrationofLemma7.

Now, we show how to obtain a ontra tion, starting with transition of type

e

i

→ e

j

with

j > i + 1

(thatwe allanexternaltransitionin thesubsequent). Lemma8 Givenaword

s ∈ Σ

andaninteger

i (0 ≤ i < |s|)

su hthat

#

out

(e

i

) >

1

, let

p

be a external transition from state

e

i

to state

e

i+j

(j > 1)

labeled by

α

and

u = min(e

i

)

. Then an o urren e of

u α

exists at position

(i + j − |u|)

of

s

(seegure5page 8). Moreover, the ontra tion

(i − |u| + 1, i − |u| + j)

of

s

by

u

existstoo.

Proof. A ordingtothe onstru tionalgorithm(seealgorithm1),thetransition

p

isaddedfrom

e

i

to

e

i+j

be auseaposition

j

in

s[i − |u| + 1..|s|]

issu hthat:

j =

poccur

s[i−|u|+1..|s|]

(uα) − |u|

. Wealso have

uα ∈ F act(s)

be ause

uα ∈ F act(s[i −

|u|+1..|s|])

. Cleophas&al. [11℄provedthatsin e

u = min(e

i

)

and

uα ∈ F act(s)

, then

i − |u| + poccur

s[i−|u|+1..|s|]

(uα) = poccur

s

(uα)

. Wehave

i + j = poccur

s

(uα)

andnally

s[i + j − |u|, i + j] = uα

.



WeusealgorithmContra tor(seealgorithm2)togivea hara terizationofthe languagea eptedbytheOra leof aword

s

. Givenaword

s ∈ Σ

and itsSux Ora le

SO(s)

, initialinputsofContra torareaword

w

a eptedby

SO(s)

and

t

, themaximal suxof

s

beginning with

w[1]

. This algorithm outputsaset

C ∈ C

s

(8)

Algorithm2:Contra tionsneededtotransform

s

= t

t

(t, t

∈ Σ

)

into

t

w



1 Initialization:

S

0

= t

,

S

w

0

= w

,

C

0

= ∅

,

sdec

= |s| − |t|

%

t

is a suffix of

s

% 2 3 Input:

S

i

∈ Σ

%A suffix of

s

that an still be " ontra ted" % 4

S

w

i

∈ Σ

% T h e word to pro ess % 5

C

i

% Set of ontra tions % 6 Output: a set of ontra tions 7 8 Begin

9

p

i

longest ommon prefix between

S

i

and

S

w

i

(Claim1,item1) 10

e

qi

← State(p

i

)

(Claim1,item2) 11

f

i

← min(e

qi

)

12 If (

|p

i

| < |S

w

i

|

) Then 13

e

ri

Transition

(e

qi

, S

i

w

[|p

i

| + 1])

(Claim1,item4) 14

C

i+1

← C

i

∪ {c

i

}

,

c

i

= (q

i

− |f

i

| + 1, r

i

− |f

i

|)

(Claim2,item2)

15

S

w

i+1

← S

w

i

[|p

i

| − |f

i

| + 1..|S

i

w

|]

(Claim1,item3) 16

S

i+1

← t[r

i

− |f

i

| − sdec..|t|]

(Claim1,item3) 17 R e t u r n Contra tor

(S

i+1

, S

w

i+1

,

C

i+1

)

18 Else 19 If (

|S

i

| > |S

w

i

|

) Then 20

C

i+1

← C

i

∪ {c

i

}

,

c

i

= (q

i

− |f

i

| + 1, |s| − |f

i

| + 1)

(Claim3) 21 Else 22

C

i+1

← C

i

(Claim3) 23 End If 24 R e t u r n

C

i+1

25 End If 26 End



By Denition 7,givena word

s ∈ Σ

su h that

s = t

t (t, t

∈ Σ

)

and a set

C ∈ C

s,t

of ontra tions,

t

w = W ord(s, C)

isthena on atenationofsubstringsof

s

. Thesesubstrings anbeassimilatedasprexesofsuxesof

s

. A ontra tionis thenajump fromonesubstringtothenextone. Themainideaofthisalgorithm istoreadfromlefttorightthesequen es

t

and

w

,inorderto omputethelongest ommonprexesbetweengivensuxesof

t

and

w

. Afterea hstage,thealgorithm omputesthelengthofthejumpto gotothenextsuxof

t

to onsider.

Inputsarewords

S

i

,

S

w

i

(i ≥ 0)

andaset

C

i

of ontra tions. Weinitialize

S

0

= t

,

S

w

0

= w

,

C

0

= ∅

and

p

i

(line9) asthelongestprexof

S

i

and

S

w

i

. So:

S

i

= p

i

S

i

and

S

w

i

= p

i

S

i

′w

.

(1) Let

e

q

i

= State(p

i

)

and

f

i

= min(e

q

i

)

((lines10and11),Lemma3provides:

p

i

= p

i

f

i

(p

i

∈ Σ

).

(2)

The variable

e

r

i

(line 13) is the state rea hed by the transition from

e

q

i

and labeled by

α = S

w

i

[|p

i

| + 1] = S

i

′w

[1]

. The set

C

i+1

has ardinality

i + 1

. The variable

sdec = |s| − |t|

is ne essaryto ompute

S

i+1

(line 15),be auseea hstate

e

i

islinkedtothe

i

th

hara terof

s

,nottothe hara ter

(i − |s| + |t|)

of

t

. Claim1 Following assertionsaretrue for all

i ≥ 0

(seegure4):

1.

f

i

α ∈ P ref (p

i+1

)

.

2.

S

i

= t[q

i

− |p

i

| + 1 − sdec..|t|]

. 3.

S

i+1

and

S

w

i+1

arerespe tivelysuxesof

S

i

and

S

w

i

;

S

i

and

S

w

i

(i ≥ 0)

are respe tivelysuxes of

t

and

w

.

4. Transition from

e

q

i

to

e

r

(9)

t

t

i

S

i

t

i

p

i

S

i

t

i

p

i

f

i

S

i

t

i

p

i

f

i

f

i

α

t

i+1

p

i+1

S

i+1

t

i+1

S

i+1

ontra tion w

w

i

S

w

i

w

i

p

i

S

i

′w

w

i

p

i

f

i

S

i

′w

w

i

p

i

f

i

α

w

i+1

p

i+1

S

i+1

′w

w

w

i+1

S

w

i+1

Fig.4. VisualizationofContra toron

S

i

and

S

w

i

.

Proof.

1. Sin e

f

i

= min(e

q

i

)

and a ordingto Lemma 8, wehave

s[r

i

− |f

i

|..r

i

] =

t[r

i

− |f

i

| − sdec..r

i

− sdec] = f

i

α

. So

S

i+1

and

S

w

i+1

beginwith

f

i

α

(line15). 2. For

i = 0

(initialization ase),

S

0

= t

and

t

is the longest sux of

s

beginning by

w[1]

. If

e

x

= State(S

0

[1]) (x > 0)

, then

t[x − sdec..|t|] = S

0

and

State(p

0

) = x+|p

0

|−1 = e

q

i

. Thuswe on ludethat

S

0

= s[q

0

−|p

0

|+1−sdec..|s|]

. At iteration

i

, wehave

S

i+1

= t[r

i

− |f

i

| − sdec..|t|]

(line 16). Sin e

S

w

i+1

begins by

f

i

α

(item 1),we on ludethat

q

i+1

= r

i

+ |p

i+1

| − |f

i

| − 1

and nallywehave

S

i+1

= t[r

i

− |f

i

| − sdec..|t|] = t[q

i+1

− |p

i+1

| + 1 − sdec..|t|]

. 3. This assertionis true for

S

w

i

be ause

S

w

i+1

is suxof

S

w

i

by onstru tion (line15)and

S

w

0

= w

. Con erning

S

i

,wehave

S

0

= t

thus thepropertyistruefor

i = 0

. Supposethat

S

i

issuxof

t

,fromitem2,wehave

S

i

= t[q

i

−|p

i

|+1−sdec..|t|]

. Wealso have

S

i+1

= t[r

i

− |f

i

| − sdec..|t|]

(line 16). So we on ludeusingEq. (2) that

q

i

− |p

i

| = q

i

− |p

i

| − |f

i

|

. Sin e

|p

i

| ≥ 0

,wehave

q

i

− |f

i

| ≥ q

i

− |p

i

|

and

r

i

> q

i

. Finally,we on ludethat

r

i

− |f

i

| > q

i

− |p

i

|

andthat

S

i+1

isasuxof

S

i

.

4. A ording to item 3,

S

w

i

is sux of

w

. Then

S

w

i

is a epted by

O(s)

(Lemma 6). Using Eq.(1) with

S

′w

i

[1] = α

thetransition must exist and implies

#

out

(e

q

i

) ≥ 2

. So

f

i

= min(e

q

i

) ∈ F

s

byDenition ofCanoni alFa tors.



FromEq.(1)andClaim1(item4):



t

=

t

i

S

i

(t

i

∈ Σ

)

w

=

w

i

S

i

w

=

w

i

p

i

S

i+1

w

(w

i

∈ Σ

).

(3)

t

i

w

i

p

i

e

0

p

i

f

i

α

α

e

q

i

e

r

i

e

|s|

S

i

S

i+1

Fig.5. IllustrationofastepinthealgorithmContra tor(

α

= S

w

i

[|p

i

| + 1]

). Claim2 Forall

i ≥ 0

(seegure5):

1.

State(w

i

p

i

) = State(t

i

p

i

) = State(p

i

) = e

q

i

. 2.

c

i

is a ontra tion of

t

i

S

i

(resp.

w

i

S

i

) provided by

f

i

. Result of this on-tra tionis

t

i

p

i

S

i+1

(resp.

w

(10)

1. For

i = 0

,

t

i

= w

i

= ǫ

. Letussupposethat thepropertyis truefor

i

and thereforefor

i + 1

. FromClaim1(item2),we on ludethat

S

i

labeledapathusing only main transitions (i.e. transitions oftype

e

j

→ e

j+1

) from

e

q

i

−|p

i

|

to

e

|s|

in

O(s)

. A ordingtoClaim1(item3) we on ludethat:

S

i

= uS

i+1

(u ∈ Σ

).

(4)

So,the state

e

x

(x > q

i

− |p

i

|)

issu hthat

S

i+1

labeledapath usingonlymain transitions from

e

x

to

e

|s|

in

O(s)

and morepre isely

x = r

i

− |f

i

| − 1

. Wehave

t

i+1

= t

i

u

(Eq. (3) and (4))and

State(t

i

u) = e

x

. Sin e

f

i

= min(e

q

i

)

and there is a transition from

e

q

i

to

e

r

i

labeled by

α

(Claim 1, item 4),

State(t

i+1

f

i

α) =

State(t

i

uf

i

α) = State(f

i

α) = e

r

i

and furthermore,

p

i+1

= f

i

αv (v ∈ Σ

)

. So we have

State(t

i+1

f

i

αv) = State(t

i+1

p

i+1

) = State(p

i+1

)

. 2. FromEq. (1),(2)and(3),we on ludethat:

t = t

i

S

i

= t

i

p

i

f

i

S

i

.

(5)

Sin e

S

i+1

∈ Suff (S

i

)

,wehave

S

i

= uS

i+1

(u ∈ Σ

+

)

and fromEq.(5),

t

i

p

i

f

i

S

i

=

t

i

uS

i+1

. A ording to Claim 1(item 1), we have

t

i

p

i

f

i

S

i

= t

i

uf

i

αu

(u

∈ Σ

)

. Sin ewehave

State(t

i

p

i

f

i

) = State(f

i

)

and

|u| > |p

i

|

(be ause

S

i

[1] 6= α

),we an ontra t

t

i

S

i

by

f

i

and

t

i

p

i

f

i

αu

= t

i

p

i

S

i+1

. We an on ludethat

w

i

S

i

= w

i

p

i

S

i

is ontra ted by

f

i

in

w

i

p

i

S

i+1

be ause

State(w

i

p

i

) = State(t

i

p

i

)

. A ording to Eq.(3),we on ludethat

w

i+1

= w

i

p

i

and

w

i

p

i

S

i+1

= w

i+1

S

i+1

.



Claim2showsthata ontra tion

c

i

of

t

i

S

i

by

f

i

isalsoa ontra tionfor

t

and for

w

i

S

i

. Considerthe

i

th

iteration, we have

|S

w

i

| > |S

i+1

w

|

or

|S

w

i

| = |S

i+1

w

|

and

|p

i+1

| > |p

i

|

(if

f

i

= p

i

). Sin e

p

i

> 0

, weensurethat re ursion stopsat iteration

j > i

with

p

j

= S

w

j

(j > i)

.

Claim3 Given aninteger

i ≥ 0

su hthat

p

i

= S

w

i

,then

t

needsalast ontra tion if andonlyif

|S

w

i

| 6= |S

i

|

.

Proof. The wordobtainedwith theset

C

i

of ontra tionsis

w

i

p

i

S

i

(Claim2, item

2

). If

S

w

i

= S

i

, then we have

S

i

= ǫ

and we on ludethat

C

i

is omplete (line20). If

S

w

i

6= S

i

,a ordingtoClaim2(item

1

),wehave

State(w

i

p

i

) = e

q

i

. By DenitionofthenalstateinaSuxOra le,

min(e

q

i

) ∈ Suff (w)

(Lemma3)and

min(e

q

i

) ∈ Suff (t)

. Alast ontra tionisneededto ompletethesetof ontra tions

(line22).



Considering the

i

th

all of Contra tor, inputs are

S

i

= p

i

S

i

,

S

w

i

= p

i

S

i

′w

and

C

i

. This set is the set of ontra tions ne essary to transform

t

i

∈ P ref (t)

into

w

i

∈ P ref (w)

. The variable

p

i

refersto the longest ommon prexof

S

i

and

S

w

i

(

p

i

isbothafa torof

t

and

w

,Claim1). Two asesmayo urforthis all(line12. When

|p

i

| = |S

w

i

|

, there ursion ends. Otherwise,

|p

i

| 6= |S

w

i

|

andat leastanother ontra tionisne essaryuntil

|p

j

| = |S

w

j

| (j > i)

. FromClaim2,the

i

th

ontra tion allowsto ontinue withthe sux

S

i+1

. Attheend of thepro ess (i.e. theend of

(11)

Lemma9 Given a word

s ∈ Σ

, its Sux Ora le, a word

w ∈ Σ

a epted by

SO(s)

and the longest sux

t

of

s

su h that

w[1] = t[1]

, then Contra tor(

t, w, ∅

) outputsaset

C

,whi h issu hthat

w = W ord(t, C)

.

Proof. Let

i ≥ 0

su h that

S

w

i+1

= p

i+1

. A ording to Claim 2, we on lude that

C

i+1

isa oherentsetof ontra tionsof

t

. Sin e

p

i+1

isprexof

S

i+1

wehave:

W ord(t, C

i+1

) = w

i+1

S

i+1

= w

i+1

S

w

i+1

u = w

i+1

p

i+1

u (u ∈ Σ

)

If

u = ǫ

,wehave

W ord(t, C

i+1

) = w

(Eq.(3)). Else(Claim3)aultimate ontra tion

c

i+1

by

f

i+1

transforms

w

i+1

S

i+1

w

u

into

w

i+1

S

i+1

w

= w = W ord(t, C

i+1

∪ {c

i+1

})

. FinallyContra torprovidesaset

C

su hthat

w = W ord(t, C)

.



We annoti ethat:

1.

C

isnotalwaysminimal.

2.

C

is oherent. Let

(a, b)

and

(c, d)

betwo ontra tionssu essivelyaddedto

C

. Wehave

a < b

and

c < d

be ause

r

i

> q

i

and

|s| > q

i

(lines14and20). If

e

q

i+1

= State(p

i+1

) = e

r

i

,then

b = c

,else

p

i+1

= f

i

αv (α = S

w

[

|p

i

|+1], v 6= ǫ)

and

e

q

i+1

> e

r

i

,therefore

b < c

.

Followingtheorems arethemain purposeofthispaper:

Theorem1 Exa tly all suxes of words from

E(s)

are re ognized by the Sux Ora le of

s

.

Proof.

`

': Ea h suxof wordsfrom

E(s)

isre ognizedby the SuxOra le of

s

.

A ordingtoLemma 6,if

w

isa eptedby

SO(s)

, ea hsuxof

w

isalsoa epted by

SO(s)

. Soweonlyneedtoprovethatea hwordfrom

E(s)

isa eptedby

SO(s)

. Let

C ∈ C

s

beaset of ontra tionsappli ableto

s

and

w = W ord(s, C)

. The set

C

i

is the set of the rst

i

ontra tionsof

C

( hosenwithout loss of generality by as endingorderoverpositionsseegure1),

(x

j

, y

j

)

isthe

j

th

ontra tion,whi h usetheCanoni alFa tor

f

j

(1 ≤ j ≤ i)

and

w

j

= W ord(s, C

j

)

. Theproperty

(P )

toproveisthatif

w

i

(0 ≤ i < |C|)

isa eptedby

SO(s)

,then

w

i+1

isa eptedtoo. Wehave:



w

i

=

s[1..x

1

− 1] s[y

1

..x

2

− 1] . . . s[y

i

..|s|]

s[y

i

..y

i

+ |f

i

| − 1] = f

i

ByDenitionoftheCanoni alFa tors,

f

i+1

doesnoto urin

s

beforeposition

x

i+1

(x

i+1

> y

i

)

. We have

w

i

= v

f

i+1

u

and

w

i+1

= v

f

i+1

u

with

v

= s[1..x

1

1]s[y

1

..x

2

− 1] . . . s[y

i

..x

i+1

− 1]

and

f

i+1

u = u

′′

f

i+1

u

(u

′′

∈ Σ

+

)

.

Consideringthe ontra tion

(x

i+1

, y

i+1

)

, wehave

|s| − |f

i+1

u| + 1 = x

i+1

and

|s|−|f

i+1

u

|+1 = y

i+1

(be ausethe ontra tionsareinas endingorder). Theword

s

isthen notyet modiedafter positions

x

i+1

,so

f

i+1

u

and

f

i+1

u

aresuxes of

s

. Considerthestate

q = State(f

i+1

)

of

SO(s)

,a ordingtoLemma 7:

State(v

f

i+1

) = q.

(6)

The sux

f

i+1

u

of

s

is ne essarilyre ognized by

SO(s)

. So thepath a epting this sux in

SO(s)

go through thestate

q

. Starting from

q

, we anread

u

(12)

rea hanalstateandtherefore,a ordingtoEq.(6),

SO(s)

alsoa eptstheword

w

i+1

= v

f

i+1

u

. Finally, theproperty

(P )

istruefor all

i (0 ≤ i < |C|)

andsin e

w

0

= s

, we an provebyindu tion on

i

that the Sux Ora leof

s

re ognizesall words

w

i

(0 ≤ i ≤ |C|)

. Lemma6allowsto on ludethatea hsuxofwordsfrom

E(s)

isre ognizedby

SO(s)

.

`

': Ea h wordre ognizedby the SuxOra leof

s

issux ofawordfrom

E(s)

. Let

w

be a word a epted by the Sux Ora le of

s

and

t

be the longest sux of

s = t

t (t, t

∈ Σ

)

beginning with

w[1]

. Then a set

C

of ontra tions su h that

t

w = W ord(t

t, C)

exists (Lemma 9). Sin e the word

w

is suxof

t

w

and

t

w ∈ E(s)

,ea h worda eptedby

SO(s)

isasuxofawordfrom

E(s)

.



OnthebasisofTheorem1wegiveasimilarresult,whi hisavailableforFa tor Ora lesinsteadofSuxOra les.

Theorem2 Exa tly all fa tors of words from

E(s)

are re ognized by the Fa tor Ora le of

s

.

Proof.

`

': Ea h fa torofwordsfrom

E(s)

isre ognizedbythe Fa tor Ora leof

s

. Let

SO(s)

be the Sux Ora le of

s

,

u ∈ E(s)

and

m

a fa tor(i.e. aprex of a sux) of

u = mv (v ∈ Σ

)

. Then

mv

isa eptedby

SO(s)

(Theorem1). Thus,a path

(e

0

→ e

x

1

→ . . . → e

x

|mv|

)

existsin

SO(s)

,whi h re ognizes

mv

. Therefore, we on ludethat

m

labeledapath

(e

0

→ e

x

1

→ . . . → e

x

|m|

)

(with

e

x

|m|

nal).

`

': Ea h wordre ognizedby the Fa tor Ora leof

s

isfa tor ofawordfrom

E(s)

. Let

SO(s)

bethe Sux Ora leof

s

and

m

aworda eptedby

F O(s)

. If

SO(s)

re ognizes

m

then

m

is a suxof awordfrom

E(s)

(Theorem 1). Suppose that

SO(s)

doesnotre ognize

m

. Then

F O(s)

re ognizes

m

at state

e

x

|m|

(notnal in

SO(s)

). Furthermore, the path

(e

0

→ e

1

→ . . . → e

|s|

)

exists in

O(s)

and

e

|s|

is nal in

SO(s)

. We on ludethat apathfrom

e

x

|m|

to

e

|s|

existsin

SO(s)

. Then, theword

m

isprexof awordre ognizedby

SO(s)

andtherefore

m

isprexofa suxof some

u ∈ E(s)

. Thus,

m

isafa torofawordof

E(s)

.



4. Properties upon Ora les & Future Works

A ording to Cleophas & al. [11℄, the Ora le is not minimal in number of transitions amongthe set ofhomogeneousautomata. Furthermore,ifwe onsider the set of homogeneous automata, whi h re ognize at least all fa tors (resp. suf-xes)of

s

andwhi hhavethesamenumberofstatesandatmostthesamenumber of transitions than the Fa tor (resp. Sux) Ora le, we show that the Ora le is notminimalonthenumberof a eptedwords. TheOra leof

axttyabcdeatzattwu

(seegure6)has

35

transitions,theFa torOra lea epts

247

wordsandtheSux Ora lea epts

39

words. Thoughanotherhomogeneousautomaton(see gure7), whi hre ognizesatleastallfa tors(resp.suxes)of

axttyabcdeatzattwu

andwhi h has only

34

transitions exists. The Fa tor version of this automatonre ognizes only

236

wordsand its Sux versiona eptsonly

30

words. Moreover,we pro-videanexamplewhi hhasbothlesstransitions, andlessa eptedwordsthanthe orrespondingOra le.

(13)

e0

a

e1

e2

e3

e4

e5

e6

e7

e8

e9

e10

e11

e12

e13

e14

e15

e16

e17

e18

x t y b d e z w u x b t t y z y w a b d e a t z t a t t w u

Fig.6. Fa torOra leoftheword

axttyabcdeatzattwu

.

e0

a

e1

e2

e3

e4

e5

e6

e7

e8

e9

e10

e11

e12

e13

e14

e15

e16

e17

e18

x t y b d e z w u x b t t t y z w y w a b d e a t z a t t w u

Fig. 7. This automaton ( onsidering only the ontinuous lines)a epts all fa torsoftheword

axttyabcdeatzattwu

.Theboldtransition(from

e

1

to

e

3

)is theonlyone,whi hisnotpresentintheFa torOra leofthisword(seegure6) thoughthe twodottedones(from

e

1

to

e

12

andfrom

e

12

to

e

16

)arepresent intheFa torOra le,butnotinthisautomaton.

Insome ases,weobservethat thenumberofwordsa eptedbyOra lesdoes not allow onden e to this stru ture when it is used to dete t fa tors or suf-xes of words. Even if the numberof false positive an sometimes be equalto 0 (e.g.

aaaaaa . . .

),it analsobeexponential. Indeed,we anbuildaword

s

su hthat ea hsubsetof

C

s

is oherentandminimal. Forexample:

s = aabbccddee . . .

,theset

C

s

of ontra tions,whi h are available on su h aword,is

{(1, 2), (3, 4), . . . , (|s| −

1, |s|)}

. Ifwe onsideranynon-emptysubset

C ⊆ (C

s

\ {(1, 2)})

of ontra tions,it iseasytonoti ethat

W ord(s, C) /

∈ F act(s)

. Besides,allwordsobtainedfromsu h subsetsarepairwisedierent.

Numberof subsetsis:

|C

s

|−1

X

i=1

|C

s

| − 1

i



=

|s|

2

−1

X

i=1



|s|

2

− 1

i



= 2

|s|

2

−1

− 1

. Then,

thenumberof words,whi h area eptedbytheOra lesand arenotfa tor/sux of

s

,is

O 2

|s|



.

Tobetterusethisstru ture, weneedtoimproveortoslightlymodifyit. How-ever,betterknowledgeabouttheOra lestru turewouldbeusefulforfutureworks. Indeed,it ouldbeinterestingtohaveanempiri alorastatisti alestimationofthe a ura yof theOra le(time andqualityof theresults),when itis substituted to TriesorSuxTreesinalgorithms.

5. A knowledgments

We would like to thank Céline Barré to have improved the English writing of this arti le. We also thank Irena Rusu for its useful omments and to have improvedearlierversions ofthisarti le.

(14)

1. D. Guseld, Algorithms onStrings, Trees, and Sequen es: omputer s ien e and omputational biology (CambridgeUniversityPress,1997).

2. A. Blumer, J.Blumer, D. Haussler, A. Ehrenfeu ht, M. T.Chen and J.Seiferas, The smallest automaton re ognizing the subwords ofa text, Theoret. Comput. S i. 40(1985)3155.

3. E.Ukkonen, On-line onstru tionofsuxtrees, Algorithmi a 14(1995)249260. 4. C.Allauzen,M. Cro hemoreandM.Ranot, Fa torora le,Suxora le, Te h-ni alReport9908,InstitutGaspard-Monge, UniversitédeMarne-la-Vallée,1999. 5. C. Allauzen, M. Cro hemore and M. Ranot, Fa tor Ora le: A New Stru ture

for PatternMat hing, Conferen e on Current Trends in Theory and Pra ti e of Informati s,1999,pp.295310.

6. C.Allauzen,M.Cro hemoreandM.Ranot,E ientExperimentalString Mat h-ing byWeak Fa torRe ognition, Pro . of the12

th

onferen e onCombinatorial Pattern Mat hing,Le tureNotesinComput. S i. 2089(2001)5172.

7. A.LefebvreandT.Le roq,Computingrepeatedfa torswithafa torora le, Pro . ofthe11

th

AustralasianWorkshopOnCombinatorialAlgorithms,eds. L.Brankovi andJ.Ryan(HunterValley,Australia,2000)pp.145158.

8. A.LefebvreandT.Le roq,Aheuristi for omputingrepeatswithafa torora le: appli ationtobiologi alsequen es, InternationalJournalof Computer Mathemat-i s 79(2002)13031315.

9. A.Lefebvre,T.Le roq,H.Dau helandJ.Alexandre,FORRepeats:dete tsrepeats onentire hromosomesandbetweengenomes, Bioinformati s 19(2003)319326. 10. A. Lefebvre and T. Le roq, Compror: on-line lossless data ompression with a

fa torora le, InformationPro essing Letters 83(2002)16.

11. L.Cleophas,G.ZwaanandB.Watson, Constru tingFa torOra les, Pro . ofthe 3

rd

Figure

Fig. 1. Words obtained using the ontration operation (see Denition 5).
Fig. 4. Visualization of Contrator on S i and S i w .
Fig. 6. Fator Orale of the word axttyabcdeatzattwu .

Références

Documents relatifs

Additionally, other drugs such as tramadol, aminophenazone, diclofenac and ketorolac as compounds with unsubstituted or chlorine or methoxy- substituted phenyl

To assure quality of, and access to, data, many funding agencies, such as the NSF and the DFG (Deutsche Forschungsgemeinschaft, Germany), require proposals to have data management

However, amplicons larger than 10 kb were obtained for both field strains using one primer hybridizing to the CAMP factor II encoding gene (CAMP factor ICE_515_tRNA Lys NsiI

Infectious outbreaks in the human population occur regularly in the developing world (Africa, Southeast Asia, and South America) and thus the development of

We show that depletion of GW182 leads to changes in mRNA expression profiles strikingly similar to those observed in cells depleted of the essential Drosophila miRNA effector

The contribution of prosodic information to the acquisition of the mini-language was tested by comparing the performance of participants exposed to the language in a learning

This was done through successive analyses of the data collected: Step 1, volume of exchanges (statistical analysis), Step 2, number of producers (social network analysis), Step 3,

The analysis reveals that the in-bore projectile movements are both influenced by the tube geometry (curvature and straightness defects) and by the tube response during the