HAL Id: hal-00487228
https://hal.archives-ouvertes.fr/hal-00487228
Submitted on 28 May 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
Combinatorial Characterization of the Language
Recognized by Factor and Suffix Oracles
Alban Mancheron, Christophe Moan
To cite this version:
Alban Mancheron, Christophe Moan. Combinatorial Characterization of the Language Recognized by
Factor and Suffix Oracles. International Journal of Foundations of Computer Science, World Scientific
Publishing, 2005, 16 (6), pp.1179-1191. �10.1142/S0129054105003741�. �hal-00487228�
© WorldS ienti PublishingCompany
COMBINATORIAL CHARACTERIZATION OFTHE LANGUAGE
RECOGNIZEDBY FACTOR AND SUFFIXORACLES
ALBANMANCHERON
and
CHRISTOPHEMOAN
Laboratoired'InformatiquedeNantes-Atlantique, UniversitédeNantes,
B.P.92208, 44322 NantesCedex3,Fran e
Re eived(re eiveddate) Revised(reviseddate) Communi atedbyEditor'sname
ABSTRACT
Sequen eAnalysisrequires toelaboratedata stru tures, whi hallowbothan e- ientstorageanduse. Anewonewasintrodu edin1999byCyrilAllauzen,Maxime Cro hemoreandMathieuRaffinot.Thisstru tureislinearonthesizeofthe repre-sentedwordbothintimeandspa e.Ithasthesmallestnumberofstatesandita epts atleast allsubstringsofthe represented word. Thisstru tureis alledFa torOra le. Authorsdevelopedanotherstru tureonthebasisofFa torOra le,whi hhasthesame propertiesex eptita eptsatleastallsuxesinsteadofallfa torsofthe represented word. Thisstru tureisthen alledSuxOra le. The hara terizationofthelanguage re ognized bythe Fa tor/SuxOra leof agivenwordisanopenproblem,forwhi h weprovideasolution. Usingthisresult,weshowthatthese stru turesmaya eptan exponentialnumberofwords,whi harenotfa tors/suxesofthegivenword.
Keywords:Fa torOra le,SuxOra le,automata,language, hara terization.
1. Introdu tion
Severalstru tureshavebeendevelopedintextindexation: we an iteTries[1℄, SuxAutomata[1,2℄,SuxTrees[1,3℄...Theirobje tiveistorepresentatextor aword
s
(i.e.asu essionofsymbolstakeninanarbitraryalphabetdenotedbyΣ
), inordertoqui kly determinewhetherthisword ontainssomespe i sub-word. Thissub-wordisthen alledafa torofs
.Allauzen & al. [4, 5, 6℄ des ribed a method allowing to build an a y li automaton, whi h a epts at least all fa tors of
s
, whi h have as few states as possible(|s| + 1)
andwhi hislinearinthenumberoftransitions(2 |s| − 1)
. When ea h stateis nal in this automaton, the stru ture is alled aFa tor Ora le. By keepingonlyparti ular statesas nal,thisautomatonbe omesaSuxOra le.algorithm is easy to understand and implement; su h advantagedoesn't exist in the most e ient algorithm to build Sux Trees [3℄. Ora les are homogeneous automata,i.e.alltransitionsingoingtoasamestatearelabeledwithasamesymbol. Thus,itis notne essarytolabeledgesanymore. Thereforethis stru turerequires less memory than Sux Trees or Tries. Lefebvre & al. [7, 8, 9℄ used it for repeated motifs dis overyoverlarge genomi data and obtainedin a few se onds similarresultstotheonesobtainedbyusingthousandsofblastnrequests. Authors alsousedtheFa tor Ora lefortext ompression[10℄.
However,at leasttwoopen questions arelinked tothese Ora les: therstone isaboutthe hara terizationofthelanguagere ognizedbyOra les;andthese ond question on ernstheexisten eofalinearalgorithmintimeandspa etobuildan automaton,whi ha eptsallfa tors/suxesofaword
s
andwhi h isminimal in numberoftransitions.Whenusing these Ora les, the main di ulty is to distinguish true and false positives. Therefore wewill provide in thenext se tionseveraldenitions related tothe onstru tionofOra les.Wewill hara terizethelanguagere ognizedbythis stru turein se tion3. Finally,someresultsusingOra leswillalsobedes ribed.
2. Denitions
Inthefollowingse tions, we all
F act(s)
(resp.Suff (s)
andP ref (s)
) the set offa tors(resp. suxesandprexes) ofs ∈ Σ
+
. We all
P ref
s
(i)
theprexofs
, whi hhaslengthi ≥ 0
. Givenx ∈ F act(s)
,N b
s
(x)
isthenumberofo urren esofx
ins
andx
is repeatedifandonlyifN b
s
(x) ≥ 2
. Denition1 Givenawords ∈ Σ
+
and
x
afa torofs
,wedenethefun tionP os
asthepositionofthersto urren eofx
ins = uxv (u, v ∈ Σ
∗
)
:
P os
s
(x) = |u|+1
. Wealsodenethe fun tionpoccur
su hthatpoccur
s
(x) = |ux| = P os
s
(x) + |x| − 1
.2.1. Ora les
The Ora le onstru tion is dened by the algorithm of Allauzen & al. [4℄ (seealgorithm1). Authorsgaveanotheralgorithmtobuildthesameautomatonin lineartimeonthesize of
s
. However,sin eweareonly interestedinpropertiesof theOra le,wedonotreportitinthispaper.Denition2 [4 ℄ Given a word
s ∈ Σ
∗
, we dene the Fa tor Ora le of
s
as the automaton obtainedby the algorithm 1, whereall states arenal. Itis denoted byF O(s)
. We dene the Sux Ora le ofs
as the automaton obtained by the same algorithm, where astatee
i
(0 ≤ i ≤ |s|)
is nal if and only if there existsa path labeledby asux ofs
fromtheinitial statetothestatee
i
. ItisdenotedbySO(s)
. Weuse thetermOra leto equallydesignatetheFa torortheSuxOra leof awords
andwedenoteitbyO(s)
. Wedenearelationoforderbetweenstatesin theseOra les. Indeed,ifwehavetwostatese
i
ande
j
su hthati ≤ j
,thene
i
≤ e
j
.a
Algorithm1: Constru tionoftheFa torOra leofaword
1 Input:
Σ
% Alphabet (supposed minimal) %2
s
∈ Σ
∗
% T h e word to pro ess % 3 Output:
Oracle
% Fa tor Ora le ofs
% 45 Begin
6 Create the initial state labeled by
e
0
78 F o r
i
from1
to|s|
Do9 Create a state labeled by
e
i
10 Build a transition from the state
e
i−1
to the statee
i
labeled bys[i]
11 End F o r12
13 F o r
i
from0
to|s| − 1
Do14 Let
u
be a word of minimal length re ognized at statee
i
15 F o r Allα
∈ Σ \ {s[i + 1]}
Do16 If
uα
∈ F act(s[i − |u| + 1..|s|])
Then 17j
← poccur
s[i−|u|+1..|s|]
(uα) − |u|
18 Build a transition from the state
e
i
toe
i+j
labeled byα
19 End If 20 End F o r All 21 End F o r 22 End
Denition3 Given aword
s ∈ Σ
∗
anda word
x
a epted atthe statee
i
(0 ≤ i ≤
|s|)
in the Ora leofs
,wedenethe fun tionState
asState(x) = e
i
. Lemma1 [4 ℄ Givenawords ∈ Σ
∗
,auniquewordwithminimallengthisa epted atea hstate
e
i
(0 ≤ i ≤ |s|)
inthe Ora leofs
. Itisdenotedbymin(e
i
)
.Lemma2 [4 ℄ Given aword
s ∈ Σ
∗
,itsOra leandaninteger
i (0 ≤ i ≤ |s|)
,thenmin(e
i
) ∈ F act(s)
andi = poccur
s
(min(e
i
))
. Notation1 Given a words ∈ Σ
∗
, let
#
in
(e
i
)
and#
out
(e
i
)
denote the number of ingoing/outgoingtransitions to/fromstatee
i
(0 ≤ i ≤ |s|)
inthe Ora le ofs
.2.2. Canoni al Fa tors &Contra tion Operation
In this se tion, wewill dene parti ular fa torsfrom a givenwordand an op-erationneededto hara terizethelanguage. Thenwewilldene thesetsofwords obtainedwiththis operation.
Denition4 Given a word
s ∈ Σ
∗
and itsOra le,we dene the setof Canoni al Fa torsof
s
asF
s
= {min(e
i
) | 1 ≤ i ≤ |s| ∧ (#
out
(e
i
) > 1 ∨ #
in
(e
i
) > 1)}
. Given asuxt
ofs
andaCanoni alFa torf
ofs
,f
isa onservedCanoni al Fa torofs
int
ifthe rsto urren eoff
ins
appearsint
. Thesetofthe onservedCanoni al Fa torsofs
int
is denotedbyF
s,t
(thusF
s,t
⊆ F
s
).Denition5 Given a word
s ∈ Σ
∗
and arepeatedCanoni al Fa tor
f
ofs
su h that:
s
=
uf v
(u, v ∈ Σ
∗
)
f v
=
wf x
(w ∈ Σ
+
, x ∈ Σ
∗
)
P os
s
(f ) =
|u| + 1
Notation2 Given aword
s ∈ Σ
∗
andaCanoni al Fa tor
f ∈ F
s
,C
f
s
isthe setof ontra tionsofs
byf
andC
∗
s
(≡
[
f∈F
s
C
f
s
) isthe setof all the ontra tions we an applytos
. Givenasuxt
ofs = t
′
t (t, t
′
∈ Σ
∗
)
,thenC
∗
s,t
isthe subsetofC
∗
s
su h thatC
∗
s,t
= {(p, q) | (p, q) ∈ C
s
∗
∧ p > |t
′
|}
.In thefollowing, wewill usesets of ontra tionstoprodu enewwordsfroma givenone.Thenwewillusethesewordsto hara terizethelanguagere ognizedby Ora les.
Denition6 A set
C
of ontra tionsis oherentif andonly ifitdoes not ontain two ontra tions(i
1
, j
1
)
,(i
2
, j
2
)
su h that:i
1
< i
2
< j
1
< j
2
. Furthermore,C
is minimal ifandonlyifitdoes not ontaintwo ontra tions(i
1
, j
1
)
and(i
2
, j
2
)
su h thati
1
≤ i
2
< j
2
≤ j
1
orsu hthati
1
< j
1
= i
2
< j
2
.Denition7 Givenaword
s ∈ Σ
∗
anda oherentandminimalsetof ontra tions
C = {(p
1
, q
1
), . . . , (p
k
, q
k
)}
(asso iatedto the setof anoni al fa tors{f
1
, . . . , f
k
}
), thenwedene the fun tionW ord
asfollowing:W ord(s, C) = s[1..p
1
− 1] s[q
1
..p
2
− 1] . . . s[q
k−1
..p
k
− 1] s[q
k
..|s|].
s
f1
f2
...fk
f1
f2
...fk
f1 f2
...fk
Fig.1. Wordsobtainedusingthe ontra tionoperation(seeDenition5).
We are only interested by words obtained by the ontra tion operation. So we will only onsider oherent and minimal sets of ontra tions without loss of generality. Notethatthewordremainsthesamewhatevertheorderof ontra tions (seegure1).
Denition8 Wedene
E(s) =
[
C⊆C
∗
s
W ord(s, C)
,asthe losureofs
.Example: Considertheword
gaccattctc
(seegure2). ItssetofCanoni alFa tors isF
gaccattctc
= {a, c, ca, t, tc, ct}
and thenC
∗
gaccattctc
= {(2, 5), (3, 4), (3, 8), (3, 10),
(6, 7), (6, 9), (7, 9)}
. Now, onsiderthesetC = {(2, 5), (7, 9)} (C ⊆ C
∗
gaccattctc
)
,thenW ord(gaccattctc, C) = gacc
///attc
//tc
= gattc
. The losureofgaccattctc
is:E(gaccattctc) =
gac, gacatc, gacatctc, gacattc, gacattctc, gaccatc, gaccatctc,
gaccattc, gaccattctc, gactc, gatc, gatctc, gattc, gattctc
.
e
0
ge
1
e
2
e
3
e
4
e
5
e
6
e
7
e
8
e
9
e
10
a a t t t a t t a tGivenaword
s ∈ Σ
∗
,wesawhowtobuildthe orrespondingFa tor(resp.Sux) Ora le. ThisOra leallowsto re ognizethefa tors(resp.suxes)of
s
,but italso a epts additional words. For example the wordatc
is a epted by the Fa tor (resp.Sux)Ora leofgaccattctc
(seegure 2), whereasitis neitherafa tornor asuxofthis sequen e. Wewillsee thattheSuxOra leofs
exa tlyre ognizes allsuxesof wordsfromE(s)
andwewill usethisresultto provethattheFa tor Ora leofs
exa tlyre ognizesallfa torsofwordsfromE(s)
.Weuse followingLemmasto provetheresult on erningSuxOra les. These Lemmashavebeenprovedin [4℄.
Lemma3 [4 ℄Given aword
s ∈ Σ
∗
andan integer
i (0 ≤ i ≤ |s|)
,thenmin(e
i
)
is sux ofallwordsre ognizedatstatee
i
inthe Ora leofs
.Lemma4 [4 ℄ Given a word
s ∈ Σ
∗
and a fa tor
w
ofs
, thenw
is re ognizedat statee
i
(1 ≤ i ≤ poccur
s
(w))
inthe Ora leofs
.Lemma5 [4 ℄Given aword
s ∈ Σ
∗
andan integer
i (0 ≤ i ≤ |s|)
,thenevery path endingbymin(e
i
)
inthe Ora leofs
leads toastatee
j
su hthatj ≥ i
.Lemma6 [4 ℄ Given aword
s ∈ Σ
∗
and
w ∈ Σ
∗
aword a eptedat state
e
i
(0 ≤
i ≤ |s|)
inthe Ora leofs
,thenevery suxofw
isalso re ognizedby theOra leat statee
j
su hthatj ≤ i
.TheproofofLemma6isonlygivenin[4℄forFa torOra les. Weneedthisresult tobetrueforSuxOra les.
Proof. TheoriginalLemmagivesusthatif
x
isasuxofw
,thenState(x) ≤
State(w)
. WeneedtoprovethatifState(w)
isnal,thenState(x)
isnal. There-fore, we have to onsider two ases. When|x| ≥ |min(e
i
)|
, we havemin(e
i
) ∈
Suff (x)
and thus, a ording to Lemma 5,State(x) ≥ State(min(e
i
))
. Sin eState(min(e
i
)) = e
i
= State(w)
, we on lude thatState(x) = State(w)
. When we have|x| < |min(e
i
)|
, sin e the statee
i
is nal, then there exists a suxt
ofs
su h thatState(t) = e
i
. A ording to Lemma 3, we on lude thatmin(e
i
) ∈
Suff (t) ⊆ Suff (s)
. Sin ex
andmin(e
i
)
aresuxes ofw
, then|x| < |min(e
i
)| ⇒
x ∈ Suff (min(e
i
))
. Sox
isalsosuxofs
anda ordingtoDenitionoftheSuxOra le
State(x)
isnal.Weusethesepreviousresultstoshowthat thetwofollowingLemmas. Lemma7 Given a word
s ∈ Σ
∗
, a Canoni al Fa tor
f
ofs = uf v (u, v ∈ Σ
∗
)
su h that
f
is not repeatedinuf
and a setC
of ontra tions(C ∈ C
∗
s
)
su h thatW ord(uf, C) = wf
,thenthe Ora leofs
a eptswf
andf
atthe samestate. Proof. WedenotebyC
i
⊆ C
∗
s
asetof ontra tions,whi h has ardinalityi
. In thesameway,wedenotebyw
i
f
thewordobtainedwhenweapply ontra tionsC
i
touf
(warning:w
i
f = W ord(uf, C
i
) ; w
i
= W ord(u, C
i
)
). By indu tion onthe sizeofC
i
,weshowthatState(W ord(uf, C
i
)) = State(f )
forallC
i
∈ C
∗
s
. Lete
x
= State(f )
(f = min(e
x
)
byDenitionoff
)ande
x
′
i
= State(W ord(uf, C
i
))
. If we onsider
C
0
, thenW ord(uf, C
0
) = uf
. A ording to Lemma 5,x
′
0
≥ x
. Furthermore, a ording to Lemma 4 applied touf
, we havex
′
HoweverbyDenition of
f
,poccur
s
(f ) = |uf | = poccur
s
(uf )
. Thereforewehavex
′
0
≤ x
andx
′
0
= x
.Ifthislemmaistrueforaset
C
i
⊂ C
∗
s
of ontra tions,thenitistrueforasetC
i+1
=
C
i
∪ {(p, q)}
. Weassumewithoutlossofgenerality(see gure1) that(p, q)
isthe last ontra tioninC
i+1
(byas endingorderoverpositions). Letb
betheCanoni al Fa tor usedbythis ontra tion. We anwriteuf = s[1..p − 1] s[p..q − 1] s[q..|uf |]
. Sin e(p, q)
is hosenas thelast ontra tion, all ontra tionsinC
i
are appli able tos[1..p − 1]
. So we denea, c ∈ Σ
∗
su h that
w
i
f = a s[p..|uf |] = abc
andd ∈ Σ
∗
su hthat
w
i+1
f = a s[q..|uf |] = abd
. Alsoab = W ord(s[1..p − 1] b, C
i
)
: the oppositewouldmeanthat ontra tion(p, q)
an'tbeoperatedfromb
anda ording to the indu tion hypothesis,State(ab) = State(b)
. From this, we on lude thatState(abc) = State(bc)
andState(abd) = State(bd)
. Weknowthatbd (= s[q..|uf |])
isasuxofbc (= s[p..|uf |])
anda ordingto Lemma6:State(bd)
≤ State(bc)
⇔ State(abd)
≤ State(abc)
⇔ State(w
i+1
f ) ≤ State(w
i
f )
⇔ State(w
i+1
f ) ≤ State(f )
However,a ordingto Lemma 5, wehave
State(w
i+1
f ) ≥ State(f )
andthereforeState(w
i+1
f ) = State(f )
. ThislemmaisthentrueforallC
i
⊆ C
∗
s
. ue
0
f
v
f
= min(e
i
)
wf
= W ord(uf, C)
e
i
e
|s|
Fig.3. IllustrationofLemma7.
Now, we show how to obtain a ontra tion, starting with transition of type
e
i
→ e
j
withj > i + 1
(thatwe allanexternaltransitionin thesubsequent). Lemma8 Givenawords ∈ Σ
∗
andaninteger
i (0 ≤ i < |s|)
su hthat#
out
(e
i
) >
1
, letp
be a external transition from statee
i
to statee
i+j
(j > 1)
labeled byα
andu = min(e
i
)
. Then an o urren e ofu α
exists at position(i + j − |u|)
ofs
(seegure5page 8). Moreover, the ontra tion(i − |u| + 1, i − |u| + j)
ofs
byu
existstoo.Proof. A ordingtothe onstru tionalgorithm(seealgorithm1),thetransition
p
isaddedfrome
i
toe
i+j
be auseapositionj
ins[i − |u| + 1..|s|]
issu hthat:j =
poccur
s[i−|u|+1..|s|]
(uα) − |u|
. Wealso haveuα ∈ F act(s)
be auseuα ∈ F act(s[i −
|u|+1..|s|])
. Cleophas&al. [11℄provedthatsin eu = min(e
i
)
anduα ∈ F act(s)
, theni − |u| + poccur
s[i−|u|+1..|s|]
(uα) = poccur
s
(uα)
. Wehavei + j = poccur
s
(uα)
andnally
s[i + j − |u|, i + j] = uα
.WeusealgorithmContra tor(seealgorithm2)togivea hara terizationofthe languagea eptedbytheOra leof aword
s
. Givenawords ∈ Σ
∗
and itsSux Ora le
SO(s)
, initialinputsofContra torareawordw
a eptedbySO(s)
andt
, themaximal suxofs
beginning withw[1]
. This algorithm outputsasetC ∈ C
∗
s
Algorithm2:Contra tionsneededtotransform
s
= t
′
t
(t, t
′
∈ Σ
∗
)
intot
′
w
1 Initialization:S
0
= t
,S
w
0
= w
,C
0
= ∅
,sdec
= |s| − |t|
%t
is a suffix ofs
% 2 3 Input:S
i
∈ Σ
∗
%A suffix ofs
that an still be " ontra ted" % 4S
w
i
∈ Σ
∗
% T h e word to pro ess % 5C
i
% Set of ontra tions % 6 Output: a set of ontra tions 7 8 Begin9
p
i
←
longest ommon prefix betweenS
i
andS
w
i
(Claim1,item1) 10e
qi
← State(p
i
)
(Claim1,item2) 11f
i
← min(e
qi
)
12 If (|p
i
| < |S
w
i
|
) Then 13e
ri
←
Transition(e
qi
, S
i
w
[|p
i
| + 1])
(Claim1,item4) 14C
i+1
← C
i
∪ {c
i
}
,c
i
= (q
i
− |f
i
| + 1, r
i
− |f
i
|)
(Claim2,item2)15
S
w
i+1
← S
w
i
[|p
i
| − |f
i
| + 1..|S
i
w
|]
(Claim1,item3) 16S
i+1
← t[r
i
− |f
i
| − sdec..|t|]
(Claim1,item3) 17 R e t u r n Contra tor(S
i+1
, S
w
i+1
,
C
i+1
)
18 Else 19 If (|S
i
| > |S
w
i
|
) Then 20C
i+1
← C
i
∪ {c
i
}
,c
i
= (q
i
− |f
i
| + 1, |s| − |f
i
| + 1)
(Claim3) 21 Else 22C
i+1
← C
i
(Claim3) 23 End If 24 R e t u r nC
i+1
25 End If 26 EndBy Denition 7,givena word
s ∈ Σ
∗
su h thats = t
′
t (t, t
′
∈ Σ
∗
)
and a setC ∈ C
∗
s,t
of ontra tions,t
′
w = W ord(s, C)
isthena on atenationofsubstringsof
s
. Thesesubstrings anbeassimilatedasprexesofsuxesofs
. A ontra tionis thenajump fromonesubstringtothenextone. Themainideaofthisalgorithm istoreadfromlefttorightthesequen est
andw
,inorderto omputethelongest ommonprexesbetweengivensuxesoft
andw
. Afterea hstage,thealgorithm omputesthelengthofthejumpto gotothenextsuxoft
to onsider.Inputsarewords
S
i
,S
w
i
(i ≥ 0)
andasetC
i
of ontra tions. WeinitializeS
0
= t
,S
w
0
= w
,C
0
= ∅
andp
i
(line9) asthelongestprexofS
i
andS
w
i
. So:S
i
= p
i
S
i
′
andS
w
i
= p
i
S
i
′w
.
(1) Lete
q
i
= State(p
i
)
andf
i
= min(e
q
i
)
((lines10and11),Lemma3provides:p
i
= p
′
i
f
i
(p
′
i
∈ Σ
∗
).
(2)The variable
e
r
i
(line 13) is the state rea hed by the transition frome
q
i
and labeled byα = S
w
i
[|p
i
| + 1] = S
i
′w
[1]
. The setC
i+1
has ardinalityi + 1
. The variablesdec = |s| − |t|
is ne essaryto omputeS
i+1
(line 15),be auseea hstatee
i
islinkedtothei
th
hara terof
s
,nottothe hara ter(i − |s| + |t|)
oft
. Claim1 Following assertionsaretrue for alli ≥ 0
(seegure4):1.
f
i
α ∈ P ref (p
i+1
)
.2.
S
i
= t[q
i
− |p
i
| + 1 − sdec..|t|]
. 3.S
i+1
andS
w
i+1
arerespe tivelysuxesofS
i
andS
w
i
;S
i
andS
w
i
(i ≥ 0)
are respe tivelysuxes oft
andw
.4. Transition from
e
q
i
toe
r
t
t
′
i
S
i
t
′
i
p
i
S
′
i
t
′
i
p
′
i
f
i
S
′
i
t
′
i
p
′
i
f
i
f
i
α
t
′
i+1
p
i+1
S
′
i+1
t
′
i+1
S
i+1
ontra tion ww
′
i
S
w
i
w
i
′
p
i
S
i
′w
w
i
′
p
′
i
f
i
S
i
′w
w
i
′
p
′
i
f
i
α
w
′
i+1
p
i+1
S
i+1
′w
w
w
′
i+1
S
w
i+1
Fig.4. VisualizationofContra toron
S
i
andS
w
i
.Proof.
1. Sin e
f
i
= min(e
q
i
)
and a ordingto Lemma 8, wehaves[r
i
− |f
i
|..r
i
] =
t[r
i
− |f
i
| − sdec..r
i
− sdec] = f
i
α
. SoS
i+1
andS
w
i+1
beginwithf
i
α
(line15). 2. Fori = 0
(initialization ase),S
0
= t
andt
is the longest sux ofs
beginning byw[1]
. Ife
x
= State(S
0
[1]) (x > 0)
, thent[x − sdec..|t|] = S
0
andState(p
0
) = x+|p
0
|−1 = e
q
i
. Thuswe on ludethatS
0
= s[q
0
−|p
0
|+1−sdec..|s|]
. At iterationi
, wehaveS
i+1
= t[r
i
− |f
i
| − sdec..|t|]
(line 16). Sin eS
w
i+1
begins byf
i
α
(item 1),we on ludethatq
i+1
= r
i
+ |p
i+1
| − |f
i
| − 1
and nallywehaveS
i+1
= t[r
i
− |f
i
| − sdec..|t|] = t[q
i+1
− |p
i+1
| + 1 − sdec..|t|]
. 3. This assertionis true forS
w
i
be auseS
w
i+1
is suxofS
w
i
by onstru tion (line15)andS
w
0
= w
. Con erningS
i
,wehaveS
0
= t
thus thepropertyistruefori = 0
. SupposethatS
i
issuxoft
,fromitem2,wehaveS
i
= t[q
i
−|p
i
|+1−sdec..|t|]
. Wealso haveS
i+1
= t[r
i
− |f
i
| − sdec..|t|]
(line 16). So we on ludeusingEq. (2) thatq
i
− |p
i
| = q
i
− |p
′
i
| − |f
i
|
. Sin e|p
′
i
| ≥ 0
,wehaveq
i
− |f
i
| ≥ q
i
− |p
i
|
andr
i
> q
i
. Finally,we on ludethatr
i
− |f
i
| > q
i
− |p
i
|
andthatS
i+1
isasuxofS
i
.4. A ording to item 3,
S
w
i
is sux ofw
. ThenS
w
i
is a epted byO(s)
(Lemma 6). Using Eq.(1) withS
′w
i
[1] = α
thetransition must exist and implies#
out
(e
q
i
) ≥ 2
. Sof
i
= min(e
q
i
) ∈ F
s
byDenition ofCanoni alFa tors. FromEq.(1)andClaim1(item4):t
=
t
′
i
S
i
(t
′
i
∈ Σ
∗
)
w
=
w
i
′
S
i
w
=
w
i
′
p
′
i
S
i+1
w
(w
i
′
∈ Σ
∗
).
(3)t
′
i
w
′
i
p
i
e
0
p
i
f
i
α
α
e
q
i
e
r
i
e
|s|
S
i
S
i+1
Fig.5. IllustrationofastepinthealgorithmContra tor(
α
= S
w
i
[|p
i
| + 1]
). Claim2 Foralli ≥ 0
(seegure5):1.
State(w
′
i
p
i
) = State(t
′
i
p
i
) = State(p
i
) = e
q
i
. 2.c
i
is a ontra tion oft
′
i
S
i
(resp.w
′
i
S
i
) provided byf
i
. Result of this on-tra tionist
′
i
p
′
i
S
i+1
(resp.w
′
1. For
i = 0
,t
′
i
= w
i
′
= ǫ
. Letussupposethat thepropertyis truefori
and thereforefori + 1
. FromClaim1(item2),we on ludethatS
i
labeledapathusing only main transitions (i.e. transitions oftypee
j
→ e
j+1
) frome
q
i
−|p
i
|
toe
|s|
inO(s)
. A ordingtoClaim1(item3) we on ludethat:S
i
= uS
i+1
(u ∈ Σ
∗
).
(4)So,the state
e
x
(x > q
i
− |p
i
|)
issu hthatS
i+1
labeledapath usingonlymain transitions frome
x
toe
|s|
inO(s)
and morepre iselyx = r
i
− |f
i
| − 1
. Wehavet
′
i+1
= t
′
i
u
(Eq. (3) and (4))andState(t
′
i
u) = e
x
. Sin ef
i
= min(e
q
i
)
and there is a transition frome
q
i
toe
r
i
labeled byα
(Claim 1, item 4),State(t
′
i+1
f
i
α) =
State(t
′
i
uf
i
α) = State(f
i
α) = e
r
i
and furthermore,p
i+1
= f
i
αv (v ∈ Σ
∗
)
. So we have
State(t
′
i+1
f
i
αv) = State(t
′
i+1
p
i+1
) = State(p
i+1
)
. 2. FromEq. (1),(2)and(3),we on ludethat:t = t
′
i
S
i
= t
′
i
p
′
i
f
i
S
i
′
.
(5)Sin e
S
i+1
∈ Suff (S
i
)
,wehaveS
i
= uS
i+1
(u ∈ Σ
+
)
and fromEq.(5),
t
′
i
p
′
i
f
i
S
i
′
=
t
′
i
uS
i+1
. A ording to Claim 1(item 1), we havet
′
i
p
′
i
f
i
S
i
′
= t
i
′
uf
i
αu
′
(u
′
∈ Σ
∗
)
. Sin ewehaveState(t
′
i
p
′
i
f
i
) = State(f
i
)
and|u| > |p
′
i
|
(be auseS
′
i
[1] 6= α
),we an ontra tt
′
i
S
i
byf
i
andt
′
i
p
′
i
f
i
αu
′
= t
′
i
p
′
i
S
i+1
. We an on ludethatw
′
i
S
i
= w
i
′
p
i
S
i
′
is ontra ted by
f
i
inw
′
i
p
′
i
S
i+1
be auseState(w
′
i
p
i
) = State(t
′
i
p
i
)
. A ording to Eq.(3),we on ludethatw
′
i+1
= w
′
i
p
′
i
andw
′
i
p
′
i
S
i+1
= w
′
i+1
S
i+1
. Claim2showsthata ontra tionc
i
oft
′
i
S
i
byf
i
isalsoa ontra tionfort
and forw
′
i
S
i
. Considerthei
th
iteration, we have|S
w
i
| > |S
i+1
w
|
or|S
w
i
| = |S
i+1
w
|
and|p
i+1
| > |p
i
|
(iff
i
= p
i
). Sin ep
i
> 0
, weensurethat re ursion stopsat iterationj > i
withp
j
= S
w
j
(j > i)
.Claim3 Given aninteger
i ≥ 0
su hthatp
i
= S
w
i
,thent
needsalast ontra tion if andonlyif|S
w
i
| 6= |S
i
|
.Proof. The wordobtainedwith theset
C
i
of ontra tionsisw
′
i
p
i
S
i
′
(Claim2, item2
). IfS
w
i
= S
i
, then we haveS
′
i
= ǫ
and we on ludethatC
i
is omplete (line20). IfS
w
i
6= S
i
,a ordingtoClaim2(item1
),wehaveState(w
′
i
p
i
) = e
q
i
. By DenitionofthenalstateinaSuxOra le,min(e
q
i
) ∈ Suff (w)
(Lemma3)andmin(e
q
i
) ∈ Suff (t)
. Alast ontra tionisneededto ompletethesetof ontra tions(line22).
Considering the
i
th
all of Contra tor, inputs are
S
i
= p
i
S
′
i
,S
w
i
= p
i
S
i
′w
andC
i
. This set is the set of ontra tions ne essary to transformt
′
i
∈ P ref (t)
intow
′
i
∈ P ref (w)
. The variablep
i
refersto the longest ommon prexofS
i
andS
w
i
(
p
i
isbothafa toroft
andw
,Claim1). Two asesmayo urforthis all(line12. When|p
i
| = |S
w
i
|
, there ursion ends. Otherwise,|p
i
| 6= |S
w
i
|
andat leastanother ontra tionisne essaryuntil|p
j
| = |S
w
j
| (j > i)
. FromClaim2,thei
th
ontra tion allowsto ontinue withthe sux
S
i+1
. Attheend of thepro ess (i.e. theend ofLemma9 Given a word
s ∈ Σ
∗
, its Sux Ora le, a word
w ∈ Σ
∗
a epted by
SO(s)
and the longest suxt
ofs
su h thatw[1] = t[1]
, then Contra tor(t, w, ∅
) outputsasetC
,whi h issu hthatw = W ord(t, C)
.Proof. Let
i ≥ 0
su h thatS
w
i+1
= p
i+1
. A ording to Claim 2, we on lude thatC
i+1
isa oherentsetof ontra tionsoft
. Sin ep
i+1
isprexofS
i+1
wehave:W ord(t, C
i+1
) = w
i+1
′
S
i+1
= w
i+1
′
S
w
i+1
u = w
′
i+1
p
i+1
u (u ∈ Σ
∗
)
If
u = ǫ
,wehaveW ord(t, C
i+1
) = w
(Eq.(3)). Else(Claim3)aultimate ontra tionc
i+1
byf
i+1
transformsw
′
i+1
S
i+1
w
u
intow
′
i+1
S
i+1
w
= w = W ord(t, C
i+1
∪ {c
i+1
})
. FinallyContra torprovidesasetC
su hthatw = W ord(t, C)
. We annoti ethat:1.
C
isnotalwaysminimal.2.
C
is oherent. Let(a, b)
and(c, d)
betwo ontra tionssu essivelyaddedtoC
. Wehavea < b
andc < d
be auser
i
> q
i
and|s| > q
i
(lines14and20). Ife
q
i+1
= State(p
i+1
) = e
r
i
,thenb = c
,elsep
i+1
= f
i
αv (α = S
w
[
|p
i
|+1], v 6= ǫ)
and
e
q
i+1
> e
r
i
,thereforeb < c
.Followingtheorems arethemain purposeofthispaper:
Theorem1 Exa tly all suxes of words from
E(s)
are re ognized by the Sux Ora le ofs
.Proof.
`
⇒
': Ea h suxof wordsfromE(s)
isre ognizedby the SuxOra le ofs
.A ordingtoLemma 6,if
w
isa eptedbySO(s)
, ea hsuxofw
isalsoa epted bySO(s)
. Soweonlyneedtoprovethatea hwordfromE(s)
isa eptedbySO(s)
. LetC ∈ C
∗
s
beaset of ontra tionsappli abletos
andw = W ord(s, C)
. The setC
i
is the set of the rsti
ontra tionsofC
( hosenwithout loss of generality by as endingorderoverpositionsseegure1),(x
j
, y
j
)
isthej
th
ontra tion,whi h usetheCanoni alFa tor
f
j
(1 ≤ j ≤ i)
andw
j
= W ord(s, C
j
)
. Theproperty(P )
toproveisthatifw
i
(0 ≤ i < |C|)
isa eptedbySO(s)
,thenw
i+1
isa eptedtoo. Wehave:w
i
=
s[1..x
1
− 1] s[y
1
..x
2
− 1] . . . s[y
i
..|s|]
s[y
i
..y
i
+ |f
i
| − 1] = f
i
ByDenitionoftheCanoni alFa tors,
f
i+1
doesnoto urins
beforepositionx
i+1
(x
i+1
> y
i
)
. We havew
i
= v
′
f
i+1
u
andw
i+1
= v
′
f
i+1
u
′
withv
′
= s[1..x
1
−
1]s[y
1
..x
2
− 1] . . . s[y
i
..x
i+1
− 1]
andf
i+1
u = u
′′
f
i+1
u
′
(u
′′
∈ Σ
+
)
.Consideringthe ontra tion
(x
i+1
, y
i+1
)
, wehave|s| − |f
i+1
u| + 1 = x
i+1
and|s|−|f
i+1
u
′
|+1 = y
i+1
(be ausethe ontra tionsareinas endingorder). Thewords
isthen notyet modiedafter positionsx
i+1
,sof
i+1
u
andf
i+1
u
′
aresuxes of
s
. Considerthestateq = State(f
i+1
)
ofSO(s)
,a ordingtoLemma 7:State(v
′
f
i+1
) = q.
(6)The sux
f
i+1
u
′
of
s
is ne essarilyre ognized bySO(s)
. So thepath a epting this sux inSO(s)
go through thestateq
. Starting fromq
, we anreadu
′
rea hanalstateandtherefore,a ordingtoEq.(6),
SO(s)
alsoa eptsthewordw
i+1
= v
′
f
i+1
u
′
. Finally, theproperty(P )
istruefor alli (0 ≤ i < |C|)
andsin ew
0
= s
, we an provebyindu tion oni
that the Sux Ora leofs
re ognizesall wordsw
i
(0 ≤ i ≤ |C|)
. Lemma6allowsto on ludethatea hsuxofwordsfromE(s)
isre ognizedbySO(s)
.`
⇐
': Ea h wordre ognizedby the SuxOra leofs
issux ofawordfromE(s)
. Letw
be a word a epted by the Sux Ora le ofs
andt
be the longest sux ofs = t
′
t (t, t
′
∈ Σ
∗
)
beginning with
w[1]
. Then a setC
of ontra tions su h thatt
′
w = W ord(t
′
t, C)
exists (Lemma 9). Sin e the word
w
is suxoft
′
w
and
t
′
w ∈ E(s)
,ea h worda eptedby
SO(s)
isasuxofawordfromE(s)
. OnthebasisofTheorem1wegiveasimilarresult,whi hisavailableforFa tor Ora lesinsteadofSuxOra les.Theorem2 Exa tly all fa tors of words from
E(s)
are re ognized by the Fa tor Ora le ofs
.Proof.
`
⇒
': Ea h fa torofwordsfromE(s)
isre ognizedbythe Fa tor Ora leofs
. LetSO(s)
be the Sux Ora le ofs
,u ∈ E(s)
andm
a fa tor(i.e. aprex of a sux) ofu = mv (v ∈ Σ
∗
)
. Then
mv
isa eptedbySO(s)
(Theorem1). Thus,a path(e
0
→ e
x
1
→ . . . → e
x
|mv|
)
existsinSO(s)
,whi h re ognizesmv
. Therefore, we on ludethatm
labeledapath(e
0
→ e
x
1
→ . . . → e
x
|m|
)
(withe
x
|m|
nal).`
⇐
': Ea h wordre ognizedby the Fa tor Ora leofs
isfa tor ofawordfromE(s)
. LetSO(s)
bethe Sux Ora leofs
andm
aworda eptedbyF O(s)
. IfSO(s)
re ognizesm
thenm
is a suxof awordfromE(s)
(Theorem 1). Suppose thatSO(s)
doesnotre ognizem
. ThenF O(s)
re ognizesm
at statee
x
|m|
(notnal inSO(s)
). Furthermore, the path(e
0
→ e
1
→ . . . → e
|s|
)
exists inO(s)
ande
|s|
is nal inSO(s)
. We on ludethat apathfrome
x
|m|
toe
|s|
existsinSO(s)
. Then, thewordm
isprexof awordre ognizedbySO(s)
andthereforem
isprexofa suxof someu ∈ E(s)
. Thus,m
isafa torofawordofE(s)
. 4. Properties upon Ora les & Future WorksA ording to Cleophas & al. [11℄, the Ora le is not minimal in number of transitions amongthe set ofhomogeneousautomata. Furthermore,ifwe onsider the set of homogeneous automata, whi h re ognize at least all fa tors (resp. suf-xes)of
s
andwhi hhavethesamenumberofstatesandatmostthesamenumber of transitions than the Fa tor (resp. Sux) Ora le, we show that the Ora le is notminimalonthenumberof a eptedwords. TheOra leofaxttyabcdeatzattwu
(seegure6)has35
transitions,theFa torOra lea epts247
wordsandtheSux Ora lea epts39
words. Thoughanotherhomogeneousautomaton(see gure7), whi hre ognizesatleastallfa tors(resp.suxes)ofaxttyabcdeatzattwu
andwhi h has only34
transitions exists. The Fa tor version of this automatonre ognizes only236
wordsand its Sux versiona eptsonly30
words. Moreover,we pro-videanexamplewhi hhasbothlesstransitions, andlessa eptedwordsthanthe orrespondingOra le.e0
ae1
e2
e3
e4
e5
e6
e7
e8
e9
e10
e11
e12
e13
e14
e15
e16
e17
e18
x t y b d e z w u x b t t y z y w a b d e a t z t a t t w uFig.6. Fa torOra leoftheword
axttyabcdeatzattwu
.e0
ae1
e2
e3
e4
e5
e6
e7
e8
e9
e10
e11
e12
e13
e14
e15
e16
e17
e18
x t y b d e z w u x b t t t y z w y w a b d e a t z a t t w u
Fig. 7. This automaton ( onsidering only the ontinuous lines)a epts all fa torsoftheword
axttyabcdeatzattwu
.Theboldtransition(frome
1
toe
3
)is theonlyone,whi hisnotpresentintheFa torOra leofthisword(seegure6) thoughthe twodottedones(frome
1
toe
12
andfrome
12
toe
16
)arepresent intheFa torOra le,butnotinthisautomaton.Insome ases,weobservethat thenumberofwordsa eptedbyOra lesdoes not allow onden e to this stru ture when it is used to dete t fa tors or suf-xes of words. Even if the numberof false positive an sometimes be equalto 0 (e.g.
aaaaaa . . .
),it analsobeexponential. Indeed,we anbuildawords
su hthat ea hsubsetofC
∗
s
is oherentandminimal. Forexample:s = aabbccddee . . .
,thesetC
∗
s
of ontra tions,whi h are available on su h aword,is{(1, 2), (3, 4), . . . , (|s| −
1, |s|)}
. Ifwe onsideranynon-emptysubsetC ⊆ (C
∗
s
\ {(1, 2)})
of ontra tions,it iseasytonoti ethatW ord(s, C) /
∈ F act(s)
. Besides,allwordsobtainedfromsu h subsetsarepairwisedierent.Numberof subsetsis:
|C
∗
s
|−1
X
i=1
|C
∗
s
| − 1
i
=
|s|
2
−1
X
i=1
|s|
2
− 1
i
= 2
|s|
2
−1
− 1
. Then,thenumberof words,whi h area eptedbytheOra lesand arenotfa tor/sux of
s
,isO 2
|s|
.
Tobetterusethisstru ture, weneedtoimproveortoslightlymodifyit. How-ever,betterknowledgeabouttheOra lestru turewouldbeusefulforfutureworks. Indeed,it ouldbeinterestingtohaveanempiri alorastatisti alestimationofthe a ura yof theOra le(time andqualityof theresults),when itis substituted to TriesorSuxTreesinalgorithms.
5. A knowledgments
We would like to thank Céline Barré to have improved the English writing of this arti le. We also thank Irena Rusu for its useful omments and to have improvedearlierversions ofthisarti le.
1. D. Guseld, Algorithms onStrings, Trees, and Sequen es: omputer s ien e and omputational biology (CambridgeUniversityPress,1997).
2. A. Blumer, J.Blumer, D. Haussler, A. Ehrenfeu ht, M. T.Chen and J.Seiferas, The smallest automaton re ognizing the subwords ofa text, Theoret. Comput. S i. 40(1985)3155.
3. E.Ukkonen, On-line onstru tionofsuxtrees, Algorithmi a 14(1995)249260. 4. C.Allauzen,M. Cro hemoreandM.Ranot, Fa torora le,Suxora le, Te h-ni alReport9908,InstitutGaspard-Monge, UniversitédeMarne-la-Vallée,1999. 5. C. Allauzen, M. Cro hemore and M. Ranot, Fa tor Ora le: A New Stru ture
for PatternMat hing, Conferen e on Current Trends in Theory and Pra ti e of Informati s,1999,pp.295310.
6. C.Allauzen,M.Cro hemoreandM.Ranot,E ientExperimentalString Mat h-ing byWeak Fa torRe ognition, Pro . of the12
th
onferen e onCombinatorial Pattern Mat hing,Le tureNotesinComput. S i. 2089(2001)5172.
7. A.LefebvreandT.Le roq,Computingrepeatedfa torswithafa torora le, Pro . ofthe11
th
AustralasianWorkshopOnCombinatorialAlgorithms,eds. L.Brankovi andJ.Ryan(HunterValley,Australia,2000)pp.145158.
8. A.LefebvreandT.Le roq,Aheuristi for omputingrepeatswithafa torora le: appli ationtobiologi alsequen es, InternationalJournalof Computer Mathemat-i s 79(2002)13031315.
9. A.Lefebvre,T.Le roq,H.Dau helandJ.Alexandre,FORRepeats:dete tsrepeats onentire hromosomesandbetweengenomes, Bioinformati s 19(2003)319326. 10. A. Lefebvre and T. Le roq, Compror: on-line lossless data ompression with a
fa torora le, InformationPro essing Letters 83(2002)16.
11. L.Cleophas,G.ZwaanandB.Watson, Constru tingFa torOra les, Pro . ofthe 3