HAL Id: hal-00619990
https://hal-upec-upem.archives-ouvertes.fr/hal-00619990
Submitted on 26 Mar 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Minimal forbidden words and factor automata
Maxime Crochemore, Filippo Mignosi, Antonio Restivo
To cite this version:
Maxime Crochemore, Filippo Mignosi, Antonio Restivo. Minimal forbidden words and factor au- tomata. Mathematical Foundations of Computer Science (Brno, 1998), 1998, France. pp.665-673.
�hal-00619990�
Minimal Forbidden Words and Factor Automata
M. Crochemore?;1,F. Mignosi2, andA. Restivo2
1 Institut Gaspard-Monge [email protected]
2 Universita di Palermo[mignosi,restivo]@altair.math.unipa.it
Abstract. LetL(M) be the (factorial) language avoiding a given anti- factorial languageM. We design an automaton acceptingL(M) and built from the languageM. The construction is eective ifM is nite.
IfMis the set of minimal forbidden words of a single wordv, the automa- ton turns out to be the factor automaton ofv (the minimal automaton accepting the set of factors ofv).
We also give an algorithm that builds the trie of M from the factor automaton of a single word. It yields a non-trivial upper bound on the number of minimal forbidden words of a word.
Keywords: factorial language, anti-factorial language, factor code, fac- tor automaton, forbidden word, avoiding a word, failure function.
1 Introduction
LetLA be afactoriallanguage,i.e., a language containing all factors of its words. A wordw2A is called aminimal forbidden wordforLifw2=Land all proper factors ofwbelong toL. We denote byMF(L) the language of minimal forbidden words for L.
The study of combinatorial properties ofMF(L) helps investigate the struc- ture of the languageLor of the system it describes. For instance, locally testable factorial languages (cf [8]) are characterized by the fact that the corresponding languages of minimal forbidden words are nite. In the context of Symbolic Dynamics they correspond to systems of nite type.
Another example is given by a languageL that is the set of factors of an innite word: in this case, as shown in [2], the elements of MF(L) are closely related to thebispecialfactors (cf. [6], [7] and [3]) of the innite word.
A measure of complexity of the languageLis introduced in [2] based on the functionFL, that counts, for anyn, the number of words of lengthninMF(L).
Authors prove that the growth of FL(n) as well as the topological entropy of
MF(L) are topological invariants of the dynamical system dened by L. This result provides a usefull tool to show that some systems are not isomorphic, which comes in addition to other notion like the ordinary notion of entropy and the zeta function, for example.
Finally, [5] considers properties of languages dened by nite forbidden sets of words. Authors dene the Mobius function for these languages.
?Work by this author is supported in part by Programme \Genomes" of C.N.R.S.
In this paper we focus on the transformations betweenL andMF(L). We rst design an automaton accepting L(M) and that is built from the language
M. WhenMis a nite set the transformation is eective. Moreover, ifM is given by its digital tree, that is, its tree-like deterministic automaton, the algorithm is very similar to the algorithm of Aho and Corasick that builds a pattern-matching machine for a nite set of words [1].
In a second part we consider the particular situation of a language that is the set of factors of a single word v. The construction of itsfactor automaton, the minimal deterministic automaton accepting the factors ofv(see [4]) is known to be rather intricate. It is remarkable that the preceding transformation yields ex- actly the factor automaton ofvwhen the input if the setM of minimal forbidden words ofv. We also give an algorithm that realizes the converse transformation, building the trie of M from the factor automaton ofv. A corollary of the algo- rithm is a non-trivial upper bound on the number of minimal forbidden words of a word.
The complexities of algorithms described in this paper are all linear in the size of their input or output. Therefore, the design of possible faster algorithms relies on dierent representations of objects, which is not the aim of the paper.
2 Avoiding an anti-factorial language
Let A be a nite alphabet and A be the set of nite words drawn from the alphabet A, the empty word included. Let L A be a factorial language, i.e. a language satisfying: 8u;v 2 A uv 2 L =) u;v 2 L. The complement languageLc =AnL is a (two-sided) ideal ofA. Denote by MF(L) the base of this ideal, we haveLc=AMF(L)A.
The set MF(L) is called the set of minimal forbidden wordsfor L. A word
v2A
is forbidden for the factorial languageLifv2=L, which is equivalent to say that v occurs in no word ofL. In addition,v is minimal if it has no proper factor that is forbidden.
One can note that the setMF(L) uniquely characterizes L, just because
L=AnAMF(L)A: (1) The following simple observation provides a basic characterization of minimal forbidden words.
Remark 1
A wordv=a1a2anbelongs to MF(L)i the two conditions hold:{
v is forbidden, (i.e.,v2=L),{
both a1a2an?12L and a2a3an2L(the prex and the sux of v of lengthn?1belong toL).The remark translates into the equality:
MF(L) =AL\LA\(AnL): (2) As a consequence of both equalities (1) and (2) we get the following proposition.
Proposition 1
For a factorial languageL, languagesL and MF(L) are simul- taneously rational, that is, L2Rat(A) i MF(L)2Rat(A).The set MF(L) is an anti-factorial language or afactor code, which means that it satises: 8u;v2MF(L)u6=v=) uis not a factor ofv, property that comes from the minimality of words ofMF(L).
We introduce a few more denitions.
Denition 1
A wordv2A avoids the set M,M A, if no word of M is a factor of v, (i.e., if v2= AMA). A languageL avoidsM if every words of L avoid M.Fromthe denition ofMF(L), it readily comes thatLis the largest (according to the subset relation) factorial language that avoidsMF(L). This shows that for any anti-factorial languageM there exists a unique factorial languageL(M) for whichM =MF(L). The next remark summarizes the relation between factorial and anti-factorial languages.
Remark 2
There is a one-to-one correspondence between factorial and anti- factorial languages. If L and M are factorial and anti-factorial languages re- spectively, both equalities hold: MF(L(M)) =M and L(MF(L)) =L.We also refer to the next denition that is to be considered in the context of dynamical systems (see [9] for example).
Denition 2
The factorial language L is said to be of nite typewhen MF(L) is nite.Finally, with an anti-factorial nite languageM we associate the nite au- tomaton A(M) as described below. The automaton is deterministic and com- plete, and, as shown at the end of the section by Theorem 3, the automaton accepts the languageL(M).
The automatonA(M) is the tuple (Q;A;i;T;F) where
{
the setQof states is fwjwis a prex of a word inMg,{
Ais the current alphabet,{
the initial stateiis the empty word ,{
the setT of terminal states isQnM.States ofA(M) that are words ofM are sink states. The setF of transitions is partitioned into the three (pairwise disjoint) sets F1, F2, andF3dened by:
{
F1=f(u;a;ua)jua2Q;a2Ag(forward edges or tree edges),{
F2 = f(u;a;v) j u 2 QnM;a 2 A;ua 2= Q;vlongest sux ofuainQg (backward edges),{
F3=f(u;a;u)ju2M;a2Ag(loops on sink states).The transition function dened by the set F of arcs ofA(M) is noted.
Remark 3
One can easily prove from denitions that1. if q 2 Qn(M [fg), all transitions arriving on state q are labeled by the same lettera2A,
2. from any state q2Qwe can reach a sink state, i.e.,qcan be extended to a word ofM.
Denition 3
For any v2A,qv denotes the state (;v), target of the unique path in A(M)starting at the initial state and labeled by v.SinceA(M) is a complete automaton,qvis always dened. In the automaton
A(M) states are words, but to avoid misunderstandings we sometimes write \the word corresponding to qv" instead of just \the wordqv".
Remark 4
Note that ifv is a state ofA(M)we have qv =v.We are now ready to state the next lemma (which proof is by induction on
v) that is used in the proof of Theorem 3, the main result of the section.
Lemma 2
LetM be an anti-factorial language and considerA(M). Letv2A be such that, for any proper prex uofv,quis not a sink state (qu2=M). Then,1. the wordqv is a sux of v,
2. qv is the longest sux ofv that is also a state of A(M) (or8q2Qqsux ofv=)qsux ofqv).
Proof.
By induction onjv j. ./Denoting byLang(A) the language accepted by an automatonA, we get the main theorem of the section.
Theorem 3
For any anti-factorial language M, Lang(A(M)) =L(M).Proof.
We rst proveL(M)Lang(A(M)). We have to show that ifvis a word that avoidsM thenv2Lang(A(M)). Assumeab absurdothatv2=Lang(A(M));therefore qv is a sink state. Let ube the shortest prex ofv for whichqu is a sink state (note thatqu=qv). By lemma 2 statement 1,qu is a sux ofu, but
q
vis by denition an element ofM, and sov does not avoidM, a contradiction.
We then proveLang(A(M))L(M). Letv2Lang(A(M)). Let us suppose ab absurdo that v does not avoidM,i.e.,v =uw z for somew2M;u;z2A. We choose uw as the shortest prex ofv that belongs to AM. Since w 2 M it is by denition a state of A(M); sincew is a state that is a sux ofuw, by Lemma 2 statement 2, w is a sux of quw. But quw, which is by denition a state ofA(M), is a prex of an elementw0ofM. Since wis a sux of a prex ofw0, wis a factor ofw0, a contradiction because M is anti-factorial. ./
The above denition of A(M) turns into the algorithm below, called L- automaton, that builds the automaton from a nite anti-factorial set of words.
The input is the trie T that represents M. It is a tree-like automaton accepting the setM and, as such, it is noted (Q;A;i;T;0). The procedure can be adapted to test whether T represents an anti-factorial set, or even to generate the trie of the anti-factorial language associated with a set of words.
In view of Equality 1, the design of the algorithm remains to adapt the construction of a pattern matching machine (see [1] or [4]) The algorithm uses a functionf called afailure functionand dened on states ofT as follows. States of the trieT are identied with the prexes of words inM. For a stateau(a2A,
u2 A
), f(au) is 0(i;u), quantity that may happen to be u itself. Note that
f(i) is undened, which justies a specic treatment of the initial state in the algorithm.
L-automaton(trieT = (Q;A;i;T;0)) 1. foreacha2A
2. if0(i;a) dened 3. set(i;a) =0(i;a);
4. setf((i;a)) =i;
5. else
6. set(i;a) =i;
7. foreach statep2Qnfigin width-rst searchandeacha2A 8. if0(p;a) dened
9. set(p;a) =0(p;a);
10. setf((p;a)) =(f(p);a);
11. elseifp62T
12. set(p;a) =(f(p);a);
13. else
14. set(p;a) =p; 15. return(Q;A;i;QnT;);
-
m
m
m m
m
0
1
2 3
4
?
?
?
?
a
@
@
@
@ R b
- a
- b
?
?
?
?
a
- b
- b
Fig.1.Trie of the factor codefaa;bbaa;bbbgon the alphabetfa;bg. Squares represent terminal states.
Example. Figure 1 displays the trie that accepts M =faa;bbaa;bbbg. It is an anti-factorial language. The automaton produced from the trie by algorithmL- automatonis shown in Figure 2. It accepts the prexes of (ab[b)(ab)bathat are all the words avoidingM.
-
m
m
m m
m
0
1
2 3
4
?
?
?
?
a
@
@
@
@ R b
- a
- b
? b 6
a
?
?
?
?
a
- b
- b
@
@
@
@ I
a
a,b
a,b
a,b
Fig.2. Automaton accepting the words that avoid the set faa;bbaa;bbbg. Squares represent non-terminal states (sink states).
Theorem 4
Let T be the trie of an anti-factorial language M. Algorithm L- automaton builds a complete deterministic automaton acceptingL(M).Proof.
The automaton produced by the algorithm has the same set of states as the input trie. It is clear that the automaton is deterministic and complete.Letu2A+ and p=(i;u). A simple induction onjuj shows that the word corresponding to f(p) is the longest proper sux of uthat is a prex of some word in M. This notion comes up in the denition of the set of transitions F2 in the automatonA(M). Therefore, the rest of the proof just remains to check that instructions implement the denition ofA(M). ./
Theorem 5
AlgorithmL-automaton runs in timeO(jQjjAj)on inputT = (Q;A;i;T;0)if transition functions are implemented by transition matrices.3 Factor automaton of a single word
In this section we specialize the previous results to the language of factors of a single word. It is proved below that the contruction of Section 2 yields the factor automaton (minimal dterministic automaton accepting the factors) of the word (see Theorem 7). The minimalityof the automaton seems to be exceptional because, for example, the same construction applied to the setfaa;abgdoes not provide a minimal automaton.
The reverse construction that produces the trie of minimal forbidden words from the factor automaton is described in the next section.
We consider a xed word v 2 A and denote by F(v) be the language of factors ofv.
Proposition 6
The language F(v)is of nite type.Proof.
Indeed, factors ofv, of lengths less thanjv j+1, avoid all words of length exactly jv j+ 1. Therefore, every minimal forbidden word ofF(v) has length atmostjv j+ 1. ./
The result of the previous proposition is made more precise in the next sec- tion, but an immediate consequence of it and of the denition of the automaton
A(M) for an anti-factorial languageM, the automatonA(MF(F(v))) has a - nite number of states. The next statement gives a complete characterization of the automaton as the factor automaton ofv.
Theorem 7
For any v 2 A, the automaton obtained from A(MF(F(v))) by removing its sink states is the minimal deterministic nite automaton accepting the language F(v)of factors of v.Proof.
The automatonA(MF(F(v))) is already a deterministic nite automaton that accepts the languageF(v) by Theorem 3. We only have to prove that it is minimal after removing the sink states.Supposeab absurdothat there exist two equivalent non-sink states p;qinQ. By the standard equivalence relation of undistinginshability and by construction
p;q2F(v). Hence,v=xpyandv=x0q y0and we can choosexandx0of minimal length. We consider two cases:
(i) jxpj6=jx0q j, (ii) jxpj=jx0q j.
Case (i). We can suppose for example that jxpj< jx0q j (the case jxpj > jx0q j is handled symmetrically). Then, xpy 2F(v) implies that(p;y) is not a sink state, hence, by the equivalence(q ;y) is not a sink state, that is,q y2F(v) by Remark 4. Therefore,v=x"q y zwherejx"jjx0jby the choice ofx0(of minimal length). Hence,jv jjx0j+jq j+jy j+jzj>jxpj+jy j=jv j, a contradiction.
Case (ii). The equality jxpj=jx0q jimplies either that p is a sux ofq or the converse. Let us suppose for example that p = sq for some word s 6= . By Remark 3 statement 2, there exists w=pz that belongs to MF(F(v)). By the equivalence,q z is also a sink state and, again by the equivalence, for no proper prex u of q z, qu is a sink state. Hence, by Lemma 2.1, qq z is an element of MF(F(v)), that is, a sux ofq z. Sincep=sq ;s6=,qq zis a proper sux ofpz against the anti-factorial property ofMF(F(v)). A contradiction again.
After cases (i) and (ii) it appears that there cannot exist two dierent non- sink states p;qinQthat are equivalent. Therefore the automaton without sink
states is minimal, which ends the proof. ./
4 Minimal forbidden words of a word
We end the article by an algorithm that builds the trie accepting the language MF(F(v)) of minimal words avoided by v. This is an implementation of the inverse of the transformation described in Section 2. Its design follows Equality 2.
A corollary of the transformation gives a bound on the number of minimal forbidden words of a single word, which improves on the bound coming readily from Proposition 6.
MF-trie(factor automatonA= (Q;A;i;T;) and its sux functions) 1. foreach state p2Qin width-rst search fromiandeacha2A 2. if(p;a) undenedand(p=ior(s(p);a) dened
3. set0(p;a) = new sink;
4. else if(p;a) =qandqnot already treated 5. set0(p;a) =q;
6. return(Q;A;i;fsink sg;0);
The input of algorithmMF-trie is the factor automaton of word v. It in- cludes the failure function dened on the states of the automaton and called s. This function is a by-product of ecient algorithms that build the factor au- tomaton (see [4]). It is dened as follows. Let u2 A+ and p =(i;u). Then,
s(p) = (i;u0) where u0 is the longest sux of ufor which(i;u)6= (i;u0). It can be shown that the denition ofs(p) does not depend on the choice ofu.
- 0
m
1m
2m
3m
4m
6m
m
5
- - - - -
a b b a b
&
b - ???
?
b
%
6
a
Fig.3.Factor automaton ofabbab; all states are terminal.
- 0
m
1m
2m
3m
4m
6m
m
5
- - -
a b b
&
b - ???
?
b
%
6
a
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
c a a b b
Fig.4.Trie of minimal forbidden words ofF(abbab) on the alphabetfa;b;cg. Squares represent terminal states.
Example Consider the word v = abbab on the alphabet fa;b;cg. Its factor automaton is displayed in Figure 3. The failure function s dened on states
has values: s(1) = s(5) = 0, s(2) = s(3) = 5, s(4) = 1, s(6) = 2. Algorithm
MF-trie produces the trie of Figure 4 that represents the set of ve words
faa;aba;babb;bbb;cg.
Theorem 8
Let A be the factor automaton of a word v 2 A. (It accepts the languageF(v).) AlgorithmMF-triebuilds the tree-like deterministic automaton accepting MF(F(v)) the set of minimal forbidden words ofF(v).Corollary 9
A wordv2Ahas no more than2(jv j?2)(jAvj?1)+jAjminimal forbidden words ifjv j3, whereAvis the set of letters occurring inv. The bound becomes jAj+ 1if jv j<3.Proof.
The number of words inMF(F(v)) is the number of sink states created during the execution of algorithmMF-trie. These states have exactly one in- going arc originated at a state of the factor automatonAof v. So, we have to count these arcs.From the initial state of Athere is exactly jAj?jAvj such arcs. From the (unique) state ofAwithout outgoing arc, there are at mostjAvjsuch arcs. From other states there are at mostjAvj?1 such arcs.
Forjv j3, it is known thatAhas at most 2jv j?2 states (see [4]). Therefore,
jMF(F(v))j(jAj?jAvj)+jAvj+(2jv j?4)(jAvj?1) = 2(jv j?2)(jAvj?1)+jAj. Whenjv j<3, it can be checked directly thatjMF(F(v))jjAj+ 1. ./
Theorem 10
AlgorithmMF-trieruns in timeO(jv jjAj)on input wordv if transition functions are implemented by transition matrices.References
1. A. V. Aho and M. J. Corasick. Ecient string matching: an aid to bibliographic search, Comm. ACM18:6(1975) 333{340.
2. M.-P. Beal, F. Mignosi, and A. Restivo. Minimal Forbidden Words and Symbolic Dynamics, in (C. Puech and R. Reischuk, eds., LNCS 1046, Springer, 1996) 555{
3. J. Cassaigne. Complexite et Facteurs Speciaux,566. Bull. Belg. Math. Soc.4(1997) 67{88.
4. M. Crochemore, C. Hancart. Automata for matching patterns, inHandbook of For- mal Languages, G. Rozenberg, A. Salomaa, eds.", Springer-Verlag", 1997, Volume 2,Linear Modeling: Background and Application, Chapter 9, 399{462.
5. V. Diekert, Y. Kobayashi. Some identities related to automata, determinants, and Mobius functions, Report 1997/05, Fakultat Informatik, Universitat Stuttgart, 1997.
6. A. de Luca, F. Mignosi. Some Combinatorial Properties of Sturmian Words, Theor. Comp. Sci.136(1994) 361{385.
7. A. de Luca, L. Mione. On Bispecial Factors of the Thue-Morse Word, Inf. Proc.
Lett.49(1994) 179{183.
8. R. McNaughton, S. Papert. Counter-Free Automata, M.I.T. Press, MA 1970.
9. D. Perrin. Symbolic Dynamics and Finite Automata, invited lecture in Proc.
MFCS'95, LNCS969, Springer, Berlin 1995.