Languages and scanners

(1)

Languages and scanners

Dani`ele Beauquier and Jean-Eric Pin ^∗

In this paper, we are interested in a special class of automata, called scanners. These machines can be considered as a model for computations that require only “local” information. Informally, a scanner is an automaton equipped with a finite memory and a “sliding” window of a fixed length. In a typical computation, the sliding window is moved from left to right on the input, so that the scanner can remember the factors of length smaller than or equal to the size of the window. In view of these factors, the scanner decides whether or not the input is accepted or rejected.

a1 a2 a3 a4 a5 · · · an

Finite Memory

Figure 0.1: A scanner.

Scanners have been used for a long time in language theory. Every- one knows the local languages which occur for instance in the theorem of Chomsky-Sch¨utzenberger on context-free languages. Roughly speaking, a local language is described by the factors of length 2 of its words. For instance, ifA={a, b, c, d}, the language c(ab)⁺d is the set of all words whose set of factors of length 2 is exactly{ca, ab, ba, bd}. The locally testable languages generalize local languages : the membership of a given word in such a language is determined by the set of factors of a fixed lengthk (the order in which these factors occur and their frequency is not relevant) of the word, and by the prefixes and suffixes of length< kof the word. These conditions can be tested by a scanner. The locally testable languages are character-

∗LITP/IBP, Universit´e Paris VI et CNRS, Tour 55-56, 2 place Jussieu, 75251 Paris Cedex 05, FRANCE.e-mail: beauquier@litp.ibp.fr, pin@litp.ibp.fr

(2)

ized by a deep algebraic property of their syntactic semigroup, discovered independently by Brzozowski-Simon [2] and McNaughton [7].

There are several possible variations on this definition. First, one can drop the conditions about the prefixes and suffixes. Membership in this type of language, that we call strongly locally testable (SLT), is determined only by factors of a fixed length k. Thus, a language is SLT if and only if it is a finite boolean combination of languages of the formA^∗wA^∗, where wis a word. Surprisingly, this rather natural family of languages does not seem to have been considered previously in the literature. We show that this family is also decidable and characterized by another nice algebraic property. But this time, the syntactic semigroup does not suffice, and a property of the image of the language in its syntactic semigroup is needed.

A second natural extension is to take in account the number of occurrences of the factors of the word. However, since we want to use finite automata to recognize our languages, we can only count factors up to a certain threshold. Threshold counting is the favorite way of counting of small children : they can distinguish 0, 1, 2, . . . but after a certain numbern(the threshold), all numbers are ”big”. From a more mathematical point of view, two positive integers sand t are congruent threshold n ifs=t or if s≥n and t ≥n. This defines the threshold locally testable languages (TLT). A combination of two deep results of Straubing [11] and Th´erien and Weiss [13]

yields a syntactic characterization of these languages. In view of the results of the previous paragraph, it is reasonable to think that similar results hold if one drops the conditions about the prefixes and suffixes. However, we are not yet able to solve this problem.

The families of languages we have introduced are also deeply connected with the study of first order theory of successor, with a predicate for each letter, interpreted on finite words. Indeed Thomas [14] proved that the languages definable in this logic are exactly the TLT languages. Since we have an effective syntactic characterization of these languages, we derive the following decidability result : given a monadic second order formula ϕ of the theory of successor, it is decidable whetherϕ is equivalent to a first order formula. We also show that the languages definable by a boolean combination of existential formulas are exactly the SLTT languages.

Finally, scanners can also be used to define sets of infinite (or even bi- infinite) words. This is technically more difficult and will be the subject of a future paper. Most of the results of the present paper as well as those of this future paper have been presented, without proofs, at the ICALP in Stresa, 1989 [1].

(3)

1 Preliminaries

1.1 Words.

LetA be a finite alphabet. We denote byA^∗ the set of words over A, and by A⁺ the set of non empty words. Ifu is a word of length≥k, we denote byup_k andus_k, respectively, the prefix and suffix of lengthk ofu. Ifu and x are two words, we denote by _u

x

the number of occurrences of the factor x inu. For instance _abababa

aba

= 3, since aba occurs in three different places inabababa: abababa, abababa, abababa.

1.2 Finite semigroups.

Recall that a semigroup is a set S equipped with an associative multipli- cation. All semigroups considered in this paper are finite, except for free semigroups and free monoids. Therefore the word “semigroup” will mean

“finite semigroup” in the rest of the paper. An elementeof a semigroupS is idempotent ife²=e. The set of idempotents of a semigroupS is denoted by E(S). Every non-empty finite semigroup contains at least one idempotent.

IfS =E(S), S is an idempotent semigroup.

A monoid is a semigroup with an identity. Let S be a semigroup. We denote by S¹ the monoid equal to S if S has an identity, and to S∪ {1}, where 1 is a new identity, otherwise.

Given two semigroups S and T, a semigroup morphism ϕ : S → T is a function from S into T such that, for every s, s⁰ ∈ S,(sϕ)(s⁰ϕ) = (ss⁰)ϕ.

Recall the definitions of the Green’s relationsR, LandD: sRtif and only if there exists u, v ∈ S¹ such that su =t and tv = s, s L t if and only if there existsu, v ∈S¹ such thatus=tand vt=s, sDt if and only if there existsu∈S¹ such that sRu anduLt. We denote by ≤J the preorder on S defined by s≤J t if and only if there exists u, v ∈S¹ such that usv=t.

One can show that sDt if and only if s ≤J t and t ≤J s. We shall need the following technical result.

Lemma 1.1 Let S and T be two finite semigroups, let π : S → T be a surjective morphism. Let t and t⁰ be two elements of T such that t R t⁰ (respectively t L t⁰, t D t⁰). Then there exist s, s⁰ ∈ S such that sπ = t, s⁰π=t⁰ and sRs⁰ (respectively sLs⁰,sDs⁰).

Proof. We give the proof forR only, but the proof for Land D is similar.

Lettandt⁰ be two elements ofT such that tRt⁰. Sinceπ is surjective, the set tπ⁻¹ is non empty. Choose a minimal elements (for the preorder≤J) intπ⁻¹. SincetRt⁰, there existu, v ∈T¹such that t⁰=tuandt=t⁰v. Let x, y∈S¹be such thatxπ=uandyπ=v. Then (sxy)π=tuv=t⁰v=tand sxy≤J sx≤J s. Thus, by the choice ofs, we havesxyDs. It follows, by a

(4)

standard argument ([9], proposition 1.4, p.47),sxyRs, whences⁰=sxRs.

This proves the lemma, sinces⁰π =tu=t⁰.

Let S be a finite semigroup. A local subsemigroup of S is a subsemigroup of S of the form eSe, where e ∈ E(S). A semigroup is said to be locally trivial, (respectively locally commutative, locally idempotent, locally a group, etc.) if all the local subsemigroups of S are trivial (respectively commutative, idempotent, groups, etc.). For instance, a semigroupS is locally idempotent and commutative if, for each e∈E(S) and each s, t∈S, (ese)² = (ese) and (ese)(ete) = (ete)(ese).

1.3 Semidirect products.

Let M be a monoid and let T be semigroup. We write the product in M additively to provide a more transparent notation , but it is not meant to suggest that M is commutative. A left action of T on M is a map (t, m) → tm from T ×M into M such that, for all m, m₁, m₂ ∈ M and t, t₁, t₂ ∈T, (t₁t₂)m=t₁(t₂m) and t(m₁+m₂) =tm₁+tm₂, t0 = 0 Given such a left action, the semidirect product of M andT (with respect to this action) is the semigroup M ∗T defined on the set M ×T by the product (m₁, t₁)(m₂, t₂) = (m₁+t₁m₂, t₁t₂).

1.4 Counting semirings.

We define the following congruences onN. x ≡y threshold t(also denoted x≡t y) if and only if (x < tand x =y) or (x≥tand y ≥t). For instance the equivalence classes of ≡4 are {0}, {1}, {2}, {3}, {4,5,6,7, . . .}. The quotient semiring N_t = N/≡t is called a counting semiring. In particular, the boolean semiringB=N₁ is a counting semiring.

2 Scanners and languages defined by factors of words.

2.1 Some equivalences and congruences.

Factors of words can be used in many different ways to define families of languages. We have selected four of them, which form the subject of this article. For everyk, t >0, let≡k,t be the equivalence of finite index defined on A⁺ by setting u ≡k,t v if and only if, for every word x of length ≤ k, _u

x

≡t

_v

x

. For instance, u≡_k,1v if and only ifu and v have the same sets of factors of length k, and u ≡k,5 v if and only if u and v have the same factors of lengthk, but counted with multiplicity threshold 5.

(5)

Example 2.1 abababab≡2,3abababasinceababababcontains 4 (≡3 threshold 3) occurrences of ab and 3 (≡3 threshold 3) occurrences of ba, and no occurrences ofaa (respectivelybb).

We also define a congruence∼k,t of finite index onA⁺ by settingu∼k,t vif (a) uand v have the same prefixes of length < k,

(b) uand v have the same suffixes of length< k, (c) u≡k,t v.

These four equivalences define four classes of languages. A subset ofA⁺ is locallyk-testable if it is union of∼k,1-classes. It isthreshold locallyk-testable if it is union of ∼k,t-classes for some t. If one replaces in the previous definitions the congruence∼k,t by the equivalence ≡k,t, one defines the corresponding notions ofstrongly locally k-testable andstrongly threshold locally k-testable languages.

A subset of A⁺ is locally testable (LT) if it is locallyk-testable for some k > 0. The notions of threshold locally testable, strongly locally testable, strongly threshold locally testable languages are defined similarly. The corresponding abreviations are TLT, SLT, and STLT.

2.2 A combinatorial description.

One can also define these classes as boolean algebras. Set, for x∈A⁺, and r, t≥0,

L(x, r, t) ={u∈A⁺| u

x

≡t r}

Thus L(x, r, t) is the set of all words u containing r occurrences of the factor x, but counted threshold t. For instance, L(x,1,1) = A^∗xA^∗, and L(x,0,1) =A⁺\A^∗xA^∗.

Proposition 2.1 Let L be a subset ofA⁺.

(1) L is SLT if and only if it is a boolean combination of languages of the form L(x,1,1),

(2) L is STLT if and only if it is a boolean combination of languages of the formL(x, r, t),

(3) L is LT if and only if it is a boolean combination of languages of the form L(x,1,1), xA^∗, or A^∗x,

(4) L is TLT if and only if it is a boolean combination of languages of the form L(x, r, t), xA^∗, or A^∗x.

Proof. We only recall the proof of (1), which is standard. The proof of the other statements can be easily adapted. For each u∈A⁺, put

Ek(u) ={x∈A⁺ |xis a factor of length k of u} and Fk(u) =A^k\Ek(u)

(6)

Then the equivalence class of uwith respect to ≡_k,1 is the set S(u) = \

x∈Ek(u)

A^∗xA^∗\ [

x∈Fk(u)

A^∗xA^∗

This shows that S(u) is a boolean combination of languages of the form A^∗xA^∗.

Conversely, let L be a finite boolean combination of languages of the formA^∗xA^∗. Let k be the maximal length of the wordsx occurring in such a boolean expression. It suffices to show that if |x| ≤ k, then A^∗xA^∗ is a union of ≡_k,1-classes. But this is clear, since if u ∈ A^∗xA^∗ and u ≡_k,1 v, thenu andv have the same factors of length |x|, whence v∈A^∗xA^∗.

The relations between the four classes is shown in the following diagram.

SLT

ST LT

T LT

LT

Example 2.2 LetA={a, b}. Then (ab)⁺ is locally testable since (ab)⁺= (aA^∗∩A^∗b)\(A^∗aaA^∗ ∪A^∗bbA^∗). We shall see in section 3 that (ab)⁺ is not strongly locally testable.

Example 2.3 Let A = {a, b}. Then the set a^∗ba^∗ is strongly threshold locally testable : it is the set of all words containing exactly one occurrence ofb. We shall see in section 3 that a^∗ba^∗ is not locally testable.

2.3 Scanners.

Our four classes of languages can also be defined in terms of a special class of finite automata, the scanners. The informal definition of a scanner has been given in the introduction. There are actually two types of scanners, depending on the use of the window : normally, a window of sizekis allowed to read only factors of lengthk. If the scanner is unbounded, we allow the window to be also moved beyond the first and the last letter of the word, so that the prefixes and suffixes of length< k can be read. For instance, if k= 3, and u=abbaaabab, different positions of the window are represented in the following diagrams :

a bbaaabab ab baaabab abb aaabab a bba aabab · · · abbaaaba b

(7)

We now give the formal definition. A scanner on a counting semiring K is a triple S = (A, k, F) where A is a finite set (the alphabet), k is a positive integer (the size of the window),F is a (finite) set of polynomials of KhA^∗i of degreek, called the memory. Letu∈A⁺be a word. Letf :A⁺ →KhA^∗i be the application defined by

uϕ= X

|x|=k

u x

x

Thusuϕis just the formal sum of all factors of lengthkofu, with multiplicity counted inK. For instance, ifA={a, b}, K=N₃, and k= 3,

(abbabbaabababbabbab)ϕ= 4abb+ 4bba+ 5bab+baa+aab+ 2aba

= 3abb+ 3bba+ 3bab+baa+aab+ 2aba A wordu is accepted byS if and only if uf ∈F.

A scanner on the boolean semiring is aboolean scanner. In the case of an unbounded scanner, the window is allowed to read the prefixes and suffixes of length< k. To represent this information, one introduces two new functions π, σ : A⁺ → KhA^∗i defined by uπ = P

t<kupt and uσ = P

t<kust. The memory is now a triple (P, F, S) of sets of polynomials of KhA^∗i of degree

≤k. Intuitively, P codes the set of possible prefixes, F the set of possible factors, andS the set of possible suffixes. A wordu is accepted byS if and only if uπ∈P, uϕ∈F and uσ∈S.

Proposition 2.2 Let L be a subset ofA⁺.

(1) L is SLT if and only if it is accepted by a boolean scanner, (2) L is STLT if and only if it is accepted by a scanner,

(3) L is LT if and only if it is accepted by an unbounded boolean scanner, (4) L is TLT if and only if it is accepted by an unbounded scanner, Proof. Again, we only give the proof for (1), but the other proofs are similar. Let S = (A, k, F) be a boolean scanner recognizing a subset L of A⁺. Observe that u ≡k,1 v if and only if uf = vf. It follows that L is union of ≡k,1-classes. Conversely, if L is union of ≡k,1-classes for some k, put F = uf |u∈L. Then the boolean scanner S = (A, k, F) recognizes L.

3 Syntactic characterizations.

In this section, we give effective characterizations for three of the four families of languages introduced above. In order to keep a uniform notation

(8)

for the subsequent statements, we shall denote by S(L) the syntactic semigroup of a recognizable language L of A⁺, by η : A⁺ → S(L) its syntactic morphism, and by P(L) =Lη the syntactic image of L. We need first to introduce some algebraic tools.

3.1 Varieties of semigroups.

A variety of (finite) semigroups is a class of semigroups closed under taking subsemigroups, morphic images (or quotients) and finite direct products.

Varieties of monoids are defined similarly. The following varieties will be used in this paper.

• J₁, the variety of all idempotent and commutative monoids,

• Com, the variety of all commutative monoids,

• Acom, the variety of all commutative aperiodic monoids,

• LI, the variety of locally trivial semigroups,

• LI_k, the variety of all semigroupsS that satisfy the equation x₁x₂· · ·xkxx₁x₂· · ·xk=x₁x₂· · ·xk

for allx, x1, x2, . . . , xk∈S.

• LJ₁, the variety of locally idempotent and commutative semigroups, Given a variety of monoidsVand a variety of semigroupsW, we denote byV∗Wthe variety of semigroups generated by all the semidirect products of the formM∗T, whereM ∈V and T ∈W.

3.2 Varieties of languages.

LetVbe a variety of semigroups (monoids). One associates to each alphabet Athe setA⁺V (A^∗V) of all languages ofA⁺(A^∗) whose syntactic semigroup (monoid) belongs to V. V is called the variety of languages corresponding to V. A description of various varieties of languages can be found in the litterature [5, 6, 9].

Proposition 3.1 For each alphabet A,

(1) A⁺J1 is the boolean algebra generated by the languages of the form A^∗aA^∗ where a∈A,

(2) A⁺LIk is the boolean algebra generated by the languages of the form A^∗u, vA^∗ where u and v are words of length ≤k,

(3) A⁺LI is the boolean algebra generated by the languages of the form A^∗u, vA^∗ where u, v ∈A⁺,

(9)

(4) A^∗Acom is the boolean algebra generated by the languages of the form {u∈A^∗ | |u|a =r}, where a∈A andr ∈N.

For now, we need a description, due to Straubing [11], of the varieties of languages corresponding to V∗LI and toV∗LI_k, whenV is a variety of monoids.

Let k be an integer, and let B_k=A^k. To avoid ambiguity, words ofB_k^∗ will be represented by finite sequences (b₁, b₂,· · · , b_n), where eachb_i ∈A^k. We define a (sequential) function σk :A⁺→B_k^∗ by setting

wσ_k= 1 if |w| < k,

(wa)σ_k = (wσ_k,(wa)s_k) if |w| ≥k and a∈A.

For example, (abbaab)σ₃ = (abb, bba, baa, aab). Thusσkassociates to a word u the sequence of words appearing on a window of size k when u is read from left to right.

To each congruence α of finite index on B_k^∗, associate the congruence (α, k) onA⁺ defined by u(α, k)v if and only if

(a) uand v have the same prefixes of length < k, (b) uand v have the same suffixes of length< k,

(c) uσ_kαvσ_k

Denote by V and Wk the varieties of languages corresponding toV and V∗LIk, respectively.

Theorem 3.2 [11]A language belongs toA⁺Wk−1if and only if it is a finite union of(α, k)-classes for some congruenceα on B_k^∗ such that B_k^∗/α∈V.

We give an equivalent form of Theorem 3.2 in terms of boolean algebras.

Corollary 3.3 For every alphabet A, A⁺Wk−1 is the boolean algebra generated by the languages of the form A^∗u, vA^∗ (where u and v are words of length < k) or Xσ⁻¹_k , whereX ∈B_k^∗V.

Proof. In one direction, it suffices to show that each of the languages A^∗u, vA^∗ andXσ_k⁻¹ belong toA⁺Wk−1. First, if|u| < k, then, by proposition 3.1, S(A^∗u), S(uA^∗)∈LI_k−1 ⊂V∗LI_k−1. Furthermore, sinceσk is a sequential function, a general result ([5], Chap. 6) states that S(Xσ⁻¹_k ) divides a wreath product of the form M(X)◦S(σ_k), where S(σ_k) is the syntactic invariant of σk. Now M(X) ∈ V since X ∈ B_k^∗V, and S(σk) ∈ LI_k−1. It follows that S(Xσ_k⁻¹)∈V∗LI_k−1.

Conversely, if L∈A⁺Wk−1, then by Theorem 2.5, L is union of (α, k)- classes for a certain congruence α such that B_k^∗/α ∈ V. Now, it follows from the definition of (α, k) that the equivalence classes of (α, k) are boolean combinations of sets of the formA^∗u,vA^∗(whereuandvare words of length

(10)

< k) or Xσ⁻¹_k , whereX is an equivalence class forα. But sinceB_k^∗/α∈V, one hasX ∈B_k^∗V.

For the variety V∗LI, one has the following result.

Theorem 3.4 Let S be a semigroup. Then S ∈ V∗ LI if and only if S∈V∗LI_k, wherek =|S|.

Applying these results withV=AcomandJ₁ respectively, one obtains the following corollary.

Corollary 3.5

(1) L is locally testable if and only ifS(L) belongs toJ₁∗LI,

(2) Lis threshold locally testable if and only ifS(L)belongs toAcom∗LI, Proof. We only prove (2), the proof of (1) beeing similar. LetLbe a subset ofA⁺. By Corollary 3.3 and Theorem 3.4,S(L)∈Acom∗LIif and only ifL is a boolean combination of languages of the formA^∗u,vA^∗ orXσ⁻¹_k , where X ∈B_k^∗Acom. Furthermore, by Proposition 3.1, X ∈B_k^∗Acomif and only ifX is a boolean combination of languages of the form{u∈B_k^∗ | |u|x =r}, wherex∈B_k and r∈N. Now we have

{u∈A^∗ | |u|x =r}σ_k⁻¹ ={u∈A^∗| u

x

=r}=L(x, r, t) for every t > r.

Therefore S(L) ∈ Acom ∗LI if and only if L is a boolean combination of languages of the formA^∗u, vA^∗ or L(x, r, t), that is, if and only if L is threshold locally testable.

Note that corollary 3.5 does not give an algorithm to decide whether a language is LT, TLT or PLT. Indeed, it is not clear at first sight whether one can decide if a given semigroup belongs to J₁∗LIorAcom∗LI. But Brzozowski and Simon [2] and Mac Naughton [7] shaw independently that J₁∗LI=LJ₁. Therefore we have

Theorem 3.6 [2, 7, 11] L is locally testable if and only if S(L) belongs to LJ₁.

The syntactic characterization of locally threshold testable languages is more involved and depends on a deep result of Th´erien and Weiss [13]. Given a semigroup S, form a graphG(S) as follows : the vertices of G(S) are the idempotents of S, and the edges from e to f are the elements of the form esf.

Theorem 3.7 [13]A semigroup S belongs to Acom∗LIif and only ifS is aperiodic and its graph satisfies the following condition (C): if p and r are edges from eto f, and if q is an edge from f to e, then pqr=rqp.

(11)

e f p, r

q

Figure 3.2: The conditionpqr=rqp.

Therefore, one obtains

Corollary 3.8 A language Lis threshold locally testable if and only ifS(L) is aperiodic and its graph satisfies condition (C).

Example 3.1 LetA={a, b}, and let L=a^∗ba^∗. Then Lis recognized by the automaton

1 a

2 a

b

Figure 3.3: The minimal automaton ofa^∗ba^∗. The transitions are given in the following table

1 2

a 1 2

b 2 −

bb − −

Therefore, S = S(L) is presented by the relations a = 1, b² = 0. Thus S = {a, b,0}, where a = 1 is the identity, and E(S) = {1,0}. The local semigroups are 0S0 = {0}, and 1S¹ = S. This last local semigroup is not idempotent, sinceb²6=b. Therefore,Lis not locally testable. On the other hand,G(S) is the graph represented below, which satisfies condition (C).

1, b,0 1 0 0

0

Figure 3.4: The graph ofS.

ThereforeL is TLT (see example 2.2).

The three classes of languages we have considered so far were characterized by an algebraic property of their syntactic semigroup. Such a property

(12)

do not suffice, however, to characterize the class of strongly locally testable languages. To overcome this difficulty, we need to consider not only the syntactic semigroup, but also the syntactic image of the language.

LetSbe a semigroup and letP be a subset ofS. We say thatP saturates theD-classes ofS if, for every D-classD ofS, D∩P 6=∅impliesD ⊂P. It is equivalent to say thats∈P andsDtimplyt∈P. The next proposition shows that this property is stable under quotients.

Proposition 3.9 Let S and T be two semigroups, let π : S → T be a surjective morphism. LetPS (respectively PT) be a subset of S (respectively T) such that P_T =P_Sπ and P_S =P_Tπ⁻¹. Then P_S saturates the D-classes of S if and only if PT saturates the D-classes of T.

Proof. Suppose thatPT saturates theD-classes ofT. Lets∈PS ands⁰∈S such that s D s⁰. Then sπ D s⁰π, sπ ∈ P_T and therefore, s⁰π ∈P_T. Thus s⁰∈PTπ⁻¹=PS, and PS saturates the D-classes of S.

Conversely, suppose that PS saturates the D-classes of S. Let t, t⁰ ∈ T such that t∈P_T and tDt⁰. By Lemma 1.1, there existss, t∈S such that sπ=t, s⁰π=t⁰ and sDs⁰. It follows thats∈PTπ⁻¹ =PS, whences⁰ ∈PS

and s⁰π =t⁰ ∈PSπ=PT.

Theorem 3.10 A languageLis strongly locally testable if and only if S(L) is locally idempotent and commutative and P(L) saturates the D-classes of S(L).

Proof. To simplify notations, putS =S(L), P =P(L) and denote by ∼k

the congruence ∼k,1. If L is SLT, then L is also LT and thus S is locally idempotent and commutative by Theorem 3.6. Furthermore,Lis a boolean combination of languages of the formA^∗xA^∗. Letk be the maximal length of the words x occurring in this boolean combination. Then, if u ∈L and ifu and v have the same factors of length ≤k, then v ∈L. We claim that P saturates the R-classes of S. Let s ∈ P and let t ∈ S such that s R t.

Then there exist some elements x, y ∈S(L)¹ such that sx =t and ty =s.

Let s⁰ ∈ A⁺, x⁰, y⁰ ∈A^∗ be words such that s⁰η = s, x⁰η = x and y⁰η =y (if x = 1 or y = 1, we take x⁰ = 1 or y⁰ = 1, respectively). Now the word s⁰(x⁰y⁰)^k belongs to L, since (s⁰(x⁰y⁰)^k)η = s(xy)^k = s. Furthermore, the wordss⁰(x⁰y⁰)^kand s⁰(x⁰y⁰)^kx⁰ contain the same factors of length≤k. This is obvious ifx⁰ = 1. Suppose nowx⁰∈A⁺. Then every factor ofs⁰(x⁰y⁰)^k is clearly a factor of s⁰(x⁰y⁰)^kx⁰. Conversely, let t be a factor of length ≤k of s⁰(x⁰y⁰)^kx⁰. Then eithertis a factor ofs⁰(x⁰y⁰)^k, ortis a factor of (x⁰y⁰)^k−1x⁰, since |(x⁰y⁰)^k−1| ≥ k−1. But (x⁰y⁰)^k−1x⁰ itself is a factor of s⁰(x⁰y⁰)^k and thustis a factor ofs⁰(x⁰y⁰)^k. It follows, by the remark above, thats⁰(x⁰y⁰)^kx⁰ belongs to Land hence (s⁰(x⁰y⁰)^kx⁰)η =s(xy)^kx=sx=t∈P, proving the

(13)

claim. A dual argument would show that P saturates the L-classes, and henceP also saturates the D-classes.

Conversely, assume that S is locally idempotent and commutative and thatP saturates theD-classes ofS. Then,L is locally testable by theorem 3.6, and thus is union of ∼k-classes for some k. Put Sk = A⁺/∼k. Then there is a surjective morphismπk :Sk→S, and a subsetQof Sk such that

(a) L=Qπ⁻¹_k and Q=Lπ_k, (b) Q=P π⁻¹ andP =Qπ.

Now, by Proposition 3.9, Q saturates the D-classes of Sk. To finish the proof, we need a result on theD-classes ofS_k. Denote by F_k(u) the set of factors of lengthk of a word u.

Lemma 3.11 Letuandv be two words ofA⁺. Thenuπ_kDvπ_kif and only if either u=v or Fk(u) =Fk(v)6=∅ (this case implies that u and v are of length ≥k).

Proof. We first treate the case|u|< k (respectively|v|< k). IfuπkDvπk, then there exist four wordsx, y, s, t∈A^∗ such thatxuy∼kv and svt∼ku.

This impliessvt=u, whence|v|< k, and thus xuy=v, so that u=v.

Suppose now |u|,|v| ≥ k. IfuπkDvπk, there exist two wordsx, y∈A^∗ such that xuy ∼k v. In particular, every factor of length k of u is a factor ofv, and, by a dual argument,F_k(u) =F_k(v).

Conversely, assume that Fk(u) = Fk(v) and let p (respectively s) be the prefix (suffix) of length k of u. Then p (respectively s) occurs in v, so that v = v₀pv₁ =v₂sv₃. Put w = v₀uv₃. We claim that w ∼k v. Indeed v₀p (respectively sv₃) is a common prefix (suffix) of v and w. Next, since F_k(u) =F_k(v), each factor of lengthk ofvis a factor ofuand hence a factor ofw. Conversely, lettbe a factor of length k ofw. Thentis either a factor ofv0p, a factor ofu, or a factor of sv3. In each case, it is also a factor ofv, which proves the claim. Thereforevπk= (v₀πk)(uπk)(v₃πk), and, by a dual argument,uπ_kDvπ_k.

We now conclude the proof of Theorem 3.10. We start with the equality L=Qπ_k⁻¹ and we distinguish two categories of elements inQ. Put

Q₁ ={s∈Q| every word ofsπ⁻¹_k is of length≥k}and Q₂ ={s∈Q| there exists a word ofsπ⁻¹_k of length< k}

Then, sinceQ=Q₁∪Q₂, we have

L=Q₁π_k⁻¹∪ [

s∈Q₂

sπ_k⁻¹

and we shall prove separately that the languages Q₁π_k⁻¹ and sπ_k⁻¹, for s∈ Q₂, are SLT.

(14)

Lets∈Q2. Thensπ⁻¹_k contains a worduof length< k, andsπ_k⁻¹={u}, since u cannot be equivalent to another word. But

{u}=A^∗uA^∗\ [

a∈A

A^∗uaA^∗∪ [

a∈A

A^∗auA^∗

!

and thus sπ⁻¹_k is strongly locally testable.

Since Q saturates the D-classes of Sk, and since, by Lemma 3.11, an element ofQ₂ cannot beD-equivalent with an element ofQ₁, Q₁ is a union ofD-classes. Furthermore, Lemma 3.11 shows that aD-classDcontained in Q1 is entirely characterized by a certain non empty setF of words of length k. More precisely Dπ⁻¹_k ={u∈A⁺ | Fk(u) =F}. It follows that Dπ_k⁻¹ is strongly locally testable, since

{u∈A⁺|Fk(u) =F}= \

x∈F

A^∗xA^∗\[

x∈A^k\F A^∗xA^∗

Finally,Lis a finite union of SLT languages, and thus is also strongly locally testable.

Example 3.2 Let A = {a, b, c}, and let L = c(ab)^∗ ∪c(ab)^∗a. Then L is recognized by the following automaton

1 c 2 3

a

b

Figure 3.5: An automaton recognizing L=c(ab)^∗∪c(ab)^∗a.

The transitions are given in the following table

1 2 3

a − 3 −

b − − 2

c 2 − −

ab − 2 − ba − − 3 ca 3 − −

Therefore,S(L) is presented by the relations cabc=c, a² =b²=c² =bc= ac= 0. TheD-class structure is represented in the following diagram, where the grey box is the image of L.

(15)

∗ab a b ^∗ba

c ca

∗0

Figure 3.6: The J-class structure.

ThusP(L) saturates theD-classes, andL is SLT. In fact,

L=A^∗cA^∗\(A^∗aaA^∗∪A^∗acA^∗∪A^∗bbA^∗∪A^∗bcA^∗∪A^∗cbA^∗∪A^∗ccA^∗).

The next statement summarizes the results of this section.

Corollary 3.12 For a given recognizable subsetLofA⁺, the following prop- erties are effectively decidable :

(1) L is locally testable,

(2) L is threshold locally testable, (3) L is strongly locally testable.

In view of these results, it is natural to conjecture that one can also decide whether a given language is STLT, but we don’t have a proof of this fact.

4 Connections with logic.

The connections between formal languages and mathematical logic were first studied by B¨uchi [3]. To each word u∈A⁺ is associated a structure

Mu = ({1,2, . . . ,|u|}, S,(R_a)_a∈A)

whereS denotes the successor relation on{1,2, . . . ,|u|} andR_a is set of all isuch that thei-th letter of u is ana. For instance, ifA ={a, b} and u= abaab, thenRa={1,3,4}andRb={2,5}. The logical language appropriate to such models hasS and theRa’s as non logical symbols, and formulas are built in the standard way by using these non-logical symbols, variables, boolean connectives, equality and quantifiers. Note that we don’t use the symbol < in this logic. We shall denote byL1(A) and L2(A), respectively, the set of first order and monadic second order formulas of this logic. Given

(16)

a sentenceϕ, we denote by L(ϕ) the set of all words which satisfy ϕ, when ϕis interpreted according to the model described above. For instance, if

ϕ=∃x∃y∃z((y=Sx)∧(z=Sy)∧Rax∧Rby∧Rbz) thenL(ϕ) =A^∗abbA^∗.

The seminal result of B¨uchi can now be stated as follows

Theorem 4.1 [3]A subset of A⁺ is rational if and only if it can be defined by a L2(A)-sentence.

The first order theory was investigated by Thomas [14]. Thomas proved that a language is TLT if and only if it is definable by a L⁰₁(A)-sentence, whereL⁰₁(A) is the logical language obtained by completingL1(A) with the 0-ary symbols min, max, interpreted as the minimum 1 and the maximum

|u|on{1, . . . ,|u|}. Furthermore, Thomas proved that boolean combinations of existentialL⁰₁(A)-sentences were sufficient to define TLT-languages. Now, it is easy to define min and max in terms ofS. For instance, one can consider min as a new variable satisfying the formula∀x¬S(x, min). Therefore, one obtains

Theorem 4.2 A subset of A⁺ is threshold locally testable if and only if it is definable by aL₁(A)-sentence.

However, the situation is slightly different if one considers only boolean combinations of existentialL1(A)-sentences, because rewriting min in terms ofS creates some alternations of quantifiers. More precisely, we have Theorem 4.3 A subset of A⁺ is strongly threshold locally testable if and only if it is definable by a boolean combination of existentialL1(A)-sentences.

Proof. IfLis a STLT language,Lis a boolean combination of languages of the formL(x, r, t). Therefore it suffices to show that each of these languages can be defined by a boolean combination of existential L1(A)-sentences.

The formal proof can easily be adapted from the following example, where A={a, b}. One hasL(ab,2,3) =L(ϕ), whereϕ=ϕ₁∧ ¬ϕ₂, and

ϕ1 =∃x1∃x2∃x3∃x4(¬(x1 =x3)∧S(x1, x2)∧S(x3, x4)

∧R_ax₁∧R_bx₂∧R_ax₃∧R_bx₄)

ϕ₂ =∃x1∃x2∃x3∃x4∃x5∃x6(¬(x1 =x₃)∧ ¬(x3=x₅)∧ ¬(x1 =x₅)

∧S(x₁, x₂)∧S(x₃, x₄)∧S(x₅, x₆)

∧Rax₁∧Rbx₂∧Rax₃∧Rbx₄∧Rax₅∧Rbx₆)

Conversely, it suffices to show that a language L defined by an existential L1(A)-sentence ϕ is SLTT. We use an argument of game theory, which

(17)

we borrow from [14]. For the conveniance of the reader, we briefly review the terminology of game theory needed to achieve the proof (see [10] for more details). Let u = u₁· · ·ur and v = v₁· · ·vs be two words, where u1, . . . , ur, v1, . . . , vs∈A. A position inu (respectively inv) is a an element of {1, . . . ,|u|} (respectively {1, . . . ,|v|}). For each m > 0, define a game Gm(u, v) between two players as follows. Player I plays first, and choosesm positionsi₁, . . . , i_m of u. Then player II must choose m positionsj₁, . . . , j_m inv. We say that II wins the game if the following conditions are satisfied : (1) For 1≤r ≤m, the letteruir is equal to the lettervjr. (Intuitively, if I chooses an occurrence of a lettera, then II should choose an occurrence of the same letter.)

(2) For each r, s ≤ m, ir = is if and only if jr = js. (Intuitively, if I decides to choose twice - or more - the same position, then II should follow this choice. Conversely, II is not allowed to choose twice the same position if it was not the choice of I.)

(3) For each r, s≤m, i_r =i_s+ 1 if and only if j_r =j_s+ 1. (Intuitively, if I decides to choose two adjacent positions, then II should follow this choice. Similarly, if I chooses two non-adjacent positions, then II should follow this choice.)

Now, by the theory of Ehrenfeucht-Fra¨ıss´e, two words u and v satisfy the same existential formulas of quantifier rank≤mif and only if Player II has a winning strategy in the games Gm(u, v) and Gm(v, u).

The main argument of the proof is the following lemma.

Lemma 4.4 Letn= 2^m+ 1 andt= (m−1)(2^m+ 1) + 1. Ifu≡n,tv, then Player II has a winning strategy in the gamesGm(u, v) andGm(v, u).

Proof. By symmetry, it suffices to prove the result for the game G_m(u, v).

The strategy of II is easier to understand if one thinks that the choice ofik

(respectivelyj_k) also determines the segment

Ik= [1, . . . ,|u|]∩[ik−2^m−k, ik+ 2^m−k]

(respectively Jk = [1, . . . ,|v|]∩[jk−2^m−k, jk+ 2^m−k]). The strategy of II consists to choose jk so that the following conditions are satisfied :

(a) u[I_k] =v[J_k],

(b) For every s < k, i_s ∈ I_k if and only if j_s ∈ J_k. In this case I_k is a subsegment ofIs and Jk is the corresponding subsegment of Js. We prove by induction onk thatj_kcan be choosen so that conditions (a) and (b) are satisfied. For k = 1, condition (b) is empty, and condition (a) can be fulfilled becauseuandvhave the same factors of lengthnby hypothesis.

Assume thatj_s has been choosen successfully for s < k. We now choose j_k as follows.

(18)

First, assume there exists s < k such that is ∈ Ik, and let us take the smallests satisfying this condition. Then

is−2^m−s ≤(ik+ 2^m−k)−2^m−s≤(ik+ 2^m−k)−2.2^m−k =ik−2^m−k and, symmetrically, i_k+ 2^m−k ≤i_s+ 2^m−s. Therefore I_k is a subsegment of Is and we can take for Jk the corresponding subsegment of Js. Since u[Is] = v[Js], we have u[Ik] = v[Jk]. Now if i⁰_s ∈ Ik for some s⁰ such that s < s⁰ < k, thend(i_s, i_s0)≤d(i_s, i_k)+d(i_k, i_s0)≤2.2^m−k≤2^m−s⁰. Therefore Is⁰ is a subsegment ofIs , as illustrated in the following diagram, andJs⁰ is the corresponding subsegment ofJs⁰.

Is

Ik

I_s0

is i_k i_s0

A similar argument would show that, if js⁰ ∈ Jk, then is⁰ ∈ Ik. Thus conditions (a) and (b) are satisfied in this case.

Now suppose that is ∈ I_k for every s < k. We claim that there exists at least one occurrence of the factor x =u[Ik] in v, defining a segment Jk

such thatv[Jk] =x and js ∈Jk for every s < k. Indeed, assume that every segment J_k such that v[J_k] = x satisfies js ∈ J_k for some s < k, that is, jk ∈[js−2^m−k, js+ 2^m−k]. Then one can bound the number of occurrences ofx inv as follows:

v x

≤ X

0<s<k

(2^m−k+1+ 1)) = (k−1)(2^m−k+1+ 1)< t Now, since |x| ≤ n, we have by hypothesis _u

x

≡ _v

x

threshold t, so that _u

x

= _v

x

. But each segment J inv such that v[J] =x and js ∈ J defines a segment I in u such that u[I] =x and is ∈I, since u[Is] = v[Js] by the induction hypothesis.

(19)

Is

I

Js

J u

v

Furthermore, this application is injective, that is, the situation represented in the figure below can never occur.

I_s I_s0

I =I⁰

Js Js⁰

J J⁰

u

v

Indeed, we have seen above that if, for instance, s < s⁰, I = I⁰ implies that I_s0 is a subsegment of I_s. Therefore, J_s0 must be a subsegment of J_s and thusJ =J⁰. It follows that each occurrence ofxinvthat is a factor of somev[Js] is in one-to-one correspondence with an occurrence ofxinuthat is a factor ofu[I_s]. In particular,_u

x

>_v

x

, a contradiction. This proves the claim, and conditions (a) and (b) can be satisfied.

Now, it is easy to verify that this choice ofj₁, . . . , jmis a winning strategy for player II.

We now conclude the proof of theorem 4.3. Assume that Lis defined by an existential sentence ϕof quantifier rank m. Ifu≡n,tv, then Lemma 4.4 and the theorem of Ehrenfeucht-Fra¨ıss´e shows thatu satisfiesϕif and only ifv satisfies ϕ. This means thatL(ϕ) is a union of ≡n,t-classes, and hence L(ϕ) is strongly locally threshold testable.

(20)

5 Remarks.

There are a few extensions that were not considered in order to keep this paper to a reasonable size. The first possibility would be to introduce modulo counting. If one considers modulo counting only, the notions of “periodi- cally locally testable language” and “modular scanner” can be easily defined, and the syntactic characterization follows from the works of Straubing and Th´erien (the condition would be thatS(L) is locally a commutative group).

One can also give a logical interpretation if one allows the “modular” quantifiers considered by Straubing, Th´erien and Thomas [12]. One can also consider simultaneously modulo and threshold counting. The corresponding variety of semigroups would beCom∗LI, for which an effective description has been given by Th´erien and Weiss [13] (a semigroup belongs toCom∗LI if and only if the graph associated with the semigroup satisfies (C)). How- ever, no such decidability results is known for the corresponding “strong”

notions. In conclusion, if one removes the conditions on prefixes and suffixes, nothing is known, except in the boolean case.

The second possible extension is to consider infinite words, and this will be the subject of a future paper.

References

[1] D. Beauquier and J.E. Pin, 1989, Factors of words, inAutomata, Lan- guages and Programming, (G. Ausiello, M. Dezani-Ciancaglini and S. Ronchi Della Rocca, eds.), Lecture Notes in Comput. Sci., 372, Springer, 63–79.

[2] J.A. Brzozowski and I. Simon, 1973, Characterizations of locally testable languages, Discrete Math. 4, 243-271.

[3] J.R. B¨uchi, 1962, On a decision method in restricted second-order arith- metic, inProc. 1960 Int. Congr. for Logic, Methodology and Philosophy of Science, Stanford Univ. Press, Standford, 1–11.

[4] S. Eilenberg, 1974, Automata, languages and machines, Vol. A, Aca- demic Press, New York.

[5] S. Eilenberg, 1976, Automata, languages and machines, Vol. B, Aca- demic Press, New York.

[6] G. Lallement, 1979,Semigroups and combinatorial applications, Wiley, New York.

[7] R. McNaughton, 1974, Algebraic decision procedures for local testabil- ity, Math. Syst. Theor. 8, 60-76.

(21)

[8] D. Perrin and J.E. Pin, First order logic and star-free sets,J. Comput.

System Sci. 32, 1986, 393–406.

[9] J.-E. Pin, 1984 Vari´et´es de langages formels, Masson, Paris; English translation: 1986, Varieties of formal languages, Plenum, New-York.

[10] J.G. Rosenstein, 1982,Linear orderings, Academic Press, New York.

[11] H. Straubing, Finite semigroup varieties of the form V ∗D, J. Pure Applied Algebra 36(1985) 53–94.

[12] H. Straubing, D. Th´erien and W. Thomas, 1988, Regular Languages Defined with Generalized Quantifiers, in Proc. 15th ICALP, Springer Lecture Notes in Computer Science 317, 561–575.

[13] D. Th´erien and A. Weiss, 1985, Graph congruences and wreath products,J. Pure Applied Algebra 35, 205–215.

[14] W. Thomas, 1982, Classifying regular events in symbolic logic,J. Com- put. Syst. Sci 25, 360–375.