• Aucun résultat trouvé

Construction of tree automata from regular expressions

N/A
N/A
Protected

Academic year: 2022

Partager "Construction of tree automata from regular expressions"

Copied!
24
0
0

Texte intégral

(1)

DOI:10.1051/ita/2011107 www.rairo-ita.org

CONSTRUCTION OF TREE AUTOMATA FROM REGULAR EXPRESSIONS

Dietrich Kuske

1

and Ingmar Meinecke

2

Abstract. Since recognizable tree languages are closed under the ra- tional operations, every regular tree expression denotes a recognizable tree language. We provide an alternative proof to this fact that results in smaller tree automata. To this aim, we transfer Antimirov’s partial derivatives from regular word expressions to regular tree expressions.

For an analysis of the size of the resulting automaton as well as for algo- rithmic improvements, we also transfer the methods of Champarnaud and Ziadi from words to trees.

Mathematics Subject Classification.68Q45.

Introduction

One of the most prominent topics in formal language theory is the comparison of different finite descriptions for potentially infinite objects – the languages. The result of Kleene [14] states the equivalence between finite automata and regular expressions for languages of finite words. The transformation of a finite automa- ton into an equivalent regular expression is a prototypical example of dynamic programming. The converse transformation is of direct practical consequencee.g.

in text processing. For this reason, several methods were proposed within the last decades to find more efficient algorithms, see [19,20] for surveys. For teaching purposes, one often uses an inductive construction. The most common construc- tion is the standard or position automaton (Glushkov [10] and McNaughton and Yamada [17]). Brzozowski’s construction [4] of a deterministic finite automaton

Keywords and phrases.Trees, automata, regular expressions, partial derivatives.

I. Meinecke was supported by the German Research Foundation (DFG) within the project DR 202/10-1.

1Fachgebiet Theoretische Informatik, Technische Universit¨at Ilmenau, Postfach 100565, 98684 Ilmenau, Germany.dietrich.kuske@tu-ilmenau.de

2Institut f¨ur Informatik, Universit¨at Leipzig, PF 100920, 04009 Leipzig, Germany.

meinecke@informatik.uni-leipzig.de

Article published by EDP Sciences c EDP Sciences 2011

(2)

uses derivates of regular expressions. This approach was modified by Antimirov [2]

who definedpartial derivatives to construct a non-deterministic automaton from a regular expression.

Kleene’s theorem was lifted to the setting of trees [22], also cf. [8,9], which are one of the most fundamental concepts in computer science. A regular tree expression defines a language of ordered trees. An inductive construction even produces a tree automaton accepting this language. The number of states of this automaton is exactly the number of iterations in the expressionE plus|E|Σ

where |E|Σ is the number of occurrences of symbols from the ranked alphabet in E. In this paper, we define partial derivatives for regular tree expressions and build by their help a non-deterministic finite tree automaton recognizing the language denoted by the regular expression. The concept of partial derivatives will yield a tree automaton with at most |E|Σ states and|E|2Σ transitions. The construction of this tree automaton and the correctness proof is combined with algorithmic considerations to build this automaton. We adapt and modify the approach by Champarnaud and Ziadi [5,6] in the word case who extended work of Berry and Sethi [3]. Here, we use linearizations of regular tree expressions. The main idea is to distinguish occurrences of the same symbol at different positions in the regular expression. By doing so, we can ensure a certain uniqueness of the partial derivatives. As it turns out, the partial derivatives of the original regular expression are just projections of the partial derivatives of the linearized regular expression. This approach results in two main advantages: firstly, the desired automaton is in fact a quotient of an automaton that stems from the linearized regular expression. This way we also get the upper bound on the number of transitions mentioned above. Secondly, the theoretical results allow for an efficient algorithm working in the syntax-tree ofE. We obtain an algorithm with O(R·size(E)2) space and time complexity where R is the maximal rank of a symbol occuring in the finite ranked alphabet Σ and size(E) is the size of the regular expression.

Beside the standard and the partial derivative construction there are other pro- posals in the literature how to obtain an automaton from a regular expression.

Especially, it would be interesting whether the construction of the follow automa- ton [7,12,13] carries over to the setting of trees. In this paper we consider ranked trees. However, regular expressions were explored for unranked trees in connec- tion with XML. They are used in pattern matching, seee.g.[11]. In an extended abstract [15] of this paper, we wondered whether the concept of partial derivatives can lead to fruitful results and algorithms in this area. A first answer was given by Suzuki and Okui [21] who applied successfully the concept of partial deriva- tives to regular hedge expression patterns. Last but not least, Lombardy and Sakarovitch [16] applied the method of partial derivatives to a weighted setting for words. We are confident that this should be also possible for trees.

(3)

1. Trees, automata, and regular expressions

Let N be the set of non-negative integers. Throughout this paper, we fix a finite ranked alphabet Σ = (Σm)m∈N. The setTΣof trees over Σ is defined by the Backus-Naur form (BNF)

t::=f(t, . . . , t mtimes

)

wheref Σm. For the base casec∈Σ0, we will writec instead ofc(). A subset L⊆TΣis called atree language.

A (top-down) tree automaton over Σ is a tuple A = (Q,Σ, I,Δ) where Q is a set of states, I Q is the set of initial states, and Δ = (Δm)m∈N is the set of transitions1 such that Δm Σm×Qm for every m N. Especially, Δ0⊆Q×Σ0. A finite tree-automaton (or FTA) is a tree automatonAwith only finitely many states and, thus, only finitely many transitions (note that there are only finitely manym with Σm=∅).

As to whether a tree t is accepted by a tree automaton A = (Q,Σ, I,Δ) is defined inductively along the construction of the tree t: if t = c Σ0, then t is accepted by A iff there exists a state q ∈I with (q, c) Δ0. For f Σm with m > 0, the treef(t1, . . . , tm) is accepted by A iff there exist states q I and q1, . . . , qm∈Qsuch that (q, f, q1, . . . , qm)Δmand, for 1≤i≤m, the treeti is accepted by the tree automaton (Q,Σ,{qi},Δ). The languageL(A) recognized by Ais the set of all treestthat are accepted byA. A tree languageLisrecognizable if there is a FTAAwith L(A) =L.

We next introduce some constructions of tree languages that extend the rational operations on word languages. Letf ΣmandL1, . . . , Lm⊆TΣ. Then we put

f(L1, . . . , Lm) ={f(t1, . . . , tm)|ti∈Lifori= 1, . . . , m}.

ForL ⊆TΣ and c Σ0 we define for every t TΣ inductively the non-uniform substitution t[c←L]:

c[c←L] =Land d[c←L] ={d}for everyd∈Σ0 withd=c;

f(t1, . . . , tm)[c←L] =f(t1[c←L], . . . , tm[c←L]).

Then the c-product of L1, L2 ⊆TΣ is the language L1·cL2 =

tL1t[c L2].

Now the iteratedc-products are defined forL⊆TΣby L0,c={c} andLn+1,c=Ln,c∪L·cLn,c. Thec-iteration ofLis defined asLc=

n≥0Ln,c.

It is well-known that a tree language L is recognizable if and only if it can be denoted by a regular expression. Theseregular expressions are defined by the

1In the term-rewriting terminology employed by [8], the transition (q, f, q1, . . . , qm) is denoted by the ruleq(f(x1, . . . , xm))f(q1(x1), . . . , qm(xm)).

(4)

following grammar in BNF2:

E::=∅ |f(E, . . . , E

mtimes

)|(E+E)|(E·cE)|(Ec)

wheref Σmandc∈Σ0. Again, we writec instead ofc() wheneverc∈Σ0. The semanticsEof a regular expressionE is defined inductively by

∅=∅, f(E1, . . . , Em)=f(E1, . . . ,Em),

(E+F)=EF, (E·cF)=E·cF, and

(Ec)=Ec.

For a setM of regular expressions, we put M=

EME.

The set of all regular expressions over the ranked alphabet Σ is denoted by EXP(Σ). Let|E|f denote the number of occurrences of the letterf Σ inE. The alphabetic width|E|Σof E is defined by|E|Σ= max

1,

f∈Σ|E|f ,i.e.,|E|Σis the number of occurrences of symbols from Σ in E but for technical reasons at least 1. Thus,|∅|Σ= 1. ThesizeofEis defined inductively by: size() = size(c) = 1 forc∈Σ0, size(f(E1, . . . , Em)) = 1 +m

i=1size(Ei), size

(E+F)

= size (E·c

F)

= 1+size(E)+size(F), and size (Ec)

= 1+size(E). Every regular expression E can be understood as a tree over the ranked alphabet Σ∪ {+,·c,c,∅ |c∈Σ0} where + and·c have rank 2,c has rank 1, and has rank 0. This tree is called thesyntax-tree tE ofE.

2. A direct construction

In this section, we will construct from a regular expressionE a tree automa- tonAE that acceptsE. The finiteness of this automaton will only be proved later. Our construction is based on partial derivates that we have to define and investigate first.

LetM be a set of regular expressions,F some regular expression, andc∈Σ0. ThencF denotes the set{(E·cF)|E∈M}. Similarly, we put for a setMof m-tuples of regular expressions

M ·cF =

(E1·cF),(E2·cF), . . . ,(Em·cF)

|(E1, E2, . . . , Em)∈ M . Definition 2.1. For g Σm, m≥1, and a regular expressionE, we define the setsg−1E ofm-tuples of regular expressions inductively:

g−1=,

2For constructing an equivalent regular expression for a given tree automaton, it is necessary to introduce additional symbols of rank zero,cf.[9], Proposition 9.2. But since we are interested only in the converse,i.e., the construction of automata from an expression, we do not have to differentiate between nullary symbols from the alphabet and additional ones.

(5)

g−1f(E1, E2, . . . , En) =

{(E1, E2, . . . , En)} iff =g

iff =g,

g−1(E+F) =g−1E∪g−1F,

g−1(E·cF) =

(g−1E)·cF∪g−1F ifc∈E (g−1E)·cF otherwise,

g−1(Ec) = (g−1E)·c(Ec).

Let Σ≥1 =

m≥1Σm = Σ\Σ0 denote the set of non-nullary symbols from the ranked alphabet Σ. Following Antimirov, we define further functionswfor finite wordsw∈Σ≥1 over the alphabet Σ≥1. Byεwe denote the empty word.

Definition 2.2.LetEbe a regular expression. ThenεE={E}and, forw∈Σ≥1 and g Σ≥1, the set wgE consists of all regular expressions F that appear in some tuple fromg−1E for some E wE. For a set of words W Σ≥1 and a regular expressionE, we put∂WE=

wWwE.

The functionwis called the partial derivative with respect tow.

Note that wgE = gwE =

EwEgE for all w Σ≥1 and g Σ≥1. Further note that we consider derivatives with respect to words over the non- nullary symbols from Σ and not with regard to trees.

Example 2.3. Let Σ0={a, b}, Σ1={g, h}, and Σ2={f}. Consider the regular tree expression

E=

f g

h(a)

, g(b)a

E1

·b

h(a) +h(b)

E2

.

Note thatb /∈E1. HencegE=hE=and

fE= g

h(a)

·aE1

·bE2

,

g(b)·aE1

·bE2

. Furthermore, we compute the following partial derivatives:

fgE=

h(a)·aE1

·bE2

, aE1

·bE2

,

fghE= aE1

·bE2

, a, b

; where the last equality is due toh

(b·aE1)·bE2

=hE2={a, b}. Notice that

fghfE=fE. Hence, together with∂εE={E}we obtain eight different partial derivatives ofE.

A symbolf Σ occursunguardedin E if no ancestor in the syntax treetE is labeled by an element of Σ. We will be interested in the numberEfof unguarded occurrences off inE that can be defined inductively by:

• ∅f = 0,

(6)

• f(E1, . . . , Em)f = 1 andg(E1, . . . , En)f = 0 forg=f,

• (E1+E2)f =(E1·cE2)f =E1f+E2f, and

• (Ec)f =Ef.

Proposition 2.4. Let E be a regular expression and g Σ≥1. Then |g−1E| ≤ Eg. Especially, if|E|g= 0 theng−1E=gE=∅.

Proof. The claim is shown by induction on the construction ofE: forE =∅, the claim is trivial. If we haveE=f(E1, . . . , Em) andf =g, then|g−1E|= 0, so the claim is also trivial. Iff =g, then|g−1E|= 1 =Eg.

Next consider the case (E+F): then

|g−1(E+F)| ≤ |g−1E|+|g−1F| ≤ Eg+Fg =(E+F)g. For the product, we have

|g−1(E·cF)| ≤ |(g−1E)·cF|+|g−1F| ≤ |g−1E|+|g−1F|

≤ Eg+Fg=(E·cF)g. Similarly, we obtain for the iteration

|g−1(Ec)|=|(g−1E)·c(Ec)|=|g−1E| ≤ Eg=(Ec)g. Next, we express the semantics of a regular expressionEin terms of the semantics of the tuples fromg−1E.

Proposition 2.5. For any regular expressionE, we have

E=

{g(G1, . . . ,Gm)|g∈Σ≥1,(G1, . . . , Gm)∈g−1E}

∪ {c∈Σ0|c∈E}. (1)

Proof. LetE0=E∩Σ0. The proof proceeds by induction on the construction of the regular expression E. For E = ∅, equation (1) holds. Next let E = f(E1, . . . , Em) where f Σm and E1, . . . , Em are regular expressions. We put

→G = (G1, . . . Gm). Then E=f(E1, . . . ,Em)

=

{f(G1, . . . ,Gm)|−→

G∈f−1E} (sincef−1E={(E1, . . . , Em)})

=

{g(G1, . . . ,Gm)|g∈Σ,−→

G∈g−1E} (sinceg−1E=forf =g).

The proof in caseE = (E1+E2) is immediate and therefore omitted. Next let E= (E1·cE2). Then we have

E=E1·cE2

=

(E1\ {c})·cE2

(E1∩ {c})·cE2 .

(7)

By the induction hypothesis, the first of these two sets equals f(G1, . . . ,Gm)|f Σ,−→

G∈f−1E1 ·cE2(E10\ {c})

=

f((G1·cE2), . . . ,(Gm·cE2))|f Σ,−→

G∈f−1E1 (E10\ {c})

=

f(H1, . . . ,Hm)|f Σ,−→

H (f−1E1)·cE2 (E10\ {c}).

Ifc∈E1, then the second of the two sets above equals

E2=

f(H1, . . . ,Hm)|f Σ,−→

H ∈f−1E2 E20, otherwise it is empty. Hence equation (1) holds forE= (E1·cE2).

Finally, consider the regular expression (Ec). If f(t1, . . . , tm) (Ec) = Ec, then there existsn≥0 withf(t1, . . . , tm)En+1,c\En,c. Hence there exists s E with f(t1, . . . , tm) ∈ {s} ·cEn,c. Since f(t1, . . . , tm) ∈/ En,c, the treesis of the form s=f(s1, . . . , sm). By the induction hypothesis, we find (G1, . . . , Gm)∈f−1E withsi∈Gi. Hence we obtain

f(t1, . . . , tm)∈ {s} ·c(Ec)

⊆f(G1, . . . ,Gm)·c(Ec)

=f((G1·c(Ec)), . . . ,(Gm·c(Ec))).

Since the tuple

(Gi·c(Ec))

1≤im = (Gi)1≤im·c(Ec) belongs to f−1(Ec), we showed the containment “⊆” of equation (1).

Conversely letf Σ and −→

H ∈f−1(Ec) = (f−1E)·c(Ec). Then there exists a tuple of regular expressions−→

G∈f−1E with−→ H =−→

c(Ec). Hence we get f(H1, . . . ,Hm) =f((G1·c(Ec)), . . . ,(Gm·c(Ec)))

=f(G1, . . . ,Gm)·c(Ec).

By the induction hypothesis,f(G1, . . . ,Gm)E, so we can continue

E·c(Ec)(Ec).

Let E be a regular expression and let QE = Σ

≥1E. Then we define a set of transitions ΔE as

F, f, G1, G2, . . . , Gm

|F ∈QE, f∈Σm, m≥1,(G1, . . . , Gm)∈f−1F

(F, c)|F ∈QE, c∈Σ0, c∈F .

Furthermore, letAE = (QE,Σ,{E},ΔE) denote the tree automaton whose only initial state is the regular expressionE.

(8)

Theorem 2.6. Let E be a regular expression over the ranked alphabet Σ. Then AE is a tree automaton that accepts E.

Proof. We show by induction on the structure of trees that for any treet TΣ and any regular expressionF, the tree automatonAF acceptstifft∈F.

First let t = c Σ0. Now c is accepted by AF iff there is a transition (F, c)ΔF. But this is the case iff c∈F. Now let t=f(s1, . . . , sm) for some m >0. Thentis accepted byAF iff there is a transition (F, f,(G1, . . . , Gm))ΔF

such thatsiis accepted by the tree automaton (QF,Σ,{Gi},ΔF) for all 1≤i≤m.

Note that the reachable part of the automaton (QF,Σ,{Gi},ΔF) is the set of states QGi. Hence, si is accepted by this automaton iff it is accepted by AGi. By the induction hypothesis, this is equivalent to saying si Gi. Since this holds for all 1 i m, we have that t is accepted by AF iff there exists (G1, . . . , Gm)∈f−1(F) withsiGiwhich is, by Proposition2.5, equivalent to

sayingt∈F.

Example 2.7. Consider the regular expression E=

f

g h(a)

, g(b)a

E1

·b

h(a) +h(b)

E2

from Example2.3. There we computed the partial derivatives ofE. ThusQE = {qi|i= 0, . . . ,7}where

q0=E, q1=

g h(a)

·aE1

·bE2

, q2=

g(b)·aE1

·bE2 , q3=

h(a)·aE1

·bE2

, q4= aE1

·bE2

, q5=

aE1

·bE2 ,

q6=a, q7=b.

The set of transitions ΔE comprises

q0−→f (q1, q2), q1−→g q3, q2−→g q4, q3−→h q5, q4−→h q6, q4−→h q7, q5−→f (q1, q2),

q0−→ ⊥,a q5−→ ⊥,a q6−→ ⊥,a q7−→ ⊥.b

Here,q0−→f (q1, q2) andq0−→ ⊥a mean that (q0, f, q1, q2),(q0, a)∈ΔE.

In the last example, the tree automaton resulting from our construction is finite. But so far, we did not prove in general that the tree automaton AE has only finitely many states, i.e., that E is recognizable. Theorem3.16will show that the number of states is linear and that the number of transitions is quadratic in the size ofE. This will only be achieved after going through the following two constructions.

(9)

3. An indirect construction via linearizations

The idea of the indirect construction is as follows: in a regular expressionE, uniquely mark the occurrences of letters from Σ≥1. Then apply our direct con- struction to the resulting regular expressionE. The projection of this automaton acceptsE. As it turns out, a quotient of the automaton one obtains this way is isomorphic to the result of the direct construction.

3.1. Linear regular expressions

A regular expressionE is linear if every letter f Σ≥1 occurs at most once in E. Note that c∈Σ0 may occur more than once. The following proposition is a consequence of Proposition2.4.

Proposition 3.1. Let E be a linear regular expression and g Σm for m 1.

Then |g−1E| ≤1 and therefore|∂gE| ≤m.

For M EXP(Σ) and g Σ≥1, we put g−1M =

g−1E | E M . Now we consider partial derivatives with respect to non-empty words for linear regular expressions.

Proposition 3.2. LetE, F be linear regular expressions over the alphabetΣsuch that also (E+F)and (E·cF) are linear. Let w∈Σ≥1 and g Σ≥1. Then the following hold true:

g−1w(E+F) =

g−1wE if|E|g>0, g−1wF otherwise.

g−1w(E·cF) =

⎧⎪

⎪⎩

(g−1wE)·cF if|E|g>0,

{g−1vF | ∃u∈Σ≥1:w=uv and c∈uE}

otherwise.

There are suffixesv1, . . . , vk ofw such that

g−1w(Ec) =

1≤ik

(g−1viE)·c(Ec).

Proof. If |E|g = 0, then |∂wE|g = 0 implying g−1wE = by Proposition 2.4.

Now g−1w(E+F) = g−1wE∪g−1wF. If |E|g > 0, then g−1wF = and g−1w(E+F) = g−1wE. Otherwise we get g−1wE = andg−1w(E+F) = g−1wF.

For the remaining claims we proceed by induction on |w|. The claims are obvious for|w| = 0. From now on let w= wf for some w∈ Σ≥1 and f Σ≥1. First, we considerg−1w(E·cF). By the induction hypothesis, we obtain

f−1w(E·cF) =

(f−1wE)·cF if|E|f >0, {f−1vF | ∃u∈Σ≥1:w=uv&c∈uE} otherwise.

(10)

First, consider the case|E|f >0 and|E|g>0. Then|F|f =|F|g= 0 and therefore g−1F =∅. Hence we haveg−1w(E·cF) = (g−1wE)·cF.

Next, let|E|f >|E|g = 0. Then (i) g−1wE=and (ii) vF = for all non- empty suffixesv ofw(since f occurs inv but not inF). Since, by the induction hypothesis,f−1w(E·cF) = (f−1wE)·cF, we getw(E·cF) =fw(E·cF) = (∂fwE)·cF = (∂wE)·cF and therefore

g−1w(E·cF) =g−1((∂wE)·cF)

=

(g−1wE)·cF ifc /∈∂wE (g−1wE)·cF∪g−1F otherwise

(i)

=

ifc /∈wE g−1F otherwise

(ii)=

{g−1vF | ∃u∈Σ≥1:w=uv&c∈uE}

as required.

Now assume|E|f = 0. Then the induction hypothesis impliesf−1w(E·cF) = {f−1vF| ∃u∈Σ≥1:w=uv, c∈uE}. Hence we obtain

w(E·cF) =∂fw(E·cF)

=

{∂fvF | ∃u∈Σ≥1:w=uv, c∈∂uE}

=

{∂vF | ∃u∈Σ≥1:w=uv, v=ε, c∈uE}

=

{∂vF | ∃u∈Σ≥1:w=uv, c∈uE}

where the last equality holds sincewE=∅. Applyingg−1to this equation yields g−1w(E·cF) =

{g−1vF | ∃u:w=uv, c∈uE}.

If |E|g = 0, this is precisely what we wanted to show. Otherwise, we obtain

|F|g = 0 and thereforeg−1vF = for all v Σ≥1. Hence, in this case the last expression equals ∅. Since alsog−1wE = (due to the non-occurrence of f in E), this equals (g−1wE)·cF as required. This shows the claim for (E·cF).

Now consider the regular expression (Ec). By the induction hypothesis, there are suffixesv1, . . . , vk ofwsuch that

f−1w(Ec) =

1≤ik

(f−1viE)·c(Ec) and w(Ec) =

1≤ik

(∂vifE)·c(Ec).

(11)

Hence, forvi =vif (1≤i≤k) we have

g−1w(Ec) = g−1

1≤ik

(∂viE)·c(Ec)

=

1≤ik(g−1viE)·c(Ec) ifc /∈viEfor all 1≤i≤k,

1≤ik(g−1viE)·c(Ec)∪g−1(Ec) otherwise.

Sinceg−1(Ec) = (g−1E)·c(Ec) = (g−1εE)·c(Ec), the set of tuples of regular expressionsg−1w(Ec) is of the required form.

Proposition 3.3. Let E be a linear regular expression,u, w∈Σ≥1, andg∈Σ≥1. Then we have:

(1) |g−1uE| ≤1,

(2) ifg−1uE=∅ andg−1wE=∅, theng−1uE=g−1wE.

Proof. The proof is by induction on the structure ofE.

ForE= the claim is immediate. Now consider the caseE =f(E1, . . . , En).

Since E is linear, there is at most one i with |Ei|g > 0, if no such i exists, set i= 1. Then we have

g−1uE=

⎧⎪

⎪⎩

g−1u{E1, . . . , En}=g−1uEi ifu=f u, {(E1, . . . , En)} ifu=ε&f =g,

otherwise,

where the first case is due to|Ej|g = 0 for j =i, and, similarly for g−1wE. By induction hypothesis, we get immediately|g−1uE| ≤1.

Assume g−1uE =∅ andg−1wE =∅. If f =g is the first letter ofu=f u, then∅ =g−1uE=g−1uEi=f−1uEi = sinceE is linear, a contradiction.

Hence, either f = g for the first letter f of u or f = g and uis empty. Since the analogous holds for w, we obtain u = ε iff w = ε. Now the claim follows immediately from the induction hypothesis.

ForE= (E1+E2) the claims are immediate by Proposition3.2and the induc- tion hypothesis.

LetE= (E1·cE2). If|E1|g>0, then by Proposition3.2g−1uE= (g−1uE1c

E2 as well as g−1wE = (g−1wE1)·cE2. Hence, |g−1uE| ≤ 1 by induction hypothesis. Ifg−1uE andg−1wE are non-empty, so are the setsg−1uE1 and g−1wE1. Hence, by the induction hypothesis, the claim follows. Suppose now

|E1|g = 0. Then g−1uE is a finite union of sets of the form g−1uE2 where everyu is a suffix ofu. The induction hypothesis implies that any two non-empty of them are equal, i.e., g−1uE = g−1uE2 for some u. Similarly, g−1wE = g−1wE2for some wordw. Now both claims follow from the induction hypothesis.

A similar argument can be applied in caseE= (Fc) with (g−1uF)·c(Fc)

in place ofg−1uE1.

(12)

An immediate consequence of the last proposition is

Corollary 3.4. LetEbe a linear regular expression, u, w∈Σ≥1 andg∈Σmwith m≥1. Then

(1) |∂ugE| ≤m,

(2) if∂ugE= and∂wgE=∅, then∂ugE=wgE.

By Proposition3.2and Corollary3.4we conclude

Corollary 3.5. For a linear regular expressionEandw∈Σ+≥1we have∂w(Ec) = (∂uE)·c(Ec)for some non-empty suffixuofw.

Next, we bound the number of partial derivatives of a linear regular expression.

Proposition 3.6. Let E be a linear regular expression. Then we have|∂Σ+

≥1E| ≤

|E|Σ1 and|∂Σ≥1E| ≤ |E|Σ. Proof. Note thatΣ≥1E =Σ+

≥1E∪ {E}. We apply induction on E. Recall that

|∅|Σ= 1. Then we have|∂Σ+

≥1∅|= 0 =|∅|Σ1. ForE=f(E1, . . . , En), g∈Σ≥1

andu∈Σ≥1, we get

guE=

u{E1, . . . , En} ifg=f,

ifg=f.

Hence,|∂Σ+

≥1E| ≤n

i=1|∂Σ≥1Ei| ≤n

i=1|Ei|Σ=|E|Σ1. ForE= (E1+E2) we use Proposition 3.2and the induction hypothesis and obtain the assumption. If E= (E1·cE2), then again by Proposition3.2: |∂Σ+

≥1E| ≤ |∂Σ+

≥1E1|+|∂Σ+

≥1E2| ≤

|E1|Σ1+|E2|Σ1≤ |E|Σ1. Finally, we conclude by Corollary3.5|∂Σ+

≥1(Ec)| ≤

|∂Σ+

≥1E| ≤ |E|Σ1 =|(Ec)|Σ1.

3.2. The projection construction

Recall that Theorem 2.6 provides a possibly infinite tree automatonAE that acceptsE. AssumingE to be linear, we are now in the position to improve this result:

Corollary 3.7. Let E be a linear regular expression over the ranked alphabet Σ.

ThenAE is a finite tree automaton with at most|E|Σstates and at most|E|Σ· |Σ| transitions that acceptsE.

Proof. The equality L(AE) = E was shown in Theorem 2.6. Since the set of states of AE equals Σ≥1E, the finite tree automaton has at most |E|Σ states by Proposition 3.6. For f Σ≥1 and D QE, there is at most one transition of the form

D, f,(G1, . . . , Gm)

by Proposition 3.3(1), i.e., there are at most

|E|Σ· |Σ≥1| transitions whose label belongs to Σ≥1. In addition, there can be

|QE×Σ0| ≤ |E|Σ· |Σ0| transitions of the form (D, c) withc∈Σ0.

(13)

Remark 3.8. A top-down FTAA= (Q,Σ, I,Δ) isdeterministicifIis a singleton and (q, f, q1, . . . , qm),(q, f, p1, . . . , pm) Δ imply qi = pi for all i ∈ {1, . . . , m}.

Due to Proposition3.3(1), we have even proved that for alinear regular expres- sion E the FTA AE is a deterministic top-down automaton which implies the number of transitions given in the corollary. For arbitrary regular expressionsE, the FTAAE is in general not deterministic.

Let Γ and Σ be two alphabets with Γ0 Σ0. A mapping η : Γ Σ with η(Γm) Σm for every m N and η(c) =c for all c Σ0 is called a projection.

We can extendη naturally toη: EXP(Γ)EXP(Σ) by:

η(∅) =∅,η(f(E1, . . . , Em)) =η(f)(η(E1), . . . , η(Em)),

η

(E+F)

= (η(E) +η(F)),η

(E·cF)

= (η(E)·cη(F)), andη (Ec)

= (η(E)c).

Definition 3.9. Let E and E be regular expressions over the ranked alphabets Σ and Γ, respectively, and let η : Γ Σ be a projection. We say that E is a refinement ofE with respect to the projectionη ifη(E) =E.

Eis called alinearization ofEwith respect toη ifEis linear and a refinement ofE with respect toη.

Example 3.10. A linearization of the regular expressionE from Example2.3is E=

f1

g2 h3(a)

, g4(b)a

·b

h5(a) +h6(b)

where Γ ={f1, g2, g4, h3, h5, h6, a, b}andη: ΓΣ is given byη:f1→f, g2, g4 g, h3, h5, h6→h, a→a, b→b.

Note that due to η(c) = c for c Σ0 both the constants from Σ0 and the operations·c andc remain unchanged. By abuse of notation, we denote also the two natural continuations ofηto Γand toTΓbyη. The following lemma is easily shown:

Lemma 3.11. LetEbe a regular expression andE a refinement ofEwith respect toη: ΓΣ. Then η(E) =E.

LetEbe an arbitrary regular expression. Then one can construct a small finite tree automatonAE acceptingEas follows: firstly, construct some linearization EofEwith respect toη: ΓΣ (we can assume that every symbol from Γ appears inE and therefore |Γ| ≤ |E|Σ=|E|Σ). Secondly, build the finite tree automaton AE which then has at most |E|Σ =|E|Σ states and at most |E|Σ· |Γ| ≤ |E|Σ2

transitions. Thirdly, replace the transitions (F , f ,(G1, . . . , Gm)) of this automaton by (F , η(f),(G1, . . . , Gm)). Then, by Lemma3.11, the following is immmediate:

Corollary 3.12. Let Ebe a regular expression. ThenAE is a finite tree automa- ton with at most|E|Σ states and at most|E|Σ2 transitions that accepts E.

Références

Documents relatifs

Keywords: automata theory, regular expression derivation, similarity operators, distance operators, extension of regular expressions to similarity operators, extended regular

The best known upper bound is from Br¨ uggeman- Klein and Wood, who showed that the problem is in EXPTIME (by exhibiting an algorithm that works in polynomial time on the minimal

For example, the operators in the following expression enables the FTK's search engine to find all Visa and MasterCard credit card numbers in case evidence

We consider to solve the random generation problem by a bottom-up algo- rithm based on grammars of dCHAREs, by taking full advantage of the chain structure and the information that

If the next character is mapped to a regex leaf node that succeeds the current leaf node, we trace the path from the left leaf node to the lca, and make all children to the right

Let us notice that Brzozowski derivatives [4] handle unrestricted regular ex- pressions and provide a deterministic automaton; Antimirov derivatives [1] only address simple

We consider conversions of regular expressions into k-realtime finite state automata, i.e., automata in which the number of consecutive uses of ε -transitions, along any

Order Logi of labeled tree strutures, and links with nite automata using.. orales to test