Construction of tree automata from regular expressions

(1)

DOI:10.1051/ita/2011107 www.rairo-ita.org

CONSTRUCTION OF TREE AUTOMATA FROM REGULAR EXPRESSIONS

^∗

Dietrich Kuske

¹

and Ingmar Meinecke

²

Abstract. Since recognizable tree languages are closed under the rational operations, every regular tree expression denotes a recognizable tree language. We provide an alternative proof to this fact that results in smaller tree automata. To this aim, we transfer Antimirov’s partial derivatives from regular word expressions to regular tree expressions.

For an analysis of the size of the resulting automaton as well as for algorithmic improvements, we also transfer the methods of Champarnaud and Ziadi from words to trees.

Mathematics Subject Classification.68Q45.

Introduction

One of the most prominent topics in formal language theory is the comparison of different finite descriptions for potentially infinite objects – the languages. The result of Kleene [14] states the equivalence between finite automata and regular expressions for languages of finite words. The transformation of a finite automaton into an equivalent regular expression is a prototypical example of dynamic programming. The converse transformation is of direct practical consequencee.g.

in text processing. For this reason, several methods were proposed within the last decades to find more efficient algorithms, see [19,20] for surveys. For teaching purposes, one often uses an inductive construction. The most common construction is the standard or position automaton (Glushkov [10] and McNaughton and Yamada [17]). Brzozowski’s construction [4] of a deterministic finite automaton

Keywords and phrases.Trees, automata, regular expressions, partial derivatives.

∗I. Meinecke was supported by the German Research Foundation (DFG) within the project DR 202/10-1.

1Fachgebiet Theoretische Informatik, Technische Universit¨at Ilmenau, Postfach 100565, 98684 Ilmenau, Germany.dietrich.kuske@tu-ilmenau.de

2Institut f¨ur Informatik, Universit¨at Leipzig, PF 100920, 04009 Leipzig, Germany.

meinecke@informatik.uni-leipzig.de

Article published by EDP Sciences c EDP Sciences 2011

(2)

uses derivates of regular expressions. This approach was modiﬁed by Antimirov [2]

who deﬁnedpartial derivatives to construct a non-deterministic automaton from a regular expression.

Kleene’s theorem was lifted to the setting of trees [22], also cf. [8,9], which are one of the most fundamental concepts in computer science. A regular tree expression deﬁnes a language of ordered trees. An inductive construction even produces a tree automaton accepting this language. The number of states of this automaton is exactly the number of iterations in the expressionE plus|E|Σ

where |E|Σ is the number of occurrences of symbols from the ranked alphabet in E. In this paper, we define partial derivatives for regular tree expressions and build by their help a non-deterministic finite tree automaton recognizing the language denoted by the regular expression. The concept of partial derivatives will yield a tree automaton with at most |E|Σ states and|E|²_Σ transitions. The construction of this tree automaton and the correctness proof is combined with algorithmic considerations to build this automaton. We adapt and modify the approach by Champarnaud and Ziadi [5,6] in the word case who extended work of Berry and Sethi [3]. Here, we use linearizations of regular tree expressions. The main idea is to distinguish occurrences of the same symbol at different positions in the regular expression. By doing so, we can ensure a certain uniqueness of the partial derivatives. As it turns out, the partial derivatives of the original regular expression are just projections of the partial derivatives of the linearized regular expression. This approach results in two main advantages: firstly, the desired automaton is in fact a quotient of an automaton that stems from the linearized regular expression. This way we also get the upper bound on the number of transitions mentioned above. Secondly, the theoretical results allow for an efficient algorithm working in the syntax-tree ofE. We obtain an algorithm with O(R·size(E)²) space and time complexity where R is the maximal rank of a symbol occuring in the finite ranked alphabet Σ and size(E) is the size of the regular expression.

Beside the standard and the partial derivative construction there are other pro- posals in the literature how to obtain an automaton from a regular expression.

Especially, it would be interesting whether the construction of the follow automaton [7,12,13] carries over to the setting of trees. In this paper we consider ranked trees. However, regular expressions were explored for unranked trees in connec- tion with XML. They are used in pattern matching, seee.g.[11]. In an extended abstract [15] of this paper, we wondered whether the concept of partial derivatives can lead to fruitful results and algorithms in this area. A ﬁrst answer was given by Suzuki and Okui [21] who applied successfully the concept of partial derivatives to regular hedge expression patterns. Last but not least, Lombardy and Sakarovitch [16] applied the method of partial derivatives to a weighted setting for words. We are conﬁdent that this should be also possible for trees.

(3)

1. Trees, automata, and regular expressions

Let N be the set of non-negative integers. Throughout this paper, we fix a finite ranked alphabet Σ = (Σm)m∈N. The setT_Σof trees over Σ is defined by the Backus-Naur form (BNF)

t::=f(t, . . . , t mtimes

)

wheref ∈Σm. For the base casec∈Σ₀, we will writec instead ofc(). A subset L⊆T_Σis called atree language.

A (top-down) tree automaton over Σ is a tuple A = (Q,Σ, I,Δ) where Q is a set of states, I ⊆ Q is the set of initial states, and Δ = (Δm)m∈N is the set of transitions¹ such that Δm ⊆ Q×Σm×Q^m for every m ∈ N. Especially, Δ₀⊆Q×Σ₀. A finite tree-automaton (or FTA) is a tree automatonAwith only finitely many states and, thus, only finitely many transitions (note that there are only finitely manym with Σm=∅).

As to whether a tree t is accepted by a tree automaton A = (Q,Σ, I,Δ) is defined inductively along the construction of the tree t: if t = c ∈Σ₀, then t is accepted by A iff there exists a state q ∈I with (q, c) ∈Δ₀. For f ∈ Σm with m > 0, the treef(t₁, . . . , tm) is accepted by A iff there exist states q ∈ I and q₁, . . . , qm∈Qsuch that (q, f, q₁, . . . , qm)∈Δmand, for 1≤i≤m, the treeti is accepted by the tree automaton (Q,Σ,{qi},Δ). The languageL(A) recognized by Ais the set of all treestthat are accepted byA. A tree languageLisrecognizable if there is a FTAAwith L(A) =L.

We next introduce some constructions of tree languages that extend the rational operations on word languages. Letf ∈ΣmandL₁, . . . , Lm⊆T_Σ. Then we put

f(L₁, . . . , Lm) ={f(t₁, . . . , tm)|ti∈Lifori= 1, . . . , m}.

ForL ⊆T_Σ and c ∈ Σ₀ we deﬁne for every t ∈ T_Σ inductively the non-uniform substitution t[c←L]:

• c[c←L] =Land d[c←L] ={d}for everyd∈Σ₀ withd=c;

• f(t₁, . . . , tm)[c←L] =f(t₁[c←L], . . . , tm[c←L]).

Then the c-product of L₁, L₂ ⊆T_Σ is the language L₁·cL₂ =

t∈L₁t[c ← L₂].

Now the iteratedc-products are deﬁned forL⊆T_Σby L⁰^,c={c} andLⁿ⁺¹^,c=L^n,c∪L·cL^n,c. Thec-iteration ofLis deﬁned asL^∗^c=

n≥0L^n,c.

It is well-known that a tree language L is recognizable if and only if it can be denoted by a regular expression. Theseregular expressions are deﬁned by the

1In the term-rewriting terminology employed by [8], the transition (q, f, q1, . . . , qm) is denoted by the ruleq(f(x1, . . . , xm))→f(q1(x1), . . . , qm(xm)).

(4)

following grammar in BNF²:

E::=∅ |f(E, . . . , E

mtimes

)|(E+E)|(E·cE)|(E^∗^c)

wheref ∈Σmandc∈Σ₀. Again, we writec instead ofc() wheneverc∈Σ₀. The semanticsEof a regular expressionE is deﬁned inductively by

∅=∅, f(E₁, . . . , Em)=f(E1, . . . ,Em),

(E+F)=E∪F, (E·cF)=E·cF, and

(E^∗^c)=E^∗^c.

For a setM of regular expressions, we put M=

E∈ME.

The set of all regular expressions over the ranked alphabet Σ is denoted by EXP(Σ). Let|E|f denote the number of occurrences of the letterf ∈Σ inE. The alphabetic width|E|Σof E is deﬁned by|E|Σ= max

1,

f∈Σ|E|f ,i.e.,|E|Σis the number of occurrences of symbols from Σ in E but for technical reasons at least 1. Thus,|∅|Σ= 1. ThesizeofEis deﬁned inductively by: size(∅) = size(c) = 1 forc∈Σ₀, size(f(E₁, . . . , Em)) = 1 +m

i=1size(Ei), size

(E+F)

= size (E·c

F)

= 1+size(E)+size(F), and size (E^∗^c)

= 1+size(E). Every regular expression E can be understood as a tree over the ranked alphabet Σ∪ {+,·c,^∗^c,∅ |c∈Σ₀} where + and·c have rank 2,^∗^c has rank 1, and ∅ has rank 0. This tree is called thesyntax-tree tE ofE.

2. A direct construction

In this section, we will construct from a regular expressionE a tree automa- tonAE that acceptsE. The finiteness of this automaton will only be proved later. Our construction is based on partial derivates that we have to define and investigate first.

LetM be a set of regular expressions,F some regular expression, andc∈Σ₀. ThenM·cF denotes the set{(E·cF)|E∈M}. Similarly, we put for a setMof m-tuples of regular expressions

M ·cF =

(E₁·cF),(E₂·cF), . . . ,(Em·cF)

|(E₁, E₂, . . . , Em)∈ M . Definition 2.1. For g ∈Σm, m≥1, and a regular expressionE, we deﬁne the setsg⁻¹E ofm-tuples of regular expressions inductively:

• g⁻¹∅=∅,

2For constructing an equivalent regular expression for a given tree automaton, it is necessary to introduce additional symbols of rank zero,cf.[9], Proposition 9.2. But since we are interested only in the converse,i.e., the construction of automata from an expression, we do not have to diﬀerentiate between nullary symbols from the alphabet and additional ones.

(5)

• g⁻¹f(E₁, E₂, . . . , En) =

{(E1, E₂, . . . , En)} iff =g

∅ iff =g,

• g⁻¹(E+F) =g⁻¹E∪g⁻¹F,

• g⁻¹(E·cF) =

(g⁻¹E)·cF∪g⁻¹F ifc∈E (g⁻¹E)·cF otherwise,

• g⁻¹(E^∗^c) = (g⁻¹E)·c(E^∗^c).

Let Σ_≥1 =

m≥1Σm = Σ\Σ₀ denote the set of non-nullary symbols from the ranked alphabet Σ. Following Antimirov, we deﬁne further functions∂wfor ﬁnite wordsw∈Σ^∗_≥1 over the alphabet Σ_≥1. Byεwe denote the empty word.

Definition 2.2.LetEbe a regular expression. Then∂εE={E}and, forw∈Σ^∗_≥1 and g ∈ Σ≥1, the set ∂wgE consists of all regular expressions F that appear in some tuple fromg⁻¹E for some E ∈ ∂wE. For a set of words W ⊆Σ^∗_≥1 and a regular expressionE, we put∂WE=

w∈W∂wE.

The function∂wis called the partial derivative with respect tow.

Note that ∂wgE = ∂g∂wE =

E∈∂_wE∂gE for all w ∈ Σ^∗_≥1 and g ∈ Σ_≥1. Further note that we consider derivatives with respect to words over the non- nullary symbols from Σ and not with regard to trees.

Example 2.3. Let Σ₀={a, b}, Σ1={g, h}, and Σ2={f}. Consider the regular tree expression

E=

f g

h(a)

, g(b)_∗a

E₁

·b

h(a) +h(b)

E₂

.

Note thatb /∈E1. Hence∂gE=∂hE=∅and

∂fE= g

h(a)

·aE₁

·bE₂

,

g(b)·aE₁

·bE₂

. Furthermore, we compute the following partial derivatives:

∂fgE=

h(a)·aE₁

·bE₂

, b·aE₁

·bE₂

,

∂fghE= a·aE₁

·bE₂

, a, b

; where the last equality is due to∂h

(b·aE₁)·bE₂

=∂hE₂={a, b}. Notice that

∂fghfE=∂fE. Hence, together with∂εE={E}we obtain eight diﬀerent partial derivatives ofE.

A symbolf ∈Σ occursunguardedin E if no ancestor in the syntax treetE is labeled by an element of Σ. We will be interested in the numberEfof unguarded occurrences off inE that can be deﬁned inductively by:

• ∅f = 0,

(6)

• f(E₁, . . . , Em)f = 1 andg(E₁, . . . , En)f = 0 forg=f,

• (E1+E₂)f =(E1·cE₂)f =E1f+E2f, and

• (E^∗^c)f =Ef.

Proposition 2.4. Let E be a regular expression and g ∈Σ_≥1. Then |g⁻¹E| ≤ Eg. Especially, if|E|g= 0 theng⁻¹E=∂gE=∅.

Proof. The claim is shown by induction on the construction ofE: forE =∅, the claim is trivial. If we haveE=f(E₁, . . . , Em) andf =g, then|g⁻¹E|= 0, so the claim is also trivial. Iff =g, then|g⁻¹E|= 1 =Eg.

Next consider the case (E+F): then

|g⁻¹(E+F)| ≤ |g⁻¹E|+|g⁻¹F| ≤ Eg+Fg =(E+F)g. For the product, we have

|g⁻¹(E·cF)| ≤ |(g⁻¹E)·cF|+|g⁻¹F| ≤ |g⁻¹E|+|g⁻¹F|

≤ Eg+Fg=(E·cF)g. Similarly, we obtain for the iteration

|g⁻¹(E^∗^c)|=|(g⁻¹E)·c(E^∗^c)|=|g⁻¹E| ≤ Eg=(E^∗^c)g. Next, we express the semantics of a regular expressionEin terms of the semantics of the tuples fromg⁻¹E.

Proposition 2.5. For any regular expressionE, we have

E=

{g(G1, . . . ,Gm)|g∈Σ≥1,(G₁, . . . , Gm)∈g⁻¹E}

∪ {c∈Σ₀|c∈E}. (1)

Proof. LetE0=E∩Σ₀. The proof proceeds by induction on the construction of the regular expression E. For E = ∅, equation (1) holds. Next let E = f(E₁, . . . , Em) where f ∈ Σm and E₁, . . . , Em are regular expressions. We put

−

→G = (G₁, . . . Gm). Then E=f(E1, . . . ,Em)

=

{f(G1, . . . ,Gm)|−→

G∈f⁻¹E} (sincef⁻¹E={(E1, . . . , Em)})

=

{g(G1, . . . ,Gm)|g∈Σ,−→

G∈g⁻¹E} (sinceg⁻¹E=∅forf =g).

The proof in caseE = (E₁+E₂) is immediate and therefore omitted. Next let E= (E₁·cE₂). Then we have

E=E1·cE2

=

(E₁\ {c})·cE₂

∪

(E₁∩ {c})·cE₂ .

(7)

By the induction hypothesis, the ﬁrst of these two sets equals f(G1, . . . ,Gm)|f ∈Σ,−→

G∈f⁻¹E₁ ·cE2∪(E10\ {c})

=

f((G₁·cE₂), . . . ,(Gm·cE₂))|f ∈Σ,−→

G∈f⁻¹E₁ ∪(E₁0\ {c})

=

f(H1, . . . ,Hm)|f ∈Σ,−→

H ∈(f⁻¹E₁)·cE₂ ∪(E10\ {c}).

Ifc∈E1, then the second of the two sets above equals

E2=

f(H1, . . . ,Hm)|f ∈Σ,−→

H ∈f⁻¹E₂ ∪E20, otherwise it is empty. Hence equation (1) holds forE= (E₁·cE₂).

Finally, consider the regular expression (E^∗^c). If f(t₁, . . . , tm) ∈ (E^∗^c) = E^∗^c, then there existsn≥0 withf(t₁, . . . , tm)∈E^n+1,c\E^n,c. Hence there exists s ∈ E with f(t₁, . . . , tm) ∈ {s} ·cE^n,c. Since f(t₁, . . . , tm) ∈/ E^n,c, the treesis of the form s=f(s₁, . . . , sm). By the induction hypothesis, we ﬁnd (G₁, . . . , Gm)∈f⁻¹E withsi∈Gi. Hence we obtain

f(t₁, . . . , tm)∈ {s} ·c(E^∗^c)

⊆f(G1, . . . ,Gm)·c(E^∗^c)

=f((G₁·c(E^∗^c)), . . . ,(Gm·c(E^∗^c))).

Since the tuple

(Gi·c(E^∗^c))

1≤i≤m = (Gi)_1≤i≤m·c(E^∗^c) belongs to f⁻¹(E^∗^c), we showed the containment “⊆” of equation (1).

Conversely letf ∈Σ and −→

H ∈f⁻¹(E^∗^c) = (f⁻¹E)·c(E^∗^c). Then there exists a tuple of regular expressions−→

G∈f⁻¹E with−→ H =−→

G·c(E^∗^c). Hence we get f(H₁, . . . ,Hm) =f((G₁·c(E^∗^c)), . . . ,(Gm·c(E^∗^c)))

=f(G1, . . . ,Gm)·c(E^∗^c).

By the induction hypothesis,f(G1, . . . ,Gm)⊆E, so we can continue

⊆E·c(E^∗^c)⊆(E^∗^c).

Let E be a regular expression and let QE = ∂_Σ^∗

≥1E. Then we deﬁne a set of transitions ΔE as

F, f, G₁, G₂, . . . , Gm

|F ∈QE, f∈Σm, m≥1,(G₁, . . . , Gm)∈f⁻¹F

∪

(F, c)|F ∈QE, c∈Σ₀, c∈F .

Furthermore, letAE = (QE,Σ,{E},ΔE) denote the tree automaton whose only initial state is the regular expressionE.

(8)

Theorem 2.6. Let E be a regular expression over the ranked alphabet Σ. Then AE is a tree automaton that accepts E.

Proof. We show by induction on the structure of trees that for any treet ∈ T_Σ and any regular expressionF, the tree automatonAF acceptstiﬀt∈F.

First let t = c ∈ Σ₀. Now c is accepted by AF iff there is a transition (F, c)∈ΔF. But this is the case iff c∈F. Now let t=f(s₁, . . . , sm) for some m >0. Thentis accepted byAF iff there is a transition (F, f,(G₁, . . . , Gm))∈ΔF

such thatsiis accepted by the tree automaton (QF,Σ,{Gi},ΔF) for all 1≤i≤m.

Note that the reachable part of the automaton (QF,Σ,{Gi},ΔF) is the set of states QG_i. Hence, si is accepted by this automaton iﬀ it is accepted by AG_i. By the induction hypothesis, this is equivalent to saying si ∈ Gi. Since this holds for all 1 ≤ i ≤ m, we have that t is accepted by AF iﬀ there exists (G₁, . . . , Gm)∈f⁻¹(F) withsi∈Giwhich is, by Proposition2.5, equivalent to

sayingt∈F.

Example 2.7. Consider the regular expression E=

f

g h(a)

, g(b)_∗a

E₁

·b

h(a) +h(b)

E₂

from Example2.3. There we computed the partial derivatives ofE. ThusQE = {qi|i= 0, . . . ,7}where

q₀=E, q₁=

g h(a)

·aE₁

·bE₂

, q₂=

g(b)·aE₁

·bE₂ , q₃=

h(a)·aE₁

·bE₂

, q₄= b·aE₁

·bE₂

, q₅=

a·aE₁

·bE₂ ,

q₆=a, q₇=b.

The set of transitions ΔE comprises

q₀−→^f (q₁, q₂), q₁−→^g q₃, q₂−→^g q₄, q₃−→^h q₅, q₄−→^h q₆, q₄−→^h q₇, q₅−→^f (q₁, q₂),

q₀−→ ⊥,â q₅−→ ⊥,â q₆−→ ⊥,â q₇−→ ⊥.^b

Here,q₀−→^f (q₁, q₂) andq₀−→ ⊥^a mean that (q₀, f, q₁, q₂),(q₀, a)∈ΔE.

In the last example, the tree automaton resulting from our construction is ﬁnite. But so far, we did not prove in general that the tree automaton AE has only ﬁnitely many states, i.e., that E is recognizable. Theorem3.16will show that the number of states is linear and that the number of transitions is quadratic in the size ofE. This will only be achieved after going through the following two constructions.

(9)

3. An indirect construction via linearizations

The idea of the indirect construction is as follows: in a regular expressionE, uniquely mark the occurrences of letters from Σ_≥1. Then apply our direct construction to the resulting regular expressionE. The projection of this automaton acceptsE. As it turns out, a quotient of the automaton one obtains this way is isomorphic to the result of the direct construction.

3.1. Linear regular expressions

A regular expressionE is linear if every letter f ∈ Σ_≥1 occurs at most once in E. Note that c∈Σ₀ may occur more than once. The following proposition is a consequence of Proposition2.4.

Proposition 3.1. Let E be a linear regular expression and g ∈Σm for m ≥1.

Then |g⁻¹E| ≤1 and therefore|∂gE| ≤m.

For M ⊆EXP(Σ) and g ∈ Σ_≥1, we put g⁻¹M =

g⁻¹E | E ∈ M . Now we consider partial derivatives with respect to non-empty words for linear regular expressions.

Proposition 3.2. LetE, F be linear regular expressions over the alphabetΣsuch that also (E+F)and (E·cF) are linear. Let w∈Σ^∗_≥1 and g ∈Σ_≥1. Then the following hold true:

• g⁻¹∂w(E+F) =

g⁻¹∂wE if|E|g>0, g⁻¹∂wF otherwise.

• g⁻¹∂w(E·cF) =

⎧⎪

⎨

⎪⎩

(g⁻¹∂wE)·cF if|E|g>0,

{g⁻¹∂vF | ∃u∈Σ^∗_≥1:w=uv and c∈∂uE}

otherwise.

• There are suﬃxesv₁, . . . , vk ofw such that

g⁻¹∂w(E^∗^c) =

1≤i≤k

(g⁻¹∂v_iE)·c(E^∗^c).

Proof. If |E|g = 0, then |∂wE|g = 0 implying g⁻¹∂wE =∅ by Proposition 2.4.

Now g⁻¹∂w(E+F) = g⁻¹∂wE∪g⁻¹∂wF. If |E|g > 0, then g⁻¹∂wF = ∅ and g⁻¹∂w(E+F) = g⁻¹∂wE. Otherwise we get g⁻¹∂wE =∅ andg⁻¹∂w(E+F) = g⁻¹∂wF.

For the remaining claims we proceed by induction on |w|. The claims are obvious for|w| = 0. From now on let w= wf for some w∈ Σ^∗_≥1 and f ∈Σ≥1. First, we considerg⁻¹∂w(E·cF). By the induction hypothesis, we obtain

f⁻¹∂w(E·cF) =

(f⁻¹∂wE)·cF if|E|f >0, {f⁻¹∂vF | ∃u∈Σ^∗_≥1:w=uv&c∈∂uE} otherwise.

(10)

First, consider the case|E|f >0 and|E|g>0. Then|F|f =|F|g= 0 and therefore g⁻¹F =∅. Hence we haveg⁻¹∂w(E·cF) = (g⁻¹∂wE)·cF.

Next, let|E|f >|E|g = 0. Then (i) g⁻¹∂wE=∅and (ii) ∂vF =∅ for all non- empty suﬃxesv ofw(since f occurs inv but not inF). Since, by the induction hypothesis,f⁻¹∂w(E·cF) = (f⁻¹∂wE)·cF, we get∂w(E·cF) =∂f∂w(E·cF) = (∂f∂wE)·cF = (∂wE)·cF and therefore

g⁻¹∂w(E·cF) =g⁻¹((∂wE)·cF)

=

(g⁻¹∂wE)·cF ifc /∈∂wE (g⁻¹∂wE)·cF∪g⁻¹F otherwise

(i)

=

∅ ifc /∈∂wE g⁻¹F otherwise

(ii)=

{g⁻¹∂vF | ∃u∈Σ^∗_≥1:w=uv&c∈∂uE}

as required.

Now assume|E|f = 0. Then the induction hypothesis impliesf⁻¹∂w(E·cF) = {f⁻¹∂vF| ∃u∈Σ^∗_≥1:w=uv, c∈∂uE}. Hence we obtain

∂w(E·cF) =∂f∂w(E·cF)

=

{∂f∂vF | ∃u∈Σ^∗_≥1:w=uv, c∈∂uE}

=

{∂vF | ∃u∈Σ^∗_≥1:w=uv, v=ε, c∈∂uE}

=

{∂vF | ∃u∈Σ^∗_≥1:w=uv, c∈∂uE}

where the last equality holds since∂wE=∅. Applyingg⁻¹to this equation yields g⁻¹∂w(E·cF) =

{g⁻¹∂vF | ∃u:w=uv, c∈∂uE}.

If |E|g = 0, this is precisely what we wanted to show. Otherwise, we obtain

|F|g = 0 and thereforeg⁻¹∂vF =∅ for all v ∈Σ^∗_≥1. Hence, in this case the last expression equals ∅. Since alsog⁻¹∂wE =∅ (due to the non-occurrence of f in E), this equals (g⁻¹∂wE)·cF as required. This shows the claim for (E·cF).

Now consider the regular expression (E^∗^c). By the induction hypothesis, there are suﬃxesv₁, . . . , vk ofwsuch that

f⁻¹∂w(E^∗^c) =

1≤i≤k

(f⁻¹∂v_iE)·c(E^∗^c) and ∂w(E^∗^c) =

1≤i≤k

(∂v_ifE)·c(E^∗^c).

(11)

Hence, forvi =vif (1≤i≤k) we have

g⁻¹∂w(E^∗^c) = g⁻¹

⎛

⎝

1≤i≤k

(∂v_iE)·c(E^∗^c)

⎞

⎠

=

1≤i≤k(g⁻¹∂v_iE)·c(E^∗^c) ifc /∈∂v_iEfor all 1≤i≤k,

1≤i≤k(g⁻¹∂v_iE)·c(E^∗^c)∪g⁻¹(E^∗^c) otherwise.

Sinceg⁻¹(E^∗^c) = (g⁻¹E)·c(E^∗^c) = (g⁻¹∂εE)·c(E^∗^c), the set of tuples of regular expressionsg⁻¹∂w(E^∗^c) is of the required form.

Proposition 3.3. Let E be a linear regular expression,u, w∈Σ^∗_≥1, andg∈Σ_≥1. Then we have:

(1) |g⁻¹∂uE| ≤1,

(2) ifg⁻¹∂uE=∅ andg⁻¹∂wE=∅, theng⁻¹∂uE=g⁻¹∂wE.

Proof. The proof is by induction on the structure ofE.

ForE=∅ the claim is immediate. Now consider the caseE =f(E₁, . . . , En).

Since E is linear, there is at most one i with |Ei|g > 0, if no such i exists, set i= 1. Then we have

g⁻¹∂uE=

⎧⎪

⎨

⎪⎩

g⁻¹∂u{E1, . . . , En}=g⁻¹∂uEi ifu=f u, {(E₁, . . . , En)} ifu=ε&f =g,

∅ otherwise,

where the ﬁrst case is due to|Ej|g = 0 for j =i, and, similarly for g⁻¹∂wE. By induction hypothesis, we get immediately|g⁻¹∂uE| ≤1.

Assume g⁻¹∂uE =∅ andg⁻¹∂wE =∅. If f =g is the ﬁrst letter ofu=f u, then∅ =g⁻¹∂uE=g⁻¹∂uEi=f⁻¹∂uEi =∅ sinceE is linear, a contradiction.

Hence, either f = g for the ﬁrst letter f of u or f = g and uis empty. Since the analogous holds for w, we obtain u = ε iﬀ w = ε. Now the claim follows immediately from the induction hypothesis.

ForE= (E₁+E₂) the claims are immediate by Proposition3.2and the induction hypothesis.

LetE= (E₁·cE₂). If|E1|g>0, then by Proposition3.2g⁻¹∂uE= (g⁻¹∂uE₁)·c

E₂ as well as g⁻¹∂wE = (g⁻¹∂wE₁)·cE₂. Hence, |g⁻¹∂uE| ≤ 1 by induction hypothesis. Ifg⁻¹∂uE andg⁻¹∂wE are non-empty, so are the setsg⁻¹∂uE₁ and g⁻¹∂wE₁. Hence, by the induction hypothesis, the claim follows. Suppose now

|E1|g = 0. Then g⁻¹∂uE is a ﬁnite union of sets of the form g⁻¹∂uE₂ where everyu is a suﬃx ofu. The induction hypothesis implies that any two non-empty of them are equal, i.e., g⁻¹∂uE = g⁻¹∂uE₂ for some u. Similarly, g⁻¹∂wE = g⁻¹∂wE₂for some wordw. Now both claims follow from the induction hypothesis.

A similar argument can be applied in caseE= (F^∗^c) with (g⁻¹∂uF)·c(F^∗^c)

in place ofg⁻¹∂uE₁.

(12)

An immediate consequence of the last proposition is

Corollary 3.4. LetEbe a linear regular expression, u, w∈Σ^∗_≥1 andg∈Σmwith m≥1. Then

(1) |∂ugE| ≤m,

(2) if∂ugE=∅ and∂wgE=∅, then∂ugE=∂wgE.

By Proposition3.2and Corollary3.4we conclude

Corollary 3.5. For a linear regular expressionEandw∈Σ⁺_≥1we have∂w(E^∗^c) = (∂uE)·c(E^∗^c)for some non-empty suﬃxuofw.

Next, we bound the number of partial derivatives of a linear regular expression.

Proposition 3.6. Let E be a linear regular expression. Then we have|∂_Σ⁺

≥1E| ≤

|E|Σ−1 and|∂Σ^∗_≥1E| ≤ |E|Σ. Proof. Note that∂_Σ^∗_≥1E =∂_Σ+

≥1E∪ {E}. We apply induction on E. Recall that

|∅|Σ= 1. Then we have|∂_Σ⁺

≥1∅|= 0 =|∅|Σ−1. ForE=f(E₁, . . . , En), g∈Σ≥1

andu∈Σ^∗_≥1, we get

∂guE=

∂u{E₁, . . . , En} ifg=f,

∅ ifg=f.

Hence,|∂_Σ⁺

≥1E| ≤n

i=1|∂Σ^∗_≥1Ei| ≤n

i=1|Ei|Σ=|E|Σ−1. ForE= (E₁+E₂) we use Proposition 3.2and the induction hypothesis and obtain the assumption. If E= (E₁·cE₂), then again by Proposition3.2: |∂_Σ⁺

≥1E| ≤ |∂_Σ⁺

≥1E₁|+|∂_Σ⁺

≥1E₂| ≤

|E₁|Σ−1+|E₂|Σ−1≤ |E|Σ−1. Finally, we conclude by Corollary3.5|∂_Σ+

≥1(E^∗^c)| ≤

|∂_Σ⁺

≥1E| ≤ |E|Σ−1 =|(E^∗^c)|Σ−1.

3.2. The projection construction

Recall that Theorem 2.6 provides a possibly inﬁnite tree automatonAE that acceptsE. AssumingE to be linear, we are now in the position to improve this result:

Corollary 3.7. Let E be a linear regular expression over the ranked alphabet Σ.

ThenAE is a ﬁnite tree automaton with at most|E|Σstates and at most|E|Σ· |Σ| transitions that acceptsE.

Proof. The equality L(AE) = E was shown in Theorem 2.6. Since the set of states of AE equals ∂_Σ^∗_≥1E, the ﬁnite tree automaton has at most |E|Σ states by Proposition 3.6. For f ∈ Σ_≥1 and D ∈ QE, there is at most one transition of the form

D, f,(G₁, . . . , Gm)

by Proposition 3.3(1), i.e., there are at most

|E|Σ· |Σ≥1| transitions whose label belongs to Σ≥1. In addition, there can be

|QE×Σ₀| ≤ |E|Σ· |Σ₀| transitions of the form (D, c) withc∈Σ₀.

(13)

Remark 3.8. A top-down FTAA= (Q,Σ, I,Δ) isdeterministicifIis a singleton and (q, f, q₁, . . . , qm),(q, f, p₁, . . . , pm) ∈ Δ imply qi = pi for all i ∈ {1, . . . , m}.

Due to Proposition3.3(1), we have even proved that for alinear regular expression E the FTA AE is a deterministic top-down automaton which implies the number of transitions given in the corollary. For arbitrary regular expressionsE, the FTAAE is in general not deterministic.

Let Γ and Σ be two alphabets with Γ₀ ⊆ Σ₀. A mapping η : Γ → Σ with η(Γm)⊆ Σm for every m ∈N and η(c) =c for all c ∈ Σ₀ is called a projection.

We can extendη naturally toη: EXP(Γ)→EXP(Σ) by:

• η(∅) =∅,η(f(E₁, . . . , Em)) =η(f)(η(E₁), . . . , η(Em)),

• η

(E+F)

= (η(E) +η(F)),η

(E·cF)

= (η(E)·cη(F)), andη (E^∗^c)

= (η(E)^∗^c).

Definition 3.9. Let E and E be regular expressions over the ranked alphabets Σ and Γ, respectively, and let η : Γ → Σ be a projection. We say that E is a reﬁnement ofE with respect to the projectionη ifη(E) =E.

Eis called alinearization ofEwith respect toη ifEis linear and a reﬁnement ofE with respect toη.

Example 3.10. A linearization of the regular expressionE from Example2.3is E=

f₁

g₂ h₃(a)

, g₄(b)_∗a

·b

h₅(a) +h₆(b)

where Γ ={f1, g₂, g₄, h₃, h₅, h₆, a, b}andη: Γ→Σ is given byη:f₁→f, g₂, g₄→ g, h₃, h₅, h₆→h, a→a, b→b.

Note that due to η(c) = c for c ∈ Σ₀ both the constants from Σ₀ and the operations·c and^∗^c remain unchanged. By abuse of notation, we denote also the two natural continuations ofηto Γ^∗and toT_Γbyη. The following lemma is easily shown:

Lemma 3.11. LetEbe a regular expression andE a reﬁnement ofEwith respect toη: Γ→Σ. Then η(E) =E.

LetEbe an arbitrary regular expression. Then one can construct a small finite tree automatonAE acceptingEas follows: firstly, construct some linearization EofEwith respect toη: Γ→Σ (we can assume that every symbol from Γ appears inE and therefore |Γ| ≤ |E|Σ=|E|Σ). Secondly, build the finite tree automaton AE which then has at most |E|Σ =|E|Σ states and at most |E|Σ· |Γ| ≤ |E|Σ2

transitions. Thirdly, replace the transitions (F , f ,(G₁, . . . , Gm)) of this automaton by (F , η(f),(G₁, . . . , Gm)). Then, by Lemma3.11, the following is immmediate:

Corollary 3.12. Let Ebe a regular expression. ThenAE is a ﬁnite tree automa- ton with at most|E|Σ states and at most|E|Σ2 transitions that accepts E.