DOI:10.1051/ita/2011107 www.rairo-ita.org
CONSTRUCTION OF TREE AUTOMATA FROM REGULAR EXPRESSIONS
∗Dietrich Kuske
1and Ingmar Meinecke
2Abstract. Since recognizable tree languages are closed under the ra- tional operations, every regular tree expression denotes a recognizable tree language. We provide an alternative proof to this fact that results in smaller tree automata. To this aim, we transfer Antimirov’s partial derivatives from regular word expressions to regular tree expressions.
For an analysis of the size of the resulting automaton as well as for algo- rithmic improvements, we also transfer the methods of Champarnaud and Ziadi from words to trees.
Mathematics Subject Classification.68Q45.
Introduction
One of the most prominent topics in formal language theory is the comparison of different finite descriptions for potentially infinite objects – the languages. The result of Kleene [14] states the equivalence between finite automata and regular expressions for languages of finite words. The transformation of a finite automa- ton into an equivalent regular expression is a prototypical example of dynamic programming. The converse transformation is of direct practical consequencee.g.
in text processing. For this reason, several methods were proposed within the last decades to find more efficient algorithms, see [19,20] for surveys. For teaching purposes, one often uses an inductive construction. The most common construc- tion is the standard or position automaton (Glushkov [10] and McNaughton and Yamada [17]). Brzozowski’s construction [4] of a deterministic finite automaton
Keywords and phrases.Trees, automata, regular expressions, partial derivatives.
∗I. Meinecke was supported by the German Research Foundation (DFG) within the project DR 202/10-1.
1Fachgebiet Theoretische Informatik, Technische Universit¨at Ilmenau, Postfach 100565, 98684 Ilmenau, Germany.dietrich.kuske@tu-ilmenau.de
2Institut f¨ur Informatik, Universit¨at Leipzig, PF 100920, 04009 Leipzig, Germany.
meinecke@informatik.uni-leipzig.de
Article published by EDP Sciences c EDP Sciences 2011
uses derivates of regular expressions. This approach was modified by Antimirov [2]
who definedpartial derivatives to construct a non-deterministic automaton from a regular expression.
Kleene’s theorem was lifted to the setting of trees [22], also cf. [8,9], which are one of the most fundamental concepts in computer science. A regular tree expression defines a language of ordered trees. An inductive construction even produces a tree automaton accepting this language. The number of states of this automaton is exactly the number of iterations in the expressionE plus|E|Σ
where |E|Σ is the number of occurrences of symbols from the ranked alphabet in E. In this paper, we define partial derivatives for regular tree expressions and build by their help a non-deterministic finite tree automaton recognizing the language denoted by the regular expression. The concept of partial derivatives will yield a tree automaton with at most |E|Σ states and|E|2Σ transitions. The construction of this tree automaton and the correctness proof is combined with algorithmic considerations to build this automaton. We adapt and modify the approach by Champarnaud and Ziadi [5,6] in the word case who extended work of Berry and Sethi [3]. Here, we use linearizations of regular tree expressions. The main idea is to distinguish occurrences of the same symbol at different positions in the regular expression. By doing so, we can ensure a certain uniqueness of the partial derivatives. As it turns out, the partial derivatives of the original regular expression are just projections of the partial derivatives of the linearized regular expression. This approach results in two main advantages: firstly, the desired automaton is in fact a quotient of an automaton that stems from the linearized regular expression. This way we also get the upper bound on the number of transitions mentioned above. Secondly, the theoretical results allow for an efficient algorithm working in the syntax-tree ofE. We obtain an algorithm with O(R·size(E)2) space and time complexity where R is the maximal rank of a symbol occuring in the finite ranked alphabet Σ and size(E) is the size of the regular expression.
Beside the standard and the partial derivative construction there are other pro- posals in the literature how to obtain an automaton from a regular expression.
Especially, it would be interesting whether the construction of the follow automa- ton [7,12,13] carries over to the setting of trees. In this paper we consider ranked trees. However, regular expressions were explored for unranked trees in connec- tion with XML. They are used in pattern matching, seee.g.[11]. In an extended abstract [15] of this paper, we wondered whether the concept of partial derivatives can lead to fruitful results and algorithms in this area. A first answer was given by Suzuki and Okui [21] who applied successfully the concept of partial deriva- tives to regular hedge expression patterns. Last but not least, Lombardy and Sakarovitch [16] applied the method of partial derivatives to a weighted setting for words. We are confident that this should be also possible for trees.
1. Trees, automata, and regular expressions
Let N be the set of non-negative integers. Throughout this paper, we fix a finite ranked alphabet Σ = (Σm)m∈N. The setTΣof trees over Σ is defined by the Backus-Naur form (BNF)
t::=f(t, . . . , t mtimes
)
wheref ∈Σm. For the base casec∈Σ0, we will writec instead ofc(). A subset L⊆TΣis called atree language.
A (top-down) tree automaton over Σ is a tuple A = (Q,Σ, I,Δ) where Q is a set of states, I ⊆ Q is the set of initial states, and Δ = (Δm)m∈N is the set of transitions1 such that Δm ⊆ Q×Σm×Qm for every m ∈ N. Especially, Δ0⊆Q×Σ0. A finite tree-automaton (or FTA) is a tree automatonAwith only finitely many states and, thus, only finitely many transitions (note that there are only finitely manym with Σm=∅).
As to whether a tree t is accepted by a tree automaton A = (Q,Σ, I,Δ) is defined inductively along the construction of the tree t: if t = c ∈Σ0, then t is accepted by A iff there exists a state q ∈I with (q, c) ∈Δ0. For f ∈ Σm with m > 0, the treef(t1, . . . , tm) is accepted by A iff there exist states q ∈ I and q1, . . . , qm∈Qsuch that (q, f, q1, . . . , qm)∈Δmand, for 1≤i≤m, the treeti is accepted by the tree automaton (Q,Σ,{qi},Δ). The languageL(A) recognized by Ais the set of all treestthat are accepted byA. A tree languageLisrecognizable if there is a FTAAwith L(A) =L.
We next introduce some constructions of tree languages that extend the rational operations on word languages. Letf ∈ΣmandL1, . . . , Lm⊆TΣ. Then we put
f(L1, . . . , Lm) ={f(t1, . . . , tm)|ti∈Lifori= 1, . . . , m}.
ForL ⊆TΣ and c ∈ Σ0 we define for every t ∈ TΣ inductively the non-uniform substitution t[c←L]:
• c[c←L] =Land d[c←L] ={d}for everyd∈Σ0 withd=c;
• f(t1, . . . , tm)[c←L] =f(t1[c←L], . . . , tm[c←L]).
Then the c-product of L1, L2 ⊆TΣ is the language L1·cL2 =
t∈L1t[c ← L2].
Now the iteratedc-products are defined forL⊆TΣby L0,c={c} andLn+1,c=Ln,c∪L·cLn,c. Thec-iteration ofLis defined asL∗c=
n≥0Ln,c.
It is well-known that a tree language L is recognizable if and only if it can be denoted by a regular expression. Theseregular expressions are defined by the
1In the term-rewriting terminology employed by [8], the transition (q, f, q1, . . . , qm) is denoted by the ruleq(f(x1, . . . , xm))→f(q1(x1), . . . , qm(xm)).
following grammar in BNF2:
E::=∅ |f(E, . . . , E
mtimes
)|(E+E)|(E·cE)|(E∗c)
wheref ∈Σmandc∈Σ0. Again, we writec instead ofc() wheneverc∈Σ0. The semanticsEof a regular expressionE is defined inductively by
∅=∅, f(E1, . . . , Em)=f(E1, . . . ,Em),
(E+F)=E∪F, (E·cF)=E·cF, and
(E∗c)=E∗c.
For a setM of regular expressions, we put M=
E∈ME.
The set of all regular expressions over the ranked alphabet Σ is denoted by EXP(Σ). Let|E|f denote the number of occurrences of the letterf ∈Σ inE. The alphabetic width|E|Σof E is defined by|E|Σ= max
1,
f∈Σ|E|f ,i.e.,|E|Σis the number of occurrences of symbols from Σ in E but for technical reasons at least 1. Thus,|∅|Σ= 1. ThesizeofEis defined inductively by: size(∅) = size(c) = 1 forc∈Σ0, size(f(E1, . . . , Em)) = 1 +m
i=1size(Ei), size
(E+F)
= size (E·c
F)
= 1+size(E)+size(F), and size (E∗c)
= 1+size(E). Every regular expression E can be understood as a tree over the ranked alphabet Σ∪ {+,·c,∗c,∅ |c∈Σ0} where + and·c have rank 2,∗c has rank 1, and ∅ has rank 0. This tree is called thesyntax-tree tE ofE.
2. A direct construction
In this section, we will construct from a regular expressionE a tree automa- tonAE that acceptsE. The finiteness of this automaton will only be proved later. Our construction is based on partial derivates that we have to define and investigate first.
LetM be a set of regular expressions,F some regular expression, andc∈Σ0. ThenM·cF denotes the set{(E·cF)|E∈M}. Similarly, we put for a setMof m-tuples of regular expressions
M ·cF =
(E1·cF),(E2·cF), . . . ,(Em·cF)
|(E1, E2, . . . , Em)∈ M . Definition 2.1. For g ∈Σm, m≥1, and a regular expressionE, we define the setsg−1E ofm-tuples of regular expressions inductively:
• g−1∅=∅,
2For constructing an equivalent regular expression for a given tree automaton, it is necessary to introduce additional symbols of rank zero,cf.[9], Proposition 9.2. But since we are interested only in the converse,i.e., the construction of automata from an expression, we do not have to differentiate between nullary symbols from the alphabet and additional ones.
• g−1f(E1, E2, . . . , En) =
{(E1, E2, . . . , En)} iff =g
∅ iff =g,
• g−1(E+F) =g−1E∪g−1F,
• g−1(E·cF) =
(g−1E)·cF∪g−1F ifc∈E (g−1E)·cF otherwise,
• g−1(E∗c) = (g−1E)·c(E∗c).
Let Σ≥1 =
m≥1Σm = Σ\Σ0 denote the set of non-nullary symbols from the ranked alphabet Σ. Following Antimirov, we define further functions∂wfor finite wordsw∈Σ∗≥1 over the alphabet Σ≥1. Byεwe denote the empty word.
Definition 2.2.LetEbe a regular expression. Then∂εE={E}and, forw∈Σ∗≥1 and g ∈ Σ≥1, the set ∂wgE consists of all regular expressions F that appear in some tuple fromg−1E for some E ∈ ∂wE. For a set of words W ⊆Σ∗≥1 and a regular expressionE, we put∂WE=
w∈W∂wE.
The function∂wis called the partial derivative with respect tow.
Note that ∂wgE = ∂g∂wE =
E∈∂wE∂gE for all w ∈ Σ∗≥1 and g ∈ Σ≥1. Further note that we consider derivatives with respect to words over the non- nullary symbols from Σ and not with regard to trees.
Example 2.3. Let Σ0={a, b}, Σ1={g, h}, and Σ2={f}. Consider the regular tree expression
E=
f g
h(a)
, g(b)∗a
E1
·b
h(a) +h(b)
E2
.
Note thatb /∈E1. Hence∂gE=∂hE=∅and
∂fE= g
h(a)
·aE1
·bE2
,
g(b)·aE1
·bE2
. Furthermore, we compute the following partial derivatives:
∂fgE=
h(a)·aE1
·bE2
, b·aE1
·bE2
,
∂fghE= a·aE1
·bE2
, a, b
; where the last equality is due to∂h
(b·aE1)·bE2
=∂hE2={a, b}. Notice that
∂fghfE=∂fE. Hence, together with∂εE={E}we obtain eight different partial derivatives ofE.
A symbolf ∈Σ occursunguardedin E if no ancestor in the syntax treetE is labeled by an element of Σ. We will be interested in the numberEfof unguarded occurrences off inE that can be defined inductively by:
• ∅f = 0,
• f(E1, . . . , Em)f = 1 andg(E1, . . . , En)f = 0 forg=f,
• (E1+E2)f =(E1·cE2)f =E1f+E2f, and
• (E∗c)f =Ef.
Proposition 2.4. Let E be a regular expression and g ∈Σ≥1. Then |g−1E| ≤ Eg. Especially, if|E|g= 0 theng−1E=∂gE=∅.
Proof. The claim is shown by induction on the construction ofE: forE =∅, the claim is trivial. If we haveE=f(E1, . . . , Em) andf =g, then|g−1E|= 0, so the claim is also trivial. Iff =g, then|g−1E|= 1 =Eg.
Next consider the case (E+F): then
|g−1(E+F)| ≤ |g−1E|+|g−1F| ≤ Eg+Fg =(E+F)g. For the product, we have
|g−1(E·cF)| ≤ |(g−1E)·cF|+|g−1F| ≤ |g−1E|+|g−1F|
≤ Eg+Fg=(E·cF)g. Similarly, we obtain for the iteration
|g−1(E∗c)|=|(g−1E)·c(E∗c)|=|g−1E| ≤ Eg=(E∗c)g. Next, we express the semantics of a regular expressionEin terms of the semantics of the tuples fromg−1E.
Proposition 2.5. For any regular expressionE, we have
E=
{g(G1, . . . ,Gm)|g∈Σ≥1,(G1, . . . , Gm)∈g−1E}
∪ {c∈Σ0|c∈E}. (1)
Proof. LetE0=E∩Σ0. The proof proceeds by induction on the construction of the regular expression E. For E = ∅, equation (1) holds. Next let E = f(E1, . . . , Em) where f ∈ Σm and E1, . . . , Em are regular expressions. We put
−
→G = (G1, . . . Gm). Then E=f(E1, . . . ,Em)
=
{f(G1, . . . ,Gm)|−→
G∈f−1E} (sincef−1E={(E1, . . . , Em)})
=
{g(G1, . . . ,Gm)|g∈Σ,−→
G∈g−1E} (sinceg−1E=∅forf =g).
The proof in caseE = (E1+E2) is immediate and therefore omitted. Next let E= (E1·cE2). Then we have
E=E1·cE2
=
(E1\ {c})·cE2
∪
(E1∩ {c})·cE2 .
By the induction hypothesis, the first of these two sets equals f(G1, . . . ,Gm)|f ∈Σ,−→
G∈f−1E1 ·cE2∪(E10\ {c})
=
f((G1·cE2), . . . ,(Gm·cE2))|f ∈Σ,−→
G∈f−1E1 ∪(E10\ {c})
=
f(H1, . . . ,Hm)|f ∈Σ,−→
H ∈(f−1E1)·cE2 ∪(E10\ {c}).
Ifc∈E1, then the second of the two sets above equals
E2=
f(H1, . . . ,Hm)|f ∈Σ,−→
H ∈f−1E2 ∪E20, otherwise it is empty. Hence equation (1) holds forE= (E1·cE2).
Finally, consider the regular expression (E∗c). If f(t1, . . . , tm) ∈ (E∗c) = E∗c, then there existsn≥0 withf(t1, . . . , tm)∈En+1,c\En,c. Hence there exists s ∈ E with f(t1, . . . , tm) ∈ {s} ·cEn,c. Since f(t1, . . . , tm) ∈/ En,c, the treesis of the form s=f(s1, . . . , sm). By the induction hypothesis, we find (G1, . . . , Gm)∈f−1E withsi∈Gi. Hence we obtain
f(t1, . . . , tm)∈ {s} ·c(E∗c)
⊆f(G1, . . . ,Gm)·c(E∗c)
=f((G1·c(E∗c)), . . . ,(Gm·c(E∗c))).
Since the tuple
(Gi·c(E∗c))
1≤i≤m = (Gi)1≤i≤m·c(E∗c) belongs to f−1(E∗c), we showed the containment “⊆” of equation (1).
Conversely letf ∈Σ and −→
H ∈f−1(E∗c) = (f−1E)·c(E∗c). Then there exists a tuple of regular expressions−→
G∈f−1E with−→ H =−→
G·c(E∗c). Hence we get f(H1, . . . ,Hm) =f((G1·c(E∗c)), . . . ,(Gm·c(E∗c)))
=f(G1, . . . ,Gm)·c(E∗c).
By the induction hypothesis,f(G1, . . . ,Gm)⊆E, so we can continue
⊆E·c(E∗c)⊆(E∗c).
Let E be a regular expression and let QE = ∂Σ∗
≥1E. Then we define a set of transitions ΔE as
F, f, G1, G2, . . . , Gm
|F ∈QE, f∈Σm, m≥1,(G1, . . . , Gm)∈f−1F
∪
(F, c)|F ∈QE, c∈Σ0, c∈F .
Furthermore, letAE = (QE,Σ,{E},ΔE) denote the tree automaton whose only initial state is the regular expressionE.
Theorem 2.6. Let E be a regular expression over the ranked alphabet Σ. Then AE is a tree automaton that accepts E.
Proof. We show by induction on the structure of trees that for any treet ∈ TΣ and any regular expressionF, the tree automatonAF acceptstifft∈F.
First let t = c ∈ Σ0. Now c is accepted by AF iff there is a transition (F, c)∈ΔF. But this is the case iff c∈F. Now let t=f(s1, . . . , sm) for some m >0. Thentis accepted byAF iff there is a transition (F, f,(G1, . . . , Gm))∈ΔF
such thatsiis accepted by the tree automaton (QF,Σ,{Gi},ΔF) for all 1≤i≤m.
Note that the reachable part of the automaton (QF,Σ,{Gi},ΔF) is the set of states QGi. Hence, si is accepted by this automaton iff it is accepted by AGi. By the induction hypothesis, this is equivalent to saying si ∈ Gi. Since this holds for all 1 ≤ i ≤ m, we have that t is accepted by AF iff there exists (G1, . . . , Gm)∈f−1(F) withsi∈Giwhich is, by Proposition2.5, equivalent to
sayingt∈F.
Example 2.7. Consider the regular expression E=
f
g h(a)
, g(b)∗a
E1
·b
h(a) +h(b)
E2
from Example2.3. There we computed the partial derivatives ofE. ThusQE = {qi|i= 0, . . . ,7}where
q0=E, q1=
g h(a)
·aE1
·bE2
, q2=
g(b)·aE1
·bE2 , q3=
h(a)·aE1
·bE2
, q4= b·aE1
·bE2
, q5=
a·aE1
·bE2 ,
q6=a, q7=b.
The set of transitions ΔE comprises
q0−→f (q1, q2), q1−→g q3, q2−→g q4, q3−→h q5, q4−→h q6, q4−→h q7, q5−→f (q1, q2),
q0−→ ⊥,a q5−→ ⊥,a q6−→ ⊥,a q7−→ ⊥.b
Here,q0−→f (q1, q2) andq0−→ ⊥a mean that (q0, f, q1, q2),(q0, a)∈ΔE.
In the last example, the tree automaton resulting from our construction is finite. But so far, we did not prove in general that the tree automaton AE has only finitely many states, i.e., that E is recognizable. Theorem3.16will show that the number of states is linear and that the number of transitions is quadratic in the size ofE. This will only be achieved after going through the following two constructions.
3. An indirect construction via linearizations
The idea of the indirect construction is as follows: in a regular expressionE, uniquely mark the occurrences of letters from Σ≥1. Then apply our direct con- struction to the resulting regular expressionE. The projection of this automaton acceptsE. As it turns out, a quotient of the automaton one obtains this way is isomorphic to the result of the direct construction.
3.1. Linear regular expressions
A regular expressionE is linear if every letter f ∈ Σ≥1 occurs at most once in E. Note that c∈Σ0 may occur more than once. The following proposition is a consequence of Proposition2.4.
Proposition 3.1. Let E be a linear regular expression and g ∈Σm for m ≥1.
Then |g−1E| ≤1 and therefore|∂gE| ≤m.
For M ⊆EXP(Σ) and g ∈ Σ≥1, we put g−1M =
g−1E | E ∈ M . Now we consider partial derivatives with respect to non-empty words for linear regular expressions.
Proposition 3.2. LetE, F be linear regular expressions over the alphabetΣsuch that also (E+F)and (E·cF) are linear. Let w∈Σ∗≥1 and g ∈Σ≥1. Then the following hold true:
• g−1∂w(E+F) =
g−1∂wE if|E|g>0, g−1∂wF otherwise.
• g−1∂w(E·cF) =
⎧⎪
⎨
⎪⎩
(g−1∂wE)·cF if|E|g>0,
{g−1∂vF | ∃u∈Σ∗≥1:w=uv and c∈∂uE}
otherwise.
• There are suffixesv1, . . . , vk ofw such that
g−1∂w(E∗c) =
1≤i≤k
(g−1∂viE)·c(E∗c).
Proof. If |E|g = 0, then |∂wE|g = 0 implying g−1∂wE =∅ by Proposition 2.4.
Now g−1∂w(E+F) = g−1∂wE∪g−1∂wF. If |E|g > 0, then g−1∂wF = ∅ and g−1∂w(E+F) = g−1∂wE. Otherwise we get g−1∂wE =∅ andg−1∂w(E+F) = g−1∂wF.
For the remaining claims we proceed by induction on |w|. The claims are obvious for|w| = 0. From now on let w= wf for some w∈ Σ∗≥1 and f ∈Σ≥1. First, we considerg−1∂w(E·cF). By the induction hypothesis, we obtain
f−1∂w(E·cF) =
(f−1∂wE)·cF if|E|f >0, {f−1∂vF | ∃u∈Σ∗≥1:w=uv&c∈∂uE} otherwise.
First, consider the case|E|f >0 and|E|g>0. Then|F|f =|F|g= 0 and therefore g−1F =∅. Hence we haveg−1∂w(E·cF) = (g−1∂wE)·cF.
Next, let|E|f >|E|g = 0. Then (i) g−1∂wE=∅and (ii) ∂vF =∅ for all non- empty suffixesv ofw(since f occurs inv but not inF). Since, by the induction hypothesis,f−1∂w(E·cF) = (f−1∂wE)·cF, we get∂w(E·cF) =∂f∂w(E·cF) = (∂f∂wE)·cF = (∂wE)·cF and therefore
g−1∂w(E·cF) =g−1((∂wE)·cF)
=
(g−1∂wE)·cF ifc /∈∂wE (g−1∂wE)·cF∪g−1F otherwise
(i)
=
∅ ifc /∈∂wE g−1F otherwise
(ii)=
{g−1∂vF | ∃u∈Σ∗≥1:w=uv&c∈∂uE}
as required.
Now assume|E|f = 0. Then the induction hypothesis impliesf−1∂w(E·cF) = {f−1∂vF| ∃u∈Σ∗≥1:w=uv, c∈∂uE}. Hence we obtain
∂w(E·cF) =∂f∂w(E·cF)
=
{∂f∂vF | ∃u∈Σ∗≥1:w=uv, c∈∂uE}
=
{∂vF | ∃u∈Σ∗≥1:w=uv, v=ε, c∈∂uE}
=
{∂vF | ∃u∈Σ∗≥1:w=uv, c∈∂uE}
where the last equality holds since∂wE=∅. Applyingg−1to this equation yields g−1∂w(E·cF) =
{g−1∂vF | ∃u:w=uv, c∈∂uE}.
If |E|g = 0, this is precisely what we wanted to show. Otherwise, we obtain
|F|g = 0 and thereforeg−1∂vF =∅ for all v ∈Σ∗≥1. Hence, in this case the last expression equals ∅. Since alsog−1∂wE =∅ (due to the non-occurrence of f in E), this equals (g−1∂wE)·cF as required. This shows the claim for (E·cF).
Now consider the regular expression (E∗c). By the induction hypothesis, there are suffixesv1, . . . , vk ofwsuch that
f−1∂w(E∗c) =
1≤i≤k
(f−1∂viE)·c(E∗c) and ∂w(E∗c) =
1≤i≤k
(∂vifE)·c(E∗c).
Hence, forvi =vif (1≤i≤k) we have
g−1∂w(E∗c) = g−1
⎛
⎝
1≤i≤k
(∂viE)·c(E∗c)
⎞
⎠
=
1≤i≤k(g−1∂viE)·c(E∗c) ifc /∈∂viEfor all 1≤i≤k,
1≤i≤k(g−1∂viE)·c(E∗c)∪g−1(E∗c) otherwise.
Sinceg−1(E∗c) = (g−1E)·c(E∗c) = (g−1∂εE)·c(E∗c), the set of tuples of regular expressionsg−1∂w(E∗c) is of the required form.
Proposition 3.3. Let E be a linear regular expression,u, w∈Σ∗≥1, andg∈Σ≥1. Then we have:
(1) |g−1∂uE| ≤1,
(2) ifg−1∂uE=∅ andg−1∂wE=∅, theng−1∂uE=g−1∂wE.
Proof. The proof is by induction on the structure ofE.
ForE=∅ the claim is immediate. Now consider the caseE =f(E1, . . . , En).
Since E is linear, there is at most one i with |Ei|g > 0, if no such i exists, set i= 1. Then we have
g−1∂uE=
⎧⎪
⎨
⎪⎩
g−1∂u{E1, . . . , En}=g−1∂uEi ifu=f u, {(E1, . . . , En)} ifu=ε&f =g,
∅ otherwise,
where the first case is due to|Ej|g = 0 for j =i, and, similarly for g−1∂wE. By induction hypothesis, we get immediately|g−1∂uE| ≤1.
Assume g−1∂uE =∅ andg−1∂wE =∅. If f =g is the first letter ofu=f u, then∅ =g−1∂uE=g−1∂uEi=f−1∂uEi =∅ sinceE is linear, a contradiction.
Hence, either f = g for the first letter f of u or f = g and uis empty. Since the analogous holds for w, we obtain u = ε iff w = ε. Now the claim follows immediately from the induction hypothesis.
ForE= (E1+E2) the claims are immediate by Proposition3.2and the induc- tion hypothesis.
LetE= (E1·cE2). If|E1|g>0, then by Proposition3.2g−1∂uE= (g−1∂uE1)·c
E2 as well as g−1∂wE = (g−1∂wE1)·cE2. Hence, |g−1∂uE| ≤ 1 by induction hypothesis. Ifg−1∂uE andg−1∂wE are non-empty, so are the setsg−1∂uE1 and g−1∂wE1. Hence, by the induction hypothesis, the claim follows. Suppose now
|E1|g = 0. Then g−1∂uE is a finite union of sets of the form g−1∂uE2 where everyu is a suffix ofu. The induction hypothesis implies that any two non-empty of them are equal, i.e., g−1∂uE = g−1∂uE2 for some u. Similarly, g−1∂wE = g−1∂wE2for some wordw. Now both claims follow from the induction hypothesis.
A similar argument can be applied in caseE= (F∗c) with (g−1∂uF)·c(F∗c)
in place ofg−1∂uE1.
An immediate consequence of the last proposition is
Corollary 3.4. LetEbe a linear regular expression, u, w∈Σ∗≥1 andg∈Σmwith m≥1. Then
(1) |∂ugE| ≤m,
(2) if∂ugE=∅ and∂wgE=∅, then∂ugE=∂wgE.
By Proposition3.2and Corollary3.4we conclude
Corollary 3.5. For a linear regular expressionEandw∈Σ+≥1we have∂w(E∗c) = (∂uE)·c(E∗c)for some non-empty suffixuofw.
Next, we bound the number of partial derivatives of a linear regular expression.
Proposition 3.6. Let E be a linear regular expression. Then we have|∂Σ+
≥1E| ≤
|E|Σ−1 and|∂Σ∗≥1E| ≤ |E|Σ. Proof. Note that∂Σ∗≥1E =∂Σ+
≥1E∪ {E}. We apply induction on E. Recall that
|∅|Σ= 1. Then we have|∂Σ+
≥1∅|= 0 =|∅|Σ−1. ForE=f(E1, . . . , En), g∈Σ≥1
andu∈Σ∗≥1, we get
∂guE=
∂u{E1, . . . , En} ifg=f,
∅ ifg=f.
Hence,|∂Σ+
≥1E| ≤n
i=1|∂Σ∗≥1Ei| ≤n
i=1|Ei|Σ=|E|Σ−1. ForE= (E1+E2) we use Proposition 3.2and the induction hypothesis and obtain the assumption. If E= (E1·cE2), then again by Proposition3.2: |∂Σ+
≥1E| ≤ |∂Σ+
≥1E1|+|∂Σ+
≥1E2| ≤
|E1|Σ−1+|E2|Σ−1≤ |E|Σ−1. Finally, we conclude by Corollary3.5|∂Σ+
≥1(E∗c)| ≤
|∂Σ+
≥1E| ≤ |E|Σ−1 =|(E∗c)|Σ−1.
3.2. The projection construction
Recall that Theorem 2.6 provides a possibly infinite tree automatonAE that acceptsE. AssumingE to be linear, we are now in the position to improve this result:
Corollary 3.7. Let E be a linear regular expression over the ranked alphabet Σ.
ThenAE is a finite tree automaton with at most|E|Σstates and at most|E|Σ· |Σ| transitions that acceptsE.
Proof. The equality L(AE) = E was shown in Theorem 2.6. Since the set of states of AE equals ∂Σ∗≥1E, the finite tree automaton has at most |E|Σ states by Proposition 3.6. For f ∈ Σ≥1 and D ∈ QE, there is at most one transition of the form
D, f,(G1, . . . , Gm)
by Proposition 3.3(1), i.e., there are at most
|E|Σ· |Σ≥1| transitions whose label belongs to Σ≥1. In addition, there can be
|QE×Σ0| ≤ |E|Σ· |Σ0| transitions of the form (D, c) withc∈Σ0.
Remark 3.8. A top-down FTAA= (Q,Σ, I,Δ) isdeterministicifIis a singleton and (q, f, q1, . . . , qm),(q, f, p1, . . . , pm) ∈ Δ imply qi = pi for all i ∈ {1, . . . , m}.
Due to Proposition3.3(1), we have even proved that for alinear regular expres- sion E the FTA AE is a deterministic top-down automaton which implies the number of transitions given in the corollary. For arbitrary regular expressionsE, the FTAAE is in general not deterministic.
Let Γ and Σ be two alphabets with Γ0 ⊆ Σ0. A mapping η : Γ → Σ with η(Γm)⊆ Σm for every m ∈N and η(c) =c for all c ∈ Σ0 is called a projection.
We can extendη naturally toη: EXP(Γ)→EXP(Σ) by:
• η(∅) =∅,η(f(E1, . . . , Em)) =η(f)(η(E1), . . . , η(Em)),
• η
(E+F)
= (η(E) +η(F)),η
(E·cF)
= (η(E)·cη(F)), andη (E∗c)
= (η(E)∗c).
Definition 3.9. Let E and E be regular expressions over the ranked alphabets Σ and Γ, respectively, and let η : Γ → Σ be a projection. We say that E is a refinement ofE with respect to the projectionη ifη(E) =E.
Eis called alinearization ofEwith respect toη ifEis linear and a refinement ofE with respect toη.
Example 3.10. A linearization of the regular expressionE from Example2.3is E=
f1
g2 h3(a)
, g4(b)∗a
·b
h5(a) +h6(b)
where Γ ={f1, g2, g4, h3, h5, h6, a, b}andη: Γ→Σ is given byη:f1→f, g2, g4→ g, h3, h5, h6→h, a→a, b→b.
Note that due to η(c) = c for c ∈ Σ0 both the constants from Σ0 and the operations·c and∗c remain unchanged. By abuse of notation, we denote also the two natural continuations ofηto Γ∗and toTΓbyη. The following lemma is easily shown:
Lemma 3.11. LetEbe a regular expression andE a refinement ofEwith respect toη: Γ→Σ. Then η(E) =E.
LetEbe an arbitrary regular expression. Then one can construct a small finite tree automatonAE acceptingEas follows: firstly, construct some linearization EofEwith respect toη: Γ→Σ (we can assume that every symbol from Γ appears inE and therefore |Γ| ≤ |E|Σ=|E|Σ). Secondly, build the finite tree automaton AE which then has at most |E|Σ =|E|Σ states and at most |E|Σ· |Γ| ≤ |E|Σ2
transitions. Thirdly, replace the transitions (F , f ,(G1, . . . , Gm)) of this automaton by (F , η(f),(G1, . . . , Gm)). Then, by Lemma3.11, the following is immmediate:
Corollary 3.12. Let Ebe a regular expression. ThenAE is a finite tree automa- ton with at most|E|Σ states and at most|E|Σ2 transitions that accepts E.