Efficient weighted expressions conversion

(1)

DOI:10.1051/ita:2007035 www.rairo-ita.org

EFFICIENT WEIGHTED EXPRESSIONS CONVERSION

Faissal Ouardi

¹

and Djelloul Ziadi

¹

Abstract. J. Hromkovi˜c et al. have given an elegant method to convert a regular expression of size n into an ε-free nondeterminis- tic ﬁnite automaton havingO(n) states and O(nlog²(n)) transitions.

This method has been implemented eﬃciently inO(nlog²(n)) time by C. Hagenah and A. Muscholl. In this paper we extend this method to weighted regular expressions and we show that it can be achieved in O(nlog²(n)) time.

Mathematics Subject Classification.03D15, 68Q45.

Introduction

Weighted automata are eﬃcient data structures for manipulating regular series.

They are used in a lot of practical and theoretical applications such as computer algebra, image encoding, speech recognition or text processing. Regular series are also encoded by regular weighted expressions. The equivalence between these two representations, weighted regular expressions and finite automata has been proved in 1961 by Schützenberger [16]. For the conversion of weighted regular expressions into weighted automata there are principally three algorithms. The first one is due to Caron and Flouret [4]. Their algorithm works recursively on a subset of weighted regular expressions and produces the position automaton.

Champarnaudet al. [7] have given an eﬃcient algorithm that converts a weighted regular expression into its position automaton in quadratic time w.r.t. the size of the expression. Their algorithm is mainly based on the use of theKZPC-structure for weighted expressions. The position automaton is a particular one (see [5]) that can have a quadratic number of transitions and (n+ 1) states where n is the alphabetic width of the expression. The algorithm of Lombardy and Sakarovitch [14] constructs a weighted automaton that turns out to be the generalization of Antimirov automaton for the Boolean case. The Antimirov automaton [1] has

Keywords and phrases.Formal languages and automata, complexity of computation.

1L.I.T.I.S., University of Rouen, France;Faissal.Ouardi, Djelloul.Ziadi @univ-rouen.fr

Article published by EDP Sciences c EDP Sciences 2007

(2)

F. OUARDI AND D. ZIADI

less states than the position automaton. As the position automaton, Antimirov automaton can have a quadratic number of transitions.

In the Boolean case many algorithms have been developed for the problem of conversion [2,8,12,15,19]. The best one in term of complexity is that pro- posed by Hromkovi˜cet al. [11] and implemented by Hagenah and Muscholl [9].

This automaton called the common follow sets automaton has O(n) states and O(nlog²(n)) transitions. Hagenah and Muscholl proved that this automaton can be constructed from a regular expression of size n in time O(nlog²(n)) which constitutes the best known algorithm for the conversion problem.

In this paper, we extend the algorithm of Hromkovi˜cet al. to weighted regular expressions and we present an eﬃcient algorithm to convert a weighted regular expression of size n into an automaton having O(n) states and O(nlog²(n)) in O(nlog²(n)) time.

In Section 1 we recall the notion of aK-expression and of a formal series. In Sec- tion 2 we present the position automaton construction. In Section 3 we introduce the notion ofcommon follow polynomialsautomaton which is the generalization of the notion ofcommon follow setsautomaton presented in [11]. In Section 4 we recall theKZPC-structure in order to implement eﬃciently the algorithm introduced in Section 5. Finally in Section 6 and 7 we describe our algorithm.

1. Preliminaries

Let A be a finite alphabet, and (K,⊕,⊗,0,1) be a semiring (commutative or not). The operator starcan be partially defined, the scalary =x ∈Kbeing a solution (if there exists) of the equations y⊗x⊕1 = y and x⊗y⊕1 = y with 0 = 1 [10,13]. In this paper, we will give examples using the semiring (Q,⊕,⊗) with the classical definitions for⊕(addition) and⊗(multiplication). In the following definition, we introduce the notion ofK-expression.

Definition 1.1. K-expressions over an alphabet A are inductively deﬁned as follows:

• a∈Aandk∈KareK-expressions;

• if F and G are K-expressions, then (F + G), (F·G), and (F^∗) are K- expressions.

When there is no ambiguity, theK-expression (F·G) will be denoted (F G). Let E be aK-expression. We will denoteA_Ethe alphabet of E. The linearized version E of E is theK-expression deduced from E by ranking every letter occurrence with its position in E. Subscripted letters are called positions. The size of E, denoted

|E| is the size of the syntax tree of E. For example, if E = (¹₂·a^∗+¹₃ ·b^∗)^∗·a^∗, we getA_E={a, b}, E = (¹₂·a^∗₁+¹₃·b^∗₂)^∗·a^∗₃,A_E={a₁, b₂, a₃}and|E|= 13.

(3)

We deﬁne inductively thenull termof aK-expression E, denotedc(E), by:

c(k) = kfor allk∈K, c(a) = 0 for alla∈A, c(F + G) = c(F)⊕c(G),

c(F G) = c(F)⊗c(G), c(F^∗) = c(F). The null term of E = (¹₂a^∗+¹₃b^∗)^∗a^∗isc(E) = 6.

In the following, we present a brief description of formal series and we deﬁne a subset of the set ofK-expressions, usually called regularK-expressions, which are associated to regular series [3].

Definition 1.2. A (non-commutative) formal series with coeﬃcients in K and variables inA is a map from the free monoidA^∗ to K that associates with the wordw∈A^∗ a coeﬃcientS, w ∈K.

A formal series is usually written as an inﬁnite sum: S =

u∈A^∗S, uu.

Thesupportof the formal seriesSis the language supp(S) ={u∈A^∗| S, u = 0}.

The set of formal series over A with coeﬃcients in K is denoted by KA. A structure of semiring is deﬁned onKAas follows [3,13]:

– S+T, u=S, u ⊕ T, u; – ST, u=

u1u2=uS, u₁ ⊗ T, u₂, withS,T ∈KA.

A polynomial is a formal series with ﬁnite support. The set of polynomials is denoted byKA. It is a subsemiring ofKA. The star of series is deﬁned by:

S^∗ =

n≥0Sⁿ with S⁰ = ε, Sⁿ = Sⁿ⁻¹S if n > 0. Notice that the star of a formal series does not always exist.

Proposition 1.3 (cf. [13]). The star of a formal seriesS ∈KAis defined if and only ifS, ε is defined inK. In this case:

S^∗ = S, ε(S_pS, ε)^∗ (1) where the formal series S_p is defined by S_p, ε= 0 and S_p, u=S, ufor any wordu.

In the following, we will consider the previous construction of the star of formal series.

Definition 1.4. The semiring of regular seriesKRat(A^∗)⊂KAis the smallest set ofKAthat contains the polynomials semiringKA, and that is stable by the operations of addition, product and star when this latter is deﬁned.

The following deﬁnition introduces the notion of regularK-expression.

(4)

Definition 1.5. A regularK-expression is deﬁned inductively by:

– a∈A, k∈Kare regularK-expressions that respectively denote the regular seriesS_a=aandS_k =k;

– if F, G and H (such that c(H) exists) are regular K-expressions that respectively denote the regular seriesS_F,S_GandS_H, then (F + G),(F G), and (H^∗) are regularK-expressions that respectively denote the regular seriesS_F+S_G,S_FS_G andS_H^∗.

The set of regularK-expressions over an alphabet Ais denoted byKExp(A).

Definition 1.6. Let A be a ﬁnite alphabet, andKbe a semiring (commutative or not). We deﬁne an automaton with multiplicitiesA=Q, A, q₀, δ, γ, µas follows:

• Qis a ﬁnite set of states;

• q₀ is the initial state;

• δ⊆Q×A×Qis the set of transitions;

• γ : δ→Kis the transition weight function;

• µ : Q→Kis the output weight function.

The deﬁnition of an automaton with multiplicities is more general but this one is suﬃcient for our construction.

A path p from a state q to a state q is a sequence of transitions (q, a₁, q₁), (q₁, a₂, q₂),· · ·,(q_n−1, a_n, q_n = q) in A. It is written p = (p₁, p₂,· · · , p_n) with p_i = (q_i−1, a_i, q_i) for 1 ≤ i ≤ n. Its label is the word w(p) = a₁a₂· · ·a_n. We denote by coef(p) the cost of the pathpinA. Formally, coef(p) =γ(p₁)⊗γ(p₂)⊗

· · · ⊗γ(p_n)⊗µ(q_n).

LetC_Athe set of all paths in A starting atq₀. The series S_A associated with the automatonAis

S_A=

u∈A^∗

S_A, uu

where:

S_A, u=

p∈CA

w(p)=u

coef(p).

We say that the automatonArealizesthe seriesS_A.

Definition 1.7. A formal seriesS ∈KAis called recognizable if there exists an automaton that realizes it.

The following result due to Sch¨utzenberger [16] is classical.

Theorem 1.8. (Sch¨utzenberger, 1961). A formal series is recognizable if and only if it is regular.

LetS be a series inKA. Ifh:A−→Bis a surjective morphism, thenh(S) denotes the series

u∈A^∗S, uh(u) inKA.

(5)

Proposition 1.9. Let A = Q, X, q₀, δ, γ, µ and B = Q, Y, q₀, δ, γ, µ be two automata. Leth:X →Y be a surjective morphism where

(1) (p, a, q)∈δ⇔There exists(p, a_i, q)∈δsuch that h(a_i) =a, (2) γ(p, a, q) =

h(ai)=a

(p,ai,q)∈δ

γ(p, a_i, q).

We have,S_B=h(S_A).

Proof. Let Ω_Abe the set of sequences inA, Ω_A={(q₀, q₁),(q₁, q₂),· · ·,(q_n−1, q_n)|∃a_k_i ∈X,

s.t. (q_i−1, a_k_i, q_i)∈δ,∀1≤i≤n}. (2) We deﬁne the transitions polynomial between two statesqandq as

∆_A(q, q) =

(q,a,q)∈δ

γ(q, a, q)a.

For a sequenceρ= ((q₀, q₁),(q₁, q₂),· · ·,(q_n−1, q_n)) in Ω_A, we set

∆_A(ρ) =ⁿ

i=1

∆_A(q_i−1, q_i)µ(q_n).

Thus, the seriesS_Aassociated with the automatonAcan be written as S_A=

ρ∈ΩA

∆_A(ρ).

Now let us prove thath(S_A) =S_B. We have

h(S_A) = h

⎛

⎝

ρ∈ΩA

∆_A(ρ)

⎞

⎠

= h

⎛

⎝

ρ=((q0,q1),(q1,q2),···,(qn−1,qn))∈ΩA

_n

i=1

∆_A(q_i−1, q_i)

⎞

⎠µ(q_n)

=

ρ∈ΩA

n i=1

⎡

⎣

(qi−1,ai,qi)∈δ

γ(q_i−1, a_i, q_i)h(a_i)µ(q_n)

⎤

⎦

(6)

Condition (1)

=

ρ∈ΩA

n i=1

(qi−1,a,qi)∈δ

⎡

⎢⎢

⎢⎣

⎛

⎜⎜

⎜⎝

(qi−1,aki ,qi)∈δ

h(a_ki)=a

γ(q_i−1, a_k_i, q_i)

⎞

⎟⎟

⎟⎠aµ(q_n)

⎤

⎥⎥

⎥⎦

Condition(2)

=

ρ∈ΩA

n i=1

(qi−1,a,qi)∈δ

[γ(q_i−1, a, q_i)aµ(q_n)]

=

ρ∈ΩA

∆_B(ρ)

Condition(1),Relation 2

=

ρ∈ΩB

∆_B(ρ)

= S_B.

2. Position automaton

In this section, we recall some basic notions related to the construction of position automata from regularK-expressions. Let E be a regularK-expression over an alphabetA. The position automaton associated withEis computed from three functions First(E), Last(E) and Follow(·,E) inKExp(A)−→KA. They can be computed inductively according to the following rules [7]:

First(k) = 0 for allk∈K, (3)

First(a) = 1a_i (a_i is the position associated toainA_E), (4)

First(F + G) = First(F) + First(G), (5)

First(F G) = First(F) +c(F) First(G), (6)

First(F^∗) = c(F)First(F). (7)

We obtain the same rule for Last by substituting Last for First and replacing the formula (6) by:

Last(F G) = Last(G) +c(G) Last(F). (8) Given a position x ∈ A_E, the function Follow(x,E) is inductively computed as follows:

Follow(x, k) = 0 for allk∈K, Follow(x, a) = 0 for alla∈A,

Follow(x,F + G) = Follow(x,F) + Follow(x,G),

Follow(x,F G) = Follow(x,F) +Last(F), xFirst(G) + Follow(x,G), Follow(x,F^∗) = Follow(x,F) +Last(F^∗), xFirst(F).

(7)

Lethbe the mapping fromA_E toA_E induced by the linearization of E overA_E. It maps every position to its value in A_E. For example, if E = 2a₁+ 3b₂+a₃ thenh(a₁) =h(a₃) =aandh(b₂) =b. The First, Last and Follow polynomials of a regularK-expression E can be used to deﬁne an automaton with multiplicities realizingS_E. We deﬁne the position automatonfor E, denotedA_E, as the 6-tuple Q, A_E, q₀, δ, γ, µwhere, for someq₀∈/A_E:

• Q={q₀} ∪A_E.

• (q, a, p)∈δ⇔h(p) =aand ifq=q₀ thenFirst(E), p = 0, else Follow(q,E), p = 0.

• For all (p, a, q)∈δ, one has:

γ(q, a, p) =

First(E), p ifq=q₀, Follow(q,E), p otherwise.

• µ(q) =

c(E) ifq=q₀, Last(E), q otherwise.

Proposition 2.1. [7] Let E be a regular K-expression and AE its position au- tomaton. Then AE realizes the regular series S_E.

Lemma 2.2. Let E be a regularK-expression and Eits linearized version. Then one has S_E=h(S_E).

Proof. LetAEbe the position automaton associated with E andA_Ebe the position automaton associated with E. The surjective morphismhfromA_EintoA_Esatisﬁes the conditions (1) and (2) of Proposition1.9. Then, one has:

S_A_E = h(S_A

E).

From Proposition2.1, we haveS_A_E=S_E andS_A_E=S_E. ThusS_E=h(S_E).

3. Common follow polynomials automaton

In this section, we introduce the notion of common follow polynomials system which is the generalization of the notion of the common follow sets introduced by Hromkovi˜cet al. [11] in order to compute an automaton having less transitions than the position automaton.

Next, we prove that the CFP automaton induced by thecommon follow poly- nomialssystem realizes the same series as the position automaton.

Finally, we recall theKZPC-structure in order to implement an eﬃcient algorithm to convert a regularK-expression into its CFP automaton.

From now, we will consider the expression E= (E·) with /∈A_E.

Definition 3.1. Let E be a regularK-expression and letxbe a position in A_E. The set dec(x) ={(k₁, P₁),· · ·,(k_m, P_m)}where{P₁,· · ·P_m}is a set of polynomials with pairwise disjoints supports and{k₁,· · · , k_m} m elements ofK, is called follow decomposition ofxif and only if one has:

Follow(x,E) =k₁P₁+k₂P₂+· · ·+k_mP_m.

(8)

Proposition 3.2. Let x be a position in A_E, and let dec(x) = {(k₁, P₁),· · ·, (k_m, P_m)} and dec(x) = {(k₁, P₁),· · · ,(k_m, P_m)} be two follow decompositions ofx. Then, one hask_i=k_i, for all1≤i≤m.

Proof. It comes from the fact that supp(P_i)∩supp(P_j) =∅, for all 1≤i, j ≤m

andi=j.

We will write k^P_xⁱ the coeﬃcient of P_i in the follow decomposition of x and Pr(dec(x)) the set{P₁,· · ·, P_m}.

Definition 3.3 (common follow polynomials system). Let E be a regular K- expression and E = E. A common follow polynomials system S(E) = (dec(x))_x∈A

Eof E is a family of follow decompositions of positions inA_E. F_S(E) = {First(E)} ∪

x∈A_E

Pr(dec(x)) is called the set ofcommon follow polynomialsasso- ciated with the systemS(E).

Notice that thecommon follow polynomial systemS(E) is not unique as we can see in the following example.

Example 3.4. Let E = (¹₃a+¹₆b)(¹₂a^∗b). One has:

E = (1 3a₁+1

6b₂)(1 2a^∗₃b₄), First(E) = 1

3a₁+1 6b₂, Follow(a₁,E) = 1

2a₃+1 2b₄, Follow(b₂,E) = 1

2a₃+1 2b₄, Follow(a₃,E) = a₃+b₄, Follow(b₄,E) = .

S(E) ={ S(E) ={

dec(a₁) ={(¹₂,(a₃+b₄))}, dec(a₁) ={(¹₂, a₃),(¹₂, b₄)}, dec(b₂) ={(¹₂,(a₃+b₄))}, dec(b₂) ={(¹₂, a₃),(¹₂, b₄)}, dec(a₃) ={(1,(a₃+b₄))}, dec(a₃) ={(1,(a₃+b₄))}, dec(b₄) ={(1, )} dec(b₄) ={(1, )}

} }

F_S(E) ={¹₃a₁+¹₆b₂, a₃+b₄, } FS(E)={¹₃a₁+¹₆b₂, a₃, b₄, a₃+b₄, } Definition 3.5 (CFP automaton). Let E a regular K-expression. Let S(E) be a common follow polynomials system of E. The common follow polynomials au- tomatonA_S(E)=Q, A_E, q₀, δ₁, γ₁, µassociated withS(E) is deﬁned by:

• Q=F_S(E),

• q₀={First(E)},

• (P, a, P)∈δ₁⇔There existsa_i∈A_E such that the following holds:

(9)

– h(a_i) =a, – P, a_i = 0, – P∈Pr(dec(a_i)).

• For all (P, a, P)∈δ₁, one has: γ₁(P, a, P) =

ai∈AE

h(ai)=a

P, a_i ⊗k^P_a

i.

• µ(P) =P, .

Theorem 3.6. Let E be a regular K-expression and A_S(E) the common follow polynomials automaton associated with a CFP system S(E). Then A_S(E) realizes the regular seriesS_E.

To prove this theorem, we introduce the automatonA_S(E)=Q, A_E, q₀, δ₂, γ₂, µ deﬁned as follows:

• Q=F_S(E).

• q₀={First(E)}.

• (P, a_i, P)∈δ₂⇔The following holds:

– P, a_i = 0, – P∈Pr(dec(a_i)).

• For all (P, a_i, P)∈δ₂, one has: γ₂(P, a_i, P) =P, a_i ⊗k^P_a_i.

• µ(P) =P, .

Lemma 3.7. LetE be a regular K-expression. The automatonA_S(E) realizes the seriesS_E.

Proof. Let E be a regularK-expression andA_Ethe position automaton associated to its linearized version E and letS(E) = (dec(x))_x∈A_E a CFP system associated with E. Letw=a₁a₂· · ·a_n∈A^∗

E, From Proposition2.1,A_Erealizes the seriesS_E, so to prove this lemma it suﬃces to show that for each pathp∈ C_A_E, such that m(p) =w, there exists a unique equivalent pathp∈ CA_S(E) such thatm(p) =w and coef(p) = coef(p), and conversely.

Let p = (q₀, a₁, q₁),· · · ,(q_n−1, a_n, q_n) be a path in C_A_E. According to the deﬁnition of the automatonA_E, one has:

First(E), a1 = 0, (9)

Follow(ai,E), a_i+1 = 0, ∀1≤i≤n−1. (10) LetP_i the unique polynomial in Pr(dec(a_i)) such thatP_i, a_i+1 = 0 i.e. a_i+1 ∈ supp(P_i) andP_i∩P_j =∅ ∀1≤i=j≤n. P_iexists sinceFollow(a_i,E), a_i+1 = 0.

So the path

p = (First(E), a₁, P₁),(P₁, a₂, P₂),· · ·,(P_n−1, a_n, P_n) (11) exists inA_S(E) and is unique.

(10)

Let us prove that coef(p) = coef(p). One has:

coef(p) = γ(q₀, a₁, q₁)⊗γ(p₁, a₂, q₂)· · ·γ(q_n−1, a_n, q_n)⊗µ(q_n),

Def. of A_E

= First(E), a₁ ⊗ Follow(a₁, E), a₂ · · · Follow(a_n−1, E), a_n

⊗Last(E), an; and

coef(p) = γ₂(First(E), a₁, P₁)⊗γ₂(P₁, a₂, P₂)⊗ · · · ⊗ γ₂(P_n−1, a_n, P_n)⊗µ(P_n),

Def. of A_S(E)

= First(E), a₁ ⊗k^P_a¹

1 ⊗ P₁, a₂ ⊗k^P_a²

2 · · · P_n−1, a_n

⊗k^P_a_nⁿ⁻¹⊗µ(p_n).

From Deﬁnition3.1, We have Follow(a_i,E) =k^P_a_iⁱ¹P_i₁+· · ·+k_a^P_iⁱP_i+· · ·+k_a^P_iⁱⁿP_i_n. Then, one has

k_a^Pⁱ

i ⊗ P_i, a_i+1=Follow(ai,E), a_i+1, for all 1≤i≤n−1 andLast(E), an=

µ(P_n).

Proof of Theorem3.6. Let E be a regular K-expression. Let hbe the surjective morphism from A_E into A_E. We consider the automata A_S(E) and A_S(E), the applicationhsatisﬁes the conditions (1) and (2) of Proposition1.9, then one has:

S_A_S(E) = h(S_A

S(E)),

Lemma3.7

= h(S_E),

Lemma2.2

= S_E.

Example 3.8. Consider the regularK-expression E = (¹₃a+¹₆b+¹₂a)(¹₂a^∗b). The linearized version of E is E = (¹₃a₁+ ¹₆b₂+¹₂a₃)(¹₂a^∗₄b₅). Consider the following set ofcommon follow polynomialsF_S(E)={¹₃a₁+¹₆b₂+¹₂a₃} ∪ {a₄+b₅, }.

Note that for the previous example, the position automaton associated withE has 6 states and 11 transitions (see Fig.1). However there exists a CFP automaton realizing the seriesS_Ewith only 3 states and 4 transitions (see Fig.2).

The number of transitions and the number of states in acommon follow poly- nomials automatonobviously depends on the choice of the common follow poly- nomials system. The next section deals with the problem of ﬁnding appropriate common follow polynomials system.

A good choice can be found by resolving the following system:

Minimize(f(E) =

x∈A_E

n_xm_x)

wherem_xdenotes the number of polynomialsC ∈ F_S(E) such thatC, x = 0 and n_xdenotes the size of dec(x).

(11)

q0

1

a1

a3

b2

a4

b5 1

1 3a1

1 6b2

1 2a3

1 2a4

1 2b5 1 2a4

1 2b5

1 2a4

1 2b5

b4

Figure 1. The position automaton for E = (¹₃a₁+¹₆b₂+¹₂a₃)(¹₂a^∗₄b₅).

1

3a1+ ¹₆b2+ ¹₂a3

1 ¹6a1,₁₂¹b2,¹₄a3 a₄+b₅ b5 1 a₄

Figure 2. A CFP automaton for E = (¹₃a₁+¹₆b₂+¹₂a₃)(¹₂a^∗₄b₅).

1

3a1+¹₆b2+ ¹₂a3

1 12⁵a,₁₂¹b a4+b5 b 1

a

Figure 3. A CFP automaton for E = (¹₃a+¹₆b+¹₂a)(¹₂a^∗b).

From the deﬁnition of the automaton A_S(E), the function f represents the number of its transitions.

Unfortunately, in the practice this method is not eﬃcient. In the Boolean case J. Hromkovi˜cet al. [11], presented an elegant method that computes a particu- larcommon follow sets system which yields to a common follow sets automaton having O(n) states and O(nlog²(n)) transitions where n denote the size of the regular expression. In [9] C. Hagenah and A. Muscholl have shown that this particular automaton can be computed in timeO(nlog²(n)) which is the best known algorithm for the problem of conversion in the boolean case.

In the next sections, we prove that for a regularK-expression of size n there exists acommon follow polynomials automatonwithO(nlog²(n)) transitions and O(n) states. This constitutes the generalization of the result proved by J. Hromkovi˜c et al. [11]. Next we give an eﬃcient algorithm that computes this automaton in timeO(nlog²(n)).

Our algorithm is based on theKZPC-structure . This structure has been introduced in [7], in order to compute the position automaton. Its boolean version is described in [17,19]. The following section gives a brief description of this structure.

(12)

4. KZPC -structure

Let E be a regular K-expression. The KZPC-structure of E is based on two labeled trees (TL(E) and TF(E)) deduced from its syntax tree T(E). These trees encode respectively the Last and the First polynomials associated to the subexpressions of E. The edges of these trees are labeled by elements of the semiring K.

A node in T(E) will be notedν. If the arity ofν is two, we write respectively ν_l and ν_r its left son and its right son. If its arity is 1, its son will be notedν_s. The relation of descendance over the syntax tree is denoted . Let (ν1, ν₂) an edge in TL(E) (or in TF(E)), we denote by α(ν₁, ν₂) its label. For a tree whose edges are labeled by elements of the semiring K, we deﬁne the cost of the path (ν =ν₁, ν₂,· · · , ν_k=ν), written π(ν, ν), as

k−1

i=1

α(ν_i, ν_i+1).

By convention we setπ(ν, ν) = 1.

A subtree t of T(E) is a tree associated to a subexpression of E, or a tree obtained from T(E) by deleting a set of trees which represent some subexpressions of E. This deﬁnition is also applied to t. Let t₁be a subtree of t, we denote by t\t₁ the subtree of t resulting from t by deleting the subtree t₁. We denote by Pos(t) the set of positions of A_E in t. As a measure of t we use the cardinality of the nodes set of t. It denoted by|t|. For a nodeν in T(E), the regularK-expression E_ν denotes the subexpression resulting from the node ν and c(ν) its null term.

The Last tree TL(E) is a labeled copy of T(E), where an edge going from a nodeλ labeled “·” to its left sonλ_lis marked by α(λ_l, λ) =c(λ_l) and for each edge going from a nodeλ labeled “∗” to its sonλ_s is marked byα(λ_s, λ) =c(λ_s). For all other edges (λ, λ), we setα(λ, λ) = 1. The nodeλrepresents the polynomial

e(λ) =

x∈A_E

π(x, λ)x. (12)

Thus we have

e(λ) = Last(E_λ).

The First tree TF(E) is computed in a similar way, by marking an edge going from a nodeϕlabeled “·” to its right sonϕ_rbyα(ϕ, ϕ_r) =c(ϕ_l) and a edge going from a nodeϕto its sonϕ_sis marked byα(ϕ_s, ϕ) =c(ϕ_s). For all other edges (ϕ, ϕ), we setα(ϕ, ϕ) = 1. The nodeϕrepresents the polynomial

e(ϕ) =

x∈A_E

π(ϕ, x)x. (13)

(13)

TL( E )

∗

+

• •

• • ¹₂ ∗

1

3 a1 1

4 ∗

b2

b3

2

1 4

1

TF( E )

∗

+

• •

• • ¹₂ ∗

1

3 a1 1

4 ∗

b2

b3

2

0

1 3

1 2

1 4

1

Figure 4. TheKZPC-structure associated to the expression E = (¹₃a¹₄b^∗+¹₂b^∗)^∗.

Thus we have

e(ϕ) = First(E_ϕ).

The two trees are connected as follows: if a nodeλof TL(E) is labeled by “·”, its left sonλ_l is linked to the right sonϕ_r of the corresponding nodeϕin TF(E). If a node in TL(E) labeled “∗” its son node is linked to its corresponding node in TF(E). Such links are called follow links. The set of follow links is denoted by

∆. We denote by ∆_x the set of follow links associated to the positionx. That is:

∆_x={(λ, ϕ)∈∆|xλ}.

Proposition 4.1. Let Ebe a regularK-expression andx∈A_E. Then Follow(x,E) =

(λ,ϕ)∈∆x

π(x, λ)e(ϕ). (14)

5. A CFP system computation

Now we describe a procedure which allows us to produce a CFP system that yields to a CFP automaton with O(nlog²(n)) transitions and O(n) states. This procedure is a generalization of J. Hromkovi˜cet al. one.

(14)

We introduce the functionf : TL(E)→TF(E)∪ {⊥}deﬁned by:

f(λ) =

ϕ if (λ, ϕ)∈∆_x,

⊥ otherwise, where⊥denotes an artiﬁcial node such thatπ(x,⊥) = 0.

We extend the polynomial Follow to the nodes of the tree TL(E) and TF(E) as follows:

ifλ₁ andλ₂are two nodes in TL(E) such thatλ₁λ₂, then Follow(λ₁, λ₂) =

λ1λ≺λ2

f(λ)=ϕ

π(λ₁, λ)e(ϕ), (15)

and ifϕ₁ andϕ₂ are two nodes in TF(E) such thatϕ₁ϕ₂, then Follow(ϕ₁, ϕ₂) =

ϕ1ϕ≺ϕ2

f(λ)=ϕ

π(ϕ₁, ϕ)e(λ). (16)

LetP be a polynomial and let t be a subtree of T(E). The restriction ofP to t, denoted byP_t is the polynomial

P_t=

x∈Pos(t)

P, xx.

Let t be a subtree of T(E) and x a position in Pos(t)\ {}. We denote by tf (respectively tl) the labeled copy of t in TF(E) (respectively in TL(E)). The procedure recursively computes a particular CFP system. It is deﬁned as follows.

If|Pos(t)|>1

Decompose t into two subtrees t₁and t₂ according to the following rules:

|Pos(t)|

3 ≤ |Pos(t₁)| ≤ ^2|Pos(t)|₃ and let t₂= t\t₁withλ₁be the root of the subtree tl₁andϕ₁ its corresponding node in tf₁. Let

P₁ = Follow_t₂(λ₁,E), P₂ = e(ϕ₁),

P₃ = e(λ₁),

P₄ = Follow(ϕ₁,E).

We have for allx∈Pos(t):

Follow_t(x,E) = Follow_t₁(x,E) + Follow_t₂(x,E).

(15)

Case x∈Pos(t₁):

One has

Follow_t₂(x,E) ^Relation= ⁽¹⁵⁾

xλ≺E

f(λ)=ϕ

π(x, λ)e_t₂(ϕ)

=

xλ≺E

f(λ)=ϕ

π(x, λ₁)⊗π(λ₁, λ)e_t₂(ϕ)

= π(x, λ₁)

λ1λ≺E

f(λ)=ϕ

π(λ₁, λ)e_t₂(ϕ)

Relations (12)and(15)

= e(λ₁), xP₁

= k^P_x¹P₁.

Thus

Follow_t(x,E) = Follow_t₁(x,E) +k^P_x¹P₁.

It is easy to see that supp(Follow_t₁(x,E))∩supp(P₁) =∅. Ifk_x^P¹ = 0, (k^P_x¹, P₁) constitues the ﬁrst element of our follow decomposition dec(x, t) of x in t. We recursively apply the same decomposition process to Follow_t₁(x,E). Thus we get

dec(x,t) =

dec(x,t₁) ifk^P_x¹ = 0, dec(x,t₁)∪ {(k_x^P¹, P₁)} otherwise.

Case x∈Pos(t₂):

Lemma 5.1. Letxbe a position in Pos(t₂) and let(λ, ϕ) be a follow link in∆_x. Ifsupp(e(ϕ))∩Pos(t₁)= 0, thenϕ₁ϕ.

Proof. We have (λ, ϕ)∈∆_x, Thenxλ.

If supp(e(ϕ))∩Pos(t₁)= 0, thusϕϕ₁orϕ₁ϕ. Suppose thatϕϕ₁, so one

hasxλλ₁. Contradiction withx∈Pos(t₂).

(16)

One has

Follow_t₁(x,E) ^Relation= ⁽¹⁶⁾

xλ≺E

f(λ)=ϕ

π(x, λ)e_t₁(ϕ)

Lemma5.1

=

ϕ1ϕ≺E

f(λ)=ϕ

π(x, λ)e(ϕ)

Relation(12)

= [

ϕ1ϕ≺E

f(λ)=ϕ

e(λ), x ⊗π(ϕ, ϕ₁)]e(ϕ₁)

=

ϕ1ϕ≺E

f(λ)=ϕ

e(λ)π(ϕ, ϕ₁), xe(ϕ₁)

= P₄, xP₂

= k_x^P²P₂. Thus

Follow_t(x,E) = Follow_t₂(x,E) +k^P_x²P₂.

It is easy to see that supp(Follow_t₂(x,E))∩supp(P₂) =∅. In this case ifk_x^P² = 0, (k_x^P², P₂) constitues the ﬁrst element of our follow decomposition dec(x,t) of x in t. We recursively apply the same decomposition process to Follow_t₂(x,E).

Thus we get

dec(x,t) =

dec(x,t₂) ifk^P_x² = 0, dec(x,t₂)∪ {(k_x^P², P₂)} otherwise.

If|Pos(t)|= 1

Using the same idea of the previous cases, we get dec(x,t) = {(P₀, x, x)}, withP₀=

xλ≺E

(π(x, λ)⊗π(f(λ), x))x.

The following example illustrates the diﬀerent stages of the recursive procedure.

Example 5.2. Consider the regularK-expression E = (((a+¹₃) + (b+¹₆))(b+ 1))^∗. One has

Step 1.

|Pos(T(E))|= 4. We decompose the tree t = T(E) into two subtrees t₁ and t₂ such that t₁be the subtree representing the subexpression (a₁+¹₃) + (b₂+¹₆) and t₂= t\t₁. In this case one has

dec(a₁,t) = dec(a₁,t₁)∪ {(1,(2b₃+ 2))}, dec(b₂,t) = dec(b₂,t₁)∪ {(1,(2b₃+ 2))}, dec(b₃,t) = dec(b₃,t₂)∪ {(1,(2a₁+ 2b₂))}.

(17)

Step 2.

Recursively, we decompose the subtree t₁ into two subtrees t₁₁ (the subtree representing the subexpression a₁+ ¹₃ and t₁₂ = t₁\t₁₁). Similarly, we divide the subtree t₂ into two subtrees t₂₁ (the subtree representing the subexpressionb₃+ 1 and t₂₂= t₂\t₂₁). Thus we get

dec(a₁,t) = dec(a₁,t₁₁)∪ {(2,(b₂))} ∪ {(1,(2b₃+ 2))}, dec(b₂,t) = dec(b₂,t₁₂)∪ {(2,(a₁))} ∪ {(1,(2b₃+ 2))}, dec(b₃,t) = dec(b₃,t₂₁)∪ {(2,())} ∪ {(1,(2a₁+ 2b₂))}. Step 3.

dec(a₁,t) = {(2,(a₁))} ∪ {(2,(b₂))} ∪ {(1,(2b₃+ 2))}, dec(b₂,t) = {(2,(b₂))} ∪ {(2,(a₁))} ∪ {(1,(2b₃+ 2))}, dec(b₃,t) = {(2,(b₃))} ∪ {(2,())} ∪ {(1,(2a₁+ 2b₂))}.

Thus the resulting set ofcommon follow polynomialsF_S(E)associated withS(E) is FS(E)={2a₁+ 2b₂+b₃+ 2} ∪ {(a₁),(b₂),(b₃),(),(2a₁+ 2b₂),(2b₃+ 2)}. Using this procedure, we can prove in similar way as the Boolean case the following Lemma.

Lemma 5.3(cf. [11]). ForF_S(E) the following holds:

(1) |F_S(E)|=O(|E|), (2)

P∈FS(E)|supp(P)|=O(|E|log|E|), (3) |dec(x)|=O(log|E|).

6. Efficient CFP system computation

In this section, we ﬁrst give a naive cubic implementation (Algorithm6.1) of the CFP system procedure described in the previous section. This constitues our starting point for the eﬃcient implementation of the CFP system procedure.

Next, we show that Algorithm6.1 runs in quadratic time. After, we show that Algorithm 6.1 contains some redundants computations, that can be eliminated.

Finally, we prove that the CFP system procedure can be eﬃciently implemented inO(|E|log(|E|)) time.