Path Rewriting in Semistructured Data

(1)

HAL Id: inria-00536731

https://hal.inria.fr/inria-00536731

Submitted on 16 Nov 2010

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Path Rewriting in Semistructured Data

Denis Debarbieux, Yves Roos, Sophie Tison, Yves André, Anne-Cécile Caron

To cite this version:

Denis Debarbieux, Yves Roos, Sophie Tison, Yves André, Anne-Cécile Caron. Path Rewriting in Semistructured Data. 4th International Conference on Combinatorics on Words, 2003, Turku, Finland.

pp.358–369. �inria-00536731�

(2)

Path Rewriting in Semistructured Data ^∗

Denis Debarbieux, Yves Roos, Sophie Tison, Yves Andr´ e, Anne-C´ ecile Caron Laboratoire d’Informatique Fondamentale de Lille,

U.M.R. C.N.R.S. 8022

Universit´ e de Lille 1, 59655 Villeneuve d’Ascq Cedex. France.

e-mail:{debarbie, yroos, tison, andre, caronc }@lifl.fr

Abstract

We consider in this paper rooted labeled directed graphs as models for semistructured data. A path inclusion constraint, in such graphs, is an expressionpq wherepand q are regular expressions over the alphabet of labels. An path inclusion constraint pq is satisfied by a rooted labeled directed graph if the set of nodes reached, from the root, by the words of the language described by p is included in the set of nodes reached, from the root, by the words of the language described by q. We answer here to the following problem : given a setC of path inclusion constraints, given a regular expression p describing an infinite regular language, we want to compute, if it exists, a regular expressionf describing a finite language, such thatpf for every graph satisfyingC.

1 Introduction

The notion of semistructured data comes from the Web. In semistructured databases, the data is unconstrained by any type system or schema and may have an irregular structure.

It can be one or more XML documents, or a set of web pages with hyperlinks. In [1] the authors present some models and problematics about semistructured data. In this paper, we see a semistructured data as a rooted edge-labeled directed graph. In the following, we will also consider a semistructured data as a finite state automaton on words, even if this data is not a tool for recognizing languages.

The study of semistructured databases has generated the development of new query languages. Most of them are based on path expressions, which allow to reach nodes to arbitrary depth in the data graph. More precisely, a regular path expression is a regular expression on the alphabet of labels appearing in the data. Each path from the root to a node in the data is labeled by a word. The result of the queryq, regular path expression, is the set of nodes reached from the root by a path labeled by a word ofq.

Figure 1 gives an example of semistructured data. On this data, the result of the query journal.(title + article.title)is the set of nodes{11,12,13}.

One can express constraints on semistructured data. Some of these are called path constraints since they give restrictions on the data paths. Certain kinds of integrity constraints found in object-oriented databases and also common in semistructured databases can be expressed with path constraints. These constraints have been introduced by Abite- boul and Vianu in [2]. See [6], [10] or [4] which analyze different classes of path constraints.

Here, we study path inclusion constraints. A path inclusion constraint is written pq

∗This research was partially supported by Inria (MOSTRARE team).

(3)

0

2

author

3 author

4 journal

6 7

article

12 title writtenBy

writtenBy

5 author

writtenBy 1

author

14 name 9

8 name

name

10

name

11 13

title title

Figure 1: semistructured data

where pandqare regular path queries, and means that the set of nodes result ofpis included in the set of nodes result ofq. In Figure 1, we havejournal.article.writtenBy author. These constraints are interesting because they give semantic informations on the data (on the example,an article in a journal is written by authors) and they are used for query optimization.

In this context, the problem studied in this paper is the following : given a set C of inclusion constraints, given a regular expressionqdescribing an infinite regular language, we want to compute, if it exists, a regular expressionf corresponding to a finite language, such that q f for every data satisfying C. The existence of f is decidable when C contains only word inclusions [2] (inclusions w1 w2 where w1 and w2 are words) or inclusions of the form p w where p is a regular expression and w is a word [5]. We define here a transducer computingf from q(when it exists) extending the result of [5].

2 Path inclusions and query rewriting

2.1 Semistructured data and path inclusions

In the sequel we use the following notions which were introduced in [2]. LetAbe a fixed finite alphabet of labels.

Definition 2.1 : A semistructured datum is a tripleI = (N, i, T) where N is a set of nodes, i∈N is called the root of datumI andT ⊆N×A×N is the set of transitions.

IfN is finite, the datum is said finite.

We are interested in the set of nodes which are reached by some paths in a datum. More precisely, given a datum I = (N, i, T) and a wordu∈A^⋆, we denote byu(I) the set of nodes that can be reached inI at the end of any path labelled byustarting from the root i, that is :

1. ifu=ε, thenu(I) ={i}

(4)

2. ifu=u^′a, withu^′∈A^⋆ anda∈A,u(I) ={n∈N | ∃n^′∈u^′(I),(n^′, a, n)∈T} Then we can define the notion of query and result of query.

Definition 2.2 : Aregular querypis a regular expression overA. Theresultof a query pover a datum I is the setp(I) =∪_u∈L(p)u(I) whereL(p) denotes the regular language described byp.

Definition 2.3 : A path inclusion is an expression of the form p q where p, q are regular queries. A datumI satisfiesa path inclusionpq, denotedI|=pq, if the set of nodes p(I) is included inq(I). I satisfies a setC of path inclusions, denoted I|=C, if I satifies each path inclusion ofC.

Definition 2.4 : A set C of path inclusions implies a path inclusion p q, denoted C |=pq, if for each datumI such thatI|=C,I|=pq.

We will use the following proposition [2]:

Proposition 2.1 A set C of path inclusions implies a path inclusion p q, denoted C |=pq, if for eachfinite datumI such that I|=C,I|=pq.

Definition 2.5 : A regular queryphas theboundedness propertyw.r.t a set Cof path inclusions if there exists a regular query f such that C |=pf andL(f) is finite.

E.g., letC be{a²a}; w.r.t. C, the querya^∗ is bounded, whereas the queryba∗is not.

2.2 Bounded path constraints and Prefix Rewriting

From now on, we will consider uniquely the case of a finite set of inclusions in the form puwhere pis a regular expression anduis a word: we shall call such path inclusions bounded inclusions. In this case, following and slightly generalizing [2], we associate with a setCof bounded path inclusions a rewriting system such that there is a prefix rewriting from utov, if and only if the queryuis included in the queryv for each model ofC.

Definition 2.6 : LetC ={p1 u1, p2 u2, . . . , pn un} be a finite set of bounded path inclusions over an alphabetA. We consider the relation on words defined byu−→

C v if and only if there exists isuch thatu∈L(pi) andv=ui. By extension, we denote also

−→C its rightcongruent closure. Then−→^⋆

C denotes the reflexive, transitive closure of−→

C . We can remark than this relation is a prefix rewriting relation as defined in [8] based on an infinite rewrite system.

From now on, we will suppose that there is no path constraint of the type pε. Then we have the following property:

Proposition 2.2 LetC be a set of bounded path inclusions. For any wordsu,v,u−→^⋆

C v if and only if C |=uv.

The proof uses a kind of (infinite) canonical model ofCwhich is close to the model defined in [2]:

Definition 2.7 : With a set of bounded path constraintsC, we associate an infinite datum IC = (N, i, T) defined by:

• N ={u | u∈A^⋆}

• i=ε

(5)

• T ={(u, x, v) | v−→^⋆

C ux}

Example 2.1 :

LetC ={ab^⋆ ba, b⁺a, a(aa)^⋆ba}. The following datum shows a finite part of the infinite datum defined in 2.7. It follows from bb →a →ba → aathat (a, a, bb) ∈T, it follows fromb→a→bathat (b, a, b)∈T.

ε

a

b

ba

bb a

a, b

a, b a a

a

a a

a

a, b a

a, b

Then, we get immediately by induction on the length ofu: Lemma 2.1 u(IC) ={v | v−→^⋆

C u}

Now, let us supposeIC |=uv, i.e. u(IC)⊆v(IC); by the preceding lemmaubelongs to u(IC), soubelongs tov(IC); once again by the preceding lemmau−→^⋆

C v. Let us suppose now u−→^⋆

C v. So if w−→^⋆

C u, w−→^⋆

C v; by the preceding lemmau(IC)⊆v(IC), that is to sayIC |=uv:

Lemma 2.2 IC |=uv if and only if u−→^⋆

C v We obtain now

Lemma 2.3 IC |=uv if and only if C |=uv

Proof : First, it is easy to get than IC |=C and so that if C |=uv, then IC |=uv.

Now, let us suppose IC |=uv that is to say u−→^⋆

C v. Let C be defined byuC v if C |=uv . The relationC contains−→

C , is transitive and closed by right-congruence:

it contains−→^⋆

C ; sou−→^⋆

C vimpliesuC v i.e. C |=uv. 2

By the two last lemmas, we obtain the proposition 2.2.

The following property will be fundamental:

Lemma 2.4 If IC |=uL, there is some wordv in Lsuch that IC |=uv

(6)

Proof : Indeed, if IC |= u L, u(IC) is included in L(IC); in particular, the state u belongs to L(IC), i.e. there is somev inL such thatubelongs tov(IC); then, by lemma 2.1, u(IC) is included inv(IC), i.e. IC |=uv. 2 Then, by summarizing the preceding lemmas, we get:

Proposition 2.3 Let C be a set of bounded path inclusions, and L a regular query; the following properties are equivalent:

• C |=uL

• there is some word v inL such thatC |=uv

• there is some word v inL such thatu−→^⋆

C v

Let us now supposeC |=uv, for some wordv inL: IC |=uv and thenu−→^⋆

C v.

Lastly, let us suppose u−→^⋆

C v, for some wordv in L: then IC |=uv. So,C |=uv

and we getC |=uL. 2

The following example shows that proposition 2.3 does not hold in the general case of path inclusions which are not bounded inclusions :

Example 2.2 : The following datum satisfies a⁺ (b+c) but does not satisfyab neitherac.

i

a, c a, b

a

2.3 Path inclusions and recognizable relations

In this section, we will use the notion of recognizable relations to decide boundedness of a regular query. The idea behind recognizable relations is simply to encode a n-uple of words on the alphabet A by a word on the alphabetA∪ {⊥} ×. . .×A∪ {⊥} , where

⊥is a new symbol. This encoding is obtained by superposing then words, aligning the words by the end. E.g. (ab, aaa, b) will be encoded in [⊥, a,⊥][a, a,⊥],[b, a, b]. Formally, we define the morphismπj from ([A∪ {⊥}]ⁿ)^∗ byπj[a1, a2, ...an] =aj, if it belongs toA, πj[a1, a2, ...an] = ǫ ifaj =⊥. Now, let Cor be the recognizable language of the words u1...un -on the alphabet [A∪ {⊥}]ⁿ- satisfying the two following properties:

• for eachj, ifπj(uk) is inA,πj(ul) is inAfor anyl > k

• there exists somej s.t. πj(u1) is inA.

Then, the notion of recognizable relation is defined by:

Definition 2.8 : A n-ary relationRonA^⋆×. . .×A^⋆is recognizable if and only if there exists an automatonMon the alphabetA∪ {⊥} ×. . .×A∪ {⊥}recognizing the language

L(M) ={u∈Cor /(π1(u),· · ·, πn(u))∈R}

(7)

This definition corresponds to the definition of tree recognizable relations defined in [9], if we consider here a word as a tree whose root is associated with the end of the word. So, ifRecdenotes the set of recognizable relations, the following properties ofRec(see [9] for more details) hold:

Proposition 2.4

• Rec is closed under boolean operations (union, intersection, complementation)

• Rec is closed under cylindrification and projection.¹

• Finiteness and emptiness are decidable on Rec.

Proposition 2.5 (see for instance [9]) IfCis a set of inclusion constraints, the prefix rewriting relation −→^⋆

C is a recognizable relation.

The boundedness property can be expressed in terms of recognizable relations. More precisely, let define some relations onA^⋆×A^⋆ :

• shorter ={(u, v)| |u|<|v|}

• shortestResults ={(u, v)|u−→^⋆

C v∧(6 ∃v^′u−→^⋆

C v^′∧(v^′, v)∈shorter)}

Lemma 2.5 The relations shorter and shortestResults are recognizable.

Proof : shorter is obviously a recognizable relation. Since−→^⋆

C is a recognizable relation, we can deduce from proposition 2.4 than shortestResults is a recognizable relation.

Let us now study the number of states of an automatonM for the relation shortestResults w.r.t. a set of bounded path inclusions C = {p1 u1, p2 u2, . . . , pn un}, where the size of C is defined by |C| = Σⁿ_i=1|p_i|+|u_i|. We can define an automaton Ms of constant size for shorter(u, v) and an automaton Mc for u−→^⋆

C v of sizeO(|C|). Formula ϕ=u−→^⋆

C v^′∧shorter(v^′, v) can so be associated with an automatonMϕof size O(|C|).

The automaton Msr recognizing the relation shortestResults(u, v) can be built from a cartesian product of Mc and an automaton built from Mϕ using a projection and a determinization: this leads to a size O(2^k|C|) for some constant k. So, the number of states of the final automaton will be inO(|C|.2^k|C|). 2 Proposition 2.6 Let C a set of path inclusions and p a regular path query. Query p has the boundedness property w.r.t. C if and only if the recognizable set shortestResults∩ {(u, v), u∈p}is finite. So, the boundedness property of pis decidable.

Let us note, that we could use a total recognizable ordering on words to associate with a word one and only one canonical query.

In the following, we develop an “ad hoc” method in order to reduce the number of states.

3 Query transduction

We consider here a fixed (non empty) finite set of bounded inclusionsC={p1u1, p2 u2, . . . , pn un} over an alphabetA . The aim of this section is to build a finite word transducer τC such that for any regular querypoverA:

1. C |=pqfor allqsuch thatL(q) =τC(L(p))

2. phas the boundedness property w.r.t. Cif and only if τC(L(p)) is finite

1roughly speaking, cylindrification (projection) consists in adding (deleting) a component in a tuple.

(8)

In order to define our transducerτC, our first goal is to build a finite automatonAC (with ε-moves) which recognizes the languageRC ={v∈A^⋆ | ∃i, v−→⁺

C ui}. It is already known thatRC is a recognizable language from [7], [9], [8]. We give here a different construction : For each i with 1 ≤i≤n, let Mi = (A, Qi, Ii, Fi, δi) be an automaton recognizing the languageL(pi). We can assume, without loss of generality, that for different subscriptsi and j, the intersectionQi∩Qj is empty. Then we can define AC = (A, Q, I, F,∆) where Q=∪ⁿ_i=1Qi, I =∪ⁿ_i=1Ii,F =∪ⁿ_i=1Fi and ∆ =∪_k∈N∆k where ∆k, fork∈

N

_{is defined}

inductively by :

• ∆0=∪ⁿ_i=1δi

• fork >0, ∆k = ∆k−1∪ {(q, ε, q^′) | ∃i≤n, q∈Fi, q^′∈∆k−1(I, ui)}

Since ∆ is included inQ×(A∪ {ε})×Q, it is clear that there exists an integerK such that ∆K = ∆K+1= ∆. Since we haveK≤ |Q|², automatonAC can be build in quadratic time in |C|, the size ofC.

Example 3.1 : Let C = {ab^⋆ ba, b⁺ a, a(aa)^⋆b a}. Automaton AC is the following :

0 f1

1 f2

2 3 f3

a

b

a b a

b

b ε

ε ε

ε

ε ε

ε

We have now to prove that AC recognizesRC.

Lemma 3.1 For any word v in A^⋆, if ∆(I, v)∩Fi 6= ∅ for some i in {1, . . . , n} then v−→⁺

C ui.

Proof : Let v ∈ A^⋆ and q ∈ ∆(I, v)∩Fi for some i in {1, . . . , n}. Then there exists a k such that q ∈ ∆k(I, v). We will show that v −→⁺

C ui by induction on k. If k = 0, then v ∈ L(pi) and v −→

C ui. Suppose now that k > 0. There exist q0, q1, . . . , ql in Q and x1, x2, . . . , xlin A∪ {ε}such thatq=ql, v=x1x2. . . xl and for anyj in {1, . . . , l}, (qj−1, xj, qj)∈∆k. Let mbe the number of such (qj−1, xj, qj) which are in ∆k\∆k−1. We shall now make an induction onm. Ifm= 0, then, by induction hypothesis, we obtain that v−→⁺

C ui. Ifm >0, let pbe the integer such that (qp−1, xp, qp)∈∆k\∆k−1 and for any j with p < j ≤l, (qj−1, xj, qj) is in ∆k−1. Then xp =ε, qp−1 ∈Fi^′ for somei^′ in {1, . . . , n}andqp∈∆k−1(I, ui^′) :

(9)

q0 qp−1 qp

q xp=ε

ui^′

x1. . . xp−1

xp+1. . . xl

∆k\∆k−1

∆k−1

∆k

∆k−1

By induction hypothesis on m, we obtain that x1x2. . . xp−1 +

−→C ui^′ and by induction hypothesis onk, we obtain thatui^′xpxp+1. . . xl

−→+

C ui. It follows thatv=x1x2. . . xl

−→+ C

ui^′xpxp+1. . . xl

−→+

C ui. 2

In order to prove the converse of lemma 3.1, we shall use the following result : Lemma 3.2 Let v andw be two words ofA^⋆, thenv−→^⋆

C w=⇒∆(I, w)⊆∆(I, v).

Proof : Letj be the length of the derivationv−→^⋆

C w. We shall make an induction onj.

Ifj = 0 thenv=wand ∆(I, w) = ∆(I, v). Ifj >0, then there existi in{1, . . . , n} and words v1, v2 such that v−→^j−1

C v1v2,v1 −→

C ui and w=uiv2. By induction hypothesis, we have ∆(I, v1v2)⊆∆(I, v). Moreover, since v1−→

C ui then v1∈L(pi) and there exists a stateq in Fi∩∆(I, v1). Letq^′ be a state in ∆(I, ui) then there exists an integer ksuch that q^′ ∈∆k(I, ui). It follows that (q, ε, q^′)∈∆k+1 ⊆∆ and q^′ ∈∆(I, v1). As we have

∆(I, ui)⊆∆(I, v1), we obtain ∆(I, w) = ∆(I, uiv2)⊆∆(I, v1v2)⊆∆(I, v). 2 We are now able to prove :

Proposition 3.1 For any wordv inA^⋆,∆(I, v)∩Fi6=∅ for somei in{1, . . . , n}if and only ifv−→⁺

C ui. Then automaton AC recognizesRC

Proof : From lemma 3.1, we have only to prove that for any wordv∈A^⋆, if there exists an iin {1, . . . , n} such that v −→⁺

C ui then ∆(I, v)∩Fi 6=∅. Sincev −→⁺

C ui, there exists a word w∈A^⋆ such that v −→^⋆

C w−→

C ui. Then w∈L(pi) and ∆(I, w)∩Fi 6=∅. Since

∆(I, w)⊆∆(I, v), from lemma 3.2 it follows that ∆(I, v)∩Fi6=∅. 2 Now, we construct the transducer τC from the automatonAC. This transducer produces from any regular querypwhich has the boundedness property w.r.t. C a regular queryq, such thatL(q) is finite andC |=pq.

Definition 3.1 : LetC be a set of bounded inclusions overAand v be a word of A^⋆. The shortest suffix of v which cannot be rewritten w.r.t −→^⋆

C , denoted by fC(v) is the suffix ofv which satisfies that for any wordsv1∈RC∪ {ε}andv2∈A^⋆, ifv=v1v2 then

|v2| ≥ |fC(v)|.

It is easy to verify that the following lemma holds :

Lemma 3.3 Let C be a set of bounded inclusions over A. Let v andw be two words of A^⋆. Then v −→^⋆

C w if and only if there exist v1 and w1 in A^⋆ such that v = v1fC(v), w=w1fC(v)andv1 ⋆

−→C w1.

(10)

The shortest suffix of words which cannot be rewritten w.r.t−→^⋆

C has been used to decide if a regular query has the boundedness property w.r.t. a set of bounded inclusions.

Theorem 3.1 ([5]) LetC be a set of bounded inclusions overAandpbe a regular query overAthenphas the boundedness property w.r.t. C if and only if the setFC(p) ={f(v)| v∈L(p)} is finite.

Our transducerτC will satisfy the following property, proved in proposition 3.2 : For any wordv∈A^⋆, we will havew∈τC(v) if and only ifw=v=f(v) orv−→⁺

C w∧w= uif(v) for some i∈ {1, . . . , n}.

Let A^′_C be the complete and deterministic automaton obtained from the automatonAC

applying the classical transformations (removeε-moves, subset-construction, completion).

Then A^′_C = (A, Q^′, q₀^′, F^′,∆^′) and, from the subset-construction and proposition 3.1, we can associate to any state q ∈ Q^′ a subset S(q) ⊆ {1, . . . , n} such that, for any word v∈A^⋆, i∈S(∆^′(q₀^′, v)) if and only ifv−→⁺

C ui. In particular,F^′={q∈Q^′ | S(q)6=∅}.

We can remark than the automatonA^′_C can be built in time 2^|C|.

Then the transducer τC is defined by : τC = (A, Q^′∪Q¯^′,{q0,q¯0},Q¯^′, T) whereA is the input/output alphabet,Q^′∪Q¯^′where ¯Q^′ is a copy ofQ^′ is the set of states,q0 and ¯q0are the initial states, ¯Q^′ is the set of final states andT⊆Q×A∪ {ε} ×A^⋆×Qis the set of transitions ofτC defined by : T ={(q, x, ε, q^′)|q, q^′∈Q^′, x∈A,(q, x, q^′)∈∆^′}

∪ {q, ε, u_i,q)¯ |q∈Q^′, i∈S(q)}

∪ {(¯q, x, x,q¯^′)|q, q^′∈Q^′, x∈A, S(q^′) =∅,(q, x, q^′)∈∆^′}.

Example 3.2 : The transducer τC associated with the set of bounded inclusions of example 3.1 is the following :

0 1 2

3

4 5

0

1

5

2

3

4 b|ε

b|ε a|ε

a|ε

a|ε a|ε

b|ε

a|ε b|ε

a|ε

b|ε ε|ba

ε|ba, ε|a

ε|a, ε|a, ε|ba

a|a

a|a a|a

(11)

Proposition 3.2 For any words v, w ∈ A^⋆, w ∈τC(v) if and only if w= v = f(v) or v−→⁺

C wandw=uif(v)for somei in{1, . . . , n}.

Proof :

• Letv, w∈A^⋆ such thatw∈τC(v) then there exists a path inτCfrom an initial state to a final state, labelled byv. If this path starts from ¯q0, then w=vand, since for any (¯q, x, x,q¯^′) inT,S(q^′) =∅, there does not exist any prefix ofv, different from ε which belongs toRC. It follows that f(v) =v=w. Let us suppose now that the path labelled byv starts fromq0. Then there exists two wordsv1 andv2 such that v =v1v2, v1 labels a path fromq0 to a stateq ∈Q^′ such that there exists somei in S(q) , (q, ε, ui,q) is in¯ T, v2 labels a path from ¯q to some state ¯q^′ of ¯Q^′. Then v1 +

−→C ui,w=uiv2andv−→⁺

C w. Note that for any prefixv₂^′ ofv2, different fromε, we are sure that v1v^′₂ does not belong to RC sincev₂^′ labels a path from ¯q to some state ¯q^′′ withS(q^′′) =∅. It follows thatv2=f(v) thenw=uif(v).

• Conversely, let v, w ∈ A^⋆ such that w = v = f(v) then there does not exist any prefix ofv, different fromεwhich belongs to RC. Thenv labels a path from ¯q0 to some state ¯q^′ of ¯Q^′ and v ∈τC(v). Suppose now thatv −→⁺

C w andw=uif(v) for some i in {1, . . . , n}. From lemma 3.3, it follows thatv = v1f(v) and v1

−→+ C ui. Then v1 labels a path fromq0 to a stateq such thati∈S(q). Moreover, it is clear that f(v) labels a path from ¯qto some state ¯q^′ of ¯Q^′. It follows that uif(v) =wis in τC(v).

2 Proposition 3.3 Let C ={p1 u1, p2 u2, . . . , pn un} be a non-empty finite set of bounded inclusions over an alphabet Aand letτC be the transducer defined above, then for any regular querypoverA :

1. C |=pq for allq such thatL(q) =τC(L(p))

2. phas the boundedness property w.r.t. C if and only ifτC(L(p))is finite Proof :

1. Let u be a word of L(p) and v be a word of τC(u) = L(q) (∀u ∈ A^⋆, τ(u) 6= ∅).

It follows from proposition 3.2 that u−→⁺

C v or u = v and then C |= p q from proposition 2.3.

2. IfτC(L(p)) is finite thenphas clearly the boundedness property w.r.tC. Conversely, ifphas the boundedness property w.r.t. Cthen, from theorem 3.1,Fp={f(w), w∈ L(p)} is a finite set. Since, from proposition 3.2, τC(L(p)) ⊆ {ε, u1, ..., un}Fp it followsτC(L(p)) is finite.

2 Remark : We have produced an algorithm which, given a non-empty finite set of bounded inclusions C and a regular query p gives a finite query f such that C |= p f if and only ifphas the boundedness property w.r.t. C. This algorithm is inO(|p|.2^|C|) since it computesτC(L(p)). It is worth noting that the problem of deciding boundedness property is PSPACE-hard : Indeed letpandqbe two regular expressions over an alphabetA. Let us consider a new letter $ which is not inA and let us defineC asC ={q$⁺$}. Then it is easy to see that phas the boundedness property forC if and only ifL(p)⊆L(q). It follows that our algorithm may be used to decide whether a regular language is included in another regular language where these languages are defined by regular expressions and

(12)

it is known from [3] that this last decision problem is PSPACE-complete.

When the set of path inclusionsC contains not only bounded inclusions, the problem to check whether a regular queryphas the boundedness property w.r.t. C is still open. It is also an open question to know whether this problem is decidable or not.

References

[1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kaufmann Publishers, 2000.

[2] Serge Abiteboul and Victor Vianu. Regular path queries with constraints. InPODS, pages 122–133. ACM Press, 1997.

[3] A. Aho, J. Hopcroft, and J. Ullman. The design and Analysis of Computer Algo- rithms. Addison-Wesley Publishing Compagny, Reading, Mass., 1974.

[4] N. Alechina, S. Demri, and M. de Rijke. A modal perspective on path constraints.

Journal of Logic and Computation, 13(6):939 – 956, 2003.

[5] Y. Andr´e, F. Bossut, and A.C. Caron. On decidability of boundedness property for regular path queries. Inproceedings of DLT’99, Aachen, Germany, 1999. Development in Language Theory, World Scientific Publishing Co.

[6] P. Buneman, W. Fan, and S. Weinstein. Path constraints in semistructured databases. Journal of Computer and System Sciences, 61(2), 2000.

[7] J. Richard B¨uchi and W.H. Hosken. Canonical systems which produce periodic sets.

Mathematical Systems Theory, 4(1), 1970.

[8] D. Caucal. On the regular structure of prefix rewritings. In Springer, editor,Selected papers of the conference on Fifteenth colloquium on trees in algebra and programming, pages 87 – 102, Copenhagen, Denmark, May 1990.

[9] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on:

http://www.grappa.univ-lille3.fr/tata, 1997.

[10] G. Grahne and A. Thomo. Query containment and rewriting using views for regular path queries under constraints. Inproceedings of PODS’03, pages 111–122. Sympo- sium on Principles of Database Systems, ACM, 2003.

Path Rewriting in Semistructured Data

HAL Id: inria-00536731

https://hal.inria.fr/inria-00536731

Submitted on 16 Nov 2010

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Path Rewriting in Semistructured Data

Denis Debarbieux, Yves Roos, Sophie Tison, Yves André, Anne-Cécile Caron

To cite this version:

Denis Debarbieux, Yves Roos, Sophie Tison, Yves André, Anne-Cécile Caron. Path Rewriting in Semistructured Data. 4th International Conference on Combinatorics on Words, 2003, Turku, Finland.

pp.358–369. �inria-00536731�

Path Rewriting in Semistructured Data ∗

Denis Debarbieux, Yves Roos, Sophie Tison, Yves Andr´ e, Anne-C´ ecile Caron Laboratoire d’Informatique Fondamentale de Lille,

U.M.R. C.N.R.S. 8022

Universit´ e de Lille 1, 59655 Villeneuve d’Ascq Cedex. France.

e-mail:{debarbie, yroos, tison, andre, caronc }@lifl.fr

1 Introduction

2 Path inclusions and query rewriting

2.1 Semistructured data and path inclusions

2.2 Bounded path constraints and Prefix Rewriting

2.3 Path inclusions and recognizable relations

3 Query transduction

N

References

Path Rewriting in Semistructured Data ^∗