Compression vs Queryability - A Case Study

(1)

HAL Id: inria-00449563

https://hal.inria.fr/inria-00449563

Submitted on 22 Jan 2010

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Compression vs Queryability - A Case Study

Siva Anantharaman

To cite this version:

Siva Anantharaman. Compression vs Queryability - A Case Study. Dagstuhl Seminar 08621, Jun

2008, Dagstuhl, Germany. http://drops.dagstuhl.de/opus/volltexte/2008/1676. �inria-00449563�

(2)

Compression vs Queryability - A Case Study

Siva Anantharaman

LIFO, Universit´e d’Orl´eans (France) e-mail: [email protected]

Abstract

Some compromise on compression is known to be necessary, if the relative positions of the information stored by semi-structured documents are to remain accessible under queries. With this in view, we com- pare, on an example, the ‘query-friendliness’ of XML documents, when compressed into straightline tree grammars which are either regular or context-free. The queries considered are in a limited fragment of XPath, corresponding to a type of patterns; each such query defines naturally a non-deterministic, bottom-up ‘query automaton’ that runs just as well on a tree as on its compressed dag.

Keywords:Tree automata, Tree Grammars, Dags, XML documents, Queries.

1 Introduction

Structures over dags instead of over trees have been widely used in order to op- timize algorithms. Tree automata (TA) are among the basic tools employed for querying XML documents (e.g., [10, 11, 16, 17]); on the other hand, the notion of a compressed XML document has been introduced in [2, 9, 14], and a possible advantage of using dag structures for the manipulation of such documents has been brought out in [14]. It is legitimate then to investigate the possibility of using automata running directly over dags instead of over trees, for querying compressed XML documents. Unfortunately however, the Dag Automata (DA), defined as a natural extension of tree automata in [5] as bottom-up tree automata running on dags, cannot directly serve such a purpose. The reason is that the class of their languages – defined as the set of dags accepted under their bottom-up runs – is algebraically ill-behaved ([1]): although adeterminis- ticbottom-up TA runs exactly alike on a dag or on its uncompressed tree, the set of dags accepted by anon-deterministicDA does not represent, in general, a regular tree language. Thus – if the notion of acceptance is not ‘adapted appropriately’ –, the languages of DAs (would) form astrict superclass of the class of regular tree languages. Note, on the other hand, that the answers to MSO-definable queries on semi-structured trees are known to be regular tree languages, cf. [17, 19]. So, for a suitable definition of acceptance. a bottom-up

(3)

run of a non-deterministic TA,on a dag, has to be coupled in general with an

“on the ﬂy determinization” of the TA along the run.

Several observations are in order at this point, before we proceed. First, we are concerned with queries on XML documents, and the trees modeling such documents are unranked, i.e., the symbols at the nodes can have arbitrarily many arguments; so the automata we want to employ for querying should also be unranked. A second observation is that, fully or partially compressing an unranked tree into a DAG is not a major issue: suﬃces to represent it as a straightline regulartree grammar (SLR), e.g., as in [3]. In the sequel, therefore, we shall consider the terms ‘document’ and ‘dag’ to be synonymous; and by a

‘tree’, we shall mean a dag with ‘copying fully allowed’, i.e., with no compression at all. It must be pointed out here that a higher degree of compression of a rankedtree can be achieved by representing it as a straightline, context-free, tree grammar (SLCF), cf. e.g., [3, 4]; but it is not clear if the requirement on the base alphabet to be ranked can be dropped. Now, patterns (cf. [15], and Section 3 below) are often conveniently used to specify queries on XML documents; and they can be naturally visualized as bottom-up, non-deterministic, query automata; the full evaluation of such queries – i.e.,access to the informa- tion stored, along with their respective relative positions – is quite possible on documents compressed under SLR, without any need for decompression; while on a document compressed under SLCF, checking for query satisﬁability can be checked eﬃciently, but full query evaluation appears to be more intricate.

Several mechanisms have been proposed in the literature for querying (compressed, unranked) documents. The one proposed in [2] is a priori for query satisﬁability, and employs ‘tree-like’ automata; it can be either top-down, or bottom-up, on dags. That of [7] is a mixture of top-down and bottom-up ana- lysis, based on the SLR vision, suitable for query evaluation on dags, for a fragment of Core XPath, but it is not hard to extend it a little farther. [17]

proposes a very general query mechanism namedquery automaton(QA), which is a2-way tree automaton with specified selecting states: a selection label ‘1’ is attributed to some speciﬁed states, and ‘0’ or ‘⊥’ to the others; this works only for trees, however.

For the view presented below, our concern will be limited to a class of queries that are expressible inside a restricted XPath format; they correspond to the patternsusing the/and//of XPath, as defined in [15], possibly with some filters (or ‘branches’, as was called there). Besides being simple, this limitation will also have the advantage that such queries can be visualized as usual, bottom- up, non-deterministic tree automata with specified selecting states (Note: this, incidentally, will also allow us to considern-ary queries).

This paper is structured as follows. The notation and the preliminaries are given in Section 2. In Section 3, we deﬁne the notion of patterns (as in [15]), and show how to visualize them, naturally, as bottom-up query automata (QA); we also show, on an example, how to evaluate the query corresponding to a given patternpon a given documentt, via a bottom-up run ontof the QA associated withp, if the document is compressed under SLR; on the same example, we will

2

(4)

also show that, if the document gets compressed as SLCF, the query is not that easy to evaluate (although it can be tested for satisﬁability).

A final remark before closing this section: although we noted above that bottom-up runs of non-deterministic TA ondagscan be ‘problematic’, no such complication arises actually for the QAs defined by the patterns we consider here; so, we do not need to determinize the QA on the fly, along its evaluating runs. Moreover, it can be shown that the answer to the query defined by any given pattern on a given SLR-compressed document t, will be the ‘same’ as when the query is evaluated on the uncompressed tree equivalent oft.

2 Notation and Preliminaries

We assume given an unranked base alphabet Σ. Trees and dags over Σ are deﬁned as usual, each of their nodes baring a name that is a symbol from Σ.

(Formal deﬁnitions do not seem needed.) A ﬁrst example to show what we mean by fully or partially compressed dags is the following:

Compressed Tree

Compressed Partially

f

a a b a

f

a b a b a

f

Fully

The most elegant and eﬃcient way to distinguish between these 3 formats of the ‘same document’ (same information, but stored with more or less optimization) is to associate to each formatt,canonically, a straightline regular tree grammar (SLR) G_t: canonical in the sense that G_t recognizes exactly t, and nothing else. The three respective SLR for the above 3 formats are as follows:

X0→f(X1, X2, X3, X4) X1→a

X2→a X3→b X4→a

X0→f(Y1, Y1, Y2, Y1) Y1→a

Y2→b

X0→f(Z1, Z1, Z2, Z3) Z1→a

Z2→b Z3→a

The notions ofchild, parent, descendent, ancestorare all defined, in a natural manner, on the set of nodes of any given dag. A node is aroot(resp. leaf) for a dag, iff it has no parent (resp. child). All our dags will be assumed rooted, i.e., to have a unique root node. It is then easy to associate aset of positions to any given node on any given dag, such that the following holds: A dag is a tree iff the set of positions of any of its nodes is singleton. (Note: a node on a dag can have more than one parents.)

(5)

We also need the the notion of a bottom-up tree automaton over an unranked alphabet; to facilitate understanding, we ﬁrst recall the notion over a ranked alphabet; the deﬁnition is easily extended to the unranked case.

Definition 1 A bottom-up tree automaton (TA) over a ranked alphabetΣis a tuple (Σ, Q, F,Δ), where Q is a finite non-empty set of states, F ⊆Q is the set of final (or accepting) states, andΔis a set of transition rules of the form:

f(q1, ..., q_k)→q, wheref ∈Σis of rank k, andq1, . . . , q_k, q∈Q.

Now, a transitionf(q1, . . . , q_k)→q can also be written asf(q1. . . q_k)→q, where q1. . . q_k is seen as a word in Q^∗, that has to be of length = rank(f) in the ranked case. So the extension is easy to the unranked case: suffices to define the transitions to be of the formf(ω)→q, whereω∈Q^∗, andf ∈Σ. A TA is said to be bottom-updeterministiciff whenever there are two transition rules of the form f(ω) → q, f(ω) → q, with q = q, we have necessarily ω∩ω=∅; otherwise it is said to benon-deterministic. We also agree to denote the transitions of the formf(∅) → q simply as f → q, and refer to them as initialtransitions.

Example. We come now to our second example: to the right of the ﬁgure below is the tree format of a document, of which the fully compressed dag format is to the left. Over the unranked signature{a, f, g} we consider the bottom-up TA A, with the following transitions:

a→p, b→p, b→q, a(p)→q, a(q)→p, g(q Q^∗)→q, g(p q)→p, f(q p q)→q_fin, f(p Q^∗)→q_fin,

withQ={p, q, q_fin},q_finbeing the unique accepting state.

f f

g

b

a a

b

g g

a

b

b b

b

It is not hard to check that there is an accepting run of the TA on the tree to the right. And if we want to run the TA directly on the dag to the left, we mayhave to determinize the run on the ﬂy, as and when we move up, in general. This can be done more elegantly, and more naturally, by seeing the dag as its SLR, and by seeing its productions as well as the transitions of the TA as rewrite rules, and using innermost rewriting. For instance, the SLR for

4

(6)

the dag to the left above is as follows:

X0→f(X1, X1, X2) X1→g(X3, X2) X3→a(X2), X2→b We get by innermost rewriting:

X2=b→ {p, q}, X3=a(X2)→ {p, q}, X1→g({p, q}{p, q}), and ﬁnally: X0→f({p, q}{p, q}{p, q}).

And we end up by checking thatq_fin is in the setf({p, q}{p, q}{p, q}).

This is how one would proceed, in general, for deciding acceptance under the bottom-up runs of non-deterministic TAs on dags. However, to any query deﬁned by a pattern of a limited format that we shall be considering below, we shall naturally associate a bottom-up non-deterministic TA, called its query automaton (QA); for such QA, the selecting runs can be computed more directly (without on the ﬂy determinizations).

3 Query Automata for Patterns

We henceforth assume known the usual notions and terminology of XML, and of XPath which is a language generally employed for querying XML documents.

For simplicity, we only consider unary queries, i.e., a queries that return a set of nodes and/or the data attached to these nodes. We are interested here in a limited sub-class of XPath expressions which can also be seen naturally as a non-deterministic bottom-up QA. These expressions are all representable as

‘patterns with selection labels’ with the help of the symbols/, //,∗ of XPath, along with those from the (unranked) base alphabet Σ, and the ﬁlter expressions of XPath. For instance, the XPath expressiona//∗[b//d][c] is represented as the following pattern, where a, b, c, dare in the base alphabet, ‘∗’ is the wildcard, andsis the selection symbol:

d b

* a

c

s

A nodeuon any given documenttis an answer to the query of this pattern, iﬀ:

- the root node oft bears the namea, anduis a descendant of the root, - and u has two children nodes b and c, the child named b itself having a descendantd.

(7)

The XPath expressions that we are interested in here, can be deﬁned formally in terms of simple grammars; for instance, as in [15], one can deﬁne them by:

P::= σ| ∗ |.|P/P |P//P|P[P]

whereσ∈Σ, ‘∗’ is the wildcard of XPath, and ‘.’ stands for the current node position. To each such expression, one can associate apattern tree(pattern, for short), as in the above example (cf. [15] for details). A pattern is essentially a (rooted) tree with usual edges, plus some distinguished “double-edges” (as in the above example), referred to as descendant-edges; the nodes are named from Σ∪ {∗}; in addition, to exactly one of the nodes is assigned a selection symbol:

1 or s. (Note: we are only concerned here with unary queries.) An essential diﬀerence between the usual trees and patterns is that the “set of outgoing nodes” from a node, on a pattern, arenot ordered. (Thus, in the above pattern example, it isnotrequired that the b-child, of the nodeuto be selected, be to the left of itsc-child.)

A point we want to drive in now, is that to any pattern can be associated a QA in a natural manner. We shall do that only on an example, with which we will continue in the next section. Consider, for instance, the following pattern:

c c

d s

a

∗

This pattern represents the following XPath expression: //c[c]//d[.//a], which is short for: //c[child : c]//d[decendant : a]. The QA associated with this pattern is the bottom-up non-deterministic TA over the alphabet {a, c, d,∗}, withQ={qin, q0, q1, q2, q3q_acc}as its set of states,q_acc as the accepting state, q1 as theselectingstate, and the following transitions:

∗ →q_in

∗(Q^∗q_inQ^∗)→q_in a→q0

a(Q^∗q_inQ^∗)→q0

∗(Q^∗q0Q^∗)→q0

d(Q^∗q0Q^∗)→q1

∗(Q^∗q1Q^∗)→q1

c(Q^∗q_inQ^∗)→q2

c(Q^∗q1Q^∗q2Q^∗)→q3

c(Q^∗q2Q^∗q1Q^∗)→q3

∗(Q^∗q3Q^∗)→q3

∗(Q^∗q3Q^∗)→q_acc

∗(Q^∗q_accQ^∗)→q_acc

(TheQ^∗stands for any string over the setQ.) The semantics of the transitions are as follows: at any leaf node other than a-named we would be in stateq_in; when we reach ana-node we would be in stateq0; when we reach ad-node above

6

(8)

awe would be in stateq1; at ac-node, ‘in general’ we would be inq2, but if it is above ad-node, we would be in q3; ﬁnally, at any node strictly above a “c-node reached at stateq3”, we would be in the accepting state q_acc. (Note: a query automaton returns a selected node only on an accepted document.)

4 Querying under SLR or under SLCF

We show here, brieﬂy, how the query //c[child : c]//d[decendant : a] can be evaluated on the following ‘document’:

t=c(c(a, a), d(c(a, a), c(c(a, a), d(c(a, a), c(a, a))))),

under the runs of the QA deﬁned in the previous section. The example has been borrowed from [3]; as was shown there, the following SLCF – withS, A, B(y), C as non-terminals,S being the axiom – gives a ‘nice’ compression of this tree:

S →B(B(C)) B(y)→c(C, d(C, y)) C→c(A, A)

A→a

Observe ﬁrst that our QA, if run on the tree t, would select the subterm d(c(a, a), c(a, a)), the position of which (ont) can be deduced by following the path from thed-node to which is assigned the selecting state q1, to the root.

a

c

d d

c c

a c

d d

c c

Now, the dag to the left of the figure above is the fully compressed dag format fort; and it is not hard to check that our QA can run bottom-up just as well on this dag (without needingany determinization on the fly), as on the tree t; the dag will be accepted, and the d-node to the bottom-right will be assigned the selecting stateq1; the set of positions on a dag being well-defined, one deduces easily the position(s) of the selected node: it corresponds here to the path formed of the full arrows in the dag to the right. It is easy then to

‘output’ the sub-dag at that position (or its SLR), as the answer to our query.

On the other hand, our QA can also ‘run’ on the SLCF compressed format given above for the documentt; we will then get the following:

(9)

- there is an innermost rewrite derivation from the axiomS to the stateq_acc; - andoneof the intermediary terms thus derived gets selected.

In other words, we can conclude that the query deﬁned by the pattern does have an answer, ‘somewhere’. It could also be possible to say ‘exactly where’, using some additional and more intricate syntax, based on the BPLEX algorithm of [3, 4]. But, although BPLEX is ‘bottom-up multiplex’, it is not clear if it can output the exactSLCF grammarcorresponding to the answerof the query; more precisely, it does not seem simple to relate the SLCF grammar of the answer to the SLCF of the given document.

Before we close this section, a few observations are perhaps in order:

i) Several of the usual XPath based queries on XML documents – or their skeletons – can be visualized as pattern trees.

ii) Bottom-upunrankedquery automata can be derived naturally, from such patterns.

iii) These unranked query automata are (not easy to determinize, but they) run bottom-up on compressed dags, just as easily as on their corresponding uncompressed trees; and the answer sets obtained correspond.

iv) Compression based on SLR, and the bottom-up evaluation technique des- cribed above for queries deﬁned by our patterns, can both continue to function – without any major modiﬁcations – on documents that are subject to rule-based access control policies (RBAC). It is not clear if the same holds for SLCF-based compression.

5 Conclusion

We have tried to show that although compression under SLR ensures at best only a single exponential space optimization, it has a more query-friendly behavior on several fronts, than compression based on SLCF. We have also shown that non- deterministic, bottom-up, tree automata are natural and intuitive candidates for fully evaluating a certain class of queries in the XPath format, visualizable as patterns, even on documents compressed as DAGs (it is actually the formulation as a pattern that gives the clue for deriving an automaton corresponding to the query). The pattern-based view for queries seems to have other advantages as well: for instance,n-ary queries can be handled quite naturally in that set up.

Moreover, the patterns of the format studied here can also be directly deﬁned in terms of a suitable grammar; this is of help in handling other problems, such as that of pattern containment, via rewrite techniques (cf. e.g., [8]).

References

[1] S. Anantharaman, P. Narendran, M. Rusinowitch,Closure Properties and Decision Problems of Dag Automata, In Information Processing Letters, 94(5):231–240, 2005.

[2] P. Buneman, M. Grohe, C. Koch, Path queries on compressed XML. In Proc. of the 29thConf. on VLDB, 2003, pp. 141–152, Ed. Morgan Kaufmann.

8

(10)

[3] G. Busatto, M. Lohrey, S. Maneth, Grammar-Based Tree Compression. EPFL Tech. Report IC/2004/80, http://icwww.epfl.ch/publications.

[4] G. Busatto, M. Lohrey, S. Maneth, Eﬃcient Memory Representation of XML Documents. In Proc. DBPL’05, LNCS 3774, pp. 199–216, Springer-Verlag, 2005.

[5] W. Charatonik,Automata on DAG Representations of Finite Trees, Technical Re- port MPI-I-99-2-001, Max-Planck-Institut f¨ur Informatik, Saarbr¨ucken, Germany.

[6] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, M. Tommasi, Tree Automata Techniques and Applications, http://www.grappa.univ-lille3.fr/tata/

[7] B. Fila, S. Anantharaman, Automata for Positive Core XPath Queries on Com- pressed Documents, In Proc. of the Int. Conf. LPAR-13, pp. 467–481, LNAI 4246, Springer-Verlag, November 2006.

[8] B. Fila-Kordy, A Rewrite Approach for Pattern Containment, In Proc.

WADT’2008, Pisa, June 2008.

[9] M. Frick, M. Grohe, C. Koch, Query Evaluation of Compressed Trees, In Proc. of LICS’03, pp. 188–197, IEEE,

[10] G. Gottlob, C. Koch, Monadic Queries over Tree-Structured Data, In Proc. of LICS’02, pp. 189–202, IEEE.

[11] G. Gottlob, C. Koch, Monadic Datalog and the Expressive Power of Languages for Web Information Extraction, In Journal of the ACM, 51(1):12–28, 2004.

[12] G. Gottlob, C. Koch, R. Pichler, L. Segoufin, The complexity of XPath query evaluation and XML typing In Journal of the ACM 52(2):284-335, 2005.

[13] W. Martens, F. Neven, On the complexity of typechecking top-down XML trans- formations, In Theoretical Computer Sc., 336(1): 153–180, 2005.

[14] M. Marx,XPath and Modal Logics for Finite DAGs. In Proc. of TABLEAUX’03, pp. 150–164, LNAI 2796, 2003.

[15] G. Miklau, D. Suciu,Containment and Equivalence for a Fragment of XPath, In Journal of the ACM, 51(1):2-45, 2004.

[16] F. Neven, Automata Theory for XML Researchers, In SIGMOD Record 31(3), September 2002.

[17] F. Neven, T. Schwentick, Query automata over ﬁnite trees, In Theoretical Com- puter Science, 275(1–2):633–674, 2002.

[18] A. Potthof, S. Seibert, W. Thomas,Nondeterminism versus determinism of ﬁnite automata over directed acyclic graphs, In Bull. Belgian Math. Society, 1:285–298, 1994.

[19] J.W. Thatcher, J.B. Wright,Generalized ﬁnite automata theory with an applica- tion to a decision problem of second-order logic, In Math. Syst. Theory, 2(1):57–81, 1968.