• Aucun résultat trouvé

AntoineAmarilli PierreSenellart OntheConnectionsbetweenRelationalandXMLProbabilisticDataModels

N/A
N/A
Protected

Academic year: 2022

Partager "AntoineAmarilli PierreSenellart OntheConnectionsbetweenRelationalandXMLProbabilisticDataModels"

Copied!
19
0
0

Texte intégral

(1)

On the Connections between

Relational and XML Probabilistic Data Models

Antoine Amarilli1 Pierre Senellart2

1´Ecole normale sup´erieure, Paris, France

& University of Oxford, Oxford, United Kingdom

2Institut Mines-T´el´ecom, T´el´ecom ParisTech, CNRS LTCI, Paris, France

(2)

Table of contents

1 Probabilistic data

2 Efficient models

3 Expressive models

4 Conclusion

2/18

(3)

Probabilistic data

Aprobabilistic instanceis a finite collection of possible states of the data (thepossible worlds) with associated probability.

p= 0.16 p = 0.04

document topic

#42 bncod

#42 oxford

document topic

#42 oxford

p= 0.64 p = 0.16

document topic

#42 bncod

document topic

(4)

Efficient representation

We want to represent thepossible worlds in a concise manner:

document topic p

#42 bncod 0.8

#42 oxford 0.2

We want to evaluate queriesefficiently on all possible worlds:

document p

#42 0.8

q(x) : SELECTdocument WHEREtopic = ’bncod’

4/18

(5)

Relational data and XML

Probabilisticrelationaland XMLdata models have been developed in isolation.

document

id

#42

ind

topic 0.8

bncod topic

0.2

oxford

document topic p

#42 bncod 0.8

#42 oxford 0.2

We will show how query complexity results in both models can be connected throughencodings from one model to the other.

(6)

Relational tuple-independent model

Each tuple is annotated with aprobability score independently from other tuples.

document topic p

#42 bncod 0.8

#42 oxford 0.2

This is not a veryexpressive model! For instance:

conference loc iso p bncod oxford gb 0.8 bncod london gb 0.2

6/18

(7)

Relational block-independent-disjoint model

Use a key to divide the relation attributes. For one value of the key (ablock), choose at most oneof the matching rows, independently between blocks.

name city iso p bncod oxford gb 0.8 bncod london gb 0.2 icalp riga lv 0.9 icalp riga lt 0.1

(8)

Complexity results

We will be interested inconjunctive queries(CQs), unions of conjunctive queries(UCQs), and therelational algebra. We always studydata complexity.

Relational algebra evaluation over BIDs is in FP#P [DS07b].

CQ evaluation over tuple-independent databases is FP#P-hard.

Finer results exist:

UCQs over tuple-independent databases : dichotomy between FP#P-hard queries and FP queries [DSS10] with a doubly exponential time test (exact complexity open).

CQs without self-joins over BID: similar dichotomy, polynomial-time test [DS07b].

CQ without self-joins over tuple-independent: FP#P-hard iff not hierarchical. [DS07a]. Being non-hierarchical is a sufficient condition for FP#P-hardness of any relational calculus query.

8/18

(9)

PrXML

{ind,mux}

model

Documents with additionalind nodes to keep or remove children independently, andmuxnodes to choose at most one child.

document

id

#42

ind

topic 0.8

bncod

topic 0.2

oxford

conference

name

bncod

mux

city 0.8

oxford

city 0.2

london

(10)

Complexity results

Tree-pattern queries (with joins)(TPQ(J)s): tree patterns labeled with constants and variables (possibly multiple occurrences), with child and descendent edges.

Results:

TPQJ evaluation over PrXML{ind,mux} is in FP#P [KNS11].

TPQJ evaluation over PrXML{ind,mux} is FP#P-hard.

More precisely:

TPQs over PrXML{ind,mux} are linear time (actually also true for MSO [CKS09]).

TPQJs with a single join over PrXML{ind,mux} are FP#P-hard if not equivalent to a join-free query [KNS11], with a

ΣP2-complete test.

10/18

(11)

Relational to XML

We can encode any BID table (and hence any tuple-independent table) to PrXML{mux,ind} in linear time:

>

block name

bncod

mux ind 0.8

city 1

oxford iso

1

gb

ind 0.2

city 1

london iso

1

gb

block name

icalp

mux ind 0.9

city 1

riga iso

1

lv

ind 0.1

city 1

riga iso

1

lt Likewise, we can encode CQs to TPQJs in linear time.

(12)

Complexity consequences

Some results from the relational or XML world can thus be reprovenfrom the corresponding result in the other world:

TPQJ evaluation over PrXML{ind,mux} is in FP#P

⇒CQ evaluation over BIDs is in FP#P

CQ evaluation over tuple-independent databases is FP#P-hard

⇒TPQJ evaluation over PrXML{ind} is FP#P-hard;

⇒We can identify examples of hard TPQJ query classes (e.g., the image of classes of non-hierarchical CQs) MSO evaluation over PrXML{ind,mux} is linear time

⇒Read-once relational algebra on BID is linear time Wecannot encode PrXML{ind,mux} to BIDs: the result of a query over PrXML{ind,mux} can be represented in PrXML{ind,mux}

whereas BIDs are not astrong representation systemfor CQs.

12/18

(13)

Relational pc-tables

Each tuple is annotated with aBoolean condition over variables drawn independently with an associatedprobability.

from to C

home ny pods

ny home pods∧ ¬icalp

ny riga icalp∧pods home riga icalp∧ ¬pods

riga home icalp p(pods) = 0.2, p(icalp) = 0.2

This can representany finite distribution. However, evaluating a single-atomCQ over a pc-table withonly conjunctions is already FP#P-hard.

(14)

PrXML

{fie}

model

Boolean conditions on children can be expressed withfienodes like inpc-tables.

>

fie

trip p

from home

to ny

trip p∧ ¬i

from ny

to home

trip pi

from ny

to riga

trip i∧ ¬p

from home

to riga

trip i

from riga

to home

14/18

(15)

Encodings

As before, we can encodepc-tables in PrXML{fie}.

In the other direction, we materialize thedescendant relationto handle descendant edges in TPQJs and the construction is cubic time.

>

fie trip

p

from ny

to fie riga

i

home

¬i

node desc C

> trip p

> from p

> ny p

> to p

> riga pi

> home p∧ ¬i

trip from t

trip ny t

trip to t

trip riga i

trip home ¬i

from ny t

to riga i

to home ¬i

node child C

> trip p

trip from t

trip to t

to riga icalp

to home ¬icalp

(16)

Complexity consequences

TPQJ evaluation over PrXML{fie} is in FP#P

⇒CQ evaluation over pc-tables is in FP#P CQ evaluation over pc-tables is in FP#P

⇒TPQJ evaluation over PrXML{fie} is in FP#P Certain TPQJ queries are FP#P-hard over PrXML{fie}

⇒Examples of FP#P-hard relational queries

16/18

(17)

Probabilistic data Efficient models Expressive models Conclusion

Summary and open problems

The query evaluation problems on relational and XML probabilistic data models have been studied in isolation.

Simple encodings shows that the broad results (FP#P membership and hardness) can be transferred from one setting to the other.

There is no straightforward correspondence for finer results (e.g., dichotomies).

The connection between the two settings could be used to suggesttractable classesof queries and instances that are the preimage of a known tractable class in the other setting.

Thanks for your attention!

(18)

Summary and open problems

The query evaluation problems on relational and XML probabilistic data models have been studied in isolation.

Simple encodings shows that the broad results (FP#P membership and hardness) can be transferred from one setting to the other.

There is no straightforward correspondence for finer results (e.g., dichotomies).

The connection between the two settings could be used to suggesttractable classesof queries and instances that are the preimage of a known tractable class in the other setting.

Thanks for your attention!

17/18

(19)

References

Sara Cohen, Benny Kimelfeld, and Yehoshua Sagiv,Running tree automata on probabilistic XML, PODS, 2009.

Nilesh N. Dalvi and Dan Suciu, Efficient query evaluation on probabilistic databases, VLDB Journal 16(2007), no. 4.

,Management of probabilistic data: foundations and challenges, PODS, 2007.

Nilesh N. Dalvi, Karl Schnaitter, and Dan Suciu, Computing query probability with incidence algebras, PODS, 2010.

Evgeny Kharlamov, Werner Nutt, and Pierre Senellart,Value joins are expensive over (probabilistic) XML, Proc. LID, March 2011.

Références

Documents relatifs

On peut multiplier les exemples de pratiques parentales mises en difficulté par les prescriptions scolaire, comme les demandes d’intervention des parents pour

The upper bound ÉK(i) is discussed in Section 3, first for p-groups and then (by a standard p-Sylow subgroup argument) for arbitrary finite groups.. Then the

The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in

Previous work on structural testing of object- oriented software exploits data flow analysis to derive test requirements for class testing and defines contextual def-use associations

Apart from the purely mathematical interest in develop- ing the connection between problems of determining cov- ering numbers of function classes to those of determining entropy

The paper discusses some stochastic properties and parameter estimation for three types of stationary AR(1) processes involving the Exponential distribution in three different

For the Gaussian sequence model, we obtain non-asymp- totic minimax rates of es- timation of the linear, quadratic and the ℓ 2 -norm functionals on classes of sparse vectors

Bounded cohomology, classifying spaces, flat bundles, Lie groups, subgroup distortion, stable commutator length, word metrics.. The authors are grateful to the ETH Z¨ urich, the MSRI