• Aucun résultat trouvé

Aggregate Queries for Discrete and Continuous Probabilistic XML

N/A
N/A
Protected

Academic year: 2022

Partager "Aggregate Queries for Discrete and Continuous Probabilistic XML"

Copied!
54
0
0

Texte intégral

(1)

Free University of Bozen-Bolzano Télécom ParisTech

S. Abiteboul, T-H. H. Chan, E. Kharlamov, W. Nutt, P. Senellart INRIA Saclay – Île-de-France

The University of Hong Kong

Aggregate Queries for

Discrete and Continuous Probabilistic XML

1 2 1, 3 3 4

1

2 3

4

ICDT, March 2010

(2)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

2 /27

(3)

Applications of

Probabilistic Data

Approximate query processing: ranking, linkage

Information extraction: approximate search for entities (e.g. names) in text

Sensor data: imprecise or missing readings

...

3 /27

(4)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

4

Probabilistic DB:

/27

(5)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a

4

Probabilistic DB:

/27

(6)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Probabilistic DB:

/27

(7)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Probabilistic DB: Representation of Prob DB:

/27

(8)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Probabilistic DB: Representation of Prob DB:

Q

(a, 0.8)

/27

(9)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Probabilistic DB: Representation of Prob DB:

Q

(a, 0.8)

/27

(10)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

5 /27

(11)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

5

g ∧ k ∧¬ m

g ∧ ¬ h

/27

(12)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

5 /27

(13)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

Semantics a world d:

f = true, Pr(f) = 0.4

MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(14)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

Semantics a world d:

f = true, Pr(f) = 0.4

MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(15)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

Semantics a world d:

f = true, Pr(f) = 0.4

MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(16)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

Semantics a world d:

f = true, Pr(f) = 0.4

MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(17)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

23

Semantics a world d:

f = true, Pr(f) = 0.4

MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(18)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

f - event: “weather is fine”

Pr(f) = .4

MUX - mutually exclusive options

23

Semantics a world d:

f = true, Pr(f) = 0.4

MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(19)

Discrete Probabilistic XML Documents

Probabilistic XML document D

represents (exponentially) many documents d

each with a probability Pr(d)

It is achieved by

Conjunctions of event literals on edges.

Capture long-distance dependencies

Distributional nodes: Mux, Ind, Det, Exp.

Capture local (hierarchical) dependencies

6 /27

(20)

Discrete Probabilistic XML Documents

Probabilistic XML document D

represents (exponentially) many documents d

each with a probability Pr(d)

It is achieved by

Conjunctions of event literals on edges.

Capture long-distance dependencies

Distributional nodes: Mux, Ind, Det, Exp.

Capture local (hierarchical) dependencies

6

Special case of event formulas

/27

(21)

What is Known?

Answering simple XPath queries

Distributional nodes: PTIME

Events: FP -complete

Simple XPath over Mux-Det PXML with HAVING constraints:

PTIME for COUNT and MIN

NP-hard for SUM and AVG

#P

[Kimelfed&al:2007]

[Senellart&al:2007]

7

[Cohen&al:2008]

[Re&al:2007]

/27

(22)

What is Known?

Answering simple XPath queries

Distributional nodes: PTIME

Events: FP -complete

Simple XPath over Mux-Det PXML with HAVING constraints:

PTIME for COUNT and MIN

NP-hard for SUM and AVG

#P

[Kimelfed&al:2007]

[Senellart&al:2007]

7

NO events

[Cohen&al:2008]

[Re&al:2007]

/27

(23)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

8 /27

(24)

Aggregate Queries

1. What is the average temperature across sensors?

2. What is the average temperature for sensor i25?

3. How often did sensors i25 and i33

give the same measurement simultaneously?

9

we want to answer queries with aggregate functions:

MIN/MAX, TopK, COUNT, SUM, COUNTD, AVG

/27

(25)

1. What is the average temperature across sensors?

2. What is the average temperature for sensor i25?

3. How often did sensors i25 and i33

give the same measurement simultaneously?

3.

id t vl

Y sensor

X i25

id t vl

Y sensor

X i33

Root

COUNT

Query Models

10

1.

AVG

2.

AVG Root

vl Y

Root

id vl

Y sensor

i25

/27

(26)

1. What is the average temperature across sensors?

2. What is the average temperature for sensor i25?

3. How often did sensors i25 and i33

give the same measurement simultaneously?

3.

id t vl

Y sensor

X i25

id t vl

Y sensor

X i33

Root

COUNT

Query Models

10

1.

AVG

2.

AVG Root

vl Y

Root

id vl

Y sensor

i25

/27

(27)

1. Single-Path queries - SP 2. Tree-Pattern queries - TP

3. Tree-Pattern queries with Joins - TPJ 3.

id t vl

Y sensor

X i25

id t vl

Y sensor

X i33

Root

COUNT

Query Models

10

1.

AVG

2.

AVG Root

vl Y

Root

id vl

Y sensor

i25

/27

(28)

Semantics of AQs

Query:

What is the average temperature?

What should be an answer?

11

AVG(d17) = 44, Pr(d17)=.6 AVG(d23) = 46, Pr(d23)=.1 AVG(d20) = 45, Pr(d20)=.3

sensor

vl vl

mes mes

vl

45 70 MUX

17 23 20

.6 .1 .3

/27

(29)

Semantics of AQs

Query:

What is the average temperature?

What should be an answer?

11

AVG(d17) = 44, Pr(d17)=.6 AVG(d23) = 46, Pr(d23)=.1 AVG(d20) = 45, Pr(d20)=.3 Distribution of aggregate values over all documents

represented by the PXML document

sensor

vl vl

mes mes

vl

45 70 MUX

17 23 20

.6 .1 .3

/27

(30)

Problems to Investigate for Discrete PXML

For PXML document D, constant C

Possible answers:

decide Pr(Q(D)=C) > 0

Probability computation:

compute Pr(Q(D)=C)

Moment computation:

compute E(Q(D) )

12

k E is “expected value”

/27

(31)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

Incorporate continuous distributions in PXML leaves

Aggregate continuous PXML

Continuous PXML

13

At the moment there is no formal semantics for continuous probabilistic XML models

/27

(32)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

Incorporate continuous distributions in PXML leaves

Aggregate continuous PXML

Continuous PXML

13

At the moment there is no formal semantics for continuous probabilistic XML models

/27

(33)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

14 /27

(34)

Data Complexity of Query Answering

PXML Model Single Path Tree Pattern Tree Pat. Joins Query Language

Event

Conjunctions NP-complete NP-complete NP-complete Distributional

Nodes FP -complete P FP -complete FP -complete FP -complete

FP -complete#P

#P

What is difficult?

joins in queries

events in data

15 /27

(35)

Data Complexity of Query Answering

PXML Model Single Path Tree Pattern Tree Pat. Joins Query Language

Event

Conjunctions NP-complete NP-complete NP-complete Distributional

Nodes FP -complete P FP -complete FP -complete FP -complete

FP -complete#P

#P

What is difficult?

joins in queries

events in data

15

Is it getting more difficult with aggregation?

/27

(36)

Possible

Answers NP-complete NP-complete NP-complete Probability

Computation FP -complete FP -complete FP -complete

Moment

Computation FP -completeFP -completeFP -complete NP-complete

FP -comp

Aggregating PXML-Events

Problems Single Path Tree Pattern Tree Pat. Joins Aggregate Query Language

#P

16

FP -complete#P

COUNT, SUM: PTIME #P MIN, AVG

COUNTD:

Aggregates: COUNT, SUM, MIN, COUNTD, AVG Data-complexity

/27

(37)

Possible

Answers NP-complete NP-complete NP-complete Probability

Computation FP -complete FP -complete FP -complete

Moment

Computation FP -completeFP -completeFP -complete NP-complete

FP -comp

Aggregating PXML-Events

Problems Single Path Tree Pattern Tree Pat. Joins Aggregate Query Language

#P

16

FP -complete#P

COUNT, SUM: PTIME #P MIN, AVG

COUNTD:

Aggregates: COUNT, SUM, MIN, COUNTD, AVG Data-complexity

/27

(38)

PTIME FP FP -complete

PTIME AVG: FP

others: PTIME FP -complete PTIME

FP -complete SUM, AVG, COUNTD: FP -complete

COUNT, MIN: PTIME

Aggregating PXML with Distributional Nodes

#P

#P Possible

Answers Probability Computation Probability SUM in |input| +|output|

Moment Computation

COUNT, MIN: PTIME

Problems Single Path Tree Pattern Tree Pat. Joins SUM, AVG, COUNTD: NP-complete

COUNT, MIN : NP

#P

Aggregates: COUNT, SUM, MIN, COUNTD, AVG

17

Aggregate Query Language

Data-complexity

FP -complete#P

/27

(39)

PTIME FP FP -complete

PTIME AVG: FP

others: PTIME FP -complete PTIME

FP -complete SUM, AVG, COUNTD: FP -complete

COUNT, MIN: PTIME

Aggregating PXML with Distributional Nodes

#P

#P Possible

Answers Probability Computation Probability SUM in |input| +|output|

Moment Computation

COUNT, MIN: PTIME

Problems Single Path Tree Pattern Tree Pat. Joins SUM, AVG, COUNTD: NP-complete

COUNT, MIN : NP

#P

Aggregates: COUNT, SUM, MIN, COUNTD, AVG

17

Aggregate Query Language

Data-complexity

FP -complete#P

/27

(40)

Tractable Cases

Key components of tractability:

Hierarchical structure of PXML documents imposed by distributional nodes

Some aggregate functions

can exploit the hierarchy - monoid functions

18

Monoid: COUNT, SUM, MIN, TopK, PARITY, ...

Non Monoid: COUNTD, AVG

PTIME algorithm to compute distributions:

Bottom-up evaluation using convex sums and convolutions

/27

(41)

The Problem Space

19 Tree Pattern

Queries Tree Pattern Queries

with Joins

Single Path Queries

COUNT Distributional Nodes

(Mux-Det-Ind-Exp) Events

Query Language

Aggregate Functions Probabilistic XML

Document Model

Discrete

Continuous

Distributions SUM

MIN COUNTD AVG

Outside: intractable,

i.e., NP-, FP -complete Inside: PTIME

#P

/27

(42)

Approximating Query Answers

Many problems are NP- or FP -complete How good are Monte-Carlo methods?

By Hoeffding bound, to achieve

| E(α(D) ) - Estimate | < ε with Pr = 1-δ

at most O(R 1/ε log(1/δ)) samples is needed for α=COUNTD

quadratically many samples are needed

20

#P

k

2k 2

/27

(43)

Approximating Query Answers

Many problems are NP- or FP -complete How good are Monte-Carlo methods?

By Hoeffding bound, to achieve

| E(α(D) ) - Estimate | < ε with Pr = 1-δ

at most O(R 1/ε log(1/δ)) samples is needed for α=COUNTD

quadratically many samples are needed

20

#P

k

2k 2

/27

(44)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

21 /27

(45)

Finite case:

finite sets of trees

where every tree has a non-zero probability

Continuous case:

infinite sets of trees

where some (infinite) subsets of trees have non-zero probability measure

22

Discrete vs Continuous Models

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

/27

(46)

Finite case:

finite sets of trees

where every tree has a non-zero probability

Continuous case:

infinite sets of trees

where some (infinite) subsets of trees have non-zero probability measure

22

How to measure infinite sets of trees?

Discrete vs Continuous Models

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

/27

(47)

1. Take a set S of trees with

real values on the leaves / share the same structure

Measuring Infinite Sets of Trees

23 t

1

vl 30 25

id mes mes

sensor

t vl

2 70

t 2

vl 30 40

id mes mes

sensor

t vl

3 100

S:

/27

(48)

1. Take a set S of trees with

real values on the leaves / share the same structure 2. collect labels of leaves as tuples of values

Subset L(S) of R

Measuring Infinite Sets of Trees

23 t

1

vl 30 25

id mes mes

sensor

t vl

2 70

t 2

vl 30 40

id mes mes

sensor

t vl

3 100

(25, 1, 30, 2, 70) (40, 2, 30, 3, 100) n

S:

L(S):

/27

(49)

3. Take a standard measure M on Borel subsets of R 4. Use the measure M on L(S)

5. Lift M from sets of tuples L(S) to sets of trees S

Measuring Infinite Sets of Trees

23 t

1

vl 30 25

id mes mes

sensor

t vl

2 70

t 2

vl 30 40

id mes mes

sensor

t vl

3 100

(25, 1, 30, 2, 70) (40, 2, 30, 3, 100)

S:

L(S):

n

/27

(50)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

Continuous

PXML Documents

Extension of discrete PXML

with distribution functions attached to leaves

24

piecewise polynomials

Diracs

Gaussian

...

/27

(51)

Aggregation of CPXML:

Probability Computation

Tractable for

1. Data: CPXML with distributional nodes 2. Query: SP with monoid functions

Bottom-up algorithms based on convex sums and convolutions

Works when distributions on the leaves are closed under convolutions and convex sums

piecewise polynomials (SUM, MIN/MAX)

25

PTIME

/27

(52)

Summing Up

Comprehensive picture of complexity for discrete PXML aggregation:

PXML models with local, global dependencies

SP, TP, TPJ queries

COUNT, SUM, MIN, COUNTD, AVG

Continuous PXML model:

formal semantics

initial study of aggregation

26 /27

(53)

Thank you

(54)

References

[Kimelfeld&al:2007] - Benny Kimelfeld, Yehoshua Sagiv:

Matching Twigs in Probabilistic XML. VLDB 2007: 27-38

[Senellart&al:2007] -Pierre Senellart, Serge Abiteboul: On the complexity of managing probabilistic XML data. PODS 2007:

283-292

[Re&al:2007] - C. Ré and D. Suciu. Efficient evaluation of HAVING queries on a probabilistic database. DBPL 2007

[Cohen&al:2008] - Sara Cohen, Benny Kimelfeld, Yehoshua Sagiv: Incorporating constraints in probabilistic XML. PODS 2008: 109-118

Références

Documents relatifs

We address the cost of adding value joins to tree-pattern queries and monadic second-order queries over trees in terms of the tractability of query evaluation over two data models:

non-deterministic Turing machine whose number of accepting paths, given as input the input of the problem, is the output of the problem.. A problem is #P-hard if any #P problem can

Assume that for each edge e, on average, there are 16 potential outgN um(lbl(e)) from each node. Thus, on average, we need 4 bits to represent each edge of the query. Based on

Nous émettrons trois hypothèses à vérifier par la suite : la première supposera que le droit à la santé des MNA et à l’égalité de traitement est respecté dans le canton de

Stepwise tree automata were proved to have two nice properties that yield a simple proof of the theorem. 1) N-ary queries by UTAs can be translated to n-ary queries by stepwise

dene a distributed game in whih the proesses have a winning strategy if and. only if there is a solution to the synthesis problem in the

Thus, each guard G in a positive, depth-bounded TPRS (T, R) can be described by the (finite) set of forests min(G). For the subsumed relation , note that ∼ is the identity...

As a matter of fact, a restricted class of sheaves automata [13] has been used to prove complexity properties of the static fragment of ambient logic, which corresponds to a kind