Aggregate Queries for Discrete and Continuous Probabilistic XML

(1)

Free University of Bozen-Bolzano Télécom ParisTech

S. Abiteboul, T-H. H. Chan, E. Kharlamov, W. Nutt, P. Senellart INRIA Saclay – Île-de-France

The University of Hong Kong

Aggregate Queries for

Discrete and Continuous Probabilistic XML

1 2 1, 3 3 4

1

2 3

4

ICDT, March 2010

(2)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

2 /27

(3)

Applications of

Probabilistic Data

•

Approximate query processing: ranking, linkage

•

Information extraction: approximate search for entities (e.g. names) in text

•

Sensor data: imprecise or missing readings

•

^...

3 /27

(4)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

4

Probabilistic DB:

/27

(5)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a

4

Probabilistic DB:

/27

(6)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Probabilistic DB:

/27

(7)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Probabilistic DB: Representation of Prob DB:

/27

(8)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Q

(a, 0.8)

/27

(9)

Probabilistic Database

P = 0.3 P = 0.2 P = 0.5

Q a

Q Q

a (a, 0.8)

4

Answer:

Q

(a, 0.8)

/27

(10)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

[Kimelfed&al:2007] [Senellart&al:2007]

• f - event: “weather is fine”

Pr(f) = .4

• MUX - mutually exclusive options

5 /27

(11)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

5

g ∧ k ∧¬ m

g ∧ ¬ h

/27

(12)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

5 /27

(13)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

Semantics a world d:

• f = true, Pr(f) = 0.4

• MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(14)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

• f = true, Pr(f) = 0.4

• MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(15)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

• f = true, Pr(f) = 0.4

• MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(16)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

• f = true, Pr(f) = 0.4

• MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(17)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

23

• f = true, Pr(f) = 0.4

• MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(18)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20

2 70 1

i35

¬f f

.6 .1 .3

PXML with Events and Distributional Nodes

Pr(f) = .4

23

• f = true, Pr(f) = 0.4

• MUX: 23, Pr(23) = 0.1 Pr(d) = 0.4 x 0.1

5 /27

(19)

Discrete Probabilistic XML Documents

•

Probabilistic XML document D

•

represents (exponentially) many documents d

•

each with a probability Pr(d)

•

It is achieved by

•

Conjunctions of event literals on edges.

Capture long-distance dependencies

•

Distributional nodes: Mux, Ind, Det, Exp.

Capture local (hierarchical) dependencies

6 /27

(20)

Discrete Probabilistic XML Documents

•

Probabilistic XML document D

•

represents (exponentially) many documents d

•

each with a probability Pr(d)

•

It is achieved by

•

Conjunctions of event literals on edges.

Capture long-distance dependencies

•

Distributional nodes: Mux, Ind, Det, Exp.

Capture local (hierarchical) dependencies

6

Special case of event formulas

/27

(21)

What is Known?

•

Answering simple XPath queries

•

Distributional nodes: PTIME

•

Events: FP -complete

•

Simple XPath over Mux-Det PXML with HAVING constraints:

•

PTIME for COUNT and MIN

•

NP-hard for SUM and AVG

#P

[Kimelfed&al:2007]

[Senellart&al:2007]

7

[Cohen&al:2008]

[Re&al:2007]

/27

(22)

What is Known?

•

Answering simple XPath queries

•

Distributional nodes: PTIME

•

Events: FP -complete

•

Simple XPath over Mux-Det PXML with HAVING constraints:

•

PTIME for COUNT and MIN

•

NP-hard for SUM and AVG

#P

[Kimelfed&al:2007]

[Senellart&al:2007]

7

NO events

[Cohen&al:2008]

[Re&al:2007]

/27

(23)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

8 /27

(24)

Aggregate Queries

1. What is the average temperature across sensors?

2. What is the average temperature for sensor i25?

3. How often did sensors i25 and i33

give the same measurement simultaneously?

9

we want to answer queries with aggregate functions:

MIN/MAX, TopK, COUNT, SUM, COUNTD, AVG

/27

(25)

3.

id t vl

Y sensor

X i25

id t vl

Y sensor

X i33

Root

COUNT

Query Models

10

1.

AVG

2.

AVG Root

vl Y

Root

id vl

Y sensor

i25

/27

(26)

3.

id t vl

Y sensor

X i25

id t vl

Y sensor

X i33

Root

COUNT

Query Models

10

1.

AVG

2.

AVG Root

vl Y

Root

id vl

Y sensor

i25

/27

(27)

1. Single-Path queries - SP 2. Tree-Pattern queries - TP

3. Tree-Pattern queries with Joins - TPJ 3.

id t vl

Y sensor

X i25

id t vl

Y sensor

X i33

Root

COUNT

Query Models

10

1.

AVG

2.

AVG Root

vl Y

Root

id vl

Y sensor

i25

/27

(28)

Semantics of AQs

•

^Query:

What is the average temperature?

•

What should be an answer?

11

AVG(d17) = 44, Pr(d17)=.6 AVG(d23) = 46, Pr(d23)=.1 AVG(d20) = 45, Pr(d20)=.3

sensor

vl vl

mes mes

vl

45 70 MUX

17 23 20

.⁶ .¹ .³

/27

(29)

Semantics of AQs

•

^Query:

What is the average temperature?

•

What should be an answer?

11

AVG(d17) = 44, Pr(d17)=.6 AVG(d23) = 46, Pr(d23)=.1 AVG(d20) = 45, Pr(d20)=.3 Distribution of aggregate values over all documents

represented by the PXML document

sensor

vl vl

mes mes

vl

45 70 MUX

17 23 20

.⁶ .¹ .³

/27

(30)

Problems to Investigate for Discrete PXML

For PXML document D, constant C

•

Possible answers:

decide Pr(Q(D)=C) > 0

•

Probability computation:

compute Pr(Q(D)=C)

•

Moment computation:

compute E(Q(D) )

12

k E is “expected value”

/27

(31)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

•

Incorporate continuous distributions in PXML leaves

•

Aggregate continuous PXML

Continuous PXML

13

At the moment there is no formal semantics for continuous probabilistic XML models

/27

(32)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

•

Incorporate continuous distributions in PXML leaves

•

Aggregate continuous PXML

Continuous PXML

13

At the moment there is no formal semantics for continuous probabilistic XML models

/27

(33)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

14 /27

(34)

Data Complexity of Query Answering

PXML Model Single Path Tree Pattern Tree Pat. Joins Query Language

Event

Conjunctions NP-complete NP-complete NP-complete Distributional

Nodes FP -complete P FP -complete FP -complete FP -complete

FP -complete#P

#P

What is difficult?

•

joins in queries

•

events in data

15 /27

(35)

Data Complexity of Query Answering

PXML Model Single Path Tree Pattern Tree Pat. Joins Query Language

Event

Conjunctions NP-complete NP-complete NP-complete Distributional

Nodes FP -complete P FP -complete FP -complete FP -complete

FP -complete#P

#P

What is difficult?

•

joins in queries

•

events in data

15

Is it getting more difficult with aggregation?

/27

(36)

Possible

Answers NP-complete NP-complete NP-complete Probability

Computation FP -complete FP -complete FP -complete

Moment

Computation FP -completeFP -completeFP -complete NP-complete

FP -comp

Aggregating PXML-Events

Problems Single Path Tree Pattern Tree Pat. Joins Aggregate Query Language

#P

16

FP -complete#P

COUNT, SUM: PTIME #P MIN, AVG

COUNTD:

Aggregates: COUNT, SUM, MIN, COUNTD, AVG Data-complexity

/27

(37)

Possible

Answers NP-complete NP-complete NP-complete Probability

Computation FP -complete FP -complete FP -complete

Moment

Computation FP -completeFP -completeFP -complete NP-complete

FP -comp

Aggregating PXML-Events

Problems Single Path Tree Pattern Tree Pat. Joins Aggregate Query Language

#P

16

FP -complete#P

COUNT, SUM: PTIME #P MIN, AVG

COUNTD:

Aggregates: COUNT, SUM, MIN, COUNTD, AVG Data-complexity

/27

(38)

PTIME FP FP -complete

PTIME AVG: FP

others: PTIME FP -complete PTIME

FP -complete SUM, AVG, COUNTD: FP -complete

COUNT, MIN: PTIME

Aggregating PXML with Distributional Nodes

#P

#P Possible

Answers Probability Computation Probability SUM in |input| +|output|

Moment Computation

COUNT, MIN: PTIME

Problems Single Path Tree Pattern Tree Pat. Joins SUM, AVG, COUNTD: NP-complete

COUNT, MIN : NP

#P

Aggregates: COUNT, SUM, MIN, COUNTD, AVG

17

Aggregate Query Language

Data-complexity

FP -complete#P

/27

(39)

PTIME FP FP -complete

PTIME AVG: FP

others: PTIME FP -complete PTIME

FP -complete SUM, AVG, COUNTD: FP -complete

COUNT, MIN: PTIME

Aggregating PXML with Distributional Nodes

#P

#P Possible

Answers Probability Computation Probability SUM in |input| +|output|

Moment Computation

COUNT, MIN: PTIME

Problems Single Path Tree Pattern Tree Pat. Joins SUM, AVG, COUNTD: NP-complete

COUNT, MIN : NP

#P

Aggregates: COUNT, SUM, MIN, COUNTD, AVG

17

Aggregate Query Language

Data-complexity

FP -complete#P

/27

(40)

Tractable Cases

Key components of tractability:

•

Hierarchical structure of PXML documents imposed by distributional nodes

•

Some aggregate functions

can exploit the hierarchy - monoid functions

18

Monoid: COUNT, SUM, MIN, TopK, PARITY, ...

Non Monoid: COUNTD, AVG

PTIME algorithm to compute distributions:

Bottom-up evaluation using convex sums and convolutions

/27

(41)

The Problem Space

19 Tree Pattern

Queries Tree Pattern Queries

with Joins

Single Path Queries

COUNT Distributional Nodes

(Mux-Det-Ind-Exp) Events

Query Language

Aggregate Functions Probabilistic XML

Document Model

Discrete

Continuous

Distributions SUM

MIN COUNTD AVG

Outside: intractable,

i.e., NP-, FP -complete Inside: PTIME

#P

/27

(42)

Approximating Query Answers

•

Many problems are NP- or FP -complete How good are Monte-Carlo methods?

•

By Hoeffding bound, to achieve

| E(α(D) ) - Estimate | < ε with Pr = 1-δ

at most O(R 1/ε log(1/δ)) samples is needed for α=COUNTD

quadratically many samples are needed

20

#P

k

2k 2

/27

(43)

Approximating Query Answers

•

Many problems are NP- or FP -complete How good are Monte-Carlo methods?

•

By Hoeffding bound, to achieve

| E(α(D) ) - Estimate | < ε with Pr = 1-δ

at most O(R 1/ε log(1/δ)) samples is needed for α=COUNTD

quadratically many samples are needed

20

#P

k

2k 2

/27

(44)

Outline

1. Probabilistic data 2. Problem definition

3. Aggregating discrete Probabilistic XML

4. Aggregating continuous Probabilistic XML

21 /27

(45)

•

Finite case:

•

finite sets of trees

•

where every tree has a non-zero probability

•

Continuous case:

•

infinite sets of trees

•

where some (infinite) subsets of trees have non-zero probability measure

22

Discrete vs Continuous Models

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

/27

(46)

•

Finite case:

•

finite sets of trees

•

where every tree has a non-zero probability

•

Continuous case:

•

infinite sets of trees

•

where some (infinite) subsets of trees have non-zero probability measure

22

How to measure infinite sets of trees?

Discrete vs Continuous Models

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

/27

(47)

1. Take a set S of trees with

•

real values on the leaves / share the same structure

Measuring Infinite Sets of Trees

23 t

1

vl 30 25

id mes mes

sensor

t vl

2 70

t 2

vl 30 40

id mes mes

sensor

t vl

3 100

S:

/27

(48)

1. Take a set S of trees with

•

real values on the leaves / share the same structure 2. collect labels of leaves as tuples of values

Subset L(S) of R

Measuring Infinite Sets of Trees

23 t

1

vl 30 25

id mes mes

sensor

t vl

2 70

t 2

vl 30 40

id mes mes

sensor

t vl

3 100

(25, 1, 30, 2, 70) (40, 2, 30, 3, 100) n

S:

L(S):

/27

(49)

3. Take a standard measure M on Borel subsets of R 4. Use the measure M on L(S)

5. Lift M from sets of tuples L(S) to sets of trees S

Measuring Infinite Sets of Trees

23 t

1

vl 30 25

id mes mes

sensor

t vl

2 70

t 2

vl 30 40

id mes mes

sensor

t vl

3 100

(25, 1, 30, 2, 70) (40, 2, 30, 3, 100)

S:

L(S):

n

/27

(50)

root

t 1

vl 30 i25

id mes mes

sensor sensor

t vl

id mes

t vl

MUX

17 23 20 2 N(70,4) 1

i35

¬f f

.6 .1 .3

Continuous

PXML Documents

•

Extension of discrete PXML

with distribution functions attached to leaves

24

•

piecewise polynomials

•

^Diracs

•

^Gaussian

•

^...

/27

(51)

Aggregation of CPXML:

Probability Computation

•

Tractable for

1. Data: CPXML with distributional nodes 2. Query: SP with monoid functions

•

Bottom-up algorithms based on convex sums and convolutions

•

Works when distributions on the leaves are closed under convolutions and convex sums

•

piecewise polynomials (SUM, MIN/MAX)

25

PTIME

/27

(52)

Summing Up

•

Comprehensive picture of complexity for discrete PXML aggregation:

•

PXML models with local, global dependencies

•

SP, TP, TPJ queries

•

COUNT, SUM, MIN, COUNTD, AVG

•

Continuous PXML model:

•

formal semantics

•

initial study of aggregation

26 /27

(53)

• ^{Thank you}

(54)

References

•

[Kimelfeld&al:2007] - Benny Kimelfeld, Yehoshua Sagiv:

Matching Twigs in Probabilistic XML. VLDB 2007: 27-38

•

[Senellart&al:2007] -Pierre Senellart, Serge Abiteboul: On the complexity of managing probabilistic XML data. PODS 2007:

283-292

•

[Re&al:2007] - C. Ré and D. Suciu. Efficient evaluation of HAVING queries on a probabilistic database. DBPL 2007

•

[Cohen&al:2008] - Sara Cohen, Benny Kimelfeld, Yehoshua Sagiv: Incorporating constraints in probabilistic XML. PODS 2008: 109-118