On the Expressiveness of Probabilistic XML Models

(1)

Serge Abiteboul Benny Kimelfeld Yehoshua Sagiv³ Pierre Senellart⁴

1

2 4

(2)

Probabilistic XML

Framework Unordered data trees

Details: no attributes, no mixed content. . . A

B C

D 6=

A

B B C

D

(multiset semantics) Sample space: Set of all such data trees.

Probabilistic XML database: (Succinct) representation of a discrete probability distributionover this sample space (= a set of possible worlds).

(3)

B C

D 6=

A

B B C

D

(4)

Probabilistic XML

Framework Unordered data trees

B C

D 6=

A

B B C

D

Probabilistic XML database: (Succinct) representation of a discrete probability distributionover this sample space (= a set of possible worlds).

(5)

Goal

A genericframework for probabilistic XML

Previously proposed models: concreteinstancesof this framework Comparison of theexpressiveness of various models

Updatecapabilities in various models Efficiency issues

(6)

Outline

1 Introduction

2 P-documents

The P-document Model Types of Distributional Nodes

Link with Previously Studied Models

3 Efficient Translations between Models

4 Update Capabilities

5 Conclusion

(7)

A

ind

mux

0:8

B

0:3

C

0:7

D

0:6

Tree with ordinary(circles) and distributional (rectangles) nodes Distributional nodes specify how their children can berandomly selected

Several kindsof distributional nodes (see later on)

Possible-world semantics: every possible selection of children of distributional nodes, with associated probability

(8)

Possible-world semantics

A

B p₃ =0:096 A

D p2=0:12 A

p1 =0:08

A

B D

p₄=0:144

A

C p₅ =0:224

A

C D

p₆ =0:336

(9)

det

B1 . . . Bn

All children are always chosen Not really a distributional node, but sometimes useful in hierarchies

(10)

Independent

ind

B1 p1

. . . Bn pn

Children are randomly chosen independently of one another

The probability of choosing each child is given

(11)

mux

B1 p1

. . . Bn pn

At most one child is randomly chosen, with probability p_i

P_n

i=1p_i 1

(12)

Exponential

exp

B1 . . . Bn

Subset Probability

W1 p1

. . .

W_k p_k

The probability of choosing a given subset Wj of the set of children is given 1k 2ⁿ

P_k

j=1p_j =1

(13)

e1 p1

. . . e_k p_k

cie

c1 cn

Each ci is aconjunction of the random events ej or their negations :ej (e.g., e₁^ :e₂^e₃)

Each e_j is independent of the other ones The probability of each ej is given The ej can be shared across multiple distributional nodes

Only distributional node expressing long-distance dependency

Reminiscent of [Imieliński and Lipski,

(14)

Families of P-documents

Definition

PrXML^f^type¹^;^type²^;:::g: family of p-documents obtained with the distributional nodes type₁;type₂; : : :

PrXML^f_j6_h^type¹^;^type²^;:::g: no hierarchy of distributional nodes is

allowed (i.e., the child of a distributional node is an ordinary node)

(15)

[Nierman and Jagadish, 2002]: PrXML^f^mux^;^det^g [van Keulen et al., 2005]: subset ofPrXML^f^mux^;^det^g

[Abiteboul and Senellart, 2006, Senellart and Abiteboul, 2007]:

PrXML^f_j6_h^ind^g and PrXML^f_j6_h^cie^g

[Hung et al., 2003]: subset ofPrXML^f_j6_h^exp^g (graphs restricted to trees) [Hung et al., 2007]: subset ofPrXML^f_j6_h^exp^g (intervals restricted to

points)

(16)

Outline

1 Introduction

2 P-documents

3 Efficient Translations between Models V-translations

Main Results

5 Conclusion

(17)

Definition

F isv-translatabletoF⁰ if, for eachP 2 F~ , there is aP~⁰ 2 F⁰ such that the possible-world semanticsof P~ and P~⁰ areisomorph.

If, additionally, P~⁰ can be obtained from P~ in polynomial time, F is efficiently v-translatable toF⁰.

(18)

Main Results

All families of p-documents are translatable toPrXML^f^mux^;^det^g PrXML^f^mux^;^ind^g is efficiently translatable to PrXML^f^exp^g and to PrXML^f^cie^g

PrXML^f^exp^g is not efficiently translatable to PrXML^f_j6_h^exp^g PrXML^f^cie^g is efficiently translatable to PrXML^f_j6_h^cie^g

PrXML^f^cie^g is not efficiently translatable to PrXML^fînd^;^mux^;êxp^g PrXML^fêxp^g toPrXML^f^cie^g: open problem, butPrXML^fêxp^g with bounded height or bounded degree efficiently translatable to PrXML^f^cie^g

(19)

1 Introduction

2 P-documents

4 Update Capabilities V-updates

Main Results

(20)

V-updates

Elementary insertions and deletions

Locator queryindicating where to apply the (cf. XPath for XUpdate, XQuery for XQuery Update)

The update itself can be probabilistic: “Insert this subtree at each node matched by this query with probabilityp.”

(21)

PrXML^f^mux^;^det^g (and any family v-translatable to this) closed under updates for any class of queries

PrXML^f^cie^g,PrXML^f^mux^g,PrXML^f^exp^g, etc.: tractably closed under updates defined by single-path queries such that the matched node is at the end of the path

Withoutcie nodes: not tractably closed under insertions defined by single-path queries

PrXML^f^cie^g: tractably closed under insertions defined by tree-pattern queries with joins

(22)

Outline

1 Introduction

2 P-documents

5 Conclusion

(23)

Tree-pattern projection queries can be processed efficientlyin PrXML^f^ind^;^mux^g.

Tree-pattern projection queries are #P-completein PrXML^f^cie^g.

In summary:

PrXML^f^cie^g is more succinct.

Simple updates remaintractable in PrXML^f^cie^g.

. . . but (projection) queries are intractableinPrXML^f^cie^g.

(24)

A Word About Queries

It was shown [Kimelfeld and Sagiv, 2007, Kimelfeld et al., 2008] that:

In summary:

Trade-off betweenqueries and updates, or betweenqueries and express- ibility of complex dependencies.

(25)

In summary:

(26)

Perspectives

Alternative approach: external constraints[Cohen et al., 2008]

Multiset! setsemantics Equivalenceof p-documents Validationagainst a DTD

(27)

(28)

References I

Serge Abiteboul and Pierre Senellart. Querying and updating probabilistic information in XML. InProc. EDBT, Munich, Germany, March 2006.

Sara Cohen, Benny Kimelfeld, and Yehoshua Sagiv. Incorporating constraints in probabilistic XML. InProc. PODS, Vancouver, Canada, June 2008.

Edward Hung, Lise Getoor, and V. S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. InProc.

ICDE, Bangalore, India, March 2003.

Edward Hung, Lise Getoor, and V. S. Subrahmanian. Probabilistic interval XML. ACM Transactions on Computational Logic, 8(4), 2007.

Tomasz Imieliński and Witold Lipski. Incomplete information in

(29)

XML. InProc. VLDB, Vienna, Austria, September 2007.

Benny Kimelfeld, Yuri Kosharovski, and Yehoshua Sagiv. Query efficiency in probabilistic XML models. In Proc. SIGMOD, Vancouver, Canada, June 2008.

Andrew Nierman and H. V. Jagadish. ProTDB: Probabilistic data in XML. InProc. VLDB, Hong Kong, China, August 2002.

Pierre Senellart and Serge Abiteboul. On the complexity of managing probabilistic XML data. InProc. PODS, pages 283–292, Beijing, China, June 2007.

Maurice van Keulen, Ander de Keijzer, and Wouter Alink. A