Serge Abiteboul Benny Kimelfeld Yehoshua Sagiv3 Pierre Senellart4
1
2 4
Probabilistic XML
Framework Unordered data trees
Details: no attributes, no mixed content. . . A
B C
D 6=
A
B B C
D
(multiset semantics) Sample space: Set of all such data trees.
Probabilistic XML database: (Succinct) representation of a discrete probability distributionover this sample space (= a set of possible worlds).
Details: no attributes, no mixed content. . . A
B C
D 6=
A
B B C
D
(multiset semantics) Sample space: Set of all such data trees.
Probabilistic XML
Framework Unordered data trees
Details: no attributes, no mixed content. . . A
B C
D 6=
A
B B C
D
(multiset semantics) Sample space: Set of all such data trees.
Probabilistic XML database: (Succinct) representation of a discrete probability distributionover this sample space (= a set of possible worlds).
Goal
A genericframework for probabilistic XML
Previously proposed models: concreteinstancesof this framework Comparison of theexpressiveness of various models
Updatecapabilities in various models Efficiency issues
Outline
1 Introduction
2 P-documents
The P-document Model Types of Distributional Nodes
Link with Previously Studied Models
3 Efficient Translations between Models
4 Update Capabilities
5 Conclusion
A
ind
mux
0:8
B
0:3
C
0:7
D
0:6
Tree with ordinary(circles) and distributional (rectangles) nodes Distributional nodes specify how their children can berandomly selected
Several kindsof distributional nodes (see later on)
Possible-world semantics: every possible selection of children of distributional nodes, with associated probability
Possible-world semantics
A
B p3 =0:096 A
D p2=0:12 A
p1 =0:08
A
B D
p4=0:144
A
C p5 =0:224
A
C D
p6 =0:336
det
B1 . . . Bn
All children are always chosen Not really a distributional node, but sometimes useful in hierarchies
Independent
ind
B1 p1
. . . Bn pn
Children are randomly chosen independently of one another
The probability of choosing each child is given
mux
B1 p1
. . . Bn pn
At most one child is randomly chosen, with probability pi
Pn
i=1pi 1
Exponential
exp
B1 . . . Bn
Subset Probability
W1 p1
. . .
Wk pk
The probability of choosing a given subset Wj of the set of children is given 1k 2n
Pk
j=1pj =1
e1 p1
. . . ek pk
cie
c1 cn
Each ci is aconjunction of the random events ej or their negations :ej (e.g., e1^ :e2^e3)
Each ej is independent of the other ones The probability of each ej is given The ej can be shared across multiple distributional nodes
Only distributional node expressing long-distance dependency
Reminiscent of [Imieliński and Lipski,
Families of P-documents
Definition
PrXMLftype1;type2;:::g: family of p-documents obtained with the distributional nodes type1;type2; : : :
PrXMLfj6htype1;type2;:::g: no hierarchy of distributional nodes is
allowed (i.e., the child of a distributional node is an ordinary node)
[Nierman and Jagadish, 2002]: PrXMLfmux;detg [van Keulen et al., 2005]: subset ofPrXMLfmux;detg
[Abiteboul and Senellart, 2006, Senellart and Abiteboul, 2007]:
PrXMLfj6hindg and PrXMLfj6hcieg
[Hung et al., 2003]: subset ofPrXMLfj6hexpg (graphs restricted to trees) [Hung et al., 2007]: subset ofPrXMLfj6hexpg (intervals restricted to
points)
Outline
1 Introduction
2 P-documents
3 Efficient Translations between Models V-translations
Main Results
4 Update Capabilities
5 Conclusion
Definition
F isv-translatabletoF0 if, for eachP 2 F~ , there is aP~0 2 F0 such that the possible-world semanticsof P~ and P~0 areisomorph.
If, additionally, P~0 can be obtained from P~ in polynomial time, F is efficiently v-translatable toF0.
Main Results
All families of p-documents are translatable toPrXMLfmux;detg PrXMLfmux;indg is efficiently translatable to PrXMLfexpg and to PrXMLfcieg
PrXMLfexpg is not efficiently translatable to PrXMLfj6hexpg PrXMLfcieg is efficiently translatable to PrXMLfj6hcieg
PrXMLfcieg is not efficiently translatable to PrXMLfind;mux;expg PrXMLfexpg toPrXMLfcieg: open problem, butPrXMLfexpg with bounded height or bounded degree efficiently translatable to PrXMLfcieg
1 Introduction
2 P-documents
3 Efficient Translations between Models
4 Update Capabilities V-updates
Main Results
V-updates
Elementary insertions and deletions
Locator queryindicating where to apply the (cf. XPath for XUpdate, XQuery for XQuery Update)
The update itself can be probabilistic: “Insert this subtree at each node matched by this query with probabilityp.”
PrXMLfmux;detg (and any family v-translatable to this) closed under updates for any class of queries
PrXMLfcieg,PrXMLfmuxg,PrXMLfexpg, etc.: tractably closed under updates defined by single-path queries such that the matched node is at the end of the path
Withoutcie nodes: not tractably closed under insertions defined by single-path queries
PrXMLfcieg: tractably closed under insertions defined by tree-pattern queries with joins
Outline
1 Introduction
2 P-documents
3 Efficient Translations between Models
4 Update Capabilities
5 Conclusion
Tree-pattern projection queries can be processed efficientlyin PrXMLfind;muxg.
Tree-pattern projection queries are #P-completein PrXMLfcieg.
In summary:
PrXMLfcieg is more succinct.
Simple updates remaintractable in PrXMLfcieg.
. . . but (projection) queries are intractableinPrXMLfcieg.
A Word About Queries
It was shown [Kimelfeld and Sagiv, 2007, Kimelfeld et al., 2008] that:
Tree-pattern projection queries can be processed efficientlyin PrXMLfind;muxg.
Tree-pattern projection queries are #P-completein PrXMLfcieg.
In summary:
PrXMLfcieg is more succinct.
Simple updates remaintractable in PrXMLfcieg.
. . . but (projection) queries are intractableinPrXMLfcieg.
Trade-off betweenqueries and updates, or betweenqueries and express- ibility of complex dependencies.
Tree-pattern projection queries can be processed efficientlyin PrXMLfind;muxg.
Tree-pattern projection queries are #P-completein PrXMLfcieg.
In summary:
PrXMLfcieg is more succinct.
Simple updates remaintractable in PrXMLfcieg.
. . . but (projection) queries are intractableinPrXMLfcieg.
Perspectives
Alternative approach: external constraints[Cohen et al., 2008]
Multiset! setsemantics Equivalenceof p-documents Validationagainst a DTD
References I
Serge Abiteboul and Pierre Senellart. Querying and updating probabilistic information in XML. InProc. EDBT, Munich, Germany, March 2006.
Sara Cohen, Benny Kimelfeld, and Yehoshua Sagiv. Incorporating constraints in probabilistic XML. InProc. PODS, Vancouver, Canada, June 2008.
Edward Hung, Lise Getoor, and V. S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. InProc.
ICDE, Bangalore, India, March 2003.
Edward Hung, Lise Getoor, and V. S. Subrahmanian. Probabilistic interval XML. ACM Transactions on Computational Logic, 8(4), 2007.
Tomasz Imieliński and Witold Lipski. Incomplete information in
XML. InProc. VLDB, Vienna, Austria, September 2007.
Benny Kimelfeld, Yuri Kosharovski, and Yehoshua Sagiv. Query efficiency in probabilistic XML models. In Proc. SIGMOD, Vancouver, Canada, June 2008.
Andrew Nierman and H. V. Jagadish. ProTDB: Probabilistic data in XML. InProc. VLDB, Hong Kong, China, August 2002.
Pierre Senellart and Serge Abiteboul. On the complexity of managing probabilistic XML data. InProc. PODS, pages 283–292, Beijing, China, June 2007.
Maurice van Keulen, Ander de Keijzer, and Wouter Alink. A