• Aucun résultat trouvé

Finding Optimal Probabilistic Generators for XML Collections

N/A
N/A
Protected

Academic year: 2022

Partager "Finding Optimal Probabilistic Generators for XML Collections"

Copied!
19
0
0

Texte intégral

(1)

Finding Optimal Probabilistic Generators for XML Collections

Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

To appear in BDA 2011

(2)

Adding probabilities to an XML Schema

• The Web is full of semi-structured documents

• Given a collection of XML documents, we would like to have a schema the documents conform to.

– E.g., DTD or XSD

– Restricts the structure, mostly parent-child node relations (using regular expressions)

• The schema may be very general (e.g., xhtml, RSS)

• We want to add probabilities to 'guide' the schema

– Optimal probabilities – maximize the likelihood of a corpus

- 2 -

Motivation

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

(3)

Implementation Idea: XML Editor Auto-Completion

• Based on previous document versions / corpus of user

documents / corpus of example documents –

• Suggest nodes / sub-trees / node values to the user

• For example:

Challenges:

– Allow editing in every part of the document

– What kind of completion to suggest?

– Finding the top-k best completions

- 3 -

Motivation

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

<MyPapers>

<Paper>

<title>Provenance Minimization</title>

<author>Yael A.</author>

<author>Daniel D.</author>

<author>Tova M.</author>

<author>Val T.</author>

</Paper>

<Paper>

<title>Provenance for Aggregates</title>

<author>Yael A.</author>

<author>Daniel D./author>

<author>Val T.</author>

</Paper>

<Paper>

<title> </title>

<author> </author>

<author> </author>

<author> </author>

</Paper>

</MyPapers>

(4)

Many Other Usages for a Probabilistic Schema

Testing – e.g., generating many XML messages to

simulate network load and test system performance.

Explaining – e.g., the probabilistic schema for DBLP shows which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc.

Querying – e.g., finding the probability that a paper has more than 3 authors.

– e.g., finding the top-k best completions to a partial document.

Schema Evaluation – how well a given schema describes a given corpus.

- 4 -

Motivation

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

(5)

Our solution - An Outline

- 5 - Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

Preliminaries – Tree Automata

Generators for Schemas without Constraints

Restart Generators

Continuation-Test Generators Leaf Values

Adding Constraints

(6)

Using a Tree Automaton for Schema Verification

- 6 -

Preliminaries

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

q 0 b q 1 q 2

a c

$

An XML document is

modeled as an ordered tree.

Document d 0 :

The children of a-labeled node are accepted by

DFA A a

Automaton A r : (L( A r ) = a*bc*$)

This is done for every inner node in a fixed order (BF-LTR)

abcd 532 abcd

$

r

a b c

(7)

Using the Schema as a (Probabilistic) Generator

• Each transition is assigned a probability

• We assume independent choices, thus the document probability is the product of all t-probs.

• Thus, PR(d)=p a ∙p a ∙p b ∙p $

• The schema and generator ignore leaf values (for now!)

- 7 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer

b

a c

$ $

p a p c p b p $

q 0 q 1 q 2

r

a a b

(8)

An Algorithm for Probabilities Learning

- 8 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections

b

a c

$

$

The frequency each transition is chosen during the corpus verification process is recorded.

(q 0 , a) (q 0 , b) (q 1 , c) (q 1 , $)

1 1 1

1 q 0 q 1 q 2

r

a b c

(9)

An Algorithm for Probabilities Learning (Cont.)

This is repeated for every node in every corpus document.

We set the probability of each transition to be its relative frequency .

- 9 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections

(q 0 , a) 1 (q 0 , b) 1 (q 1 , c) 1 (q 1 , $) 1

/2 /2 /2 /2

These probabilities maximize the likelihood of generating the corpus – optimal generator

(similar result in PCFGs)

(10)

Our Results

• Relative frequencies make optimal probabilities.

• Optimal probability learning and document generation can be done efficiently.

• Additional non-trivial result:

Generation terminates with probability 1.

– Guaranteed because of the choice of probabilities according to the corpus.

- 10 -

Without Constraints

Finding Optimal Probabilistic Generators for XML Collections

(11)

Integrity Constraints

• We want to allow the use of the following constraints in the schema:

Key Constraint: the leaves of a-labeled leaves have unique values (unary key)

Inclusion Constraint: the values of a-labeled

leaves are contained in those of b-labeled leaves – Domain Constraint: the values of a-labeled leaves

belong to some (finite or infinite) domain

- 11 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections

(12)

Restart Generators

• A naïve idea:

– Use a probabilistic generator to generate a document – Check if it has a value assignment valid w.r.t. the

constraints

– If not, 'restart' and try again until a valid document is generated

Good news: Checking the existence of a valid assignment is in PTIME

Bad news: number of restarts can be

unboundedly large in an optimal generator

– A different quality measure for restart generators?

- 12 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections

(13)

Continuation-test Generators

• Never make choices that lead to a 'dead end', thus always generate a valid document.

• We use a binary test to check if a choice has a continuation.

• Add to the schema of d 0 the constraints:

c is included in a – c is unique

• The generation process:

- 13 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections

b

a c

$ $

p a p c p b p $

q 0 q 1 q 2

r

a b c

Pr(d) = p a ∙p b ∙p c ∙1

(14)

Algorithm for Learning

Probabilities with Constraints

• The probabilities are again relative frequencies, but –

only in cases where there was an alternative choice.

• The learned generator will generate as many c-s as a-s Adding

Constraints

Finding Optimal Probabilistic Generators for XML Collections

(q 0 , a) 1 (q 0 , b) 1 (q 1 , c) 1 (q 1 , $) 0

/2 /2 /1

/1 (q 1 , $) was chosen only when (q 1 , c) was not available.

- 14 -

(15)

Our Results

• The algorithm learns an optimal continuation-test generator, for automata with binary choices.

– Extensions to non-binary are discussed in the paper

Bad News: Continuation-test is NP-Complete

– But only in the size of the schema; it is polynomial in the document size (not so bad?)

– Based on schema satisfiability test [David et al. 2011]

More Bad News: probability of termination may be arbitrarily small!

– Even for simple, non-recursive schemas

– Can be handled by adding a constraint on the document size.

– Sub-classes of schemas that guarantee termination?

- 15 -

Adding Constraints

Finding Optimal Probabilistic Generators for XML Collections

(16)

Suggested Algorithm

• We start with a valid document skeleton

• Order labels by inclusion constraints (e.g., c, b, a)

• Choose a leaf from the 'smallest' (most included) label, and including leaves

• Draw a value (from the domain) according to a given distribution.

• Use PTIME test to verify validity, if not revert the step

- 16 -

Leaf Values

Finding Optimal Probabilistic Generators for XML Collections

$

r

a b c

abcd

abcd efg

(17)

Possible improvement to the basic algorithm

• Annotate the leaves with 'old' or 'new'

• For 'old' a-labeled leaves choose values already chosen for some a-labeled leaf

• For 'new' choose a value unused by a-labeled leaves yet

• Annotations can be learned from the corpus, and generated:

Offline – after the document generation, using a PTIME validity test – Online – during document generation, using a continuation test.

– Both methods are incomparable in terms of quality

- 17 -

Leaf Values

Finding Optimal Probabilistic Generators for XML Collections

new old new

$

r

a a b

(18)

Ideas for Experimental Study on Probabilistic generators

• How many schemas in practice may require continuation tests?

• How many have termination probability < 1?

• Continuation test is expensive, but how expensive is it in practice?

• 'Competition' between restart and continuation-test generators

• More?

- 18 -

Implementation

Finding Optimal Probabilistic Generators for XML Collections

(19)

Thank You!

Thank You!

Questions, Ideas?...

Références

Documents relatifs

 Processes the lineage formula to get the probability of the query (and of each matching

– Proof – by construction of a simple, non-recursive schema – Can be handled by adding a constraint on the document size. – Sub-classes of schemas that

p-generator An algorithm for finding the best probabilistic model for a document corpus based on a given schema, and a proof that termination probability is 1..

leaves are contained in those of b-labeled leaves – Domain Constraint: the values of a-labeled leaves. belong to some (finite or

Since the tree merging operations we use for obtaining a probabilis- tic document can be seen as a succession of updates, we focus here on global distributional nodes, and, more

(i) We first show how recursive Markov chains can be turned into a representation system for probabilistic XML that allows arbitrarily deep and wide documents, and that most

A hierarchical Markov chain A is tree-like if no two boxes are mapped to the same component.. It follows from the definition that there is

The contributions of this paper are as follows: (i) a general picture of update tractability for various query languages and probabilistic XML models, with complexity results; (ii)