• Aucun résultat trouvé

Leveraging the Structure of Uncertain Data

N/A
N/A
Protected

Academic year: 2022

Partager "Leveraging the Structure of Uncertain Data"

Copied!
1
0
0

Texte intégral

(1)

Leveraging the Structure of Uncertain Data

Antoine Amarilli Télécom ParisTech, CNRS LTCI, Université Paris-Saclay

Probabilistic Databases

A dating website uses machine learning to classify

users as active/inactive, and chats as flirtatious or not Jean

80%

Jamie 100%

Jordan 25%

60%

100% means

12%

8%

3%

2%

36%

24%

9%

6%

We assume independence across all these facts

→ Cannot represent correlations, like

"Monoamorous users are flirtatious with ≤1 person"

Query Evaluation

12%

8%

3%

2%

36%

24%

9%

6%

Query: are there active users engaged in flirtatious chat?

On deterministic data, this is easy:

→ YES

→ NO

On probabilistic data, evaluation becomes much harder!

Jean 80%

Jamie 100%

Jordan 25%

60%

100% → 59%

Can we compute efficiently the probability of a query?

→ The task is intractable (#P-hard) for many queries

even when the query is fixed (i.e., in data complexity)

→ What can we do? (especially on simple, realistic data?)

Ongoing work Lineages

Jeanje Jamie ja

Jordan jo

mea meo

Is there a flirtatious pair of active users?

Lineage: je ∧ ((mea ∧ ja) ∨ (meo ∧ jo))

Lineage φ of an arbitrary Boolean query q on database D:

φ is a Boolean formula (or circuit) on the facts of D

such that D' ⊆ D satisfies q iff φ holds for the valuation of D'

q1 q2

in

in in

q

¬

∨ ∨ ∨

... ... ... ...

...

- Lineage circuits for tree automata on

uncertain trees can be computed in O(n) - Extends to bounded treewidth data

- The circuit has low treewidth itself so probability computation is 1

Our main technical results:

Problem Statement: Which structural hypotheses on probabilistic data make query evaluation tractable?

Treewidth

A measure of how much the data is similar to a tree We can decompose trees ...and treelike data too:

Low treewidth data rewrites to a tree

a

b

c d

e

f a b c

b c e

c d e e f

a1, a2 a2, a3 a1, a3 a2, a4 a3, a1

a1, a4

a4, a1

tree

decompose

tree encode

Courcelle's theorem: Monadic second-order queries are tractable to evaluate on non-probabilistic data using tree automata on the tree encoding

Results

Our main result:

For any monadic second-order query q and integer k for any input TID database D of treewidth ≤ k

we can compute in O(|D|) the probability of q on D

(up to polynomial arithmetic costs)

→ Also for low-treewidth correlations

(e.g. mutually exclusive database facts)

→ Also for expressive lineages (

N

[X]-provenance) Extensions:

Lower bound: Query evaluation on TID is intractable if the treewidth is not bounded (under some technical conditions)

Antoine Amarilli, Pierre Bourhis, Pierre Senellart:

- Provenance Circuits for Trees and Treelike Instances, ICALP'15

- Tractable Lineages on Treelike Instances: Limits and Extensions, PODS'16

Computing tree decompositions on real datasets

joint work with Pierre Bourhis and Pierre Senellart

References

The Paris road network (4.3M nodes and 5.4M edges) has treewidth ≤ 521 (computed using heuristics)

100%

80%

60%

40%

20%

0% width 5 width 10

Most of the network is covered by a partial decomposition

of width 5

high-treewidth core (<10%)

low-treewidth

tentacles (>90%)

→ Answer queries with uncertainty (e.g., RER trip time) with tree automata on the tentacles and sampling on the core

with Mikaël Monet, Silviu Maniu, and Pierre Senellart

Tractability in combined complexity for restricted queries Our method on low-treewidth data is linear in the data

but hides high complexity in the query

→ Lower bound: Tractable complexity in TID data and query seems unlikely, even for tree-shaped queries and data

→ Can we compute lineages more efficiently in some cases?

Références

Documents relatifs

In particular, we construct an ontology T such that answering OMQs (T , q) with tree-shaped CQs q is W[1]- hard if the number of leaves in q is regarded as the parameter.. The number

Does query evaluation on probabilistic data have lower complexity when the structure of the data is close to a tree.. • Existing tools for

For any schema S with result bounds, query Q, and FGTGDs Σ deciding the existence of a monotone plan is 2EXPTIME-complete We can also show choice approximability for another

We study the data complexity of conjunctive query answering under these new semantics, and show a general tractability result for all known first-order rewritable ontology languages.

Number of property types within each of the following groupings: overall, context URL, resource, class, data type.. This tasks tells us how many property types are used in

An important tool we use for analyzing the complexity of query answering is the notion of materializability of a TBox T , which means that computing the certain answers to any query

We show that, contrary to common belief that aggre- gates efficiently summarize (and compress) such histories, the space needed for maintaining exact aggregates is actually larger

While the complexity of DLs has been studied extensively, data complexity in expressive DLs has been characterized only for answering atomic queries, and was still open for