• Aucun résultat trouvé

Provenance in Databases ... and Links to Knowledge Compilation

N/A
N/A
Protected

Academic year: 2022

Partager "Provenance in Databases ... and Links to Knowledge Compilation"

Copied!
218
0
0

Texte intégral

(1)

Provenance in Databases

... and Links to Knowledge Compilation

Antoine Amarilli December 17, 2019 Kocoon Workshop

(2)

Provenance management

• Common task on databases: query evaluation

• What if we wantmorethan the result?

Wheredoes the result come from?

Whywas this result obtained?

Howwas the result produced?

• What is theprobabilityof the result?

• How manytimeswas the result obtained?

• How would the result change if some data wasmissing?

• What is the minimalsecurity clearanceI need to see the result?

• How can a result beexplainedto the user?

• Provenance management: extend query evaluation with provenance informationto answer these questions

• Provenance information often representable as acircuit

(3)

Provenance management

• Common task on databases: query evaluation

• What if we wantmorethan the result?

Wheredoes the result come from?

Whywas this result obtained?

Howwas the result produced?

• What is theprobabilityof the result?

• How manytimeswas the result obtained?

• How would the result change if some data wasmissing?

• What is the minimalsecurity clearanceI need to see the result?

• How can a result beexplainedto the user?

• Provenance management: extend query evaluation with provenance informationto answer these questions

• Provenance information often representable as acircuit

(4)

Provenance management

• Common task on databases: query evaluation

• What if we wantmorethan the result?

Wheredoes the result come from?

Whywas this result obtained?

Howwas the result produced?

• What is theprobabilityof the result?

• How manytimeswas the result obtained?

• How would the result change if some data wasmissing?

• What is the minimalsecurity clearanceI need to see the result?

• How can a result beexplainedto the user?

• Provenance management: extend query evaluation with provenance informationto answer these questions

• Provenance information often representable as acircuit

(5)

Goal of this talk

• Refresher onrelational databases, andprovenancefor them (standard in database theory)

• Primer onquery evaluation(MSO/automata) onwords and trees (standard in database theory and logics)

• Present a notion ofprovenancefor queries on trees

(less standard, but nice connections to knowledge compilation)

• Present applications toprobabilitiesandenumeration for relational data and trees

Myco-authorsfor results in this talk (and some of the slides):

(6)

Goal of this talk

• Refresher onrelational databases, andprovenancefor them (standard in database theory)

• Primer onquery evaluation(MSO/automata) onwords and trees (standard in database theory and logics)

• Present a notion ofprovenancefor queries on trees

(less standard, but nice connections to knowledge compilation)

• Present applications toprobabilitiesandenumeration for relational data and trees

Myco-authorsfor results in this talk (and some of the slides):

(7)

Outline

Query Evaluation on Relational Databases Boolean Provenance on Relational Databases Semiring Provenance on Relational Databases Query Evaluation on Trees and Words

Boolean Provenance on Trees and Words Applications to Probability Computation Applications to Enumeration

Conclusion

(8)

Relational DBMSs

• Relational model: express data as relations (i.e., tables)

• A standard query language:SQL

(9)

Example

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation

id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

(10)

Relations and databases

Formally:

• Arelation schemaRis a finite sequence of attribute names

• Adatabase schemaDmaps eachrelation nameto arelation schema

• Atupleover relation schemaRmaps each attribute name ofR to adata value

• Arelation instance overRis a finite set of tuples overR

• Adatabaseover database schemaDmaps eachrelation nameR ofDto a relation instance over the relation schema ofRinD

(11)

The positive relational algebra

• Algebraic languageto express queries

• Eachoperatorapplies to 0, 1, or 2subexpressionsand produces arelation instance

• Main operators:

R: relation name

ρa→b: rename attributeatob

Πa1,...,an: project on attributesa1, . . . ,an

σϕ: select all tuples satisfying conditionϕ

: union of two relations (with same relation schema)

×: cross product of two relations

(12)

Relation name

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Expression: Guest Result:

id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

(13)

Renaming

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Expression: ρid→guest(Guest) Result:

guest name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

(14)

Projection

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Expression: Πemail,id(Guest) Result:

email id

john.smith@gmail.com 1 alice@black.name 2 john.smith@ens.fr 3

(15)

Selection

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Expression: σarrival>2019-01-12∧guest=2(Reservation) Result:

id guest room arrival nights

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

The formula used in the selection can be anyBoolean combinationof comparisonsof attributes to attributes or constants

(16)

Cross product

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Expression: Πid(Guest)×Πname(Guest) Result:

id name 1 Alice Black 2 Alice Black 3 Alice Black 1 John Smith 2 John Smith 3 John Smith

(17)

Natural join

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Not a basic operator, but auseful shorthand!

Expression: Reservation./ ρid→guest(Guest) Result:

id guest room arrival nights name email

1 1 504 2019-01-01 5 John Smith john.smith@gmail.com

2 2 107 2019-01-10 3 Alice Black alice@black.name

3 3 302 2019-01-15 6 John Smith john.smith@ens.fr

4 2 504 2019-01-15 2 Alice Black alice@black.name

5 2 107 2019-01-30 1 Alice Black alice@black.name

Equivalent to:

Π (Guest)×Reservation)).

(18)

Union

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Reservation id guest room arrival nights

1 1 504 2019-01-01 5

2 2 107 2019-01-10 3

3 3 302 2019-01-15 6

4 2 504 2019-01-15 2

5 2 107 2019-01-30 1

Expression: Πroomguest=2(Reservation))∪ Πroomarrival=2019-01-15(Reservation)) Result:

room 107 302 504

(19)

Relational algebra vs relational calculus

Sometimes we write tuples asground factsrather than tables

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Guest(1, John Smith, john.smith@gmail.com), Guest(2, Alice Black, alice@black.name), Guest(3, John Smith, john.smith@ens.fr)

Sometimes we write queries inrelational calculusrather than algebra Πid(Guest)×Πname(Guest)

Q(x,y0) :∃y z x0z0 Guest(x,y,z)Guest(x0,y0,z0)

→ Relational algebra and calculus have thesame expressive power!

(20)

Relational algebra vs relational calculus

Sometimes we write tuples asground factsrather than tables

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Guest(1, John Smith, john.smith@gmail.com), Guest(2, Alice Black, alice@black.name), Guest(3, John Smith, john.smith@ens.fr)

Sometimes we write queries inrelational calculusrather than algebra Πid(Guest)×Πname(Guest)

Q(x,y0) :∃y z x0z0 Guest(x,y,z)Guest(x0,y0,z0)

→ Relational algebra and calculus have thesame expressive power!

(21)

Relational algebra vs relational calculus

Sometimes we write tuples asground factsrather than tables

Guest id name email

1 John Smith john.smith@gmail.com 2 Alice Black alice@black.name 3 John Smith john.smith@ens.fr

Guest(1, John Smith, john.smith@gmail.com), Guest(2, Alice Black, alice@black.name), Guest(3, John Smith, john.smith@ens.fr)

Sometimes we write queries inrelational calculusrather than algebra Πid(Guest)×Πname(Guest)

Q(x,y0) :∃y z x0z0 Guest(x,y,z)Guest(x0,y0,z0)

→ Relational algebra and calculus have thesame expressive power!

(22)

Outline

Query Evaluation on Relational Databases Boolean Provenance on Relational Databases Semiring Provenance on Relational Databases Query Evaluation on Trees and Words

Boolean Provenance on Trees and Words Applications to Probability Computation Applications to Enumeration

Conclusion

(23)

Data model

• Relational data model: data decomposed into relations, with labeled attributes. . .

• . . . with an extraprovenance annotationfor each tuple (think of it as a Boolean variable)

name position city classification

John Director New York unclassified Paul Janitor New York restricted

Dave Analyst Paris confidential

Ellen Field agent Berlin secret Magdalen Double agent Paris top secret Nancy HR director Paris restricted

Susan Analyst Berlin secret

(24)

Data model

• Relational data model: data decomposed into relations, with labeled attributes. . .

• . . . with an extraprovenance annotationfor each tuple (think of it as a Boolean variable)

name position city classification

John Director New York unclassified Paul Janitor New York restricted

Dave Analyst Paris confidential

Ellen Field agent Berlin secret Magdalen Double agent Paris top secret Nancy HR director Paris restricted

Susan Analyst Berlin secret

(25)

Data model

• Relational data model: data decomposed into relations, with labeled attributes. . .

• . . . with an extraprovenance annotationfor each tuple (think of it as a Boolean variable)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

(26)

Boolean valuations

• DatabaseDwithntuples

• X ={x1,x2, . . . ,xn}theBoolean variablesannotating the tuples

• ValuationoverX: functionν :X → {⊥,>}

• Possible worldν(D): the subset ofDwhere we keep precisely the tuples whose annotation evaluates to>

(27)

Example of possible worlds

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

ν: x1 x2 x3 x4 x5 x6 x7

> > > > > > >

(28)

Example of possible worlds

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

ν: x1 x2 x3 x4 x5 x6 x7

> ⊥ > ⊥ > ⊥ >

(29)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ on a databaseD...

whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR... where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D)) Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(30)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ

on a databaseD... whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR... where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D)) Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(31)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ

on a databaseD... whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR...

where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D)) Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(32)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ

on a databaseD... whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR... where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D)) Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(33)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ

on a databaseD... whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR... where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D))

Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(34)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ

on a databaseD... whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR... where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D)) Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(35)

Boolean provenance of query results

• Goal:Evaluate apositive relational algebra queryQ

on a databaseD... whose tuples are annotated withX =x1, . . . ,xn

• The result is arelation instanceR... where each tuple is annotated with aBoolean functiononX

• Semantics:For every tupletof the result, for every valuationν ofX, the annotation oftevaluates to true onν ifftQ(ν(D)) Example (What cities are in the table?)

name position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5

Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

Claim:we can compute this provenance while evaluating the query!

(36)

Selection, renaming

Provenance annotations of selected tuples areunchanged Example (ρname→ncity=“New York”(R)))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

n position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

(37)

Projection

Take the OR of provenance annotations of identical, merged tuples Example (πcity(R))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

(38)

Union

Take the OR of provenance annotations of identical, merged tuples Example

πcityends-with(position,“agent”)(R))∪πcityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov Paris x3x5

Berlin x4x7

(39)

Cross product

Take the AND of provenance annotations of combined tuples Example

πcityends-with(position,“agent”)(R))onπcityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov Paris x3x5

Berlin x4x7

(40)

How is provenance actually represented?

Provenance annotations areBoolean functions

• The simplest representation isBoolean formulas

• Formalism used in most of the provenance literature Example

Is there a city with two different agents?

(x1x2)∨(x3x6)∨(x3x5)∨(x4x7)∨(x5x6)

Theorem (PTIME overhead)

For any fixedpositive relational algebraexpression, given an input database, we can compute in PTIME the provenance annotation of every tuple in the result

(41)

Other representation: Provenance circuits

[Deutch, Milo, Roy, and Tannen 2014]

• UseBoolean circuitsto represent provenance

• Every time an operation reuses a previously computed result, link to thepreviously created circuit gate

• Never largerthan provenance formulas

• Sometimesmore concise: provenance circuits can be...

More concise by alog logfactorthan provenance formulas for positive relational algebra [Amarilli, Bourhis, and Senellart 2016]

More concise by alogfactorthanmonotoneprovenance formulas for positive relational algebra

Super-polynomiallymore concise for more expressive query languages [Deutch, Milo, Roy, and Tannen 2014]

(42)

Other representation: Provenance circuits

[Deutch, Milo, Roy, and Tannen 2014]

• UseBoolean circuitsto represent provenance

• Every time an operation reuses a previously computed result, link to thepreviously created circuit gate

• Never largerthan provenance formulas

• Sometimesmore concise: provenance circuits can be...

More concise by alog logfactorthan provenance formulas for positive relational algebra [Amarilli, Bourhis, and Senellart 2016]

More concise by alogfactorthanmonotoneprovenance formulas for positive relational algebra

Super-polynomiallymore concise for more expressive query languages [Deutch, Milo, Roy, and Tannen 2014]

(43)

Example provenance circuit

(44)

What can we do with Boolean provenance?

(x1x2)∨(x3x6)∨(x3x5)∨(x4x7)∨(x5x6)

• The provenance describes, for each result tuple, thesubsetsof the input database for which it appears in the query result

• SAT: test if the tuple can be an answer when we delete some input tuples (trivial here)

• #SAT: number of sub-databases where the tuple is a result

Useful forprobabilistic reasoning(see later)

• Enumerating models:enumerating sub-databases where the tuple is a result

Useful toenumerate query results(see later)

(45)

What can we do with Boolean provenance?

(x1x2)∨(x3x6)∨(x3x5)∨(x4x7)∨(x5x6)

• The provenance describes, for each result tuple, thesubsetsof the input database for which it appears in the query result

• SAT: test if the tuple can be an answer when we delete some input tuples (trivial here)

• #SAT: number of sub-databases where the tuple is a result

Useful forprobabilistic reasoning(see later)

• Enumerating models:enumerating sub-databases where the tuple is a result

Useful toenumerate query results(see later)

(46)

What can we do with Boolean provenance?

(x1x2)∨(x3x6)∨(x3x5)∨(x4x7)∨(x5x6)

• The provenance describes, for each result tuple, thesubsetsof the input database for which it appears in the query result

• SAT: test if the tuple can be an answer when we delete some input tuples (trivial here)

• #SAT: number of sub-databases where the tuple is a result

Useful forprobabilistic reasoning(see later)

• Enumerating models:enumerating sub-databases where the tuple is a result

Useful toenumerate query results(see later)

(47)

What can we do with Boolean provenance?

(x1x2)∨(x3x6)∨(x3x5)∨(x4x7)∨(x5x6)

• The provenance describes, for each result tuple, thesubsetsof the input database for which it appears in the query result

• SAT: test if the tuple can be an answer when we delete some input tuples (trivial here)

• #SAT: number of sub-databases where the tuple is a result

Useful forprobabilistic reasoning(see later)

• Enumerating models:enumerating sub-databases where the tuple is a result

Useful toenumerate query results(see later)

(48)

Outline

Query Evaluation on Relational Databases Boolean Provenance on Relational Databases Semiring Provenance on Relational Databases Query Evaluation on Trees and Words

Boolean Provenance on Trees and Words Applications to Probability Computation Applications to Enumeration

Conclusion

(49)

Commutative semiring(K,0,1,⊕,⊗)

• SetKwith distinguished elements0,1

• ⊕associative,commutativeoperator, with identity0K:

a(bc) = (ab)c

ab=ba

a0=0a=a

• ⊗associative,commutativeoperator, with identity1K:

a(bc) = (ab)c

ab=ba

a1=1a=a

• ⊗distributesover⊕:

a⊗(b⊕c) = (ab)⊕(a⊗c)

0isannihilatingfor⊗:

a0=0a=0

(50)

Example semirings

• (N,0,1,+,×):countingsemiring

• ({⊥,>},⊥,>,∨,∧): Booleansemiring

• ({unclassified,restricted,confidential,secret,top secret}, top secret,unclassified,min,max):securitysemiring

• (N∪ {∞},∞,0,min,+):tropicalsemiring

• ({Boolean functions overX },⊥,>,∨,∧): semiring ofBoolean functionsoverX

• (N[X],0,1,+,×): semiring of integer-valuedpolynomials with variables inX (also calledHow-semiring oruniversalsemiring)

(51)

Semiring provenance[Green, Karvounarakis, and Tannen 2007]

• Wefixa semiring(K,0,1,⊕,⊗)

• We assume provenance annotations areinK

• We consider a queryQfrom thepositive relational algebra (selection, projection, renaming, product, union)

• We define a semantics for the provenance of a tupletQ(D) inductivelyon the structure ofQjust like before

(52)

Selection, renaming

Provenance annotations of selected tuples areunchanged Example (ρname→ncity=“New York”(R)))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

n position city classification prov John Director New York unclassified x1

Paul Janitor New York restricted x2

(53)

Projection

Provenance annotations of identical, merged, tuples are⊕-ed Example (πcity(R))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov

New York x1x2

Paris x3x5x6

Berlin x4x7

(54)

Union

Provenance annotations of identical, merged, tuples are⊕-ed Example

πcityends-with(position,“agent”)(R))∪πcityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov Paris x3x5

Berlin x4x7

(55)

Cross product

Provenance annotations of combined tuples are⊗-ed Example

πcityends-with(position,“agent”)(R))onπcityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified x1 Paul Janitor New York restricted x2

Dave Analyst Paris confidential x3

Ellen Field agent Berlin secret x4

Magdalen Double agent Paris top secret x5 Nancy HR director Paris restricted x6

Susan Analyst Berlin secret x7

city prov Paris x3x5

Berlin x4x7

(56)

What can we do with semiring provenance?

counting semiring: count the number of times a tuple can be derived, multiset semantics

Boolean semiring: determines if a tuple exists when a subdatabase is selected

security semiring: determines the minimum clearance level required to get a tuple as a result

tropical semiring: minimum-weight way of deriving a tuple (think shortest path in a graph)

Boolean functions: Boolean provenance, as previously defined integer polynomials: N[X], universal provenance, see further

(57)

Example of security provenance

πcityname<name2name,city(R)onρnamename2name,city(R))))

name position city prov

John Director New York unclassified Paul Janitor New York restricted

Dave Analyst Paris confidential

Ellen Field agent Berlin secret Magdalen Double agent Paris top secret Nancy HR director Paris restricted

Susan Analyst Berlin secret

city prov

New York restricted Paris confidential Berlin secret

(58)

Properties[Green, Karvounarakis, and Tannen 2007]

• Semiring provenance still hasPTIMEdata overhead

• Semiring homomorphismscommutewith provenance

computation: ifK−−→hom K0, then one can compute the provenance inK, apply the homomorphism, and obtain the same result as when computing provenance inK0

• The integer polynomial semiringN[X]isuniversal: there is a unique homomorphism to any other commutative semiring that respects a given valuation of the variables

• This meansall computations can be performed in the universal semiring, and homomorphisms applied next

• Twoequivalent queriescan have twodifferent provenance annotationson the same database, in some semirings

(59)

Properties[Green, Karvounarakis, and Tannen 2007]

• Semiring provenance still hasPTIMEdata overhead

• Semiring homomorphismscommutewith provenance

computation: ifK−−→hom K0, then one can compute the provenance inK, apply the homomorphism, and obtain the same result as when computing provenance inK0

• The integer polynomial semiringN[X]isuniversal: there is a unique homomorphism to any other commutative semiring that respects a given valuation of the variables

• This meansall computations can be performed in the universal semiring, and homomorphisms applied next

• Twoequivalent queriescan have twodifferent provenance annotationson the same database, in some semirings

(60)

Properties[Green, Karvounarakis, and Tannen 2007]

• Semiring provenance still hasPTIMEdata overhead

• Semiring homomorphismscommutewith provenance

computation: ifK−−→hom K0, then one can compute the provenance inK, apply the homomorphism, and obtain the same result as when computing provenance inK0

• The integer polynomial semiringN[X]isuniversal: there is a unique homomorphism to any other commutative semiring that respects a given valuation of the variables

• This meansall computations can be performed in the universal semiring, and homomorphisms applied next

• Twoequivalent queriescan have twodifferent provenance annotationson the same database, in some semirings

(61)

Properties[Green, Karvounarakis, and Tannen 2007]

• Semiring provenance still hasPTIMEdata overhead

• Semiring homomorphismscommutewith provenance

computation: ifK−−→hom K0, then one can compute the provenance inK, apply the homomorphism, and obtain the same result as when computing provenance inK0

• The integer polynomial semiringN[X]isuniversal: there is a unique homomorphism to any other commutative semiring that respects a given valuation of the variables

• This meansall computations can be performed in the universal semiring, and homomorphisms applied next

• Twoequivalent queriescan have twodifferent provenance annotationson the same database, in some semirings

(62)

Properties[Green, Karvounarakis, and Tannen 2007]

• Semiring provenance still hasPTIMEdata overhead

• Semiring homomorphismscommutewith provenance

computation: ifK−−→hom K0, then one can compute the provenance inK, apply the homomorphism, and obtain the same result as when computing provenance inK0

• The integer polynomial semiringN[X]isuniversal: there is a unique homomorphism to any other commutative semiring that respects a given valuation of the variables

• This meansall computations can be performed in the universal semiring, and homomorphisms applied next

• Twoequivalent queriescan have twodifferent provenance annotationson the same database, in some semirings

(63)

Extensions

• Beyond positive relational algebra...

• Allowrelational difference: need a semiring withmonus, but complicated semantics [Amer 1984; Geerts and Poggi 2010;

Amsterdamer, Deutch, and Tannen 2011a; Amarilli and Monet 2016]

• Allowaggregate queries: extend semirings tosemimodules [Amsterdamer, Deutch, and Tannen 2011b; Fink, Han, and Olteanu 2012]

• Allowrecursive queries: representation asformal power seriesor cycluits[Amarilli, Bourhis, Monet, and Senellart 2017]

• Beyond semiring provenance...

Where-provenance: capture which outputvaluecomes from which inputvalue[Buneman, Khanna, and Tan 2001]

Why-not provenance: capture why an output tuple wasnot produced, usually as a function of thequery[Chapman and Jagadish 2009]

(64)

Extensions

• Beyond positive relational algebra...

• Allowrelational difference: need a semiring withmonus, but complicated semantics [Amer 1984; Geerts and Poggi 2010;

Amsterdamer, Deutch, and Tannen 2011a; Amarilli and Monet 2016]

• Allowaggregate queries: extend semirings tosemimodules [Amsterdamer, Deutch, and Tannen 2011b; Fink, Han, and Olteanu 2012]

• Allowrecursive queries: representation asformal power seriesor cycluits[Amarilli, Bourhis, Monet, and Senellart 2017]

• Beyond semiring provenance...

Where-provenance: capture which outputvaluecomes from which inputvalue[Buneman, Khanna, and Tan 2001]

Why-not provenance: capture why an output tuple wasnot produced, usually as a function of thequery[Chapman and Jagadish 2009]

(65)

Extensions

• Beyond positive relational algebra...

• Allowrelational difference: need a semiring withmonus, but complicated semantics [Amer 1984; Geerts and Poggi 2010;

Amsterdamer, Deutch, and Tannen 2011a; Amarilli and Monet 2016]

• Allowaggregate queries: extend semirings tosemimodules [Amsterdamer, Deutch, and Tannen 2011b; Fink, Han, and Olteanu 2012]

• Allowrecursive queries: representation asformal power seriesor cycluits[Amarilli, Bourhis, Monet, and Senellart 2017]

• Beyond semiring provenance...

Where-provenance: capture which outputvaluecomes from which inputvalue[Buneman, Khanna, and Tan 2001]

Why-not provenance: capture why an output tuple wasnot produced, usually as a function of thequery[Chapman and Jagadish 2009]

(66)

Extensions

• Beyond positive relational algebra...

• Allowrelational difference: need a semiring withmonus, but complicated semantics [Amer 1984; Geerts and Poggi 2010;

Amsterdamer, Deutch, and Tannen 2011a; Amarilli and Monet 2016]

• Allowaggregate queries: extend semirings tosemimodules [Amsterdamer, Deutch, and Tannen 2011b; Fink, Han, and Olteanu 2012]

• Allowrecursive queries: representation asformal power seriesor cycluits[Amarilli, Bourhis, Monet, and Senellart 2017]

• Beyond semiring provenance...

Where-provenance: capture which outputvaluecomes from which inputvalue[Buneman, Khanna, and Tan 2001]

Why-not provenance: capture why an output tuple wasnot produced, usually as a function of thequery[Chapman and Jagadish 2009]

(67)

Extensions

• Beyond positive relational algebra...

• Allowrelational difference: need a semiring withmonus, but complicated semantics [Amer 1984; Geerts and Poggi 2010;

Amsterdamer, Deutch, and Tannen 2011a; Amarilli and Monet 2016]

• Allowaggregate queries: extend semirings tosemimodules [Amsterdamer, Deutch, and Tannen 2011b; Fink, Han, and Olteanu 2012]

• Allowrecursive queries: representation asformal power seriesor cycluits[Amarilli, Bourhis, Monet, and Senellart 2017]

• Beyond semiring provenance...

Where-provenance: capture which outputvaluecomes from which inputvalue[Buneman, Khanna, and Tan 2001]

Why-not provenance: capture why an output tuple wasnot produced, usually as a function of thequery[Chapman and Jagadish 2009]

(68)

Outline

Query Evaluation on Relational Databases Boolean Provenance on Relational Databases Semiring Provenance on Relational Databases Query Evaluation on Trees and Words

Boolean Provenance on Trees and Words Applications to Probability Computation Applications to Enumeration

Conclusion

Références

Documents relatifs

This paper focus on query semantics on a graph database with constraints defined over a global schema and confidence scores as- signed to the local data sources (which compose

We gave an overview of a provenance-based framework for routing queries, and of di↵erent approaches considered to lower the complexity of query evaluation: considering the shape of

Clearly, each component of the R2PG-DM mapping can be specified by a declarative non-recursive Datalog query [2] over the source DB represented using a fixed set of predicates

Relational Database Management System (RDBMS) is a specialization of a Database Management System (DBMS) whose data abstraction is based on the rela- tional model [Date 2004]

In- stead of connecting with the database through a socket, the SQL code could be directly executed inside the server and the results could be used inside the analytical tool

Dif- ferently from normal process mining analysis which uses event log files, we use event data stored in relational databases.. In other words, we move the location of data from

Selection queries. As shown in Section 4.1, sequential equality selections are used to evaluate range and prefix se- lections. Thus, it suffices to show that equality selections

Inference control can guarantee confidentiality but is costly to imple- ment. Access control can be implemented efficiently but cannot guarantee con- fidentiality. Hence, it is