• Aucun résultat trouvé

Provenance MPRI 2.26.2: Web Data Management

N/A
N/A
Protected

Academic year: 2022

Partager "Provenance MPRI 2.26.2: Web Data Management"

Copied!
150
0
0

Texte intégral

(1)

Provenance

MPRI 2.26.2: Web Data Management

Antoine Amarilli, Pierre Senellart Friday, January 18th

(2)

Provenance definition

(3)

Provenance management

• Data management isall about query evaluation

• What if we wantsomething morethan the query result?

Wheredoes the result come from?

Whywas this result obtained?

Howwas the result produced?

What is theprobabilityof the result?

How manytimeswas the result obtained?

How would the resultchangeif part of the input data was missing?

What is the minimalsecurity clearanceI need to see the result?

What is themost economical wayof obtaining the result?

How can a result beexplainedto the user?

• Provenance management: along with query evaluation, record additional bookkeeping informationallowing to answer the

(4)

Data model

• Relational data model: data decomposed into relations, with labeled attributes…

first as a tuple id)

name position city classification John Director New York unclassified Paul Janitor New York restricted

Dave Analyst Paris confidential

Ellen Field agent Berlin secret Magdalen Double agent Paris top secret Nancy HR director Paris restricted

Susan Analyst Berlin secret

(5)

Data model

• Relational data model: data decomposed into relations, with labeled attributes…

• … with an extraprovenance annotationfor each tuple (think of it first as a tuple id)

name position city classification John Director New York unclassified Paul Janitor New York restricted

Dave Analyst Paris confidential

Ellen Field agent Berlin secret Magdalen Double agent Paris top secret Nancy HR director Paris restricted

Susan Analyst Berlin secret

(6)

Data model

• Relational data model: data decomposed into relations, with labeled attributes…

• … with an extraprovenance annotationfor each tuple (think of it first as a tuple id)

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t

(7)

Relations and databases

Formally:

• Arelational schemaRis a finite sequence of distinctattribute names; thearityofRis|R|

• Adatabase schemais a finite set ofrelation names, each having arelational schema

• Atupleover relational schemaRis a mapping fromRtodata values; each tuple comes with aprovenance annotation

• Arelation instance(orrelation) overRis a finite set oftuples overR

• Adatabase instance(ordatabase) over database schemaDis a mapping from the support ofDmapping eachrelation nameR to arelation instanceoverD

(8)

Queries

• Aqueryis an arbitraryfunctionthat maps databases over a fixed database schemaDto relations over some relational schemaR

• The query doesnotlook at the provenance annotations; we will give semantics for the provenance annotations of the output, based on that of the input

• Example of query languages:

First-Order logic (FO) or the relational algebra

Monadic-Second Order logic (MSO)

(9)

Outline

Provenance definition Preliminaries

Boolean provenance

Relational algebra reminder Semiring provenance And beyond…

Representation Systems for Provenance Provenance for Trees

Implementing Provenance Support

(10)

Boolean provenance

X ={x1,x2, . . . ,xn}finite set ofBoolean events

• Provenance annotation: Boolean functionoverX, i.e., a function of the form:(X → {⊥,>})→ {⊥,>}

• Interpretation: possible-world semantics

every valuationν :X → {⊥,>}denotes apossible worldof the database

the provenance of a tuple onνevaluates toor>depending whether this tupleexistsin that possible world

for example, if every tuple of a database is annotated with a different Boolean event, the set of possible worlds is the set ofall

(11)

Example of possible worlds

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5 Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

ν: t1 t2 t3 t4 t5 t6 t7

> > > > > > >

(12)

Example of possible worlds

name position city classification prov John Director New York unclassified t1

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

(13)

Boolean provenance of query results

ν(D): thesubdatabaseofDwhere all tuples whose provenance annotation evaluates tobyν is removed

• TheBoolean provenanceprovq,D(t)of tuplet∈q(D)is the function:

ν 7→



>ift∈q(ν(D))

otherwise Example (What cities are in the table?)

name position city classification prov John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret

city prov

Berlin t4∨t7 New York t1∨t2 Paris t3∨t5∨t6

(14)

Application: Probabilistic databases

[Green and Tannen, 2006, Suciu et al., 2011]

• Tuple-independent database: each tupletin a database is annotated withindependentprobabilityPr(t)of existing

• Probability of a possible worldD⊆D: Pr(D) =∏

tDPr(t)×

tD\D(1Pr(t))

• Probability of a tuple for a queryqoverD: Pr(t∈q(D)) =

DD tq(D)

Pr(D)

tithenPr(t∈q(D)) =Pr(provq,D(t))

• Computing the probability of an answer tuple databases thus amounts tocomputing its Boolean provenance, and then computing theprobability of that Boolean function

• Also works for more complex probabilistic models

(15)

Application: Probabilistic databases

[Green and Tannen, 2006, Suciu et al., 2011]

• Tuple-independent database: each tupletin a database is annotated withindependentprobabilityPr(t)of existing

• Probability of a possible worldD⊆D: Pr(D) =∏

tDPr(t)×

tD\D(1Pr(t))

• Probability of a tuple for a queryqoverD: Pr(t∈q(D)) =

DD tq(D)

Pr(D)

• IfPr(xi) :=Pr(ti)wherexiis the provenance annotation of tuple tithenPr(t∈q(D)) =Pr(provq,D(t))

• Computing the probability of an answer tuple databases thus amounts tocomputing its Boolean provenance, and then

(16)

Example of probability computation

name position city classification prov prob John Director New York unclassified t1 0.5

Paul Janitor New York restricted t2 0.7

Dave Analyst Paris confidential t3 0.3

Ellen Field agent Berlin secret t4 0.2

Magdalen Double agent Paris top secret t5 1.0 Nancy HR director Paris restricted t6 0.8

Susan Analyst Berlin secret t7 0.2

city prov

(17)

Example of probability computation

name position city classification prov prob John Director New York unclassified t1 0.5

Paul Janitor New York restricted t2 0.7

Dave Analyst Paris confidential t3 0.3

Ellen Field agent Berlin secret t4 0.2

Magdalen Double agent Paris top secret t5 1.0 Nancy HR director Paris restricted t6 0.8

Susan Analyst Berlin secret t7 0.2

city prov prob

Berlin t4∨t7 1(10.2)×(10.2) = 0.36 New York t1∨t2 1(10.5)×(10.7) = 0.85

(18)

What now?

• How tocomputeBoolean provenance for practical query languages? What complexity?

• Can wedo morewith provenance?

• How should werepresentprovenance annotations?

• How can weimplementsupport for provenance management in a relational database management system?

(19)

Outline

Provenance definition Preliminaries

Boolean provenance

Relational algebra reminder Semiring provenance And beyond…

Representation Systems for Provenance Provenance for Trees

Implementing Provenance Support

(20)

Relational algebra

• Basic relations:

the relation names in the signature

constant relations, e.g., the empty relation

• ProjectionΠ

• Selectionσ

• Renamingρ

• Union

• Product×andjoin

(21)

Projection: keep a subsequence of the attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

Πteacher,resp(Class) teacher resp

Antoine Olivier Pierre Olivier

Duplicates areremoved

(22)

Projection: keep a subsequence of the attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

Πteacher,resp(Class) teacher resp

Pierre Olivier

Duplicates areremoved

(23)

Projection: keep a subsequence of the attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

Πteacher,resp(Class) teacher resp Antoine Olivier Pierre Olivier

Duplicates areremoved

(24)

Projection: keep a subsequence of the attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

Πteacher,resp(Class) teacher resp

(25)

Selection: keep a subset of the tuples

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

σteacher=“Antoine”(Class)

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4

(26)

Selection: keep a subset of the tuples

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

σteacher=“Antoine”(Class)

(27)

Selection: keep a subset of the tuples

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

σteacher=“Antoine”(Class)

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4

(28)

Rename: change the name of attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

ρrespboss(Class)

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

(29)

Rename: change the name of attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

ρrespboss(Class)

date teacher boss name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

(30)

Rename: change the name of attributes

Class

date teacher resp name num

2019-01-11 Antoine Olivier Web Data Mgmt 4 2019-01-18 Pierre Olivier Web Data Mgmt 5 2019-02-01 Pierre Olivier Web Data Mgmt 6 2019-02-08 Pierre Olivier Web Data Mgmt 7

ρrespboss(Class)

date teacher boss name num

(31)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

id nameS2

42 Zaphod B.

=

Students1 S2 id name 1 Arthur Dent 2 Ford Prefect 42 Zaphod B.

Duplicates areremovedhere as well

(32)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

id nameS2

42 Zaphod B.

=

id1 nameArthur Dent 2 Ford Prefect 42 Zaphod B.

Duplicates areremovedhere as well

(33)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

S2 id name 42 Zaphod B.

=

Students1 S2 id name 1 Arthur Dent 2 Ford Prefect 42 Zaphod B.

Duplicates areremovedhere as well

(34)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

id nameS2

42 Zaphod B.

=

id1 nameArthur Dent 2 Ford Prefect 42 Zaphod B.

Duplicates areremovedhere as well

(35)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

id nameS2

42 Zaphod B.

=

Students1 S2 id name 1 Arthur Dent 2 Ford Prefect 42 Zaphod B.

Duplicates areremovedhere as well

(36)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

id nameS2

42 Zaphod B.

=

Students1 S2 id name 1 Arthur Dent 2 Ford Prefect 42 Zaphod B.

(37)

Union

• Take tuples occurring inone ofthe input tables

• Applies to twotableswith the sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

id nameS2

42 Zaphod B.

=

Students1 S2 id name 1 Arthur Dent 2 Ford Prefect 42 Zaphod B.

Duplicates areremovedhere as well

(38)

Product

• Take allcombinationsof the input tables

Students id name 1 Arthur Dent 2 Ford Prefect

×

Rooms room E200 E242

=

id name room 1 Arthur Dent E200 1 Arthur Dent E242 2 Ford Prefect E200 2 Ford Prefect E242

(39)

Product

• Take allcombinationsof the input tables

Students id name 1 Arthur Dent 2 Ford Prefect

×

Rooms room E200 E242

=

Students × Rooms id name room 1 Arthur Dent E200 1 Arthur Dent E242 2 Ford Prefect E200 2 Ford Prefect E242

(40)

Product

• Take allcombinationsof the input tables

Students id name 1 Arthur Dent 2 Ford Prefect

×

Rooms room E200 E242

=

id name room 1 Arthur Dent E200 1 Arthur Dent E242 2 Ford Prefect E200 2 Ford Prefect E242

(41)

Product

• Take allcombinationsof the input tables

Students id name 1 Arthur Dent 2 Ford Prefect

×

Rooms room E200 E242

=

Students × Rooms id name room 1 Arthur Dent E200 1 Arthur Dent E242 2 Ford Prefect E200 2 Ford Prefect E242

(42)

Product

• Take allcombinationsof the input tables

Students id name 1 Arthur Dent 2 Ford Prefect

×

Rooms room E200 E242

=

id name room 1 Arthur Dent E200 1 Arthur Dent E242 2 Ford Prefect E200 2 Ford Prefect E242

(43)

Product

• Take allcombinationsof the input tables

Students id name 1 Arthur Dent 2 Ford Prefect

×

Rooms room E200 E242

=

Students × Rooms id name room 1 Arthur Dent E200 1 Arthur Dent E242 2 Ford Prefect E200 2 Ford Prefect E242

(44)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(45)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date

(

σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(46)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(47)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date

(

σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(48)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(49)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28

ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date

(

σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(50)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(51)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date

(

σclass=class2 (

ρclassclass2(

Member

)

×Class

))

(52)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date σclass=class2 ( ))

(53)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

Πid,class,date

(

σclass=class2 (

ρclassclass2(Member)×Class)

)

(54)

Join

Product is useful to expressjoins:

Member id class

1 WDM

2 WDM

▷◁

Class

class date WDM Nov 21 ABC Nov 24 WDM Nov 28

=

Member ▷◁ Class id class date 1 WDM Nov 21 1 WDM Nov 28 2 WDM Nov 21 2 WDM Nov 28 ExpressMember ▷◁ Classwith theprevious operators:

(55)

Difference

• Take tuples that are in one table butnotin the other

• Applies to twotableswith sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

S3 id name 1 Arthur Dent 42 Zaphod B.

=

Students1 S3 id name 2 Ford Prefect

(56)

Difference

• Take tuples that are in one table butnotin the other

• Applies to twotableswith sameattributes

Students1 id name 1 Arthur Dent

id name

1 Arthur Dent 42 Zaphod B.

=

Students1 S3 id name 2 Ford Prefect

(57)

Difference

• Take tuples that are in one table butnotin the other

• Applies to twotableswith sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

S3 id name 1 Arthur Dent 42 Zaphod B.

=

Students1 S3 id name 2 Ford Prefect

(58)

Difference

• Take tuples that are in one table butnotin the other

• Applies to twotableswith sameattributes

Students1 id name

1 Arthur Dent

S3 id name 1 Arthur Dent

=

id name

2 Ford Prefect

(59)

Difference

• Take tuples that are in one table butnotin the other

• Applies to twotableswith sameattributes

Students1 id name 1 Arthur Dent 2 Ford Prefect

S3 id name 1 Arthur Dent 42 Zaphod B.

=

Students1 S3 id name 2 Ford Prefect

(60)

Difference

• Take tuples that are in one table butnotin the other

• Applies to twotableswith sameattributes

Students1 id name

1 Arthur Dent

S3 id name

1 Arthur Dent

=

Students1 S3 id name 2 Ford Prefect

(61)

Outline

Provenance definition Preliminaries

Boolean provenance

Relational algebra reminder Semiring provenance And beyond…

Representation Systems for Provenance Provenance for Trees

Implementing Provenance Support

(62)

Commutative semiring

• SetKwith distinguished elements0,1

associative,commutativeoperator, with identity0K:

a(bc) = (ab)c

ab=ba

a0=0a=a

associative,commutativeoperator, with identity1K:

a(bc) = (ab)c

ab=ba

a1=1a=a

distributesover⊕:

(63)

Example semirings

• (N,0,1,+,×): countingsemiring

• ({⊥,>},⊥,>,∨,∧): Booleansemiring

• ({unclassified,restricted,confidential,secret,top secret}, top secret,unclassified,min,max): securitysemiring

• (N∪ {∞},∞,0,min,+): tropicalsemiring

• ({Boolean functions overX },⊥,>,∨,∧): semiring ofBoolean functionsoverX

• (N[X],0,1,+,×): semiring of integer-valuedpolynomialswith variables inX (also calledHow-semiring oruniversalsemiring, see further)

• (P(P(X)),∅,{∅},∪,⋓): Why-semiring overX

(A⋓ { | })

(64)

Semiring provenance

• Wefixa semiring(K,0,1,⊕,⊗)

• We assume provenance annotations areinK

• We consider a queryqfrom thepositive relational algebra (selection, projection, renaming, cross product, union; joins can be simulated with renaming, cross product, selection, projection)

• We define a semantics for the provenance of a tuplet∈q(D) inductivelyon the structure ofq

(65)

Selection, renaming

Provenance annotations of selected tuples areunchanged Example (ρnamencity=“New York”(R)))

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4 Magdalen Double agent Paris top secret t5 Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

n position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

(66)

Projection

Provenance annotations of identical, merged, tuples are⊕-ed Example (πcity(R))

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4 Magdalen Double agent Paris top secret t5 Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov

(67)

Union

Provenance annotations of identical, merged, tuples are⊕-ed Example

πcityends-with(position,“agent”)(R))∪πcityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4 Magdalen Double agent Paris top secret t5 Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov Berlin t4⊕t7

(68)

Cross product

Provenance annotations of combined tuples are⊗-ed Example

πcityends-with(position,“agent”)(R))⋊⋉πcityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4 Magdalen Double agent Paris top secret t5 Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

(69)

What can we do with it?

counting semiring: count the number of times a tuple can be derived (multiset semantics)

Boolean semiring: indicates in which possible worlds the tuple appears

security semiring: determines the minimum clearance level required to get a tuple as a result

tropical semiring: minimum-weight way of deriving a tuple (think shortest path in a graph)

Boolean functions: Boolean provenance, as previously defined integer polynomials: universal provenance, see further

Why-semiring: Why-provenance[Buneman et al., 2001], set of combinations of tuples needed for a tuple to exist

(70)

Example of security provenance

πcityname<name2name,city(R)⋊⋉ρname→name2name,city(R))))

name position city prov

John Director New York unclassified Paul Janitor New York restricted

Dave Analyst Paris confidential

Ellen Field agent Berlin secret Magdalen Double agent Paris top secret Nancy HR director Paris restricted

Susan Analyst Berlin secret

city prov

(71)

Notes[Green et al., 2007]

• Computing provenance has aPTIMEdata complexity overhead

• Semiringhomomorphisms commutewith provenance

computation: if there is a homomorphism fromKtoK, then one can compute the provenance inK, apply the homomorphism, and obtain the same result as when computing provenance inK

• The integer polynomial semiring isuniversal: there is a unique homomorphism to any other commutative semiring that respects a given valuation of the variables

• This meansall computations can be performed in the universal semiring, and homomorphisms applied next

• Twoequivalent queriescan have twodifferent provenance annotationson the same database, in some semirings

(72)

Outline

Provenance definition Preliminaries

Boolean provenance

Relational algebra reminder Semiring provenance And beyond…

Representation Systems for Provenance Provenance for Trees

(73)

Semirings with monus[Amer, 1984, Geerts and Poggi, 2010]

• Some semirings can be equipped with a verifying:

a(b a) =b(a b)

(a b) c=a (b+c)

a a=0 a=0

• Boolean function semiring with∧¬, Why-semiring with\, counting semiring withtruncated difference…

• Most natural semirings (but not all semirings[Amarilli and Monet, 2016]!) can be extended intosemirings with monus

• Sometimes strange things happen[Amsterdamer et al., 2011]: e.g, doesnot always distributeover

• Allows supportingfull relational algebrawith the\operator, still PTIME

• Semantics for Boolean function semiringcoincideswith that of

(74)

Difference

Provenance annotations of diff-ed tuples are -ed Example

πcityends-with(position,“agent”)(R))cityposition=“Analyst”(R))

name position city classification prov John Director New York unclassified t1 Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4 Magdalen Double agent Paris top secret t5 Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

(75)

Provenance for aggregates

[Amsterdamer et al., 2011, Fink et al., 2012]

• Trickierto define provenance for queries with aggregation, even in the Boolean case

• One can construct aK-semimoduleK∗Mfor each monoid aggregateMover a provenance database with a semiring inK

• Datavaluesbecome elements of the semimodule Example (count(πnamecity=“Paris”(R)))

t31 +t51 +t61

(76)

Where-provenance

• Different form of provenance: captures from which database valuescome which outputvalues

• Bipartite graphof provenance: two attribute values are connected if one can be produced from the other

• Axiomatized in[Buneman et al., 2001, Cheney et al., 2009]

• Cannotbe captured by provenance semirings[Cheney et al., 2009], because of renaming (does not keep track of relation attributes), projection (does not remember which attribute values still exist),

(77)

Outline

Provenance definition

Representation Systems for Provenance

Provenance for Trees

Implementing Provenance Support Conclusion

(78)

Representation systems

• In the Boolean semiring, the counting semiring, the security semiring: provenance annotations areelementary

• In the Boolean function semiring, the universal semiring, etc., provenance annotations can become quitecomplex

• Needs forcompact representationof provenance annotations

• Lower theprovenance computation complexityas much as possible

(79)

Provenance formulas

• Quitestraightforward

• Formalism used in most of the provenance literature

• PTIMEdata complexity

• Expanding formulas (e.g., computing the monomials of aN[X] provenance annotation) can result in anexponential blowup Example

Is there a city with two different agents?

(t1⊗t2)(t3⊗t6)(t3⊗t5)(t4⊗t7)(t5⊗t6)

(80)

Provenance circuits

• Usearithmetic circuits(Boolean circuits for Boolean provenance) to represent provenance

• Every time an operation reuses a previously computed result, link to thepreviously created circuit gate

• Never largerthan provenance formulas

• Sometimesmore concise: provenance circuits can be...

Super-polynomially more concisethan monotone provenance formula for linear Datalog programs [Deutch et al., 2014]

Quadratically more concisethan provenance formulas for monadic second-order (MSO) formulas

(81)

Example provenance circuit

(82)

OBDD and d-DNNF

• Various subclasses ofBooleancircuits commonly used:

OBDD: Ordered Binary Decision Diagrams

d-DNNF: deterministic Decomposable Negation Normal Form

• OBDDscan be obtained inPTIMEdata complexity on bounded-treewidth databases[Amarilli et al., 2016]

• d-DNNFscan be obtained inlinear-timedata complexity on bounded-treewidth databases

(83)

Provenance cycluits[Amarilli et al., 2017b]

• Cycluit (cyclic circuit): arithmetic circuit withcycles

• Well-defined semantics on the Boolean semiring andsome semirings where infinite loops do not matter

• Allows computing provenance inlinear-time combined

complexityforrecursivequeries of a certain form (ICG-Datalog of bounded body size[Amarilli et al., 2017b], capturingα-acyclic conjunctive queries, 2RPQs, etc.), on bounded tree-width databases

• Related toprovenance equation systemsand formal series introduced in[Green et al., 2007]

(84)

Outline

Provenance definition

Representation Systems for Provenance Provenance for Trees

Implementing Provenance Support

(85)

Definition

• The previous definitions of provenance work onrelational data

• What if the data is atree, e.g., an XML document, or aword?

• Problem: we can now haverecursive queries, e.g., descendant queries

• A naive representation of provenance may no longer be polynomialin the database

(86)

Example: uncertain trees

1

5

6 7 2

4 3

keep(1) ordiscard(0) node labels Valuation: {2,3,77→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

(87)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels

Valuation: {2,3,77→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

(88)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {2,3,77→1, ∗ 7→0}

(89)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {27→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

(90)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {2,77→1, ∗ 7→0}

(91)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {2,77→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

(92)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {2,3,77→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

The tree automatonAaccepts

(93)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {27→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

The tree automatonArejects

(94)

Example: uncertain trees

1

5

6 7 2

4 3

Avaluationof a tree decides whether to keep(1) ordiscard(0) node labels Valuation: {2,77→1, ∗ 7→0}

A: “Is there both a pink and a blue node?”

The tree automatonAaccepts

(95)

Example: Provenance circuit 1

5

6 7 2

4 3

Query: Is there both a pink and a blue node?

Provenance circuit:

7

2 3

Formally:

• Tree automatonA, uncertain treeT, circuitC

• Variable gatesofC: nodes ofT

• Condition: Letν be a valuation ofT, thenν(C)iffAacceptsν(T)

(96)

Example: Provenance circuit 1

5

6 7 2

4 3

Query: Is there both a pink and a blue node?

Provenance circuit:

7

2 3

• Tree automatonA, uncertain treeT, circuitC

• Variable gatesofC: nodes ofT

• Condition: Letν be a valuation ofT, thenν(C)iffAacceptsν(T)

(97)

Example: Provenance circuit 1

5

6 7 2

4 3

Query: Is there both a pink and a blue node?

Provenance circuit:

7

2 3

Formally:

• Tree automatonA, uncertain treeT, circuitC

• Variable gatesofC: nodes ofT

• Condition: Letν be a valuation ofT, thenν(C)iffAacceptsν(T)

(98)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

States: {⊥,B,P,>}

• Final: {>}

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

¬

(99)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

¬

(100)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

¬

(101)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

¬

(102)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

¬

(103)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

¬

(104)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

∨ ∨ ∨ ∨

B P >

¬

(105)

Building provenance circuits on trees[Amarilli et al., 2015b]

Theorem

For any bottom-uptree automatonAand inputtreeT, we can build aprovenance circuitofAonTinO(|A| × |T|)

• Alphabet:

• Automaton: “Is there both a pink and a blue node?”

• States:

{⊥,B,P,>}

• Final: {>}

• Transitions:

>

P

P

P

n

n

∨ ∨ ∨ ∨

B P >

¬

(106)

d-DNNFs

Lemma

If the tree automaton isunambiguousthen the circuit is ad-DNNF The circuit is ad-DNNF...

... so probability computation iseasy!

¬ gates only have variablesas inputs

g

P(g) := 1−P(g)

gates always have mutually exclusiveinputs

g g1 g2

P(g) :=P(g1) +P(g2)

gates are all on independentinputs

g g1 g2

P(g) :=P(g1)×P(g2)

(107)

d-DNNFs

Lemma

If the tree automaton isunambiguousthen the circuit is ad-DNNF The circuit is ad-DNNF...

... so probability computation iseasy!

¬ gates only have variablesas inputs

¬ g g

P(g) := 1−P(g)

gates always have mutually exclusiveinputs

g g1 g2

P(g) :=P(g1) +P(g2)

gates are all on independentinputs

g g1 g2

P(g) :=P(g1)×P(g2)

(108)

d-DNNFs

Lemma

If the tree automaton isunambiguousthen the circuit is ad-DNNF The circuit is ad-DNNF...

... so probability computation iseasy!

¬ gates only have variablesas inputs

¬ g g

P(g) := 1−P(g)

gates always have mutually exclusiveinputs

g g1 g2

P(g) :=P(g1) +P(g2)

gates are all on independentinputs

g1 g2

P(g) :=P(g1)×P(g2)

(109)

d-DNNFs

Lemma

If the tree automaton isunambiguousthen the circuit is ad-DNNF The circuit is ad-DNNF...

... so probability computation iseasy!

¬ gates only have variablesas inputs

¬ g g

P(g) := 1−P(g)

gates always have mutually exclusiveinputs

g g1 g2

P(g) :=P(g1) +P(g2)

gates are all on independentinputs

g g1 g2

P(g) :=P(g1)×P(g2)

Références

Documents relatifs

SVG+JS Javascript manipulation of vector graphics WebGL Javascript API for accelerated 3D using a GPU. WebVR Describe scenes for virtual reality headsets with a-scene WebRTC Voice

JSP Integrating Java and a Web server (e.g., Apache Tomcat) node.js Chrome’s JavaScript engine (V8) plus a Web server Python Web frameworks: Django , CherryPy, Flask. Ruby

DOM (Document Object Model) gives an API for the tree representation of HTML and XML documents (independent from the programming language). Schema languages Specify the structure

• YAML: extends JSON with several features, e.g., explicit types, user-defined types, anchors and references. • Some JSON parsers are permissive and allow,

• While inside an element we can check that the word of its children satisfies the deterministic regular expression used to define it. • Can be implemented with a

descendant::C[@att1='1'] is a step which denotes all the Element nodes named C, descendant of the context node, having an Attribute node att1 with value

• Information extraction: creating structured information out of existing Web Data

• Recall , the percentage of correct matches that are extracted. → Extract everything gives 100% recall (and very