Inverting Schema Mappings (presenting works by Fagin, Kolaitis, Popa, Tan)

(1)

Inverting Schema Mappings

(presenting works by Fagin, Kolaitis, Popa, Tan)

Dmitri Akatov¹ Pierre Senellart^2,3

1 2 3

Oxford Computing Laboratory

(2)

Data Exchange

Outline

1 Data Exchange Schema Mappings Dependencies

Solutions of Schema Mappings

2 Composing Schema Mappings

3 Inverses

4 Quasi-Inverses

5 Conclusion

(3)

Data Exchange Schema Mappings

Motivation

Data Exchange deals with the problem of transforming a database (the sourceinstance) into another database (thetarget instance) under a (possibly) different schema, while adhering to a set of conditions (the dependencies) between the source and target instances.

Target instance should reflect the information given in the source instance as accurately as possible

Target instance should be as general as possible (not introducing any more information than the source contained).

The “oldest” problem in Database Theory. Still open to a lot of research.

This talk will concentrate on relational databases.

(4)

Motivation

(5)

Motivation

(6)

Motivation

(7)

Definitions and Example

Definition

A schema mappingis a triple

M= (S,T,Σ)

where Sis source schema,T istarget schema (with disjoint sets of relational symbols) and Σis a set of formulae in some logical formalism overhS,Ti definingthe schema mappingM.

Example

Let S=hEDLi,T=hED,DLi. Let

Σ ={∀x,y,z(EDL(x,y,z)→ED(x,y)∧DL(x,y))}.

(8)

Definitions and Example

Definition

A schema mappingis a triple

M= (S,T,Σ)

where Sis source schema,T istarget schema (with disjoint sets of relational symbols) and Σis a set of formulae in some logical formalism overhS,Ti definingthe schema mappingM.

Example

Let S=hEDLi,T=hED,DLi. Let

Σ ={∀x,y,z(EDL(x,y,z)→ED(x,y)∧DL(x,y))}.

(9)

Data Exchange Dependencies

Source-Target Conditions vs Target Conditions

We are interested inΣbeing a finite set of Tuple-Generating Dependencies (tgds) and Equality-Generating Dependencies(egds), in particular

Source-target tgds: FO formulae of the form

∀x(φ(x)→ ∃yψ(x,y))

whereφ is conjunction of atoms overS,ψ is conjunction of atoms overT.

Target tgds: FO formulae of the same form where bothφ andψ are conjunctions of atoms overT.

Target egds: FO formulae of the form∀x(φ(x)→xi =xj)where φis conjunction of atoms over Tand x_i,x_j are in x.

Full tgds: s-t tgds without∃

Notation: Universal quantifiers are usually ommitted. Existential quantifiers should always be written out!

(10)

Source-Target Conditions vs Target Conditions

∀x(φ(x)→ ∃yψ(x,y))

Target egds: FO formulae of the form∀x(φ(x)→xi =xj)where φis conjunction of atoms over Tand x_i,x_j are in x.

(11)

Source-Target Conditions vs Target Conditions

∀x(φ(x)→ ∃yψ(x,y))

Target egds: FO formulae of the form∀x(φ(x)→xi =xj)where φis conjunction of atoms over Tandx_i,x_j are in x.

(12)

Source-Target Conditions vs Target Conditions

∀x(φ(x)→ ∃yψ(x,y))

(13)

Source-Target Conditions vs Target Conditions

∀x(φ(x)→ ∃yψ(x,y))

(14)

Global-as-View and Local-as-View Contexts

There are special forms of tgds that come up particularly in Data Integration:

Local-as-ViewMappings are schema mappings in which Σis a set of s-t tgds in each of which φ(the left-hand side of the tgd) is a single atom.

Global-as-view Mappings are schema mappings in whichΣis a set of s-t tgds in each of which ψ(the right-hand side of the tgd) is a single atom.

LAV mappings have some nice invertibility properties

(15)

Global-as-View and Local-as-View Contexts

(16)

Global-as-View and Local-as-View Contexts

(17)

Global-as-View and Local-as-View Contexts

(18)

Data Exchange Solutions of Schema Mappings

Definition

Given a schema mapping M= (S,T,Σ) and asource instance I ofS we define a solution for I to be an instance J ofT such thathI,Ji |= Σ. We define Sol(I) to be the set of all solutions for I.

We are interested in the most general solution: Definition

A universal solution J forI is a solution such that for any other solution J⁰ there exists a homomorphism h:J →J⁰.

Theorem

IfΣ is a finite set of s-t tgds, then there is always a universal solution for any source instance.

(19)

Definition

We are interested in the most general solution:

Definition

A universal solution J forI is a solution such that for any other solution J⁰ there exists a homomorphism h:J →J⁰.

Theorem

(20)

Definition

A universal solution J forI is a solution such that for any other solutionJ⁰ there exists a homomorphism h:J →J⁰.

Theorem

(21)

Definition

A universal solution J forI is a solution such that for any other solutionJ⁰ there exists a homomorphism h:J →J⁰.

Theorem

(22)

Computing Universal Solutions (Outline)

To compute a universal solution we chasehI,∅iwith Σ: In each chase step we do

Select a dependency inΣ (a tgd or an egd) Find a homomorphism from φ(x) to K.

For a tgd: Extend the homomorphism by defining a freshlabelled null for each variable iny and take the image under ψ.

For an egd: Assume x_i,x_j have distinct images in K. If both are constants, then the chase ends in failure. Otherwise we either replace a labelled null by a constant or one labelled null by the other.

Repeat until no dependency inΣcan be used to extend the homomorphism

Theorem

For tgds and egds, if hI,Ji is the result of asuccessful finite chase, then J is a universal solution. If there exists a failing finite chase, then there is no solution.

(23)

Computing Universal Solutions (Outline)

Select a dependency inΣ (a tgd or an egd)

Find a homomorphism from φ(x) to K.

Theorem

(24)

Computing Universal Solutions (Outline)

Theorem

(25)

Computing Universal Solutions (Outline)

Theorem

(26)

Computing Universal Solutions (Outline)

Theorem

(27)

Computing Universal Solutions (Outline)

Theorem

(28)

Computing Universal Solutions (Outline)

Theorem

For tgds and egds, if hI,Ji is the result of asuccessful finite chase, then J is a universal solution. If there exists a failing finite chase, then there is no

(29)

Canonical Universal Solutions

Definition

We call such a J (if it exists)a canonical universal solution. If it is unique (up to isomorphism), we call it the canonical universal solution.

However..

Example

Canonical solutions are not unique in general: Let S=hP,Qi,T=hRi. Let Σ ={P(x)→R(x),Q(x)→ ∃YR(Y)}. LetI ={P(a),Q(a)}. The result of a chase is either {R(a)} or {R(Y),R(a)}.

Canonical solutions may also not exist, in case there is no final chase (cyclic dependencies).

(30)

Canonical Universal Solutions

Definition

However..

Example

Canonical solutions are not unique in general: Let S=hP,Qi,T=hRi.

Let Σ ={P(x)→R(x),Q(x)→ ∃YR(Y)}. LetI ={P(a),Q(a)}.

The result of a chase is either {R(a)} or {R(Y),R(a)}.

(31)

Canonical Universal Solutions

Definition

However..

Example

Let Σ ={P(x)→R(x),Q(x)→ ∃YR(Y)}. LetI ={P(a),Q(a)}. The result of a chase is either {R(a)} or {R(Y),R(a)}.

(32)

Canonical Universal Solutions

Definition

However..

Example

Let Σ ={P(x)→R(x),Q(x)→ ∃YR(Y)}. LetI ={P(a),Q(a)}. The result of a chase is either {R(a)} or {R(Y),R(a)}.

(33)

Weakly acyclic tgds and polynomial-length chase

Want to know when computing canonical universal solutions is feasible.

Using onlyfull tgdsensures that the chase is always finite and always has the same result. However full tgds are too restricted in practice. We define special sets of tgds called weakly acyclic sets of tgdswhich strictly include full tgds, as well asacyclic sets of inclusion

dependencies.

Weakly acyclic sets of tgds are sets of tgds, such that there is no cycle in the dependency graph going through aspecial edge(edges representing existentially quantified variables)

Theorem

Let Σbe the union of a weakly acyclic set of tgds and a set of egds. Then there exists a polynomial in the size of K that bounds the length of every chase of K with Σ.

(34)

Weakly acyclic tgds and polynomial-length chase

Using onlyfull tgdsensures that the chase is always finite and always has the same result. However full tgds are too restricted in practice.

We define special sets of tgds called weakly acyclic sets of tgdswhich strictly include full tgds, as well asacyclic sets of inclusion

dependencies.

Theorem

(35)

Weakly acyclic tgds and polynomial-length chase

dependencies.

Theorem

(36)

Weakly acyclic tgds and polynomial-length chase

dependencies.

Theorem

(37)

Weakly acyclic tgds and polynomial-length chase

dependencies.

Theorem

(38)

Composing Schema Mappings

Outline

1 Data Exchange

2 Composing Schema Mappings The Composition Operator

Language for Expressing Composition Second-Order tgds and Data Exchange

3 Inverses

4 Quasi-Inverses

5 Conclusion

(39)

Composing Schema Mappings The Composition Operator

Motivation

Natural operators on schema mappings:

Composition Inverse . . .

⇒ High-levelmanagement of schema mappings. Need for well-definedsemantics.

In practice, useful for describing and maintaining successive evolutions of a schema.

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

(40)

Motivation

⇒ High-levelmanagement of schema mappings. Need for well-definedsemantics.

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

(41)

Motivation

⇒ High-levelmanagement of schema mappings.

Need for well-definedsemantics.

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

(42)

Motivation

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

(43)

Motivation

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

(44)

Motivation

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

M⁰₁⁻¹

(45)

Motivation

S₁

T1 M₂ T⁰₁ M¹

S⁰₁ M0

1

M⁰₁⁻¹ M⁰₁⁻¹◦ M₁◦ M₂

(46)

Definition

Unambiguous andquery-independentsemantics of schema mappings composition.

Based on natural composition of binary relations.

Definition

hI,Ki instance of M₁◦ M₂ ⇐⇒ there exists J such that: hI,Ji instance ofM₁

hJ,Ki instance of M₂.

(47)

Definition

hI,Ki instance of M₁◦ M₂ ⇐⇒ there exists J such that: hI,Ji instance ofM₁

hJ,Ki instance of M₂.

(48)

Definition

hI,Ki instance ofM₁◦ M₂ ⇐⇒ there exists J such that:

hI,Ji instance ofM₁ hJ,Ki instance of M₂.

(49)

Example

M₁₂= (S₁,S₂,Σ₁₂),M₂₃= (S₂,S₃,Σ₂₃).

S1 = {Takes(·,·)}

S2 = {Takes’(·,·),Student(·,·)}

S3 = {Enrollment(·,·)}

Σ12 = {∀n∀c(Takes(n,c)→Takes’(n,c)),

∀n∀c(Takes(n,c)→ ∃sStudent(n,s))}

Σ23 = {∀n∀s∀c(Student(n,s)∧Takes’(n,c)→Enrollment(s,c))}

M₁₂◦ M₂₃= (S₁,S₃,Σ) with:

Σ ={∀n∃s∀c(Takes(n,c)→Enrollment(s,c))}

(50)

Example

M₁₂= (S₁,S₂,Σ₁₂),M₂₃= (S₂,S₃,Σ₂₃).

S1 = {Takes(·,·)}

S2 = {Takes’(·,·),Student(·,·)}

S3 = {Enrollment(·,·)}

Σ12 = {∀n∀c(Takes(n,c)→Takes’(n,c)),

∀n∀c(Takes(n,c)→ ∃sStudent(n,s))}

Σ23 = {∀n∀s∀c(Student(n,s)∧Takes’(n,c)→Enrollment(s,c))}

M₁₂◦ M₂₃= (S₁,S₃,Σ) with:

(51)

Composing Schema Mappings Language for Expressing Composition

A Positive Result. . .

Proposition

IfM₁ is a mapping expressed as a finite set offull source-to-targettgds and M₂ is a mapping expressed as a finite set of source-to-targettgds, M₁◦ M₂ can be expressed as a mapping with source-to-targettgds.

Example

A(x,y)→B(x) ◦ B(x)→ ∃zC(x,z)

=

A(x,y) → ∃zC(x,z) Remark

If the tgds ofM₂ arefull, so will be the tgds of M₁◦ M₂.

(52)

A Positive Result. . .

Proposition

Example

A(x,y)→B(x) ◦ B(x)→ ∃zC(x,z)

=

(53)

A Positive Result. . .

Proposition

Example

A(x,y)→B(x) ◦ B(x)→ ∃zC(x,z)

=

A(x,y) → ∃zC(x,z)

Remark

(54)

A Positive Result. . .

Proposition

Example

A(x,y)→B(x) ◦ B(x)→ ∃zC(x,z)

=

If the tgds ofM₂ arefull, so will be the tgds ofM₁◦ M₂.

(55)

. . . and a Negative Result

Proposition

There exist two mappings M₁ andM₂, defined with a finite set of source-to-target tgds, such thatM₁◦ M₂ cannot be expressed in first-order logic (even with least fix point).

Proof.

Composition query problem: givenI and J, ishI,Ji an instance of the composed schema mapping?

Reduction of 3-colorability

⇒ NP-completeproblem.

⇒ Q.E.D. (descriptive complexity theory) Remark

Fagin’s theorem implies that this can be defined as an existential second- order formula. More precise characterization of the language?

(56)

. . . and a Negative Result

Proposition

Proof.

⇒ Q.E.D. (descriptive complexity theory)

Remark

(57)

. . . and a Negative Result

Proposition

Proof.

⇒ Q.E.D. (descriptive complexity theory) Remark

(58)

A Natural Language for Composition

Definition

A second-ordertgd is of the form:

∃f₁. . .f_m((∀x₁(φ₁→ψ₁))∧. . .∧(∀x_n(φ_n→ψ_n)))

fi are function symbols

φi are conjunction of source relation atoms and equalities ψ_i are conjunctions of target relation atoms

(additional safety conditions omitted).

Example

∀n∃s∀c(Takes(n,c) → Enrollment(s,c))

≡

∃f(∀n∀c(Takes(n,c) → Enrollment(f(n),c)))

(59)

A Natural Language for Composition

Definition

∃f₁. . .f_m((∀x₁(φ₁→ψ₁))∧. . .∧(∀x_n(φ_n→ψ_n)))

Example

≡

∃f(∀n∀c(Takes(n,c) → Enrollment(f(n),c)))

(60)

A Natural Language for Composition

Definition

∃f₁. . .f_m((∀x₁(φ₁→ψ₁))∧. . .∧(∀x_n(φ_n→ψ_n)))

Example

≡

(61)

Composition Theorem

Theorem

Second-order tgds are closedunder composition.

Constructive proof.

All features of the language (disjunction, equalities, second-order quantifiers) are required.

Remark

Any finite set of s-t tgds can be represented as a singlesecond-order tgds.

(62)

Composition Theorem

Theorem

Constructive proof.

Remark

(63)

Composition Theorem

Theorem

Constructive proof.

Remark

(64)

Composition Theorem

Theorem

Constructive proof.

Remark

(65)

Composing Schema Mappings Second-Order tgds and Data Exchange

Properties of the Composition Operator

Two important properties for using second-order tgds in data exchange:

Polynomial chaseof schema mappings defined as a second-order tgd.

PTIME computation of certain answers to union of conjunctive queries.

Remark

Contrast with the fact that computing certain answers with arbitrary first- order mappings isundecidable.

Second-order tgds:

Powerful enough for including normal tgds and being closed under composition.

Restricted enoughto get a PTIME algorithm for answering queries.

(66)

Properties of the Composition Operator

Remark

Second-order tgds:

(67)

Properties of the Composition Operator

Remark

Second-order tgds:

(68)

Inverses

Outline

1 Data Exchange

2 Composing Schema Mappings

3 Inverses

Motivation and Definition Conditions

Computing Inverses

4 Quasi-Inverses

5 Conclusion

(69)

Inverses Motivation and Definition

Semantics of Inverses

Example

Let M₁₂be EDL(x,y,z)→ED(x,y)∧DL(y,z). LetM₂₁ be ED(x,y)∧DL(y,z)→EDL(x,y,z).

Under what conditions do we obtain the original EDL relation? DefineΓas EDL(x,y,z⁰)∧EDL(x⁰,y,z))→EDL(x,y,z). We want M₂₁ to be theinverse of M₁₂ for precisely those instancesI which satisfy Γ.

(70)

Semantics of Inverses

Example

Under what conditions do we obtain the original EDL relation?

DefineΓas EDL(x,y,z⁰)∧EDL(x⁰,y,z))→EDL(x,y,z). We want M₂₁ to be theinverse of M₁₂ for precisely those instancesI which satisfy Γ.

(71)

Semantics of Inverses

Example

DefineΓas EDL(x,y,z⁰)∧EDL(x⁰,y,z))→EDL(x,y,z).

We want M₂₁ to be theinverse of M₁₂ for precisely those instancesI which satisfy Γ.

(72)

Semantics of Inverses

Example

DefineΓas EDL(x,y,z⁰)∧EDL(x⁰,y,z))→EDL(x,y,z).

We wantM₂₁ to be theinverse of M₁₂ for precisely those instancesI which satisfy Γ.

(73)

Obvious Solution fails

Let M₁₂ be defined by Σ₁₂.

Let S₁₂={hI,Ji:hI,Ji |= Σ₁₂}. Let S₂₁={hJ,Ii:hI,Ji ∈S₁₂}.

Can we define inverse as mapping associated with set S₂₁? No: If hI,Ji |= Σ12, then for allI⁰ ⊆I and J⁰ ⊇J, we also have hI⁰,J⁰i |= Σ₁₂. This will not hold for the pairs inS₂₁.

(74)

Obvious Solution fails

Let M₁₂ be defined by Σ₁₂. Let S₁₂={hI,Ji:hI,Ji |= Σ₁₂}.

Let S₂₁={hJ,Ii:hI,Ji ∈S₁₂}.

(75)

Obvious Solution fails

(76)

Obvious Solution fails

Can we define inverse as mapping associated with set S₂₁?

No: If hI,Ji |= Σ12, then for allI⁰ ⊆I and J⁰ ⊇J, we also have hI⁰,J⁰i |= Σ₁₂. This will not hold for the pairs inS₂₁.

(77)

Obvious Solution fails

Can we define inverse as mapping associated with set S₂₁? No: If hI,Ji |= Σ12, then for allI⁰ ⊆I and J⁰ ⊇J, we also have hI⁰,J⁰i |= Σ₁₂. This will not hold for the pairs in S₂₁.

(78)

Definition

We say two schema mappings are equivalent on I, if they have the same solutions for I.

Definition

Let M_Id= (S1,S^c1,ΣId)be the identity mapping. Let

M₁₂= (S₁,S₂,Σ₁₂) andM₂₁= (S₂,S^c₁,Σ₂₁) be schema mappings. Let σ be the composition formulaΣ₁₂◦Σ₂₁ and letM₁₁= (S₁,S^c₁, σ). LetI be an instance of S1. ThenM₂₁ is an inverse of M₁₂ for I if M₁₁ and M_Id are equivalent onI.

(79)

Definition

We say two schema mappings are equivalent on I, if they have the same solutions for I.

Definition

Let M_Id= (S1,S^c1,ΣId) be the identity mapping. Let

M₁₂= (S₁,S₂,Σ₁₂) andM₂₁= (S₂,S^c₁,Σ₂₁) be schema mappings. Let σ be the composition formulaΣ₁₂◦Σ₂₁ and letM₁₁= (S₁,S^c₁, σ). LetI be an instance of S1. ThenM₂₁ is an inverse of M₁₂ for I if M₁₁ and M_Id are equivalent onI.

(80)

Local and Global Inverses

We call such an inverse a local inverse.

IfS is a class of instances such thatM₂₁ is an inverse of M₁₂ for eachI in S, then we call it anS-inverse.

IfS is the class of all instances, we call it a global inverse.

(81)

Local and Global Inverses

(82)

Local and Global Inverses

(83)

Inverses Conditions

Unique Solutions Property

We would like to know when global inverses exist.

IfM₂₁ is an inverse of M₁₂, andI₁ andI₂ are distinct source instances, then the solutions ofI1 andI2 are different underM12. Unique solutions property. A mapping M12 has it if wheneverI1,I2

are distinct source instances, then their solution sets are distinct This is necessary for global inverses to exist.

For LAV schema mappings it is also sufficient.

(84)

Inverses Conditions

Unique Solutions Property

IfM₂₁ is an inverse of M₁₂, andI₁ andI₂ are distinct source instances, then the solutions ofI1 andI2 are different under M12.

Unique solutions property. A mapping M12 has it if wheneverI1,I2

are distinct source instances, then their solution sets are distinct This is necessary for global inverses to exist.

For LAV schema mappings it is also sufficient.

(85)

Inverses Conditions

Unique Solutions Property

IfM₂₁ is an inverse of M₁₂, andI₁ andI₂ are distinct source instances, then the solutions ofI1 andI2 are different under M12. Unique solutions property. A mapping M12 has it if wheneverI1,I2

are distinct source instances, then their solution sets are distinct

This is necessary for global inverses to exist. For LAV schema mappings it is also sufficient.