• Aucun résultat trouvé

Typing semistructured data

N/A
N/A
Protected

Academic year: 2022

Partager "Typing semistructured data"

Copied!
111
0
0

Texte intégral

(1)

Master Informatique 10/9/2007 1

Typing semistructured data

Serge Abiteboul

2008

Typing semistructured data

(2)

Organization

• Motivations

• Automata

– Automata on words – Ranked tree automata – Unranked tree automata

– Automata and monadic second-order logic – Automata – to compute

• XML typing: DTD, XML schema

(3)

Master Informatique 10/9/2007 3

Motivation

Typing semistructured data

(4)

XML typing

• Not compulsory

• Simplify writing software for XML

– Improve interoperability between programs

• Improve storage and performance

• Ease querying: data guide

• Simplify data protection

– Reject illegal update – like relational dependencies

(5)

Master Informatique 10/9/2007 5

Improve storage

Root

Company Employee

string company

person works-for

c.e.o.

address name

managed-by

name

o i d n a m e a d d r e s s c . e . o .

… … … …

… … … …

Company

o i d n a m e m a n a g e d - b y w o r k s - f o r

… … … …

… … … …

Employee

Store rest in overflow graph Lower-bound schema

Typing semistructured data

(6)

Improve performance

Bib

paper book

year journal title

int string string

address

author title

zip city street last

name first name

string string string string string string

select X.title from Bib._ X

where X.*.zip = “12345”

select X.title from Bib._ X

where X.*.zip = “12345”

select X.title

from Bib.book X

where X.address.zip = “12345”

select X.title

from Bib.book X

where X.address.zip = “12345”

(7)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 7

Type checking

• Who checks

– XML editor: check that the data conforms to its type – XML exchange, e.g., with Web service

• Server when delivering the data

• Client/application: when receiving it

• Dynamic verification: after the data is produced

• Static verification: verification of the program that

generates the data

(8)

Static verification

• Input: input type T and code of function f

– f is Xquery, Xpath, XSLT, etc.

• Verification of T’

– Is it true that d╞T, f(d)╞T’ ?

• Type inference

– Find the smallest T’ such that d╞T, f(d)╞T’

• Rapidly undecidable because of “joins”

(9)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 9

Example

for $p in doc("parts.xml“)//part[color=“red"]

return <part>

<name>$p/name</name>

<desc>$p/desc</desc>

</part>

Result type

(part (name (string) desc (any) )*

If the type of parts.xml//part/desc is string

(part (name (string) desc (string) )*

(10)

Difficulty

for $X in Input, $Y in Input do { print ( <b/> } Input: <a/> <a/>

Result: <b/> <b/> <b/> <b/>

Problem: { b i  i=n 2 for n ≥ 0 } cannot be described in XML schema There is no « best » result

– b*

–  + b

2

b

*

–  + b

2

+ b

4

b

*

–  + b

2

+ b

4

+ b

9

b

*

– …

(11)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 11

Why tree automata?

• XML = unranked trees

• No theory for XML

• Rich theory for strings: Automata

• Extend to

rich theory for ranked trees: Tree automata

– Nice algorithms – Nice theorems

– Can this carry to unranked trees and XML?

• Yes!

(12)

From strings to trees

a

b

b

a

a

b

b a

b

b

a b

a

b

b

a

b

b

a b

a b

a b

Word Binary tree… Unranked tree automata

Finite State Ranked tree automata no bound on number of children a

b b b

(13)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 13

Only unranked tree automata?

• Missing practical gadgets

• Complexity of verification

– Goal: typing at reasonable cost

• Unranked tree automata + …

(14)

Automata

Automata on words

(15)

Master Informatique 10/9/2007 15

Finite state automata on words )

, ,

, ,

(  Q q 0 F

Alphabet

State

Initial state Accepting states

Transitions

Q

q 0FQ

) ( :   QP Q

Typing semistructured data

(16)

q

0

Nondeterministic automaton:

Example

   

   

   

   

3 2   3 3

2 1

0 1

1 0 0

, , , ,

, ,

q q

q q

q q

q q

b

q q q

a

 

  

0 2 , 1 , 2 , 3

, q F

q q

q q

Q

b a

a b a a b - a b a -

q

0

q

0

q

0

q

0

q

0

q

0

q

0

q

0

q

2

q

0

(17)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 17

• Deterministic

– No  transition

– No alternative transitions such as

• Determinization

– It is possible to obtain an equivalent deterministic automaton – State of new automaton = set of states of the original one – Possible exponential blow-up

• Minimization

• Limitations – cannot do

– Context-free languages

• Essential tool – e.g., lexical analysis

Reminder

a n b n , n Ν

a , q 0   q 0 , q 1

     , q q 0

(18)

Reminder (2)

• L(A) = set of words accepted by automata A

• Regular languages

• Can be described by regular expressions, e.g. a(b+c)*d

• Closed under complement

• Closed under union, intersection

– Product automata with states (s,s’)

) (

*  L A

) ( )

(

) ( )

(

B L A

L

B L A

L

(19)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 19

Automata on words versus trees

a b b a

a

b

b a

b

b

a b

a Left to right

Right to left

No difference

B o t t o m u p

T o p

d o w n

Differences

(20)

Automata

Automata on ranked trees

(21)

Master Informatique 10/9/2007 21

Binary tree automata

• Parallel evaluation

• For leaves:

• For other nodes:

) ,

, ,

(  Q F

) (

:   P Q

) (

:   QQP Q

a

b

b a

b

a b

a B

o t t o m u p

q q’

b q”

q1 q”

q2

q q

q’

Typing semistructured data

(22)

Bottom-up tree automata

• Bottom-up: if a node labeled a has its children in states q, q’ then the node moves

nondeterministically to state r or r’

• Accepts is the root is in some state in F

• Not deterministic if alternatives or -transitions:

a , q , q '    r , r '

a , q , q ' { r , r ' }

    , r r '

(23)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 23

Example: deterministic bottom-up

   

       

   

1 1   2 1 0   2 0 1   1

2

0 0

0 2

0 1

0 2

0 1

2 0

0 2

1 1

1 2

, ,

, ,

, ,

, ,

, ,

, ,

, ,

, ,

, ,

, ,

q q

q q

q q

q

q q

q

q q

q q

q q

q

q q

q

 

 

1 0 , 1

, , 1 , 0

q F

q q

Q

    

    1

1

0 1

1 0

q q

(24)

   

   

   

   

   

   

1 1

 

1

2

0 0

0 2

0 1

0 2

0 0

1 2

0 0

0 2

1 1

1 2

, ,

, ,

, ,

, ,

, ,

, ,

, ,

q q

q

q q

q

q q

q

q q

q

q q

q

q q

q

q q

q

Boolean circuit evaluation

v v v

v 1 1 v

1

0 v

0

1 1

   

   

1

1

0 1

1 0

q q

q 0 q 1 q 0

q 1

q 1

q 1

q 1

q 1

q 1

q 1

q 1

q 1

q 1

(25)

Master Informatique 10/9/2007 25

Regular tree language = set of trees accepted by a bottom-up tree automata

Typing semistructured data

(26)

Regular tree languages

The following are equivalent

– L is a regular tree language

– L is accepted by a nondeterministic bottom-up automata

– L is accepted by a deterministic bottom-up automata

– L is accepted by a nondeterministic top-down

automata

(27)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 27

Top-down tree automata

• Top-down: if a node labeled a is in state q”,

then its left child moves to state q (right to q’)

• Accepts is all leaves are is in states in F

• Not deterministic if

a , q "   q , q '  

a , q "   q , q '    , r , r '

(28)

Why deterministic top-down is weaker?

• Consider the language

– L = { f(a,b), f(b,a) }

• It can be accepted by a bottom-up TA

– Exercise: write a BUTA A such that L = L(A)

• Suppose that B is a deterministic top-down TA with L = L(B)

– Exercise: Show that B also accepts {f(a,a)}

– A contradiction

Fact: No deterministic top-down tree automata accepts L

(29)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 29

Ranked trees automata: Properties

• Like for words only higher complexity

• Determinization

• Minimization

• Closed under

– Complement

– Intersection

– Union

(30)

But…

• XML documents are unranked

• The kind of things we want to do:

book (intro,section*,conclusion)

(31)

Master Informatique 10/9/2007 31

Automata

Automata on unranked tree

Typing semistructured data

(32)

Unranked tree automata

           

           

           

, ,     , , , ,    ,, , , , ...    ...

...

, , ,

, ,

...

, , , ,

, ,

2 2

2

2 2

2

2 2

2

2 2

2

f f

f f f

f f f

f

t t

f t

f t t

t

f t

f f

f t f

f

t t

t t t

t t t

t

Issue: represent an infinite set of transitions

Solution: a regular language

(33)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 33

• Rule:

• Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r 1 ,…,r m }

Unranked tree automata (2)

a , L ( Q )   r 1 ,..., r m

   

   

   

  

f Or

where f

Or

f t

t f

t Or

where t

Or

f t

f f

t And

where f

And

t And

where t

And

0 0

,

* ) (

* ) (

1 1

,

* ) (

* ) (

0 0

,

1 1

,

2 2 2 2

(34)

Building on ranked trees

a

b

b

b

b

a b

a b

a

b

b

b

b

a b

a b

Ranked tree: FirstChild-NextSibling F: encoding into a ranked tree

• F is a bijection

(35)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 35

Building on

bottom-up ranked trees (2)

• For each Unranked TA A, there is a Ranked TA accepting F(L(A))

• For each Ranked TA A, there is an unranked TA accepting F -1 (L(A))

• Both are easy to construct

Consequence: Unranked TA are closed under

union, intersection, complement

(36)

Determinization always possible for bottom-up

Can we use the FirstChild-NextSibling encoding No: it does not preserve determinism

Determinization

. such that

) , ( rule unique

a exists there

,

, *

L w

L Q

w

   

(37)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 37

Top-down?

• This is more delicate

• Transition  (a,q)=A(a,q)

– The state of the automata A(a,q) when reading the labels of the children of a node labeled a

determines the states of the children of that node

– Accepts if all the leaves are in accepting state

(38)

q 1

Boolean circuit evaluation

v v

v

1 v

q 0

q 1

0 0 1

v 1

1 1 1

0 v

v

v

q 1

q 1

q 1

q 0

q 0

q 0 q 1

q 0

q 0 q 0

q 1 q 0

q 0

q 1

It is accepted It rejects by if some state of a leaf

is neither

0 with q 0

nor 1 with q 1

(39)

Master Informatique 10/9/2007 39

Automata

Automata and

monadic second-order logic

Typing semistructured data

(40)

Monadic second-order logic

• Representation of a tree as a logical structure

E(1,2), E(1,3)… E(3,9) S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) a

b

b

b

b

a b

a b

1

6

3 4

2

7 8 9

5

(41)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 41

X x

X X

x

x a y

x S y

x E y

x

) (

...

) ( )

, ( )

, ( ::

Monadic second-order logic

E(1,2), E(1,3)… E(3,9)

S(2,3), S(3,4), S(4,5)…S(8,9) a(1), a(4), a(8)

b(2), b(3), b(5), b(6), b(7), b(9)

MSO syntax

Set variable

Quantification over a set

variable

(42)

Example of MSO

• Each a node has a b-descendant

• This corresponds to the formula

For each node x labeled a: each set X that (  )  contains x and that (  ) is closed under descendant, X contains some y

 

 

 

)) (

) ( (

)) ( )

( )

, ( ( ) ( )

(

y b y

X y

z X y

X z

y E z y

x X

where X

x a x

(43)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 43

Bridge

Theorem: for a set L of trees, the following are equivalent

1.L = L(A) for some bottom-up tree automata A i.e. L is definable with bottom-tree automata 2.L = {T | T satisfies } for some MSO formula 

i.e. L is definable in MSO

(44)

XML typing

DTDs

(45)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 45

DTD

• Describe the children of a node of a label a by a regular expression

• Bizarre syntax

<!ELEMENT populationdata (continent*) >

<!ELEMENT continent (name, country*) >

<!ELEMENT country (name, province*)>

<!ELEMENT province (name, city*) >

<!ELEMENT city (name, pop) >

<!ELEMENT name (#PCDATA) >

<!ELEMENT pop (#PCDATA) >

(46)

DTD and deterministism

• Regular expressions in DTD should be deterministic

– Complicated definition

• Intuition: the corresponding automata should be deterministic

(a+b)*a is not

– When reading <a>, one cannot tell whether it is an a from (a+b) or if it is the a of the end

– (b*a)(b*a)* is an equivalent expression that is

deterministic

(47)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 47

Very efficient validation

• It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata A a

• Possible to type check the document while

scanning it, e.g. with SAX parser

(48)

Very efficient validation (2)

<!ELEMENT a ( b c ) >

<!ELEMENT b ( d+ ) >

a

b c

d d

s t u

b c

A

a

<a><b><d/><d/></b><c/></a>

s’

t’

(49)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 49

Warning

• The previous example can be checked with a simple automata on words

• But not the following one

<!ELEMENT part ( part* ) >

• The stack is needed for accepting

<a>…<a></a>…</a>

n <a> n </a>

(50)

Some bad news for DTD

• Not closed under union

DTD1

<!ELEMENT used( ad*) >

<!ELEMENT ad ( year, brand )>

DTD2

<!ELEMENT new( ad*) >

<!ELEMENT ad ( brand )>

• L(DTD1)  L(DTD2) cannot be described by a DTD but can be described easily by a tree automata

– Problem with the type of ad that depends of its parent

• Also not closed under complement

(51)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 51

Car example continued

• The best DTD we can choose does not distinguish between ads for used and new cars

– <!ELEMENT ad (year?, brand) >

Car

Used New

Brand Year Brand

“Renault” “2008” “BMW”

(52)

Decoupled types in XML schema

• Each type corresponds to a label, not conversely

car: [car]( used + new )*

used: [used] (ad1*) new: [new] (ad2*) ad1: [ad] (year, brand) ad2: [ad] (brand)

• The tags are in green; type names in blue

• Nice closure properties

• Many other « gadgets » in XML schemas

(53)

Master Informatique 10/9/2007 53

XML typing

XML Schemas

Typing semistructured data

(54)

XML Schema

• Often criticized & unnecessarily complicated

Boosted by Web services

• Richer than DTD – decoupled types

• Deterministic top-down tree automata (close to)

• XML schemas are extensible

• Many other useful functionalities

– Namespaces

– Atomic types

(55)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 55

An XML schema is an XML document

• Since it is an XML syntax, it can use XML tools

– Editor

– Type checker – Etc.

• The type of all XML schemas can be described with

an XML schema

(56)

<?xml version="1.0" encoding="utf-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

targetnamespace="http://www.net-language.com">

<xs:element name="book">

<xs:complexType>

<xs:sequence>

<xs:element name="title" type="xs:string"/>

<xs:element name="author" type="xs:string"/>

<xs:element name="character"

minOccurs="0" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="name" type="xs:string"/>

<xs:element name="friend-of" type="xs:string"

minOccurs="0" maxOccurs="unbounded"/>

<xs:element name="since" type="xs:date"/>

<xs:element name="qualification" type="xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:sequence>

(57)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 57

Simple elements and atomic types

Definition: <xs:element name="xxx" type="yyy"/>

with common types:

xs:string; xs:decimal; xs:integer; xs:boolean; xs:date; xs:time Examples

<xs:element name="lastname" type="xs:string"/>

<xs:element name="age" type="xs:integer"/>

<xs:element name="dateborn" type="xs:date"/>

Instances of such elements

<lastname>Refsnes</lastname>

<age>34</age>

<dateborn>1968-03-27</dateborn>

(58)

Attributs

Definition: <xs:attribute name="xxx" type="yyy"/>

Example

<xs:attribute name="lang" type="xs:string"/>

Instance of such attribute

<lastname lang="EN">Smith</lastname>

(59)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 59

Complex elements

• Empty element

<product pid="1345"/>

• Contains only other elements

<employee> <firstname>John</firstname>

<lastname>Smith</lastname> </employee>

• Contains only text

<food type="dessert">Ice cream</food>

• Contains both elements and text

<description> It happened on <date lang="norwegian">

03.03.99</date> .... </description>

(60)

Restriction of simple elements

<xs:element name="age">

<xs:simpleType>

<xs:restriction base="xs:integer">

<xs:minInclusive value="0"/>

<xs:maxInclusive value="100"/>

</xs:restriction>

</xs:simpleType>

</xs:element>

Other restrictions: enumerated types, patterns, etc.

(61)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 61

Restriction on complex elements

<xs:element name="person">

<xs:complexType>

<xs:sequence>

<xs:element name="firstname" type="xs:string"/>

<xs:element name="lastname" type="xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

(62)

Possible to name a type

<xs:element name="employee">

<xs:complexType> <xs:sequence>

<xs:element name="firstname"

type="xs:string"/> <xs:element name="lastname"

type="xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

Only the "employee" element can use the specified complex type (<sequence> indicates an order on child elements)

Alternative

<xs:element name="employee"

type="personinfo" />

<xs:complexType

name="personinfo">

<xs:sequence> <xs:element name="firstname"

type="xs:string"/> <xs:element name="lastname"

type="xs:string"/>

</xs:sequence>

</xs:complexType>

(63)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 63

Other gadgets

• Import of types associated to a namespace

– <import nameSpace = "http:// ..."

schemaLocation =

"http:// ..." />

• Possible to include an existing schema

– <include schemaLocation="http:// ..."/>

• Possible to extend/redefine an existing schema

– <redefine schemaLocation="http:// ..."/>

.... Extensions ...

</redefine>

(64)

Example: a DTD

<!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>

<!ATTLIST EMAIL

LANGUAGE (Western|Greek|Latin|Universal) "Western"

ENCRYPTED CDATA #IMPLIED

PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

<!ELEMENT TO (#PCDATA)>

<!ELEMENT FROM (#PCDATA)>

<!ELEMENT CC (#PCDATA)>

<!ELEMENT BCC (#PCDATA)>

<!ATTLIST BCC

HIDDEN CDATA #FIXED "TRUE">

<!ELEMENT SUBJECT (#PCDATA)>

<!ELEMENT BODY (#PCDATA)>

(65)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 65

The same in XML schema

(more verbose)

<?xml version="1.0" ?>

<Schema name="email" xmlns="urn:schemas-microsoft-com:xml-data"

xmlns:dt="urn:schemas-microsoft-com:datatypes">

<AttributeType name="language"

dt:type="enumeration" dt:values="Western Greek Latin Universal" />

<AttributeType name="encrypted" />

<AttributeType name="priority" dt:type="enumeration" dt:values="NORMAL LOW HIGH" />

<AttributeType name="hidden" default="true" />

<ElementType name="to" content="textOnly" />

<ElementType name="from" content="textOnly" />

<ElementType name="cc" content="textOnly" />

<ElementType name="bcc" content="mixed">

<attribute type="hidden" required="yes" />

</ElementType>

<ElementType name="subject" content="textOnly" />

<ElementType name="body" content="textOnly" />

<ElementType name="email" content="eltOnly">

<attribute type="language" default="Western" />

<attribute type="encrypted" />

<attribute type="priority" default="NORMAL" />

<element type="to" minOccurs="1" maxOccurs="*" />

<element type="from" minOccurs="1" maxOccurs="1" />

<element type="cc" minOccurs="0" maxOccurs="*" />

<element type="bcc" minOccurs="0" maxOccurs="*" />

<element type="subject" minOccurs="0" maxOccurs="1" />

<element type="body" minOccurs="0" maxOccurs="1" />

</ElementType>

</Schema>

(66)

Where to place XML schemas

• Some bizarre restriction

– Inside an element, no two types with the same tag

• Closer to DTDs than to tree automata

Tree automata

Deterministic . top-down tree automata

DTD

XML schema

(67)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 67

Exercise: coupled vs decoupled

• Write a realistic DTD1 for new cars

– With make, model, engine…

• Write a realistic DTD2 for used cars

– Also year, miles, zipcode

• Write an XML schema for L(DTD1)  L(DTD2)

– Using decoupled schema

(68)

Automata

Automata to compute

(69)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 69

Another use of automata: XPATH

$x in //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0)

(70)

Example: //a/b

a b

a a b

a b

$x $x

(0)

(01)

(71)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 71

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0)

(01)

(01)

(72)

Example: //a/b

a b

a a b

a b

$x $x

(0) (01) (01) (02)

$x

(73)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 73

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0) (01) (01)

$x

(74)

Example: //a/b

a b

a a b

a b

$x $x

(0) (01)

$x

(75)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 75

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0) (01)

$x

(01)

(76)

Example: //a/b

a b

a a b

a b

$x $x

(0) (01)

$x

(77)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 77

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0) (01)

$x

$x (02)

(78)

Example: //a/b

a b

a a b

a b

$x $x

(0) (01)

$x

$x (02)

(01)

(79)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 79

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0) (01)

$x

(02)

$x

(01) (02)

$x

(80)

Example: //a/b

a b

a a b

a b

$x $x

(0) (01)

$x

(02)

$x (01)

(81)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 81

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0) (01)

$x

(02)

$x

$x

(82)

Example: //a/b

a b

a a b

a b

$x $x

(0) (01)

$x

$x

(83)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 83

Example: //a/b

a b

a a b

a b

$x $x b

NFA DFA

(0)

$x

$x

$x

(84)

Determinization: exponential blow up

//a/*/*/b

(85)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 85

Proposal : k-pebble transducers

stack

[milo,suciu,vianu]

(86)

k-pebble transducers: result

root

a c

b a a b

a b

(87)

Master Informatique 10/9/2007 87

Graphs and bisimulation

Typing semistructured data

(88)

Graph

• Graph semistructured data

• Graph simulation

• Graph bisimulation

• Data guides

(89)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 89

Semistructured data

• With ID-IDREF, XML is a graph model as well

• OEM = Object Exchange Model Labeled (rooted) graph (E,r)

– Set N of nodes

– A finite ternary relation E  N  N  Label

E(s,t,l) = there is an edge from s to t labeled l

– Possibly a root r

(90)

&r

&p8

&p1 &p2 &p3 &p4 &p5 &p6 &p7

&c company

employee

employee

employee

employee employee employee

employee

employee

worksfor worksfor

worksfor worksfor

worksfor worksfor

worksfor worksfor

manages manages

manages manages

managedby managedby

managedby manages

managedby managedby

(91)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 91

Equality revisited

• {1,2,2,1,5} = {1,2,5}

– Ignores the order

• For trees, if we ignore the order of siblings and use a “set” semantics

=

a

b c

d d

b

d d

a

b c

d

(92)

Simulation

A simulation  of (E,r) with (E’,r’) is a relation between the nodes of E and E’ such that

1.(r,r’)

2.if (s,s’) and E(s,t,l) for some l then there exists t’ with (t,t’) and E’(s’,t’,l’)

(we simulate a move in E by a move in E’)

(93)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 93

Bisimulation

Given , E, E’,

 is a bisimulation if

 is a simulation of E with E’ and

-1 is a simulation of E’ with E

(94)

Examples

a a

a d

a a

a d

a

a d

G G’ G”

bisimulation Not bisimulation

(95)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 95

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING

employee _

projects

R

(96)

Graph bisimulation

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

programmer | statistician

R

(97)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 97

t1

Graph bisimulation

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING

employee _

projects

R

(98)

Graph bisimulation

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

programmer | statistician

R

(99)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 99

Graph bisimulation

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING

employee _

projects R

R

(100)

Graph bisimulation

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

programmer | statistician

R

(101)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 101

Graph bisimulation

root

e2 e3 e4

e1

p1 p2 p3 p4 p5 p6 p7 p8 p9

"exercise" "lecture" "finance" "adminstr." "PR" "undergrad" "grad" "postgrad" "web"

leads

workson workson leads

leads

workson leads

workson consults

employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

t1 t2

programmer | statistician

STRING

employee _

projects R

R

(102)

Computing bisimulation in ptime

• Start with  = N  N’ (for N, N’ the set of nodes)

• While there exists (x,x’) in  that violate the definition of simulation, remove (x,x’) from 

• This computes the maximal bisimulation in ptime

(Note: this maximal bisimulation exists because  is a bisimulation, and if  1,  2 are bisimulation,  1 

2 is also one)

(103)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 103

What does this have to do with typing?

• Take a very complex graph E

• How do you describe it?

• By a “smaller” graph T that is a bisimulation of E

• There may be several bisimulation with more

and more details

(104)

Rough bisimulation

Root

&r

Bosses

&p1,&p4,&p6

Regulars

&p2,&p3,&p5,&p7,&p8

company employee

manages managedby worksfor

employee

(105)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 105

More precise one

Root

&r

Employees

&p1,&p1,&p3,P4

&p5,&p6,&p7,&p8 Bosses

&p1,&p4,&p6

Regulars

&p2,&p3,&p5,&p7,&p8 Company

&c

company

employee

manages managedby

manages managedby worksfor

worksfor

worksfor

(106)

Other “typing”: data guide

• See the graph as an automata with root as the start symbol and only accepting states

• This graph accepts all the paths from the root

• Obtain an equivalent, minimal, deterministic automata

– This is the data guide for the graph – It can be used for describing the data

– It can be used to support Graphical Query Interfaces

(107)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 107

Data guide

• Gives all the paths from the root

• Automata minimization

(108)

{root}

{c1}

programmer

{c2}

statistician

{p1,p2,p3,p4,p5,

p6,p7,p8,p9}

project

{e1,e2,e3,e4}

employee

{p1,p3} {p2,p4} {p1,p3,p5,p7} {p4,p6} {p4}

workson leads workson leads consults

{e1,e2} {e2,e3} {p1,p3,p5,

p7,p9} {p2,p4, p6,p8}

workson

{p4,p9}

leads consults

employee employee

root

e2 e3 e4

e1

leads

workson worksonleads

leads

workson leads

workson consults employee

consults workson

workson

c1 c2

programmer statistician

project

workson

employee employee

(109)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 109

What you should remember

• Tree automata = theoretical foundation for XML

• Bottom-up tree automata are nice

• Top-down and determinism together  limitations

• XML documents do not have to be typed

• Typing may be very useful for XML

– In particular for software managing XML data

• DTD: simple but limited

• XML Schema: more expressive but still limited

• Graph data: bisimulation is the answer

(110)

Merci

(111)

Master InformatiqueMaster Informatique Typing semistructured data 10/9/2007 111

Bibliography

• TATA: the book, Tree Automata Techniques and Applications, tata.gforge.inria.fr/

The book on the topic and it is free

• XML schema, see http://w3.org

http://www.w3schools.com/schema/

Références

Documents relatifs

SEFCA : Service commun de Formations Continue et par Alternance - Université de Bourgogne N° Siret : 192 112 373 00 589 - Numéro d’organisme de formation : 26.21.P0018.21. Maison

In the particular case of goeBURST, we use the index twice: once for computing the number of neighbors at a given distance, used for untying links according to the total order

Give an example with at least two different sources where Q could be rewritten using at least two local sources with standard data integration techniques if all access methods had

We want to allow arbitrary values to appear in the XML document, however we still want the tree representation to be on a fixed alphabet (i.e., Σ should be finite, it should not

Comme on l'a indiqué question 3, décrire la construction g, qui transforme un ensemble de clauses en un graphe, comme un algorithme détaillé, avec les structures

- pour les étudiants n'ayant pas effectué la Licence informatique au Mans, les photocopies des diplômes universitaires obtenus ou attestation provisoire, avec le détail des

- pour les étudiants n'ayant pas effectué la première année de Master au Mans, les photocopies des diplômes universitaires obtenus ou attestation provisoire, avec le détail des

 5 crédits [cours magistral: 24h, exercices dirigés: 12h, travaux pratiques: 12h]  deuxième quadrimestre  Anglais INFO-F424 Combinatorial optimization | RENAUD