• Aucun résultat trouvé

Max-Planck-InstitutfürInformatik ,22August2007 PierreSenellart UnderstandingtheHiddenWeb

N/A
N/A
Protected

Academic year: 2022

Partager "Max-Planck-InstitutfürInformatik ,22August2007 PierreSenellart UnderstandingtheHiddenWeb"

Copied!
89
0
0

Texte intégral

(1)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Pierre Senellart

Max-Planck-Institut für Informatik, 22 August 2007

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 1 / 32

(2)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Hidden Web

Definition (Hidden Web, Deep Web, Invisible Web)

The part of Web content not accessible from the hyperlinked structure of the World Wide Web. Typically: HTML forms, Web Services.

Size estimate (2001) : 500 times larger than the surface Web.

How to understand it and benefit from its content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32

(3)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Hidden Web

Definition (Hidden Web, Deep Web, Invisible Web)

The part of Web content not accessible from the hyperlinked structure of the World Wide Web. Typically: HTML forms, Web Services.

Size estimate (2001) : 500 times larger than the surface Web.

How to understand it and benefit from its content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32

(4)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Hidden Web

Definition (Hidden Web, Deep Web, Invisible Web)

The part of Web content not accessible from the hyperlinked structure of the World Wide Web. Typically: HTML forms, Web Services.

Size estimate (2001) : 500 times larger than the surface Web.

How to understand it and benefit from its content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32

(5)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Purpose

Intensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

(6)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Purpose

Intensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

(7)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Purpose

Intensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

(8)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Purpose

Intensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

(9)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Purpose

Intensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

(10)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW

HTML form

Analyzed form + result pages Web service

Analyzed

Web service Service index

User discovery

discovery probing

wrapper induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(11)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages

Web service

Analyzed

Web service Service index

User

discovery

discovery

probing wrapper

induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(12)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages Web service

Analyzed

Web service Service index

User

discovery

discovery probing

wrapper induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(13)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages Web service

Analyzed

Web service Service index

User

discovery

discovery probing

wrapper induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(14)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages Web service

Analyzed Web service

Service index

User

discovery

discovery probing

wrapper induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(15)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages Web service

Analyzed

Web service Service index

User

discovery

discovery probing

wrapper induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(16)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages Web service

Analyzed

Web service Service index

User discovery

discovery probing

wrapper induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

(17)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Imprecise Data and Imprecise Tasks

Observations

Many needed tasks generate imprecise data, with some confidence value.

Need for a way to manage this imprecision, to work with it throughout an entire complex process.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 5 / 32

(18)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Imprecise Data and Imprecise Tasks

Observations

Many needed tasks generate imprecise data, with some confidence value.

Need for a way to manage this imprecision, to work with it throughout an entire complex process.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 5 / 32

(19)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query Results + confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(20)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query Results + confidence Topic crawler Form analyzer Inf. Extractor

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(21)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query Results + confidence Topic crawler Form analyzer Inf. Extractor

Crawled URLs + confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(22)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query

Results + confidence Topic crawler Form analyzer Inf. Extractor

Form URL?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(23)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query

Results + confidence

Topic crawler Form analyzer Inf. Extractor

URLs + confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(24)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query Results + confidence Topic crawler Form analyzer Inf. Extractor

Analyzed form + confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(25)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query

Results + confidence Topic crawler Form analyzer Inf. Extractor

Form?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(26)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query

Results + confidence

Topic crawler Form analyzer Inf. Extractor

Person ! + confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(27)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Update transaction + confidence

Query Results + confidence Topic crawler Form analyzer Inf. Extractor

Person ! ISBN + confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

(28)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 Conclusion

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 7 / 32

(29)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data trees

Details: no attributes, no mixed content. . . A

B C

D 6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probability distribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

(30)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data trees

Details: no attributes, no mixed content. . . A

B C

D 6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probability distribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

(31)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data trees

Details: no attributes, no mixed content. . . A

B C

D 6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probability distribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

(32)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data trees

Details: no attributes, no mixed content. . . A

B C

D 6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probability distribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

(33)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data trees

Details: no attributes, no mixed content. . . A

B C

D 6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probability distribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

(34)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Prob-Tree Model

Data tree with event conditions (conjunction of probabilistic events or negations of probabilistic events) assigned to each node.

Probabilistic events are boolean random variables, assumed to be independent, with their own probability distribution.

A

B

w

1

; : w

2

C

D

w

2

Event Prob.

w

1

0 : 8 w

2

0 : 7

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 9 / 32

(35)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in an efficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

(36)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in an efficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

(37)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in an efficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

(38)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in an efficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

(39)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in an efficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

(40)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 Conclusion

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 11 / 32

(41)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Analyzing HTML Forms

Analyzing the structure of HTML forms.

Problem

Associate to each relevant form field its corresponding domain concept.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 12 / 32

(42)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1

Build a context for each field:

label tag;

id and name attributes;

text immediately before the field.

2

Remove stop words, stem.

3

Match this context with the concept names, extended with WordNet.

4

Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

(43)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1

Build a context for each field:

label tag;

id and name attributes;

text immediately before the field.

2

Remove stop words, stem.

3

Match this context with the concept names, extended with WordNet.

4

Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

(44)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1

Build a context for each field:

label tag;

id and name attributes;

text immediately before the field.

2

Remove stop words, stem.

3

Match this context with the concept names, extended with WordNet.

4

Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

(45)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1

Build a context for each field:

label tag;

id and name attributes;

text immediately before the field.

2

Remove stop words, stem.

3

Match this context with the concept names, extended with WordNet.

4

Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

(46)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1

Probe the field with nonsense word to get an error page.

2

Probe the field with instances of c (chosen representatively of the frequency distribution of c).

3

Compare pages obtained by probing with the error page (by using clustering along the DOM tree structure of the pages), to

distinguish error pages and result pages.

4

Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitations on the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

(47)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1

Probe the field with nonsense word to get an error page.

2

Probe the field with instances of c (chosen representatively of the frequency distribution of c).

3

Compare pages obtained by probing with the error page (by using clustering along the DOM tree structure of the pages), to

distinguish error pages and result pages.

4

Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitations on the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

(48)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1

Probe the field with nonsense word to get an error page.

2

Probe the field with instances of c (chosen representatively of the frequency distribution of c).

3

Compare pages obtained by probing with the error page (by using clustering along the DOM tree structure of the pages), to

distinguish error pages and result pages.

4

Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitations on the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

(49)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1

Probe the field with nonsense word to get an error page.

2

Probe the field with instances of c (chosen representatively of the frequency distribution of c).

3

Compare pages obtained by probing with the error page (by using clustering along the DOM tree structure of the pages), to

distinguish error pages and result pages.

4

Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitations on the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

(50)

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1

Probe the field with nonsense word to get an error page.

2

Probe the field with instances of c (chosen representatively of the frequency distribution of c).

3

Compare pages obtained by probing with the error page (by using clustering along the DOM tree structure of the pages), to

distinguish error pages and result pages.

4

Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitations on the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

(51)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 Conclusion

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 15 / 32

(52)

Joint work with researchers frommostrare(INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Query-answer Web Pages

Extract data from query-answer Web pages.

Issues

What part of the Web page contains the answer?

How to extract structured content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 16 / 32

(53)

Joint work with researchers frommostrare(INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Query-answer Web Pages

Extract data from query-answer Web pages.

Issues

What part of the Web page contains the answer?

How to extract structured content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 16 / 32

(54)

Joint work with researchers frommostrare(INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Automatic Wrapper Induction with Domain Knowledge

Annotate pages with knowledge domain (finite automata techniques): Both imperfect and incomplete.

Use machine learning to generalize the result into a structural extraction wrapper (Conditional Random Fields).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 17 / 32

(55)

Joint work with researchers frommostrare(INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Automatic Wrapper Induction with Domain Knowledge

Annotate pages with knowledge domain (finite automata techniques): Both imperfect and incomplete.

Use machine learning to generalize the result into a structural extraction wrapper (Conditional Random Fields).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 17 / 32

(56)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 Conclusion

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 18 / 32

(57)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Motivation

Analyzing the relations between different sources, or between a source and the domain knowledge.

Problem

Given two database instances I and J with different schemata, what is the optimal description of J with respect to I (with a finite set of formulæ in some logical language)?

What does optimal implies:

Conciseness of description.

Validity of facts predicted by I and . Facts of J explained by I and .

(Note the asymmetry between I and J ; context of data exchange where J is computed from I and ).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32

(58)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Motivation

Analyzing the relations between different sources, or between a source and the domain knowledge.

Problem

Given two database instances I and J with different schemata, what is the optimal description of J with respect to I (with a finite set of formulæ in some logical language)?

What does optimal implies:

Conciseness of description.

Validity of facts predicted by I and . Facts of J explained by I and .

(Note the asymmetry between I and J ; context of data exchange where J is computed from I and ).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32

(59)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Motivation

Analyzing the relations between different sources, or between a source and the domain knowledge.

Problem

Given two database instances I and J with different schemata, what is the optimal description of J with respect to I (with a finite set of formulæ in some logical language)?

What does optimal implies:

Conciseness of description.

Validity of facts predicted by I and . Facts of J explained by I and .

(Note the asymmetry between I and J ; context of data exchange where J is computed from I and ).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32

(60)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Example (Tuple-Generating Dependencies)

R R 0

a b c d

a a b b c a d d g h 0 = ?

1 = f8 x R ( x ) ! R 0 ( x ; x )g 2 = f8x R(x ) ! 9y R 0 (x ; y)g 3 = f8 x 8 y R ( x ) ^ R ( y ) ! R 0 ( x ; y )g 4 = f9 x 9 y R 0 ( x ; y )g

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 20 / 32

(61)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formula that is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instances derived from each other with simple operations.

Detailed complexity analysis for various languages and decision problems. Quite high in the polynomial hierarchy (up to P 4 for general tgds!).

Even for 8 x 1 8 x 2 8 x 3 R ( x 1 ; x 2 ; x 3 ) ! R 0 ( x 1 ) , computing the size of the minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

(62)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formula that is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instances derived from each other with simple operations.

Detailed complexity analysis for various languages and decision problems. Quite high in the polynomial hierarchy (up to P 4 for general tgds!).

Even for 8 x 1 8 x 2 8 x 3 R ( x 1 ; x 2 ; x 3 ) ! R 0 ( x 1 ) , computing the size of the minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

(63)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formula that is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instances derived from each other with simple operations.

Detailed complexity analysis for various languages and decision problems. Quite high in the polynomial hierarchy (up to P 4 for general tgds!).

Even for 8 x 1 8 x 2 8 x 3 R ( x 1 ; x 2 ; x 3 ) ! R 0 ( x 1 ) , computing the size of the minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

(64)

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formula that is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instances derived from each other with simple operations.

Detailed complexity analysis for various languages and decision problems. Quite high in the polynomial hierarchy (up to P 4 for general tgds!).

Even for 8 x 1 8 x 2 8 x 3 R ( x 1 ; x 2 ; x 3 ) ! R 0 ( x 1 ) , computing the size of the minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

(65)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 Conclusion

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 22 / 32

(66)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Conceptual Model

IsA ontology of concepts (simple DAG) Thing

Person

Man Woman

Publication

Proceedings Article Book n -ary typed roles

AuthorOf(Publication,Person) HasName(Person,Name)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 23 / 32

(67)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Conceptual Model

IsA ontology of concepts (simple DAG) Thing

Person

Man Woman

Publication

Proceedings Article Book n -ary typed roles

AuthorOf(Publication,Person) HasName(Person,Name)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 23 / 32

(68)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-like description).

Definition (Complex types) S : set of concepts

T S j< T ; : : : ; T >j T

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

(69)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-like description).

Definition (Complex types) S : set of concepts

T S j< T ; : : : ; T >j T

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

(70)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-like description).

Definition (Complex types) S : set of concepts

T S j< T ; : : : ; T >j T

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

(71)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-like description).

Definition (Complex types) S : set of concepts

T S j< T ; : : : ; T >j T

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

(72)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Services and Queries

Example

Service giving authors from publication titles

A* AuthorOf(A,P),HasTitle(P,T),Input(T)

Example Query:

<A,T*>* AuthorOf(A,P), Article(P),

HasTitle(P,T), KeywordOf(“xml”,P)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 25 / 32

(73)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Services and Queries

Example

Service giving authors from publication titles

A* AuthorOf(A,P),HasTitle(P,T),Input(T)

Example Query:

<A,T*>* AuthorOf(A,P), Article(P),

HasTitle(P,T), KeywordOf(“xml”,P)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 25 / 32

(74)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Managing Extensional Information

How to represent extensional information (i.e. documents) in this formalism?

Definition

A document is a service with no input.

Complex types: natural representation of a DTD.

(Disjunctions a|b simulated by (a?,b?)).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 26 / 32

(75)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to know which known Web services to query?

Issues

Subsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

(76)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to know which known Web services to query?

Issues

Subsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

(77)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to know which known Web services to query?

Issues

Subsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

(78)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to know which known Web services to query?

Issues

Subsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

(79)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Differences with Classical Database Querying

Three main differences:

Information can be queried only through views (Local As View).

Nested types.

Incomplete information.

Three sources of complexity!

Current direction of work: Using Magic sets techniques (for evaluation of Datalog programs) restricted to appropriate binding patterns.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32

(80)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Differences with Classical Database Querying

Three main differences:

Information can be queried only through views (Local As View).

Nested types.

Incomplete information.

Three sources of complexity!

Current direction of work: Using Magic sets techniques (for evaluation of Datalog programs) restricted to appropriate binding patterns.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32

(81)

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Differences with Classical Database Querying

Three main differences:

Information can be queried only through views (Local As View).

Nested types.

Incomplete information.

Three sources of complexity!

Current direction of work: Using Magic sets techniques (for evaluation of Datalog programs) restricted to appropriate binding patterns.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32

(82)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 Conclusion

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 29 / 32

(83)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form + result pages Web service

Analyzed

Web service Service index

User discovery

discovery probing

wrapper induction semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 30 / 32

(84)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Perspectives

Still a lot to do... In particular:

Answering queries using views on the semantic model.

Continue work on automatic wrapper induction, to get a form fully

wrapped as a Web service.

Relation between schema mapping induction and inductive logic programming.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 31 / 32

(85)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailing lists. . . to build a warehouse of sociological data (with various people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application to synonym extraction (with Vincent Blondel, from UCL).

Related nodes in a graph; application to Wikipedia (with Yann Ollivier, from ÉNS Lyon).

PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML document processing, statistical and rule-based machine translation, multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

(86)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailing lists. . . to build a warehouse of sociological data (with various people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application to synonym extraction (with Vincent Blondel, from UCL).

Related nodes in a graph; application to Wikipedia (with Yann Ollivier, from ÉNS Lyon).

PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML document processing, statistical and rule-based machine translation, multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

(87)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailing lists. . . to build a warehouse of sociological data (with various people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application to synonym extraction (with Vincent Blondel, from UCL).

Related nodes in a graph; application to Wikipedia (with Yann Ollivier, from ÉNS Lyon).

PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML document processing, statistical and rule-based machine translation, multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

(88)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailing lists. . . to build a warehouse of sociological data (with various people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application to synonym extraction (with Vincent Blondel, from UCL).

Related nodes in a graph; application to Wikipedia (with Yann Ollivier, from ÉNS Lyon).

PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML document processing, statistical and rule-based machine translation, multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

(89)

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailing lists. . . to build a warehouse of sociological data (with various people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application to synonym extraction (with Vincent Blondel, from UCL).

Related nodes in a graph; application to Wikipedia (with Yann Ollivier, from ÉNS Lyon).

PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML document processing, statistical and rule-based machine translation, multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

Références

Documents relatifs

Les différents modules que nous décrivons dans cette thèse utilisent uniquement ces formes de connaissances (ontologie du domaine, avec un graphe acyclique de concepts et des

Non first normal form relations: An algebra allowing data restructuring all 5 versions » S Abiteboul, N Bidoit Journal of Computer and System Sciences, 1986 portal.acm.org

Other way to see this problem: Match operator on schema mappings, in the setting of data

Wrapping Web Service Descriptions Web Service Semantic Analysis Web Service Indexing and Querying.

Given a query, represented as an Analyzed Web Service, how to know which known web services to

Physicien théorique, professeur à la prestigieuse Université Friedrich-Wilhelms de Berlin, musicien doué, Max Planck est surtout connu pour avoir révolutionné la perception des lois

Le moyen d'y parvenir était à ses yeux de déterminer ce qui avait valeur d'absolu et d'invariant dans la physique, et d'unifier sur cette base des domaines qui pouvaient

D EFINITION 1 (K IND ). The data values in the copy of K have to agree in all T δ ’s, so they have to be determined in advance. By filling in the data values we obtain a data kind.