• Aucun résultat trouvé

UnSAID: Uncertainty and Structure in the Access to Intensional Data

N/A
N/A
Protected

Academic year: 2022

Partager "UnSAID: Uncertainty and Structure in the Access to Intensional Data"

Copied!
36
0
0

Texte intégral

(1)

2 April 2014,IC2 Lunch Seminar

UnSAID:

Uncertainty and Structure in the Access

to Intensional Data

Pierre Senellart

(2)

2 / 24 IC2 Pierre Senellart

Uncertain data is everywhere

Numerous sources of uncertain data:

Measurement errors

Data integration from contradicting sources

Imprecise mappings between heterogeneous schemas

Imprecise automatic processes (information extraction, natural language processing, etc.)

Imperfect human judgment Lies, opinions, rumors

(3)

3 / 24 IC2 Pierre Senellart

Structured data is everywhere

Data is structured, not flat:

Variety of representation formats of data in the wild:

relational tables

trees, semi-structured documents

graphs, e.g., social networks or semantic graphs data streams

complex views aggregating individual information Heterogeneous schemas

Additionalstructural constraints: keys, inclusion dependencies

(4)

4 / 24 IC2 Pierre Senellart

Intensional data is everywhere

Lots of data sources can be seen as intensional: accessing all the data in the source (in extension) isimpossible orvery costly, but it is possible to access the data through views, with someaccess constraints, associated with some access cost.

Indexesover regular data sources

Deep Web sources: Web forms, Web services

The Web or social networks as partial graphs that can be expanded bycrawling

Outcome of complex automated processes: information extraction, natural language analysis, machine learning, ontology matching Crowd data: (very) partial views of the world

Logical consequences of facts, costly to compute

(5)

5 / 24 IC2 Pierre Senellart

Introducing UnSAID

Uncertainty and Structure in the Access to Intensional Data Jointly deal with Uncertainty, Structure, and the fact that access to data is limited and has acost, to solve a user’s knowledge need Lazy evaluationwhenever possible

Evolving probabilistic, structured view of the current knowledge of the world

Solve at each step the problem: What is the next best access to do given my current knowledge of the world and the knowledge need Knowledge acquisition plan (recursive, dynamic, adaptive) that minimizes access cost, and provides probabilistic guarantees

(6)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need Current

knowledge of the world

Knowledge acquisition plan

Uncertain access result Structured

source profiles

(7)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need

Current knowledge of the world

Knowledge acquisition plan

Uncertain access result Structured

source profiles

(8)

formulation

answer & explanation

optimization

modeling

priors

intensional access

knowledge update

Knowledge need

Current knowledge of the world

Knowledge acquisition plan

Uncertain access result

Structured source profiles

(9)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need Current

knowledge of the world

Knowledge acquisition plan

Uncertain access result

Structured source profiles

(10)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need Current

knowledge of the world

Knowledge acquisition plan

Uncertain access result

Structured source profiles

(11)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need Current

knowledge of the world

Knowledge acquisition plan

Uncertain access result Structured

source profiles

(12)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need Current

knowledge of the world

Knowledge acquisition plan

Uncertain access result Structured

source profiles

(13)

formulation

answer & explanation

optimization

modeling priors

intensional access

knowledge update

Knowledge need Current

knowledge of the world

Knowledge acquisition plan

Uncertain access result Structured

source profiles

(14)

7 / 24 IC2 Pierre Senellart

What this talk is about

General overview of my current (and recent) research, through one-slide presentation of individual works

Hopefully, emergingconsistent themes Connectionswith the UnSAID problem

(15)

8 / 24 IC2 Pierre Senellart

Plan

Introduction

Instances of UnSAID

Uncertainty and Structure UnSAID Applications Conclusion

(16)

9 / 24 IC2 Pierre Senellart

Adaptive focused crawling

(Gouriten, Maniu, and Senellart 2014)

with 3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

0

Problem: Efficiently crawl nodes in a graph such that total score is high Challenge: The score of a node is unknown till it is crawled

Methodology: Use various predictors of node scores, andadaptively select the best one so far with multi-armed bandits

(17)

9 / 24 IC2 Pierre Senellart

Adaptive focused crawling

(Gouriten, Maniu, and Senellart 2014)

with 3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

0

Problem: Efficiently crawl nodes in a graph such that total score is high Challenge: The score of a node is unknown till it is crawled

Methodology: Use various predictors of node scores, andadaptively select the best one so far with multi-armed bandits

(18)

9 / 24 IC2 Pierre Senellart

Adaptive focused crawling

(Gouriten, Maniu, and Senellart 2014)

with 3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

0

Problem: Efficiently crawl nodes in a graph such that total score is high Challenge: The score of a node is unknown till it is crawled

Methodology: Use various predictors of node scores, andadaptively select the best one so far with multi-armed bandits

(19)

10 / 24 IC2 Pierre Senellart

Adaptive Web application crawling

with

entrypoint p4

2039 p1

239

p2 754

p3 3227

p5 2600 l2

l3 l1

l1 l4

Problem: Optimize the amount of distinct content retrieved from a Web site w.r.t. thenumber of HTTP requests

Challenge: No way to know a priori where the content lieson the Web site Methodology: Sample a small part of the Web site and discoveroptimal crawling patterns from it

(20)

11 / 24 IC2 Pierre Senellart

Optimizing crowd queries under order

with

Problem: Given a query, what is the next best question to ask the crowd when crowd answers are constrained by a partial order

Challenge: Order constraints make questions not independent of each other

Methodology: Construct apolytope of admissible regionsand uniformly sample from it to determine the impact of a data item

(21)

12 / 24 IC2 Pierre Senellart

Online influence maximization

with Action Log

User | Action | Date 1 1 2007-10-10 2 1 2007-10-12 3 1 2007-10-14 2 2 2007-11-10 4 3 2007-11-12 . . .

1

4 2

3 0.5

0.1 0.9

0.5 0.2

. . . Real World Simulator

1

4 2

3

. . . Social Graph

1

4 2

3 . . .

Weighted Uncertain Graph

One Trial

Sampling 1

2

Activation Sequences Update

3 Solution Framework

Problem: Run influence campaigns in social networks, optimizing the amount of influenced nodes

Challenge: Influence probabilities are unknown

Methodology: Build a model of influence probabilities and focus on influent nodes, with an

exploration/exploitation trade-off

(22)

13 / 24 IC2 Pierre Senellart

Query answering under uncertain rules

with Pope(X) )BuriedIn(X;Rome) (98%)

LocatedIn(X;Lombardy) )

BelongsTo(X;AustrianEmpire) (45%)

Pope(PiusXI) BornIn(PiusXI;Desio) LocatedIn(Desio;Lombardy)

9X;BornIn(X;Y) ^

BelongsTo(Y;AustrianEmpire) ^ BuriedIn(X;Rome)?

Problem: Determine efficiently the probability of a query being true, given somedata and uncertain rules over this data

Challenge: Produced facts may be correlated, the same facts can be generated in different ways,

probability computation is hard in general. . .

Methodology: Findrestrictionson the rules (guarded?) and the data

(bounded tree-width?) that make the problemtractable

(23)

14 / 24 IC2 Pierre Senellart

Plan

Introduction

Instances of UnSAID Uncertainty and Structure UnSAID Applications Conclusion

(24)

15 / 24 IC2 Pierre Senellart

Efficient querying of uncertain graphs

(Maniu, Cheng, and Senellart 2014)

with 0

6

5

6 0

2

0 6

4 3 4

2 6

1

1 6 3: 0.14 4: 0.01

2: 0.18 3: 0.01

1: 0.75

1: 0.75 2: 0.06 1: 0.75

1: 0.75

1: 0.5

1: 0.75

1: 0.25

1: 0.75

1: 0.5 1: 0.5

1: 1

(α)

(β)

(γ)

(ε)

(δ)

(ζ)

Problem: Optimize query evaluation on probabilistic graphs

Challenge: Probabilistic query evaluation is hard, and standard indexing techniques for large graphs do not work

Methodology: Build atree decomposition that preserves probabilities and run the query on this tree decomposition

(25)

16 / 24 IC2 Pierre Senellart

Uniform sampling of XML documents

(Tang, Amarilli, Senellart, and Bressan 2014)

with

n0

n1

n2 n3

n5

n4

n0

n1

n2 n3

n5

n4

n6

n0

n1

n3 n4

n6

n5

n2

Problem: Sample a subtree of fixed size/characteristics uniformly at randomfrom a tree, e.g., for data pricing reasons

Challenge: Naive top-down sampling does not work, will result inbiased sampling depending on the tree structure

Methodology: Bottom-up annotation of the tree recording distribution information, followed by top-down sampling

(26)

17 / 24 IC2 Pierre Senellart

Truth discovery on heterogeneous data

with

Problem: Determinetrue values from integrated semi-structured Web sources

Challenge: Semi-structured data on the Web iscontradictoryand copied from source to source

Methodology: Estimate the truth and determine copy patterns between sources, not only of base facts but of subtreesof the data

(27)

18 / 24 IC2 Pierre Senellart

Probabilistic XML conditioning

with 0

1

2 3 4

e1

e2 e3 e4

0 1

2 3 4

a1

a2 a2a3 a2a3

e0 a0

0 1

2 e1

e2

e3

e0

3 4

5 e4

e5

0 1

2

3 4

5 a2

a4

a0

a1

a2

0 1

2 3

e1

e2

e3

e0

4

5 e4

0 1

2 3 a2a3

4

5

4

2 a

a

a0

a1

a2

a2

e5 a2

0 1

2 3

e1

e2

e3

e0

4

5 e4

0 1

2 3 (a2a3)a3'

4

5 ' ) (a2a4a4

a0

a1

a2

e5 a2a5'

0

1 3 5

e1

e3 e5

e0

2 4 6

e2 e4 e6

0

1 2

3 4

e0

e1 e2

e3 e4 label(0)=R

label(1)=A label(2)=B label(3)=C label(4)=D

Problem: Incorporate a logical constraint into a probabilistic XML database

Challenge: Constraining is not a standard probabilistic database operation, NP-hard in general Methodology: Identifytractable subcases

(28)

19 / 24 IC2 Pierre Senellart

Provenance and Order

(Amarilli, Ba, Deutch, and Senellart 2013)

with

Problem: Clean semantics for

nondeterministic order-aware queries Challenge: Existing data

manipulation languages treat order in an ad hocmanner, with no

compositional semantics

Methodology: Order-aware type system for strict static checking, and individual data-levelprovenance annotationsfor dynamic analysis of allowed ordered operations

(29)

20 / 24 IC2 Pierre Senellart

Plan

Introduction

Instances of UnSAID Uncertainty and Structure UnSAID Applications Conclusion

(30)

21 / 24 IC2 Pierre Senellart

Social Sensing of Moving Objects

(Ba, Montenez, Abdessalem, and Senellart 2014)

with

Problem: Infertrajectories and meta-information of moving objects from Web and social Web data Challenge: Uncertainty and

inconsistencyin extracted information Methodology: Data cleaningby filtering incorrect locations, and truth discovery to identify reliable sources

(31)

22 / 24 IC2 Pierre Senellart

Smarter urban mobility

with

Problem: Smart and adaptive recommendations for mobility in cities (transit, bike rental, car, etc.) Challenge: Should take into account personal information(calendar, etc.), past trajectories,public information about transit and traffic

Methodology: Map GPS tracks to routesof public transport, learn route patterns, infer destination while in transit and provide push suggestions

(32)

23 / 24 IC2 Pierre Senellart

Plan

Introduction

Instances of UnSAID Uncertainty and Structure UnSAID Applications Conclusion

(33)

24 / 24 IC2 Pierre Senellart

What’s next?

So far, we have tackled individualaspects or specializations of the UnSAID problem

Now we need to consider the general problem, and proposegeneral solutions

There is a strong potential for uncovering theunsaid information from the Web

Strong connections with a number of research areas: active learning, reinforcement learning, adaptive query evaluation, etc.

Inspiration to get from these areas.

Everyone is welcometo join the effort!

Merci.

(34)

24 / 24 IC2 Pierre Senellart

What’s next?

So far, we have tackled individualaspects or specializations of the UnSAID problem

Now we need to consider the general problem, and proposegeneral solutions

There is a strong potential for uncovering theunsaid information from the Web

Strong connections with a number of research areas: active learning, reinforcement learning, adaptive query evaluation, etc.

Inspiration to get from these areas.

Everyone is welcometo join the effort!

Merci.

(35)

Amarilli, A., M. L. Ba, D. Deutch, and P. Senellart (Dec. 2013).

Provenance for Nondeterministic Order-Aware Queries.

Preprint available at

http://pierre.senellart.com/publications/amarilli2014provenance.pdf. Ba, M. L., S. Montenez, T. Abdessalem, and P. Senellart (Apr. 2014).

Extracting, Integrating, and Visualizing Uncertain Web Information about Moving Objects. Preprint available at

http://pierre.senellart.com/publications/ba2014extracting.pdf. Gouriten, G., S. Maniu, and P. Senellart (Mar. 2014). Scalable,

Generic, and Adaptive Systems for Focused Crawling. Preprint available at

http://pierre.senellart.com/publications/gouriten2014scalable.pdf. Maniu, S., R. Cheng, and P. Senellart (Mar. 2014). ProbTree: A

Query-Efficient Representation of Probabilistic Graphs. Preprint available at

http://pierre.senellart.com/publications/maniu2014probtree.pdf.

(36)

Tang, R., A. Amarilli, P. Senellart, and S. Bressan (Mar. 2014).Get a Sample for a Discount: Sampling-Based XML Data Pricing.

Preprint available at

http://pierre.senellart.com/publications/tang2014get.pdf.

Références

Documents relatifs

A central issue in acquiring knowledge is its appropriate transfer beyond the contexts and con- tents of first acquisition. In contrast to dominant "common

Solve at each step the problem: What is the next best access to do given my current knowledge of the world and the knowledge need Knowledge acquisition plan (recursive,

Ariadne, Europeana, etc.) have already suggested a set of methodologies/tech- nologies together with the best ways and practices to manage and organize the cultural knowledge

Since the server will be the reference for terminology and relationships, we should inform the development of this reference with as many knowledge organization systems (KOSs) in

The naïve productivity of knowledge acquisition method designers is very similar to that of cathedral builders, based on a mixture of laziness, conformity and superstition. Old

process of developing a Disciple agent for a specific application relies on importing ontologies from existing repositories of knowledge, and on teaching Disciple how to

That means, that even if the data denition language and the data manipulation language of a database management system and a knowledge base manage- ment system would coincide,

In the unique (symmetric) Markov Perfect Equilibrium (MPE), after the first knowledge contract is sold, the sellers immediately sell knowledge, through the second contract, to