• Aucun résultat trouvé

Context knowledge representation and reasoning in the Context Interchange system

N/A
N/A
Protected

Academic year: 2021

Partager "Context knowledge representation and reasoning in the Context Interchange system"

Copied!
42
0
0

Texte intégral

(1)

MIT LIBRARIES DUPL

111

....HillIII

(2)
(3)
(4)
(5)

D28

M414

..MIS*-

WEY

To

appear

in

the

Journal

at

Applied

Intelligence

,

September

2000

Context

Knowledge

Representation

and

Reasoning

in

the

Context

Interchange

System

Stephane

Bressan',

Cheng

Goh

2

, Natalia Levina, Stuart

Madnick,

Ahmed

Shah,

Michael

Siegel

Sloan

WP#

4133

CISLWP

#00-04

August,

2000

MIT

Sloan School

of

Management

50

Memorial

Drive

(6)

MASSACHUSETTS

INSTITUTE

E

OF

TECHNOL OGY

OCT

2

5

2000

(7)

Context

Knowledge

Representation

and Reasoning

in

the

Context Interchange

System

Stephane

Bressan1

,

Cheng

Goh

2

,Natalia Levina, Stuart

Madnick,

Ahmed

Shah, MichaelSiegel

Massachusetts

Institute of

Technology,

Cambridge,

MA

02139

Abstract

The

Context Interchange Project presents a unique approach to the problem ofsemantic conflict resolution

among

multiple heterogeneous data sources.

The

system presents a semantically

meaningful view ofthe data to the receivers (e.g. user applications) for all the available data

sources.

The

semantic conflicts are automatically detected and reconciled by a Context Mediator

using the context

knowledge

associated with both the data sources and the data receivers.

The

results are collated and presented in the receiver context.

The

cunent implementation of the system provides access to flat files, classical relational databases, on-line databases, and

web

services.

An

example application, using actual financial information sources, is described along withadetaileddescriptionoftheoperationofthesystem foranexamplequery.

Keywords: context, integration, mediation, semantic heterogeneity,

web

wrapping

1.

Introduction

Inrecentyearsthe

amount

ofinformation availablehas

grown

exponentially. Whilethe availabilityofso

much

information has helped people

become

self-sufficient andgetaccesstoallthe informationhandily,thishas created

anotherdilemma. Allthesedatasourcesandthetechnologiesthatare

employed

by thedata source providersdonot

providesufficient logicalconnectivity(theability toexchangedatameaningfully). Logical connectivity iscrucial

becauseusersofthese sources expecteachsystemtounderstand requestsstatedintheir

own

terms,usingtheir

own

conceptsof

how

theworld is definedandstructured.

As

aresult,anydata integration effortmustbecapableof

reconcilingsemanticconflicts

among

sourcesandreceivers. Thisproblemisgenerallyreferredtoastheneedfor

semanticinteroperability

among

distributeddatasources.

The

ContextInterchange Projectat

MIT

[1,2] isstudyingthesemantic integration ofdisparateinformation sources.

Like other information integrationprojects (the

SIMS

projectatISI [3],the

TSIMMIS

projectatStanford[4], the

DISCO

projectatBull-INRIA [5],theInformation Manifoldprojectat

AT&T

[6],theGarlicprojectat

IBM

[7],the

InfomasterprojectatStanford [8]),

we

have adoptedaMediationarchitecture asoutlinedin Wiederhold's seminal

paper [9].

Insection2,

we

presenta motivational scenarioofausertryingtoaccess informationfromvarious actualdata

sourcesandtheproblems faced.Section3 describesthe currentimplementation ofthe Contextmediation system.

Section 4 presents adetaileddiscussion ofthevarioussubsystems, highlightingthecontext

knowledge

representation andreasoning, usingthescenario outlinedin section2.Section5 concludes ourdiscussion.

2.

Why

Context

Mediation ?

-

An

Example

Scenario

Consideran

example

ofafinancialanalystdoingresearchon DaimlerBenz. She needstofind out the netincome,

netsales,andtotalassetsof Daimler

Benz

Corporationforthe yearending 1993. Inadditionto that,sheneedsto

know

theclosingstockprice of DaimlerBenz. She normallyuses the financialdatastoredinthe Worldscope3

database. Sherecalls Jill,herco-workertellingherabouttwoother databases, Datastrearn4andDisclosure5and

how

theycontained

much

ofthe infonnation thatJillneeded. Shestartsoff with Worldscopedatabase. She

knows

that

Worldscopehastotalassets forallthecompanies. Shebringsupaquerytooland issuesaquery:

Now

attheNational UniversityofSingapore.

2

We

dedicatethis

work

tothe

memory

of

Cheng

Hian

Goh

(1965-1999).

3

The

Worldscopedatabaseisanextractfrom the Worldscopefinancialdatasource

4

(8)
(9)

select

company_name

,

total_assets

from worldscope

where company_name

= "

DAIMLER-BENZ

AG"

;

She immediatelygetsbacktheresult:

DAIMLER-BENZ

AG

5659478

Satisfied,she

moves

onandfiguresoutafterlookingatthedata informationfor the

new

databasesthatshe canget thedataonnet income from Disclosureandnet salesfrom Datastream. Fornet income, sheissues thequery:

select

company_name

,

net_income from disclosure

where

company_name

= "

DAIMLER-BENZ

AG"

;

The

query does notreturnanyrecords. Puzzled, shechecks fortyposandtriesagain. She

knows

thatthe information

exists. Shetriesone

more

time, thistime entering apartial

name

for

DAIMLER

BENZ.

select

company_name

,

net_income

from disclosure

where

company_name like

"

DAIMLER%"

;

Shegetstherecord back:

DAIMLER BENZ

CORP

615000000

She

now

realizes thatthedatasourcesdonotconformtothe

same

standards, asit

becomes

obvious fromthenames.

Cautious,she pressesonandissues the thirdquery:

select

name,

total_sales

from datastream

where

name like

"

DAIMLER%"

;

Shegetstheresult:

DAIMLER-BENZ

9773092

Woikkcupc

(10)
(11)

year 1993). In addition shestill hasto

somehow

findanother data sourcetoget thelatest stock priceforDaimler

Benz.Forthat, she

knows

shewill firsthave to findout the tickerforDaimler

Benz

andthen lookupthe priceusing

one ofthe

many

stockquoteserversontheweb.

The

ContextMediation system can be usedtoautomaticallydetectandresolveallthesemanticconflictsbetweenall

thedatasources beingusedandcan presentthe results totheuser inthe formatthatsheis familiar with. Inthe above

example, ifthe analystwere usingtheContext Mediation system instead,all she

would

havetodo

would

beto

formulateandaskasinglequerywithouthavingtoworryabout theunderlyingdifferencesbetweenthe data. Both

her requestandthe result

would

be formulatedinherpreferredcontext(e.g. Worldscope).

The

multi-source query,

Queryl,

couldbestated as follows:

select

worldscope

.

total_assets,

datastream. total_sales,

disclosure

.

net_income

,

quotes

.

Last

from

worldscope,

datastream,

disclosure,

quotes

where

worldscope

.

company_name

=

"DAIMLER-BENZ

AG"

and

datastream. as_of_date

=

"01/05/94"

and

worldscope

.

company_name

=

datastream. name and

worldscope

.

company_name

=

disclosure

.

company_name and

worldscope

.

company_name

=

quotes

.

cname

;

The

system

would

then detectandreconcile the conflicts encounteredby the analyst.

3.

Overview

of

the

COIN

Project

The COntext

INterchange

(COIN)

strategy seekstoaddresstheproblem of semanticinteroperabilityby

consolidatingdistributeddatasourcesandprovidinga unifiedview.

COIN

technology presentsalldatasourcesas

SQL

relational databasesbyproviding genericwrappersforthem.

The

underlyingintegration strategy, called the

COIN

model,definesanovelapproachformediated [9]dataaccess in whichsemanticconflicts

among

heterogeneoussystemsare automatically detectedandreconciledbytheContext Mediator.

3.1

The

COIN

Framework

The

COIN

framework

is

composed

ofbothadatamodel and a logical language,

COINL

[ll],derived fromthe

familyof F-Logic [10].

The

data

model

andlanguageareused todefinethe

domain model

ofthereceiveranddata

sourceandthecontext [12]associated withthem.

The

datamodelcontainsthe definitions for the"types"of

information units(calledsemantictypes)thatconstitutea

common

vocabularyforcapturingthesemanticsofdatain

disparatesystems. Contexts, associatedwithboth information sources andreceivers, are collectionsofstatements

defining

how

datashould be interpretedand

how

potential conflicts(differences inthe interpretation)should be

resolved.Conceptssuch assemantic-objects, attributes, modifiers, andconversion functions definethesemanticsof

datainsideandacrosscontexts.Togetherwiththedeductiveandobject-orientedfeatures inheritedfrom F-Logic,the

COIN

data

model

and

COINL

constitutean appropriateandexpressiveframeworkforrepresentingsemantic

knowledge

andreasoningabout semanticheterogeneity.

3.2

Context Mediator

The

Context Mediatoristheheartofthe

COIN

project.Mediation istheprocessofrewriting queries posedinthe receiver'scontextintoasetofmediatedquerieswhereallactual conflicts are explicitlyresolvedandthe resultis

reformulatedinthereceivercontext.This process isbasedinan abduction [13]procedurethatdetermineswhat

informationisneededtoanswerthequery and

how

conflictsshouldberesolvedbyusingtheaxioms inthe different

contexts involved.

Answers

generatedbythemediationunitcan be both extensionalandintentional.Extensional

answers correspondtothe actualdataretrievedfromthevarioussources involved. Intentional answers,ontheother

hand, provide only a characterizationoftheextensionalanswerwithoutactually retrievingdatafromthedata

sources. Inaddition, the mediation process supports queriesonthesemanticsofdatathat are implicit inthe different

systems. Therearereferredtoasknowledge-level queriesasopposedtodata-levelqueriesthatareenquires onthe

factual data presentinthedata sources. Finally, integrityknowledge on onesource or across sourcescan be naturallyinvolved inthemediation processto improvethequalityandinformation contentofthemediatedqueries

(12)
(13)

3.3

System

Perspective

From

asystemperspective, the

COIN

strategycombinesthe best featuresofthe loose-andtight-coupling

approachestosemanticinteroperability [14]

among

autonomous andheterogeneous systems. Itsmodulardesign and

implementation, depicted in Figure2, funnels thecomplexity ofthe systemintomanageablechunks, enables sources

andreceivers toremain loosely-coupledtooneanother,andsustainsan infrastructure fordata integration.

This modularity, both inthecomponents andthe protocol, alsokeepsourinfrastructure scalable, extensible,and

accessible[2].

By

scalability,

we

mean

thatthecomplexity ofcreatingandadministeringthe mediationservices

(14)
(15)

services.Second,

COIN

employsthehypertext

markup

Language andJavaforthedevelopment ofportable user

interfaces. Figure 3

shows

the architectureofthe

COIN

system. Itconsistsofthreedistinctgroups ofprocesses.

• Client Processes providethe interactionwithreceiversand routealldatabaserequeststotheContext Mediator.

An

example ofaclientprocess isthemulti-databasebrowser[15],which providesapoint-and-click interface

forformulating queriestomultiplesourcesandfordisplayingthe answersobtained. Specifically,any

application programthatissuesqueriestooneor

more

sourcescan beconsidereda clientprocess.

• ServerProcesses refertodatabasegateways andwrappers. Databasegatewaysprovide physical connectivity

toadatabase onanetwork.

The

goal isto insulate theMediatorProcess fromthe idiosyncrasiesofdifferent

database

management

systems byprovidingauniform protocolfordatabase accessaswell ascanonical query

language(and data model)forformulatingthe queries.Wrappers providericher functionalitybyallowing

semi-structureddocuments on the

World

Wide

Web

tobe queried asiftheywererelational databases.This is

accomplished bydefiningan export

schema

foreachofthese

web

sitesanddescribing

how

attribute-valuescan

be extractedfroma

web

site usingafiniteautomaton withpatternmatching [16].

Mediator

Processes refertothe systemcomponentsthat collectivelyprovidethemediationservices.These

includeSQL-to-datalogcompiler, context mediator,and query planner/optimizerandmulti-database

executioner. SQL-to-Datalog compilertranslates a

SQL

query intoitscorresponding datalog format.

The

Context Mediatorrewrites theuser-providedquery intoamediatedquerywith allthe conflicts resolved.

The

planner/optimizerproduces aqueryevaluation planbasedonthemediatedquery.

The

multi-database

executioner processesthequeryplan generatedbytheplanner. Itdispatches sub-queriestothe server processes,

collatestheintermediaryresults, convertstheresult intotheclientcontext,and returns thereformulatedanswer

tothe client processes.

Of

thesethree distinctgroupsofprocesses, the mostrelevant toour discussionofcontext

knowledge

andreasoning

are themediatorprocesses.

We

willstartbyexplainingthe

domain

model andthen discusstheprototype system.

SERVER

PROCESSES

MEDIATOR

PROCESSES

CLIENT

PROCESSES

WebClient N

2

/v

2

/\

ODBC-compliant Apps (egMicrosoft Excel) ODBC-Driver

Figure3:

COIN

System

Overview

(16)
(17)

4.1

Domain

Model and

Context

definition

The

firstthingthat

we

needtodo isspecifythe

domain

model forthe

domain

that

we

are workingin.

A

domain

model

specifies thesemanticsofthe "types'"ofinformationunits,whichconstitutes a

common

vocabulary used in

capturingthesemanticsofdata in disparate sources. Inotherwordsitdefinestheontologywhichwill beused.

The

varioussemantic types, thetypehierarchy, andthetype signatures (for attributesandmodifiers)arealldefined in

the

domain

model. Typesinthegeneralized hierarchyarerootedtosystemtypes, i.e. typesnative to theunderlying

system such as integers,strings, real numbersetc.

Figure 4 depicts partofthe

domain

modelthatis usedin our example.Inthe

domain

modeldescribed,there are

three kindsofrelationshipsexpressed.

Figure 4:FinancialDomain Model Inheritance

--

Attribute

Modifier

• Inheritance: This isthe classictype inheritance relationship. Allsemantic typesinheritfrombasicsystem

types. Inthe

domain

model, type

companyFinancials

inheritsfrombasic typestring.

• Attributes: In

COIN

[17],objectshavetwo forms ofproperties, thosewhichare structuralpropertiesofthe

underlying data sourceandthosethatencapsulatetheunderlyingassumptionsabouta particularpieceofdata.

Attributesaccess structural propertiesofthesemanticobjectin question.Forinstance,thesemantic type

companyf inancials

has

two

attributes,

company

and f

yEnding.

Intuitively, these attributesdefinea relationshipbetweenobjectsofthecorresponding semantictypes.Here,the relationshipformed bythe

company

attributestatesthatforany

company

financialinquestion, theremust becorresponding

company

to

which

that

company

financial belongs.Similarly, thef

yEnding

attributestatesthatevery

company

financial

objecthas a date

when

it

was

recorded.

• Modifiers: Thesedefinearelationshipbetweensemanticobjectsofthecorresponding semantictypes.

The

differencethough isthatthevaluesofthesemanticobjectsdefinedbythemodifiershavevaryinginterpretations

depending onthe context. Looking onceagainatthe

domain

model,thesemantic type

companyFinancials

defines

two

modifiers,

scaleFactor

and

currency.

The

valueofthe objectreturnedbythemodifier

scaleFactor

depends on agivencontext.

Once

we

havedefinedthe

domain

model,

we

needtodefinethecontextsforallthe sources. Inourcase,

we

have

severaldatasources with theassumptions abouttheirdata infigure5.

A

simplified view ofwhatthecontextmight beforthe Worldscopedata sourceis:

modifier

(companyFinancials,

O,

scaleFactor,

c_ws,

M):-cste

(basic, M,

c_ws,

1000) .

modifier

(companyFinancials,

O,

currency,

c_ws,

M):-cste

(currencyType,

M, c_ws,

"USD"),

modifier

(date, 0,

dateFmt,

c_ws, M)

(18)
(19)
(20)
(21)
(22)
(23)

Value

(V12, c_ws, V3) ,

V5

= V3,

Value

(V20, c_ws, V2) ,

V5

= V2,

Value

(V7, c_ws, VI) ,

V5

= VI,

Value

(V22, c_ws,

total_assets)

,

Value

(V17,

c_ws,

total_sales

) ,

Value

(Vll, c_ws,

net_income),

Value

(q_last, c_ws,

last).

As

can beseen, thequery

now

contains elevated data sources along with asetofpredicatesthat

map

eachattribute

toitsvaluein thecorrespondingcontext. Sincetheuseraskedthe queryin the Worldscopecontext (denotedby

c_ws),thelast four predicatesin the translatedqueryascertainthatthe actual values returnedasthe solutionofthe

query needtobeinthe Worldscopecontext.

The

resultingunmediated datalogquery isthen fedtothe mediation engine.

4.3.2 Mediation

Engine

The

mediationengine isthe partofthesystemthatdetectsandresolves possible semanticconflicts. In essence,the

mediation is aquery rewritingprocess.

The

actual

mechanism

ofmediation isbasedonanAbduction Engine [13].

The

engine takes adatalogquery andasetof

domain

model axioms and computesasetof abductedqueriessuch

thattheabductedqueries haveallthedifferences resolved.

The

system doesthatby incrementallytesting for potential semanticconflictsand introducingconversion functions for the resolutionofthoseconflicts.

The

mediation

engine asitsoutputproducesasetofqueriesthattake intoaccountallthepossiblecasesgiventhe variousconflicts.

Usingtheabove example andwiththe

domain

model and contextsstatedabove,

we

would

get thesetof abducted

queries

shown

below:

answer(V108,

V107,

V106,

V105)

:-datexform(V104,

"European Style

-",

"01/05/94",

"American

Style

/"),

Name_map_Dt_Ws

(V103,

"DAIMLER-BENZ

AG"),

Name_map_Ds_Ws

(V102,

"DAIMLER-BENZ

AG"),

Ticker_Lookup2

("DAIMLER-BENZ

AG",

V101,

V100),

WorldcAFf

"DAIMLER-BENZ

AG", V99, V98, V97, V96,

V108,

V95),

DiscAF(V102,

V94, V93, V92, V91, V90,

V89),

V107

is

V92

* .001,

Currencytypes

(V89,

USD),

DStreamAF(V104,

V103,

V106,

V88, V87,

V86),

Currency_map

(USD,

V86),

quotes

(V101,

V105)

.

answer(V85,

V84, V83, V82)

:-datexform(V81,

"European Style

-",

"01/05/94",

"American

Style

/"),

Name_map_Dt_Ws

(V80,

"DAIMLER-BENZ

AG"),

Name_map_Ds_Ws

(V79,

"DAIMLER-BENZ

AG"),

Ticker_Lookup2

("DAIMLER-BENZ AG"

, V78,

V77),

WorldcAF( "DAIMLER-BENZ

AG",

V76, V75, V74, V73, V85,

V72),

DiscAF(V79,

V71, V70, V69, V68, V67,

V66),

V84

is

V69

*

0.001,

Currencytypes

(V66,

USD),

DStreamAF(V81,

V80, V65, V64,

V63

,

V62),

Currency_map

(V61,

V62),

<> (V61, USD) ,

datexform(V6

0,

"European Style

/",

"01/05/94",

"American Style

/"),

olsen(V61,

USD, V59,

V60),

V83

is

V65

* V59,

quotes

(V78, V82)

.

answer(V58,

V57, V56, V55)

:-datexform(V54,

"European

Style

-",

"01/05/94",

"American

Style

/"),

(24)
(25)

9-Name_map_Dt_Ws

(V53,

"DAIMLER-BENZ

AG"),

Name_map_Ds_Ws

(V52,

"DAIMLER-BENZ

AG"),

Ticker_Lookup2

("DAIMLER-BENZ

AG", V51,

V50),

WorldcAF(

"DAIMLER-BENZ

AG", V49, V48, V47, V46, V58,

V45),

DiscAF(V52,

V44, V43, V42, V41, V40,

V39),

V38

is

V42

*

0.001,

Currencytypes

(V39, V37) , <> (V37, USD) ,

datexform(V36,

"European Style

/", V44,

"American Style

/"),

olsen(V37,

USD, V35,

V36),

V57

is

V38

* V35,

DStreamAF(V54,

V53, V56, V34, V33,

V32),

Currency_map

(USD,

V32),

quotes

(V51, V55) .

answer(V31,

V30, V29, V28)

:-datexform(V27,

"European

Style

-",

"01/05/94",

"American

Style

/"),

Name_map_Dt_Ws

(V26,

"DAIMLER-BENZ

AG"),

Name_map_Ds_Ws

(V25,

"DAIMLER-BENZ

AG"),

Ticker_Lookup2

(

"DAIMLER-BENZ

AG", V24, V23),

WorldcAFf

"DAIMLER-BENZ

AG", V22, V21, V20, V19, V31,

V18),

DiscAF(V25,

V17, V16, V15, V14, V13,

V12),

Vll

is

V15

*

0.001,

Currencytypes

(V12,

V10),

<> (V10, USD)

,

datexform(V9,

"European

Style

/", V17,

"American Style

/"),

olsen(V10,

USD, V8, V9),

V30

is

Vll

* V8,

DStreamAF(V27,

V26, V7,

V6

, V5, V4) ,

Currency_map

(V3, V4) , <> (V3, USD) ,

datexform(V2,

"European Style

/",

"01/05/94",

"American Style

/"),

olsen(V3,

USD, VI, V2)

,

V29

is

V7

* VI,

quotes

(V24, V28)

.

The

mediated querycontains foursub-queries. Each ofthesub-queries accounts fora potential semanticconflict.

For example,thefirstsub-querydeals withthecase

when

there isnocurrency conversionconflict(i.e.,sourceand

receiver use

same

currency). Whilethesecondsub-querytakes intoaccountthe possibilityofcurrency conversion.

Resolvingthe conflicts

may

sometimerequireintroducing intermediate datasources. Figure5 listed

some

ofthe

context differencesin thevarious data sourcesthat

we

useforourexample.Lookingatthetable,

we

observethat

one ofthepossibleconflictsisdifferentdatasourcesusingdifferent currencies. Inordertoresolvethatdifference,

themediation engine hastointroduce an intermediarydata source.

The

sourceused forthispurpose is acurrency

conversion

web

site(http://www.oanda.com)andisreferredtoas olsen. Inordertoresolve thecurrencyconflict in

thesecondsub-query, theolsensourceis usedtoconvertthecurrencytocorrectlyrepresent datainthecurrency

specified asofthespecifieddate inthespecified context.Notethat itis themediator, usingthecontext knowledge,

thatdeterminesthatcurrency conversion

was

needed inthiscase.

4.3.3

Query

planner

and

optimizer

The

query planner

module

takes thesetofdatalog queriesproducedbythemediation engineand producesaquery plan. Itensuresthatan executable planexistswhichwill producearesultthatsatisfiestheinitial query.Thisis

necessitatedbythefactthatthere are

some

sourcesthatimposerestrictions onthetypeofqueriesthatthey can

service. In particular,

some

sources

may

requirethat

some

ofthe attributesmust always be

bounded

while

making

queriestothosesources. Anotherlimitation thatsourcesmight have isthekindsofoperatorsthattheycan handle.

One

example isthatmost

web

sourcesdonotprovide an interfacethatsupportsallthe

SQL

operators,ortheymight

requirethat

some

attributes inqueries bealwaysbound.

Once

the planner ensures than an executable planexists, it

generates asetofconstraintsonthe orderinwhich the differentsub-queriescanbe executed.

Under

these

constraints,theoptimizerapplies standard optimizationheuristicstogeneratethequeryexecutionplan.

The

query

(26)
(27)

execution plan isessentiallyanalgebraicoperatortreein whicheach operation isrepresentedbyanode. Thereare

two

typesofnodes:

• Access Nodes: Access nodesrepresent accesstoremotedatasources.

Two

subtypesofaccessnodesare: • sfw Nodes: These nodesrepresentaccesstodata-sourcesthatdonot require inputbindingsfromother

sourcesin the query.

• join-sfwNode: Thesenode haveadependency inthat theyrequire inputfrom other data sourcesin the

query.

Thus

thesenodes haveto

come

afterthenodesthattheydepend onwhiletraversing thequeryplan

tree.

• LocalNodes: These nodesrepresent localoperations in localexecution engine. Thereare foursubtypesoflocal

nodes.

• JoinNode: Joinstwo trees

• SelectNode: Thisnode isusedtoapply conditionsto intermediateresults.

CVTNode:

Thisnode isusedtoapply conversion functionsto intermediatequeryresult.

Union

Node:Thisnoderepresents aunionoftheresultsobtainedbyexecutingthesub-nodes.

Each nodecarriesadditional information aboutwhatdata-sourcetoaccess(ifitisan accessnode) andother

information thatis used bytheruntime engine.

Some

ofthe informationthatiscarriedineach nodeisalistof

attributes inthesourceandtheirrelative position,listofcondition operationsand any literalsandother information

liketheconversion formulain thecaseofaconversion node.

The

queryplan forthefirstsub-queryofthemediated

queryis

shown

inthe Appendix.

The

queryplan thatis produced bytheplanneristhen forwardedtotheruntime engine.

4 3.4

Runtime

engine

The

runtime execution engine executesthequery plan.Givenaqueryplan, theexecution enginetraverses thequery

plan tree ina depth-first

manner

startingfrom therootnode. Ateach node, itcomputes thesub-trees forthatnode

andthen appliestheoperationspecified for thatnode. For eachsub-tree theengine recursivelydescends

down

the

tree until itencounters an access node. Forthataccessnode, it

composes

a

SQL

query andsends it offto theremote

source.

The

resultsofthatqueryarethenstored in thelocal store.

Once

allthe sub-treeshave been executedandall

the results are inthe localstore,theoperation associatedwith thatnodeis executedandthe results collected.This

operation continuesuntilthe rootofthequery isreach.At thispoint theexecution engine hastherequiredsetof

resultscorrespondingtothe originalquery.Theseresultsare thensentbackto theuserandthe processiscompleted.

4.4

Web

Wrapper

The

originalquery used inourexample, contained accesstoaquoteserverto getthemostrecentquotesforthe

company

inquestion, i.e. Daimler-Benz.

As

opposedtothe restofthe sources, thequote serverthat

we

usedis a

web

quoteserver. In ordertoaccessthe

web

sources suchasthisone,

we

have developedatechnologythatletsusers

treat

web

sitesas relationaldatasources. Userscan then issue

SQL

queriesjustasthey

would

toanyrelation inthe

relationaldomain,thus combiningmultiplesourcesandcreatingqueriesas the oneabove. Thistechnology iscalled

web-wrapping

and

we

havean implementation forthistechnologywhich iscalled

web

wrappingengine[18].Using

the

web

wrapperengine

(web

wrapperforshort)theapplicationdevelopers can veryrapidly

wrap

astructured or

semi-structured

web

siteandexportthe

schema

fortheuserstoqueryagainst.

Once

the source hasbeenwrapped,it

can be usedas a relationalsource inanyquery.

4.4.1

Web

wrapper

architecture

Figure6

shows

the architectureofthe

web

wrapper.

The

systemtakes the

SQL

queryas input. Itparsesthequery

alongwiththe specifications for thegiven

web

site.

A

queryplan isthenconstituted.

The

planconstitutes ofwhat

web

sitestosendhttprequests and whatdocuments onthose

web

sites.

The

executioner thenexecutesthe plan.

Once

thepagesare fetched, theexecutioner thenextractstherequired informationfrom thepagesandpresentsthat

tothe user.

4.4.2

Wrapping

a

web

site

For our query,the relationquote isactually a

web

quote serverthat

we

accessusing our

web

wrapper. In orderto

wrap

asite,

you

needtocreateaspecification file. For each

Web

pageorsetof

Web

pagesthegeneric

Web

Wrapper

engineutilizesa specification filetoguideitthroughthedataextractions process.

The

specificationfile

contains information aboutthe locations.onthe

web

forbothinputand outputdata.

The

Web

Wrapper

Engine

utilizes this information duringquery executionto getinformationbackfrom thesetof

Web

pagesasifitwere a

(28)
(29)

relationaldatabase. This file isplaintextfileandcontains information suchastheexported schema,the

URL

ofthe

web

site toaccess,anda regular expressionthatwillbe usedto extract the actual information fromthe

web

page. In

ourexample

we

usethe

CAW

quoteservertoget quotes.

A

simplifiedspecification tile is included below:

#HEADER

#RELATION=quotes

#HREF=GET

http://qs.cnnfn.com

#EXPORT= quotes. Cname quotes. Last

#ENDHEADER

#BODY

#PAGE

#HREF

=

POST

http: //qs

.cnnfn.com/cgibin/stockquote

?

symbols=##quotes

.

Cname##

#CONTENT=last

:

&nbsp </font><FONT

SIZE=

+

lxb>##quotes

.

Last##</FONTx/TD>

#ENDPAGE

#ENDBODY

The

specificationhas

two

parts.

Header

and Body.

The Header

partspecifies information aboutthe

name

ofthe relationandtheexportedschema. Intheabovecase,the

schema

that

we

decidedtoexport hastwoattributes,

Cname,

the ticketofthe

company

andLastthelatestquote.

The

Body

Portionofthe filespecifies

how

toactually

access thepage(asdefinedinthe

HREF

field)andwhatregularexpressiontouse(asdefinedinthe

CONTENT

field).

Once

the specification fileiswrittenandplacedwherethe

web

wrappercan readit,

we

arereadytousethe

system.

We

canstart

making

queriesagainst the

new

relationthat

we

justcreated.

(30)
(31)

12-Querj Results Interface Query Compiler Interpreter Planner/ Optimizer Specification Compiler; Interpreter Executioner PatternMatching Network Access V I V „ ... .. Web documents Specittcatton

Figure6:

Web

Wrapper

Architecture

4.4.3

Web

Wrapping and

XML

The

extensible

Markup Language

(

XML)

for

Web

pagesis

becoming

increasinglyaccepted. This provides

opportunities forthe

Web

Wrapping

Engine,inparticular,andtheContext Interchange System, in general. First, the

XML

tagsprovide a

much

easierandexplicitdemarcationofthe locationoffieldswithina

web

page. That

makes

the extractionofdata

much

simplerfor the

Web

Wrapper.

On

theotherhand,

XML

isprimarilyasyntacticfacility.

You

may

havea tag

<PRICE>,

but

XML

does notprovidethesemantics, suchaswhatisthe currencyofthe price,

does itincludetax,doesitincludeshippingcosts, etc.

The

Context Interchangeapproach isthenextstep inthe

evolution of

XML

andthe

Web

toprovidethesemanticsthat isso criticaltothe effectiveexchange ofinformation.

5.

Conclusions

In thispaper,

we

havedescribedanovel approachtotheproblem ofresolvingsemantic differencesbetween

disparateinformation sourcesbyautomaticallydetectingandresolvingsemanticconflictsbetween thosesources

basedonthe

knowledge

ofthe contextsofthose datasources inaparticulardomain.

We

havealsodescribedand

explainedthe architectureand implementation oftheprototype,anddiscussed theprototype at

work

byusingan

examplescenario.

More

detailspertainingto thisscenarioandademonstrationofitsoperation canbe foundat

http

:

//context

.mit

.

edu/

-co in/

demos

/t

as

c/

saved/

crueries/qll

.

html

.

Acknowledgements

In

Memory

of

Cheng

Hian

Goh

(1965-1999)

Cheng

Hian

Goh

passed

away

onApril 1, 1999,attheageof33.

He

issurvivedbyhis wife,

Soh Mui

Lee

and

two

sons,

Emmanuel

andGabriel,to

whom

we

all send our deepest condolences.

(32)
(33)

Cheng

Hian receivedhis Ph.D. in InformationTechnologies fromthe Massachusetts Instituteof

technology inFebruary, 1997.

He

joined the Department of

Computer

Science, National Universityof Singapore asanAssistant Professorin

November,

1996.

He

lovedand

was

dedicatedto his

works—

teachingaswellas research.

He

believedin givinghisstudents

the bestandwhat they deserved.

As

a

young

databaseresearcher, hehad

made

majorcontributionstothefieldas

testifiedbyhispublications(in ICDE'97,

VLDB'98,

and ICDE'99).

Cheng

Hian

was

avery sociable person,andoften soughtthe

company

offriends.

As

such,he

was

loved

by those

who

came

in contactwith him.

He would

oftengothe extramiletohelphis friendsandcolleagues.

He

was

agreatperson,and had touchedthe livesofmany.

We

suddenlyrealizedthatthere are

many

things that

we

will

neverdotogetheragain.

We

willmiss

him

sorely,andhis laughter,andhis smile...

Work

reported herein hasbeensupported, inpart,bythe

Advanced

ResearchProjects

Agency

(ARPA)

and

the

USAF/Rome

Laboratory undercontractF30602-93-C-0160,

TASC,

Inc. (andits parent,

PRIM

ARK

Corporation),the

MIT

PROductivity

From

Information Technology

(PROFIT)

Program, andthe

MIT

Total Data

Quality

Management

(TDQM)

Program. Information abouttheContext Interchangeprojectcan be obtainedat

http

:

//context

.

mit

.

edu/~coin.

Bibliography

I]Adil Daruwala,

Cheng

Goh, Scot Hofmeister,

Karim

Husein, StuartMadnick, Michael Siegel. "TheContext

Interchange

Network

Prototype",Center For Information SystemLaboratory (CISL),working paper,#95-01,

Sloan School of

Management,

MassachusettsInstitute ofTechnology, February 1995

2] Goh,

C,

Madnick, S., Siegel,

M.

"Context Interchange:

Overcoming

thechallengesofLarge-scale

interoperabledatabase systems inadynamic environment".Proc. ofIntl. Conf.

On

Informationand

Knowledge Management.

1994.

3]Arens, Y. and Knobloch, C. "Planning andreformulating queries forsemantically-modeled multidatabase".

Proc. ofthe Intl.Conf. on Informationand

Knowledge Management.

1992

4] Garcia-Molina, H.

"The

TSIMMIS

Approach

to Mediation: Data

Models

andLanguages". Proc. oftheConf. on

Next

Generation Information Technologies and Systems. 1995.

5]Tomasic,A. Rashid, L.,andValduriez,P. "Scaling Heterogeneousdatabasesandthedesignof

DISCO".

Proc. of theIntl.Conf. on Distributed

Computing

Systems. 1995.

6] Levy, A., Srivastava, D.andKrik, T.

"Data

Model

and

Query

Evaluation in Global Information Systems".

Journalof Intelligent InformationSystems. 1995.

7] Papakonstantinou, Y.,Gupta, a.,andHaas, L. "Capabilities-Based

Query

RewritinginMediator

Systems". Proc.ofthe4th Intl. Conf.on Paralleland Distributed Information Systems. 1996.

8] Duschka, O.,andGenesereth,

M.

"Query

Planningin Infomaster". http://infomaster.standord.edu. 1997. 9] Wiederhold, G. "Mediation intheArchitectureofFuture Information Systems".Computer,23(3), 1992.

10] Kifer, M., Lausen,G.,and

Wu,

J. Logical Foundationsof Object-orientedand Frame-based Languages.

JACM

5(1995), pp. 741-843.

II] Pena,F.

PENNY:

A

Programming

Language and Compilerfor the ContextInterchange Project.

CISL

Working

Paper#97-06, 1997

12] McCarthy,J.Generality inArtificial Intelligence.

Communications

ofthe

ACM

30, 12(1987), pp. 1030-1035.

13]

KaKas,

A.

C,

Kowalski, R.A. andToni, F. Abductive Logic Programming.Journalof LogicandComputation

2,6(1993), pp.719-770.

14] Sheth,A. P.,and Larson,J. A. Federated databasesystemsfor

managing

distributed,heterogeneous,and

autonomous

databases.

ACM

Computing

Surveys22,3 (1990), pp. 183-236

15] Jakobiasik,

M. Programming

theweb-design andimplementationof amultidatabase browser. Technical

Report, CISL,

WP

#96-04, 1996

16] Qu,J. F. Data wrapping ontheworldwide web.Technical Report,

CISL

WP

#96-05, 1996

17]Goh,C. H. Representingand ReasoningaboutSemanticConflicts inHeterogeneousInformation Systems.

PhD

thesis,Sloan School of

Management,

MassachusettsInstitute of Technology, 1996.

18]Bressan, S., Bonnet,P. ExtractionandIntegrationofData from Semi-structured

Documents

intoBusiness

Applications.

To

be published.

19] Madnick, S.MetadataJones andthe

Tower

of Babel:

The

ChallengeofLarge-Scale Heterogeneity.

Proceedingsofthe

IEEE

Meta-dataConference, April 1999.

(34)
(35)

APPENDIX:

Part

of

the

Query Execution

Plan

(the entireplan canbe foundat http://context.mit.edu/~coin/demos/tasc/saved/saved-results/qlI t3.html)

SELECT

ProjL:

[att(l, 5),

att(l,

4),

att(l,

3),

att(l,

2)]

CondL:

[attd,

1) =

"01/05/94"]

JOIN-SFW-NODE

DateXform

ProjL:

[attd,

3),

att(2,

4),

att(2,

3),

att(2,

2),

att(2,D]

CondL:

[att(l, 1) =

att(2,

5),

att(l,

2) =

"European Style

-",

attd,

4) =

"American

Style

/"]

CVT-NODE

'V18' is 'V17' *

0.001

Projl:

[att(2, 1),

att(l,

1),

att(2,

2),

att(2,

3),

att(2,

4)]

Condi:

['V17' =

att

(2, 5)

]

JOIN-SFW-NODE quotes

ProjL:

[att(2, 1),

att(2,

2),

att(l,

2),

att(2,

3),

att(2,

4)]

CondL:

[att(l, 1) =

att

(2, 5)]

JO

IN

-NODE

ProjL:

[att(2, 1),

att(2,

2),

att(2,

3),

att(2,

4),

att(2,

5)]

CondL:

[attd,

1) =

att

(2, 6)]

SELECT

ProjL:

[attd,

2)]

CondL:

[attd,

1) = 'USD']

SFW-NODE

Currency_map

ProjL:

[attd,

1),

attd,

2)]

CondL:

[]

JOIN-NODE

ProjL:

[attd,

1),

attd,

2),

attd,

3),

attd,

2),

attd,

3),

attd,

4)]

CondL:

[att(l, 1) =

attd,

4)]

SFW-NODE

Datastream

ProjL:

[attd,

2),

attd,

3),

attd,

1),

attd,

6)]

CondL:

[]

JO

IN

-NODE

ProjL:

[attd,

1),

attd,

2),

attd,

3),

attd,

4)]

CondL:

[attd,

1) =

attd,

5)]

SELECT

ProjL:

[att(l, 2)]

CondL:

[attd,

1) = 'USD']

SFW-NODE Currencytypes

ProjL:

[att(l, 2),

attd,

D]

CondL:

[]

JOIN-NODE

ProjL:

[attd,

1),

attd,

2),

attd,

2),

attd,

3),

attd,

3)]

CondL:

[att(l, 1) =

att

(2, 4)]

SFW-NODE Disclosure

ProjL:

[attd,

1),

attd,

4),

attd,

7)]

CondL:

[]

JOIN-NODE

ProjL:

[attd,

1),

attd,

1),

attd,

2),

attd,

3)]

CondL

: []

SELECT

ProjL:

[attd,

2}]

CondL:

[attd,

1) =

"DAIMLER-BENZ

AG"]

SFW-NODE Worldscope

ProjL:

[att(l, 1),

attfl,

6)]

CondL:

[]

(36)
(37)

15-JOIN-NODE

ProjL:

[att(l, 1),

att(2,

1),

att(2,

2)]

CondL

: [

]

SELECT

ProjL:

[attll, 2)]

CondL:

[att(l, 1) =

"DAIMLER-BENZ

AG"]

SFW-NODE Ticker_Lookup2

ProjL:

[att(l, 1),

att(l,

2)]

CondL

: []

JOIN-NODE

ProjL:

[att(2, 1),

attll,

1)]

CondL:

[]

SELECT

ProjL:

[attll, 2)]

CondL:

[attll, 1) =

"DAIMLER-BENZ

AG"]

SFW-NODE Name_map_Ds_Ws

ProjL:

[attll, 2),

attll,

1)]

CondL:

[]

SELECT

ProjL:

[attll, 2)]

CondL:

[attll, 1) =

"DAIMLER-BENZ

AG"]

SFW-NODE Name_map_Dt_Ws

ProjL:

[attll, 2),

attll,

1)]

CondL:

[]

(38)
(39)
(40)

Date

Due

MA\

3

1

200$

(41)

MIT LIBRARIES

(42)

Figure

Figure 3: COIN System Overview
Figure 4 depicts part of the domain model that is used in our example. In the domain model described, there are three kinds of relationships expressed.
Figure 6: Web Wrapper Architecture

Références

Documents relatifs

It is responsible for retrieving context information from the context database module, analyzing stored context information, identifying context changes and notifying the context

Concrete, every value-based configuration unit, such as the necessary value in- terval of a supersonic sensor for a reconfiguration option like an autonomous parking service,

In order to test the feasibility of the denoising algorithms on real data, CoM2, InfoMax and CCA are applied to the denoising of rapid ictal discharges in subject (referred as

Under this constraint, a new link adaptation criterion is proposed: the maximization of the minimum user’s information rate through dynamic spreading gain and power control, al-

IaaS service delivery model is likely to keep losing market share to PaaS and SaaS models because companies realize more value and resource-savings from software

Considering the low experimental value of the ratio d/a and the fast spatial decay of the DLVO interaction, only the side of the particle turned towards the air-liquid interface

[r]

Attempts to increase the specific energy of SCs are centred on the development of high surface area porous materials and on the control of pore size, although