MIT LIBRARIES DUPL
111
....HillIIID28
M414
..MIS*-
D£
WEY
To
appear
inthe
Journal
atApplied
Intelligence
,September
2000
Context
Knowledge
Representation
and
Reasoning
in
the
Context
Interchange
System
Stephane
Bressan',Cheng
Goh
2
, Natalia Levina, Stuart
Madnick,
Ahmed
Shah,
Michael
SiegelSloan
WP#
4133
CISLWP
#00-04
August,
2000
MIT
Sloan School
of
Management
50
Memorial
Drive
MASSACHUSETTS
INSTITUTEE
OF
TECHNOL OGY
OCT
2
5
2000
Context
Knowledge
Representation
and Reasoning
in
the
Context Interchange
System
Stephane
Bressan1,
Cheng
Goh
2
,Natalia Levina, Stuart
Madnick,
Ahmed
Shah, MichaelSiegelMassachusetts
Institute ofTechnology,
Cambridge,
MA
02139
Abstract
The
Context Interchange Project presents a unique approach to the problem ofsemantic conflict resolutionamong
multiple heterogeneous data sources.The
system presents a semanticallymeaningful view ofthe data to the receivers (e.g. user applications) for all the available data
sources.
The
semantic conflicts are automatically detected and reconciled by a Context Mediatorusing the context
knowledge
associated with both the data sources and the data receivers.The
results are collated and presented in the receiver context.
The
cunent implementation of the system provides access to flat files, classical relational databases, on-line databases, andweb
services.
An
example application, using actual financial information sources, is described along withadetaileddescriptionoftheoperationofthesystem foranexamplequery.Keywords: context, integration, mediation, semantic heterogeneity,
web
wrapping1.
Introduction
Inrecentyearsthe
amount
ofinformation availablehasgrown
exponentially. Whilethe availabilityofsomuch
information has helped people
become
self-sufficient andgetaccesstoallthe informationhandily,thishas createdanotherdilemma. Allthesedatasourcesandthetechnologiesthatare
employed
by thedata source providersdonotprovidesufficient logicalconnectivity(theability toexchangedatameaningfully). Logical connectivity iscrucial
becauseusersofthese sources expecteachsystemtounderstand requestsstatedintheir
own
terms,usingtheirown
conceptsof
how
theworld is definedandstructured.As
aresult,anydata integration effortmustbecapableofreconcilingsemanticconflicts
among
sourcesandreceivers. Thisproblemisgenerallyreferredtoastheneedforsemanticinteroperability
among
distributeddatasources.The
ContextInterchange ProjectatMIT
[1,2] isstudyingthesemantic integration ofdisparateinformation sources.Like other information integrationprojects (the
SIMS
projectatISI [3],theTSIMMIS
projectatStanford[4], theDISCO
projectatBull-INRIA [5],theInformation ManifoldprojectatAT&T
[6],theGarlicprojectatIBM
[7],theInfomasterprojectatStanford [8]),
we
have adoptedaMediationarchitecture asoutlinedin Wiederhold's seminalpaper [9].
Insection2,
we
presenta motivational scenarioofausertryingtoaccess informationfromvarious actualdatasourcesandtheproblems faced.Section3 describesthe currentimplementation ofthe Contextmediation system.
Section 4 presents adetaileddiscussion ofthevarioussubsystems, highlightingthecontext
knowledge
representation andreasoning, usingthescenario outlinedin section2.Section5 concludes ourdiscussion.
2.
Why
Context
Mediation ?
-
An
Example
Scenario
Consideran
example
ofafinancialanalystdoingresearchon DaimlerBenz. She needstofind out the netincome,netsales,andtotalassetsof Daimler
Benz
Corporationforthe yearending 1993. Inadditionto that,sheneedstoknow
theclosingstockprice of DaimlerBenz. She normallyuses the financialdatastoredinthe Worldscope3database. Sherecalls Jill,herco-workertellingherabouttwoother databases, Datastrearn4andDisclosure5and
how
theycontained
much
ofthe infonnation thatJillneeded. Shestartsoff with Worldscopedatabase. Sheknows
thatWorldscopehastotalassets forallthecompanies. Shebringsupaquerytooland issuesaquery:
Now
attheNational UniversityofSingapore.2
We
dedicatethiswork
tothememory
ofCheng
HianGoh
(1965-1999).3
The
Worldscopedatabaseisanextractfrom the Worldscopefinancialdatasource4
select
company_name
,total_assets
from worldscope
where company_name
= "DAIMLER-BENZ
AG"
;
She immediatelygetsbacktheresult:
DAIMLER-BENZ
AG
5659478
Satisfied,she
moves
onandfiguresoutafterlookingatthedata informationfor thenew
databasesthatshe canget thedataonnet income from Disclosureandnet salesfrom Datastream. Fornet income, sheissues thequery:select
company_name
,net_income from disclosure
where
company_name
= "DAIMLER-BENZ
AG"
;
The
query does notreturnanyrecords. Puzzled, shechecks fortyposandtriesagain. Sheknows
thatthe informationexists. Shetriesone
more
time, thistime entering apartialname
forDAIMLER
BENZ.
select
company_name
,net_income
from disclosure
where
company_name like
"DAIMLER%"
;
Shegetstherecord back:
DAIMLER BENZ
CORP
615000000
She
now
realizes thatthedatasourcesdonotconformtothesame
standards, asitbecomes
obvious fromthenames.Cautious,she pressesonandissues the thirdquery:
select
name,
total_sales
from datastream
where
name like
"DAIMLER%"
;Shegetstheresult:
DAIMLER-BENZ
9773092
Woikkcupc
year 1993). In addition shestill hasto
somehow
findanother data sourcetoget thelatest stock priceforDaimlerBenz.Forthat, she
knows
shewill firsthave to findout the tickerforDaimlerBenz
andthen lookupthe priceusingone ofthe
many
stockquoteserversontheweb.The
ContextMediation system can be usedtoautomaticallydetectandresolveallthesemanticconflictsbetweenallthedatasources beingusedandcan presentthe results totheuser inthe formatthatsheis familiar with. Inthe above
example, ifthe analystwere usingtheContext Mediation system instead,all she
would
havetodowould
betoformulateandaskasinglequerywithouthavingtoworryabout theunderlyingdifferencesbetweenthe data. Both
her requestandthe result
would
be formulatedinherpreferredcontext(e.g. Worldscope).The
multi-source query,Queryl,
couldbestated as follows:select
worldscope
.total_assets,
datastream. total_sales,
disclosure
.net_income
,
quotes
.Last
from
worldscope,
datastream,
disclosure,
quotes
where
worldscope
.company_name
="DAIMLER-BENZ
AG"
and
datastream. as_of_date
="01/05/94"
and
worldscope
.company_name
=datastream. name and
worldscope
.company_name
=disclosure
.company_name and
worldscope
.company_name
=quotes
.cname
;
The
systemwould
then detectandreconcile the conflicts encounteredby the analyst.3.
Overview
ofthe
COIN
ProjectThe COntext
INterchange(COIN)
strategy seekstoaddresstheproblem of semanticinteroperabilitybyconsolidatingdistributeddatasourcesandprovidinga unifiedview.
COIN
technology presentsalldatasourcesasSQL
relational databasesbyproviding genericwrappersforthem.The
underlyingintegration strategy, called theCOIN
model,definesanovelapproachformediated [9]dataaccess in whichsemanticconflictsamong
heterogeneoussystemsare automatically detectedandreconciledbytheContext Mediator.3.1
The
COIN
Framework
The
COIN
framework
iscomposed
ofbothadatamodel and a logical language,COINL
[ll],derived fromthefamilyof F-Logic [10].
The
datamodel
andlanguageareused todefinethedomain model
ofthereceiveranddatasourceandthecontext [12]associated withthem.
The
datamodelcontainsthe definitions for the"types"ofinformation units(calledsemantictypes)thatconstitutea
common
vocabularyforcapturingthesemanticsofdataindisparatesystems. Contexts, associatedwithboth information sources andreceivers, are collectionsofstatements
defining
how
datashould be interpretedandhow
potential conflicts(differences inthe interpretation)should beresolved.Conceptssuch assemantic-objects, attributes, modifiers, andconversion functions definethesemanticsof
datainsideandacrosscontexts.Togetherwiththedeductiveandobject-orientedfeatures inheritedfrom F-Logic,the
COIN
datamodel
andCOINL
constitutean appropriateandexpressiveframeworkforrepresentingsemanticknowledge
andreasoningabout semanticheterogeneity.3.2
Context Mediator
The
Context MediatoristheheartoftheCOIN
project.Mediation istheprocessofrewriting queries posedinthe receiver'scontextintoasetofmediatedquerieswhereallactual conflicts are explicitlyresolvedandthe resultisreformulatedinthereceivercontext.This process isbasedinan abduction [13]procedurethatdetermineswhat
informationisneededtoanswerthequery and
how
conflictsshouldberesolvedbyusingtheaxioms inthe differentcontexts involved.
Answers
generatedbythemediationunitcan be both extensionalandintentional.Extensionalanswers correspondtothe actualdataretrievedfromthevarioussources involved. Intentional answers,ontheother
hand, provide only a characterizationoftheextensionalanswerwithoutactually retrievingdatafromthedata
sources. Inaddition, the mediation process supports queriesonthesemanticsofdatathat are implicit inthe different
systems. Therearereferredtoasknowledge-level queriesasopposedtodata-levelqueriesthatareenquires onthe
factual data presentinthedata sources. Finally, integrityknowledge on onesource or across sourcescan be naturallyinvolved inthemediation processto improvethequalityandinformation contentofthemediatedqueries
3.3
System
Perspective
From
asystemperspective, theCOIN
strategycombinesthe best featuresofthe loose-andtight-couplingapproachestosemanticinteroperability [14]
among
autonomous andheterogeneous systems. Itsmodulardesign andimplementation, depicted in Figure2, funnels thecomplexity ofthe systemintomanageablechunks, enables sources
andreceivers toremain loosely-coupledtooneanother,andsustainsan infrastructure fordata integration.
This modularity, both inthecomponents andthe protocol, alsokeepsourinfrastructure scalable, extensible,and
accessible[2].
By
scalability,we
mean
thatthecomplexity ofcreatingandadministeringthe mediationservicesservices.Second,
COIN
employsthehypertextmarkup
Language andJavaforthedevelopment ofportable userinterfaces. Figure 3
shows
the architectureoftheCOIN
system. Itconsistsofthreedistinctgroups ofprocesses.• Client Processes providethe interactionwithreceiversand routealldatabaserequeststotheContext Mediator.
An
example ofaclientprocess isthemulti-databasebrowser[15],which providesapoint-and-click interfaceforformulating queriestomultiplesourcesandfordisplayingthe answersobtained. Specifically,any
application programthatissuesqueriestooneor
more
sourcescan beconsidereda clientprocess.• ServerProcesses refertodatabasegateways andwrappers. Databasegatewaysprovide physical connectivity
toadatabase onanetwork.
The
goal isto insulate theMediatorProcess fromthe idiosyncrasiesofdifferentdatabase
management
systems byprovidingauniform protocolfordatabase accessaswell ascanonical querylanguage(and data model)forformulatingthe queries.Wrappers providericher functionalitybyallowing
semi-structureddocuments on the
World
Wide
Web
tobe queried asiftheywererelational databases.This isaccomplished bydefiningan export
schema
foreachoftheseweb
sitesanddescribinghow
attribute-valuescanbe extractedfroma
web
site usingafiniteautomaton withpatternmatching [16].•
Mediator
Processes refertothe systemcomponentsthat collectivelyprovidethemediationservices.TheseincludeSQL-to-datalogcompiler, context mediator,and query planner/optimizerandmulti-database
executioner. SQL-to-Datalog compilertranslates a
SQL
query intoitscorresponding datalog format.The
Context Mediatorrewrites theuser-providedquery intoamediatedquerywith allthe conflicts resolved.
The
planner/optimizerproduces aqueryevaluation planbasedonthemediatedquery.
The
multi-databaseexecutioner processesthequeryplan generatedbytheplanner. Itdispatches sub-queriestothe server processes,
collatestheintermediaryresults, convertstheresult intotheclientcontext,and returns thereformulatedanswer
tothe client processes.
Of
thesethree distinctgroupsofprocesses, the mostrelevant toour discussionofcontextknowledge
andreasoningare themediatorprocesses.
We
willstartbyexplainingthedomain
model andthen discusstheprototype system.SERVER
PROCESSES
MEDIATOR
PROCESSES
CLIENT
PROCESSES
WebClient N
2
/v
2
/\
ODBC-compliant Apps (egMicrosoft Excel) ODBC-DriverFigure3:
COIN
SystemOverview
4.1
Domain
Model and
Context
definitionThe
firstthingthatwe
needtodo isspecifythedomain
model forthedomain
thatwe
are workingin.A
domain
model
specifies thesemanticsofthe "types'"ofinformationunits,whichconstitutes acommon
vocabulary used incapturingthesemanticsofdata in disparate sources. Inotherwordsitdefinestheontologywhichwill beused.
The
varioussemantic types, thetypehierarchy, andthetype signatures (for attributesandmodifiers)arealldefined in
the
domain
model. Typesinthegeneralized hierarchyarerootedtosystemtypes, i.e. typesnative to theunderlyingsystem such as integers,strings, real numbersetc.
Figure 4 depicts partofthe
domain
modelthatis usedin our example.Inthedomain
modeldescribed,there arethree kindsofrelationshipsexpressed.
Figure 4:FinancialDomain Model Inheritance
--
Attribute—
Modifier• Inheritance: This isthe classictype inheritance relationship. Allsemantic typesinheritfrombasicsystem
types. Inthe
domain
model, typecompanyFinancials
inheritsfrombasic typestring.• Attributes: In
COIN
[17],objectshavetwo forms ofproperties, thosewhichare structuralpropertiesoftheunderlying data sourceandthosethatencapsulatetheunderlyingassumptionsabouta particularpieceofdata.
Attributesaccess structural propertiesofthesemanticobjectin question.Forinstance,thesemantic type
companyf inancials
hastwo
attributes,company
and fyEnding.
Intuitively, these attributesdefinea relationshipbetweenobjectsofthecorresponding semantictypes.Here,the relationshipformed bythecompany
attributestatesthatforanycompany
financialinquestion, theremust becorrespondingcompany
towhich
thatcompany
financial belongs.Similarly, thefyEnding
attributestatesthateverycompany
financialobjecthas a date
when
itwas
recorded.• Modifiers: Thesedefinearelationshipbetweensemanticobjectsofthecorresponding semantictypes.
The
differencethough isthatthevaluesofthesemanticobjectsdefinedbythemodifiershavevaryinginterpretations
depending onthe context. Looking onceagainatthe
domain
model,thesemantic typecompanyFinancials
defines
two
modifiers,scaleFactor
andcurrency.
The
valueofthe objectreturnedbythemodifierscaleFactor
depends on agivencontext.Once
we
havedefinedthedomain
model,we
needtodefinethecontextsforallthe sources. Inourcase,we
haveseveraldatasources with theassumptions abouttheirdata infigure5.
A
simplified view ofwhatthecontextmight beforthe Worldscopedata sourceis:modifier
(companyFinancials,
O,scaleFactor,
c_ws,M):-cste
(basic, M,c_ws,
1000) .modifier
(companyFinancials,
O,currency,
c_ws,M):-cste
(currencyType,
M, c_ws,"USD"),
modifier
(date, 0,dateFmt,
c_ws, M)Value
(V12, c_ws, V3) ,V5
= V3,Value
(V20, c_ws, V2) ,V5
= V2,Value
(V7, c_ws, VI) ,V5
= VI,Value
(V22, c_ws,total_assets)
,Value
(V17,c_ws,
total_sales
) ,Value
(Vll, c_ws,net_income),
Value
(q_last, c_ws,last).
As
can beseen, thequerynow
contains elevated data sources along with asetofpredicatesthatmap
eachattributetoitsvaluein thecorrespondingcontext. Sincetheuseraskedthe queryin the Worldscopecontext (denotedby
c_ws),thelast four predicatesin the translatedqueryascertainthatthe actual values returnedasthe solutionofthe
query needtobeinthe Worldscopecontext.
The
resultingunmediated datalogquery isthen fedtothe mediation engine.4.3.2 Mediation
Engine
The
mediationengine isthe partofthesystemthatdetectsandresolves possible semanticconflicts. In essence,themediation is aquery rewritingprocess.
The
actualmechanism
ofmediation isbasedonanAbduction Engine [13].The
engine takes adatalogquery andasetofdomain
model axioms and computesasetof abductedqueriessuchthattheabductedqueries haveallthedifferences resolved.
The
system doesthatby incrementallytesting for potential semanticconflictsand introducingconversion functions for the resolutionofthoseconflicts.The
mediationengine asitsoutputproducesasetofqueriesthattake intoaccountallthepossiblecasesgiventhe variousconflicts.
Usingtheabove example andwiththe
domain
model and contextsstatedabove,we
would
get thesetof abductedqueries
shown
below:answer(V108,
V107,
V106,
V105)
:-datexform(V104,
"European Style
-","01/05/94",
"American
Style
/"),Name_map_Dt_Ws
(V103,"DAIMLER-BENZ
AG"),
Name_map_Ds_Ws
(V102,"DAIMLER-BENZ
AG"),
Ticker_Lookup2
("DAIMLER-BENZ
AG",V101,
V100),
WorldcAFf
"DAIMLER-BENZ
AG", V99, V98, V97, V96,V108,
V95),
DiscAF(V102,
V94, V93, V92, V91, V90,V89),
V107
isV92
* .001,Currencytypes
(V89,USD),
DStreamAF(V104,
V103,
V106,
V88, V87,V86),
Currency_map
(USD,V86),
quotes
(V101,V105)
.answer(V85,
V84, V83, V82):-datexform(V81,
"European Style
-","01/05/94",
"American
Style
/"),Name_map_Dt_Ws
(V80,"DAIMLER-BENZ
AG"),
Name_map_Ds_Ws
(V79,"DAIMLER-BENZ
AG"),
Ticker_Lookup2
("DAIMLER-BENZ AG"
, V78,V77),
WorldcAF( "DAIMLER-BENZ
AG",
V76, V75, V74, V73, V85,V72),
DiscAF(V79,
V71, V70, V69, V68, V67,V66),
V84
isV69
*0.001,
Currencytypes
(V66,USD),
DStreamAF(V81,
V80, V65, V64,V63
,V62),
Currency_map
(V61,V62),
<> (V61, USD) ,datexform(V6
0,"European Style
/","01/05/94",
"American Style
/"),olsen(V61,
USD, V59,V60),
V83
isV65
* V59,quotes
(V78, V82).
answer(V58,
V57, V56, V55):-datexform(V54,
"European
Style
-","01/05/94",
"American
Style
/"),9-Name_map_Dt_Ws
(V53,"DAIMLER-BENZ
AG"),
Name_map_Ds_Ws
(V52,"DAIMLER-BENZ
AG"),
Ticker_Lookup2
("DAIMLER-BENZ
AG", V51,V50),
WorldcAF(
"DAIMLER-BENZ
AG", V49, V48, V47, V46, V58,V45),
DiscAF(V52,
V44, V43, V42, V41, V40,V39),
V38
isV42
*0.001,
Currencytypes
(V39, V37) , <> (V37, USD) ,datexform(V36,
"European Style
/", V44,"American Style
/"),olsen(V37,
USD, V35,V36),
V57
isV38
* V35,DStreamAF(V54,
V53, V56, V34, V33,V32),
Currency_map
(USD,V32),
quotes
(V51, V55) .answer(V31,
V30, V29, V28):-datexform(V27,
"European
Style
-","01/05/94",
"American
Style
/"),Name_map_Dt_Ws
(V26,"DAIMLER-BENZ
AG"),
Name_map_Ds_Ws
(V25,"DAIMLER-BENZ
AG"),
Ticker_Lookup2
("DAIMLER-BENZ
AG", V24, V23),WorldcAFf
"DAIMLER-BENZ
AG", V22, V21, V20, V19, V31,V18),
DiscAF(V25,
V17, V16, V15, V14, V13,V12),
Vll
isV15
*0.001,
Currencytypes
(V12,V10),
<> (V10, USD)
,
datexform(V9,
"European
Style
/", V17,"American Style
/"),olsen(V10,
USD, V8, V9),V30
isVll
* V8,DStreamAF(V27,
V26, V7,V6
, V5, V4) ,Currency_map
(V3, V4) , <> (V3, USD) ,datexform(V2,
"European Style
/","01/05/94",
"American Style
/"),olsen(V3,
USD, VI, V2),
V29
isV7
* VI,quotes
(V24, V28).
The
mediated querycontains foursub-queries. Each ofthesub-queries accounts fora potential semanticconflict.For example,thefirstsub-querydeals withthecase
when
there isnocurrency conversionconflict(i.e.,sourceandreceiver use
same
currency). Whilethesecondsub-querytakes intoaccountthe possibilityofcurrency conversion.Resolvingthe conflicts
may
sometimerequireintroducing intermediate datasources. Figure5 listedsome
ofthecontext differencesin thevarious data sourcesthat
we
useforourexample.Lookingatthetable,we
observethatone ofthepossibleconflictsisdifferentdatasourcesusingdifferent currencies. Inordertoresolvethatdifference,
themediation engine hastointroduce an intermediarydata source.
The
sourceused forthispurpose is acurrencyconversion
web
site(http://www.oanda.com)andisreferredtoas olsen. Inordertoresolve thecurrencyconflict inthesecondsub-query, theolsensourceis usedtoconvertthecurrencytocorrectlyrepresent datainthecurrency
specified asofthespecifieddate inthespecified context.Notethat itis themediator, usingthecontext knowledge,
thatdeterminesthatcurrency conversion
was
needed inthiscase.4.3.3
Query
plannerand
optimizerThe
query plannermodule
takes thesetofdatalog queriesproducedbythemediation engineand producesaquery plan. Itensuresthatan executable planexistswhichwill producearesultthatsatisfiestheinitial query.Thisisnecessitatedbythefactthatthere are
some
sourcesthatimposerestrictions onthetypeofqueriesthatthey canservice. In particular,
some
sourcesmay
requirethatsome
ofthe attributesmust always bebounded
whilemaking
queriestothosesources. Anotherlimitation thatsourcesmight have isthekindsofoperatorsthattheycan handle.
One
example isthatmostweb
sourcesdonotprovide an interfacethatsupportsalltheSQL
operators,ortheymightrequirethat
some
attributes inqueries bealwaysbound.Once
the planner ensures than an executable planexists, itgenerates asetofconstraintsonthe orderinwhich the differentsub-queriescanbe executed.
Under
theseconstraints,theoptimizerapplies standard optimizationheuristicstogeneratethequeryexecutionplan.
The
queryexecution plan isessentiallyanalgebraicoperatortreein whicheach operation isrepresentedbyanode. Thereare
two
typesofnodes:• Access Nodes: Access nodesrepresent accesstoremotedatasources.
Two
subtypesofaccessnodesare: • sfw Nodes: These nodesrepresentaccesstodata-sourcesthatdonot require inputbindingsfromothersourcesin the query.
• join-sfwNode: Thesenode haveadependency inthat theyrequire inputfrom other data sourcesin the
query.
Thus
thesenodes havetocome
afterthenodesthattheydepend onwhiletraversing thequeryplantree.
• LocalNodes: These nodesrepresent localoperations in localexecution engine. Thereare foursubtypesoflocal
nodes.
• JoinNode: Joinstwo trees
• SelectNode: Thisnode isusedtoapply conditionsto intermediateresults.
•
CVTNode:
Thisnode isusedtoapply conversion functionsto intermediatequeryresult.•
Union
Node:Thisnoderepresents aunionoftheresultsobtainedbyexecutingthesub-nodes.Each nodecarriesadditional information aboutwhatdata-sourcetoaccess(ifitisan accessnode) andother
information thatis used bytheruntime engine.
Some
ofthe informationthatiscarriedineach nodeisalistofattributes inthesourceandtheirrelative position,listofcondition operationsand any literalsandother information
liketheconversion formulain thecaseofaconversion node.
The
queryplan forthefirstsub-queryofthemediatedqueryis
shown
inthe Appendix.The
queryplan thatis produced bytheplanneristhen forwardedtotheruntime engine.4 3.4
Runtime
engine
The
runtime execution engine executesthequery plan.Givenaqueryplan, theexecution enginetraverses thequeryplan tree ina depth-first
manner
startingfrom therootnode. Ateach node, itcomputes thesub-trees forthatnodeandthen appliestheoperationspecified for thatnode. For eachsub-tree theengine recursivelydescends
down
thetree until itencounters an access node. Forthataccessnode, it
composes
aSQL
query andsends it offto theremotesource.
The
resultsofthatqueryarethenstored in thelocal store.Once
allthe sub-treeshave been executedandallthe results are inthe localstore,theoperation associatedwith thatnodeis executedandthe results collected.This
operation continuesuntilthe rootofthequery isreach.At thispoint theexecution engine hastherequiredsetof
resultscorrespondingtothe originalquery.Theseresultsare thensentbackto theuserandthe processiscompleted.
4.4
Web
Wrapper
The
originalquery used inourexample, contained accesstoaquoteserverto getthemostrecentquotesforthecompany
inquestion, i.e. Daimler-Benz.As
opposedtothe restofthe sources, thequote serverthatwe
usedis aweb
quoteserver. In ordertoaccesstheweb
sources suchasthisone,we
have developedatechnologythatletsuserstreat
web
sitesas relationaldatasources. Userscan then issueSQL
queriesjustastheywould
toanyrelation intherelationaldomain,thus combiningmultiplesourcesandcreatingqueriesas the oneabove. Thistechnology iscalled
web-wrapping
andwe
havean implementation forthistechnologywhich iscalledweb
wrappingengine[18].Usingthe
web
wrapperengine(web
wrapperforshort)theapplicationdevelopers can veryrapidlywrap
astructured orsemi-structured
web
siteandexporttheschema
fortheuserstoqueryagainst.Once
the source hasbeenwrapped,itcan be usedas a relationalsource inanyquery.
4.4.1
Web
wrapper
architectureFigure6
shows
the architectureoftheweb
wrapper.The
systemtakes theSQL
queryas input. Itparsesthequeryalongwiththe specifications for thegiven
web
site.A
queryplan isthenconstituted.The
planconstitutes ofwhatweb
sitestosendhttprequests and whatdocuments onthoseweb
sites.The
executioner thenexecutesthe plan.Once
thepagesare fetched, theexecutioner thenextractstherequired informationfrom thepagesandpresentsthattothe user.
4.4.2
Wrapping
aweb
siteFor our query,the relationquote isactually a
web
quote serverthatwe
accessusing ourweb
wrapper. In ordertowrap
asite,you
needtocreateaspecification file. For eachWeb
pageorsetofWeb
pagesthegenericWeb
Wrapper
engineutilizesa specification filetoguideitthroughthedataextractions process.The
specificationfilecontains information aboutthe locations.onthe
web
forbothinputand outputdata.The
Web
Wrapper
Engineutilizes this information duringquery executionto getinformationbackfrom thesetof
Web
pagesasifitwere arelationaldatabase. This file isplaintextfileandcontains information suchastheexported schema,the
URL
oftheweb
site toaccess,anda regular expressionthatwillbe usedto extract the actual information fromtheweb
page. Inourexample
we
usetheCAW
quoteservertoget quotes.A
simplifiedspecification tile is included below:#HEADER
#RELATION=quotes
#HREF=GET
http://qs.cnnfn.com
#EXPORT= quotes. Cname quotes. Last
#ENDHEADER
#BODY
#PAGE
#HREF
=POST
http: //qs
.cnnfn.com/cgibin/stockquote
?symbols=##quotes
.Cname##
#CONTENT=last
:  </font><FONT
SIZE=
+lxb>##quotes
.Last##</FONTx/TD>
#ENDPAGE
#ENDBODY
The
specificationhastwo
parts.Header
and Body.The Header
partspecifies information aboutthename
ofthe relationandtheexportedschema. Intheabovecase,theschema
thatwe
decidedtoexport hastwoattributes,Cname,
the ticketofthecompany
andLastthelatestquote.The
Body
Portionofthe filespecifieshow
toactuallyaccess thepage(asdefinedinthe
HREF
field)andwhatregularexpressiontouse(asdefinedintheCONTENT
field).
Once
the specification fileiswrittenandplacedwheretheweb
wrappercan readit,we
arereadytousethesystem.
We
canstartmaking
queriesagainst thenew
relationthatwe
justcreated.12-Querj Results Interface Query Compiler Interpreter Planner/ Optimizer Specification Compiler; Interpreter Executioner PatternMatching Network Access V I V „ ... .. Web documents Specittcatton
Figure6:
Web
Wrapper
Architecture4.4.3
Web
Wrapping and
XML
The
extensibleMarkup Language
(XML)
forWeb
pagesisbecoming
increasinglyaccepted. This providesopportunities forthe
Web
Wrapping
Engine,inparticular,andtheContext Interchange System, in general. First, theXML
tagsprovide amuch
easierandexplicitdemarcationofthe locationoffieldswithinaweb
page. Thatmakes
the extractionofdata
much
simplerfor theWeb
Wrapper.On
theotherhand,XML
isprimarilyasyntacticfacility.You
may
havea tag<PRICE>,
butXML
does notprovidethesemantics, suchaswhatisthe currencyofthe price,does itincludetax,doesitincludeshippingcosts, etc.
The
Context Interchangeapproach isthenextstep intheevolution of
XML
andtheWeb
toprovidethesemanticsthat isso criticaltothe effectiveexchange ofinformation.5.
Conclusions
In thispaper,
we
havedescribedanovel approachtotheproblem ofresolvingsemantic differencesbetweendisparateinformation sourcesbyautomaticallydetectingandresolvingsemanticconflictsbetween thosesources
basedonthe
knowledge
ofthe contextsofthose datasources inaparticulardomain.We
havealsodescribedandexplainedthe architectureand implementation oftheprototype,anddiscussed theprototype at
work
byusinganexamplescenario.
More
detailspertainingto thisscenarioandademonstrationofitsoperation canbe foundathttp
://context
.mit
.edu/
-co in/
demos
/t
as
c/
saved/
crueries/qll
.html
.Acknowledgements
In
Memory
ofCheng
Hian
Goh
(1965-1999)Cheng
HianGoh
passedaway
onApril 1, 1999,attheageof33.He
issurvivedbyhis wife,Soh Mui
Leeand
two
sons,Emmanuel
andGabriel,towhom
we
all send our deepest condolences.Cheng
Hian receivedhis Ph.D. in InformationTechnologies fromthe Massachusetts Instituteoftechnology inFebruary, 1997.
He
joined the Department ofComputer
Science, National Universityof Singapore asanAssistant ProfessorinNovember,
1996.He
lovedandwas
dedicatedto hisworks—
teachingaswellas research.He
believedin givinghisstudentsthe bestandwhat they deserved.
As
ayoung
databaseresearcher, hehadmade
majorcontributionstothefieldastestifiedbyhispublications(in ICDE'97,
VLDB'98,
and ICDE'99).Cheng
Hianwas
avery sociable person,andoften soughtthecompany
offriends.As
such,hewas
lovedby those
who
came
in contactwith him.He would
oftengothe extramiletohelphis friendsandcolleagues.He
was
agreatperson,and had touchedthe livesofmany.We
suddenlyrealizedthatthere aremany
things thatwe
willneverdotogetheragain.
We
willmisshim
sorely,andhis laughter,andhis smile...Work
reported herein hasbeensupported, inpart,bytheAdvanced
ResearchProjectsAgency
(ARPA)
andthe
USAF/Rome
Laboratory undercontractF30602-93-C-0160,TASC,
Inc. (andits parent,PRIM
ARK
Corporation),the
MIT
PROductivityFrom
Information Technology(PROFIT)
Program, andtheMIT
Total DataQuality
Management
(TDQM)
Program. Information abouttheContext Interchangeprojectcan be obtainedathttp
://context
.mit
.edu/~coin.
Bibliography
I]Adil Daruwala,
Cheng
Goh, Scot Hofmeister,Karim
Husein, StuartMadnick, Michael Siegel. "TheContextInterchange
Network
Prototype",Center For Information SystemLaboratory (CISL),working paper,#95-01,Sloan School of
Management,
MassachusettsInstitute ofTechnology, February 19952] Goh,
C,
Madnick, S., Siegel,M.
"Context Interchange:Overcoming
thechallengesofLarge-scaleinteroperabledatabase systems inadynamic environment".Proc. ofIntl. Conf.
On
InformationandKnowledge Management.
1994.3]Arens, Y. and Knobloch, C. "Planning andreformulating queries forsemantically-modeled multidatabase".
Proc. ofthe Intl.Conf. on Informationand
Knowledge Management.
19924] Garcia-Molina, H.
"The
TSIMMIS
Approach
to Mediation: DataModels
andLanguages". Proc. oftheConf. onNext
Generation Information Technologies and Systems. 1995.5]Tomasic,A. Rashid, L.,andValduriez,P. "Scaling Heterogeneousdatabasesandthedesignof
DISCO".
Proc. of theIntl.Conf. on DistributedComputing
Systems. 1995.6] Levy, A., Srivastava, D.andKrik, T.
"Data
Model
andQuery
Evaluation in Global Information Systems".Journalof Intelligent InformationSystems. 1995.
7] Papakonstantinou, Y.,Gupta, a.,andHaas, L. "Capabilities-Based
Query
RewritinginMediatorSystems". Proc.ofthe4th Intl. Conf.on Paralleland Distributed Information Systems. 1996.
8] Duschka, O.,andGenesereth,
M.
"Query
Planningin Infomaster". http://infomaster.standord.edu. 1997. 9] Wiederhold, G. "Mediation intheArchitectureofFuture Information Systems".Computer,23(3), 1992.10] Kifer, M., Lausen,G.,and
Wu,
J. Logical Foundationsof Object-orientedand Frame-based Languages.JACM
5(1995), pp. 741-843.
II] Pena,F.
PENNY:
A
Programming
Language and Compilerfor the ContextInterchange Project.CISL
Working
Paper#97-06, 1997
12] McCarthy,J.Generality inArtificial Intelligence.
Communications
oftheACM
30, 12(1987), pp. 1030-1035.13]
KaKas,
A.C,
Kowalski, R.A. andToni, F. Abductive Logic Programming.Journalof LogicandComputation2,6(1993), pp.719-770.
14] Sheth,A. P.,and Larson,J. A. Federated databasesystemsfor
managing
distributed,heterogeneous,andautonomous
databases.ACM
Computing
Surveys22,3 (1990), pp. 183-23615] Jakobiasik,
M. Programming
theweb-design andimplementationof amultidatabase browser. TechnicalReport, CISL,
WP
#96-04, 199616] Qu,J. F. Data wrapping ontheworldwide web.Technical Report,
CISL
WP
#96-05, 199617]Goh,C. H. Representingand ReasoningaboutSemanticConflicts inHeterogeneousInformation Systems.
PhD
thesis,Sloan School of
Management,
MassachusettsInstitute of Technology, 1996.18]Bressan, S., Bonnet,P. ExtractionandIntegrationofData from Semi-structured
Documents
intoBusinessApplications.
To
be published.19] Madnick, S.MetadataJones andthe
Tower
of Babel:The
ChallengeofLarge-Scale Heterogeneity.Proceedingsofthe
IEEE
Meta-dataConference, April 1999.APPENDIX:
Part
of
the
Query Execution
Plan
(the entireplan canbe foundat http://context.mit.edu/~coin/demos/tasc/saved/saved-results/qlI t3.html)
SELECT
ProjL:
[att(l, 5),att(l,
4),att(l,
3),att(l,
2)]CondL:
[attd,
1) ="01/05/94"]
JOIN-SFW-NODE
DateXform
ProjL:
[attd,
3),att(2,
4),att(2,
3),att(2,
2),att(2,D]
CondL:
[att(l, 1) =att(2,
5),att(l,
2) ="European Style
-",
attd,
4) ="American
Style
/"]CVT-NODE
'V18' is 'V17' *0.001
Projl:
[att(2, 1),att(l,
1),att(2,
2),att(2,
3),att(2,
4)]Condi:
['V17' =att
(2, 5)]
JOIN-SFW-NODE quotes
ProjL:
[att(2, 1),att(2,
2),att(l,
2),att(2,
3),att(2,
4)]CondL:
[att(l, 1) =att
(2, 5)]JO
IN-NODE
ProjL:
[att(2, 1),att(2,
2),att(2,
3),att(2,
4),att(2,
5)]CondL:
[attd,
1) =att
(2, 6)]SELECT
ProjL:
[attd,
2)]CondL:
[attd,
1) = 'USD']SFW-NODE
Currency_map
ProjL:
[attd,
1),attd,
2)]CondL:
[]JOIN-NODE
ProjL:
[attd,
1),attd,
2),attd,
3),attd,
2),attd,
3),attd,
4)]CondL:
[att(l, 1) =attd,
4)]SFW-NODE
Datastream
ProjL:
[attd,
2),attd,
3),attd,
1),attd,
6)]CondL:
[]JO
IN-NODE
ProjL:
[attd,
1),attd,
2),attd,
3),attd,
4)]CondL:
[attd,
1) =attd,
5)]SELECT
ProjL:
[att(l, 2)]CondL:
[attd,
1) = 'USD']SFW-NODE Currencytypes
ProjL:
[att(l, 2),attd,
D]
CondL:
[]JOIN-NODE
ProjL:
[attd,
1),attd,
2),attd,
2),attd,
3),attd,
3)]CondL:
[att(l, 1) =att
(2, 4)]SFW-NODE Disclosure
ProjL:
[attd,
1),attd,
4),attd,
7)]CondL:
[]JOIN-NODE
ProjL:
[attd,
1),attd,
1),attd,
2),attd,
3)]CondL
: []SELECT
ProjL:
[attd,
2}]CondL:
[attd,
1) ="DAIMLER-BENZ
AG"]SFW-NODE Worldscope
ProjL:
[att(l, 1),attfl,
6)]CondL:
[]15-JOIN-NODE
ProjL:
[att(l, 1),att(2,
1),att(2,
2)]CondL
: []
SELECT
ProjL:
[attll, 2)]CondL:
[att(l, 1) ="DAIMLER-BENZ
AG"]SFW-NODE Ticker_Lookup2
ProjL:
[att(l, 1),att(l,
2)]CondL
: []JOIN-NODE
ProjL:
[att(2, 1),attll,
1)]CondL:
[]SELECT
ProjL:
[attll, 2)]CondL:
[attll, 1) ="DAIMLER-BENZ
AG"]SFW-NODE Name_map_Ds_Ws
ProjL:
[attll, 2),attll,
1)]CondL:
[]SELECT
ProjL:
[attll, 2)]CondL:
[attll, 1) ="DAIMLER-BENZ
AG"]SFW-NODE Name_map_Dt_Ws
ProjL:
[attll, 2),attll,
1)]CondL:
[]Date
Due
MA\
31
200$
MIT LIBRARIES