MIT LIBRARIES DUPL jlll Ijl 111III
I
1D28
.M414
f
[DEWEY
Business
i>t:M
IT
MIT
Sloan School
of
Management
Sloan
Working
Paper
4183-01
eBusiness@MIT
Working
Paper
108
October
2001
CROSS-ORGANIZATIONAL
DATA
QUALITY
AND
SEMANTIC
INTEGRITY:
LEARNING
AND
REASONING
ABOUT DATA
SEMANTICS
WITH CONTEXT
INTERCHANGE
MEDIATION
Allen
Moulton,
Stuart
E.
Madnick,
and
Michael D.
Siegel
This
paper
isavailable
through
theCenter
foreBusiness@MIT
web
site atthefollowing
URL:
http://ebusiness.mit.edu/research/papers.html
This
paper
alsocan
be
downloaded
without
charge
from
theSocial
Science Research
Network
Electronic
Paper
Collection:
http://papers.ssm.
com/abstract=289992
MASSACHUSETTS
INSTITUTEOF
TECHNOLOGY
Ui!
3 2002
'!'!
Cross-Organizational
Data
Quality
and Semantic
Integrity:
Learning
and Reasoning
about
Data Semantics
with
Context
Interchange
Mediation
Allen
Moulton,
Stuart
E.
Madnick,
and
Michael
D.
Siegel
MIT
Sloan
School of
Management
amoulton@mit.edu,
smadnick@mit.edu, msiegel@mit.edu
Abstract
IVe use a context interchange mediation
approach
for detectingand
resolvingdata qualityand
semanticintegrity conflicts in infi)rmation
exchanged
across organizational boundaries. Contextmodelsdraw
on adomain
ontology to explainhow
sourceand
receiver data models implement generalprinciples ofthesubject domain. Using the declarative
knowledge from
thedomain
ontologyand
context models, themediator writes a query plan meeting receiversemantic requirements
from
autonomous, heterogeneoussources.
Examples drawn
from
fixedincome
securities investments illustrateproblems and
solutionsenabledbycontextinterchangemediation.
Introduction
Efficiently integrating
new
sourcesofinformation fromoutsidethe enterpriseisoftencritical tosuccessinaworld ofglobalcompetition, interdependency, and rapid market change. Traditional
means
ofassuring data quality and semantic integrityfocus on data within the control ofthe organization
(Huang
et al., 1999). Withoutsome
automated assistance, internalplanning andcontrol
mechanisms
forassuring data qualitywillhavedifficulty in respondingina timelymanner
tochangingdemands. Within an organization, datacan becreated, stored, and used by people and computerssharinga
common
implicitunderstandingofdata semantics.
We
usetheterm contexttorefer to this implicit understanding ofthe relationshipbetweendata elements and structures and the real world that the data represents.
The
context interchange problem ariseswhen
organizationswithdifferent conte.xts
must
exchangeinformation(Madnick, 1999).A
context interchange(COIN)
mediatorisan automatedreasoningengineto assistanorganizationinlearningabout semanticconflictsbetweenits
own
receiver contextandthecontextsofdatasources(Goh
etal., 1999).Because
context definitionsare declarative, theyneedonly bepreparedonceforeachsourceand receiver context(Bressan etal.,2000). Datasourcesmay
berelationaldatabases,
XML
documents,HTML
webs wrapped
toappearas relationswith limitedquerycapability (Firatetal.,2000), and stateless computational procedures. Using declarative context knowledge, a
COIN
mediator identifies semanticconflictsanddesigns plansfor
combining
sourceswithdataconversionstomeet
receiversemanticrequirements.Given
alargenumber
ofcomponent
systemsoperatinginadiversifiedanddynamic
environment,COIN
mediationfacilitatesrapid incorporation of
new
information sources,dynamic
substitution ofinformation sources, extension and evolution ofsemantics, data representation inthe user's context,accesstothe
meaning
ofdata represented, identificationandselectionofinformation sourcealternatives,andadaptationtochanges inuserandbusiness operations.
Research Motivation
In the fixed income securities industry, portfolio
managers
may
need todraw upon
external sources fordataaboutsecuritycharacteristics, formarketvaluation information,andfor models andcalculations(Moulton, Madnick,et al., 1998). All these
sources
may
needtobecombined
withinternalportfolioholdings dataand usedby
a decision support applicationsystem(seeFigure 1).
Suppose
the receiver system requires a security valuation as a "dollar price" expressed as a percentage withfractions in 32nds. Ifa source offers such a price in the required form, data can be simply sent
from
source to receiverunmodified.But suppose thatthebestsourceformarketinformationoffersvaluationsas"nominal spread" expressedinbasis
points (lOOths ofapercent).
To
meet
the receiver's requirements, general industryknowledge
and additional data sourcesCross-OrganizationalDataQualityand Semantic Integrity Fixed
Income
Investment
Portfolio -'A'Manager
RECEIVER
Portfolioy-^
Management
|«
Application iSystem
^-i
SOURCES
r
Context
Interchange
Mediator
mT
Portfolio Holdings Bond Calculation FunctionsFigureI. Fixed
income
securitiesinvestmentmediation scenarioTo
resolve thesemanticconflict,themediatormustknow
1)nominal spreadmeans
thedifferencebetween
yieldona securityand a
benchmark
yield, 2) the on-the-run 10-year T-note yield is an appropriatebenchmark
providedby
another sourceexpressed as apercentage, 3) a
bond
calculation from anothersource can convert yield toprice giventhe security's interestrate and other details in factor form. 4) another source provides security details, and 5)
methods
for convertingamong
percentages with orwithout32nds,basis points, andfactorform.
Research
Strategy
Our
research strategy includes investigation ofpractical problems of semantic integrity and information integration acrossautonomous
sourcesandreceivers,includingfinancialservices(Galpaya, 2000), equitysecuritiesanalysis(Fan, etal.,2000),and fixed
income
securities investments (Moulton, Madnick, et al., 1998).Based on
these industry studies,we
havede\elopedprototypesof
COIN
mediatorknowledge
representationsand reasoningengines.These
prototypesandtheoreticalproofs areusedtoevaluateapproachestopracticalproblems of semantic interoperability.
Current
COIN
Mediation
Research
Building onearlier
work by
Goh
etal. (1999),we
areexploringknowledge
representationand reasoningmethods
toexpandthe functionality of
COIN
mediation to include: 1) identifying data representationconflicts and introducing conversions totransform data from sourcetoreceiver form, 2)applying
domain
ontologyandcontextknowledge
tomap
betweenreceiverschema
and source schemata, 3) determiningwhen
andhow
tocombine
sources, feeding data from one source to anotherwith appropriate data representation conversions, 4) derivingmissingdata
by
applyingdomain
ontology, contextknowledge,or by combining sources.
Where
possible,we
employ
aknowledge
representation consistent withcommon
system designpractices(e.g.,
UML,
E-R,andrepositories).COIN
mediation isbasedon
thesemanticproposition thatinterchangeofinformation ispossiblewhen
sources andreceiversshare a
common
subject domain. Sourcesandreceiversareseenasautonomous
implementationsofcommon
subjectdomain
abstractions. Source andreceiversystem designers
make
decisionsabouthow
to conceptualizeabstract constructsand abouthow
to represent those conceptualizations in data.The
mediator uses declarative information about source and receivercontexts,as
shown
in Figure 1,todeviseaplanforintegratingsourcesanddataconversionstomeet
receiverneeds.The knowledge
used for mediation consistsof
1) adomain
ontology-containing abstractsubject matter conceptualizationsthat
would
beknown
to experienced practitioners and systemdesigners in the industry, and 2)data models foreachsourceand receiver with the kind ofinformation
programmers
would
use to access data, 3) contextmodels
for each source andreceiverthatexplain
how
eachsource or receiver datamodel
implementstheabstractconcepts fromadomain
ontology.The framework
ofa subjectdomain
ontology is a structural conceptualmodel
with classes ofabstractobjects, attributes ofCross-OrganizationalDataQualityand SemanticIntegrity
conceptual categories represent object propertydistinctions that
may
be implementeddifferentlyby
eachsourceandreceiver.Rulescapture functional relationships
among
conceptualmodel
object attributes thatcanbe determined fromgeneraldomain
knowledge. Default and contingentrulesareused forderivingattribute relationships frompartial information, followingthe
same
reasoningthat industry participantswould
use.Data models for relationaldatabase sources
come
fromschema
andcataloginformation. ForXML
sources,datamodelsmay
be obtained from an
XML-schema
orDTD
or reverse engineered fromdocuments
themselves. ForHTML
sources, datamodels are provided
by
web
wrapping. For computational procedural sources, arguments and return values are treated asrelational attributes inadata
model
that isaugmented
with functionaldependency
andinput-outputcombinationconstraints.Context
models
foreach source and receiver explainhow
each datamodel
implementsthe general concepts in thedomain
ontology. Classes from the
domain
ontology structural conceptualmodel
may
be used directly oraugmented
withcontext-specific extensions.Context-specific functional orequivalencerelationshipstieelements oftheconceptual
model
toelementsofthedatamodel. For coding schemes, enumeratedattribute
domains
inacontextaremapped
to conceptual categories fromthe
domain
ontology. Semantic types logicallyencapsulate data attributes and associate context-specific modifier valuesto identify theparticulardatarepresentationusedby
asource orreceiver.A
domain
ontology is not a globalschema. Rather,it isan abstractrepresentation ofthesubjectmatterthateachdatamodel
implements in its
own
way. Neither sources nor receivers need to accept thedomain
ontology as the "rightway"
ofrepresentinginformation aboutthe subjectmatterathand,avoiding
some
ofthe practical useracceptance problems notedby
Moulton, Bressanet al (1998).
By
allowing each contextmodel
toextendthedomain
ontology andtoexplainhow
context-specific concepts
map
to generaldomain
ontology concepts,mediation is facilitated without imposing the rigidity ofview-based systems.
The
mediator usesdomain
ontologies and context models internally within its reasoning process withoutexposing
them
to sources or receiver.The
mediator accepts relational queries against a receiver datamodel
and writes aqueryplanthatusesonlysource datamodels,ineffectdesigningacustomizedview.
Conclusion
Context interchange mediation brings automated
methods
to the important task of assuring that data exchanged acrossorganizationscan
meet
thedata qualityand semantic integrityrequirementsofthe receiver-
anddo
sowithoutrequiringthesource organizationsto
accommodate
theneeds ofthe receiver,or the receivertoadjusttoeithersources or the mediator.References
Bressan, S,,
Goh,
C.H., Levina,N,,Madnick,S.E,Shah,A.,andSiegel,M.D.
"ContextKnowledge
RepresentationandReasoningintheContextInterchange
System"
AppliedIntelligence(13:2), Sept.2000,pp. 165-179.Fan,W.,Lu,H.,Madnick,S.E.,and
Cheung,
D."Discoveringand Reconciling ValueConflictsforDataIntegration,"MIT
SloanSchool
Working
Paper4153,December,
2000.Firat,A.,Madnick,S. E.,andSiegel,
M.
D."The
Cameleon
Web
Wrapper
Engine,"Proc.VLDB
2000 Workshop
onTechnologies forE-Services,Sept., 2000.
Galpaya,H. "Financial Servicesand DataHeterogeneity;
Does
StandardizationWork?",
MIT
SM
Thesis,September,2000.Goh.
C H
, Bressan, S,, Madnick,S, E.,andSiegel,M.
D."ContextInterchange:New
FeaturesandFormalismsfortheIntelligentIntegrationofInformation,"
ACM
Trans, onOfficeInformationSystems, July 1999,pp
270-293.Huang, K.,Lee,Y.,and
Wang
R. Y.Quality'Informationand
Knowledge,Upper
SaddleRiver,New
Jersey:Prentice-HaIl,1999.
Kon, H., Madnick,S.E.,andSiegel,
M.D.
"Good
Answers
fromBad
Data: aDataManagement
Strategy,"Proc. 5thInternational
Workshop
on Information Technologiesand
Systems,Amsterdam,
Netherlands, 1995.Madnick,S.E."Metadata Jones andthe
Tower
ofBabel:The
Challengeof Large-ScaleHeterogeneity," Proc.IEEE
Meta-Data
Conf,Apn\
1999.Moulton,A.,BressanS., Madnick,S. E.,andSiegel,
M.
D."An
ActiveConceptualModel
forFixedIncome
SecuritiesAnalysisfor Multiple FinancialInstitutions,"Proc.
ER
1998, pp.407-420.Moulton,A.,Madnick,S. E.,andSiegel,