• Aucun résultat trouvé

Cross-organizational data quality and semantic integrity : learning and reasoning about data semantics with context interchange mediation

N/A
N/A
Protected

Academic year: 2021

Partager "Cross-organizational data quality and semantic integrity : learning and reasoning about data semantics with context interchange mediation"

Copied!
16
0
0

Texte intégral

(1)

MIT LIBRARIES DUPL jlll Ijl 111III

I

(2)
(3)
(4)
(5)

1D28

.M414

f

[DEWEY

Business

i>t:

M

I

T

MIT

Sloan School

of

Management

Sloan

Working

Paper

4183-01

eBusiness@MIT

Working

Paper

108

October

2001

CROSS-ORGANIZATIONAL

DATA

QUALITY

AND

SEMANTIC

INTEGRITY:

LEARNING

AND

REASONING

ABOUT DATA

SEMANTICS

WITH CONTEXT

INTERCHANGE

MEDIATION

Allen

Moulton,

Stuart

E.

Madnick,

and

Michael D.

Siegel

This

paper

is

available

through

the

Center

for

eBusiness@MIT

web

site atthe

following

URL:

http://ebusiness.mit.edu/research/papers.html

This

paper

also

can

be

downloaded

without

charge

from

the

Social

Science Research

Network

Electronic

Paper

Collection:

http://papers.ssm.

com/abstract=289992

(6)

MASSACHUSETTS

INSTITUTE

OF

TECHNOLOGY

Ui!

3 2002

'!'!

(7)

Cross-Organizational

Data

Quality

and Semantic

Integrity:

Learning

and Reasoning

about

Data Semantics

with

Context

Interchange

Mediation

Allen

Moulton,

Stuart

E.

Madnick,

and

Michael

D.

Siegel

MIT

Sloan

School of

Management

amoulton@mit.edu,

smadnick@mit.edu, msiegel@mit.edu

Abstract

IVe use a context interchange mediation

approach

for detecting

and

resolvingdata quality

and

semantic

integrity conflicts in infi)rmation

exchanged

across organizational boundaries. Contextmodels

draw

on a

domain

ontology to explain

how

source

and

receiver data models implement generalprinciples ofthe

subject domain. Using the declarative

knowledge from

the

domain

ontology

and

context models, the

mediator writes a query plan meeting receiversemantic requirements

from

autonomous, heterogeneous

sources.

Examples drawn

from

fixed

income

securities investments illustrate

problems and

solutions

enabledbycontextinterchangemediation.

Introduction

Efficiently integrating

new

sourcesofinformation fromoutsidethe enterpriseisoftencritical tosuccessinaworld ofglobal

competition, interdependency, and rapid market change. Traditional

means

ofassuring data quality and semantic integrity

focus on data within the control ofthe organization

(Huang

et al., 1999). Without

some

automated assistance, internal

planning andcontrol

mechanisms

forassuring data qualitywillhavedifficulty in respondingina timely

manner

tochanging

demands. Within an organization, datacan becreated, stored, and used by people and computerssharinga

common

implicit

understandingofdata semantics.

We

usetheterm contexttorefer to this implicit understanding ofthe relationshipbetween

data elements and structures and the real world that the data represents.

The

context interchange problem arises

when

organizationswithdifferent conte.xts

must

exchangeinformation(Madnick, 1999).

A

context interchange

(COIN)

mediatorisan automatedreasoningengineto assistanorganizationinlearningabout semantic

conflictsbetweenits

own

receiver contextandthecontextsofdatasources

(Goh

etal., 1999).

Because

context definitionsare declarative, theyneedonly bepreparedonceforeachsourceand receiver context(Bressan etal.,2000). Datasources

may

be

relationaldatabases,

XML

documents,

HTML

webs wrapped

toappearas relationswith limitedquerycapability (Firatetal.,

2000), and stateless computational procedures. Using declarative context knowledge, a

COIN

mediator identifies semantic

conflictsanddesigns plansfor

combining

sourceswithdataconversionsto

meet

receiversemanticrequirements.

Given

alarge

number

of

component

systemsoperatinginadiversifiedand

dynamic

environment,

COIN

mediationfacilitates

rapid incorporation of

new

information sources,

dynamic

substitution ofinformation sources, extension and evolution of

semantics, data representation inthe user's context,accesstothe

meaning

ofdata represented, identificationandselectionof

information sourcealternatives,andadaptationtochanges inuserandbusiness operations.

Research Motivation

In the fixed income securities industry, portfolio

managers

may

need to

draw upon

external sources fordataaboutsecurity

characteristics, formarketvaluation information,andfor models andcalculations(Moulton, Madnick,et al., 1998). All these

sources

may

needtobe

combined

withinternalportfolioholdings dataand used

by

a decision support applicationsystem(see

Figure 1).

Suppose

the receiver system requires a security valuation as a "dollar price" expressed as a percentage with

fractions in 32nds. Ifa source offers such a price in the required form, data can be simply sent

from

source to receiver

unmodified.But suppose thatthebestsourceformarketinformationoffersvaluationsas"nominal spread" expressedinbasis

points (lOOths ofapercent).

To

meet

the receiver's requirements, general industry

knowledge

and additional data sources

(8)
(9)

Cross-OrganizationalDataQualityand Semantic Integrity Fixed

Income

Investment

Portfolio -'A'

Manager

RECEIVER

Portfolio

y-^

Management

|

«

Application i

System

^-i

SOURCES

r

Context

Interchange

Mediator

mT

Portfolio Holdings Bond Calculation Functions

FigureI. Fixed

income

securitiesinvestmentmediation scenario

To

resolve thesemanticconflict,themediatormust

know

1)nominal spread

means

thedifference

between

yieldona security

and a

benchmark

yield, 2) the on-the-run 10-year T-note yield is an appropriate

benchmark

provided

by

another source

expressed as apercentage, 3) a

bond

calculation from anothersource can convert yield toprice giventhe security's interest

rate and other details in factor form. 4) another source provides security details, and 5)

methods

for converting

among

percentages with orwithout32nds,basis points, andfactorform.

Research

Strategy

Our

research strategy includes investigation ofpractical problems of semantic integrity and information integration across

autonomous

sourcesandreceivers,includingfinancialservices(Galpaya, 2000), equitysecuritiesanalysis(Fan, etal.,2000),

and fixed

income

securities investments (Moulton, Madnick, et al., 1998).

Based on

these industry studies,

we

have

de\elopedprototypesof

COIN

mediator

knowledge

representationsand reasoningengines.

These

prototypesandtheoretical

proofs areusedtoevaluateapproachestopracticalproblems of semantic interoperability.

Current

COIN

Mediation

Research

Building onearlier

work by

Goh

etal. (1999),

we

areexploring

knowledge

representationand reasoning

methods

toexpand

the functionality of

COIN

mediation to include: 1) identifying data representationconflicts and introducing conversions to

transform data from sourcetoreceiver form, 2)applying

domain

ontologyandcontext

knowledge

to

map

betweenreceiver

schema

and source schemata, 3) determining

when

and

how

to

combine

sources, feeding data from one source to another

with appropriate data representation conversions, 4) derivingmissingdata

by

applying

domain

ontology, contextknowledge,

or by combining sources.

Where

possible,

we

employ

a

knowledge

representation consistent with

common

system design

practices(e.g.,

UML,

E-R,andrepositories).

COIN

mediation isbased

on

thesemanticproposition thatinterchangeofinformation ispossible

when

sources andreceivers

share a

common

subject domain. Sourcesandreceiversareseenas

autonomous

implementationsof

common

subject

domain

abstractions. Source andreceiversystem designers

make

decisionsabout

how

to conceptualizeabstract constructsand about

how

to represent those conceptualizations in data.

The

mediator uses declarative information about source and receiver

contexts,as

shown

in Figure 1,todeviseaplanforintegratingsourcesanddataconversionsto

meet

receiverneeds.

The knowledge

used for mediation consists

of

1) a

domain

ontology-containing abstractsubject matter conceptualizations

that

would

be

known

to experienced practitioners and systemdesigners in the industry, and 2)data models foreachsource

and receiver with the kind ofinformation

programmers

would

use to access data, 3) context

models

for each source and

receiverthatexplain

how

eachsource or receiver data

model

implementstheabstractconcepts froma

domain

ontology.

The framework

ofa subject

domain

ontology is a structural conceptual

model

with classes ofabstractobjects, attributes of

(10)
(11)

Cross-OrganizationalDataQualityand SemanticIntegrity

conceptual categories represent object propertydistinctions that

may

be implementeddifferently

by

eachsourceandreceiver.

Rulescapture functional relationships

among

conceptual

model

object attributes thatcanbe determined fromgeneral

domain

knowledge. Default and contingentrulesareused forderivingattribute relationships frompartial information, followingthe

same

reasoningthat industry participants

would

use.

Data models for relationaldatabase sources

come

from

schema

andcataloginformation. For

XML

sources,datamodels

may

be obtained from an

XML-schema

or

DTD

or reverse engineered from

documents

themselves. For

HTML

sources, data

models are provided

by

web

wrapping. For computational procedural sources, arguments and return values are treated as

relational attributes inadata

model

that is

augmented

with functional

dependency

andinput-outputcombinationconstraints.

Context

models

foreach source and receiver explain

how

each data

model

implementsthe general concepts in the

domain

ontology. Classes from the

domain

ontology structural conceptual

model

may

be used directly or

augmented

with

context-specific extensions.Context-specific functional orequivalencerelationshipstieelements oftheconceptual

model

toelements

ofthedatamodel. For coding schemes, enumeratedattribute

domains

inacontextare

mapped

to conceptual categories from

the

domain

ontology. Semantic types logicallyencapsulate data attributes and associate context-specific modifier valuesto identify theparticulardatarepresentationused

by

asource orreceiver.

A

domain

ontology is not a globalschema. Rather,it isan abstractrepresentation ofthesubjectmatterthateachdata

model

implements in its

own

way. Neither sources nor receivers need to accept the

domain

ontology as the "right

way"

of

representinginformation aboutthe subjectmatterathand,avoiding

some

ofthe practical useracceptance problems noted

by

Moulton, Bressanet al (1998).

By

allowing each context

model

toextendthe

domain

ontology andtoexplain

how

context-specific concepts

map

to general

domain

ontology concepts,mediation is facilitated without imposing the rigidity of

view-based systems.

The

mediator uses

domain

ontologies and context models internally within its reasoning process without

exposing

them

to sources or receiver.

The

mediator accepts relational queries against a receiver data

model

and writes a

queryplanthatusesonlysource datamodels,ineffectdesigningacustomizedview.

Conclusion

Context interchange mediation brings automated

methods

to the important task of assuring that data exchanged across

organizationscan

meet

thedata qualityand semantic integrityrequirementsofthe receiver

-

and

do

sowithoutrequiringthe

source organizationsto

accommodate

theneeds ofthe receiver,or the receivertoadjusttoeithersources or the mediator.

References

Bressan, S,,

Goh,

C.H., Levina,N,,Madnick,S.E,Shah,A.,andSiegel,

M.D.

"Context

Knowledge

Representationand

ReasoningintheContextInterchange

System"

AppliedIntelligence(13:2), Sept.2000,pp. 165-179.

Fan,W.,Lu,H.,Madnick,S.E.,and

Cheung,

D."Discoveringand Reconciling ValueConflictsforDataIntegration,"

MIT

SloanSchool

Working

Paper4153,

December,

2000.

Firat,A.,Madnick,S. E.,andSiegel,

M.

D.

"The

Cameleon

Web

Wrapper

Engine,"Proc.

VLDB

2000 Workshop

on

Technologies forE-Services,Sept., 2000.

Galpaya,H. "Financial Servicesand DataHeterogeneity;

Does

Standardization

Work?",

MIT

SM

Thesis,September,2000.

Goh.

C H

, Bressan, S,, Madnick,S, E.,andSiegel,

M.

D."ContextInterchange:

New

FeaturesandFormalismsforthe

IntelligentIntegrationofInformation,"

ACM

Trans, onOfficeInformationSystems, July 1999,

pp

270-293.

Huang, K.,Lee,Y.,and

Wang

R. Y.Quality'Information

and

Knowledge,

Upper

SaddleRiver,

New

Jersey:Prentice-HaIl,

1999.

Kon, H., Madnick,S.E.,andSiegel,

M.D.

"Good

Answers

from

Bad

Data: aData

Management

Strategy,"Proc. 5th

International

Workshop

on Information Technologies

and

Systems,

Amsterdam,

Netherlands, 1995.

Madnick,S.E."Metadata Jones andthe

Tower

ofBabel:

The

Challengeof Large-ScaleHeterogeneity," Proc.

IEEE

Meta-Data

Conf,

Apn\

1999.

Moulton,A.,BressanS., Madnick,S. E.,andSiegel,

M.

D.

"An

ActiveConceptual

Model

forFixed

Income

Securities

Analysisfor Multiple FinancialInstitutions,"Proc.

ER

1998, pp.407-420.

Moulton,A.,Madnick,S. E.,andSiegel,

M.

D."ContextMediation on WallStreet,"Proc.

CoopIS

1998,pp.271-27.

(12)
(13)
(14)

"^•^

Mte

Due

(15)
(16)

Figure

Figure I. Fixed income securities investment mediation scenario

Références

Documents relatifs

In this paper, we present a Linked Data-driven, semantically-enabled journal portal (SEJP) that currently supports over 20 different interactive analysis modules.. These modules

In practice, the data input of the analysis tool is a vector containing the covariance of data points with the point in which the error estimate has to be calculated.. Figure

In addition, depending on the data source, data quality metrics as well as prove- nance information can be added to the to-be-imported data sets and change the way the data is

Social DQ component is responsible for standardizing social-originated data according to DQ requirements, while Sensor DQ component is responsible for standardizing

Multiple studies have been carried out by collecting data from hospitals on traffic accident casualties, then comparing it with police reports to observe how

In this regard, we present BreXearch, a cross-lingual and cross-media semantic search system with a focus on the Brexit content, which allows us to retrieve media items from

In the business modeling context semantic mediation based on semantic bridges can be exploited to suggest matching information entities in process parts of different

Some recent analyses of random Schr¨odinger operators have involved three related concepts: the Wegner estimate for the finite-volume Hamiltonians, the spectral shift function