>>
^KCBff.
'*A
V
3j
MBnAstms.
^
no.
3157
WORKING
PAPER
ALFRED
P.SLOAN
SCHOOL
OF
MANAGEMENT
BUILDING
FLEXIBLE,
EXTENSIBLE
TOOLS
FOR
METADATABASE
INTEGRATION
MichaelSiegelArnon
RosenthalNovember
1991WP
# 3357-91-MSA
WP#CIS-91-11
MASSACHUSETTS
INSTITUTE
OF
TECHNOLOGY
50
MEMORIAL
DRIVE
CAMBRIDGE,
MASSACHUSETTS
02139
BUILDING
FLEXIBLE,
EXTENSIBLE
TOOLS
FOR
METADATABASE
INTEGRATION
MichaelSiegelArnon
RosenthalNovember
1991WP#
3357-91-MS
A
WP#C1S-91-11
M.I.T.
UERARIES
IBuilding
Flexible,
Extensible
Tools
for
Metadatabase
Integration
Arnon
Rosenthal Michael SiegelMITRE
Corporation Sloan School ofManagement
arnie@niitre.org Massachusetts Institute ofTechnology
msiegel@sloan.mit.edu
1
Introduction
Applications that
span
multiple existing filesand
databasesneed
an
integrated descriptionof the information that they access (i.e., an integrated metadatabase).
As
thenumber
of cooperating
systems
increases, thedevelopment
and maintenance
of this integratedde-scription
becomes
extremelyburdensome
for administrators. Largeamounts
of descriptiveinformation
must
be
integratedand
much
data
modeling
skill are required.Generating and
maintaining the relationshipsamong
thecomponent
schemas
canbe-come
an
administrative nightmare, especially in federations ofautonomous
data sources each ofwhich
is free tomake
changes toitsmetadatabase.
As
described in thecomponent
metadatabases, data from
these sourcesmay
differ in representation (e.g.,data
models),structure (e.g.,
data
formats, databaseschema
designs) or in semantics (e.g., differingdef-initions of
GNP).
Therefore, developingand
maintainingan
integratedview
ofeven a fewautonomously
operated sourceswould be
a difficult task,and
thus requiresan
automated
metadatabase
integration tool.Such
a toolmust
assist in developingan understanding
ofthe semantic connections
between metadatabases and from
this understanding,produce
a
combined
metadatabase,
includingmappings
between
thecombined
database
and
thecomponents
databases^.The
literatureon metadatabase
integration consists mostly of techniques identifyingrelated information in the
metadatabases
being integratedand
techniques for resolvingdif-ferences
between
thetwo
metadatabases' treatments.Most
ofthe publishedwork
deals withschema
information only, in the relationalmodel
orfor variousextended
entity-relationship(EER)
models.Even
papers that containsystem
overviews (e.g., [BLN86]), give littleattention to
pragmatic
issues thatwe
consider crucial - flexible usage,and
integratorcus-tomization
and
extension.Our
goal is tojustifyand
call attention tothe requirements,and
toprovideaframework
forresearch
on metadatabase
integration tools.We
envisionatool that supports: scheduling'The term metadatabase integration tool is used to emphasize that the users' problem (and hence the
tool's task) isbroader thanjust integratingschemas- thetool shouldalso handle other typesof descriptive
information. For example, attributes
may
be annotated with security level, audit priority, and severalflexibility, to aJlow integration to be
done
in different stagesand
along different levelsof expertise; metadatabase evolution, to allow incremental
development and
evolution ofthe
component
metadatabases
and
the organization's metadatabase;model
extensibility, tocapture descriptive information that is not standardized across the industry;
and method
extensibility, to
add
new
orsite-specific integrationmethods.
Finally, despiteall theflexibleorderings
and
extensions, the integrator shouldmake
minimize
the requirements for user interaction.Section 2 describes
unmet
pragmatic requirementson
integratortools,and
in Section2.1we
illustratethem
with asample
session.An
architecture thatcanmeet
these requirementsis described in Section 3.
Our
work
is intended to bemodel
independent, e.g., to apply tointegrating pairs ofrelational databases, pairs of
EER
schemas, or pairs ofobject-orientedschemas. Finally, Section 4 presents our conclusions.
2
Pragmatic Requirements on Metadatabase
Integration
In this section
we
present asample metadatabase
integration session todemonstrate
therequirements for a
metadatabase
integration tool.Then,
we
describe theimportance
of flex-ibilityand
extensibility inmetadabase
integration. In Section 3we
propose an
architecture to address these requirements.2.1
A
Sample
Session
This
sample
integration session illustrates scheduling flexibilityand
the extensibility thatusers
need
and
our architecture supports. It also illustrates the use of small, concretequestions for elicitinginterschemarelationships
from
users. It is not intended toillustrate asophisticated collectionofrules- ourinterest isinthe
framework
and
thestyleof interaction.For simplicity, the
sample
session integrates relations rather than entitiesand
relationships.Considerthe
problem
when
two
organization with overlappingactivitiesdecide tomerge
theirinformation systems.
The
first stage ofthis effort is toprovideaunifiedview
of thetwo
companies' data while continuing
autonomous
operation of thecomponent
databases.The
information
system
integration project begins with a pilot project to integrate personneldata.
Ms. Smith
ofHuman
Resources is the expert assigned towork
with the InformationTechnology
(IT) group.The
project will use a vendor-supplied integrator that deals withstructural integration issues in relational databases.
The
IT group
helps her develop a user profile for their effort.The
default profile for a nontechnical user is breadth first (i.e., completing a higher level before consideringdetails)and
excludes rules that deal withmathematical
dependenciesand
with versioning.The
group
decides tomake
one change
to the default - assoon
as acombinable
pair of relationsisidentified, detailsofits integration willbe
worked
out beforeproceeding toother relations.Ms.
Smith
identifies the relations in each database that dealwith personnel,and
thein-tegratoris instructed torestrict its attention to these relations.
She
thentells the integratorto begin execution.
A
heuristic rule suggests that the relations. Database!.EMP
and
Database2.EMPL,
Company
1 -Databasel
Company
2 -Database2
Relation
EMP
RelationEMPL
Emp#
-{il:i^#-i^i^-ii:#i^ii:)
E#
-(###-##-####)
Name
- Char(40)EName
- Char(36)Annual_Salary - Integer Weekly.Pay - Integer
Age
- Integer Date-of-Birth - Date... etc. ... ... etc. ...
Figure 1:
Two
Relationsfrom
theSample
Databases
is asked
what
the concept should be called in thecombined schema.
Instead of choosingone
ofthe suggestionsEMP
and
EMPL,
Ms. Smith
typesEMPLOYEE.
Ms.
Smith
tells thesystem
that thetwo
E#
attributes (i.e.,Emp#
and
E#)
aremergeable,
and
thatName
and
EName
are mergeable.She
accepts the defaultnames
forthe
merged
attributes,E#
and
EName.
According
to her schedule,it is not yettime
for herto deal with datatypes.
However,
thesystem
knows
that its rule forcombining
datatypesrequires
no
human
interventionwhen
there isno
conflict (i.e.,E#,
Emp#).
In order tomaximize
progresstoward
acombined schema,
thesystem
executes this noninteractiverule.Next, using a local thesaurus of Personnel jargon, the integrator guesses that
An-nual_Salary
and
Weekly
.Pay
may
be
mergeable.When
Ms.
Smith
replies yes, theinte-grator asks for a
name
for thecombined
attribute,and
away
tocompute
the proper value.She
tells the integrator to defer this issue until later. Partway
through
the attribute list,Ms.
Smith
decides that her assistant is better able to handle definitions ofEMPLOYEE
attributes.
When
thesystem
next requests informationfrom
her, she callsup
a controlscreen
and
tells thesystem
to proceed to the next type of descriptive information.The
integratornow
gathers information necessary todetermine
the constraintson
thecombined
relationand
themappings from
thecomponent
relations tothecombined
relation.The
number
ofpossibilities can be substantial,and
the user is nontechnical, so ratherthan
present possible
combined schemas and mappings,
the integrator askssome
simple questions (Ms. Smith's answers are boldand
underlined):• Ifrecords
from
EMP
and
EMPL
have
thesame
key valueE#,
must
they refer to thesame
EMPLOYEE?
(Yes.
No. Defer)•
Can
recordsfrom
EMP
and
EMPL
with differentE#
values refer to thesame
EM-PLOYEE?
{Yes,No,
Defer)•
Can
thesame
EMPLOYEE
be inboth
relationsEMP
and
EMPL?
(Yes,
No, Defer)A
setofrules inthe integratornow
infers that recordswith thesame
E#
should becombined,
that
EMPLOYEE
istheouterjomofEMP
and
EMPL,
and
thatE#
is akeyofEMPLOYEE.
The
results of attribute integration are used to specify the target list of the outerjoin.The
team
continues filling in details ofEMPLOYEE,
and
then of other relations.Sud-denly, the auditors insist that the
combined metadatabase
beextended
to include accesscontrols.
The
component
metadatabases
include access listson
each attribute, but thevendor-supplied integrator does not handle this information.
A
member
of theteam
writesa
new
rule that helpsa
user tomerge
access lists interactively; theserules are to be appliedwhenever
a pair ofattributes are merged.The
programmer
registers thenew
rule with theintegrator,
and
it isimmediately
applicable.Afterlunch,
one
member
decides togenerateaprototypecombined metadatabase.
Many
of the rules (e.g., for datatype integration) can determine a confidence level for their
de-fault action;
when
the integrator runs with low required confidence, these rules willmake
most
decisions without user interaction.The
prototype is generated rapidly.Meanwhile,
work
proceedson
the accurateschema,
and
discussions beginabout
an enterpriseconcep-tual model.
The
integrated metadatabase,though
not complete,may
be used to developapplications that require data
from
portions ofboth
databases.From
this scenario,we
conclude that several specific capabilitiesought
to be includedin the integrator's skeleton: redefinition ofthe scope for the
immediate
integration effort;priority to fully-
automated
ruleswhenever
all necessary inputs are available; deferral ofinconvenient questions; redefining the set of active rules; simple, user-friendly questions
whose
answers
eventually permit inference ofmore
complicated relationships; reactivationofdeferred questions; avoidanceof
redundant
questions;adding
new
rules tocope
withnew
kinds of
metadata;
and
fast prototyping.2.2
The Need
forFlexibility
and
Evolution
The
scenario identifiedtwo
basic properties that an integratorought
to have,and
that thecurrent research literature ignores - flexibility
and and
extensibility.We
now
summarize
the desired capabilities:
1. Scheduling Flexibility -
The
integratormust
allow its user to control the order of taskexecution, inorder to
adapt
to project goals or availability of resources. Forexample,
the user
must be
allowed to choose the initial goals -a broad
enterprisemodel
thatpresents
major
entities, relationships,and
attributes for thewhole
organization, oralternatively a
narrow
butcomplete
pilot project (e.g., integrating inventoryinfor-mation
from two
corporations that are being merged).Hybrid
goals alsomake
sense,such as detailed integration ofinventory information, plus
a
top-level integration forrelated areas such as order entry.
A
large integration task requiresmany
kinds of expertise (e.g.,about
CAD
tech-niques, design
management,
manufacturing,and
security). Questionsmay
need
to bedeferred until the expert is available.
When
experts are interviewed, the integratorshould concentrate
on
questions relevant to their expertise. Also,spontaneously-offered information should be captured, even ifthe integration
methodology
relegatesit to another stage.
2. Extensibility by User Sites -
New
integration techniquesappear
frequently in theI UserControl
combined metadatabase.
In general,we
prefer that the rules obtainand
theISDb
storeinformation in small, concrete increments, as illustrated by the scenario's questions
about
matching
keys(E#).
Such
small questions tend to bemore
understandableby
users,and
require less rollback ifadecision is to be changed.
The
IDb
includesinformationon
object mergeability,intercomponent
semanticrelation-ships,
and
thecombined schema.
3.2
Rules
Rules are the unified
mechanism
for expressing all the integrator's work, used forboth
high-level decisions (e.g.,
which
relations shall be merged),and
low-level decisions (e.g.,how
toresolve conflicts
between
specifications of String(9) or Integer for Partes).The
unifiedtreatment facilitates tracking
and
explaining theimpact
of evolution in thecomponent
databases or the interschema semantics.
In
many
rule languages, oneassumes
that each predicate in a rule is inexpensive toevaluate.
A
schema
integrator, though, includesmany
rules that require user input.Such
input is normally
more
costly thanany
fully-automated rule, so the formatfrom
rules ischosen to
minimize
thenumber
ofinteractionsand
other expensive operations.A
rule has three parts: a precondition, a body,and an
action.The
descriptionbelow
is
an
attempt
to allow forperformance
optimizations (e.g., possible early evaluation ofsome
predicates), whileminimizing
the interactions requestedfrom
theuser.Note
also thatexecution of the entireset ofrules can take days,
and
it is possible tointroduce informationthat conflicts with previous decisions,
and
for the user to edit theISDb
directly, possiblyinvalidating the results of early evaluation of predicate conjuncts.
The
precondition is a conjunct ofBoolean
terms (with the structure visible to aid inperformance
optimization). Furthermore, the precondition should normally include only termswhose
evaluation is fully automatic.The
system
is permitted to evaluate predicates in the precondition repeatedly, as conditions change.The
body
is evaluated only afterthe entire precondition
becomes
true forsome
binding^.Normally
(i.e., ifnot deferred orrestarted) the
body
will be evaluated just once for each rule binding. Costly operationsshould
appear
in thebody
rather than the precondition.The
action is arbitrary; only it canmodify
theISDb.
Rule
format is illustrated below, fortwo
simple rules.The
bodieshave been
simplified toremove
user-interactions thatmight
otherwise be present.The
outcomes
insert an asser-tionabout
mergeability of thetwo
attributes, orabout
the security level of thecombined
attribute.
Mergeable_attribs(Al,
A2)
/* tells whetherAl
andA2
should be combinedPrecondition: (The relation containing
Al
is mergeable with the relation containingA2)
Body: {If
Name(Al)
=
Name(A2)
then IDbJnsert(Mergeable(Al,A2))}CombineJSecurity_LeveIs(Sl,S2)
/* determineshow
the security levels should beintegratedfor two attributes that are beingcombined
Precondition: Si is the security property of attribute Ai from
component
database i,and
A1,A2
are to be merged.A12
denotes theresulting attribute.Body: {IDbJnsert(Security(A12)=
maximum(Sl,S2)
)}3.3
User
Control
of
the
Integration
Process
ConventionaJ rule-based applications often
run autonomously,
or else interactively oversec-onds
orminutes.Schema
integration lasts for days (i.e.,ormay
continuouslyevolve), duringwhich
themetadatabases
may
bechanged due
to external events or to editsby
the userrather
than by
the action of a particular rule.As
illustrated in the scenario, the usertherefore needs to control the order of tasks. Control can
be
direct or indirect.Users can obtain direct control in three ways: i) the controller can be asked to return
control periodically or
upon
reaching certain points (analogous toa
debugger
steppingthrough
or reaching a breakpoint); ii)an
interrupt can abort the currently-executing rule; iii)whenever a
user supplies input requestedby a
rule, the interface allows the user to takecontrol actions.
The
user hastwo
basic choices after obtaining control. First, he or she can select asetof candidate rule bindings,
and
causethem
to be invoked immediately. Second, the usercan browse
and
edit all information in theIDb
(i.e., subject to authorization).A
controller can consider several kindsof control information in order tochoose the nextrule binding to be executed.
Here
we
present threeexamples
of the types of control that areimportant
in the integration process - scoping, deferral,and
directionality. Allwere
illustrated in the scenario.
Scopes
are views over the rule baseand
database.Rather than
directly execute all stored rules over all information in the
IDb,
a
scope provides a subsetof each. Users can request execution within a scope (e.g.,
Ms.
Smith's scopewas
personnel)that is
more
targeted to their interestsand
asmaller set ofrules (e.g., excluding rules thatdeal with technical details ofdatatypes
and
security).The
integratormust
be
able todefer
a rule bindings forwhich
the useris not preparedto furnish the
answer
(e.g.,Ms.
Smith
chose to defer interactions involving datatypes).The
IDb
includes a history ofexecuted rule bindings,some
ofwhich
aremarked
executed-deferred. Bindings to
be
reactivatedmay
be
selectedby
specialcommands
orby an
ordinarydatabase query.
Finally, the user should be able to influence the
direction
of integration. Forexample,
for hierarchical
metaschemas
the usermay
choose togo "down".
Thiswould
mean
tonext apply candidate bindings associated with the
components
of ametatype,
e.g., afterdetermining
thattwo
relations canbe merged,
obtain informationabout
the first childin the
metaschema
(e.g., Relation_name, Attribute-List, Constraint_List).Another
choicewould be
"same_rule"where
the user chooses toapply thesame
rule again but with different bindings.3.4
Controller
and
Control
Strategies
The
controller determines the next rule binding to execute (one at a time). It is a keyvariable in the integrator's usability, but to date the control process has received little
research attention in the database
community.
Any
rule bindingwhose
precondition isTrue
and which
is not recorded as "executed" iscalled a candidateforexecution.
A
controllermust
invoke only candidate rules,and
notter-minate
untilor candidatesareexhausted, ortheuserissuesan
exitcommand.
The
controllerstores the history of executed rule bindings, with return codes
and
other statusinforma-tion.
The
history is part of the IDb,and
may
be referenced by rules,and
editedby
built-incommands.
Ideally, the controller willbe
built asan
enhancement
of a general-purposerule engine, but it is not clear
whether
off-the-shelf rule systems allow their controllers tobe
modified in this way.An
additional responsibility of the controller is to use indexing,eager evaluation,
and
similar techniques tominimize
the user's time waiting for the nextinteraction.
4
Summary
We
have
identifiedunmet
pragmatic
requirementsof long, interactive design processes suchas
metadatabase
integration.We
describedsome
ways
toadapt
a generic rule-basedar-chitecture to
meet
those needs.Our
designwas
quite preliminary, but points theway
forfuture research.
The
proposed
architecture helps an integrator satisfy the goals offlexibility by allowingintegration tasks (i.e., rule bindings) to be invoked in arbitrary order as long as each
pre-condition is satisfied. Several
mechanisms were proposed
for influencing the choice ofnextrule to be invoked. Users can invoke sets of candidates, explicitly.
The
scopeand
directionof integration is adjustable at
any
time. Users can freely defer tasks (i.e., rule bindings),and
have great flexibility in determiningwhich
deferred bindings should be reactivated.The
proposed
integrator provides for extensibility by using a rule-basedapproach
to defineall integration techniques.
New
techniques can be freelyadded
to the rule base, to allowusers to include techniques tailored to local conditions.
The
integrator will invoke thesenew
ruleswhen
their preconditions are satisfied. Thismechanism
handles both the resultsof
new
research,and
locally-developed rules to handle local kinds ofmetadata.
In a future database system, the
metadatabase
integrator willbe
justone
ofmany
tools. It will
depend on
other tools that help with evolution of multilevelschema
struc-tures,
and
will shareknowledge
with other tools that deal withdata
semantics.There
isa
substantial functional overlapbetween
schema
integrators, intelligent query-formulationassistants
[KBH89,KN89],
and
intelligent assistants that reconcile database contents to auser view [SM91]:
Much
of theknowledge
that they use is similar (e.g.,about
identifyingsimilar concepts
between two
views, orabout
necessary conversions). Also, all three toolsgenerate
mappings
from
some
final view (the integratedschema
orthe user's desired query)to the lower level structures used in
implementing
that final view.However,
the facilitiesfor controlling
and
influencing rule execution are likely to be quite different forschema
to
be
formidable challenge that will requirebroad
knowledge
ofmany
tools.Acknowledgements:
Thiswork
hasbeen funded
in partby
the International Financial Research Services Center atMIT,
National ScienceFoundation Grant #IRI902189, Xerox
Advanced
Information Technology,and ETH-Zurich.
References
[BLN86]
C. Batini,M.
Lenzerini,and
S. Navathe.A
comparative
analysis ofmethodologies
for database
schema
integration.ACM
Computing
Surveys, 18(4):323-364, 1986.[KBH89]
L. Kerschberg, R.Baum,
and
J.Hung.
Kortex: an expert databasesystem
shellfor a
knowledge-based
entity-relationship model. InThe Conference on
theEntity-Relationship Approach, Toronto, 1989.
[KN89]
M.
Kracker
and
E. Neuhold.Schema
independent query
formulation. InThe
Conference
on
the Entity-Relationship Approach, Toronto, 1989.[McC84]
J.McCarthy.
Scientific information=
data
+
meta-data. InDatabase
Man-agement:
Proceedings of theWorkshop November
1-2, U.S.Navy
PostgraduateSchool, Monterey, California,
Department
of Statistics Technical Report,Stan-ford University, 1984.
[SM91]
M.
Siegeland
S.Madnick.
A
metadata approach
to resolving semantic conflicts. In Proceeding of the 17th International Conferenceon
Very LargeData
Bases,,A
., c;MIT LIBRARIES DUPL
Date
Due^
MIT LIBRARIES