Learning Ontologies for the Semantic Web
Alexander Maedche
Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany
Ontoprise GmbH, Haid-und-Neu Strasse 7 76131 Karlsruhe, Germany
ama@aifb.uni-karlsruhe.de
Steffen Staab
Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany
Ontoprise GmbH, Haid-und-Neu Strasse 7 76131 Karlsruhe, Germany
sst@aifb.uni-karlsruhe.de
ABSTRACT
The Semantic Web relies heavily on the formal ontologies
that structureunderlying datafor the purpose of compre-
hensive and transportable machine understanding. There-
fore, thesuccessof theSemanticWebdependsstronglyon
theproliferationofontologies,whichrequiresfastandeasy
engineeringofontologiesandavoidanceofaknowledgeac-
quisitionbottleneck.
Ontology Learning greatly facilitatesthe constructionof
ontologiesbytheontologyengineer. Thevisionofontology
learningthatweproposehereincludesanumberofcomple-
mentarydisciplinesthat feedondierenttypesofunstruc-
tured,semi-structuredandfullystructureddatainorderto
supportasemi-automatic,cooperativeontologyengineering
process. Ourontologylearningframeworkproceedsthrough
ontologyimport,extraction,pruning,renement,andeval-
uationgivingtheontologyengineerawealthofcoordinated
tools for ontology modeling. Besides ofthe generalframe-
work and architecture, we showin this paper some exem-
plarytechniquesintheontologylearningcyclethatwehave
implemented in our ontology learning environment, Text-
To-Onto,suchasontologylearningfromfreetext,fromdic-
tionaries,orfromlegacyontologies,andrefertosomeothers
thatneedtocomplementthecompletearchitecture,suchas
reverseengineeringofontologiesfromdatabaseschemataor
learningfromXMLdocuments.
1. ONTOLOGIES FOR THE SEMANTIC WEB
Conceptualstructuresthatdeneanunderlyingontology
aregermanetotheideaofmachineprocessabledataonthe
SemanticWeb. Ontologiesare(meta)dataschemas,provid-
ingacontrolledvocabularyofconcepts,eachwithanexplic-
itlydenedandmachineprocessablesemantics. Bydening
sharedand commondomaintheories, ontologies helpboth
peopleand machinestocommunicateconcisely,supporting
theexchangeofsemanticsandnotonlysyntax. Hence,the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors.
Semantic Web Workshop 2001 Hongkong, China Copyright by the authors.
cheapandfastconstructionofdomain-specicontologiesis
crucialforthesuccessandtheproliferationoftheSemantic
Web.
Thoughontology engineering tools have become mature
overthelast decade(cf.[9]),themanual acquisitionof on-
tologies still remainsatedious, cumbersometask resulting
easily ina knowledge acquisition bottleneck. Havingdevel-
opedourontologyengineeringworkbench,OntoEdit,wehad
toface exactlythisissue,inparticular weweregivenques-
tionslike
Canyoudevelopanontologyfast? (time)
IsitdiÆculttobuildanontology? (diÆculty)
Howdoyouknowthatyou'vegottheontologyright?
(condence)
Infact,theseproblemsontime,diÆcultyandcondence
thatweendedup withweresimilarto whatknowledgeen-
gineershaddealtwithoverthelasttwodecadeswhenthey
elaborated on methodologies for knowledge acquisition or
workbenches fordeningknowledgebases. A methodthat
proved extremely benecial for the knowledge acquisition
taskwastheintegrationofknowledgeacquisitionwithma-
chine learning techniques [33]. The drawback ofthese ap-
proaches,e.g. theworkdescribedin[21],however,wastheir
ratherstrongfocusonstructuredknowledgeor databases,
fromwhichtheyinducedtheirrules.
Incontrast, in the Webenvironment that we encounter
whenbuildingWebontologies,thestructuredknowledge or
databaseisrathertheexceptionthanthenorm. Hence,in-
telligentmeansforanontologyengineertakesonadierent
meaning thanthe| veryseminal | integration architec-
turesformoreconventionalknowledgeacquisition[7].
OurnotionofOntologyLearningaimsattheintegrationof
amultitudeofdisciplinesinordertofacilitatetheconstruc-
tionof ontologies,inparticularmachinelearning. Because
the fully automatic acquisition of knowledge by machines
remains in the distant future, we consider the process of
ontology learning as semi-automatic withhumaninterven-
tion,adoptingtheparadigmofbalancedcooperativemodeling
[20]fortheconstructionofontologiesfortheSemanticWeb.
This objective inmind,we have builtanarchitecture that
combinesknowledgeacquisitionwithmachinelearning,feed-
ingontheresourcesthatwenowadaysndonthesyntactic
Web,viz.freetext,semi-structuredtext,schemadenitions
ferentstepsintheengineering cycle,whichhereconsistsof
thefollowingvesteps(cf.Figure1):
First, existing ontologies are importedand reused by
merging existing structures or dening mapping rules be-
tweenexistingstructuresandtheontologytobeestablished.
For instance, [26] describe how ontological structures con-
tainedinCyc are used inorder to facilitate the construc-
tion of a domain-specic ontology. Second, in the ontol-
ogy extraction phase major parts of the target ontology
are modeled with learning supportfeeding from web doc-
uments. Third, this rough outline of the target ontology
needstobepruned inordertobetteradjust theontology
toitsprimepurpose. Fourth,ontologyrenementprots
fromthegivendomainontology,butcompletestheontology
atanegranularity(also incontrasttoextraction). Fifth,
the prime target application serves as a measure for vali-
datingtheresultingontology[31]. Finally,onemayrevolve
againinthis cycle,e.g.for includingnewdomainsintothe
constructed ontology or for maintaining and updating its
scope.
2. THE ONTOLOGY LEARNING KILLER APPLICATION
Though ontologies and their underlying data in the Se-
manticWebareenvisionedtobereusableforawiderangeof
possiblyunforeseenapplications 1
,aparticulartargetappli-
cationremainsthetouchstonefor agivenontology. Inour
case, we have been dealing with ontology-based knowledge
portals thatstructureWebcontentandthatallowforstruc-
turedprovisioningandaccessingofdata[29,30]. Knowledge
portalsareinformationintermediariesforknowledgeaccess-
ingandsharingontheWeb. Thedevelopmentof aknowl-
edgeportal consists of the tasks of structuring theknowl-
edge,establishingmeansfor providingnew knowledgeand
accessingtheknowledgecontainedintheportal.
A considerable part of development and maintenanceof
theportal lies inintegrating legacy information as wellas
in constructing and maintaining the ontology in vast, of-
ten unknown, terrain. For instance, a knowledge portal
may focus on the electronics sector, integrating compar-
ative shopping in conjunction with manuals, reports and
opinionsaboutcurrentelectronicproducts. Thecreationof
thebackgroundontologyfor thisknowledgeportalinvolves
tremendouseortsforengineeringtheconceptualstructures
that underly existing warehouse databases, product cata-
logues, user manuals, test reports and newsgroup discus-
sions. Correspondingly, ontology structures mustbe con-
structedfromdatabaseschemata,agivenproductthesaurus
(likeBMEcat),XMLdocumentsanddocumenttypedeni-
tions(DTDs),andfreetexts. Stillworse,signicantpartsof
these(meta-)datachangeextremelyfastand,hence,require
aregularupdateofthecorrespondingontologyparts.
Thus, very dierent types of (meta-)data might beuse-
fulinputfor theconstructionofthe ontology. However, in
practiceoneneedscomprehensivesupport 2
inordertodeal
1
Justlike Tim Berners-Lee did not forsee online auctions
beingacommonbusinessmodeloftheWebof2000.
2
Comprehensive support with ontology learning need not
necessarily imply the top-notch learning algorithms, but
may rely more heavily on appropriate tool support and
dierenttechniques: Structureddataandmetadatarequire
reverseengineering approaches,freetextmaycontributeto
ontologylearningdirectlyorthroughinformationextraction
approaches.
3
Semi-structureddatamay nallyrequireand
protfromboth.
In the following we elaborate on our ontology learning
framework. Thereby we approach dierent techniques for
dierenttypesofdata,showingpartsofourarchitecture,its
currentstatus,andpartsthatmaycomplementourcurrent
Text-To-Onto environment.
Ageneraloverviewofontologylearningtechniquesaswell
ascorrespondingreferencesmaybefoundinSection9.
3. AN ARCHITECTURE FOR ONTOLOGY LEARNING
Giventhetaskofconstructingandmaintaininganontol-
ogy for a Semantic Webapplication, e.g. for an ontology-
basedknowledgeportalthatwehavebeendealingwith(cf.
[29]), wehaveproducedawishlistofwhatkindofsupport
wewouldfancy.
3.1 Ontology Engineering Workbench
OntoEditAscoretoourapproachwehavebuiltagraphicaluserin-
terfacetosupporttheontologyengineeringprocessmanually
performedbytheontologyengineer. Here,weoersophisti-
catedgraphicalmeansformanualmodelingandreningthe
nalontology. Dierentviewsareoeredtotheusertarget-
ing the epistemological level rather than a particular rep-
resentation language. However, the ontological structures
builttheremaybeexportedtostandardSemanticWebrep-
resentationlanguages,suchasOILandDAML-ONT,aswell
asourownF-LogicbasedextensionsofRDF(S).Inaddition,
executable representations for constraint checking and ap-
plicationdebuggingcanbegeneratedandthenaccessedvia
SilRi 4
, our F-Logic inference engine, that is directly con-
nectedwithOntoEdit.
Thesophisticatedontologyengineeringtoolsweknew,e.g.
theProtegemodelingenvironmentforknowledge-basedsys-
tems[9],wouldoercapabilitiesroughlycomparabletoOn-
toEdit. However,giventhetaskofconstructingaknowledge
portal,wefoundthattherewasthislargeconceptualbridge
between the ontology engineering tool and the input (of-
ten legacy data), suchas Web documents,Web document
schemata,databasesontheWeb,andWebontologies,which
ultimatelydeterminedthetargetontology. Intothisvoidwe
havepositionednewcomponentsofourontologylearningar-
chitecture(cf. Figure2). Thenewcomponentssupportthe
ontologyengineerinimportingexistingontologyprimitives,
extracting new ones, pruning given ones, or rening with
additional ontology primitives. In our case, the ontology
primitivescomprise:
asetofstringsthatdescribelexicalentriesLfor con-
ceptsandrelations;
3
Infact,ontologylearningforfreetextservesadoublepur-
pose. Onthe one hand it yields a readily exploitable on-
tology for Semantic Web purposes, on the other hand it
often returnsimprovedinformation extractionand natural
languageunderstanding meansadjustedtothe learnedon-
tology,cf. [10].
4
Apply
Domain Ontology
Extract
Refine
^c
Prune
Import / Reuse
Legacy + Application Data
Ontology Learning
Ontology Learning
Legacy + Application Data
Figure 1: OntologyLearningprocesssteps
asetofconcepts 5
|C;
ataxonomyofconceptswithmultipleinheritance(het-
erarchy)H
C
;
asetofnon-taxonomicrelations|R|describedby
theirdomainandrangerestrictions;
aheterarchyofrelations, i.e.a setoftaxonomicrela-
tionsH
R
;
relations F and G that relateconcepts and relations
withtheirlexicalentries,respectively;and,nally,
asetofaxiomsAthatdescribeadditional constraints
onthe ontologyand allow to make implicit factsex-
plicit[29].
Thisstructurecorrespondsclosely to RDFS,theoneex-
ceptionis theexplicit considerationoflexical entries. The
separationofconceptreferenceandconceptdenotation,which
may be easily expressed in RDF, allows to provide very
domain-specic ontologies without incurring an instanta-
neous conict when merging ontologies | a standard re-
questinthe Semantic Web. Forinstance,the lexicalentry
\school" inone ontology may refer to abuilding inontol-
ogyA,buttoanorganizationinontologyB, orto bothin
ontology C. Also inontology A the concept refered to in
Englishby\school" and\schoolbuilding"may bereferred
toinGermanby\Schule"and\Schulgebaude".
5
Conceptsinourframeworkareroughlyakintosynsetsin
Ontologylearningreliesonontologystructuresgivenalong
theselinesandoninputdataasdescribedaboveinorderto
propose new knowledge about reasonably interesting con-
cepts,relations,lexicalentries,oraboutlinksbetweenthese
entities|proposingtheaddition,thedeletion,orthemerg-
ing of some ofthem. The results ofthe ontology learning
processarepresentedtotheontologyengineerbythegraph-
icalresultsetrepresentation(cf. Figure4foranexampleof
howextractedpropertiesmaybepresented). Theontology
engineermay thenbrowse theresultsand decideto follow,
delete,ormodifytheproposalsinaccordancetothepurpose
ofhertask.
4. COMPONENTS FOR LEARNING ONTOLOGIES
Integratingtheconsiderationsfromaboveintoacoherent
generic architectureforextracting andmaintainingontolo-
gies from data onthe Webwe haveidentied several core
components. Thereare,(i),agenericmanagementcompo-
nent dealing with delegation of tasks and constituting the
infrastructurebackbone,(ii),aresourceprocessing compo-
nentworkingoninputdatafromtheWebincluding,inpar-
ticular,anaturallanguageprocessingsystem,(iii),analgo-
rithmlibraryworkingontheoutputoftheresourceprocess-
ing component as well asthe ontology structures sketched
aboveand returningresult setsalso mentionedabove and,
(iv), the graphical userinterface for ontology engineering,
OntoEdit.
Web documents
Domain Ontology Domain Ontology
Resource Processing Component
OntoEdit
WordNet Ontology O1 WordNet Ontology
O1
NLP System
Algorithm Library
Result Set
Lexicon 1
Inference Engine(s) Import
existing ontologies
Lexicon n Crawl
corpus
DTD DTD
Legacy databases
Management Component
Ontology Engineer
Import schema Import semi-
structured schema
O2
...
Figure2: Architecture for LearningOntologiesfor the Semantic Web
4.1 Management component
Theontology engineerusesthe managementcomponent
toselectinputdata,i.e.relevantresourcessuchasHTML&
XMLdocuments,documenttype denitions,databases, or
existingontologiesthatareexploitedinthefurtherdiscovery
process. Secondly, using the management component, the
ontologyengineeralsochoosesamongasetofresourcepro-
cessing methods available at the resource processing com-
ponentandamongasetofalgorithmsavailableinthealgo-
rithmlibrary.
Furthermore,themanagementcomponentevensupports
theontologyengineerindiscoveringtask-relevantlegacydata,
e.g. an ontology-based crawler gathers HTML documents
that are relevant to a given core ontology and an RDF
crawlerfollowsURIs(i.e.,uniqueidentiersinXML/RDF)
thatarealsoURLsinordertocoverpartsofthesofartiny,
butgrowingSemanticWeb.
4.2 Resource processing component
Resourceprocessingstrategiesdierdependingonthetype
ofinputdatamadeavailable:
HTMLdocumentsmaybeindexedandreducedtofree
text.
Semi-structured documents,like dictionaries,maybe
Semi-structuredandstructuredschemadata(likeDTD's,
structureddatabaseschemata,andexistingontologies)
are handeled following dierent strategiesfor import
asdescribedlaterinthispaper.
Forprocessingfreenaturaltextoursystemaccessesthe
naturallanguageprocessingsystemSMES(Saarbrucken
MessageExtractionSystem),ashallowtextprocessor
for German (cf. [24]). SMES comprises a tokenizer
basedonregularexpressions,alexicalanalysiscompo-
nent includingvariouswordlexicons, amorphological
analysismodule,anamedentity recognizer,apart-of-
speechtagger andachunkparser.
Afterrstpreprocessingaccordingtooneoftheseorsimi-
larstrategies,theresourceprocessingmoduletransformsthe
dataintoanalgorithm-specicrelationalrepresentation.
4.3 Algorithm Library
Asdescribed above an ontology may be described by a
numberof sets of concepts, relations, lexical entries, and
linksbetweentheseentities. Anexistingontologydenition
(includingL;C;H
C
;R;H
R
;A;F;G)may beacquiredusing
various algorithmsworkingon this denitionand the pre-
processedinputdata. Whilespecicalgorithmsmaygreatly
varyfromonetype ofinput tothe next, thereisalso con-
Hence,wemayreusealgorithmsfromthelibraryforacquir-
ingdierentpartsoftheontologydenition.
Subsequently,weintroducesomeofthesealgorithmsavail-
able in our implementation. In general, we use a multi-
strategylearningandresultcombinationapproach,i.e. each
algorithmthatispluggedintothelibrarygeneratesnormal-
izedresultsthatadheretotheontologystructuressketched
aboveand that maybecombinedintoa coherent ontology
denition.
5. IMPORT & REUSE
Given our experiences in medicine, telecommunication,
andinsurance,weexpectthatforalmostanycommercially
signicantdomaintherearesomekindofdomainconceptu-
alizationsavailable. Thus,weneedmechanismsandstrate-
giestoimport &reuse domainconceptualizationsfromex-
isting(schema)structures. Thereby,theconceptualizations
mayberecovered,e.g.,fromlegacydatabaseschemata,document-
type denitions (DTDs), or from existing ontologies that
conceptualizesomerelevantpartofthetargetontology.
Inthe rstpart of theimport & reuse step, the schema
structures are identiedand their general content needto
be discussed with domain experts. Each of these knowl-
edgesourcesmustbeimportedseparately. Importmaybe
performedmanually|whichmayincludethemanualdef-
initionoftransformation rules. Alternatively,reverseengi-
neeringtools,suchas exist for recoveringextended entity-
relationshipdiagramsfromthe SQLdescription ofa given
database (cf. reference [32, 14] in survey, Table 1), may
facilitatetherecoveryofconceptualstructures.
Inthesecondpart ofthe import &reuse step,imported
conceptual structures needto be mergedor aligned inor-
der to constitute a single common ground from which to
take-ointothesubsequentontologylearningphasesofex-
tracting, pruning andrening. Whilethe generalresearch
issueconcerningmergingandaligningisstillanopenprob-
lem,recentproposals(e.g.,[25])haveshownhowtoimprove
themanualprocessofmerging/aligning. Existingmethods
formerging/aligningmostlyrelyonmatchingheuristicsfor
proposingthemergeofconceptsandsimilarknowledge-base
operations. Ourcurrentresearchalsointegratesmechanisms
thatuse aapplicationdata oriented, bottom-up approach.
Forinstance,formalconceptanalysisallowstodiscoverpat-
terns between application data on the one hand and the
usageofconceptsandrelationsand thesemanticsgivenby
their heterarchies onthe other hand in aformally concise
way(cf. reference[8] insurvey,Table 1,onformalconcept
analysis).
Overall, the importand reusestep in ontology learning
seemstobethe onethatis thehardestto generalize. The
taskmayremindvaguelyofthegeneralproblemswithdata
warehousing adding, however, challenging problems of its
own.
6. EXTRACTING ONTOLOGIES
In the ontology extraction phase of the ontology learn-
ingprocess,majorparts,i.e.thecompleteontologyorlarge
chunksreectinganewsubdomainoftheontology,aremod-
eledwithlearningsupportexploitingvarioustypesof(Web)
sources.Thereby,ontologylearningtechniquespartiallyrely
learningcyclemaypropelsubsequentonesandmoresophis-
ticatedalgorithmsmayworkonstructuresproposedbymore
straightforwardonesbefore.
Describingthis phase,we sketch someof the techniques
andalgorithmsthathavebeenembeddedinourframework
andimplementedinourontologylearningenvironmentText-
To-Onto(cf.Figure3). Doingso,wecoveraverysubstantial
part oftheoverallontologylearningtask inthe extraction
phase. Text-To-Ontoproposesmanydierentontologycom-
ponents, which we have describedabove (i.e. L;C;R;:::),
totheontologyengineerfeedingonseveraltypesofinput.
6.1 Lexical Entry & Concept Extraction
Thistechniqueisoneofthebaselinemethodsappliedin
ourframeworkforacquiringlexicalentrieswithcorrespond-
ingconcepts. InText-To-Onto,webdocumentsaremorpho-
logically processed, including the treatment of multi-word
terms suchas \database reverse engineering" by N-grams,
asimplestatisticsmeans.Basedonthistextpreprocessing,
termextractiontechniques,whicharebasedon(weighted)
statistical frequencies,are appliedinordertopropose new
lexicalentriesforL.
Often,theontologyengineerfollows the proposalby the
lexicalentry&conceptextractionmechanismandincludes
anewlexicalentryintheontology. Becausethenewlexical
entry comes without an associated concept, the ontology
engineermustthendecide(possiblywithhelpfromfurther
processing)whetherto introduceanewconceptorlinkthe
newlexicalentrytoanexisting concept.
6.2 Hierarchical Concept Clustering
Given a lexicon and a set of concepts, one major next
step is the taxonomicclassication of concepts. One gen-
erally applicablemethodwithtothisregard ishierarchical
clustering. Hierarchical clusteringexploits thesimilarity of
itemsinordertoproposeahierarchyofitemcategories. The
similaritymeasureisdenedonthepropertiesofitems.
Giventhetaskofextractingahierarchyfromnaturallan-
guage text, adjacency ofterms or syntactical relationships
betweentermsaretwopropertiesthatyieldconsiderablede-
scriptivepowertoinducethesemantichierarchyofconcepts
relatedtotheseterms.
Asophisticatedexampleforhierarchicalclusteringisgiven
by Faure &Nedellec (cf. reference [6] in survey, Table 1):
Theypresentacooperativemachinelearningsystem,ASIUM,
which acquires taxonomic relations and subcategorization
framesofverbsbasedonsyntacticinput. TheASIUMsys-
tem hierarchically clusters nouns based onthe verbs that
they are syntactically related with and vice versa. Thus,
theycooperatively extendthelexicon, theset of concepts,
andtheconceptheterarchy(L;C;HC).
6.3 Dictionary Parsing
Machine-readabledictionaries(MRD)arefrequentlyavail-
able for manydomains. Thoughtheir internal structureis
freetexttoalargeextent,therearecomparativelyfewpat-
terns thatare used togive textdenitions. Hence, MRDs
exhibit a large degree of regularity that may be exploited
for extractingadomainconceptualizationandproposingit
totheontologyengineer.
Text-To-Onto has been used to generate ataxonomy of
company(cf.reference [15]insurvey,Table 1). Likewiseto
termextractionfrom free textmorphological processing is
applied,this timehowevercomplementingseveralpattern-
matching heuristics. For examplethedictionarycontained
thefollowingentry:
Automatic Debit Transfer: Electronic service arising
froma debit authorization of theYellowAccountholder
for arecipienttodebit bills thatfallduedirectfromthe
account..
Severalheuristicswereappliedtothemorphologicallyan-
alyzeddenitions. Forinstance,onesimpleheuristicrelates
the denition term, here \automaticdebit transfer", with
therstnounphraseoccurringinthedenition,here\elec-
tronicservice". Their correspondingconceptsare linkedin
theheterarchyHC:
HC(automaticdebittransfer,electronicservice).
Applyingthisheuristiciteratively,onemayproposelarge
partsofthetargetontology,morepreciselyL;C andHC to
the ontology engineer. In fact, because verbs tend to be
modeledasrelations,R(andthelinkagebetweenRandL)
maybeextendedbythisway,too.
6.4 Association Rules
Associationrulelearningalgorithmsaretypicallyusedfor
prototypicalapplicationsofdatamining,likendingassoci-
ationsthatoccurbetweenitems,e.g.supermarketproducts,
inasetoftransactions,e.g.customers'purchases. Thegen-
eralizedassociationrulelearningalgorithmextendsitsbase-
taxonomy,e.g.\snacksarepurchasedtogetherwithdrinks"
rather than\chipsare purchasedwith beer"and \peanuts
arepurchasedwithsoda".
InText-To-Onto (cf.reference[17]insurvey,Table1)we
useamodicationofthegeneralizedassociationrulelearn-
ingalgorithm fordiscoveringpropertiesbetweenclasses. A
givenclass hierarchy HC serves as background knowledge.
Pairsofsyntacticallyrelatedclasses(e.g.pair(festival,island)
describing the head-modier relationship contained in the
sentence\ThefestivalonUsedom 6
attractstouristsfromall
overtheworld.") aregivenas inputtothealgorithm. The
algorithm generates association rules comparing the rele-
vanceofdierent ruleswhile climbingupand/or downthe
taxonomy. Theappearinglymost relevant binaryrulesare
proposedtotheontologyengineerformodelingrelationsinto
theontology,thusextendingR.
As the number of generated rules is typically high, we
oer variousmodesofinteraction. Forexample,it ispossi-
bletorestrictthenumberofsuggestedrelationsbydening
so-called restriction classes that have toparticipate inthe
relationsthatareextracted. Anotherwayoffocusingisthe
exibleenabling/disablingoftheuseoftaxonomicknowl-
edgeforextractingrelations.
Resultsare presented oeringvariousviewsonto the re-
sults as depicted inFigure 4. A generalized relation that
may beinducedbythepartiallygivenexampledataabove
maybetheproperty(event,area),whichmaybenamedby
6
Usedom is anisland located innorth-eastofGermanyin
inanarea(thusextendingLandF). Theusermayaddthe
extractedrelationstotheontologybydrag-and-drop.Toex-
ploreand determinetherightaggregationlevelofaddinga
relationtotheontology,theusermaybrowsethehierarchy
viewonextractedpropertiesasgivenintheleftpartofFig-
ure 4. This view may also support the ontology engineer
in dening appropriate subPropertyOf relations between
properties,suchassubPropertyOf(hasDoubleRoom,hasRoom )
(therebyextendingHR).
7. PRUNING THE ONTOLOGY
Acommonthemeofmodelinginvariousdisciplinesisthe
balance between completeness and scarcity of the domain
model. Itisawidelyheldbeliefthattargetingcompleteness
forthedomainmodelontheonehandappearsto beprac-
tically inmanagable and computationally intractable, and
targetingthescarcestmodelontheotherhandisoverlylim-
itingwithregard toexpressiveness. Hence, whatwestrive
foristhebalancebetweenthesetwo,whichisreallyworking.
Weaim at a modelthat capturesa richconceptualization
ofthetarget domain, butthatexcludes partsthat areout
ofitsfocus. Theimport&reuse ofontologiesaswellasthe
extraction of ontologies considerably pull the lever of the
scaleinto theimbalancewhere out-of-focus concepts reign.
Therefore,wepursuetheappropriatediminishingoftheon-
tologyinthepruning phase.
There are at least two dimensions to look at the prob-
lemofpruning. First,oneneedstoclarify howthepruning
of particular parts of the ontology (e.g., the removalof a
conceptorarelation) aectstherest. Forinstance, Peter-
sonet. al.[26]havedescribedstrategiesthatleavetheuser
withacoherentontology(i.e.nodanglingor brokenlinks).
Second,onemayconsiderstrategiesforproposingontology
itemsthatshouldbeeitherkeptorpruned. Wehaveinves-
tigated several mechanismsfor generating proposals from
applicationdata. Given aset ofapplication-specic docu-
mentsthereareseveralstrategiesforpruningtheontology.
Theyarebasedonabsoluteorrelativecountsoffrequency
ofterms(cf.reference[15]insurvey,Table 1).
8. REFINING THE ONTOLOGY
Reningplaysasimilarroleasextracting. Theirdierence
exists ratheronaslidingscale thanby aclear-cut distinc-
tion. Whileextracting servesmostlyfor cooperativemod-
elingof theoverallontology(orat leastofverysignicant
chunksofit), therenementphaseisaboutnetuningthe
targetontologyandthesupportofitsevolvingnature. The
renement phase may use data that comes from the con-
creteSemanticWebapplication,e.g. loglesofuserqueries
or generic userdata. Adapting and rening the ontology
withrespecttouserrequirementsplaysamajorroleforthe
acceptanceoftheapplicationanditsfurtherdevelopment.
Inprinciple,thesamealgorithmsmaybeusedforextrac-
tionasforrenement. However,duringrenementonemust
considerindetailtheexistingontologyandtheexistingcon-
nectionsintotheontology,whileextractionworksmoreoften
thannotpracticallyfromscratch.
A prototypical approach for renement (though not for
extraction!) has been presented by Hahn & Schnattinger
(cf.reference[11]insurvey,Table1). Theyhaveintroduced
asnewconceptsareacquiredfromtext. Theacquisitionpro-
cessiscenteredaroundthelinguisticandconceptual\qual-
ity"ofvariousforms ofevidenceunderlyingthegeneration
and renement of concept hypothesis. In particular they
consider semantic conicts and analogous semantic struc-
turesfromtheknowledgebaseintotheontologyinorderto
determinethe quality ofaparticular proposal. Thus,they
extendanexisting ontologywithnewlexical entries forL,
newconceptsforC andnewrelationsforHC.
9. RELATED WORK
Untilrecentlyontologylearningperse,i.e.forcomprehen-
siveconstructionofontologies,hasnotexisted. Weheregive
thereaderacomprehensiveoverviewoverexistingworkthat
hasactuallyresearchedandpracticedtechniquesforsolving
partsoftheoverallproblemofontologylearning.
There are only afew approaches that describedthe de-
velopment of frameworks and workbenches for extracting
ontologies from data: Faure & Nedellec [6] present a co-
operativemachinelearningsystem,ASIUM,whichacquires
taxonomicrelations and subcategorization frames ofverbs
basedonsyntacticinput. TheASIUMsystemhierarchically
clustersnounsbasedontheverbsthattheyaresyntactically
relatedwithandviceversa. Thus,theycooperativelyextend
thelexicon,thesetofconcepts,andtheconceptheterarchy
(L;C;HC).
HahnandSchnattinger[11]introducedamethodologyfor
the maintenanceofdomain-specictaxonomies. Anontol-
ogyis incrementallyupdatedasnewconceptsare acquired
from real-world texts. The acquisition process is centered
aroundlinguisticandconceptual\quality"ofvariousforms
ofevidenceunderlyingthegenerationandrenementofcon-
cept hypotheses. Theirontology learning approach is em-
beddedinaframeworkfornaturallanguageunderstanding,
namedSyndicate[10].
Mikheev&Finch[18]havepresentedtheirKAWBWork-
benchfor\Acquisition ofDomainKnowledgeformNatural
Language". Theworkbench compromises a setof compu-
tational tools for uncovering internal structure in natural
language texts. The main idea behind the workbench is
the independenceofthe textrepresentationand textanal-
ysis phases. At the representation phase the text is con-
vertedfromasequenceofcharacterstofeatures ofinterest
by means of the annotation tools. At the analysis phase
those features are used by statistics gathering and infer-
ence tools for ndingsignicant correlations in the texts.
Theanalysistoolsareindependentofparticularassumptions
aboutthenatureofthefeature-setandworkontheabstract
leveloffeatureelementsrepresentedasSGMLitems.
Muchworkinanumberofdisciplines |computational
linguistics,informationretrieval,machinelearning,databases,
software engineering | has actually researched and prac-
ticed techniques for solving part of the overall problem.
Hence,techniquesandmethodsrelevantforontologylearn-
ing maybefoundundertermslikethe acquisitionofselec-
tionalrestrictions(cf. Resnik[27]andBasilietal. [2]),word
sensedisambiguationandlearningofwordsenses(cf. Hast-
ings[34]), the computationofconcept latticesfromformal
contexts(cf. Ganter&Wille[8])and ReverseEngineering
insoftware engineering(cf. Muelleretal. [23]).
Ontology Learning putsa numberof research activities,
get of a commondomain conceptualization, into one per-
spective. Onemayrecognizethattheseactivitiesarespread
between verymany communities incurringreferencesfrom
20completelydierentevents/journals.
10. CHALLENGES
Ontology Learning may add signicant leverage to the
Semantic Web, because it propels the construction of do-
main ontologies, which are needed fastly and cheaply for
theSemanticWebtosucceed. Wehavepresentedacompre-
hensive framework for Ontology Learning that crosses the
boundaries of single disciplines, touching on a numberof
challenges. Table 1 givesa surveyof what types of tech-
niquesshouldbeincludedinafull-edgedontologylearning
and engineering environment. The good news however is
that onedoes notneed perfect or optimal supportfor co-
operativemodelingofontologies. Atleastaccordingtoour
experience \cheap" methods inan integrated environment
mayyieldtremendoushelpfortheontologyengineer.
Whileanumberofproblemsremainwiththesingledisci-
plines,somemorechallengescomeupregardingthepartic-
ularproblem of OntologyLearning for the Semantic Web.
First,withthe XML-basednamespacemechanismstheno-
tionof anontologywith well-denedboundaries, e.g.only
denitionsthat are inonele, will disappear. Rather,the
SemanticWebmayyieldan\amoeba-like"structureregard-
ing ontology boundaries, because ontologies refer to each
otherandimporteachother(cf.e.g.theDAML-ONTprim-
itiveimport). However,itisnotyetclearhowthesemantics
ofthesestructureswill looklike. Inlightofthesefactsthe
importanceofmethodslikeontologypruningandcrawlingof
ontologieswilldrastically increasestill. Second,wehaveso
farrestrictedourattentioninontologylearningtothecon-
ceptual structures that are (almost) contained inRDF(S)
proper. AdditionalsemanticlayersontopofRDF(e.g.fu-
tureOILorDAML-ONTwithaxioms,A) willrequirenew
meansforimprovedontologyengineering withaxioms,too!
Acknowledgements..
Wethankourstudents,DirkWenkeandRaphaelVolz,forworkatOntoEdit andText-To-Onto.
ResearchforthispaperwaspartiallynancedbyOntoprise
GmbH,Karlsruhe,Germany,byUSAirForceintheDARPA
DAML project \OntoAgents", by European Union in the
IST-1999-10132 project \On-To-Knowledge", and by Ger-
manBMBFintheproject\GETESS"(01IN901C0).
11. REFERENCES
[1] H.Assadi.Constructionofaregionalontologyfrom
textanditsusewithinadocumentarysystem.In
ProceedingsoftheInternationalConferenceonFormal
OntologyandInformationSystems-FOIS'98,Trento,
Italy,1998.
[2] R.Basili,M.T.Pazienza,andP.Velardi.Acquisition
ofselectionalpatternsinasublanguage. Machine
Translation, 8(1):175{201,1993.
[3] Paul Buitelaar.CoreLex: Systematic Polysemyand
Underspecication. PhDthesis,BrandeisUniversity,
DepartmentofComputerScience,1998.
[4] A.Doan,P.Domingos,andA.Levy.LearningSource
Descriptionsfor DataIntegration.InProceedingsof
theInternationalWorkshoponTheWeband
Databases (WebDB-2000),2000.
[5] F.Esposito,S.Ferilli,N.Fanizzi,andG.Semeraro.
Learningfromparsedsentenceswithinthelex.In
ProceedingsofLearningLanguageinLogicWorkshop
(LLL-2000), Lisbon,Portugal, 2000,2000.
[6] D. FaureandC.Nedellec.Acorpus-basedconceptual
clusteringmethodforverbframesandontology
acquisition.InLRECworkshoponadaptinglexical
andcorpusresourcestosublanguagesandapplications,
Granada,Spain,1998.
[7] B. GainesandM.Shaw.Integratedknowledge
acquisitionarchitectures.JournalofIntelligent
Information Systems,1(1),1992.
[8] B. GanterandR.Wille.FormalConceptAnalysis:
Domain Method Featuresused Primepurpose Papers
FreeText Clustering Syntax Extract Buitelaar[3],Assadi[1]andFaure&
Nedellec[6]
Inductive Logic
Programming
Syntax, Logic rep-
resentation
Extract Espositoetal.[5]
Associationrules Syntax,Tokens Extract Maedche&Staab[17]
Frequency-based Syntax Prune Kietzetal. [15]
Pattern-Matching Extract Morin[22]
Classication Syntax,Semantics Rene Schnattinger&Hahn[11]
Dictionary Information extrac-
tion
Syntax Extract Hearst [12], Wilks[35]and Kietzet
al. [15]
Pagerank Tokens Jannink&Wiederhold[13]
Knowledge
base
Concept Induction,
A-Boxmining
Relations Extract Kietz & Morik [16] and Schlobach
[28]
Semi-
structured
schemata
NaiveBayes Relations Reverseengineering Doanetal.[4]
Relational
schemata
DataCorrelation Relations Reverseengineering Johannesson[14]andTarietal.[32]
Heidelberg-NewYork,1999.
[9] E.Grosso,H.Eriksson,R.Fergerson,S.Tu,and
M.Musen.Knowledgemodelingatthemillennium|
thedesignandevolutionofprotege-2000.In
Proceedings ofKAW-99,Ban, Canada,1999.
[10] U.HahnandM.Romacker.Contentmanagementin
thesyndikatesystem|howtechnicaldocumentsare
automaticallytransformedtotextknowledgebases.
Data&KnowledgeEngineering,35:137{159,2000.
[11] U.HahnandK.Schnattinger.Towardstext
knowledge engineering.InProc.ofAAAI'98,pages
129{144,1998.
[12] M.A.Hearst.Automaticacquisitionofhyponymsfrom
largetextcorpora.InProceedings ofthe14th
InternationalConferenceonComputational
Linguistics.Nantes,France,1992.
[13] J.JanninkandG.Wiederhold.Thesaurusentry
extractionfromanon-linedictionary.InProceedings
ofFusion'99,SunnyvaleCA, July1999,1999.
http://www-db.stanford.edu/SKC/publications.html.
[14] P.Johannesson.Amethodfortransformingrelational
schemasintoconceptual schemas.InM.Rusinkiewicz,
editor,10thInternationalConference onData
Engineering,pages115{122,Houston,1994.IEEE
Press.
[15] J.-U.Kietz,A.Maedche,andR.Volz.Semi-automatic
ontologyacquisitionfromacorporateintranet.In
InternationalConferenceonGrammarInference
(ICGI-2000),toappear: LectureNotesinArticial
Intelligence,LNAI,2000.
[16] J.-U.KietzandK.Morik.Apolynomialapproachto
theconstructiveinductionofstructuralknowledge.
MachineLearning,14(2):193{218,1994.
[17] A.MaedcheandS.Staab.Discoveringconceptual
relationsfromtext.InProceedingsof ECAI-2000.IOS
Press,Amsterdam,2000.
[18] A.MikheevandS.Finch.Aworkbenchfornding
structureintext.InInProceedingsofthe5th
ConferenceonAppliedNaturalLanguageProcessing
|ANLP'97,March1997,Washington DC,USA,
pages372{379,1997.
[19] G.Miller.Wordnet: Alexicaldatabasefor English.
CACM,38(11):39{41,1995.
[20] K.Morik.Balancedcooperativemodeling.Machine
Learning,11:217{235, 1993.
[21] K.Morik,S.Wrobel,J.-U.Kietz,andW.Emde.
Knowledgeacquisition andmachinelearning: Theory,
methods,andapplications.AcademicPress, London,
1993.
[22] E.Morin.Automaticacquisitionofsemanticrelations
betweentermsfromtechnicalcorpora.InProc.ofthe
FifthInternationalCongress onTerminologyand
KnowledgeEngineering-TKE'99,1999.
[23] H.A.Mueller,J.H.Jahnke,D.B.Smith,M.-A.
Storey,S.R.Tilley,andK.Wong.Reverse
Engineering: ARoadmap.InProceedingsof the22nd
InternationalConference onSoftware Engineering
(ICSE-2000),Limerick,Ireland. Springer,2000.
[24] G.Neumann,R.Backofen,J.Baur,M. Becker,and
C.Braun.Aninformationextractioncoresystemfor
realworldgermantextprocessing.InANLP'97|
ProceedingsoftheConference onAppliedNatural
Language Processing,pages208{215,Washington,
USA,1997.
[25] N.FridmanNoyandM.A.Musen.PROMPT:
AlgorithmandToolforAutomatedOntologyMerging
andAlignment.InProceedings ofthe17thNational
Conf. onArticialIntelligence(AAAI'2000),Austin,
Texas.MITPress/AAAIPress, 2000.
[26] B. Peterson,W.A.Andersen,andJ.Engel.Knowledge
bus: Generatingapplication-focuseddatabases from
largeontologies.InProc ofKRDB1998,Seattle,
Washington, USA,pages2.1{2.10, 1998.
[27] P.Resnik.SelectionandInformation: AClass-based
Approach toLexicalRelationships.PhDthesis,
UniversityofPennsylania,1993.
[28] S.Schlobach.Assertionalminingindescriptionlogics.
InProceedings ofthe2000InternationalWorkshopon
Description Logics(DL2000),2000.
http://SunSITE.Informatik.RWTH-
[29] S.Staab,J.Angele,S.Decker,M.Erdmann,
A.Hotho,A.Maedche,H.-P.Schnurr,R.Studer,and
Y.Sure.Semanticcommunitywebportals. Proc.of
WWW9/Computer Networks,33(1-6):473{491, 2000.
[30] S.StaabandA.Maedche.Knowledgeportals|
ontologiesatwork. AIMagazine,21(2),Summer2001.
[31] S.Staab,H.-P.Schnurr,R.Studer,andY.Sure.
Knowledgeprocessesandontologies.IEEEIntelligent
Systems,16(1),2001.
[32] Z.Tari,O.Bukhres,J.Stokes,andS.Hammoudi.The
ReengineeringofRelationalDatabasesbasedonKey
andDataCorrelations. InProceedingsof the7th
ConferenceonDatabaseSemantics(DS-7),7-10
October1997, Leysin,Switzerland.Chapman&Hall,
1998.
[33] G.Webb,J.Wells,andZ.Zheng.Anexperimental
evaluationofintegratingmachinelearningwith
knowledge acquisition.MachineLearning,35(1):5{23,
1999.
[34] P.Wiemer-Hastings,A.Graesser,and
K.Wiemer-Hastings.Inferringthemeaningofverbs
fromcontext.InProceedings oftheTwentiethAnnual
Conferenceof theCognitiveScienceSociety,1998.
[35] Y.Wilks,B.Slator,andL. Guthrie.ElectricWords:
Dictionaries,Computers,andMeanings.MITPress,
Cambridge,MA,1996.