Learning Ontologies for the Semantic Web

(1)

Learning Ontologies for the Semantic Web

Alexander Maedche

Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany

Ontoprise GmbH, Haid-und-Neu Strasse 7 76131 Karlsruhe, Germany

[email protected]

Steffen Staab

Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany

Ontoprise GmbH, Haid-und-Neu Strasse 7 76131 Karlsruhe, Germany

[email protected]

ABSTRACT

The Semantic Web relies heavily on the formal ontologies

that structureunderlying datafor the purpose of compre-

hensive and transportable machine understanding. There-

fore, thesuccessof theSemanticWebdependsstronglyon

theproliferationofontologies,whichrequiresfastandeasy

engineeringofontologiesandavoidanceofaknowledgeac-

quisitionbottleneck.

Ontology Learning greatly facilitatesthe constructionof

ontologiesbytheontologyengineer. Thevisionofontology

learningthatweproposehereincludesanumberofcomple-

mentarydisciplinesthat feedondierenttypesofunstruc-

tured,semi-structuredandfullystructureddatainorderto

supportasemi-automatic,cooperativeontologyengineering

process. Ourontologylearningframeworkproceedsthrough

ontologyimport,extraction,pruning,renement,andeval-

uationgivingtheontologyengineerawealthofcoordinated

tools for ontology modeling. Besides ofthe generalframe-

work and architecture, we showin this paper some exem-

plarytechniquesintheontologylearningcyclethatwehave

implemented in our ontology learning environment, Text-

To-Onto,suchasontologylearningfromfreetext,fromdic-

tionaries,orfromlegacyontologies,andrefertosomeothers

thatneedtocomplementthecompletearchitecture,suchas

reverseengineeringofontologiesfromdatabaseschemataor

learningfromXMLdocuments.

1. ONTOLOGIES FOR THE SEMANTIC WEB

Conceptualstructuresthatdeneanunderlyingontology

aregermanetotheideaofmachineprocessabledataonthe

SemanticWeb. Ontologiesare(meta)dataschemas,provid-

ingacontrolledvocabularyofconcepts,eachwithanexplic-

itlydenedandmachineprocessablesemantics. Bydening

sharedand commondomaintheories, ontologies helpboth

peopleand machinestocommunicateconcisely,supporting

theexchangeofsemanticsandnotonlysyntax. Hence,the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors.

Semantic Web Workshop 2001 Hongkong, China Copyright by the authors.

cheapandfastconstructionofdomain-specicontologiesis

crucialforthesuccessandtheproliferationoftheSemantic

Web.

Thoughontology engineering tools have become mature

overthelast decade(cf.[9]),themanual acquisitionof on-

tologies still remainsatedious, cumbersometask resulting

easily ina knowledge acquisition bottleneck. Havingdevel-

opedourontologyengineeringworkbench,OntoEdit,wehad

toface exactlythisissue,inparticular weweregivenques-

tionslike

Canyoudevelopanontologyfast? (time)

IsitdiÆculttobuildanontology? (diÆculty)

Howdoyouknowthatyou'vegottheontologyright?

(condence)

Infact,theseproblemsontime,diÆcultyandcondence

thatweendedup withweresimilarto whatknowledgeen-

gineershaddealtwithoverthelasttwodecadeswhenthey

elaborated on methodologies for knowledge acquisition or

workbenches fordeningknowledgebases. A methodthat

proved extremely benecial for the knowledge acquisition

taskwastheintegrationofknowledgeacquisitionwithma-

chine learning techniques [33]. The drawback ofthese ap-

proaches,e.g. theworkdescribedin[21],however,wastheir

ratherstrongfocusonstructuredknowledgeor databases,

fromwhichtheyinducedtheirrules.

Incontrast, in the Webenvironment that we encounter

whenbuildingWebontologies,thestructuredknowledge or

databaseisrathertheexceptionthanthenorm. Hence,in-

telligentmeansforanontologyengineertakesonadierent

meaning thanthe| veryseminal | integration architec-

turesformoreconventionalknowledgeacquisition[7].

OurnotionofOntologyLearningaimsattheintegrationof

amultitudeofdisciplinesinordertofacilitatetheconstruc-

tionof ontologies,inparticularmachinelearning. Because

the fully automatic acquisition of knowledge by machines

remains in the distant future, we consider the process of

ontology learning as semi-automatic withhumaninterven-

tion,adoptingtheparadigmofbalancedcooperativemodeling

[20]fortheconstructionofontologiesfortheSemanticWeb.

This objective inmind,we have builtanarchitecture that

combinesknowledgeacquisitionwithmachinelearning,feed-

ingontheresourcesthatwenowadaysndonthesyntactic

Web,viz.freetext,semi-structuredtext,schemadenitions

(2)

ferentstepsintheengineering cycle,whichhereconsistsof

thefollowingvesteps(cf.Figure1):

First, existing ontologies are importedand reused by

merging existing structures or dening mapping rules be-

tweenexistingstructuresandtheontologytobeestablished.

For instance, [26] describe how ontological structures con-

tainedinCyc are used inorder to facilitate the construc-

tion of a domain-specic ontology. Second, in the ontol-

ogy extraction phase major parts of the target ontology

are modeled with learning supportfeeding from web doc-

uments. Third, this rough outline of the target ontology

needstobepruned inordertobetteradjust theontology

toitsprimepurpose. Fourth,ontologyrenementprots

fromthegivendomainontology,butcompletestheontology

atanegranularity(also incontrasttoextraction). Fifth,

the prime target application serves as a measure for vali-

datingtheresultingontology[31]. Finally,onemayrevolve

againinthis cycle,e.g.for includingnewdomainsintothe

constructed ontology or for maintaining and updating its

scope.

2. THE ONTOLOGY LEARNING KILLER APPLICATION

Though ontologies and their underlying data in the Se-

manticWebareenvisionedtobereusableforawiderangeof

possiblyunforeseenapplications 1

,aparticulartargetappli-

cationremainsthetouchstonefor agivenontology. Inour

case, we have been dealing with ontology-based knowledge

portals thatstructureWebcontentandthatallowforstruc-

turedprovisioningandaccessingofdata[29,30]. Knowledge

portalsareinformationintermediariesforknowledgeaccess-

ingandsharingontheWeb. Thedevelopmentof aknowl-

edgeportal consists of the tasks of structuring theknowl-

edge,establishingmeansfor providingnew knowledgeand

accessingtheknowledgecontainedintheportal.

A considerable part of development and maintenanceof

theportal lies inintegrating legacy information as wellas

in constructing and maintaining the ontology in vast, of-

ten unknown, terrain. For instance, a knowledge portal

may focus on the electronics sector, integrating compar-

ative shopping in conjunction with manuals, reports and

opinionsaboutcurrentelectronicproducts. Thecreationof

thebackgroundontologyfor thisknowledgeportalinvolves

tremendouseortsforengineeringtheconceptualstructures

that underly existing warehouse databases, product cata-

logues, user manuals, test reports and newsgroup discus-

sions. Correspondingly, ontology structures mustbe con-

structedfromdatabaseschemata,agivenproductthesaurus

(likeBMEcat),XMLdocumentsanddocumenttypedeni-

tions(DTDs),andfreetexts. Stillworse,signicantpartsof

these(meta-)datachangeextremelyfastand,hence,require

aregularupdateofthecorrespondingontologyparts.

Thus, very dierent types of (meta-)data might beuse-

fulinputfor theconstructionofthe ontology. However, in

practiceoneneedscomprehensivesupport 2

inordertodeal

1

Justlike Tim Berners-Lee did not forsee online auctions

beingacommonbusinessmodeloftheWebof2000.

2

Comprehensive support with ontology learning need not

necessarily imply the top-notch learning algorithms, but

may rely more heavily on appropriate tool support and

dierenttechniques: Structureddataandmetadatarequire

reverseengineering approaches,freetextmaycontributeto

ontologylearningdirectlyorthroughinformationextraction

approaches.

3

Semi-structureddatamay nallyrequireand

protfromboth.

In the following we elaborate on our ontology learning

framework. Thereby we approach dierent techniques for

dierenttypesofdata,showingpartsofourarchitecture,its

currentstatus,andpartsthatmaycomplementourcurrent

Text-To-Onto environment.

Ageneraloverviewofontologylearningtechniquesaswell

ascorrespondingreferencesmaybefoundinSection9.

3. AN ARCHITECTURE FOR ONTOLOGY LEARNING

Giventhetaskofconstructingandmaintaininganontol-

ogy for a Semantic Webapplication, e.g. for an ontology-

basedknowledgeportalthatwehavebeendealingwith(cf.

[29]), wehaveproducedawishlistofwhatkindofsupport

wewouldfancy.

3.1 Ontology Engineering Workbench

^OntoEdit

Ascoretoourapproachwehavebuiltagraphicaluserin-

terfacetosupporttheontologyengineeringprocessmanually

performedbytheontologyengineer. Here,weoersophisti-

catedgraphicalmeansformanualmodelingandreningthe

nalontology. Dierentviewsareoeredtotheusertarget-

ing the epistemological level rather than a particular rep-

resentation language. However, the ontological structures

builttheremaybeexportedtostandardSemanticWebrep-

resentationlanguages,suchasOILandDAML-ONT,aswell

asourownF-LogicbasedextensionsofRDF(S).Inaddition,

executable representations for constraint checking and ap-

plicationdebuggingcanbegeneratedandthenaccessedvia

SilRi 4

, our F-Logic inference engine, that is directly con-

nectedwithOntoEdit.

Thesophisticatedontologyengineeringtoolsweknew,e.g.

theProtegemodelingenvironmentforknowledge-basedsys-

tems[9],wouldoercapabilitiesroughlycomparabletoOn-

toEdit. However,giventhetaskofconstructingaknowledge

portal,wefoundthattherewasthislargeconceptualbridge

between the ontology engineering tool and the input (of-

ten legacy data), suchas Web documents,Web document

schemata,databasesontheWeb,andWebontologies,which

ultimatelydeterminedthetargetontology. Intothisvoidwe

havepositionednewcomponentsofourontologylearningar-

chitecture(cf. Figure2). Thenewcomponentssupportthe

ontologyengineerinimportingexistingontologyprimitives,

extracting new ones, pruning given ones, or rening with

additional ontology primitives. In our case, the ontology

primitivescomprise:

asetofstringsthatdescribelexicalentriesLfor con-

ceptsandrelations;

3

Infact,ontologylearningforfreetextservesadoublepur-

pose. Onthe one hand it yields a readily exploitable on-

tology for Semantic Web purposes, on the other hand it

often returnsimprovedinformation extractionand natural

languageunderstanding meansadjustedtothe learnedon-

tology,cf. [10].

4

(3)

Apply

Domain Ontology

Extract

Refine

^c

Prune

Import / Reuse

Legacy + Application Data

Ontology Learning

Legacy + Application Data

Figure 1: OntologyLearningprocesssteps

asetofconcepts 5

|C;

ataxonomyofconceptswithmultipleinheritance(het-

erarchy)H

C

;

asetofnon-taxonomicrelations|R|describedby

theirdomainandrangerestrictions;

aheterarchyofrelations, i.e.a setoftaxonomicrela-

tionsH

R

;

relations F and G that relateconcepts and relations

withtheirlexicalentries,respectively;and,nally,

asetofaxiomsAthatdescribeadditional constraints

onthe ontologyand allow to make implicit factsex-

plicit[29].

Thisstructurecorrespondsclosely to RDFS,theoneex-

ceptionis theexplicit considerationoflexical entries. The

separationofconceptreferenceandconceptdenotation,which

may be easily expressed in RDF, allows to provide very

domain-specic ontologies without incurring an instanta-

neous conict when merging ontologies | a standard re-

questinthe Semantic Web. Forinstance,the lexicalentry

\school" inone ontology may refer to abuilding inontol-

ogyA,buttoanorganizationinontologyB, orto bothin

ontology C. Also inontology A the concept refered to in

Englishby\school" and\schoolbuilding"may bereferred

toinGermanby\Schule"and\Schulgebaude".

5

Conceptsinourframeworkareroughlyakintosynsetsin

Ontologylearningreliesonontologystructuresgivenalong

theselinesandoninputdataasdescribedaboveinorderto

propose new knowledge about reasonably interesting con-

cepts,relations,lexicalentries,oraboutlinksbetweenthese

entities|proposingtheaddition,thedeletion,orthemerg-

ing of some ofthem. The results ofthe ontology learning

processarepresentedtotheontologyengineerbythegraph-

icalresultsetrepresentation(cf. Figure4foranexampleof

howextractedpropertiesmaybepresented). Theontology

engineermay thenbrowse theresultsand decideto follow,

delete,ormodifytheproposalsinaccordancetothepurpose

ofhertask.

4. COMPONENTS FOR LEARNING ONTOLOGIES

Integratingtheconsiderationsfromaboveintoacoherent

generic architectureforextracting andmaintainingontolo-

gies from data onthe Webwe haveidentied several core

components. Thereare,(i),agenericmanagementcompo-

nent dealing with delegation of tasks and constituting the

infrastructurebackbone,(ii),aresourceprocessing compo-

nentworkingoninputdatafromtheWebincluding,inpar-

ticular,anaturallanguageprocessingsystem,(iii),analgo-

rithmlibraryworkingontheoutputoftheresourceprocess-

ing component as well asthe ontology structures sketched

aboveand returningresult setsalso mentionedabove and,

(iv), the graphical userinterface for ontology engineering,

OntoEdit.

(4)

Web documents

Domain Ontology Domain Ontology

Resource Processing Component

OntoEdit

WordNet Ontology O1 WordNet Ontology

O1

NLP System

Algorithm Library

Result Set

Lexicon 1

Inference Engine(s) Import

existing ontologies

Lexicon _n Crawl

corpus

DTD DTD

Legacy databases

Management Component

Ontology Engineer

Import schema Import semi-

structured schema

O2

...

Figure2: Architecture for LearningOntologiesfor the Semantic Web

4.1 Management component

Theontology engineerusesthe managementcomponent

toselectinputdata,i.e.relevantresourcessuchasHTML&

XMLdocuments,documenttype denitions,databases, or

existingontologiesthatareexploitedinthefurtherdiscovery

process. Secondly, using the management component, the

ontologyengineeralsochoosesamongasetofresourcepro-

cessing methods available at the resource processing com-

ponentandamongasetofalgorithmsavailableinthealgo-

rithmlibrary.

Furthermore,themanagementcomponentevensupports

theontologyengineerindiscoveringtask-relevantlegacydata,

e.g. an ontology-based crawler gathers HTML documents

that are relevant to a given core ontology and an RDF

crawlerfollowsURIs(i.e.,uniqueidentiersinXML/RDF)

thatarealsoURLsinordertocoverpartsofthesofartiny,

butgrowingSemanticWeb.

4.2 Resource processing component

Resourceprocessingstrategiesdierdependingonthetype

ofinputdatamadeavailable:

HTMLdocumentsmaybeindexedandreducedtofree

text.

Semi-structured documents,like dictionaries,maybe

Semi-structuredandstructuredschemadata(likeDTD's,

structureddatabaseschemata,andexistingontologies)

are handeled following dierent strategiesfor import

asdescribedlaterinthispaper.

Forprocessingfreenaturaltextoursystemaccessesthe

naturallanguageprocessingsystemSMES(Saarbrucken

MessageExtractionSystem),ashallowtextprocessor

for German (cf. [24]). SMES comprises a tokenizer

basedonregularexpressions,alexicalanalysiscompo-

nent includingvariouswordlexicons, amorphological

analysismodule,anamedentity recognizer,apart-of-

speechtagger andachunkparser.

Afterrstpreprocessingaccordingtooneoftheseorsimi-

larstrategies,theresourceprocessingmoduletransformsthe

dataintoanalgorithm-specicrelationalrepresentation.

4.3 Algorithm Library

Asdescribed above an ontology may be described by a

numberof sets of concepts, relations, lexical entries, and

linksbetweentheseentities. Anexistingontologydenition

(includingL;C;H

C

;R;H

R

;A;F;G)may beacquiredusing

various algorithmsworkingon this denitionand the pre-

processedinputdata. Whilespecicalgorithmsmaygreatly

varyfromonetype ofinput tothe next, thereisalso con-

(5)

Hence,wemayreusealgorithmsfromthelibraryforacquir-

ingdierentpartsoftheontologydenition.

Subsequently,weintroducesomeofthesealgorithmsavail-

able in our implementation. In general, we use a multi-

strategylearningandresultcombinationapproach,i.e. each

algorithmthatispluggedintothelibrarygeneratesnormal-

izedresultsthatadheretotheontologystructuressketched

aboveand that maybecombinedintoa coherent ontology

denition.

5. IMPORT & REUSE

Given our experiences in medicine, telecommunication,

andinsurance,weexpectthatforalmostanycommercially

signicantdomaintherearesomekindofdomainconceptu-

alizationsavailable. Thus,weneedmechanismsandstrate-

giestoimport &reuse domainconceptualizationsfromex-

isting(schema)structures. Thereby,theconceptualizations

mayberecovered,e.g.,fromlegacydatabaseschemata,document-

type denitions (DTDs), or from existing ontologies that

conceptualizesomerelevantpartofthetargetontology.

Inthe rstpart of theimport & reuse step, the schema

structures are identiedand their general content needto

be discussed with domain experts. Each of these knowl-

edgesourcesmustbeimportedseparately. Importmaybe

performedmanually|whichmayincludethemanualdef-

initionoftransformation rules. Alternatively,reverseengi-

neeringtools,suchas exist for recoveringextended entity-

relationshipdiagramsfromthe SQLdescription ofa given

database (cf. reference [32, 14] in survey, Table 1), may

facilitatetherecoveryofconceptualstructures.

Inthesecondpart ofthe import &reuse step,imported

conceptual structures needto be mergedor aligned inor-

der to constitute a single common ground from which to

take-ointothesubsequentontologylearningphasesofex-

tracting, pruning andrening. Whilethe generalresearch

issueconcerningmergingandaligningisstillanopenprob-

lem,recentproposals(e.g.,[25])haveshownhowtoimprove

themanualprocessofmerging/aligning. Existingmethods

formerging/aligningmostlyrelyonmatchingheuristicsfor

proposingthemergeofconceptsandsimilarknowledge-base

operations. Ourcurrentresearchalsointegratesmechanisms

thatuse aapplicationdata oriented, bottom-up approach.

Forinstance,formalconceptanalysisallowstodiscoverpat-

terns between application data on the one hand and the

usageofconceptsandrelationsand thesemanticsgivenby

their heterarchies onthe other hand in aformally concise

way(cf. reference[8] insurvey,Table 1,onformalconcept

analysis).

Overall, the importand reusestep in ontology learning

seemstobethe onethatis thehardestto generalize. The

taskmayremindvaguelyofthegeneralproblemswithdata

warehousing adding, however, challenging problems of its

own.

6. EXTRACTING ONTOLOGIES

In the ontology extraction phase of the ontology learn-

ingprocess,majorparts,i.e.thecompleteontologyorlarge

chunksreectinganewsubdomainoftheontology,aremod-

eledwithlearningsupportexploitingvarioustypesof(Web)

sources.Thereby,ontologylearningtechniquespartiallyrely

learningcyclemaypropelsubsequentonesandmoresophis-

ticatedalgorithmsmayworkonstructuresproposedbymore

straightforwardonesbefore.

Describingthis phase,we sketch someof the techniques

andalgorithmsthathavebeenembeddedinourframework

andimplementedinourontologylearningenvironmentText-

To-Onto(cf.Figure3). Doingso,wecoveraverysubstantial

part oftheoverallontologylearningtask inthe extraction

phase. Text-To-Ontoproposesmanydierentontologycom-

ponents, which we have describedabove (i.e. L;C;R;:::),

totheontologyengineerfeedingonseveraltypesofinput.

6.1 Lexical Entry & Concept Extraction

Thistechniqueisoneofthebaselinemethodsappliedin

ourframeworkforacquiringlexicalentrieswithcorrespond-

ingconcepts. InText-To-Onto,webdocumentsaremorpho-

logically processed, including the treatment of multi-word

terms suchas \database reverse engineering" by N-grams,

asimplestatisticsmeans.Basedonthistextpreprocessing,

termextractiontechniques,whicharebasedon(weighted)

statistical frequencies,are appliedinordertopropose new

lexicalentriesforL.

Often,theontologyengineerfollows the proposalby the

lexicalentry&conceptextractionmechanismandincludes

anewlexicalentryintheontology. Becausethenewlexical

entry comes without an associated concept, the ontology

engineermustthendecide(possiblywithhelpfromfurther

processing)whetherto introduceanewconceptorlinkthe

newlexicalentrytoanexisting concept.

6.2 Hierarchical Concept Clustering

Given a lexicon and a set of concepts, one major next

step is the taxonomicclassication of concepts. One gen-

erally applicablemethodwithtothisregard ishierarchical

clustering. Hierarchical clusteringexploits thesimilarity of

itemsinordertoproposeahierarchyofitemcategories. The

similaritymeasureisdenedonthepropertiesofitems.

Giventhetaskofextractingahierarchyfromnaturallan-

guage text, adjacency ofterms or syntactical relationships

betweentermsaretwopropertiesthatyieldconsiderablede-

scriptivepowertoinducethesemantichierarchyofconcepts

relatedtotheseterms.

Asophisticatedexampleforhierarchicalclusteringisgiven

by Faure &Nedellec (cf. reference [6] in survey, Table 1):

Theypresentacooperativemachinelearningsystem,ASIUM,

which acquires taxonomic relations and subcategorization

framesofverbsbasedonsyntacticinput. TheASIUMsys-

tem hierarchically clusters nouns based onthe verbs that

they are syntactically related with and vice versa. Thus,

theycooperatively extendthelexicon, theset of concepts,

andtheconceptheterarchy(L;C;HC).

6.3 Dictionary Parsing

Machine-readabledictionaries(MRD)arefrequentlyavail-

able for manydomains. Thoughtheir internal structureis

freetexttoalargeextent,therearecomparativelyfewpat-

terns thatare used togive textdenitions. Hence, MRDs

exhibit a large degree of regularity that may be exploited

for extractingadomainconceptualizationandproposingit

totheontologyengineer.

Text-To-Onto has been used to generate ataxonomy of

(6)

company(cf.reference [15]insurvey,Table 1). Likewiseto

termextractionfrom free textmorphological processing is

applied,this timehowevercomplementingseveralpattern-

matching heuristics. For examplethedictionarycontained

thefollowingentry:

Automatic Debit Transfer: Electronic service arising

froma debit authorization of theYellowAccountholder

for arecipienttodebit bills thatfallduedirectfromthe

account..

Severalheuristicswereappliedtothemorphologicallyan-

alyzeddenitions. Forinstance,onesimpleheuristicrelates

the denition term, here \automaticdebit transfer", with

therstnounphraseoccurringinthedenition,here\elec-

tronicservice". Their correspondingconceptsare linkedin

theheterarchyHC:

HC(automaticdebittransfer,electronicservice).

Applyingthisheuristiciteratively,onemayproposelarge

partsofthetargetontology,morepreciselyL;C andHC to

the ontology engineer. In fact, because verbs tend to be

modeledasrelations,R(andthelinkagebetweenRandL)

maybeextendedbythisway,too.

6.4 Association Rules

Associationrulelearningalgorithmsaretypicallyusedfor

prototypicalapplicationsofdatamining,likendingassoci-

ationsthatoccurbetweenitems,e.g.supermarketproducts,

inasetoftransactions,e.g.customers'purchases. Thegen-

eralizedassociationrulelearningalgorithmextendsitsbase-

taxonomy,e.g.\snacksarepurchasedtogetherwithdrinks"

rather than\chipsare purchasedwith beer"and \peanuts

arepurchasedwithsoda".

InText-To-Onto (cf.reference[17]insurvey,Table1)we

useamodicationofthegeneralizedassociationrulelearn-

ingalgorithm fordiscoveringpropertiesbetweenclasses. A

givenclass hierarchy HC serves as background knowledge.

Pairsofsyntacticallyrelatedclasses(e.g.pair(festival,island)

describing the head-modier relationship contained in the

sentence\ThefestivalonUsedom 6

attractstouristsfromall

overtheworld.") aregivenas inputtothealgorithm. The

algorithm generates association rules comparing the rele-

vanceofdierent ruleswhile climbingupand/or downthe

taxonomy. Theappearinglymost relevant binaryrulesare

proposedtotheontologyengineerformodelingrelationsinto

theontology,thusextendingR.

As the number of generated rules is typically high, we

oer variousmodesofinteraction. Forexample,it ispossi-

bletorestrictthenumberofsuggestedrelationsbydening

so-called restriction classes that have toparticipate inthe

relationsthatareextracted. Anotherwayoffocusingisthe

exibleenabling/disablingoftheuseoftaxonomicknowl-

edgeforextractingrelations.

Resultsare presented oeringvariousviewsonto the re-

sults as depicted inFigure 4. A generalized relation that

may beinducedbythepartiallygivenexampledataabove

maybetheproperty(event,area),whichmaybenamedby

6

Usedom is anisland located innorth-eastofGermanyin

(7)

inanarea(thusextendingLandF). Theusermayaddthe

extractedrelationstotheontologybydrag-and-drop.Toex-

ploreand determinetherightaggregationlevelofaddinga

relationtotheontology,theusermaybrowsethehierarchy

viewonextractedpropertiesasgivenintheleftpartofFig-

ure 4. This view may also support the ontology engineer

in dening appropriate subPropertyOf relations between

properties,suchassubPropertyOf(hasDoubleRoom,hasRoom )

(therebyextendingHR).

7. PRUNING THE ONTOLOGY

Acommonthemeofmodelinginvariousdisciplinesisthe

balance between completeness and scarcity of the domain

model. Itisawidelyheldbeliefthattargetingcompleteness

forthedomainmodelontheonehandappearsto beprac-

tically inmanagable and computationally intractable, and

targetingthescarcestmodelontheotherhandisoverlylim-

itingwithregard toexpressiveness. Hence, whatwestrive

foristhebalancebetweenthesetwo,whichisreallyworking.

Weaim at a modelthat capturesa richconceptualization

ofthetarget domain, butthatexcludes partsthat areout

ofitsfocus. Theimport&reuse ofontologiesaswellasthe

extraction of ontologies considerably pull the lever of the

scaleinto theimbalancewhere out-of-focus concepts reign.

Therefore,wepursuetheappropriatediminishingoftheon-

tologyinthepruning phase.

There are at least two dimensions to look at the prob-

lemofpruning. First,oneneedstoclarify howthepruning

of particular parts of the ontology (e.g., the removalof a

conceptorarelation) aectstherest. Forinstance, Peter-

sonet. al.[26]havedescribedstrategiesthatleavetheuser

withacoherentontology(i.e.nodanglingor brokenlinks).

Second,onemayconsiderstrategiesforproposingontology

itemsthatshouldbeeitherkeptorpruned. Wehaveinves-

tigated several mechanismsfor generating proposals from

applicationdata. Given aset ofapplication-specic docu-

mentsthereareseveralstrategiesforpruningtheontology.

Theyarebasedonabsoluteorrelativecountsoffrequency

ofterms(cf.reference[15]insurvey,Table 1).

8. REFINING THE ONTOLOGY

Reningplaysasimilarroleasextracting. Theirdierence

exists ratheronaslidingscale thanby aclear-cut distinc-

tion. Whileextracting servesmostlyfor cooperativemod-

elingof theoverallontology(orat leastofverysignicant

chunksofit), therenementphaseisaboutnetuningthe

targetontologyandthesupportofitsevolvingnature. The

renement phase may use data that comes from the con-

creteSemanticWebapplication,e.g. loglesofuserqueries

or generic userdata. Adapting and rening the ontology

withrespecttouserrequirementsplaysamajorroleforthe

acceptanceoftheapplicationanditsfurtherdevelopment.

Inprinciple,thesamealgorithmsmaybeusedforextrac-

tionasforrenement. However,duringrenementonemust

considerindetailtheexistingontologyandtheexistingcon-

nectionsintotheontology,whileextractionworksmoreoften

thannotpracticallyfromscratch.

A prototypical approach for renement (though not for

extraction!) has been presented by Hahn & Schnattinger

(cf.reference[11]insurvey,Table1). Theyhaveintroduced

asnewconceptsareacquiredfromtext. Theacquisitionpro-

cessiscenteredaroundthelinguisticandconceptual\qual-

ity"ofvariousforms ofevidenceunderlyingthegeneration

and renement of concept hypothesis. In particular they

consider semantic conicts and analogous semantic struc-

turesfromtheknowledgebaseintotheontologyinorderto

determinethe quality ofaparticular proposal. Thus,they

extendanexisting ontologywithnewlexical entries forL,

newconceptsforC andnewrelationsforHC.

9. RELATED WORK

Untilrecentlyontologylearningperse,i.e.forcomprehen-

siveconstructionofontologies,hasnotexisted. Weheregive

thereaderacomprehensiveoverviewoverexistingworkthat

hasactuallyresearchedandpracticedtechniquesforsolving

partsoftheoverallproblemofontologylearning.

There are only afew approaches that describedthe de-

velopment of frameworks and workbenches for extracting

ontologies from data: Faure & Nedellec [6] present a co-

operativemachinelearningsystem,ASIUM,whichacquires

taxonomicrelations and subcategorization frames ofverbs

basedonsyntacticinput. TheASIUMsystemhierarchically

clustersnounsbasedontheverbsthattheyaresyntactically

relatedwithandviceversa. Thus,theycooperativelyextend

thelexicon,thesetofconcepts,andtheconceptheterarchy

(L;C;HC).

HahnandSchnattinger[11]introducedamethodologyfor

the maintenanceofdomain-specictaxonomies. Anontol-

ogyis incrementallyupdatedasnewconceptsare acquired

from real-world texts. The acquisition process is centered

aroundlinguisticandconceptual\quality"ofvariousforms

ofevidenceunderlyingthegenerationandrenementofcon-

cept hypotheses. Theirontology learning approach is em-

beddedinaframeworkfornaturallanguageunderstanding,

namedSyndicate[10].

Mikheev&Finch[18]havepresentedtheirKAWBWork-

benchfor\Acquisition ofDomainKnowledgeformNatural

Language". Theworkbench compromises a setof compu-

tational tools for uncovering internal structure in natural

language texts. The main idea behind the workbench is

the independenceofthe textrepresentationand textanal-

ysis phases. At the representation phase the text is con-

vertedfromasequenceofcharacterstofeatures ofinterest

by means of the annotation tools. At the analysis phase

those features are used by statistics gathering and infer-

ence tools for ndingsignicant correlations in the texts.

Theanalysistoolsareindependentofparticularassumptions

aboutthenatureofthefeature-setandworkontheabstract

leveloffeatureelementsrepresentedasSGMLitems.

Muchworkinanumberofdisciplines |computational

linguistics,informationretrieval,machinelearning,databases,

software engineering | has actually researched and prac-

ticed techniques for solving part of the overall problem.

Hence,techniquesandmethodsrelevantforontologylearn-

ing maybefoundundertermslikethe acquisitionofselec-

tionalrestrictions(cf. Resnik[27]andBasilietal. [2]),word

sensedisambiguationandlearningofwordsenses(cf. Hast-

ings[34]), the computationofconcept latticesfromformal

contexts(cf. Ganter&Wille[8])and ReverseEngineering

insoftware engineering(cf. Muelleretal. [23]).

Ontology Learning putsa numberof research activities,

(8)

get of a commondomain conceptualization, into one per-

spective. Onemayrecognizethattheseactivitiesarespread

between verymany communities incurringreferencesfrom

20completelydierentevents/journals.

10. CHALLENGES

Ontology Learning may add signicant leverage to the

Semantic Web, because it propels the construction of do-

main ontologies, which are needed fastly and cheaply for

theSemanticWebtosucceed. Wehavepresentedacompre-

hensive framework for Ontology Learning that crosses the

boundaries of single disciplines, touching on a numberof

challenges. Table 1 givesa surveyof what types of tech-

niquesshouldbeincludedinafull-edgedontologylearning

and engineering environment. The good news however is

that onedoes notneed perfect or optimal supportfor co-

operativemodelingofontologies. Atleastaccordingtoour

experience \cheap" methods inan integrated environment

mayyieldtremendoushelpfortheontologyengineer.

Whileanumberofproblemsremainwiththesingledisci-

plines,somemorechallengescomeupregardingthepartic-

ularproblem of OntologyLearning for the Semantic Web.

First,withthe XML-basednamespacemechanismstheno-

tionof anontologywith well-denedboundaries, e.g.only

denitionsthat are inonele, will disappear. Rather,the

SemanticWebmayyieldan\amoeba-like"structureregard-

ing ontology boundaries, because ontologies refer to each

otherandimporteachother(cf.e.g.theDAML-ONTprim-

itiveimport). However,itisnotyetclearhowthesemantics

ofthesestructureswill looklike. Inlightofthesefactsthe

importanceofmethodslikeontologypruningandcrawlingof

ontologieswilldrastically increasestill. Second,wehaveso

farrestrictedourattentioninontologylearningtothecon-

ceptual structures that are (almost) contained inRDF(S)

proper. AdditionalsemanticlayersontopofRDF(e.g.fu-

tureOILorDAML-ONTwithaxioms,A) willrequirenew

meansforimprovedontologyengineering withaxioms,too!

Acknowledgements..

^We^thank^our^students,^Dirk^Wenke

andRaphaelVolz,forworkatOntoEdit andText-To-Onto.

ResearchforthispaperwaspartiallynancedbyOntoprise

GmbH,Karlsruhe,Germany,byUSAirForceintheDARPA

DAML project \OntoAgents", by European Union in the

IST-1999-10132 project \On-To-Knowledge", and by Ger-

manBMBFintheproject\GETESS"(01IN901C0).

11. REFERENCES

[1] H.Assadi.Constructionofaregionalontologyfrom

textanditsusewithinadocumentarysystem.In

ProceedingsoftheInternationalConferenceonFormal

OntologyandInformationSystems-FOIS'98,Trento,

Italy,1998.

[2] R.Basili,M.T.Pazienza,andP.Velardi.Acquisition

ofselectionalpatternsinasublanguage. Machine

Translation, 8(1):175{201,1993.

[3] Paul Buitelaar.CoreLex: Systematic Polysemyand

Underspecication. PhDthesis,BrandeisUniversity,

DepartmentofComputerScience,1998.

[4] A.Doan,P.Domingos,andA.Levy.LearningSource

Descriptionsfor DataIntegration.InProceedingsof

theInternationalWorkshoponTheWeband

Databases (WebDB-2000),2000.

[5] F.Esposito,S.Ferilli,N.Fanizzi,andG.Semeraro.

Learningfromparsedsentenceswithinthelex.In

ProceedingsofLearningLanguageinLogicWorkshop

(LLL-2000), Lisbon,Portugal, 2000,2000.

[6] D. FaureandC.Nedellec.Acorpus-basedconceptual

clusteringmethodforverbframesandontology

acquisition.InLRECworkshoponadaptinglexical

andcorpusresourcestosublanguagesandapplications,

Granada,Spain,1998.

[7] B. GainesandM.Shaw.Integratedknowledge

acquisitionarchitectures.JournalofIntelligent

Information Systems,1(1),1992.

[8] B. GanterandR.Wille.FormalConceptAnalysis:

(9)

Domain Method Featuresused Primepurpose Papers

FreeText Clustering Syntax Extract Buitelaar[3],Assadi[1]andFaure&

Nedellec[6]

Inductive Logic

Programming

Syntax, Logic rep-

resentation

Extract Espositoetal.[5]

Associationrules Syntax,Tokens Extract Maedche&Staab[17]

Frequency-based Syntax Prune Kietzetal. [15]

Pattern-Matching Extract Morin[22]

Classication Syntax,Semantics Rene Schnattinger&Hahn[11]

Dictionary Information extrac-

tion

Syntax Extract Hearst [12], Wilks[35]and Kietzet

al. [15]

Pagerank Tokens Jannink&Wiederhold[13]

Knowledge

base

Concept Induction,

A-Boxmining

Relations Extract Kietz & Morik [16] and Schlobach

[28]

Semi-

structured

schemata

NaiveBayes Relations Reverseengineering Doanetal.[4]

Relational

schemata

DataCorrelation Relations Reverseengineering Johannesson[14]andTarietal.[32]

Heidelberg-NewYork,1999.

[9] E.Grosso,H.Eriksson,R.Fergerson,S.Tu,and

M.Musen.Knowledgemodelingatthemillennium|

thedesignandevolutionofprotege-2000.In

Proceedings ofKAW-99,Ban, Canada,1999.

[10] U.HahnandM.Romacker.Contentmanagementin

thesyndikatesystem|howtechnicaldocumentsare

automaticallytransformedtotextknowledgebases.

Data&KnowledgeEngineering,35:137{159,2000.

[11] U.HahnandK.Schnattinger.Towardstext

knowledge engineering.InProc.ofAAAI'98,pages

129{144,1998.

[12] M.A.Hearst.Automaticacquisitionofhyponymsfrom

largetextcorpora.InProceedings ofthe14th

InternationalConferenceonComputational

Linguistics.Nantes,France,1992.

[13] J.JanninkandG.Wiederhold.Thesaurusentry

extractionfromanon-linedictionary.InProceedings

ofFusion'99,SunnyvaleCA, July1999,1999.

http://www-db.stanford.edu/SKC/publications.html.

[14] P.Johannesson.Amethodfortransformingrelational

schemasintoconceptual schemas.InM.Rusinkiewicz,

editor,10thInternationalConference onData

Engineering,pages115{122,Houston,1994.IEEE

Press.

[15] J.-U.Kietz,A.Maedche,andR.Volz.Semi-automatic

ontologyacquisitionfromacorporateintranet.In

InternationalConferenceonGrammarInference

(ICGI-2000),toappear: LectureNotesinArticial

Intelligence,LNAI,2000.

[16] J.-U.KietzandK.Morik.Apolynomialapproachto

theconstructiveinductionofstructuralknowledge.

MachineLearning,14(2):193{218,1994.

[17] A.MaedcheandS.Staab.Discoveringconceptual

relationsfromtext.InProceedingsof ECAI-2000.IOS

Press,Amsterdam,2000.

[18] A.MikheevandS.Finch.Aworkbenchfornding

structureintext.InInProceedingsofthe5th

ConferenceonAppliedNaturalLanguageProcessing

|ANLP'97,March1997,Washington DC,USA,

pages372{379,1997.

[19] G.Miller.Wordnet: Alexicaldatabasefor English.

CACM,38(11):39{41,1995.

[20] K.Morik.Balancedcooperativemodeling.Machine

Learning,11:217{235, 1993.

[21] K.Morik,S.Wrobel,J.-U.Kietz,andW.Emde.

Knowledgeacquisition andmachinelearning: Theory,

methods,andapplications.AcademicPress, London,

1993.

[22] E.Morin.Automaticacquisitionofsemanticrelations

betweentermsfromtechnicalcorpora.InProc.ofthe

FifthInternationalCongress onTerminologyand

KnowledgeEngineering-TKE'99,1999.

[23] H.A.Mueller,J.H.Jahnke,D.B.Smith,M.-A.

Storey,S.R.Tilley,andK.Wong.Reverse

Engineering: ARoadmap.InProceedingsof the22nd

InternationalConference onSoftware Engineering

(ICSE-2000),Limerick,Ireland. Springer,2000.

[24] G.Neumann,R.Backofen,J.Baur,M. Becker,and

C.Braun.Aninformationextractioncoresystemfor

realworldgermantextprocessing.InANLP'97|

ProceedingsoftheConference onAppliedNatural

Language Processing,pages208{215,Washington,

USA,1997.

[25] N.FridmanNoyandM.A.Musen.PROMPT:

AlgorithmandToolforAutomatedOntologyMerging

andAlignment.InProceedings ofthe17thNational

Conf. onArticialIntelligence(AAAI'2000),Austin,

Texas.MITPress/AAAIPress, 2000.

[26] B. Peterson,W.A.Andersen,andJ.Engel.Knowledge

bus: Generatingapplication-focuseddatabases from

largeontologies.InProc ofKRDB1998,Seattle,

Washington, USA,pages2.1{2.10, 1998.

[27] P.Resnik.SelectionandInformation: AClass-based

Approach toLexicalRelationships.PhDthesis,

UniversityofPennsylania,1993.

[28] S.Schlobach.Assertionalminingindescriptionlogics.

InProceedings ofthe2000InternationalWorkshopon

Description Logics(DL2000),2000.

http://SunSITE.Informatik.RWTH-

(10)

[29] S.Staab,J.Angele,S.Decker,M.Erdmann,

A.Hotho,A.Maedche,H.-P.Schnurr,R.Studer,and

Y.Sure.Semanticcommunitywebportals. Proc.of

WWW9/Computer Networks,33(1-6):473{491, 2000.

[30] S.StaabandA.Maedche.Knowledgeportals|

ontologiesatwork. AIMagazine,21(2),Summer2001.

[31] S.Staab,H.-P.Schnurr,R.Studer,andY.Sure.

Knowledgeprocessesandontologies.IEEEIntelligent

Systems,16(1),2001.

[32] Z.Tari,O.Bukhres,J.Stokes,andS.Hammoudi.The

ReengineeringofRelationalDatabasesbasedonKey

andDataCorrelations. InProceedingsof the7th

ConferenceonDatabaseSemantics(DS-7),7-10

October1997, Leysin,Switzerland.Chapman&Hall,

1998.

[33] G.Webb,J.Wells,andZ.Zheng.Anexperimental

evaluationofintegratingmachinelearningwith

knowledge acquisition.MachineLearning,35(1):5{23,

1999.

[34] P.Wiemer-Hastings,A.Graesser,and

K.Wiemer-Hastings.Inferringthemeaningofverbs

fromcontext.InProceedings oftheTwentiethAnnual

Conferenceof theCognitiveScienceSociety,1998.

[35] Y.Wilks,B.Slator,andL. Guthrie.ElectricWords:

Dictionaries,Computers,andMeanings.MITPress,

Cambridge,MA,1996.

Learning Ontologies for the Semantic Web