Machine Learning for Information Extraction from XML marked-up text on the Semantic Web

(1)

Machine Learning for Information Extraction from XML marked-up text on the Semantic Web

Nigel Collier

National Institute of Informatics (NII) National Center of Sciences, 2-1-2 Hitotsubashi

Chiyoda-ku, Tokyo 101-8430, Japan

collier@nii.ac.jp

ABSTRACT

Thelastfewyearshaveseenanexplosion intheamountof

textb ecomingavailable on theWorld Wide Web asonline

communities of users in diverse domains emerge to share

do cuments and other digital resources. In this pap er we

explore the issue of how to provide a low-level informa-

tion extraction to ol based on hidden Markov mo dels that

can identify and classify terminology based on previously

marked-up examples. Suchato ol shouldprovide thebasis

foradomain p ortable information extraction system, that

when combined with search technology can help users to

access information more eectively within their do cument

collectionsthanto day'sinformation retrieval enginesalone.

Wepresentresultsofapplyingthemo delintwodiversedo-

mains: news andmolecular biology and discuss the mo del

andtermmarkupissuesthatthisinvestigationreveals.

1. INTRODUCTION

Thelastfewyearshave seenanexplosion intheamount

oftext b ecomingavailableon theWorldWide Web(Web).

Encouraged by thegrowthof theWebcommunity, textual

database providers have b een migrating their archives for

online access, adding to the available information. Online

communitiesofusershaveemergedtosharedo cuments,and

otherdigitalresourcesinmultipleanddiversedomains. The

issueofinformation access,i.e. howtondtheinformation

thatmeetsusers'requirements and present it in anunder-

standableformhasnow b ecomeamajorresearchissue. In

order to accomplish this we need to emp ower computers

with an `understanding' of the users' texts. One scenario

thatis b eingexplored, e.g. [34],istocombine information

retrieval (IR)with information extraction (IE). It is likely

thatacriticalcomp onent insuchasystemwillb eanability

tolearntoidentifyandclassify termsbased onexamplesof

previously markeduptext.

Forthispurp oseextensiblemarkuplanguage (XML)[35]

[14]seemstob e mostappropriateforsemanticannotation,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors.

Semantic Web Workshop 2001 Hongkong, China Copyright by the authors.

notonlyforterminology,butalsofortasksthatrequire the

learning of relations b etween those terms. XML incorp o-

rates a numb er of p owerful features for describing object

semantics, suchasinheritance ofname spaces bychild ele-

mentsfromtheirparents. Namespaces canb enestedwith

the youngestancestorwithin anamespacedeclaration de-

terminingthecurrentscop e. XMLallowsustorepresentse-

manticsthroughp otentiallyunb oundedhyptertagsbutdo es

notbyitself attempttointerpretthemeaningofthelab els.

AtthelowestlevelofIE,asystemshouldb eabletoidentify

andclassify object,i.e. term,b oundariesandclassify them

according tothe semantic classes and ontologies describ ed

inthe trainingdo cumentsthroughmechanismssuchas the

usualdo cumenttyp edeclaration(DTD)andXMLSchemas

thatarenowb eingprop osed[18].

In our work we emphasize the need for IE to ols to b e

adaptable to dierent domains and languages rather than

as general-purp ose to ols dueto thedistinct semantic char-

acteristics ofeachdomain.

Inthe remainder of this pap er wepresent a metho d for

termidenticationandclassicationbasedonhiddenMarkov

mo dels (HMMs)[24]thatlearns fromannotatedtexts. We

describ eitsp erformanceintwodomains,newsandmolecular-

biologyanddiscusssomeofthetermmarkupissuesthatour

analysis revealed.

2. BACKGROUND

Informationextractionhasdevelop edinthelasttenyears

fromacollectionofad-ho cmetho dsintoadisciplin ethatfo-

cusesonasmallsetofwelldenedsubtasks. Thishaslargely

o ccurredduetotheinuenceoftheDARPA-sp onsoredMes-

sageUnderstandingConferences(MUCs)intheUSA[9][10]

andrecentlytheJapaneselanguageIREXconference[29].

Inthelast fewyearsworkhasb egun on adapting IEfor

the technical domain of molecular biology, e.g. [6][8] [16]

[26][27][31].AswithpreviousIEapproaches,thesesystems

canb eclassed as eitherpredominantly dictionary-bas ed or

learning-based. Itisourviewthatthehand-builtdictionary-

basedsystemscannotb eexp ectedtob eeasilyp ortedtonew

domainsandtheyignoreap otentiallyvaluablesourceofthe

domain exp ert'sknowledge,i.e. markeduptexts.

Recent studies into theuse of sup ervised learning-based

mo dels for the `named entity' task in b oth the news and

molecular-biol ogydomainhaveshownthatmo dels basedon

HMMs [1][5][30], maximum entropy [2] and decision trees

[28][21]are muchmoregeneralisabl e and adaptable tonew

classes of words than systems based on traditional hand-

(2)

overcoming the problems asso ciated with data sparseness

withthehelpofsophisticated smo othingalgorithms [3].

HMMs are one of the most widely metho ds in ML for

IE.Theycanb econsideredtob esto chasticnitestatema-

chines and have enjoyed success in a numb er of elds in-

cluding sp eechrecognition and part-of-sp eechtagging [17].

Ithas b een natural therefore that thesemo dels have b een

adapted for use in other word-class prediction tasks such

as thenamed-entity task in IE. Suchmo dels are based on

n-grams. Although the assumption that a word'spart-of-

sp eechor nameclass canb e predicted by theprevious n-1

words and their classes is counter-intuiti ve to our under-

standing of linguisti c structures and long distance dep en-

dencies,thissimplemetho ddo esseemtob ehighly eective

inpractice. Nymble[1],asystemwhichusesHMMsis one

ofthemostsuccessful suchsystems andtrains onacorpus

ofmarked-uptext,usingonlycharacterfeaturesinaddition

towordbigrams.

AlthoughitisstillearlydaysfortheuseofHMMsforIE,

wecanseeanumb er oftrendsintheresearch. Systemscan

b edivided intothose whichuseonestatep er classsuchas

Nymble (atthetoplevel oftheirbackomo del) andthose

whichautomatically learnab outthemo del'sstructuresuch

as [30]. Additionall y, there isa distinction to b e made in

thesourceofthe knowledge forestimating transitionprob-

abilities b etween mo dels which are built by hand such as

[12]andthosewhichlearnfromtaggedcorp orainthesame

domainsuchasthemo delpresentedinthispap er,wordlists

andcorp oraindierentdomains-so-calleddistantly-lab eled

data[30].

Despite their success, HMMs and other machine learn-

ingmetho ds canonly b e as successfulas thefeatures with

whichthey aretrained. Inthis study we havefo cussed on

developing a simple, yetp owerful setof features based on

orthographic knowledge, that canb ep ortable b etweendo-

mains. Wenowpresentan overviewofthetrainingcorp ora

weused. ThisisfollowedbytheresultsforaHMMnamed-

entity system fortwo diverse domains: news and molecu-

larbiology,usingorthographic, lexicalandclassfeaturesto

train the mo dels. We also discuss some of the problems

that our results revealed, particularly from lo cal syntactic

relations,duetothelo cal contextual viewthe mo delto ok.

3. CORPORA

Inourexp erimentsweusedabstractsinthemolecularbi-

ologydomainavailablefromPubMed'sMEDLINE[19]that

weremarkedupin XMLbyadomainexp ert[23]aswellas

asmallcollection ofnews textsusedinthe MUC-6confer-

ence[9]formalanddryruns. Itisworthnotingthatatthe

present time,nostandard testsetsexistforthenameden-

titytaskoutside ofnews(MUCandIREX),makingformal

systemcomparisons quitediÆcult. Thiswill clearly b e an

imp ortantfactorinthefuturedevelopmentofIEintechnical

domains.

An example of a marked-up sentence from a news text

canb eseeninFigure1. Wecanseeseveralinteresting fea-

turesofthedomainsuchasthefo cusofNEsonp eopleand

organization proles. Moreoverweseethattherearemany

pre-namecluewordssuchas\Ms."or\Rep."indicatin gthat

aRepublican p oliticia n'snameshouldfollow.

In contrast we can seean example of a marked-up sen-

tion wecanseethatmanyare combinations ofprop erand

commonnounswithcross-overofvo cabulary b etweenname

classes, e.g. the lemma `cell' b elongs to b oth PROTEIN,

SOURCE.ctandSOURCE.clterms.

Thesetsofnameclassesforthetwodomainsaregivenin

Tables1and2.

4. METHOD

Ourinitialapproachwasmotivatedbytheneedtoacquire

themajorityoftermsthatdonotinvolvecomplexstructural

analysis torecovertheir baseforms. Forthisweconsidered

that aHMM approachbased on rawtext stringsof words

and nodeep linguisti c analysis iswell suited. Later ween-

visage that more sophisticated markup and pre-pro cessing

metho ds will b e needed to handle term structure and we

hop etoincorp oratethesewithintheautomaticlearningap-

proachwehaveadoptedin ourwork.

Thepurp ose of our mo del is to nd the most likely se-

quence of name classes (C) fora given sequence of words

(W).Thesetofnameclassesincludesthe`Unk'nameclass

whichweuseforbackgroundwordsnotb elongingtoanyof

theinterestingnameclassesgivenforthetwodomains (see

Tables1and2) andthegivensequence ofwordswhichwe

use spans asingle sentence. Thetaskis therefore to max-

imize Pr (CjW). We implement a HMM to estimate this

using theMarkovassumption that Pr (CjW) canb e found

frombigramsofnameclasses.

Inthe following mo del we consider words to b eordered

pairs consisting ofasurfaceword,W, andawordfeature,

F, given as < W;F >. The word features themselves are

discussedin Section4.2.

Asiscommonpractice,weneedtocalculatetheprobabil-

itiesforawordsequencefortherstword'snameclassand

everyother word dierently since wehave no initial name-

class to make a transition from. Accordingly we use the

followingequationtocalculatetheinitialnameclass proba-

bility,

Pr (Ctj<Wfir st;Ffir st>)=

0 f(C

fir st j<W

fir st

;F

fir st

>)+

1 f(C

fir st j< ;F

fir st

>)+

2 f(C

fir st

) (1)

andforallotherwordsandtheir nameclassesasfollows:

Pr (Ctj<Wt;Ft>;<Wt 1;Ft

1

>;Ct 1)=

0f(Ctj<Wt;Ft>;<Wt 1;Ft

1

>;Ct 1)+

1 f(C

t j< ;F

t

>;<W

t 1

;F

t 1

>;C

t 1 )+

2 f(C

t j<W

t

;F

t

>;< ;F

t 1

>;C

t 1 )+

3 f(C

t j< ;F

t

>;< ;F

t 1

>;C

t 1 )+

4f(CtjCt 1)+

5f(Ct) (2)

where f(j) is calculated with maximum-likeli hoo d esti-

(3)

TYPE="PERSON">Washington</ENAMEX> worked as a laywer for the corp orate nance division of the <ENAMEX

TYPE="ORGANIZATION">SEC< /ENAMEX > in the late <TIMEX TYPE="DATE">1970s</TIMEX>. She

has b een a congressional staer since <TIMEX TYPE= "DATE">1979</TIMEX>. Separately, <ENAMEX

TYPE="PERSON">Clinton</ENAMEX> transition oÆcials said that <ENAMEX TYPE="PERSON">Frank

Newman</ENAMEX>, 50, vice chairman and chief nancial oÆcer of <ENAMEX TYPE="ORGANIZATION">BankAmerica

Corp.</ENAMEX>, is exp ected to b e nominated as assistant <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX>

secretary for domestic nance. Mr. <ENAMEX TYPE="PERSON">Newman</ENAMEX>, who would b e giving

up a job that pays <ENAMEX TYPE="MONEY">$1 million</ENAMEX> a year, would oversee the <ENAMEX

TYPE="ORGANIZATION">Treasury</ENAMEX>'s auctions of government securities as well as banking issues. He

would rep ort directly to <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX> Secretary-designate <ENAMEX

TYPE="PERSON">Lloyd Bentsen</ENAMEX>. Mr. <ENAMEX TYPE="PERSON">Bentsen</ENAMEX>, who headed

the <ENAMEX TYPE="ORGANIZATION">Senate Finance Committee</ENAMEX> for the past six years, also is exp ected

to nominate <ENAMEX TYPE="PERSON"> Samuel Sessions</ENAMEX>, the committee's chief tax counsel, to one of

the top tax jobs at <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX>. As early as to day, the <ENAMEX

TYPE="PERSON">Clinton</ENAMEX>campisexp ectedtonameveundersecretariesofstateandseveralassistantsecretaries.

Figure1: Examplesentences taken fromthe annotatedMUC-6 NEcorpus

TI-Activationof<PROTEIN> JAKkinases</PROTEIN>and<PROTEIN>STAT proteins</PROTEIN> by<PROTEIN> in-

terleukin-2 </PROTEIN> and<PROTEIN>interferonalpha</PROTEIN>,but notthe<PROTEIN> Tcellantigenreceptor

</PROTEIN>,in<SOURCE.ct>humanTlympho cytes</SOURCE.ct>.

AB - The activation of <PROTEIN> Janus protein tyrosine kinases </PROTEIN> ( <PROTEIN> JAKs </PROTEIN>

) and <PROTEIN> signal transducer and activator of transcription </PROTEIN> ( <PROTEIN> STAT </PROTEIN> )

proteins by <PROTEIN> interleukin ( IL ) - 2 </PROTEIN> , the <PROTEIN> T cell antigen receptor </PROTEIN>

( <PROTEIN> TCR </PROTEIN> ) and <PROTEIN> interferon ( IFN ) alpha </PROTEIN> was explored in

<SOURCE.ct> human p eripheral blo o d - derived T cells </SOURCE.ct> and the <SOURCE.cl> leukemic T cell line

Kit225 </SOURCE.cl> . An <PROTEIN>IL-2</PROTEIN>-induced increase in <PROTEIN>JAK1</PROTEIN> and

<PROTEIN>JAK3</PROT EIN>, butnot<PROTEIN>JAK2</PROTEIN> or<PROTEIN>Tyk2</PROTEIN>, tyrosinephos-

phorylationwasobserved. Incontrast,noinductionof tyrosinephosphorylationof <PROTEIN>JAKs</PROT EIN> wasdetected

up on stimulationof the <PROTEIN>TCR</PRO TEIN> . <PROTEIN>IFN alpha</PROTEIN> inducedthe tyrosine phospho-

rylation of <PROTEIN>JAK1</PROT EIN> and <PROTEIN>Tyk2</PROTEIN>, but not <PROTEIN>JAK2</PROT EIN>

or <PROTEIN>JAK3</PROTEIN> . <PROTEIN>IFN alpha</PROTEIN> activated <PROTEIN>STAT1</PROT EIN> ,

<PROTEIN>STAT2</PROT EIN> and<PROTEIN>STAT3</PROTEIN> in<SOURCE.ct>T cells</SOURCE.ct>, butno de-

tectableactivationofthese<PROTEIN>STATs</PROT EIN> wasinducedby<PROTEIN>IL-2</PROTEIN>.

Figure2: ExampleMEDLINEsentencetakenfromthe XMLannotated molecular-biologyNEcorpus

Class # Example Description

PROTEIN 2125 JAKkinase proteins,proteingroups,

families,complexesandsubstructures.

DNA 358 IL-2promoter DNAs,DNAgroups,regionsandgenes

RNA 30 TAR RNAs,RNAgroups,regionsandgenes

SOURCE.cl 93 leukemicTcelllineKit225 cellline

SOURCE.ct 417 human Tlymphocytes celltyp e

SOURCE.mo 21 Schizosaccharomyce spombe mono-organism

SOURCE.mu 64 mice multi-organism

SOURCE.vi 90 HIV-1 viruses

SOURCE.sl 77 membrane sublo cation

SOURCE.ti 37 centralnervoussystem tissue

UNK - tyrosinephosphorylation backgroundwords

Table 1: Namedentity classesfor themolecularbiologydomain. # indicatesthenumb eroftaggedtermsin

thecorpusof100 abstracts.

Class # Example Description

ORGANISATION 1783 HarvardLawSchool namesoforganisations

PERSON 838 Washington namesofp eople

LOCATION 390 Houston namesofplaces,countriesetc.

DATE 542 1970s dateexpressions

TIME 3 midnight timeexpressions

MONEY 423 $10 million moneyexpressions

PERCENT 108 2.5% p ercentageexpressions

UNK - start-upcosts backgroundwords

Table 2: Namedentity classes for the news domain. # indicatesthe numb eroftagged terms inthe corpus

(4)

f(C

t j<W

t

;F

t

>;<W

t 1

;F

t 1

>;C

t 1 )

:

=

T(<Wt;Ft>;Ct;<Wt 1;Ft

1

>;Ct 1)

T(<Wt;Ft>;<Wt 1;Ft

1

>;Ct

1 )

(3)

WhereT()hasb eenfoundfromcountingtheeventsinthe

trainingcorpus. Inourcurrent systemwesettheconstants

i and i by hand andlet P

i =1:0, P

i =1:0, 0

1

2 ,

0

1 :::

5

. The current name-class C

t is

conditioned on the current wordand feature,the previous

name-class,Ct

1

,andpreviouswordandfeature.

Inourcurrent systemwesetthe constantsi and i by

handbutclearlyab etterwaywouldb etodothisautomat-

ically. An obvious strategy to use would b e to use some

iterative learning metho d such as Exp ectation Maximiza-

tion [11]. Wealso imp ose the restriction that P

i

=1:0,

P

i

=1:0,

0

1

2 ,

0

1 :::

5

. Thecurrent

name-class Ct is conditioned on the current word and fea-

ture,thepreviousname-class,C

t 1

,andpreviouswordand

feature.

Equations1and2implementalinear-interpolatingHMM

thatincorp oratesanumb erofsub-mo delsdesignedtoreduce

theeects ofdatasparsenessandimprovegeneralisabi l ity.

Once the state transition probabiliti es have b een calcu-

latedaccordingtoEquations1and2,theViterbialgorithm

[33]isusedtosearchthestate spaceofp ossiblenameclass

assignments in linear time to nd the highest probability

path,i.e. tomaximise Pr (W;C).

Thenalstageofouralgorithmthat isusedaftername-

classtaggingiscompleteistouseaclean-upmo dulecalled

Unity. This creates a frequency list of words and name-

classesandthenre-tagsthe textusingthe mostfrequently

usednameclass assigned by theHMM.Wehave generally

foundthatthisimprovesF-scorep erformancebyb etween2

and4%,b othforre-taggingspuriouslytaggedwordsandfor

ndinguntaggedwordsin unknowncontextsthathadb een

correctlytaggedelsewhereinthetext.

4.1 Tokenization

Before featurising we p erform two pre-pro cessing tasks:

sentenceb oundaryidenticatio nandwordtokenization. Th-

esepro ceed accordingtoquite simplealgorithms thathave

nevertheless proven to b e adequately eective. Sentence

b oundaryidenticati on simplytreatsfullstops`.' asendof

sentence exceptforafewsp ecialcasessuchasabbreviation

markinganddecimalp ointsthatarehandledwithheuristic

rules. Although this metho d is farless sophisticated than

otherssuchas[25],wefoundthatitp erformedwellinprac-

tice,althoughwewouldexp ecttoencountermoresignicant

problemsinengineering andtechnical domains.

Tokenization treatsa continuous string of letters and/or

numerals as a word, converts multiple space sequences to

singlespaces, removesnon-printablecharactersexceptend-

of-line,andtreatspunctuation(includin ghyphen)asasep-

arate`word'. Pro cessing isdone sentence by sentence. All

wordsareassignedafeatureco dedep endingontheirortho-

graphicform.

4.2 Orthographic features

InordertogeneralisetheHMMs'knowledgeab outsurface

formsitisnecessarytofeaturisethevo cabularyinsomeway.

Onanalysinglistsoftermswefeltthatorthographicfeatures

shouldb enoted thattheexamples donotshowthe

full form of terms, but simply examples of `words'

thatmake upthetermtogetherwiththeirsemantic

classication.

Featureco de Examples

TwoDigitNumb er [25]per cent

FourDigitNumb er [2000]date

DigitNumb er [2]per cent[3]D NA

SingleCap [I]pr otein[B]pr otein

[T]

sour ce:ct

GreekLetter [alpha]

pr otein

CapsAndDigits [2A]D N

A

[BW5147]

sour ce:cl

[CD4+]

sour ce:ct

TwoCaps [RelB]pr otein [TAR]RNA

[HMG]D NA[NF]D NA

[NF]

pr otein

LettersAndDigits [p50]pr otein [Kit225]sour ce:cl

InitCap [Interleukin]pr otein

[Washington]

per son

LowCaps [kappaB]

pr otein

[mRNA]RNA

Lowercase [cytoplasmic]

sour ce:sl

[tax]

pr otein

Determiner the

Conjunction and

FullStop .

Comma ,

Hyphen [-]

pr otein [-]

D NA

Colon :

SemiColon ;

Op enParen (

CloseParen )

CloseSquare ]

Op enSquare [

Percent %

Other *+#

Backslash [/]pr otein

particularly inmolecularbiology.

Table3shows thecharacterfeatures thatweusedin the

HMM.Ourintuitionisthatsuchfeatureswillhelpthemo del

to nd similarities b etween known words that were found

in the training set and unknown words and so overcome

theunknownwordproblem. Eachwordisdeterministicall y

assignedasinglefeature,givingmatchingfeaturesnearerto

thetopofthetablepriority overthose lowerdown.

5. EXPERIMENT AND RESULTS

We ran exp eriments on the 60 news texts using 6-fold

cross-validation(50training,10testing)andthe100molec-

ular biology textsusing5-foldcross-validation (80training,

20testing). Wethencalculatedthescoresasan average of

theF-scoresforeachmarked-upclass category.

TheresultsaregivenasF-scores,acommonmeasurement

foraccuracy in theMUC conferencesthat isthe harmonic

mean ofrecall andprecision. Theseare calculated using a

(5)

HMMwithUnity 78.4 75.0

HMMw/oUnity 74.2 73.1

Table 4: Named entity acquisition results for the

MUC-6 and molecular biology domains calculated

by n-foldcross validation.

F scor e=

2Pr ecisionRecal l

Pr ecision+Recal l

(4)

Theresultsaresummarisedforall classesineachdomain

inTable4andshowp erformancewithandwithouttheUnity

mo dule. Despite the small numb er of training texts used,

the system could achieve reasonably high p erformance for

b othdomains. Although results indicatethat p erformance

fornewstextsisslightly b etterthan forbiology,thisneeds

conrmingwithalargertestcollectiontoobtaincondence

in the conclusion. The result also highlights the need for

soundly motivated metrics (e.g. see [22]) to compare the

diÆcultiesofthenamedentitytaskb etweenmarked-upcor-

p ora in dierent domains. In the following discussion we

provide failureanalysis oftheresults.

6. ANALYSIS

Inthissectionweconcentrateourdiscussionontheanaly-

sisofresultsfromthemolecular-bio log y corpus. Discussion

of term markup issues for the MUC news domain is well

do cumented,e.g. in[20].

During analysis of the corpus we found that a numb er

of syntactic phenomena caused p otential complications to

identication of term b oundaries and their classication .

Broadly sp eaking the major ones can b e divided into co-

ordination,app ositionandabbreviation, althoughthereare

manyother issues thatwecannot coverhere suchasuseof

negativesintermnames suchas`non-T-cells' andtheneed

toinfersometerm'sclassesfromknowledgecontainedwithin

adomainmo del. Thethreemajorissues arenowdiscussed

b elow.

6.1 Coordination

We observed that failure o ccurred where complex lo cal

structures could not b e resolved by the shallow-level con-

textual viewofthe HMM.Co ordination, as appliedto the

app earanceoftermsinmolecularbiologytextsapp earsquite

frequently through the use of the co ordinators and, slash

(`/'),andhyphen(`-'). Hyphenationisparticularly trouble-

someforourimplementation of theHMMasit isalso one

oftheorthographicfeaturesusedinmanyprotein andgene

names.

Inthefollowing examples,op enandclosetagshaveb een

representedas`['and`]'resp ectivelyforbrevityandthetag

nameapp earsasasubscript.

Training examples (1),(2)and (3)allshowthe need for

structural analysis to take place to transform(at least in-

ternally within the IE software) theterm back to its base

form. Comparing (1)and (2)weseethat `/'is sometimes

usedaspartofthetermandsometimesasaco ordinator.

Ex. 1. ...like the [c-rel]pr otein and [v-rel]pr otein (proto)-

oncogenes.

Ex. 2. ...regulated by members of the [rel/NF-kappa B

family]

pr otein

Ex. 3. ...involvesphosphorylationofseveralmembersof

the[NF-kappaB]

pr otein

/[IkappaBprotein]

pr otein

families.

Examples(2)and(3)resultedinconicting patterns,i.e.

sometimes`/'shouldactasaco ordinatorandsometimesas

partofthetermitselfascanb eseenintheHMMoutputin

(4)and(5).

Ex. 4. ...regulated by members of the rel/[NF-kappa B

family]pr otein

Ex. 5. ...involvesphosphorylationofseveralmembersof

the[NF-kappaB/IkappaB]proteinfamilies.

Training examples (6)and (7)show morecomplex cases

wheretheannotationschemehasnotallowedthedomainex-

p erttofullyexpressherintuitionab outtheclassesofterms,

particularly where thereasequence ofconjoinedmo diers.

In (6) we see a case of elision where the exp ert was not

happy to markup`[c-]pr otein'as atermwithout b eing able

toshowtheattachmentto`-rel'andwasalsouneasyab out

marking up `c-and v-rel' as the term. In(7) we see that

althoughtheheadnoun\regions",whichdominatesthelist

shouldimp oseaproteincategoryon`TATA',without away

ofmarkingupthisrelation,theexp ertpreferstotag`TATA'

asDNAaccording totheclass ofthebasicterm underdis-

cussion. Cases such as these indicate the need for richer

markupmetho ds.

Ex. 6. ThisproteinreducesorabolishesinvitrotheDNA

bindingactivityofwild-typeproteins ofthesamefamily (-

[KBF1]

pr otein /[p50]

pr otein

,c-and[v-rel]

pr otein ).

Ex. 7. .. indicated that multipleregulatory regions in-

cludingtheenhancer,[SP1]pr otein,[TATA]D NA and[TAR]

pr otein

regionswereimportantfor[HIV]

sour ce:v i

geneexpres-

sion.

Itisinterestingtonotethatannotatorsmarkingupmulti-

nameexpressionsintheMUC-6corpusfacedasimilardilemma

asnoted in[20].Forexample\<ENAMEXTYPE=

\LOCATION">North</ENAMEX>and<ENAMEXTYPE-

=\LOCATION">South America</ENAMEX>. In their

casetheyincludetheheadwordwiththenalmo dierasa

namedentitybutmakeno explicit link toits relation with

earlier mo diers.

6.2 Apposition

Toquote from[15]: app ositions are \Twoor morenoun

phrases are in app ositionwhen they haveidentity ofrefer-

ence". Intheexampleswegiveb elow,thereferentprovides

usefulinformation ab outtheapp osition phrases' class that

requires thisrelation to b erecognised. Forexample in (9),

\transcription factor" provides the information that \NF-

Kappa B" is a typ e of protein, and in (8), \RelB-p52" is

a\Rel-NF-kappa Bcomplex". i.e. theapp ositions provide

attribute information ab out theterm. This is particularly

useful in higher level IE tasks that require us to combine

attributesforaparticularterm. Thechallengep osedbyap-

p ositionsisoftentoknowwheretostartandendtheterm's

b oundaries, particularly where no punctuation is used to

(6)

kappaB]

pr otein

complexes,[RelB]

pr otein -[p52]

pr otein canup-

regulatethesynthesisof[IkappaBalpha]pr otein.

Ex. 9. The transcriptionfactor[NF-Kappa B]

pr otein is

storedinthe[cytoplasm]sour ce:sl...

App osition by itself in example (8) did not seem to b e

the cause of the problem, but rather the conjunction im-

pliedbythehyphenmakesitunclearwheretobreakthese-

quence\RelB-p52" asseenin (10). Interestingly theHMM

managedtocorrectly ndthebreakintheearlier sequence

\Rel-NF-kappaB".

Ex. 10. However,similarlytotheother[Rel]

pr otein -[NF-

kappaB]pr otein complexes,[RelB-p52]pr oteincanupregulate

thesynthesisof[IkappaBalpha]

pr otein .

Theapp ositionofexample(9)alsop osednodiÆculty for

theHMMasshownin(11).

Ex. 11. Thetranscriptionfactor[NF-KappaB]pr oteinis

storedinthe[cytoplasm]sour ce:sl...

6.3 Abbreviation

Ofmoreimmediateconcerntousarecaseswhereabbrevi-

ationso ccursinsidethetermitselfasshowninthetraining

example (12). Here we require deep er analysis than can

b eobtained through the lo cal contextual viewusedbythe

HMMthatwehavepresented.

Ex. 12. The[interleukin-2(IL-2)promoter]D NAconsists

ofseveralindependent[T cellreceptor(TcR)responsiveel-

ements]D NA.

The result from the HMM forexample (12) is given in

(13).

Ex. 13. The[interleukin-2]

pr otein ([IL-2]

pr otein

)promoter

consistsofseveralindependent[Tcellreceptor]

pr otein ([TcR]

pr otein)responsiveelements.

It is imp ortant to rememb er that the HMM lo oks for

the most likely sequence of classes that corresp ond to the

word sequence and that, for example, the word sequence

for\interleukin-2" isfarmorelikely tob eaprotein thana

DNA,giventhatabbreviationsusuallyo ccuraftertheterm

hasnished andaremostlymarkedupasseparatetermsin

theirownright. Here thoughwehave an exceptionand it

requiresquitesophisticated pro cessingtorecognisetheem-

b edded abbreviation do esnot formpart ofthe term itself,

andthat the head ofthe rstterm is\promoter". A sim-

ilar case canb e found in the second term `Tcell receptor

resp onsiveelements',wheretheabbreviationshouldalsob e

considered asaseparateterm. ThediÆcultycanb etraced

backtolimitationsinourmarkupschemewhichdidnotal-

low the domain exp ert to express her intuition ab out the

term'sstructure with either nested or cross-overtags. Al-

thoughnestingoftagsisallowedinXML,cross-overoftags

requiresawork-around.

7. DISCUSSION

Many ofthe caseswhere themo del failed to recognisea

(HMM)and ofthemark-upscheme. Althoughnon-nesting

ofmarked-up termsallows ustodevelop rathersimple ML

mo dels, itdo es notreecttheapp earance oftermsasthey

arewrittenbydomainexp erts. XMLitself do esnotimp ose

anestingrestrictionon usbutdo esinsistthatthereshould

b enocross-overofnamespaces. Althoughthiscanb edealt

withbyworkarounds itisnotideal.

Despite a limited context window used in analysis, the

HMMp erformedquite well, showing thatnite statetech-

niques cangivego o dresultsdespiteshallow linguistic anal-

ysis. Unliketraditionaldictionary-based termidentication

metho dsusedinIE,themetho dwehaveshownhasthead-

vantageofb eingp ortableandnohand-made patternswere

used. Thestudy also indicated thatmore training data is

b etterandthatwehavenotyetreachedap eakinthelevel

of p erformance using the small training set that we have

available.

Forourfutureworkweprop osetouseshallowdep endency

analysis to`normalise' theterm, i.e. to disambiguate lo cal

syntactic structures. Theoutput of thiswill b e emb edded

asXMLmarkupintothe do cument forthelearningby the

MLcomp onent ofthenamedentity mo dule.

8. CONCLUSION

In this pap er we have presented a scenario for the ex-

ploitation ofterminologycontainedwithinXMLmarked-up

do cumentsthatarepro ducedbyonline domain-based com-

munities forautomatically annotatinguntaggedtextsbased

on previously seen examples. Although we have fo cussed

largelyonthelowestlevelofIE,i.e.identicati on andclas-

sication of terms, the idea can b e extended to learning

higher-level information structuresforuse invariousinfor-

mation access to ols as well as long distance dep endencies

suchasanaphorathatarenecessarytoestablishequivalence

relationsb etweenterms,theirabbreviation sandreferencing

pronouns.

AlimitationoftheHMMapproachisthatitcannoteasily

mo dellargefeaturesetsduetofragmentationoftheproba-

bilitydistributio n. Inourfutureworkwewouldliketolo ok

at combining orthographic knowledge with other typ es of

lexical information as well as contextual clues fromgram-

matical dep endency analysis. Itis likely thatdierent ML

techniques will b e required that canmore eectively com-

binetheknowledgefromlarge,p ossiblysparse,featuresets.

TheuseofSupp ortVectorMachines(SVMs)[32][7],which

havesuccessfullyb eenappliedinotherresearchareas,seems

onefruitful directionthatweare currently exploring.

Acknowledgements

I am grateful to a numb er of my colleagues at the Tsujii

lab oratory, University ofTokyo,where mostof theexp eri-

mentsforthispap erwerecarriedout.Inparticular Iwould

like toexpress mygratitudetoSang-Zo o LeeandChikashi

Nobata (CRL) fortheir helpful comments concerning ma-

chine learning metho ds, Yuka Tateishi and Tomoko Ohta

for their eorts to pro duce the tagged corpus used in the

molecular-biol ogy exp eriments and also to Professor Jun-

ichiTsujiiforhissupp ortinthis work. Iwouldalsolike to

thanktheanonymousreviewersfortheir helpful comments

(7)

9. REFERENCES

[1] D.Bikel,S.Miller,R.Schwartz,andR.Wesichedel.

Nymble: ahigh-p erformance learningname-nder.In

ProceedingsoftheFifthConfererenceonApplied

NaturalLanguageProcessing,pages194{201,1997.

[2] A.Borthwick,J.Sterling, E.Agichtein,and

R.Grishman.Exploiting diverseknowledge sources

viamaximumentropy innamedentity recognition.In

ProceedingsoftheWorkshoponVeryLargeCorpora

(WVLC'98),1998.

[3] S.ChenandJ.Go o dman.Anempirical studyof

smo othingtechnniques forlanguagemo deling.34st

AnnualMeetingoftheAssociationofComputational

Linguistics,California,USA,24{27June 1996.

[4] N.Chinchor.MUC-5evaluation metrics.InIn

ProceedingsoftheFifthMessageUnderstanding

Conference(MUC-5), Baltimore,Maryland,USA.,

pages69{78,1995.

[5] N.Collier,C.Nobata,andJ.Tsujii.Extractingthe

namesofgenesandgenepro ducts withahidden

Markovmo del.InProceedingsofthe18th

InternationalConferenceonComputational

Linguistics(COLING'2000),Saarbrucken,Germany,

July31st{August4th2000.

[6] N.Collier,H.Park,N.Ogata,Y.Tateishi, C.Nobata,

T.Ohta,T.Sekimizu,H.Imai,andJ.Tsujii.The

GENIAproject: corpus-basedknowledgeacquisition

andinformation extractionfromgenomeresearch

pap ers.InProceedingsoftheAnnualMeetingofthe

EuropeanchapteroftheAssociationfor

ComputationalLinguistics(EACL'99),June 1999.

[7] C.CortesandV.Vapnik.Supp ort-vectornetworks.

MachineLearning,20:273{297, Novemb er1995.

[8] M.CravenandJ.Kumlien.Constructingbiologica l

knowledge basesbyextracting information fromtext

sources.InProceedingsofthe7thInternational

ConferenceonIntelligentSystempsforMolecular

Biology(ISMB-99),Heidelburg,Germany,August

6{101999.

[9] DARPA.ProceedingsoftheSixthMessage

UnderstandingConference(MUC-6),Columbia,MD,

USA,Novemb er1995.MorganKaufmann.

[10] DARPA.ProceedingsoftheSeventhMessage

UnderstandingConference(MUC-7),Fairfax,VA,

USA,May1998.

[11] A.Dempster,N.Laird,andD.Rubins.Maximum

likeliho o d fromincomplete dataviatheEM

algorithm.JournaloftheRoyalStatisticalSociety(B),

39:1{38,1977.

[12] D.FreitagandA.McCallum.Informationextraction

withHMMsandshrinkage.InProceedingsofthe

AAAI'99WorkshoponMachineLearningfor

InformationExtraction,Orlando,Florida,July19th

1999.

[13] K.Fukuda,T.Tsuno da,A.Tamura,andT.Takagi.

Towardinformation extraction: identifyingprotein

namesfrombiological pap ers.InProceedingsofthe

PacicSymposiumonBiocomputing'98(PSB'98),

January1998.

[14] I.GrahamandL.Quin.XML SpecicationGuide.

JohnWileyandSons: NewYork,1999;ISBN

[15] S.Greenbaum andR.Quirk.AStudent'sGrammar of

theEnglishLanguage.Longman,Essex,England,

1990.

[16] K.Humphreys,G.Demetriou,andR.Gaizauskas.

Twoapplicationsofinformation extraction to

biologica l sciencejournalarticles: Enzymeinteractions

andprotein structures.InProceedingsofthePacic

Rim SymposiumonBio-Computing2000(PSB'2000),

January2000.

[17] J.Kupiec.Robustpart-of-sp eechtaggingusinga

hidden markovmo del.ComputerSpeechand

Language,6:225{242,1992.

[18] A.Layman,E.Jung,E.Maler,H.S.Thompson,

J.Paoli,J.Tigue, N.H.Mikura,andS.DeRose.

Xml-data,w3cnote05jan1998.TheXML-Data

sp ecication s arestilldeveloping . Thelatest

description canb efoundat

http://www.w3.org/TR/1998/NOTE-XML-data-

0105/,January

1998.

[19] MEDLINE.ThePubMeddatabasecanb efoundat:,

1999. http://www.ncbi.nlm.nih.gov/PubM ed/.

[20] NewYorkUniversity.NamedEntityTask Denition,

Version2.0,Thisdo cumentcanb efoundonlineat

http://cs.nyu.edu/cs/ faculty/grishman/

NEtask20.b o ok 1.html, May31st1995.

[21] C.Nobata,N.Collier, andJ.Tsujii.Automaticterm

identicati on andclassicationin biology texts.In

ProceedingsoftheNaturalLanguagePacicRim

Symposium(NLPRS'2000),Novemb er 1999.

[22] C.Nobata,N.Collier, andJ.Tsujii.Comparison

b etweentaggedcorp ora forthenamedentitytask.In

ProceedingsoftheAssociationforComputational

Linguistics(ACL'2000)WorkshoponComparing

Corpora,HongKong,Octob er7th2000.

[23] T.Ohta,Y.Tateishi, N.Collier, C.Nobata,K.Ibushi,

andJ.Tsujii.Asemantically annotatedcorpusfrom

MEDLINEabstracts.InProceedingsofthe Tenth

WorkshoponGenomeInformatics.Universal

AcademyPress,Inc.,14{15Decemb er 1999.

[24] L.RabinerandB.Juang.Anintro duction tohidden

Markov mo dels.IEEE ASSPMagazine,pages4{16,

January1986.

[25] J.ReynarandA.Ratnaparkhi.Amaximum entropy

approachtoidentifying sentence b oundaries.In

Proceedingsofthe5thConferenceonApplicationsof

NaturalLanguageProcessingANLP,WashingtonDC,

pages16{19,1997.

[26] T.Rindesch,L.Hunter,andA.Aronson.Mining

molecularbinding terminology frombiomedical text.

InAmericanMedicalInformaticsAssociation

(AMIA)'99annualsymposium,WashingtonDC,USA,

1999.

[27] T.Rindesch,L.Tanab e,N.Weinstein, and

L.Hunter.EDGAR:Extractionofdrugs,genesand

relationsfromthebiomedical literature. InPacic

SymposiumonBio-informatics(PSB'2000),Hawai'i,

USA,January2000.

[28] S.Sekine, R.Grishman,andH.Shinnou.ADecision

TreeMetho dforFinding andClassifyingNamesin

JapaneseTexts.InProceedingsoftheSixthWorkshop

onVeryLargeCorpora,Montreal,Canada,August

(8)

[29] S.SekineandH.Isahara.IREX:Informationretrieval,

information extractioncontest.InInformation

ProcessingSocietyofJapanJointSIGFIandSIGNL

Workshop,UniversityofTokyo,Japan,

http://www.cs.nyu.edu/cs/projects/proteus/ire x,

Septemb er1998.IPSJ.

[30] K.Seymore,A.McCallum,andR.Rosenfeld.

LearninghiddenMarkovestructure forinformation

extraction. InProceedingsoftheAAAI'99Workshop

onMachineLearningforInformationExtraction,

Orlando,Florida,July19th1999.

[31] J.Thomas,D. Milward,C.Ouzounis, S.Pulman,and

M.Carroll.Automaticextraction ofprotein

interactionsfromscienticabstracts.InProceedingsof

thePacicSymposiumonBiocomputing'99(PSB'99),

Hawaii,USA,January4{91999.

[32] V.N.Vapnik.TheNatureofStatisticalLearning

Theory,2ndedition.Springer-Verlag,NewYork;ISBN

0387987800,1999.

[33] A.J.Viterbi.Errorb oundsforconvolutions co desand

anasymptotically optimumdeco dingalgorithm.IEEE

TransactionsonInformationTheory,

IT-13(2):260{269,1967.

[34] E.Vo orheesandD.Harman,editors.Proceedingsof

theEighthText REtrievalConference(TREC{8),

NationalInstitute ofStandardsandTechnology

Sp ecialPublication, Gaithersburg, Md.20899,USA,

Novemb er17{191999.

[35] TheXMLsp ecications aredeveloping veryrapidly.

Thelatestdescription canb efoundinXML1.0

Recommendationathttp://www.w3.org/xml/,2000.