Machine Learning for Information Extraction from XML marked-up text on the Semantic Web
Nigel Collier
National Institute of Informatics (NII) National Center of Sciences, 2-1-2 Hitotsubashi
Chiyoda-ku, Tokyo 101-8430, Japan
collier@nii.ac.jp
ABSTRACT
Thelastfewyearshaveseenanexplosion intheamountof
textb ecomingavailable on theWorld Wide Web asonline
communities of users in diverse domains emerge to share
do cuments and other digital resources. In this pap er we
explore the issue of how to provide a low-level informa-
tion extraction to ol based on hidden Markov mo dels that
can identify and classify terminology based on previously
marked-up examples. Suchato ol shouldprovide thebasis
foradomain p ortable information extraction system, that
when combined with search technology can help users to
access information more eectively within their do cument
collectionsthanto day'sinformation retrieval enginesalone.
Wepresentresultsofapplyingthemo delintwodiversedo-
mains: news andmolecular biology and discuss the mo del
andtermmarkupissuesthatthisinvestigationreveals.
1. INTRODUCTION
Thelastfewyearshave seenanexplosion intheamount
oftext b ecomingavailableon theWorldWide Web(Web).
Encouraged by thegrowthof theWebcommunity, textual
database providers have b een migrating their archives for
online access, adding to the available information. Online
communitiesofusershaveemergedtosharedo cuments,and
otherdigitalresourcesinmultipleanddiversedomains. The
issueofinformation access,i.e. howtondtheinformation
thatmeetsusers'requirements and present it in anunder-
standableformhasnow b ecomeamajorresearchissue. In
order to accomplish this we need to emp ower computers
with an `understanding' of the users' texts. One scenario
thatis b eingexplored, e.g. [34],istocombine information
retrieval (IR)with information extraction (IE). It is likely
thatacriticalcomp onent insuchasystemwillb eanability
tolearntoidentifyandclassify termsbased onexamplesof
previously markeduptext.
Forthispurp oseextensiblemarkuplanguage (XML)[35]
[14]seemstob e mostappropriateforsemanticannotation,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors.
Semantic Web Workshop 2001 Hongkong, China Copyright by the authors.
notonlyforterminology,butalsofortasksthatrequire the
learning of relations b etween those terms. XML incorp o-
rates a numb er of p owerful features for describing object
semantics, suchasinheritance ofname spaces bychild ele-
mentsfromtheirparents. Namespaces canb enestedwith
the youngestancestorwithin anamespacedeclaration de-
terminingthecurrentscop e. XMLallowsustorepresentse-
manticsthroughp otentiallyunb oundedhyptertagsbutdo es
notbyitself attempttointerpretthemeaningofthelab els.
AtthelowestlevelofIE,asystemshouldb eabletoidentify
andclassify object,i.e. term,b oundariesandclassify them
according tothe semantic classes and ontologies describ ed
inthe trainingdo cumentsthroughmechanismssuchas the
usualdo cumenttyp edeclaration(DTD)andXMLSchemas
thatarenowb eingprop osed[18].
In our work we emphasize the need for IE to ols to b e
adaptable to dierent domains and languages rather than
as general-purp ose to ols dueto thedistinct semantic char-
acteristics ofeachdomain.
Inthe remainder of this pap er wepresent a metho d for
termidenticationandclassicationbasedonhiddenMarkov
mo dels (HMMs)[24]thatlearns fromannotatedtexts. We
describ eitsp erformanceintwodomains,newsandmolecular-
biologyanddiscusssomeofthetermmarkupissuesthatour
analysis revealed.
2. BACKGROUND
Informationextractionhasdevelop edinthelasttenyears
fromacollectionofad-ho cmetho dsintoadisciplin ethatfo-
cusesonasmallsetofwelldenedsubtasks. Thishaslargely
o ccurredduetotheinuenceoftheDARPA-sp onsoredMes-
sageUnderstandingConferences(MUCs)intheUSA[9][10]
andrecentlytheJapaneselanguageIREXconference[29].
Inthelast fewyearsworkhasb egun on adapting IEfor
the technical domain of molecular biology, e.g. [6][8] [16]
[26][27][31].AswithpreviousIEapproaches,thesesystems
canb eclassed as eitherpredominantly dictionary-bas ed or
learning-based. Itisourviewthatthehand-builtdictionary-
basedsystemscannotb eexp ectedtob eeasilyp ortedtonew
domainsandtheyignoreap otentiallyvaluablesourceofthe
domain exp ert'sknowledge,i.e. markeduptexts.
Recent studies into theuse of sup ervised learning-based
mo dels for the `named entity' task in b oth the news and
molecular-biol ogydomainhaveshownthatmo dels basedon
HMMs [1][5][30], maximum entropy [2] and decision trees
[28][21]are muchmoregeneralisabl e and adaptable tonew
classes of words than systems based on traditional hand-
overcoming the problems asso ciated with data sparseness
withthehelpofsophisticated smo othingalgorithms [3].
HMMs are one of the most widely metho ds in ML for
IE.Theycanb econsideredtob esto chasticnitestatema-
chines and have enjoyed success in a numb er of elds in-
cluding sp eechrecognition and part-of-sp eechtagging [17].
Ithas b een natural therefore that thesemo dels have b een
adapted for use in other word-class prediction tasks such
as thenamed-entity task in IE. Suchmo dels are based on
n-grams. Although the assumption that a word'spart-of-
sp eechor nameclass canb e predicted by theprevious n-1
words and their classes is counter-intuiti ve to our under-
standing of linguisti c structures and long distance dep en-
dencies,thissimplemetho ddo esseemtob ehighly eective
inpractice. Nymble[1],asystemwhichusesHMMsis one
ofthemostsuccessful suchsystems andtrains onacorpus
ofmarked-uptext,usingonlycharacterfeaturesinaddition
towordbigrams.
AlthoughitisstillearlydaysfortheuseofHMMsforIE,
wecanseeanumb er oftrendsintheresearch. Systemscan
b edivided intothose whichuseonestatep er classsuchas
Nymble (atthetoplevel oftheirbackomo del) andthose
whichautomatically learnab outthemo del'sstructuresuch
as [30]. Additionall y, there isa distinction to b e made in
thesourceofthe knowledge forestimating transitionprob-
abilities b etween mo dels which are built by hand such as
[12]andthosewhichlearnfromtaggedcorp orainthesame
domainsuchasthemo delpresentedinthispap er,wordlists
andcorp oraindierentdomains-so-calleddistantly-lab eled
data[30].
Despite their success, HMMs and other machine learn-
ingmetho ds canonly b e as successfulas thefeatures with
whichthey aretrained. Inthis study we havefo cussed on
developing a simple, yetp owerful setof features based on
orthographic knowledge, that canb ep ortable b etweendo-
mains. Wenowpresentan overviewofthetrainingcorp ora
weused. ThisisfollowedbytheresultsforaHMMnamed-
entity system fortwo diverse domains: news and molecu-
larbiology,usingorthographic, lexicalandclassfeaturesto
train the mo dels. We also discuss some of the problems
that our results revealed, particularly from lo cal syntactic
relations,duetothelo cal contextual viewthe mo delto ok.
3. CORPORA
Inourexp erimentsweusedabstractsinthemolecularbi-
ologydomainavailablefromPubMed'sMEDLINE[19]that
weremarkedupin XMLbyadomainexp ert[23]aswellas
asmallcollection ofnews textsusedinthe MUC-6confer-
ence[9]formalanddryruns. Itisworthnotingthatatthe
present time,nostandard testsetsexistforthenameden-
titytaskoutside ofnews(MUCandIREX),makingformal
systemcomparisons quitediÆcult. Thiswill clearly b e an
imp ortantfactorinthefuturedevelopmentofIEintechnical
domains.
An example of a marked-up sentence from a news text
canb eseeninFigure1. Wecanseeseveralinteresting fea-
turesofthedomainsuchasthefo cusofNEsonp eopleand
organization proles. Moreoverweseethattherearemany
pre-namecluewordssuchas\Ms."or\Rep."indicatin gthat
aRepublican p oliticia n'snameshouldfollow.
In contrast we can seean example of a marked-up sen-
tion wecanseethatmanyare combinations ofprop erand
commonnounswithcross-overofvo cabulary b etweenname
classes, e.g. the lemma `cell' b elongs to b oth PROTEIN,
SOURCE.ctandSOURCE.clterms.
Thesetsofnameclassesforthetwodomainsaregivenin
Tables1and2.
4. METHOD
Ourinitialapproachwasmotivatedbytheneedtoacquire
themajorityoftermsthatdonotinvolvecomplexstructural
analysis torecovertheir baseforms. Forthisweconsidered
that aHMM approachbased on rawtext stringsof words
and nodeep linguisti c analysis iswell suited. Later ween-
visage that more sophisticated markup and pre-pro cessing
metho ds will b e needed to handle term structure and we
hop etoincorp oratethesewithintheautomaticlearningap-
proachwehaveadoptedin ourwork.
Thepurp ose of our mo del is to nd the most likely se-
quence of name classes (C) fora given sequence of words
(W).Thesetofnameclassesincludesthe`Unk'nameclass
whichweuseforbackgroundwordsnotb elongingtoanyof
theinterestingnameclassesgivenforthetwodomains (see
Tables1and2) andthegivensequence ofwordswhichwe
use spans asingle sentence. Thetaskis therefore to max-
imize Pr (CjW). We implement a HMM to estimate this
using theMarkovassumption that Pr (CjW) canb e found
frombigramsofnameclasses.
Inthe following mo del we consider words to b eordered
pairs consisting ofasurfaceword,W, andawordfeature,
F, given as < W;F >. The word features themselves are
discussedin Section4.2.
Asiscommonpractice,weneedtocalculatetheprobabil-
itiesforawordsequencefortherstword'snameclassand
everyother word dierently since wehave no initial name-
class to make a transition from. Accordingly we use the
followingequationtocalculatetheinitialnameclass proba-
bility,
Pr (Ctj<Wfir st;Ffir st>)=
0 f(C
fir st j<W
fir st
;F
fir st
>)+
1 f(C
fir st j< ;F
fir st
>)+
2 f(C
fir st
) (1)
andforallotherwordsandtheir nameclassesasfollows:
Pr (Ctj<Wt;Ft>;<Wt 1;Ft
1
>;Ct 1)=
0f(Ctj<Wt;Ft>;<Wt 1;Ft
1
>;Ct 1)+
1 f(C
t j< ;F
t
>;<W
t 1
;F
t 1
>;C
t 1 )+
2 f(C
t j<W
t
;F
t
>;< ;F
t 1
>;C
t 1 )+
3 f(C
t j< ;F
t
>;< ;F
t 1
>;C
t 1 )+
4f(CtjCt 1)+
5f(Ct) (2)
where f(j) is calculated with maximum-likeli hoo d esti-
TYPE="PERSON">Washington</ENAMEX> worked as a laywer for the corp orate nance division of the <ENAMEX
TYPE="ORGANIZATION">SEC< /ENAMEX > in the late <TIMEX TYPE="DATE">1970s</TIMEX>. She
has b een a congressional staer since <TIMEX TYPE= "DATE">1979</TIMEX>. Separately, <ENAMEX
TYPE="PERSON">Clinton</ENAMEX> transition oÆcials said that <ENAMEX TYPE="PERSON">Frank
Newman</ENAMEX>, 50, vice chairman and chief nancial oÆcer of <ENAMEX TYPE="ORGANIZATION">BankAmerica
Corp.</ENAMEX>, is exp ected to b e nominated as assistant <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX>
secretary for domestic nance. Mr. <ENAMEX TYPE="PERSON">Newman</ENAMEX>, who would b e giving
up a job that pays <ENAMEX TYPE="MONEY">$1 million</ENAMEX> a year, would oversee the <ENAMEX
TYPE="ORGANIZATION">Treasury</ENAMEX>'s auctions of government securities as well as banking issues. He
would rep ort directly to <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX> Secretary-designate <ENAMEX
TYPE="PERSON">Lloyd Bentsen</ENAMEX>. Mr. <ENAMEX TYPE="PERSON">Bentsen</ENAMEX>, who headed
the <ENAMEX TYPE="ORGANIZATION">Senate Finance Committee</ENAMEX> for the past six years, also is exp ected
to nominate <ENAMEX TYPE="PERSON"> Samuel Sessions</ENAMEX>, the committee's chief tax counsel, to one of
the top tax jobs at <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX>. As early as to day, the <ENAMEX
TYPE="PERSON">Clinton</ENAMEX>campisexp ectedtonameveundersecretariesofstateandseveralassistantsecretaries.
Figure1: Examplesentences taken fromthe annotatedMUC-6 NEcorpus
TI-Activationof<PROTEIN> JAKkinases</PROTEIN>and<PROTEIN>STAT proteins</PROTEIN> by<PROTEIN> in-
terleukin-2 </PROTEIN> and<PROTEIN>interferonalpha</PROTEIN>,but notthe<PROTEIN> Tcellantigenreceptor
</PROTEIN>,in<SOURCE.ct>humanTlympho cytes</SOURCE.ct>.
AB - The activation of <PROTEIN> Janus protein tyrosine kinases </PROTEIN> ( <PROTEIN> JAKs </PROTEIN>
) and <PROTEIN> signal transducer and activator of transcription </PROTEIN> ( <PROTEIN> STAT </PROTEIN> )
proteins by <PROTEIN> interleukin ( IL ) - 2 </PROTEIN> , the <PROTEIN> T cell antigen receptor </PROTEIN>
( <PROTEIN> TCR </PROTEIN> ) and <PROTEIN> interferon ( IFN ) alpha </PROTEIN> was explored in
<SOURCE.ct> human p eripheral blo o d - derived T cells </SOURCE.ct> and the <SOURCE.cl> leukemic T cell line
Kit225 </SOURCE.cl> . An <PROTEIN>IL-2</PROTEIN>-induced increase in <PROTEIN>JAK1</PROTEIN> and
<PROTEIN>JAK3</PROT EIN>, butnot<PROTEIN>JAK2</PROTEIN> or<PROTEIN>Tyk2</PROTEIN>, tyrosinephos-
phorylationwasobserved. Incontrast,noinductionof tyrosinephosphorylationof <PROTEIN>JAKs</PROT EIN> wasdetected
up on stimulationof the <PROTEIN>TCR</PRO TEIN> . <PROTEIN>IFN alpha</PROTEIN> inducedthe tyrosine phospho-
rylation of <PROTEIN>JAK1</PROT EIN> and <PROTEIN>Tyk2</PROTEIN>, but not <PROTEIN>JAK2</PROT EIN>
or <PROTEIN>JAK3</PROTEIN> . <PROTEIN>IFN alpha</PROTEIN> activated <PROTEIN>STAT1</PROT EIN> ,
<PROTEIN>STAT2</PROT EIN> and<PROTEIN>STAT3</PROTEIN> in<SOURCE.ct>T cells</SOURCE.ct>, butno de-
tectableactivationofthese<PROTEIN>STATs</PROT EIN> wasinducedby<PROTEIN>IL-2</PROTEIN>.
Figure2: ExampleMEDLINEsentencetakenfromthe XMLannotated molecular-biologyNEcorpus
Class # Example Description
PROTEIN 2125 JAKkinase proteins,proteingroups,
families,complexesandsubstructures.
DNA 358 IL-2promoter DNAs,DNAgroups,regionsandgenes
RNA 30 TAR RNAs,RNAgroups,regionsandgenes
SOURCE.cl 93 leukemicTcelllineKit225 cellline
SOURCE.ct 417 human Tlymphocytes celltyp e
SOURCE.mo 21 Schizosaccharomyce spombe mono-organism
SOURCE.mu 64 mice multi-organism
SOURCE.vi 90 HIV-1 viruses
SOURCE.sl 77 membrane sublo cation
SOURCE.ti 37 centralnervoussystem tissue
UNK - tyrosinephosphorylation backgroundwords
Table 1: Namedentity classesfor themolecularbiologydomain. # indicatesthenumb eroftaggedtermsin
thecorpusof100 abstracts.
Class # Example Description
ORGANISATION 1783 HarvardLawSchool namesoforganisations
PERSON 838 Washington namesofp eople
LOCATION 390 Houston namesofplaces,countriesetc.
DATE 542 1970s dateexpressions
TIME 3 midnight timeexpressions
MONEY 423 $10 million moneyexpressions
PERCENT 108 2.5% p ercentageexpressions
UNK - start-upcosts backgroundwords
Table 2: Namedentity classes for the news domain. # indicatesthe numb eroftagged terms inthe corpus
f(C
t j<W
t
;F
t
>;<W
t 1
;F
t 1
>;C
t 1 )
:
=
T(<Wt;Ft>;Ct;<Wt 1;Ft
1
>;Ct 1)
T(<Wt;Ft>;<Wt 1;Ft
1
>;Ct
1 )
(3)
WhereT()hasb eenfoundfromcountingtheeventsinthe
trainingcorpus. Inourcurrent systemwesettheconstants
i and i by hand andlet P
i =1:0, P
i =1:0, 0
1
2 ,
0
1 :::
5
. The current name-class C
t is
conditioned on the current wordand feature,the previous
name-class,Ct
1
,andpreviouswordandfeature.
Inourcurrent systemwesetthe constantsi and i by
handbutclearlyab etterwaywouldb etodothisautomat-
ically. An obvious strategy to use would b e to use some
iterative learning metho d such as Exp ectation Maximiza-
tion [11]. Wealso imp ose the restriction that P
i
=1:0,
P
i
=1:0,
0
1
2 ,
0
1 :::
5
. Thecurrent
name-class Ct is conditioned on the current word and fea-
ture,thepreviousname-class,C
t 1
,andpreviouswordand
feature.
Equations1and2implementalinear-interpolatingHMM
thatincorp oratesanumb erofsub-mo delsdesignedtoreduce
theeects ofdatasparsenessandimprovegeneralisabi l ity.
Once the state transition probabiliti es have b een calcu-
latedaccordingtoEquations1and2,theViterbialgorithm
[33]isusedtosearchthestate spaceofp ossiblenameclass
assignments in linear time to nd the highest probability
path,i.e. tomaximise Pr (W;C).
Thenalstageofouralgorithmthat isusedaftername-
classtaggingiscompleteistouseaclean-upmo dulecalled
Unity. This creates a frequency list of words and name-
classesandthenre-tagsthe textusingthe mostfrequently
usednameclass assigned by theHMM.Wehave generally
foundthatthisimprovesF-scorep erformancebyb etween2
and4%,b othforre-taggingspuriouslytaggedwordsandfor
ndinguntaggedwordsin unknowncontextsthathadb een
correctlytaggedelsewhereinthetext.
4.1 Tokenization
Before featurising we p erform two pre-pro cessing tasks:
sentenceb oundaryidenticatio nandwordtokenization. Th-
esepro ceed accordingtoquite simplealgorithms thathave
nevertheless proven to b e adequately eective. Sentence
b oundaryidenticati on simplytreatsfullstops`.' asendof
sentence exceptforafewsp ecialcasessuchasabbreviation
markinganddecimalp ointsthatarehandledwithheuristic
rules. Although this metho d is farless sophisticated than
otherssuchas[25],wefoundthatitp erformedwellinprac-
tice,althoughwewouldexp ecttoencountermoresignicant
problemsinengineering andtechnical domains.
Tokenization treatsa continuous string of letters and/or
numerals as a word, converts multiple space sequences to
singlespaces, removesnon-printablecharactersexceptend-
of-line,andtreatspunctuation(includin ghyphen)asasep-
arate`word'. Pro cessing isdone sentence by sentence. All
wordsareassignedafeatureco dedep endingontheirortho-
graphicform.
4.2 Orthographic features
InordertogeneralisetheHMMs'knowledgeab outsurface
formsitisnecessarytofeaturisethevo cabularyinsomeway.
Onanalysinglistsoftermswefeltthatorthographicfeatures
shouldb enoted thattheexamples donotshowthe
full form of terms, but simply examples of `words'
thatmake upthetermtogetherwiththeirsemantic
classication.
Featureco de Examples
TwoDigitNumb er [25]per cent
FourDigitNumb er [2000]date
DigitNumb er [2]per cent[3]D NA
SingleCap [I]pr otein[B]pr otein
[T]
sour ce:ct
GreekLetter [alpha]
pr otein
CapsAndDigits [2A]D N
A
[BW5147]
sour ce:cl
[CD4+]
sour ce:ct
TwoCaps [RelB]pr otein [TAR]RNA
[HMG]D NA[NF]D NA
[NF]
pr otein
LettersAndDigits [p50]pr otein [Kit225]sour ce:cl
InitCap [Interleukin]pr otein
[Washington]
per son
LowCaps [kappaB]
pr otein
[mRNA]RNA
Lowercase [cytoplasmic]
sour ce:sl
[tax]
pr otein
Determiner the
Conjunction and
FullStop .
Comma ,
Hyphen [-]
pr otein [-]
D NA
Colon :
SemiColon ;
Op enParen (
CloseParen )
CloseSquare ]
Op enSquare [
Percent %
Other *+#
Backslash [/]pr otein
particularly inmolecularbiology.
Table3shows thecharacterfeatures thatweusedin the
HMM.Ourintuitionisthatsuchfeatureswillhelpthemo del
to nd similarities b etween known words that were found
in the training set and unknown words and so overcome
theunknownwordproblem. Eachwordisdeterministicall y
assignedasinglefeature,givingmatchingfeaturesnearerto
thetopofthetablepriority overthose lowerdown.
5. EXPERIMENT AND RESULTS
We ran exp eriments on the 60 news texts using 6-fold
cross-validation(50training,10testing)andthe100molec-
ular biology textsusing5-foldcross-validation (80training,
20testing). Wethencalculatedthescoresasan average of
theF-scoresforeachmarked-upclass category.
TheresultsaregivenasF-scores,acommonmeasurement
foraccuracy in theMUC conferencesthat isthe harmonic
mean ofrecall andprecision. Theseare calculated using a
HMMwithUnity 78.4 75.0
HMMw/oUnity 74.2 73.1
Table 4: Named entity acquisition results for the
MUC-6 and molecular biology domains calculated
by n-foldcross validation.
F scor e=
2Pr ecisionRecal l
Pr ecision+Recal l
(4)
Theresultsaresummarisedforall classesineachdomain
inTable4andshowp erformancewithandwithouttheUnity
mo dule. Despite the small numb er of training texts used,
the system could achieve reasonably high p erformance for
b othdomains. Although results indicatethat p erformance
fornewstextsisslightly b etterthan forbiology,thisneeds
conrmingwithalargertestcollectiontoobtaincondence
in the conclusion. The result also highlights the need for
soundly motivated metrics (e.g. see [22]) to compare the
diÆcultiesofthenamedentitytaskb etweenmarked-upcor-
p ora in dierent domains. In the following discussion we
provide failureanalysis oftheresults.
6. ANALYSIS
Inthissectionweconcentrateourdiscussionontheanaly-
sisofresultsfromthemolecular-bio log y corpus. Discussion
of term markup issues for the MUC news domain is well
do cumented,e.g. in[20].
During analysis of the corpus we found that a numb er
of syntactic phenomena caused p otential complications to
identication of term b oundaries and their classication .
Broadly sp eaking the major ones can b e divided into co-
ordination,app ositionandabbreviation, althoughthereare
manyother issues thatwecannot coverhere suchasuseof
negativesintermnames suchas`non-T-cells' andtheneed
toinfersometerm'sclassesfromknowledgecontainedwithin
adomainmo del. Thethreemajorissues arenowdiscussed
b elow.
6.1 Coordination
We observed that failure o ccurred where complex lo cal
structures could not b e resolved by the shallow-level con-
textual viewofthe HMM.Co ordination, as appliedto the
app earanceoftermsinmolecularbiologytextsapp earsquite
frequently through the use of the co ordinators and, slash
(`/'),andhyphen(`-'). Hyphenationisparticularly trouble-
someforourimplementation of theHMMasit isalso one
oftheorthographicfeaturesusedinmanyprotein andgene
names.
Inthefollowing examples,op enandclosetagshaveb een
representedas`['and`]'resp ectivelyforbrevityandthetag
nameapp earsasasubscript.
Training examples (1),(2)and (3)allshowthe need for
structural analysis to take place to transform(at least in-
ternally within the IE software) theterm back to its base
form. Comparing (1)and (2)weseethat `/'is sometimes
usedaspartofthetermandsometimesasaco ordinator.
Ex. 1. ...like the [c-rel]pr otein and [v-rel]pr otein (proto)-
oncogenes.
Ex. 2. ...regulated by members of the [rel/NF-kappa B
family]
pr otein
Ex. 3. ...involvesphosphorylationofseveralmembersof
the[NF-kappaB]
pr otein
/[IkappaBprotein]
pr otein
families.
Examples(2)and(3)resultedinconicting patterns,i.e.
sometimes`/'shouldactasaco ordinatorandsometimesas
partofthetermitselfascanb eseenintheHMMoutputin
(4)and(5).
Ex. 4. ...regulated by members of the rel/[NF-kappa B
family]pr otein
Ex. 5. ...involvesphosphorylationofseveralmembersof
the[NF-kappaB/IkappaB]proteinfamilies.
Training examples (6)and (7)show morecomplex cases
wheretheannotationschemehasnotallowedthedomainex-
p erttofullyexpressherintuitionab outtheclassesofterms,
particularly where thereasequence ofconjoinedmo diers.
In (6) we see a case of elision where the exp ert was not
happy to markup`[c-]pr otein'as atermwithout b eing able
toshowtheattachmentto`-rel'andwasalsouneasyab out
marking up `c-and v-rel' as the term. In(7) we see that
althoughtheheadnoun\regions",whichdominatesthelist
shouldimp oseaproteincategoryon`TATA',without away
ofmarkingupthisrelation,theexp ertpreferstotag`TATA'
asDNAaccording totheclass ofthebasicterm underdis-
cussion. Cases such as these indicate the need for richer
markupmetho ds.
Ex. 6. ThisproteinreducesorabolishesinvitrotheDNA
bindingactivityofwild-typeproteins ofthesamefamily (-
[KBF1]
pr otein /[p50]
pr otein
,c-and[v-rel]
pr otein ).
Ex. 7. .. indicated that multipleregulatory regions in-
cludingtheenhancer,[SP1]pr otein,[TATA]D NA and[TAR]
pr otein
regionswereimportantfor[HIV]
sour ce:v i
geneexpres-
sion.
Itisinterestingtonotethatannotatorsmarkingupmulti-
nameexpressionsintheMUC-6corpusfacedasimilardilemma
asnoted in[20].Forexample\<ENAMEXTYPE=
\LOCATION">North</ENAMEX>and<ENAMEXTYPE-
=\LOCATION">South America</ENAMEX>. In their
casetheyincludetheheadwordwiththenalmo dierasa
namedentitybutmakeno explicit link toits relation with
earlier mo diers.
6.2 Apposition
Toquote from[15]: app ositions are \Twoor morenoun
phrases are in app ositionwhen they haveidentity ofrefer-
ence". Intheexampleswegiveb elow,thereferentprovides
usefulinformation ab outtheapp osition phrases' class that
requires thisrelation to b erecognised. Forexample in (9),
\transcription factor" provides the information that \NF-
Kappa B" is a typ e of protein, and in (8), \RelB-p52" is
a\Rel-NF-kappa Bcomplex". i.e. theapp ositions provide
attribute information ab out theterm. This is particularly
useful in higher level IE tasks that require us to combine
attributesforaparticularterm. Thechallengep osedbyap-
p ositionsisoftentoknowwheretostartandendtheterm's
b oundaries, particularly where no punctuation is used to
kappaB]
pr otein
complexes,[RelB]
pr otein -[p52]
pr otein canup-
regulatethesynthesisof[IkappaBalpha]pr otein.
Ex. 9. The transcriptionfactor[NF-Kappa B]
pr otein is
storedinthe[cytoplasm]sour ce:sl...
App osition by itself in example (8) did not seem to b e
the cause of the problem, but rather the conjunction im-
pliedbythehyphenmakesitunclearwheretobreakthese-
quence\RelB-p52" asseenin (10). Interestingly theHMM
managedtocorrectly ndthebreakintheearlier sequence
\Rel-NF-kappaB".
Ex. 10. However,similarlytotheother[Rel]
pr otein -[NF-
kappaB]pr otein complexes,[RelB-p52]pr oteincanupregulate
thesynthesisof[IkappaBalpha]
pr otein .
Theapp ositionofexample(9)alsop osednodiÆculty for
theHMMasshownin(11).
Ex. 11. Thetranscriptionfactor[NF-KappaB]pr oteinis
storedinthe[cytoplasm]sour ce:sl...
6.3 Abbreviation
Ofmoreimmediateconcerntousarecaseswhereabbrevi-
ationso ccursinsidethetermitselfasshowninthetraining
example (12). Here we require deep er analysis than can
b eobtained through the lo cal contextual viewusedbythe
HMMthatwehavepresented.
Ex. 12. The[interleukin-2(IL-2)promoter]D NAconsists
ofseveralindependent[T cellreceptor(TcR)responsiveel-
ements]D NA.
The result from the HMM forexample (12) is given in
(13).
Ex. 13. The[interleukin-2]
pr otein ([IL-2]
pr otein
)promoter
consistsofseveralindependent[Tcellreceptor]
pr otein ([TcR]
pr otein)responsiveelements.
It is imp ortant to rememb er that the HMM lo oks for
the most likely sequence of classes that corresp ond to the
word sequence and that, for example, the word sequence
for\interleukin-2" isfarmorelikely tob eaprotein thana
DNA,giventhatabbreviationsusuallyo ccuraftertheterm
hasnished andaremostlymarkedupasseparatetermsin
theirownright. Here thoughwehave an exceptionand it
requiresquitesophisticated pro cessingtorecognisetheem-
b edded abbreviation do esnot formpart ofthe term itself,
andthat the head ofthe rstterm is\promoter". A sim-
ilar case canb e found in the second term `Tcell receptor
resp onsiveelements',wheretheabbreviationshouldalsob e
considered asaseparateterm. ThediÆcultycanb etraced
backtolimitationsinourmarkupschemewhichdidnotal-
low the domain exp ert to express her intuition ab out the
term'sstructure with either nested or cross-overtags. Al-
thoughnestingoftagsisallowedinXML,cross-overoftags
requiresawork-around.
7. DISCUSSION
Many ofthe caseswhere themo del failed to recognisea
(HMM)and ofthemark-upscheme. Althoughnon-nesting
ofmarked-up termsallows ustodevelop rathersimple ML
mo dels, itdo es notreecttheapp earance oftermsasthey
arewrittenbydomainexp erts. XMLitself do esnotimp ose
anestingrestrictionon usbutdo esinsistthatthereshould
b enocross-overofnamespaces. Althoughthiscanb edealt
withbyworkarounds itisnotideal.
Despite a limited context window used in analysis, the
HMMp erformedquite well, showing thatnite statetech-
niques cangivego o dresultsdespiteshallow linguistic anal-
ysis. Unliketraditionaldictionary-based termidentication
metho dsusedinIE,themetho dwehaveshownhasthead-
vantageofb eingp ortableandnohand-made patternswere
used. Thestudy also indicated thatmore training data is
b etterandthatwehavenotyetreachedap eakinthelevel
of p erformance using the small training set that we have
available.
Forourfutureworkweprop osetouseshallowdep endency
analysis to`normalise' theterm, i.e. to disambiguate lo cal
syntactic structures. Theoutput of thiswill b e emb edded
asXMLmarkupintothe do cument forthelearningby the
MLcomp onent ofthenamedentity mo dule.
8. CONCLUSION
In this pap er we have presented a scenario for the ex-
ploitation ofterminologycontainedwithinXMLmarked-up
do cumentsthatarepro ducedbyonline domain-based com-
munities forautomatically annotatinguntaggedtextsbased
on previously seen examples. Although we have fo cussed
largelyonthelowestlevelofIE,i.e.identicati on andclas-
sication of terms, the idea can b e extended to learning
higher-level information structuresforuse invariousinfor-
mation access to ols as well as long distance dep endencies
suchasanaphorathatarenecessarytoestablishequivalence
relationsb etweenterms,theirabbreviation sandreferencing
pronouns.
AlimitationoftheHMMapproachisthatitcannoteasily
mo dellargefeaturesetsduetofragmentationoftheproba-
bilitydistributio n. Inourfutureworkwewouldliketolo ok
at combining orthographic knowledge with other typ es of
lexical information as well as contextual clues fromgram-
matical dep endency analysis. Itis likely thatdierent ML
techniques will b e required that canmore eectively com-
binetheknowledgefromlarge,p ossiblysparse,featuresets.
TheuseofSupp ortVectorMachines(SVMs)[32][7],which
havesuccessfullyb eenappliedinotherresearchareas,seems
onefruitful directionthatweare currently exploring.
Acknowledgements
I am grateful to a numb er of my colleagues at the Tsujii
lab oratory, University ofTokyo,where mostof theexp eri-
mentsforthispap erwerecarriedout.Inparticular Iwould
like toexpress mygratitudetoSang-Zo o LeeandChikashi
Nobata (CRL) fortheir helpful comments concerning ma-
chine learning metho ds, Yuka Tateishi and Tomoko Ohta
for their eorts to pro duce the tagged corpus used in the
molecular-biol ogy exp eriments and also to Professor Jun-
ichiTsujiiforhissupp ortinthis work. Iwouldalsolike to
thanktheanonymousreviewersfortheir helpful comments
9. REFERENCES
[1] D.Bikel,S.Miller,R.Schwartz,andR.Wesichedel.
Nymble: ahigh-p erformance learningname-nder.In
ProceedingsoftheFifthConfererenceonApplied
NaturalLanguageProcessing,pages194{201,1997.
[2] A.Borthwick,J.Sterling, E.Agichtein,and
R.Grishman.Exploiting diverseknowledge sources
viamaximumentropy innamedentity recognition.In
ProceedingsoftheWorkshoponVeryLargeCorpora
(WVLC'98),1998.
[3] S.ChenandJ.Go o dman.Anempirical studyof
smo othingtechnniques forlanguagemo deling.34st
AnnualMeetingoftheAssociationofComputational
Linguistics,California,USA,24{27June 1996.
[4] N.Chinchor.MUC-5evaluation metrics.InIn
ProceedingsoftheFifthMessageUnderstanding
Conference(MUC-5), Baltimore,Maryland,USA.,
pages69{78,1995.
[5] N.Collier,C.Nobata,andJ.Tsujii.Extractingthe
namesofgenesandgenepro ducts withahidden
Markovmo del.InProceedingsofthe18th
InternationalConferenceonComputational
Linguistics(COLING'2000),Saarbrucken,Germany,
July31st{August4th2000.
[6] N.Collier,H.Park,N.Ogata,Y.Tateishi, C.Nobata,
T.Ohta,T.Sekimizu,H.Imai,andJ.Tsujii.The
GENIAproject: corpus-basedknowledgeacquisition
andinformation extractionfromgenomeresearch
pap ers.InProceedingsoftheAnnualMeetingofthe
EuropeanchapteroftheAssociationfor
ComputationalLinguistics(EACL'99),June 1999.
[7] C.CortesandV.Vapnik.Supp ort-vectornetworks.
MachineLearning,20:273{297, Novemb er1995.
[8] M.CravenandJ.Kumlien.Constructingbiologica l
knowledge basesbyextracting information fromtext
sources.InProceedingsofthe7thInternational
ConferenceonIntelligentSystempsforMolecular
Biology(ISMB-99),Heidelburg,Germany,August
6{101999.
[9] DARPA.ProceedingsoftheSixthMessage
UnderstandingConference(MUC-6),Columbia,MD,
USA,Novemb er1995.MorganKaufmann.
[10] DARPA.ProceedingsoftheSeventhMessage
UnderstandingConference(MUC-7),Fairfax,VA,
USA,May1998.
[11] A.Dempster,N.Laird,andD.Rubins.Maximum
likeliho o d fromincomplete dataviatheEM
algorithm.JournaloftheRoyalStatisticalSociety(B),
39:1{38,1977.
[12] D.FreitagandA.McCallum.Informationextraction
withHMMsandshrinkage.InProceedingsofthe
AAAI'99WorkshoponMachineLearningfor
InformationExtraction,Orlando,Florida,July19th
1999.
[13] K.Fukuda,T.Tsuno da,A.Tamura,andT.Takagi.
Towardinformation extraction: identifyingprotein
namesfrombiological pap ers.InProceedingsofthe
PacicSymposiumonBiocomputing'98(PSB'98),
January1998.
[14] I.GrahamandL.Quin.XML SpecicationGuide.
JohnWileyandSons: NewYork,1999;ISBN
[15] S.Greenbaum andR.Quirk.AStudent'sGrammar of
theEnglishLanguage.Longman,Essex,England,
1990.
[16] K.Humphreys,G.Demetriou,andR.Gaizauskas.
Twoapplicationsofinformation extraction to
biologica l sciencejournalarticles: Enzymeinteractions
andprotein structures.InProceedingsofthePacic
Rim SymposiumonBio-Computing2000(PSB'2000),
January2000.
[17] J.Kupiec.Robustpart-of-sp eechtaggingusinga
hidden markovmo del.ComputerSpeechand
Language,6:225{242,1992.
[18] A.Layman,E.Jung,E.Maler,H.S.Thompson,
J.Paoli,J.Tigue, N.H.Mikura,andS.DeRose.
Xml-data,w3cnote05jan1998.TheXML-Data
sp ecication s arestilldeveloping . Thelatest
description canb efoundat
http://www.w3.org/TR/1998/NOTE-XML-data-
0105/,January
1998.
[19] MEDLINE.ThePubMeddatabasecanb efoundat:,
1999. http://www.ncbi.nlm.nih.gov/PubM ed/.
[20] NewYorkUniversity.NamedEntityTask Denition,
Version2.0,Thisdo cumentcanb efoundonlineat
http://cs.nyu.edu/cs/ faculty/grishman/
NEtask20.b o ok 1.html, May31st1995.
[21] C.Nobata,N.Collier, andJ.Tsujii.Automaticterm
identicati on andclassicationin biology texts.In
ProceedingsoftheNaturalLanguagePacicRim
Symposium(NLPRS'2000),Novemb er 1999.
[22] C.Nobata,N.Collier, andJ.Tsujii.Comparison
b etweentaggedcorp ora forthenamedentitytask.In
ProceedingsoftheAssociationforComputational
Linguistics(ACL'2000)WorkshoponComparing
Corpora,HongKong,Octob er7th2000.
[23] T.Ohta,Y.Tateishi, N.Collier, C.Nobata,K.Ibushi,
andJ.Tsujii.Asemantically annotatedcorpusfrom
MEDLINEabstracts.InProceedingsofthe Tenth
WorkshoponGenomeInformatics.Universal
AcademyPress,Inc.,14{15Decemb er 1999.
[24] L.RabinerandB.Juang.Anintro duction tohidden
Markov mo dels.IEEE ASSPMagazine,pages4{16,
January1986.
[25] J.ReynarandA.Ratnaparkhi.Amaximum entropy
approachtoidentifying sentence b oundaries.In
Proceedingsofthe5thConferenceonApplicationsof
NaturalLanguageProcessingANLP,WashingtonDC,
pages16{19,1997.
[26] T.Rindesch,L.Hunter,andA.Aronson.Mining
molecularbinding terminology frombiomedical text.
InAmericanMedicalInformaticsAssociation
(AMIA)'99annualsymposium,WashingtonDC,USA,
1999.
[27] T.Rindesch,L.Tanab e,N.Weinstein, and
L.Hunter.EDGAR:Extractionofdrugs,genesand
relationsfromthebiomedical literature. InPacic
SymposiumonBio-informatics(PSB'2000),Hawai'i,
USA,January2000.
[28] S.Sekine, R.Grishman,andH.Shinnou.ADecision
TreeMetho dforFinding andClassifyingNamesin
JapaneseTexts.InProceedingsoftheSixthWorkshop
onVeryLargeCorpora,Montreal,Canada,August
[29] S.SekineandH.Isahara.IREX:Informationretrieval,
information extractioncontest.InInformation
ProcessingSocietyofJapanJointSIGFIandSIGNL
Workshop,UniversityofTokyo,Japan,
http://www.cs.nyu.edu/cs/projects/proteus/ire x,
Septemb er1998.IPSJ.
[30] K.Seymore,A.McCallum,andR.Rosenfeld.
LearninghiddenMarkovestructure forinformation
extraction. InProceedingsoftheAAAI'99Workshop
onMachineLearningforInformationExtraction,
Orlando,Florida,July19th1999.
[31] J.Thomas,D. Milward,C.Ouzounis, S.Pulman,and
M.Carroll.Automaticextraction ofprotein
interactionsfromscienticabstracts.InProceedingsof
thePacicSymposiumonBiocomputing'99(PSB'99),
Hawaii,USA,January4{91999.
[32] V.N.Vapnik.TheNatureofStatisticalLearning
Theory,2ndedition.Springer-Verlag,NewYork;ISBN
0387987800,1999.
[33] A.J.Viterbi.Errorb oundsforconvolutions co desand
anasymptotically optimumdeco dingalgorithm.IEEE
TransactionsonInformationTheory,
IT-13(2):260{269,1967.
[34] E.Vo orheesandD.Harman,editors.Proceedingsof
theEighthText REtrievalConference(TREC{8),
NationalInstitute ofStandardsandTechnology
Sp ecialPublication, Gaithersburg, Md.20899,USA,
Novemb er17{191999.
[35] TheXMLsp ecications aredeveloping veryrapidly.
Thelatestdescription canb efoundinXML1.0
Recommendationathttp://www.w3.org/xml/,2000.