HAL Id: hal-00162114
https://hal.archives-ouvertes.fr/hal-00162114
Submitted on 12 Jul 2007
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A Bayesian approach combining surface clues and linguistic knowledge: Application to the anaphora
resolution problem
Davy Weissenbacher, Adeline Nazarenko
To cite this version:
Davy Weissenbacher, Adeline Nazarenko. A Bayesian approach combining surface clues and linguistic
knowledge: Application to the anaphora resolution problem. Recent Advances in Natural Language
Processing, Sep 2007, Borovets, Bulgaria. 7 p. (édition électronique). �hal-00162114�
linguisti knowledge:
Appliation to the anaphora resolution problem
Davy Weissenbaher, AdelineNazarenko
Université Paris-Nord -Laboratoired'Informatiquede Paris-Nord.
99 av. J-B. Clément 93430Villetaneuse, FRANCE
dwlipn.univ-paris13.fr,nazarenkolipn.univ-paris13.fr
Abstrat
In NLP, A traditional distintion opposes the
linguistially-basedsystemsandthe knowledge-
poor ones whih mainly rely on surfae lues.
Eah approah has its drawbaks and its ad-
vantages. In this paper, we propose a new
methodwhih is basedonBayesNetworks and
allowstoombinebothtypesofinformation. As
a ase study, we fous on the spei task of
pronominalanaphoraresolutionwhihisknown
as a diultNLP problem. We show that our
bayesian system performs better than state-of-
theartanaphoraresolutionones.
Keywords
BayesianNetwork,Anaphora Resolution,linguistiknowledge,
surfaelue
1 Introdution
One often opposes knowledge based and knowledge
poor Natural Language Proessing (NLP) systems.
Therstonesexploitomplexknowledgepieeswhih
maybeautomatiallyormanuallybuiltandwhihare
thereforenotalwaysreliableoravailable. Theseond
ones relyonmahine learningmethodsandtakeonly
surfaeluesintoaount. Theygivemitigatedresults
onomplexNLPtasks.
This paper proposes an approah that overomes
thatopposition. ItreliesoftheBayesianNetworkfor-
malism, a probabilisti model designed for reasoning
ondubious, partial andlakinginformation, whih is
stilllittleexploitedin NLP.
This approah is tested on the resolution of the
anaphoripronounit,whihisaomplextaskinvolv-
ingdierenttypesofknowledgeandforwhihthereis
a lear ontrast between linguisially-based methods
and methods based on surfaelues. We designed a
systemthatreliesonaBayesianNetworkforthelas-
siationofanteedentandidatesandweompareits
performanes with that of a state-of-the-art system,
MARSproposedbyR.Mitkov[10℄,whihanbeon-
sideredasaknowledge-poorsystem.
The next setion presents the opposition between
rihandpoorapproahesintheaseofanaphoripro-
noun resolution. Setion3desribestheformalismof
the Bayesian Networks, its advantages for NLP and
we present our lassier for anaphora resolution. In
Setion 4, we ompare its performanes with several
otherones. Thelastsetiondisussestheresults.
2 The opposition between lin-
guisti knowledge and surfae
lues
Anaphora is a linguisti relation that holds between
two textual units where one of them (the anaphor)
annotgetinterpretedassuhbutreferstotheother,
whih usuallyours before (the anteedent). As the
preseneofanaphorssigniantlydegradestheperfor-
manes of NLP tasks suh as information extration
ortextsynthesis,alotofworkhasbeendevotedtothe
automatiresolutionoftheseanaphorirelationships,
i.e. theidentiationof theanteedentsof anaphori
pronouns. Inthis paper, we fous on thepronoun it
in English texts, whih is a well-known and frequent
typeofanaphors.
2.1 The usefulness of surfae lues
The traditional approah for anaphora resolution is
omposed of three steps: the distintion between
anaphoriandimpersonalourrenesofthepronoun
(it is known that... vs. it produed...), the seletion
of anteedent andidates and the hoie of the most
plausibleanteedent. Foreahofthesesteps,therst
systems relied on omplex linguisti knowledge that
reeted the deepsyntatiand semanti onstraints
of anaphori relations. As these onstraints seemed
tooomplexto build automatially,the rstsystems
relied on a set of manually designed rules, whih re-
quiredathoroughorpusanalysis.
Duringthe1990's,severalsystemsrelyingonsurfae
lueswereproposedtofaetheneedforrobustandless
expensive anaphora resolution methods [14℄. These
systems gotrid of the omplexlinguisti rulesof the
rst ones and tried to approximate them by simple
luesthat arepresumably morereliableand easierto
ompute.
For instane, [7℄ modies the RAP algorithm ini-
tially proposed by [8℄. Considering that a deep syn-
tati analysis annot be ahieved with state-of-the-
art parsers, the authors implement arelaxed version
ofthatalgorithmbasedonshallowparsing. Theyshow
manesof the newalgorithm are omparableto that
oftherstone. Anotherexampleisgivenin[5℄,whih
proposes to approximatethe semantionstraintsby
oourrenefrequenies. Theanteedentissupposed
tobelongtothesamedistributionalsubjetorobjet
lass as the anaphori pronoun and the reported ex-
periments show that these distributional onstraints
anpartiallysupplydeepersemantiones.
2.2 The limits of surfae lues
Thesurfaeluesproposedduring the1990'senabled
to build robustsystems [10℄ but reentwork hasun-
derlinedtheirlimits.
Sine the prediate-arguments shemata that im-
provetheandidateltering[11℄,areseldomavailable,
theyhavebeenapproximatedbyonurrenefrequen-
ies [5℄. However,[2℄ showsthat these frequeniesdo
notreallyenhane theperformanes ofasystemthat
isalreadybasedonmorpho-syntatiknowledge. The
ontribution of frequeniesseems to pertain moreto
hazardthantosemantis.
Suh a onlusion brings bak to the initial prob-
lem. Anaphora resolutioninvolvesomplexsyntati
and semanti knowledge that is not always available
and whih isoften notfully reliable. Previousworks
havetriedtosubstitutelinguistiknowledgebysurfae
lueswhih areeasiertoomputeandthereforemore
reliable. Howevertheseluesonlypartiallyreetthe
linguisti onstaintsand may leadto erroneousdei-
sions,whensolvingambiguousases.
2.3 Enrihing the surfae lues with
linguisti information
TheMARSsystem[10℄reliesonsurfaeluestoiden-
tify the most salient element in the disourse frag-
ment preeding a pronoun ourrene. This salient
element is onsidered as the most probable pronoun
anteedent. Thesystemreliesonapart-of-speehtag-
ging(POStagging)ofthetextandappliessomesimple
grammarrulesinordertolistthenounphrases(NPs)
ofthetwosentenespreedingagivenpronounour-
rene andtheNPs preeding thepronounourrene
in the samesentene. Foreah NP assoiated to the
pronoun ourrene, a set of onstraints and prefer-
enesisapplied. Theonstraintslterouttheimper-
sonal pronoun ourrenes and the NPs that annot
beanteedent. ThepreferenesranktheremainingNP
andidates. Eahprefereneisassoiatedwithasore,
eitherpositiveornegative,andthevarioussoresofa
andidate are summedup in aglobal sore. The an-
teedentwith thehighestsore ishosen. Whentwo
andidatesendwiththesamesore,additionalheuris-
tisareusedtorankthem 1
.
Weproposeanewsystemexploiting allthesurfae
luesofMARSbutalsointegratingthelinguistion-
straintsthatthesurfaeluesapproximate,whenever
somelinguistiknowledgeisavailable. Wearguethat
ombiningbothtypesofinformationisbeneial. For
1
The nal ranking dependson the types of the preferenes
thathavebeenusedforeahandidate andthemostreent
andidateishosen,ifnothingelseapplies.
salient element but, sine the syntati role analysis
may be erroneous, it is useful to exploit in parallel
the information relativeto the NP loation: the sur-
fae lue (the rst NP of the sentene is very often
the verb subjet) orroborates the grammatial role
hypothesis.
Our system is modeled thanks to aBayesianNet-
work. This type of representationhas beendesigned
to reason on dubious and inomplete knowledge. It
oersaprobabilistiapproahthat uniesinasingle
representation deeplinguisti onstraintsandsurfae
lues. This uniation allows to orroborate linguis-
tionstraintswiththesurfaepropertiesobservedin
orporaandtoorrettheerrorsmadebythesystems
basedonsurfaelues.
3 A unied approah: the
Bayesian model
3.1 Classiation problems
As many other NLP tasks, distinguishing anaphori
andimpersonalpronounourrenesandmoregener-
ally solvinganaphors anbe onsidered aslassia-
tionproblems[3℄.
Let us onsider for instane the hoie of the an-
teedent among various andidates. Let Corpus be
a set of texts belonging to the same domain, Train-
ing_Corpus and Test_Corpus two distint subsets
of Corpus,
P ronouns
andN ounP hases
, the sets ofthe pronoun and NP ourrenes of
Corpus
. LetR
be the set of potentialanaphora relationships. Eah
relation
r
i,j is represented as a ouple (p
i,np
j) ofP ronouns X N ounP hrases
, wherenp
j is onsid-ered as a andidate anteedent of the pronoun
p
i2
.
Antecedents
andN ot
_Antecedents
are twoomple-mentary sublasses of
R
.r
i,j belongs to the lassAntecedent
if the andidatenp
j isthe anteedent ofthe pronoun ourrene
p
i. It belongs to the lassN ot
_Antecedent
if the andidatenp
j is not the an-teedentorifthepronoun
p
iisimpersonal. Anyoupler
i,j isdesribedbyavetora = v
1, ..., v
a ofattributeswhosevaluesaredenedinR.Eahattribute
v
k isse-letedonthebasisofananalysis ofTraining_Corpus
and orresponds toeither a linguistipiee of knowl-
edgeorasurfaelue.
The Bayes theorem states how to predit the best
lass for any new ouple of andidate NP and pro-
noun ourreneof Test_Corpus on the basis of the
regularities observed on the set of ouples of Train-
ing_Corpus: seletthelassthatmaximisestheprob-
ability
P (C|E) =
P(E|C)∗P(C) P(E)where
C∈{Antecedent, N ot
_Antecedent}
,E
isanex-ampleofTest_Corpus and
P (E|C)
istheonditionalprobability that E belongs to the lass
C
given thevaluesoftheattributesofE. Thatprobabilityisesti-
matedonthebasisofthetrainingexamples.
2
Atually,onlytheNPsourringinthetwosentenespreed-
ingthepronounourreneorbeforeitinthesamesentene
areonsideredasandidates.
P (E|C)
anbedeomposedintoP (v
1|C)∗ ...∗ P (v
a|C)
andtheprobabilityto maximiseis
P (C|E) =
P(C)P(E)a
Π
j=1P(v
j|C)
Inthat ase,the lassier isa NaiveBayesClassier
(NBC) 3
.
Foranypronounourrene
p
of Test_Corpus andforeahoupletowhihitbelongs,theBayesianlassi-
eromputestheprobabilityforthatoupletobelong
to thelass
Antecedent
. Ifthepronounourreneisanaphori,theandidatewith thehighestprobability
ishosenasanteedent.
3.2 Inferringfrom imperfetattributes
A BayesianNetwork is amodel designed for reason-
ing ondubiousand inompleteattributes. Itis om-
posedof aqualitativedesriptionoftheattribute de-
pendanies,anorientedayligraph,andofaquan-
titativedesription,asetofonditionalprobabilityta-
bles,eahrandomvariable(RV)beingassoiatedtoa
graph node. A rst parameterisingstep assoiatesa
priori onditional probabilitytables to eah RV. The
seondinferringstepmodiestheRVvaluesontheba-
sisof orpusevidene(itupdates theapriori proba-
bilitiesintoaposteriori ones). Theobservationsmade
in orpusare propagatedthroughthe network,whih
leadstoupdate thea priori valuesevenforsomeun-
observedvariables.
First_NP Subject_NP
Number_Filter
First_NP=NotFirst First_NP=First Candidate=NotAntecedent Candidate=Antecedent
Number_Filter=Singular Number_Filter=Plural
Candidate
N A
Candidate
A Candidate N
Candidate, First_NP N,F A,F N,N A,N .04
.96
.03 .78
.46 .95
.97
.36 .15 .24 .71
.08 .63 .65
Subject_NP=Subject Subject_NP=Unknown Subject_NP=Complement
.22
.05 .54
.30 .05 .01 .14
.66
Fig. 1: Example of aBayesian lassier represented
by aBayesianNetwork
Let us explain on a simplied example the infer-
ringmehanismofthe BayesianNetwork represented
on Figure 1. This network hooses the pronoun an-
teedentbyorderingthevarious ouples(
p
i,np
j). Itis omposed of 4nodes, whih respetivelyrepresent
the probability for the andidate
np
j to be the an-teedentof
p
i(Candidate),to havesomemorphologi- alpropertiesregardingnumber(Number_Filter),to3
Ifthislinkiserased,thelassierbeomesanaiveBayesian
lassier. More generally, aBayesianNetwork whihstru-
ture, whihstruture isa treeof depth1 and withoutany
linkbetweenleavesisaNaiveBayesianlassier.
thesentene.
The rst prameterising step omputes the a pri-
ori probability values. These probabilities are esti-
mated on the basis of the frequenies omputed on
the set of ouple examplesextrated form atraining
orpus,forwhihalltheattributevaluesareinstanti-
ated. From these observations, we state for instane
thatP(Candidate=Anteedent)=0.04i.e. weonsider
that any andidate has a priori a probability of 4%
to bethe anteedentof ananaphori pronoun our-
rene 4
.
TheinuenelinkbetweenthevariablesCandidate
and Number_Filterindiatesthat aandidateisless
likely to be plural if it is the anteedent of the pro-
noun it (reversely,itislesslikelytobeitsanteedent
ifit isaplural noun). Inthe samemanner,thelinks
betweenthevariableCandidateandFirst_NPonthe
one hand, Candidate and Subjet_NP on the other
hand respetivelyindiatethat theandidateismore
likelytobetherstNPofthepreedingsenteneand
to bethe subjet ofthe verb ifit is thepronoun an-
teedent. The link (First_NP,Subjet_NP)onnets
twovariablesthatareonsideredasdependantoneah
otheronthebasisofthetrainingorpusandexpertes-
timation. Thismeansthatthereliabilityofthesubjet
syntatirole isinreasediftheandidatealsoours
at thebeginning ofasentene. Thisinterdependeny
is measured through the table of onditional proba-
bilities that is assoiatedto thenodeSubjet_NPon
Figure 1. Wealso addedavalueUnknowntotheRV
oftheSubjet_NPnodeasthesyntatianalysisquite
oftenfailstoassoiateagrammatialroletosomeNPs.
Thisisawaytoavoidtotakeintoaountinomplete
datafortherstevaluation ofoursystem[4℄.
One all the apriori onditional probabiliteshave
beenomputed, theinferring stepbegins. Let'stake
asanexamplethe ouple(itA transription,
it
1)ex-trated from the sentene In minimal medium, [itA
transription℄1 was about 6-fold lower when gluose
was the sole arbon soure than [it℄1 was when su-
inate was the arbon soure. Our systemomputes
the valuesof theattributes of that ouple. Thean-
didateisnotapluralNP but itistherstNP ofthe
sentene. Sine these observations are very reliable,
weanstatethatP(Number_Filter=Singular)=1and
P(First_NP=First)=1 (strongevidene). Evenifthe
parser has produed a dependany analysis of that
sentene in whih the andidateis the subjetof the
verb, we know that this analysis may be erroneous
and weonsider that this third observation is onlya
soft-evidene: P(Subjet_NP=Subjet)=0.89
On the basisof these observations, theprobability
fortheandidatetobethepronounanteedentanbe
omputed:
P(Candidate=Anteedent
|
Number_Filter=Singular, First_NP=First,Subjet_NP=Subjet)=0.4Our system similarly omputesthe probability for
4
Atuallyapartofhumanexpertiseisombinedwithorpus
evideneinthisprobabilityestimationbeausethe training
dataset,althoughomplete,isnotfullyreliable(somevalues
maybeerroneous). Tolowerthatnoiseeet,weintegratean
expertestimationintothea prioriprobabilityomputation,
usingtheMaximumAPosterioriapproah[13 ℄.
any other NP to be the anteedent of the pronoun
it
1. Ifnoneof theother andidateshasaprobability higher than 40%, itA transription is onsidered tobeanteedentofthepronoun.
3.3 An extensive list of lassiation
attributes
We keep all the attributes of MARS, exept the C-
ommandonstraintthat ismostlyusefulfor demon-
strative pronoun anaphors (e.g. this) and the pref-
erenes speially designedfor the tehnial type of
orporaonwhihMARShasbeeninitiallytested 5
. We
also enrih that list with some additionallues lues
that are relevant for saliene alulus and whih are
usedinseveralothersystemsdesribedinthestateof
theart.
Thefollowinglistdetailsthevariouspropertiesthat
areusedasattributesbyourlassier. Eahproperty
is modelled asa node in our BayesianNetwork (see
Figure 2,whereMARSattributesandtheadditional
onesaredistinguished. Theyarerespetivelyoloured
inblakandgrey):
•
Gender_Filterand Number_Filter: the andi- datemustbemorphologiallyompatiblewiththepronounourrene.
•
Impersonal_Filter:theandidateannotbethe anteedentofanimpersonalpronounourrene.•
First_NP:therstNPofthesenteneisveryof-tentheverbsubjet.
•
Subjet_NP:aandidateis morelikelytobetheanteedentifitistheverbsubjetthanifitholds
inadierentsyntatirole.
•
Indiative verb: the NPs immediately follow- ing the verbs that belong to theindiative lass(analyze, hek...) are supposed to be omple-
ment of these verbs and are more salient than
others. Forourexperiments, this lass hasbeen
manuallyaquiredfromatrainingorpus.
•
Repeated_NP: an NP that is repeated several times in the sameparagraph of thepronoun o-urrene is morelikelyto besalient. These rep-
etitionsareomputedbyountingthenumberof
ourrenes of the NP head onstituent (on the
basisofasimpleharaterstringomparison).
•
Heading_Candidate: NPs ouring in a title or at the beginning of a paragraph are emphasisedandaremoresalient.
•
Colloation_Patterns: our system exploit someolloationpatterns with order onstraints(<NP/pronoun verb> or <verb NP/pronoun>,
in whih weonsider thelemmatisedform ofthe
verbs)butalsowithsyntationstraints(<Sub-
jetverb>and<verbomplement>). Ourrene
frequeniesareomputedforeahandidatehead
ineahtypeofolloationpattern.
5
Namely,theimmediate referene andsequentialinstrution
preferenes.
•
Term: theNPsbelongingtothedomainterminol-ogyareonsideredassalientdisourseelements.
•
Definite_NP:indeniteNPsarelesssalientthan denite ones. We onsider that an NP is inde-nite ifit doesnotfollowadenite, possessiveor
demonstrativedeterminant.
•
Prepositional_NP:ifanNPbelongtoaprepo- sitional omplement, its saliene sore is de-reased. Theprepositionalomplementsareiden-
tiedthroughthetext onstituentanalysis.
•
Distane: the andidates that are loser to thepronounourrenearemorelikelytobethean-
teedent.
•
Proper_Name: the proper names are disourse salient elements. We onsider as proper namesalltheNPstagged assuhbythePOStaggeror
taggedasnamedentities.
•
Pronoun_NP:iftheandidateisitselfananaphori pronoun, its own anteedent is onsidered as asalientandidateforthenewpronoun.
•
Appositive_NP: if a andidate ours in an ap- positivelause,itssalieneisdereased. Theap-positivelausesareidentiedastextualsegments
that are preeded and followed by the same or
symmetripuntuationmarks 6
andwhihontain
noverbourrene.
•
Syntati_Parallelism:wehekthatthean- didatehasthesamesyntaxiroleasthepronounourrene.
•
Semanti_Class:somesemantilassesaremore salient than others. For instane, in biologialorpora,thegenesaremoresalientthanpersons.
•
Semanti_Consistene: if the andidate is a named entity, we hek that it is semantiallyoherent with the pronoun ourrene. We list
the semantilasses ofthe NPs ourring in the
same olloation patterns asthe pronoun our-
rene and wehek that the andidate semanti
lassisoneofthose.
4 Experiments and results
4.1 Desription of the lassieurs
We have used 6 dierent lassiers for the anaphora
resolution.
Threeofthemareusedasbaselinesystems:Random
system,whihrandomlyhoosestheanteedentamong
theandidatelist,First_NPsystem,whihsystemat-
iallyseletstherstNP ofthepreedingsenteneas
thepronounanteedent,andBio_MARS,whihisour
versionofMitkov's MARSsystem. The solvingalgo-
rithm of Bio_MARS is thesame asthat MARS but
oursystemis speiallydesignedfor genomis. The
preproessinginludesthefollowingsteps: theNPlist
6
Exeptforparenthesis,whihareoftenusedforaronymsin
biologialorpora.