Anaphora resolution for question answering

(1)

Anaphora Resolution for Question Answering

by

Luciano Castagnola

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degrees of

Bachelor of Science in Computer Science and Engineering

and

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2002

@

Massachusetts Institute of Technology 2002. All rights reserved.

Author ...

_r

Department of Electrical Engineering and Computer Science

May 24, 2002

Certified by...

. . . .. . . . . . .. . . . ..

Boris Katz

Principal Research Scientist

Thesis Supervisor

Accepted by

. .6:7. ....

.. . . .

Arthur C. Smith

Chairman, Department Committee on Graduate Students

MSSACHUNSETS INSTITUTE OF TECHNOLOGY

JUL 3

1 2002

(2)

Anaphora Resolution for Question Answering

by

Luciano Castagnola

Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2002, in partial fulfillment of the

requirements for the degrees of

Bachelor of Science in Computer Science and Engineering and

Master of Engineering in Electrical Engineering and Computer Science

Abstract

Anaphora is a major phenomenon of natural language, and anaphora resolution is one of the important problems in Natural Language Understanding. In order to analyze text for content, it is important to understand what pronouns (and other referring expressions) refer to. This is important in the context of Question Answering, where questions and information sources are analyzed for content in order to provide precise answers, unlike keyword searches. This thesis describes BRANQA, an anaphora reso-lution tool built for the purpose of improving the performance of Question Answering systems. It resolves pronoun references via the use of syntactic analysis and high precision heuristic rules. BRANQA serves as an infrastructure for the experimenta-tion with different resoluexperimenta-tion strategies and will enable evaluaexperimenta-tion of the benefits of anaphora resolution for Question Answering. We evaluated BRANQA's performance and found it to be comparable to that of other systems in the literature.

Thesis Supervisor: Boris Katz Title: Principal Research Scientist

(3)

Acknowledgments

I am grateful to Sue Felshin and Greg Marton for valuable comments in the prepa-ration of this document. I thank Ali Ibrahim and Greg for their help during the development of the system.

I thank Boris Katz for his support and patience throughout these years.

(4)

List of Figures

2-1 Binding Theory Examples . . . . 16

2-2 Examples of disjoint reference . . . . 19

2-3 RAP's pleonastic pronoun detector . . . . 19

3-1 Overall Architecture . . . . 22

3-2 Link Parser Output Example . . . . 24

3-3 Problems assigning constituent structure to conjunctions . . . . 25

3-4 Noun Phrase Table . . . . 26

(7)

List of Tables

4.1 Test Results by Rule . . . . .. . . . . 37 4.2 Test Results by Pronoun . . .. . . . . 38

(8)

Chapter 1 Introduction

In this chapter I present the motivation behind the development of BRANQA1, an anaphora resolution tool.

1.1 What is anaphora?

Anaphora is reference to entities mentioned previously in the discourse. The referring

expression is called an anaphor and the entity to which it refers, or binds, is its referent or antecedent. Anaphora resolution is the process of finding an anaphor's antecedent.

Example: The car is falling apart, but it still works.

Here "it" is the anaphor and "The car" is the antecedent. This is an example of pronominal anaphora, or anaphora where the anaphor is a pronoun. It is the most common type of anaphora, and will be the focus of this thesis. Other kinds of anaphora are definite noun phrase anaphora and one-anaphora:

President George Bush signed (...) The president...

If you don't like the coat, you can choose another one.

In the first sentence "The president" is the anaphor, and "President George Bush" is the antecedent. In the second, "one" is the anaphor and "the coat" the antecedent. When the anaphor is in the same sentence as the antecedent, it is called an

intrasentential anaphor; otherwise it is an intersentential anaphor.

(9)

1.2 Question Answering

The InfoLab Group at MIT's Al Lab has developed systems that attempt to solve the problem of information access. The belief that natural language is the easiest way for humans to request information has led the group to work on question answering systems. The START (SynTactic Analysis using Reversible Transformations) (17, 18] system provides multimedia access using natural language. It has been available to answer questions on the World Wide Web2

since December 1993. Since it came online, it has answered millions of questions for hundreds of thousands of people all over the world, providing users with knowledge regarding geography, presidents of the U.S., movies, and many other areas.

The START System strives to deliver "just the right information" in response to a query. Unlike Web search engines, START does not reply with long lists of documents that might contain an answer to our question; it provides the actual answer we are looking for. This comes in the form of a short information segment (e.g., an English sentence, a graph, a picture), rather than an entire document. START has been very successful in its interaction with users, but its domain of knowledge is fairly limited and expanding its knowledge base requires human effort. It works extremely well within the domains it handles, but any question outside its knowledge base will get a reply from START saying it does not know how to answer it.

In response to this problem, the InfoLab Group began to work on systems with less stringent requirements with respect to both returning correct answers and deliv-ering "just the right information". These systems lie somewhere along the spectrum between information retrieval engines like Altavista or Google3 _{at the one end and}

natural language systems like START at the other. They are linguistically informed search engines, which attempt to use natural language tools to aid the retrieval of information in order to return a smaller amount of irrelevant information than tradi-tional search engines.

One of these systems, Sapere [22], indexes relations between words to allow it 2

http://www.ai.mit.edu/projects/infolab

3

(10)

to search for information in a smart way. By storing relations like Subject-Verb-Object, it can distinguish between cases that the simple "bag of words" approach would confuse. For example, in response to the question "When did France attack England?" Sapere will not return the sentences "England and France attacked China in 1857", "England attacked France", or "France was attacked by England", since the crucial relation France-attack-England is missing in all of them. The "bag of words" approach treats documents as sets of keyword counts, and would thus consider

"England attacked France" to be equivalent to "France attacked England".

1.3 Resolving Pronouns for Question Answering

Underlying the motivation for this project is a desire to improve the performance of the InfoLab Group's question answering systems. START, Sapere, and future group projects can benefit from the use of a pronominal anaphora resolution tool.

As mentioned above, Sapere indexes relations as part of its linguistically informed information retrieval approach. The analysis, indexing and retrieving are all done at the sentence level and this makes the resolution of anaphors very important. Without resolving what a pronoun refers to in a sentence, relations involving that pronoun are not useful for retrieval. After reading "The first seven attempts to climb Mount Everest were unsuccessful. Edmund Hillary climbed it in 1953..." we cannot answer "Who climbed Mount Everest for the first time?" unless we find that "Mount Everest" is an antecedent for "it".

Adding a pronominal anaphora resolution module to Sapere should increase the number of questions it can answer about a given corpus; resolving pronouns should increase its recall4 by raising the number of useful relations that are indexed.

START also stands to benefit from the availability of an anaphora resolution module. One of the features of START that is not currently being used is the ability to keep track of threads of conversation with different users. Enabling this feature

4Recall is the ratio of correct answers found to correct answers in the corpus. Precision is the

(11)

could allow more interesting interaction with the users, turning sessions into dialogues rather than series of disconnected question/answer pairs. In this mode of operation, pronominal anaphora resolution would become very important, since it would allow users to refer to entities introduced in previous sentences much more naturally.

Currently START handles the simplest cases of pronominal anaphora; namely, if there is only one possible antecedent for a pronoun that passes the gender and number agreement test, START will resolve it, but otherwise it will ask the user to clarify. This is the most conservative approach to anaphora resolution (so long as gender and number of entities is identified correctly, no mistakes will be made), but this leads to unnatural conversation in many cases where the pronoun could be resolved with high confidence. We believe that a good pronominal anaphora resolution tool would lead to improved user interaction, an important goal of the START system.

More traditional approaches to information retrieval also stand to gain from anaphora resolution [29]. It can even help systems based on the "bag of words" scheme, where pronouns should raise the counts of their antecedents. Thus, future group projects in this direction could also profit from this technology.

This thesis presents the design and evaluation of BRANQA, a system motivated by the benefits that question answering could reap from anaphora resolution.

1.4 Outline

The rest of the document is organized as follows:

* Chapter 2 introduces the background work on which the system is based. * Chapter 3 describes the system's architecture and how it works.

* Chapter 4 presents an evaluation of the system.

* Chapter 5 lists improvements to be made in the near future together with research projects suggested by this work.

(12)

Chapter

2 Anaphora Resolution

In this chapter I present the ground on which this thesis rests.

2.1 Overview of Pronominal Anaphora Resolution

When encountering a pronoun in the text, how can one tell what it refers to? The literature shows a wide variety of approaches to solving this problem. Miktov [27] provides an excellent overview of the state of the art in anaphora resolution, parts of which I summarize briefly in this section.

2.1.1 Two Stages

The process of finding the expression to which a pronoun refers can be split into two tasks: finding a set of plausible referents, and picking a "best" element from the set. The first task is complicated by the many different types of reference that pronouns take part in. Pronouns can refer to noun phrases, verb phrases, clauses, sentences or even whole paragraphs. For example, in "Mary ran ten miles yesterday. She liked it very much", "she" refers to the noun phrase "Mary", and "it" refers to the verb phrase "ran ten miles yesterday". Pronouns can also lack referents. This is the case with pleonastic (alternatively non-referential or semantically empty) pronouns, as in "It is raining" or "It seems John is unhappy."

(13)

Additionally, the referent can be mentioned before or after the pronoun. If the referent is mentioned first, the usual situation, the referent is the antecedent of an

anaphor. If the pronoun is seen first, the kind of reference is called cataphora and

the pronoun is a cataphor. An example of a cataphoric relation would be "When he woke up, John was drenched in sweat."

An ideal system for the resolution of pronouns would have to handle all of these kinds of reference, which would involve search over all possible referents (noun phrases, verb phrases, etc.) both before and after the pronoun. In practice, the scope of systems is usually reduced to the detection of pleonastic pronouns and the resolution of anaphors with noun phrase antecedents, both because of the complexity of handling the general case, and because these are the most common uses of pronouns. In a system with this focus, the first step in resolving a pronoun is to determine whether it is pleonastic, and if not, to identify all noun phrases occurring before it as possible antecedents.

Once the set of potential antecedents is determined, a number of "resolution fac-tors" are used to track down the correct antecedent. Factors used frequently in the resolution process include gender and number agreement, syntactic binding re-strictions, semantic consistency, syntactic parallelism, semantic parallelism, salience, proximity and others. These factors can be divided into constraints (or eliminating

factors), which must hold, and preferences, which are used to rank candidates.

2.1.2 Constraints

Constraints control what an anaphor can refer to. They are conditions that always need to hold for reference to be valid, and can thus be used to remove implausible candidates from the list of possible antecedents.

Examples of constraints are gender and number agreement. Anaphors and their antecedents must always agree in number and gender.' Some constraints are given by syntactic theories like Government and Binding Theory, which specifies binding

'Note: Collective nouns like "government" and "team" can be referred to by "they", and plural nouns like "data" can be referred to by "it". The definition of number is complicated in these cases.

(14)

restrictions (see Section 2.2).

Other constraints are given by semantics. Although it is beyond current natural language technology to understand open domain texts, statistics can be used as a proxy for semantic knowledge. In the two examples below, the frequency of co-occurrence of words could be used to disambiguate the anaphors:

Joe removed the diskette from the computer and disconnected iti. Joe removed the diskette from the computer and copied iti.

Ge, Hale and Charniak [11] present a successful statistical approach to the resolu-tion of pronouns consisting of a probabilistic model trained on a small subset of the

Penn Treebank corpus.

2.1.3 Preferences

Preferences, as opposed to constraints, are not obligatory conditions and therefore do not always hold. They are criteria that can be used to rank the possible antecedents. Among preferences, Mitkov lists syntactic parallelism, semantic parallelism and cen-tering.

Syntactic parallelism gives preference to noun phrases with the same syntactic function as the anaphor. For example:

The programmer successfully combined Prologj with C, but he had combined it1 with Pascal last time.

The programmeri successfully combined Prolog with Cj, but he had combined Pascal with it, last time.

Similarly, semantic parallelism says that noun phrases which have the same se-mantic role as the anaphor are favoured.

Vincent gave the diskette to Sodyi. Kim also gave him a letter. Vincenti gave the diskette to Sody. Kim got a letter from himi too.

Syntactic and semantic criteria are not always sufficient to choose among a set of candidates. These criteria are usually used as filters to eliminate unsuitable can-didates, and after that the most salient element among the remaining noun phrases is selected. This most salient element is referred to as the focus [31] or center [12]. Mitkov uses the following example to illustrate this concept:

(15)

Jenny put the cup on the plate and broke it.

Here the meaning of "it" is ambiguous; its antecedent could be "the cup" or "the plate". However, context can help disambiguate the reference:

Jenny went window shopping yesterday and spotted a nice cup. She wanted to buy it, but she had no money with her. The following day, she went to the shop and bought the coveted cup. However, once back home and in her kitchen, she put the cup on the plate and broke it.

Now "the cup" is the most salient entity and is the center of attention throughout the paragraph; it is preferred over "the plate" as an antecedent for "it". This example illustrates the important role of tracking down the center/focus in anaphora resolu-tion. After "filtering" unsuitable candidates, the final choice is made by determining which of the candidates seems to be the center. Various methods have been proposed for center/focus tracking [5, 9, 25, 32, 36].

2.1.4 Computational Strategies

The traditional approach to anaphora resolution is to eliminate unlikely candidates until a minimal set of plausible candidates is obtained, and then make use of pref-erences to choose a candidate. Other approaches compute the most likely candidate on the basis of statistical or "AI" techniques (Mitkov mentions uncertainty-reasoning methods as an example of these techniques). In these "alternative" systems the con-cept of constraint might disappear, and all resolution factors might be considered preferences whose weights get updated through "Al" techniques. Mitkov [26] com-pares a traditional and an "alternative" approach using the same set of anaphora resolution factors.

2.2 Government and Binding Theory

Government and Binding Theory is a version of Chomsky's theory of universal gram-mar named after his Lectures on Government and Binding [7]. One of its components, Binding Theory, explains the behavior of intrasentential anaphora. The theory

(16)

ex-Johni hit himj/,i. Johni hit himselfi/,j.

Luciej said [cp that [ip Lilij hurt herselfj/,i/,]].

Luciej said [cp that [jp Lilij hurt heri/k/*jIj.

Poiroti believes [NP John's description of himselfj/*j]. Poiroti believes [NP any description of himselfi/,j].

('*' denotes ungrammatical co-indexings) Figure 2-1: Binding Theory Examples

plains when an anaphor can bind to a noun phrase based on their relative positions in syntactic structure. The details of the theory are complicated, but the important point for this thesis is that syntax alone can place hard constraints on anaphora, and this can be used to help us pick antecedents for anaphors by eliminating syntacti-cally disallowed candidates. For an introductory treatment of Binding Theory see Haegeman [13].

Figure 2-1 shows examples of reference determined valid or invalid by Binding Theory on the basis of syntactic structure.

2.3 Prior Work

BRANQA is largely based on two prior systems: Lappin and Leass' RAP [19], and Baldwin's CogNIAC [4]. This section presents some of the ideas taken from them.

2.3.1 RAP

RAP (Resolution of Anaphora Procedure) is an algorithm for identifying the noun phrase antecedents of third person pronouns and lexical anaphors (reflexive and re-ciprocal pronouns). The algorithm applies to the syntactic representations generated by McCord's Slot Grammar parser [23], and relies on salience measures derived from syntactic structure and a simple dynamic model of attentional state. In a blind test on computer manual text containing 360 pronoun occurrences the system identified the correct antecedent for 86% of these pronoun occurrences.

(17)

RAP contains the following main components:

" An intrasentential syntactic filter for ruling out anaphoric dependence of a pro-noun on a pro-noun phrase based on syntactic binding constraints.

" A morphological filter for ruling out anaphoric dependence of a pronoun on a noun phrase due to non-agreement of person, number or gender features. * A procedure for identifying pleonastic (semantically empty) pronouns.

* An anaphor binding algorithm for identifying the possible antecedent of a lexical anaphor within the same sentence.

" A procedure for assigning values to several salience parameters (grammatical role, parallelism of grammatical roles, frequency of mention, proximity, and sentence recency) for a noun phrase. This procedure employs a grammatical role hierarchy according to which the evaluation rules assign higher salience weights to (i) subject over non-subject noun phrases, (ii) direct objects over indirect objects, (iii) arguments of a verb over adjuncts and objects of prepositional phrase adjuncts of the verb, and (iv) head nouns over complements of head nouns.

* A procedure for identifying anaphorically linked noun phrases as an equivalence class for which a global salience value is computed as the sum of the salience values of its elements.

" A decision procedure for selecting the preferred element of a set of antecedent candidates for a pronoun.

BRANQA's syntactic filter and pleonastic pronoun detector were modeled after the ones in RAP, which I describe below.

Intrasentential Syntactic Filter

RAP's syntactic filter was developed for English Slot Grammar, a kind of dependency-based grammar [24]. Dependency syntax avoids the use of phrase structure or

(18)

cat-egories; instead it marks syntactic dependencies between the words of a sentence.

These are represented by arcs with arrows: X-+Y. We say that Y depends on X, or that X governs Y. X is called the (syntactic) governor of Y and Y is called the (syn-tactic) dependent of X. The head of a phrase P is a component of P which governs all other components of P. An argument of X is a necessary dependent of X (e.g., the direct object for a transitive verb) and an adjunct of X is an optional dependent of X (e.g., an adjective modifying a noun).

The filter consists of conditions for non-coreference of a noun phrase and a pronoun within the same sentence. The following terminology is used to state these conditions: " A phrase P is in the argument domain of a phrase N iff P and N are both

arguments of the same head.

" P is in the adjunct domain of N iff N is an argument of a head H, P is the object of a preposition PREP, and PREP is an adjunct of H.

" P is in the NP domain of N iff N is the determiner of a noun

Q

and (i) P is an argument of

Q,

or (ii) P is the object of a preposition PREP and PREP is an

adjunct of

Q.

" A phrase P is contained in a phrase

Q

iff (i) P is either an argument or an adjunct of

Q,

i.e., P is immediately contained in

Q,

or (ii) P is immediately contained in some phrase R, and R is contained in

Q.

Given these definitions, the syntactic filter says that a pronoun P is non-coreferential with a (non-reflexive or non-reciprocal) noun phrase N if any of the following hold:

1. P is in the argument domain of N. 2. P is in the adjunct domain of N.

3. P is an argument of a head H, N is not a pronoun, and N is contained in H. 4. P is in the NP domain of N.

(19)

Figure 2-2: Examples of disjoint reference

Figure 2-2 shows examples of disjoint reference signalled by these conditions.

Pleonastic pronoun detector

RAP attempts to identify non-referential uses of it to improve resolution performance. It defines a class of modal adjectives (ModalAdj) containing words like "necessary", "easy" and "advisable", together with their morphological negations, as well as com-parative and superlative forms. It also defines a class of cognitive verbs (CogV) like "recommend", "think" and "believe". When it is present in one of the constructions in Figure 2-3 it is considered pleonastic. Syntactic variants of these constructions (It

is not/may be ModalAdj..., Wouldn't it be ModalAdj..., etc) are also recognized.

It is ModalAdj that S

It is ModalAdj (for NP) to VP It is CogV-past-tense that S

It seems/appears/means/follows (that) S

NP makes/finds it ModalAdj (for NP) to VP It is time to VP

It is thanks to NP that S

Figure 2-3: RAP's pleonastic pronoun detector 1. She likes her3.

Johni seems to want to see him3.

2. Shei sat near her3.

3. Hei believes that the man3 is amusing.

4. Johni's portrait of him is interesting. 5. Hisi portrait of John3 is interesting.

(20)

2.3.2 CogNIAC

CogNIAC is a pronoun resolution system giving more importance to precision than to recall. The system resolves a subset of anaphors that do not require general world knowledge or sophisticated linguistic processing for successful resolution. CogNIAC does this by being very sensitive to ambiguity, and only resolving pronouns when very high confidence rules have been satisfied.

CogNIAC, like RAP, first eliminates candidate phrases that are not compatible with the anaphor's gender and number or that are ruled out on syntactic grounds (the syntactic constraints used are not mentioned in the paper). The system then evaluates a set of heuristic rules to choose an antecedent, or in the case that no rules are triggered, to leave it unresolved.

The six core rules of CogNIAC are (in order of application):

1. Unique in Discourse: If there is a single possible antecedent i in the preceding portion of the entire discourse, then pick i as the antecedent.

2. Reflexive: Pick nearest possible antecedent in preceding portion of current sen-tence if the anaphor is a reflexive pronoun.

3. Unique in Current + Prior: If there is a single possible antecedent i in the prior sentence and the preceding portion of the current sentence, then pick i as the antecedent.

4. Possessive Pro: If the anaphor is a possessive pronoun and there is a single exact string match i of the possessive in the prior sentence, then pick i as the antecedent.

5. Unique Current Sentence: If there is a single possible antecedent in the preced-ing portion of the current sentence, then pick i as the antecedent.

6. Unique Subject/ Subject Pronoun: If the subject of the prior sentence contains a single possible antecedent i, and the anaphor is the subject of the current sentence, then pick i as the antecedent.

(21)

In the first experiment reported, CogNIAC was tested on 298 third person singu-lar pronouns in narrative texts about two same-gender people (chosen to maximize ambiguity). It achieved a precision of 92% and recall of 64%.

In a second experiment, CogNIAC was tested on the articles used in the MUC-6 coreference task [30]. The system underwent some changes in preparation for MUC-6, both because CogNIAC was now being used as part of a larger system, and because the domain of the MUC-6 documents was different from the narrative. Rule 4 was eliminated because it did not seem appropriate for the domain. Additions were made to process quoted speech in a limited fashion (the specific additions were not presented in the paper). A rule was added to search back for a unique antecedent through the text looking backwards at progressively larger portions of the text. A new pattern was added which selected the subject of the immediately surrounding clause. A pleonastic

it detector was also implemented.

After these changes, CogNIAC achieved 73% precision and 75% recall on fifteen MUC-6 documents containing 114 pronoun occurrences.

(22)

Chapter 3 System Architecture

As mentioned in Chapter 2, the resolution of anaphoric expressions proceeds in two stages: the identification of a set of plausible antecedents, followed by the selection of the most likely candidate. The overall architecture of the system has two main components, each one dealing with one part of the problem.

Link Parser

Link Parser Interface

Named Entity Module

Noun Coreference Module

Phrase Pleonastic Pron. Syntactic Filter

Categorization ReslutionProcedure

Noun Phrase Table

(23)

The noun phrase categorization tool identifies noun phrases in the input and determines relevant properties of them (e.g., gender and number). The coreference module then uses these properties of noun phrases to select an antecedent for the anaphor. The two components interact with the external Link Parser through a wrapper that communicates with the parser and attempts to correct some of its deficiencies.

One of the goals in mind while designing the system was to provide a testbed for research in anaphora resolution and its application to question answering. Thus, although the system concentrates on the resolution of pronominal anaphors, the in-frastructure is there for experimentation with coreference in general. The resolution procedure does not depend on the parser output representation, nor does it depend directly on the linguistic resources used for noun phrase categorization. This mod-ularity allows for easy experimentation with individual parts of the system, for ex-ample, evaluating different resolution strategies or different methods for noun phrase categorization.

The system was written in Java, except for parts of the wrapper for the Link Parser which were written in C. The Link Parser is the only external dependency but the design philosophy allows for easy connection to other systems (e.g., a better Named Entity module).

The following sections describe the system's components in more detail.

3.1 Link Parser

After a sentence is submitted to the system, the system parses it. For this task it uses the Link Parser developed at Carnegie Mellon University.1

The Link Parser is written in generic C code, and runs on any platform with a C compiler. An application program interface (API) makes it easy to incorporate the parser into other applications.

The parser has a dictionary of about 60,000 word forms. It has coverage of a

(24)

+-SFsi+---Paf--+--THi--+-Cet+-Ss-+---I---+--Os-+

I I I I I I I I

it seemed.v likely.a that.c he would.v kiss.v Mary

Figure 3-2: Link Parser Output Example

wide variety of syntactic constructions, including many rare and idiomatic ones. The parser is robust; it is able to skip over portions of the sentence that it cannot un-derstand, and assign some structure to the rest of the sentence. It is able to handle unknown vocabulary, and make intelligent guesses from context and spelling about the syntactic categories of unknown words. It has knowledge of capitalization, numer-ical expressions, and a variety of punctuation symbols. When several interpretations of a sentence are possible the parser allows access to all of them, sorted by a measure of how good the parse is (for example, when the parse is not complete, it includes in

this measure the number of words that had to be skipped).

3.1.1 Link Grammar

The parser is based on a formal grammatical system called a link grammar [33, 34].

A link grammar has no concept of constituents or categories (e.g., noun phrase, verb

phrase). It contains a set of words (the terminal symbols of the grammar) each of which has a linking requirement. The parser connects the words with links so as to satisfy their linking requirements and the requirement of planarity (that links do not cross each other), which is a property that holds for most sentences of most natural languages [24]. The linking requirements for the words are specified in a dictionary. They determine what types of links can be used to connect a word to others.

The link grammar for English contains more than 100 types of links, each of which specifies a different kind of relation between words. Some of these link types are very useful for the task of pronominal anaphora resolution. For example, an

SF link connecting "it" to a verb indicates that this is a non-referential use of "it".

Figure 3-2 shows a sample linkage.

(25)

(S (NP Former guests) (S (NP Former guests) (VP include (VP include (NP (NP (NP John) (NP (NP John) (NP Paul)) (NP Paul) (NP (NP George) (NP George) (NP (NP Ringo (NP Ringo) and and (NP Steve) (NP Steve)))

(a) Top-ranked parse (b) Correct parse (worst rank)

Figure 3-3: Problems assigning constituent structure to conjunctions

the task of extracting relations, and the ongoing JLink project at the InfoLab Group is working on that problem. The extracted relations can then be used in our question answering systems (such as Sapere) as well as in new versions of BRANQA by building on the work of Dagan and Itai [8] (see Chapter 5). Thus, work using the Link Parser has the possibility of helping the InfoLab Group beyond the direct results of this thesis, especially through the identification of problems and possible solutions.

3.1.2 Constituent Structure

Although link grammars have no concept of constituents, the Link Parser has (since version 4.0) a phrase-parser: a system which takes a linkage (the usual link grammar representation of a sentence, showing links connecting pairs of words) and derives a constituent or phrase-structure representation, showing conventional phrase cate-gories such as noun phrase (NP), verb phrase (VP), prepositional phrase (PP), clause

(S), and so on. This allows us to identify the noun phrases in the text, the first step

towards resolution of pronominal anaphors.

The interface to the Link Parser takes the NPs identified by the parser and at-tempts to fix some common problems in the parser's handling of conjunctions. Con-junctions with many disjuncts are almost always parsed incorrectly in the top-ranking

(26)

Jimi bought a new guitar. He broke it on stage.

NP sent. text head(s) subject he she it they ref.

1 1 Jimi Jimi true true false false false 1

2 1 a new guitar guitar false false false true false 2

3 2 He He true true false false false 1

4 2 it it false false false true false 2

5 2 stage stage false false false true false 5

Figure 3-4: Noun Phrase Table

linkage returned by the parser. This happens because one of the components in the cost vector used to sort the linkages is the sum of link lengths, and a flatter structure will have longer links than one with more embedding of phrases. Thus, a sentence like "Former guests include John, Paul, George, Ringo and Steve" has the constituent structure in Figure 3-3(a) assigned to the top-ranking parse whereas the correct con-stituent structure is the one of the worst ranked linkage.

The interface to the Link Parser identifies instances of lists of items, like the one in the example, and corrects the constituent structure within the conjunction.

By using better knowledge of named entities, BRANQA is also able to correct some parsing errors when identifying noun phrases. The Link Parser fails to parse the second sentence in the example below, since "Son" is not recognized as a name.

Mr. Son sang a song. (...) Son was happy.

Our Named Entity recognition module identifies "Son" as a name after having seen "Mr. Son", enabling us to mark the noun phrase.

3.2 Noun Phrase Categorization

The next step after identifying the noun phrases is to determine their values for relevant features to be used by the resolution engine. These properties include the principal noun or head (in the case of conjunctions, the heads of all disjuncts are listed), whether it is a subject or not, and whether each of "he", "she", "it" and "they" can refer to it. The noun phrase categorization module builds up a table of noun phrases that have been observed in the document. Noun phrases in the table

(27)

have a reference field used to mark anaphoric reference to a previously seen noun phrase. The coreference module is the one in charge of filling that column of the table. An example can be seen in Figure 3-4.

3.2.1 Heads and Subject

Finding the head of a noun phrase is accomplished through examination of the link grammar representation of the sentence. The main noun of a simple noun phrase is the only word with a link crossing the boundaries of the phrase. In the case of conjunctions several words can link outside of the phrase; the heads of all noun phrases that comprise the conjunction are then listed. Occasionally this happens in noun phrases that are not conjunctions, but a small list of rules helps select the correct word in most of these cases.

The value for the subject column is obtained directly from the link grammar parse. If the phrase is a subject, its head will be linked to a verb using one of the subject link types (S, SF, SX, SI, SFI or SXI). In passive constructions the surface subject will be linked to the verb through one of these link types, so care must be taken to not mark it as a subject. This is done by checking for a Pv link, which marks the use of a passive verb.

3.2.2 Valid References

The most important task performed by the noun categorization module is determining which pronouns can refer to the noun phrases.

The main resources used for this task are the Named Entity module, Wordnet [10], and a list of male and female common nouns. The nouns in Wordnet were split into three lists according to whether they always, sometimes, or never indicate a person. This was done by checking whether the word was a hyponym of the synset2 _"person,

individual, someone, somebody..." in all, some, or none of its senses. Many of the 2_{A synset is a set of synonyms; it is the basic element in the Wordnet hierarchy of meanings.}

(28)

words that were in the sometimes-a-person list were then moved to either the

never-a-person or always-never-a-person lists if they were assigned to that list because of senses

which are very infrequent uses of the word. A similar procedure was used to generate a list of collective nouns like "team" by looking for hyponyms of "group, grouping".

Proper names are handled by the Named Entity module. For the rest of the noun phrases, the module looks at the head of the noun phrase and checks for number and gender by using the aforementioned lists, Wordnet's list of irregular plurals and some pluralization rules.

3.3 Named Entity Recognition

A very simple Named Entity recognition module was built to assist in the identifica-tion of noun phrases and the determinaidentifica-tion of valid references. The module knows about male and female first names, countries, and US states. It also uses heuristics to recognize unknown names.

For example, it identifies as company names those sequences of capitalized words ending in an element of a set containing "Company", "Co", "Inc" and other words that indicate the entity is a company. Once it has seen the full name ending in one of these words, it will recognize subsequences of the words in the name as coreferent with the full name (e.g. after seeing "Lockheed Martin Corp." it will identify "Lockheed", a word it doesn't know, as a company). A similar treatment is given to names of people, which are identified if they contain known first names or personal titles like

"Dr." , "Ms." or "Capt.".

When a capitalized sequence of words is a subsequence of more than one previously identified named entity, the module will not mark it as coreferent with any of them, and will set its valid references to be the union of the valid references for the matching named entities. For example, after seeing "Janis Joplin", "Joplin" would be resolved to "Janis Joplin" and identified as a female name. If after that "Peter Joplin" is mentioned in the same article, future mentions of "Joplin" will be left unresolved, but they will be identified as persons. If the module had not seen any of the full

(29)

names, "Joplin" would not be marked as a person, as it could be referring to a company or a place.

Since articles tend to use full names when a company or person is first mentioned, this strategy gives good performance without requiring a large list of company names or last names. However, a good list of companies would certainly help, especially in the case of household names, since these often show up without a "Co.", "Inc." or any other indication that it is a company.

The scope of person names is limited to the document where they are found, i.e., the module forgets the names of people when the system starts working on a new document. Company names, on the other hand, are not forgotten; after seeing "Lockheed Martin Corp." in one article, "Lockheed" will be identified as a company in all subsequent articles.

3.4 Coreference Module

Once noun phrases have been identified, it is the turn of the coreference module to find what they refer to. Noun phrases are resolved left to right, filling the reference column of the Noun Phrase Table. Possible values for this column are null, unresolved or a reference to a noun phrase in the table.

When two noun phrases are identified as coreferent, each gets its set of valid references reduced to the intersection of the two sets. For example in "Kublai Khan, first Emperor of the Yuan Dynasty", the noun phrases before and after the comma will be marked coreferent. Initially the system doesn't know that "Kublai Khan" refers to a man, or even a person, so it will allow reference by "he", "she" and "it". But when coreference is found with "first Emperor of the Yuan Dynasty", "Kublai Khan" will get its set of valid references reduced to only "he", since BRANQA knows that "Emperor" refers to a male person.

The coreference module is comprised of three components: a pleonastic pronoun detector, a syntactic filter, and the resolution procedure. The pleonastic pronoun detector identifies non-referential instances of it and the syntactic filter uses binding

(30)

constraints to eliminate syntactically disallowed candidates. The resolution procedure is the component that determines the coreference relations. The following subsections describe these components in more detail.

3.4.1 Pleonastic pronoun detector

It is important to detect pleonastic instances of it (as the one starting this sentence), in order to avoid assigning referents to pronouns that are non-referential. The Link Parser detects some of these instances and uses the link type SF to signal them. However, there are several cases which the Link Parser does not notice. The pleonastic pronoun detector supplements the parser with a set of rules for detection. These are based on the rules in [19] presented in Chapter 2, together with some rules added to

handle uncovered cases (e.g., "It's been a long time ... ")

3.4.2 Syntactic filter

The syntactic filter is used to rule out reference to noun phrases on the basis of intrasentential binding constraints. Chapter 2 mentioned Government and Binding Theory, and the fact that it can be used to constrain the search for antecedents. However, it is not easy to obtain from a link grammar the syntactic structure that we need to make direct use of binding constraints. In theory, we should be able to construct the necessary categories from the link grammar representation (and the Link Parser already helps by providing some constituent structure), but in practice the mapping from a non-categorial grammar to one based on phrase-structure is not so easy.

A better match to our system is the set of binding constraints for English Slot Grammar [23] presented in [19, 20] (explained here in Chapter 2). Slot Grammar belongs to the set of dependency grammars, and these are very similar to link gram-mars. Quoting Sleator [34]: "In a dependency grammar, a grammatical sentence is endowed with dependency structure, which is very similar to a linkage. This struc-ture, as defined by Meikuk [24], consists of a set of planar directed arcs among the

(31)

words that form a tree. Each word (except the root word) has an arc to exactly one other word, and no arc may pass over the root word. In a linkage (as opposed to a dependency structure) the links are labeled, undirected, and may form cycles, and there is no notion of a root word."

While we cannot directly apply the algorithms in [19, 20] to the linkages obtained from the Link Parser, it is possible to extract some of the information that would be present in English Slot Grammar by adding direction to the links. The current implementation does this for a few link types covering many important cases. Future work includes improving the syntactic filter to detect more cases of syntactically invalid reference.

3.4.3 Resolution procedure

The resolution procedure is the core of the coreference module. It uses the Named En-tity module, the Noun Phrase Table and the other two components of the coreference module to make decisions regarding coreference relations.

It is not directly dependent on the representation of the parse trees, and this mod-ularity allows experiments with the resolution strategies and the linguistic resources to be carried out independently of each other.

The current resolution procedure concentrates on pronominal anaphora, but it also resolves some simple cases of coreference between noun phrases, namely coreference between named entities, coreference with appositional phrases, and coreference with modifiers of a named entity.

Named Entities

The Named Entity Module is used to mark coreference between named entities. When it identifies a noun phrase as matching one of the previously seen named entities, the coreference module marks the two expressions as coreferent in the Noun Phrase Table.

(32)

Appositional Phrases

Appositional phrases are typically used to provide an alternative description or name for an entity. The module recognizes appositions by checking for noun phrases of the form: (NP <token>+ , (NP <token>+) ,). For example:

(NP Luca Prodan, (NP the great singer, ...))

(NP the great singer, (NP Luca Prodan, ...))

Here the appositional phrases are marked coreferent with the first noun phrase. A common use of appositions which does not indicate coreference is in the names of places (e.g., "Cambridge, Massachusetts"). The system checks for this possibility using a list of countries and U.S. states, which manages to cover the most common cases in U.S. newspaper articles. In the near future I plan to improve coverage by using a larger list of names, including well known U.S. and foreign cities.

Modifiers of Named Entities

The case of modifiers of named entities is similar to that of appositions. For phrases of the form: (NP (NP <token>+) <named-entity>), the embedded modifier is marked as coreferent with the whole phrase. An example of this kind of construction is:

(NP (NP famous singer) Luca Prodan) ...

Pronominal Anaphors

We finally get to the reason behind all previously described components, which is re-solving pronominal anaphors. The other three cases of coreference marked are there to allow the resolution procedure to work correctly when resolving pronouns. The res-olution strategy used belongs to the traditional approach to anaphora resres-olution, i.e., discounting unlikely candidates and then making use of heuristics to pick a referent from the remaining set of plausible candidates.

The system eliminates from consideration all noun phrases that do not pass the morphological and syntactic filters. The morphological filter eliminates all phrases that do not agree in gender and number with the pronoun. This is done by checking

(33)

Figure 3-5: Resolution rules in BRANQA

the valid reference columns in the Noun Phrase Table (e.g., themselves cannot refer to a noun phrase whose value in the they column is false).

The syntactic filter further removes from consideration all those noun phrases that are ruled out by binding constraints. Here the system also makes use of coreference links marked between noun phrases to apply the constraints. For example:

Peteri made fun of John Smith. John beat himi up.

In the second sentence, binding constraints forbid him from referring to John. Since John will be marked coreferent with John Smith, this also eliminates John Smith from consideration, leaving a single possible antecedent, Peter.

The system then uses heuristics to pick an antecedent from the remaining noun phrases. The heuristics used are taken from the CogNIAC system, described in Sec-tion 2.3.2. The core rules of the system are used, together with a rule to search back for a unique antecedent when no possible antecedents are found (Search Back) and a rule that looks for a unique antecedent in the subject of the current sentence (Unique Current Subj). Rule 4, Possessive Pro, was excluded since it was eliminated when preparing CogNIAC for MUC-6. The rules used by BRANQA are listed in order of evaluation in Figure 3-5.

In evaluating the rules, when checking for a "single possible antecedent" we count possible entities, not possible expressions; that is, if there is more than one possible antecedent but all possible antecedents refer to the same entity, the rule is allowed to

1. Unique in Discourse 2. Reflexive

3. Unique in Current + Prior 4. Unique in Current

5. Search Back (until > 1 candidates) 6. Unique Current Subject

/

Subject Pron

(34)

trigger. This is one of the reasons why resolving coreference between non-pronominal noun phrases helps with pronoun resolution.

Before trying to apply any of the rules, the pleonastic pronoun detector is used to check if this is an instance of non-referential it, in which case it is assigned null reference. The rules are then evaluated in order, and if none of them trigger, the pronoun is marked unresolved in the Noun Phrase Table.

(35)

Chapter 4 Evaluation

In this chapter I evaluate the performance of BRANQA on a test corpus. I explain the experimental procedure and present the results.

4.1 MUC-7 Coreference Task Corpus

Evaluation was performed on the newspaper articles used in the MUC-7 Corefer-ence Task [15]. These articles have been annotated for coreferCorefer-ence using SGML tags, allowing one to procedurally check the correctness of BRANQA's decisions. Coref-erence relations are tagged between markables: nouns, noun phrases and pronouns. Pronouns include both personal and demonstrative pronouns, and with respect to personal pronouns, all grammatical cases, including the possessive. Dates ("January 23"), currency expressions ("$1.2 billion"), and percentages ("17%") are considered noun phrases.

Coreference relations are marked only between pairs of elements both of which are markables. This means that in those cases where the antecedent is a clause rather than a markable the relation will not be annotated.

Referring expressions and their antecedents are marked as follows:

<COREF ID="100">Lawson Mardon Group Ltd.</COREF> said <COREF ID="101"

TYPE="IDENT" REF="100">it</COREF> ...

(36)

them through the REF attribute of COREF tags.

Two sets of articles were available for testing, one specified "dry-run" and the other one "formal", used in different stages of the MUC-7 evaluation. The "dry-run" set was used for this evaluation, saving the other set for future experiments. This set consists of thirty New York Times articles, most of them regarding airplane crashes.

4.2 Test Procedure

The calls to BRANQA's pronoun resolution procedure were instrumented so that it would send its answers through an evaluation module, which checked them against the key.

In evaluating the system, errors were not chained; that is, answers were corrected, if possible, before proceeding to resolve the next pronoun. After resolving a pronoun, the evaluation procedure recorded the answer and checked it against the key. If it was wrong, it attempted to find a noun phrase in the Noun Phrase Table that would match the one in the key. This was not always possible for two reasons: sometimes the pronoun was not marked on the key because it had no markable antecedent, and sometimes parser errors caused BRANQA not to identify the marked noun phrase. In both cases the pronoun was marked unresolved in the Noun Phrase Table before going on to the next pronoun. If the pronoun was not marked for lack of a markable antecedent, the evaluation module considered an answer of unresolved as correct.

4.3 Results

Table 4.1 shows BRANQA's precision and recall characteristics on 336 third person pronouns in the test corpus, broken down by rule. The resolution of a pronoun to null

(for the pleonastic case) was considered correct if the pronoun had no antecedent in the key (which could happen either if the pronoun was actually pleonastic, or if it had an antecedent that was not markable). I checked the eight cases marked pleonastic in the test and they were all in fact pleonastic.

(37)

Rule Contribution to Recall Precision Pleonastic 2% (8/336) 100% (8/8) Unique in Disc. 6% (20/336) 100% (20/20) Reflexive 1% (4/336) 80% (4/5) Unique Cur+Prior 14% (47/336) 85% (47/55) Unique Cur 17% (58/336) 95% (58/61)

Search Back 0% (0/336) never used

Subject Cur 3% (10/336) 83% (10/12)

Subject Prev 6% (21/336) 84% (21/25)

Unresolved (correct) 3% (10/336) 100% (10/10)

Total 53% (177/336) 91% (177/195)

Table 4.1: Test Results by Rule

The "Unresolved (correct)" line of the table shows the number of pronouns that were left unresolved but had no antecedent in the key, and were thus considered correct for the purpose of computing precision and recall statistics.

Table 4.2 shows the results broken down by pronoun.

The precision/recall characteristics of BRANQA are comparable to those of CogNIAC. In the first experiment on narrative texts CogNIAC achieved 92% pre-cision for 64% recall, and in the second test, on MUC-6 documents, it yielded 73% precision for a recall of 75%. Especially in the second case, CogNIAC's recall is quite higher than that of BRANQA, but this came at a significant cost in precision.

Of the 18 incorrect resolutions, six happened in cases where there was no an-tecedent marked on the key (this included pleonastic pronouns, but also cases where the antecedent was not markable, e.g., "they" refering to two people mentioned in separate sentences). Three incorrect resolutions can be attributed to misclassification of a word according to gender and number. Another three were due to parser errors, and the remaining six can be attributed to failures of the resolution rules.

Of the unresolved cases, several could have been resolved with the existing rules if not for misclassification of words, failure to eliminate candidates by the syntactic filter, and parser errors leading to faulty identification of noun phrases. A detailed case by case analysis of the 140 unresolved pronouns was not carried out.

(38)

Pronoun Correct Wrong Unresolved Precision Recall he 37 0 24 100% 61% she 11 1 7 92% 58% it 40 7 30 85% 52% they 14 5 13 74% 44% him 2 0 4 100% 33% his 36 0 13 100% 73% himself 0 1 0 0% 0% her 11 1 4 92% 69% hers 0 0 0 - -herself 0 0 0 -its 12 1 23 92% 33% itself 3 0 0 100% 100% them 1 0 11 100% 8% their 9 2 11 82% 41% theirs 0 0 0 -themselves 2 0 0 100% 100% Total 178 18 140 91% 53%

Table 4.2: Test Results by Pronoun

4.4 Effect on Question Answering

I chose to use the rules from Baldwin's system because it was developed with a focus on high precision coreference and I believe that high precision is more important than high recall when doing question answering. I think it is better not to give an answer than to give the wrong answer. On the other hand, the best way not to make mistakes is to never attempt to resolve pronouns. The purpose of a resolution tool is to raise the recall characteristics of the systems using it, and thus a balance must be struck between precision and recall.

More important than precision/recall values for resolution on a test corpus are the effects that the tool has on the precision/recall characteristics of the systems using it. Lacking a test set for Sapere, I could not evaluate BRANQA's effect on it. However, I did look at some articles from the WorldBook Encyclopedia that Sapere indexed, in order to get an idea of the potential benefits for question answering. I ran BRANQA on the articles, and then evaluated its results manually, since no coreference annotations had been made.

(39)

Taking the article on Afghanistan as an example, we find that out of 42 third-person pronoun occurrences, the system resolved 27 of them correctly, resolved 3 incorrectly, and left 12 occurrences unresolved (all 12 had a markable antecedent). An example of useful resolution for question answering is that of "he" and "his" to "Abdur Rahman" in "After he died in 1901, his policies were continued by his son, Habibullah Khan." Since this is the only article in the Encyclopedia mentioning Abdur Rahman, the resolution of "he" adds information that we did not have before. For the three incorrect resolutions, it does not seem like they would cause Sapere to return wrong answers to questions people would ask. The following lists the mistakes and the reasons behind them:

* "They" resolved to "their communities" instead of "Mullahs" in "They interpret Islamic law and educate the young" (failure to recognize "Mullahs" as plural). * "It" resolved to "the game" in "In the game, dozens of horsemen try to grab a headless calf and carry it across a goal" (a bug in the syntactic filter eliminating "a headless calf" from the set of possible antecedents)

* "His" resolved to "The British" instead of "Abdur Rahman Khan" in "The British agreed to recognize his authority over the country's internal affairs" (misclassification of "The British" as a person's name)

More formal testing is necessary, but from what I have seen so far I am led to believe that BRANQA would improve Sapere's performance if used before indexing relations.

(40)

Chapter 5 Future Work

In this chapter I present a number of ways in which the system will be improved in the near future, and possible research projects that are suggested by this thesis.

5.1 Improvements

5.1.1 Quoted Speech

Several of the pronouns left unresolved in the evaluation could have been assigned an antecedent if better machinery had been added to handle quoted speech. Quotations are very common in newspaper articles, and it seems plausible to construct a module that accurately keeps track of who is the speaker being quoted. This can then be used to add binding constraints for the pronouns in quotations. I expect this should improve the performance of the system, at least for the domain of newspaper articles.

5.1.2 Named Entity Module

The named entity module developed for this system is very simple and leaves ample room for improvement. A better named entity tagger is currently being developed by the InfoLab Group, and I plan to integrate it into BRANQA when it becomes available.

Anaphora resolution for question answering

Anaphora Resolution for Question Answering

by

Luciano Castagnola

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degrees of

Bachelor of Science in Computer Science and Engineering

and

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2002

@

Massachusetts Institute of Technology 2002. All rights reserved.

Author ...

Department of Electrical Engineering and Computer Science

May 24, 2002

Certified by...

Boris Katz

Principal Research Scientist

Thesis Supervisor

Accepted by

.. . . .

Chairman, Department Committee on Graduate Students

JUL 3

1 2002

Anaphora Resolution for Question Answering

by

Luciano Castagnola

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

What is anaphora?

1.2

Question Answering

1.3

Resolving Pronouns for Question Answering

1.4

Outline

Chapter

2

Anaphora Resolution

2.1

Overview of Pronominal Anaphora Resolution

2.1.1

Two Stages

2.1.2

Constraints

2.1.4

Computational Strategies

2.2

Government and Binding Theory

2.3

Prior Work

2.3.1

RAP

Q

Q,

Q.

Q

Q,

Q,

Q.

2.3.2

CogNIAC

Chapter 3

System Architecture

3.1

Link Parser

3.1.1

Link Grammar

3.1.2

Constituent Structure

3.2