Coreference annotation schema for an inflectional language ?
Maciej Ogrodniczuk 1 , Magdalena Zawisławska 2 , Katarzyna Głowińska 3 , and Agata Savary 4
1
Institute of Computer Science, Polish Academy of Sciences
2
Institute of Polish Language, Warsaw University
3
Lingventa
4
Fran¸ cois Rabelais University Tours, Laboratoire d’informatique
Abstract. Creating a coreference corpus for an inflectional and free-
word-order language is a challenging task due to specific syntactic fea- tures largely ignored by existing annotation guidelines, such as the ab- sence of definite/indefinite articles (making quasi-anaphoricity very com- mon), frequent use of zero subjects or discrepancies between syntactic and semantic heads. This paper comments on the experience gained in preparation of such a resource for an ongoing project (CORE), aiming at creating tools for coreference resolution.
Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.
1 Introduction
Although the notion of coreference is no longer a subject of much controversy and there are many more or less ready-to-use annotation guidelines available, in a case where a “new” language is being investigated — which has not yet received any formalized coreference description — they usually need to be supplemented with details specific to this language, and the task of creating a coreference corpus requires establishing detailed rules concerning annotation scope, strategies and typology of coreferential constructs.
This paper comments on the experience gained in the process of creating the first substantial Polish corpus of general coreference (500K words and 160K mentions are intended), which is currently being completed. We hope our anal- ysis can provide a valuable source of information for creators of new corefer- ence corpora for other inflectional and free-word-order languages. We believe that they could particularly benefit from studying our assumptions based on such specific properties as the absence of definite/indefinite articles (introduc- ing quasi-anaphoricity), frequent use of zero subjects or discrepancies between
?
The work reported here was carried out within the Computer-based methods for coref-
erence resolution in Polish texts (CORE) project financed by the Polish National
Science Centre (contract number 6505/B/T02/2011/40).
syntactic and semantic heads. These phenomena are fundamental for building computational coreference resolvers.
Construction of a large high-quality corpus is of great importance in the con- text of further tasks in the ongoing CORE project, whose central aim is the creation of an efficient coreference resolver for Polish. We wish to surpass the previous early attempts, both rule-based [1] and statistical [2], which yielded tools trained and evaluated on a very limited amount of data. We believe that a more efficient tool can boost the development of higher-level Polish NLP applica- tions, on which coreference resolution has a crucial impact [3]. Such applications include: 1) machine translation (when translating into Polish, coreferential re- lations are needed to deduce the proper gender of pronouns), 2) information extraction (coreference relations help with merging partial data about the same entities, entity relationships, and events described at different discourse posi- tions), 3) text summarization, 4) cross-document summarization, and 5) ques- tion answering.
2 Reference, Anaphora and Coreference
In order to define the scope of coreference annotation we must bring back the un- derlying concept of reference to discourse-world objects, leading to an important limitation: only nominal groups (NGs), including pronouns, can be referencing expressions.
Recall that coreference annotation is usually performed (and evaluated) in two steps: (i) identifying mentions (or markables ), i.e. phrases denoting entities in the discourse world, (ii) clustering mentions which denote the same referent.
Consequently, the definition of a mention, and of the difference between a men- tion and a NG in particular, is of crucial importance to the whole process. We, unlike e.g. [4], consider this difference too controversial to be reliably decided in a general case.
For instance, multi-word expressions (MWEs) show opaque semantics, thus the NGs they include might be seen as non-referential. However, most MWEs do inherit some part of the semantics of their components, and might be corefer- ential in some stylistically marked cases, as in (1) 5 . Defining a clear-cut frontier between non-referential and referential NGs in these cases seems very hard.
(1) Nie wahał się włożyć kij w mrowisko.
Mrowisko to, czyli cały senat uniwersytecki, pozostawało zwykle niewzru- szone.
’He didn’t hesitate to put a stick into an anthill (i.e. to provoke a disturbance).
This anthill, i.e. the whole university senate, usually didn’t care.’
Thus, our annotation process consists in retaining – as mentions – all NGs (whether referential or not), and establishing coreference chains among them
5
Henceforth, we will mark coreferent NGs with (possibly multiple) underlining, and
non-coreferent NGs with dashed underlining.
wherever appropriate. In other words, we do not distinguish non-referential NGs from referential, but non-coreferential, NGs (e.g. singleton mentions). This de- cision obviously has a big influence on coreference resolution quality measures which take singleton mentions into account.
We also consider that the reference is context-dependent, not surface-form dependent, cf.
(2) Spotkałam nową dyrektorkę. Osoba ta zrobiła na mnie dobre wrażenie.
’I met the new manager. This person made a good impression on me.’
(3) Nasza nowa dyrektorka to młoda kobieta.
’Our new manager is a young woman.’
(4) Nasza dyrektorka, młoda kobieta, przyszła na spotkanie.
’Our manager, a young woman, came to the meeting.’
(5) Młoda kobieta, która przejęła funkcję dyrektora, zrobiła na mnie dobre wraże- nie.
’The young woman who overtook the manager’s duties made a good impression on me.’
In example (2) the NG osoba ta (’this person’) has a defined referent, i.e. a con- crete human being the speaker refers to. In (3)–(4), the nominal group młoda kobieta does not carry reference, but is used predicatively — assigns certain properties to the subject of the sentence. Our understanding of nominal corefer- ence is therefore strictly limited to direct nominal constructs; expressions that do not denote the object directly are not included in coreference chains.
There is an additional, operational, criterion that we admit, contrary to many common coreference annotation and resolution approaches, e.g. [5]. If semantic identity relations between NGs are directly expressed by the syntax, we see no point in including them in coreferential chains. Typical cases here are predicates, as in (3), relative clauses, as in (5), and appositions, as in (4), where we see one, not two, mentions in the NG Nasza dyrektorka, młoda kobieta (’Our manager, a young woman’) .
Such definition of reference creates links between the text and discourse world and is of different nature than anaphora — an inter-textual reference to previ- ously mentioned objects. Even if, in most cases, anaphora and coreference co- occur, it is not necessarily the case. In example (6), the underlined NGs are anaphoric but not coreferential, cf. [3]. Conversely, NGs in separate texts can be coreferential, but not anaphoric.
(6) Człowiek, który dał piękne kwiaty swojej żonie, wydał mi się sympaty- czniejszy niż człowiek, który odmówił kupienia ich swojej.
’The man who gave beautiful flowers to his wife seemed nicer to me than the one who refused buying them for his (wife).’
3 Scope of Annotation
3.1 Mentions
As it was said in the previous section, all NGs (both referential and non-referen-
tial) are marked as mentions, while coreference chains can only concern referen-
tial NGs (mentions). In particular, some types of nominal pronouns, which seem non-referential by nature, are marked as mentions (since they are NGs) but never included in coreference chains: (i) indefinite pronouns (ktoś ’somebody’), (ii) negative pronouns (nic ’nothing’), (iii) interrogative pronouns (kto ’who’) 6 . Note also that some Polish lexemes designated traditionally as pronouns behave morphosyntactically like other parts of speech. Namely, demonstrative pronouns introducing subordinates other than relative clauses (o tym, że ’of-this-that = of the fact that’) are in fact parts of correlates. The reflexive pronoun (się ’oneself’) is a particle. Finally, possessive pronouns (mój ’mine’) behave like adjectives.
Consequently, these three types of pronouns are never considered as NGs, i.e.
they are never marked as mentions.
Finally, coreference relations between phrases other than nominal ones (e.g.
tam ’there’) are obviously never marked, since only NGs are considered as men- tions.
3.2 Types of Relations
The major goal of coreference annotation is to determine the type of relation holding among discourse-world entities referred to by two or more mentions. We are essentially interested in identity relations. We also consider, experimentally, the notion of near-identity proposed by [6]. Due to the pioneering (wrt. Polish) nature of our project, all other types of relations (whether among entities or among mentions) have been explicitly ruled out, including non-identity, indi- rect anaphora, bound anaphora, ellipses (with the exception of zero anaphora), predicative relations, and identity of sense.
Identity Textual techniques used in Polish to signal the identity of referred entities are manifold:
– lexical and grammatical (personal and demonstrative pronouns), – stylistic, such as synonymy,
– lexical and grammatical anaphora and cataphora between nominal groups, – “quasi-anaphora” – when a group with syntactic-functional properties of
anaphora introduces new information, e.g.
(7) Duszą towarzystwa był zięć Kowalskich. Młody prawnik właśnie wrócił ze Stanów.
’Kowalski’s son-in-law was the life and soul of the party. The young lawyer had just returned from the US.’
– zero-anaphora, very frequent in Polish – a personal pronoun may be omitted whenever the subject’s person and gender are recognizable from the verb’s
6
Surprisingly enough, recent experiences show that such pronouns may be referential
in stylistically marked cases such as: Ktoś ukradł łopatę. Ten sam ktoś zniszczył
ogrodzenie. ’Someone stole the spade. The same someone broke the fence.’. We wish
to review these cases in the final annotation stage.
agreement; therefore the annotation denoting the missing referential NG is most naturally attached to the verb, as in example (8). 7
(8) Maria wróciła już z Francji. ØSpędziła tam miesiąc.
’Maria came back from France. ØHad
singular:f emininespent a month there.’
Note that some approaches introduce a typology of coreference links which takes the above techniques into account. We, conversely, think that these types of linguistic data should be documented either at other annotation levels or in external linguistic resources. One – formal and practical – reason is that we see coreference chains as clusters, i.e. results of splitting the set of all mentions via a (unique and uniform) equivalence relation. If subtypes of this relation were to be used, clustering would no longer be possible and each pair of coreferent mentions would have to be marked explicitly. Such a methodology might not only have a prohibitive cost in some types of texts but would also be hard to evaluate by classical quality measures.
Near-identity [6] define the notion of near-identity, taking place in two con- texts called refocusing and neutralization. Our understanding of these phenom- ena involves the following:
– Refocusing – two mentions refer to the same entity but the text suggests the opposite. This stylistic technique is often used to account for a temporal or spatial change of an object as in 8 :
(9)
:::::::::Warszawa
::::::::::::przedwojenna i
::ta
::z
::::::::początku
::::XXI
::::::wieku
’ Pre-war
::::::::::::
Warsaw and
:::the
:::
one
::at
::
the
::::::::
beginning
::of
:::
the
:::21st
::::::
century’
– Neutralization – two mentions refer to different entities but the text suggests the opposite. This situation is typical for metonymy, as in example (10), where a container and its contents are merged, and unlike (11), which is a case of a classical identity:
(10) Wziął wino z lodówki i wypił
::::je.
:’He took the
::::::
wine from the fridge and drank
::it.’
(11) Wziął wino z lodówki i włożył je do torby.
’He took the wine from the fridge and put it into the the bag’ .
[7] put forward a detailed typology of near-identity relations. However, in the experimental annotation stage of our project, the annotators marked very few examples of near-identity, most of them concerning, in fact, more typical semantic relations, like homonymy, meronymy, metonymy, element of a set or — sometimes — hypernymy, e.g.:
(12)
::::Cała
::::::::::Warszawa była właściwie jednym wielkim cmentarzem. Ginęli ludzie, mnóstwo ludzi! Na podwórku, już tak po 15 sierpnia, praktycznie codzien- nie był pogrzeb przed kapliczką.
:::::::::Warszawa była bardzo pobożna...
7
Elliptical constructions concerning functions other than the subject, as in Czytałeś książki Lema? Czytałem Ø. ’Did you read Lem’s books? I read Ø.’ are not annotated in our model.
8
Henceforth, near-identity-related mentions will be marked by a wavy underline.
’ The
:::::::
whole
::::::
Warsaw was in fact one big graveyard. People were dying, plenty of people!
After the 15th of August there were funerals in the courtyard, in front of the chapel, almost every day.
::::::
Warsaw was very pious... 9
That experience made us think that near-identity is either too infrequent to deserve a rich typology, or too hard to capture and classify reliably by annotators.
That is why we mark near-identity links in our corpus, but we assign no type labels to them. Once the annotation has been completed, we plan to compare our examples of near-identity more thoroughly with the types proposed in [7].
3.3 Dominant Expressions
Despite the fact that all mentions within a cluster are (mathematically speaking) equivalent, we enrich each cluster with a pointer towards the dominant expres- sion, i.e. the one that carries the richest semantics. For instance in the following chain the last element is dominant: stworzenie ’creature’ → zwierzę ’animal’ → pies
’dog’ → jamnik ’dachshund’ .
In many cases, pointing at the dominant expression helps the annotators sort out a large set of pronouns denoting various persons (e.g. in fragments of plays or novels). We think that it might also facilitate linking mentions within different texts, and creating a semantics frame containing different descriptions of the same object.
4 Annotation Strategies
4.1 Mention Boundaries
In order to encompass the wide range of mentions, we set the boundaries of nominal groups as broadly as possible. Therefore, an extended set of elements is allowed within NG contents, i.e., 1) adjectives as well as adjectival participles in agreement (with respect to case, gender and number) with superior noun, 2) subordinate noun in the genitive case, 3) nouns in case and number agree- ment with superior nouns (i.e. nouns in apposition); but also 4) prepositional- nominal phrase that is a subordinate element of a noun (e.g. koncert na skrzypce i fortepian ‘a concerto for violin and piano’) 10 ; 5) relative clause (e.g., dziewczyna, o której rozmawiamy ’the girl that we talk about’). Moreover, the following phrases are treated as nominal groups: 1) numeral groups (e.g., trzy rowery
‘three bicycles’), 2) adjectival phrases with elided nouns (e.g., Zrób bukiet z tych czerwonych kwiatów i z tych niebieskich. ‘Make a bouquet of these red flowers and these blue ones.’), 3) date/time expressions of various syntactic structures,
9
The whole Warsaw refers to the place, while Warsaw is a metonymy and refers to people who lived in the city.
10
Such cases should be distinguished from situations where a prepositional-nominal
phrase is a subordinate element of a verb, e.g. Kupił mieszkanie z garażem. ‘He
bought a flat with a garage.’
4) coordinated nominal phrases, including conjoining commas (krzesło, stół i fotel ‘a chair, a table, and an armchair’).
For each phrase, the semantic head is selected, being the most relevant word of the group in terms of meaning. The semantic head of a nominal group is usually the same element as the syntactic head, but there are some exceptions, e.g., in numeral groups, the numeral is the syntactic head, and the noun is the semantic head.
4.2 Mention Structure
The deep structure of noun phrases, i.e. all embedded phrases not containing finite verb forms having semantic heads other than those of the superior phrase (which reference different entities), is subject to annotation, therefore the frag- ment dyrektor departamentu firmy ‘manager of a company department’ con- tains 3 nominal phrases, referencing dyrektora departamentu firmy (‘manager of a company department’), departamentu firmy (‘a company department’) and firmy (‘the company’) alone.
This assumption is also valid for coordination — we annotate both the in- dividual constituents and the resulting compound, because they can be both referred to:
(13) Asia i Basia mnie lubią. One są naprawdę ładne, szczególnie Aśka.
‘Asia and Basia like me. They are really pretty, particularly Aśka.’
Discontinuous phrases and compounds are also marked:
(14) To był delikatny, że tak powiem, temat. ‘It was a touchy, so to speak, subject.’
5 Task Organization
Texts for annotation were randomly selected from the National Corpus of Pol- ish [8]. Similarly to this resource, we aimed at creating a 500-thousand word balanced subcorpus. It was divided into over 1700 samples between 250 and 350 segments each. These samples were automatically pre-processed with a shal- low parser detecting nominal groups and their semantic heads 11 , and a baseline coreference resolution tool marking potential mentions and identity clusters.
The manual revision of this automatically performed pre-annotation is being carried out in the MMAX2 tool [12] adapted to our needs. In particular, the
11