• Aucun résultat trouvé

Enriching a lexical semantic net with selectional preferences by means of statistical corpus analysis.

N/A
N/A
Protected

Academic year: 2022

Partager "Enriching a lexical semantic net with selectional preferences by means of statistical corpus analysis."

Copied!
6
0
0

Texte intégral

(1)

Enriching a Lexical Semantic Net with Selectional Preferences by Means of Statistical Corpus Analysis

Andreas Wagner

1

Abstract. Broad-coverage ontologies which represent lexical se- mantic knowledge are being built for more and more natural lan- guages. Such resources provide very useful information for word sense disambiguation, which is crucial for a variety of NLP tasks (e.g. semantic annotation of corpora, information retrieval, or se- mantic inferencing). Since the manual encoding of such ontologies is very labour-intensive, the development of (semi-)automatic methods for acquiring lexical semantic information is an important task. This paper addresses the automatic acquisition of selectional preferences of verbs by means of statistical corpus analysis. Knowledge about such preferences is essential for inducing thematic relations, which link verbal concepts to nominal concepts that are selectionally pre- ferred as their complements. Several approaches for learning selec- tional preferences from corpora have been proposed in the last years.

However, their usefulness for ontology building is limited. This paper introduces a modification of one of these methods (i.e. the approach of Li & Abe [1]) and evaluates it by employing a gold standard. The results show that the modified approach is much more appropriate for the given task.

1 INTRODUCTION

Recently, broad-coverage, general-purpose lexical semantic ontolo- gies have become available and/or are being developed for a variety of natural languages, e.g. WordNet [4] and EuroWordNet [12]. Word- Net was developed at Princeton University for English and is widely used in the NLP community. EuroWordNet is a multilingual lexi- cal semantic database which comprises WordNet-like ontologies for eight European languages. These wordnets are connected to an inter- lingual index so that a node in one language-specific wordnet can be mapped to the corresponding node in another language-specific wordnet. These resources capture the semantic properties of the most common words in a language. In particular, they encode the different senses of words (represented by the concept nodes of the ontology) and the basic semantic relations between word senses like hyper- onymy, antonymy, etc. (represented by the edges of the ontology).2 Such resources contain useful information for word sense disam- biguation, which is a prerequisite for several NLP tasks like semantic annotation of corpora, text analysis, information retrieval, or seman- tic inferencing. Thus, the resources provide necessary information for various kinds of NLP tools. Their intention is to capture gen- eral, domain-independent knowledge which complements domain- specific knowledge needed for a particular NLP system.

1Seminar f ¨ur Sprachwissenschaft, University of T ¨ubingen, Wilhelmstr. 113, D-72074 T ¨ubingen, Germany, email: wagner@sfs.nphil.uni-tuebingen.de

2The synonymy relation is encoded within the nodes, since a concept is rep- resented by a set of synonyms.

The most important semantic relation in the above-mentioned on- tologies is hyponymy/hyperonymy. This relation constitutes a hierar- chical structuring of the different semantic concepts. In WordNet, for example, the concept<life form>3is a hyperonym of concepts like

<animal>,<human>, and<plant>. Other semantic relations are in general not encoded exhaustively (or even not at all). However, they also provide useful information for NLP tasks. One group of such relations in EuroWordNet are thematic role relations. These relations connect verbal concepts with nominal concepts which typically oc- cur as their complements. For example, the verbal concept<eat>

should have AGENT pointers to the nominal concepts<human>

and<animal>, and a PATIENT pointer to the concept<food>. The- matic relations provide information about the selectional preferences which verbs impose on their complements. This kind of information is useful for lexical and syntactic disambiguation (cf. [8], [1]).

As manual encoding of ontologies is very labour-intensive, (semi-) automatic methods have been explored, particularly the extraction of information from other existing lexical resources. However, such resources are often not complete or not available at all. For example, thematic relations are encoded in several language-specific wordnets in EuroWordNet, but only for some minor portions of verb concepts so that a mapping of these relations to another language does not yield an exhaustive coverage.

If appropriate lexical resources are missing, other means of auto- matically acquiring lexical information have to be considered. One possibility is the statistical analysis of corpora. This paper addresses the usefulness of employing statistical methods for learning thematic relations. In particular, I will investigate the acquisition of selec- tional preferences that verbs impose on their complements. Know- ledge about selectional preferences is a prerequisite for encoding the- matic relations.

Several approaches for the statistical acquisition of selectional preferences (represented as WordNet noun classes) have been pro- posed ([9], [10], [1]). As these approaches investigate corpora, i.e.

huge collections of sentences, they reveal preferences for syntac- tic arguments. As thematic roles can have different syntactic re- alizations, the preferences for syntactic complements have to be mapped to the corresponding roles. This can be done manually or (semi-)automatically, cf. [6] and [7] for some basic approaches to solve this problem.

If selectional preferences are gathered to supplement an ontology, it is desirable (if not necessary) to find a representation for them which is both empirically adequate (i.e. captures all and only the pre- ferred concepts) and as compact as possible. For example, it is not desirable to introduce PATIENT relations from<eat>to all the food

3In this paper, I will enclose concepts in pointed brackets and use capital letters for relation identifiers.

(2)

concepts in the wordnet (<meat>,<strawberry cake>,...) because this would be highly redundant and would not express any general- ization. One would rather want to find a class which subsumes all the preferred classes (such as<food>). Of course, a class which is so general that it also subsumes dispreferred classes is unacceptable as well (e.g.<entity>). Thus, the problem is to find the appropriate level of generalization. The “compactness desideratum” (find classes which are not too specific) is particularly important for our task, the extension of a semantic net. It is motivated from a practical point of view (storage economy) as well as by conceptual considerations (ap- propriate generalizations should be expressed; this is important for applications like semantic inferencing).

This paper is organized as follows: Section 2 examines the suit- ability of the above-mentioned statistical methods for finding the ap- propriate generalization level. I will describe in detail the Li & Abe approach [1], which is explicitly intended for that task. I will report an experiment which reveals an inherent problem of this approach with respect to generalization. In section 3, I will introduce a modifi- cation of the method to overcome this problem, which is indicated by an analogous experiment. Section 4 describes a more system- atic evaluation of the alternative approaches against a gold standard which I extracted from the EuroWordNet database. Section 5 gives a conclusion and sketches future work.

2 ACQUIRING SELECTIONAL PREFERENCES FROM CORPORA

Among the approaches for the acquisition of selectional preferences mentioned above, only the work of Li & Abe systematically ad- dresses the problem of appropriate generalization. Resnik [9] does not determine a set of classes that represents selectional preferences at all. Ribas [10] determines such a set by a simple greedy algorithm.

The impact of this algorithm for the generalization level of the se- lected classes is undetermined. Li & Abe obtain a set of classes that form a partition of the corpus instances. They employ a theoretically well-founded principle (Minimum Description Length) to find the appropriate generalization level.

In this section, I describe this method and the experiment I carried out to test its behaviour.

2.1 Information theoretic foundations

Information theory deals with coding information as efficiently as possible. In the framework of this discipline, information is usually coded in bits. If one has to code a sequence of signs (in our case, nouns which occur as the complement of a certain verb in a corpus), the simplest way to do this would be to represent each sign by a bit sequence of uniform length. However, if the probabilities of the in- dividual signs differ significantly, it is more efficient (with respect to data compression) to assign shorter bit sequences to more probable (and thus more frequent) signs and longer bit sequences to less prob- able (and less frequent) signs. It can be shown that one can achieve the shortest average code length by assigningdl og2

1

p(x)

ebits to a signxwith probabilityp(x)(cf. [3]). Thus, if one has a good esti- mation of the probability distribution which underlies the occurrence of the signs, one can develop an efficient coding scheme (a mapping between signs and bit sequences) based on this estimation.

2.2 The basic method

The approach in [1] is based on the Minimum Description Length (MDL) Principle invented by J. Rissanen (cf. [11]). This principle

depends on the assumption that learning can be seen as data com- pression. The better one knows which general principles underlie a given data sample, the better one can make use of them to encode this sample efficiently. If one wants to encode a sample, one has to encode (a) the probability model that determines a coding scheme, and (b) the data themselves (by employing that coding scheme). The MDL principle states that the best probability model is that which achieves the highest data compression, i.e. which minimizes the sum of the lengths of (a) and (b). (The length of (a) is called model de- scription length, the length of (b) data description length.) In our case, a sample consists of the noun tokens that appear at a certain syntactic argument slot (e.g. the direct object of a certain verb in the examined corpus).

Li & Abe represent the selectional behaviour of a verb (with re- spect to a certain argument) as a so-called tree cut model. Such a model provides a horizontal cut through the noun hierarchy tree, so that the classes that are located along the cut form a partition of the noun senses covered by the hierarchy. Each class is assigned a prefer- ence value. The preference value for a class in the cut is inherited by its subclasses. A tree cut model (cut+preference values) determines a probability distribution over the sample (see below), and hence a coding scheme. (Examples of tree cut models can be found in Tables 1–6.)

As preference value, Li & Abe estimate the so-called association norm:

A(c;v)= p(c;v)

p(c)p(v)

= p(cjv)

p(c)

(1) This measure quantifies the ratio of the occurrence probability of a noun classcat a certain argument slot4 of a certain verbvand the expected occurrence probability ofcat this slot if independence be- tweencandvis assumed. This is equivalent to the ratio of the con- ditional probability ofcgivenvand the probability ofcregardless of a particular verb. An association norm greater than 1 indicates thatvprefersc, an association norm smaller than 1 indicates thatv disprefersc.

Given the marginal probabilitiesp(n)of the noun senses (regard- less of a particular verb),5a tree cut model determines a probability distribution over the noun sensesnin a sample. This follows from the constraint that the association norm of a class is inherited by its descendents. Every noun sensenis represented in the cut by exactly one classcn. So we have

p(njv)=A(n;v)p(n)=A(cn;v)p(n) (2)

MDL is used to get the tree cut model with the appropriate gen- eralization level. The number of bits required to encode a sampleS using a probability modelM is given by

L(M)=Lmod(M)+Ldat(M) (3)

with

L

mod

(M)=L

cut

(M)+L

par

(M) (4)

Lmod(M) is the model description length,Ldat(M) the data de- scription length,Lcut

(M)is the code length needed to identify the cut through the hierarchy, andLpar(M)is the code length needed to encode the parameters of the model (the association norms of the classes on the cut). Following the MDL principle, we search for the modelMthat minimizesL(M).

4This slot is not explicitly referred to in the formula.

5These probabilities are also estimated on the basis of a tree cut model by employing the MDL principle; cf. [1] and [5] for details.

(3)

For simplicity, it is assumed that all possible cuts have uniform probability. Thus,Lcutis constant for all cuts. As we aim at mini- mizing the description length, we can neglect this term.

Li & Abe calculateLpar(M)as

L

par

(M)=K

logjSj

2

(5)

Kis the number of parameters inM (i.e. the number of classes on the cut) andjSjis the sample size. For every class on the cut, the association norm is represented by logjSj

2

bits. This precision mini- mizesL(M)for a givenM(cf. [11]).6

The data description length is given by

L

dat

(M)=

X

n2S logp

M

(njv;s) (7)

wherepMis a probability distribution determined byM (cf. section 2.1).

If the tree cut is located near the root, then the model descrip- tion length will be low because the model contains only few classes.

However, the data description length will be high because the code for the data is based on the probability distribution of the classes in the model, not on the real probability distribution of the individual nouns. The greater the difference between the supposed distribution and the real one, the longer the code. And the coarser the classifica- tion is, the more the corresponding distributionpM deviates from the real distribution. On the other hand, if the tree cut is located near the leaves, the reverse is true: the fine-grained classification fits the data well, resulting in a low data description length, but the great amount of classes increases the model description length. Minimizing the sum of these two description lengths yields a balance between com- pactness (expressing generalizations) and accuracy (fitting the data) of the model.

2.3 Implementational details

In essence, I used the algorithm described in [1] to obtain the tree cut model. However, some modifications were necessary or useful for practical reasons.

Firstly, some WordNet specific problems had to be solved. The al- gorithm requires that the class hierarchy is a tree where the leaves represent the word senses and the inner nodes represent semantic classes. However, WordNet is not a pure tree, but a DAG, and all nodes represent both word senses and semantic classes (e.g. the node

<person#individual#someone> represents at the same time a se- mantic class and a particular word sense for the nouns “person”,

“individual”, and “someone”. No hyponym of the class represents this sense. To handle this problem, I introduced for every inner node an additional node that captures the noun sense that the node repre- sents and made this additional node a hyponym of the inner node. So

6I actually introduced an optimization: It does not make sense to represent the parameter value 0 withlogjSj

2

bits. (An association norm is 0 if and only if the corresponding class has no instances in the sample.) A more ef- ficient coding strategy is to mark the classes that have a non-zero parameter and represent the parameter values for those classes only. First you needK bits, one for each class, which indicate whether a class occurs in the sample or not. Then you needlogjSj

2

bits for every class that occurs in the sample.

Thus,

Lpar(M)=K+K

S

logjSj

2

(6)

K

Sis the number of classes that have instances in the sample. With this modification, one saves(K KS

) logjSj

Kbits.

all word senses are represented by leaves. (These additional nodes will be indicated by ‘REST::’ as they represent the “rest” of class in- stances which is not captured by the subclasses.) To handle the DAG issue, I “broke the DAG into a tree”. This means, if a node has more than one parent, I virtually duplicated that node (and its descendants) to maintain a tree structure. This solution has the disadvantage that parts of the sample are artificially duplicated. I will work on a more principled solution in the future.

To eliminate noise, I introduced a threshold in the following way:

The algorithm compares possible cuts by traversing the hierarchy top down. If a class with a probability smaller than 0.05 is encountered, then the traversal stops, i.e. the descendants of that class are not ex- amined. This has also the advantage of limiting the search space.

2.4 Experiment

2.4.1 Setting

To test the behaviour of the Li & Abe approach with respect to gener- alization, I applied it to acquire selectional preferences for the direct object slot.7I extracted verb–object instances from a portion of the the British National Corpus (parts A–E; about 40 million words) with Steven Abney’s CASS parser [2].8This resulted in a sample of about 2 million verb–noun pairs. Then I applied the algorithm of Li & Abe to calculate the selectional preferences of 24 test verbs and manually inspected the results.

2.4.2 Results

The experiment revealed a significant drawback of employing the MDL principle for our task. It turned out that the frequency of the examined verb in the sample has an undesirable impact on the gen- eralization level of the tree cut model: The algorithm tends to over- generalize (acquire a tree cut with few general classes) for infrequent verbs and to under-generalize (acquire a tree cut with many specific classes) for frequent verbs. This behaviour is an immediate conse- quence of the MDL principle: If a large amount of data has to be described, then the model costLmoddoes not contribute much to the whole description lengthL. The gain of a complex model for encod- ing the data outweighs the model cost. If, however, only few data have to be described, thenLmodis much more significant forL: the cost of encoding a complex model outweighs the gain for encoding the data.

However, this is not the desired behaviour. Generalization should not be triggered by the sample size, but by the “semantic variety” of the instances in the sample: Nouns like “apple”, “pear”, “strawberry”

should generalize to<fruit>. Further instances like “pork” or “cake”

should trigger generalization to<food>, and yet further instances like “house” or “vessel” to<physical object>.

To illustrate these considerations, let us look at the verbs “kill” ,

“murder”, and “assassinate”. Tables 1–3 show (parts of) the tree cut models obtained for these verbs. For the rather frequent verb “kill”

(3352 occurrences), hyponyms of<animal> are acquired. These classes are too specific; one would expect the class<life form>. In contrast, the tree cut model for the less frequent verb “murder”

(477 occurrences) is an over-generalization. This verb prefers the

7I chose this slot for several practical reasons. Of course, one can apply the algorithm to samples of other syntactic complements as well.

8 I thank Steven Abney and Marc Light, who made their source code for acquiring selectional preferences available to me. Those parts of my imple- mentation which deal with collecting and storing co-occurrence statistics from the sample are adapted from their code.

(4)

more specific concept<person>. The over-generalization is even worse for the infrequent verb “assassinate” (79 occurrences). The selectional preference of this verb is even more specific; it prefers a concept like “important person” (which does not exist in WordNet).

However, one of the most general concepts,<entity>, is retrieved.

Table 1. Part of the tree cut model for “kill” (standard MDL)

cl ass A(cl ass;verb)

... ...

<person#individual#someone#mortal#human> 3.11

... ...

<herbivore> 31.42

<aquatic vertebrate> 10.93

<bird> 9.02

<amphibian> 11.36

<reptile#reptilian> 5.44

<metatherian> 6.08

<livestock#farm animal> 31.42

<bull> 109.96

<insectivore> 9.43

<aquatic mammal> 88.81

<carnivore> 15.56

<lagomorph#gnawing mammal> 178.59

<rodent#gnawer#gnawing animal> 11.28

<ungulate#hoofed mammal> 11.95

<primate> 150.40

<proboscidean#proboscidian> 20.95

<invertebrate> 18.21

<predator#predatory animal> 31.42

<prey#quarry> 364.85

<REST::animal#animate being#beast#brute> 343.82

<plant#flora#plant life> 0.14

... ...

<REST::life form#organism#being#living thing> 19.47

... ...

Table 2. Part of the tree cut model for “murder” (standard MDL)

cl ass A(cl ass;verb)

... ...

<life form#organism#being#living thing> 4.15

... ...

Table 3. Part of the tree cut model for “assassinate” (standard MDL)

cl ass A(cl ass;verb)

... ...

<entity> 2.29

... ...

3 THE WEIGHTING ALGORITHM 3.1 Introducing a weighting factor

The problem discussed above is caused by different complexities of

L

parandLdat(with respect to the sample sizejSj). As one can see from the equations (5) and (7),Lparhas the complexityO(logjSj), whileLdathas the complexityO(jSj). Thus, with growingjSj,Ldat

“grows faster” thenLpar, and for frequent verbs, the model descrip- tion length can be neglected, so that a model with many specific classes becomes “affordable”.

To overcome this drawback, I extended the expression to mini- mize by a weighting factor: Instead of minimizingLpar+Ldat, the modified algorithm minimizes

Lpar(M)+C

logjSj

jSj

Ldat(M) (C>0) (8)

Now both addends have the same complexity.jSjdoes not directly affect generalization any more.

The value of the constantCinfluences the degree of generaliza- tion. The smallerCis, the more general classes are acquired. The possibility of manipulating the overall generalization level by the choice ofC introduces some flexibility which might prove useful when the algorithm is applied in different situations (tasks, domains, languages, etc.).

Note that the introduction of weighting is a deviation from the

“pure” MDL principle that is based on the view that learning can be regarded as data compression. However, it can be shown that the modified algorithm is a kind of Bayesian learning.

3.2 Experiments

To test the impact of this modification on the generalization level of the acquired tree cuts, I examined verbs with diverse numbers of dif- ferent noun complements (types) in the training sample. In particular, I selected all verbs with a high number (1000), a medium number (400–600), a low number (70–100), and a very low number (10–40) of different complements and compared the generalization level re- trieved by the “standard MDL” algorithm and the weighting algo- rithm. (I arbitrarily choseC=50.) For all verbs with a high number and 89% of the verbs with a medium number of complements, the weighting algorithm obtained more general classes than the standard MDL algorithm. In contrast, more specific classes were computed for almost all verbs with a low and a very low number of different com- plements (95.9% and 99.5%, respectively). Hence, the modification changes the behaviour of the algorithm towards the desired direction (that variety of complements should trigger generalization).

Tables 4–6 show the tree cut models for “kill”, “murder”, and “as- sassinate” which are yielded by the weighting algorithm (C =50).

Now these models are at the appropriate level of generalization.

Table 4. Part of the tree cut model for “kill” (weighting algorithm)

cl ass A(cl ass;verb)

... ...

<life form#organism#being#living thing> 3.25

... ...

Table 5. Part of the tree cut model for “murder” (weighting algorithm)

cl ass A(cl ass;verb)

... ...

<person#individual#someone#mortal#human> 4.65

... ...

(5)

Table 6. Part of the tree cut model for “assassinate” (weighting algorithm)

cl ass A(cl ass;verb)

... ...

<adult> 4.67

<communicator> 4.74

<contestant> 9.40

<spiritual leader> 14.81

<head#chief#top dog> 74.25

<president#chairman#chairwoman#chair> 982.64

<REST::leader> 1187.54

<peer#equal#match#compeer> 24.04

<relative#relation> 6.47

<czar#tsar#tzar> 791.57

<king#male monarch> 1187.36

<authority> 238.61

<suspect> 1187.36

... ...

4 EVALUATION

Up to now, it was not possible to automatically evaluate the “intu- itiveness” of the selectional preferences acquired by a certain ap- proach because there was no way to tell the computer which pref- erences correspond with human intuition. One only could manually inspect a few illustrative examples and concentrate on evaluating the performance of the approach in NLP tasks, e.g. word sense disam- biguation (which is, of course, a crucial issue). The EuroWordNet database provides information suitable for compiling a gold standard.

This gold standard allows to evaluate the lexicographic appropriate- ness (the appropriateness with respect to building wordnets) of an acquisition approach automatically and on a broader empirical basis.

This section describes the evaluation of the standard MDL and the weighting algorithm.

4.1 Retrieving the gold standard

As mentioned in section 1, some of the wordnets in EuroWordNet contain thematic relations: the wordnets for Dutch, English, Esto- nian, Italian, and Spanish. These relations have been manually en- coded or extracted from other lexical resources, respectively. I em- ployed them for the gold standard by mapping them to WordNet (which does not contain thematic relations itself).

I started from the simplifying heuristic that the patient of a verb is usually syntactically realized as its direct object. In EuroWordNet, a verb sense is connected to a noun sense that it prefers as its patient by the INVOLVED PATIENT relation. Thus, I mapped the relations of this type to WordNet.

I extracted those INVOLVED PATIENT relations where both the source node and the target node were linked to a node in the inter- lingual index (ILI) by a synonymy or a near-synonymy relation.9 The inter-lingual index essentially consists of all the concept nodes of WordNet 1.5. Thus, extracting the ILI concepts equivalent to the source and the target concept of an INVOLVED PATIENT relation, respectively, immediately yields a mapping of this relation to Word- Net 1.5. With this procedure, I retrieved 605 relations.

However, a certain amount of these relations were inappropriate for our task. The assumption that a patient is syntactically realized as

9Most concepts in the language-specific wordnets are linked to a correspond- ing concept in the ILI by a synonymy link. However, it is often the case that there is no ILI concept that exactly matches a language concept. This lan- guage concept has to be linked to a semantically related ILI concept, e.g.

by a hyponymy or a hyperonymy link.

an object is a good starting point, but does not apply for all cases. Un- accusative verbs (e.g.<silt>) realize their patient (e.g.<sediment>) as their subject. Other verbs do not realize their patient as a syntactic argument at all (e.g.<delouse><louse>). Patients of such verbs cannot be found by examining verb objects.

Furthermore, some relations like <address> INVOLVED PA- TIENT<addressee>indicate a noun concept that itself is perfectly adequate, but does not capture the majority of the noun instances in the corpus. Any noun referring to a human could occur as the patient of<address>. Thus, the learning algorithm should general- ize to the<human>level. However,<addressee>is a subclass of

<human>(which has no hyponyms itself). It makes sense to encode thematic relations where the noun concept does not subsume all pre- ferred concepts extensionally, but characterizes them intensionally.

However, such relations cannot be derived by generalizing from cor- pus instances. They could rather be acquired by examining deriva- tional patterns.

To obtain a gold standard that is appropriate for the evaluation of the two algorithms, I excluded these problematic cases. 390 relations remained.

For every WordNet verb concept which was retrieved in this way, I collected all the verbs which the concept represents and assigned each of them the noun concepts linked to it. This means that the in- formation to which sense of the verb a noun concept is related is lost. However, this is necessary to perform the comparison with the results of the two algorithms because they compute preferences for verbs, not verb senses. I obtained 599 verbs altogether (excluding multiword lexemes).

4.2 Evaluation results

The intersection of the verbs in the gold standard and the verbs in the training sample contained 522 verbs which were connected to 1082 noun concepts in the gold standard. For both algorithms (and different values ofC), I evaluated the match between the classes acquired for a verb and the gold standard classes for that verb. Ta- ble 7 shows the number and the percentage of the noun classes in the gold standard which were exactly matched, not matched at all,10 or matched by more general or more specific classes in the tree cut model. This table contains recall values (number of correct classes that are found). Note that it does not make sense to calculate pre- cision (number of preferred classes in the tree cut model that are correct) because the gold standard does not capture every sense of a verb, i.e. it is not “complete” with respect to a particular verb.

4.3 Discussion

The results show that the weighting algorithm significantly outper- forms the standard MDL algorithm. ForC = 10;000or higher, about three times as many classes are matched than with standard MDL. The same holds true if we add the exact matches and the 1- level deviations, which are the most helpful cases in a scenario in which selectional preferences are acquired automatically and cor- rected manually afterwards (33% vs. 11%). The classes with the most exact matches (C 10;000) are<food>(18 times),<physical object>(10 times),<animal>(9 times),<beverage>(7 times), and

<person>(6 times).

The choice ofCinfluences the performance. Here, selecting a high value (which forces specific tree cuts) improves the results, but only

10Dispreferred classes, i.e. classes with a preference value<1, are consid- ered as not matching the gold standard.

(6)

Table 7. Comparison of acquired tree cut models with the gold standard

number (percentage) of standard weighting;C=

gold standard classes MDL 100 1,000 10,000 100,000

exactly matched 57 (5.3%) 106 (9.8%) 160 (14.8%) 162 (15.0%) 162 (15.0%)

matched by 1 level hyperonym 56 (5.2%) 90 (8.3%) 120 (11.1%) 126 (11.6%) 126 (11.6%)

matched by 1 level hyponym 5 (0.5%) 38 (3.5%) 63 (5.8%) 69 (6.4%) 69 (6.4%)

matched by 2 level hyperonym 107 (9.9%) 86 (7.9%) 122 (11.3%) 125 (11.6%) 125 (11.6%)

matched by 2 level hyponym 0 (0.0%) 6 (0.6%) 6 (0.6%) 6 (0.6%) 6 (0.6%)

matched by>=3 level hyperonym 524 (48.4%) 296 (27.4%) 193 (17.8%) 194 (17.9%) 194 (17.9%)

matched by>=3 level hyponym 0 (0.0%) 9 (0.8%) 11 (1.0%) 11 (1.0%) 11 (1.0%)

not matched 333 (30.8%) 451 (41.7%) 407 (37.6%) 389 (36.0%) 389 (36.0%)

to a certain extent. Above C = 10;000, no improvement can be observed. The tree cuts have reached their “lower limit” then. This limit is determined by the threshold introduced to eliminate noise (cf.

section 2.3).

The percentage of classes which were not matched at all is higher for the weighting algorithm. The reason for this is that the majority of verbs occur rather infrequently (cf. Zipf’s law) so that standard MDL tends to acquire over-general classes for them. Thus, the chance that the class in the gold standard is subsumed by such a general class is higher. (Note that most of the classes matched by standard MDL are at least 3 levels too general.)

However, even with the weighting algorithm the overall results are not satisfying. 15% of the classes in the gold standard are ex- actly matched; 33% are approximated with 0 or 1 level deviation.

41.1% are matched by too general classes, but only 8% by too spe- cific classes. More than one third of the classes is not found at all.

The main reason for this behaviour is that the selectional prefer- ences are acquired for verb forms, not for verb senses. Calculating a tree cut model for a highly polysemous verb may trigger inappro- priate generalizations, since the different senses of the verb could introduce a high variety of complement nouns, which yields gener- alization, even if each sense alone prefers rather specific noun con- cepts.

On the other hand, it would be useful to pool verb instances which represent the same concept when calculating tree cut models. For ex- ample, the verbs “arrest”, “nail”, “nab”, and “cop” are represented by the same concept in the gold standard. More appropriate selec- tional preferences could be acquired if the algorithm did not com- pute one tree cut for all instances of “arrest”, one for all instances of “nail” etc., but one tree cut for all instances of “arrest”, “nail”,

“nab”, and “cop” which have the same sense. This would also reduce the percentage of unmatched classes, since verb instances which have a sense that does not occur in the gold standard would not be taken into account.

5 CONCLUSION AND FUTURE WORK

In this paper, I addressed the automatic acquisition of selectional preferences by the statistical analysis of corpora in order to be en- coded in lexical semantic ontologies. I argued that methods which have been proposed for the acquisition of selectional preferences do not satisfyingly cope with the task of finding the appropriate gener- alization level. I modified one of these approaches and showed that the modified approach is much better suited for computing general- ization levels which are appropriate for ontology building. The Eu- roWordNet database provides information that can be combined to

obtain a gold standard for selectional preferences. With this gold standard, lexicographic appropriateness can be evaluated automati- cally and on a broader empirical basis. This evaluation shows that the algorithm proposed in this paper is promising. However, the re- sults are not satisfying yet. One shortcoming of the experiments de- scribed here (as well as in the mentioned work of Resnik, Ribas and Li & Abe) is that the learning algorithms are fed with word forms rather than word senses, which would be adequate. Employing cor- pora which are at least partially semantically disambiguated should improve the performance significantly.

In the near future, I will employ approaches for lexical disam- biguation and test their impact on the performance of the weighting algorithm. Furthermore, I will test the methods described in this pa- per for different argument slots. As large syntactically annotated cor- pora are becoming more and more available, other verb–argument relations than direct objects can be reliably extracted and fed into the learning algorithm.

REFERENCES

[1] Naoki Abe and Hang Li, ‘Learning Word Association Norms Using Tree Cut Pair Models’, in Proc. of 13th Int. Conf. on Machine Learning, (1996).

[2] Steven Abney, ‘Partial parsing via finite-state cascades’, in Workshop on Robust Parsing (ESSLLI ’96), ed., John Carroll, pp. 8 – 15, (1996).

[3] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, 1991.

[4] WordNet: An electronical lexical database, ed., Christiane Fellbaum, MIT Press, Cambridge, Mass., 1998.

[5] Hang Li and Naoki Abe, ‘Generalizing Case Frames Using a Thesaurus and the MDL Principle’, in Proc. of Int. Conf. on Recent Advances in NLP, (1995).

[6] Diana McCarthy and Anna Korhonen. Detecting verbal participation in diathesis alternations, 1998. Proc. of 36th Annual Meeting of the Association for Computational Linguistics.

[7] Wim Peters, ‘Corpus-based conceptual characterisation of verbal pred- icate structures’, in Proc. of Computational Linguistics in the Nether- lands, Antwerpen, (1996).

[8] Philip Resnik, ‘Selectional preference and sense disambiguation’, in ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, Washington, D.C., (1997).

[9] Philip Stuart Resnik, Selection and Information: A Class-Based Ap- proach to Lexical Relationships, Dissertation, University of Pennsylva- nia, 1993.

[10] Francesc Ribas, ‘An experiment on learning appropriate selectional re- strictions from a parsed corpus’, in Proc. of COLING, Kyoto, (1994).

[11] Jorma Rissanen and Eric Sven Ristad, ‘Language acquisition in the MDL framework’, in Language Computations, ed., Eric Sven Ristad, volume 17 of Series in Discrete Mathematics and Theoretical Com- puter Science, 149–166, DIMACS, (1992).

[12] Piek Vossen, ed. EuroWordNet Final Document. EuroWordNet (LE2- 4003, LE4-8328), 1999. Deliverable D032D033/2D014.

Références

Documents relatifs

Partir is an irregular -ir verb that conveys the particular meaning of 'leaving with the intention of going somewhere.' It is often followed by the preposition pour.. Joe-Bob: Je

In this work we show how we enriched the Bioscope corpus (a biological corpus annotated with modality cues and their scope), by integrating syntactic and lexical information

If we use a basic clustering algorithm based on the minimization of the empirical distortion (such as the k-means algorithm) when we deal with noisy data, the expected criterion

Our research showed that an estimated quarter of a million people in Glasgow (84 000 homes) and a very conservative estimate of 10 million people nation- wide, shared our

Finally, we can extract some conclusions from our LST analysis (statements are written in terms of the absolute value of the bias): (1) the bias rose when the wall LST decreases;

Consider Streeck’s claim that soaring public indebtedness following the Great Recession “reflected the fact that no democratic state dared to impose on its society another

At the next stage, the generation of a semantic graph is performed based on the subject environment description in the language gen2.. For this purpose a

We propose a word recognition system for pen-based devices based on four main modules: a preprocessor that normalizes a word, or word group, by tting a geometrical model to the