Adding frequencies to the LGLex lexicon with IRASUBCAT

(1)

Adding frequencies to the LGLex lexicon with IRASUBCAT

Elsa Tolone and Romina Altamirano

FaMAF, Universidad Nacional de C´ ordoba,

elsa.tolone@univ-paris-est.fr, romina.altamirano@gmail.com

Abstract. We present a method for enlarge a lexicon (with frequencies information), that is useful for parsing and others NLP applications. We show an example enlarging the verbal LGLex lexicon of French [8], us- ing several corpora extracted from the evaluation campaign for French parsers Passage [5]. To do that, we use the results of the frmg parser [7]

with IRASubcat [1], a tool that automatically acquires subcategoriza- tion frames from corpus in any language and that also allows to complete an existing lexicon. We obtain the frequencies of occurrence for each in- put and each subcategorization frame for 14,068 distinct lemmas.

Keywords: Lexicon-Grammar, syntactic lexicon, French lexicon, sub- categorization, frequency of occurence.

1 Lexicon-Grammar, LGLex lexicon and FRMG parser

Lexicon-Grammar tables are currently one of the major sources of syntactic lexi- cal information for the French language [4]. Moreover, several Lexicon-Grammar tables exist for other languages, such as Italian, Brazilian Portuguese, Modern Greek, Korean, Romanian, and others.

We improved the Lexicon-Grammar tables to make them usable in various NLP applications, in particular parsing [8]. So we generated a French syntactic lexicon for verbs, nouns playing the predicative role frozen expressions including verbal and adjectival idioms, and adverbs from the Lexicon-Grammar tables, called LGLex [3]

¹

.

Then, we converted the verbs and predicative nouns into the Alexina frame- work, that is the one of the Lefff lexicon (Lexicon of French inflected form) [6]

²

, a large-coverage morphological and syntactic lexicon for French.

This enables its integration in the frmg parser (French MetaGrammar) [7]

³

, a large-coverage deep parser for French, based on Tree Adjoining Grammars (TAG), that usually relies on the Lefff. The result is a variant of the frmg parser, that we shall call frmg

LGLex

, to distinguish it from the standard frmg

Lef f f

.

1

All tables are fully available under the LGPL-LR license at http://infolingu.

univ-mlv.fr/english > Language Resources > Lexicon-Grammar > Download.

2

Lefff is available at http://gforge.inria.fr/projects/alexina/.

3

frmg is available at http://gforge.inria.fr/projects/mgkit/, with a visualiza-

tion of the grammar frmg on http://alpage.inria.fr/frmgdemo.

(2)

2 Elsa Tolone, Romina Altamirano

In this article we present a method for enlarge a lexicon (with frequencies information), that is useful for parsing and others NLP applications. We show an example enlarging the verbal LGLex lexicon of French, using several corpora extracted from the evaluation campaign for French parsers Passage [5]. To do that, we use the results of the frmg parser with the IRASubcat tool [1].

2 IRASUBCAT

IRASubcat is a tool that acquires subcategorization information about the behaviour of any tag class (e.g., part of speech, syntactic function, etc.) in a corpus [1]

⁴

. We are interested in using it to acquire information about verbs.

IRASubcat takes as input a corpus in XML format. The output of IRA- Subcat is a lexicon, also in XML format, where each of the verbs under in- spection is associated to a set of subcategorization patterns. The lexicon also provides information about frequencies of occurrence for verbs, patterns, and their co-occurrences in corpus.

Moreover, IRASubcat allows to integrate the output lexicon with a pre- existing one, merging information about verbs and patterns with information that had been previously extracted, possibly from a different corpus or even from a hand-built lexicon.

3 Experiment with IRASUBCAT and the LGLex lexicon of French

We want to use the results of frmg parser on a big corpus with IRASubcat in order to improve the LGLex lexicon of French, adding the frequencies of occurrence for each entry and each subcategorization frame. To do this, we must:

– find a corpus with millons of words (using a small part for the experiment);

– parse the corpus with the frmg parser, with and without the LGLex lexicon (i.e. only with the Lefff lexicon) – results with frmg

^LGLex

and frmg

^{Lef f f}

; – convert both the processed corpus and the LGLex lexicon into XML format;

– use IRASubcat in order to add the frequencies of occurrence for each entry and each subcategorization frame in the LGLex lexicon from the corpus.

3.1 Conversion of the verbal LGLex lexicon

The input is the verbal LGLex lexicon, or more precisely, the extensional lexicon of LGLex lexicon in Lefff format, which contains each inflected form of the lemma and every possible redistribution.

In the output lexicon converted into XML format as IRASubcat output lexicon, each lemma is associated to a set of subcategorization patterns. In fact,

4

IRASubcat is available at http://www.cs.famaf.unc.edu.ar/~romina/

irasubcat/.

(3)

Adding frequencies to the LGLex lexicon with IRASUBCAT 3

we simplify by omitting the realizations. So, we have only the syntactic functions because it’s more easy to find them in the processed corpus.

For each lemma represented by his identifier (for example, verb=”achever V 1 1” , which correspond to the 1st entry in the verb class 1), a count of ocur- rences of this lemma is initialized at 0 ( count oc verb=”0” ). We extracted the set of subcategorization patterns from all his inflected forms and all his redistributions and the number of different pattern is indicated (for example, different patterns=”6” ). For each pattern ( [’obj’, ’suj’] , [’obl’, ’suj’] , [’obl2’,

’suj’] and [’obl’, ’obl2’] ), a count of occurences of this pattern for this lemma and a count of occurences of this pattern for all verbs are both initialized at 0 ( count w verb=”0” total count=”0” ).

We have in total 14,068 distinct lemmas. An example of the output lexicon:

<dictionary>

<entry verb="achever___V_1_1" count_oc_verb="0">

<tag name="fs" different_patterns="6">

<pattern id="[’obj’, ’suj’]" count_w_verb="0" total_count="0"

rejected_patterns_freq_test="NO">

</pattern>

...

</tag>

</entry>

</dictionary>

3.2 Conversion of the processed corpus with the frmg parser To use the result of the parsing in NLP applications of high-level, Forest utils

⁵

represents the forest of dependencies in format XMLDep [7]. Basically, we repre- sent in XMLDep format a graph of dependencies with nodes (lemmas), grouped in clusters (forms), with arcs describing the syntactic dependencies between nodes.

The processed corpus with frmg

LGLex

used for the experiment is CPJ (Cor- pus Passage Jouet) with 100K sentences of AFP (Agence France-Presse), Eu- roparl, Wikipedia and Wikisources, extracted from the corpus of the evaluation campaign (in 2009) for French parsers Passage [5].

The input is the processed corpus CPJ with the frmg parser, more precisely, with frmg

^LGLex

. We want to extract only the useful information in a format directly readable by IRASubcat .

In the output in XML format, for each sentence of the corpus (for example,

<sentence ID=”12” corpus=”frwikipedia 012” s=”12”> ), we extracted the

verbs ( cat=”v” ) with their identifiers (for example, lemmaid=”achever V 1 1 ”).

For each verb, we extracted the syntactic functions and we indicated the number of arguments ( nb fs=”2” ) and then, each syntactic function ( fs ) one by one (for example, fs=”suj” for subject, and fs=”obl2” for oblique).

5

Forest utils is a set of Perl scripts to convert between various formats for shared derivation forest produced by parsers for TAG (available at https://gforge.inria.

fr/projects/lingwb/).

(4)

4 Elsa Tolone, Romina Altamirano

An example of the output:

<sentence ID="12" corpus="frwikipedia_012" s="12">

<word lexica="acheve" lemma="achever" lemmaid="achever___V_1_1" cat="v"

nb_fs="2">acheve</word>

<word fs="suj"></word>

<word fs="obl2"></word>

</sentence>

4 Conclusion

Using IRASubcat with the converted lexicon and the relevant information ex- tracted of the processed corpus we can complete the lexicon with the frequencies of occurrence for each verb and each syntactic function. The processed corpus is the results of the frmg parser with LGLex lexicon, so it could find wrong sense.

The next step is to consider the information on realizations, that we must extract from processed corpus, but it is not a straightforward task. Then we have to use the frmg parser with Lefff lexicon only, without the LGLex lexicon influences the results. We could also use IRASubcat with another parser which is statistical, such as MaltParser, MSTParser, or Berkeley Parser [2]. And we could do a comparison using the original lexicon and the enlarged lexicon with that different parsers to verify that the accuracy is better using more information.

References

1. Ivana Romina Altamirano and Laura Alonso Alemany. Irasubcat, a highly parametrizable, language independent tool for the acquisition of verbal subcate- gorization information from corpus. In Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Ameri- cas, pages 84–91, Los Angeles, California, 2010.

2. Marie. Candito, Joakim Nivre, Pascal Denis, and Enrique Henestroza Anguiano.

Benchmarking of statistical dependency parsers for french. In Proceedings of COL- ING’10 (poster session), Beijing, China, 2010.

3. Matthieu Constant and Elsa Tolone. A generic tool to generate a lexicon for NLP from Lexicon-Grammar tables. In Michele De Gioia, editor, Actes du 27e Col- loque international sur le lexique et la grammaire (L’Aquila, 10-13 septembre 2008), Seconde partie, volume 1 of Lingue d’Europa e del Mediterraneo, Grammatica com- parata, pages 79–193. Aracne, Rome, Italy, 2010.

4. Maurice Gross. M´ ethodes en syntaxe : R´ egimes des constructions compl´ etives. Her- mann, Paris, France, 1975.

5. Olivier Hamon, Djamel Mostefa, Christelle Ayache, Patrick Paroubek, Anne Vilnat, and ´ Eric de La Clergerie. Passage: from french parser evaluation to large sized treebank. In Proceedings of LREC’08, Marrakech, Morocco, 2008.

6. Benoˆıt Sagot. The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In Proceedings of LREC’10, Valletta, Malta, 2010.

7. Fran¸ cois Thomasset and ´ Eric de La Clergerie. Comment obtenir plus des m´ eta- grammaires. In Actes de TALN’05, Dourdan, France, 2005.

8. Elsa Tolone. Analyse syntaxique ` a l’aide des tables du Lexique-Grammaire fran¸ cais.

Editions Universitaires Europ´ ´ eenes, Saarbr¨ ucken, Germany, July 2012. (352 pp.).