Adding frequencies to the LGLex lexicon with IRASUBCAT
Elsa Tolone and Romina Altamirano
FaMAF, Universidad Nacional de C´ ordoba,
elsa.tolone@univ-paris-est.fr, romina.altamirano@gmail.com
Abstract. We present a method for enlarge a lexicon (with frequencies information), that is useful for parsing and others NLP applications. We show an example enlarging the verbal LGLex lexicon of French [8], us- ing several corpora extracted from the evaluation campaign for French parsers Passage [5]. To do that, we use the results of the frmg parser [7]
with IRASubcat [1], a tool that automatically acquires subcategoriza- tion frames from corpus in any language and that also allows to complete an existing lexicon. We obtain the frequencies of occurrence for each in- put and each subcategorization frame for 14,068 distinct lemmas.
Keywords: Lexicon-Grammar, syntactic lexicon, French lexicon, sub- categorization, frequency of occurence.
1 Lexicon-Grammar, LGLex lexicon and FRMG parser
Lexicon-Grammar tables are currently one of the major sources of syntactic lexi- cal information for the French language [4]. Moreover, several Lexicon-Grammar tables exist for other languages, such as Italian, Brazilian Portuguese, Modern Greek, Korean, Romanian, and others.
We improved the Lexicon-Grammar tables to make them usable in various NLP applications, in particular parsing [8]. So we generated a French syntactic lexicon for verbs, nouns playing the predicative role frozen expressions including verbal and adjectival idioms, and adverbs from the Lexicon-Grammar tables, called LGLex [3]
1.
Then, we converted the verbs and predicative nouns into the Alexina frame- work, that is the one of the Lefff lexicon (Lexicon of French inflected form) [6]
2, a large-coverage morphological and syntactic lexicon for French.
This enables its integration in the frmg parser (French MetaGrammar) [7]
3, a large-coverage deep parser for French, based on Tree Adjoining Grammars (TAG), that usually relies on the Lefff. The result is a variant of the frmg parser, that we shall call frmg
LGLex, to distinguish it from the standard frmg
Lef f f.
1
All tables are fully available under the LGPL-LR license at http://infolingu.
univ-mlv.fr/english > Language Resources > Lexicon-Grammar > Download.
2
Lefff is available at http://gforge.inria.fr/projects/alexina/.
3
frmg is available at http://gforge.inria.fr/projects/mgkit/, with a visualiza-
tion of the grammar frmg on http://alpage.inria.fr/frmgdemo.
2 Elsa Tolone, Romina Altamirano
In this article we present a method for enlarge a lexicon (with frequencies information), that is useful for parsing and others NLP applications. We show an example enlarging the verbal LGLex lexicon of French, using several corpora extracted from the evaluation campaign for French parsers Passage [5]. To do that, we use the results of the frmg parser with the IRASubcat tool [1].
2 IRASUBCAT
IRASubcat is a tool that acquires subcategorization information about the behaviour of any tag class (e.g., part of speech, syntactic function, etc.) in a corpus [1]
4. We are interested in using it to acquire information about verbs.
IRASubcat takes as input a corpus in XML format. The output of IRA- Subcat is a lexicon, also in XML format, where each of the verbs under in- spection is associated to a set of subcategorization patterns. The lexicon also provides information about frequencies of occurrence for verbs, patterns, and their co-occurrences in corpus.
Moreover, IRASubcat allows to integrate the output lexicon with a pre- existing one, merging information about verbs and patterns with information that had been previously extracted, possibly from a different corpus or even from a hand-built lexicon.
3 Experiment with IRASUBCAT and the LGLex lexicon of French
We want to use the results of frmg parser on a big corpus with IRASubcat in order to improve the LGLex lexicon of French, adding the frequencies of occurrence for each entry and each subcategorization frame. To do this, we must:
– find a corpus with millons of words (using a small part for the experiment);
– parse the corpus with the frmg parser, with and without the LGLex lexicon (i.e. only with the Lefff lexicon) – results with frmg
LGLexand frmg
Lef f f; – convert both the processed corpus and the LGLex lexicon into XML format;
– use IRASubcat in order to add the frequencies of occurrence for each entry and each subcategorization frame in the LGLex lexicon from the corpus.
3.1 Conversion of the verbal LGLex lexicon
The input is the verbal LGLex lexicon, or more precisely, the extensional lexicon of LGLex lexicon in Lefff format, which contains each inflected form of the lemma and every possible redistribution.
In the output lexicon converted into XML format as IRASubcat output lexicon, each lemma is associated to a set of subcategorization patterns. In fact,
4
IRASubcat is available at http://www.cs.famaf.unc.edu.ar/~romina/
irasubcat/.
Adding frequencies to the LGLex lexicon with IRASUBCAT 3
we simplify by omitting the realizations. So, we have only the syntactic functions because it’s more easy to find them in the processed corpus.
For each lemma represented by his identifier (for example, verb=”achever V 1 1” , which correspond to the 1st entry in the verb class 1), a count of ocur- rences of this lemma is initialized at 0 ( count oc verb=”0” ). We extracted the set of subcategorization patterns from all his inflected forms and all his redistributions and the number of different pattern is indicated (for example, different patterns=”6” ). For each pattern ( [’obj’, ’suj’] , [’obl’, ’suj’] , [’obl2’,
’suj’] and [’obl’, ’obl2’] ), a count of occurences of this pattern for this lemma and a count of occurences of this pattern for all verbs are both initialized at 0 ( count w verb=”0” total count=”0” ).
We have in total 14,068 distinct lemmas. An example of the output lexicon:
<dictionary>
<entry verb="achever___V_1_1" count_oc_verb="0">
<tag name="fs" different_patterns="6">
<pattern id="[’obj’, ’suj’]" count_w_verb="0" total_count="0"
rejected_patterns_freq_test="NO">
</pattern>
...
</tag>
</entry>
</dictionary>
3.2 Conversion of the processed corpus with the frmg parser To use the result of the parsing in NLP applications of high-level, Forest utils
5represents the forest of dependencies in format XMLDep [7]. Basically, we repre- sent in XMLDep format a graph of dependencies with nodes (lemmas), grouped in clusters (forms), with arcs describing the syntactic dependencies between nodes.
The processed corpus with frmg
LGLexused for the experiment is CPJ (Cor- pus Passage Jouet) with 100K sentences of AFP (Agence France-Presse), Eu- roparl, Wikipedia and Wikisources, extracted from the corpus of the evaluation campaign (in 2009) for French parsers Passage [5].
The input is the processed corpus CPJ with the frmg parser, more precisely, with frmg
LGLex. We want to extract only the useful information in a format directly readable by IRASubcat .
In the output in XML format, for each sentence of the corpus (for example,
<sentence ID=”12” corpus=”frwikipedia 012” s=”12”> ), we extracted the
verbs ( cat=”v” ) with their identifiers (for example, lemmaid=”achever V 1 1 ”).
For each verb, we extracted the syntactic functions and we indicated the number of arguments ( nb fs=”2” ) and then, each syntactic function ( fs ) one by one (for example, fs=”suj” for subject, and fs=”obl2” for oblique).
5