References
French parsing enhanced with a word clustering method based on a syntactic lexicon
Anthony Sigogne, Matthieu Constant, ´Eric Laporte
LIGM, University Paris-Est, France
October 6 2011
References
1 Introduction
2 Parsing resources
3 A French Syntactic Lexicon, The Lexicon-Grammar
4 Experiments
5 Conclusion and future work References18
References
1 Introduction
2 Parsing resources
3 A French Syntactic Lexicon, The Lexicon-Grammar
4 Experiments
5 Conclusion and future work References18
References
Introduction
Constituent parsing
State-of-the-artperformances on French are about 87 (F1score) obtained by PCFG parsers.
with some methods of pre-processing of the corpus.
Pre-processing methods : Word clustering
Replacing all words of the corpus by a cluster of words.
For example, a word can be replaced by the combination tag+lemma (Candito and Crabb ´e2009; Candito and Seddah2010).
⇒ Goal : reducing the effect of lexical data sparsness.
⇒ Result : improving performances of PCFG parsers.
References
Introduction
The problem
Few works on PCFGs report the use of an external syntactic lexicon
Try to integrate data from a large scale external syntactic lexicon into a PCFG parser :
the parser, Berkeley (Petrov et al.2006).
the lexicon for French, the Lexicon-Grammar (Gross1994) :
⇒ Creation of word clusters according to the LG data.
References
1 Introduction
2 Parsing resources
3 A French Syntactic Lexicon, The Lexicon-Grammar
4 Experiments
5 Conclusion and future work References18
References
Parsing resources
A PCFG parser, the Berkeley Parser (Petrov et al.2006)
Goal : for each sentence, BKY generates the best parse according to a non-lexicalized PCFG model.
Algorithm : creates iteratively several grammars, which have a tagset increasingly complex (Matsuzaki et al.2005).
The French Treebank (Abeill ´e et al.2003) [FTB]
Articles from the newspaperLe Monde.
Each sentence is annotated with a constituent tree.
FTB-UC : 12,531 sentences (350,931 words), 28 POS, 12 phrases.
References
1 Introduction
2 Parsing resources
3 A French Syntactic Lexicon, The Lexicon-Grammar
4 Experiments
5 Conclusion and future work References18
References
A French Syntactic Lexicon, The Lexicon-Grammar
The LG is a set of tables
each table represents a syntactic class
a class is composed of lexical entries sharing syntactic properties.
a lexical entry is a lemmatized form with syntactic properties.
Meaning separation
a lexical form may appear in several lexical entries.
→ 67 tables of verbs (5,923 lexical forms and 13,862 entries)
Example
The verbvolerhas two distinct meanings in French to fly, intransitive
to steal, two complements
⇒two lexical entries for the formvolerin two distinct classes
References
A French Syntactic Lexicon, The Lexicon-Grammar
Hierarchy of verb tables
Manually constructed for verbs tables.
Each level contains classes which group LG tables.
Goal⇒reducing the number of classes associated to lexical forms.
→ hierarchy of verbs : 4 levels.
References
Exploitation of the Lexicon-Grammar data
Table identifiers of the lexical entries are important hints about their syntactic behaviors :
Verbs belonging to the table31Rare intransitive.
Verbs of table36DT→to givesomething to someone.
ApproachLexClust
Each verbal entry in the corpus is replaced by the combination : POS tag
Class identifier(s) at the specific level in the hierarchy.
Example
sanctionner(to punish) belongs to tables6and12
#tag 6 12at level0
#tag QTD2at level1
#tag TD2at levels2and3
References
1 Introduction
2 Parsing resources
3 A French Syntactic Lexicon, The Lexicon-Grammar
4 Experiments
5 Conclusion and future work References18
References
Experiments
⇒ Evaluate the impact of the clustering methodLexCluston the FTB-UC with BKY.
Evaluation metrics
Cross-validationevaluation (corpus splitted in 10 equal parts).
PARSEVALF1score (punctuation nodes not included).
Unlabeled and Labeled Attachement scores [UAS/LAS].
Detecting verbal forms in raw texts
Verbal POS tags : POS taggerMElt(Denis and Sagot2009).
Lemmas : DictionaryDela(Courtois and Silberztein1990).
References
Results
Method #classes #lexicon F1 UAS LAS F1<40
Baseline - 27,143 83.82 89.43 85.85 86.12
LexClust 0 67 24,743 84.11 89.67 86.10 86.53 LexClust 1 13 22,318 84.33 89.77 86.22 86.62 LexClust 2 10 21,833 84.44 89.87 86.32 86.76
LexClust 3 4 20,556 84.26 89.64 86.10 86.57
Clust - 1,987 85.22 90.26 86.70 87.39
⇒ Slight improvements with original classes of the LG.
⇒ Best results obtained with level 2 of hierarchy (while considerably reducing the size of the corpus lexicon) and results are significant⇒ p−value<0.1
References
1 Introduction
2 Parsing resources
3 A French Syntactic Lexicon, The Lexicon-Grammar
4 Experiments
5 Conclusion and future work References18
References
Conclusion
A syntactic lexicon like the LG is able to improve performances of a PCFG parser on French.
Performances are mainly obtained with the help of a hierarchy of verb tables.
Future work
Using other information contained in the lexicon (preposition,...).
Automatic creation of a hierarchy of verb tables.
Reproducing experiments with other grammatical categories.
References
Thank you for your attention !
References
A. Abeill ´e, L. Cl ´ement, and F. Toussenel.
2003.
Building a treebank for French.
In Anne Abeill ´e, editor,Treebanks : building and using parsed corpora, Kluwer, Dordrecht.
M. Candito and B. Crabb ´e.
2009.
Improving generative statistical parsing with semi-supervised word clustering.
InProceedings of the 11th International Conference on Parsing Technology (IWPT’09), pages 138–141.
M. Candito and D. Seddah.
2010.
Parsing word clusters.
References
B. Courtois and M. Silberztein.
1990.
Dictionnaires ´electroniques du franc¸ais. Pr ´esentation.
InLarousse, editor, Langue Franc¸aise.
P. Denis and B. Sagot.
2009.
Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art pos tagging with less human effort.
InPACLIC 2009, Hong Kong.
M. Gross.
1994.
Constructing Lexicon-grammars.
In Atkins and Zampolli, editors,Computational Approaches to the Lexicon, pages 213–263.
T. Matsuzaki, Y. Miyao, and J. Tsujii.
2005.
References
InProceedings of ACL-05, pages 75–82, Ann Arbor, USA.
S. Petrov, L. Barrett, R. Thibaux, and D. Klein.
2006.
Learning accurate, compact, and interpretable tree annotation.
InProceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for
Computational Linguistics, Sydney, Australia.