• Aucun résultat trouvé

Evaluating and improving syntactic lexica by plugging them within a parser

N/A
N/A
Protected

Academic year: 2022

Partager "Evaluating and improving syntactic lexica by plugging them within a parser"

Copied!
21
0
0

Texte intégral

(1)

INRIA

Evaluating and improving syntactic lexica by plugging them within a parser

Elsa Tolone, Benoît Sagot, Éric de La Clergerie

LIGM, Université Paris-Est, France & FaMAF, Universidad Nacional de Córdoba, Argentine INRIA Paris-Rocquencourt

http://alpage.inria.fr

LREC’12 Istanbul May 23-25 2012

INRIA É. de la Clergerie Evaluating lexica 24/05/12 1 / 19

(2)

INRIA

Deep parsing with FRMG (and L EFFF )

FRMGis a deep wide-coverage TAG parser for French, co-developed withLEFFFlexicon

F1|de F2|quoi

F3|fait

F4|-il

F5|prendre

F6|conscience

F7|à

F8|Marie

F9|?

de:prep

_:S ?:_

_PERSON_m:np faire:v

conscience:ncpred

cln:cln quoi?:pri

à:prep prendre:v

subject

S2 preparg

subject

void causative_prep

S

ncpred N2

de quoi fait-il prendre conscience à Marie ? What does he make Mary become aware of ?

To be tried athttp://alpage.inria.fr/parserdemo

INRIA É. de la Clergerie Evaluating lexica 24/05/12 2 / 19

(3)

INRIA

Evaluating lexica ?

FRMGdepends onLEFFFsyntactic lexicon (Sagot), and largely co-developped with it, but

can we useFRMGwith other lexica ? is it easy to plug a new lexicon ?

can we then run fine-grained evaluations of syntactic lexica ? can we use feedback information to improve a lexicon ? We tried to answer these questions, with 3 new lexica:

LGLEX,DICOVALENCE, andNEWLEFFF

INRIA É. de la Clergerie Evaluating lexica 24/05/12 3 / 19

(4)

INRIA

FRMG: a French meta-grammar

A large-coverage Meta-grammar for French

abstract descriptive layer, constraint-based, modularity, inheritance generation of a TAG/TIG grammar

extended domain of locality, capture of subcategorization frames withfactorizedtrees

I current version: 323 trees (and only 38 verbal trees)

I one tree≡many ordinary TAG trees

I ⇒one verbal tree stands for many subcat frames, arg positions, realizations, . . .

S

|

↓NP0 cln ↓S

VP V

3v

##

free order

↓NP1 ↓PP2

INRIA É. de la Clergerie Evaluating lexica 24/05/12 4 / 19

(5)

INRIA

Coupling FRMG with a lexicon: Hypertags

FRMG

hypertag #286

arg0 arg0

extracted - fun fun0

kind kind0 subj |nosubj

pcas -

real real0 -|CS|N2|PP|S|cln|prel|pri

arg1 arg1

extracted - fun fun1

kind kind1 - |acomp|obj|prepacomp|prepobj

pcas pcas1 +| - |apres|à|avec|de|par |. . . real real1 -|CS|N|N2|PP|S|V|adj|cla|. . .

arg2 arg2

extracted - fun fun2

kind kind2- |prepacomp| prepobj |prepscomp

| prepvcomp | scomp| vcomp | wh- comp

pcas pcas2 -|+|apres|à |...

real real2 -|CS|N|N2|PP|S|...

cat v

diathesis active refl refl ctrsubj ctr

imp imp

LEFFF

hypertag «promettre»

arg0

fun suj kind subj|- pcas -

arg1

fun obj

kind obj|scomp|- pcas -

arg2

fun objà kind prepobj|- pcas à|-

refl - ctrsubj suj

imp -

INRIA É. de la Clergerie Evaluating lexica 24/05/12 5 / 19

(6)

INRIA

Coupling FRMG with a lexicon: Hypertags

FRMG

hypertag #286

arg0 arg0

extracted - fun fun0

kind kind0 subj |nosubj

pcas -

real real0 -|CS|N2|PP|S|cln|prel|pri

arg1 arg1

extracted - fun fun1

kind kind1 - |acomp|obj|prepacomp|prepobj

pcas pcas1 +| - |apres|à|avec|de|par |. . . real real1 -|CS|N|N2|PP|S|V|adj|cla|. . .

arg2 arg2

extracted - fun fun2

kind kind2- |prepacomp| prepobj |prepscomp

| prepvcomp | scomp| vcomp | wh- comp

pcas pcas2 -|+|apres|à |...

real real2 -|CS|N|N2|PP|S|...

cat v

diathesis active refl refl ctrsubj ctr

imp imp

LEFFF

hypertag «promettre»

arg0

fun suj kind subj|- pcas -

arg1

fun obj

kind obj|scomp|- pcas -

arg2

fun objà kind prepobj|- pcas à|-

refl - ctrsubj suj

imp -

INRIA É. de la Clergerie Evaluating lexica 24/05/12 5 / 19

(7)

INRIA

Coupling FRMG with a lexicon: Hypertags

FRMG

hypertag #286

arg0 arg0

extracted - fun fun0 suj

kind kind0 subj|nosubj

pcas -

real real0 -|CS|N2|PP|S|cln|prel|pri

arg1 arg1

extracted - fun fun1 obj

kind kind1 - |acomp| obj |prepacomp|prepobj

pcas pcas1 +| - |apres|avec|de|par|. . . real real1 -|CS|N|N2|PP|S|V|adj|cla|. . .

arg2 arg2

extracted - fun fun2 objà

kind kind2 - |prepacomp | prepobj | prep-

scomp|prepvcomp|scomp|vcomp

|whcomp

pcas pcas2 - |+|apres| à |...

real real2 -|CS|N|N2|PP|S|...

cat v

diathesis active refl refl - ctrsubj ctr suj imp imp -

LEFFF

hypertag «promettre»

arg0

fun suj kind subj|- pcas -

arg1

fun obj

kind obj|scomp|- pcas -

arg2

fun objà kind prepobj|- pcas à|-

refl - ctrsubj suj

imp -

INRIA É. de la Clergerie Evaluating lexica 24/05/12 5 / 19

(8)

INRIA

Alexina and Lefff (Sagot)

ALEXINAis a lexical formalism

with anintensionallevel for lemma

and the generation of anextensionallevel for forms

the descriptions use a set of primitive features, and macros

LEFFFis a wide-coverage morphosyntactic and syntactic lexicon for French, covering all categories

LEFFFis partially factorized:

one entry may cover several meanings and several subcat frames

⇒5,736 entries for 5,450 distinct ones (intensional level)

Freely available athttp://gforge.inria.fr/projects/alexina/

INRIA É. de la Clergerie Evaluating lexica 24/05/12 6 / 19

(9)

INRIA

Lefff example: promettre

Intensional level

p r o m e t t r e v55 100;Lemma ; v ;

<Suj : c l n|scompl|s i n f|sn ,

Obj : ( c l a|de−s i n f|scompl|sn ) , Objà : ( c l d|à−sn ) >;

@CtrlSujObj , c a t =v ;

%a c t i f ,% p a s s i f ,%ppp_employé_comme_adj,% p a s s i f _ i m p e r s o n n e l Extensional level

promet 100 v

[ pred= " promettre_____1<Suj : c l n|scompl|s i n f|sn , Obj : ( c l a|de−s i n f|scompl|sn ) , Objà : ( c l d|à−sn ) > " ,

@CtrlSujObj , @pers , c a t =v , @P3s ] A macro

@CtrlSujObl = [ c t r s u b j = s u j ] ;

INRIA É. de la Clergerie Evaluating lexica 24/05/12 7 / 19

(10)

INRIA

LgLex

Resulting from the conversion of LADL tables (Gross)

1 first, into LGlex format (ConstantandTolone)

2 then, into alexina format (SagotandTolone) Wide-coverage lexicon

kind #tables #entries #lemma

verbs 67 13,867 5,738

pred. nouns 78 12,696 8,531 Fine-grained: many entries for some verbs

53 entries fortenir(LEFFF: 6 entries)

Some features can’t be represented inALEXINAand/or exploited inFRMG

for instance: added determiner for predicative nouns inFRMG

missing: semantic restrictions

Freely available athttp://infolingu.univ-mlv.fr

INRIA É. de la Clergerie Evaluating lexica 24/05/12 8 / 19

(11)

INRIA

Dicovalence

Developed byMertensandvan den Eynde based on pronominal approach (Benveniste) fine grained: one entry for one meaning small coverage

3,738 verbs for 8,313 entries

INRIA É. de la Clergerie Evaluating lexica 24/05/12 9 / 19

(12)

INRIA

New Lefff

Evolution ofLEFFFtowards a more semantic lexicon finer-grained: one meaning per entry

automatic merge ofLEFFFandDICOVALENCE, plus manual validation of 505 verbs (986 entries):

I the 100 most frequent lemmas

I thedubious lemmas: more output entries than the sum of corresponding LEFFFandDICOVALENCEentries

still a wide-coverage lexicon 7,933 verbs, for 12,613 entries

INRIA É. de la Clergerie Evaluating lexica 24/05/12 10 / 19

(13)

INRIA

The lexica at a glance

Verbs from each lexicon:

Lexica #Entries #Lemmas Ratio

LEFFF 7,108 6,827 1.04

LGLEX 13,867 5,738 2.41

DICOVALENCE 8,313 3,738 2.22

NEWLEFFF 12,613 7,933 1.58

Other categories imported fromLEFFF, shared by all lexica

Category #Int. Entries #Lemmas #Ext. Entries

nouns 41,816 41,592 86,675

adjectives 10,556 10,517 34,359

adverbs 4,111 3,676 4,155

prepositions 260 259 728

proper nouns 52,499 52,202 52,571

other 1,007 854 1,589

INRIA É. de la Clergerie Evaluating lexica 24/05/12 11 / 19

(14)

INRIA

EASy/Passage evaluation campaign

French parsing evaluation campaigns organized within EASy and Passage actions.

We use the EASy reference corpus as benchmark

around 4K sentences, manually annotated (but with errors !) various styles: journalistic, literacy, medical, mail, oral, questions constituency and dependency based format (shallow level)

I 6 kinds ofchunks: GN, NV, GA, GR, GP, PV

I 14 kinds ofrelations: SUJ-V, AUX-V, COD-V, ATB-SO , CPL-V, MOD-V, MOD-N, MOD-A, MOD-R, MOD-P, COMP, COORD, APPOS, JUXT

INRIA É. de la Clergerie Evaluating lexica 24/05/12 12 / 19

(15)

INRIA

Overall results

Setting:

each sentence segmented withSXPIPE, no prior tagging, use of a lexicon

FRMGreturns either

I full parses (possibly by relaxing some agreement constraints)

I sequences of partial parses, covering the sentence

I nothing in case of timeout (100s)

whenever possible,FRMGreturns a shared dependency forest of all possibilities

then heuristic-based disambiguisation and conversion to Passage format

Cover. Chunks Rels Time

Lexicon (%) (%) (%) (s)

LEFFF 83.45 89.03 66.76 0.35

NEWLEFFF 82.19 88.74 66.09 0.55

LGLEX 80.61 87.89 63.19 1.10

DICOVALENCE 71.44 88.08 64.49 0.38 Old DICOVALENCE 65.69 87.06 62.72 0.42

INRIA É. de la Clergerie Evaluating lexica 24/05/12 13 / 19

(16)

INRIA

Analysis per verbal relation (F-measure)

SUJ-V AUX-V COD-V CPL-V ATB-SO MOD-N

0 20 40 60 80 100

79 .2

91 .9

72 .7

63 .1

65

.3 70

78 .8

91 .6

71 .8

62 .7

60

.2 69

.6 77

.7

89 .9

62 .7

58 .1

47 .6

67 .8 77

.4

89

66 .9

62 .8

51 .7

69 .4 76

.1

87 .2

66 .1

62 .7

7.7

68 .1

verbal relations

LEFFF NEWLEFFF LGLEX DICOVALENCE Old DICOVALENCE

INRIA É. de la Clergerie Evaluating lexica 24/05/12 14 / 19

(17)

INRIA

Experiments on French TreeBank

Evaluation on CONLL dependency version of FTB (journalistic style) richer set of verbal dependencies, but still shallow level

1 de de P P 5 de_obj

2 quoi quoi? PRO PROWH 1 obj

3 fait faire V V 5 aux_caus

4 -il -il CL CLS 5 suj

5 prendre prendre V V 0 root

6 conscience conscience N NC 5 obj

7 à à P P 5 mod

8 Marie marie N NPP 7 obj

9 ? ? PONCT PONCT 5 ponct

On FTB test part (1200 sentences)

Lexicon Cover.(%) LAS(%) Time(s) New LAS (%) δ(%)

LEFFF 89.53 82.21 0.61 85.82 4.4

NEWLEFFF 88.76 81.36 0.94 83.83 3.0

LGLEX 86.73 78.75 1.95 81.82 3.9

DICOVALENCE 75.28 79.38 0.69 - -

MST - 88.20 - - -

INRIA É. de la Clergerie Evaluating lexica 24/05/12 15 / 19

(18)

INRIA

Analysis per CONLL verbal relation

Recall

aff ato aux_causats

aux_pass aux_tps

de_obj

dep

mod obj p_obj root suj

a_obj

10 20 4030 50 60 70 80 90

100

Lefff LGLex

NewLefff Dicovalence

Precision

aff ato aux_causats

aux_pass aux_tps

de_obj

dep

mod obj p_obj root suj

a_obj

10 20 4030 50 60 70 80 90

100

Lefff LGLex

NewLefff Dicovalence

INRIA É. de la Clergerie Evaluating lexica 24/05/12 16 / 19

(19)

INRIA

Error mining

Original motivation: find lexical entries that are incorrect or incomplete through full parse failures:

a form is suspect if occurring more often than expected in failed sentences, in co-occurrence with non-suspect forms

;fix-point iterative algorithm, close to EM (Expectation-Maximization) and return the best sentences where a form is the main suspect

⇒WEB-based interface to browse suspects, lexical info, and sentences May be used for any lexica, but can also be adapted for contrasting lexica

a verb is suspect for lexicon L if occurring more often than expected in failed sentences that succeed forLEFFF, in co-occurrence with non-suspect verbs.

Tried on a 100Ksent. toy corpus (wikipedia, wikisource, europarl, AFP news) but could be tried on CPC (100Mwords) or even bigger (700Mwords)

INRIA É. de la Clergerie Evaluating lexica 24/05/12 17 / 19

(20)

INRIA

Some suspects (for L G L EX )

A first typology of errors on the first 15th suspects forLGLEX: missing entries in the expected LADL table

I réaffirmer(to reaffirm28),réélire(to reelect, 10),mixer(to mix, 7),zapper(to omit, 4). . .

la mémoire . . . qui zappe les détails . . . the memory . . . which omits the details . . . existing entries, but missing coding information

I susciter(to spark off, 41; 36DT & 38R),recruter(to recruit, 14; 38R), . . . mandatory args (forLGLEX), but missing ones in the sentences

kidnapper(to kidnap, 12; 36DT N0 V N1 Prep N2) :Les deux Italiens ont étékidnappésle 18 décembre

misc. situations

ForNEWLEFFF: most significant error isestimer(to consider) : missing clausal argument (to consider that S)

INRIA É. de la Clergerie Evaluating lexica 24/05/12 18 / 19

(21)

INRIA

Conclusion

Relatively easy to plug new (alexina-based) lexica intoFRMG

Rather good results for all lexica, even if lower than withLEFFF

(better thanFRMG+LEFFFin 2007 Passage campaign) Room for needed and normal co-adaptationFRMG-lexicon Future:

specific training per lexica

completing lexica and/or merging information (error mining, evaluation) better factorization of lexical entries inLGLEX, delaying use of more semantic entries at disambiguisation time

assign probabilities to entries and/or frames enrichFRMGwith some new features, to take into account richer lexical information

Thank you

INRIA É. de la Clergerie Evaluating lexica 24/05/12 19 / 19

Références

Documents relatifs

Improving Exposure Assessment Using Non-Targeted and Suspect Screening: The ISO/IEC 17025: 2017 Quality Standard as a Guideline... Improving Exposure Assessment Using Non-Targeted

We calculated type precision, type recall and F-measure against the gold standard, and obtained 0.79 precision, 0.55 recall and 0.65 F-measure. These results are shown in table 1,

We compared our lexicon against the Tr´esor de la Langue Franc¸aise Informatis´e (TLFI) - a freely available French lexicon containing verbal SCF information from a dictionary. We

◮ FRMG lglex ’s verb entries are the result of a conversion process from the original tables. → this conversion process certainly

This integration required a conversion step so as to extract the syntactic information encoded in Lexicon-Grammar tables and represent it in the NLP lexical formalism used by FRMG

We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon (Sagot, 2010), and

a verb is suspect for lexicon L if occurring more often than expected in failed sentences that succeed for L EFFF , in co-occurrence with non-suspect verbs. Tried on

This integration required a conversion step so as to extract the syntactic information encoded in Lexicon-Grammar tables and represent it in the NLP lexical formalism used by FRMG