• Aucun résultat trouvé

5.3 Framework tests

5.3.4 Results

TheListstatistics are reported in Table 5.9. These figures are computed over the whole candidate lists and contain between 0 and 50 entries per source word. For the learning models, we do not report the impacts of different corpus configurations here. Instead, we use the large cognate-only training corpus for all models except for the class model, which was trained with the small cognate corpus (The small corpus yielded better re-sults in the transducer tests, see Section 5.2). The training parameters were the same as for the transducer tests.

The most interesting figures of theListtable are the average list length figures (column L). While all static models, as well as the rule-based model, generate lists of less than ten entries on average, these numbers are considerably higher for most learning mod-els, with the exception of the trigram model. We believe that its poor results are due to the high number of transitions that have never been seen in the training corpus, but do occur in the test corpus. Such transitions are much less frequent in the case of the

Model N L MDCS P R F

Lower bound 407 17.2

Upper bound 1687 71.3

Tested words 2366 100.0

Levenshtein 716 1.2 2.0 24.3 30.3 27.0

Vowel-sensitive 748 1.0 1.8 33.1 31.6 32.4 Alignment constraints 710 1.2 2.0 25.3 30.0 27.5 Phonetic features 643 0.8 1.9 34.3 27.2 30.3

Rules 920 1.0 1.6 38.9 38.9 38.9

Basic EM 834 0.8 1.5 44.1 35.3 39.2

Classes 771 1.1 1.8 30.3 32.6 31.4

Bigrams 955 0.8 1.4 50.2 40.3 44.8

Trigrams 518 0.6 1.8 5.8 21.9 26.6

EM monolingual 699 0.9 2.1 34.3 29.6 31.8

Table 5.10:Topstatistics for the framework tests. The table shows the abso-lute numbers of correct target words induced (N), the average lengths of the candidate lists (L), and the average Levenshtein distance between the pro-posed and the correct target words (MDCS). The three rightmost columns represent percentage values of precision (P), recall (R), and F-measure (F).

unigram and bigram models. The monolingual model generates particularly long lists.

Probably, they result from the chosen training setup. The model is not trained on the correct word pairs, but on the maximal set of word pairs that best fits the transducer.

The model is thus trained to return as many candidate strings as possible, rather than the correct ones.

The MDCS seems to correlate with the candidate list lengths: the more candidates are proposed, the more they are distant on average from the solution. Unfortunately, the MDCS figures are too similar to show more interesting patterns.

It is not surprising that the models with large candidate lists (basic EM, bigrams, EM monolingual) obtain high recall rates – it is more likely to find the right word in a long list than in a short one. The class model and the rule model present promising ratios between average list length and recall. We will see now if theTopstatistics confirm this intuition.

Table 5.10 reports theTopstatistics. Recall that these figures are computed over the subset of the candidate lists with minimal transduction costs (or maximal transduction probabilities). The F-measure scores combine precision and recall and represent the global model performances best.

The figures of the static models confirm some of the results that we have obtained in the transducer tests (5.2). For instance, the additional alignment constraints (27.5% F-measure) do not improve the results much, compared to the simple Levenshtein model (27.0%). While the phonetic feature model yields lower recall rates (27.2%) than the Levensthein baseline (30.3%), it obtains good precision scores (34.3%vs.24.3%) thanks to short candidate lists. In contrast to the transducer tests, the vowel-sensitive variant obtains sensibly better results than the baseline (32.4%vs. 27.0%). These results are also due to short candidate lists and, consequently, high precision values.

The rule-based model presents good recall performance (38.9%), as in the transducer tests. It presents a good balance between precision and recall. However, its precision (38.9%) is lower than the precision rates of the learning models (up to 50.2%), due to rather long candidate lists.

As for the learning models, the the basic unigram model works very well. ItsTop statis-tics show very short candidate lists (0.8) and high precision rates (44.1%), and yield a very good overall performance (39.2% F-measure), comparable with the rule-based model (38.9% F-measure). The class model does not outperform the best static model (31.4%vs. 32.4% for the vowel-sensitive model). Even if it performed as well as the basic model in the transducer tests (Tables 5.4 and 5.3), and obtained a good ratio of list length and recall rate in theListstatistics (Table 5.9), it does not show these good results in theTopstatistics, mostly because of a relatively poor recall rate.

The bigram model obtains surprisingly good results. It surpasses the basic EM model in precision (50.2%vs.44.1%) and in recall (40.3%vs.35.3%), at comparable candidate list lengths (0.8 for both models). These scores are surprising in the light of the mitigated results this model obtained in the transducer tests. Nevertheless, they show that the two tasks rely on different characteristics of the transducer and are not comparable in every case. It would be interesting to test a class bigram model, more robust in situations of scarce training data. The lack of training data clearly shows up in the trigram model. Its results are worse than the baseline (26.6%vs. 27.0% F-measure).

The model trained in an unsupervised manner on a monolingual training corpus did not achieve good results either. Its performances are comparable to the ones achieved with static, language-independent models.

All models, except the alignment constraint model and the EM monolingual model, obtained significatively different results than the baseline Levenshtein model (paired McNemar test, at most,p<0.006). Moreover, the difference between the rule-based

model and the basic EM model, as well as the differences between the basic EM model and the other learning models, are significant (paired McNemar test, at most p<0.0001. These results hold for theListstatistics as well as for theTopstatistics.

As noted above, the Mean Distance between Candidates and Solution is quite constant across the models; the models with the best F-measure values also obtain the least MDCS values (e.g. the bigram EM model, the basic EM model, the rule-based model).

However, it is surprising to see values around 2 even in theTopstatistics. We would have expected lower values. It is likely that these results are an effect of using the arith-metic mean; however, other means cannot be applied because they are undefined for 0 values (which occur every time the candidate is identical to the solution).