Evaluation methodology - Framework tests - Phonetic Distance Measures for the Induction of a Tr

5.3 Framework tests

5.3.2 Evaluation methodology

In this section, we would like to explain our evaluation methodology and illustrate it with a theoretical example. Table 5.5 shows an example output for a corpus containing four source words (a₁. . .a₄). For each source word a_i, we obtain a list of candidate words (bi1. . .bin). The elements of this list are partially ordered. The list may be empty – this is the case fora2. We carried out a first evaluation series on these lists; they will be referred to asListstatistics. If the expected target word occurs anywhere in that list, we

a₁ → b₁₁ 2 a₂ → none a₃ → b₃₁ 0 a₄ → b₄₁ 1

b₁₂ 4 b₃₂ 2 b₄₂ 1

b13 5 b33 2 b43 2

b₄₄ 2 b₄₅ 5

Table 5.5: Some examples of candidate lists. Source words are represented as a_i, target candidate words asb_{i j}. The integer numbers represent the distance weights given by the transducer. The correct answers, determined by the reference corpus, are in bold face. We assume thatb₂₁would be the correct solution fora₂. TheListstatistics refer to these candidate lists.

a₁ → b₁₁ 2 a₂ → none a₃ → b₃₁ 0 a₄ → b₄₁ 1 b₄₂ 1

Table 5.6: The cropped candidate lists used for theTopstatistics. Source words are represented asai, target candidate words asbi j, correct answers in bold face.

consider the word pair as correctly induced. In the example, we obtain 3 successfully induced word pairs.

For most applications, such an evaluation does not seem precise enough. If the correct word appears somewhere in a long list of candidates, the lexicon induction process is still incomplete. Ideally, we should obtain one single candidate for each source word.

Therefore, we carried out a second series of evaluations. We cropped the candidate lists in order to keep, for each source word, only the highest-ranked candidate(s). Table 5.6 shows the results of this cropping for our example; we obtain 2 successfully induced word pairs. We will refer to the tests performed on the cropped candidate lists asTop statistics.

TheTopstatistics give us more precise results, as there rarely is more than one candi-date for each source word. In contrast, as the example illustrates fora₁, we must expect lower success rates, as the cropping will eliminate some correct candidates situated lower in the list. For our two-stage framework, theTopstatistics are probably more relevant, as our goal is to create unambiguous word pairs. However, theListstatistics may be relevant in another context: our two-stage framework can be extended with a third stage which represents an additional filter based on syntactic information. In this case, we do not want the first two stages to be too severe. We would like to have a candidate list that contains the right answer somewhere, but we do not need it to be

at the top position, as the third stage would rerank the candidates anyway. For those reasons, we find that bothListandTopstatistics are worth reporting.

For each model, we report three values forListstatistics and five values forTop statis-tics. First, we report the absolute numbers of correctly induced word pairs: if the ex-pected Standard German word appears in the list, the word pair is considered as suc-cessfully induced. These numbers allow us to compute precision and recall values, as well as the F-measure.

In general, precision and recall draw on two sets: the set of expected answers, given in the reference corpus, and the set of answers obtained by the model:

recall=|expected answers∩obtained answers|

|expected answers|

precision=|expected answers∩obtained answers|

|obtained answers|

In our situation, the answers are word pairs. In the example, there are 4 expected word pairs: 〈a1,b12〉,〈a2,b21〉,〈a3,b31〉,〈a4,b42〉. For each source word, there is always one word pair in the reference corpus. This follows from our restriction on one-to-one mappings (see Section 1.3). TheListexample contains 11 obtained word pairs – one for each target word: 〈a₁,b₁₁〉, 〈a₁,b₁₂〉, 〈a₁,b₁₃〉,〈a₃,b₃₁〉, . . . We compute recall and precision as follows:

recall=|correctly induced word pairs|

|expected word pairs|

precision=|correctly induced word pairs|

|induced word pairs|

wherecorrectly induced word pairs=expected word pairs∩induced word pairs; the ex-pected word pairs are those contained in the reference corpus. The F-measure allows us to combine precision and recall in one number. It is the harmonic mean of precision and recall:

F-measure=2·precision·recall precision+recall

While we report recall figures forListandTopstatistics, we mention precision and F-measure only for theTopstatistics. We believe that precision is not relevant for theList

List Top

Expected word pairs 4 4

Induced word pairs 11 4

Correctly induced word pairs 3 2

Recall 3/4=0.75 2/4=0.5

Precision 3/11=0.27 2/4=0.5

F-measure 0.40 0.5

Average candidate list length (3+0+3+5)/4=2.75 (1+0+1+2)/4=1.0

Table 5.7: Results for the example corpus. We include precision and F-measure values of theListstatistics for illustration here, but will not report these values in the tests.

statistics; they are not optimized for high precision, and will never be used in a context where high precision is needed. For all models, the precision values are below 10%.

There is another – probably more intuitive – way to combine the ideas of precision and recall in one number: the average candidate list length. It describes how many candidates, on average, are proposed for each source word. The ideal value is 1. Lower values mean that the model often does not give any answer at all; they stand for low recall. Values higher than 1 mean that the model produces long candidate lists and therefore yields low precision. Table 5.7 illustrates these computations on our example data.

All above measures ultimately rely on binary decisions on conformity with the refer-ence corpus: a proposed candidate is correct or wrong. We felt that finer-grained judgements could improve the understanding of the qualities of different models.

Moreover, the measures presented above implicitly admit that long candidate lists are due to a bad model. However, this can also be due to the fact that many words in the target word list are similar. In this case, all candidates are close to the reference translation, and even a good model would propose them all. Therefore, we examined the proposed candidates more thoroughly. We tested how close, on average, the can-didates came to the reference translation: we computed the mean distance between candidates and solutions. The idea is the following: Let the reference corpus contain a word pair〈a_R,b_R〉. a_R is used to generate a candidate list〈b₁. . .b_n〉. The mean dis-tance between candidates and solution (MDCS) for the word pair〈a_R,b_R〉is computed as follows:

M DC S(b_R,b₁. . .b_n)= P_n

i=1Dist(b_R,b_i) n

Tested words 2366 100%

Lower bound (identical words) 407 17.2%

Upper bound

(cognates found in lexicon) 1687 71.3%

(words found in lexicon) 1801 76.1%

Table 5.8: Characteristics of the test corpus.

whereDist(b_R,b_i) stands for the Levenshtein distance betweenb_R andb_i. The MDCS is computed separately for each word pair of the test corpus for which at least one word pair was induced. The final value is the average of all word pairs.

Dans le document Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German (Page 81-85)