Learning models - Transducer tests - Phonetic Distance Measures for the Induction of a Translat

5.2 Transducer tests

5.2.3 Learning models

In Section 4.3, we proposed some models that are adapted to a specific language pair by using a training corpus and a learning algorithm. The performances of these mod-els are thus determined by the nature of the training corpus and the settings of the learning algorithm.

The nature of the corpus can be defined by its size and its quality. We claim that larger training corpora yield better models than smaller ones, and that corpora without noise yield better models than noisy ones. In order to test this claim, we trained our models on different corpora. For the size parameter, we used training corpora of about 200 word pairs and of about 2000 word pairs. Mann and Yarowsky (2001) used an interme-diate corpus size of 300-700 word pairs. We have defined the quality of the corpus by the noise it contains. In our case, noise represents non-cognate word pairs. As these pairs do not have a common phonetic structure, the alignments between them are not based on regular phonetic correspondences, but are determined randomly. Hence, training a corpus on such random alignments can deteriorate it. We trained our mod-els with cognate-only corpora and with corpora that contained non-cognates. The size and quality parameters allow us to build four different training corpora, represented by the first four columns of Tables 5.3 and 5.4.

The unigram model and the class model were trained on all four corpora. The bigram and trigram models were only trained on the large corpora, as their number of tran-sitions largely exceeded those found in the small training corpora. The unsupervised model was only trained on the Swiss German part of the large corpora. We felt that the model was not robust enough to handle training data of smaller size.

The training is performed with the iterative EM algorithm. With each iteration, it gets closer to the maximum likelihood distribution. When the algorithm reaches this dis-tribution, it is said to converge. The unigram model only converged after about 21’000

iterations with the large corpus, and after about 12’000 iterations with the small cor-pus. These numbers are based on the transition weights, with respect to the floating point precision of thePythoncompiler. The same models converged at 59 (large cor-pus) and 45 iterations (small corcor-pus) to a precision of 0.000001. We decided to train the unigram model on a fixed number of 50 iterations, as longer training did not seem to improve the test results. The bigram and trigram transducers fully converged at 150-250 iterations on the large training corpora. It may surprise that they converge much faster than the unigram transducer. We believe that this is due to a lack of training data. Assuming an average word alignment length of 8 symbols, the 2000 word pairs yield about 16’000 edit and identity operations; many of them are not unique. A bi-gram transducer contains 64’000 transitions, a tribi-gram transducer about 2.5 million transitions to be trained. Thus, many transitions never occur in the training corpus.

The unsupervised model alternates two steps. In the first step, a transducer generates new word pairs from the monolingual training corpus: for each dialect word in the cor-pus, it takes the most probable Standard German candidate it finds in the lexicon. We used the full Standard German lexicon for training, even in this simplified framework.

In the second step, these hypothesized word pairs are used to train a new transducer in 50 iterations. This new transducer is used in the first step to generate new word pairs.

The two steps are repeated ten times.

As for the static models, we report the scores of the learning models on theFulltest corpus (Table 5.3) and on theCognatetest corpus (Table 5.4). Both tables show the impact of different model variants (one on each line) and different training corpus configurations (one on each column). For convenience, we repeat the scores of the baseline Levenshtein model and of the rule-based model. To sum up, the results of the trained models based on unigrams (theBasic EMand theClass model) are situated between those of the static models and the rule-based model, while the models based on bigrams and trigrams, as well as the one based on a monolingual training corpus, performed worse than Levenshtein distance.

While the bigram model (up to 89.4%) comes close to the unigram-based models (up to 94.5%), the trigram model (up to 62.3%) is far behind. This suggests that the size of the training corpus was not sufficient for those models. The performance of the monolingual model is below our expectations (up to 85.0%, i.e. significantly lower than Levenshtein distance with 90.9%). We believe that the target lexicon is so large that the model gets confused by its complexity, rather than inferring a satisfactory model of word structures. It remains to be shown if the model would performed better if trained

Model Swiss G. - Standard G. Sp-Pt Fr-Pt Cognates Full Cognates Cognates

219 2192 236 2365 621 ∼310

Levenshtein 85.7 67.9 32.0

Rules 90.2

Basic EM 86.9 88.5 85.5 86.5 67.1 38.5

Class model 88.8 86.9 87.1 86.1 69.8 42.3

Bigrams 84.1 83.4

Trigrams 58.6 56.9

EM monolingual 79.6 79.0

Table 5.3: Results for the learning models, tested on the Full corpus, containing cognate and non-cognate pairs. The numbers represent the percentage of correctly induced word pairs. The nature and the size of the training corpora are indicated at the top. The results for Spanish-Portuguese (Sp-Pt) and French-Spanish-Portuguese (Fr-Pt) are those reported by Mann and Yarowsky (2001); their models were trained on cognate-pair cor-pora only.

Model Swiss G. - Standard G. Sp-Pt Fr-Pt

Cognates Full Cognates Cognates

219 2192 236 2365 621 ∼310

Levenshtein 90.9 92.3 66.4

Rules 95.7

Basic EM 92.9 94.5 91.3 92.1 92.3 78.6

Class model 94.3 92.3 92.7 91.5 94.7 84.3

Bigrams 89.4 88.4

Trigrams 62.3 60.4

EM monolingual 85.0 84.2

Table 5.4: Results for the learning models, tested on theCognatecorpus.

The numbers represent the percentage of correctly induced word pairs.

The nature and the size of the training corpora are indicated at the top. The results for Spanish-Portuguese (Sp-Pt) and French-Portuguese (Fr-Pt) are those reported by Mann and Yarowsky (2001); their models were trained on cognate-pair corpora only.

on smaller monolingual lexicons.

The comparison of the standard unigram model and the class model shows some inter-esting results. To build the class model, we used the transition weights induced in the standard unigram model, and divided them into four weight classes. We argued that this could improve the results if the training data was not sufficient. The results par-tially confirm this intuition. The basic EM model performs better than the class model on the bigCognatetraining corpus, while it performed worse than the class model on the smallCognatetraining corpus. When trained on theFullcorpus, the differences between the models were not significant (paired McNemar’s test,p<0.05).

We can also look at the impact of corpus size, keeping the model constant. (Compare the Cognates-219 with the Cognates-2192 column, or the Full-236 with the Full-2365 column.) The basic EM model shows a tendency for better results with large training corpora, but this tendency is not significant. In contrast, the class model performed significantly better (paired McNemar’s test,p<0.05) on the small training corpus than on the large one (e.g. 94.3%vs. 92.3% for training on aCognatecorpus and testing on aCognatecorpus). These findings show that the class model is quite robust. Our results are consistent with those of Mann and Yarowsky, but show that the class model is not well suited if the training corpus exceeds a certain size. In the light of this result, it would be interesting to show if weight grouping can also improve the performance of the bigram model.

The variation of corpus quality did not yield surprising results. The models trained on theCognatecorpus (columns 1 and 2) constantly obtained better scores than the ones trained on theFull training corpus (columns 3 and 4). However, the differences are not very large (but significant (paired McNemar’s test,p <0.05), except for the basic EM models trained on small corpora). We do not believe that the cost of a manual annotation is outweighed sufficiently by the improvements of the results. A simpler heuristics like the one proposed by Mann and Yarowsky (2001) should suffice to obtain a training corpus of good quality.

The comparison of the different language pairs (see the two rightmost columns of Ta-bles 5.3 and 5.4) does not confer new insights. The learning models confirm the close relationship between Bernese Swiss German and Standard German.

Dans le document Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German (Page 77-81)