Phonetic and syntactic similarity measures

One goal of this evaluation was the comparison between our lexicon induction ap-proach and related apap-proaches. We have already compared our results with Mann and Yarowsky’s study on Spanish-Portuguese and French-Portuguese. They use compara-ble algorithms – phonetic similarity measures –, but different language pairs. To com-plete the comparison, it would be interesting to have results on the same language pair, but using a radically different approach. Concretely, we will use a framework based on syntactic similarity. Syntactic similarity models induce word pairs on the basis of co-occurrence patterns in parallel corpora. This is the standard lexicon induction model used in statistical machine translation.

As exposed in Chapter 3, our corpus is in fact a parallel corpus for Swiss German and Standard German. It is about the only existing one for Bernese dialect, and no parallel corpora are available for other Swiss dialects. Therefore, we argued that we should bet-ter not rely on parallel corpora if our work is to be extended to other dialects. However, we do possess parallel data in our particular case, and we think that it is interesting to use it for comparison purposes.

Our parallel corpus consists of 57 text pairs of two pages each. On this data, the syntac-tic similarity model has to perform two steps. First, it has to find sentence pairs, given the text pairs. Second, it has to find word pairs, given the sentence pairs obtained in the first step.

The first step, thesentence alignment, was performed with the algorithm of Gale and

Church (1993)¹. It is essentially based on the comparison of sentence lengths.

The second step, theword alignment, relies on the assumption that aligned sentences carry the same content, and that there is a rough one-to-one correspondence between the words of the dialect sentence and the words of the Standard German sentence.

While the word correspondences are largely underspecified in one sentence pair, they can be induced by looking at other sentence pairs. For example, if w_a_i is the only word present in the three source sentencesa₁,a₂,a₃, and the corresponding sentences b₁,b₂,b₃all contain the wordw_b_j, we infer thatw_a_i is the translation ofw_b_j. The word alignment is performed in an iterative way with the algorithm proposed by Brown et al.

(1993). They apply different models (the“IBM models”) in sequence to gradually opti-mize the word alignments and the translation probabilities. For our test, we used the GIZA++implementation (Och and Ney 2003) with its default parameters.

As a result of the alignment, we obtained a table of word pairs with the correspond-ing translation probabilities. A source word can be translated to different target words, with different probabilities. Thus, we could create the same statistics as in our pho-netic similarity based models: we compiledList statistics by testing if the expected word is contained in the candidate list, and we compiledTopstatistics by testing if the expected word was the one with maximal probability. We also computed the average target list length, like in the previous tests. Table 5.11 reports theListstatistics, Table 5.12 theTopstatistics of this complementary study. For convenience, we also report the scores of the baseline Levenshtein model and of the best-scoring bigram model.

The phonetic similarity methods were tested on half of the words of our corpus (2366), the other half being used for training. As these words were selected randomly and do not make up for half of the sentences, we could not use the same test corpus for the alignment method. We therefore used the complete corpus, containing 1850 aligned sentence pairs and 4717 distinct source words. Although this corpus is very small for a word alignment method, it yielded better results than the phonetic similarity models.

This is not surprising as the phonetic models are designed to work with word lists only.

Hence, they ignore all information carried by the syntactic context.

The figures show clearly that the alignment method performs better than the phonetic models. However, the comparison is not so simple as the tables suggest. The methods use different types of training corpora – word pair lists in the case of phonetic simi-larity models, sentence pairs in the case of syntactic simisimi-larity models. Moreover, the

1 We used the implementation contained in the source distribution of theEuroparlparallel corpus:

http://people.csail.mit.edu/koehn/publications/europarl/.

Tested words Correct words % Correct words Avg list length

Levenshtein 2366 900 38.1 5.8

Bigrams 2366 1319 55.8 17.6

Word alignment 4717 2951 62.6 2.7

Table 5.11:Liststatistics for the comparison of phonetic and syntactic sim-ilarity models. The percentage of correct words represents recall.

Tested words Correct words % Correct words Avg list length

Levenshtein 2366 716 30.3 1.2

Bigrams 2366 955 40.3 0.8

Word alignment 4717 2737 58.0 1.0

Table 5.12:Topstatistics for the comparison of phonetic and syntactic sim-ilarity models. The percentage of correct words represents recall.

phonetic models use an independent lexicon as target word list, which contains some gaps. In the syntactic model, the Standard German words are determined directly by the parallel corpus – there is much less ambiguity than using a large independent lexi-con. Nevertheless, if we compare the uncomparable, we can say that the bigram model roughly induces 70% (based onTopstatistics) of the word pairs that the alignment in-duces.

This value can be related to Koehn and Knight (2002). They induce an English-German translation lexicon with a sophisticated framework using non-parallel corpora and compare it to a translation lexicon induced with the standard alignment framework using a parallel corpus. The non-parallel corpus allows them to induce 39% of the word pairs induced with the parallel corpus. Two main reasons account for the fact that our results are better. First, our parallel corpus is much smaller than the one used by Koehn and Knight (2002); therefore, our word alignment figures are quite low. Sec-ond, the linguistic differences between English and Standard German are much greater than between two dialectal variants of German.

We have proposed phonetic similarity measures as an alternative base for lexicon in-duction models. We argued that especially for dialects, it may be difficult to find paral-lel corpora. This comparative study showed that we cannot expect phonetic similarity measures to perform as well as models based on parallel corpora. However, it showed that the gaps between the two models are not very large. Further research may show if a hybrid approach that takes into account the syntactic contexts as well as the phonetic

word similarity is appropriate for very closely related language pairs, for which parallel corpora are not available.

5.6 Concluding remarks

In Section 5.2, we presented the transducer tests as a means to compare different train-ing setttrain-ings for the learntrain-ing models, to obtain a rough classification of our ten models, and to compare our language pair to Romance language pairs on the basis of the com-plexity of lexicon induction. In particular, these tests showed that static variants of Levenshtein distance did not perform better than the baseline model, but that the rule-based model obtained satisfying results. Concerning the learning models, the class model proved successful for scarce and noisy data. However, if more data is available, and if its quality is better, the basic EM model and the bigram model are better suited.

In Section 5.3, the framework tests were presented with the goal to obtain more dis-criminative results, and to assess the performances of our models in a slightly differ-ent setting, involving candidate string pruning and the use of a non-restricted lexicon.

Nevertheless, the results of the framework tests showed some parallels in comparison to the transducer tests. There were few differences among the static models, and they were clearly outperformed by the rule-based model. The results of the learning mod-els surprised: the basic EM model and the bigram model obtained better results than the rule-based model, but the class model did not follow them. However, the good re-sults of the aforementioned models depend much on the size of the training corpus – a training corpus of 2000 word pairs may not be readily available for other language pairs.

Finally, we compared our models to a lexicon induction approach relying on parallel corpora (Section 5.5). The results of this comparative study suggest that the integration of word contexts may improve the performance of phonetic similarity models. The next chapter will present some perspectives for such a hybrid approach.

Chapter 6 Conclusion

6.1 Main contributions

Our work contains some novel elements that might be interesting for future research on lexicon induction and on dialects and cognate languages.

We created a parallel corpus for Bernese Swiss German and Standard German, man-ually extracted the word pairs and annotated them as cognate or non-cognate pairs.

These resources may be useful for future work.

We formulated the problem of lexicon induction with phonetic similarity measures as an instance of the noisy channel metaphor. This allowed us to extend the framework of Mann and Yarowsky (2001) in order to handle larger data sizes. The comparison of these two frameworks also showed some differences in the way they use distance measures. We found that a measure that performs well in one framework does not necessarily do so in the other one.

Furthermore, we bridged the existing gap between studies on cognate languages and research in dialectometry by using distance measures proposed in either field. How-ever, we could not sufficiently exploit the dialectometric measures as they rely on exact phonetic transcriptions. We did not possess such data for our language pair.

We also proposed a novel model based on contextual transformation rules and ob-tained satisfying results. As a description of the phonetic correspondences was al-ready available for our language pair, the implementation of this model was relatively

straightforward. However, many scarce resource languages lack such resources. There-fore, we also integrated other methods in our study.

For instance, we reproduced the standard learning model and the class model pro-posed by Mann and Yarowsky (2001). Our experiments confirmed their intuition that the class model is particularly well suited for small training data. With larger train-ing data, the advantages appeared in favor of the standard model. We extended the learning framework in several ways. We proposed a model that relies on monolingual training data only, but did not obtain the expected results.

We also developed a novel approach ton-gram models. Our models acceptn-grams as input, but always yield unigrams in the output. This setting prevents us from building a special target lexicon inn-gram representation, and it also avoids the generation of invalid n-gram sequences. However, the increased size of these transducers makes training difficult with our limited resources. The training data were sufficient to train the bigram model: it obtained the best scores in our framework tests. However, they clearly do not suffice to train trigram models.

In spite of the satisfying results we obtained with some models, many extensions, im-provements and variations can be taken into account. The following sections present some potential aspects of future work.

Dans le document Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German (Page 90-95)