• Aucun résultat trouvé

3.3 Phonemes and Graphemes

4.1.5 Translation lexicon induction

Mann and Yarowsky (2001) were the first to present a translation lexicon induction framework for cognate languages based on phonetic similarity measures. While the goal of their study was to induce lexical mappings by transitivity, using a bridge lan-guage, they developed a test to measure the performance of the different transducers on the cognate language pair (i.e. on their bridge-to-target language step).

They use a test corpusC, containingn word pairs: C ={〈ai,bi〉|1≤in}. It is split into the source word listA={ai|1≤in} and the target word list B ={bi|1≤in}.

The lexicon induction task consists of finding mappings between the elements of A and the elements ofB with the help of the transducer that encodes a distance measure

input word transducer

candidate

string list target word list

candidate word

take best

Figure 4.7: The architecture of the lexicon induction framework proposed by Mann and Yarowsky (2001). All entities in boxes are implemented as finite-state machines (acceptors or transducers). The bold arrows repre-sent composition operations.

Dist. For a given source wordai, the transducer computes the distance to each element ofB. The element with the least distance ˆbiis retained to form a word pair withai. This procedure is repeated with all elements ofAto constitute a list of word pairsD:

D={〈ai, ˆbi〉|aiA, ˆbi=arg min

b∈B

Dist(ai,b)}

The word pairs 〈ai, ˆbi〉 are built using finite-state machines and composition. The source wordai is compiled into a finite state automaton (FSA). This automaton is com-posed with the transducer. The result is another FSA which encodes all possible can-didate strings and their respective weights. This FSA is in turn composed with the trie containing the target wordsB. This composition yields another FSA with ranked can-didate words. The best-ranked cancan-didate word is selected to form the word pair. Figure 4.7 shows the different operations.

The algorithm of Mann and Yarowsky (2001) does not use an external lexicon, but it uses the target words in the test corpus to find the best mappings. It performs an exhaustive search on the target word list to find the optimal candidates. While this methodology is simple and functional, it has the drawback that its results depend on the size of the target word listB (which is, by construction, equal to the size ofC): the bigger the target word list, the more the entries tend to become similar, and the more it becomes difficult to choose the right word. Using a large corpus increases the likeli-hood of “near misses” and thus yields worse results.

The test corpora of Mann and Yarowsky (2001) are small – they contain only 100 target words, one of which is guaranteed to be the solution. However, we believe that such a setting is not representative of the lexicon induction task. In our case, we possess a list of dialect words extracted from a dialect corpus, and want to find the Standard German equivalents. A priori, there is no way to restrict the size of the target word list – every Standard German word could be needed. We could restrict the target word list if we knew that our text is from a particular domain. We could also restrict the list if we had a parallel corpus: in this case, we would constitute the target word list from all words occurring on the Standard German side of the parallel corpus. Usually, we do not have such information for dialect texts. Therefore, we cannot use any heuristics to restrict the target word list, and we have to use a list that is as complete as possible, like our 202’000 word lexicon.

We have to show now if Mann and Yarowsky’s framework also works for large lexicons.

We already mentioned that we suspect its performances to be inversely proportional to the size of the target word list. If we use a list of 202’000 words, we may expect worse results than with a list of 100 words. Furthermore, a practical problem occurs. The second composition step involves a rather complex candidate string FSM and a huge lexicon FSM. It turned out that this composition could not be handled due to memory limitations5.

While we cannot avoid the first drawback, we must at least find a way to perform the composition. We already argued that the lexicon size cannot be restricted. The only remaining possibility is to restrict the size of the candidate string FSM. We claim that we do not need all candidate strings, but only the best ones. Concretely, we choose the nbest-ranked strings for each word, hoping that among those selected strings, there will be at least one word. In doing this, we have obtained a working model, but we have lost the guarantee to get an answer for each source word.

In conclusion, our two-stage framework (Figure 4.8) differs from the one proposed by Mann and Yarowsky (2001) in two respects. First, the target lexicon is independent in size and content from the source word corpus; it is a general lexicon, as complete as possible. Second, the candidate strings have to be pruned before applying the lexicon filter. This is the price to pay for the data independence.

5 The tests were performed on a 3.2 GHz Pentium 4 machine with 2 GB RAM.

input word transducer

candidate string list

take n-best

target lexicon

candidate word list

Figure 4.8: The architecture of our two-stage framework. All entities in boxes are implemented as finite-state machines. The bold arrows represent composition operations. The first composition involves pruning of the re-sulting finite-state machine. For testing purposes, we do not only take the best candidate word, but all of them.