• Aucun résultat trouvé

3.3 Phonemes and Graphemes

4.1.6 Implementation

Figure 4.8 shows the transducer and automata used in our framework. They are im-plemented with theFSM Librarydeveloped byAT&T(Mohri, Pereira, and Riley 1998)6. This library contains a set of command-line tools for building binary files representing finite-state machines, modifying and composing them, and for printing the contents of a finite-state machine in text form. We integrated these commands into a set of Pythonscripts for building the lexicon, building specific types of transducers, and for evaluating a model with a test corpus. The main program for obtaining translations of a given source word consists of a series of piped commands. An example for the dialect wordchoufet ‘(you) buy’ is reproduced below :7

echo ’choufet’ | farcompilestrings | fsmcompose

-transducer.fsm | fsmbestpath -n 5000 | fsmproject -o | fsmcompose - lexicon.fsm | farprintstrings

The source word is first compiled into an automaton (farcompilestrings).

This automaton is then composed with the transducer (fsmcompose -transducer.fsm). As there is an infinity of resulting paths, only the 5000

6 http://www.research.att.com/~fsmtools/fsm

7 These commands are slightly simplified to keep them readable. In fact, for all commands, sym-bol files must be specified to map the letters to integers used in the internal FSM representation.

Moreover, the letters in the input and output strings are separated by spaces.

best paths are kept (fsmbestpath -n 5000). Each path represents a pair con-sisting of the source word and a candidate string. As the source word is the same for all paths, we only keep the candidate strings (fsmproject -o) and compose them with the lexicon (fsmcompose - lexicon.fsm) in order to eliminate all non-words. Finally, the resulting strings and their transduction weights are printed (farprintstrings):

kaufet 31.3210907 kauft 33.0162697 taufet 33.6648293 häufet 33.9177208

Mann and Yarowsky’s algorithm can be implemented in a very similar way. It suffices to remove the commandfsmbestpath -n 5000, to replace the lexicon by a smaller one, and to addfsmbestpath -n 1after the second composition.

We choose theAT&T framework because it is sufficiently general to represent different types of transducers and acceptors. Therefore, the same scripts could be used for all models. As all informations are encoded in finite-state machines, testing a model is quite fast.

4.1.7 String distance measures

In the preceding sections, we have presented different types of transducers and their role in our framework. We have noted that our different models will use different trans-ducers, and that each transducer encodes a particular distance measure. We would like to define these distance measures now. Generally speaking, string distance measures are functions from string pairs to numbers. The distance between two strings depends, in some way, on the symbols that compose the strings.

Many types of string distance measures have been proposed. For instance, it has been suggested to use the length of the longest common subsequence as a distance mea-sure. Other measures, like Hamming distance, only apply to string pairs of the same length. We will focus on a particular class of distance measures here. They will allow strings of different lengths, but they will only compare one letter pair at a time. In other

m ü e d m ü e d m ü e d m ü e d

m ü d e m ü d e m ü d e m ü d e

= = S S = = D = I = = I = D = D S = I

Figure 4.9: Four alignments for the word pair müed - müde. There are many more alignments for this word pair. The last line reports the type of the operation: identity (=), substitution (S), insertion (I) or deletion (D).

words, we are interested in context-free distance measures. Such measures can be im-plemented in memoryless transducers. Context-free distance measures account for all of our models, except for the rule based system.

The distance between two stringssandt is defined by the letter operations needed to transformsintot. There are two fundamental types of operations. Identity operations keep a symbol froms as it is. Edit operations change a symbol ofs. There are three types of edit operations. Substitution operations replace the symbol by another one.

Insertion operations add a symbol tot. Deletion operations delete a symbol ofs. Inser-tions and deleInser-tions allow the comparison of string pairs with different length. Note that swapping two letters is not considered as a primitive edit operation. This may be in-adequate for some purposes, but has the advantage to keep the measure context-free.

Swapping two adjacent letters would call for considering two letter pairs at the same time, and would need a more complicated transducer architecture. All operations have costs. These costs are the only element that differs from one model to another. Usually, the identity operations are less costly than the edit operations. In a memoryless trans-ducer, each operation represents a transition from the unique state back to the unique state. The transition carries the input symbol (²for insertion operations), the output symbol (²for deletion operations) and the cost.

A string pair can be obtained by many different operation sequences. Each sequence defines analignment – a mapping between symbols of s and symbols of t. Figure 4.9 shows four possible alignments for a given word pair; the operation sequences de-fined by these alignments are represented graphically in Figure 4.10. The alignment distance for alignmentais defined as the sum of the costs of the operations defined by a. The string distance of a string pair is defined as the minimal alignment distance of the string pair.8

8 Here, we present the string distance measures from a recognition point of view: for two given strings, the distance needs to be computed. This is by far the most frequent application mode, but not the one we will use. We will rather use these distance models to generate strings: from a given word, we can generate an infinity of strings, ordered by their distance. To obtain finite string

m ü d e

m ü e d

m ü d e

m ü e d

m ü d e

m ü e d m ü d e

m ü e d

Figure 4.10: Four matrices representing the example alignments of Figure 4.9. In a matrix with the source word on the right and the target word on top, every path from the upper left corner to the lower right corner repre-sents an alignment. Horizontal arrows represent insertion operations, ver-tical arrows deletion operations. Thin diagonal arrows represent identity operations, bold diagonal arrows represent edit operations.

m ü e d m ü e d m ü e d m ü e d

m ü d e m ü d e m ü d e m ü d e

1 1 1 1 1 1 1 1 1

Figure 4.11: Four alignments for the word pairmüed -müde with Leven-shtein weights. The rightmost alignment has a distance of 3, while the other three alignments show distances of 2. Hence, the Levenshtein distance be-tweenmüed andmüdeis 2.

Assume that all strings are formed from an alphabet ofn letters. In this case, the dis-tance measure contains n identity operations, n·(n−1) substitution operations, n insertion operations andn deletion operations. Each operation possesses a weight.

These weights may be distinct for each operation, or the operations may be grouped together into classes, all members of a class having the same weight. We will propose models without classes, and models with a varying number of classes. Moreover, we will present different ways of defining these operation classes.

Depending on the criteria that define the operation weights, we distinguish two cat-egories of distance measures. The weights ofstatic measures(4.2) are independent of the particular language pair they are applied to. While these measures may encode general linguistic intuitions, they do not reflect specificities about one language pair. In contrast, the weights ofadaptive measuresare specifically adjusted to a language pair with the help of a learning algorithm (4.3). The rule-based model (4.4), although not implemented in a memoryless transducer, is also considered as an adaptive model.