Multi-word Groups (MWG) - Metrics and Implementation

1.4 Metrics and Implementation

1.4.6 Multi-word Groups (MWG)

The observant reader may have noticed that our requirements for creating sequence groups has one important shortcoming when applied to real-word translations. It will occur that a specific construction of multiple words is difficult to transfer to the target language in terms of alignment, i.e. where a one-to-one mapping of meaning of source to target words is tedious. Even if such one-to-one or one-to-few alignments are possible, the translator may simply opt for a less straightforward construction, either because of personal preference or to create a target text that is more natural. Ultimately, such translation choices will lead to m-to-n alignments where all m source words are aligned to all n words in the group because all words contribute to the meaning that is being transferred and a smaller compositional alignment is not possible.

We consider allm-to-nalignments to be MWG candidates in cases where both m and n are greater than 1 (if m = 1 or n = 1, the group is a valid sequence group anyhow as there can not be an internal cross). That means that when calculating the sequence cross and SACr cross metrics for words, we can consider whether MWGs are allowed or not and base our calculations on the groups that are formed on said condition. If MWGs are considered valid groups, all MWG candidates are interpreted as valid sequence and SACr groups, even if they do not meet the criteria of the aforementioned Definitions 2 and 3. Instead, an alternative Definition4 defines what we consider to be MWGs. If MWGs are not considered in the calculation of word group crossings, the words in m-to-n alignments do not constitute valid groups according to

our criteria (Definitions2and3above) and each alignment will be considered individually.

Definition 4(Multi-word group).

• Includes requirements for Consecutiveness (Definition1)

• All words in the source group need to be aligned with all the words in the target group and vice-versa

Note that our inclusion of MWGs for word groups only has an effect on the sequence and SACr cross values and not on word_cross. When considering groups of words, we calculate cross on the sequence level or on the SACr level, and in both cases we can (dis)allow the creation of MWGs as a alternative group type.

It is no surprise that not considering MWG as valid group alignments can lead to incredibly high sequence and SACr cross values, similar to the word_cross value for the same construction, because in such an event the unit of alignment does not increase in size. Instead, rather than forming a group, all alignments would constitute their own singleton group because they do not meet the requirements to form a sequence or SACr group. When we do allow MWGs, however, such alignment constructions are considered as valid coherent units, which leads to a considerably lower cross value. This is illustrated in the following example.

word_cross

Climate change scientists predict that civil

Wetenschappers

Figure 1.9. Alignment table of Example3. Blue squares (dashed lines) indicate MWG candidates, orange squares (solid line) are valid sequence groups

In Figure1.9, an English source sentence is translated as a Dutch target sentence. For clarity’s sake, the example sentence and its translation and alignments are also given in text in Example 3. The example contains two MWG candidates, “Climate change scientists” aligned with “Wetenschappers inzake klimaatverandering” and “may increase by as much as” aligned with

“meer kans is op”.

er 11-9 11-10 11-11 12-8 12-9 12-10 12-11 13-6 14-7 15-14

In the figure, black squares indicate word alignments, blue squares (dashed lines) are MWG candidates, whereas orange groups (solid lines) are valid se-quence groups (no internal or external cross; consecutive source and target words).¹⁵ At the top, the cross values for each word are given. word_cross serves as a baseline. seq_cross* values show the cross value of the group that a word belongs to, as before in Section 1.4.4. The distinction between the two versions is that forseq_cross_MWG we allow MWG candidates (blue squares) to be valid groups but in the other variant, they are not. That means that rather than having large units that possibly cross each other, the words themselves constitute their own group. Because every word in a MWG crosses every other word, that leads to large values. The number of internal crosses in anm-to-naligned group scale with the number of words on the source (m) and target side (n) according to Formula 1.3. The proof for this formula is given in AppendixA.

crossM W G=1

4 ·mn(m−1)(n−1) (1.3)

where:

m number of words on the source side of the MWG n number of words on the target side of the MWG

In the kind of word alignment visualisation in Figure 1.9, a literal, one-on-one translation would be a straight diagonal descending line from left to right. Disruptions in word order are those parts where an alignment deviates from that diagonal. Looking at the individual word alignments in the figure, that is the case for instance for “civil” (positioned relatively early in the source sentence) which is aligned with “sociale” (near the end of the target sentence).

The corresponding alignment point does not directly follow the previous word

“that” diagonally. Looking at the figure, all the alignments of the words of

“may increase by as much as 56 %” needs to be crossed to align “unrest” with

“onrust”. That leads to a significantword_crossvalue of 26.

In terms of word groups, “civil unrest” (aligned with “sociale unrust”) and “56 %” have swapped places between the source and target sentence. If we create sequence groups as per Definition 2, then only the orange groups

15For simplicity’s sake we do not include SACr but the results would be identical to the seq_cross*values because the sequence groups “civil unrest” (aligned with “sociale on-rust”) and “56 %” (“56 %”) are valid subtrees so the SACr groups are identical to the sequence groups, leading to the same respective cross values

(solid lines) are acceptable sequence groups: if we do not allow MWG, then the blue groups (dashed lines) are not valid groups and the individual word alignments constitute each their own individual sequence (and SACr) group.

That means that on the sequence group level, even though “civil unrest” is a single group, it still has to cross with every single alignment (black box) in

“unrest may increase by as much as” (aligned with “meer kans is op”) and the newly formed sequence group “56 %” (“56 %”). That decreases the sequence cross value of “unrest” by one compared to theword_crossvalue (because it only crosses with the group “56 %” rather than the individual alignments), but it still is quite high at 25. If, however, MWGs are allowed and groups can be formed according to Definition 4, then the blue groups (dashed lines) are valid groups, too. If that is the case, the sequence group “civil unrest” only needs to cross two other alignments, namely the one between the MWG “may increase by as much as” and its translation, and the alignment between the sequence group “56 %” and its translation. This leads to a seq_cross_MWG value of only 2.

As is clear by theseq_cross_mwgvalues in Figure1.9, interpreting MWGs as single units greatly reduces the word group cross values. Also here the principle holds that the larger the units of alignment, the lower the cross value can be. So the decision whether or not to allow MWGs in either sequence or SACr groups not only impacts the cross values in the word group at hand, but also the cross values of the surrounding groups. In the example discussed above theseq_cross_MWGwas reduced greatly for all words inside the MWG

“may increase by as much as” but as a consequence of this word group also theseq_cross_MWGvalue for words in “civil unrest” decreased significantly.

Dans le document Syntactic difficulties in translation (Page 46-50)