Benefits and State of the Art of Automatic, Unsupervised Estimation of MT Quality. Session 5 - Assessing Quality in MT

(1)

HAL Id: hal-02497512

https://hal.archives-ouvertes.fr/hal-02497512

Submitted on 3 Mar 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Benefits and State of the Art of Automatic,

Unsupervised Estimation of MT Quality. Session 5 - Assessing Quality in MT

Niko Papula

To cite this version:

Niko Papula. Benefits and State of the Art of Automatic, Unsupervised Estimation of MT Quality.

Session 5 - Assessing Quality in MT. Tralogy II. Trouver le sens : où sont nos manques et nos besoins respectifs ?, Jan 2013, Paris, France. 5p. �hal-02497512�

(2)

Niko H.M. Papula

Multilizer niko.papula@multilizer.com

TRALOGY II - Session 5 Date d’intervention : 18/01/2013

Automatic MT (machine translation) quality estimation system is briefly de- scribed and its results are compared to findings of 2012 Workshop on Statistical Machine Translation. The system is found to be better than other published sys- tems, based on the criteria used in the mentioned conference. Practical applica- bility of the system is discussed based on conducted surveys and real-life test.

Post-editing using generally available MT services is found to be usefuf in prac- tice, contrary to findings in previous literature. The theoretical key finding is that comparing output from several generally available MT engines provides better than state-of-the-art results in automatic MT quality estimation.

(3)

2

SESSION 5 - MESURE DE LA QUALITÉ EN TA

A NEW MODEL OF TRANSLATION QUALITY ASSESSMENT - HANS USZKOREIT / ALJOSCHA BURCHARDT / ARLE LOMMEL

Introduction

Automatic estimation of MT quality refers to a method that is able to assess quality of MT without any reference translations (sometimes called also as quality prediction). This is in contrast to the so-called MT quality evaluation which uses human-made reference translations for quality assessment, such as BLEU (Papineni et al. 2002). Because human-made reference tranlations are not required, automatic MT quality estimation can be more effectively used in everyday situations. This is very important in practical applications (Specia et al. 2010).

Common understanding is that post-editing bad quality MT is both unproductive and causes frustration. On the other hand, post-editing ”good-enough” MT has been shown to increase both productivity and sometimes also quality (Carl et al. 2011; de Almeida and O’Brien 2010;

Fiederer and O’Brien 2009; Garcia 2011; Groves and Schmidtke 2009; Guerberof 2009; Plitt and Masselot 2010).

One typical use case of automatic MT quality estimation system is the following.

• 1.The automatic MT quality estimation system receives translations from several generally available MT services (such as Google, Bing, Systran, etc.)

• 2.The system estimates their qualities

• 3.The system selects best of them (bad translations may be omitted)

• 4.The system sends the selected translation and a quality estimate to user (e.g. for post-editing)

The machine translation obtained with the above process is called qualified MT. This is in contrast to ”raw” MT that refers to machine translation used as such, without any processing or selection.

Other use cases include estimating post-editing workload prior to actual work and adjusting the price accordingly. One more use case is obtaining information for fair pricing of post-editing work.

Using raw machine translations from generally available MT engines (such as Google, Bing, etc.) is usually not considered to be usable in practical post-editing (QT Launchpad 2013).

1. Explanation of the system

The system used to obtain results introduced in this article is a commercial software and service called ”Multilizer MT-Qualifier”. The key idea of the system is to compare machine translation candidates from several MT engines, both when translating source text to target language and also when back-translating several translation candidates back to original language. The system utilizes also linguistic data both from source and target segments as well as internal variables of the MT engines, if available.

The system estimates the quality of the MT per sentence. The output is a quality estimate value between 0 and 100 per each sentence. The system also selects the best of several available translation candidates per sentence thus also improving the MT quality.

Currently the system is available for English as the source language and French, Spanish, Italian, Portuguese, and German as target languages. For English-Spanish there exists also other research. For other mentioned target languages, ”MT-Qualifier” seems to be the only published system and no indication of research in these languages has been found.

A global patent for the system has been applied.

(4)

2. Comparison to results published in WMT12 conference

2012 Workshop on Statistical Machine Translation included an automatic MT quality estimation task. The workshop established the current state of the art of automatic MT quality estimation when translating from English to Spanish (Callison-Burch et al. 2012). This provides a basis for developing and comparing the performance of different systems.

The material used in the comparison was the one published in WMT12 conference. It contains a total of 2254 segments. They are divided into training data and test data, 1832 and 422 segments respectively. The data contains English source text, Spanish machine translation and human quality grades for each segment. The human quality grades were obtained using the following criteria:

• 1.The MT output is incomprehensible, with little or no information transferred accurately.It cannot be edited, needs to be translated from scratch.

• 2.About 50% -70% of the MT output needs to be edited. It requires a significant editing effort in order to reach publishable level.

• 3.About 25-50% of the MT output needs to be edited. It contains different errors and mistranslations that need to be corrected.

• 4.About 10-25% of the MT output needs to be edited. It is generally clear and intelligible.

• 5.The MT output is perfectly clear and intelligible. It is not necessarily a perfect translation, but requires little to no editing.

(Callison-Burch et al. 2012)

The actual measurement of the results of ”MT-Qualifier” was done by doctoral student Maarit Koponen of University of Helsinki using the evaluation script provided by the conference. The evaluation includes two evaluation criteria, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). For both criteria lower values are better. The table below shows the results.

The results of the conference were reported in Callison-Burch et al. (2012). In English-to- Spanish the current accuracy of the ”MT-Qualifier” system is better than in any system published in the WMT12 conference. This is despite the fact that the system’s model was fitted to data using a different criteria than in the conference’s evaluation. The used criteria was minimizing mean-squared-error as that seems to be better suited for practical purposes, giving greater penalty for large estimation errors. That is, the model was not built to optimize the evaluation criteria used in WMT12 evaluation. Instead, the system was optimized for real-life use.

The Pearson correlation of the estimated quality and the human evaluation score was found to be between 65 and 69% per sentence. This is very high compared to the human evaluations. In the WMT12 data there were three human evaluations per sentence and the Pearson correlations between these were about 85%.

As mentioned, for other languages supported by ”MT-Qualifier” comparison was not possible because such competing systems do not seem to exist.

MAE RMSE

Multilizer MT-Qualifier 0.58 0.73

Winning system in the conference 0.61 0.75

Baseline system 0.69 0.82

(5)

4

SESSION 5 - MESURE DE LA QUALITÉ EN TA

A NEW MODEL OF TRANSLATION QUALITY ASSESSMENT - HANS USZKOREIT / ALJOSCHA BURCHARDT / ARLE LOMMEL

3. Estimating and measuring real-life business benefits

In addition to comparing theoretical results to results in WMT12 workshop, a series of surveys and tests were conducted to understand the benefits of the system in real-life scenarios.

Post-editing raw MT is often perceived as frustrating and unproductive. It also usually requires customizing MT service which is both difficult and expensive. Post-editing qualified MT is very different from post-editing raw MT. First of all, it reduces routine work and lets translator focus on the more difficult sentences. It also does not require customizing MT in order to be productive. To get an understanding of translators’ views about this a first survey was conducted. In the survey 117 professional translators were given three texts, (a) source text, (b) one containing raw MT and (c) one containing qualified MT. The translators were asked how well the translations convey the information in the original text. In qualified MT, the significant information was correct in 57% as opposed to 27% in raw MT. The decision of which information is significant was left for the professional translators to judge by themselves.

To understand the magnitude of the benefits a second survey was conducted. 117 professional translators were asked how much qualified MT saves in translation cost, compared to translating from scratch. 70% of the translators said that qualified MT saves in translation cost. The average cost saving per word was estimated to be about 50%.

Surveys offer valuable information but even more reliable and valuable information about the applicability in real-life use is to actually use the system. In one test case 19 translators were asked to translate a text using a translation memory (TM). The translation memory had been generated with MT and contained those translations that ”MT-Qualifier” estimated to be of very high quality. The translators had no knowledge of how the TM had been obtained. The translators accepted the content of the translation memory and discounted the price per word between 17 and 70% for the contents of the TM thus confirming the good quality of the translations. Average discount was 50%. Few translators discounted 0% and few 100%. However, these were left out of the calculations as possibly unrealistic. Including them would not have changed the average discount meaningfully.

The above set of the surveys shows that the professional translators believe that qualified MT indeed enables them to work faster. This belief was confirmed with the above test in a real- life scenario. These results show that, unlike presented in literature, using MT from generally available MT engines is both usable and productive in practical post-editing.

4. Discussion

The theoretical key result is that comparing output from several generally available MT engines provides better than state-of-the-art results in automatic estimation of MT quality.

The practical key result is that using qualified MT enables using generally available MT engines in a post-editing process, contrary to what have been presented earlier in literature.

The main challenge in the system is to detect all good translations. Higher percentage of detected good translations results in higher percentage of ”false positives”.

Another challenge is that because the systems uses several machine translations, the price is also higher than when using just one MT engine.

(6)

criteria used in WMT12 conference. It is difficult to say whether using the same criteria as in WMT12 would have obtained better results. Most likely it would have improved the results with WMT12 data a bit but on the other hand, the applicability in real-life use cases would probably have been lower.

More use of the system is required to verify and obtain further results.

Bibliography

Callison-Burch C et al (2012) Findings of the 2012 workshop on statistical machine translation.

In: Proceedings of the Seventh Workshop on Statistical Machine Translation, pp 10-51

Carl M, Dragsted B, Elming J, Hardt D, Jakobsen AL (2011) The process of post-editing: A pilot study. In: Proceedings of the 8th international NLPSC workshop. Special theme: Human- machine interaction in translation, pp 131-142

de Almeida G, O’Brien S (2010) Analysing post-editing performance: Correlations with years of translation experience. In: EAMT 2010: Proceedings of the 14th Annual conference of the European Association for Machine Translation,

Fiederer R, O’Brien S (2009) Quality and machine translation: A realistic objective? JoSTrans 11(January 2009):52-74

Garcia I (2011) Translating by post-editing: Is it the way forward? Machine Translation 25(3):217-237

Groves D, Schmidtke D (2009) Identification and analysis of post-editing patterns for MT. In:

MT Summit XII: proceedings of the twelfth Machine Translation Summit, pp 429-436

Guerberof A (2009) Productivity and quality in MT post-editing. In: MT Summit XII - Workshop:

Beyond Translation Memories: New Tools for Translators MT,

Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: A method for automatic evaluation of machine translation. In: ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 311-318

Plitt M, Masselot F (2010) A productivity test of statistical machine translation post-editing in a typical localisation context. The Prague Bulletin of Mathematical Linguistics 93:7-16

QT Launchpad (2013), research project by German Research Center for Artificial Intelligence (DFKI), Dublin City University (DCU), Athena – Research and Innovation Center in Information, Communication, and Knowledge Technologies - Institute for Language and Speech Processing (ILSP), University of Sheffield (USFD), http://www.qt21.eu/launchpad/

Specia L, Farzindar A (2010) Estimating machine translation post-editing effort with HTER.

In: Proceedings of the Second joint EM+/CNGL Workshop “Bringing MT to the user: research on integrating MT in the translation industry” (JEC’10), pp 33-41

Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation.

Mach Trans 24:39-50