Master
Reference
Relationship between segment length and translation quality in neural machine translation : Freelance translators and small language
service providers as end users
CARAGOL RIVERA, Fabiola
Abstract
Ce mémoire évalue s'il existe une corrélation entre la longueur des segments et la qualité des traductions automatiques neuronales. Un corpus anglais-espagnol a été créé pour effectuer des évaluations humaines et automatiques. Pour l'évaluation humaine, les mesures de fluidité, fidélité et réutilisabilité ont été utilisées et pour l'évaluation automatique celle de BLEU. Les tendances les plus marquées sont que les segments d'une longueur de 1 à 10 mots obtiennent de meilleurs scores que les segments plus longs (jusqu'à 40 mots) et que DeepL est plus performant que Microsoft Translator. En outre, dans cette étude, les segments courts (20 mots ou moins) ont obtenu des scores plus élevés que les segments longs (21-40 mots) tout au long des évaluations humaines.
CARAGOL RIVERA, Fabiola. Relationship between segment length and translation quality in neural machine translation : Freelance translators and small language service providers as end users. Master : Univ. Genève, 2020
Available at:
http://archive-ouverte.unige.ch/unige:145874
Disclaimer: layout of this document may differ from the published version.
1 / 1
i
Fabiola Caragol Rivera
Relationship between sentence length and translation quality in neural machine
translation
Freelance translators and small language service providers as end users
Directrice: Pierrette Bouillon Jurée: Johanna Gerlach
Mémoire présenté à la Faculté de traduction et interprétation pour l’obtention de la Maîtrise universitaire en traitement informatique
multilingue
Université de Genève
Octobre 2020
ii J’affirme avoir pris connaissance des documents d’information et de prévention du plagiat émis par l’Université de Genève et la Faculté de traduction et d’interprétation (notamment la Directive en matière de plagiat des étudiant‐e‐s, le Règlement d’études de la Faculté de traduction et d’interprétation ainsi que l’Aide‐mémoire à l’intention des étudiants préparant un mémoire de Ma en traduction).
J’atteste que ce travail est le fruit d’un travail personnel et a été rédigé de manière autonome.
Je déclare que toutes les sources d’information utilisées sont citées de manière complète et précise, y compris les sources sur Internet.
Je suis conscient‐e que le fait de ne pas citer une source ou de ne pas la citer correctement est constitutif de plagiat et que le plagiat est considéré comme une faute grave au sein de l’Université, passible de sanctions.
Au vu de ce qui précède, je déclare sur l’honneur que le présent travail est original.
CARAGOL RIVERA, Fabiola
Genève, 6 octobre, 2020
iii
Acknowledgments
I would like to thank first and foremost my thesis director, Pierrette Bouillon, for the time she invested in this project and for all the recommendations and suggestions for improvements she provided. Additionally, I want to thank Johanna Gerlach for her great disposition to act as jury and for her great help as a professor.
I would also like to thank my mother, Enid Rivera, for being my proofreader and editor and for always giving me very relevant advice regardless of how far removed the topic was from her domain. Mostly, I would like to thank her for the moral support during this process.
Lastly, I would like to thank my partner, Nicolas, for the moral support, continuous pep talks during the ups and downs of this process, and for pushing me to finish this project.
iv
Table of Contents
1. Introduction ... 1
1.1 Background ... 1
1.2 Research question and hypothesis ... 4
1.3 Methodology ... 5
1.4 Plan ... 6
2. Machine translation ... 7
2.1 Background ... 7
2.2 Types of translation ... 8
2.3 Current state of machine translation ... 10
2.3.1 Statistical machine translation ... 10
2.3.2 Neural machine translation ... 11
2.4 Obstacles ... 11
2.5 Conclusion ... 14
3. MT Evaluation metrics ... 15
3.1 General ... 15
3.2 Human evaluation ... 15
3.2.1 Error annotation ... 16
3.2.2 Fluency and adequacy ... 16
3.2.3 Reusability ... 17
3.2.4 Ranking ... 18
3.2.5 Agreement ... 18
3.3 Automatic evaluation ... 19
3.3.1 Suitability ... 19
3.3.2 BLEU metric ... 20
3.4 Conclusion ... 23
4. Methodology ... 24
4.1 Purpose ... 24
4.2 NMT systems ... 25
4.2.1 Web browser version ... 25
4.2.2 Desktop version ... 26
4.3 Corpora ... 27
4.4 Human evaluation ... 31
4.4.1 Evaluation ... 31
4.4.2 Annotators ... 34
v
4.5 Automatic evaluation ... 35
4.5.1 Reference translation ... 35
4.5.2 Evaluation metric ... 35
4.6 Conclusion ... 36
5. Analysis ... 37
5.1 Human evaluation ... 37
5.1.1 Adequacy evaluation ... 38
5.1.2 Fluency evaluation ... 40
5.1.3 Reusability evaluation ... 42
5.1.4 Comparison of scores by evaluations ... 45
5.1.5 Inter‐annotator agreement ... 46
5.2 Automatic evaluation ... 47
5.3 Correlation between automatic and human evaluation ... 48
5.4 Conclusion ... 50
6. Conclusion ... 52
6.1 Summary and results ... 52
6.2 Future studies ... 55
Annex A ... 56
A.1 Instructions for the fluency and adequacy evaluations ... 56
A.2 Instructions for the reusability evaluation ... 56
Annex B ... 57
Annex C ... 62
Annex D ... 67
Bibliography ... 72
vi
List of Figures
Figure 1 ‐ Articles containing "neural machine translation" in ArXiv.org (Source:
Personal collection) ... 2
Figure 2 – Classification of translation types [Source: Hutchins and Somers (1992) as cited in Quah (2006, p. 7)] ... 9
Figure 3 – Example of the adequacy evaluation ... 33
Figure 4 ‐ Example of the fluency evaluation ... 33
Figure 5 – Example of the reusability evaluation ... 34
Figure 6 – Screenshot of the BLEU calculator ... 36
Figure 7 – Example of ordinal and numerical data ... 37
Figure 8 – Frequency of ratings for the adequacy evaluation (300 results per category) ... 38
Figure 9 – Frequency of ratings for the fluency evaluation (300 results per category) ... 40
Figure 10 – Percentage of ratings for the reusability evaluation ... 43
Figure 11 – Frequency of ratings for the reusability evaluation (300 ratings per category)... 43
Figure 12 – Meaning of the average scores in the reusability evaluation ... 44
Figure 13 – BLEU scores ... 47
vii
List of Tables
Table 1 – TAUS fluency and adequacy scale ... 17
Table 2 – Labels for kappa agreement ... 19
Table 3 – Precision measure ... 21
Table 4 – Modified precision measure ... 22
Table 5 – Description of the categories ... 24
Table 6 – Corpora ... 27
Table 7 – Source text information ... 29
Table 8 – Example of segments used in the evaluation, including word count and category ... 31
Table 9 – Evaluations’ content ... 32
Table 10 – Frequency of ratings for the adequacy evaluation ... 39
Table 11 – Average of ratings for the adequacy evaluation ... 39
Table 12 – Average of ratings for the adequacy evaluation by NMT system ... 40
Table 13 – Number of ratings for the fluency evaluation ... 41
Table 14 – Average of ratings for the fluency evaluation ... 41
Table 15 – Average of ratings for the fluency evaluation by NMT system ... 42
Table 16 – Frequency of ratings for the reusability evaluation ... 44
Table 17 – Average of ratings for the reusability evaluation ... 45
Table 18 – Average of ratings for the reusability evaluation per NMT system ... 45
Table 19 – Inter‐annotator agreement ... 46
Table 20 – BLEU scores ... 48
Table 21 – Order of the scores/averages for DeepL® (1 = best, 4 = worst) ... 48
Table 22 – Order of the scores/averages for Microsoft Translator® (1 = best, 4 = worst) ... 49
Table 23 – BLEU scores with the categories regrouped ... 49
Table 24 – Order of the scores/averages for DeepL® with the categories regrouped (1=highest, 2=lowest) ... 49
Table 25 – Order of the scores/averages for Microsoft Translator® with the categories regrouped (1=highest, 2=lowest) ... 50
Table B. 1 – Results from cat. 1 for adequacy ... 57
Table B. 2 – Results from cat. 2 for adequacy ... 58
Table B. 3 – Results from cat. 3 for adequacy ... 59
viii
Table B. 4 – Results from cat. 4 for adequacy 60
Table C. 1 –Results from cat. 1 for fluency ... 62
Table C. 2 – Results from cat. 2 for fluency ... 63
Table C. 3 – Results from cat. 3 for fluency ... 64
Table C. 4– Results from cat. 4 for fluency ... 65
Table D. 1 – Results from cat. 1 for reusability ... 67
Table D. 2 – Results from cat. 2 for reusability ... 68
Table D. 3 – Results from cat. 3 for reusability ... 69
Table D. 4 – Results from cat. 4 for reusability ... 70
ix
List of abbreviations
BigLM big language model
BLEU bilingual evaluation understudy CAT computer aided translation
DL DeepL
FAHQT fully automatic high‐quality translation GPU graphic processing unit
grConv gated recursive convolutional neural network HAMT human aided machine translation
HT human translation
LSP language service provider
MAHT machine‐aided human translation MT machine translation
Mt Microsoft Translator
MWU multiword unit
NMT neural machine translation RBMT rule based machine translation
RCTM recurrent continuous translation model
RNNenc recurrent neural network encoder‐decoder model SMT statistical machine translation
TER translation error rate
TM translation memory
WER word error rate
1
1. Introduction
An increasingly interconnected world has increased the demand for language services. This can be seen in the rapid growth of the language service market. This market has doubled in the last ten years, with a market value of 46 billion US dollars in 2019 (Mazareanu 2019). If we go into more detail of the components of the languages services industry, the machine translation share of the market was estimated at 550 million US dollars in 2019 (Mordor Intelligence 2019). Machine translation is now known to the general public thanks to publicly available systems like Google Translate® and Microsoft Bing®. Whereas in the past translation tools were not likely to be known by people outside the domain, now it seems it is commonplace for the general public to use these systems.
However, opinions on machine translation are very polarized. Machine translation is still met with skepticism by a part of the general public who criticize the quality of the translations. On the other end of the spectrum, it is not uncommon to hear comments regarding how soon the translation profession will disappear because translation can be done by machines. In the translation industry machine translation also encounters some obstacles, mainly the translators’ resistance to use machine translations. This is shown in a study of freelance translators and interpreters done by CSA Research. In the study only 36% of the translators said that they offer the service of post‐editing of machine translations and 45% said they never use machine translation (Pielmeier & O'Mara 2020, pp. 7,20). Nevertheless, machine translation is a helpful tool for translators and should not be overlooked because of concerns about how this tool can impact the profession.
1.1 Background
Neural machine translation (NMT) is a topic that has exploded over the past decade.
The use of neural methods integrated into statistical machine translation started with the work of Schwenk 2007 (cited in Koehn 2017, p. 5). Initially it was a slow process because of computational concerns; the research groups lacked the graphic processing units
2 (GPUs) needed to train the systems. It was not until 2013 that purely NMT systems started being used and by 2015 the results of NMT were finally competitive “[…] NMT became the new state of the art. Within a year or two, the entire research field of machine translation went neural” (Koehn 2017, pp. 5‐6).
NMT has continued developing at a fast pace in recent years. Performing a simple search for the term “neural machine translation” in the title or in the abstract of research papers in ArXiv.org, a repository of scientific electronic e‐prints, produced the following results: 3 papers submitted in 2014, 9 in 2015, 70 in 2016, 138 in 2017, 265 in 2018, 304 for 2019 and until the month of June, 187 papers had been submitted (see Figure 1, please note that the number of publications for 2020 is an estimate based on the publication rate for the first half of the year). These numbers paint a clear picture of how much research attention this topic has garnered. The creation of diverse architectures, models and training techniques has continued to revolutionize the field of machine translation (MT) (Chen et al. 2018).
Figure 1 ‐ Articles containing "neural machine translation" in ArXiv.org (Source:
Personal collection)
0 50 100 150 200 250 300 350 400
2014 2015 2016 2017 2018 2019 2020*
Number of articles
Publication year Published articles
3 However, the average translator has very limited knowledge about the technology behind the NMT systems. A small language service provider (LSP) or a freelance translator’s use of NMT is in most cases relegated to publicly available NMT systems or to vendor solutions, because of limited resources. Therefore, studies that compare the NMT output of different systems or models, although very useful for the research community, are not always useful for a translator. The reason for this is because the NMT systems a translator will most likely have access to are not necessarily the same that are being researched. There are currently many systems, toolkits, vendor solutions, etc., available and englobed under the NMT label. Any statements regarding the quality of NMT output are therefore specific to a tool, model or system.
Phrases like “translation quality decreases as source sentence length increases”
(Vandeghinste 2018, p. 38), “NMT systems […] perform poorly when translating very long sentences” (Toral & Sánchez‐Cartagena 2017, p. 1063) and many others that convey the same information—long sentences perform worse in NMT—are commonplace in the field of MT. However, there are not many studies that focus solely on this research topic.
Most of the studies that observe this issue are studies focused on comparing statistical machine translations (SMT) to NMT. NMT systems have garnered attention for being
“considerably different, more fluent and more accurate in terms of word order compared to those produced by phrase‐based systems. NMT systems are also more accurate at producing inflected forms, but they perform poorly when translating very long sentences” (Toral & Sánchez‐Cartagena, 2017, p. 1063). In the bilingual evaluation understudy (BLEU) (an algorithm for evaluating the quality of text which has been machine‐translated from one natural language to another see 3.3.2), SMT still obtained considerably higher BLEU scores when it comes to long sentences in 2014 as it is shown by Cho, van Merrienboer, Bahdanau and Bengio (2014), although by 2017 the score difference between NMT and SMT had been reduced (Koehn & Knowles).
Pouget‐Abadie, Bahdanau, van Merriënboer, Cho and Bengio (2014) wrote an article about how to improve translation quality with NMT by segmenting long sentences.
They start from the premise that “neural network translation systems suffer from a significant drop in translation quality when translating long sentences” and propose a method segmenting phrases that are longer than 20 words to improve quality (p. 78).
Their method shows improvement in the BLEU scores when comparing the original long
4 sentence with the newly segmented sentences. They cite the works of Kalchbrenner and Blunsom (2013) and Cho et al. (2014) as studies that have shown that translation quality decreases as the length of the source sentence increases.
Pouget‐Abadie et al. (2014) use the recurrent neural network encoder‐decoder model (RNNenc) for their study, Kalchbrenner and Blunsom use Recurrent Continuous Translation Models (RCTM) and Cho et al. use RNNenc and a gated recursive convolutional neural network (grConv) (p. 78). It is important to note that these academics are studying different models and sometimes different architectures of neural machine systems to verify what is the quality of the translation of long sentences (>20 words) in comparison to short sentences (<20 words). Pouget‐Abadie et al. mention that one of the reasons for this decrease in quality for long sentences is that the corpora used to train the systems has insufficient long sentences (p. 78). Their own study showed that the output quality of NMT did not start decreasing until the length of the sentences went beyond what they had trained the system with.
1.2 Research question and hypothesis
These studies prepare very well the ground for this study. We want to see if there is a correlation between sentence length and NMT output quality. The variables for this study will be sentence length and NMT systems. The target public for this study consists of freelance translators and small LSPs. Therefore, previous knowledge of the technology behind the NMT systems is not expected. Considering these factors, we chose two NMT systems (DeepL® and Microsoft Translator®) that are ready to use and available for free or at a low cost.
We used the aforementioned studies as a basis for our hypothesis. We know that in these studies long sentences performed worse and obtained lower scores when compared to short sentences. Our hypothesis is that short sentences will perform better than long sentences in NMT systems. We will observe if the drop in quality for long sentences, shown in the studies mentioned, is still the case today for the systems we have chosen. One of the main differences of this study in comparison to the studies already done is that in our case the end user does not train the system. Therefore, he or she has
5 no knowledge about what the training data sets consist of, or if there are sufficient long sentences in them. Considering that DeepL® owns Linguee®, which is a vast database of human translations and Microsoft Translator® has Bing®, a web search engine, both systems have access to an immense amount of data and this data could include sentences of all lengths. As to what model or architecture we are comparing, DeepL® does not divulge what kind of NMT architecture nor model it uses and Microsoft Translator® is using the Marian NMT open source toolkit, with a transformer model. These are all factors that could influence our results.
1.3 Methodology
Microsoft Translator® and DeepL®, two similar systems, will be evaluated to study the NMT quality of sentences of varying lengths. The hypothesis for this study is that as sentence length increases, quality decreases. A data set will be created containing phrases in English from the general domain, with a uniform distribution of phrases from varying lengths. The target language tested will be Spanish and a professional translator will provide a reference translation for the automatic evaluation.
Alongside many advances in MT, the methods for evaluating the automatic translation output have also evolved. There are now different automatic evaluation metrics, these use a reference translation to give the MT output a score. The translations in this study will be both evaluated manually by human translators as well as automatically with the BLEU metric to observe if there is a correlation between both. The group of human evaluators will consist of translators with Spanish as their active language and English as their passive language.
The human evaluation performed will follow Koehn’s (2010) proposed method of evaluating for fluency and adequacy. Using a graded scale, translators will be asked to determine the fluency of the NMT output. Is the translation fluent in Spanish? Is it idiomatic? Is it grammatically correct? Those are the key questions when evaluating for fluency. To evaluate for fluency the translator does not necessarily need to consider the source text and for this reason the fluency evaluations only consisted of the two target texts. Evaluating for adequacy consist of comparing the source text and its translation
6 and determining if the meaning is conveyed in full or if some of it is missing, added or distorted. In addition to evaluating for fluency and adequacy a third evaluation will be done where a translator is asked if he/she would use the translation.
After the evaluation is done, the results will be analyzed. We are looking to answer the following questions: Is there a correlation between the sentence length and NMT output? Are the results of the human evaluation similar to the results from the automatic evaluation?
1.4 Plan
We will present in Chapter 2 a brief introduction into NMT followed by the current obstacles it faces. In addition, in Chapter 3 several MT evaluation metrics both for human and automatic evaluation will be explained. Chapter 4 will include the methodology chosen to carry out the evaluations and the criteria for choosing evaluators. Afterwards in Chapter 5 the results will be presented along with the analysis and in Chapter 6 will be the conclusion.
7
2. Machine translation
This chapter consists of a brief history of the MT field: the main stakeholders, the development of technologies and what impact that had on the field, and the context in which NMT finally gained traction (section 2.1). The types of translation are described in section 2.2. In addition to that, the current state of MT and its most used approaches are described in section 2.3 and lastly the six main challenges of NMT, as explained by Koehn and Knowles, are also presented with their relation to this study (section 2.4). A definition of machine translation (European Association of Machine Translation n.d.) is presented below.
Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another.
2.1 Background
Neural machine translation is the most recent approach used in the field of machine translation. However, machine translation had its beginnings in the 1950s, in the context of the Cold War where there was a need to translate from Russian to English in the Western World (Poibeau 2017, pp. 49‐51). The first attempts at machine translation systems consisted of bilingual dictionaries and a direct translation approach.
In 1954 the first MT demonstration attracted a lot of interest from the American funding agencies. The research group from Georgetown University in a partnership with IBM presented a prototype which they thought could be commercially available within months. However, the issue of word ambiguity, which at first was thought to only affect a small number of cases, proved to be a big obstacle in producing quality results. The systems which had had so much promise in the 1950s did not achieve the desired results.
In the next decade the research funding started to decrease and with the publication of the ALPAC report, the research in this field was severely impacted and diminished in the United States (Poibeau 2017, pp. 49‐89).
8 In the 1980s electronically available texts started to become increasingly available. Among these texts some were translations, this was a leading factor in the emergence of the parallel corpora and the research on sentence alignment. A parallel corpus is a group of sets of translations and sentence alignment is the process of aligning on the sentence level the same text in different languages. The availability of this type of data led to the creation of the example‐based translation approach and the statistical MT systems (Poibeau 2017, pp. 90‐93).
However, neural networks were invented in the 1950s and had a resurgence in the 1980s, but it wasn’t until the 2010s when the computational power of computers allowed for managing the complexity of the representations involved. Neural networks first achieved success in image recognition but are now used in various fields like speech processing and natural language processing. For the MT field, NMT can be considered a revolution and it has been adopted very quickly by companies like Google, Bing, Facebook, Systran, etc. which now have their online systems based on this approach (Poibeau 2017, pp. 183‐195).
2.2 Types of translation
In section 2.1 the different systems for MT were described. Nonetheless, it is also important to describe the types of translation and how MT and other tools are used in the translation industry. In Figure 2 we can see an illustration of the different types of translation.
9 Figure 2 – Classification of translation types [Source: Hutchins and Somers (1992) as
cited in Quah (2006, p. 7)]
Human translation (HT) is a translation accomplished without the use of any computer‐based translation tool. The traditional translator would use his/her knowledge combined with resources like (paper) monolingual/bilingual dictionaries, encyclopedias or thesauri. Nowadays it is uncommon to find translators who still work like this. The increasing availability of reference works online and with many word processing programs having spell, grammar and style checkers integrated have seen the modern translators adopt these tools, as a minimum, into their work. In addition, as CAT tools have gained popularity, it has become commonplace for translators to integrate into their work at least one of these tools (Quah 2006, pp. 14).
The main actor in machine‐aided human translation (MAHT) is, as the name suggests, the human. It is a translation done by a person with the help of a translation tool.
According to Hutchins and Somers (1992, pp. 149), the basic requirement for MAHT
“must involve a computer‐based linguistic aid”. Some of the tools that fall in this category are spell, style and grammar checkers, electronic dictionaries, terminology databases, terminology management software, alignment tools, translation memories (TM), electronic glossaries and others (Quah 2006, pp. 13).
The human aided machine translation (HAMT) is where the translation is done by a machine and the human only intervenes in the process as needed, mainly in the form of pre‐editing and/or post editing. Pre‐editing involves checking the source text before it is translated and resolving issues that could cause problems in the translation. An example of pre‐editing is solving word ambiguities or separating a long sentence into two. Post
10 editing is when the human intervenes after the translation is completed to correct any issues such as inaccuracies or styling.
In fully automatic high‐quality translation (FAHQT) there is no human intervention.
However, this type of translation was considered as unfeasible by Bar Hillel in 1960 and today, regardless of the vast improvements in the quality of the MT output, this is still the case.
Systems that integrate multiple MAHT tools were known as workbenches. Nowadays there are integrated systems that include various features, some of which may fit into different categories, and therefore these integrated systems cannot be categorized into only one type of translation (Quah 2006, pp. 7‐8). These systems are known simply as CAT tools. The category of CAT tools englobes most of the tools used today by translators and is not limited to the previously mentioned.
2.3 Current state of machine translation
Different approaches have been used for machine translation. Some of those approaches are: rule based machine translation (RBMT) which can be subdivided into direct translation, indirect translation, and interlingua translation, statistical machine translation (SMT) which includes word‐based SMT and phrase‐based SMT and neural machine translation (NMT). We will describe SMT and NMT in the following sections given their importance in today’s MT domain.
2.3.1 Statistical machine translation
Statistical machine translation is a corpus‐based approach to MT. SMT systems have two main processes which are training and decoding. For training the system uses a parallel corpus to extract a statistical translation model and then it uses a monolingual corpus to extract a statistical target‐language model. The translation model is a compilation of possible translations. This compilation is similar to a bilingual dictionary, except that instead of only the base forms present, there are all the segments (n‐grams) with their associated probabilities. The language model is comprised of target language
11 word sequences and their probabilities. The models extracted during the training are then used during the decoding process. The decoding process takes the source segment and provides the translation with the highest overall probability according to the translation and language models (Hearne & Way 2011, pp. 205‐206).
2.3.2 Neural machine translation
Neural machine translation is the latest generation of MT systems. It is a corpus‐
based approach to MT that is based on deep learning. These systems attribute words and phrases a distributed representation. These distributed representations are vectors which in other words provide a set of “coordinates” for each word or sentence based on their meaning. “Two similar concepts would ideally be close to each other and therefore have similar coordinates; very different concepts would be far apart from each other and therefore have different coordinates” (Forcada 2017, pp. 295). There are various architectures for NMT systems, but the typical one is the encoder‐decoder architecture (pp. 299), although an attention mechanism is commonly added to obtain better results with longer sentences. A simple explanation of how the encoder‐decoder NMT works is as follows: the encoder encodes each word in the source segment into a vector and then these vectors are combined have the vectors of the whole sentence. The decoder then turns these vectors into the target language (Asquith 2019).
2.4 Obstacles
Koehn and Knowles (2017) present six important challenges to NMT: domain mismatch, amount of training data, low frequency words, long sentences, word alignment, and beam search. Koehn and Knowles state that a common factor to many of the obstacles NMT systems face is that “neural translation models do not show robust behavior when confronted with conditions that differ significantly from training conditions” (p. 37). This is similar to what Pouget‐Abadie et al. (2014) state when explaining one of the factors to the poor performance of NMT, when confronted with long
12 sentences. We will explain in more detail the first four challenges, because they are the most relevant to this thesis’ topic.
The first challenge, domain mismatch, is when a text of one domain, for example a medical domain text, is translated using an NMT system trained with corpora from a different domain, for example with legal corpora. According to Koehn and Knowles (2017), when comparing the outcome of translations in these conditions NMT systems perform worse than SMT systems. The same word can have different translations and be expressed in different styles depending on the domain. We will not be exploring this particular challenge in this study because the texts chosen are from a general domain.
However, this is an important consideration, if the user is interested in translating texts pertaining to particular domains.
The second challenge relates to the amount of training data, which is studied by Koehn and Knowles (2017) with an English‐Spanish corpus. The authors compare the BLEU scores for systems trained with varying sizes of training data. The training data volume is as follows: 10241 , 5121 , 2561 , 1281 , …12 and all of the training data (the previous are fractions of the totality of the training data, 385.7 million English words paired to Spanish). In this comparison they observe differences in the learning curves for NMT and SMT systems. “NMT exhibits a much steeper learning curve, starting with abysmal results (BLEU score of 1.6 vs. 16.4 for 10241 of the data), outperforming SMT 25.7 vs. 24.7 with 161 of the data (24.1 million words), and even beating the SMT system with a big language model with the full data set (31.1 for NMT, 28.4 for SMT, 30.4 for SMT+BigLM)” (Koehn
& Knowles 2017, p. 31). Their study shows the potential for NMT to deliver a high‐quality translation if the system is trained with large corpora.
For the purposes of this study, the challenge of training data is not considered for two reasons. The first reason is that Microsoft Translator® and DeepL® do not offer the details of their training data, but it is safe to assume that they have access to large amounts of training data in various languages and that they use it to their advantage. The second reason is that the user has no control over how these systems are trained.
Nevertheless, this is an important factor to consider for a user intending to train an NMT system, as it is possible to do with many tools (e.g. Custom Translator®), when obtaining training data on the desired languages.
13 The third challenge, rare words, is both a challenge for SMT and NMT systems. It consists of the poor translation performance of the system when it encounters a word that is not in the training data or that is very infrequent. It is impossible to know which words are in the training data of the systems that will be compared in this thesis, therefore this challenge is out of its scope. As with the first two challenges, this information could be useful to the user when analyzing the translation output or when choosing the data to customize a system.
The fourth challenge, long sentences, is the main objective of this thesis. Koehn and Knowles (2017) recount how this poor performance for long sentences was a well‐
known flaw of early encoder‐decoder models, but with the introduction of the attention model this flaw was somewhat remedied. The authors use the same systems used for the second obstacle, amount of training data, to make a comparison of the translation quality between SMT and NMT. They grouped sentences of the same length, 1‐9 words, 10‐19 words, until 70‐79 words, and obtained the BLEU scores for these data sets. Their findings were that the NMT system outperformed the SMT until the data set containing 60‐69‐word sentences. However, an interesting fact about this study is that the test set that obtained the highest BLEU scores for the NMT system was the 50‐59‐word test set (34.7 BLEU) followed by the 60‐69‐word test set. These findings are very interesting for this thesis as they seem contradictory to what has been said so far regarding sentence length and quality (see Vandeghinste and Toral & Sánchez‐Cartagena in 1.1).
Another challenge is multiword units (MWU). This natural occurrence in language is explained as “meaningful lexical units made of two or more words in which at least one of them is restricted by linguistic conventions in the sense that it is not freely chosen”
(Monti et al. 2018, p. 1). They use the example of the MWU smell a rat where the word rat is fixed cannot be substituted by a synonym, for example rodent or mouse. Monti et al.
(2018) explain that the meaning of a multiword unit is not always derivable from the meaning of its components nor can it be determined by the rules used to combine them.
Since MWUs are very pervasive in language and are present in sentences of all lengths, this may be considered an important factor affecting NMT quality and therefore this study.
14
2.5 Conclusion
This chapter has given an outline of the machine translation field, from the very theoretical beginnings to the MT systems used today. The types of translations are explained along with a brief description of how statistical and neural machine translations systems work. In addition, the six obstacles to NMT from Koehn and Knowles (2017) were presented, as well as their relation to this study.
15
3. MT Evaluation metrics
In this chapter we describe two types of evaluation metrics: human evaluation and automatic evaluation. In section 3.1 there is a brief introduction to the topic, followed by some of the different approaches used for human evaluation (section 3.2) including the ones used later in the evaluation which are fluency and adequacy (3.2.2) and reusability (3.2.3). A non‐exhaustive list of automatic evaluation metrics is described in section 3.3.
3.1 General
The evaluation of machine translations can be done in various ways, but there is a common issue for all the metrics. There is no such thing as a perfect translation, or a gold standard. If a given phrase is given to translators, different translations will be obtained and this does not necessarily imply that one is correct and the rest are incorrect. This makes the process of translation evaluation less objective than desired and whether the task is done by an automatic or human evaluation metric, it is intrinsically flawed.
However, that is not to say that the current metrics are not useful, but that there is no perfect metric.
Depending on what is being evaluated, a metric might be more appropriate than another. There are several criteria that are desired for an evaluation: low cost, quickness, consistency, meaningfulness. Each metric can be assessed to verify if it delivers these criteria or some of them.
3.2 Human evaluation
A human evaluation is performed, as the name says, by human annotators.
Depending on what metric is chosen for the evaluation, the profile of the annotator must match different skills. The following sections contain an overview of the annotator’s
16 profile, what each metric is evaluating and how that relates to the purpose behind the evaluation and other constraints specific to the metrics.
3.2.1 Error annotation
The core of error annotation is to evaluate each translation by specifying what type of errors are present. The error typology for translation errors is vast and depending on which metric is chosen can be significantly different. An example of an evaluation metric is the Multidimensional Quality Metrics (MQM) framework for translation quality assessment by Lommel and Uszkoreit (2014). The purpose behind the creation of MQM was to obtain a framework that was adaptable in nature to the evaluation needs of different projects and instead of using different metrics where there is no interoperability. This framework contains a list of core issue types that include: design, locale conventions, style, terminology, verity, fluency and accuracy. Some of these issues include a subset of issues, for example accuracy issues can be further divided into addition, mistranslation, omission and untranslated. The profile of a human annotator using an error annotation metric must be a bilingual individual. This type of evaluation metric is analytical and provides a way to quantify specific errors. This can then be used by developers, as there is an in‐detail view of common issues, to guide them in finding solutions to improve the translation quality.
3.2.2 Fluency and adequacy
Evaluating the fluency and adequacy of a translation is an additional method used for the human evaluation. Koehn (2010) proposes some simple questions that the annotator should take into consideration when evaluating the fluency of a segment: Is the translation good fluent Spanish? Is it grammatically correct? Are the word choices idiomatic? The evaluation for fluency should be separate from the adequacy evaluation to avoid any confusion. The profile requirements of a fluency annotator are also different from the annotators that perform error annotation. The fluency annotator must be someone whose active language is Spanish but it is not necessary for the annotator to know the source language as only the translation is present in the evaluation.
17 Adequacy measures how well the original meaning is preserved in the translation.
Koehn (2010) gives two key questions for the annotator: Does the translation convey the same meaning as the source text? and “Is part of the message lost, added or distorted?”
(p. 218). The annotators assessing a text for adequacy must have working knowledge of the source language to understand fully the meaning of the source text and have the target language as their active language.
A graded scale is given to the annotators when evaluating for fluency and adequacy. Koehn (2010) proposes a 5‐point grade scale, which has the disadvantage of having an odd number. When there is a middle grade the annotators might be tempted to choose that middle grade more than the other grades. TAUS, a think tank and resource center for the translation industries, provides guidelines for this type of evaluation and their example of a graded scale can be seen on Table 1.
Table 1 – TAUS fluency and adequacy scale
Fluency Adequacy
4 Flawless Everything 4
3 Good Most 3
2 Dis‐fluent Little 2
1 Incomprehensible None 1
3.2.3 Reusability
The reusability metric is not an established metric, but it will be used in this study. It consists of asking annotators if they would use the NMT output; the annotators will answer “yes” or “no”. The annotators are explained beforehand that purpose is not to find a perfect NMT output, but to evaluate if the output is worth using considering the post‐
edition needed to achieve the desired quality. The purpose of this evaluation is to observe if there is a correlation between its scores and the scores of other human evaluations, in this case the fluency and adequacy evaluation.
18 3.2.4 Ranking
The ranking metric consists of evaluating translations phrase by phrase to determine if translation “a” is: better, slightly better, equal, slightly worse or worse than translation “b”. Typically, the evaluation with this metric is easier than with the fluency and adequacy metric (Koehn 2010, p. 220). A bilingual annotator is given the source text and the target text for the systems, it could be two or more systems. This evaluation is carried out quicker than the previous metrics shown here, and it is useful when the purpose of the study is to compare one system to another. This metric, however, does not tell us anything about the quality of the translations per se, nor about the quality of the systems. System “a” can outrank system “b”, but that does not imply that system “a” has a high performance, nor does it imply the contrary.
3.2.5 Agreement
When performing a human evaluation with more than one annotator it is also pertinent to calculate the agreement between annotators. Human annotators can assess the same phrase and assign it different scores. If we take as an example the fluency and adequacy metric, “good” fluency can be interpreted in different ways. Some annotators might be more lenient than others and professional beliefs might also come into play. If we have an annotator who is an expert in grammar and puts a lot of weight in this area, he or she might be more severe in assigning a score to a phrase and what that annotator considers disfluent can be considered good by someone else. Consequently, we might be faced with scores that are very varied. Normalizing the scores is a way to deal with this issue.
There are other ways to evaluate agreement, for example when there are only two annotators, we can use Cohen’s kappa coefficient (κ), commonly referred to as kappa’s coefficient. Kappa’s coefficient is the “proportion of agreement corrected for chance” and it is used to assess if the scores obtained by two annotators are reliable (Fleiss & Cohen 1973, pp. 613). This metric provides a coefficient from 0 to 1 where a value of one means perfect agreement between annotators (Upton & Cook 2014, pp. 85). In the case of more than two annotators, as it is in this study, Fleiss kappa is used to assess the annotator’s
19 agreement. Landis and Koch (1977, pp. 165) propose the following labels for the kappa statistics (see Table 2).
Table 2 – Labels for kappa agreement
Kappa statistic Strength of agreement
<0.00 Poor
0.00‐0.20 Slight
0.21‐0.40 Fair
0.41‐0.60 Moderate
0.61‐0.80 Substantial
0.81‐1.00 Almost perfect
3.3 Automatic evaluation
3.3.1 Suitability
If automatic‐evaluation metrics exist, then why should one consider spending time and resources in a human evaluation? The answer to that question is quite simple, the objective of these metrics is different. Automatic evaluation metrics require a translation reference to which the translation output can be compared to. This limits their use to cases like the development or training of a system by researchers, where time and cost are very important. In a production environment, a human evaluation could be done, theoretically, when the translator receives the MT output and then the translator can continue to discard or work on post editing the output to produce a high‐quality translation.
Eventually a company might decide to use an automatic evaluation metric to post grade the MT quality, but then the question about how to proceed arises. Where will the reference translation come from? The company has three options. The first, the reference translation can be the final translation done by the translator. The second, an independent translation can be provided, but that will involve additional time and resources spent. The third, translations done in the past can be used, but only if that information has not been used for the training data. The first option could be, at first
20 glance, considered as the easiest. However, it would bring a bias and probably an inflated score. Consider that you, as a translator, have an MT output provided for you, the goal is to make the least amount of changes possible to achieve a high‐quality translation. This produces a translation that is correct but that if the translator had translated from the source text, with no aid whatsoever, it might have been correct as well but probably very different than the post edited text. As an example, the word order might differ and even the word choices as well.
If a company chooses to use the post‐edited translations, there is a much higher probability that it resembles the MT output than if an independent translation is provided. There is an incentive in terms of working time rate for translators to salvage as much as possible from the MT output. Consequently, using the post edited translations would introduce a bias and an inflation to the score. To combat this outcome a company could provide a translation, independent from the MT output, where a translator would translate the same source text and the translation would be used as a translation reference. Otherwise, using a previously translated text could be used in situations where the MT system has not been trained with the company data. Therefore, in a production environment, using automatic evaluation metrics is possible, but it is not an ideal solution.
On the other hand, in a developing or research environment it is very convenient to use these metrics as a researcher has the possibility to verify that reference translation is not in the training data and it permits to get a score in a quick manner and assess the performance of the system when making constant changes.
3.3.2 BLEU metric
There are numerous automatic metrics to evaluate MT. Every year the conference on Machine Translation (WMT) has a metrics task which goal is to evaluate proposed automatic metrics. Nevertheless, some of the metrics currently used are the f‐measure, word error rate (WER), translation edit rate (TER), BLEU, METEOR, NIST and ROUGE.
Koehn (2010) explains briefly how these automatic evaluation metrics work and what they take into consideration. For the purpose of this study, only BLEU will be taken into account.
21 In the BLEU metric, the quality of an MT is measured by the closeness to the reference translations done by humans. BLEU’s closeness metric is based on the WER metric and adapted to consider multiple reference translation, word choice differences and word order (Papineni et al. 2002). BLEU compares the n‐grams present in the MT and the reference translation (s). An n‐gram is a sequence of words. When using an n‐ gram of one, only surface forms are compared, an n‐gram of two will look for sequences of two surface forms and it can continue like that with the number of words chosen. In general, the highest n‐grams used are 4‐grams. Take the following short sentence as an example: The house was on fire. We have five 1‐grams: the, house, was, on, and fire. We have four 2‐grams: the house, house was, was on, and on fire. We have three 3‐grams: the house was, house was on, and was on fire. We have two 4‐grams: the house was on, and house was on fire.
BLEU uses a modified n‐gram precision measure. This is based on the precision measure, for which the number of words in the MT that are present in the reference translation are counted (Papineni et al. 2002) The limit of the precision measure is that it counts the number of words that match both the MT and the reference translation without any consideration to the number of times these words are used in the reference translation. See Table 3 for an example.
Table 3 – Precision measure
Identification Sentence Precision
Reference translation The house is on fire. ‐‐
MT candidate 1 The the the the the. 5/5
MT candidate 2 The house caught on fire. 4/5
Modified n‐gram precision caps the number of times an n‐gram can be matched to the highest number it is present in the reference translation. In this case, we can see in Table 4 how the MT candidate 1 only has a unigram precision of 1/5, whereas MT candidate 2 obtains a score of 4/5 for unigrams. This modified precision measure has the advantage of obtaining unigram precision scores that are closer to reality and factoring word order when measuring n‐grams of two and above. If multiple reference translations are used, then word choice is also taken into account.
22 Table 4 – Modified precision measure
Identification Sentence 1-gram
precision
2-gram precision
3-gram precision Reference translation The house is on fire. ‐‐
MT candidate 1 The the the the the. 1/5 0/4 0/3
MT candidate 2 The house caught on fire. 4/5 2/4 0/3
Koehn (2010) and Papineni et al. (2002) remark that there is a correlation between BLEU scores and human evaluation scores. When a system is ranked low by human evaluators it is also ranked low by BLEU and vice versa. This is an interesting finding for developers as it gives them validation that the metric is meaningful. However, BLEU is known for rating unfavorably NMT systems (Volkart, Bouillon & Girletti 2018, p.
148). Nowadays when researching papers about MT, the use of the BLEU metric is almost a given to evaluate MT systems.
Koehn (2010) summarizes the main critiques to automatic evaluation metrics.
First, most of the metrics do not account for the weight of errors1. When the error weight is not considered, this signifies that one missing word holds the same penalty in any case;
in many cases if a sentence is missing an article this can have a much lesser impact than if a word like “not” is missing from a sentence. It is possible that a missing article makes the sentence less fluent, whereas the absence of “not” would probably change the meaning of the sentence. Another issue is grammatical coherence, high automatic evaluation scores do not necessarily represent a sentence structured in a grammatically correct way. In addition, the interpretation of BLEU scores is also problematic and abstract, we know a score of 0 can be interpreted as the lowest quality score, and a score of 1 means the translation is identical to the reference, but what do scores of .20, .40, .60, etc., mean? At what point does the score represent a good quality translation? This is not a straightforward answer as the scores depend on a number of factors and there lies the issue.
1 NIST, which is a metric based on BLEU, does take error weight into account, (Wolk & Danijel 2015, p. 3).
23
3.4 Conclusion
Human evaluations are expensive because human annotators are generally payed for these evaluations. Therefore, an evaluation metric like error typology will take more time for human to evaluate than the fluency and adequacy metric or the ranking metric.
Setting up an error typology also involves choosing what errors will be evaluated, which is, again, time consuming for the researchers. However, where an error typology metric will provide an analytical overview of the errors that can serve for the purpose of improving the MT output, the criteria of fluency and adequacy will provide an assessment that will focus on the current quality of the translation and be used, for example to determine if the output quality is good enough for post editing and the ranking metric will provide an insight into which system performs better. Therefore, the objective of the evaluation needs to be determined first, followed by choosing a metric that will provide the correct type of information for the evaluation. For this study we will use the fluency and adequacy metric as it is the most appropriate for its purpose, determining the NMT output quality.
Automatic evaluation metrics provide scores in a quick and cost‐efficient manner, and regardless of the disadvantages of these metrics they are still a useful tool for developers who deal directly with the development of the systems and consequently need a way of knowing if their changes are improving or declining the quality of the output on a daily basis. In a production environment, these metrics are more impractical and unsuitable given the necessity of a reference translation. However, given that many researchers state that there’s a correlation between human evaluations and automatic evaluations, we will use the BLEU scores to see if such correlation is observed in this study.
24
4. Methodology
In this chapter the methodology used to answer our research question will be explained. Section 4.1 details the objective of this study as well as the variables used. The NMT systems chosen for the evaluations are described in section 4.2. The corpora prepared for the evaluations in conjunction with the variables are explained in section 4.3. The preparation of the human evaluation including information on the annotators is presented in section 4.4. Lastly, the automatic evaluation is described in section 4.5, which includes information about the reference translation and the resources used to obtain the BLEU scores.
4.1 Purpose
The purpose of this study is to evaluate and observe is there is a variation in the quality of an NMT output text with regards to segment length. Segments with lengths varying from 1 to 40 words were used. The segments are grouped into four categories: categories 1, 2, 3 and 4 (see Table 5). Although sentence length is at the core of the variables studied, it is important to mention the other variables for this study. To begin with, the languages used were English for the source text and Spanish for the target text. The texts analyzed were all from the general domain, two different NMT systems were used (DeepL® and Microsoft Translator®) and both human and automatic evaluations were performed.
Table 5 – Description of the categories
Category Description
1 Segments with a length of 1‐10 words 2 Segments with a length of 11‐20 words 3 Segments with a length of 21‐30 words 4 Segments with a length of 31‐40 words
25
4.2 NMT systems
The systems chosen for the evaluation were DeepL® and Microsoft Translator® (see Table 7 for the dates when the source texts were translated using these systems). Both of these systems have a web browser and a desktop version. The desktop versions were used in this evaluation. In the following subsections the functionality differences between the web browser and desktop versions are explained as well as the differences between the two systems.
4.2.1 Web browser version
Both NMT systems, in their web browser option, had a 5,000‐character limit of what could be translated at once. Curiously the way the characters were counted was not identical in both systems. The web version of DeepL® was simple to find, typing “DeepL”
in a search engine resulted in the website on top of the results page. The Microsoft Translator® version is found under “Bing Translator”. When typing those exact words in a search engine it appears on top of the list, but if “Microsoft Translator” is typed then it requires more clicks, as it is necessary to look for it in the Microsoft website.
DeepL® has an editing function where the user can click on a word and a list of other possible translations are displayed. Microsoft Translator® does not have this option. The web version of DeepL® also allowed to upload a Microsoft Word® or PowerPoint® document for translation, although the editing functionality is turned off in this case, unless the user has a paying subscription (DeepL Pro®). When translating a Word® or PowerPoint® document, the 5,000‐character limit is not imposed, but there is no information on the website where it mentions what are the limits for this functionality on the free version of DeepL®. When a text of length greater than 5,000 characters is entered for translation, Microsoft Translator® gives an error message, whereas DeepL® translates up to the 5,000th character and by hovering with the cursor over the last sentence translated the source text sentence is highlighted in a different color. Another
26 difference between both systems is the number of languages supported; Microsoft Translator® supports more than 60 languages2 and DeepL® supports 11 languages.3
4.2.2 Desktop version
Both systems have desktop versions that could be downloaded and used free of cost. The application for Microsoft Translator® permitted to translate 10,000 characters at once whereas DeepL® only 5,000. Both versions of DeepL® included in the translation the following sentence “Traducción realizada con la versión gratuita del traductor www.DeepL.com/Translator” which appeared once the text was pasted. This was subsequently deleted in our DeepL® target‐text corpus. The desktop application for Microsoft Translator® has more functionalities than DeepL®. It translates images, text delivered verbally, text written with a pen simulator and it allows to directly upload documents for translation. With the exception of uploading documents, the rest of the functionalities are not something a professional translator would generally use for work, so these functionalities do not bring added value. One issue encountered with Microsoft Translator® is that this application bugs if provided with more than 5,000 characters. As a test, it was provided with a Word® document of 6,500 characters and another one with 3,500. The bigger document was not translated and a message error was received and the smaller document was translated. For reference, the 3,500‐characters document contained 600 words. The DeepL® desktop version does not have the functionality of allowing the upload of documents for translation, however it does allow the user the possibility of highlighting text in applications (e.g. Word® document, web browser) and clicking “Ctrl + c” twice at any moment when using the computer and the highlighted text will be translated in the desktop application.
2 https://www.microsoft.com/en‐us/translator/languages/ ‐ Accessed April 2nd, 2020.
3 https://www.deepl.com/en/press.html ‐ Accessed on April 2nd, 2020.
27
4.3 Corpora
The corpora consist of a source text corpus, three target text corpora (one for each NMT system and one for the reference translation) and further subcorpora of the source text and reference translation classified by the sentence length. The corpora classified by sentence length were created for the automatic evaluation system (see Table 6).
Table 6 – Corpora
Corpus/corpora name Content Size
Source text
This corpus contains all the segments used in the evaluations in the source language (English), as well as additional segments that were ultimately not used in the evaluations.
9,946 words
DeepL
This corpus contains the Spanish translation, obtained with DeepL®, of the source text corpus.
11,074 words
Microsoft Translator
This corpus contains the Spanish translation, obtained with Microsoft Translator ®, of the source text corpus.
10,992 words
Source text – category 1, 2, 3 and 4 (corpora)
Corpora consisting of four individual files each with the source text in the category specified in its title, i.e. category 1 only contains the segments with a length of 1‐
10 words.
50 segments per corpus
Source text – regrouped category 1&2 and 3&4 (corpora)
Corpora consisting of two individual files each with the source text in the category specified in its title, i.e. category 1&2 only contains segments with a length of 1‐20 words.
100 segments per corpus
28
Corpus/corpora name Content Size
Reference text – category 1, 2, 3 and 4 (corpora)
Corpora consisting of four individual files each with the reference translation only for the segments of the category specified in its title, i.e. category 1 only contains the translations for segments of 1‐10 words.
50 segments per corpus
Reference text –
regrouped category 1&2 and 3&4 (corpora)
Corpora consisting of two individual files each with the reference translation only for the segments of the category specified in its title, i.e. category 1&2 only contains the reference translation segments of 1‐
20 words.
100 segments per corpus
DeepL – category 1, 2, 3 and 4 (corpora)
Corpora consisting of four individual files each with the translation only for the segments of the category specified in its title, i.e. category 1 only contains the translations for segments of 1‐10 words.
50 segments per corpus
DeepL – regrouped category 1&2 and 3&4 (corpora)
Corpora consisting of two individual files each with the DeepL® translation only for the segments of the category specified in its title, i.e. category 1&2 only contains the reference translation segments of 1‐
20 words.
100 segments per corpus
Microsoft Translator – category 1, 2, 3 and 4 (corpora)
Corpora consisting of four individual files each with the translation only for the segments of the category specified in its title, i.e. category 1 only contains the translations for segments of 1‐10 words.
50 segments per corpus
Microsoft Translator – regrouped category 1&2 and 3&4 (corpora)
Corpora consisting of two individual files each with the Microsoft Translator® translation only for the segments of the category specified in its title, i.e. category
100 segments per corpus
29
Corpus/corpora name Content Size
1&2 only contains the reference translation segments of 1‐20 words.
The data gathered for the source text corpus came from articles from reputable and varied English language newspapers and magazines. The newspapers used were The New York Times, The Guardian and The Washington Post, and the magazine was The Economist. The topics were varied and chosen at random. The CEFR language level for the articles was primarily C level and some articles had a B2 level. The language level was obtained with the online tool Duolingo CEFR checker®. All the articles chosen had been at the most been published three days before they were collected, and they were translated with both DeepL® and Microsoft Translator® the same day they were collected.
Translating them at the most three days after publication was a way to ensure that they were not already used in the training corpora for the NMT systems. In Table 7 we can observe all the information pertaining to the source texts. After gathering the data, some cleaning of the corpus was needed before proceeding. Each sentence was separated after the final stop and the next sentence was started in a new line. Since the source data was already properly separated, this step did not need to be performed for the target data as both DeepL® and Microsoft Translator® respected this separation and provided the target text in an identical form. Additionally, all source segments retained their punctuation signs (e.g. quotation marks, commas, full stops).
Table 7 – Source text information
Article name Publisher Publication date Translation
date CEFR level
Vlad the indefinite The
Economist 12/3/2020 15/3/2020 C Trump says he’s ‘strongly
considering’ pardoning Michael Flynn
Washington The
Post 15/3/2020 15/3/2020 C The impact of covid‐19 on
airlines The
Economist 15/3/2020 15/3/2020 C Why Germany’s pay gap is so
large The
Economist 15/3/2020 15/3/2020 B2 U.S. Offered ‘Large Sum’ to
German Company for Access to The New
York Times 15/3/2020 15/3/2020 B2
30 Article name Publisher Publication date Translation
date CEFR level Coronavirus Vaccine Research,
German Officials Say
Spain’s King Cuts Financial Ties
With Father Amid Scandal The New
York Times 16/3/2020 16/3/2020 C 'Dead Sea Scrolls fragments' at
Museum of the Bible are all fakes, study says
Guardian The 16/3/2020 16/3/2020 C
#Notmymermaid: the Disney row is ridiculous – who knows what mermaids look like?
Guardian The 16/3/2020 16/3/2020 C
From Bourbon Street to Miami Beach, America’s party people ignored pleas for social distancing
Washington The
Post 16/3/2020 16/3/2020
B2
Why the right's new strongmen
are winning everywhere The
Guardian 16/3/2020 16/3/2020 C Defrocked French priest jailed
for abusing scouts over 20‐year period
Guardian The 16/3/2020 16/3/2020 C
The 1,000‐Bed Comfort Was Supposed to Aid New York. It Has 20 Patients.
The New
York Times 7/4/2020 7/4/2020 B2
The TAUS guideline for NMT human evaluations suggests that a minimum of 200 segments should be evaluated. Using Microsoft Excel®, the word count was obtained for each segment in the source text corpus. The segments were then classified into four categories (see Table 5). 50 sentences of each category make up the suggested 200 segments by TAUS. If a segment did not belong to one of the four categories it was automatically discarded. The reason for limiting the sentence length to 40 was to be able to establish categories that were equivalent (a difference of 10 words per category).
Additionally, only the first 50 segments of each category were chosen for the evaluation.
Table 8 presents an example of segments in each category.