Evaluation methods - The experiment - Google Assistant’s Interpreter Mode: is it really an inte

3. The experiment

3.3. Evaluation methods

Fluency and Adequacy

As Machine Translation systems constantly aim to replicate or emulate the results produced by human translators, it is almost necessary to incorporate human judgments of translation quality into system development and evaluation. However, such judgments and evaluations are often time-consuming and expensive to collect, which pushed specialists to develop automatic evaluation metrics to replace human judgments when possible or necessary.

Interpreter Mode is named and marketed to sound like something that could replace human interpreters. This is why we decided to make our evaluation centered around user opinion and feedback by resulting to manual evaluation of the target texts’ fluency and adequacy as

29 Rtings.com, accessed 27 December 2020, https://www.rtings.com/headphones/reviews/sony/wh-1000xm3-wireless

30 Sony, accessed 27 December 2020, https://www.sony.com/electronics/headband-headphones/wh-1000xm3

41 mentioned in Philipp Koehn’s book Statistical Machine Translation (Koehn, 2010b). Manual evaluation is the process of evaluating Machine Translation output by humans. Human evaluation is traditionally split into two categories: adequacy and fluency. Adequacy judgments ask annotators to rate the amount of meaning expressed in a translation. For this, we will use Koehn’s (Koehn, 2010b) scale (Figure 16):

Figure 16: Adequacy Evaluation Scale³¹

On the other hand, Fluency judgments ask annotators to rate the well-formedness of a translation output in the target language, regardless of sentence meaning. For this, we will follow Koehn’s (Koehn, 2010b) scale as well (Figure 17):

Figure 17: Fluency Evaluation Scale³²

31 (Koehn, 2010b, p. 219)

32 (Koehn, 2010b, p. 219)

42 During the data collection step as mentioned in section (3.2.1), we will rely on the phone screen recording as well as recorded video (if necessary) to manually transcribe the interview into Excel files that contain the Fluency and Adequacy evaluation tables (see Annex 4). This Excel file was obtained during the Traduction Automatique 2 - Pré-édition, post-édition et évaluation³³ course taught by Professor P. Bouillon. The Excel files will contain the transcription of the translations of 10 questions (EN → AR) and the translations of 10 answers (AR → EN) per participant This amounts to a total of 20 sentences per participant.

Each participant will evaluate the translations obtained from their own interview, as well as the translations obtained from the two other interviews.

Once the transcriptions are ready, they will be sent to the participants in order for them to evaluate fluency and adequacy. To that end, the participant receives six Excel files (two per interview) which they should fill out and return to us. The obtained results will be discussed in section 4.4).

System usability

For system usability, we chose to use John Brooke’s “quick and dirty” usability scale (Jordan et al., 1996, p. 189). John Brooke says that “usability does not exist in any absolute sense” because it always depends on the context, which makes coming up with a universal way to measure usability, a daunting task. John Brooke managed to come up with a workaround and created what he called a “quick and dirty” system usability scale (SUS) (Jordan et al., 1996, Chapter 21). This SUS proved to be an effective way to get relevant user feedback on any system, without regards to the context. The SUS was given to participants after each interview after the participant has had a chance to use the evaluated system, but before any discussion or debriefing takes place.

33 FTI, accessed last on January 11, 2021 - https://pgc.unige.ch/cursus/programme-des-cours/web/teachings/details/2020-BTM0910

43 The SUS is a Likert scale. It is composed of ten items to which participants express their degree of agreement or disagreement; from “strongly agree” to “strongly disagree”. The SUS gives “a global view of subjective assessments of usability” (Jordan et al., 1996, Chapter 21).

The SUS should be used in a certain manner to gather optimal data. Brooke explains that

“respondents should be asked to record their immediate response to each item, rather than thinking about items for a long time. All items should be checked. If a respondent feels that they cannot respond to a particular item, they should mark the centre point of the scale (Jordan et al., 1996, Chapter 21).”

Brooke says that the SUS eventually gives us one single number that represents the overall usability of the system being evaluated and that the scores obtained for each item do not mean anything on their own (Jordan et al., 1996, Chapter 21). An example of a scored SUS can be seen in Figure 18:

Figure 18: example of a scored system usability scale (SUS) (Jordan et al., 1996, fig. 21.2)

45 Looking at Figure 18, the final score is calculated by establishing the sum of the score contributions from each item. “Each item's score contribution will range from 0 to 4. For items 1,3,5,7, and 9 the score contribution is the scale position minus 1. For items 2,4,6,8 and 10, the contribution is 5 minus the scale position. Multiply the sum of the scores by 2.5 to obtain the overall value of SU. SUS scores have a range of 0 to 100. The following section gives an example of a scored SU scale.” (Jordan et al., 1996, Chapter 21).

There is not much information about how SUS scores can be interpreted. But many researchers who use the scale a lot, compiled the data they obtained over the years to provide some guidelines.

UI UX Trend³⁴ provide the following Figure 19 that can be used to interpret SUS scores:

Figure 19: meaning of different SUS scores³⁵

35 UIUXTrend, retrieved December 24, 2020 https://uiuxtrend.com/measuring-system-usability-scale-sus/

46 According to Figure 19, a system “passes” when it obtains a score of 68 or more. This approximation is also corroborated by what researchers such as Jeff Sauro (Sauro, 2011) who compiled his findings from years of using the SUS method to finally determine that the average SUS score is 68.

After each interview, the participant will be sent via email a Google Form (Annex 5) that contains the SUS. Each participant fills out the SUS before leaving the location of the interview. After all three participants have given back their SUS Google Forms, we will extract the relevant data which will be discussed in section (4.3) before drawing the appropriate conclusions.

Other metrics

In addition to the SUS, fluency, and adequacy, we will result to manual calculations of different other metrics. We will be calculating the number of times the tool managed to achieve a successful interaction during each interview. We will consider a successful interaction to be either a correct translation or a non-idiomatic translation that can be understood from context.

We will also calculate the number of times we had to repeat a question during each interview.

As mentioned in section (3.4.2) where we discussed our preliminary test, the tool struggled with keeping track of the participant’s gender. Since IM is supposed to emulate a human interpreter, keeping track of a person’s gender during a conversation is very important and this is why we will also look at the number of times IM uses of the correct gender when addressing each participant.

In Group 2 (see Table 1), Q1 and Q2 are both about the Coronavirus Pandemic. However, while Q1 mentions the term explicitly, Q2 refers to it by the use of the pronoun “it”. By using the pronoun in the second question, we wanted to verify whether IM can keep track of the context like a human interpreter would, or not. Finally, although the term “Coronavirus Pandemic” has an official translation in proper Arabic, many major news outlets such as the France 24³⁶ Arabic website use a loanword from the English term. This is why we decided to look at whether IM translates the term “Coronavirus Pandemic” from English to Arabic correctly or not. The results

36 France 24, last accessed January 11, 2021 - https://f24.my/7FUs.E

47 obtained will be used when writing the section “In-depth look into each interview” and they will be discussed further in section 4.5).

Dans le document Google Assistant’s Interpreter Mode: is it really an interpreter? (Page 50-57)