An Albanian Text-to-Speech System for the BabelDr Medical Speech Translator

(1)

Book Chapter

Reference

An Albanian Text-to-Speech System for the BabelDr Medical Speech Translator

TSOURAKIS, Nikolaos, et al.

Abstract

In this paper we present work on creating and evaluating a Text-to-Speech system for the Albanian language to be used in the BabelDr medical speech translation system. Its quality was assessed by twelve native speakers who provided feedback on 60 prompts generated by the synthesizer and on 60 real human recordings across three dimensions, namely comprehensibility, naturalness and likeability. The results suggest that the newly created voice can be incorporated in the content creation pipeline of the BabelDr platform.

TSOURAKIS, Nikolaos, et al . An Albanian Text-to-Speech System for the BabelDr Medical Speech Translator. In: Pape-Haugaard, L., Lovis, C., Cort Madsen, I., Weber, P., Hostrup Nielsen, P., Scott, P. Digital Personalized Health and Medicine . IOS Press, 2020. p.

527-531

PMID : 32570439

DOI : 10.3233/SHTI200216

Available at:

http://archive-ouverte.unige.ch/unige:138416

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

An Albanian Text-to-Speech System for the BabelDr Medical Speech Translator

Nikos TSOURAKISâ,1, Rovena TROQE^b, Johanna GERLACHâ, Pierrette BOUILLONâ and Hervé SPECHBACH^c

aUniversity of Geneva, Switzerland

bUniversity of the Free State, Bloemfontein, South Africa

cGeneva University Hospitals, Switzerland

Abstract. In this paper we present work on creating and evaluating a Text-to-Speech system for the Albanian language to be used in the BabelDr medical speech translation system. Its quality was assessed by twelve native speakers who provided feedback on 60 prompts generated by the synthesizer and on 60 real human recordings across three dimensions, namely comprehensibility, naturalness and likeability. The results suggest that the newly created voice can be incorporated in the content creation pipeline of the BabelDr platform.

Keywords. Text-to-Speech, Albanian, Tacotron 2, BabelDr

1. Introduction

The ever increasing movement of people worldwide poses new challenges for healthcare institutions across Europe in particular problems related to different language and cultural barriers experienced by the healthcare practitioners and their foreign patients [1], [2]. Most of the time experienced interpreters, relatives or friends try to bridge this communication gap. However, this solution is often less than satisfactory as trained medical interpreters are both scarce and expensive, family members might not communicate the exact meaning of the physician’s question or the patient’s response, while there are also privacy concerns. It is generally agreed that translation quality is of prime importance in this setting [3], while tools like Google Translate pose certain problems on reliability and data privacy.

Since 2017 the Faculty of Translation and Interpreting of the University of Geneva and the Geneva University Hospitals (HUG) joined forces for building a medical speech translator, under the BabelDr project² funded by the Fondation Privée des HUG. The system is a web based tool that has been specifically designed to assist in triaging of non- French-speaking patients visiting HUG’s A&E department, and allows a medical

1 Nikos Tsourakis, TIM/FTI, University of Geneva, 40, boulevard du Pont-d'Arve 1211 GENEVE 4 Switzerland; E-mail: [email protected].

2https://babeldr.unige.ch

3 https://forms.gle/HZoJszhmKgbmeJqYA

4 https://github.com/mozilla/TTS

5 https://github.com/erogol/WaveRNN

6 https://github.com/NVIDIA/waveglow

This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).

doi:10.3233/SHTI200216

(3)

professional to perform a preliminary medical examination dialogue to determine the nature of the patient’s problem and the appropriate action to take. The system incorporates state-of-the-art technologies like speech recognition (ASR) and text-to- speech (TTS) and supports translation for different language pairs [4]. It is also aligned with mainstream research findings [5] on the basis that it incorporates preset health related questions and declarative sentences (11.134 in total) that can be selected using around 1M variations of similar speech phrases.

In the context of this work we created and evaluated a synthetic female voice for the Albanian language based on Tacotron 2, a neural network architecture for speech synthesis directly from text [6] that can be used for announcing the physician’s questions to an Albanian speaking patient. In this paper we present the data gathering process and the steps taken in order to train the TTS. Based on an evaluation with twelve native speakers we assessed the comprehensibility, naturalness and likeability of the newly created voice. According to our findings we can reasonably argue that the synthesized voice can be incorporated in the content creation pipeline of the BabelDr platform.

Section 2 presents the methods used for gathering audio data, training the voice and evaluating its quality. Section 3 reports on the results of the training process and the feedback from the native speakers. Section 4 discusses the findings and different implications in more depth. The final section concludes.

2. Method 2.1. TTS training

As in every training task that involves a neural network architecture the amount of data is a crucial factor. Using an in-house web-based recording tool and an intermediate fidelity microphone we asked a female subject to perform 10452 recordings from home using a sampling rate of 48 kHz. The text included questions from 13 diagnostic domains with phrases like: “Can you show me with the finger where the pain is?”, “Are you allergic to antibiotics?”, “How many cigarettes do you smoke per day?”, etc. The created corpus contained 2014 unique words, while the average length of the sentences was 14 words (sd=6). The corpus was split into a training set consisting of 9500 sentences and an evaluation set with 952 sentences. Approximately 9.5 hours of speech were recorded.

2.2. Human evaluation

Contrary to other evaluation tasks, where software systems can be assessed in terms of various objective measurements, evaluating a synthesized voice is highly subjective. For this reason we recruited twelve native Albanian speakers and asked them to provide their feedback across three dimensions typically encountered in TTS evaluations [7].

Specifically, they had to listen to a series of audio prompts and express their opinion in a five-point Likert scale (subjective mean opinion score - MOS) concerning how comprehensible the prompt is, how natural the voice sounds and how much they like the voice. The whole evaluation was set-up as a Google form³so that the participants could provide their feedback easily, while the investigators of the study could acquire their responses immediately. The evaluation protocol was based on two hypotheses:

x The TTS can be used in our content creation pipeline.

N. Tsourakis et al. / An Albanian TTS System for the BabelDr Medical Speech Translator 528

(4)

x The TTS works equally well for short and long sentences.

For testing the two hypotheses we generated 60 audio prompts with the TTS and selected randomly 60 volume normalized recordings from the initial training corpus. For each of these two subsets we chose sentences of different word length according to this formula: 20 sentences with 1 to 3 words, 20 sentences with 4 to 9 words and 20 sentences with more than 9 words. The human recordings and the TTS prompts were merged and shuffled into a single set, so that participants were not aware of the source of each one.

3. Results 3.1. TTS training

The TTS engine was trained based on a Tacotron 2 implementation⁴using the following hyper-parameters: learning rate=0.0001, batch size=32 and epochs=1000. The model had to learn ~28M parameters and the process took place on a Linux server equipped with an NVIDIA GeForge GTX Titan X, 12GB GPU card. Figure 1 shows the average total loss (average of the linear loss, mel loss and stop loss) in respect to the training epochs, which reaches to 0.01 at the last one.

Figure 1. Average total loss per number of epochs 3.2. Human evaluation

Table 1. Mean opinion score (MOS) for the three evaluation criteria: Comprehensibility (C), Naturalness (N)

& Likeability (L) and for the two types of prompts I understand the question

(C)

The voice sounds natural (N)

I like the voice (L)

TTS MOS 3.9 (sd=0.5) 3.0 (sd=1.0) 3.1 (sd=1.1)

Recordings MOS 4.5 (sd=0.4) 4.2 (sd=0.9) 4.1 (sd=0.9)

Table 1 summarizes the mean opinion score of all participants, for the three evaluation criteria and the two types of prompts. As expected human recordings outperform the TTS in all dimensions. Although the TTS might sound less natural and be less likeable it is still quite comprehensible 3.9/5 vs. 4.5/5 (statistically significant difference at p<0.002).

We also examine the variation in comprehensibility, naturalness and likeability of the TTS based on the sentence length. As it can be observed in Table 2, our initial hypothesis

(5)

is correct and the TTS works equally well for short and long sentences (no statistically significant differences).

Table 2. Mean opinion score (MOS) of Comprehensibility (C), Naturalness (N) & Likeability (L) of the TTS prompts based on the sentence word length

MOS (C) MOS (N) MOS (L)

Sentences with 1-3 words 3.8 (sd=0.6) 3.0 (sd=1.0) 3.1 (sd=1.1) Sentences with 4-9 words 4.0 (sd=0.4) 3.1 (sd=1.0) 3.1 (sd=1.1) Sentences with more than 9 words 3.9 (sd=0.6) 3.0 (sd=1.0) 3.0 (sd=1.1) 3.3. Qualitative analysis

In order to qualitatively analyze the source of errors in the synthesized prompts we asked two experts to identify possible problems. Specifically, each expert had to assess each prompt based on possible intonation problems (e.g. not being able to distinguish if it is a question or a request) and whether the prompt included muffed or unnatural sounds.

These criteria were selected on the basis of informal feedback by the participants of the evaluation task. We calculated the Cohen’s Kappa score for each one of the two pairs that expresses agreement corrected for chance, shown in Table 3. A kappa ranges between -1 and 1 and a value of 0.19 is defined as slight agreement whereas a value of - 0.27 as a poor one [8]. The last row shows the absolute number of sentences where the two experts agreed.

Table 3. Cohen’s Kappa for the two kinds of possible problems (95% confidence intervals in parenthesis) and absolute sentence agreement

Intonation problems Muffed or unnatural sounds Cohen’s Kappa 0.19 (0.1, 0.34) -0.27 (-0.51, -0.04)

Agreement 30 out of 60 25 out of 60

We also calculated the correlation between the MOS of comprehensibility (continuous variable) and the responses from each expert and for each one of the two criteria (categorical variables “0” or “1”). The appropriate method in this case is the point biserial correlation, which did not show any significant differences. Both correlation and agreement suggest that using intonation and checking for unnatural sounds is not an appropriate method for assessing prompt comprehensibility, which should be determined on a case-by-case basis with other criteria.

4. Discussion

In the era of neural networks creating a new voice from scratch demands few human months of work something that a decade ago would require extensive expertise and commitment for much longer time. After the positive experience with the Albanian TTS we are capable of creating new voices according to the needs of the project. Resorting to a commercial TTS is not always an option due to the lack of support for certain languages; a typical example is Tigrinya, frequently encountered at HUG, for which no synthetic voice yet exists. Even for those languages where a commercial option is available it normally targets the most common dialect. Arabic is another typical example.

On the other hand one might argue in favor of recording every new sentence from scratch. This option is well aligned with the BabelDr project which is strongly biased

N. Tsourakis et al. / An Albanian TTS System for the BabelDr Medical Speech Translator 530

(6)

towards quality assurance. Every translation is checked by experts to avoid possible errors and ambiguities. The sole difference, however, is that multiple people can perform the translation task but a single person can do the audio recordings. If this person is temporary not available the whole deployment pipeline is stalled. Even as a backup strategy having your own TTS presents many competitive advantages. Nonetheless there is always a human in the loop in our content creation pipeline that checks both translations and TTS prompts and decides if they are acceptable or not.

A factor that was not properly addressed in this study concerns on how we measure comprehensibility. In essence by asking participants to quantify their understanding you risk obtaining a negative answer for a perfectly oralized prompt simply because the question was ambiguous or unclear in the first place (e.g. a medical term was used). In a future study two levels of comprehensibility should be scrutinized by carefully creating and testing a corpus that alleviates this deficiency. Conversely, even a perfect score in comprehensibility does not decisively mean that the prompt sounds perfect. The participants expressed concerns that sometimes they listened to sentences that did not sound completely right but were still comprehensible.

5. Conclusion

Language barriers often cause inconvenience but when medical issues are involved they cease to be mere inconvenience and can become life-threatening. In this work we include a new building block in our platform for creating synthesized voices. After evaluating its feasibility for the Albanian language with twelve native speakers we can include it in the BabelDr content creation pipeline. We have also found that there are no differences in quality between short and long sentences.

In the future we plan to combine Tacotron 2 with a neural Vocoder like WaveNet⁵ and WaveGlow⁶ that promise to ameliorate the output quality. We also intend to gather more data for Albanian, as 9.5 hours of training material puts us in the low end of complexity (e.g. in [6] around 25 hours of speech were used). We will also create a new voice for the Tigrinya language.

References

[1] Flores, G., Laws, M.B., Mayo, S.J., Zuckerman, B., Abreu, M., Medina, L. and Hardt, E.J. Errors in medical interpretation and their potential clinical consequences in pediatric encounters. Pediatrics, 111(1), (2003).

[2] Wasserman, M., Renfrew, M.R., Green, A.R., Lopez, L., TanǦMcGrory, A., Brach, C. and Betancourt, J.R.

Identifying and preventing medical errors in patients with limited English proficiency: key findings and tools for the field. Journal for Healthcare Quality, 36(3), 2014, 5-16.

[3] Tsourakis, N. and Estrella, P. Evaluating the quality of mobile medical speech translators based on ISO/IEC 9126 series: definition, weighted quality model and metrics. IJRQEH, 2(2), 2013, 1–20.

[4] Spechbach, H., Gerlach, J. Karker, S.M., Tsourakis, N., Combescure, C. and Bouillon, P. A Speech- Enabled Fixed-Phrase Translator for Emergency Settings: Crossover Study. JMIR Medical Informatics, 7(2) (2019).

[5] Panayiotou, A., Gardner, A., Williams, S., Zucchi, E., Mascitti-Meuter, M., Goh, A.M., You, E., et al.

Language Translation Apps in Health Care Settings: Expert Opinion. JMIR Mhealth Uhealth, 7(4), 2019.

[6] Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., et al. Natural TTS synthesis by conditioning Wavenet on mel spectrogram predictions. IEEE ICASSP, (2018), 4779-4783.

[7] Dybkjaer, L., Hemsen, H. and Minker, W. Evaluation of Text and Speech Systems (1st ed.). Springer, 2007.

[8] Landis, J.R. and Koch G.G. The measurement of observer agreement for categorical data. Biometrics, (33), 1977, 159–174.