• Aucun résultat trouvé

Improving statistical machine translation of informal language: a rule-based pre-editing approach for French Forums

N/A
N/A
Protected

Academic year: 2022

Partager "Improving statistical machine translation of informal language: a rule-based pre-editing approach for French Forums"

Copied!
366
0
0

Texte intégral

(1)

Thesis

Reference

Improving statistical machine translation of informal language: a rule-based pre-editing approach for French Forums

GERLACH, Johanna

Abstract

La barrière linguistique limite le partage de connaissances entre communautés sur les forums internet. La traduction automatique (TA), qui permettrait d'aborder cette limitation, rencontre de nombreuses difficultés avec les textes communautaires. Cette thèse vise à investiguer dans quelle mesure un processus de préédition automatique ou interactif utilisant une technologie à base de règles peut réduire ces difficultés et ainsi améliorer la TA statistique de textes de forums français. Nous étudions différentes transformations puis évaluons leur impact sur la TA. Au-delà d'une amélioration significative de la qualité de la TA, nous observons une réduction de l'effort technique et temporel de post-édition. Au travers d'une étude de faisabilité avec de vrais utilisateurs de forums, nous observons que la tâche de préédition est envisageable dans ce contexte. Finalement, nous confirmons la portabilité du processus de préédition développé dans ce travail vers d'autres systèmes de TA et pour des forums de différents domaines.

GERLACH, Johanna. Improving statistical machine translation of informal language: a rule-based pre-editing approach for French Forums. Thèse de doctorat : Univ. Genève, 2015, no. FTI 22

URN : urn:nbn:ch:unige-732262

DOI : 10.13097/archive-ouverte/unige:73226

Available at:

http://archive-ouverte.unige.ch/unige:73226

(2)

Improving Statistical Machine Translation of Informal Language:

A Rule-based Pre-editing Approach for French Forums

Th`ese

pr´esent´ee `a la Facult´e de Traduction et d’Interpr´etation de l’Universit´e de Gen`eve

pour obtenir le grade de Docteur en Traitement Informatique Multilingue par

Johanna Gerlach

Jury:

Prof. Pierrette Bouillon, FTI/TIM, Universit´e de Gen`eve (Directeur de th`ese) Prof. Aur´elie Picton, FTI/TIM, Universit´e de Gen`eve (Pr´esident du Jury) Dr. Ana Guerberof, Pactera (Jur´e externe)

Dr. Sabine Lehmann, Acrolinx (Jur´e externe)

Dr. Emmanuel Rayner, FTI/TIM, Universit´e de Gen`eve (Jur´e) Dr. Johann Roturier, Symantec (Jur´e externe)

Soutenue le 13 mars 2015 `a l’Universit´e de Gen`eve

(3)
(4)

Abstract

Forums are increasingly used by online communities to share information about a wide range of topics. While this content is in theory available to anyone with internet access, it is in fact accessible only to those users who understand the language in which it was written. Machine translation (MT) seems the most practical solution to make this content more widely acces- sible, but forum data presents multiple challenges for machine translation.

The central objective of the thesis is to investigate the possibility of im- proving the outcome of statistical machine translation of French forum data through the application of pre-editing rules. In particular, our work aims at identifying which transformations are useful to improve translation and whether these transformations can be applied automatically or interactively with a rule-based technology. To evaluate the impact of these rules, we pro- pose a human comparative evaluation methodology using crowdsourcing.

Results show that pre-editing significantly improves the machine transla- tion output. To assess the usefulness of these improvements, we perform an evaluation of temporal and technical post-editing effort. Findings show that improvements coincide with reduced effort. Another aspect we consider is whether the pre-editing task can concretely be performed in a forum con- text. Results of a pre-editing experiment with real forum users suggest that the interactive pre-editing process is accessible, with users producing only slightly less improvement than experts. Finally, to assess the portability of the developed pre-editing process, we perform evaluations with other MT systems, notably rule-based systems, as well as with data from forums from different domains. Findings indicate that, for the most part, the developed pre-editing rules are easily portable.

(5)

Aujourd’hui les forums internet jouent un rˆole de plus en plus important dans le partage d’informations par les communaut´es. Cependant, bien que les informations qui s’y trouvent soient techniquement accessibles `a tous, celles-ci ne seront en r´ealit´e utiles qu’aux utilisateurs qui en maitrisent la langue. La traduction automatique (TA) semble une solution int´eressante pour aborder cette limitation, mais elle est ici confront´ee `a des textes pr´esentant de nombreuses difficult´es. L’objectif principal de cette th`ese est d’investiguer les possibilit´es d’am´elioration de la traduction automa- tique statistique de textes de forums fran¸cais par la pr´e´edition. En par- ticulier, notre travail a pour but d’identifier quelles transformations sont utiles pour am´eliorer la traduction, et de d´eterminer si ces transforma- tions peuvent ˆetre effectu´ees automatiquement ou interactivement en util- isant une technologie `a base de r`egles. D’abord, pour ´evaluer l’impact de ces r`egles, nous proposons une m´ethode d’´evaluation humaine comparative utilisant une plateforme participative en ligne. Les r´esultats montrent que la pr´e´edition am´eliore significativement le r´esultat de traduction automa- tique. Pour mesurer l’utilit´e de ces am´eliorations, nous ´evaluons ensuite l’effort temporel et technique de post-´edition. Les r´esultats r´ev`elent que les am´eliorations co¨ıncident avec un effort de post-´edition r´eduit. Puis nous consid´erons la faisabilit´e de la tˆache de pr´e´edition dans le contexte des forums. Une exp´erience avec de vrais utilisateurs de forums sugg`ere que la pr´e´edition interactive est accessible, les utilisateurs obtenant des am´eliorations faiblement inf´erieures `a celles obtenues par des experts. Fi- nalement, pour mesurer la portabilit´e du processus de pr´e´edition d´evelopp´e, nous effectuons des ´evaluations avec d’autres syst`emes de TA, notamment des syst`emes linguistiques, ainsi qu’avec des donn´ees extraites de forums d’autres domaines. Les r´esultats indiquent que la plupart des r`egles sont ais´ement portables.

(6)

Acknowledgements

First and foremost, I am deeply grateful to my advisor Pierrette Bouillon, without whom this thesis would never have come into existence. Her ex- perience and enthusiasm have been invaluable for the completion of this work.

I am greatly indebted to Sabine Lehmann for taking the time to introduce me to the Acrolinx technology and rule development, as well as for the many thought-provoking discussions. I am also very grateful to the other mem- bers of my thesis committee, Ana Guerberof, Manny Rayner and Johann Roturier, for providing valuable comments and suggestions. I would also like to thank Aur´elie Picton for accepting the role of president of the jury, and for motivating me over these last years.

Many thanks to the members of the ACCEPT project for providing a stimu- lating research context. In particular, I would like to thank Victoria Porro, who significantly contributed to rule development and experiment setup.

Many thanks also to Liliana Gaspar for setting up the pre-editing experi- ment with Norton forum users, and to Philip Koehn, who kindly let me use his Amazon Mechanical Turk Requester account for all my evaluations. I would also like to thank Magdalena Freund who took on the specialisation of Systran. I am also very grateful to the translators and AMT workers who completed a countless number of translation evaluations.

I would like to express my gratitude to my colleagues at the FTI/TIM department, Donatella, Luc`ıa, Marianne, Nikos, Silvia and Violeta, who have been an invaluable source of advice and motivation. Special thanks to Claudia and Tobias who shared an office with me and had to deal with all my doubts and complaints.

I would like to acknowledge funding from the European Community’s Sev- enth Framework Programme (FP7/2007-2013) under grant agreement no.

288769.

(7)

all my ups and downs. I would like to dedicate this thesis to my parents, my mother Silke and my late father Dr. Dieter Gerlach.

(8)

Contents

List of Figures vii

List of Tables ix

1 Introduction 1

1.1 Motivation . . . 1

1.2 Context . . . 3

1.3 Objectives . . . 4

1.4 Structure of the thesis . . . 6

1.5 Published work . . . 7

2 Pre-Editing 9 2.1 Introduction . . . 9

2.2 Pre-editing approaches . . . 11

2.2.1 Unknown words . . . 12

2.2.2 Grammar errors and unknown structures . . . 18

2.2.3 Ambiguity and complexity . . . 19

2.2.4 Reordering . . . 23

2.3 Pre-editing with Acrolinx . . . 25

2.3.1 Acrolinx technology . . . 25

2.3.2 Using Acrolinx . . . 27

2.3.3 Developing rules with Acrolinx . . . 28

2.3.4 Summary . . . 28

2.4 Pre-editing forum data with Acrolinx . . . 29

2.4.1 Input . . . 29

2.4.2 MT technology . . . 31

(9)

2.4.3 Target text . . . 31

2.4.4 Pre-editing in the forum context . . . 32

2.4.5 Summary . . . 34

2.5 Conclusion . . . 34

3 Rule Development 37 3.1 Introduction . . . 37

3.2 Developing Acrolinx rules . . . 38

3.2.1 Rule formalism . . . 39

3.2.2 Development environment . . . 45

3.2.3 Development methodology . . . 45

3.3 Spelling and grammar rules . . . 47

3.3.1 Non-word errors - Acrolinx spelling . . . 47

3.3.2 Real-word errors - Acrolinx grammar rules . . . 49

3.3.3 Performance of spelling and grammar rules on forum data . . . . 57

3.3.4 Spelling and grammar - summary . . . 64

3.4 Punctuation and spacing rules . . . 65

3.5 Informal language rules . . . 67

3.5.1 Informal vocabulary . . . 68

3.5.2 Informal syntactic structures . . . 70

3.6 Controlled language and simplification rules . . . 71

3.7 Rules for the machine . . . 74

3.7.1 Development methodology . . . 74

3.7.2 Reformulation rules . . . 77

3.7.3 Clitics . . . 79

3.7.4 Reordering rules . . . 80

3.7.5 Informal second person . . . 81

3.8 Rule application in the forum context . . . 83

3.9 Rule precision . . . 86

3.9.1 Evaluation . . . 87

3.9.2 Results . . . 87

3.9.3 Summary . . . 93

3.10 Conclusion . . . 94

(10)

CONTENTS

4 Rule Evaluation: Impact 97

4.1 Introduction . . . 97

4.2 Machine translation evaluation . . . 98

4.2.1 Human evaluation . . . 99

4.2.2 Automatic evaluation . . . 102

4.3 Evaluating the impact of pre-editing on SMT of forum data . . . 106

4.3.1 Participants . . . 108

4.3.2 Evaluation setup . . . 111

4.4 Rule by rule evaluation . . . 114

4.4.1 Data selection . . . 114

4.4.2 Results by rule categories . . . 115

4.4.3 Evaluator groups and rater agreement . . . 142

4.5 Global evaluation . . . 148

4.5.1 Human comparative evaluation . . . 150

4.5.2 Automatic evaluation . . . 151

4.6 Global evaluation - automatic rules only . . . 155

4.7 Conclusion . . . 156

5 Rule Evaluation: Impact on Post-editing Temporal Effort 159 5.1 Introduction . . . 159

5.2 Post-editing . . . 160

5.2.1 Assessing post-editing effort . . . 160

5.2.2 MT output quality and post-editing effort . . . 162

5.3 Post-editing effort and pre-editing . . . 163

5.4 Post-editing in ACCEPT . . . 165

5.5 First experiment - impact of successful pre-editing . . . 167

5.5.1 Data . . . 168

5.5.2 Participants . . . 168

5.5.3 Pre-editing task . . . 169

5.5.4 Translation . . . 170

5.5.5 Post-editing task . . . 171

5.5.6 Post-editing temporal effort . . . 172

5.5.7 Edit distance . . . 174

(11)

5.5.8 Summary . . . 175

5.6 Second experiment - impact of all pre-editing . . . 176

5.6.1 Experimental setup . . . 176

5.6.2 Post-editing task . . . 178

5.6.3 Post-editing temporal effort . . . 178

5.6.4 Edit distance . . . 182

5.6.5 Comparison with first experiment . . . 183

5.6.6 Summary . . . 185

5.7 Conclusion . . . 186

6 Rule Evaluation: Portability 189 6.1 Introduction . . . 189

6.2 Portability to other MT systems (Lucy, Systran) . . . 190

6.2.1 Specialising the rule-based systems . . . 191

6.2.2 Rule by rule evaluation with Lucy and Systran . . . 193

6.2.3 Global evaluation with Lucy and Systran . . . 213

6.3 Portability to other forums . . . 214

6.3.1 Forum selection . . . 214

6.3.2 Rule precision . . . 215

6.3.3 Impact on translation . . . 219

6.4 Conclusion . . . 221

7 Rule Evaluation: Usability 223 7.1 Introduction . . . 223

7.2 Pre-editing in the Norton Community forums . . . 225

7.2.1 The ACCEPT pre-editing plugin . . . 225

7.2.2 Participants . . . 226

7.2.3 Data selection . . . 226

7.2.4 Pre-editing scenarios . . . 227

7.3 Fully manual vs semi-automatic pre-editing . . . 228

7.3.1 Pre-editing activity . . . 228

7.3.2 Impact on translation . . . 230

7.4 Users against Experts . . . 231

7.4.1 Pre-editing activity . . . 231

(12)

CONTENTS

7.4.2 Impact on translation . . . 233

7.5 Conclusion . . . 234

8 Conclusion 237 8.1 Achievements . . . 237

8.2 Limitations . . . 241

8.3 Future work . . . 242

References 245 A Research Overview 253 B Data 257 C Pre-editing rules ordered by Set 259 D Rule by rule results 263 D.1 Grammar (agreement) . . . 266

D.2 Grammar (mood/tense) . . . 268

D.3 Grammar (sequence) . . . 274

D.4 Homophone confusion . . . 277

D.5 Punctuation . . . 296

D.6 Informal . . . 309

D.7 Simplification . . . 315

D.8 Reformulation . . . 321

D.9 Informal 2nd person . . . 332

D.10 Clitics . . . 333

D.11 Reordering . . . 336 E Pre-editing in the Norton Community forum: instructions for partic-

ipants 339

F Post-editing guidelines 343

(13)
(14)

List of Figures

2.1 Pre-editing plugin in forum interface . . . 28

3.1 Example of trigram extraction output . . . 76

4.1 AMT evaluation interface . . . 113

4.2 Tool evaluation interface . . . 114

4.3 % Impact of individual grammar (agreement) rules . . . 116

4.4 % Impact of individual grammar (mood/tense) rules . . . 118

4.5 % Impact of individual grammar (sequence) rules . . . 120

4.6 % Impact of individual homophone rules . . . 121

4.7 % Impact of individual punctuation rules . . . 123

4.8 % Impact of individual informal language rules . . . 127

4.9 % Impact of individual simplification rules . . . 131

4.10 % Impact of individual reformulation rules . . . 135

4.11 % Impact of informal second person rule . . . 137

4.12 % Impact of individual clitics rules . . . 138

4.13 % Impact of individual reordering rules . . . 141

5.1 Post-editing interface . . . 166

5.2 XLIFF output from ACCEPT post-editing portal . . . 167

6.1 % of sentences improved by grammar (agreement) rules . . . 195

6.2 % of sentences improved by grammar (mood/tense) rules . . . 196

6.3 % of sentences improved by grammar (sequence) rules . . . 197

6.4 % of sentences improved by homophone rules . . . 199

6.5 % of sentences improved by punctuation rules . . . 202

(15)

6.6 % of sentences improved by informal language rules . . . 203

6.7 % of sentences improved by simplification rules . . . 205

6.8 % of sentences improved by reformulation rules . . . 208

6.9 % of sentences improved by clitic rules . . . 210

6.10 % of sentences improved by reordering rules . . . 212

7.1 Pre-editing plugin in forum interface . . . 226

B.1 Overview of data used . . . 258

E.1 Pre-editing guidelines page 1 of 2 . . . 340

E.2 Pre-editing guidelines page 2 of 2 . . . 341

F.1 Post-editing guidelines page 1 of 4 . . . 344

F.2 Post-editing guidelines page 2 of 4 . . . 345

F.3 Post-editing guidelines page 3 of 4 . . . 346

F.4 Post-editing guidelines page 4 of 4 . . . 347

(16)

List of Tables

3.1 Grammar (agreement) rules . . . 50

3.2 Grammar (tense/mood) rules . . . 51

3.3 Grammar (sequence) rules . . . 52

3.5 Homophone rules . . . 52

3.4 Sequences flagged by the wrongSeq rule . . . 55

3.6 Precision and recall of non-word error detection on 500 sentences . . . . 59

3.7 Precision and recall of non-word error correction on 500 sentences . . . . 60

3.8 Distribution of correction types for non-word errors on 500 sentences . . 61

3.9 Precision and recall of non-word error correction on 500 sentences, taking into account cases with multiple replacement suggestions . . . 61

3.10 Precision and recall of real-word error detection on 500 sentences . . . . 62

3.11 Precision and recall of real-word error correction on 500 sentences . . . . 63

3.12 Distribution of correction types for real-word errors on 500 sentences . . 64

3.13 Punctuation and spacing rules . . . 65

3.14 Informal language rules . . . 68

3.15 Simplification rules . . . 71

3.16 Reformulation rules . . . 78

3.17 Clitic rules . . . 81

3.18 Reordering rules . . . 82

3.19 Informal second person rule . . . 83

3.20 Pre-editing rule sets . . . 84

3.21 Flags and precision of rule sets on 10,000 sentences . . . 88

3.22 Causes of erroneous flags on forum data . . . 88

4.1 Distribution of HITs among workers . . . 111

(17)

4.2 Comparative evaluation results for grammar (agreement) rules . . . 116

4.3 Comparative evaluation results for grammar (mood/tense) rules . . . . 117

4.4 Comparative evaluation results for grammar (sequence) rules . . . 119

4.5 Comparative evaluation results for combined homophone rules . . . 120

4.6 Comparative evaluation results for combined punctuation rules . . . 123

4.7 Comparative evaluation results for combined informal language rules . . 127

4.8 Comparative evaluation results for simplification rules . . . 130

4.9 Comparative evaluation results for reformulation rules . . . 135

4.10 Comparative evaluation results for informal second person rule . . . 137

4.11 Comparative evaluation results for combined clitic rules . . . 138

4.12 Examples for the cliticsPersPron rule . . . 140

4.13 Comparative evaluation results for combined reordering rules . . . 141

4.14 Judgement distribution for both evaluator groups . . . 144

4.15 Unweighted Cohen’s Kappa computed over each evaluator pair for both scales. . . 145

4.16 Contingency table for raters 1 and 2 . . . 146

4.17 Agreement between and within groups . . . 147

4.18 Flags for each rule category on 1,030 sentences of Norton Community forum data . . . 149

4.19 Flags by rule sets on 1,030 sentences of Norton Community forum data 150 4.20 Comparative evaluation results for complete pre-editing sequence . . . . 151

4.21 Document level automatic metric results . . . 152

4.22 Sentence level correlation between metrics and human judgements . . . 152

4.23 Comparative evaluation results for automatic pre-editing only . . . 156

5.1 Flags for each pre-editing rule set . . . 169

5.2 Comparative evaluation results . . . 170

5.3 Throughput (words/min) for translations of raw and pre-edited data . . 173

5.4 Combined pre- and post-editing times (minutes) 1st experiment . . . 174

5.5 TER scores computed between MT output and post-edited versions . . 175

5.6 Sentences where no post-editing was performed . . . 175

5.7 Comparative evaluation results for 100 random posts . . . 177

(18)

LIST OF TABLES

5.8 Throughput (words/min) based on editing time for translations of raw and pre-edited data, for each of the three cases . . . 179 5.9 Throughput (words/min) based on thinking time + editing time for

translations of raw and pre-edited data, for each of the three cases . . . 181 5.10 Estimation of global impact of pre-editing on post-editing throughput . 181 5.11 Estimated combined pre- and post-editing times (minutes) 2nd experiment182 5.12 TER scores computed between MT output and post-edited versions . . 182 5.13 Global throughput in both experiments . . . 183 5.14 Proportion of sentences were MT output was left unedited for both ex-

periments . . . 184 6.1 Comparative evaluation of grammar (agreement) rules for three MT sys-

tems . . . 194 6.2 Comparative evaluation of grammar (mood/tense) rules for three MT

systems . . . 196 6.3 Comparative evaluation of grammar (sequence) rules for three MT systems197 6.4 Comparative evaluation of homophone rules for three MT systems . . . 198 6.5 Comparative evaluation of punctuation rules for three MT systems . . . 201 6.6 Comparative evaluation of informal language rules for three MT systems 203 6.7 Comparative evaluation of simplification rules for three MT systems . . 204 6.8 Comparative evaluation of reformulation rules for three MT systems . . 206 6.9 Comparative evaluation of tuVous rule for three MT systems . . . 208 6.10 Comparative evaluation of clitic rules for three MT systems . . . 209 6.11 Comparative evaluation of reordering rules for three MT systems . . . . 211 6.12 Comparative evaluation results for complete pre-editing sequence for

three MT systems . . . 213 6.13 Rule precision on 10,000 sentences by categories for the three forums . . 216 6.14 Comparative evaluation results for alternate forums . . . 219 7.1 Pre-editing activity for the two user scenarios . . . 228 7.2 Edit distance (words) between raw and pre-edited versions for user sce-

narios . . . 229 7.3 Comparative evaluation results for Raw vs SemiAuto and AllManual . . 230 7.4 Comparative evaluation results for SemiAuto vs AllManual . . . 231

(19)

7.5 Edit distance between raw and pre-edited versions for all scenarios . . . 232

7.6 Flags rejected by forum users . . . 232

7.7 Comparative evaluation results for Raw vs Expert and Raw vs Oracle . 233 7.8 Comparative evaluation results for User against Expert . . . 234

A.1 Research overview . . . 253

C.1 Set 1 (rules for humans; automatic application) . . . 260

C.2 Set 2 (rules for humans; interactive application) . . . 261

C.3 Set 3 (rules for the machine; automatic application) . . . 262

D.1 Pre-editing rules . . . 263

(20)

1

Introduction

1.1 Motivation

The Web 2.0 paradigm, which has transformed users from passive viewers to active contributors, has brought into existence a new form of textual data: user-generated content (UGC). Forums, blogs and social networks are increasingly used by online communities to share information about a wide range of topics, from technical issues like IT support, to all sorts of hobbies, crafts, trades, health and lifestyle issues. UGC now represents a large share of the informative content available on the web. While technically this content is readily available to anyone with internet access, it is in fact accessible only to those users who understand the language in which it was written.

To breach the language barrier and make this content more widely accessible, some form of translation is necessary. Since human translation is not an option, due both to the sheer volume of data and to the cost this would engender, machine translation (MT) seems the most practical solution. However, UGC presents multiple challenges for machine translation. In the context of a forum, where the focus is on solving prob- lems, linguistic accuracy is often not a priority. Spelling, grammar and punctuation conventions are not always respected. Additionally, the language used is closer to spo- ken language, using informal syntax, colloquial vocabulary, abbreviations and technical terms. This combination of poor linguistic quality and informal language makes this content difficult to translate automatically (Carreraet al.,2009;Roturier & Bensadoun, 2011;Jianget al.,2012).

To address these issues, two possibilities have been investigated: adapting MT

(21)

systems to handle this type of data (e.g.Banerjeeet al.,2011, who perform a language and translation model adaptation) or pre-processing the data to bring it closer to traditional text forms that MT systems are designed handle (e.g. Jiang et al., 2012, who perform pre-processing by means of regular expressions). The latter process, pre- editing, and its application to forum content, are the main focus of the present thesis.

Pre-processing text to improve MT is an old topic (e.g.Ruffino,1981). A fair share of research on machine translation improvement has gone to investigating the input that is being translated, the difficulties it presents for MT systems, and how they can be removed by pre-editing to improve the translatability of text (e.g.Bernth & Gdaniec, 2001). Pre-editing can take on numerous forms, such as spelling and grammar check- ing, normalisation, the application of controlled languages (CLs), simplification and reordering. The difficulties of natural language tackled by these pre-editing approaches are by no means exclusive to machine translation. Indeed, they are problematic for many other natural language processing (NLP) tasks. Of the numerous approaches that try to deal with these issues, only a fraction have been developed specifically to improve machine translation. Nevertheless, since many processing steps such as part- of-speech tagging or parsing are common to multiple NLP tasks, the approaches found to facilitate these steps could also be shared across tasks, and thus be beneficial to MT.

Traditionally, pre-editing is called for in authoring situations where high quality content is produced, for example for technical documentation (Huijsen, 1998). It has only recently been associated with community content, and then mostly in the form of normalisation, investigated for processing of social media data, such as Twitter, to make it more accessible to processes like data mining or sentiment analysis (e.g. Han

& Baldwin, 2011; Clark & Araki, 2012; Sidarenka et al., 2013). Few studies however have focussed on pre-editing of forum data to improve machine translation (with the exception of Banerjeeet al.,2012;Lehmann et al.,2012, for example).

When applied before machine translation, pre-editing has mostly been associated with rule-based machine translation (e.g. Pym, 1988; Mitamura & Nyberg, 1995;

Bernth, 1998). Controlled languages in particular have often been combined with, or even developed specifically for, rule-based machine translation (Nyberg & Mita- mura, 1996). This can be explained by the fact that the difficulties encountered by RBMT, such as dealing with specific ambiguities, are well known. It is thus relatively straightforward to pre-process the data in order to reduce them. In the context of

(22)

1.2 Context

statistical machine translation, identifying transformations that will improve translata- bility is not as straightforward. With the exception of a few studies (e.g. Aikawaet al., 2007; Temnikova,2010; Lehmann et al., 2012), pre-editing, in particular in the sense of a controlled language has rarely been associated with statistical machine translation.

1.2 Context

Most of the work described in this thesis was carried out within the context of the ACCEPT (Automated Community Content Editing PorTal) research project1. This project, funded by the seventh European framework programme, has brought to- gether the Universities of Edinburgh and Geneva, Acrolinx, Symantec and Lexcel- era/Translators Without Borders from 2012 to 2014. It has aimed at improving Sta- tistical Machine Translation (SMT) of community content through minimally-intrusive pre-editing techniques, SMT improvement methods and post-editing strategies, thus allowing community content to be shared across the language barrier.

The context given by this project has conditioned our choice of data, pre-editing technology and MT system. Within the project, the forums used are the Norton Com- munity forums2, administered by Symantec, one of the partners in the project. Pre- editing and post-editing are performed with the technology of another project partner, the Acrolinx IQ engine (Bredenkampet al.,2000). This rule-based engine uses a combi- nation of NLP components and enables the development of declarative rules, which are written in a formalism similar to regular expressions, based on the syntactic tagging of the text. Acrolinx allows both automatic and interactive rule application. The machine translation system, developed by the University of Edinburgh, is a phrase-based Moses system trained using the standard Moses pipeline (Koehnet al.,2007) with Symantec translation memory data, complemented by Europarl and news-commentary data. Fo- rum text was included in the data used to train the language models. The data and technology just listed are thus the main resources used in the present thesis.

1http://www.accept-project.eu/

2http://community.norton.com/

(23)

1.3 Objectives

The central objective of the thesis is to investigate whether pre-editing can improve the outcome of statistical machine translation of French forum data. To address this question, a number of aspects must be considered, from the definition of a pre-editing process to the means of measuring the improvement of MT, but also the feasibility of pre-editing in a forum context. We shall now present these aspects in further detail and introduce the research questions addressed by this thesis.

Pre-editing can take on many forms. Our aim is to explore various types of trans- formations to see whether they are applicable to French forum data and whether they improve translation. Rule development is motivated by aspects of the input data and of the SMT system: on the one hand we have error fraught forum content, on the other a system trained on clean published data, from the same domain but different in terms of register. The developed rules thus include rules for spelling and grammar correc- tion, normalisation, disambiguation, simplification, reformulation and reordering. Our objective is not to produce an extensive set of rules, but rather to investigate different types of transformations to identify which ones should be considered for more extensive development. An important aspect of rule development is that, beside being problem- atic for MT, the forum data can be equally challenging for the technology used for pre-editing. Pre-editing resources must therefore be sufficiently robust. Consequently, our objective is to investigate whether a technology such as Acrolinx, relying on a simple declarative formalism in combination with shallow NLP components, is flexible enough to describe the different types of rules while being robust enough to handle fo- rum data. The research question we seek to answer iswhether a declarative rule-based formalism such as Acrolinx is suited to pre-edit forum data.

Since our focus is on translation improvement, rule development and selection is driven by the impact of source transformations on MT. A reliable evaluation method to measure this impact is thus essential. We describe a human comparative evaluation framework, which allows us to quickly determine whether the translation of a pre- edited version is better or worse than that of a raw version. This framework serves multiple purposes. First, in the rule development context, identifying successful and unsuccessful transformations is indispensable to the fine-tuning of rules. Second, it serves as basis for rule selection, by enabling the identification of the most appropriate

(24)

1.3 Objectives

rules for a given system and data. Ultimately, this framework allows us to establish the impact of the pre-editing process as a whole and thus serves to answer the question whether pre-editing can improve statistical machine translation of French forum data.

With the objective of improving the efficiency of this human evaluation, we investi- gate the possibility of crowdsourcing judgements using an online microworking platform (Amazon Mechanical Turk). This raises another question, namelywhether a compara- tive translation evaluation on Amazon Mechanical Turk (AMT) can produce comparable results to evaluations performed with language professionals.

While comparative evaluation allows the identification of improvement, it gives no insight into the usefulness of these improvements. Taking a broader view, where pre- editing is part of a machine translation process which aims at producing a final, usable translation, pre-editing should contribute to bringing the MT output closer to this final translation, thereby reducing post-editing effort, i.e. the human effort involved in the correction of MT output. Therefore, to assess the usefulness of pre-editing, we will investigate how the improvements identified by our comparative evaluation relate to post-editing effort. More precisely, the research question we seek to answer iswhether pre-editing that is found to improve MT also reduces post-editing effort.

Another aspect to consider in this context is that pre-editing, when it cannot be done automatically, for example in the case of complex or ambiguous transformations, also requires effort. This raises the question of return on investment, since it would make little sense to invest effort in pre-editing if the resulting changes had little or no impact on post-editing effort. Thus we will address a second research question relating to post-editing, namelyhow the effort invested in pre-editing relates to the gain in terms of post-editing effort.

The pre-editing approach developed in this study focuses on the translation of IT forum data with a specially trained SMT system, which is only one amongst a multitude of possible cases of machine translation of forum data. To investigate whether this approach can be generalised, we will address two further research questions, namely whether the pre-editing approach defined in this study is portable to other MT systems, notably RBMT systems, and whether the pre-editing approach is portable to forums from another domain. Investigating these issues will also provide insights into which rules are specific and which must be specialised.

(25)

Finally, we must consider whether the developed pre-editing rules can successfully be implemented in a forum. The best pre-editing rules with high impact on machine translation will be of little use if they cannot be applied reliably. Since part of the rules require interactive application, and in the forum context the pre-editing task will have to be accomplished by the community members themselves, it is necessary to investigate whether these rules are accessible to forum users. Our final research question therefore is whether forum users can successfully perform the transformations required by our pre-editing rules, and thus achieve the desired impact on MT output.

1.4 Structure of the thesis

The body of the thesis is divided into two main parts. The first, consisting of Chapters 2 and 3, presents the pre-editing rules and their development. The second (Chapters 4 to 7) describes the different evaluations of the rules. In more detail, the content of the individual chapters is as follows:

Chapter 2 presents the pre-editing approach. After an overview of common pre- editing practices, such as spell-checking, normalisation and controlled language appli- cation, we consider how pre-editing can be applied to forum data.

Chapter 3 describes rule development with Acrolinx. An introduction to the Acrolinx rule formalism is followed by a description of the different resources used for rule devel- opment and the related rule development methodologies. We then describe the specific rules developed in this thesis. An evaluation of rule precision addresses the question of the suitability of the Acrolinx formalism for the task of pre-editing forum data.

Chapter 4 presents the evaluation of the impact of pre-editing rules on machine translation. After describing the comparative human evaluation methodology, we pro- vide results for the impact of pre-editing rules on forum data, both on a rule-by-rule basis and as a complete process. This evaluation shows that in our application pre- editing has a significant positive impact on SMT output.

(26)

1.5 Published work

Chapter 5 addresses the question of the impact of pre-editing on post-editing effort.

We describe two experiments involving post-editing of translations of raw and pre- edited content. The results show that pre-editing that improved translation quality also significantly reduces post-editing time.

Chapter 6 addresses the question of the portability of the rules to other domains and other MT engines. In the first part of this chapter, we focus on the portability to other MT systems, by evaluating the impact on two rule-based systems: one uses a transfer approach, the other an improved direct approach. In the second part, we evaluate the usefulness of the pre-editing rules for a forum taken from another domain (DIY), by evaluating both rule precision and impact on machine translation.

Chapter 7 addresses the question of rule usability by real forum users. In this chapter we describe an experiment where pre-editing rules are applied by real users of the Norton Community forums. The results show that rule application by forum users is close to that by experts.

Chapter 8 concludes, presents limitations of this work and outlines future work.

Appendix A presents a summary of the different experiments performed in this thesis to address the research questions. An overview of the datasets used for these experiments is provided in AppendixB.

1.5 Published work

The work described in this thesis has been discussed in previous publications. A first evaluation of the pre-editing rules was published inGerlach et al.(2013b). A subset of the pre-editing rules described in this thesis were used in an experiment investigating hybrid vs rule-based approaches for the correction of homophones (Bouillon et al., 2013). Results of the first post-editing experiment were published in Gerlach et al.

(2013a). The pre-editing experiment involving Norton Community forum users was published in Bouillon et al. (2014). A second evaluation of the impact of the pre- editing rules was published in Seretanet al. (2014). Finally, results pertaining to rule portability were published in the project deliverableACCEPT D9.2.4(2014).

(27)
(28)

2

Pre-Editing

This chapter provides background on pre-editing for machine translation, introduces the Acrolinx pre-editing technology and outlines our pre-editing approach for forum data.

2.1 Introduction

In the current research on machine translation improvement, a fair share of attention has gone to the input that is being translated and what difficulties it presents for MT systems. Pre-processing methods have been considered to remove these difficulties, and thus improve the translatability of text (e.g.Bernth & Gdaniec,2001). These pre- editing methods focus on the different aspects of natural language that are problematic for machine translation.

Difficulties begin at the word level: MT systems are thwarted by unknown words, i.e. tokens that are not covered by the system’s resources, and therefore cannot be processed. Often, unknown words are the result of misspellings, thus a commonplace form of pre-editing is spell-checking. Unknown words can also result from the use of non-standard tokens, such as the colloquial language used in social media data. An increasing number of approaches regrouped under the denomination of normalisation transform these non-standard tokens.

The next problem comes with the way words are arranged into phrases and sen- tences. If the structures do not match those the MT system can process, based either on its linguistic resources or its training data, they will not be translated correctly.

(29)

This can happen in case of sentences that do not match the conventions of a language;

in these cases, grammar checking would be a useful form of pre-editing.

Even in the absence of errors, the ambiguities of natural language, both at the lexical and syntactic level, can be difficult to handle. Long, complex sentences with multiple clauses and long distance dependencies are difficult to translate. Thus probably the most well known form of pre-editing is the application of controlled languages (CLs), languages defined by sets of rules designed to reduce the ambiguity and complexity of texts. On the same principles, but for different applications, simplification approaches also aim at reducing complexity.

Finally, more specifically for statistical MT as well as rule-based systems using a direct translation architecture, word order differences between source and target language are also problematic (Niessen & Ney,2001). This has led to the development of reordering approaches, i.e. transforming word order to improve translatability (Collins et al.,2005).

These difficulties of natural language are by no means exclusive to machine trans- lation. Indeed, they are problematic for many other NLP tasks. Numerous approaches have thus been developed to deal with these issues, only a fraction of them specifically to improve machine translation. Nevertheless, since many processing steps such as POS tagging or parsing are common to multiple NLP tasks, the approaches found to facilitate these steps could also be shared across tasks.

While pre-editing texts to improve human readability or MT performance is an old topic (e.g.Ruffino,1981), the pre-editing approach we will explore in the present thesis will differ from the approaches found in the literature in several aspects.

Most studies involving pre-editing have focussed on one approach at a time, such as applying a controlled language to improve translation (e.g.Mitamura & Nyberg,1995) or normalising certain phenomena in social media to facilitate processing (e.g.Sidarenka et al., 2013). In our study we intend to combine aspects of different approaches, in- cluding several kinds of transformations that can potentially improve translation. To achieve this, we will use Acrolinx, a content control tool which allows development of different types of rules, thereby making it possible to include different transformations within a single tool.

Often, pre-editing is called for in authoring situations where high quality content is produced, for example for technical documentation. It has therefore rarely been associ-

(30)

2.2 Pre-editing approaches

ated with community content. One exception is normalisation, which has recently been applied to social media data, such as twitter, to make it more accessible to processes like data mining or sentiment analysis (e.g.Clark & Araki,2012;Han & Baldwin,2011;

Sidarenkaet al.,2013). Few studies however have focussed on pre-editing of forum data (Roturier & Bensadoun,2011).

Pre-editing often involves complex rules, such as controlled language rules, and the application of rules is considered a difficult task (Goyvaerts, 1996). It is generally performed by professional writers. In our context however, pre-editing will have to be performed by forum users, who present an entirely different profile. Defining an accessible and reliable pre-editing process for this context is a novel question.

Finally, when applied before machine translation, pre-editing has mostly been as- sociated with rule-based machine translation (e.g. Mitamura & Nyberg,1995;Bernth, 1998;Pym,1988). Controlled languages in particular have often been combined with, or even developed specifically for, rule-based machine translation (Nyberg & Mitamura, 1996). Since the difficulties encountered by RBMT, such as dealing with specific ambi- guities, are well known, it is relatively straightforward to pre-process the data in order to reduce these. In the context of statistical machine translation, identifying transfor- mations that will improve translatability is not as straightforward. Only a few studies investigate pre-editing in association with SMT.

This chapter is organised as follows: we begin with a first section providing the background on different pre-editing approaches and their association with machine translation (2.2). We then introduce the pre-editing technology used in this thesis, Acrolinx (2.3). We conclude by outlining how these approaches and technology are combined in the pre-editing approach we have defined with the objective of improving machine translation of forum data (2.4).

2.2 Pre-editing approaches

Different pre-editing approaches attempt to resolve or diminish the issues of natural language to make text more easily processable, and possibly more translatable. In this section, we will discuss the difficulties natural language presents for machine translation, and the different pre-editing approaches that could reduce these difficulties. We begin

(31)

this section with the problem of unknown words (2.2.1), continue with grammar issues (2.2.2) followed by ambiguity and complexity (2.2.3, and conclude with word order differences between language pairs (2.2.4).

2.2.1 Unknown words

A very common problem encountered by MT systems of any kind are unknown words, also called out-of-vocabulary items (OOVs). For a RBMT system, these are words that are not in the system’s linguistic resources or dictionaries; for SMT, these are words absent from the training data. Occurrence of these invariably results in poor translations, where these words are either left untranslated or removed altogether.

Worse, these OOVs can disrupt the analysis and the translation of the entire context.

OOVs can be of different natures: domain specific terminology, proper nouns, ab- breviations, colloquial language, misspelt words, letter and digit sequences, acronyms or tokens resulting from word boundary infractions. Handling domain terminology and well defined abbreviations or acronyms is generally considered a matter of system spe- cialisation rather than pre-processing, and will not be discussed here. The other issues however, can benefit from pre-processing to replace them with in-coverage tokens. The task of identifying and replacing such undesirable tokens has led to much research, with spell-checkingaddressing misspelt words, andnormalisationhandling non-standard tokens. We will now provide some background for these two approaches and discuss their application in the context of MT.

2.2.1.1 Spell-checking - Correcting non-word errors

Spelling errors that produce OOVs are ones that result in words that do not exist, commonly referred to as non-word errors. Identifying and replacing these is the task of spell-checkers, development of which began in the sixties (Blair,1960;Damerau,1964).

The first tools identified misspellings in a text by looking up each token in a correct word list (often also referred to as dictionary or lexicon in the literature), and rejecting those that could not be found. This a priori simple task requires correct segmentation and tokenisation of text into words. Depending on the language, identifying word boundaries correctly can be quite complicated: punctuation, spacing characters and casing have to be taken into account, each with their share of exceptions (Fontenelle, 2006). Correct tokenisation becomes even more complicated in the presence of word

(32)

2.2 Pre-editing approaches

boundary infractions, such as words that are run together (e.g. *tuas fait (youhave done)) or single words that are mistakenly split (e.g. *mon ordina teur (my comp uter)). Besides correct tokenisation, the efficiency of non-word error detectors depends greatly on the word list against which tokens are checked. The larger the word list, the better the coverage, and the less chances that valid words will be flagged as incorrect.

However, a too large dictionary can also be problematic as some misspelt words might not be identified.

Once non-word errors have been identified, replacement candidates must be found.

The most common algorithms to find replacement candidates use some form of word similarity measure (Damerau, 1964; Levenshtein, 1966;Wagner,1974). A widespread measure is the minimum edit distance, often also referred to as Damerau-Levenshtein distance, the principle of which is to find the minimum number of edit operations (insertions, deletions, substitutions or transpositions) to transform one token into an- other. Replacement candidates are thus selected among words with a minimum edit distance to the misspelled word. Edit distance is also used to rank replacement can- didates, on the assumption that the shorter the distance, the better the candidate.

A more sophisticated method to transform misspelt words into a correct equivalent is rule-based correction, which uses transformation rules based on patterns inferred from a set of misspellings (Yannakoudakis & Fawthrop, 1983). A further development to find replacement candidates is the noisy channel model (Shannon,1948). Widely used in the domain of speech recognition (Jelinek,1997), this model has first been applied to spelling correction in the nineties (Kernighanet al., 1990;Mays et al., 1991). The aim of this model is to find the most likely valid word, given the observed distorted word. Finally, yet another approach is that based on the assumption that many errors are caused by confusions. It uses so-called confusion lists, or confusion sets, which are sets of words which are likely to be confused with each other (e.g. Yarowsky, 1994;

Pedler & Mitton,2010).

While intuitively the words with the shortest edit distance would seem to be the most obvious suggestion, an important exception are some language specific phonetic confusions. The mathematical approach based on sequences of edit operations is par- ticularly unfavourable for phonographic errors (e.g. the confusion of f and ph) which are frequent in French (Veronis,1988). Finding the correct replacement for these words is therefore not a result of minimal edits, but of specific replacement based on phonetic

(33)

similarity (Fontenelle,2006). To this end,Brill & Moore(2000) improve a noisy channel model by enhancing the error model to include edit operations composed of multiple letters (such as “ph|f” or “ent|ant”), bringing the model closer to errors humans ac- tually make. In a similar line of reasoning, Toutanova & Moore (2002) incorporate word pronunciation information in a noisy channel model, thus achieving a substantial performance improvement. Another earlier approach using phonographic information is that investigated by van Berkelt & De Smedt (1988), which relies on a so-called triphone analysis to correct both orthographical and typographical errors in Dutch.

Finally, we should also mention some more language specific corrections, such as the restoration of diacritics. At the early stages of computer technology, when encodings supporting accented characters were not commonplace and standard keyboards were not adapted to typing these characters, accents were often left out. Simard & Deslauri- ers(2001) describe a method to reinsert missing accents based on a statistical language model. Another phenomena specific to French is the omission of the apostrophe in the case of an elision which leads to a simple concatenation of an article or pronoun with the following word (*sinstalle, which should be s’installe). Including grammati- cal information in the lexicon allows correct insertion of the apostrophe in such cases (Fontenelle,2006).

As we have seen, numerous methods exist to identify and replace non-word er- rors, and checking spelling has become very simple. Spell-checkers have nowadays become ubiquitous: many tools where users can enter text, from text-processors to e-mail clients, or even website forms and search engines, have some kind of checking functionality. As an example, Google Translate’s online interface1 includes on-the-fly spell-checking. The ubiquitousness of spell-checking probably explains why its impact on machine translation has not been much studied, although some studies (e.g. Baner- jeeet al.,2012;Liuet al.,2012;Khanet al.,2013) include a spell-checking step as part of a pre-editing process.

2.2.1.2 Normalisation - Replacing non-standard tokens

Besides misspelt words, a number of other non-standard tokens, such as letter and digit sequences, acronyms, abbreviations, informal language, URLs etc. can disrupt

1https://translate.google.com/

(34)

2.2 Pre-editing approaches

machine processing. The process of replacing non-standard tokens by some standard form is referred to as normalisation. It is found to have a positive impact on tasks such as lemmatisation, POS-tagging and assignment of morphosyntactic features (Melero et al.,2012).

Most normalisation approaches are closely related to the spell checking task, with distorted or contracted words being replaced by their “normal” equivalent. A common issue is the expansion of abbreviations, for which both rule-based and machine learning approaches have been investigated. Rowe & Laitinen(1995) describe a semi-automatic technique to expand abbreviations, using a rule-based approach and a large dictionary, which is found to improve the readability of technical text, programs and technical captions. In the medical domain, where many abbreviations and acronyms can stand for different things, depending on the concerned medical speciality and the context Pakhomov(2002) considers abbreviation normalisation as a special case of word sense disambiguation, and resolves this by means of a Maximum Entropy classifier.

Besides abbreviations, normalisation also treats unusual lexical elements, such as the informal language used in social media. To normalise English tweets, Clark &

Araki(2012) use a manually compiled and verified database of “casual English” words mapped to their normalised English equivalent to replace tokens. Additionally, phrase matching rules are employed to identify misused existing words (e.g. right/rite). This pre-processing is found to improve machine translation by reducing the number of untranslated words. Relatedly,Sidarenkaet al.(2013) use a rule-based method to pre- process German tweets, replacing Twitter-specific phenomena with artificial tokens, using regular expressions to check left and right context. This approach is found to improve the accuracy of POS tagging. Another approach to normalise Twitter data uses a confusion set of in vocabulary normalisation candidates generated for each OOV word (Han & Baldwin, 2011; Han et al., 2013). Candidates are then selected based on multiple factors, such as lexical edit distance, morphophonemic similarity, prefix and suffix substrings, and the longest common subsequence. An extrinsic evaluation of this approach shows improvement of POS tagging accuracy. Considering expressions instead of individual tokens, another approach uses paraphrases to normalise travel conversations (Shimohata & Sumita,2002). Expressions are replaced by more frequent synonymous expressions extracted from a parallel corpus by grouping sentences that

(35)

have the same translation. The effect of these replacements on NLP tasks is not mea- sured. Normalisation approaches often also use noisy channel models, widely employed in spelling correction tools, to find the most probable valid word given a distorted or compressed word (Clark,2003;Choudhuryet al.,2007;Cook & Stevenson,2009;Beau- fort et al.,2010). In these studies, the normalisation process is evaluated intrinsically, and no evaluation of its impact on other NLP tasks is performed.

Another approach for text normalisation is inspired by machine translation, where social media language is considered as a foreign language. The normalisation task is thus approached as a translation task. Awet al.(2006) describe a phrase-based statisti- cal machine translation approach to the task of transforming SMS language into normal English. Kaufmann & Kalita (2010) use a two step approach for syntactic normalisa- tion of Twitter messages, combining a pre-processor with a Moses machine translation system to convert phrases, in order to remove noise and thus improve readability and facilitate processing with NLP tools.

Considering that the language used in social media is similar to spoken language, Kobus et al. (2008) combine a machine translation approach with a system inspired by automatic speech recognition to normalise French SMS. A SMS stream is converted into a phone lattice which is then decoded to text.

While many normalisation approaches are close to spell checking, some actually include a spelling component or module. Banerjee et al. (2012) use a combination of approaches for the normalisation of technical English user forum data. Tokens such as URLs or paths are replaced by placeholders by means of regular expressions and spelling errors are treated with an off-the-shelf spellchecker adapted to the domain by including domain-specific terms. This pre-processing is found to improve SMT in terms of BLEU scores. Similarly, Khan et al.(2013) investigate the impact of text normali- sation on parsing web data, also replacing URLs and emoticons with placeholders and correcting spelling. Liu et al. (2012) describe a broad-coverage normalisation system integrating three subnormalisers which each suggest candidates, the first performing letter transformations, the second selecting words on the basis of visual similarity, and the third being a spell checker.

While most of the research on normalisation has not focussed specifically on the improvement of translatability (with some exceptions, e.g. Banerjeeet al.(2012);Clark

(36)

2.2 Pre-editing approaches

& Araki(2012)), and most approaches have been tested only intrinsically, we suspect that many of these approaches could benefit machine translation.

2.2.1.3 Summary

The sources of OOVs problematic for machine translation are numerous: misspelt words, domain specific terminology, proper nouns, abbreviations, colloquial language or non-standard letter and digit sequences, among others. While some of these can be taken care of by system specialisation, others require pre-processing to replace them with in-coverage tokens. Two technologies provide the means of completing this task:

spell-checking and normalisation. Spell-checking has been the object of much research and many rule-based and data-driven methods now exist to identify errors and find replacement candidates. Spell-checking as such has no particular association with ma- chine translation, and its impact on MT has hardly been studied. This is unsurprising;

since most texts that have been considered for translation were published text in a traditional sense, which had undergone editing and revision, misspelled words were not an issue. Normalisation, which aims at replacing non-standard tokens to facilitate sub- sequent processing, has gained increasing interest with the emergence of social media data. Among other applications, normalisation has been found to improve machine translation.

In our particular context, we expect OOVs to be an important issue, for several reasons. First, considering the linguistic quality of forum data, misspelt words are bound to be frequent. Spell-checking will therefore be necessary. Second, the technical domain to which our data belong, the IT domain, is known for its complex jargon and use of anglicisms, numerous abbreviations, proper nouns, or unusual tokens such as the names of files or processes. Finally, as no large amounts of bilingual aligned data were available of the same nature as the forum content we propose to translate, the SMT system was trained on more traditional data, mainly Europarl (Koehn,2005) and Symantec translation memories. While the translation memory data are from the same domain, they consist mostly of text from user manuals, which differs strongly from the forum data on the lexical level. For all these reasons, our pre-editing process will include spell-checking for non-word errors and normalisation-inspired rules to handle abbreviations, non-standard tokens and informal language. We will focus on tokens

(37)

that are not covered by the system (i.e. that are absent from the training data), and attempt to replace these with in-coverage alternatives.

2.2.2 Grammar errors and unknown structures

Beyond the word-level difficulties, MT systems also encounter difficulties with ill-formed structures. This can again be caused by errors, such as incorrect syntax, which disrupt processing steps such as tagging or parsing. Correcting grammar and punctuation be- fore machine translation is found to improve output (Bernth & Gdaniec,2001;O’Brien, 2005). The issue of ungrammaticality is however considered more problematic for rule- based systems, which require a correct analysis, than for SMT systems which are more robust in the face of grammar/syntax inaccuracies (Carreraet al.,2009).

Correcting these syntax and grammar errors also lies within the scope of spell- checkers, yet these require more sophisticated technologies to identify what are com- monly referred to as real-word errors, or context-sensitive errors, since analysis of the context is required for their detection. Errors can be of a syntactic nature (e.g. agree- ment errors) or at the semantic, discourse structure or pragmatic level (Kukich,1992).

Two main approaches are distinguished for grammar checking: syntax-based ap- proaches, and approaches combining tagging and patterns, often taking the form of regular expressions. Syntax-based approaches use some form of parsing to analyse structures for correctness. These approaches mostly rely on grammars to describe cor- rect structures, and different strategies to allow parsing of ungrammatical content (Car- bonell & Hayes, 1983; Thurmair, 1990). The other approaches, which are sometimes referred to as pattern-based approaches since they use pattern-matching to identify er- rors, perform POS-tagging or other shallow linguistic annotations and then apply rules describing sequences of tokens or tags. The sequences are either developed manually (e.g. Acrolinx (Bredenkamp et al., 2000) or LanguageTool (Naber,2003)), or derived statistically by data-driven methods.

2.2.2.1 Summary

Similarly to spell-checking, grammar-checking has not been explicitly associated with machine translation. For rule-based MT architectures, the usefulness of correcting grammar as a pre-editing step seems obvious, since incorrect syntax is bound to cause problems for analysis. The use for statistical machine translation is less obvious, since

(38)

2.2 Pre-editing approaches

these systems do not perform any analysis as such. To our knowledge the impact of grammar checking on SMT has not been the object of much investigation.

The forum data presents a wealth of grammar errors, including among others num- ber and gender agreement, wrong verb forms and word confusions. In our particular case, translating from French to English, where a more inflected language is translated into a less inflected language, it is reasonable to suspect that a number of these issues, for example gender agreement within noun phrases, will not have a major impact on machine translation. Other errors however, such as the confusion of words of different categories, are more likely to have a significant impact on MT. Correcting grammar will therefore be part of the pre-editing process developed in this thesis.

2.2.3 Ambiguity and complexity

The highly ambiguous nature of natural language, which can be problematic for hu- mans, is even more problematic for any kind of machine processing, including machine translation. Even the most ordinary words can be ambiguous, if they have more than one meaning or category and the context is insufficient for disambiguation. Addition- ally, besides lexical ambiguity, the way words are arranged into sentences can also be challenging if different parses are possible.

Ambiguity of natural language has long been a problem in the technical domain, where safety-critical instructions for example need to be totally unambiguous. This issue has led to the development of Controlled languages (CL), sub-languages defined by lexical, syntactic and stylistic constraints designed to reduce or eliminate ambiguity (Huijsen, 1998). Although CLs were originally developed to improve human read- ability of technical documentation, it was soon found that they also improve machine- translatability (Mitamura & Nyberg,1995;Bernth,1998;Nyberget al.,2003;Roturier, 2004). CLs have therefore often been associated with machine translation, mostly with rule-based systems. However, with some exceptions (e.g. Aikawa et al., 2007, Tem- nikova, 2010or Lehmannet al., 2012), pre-editing in the sense of controlled language for statistical machine translation remains a mostly unexplored field.

As described by Kuhn (2014), there is no universally accepted definition for con- trolled natural language. Approaches range from “technical languages that are designed to improve comprehensibility” to “languages that can be interpreted by computer”.

However, controlled languages have in common the fact that they are defined by sets of

(39)

rules, which govern different linguistic phenomena. O’Brien (2003) classifies rules into three categories: lexical, syntactic and textual rules.

Lexical rules control word usage, i.e. phenomena such as homography, synonymy, pronoun usage or date formats, among others. Many ambiguities result from the use of homographs. To avoid these, Simplified English (AECMA,1995) for example allows only one category for each word. The reduction of lexical ambiguity is one aspect of CLs that was found to have a positive impact on MT (Bakeret al.,1994).

Syntactic rules define how words can be combined to form phrases and sentences.

They mostly deal with structures that can cause ambiguity or reduce understandability, such as long noun clusters, ambiguous attachments or passives (Nyberg & Mitamura, 1996). They also restrict usage of expressions that can only be interpreted correctly in light of another expression in the context, such as ellipses or anaphora.

Textual rules control sentence length, text structure or capitalisation. The objective of these rules is mainly to improve clarity and readability by defining a coherent form.

The most common rule in this category, which appears in most CLs, is to avoid long sentences (O’Brien,2003). Splitting long sentences is also a very common approach to improve MT. Goh & Sumita (2011) argue that partitioning sentences into individual clauses prevents word reordering across clauses.

Applying the principles of CL to improve MT, Pym(1988) designed ten simplifica- tion rules. Five of these tell the writer to “keep it short and simple”, by writing short sentences, avoiding redundancies and taking care with logical constructions, among others. The remaining five rules direct them to “make it explicit” by for example avoiding ellipses, adhering to a dictionary which allows only one meaning per word and avoiding noun clusters. Application of these ten rules improves raw RBMT output and reduces post-editing time for the translation of workshop manuals. Another CL approach for machine translation, is that described by Nyberg & Mitamura (1996), where KANT Controlled English was developed specifically to improve translations by the knowledge-based KANT translation system. This CL is found to greatly reduce the average number of parses per sentence. Not exactly a CL, but on the same principles, Bernth & Gdaniec(2001) aim at improving “MTranslatability” by commercial RBMT systems. The central idea is again to reduce ambiguity, notably by controlling coor- dination, using explicit post-nominal modifiers introduced by relative pronouns (that, which, who etc.) and avoiding personal pronouns.

(40)

2.2 Pre-editing approaches

Also for RBMT systems, a study byde Preux (2005) suggests that although appli- cation of a CL does not decrease the number of translation errors, it decreases their severity. More recently, a study has used eye tracking measures to evaluate the read- ability and comprehensibility of MT output, both raw and controlled (Doherty,2012).

A study for the English-German language pair, measuring both post-editing effort and MT output comprehensibility (O’Brien & Roturier,2007) shows that not all rules have equal impact on MT. Some rules, such as replacing personal pronouns without antecedents, were found to have a high impact, while others, such as the use of paren- theses, only had a limited impact. Also investigating the impact of CL rules on MT output quality and post-editing effort, but using a statistical MT system,Aikawaet al.

(2007) translate from English to four different languages: Chinese, French, Dutch and Arabic. It was found that rules correcting informal style, spelling and capitalisation had the greatest crosslinguistic impact. Other rules had more language specific impact.

The question of portability of rules, between domains, MT systems or language pairs is still worthy of research.

While controlled languages are traditionally associated with technical documen- tation, very similar approaches have been investigated for more general texts. The objective of these simplification methods is to reduce grammatical complexity while maintaining meaning and information content. A real world example of text simplifica- tion is Simple Wikipedia1. Simplification is beneficial for human readers, and has been found useful, among others, for aphasic readers (Carrollet al.,1998), language learners (Petersen & Ostendorf, 2007) as well as poor literacy readers (Candido et al., 2009).

Simplification can also be of use for many natural language processing tasks, including machine translation (Chandrasekar et al.,1996). Approaches have been developed for summarisation (Siddharthanet al.,2004), to improve syntactic parser performance for text mining (Jonnalagaddaet al.,2009), for semantic role labelling (Vickrey & Koller, 2008) or information retrieval (Beigman Klebanov et al., 2004). While most simplifi- cation research has focused on English, some research exists for other language such as Brasilian Portuguese (Alu´ısioet al.,2008;Candidoet al.,2009), Dutch (Daelemans et al.,2004) or French (Seretan,2012).

Like controlled language rules, simplification rules act both at the lexical and syn- tactic level, aiming at producing simpler variants. Different methods are used to acquire

1http://simple.wikipedia.org/

Références

Documents relatifs

In this paper, we propose a simple algorithm for overcoming these problems by editing the existing training dataset, and adapting this edited set with different

Oracle scores indicate that gains in terms of BLEU score are still possible, even with a PBMT system built on in-domain data and with- out introducing new data during the

After the relevance judgments for this data set were distributed, we performed two additional experiments: first, we tried setting the clustering threshold to 1.0; and secondly, we

This work presents a case study of bridging a business rule language: the Ilog Rule Language (IRL) to a standard format: the Rule Interchange Format (RIF) and vice versa.. We show

This paper proposes a formal language, Event-Condition-Action-Event (ECAE), for integrating Colored Petri Nets (CPN)-based business process with a set of business rules.. We

Storing the discovered and/or defined rules in a repository will facilitate that these rules can be used in a systematic approach to various event log preparation activities (e.g.,

In par- ticular, the proposed RBMT system supports the professional translator of GSL to produce high quality parallel Greek text - GSL glossed corpus, which is then used as

This tutorial introduces to Inductive Logic Programming (ILP), being it a major logic-based approach to rule learning, and sur- veys extensions of ILP that turn out to be suitable