Statistical Machine Translation and Controlled Language Impact and Interactions

(1)

Master

Reference

Statistical Machine Translation and Controlled Language Impact and Interactions

PORRO RODRIGUEZ, Victoria

Abstract

This thesis explores the possible interactions between controlled language and statistical machine translation. Building on some previous research done at Autodesk Development Sàrl., we conducted a separate short research to determine whether it would be possible to improve SMT output by editing the input texts according to CL rules (in particular, AcrocheckTM rules). Although these results are not conclusive due to the limited extent of the tests performed, this preliminary study has shown that the impact of CL on SMT systems (particularly, Moses) is slightly positive, rarely negative. The perspectives are very optimistic and research in this sense may bear fruits. Future lines of research are suggested.

PORRO RODRIGUEZ, Victoria. Statistical Machine Translation and Controlled Language Impact and Interactions. Master : Univ. Genève, 2012

Available at:

http://archive-ouverte.unige.ch/unige:19363

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Statistical Machine Translation and

Controlled Language Impact and interactions

María Victoria Porro Rodríguez Faculty of Translation and Interpretation University of Geneva January 2012 Master in Translation Technologies Final research project Thesis director: Pierrette Bouillon Jury: Marianne Starlander

(3)

To my dearest people

¡Piu Avanti!

No te des por vencido, ni aun vencido, No te sientas esclavo, ni aun esclavo, Trémulo de pavor, piénsate bravo, Y arremete feroz, ya mal herido.

Almafuerte

(4)

Acknowledgements

Better late than never, goes the old proverb. This thesis started far back in time, in the first summer of my master’s program while I was doing an internship in Neuchâtel. Nevertheless, due to a variety of circumstances, the completion of this thesis required several months of preparation and planning, of intermittent and discontinued research and discussion, of stopping and resuming, of thinking and rethinking different possibilities, methods and layouts. Although I am now deeply proud and very satisfied with the results, it turned out to be quite a demanding and self-consuming project.

I believe that this thesis was, to a great extent, the result of the combined effort of many people who inspired me and helped me reach my goals. First and foremost, I want to thank my supervisor, professor Pierrette Bouillon, who shared with me all her knowledge along my two years of master and thesis preparation. Not only did she offer me her unconditional support, patience and enthusiasm in this area, but also guided me through the difficult moments of despair and uncertainty, and specially pulled me out when I found myself walking on quicksand. Moreover, I would like to thank all the teaching staff at the School of Translation and Interpretation of the University of Geneva, for they have contributed in one way or another to the completion of this thesis.

In particular, let me thank Marianne Starlander, for having accepted so willingly and without hesitation to participate in the thesis defence as the jury, and Maria Georgescul, for her help regarding automatic evaluation of machines.

The Autodesk Localization Services team at Neuchâtel, in particular Mirko Plitt, François Masselot and Petra Ribiczey, deserves a very special thanks. They inspired me in research and life through our interactions during my internship over the summer of 2010. They were always ready to answer all my questions and provided me unselfishly with invaluable information and documents, which constitute to a great extent the basis of this thesis. I want to thank Sabine Lehmann, co-founder and Chief Linguist at Acrolinx^IQ, for her help and comments related to the use of Acrocheck^TM.

My deepest gratitude goes to my family and friends, in particular to my mom and my dad for their unflagging love and support throughout my life. They gave me the strength and the courage to chase my dreams and achieve my goals, always in the quest of happiness and self-fulfilment. My brother, although separated from me by a vast ocean, has always been as present as them and supported me as much in spite of the

(5)

distance. Thank you, bro. Among my friends, my dear sister Kasia was my main pillar.

To her, all I need to say is contained in these words: “I know that our friendship, our whole lives through will last, and we will be cherished just as much, when today becomes the past”.

I am indebted in particular to my friends in Geneva, in particular to my master friends at the University of Geneva, without whom the long hours of study and work in this city would have been dull and grey as the weather. Together we became better persons and better translators. Let me include here those friends who were not originally translators, but were almost converted to this amazing religion due to the long hours they spent surrounded by us. Thank you all! Finally, a very special person gave me the last bit of encouragement during the last summer of work; he deserves a huge thanks and all my love.

Last but not least, I want to express my most sincere gratitude to my collocataires: Martin Hemmi, Melissa Dumont and Komlan Sangbana. They were, and still are, my family in Geneva. As such, they gave me all their unconditional love and support. Bref, they brightened up my days.

(6)

Abstract

This thesis explores the possible interactions between controlled language (CL) and statistical machine translation (SMT). Building on some previous research done at Autodesk Development Sàrl, we conducted a separate short research to determine whether it would be possible to improve SMT output by editing the input texts according to CL rules. Although these results are not conclusive due to the limited extent of the tests performed, this preliminary study has shown that the impact of CL on SMT systems (specifically, on Moses) is imperceptible or slightly positive, rarely negative. The perspectives are very optimistic and research in this sense may bear fruits.

This thesis is divided into 8 chapters. Chapter 1 offers the reader a short introduction about the current state of machine translation and provides a thorough explanation of the reasons that motivated this study as well as the particularities and main objectives of the research performed in this framework. Chapter 2 looks in more detail into the history of machine translation and describes thoroughly the main machine translation approaches and systems in the market and under study. Chapter 3 will elaborate specifically on statistical machine translation, which is becoming an increasingly attractive alternative, and more particularly on the Moses engine. Chapter 4 is fully devoted to controlled language and to explaining the main functionalities of Acrocheck^TM by Acrolinx^IQ, the style checker and quality assurance tool used in the current study. Chapter 5 explains how the tests were performed and which was the methodology used. Chapter 6 describes the two main standard ways of evaluating the results of MT engines, in particular, human evaluation and automatic metrics (among which, BLEU and WER). Chapter 7 contains the results of the tests (and the eventual MT evaluations) performed for the purpose of this thesis, which are mainly discussed in terms of impact of CL and interaction of CL and SMT. Main conclusions follow at the end of this chapter. In Chapter 8 I have summarized the whole project, pointed out the difficulties and its future impact and added my personal opinion. In Chapter 9 the reader will find a complete bibliography and in Chapters 10 to 13 the annexes mentioned throughout the thesis.

(7)

Index

1. INTRODUCTION ... 8

1.1. PURPOSES AND SCOPE OF THE THESIS... 10

2. MACHINE TRANSLATION ... 12

2.1. A BRIEF HISTORY ON MACHINE TRANSLATION... 12

2.2. MACHINE TRANSLATION APPROACHES... 15

3. STATISTICAL MT. THE MOSES ENGINE... 21

3.1. THE MOSES ENGINE... 21

3.2. DESCRIPTION OF A STANDARD SMT ENGINE... 22

3.2.1. The translation model... 24

3.2.1.1. Word-based and phrase-based systems ... 27

3.2.2. The language model ... 31

4. CONTROLLED LANGUAGES. ACROLINX^IQ AND ACROCHECK. ... 34

4.1. ACROLINX^IQ AND ACROCHECK. ... 35

5. TEST SETUP AND TEST EXECUTION... 40

5.1. TEST SETUP:AUTODESK'S PRODUCTIVITY TEST 2009 ... 40

5.2. TEST EXECUTION:TEST LAYOUT AND METHODOLOGY... 45

6. MT EVALUATION: HUMANS AND MACHINES... 49

6.1. AUTOMATIC EVALUATION... 51

6.1.1. Precision, recall and f-measure ... 52

6.1.2. The WER score ... 54

6.1.3. The BLEU score ... 56

6.2. HUMAN EVALUATION... 59

7. TEST RESULTS... 64

7.1. HUMAN AND MACHINE EVALUATION... 64

7.2. CL AND SMT: DISCUSSING INTERACTIONS, IMPACT AND IMPROVEMENT METHODS 66 7.2.1. In-depth analysis. Impact of CL rules. ... 66

7.3. CONCLUSIONS.SMT AND CL INTERACTIONS. ... 75

8. OTHER CONCLUSIONS ... 78

8.1. PERSONAL CONCLUSIONS... 79

9. BIBLIOGRAPHY... 80

10. ANNEX I. HUMAN EVALUATION GUIDE. ... 84

11. ANNEX II. SAMPLE INPUT TEXTS. ... 88

12. ANNEX III. TABLES OF PHRASES AND CL RULES APPLIED. ... 91

13. ANNEX IV. CL RULES PROPOSED BY ACROCHECK^TM. ... 93

(8)

Table 1. Fertility values [p(n|e)] for the words in example 1. _________________________ 26 Table 2. Translation model calculation (on the left) for example 1._____________________ 27 Table 3. CL-violation set (subsets) (Set Nº1). _____________________________________ 42 Table 4. CL-compliance set (Set Nº2). ___________________________________________ 43 Table 5. Precision and recall comparison between two SMT systems.___________________ 53 Table 6. F-measure scores for the systems in Table 5. _______________________________ 53 Table 7. Computations of the Levenshtein distance for two different systems._____________ 55 Table 8. Example of n-gram matches in a BLEU calculation. _________________________ 56 Table 9. Comparative example of modified and standard n-gram precision for a candidate translation and two possible reference sentences. __________________________________ 57 Table 10. Human evaluation results. ____________________________________________ 64 Table 11. BLEU scores._______________________________________________________ 65 Table 12. WER scores. _______________________________________________________ 65 Table 13. Average from graph 5. _______________________________________________ 67 Table 14. Examples of phrases considered by all three evaluators as improvements,

degradations or no changes. ___________________________________________________ 69 Table 15. Improvements. Rules and frequency. ____________________________________ 70 Table 16. Degradations. Rules and frequency. _____________________________________ 71 Table 17. No changes. Rules and frequency. ______________________________________ 71 Table 18. CL rules impact. ____________________________________________________ 73

Figure 1. MT approaches._____________________________________________________ 15 Figure 2. MT model diversity.__________________________________________________ 19 Figure 3. MT model diversity.__________________________________________________ 19 Figure 4. MT model diversity.__________________________________________________ 20 Figure 5. Components of a standard SMT engine and typical translation flow for this kind of systems. ___________________________________________________________________ 24 Figure 6. Acrolinx^IQ Dashboard. _______________________________________________ 36 Figure 7. Acrocheck^TM commands. ______________________________________________ 38 Figure 8. Errors detected. _____________________________________________________ 39 Figure 9. Word plug-in._______________________________________________________ 39 Figure 10. Acrocheck^TM report._________________________________________________ 39 Figure 11. Project phases. Workflow.____________________________________________ 45

Graph 1. CL-violation set (type of error) (Set Nº1). _________________________________ 42 Graph 2. CL-violation set (subsets) (Set Nº1). _____________________________________ 42 Graph 3. CL-compliance set (subsets) (Set Nº2). ___________________________________ 43 Graph 4. CL-compliance set (type of error) (Set Nº2). _______________________________ 43 Graph 5. Overall scores comparison. CL-violation vs- CL-compliance. _________________ 44 Graph 6. Average improvements, degradations and no changes. ______________________ 68

(9)

1. Introduction

Some decades ago no one would have imagined that the machine translation (MT) industry could evolve so fast and get so far. While the 20th century saw significant but unstable success in MT development and natural language processing (NLP), the 21st century came along with completely renewed hopes for these areas.

Mainly two facts explain this optimism: first, the availability of very large online corpora and, secondly, the increasing computing power and data storage capacity of new technologies (Arnold 1994, 192; Jurafsky and Martin 2000, xxi; Koehn 2010, 17- 18; Banko and Brill 2001). As a result, all such areas as Artificial Intelligence, Computational Linguistics, and Speech and Natural Language Processing (S&NLP) are now in the spotlight, and a myriad of possible MT applications, architectures and countless hybrid systems are being pointed out and studied in depth.

Research in any field is generally driven by market needs and demands. In the case of the translation industry, and particularly in the localization world, the current trend is towards cost and time reduction, which explains the recent shift in interest towards corpus-driven and empirical approaches to machine translation (Arnold 1994, 198). Among these new MT directions, probabilistic and statistical systems gathered steam around 2000 and earned quite a good reputation (Koehn 2010, 17).

Besides refining the internal parameters of their architecture (language and translation models, algorithms, n-grams models, lexical selection), much attention has been devoted to data, the main fuel of any probabilistic and statistical engine. The first and immediate efforts aimed at quantity, that is, at feeding the system with increasing amounts of data. Eric Brill's statement is revealing in this sense: "Don't think about algorithms, get more data. If you want to think, think about getting more data" (see Koehn PowerPoint). Nevertheless, this well-trotted path did not always yield good enough results. Recent studies (Mandal, Vergyri et al 2008; Finch and Sumita 2008) – in keeping with the current wave towards quality assurance and controlled authoring tools – have shown that data selection and data quality are more or as important as data quantity for improving the performance of a MT engine.

Many translation managers, as well as MT researchers, are now focusing on exploiting and relying on other phases of the translation workflow (especially for software and related documentation), notably pre-editing and post-editing tasks, being the overall aim to improve the translation process and to delegate part of the tasks to

(10)

machines, thus reducing human labour-force costs and time. Several companies are now using controlled language (CL) and language authoring applications during the pre- editing phases of a project in order to gain control over content creation (either for machine or human-oriented purposes). The issue of whether pre-editing texts according to CL rules can improve the quality of MT output has been investigated using different approaches (basically, hybrid and rule-based systems). Yet, empirical studies specifically relating statistical machine translation (SMT) engines and CL are scarce.

Philipp Koehn raises the question in the following terms: “Should we model the training data or optimize on translation performance?” (Koehn 2009, 15). Koehn himself asserts, when it comes to controlled language, that “statistical machine translation systems also reach very high performance when they translate documentation for new products using very similar documentation of old products as training material” (Koehn 2010, 21). The hypothesis underlying this statement is that CL might have good results on SMT engines at least as long as training data and input texts are equally controlled.

The current study will focus on the first direction cited by Koehn, that of data modelling, and specifically on data modelling through CL means. Controlling the training data, that is, modelling the whole corpus a SMT will be fed with is not a realistic task (controlling data is more of a permanent improvement cycle than a one- point activity) and greatly exceeds the possibilities and the framework of this study. For this reason, we will only apply CL rules to new input texts and evaluate their impact on the translation output of a chosen SMT engine. This seems to be, on the contrary, a feasible and attainable goal, and will yield preliminary results regarding interactions between SMT and CL.

(11)

1.1. Purposes and scope of the thesis

After completing an internship at Autodesk Development, Sàrl. and having attended several courses on machine translation at the University of Geneva, I asked myself how SMT systems would interact with CL, whether CL could help improve the output of these systems and, in this sense, 1) if it would be enough to control the input text or 2) if controlling the training corpus was equally important and necessary. I came up then with the following hypothesis of work: pre-editing input texts with a view to translating these texts using a SMT engine (at least in a sublanguage domain) might possibly improve translation output in terms of quality. Several studies and workshops¹ relating the positive interactions between CL and rule-based MT supported the foundations of elaborating a hypothesis relating CL input and SMT engines.

The main aim of the test that will be performed within the framework of this thesis is to determine if applying CL rules to input texts to be translated can improve the output of a SMT system regardless of how controlled the training corpus data is. We will also address other related issues with a view to determining if CL and SMT interaction can yield good results and if investigations in this direction are to bear fruits.

The study builds on previous research done at Autodesk Development Sàrl.

(Neuchâtel) and is designed for a particular customer scenario. Note that the study was performed, given:

(1) a specific MT system (Moses² SMT package) and a specific language and style checker (Acrocheck^TM by Acrolinx^IQ3),

(2) only for one language combination and direction (English>Spanish), (3) a limited domain (architecture, design), and

(4) a special type of texts (software documentation, online aid, instructions).

Under no circumstances, the decision of using a SMT engine will be called into question during the test. The following sections will develop all these aspects in detail.

1 The fact that CLAW (Controlled Language Application Workshops) meetings have been held since 1996 reveals the importance of this subject for researchers since then.

2 Moses is available at http://www.statmt.org/moses/. The system is described in detail in Section 5.1.

3 Acrocheck^TM is described in section 4.1.

(12)

Among others, this thesis aims at answering the following questions:

a) What types of errors occur using a SMT engine? What types of editions might be needed?

b) If Acrocheck CL checker is used, does the output improve, degrade or are there no changes at all?

c) Is there any relation between improvements, degradations and no- changes, and the type of errors/editions committed/required by the engine?

Are there any areas of convergence? This might lead us to believe that when using a SMT system certain difficulties could be overcome by pre-edition (CL rules), while others might only be improved by post-edition means (human intervention). Yet others might need to be dealt with differently or by designing other CL rules, different from the ones proposed by Acrocheck.

d) How can we improve the system for Autodesk's purposes (a limited- domain application)?

It is worth mentioning that given the specificity of the framework of the tests performed herein the results might be perceived as of limited impact in the rapidly expanding research field of SMT. The tests are meant to lay the basis and propose a possible methodology for a more ambitious and detailed study relating the CL and SMT. The ultimate intention is to give more insight into this scarcely explored area and encourage research in this direction.

In the remainder of this thesis we will look in more detail into the history of machine translation and describe the main machine translation approaches and systems in the market and under study (Chapter 2). We will elaborate specifically on statistical machine translation, which is becoming an increasingly attractive alternative, and more particularly on the Moses engine (Chapter 3). A whole chapter is devoted to controlled language and to Acrocheck^TM by Acrolinx^IQ (Chapter 4). We explain how the tests were performed and which was the methodology used as well as the two main standard ways of evaluating the results of MT engines (Chapter 5 and 6). The results of the tests and the MT evaluations performed and the main conclusions of the project are presented at the end of the paper (Chapters 7 to 9).

(13)

2. Machine Translation

"The history of machine translation is one of great hopes and disappointments" (Koehn 2010, 14)

In this chapter the reader will be introduced to machine translation. We will trace back the modern origins of machine translation and briefly review the landmarks of machine translation development and the obstacles the field had to surmount. Next, the reader will find an overview of the main MT approaches and systems available in the market or under study, mainly: direct, indirect and interlingual.

2.1. A brief history on machine translation

From its earliest days, MT has been bedevilled for believing in far-fetched ideas and expecting unreasonable outputs. However, achievements during the last decades of the 20th century have shown that the unexpected could be expected and that research was taking the right track. Although the pessimistic prospects of the infamous ALPAC report (1966) led to a general loss of moral, and silenced the years following its publication, it did not bring neither a real nor a virtual end to MT research and development: on the contrary, more than a discouraging barrier, it turned out to be an encouraging challenge. At that moment, we had reached the top of the “roller coaster”

(Arnold 1992, 15); it was time to go down and gather momentum, taking advantage of the technology thrust and the increasing availability of machine-readable (and increasingly online) texts.

The first concrete proposals that encouraged serious research can be traced back to the cryptographic works of Warren Weaver and Andrew D. Booth, and more specifically to the historic memorandum published by Weaver in July 1949 (Nirenburg, Somers, Wilks 2003, 13-17), in which he depicted the prospects of MT and suggested methods for automatic translation using computer techniques, most of which were statistic-related methods. The memorandum sparked a significant amount of interest and a few years later the proposals and efforts of renowned researchers such as Bar-Hillel at the Massachusetts Institute of Technology and Leon Dorset at Georgetown University confirmed this trend (Hutchins 1994, 7).

(14)

The feasibility of MT was first demonstrated in 1954 with the MT project that Dorset had designed in collaboration with the IBM research group. Results had little scientific value at the time, but they were sufficient to encourage funding (at least in the USA) and to liven up hopes in the field (Hutchins 1994, 8). In spite of the encouraging advances, computer facilities and technology development were still frequently inadequate and much effort had to be devoted to improving hardware and developing programs suitable for natural language processing needs.

Before the Automatic Language Processing Advisory Committee (ALPAC) report was published (1966), most experiments had been conducted in the USA and the USSR, and involved the translation of texts from English into Russian and vice versa.

Besides the technological hype and interest in language automation, there was an undeniable social and political motivational background: the Cold War. At that moment, research tended to polarize between advocates of empirical and theoretical approaches, whose contrastive methods were usually described at the time as “brute-force” and

“perfectionist” respectively (Hutchins 1994, 8). In spite of this differentiation, most groups showed a similar mix of practical and theoretical basis.

The heavy funding required by research groups around the world was quite often conditioned by the attainment of decisive and conclusive results. Disillusion grew from time to time as researchers encountered new complex linguistic problems that required new funding and extra time, and extra patience as well. By the mid 1960s, as most funding groups harboured serious doubts regarding the advancement and usefulness of MT research, in 1964 the United States National Academy of Sciences commissioned that infamous ALPAC report, which was meant to analyse the quality, costs and prospects of MT. The report published in 1966 by the ALPAC concluded that MT was

“slower, less accurate and twice as expensive as human translation” and that there was

“no immediate or predictable prospect of useful machine translation” (Hutchins 1994, 9;

Arnold 1994, 14).

At the time the report was published, there were at least three systems in regular and extensive use: one at the Wright Patterson USAF base, one at the Oak Ridge Laboratory of the United States Atomic Energy Commission and another one at the EURATOM Centre at Ispra, in Italy. Irrespective of the accuracy or validity of the ALPAC report, its conclusions provoked a general distrust for MT and forced most research groups in the USA to dissolve. Efforts concentrated then in a handful of groups

(15)

in Canada (notably the TAUM group in Montreal, that developed the METEO system), the USSR (especially the groups led by Mel'cuk and Apresian) and Europe (the GETA group in Grenoble, France, and the SUSY group in Saarbrücken, Germany). On top of all that, funding was diverted to more fundamental research on Computational Linguistics and Artificial Intelligence (AI). It was not until the late 1970s that MT got back on its feet and was reconsidered an important research field (1976 Bernard Vauquois' article was a milestone in this sense, see Vauquois 1976, 333).

During the 1980's there was a considerable hype surrounding commercial and operational systems. While others explored the potentials of advancements in AI, researchers like Alan K. Melby devoted themselves to developing systems that would cooperate with human users to create high quality translations (Nirenburg, Somers, Wilks 2003, 340-343). His "Interactive Translation System" became a blueprint in this area and sparked a parallel field of investigation, that of computer-assisted translation tools (CAT). Simultaneously, serious research in MT projects resumed, old investigations were rediscovered as “new frameworks” (Boitet 2003, 273) and systematic studies, both in linguistics and in computer science, “took precedence over all considerations of utilisation” (Vauquois 1976, 333).

Since the 1990s, technology developments and computer potential have renewed hopes with regards to MT possibilities and have sparked the exploration of new techniques (neural networks, parallel processing, and particularly corpus- or knowledge- based systems) that promise rewarding results. The field started to widen and MT possibilities rocketed. Hybrid architectures spread out and the tendency was towards developing domain-specific and user-specific systems as well as controlling content creation (Hutchins 1994a).

After this brief review on the history of machine translation, in the next section we will describe some of the most common MT approaches and systems that the industry has proposed over time, developed or studied.

(16)

2.2. Machine translation approaches

In order to understand the MT analysis done in the current study and the output obtained herewith, this section is devoted to describing the major approaches and systems developed for MT purposes and its underlying paradigms.

It is generally acknowledged that there are three historical and basic approaches to MT (Koehn 2010, Hutchins 1994, Arnold 1994). Each of them mirror different understandings of how human translation is done, and reflect as well the research directions undertaken by most groups in MT since the 1950s: direct, indirect and interlingual approach (see Figure 1 below).

The direct approach can be seen at the bottom of the Vauquois' triangle. It is the first and oldest type (“first generation” systems), used even before the IBM demonstration of 1954. As the arrow suggests, the process is fairly elementary and unidirectional. It typically consists of a large bilingual dictionary and a basic program for directly mapping source and target words. The analysis is minimal and linguistically weak and it is almost only meant to choose the correct word in the target language (TL) and place it in the same sequence as the source language (SL), even though the result

ANALYSIS

SEMANTIC structure

INTERLINGUA

SYNTACTIC structure

GENERATION

SEMANTIC structure

SYNTACTIC structure

WORD structure WORD

structure Direct transfer

Syntactic transfer Semantic

transfer

Figure 1. MT approaches. Basic direct systems lay at the bottom while they become indirect towards the top, which is crowned by the interlingual approach.

(17)

might not sound idiomatic or natural. Despite its limited capabilities, when used in conjunction with other strategies, it can offer various advantages should it be used in specific domains and depending on the users' needs. The Meteo system, for instance, relies on a semi-direct approach (see Noone_2003).

The transfer approach, the more widely used, lies somewhere between the second and the third level depicted in Figure 1. It is more concerned with contrastive linguistics, and deeper source and target analysis. The process it follows can be clearly broken up into three phases: analysis, transfer and generation (see Figure 1). During the analysis phase, each SL sentence is parsed into a SL-oriented syntactic representation (sometimes tree-shaped). The intermediate or transfer phase deals with lexical and syntactic ambiguities and divergences through context specific transfer rules. In this phase, SL and TL phrase structures are matched in a language free-oriented process (in other words, independent of languages), which generally enables these systems to translate from one SL into many TL without having to modify the whole system. The internal linguistic representation might vary, as different underlying grammar formalisms can be used for modeling constituent structures (lexical functional grammar [LFG], context-free grammar [CFG], or head-driven phrase structure grammar [HPSG];

see Jurafsky 2000, chapter 11). This intermediate phase typically consists of a 'transfer' dictionary, which contains lexical and structural rules (for the transformation of syntactic structures and lexical matching) that map the SL representation into a representation suitable for the TL. Moreover, the transfer can be deep (semantic, which uses a representation of the meaning of each word) or shallow (syntactic, where words are directly translated) or both. The generation phase generates an output text in the TL once the intermediate phase finishes mapping grammar rules and lexicon (and solving ambiguities and differences).

The third basic design is the interlingual approach (IL), at the top of Figure 1.

It relies on a more cognitive and universalistic theory of language and assumes that it is possible to convert SL texts into an abstract and independent representation of language, sufficient to generate sentences in any other language and to perform a back translation process as well (Dorr 1993; Noone 2003, 16). Seen in this way, a typical IL system goes through two phases: an analysis phase (using an analyzer) from SL to IL, and a synthesis phase (using a synthesizer) from IL to TL. Note that SL analysis is SL- specific and TL synthesis is TL-specific. The method is taken as a move towards

(18)

simplicity and overall economy, in that translation between several pairs of languages only requires (in principle) the same n components as n languages are being translated, that is, it grows linearly 2n (n being the number of languages) (Noone 2003). On the opposite, other architectures (as the transfer approach above) require an additional device for each language pair direction. Naturally, an IL engine would be the most attractive approach for multilingual systems, but the more languages it encompasses, the more complex the system becomes. In spite of its advantages, the feasibility of interlingual approaches is still under study and not very extended (except for small domains).

In the last decade, research in MT materialized in the design and development of three major MT systems or paradigms: rule-based systems, knowledge-based systems (statistical MT among them) and hybrid systems.

Besides being second-generation (and “indirect”) MT systems (Vauquois 1976, 334-335), both interlingual and transfer approaches are now generically grouped as rule-based MT systems (RBMT) and used as a basis for developing this kind of systems. In spite of the differences mentioned above, they are both based on the idea that success in MT depends on an abstract linguistic representation of texts and both of them rely on intermediate rules that map the languages in question at various levels and encode the way linguistic differences should be overcome (Arnold 1994, 185).

These architectures (particularly purely transfer-based engines) are complex and extensive. They require deep linguistic knowledge to create transfer rules, constant manual crafting/maintenance and, very often, a large number of rules to cover all (and mainly) the contrasting features of languages (Arnold 1994, 192-193). Semantic selection does not generally pose major problems for limited domain applications (or sublanguages, Arnold 1994, 159), but when using RBMT for general-purpose texts, specific lexical items should be added to the dictionary and the lexical constraints should be updated to hinder incorrect semantic selection. Unless the system is properly maintained and updated, and the acquisition of linguistic specific knowledge is semi- automatized, the process will most probably turn out to be a very time-consuming one.

On the positive side, a RBMT usually delivers a syntactically correct output with only very few grammatical mistakes (as long as the linguistic analysis and the representation succeed) (Arnold 1994, 193; Thurmair 2004, 6).

(19)

In spite of its flaws, in the last few years most work in MT has been done following a rule-based approach and studies have shown that the aspects mentioned above are not so serious as to effectively undermine the possibility of building robust and large scale rule-based systems (Arnold 1994, 193)⁴.

The predominantly rule-based framework was broken by the emergence of new methods and strategies that took advantage of the increasing availability of large amounts of texts and machine computing power (Arnold 1994, 198; Hutchins 1994a).

The new proposed approaches rely on stored data or knowledge, in other words, on the examination of stored, real and example texts, thus allowing a MT system to empirically derive from those sources the linguistic knowledge required. This can be done using diverse pattern matching techniques and with a varying degree of linguistic analysis. Recently, MT research mostly falls into the so-called knowledge-based systems: statistical MT and example-based MT.

Example-based MT (EBMT) belongs to the tradition of analogical models arising in the mid-1980s and was first mentioned in literature by Nagao in 1984 (Nagao 1984). At its simplest, EBMT engines translate input phrases by matching them against stored data and making a nontrivial use of a large library of examples at runtime (Dekai Wu 2006, 4; Cavalli-Sforza, Brown et al 2004, 3). The essential ‘training’ data for the EBMT engine is a sentence-aligned or parallel corpus, which preprocesses to build an index and retrieve candidate translations, which are generally scored and posted in a chart. EBMT typically includes two optimization mechanisms (Cavalli-Sforza, Brown et al 2004, 3): tokens analysis and use of tagged-entries. In this sense, EBMT resembles Translation Memories tasks.

Similarly, statistical MT (SMT) systems rely on stored data, but analyze information by resorting to probabilistic and statistical techniques when matching input phrases against stored data. These inherently logical MT models assign probabilities to the retrieved candidate translations by making use of mathematical models of varying degrees of sophistication. As statistical systems are particularly important in the framework of this thesis, a detailed analysis of this MT architecture is offered in the next section.

4 Well-known examples of transfer RBMT are Eurotra, Japan and Susy, or DLT and Translator. For a detailed description of these systems, please refer to (Nirenburg, Somers et al 2003, 22).

(20)

Finally, hybrid systems encompass every effort for combining the advantages of the two approaches mentioned above. Although the idea originates in the complementary strengths and weaknesses that statistical and rule-based MT paradigms have shown in the course of time, there are countless possible MT combinations. For more information and examples regarding hybrid systems, please refer to recent articles on this respect (Wolf, Bernandi, Federmann et al 2011 or Xu, Uszkoreit et al 2011).

The figures below (2, 3 and 4) illustrate quite well the myriad of possible MT systems. In Dekai’s view (2005, 5-7), “[…] any MT model [sits] at some point within a three-dimensional space defined by axes corresponding to the degree of statistical, example-based, and compositional techniques employed”.

The example-based x-axis represents the degree to which abstraction is performed during testing, as opposed to during training. Models vary along the spectrum from schema-based models (abstraction of the training set), to example-based models (memorization of the training set). As regards the compositional y-axis, it represents the degree to which rules are compositional, as opposed to lexical. The spectrum ranges from flat lexical models, to fully recursive compositional ones. Finally, the statistical z-axis points out the degree to which models make use of statistics.

Models vary from purely logical models, to models that make increasing use of statistics and statistical inference.

Figure 2. MT model diversity. Note how statistical systems follow a midway course between example-based systems and lexical and compositional engines. (Dekai Wu 2005)

(21)

In this chapter we have explored machine translation history and learned which are the most common MT approaches and systems being used and studied in the MT field. We have seen that the combination possibilities are manifold and that the systems might get more complex with time. Among this diversity, we have seen that SMT stands out as a particularly interesting alternative. In the next chapter we will dig more into this alternative and explain how it works and why these systems are in the spotlight of academic MT research.

Figure 4. MT model diversity. SMT systems gather steam. Different combinations and possibilities of statistical and example-based engines are explored during the 90s. (Dekai Wu 2005) Figure 3. MT model diversity. The model becomes more complex here. Different

transfer systems where developed in the late 70s. (Deaki Wu 2005)

(22)

3. Statistical MT. The Moses engine.

As we have seen in the previous chapter, research on statistical MT reached its peak around the 80s, when the IBM research division published a landmark article entitled: “A statistical approach to language translation” (Brown, Cocke et al 1988).

Since then, a whole myriad of engines were built, each of them with varying degrees of sophistication and improvement algorithms.

In this chapter we will explain what is Moses, the specific SMT engine used in the test performed for the purpose of this thesis, and describe a standard SMT engine.

We will explain how do these systems work, how they translate unseen sentences and which are their most prominent features.

3.1. The Moses engine

By 2003, most work in the SMT field was carried out mainly for proprietary and in-house research systems (IBM investigations is a good example of this) (Koehn, Hoang et al 2007, 177). To overcome this lack of openness, which had created great barriers for researchers, a group of scholars at the University of Edinburgh, led by Philipp Koehn, developed an open-source, complete and freely available online SMT toolkit called Moses. It was meant to help the research community to proceed with investigations and contribute to the progress of the field. See http://www.statmt.org/moses/.

Moses provides the user (and especially the academic community) not only with a complete out-of-the-box translation system, but also with a whole set of features and modules needed to use, develop and investigate on SMT issues. It consists of all the components needed to pre-process data, train the language and the translation models, and in the beginning contained tools for tuning these models using minimum error rate training and evaluating the resulting translations using the BLEU score (Koehn, Hoang et al 2007, 178).

Moses incorporates other standard external tools, such as GIZA++ for word alignments and SRILM for language modeling, but its core component is the decoder, which was developed on the basis of the Pharaoh phrase-based decoder, a creation of Philipp Koehn as well. A decoder is the core element, the engine of a machine translation; it uses its knowledge and algorithmic functions to find the best sequence of

(23)

transformations that can be applied to an initial target sentence to translate a given source sentence. Despite the complex architecture of this statistical engine (which certainly requires much more of an informatics profile than a linguistic or translation background), the toolkit was developed with a very wide community of users in mind.

To make it easy for people of different backgrounds to use and test Moses and to contribute thereby to the project, the developers kept to the following principles when developing the decoder (Koehn, Hoag et al 2007, 178):

- accessibility - easy to maintain - flexibility

- easy for distributed team development - portability

In the next sections we will only explain some of the features available in the Moses engine, which are useful to understand the performance of the engine during the tests carried out for the purposes of this thesis. For further information about specialized aspects of SMT systems, please refer to the latest book written by Philipp Koehn (2010): Statistical Machine Translation, which elaborates in detail on all those aspects that we will not deal with in the framework of this work; with regards specifically to Moses, the reader can visit the website cited above.

3.2. Description of a standard SMT engine

Statistical systems rely on probabilistic and statistical models, and thus need to be trained on large amounts of bilingual corpora. These models rarely include exhaustive linguistic information, but rely instead on the distributional properties of words and phrases to establish the most likely translation of the input text. It is essentially a system based on predictions that looks forward to relying on little, if any, linguistic knowledge. Features that can be easily measured by a statistical engine are, among others, co-occurrence of two or more words in source and target texts, relative position of words within sentences, and length of sentences (Trujillo 1999, 210).

One of the most common approaches to statistical MT combines monolingual and bilingual knowledge, which involves building two separate modules on which the

(24)

engine will rely on to perform the translation task. Based on this knowledge information, the engine creates two basic models⁵ (see Figure 5 below):

1- A statistical language model that contains monolingual information.

2- A statistical translation model that contains bilingual information.

The language model is the set of rules employed by SMT engines to measure how likely it is that a sequence of target words would be uttered in the target language.

It can be based on unigrams, bigrams or trigrams (see section 3.2.2.), that is, it performs the probability calculation by decoding the texts and grouping words in ones, twos or threes, respectively. To do this, it relies uniquely on the input information of monolingual texts it has been fed with, the so-called “training data”. The aim of this module is to analyse the linguistic correctness of the output translation text, choose the best possible linguistic combination and optimize fluency in the target.

The translation model is the core basic model, the set of rules employed to perform the calculations needed to link and evaluate source to target. Unlike the language model, it performs much more sophisticated calculations by carrying out as well an n-gram analysis and different optimization processes. It relies on a “parallel corpus”⁶ (also considered “training data”) it was fed with and computes, among others, (1) the frequency of co-occurrence of source and target words (probability of a given word being translated by another), (2) the length of the sentences in which they appear, (3) the position of words within their respective sentences, and (4) the fertility of the target words (Trujillo 1999, 211), that is, the number of target words generated by a given source word(Brown, Della Pietra et al 1993). This is only a simple example of the calculations a translation model might perform; models are getting more and more complex and might include more features than those mentioned above. Overall, the translation model serves at comparing source and target texts to eventually find the most adequate translation. To this end, there are several ways of training and decoding a corpus (ex. phrase-based and tree-based models) and different algorithms and factors for optimizing and customizing the translation model (ex. factored translation model).

5 A model is simply the set of rules employed by a MT system to transform the source sentence into a target sentence.

6 A parallel corpus is “a collection of texts, paired with translations into another language” (Koehn 53).

Examples of parallel corpus are: the collection of proceedings of the Eruopean Parliament, the Europarl corpus; the OPUS corpus; the LDC corpus; the Acquis Communautaire corpus of legal documents from the European Union, among others.

(25)

Some of these will be described in the next sections. See section 3.2.1. for more information regarding the translation model.

As we will see later, the fundamental challenge of both modules is handling scarce or limited data, which explains the need for growing amounts of data (corpora) to improve output. As Koehn explains, just because something has not been seen in a training text does not mean that it is incorrect or impossible. This is probably the main reason that conditions and hinders better output in any SMT system. This is why we generally say that the larger and the more specialized a corpus is, the better.

Figure 5. Components of a standard SMT engine and typical translation flow for this kind of systems. Multiple processes occur at the same time and the system is fed simultaneously on

two fronts. Note the training data on the right and the input texts to be translated above.

3.2.1. The translation model

In this section we will first describe more in detail how does the translation model perform its functions. The language model is described further below.

After pre-processing the input data and training the corpus adequately⁷, the two modules mentioned in the previous section work together to enable the statistical engine

7 Please note that the pre- and post-processing steps of the translation process pipeline will not be explained in this thesis, as we want to focus on the translation process itself. As an example, the pre-

(26)

to perform the following three main steps (Trujillo 1999, 211; Brown, Della Pietra et al 1993, 256):

a- compute the probability of a string being the translation of a source string (translation model);

b- compute the probability of a target string being a valid target sentence (language model);

c- build various algorithms to search for the target string which maximizes these probabilities (language + translation model).

In basic mathematical terms, an SMT engine performs the following calculation: p(t | s), that is, it calculates the probability of finding a string “t” in the target language given the string “s” in the source language. One way of applying this basic distributional probability to SMT purposes and to combine the language model with the translation model is to use the Noisy-Channel Model (very much used in optical characters and speech recognition) and to apply the Bayes Rule or Theorem, a probability theory which links a conditional probability to its inverse (Koehn 2010, 69, 95; Knight 1999). As a result, most SMT algorithms are prepared to search the most likely translation among the number of available choices (argmaxt) through this basic formula:

t

[best translation or

“argmaxt p(t | s)”]

= argmaxt p(s | t) p(t)

KEY

t = target language s = source language p = probability

p(t) = language model – target p(s) = language model – source

p(t | s) = translation model - probability of (t) given (s)

p(s | t) = translation model - probability of (s) given (t) = “noisy-channel model”

To illustrate the typical steps followed by a SMT engine, the calculations performed along the process as well as the notion of “correspondence” of words, we will have a look at the Example 1 below⁸. The aim is to determine the probability of the French target phrase “Le chien est battu par Jean” being a translation of the original

processing steps in a SMT might involve segmenting, tokenizing and truecasing or lowercasing the input text and/or the training data.

8 The example was taken from Brown, Cocke, Della Pietra et al 1990, 80-81.

(27)

English “John does beat the dog”. As we mentioned before, various calculations are performed to build up the translation model (Nielsen 2009), among them:

• The translation probability, [p(f|e)], one for each French word (f) and English word (e). This is, for instance, the probability of “dog” being a translation of “chien” (see Example 1 below). It should not be confused with the case when (f) and (e) are sentences!

• The fertility probability, [p(n|e)], the probability that the English source word (e) has fertility (n). For instance, the fertility value of “beat” is 2 as in French we obtain “est battu”) (see Table 1). These are just one possible assignment for this pair of sentences. Many more are tried in the process to maximize the equation.

• The distortion probability, [p(pf|pe,l)], which is the probability that an English word at position /pe/ corresponds to a French word at position /pf/ in a French sentence of length /l/. This is calculated at the end (see Example 1 and Table 2 below).

Example 1. EN>FR alignment. The numbers correspond to the place of the French words in the English sentence and vice-versa.

p(fertility = 1 | John) > p(Jean | John) p(fertility = 0 | does) > ---

p(fertility = 1 | dog) > p(chien | dog)

Table 1. Fertility values [p(n|e)] for the words in example 1. We call fertility to the number of target words (French) that a source word (English) produces in a given alignment.

Le (4) chien (5) est (3) battu (3) par (0) Jean (1)

John (6) does (0) beat (3,4) the (1) dog (2)

(28)

p(fertility = 1 | John) x p(Jean | John) x p(fertility = 0 | does) x p(<null>) x

p(fertility = 1 | dog) x p(chien | dog) p(fertility = 1 | <null>) x p(par | <null>)

>> p(6|1,6)

>> xxx

>> p(3|3,6) + p(4|3,6)

>> p(1|5,6)

>> p(2|6,6)

>> xxx

Table 2. Translation model calculation (on the left) for example 1. To compute the probability of the alignment EN-FR the fertility values are multiplied by the translation values (on the left). At the end, the distortion probability comes in. The distortion values [p(pf|pe,l)] are shown on the right.

The analysis consists in calculating the probability of, for example, the word “le” being a translation of “the”, which passes from a position 5 in the source sentence to a position 1 in the target phrase, which has a total of 6 words.

Several other algorithms are used simultaneously to optimize search and analysis, but will not be covered in the framework of this thesis due to the complex analysis they would require. Among the latest techniques and features to enhance the SMT modules and to improve results, the reader might find:

- backward n-grams (see Xiong, Zhang et al 2011)

- mutual information triggers (see Xiong, Zhang et al 2011)

- lexical weighting reordering (see Costa-Jussà, Crego et al 2007; Koehn 2010, 139)

- multiple-stream language models (see Levenberg, Osborne et al 2011) - paraphrases (Callison-Burch, Koehn et al 2006)

3.2.1.1. Word-based and phrase-based systems

Translation models used in SMT are generally either word-based or phrase- based. In broad terms, while the first ones are based on words as atomic units, the latter take phrases as the main unit of analysis. The original work on SMT was almost entirely done on word-based systems. Nevertheless, researchers soon realised that to avoid models of an increasing computational complexity needed to overcome the problem of learning from incomplete data and obtain a fluent output (Koehn 2010, 6-7 and 88), it would be better to work at the phrase level (or complement word-based analysis with phrase-based alignments) and profit from the fact that parallel corpora

(29)

provide the user with alignments of original sentences and their translations (not only with word alignments). Echoing the trends that predominated in the Translation Studies area, single words were not considered the best candidates to be seen as the smallest unit of translation (Koehn 2010, 127), especially due to the semantic and linguistic phenomena that characterises the translation process, such as polysemy, amplification or economy (Molina and Hurtado Albir 2002, 500-501). As a result of this new approach, phrase-based became the most successful approach. Word-based models were primarily left aside for alignment and data mark-up tasks, but are still very much in use all alone for other multilingual applications (especially in terminology tools) (Koehn 2010, 6-7).

Because at Autodesk the Moses decoder was built on a phrase-based model, we will only explain the basic principles of this second model⁹. In this context we understand by phrase “any multiword unit” (Koehn 2010, 128). Note, nevertheless, that phrase-based models are not rooted in any deep linguistic notion of the concept

“phrase” (Koehn 2010, 128). The engine does not group words or fragment sentences as a linguist would expect. It may compute the probability of “phrases” such as “fun with the” while a linguist would probably group units following a syntagmatic analysis (noun phrase, prepositional phrases, verbal phrases). These unusual groupings have an important impact on the result: they provide more context and more clues to help the system choose the correct translation for a SL specific phrase. Take for instance “play guitar”. If the engine takes separately “play” and “guitar” and assigns the highest probable translation for each word, it will probably throw “jouer” and “guitare” as a translation (which gives “play guitar”) and miss the right French preposition: “de la”.

On the contrary, if it calculates probabilities by taking the phrase “play guitar” and this phrase exist in the phrase table, it will probably come up with the correct translation phrase: “jouer de la guitare”¹⁰.

This kind of model first fragments the input sentences into the so-called phrases, afterwards translates them into target language (TL) equivalent phrases and, finally, reorders the TL phrases. To do this, the system uses tables made up of phrases, not

9 For more information on word-based systems, please refer to chapter 4 in (Koehn 2010). In this book, the reader will find a whole chapter devoted to phrase-based models (Chapter 5).

10 This example is mine, but it is based on the example given in (Koehn 2010, 128).

(30)

words. A phrase translation table of English (for the French “bien sûr”) may look as follows¹¹ (note that it considers punctuation as well):

Translation Probability p(e|f)

of course 0.5

naturally 0.3

of course, 0.15

, of course, 0.05

All this said, besides being conceptually simpler (Koehn 2010, 128), the major advantages of a phrase-based system are the following:

1) polysemy issues (“one-to-many mappings”) may be overcome more easily;

2) translating word groups may help resolve ambiguities;

3) if the system is fed with a large corpora, it can learn longer sentences and even memorize entire sentences that get repeated.

Besides being phrase-based or word-based, SMT engines – and specifically the Moses engine - can apply factored or non-factored models to perform the translation task. Non-factored models work on a surface level, with the form of words, and use only one phrase table (a kind of dictionary which contains only one possible translation for a given word). The engine analyzes a given phrase and its possible translation by using a phrase dictionary, which searches for matches between source and target words (taken from Koehn, Hoag et al 2007, 178) (see Example 2 below).

I am buying you a green cat

Je vous achète un chat vert

Example 2. Non-factored representation of target and source words for a given pair of sentences.

11 This example is mine, but it is based on the example given in (Koehn 2010, 128). Please note that the probability is not precise or real. It corresponds to the probability calculated for the combination “German (natuerlich)>English”, but it was reproduced here for the combination “French (bien sûr)>English” given the fact that the translation frequency of this word is similar for these languages.