TM systems functionalities - Translation memory systems

Translation automation and TM systems

4.4 Translation memory systems

4.4.5 TM systems functionalities

Before discussing different approaches to TM processes, it is important to briefly describe the internal workings of a basic TM system. TM systems are produced by different providers of language technology. Although there are differences between the systems, they basically work on the same principles, i.e. to store translation segments in a database and enable automatic retrieval from the database of segments matching a given input segment. Also, they include more or less the same additional facilities, e.g. fuzzy matching and integration with a terminology management system and a MT engine.

In order to understand how TM systems actually operate, it is necessary to be familiar with the way these tools organize and store information. This in turn requires an understanding of following basic concepts: translation unit, segmentation, math search, retrieval and assembly, and alignment.

4.4.5.1 Translation units

In a TM system context, a translation unit is a source and target language sentence pair stored in the TM database. Translation units (TUs) processed by TM systems are often referred to as ‘segments’ or ‘fragments’ of text in the source and target language. Common for all TM systems available on the market today is that the translation units they work with are in effect sentences from one full stop to the next in source and target language, or alternatively a headline or a bullet point text.¹⁶

Automatic segmentation below sentence level, for instance at word or phrase level, is not possible in present systems, because they work on the basis of orthographic principles. This means that the system needs a specific character in the text on which the division into segments can be based, viz.

a full stop. The fact that TM systems are based on segmentation into full sentences is of major importance in relation to this study, and has given rise to some of the hypotheses set up in section ?? below.

4.4.5.2 Segmentation

TM tools process texts in small chunks known as segments. Once the conver-sion of file formats is complete, the system starts the segmentation of the ST.

In essence, it splits the ST into units (‘segments’), which can be of any length (e.g. words, phrases, sentences, or paragraphs) depending on the

segmenta-tion strategy of each system. Usually, the punctuasegmenta-tion marks (e.g. periods,

16The user may choose to apply other segmentation rules and divide segments for instance at each colon, semi colon or tabulator.

exclamation marks and question marks), hard returns and text formatting indicators (e.g. paragraph marks, list tags) determine the boundaries of each segment, but the user is also allowed to customise the segmentation rules, for example, by adding stop lists (i.e. by specifying words or abbreviations, such as ‘Prof.’, which should not be considered as segment separators). However, accurate segmentation, or as often called ‘sentence boundary disambiguation’

(Mikheev, 2003: 212), is a challenging task, because each language has its own script, so that the principles that drive how a sentence is constructed and delimited differ form on language to another (Savourel 2001: 209).

Segmentation is the first key TM process, as it impacts directly on the quantity and quality of matches retrieved by the system in the subsequent process of matching. For instance, if the segmentation is done at paragraph level, the system has little chance of finding a match for the source paragraph in the repository (a phenomenon calledsilence by NLP researchers). However, if it does find a paragraph match, then this typically does not require any editing and can be incorporated in the translation as it is (Esselink 2000: 363).

In the next case, where the system performs sentence-based segmentation (full sentences or clauses), there are more chances of it finding a match from a repository that contains sentences. Whether those matches are correct or not, it depends on the match retrieval mechanism of each system. Finally, in the case where a system segments the text in smaller units (e.g. phrases, words), there are even more chances of it finding matches —as a matter of fact, so many that the system produces the so-called phenomenon of noise— but naturally most of them are useless in the context of the source text (Simard 2003).

For the last ten years, many methods have been proposed for the seg-mentation of texts by researchers in both TM and MT. In recognition of the fact that whole sentences are rarely repeated in a document to be translated (unless it is a manual), and based on the premise that long segments (e.g. full sentences) reduce the chances of the TM system finding matches (Gervais 2002;

Hunt 2003), research has been geared in favour of sub-sentence segmentation (see Garc´ıa-Varea et al., 2005; Gotti et al., 2005; Simard 2003; Simard and Langlais, 2000; Macklovitch and Russel, 2000). A major distinction between the proposed methods involves the contrast of approaches based exclusively on the information contained in the text to be segmented (e.g. punctuation marks, spaces, formatting indicators) and those approaches that depend on complementary linguistic knowledge.

4.4.5.3 Data storage

Most TM systems segment the ST before storing it in the database. Each segment is accompanied by contextual information, such as the name of the client, the domain, the name of translator and date. However, some systems (such as MultiTrans and LogiTrans) follow a different approach, called the

“full-text approach”. Instead of segmenting the texts at the beginning of translation, they store both ST and its translation as full bitexts using the character-string-in-bitext (CSB) technique (Gow 2004). Once the bitexts are in the database, they are aligned at the paragraph level. This approach has two main advantages compared to the conventional method: a) the faster creation of a large TM database containing previously translated material and b) the retention of the co-text for any match found and suggested to the user (Gervais 2002).

There are also systems that do not require the use of a database for the storage of the previously translated material. STAR Transit, for example, instead of creating a physical TM database, builds and makes use of a ‘virtual’

translation memory, which is essentially and index of selected reference material residing in any directory on one’s computer. The user chooses the reference material she wants to use (i.e. the translated documents that are relevant to her project), and the system, after extracting the text into XML files, creates the index with the associated files. This idea has many advantages over the classic database. Firstly, because the index is resident in RAM, the searches for matches are relatively quicker. Secondly, the user has greater freedom in tailoring this repository according to the needs of each project. In addition, this model of a TM repository works especially well for projects with many updates containing a lot of small changes (Zerfass 2002).

However, the greatest advantage of this idea is perhaps the absence of issues related to database maintenance and to the security of data residing in a database (databases are, by nature, prone to crashing or getting corrupted).

4.4.5.4 Match search and retrieval

Returning to the typical TM workflow, after the system has segmented the ST, the translator chooses the segment he wants to start translating and the system automatically looks up this segment in the TM repository. When a match is found, the system retrieves it and suggests it to the translator as a possible translation.

Matching is thus the second key TM process, as it determines the ability of the system to leverage phrases and terms from already existing translations in an effective and efficient way. Overall, however, the retrieval mechanisms in

TM systems build on the basic principles of information retrieval, i.e. the aim is to achieve as high a recall and as high a precision value as possible. TM systems available today seek to achieve this aim by calculating the formal, or more specifically, the orthographic similarity between a given input sentence and sentences stored in the database. Hence the match value is calculated on the basis of the number of identical characters in a specific order in input sentence vs. stored sentences (Reinke 1999: 105).

For a sentence to be retrieved from the TM system, it has to have a similarity of 30-100% compared with the input sentence. It is up to the translator to decide how similar she want a sentence to be in order for the TM system to retrieve it. Generally, translators set the match value to 70%, since lower match values are usually irrelevant.

There are several kinds of matches: exact matches, fuzzy matches, full matches, sub-segment matches and term matches (Bowker, 2002a: 95-105);

yet, recently, most of the TM systems are also able to produce an additional kind of match: an MT match (O’Brien 2006: 187-8). A brief description of each match type is provided in Table 4.1.

Type of match Description

exact (or perfect) match The segment in the TM is 100% identical to the source segment in terms of morphology, syntax and sometimes formatting (in other words, it is the same in spelling, punctuation, inflection and numbers). The process followed by the TM system to recognise exact matches is simple string pattern recognition.

full match The segment in the TM differs from the source segment only in terms of variable elements, such as numbers, dates, times, currencies, measurements, and sometimes proper names.

fuzzy match The segment in the TM resembles approxim-ately or partially the source segment. The differences between the two segments are usu-ally highlighted by the system. Fuzzy matches are presented according to a similarity degree given by the system. The user usually spe-cifies a threshold for the acceptable similarity degrees.

sub-segment match The source segment is identical with part of a segment in the TM.

term match The source segment is identical with an entry in the term base (or the lexicon database).

MT match When the source segment does not match with any segment in the TM, the system constructs a segment usually by combining sub-segment matches or by generating a new match by means of Machine Translation techniques.

Table 4.1: Types of TM matches

Issues in match search and retrieval (due, for instance, to the inflectional and derivational morphology of words) have been widely discussed (see Bowker 2002a: 72-4; Somers 2003: 37-41) and several matching techniques have been proposed during the past few years (see Hod´asz and Pohl, 2005; Simard 2003).

All matching techniques aim at rendering the TM system efficient not only in retrievingall available exact or fuzzy matches for a source segment (match recall), but also in retrieving the correct exact or fuzzy matches (match precision).

The two match retrieval techniques employed by current commercial TM systems are: a) segment-based matching and b) character-string based (CSB) matching (Gow 2003: 22-44). Each of the two techniques is directly related to the way in which the text is stored in the database. More specifically, a system that segments the text in translation units before storing it in the

database is normally using segment-based matching, whereas a system that does not segment the text, but stores it as full text instead, performs a CSB matching.

Most TM systems (e.g. SDL Trados, Wordfast) implement the first tech-nique, which means that they look for matches in the sequence of character strings of each segment in order to identify potential translation equivalents.

The similarity between segments is operationalised via the distributed sim-ilarity of each word in the segment (Hunt 2003; Simard 2003). Some TM systems are also able to recognise matches not only at segment level but also in sub-parts of the segments that exist in the TM database (e.g. D´ej`a Vu X3). Alternatively, systems such as MultiTrans and Lingotek use a matching algorithm that calculates the similarity of equivalent continuous character strings in a text, rather than relying on segments and fuzzy matching (Garcia and Stevenson 2006: 24). Both these statistical techniques are based on character n-gramming and are language-independent.

However, the ’second generation’ of TM systems have recently opted to use matching techniques that are linguistically enhanced. These can be lexeme-based or grammatical pattern-based techniques (Planas 2005, Gotti et al., 2005). Based on the premise that superficial matching techniques alone may result in a potentially useful match being overlooked, these systems make use of morphological and grammatical information to improve the match coverage.

According to experiments conducted by Carl and Hansen (1999), when comparing matching results form a string-based TM, a lexeme-based TM and an EBMT system, the least generalising system (the string-based TM) achieved higher translation precision when near matches existed in the data-base. However, when the database did not contain any similar translation examples, the lexeme-based TM performed better in terms of translation converge but it lost in precision (the EBMT system outperformed both in match recall but performed poorly in match precision). Nevertheless, these results were based on a fully automatic evaluation method; hence, they could not establish the superiority of a particular matching technique in real work conditions. Such an evaluation that includes humans is very important as translators are able to assess not only if matches are indeed thecorrect trans-lations for a source segment, but also if they are suitable for the context of the ST.

From a general perspective, the main disadvantage of TM systems that use linguistically enhanced matching techniques is that they are language dependent (since they rely on the built-in language resources). Consequently, such systems can work only for a small number of language combinations for which adequate language resources have been developed and incorporated

into the system.

Recent developments in TM match enhancement point to some trends that already play an important role in the localisation industry, i.e. MT matches. This type of match is fully generated by means of machine translation techniques. MT matches are not new, but it is only recently that they are being seriously integrated into the localisation workflow due to the pressing need to lower costs in the localization cycle when internationalizing new products (Guerberof 2009, 2012).

As pointed by Garc´ıa (2009: 208), MT matches were first considered by freelance translators more distracting than helpful. To the point that, some years ago some TM tools started to offer MT plug-ins (Wordfast version 3, SDLX version 4, Trados version 5), but the concept did not find favour and was consequently neglected in subsequent versions. This scenario has changed a lot since then. SDL Trados 2007 first offered access to its in-house SDL Automated Translation feature in late 2008 (SDL 2008), and the feature remains in the latest SDL Trados Studio versions with easier access. WordFast or MemoQ (in all their present versions) also offer the possibility to work with MT matches coming from different MT engines.

Returning to the typical TM workflow, after the system has been able to offer a match, the translator has then to decide whether to reuse the translation example, to adapt it or to ignore it and insert his own translation.

Once the segment is translated, the system automatically saves the pair of source and target segments in the repository. As the translator works his way though the source text, each successive segment is looked up automatically in the TM repository, and a translation is proposed (if one is available). By the end of the translation of the source text, each successive segment is looked up automatically in the TM repository, and a translation is proposed (if one is available). By the end of the translation of the source text, the system has stored all pairs of source and target segments in the TM database for future reuse. The following illustration (Figure 3) shows how all the basic TM processes are linked together in a typical TM workflow.

4.4.5.5 Alignment

TM systems usually come with no, or virtually empty, repositories. According to the ICL TM Survey 2006 (Lagoudaki 2006), the most frequent way followed by translators to build a repository is to create a database and have it filled as they translate (interactive method). The next most frequent way is to build up a repository with aligned original texts and their translations (retrospective method).

Alignment is a feature offered by most TM systems and it constitutes an

Figure 4.3: Basic processes in a simplified TM workflow (Quah 2006:

104)

additional key process within a TM system. In fact, it impacts directly on the effectiveness of the matching, because if a source segment is not accurately aligned with its translation equivalent, a wrong match is likely to be retrieved.

Before the alignment process, the TM system normally segments both source and target texts and then tries to establish correspondences between their segments.¹⁷

Aligning texts means matching up ST and TT segments and linking them together to form TUs. The task involves certain challenges that inevitably compromise the accuracy of alignment. In addition to the issues of segmenta-tion (mensegmenta-tioned in secsegmenta-tion 4.4.5.2) usually deriving from non-corresponding punctuation systems and a different notion of a ‘sentence’ (Somers 2003: 35), a single ST sentence can often be translated by multiple TL sentences or vice versa, or information can be omitted from or added to the TT (e.g. expli-citating). Moreover, sometimes ST and TT do not have the same structure, because sentences or whole paragraphs may be in a different order. As a consequence, it is not always easy to establish one-to-one correspondences between sentences. Thankfully, TM systems are interactive tools by nature, which means that automatic alignment can be corrected any time by the user; either immediately after the automatic alignment process by joining or splitting misaligned segments, or during the actual translation by accessing the TM database and by editing the appropriate pair of segments.

Alignment can be performed on various levels, depending on the segment-ation output. In other words, if a system segments the ST at a sentence-level only, it will then try to align sentences; if it segments the text in chunks

17TM systems that follow the ‘full-text’ approach do no segment the text prior to alignment and storage. Instead, they store the texts in full and they align them at paragraph level.

(i.e. smaller units, like phrases), it will need to align the chunks. In the latter case, the task of alignment becomes even more challenging because the system needs to identify the similarity between the two chunks, and a mere string-length comparison cannot be much of a help. To this end, an appropriate morpho-syntactic parsing schema and/or detailed tagging is used to identify any similarities at grammatical and syntactic levels. Examples of TM systems that perform alignment at a sub-sentence level are SIMILIS (Planas 2005) and MetaMorpho (Hod´asz and Pohl 2005).

4.4.5.6 Working with a TM system

Since the TM systems provided by software developers are initially empty shells, the user of a TM system must store TUs in the empty database before any output can be retrieved. Prior to storing sentences in the database, the translator must define certain TM properties: source and target languages, name and description of the TM, etc. Once the framework for a new TM has been set up, the translator can begin storing TUs in the database.

The translator can fill TUs into the TM basically in two different ways.

Firstly, when the translator works interactively with the system, sentences are automatically stored one by one as they are translated. Once a segment is stored, it immediately becomes available for retrieval if it matches a subsequent input sentence. Secondly, the translator may fill existing translation material into the TM. Before existing resources can be reused in a TM system, each sentence in the SL must be aligned with the corresponding sentence in the

Dans le document An empirical study on the impact of (Page 112-123)