The collection procedure - Corpus preparation and setup

Data collection, preparation, coding and analysis

6.3 Corpus preparation and setup

6.3.1 The collection procedure

All the data in the TRACE corpus come from the TMs that the translators fed while translating the three STs under experimental conditions. Translations done under E2 (MS Word + SDL Trados 2007) and E3 (TagEditor + SDL Trados 2007) didn’t need extra editing before adding metadata to the TUs. However, translations done under E1 (MS Word with no translation memory) had to be aligned against their STs in order to build a TM from which to proceed with further processing. The tool used for aligment was SDL Trados WinAlign, a tool available in the SDL Trados suite to create TMs recycling previously translated texts. After collecting all the TMs resulting form the experiment (*.tmw files) , all the TMs had to be exported to text format (*.txt).

Once in plain text format, the corpus files were first edited at the file name level prior to adding metadata to each TU in the TMs. Both the file names and the metada information is used for indexing the files in the corpus query interface.

6.3.2 Conventions

During the basic editing phase the following conventions were adopted:

TUs in the TMs were left unaltered (i.e. no spelling mistakes, fomating, etc., were corrected).

The TMs resulting from subsequent alignment for translations done under E1 kept the same automatic text segmentation as translations directly done under E2 and E3.

Encoding: texts were saved as plain text (UTF-8) as to facilitate corpus annotation.

For classifing each of the 90 translators in the study in regard to their productivity profile, the total amount of time spent in the task by the translators was divided by three and then grups of 30 translators were made for each of the three time slots.

File-naming:

– Each TMs exported in txt fomat bear reference to the following parameters in the experiment from which the TM resulted from. Table X.X. shows the parameters that all the files building up the corpus encode in the file name:

Attributes coded in the file name Admitted values Description

<translator id> id F id F=freelance

id N id N=novice

id I id I=in-house

The first ID stands for the tool/text combination used (a total of 18), while the second ID stands for the number of the trans-lator perfoming the task for that combination.

<environment during translation> E1 Environment 1

E2 Environment 2

E3 Environment 3

<source text> T1 Source text 1

T2 Source text 2

T3 Source text 3

<task position> P1 First position

P2 Second position

P3 Third position

<text position> X1 First position

X2 Second position

X3 Third position

<previous environment used> C0 no previous

C1 environment E1

Table 6.2: Corpus file annotation scheme

The following file name 06F2 E2 T3 P1 X2 C0 W1.txt would contain all the TUs resulting from the second freelance translator translating the tool/text order combination number 6. The text translated would be T3 under the second environment (E2) in first position in the experimental tasks and the ST would be in second position in the text order combination. No previous environment would have been used prior to this task and the translator would have been classified as fast in relation to the total amount of time spent by the sample of translators.

6.3.3 Metada

A second level of annotation performed on the corpus files consisted in adding basic metadata. The level of annotation performed on the corpus files consisted in adding basic metadata about each translator and the translation tasks they performed. These details were manually encoded in the form of attributes of a<text>value tag, inserted in the heading of each translation unit<TU>in the translation memories that every translator in the study fed while translating under experimental conditions. Following the specification for SDL TRADOS 2007 exported translation memory files, the tag<Att L=XXX>value was used to add metadata attributes to the files of our corpus.

A list of attributes and admitted values is provided in Table 6.4 below:

Attribute Admitted values Description

<CrU> open value (Creation User) ID name for the

trans-lator using that translation memory

<ChU> TRACE (HUM2006-04349/FIL) (Change User) Name of the funded

re-search project behind this corpus devel-opment innitiative.

<Att L=Profile> InHouse/Freelance/Novice (Attribute) Participant’s profile:

In-house translator Freelance translator Novice translator

<Att L=Enviroment> E1/E2/E3 (Attribute) Translation editing

environ-ment used to translate.

E1: Microsoft Word 2003 text processor E2: MS Word 2003 text processor + SDL Trados 2007 Professional Trans-lator’s Workbench

E3: TagEditor text processor + SDL Trados 2007 Professional Translator’s Workbench

<Att L=Source> T1/T2/T3 (Attribute) Source text translated.

T1: text 1 T2: text 2 T3: text 3

<Att L=Target> open value (Attribute) Name of the file for the

translated text. The corpus contains PDF and RTF files for every transla-tion.

<Att L=EPosition> P1/P2/P3 (Attribute) Position for the environment

in the translation task performed by the participant.

P1: first position P2: second position P3: third position

<Att L=TPosition> X1/X2/X3 (Attribute) Position for the text in the

translation task performed by the parti-cipant.

X1: first position X2: second position X3: third position

Table 6.3: Corpus content annotation scheme (part 1)

Attribute Admitted values Description

<Att L=Context> C0/C1/C2/C3 (Attribute) Number of the translation

editing environment used previously to the one used to translate the one being marked. C0 stands for no previous trans-lation editing environment used (first translation task).

X1: first position X2: second position X3: third position

<Att L=Time> W1/W2/W3 (Attribute) Time group assigned to

every translation task depending on how fast/slow every translator performed their tasks.

W1: Fast translators W2: Medium translators W3: Slow translators

<Att L=Explicitation> K1/K2/K3/K4 (Attribute) Any of the four categories

of explicitation phenomena described in

(a) or “far” (c) interference options:

I1: Typography and spelling

Table 6.4: Corpus content annotation scheme (part 2)

Metadata can be used during the analysis in order to restrict searches to tailor-made corpora consisting of TUs which match a set of predefined attributes. The atributte for

explicitation phenomena in the corpus (e.g. <Att L=Explicitation>) was used to build the different subcopora to perform the analysis described in Chapter 7. The content of this different subcopora were the different TUs containing the four focus points under analysis (e.g. values K1: introduction of contextual references; K2: pronouns specification; K3:

introduction of cohesive devices; K4: lexical specification).

Figure 6.1 below show what the content of the TMs look like at the TU level once this metadata layer of annotation was added.

<TrU>

<ChU>TRACE (HUM2006-04349/FIL)

<CrD>20022009, 17:53:47

<CrU>05H

<Att L=Profile>InHouse

<Att L=Environment>E3

<Att L=Source>T3

<Att L=Target>05H E3 T3 P2 X2 C1 W2.pdf

<Att L=EPosition>P2

<Att L=TPosition>X2

<Att L=Context>C1

<Att L=Time>W2

<ChD>20022009, 18:25:29

<Att L=Explicitation>0

<Seg L=EN-US>When the Off key is pressed, all stock details are backed-up.

<Seg L=ES-EM>Adem´as, cuando se pulsa la tecla Off (Apagado), el sistema efect´ua una

copia de seguridad de toda la informaci´on del stock.</TrU>

Figure 6.1: Metadata represention in a translation unit<TrU>of the corpus

Dans le document An empirical study on the impact of (Page 177-182)