Using Artificial Data to Compare the Difficulty of Using Statistical Machine Translation in Different Language-Pairs

(1)

Proceedings Chapter

Reference

Using Artificial Data to Compare the Difficulty of Using Statistical Machine Translation in Different Language-Pairs

RAYNER, Emmanuel, et al.

RAYNER, Emmanuel, et al . Using Artificial Data to Compare the Difficulty of Using Statistical Machine Translation in Different Language-Pairs. In: Proceedings of Machine Translation Summit XII . 2009.

Available at:

http://archive-ouverte.unige.ch/unige:5698

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

StatistialMahineTranslation inDifferentLanguage-Pairs

MannyRayner,PaulaEstrella,PierretteBouillon,SoniaHalimi

UniversityofGeneva,TIM/ISSCO

40bvdduPont-d'Arve,CH-1211Geneva4,Switzerland

fEmmanuel.Rayner,Pierrette.Bouillon,Sonia.Halimigunige.h

pestrellagmail.om

YukieNakao

LINA,NantesUniversity,2,ruedelaHoussiniere,BP9220844322NantesCedex03

yukie.nakaouniv-nantes.fr

Abstrat

Anedotally, Statistial Mahine Translation

works muh better in some language pairs

than others, but methodologial problems

mean that it is difult to draw hard on-

lusions. In partiular, it is generally un-

lear whether translations in parallel train-

ingorporahavebeenproduedusingequiv-

alent onventions. In this paper, we report

on an experiment where a small-voabulary

multilingualinterlingua-basedtranslationsys-

temwas usedto generatedatato trainSMT

models for the 12 pairs involving the lan-

guagesfEnglish,Frenh,Japanese, Arabig.

By onstrution, the data an be assumed

strongly uniform. As expeted, translation

between English and Frenh in both dire-

tionsperformedmuh betterthantranslation

involvingJapanese. Less obviously, transla-

tionfrom EnglishandFrenhtoArabiper-

formedapproximately as well as translation

betweenEnglishandFrenh,andtranslation

toJapaneseperformedbetterthantranslation

fromJapanese.

1 Introdution

Sineits introdution intheearly 90s,when itwas

regarded asa dubious outsider, Statistial Mahine

Translation(SMT)hasrapidlygainedgrounduntilit

isnowonsidered themainstreamapproah. There

is,however,generalagreementthatsomelanguage-

pairsworkmuhbetterthan others. Inthepositive

diretion, the early suesses reported by theIBM

group (Brown et al., 1990) used Frenh-English,

whihisnowknowntobeanunusuallysuitablepair

Anedotally,translatingbetweentwoEuropeanlan-

guages is easier than translating between a Euro-

peanand a non-European language, and some lan-

guages,likeJapanese,arewidelyassumedtobedif-

ult. Given the steadily inreasing importane of

MTtehnology, it is often important to be able to

makeareasonableguessathowwellSMTwillwork

foranewlanguage-pair. BothSMTand RBMTre-

quire alarge investment ofeffortbefore anyevalu-

ablesystememerges;whenplanningaprojet,both

arhiteturesareinpriniplepossible,anditisdesir-

abletobeabletomakeaninformedhoiebetween

thematanearlystage.

Areentlarge-salestudy(Birhetal.,2008),us-

ingthe110 language-pairs overedbytheEuroparl

orpus, found that the features most preditive of

SMT translation quality were target language vo-

abulary size, lexiostatial relatedness (measured

intermsofproportion of ognate words), and sim-

ilarity in word order. The same study, however,

alsohighlightedthemethodologialproblemsinher-

entinarrying outthis typeofomparison. As al-

readynoted,targetvoabulary sizeturned outtobe

themost preditive feature. Voabulary size, how-

ever,dependsruiallyonhowmorphologyistaken

intoaount. Forexample,(Birhetal.,2008)on-

sideredthat Swedishhad a muhlargervoabulary

size than English, but this is almost entirely due

tothefatthat Swedish, like German, writesom-

pound nominals without intervening spaes. The

struture of these nominals, however, isoften very

similartothatoftheorrespondingEnglishphrases.

Theproblembeomesmoreauteinalanguagelike

Japanese, whih is normally written with no word

(3)

lel human-translated orpora, where it isgenerally

difult toknow whether thedataistrulyuniform.

Qualityandstyleoftranslationanvarywidely,with

translatorsusing differentguidelines. Inpartiular,

sometranslatorswillpreferamoreliteralstylethan

others. It is also ommon to mix in low-quality

data; a frequent hoie is translations taken from

thereverselanguage-pair,withthesoureandtarget

swapped around. Some reent studies (Ozdowska,

2009) have infat suggested that this kindof low-

quality adulteration an do more harm than good.

Conversely,otherpratitionersofSMThavepointed

to the performane gains that an be ahieved by

arefulleaningofthedata.

Withoutontrollingforallthesefators,itishard

toknowhowgeneraltheresultsofomparativestud-

iesreallyare.Although(Birhetal.,2008)isanun-

usually responsible and areful piee of work, the

authors point out that removal of the outlier lan-

guage (Finnish) substantially hanges the overall

onlusions; it is probably not a oinidene that

Finnish was also the only non-Indo-European lan-

guageusedinthestudy.

Inthispaper,wepresenttheresultsofanoveltype

of omparative study arried out using MedSLT,

a small-voabulary interlingua-based multilingual

speehtranslationsystemforamedialdomain.We

generatedparallelorporaforall12pairsinvolving

the soure languages fEnglish, Frenh, Japanese,

Arabig,rstusingthesourelanguagegrammarsto

generatearbitraryamountsofsoure-language data,

then,foreahtargetlanguage,passingitthroughthe

relevanttranslationrulestogeneratetargetlanguage

expressions.Useofinterlingua-basedtranslationen-

fores a uniform translation style. The small do-

main,whihweompletelyontrol,madeitpossible

toenforeuniformdeisionsabouthowmorphology

istreated.Forexample,wedeidedinArabitosplit

offthedeniteartileal,normallyafxedtothefol-

lowing noun, and treat it as a separate word. For

similar reasons,wealso treated Japanesetense and

politeness afxesas separate words. Thusa word

like okorimashita (happened) issplit up asokori

mashita (happen PAST-POLITE). One we had

reated theparallelorpora, wetrained SMTmod-

els,andevaluatedthequalityofthetranslationsthey

ter than translation involving Japanese. We were,

however,interestedtodisoverthattranslationfrom

EnglishandFrenh toArabiperformedaswellas

translation between English and Frenh, and that

translation toJapaneseperformedbetter thantrans-

lationfromJapanese.

Therestofthepaperisorganisedasfollows. Se-

tion 2 provides bakground on the MedSLT sys-

tem. Setion 3 desribes the experimental frame-

work,and Setion4theresultsobtained. Setion5

onludes.

2 TheMedSLTSystem

MedSLT (Bouillon et al., 2008) is a medium-

voabulary interlingua-based Open Soure speeh

translation system for dotor-patient medial ex-

aminationquestions, whihprovides any-language-

to-any-language translation apabilities for all lan-

guages intheset fEnglish, Frenh, Japanese, Ara-

bi, Catalang. Both speeh reognition and trans-

lation are rule-based. Speeh reognition runs on

theNuane8.5reognitionplatform,withgrammar-

basedlanguage modelsbuiltusingtheOpenSoure

Regulus ompiler. As desribed in (Rayner et

al., 2006), eah domain-spei language model

is extrated from a general resoure grammar us-

ing orpus-based methods driven by a seed or-

pusof domain-speiexamples. Theseedorpus,

whihtypiallyontainsbetween 500 and1500 ut-

teranes, isthen used a seond timeto add proba-

bilisti weightstothe grammarrules; this substan-

tiallyimprovesreognitionperformane (Rayneret

al.,2006,x11.5). Performanemeasures forspeeh

reognition in the three languages where serious

evaluations havebeen arriedout areshown inTa-

ble1.

At run-time, the reogniser produes a soure-

languagesemantirepresentation. Thisisrsttrans-

latedby one set of rulesinto an interlingual form,

andthenbyaseondsetinto atargetlanguagerep-

resentation. A target-language Regulus grammar,

ompiled into generation form, turns this into one

or more possible surfae strings, after whih a set

of generation preferenes piks one out. Finally,

theseleted string isrealised inspoken form. Ro-

(4)

English 6% 11%

Frenh 8% 10%

Japanese 3% 4%

Table 1: Reognition performane forEnglish, Frenh

andJapaneseMedSLTreognisers.WER=WordError

Rateforsourelanguagereogniser,onin-overagema-

terial;SemER=semantierrorrate(proportionofut-

teranesfailingtoprodueorretinterlingua)forsoure

languagereogniser,onin-overagematerial.

statistial reogniser, whihdrivesa robustembed-

ded help system. The purpose ofthe help system

(Chatzihrisasetal.,2006)istoguidetheuserto-

wardssupportedoverage;itperformsapproximate

mathing of output from the statistial reogniser

againalibraryofsenteneswhihhavebeenmarked

asorretly proessed during system development,

andthenpresentsthelosestmathestotheuser.

Examples of typial English domain sentenes

and their translations into Frenh, Arabi and

JapaneseareshowninFigure2.

3 Experimentalframework

In the literature on language modelling, there is a

knowntehniqueforbootstrapping astatistial lan-

guage model (SLM) from a grammar-based lan-

guage model (GLM). The grammar whih forms

the basis of the GLM is sampled randomly in or-

der to reate an arbitrarily large orpus of exam-

ples;theseexamplesarethenusedasatrainingor-

pustobuildtheSLM(Jurafskyetal.,1995;Jonson,

2005). We adapt this proess in a straightforward

waytoonstrutanSMTmodelforagivenlanguage

pair,usingthesourelanguagegrammar,thesoure-

to-interlingua translation rules, the interlingua-to-

target-languagerules,andthetargetlanguagegener-

ationgrammar. Westartinthesameway,usingthe

soure languagegrammartobuilda randomlygen-

eratedsourelanguageorpus;asshownin(Hokey

et al., 2008), it is important to have a probabilis-

ti grammar. We then use the omposition of the

otheromponentstoattempttotranslateeahsoure

languagesenteneintoatargetlanguageequivalent,

disardingtheexamples forwhihnotranslation is

produed. Theresultisanalignedbilingual orpus

model.

Weusedthismethodtogeneratealignedorpora

for12MedSLTlanguage pairswithsoureand tar-

getlanguages taken fromthesetfEnglish, Frenh,

Japanese, Arabig. For eah language pair, we

rst generated one million soure-language utter-

anes;wenextlteredthemtokeeponlyexamples

whih werefull sentenes, as opposed to elliptial

phrases, and used the translation rules and target-

languagegeneratorstoattempttotranslateeahsen-

tene.Thisreatedbetween260Kand310Kaligned

sentene-pairs for eah language-pair. In order to

make overage uniform for eah soure language,

wekeptonlythepairsforwhihthesouresentene

hadtranslationsinalltargetlanguages. Thismakes

itpossibletoomparefairlybetweenlanguage-pairs

with the same soure-language. In ontrast, it ap-

pears to us that it is less straightforward to om-

pare aross language-pairs with different soure-

languages,sinethereisnoobviouswaytoasertain

thatthetwosoure-language orporaareofompa-

rabledifulty.

The sizesofthenalsourelanguageorpus for

eahof the three soure languages isshown inTa-

ble3. Werandomly heldout2.5%ofeahofthese

setsasdevelopmentdata,and2.5%astestdata.Us-

ingGiza++,MosesandSRILM(OhandNey,2000;

Koehnetal.,2007;Stolke,2002),wetrainedSMT

modelsfrominreasingly largesubsetsofthetrain-

ing portion, using the development portion in the

usualwaytooptimizeparametervalues.Finally,we

used the resulting models to translate the test por-

tion.Weperformedthetestswithsuborporaofdif-

ferentsizesinordertosatisfy ourselvesthatperfor-

manehadtoppedout,andthatgenerationoffurther

training datawouldnotimprove performane. Full

detailsarepresentedin(Rayneretal.,2009).

Language #Sentenes #Words Voab.

Eng 236340 1441263 364

Fre 179758 1205308 557

Ara 233141 1509594 253

Jap 207717 1169106 336

Table 3: Statistis for nal auto-generated soure lan-

guage orpora for soure languages: number of sen-

tenes,numberofwords,andsizeofvoabulary

(5)

Frenh Avez-vousmaldepuisplusd'unmois?

Arabi Haltahusbialalammoundhouaktharminhahrwahid?

Japanese Ikkagetsuijouitamiwatsuzukimashitaka?

English Whendotheheadahesusuallyappear?

Frenh Quandavez-voushabituellement vosmauxdetete?

Arabi Mataatahusbialsoudaaadatan?

Japanese Daitaiitsuatamawaitamimasuka?

English Isthepainassoiatedwithnausea?

Frenh Avez-vousdesnaus´eesquandvousavezladouleur?

Arabi Haltouridantataqayaindamatahusbialalam

Japanese Itamutohakikewaokorimasuka?

English Doesbrightlightmaketheheadaheworse?

Frenh Vosmauxdetetesontilsaggrav´esparlalumiere?

Arabi Halyahtaddoualsoudaaaldhaw?

Japanese Akaruihikariwomirutozutsuwahidokunarimasuka?

Table2:ExamplesofEnglishdomainsentenes,withsystemtranslationsintoFrenh,ArabiandJapanese.

Our metris measurethe extent towhihthede-

rivedversionsoftheSMTwereabletoapproximate

the original RBMT on data whih was within the

RBMT'soverage.Themoststraightforwardwayto

dothisissimplytoountthesentenesinthetestset

whihreeivedifferenttranslationsfromtheRBMT

and theSMT.Avariantistodenea non-standard

versionoftheBLEUmetri(Papinenietal.,2001),

withtheRBMT'stranslationtakenasthereferene.

Thismeansthatperfetorrespondenebetweenthe

two translationswouldyield a non-standard BLEU

soreof1.0.

For all these metris, it is important to bring in

humanjudgesatsomepoint,usingthemtoevaluate

the ases wherethe SMTand RBMTdiffer. If,in

theseases,ittranspiredthathumanjudgestypially

thought that the SMT was as good as the RBMT,

then the metris would not be useful. We needto

satisfyourselvesthathumanjudgestypiallyasribe

differenesbetween SMTandRBMTtoshortom-

ingsintheSMTratherthanintheRBMT.

Conretely,weolletedallthedifferenthSoure,

SMT-translation, RBMT-translationi triples pro-

duedduringtheourseoftheexperiments,andex-

tratedthosetripleswherethetwotranslationswere

different. Werandomly seletedtriplesforseleted

languagepairs,and asked humanjudges tolassify

RBMT better: The RBMT translation was

better, in terms of preserving meaning and/or

beinggrammatiallyorret;

SMTbetter: TheSMTtranslation wasbetter,

in terms of preserving meaning and/or being

grammatiallyorret;

Similar: Bothtranslations wereaboutequally

goodORthesouresentenewasmeaningless

inthedomain.

In order to show that our metris are intuitively

meaningful, it is sufient to demonstrate that the

frequeny of ourrene of RBMT better is both

largeinomparisontothatof SMTbetter,anda-

ountsforasubstantialproportion ofthetotalpopu-

lation.

Inthe nextsetion, we presentthe results ofthe

variousexperimentswehavejustdesribed.

4 Results

Tables4,5and6presentthemainresults,summaris-

ingtheextenttowhihSMTandRBMTtranslations

differ for the 12 language-pairs. Sine the train-

ing and test data are independently sampled from

the soure grammar, and the domain is quite on-

strained, they overlap. This is natural, sine, in

(6)

training senteneswillalsoourintestdata;basi

questions like Whereis thepain? will be gener-

atedwithhighfrequenybytheprobabilistisoure

languagemodel,and will tendtoour inanysub-

stantial independently generated set, hene bothin

testandtraining. Whenountingdivergenttransla-

tionsin RBMTand SMToutput, we nonethe less

presentseparatelyresults fortestdatathatdoesnot

overlapwithtrainingdata(Table4)andfortestdata

thatdoesoverlapwithtrainingdata(Table5),onthe

groundsthat theguresare,asusual, verydifferent

forthetwokindsofmaterial. Thesetwotablesthus

summariseagreementbetween SMTand RBMTat

thesentene level. Table6shows thenon-standard

BLEU sores, where the RBMT translations have

been usedas the referene; these give a piture of

agreement between the two types of translation at

then-gramlevel.

Looking in partiular at Table 4, we see that

the gures fall into three distint groups. For

language-pairs involving only languages in the

groupfEnglish,Frenh, Arabig,SMTand RBMT

agree on about 70% to 80% ofthe sentenes. For

translationfromEnglishandFrenhtoJapanese,the

two typesoftranslation agree on about27%of the

sentenes. For translation from Japanese into the

otherthreelanguages,andforArabiintoJapanese,

we only get agreement on about 13% to 16% of

thesentenes. Thesedivisionsappear toshowlear

qualitativedifferenes.

Soure Target

Eng Fre Ara Jap

Eng xxx 69.6 76.5 27.9

Fre 77.1 xxx 72.4 26.9

Ara 76.7 79.1 xxx 13.9

Jap 15.7 14.7 12.7 xxx

Table4: PerentageoftranslationswhereSMTtransla-

tionoinideswithRBMTtranslation,overtestsentenes

notourringintrainingdata.

As disussed in the previous setion, simply

ountingdifferenesbetweenSMTandRBMTsays

nothing on its own; it is also neessary to estab-

lish what these differenes mean in terms of hu-

manjudgements. Weperformedevaluationsof this

Eng Fre Ara Jap

Eng xxx 87.9 92.4 77.8

Fre 94.7 xxx 94.4 74.4

Ara 95.2 90.8 xxx 64.0

Jap 79.1 81.4 76.6 xxx

Table5: PerentageoftranslationswhereSMTtransla-

tionoinideswithRBMTtranslation,overtestsentenes

ourringintrainingdata.

Soure Target

Eng Fre Ara Jap

Eng xxx 0.91 0.92 0.79

Fre 0.93 xxx 0.92 0.76

Ara 0.97 0.98 xxx 0.74

Jap 0.80 0.83 0.85 xxx

Table6:Non-standardBLEUsores(RBMTtranslations

usedasreferene),alldata.

found it easy toloate bilingual judges. First, Ta-

ble8showstheategorisation, aordingtotheri-

teria outlined at the end of Setion 3, for 500 En-

glish!Frenhpairsrandomlyseletedfromtheset

of exampleswhereRBMTand SMTgavedifferent

results; we asked three judges to evaluate themin-

dependently, andombined theirjudgmentsby ma-

jority deision where appropriate. We observed a

veryheavybiastowardstheRBMT,withunanimous

agreementthat theRBMTtranslation wasbetterin

201/500ases, and2-1agreementina further127.

Inontrast, there were only4/500 ases where the

judges unanimously thought that the SMT transla-

tionwaspreferable,with afurther 12supported by

amajority deision. The rest of thetablegives the

aseswheretheRBMTandSMTtranslationswere

judged the same or ases in whih the judges dis-

agreed;therewereonly41/500aseswherenoma-

jority deision was reahed. Our overall onlu-

sion isthat we arejustied in evaluating theSMT

by using the BLEU sores with the RBMTas the

referene. Oftheases wherethetwo systemsdif-

fer, only atiny fration, atmost16/500, indiate a

bettertranslation fromtheSMT,and welloverhalf

are translated better by the RBMT. Table 7 shows

some examplesofbad SMTtranslationsintheEn-

(7)

ialerrors(asuperuousextraverbintherst,and

agreementerrorsintheseond). Thethirdisanbad

hoieoftenseandpreposition;althoughgrammati-

al,thetargetlanguagesentenefailstopreservethe

meaning, and,rather thanreferring toa 20 daype-

riod ending now, instead refers to a 20 day period

sometimeinthepast.

Table 10 shows a similar evaluation for the En-

glish!Japanese. Here,thedifferenebetweenthe

SMTand RBMTversions wasso pronouned that

wefeltjustiedintakingasmallersample, ofonly

150sentenes. Thistime,92/150aseswereunani-

mouslyjudgedashavingabetterRBMTtranslation,

and there was not a single ase where even a ma-

jority found that the SMT was better. Agreement

wasgoodheretoo,withonly8/150asesnotyield-

ingatleastamajoritydeision. Unsurprisingly,the

mainproblemwith thislanguage-pair wasinability

tohandle orretlythedifferenes betweenEnglish

andJapaneseword-order. Table9againshowssome

typialexamples. Theerrorsaremuhmoreserious

than in Frenh, and the SMT translations are only

marginallyomprehensible.

Result Agreement Count

RBMTbetter alljudges 201

RBMTbetter majority 127

SMTbetter alljudges 4

SMTbetter majority 12

Similar alljudges 34

Similar majority 81

Unlear disagree 41

Total 500

Table8: ComparisonofRBMT andSMTperformane

on 500randomlyhosenEnglish !Frenhtranslation

examples,evaluatedindependentlybythreejudges.

Cursory examination oftheremaining language-

pairsstrongly suggested that thesame patterns ob-

tainedthereaswell,withveryfewaseswhereSMT

was better thanRBMT,and numerous ases inthe

opposite diretion. Sine other evaluations of the

MedSLT system (e.g. (Rayner et al., 2005)) show

that over98%ofin-overage translationsprodued

bytheRBMTsystemareaeptableintermsofpre-

SMTandRBMTanplausiblybeinterpreted asre-

etingerrorsproduedbytheSMT.

Result Agreement Count

RBMTbetter alljudges 92

RBMTbetter majority 32

SMTbetter alljudges 0

SMTbetter majority 0

Similar alljudges 2

Similar majority 16

Unlear disagree 8

Total 150

Table10: ComparisonofRBMTandSMTperformane

on150randomlyhosenEnglish!Japanesetranslation

examples,evaluatedindependentlybythreejudges.

5 SummaryandConlusions

Wehavepresented anexperimentinwhihwegen-

erateduniform artiial data for 12 languagepairs

inamultilingualsmall-voabularyinterlingua-based

translation system. Use ofthe interlingua enfored

a uniform translation standard, so we feel justied

inlaimingthat theresultsprovide harderevidene

thanusual about therelative suitability of different

language-pairsforSMT.

As expeted, translation between English and

Frenhinbothdiretionsismuhmorereliablethan

translationinlanguagepairsinvolvingJapanese. To

oursurprise,wealsofoundthattranslationbetween

Englishor FrenhandArabi workedaboutaswell

astranslation between Englishand Frenh, despite

the fat that Arabi typologially belongs toa dif-

ferentlanguagefamily.Informalonversationswith

olleagues who have worked on Arabi suggest,

however, that theresult isnotasunexpeted as we

rstimagined.

Table 4 appears to suggest that translation from

Japaneseworkssubstantiallyless wellthantransla-

tiontoJapanese. Theexplanation ismostprobably

the usual problem ofzero-anaphora, whih isvery

ommon in Japanese, with words that an learly

be inferred from ontext generally deleted. In the

from-Japanesediretion,itisneessarytogeneratea

translationofazeroanaphor(mostoftentheimpliit

(8)

RBMTFrenh vosmauxdetetesont-ilsaus´espardeshangementsdetemp´erature

(yourheadahesare-theyausedbyhangesoftemperature)

SMTFrenh avez-vousvosmauxdetetesont-ilsaus´espardeshangementsdetemp´erature

(have-youyourheadahesare-theyausedbyhangesoftemperature)

English areheadahesrelievedintheafternoon

RBMTFrenh vosmauxdetetediminuent-ilsl'apres-midi

(yourheadahes(MASC-PLUR)derease-MASC-PLURtheafternoon)

SMTFrenh vosmauxdetetediminue-t-ellel'apres-midi

(yourheadahes(MASC-PLUR)derease-FEM-SINGtheafternoon)

English haveyouhadthemfortwentydays

RBMTFrenh avez-vousvosmauxdetetedepuisvingtjours

(have-youyourheadahessinetwentydays)

SMTFrenh avez-vouseuvosmauxdetetependantvingtjours

(have-youhadyourheadahesduringtwentydays)

Table7:ExamplesofinorretSMTtranslationsfromEnglishintoFrenh.Errorsarehighlightedinbold.

diretion it is only a question of deleting mate-

rial. Although, as pointedout earlier inSetion 3,

we need to be areful when omparing between

language-pairs with different soure-language, the

gapinperformanehereislargeenoughthatwean

expetittoreetarealtrend.

Simple as theidea is, we hope that the method-

ology desribed inthis paper will make it possible

toevaluatetherelativesuitabilityofSMTfordiffer-

ent languagepairsina more quantitative way than

has so farbeen possible. Ingeneral, theonstru-

tionusedouldequallywellbeimplemented inthe

ontextofany otherhigh-performane multilingual

RBMT system. The idea of statistially relearn-

inganRBMTsystemhasreentlybeguntoaquire

some popularity (Seneff etal.,2006; Dugast etal.,

2008), and it should be easy tohek whether our

resultsaregenerallyreproduible.

Referenes

A.Birh, M. Osborne,and P.Koehn. 2008. Predit-

ingsuessinmahinetranslation. InProeedingsof

EMNLP2008,Waikiki,Hawaii.

P.Bouillon, G.Flores, M.Georgesul, S.Halimi, B.A.

Hokey,H.Isahara,K.Kanzaki,Y.Nakao,M.Rayner,

M. Santaholma, M. Starlander, and N. Tsourakis.

2008. Many-to-many multilingual medial speeh

translationonaPDA. InProeedingsofTheEighth

ConfereneoftheAssoiationforMahineTranslation

intheAmerias,Waikiki,Hawaii.

P.Brown,J.Coke,S.DellaPietra,V.DellaPietra,F.Je-

linek,J.Lafferty,R.Merer,andP.Roossin. 1990. A

statistialapproahtomahinetranslation. Computa-

tionalLinguistis,16(2):7985.

N. Chatzihrisas, P. Bouillon, M. Rayner, M. Santa-

holma,M.Starlander,andB.A.Hokey.2006.Evalu-

atingtaskperformaneforaunidiretionalontrolled

languagemedialspeehtranslationsystem. InPro-

eedingsoftheHLT-NAACL InternationalWorkshop

on Medial Speeh Translation, pages 916, New

York.

L.Dugast,J.Senellart,andP.Koehn. 2008. Canwere-

learnanRBMTsystem? InProeedingsoftheThird

Workshop on Statistial Mahine Translation, pages

175178,Columbus,Ohio.

B.A. Hokey, M. Rayner, and G. Christian. 2008.

Trainingstatistial language models from grammar-

generateddata: A omparativease-study. InPro-

eedingsofthe6thInternationalConfereneonNatu-

ralLanguageProessing,Gothenburg,Sweden.

R.Jonson. 2005.Generatingstatistiallanguagemodels

frominterpretationgrammarsindialoguesystems. In

(9)

RBMTJapanese banarimashitaka

eveningbePOLITE-PASTQ

SMTJapanese arimashitakaban

bePOLITE-PASTQevening

English doesnoisetypiallygiveyouyourheadahes

RBMTJapanese souongakininarutodaitaiatamawaitamimasuka

noiseSUBJexperieneWHENoftenheadTOPhurtPOLITEQ

SMTJapanese souongakininarutoitamimasukadaitaiatamawaitamimasuka

noiseSUBJexperieneWHENhurtPOLITEQoftenheadTOPhurtPOLITEQ

English isthepainusuallyausedbysuddenheadmovements

RBMTJapanese daitaitotsuzenatamawougokasutoitamimasuka

usuallyquiklyheadOBJmoveWHENhurtPOLITEQ

SMTJapanese daitaiitamiwatotsuzenatamawougokasuto[Missing: itamimasuka℄

usuallypainSUBJquiklyheadOBJmoveWHEN[Missing:hurtPOLITEQ℄

Table9:ExamplesofinorretSMTtranslationsfromEnglishintoJapanese.Errorsarehighlightedinbold.

A. Jurafsky,C. Wooters, J. Segal, A. Stolke, E. Fos-

ler, G. Tajhman, and N. Morgan. 1995. Using a

stohastiontext-freegrammaras alanguagemodel

forspeehreognition.InProeedingsoftheIEEEIn-

ternationalConfereneonAoustis,SpeehandSig-

nalProessing,pages189192.

P.KoehnandC.Monz. 2005. Sharedtask: Statistial

mahinetranslationbetweeneuropeanlanguages. In

ProeedingsoftheACLWorkshoponBuildingandUs-

ingParallelTexts,AnnArbor,MI.

P.Koehn andC. Monz. 2006. Manualandautomati

evaluation of mahine translation between european

languages. InProeedingson theWorkshop onSta-

tistialMahineTranslation,NewYorkCity,NY.

P. Koehn, H. Hoang, A. Birh, C. Callison-Burh,

M. Federio, N. Bertoldi, B. Cowan, W. Shen,

C.Moran,R.Zens,etal. 2007. Moses: Opensoure

toolkitforstatistialmahinetranslation.InANNUAL

MEETING-ASSOCIATION FOR COMPUTATIONAL

LINGUISTICS,volume45,page2.

F.J.OhandH.Ney. 2000. Improvedstatistialalign-

mentmodels. InProeedingsofthe38thAnnualMeet-

ingoftheAssoiationforComputationalLinguistis,

HongKong.

S. Ozdowska. 2009. Donnees´ bilingues pour la tas

fran¸ais-anglais: impatdelalanguesoureetdire-

tiondetradutionoriginalessurlaqualit´edelatradu-

tion. InProeedingsofTALN2009,Senlis,Frane.

K.Papineni,S.Roukos,T.Ward,andW.-J.Zhu. 2001.

BLEU: a method for automati evaluation of ma-

hine translation. Researh Report, Computer Si-

eneRC22176(W0109-022),IBMResearhDivision,

M.Rayner,P.Bouillon,N.Chatzihrisas,B.A.Hokey,

M.Santaholma,M.Starlander,H.Isahara,K.Kanzaki,

andY.Nakao. 2005. Amethodologyforomparing

grammar-basedandrobust approahestospeehun-

derstanding. InProeedingsofthe 9thInternational

ConfereneonSpokenLanguageProessing(ICSLP),

pages11031107,Lisboa,Portugal.

M.Rayner,B.A.Hokey,andP.Bouillon. 2006.Putting

Linguistis into Speeh Reognition: The Regulus

GrammarCompiler.CSLIPress,Chiago.

M. Rayner, P. Estrella, P. Bouillon, B.A. Hokey, and

Y.Nakao. 2009. Usingartiiallygenerateddatato

evaluatestatistial mahinetranslation. InProeed-

ingsoftheThirdWorkshoponGrammarEngineering

ArossFrameworks,Singapore.

S.Seneff,C.Wang,andJ.Lee. 2006. Combininglin-

guisti andstatistial methods forbi-diretionalEn-

glishChinesetranslationintheightdomain. InPro-

eedingsofAMTA2006.

A.Stolke. 2002.SRILM-anextensiblelanguagemod-

elingtoolkit. InSeventhInternationalConfereneon

SpokenLanguageProessing.ISCA.