Transactions of the Association for Computational Linguistics, 1 (2013) 429–440. Action Editor: Philipp Koehn.

Submitted 3/2013; Revised 8/2013; Published 10/2013. c
(cid:13)

2013 Association for Computational Linguistics.

MeasuringMachineTranslationErrorsinNewDomainsAnnIrvineJohnsHopkinsUniversityanni@jhu.eduJohnMorganUniversityofMarylandjjm@cs.umd.eduMarineCarpuatNationalResearchCouncilCanadamarine.carpuat@nrc.gc.caHalDaum´eIIIUniversityofMarylandme@hal3.nameDragosMunteanuSDLResearchdmunteanu@sdl.comAbstractWedeveloptwotechniquesforanalyzingtheeffectofportingamachinetranslationsystemtoanewdomain.Oneisamacro-levelana-lysisthatmeasureshowdomainshiftaffectscorpus-levelevaluation;thesecondisamicro-levelanalysisforword-levelerrors.Weap-plythesemethodstounderstandwhathappenswhenaParliament-trainedphrase-basedma-chinetranslationsystemisappliedinfourverydifferentdomains:news,medicaltexts,scien-tiﬁcarticlesandmoviesubtitles.Wepresentquantitativeandqualitativeexperimentsthathighlightopportunitiesforfutureresearchindomainadaptationformachinetranslation.1IntroductionWhenbuildingastatisticalmachinetranslation(SMT)system,theexpectedusecaseisoftenlimitedtoaspeciﬁcdomain,genreandregister(henceforth“domain”referstothisset,inkeepingwithstandard,imprecise,terminology),suchasaparticulartypeoflegalormedicaldocument.Unfortunately,itisex-pensivetoobtainenoughparalleldatatoreliablyes-timatetranslationmodelsinanewdomain.Instead,onecanhopethatlargeamountsofdatafromano-ther,“olddomain,”mightbecloseenoughtostandasaproxy.Thisisthedefactostandard:wetrainSMTsystemsonParliamentproceedings,butthenusethemtotranslateallsortsofnewtext.Unfortuna-tely,thisresultsinsigniﬁcantlydegradedtranslationquality.Inthispaper,wepresenttwocomplemen-tarymethodsforquantiﬁablymeasuringthesourceoftranslationerrors(§5.1and§5.2)inanoveltaxo-nomy(§4).Weshowquantitative(§7.1)andquali-tative(§7.2)resultsobtainedfromourmethodsonOldDomain(Hansard)Inpmonsieurlepr´esident,lespˆecheursdehomarddelar´egiondel’atlantiquesontdansunesituationcatastro-phique.Refmr.speaker,lobsterﬁshersinatlanticcanadaarefacingadisaster.Outmr.speaker,thelobsterﬁshersinatlanticcanadaareinamess.NewDomain(Medical)Inpmodeetvoie(s)d’administrationRefmethodandroute(s)ofadministrationOutfashionandvoie(s)ofdirectorsTABLE1:Exampleinputs,referencesandsystemoutputs.Therearethreetypesoferrors:unseenwords(blue),in-correctsenseselection(red)andunknownsense(green).fourverydifferentnewdomains:newswire,medicaltexts,scientiﬁcabstracts,andmoviesubtitles.Ourbasicapproachistothinkoftranslationer-rorsinthecontextofanoveltaxonomyoferrorcategories,“S4.”OurtaxonomycontainscategoriesfortheerrorsshowninTable1,inwhichanSMTsystemtrainedontheHansardparliamentaryproce-dingsisappliedtoanewdomain(inthiscase,me-dicaltexts).Ourcategorizationfocusesonthefollo-wing:newFrenchwords,newFrenchsenses,andin-correctlychosentranslations.Theﬁrstmethodologywedevelopforstudyingsucherrorsisamicro-levelstudyofthefrequencyanddistributionoftheseerrortypesinrealtranslationoutputatthelevelofindivi-dualwords(§5.1),withoutrespecttohowtheseer-rorsaffectoveralltranslationquality.Thesecondisamacro-levelstudyofhowtheseerrorsaffecttrans-lationperformance(measuredbyBLEU;§5.2).Oneimportantfeatureofourmethodologiesisthatwefocusonerrorsthatcouldpossiblybeﬁxedgivenaccesstodatafromanewdomain,ratherthanallerrorsthatmightarisebecausetheparticulartransla-tionmodelusedisinadequatetocapturetherequired

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

430

translationtask(formally:wemeasureestimationer-ror,notapproximationerror).OurgoalisneithertobuildbetterSMTsystemsnortodevelopnoveldomainadaptationmethods.Wetakeanabinitioapproachandask:givenalargeunadapted,outoftheboxSMTsystem,whathap-penswhenitisappliedinanewdomain?Inordertoanswerthisquestion,wewilluseparalleldatainnewdomains,butonlyfortestingpurposes.ThebaselineSMTsystemisnotadapted,exceptfortheuseof(1)alanguagemodeltrainedonmonolingualnew-domainlanguagedata,1and(2)afewthousandpa-rallelsentencesoftuningdatainthenewdomain.2SummaryofResultsWeconductexperimentsacrossavarietyofdo-mains(describedin§6).2Asinanystudy,ourre-sultsarelimitedbyassumptionsaboutlanguage,do-mains,andMTsystems:theseassumptionsandtheirconsequencesarediscussedin§8.Ourhigh-levelconclusionsonthedomainswestudyaresumma-rizedbelow(detailsmaybefoundin§7).1.AdaptinganSMTsystemfromtheParliamentdomaintothenewsdomainisnotarepresentativeadaptationtask;thereareaverysmallnumberofer-rorsduetounseenwords,whichareminorincompa-risontoallotherdomains.(Despitethefactthatmostpreviousworkfocusesexclusivelyonusingnewsasa“new”domain,§3).2.Fortheremainingdomains,unseenwordshaveasigniﬁcanteffect,bothintermsofBLEUscoresaswellasﬁne-grainedtranslationdistinctions.Howe-ver,manyofthesewordshavemultipletranslations,andasystemmustbeabletocorrectlyselectwhichonetouseinaparticularcontext.3.Likewise,wordsthatgainnewsensesaccountforapproximatelyasmucherrorasunseenwords,sug-gestinganovelavenueforresearchinsenseinduc-tion.Unfortunately,itappearsthatchoosingtherightsensefortheseattranslationtimeisevenmoredifﬁ-cultthanintheunseenwordcase.4.Thestoryismorecomplicatedforseenwordswithknowntranslations:ifwelimitourselvesto“high1.Weuseold/newtorefertodomainsandsource/targettorefertolanguages,toavoidambiguity(westayawayfromin-domainandout-of-domain,whichisitselfambiguous).2.Allsourcedata,methodologicalcodeandoutputsareavailableathttp://hal3.name/damt.conﬁdence”translations,thereisalottobegai-nedbyimprovingthescoresintranslationmodels.However,foranentirephrasetable,manipulatingscorescanhurtasoftenasithelps.3RelatedWorkMostrelatedworkhasfocusedoneither(a)analy-zingerrorsmadebymachinetranslationsystemsinanon-adaptationsetting(Popovi´candNey,2011),or(b)tryingtodirectlyimprovemachinetranslationperformance.Asmallamountofwork(discussednext)addressesissuesofanalyzingMTsystemsinadomainadaptationsetting.3.1AnalysisofDomainEffectsTodate,workondomainadaptationinSMTmostlyproposedmethodstoefﬁcientlycombinedatafrommultipledomains.Tothebestofourknowledge,therehavebeenonlyafewstudiestoun-derstandhowdomainshiftsaffecttranslationquality(Duhetal.,2010;Bisazzaetal.,2011;HaddowandKoehn,2012).However,thesestartfromdifferentpremisesthanthispaper,andasaresult,askrelatedbutcomplementaryquestions.Thesepreviousana-lysesfocusonhowtoimproveaparticularMTar-chitecture(trainedonnewdomaindata)byinjectingolddomaindataintoaspeciﬁcpartofthepipelineinordertoimproveBLEUscore.Incomparisontothiswork,wefocusonﬁner-grainedphenomena.Wedis-tinguishbetweeneffectspreviouslylumpedtogetheras“missingphrase-tableentries.”Despitedifferentstartingassumptions,languagepairsanddata,someofourconclusionsareconsistentwithpreviouswork:inparticular,wehighlighttheimportanceofdifferencesincoverageinanadaptationsetting.However,ourﬁne-grainedanalysisshowsthatcorrectlyscoringtranslationsforpreviouslyunseenwordsandsensesisacomplexissue.Finally,theseotherstudiessuggestpotentialdirectionsforreﬁningourerrorcategories:forins-tance,HaddowandKoehn(2012)showthattheim-pactofadditionalneworolddomaindataisdifferentforrarevs.frequentphrases.3.2DomainAdaptationforMTPriorworkfocusesonmethodscombiningdatafromoldandnewdomainstolearntranslationandlanguagemodels.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

431

ManyﬁlteringtechniqueshavebeenproposedtoselectOLDdatathatissimilartoNEW.Informa-tionretrievaltechniqueshavebeenusedtoimprovethelanguagemodel(Zhaoetal.,2004),thetransla-tionmodel(Hildebrandetal.,2005;Luetal.,2007;Gongetal.,2011;Duhetal.,2010;Banerjeeetal.,2012),orboth(Luetal.,2007);languagemo-delcross-entropyhasalsobeenusedfordatase-lection(Axelrodetal.,2011;Mansouretal.,2011;Sennrich,2012).Anotherresearchthreadaddressescorporaweigh-ting,ratherthanhardﬁltering.Weightinghasbeenappliedatdifferentlevelsofgranularity:sentencepairs(Matsoukasetal.,2009),phrasepairs(Fosteretal.,2010),n-grams(Ananthakrishnanetal.,2011),orsub-corporathroughfactoredmodels(NiehuesandWaibel,2010).Inparticular,Fosteretal.(2010)showthatadaptingatthephrasepairlevelsoutper-formearliercoarsercorpuslevelcombinationap-proaches(FosterandKuhn,2007).Thisisconsistentwithouranalysis:domainshiftshaveaﬁne-grainedimpactontranslationquality.Finally,strategieshavebeenproposedtocom-binesub-modelstrainedindependentlyondifferentsub-corpora.Linearinterpolationiswidelyusedformixinglanguagemodelsinspeechrecognition,andithasalsobeenusedforadaptingtranslationandlan-guagemodelsinMT(FosterandKuhn,2007;Tiede-mann,2010;Lavergneetal.,2011).Log-linearcom-binationﬁtswellinexistingSMTarchitectures(Fos-terandKuhn,2007;KoehnandSchroeder,2007).KoehnandSchroeder(2007)considerbothaninter-sectionsetting(whereonlyentriesoccurringinallphrase-tablescombinedareconsidered),andaunionsetting(whereentrieswhicharenotintheintersec-tionaregivenanarbitrarynullscore).Razmaraetal.(2012)takethisapproachfurtherandframecombi-nationasensembledecoding.3.3TargetingSpeciﬁcErrorTypesTheexperimentsconductedinthisarticlemotiva-tedfollow-upworkonidentifyingwhenawordhasgainedanewsenseinanewdomain(Carpuatetal.,2013),aswellaslearningjointwordtranslationpro-babilitydistributionsfromcomparablenewdomaincorpora(Irvineetal.,2013).Earlier,Daum´eIIIandJagarlamudi(2011)showedhowminingtranslationsforunseenwordsfromcomparablecorporacanim-proveSMTinanewdomain.4TheS4TaxonomyWebeginwithasimplequestion:whenwemoveanSMTsystemfromanolddomaintoanewdo-main,whatgoeswrong?Weemployasetoffourerrortypesasourtaxonomy.Werefertotheseer-rortypesasSEEN,SENSE,SCOREandSEARCH,andtogetherastheS4taxonomy:SEEN:anattempttotranslateasourcewordorphrasethathasneverbeenseenbefore.Forexample,“voie(s)”inTable1.SENSE:anattempttotranslateapreviouslyseensourcewordorphrase,butforwhichthecorrecttargetlanguagesensehasneverbeenobserved.3InTable1,theHansard-trainedsystemhadneverseen“mode”translatedas“method.”SCORE:anincorrecttranslationforwhichthesys-temcouldhavesucceededbutdidnotbecauseanincorrectalternativeoutweighedthecorrecttransla-tion.Inaconventionaltranslationsystem,thiscouldbeduetoerrorsinthelanguagemodel,translationmodel,orboth.InTable1,theHansard-trainedsys-temhadseen“administration”translatedas“admi-nistration,”but“directors”hadahigherprobability.SEARCH:anerrorduetopruninginbeamsearch.Whenlimitingoneselftoissuesoflexicalselec-tion,thissetisexhaustiveanddisjoint:anylexi-calselectionerrormadebyanMTsystemcanbeattributedtoexactlyoneoftheseerrorcategories.Thisobservationisimportantfordevelopingmetho-dologiesformeasuringtheimpactofeachofthesesourcesoferror.PartitionsofthesetoferrorsthatfocusoncategoriesotherthanlexicalchoicehavebeeninvestigatedbyVilaretal.(2006).5MethodologyforAnalyzingMTSystemsGiventheS4taxonomyforcategorizingSMTer-rors,itwouldbepossible(ifpainstaking)toma-nuallyannotateSMToutputwitherrortypes.Wepreferautomatedmethods.Inthissectionwedes-cribetwosuchmethods:amicro-levelanalysisto3.Wedeﬁne“sense”asaparticulartranslationintoatargetlanguage,inlinewithCarpuat&Wu(2007)orMihalceaetal.(2010).Thismeansbothtraditionalwordsenseerrorsandothertranslationerrors(likemorphologicalvariants)areincluded.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

432

FIGURE1:ExampleofWADEvisualization.DashedboxesaroundtheFrenchinputmarkthephrasespansusedbythedecoder.seewhathappensatthewordlevel(regardlessofhowitaffectstranslationperformance)andamacro-levelanalysistodiscoverimpactoncorpustransla-tionperformance.WefocusontheﬁrstthreeS4ca-tegoriesandseparatelydiscusssearcherrors(§7).Inbothcases,weuseexactstringmatchtodetecttrans-lationequivalences,ashasbeendonepreviouslyinothersettingsthatalsousewordalignmentstoins-pecterrorsorautomaticallygeneratedataforothertasks(Blatzetal.,2004;CarpuatandWu,2007;Popovi´candNey,2011;Bachetal.,2011,amongothers).5.1Micro-analysis:WADEWedeﬁneWordAlignmentDrivenEvaluation,orWADE,whichisatechniqueforanalyzingMTsys-temoutputatthewordlevel,allowingusto(1)ma-nuallybrowsevisualizationsofMToutputannota-tedwithS4errortypes,and(2)aggregatecountsoferrors.WADEisbasedonthefactthatwecanau-tomaticallyword-alignaFrenchtestsentenceanditsEnglishreferencetranslation,andtheMTdeco-dernaturallyproducesawordalignmentbetweenaFrenchsentenceanditsmachinetranslation.WecanthencheckwhethertheMToutputhasthesamesetofEnglishwordsalignedtoeachFrenchwordthatwewouldhopefor,giventhereference.Insomeways,WADEissimilartotheword-basedanalysistechniqueofPopovi´candNey(2011).Ho-wever,incontrasttothatwork,wedonotdirectlyalignthehypothesisandreferencetranslationsbut,rather,pivotthroughthesourcetext.Additionally,weuseWADEtoannotateS4errors,whicharedri-venmorebyhowlexicalchoiceismadewithintheSMTframeworkthanbylinguisticpropertiesofwordsinthereferenceandhypothesistranslations.Forexample,inthecaseofdomainadaptation,wedonotexpecttherateofinﬂectionalerrorstobeaf-fectedbydomainshift.InWADE,theunitofanalysisiseachwordalign-mentbetweenaFrenchword,fi,andareferenceEn-glishword,ej.Toannotatethealignedpair,ai,j,weconsidertheword(s),Hi,intheoutputEnglishsen-tencewhicharealigned(bythedecoder)tofi.IfejappearsinthesetHi,thenthealignmentai,jismar-kedcorrect.Ifnot,thealignmentiscategorizedwithoneoftheS4errortypes.IftheFrenchwordfidoesnotappearinthephrasetableusedfortranslation,thenthealignmentismarkedasaSEENerror.Iffidoesappearinthephrasetable,butitisneverobser-vedtranslatingasej,thenthealignmentismarkedasaSENSEerror.Iffihadbeenobservedtranslatingasej,butthedecoderchoseanalternatetranslation,thenthealignmentismarkedasaSCOREerror.Ourresultsin§7showthatSEARCHerrorsareveryin-frequent,sowemarkallerrorsotherthanSEENandSENSEasSCOREerrors.Wemakeuseofonead-ditionalcategory:Freebie.OurMTsystemcopiesunseen(aka“OOV”)FrenchwordsintotheEnglishoutput,and“freebies”areFrenchwordsforwhichthisiscorrect.ForWADEanalysisonly,weusethealignmentsyieldedbyamodeltrainedoverourtrainandtestdatasetsandthegrow-diag-ﬁnalheuristic.BecauseWADE’sunitofanalysisiseachalignmentlinkbet-weenthesourcetextanditsreference,itignoresuna-lignedwordsintheinputsourcetext.Figure1showsanexampleofaWADE-annotatedsentence.Inadditiontoprovidinganeasywaytovi-sualizeandbrowsetheerrorsinMToutput,WADEallowsustoaggregatecountsovertheS4errortypes.Inouranalysis(§7),wepresentresultsthatshownotonlytotalnumbersofeacherrortypebutalsohowWADE-annotationschangewhenweintro-ducesomeNEW-domainparalleltrainingdata.Forexample,SEENerrorscouldremainSEENerrors,be-comecorrect,orbecomeSENSEorSCOREerrorswhenweintroduceadditionaltrainingdata.5.2Macro-analysis:TETRAInthissection,wediscussanapproachtomeasu-ringtheeffectofeachpotentialsourceoferrorwhenatranslationsystemisconsideredinfull.Thekey

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

433

ideaistoenhancethetranslationmodelofOLD,anMTsystemtrainedonolddomainparalleltext,tocomparetheimpactofpotentialsourcesofimprove-ment.WeuseparallelnewdomaindatatoproposeenhancementstotheOLDsystem.Thisprovidesarealisticmeasureofwhatcouldbeachievedifonehadaccesstoparalleldatainthenewdomain.Thespeciﬁcsystemwebuild,calledMIXED,isalinearinterpolationofatranslationmodeltrainedonlyonolddomaindataandamodeltrainedonlyonnewdomaindata(FosterandKuhn,2007).Themixingweightsareselectedviagridsearchonatuningset,selectingforBLEU.WecallourapproachTETRA:TableEnhancementforTranslationAnalysis.Below,wedesignexperimentstoteaseapartthedifferencesindomainsbyadjustingthemodelsandenhancingOLDtobemorelikeMIXED.Weperformdifferentenhancementsdependingontheerrorcate-gorywearetargeting.Asdiscussedin§6,ourexpe-rimentsareconductedusingphrase-basedSMTsys-tems,sothetranslationmodels(TM)thatareenhan-cedarethephrasetableandreorderingtable.SeenInordertoestimatetheeffectofSEENer-rors,weenhancetheTMofOLDbyaddingphrasepairsthattranslatewordsfoundonlyinthenew-domaindata,andwemeasuretheBLEUimprove-ment.Moreprecisely,weidentifythesetofphrasepairsintheTMofMIXED,forwhichtheFrenchsidecontainsatleastonewordthatdoesnotappearintheold-domaintrainingdata.Thesearethephrasesres-ponsiblefortheSEENerrors.WebuildsystemTE-TRA+SEENbyaddingthesephrasestotheTMofOLD.Whenaddingthesephrases,weaddthemto-getherwiththeirfeaturevaluescores.SenseAnalogously,thephrasesresponsibleforSENSEerrorsarethosefromMIXEDwheretheFrenchsideexistsinthephrasetableofOLD,buttheirEnglishtranslationsdonot.WebuildTE-TRA+SENSEbyaddingthesephrasestoOLD.ScoreToisolateandmeasuretheeffectofphrasescores,weconsiderthephrasesthatourOLDandMIXEDsystemshaveincommon:theintersectionoftheirtranslationtables.Webuildtwosystems,OLDSCOREandNEWSCORE,withidenticalphrasepairs;inOLDSCORE,thefeaturevaluesaretakenfromtheOLDsystem’stables;inNEWSCOREthefeatureva-DomainSentencesLTokensTypes#PhrasesHansard8,107,356fr161.7m192k479.0men144.5m187kNews135,838fr3.9m63k12.4men3.3m52kEMEA472,231fr6.5m35k4.4men5.9m30kScience139,215fr4.3m118k8.4men3.6m114kSubs19,239,980fr155.0m362k364.7men174.4m293kTABLE2:Basiccharacteristicsofthetrainingdata:Num-berofsentences,tokens,wordtypesandnumberofphrasepairsinthephrasetables.luesaretakenfromtheMIXEDsystem’stables.6Experimentalconditions6.1DomainsandDataWeconductourstudyonFrench-Englishdatasets.Weconsiderﬁveverydifferentdomainsforwhichlargecorporaarepubliclyavailable.Thelargestcor-pusistheHansardparliamentaryproceedings.Cor-porainthefourotherdomainsaresmallerandmorespecialized,and,thus,morenaturallyserveasnewdomains.Foreachnewdomain,weuseallavailabledata.Wedonotattempttoholdtheamountofnewdomaindataconstant,aswesuspectthatsucharti-ﬁcialconstraintswouldnotbesufﬁcienttocontrolfortheverydifferentnaturesofthedomains.De-tailedstatisticsfortheparallelcorporaaregiveninTable2.Hansard:Canadianparliamentaryproceedings,consistsofmanualtranscriptionsandtranslationsofmeetingsofCanada’sHouseofCommonsanditscommitteesfrom2001to2009.Discussionscoverawidevarietyoftopics,andspeakingstylesrangefrompreparedspeechesbyasinglespeakertomoreinteractivediscussions.ItissigniﬁcantlylargerthanEuroparl,thecommonsourceofolddomaindata.EMEA:DocumentsfromtheEuropeanMedi-cinesAgency,madeavailablewiththeOPUScor-poracollection(Tiedemann,2009).Thiscorpuspri-marilyconsistsofdrugusageguidelines.News:NewscommentarycorpusmadeavailablefortheWMT2009evaluation.Ithasbeencom-monlyusedinthedomainadaptationliterature(KoehnandSchroeder,2007;FosterandKuhn,2007;HaddowandKoehn,2012,forinstance).Science:Parallelabstractsfromscientiﬁcpubli-cationsinmanydisciplinesincludingphysics,bio-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

434

logy,andcomputerscience.Wecollecteddatafromtwodistinctsources:(1)CanadianSciencePubli-shingmadeavailabletranslatedabstractsfromtheirjournalswhichspanmanyresearchdisciplines;(2)parallelabstractsfromPhDthesesinPhysicsandComputerSciencecollectedfromtheHALpublicrepository(Lambertetal.,2012).Subs:Translatedmoviesubtitles,availablethroughtheOPUScorporacollection(Tiedemann,2009).Incontrasttotheotherdomainsconsidered,subtitlesconsistofinformalnoisytext.4Inthisstudy,weusetheHansarddomainastheOLDdomain,andweconsiderfourpossibleNEWdomains:EMEA,News,ScienceandSubs.Datasetsforalldomainswereprocessedconsistently.Aftertokenization,wepaidparticularattentiontonorma-lizationinordertominimizeartiﬁcialdifferenceswhencombiningdata,suchasAmerican,BritishandCanadianspellings.Thisprovedparticularlyimpor-tantforthenewsdomain;theimpactofSEENredu-cedbymorethanhalfafternormalization.6.2MTsystemsWebuildstandardphrase-basedSMTsystemsusingtheMosestoolkit(Koehnetal.,2007)forallexperiments.Eachsystemscorestranslationcandi-datesusingstandardfeatures:5phrase-tablefea-tures,includingphrasaltranslationprobabilitiesandlexicalweightsinbothtranslationdirections,andaconstantphrasepenalty;6lexicalizedreorde-ringfeatures,includingbidirectionalmodelsbuiltformonotone,swap,discontinuousreorderings;1distance-basedreorderingfeature;and2languagemodels,a5-grammodellearnedontheOLDdomain,anda5-grammodellearnedontheNEWdomain.Featuresarecombinedusingalog-linearmodeloptimizedforBLEU,usingthen-bestbatchMIRAalgorithm(CherryandFoster,2012).Thisresultsinastronglarge-scaleOLDsystem,whichperformswellontheolddomainandisagoodstartingpointforstudyingdomainshifts.5Thewordalignments,4.Hansards,NewsandtheCanadianSciencePublishingareavailable,respectively,at:http://www.parl.gc.ca,http://www.statmt.org/wmt09/translation-task.html,andhttp://www.nrcresearchpress.com,preprocessedversionsanddatasplitsusedinthispapercanbedownloadedfromhttp://hal3.name/damt.5.Weuse(unadapted)HMMwordalignments(Vogeletlanguagemodelsandtuningsetsarekeptconstantacrossallexperimentsperdomain.Forreference,webuiltsystemsusingNEWdomaindataonly;theseachievedBLEUscoresasfollows:News=21.70,EMEA=34.63,Science=30.72,Subs=18.51.7ResultsBeforemovingontotheinterestingresults,weshowthatSEARCHisnotamajorsourceofer-ror.Weanalyzedsearcherrorsseparatelybycom-putingBLEUscoresforeachdomainwithvaryingbeamsizefrom10to1000,usingtheOLDsys-tem.Weﬁndthatincreasingthebeamfrom10to200yieldsapproximatelyaoneBLEUpointad-vantageacrossalldomains.Increasingitfurther(to500or1000)doesnotbestowanyadditionaladvan-tages.Thissuggeststhatforsufﬁcientlywidebeams,searchisunlikelytocontributetoadaptationerrors.6Thisisconsistentwithpreviousresultsobtainedinnon-adaptedsettingsusingothermeasurementtech-niques:searcherrorsaccountforlessthan5%oftheerrorinmodernMTsystems(Wisniewskietal.,2010),or0.13%forsmallbeamsettingswitha“gapconstraint”(ChangandCollins,2011).Weuseabeamvalueof200forallotherexperimentsinthiswork.7.1QuantitativeResultsResultsaresummarizedinTables3and4.Table3givesanoverviewofourWADEanalysisontestsetsineachdomaintranslatedusingOLDandMIXEDmodels.Table4showsBLEUscoreresultsbasedontheTETRAanalysis.Weﬁrstpresentgeneralobservationsbasedoneachsetofresults.WADEshowsthatfornews,newdomaindatahelpssolveonlyasmallnumberofSEENissues,andSENSEandSCOREerrorsremainessentiallyunchanged.TETRAagreesthatSENSEandSCOREarenotissuesinthisdomain.Ingeneral,theOLDsystemperformsbetteronnewsthanontheotherthreedomains.Forcomparison,usingtheOLDsystemtotranslateatestsetintheold(Hansard)do-mainyieldsaBLEUscoreof37.41and,accordingtoourWADEanalysis,67.64%ofallalignmentsareal.,1996)inbothdirections,combinedusinggrow-diag-ﬁnal(Koehnetal.,2005).Weestimatealignmentsjointlyonalldatasets.Thus,TETRAmayhaveartiﬁciallygoodphrasetables.6.Thisislikelydependentonlanguagechoiceandthelargeamountofolddomainparalleldata.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

435

Domain%Correct%SeenErrors%SenseErrors%ScoreErrorsOLDMIXEDOLDMIXED%∆OLDMIXED%∆OLDMIXED%∆News57.4257.775.735.38-6%11.0211.13+1%25.8425.72+0%EMEA55.9762.609.284.01-56%16.1613.76-15%18.5919.63+6%Science56.2061.5010.225.63-45%13.5813.11-3%20.0019.77-1%Subs55.7659.585.251.67-68%13.939.71-30%25.0629.04+15%TABLE3:WADE:Percentcorrect,percentseenerrors,percentsenseerrors,andpercentscoreerrors.Thechanges(%∆)fromOLDtoMIXEDarealsogiven;here,negativechangesaregood(errorreduction).DomainOLD+SEEN+SENSEOLDvsMIXEDSCOREMIXEDNews22.8123.87+0%23.95+1%23.7223.86+1%24.15EMEA28.6931.02+8%30.59+7%28.8930.21+5%36.60Science26.1327.72+6%27.29+4%26.0928.68+10%32.23Subs15.1015.96+6%16.41+9%14.9916.25+8%18.49TABLE4:TETRA:ResultsonallnewdomainsusingOLDandMIXEDmodels(ﬁrstandlastcolumns),OLDenhancedwithseentranslations(second),sensetranslations(third),andscores(fourth),togetherwithpercentimprovementsintermsofBLEUscore.Here,positiveimprovementsaregood(higherBLEUscores).correct.Asinthenewsdomain,mostoftheerrorsareSCOREfollowedbySENSEandthenSEEN.Fortheotherthreedomains,thetwoevaluationmethodsagreethatSEENisafairlysubstantialproblem.TE-TRAbelievesthatSENSEisafairlysubstantialis-sue,butWADEdoesnotshowthisforScience.ForSCORE,TETRAdetectssigniﬁcantroomforimpro-vement,especiallyforScience.ThelargechangesinBLEUscorefoundwithTE-TRAaresomewhatsurprisinggivenhowlittlethephrasetableschangeineachoftheseexperimen-talconditions.ForNews,EMEA,andScience,ad-dingunseenwordsresultsinanincreaseinnumberofphrasepairsbetween0.045%(News)and0.3%(Science).Thesenseadditionsweresimilarlysmall:from0.15%(EMEA)to0.59%(News).ForSubsthestorywasdifferent:addingunseenwordsamoun-tedtoagrowthof4.2%inphrasetablesize;senseamountedto25.1%.Inallcases,thesizeofthescorephrasetableswasonly0.05%smallerthanthatofOLD.Atﬁrstglance,theWADEandTETRAanalysesoftheSCOREerrortypeseemtocontradicteachother.TheMIXEDsystemsareworseintermsofSCORE(positivedeltas,moreerrorsthanOLD),buthavebetterBLEUscores.Tounderstandthisdiscre-pancy,wemustrecognizethatTETRAanalyzesthescoreerrorsinisolation:byrestrictingthephrasetablestotheintersectionoftheOLDandMIXEDdo-mainphrasetables,weremoveallscoreandsenseerrors.IntheWADEanalysishowever,manyerrorsthat“usedtobe”SEENerrorsintheolddomainbe-comeSCOREerrorsinthenewdomain.CorrectIncorrectTotalCorSeenScoreSeenSenseNewsCor53.70.01.90.00.055.6Seen-C0.11.70.00.00.01.8Score2.20.023.60.00.025.8Seen-I0.10.00.05.40.35.7Sense0.00.00.20.010.811.0Total56.11.725.75.411.1100EMEACor48.30.03.10.00.051.5Seen-C1.62.80.10.00.04.5Score5.30.013.30.00.018.6Seen-I2.30.00.54.02.59.3Sense2.30.02.60.011.316.2Total59.82.819.64.013.8100ScienceCor49.80.03.60.00.053.3Seen-C1.41.40.00.00.02.9Score5.80.014.20.00.020.0Seen-I1.80.00.35.62.510.2Sense1.40.01.60.010.613.6Total60.11.419.85.613.1100SubtitlesCor52.40.02.50.00.054.8Seen-C0.60.30.00.00.00.9Score4.50.020.60.00.025.1Seen-I1.10.00.51.72.05.3Sense0.80.05.40.07.713.9Total59.30.329.01.79.7100TABLE5:PercentofWADEannotationchangesmovingfromOLD(rows)toMIXED(columns)models,foreachdomain.Non-zerooff-diagonalsarebolded.Seen-Cin-dicatesFreebies,andSeen-Iindicatesunseenwordsthatweremistranslated.Toseethefullpicture,wemustlookathowthedifferenterrorcategorieschangefromtheOLDsys-temtotheMIXEDsysteminWADE.ThisisshowninTable5.Inthistable,therightmostcolumncontainsthetotalpercentageoferrorsintheOLDsystems;therowslabeledTotalshowthetotalpercentageofer-rorsintheMIXEDsystems;theremainingcellstheseerrorschangingfromOLDtoMIXED.Forthenewsdomain,theOLDsystemhas25.8%SCOREerrors.Ofthose,2.2%areﬁxedintheMIXEDsystem.Forthethreedomainsofinterest(allexceptnews),addressingSEENerrorscanbesubstantially

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

436

helpful,intermsofbothBLEUscoreandtheﬁne-graineddistinctionsconsideredbyWADE.Themoreinterestingconclusion,however,isthatsim-plybringinginnewwordsisn’tenough.Table5showsthatinthesethreedomainsthereareasub-stantialnumberoferrorsthattransitionfrombeingSEEN-IncorrecttoSENSE-Incorrect.Thisindicatesthatbesidesobservinganewword,wemustalsoob-serveitwithallofitscorrecttranslations.Likewise,thereisalottobegainedinBLEUbycorrectingnewSENSEtranslationerrors(essentiallythesamepercentageasforSEEN).Butthisishardertosolve.WecanseeinTable5thatfromtheSENSEerrorsoftheOLDsystem,halfbecomecorrectbuttheotherhalfbecomeSCOREerrors.Sogivingap-propriatescorestothenewsensesisachallenge.Thismakessense:thesenewsensearenow“com-peting”witholdones,andgettingtheinterpolationrightbetweenoldandnewdomaintablesisdifﬁcult.ForSCORE,thesituationismorecomplicated.OurTETRAanalysisclearlyindicatesthatthereisroomforimprovement.Butthisisbasedonintersec-tedphrasetables,fromwhichweremovedseenandsensedistinctions,andinwhichthereisnocompe-titionbetweenphrasesfromtheOLDandNEWsys-tems.TheWADEanalysisshowsapositiveeffectonlyforScience.ThedatainTable5showsthatalot(5.8/20)oftheerrorsarecorrected,butwealsointroduceanumberofadditionalerrors(3.6%thatwerecorrect,0.3%thatwereSEENand1.6%thatwereSENSE).Similarly,intheEMEAdomain,weﬁx5%of18%ofSCOREerrorsbutintroduce2.6%thatwerenewsenseerrorsbefore,0.5%thatwereSEENerrorsbefore,andmake3%additionalerroronwordswegotrightbefore.Subsissimilar:outof25%SCOREerrorsweﬁx4.5%,butintroduce0.5%fromSEENand5.4%fromSENSE,andsufferaddi-tionalerroron2.5%ofwhatwehadcorrectbefore.7.2QualitativeResultsTable7showsexamplesoftheFrenchwordsthatWADEfrequentlyidentiﬁedasincorrectlytransla-tedbytheOLDsystemduetoSCOREorSEENbutthatwerecorrectlytranslatedundertheMIXEDsys-tem.7Forexample,intheSciencedomain,‘mesu-res’sufferedfromSCOREerrorsundertheOLDsys-7.Completeoutputlistsareavailableathttp://hal3.name/damttem.Whileitscorrecttranslationwasoften‘mea-surements,’theOLDsystempreferreditsmostpro-babletranslations(‘savings,’‘actions,’‘issues,’and‘provisions.’).Thirtyoftheseerrorcaseswerecor-rectlytranslatedbytheMIXEDsystem.Similarly,intheSciencedomain,theFrenchword‘ﬁnis,’whenitshouldhavebeentranslatedas‘ﬁnite,’wastrans-latedincorrectlyduetoasenseerror27times.ItsmostfrequenttranslationsundertheOLDsystemwere‘ﬁnish,’‘ﬁnished,’and‘more.’TheMIXEDsys-temcorrectedthesesenseerrors.WeomitexamplesofwhereseenerrorsmadebyOLDwerefrequentlycorrectedbytheMIXEDsystembecausetheytendtobelessinteresting.ExamplescanbefoundinDaum´eIIIandJagarlamudi(2011).WeannotatetheFrenchtestsentencesusingtheStanfordpart-of-speech(POS)tagger(Toutanovaetal.,2003)andexaminewhichPOScategoriescorrespondtothemosterrorsofeachtype.UsingtheOLDsystem,newsenseerrorsintheSubsdo-mainaremadeonFrenchnouns40%ofthetimeandonverbs35%ofthetime.InEMEA,51%arenounsand23%areadjectives;inScience,51%nounsand20%adjectives;inNews,46%nounsand23%verbs.Seenerrorsshowaverysimilartrend:intheSubsdomain50%arenounsand25%verbs;InEMEA,48%arenounsand37%adjec-tives;inScience,46%arenounsand40%adjectives;inNews,46%arenounsand28%adjectives.Simi-larly,foralldomains,morescoreerrorsaremadeonsourcenounsthananyotherPOScategory.Insum-mary,weﬁndthatmosterrorscorrespondtosourcelanguagenouns,followedbyadjectives,exceptforSubs,whereverbsarealsocommonlymistranslatedduetoallerrortypes.Table6(left)showssomeexamplesofhowTE-TRAcanautomaticallyestimatetheerrorsduetounseenwordswhenmovingtoanewdomain.Forexample,theOCRerror“miie”inthesourcesen-tenceiscorrectlytranslatedas“miss”bytheen-hancedsystem.TheenhancedphrasetablesofTE-TRAcanalsoautomaticallyestimatetheerrorsduetopoorlexicalchoicewhenmovingtoanewdo-main,andcanselectamorelucidtranslationterm.Forexample,theenhancedsystemappropriatelyse-lected“shoot”insteadof“growth”intheScienceexampleinTable6(middle).WhenTETRAinsertsthescoresfromthenewdomainintothetransla-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

437

MedicalInpencequiconcernelesindicationsth´erapeutiquespourl’in-sufﬁsancecardiaque,letammapropos´eletextesuivant:(cid:28)traitementdel’insufﬁsancecardiaquecongestive.(cid:29)Refregardingthetherapeuticindicationsforheartfailure,themahproposedthewording:“treatmentofcongestiveheartfailure.”Oldfortherapeuticindicationsforheartfailure,tammproposedthefollowing:treatmentofcongestiveheartfailure.Fixfortherapeuticindicationsforheartfailure,themahhassuggestedthefollowing:treatmentofcongestiveheartfai-lure.ScienceInplesr´esultats`alabasedecettehypoth´esesontr´evis´es.Refﬁndingsthatformthebasisofthishypothesisarereviewed.Oldtheresultsatthebaseofthisassumptionarereviewed.Fixtheresultsofthishypothesisarereviewed.SubtitlesInpBonnenuit,MIIeKenton.Refgoodnight,misskenton.Oldgoodnight,miiekenton.Fixgoodnight,misskenton.MedicalInpcem´edicamentestunesolutionlim-pide,incolore`ajaunepˆale.Refthismedicinalproductisaclear,co-lorlesstopaleyellowsolution.Oldlantusisaclear,colorlesstopaleyel-low.Fixthismedicineisasolutionisclear,co-lorlesstopaleyellow.ScienceInptouslestraitementsontaugment´elaproductiondepousses.Refalltreatmentsincreasedshootpro-duction.Oldalltreatmentshaveincreasedthepro-ductionofgrowth.Fixalltreatmentsincreasedshootpro-duction.SubtitlesInpLesexec’estnaturel.Refsexisnatural.Oldthesexisnatural.Fixsexisnatural.MedicalInplesdeuxsubstancesactivesontdeseffetsinversessurlakali´emie.Refthetwoactivesubstanceshavein-verseeffectsonplasmapotassium.Oldthetwoactivesubstancesarereverseeffectsonthekali´emie.Fixthetwoactivesubstanceshavesideinversesonkali´emie.ScienceInpparailleurs,lesconstantesd’´equilibresontplusfaibles.Refincontrast,theequilibriumconstantsarelower.Oldfurthermore,theconstantbalancearelower.Fixmoreover,theequilibriumconstantsarelower.SubtitlesInpJebougemieux.Refimovebetter.Oldigetbetter.Fiximovebetter.TABLE6:ExampleMTresultsobtainedbyﬁxingseenerrors(left),senseerrors(middle)andscoreerrors(right).In-cludessource,areferencetranslation,theoutputoftheOLDsystemandtheoutputobtainedviaTETRAmethodology.D#FrenchCorrect-EHansard-EScore−→Correct21doitshouldhasmustneedsshallrequiresE8associationcombinationpartnershipassociation6nomsnamesnamespeoplenomineespeakers30mesuresmeasurementssavingsactionsissuesprovisionsSc27courantcurrentknowledgeknewheads-up26articlepaperstandingclauseordersection9commelikeasbecauselikeakinhowsortSu5maisonhousehomehousehomesheadplace4fricmoneycashdoughmoneyfricbuckslootSense−→Correct9noticeleaﬂetinformeddirectionsnoticeE8perfusioninfusionperfusionintravenous8mollescapsuleslaxlimpidsoftweak27ﬁnisﬁniteﬁnishﬁnishedmoreSc18jonctionsjunctionsjunction(onlyonce)10substratssubstratescornstreamsareasubstrata5emmerdefuckannoying(onlyonce)Su3reditessayrepetitioustellcoveredagain3mecmancmeguymecTABLE7:Forscore/senseerrors,in(E)MEA,(Sc)ienceand(Su)bs,frequentFrenchwordsthatfallintothatca-tegory(byWADE),aswellasthecorrectedtranslationsandthemostfrequentOLDtranslations.tiontables,thesystemproducestranslationsthattakeontheﬂavorofthenewdomain,yieldinghi-gherBLEUscores.ThiscanbeobservedinTable6(right)wheretheTETRA-enhancedsystemusedthescience-speciﬁcword“equilibrium”ratherthanthepoliticalword“balance.”7.3ResultsonanAdaptedSystemToshowhowWADEcanbeusedonalreadyadap-tedsystems,weperformedasimpleexperimentba-sedonastandardadaptationtechnique.Weusedbilingualcross-entropydifference(Axelrodetal.,2011)toquantifythedistancebetweeneachOLDdo-mainsentencepairandeachNEWdomain.Weselec-tedthetopKclosestsentencesforeachdomain.ForEMEAandScience,wesetKtothesizeoftheNEWdomaindata.ForSubs,thiswouldselectnearlyallofHansard,sowearbitrarilysetK=1m.(Weexclu-dedthenewsdomain.)Wetookthisdata,concatena-tedittotheNEWdomaindata,trainedfullmodels,andrantheWADEanalysisontheiroutputs.Thetrendsacrossthethreedomainswereremar-kablysimilar.Inall,SCOREintheadaptedsystemwerelowerbyaround2%thaneventheMIXEDba-seline(asmuchas4%forSubs).Thisislikelybe-causebyexcludingpartsoftheOLDdomainmostunliketherelevantNEWdomain,thecorrectsenseisobservedmoreoften.However,thiscomesataprice:SENSEandSEENerrorsgoupabout1%or2%each.Thissuggeststhatamoreﬁne-grainedadaptationap-proachmightachievethebestofbothworlds.8LimitingAssumptionsThispaperrepresentsapartialexplorationofthespaceofpossibleassumptionsaboutmodelsanddata.Wecannothopetoexplorethecombinatorialexplosionofpossibilities,andthereforehaverestric-tedouranalysistothefollowingsettings:Phrase-basedmodels.Allofourexperimentsarecarriedoutusingphrase-basedtranslation,asimple-mentedintheopen-sourceMosestranslationsystem(Koehnetal.,2007)toensurethattheyarerepro-ducible.Ourmethodsareeasilyextendedtohierar-chicalphrase-basedmodels(Chiang,2007).Itisnotclearwhetherthesameconclusionswouldhold:on

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

438

theonehand,complexphrasalrulesmightoverﬁtevenmorebadlythanphrases;ontheotherhand,hierarchicalmodelsmighthavemoreﬂexibilitytogeneralizestructures.Translationlanguages.WeonlytranslatefromFrenchtoEnglish.Thiswell-studiedlanguagepairpresentsseveraladvantages;largequantitiesofdataarepubliclyavailableinawidevarietyofdomains,andstandardstatisticalmachinetranslationarchi-tecturesyieldgoodperformance.UnlikewithmoredistantlanguagessuchasChinese-English,orlan-guageswithradicallydifferentmorphologyorwordordersuchasGermanorCzech,weknowthattheold-domaintranslationqualityishigh,andthattranslationfailuresduringdomainshiftcanbepri-marilyattributedtodomainissuesratherthantopro-blemswiththeSMTsystem.Constantolddomain.OurolddomainisfromHansards,andweonlyvaryournewdomain.Itwouldbeinterestingtoconsiderotherdatasetsasolddomains.WedeliberatelyonlyusetheHansarddata:basedonitssizeandscope,weassumethatityieldsthemostgeneralofourSMTsystems.Monolingualnew-domaindata.WeassumethatwealwayshaveaccesstomonolingualEnglishdatainthenewdomainforlearningadomain-speciﬁclanguagemodel.Ourfocusisontheeffectofthetranslationmodel;theeffectofadaptinglanguagemodelshasbeenstudiedpreviously(see§3).Wi-thoutaccesstoanewdomainlanguagemodel,theeffectofunseenwordsandwordswithnewsensesislikelytobedramaticallyunderestimated,becausetheirtranslationsarelikelytobe“thrownout”byanold-domainLM.Moreover,sinceSCOREer-rorsconﬂatelanguagemodelandtranslationmodelscores,usinganew-domainlanguagemodelletsusmostlyisolatetheeffectofthetranslationmodel.Parallelnew-domaindatafortuning.Weas-sumethatwealwayshaveaccesstoasmallamountofparalleldatainthenewdomain,essentiallyforthepurposeofrunningparametertuning.Withoutthis,onewouldnotevenbeabletoevaluatetheperfor-manceofone’ssystem,typicallyanon-starter.AutomaticwordalignmentsforWADEWADEisfundamentallybaseduponwordalignments,soalignmenterrorsmayaffectitsaccuracy.Sucherrorsareobviousinmanuallyinspectingsentencetriplesusingthevisualizer.Whendevelopingthistool,wecheckedthatalignmentnoisedoesnotinvalidateconclusionsdrawnfromWADEcounts.InordertoestimatehowmuchalignmenterrorsaffectWADE,aFrenchspeakermanuallycorrectedthewordali-gnmentsfor955EMEAtestsetsentences.Theana-lysesbasedonmanualexperimentsshowfewerer-rorsoverall,buttheerroneousannotationsappeartoberandomlydistributedamongallcategories(de-tailsommitedforspace).Asaresult,webelievethatWADEyieldsresultswhichareinformativedespitetheinevitableautomaticalignmenterrors.Inparticu-lar,becausealignmentsbetweenatestandreferencesetareheldconstantinasystemcomparison,sucherrorsshouldimpactallanalysesinthesameway.9DiscussionTranslationperformancedegradesdramaticallywhenmigratinganSMTsystemtoanewanddif-ferentdomain.Ourworkdemonstratesthatthema-jorityofthisdegradationinperformanceisduetoSEENandSENSEerrors:namely,unknownsource-languagewordsandknownsource-languagewordswithunknowntranslations.Thisresultholdsinalldomainswestudied,exceptfornews,inwhichthereappearstobelittleadaptationinﬂuenceatall(espe-ciallyafterspellingnormalization).Ourtwoanalysismethods:WADE(Section5.1)andTETRAanalysis(Section5.2),arebothlensesontheadverseaffectsofdomainmismatch.UsingWADE,weareabletopinpointprecisetranslationerrorsandtheirsources.Thiscouldbeextendedtomorenuanced,human-assisted,analysisofadapta-tioneffects.WADEalso“labels”translationswithdifferenterrortypes,whichcouldbeusedtotrainmorecomplexmodels.UsingTETRA,weareabletoseehowtheseerrorsaffectoveralltranslationper-formance.Inprinciple,thisperformancecouldbeanymeasure,includinghumanassessment.Westar-tedwiththeBLEUmetricsinceitismostwidelyusedinthecommunity.Onepointofpossibleim-provementwouldbetoreplaceexactstringmatchinWADE,andBLEUinTETRA,withmetricsthataremoremorphologicallyorsemanticallyinformed.Erroranalysisopensthedoortobuildingadaptedmachinetranslationsystemsthatdirectlytargetspe-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

439

ciﬁcerrorcategories.Aswehaveseen,mostexis-tingdomainadaptationtechniquesinMTaimtoimprovetranslationqualityingeneral,andareac-cordinglyevaluatedusingcorpus-levelmetricssuchasBLEU.Ourintuitiveﬁner-grainedanalysissug-geststhatﬁner-grainedmodelsmightbebettersui-tedtounderstandingandcomparingtheerrorsmadebyadaptedandunadaptedsystems.WehaveshownthatconsideringtheS4taxonomyisimportant:im-provingcoverage,forexample,doesnotnecessarilyimprovetranslationquality.Translationcandidatesmustalsobecompleteandmustbescoredcorrectly.Ourtechniquesprovideanintuitivewaytounders-tandtheeffectivenessofnewMTdomainadaptationapproaches.AcknowledgmentsWegratefullyacknowledgethesupportofthe2012JHUSummerWorkshopandNSFGrantNo1005411,aswellastheNRCforMarineCarpuat,andDARPACSSGGrantD11AP00279forHalDaum´eIII.WewouldliketothanktheentireDAMTteam(http://hal3.name/damt/)andSanjeevKhudanpurfortheirinvaluablehelpandsuggestions,aswellasallthereviewersfortheirinsightfulfeedback.ReferencesSankaranarayananAnanthakrishnan,RohitPrasad,andPremNatarajan.2011.On-linelanguagemodelbia-singforstatisticalmachinetranslation.InProceedingsoftheConferenceoftheAssociationforComputatio-nalLinguistics(ACL).AmittaiAxelrod,XiaodongHe,andJianfengGao.2011.Domainadaptationviapseudoin-domaindataselec-tion.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).NguyenBach,FeiHuang,andYaserAl-Onaizan.2011.Goodness:Amethodformeasuringmachinetransla-tionconﬁdence.InProceedingsoftheConferenceoftheAssociationforComputationalLinguistics(ACL).PratyushBanerjee,SudipKumarNaskar,JohannRotu-rier,AndyWay,andJosefvanGenabith.2012.Trans-lationquality-basedsupplementarydataselectionbyincrementalupdateoftranslationmodels.InProcee-dingsoftheInternationalConferenceonComputatio-nalLinguistics(COLING).AriannaBisazza,NickRuiz,andMarcelloFederico.2011.Fill-upversusinterpolationmethodsforphrase-basedSMTadaptation.InternationalWorkshoponSpokenLanguageTranslation(IWSLT).JohnBlatz,ErinFitzgerald,GeorgeFoster,SimonaGan-drabur,CyrilGoutte,AlexKulesza,AlbertoSanchis,andNicolaUefﬁng.2004.Conﬁdenceestimationformachinetranslation.InProceedingsoftheInterna-tionalConferenceonComputationalLinguistics(CO-LING).MarineCarpuatandDekaiWu.2007.Improvingstatisti-calmachinetranslationusingwordsensedisambigua-tion.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).MarineCarpuat,HalDaum´eIII,KatharineHenry,AnnIrvine,JagadeeshJagarlamudi,andRachelRudinger.2013.SenseSpotting:Neverletyourparalleldatatieyoutoanolddomain.InProceedingsoftheConfe-renceoftheAssociationforComputationalLinguistics(ACL).Yin-WenChangandMichaelCollins.2011.Exactde-codingofphrase-basedtranslationmodelsthroughla-grangianrelaxation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProces-sing(EMNLP).ColinCherryandGeorgeFoster.2012.Batchtuningstrategiesforstatisticalmachinetranslation.InPro-ceedingsoftheConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguis-tics(NAACL).DavidChiang.2007.Hierarchicalphrase-basedtransla-tion.ComputationalLinguistics,33(2):201–228.HalDaum´eIIIandJagadeeshJagarlamudi.2011.Do-mainadaptationformachinetranslationbyminingun-seenwords.InProceedingsoftheConferenceoftheAssociationforComputationalLinguistics(ACL).KevinDuh,KatsuhitoSudoh,andHajimeTsukada.2010.Analysisoftranslationmodeladaptationinsta-tisticalmachinetranslation.InProceedingsoftheIn-ternationalWorkshoponSpokenLanguageTransla-tion(IWSLT).GeorgeFosterandRolandKuhn.2007.Mixture-modeladaptationforSMT.InProceedingsoftheWorkshoponStatisticalMachineTranslation(WMT).GeorgeFoster,CyrilGoutte,andRolandKuhn.2010.Discriminativeinstanceweightingfordomainadapta-tioninstatisticalmachinetranslation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).ZhengxianGong,MinZhang,andGuodongZhou.2011.Cache-baseddocument-levelstatisticalmachinetrans-lation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).BarryHaddowandPhilippKoehn.2012.Analysingtheeffectofout-of-domaindataonSMTsystems.InProceedingsoftheWorkshoponStatisticalMachineTranslation(WMT).AlmutSiljaHildebrand,MatthiasEck,StephanVogel,andAlexWaibel.2005.Adaptationofthetranslation

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9

/
t

a
c
_
a
_
0
0
2
3
9
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

440

modelforstatisticalmachinetranslationbasedonin-formationretrieval.InProceedingsoftheConferenceoftheEuropeanAssociationforComputationalLin-guistics(EACL).AnnIrvine,ChrisQuirk,andHalDaum´eIII.2013.Monolingualmarginalmatchingfortranslationmo-deladaptation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).PhilippKoehnandJoshSchroeder.2007.Experimentsindomainadaptationforstatisticalmachinetranslation.InProceedingsoftheWorkshoponStatisticalMachineTranslation(WMT).PhilippKoehn,AmittaiAxelrod,AlexandraBirch,ChrisCallison-Burch,MilesOsborne,DavidTalbot,andMi-chaelWhite.2005.Edinburghsystemdescriptionforthe2005IWSLTspeechtranslationevaluation.InPro-ceedingsofInternationalWorkshoponSpokenLan-guageTranslation.PhilippKoehn,HieuHoang,AlexandraBirch,ChrisCallison-Burch,MarcelloFederico,NicolaBertoldi,BrookeCowan,WadeShen,ChristineMoran,RichardZens,ChrisDyer,OndrejBojar,AlexandraConstantin,andEvanHerbst.2007.Moses:Opensourcetoolkitforstatisticalmachinetranslation.InProceedingsoftheConferenceoftheAssociationforComputationalLinguistics(ACL).PatrikLambert,HolgerSchwenk,andFr´ed´ericBlain.2012.AutomatictranslationofscientiﬁcdocumentsintheHALarchive.InProceedingsoftheInterna-tionalConferenceonLanguageResourcesandEvalua-tion(LREC).ThomasLavergne,AlexandreAllauzen,Hai-SonLe,andFranc¸oisYvon.2011.LIMSI’sexperimentsindo-mainadaptationforIWSLT11.InProceedingsoftheInternationalWorkshoponSpokenLanguageTransla-tion(IWSLT).YajuanLu,JinHuang,andQunLiu.2007.Improvingstatisticalmachinetranslationperformancebytrainingdataselectionandoptimization.InProceedingsoftheJointConferenceonEmpiricalMethodsinNatu-ralLanguageProcessingandComputationalNaturalLanguageLearning(EMNLP-CoNLL).SaabMansour,JoernWuebker,andHermannNey.2011.Combiningtranslationandlanguagemodelscoringfordomain-speciﬁcdataﬁltering.InProceedingsoftheInternationalWorkshoponSpokenLanguageTransla-tion(IWSLT).SpyrosMatsoukas,Antti-VeikkoI.Rosti,andBingZhang.2009.Discriminativecorpusweightestima-tionformachinetranslation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan-guageProcessing(EMNLP).RadaMihalcea,RaviSinha,andDianaMcCarthy.2010.SemEval-2010Task2:Cross-LingualLexicalSubsti-tution.InProceedingsofthe5thInternationalWork-shoponSemanticEvaluation.JanNiehuesandAlexWaibel.2010.Domainadaptationinstatisticalmachinetranslationusingfactoredtrans-lationmodels.InProceedingsoftheEuropeanAsso-ciationforMachineTranslation(EAMT).MajaPopovi´candHermannNey.2011.Towardsau-tomaticerroranalysisofmachinetranslationoutput.ComputationalLinguistics,37(4).MajidRazmara,GeorgeFoster,BaskaranSankaran,andAnoopSarkar.2012.Mixingmultipletranslationmo-delsinstatisticalmachinetranslation.InProceedingsoftheConferenceoftheAssociationforComputatio-nalLinguistics(ACL).RicoSennrich.2012.Perplexityminimizationfortrans-lationmodeldomainadaptationinstatisticalmachinetranslation.InProceedingsoftheConferenceoftheEuropeanAssociationforComputationalLinguistics(EACL).J¨orgTiedemann.2009.NewsfromOPUS-Acollectionofmultilingualparallelcorporawithtoolsandinter-faces.InN.Nicolov,K.Bontcheva,G.Angelova,andR.Mitkov,editors,RecentAdvancesinNaturalLan-guageProcessing(RANLP).J¨orgTiedemann.2010.Tocacheornottocache?ex-perimentswithadaptivemodelsinstatisticalmachinetranslation.InProceedingsoftheACLWorkshoponStatisticalMachineTranslationandMetrics(MATR).KristinaToutanova,DanKlein,ChristopherManning,andYoramSinger.2003.Feature-richpart-of-speechtaggingwithacyclicdependencynetwork.InNAACL.DavidVilar,JiaXu,LuisFernandoD’Haro,andHer-mannNey.2006.Erroranalysisofstatisticalmachinetranslationoutput.InProceedingsoftheInternatio-nalConferenceonLanguageResourcesandEvalua-tion(LREC).StephanVogel,HermannNey,andChristophTillmann.1996.HMM-basedwordalignmentinstatisticaltrans-lation.InProceedingsoftheInternationalConferenceonComputationalLinguistics(COLING).GuillaumeWisniewski,AlexandreAllauzen,andFranc¸oisYvon.2010.Assessingphrase-basedtrans-lationmodelswithoracledecoding.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).BingZhao,MatthiasEck,andStephanVogel.2004.Languagemodeladaptationforstatisticalmachinetranslationwithstructuredquerymodels.InProcee-dingsoftheInternationalConferenceonComputatio-nalLinguistics(COLING). Transactions of the Association for Computational Linguistics, 1 (2013) 429–440. Action Editor: Philipp Koehn. image

Download pdf