Transacciones de la Asociación de Lingüística Computacional, 1 (2013) 429–440. Editor de acciones: Philipp Koehn.
Submitted 3/2013; Revised 8/2013; Publicado 10/2013. C
(cid:13)
2013 Asociación de Lingüística Computacional.
MeasuringMachineTranslationErrorsinNewDomainsAnnIrvineJohnsHopkinsUniversityanni@jhu.eduJohnMorganUniversityofMarylandjjm@cs.umd.eduMarineCarpuatNationalResearchCouncilCanadamarine.carpuat@nrc.gc.caHalDaum´eIIIUniversityofMarylandme@hal3.nameDragosMunteanuSDLResearchdmunteanu@sdl.comAbstractWedeveloptwotechniquesforanalyzingtheeffectofportingamachinetranslationsystemtoanewdomain.Oneisamacro-levelana-lysisthatmeasureshowdomainshiftaffectscorpus-levelevaluation;thesecondisamicro-levelanalysisforword-levelerrors.Weap-plythesemethodstounderstandwhathappenswhenaParliament-trainedphrase-basedma-chinetranslationsystemisappliedinfourverydifferentdomains:noticias,medicaltexts,scien-tificarticlesandmoviesubtitles.Wepresentquantitativeandqualitativeexperimentsthathighlightopportunitiesforfutureresearchindomainadaptationformachinetranslation.1IntroductionWhenbuildingastatisticalmachinetranslation(SMT)sistema,theexpectedusecaseisoftenlimitedtoaspecificdomain,genreandregister(henceforth“domain”referstothisset,inkeepingwithstandard,imprecise,terminology),suchasaparticulartypeoflegalormedicaldocument.Unfortunately,itisex-pensivetoobtainenoughparalleldatatoreliablyes-timatetranslationmodelsinanewdomain.Instead,onecanhopethatlargeamountsofdatafromano-ther,“olddomain,”mightbecloseenoughtostandasaproxy.Thisisthedefactostandard:wetrainSMTsystemsonParliamentproceedings,butthenusethemtotranslateallsortsofnewtext.Unfortuna-tely,thisresultsinsignificantlydegradedtranslationquality.Inthispaper,wepresenttwocomplemen-tarymethodsforquantifiablymeasuringthesourceoftranslationerrors(§5.1and§5.2)inanoveltaxo-nomy(§4).Weshowquantitative(§7.1)andquali-tative(§7.2)resultsobtainedfromourmethodsonOldDomain(Hansard)Inpmonsieurlepr´esident,lespˆecheursdehomarddelar´egiondel’atlantiquesontdansunesituationcatastro-phique.Refmr.speaker,lobsterfishersinatlanticcanadaarefacingadisaster.Outmr.speaker,thelobsterfishersinatlanticcanadaareinamess.NewDomain(Medical)Inpmodeetvoie(s)d’administrationRefmethodandroute(s)ofadministrationOutfashionandvoie(s)ofdirectorsTABLE1:Exampleinputs,referencesandsystemoutputs.Therearethreetypesoferrors:unseenwords(azul),in-correctsenseselection(rojo)andunknownsense(verde).fourverydifferentnewdomains:newswire,medicaltexts,scientificabstracts,andmoviesubtitles.Ourbasicapproachistothinkoftranslationer-rorsinthecontextofanoveltaxonomyoferrorcategories,“S4.”OurtaxonomycontainscategoriesfortheerrorsshowninTable1,inwhichanSMTsystemtrainedontheHansardparliamentaryproce-dingsisappliedtoanewdomain(inthiscase,me-dicaltexts).Ourcategorizationfocusesonthefollo-wing:newFrenchwords,newFrenchsenses,andin-correctlychosentranslations.Thefirstmethodologywedevelopforstudyingsucherrorsisamicro-levelstudyofthefrequencyanddistributionoftheseerrortypesinrealtranslationoutputatthelevelofindivi-dualwords(§5.1),withoutrespecttohowtheseer-rorsaffectoveralltranslationquality.Thesecondisamacro-levelstudyofhowtheseerrorsaffecttrans-lationperformance(measuredbyBLEU;§5.2).Oneimportantfeatureofourmethodologiesisthatwefocusonerrorsthatcouldpossiblybefixedgivenaccesstodatafromanewdomain,ratherthanallerrorsthatmightarisebecausetheparticulartransla-tionmodelusedisinadequatetocapturetherequired
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
430
translationtask(formally:wemeasureestimationer-ror,notapproximationerror).OurgoalisneithertobuildbetterSMTsystemsnortodevelopnoveldomainadaptationmethods.Wetakeanabinitioapproachandask:givenalargeunadapted,outoftheboxSMTsystem,whathap-penswhenitisappliedinanewdomain?Inordertoanswerthisquestion,wewilluseparalleldatainnewdomains,butonlyfortestingpurposes.ThebaselineSMTsystemisnotadapted,exceptfortheuseof(1)alanguagemodeltrainedonmonolingualnew-domainlanguagedata,1y(2)afewthousandpa-rallelsentencesoftuningdatainthenewdomain.2SummaryofResultsWeconductexperimentsacrossavarietyofdo-mains(describedin§6).2Asinanystudy,ourre-sultsarelimitedbyassumptionsaboutlanguage,do-mains,andMTsystems:theseassumptionsandtheirconsequencesarediscussedin§8.Ourhigh-levelconclusionsonthedomainswestudyaresumma-rizedbelow(detailsmaybefoundin§7).1.AdaptinganSMTsystemfromtheParliamentdomaintothenewsdomainisnotarepresentativeadaptationtask;thereareaverysmallnumberofer-rorsduetounseenwords,whichareminorincompa-risontoallotherdomains.(Despitethefactthatmostpreviousworkfocusesexclusivelyonusingnewsasa“new”domain,§3).2.Fortheremainingdomains,unseenwordshaveasignificanteffect,bothintermsofBLEUscoresaswellasfine-grainedtranslationdistinctions.Howe-ver,manyofthesewordshavemultipletranslations,andasystemmustbeabletocorrectlyselectwhichonetouseinaparticularcontext.3.Likewise,wordsthatgainnewsensesaccountforapproximatelyasmucherrorasunseenwords,sug-gestinganovelavenueforresearchinsenseinduc-tion.Unfortunately,itappearsthatchoosingtherightsensefortheseattranslationtimeisevenmorediffi-cultthanintheunseenwordcase.4.Thestoryismorecomplicatedforseenwordswithknowntranslations:ifwelimitourselvesto“high1.Weuseold/newtorefertodomainsandsource/targettorefertolanguages,toavoidambiguity(westayawayfromin-domainandout-of-domain,whichisitselfambiguous).2.Allsourcedata,methodologicalcodeandoutputsareavailableathttp://hal3.name/damt.confidence”translations,thereisalottobegai-nedbyimprovingthescoresintranslationmodels.However,foranentirephrasetable,manipulatingscorescanhurtasoftenasithelps.3RelatedWorkMostrelatedworkhasfocusedoneither(a)analy-zingerrorsmadebymachinetranslationsystemsinanon-adaptationsetting(Popovi´candNey,2011),o(b)tryingtodirectlyimprovemachinetranslationperformance.Asmallamountofwork(discussednext)addressesissuesofanalyzingMTsystemsinadomainadaptationsetting.3.1AnalysisofDomainEffectsTodate,workondomainadaptationinSMTmostlyproposedmethodstoefficientlycombinedatafrommultipledomains.Tothebestofourknowledge,therehavebeenonlyafewstudiestoun-derstandhowdomainshiftsaffecttranslationquality(Duhetal.,2010;Bisazzaetal.,2011;HaddowandKoehn,2012).Sin embargo,thesestartfromdifferentpremisesthanthispaper,andasaresult,askrelatedbutcomplementaryquestions.Thesepreviousana-lysesfocusonhowtoimproveaparticularMTar-chitecture(trainedonnewdomaindata)byinjectingolddomaindataintoaspecificpartofthepipelineinordertoimproveBLEUscore.Incomparisontothiswork,wefocusonfiner-grainedphenomena.Wedis-tinguishbetweeneffectspreviouslylumpedtogetheras“missingphrase-tableentries.”Despitedifferentstartingassumptions,languagepairsanddata,someofourconclusionsareconsistentwithpreviouswork:inparticular,wehighlighttheimportanceofdifferencesincoverageinanadaptationsetting.However,ourfine-grainedanalysisshowsthatcorrectlyscoringtranslationsforpreviouslyunseenwordsandsensesisacomplexissue.Finally,theseotherstudiessuggestpotentialdirectionsforrefiningourerrorcategories:forins-tance,HaddowandKoehn(2012)showthattheim-pactofadditionalneworolddomaindataisdifferentforrarevs.frequentphrases.3.2DomainAdaptationforMTPriorworkfocusesonmethodscombiningdatafromoldandnewdomainstolearntranslationandlanguagemodels.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
431
ManyfilteringtechniqueshavebeenproposedtoselectOLDdatathatissimilartoNEW.Informa-tionretrievaltechniqueshavebeenusedtoimprovethelanguagemodel(Zhaoetal.,2004),thetransla-tionmodel(Hildebrandetal.,2005;Luetal.,2007;Gongetal.,2011;Duhetal.,2010;Banerjeeetal.,2012),orboth(Luetal.,2007);languagemo-delcross-entropyhasalsobeenusedfordatase-lection(Axelrodetal.,2011;Mansouretal.,2011;Sennrich,2012).Anotherresearchthreadaddressescorporaweigh-ting,ratherthanhardfiltering.Weightinghasbeenappliedatdifferentlevelsofgranularity:sentencepairs(Matsoukasetal.,2009),phrasepairs(Fosteretal.,2010),n-grams(Ananthakrishnanetal.,2011),orsub-corporathroughfactoredmodels(NiehuesandWaibel,2010).Inparticular,Fosteretal.(2010)showthatadaptingatthephrasepairlevelsoutper-formearliercoarsercorpuslevelcombinationap-proaches(FosterandKuhn,2007).Thisisconsistentwithouranalysis:domainshiftshaveafine-grainedimpactontranslationquality.Finally,strategieshavebeenproposedtocom-binesub-modelstrainedindependentlyondifferentsub-corpora.Linearinterpolationiswidelyusedformixinglanguagemodelsinspeechrecognition,andithasalsobeenusedforadaptingtranslationandlan-guagemodelsinMT(FosterandKuhn,2007;Tiede-mann,2010;Lavergneetal.,2011).Log-linearcom-binationfitswellinexistingSMTarchitectures(Fos-terandKuhn,2007;KoehnandSchroeder,2007).KoehnandSchroeder(2007)considerbothaninter-sectionsetting(whereonlyentriesoccurringinallphrase-tablescombinedareconsidered),andaunionsetting(whereentrieswhicharenotintheintersec-tionaregivenanarbitrarynullscore).Razmaraetal.(2012)takethisapproachfurtherandframecombi-nationasensembledecoding.3.3TargetingSpecificErrorTypesTheexperimentsconductedinthisarticlemotiva-tedfollow-upworkonidentifyingwhenawordhasgainedanewsenseinanewdomain(Carpuatetal.,2013),aswellaslearningjointwordtranslationpro-babilitydistributionsfromcomparablenewdomaincorpora(Irvineetal.,2013).Earlier,Daum´eIIIandJagarlamudi(2011)showedhowminingtranslationsforunseenwordsfromcomparablecorporacanim-proveSMTinanewdomain.4TheS4TaxonomyWebeginwithasimplequestion:whenwemoveanSMTsystemfromanolddomaintoanewdo-main,whatgoeswrong?Weemployasetoffourerrortypesasourtaxonomy.Werefertotheseer-rortypesasSEEN,SENSE,SCOREandSEARCH,andtogetherastheS4taxonomy:SEEN:anattempttotranslateasourcewordorphrasethathasneverbeenseenbefore.Forexample,“voie(s)”inTable1.SENSE:anattempttotranslateapreviouslyseensourcewordorphrase,butforwhichthecorrecttargetlanguagesensehasneverbeenobserved.3InTable1,theHansard-trainedsystemhadneverseen“mode”translatedas“method.”SCORE:anincorrecttranslationforwhichthesys-temcouldhavesucceededbutdidnotbecauseanincorrectalternativeoutweighedthecorrecttransla-tion.Inaconventionaltranslationsystem,thiscouldbeduetoerrorsinthelanguagemodel,translationmodel,orboth.InTable1,theHansard-trainedsys-temhadseen“administration”translatedas“admi-nistration,”but“directors”hadahigherprobability.SEARCH:anerrorduetopruninginbeamsearch.Whenlimitingoneselftoissuesoflexicalselec-tion,thissetisexhaustiveanddisjoint:anylexi-calselectionerrormadebyanMTsystemcanbeattributedtoexactlyoneoftheseerrorcategories.Thisobservationisimportantfordevelopingmetho-dologiesformeasuringtheimpactofeachofthesesourcesoferror.PartitionsofthesetoferrorsthatfocusoncategoriesotherthanlexicalchoicehavebeeninvestigatedbyVilaretal.(2006).5MethodologyforAnalyzingMTSystemsGiventheS4taxonomyforcategorizingSMTer-rors,itwouldbepossible(ifpainstaking)toma-nuallyannotateSMToutputwitherrortypes.Wepreferautomatedmethods.Inthissectionwedes-cribetwosuchmethods:amicro-levelanalysisto3.Wedefine“sense”asaparticulartranslationintoatargetlanguage,inlinewithCarpuat&Wu(2007)orMihalceaetal.(2010).Thismeansbothtraditionalwordsenseerrorsandothertranslationerrors(likemorphologicalvariants)areincluded.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
432
FIGURE1:ExampleofWADEvisualization.DashedboxesaroundtheFrenchinputmarkthephrasespansusedbythedecoder.seewhathappensatthewordlevel(regardlessofhowitaffectstranslationperformance)andamacro-levelanalysistodiscoverimpactoncorpustransla-tionperformance.WefocusonthefirstthreeS4ca-tegoriesandseparatelydiscusssearcherrors(§7).Inbothcases,weuseexactstringmatchtodetecttrans-lationequivalences,ashasbeendonepreviouslyinothersettingsthatalsousewordalignmentstoins-pecterrorsorautomaticallygeneratedataforothertasks(Blatzetal.,2004;CarpuatandWu,2007;Popovi´candNey,2011;Bachetal.,2011,amongothers).5.1Micro-analysis:WADEWedefineWordAlignmentDrivenEvaluation,orWADE,whichisatechniqueforanalyzingMTsys-temoutputatthewordlevel,allowingusto(1)ma-nuallybrowsevisualizationsofMToutputannota-tedwithS4errortypes,y(2)aggregatecountsoferrors.WADEisbasedonthefactthatwecanau-tomaticallyword-alignaFrenchtestsentenceanditsEnglishreferencetranslation,andtheMTdeco-dernaturallyproducesawordalignmentbetweenaFrenchsentenceanditsmachinetranslation.WecanthencheckwhethertheMToutputhasthesamesetofEnglishwordsalignedtoeachFrenchwordthatwewouldhopefor,giventhereference.Insomeways,WADEissimilartotheword-basedanalysistechniqueofPopovi´candNey(2011).Ho-wever,incontrasttothatwork,wedonotdirectlyalignthehypothesisandreferencetranslationsbut,bastante,pivotthroughthesourcetext.Additionally,weuseWADEtoannotateS4errors,whicharedri-venmorebyhowlexicalchoiceismadewithintheSMTframeworkthanbylinguisticpropertiesofwordsinthereferenceandhypothesistranslations.Forexample,inthecaseofdomainadaptation,wedonotexpecttherateofinflectionalerrorstobeaf-fectedbydomainshift.InWADE,theunitofanalysisiseachwordalign-mentbetweenaFrenchword,fi,andareferenceEn-glishword,ej.Toannotatethealignedpair,ai,j,weconsidertheword(s),Hi,intheoutputEnglishsen-tencewhicharealigned(bythedecoder)tofi.IfejappearsinthesetHi,thenthealignmentai,jismar-kedcorrect.Ifnot,thealignmentiscategorizedwithoneoftheS4errortypes.IftheFrenchwordfidoesnotappearinthephrasetableusedfortranslation,thenthealignmentismarkedasaSEENerror.Iffidoesappearinthephrasetable,butitisneverobser-vedtranslatingasej,thenthealignmentismarkedasaSENSEerror.Iffihadbeenobservedtranslatingasej,butthedecoderchoseanalternatetranslation,thenthealignmentismarkedasaSCOREerror.Ourresultsin§7showthatSEARCHerrorsareveryin-frequent,sowemarkallerrorsotherthanSEENandSENSEasSCOREerrors.Wemakeuseofonead-ditionalcategory:Freebie.OurMTsystemcopiesunseen(aka“OOV”)FrenchwordsintotheEnglishoutput,and“freebies”areFrenchwordsforwhichthisiscorrect.ForWADEanalysisonly,weusethealignmentsyieldedbyamodeltrainedoverourtrainandtestdatasetsandthegrow-diag-finalheuristic.BecauseWADE’sunitofanalysisiseachalignmentlinkbet-weenthesourcetextanditsreference,itignoresuna-lignedwordsintheinputsourcetext.Figure1showsanexampleofaWADE-annotatedsentence.Inadditiontoprovidinganeasywaytovi-sualizeandbrowsetheerrorsinMToutput,WADEallowsustoaggregatecountsovertheS4errortypes.Inouranalysis(§7),wepresentresultsthatshownotonlytotalnumbersofeacherrortypebutalsohowWADE-annotationschangewhenweintro-ducesomeNEW-domainparalleltrainingdata.Forexample,SEENerrorscouldremainSEENerrors,be-comecorrect,orbecomeSENSEorSCOREerrorswhenweintroduceadditionaltrainingdata.5.2Macro-analysis:TETRAInthissection,wediscussanapproachtomeasu-ringtheeffectofeachpotentialsourceoferrorwhenatranslationsystemisconsideredinfull.Thekey
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
433
ideaistoenhancethetranslationmodelofOLD,anMTsystemtrainedonolddomainparalleltext,tocomparetheimpactofpotentialsourcesofimprove-ment.WeuseparallelnewdomaindatatoproposeenhancementstotheOLDsystem.Thisprovidesarealisticmeasureofwhatcouldbeachievedifonehadaccesstoparalleldatainthenewdomain.Thespecificsystemwebuild,calledMIXED,isalinearinterpolationofatranslationmodeltrainedonlyonolddomaindataandamodeltrainedonlyonnewdomaindata(FosterandKuhn,2007).Themixingweightsareselectedviagridsearchonatuningset,selectingforBLEU.WecallourapproachTETRA:TableEnhancementforTranslationAnalysis.Below,wedesignexperimentstoteaseapartthedifferencesindomainsbyadjustingthemodelsandenhancingOLDtobemorelikeMIXED.Weperformdifferentenhancementsdependingontheerrorcate-gorywearetargeting.Asdiscussedin§6,ourexpe-rimentsareconductedusingphrase-basedSMTsys-tems,sothetranslationmodels(TM)thatareenhan-cedarethephrasetableandreorderingtable.SeenInordertoestimatetheeffectofSEENer-rors,weenhancetheTMofOLDbyaddingphrasepairsthattranslatewordsfoundonlyinthenew-domaindata,andwemeasuretheBLEUimprove-ment.Moreprecisely,weidentifythesetofphrasepairsintheTMofMIXED,forwhichtheFrenchsidecontainsatleastonewordthatdoesnotappearintheold-domaintrainingdata.Thesearethephrasesres-ponsiblefortheSEENerrors.WebuildsystemTE-TRA+SEENbyaddingthesephrasestotheTMofOLD.Whenaddingthesephrases,weaddthemto-getherwiththeirfeaturevaluescores.SenseAnalogously,thephrasesresponsibleforSENSEerrorsarethosefromMIXEDwheretheFrenchsideexistsinthephrasetableofOLD,buttheirEnglishtranslationsdonot.WebuildTE-TRA+SENSEbyaddingthesephrasestoOLD.ScoreToisolateandmeasuretheeffectofphrasescores,weconsiderthephrasesthatourOLDandMIXEDsystemshaveincommon:theintersectionoftheirtranslationtables.Webuildtwosystems,OLDSCOREandNEWSCORE,withidenticalphrasepairs;inOLDSCORE,thefeaturevaluesaretakenfromtheOLDsystem’stables;inNEWSCOREthefeatureva-DomainSentencesLTokensTypes#PhrasesHansard8,107,356fr161.7m192k479.0men144.5m187kNews135,838fr3.9m63k12.4men3.3m52kEMEA472,231fr6.5m35k4.4men5.9m30kScience139,215fr4.3m118k8.4men3.6m114kSubs19,239,980fr155.0m362k364.7men174.4m293kTABLE2:Basiccharacteristicsofthetrainingdata:Num-berofsentences,tokens,wordtypesandnumberofphrasepairsinthephrasetables.luesaretakenfromtheMIXEDsystem’stables.6Experimentalconditions6.1DomainsandDataWeconductourstudyonFrench-Englishdatasets.Weconsiderfiveverydifferentdomainsforwhichlargecorporaarepubliclyavailable.Thelargestcor-pusistheHansardparliamentaryproceedings.Cor-porainthefourotherdomainsaresmallerandmorespecialized,y,de este modo,morenaturallyserveasnewdomains.Foreachnewdomain,weuseallavailabledata.Wedonotattempttoholdtheamountofnewdomaindataconstant,aswesuspectthatsucharti-ficialconstraintswouldnotbesufficienttocontrolfortheverydifferentnaturesofthedomains.De-tailedstatisticsfortheparallelcorporaaregiveninTable2.Hansard:Canadianparliamentaryproceedings,consistsofmanualtranscriptionsandtranslationsofmeetingsofCanada’sHouseofCommonsanditscommitteesfrom2001to2009.Discussionscoverawidevarietyoftopics,andspeakingstylesrangefrompreparedspeechesbyasinglespeakertomoreinteractivediscussions.ItissignificantlylargerthanEuroparl,thecommonsourceofolddomaindata.EMEA:DocumentsfromtheEuropeanMedi-cinesAgency,madeavailablewiththeOPUScor-poracollection(Tiedemann,2009).Thiscorpuspri-marilyconsistsofdrugusageguidelines.News:NewscommentarycorpusmadeavailablefortheWMT2009evaluation.Ithasbeencom-monlyusedinthedomainadaptationliterature(KoehnandSchroeder,2007;FosterandKuhn,2007;HaddowandKoehn,2012,por ejemplo).Ciencia:Parallelabstractsfromscientificpubli-cationsinmanydisciplinesincludingphysics,bio-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
434
logy,andcomputerscience.Wecollecteddatafromtwodistinctsources:(1)CanadianSciencePubli-shingmadeavailabletranslatedabstractsfromtheirjournalswhichspanmanyresearchdisciplines;(2)parallelabstractsfromPhDthesesinPhysicsandComputerSciencecollectedfromtheHALpublicrepository(Lambertetal.,2012).Subs:Translatedmoviesubtitles,availablethroughtheOPUScorporacollection(Tiedemann,2009).Incontrasttotheotherdomainsconsidered,subtitlesconsistofinformalnoisytext.4Inthisstudy,weusetheHansarddomainastheOLDdomain,andweconsiderfourpossibleNEWdomains:EMEA,Noticias,ScienceandSubs.Datasetsforalldomainswereprocessedconsistently.Aftertokenization,wepaidparticularattentiontonorma-lizationinordertominimizeartificialdifferenceswhencombiningdata,suchasAmerican,BritishandCanadianspellings.Thisprovedparticularlyimpor-tantforthenewsdomain;theimpactofSEENredu-cedbymorethanhalfafternormalization.6.2MTsystemsWebuildstandardphrase-basedSMTsystemsusingtheMosestoolkit(Koehnetal.,2007)forallexperiments.Eachsystemscorestranslationcandi-datesusingstandardfeatures:5phrase-tablefea-tures,includingphrasaltranslationprobabilitiesandlexicalweightsinbothtranslationdirections,andaconstantphrasepenalty;6lexicalizedreorde-ringfeatures,includingbidirectionalmodelsbuiltformonotone,swap,discontinuousreorderings;1distance-basedreorderingfeature;and2languagemodels,a5-grammodellearnedontheOLDdomain,anda5-grammodellearnedontheNEWdomain.Featuresarecombinedusingalog-linearmodeloptimizedforBLEU,usingthen-bestbatchMIRAalgorithm(CherryandFoster,2012).Thisresultsinastronglarge-scaleOLDsystem,whichperformswellontheolddomainandisagoodstartingpointforstudyingdomainshifts.5Thewordalignments,4.Hansards,NewsandtheCanadianSciencePublishingareavailable,respectivamente,en:http://www.parl.gc.ca,http://www.statmt.org/wmt09/translation-task.html,andhttp://www.nrcresearchpress.com,preprocessedversionsanddatasplitsusedinthispapercanbedownloadedfromhttp://hal3.name/damt.5.Weuse(unadapted)HMMwordalignments(Vogeletlanguagemodelsandtuningsetsarekeptconstantacrossallexperimentsperdomain.Forreference,webuiltsystemsusingNEWdomaindataonly;theseachievedBLEUscoresasfollows:News=21.70,EMEA=34.63,Science=30.72,Subs=18.51.7ResultsBeforemovingontotheinterestingresults,weshowthatSEARCHisnotamajorsourceofer-ror.Weanalyzedsearcherrorsseparatelybycom-putingBLEUscoresforeachdomainwithvaryingbeamsizefrom10to1000,usingtheOLDsys-tem.Wefindthatincreasingthebeamfrom10to200yieldsapproximatelyaoneBLEUpointad-vantageacrossalldomains.Increasingitfurther(to500or1000)doesnotbestowanyadditionaladvan-tages.Thissuggeststhatforsufficientlywidebeams,searchisunlikelytocontributetoadaptationerrors.6Thisisconsistentwithpreviousresultsobtainedinnon-adaptedsettingsusingothermeasurementtech-niques:searcherrorsaccountforlessthan5%oftheerrorinmodernMTsystems(Wisniewskietal.,2010),or0.13%forsmallbeamsettingswitha“gapconstraint”(ChangandCollins,2011).Weuseabeamvalueof200forallotherexperimentsinthiswork.7.1QuantitativeResultsResultsaresummarizedinTables3and4.Table3givesanoverviewofourWADEanalysisontestsetsineachdomaintranslatedusingOLDandMIXEDmodels.Table4showsBLEUscoreresultsbasedontheTETRAanalysis.Wefirstpresentgeneralobservationsbasedoneachsetofresults.WADEshowsthatfornews,newdomaindatahelpssolveonlyasmallnumberofSEENissues,andSENSEandSCOREerrorsremainessentiallyunchanged.TETRAagreesthatSENSEandSCOREarenotissuesinthisdomain.Ingeneral,theOLDsystemperformsbetteronnewsthanontheotherthreedomains.Forcomparison,usingtheOLDsystemtotranslateatestsetintheold(Hansard)do-mainyieldsaBLEUscoreof37.41and,accordingtoourWADEanalysis,67.64%ofallalignmentsareal.,1996)inbothdirections,combinedusinggrow-diag-final(Koehnetal.,2005).Weestimatealignmentsjointlyonalldatasets.Thus,TETRAmayhaveartificiallygoodphrasetables.6.Thisislikelydependentonlanguagechoiceandthelargeamountofolddomainparalleldata.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
435
Domain%Correct%SeenErrors%SenseErrors%ScoreErrorsOLDMIXEDOLDMIXED%∆OLDMIXED%∆OLDMIXED%∆News57.4257.775.735.38-6%11.0211.13+1%25.8425.72+0%EMEA55.9762.609.284.01-56%16.1613.76-15%18.5919.63+6%Science56.2061.5010.225.63-45%13.5813.11-3%20.0019.77-1%Subs55.7659.585.251.67-68%13.939.71-30%25.0629.04+15%TABLE3:WADE:Percentcorrect,percentseenerrors,percentsenseerrors,andpercentscoreerrors.Thechanges(%∆)fromOLDtoMIXEDarealsogiven;aquí,negativechangesaregood(errorreduction).DomainOLD+SEEN+SENSEOLDvsMIXEDSCOREMIXEDNews22.8123.87+0%23.95+1%23.7223.86+1%24.15EMEA28.6931.02+8%30.59+7%28.8930.21+5%36.60Science26.1327.72+6%27.29+4%26.0928.68+10%32.23Subs15.1015.96+6%16.41+9%14.9916.25+8%18.49TABLE4:TETRA:ResultsonallnewdomainsusingOLDandMIXEDmodels(firstandlastcolumns),OLDenhancedwithseentranslations(segundo),sensetranslations(tercero),andscores(fourth),togetherwithpercentimprovementsintermsofBLEUscore.Here,positiveimprovementsaregood(higherBLEUscores).correct.Asinthenewsdomain,mostoftheerrorsareSCOREfollowedbySENSEandthenSEEN.Fortheotherthreedomains,thetwoevaluationmethodsagreethatSEENisafairlysubstantialproblem.TE-TRAbelievesthatSENSEisafairlysubstantialis-sue,butWADEdoesnotshowthisforScience.ForSCORE,TETRAdetectssignificantroomforimpro-vement,especiallyforScience.ThelargechangesinBLEUscorefoundwithTE-TRAaresomewhatsurprisinggivenhowlittlethephrasetableschangeineachoftheseexperimen-talconditions.ForNews,EMEA,andScience,ad-dingunseenwordsresultsinanincreaseinnumberofphrasepairsbetween0.045%(Noticias)and0.3%(Ciencia).Thesenseadditionsweresimilarlysmall:from0.15%(EMEA)to0.59%(Noticias).ForSubsthestorywasdifferent:addingunseenwordsamoun-tedtoagrowthof4.2%inphrasetablesize;senseamountedto25.1%.Inallcases,thesizeofthescorephrasetableswasonly0.05%smallerthanthatofOLD.Atfirstglance,theWADEandTETRAanalysesoftheSCOREerrortypeseemtocontradicteachother.TheMIXEDsystemsareworseintermsofSCORE(positivedeltas,moreerrorsthanOLD),buthavebetterBLEUscores.Tounderstandthisdiscre-pancy,wemustrecognizethatTETRAanalyzesthescoreerrorsinisolation:byrestrictingthephrasetablestotheintersectionoftheOLDandMIXEDdo-mainphrasetables,weremoveallscoreandsenseerrors.IntheWADEanalysishowever,manyerrorsthat“usedtobe”SEENerrorsintheolddomainbe-comeSCOREerrorsinthenewdomain.CorrectIncorrectTotalCorSeenScoreSeenSenseNewsCor53.70.01.90.00.055.6Seen-C0.11.70.00.00.01.8Score2.20.023.60.00.025.8Seen-I0.10.00.05.40.35.7Sense0.00.00.20.010.811.0Total56.11.725.75.411.1100EMEACor48.30.03.10.00.051.5Seen-C1.62.80.10.00.04.5Score5.30.013.30.00.018.6Seen-I2.30.00.54.02.59.3Sense2.30.02.60.011.316.2Total59.82.819.64.013.8100ScienceCor49.80.03.60.00.053.3Seen-C1.41.40.00.00.02.9Score5.80.014.20.00.020.0Seen-I1.80.00.35.62.510.2Sense1.40.01.60.010.613.6Total60.11.419.85.613.1100SubtitlesCor52.40.02.50.00.054.8Seen-C0.60.30.00.00.00.9Score4.50.020.60.00.025.1Seen-I1.10.00.51.72.05.3Sense0.80.05.40.07.713.9Total59.30.329.01.79.7100TABLE5:PercentofWADEannotationchangesmovingfromOLD(filas)toMIXED(columnas)modelos,foreachdomain.Non-zerooff-diagonalsarebolded.Seen-Cin-dicatesFreebies,andSeen-Iindicatesunseenwordsthatweremistranslated.Toseethefullpicture,wemustlookathowthedifferenterrorcategorieschangefromtheOLDsys-temtotheMIXEDsysteminWADE.ThisisshowninTable5.Inthistable,therightmostcolumncontainsthetotalpercentageoferrorsintheOLDsystems;therowslabeledTotalshowthetotalpercentageofer-rorsintheMIXEDsystems;theremainingcellstheseerrorschangingfromOLDtoMIXED.Forthenewsdomain,theOLDsystemhas25.8%SCOREerrors.Ofthose,2.2%arefixedintheMIXEDsystem.Forthethreedomainsofinterest(allexceptnews),addressingSEENerrorscanbesubstantially
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
436
helpful,intermsofbothBLEUscoreandthefine-graineddistinctionsconsideredbyWADE.Themoreinterestingconclusion,sin embargo,isthatsim-plybringinginnewwordsisn’tenough.Table5showsthatinthesethreedomainsthereareasub-stantialnumberoferrorsthattransitionfrombeingSEEN-IncorrecttoSENSE-Incorrect.Thisindicatesthatbesidesobservinganewword,wemustalsoob-serveitwithallofitscorrecttranslations.Likewise,thereisalottobegainedinBLEUbycorrectingnewSENSEtranslationerrors(essentiallythesamepercentageasforSEEN).Butthisishardertosolve.WecanseeinTable5thatfromtheSENSEerrorsoftheOLDsystem,halfbecomecorrectbuttheotherhalfbecomeSCOREerrors.Sogivingap-propriatescorestothenewsensesisachallenge.Thismakessense:thesenewsensearenow“com-peting”witholdones,andgettingtheinterpolationrightbetweenoldandnewdomaintablesisdifficult.ForSCORE,thesituationismorecomplicated.OurTETRAanalysisclearlyindicatesthatthereisroomforimprovement.Butthisisbasedonintersec-tedphrasetables,fromwhichweremovedseenandsensedistinctions,andinwhichthereisnocompe-titionbetweenphrasesfromtheOLDandNEWsys-tems.TheWADEanalysisshowsapositiveeffectonlyforScience.ThedatainTable5showsthatalot(5.8/20)oftheerrorsarecorrected,butwealsointroduceanumberofadditionalerrors(3.6%thatwerecorrect,0.3%thatwereSEENand1.6%thatwereSENSE).Similarmente,intheEMEAdomain,wefix5%of18%ofSCOREerrorsbutintroduce2.6%thatwerenewsenseerrorsbefore,0.5%thatwereSEENerrorsbefore,andmake3%additionalerroronwordswegotrightbefore.Subsissimilar:outof25%SCOREerrorswefix4.5%,butintroduce0.5%fromSEENand5.4%fromSENSE,andsufferaddi-tionalerroron2.5%ofwhatwehadcorrectbefore.7.2QualitativeResultsTable7showsexamplesoftheFrenchwordsthatWADEfrequentlyidentifiedasincorrectlytransla-tedbytheOLDsystemduetoSCOREorSEENbutthatwerecorrectlytranslatedundertheMIXEDsys-tem.7Forexample,intheSciencedomain,‘mesu-res’sufferedfromSCOREerrorsundertheOLDsys-7.Completeoutputlistsareavailableathttp://hal3.name/damttem.Whileitscorrecttranslationwasoften‘mea-surements,’theOLDsystempreferreditsmostpro-babletranslations(‘savings,’‘actions,’‘issues,’and‘provisions.’).Thirtyoftheseerrorcaseswerecor-rectlytranslatedbytheMIXEDsystem.Similarly,intheSciencedomain,theFrenchword‘finis,’whenitshouldhavebeentranslatedas‘finite,’wastrans-latedincorrectlyduetoasenseerror27times.ItsmostfrequenttranslationsundertheOLDsystemwere‘finish,’‘finished,’and‘more.’TheMIXEDsys-temcorrectedthesesenseerrors.WeomitexamplesofwhereseenerrorsmadebyOLDwerefrequentlycorrectedbytheMIXEDsystembecausetheytendtobelessinteresting.ExamplescanbefoundinDaum´eIIIandJagarlamudi(2011).WeannotatetheFrenchtestsentencesusingtheStanfordpart-of-speech(POS)tagger(Toutanovaetal.,2003)andexaminewhichPOScategoriescorrespondtothemosterrorsofeachtype.UsingtheOLDsystem,newsenseerrorsintheSubsdo-mainaremadeonFrenchnouns40%ofthetimeandonverbs35%ofthetime.InEMEA,51%arenounsand23%areadjectives;inScience,51%nounsand20%adjectives;inNews,46%nounsand23%verbs.Seenerrorsshowaverysimilartrend:intheSubsdomain50%arenounsand25%verbs;InEMEA,48%arenounsand37%adjec-tives;inScience,46%arenounsand40%adjectives;inNews,46%arenounsand28%adjectives.Simi-larly,foralldomains,morescoreerrorsaremadeonsourcenounsthananyotherPOScategory.Insum-mary,wefindthatmosterrorscorrespondtosourcelanguagenouns,followedbyadjectives,exceptforSubs,whereverbsarealsocommonlymistranslatedduetoallerrortypes.Table6(izquierda)showssomeexamplesofhowTE-TRAcanautomaticallyestimatetheerrorsduetounseenwordswhenmovingtoanewdomain.Forexample,theOCRerror“miie”inthesourcesen-tenceiscorrectlytranslatedas“miss”bytheen-hancedsystem.TheenhancedphrasetablesofTE-TRAcanalsoautomaticallyestimatetheerrorsduetopoorlexicalchoicewhenmovingtoanewdo-main,andcanselectamorelucidtranslationterm.Forexample,theenhancedsystemappropriatelyse-lected“shoot”insteadof“growth”intheScienceexampleinTable6(middle).WhenTETRAinsertsthescoresfromthenewdomainintothetransla-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
437
MedicalInpencequiconcernelesindicationsth´erapeutiquespourl’in-suffisancecardiaque,letammapropos´eletextesuivant:(cid:28)traitementdel’insuffisancecardiaquecongestive.(cid:29)Refregardingthetherapeuticindicationsforheartfailure,themahproposedthewording:“treatmentofcongestiveheartfailure.”Oldfortherapeuticindicationsforheartfailure,tammproposedthefollowing:treatmentofcongestiveheartfailure.Fixfortherapeuticindicationsforheartfailure,themahhassuggestedthefollowing:treatmentofcongestiveheartfai-lure.ScienceInplesr´esultats`alabasedecettehypoth´esesontr´evis´es.Reffindingsthatformthebasisofthishypothesisarereviewed.Oldtheresultsatthebaseofthisassumptionarereviewed.Fixtheresultsofthishypothesisarereviewed.SubtitlesInpBonnenuit,MIIeKenton.Refgoodnight,misskenton.Oldgoodnight,miiekenton.Fixgoodnight,misskenton.MedicalInpcem´edicamentestunesolutionlim-pide,incolore`ajaunepˆale.Refthismedicinalproductisaclear,co-lorlesstopaleyellowsolution.Oldlantusisaclear,colorlesstopaleyel-low.Fixthismedicineisasolutionisclear,co-lorlesstopaleyellow.ScienceInptouslestraitementsontaugment´elaproductiondepousses.Refalltreatmentsincreasedshootpro-duction.Oldalltreatmentshaveincreasedthepro-ductionofgrowth.Fixalltreatmentsincreasedshootpro-duction.SubtitlesInpLesexec’estnaturel.Refsexisnatural.Oldthesexisnatural.Fixsexisnatural.MedicalInplesdeuxsubstancesactivesontdeseffetsinversessurlakali´emie.Refthetwoactivesubstanceshavein-verseeffectsonplasmapotassium.Oldthetwoactivesubstancesarereverseeffectsonthekali´emie.Fixthetwoactivesubstanceshavesideinversesonkali´emie.ScienceInpparailleurs,lesconstantesd’´equilibresontplusfaibles.Refincontrast,theequilibriumconstantsarelower.Oldfurthermore,theconstantbalancearelower.Fixmoreover,theequilibriumconstantsarelower.SubtitlesInpJebougemieux.Refimovebetter.Oldigetbetter.Fiximovebetter.TABLE6:ExampleMTresultsobtainedbyfixingseenerrors(izquierda),senseerrors(middle)andscoreerrors(bien).In-cludessource,areferencetranslation,theoutputoftheOLDsystemandtheoutputobtainedviaTETRAmethodology.D#FrenchCorrect-EHansard-EScore−→Correct21doitshouldhasmustneedsshallrequiresE8associationcombinationpartnershipassociation6nomsnamesnamespeoplenomineespeakers30mesuresmeasurementssavingsactionsissuesprovisionsSc27courantcurrentknowledgeknewheads-up26articlepaperstandingclauseordersection9commelikeasbecauselikeakinhowsortSu5maisonhousehomehousehomesheadplace4fricmoneycashdoughmoneyfricbuckslootSense−→Correct9noticeleafletinformeddirectionsnoticeE8perfusioninfusionperfusionintravenous8mollescapsuleslaxlimpidsoftweak27finisfinitefinishfinishedmoreSc18jonctionsjunctionsjunction(onlyonce)10substratssubstratescornstreamsareasubstrata5emmerdefuckannoying(onlyonce)Su3reditessayrepetitioustellcoveredagain3mecmancmeguymecTABLE7:Forscore/senseerrors,en(mi)MEA,(Sc)ienceand(Su)bs,frequentFrenchwordsthatfallintothatca-tegory(byWADE),aswellasthecorrectedtranslationsandthemostfrequentOLDtranslations.tiontables,thesystemproducestranslationsthattakeontheflavorofthenewdomain,yieldinghi-gherBLEUscores.ThiscanbeobservedinTable6(bien)wheretheTETRA-enhancedsystemusedthescience-specificword“equilibrium”ratherthanthepoliticalword“balance.”7.3ResultsonanAdaptedSystemToshowhowWADEcanbeusedonalreadyadap-tedsystems,weperformedasimpleexperimentba-sedonastandardadaptationtechnique.Weusedbilingualcross-entropydifference(Axelrodetal.,2011)toquantifythedistancebetweeneachOLDdo-mainsentencepairandeachNEWdomain.Weselec-tedthetopKclosestsentencesforeachdomain.ForEMEAandScience,wesetKtothesizeoftheNEWdomaindata.ForSubs,thiswouldselectnearlyallofHansard,sowearbitrarilysetK=1m.(Weexclu-dedthenewsdomain.)Wetookthisdata,concatena-tedittotheNEWdomaindata,trainedfullmodels,andrantheWADEanalysisontheiroutputs.Thetrendsacrossthethreedomainswereremar-kablysimilar.Inall,SCOREintheadaptedsystemwerelowerbyaround2%thaneventheMIXEDba-seline(asmuchas4%forSubs).Thisislikelybe-causebyexcludingpartsoftheOLDdomainmostunliketherelevantNEWdomain,thecorrectsenseisobservedmoreoften.However,thiscomesataprice:SENSEandSEENerrorsgoupabout1%or2%each.Thissuggeststhatamorefine-grainedadaptationap-proachmightachievethebestofbothworlds.8LimitingAssumptionsThispaperrepresentsapartialexplorationofthespaceofpossibleassumptionsaboutmodelsanddata.Wecannothopetoexplorethecombinatorialexplosionofpossibilities,andthereforehaverestric-tedouranalysistothefollowingsettings:Phrase-basedmodels.Allofourexperimentsarecarriedoutusingphrase-basedtranslation,asimple-mentedintheopen-sourceMosestranslationsystem(Koehnetal.,2007)toensurethattheyarerepro-ducible.Ourmethodsareeasilyextendedtohierar-chicalphrase-basedmodels(Chiang,2007).Itisnotclearwhetherthesameconclusionswouldhold:en
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
438
theonehand,complexphrasalrulesmightoverfitevenmorebadlythanphrases;por otro lado,hierarchicalmodelsmighthavemoreflexibilitytogeneralizestructures.Translationlanguages.WeonlytranslatefromFrenchtoEnglish.Thiswell-studiedlanguagepairpresentsseveraladvantages;largequantitiesofdataarepubliclyavailableinawidevarietyofdomains,andstandardstatisticalmachinetranslationarchi-tecturesyieldgoodperformance.UnlikewithmoredistantlanguagessuchasChinese-English,orlan-guageswithradicallydifferentmorphologyorwordordersuchasGermanorCzech,weknowthattheold-domaintranslationqualityishigh,andthattranslationfailuresduringdomainshiftcanbepri-marilyattributedtodomainissuesratherthantopro-blemswiththeSMTsystem.Constantolddomain.OurolddomainisfromHansards,andweonlyvaryournewdomain.Itwouldbeinterestingtoconsiderotherdatasetsasolddomains.WedeliberatelyonlyusetheHansarddata:basedonitssizeandscope,weassumethatityieldsthemostgeneralofourSMTsystems.Monolingualnew-domaindata.WeassumethatwealwayshaveaccesstomonolingualEnglishdatainthenewdomainforlearningadomain-specificlanguagemodel.Ourfocusisontheeffectofthetranslationmodel;theeffectofadaptinglanguagemodelshasbeenstudiedpreviously(see§3).Wi-thoutaccesstoanewdomainlanguagemodel,theeffectofunseenwordsandwordswithnewsensesislikelytobedramaticallyunderestimated,becausetheirtranslationsarelikelytobe“thrownout”byanold-domainLM.Moreover,sinceSCOREer-rorsconflatelanguagemodelandtranslationmodelscores,usinganew-domainlanguagemodelletsusmostlyisolatetheeffectofthetranslationmodel.Parallelnew-domaindatafortuning.Weas-sumethatwealwayshaveaccesstoasmallamountofparalleldatainthenewdomain,essentiallyforthepurposeofrunningparametertuning.Withoutthis,onewouldnotevenbeabletoevaluatetheperfor-manceofone’ssystem,typicallyanon-starter.AutomaticwordalignmentsforWADEWADEisfundamentallybaseduponwordalignments,soalignmenterrorsmayaffectitsaccuracy.Sucherrorsareobviousinmanuallyinspectingsentencetriplesusingthevisualizer.Whendevelopingthistool,wecheckedthatalignmentnoisedoesnotinvalidateconclusionsdrawnfromWADEcounts.InordertoestimatehowmuchalignmenterrorsaffectWADE,aFrenchspeakermanuallycorrectedthewordali-gnmentsfor955EMEAtestsetsentences.Theana-lysesbasedonmanualexperimentsshowfewerer-rorsoverall,buttheerroneousannotationsappeartoberandomlydistributedamongallcategories(de-tailsommitedforspace).Asaresult,webelievethatWADEyieldsresultswhichareinformativedespitetheinevitableautomaticalignmenterrors.Inparticu-lar,becausealignmentsbetweenatestandreferencesetareheldconstantinasystemcomparison,sucherrorsshouldimpactallanalysesinthesameway.9DiscussionTranslationperformancedegradesdramaticallywhenmigratinganSMTsystemtoanewanddif-ferentdomain.Ourworkdemonstratesthatthema-jorityofthisdegradationinperformanceisduetoSEENandSENSEerrors:a saber,unknownsource-languagewordsandknownsource-languagewordswithunknowntranslations.Thisresultholdsinalldomainswestudied,exceptfornews,inwhichthereappearstobelittleadaptationinfluenceatall(espe-ciallyafterspellingnormalization).Ourtwoanalysismethods:WADE(Section5.1)andTETRAanalysis(Section5.2),arebothlensesontheadverseaffectsofdomainmismatch.UsingWADE,weareabletopinpointprecisetranslationerrorsandtheirsources.Thiscouldbeextendedtomorenuanced,human-assisted,analysisofadapta-tioneffects.WADEalso“labels”translationswithdifferenterrortypes,whichcouldbeusedtotrainmorecomplexmodels.UsingTETRA,weareabletoseehowtheseerrorsaffectoveralltranslationper-formance.Inprinciple,thisperformancecouldbeanymeasure,includinghumanassessment.Westar-tedwiththeBLEUmetricsinceitismostwidelyusedinthecommunity.Onepointofpossibleim-provementwouldbetoreplaceexactstringmatchinWADE,andBLEUinTETRA,withmetricsthataremoremorphologicallyorsemanticallyinformed.Erroranalysisopensthedoortobuildingadaptedmachinetranslationsystemsthatdirectlytargetspe-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
439
cificerrorcategories.Aswehaveseen,mostexis-tingdomainadaptationtechniquesinMTaimtoimprovetranslationqualityingeneral,andareac-cordinglyevaluatedusingcorpus-levelmetricssuchasBLEU.Ourintuitivefiner-grainedanalysissug-geststhatfiner-grainedmodelsmightbebettersui-tedtounderstandingandcomparingtheerrorsmadebyadaptedandunadaptedsystems.WehaveshownthatconsideringtheS4taxonomyisimportant:im-provingcoverage,forexample,doesnotnecessarilyimprovetranslationquality.Translationcandidatesmustalsobecompleteandmustbescoredcorrectly.Ourtechniquesprovideanintuitivewaytounders-tandtheeffectivenessofnewMTdomainadaptationapproaches.AcknowledgmentsWegratefullyacknowledgethesupportofthe2012JHUSummerWorkshopandNSFGrantNo1005411,aswellastheNRCforMarineCarpuat,andDARPACSSGGrantD11AP00279forHalDaum´eIII.WewouldliketothanktheentireDAMTteam(http://hal3.name/damt/)andSanjeevKhudanpurfortheirinvaluablehelpandsuggestions,aswellasallthereviewersfortheirinsightfulfeedback.ReferencesSankaranarayananAnanthakrishnan,RohitPrasad,andPremNatarajan.2011.On-linelanguagemodelbia-singforstatisticalmachinetranslation.InProceedingsoftheConferenceoftheAssociationforComputatio-nalLinguistics(LCA).AmittaiAxelrod,XiaodongHe,andJianfengGao.2011.Domainadaptationviapseudoin-domaindataselec-tion.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).NguyenBach,FeiHuang,andYaserAl-Onaizan.2011.Goodness:Amethodformeasuringmachinetransla-tionconfidence.InProceedingsoftheConferenceoftheAssociationforComputationalLinguistics(LCA).PratyushBanerjee,SudipKumarNaskar,JohannRotu-rier,AndyWay,andJosefvanGenabith.2012.Trans-lationquality-basedsupplementarydataselectionbyincrementalupdateoftranslationmodels.InProcee-dingsoftheInternationalConferenceonComputatio-nalLinguistics(COLECCIONAR).AriannaBisazza,NickRuiz,andMarcelloFederico.2011.Fill-upversusinterpolationmethodsforphrase-basedSMTadaptation.InternationalWorkshoponSpokenLanguageTranslation(IWSLT).JohnBlatz,ErinFitzgerald,GeorgeFoster,SimonaGan-drabur,CyrilGoutte,AlexKulesza,AlbertoSanchis,andNicolaUeffing.2004.Confidenceestimationformachinetranslation.InProceedingsoftheInterna-tionalConferenceonComputationalLinguistics(CO-LING).MarineCarpuatandDekaiWu.2007.Improvingstatisti-calmachinetranslationusingwordsensedisambigua-tion.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).MarineCarpuat,HalDaum´eIII,KatharineHenry,AnnIrvine,JagadeeshJagarlamudi,andRachelRudinger.2013.SenseSpotting:Neverletyourparalleldatatieyoutoanolddomain.InProceedingsoftheConfe-renceoftheAssociationforComputationalLinguistics(LCA).Yin-WenChangandMichaelCollins.2011.Exactde-codingofphrase-basedtranslationmodelsthroughla-grangianrelaxation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProces-sing(EMNLP).ColinCherryandGeorgeFoster.2012.Batchtuningstrategiesforstatisticalmachinetranslation.InPro-ceedingsoftheConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguis-tics(NAACL).DavidChiang.2007.Hierarchicalphrase-basedtransla-tion.ComputationalLinguistics,33(2):201–228.HalDaum´eIIIandJagadeeshJagarlamudi.2011.Do-mainadaptationformachinetranslationbyminingun-seenwords.InProceedingsoftheConferenceoftheAssociationforComputationalLinguistics(LCA).KevinDuh,KatsuhitoSudoh,andHajimeTsukada.2010.Analysisoftranslationmodeladaptationinsta-tisticalmachinetranslation.InProceedingsoftheIn-ternationalWorkshoponSpokenLanguageTransla-tion(IWSLT).GeorgeFosterandRolandKuhn.2007.Mixture-modeladaptationforSMT.InProceedingsoftheWorkshoponStatisticalMachineTranslation(WMT).GeorgeFoster,CyrilGoutte,andRolandKuhn.2010.Discriminativeinstanceweightingfordomainadapta-tioninstatisticalmachinetranslation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).ZhengxianGong,MinZhang,andGuodongZhou.2011.Cache-baseddocument-levelstatisticalmachinetrans-lation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).BarryHaddowandPhilippKoehn.2012.Analysingtheeffectofout-of-domaindataonSMTsystems.InProceedingsoftheWorkshoponStatisticalMachineTranslation(WMT).AlmutSiljaHildebrand,MatthiasEck,StephanVogel,andAlexWaibel.2005.Adaptationofthetranslation
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
9
1
5
6
6
6
8
9
/
/
t
yo
a
C
_
a
_
0
0
2
3
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
440
modelforstatisticalmachinetranslationbasedonin-formationretrieval.InProceedingsoftheConferenceoftheEuropeanAssociationforComputationalLin-guistics(EACL).AnnIrvine,ChrisQuirk,andHalDaum´eIII.2013.Monolingualmarginalmatchingfortranslationmo-deladaptation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).PhilippKoehnandJoshSchroeder.2007.Experimentsindomainadaptationforstatisticalmachinetranslation.InProceedingsoftheWorkshoponStatisticalMachineTranslation(WMT).PhilippKoehn,AmittaiAxelrod,AlexandraBirch,ChrisCallison-Burch,MilesOsborne,DavidTalbot,andMi-chaelWhite.2005.Edinburghsystemdescriptionforthe2005IWSLTspeechtranslationevaluation.InPro-ceedingsofInternationalWorkshoponSpokenLan-guageTranslation.PhilippKoehn,HieuHoang,AlexandraBirch,ChrisCallison-Burch,MarcelloFederico,NicolaBertoldi,BrookeCowan,WadeShen,ChristineMoran,RichardZens,ChrisDyer,OndrejBojar,AlexandraConstantin,andEvanHerbst.2007.Moses:Opensourcetoolkitforstatisticalmachinetranslation.InProceedingsoftheConferenceoftheAssociationforComputationalLinguistics(LCA).PatrikLambert,HolgerSchwenk,andFr´ed´ericBlain.2012.AutomatictranslationofscientificdocumentsintheHALarchive.InProceedingsoftheInterna-tionalConferenceonLanguageResourcesandEvalua-tion(LREC).ThomasLavergne,AlexandreAllauzen,Hai-SonLe,andFranc¸oisYvon.2011.LIMSI’sexperimentsindo-mainadaptationforIWSLT11.InProceedingsoftheInternationalWorkshoponSpokenLanguageTransla-tion(IWSLT).YajuanLu,JinHuang,andQunLiu.2007.Improvingstatisticalmachinetranslationperformancebytrainingdataselectionandoptimization.InProceedingsoftheJointConferenceonEmpiricalMethodsinNatu-ralLanguageProcessingandComputationalNaturalLanguageLearning(EMNLP-CoNLL).SaabMansour,JoernWuebker,andHermannNey.2011.Combiningtranslationandlanguagemodelscoringfordomain-specificdatafiltering.InProceedingsoftheInternationalWorkshoponSpokenLanguageTransla-tion(IWSLT).SpyrosMatsoukas,Antti-VeikkoI.Rosti,andBingZhang.2009.Discriminativecorpusweightestima-tionformachinetranslation.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan-guageProcessing(EMNLP).RadaMihalcea,RaviSinha,andDianaMcCarthy.2010.SemEval-2010Task2:Cross-LingualLexicalSubsti-tution.InProceedingsofthe5thInternationalWork-shoponSemanticEvaluation.JanNiehuesandAlexWaibel.2010.Domainadaptationinstatisticalmachinetranslationusingfactoredtrans-lationmodels.InProceedingsoftheEuropeanAsso-ciationforMachineTranslation(EAMT).MajaPopovi´candHermannNey.2011.Towardsau-tomaticerroranalysisofmachinetranslationoutput.ComputationalLinguistics,37(4).MajidRazmara,GeorgeFoster,BaskaranSankaran,andAnoopSarkar.2012.Mixingmultipletranslationmo-delsinstatisticalmachinetranslation.InProceedingsoftheConferenceoftheAssociationforComputatio-nalLinguistics(LCA).RicoSennrich.2012.Perplexityminimizationfortrans-lationmodeldomainadaptationinstatisticalmachinetranslation.InProceedingsoftheConferenceoftheEuropeanAssociationforComputationalLinguistics(EACL).J¨orgTiedemann.2009.NewsfromOPUS-Acollectionofmultilingualparallelcorporawithtoolsandinter-faces.InN.Nicolov,K.Bontcheva,G.Angelova,andR.Mitkov,editores,RecentAdvancesinNaturalLan-guageProcessing(RANLP).J¨orgTiedemann.2010.Tocacheornottocache?ex-perimentswithadaptivemodelsinstatisticalmachinetranslation.InProceedingsoftheACLWorkshoponStatisticalMachineTranslationandMetrics(MATR).KristinaToutanova,DanKlein,ChristopherManning,andYoramSinger.2003.Feature-richpart-of-speechtaggingwithacyclicdependencynetwork.InNAACL.DavidVilar,JiaXu,LuisFernandoD’Haro,andHer-mannNey.2006.Erroranalysisofstatisticalmachinetranslationoutput.InProceedingsoftheInternatio-nalConferenceonLanguageResourcesandEvalua-tion(LREC).StephanVogel,HermannNey,andChristophTillmann.1996.HMM-basedwordalignmentinstatisticaltrans-lation.InProceedingsoftheInternationalConferenceonComputationalLinguistics(COLECCIONAR).GuillaumeWisniewski,AlexandreAllauzen,andFranc¸oisYvon.2010.Assessingphrase-basedtrans-lationmodelswithoracledecoding.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).BingZhao,MatthiasEck,andStephanVogel.2004.Languagemodeladaptationforstatisticalmachinetranslationwithstructuredquerymodels.InProcee-dingsoftheInternationalConferenceonComputatio-nalLinguistics(COLECCIONAR).