Operazioni dell'Associazione per la Linguistica Computazionale, vol. 5, pag. 205–218, 2017. Redattore di azioni: Stefan Riezler.
Lotto di invio: 12/2016; Lotto di revisione: 2/2017; Pubblicato 7/2017.
2017 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
C
(cid:13)
PushingtheLimitsofTranslationQualityEstimationAndr´eF.T.MartinsUnbabelInstitutodeTelecomunicac¸˜oesLisbon,Portugalandre.martins@unbabel.comMarcinJunczys-DowmuntAdamMickiewiczUniversityinPozna´nPozna´n,Polandjunczys@amu.edu.plFabioN.KeplerUnbabelL2F/INESC-ID,Lisbon,PortugalUniversityofPampa,Alegrete,Brazilkepler@unbabel.comRam´onAstudilloUnbabelL2F/INESC-IDLisbon,Portugalramon@unbabel.comChrisHokampDublinCityUniversityDublin,Irelandchokamp@computing.dcu.ieRomanGrundkiewiczAdamMickiewiczUniversityinPozna´nPozna´n,Polandromang@amu.edu.plAbstractTranslationqualityestimationisataskofgrowingimportanceinNLP,duetoitspoten-tialtoreducepost-editinghumaneffortindis-ruptiveways.However,thispotentialiscur-rentlylimitedbytherelativelylowaccuracyofexistingsystems.Inthispaper,weachieveremarkableimprovementsbyexploitingsyn-ergiesbetweentherelatedtasksofword-levelqualityestimationandautomaticpost-editing.First,westackanew,carefullyengineered,neuralmodelintoarichfeature-basedword-levelqualityestimationsystem.Then,weusetheoutputofanautomaticpost-editingsys-temasanextrafeature,obtainingstrikingre-sultsonWMT16:aword-levelFMULT1scoreof57.47%(anabsolutegainof+7.95%overthecurrentstateoftheart),andaPearsoncorrela-tionscoreof65.56%forsentence-levelHTERprediction(anabsolutegainof+13.36%).1IntroductionThegoalofqualityestimation(QE)istoevaluateatranslationsystem’squalitywithoutaccesstoref-erencetranslations(Blatzetal.,2004;Speciaetal.,2013).Thishasmanypotentialusages:informinganenduseraboutthereliabilityoftranslatedcon-tent;decidingifatranslationisreadyforpublish-ingorifitrequireshumanpost-editing;highlightingthewordsthatneedtobechanged.QEsystemsareparticularlyappealingforcrowd-sourcedandpro-fessionaltranslationservices,duetotheirpotentialtodramaticallyreducepost-editingtimesandtosavelaborcosts(Specia,2011).Theincreasinginterestinthisproblemfromanindustrialanglecomesasnosurprise(Turchietal.,2014;deSouzaetal.,2015;Martinsetal.,2016;Kozlovaetal.,2016).Inthispaper,wetackleword-levelQE,whosegoalistoassignalabelofOKorBADtoeachwordinthetranslation(Figure1).Pastapproachestothisproblemincludelinearclassifierswithhandcraftedfeatures(UeffingandNey,2007;Bic¸ici,2013;Shahetal.,2013;Luongetal.,2014),oftencombinedwithfeatureselection(Avramidis,2012;Becketal.,2013),recurrentneuralnetworks(deSouzaetal.,2014;KimandLee,2016),andsystemsthatcom-binelinearandneuralmodels(Kreutzeretal.,2015;Martinsetal.,2016).Westartbyproposinga“pure”QEsystem(§3)consistingofanew,carefullyen-gineeredneuralmodel(NEURALQE),stackedintoalinearfeature-richclassifier(LINEARQE).Alongtheway,weprovidearigorousempiricalanalysistobetterunderstandthecontributionoftheseveralgroupsoffeaturesandtojustifythearchitectureoftheneuralsystem.Asecondcontributionofthispaperisbring-ingintherelatedtaskofautomaticpost-editing(APE;Simardetal.(2007)),whichaimstoau-
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
206
SourceTheSharpentoolsharpensareasinanimage.MTDerSch¨arfen-WerkezugBereicheineinemBildsch¨arfererscheint.PE(reference)MitdemScharfzeichnerk¨onnenSieeinzelneBereicheineinemBildscharfzeichnen.QEBADBADOKOKOKOKBADBADOKkHTER=66.7%Figure1:ExamplefromtheWMT16word-levelQEtrainingset.ShownaretheEnglishsourcesentence,theGermantranslation(MT),itsmanualpost-edition(PE),andtheconversiontowordqualitylabelsmadewiththeTERCOMtool(QE).WordslabeledasOKareshowningreen,andthoselabeledasBADareshowninred.WealsoshowtheHTER(fractionofeditoperationstoproducePEfromMT)computedbyTERCOM.tomaticallycorrecttheoutputofmachinetransla-tion(MT).WeshowthatavariantoftheAPEsys-temofJunczys-DowmuntandGrundkiewicz(2016),trainedonalargeamountofartificial“roundtriptranslations,”isextremelyeffectivewhenadaptedtopredictword-levelqualitylabels(yieldingAPEQE,§4).WefurthershowthatthepureandtheAPE-basedQEsystemarehighlycomplementary(§5):astackedcombinationofLINEARQE,NEURALQE,andAPEQEbooststhescoresevenfurther,leadingtoanewstateoftheartontheWMT15andWMT16datasets.Forthelatter,weachieveanFMULT1scoreof57.47%,whichrepresentsanabsoluteimprovementof+7.95%overthepreviousbestsystem.Finally,weprovideasimpleword-to-sentenceconversiontoadaptoursystemtosentence-levelQE.Thisresultsinanewstateoftheartforhuman-targetedtranslationerrorrate(HTER)prediction,whereweobtainaPearson’srcorrelationscoreof65.56%(+13.36%absolutegain),andforsentenceranking,whichachievesaSpearman’sρcorrelationscoreof65.92%(+17.62%).Wecomplementourfindingswitherroranalysisthathighlightsthesyn-ergiesbetweenpureandAPE-basedQEsystems.2DatasetsandSystemArchitectureDatasets.Fordevelopingandevaluatingoursys-tems,weusethedatasetslistedinTable1.ThesedatasetshavebeenusedintheQEandAPEtasksinWMT2015–2016(Bojaretal.,2015,2016).1Theyspantwolanguagepairs(English-SpanishandEnglish-German)andtwodifferentdomains(newstranslationsandinformationtechnology).Weusedthestandardtrain,developmentandtestsplits.Eachsplitcontainsthesourceandautomaticallytrans-latedsentences(whichweuseasinputs),themanu-1Publiclyavailableathttp://www.statmt.org/wmt15andhttp://www.statmt.org/wmt16.allypost-editedsentences(outputfortheAPEtask),andasequenceofOK/BADqualitylabels,onepereachtranslatedword(outputfortheword-levelQEtask);seeFigure1.Besidesthesedatasets,fortrainingtheAPEsystemwemakeuseofartificialroundtriptranslations;thiswillbedetailedin§4.Evaluation.Forallexperiments,wereporttheof-ficialevaluationmetricsofeachdataset’syear.ForWMT15,theofficialmetricfortheword-levelQEtaskistheF1scoreoftheBADlabels(FBAD1).ForWMT16,itistheproductoftheF1scoresfortheOKandBADlabels(denotedFMULT1).Forsentence-levelQE,wereportthePearson’srcorrelationforHTERpredictionandtheSpearman’sρcorrelationscoreforsentenceranking(Graham,2015).Frompost-editedsentencestoqualitylabels.Inthedatasetsabove,thewordqualitylabelsareob-tainedautomaticallybyaligningthetranslatedandthepost-editedsentencewiththeTERCOMsoft-waretool(Snoveretal.,2006)2,withthedefaultsettings(tokenized,caseinsensitive,exactmatchingonly,shiftsdisabled).ThistoolcomputestheHTER(thenormalizededitdistance)betweenthetranslatedandpost-editedsentence.Asaby-product,italignsthewordsinthetwosentences,identifyingsubstitu-tionerrors,worddeletions(i.e.wordsomittedbythetranslationsystem),andinsertions(redundantwordsinthetranslation).WordsintheMToutputthatneedtobeeditedaremarkedbytheBADqualitylabels.Thefactthatthequalitylabelsareautomaticallyobtainedfromthepost-editedsentencesisnotjustanartifactofthesedatasets,butaprocedurethatishighlyconvenientfordevelopingQEsystemsinanindustrialsetting.Manuallyannotatingword-levelqualitylabelsistime-consumingandexpensive;ontheotherhand,post-editingtranslatedsentencesis2http://www.cs.umd.edu/˜snover/tercom.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
207
DatasetLanguagepair#sents#wordsWMT15,TrainEn-Es11,271257,548WMT15,DevEn-Es1,00023,207WMT15,TestEn-Es1,81740,899WMT16,TrainEn-De12,000210,958WMT16,DevEn-De1,00019,487WMT16,TestEn-De2,00034,531Table1:Datasetsusedinthiswork.commonlypartoftheworkflowofcrowd-sourcedandprofessionaltranslationservices.Thus,gettingqualitylabelsforfreefromsentencesthathaveal-readybeenpost-editedisamuchmorerealisticandsustainableprocess.Thisobservationsuggeststhatwecantackleword-levelQEintwoways:1.PureQE:runtheTERalignmenttool(i.e.TER-COM)onthepost-editeddata,andthentrainaQEsystemdirectlyonthegeneratedqualityla-bels;2.APE-basedQE:trainanAPEsystemontheoriginalpost-editeddata,andatruntimeusetheTERaligmenttooltoconverttheautomaticallypost-editedsentencestoqualitylabels.Fromamachinelearningpespective,QEisase-quencelabelingproblem(i.e.,whoseoutputse-quencehasafixedlengthandasmallnumberoflabels),whileAPEisasequence-to-sequenceprob-lem(wheretheoutputisofvariablelengthandspansalargevocabulary).Therefore,wecanregardAPE-basedQEasa“projection”ofamorecomplexandfine-grainedoutput(APE)intoasimpleroutputspace(QE).APE-basedQEsystemshavethepoten-tialforbeingmorepowerfulsincetheyaretrainedwiththisfiner-grainedinformation(providedthereisenoughtrainingdatatomakethemgeneralizewell).Wereportresultsin§4confirmingthishypothesis.Oursystemarchitecture,describedinfulldetailinthefollowingsections,consistsofstateoftheartpureQEandAPE-basedQEsystems,whicharethencombinedtoyieldanew,morepowerful,QEsystem.3PureQualityEstimationThebestperformingsystemintheWMT16word-levelQEtaskwasdevelopedbytheUnbabelteam(Martinsetal.,2016).Itisapurebutrathercom-plexQEsystem,ensemblingalinearfeature-basedclassifierwiththreedifferentneuralnetworkswithdifferentconfigurations.Inthissection,weprovideasimplerversionoftheirsystem,byreplacingthethreeensembledneuralcomponentsbyasingleone,whichweengineerinaprincipledway.Weevaluatetheresultingsystemonadditionaldata(WMT15inadditiontoWMT16),coveringanewlanguagepairandanewcontenttype.Overall,weobtainaslightlyhigheraccuracywithamuchsimplersystem.Inthissection,wedescribethelinear(§3.1)andneural(§3.2)componentsofoursystem,aswellastheircombination(§3.3).3.1LinearSequentialModelWestartwiththelinearcomponentofourmodel,adiscriminativefeature-basedsequentialmodel(calledLINEARQE),basedonMartinsetal.(2016).Thesystemreceivesasinputatuplehs,T,Ai,wheres=s1…sMisthesourcesentence,t=t1…tNisthetranslatedsentence,andA⊆{(M,N)|1≤m≤M,1≤n≤N}isasetofwordalignments.Itpredictsasoutputasequenceby=y1…yN,witheachyi∈{BAD,OK}.Thisisdoneasfollows:by=argmaxyNXi=1w>φu(S,T,UN,yi)+N+1Xi=1w>φb(S,T,UN,yi,yi−1).(1)Above,wisavectorofweights,φu(S,T,UN,yi)areunigramfeatures(dependingonlyonasingleoutputlabel),φb(S,T,UN,yi,yi−1)arebigramfeatures(de-pendingonconsecutiveoutputlabels),andy0andyN+1arespecialstart/stopsymbols.Features.Table2showstheunigramandbigramfeaturesusedintheLINEARQEsystem.LikethebaselinesystemsprovidedinWMT15/16,wein-cludefeaturesthatdependonthetargetwordanditsalignedsourceword,aswellasthecontextsur-roundingthem.3Adistinctiveaspectofoursys-temistheinclusionofsyntacticfeatures,whichwill3FeaturesinvolvingthealignedsourcewordarereplacedbyNILifthetargetwordisunaligned.Iftherearemultiplealignedsourcewords,theyareconcatenatedintoasinglefeature.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
208
FeaturesLabelInput(referencedbytheithtargetword)unigramyi∧…∗BIAS∗WORD,LEFTWORD,RIGHTWORD∗SOURCEWORD,SOURCELEFTWORD,SOURCERIGHTWORD∗LARGESTNGRAMLEFT/RIGHT,SOURCELARGESTNGRAMLEFT/RIGHT∗POSTAG,SOURCEPOSTAG†WORD+LEFTWORD,WORD+RIGHTWORD†WORD+SOURCEWORD,POSTAG+SOURCEPOSTAGsimplebigramyi∧yi−1∧…∗BIASrichbigramsyi∧yi−1∧…allaboveyi+1∧yi∧…WORD+SOURCEWORD,POSTAG+SOURCEPOSTAGsyntacticyi∧…DEPREL,WORD+DEPRELHEADWORD/POSTAG+WORD/POSTAGLEFTSIBWORD/POSTAG+WORD/POSTAGRIGHTSIBWORD/POSTAG+WORD/POSTAGGRANDWORD/POSTAG+HEADWORD/POSTAG+WORD/POSTAGTable2:FeaturesusedintheLINEARQEsystem(seeMartinsetal.,2016foradetaileddescription).Featuresmarkedwith∗areincludedintheWMT16baselinesystem.Thosemarkedwith†wereproposedbyKreutzeretal.(2015).showtobeusefultodetectgrammaticallyincor-rectconstructions.4Weusefeaturesthatinvolvethedependencyrelation,theheadword,andsecond-ordersiblingandgrandparentstructures.Featuresinvolvingpart-of-speech(POS)tagsandsyntacticinformationareobtainedwithTurboTaggerandTur-boParser(Martinsetal.,2013).5Training.Thefeatureweightsarelearnedbyrun-ning50epochsofthemax-lossMIRAalgorithm(Crammeretal.,2006),withregularizationcon-stantC∈{10−k}4k=1andaHammingcostfunctionplacingahigherpenaltyonfalsepositivesthanonfalsenegatives(cFP∈{0.5,0.55,…,0.95},cFN=1−cFP),toaccountfortheexistenceoffewerBADlabelsthanOKlabelsinthedata.Thesevaluesaretunedonthedevelopmentset.Resultsandfeaturecontribution.Table3showstheperformanceoftheLINEARQEsystem.Tohelpunderstandthecontributionofeachgroupoffeatures,weevaluateddifferentvariantsoftheLINEARQEsystemonthedevelopmentsetsofWMT15/16.Asexpected,theuseofbigramsim-provesthesimpleunigrammodel,andthesyntac-4Whilesyntacticfeatureshavebeenusedpreviouslyinsentence-levelQE(Rubinoetal.,2012),theyhaveneverbeenappliedtothefiner-grainedword-levelvarianttackledhere.5http://www.cs.cmu.edu/˜ark/TurboParser.FeaturesWMT15(FBAD1)WMT16(FMULT1)unigramsonly41.7740.05+simplebigram42.2040.63+richbigrams42.8043.65+syntactic(full)43.6846.11Table3:PerformanceontheWMT15(En-Es)andWMT16(En-De)developmentsetsofseveralconfigu-rationsofLINEARQE.Wereporttheofficialmetricforthesesharedtasks,FBAD1forWMT15andFMULT1forWMT16.ticfeatureshelpevenfurther.TheimpactofthesefeaturesismoreprominentinWMT16:therichbi-gramfeaturesleadtoscoresabout3pointsaboveasequentialmodelwithasingleindicatorbigramfeature,andthesyntacticfeaturescontributeanother2.5points.Thenetimprovementexceeds6pointsovertheunigrammodel.3.2NeuralSystemNext,wedescribetheneuralcomponentofourpureQEsystem,whichwecallNEURALQE.InWMT15andWMT16,theneuralQUETCHsystem(Kreutzeretal.,2015)anditsensemblewithotherneuralmod-els(Martinsetal.,2016)werecomponentsofthewinningsystems.However,noneoftheseneuralmodelsmanagedtooutperformalinearmodelwhen
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
209
2 x FF2 x 4002 x FF2 x 2002 x FF100 + 50……BiGRU100……BiGRU200softmaxOK/BADsource wordsource POStarget wordtarget POSembeddings3 x 643 X 503 X 643 x 50Figure2:ArchitectureofourNEURALQEsystem.consideredinisolation—forexample,QUETCHob-tainedaFBAD1of35.27%intheWMT15testset,farbelowthe40.84%scoreofthelinearsystembuiltbythesameteam.Bycontrast,ourcarefullyen-gineeredNEURALQEmodelattainsaperformancesuperiortothatofthelinearsystem,asweshallsee.Architecture.ThearchitectureofNEURALQEisdepictedinFigure2.WeusedKeras(Chollet,2015)toimplementourmodel.Thesystemreceivesasinputthesourceandtargetsentencessandt,theirword-levelalignmentsA,andtheircorrespondingPOStagsobtainedfromTurboTagger.TheinputlayerfollowsasimilararchitectureasQUETCH,withtheadditionofPOSfeatures.Avectorrep-resentingeachtargetwordisobtainedbyconcate-natingtheembeddingofthatwordwiththoseofthealignedwordinthesource.6Theimmediateleftandrightcontextsforsourceandtargetwordsarealsoconcatenated.Weusethepre-trained64-dimensionalPolyglotwordembeddings(Al-Rfouetal.,2013)forEnglish,German,andSpanish,andre-finethemduringtraining.Inadditiontothis,POStagsforeachsourceandtargetwordarealsoem-beddedandconcatenated.POSembeddingshavesize50andareinitializedasdescribedbyGlorotandBengio(2010).Adropoutprobabilityof0.5isap-pliedtotheresultingvectorrepresentations.Thefollowinglayersarethenappliedinsequence:1.Twofeed-forwardlayersofsize400withrecti-fiedlinearunits(ReLU;NairandHinton(2010));6Forthecasesinwhichtherearemultiplesourcewordsalignedtothesametargetword,theembeddingsareaveraged.2.Alayerwithbidirectionalgatedrecurrentunits(BiGRU,Choetal.(2014))ofsize200,whereforwardandbackwardvectorsareconcatenated,trainedwithlayernormalization(Baetal.,2016);3.Twofeed-forwardReLUlayersofsize200;4.ABiGRUlayerofsize100withidenticalconfig-urationtothepreviousBiGRU;5.Twomorefeed-forwardReLUlayersofsizes100and50,respectively.Astheoutputlayer,asoftmaxtransformationovertheOK/BADlabelsisapplied.ThechoiceforthisarchitecturewasdictatedbyexperimentsontheWMT16developmentdata,asweexplainnext.Training.WetrainthemodelwiththeRMSPropalgorithm(TielemanandHinton,2012)byminimiz-ingthecross-entropywithalinearpenaltyforBADwordpredictions,asinKreutzeretal.(2015).WesettheBADweightfactorto3.0.Allhyperparametersareadjustedbasedonthedevelopmentset.Targetsentencesarebucketedbylengthandthenprocessedinbatches(withoutanypaddingortruncation).Resultsandarchitecturalchoices.Thefinalre-sultsareshowninTable4.Overall,thefinalNEU-RALQEmodelachievesanFMULT1scoreof46.80%ontheWMT16developmentset,comparedwiththe46.11%obtainedwiththeLINEARQEsystem(cf.Table3).Thiscontrastswithpreviousneuralsystems,suchasQUETCH(Kreutzeretal.,2015)andanyofthethreeneuralsystemsdevelopedbyMartinsetal.(2016),whichcouldnotoutperformarichfeaturelinearclassifier.TojustifythemostrelevantchoicesregardingthearchitectureofNEURALQE,wealsoevaluatedsev-eralvariationsofitontheWMT16developmentset.Theuseofrecurrentlayersyieldsthelargestcon-tributiontotheperformanceofNEURALQE,asthescoresdropsharply(bymorethan4points)iftheyarereplacedbyfeed-forwardlayers(whichwouldcorrespondtoameredeeperQUETCHmodel).ThefirstBiGRUisparticularyeffective,asscoresdropmorethan2pointsifitisremoved.Theuseoflayernormalizationontherecurrentlayersalsocon-tributespositively(+1.20)tothefinalscore.Asex-pected,theuseofPOStagsaddsanotherlargeim-provement:everythingstayingthesame,themodel
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
210
ModelFMULT1NEURALQE46.80NoPOStags44.41(-2.39)ReplaceBiGRUbyFF42.36(-4.44)OnlythefirstBiGRU45.76(-1.04)OnlythesecondBiGRU44.37(-2.43)RemoveFFbetweenBiGRUs46.35(-0.45)Narrowerlayers45.09(-1.71)Broaderlayers45.02(-1.78)Onemorelayerattheend46.31(-0.49)Nolayernormalization45.60(-1.20)Table4:EffectofarchitecturalchangesinNEURALQEontheWMT16developmentset.withoutPOStagsasinputperformsalmost2.5pointsworse.Finally,varyingthesizeofthehiddenlayersandthedepthofthenetworkhurtsthefinalmodel’sperformance,albeitmoreslightly.3.3StackingNeuralandLinearModelsWenowstacktheNEURALQEsystem(§3.2)intotheLINEARQEsystem(§3.1)asanensemblestrat-egy;wecalltheresultingsystemSTACKEDQE.Stackingarchitectures(Wolpert,1992;Breiman,1996)haveprovedeffectiveinstructuredNLPprob-lems(CohenanddeCarvalho,2005;Martinsetal.,2008).Theunderlyingideaistocombinetwosys-temsbylettingthepredictionofthefirstsystembeusedasaninputfeatureforthesecondsystem.Dur-ingtraining,itisnecessarytojackknifethefirstsys-tem’spredictionstoavoidoverfittingthetrainingset.ThisisdonebysplittingthetrainingsetinKfolds(wesetK=10)andtrainingKdifferentinstancesofthefirstsystem,whereeachinstanceistrainedonK−1foldsandmakespredictionsfortheleft-outfold.Theconcatenationofallthepredictionsyieldsanunbiasedtrainingsetforthesecondclassifier.Neuralintra-ensembles.Wealsoevaluatetheperformanceofintra-ensembledneuralsystems.WetrainindependentinstancesofNEURALQEwithdif-ferentrandominitializationsanddifferentdatashuf-fles,followingtheapproachofJeanetal.(2015)inneuralMT.InTables5–6,wereporttheperformanceontheWMT15andWMT16datasetsofsystemsen-sembling5and15oftheseinstances,calledrespec-tivelyNEURALQE-5andNEURALQE-15.Thein-ModelFBAD1devFBAD1testBestsysteminWMT1543.1-43.12QUETCH+(2ndbest)–43.05LINEARQE43.6842.50NEURALQE43.5143.35NEURALQE-544.2143.54NEURALQE-1544.1143.93STACKEDQE44.6843.70Table5:PerformanceofthepureQEsystemsontheWMT15datasets.ThebestperformingsystemintheWMT15competitionwasbyEspl`a-Gomisetal.(2015),followedbyKreutzeretal.(2015)’sQUETCH+,whichisalsoanensembleofalinearandaneuralsystem.ModelFMULT1devFMULT1testBestsysteminWMT1649.2549.52Unbabel-Linear(2ndbest)45.9446.29LINEARQE46.1146.16NEURALQE46.8047.29NEURALQE-547.3048.50NEURALQE-1546.7747.98STACKEDQE49.1650.27Table6:PerformanceofthepureQEsystemsontheWMT16datasets.ThebestperformingsystemintheWMT16competitionwasbyMartinsetal.(2016),fol-lowedbyalinearsystemdevelopedbythesameteam(Unbabel-Linear).stancesareensembledbytakingtheaveragedprob-abilityofeachwordbeingBAD.Weseeconsistentbenefits(bothforWMT15andWMT16)inensem-bling5neuralsystemsand(somewhatsurprisingly)somedegradationwithensemblesof15.Stackingarchitecture.Theindividualinstancesoftheneuralsystemsareincorporatedinthestackingarchitectureasdifferentfeatures,yield-ingSTACKEDQE.Intotal,wehave15predictions(probabilityvaluesgivenbyeachNEURALQEsys-tem)foreverywordinthetraining,developmentandtestdatasets.Thesepredictionsarepluggedasaddi-tionalfeaturesintheLINEARQEmodel.Asuni-gramfeatures,weusedonereal-valuedfeatureforeverymodelpredictionateachposition,conjoinedwiththelabel.Asbigramfeatures,weusedtworeal-valuedfeaturesforeverymodelpredictionatthetwopositions,conjoinedwiththelabelpair.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
211
TheresultsobtainedwiththisstackedarchitectureontheWMT15andWMT16datasetsareshownre-spectivelyinTables5and6.InWMT15,itisun-clearifstackinghelpsoverthebestintra-ensembledneuralsystem,withaslightimprovementinthedevelopmentset,butadegradationinthetestset.InWMT16,however,stackingisclearlybeneficial,withaboostofabout2pointsoverthebestintra-ensembledneuralsystemand3–4pointsabovethelinearsystem,bothinthedevelopmentandtestpar-titions.Fortheremainderofthispaper,wewilltakeSTACKEDQEasourpureQEsystem.4APE-BasedQualityEstimationNowthatwehavedescribedapureQEsystem,wemoveontoanAPE-basedQEsystem(APEQE).OurstartingpointisthesystemsubmittedbytheAdamMickiewiczUniversity(AMU)teamtotheAPEtaskofWMT16(Junczys-DowmuntandGrundkiewicz,2016).Theyexploredtheapplica-tionofneuraltranslationmodelstotheAPEproblemandachievedgoodresultsbytreatingdifferentmod-elsascomponentsinalog-linearmodel,allowingformultipleinputs(thesourcesandthetranslatedsentencet)thatweredecodedtothesametargetlan-guage(post-editedtranslationp).Twosystemswereconsidered,oneusingsastheinput(s→p)andanotherusingtastheinput(t→p).Asimplestring-matchingpenaltyintegratedwithinthelog-linearmodelwasusedtocontrolforhigherfaithful-nesswithregardtotherawMToutput.ThepenaltyfiresiftheAPEsystemproposesawordinitsoutputthathasnotbeenseenint.Toovercometheproblemoftoolittletrainingdata,Junczys-DowmuntandGrundkiewicz(2016)generatedlargeamountsofartificialdataviaround-triptranslations:alargecorpusofmonolingualsentencesisfirstgatheredforthetargetlanguageinthedomainofinterest(eachsentenceisregardedasanartificialpost-editedsentencep);thenanMTsys-temisrantotranslatethesesentencestothesourcelanguage(whichareregardedasthesourcesen-tencess),andanotherMTsysteminthereversedi-rectiontranslatesthelatterbacktothetargetlan-guage(playingtheroleofthetranslationst).TheartificialdataisfilteredtomatchtheHTERstatisticsofthetraininganddevelopmentdataforthesharedtask.7Theirsubmissionimprovedovertheuncor-rectedbaselineontheunseenWMT16testsetby-3.2%TERand+5.5%BLEUandoutperformedanyothersystemsubmittedtotheshared-taskbyalargemargin.4.1TrainingtheAPESystemWereproducetheexperimentsfromJunczys-DowmuntandGrundkiewicz(2016)usingNematus(Sennrichetal.,2016)fortrainingandAmuNMT(Junczys-Dowmuntetal.,2016)fordecoding.Asstatedin§3.3,jackknifingisrequiredtoavoidoverfittingduringthetrainingprocedureofthestackedclassifiers(§5),thereforewestartbyprepar-ingfourjackknifedmodels.Weperformthefollow-ingsteps:•WedividetheoriginalWMT16trainingsetintofourequallysizedparts,maintainingcorrespon-dencesbetweendifferentlanguages.Fournewtrainingsetsarecreatedbyleavingoutonepartandconcatenatingtheremainingthreeparts.•Foreachofthefournewtrainingsets,wetrainoneAPEmodelonaconcatenationofasmallersetofartificialdata(denotedas“round-trip.n1”inJunczys-DowmuntandGrundkiewicz(2016),consistingof531,839sentencetriples)anda20-foldoversamplednewtrainingset.EachofthesenewlycreatedfourAPEmodelshasnotseenadif-ferentpartofthequarteredoriginaltrainingdata.•Toavoidoverfitting,weusescalingdropout8overGRUstepsandinputembeddings,withdropoutprobabilities0.2,andoversourceandtargetwordswithprobabilities0.1(Sennrichetal.,2016).•WeuseAdam(KingmaandBa,2014)insteadofAdadelta(Marinaio,2012).•Wetrainbothmodels(s→pandt→p)un-tilconvergenceupto20epochs,savingmodelcheckpointsevery10,000mini-batches.7Theartificialfiltereddatahasbeenmadeavailablebytheauthorsathttps://github.com/emjotde/amunmt/wiki/AmuNMT-for-Automatic-Post-Editing.8CurrentlyavailableintheMRTbranchofNematusathttps://github.com/rsennrich/nematus
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
212
SystemWMT15WMT16Bestsystem23.2321.52Uncorrectedbaseline22.9124.76APEt→p23.9122.60APEs→p40.4428.39APETER-tuned23.2920.99Table7:TERscoresontheofficialWMT15andWMT16testsetsfortheAPEtask.Lowerisbetter.•Thelastfourmodelcheckpointsofeachtrain-ingrunareaveragedelement-wise(Junczys-Dowmuntetal.,2016)resultinginnewsinglemodelswithgenerallyimprovedperformance.ToverifythequalityoftheAPEsystem,ween-semblethe8resultingmodels(4timess→pand4timest→p)andaddtheAPEpenaltydescribedinJunczys-DowmuntandGrundkiewicz(2016).Thislargeensembleacrossfoldsisonlyusedduringtesttime.Forcreatingthejackknifedtrainingdata,onlythemodelsfromthecorrespondingfoldareused.Sincewecombinemodelsofdifferenttypes,wetuneweightsonthedevelopmentsetwithMERT9(Och,2003)towardsTER,yieldingthemodeldenotedas“APETER-tuned”.ResultsarelistedinTable7fortheAPEsharedtask(WMT16).Forthepurelys→pandt→pensembles,modelsareweightedequally.WeachieveslightlybetterresultsintermsofTER,themaintaskmetric,thantheoriginalsys-tem,usinglessdata.Forcompleteness,wealsoapplythisproce-duretoWMT15data,generatingasimilarresourceof500KartificialEnglish-Spanish-Spanishpost-editingtripletsviaroundtriptranslation.10Thetrain-ing,jackknifingandensemblingmethodsarethesameasfortheWMT16setting.FortheWMT15APEsharedtask,resultsarelesspersuasivethanforWMT16:noneofthesharedtaskparticipantswasabletobeattheuncorrectedbaselineandoursys-temfailsatthisaswell.However,weproducedthe9WefoundMERTtoworkbetterwhentuningtowardsTERthankb-MIRAwhichhasbeenusedintheoriginalpaper.10Ourartificiallycreateddatamightsufferfromahighermis-matchbetweentraininganddevelopmentdata.WhilewewereabletomatchtheTERstatisticsofthedevset,BLEUscoresareseveralpointslower.TheartificialWMT16datawecreatedinJunczys-DowmuntandGrundkiewicz(2016)matchesboth,TERandBLEUscores,oftherespectivedevelopmentset.FBAD1devFBAD1testAPEt→p13.4612.83APEs→p41.5641.57APETER-tuned5.964.72APEQE46.4446.05Table8:PerformanceofAPE-basedQEsystemsontheWMT15developmentandtestsets.FMULT1devFMULT1testAPEt→p27.4631.39APEs→p51.9253.70APETER-tuned40.1741.87APEQE54.9555.68Table9:PerformanceofAPE-basedQEsystemsontheWMT16developmentandtestsets.secondstrongestsystemforcase-sensitiveTER(Ta-ble7,WMT15)andthestrongestforcase-insensitveTER(22.49vs.22.54).4.2AdaptationtoQEandTask-SpecificTuningAsdescribedin§2,APEoutputscanbeturnedintowordqualitylabelsusingTER-basedwordalign-ments.Somewhatsurprisingly,amongtheAPEsys-temsintroducedabove,weobserveinTable9thatthes→pAPEsystemistheso-farstrongeststand-aloneQEsystemfortheWMT16taskinthiswork.ThissystemisessentiallyaretrainedneuralMTcomponentwithoutanyadditionalfeatures.11Thet→psystemandtheTER-tunedAPEensemblearemuchweakerintermsofFMULT1.Thisislesssurprisinginthecaseofthefullensemble,asithasbeentunedtowardsTERfortheAPEtaskspecif-ically.However,wecanobtainevenbetterAPE-basedQEsystemsforbothsharedtasksettingsbytuningthefullAPEensemblestowardsFMULT1,theofficialWMT16QEmetric,andtowardsFBAD1forWMT15.12Withthisapproach,weproduceournewbeststand-aloneQE-systemsforbothsharedtasks,whichwedenoteasAPEQE.11NotethatthissystemresemblesotherQEapproacheswhichusepseudo-referencefeatures(AlbrechtandHwa,2008;SoricutandNarsale,2012;Shahetal.,2013),sincethes→pisessentiallyan“alternative”MTsystem.12UsingagainMERTandexecuting7iterationsontheoffi-cialdevelopmentsetwithann-bestlistsizeof12.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
213
FBAD1devFBAD1testBestsysteminWMT1543.143.12LINEARQE43.6842.50NEURALQE43.5143.35STACKEDQE44.6843.70APEQE46.4446.05FULLSTACKEDQE47.6147.08Table10:Performanceoftheseveralword-levelQEsys-temsontheWMT15developmentandtestdatasets.ThebaselineisthebestparticipatingsysteminWMT15,fromEspl`a-Gomisetal.(2015).5FullStackedSystemFinally,weconsideralargerstackedsystemwherewestackbothNEURALQEandAPEQEintoLIN-EARQE.ThiswillmixpureQEwithAPE-basedQEsystems;wecalltheresultFULLSTACKEDQE.Theprocedureisanalogoustothatdescribedin§3.3,withoneextrabinaryfeaturefortheAPE-basedwordqualitylabelpredictions.Fortraining,weusedjackknifingasdescribedin§3.3.5.1Word-LevelQETheperformanceoftheFULLSTACKEDQEsystemontheWMT15andWMT16datasetsareshowninTables10–11.Wecomparewiththeothersystemsintroducedinthispaper,andwiththebestpartici-patingsystemsatWMT15–16(Espl`a-Gomisetal.,2015;Martinsetal.,2016).WecanseethattheAPE-basedandthepureQEsystemsarecomplementary:thefullcombinationofthelinear,neural,andAPE-basedsystemsimprovesthescoreswithrespecttothebestindividualsys-tem(APEQE)byabout1pointinWMT15and2pointsinWMT16.Overall,weobtainforWMT16anFMULT1scoreof57.47%,anewstateoftheart,andanabsolutegainof+7.95%overMartinsetal.(2016).Thisisaremarkableimprovementthatcanpavethewayforawideradoptionofword-levelQEsystemsinindustrialsettings.ForWMT15,wealsoobtainanewstateoftheart,withalessimpres-sivegainof+3.96%overthebestprevioussystem.In§6weanalyzetheerrorsmadebythepureandtheAPE-basedQEsystemstobetterunderstandhowtheycomplementeachother.FMULT1devFMULT1testBestsysteminWMT1649.2549.52LINEARQE46.1146.16NEURALQE46.8047.29STACKEDQE49.1650.27APEQE54.9555.68FULLSTACKEDQE56.8057.47Table11:Performanceoftheseveralword-levelQEsys-temsontheWMT16developmentandtestdatasets.ThebaselineisthebestparticipatingsysteminWMT16,fromtheUnbabelteam(Martinsetal.,2016).5.2Sentence-LevelQEEncouragedbythestrongresultsobtainedwiththeFULLSTACKEDQEsysteminword-levelQE,wein-vestigatehowwecanadaptthissystemforHTERpredictionatsentencelevel.Priorwork(deSouzaetal.,2014)incorporatedword-levelqualitypre-dictionsasfeaturesinasentence-levelQEsystem,trainingafeature-basedlinearclassifier.Here,weshowthataverysimpleconversion,whichrequiresnotrainingortuning,isenoughtoobtainasubstan-tialimprovementoverthestateoftheart.FortheAPEsystem,itiseasytoobtainapredic-tionforHTER:wecansimplymeasuretheHTERbetweenthetranslatedsentencetandthepredictedcorrectedsentencebp.ForapureQEsystem,weap-plythefollowingword-to-sentenceconversiontech-nique:(io)runaQEsystemtoobtainasequenceofOKandBADwordqualitylabels;(ii)usethefrac-tionofBADlabelsasanestimateforHTER.Notethatthisprocedure,whilenotrequiringanytraining,isfarfromperfect.Wordsthatarenotinthetrans-latedsentencebutexistinthereferencepost-editedsentencedonotoriginateBADlabels,andthereforewillnotcontributetotheHTERestimate.Yet,aswewillsee,thisprocedureappliedtotheSTACKEDQEsystem(i.e.withouttheAPEQEcomponent)isal-readysufficienttoobtainstateoftheartresults.Fi-nally,tocombinetheAPEandpureQEsystemsto-wardsentence-levelQE,wesimplytaketheaverageofthetwoHTERpredictionsabove.Table12showstheresultsobtainedwithourpureQEsystem(STACKEDQE),withourAPE-basedsystem(APEQE),andwiththecombinationofthetwo(FULLSTACKEDQE).Asbaselines,we
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
214
PearsondevPearsontestSpearmandevSpearmantestWMT15BestsysteminWMT15(ranking)–39.41–36.49BestsysteminWMT15(HTER)–38.46–36.81STACKEDQE32.2936.9633.2234.44APEQE29.3340.3930.8038.74FULLSTACKEDQE36.0744.9936.6842.30WMT16BestsysteminWMT16(ranking)–52.5––BestsysteminWMT16(HTER)–46.0–48.3STACKEDQE55.3054.9356.4655.34APEQE59.0461.2761.0662.48FULLSTACKEDQE64.0465.5665.5265.92Table12:Performanceofoursentence-levelQEsystemsontheWMT15anWMT16datasets,asmeasuredbytheWMT16officialevaluationscript.ThebaselinesarethebestWMT15–16systemsintheHTERpredictiontrack(Bicicietal.,2015;Kozlovaetal.,2016)andinthesentencerankingtrack(Langlois,2015;KimandLee,2016).reporttheperformanceofthetwobestsystemsinthesentence-levelQEtasksatWMT15andWMT16(Bicicietal.,2015;Langlois,2015;Kozlovaetal.,2016;KimandLee,2016).Theresultsarestriking:forWMT16,evenourweakestsystem(STACKEDQE)withthesimplecon-versionprocedureaboveisalreadysufficienttoob-tainstateoftheartresults,outperformingKozlovaetal.(2016)andKimandLee(2016)byaconsiderablemargin.TheAPEQEsystemgivesaverylargeboostoverthesescores,whicharefurtherincreasedbythecombinedFULLSTACKEDQEsystem.Overall,weobtainabsolutegainsof+13.36%inPearson’srcor-relationscoreforHTERprediction,and+17.62%inSpearman’sρcorrelationforsentenceranking,aconsiderableadvanceoverthepreviousstateoftheart.ForWMT15,wealsoobtainanewstateoftheart,withlesssharp(butstillsignificant)improve-ments:+5.08%inPearson’srcorrelationscore,and+5.81%inSpearman’sρcorrelation.6ErrorAnalysisPerformanceoversentencelength.Tobetterun-derstandthedifferencesinperformancebetweenthepureQEsystem(STACKEDQE)andtheAPE-basedsystem(APEQE),weanalyzehowthetwosystems,aswellastheircombination(FULLSTACKEDQE),performasafunctionofthesentencelength.Figure3showstheaveragednumberofBADpre-dictionsmadebythethreesystemsfordifferentsen-tenceslengths,intheWMT16developmentset.Forcomparison,weshowalsothetrueaveragenum-berofBADwordsinthegoldstandard.Weob-servethat,forshortsentences(lessthan5words),thepureQEsystemtendstobetoooptimistic(i.e.,itunderpredictsBADwords)andtheAPE-basedsys-temtoopessimistic(overpredictingthem).Intherangeof5-10words,thepureQEsystemmatchestheproportionofBADwordsmoreaccuratelythantheAPE-basedsystem.Formedium/longsentences,weobservetheoppositebehavior(thisispartic-ularlyclearinthe20-25wordrange),withtheAPE-basedsystembeinggenerallybetter.Ontheotherhand,thecombinationofthetwosystems(FULLSTACKEDQE)managestofindagoodbal-ancebetweenthesetwobiases,beingmuchclosertothetrueproportionofBADlabelsforbothshorterandlongersentencesthananyoftheindividualsys-tems.Thisshowsthatthetwosystemscomplementeachotherwellinthecombination.Illustrativeexamples.Table13showsconcreteexamplesofqualitypredictionsontheWMT16de-velopmentdata.Inthetopexample,wecanseethattheAPEsystemcorrectlyreplacedAngleichungs-farbebyMischfarbe,butisunder-correctiveinotherparts.TheAPEQEsystemthereforemissesseveralBADwords,butmanagestogetthecorrectlabel(OK)forden.Bycontrast,thepureQEsystemer-roneouslyflagsthiswordasincorrect,butitmakestherightdecisiononFarbtonandzuerstellen,be-ingmoreaccuratethanAPEQE.Thecombinationofthetwosystems(pureQEandAPEQE)leadsto
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
215
SourceCombinesthehuevalueoftheblendcolorwiththeluminanceandsaturationofthebasecolortocreatetheresultcolor.MTKombiniertdenFarbtonWertderAngleichungsfarbemitderLuminanzundS¨attigungderGrundfarbezuerstellen.PE(Reference)KombiniertdenFarbtonwertderMischfarbemitderLuminanzundS¨attigungderGrundfarbe.APEKombiniertdenFarbtonderMischfarbemitderLuminanzunddieS¨attigungderGrundfarbe,umdieErgebnisfarbezuerstellen.STACKEDQEKombiniertdenFarbtonWertderAngleichungsfarbemitderLuminanzundS¨attigungderGrund-farbezuerstellen.APEQEKombiniertdenFarbtonWertderAngleichungsfarbemitderLuminanzundS¨attigungderGrund-farbezuerstellen.FULLSTACKEDQEKombiniertdenFarbtonWertderAngleichungsfarbemitderLuminanzundS¨attigungderGrund-farbezuerstellen.SourceTheVideoPreviewplug-insupportsRGB,grayscale,andindexedimages.MTMitdemZusatzmodul“Videovorschau”unterst¨utztRGB-,Graustufen-undindizierteBilder.PE(Reference)DasZusatzmodul“Videovorschau”unterst¨utztRGB-,Graustufen-undindizierteBilder.APEDasDialogfeld“Videovorschau”unterst¨utztRGB-,Graustufen-undindizierteBilder.STACKEDQEMitdemZusatzmodul“Videovorschau”unterst¨utztRGB-,Graustufen-undindizierteBilder.APEQEMitdemZusatzmodul“Videovorschau”unterst¨utztRGB-,Graustufen-undindizierteBilder.FULLSTACKEDQEMitdemZusatzmodul“Videovorschau”unterst¨utztRGB-,Graustufen-undindizierteBilder.Table13:ExamplesonWMT16validationdata.Shownarethesourceandtranslatedsentences,thegoldpost-editedsentences,theoutputoftheAPEsystem,andtheQEpredictionsofourpureQEandAPE-basedQEsystemsaswellastheircombination.WordspredictedasOKareshowningreen,thosepredictedasBADareshowninred,anddifferencesbetweenthetranslatedandthepost-editedsentencesareshowninblue.Forbothexamples,thefullstackedsystempredictsallqualitylabelscorrectly.Figure3:AveragednumberofwordspredictedasBADbythedifferentsystemsintheWMT16golddevset,fordifferentbinsofthesentencelength.thecorrectsequentialprediction.Inthebottomex-ample,thepureQEsystemassignsthecorrectlabeltoZusatzmodul,whiletheAPEsystemmistranslatesthiswordtoDialogfeld,leadingtoawrongpredic-tionbytheAPEQEsystem.Ontheotherhand,pureQEmisclassifiesunterst¨utztRGB-asBADwords,whiletheAPEQEgetsthemright.Overall,theAPEQEismoreaccurateinthisexample.Again,thesedecisionscomplementeachotherwell,ascanbeseenbythecombinedQEsystemwhichoutputsthecorrectwordlabelsfortheentiresentence.7ConclusionsWehavepresentednewstateoftheartsystemsforword-levelandsentence-levelQEthatareconsid-erablymoreaccuratethanprevioussystemsontheWMT15andWMT16datasets.First,weproposedanewpureQEsystemwhichstacksalinearandaneuralsystem,andissimplerandslighlymoreaccuratethanthecurrentlybestword-levelsystem.Then,byrelatingthetasksofAPEandword-levelQE,wederivedanewAPE-basedQEsystem,whichleveragesadditionalartifi-cialroundtriptranslationdata,achievingalargerim-provement.Finally,wecombinedthetwosystemsviaafullstackingarchitecture,boostingthescoresevenfurther.ErroranalysisshowsthatthepureandAPE-basedsystemsarehighlycomplementary.Thefullsystemwasextendedtosentence-levelQEbyvirtueofasimpleword-to-sentenceconversion,Rif-
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
216
quiringnofurthertrainingortuning.AcknowledgmentsWethankthereviewersandtheactionedi-torfortheirinsightfulcomments.ThisworkwaspartiallysupportedbythetheEXPERTproject(EUMarieCurieITNNo.317471),andbyFundac¸˜aoparaaCiˆenciaeTecnolo-gia(FCT),throughcontractsUID/EEA/50008/2013andUID/CEC/50021/2013,theLearnBigproject(PTDC/EEI-SII/7092/2014),theGoLocalproject(grantCMUPERI/TIC/0046/2014),andtheAmazonAcademicResearchAwardsprogram.ReferencesRamiAl-Rfou,BryanPerozzi,andStevenSkiena.2013.Polyglot:DistributedWordRepresentationsforMulti-lingualNLP.InProceedingsoftheSeventeenthCon-ferenceonComputationalNaturalLanguageLearn-ing,pages183–192.JoshuaAlbrechtandRebeccaHwa.2008.TheroleofpseudoreferencesinMTevaluation.InProceedingsoftheThirdWorkshoponStatisticalMachineTransla-tion,pages187–190.EleftheriosAvramidis.2012.Qualityestimationformachinetranslationoutputusinglinguisticanalysisanddecodingfeatures.InProceedingsoftheSeventhWorkshoponStatisticalMachineTranslation,pages84–90.JimmyLeiBa,JamieRyanKiros,andGeoffreyEHin-ton.2016.Layernormalization.arXivpreprintarXiv:1607.06450.DanielBeck,KashifShah,TrevorCohn,andLuciaSpe-cia.2013.SHEF-Lite:Whenlessismorefortransla-tionqualityestimation.InProceedingsoftheEighthWorkshoponStatisticalMachineTranslation,pages335–340.ErgunBicici,QunLiu,andAndyWay.2015.Refer-entialtranslationmachinesforpredictingtranslationqualityandrelatedstatistics.InProceedingsoftheTenthWorkshoponStatisticalMachineTranslation,pages304–308.ErgunBic¸ici.2013.Referentialtranslationmachinesforqualityestimation.InProceedingsoftheEighthWork-shoponStatisticalMachineTranslation,pages343–351.JohnBlatz,ErinFitzgerald,GeorgeFoster,SimonaGan-drabur,CyrilGoutte,AlexKulesza,AlbertoSanchis,andNicolaUeffing.2004.Confidenceestimationformachinetranslation.InProceedingsoftheInterna-tionalConferenceonComputationalLinguistics,page315.OndˇrejBojar,RajanChatterjee,ChristianFedermann,BarryHaddow,ChrisHokamp,MatthiasHuck,Var-varaLogacheva,,PhilippKoehn,,ChristofMonz,MatteoNegri,PavelPecina,MattPost,CarolinaScar-ton,LuciaSpecia,andMarcoTurchi.2015.Findingsofthe2015WorkshoponStatisticalMachineTransla-tion.InProceedingsoftheTenthWorkshoponStatis-ticalMachineTranslation,pages1–46.OndˇrejBojar,RajenChatterjee,ChristianFedermann,YvetteGraham,BarryHaddow,MatthiasHuck,Anto-nioJimenoYepes,PhilippKoehn,VarvaraLogacheva,ChristofMonz,MatteoNegri,AurelieNeveol,Mari-anaNeves,MartinPopel,MattPost,RaphaelRubino,CarolinaScarton,LuciaSpecia,MarcoTurchi,KarinVerspoor,andMarcosZampieri.2016.Findingsofthe2016conferenceonmachinetranslation.InPro-ceedingsoftheFirstConferenceonMachineTransla-tion,pages131–198.LeoBreiman.1996.StackedRegressions.MachineLearning,24:49–64.KyunghyunCho,BartVanMerri¨enboer,CaglarGul-cehre,DzmitryBahdanau,FethiBougares,HolgerSchwenk,andYoshuaBengio.2014.LearningPhraseRepresentationsUsingRNNEncoder-DecoderforSta-tisticalMachineTranslation.InProceedingsofEmpir-icalMethodsinNaturalLanguageProcessing,pages1724–1734.Franc¸oisChollet.2015.Keras.https://github.com/fchollet/keras.WilliamW.CohenandVitorR.deCarvalho.2005.StackedSequentialLearning.InProceedingsofIn-ternationalJointConferenceonArtificialIntelligence,pages671–676.KobyCrammer,OferDekel,JosephKeshet,ShaiShalev-Shwartz,andYoramSinger.2006.OnlinePassive-AggressiveAlgorithms.JournalofMachineLearningResearch,7:551–585.Jos´eG.C.deSouza,Jes´usGonz´alez-Rubio,ChristianBuck,MarcoTurchi,andMatteoNegri.2014.FBK-UPV-UEdinparticipationintheWMT14QualityEsti-mationshared-task.InProceedingsoftheNinthWork-shoponStatisticalMachineTranslation,pages322–328.Jos´eG.C.deSouza,MarcelloFederico,andHas-sanSawaf.2015.MTQualityEstimationforE-CommerceData.InProceedingsofMTSummitXV,vol.2:MTUsers’Track,pages20–29.MiquelEspl`a-Gomis,FelipeS´anchez-Mart´ınez,andMikelForcada.2015.UAlacantword-levelmachinetranslationqualityestimationsystematWMT2015.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
217
InProceedingsoftheTenthWorkshoponStatisticalMachineTranslation,pages309–315.XavierGlorotandYoshuaBengio.2010.Understandingthedifficultyoftrainingdeepfeedforwardneuralnet-works.InInternationalConferenceonArtificialIntel-ligenceandStatistics,pages249–256.YvetteGraham.2015.Improvingevaluationofmachinetranslationqualityestimation.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics,pages1804–1813.S´ebastienJean,OrhanFirat,KyunghyunCho,RolandMemisevic,andYoshuaBengio.2015.Montrealneu-ralmachinetranslationsystemsforwmt15.InPro-ceedingsoftheTenthWorkshoponStatisticalMachineTranslation,pages134–140.MarcinJunczys-DowmuntandRomanGrundkiewicz.2016.Log-linearcombinationsofmonolingualandbilingualneuralmachinetranslationmodelsforauto-maticpost-editing.InProceedingsoftheFirstConfer-enceonMachineTranslation,pages751–758.MarcinJunczys-Dowmunt,TomaszDwojak,andHieuHoang.2016.Isneuralmachinetranslationreadyfordeployment?Acasestudyon30translationdirections.arXivpreprintarXiv:1610.01108.HyunKimandJong-HyeokLee.2016.Recurrentneuralnetworkbasedtranslationqualityestimation.InPro-ceedingsoftheFirstConferenceonMachineTransla-tion,pages787–792.DiederikP.KingmaandJimmyBa.2014.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980.AnnaKozlova,MariyaShmatova,andAntonFrolov.2016.YSDAParticipationintheWMT16QualityEs-timationSharedTask.InProceedingsoftheFirstCon-ferenceonMachineTranslation,pages793–799.JuliaKreutzer,ShigehikoSchamoni,andStefanRiezler.2015.QUalityEstimationfromScraTCH(QUETCH):DeepLearningforWord-levelTranslationQualityEs-timation.InProceedingsoftheTenthWorkshoponStatisticalMachineTranslation,pages316–322.DavidLanglois.2015.LORIASystemfortheWMT15QualityEstimationSharedTask.InProceedingsoftheTenthWorkshoponStatisticalMachineTransla-tion,pages323–329.NgocQuangLuong,LaurentBesacier,andBenjaminLecouteux.2014.LIGSystemforWordLevelQEtaskatWMT14.InProceedingsoftheNinthWork-shoponStatisticalMachineTranslation,pages335–341.Andr´eF.T.Martins,DipanjanDas,NoahA.Smith,andEricP.Xing.2008.StackingDependencyParsers.InProceedingsofEmpiricalMethodsforNaturalLan-guageProcessing,pages157–166.Andr´eF.TMartins,MiguelB.Almeida,andNoahA.Smith.2013.Turningontheturbo:Fastthird-ordernon-projectiveturboparsers.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics,pages617–622.Andr´eF.T.Martins,Ram´onAstudillo,ChrisHokamp,andFabioN.Kepler.2016.Unbabel’sParticipationintheWMT16Word-LevelTranslationQualityEstima-tionSharedTask.InProceedingsoftheFirstConfer-enceonMachineTranslation,pages806–811.VinodNairandGeoffreyEHinton.2010.Rectifiedlin-earunitsimproverestrictedBoltzmannmachines.InProceedingsoftheInternationalConferenceonMa-chineLearning,pages807–814.FranzJosefOch.2003.Minimumerrorratetraininginstatisticalmachinetranslation.InProceedingsoftheAnnualMeetingonAssociationforComputationalLinguistics,pages160–167.RaphaelRubino,JenniferFoster,JoachimWagner,Jo-hannRoturier,RasulSamadZadehKaljahi,andFredHollowood.2012.DCU-SymantecsubmissionfortheWMT2012qualityestimationtask.InProceedingsoftheSeventhWorkshoponStatisticalMachineTransla-tion,pages138–144.RicoSennrich,BarryHaddow,andAlexandraBirch.2016.EdinburghNeuralMachineTranslationSystemsforWMT16.InProceedingsoftheFirstConferenceonMachineTranslation,pages371–376.KashifShah,TrevorCohn,andLuciaSpecia.2013.Aninvestigationontheeffectivenessoffeaturesfortrans-lationqualityestimation.InProceedingsoftheMa-chineTranslationSummit,volume14,pages167–174.MichelSimard,NicolaUeffing,PierreIsabelle,andRolandKuhn.2007.Rule-basedtranslationwithsta-tisticalphrase-basedpost-editing.InProceedingsoftheSecondWorkshoponStatisticalMachineTransla-tion,pages203–206.MatthewSnover,BonnieDorr,RichardSchwartz,Lin-neaMicciulla,andJohnMakhoul.2006.Astudyoftranslationeditratewithtargetedhumanannotation.InProceedingsofthe7thConferenceoftheAssocia-tionforMachineTranslationintheAmericas,pages223–231.RaduSoricutandSushantNarsale.2012.Combiningqualitypredictionandsystemselectionforimprovedautomatictranslationoutput.InProceedingsoftheSeventhWorkshoponStatisticalMachineTranslation,pages163–170.LuciaSpecia,KashifShah,JoseG.C.deSouza,andTrevorCohn.2013.QuEst-atranslationqualityestimationframework.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguis-tics:SystemDemonstrations,pages79–84.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
5
6
1
5
6
7
4
6
0
/
/
T
l
UN
C
_
UN
_
0
0
0
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
218
LuciaSpecia.2011.Exploitingobjectiveannotationsformeasuringtranslationpost-editingeffort.InProceed-ingsofthe15thConferenceoftheEuropeanAssocia-tionforMachineTranslation,pages73–80.TijmenTielemanandGeoffreyHinton.2012.Rmsprop:Dividethegradientbyarunningaverageofitsrecentmagnitude.COURSERA:NeuralNetworksforMa-chineLearning,4(2).MarcoTurchi,AntoniosAnastasopoulos,Jos´eGCdeSouza,andMatteoNegri.2014.Adaptivequal-ityestimationformachinetranslation.InProceedingsoftheAnnualMeetingoftheAssociationforCompu-tationalLinguistics,pages710–720.NicolaUeffingandHermannNey.2007.Word-levelconfidenceestimationformachinetranslation.Com-putationalLinguistics,33(1):9–40.D.Wolpert.1992.Stackedgeneralization.NeuralNet-works,5(2):241–260.MatthewD.Zeiler.2012.ADADELTA:AnAdaptiveLearningRateMethod.arXivpreprintarXiv:1212.5701.
Scarica il pdf