Transactions of the Association for Computational Linguistics, vol. 4, pp. 99–112, 2016. Action Editor: Philipp Koehn.
Submission batch: 11/2015; Revision batch: 2/2016; Published 4/2016.
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
AdaptingtoAllDomainsatOnce:RewardingDomainInvarianceinSMTHoangCuongandKhalilSima’anandIvanTitovInstituteforLogic,LanguageandComputationUniversityofAmsterdamSciencePark107,1098XGAmsterdam,TheNetherlands{c.hoang,k.simaan,titov}@uva.nlAbstractExistingworkondomainadaptationforstatis-ticalmachinetranslationhasconsistentlyas-sumedaccesstoasmallsamplefromthetestdistribution(targetdomain)attrainingtime.Inpractice,cependant,thetargetdomainmaynotbeknownattrainingtimeoritmaychangetomatchuserneeds.Insuchsituations,itisnaturaltopushthesystemtomakesaferchoices,givinghigherpreferencetodomain-invarianttranslations,whichworkwellacrossdomains,overriskydomain-specificalterna-tives.Weencodethisintuitionby(1)in-ducinglatentsubdomainsfromthetrainingdataonly;(2)introducingfeatureswhichmea-surehowspecializedphrasesaretoindividualinducedsub-domains;(3)estimatingfeatureweightsonout-of-domaindata(ratherthanonthetargetdomain).Weconductexperimentsonthreelanguagepairsandanumberofdiffer-entdomains.Weobserveconsistentimprove-mentsoverabaselinewhichdoesnotexplic-itlyrewarddomaininvariance.1IntroductionMismatchinphrasetranslationdistributionsbe-tweentestdata(targetdomain)andtraindataisknowntoharmperformanceofstatisticaltransla-tionsystems(Irvineetal.,2013;Carpuatetal.,2014).Domain-adaptationmethods(Fosteretal.,2010;Bisazzaetal.,2011;Sennrich,2012b;Raz-maraetal.,2012;Sennrichetal.,2013;Haddow,2013;Jotyetal.,2015)aimtospecializeasystemestimatedonout-of-domaintrainingdatatoatargetdomainrepresentedbyasmalldatasample.Inprac-tice,cependant,thetargetdomainmaynotbeknownattrainingtimeoritmaychangeovertimedepend-ingonuserneeds.Inthisworkweaddressexactlythesettingwherewehaveadomain-agnosticsystembutwehavenoaccesstoanysamplesfromthetar-getdomainattrainingtime.Thisisanimportantandchallengingsettingwhich,asfarasweareaware,hasnotyetreceivedattentionintheliterature.Whenthetargetdomainisunknownattrainingtime,thesystemcouldbetrainedtomakesaferchoices,preferringtranslationswhicharelikelytoworkacrossdifferentdomains.Forexample,whentranslatingfromEnglishtoRussian,themostnaturaltranslationfortheword‘code’wouldbehighlyde-pendentonthedomain(andthecorrespondingwordsense).TheRussianwords‘xifr’,‘zakon’or‘programma’wouldperhapsbeoptimalchoicesifweconsidercryptography,legalandsoftwaredevel-opmentdomains,respectively.However,thetransla-tion‘kod’isalsoacceptableacrossallthesedomainsand,assuch,wouldbeasaferchoicewhenthetar-getdomainisunknown.Notethatsuchatransla-tionmaynotbethemostfrequentoveralland,con-sequently,mightnotbeproposedbyastandard(i.e.,domain-agnostic)phrase-basedtranslationsystem.Inordertoencodepreferencefordomain-invarianttranslations,weintroduceameasurewhichquantifieshowlikelyaphrase(oraphrase-pair)istobe“domain-invariant”.Werecallthatmostlargeparallelcorporaareheterogeneous,consistingofdi-verselanguageuseoriginatingfromavarietyofun-specifiedsubdomains.Forexample,newsarticlesmaycoversports,finance,politique,technologyandavarietyofothernewstopics.Noneofthesub-domainsmaymatchthetargetdomainparticularly
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
100
well,buttheycanstillrevealhowdomain-specificagivenphraseis.Forexample,ifwewouldob-servethattheword‘code’canbetranslatedas‘kod’acrosscryptographyandlegalsubdomainsobservedintrainingdata,wecanhypothesizethatitmayworkbetteronanewunknowndomainthan‘zakon’whichwasspecificonlytoasinglesubdomain(le-gal).Thiswouldbeasuitabledecisionifthetestdomainhappenstobesoftwaredevelopment,eventhoughnotextspertainingtothisdomainwerein-cludedintheheterogeneoustrainingdata.Importantly,thesubdomainsareusuallynotspec-ifiedintheheterogeneoustrainingdata.Therefore,wetreatthesubdomainsaslatent,sowecaninducethemautomatically.Onceinduced,wedefinemea-suresofdomainspecificity,particularlyexpressingtwogenericproperties:PhrasedomainspecificityHowspecificisatargetorasourcephrasetosomeoftheinducedsub-domains?PhrasepairdomaincoherenceHowcoherentisasourcephraseandatargetlanguagetranslationacrosstheinducedsubdomains?Thesefeaturescapturetwoorthogonalaspectsofphrasebehaviourinheterogeneouscorpora,withtherationalethatphrasepairscanbeweightedalongthesetwodimensions.Domain-specificitycapturestheintuitionthatthemorespecificaphraseistocertainsubdomains,thelessapplicableitisingen-eral.Notethatspecificityisappliednotonlytotar-getphrases(as‘kod’and‘zakon’intheaboveex-ample)butalsotosourcephrases.Whenappliedtoasourcephrase,itmaygiveapreferencetowardsusingshorterphrasesastheyareinherentlylessdo-mainspecific.Incontrasttophrasedomainspeci-ficity,phrasepaircoherencereflectswhethercan-didatetargetandsourcephrasesaretypicallyusedinthesamesetofdomains.Theintuitionhereisthatthemoredivergentthedistributionalbehaviourofsourceandtargetphrasesacrosssubdomains,thelesscertainwearewhetherthisphrasepairisvalidfortheunknowntargetdomain.Inotherwords,atranslationrulewithsourceandtargetphraseshav-ingtwosimilardistributionsoverthelatentsubdo-mainsislikelysafertouse.Weightsforthesefeatures,alongsideallotherstandardfeatures,aretunedonadevelopmentset.Importantly,weshowthatthereisnonoteworthybenefitfromtuningtheweightsonasamplefromthetargetdomain.Itisenoughtotunethemonamixed-domaindatasetsufficientlydifferentfromthetrainingdata.Weattributethisattractiveprop-ertytothefactthatourfeatures,unliketheonestypicallyconsideredinstandarddomain-adaptationwork,aregenericandonlyaffecttheamountofriskoursystemtakes.Incontrast,forexample,inEi-delmanetal.(2012),Chiangetal.(2011),Huetal.(2014),Hasleretal.(2014),Suetal.(2015),Sen-nrich(2012b),Chenetal.(2013b),andCarpuatetal.(2014),featurescapturesimilaritiesbetweenatargetdomainandeachofthetrainingsubdomains.Clearly,domainadaptationwithsuchrichfeatures,thoughpotentiallymorepowerful,wouldnotbepos-siblewithoutadevelopmentsetcloselymatchingthetargetdomain.Weconductourexperimentsonthreelanguagepairsandexploreadaptationto9domainadapta-tiontasksintotal.Weobservesignificantandcon-sistentperformanceimprovementsoverthebaselinedomain-agnosticsystems.Thisresultconfirmsthatourtwofeatures,andthelatentsubdomainstheyarecomputedfrom,areusefulalsofortheverychal-lengingdomainadaptationsettingconsideredinthiswork.2Domain-InvarianceforPhrasesAtthecoreofastandardstate-of-the-artphrase-basedsystem(Koehnetal.,2003;OchandNey,2004)liesaphrasetable{h˜e,˜fi}ex-tractedfromaword-alignedtrainingcorpustogetherwithestimatesforphrasetranslationprobabilitiesPcount(˜e|˜f)andPcount(˜f|˜e).Typicallythephrasesandtheirprobabilitiesareobtainedfromlargeparal-lelcorpora,whichareusuallybroadenoughtocoveramixtureofseveralsubdomains.Insuchmixtures,phrasedistributionsmaybedifferentacrossdifferentsubdomains.Somephrases(whethersourceortar-get)aremorespecificforcertainsubdomainsthanothers,whilesomephrasesareusefulacrossmanysubdomains.Moreover,foraphrasepair,thedistri-butionoverthesubdomainsforitssourcesidemaybesimilarornottothedistributionforitstargetside.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
101
SourcePhraseProjectiondomain.i……domain.1domain.KTargetPhraseProjectionFigure1:TheprojectionframeworkofphrasesintoK-dimensionalvectorspaceofprobabilisticlatentsubdo-mains.Coherentpairsseemsafertoemploythanpairsthatexhibitdifferentdistributionsoverthesubdomains.Thesetwofactors,domainspecificityanddomaincoherence,canbeestimatedfromthetrainingcor-pusifwehaveaccesstosubdomainstatisticsforthephrases.Inthesettingaddressedhere,thesubdo-mainsarenotknowninadvanceandwehavetocon-siderthemlatentinthetrainingdata.Therefore,weintroducearandomvariablez∈{1,…,K}encoding(arbitrary)Klatentsubdo-mainsthatgenerateeachsourceandtargetphrase˜eand˜fofeveryphrasepairh˜e,˜fi.InthenextSec-tion,weaimtoestimatedistributionsP(z|˜e)andP(z|˜f)forsubdomainzoverthesourceandtargetphrasesrespectively.Inotherwords,weaimatpro-jectingphrasesontoacompact(K−1)dimensionalsimplexofsubdomainswithvectors:~˜e=hP(z=1|˜e),…,P.(z=K|˜e)je,(1)~˜f=hP(z=1|˜f),…,P.(z=K|˜f)i.(2)EachoftheKelementsencodeshowwelleachsourceandtargetphraseexpressesaspecificlatentsubdomaininthetrainingdata.SeeFig.1foranillustrationoftheprojectionframework.Oncetheprojectionisperformed,thehiddencross-domaintranslationbehaviourofphrasesandphrasepairscanbemodeledasfollows:•Domain-specificityofphrases:Arulewithsourceandtargetphraseshavingapeakeddistributionoverlatentsubdomainsislikelydomain-specific.Technicallyspeaking,entropycomesasanaturalchoiceforquantifyingdomainspecificity.Here,weoptfortheRenyientropyanddefinethedo-mainspecificityasfollows:Dα(~˜e)=11−αlog(cid:16)KXi=1P(z=i|˜e)un(cid:17)Dα(~˜f)=11−αlog(cid:16)KXi=1P(z=i|˜f)un(cid:17)Forconvenience,werefertoDα(·)asthedomainspecificityofaphrase.Inthisstudy,wechoosethevalueofαas2whichisthedefaultchoice(alsoknownastheCollisionentropy).•Source-targetcoherenceacrosssubdomains:Atranslationrulewithsourceandtargetphraseshavingtwosimilardistributionsoverthelatentsubdomainsislikelysafertouse.WeusetheChebyshevdistanceformeasuringthesimilaritybetweentwodistributions.Thedivergenceoftwovectors~˜eand~˜fisdefinedasfollowsD(~˜e,~˜f)=maxi={1,…,K}(cid:12)(cid:12)(cid:12)P.(z=i|˜e)−P(z=i|˜f)(cid:12)(cid:12)(cid:12)WerefertoD(~˜e,~˜f)asthephrasepaircoherenceacrosslatentsubdomains.Weinvestigatedsomeothersimilaritiesforphrasepaircoherence(theKullback-LeiblerdivergenceandtheHellingerdistance)buthavenotobservedanynoticeableimprovementsintheperformance.Wewilldiscusstheseexperimentsintheempiricalsec-tion.Oncecomputedforeveryphrasepair,thetwomeasuresDα(~˜e),Dα(~˜f)D(~˜e,~˜f),willbeintegratedintoaphrase-basedSMTsystemasfeaturefunc-tions.3LatentSubdomainInductionWenowpresentourapproachforinducinglatentsubdomaindistributionsP(z|˜e)andP(z|˜f)forev-erysourceandtargetphrases˜eand˜f.Inourexper-iments,wecompareusingoursubdomaininductionframeworkwithrelyingontopicdistributionspro-videdbyastandardtopicmodel,LatentDirichletAllocation(Bleietal.,2003).NotethatunlikeLDAwerelyonparalleldataandwordalignmentswhen
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
102
inducingdomains.Ourintuitionisthatlatentvari-ablescapturingregularitiesinbilingualdatamaybemoreappropriateforthetranslationtask.Inducingtheseprobabilitiesdirectlyisratherdif-ficultasthetaskofdesigningafullygenerativephrase-basedmodelisknowntobechallenging.1Inordertoavoidthis,wefollowMatsoukasetal.(2009)andCuongandSima’an(2014un)who“em-bed”suchaphrase-levelmodelintoalatentsubdo-mainmodelthatworksatthesentencelevel.Inotherwords,weassociatelatentdomainswithsentencepairsratherthanwithphrases,andusetheposteriorprobabilitiescomputedforthesentenceswithallthephrasesappearinginthecorrespondingsentences.GivenP(z|e,F)-alatentsubdomainmodelgivensentencepairshe,fi-theestimationofP(z|˜e)andP(z|˜f),forphrases˜eand˜f,canbesimplifiedbycomputingexpectationszforallz∈{1,…,K}:P.(z=i|˜e)=Pe,fP(z=i|e,F)c(˜e;e)PKi0=1Pe,fP(z=i0|e,F)c(˜e;e),P.(z=i|˜f)=Pe,fP(z=i|e,F)c(˜f;F)PKi0=1Pe,fP(z=i0|e,F)c(˜f;F).Ici,c(˜e,e)isthecountofaphrase˜einasentenceeinthetrainingcorpus.Latentsubdomainsforsentences.Wenowturntodescribingourlatentsubdomainmodelforsen-tences.Weassumethefollowinggenerativestoryforsentencepairs:1.generatethedomainzfromthepriorP(z);2.choosethegenerationdirection:f-to-eore-to-f,withequalprobability;3.ifthee-to-fdirectionischosenthengeneratethepairrelyingonP(e|z)P.(F|e,z);4.otherwise,useP(F|z)P.(e|F,z).Officiellement,itisauniformmixtureofthegenera-tiveprocessesforthetwopotentialtranslationdi-1Doingthatrequiresincorporatingintothemodeladditionalhiddenvariablesencodingphrasesegmentation(DeNeroetal.,2006).Thiswouldsignificantlycomplicateinference(Mylon-akisandSima’an,2008;Neubigetal.,2011;CohnandHaffari,2013).rections.2Thisgenerativestoryimplieshavingtwotranslationmodels(TMs)andtwolanguagemod-els(LMs),eachaugmentedwithlatentsubdomains.Now,theposteriorP(z|e,F)canbecomputedasP(z|e,F)∝P(z)(cid:16)12P.(e|z)P.(F|e,z)+12P.(F|z)P.(e|F,z)(cid:17).(3)Asweaimforasimpleapproach,ourTMsarecomputedthroughtheintroductionofhiddenalign-mentsaanda0inf-to-eande-to-fdirectionsre-spectively,inwhichP(F|e,z)=PaP(F,un|e,z)andP(e|F,z)=Pa0P(e,a0|F,z).Tomakethemarginalizationofalignmentstractable,were-strictP(F,un|e,z)andP(e,a0|F,z)tothesameassumptionsasIBMModel1(Brownetal.,1993)(i.e.,amultiplicationoftranslationoflexicalproba-bilitieswithrespecttolatentsubdomains).Weusestandardnth-orderMarkovmodelforP(e|z)andP(F|z),inwhichP(e|z)=QiP(ei|ei−1i−n,z)andP(F|z)=QjP(fj|fj−1j−n,z).Ici,thenotationei−1i−nandfj−1j−nisusedtodenotethehistoryoflengthnforthesourceandtargetwordseiandfj,respec-tively.Training.Fortraining,wemaximizethelog-likelihoodLofthedataL=Xe,flog(cid:16)XzP(z)(cid:16)12P.(e|z)XaP(F,un|e,z)+12P.(F|z)Xa0P(e,a0|F,z)(cid:17)(cid:17).(4)Asthereisnoclosed-formsolution,weusetheexpectation-maximization(EM)algorithme(Demp-steretal.,1977).IntheE-step,wecomputetheposteriordistribu-2Notethatweeffectivelyaveragebetweenthemwhichisreasonable,asthereisnoreasontogivepreferencetoanyofthem.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
103
tionsP(un,z|e,F)andP(a0,z|e,F)asfollowsP(un,z|e,F)∝P(z)(cid:16)P.(e|z)P.(F,un|e,z)+P.(F|z)Xa0P(e,a0|F,z)(cid:17),(5)P.(a0,z|e,F)∝P(z)(cid:16)P.(e|z)XaP(F,un|e,z)+P.(F|z)P.(e,a0|F,z)(cid:17).(6)IntheM-step,weusetheposteriorsP(un,z|e,F)andP(a0,z|e,F)tore-estimateparametersofbothalignmentmodels.ThisisdoneinaverysimilarwaytoestimationofthestandardIBMModel1.Weusetheposteriorstore-estimateLMparame-tersasfollowsP(ei|ei−11,z)∝Xe,fP(z|e,F)c(ei1;e),(7)P.(fi|fi−11,z)∝Xe,fP(z|e,F)c(fi1;F).(8)Toobtainbetterparameterestimatesforwordpre-dictionsandavoidoverfitting,weusesmoothingintheM-step.Inthiswork,wechosetoapplyexpectedKneser-Neysmoothingtechnique(ZhangandChi-ang,2014)asitissimpleandachievesstate-of-the-artperformanceonthelanguagemodelingproblem.Finally,P.(z)canbesimplyestimatedasfollowsP(z)∝Xe,fP(z|e,F)(9)HierarchicalTraining.Inpractice,wefoundthattrainingthefulljointmodelleadstobrittleperfor-mance,asEMisverylikelytogetstuckinbadlo-calmaxima.Toaddressthisdifficulty,inourim-plementation,westartoutbyfirstjointlytrainingP(z),P.(e|z)andP(F|z).InthiswayintheE-step,wefixourmodelparametersandcomputeP(z|e,F)foreverysentencepair:P.(z|e,F)∝P(e|z)P.(F|z)P.(z).IntheM-step,weusethepos-teriorstore-estimatethemodelparameters,asinEquations(7),(8)et(9).Oncethemodelistrained,wefixthelanguagemodelingparametersandfinallytrainthefullmodel.Thisparallellatentsubdomainlanguagemodelislessexpressiveand,par conséquent,islesslikelytogetstuckinalocalmaximum.TheLMsestimatedinthiswaywillthendrivethefullalignmentmodelto-wardsbetterconfigurationsintheparameterspace.3Inpractice,thistrainingschemeisparticularlyuse-fulincaseoflearningamorefine-grainedlatentsub-domainmodelwithlargerK.4ExperimentsTrainingDataEnglishFrenchSents5.01MWords103.39M125.81MEnglishSpanishSents4.00MWords81.48M89.08MEnglishGermanSents4.07MWords93.19M88.48MTable1:DataPreparation.4.1DataWeconductexperimentswithlarge-scaleSMTsys-temsacrossanumberofdomainsforthreelan-guagepairs(English-Spanish,English-GermanandEnglish-French).ThedatasetsaresummarizedinTable1.ForEnglish-Spanish,werunexperimentswithtrainingdataconsistingof4MsentencepairscollectedfrommultipleresourceswithintheWMT2013MTSharedTask.TheseincludeEuroParl(Koehn,2005),CommonCrawlCorpus,UNCor-pus,andNewsCommentary.ForEnglish-German,ourtrainingdataconsistsof4.1MsentencepairscollectedfromtheWMT2015MTSharedTask,in-cludingEuroParl,CommonCrawlCorpusandNewsCommentary.Finally,forEnglish-French,wetrainSMTsystemsonacorpusof5Msentencepairscol-lectedfromtheWMT2015MTSharedTask,includ-ingthe109French-Englishcorpus.Weconductedexperimentson9differentdomains(tasks)wherethedatawasmanuallycollectedbyaTAUS.4Table2presentsthetranslationtasks:eachofthetasksdealswithaspecificdomain,eachofthistaskhaspresumablyaverydifferentrelevancelevel3Thisprocedurecanberegardedasaformofhierarchicalestimation:westartwithasimplermodelandthenuseittodriveamoreexpressivemodel.NotethatwealsouseP(z)estimatedwithintheparallellatentsubdomainLMstoinitializeP(z)forthelatentsubdomainalignmentmodel.4https://www.taus.net/.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
104
EnglishFrenchProfessional&BusinessServicesDevSents2KWords74.16K83.85KTestSents5KWords92.84K105.05KLeisure,TourismandArtsDevSents2KWords107.45K117.16KTestSents5KWords101.82K114.76KEnglishSpanishProfessional&BusinessServicesDevSents2KWords31.70K34.62KTestSents5KWords84.1K93.4KLegalDevSents2KWords35.06K38.78KTestSents5KWords88.63K102.71KFinancialsDevSents2KWords37.23K42.89KTestSents5KWords99.05K109.81KEnglishGermanProfessional&BusinessServicesDevSents2KWords80.49K85.08KTestSents5KWords79.75K85.28KLegalDevSents2KWords50.54K45.99KTestSents5KWords124.93K111.70KComputerSoftwareDevSents2KWords40.24K38.31KTestSents5KWords102.71K101.12KComputerHardwareDevSents2KWords37.40K36.98KTestSents5KWords103.29K98.04KTable2:Dataandadaptationtasks.tothetrainingdata.Inthisway,wetestthestabilityofourresultsacrossawiderangeoftargetdomains.4.2SystemsWeuseastandardstate-of-the-artphrase-basedsys-tem.TheBaselinesystemincludesMOSES(Koehnetal.,2007)baselinefeaturefunctions,pluseighthi-erarchicallexicalizedreorderingmodelfeaturefunc-tions(GalleyandManning,2008).Thetrainingdataisfirstword-alignedusingGIZA++(OchandNey,2003)andthensymmetrizedwithgrow(-diag)-final-and(Koehnetal.,2003).Welimitthephraselengthtothemaximumofsevenwords.Thelan-guagemodelsareinterpolated5-gramswithKneser-Neysmoothing,estimatedbyKenLM(Heafieldetal.,2013)fromalargemonolingualcorpusofnearly2.1BEnglishwordscollectedwithintheWMT2015MTSharedTask.Finally,weuseMOSESasade-coder(Koehnetal.,2007).Oursystemisexactlythesameasthebase-line,plusthreeadditionalfeaturefunctionsinducedforthetranslationrules:twofeaturesfordomain-specificityofphrases(bothforthesourceside(Dα(~˜f))andthetargetside(Dα(~˜e)),andonefea-tureforsource-targetcoherenceacrosssubdomains(D(~˜e,~˜f)).Fortheprojection,weuseK=12.WealsoexploreddifferentvaluesforK,buthavenotobservedsignificantdifferenceinthescores.InourexperimentswedooneiterationofEMwithparal-lelLMs(asdescribedinSection3),beforecontin-uingwiththefullmodelforthreemoreiterations.WedidnotobserveasignificantimprovementfromrunningEManylonger.Finally,weusehardEM,asithasbeenfoundtoyieldbettermodelsthanthestandardsoftEMonanumberofdifferenttask(par exemple.,(Johnson,2007)).Inotherwords,insteadofstan-dard‘soft’EMupdateswithphrasecountsweightedaccordingtotheposteriorP(z=i|e,F),weusethe‘winner-takes-all’approach:P.(z=i|˜e)∝Xhe,fic(je;ˆzhe,fi)d(˜e;e),P.(z=i|˜f)∝Xhe,fic(je;ˆzhe,fi)d(˜f;F).Ici,ˆzhe,fiisthe“winning”latentsubdomainforsentencepairhe,fi:ˆzhe,fi=argmaxi∈{1,…,K}P.(z=i|e,F)Inpractice,wefoundthatusingthishardversionleadstobetterperformance.54.3AlternativetuningscenariosInordertotuneallsystems,weusethek-bestbatchMIRA(CherryandFoster,2012).Wereportthetranslationaccuracywiththreemetrics-BLEU5Amoreprincipledalternativewouldbetouseposteriorreg-ularization(Ganchevetal.,2009).
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
105
TaskSystemBLEU↑/∆METEOR↑/∆TER↓/∆English-FrenchProfessional&BusinessServicesBaseline21.428.860.0OurSystem21.5/+0.128.9/+0.159.7/-0.3Leisure,TourismandArtsBaseline39.936.748.1OurSystem40.8/+0.937.1/+0.447.1/-1.0English-SpanishFinancialsBaseline32.537.145.6OurSystem32.8/+0.337.2/+0.145.4/-0.2Professional&BusinessServicesBaseline24.431.754.9OurSystem24.8/+0.431.9/+0.254.8/-0.1LegalServicesBaseline33.336.349.5OurSystem33.8/+0.536.5/+0.249.1/-0.4English-GermanComputerSoftwareBaseline22.827.764.3OurSystem23.1/+0.327.8/+0.164.0/-0.3ComputerHardwareBaseline20.527.761.2OurSystem20.9/+0.427.9/+0.261.1/-0.1Professional&BusinessServicesBaseline15.325.469.2OurSystem15.7/+0.425.6/+0.268.6/-0.6LegalServicesBaseline29.632.955.6OurSystem30.2/+0.633.3/+0.455.1/-0.5Table3:Adaptationresultswhentuningonthein-domaindevelopmentset.Theboldfaceindicatesthattheimprovementoverthebaselineissignificant.TaskSystemBLEU↑/∆METEOR↑/∆TER↓/∆English-FrenchProfessional&BusinessServicesBaseline20.728.359.5OurSystem20.7/+0.028.4/+0.159.4/-0.1Leisure,TourismandArtsBaseline39.737.048.6OurSystem40.6/+0.937.4/+0.447.4/-1.2English-SpanishFinancialsBaseline33.637.545.4OurSystem34.0/+0.437.7/+0.245.0/-0.4Professional&BusinessServicesBaseline24.431.955.3OurSystem24.9/+0.532.0/+0.154.9/-0.4LegalServicesBaseline32.435.849.0OurSystem32.9/+0.536.0/+0.248.8/-0.2English-GermanComputerSoftwareBaseline23.227.663.4OurSystem23.5/+0.327.8/+0.263.0/-0.4ComputerHardwareBaseline20.827.861.5OurSystem21.0/+0.228.0/+0.261.2/-0.3Professional&BusinessServicesBaseline13.825.272.2OurSystem13.9/+0.125.3/+0.172.1/-0.1LegalServicesBaseline29.332.755.2OurSystem29.9/+0.633.1/+0.454.6/-0.6Table4:Adaptationresultswhentuningonthemixed-domaindevelopmentset.Theboldfaceindicatesthattheimprovementoverthebaselineissignificant.(Papinenietal.,2002),METEOR(DenkowskiandLavie,2011)andTER(Snoveretal.,2006).Wemarkanimprovementassignificantwhenweob-tainthep-levelof5%underpairedbootstrapresam-pling(Koehn,2004).Notethatbetterresultscorre-spondtolargerBLEUandMETEORbuttosmallerTER.Foreverysystemreported,weruntheopti-mizeratleastthreetimes,beforerunningMultEval
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
106
(Clarketal.,2011)forresamplingandsignificancetesting.Notethatthescoresforthesystemsareav-eragesovermultipleruns.Fortuningthesystemsweexploretwokindsofdevelopmentsets:(1)Anin-domaindevelop-mentsetofin-domaindatathatdirectlyexempli-fiesthetranslationtask(i.e.,asampleoftarget-domaindata),et(2)amixed-domaindevelopmentsetwhichisafullconcatenationofdevelopmentsetsfromalltheavailabledomainsforalanguagepair;thisscenarioisamorerealisticonewhennoin-domaindataisavailable.Intheanalysissectionwealsotestthesetwoscenariosagainstthescenariomixed-domainminusin-domain,whichexcludesthein-domaindevelopmentsetpartfromthemixed-domaindevelopmentset.Byexploringthethreedif-ferentdevelopmentsetswehopetoshedlightontheimportanceofhavingsamplesfromthetargetdo-mainwhenusingourfeatures.Ifourfeaturescanindeedcapturedomaininvarianceofphrasesthentheyshouldimprovetheperformanceinallthreeset-tings,includingthemostdifficultsettingwherethein-domaindatahasbeenexplicitlyexcludedfromthetuningphase.4.4MainresultsIn-domaintuningscenario.Table3presentstheresultsforthein-domaindevelopmentsetscenario.Theintegrationofthedomain-invariantfeaturefunctionsintothebaselineresultsinasignificantimprovementacrossalldomains:average+0.50BLEUontwoadaptationtasksforEnglish-French,+0.40BLEUonthreeadaptationtasksforEnglish-Spanishand+0.43BLEUonfouradaptationtasksforEnglish-German.Mixed-domaintuningscenario.Whiletheim-provementsarerobustandconsistentforthein-domaindevelopmentsetscenario,weareespe-ciallydelightedtoseeasimilarimprovementforthemixed-domaintuningscenario(Table4).Inde-tail,weobserveanaverage+0.45BLEUontwoadaptationtasksforEnglish-French,+0.47BLEUonthreeadaptationtasksforEnglish-Spanishand+0.30BLEUonfouradaptationtasksforEnglish-German.Wewouldliketoemphasizethatthisperformanceimprovementisobtainedwithouttun-ingspecificallyforthetargetdomainorusingotherdomain-relatedmeta-informationinthetrainingcor-pus.AdditionalanalysisWeinvestigatetheindividualcontributionofeachdomain-invariancefeature.Weconductexperimentsusingabasiclarge-scalephrase-basedsystemde-scribedinKoehnetal.(2003)asabaseline.Thebaselineincludestwobi-directionalphrase-basedmodels(Pcount(˜e|˜f)andPcount(˜f|˜e)),threepenal-tiesforword,phraseanddistortion,andfinally,thelanguagemodel.Ontopofthebaseline,webuildfourdifferentsystems,eachaugmentedwithadomain-invariancefeature.Thefirstfeatureisthesource-targetcoherencefeature,D(˜e,˜f),whereweusetheChebyshevdistanceasourdefaultoptions.WealsoinvestigatetheperformanceofothermetricsincludingtheHellingerdistance,6andtheKullback-Leiblerdivergence.7OursecondandthirdfeaturesarethedomainspecificityofphrasesonthesourceDα(˜f)andonthetargetDα(˜e)sides.Finally,wealsodeployallthesethreedomain-invariancefea-turesDα(˜f)+Dα(˜e)+D(˜e,˜f)).Theexperi-mentsareconductedforthetaskLegalonEnglish-German.English-German(Task:Legal)DevSystemBLEU↑In-domainBaseline28.8+D(˜e,˜f)29.1/+0.3+Dα(˜e)29.4/+0.6+Dα(˜f)29.8/+1.0+Dα(˜f)+Dα(˜e)+D(˜e,˜f)29.9/+1.1Mixed-domainsBaseline28.5+D(˜e,˜f)28.8/+0.3+Dα(˜e)29.3/+0.8+Dα(˜f)29.6/+1.1+Dα(˜f)+Dα(˜e)+D(˜e,˜f)29.8/+1.3Mixed-domains(ExcludeLegal)Baseline28.3+D(˜e,˜f)28.6/+0.3+Dα(˜e)29.1/+0.8+Dα(˜f)29.5/+1.2+Dα(˜f)+Dα(˜e)+D(˜e,˜f)29.6/+1.3Table5:Improvementsoverthebaseline.Theboldfactindicatesthatthedifferenceisstatisticallysig-nificant.6DH(~˜e,~˜f)=1√2rPz(cid:16)pP(z|˜e)−qP(z|˜f)(cid:17)2.7DKL(~˜e,~˜f)=PzP(z|˜e)logP(z|˜e)P.(z|˜f);DKL(~˜f,~˜e)=PzP(z|˜f)logP(z|˜f)P.(z|˜e).
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
107
German-English(Task:LegalServices)Inputimjahr2004befindetderratüberdieverpflichtungderelektronischenübertragungsolcheraufzeichnungen.Referencethecouncilshalldecidein2004ontheobligationtotransmitsuchrecordselectronically.Baselinein2004thecouncilistheobligationontheelectronictransferofsuchrecords.+Dα(˜f)in2004thecouncilisontheobligationofelectronictransferofsuchrecords.+Dα(˜e)in2004thecouncilisontheobligationofelectronictransmissionofsuchrecords.+D(˜e,˜f)in2004thecouncilisontheobligationofelectronictransmissionofsuchrecords.+ALLin2004thecouncilisontheobligationofelectronictransmissionofsuchrecords.InputdieangemessenheitundwirksamkeitderinternenverwaltungssystemesowiedieleistungderdienststellenReferenceforassessingthesuitabilityandeffectivenessofinternalmanagementsystemsandtheperformanceofde-partmentsBaselinetheadequacyandeffectivenessofinternaladministrativesystemsaswellastheperformanceoftheservices+Dα(˜f)theadequacyandeffectivenessoftheinternalmanagementsystems,aswellastheperformanceoftheservices+Dα(˜e)theadequacyandeffectivenessofinternalmanagementsystems,aswellastheperformanceoftheservices+D(˜e,˜f)theadequacyandeffectivenessoftheinternaladministrativesystemsaswellastheperformanceoftheservices+ALLtheadequacyandeffectivenessofinternalmanagementsystems,aswellastheperformanceoftheservicesInputzurausführungderausgabennimmtderanweisungsbefugtemittelbindungenvor,gehtrechtlicheverpflich-tungeneinReferencetoimplementexpenditure,theauthorisingofficershallmakebudgetcommitmentsandlegalcommitmentsBaselinetheimplementationofexpenditure,theauthorisingofficercommitmentsbefore,isalegalcommitments+Dα(˜f)theimplementationofexpenditure,theauthorisingofficercommitments,isalegalobligations+Dα(˜e)theimplementationofexpenditure,theauthorisingofficercommitmentsbefore,isalegalobligations+D(˜e,˜f)theimplementationofexpenditure,theauthorisingofficercommitmentsbefore,isalegalcommitments+ALLtheimplementationofexpenditure,theauthorisingofficercommitmentsbefore,isalegalobligationsTable7:Translationoutputsproducedbythebasicbaselineanditsaugmentedsystemswithadditionalabstractfeaturefunctionsderivedfromhiddendomaininformation.English-German(Task:Legal)DevMetricBLEU↑In-domainChebyshev29.1/+0.3Kullback-Leibler(DKL(~˜e,~˜f))29.2/+0.4Kullback-Leibler(DKL(~˜f,~˜e))29.0/+0.2Hellinger29.0/+0.2Table6:Usingdifferentmetricsasthemeasureofcoherence.Table5andTable6presenttheresults.Overall,wecanseethatalldomain-invariancefeaturescon-tributetoadaptationperformance.Specifically,weobservethefollowing:•Favouringthesource-targetcoherenceacrosssub-domains(i.e.,addingthefeatureD(˜e,˜f))pro-videsasignificanttranslationimprovementof+0.3BLEU.Whichspecificsimilaritymeasureisuseddoesnotseemtomatterthatmuch(seeTable6).Weobtainthebestresult(+0.4BLEU)withtheKLdivergence(DKL(~˜e,~˜f)).Cependant,thedifferencesarenotstatisticallysignificant.•Integratingapreferenceforlessdomain-specifictranslationphrasesatthetargetside(Dα(˜e))leadstoatranslationimprovementof+0.6BLEU.•Doingthesameforthesourceside(Dα(˜f)),inturn,leadstoanimprovementof+1.0BLEU.•Augmentingthebaselinebyintegratingallourfeaturesleadstothebestresult,withanimprove-mentof+1.1BLEU.•Thetranslationimprovementisobservedalsofortrainingwithadevelopmentsetofmixeddomains(evenforthemixed-domainminusin-domainsettingwhenexcludingtheLegaldatafromthemixeddevelopmentset).•Theweightsforalldomain-invariancefeatures,oncetuned,arepositiveinalltheexperiments.Table7presentsexamplesoftranslationsfromdifferentsystems.Forexample,thedomain-invariantsystemrevisesthetranslationfrom”elec-tronictransfer”à”electronictransmission”forthe
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
108
English-GermanTaskBaselineOurSystem+z1+z2+z3+z4+z5+z6+z7+z8+z9+z10+z11+z12Hardware20.220.420.420.420.520.520.520.420.420.520.420.420.4Software22.823.023.023.022.822.923.123.023.023.023.023.022.8P&BServices13.313.613.613.313.513.613.613.513.513.613.513.613.5Legal28.528.728.629.128.728.628.928.828.828.928.628.628.8Table8:LatentSubdomainAnalysis(withBLEUscore).Germanphrase”elektronischenÜbertragung”,andfrom”internaladministrativesystems”à”internalmanagementsystems”fortheGermanphrase”in-ternenverwaltungssysteme”.Therevisions,how-ever,arenotalwayssuccessful.Forinstance,addingDα(˜e)andDα(˜f)resultedinrevisingthetranslationoftheGermanphrase”rechtlicheverpflichtungen”à”legalobligations”,whichisaworsechoice(atleastaccordingtoBLEU)que”legalcommitments”pro-ducedbythebaseline.Wealsopresentabriefanalysisoflatentsubdomainsinducedbyourprojectionframe-work.Foreachsubdomainzweintegratethedomainposteriors(P.(z|˜e)andP(z|˜f)andthesource-targetdomain-coherencefeature(cid:12)(cid:12)(cid:12)P.(z|˜e)−P(z|˜f)(cid:12)(cid:12)(cid:12)).Wehypothesizethatwhen-everweobserveanimprovementforatranslationtaskwithdomain-informedfeatures,thismeansthatthecorrespondinglatentsubdomainzisclosetothetargettranslationdomain.TheresultsarepresentedinTable8.Apparently,amongthelatentsubdomains,z4,z5,z6,z9areclosesttothetargetdomainofHardware.Theirderivedfeaturefunctionsarehelpfulinimprovingthetranslationaccuracyforthetask.Similarly,z1,z2,z5,z6,z9andz11areclosesttoProfessional&Business,z6isclosesttoSoftware,andz3isclosesttoLegal.Meanwhile,z4,z5andz12arenotrelevanttothetaskofSoftware.Similarly,z3isnotrelevanttoProfessional&Business,andz2,z5andz10arenotrelevanttoLegal.Usingtopicmodelsinsteadoflatentdomains.Ourdomain-invarianceframeworkdemandsaccesstoposteriordistributionsoflatentdomainsforphrases.Thoughwearguedforusingourdomaininductionapproach,otherlatentvariablemodelscanbeusedtocomputetheseposteriors.Onenaturaloptionistousetopicmodels,andmorespecificallyLDA(Bleietal.,2003).Willourdomain-invarianceframeworkstillworkwithtopicmodels,andhowcloselyrelatedaretheinducedlatentdomainsinducedwithLDAandourmodel?Thesearethequestionswestudyinthissection.WeestimateLDAatthesentencelevelinamono-lingualregime8ononesideofeachparallelcorpus(letusassumefornowthatthisisthesourceside).Whenthemodelisestimated,weobtainthepos-teriordistributionsoftopics(wedenotethemasz,aswetreatthemasdomains)foreachsource-sidesentenceinthetrainingset.Now,aswedidwithourphraseinductionframework,weassociatetheseposteriorswitheveryphrasebothinthesourceandinthetargetsidesofthatsentencepair.Phraseandphrase-pairfeaturesdefinedinSection2arecom-putedrelyingontheseprobabilitiesaveragedovertheentiretrainingset.Wetrybothdirections,thatisalsoestimatingLDAonthetargetsideandtransfer-ringtheposteriorprobabilitiestothesourceside.InordertoestimateLDA,weusedGibbssam-plingimplementedintheMalletpackage(McCal-lum,2002)withdefaultvaluesofhyper-parameters(α=0.01andβ=0.01).Table9presentstheresultsfortheLegaltaskwiththreedifferentsys-temoptimizationsettings.BLEU,METEORandTERarereported.Astheresultsuggests,usingourinductionframeworktendstoyieldslightlybettertranslationresultsintermsofMETEORandespe-ciallyBLEU.However,usingLDAseemstoleadtoslightlybettertranslationresultintermsofTER.TopicsinLDA-likemodelsencodeco-occurrencepatternsinbag-of-wordrepresentationsofsen-tences.Incontrast,domainsinourdomain-inductionframeworkrelyonngramsandword-alignmentinformation.Consequently,thesemod-8NotethatbilingualLDAmodels(e.g.,seeHasleretal.(2014),Zhangetal.(2014))couldpotentiallyproducebetterresultsbutweleavethemforfuturework.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
109
English-German(Task:Legal)DevAlgorithmsBLEU↑METEOR↑TER↓In-domainOur29.933.155.5LDA(source)29.933.155.4LDA(cible)29.933.155.3Mixed-domainsOur29.832.954.9LDA(source)29.732.954.8LDA(cible)29.732.954.8Mixed-domains(ExcludeLegal)Our29.632.854.6LDA(source)29.432.754.5LDA(cible)29.432.754.6Table9:Comparisoninlatentdomaininductionwithvariousalgorithms.elsarelikelytoencodedifferentlatentinformationaboutsentences.Wealsoinvestigatetranslationper-formancewhenweusebothcoherencefeaturesfromLDAandcoherencefeaturesfromourownframe-work.Table10showsthatusingalltheinducedco-herencefeaturesresultsinthebesttranslation,nomatterwhichtranslationmetricisused.Weleavetheexplorationofsuchanextensionforfuturework.English-German(Task:Legal)DevAlgorithmsBLEU↑METEOR↑TER↓MixeddomainsOurfeatures29.832.954.9LDA(source)features29.732.954.8AllFeatures29.833.054.7Table10:Combinationofallfeatures.5RelatedWorkandDiscussionDomainadaptationisanimportantchallengeformanyNLPproblems.AgoodsurveyofpotentialtranslationerrorsinMTadaptationcanbefoundinIrvineetal(2013).Lexicalselectionappearstobethemostcommonsourceoferrorsindomainadap-tationscenarios(Irvineetal.,2013;Weesetal.,2015).Othertranslationerrorsincludereorderingerrors(Chenetal.,2013a;Zhangetal.,2015),align-menterrors(CuongandSima’an,2015)andoverfit-tingtothesourcedomainattheparametertuningstage(Jotyetal.,2015).AdaptationinSMTcanberegardedasinjectingpriorknowledgeaboutthetargettranslationtaskintothelearningprocess.Variousapproacheshavesofarbeenexploitedintheliterature.Theycanbelooselycategorizedaccordingtothetypeofpriorknowledgeexploitedforadaptation.Often,aseedin-domaincorpusexemplifyingthetargettranslationtaskisusedasaformofpriorknowledge.Varioustechniquescanthenbeusedforadaptation.Forex-ample,oneapproachistocombineasystemtrainedonthein-domaindatawithanothergeneral-domainsystemtrainedontherestofthedata(e.g.,seeKoehnandSchroeder(2007),Fosteretal.(2010),Bisazzaetal.(2011),Sennrich(2012b),Razmaraetal.(2012),Sennrichetal.(2013),Haddow(2013),Jotyetal.(2015)).Ratherthanusingtheentiretrainingdata,itisalsocommontocombinethein-domainsystemwithasystemtrainedonaselectedsubsetofthedata(e.g.,seeAxelrodetal.(2011),KoehnandHaddow(2012),Duhetal.(2013),KirchhoffandBilmes(2014),CuongandSima’an(2014b)).Insomeothercases,thepriorknowledgeliesinmeta-informationaboutthetrainingdata.Thiscouldbedocument-annotatedtraininginformation(Eidel-manetal.,2012;Huetal.,2014;Hasleretal.,2014;Suetal.,2015;Zhangetal.,2014),anddomain-annotatedsub-corpora(Chiangetal.,2011;Sen-nrich,2012b;Chenetal.,2013b;Carpuatetal.,2014;CuongandSima’an,2015).Somerecentap-proachesperformadaptationbyexploitingatargetdomaindevelopment,orevenonlythesourcesideofthedevelopmentset(Sennrich,2012un;Carpuatetal.,2013;Carpuatetal.,2014;MansourandNey,2014).Recently,therewassomeresearchonadaptingsi-multaneouslytomultipledomains,thegoalrelatedtoours(Clarketal.,2012;Sennrich,2012un).Forinstance,Clarketal.(2012)augmentaphrase-basedMTsystemwithvariousdomainindicatorfeaturestobuildasinglesystemthatperformswellacrossarangeofdomains.Sennrich(2012un)proposedtoclustertrainingdatainanunsupervisedfashiontobuildmixturemodelsthatyieldgoodperformanceonmultipletestdomains.However,theirapproachesareverydifferentfromours,thatisminimizingriskassociatedwithchoosingdomain-specifictransla-tions.Moreover,thepresentworkdeviatesradicallyfromearlierworkinthatitexploresthescenariowherenopriordataorknowledgeisavailableaboutthetranslationtaskduringtrainingtime.Thefocusofourapproachistoaimforsafertranslationbyre-wardingdomain-invarianceoftranslationrulesoverlatentsubdomainsthatcanbe(still)usefulonadap-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
110
tationtasks.ThepresentstudyisinspiredbyZhangetal.(2014)whichexploitstopic-insensitivitythatislearnedoverdocumentsfortranslation.Thegoalandsettingweareworkingonismarkedlydiffer-ent(i.e.,wedonothaveaccesstometa-informationaboutthetrainingandtranslationtasksatall).Thedomain-invarianceinducedisintegratedintoSMTsystemsasfeaturefunctions,redirectingthedecodertoabettersearchspaceforthetranslationoveradap-tationtasks.Thisaimsatbiasingthedecoderto-wardstranslationsthatarelessdomain-specificandmoresource-targetdomaincoherent.ThereisaninterestingrelationbetweenthisworkandextensivepriorworkonminimumBayesrisk(MBR)objectifs(usedeitherattesttime(KumarandByrne,2004)orduringtraining(SmithandEis-ner,2006;Paulsetal.,2009)).Aswithourwork,thegoalofMBRminimizationistoselecttransla-tionsthatareless“risky”.Theirriskisduetotheuncertaintyinmodelpredictions,andsomeofthisuncertaintymayindeedbeassociatedwithdomain-variabilityoftranslations.Still,asystemtrainedwithanMBRobjectivewilltendtooutputmostfrequenttranslationratherthanthemostdomain-invariantone,andthis,aswearguedintheintroduc-tion,mightnotbetherightdecisionwhenapplyingitacrossdomains.Webelievethatthetwoclassesofmethodsarelargelycomplementary,andleavefur-therinvestigationforfuturework.Ataconceptuallevelitisalsorelatedtoregular-izersusedinlearningdomain-invariantneuralmod-els(Titov,2011),specificallyautoencoders.Thoughtheyalsoconsiderdivergencesbetweendistributionsoflatentvariablevectors,theyusethesedivergencesatlearningtimetobiasmodelstoinducerepresen-tationsmaximallyinvariantacrossdomains.More-over,theyassumeaccesstometa-informationaboutdomainsandconsideronlyclassificationproblems.6ConclusionThispaperaimsatadaptingmachinetranslationsys-temstoalldomainsatoncebyfavoringphrasesthataredomain-invariant,thataresafetouseacrossavarietyofdomains.Whiletypicaldomainadapta-tionsystemsexpectasampleofthetargetdomain,ourapproachdoesnotrequireoneandisdirectlyapplicabletoanydomainadaptationscenario.Ex-perimentsshowthattheproposedapproachresultsinmodestbutconsistentimprovementsinBLEU,METEORandTER.Tothebestofourknowledge,ourresultsarethefirsttosuggestconsistentandsig-nificantimprovementbyafullyunsupervisedadap-tationmethodacrossawidevarietyoftranslationtasks.Theproposedadaptationframeworkisfairlysim-ple,leavingmuchspaceforfutureresearch.Onepotentialdirectionistheintroductionofadditionalfeaturesrelyingontheassignmentofphrasestodo-mains.Theframeworkforinducinglatentdomainsproposedinthispapershouldbebeneficialinthisfuturework.Theimplementationofoursubdomain-inductionframeworkisavailableathttps://github.com/hoangcuong2011/UDIT.AcknowledgementsWethankanonymousreviewersfortheirconstruc-tivecommentsonearlierversions.WealsothankHuiZhangforhishelponexpectedKneser-Neysmoothingtechnique.ThefirstauthorissupportedbytheEXPERT(EXPloitingEmpiricalappRoachestoTranslation)InitialTrainingNetwork(ITN)oftheEuropeanUnion’sSeventhFrameworkProgramme.ThesecondauthorissupportedbyVICIgrantnr.277-89-002fromtheNetherlandsOrganizationforScientificResearch(NWO).WethankTAUSforprovidinguswithsuitabledata.ReferencesAmittaiAxelrod,XiaodongHe,andJianfengGao.2011.Domainadaptationviapseudoin-domaindataselec-tion.InProceedingsofEMNLP.AriannaBisazza,NickRuiz,andMarcelloFederico.2011.Fill-upversusinterpolationmethodsforphrase-basedSMTadaptation.InIWSLT.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.Latentdirichletallocation.JMLR.PeterF.Brown,VincentJ.DellaPietra,StephenA.DellaPietra,andRobertL.Mercer.1993.Themathemat-icsofstatisticalmachinetranslation:parameteresti-mation.Comput.Linguist.MarineCarpuat,HalDaumeIII,KatharineHenry,AnnIrvine,JagadeeshJagarlamudi,andRachelRudinger.2013.Sensespotting:Neverletyourparalleldatatieyoutoanolddomain.InProceedingsofACL.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
111
MarineCarpuat,CyrilGoutte,andGeorgeFoster.2014.Linearmixturemodelsforrobustmachinetranslation.InProc.ofWMT.BoxingChen,GeorgeFoster,andRolandKuhn.2013a.Adaptationofreorderingmodelsforstatisticalma-chinetranslation.InProceedingsofNAACL.BoxingChen,RolandKuhn,andGeorgeFoster.2013b.Vectorspacemodelforadaptationinstatisticalma-chinetranslation.InProceedingsoftheACL.ColinCherryandGeorgeFoster.2012.Batchtuningstrategiesforstatisticalmachinetranslation.InPro-ceedingsoftheNAACL-HLT.DavidChiang,SteveDeNeefe,andMichaelPust.2011.Twoeasyimprovementstolexicalweighting.InPro-ceedingsofACL(ShortPapers).JonathanClark,ChrisDyer,AlonLavie,andNoahA.Smith.2011.Betterhypothesistestingforstatisticalmachinetranslation:Controllingforoptimizerinsta-bility.InProceedingsofACL(ShortPapers).JonathanClark,AlonLavie,andChrisDyer.2012.Onesystem,manydomains:Open-domainstatisticalma-chinetranslationviafeatureaugmentation.TrevorCohnandGholamrezaHaffari.2013.Aninfinitehierarchicalbayesianmodelofphrasaltranslation.InProceedingsoftheACL.HoangCuongandKhalilSima’an.2014a.Latentdo-mainphrase-basedmodelsforadaptation.InProceed-ingsofEMNLP.HoangCuongandKhalilSima’an.2014b.Latentdo-maintranslationmodelsinmix-of-domainshaystack.InProceedingsofCOLING.HoangCuongandKhalilSima’an.2015.Latentdomainwordalignmentforheterogeneouscorpora.InPro-ceedingsofNAACL-HLT.ArthurDempster,NanLaird,andDonaldRubin.1977.Maximumlikelihoodfromincompletedataviatheemalgorithm.JRSS,SERIESB,39(1):1–38.JohnDeNero,DanGillick,JamesZhang,andDanKlein.2006.Whygenerativephrasemodelsunderperformsurfaceheuristics.InProc.ofWMT.MichaelDenkowskiandAlonLavie.2011.Meteor1.3:Automaticmetricforreliableoptimizationandevalua-tionofmachinetranslationsystems.InProc.ofWMT.KevinDuh,GrahamNeubig,KatsuhitoSudoh,andHa-jimeTsukada.2013.Adaptationdataselectionus-ingneurallanguagemodels:Experimentsinmachinetranslation.InProceedingsoftheACL.VladimirEidelman,JordanBoyd-Graber,andPhilipResnik.2012.Topicmodelsfordynamictranslationmodeladaptation.InACL(ShortPapers).GeorgeFoster,CyrilGoutte,andRolandKuhn.2010.Discriminativeinstanceweightingfordomainadapta-tioninstatisticalmachinetranslation.InProceedingsofEMNLP.MichelGalleyandChristopherD.Manning.2008.Asimpleandeffectivehierarchicalphrasereorderingmodel.InProceedingsofEMNLP.KuzmanGanchev,BenTaskar,FernandoPereira,andJoaoGama.2009.Posteriorvsparametersparsityinlatentvariablemodels.InProceedingsofNIPS.BarryHaddow.2013.Applyingpairwiserankedoptimi-sationtoimprovetheinterpolationoftranslationmod-els.InProceedingsofNAACL-HLT.EvaHasler,PhilBlunsom,PhilippKoehn,andBarryHaddow.2014.Dynamictopicadaptationforphrase-basedmt.InProceedingsofEACL.KennethHeafield,IvanPouzyrevsky,JonathanClark,andPhilippKoehn.2013.ScalableModifiedKneser-NeyLanguageModelEstimation.InProceedingsoftheACL(Volume2:ShortPapers).YueningHu,KeZhai,VladimirEidelman,andJordanBoyd-Graber.2014.Polylingualtree-basedtopicmodelsfortranslationdomainadaptation.InProceed-ingsoftheACL.AnnIrvine,JohnMorgan,MarineCarpuat,DaumeHalIII,andDragosMunteanu.2013.Measuringmachinetranslationerrorsinnewdomains.InTACL.MarkJohnson.2007.Whydoesn’tEMfindgoodHMMPOS-taggers?InProceedingsofEMNLP-CoNLL.ShafiqJoty,HassanSajjad,NadirDurrani,KamlaAl-Mannai,AhmedAbdelali,andStephanVogel.2015.Howtoavoidunwantedpregnancies:Domainadapta-tionusingneuralnetworkmodels.InProceedingsofEMNLP.KatrinKirchhoffandJeffBilmes.2014.Submodularityfordataselectioninmachinetranslation.InEMNLP.PhilippKoehnandBarryHaddow.2012.Towardseffec-tiveuseoftrainingdatainstatisticalmachinetransla-tion.InProceedingsoftheWMT.PhilippKoehnandJoshSchroeder.2007.Experimentsindomainadaptationforstatisticalmachinetranslation.InProceedingsofWMT.PhilippKoehn,FranzOch,andDanielMarcu.2003.Statisticalphrase-basedtranslation.InProceedingsofNAACL.PhilippKoehn,HieuHoang,AlexandraBirch,ChrisCallison-Burch,MarcelloFederico,NicolaBertoldi,BrookeCowan,WadeShen,ChristineMoran,RichardZens,ChrisDyer,OndˇrejBojar,AlexandraCon-stantin,andEvanHerbst.2007.Moses:Opensourcetoolkitforstatisticalmachinetranslation.InProceed-ingsofACL.PhilippKoehn.2004.Statisticalsignificancetestsformachinetranslationevaluation.InProceedingsofEMNLP.PhilippKoehn.2005.Europarl:AParallelCorpusforStatisticalMachineTranslation.InProceedingsofMTSummit.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
6
1
5
6
7
3
5
4
/
/
t
je
un
c
_
un
_
0
0
0
8
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
112
ShankarKumarandWilliamJ.Byrne.2004.Minimumbayes-riskdecodingforstatisticalmachinetranslation.InHLT-NAACL.SaabMansourandHermannNey.2014.Unsupervisedadaptationforstatisticalmachinetranslation.InPro-ceedingsofWMT.SpyrosMatsoukas,Antti-VeikkoI.Rosti,andBingZhang.2009.Discriminativecorpusweightesti-mationformachinetranslation.InProceedingsofEMNLP.AndrewKachitesMcCallum.2002.Mal-let:Amachinelearningforlanguagetoolkit.http://mallet.cs.umass.edu.MarkosMylonakisandKhalilSima’an.2008.Phrasetranslationprobabilitieswithitgpriorsandsmoothingaslearningobjective.InProceedingsofEMNLP.GrahamNeubig,TaroWatanabe,EiichiroSumita,Shin-sukeMori,andTatsuyaKawahara.2011.Anunsuper-visedmodelforjointphrasealignmentandextraction.InProceedingsofACL-HLT.FranzOchandHermannNey.2003.Asystematiccom-parisonofvariousstatisticalalignmentmodels.Com-put.Linguist.,pages19–51.FranzOchandHermannNey.2004.Thealignmenttem-plateapproachtostatisticalmachinetranslation.Com-put.Linguist.,pages417–449.KishorePapineni,SalimRoukos,ToddWard,andWei-JingZhu.2002.Bleu:Amethodforautomaticevalu-ationofmachinetranslation.InProceedingsofACL.AdamPauls,JohnDeNero,andDanKlein.2009.Con-sensustrainingforconsensusdecodinginmachinetranslation.InProceedingsofEMNLP.MajidRazmara,GeorgeFoster,BaskaranSankaran,andAnoopSarkar.2012.Mixingmultipletranslationmodelsinstatisticalmachinetranslation.InProceed-ingsofACL.RicoSennrich,HolgerSchwenk,andWalidAransa.2013.Amulti-domaintranslationmodelframeworkforstatisticalmachinetranslation.InProceedingsofACL.RicoSennrich.2012a.Mixture-modelingwithunsuper-visedclustersfordomainadaptationinstatisticalma-chinetranslation.InProceedingsoftheEAMT.RicoSennrich.2012b.Perplexityminimizationfortranslationmodeldomainadaptationinstatisticalma-chinetranslation.InProceedingsofEACL.DavidA.SmithandJasonEisner.2006.Minimumriskannealingfortraininglog-linearmodels.InProceed-ingsoftheCOLING/ACL.MatthewSnover,BonnieDorr,R.Schwartz,L.Micciulla,andJ.Makhoul.2006.Astudyoftranslationeditratewithtargetedhumanannotation.InProceedingsofAMTA.JinsongSu,DeyiXiong,YangLiu,XianpeiHan,HongyuLin,JunfengYao,andMinZhang.2015.Acontext-awaretopicmodelforstatisticalmachinetranslation.InProceedingsoftheACL-IJCNLP.IvanTitov.2011.Domainadaptationbyconstraininginter-domainvariabilityoflatentfeaturerepresenta-tion.InProceedingsofACL.MarliesvanderWees,AriannaBisazza,WouterWeerkamp,andChristofMonz.2015.What’sinaDomain?AnalyzingGenreandTopicDifferencesinSMT.InProceedingsofACL-IJCNLP(shortpaper).HuiZhangandDavidChiang.2014.Kneser-NeySmoothingonExpectedCounts.InProceedingsofACL.MinZhang,XinyanXiao,DeyiXiong,andQunLiu.2014.Topic-baseddissimilarityandsensitivitymod-elsfortranslationruleselection.JAIR.BiaoZhang,JinsongSu,DeyiXiong,HongDuan,andJunfengYao.2015.Discriminativereorderingmodeladaptationviastructurallearning.InIJCAI.
Télécharger le PDF