Transacciones de la Asociación de Lingüística Computacional, volumen. 5, páginas. 487–500, 2017. Editor de acciones: Chris Quirk.

Transacciones de la Asociación de Lingüística Computacional, volumen. 5, páginas. 487–500, 2017. Editor de acciones: Chris Quirk.
Lote de envío: 3/2017; Publicado 11/2017.
C(cid:13)2017 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

PhraseTableInductionUsingIn-DomainMonolingualDataforDomainAdaptationinStatisticalMachineTranslationBenjaminMarieAtsushiFujitaNationalInstituteofInformationandCommunicationsTechnology3-5Hikaridai,Seika-cho,Soraku-gun,Kioto,619-0289,Japón{bmarie,atsushi.fujita}@nict.go.jpAbstractWepresentanewframeworktoinduceanin-domainphrasetablefromin-domainmonolin-gualdatathatcanbeusedtoadaptageneral-domainstatisticalmachinetranslationsystemtothetargeteddomain.OurmethodfirstcompilessetsofphrasesinsourceandtargetlanguagesseparatelyandgeneratescandidatephrasepairsbytakingtheCartesianproductofthetwophrasesets.Itthencomputesin-expensivefeaturesforeachcandidatephrasepairandfiltersthemusingasupervisedclas-sifierinordertoinduceanin-domainphrasetable.WeexperimentedonthelanguagepairEnglish–French,bothtranslationdirections,intwodomainsandobtainedconsistentlybetterresultsthanastrongbaselinesystemthatusesanin-domainbilinguallexicon.Wealsocon-ductedanerroranalysisthatshowedthein-ducedphrasetablesproposedusefultransla-tions,especiallyforwordsandphrasesunseenintheparalleldatausedtotrainthegeneral-domainbaselinesystem.1IntroductionInphrase-basedstatisticalmachinetranslation(SMT),translationmodelsareestimatedoveralargeamountofparalleldata.Ingeneral,usingmoredataleadstoabettertranslationmodel.Whennospecificdomainistargeted,general-domain1par-alleldatafromvariousdomainsmaybeusedto1AsinAxelrodetal.(2011),inthispaper,weusethetermgeneral-domaininsteadofthecommonlyusedout-of-domainbecauseweassumethattheparalleldatamaycontainsomein-domainsentencepairs.trainageneral-purposeSMTsystem.However,itiswell-knownthat,intrainingasystemtotrans-latetextsfromaspecificdomain,usingin-domainparalleldatacanleadtoasignificantlybettertrans-lationquality(Carpuatetal.,2012).En efecto,whenonlygeneral-domainparalleldataareused,itisun-likelythatthetranslationmodelcanlearnexpres-sionsandtheirtranslationsspecifictothetargeteddomain.Suchexpressionswillthenremainuntrans-latedinthein-domaintextstotranslate.Sofar,in-domainparalleldatahavebeenhar-nessedtocoverdomain-specificexpressionsandtheirtranslationsinthetranslationmodel.However,evenifwecanassumetheavailabilityofalargequantityofgeneral-domainparalleldata,atleastforresource-richlanguagepairs,findingin-domainpar-alleldataspecifictoaparticulardomainremainschallenging.In-domainparalleldatamaynotexistforthetargetedlanguagepairsormaynotbeavail-ableathandtotrainagoodtranslationmodel.Inordertocircumventthelackofin-domainpar-alleldata,thispaperpresentsanewmethodtoadaptanexistingSMTsystemtoaspecificdomainbyin-ducinganin-domainphrasetable,i.e.,asetofphrasepairsassociatedwithfeaturesfordecoding,fromin-domainmonolingualdata.AswereviewinSec-tion2,mostoftheexistingmethodsforinducingphrasetablesarenotdesigned,andmaynotperformasexpected,toinduceaphrasetableforaspecificdomainforwhichonlylimitedresourcesareavail-able.Insteadofrelyingonlargequantityofparalleldataorhighlycomparablecorpora,ourmethodin-ducesanin-domainphrasetablefromunalignedin-domainmonolingualdatathroughathree-steppro-

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

488

cedure:phrasecollection,phrasepairscoring,andphrasepairfiltering.Incorporatingourinducedin-domainphrasetableintoanSMTsystemachievessubstantialimprovementsintranslatingin-domaintextsoverastrongbaselinesystem,whichusesanin-domainbilinguallexicon.Toachievethisimprovement,ourproposedmethodforinducinganin-domainphrasetablead-dressesseverallimitationsofpreviousworkby:•dealingwithsourceandtargetphrasesofarbi-trarylengthcollectedfromin-domainmonolin-gualdata,•proposingtranslationsfornotonlyunseensourcephrases,butalsothosealreadyseeninthegeneral-domainparalleldata,and•makinguseofpotentiallymanyfeaturescom-putedfromthemonolingualdata,aswellasfromtheparalleldata,inordertoscoreandfil-terthecandidatephrasepairs.Intheremainderofthispaper,wefirstreviewpreviousworkinSection2,highlightingthemainweaknessesofexistingmethodsforinducingaphrasetablefordomainadaptation,andourmoti-vation.InSection3,wethenpresentourphraseta-bleinductionmethodwithallthenecessarysteps:phrasecollection(Section3.1),computingfeaturesofeachphrasepair(Section3.2),andpruningtheinducedphrasetablestokeeptheirsizemanageable(Section3.3).InSection4,wedescribeourexper-imentstoevaluatetheimpactoftheinducedphrasetablesintranslatingin-domaintexts.Followingthedescriptionofthedata(Section4.1),weexplainthetoolsandparametersusedtoinducethephrasetables(Section4.2),ourSMTsystems(Section4.3),andpresentadditionalbaselinesystems(Section4.4).OurexperimentalresultsaregiveninSection4.5.Section5.1analyzestheerrordistributionofthetranslationsproducedbyanSMTsystemusingourinducedphrasetable,followedbytranslationexam-plestofurtherillustrateitsimpactinSection5.2.Fi-nally,Section6concludesthisworkandproposessomepossibleimprovementstoourapproach.2MotivationInmachinetranslation(MONTE),wordsandphrasesthatdonotappearinthetrainingparalleldata,i.e.,out-of-vocabulary(OOV)tokens,havebeenrecognizedasoneofthefundamentalissues,regardlessofthescenario,suchasadaptingexistingSMTsystemstoanewspecificdomain.OnestraightforwardwaytofindtranslationsofOOVwordsandphrasesconsistsinenlargingtheparalleldatausedtotrainthetranslationmodel.Thiscanbedonebyretrievingparallelsentencesfromcomparablecorpora.However,thesemethodsheav-ilyrelyondocument-levelinformation(ZhaoandVogel,2002;UtiyamaandIsahara,2003;FungandCheung,2004;MunteanuandMarcu,2005)tore-ducetheirsearchspacebyscoringonlysentencepairsextractedfromeachpairofdocuments.In-deed,scoringallpossiblesentencepairsfromtwolargemonolingualcorporausingcostlyfeaturesandaclassifier,asproposedbyMunteanuandMarcu(2005)por ejemplo,iscomputationallytooexpen-sive.2Inmanycases,wemaynothaveaccesstodocument-levelinformationinthegivenmonolin-gualdataforthetargeteddomain.Furthermore,evenwithoutconsideringcomputationalcost,itisunlikelythatalargenumberofparallelsentencescanberetrievedfromnon-comparablemonolingualcorpora.HewavitharanaandVogel(2016)proposedtodirectlyextractphrasepairsfromcomparablesen-tences.However,thenumberofretrievablephrasepairsisstronglylimited,becauseonecancollectsuchcomparablesentencesonlyonarelativelysmallscaleforthetargetedlanguagepairsanddomains.Whenin-domainparallelorcomparablesentencescannotbeeasilyretrieved,anotherpossibilitytofindtranslationsforOOVwordsisbilingualwordlexi-coninductionusingcomparableorunalignedmono-lingualcorpora(Fung,1995;Rapp,1995;KoehnandKnight,2002;Haghighietal.,2008;Daum´eandJagarlamudi,2011;IrvineandCallison-Burch,2013).Thisapproachisespeciallyusefulinfindingwordsandtheirtranslationsspecifictothegivencor-pus.Arecentandcompletelydifferenttrendofworkusesanunsupervisedmethodregardingtranslationasadeciphermentproblemtolearnabilingualwordlexiconanduseitasatranslationmodel(RaviandKnight,2011;DouandKnight,2012;Nuhnetal.,2012).Sin embargo,allthesemethodsdealonlywith2Forinstance,usingtheseapproachesonsourceandtargetmonolingualdatacontainingboth5millionssentencesmeansthatwehavetoevaluate25×1012candidatesentencepairs.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

489

palabras,mainlyowingtothecomputationalcomplex-ityofdealingwitharbitrarylengthsofphrases.Translationsofphrasescanbeinducedusingbilingualwordlexiconsandconsideringpermuta-tionsofwordordering(ZhangandZong,2013;IrvineandCallison-Burch,2014).Sin embargo,itiscostlytothoroughlyinvestigateallcombinationsofalargenumberofword-leveltranslationcandi-datesandpossiblepermutationsofwordordering.Toretainonlyappropriatephrasepairs,IrvineandCallison-Burch(2014)proposedtoexploitasetoffeatures.Someofthem,includingtemporal,contex-tual,andtopicsimilarityfeatures,stronglyreliedonthecomparabilityofWikipediaarticlesandontheavailabilityofnewsarticlesannotatedwithatimes-tamp(Klementievetal.,2012).Wemaynothavesuchusefulresourcesinlargequantityforthetar-getedlanguagepairsanddomains.Salujaetal.(2014)andZhaoetal.(2015)alsoproposedmethodstoinduceaphrasetable,focus-ingonlyontheOOVwordsandphrases:unigramsandbigramsinthesourcesideoftheirdevelopmentandtestdatathatareunseeninthetrainingdata.Intheirapproach,nonewtranslationoptionsareproposedforknownsourcephrases.Togeneratecandidatephrasepairs,foragivensourcephrase,Salujaetal.(2014)usesonlyphrasesfromthetar-getsideoftheirparalleldataandtheirmorphologi-calvariantsrankedandprunedaccordingtothefor-wardlexicaltranslationprobabilitiesgivenbytheirbaselinesystem’stranslationmodel.Theirapproachthusstronglyreliesontheaccuracyoftheexist-ingtranslationmodel.Forinstance,ifthegivensourcephrasecontainsonlyOOVtokens,asitmayhappenwhentranslatingatextfromadifferentdo-main,theirapproachcannotretrievecandidatetar-getphrases.Furthermore,theydonotmakeuseofexternalmonolingualdatatoexploreunseentargetphrases.Theirmethodisconsequentlyinadequatetoproducetranslationsforphrasesfromadifferentdomainthantheoneoftheparalleldata.WhileSalujaetal.(2014)usedacostlygraphpropagationstrategytoscorethecandidatephrasepairs,Zhaoetal.(2015)usedamethodwithamuchlowercomputationalcostandreportedhigherBLEUscoresusingonlywordembeddingstoscoreandrankmanyphrasepairsgeneratedfromtar-getphrases,unigramsandbigrams,collectedfrommonolingualcorpora.ThemaincontributionofZhaoetal.(2015)istheuseofalocallinearprojec-tionstrategy(LLP)toobtainacross-lingualseman-ticsimilarityscoreforeachphrasepair.Itmakestheprojectionofsourceembeddingstothetargetem-beddingsspacebylearningatranslationmatrixforeachsourcephraseembedding,trainedonmgoldphrasepairswithsourcephraseembeddingssimi-lartotheonetoproject.Aftertheprojection,basedonlyonthesimilarityoverembeddings,theknear-esttargetphrasesoftheprojectedsourcephraseareretrieved.Iftheprojectionforagivensourcephraseisnotaccurateenough,verynoisyphrasepairsaregenerated.Thismaybeaproblemespeciallywhenthegivensourcephrasedoesnotneedtobetrans-lated(i.e.,numbers,fechas,moleculenames,etc.).Thesystemwilltranslateit,becausethissourcephrasepreviouslyOOVisnowregisteredinitsin-ducedphrasetable,buthasonlywrongtranslationsavailable(seeSection4.5forempiricalevidences).3In-domainphrasetableinductionToinduceanin-domainphrasetable,ourapproachassumestheavailabilityoflargegeneral-domainparalleldataandin-domainmonolingualdataofbothsourceandtargetlanguages.Forsomeofourconfigurations,wealsoassumetheavailabilityofanin-domainbilinguallexicontocomputefeaturesas-sociatedwitheachcandidatephrasepairandtocom-puteareliabilityscoretofilterappropriateones.3.1In-domainphrasecollectionInastandardconfiguration,SMTsystemsextractphrasesofalengthuptosixorseventokens.Col-lectingallthen-gramsofsuchalengthfromagivenlargemonolingualcorpusisfeasible,butwillpro-videalargesetofsourceandtargetphrases,re-sultinginanenormousnumberofcandidatephrasepairs.Inthenextstep,weevaluateeachcandidateinagivensetofphrasepairs;itisthuscrucialtogetareasonablysmallsetofphrases.Incontrastwithpreviouswork,wecollectmoremeaningfulphrasesthanarbitraryshortn-grams,us-ingthefollowingformulapresentedbyMikolovetal.(2013a):puntaje(wiwj)=freq(wiwj)−δfreq(Wisconsin)×freq(wj)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

490

wherewiandwjaretwoconsecutivetokens,freq(·)thefrequencyofagivenwordorphraseinthegivenmonolingualcorpus,andδadiscountingcoefficientthatpreventstheretrievalofmanyphrasescomposedfrominfrequentwords.Eachbigramwiwjinthemonolingualcorpusisscoredwiththisformulaandonlythebigramswithascoreaboveapredefinedthresholdθareregardedasphrases.Alltheiden-tifiedphrasesaretransformedintoonetoken,3andanewpassisperformedoverthemonolingualcorpustoobtainnewphrasesalsousingthephrasesidenti-fiedinthepreviouspasses.Tofurtherlimitthenum-berofcollectedphrases,weconsideronlyphrasescontainingwordsthatappearatleastKtimesinthemonolingualdata.AfterTpasses,wecompileasetofphraseswith(a)allthesinglewordsand(b)allthephraseswithalengthofuptoLtokensidenti-fiedduringeachpass.StandardSMTsystemsforcloselanguagesdi-rectlyoutputOOVtokensinthetranslation.Tobeasgoodassuchsystems,ourapproachmustbeabletoretrievetherighttranslation,especiallyforthemanydomain-specificwordsandphrasesthatareidenticalinbothsourceandtargetlanguages.Toensurethatasourcephrasethatmustremainuntranslatedhasitsidentityinthetargetphraseset,weexplicitlyaddinthetargetphrasesetallthesourcephrasesthatalsoappearinthetargetmonolingualdata.3.2FeatureengineeringGiventwosetsofphrases,forthesourceandtargetlanguages,respectivamente,weregardallpossiblecom-binationsofsourceandtargetphrasesascandidatephrasepairs.Thisnaivecouplingimperativelygen-eratesalargenumberofpairsthataremostlynoise.Thus,thechallengehereistoeffectivelyestimatethereliabilityofeachpair.Thissectiondescribessev-eralfeaturestocharacterizeeachphrasepair;theyareusedforevaluatingphrasepairsandalsoaddedintheinducedphrasetabletoguidethedecoder.3.2.1Cross-lingualsemanticsimilarityManyresearcherstackledtheproblemofesti-matingcross-lingualsemanticsimilaritybetweenpairsofwordsorphrasesbyusingtheirembeddings(Mikolovetal.,2013a;Chandaretal.,2014;Faruqui3Thistransformationisperformedbysimplyreplacingthespacebetweenthetwotokenswithanunderscore.andDyer,2014;Coulmanceetal.,2015;Gouwsetal.,2015;Duongetal.,2016)incombinationwitheitheraseedbilinguallexiconorasetofparallelsentencepairs.Weestimatemonolingualphraseembeddingsviatheelement-wiseadditionofthewordembeddingscomposingthephrase.Thismethodperformswelltoestimatephraseembeddings(MitchellandLap-ata,2010;Mikolovetal.,2013a),despiteitssimplic-ityandrelativelylowcomputationalcostcomparedtostate-of-the-artmethodsbasedonneuralnetworks(Socheretal.,2013a;Socheretal.,2013b)orrichfeatures(Lazaridouetal.,2015).Thislowcomputa-tionalcostiscrucialinourcase,asweneedtoeval-uatealargenumberofcandidatephrasepairs.Inordertomakesourceandtargetphraseem-beddingscomparable,weperformalinearprojec-tion(Mikolovetal.,2013a)oftheembeddingsofsourcephrasestothetargetembeddingspace.Tolearntheprojection,weusethemethodofMikolovetal.(2013a)withtheonlyexceptionthatwedealwithnotonlywordsbutalsophrases.Giventrain-ingdata,i.e.,agoldbilinguallexicon,weobtainatranslationmatrixˆWbysolvingthefollowingopti-mizationproblemwithstochasticgradientdescent:ˆW=argminWXi||Wxi−zi||2wherexiisthesourcephraseembeddingofthei-thtrainingdata,zithetargetphraseembeddingofthecorrespondinggoldtranslation,andWthetransla-tionmatrixusedtoprojectxisuchthatWxiisascloseaspossibletoziinthetargetembeddingspace.Oneimportantparameterhereisthenumberofdi-mensionsofword/phraseembeddings.Thiscanbedifferentforthesourceandtargetembeddings,butmustbesmallerthanthenumberofphrasepairsinthetrainingdata;otherwisetheequationisnotsolv-able.SeeSection4.1forthedetailsaboutthebilin-guallexiconusedinourexperiment.Givenaphrasepairtoevaluate,thesourcephraseembeddingisprojectedtothetargetembeddingspace,usingˆW.Then,wecomputethecosinesimi-laritybetweentheprojectedsourcephraseembed-dingandthetargetphraseembeddingtoevaluatethesemanticsimilaritybetweenthesephrases;thisseemstogivesatisfyingresultsinthiscross-lingualscenarioasshownbyMikolovetal.(2013a).A

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

491

translationmatrixistrainedforeachtranslationdi-rectionf→eande→f,respectivamente,sothatwehavetwocross-lingualsemanticsimilarityfeaturesforeachphrasepair.3.2.2LexicaltranslationprobabilitiesWeassumetheexistenceofalargeamountofgeneral-domainparalleldata,andtrainaregulartranslationmodelwithlexicaltranslationproba-bilitiesinanordinaryway.Althoughin-domainphrasesarelikelytocontaintokensthatareunseeninthegeneral-domainparalleldata,lexicaltransla-tionprobabilitiesmaybeusefultoscorecandidatepairofsourceandtargetphrasesthatcontaintokensseeninthegeneral-domainparalleldata.Tocom-puteaphrase-levelscore,foratargetphraseegivenasourcephrasef,weconsiderallpossiblewordalignmentsasfollows:Plex(mi|F)=1IIXi=1log(cid:16)1JJXj=1p(ei|fj)(cid:17)whereIandJarethelengthsofeandf,respec-tively,etc.(ei|fj)thelexicaltranslationprobabilityofthei-thtargetwordeiofegiventhej-thsourcewordfjoff.Suchphrase-levellexicaltranslationprobabilitiesarecomputedforbothtranslationdi-rectionsgivingustwofeatures.3.2.3OtherfeaturesAsdemonstratedbypreviouswork(IrvineandCallison-Burch,2014;IrvineandCallison-Burch,2016),featuresbasedonthefrequencyofthephrasesinthemonolingualdatamayhelpustobet-terscoreaphrasepair.Weaddasfeaturesthein-versedfrequencyofthesourceandtargetphrasesinthein-domainmonolingualdata,alongwiththeirrelativedifferencegivenbythefollowingformula:simf(mi,F)=(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)registro(cid:16)freq(mi)Ne(cid:17)−log(cid:16)freq(F)Nf(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)whereNxstandsforthenumberoftokensinthein-domainmonolingualdataofthecorrespondinglan-guage.Thesurface-levelsimilarityofsourceandtargetphrasescanalsobeastrongcluewhenconsideringthetranslationbetweentwolanguagesthatarerela-tivelyclose.Weinvestigatetwofeaturesconcerningthis:thefirstfeatureistheLevenshteindistancebe-tweenthetwophrasescalculatedregardingwordsasunits,4whiletheotherisabinaryfeaturethatfiresifthetwophrasesareidentical.Weshallex-pectbothfeaturestobeveryusefulincaseswheremanydomain-specificwordsandphrasesarewrit-teninthesamewayintwolanguages;por ejemplo,drugandmoleculenamesinthemedicaldomaininFrenchandEnglish.Wealsoaddasfeaturesthelengthsofthesourceandtargetphrases,i.e.,IandJ,andtheirratio.Usingalltheabove12features,theoverallscoreforeachpairisgivenbyaclassifierasdescribedinSection3.3;thisscoreisalsoaddedasafeatureintheinducedphrasetablefordecoding.3.3PhrasepairfilteringAsmentionedabove,phrasepairssofargeneratedaremostlynoise.Toreducethedecoder’ssearchspacewhenusingourinducedphrasetable,werad-icallyfilteroutinappropriatepairs.EachcandidatephrasepairisassessedbythemethodproposedinIrvineandCallison-Burch(2013),whichpredictswhetherapairofwordsaretranslationsofonean-otherusingaclassifier.Astrainingexamples,weuseabilinguallexiconaspositiveexamplesandran-domlyassociatedphrasepairsfromourphrasesetsasnegativeexamples.Forclassification,weuseallthefeaturespresentedinSection3.2.Weusethescoregivenbytheclassifiertorankthetargetphrasesforeachsourcephrase.Onlythetargetphraseswiththetopnscoresarekeptinthefinalinducedphrasetable.4ExperimentsThissectiondemonstratestheimpactofthein-ducedphrasetablesintranslatingin-domaintextsinthreeconfigurations.Inthefirstconfiguration(Conf.1),weevaluatedwhetherourinducedphrasetableimprovesthetranslationofin-domaintextsoverthevanillaSMTsystemwhichusedonlyonephrasetabletrainedfromgeneral-domainparallel4Herewedidnotusethecharacter-leveleditdistancetomeasuretheorthographicsimilaritybetweenphrases.Eventhoughsuchafeaturemaybeuseful(KoehnandKnight,2002),itscomputationalcostistoohightodealefficientlywithbillionsofphrasepairs.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

492

data.Wethenevaluated,inthesecondconfigura-tion(Conf.2),whetherourinducedphrasetableisalsobeneficialwhenusedinanSMTsystemthatalreadyincorporatesanin-domainbilinguallexiconthatcouldbecreatedmanuallyorinducedbysomeofthemethodsmentionedinSection2.Finally,weevaluatedincomplementaryexperiments(Conf.3)whetherourinducedphrasetablecanalsoofferuse-fulinformationtoimprovetranslationqualityevenwhenusedincombinationwithanotherstandardphrasetablegeneratedfromin-domainparalleldata.4.1DataSinceourapproachassumestheavailabilityoflarge-scalegeneral-domainparallelandmonolingualcor-pora,weconsideredtheFrench–Englishlanguagepairandbothtranslationdirectionsforourexperi-ments.TheFrench–EnglishversionoftheEuroparlparallelcorpus5wasregardedasageneral-domain,andnotstrictlyout-of-domain,corpusbecausemanydebatescanbeassociatedtoaspecificdomainandcancontainphrasesspecifictoparticulardomains.Asgeneral-domainmonolingualdata,weusedtheconcatenationofonesideofEuroparlandthe2007–2014editionsofNewsCrawlcorpora6inthesamelanguage.Wefocusedontwodomains:medical(EMEA)andscience(Ciencia).Forbothdomains,weusedthedevelopmentandtestsetsprovidedforawork-shopondomainadaptationofMT(Carpuatetal.,2012).7Wealsousedtheprovidedin-domainpar-alleldatafortrainingbutregardedonlythetargetsideasmonolingualdata.Sinceourprimaryob-jectiveistheinductionofaphrasetablewithoutusingin-domainparalleldata,thesourcesideofthein-domainparalleldatawasnotusedasapartofthesourcein-domainmonolingualdata,exceptwhentraininganordinaryin-domainphrasetableinConf.3.AsmedicaldomainmonolingualdatafortheEMEAtranslationtask,weusedtheFrenchandEnglishmonolingualmedicaldataprovidedfortheWMT’14medicaltranslationtask.8Noneof5http://statmt.org/europarl/,release76http://statmt.org/wmt15/translation-task.html7http://hal3.name/damt/8http://www.statmt.org/wmt14/medical-task/DomainData#sent.#tok.(En-Fr)EMEAdevelopment2,02228k-32ktest2,04525k-29kparallel472k6M-7Mmonolingual275M-255MSciencedevelopment1,99052k-65ktest1,98252k-65kparallel66k2M-2Mmonolingual82M-2MGeneralparallel2M54M-60Mmonolingual2.8B-1.1BTable1:Statisticsontrain,desarrollo,andtestdata.theparallelcorporaprovidedfortheWMT’14med-icaltranslationtaskwasused.AssciencedomainmonolingualdatafortheSciencetranslationtask,weusedtheEnglishsideoftheASPECparallelcor-pus(Nakazawaetal.,2016).9Desafortunadamente,wedidnotfindanyFrenchmonolingualcorporapub-liclyavailablefortheSciencedomainthatweresuf-ficientlylargeenoughforourexperiments.StatisticsonthedataweusedarepresentedinTable1.Toinducethephrasetablesfromthemonolin-gualdata,wecomparedtwobilinguallexicons:ageneral-domainandanin-domainlexicons.Theselexiconsareusedtotrainthetranslationmatrices(seeSection3.2.1)andtotraintheclassifier(seeSection3.3).Thegeneral-domainlexicon(hence-forth,gen-lex)isaphrase-basedoneextractedfromthephrasetablebuiltonthegeneral-domainparalleldata(seeSection4.3).Weextractedthe5,000mostfrequentsourcephrasesandtheirmostprobabletranslationaccordingtotheforwardtrans-lationprobability,pag(mi|F).Weadoptedthissizeasithadbeenprovenoptimaltolearnthemappingbetweentwomonolingualembeddingspaces(Vuli´candKorhonen,2016).Forsomeexperiments,wealsosimulatedtheavailabilityofanin-domainbilin-guallexicon.Weautomaticallygeneratedalexiconforeachdomain(henceforth,in-lex)usingtheentirein-domainparalleldata,inthesamemannerascompilinggen-lex,exceptthatweselectedthe5,000mostfrequentsourcewordsinthein-domainparalleldatathatwerenotinthe5,000mostfrequentwordsinthegeneral-domainparalleldatainorder9http://orchid.kuee.kyoto-u.ac.jp/ASPEC/

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

493

SideDomainDataw2pw2vsourcegeneralmonolingual√in-domainmonolingual√√parallel(√)(√)development√test√targetgeneralmonolingual√in-domainmonolingual√√parallel√√Table2:Corporausedforextractingphrasesandcomput-ingwordembeddings:w2pindicatesword2phrase,whilew2vforword2vec.(√)denotesthatthedataareusedinConf.3only.toensurethatweobtainedmostlyin-domainwordpairs.Notethatwedidnotusephrasesbutwordsforin-lex,assumingthathumansarenotabletomanuallyconstructalexiconcomprisingphrasepairssimilartothoseinphrasetablesforSMTsys-tems.ForConf.3,asweassumetheavailabil-ityofin-domainparalleldata,thebilinguallexi-con(para-lex)usedwas5,000phrasepairsex-tractedfromthein-domainphrasetable,excludingthesourcephrasesofgen-lex.4.2ToolsandparametersAsummaryofthedatausedtocollectphrasesandestimatewordembeddingsispresentedbyTable2.Foreachpairofdomainandtranslationdirec-tion,setsofsourceandtargetphraseswereextractedfromthein-domainmonolingualdata,asdescribedinSection3.1.Asinpreviouswork(IrvineandCallison-Burch,2014;Salujaetal.,2014;Zhaoetal.,2015),wefocusonsourcephrasesappearinginthedevelopmentandtestsetsinordertomaximizethecoverageofourinducedphrasetableforthem.10Moreprecisely,sourcephraseswerecollectedfrom10Weareawarethatthismaynotbepracticalbecauseitrequirestheknowledgeofthedevelopmentandtestsetsbe-forehand.ForinstancefortheFr→EnEMEAtranslationtask,inducingaphrasetablegivenallthe4.5Mcollectedsourcephrasewouldrequiredapproximately3monthsusing100CPUthreads.IncreasingthevalueofKtocollectlesssourcephrasescanbeareasonablealternativetosignificantlydecreasethiscomputationtime,eventhoughitwillalsonecessarilydecreasethecoverageofthephrasetable.Weleaveforourfutureworkthestudyofaphrasetableinductionwithsourcephrasesex-tractedfromsourcemonolingualdatawithoutreferringtothedevelopmentandtestsets.Tasksourcetarget#phrasealldev+testpairsEMEAFr→En4.5M20k437k8.7BEn→Fr5.1M11k469k5.2BScienceFr→En1.1M28k216k6.0BEn→Fr2.3M24k18k432MTable3:Sizeofthephrasesetscollectedfromthesourceandtargetin-domainmonolingualdataandthenumberofphrasesappearingonlyintheconcatenationofthesourcesideofthedevelopmentandtestsets(dev+test).“#phrasepairs”denotesthenumberofphrasepairsas-sessedbytheclassifier.theconcatenationofthedevelopmentandtestsetsandthein-domainmonolingualdatawithreliablestatistics,andthenonlyphrasesappearinginthede-velopmentandtestsetswerefiltered.11Weremovedphrasescontainingtokensunseeninthein-domainmonolingualdata,becauseweareunabletocom-puteallourfeaturesforthem.Ontheotherhand,targetphraseswerecollectedfromthein-domainmonolingualdata,includingthetargetsideofin-domainparalleldata.Toidentifyphrases,weusedtheword2phrasetoolincludedintheword2vecpackage,12withthedefaultvaluesforδandθ.WesetK=1forthesourcelanguagetoensurethatmostofthetokenswouldbetranslated,andK=25forthetargetlanguagetolimitthenumberofresult-ingphrases.WesetL=6asthisisthesamemax-imalphraselengththatwesetforthephrasetablestrainedfromtheparalleldata.WestoppedatT=4passesasthefifthpassretrievedonlyaverysmallnumberofnewphrasescomparedtothefourthpass.StatisticsofthecollectedphrasesforeachtaskarepresentedinTable3.Totrainthewordembeddings,weusedword2vecwiththefollowingparameters:-cbow1-window10-negative15-sample1e-4-iter15-min-count1.Mikolovetal.(2013a)observedthatbetterresultsforcross-lingualsemanticsimilaritywereobtainedwhenusingwordembeddingswithhigherdimensions11AswehadnoFrenchmonolingualcorpusfortheSciencedomain,thedevelopmentandtestsetsfortheScienceFr→Entaskwereconcatenatedwithonemillionsentencesrandomlyextractedfromthegeneral-domainmonolingualdata.12https://code.google.com/archive/p/word2vec/

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

494

DataLM1LM2LM3Targetsideofin-domainparalleldata√√√In-domainmonolingualdata√√General-domainmonolingualdata√Table4:Sourceofourthreelanguagemodels.onthesourcesidethanonthetargetside.Wethereforechose800and300dimensionsforthesourceandtargetembeddings,respectively.Theembeddingsweretrainedontheconcatenationofallthegeneral-domainandin-domainmonolingualdataaspresentedbyTable2.Consequently,foreachpairofdomainandtranslationdirection,wehavefourwordembeddingspaces:thosewith300or800dimensionsforsourceandtargetlanguages.ThereliabilityofeachphrasepairwasestimatedasdescribedinSection3.3tocompilephrasetablesofreasonablesizeandquality.WeusedVowpalWabbit13toperformlogisticregressionwithonepass,defaultparameters,y–linklogisticoptiontoobtainaclassificationscoreforeachphrasepair.Inthefinalinducedphrasetable,wekeptthe300besttargetphrases14foreachsourcephraseac-cordingtothisscore.4.3SMTsystemsTheMosestoolkit(Koehnetal.,2007)15wasusedfortrainingSMTmodels,parametertuning,andde-coding.Thephrasetablesweretrainedonthepar-allelcorpususingSyMGIZA++(Junczys-DowmuntandSzał,2012)16withIBM-2wordalignmentandthegrow-diag-final-andheuristics.Toob-tainstrongbaselinesystems,allSMTsystemsusedthreelanguagemodels17builtondifferentsetsofcorporaasshowninTable4;eachlanguagemodelisa4-grammodifiedKneser-Neysmoothedone13https://github.com/JohnLangford/vowpal_wabbit/14AsinIrvineandCallison-Burch(2014),weobtainedbet-terresultswhenfavoringrecalloverprecision.Wechose300empiricallysincewedidnotobserveanyfurtherimprovementswhenkeepingmoretargetphrases.15http://statmt.org/moses/,version2.1.116https://github.com/emjotde/symgiza-pp/17TheoneexceptionisthesystemfortheScienceEn→Frtask,whichusesonlytwolanguagemodelsaswedonothaveanyin-domainmonolingualdatainadditiontothetargetsideofthein-domainparalleldata.PhrasetableConf.1Conf.2Conf.3Phrasetabletrainedfrom√√√general-domainparalleldataPhrasetabletrainedfrom√in-domainparalleldataIn-domainbilinguallexicon√Phrasetableinducedfrom√√√in-domainmonolingualdataTable5:Multiplephrasetableconfigurations.trainedusinglmplz(Heafieldetal.,2013).18Toconcentrateonthetranslationmodel,wedidnotusethelexicalreorderingmodelthroughouttheexperi-ments,whileweenableddistance-basedreorderinguptosixwords.OursystemsusedthemultipledecodingpathsabilityofMoses;weuseduptothreephrasetablesinonesystem,assummarizedinTable5.WedidnotaddthefeaturespresentedinSection3.2tothephrasepairsdirectlyderivedfromtheparalleldata.19Weightsofthefeatureswereoptimizedwithkb-mira(CherryandFoster,2012)using200-besthypotheseson15iterations.Thetranslationout-putswereevaluatedwithBLEU(Papinenietal.,2002)andMETEOR(DenkowskiandLavie,2014).Theresultswereaveragedoverthreetuningruns.Thestatisticalsignificancewasmeasuredbyap-proximaterandomization(Clarketal.,2011)usingMultEval.204.4AdditionalbaselinesystemsTocompareourworkwithastate-of-the-artphrasetableinductionmethod,weimplementedtheworkofZhaoetal.(2015).Eventhoughtheydidnotpro-posetheirmethodtoperformdomainadaptationofanSMTsystem,theirworkistheclosesttooursanddoesnotrequireotherexternalresourcesthanthoseweused,i.e.,paralleldataandmonolingualdatanotnecessarilycomparable.Weimplementedbothglobal(GLP)andlocal(LLP)linearprojectionstrategiesandcollectedsourceandtargetphrasesastheydid.Thesourcephrasesetcontainsalluni-18https://kheafield.com/code/kenlm/estimation19AsinIrvineandCallison-Burch(2014),wegotadropofupto0.5BLEUpointswhenweaddedourfeatures,derivedfrommonolingualdata,totheoriginalphrasetable.20https://github.com/jhclark/multeval/

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

495

gramsandbigramsinthedevelopmentandtestsets,whilethetargetphrasesetcontainsunigramsandbigramscollectedfromthein-domainmonolingualdata.Theydidnotmentionanyfilteringoftheirphrasesets,butwechosetoremoveallphrasescon-tainingdigitsorpunctuationmarks,sincetryingtoretrievethetranslationofnumbersorpunctuationmarksrelyingonlyonwordembeddingsseemsin-appropriateandinfactproducedworseresultsinourpreliminaryexperiments.Tohighlighttheimpactofthephrasesetsused,wealsoexperimentedLLPus-ingourphrasesetscollectedwithword2phrase.Furthermore,togetthebestpossibleresults,wedidnotusethesearchapproximationspresentedinZhaoetal.(2015),i.e.,localsensitivehashingandredun-dantbitvector,andusedinsteadlinearsearch.FortheGLPconfiguration,thetranslationma-trixwastrainedongen-lex,i.e.,5,000phrasepairsextractedfromthegeneral-domainphraseta-bletrainedonparalleldata.FortheLLPconfigura-tions,asinZhaoetal.(2015),wetrainedthetrans-lationmatrixforeachsourcephraseonthe500mostsimilarsourcephrases,retrievedfromthegeneral-domainphrasetable,associatedtotheirmostprob-abletranslation.ForbothGLPandLLPconfig-urations,wekeptthe300besttargetphrasesforeachsourcephrase.Fourfeatures,phraseandlex-icaltranslationprobabilitiesforbothtranslationdi-rections,wereapproximatedusingthesimilaritybe-tweensourceandtargetphraseembeddingsforeachphrasepairandincludedintheinducedphrasetableasdescribedbyZhaoetal.(2015).SincethisapproachproposestotranslateallOOVunigramsandbigrams,itislikelyinourscenariothatsomemedicalterms,por ejemplo,willhavenocorrecttranslationsintheinducedphrasetable.Foracomparison,weaddedonemorebaselinesystem,whichmerelyusesavanillaMoseswiththe-duop-tionofMosesactivatedtodropallunknownwordsinsteadofcopyingthemintothetranslation.4.5ResultsTheexperimentalresultsaregiveninTable6.InConf.1,ourresultsshowthatbothGLPandLLPconfigurationsperformedmuchworsethanthevanillaMoseswhenusingphrasesnaivelycollected.Thisisduetothefactthattheinducedphraseta-blecontainstranslationsforeveryOOVunigramsandbigrams,evenforthosewhodonotneedtobetranslated,suchasmoleculenamesorplacenames.Wordembeddingsarewell-knowntobeinaccurateforveryinfrequentwords(Mikolovetal.,2013b);como consecuencia,forsomeraresourcephrases,eveniftherighttranslationisinthetargetphraseset,itisnotguaranteedthatitwillberegisteredinthein-ducedphrasetableasoneofthe300besttranslationsforthesourcephrase,relyingonlyonwordembed-dings.ThesignificantimprovementsoveravanillaMosesobservedbyZhaoetal.(2015)wouldpoten-tiallybebecausetheytranslatedfromArabic,andUrdu,toEnglish.Forsuchlanguagepairs,onecansafelytrytotranslateeveryOOVtokenofageneral-domaintext,anditisunlikelytodoworsethanavanillaMosessystemthatwillleavetheOOVto-kensasisinthetranslation.AsshownbytheMosesduconfigurations,droppingthemledtoadropofupto4.2BLEUpointsfortheEMEAFr→Entrans-lationtask.ThissuggeststhatOOVtokensmustbecarefullytranslatedonlywhennecessary.ManyOOVtokensinourtranslationtasksdonotneedtobetranslatedintodifferentforms.Hence,weregardthevanillaMosesthatcopiestheOOVtokensinthetranslationastrongbaselinesystem.Interestingly,usingthephrasescollectedbyourmethodforLLPproducedmuchbettertranslations,evenslightlybetterthantheoneproducedbythevanillaMosessystemfortheEMEAEn→Frtransla-tiontaskwithanimprovementof0.2BLEUpoints.ThismaybeduetothefactthatoursourcephrasesetisnotonlymadefromOOVphrases,mean-ingthatnewusefultranslationsmaybeproposedforsourcephrasesthatarealreadyregisteredinthegeneral-domainphrasetable.Moreover,withourphrasesets,thedecoderalsohasthepossibilitytoleavesometokensuntranslatedsinceweaddedeachsourcephraseinthetargetphrasesetifitappearedinthetargetmonolingualdata.Insteadofrelyingonlyonwordembeddings,thefeaturesusedinourapproachhelpedsignificantlytoimprovethetranslationquality.WhenweaddedourinducedphrasetabletoavanillaMosessystem,weobservedconsistentandsignificantimprovementsintranslationquality,withupto2.1BLEUand2.2METEORpointsofimprovementfortheScienceEn→Frtranslationtask.ComparedtotheLLPmethodproposedbyZhao

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
0
7
5
1
5
6
7
5
3
1

/

/
t

yo

a
C
_
a
_
0
0
0
7
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

496

ConfigurationEMEAScienceFr→EnEn→FrFr→EnEn→FrBLEUMETEORBLEUMETEORBLEUMETEORBLEUMETEORvanillaMosesdu24.227.421.740.022.329.120.442.7vanillaMoses(Conf.1)28.430.125.444.824.130.422.745.1+GLPIPTnaive24.327.022.341.022.429.020.642.6+LLPIPTnaive24.727.422.040.422.529.321.143.4+LLPIPT27.929.625.645.122.729.321.343.5+ourIPT(gen-lex)30.232.127.146.625.432.024.847.3+in-domainbilinguallexicon(Conf.2)32.432.828.348.226.632.424.948.0+ourIPT(gen-lex)33.532.628.848.628.533.825.248.4+ourIPT(in-lex)33.832.929.248.926.932.725.949.0+in-domainphrasetable(Conf.3)39.136.133.853.132.136.131.053.9+ourIPT(para-lex)39.136.134.053.232.136.131.254.1Table6:Resultados(BLEUandMETEOR)withaninducedphrasetable(IPT).TheMosesduandvanillaMosessystemsuseonlyonephrasetabletrainedfromthegeneral-domainparalleldata.Thetranslationmatricesandtheclassifiershavebeentrainedwithabilinguallexicon:gen-lex,in-lex,orpara-lex.Theconfigurationsdenotedas“naive”useaphrasetableinducedfromphrasescollectedasdescribedinSection4.4.Boldscoresindicatethestatisticalsignificance(pag<0.01)ofthegainoverthebaselinesystem(Conf.X)ineachconfiguration.etal.(2015),ourapproachincludesmorefeaturesandanadditionalclassificationstep.Thus,thein-ductionofaphrasetableismuchslower.Forin-stance,fortheEMEAFr→Entranslationtask,usingthephrasesetsextractedwithword2phrase,ourinductionmethod(excludingphrasecollection)wasnearly14timesslower(9hoursvs.38minutes).21Phrasecollectionusingword2phrasewasmuchfasterthanfeaturecomputationandphrasepairclas-sification.Forinstance,ittook72minutestocol-lecttargetphrasesfortheEMEAFr→Entransla-tiontask,usingfouriterationsofword2phraseontheEnglishin-domainmonolingualdatawith1CPUthread.InConf.2,addinganin-domainbilinguallexi-conasaphrasetabletothevanillabaselinesys-temsignificantlyboostedtheperformance,mainlybyreducingthenumberofOOVtokens.Ourin-ducedphrasetableshadlessimpact,probablyduetotheoverlapbetweenusefulwordpairscontainedinboththeinducedphrasetableandtheaddedbilin-guallexicon.However,westillobservedsignificantimprovements,whichsupporttheusefulnessofthe21Theexperimentswereperformedwith20CPUthreads.Notealsothatcomputationalspeedwasnotourprimaryfo-cuswhenimplementingourapproach.Optimizingourimple-mentationmayleadtosignificantgainsinspeed,whileZhaoetal.(2015)havepresentedasearchapproximationabletomaketheirapproach18timesfasterthanlinearsearch.inducedphrasetable,withupto1.4and1.0BLEUpointsofimprovements,respectively,fortheEMEAFr→EnandScienceEn→Frtranslationtasksforin-stance.Inthisconfiguration,thein-lexphrasetableledtoslightbutconsistentimprovements.Ithelpedmorethanthegen-lexphrasetable,exceptintheScienceFr→Entask,forwhichtheuseofthegen-lexphrasetableyieldedsignificantlybetterresultsthantheuseofthein-lexphrasetable.Wecanexpectsuchdifferenceswhentheclassifierandthetranslationmatricesaretrainedusinginfrequentwords.Embeddingsforsuchwordsaretypicallynotaswellestimatedasthoseforfrequentwords,mean-ingthatthefeaturesbasedonthewordembeddingsarelessreliableandthusmisleadboththeclassifierforpruningandthedecoder.InConf.3,wherethebaselinesystemevenusedaphrasetabletrainedonin-domainparalleldata,weobtainedcontrastedresults,withonlyslightim-provementsfortheEn→FrtranslationdirectionandnoimprovementsfortheFr→Entranslationdirec-tion.Thislackofimprovementmaybeduetothemorereliablefeaturesandmoreaccuratephrasepairscontainedinthephrasetabledirectlylearnedfromtheparalleldata.Thismayleadthedecodertopreferthistabletotheinducedoneandgivehigherweightsofitsfeaturesaccordingtothispreferenceduringtuning. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 7 5 1 5 6 7 5 3 1 / / t l a c _ a _ 0 0 0 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 497 EMEAScienceFr→EnEn→FrFr→EnEn→Frw/ow/w/ow/w/ow/w/ow/correct53.155.152.854.254.057.155.357.2SEEN6.63.97.86.15.92.38.52.2SENSE18.113.315.611.518.914.012.912.9SCORE22.227.723.828.221.226.623.327.7Table7:Percentageofthesourcetokens:comparisonofthetranslationsgeneratedwith(w/)orwithout(w/o)ourgen-lexinducedphrasetable(Conf.1).5ErroranalysisInSection5.1,wefirstpresentananalysisofthedis-tributionoftranslationerrorsthatoursystemspro-duced,usingtheS4taxonomy(Irvineetal.,2013).Then,inSection5.2,weillustratesometranslationexamplesforwhichourinducedphrasetableshaveproducedabettertranslation.5.1AnalysiswiththeS4taxonomyTheS4taxonomycomprisesthefollowingfourerrortypes:•SEEN:attempttotranslateawordneverseenbefore•SENSE:attempttotranslateawordwiththewrongsense•SCORE:agoodtranslationforthewordisavailablebutanotherone,givingabetterscoretothehypothesis,ischosenbythesystem•SEARCH:agoodtranslationisavailableforthewordbutisprunedduringthesearchforthebesthypothesisWeconsideredtheSEEN,SENSE,andSCOREer-rorsasinIrvineetal.(2013),butnottheSEARCHerrors,assumingthattherecentphrase-basedSMTsystemsrarelymakethistypeoferrorsandwith-outimpactonthetranslationquality(Wisniewskietal.,2010;Azizetal.,2014).WeperformedaWordAlignmentDrivenEvaluation(WADE)(Irvineetal.,2013)tocounttheword-levelerrors.Table7comparestheresultswithandwithoutourgen-lexinducedphrasetables(Conf.1).Forthefourtasks,morethanhalfofthesourcetokenswerecorrectlytranslatedaccordingtothetranslationref-erence.Ouranalysisrevealsthatourinducedphrasetablehelpstoobtainmorecorrecttranslations,ashigherpercentagesofsourcewordswerecorrectlytranslated,despitethesignificantincreaseofSCOREerrors(around5%forallthetasks).Thismeansthatthecorrecttranslationforthesourcewordisavailable,butthefeaturesassociatedtothistrans-lationwerenotinformativeenoughforthedecodertochooseit.ThepercentageofSEENerrorsinthetranslationsdecreasedsignificantlywiththeinducedphrasetableforallthetasks,asaresultofmanywordsandphrasesunseeninthegeneral-domainparalleldatabeingcoveredbyusingthein-domainmonolingualdata.However,ourmethoddoesnotguaranteetofindappropriatetranslationsforthesewords.Itisevenpossiblethatalltheproposedtrans-lationsareinappropriate.Nonetheless,wecanseeanoticeabledecreaseoftheSENSEerrors,exceptintheScienceEn→Frtask,forwhichwehaveusedonlyasmallamountofin-domainFrenchmono-lingualdata.AsreportedinTable3,fewertargetphraseswerecollectedforthistask,leadingtoonlyasmallchanceofobtainingtherighttranslationforagivensourcephrase.ThepercentageofSENSEerrorsstillremainshigherthan10%foralltasks,in-dicatingthatthecorrecttranslationisnotavailableinourphrasesetorisprunedbytheclassifierduringthephrasetableinduction.Fromthisanalysis,wedrawtheconclusionthatourapproachhassignificantlyincreasedthereach-abilityofthetranslationreferencealongwiththequalityofthetranslationproducedbythedecoder.Weexpectthatmoreinformativeorbetterestimatedfeaturescanfurtherimproveourresults.Improv-ingourmethodtocollectthetargetphrasesorusingalargerin-domainmonolingualcorpuswouldalsohelptoreduceSENSEerrors.5.2TranslationexamplesTable8presentsexamplesofsourcephraseandtheirtranslationschosenbythedecoderintheEMEAFr→Entranslationtask.AsshownbyExample#1,bothLLPandgen-lexconfigurationscanfindagoodtranslationintheirinducedphraseta-bleforthephrase“aupointd’injection”whilethegeneral-domainphrasetabledoesnotcontainthissourcephrase.Asaresult,thevanillaMosessystemproducedawrongtranslationusinggeneral-domainwordtranslations. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 7 5 1 5 6 7 5 3 1 / / t l a c _ a _ 0 0 0 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 498 System#1#2#3#4sourceaupointd’injectionglaucomeaigucontientdulactosemonohydrat´elelansoprazolen’estpasvanillaMosesatinjectionacuteglaucomemonohydrat´econtainslactosethelansoprazoleisnotLLPIPTatthepointofinjectionacutecontainslactosethe,isnotourIPT(gen-lex)atthesiteofinjectionacuteglaucomacontainslactosemonohydratethelansoprazoleisnotreferenceattheinjectionsiteacuteglaucomacontainslactosemonohydratelansoprazoleisnotTable8:Examplesofsourcephraseandtheirtranslation,fromthetestsetoftheEMEAFr→Entranslationtask,producedbythedecoderusingdifferentconfigurations:vanillaMoses(Conf.1)andMosesusingaphrasetableinducedwithLLPorwithourmethod(gen-lex).Example#2showsatypicalerrormadebytheLLPconfiguration.Inthisexample,“glaucome”isOOV,notranslationisproposedforthistokeninthegeneral-domainphrasetable.TheLLPIPTcon-tainsthesourcephrase“glaucomeaigu”butnoneofthe300bestcorrespondingtargetphrasescontainthetoken“glaucoma”.However,mostofthemcon-tainthemeaningof“acute”.Thiscanbeexplainedbythemuchhigherfrequencyof“aigu”whiletheword“glaucome”isveryrare,eveninthein-domainmonolingualdata.Consequently,“aigu”hasanem-beddingmoreaccuratethantheoneof“glaucome”whichisthenmuchmoredifficulttoprojectcor-rectlyacrosslanguages.Incontrast,ourgen-lexIPTcontainsthetranslationreferencefor“glaucomeaigu”andthistranslationhasbeenusedcorrectlybythedecoder,guidedbyourfeatureset.Example#3issimilartoExample#2,theembed-dingoftherareword“monohydrat´e”isprobablynotaccurateenoughtobecorrectlyprojected,thecor-recttranslationisnotintheLLPIPT,whileourap-proachsucceededtotranslateitcorrectly.Finally,Example#4presentsanothercommonsit-uationwhereanOOVtoken,here“lansoprazole”hastobepreservedasisandiscorrectlyreportedinthetranslationbythevanillaMosessystem.TheLLPIPTproposestranslationsfor“lansoprazole”,mostofthemsemanticallyunrelated,liketheonechosenbythedecoderinthisconfiguration.Weassumethatthesurface-levelsimilarityfea-turesofourmethodhelpedthedecodertoidentifytherighttranslationinthissituation.Nonetheless,evenwhenusingourgen-lexIPT,westillob-servedsomesituationswheretokensthatshouldbepreservedwereactuallywronglytranslated,produc-ingoutputsworsethanthoseproducedbythevanillaMosessystem.6ConclusionandfutureworkWepresentedaframeworktoinduceaphraseta-blefromunalignedmonolingualdataofspecificdo-mains.Weshowedthatsuchaphrasetable,whenintegratedtothedecoder,consistentlyandsignifi-cantlyimprovedthetranslationqualityfortextsinthetargeteddomain.Ourapproachusesonlysim-plefeatureswithoutrequiringstronglycomparableorannotatedtextsinthetargeteddomain.Ourmethodcouldfurtherbeimprovedinseveralways.First,weexpectbetterimprovementsbyusingmorein-domainmonolingualdataorbybeingmorecarefulincollectingthetargetphrasestouseforthephrasetableinductionasopposedtosimplypruningthemaccordingtothewordfrequency.Moreover,aswesawinSection5,scoringthephrasepairsisoneofthemostimportantissues.Weneedmorein-formativefeaturestobetterscorethepairsofsourceandtargetphrases.Despitetheirhighcomputationalcost,includingfeaturesbasedonorthographicsimi-larityorusingbetterestimatedcross-lingualembed-dingsmayhelpforthispurpose.AcknowledgmentsWewouldliketothanktheanonymousreviewersandtheactioneditor,ChrisQuirk,fortheirinsight-fulcomments.ReferencesAmittaiAxelrod,XiaodongHe,andJianfengGao.2011.DomainAdaptationviaPseudoIn-DomainDataSe-lection.InProceedingsofEMNLP,Edinburgh,Scot-land,UK.WilkerAziz,MarcDymetman,andLuciaSpecia.2014.ExactDecodingforPhrase-BasedStatisticalMachineTranslation.InProceedingsofEMNLP,Doha,Qatar. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 7 5 1 5 6 7 5 3 1 / / t l a c _ a _ 0 0 0 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 499 MarineCarpuat,HalDaum´eIII,AlexanderFraser,ChrisQuirk,FabienneBraune,AnnClifton,etal.2012.Do-mainadaptationinmachinetranslation:Finalreport.In2012JohnsHopkinssummerworkshopfinalreport.Baltimore,MD:JohnsHopkinsUniversity.A.P.SarathChandar,StanislasLauly,HugoLarochelle,MiteshMKhapra,BalaramanRavindran,VikasRaykar,andAmritaSaha.2014.AnAutoencoderAp-proachtoLearningBilingualWordRepresentations.InProceedingsofNIPS,Montr´eal,Canada.ColinCherryandGeorgeFoster.2012.BatchTuningStrategiesforStatisticalMachineTranslation.InPro-ceedingsofNAACL-HLT,Montr´eal,Canada.JonathanH.Clark,ChrisDyer,AlonLavie,andNoahA.Smith.2011.BetterHypothesisTestingforStatis-ticalMachineTranslation:ControllingforOptimizerInstability.InProceedingsofACL-HLT,Portland,OR,USA.JocelynCoulmance,Jean-MarcMarty,GuillaumeWen-zek,andAmineBenhalloum.2015.Trans-gram,FastCross-lingualWord-embeddings.InProceedingsofEMNLP,Lisbon,Portugal.HalDaum´e,IIIandJagadeeshJagarlamudi.2011.Do-mainAdaptationforMachineTranslationbyMiningUnseenWords.InProceedingsofACL-HLT,Portland,OR,USA.MichaelDenkowskiandAlonLavie.2014.MeteorUniversal:LanguageSpecificTranslationEvaluationforAnyTargetLanguage.InProceedingsofEACL,Gothenburg,Sweden.QingDouandKevinKnight.2012.LargeScaleDeci-phermentforOut-of-domainMachineTranslation.InProceedingsofEMNLP-CoNLL,JejuIsland,Korea.LongDuong,HiroshiKanayama,TengfeiMa,StevenBird,andTrevorCohn.2016.LearningCrosslingualWordEmbeddingswithoutBilingualCorpora.InPro-ceedingsofEMNLP,Austin,TX,USA.ManaalFaruquiandChrisDyer.2014.ImprovingVectorSpaceWordRepresentationsUsingMultilingualCor-relation.InProceedingsofEACL,Gothenburg,Swe-den.PascaleFungandPercyCheung.2004.MiningVery-Non-ParallelCorpora:ParallelSentenceandLexiconExtractionviaBootstrappingandEM.InProceedingsofEMNLP,Barcelona,Spain.PascaleFung.1995.CompilingBilingualLexiconEn-triesFromaNon-ParallelEnglish-ChineseCorpus.InProceedingsofthe3rdWorkshoponVeryLargeCor-pora,Cambridge,MA,USA.StephanGouws,YoshuaBengio,andGregCorrado.2015.BilBOWA:FastBilingualDistributedRepre-sentationswithoutWordAlignments.InProceedingsofICML,Lille,France.AriaHaghighi,PercyLiang,TaylorBerg-Kirkpatrick,andDanKlein.2008.LearningBilingualLexiconsfromMonolingualCorpora.InProceedingsofACL-HLT,Colombus,OH,USA.KennethHeafield,IvanPouzyrevsky,JonathanH.Clark,andPhilippKoehn.2013.ScalableModifiedKneser-NeyLanguageModelEstimation.InProceedingsofACL,Sofia,Bulgaria.SanjikaHewavitharanaandStephanVogel.2016.Ex-tractingparallelphrasesfromcomparabledataforma-chinetranslation.NaturalLanguageEngineering,22(4):549–573.AnnIrvineandChrisCallison-Burch.2013.SupervisedBilingualLexiconInductionwithMultipleMonolin-gualSignals.InProceedingsofHLT-NAACL,Atlanta,GA,USA.AnnIrvineandChrisCallison-Burch.2014.Halluci-natingPhraseTranslationsforLowResourceMT.InProceedingsofCoNLL,Baltimore,MD,USA.AnnIrvineandChrisCallison-Burch.2016.End-to-EndStatisticalMachineTranslationwithZeroorSmallParallelTexts.NaturalLanguageEngineering,22(4):517–548.AnnIrvine,JohnMorgan,MarineCarpuat,HalDaum´eIII,andDragosMunteanu.2013.MeasuringMachineTranslationErrorsinNewDomains.Trans-actionsoftheAssociationforComputationalLinguis-tics,1.MarcinJunczys-DowmuntandArkadiuszSzał.2012.SyMGiza++:SymmetrizedWordAlignmentMod-elsforMachineTranslation.InSecurityandIntel-ligentInformationSystems(SIIS),volume7053ofLectureNotesinComputerScience.Springer-Verlag,Berlin/Heidelberg,Germany.AlexKlementiev,AnnIrvine,ChrisCallison-Burch,andDavidYarowsky.2012.TowardStatisticalMachineTranslationwithoutParallelCorpora.InProceedingsofEACL,Avignon,France.PhilippKoehnandKevinKnight.2002.LearningaTranslationLexiconfromMonolingualCorpora.InProceedingsoftheACLWorkshoponUnsupervisedLexicalAcquisition,Philadelphia,PA,USA.PhilippKoehn,HieuHoang,AlexandraBirch,ChrisCallison-Burch,MarcelloFederico,NicolaBertoldi,BrookeCowan,WadeShen,ChristineMoran,RichardZens,ChrisDyer,OndˇrejBojar,AlexandraCon-stantin,andEvanHerbst.2007.Moses:OpenSourceToolkitforStatisticalMachineTranslation.InPro-ceedingsofACL,Prague,CzechRepublic.AngelikiLazaridou,GeorgianaDinu,AdamLiska,andMarcoBaroni.2015.FromVisualAttributestoAd-jectivesthroughDecompositionalDistributionalSe-mantics.TransactionsoftheAssociationforCompu-tationalLinguistics,3. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 7 5 1 5 6 7 5 3 1 / / t l a c _ a _ 0 0 0 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 500 TomasMikolov,QuocV.Le,andIlyaSutskever.2013a.ExploitingSimilaritiesamongLanguagesforMachineTranslation.CoRR,abs/1309.4168.TomasMikolov,IlyaSutskever,KaiChen,GregSCor-rado,andJeffDean.2013b.DistributedRepresenta-tionsofWordsandPhrasesandtheirCompositionality.InProceedingsofNIPS,LakeTahoe,NV,USA.JeffMitchellandMirellaLapata.2010.CompositioninDistributionalModelsofSemantics.CognitiveSci-ence,34(8).DragosStefanMunteanuandDanielMarcu.2005.Im-provingMachineTranslationPerformancebyExploit-ingNon-ParallelCorpora.ComputationalLinguistics,31(4):477–504.ToshiakiNakazawa,ManabuYaguchi,KiyotakaUchi-moto,MasaoUtiyama,EiichiroSumita,SadaoKuro-hashi,andHitoshiIsahara.2016.ASPEC:AsianScientificPaperExcerptCorpus.InProceedingsofLREC,Portoroˇz,Slovenia.MalteNuhn,ArneMauser,andHermannNey.2012.De-cipheringForeignLanguagebyCombiningLanguageModelsandContextVectors.InProceedingsofACL,JejuIsland,Korea.KishorePapineni,SalimRoukos,ToddWard,andWei-JingZhu.2002.BLEU:aMethodforAutomaticEvaluationofMachineTranslation.InProceedingsofACL,Philadelphia,PA,USA.ReinhardRapp.1995.IdentifyingWordTranslationsinNon-parallelTexts.InProceedingsofACL,Cam-bridge,MA,USA.SujithRaviandKevinKnight.2011.DecipheringFor-eignLanguage.InProceedingsofACL-HLT,Portland,OR,USA.AvneeshSaluja,HanyHassan,KristinaToutanova,andChrisQuirk.2014.Graph-basedSemi-SupervisedLearningofTranslationModelsfromMonolingualData.InProceedingsofACL,Baltimore,MD,USA.RichardSocher,JohnBauer,ChristopherD.Manning,andAndrewY.Ng.2013a.ParsingwithCompo-sitionalVectorGrammars.InProceedingsofACL,Sofia,Bulgaria.RichardSocher,AlexPerelygin,JeanWu,JasonChuang,ChristopherD.Manning,AndrewNg,andChristopherPotts.2013b.RecursiveDeepModelsforSemanticCompositionalityOveraSentimentTreebank.InPro-ceedingsofEMNLP,Seattle,WA,USA.MasaoUtiyamaandHitoshiIsahara.2003.ReliableMeasuresforAligningJapanese-EnglishNewsArti-clesandSentences.InProceedingsofACL,Sapporo,Japan.IvanVuli´candAnnaKorhonen.2016.OntheRoleofSeedLexiconsinLearningBilingualWordEmbed-dings.InProceedingsofACL,Berlin,Germany.GuillaumeWisniewski,AlexandreAllauzen,andFranc¸oisYvon.2010.AssessingPhrase-BasedTrans-lationModelswithOracleDecoding.InProceedingsofEMNLP,Cambridge,MA,USA.JiajunZhangandChengqingZong.2013.LearningaPhrase-basedTranslationModelfromMonolingualDatawithApplicationtoDomainAdaptation.InPro-ceedingsofACL,Sofia,Bulgaria.BingZhaoandStephanVogel.2002.Adaptiveparallelsentencesminingfromwebbilingualnewscollection.InProceedingsofIEEEICDM,Maebashi,Japan.KaiZhao,HanyHassan,andMichaelAuli.2015.Learn-ingTranslationModelsfromMonolingualContinu-ousRepresentations.InProceedingsofNAACL-HLT,Denver,CO,USA.
Descargar PDF