Transactions of the Association for Computational Linguistics, vol. 5, pp. 309–324, 2017. Action Editor: Sebastian Pad´o.
Submission batch: 4/2017; Revision batch: 7/2017; Published 9/2017.
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
c
(cid:13)
SemanticSpecializationofDistributionalWordVectorSpacesusingMonolingualandCross-LingualConstraintsNikolaMrkši´c1,2,IvanVuli´c1,DiarmuidÓSéaghdha2,IraLeviant3RoiReichart3,MilicaGaši´c1,AnnaKorhonen1,SteveYoung1,21UniversityofCambridge2AppleInc.3Technion,IITAbstractWepresentATTRACT-REPEL,analgorithmforimprovingthesemanticqualityofwordvectorsbyinjectingconstraintsextractedfromlexicalresources.ATTRACT-REPELfacilitatestheuseofconstraintsfrommono-andcross-lingualresources,yieldingsemanticallyspe-cializedcross-lingualvectorspaces.Ourevalu-ationshowsthatthemethodcanmakeuseofex-istingcross-linguallexiconstoconstructhigh-qualityvectorspacesforaplethoraofdifferentlanguages,facilitatingsemantictransferfromhigh-tolower-resourceones.Theeffectivenessofourapproachisdemonstratedwithstate-of-the-artresultsonsemanticsimilaritydatasetsinsixlanguages.WenextshowthatATTRACT-REPEL-specializedvectorsboostperformanceinthedownstreamtaskofdialoguestatetrack-ing(DST)acrossmultiplelanguages.Finally,weshowthatcross-lingualvectorspacespro-ducedbyouralgorithmfacilitatethetrainingofmultilingualDSTmodels,whichbringsfurtherperformanceimprovements.1IntroductionWordrepresentationlearninghasbecomeare-searchareaofcentralimportanceinmodernnatu-rallanguageprocessing.Thecommontechniquesforinducingdistributedwordrepresentationsaregroundedinthedistributionalhypothesis,relyingonco-occurrenceinformationinlargetextualcorporatolearnmeaningfulwordrepresentations(Mikolovetal.,2013b;Penningtonetal.,2014;ÓSéaghdhaandKorhonen,2014;LevyandGoldberg,2014).Re-cently,methodsthatgobeyondstand-aloneunsu-pervisedlearninghavegainedincreasedpopularity.Thesemodelstypicallybuildondistributionalonesbyusinghuman-orautomatically-constructedknowl-edgebasestoenrichthesemanticcontentofexistingwordvectorcollections.Oftenthisisdoneasapost-processingstep,wherethedistributionalwordvectorsarerefinedtosatisfyconstraintsextractedfromalex-icalresourcesuchasWordNet(Faruquietal.,2015;Wietingetal.,2015;Mrkši´cetal.,2016).Wetermthisapproachsemanticspecialization.Inthispaperweadvancethesemanticspecializa-tionparadigminanumberofways.Weintroduceanewalgorithm,ATTRACT-REPEL,thatusessyn-onymyandantonymyconstraintsdrawnfromlexi-calresourcestotunewordvectorspacesusinglin-guisticinformationthatisdifficulttocapturewithconventionaldistributionaltraining.OurevaluationshowsthatATTRACT-REPELoutperformspreviousmethodswhichmakeuseofsimilarlexicalresources,achievingstate-of-the-artresultsontwowordsim-ilaritydatasets:SimLex-999(Hilletal.,2015)andSimVerb-3500(Gerzetal.,2016).WethendeploytheATTRACT-REPELalgorithminamultilingualsetting,usingsemanticrelationsex-tractedfromBabelNet(NavigliandPonzetto,2012;Ehrmannetal.,2014),across-linguallexicalre-source,toinjectconstraintsbetweenwordsofdiffer-entlanguagesintothewordrepresentations.Thisal-lowsustoembedvectorspacesofmultiplelanguagesintoasinglevectorspace,exploitinginformationfromhigh-resourcelanguagestoimprovethewordrepresentationsoflower-resourceones.Table1illus-tratestheeffectsofcross-lingualATTRACT-REPELspecializationbyshowingthenearestneighborsforthreeEnglishwordsacrossthreecross-lingualspaces.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
310
en_morningen_carpeten_womanSlavic+ENGermanicRomance+ENSlavic+ENGermanicRomance+ENSlavic+ENGermanicRomance+ENen_daybreakde_vormittagpt_madrugadaen_rugde_teppichbodenen_rugru_женщинаde_frauenfr_femmeen_mornnl_kriekenit_mattinabg_килимnl_tapijtenit_moquettebg_женитеsv_kvinnligaen_womanishbg_разсъмванеen_dawnen_dawnru_ковролинen_rugit_tappetihr_ženasv_kvinnaes_mujerhr_svitanjenl_zonsopkomstpt_madrugadasbg_килимиde_teppichpt_tapeteen_womanishsv_kvinnorpt_mulherhr_zoresv_morgonenes_madrugadapl_dywanyen_carpetinges_moquetabg_женаde_weibes_féminabg_изгревde_tagesanbruchit_nascentebg_мокетde_teppicheit_tappetinopl_kobietaen_womanishen_womensen_dawnen_sunriseen_mornpl_dywanówsv_mattoren_carpetinghr_trebasv_kvinnopt_femininaru_утроnl_opganges_aurorahr_tepihsv_mattapt_carpetebg_жениde_frauenzimmerpt_femininasbg_аврораde_sonnenaufgangfr_matinpl_wykładzinyen_carpetspt_tapetesen_womenssv_honkönes_feminahr_jutronl_dageraadfr_auroraru_коверnl_tapijtfr_moquettepl_kobietsv_kvinnanfr_femelleru_рассветde_anbruches_amaneceresru_коврикnl_kleedjeen_carpetshr_ženenl_vrouwpt_fêmeahr_zorasv_morgonen_sunriseshr_´cilimnl_vloerbedekkinges_alfombrapl_niewiastade_madamfr_femmeshr_zoruen_daybreakes_mañaneroen_carpetingde_brückees_alfombrashr_ženskosv_kvinnligtit_donnepl_poranekde_morgengrauenfr_matinéepl_dywande_mattafr_tapishr_ženkesv_gummanes_mujeresen_sunrisenl_zonsopgangit_mattinataru_ковровnl_mattapt_tapeçariapl_samicasv_femalept_fêmeasbg_зазоряванеnl_goedemorgenpt_amanheceren_carpetsen_matit_zerbinoru_самкаsv_gummaes_hembrasbg_сутринsv_gryningenen_cockcrowru_килимde_matteit_tappetobg_женскаsv_kvinnligen_wifeen_sunrisesen_morninpt_auroraen_maten_doilieses_tapetehr_ženkasv_femininfr_nanabg_зораsv_gryningpt_alvorecerhr_sagnl_mates_mantaru_дамаen_wifees_hembraTable1:NearestneighborsforthreeexamplewordsacrossSlavic,GermanicandRomancelanguagegroups(withEnglishincludedaspartofeachwordvectorcollection).Semanticallydissimilarwordshavebeenunderlined.Ineachcase,thevastmajorityofeachwords’neigh-borsaremeaningfulsynonyms/translations.1Whilethereisaconsiderableamountofpriorre-searchonjointlearningofcross-lingualvectorspaces(seeSection2.2),tothebestofourknowledgewearethefirsttoapplysemanticspecializationtothisproblem.2Wedemonstrateitsefficacywithstate-of-the-artresultsonthefourlanguagesintheMulti-lingualSimLex-999dataset(LeviantandReichart,2015).Toshowthatourapproachyieldssemanticallyinformativevectorsforlower-resourcelanguages,wecollectintrinsicevaluationdatasetsforHebrewandCroatianandshowthatcross-lingualspecializationsignificantlyimproveswordvectorqualityinthesetwo(comparatively)low-resourcelanguages.Inthesecondpartofthepaper,weexploretheuseofATTRACT-REPEL-specializedvectorsinadown-streamapplication.Oneimportantmotivationfortrainingwordvectorsistoimprovethelexicalcover-ageofsupervisedmodelsforlanguageunderstandingtasks,e.g.,questionanswering(Iyyeretal.,2014)ortextualentailment(Rocktäscheletal.,2016).In1Some(negative)effectsofthedistributionalhypothesisdopersist.Forexample,nl_krieken(Dutchforcherries),isiden-tifiedasasynonymforen_morning,presumablybecausetheidiom‘hetkriekenvandedag’translatesto‘thecrackofdawn’.2Ourapproachisnotsuitedforlanguagesforwhichnolexicalresourcesexist.However,manylanguageshavesomecoverageincross-linguallexicons.Forinstance,BabelNet3.7automaticallyalignsWordNettoWikipedia,providingaccuratecross-lingualmappingsbetween271languages.Inourevaluation,wedemon-stratesubstantialgainsforHebrewandCroatian,bothofwhicharespokenbylessthan10millionpeopleworldwide.thiswork,weusethetaskofdialoguestatetrack-ing(DST)forextrinsicevaluation.Thistask,whicharisesintheconstructionofstatisticaldialoguesys-tems(Youngetal.,2013),involvesunderstandingthegoalsexpressedbytheuserandupdatingthesys-tem’sdistributionoversuchgoalsastheconversationprogressesandnewinformationbecomesavailable.Weshowthatincorporatingourspecializedvec-torsintoastate-of-the-artneural-networkmodelforDSTimprovesperformanceonEnglishdialogues.Inthemultilingualspiritofthispaper,weproducenewItalianandGermanDSTdatasetsandshowthatusingATTRACT-REPEL-specializedvectorsleadstoevenstrongergainsinthesetwolanguages.Finally,weshowthatourcross-lingualvectorscanbeusedtotrainasinglemodelthatperformsDSTinallthreelanguages,ineachcaseoutperformingthemonolin-gualmodel.Tothebestofourknowledge,thisisthefirstworkonmultilingualtrainingofanycompo-nentofastatisticaldialoguesystem.Ourresultsin-dicatethatmultilingualtrainingholdsgreatpromiseforbootstrappinglanguageunderstandingmodelsforotherlanguages,especiallyfordialoguedomainswheredatacollectionisveryresource-intensive.Allresourcesrelatedtothispaperareavailableatwww.github.com/nmrksic/attract-repel.Theseinclude:1)theATTRACT-REPELsourcecode;2)bilingualwordvectorcollectionscombiningEnglishwith51otherlan-guages;3)HebrewandCroatianintrinsicevaluationdatasets;and4)ItalianandGermanDialogueStateTrackingdatasetscollectedforthiswork.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
311
2RelatedWork2.1SemanticSpecializationTheusefulnessofdistributionalwordrepresentationshasbeendemonstratedacrossmanyapplicationareas:Part-of-Speech(POS)tagging(Collobertetal.,2011),machinetranslation(Zouetal.,2013;Devlinetal.,2014),dependencyandsemanticparsing(Socheretal.,2013a;Bansaletal.,2014;ChenandManning,2014;Johannsenetal.,2015;Ammaretal.,2016),sentimentanalysis(Socheretal.,2013b),nameden-tityrecognition(Turianetal.,2010;Guoetal.,2014),andmanyothers.Theimportanceofsemanticspe-cializationfordownstreamtasksisrelativelyunex-plored,withimprovementsinperformancesofarobservedfordialoguestatetracking(Mrkši´cetal.,2016;Mrkši´cetal.,2017),spokenlanguageunder-standing(Kimetal.,2016b;Kimetal.,2016a)andjudginglexicalentailment(Vuli´cetal.,2016).Semanticspecializationmethodsfall(broadly)intotwocategories:a)thosewhichtraindistributedrepresentations‘fromscratch’bycombiningdistri-butionalknowledgeandlexicalinformation;andb)thosewhichinjectlexicalinformationintopre-trainedcollectionsofwordvectors.Methodsfrombothcategoriesmakeuseofsimilarlexicalresources;commonexamplesincludeWordNet(Miller,1995),FrameNet(Bakeretal.,1998)ortheParaphraseDatabase(PPDB)(Ganitkevitchetal.,2013).LearningfromScratchSomemethodsmodifythepriorortheregularizationoftheoriginaltrainingprocedureusingthesetoflinguisticconstraints(YuandDredze,2014;Xuetal.,2014;Bianetal.,2014;Kielaetal.,2015;AletrasandStevenson,2015).Othermethodsmodifytheskip-gram(Mikolovetal.,2013b)objectivefunctionbyintroducingseman-ticconstraints(Yihetal.,2012;Liuetal.,2015)totrainwordvectorswhichemphasizewordsimilar-ityoverrelatedness.Osborneetal.(2016)proposeamethodforincorporatingpriorknowledgeintotheCanonicalCorrelationAnalysis(CCA)methodusedbyDhillonetal.(2015)tolearnspectralwordem-beddings.Whilesuchmethodsintroducesemanticsimilarityconstraintsextractedfromlexicons,ap-proachessuchastheoneproposedbySchwartzetal.(2015)usesymmetricpatterns(DavidovandRap-poport,2006)topushawayantonymouswordsintheirpattern-basedvectorspace.Onoetal.(2015)combinesbothapproaches,usingthesaurianddistri-butionaldatatotrainembeddingsspecializedforcap-turingantonymy.FaruquiandDyer(2015)usemanydifferentlexiconstocreateinterpretablesparsebi-naryvectorswhichachievecompetitiveperformanceacrossarangeofintrinsicevaluationtasks.Intheory,wordrepresentationsproducedbymod-elswhichconsiderdistributionalandlexicalinforma-tionjointlycouldbeasgood(orbetter)thanrepresen-tationsproducedbyfine-tuningdistributionalvectors.However,theirperformancehasnotsurpassedthatoffine-tuningmethods.3Fine-TuningPre-trainedVectorsRotheandSchütze(2015)fine-tunewordvectorspacestoim-provetherepresentationsofsynsets/lexemesfoundinWordNet.Faruquietal.(2015)andJauharetal.(2015)usesynonymyconstraintsinaproceduretermedretrofittingtobringthevectorsofsemanti-callysimilarwordsclosetogether,whileWietingetal.(2015)modifytheskip-gramobjectivefunctiontofine-tunewordvectorsbyinjectingparaphrasingcon-straintsfromPPDB.Mrkši´cetal.(2016)buildontheretrofittingapproachbyjointlyinjectingsynonymyandantonymyconstraints;thesameideaisreassessedbyNguyenetal.(2016).Kimetal.(2016a)furtherexpandthislineofworkbyincorporatingseman-ticintensityinformationfortheconstraints,whileRecskietal.(2016)useensemblesofrichconceptdictionariestofurtherimproveacombinedcollectionofsemanticallyspecializedwordvectors.ATTRACT-REPELisaninstanceofthesecondfam-ilyofmodels,providingaportable,light-weightap-proachforincorporatingexternalknowledgeintoar-bitraryvectorspaces.Inourexperiments,weshowthatATTRACT-REPELoutperformspreviouslypro-posedpost-processors,settingthenewstate-of-artperformanceonthewidelyusedSimLex-999wordsimilaritydataset.Moreover,weshowthatstartingfromdistributionalvectorsallowsourmethodtouseexistingcross-lingualresourcestotiedistributionalvectorspacesofdifferentlanguagesintoaunifiedvectorspacewhichbenefitsfrompositivesemantictransferbetweenitsconstituentlanguages.3TheSimLex-999webpage(www.cl.cam.ac.uk/~fh295/simlex.html)listsmodelswithstate-of-the-artperformance,noneofwhichlearnrepresentationsjointly.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
312
2.2Cross-LingualWordRepresentationsMostexistingmodelswhichinducecross-lingualwordrepresentationsrelyoncross-lingualdistribu-tionalinformation(Klementievetal.,2012;Zouetal.,2013;Soyeretal.,2015;Huangetal.,2015,in-teralia).Thesemodelsdifferinthecross-lingualsignal/supervisiontheyusetotielanguagesintouni-fiedbilingualvectorspaces:somemodelslearnonthebasisofparallelword-aligneddata(Luongetal.,2015;Coulmanceetal.,2015)orsentence-aligneddata(HermannandBlunsom,2014a;HermannandBlunsom,2014b;Chandaretal.,2014;Gouwsetal.,2015).Othermodelsrequiredocument-aligneddata(Søgaardetal.,2015;Vuli´candMoens,2016),whilesomelearnonthebasisofavailablebilin-gualdictionaries(Mikolovetal.,2013a;FaruquiandDyer,2014;Lazaridouetal.,2015;Vuli´candKorho-nen,2016b;Duongetal.,2016).SeeUpadhyayetal.(2016)andVuli´candKorhonen(2016b)foranoverviewofcross-lingualwordembeddingwork.Theinclusionofcross-lingualinformationresultsinsharedcross-lingualvectorspaceswhichcan:a)boostperformanceonmonolingualtaskssuchaswordsimilarity(FaruquiandDyer,2014;Rastogietal.,2015;Upadhyayetal.,2016);andb)sup-portcross-lingualtaskssuchasbilinguallexiconin-duction(Mikolovetal.,2013a;Gouwsetal.,2015;Duongetal.,2016),cross-lingualinformationre-trieval(Vuli´candMoens,2015;Mitraetal.,2016),andtransferlearningforresource-leanlanguages(Sø-gaardetal.,2015;Guoetal.,2015).However,priorworkoncross-lingualwordembed-dinghastendednottoexploitpre-existinglinguisticresourcessuchasBabelNet.Inthiswork,wemakeuseofcross-lingualconstraintsderivedfromsuchrepositoriestoinducehigh-qualitycross-lingualvec-torspacesbyfacilitatingsemantictransferfromhigh-tolower-resourcelanguages.Inourexperiments,weshowthatcross-lingualvectorspacesproducedbyATTRACT-REPELconsistentlyoutperformarepre-sentativeselectionoffivestrongcross-lingualwordembeddingmodelsinbothintrinsicandextrinsicevaluationacrossseverallanguages.3TheATTRACT-REPELModelInthissection,weproposeanewalgorithmforpro-ducingsemanticallyspecializedwordvectorsbyin-jectingsimilarityandantonymyconstraintsintodis-tributionalvectorspaces.Thisprocedure,whichwetermATTRACT-REPEL,buildsontheParagram(Wi-etingetal.,2015)andcounter-fittingprocedures(Mrkši´cetal.,2016),bothofwhichinjectlinguis-ticconstraintsintoexistingvectorspacestoimprovetheirabilitytocapturesemanticsimilarity.LetVbethevocabulary,Sthesetofsynony-mouswordpairs(e.g.intelligentandbrilliant),andAthesetofantonymouswordpairs(e.g.vacantandoccupied).Theoptimizationprocedureoper-atesovermini-batchesofsynonymandantonympairsBSandBA(whichlistk1synonymandk2antonympairs).Foreaseofnotation,leteachwordpair(xl,xr)inthesetwosetscorrespondtoavec-torpair(xl,xr),sothatamini-batchisgivenbyBS=[(x1l,x1r),…,(xk1l,xk1r)](similarlyforBA).Next,wedefineTS=[(t1l,t1r),…,(tk1l,tk1r)]andTA=[(t1l,t1r),…,(tk2l,tk2r)]aspairsofneg-ativeexamplesforeachsynonymyandantonymyexamplepairinmini-batchesBSandBA.Theseneg-ativeexamplesarechosenfromthewordvectorspresentinBSorBAsothat:•Foreachsynonymypair(xl,xr),thenegativeexamplepair(tl,tr)ischosenfromtheremain-ingin-batchvectorssothattlistheoneclosest(cosinesimilarity)toxlandtrisclosesttoxr.•Foreachantonymypair(xl,xr),thenegativeexamplepair(tl,tr)ischosenfromtheremain-ingin-batchvectorssothattlistheonefurthestawayfromxlandtristheonefurthestfromxr.Thesenegativeexamplesareusedto:a)forcesyn-onymouspairstobeclosertoeachotherthantotheirrespectivenegativeexamples;andb)toforceantony-mouspairstobefurtherawayfromeachotherthanfromtheirnegativeexamples.Thefirsttermofthecostfunctionpullssynonymouswordstogether:S(BS,TS)=k1Xi=1[τ(cid:0)δsyn+xiltil−xilxir(cid:1)+τ(cid:0)δsyn+xirtir−xilxir(cid:1)]whereτ(x)=max(0,x)isthehingelossfunctionandδsynisthesimilaritymarginwhichdetermineshowmuchclosersynonymousvectorsshouldbeto
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
313
eachotherthantotheirrespectivenegativeexamples.Thesecondpartofthecostfunctionpushesantony-mouswordpairsawayfromeachother:A(BA,TA)=k2Xi=1[τ(cid:0)δant+xilxir−xiltil(cid:1)+τ(cid:0)δant+xilxir−xirtir(cid:1)]Inadditiontothesetwoterms,weincludeanaddi-tionalregularizationtermwhichaimstopreservetheabundanceofhigh-qualitysemanticcontentpresentintheinitial(distributional)vectorspace,aslongasthisinformationdoesnotcontradicttheinjectedlinguisticconstraints.IfV(B)isthesetofallwordvectorspresentinthegivenmini-batch,then:R(BS,BA)=Xxi∈V(BS∪BA)λregkbxi−xik2whereλregistheL2regularizationconstantandbxidenotestheoriginal(distributional)wordvectorforwordxi.ThefinalATTRACT-REPELcostfunctionisgivenbythesumofallthreeterms:C(BS,TS,BA,TA)=S(BS,TS)+A(BA,TA)+R(BS,BA)ComparisontoPriorWorkATTRACT-REPELdrawsinspirationfromthreemethods:1)retrofitting(Faruquietal.,2015);2)PARAGRAM(Wietingetal.,2015);and3)counter-fitting(Mrkši´cetal.,2016).WhereasretrofittingandPARAGRAMdonotconsiderantonymy,counter-fittingmodelsbothsynonymyandantonymy.ATTRACT-REPELdiffersfromthismethodintwoimportantways:1.Context-SensitiveUpdates:Counter-fittingusesattractandrepeltermswhichpullsyn-onymstogetherandpushantonymsapartwith-outconsideringtheirrelationtootherwordvec-tors.Forexample,itsattracttermisgivenby:Attract(S)=P(xl,xr)∈Sτ(δsyn−xlxr)whereSisthesetofsynonymsandδsynisthesynonymymargin.Conversely,ATTRACT-REPELfine-tunesvectorspacesbyoperatingovermini-batchesofexamplepairs,updatingwordvectorsonlyifthepositionoftheirnega-tiveexampleimpliesastrongersemanticrela-tionthanthatexpressedbythepositionofitstargetexample.Importantly,ATTRACT-REPELmakesfine-grainedupdatestoboththeexamplepairandthenegativeexamples,ratherthanup-datingtheexamplewordpairbutignoringhowthisaffectsitsrelationtoallotherwordvectors.2.Regularization:Counter-fittingpreservesdis-tancesbetweenpairsofwordvectorsintheinitialvectorspace,tryingto‘pull’thewords’neighborhoodswiththemastheymovetoin-corporateexternalknowledge.Theradiusofthisinitialneighborhoodintroducesanopaquehyperparametertotheprocedure.Conversely,ATTRACT-REPELimplementsstandardL2reg-ularization,which‘pulls’eachvectortowardsitsdistributionalvectorrepresentation.Inourintrinsicevaluation(Section5),weperformanexhaustivecomparisonofthesemodels,showingthatATTRACT-REPELoutperformscounter-fittinginbothmono-andcross-lingualsetups.OptimizationFollowingWietingetal.(2015),weusetheAdaGradalgorithm(Duchietal.,2011)totrainthewordembeddingsforfiveepochs,whichsuf-ficesfortheparameterestimatestoconverge.SimilartoFaruquietal.(2015),Wietingetal.(2015)andMrkši´cetal.(2016),wedonotuseearlystopping.Bynotrelyingonlanguage-specificvalidationsets,theATTRACT-REPELprocedurecaninducesemanti-callyspecializedwordvectorsforlanguageswithnointrinsicevaluationdatasets.4HyperparameterTuningWeuseSpearman’scor-relationofthefinalwordvectorswiththeMultilin-gualWordSim-353gold-standardassociationdataset(Finkelsteinetal.,2002;LeviantandReichart,2015).TheATTRACT-REPELprocedurehassixhyperparam-eters:theregularizationconstantλreg,thesimilarityandantonymymarginsδsimandδant,mini-batchsizesk1andk2,andthesizeofthePPDBconstraintsetusedforeachlanguage(largersizesincludemore4Manylanguagesarepresentinsemi-automaticallycon-structedlexiconssuchasBabelNetorPPDB(seethediscussioninSection4.2.).However,intrinsicevaluationdatasetssuchasSimLex-999existforveryfewlanguages,astheyrequireexperttranslatorsandskilledannotators.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
314
EnglishGermanItalianRussiansynantsynantsynantsynantEnglish640524611356241969German–1352277131756Italian—-159722011Russian——481Table2:Linguisticconstraintcounts(inthousands).Foreachlanguagepair,thetwofiguresshowthenumberofin-jectedsynonymyandantonymyconstraints.Monolingualconstraints(thediagonalelements)areunderlined.constraints,butalsoalargerproportionoffalsesyn-onyms).WeranagridsearchovertheseforthefourSimLexlanguages,choosingthehyperparameterswhichachievedthebestWordSim-353score.54ExperimentalSetup4.1DistributionalVectorsWefirstpresentoursixteenexperimentallanguages:English(EN),German(DE),Italian(IT),Russian(RU),Dutch(NL),Swedish(SV),French(FR),Span-ish(ES),Portuguese(PT),Polish(PL),Bulgarian(BG),Croatian(HR),Irish(GA),Persian(FA)andVietnamese(VI).ThefirstfourlanguagesarethoseoftheMultilingualSimLex-999dataset.ForthefourSimLexlanguages,weemployfourwell-known,high-qualitywordvectorcollections:a)TheCommonCrawlGloVeEnglishvectorsfromPen-ningtonetal.(2014);b)GermanvectorsfromVuli´candKorhonen(2016a);c)ItalianvectorsfromDinuetal.(2015);andd)RussianvectorsfromKutuzovandAndreev(2015).Inaddition,foreachofthe16languageswealsotraintheskip-gramwithnegativesamplingvariantoftheword2vecmodel(Mikolovetal.,2013b),onthelatestWikipediadumpofeachlanguage,toinduce300-dimensionalwordvectors.65Weranthegridsearchoverλreg∈[10−3,…,10−10],δsim,δant∈[0,0.1,…,1.0],k1,k2∈[10,25,50,100,200]andoverthesixPPDBsizesforthefourSimLexlanguages.λreg=10−9,δsim=0.6,δant=0.0andk1=k2∈[10,25,50]consistentlyachievedthebestperformance(weusek1=k2=50inallexperimentsforconsistency).ThePPDBconstraintsetsizeXLwasbestforEnglish,GermanandItalian,andMachievedthebestperformanceforRussian.6Thefrequencycut-offwassetto50:wordsthatoccurredlessfrequentlywereremovedfromthevocabularies.Otherword2vecparametersweresettothestandardvalues(Vuli´candKorhonen,2016a):15epochs,15negativesamples,global(decreasing)learningrate:0.025,subsamplingrate:1e−4.4.2LinguisticConstraintsTable2showsthenumberofmonolingualandcross-lingualconstraintsforthefourSimLexlanguages.MonolingualSimilarityWeemploytheMultilin-gualParaphraseDatabase(GanitkevitchandCallison-Burch,2014).Thisresourcecontainsparaphrasesau-tomaticallyextractedfromparallel-alignedcorporafortenofoursixteenlanguages.Inourexperiments,theremainingsixlanguages(HE,HR,SV,GA,VI,FA)serveasexamplesoflower-resourcelanguages,astheyhavenomonolingualsynonymyconstraints.Cross-LingualSimilarityWeemployBabelNet,amultilingualsemanticnetworkautomaticallycon-structedbylinkingWikipediatoWordNet(NavigliandPonzetto,2012;Ehrmannetal.,2014).BabelNetgroupswordsfromdifferentlanguagesintoBabelsynsets.Weconsidertwowordsfromany(distinct)languagepairtobesynonymousiftheybelongto(atleast)onesetofsynonymousBabelsynsets.WemadeuseofallBabelNetwordsensestaggedasconceptualbutignoredtheonestaggedasNamedEntities.Givenalargecollectionofcross-lingualseman-ticconstraints(e.g.thetranslationpairen_sweetandit_dolce),ATTRACT-REPELcanusethemtobringthevectorspacesofdifferentlanguagestogetherintoasharedcross-lingualspace.Ideally,sharinginfor-mationacrosslanguagesshouldleadtoimprovedsemanticcontentforeachlanguage,especiallyforthosewithlimitedmonolingualresources.AntonymyBabelNetisalsousedtoextractbothmonolingualandcross-lingualantonymyconstraints.FollowingFaruquietal.(2015),whofoundPPDBconstraintsmorebeneficialthantheWordNetones,wedonotuseBabelNetformonolingualsynonymy.AvailabilityofResourcesBothPPDBandBabel-Netarecreatedautomatically.However,PPDBreliesonlarge,high-qualityparallelcorporasuchasEu-roparl(Koehn,2005).Intotal,MultilingualPPDBprovidescollectionsofparaphrasesfor22languages.Ontheotherhand,BabelNetusesWikipedia’sinter-languagelinksandstatisticalmachinetranslation(GoogleTranslate)toprovidecross-lingualmappingsfor271languages.Inourevaluation,weshowthatPPDBandBabelNetcanbeusedjointlytoimprovewordrepresentationsforlower-resourcelanguagesby
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
315
tyingthemintobilingualspaceswithhigh-resourceones.WevalidatethisclaimonHebrewandCroatian,whichactas‘lower-resource’languagesbecauseoftheirlackofanyPPDBresourceandtheirrelativelysmallWikipediasizes.75IntrinsicEvaluation5.1DatasetsSpearman’srankcorrelationwiththeSimLex-999dataset(Hilletal.,2015)isusedastheintrinsiceval-uationmetricthroughouttheexperiments.UnlikeothergoldstandardresourcessuchasWordSim-353(Finkelsteinetal.,2002)orMEN(Brunietal.,2014),SimLex-999consistsofwordpairsscoredbyannota-torsinstructedtodiscernbetweensemanticsimilarityandconceptualassociation,sothatrelatedbutnon-similarwords(e.g.bookandread)havealowrating.LeviantandReichart(2015)translatedSimLex-999toGerman,ItalianandRussian,crowd-sourcingthesimilarityscoresfromnativespeakersoftheselanguages.Weusethisresourceformultilingualintrinsicevaluation.8Toinvestigatetheportabilityofourapproachtolower-resourcelanguages,weusedthesameexperimentalsetuptocollectSimLex-999datasetsforHebrewandCroatian.9ForEnglishvectors,wealsoreportSpearman’scorrelationwithSimVerb-3500(Gerzetal.,2016),asemanticsimi-laritydatasetthatfocusesonverbpairsimilarity.5.2ExperimentsMonolingualandCross-LingualSpecializationWestartfromdistributionalvectorsfortheSimLexlanguages:English,German,ItalianandRussian.Foreachlanguage,wefirstperformsemanticspecializa-tionofthesespacesusing:a)monolingualsynonyms;b)monolingualantonyms;andc)thecombinationofboth.Wethenaddcross-lingualsynonymsandantonymstotheseconstraintsandtrainasharedfour-lingualvectorspacefortheselanguages.7HebrewandCroatianWikipedias(whichareusedtoin-ducetheirBabelNetconstraints)currentlyconsistof203,867/172,824articles,rankingthem40th/42ndbysize.8LeviantandReichart(2015)alsore-scoredtheoriginalEn-glishSimLex.Wereportresultsontheirversion,butalsoprovidenumbersfortheoriginaldatasetforcomparability.9The999wordpairsandannotatorinstructionsweretrans-latedbynativespeakersandscoredby10annotators.Theinter-annotatoragreementscores(Spearman’sρ)were0.77(pairwise)and0.87(mean)forCroatian,and0.59/0.71forHebrew.ComparisontoBaselineMethodsBothmono-andcross-lingualspecializationwasperformedusingATTRACT-REPELandcounter-fitting,inordertocon-clusivelydeterminewhichofthetwomethodsexhib-itedsuperiorperformance.RetrofittingandPARA-GRAMmethodsonlyinjectsynonymy,andtheircostfunctionscanbeexpressedusingsub-componentsofcounter-fittingandATTRACT-REPELcostfunctions.Assuch,theperformanceofthetwoinvestigatedmethodswhentheymakeuseofsynonymy(butnotantonymy)constraintsillustratestheperformancerangeofthetwoprecedingmodels.ImportanceofInitialVectorsWeusethreedif-ferentsetsofinitialwordvectors:a)well-knowndistributionalwordvectorcollections(Section4.1);b)distributionalwordvectorstrainedonthelatestWikipediadumps;andc)wordvectorsrandomlyini-tializedusingtheXAVIERinitialization(GlorotandBengio,2010).SpecializationforLower-ResourceLanguagesInthisexperiment,wefirstconstructbilingualspaceswhichcombine:a)oneofthefourSimLexlanguages;withb)eachoftheothertwelvelanguages.10SinceeachpaircontainsatleastoneSimLexlanguage,wecananalysetheimprovementovermonolingualspe-cializationtounderstandhowrobusttheperformancegainsareacrossdifferentlanguagepairs.WenextusethenewlycollectedSimLexdatasetsforHebrewandCroatiantoevaluatetheextenttowhichbilingualsemanticspecializationusingATTRACT-REPELandBabelNetconstraintscanimprovewordrepresenta-tionsforlower-resourcelanguages.ComparisontoState-of-the-ArtBilingualSpacesTheEnglish-ItalianandEnglish-GermanbilingualspacesinducedbyATTRACT-REPELwerecomparedtofivestate-of-the-artmethodsforconstructingbilin-gualvectorspaces:1.(Mikolovetal.,2013a),re-trainedusingtheconstraintsusedbyourmodel;and2.-5.(HermannandBlunsom,2014a;Gouwsetal.,2015;Vuli´candKorhonen,2016a;Vuli´candMoens,2016).Thelattermodelsusevarioussourcesofsu-pervision(word-,sentence-anddocument-aligned10Hyperparameters:weusedδsim=0.6,δant=0.0andλreg=10−9,whichachievedthebestperformancewhentunedfortheoriginalSimLexlanguages.ThelargestavailablePPDBsizewasusedforthesixlanguageswithavailablePPDB(French,Spanish,Portuguese,Polish,BulgarianandDutch).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
316
WordVectorsEnglishGermanItalianRussianMonolingualDistributionalVectors0.320.280.360.38COUNTER-FITTING:Mono-Syn0.450.240.290.46COUNTER-FITTING:Mono-Ant0.330.280.470.42COUNTER-FITTING:Mono-Syn+Mono-Ant0.500.260.350.49COUNTER-FITTING:Cross-Syn0.460.430.450.37COUNTER-FITTING:Mono-Syn+Cross-Syn0.470.400.430.45COUNTER-FITTING:Mono-Syn+Mono-Ant+Cross-Syn+Cross-Ant0.530.410.490.48ATTRACT-REPEL:Mono-Syn0.560.400.460.53ATTRACT-REPEL:Mono-Ant0.420.300.450.41ATTRACT-REPEL:Mono-Syn+Mono-Ant0.650.430.560.56ATTRACT-REPEL:Cross-Syn0.570.530.580.46ATTRACT-REPEL:Mono-Syn+Cross-Syn0.610.580.590.54ATTRACT-REPEL:Mono-Syn+Mono-Ant+Cross-Syn+Cross-Ant0.710.620.670.61Table3:MultilingualSimLex-999.TheeffectofusingtheCOUNTER-FITTINGandATTRACT-REPELprocedurestoinjectmono-andcross-lingualsynonymyandantonymyconstraintsintothefourcollectionsofdistributionalwordvectors.Ourbestresultssetthenewstate-of-the-artperformanceforallfourlanguages.WordVectorsENDEITRURandomInit.(NoInfo.)0.01-0.030.02-0.03A-R:MonolingualCons.0.540.330.290.35A-R:Mono+Cross-Ling.0.660.490.590.51DistributionalWikiVectors0.320.310.280.19A-R:MonolingualCons.0.610.480.530.52A-R:Mono+Cross-Ling.0.660.600.650.54Table4:MultilingualSimLex-999.TheeffectofATTRACT-REPEL(A-R)onalternativesetsofstartingwordvectors(Random=XAVIERinitialization).corpora),whichmeanstheycannotbetrainedus-ingoursetsofconstraints.Forthesemodels,weusecompetitivesetupsproposedin(Vuli´candKorhonen,2016a).ThegoalofthisexperimentistoshowthatvectorspacesinducedbyATTRACT-REPELexhibitbetterintrinsicandextrinsicperformancewhende-ployedinlanguageunderstandingtasks.5.3ResultsandDiscussionTable3showstheeffectsofmonolingualandcross-lingualsemanticspecializationoffourwell-knowndistributionalvectorspacesfortheSimLexlanguages.Monolingualspecializationleadstoverystrongim-provementsintheSimLexperformanceacrossalllan-guages.Cross-lingualspecializationbringsfurtherimprovements,withalllanguagesbenefitingfromsharingthecross-lingualvectorspace.GermanandItalianinparticularshowstrongevidenceofeffectivetransfer(+0.19/+0.11overmonolingualspecializa-tion),withItalianvectors’performancecomingclosetothetop-performingEnglishones.ComparisontoBaselinesTable3givesanex-haustivecomparisonofATTRACT-REPELtocounter-fitting:ATTRACT-REPELachievedsubstantiallystrongerperformanceinallexperiments.Webe-lievetheseresultsconclusivelyshowthatthecontext-sensitiveupdatesandL2regularizationemployedbyATTRACT-REPELpresentabetteralternativetothecontext-insensitiveattract/repeltermsandpair-wiseregularizationemployedbycounter-fitting.11State-of-the-ArtWietingetal.(2016)notethatthehyperparametersofthewidelyusedParagram-SL999vectors(Wietingetal.,2015)aretunedonSimLex-999,andassucharenotcomparabletomethodswhichholdoutthedataset.Thisimpliesthatfurtherworkwhichusesthesevectors(e.g.,(Mrkši´cetal.,11Tounderstandtherelativeimportanceofthecontext-sensitiveupdatesandthechangeinregularization,wecancom-parethetwomethodstotheretrofittingprocedure(Faruquietal.,2015).RetrofittingusesL2regularization(likeATTRACT-REPEL)anda‘global’attractterm(likecounter-fitting).Theperformanceofretrofittingusingallmono-andcross-lingualsynonymyconstraints(theproceduredoesnotsupportantonyms)givesanEN-DE-IT-RUscoreof[0.41,0.30,0.36,0.40],whichisachangeof[-0.06,-0.10,-0.07,-0.08]comparedtocounter-fitting,andissubstantiallyweakerthanATTRACT-REPEL:[-0.20,-0.28,-0.23,-0.21].WecanthereforeconcludethatthebulkoftheperformanceimprovementachievedbyATTRACT-REPELstemsfromusingthecontext-sensitiveupdates.Counter-fittingoutperformsretrofittingaswell,whichimpliesthatitspairwiseregularizationisanimprovementoversimpleL2regularization.However,itsquadraticcomplexitymakesitintractableforthescaleofexperimentsperformedinthispaper(unlikeATTRACT-REPEL,whichfine-tunesthevectorspaceinlessthan2minutesusinganNVIDIAGeForceGTX1080graphicscard).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
317
Mono.SimLexLanguagesPPDBavailableNoPPDBavailableSpec.ENDEITRUNLFRESPTPLBGHRHEGAVIFASVEnglish0.65-0.690.700.700.700.720.720.700.700.680.700.660.650.670.680.70German0.430.61-0.580.560.550.600.590.560.540.520.530.500.490.480.510.55Italian0.560.690.65-0.640.670.680.680.660.660.620.630.590.600.580.610.63Russian0.560.630.590.62-0.610.610.620.580.600.610.590.560.570.580.580.60Table5:SimLex-999performance.TyingtheSimLexlanguagesintobilingualvectorspaceswith16differentlanguages.Thefirstnumberineachrowrepresentsmonolingualspecialization.Allbuttwoofthebilingualspacesimprovedoverthesebaselines.TheEN-FRvectorssetanewhighscoreof0.754ontheoriginal(English)SimLex-999.2016;Recskietal.,2016))asastartingpointdoesnotyieldmeaningfulhighscoreseither.OurreportedEn-glishscoreof0.71ontheMultilingualSimLex-999correspondsto0.751ontheoriginalSimLex-999:itoutperformsthe0.706scorereportedbyWietingetal.(2016)andsetsanewhighscoreforthisdataset.Similarly,theSimVerb-3500scoreofthesevectorsis0.674,outperformingthecurrentstate-of-the-artscoreof0.628reportedbyGerzetal.(2016).StartingDistributionalSpacesTable4repeatsthepreviousexperimentwithtwodifferentsetsofinitialvectorspaces:a)randomlyinitializedwordvectors;12andb)skip-gramwithnegativesamplingvectorstrainedonthelatestWikipediadumps.Therandomlyinitializedvectorsservetodecoupletheimpactofinjectingexternalknowledgefromthein-formationembeddedinthedistributionalvectors.Therandomvectorsbenefitfrombothmono-andcross-lingualspecialization:theEnglishperformanceissurprisinglystrong,withotherlanguagessufferingmorefromthelackofinitialization.WhencomparingdistributionalvectorstrainedonWikipediatothehigh-qualitywordvectorcollectionsusedinTable3,theItalianandRussianvectorsinpar-ticularstartfromsubstantiallyweakerSimLexscores.Thedifferenceinperformanceislargelymitigatedthroughsemanticspecialization.However,allvectorspacesstillexhibitaweakerperformancecomparedtothoseinTable3.Webelievethisshowsthatthequalityoftheinitialdistributionalvectorspacesisimportant,butcan,inlargepart,becompensatedforthroughsemanticspecialization.12TheXAVIERinitializationpopulatesthevaluesforeachwordvectorbyuniformlysamplingfromtheinterval[−√6√d,+√6√d],wheredisthevectordimensionality.Thisisatypicalinitmethodinneuralnetsresearch(Goldberg,2015;Bengioetal.,2013).BilingualSpecializationTable5showstheeffectofcombiningthefouroriginalSimLexlanguageswitheachotherandwithtwelveotherlanguages(Section4.1).Bilingualspecializationsubstantiallyimprovesovermonolingualspecializationforalllan-guagepairs.Thisindicatesthatourimprovementsarelanguageindependenttoalargeextent.Interestingly,eventhoughweusenomonolin-gualsynonymyconstraintsforthesixright-mostlan-guages,combiningthemwiththeSimLexlanguagesstillimprovedwordvectorqualityforthesefourhigh-resourcelanguages.Thereasonwhyevenresource-deprivedlanguagessuchasIrishhelpimprovevectorspacequalityofhigh-resourcelanguagessuchasEn-glishorItalianisthattheyprovideimplicitindicatorsofsemanticsimilarity.EnglishwordswhichmaptothesameIrishwordarelikelytobesynonyms,evenifthoseEnglishpairsarenotpresentinthePPDBdatasets(FaruquiandDyer,2014).13Lower-ResourceLanguagesThepreviousexperi-mentindicatesthatbilingualspecializationfurtherim-provesthe(already)high-qualityestimatesforhigh-resourcelanguages.However,itdoeslittletoshowhowmuch(orif)thewordvectorsoflower-resourcelanguagesimproveduringsuchspecialization.Ta-ble6investigatesthispropositionusingthenewlycollectedSimLexdatasetsforHebrewandCroatian.Tyingthedistributionalvectorsfortheselanguages(whichhavenomonolingualconstraints)intocross-lingualspaceswithhigh-resourceones(whichdo,inourcasefromPPDB)leadstosubstantialim-provements.Table6alsoshowshowthedistribu-tionalvectorsofthefourSimLexlanguagesimprovewhentiedtootherlanguages(ineachrow,weusemonolingualconstraintsonlyforthe‘added’lan-13WereleasebilingualvectorspacesforEN+51otherlan-guages:the16presentedhereandanother35languages(allavail-ableatwww.github.com/nmrksic/attract-repel).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
318
Distrib.+EN+DE+IT+RUHebrew0.280.510.460.520.45Croatian0.210.620.490.580.54English0.32-0.610.660.63German0.280.58-0.550.49Italian0.360.690.66-0.63Russian0.380.560.520.55-Table6:Bilingualsemanticspecializationfor:a)HebrewandCroatian;andb)theoriginalSimLexlanguages.EachrowshowshowSimLexscoresforthatlanguageimprovewhenitsdistributionalvectorsaretiedintobilingualvectorspaceswiththefourhigh-resourcelanguages.guage).HebrewandCroatianexhibitsimilartrendstotheoriginalSimLexlanguages:tyingtoEnglishandItalianleadstostrongergainsthantyingtothemorphologicallysophisticatedGermanandRus-sian.Indeed,tyingtoEnglishconsistentlyledtothestrongestperformance.WebelievethisshowsthatbilingualATTRACT-REPELspecializationwithEn-glishpromisestoproducehigh-qualityvectorspacesformanylower-resourcelanguageswhichhavecov-erageamongthe271BabelNetlanguages(butarenotavailableinPPDB).ExistingBilingualSpacesTable7comparestheintrinsic(i.e.SimLex-999)performanceofbilin-gualEnglish-ItalianandEnglish-Germanvectorspro-ducedbyATTRACT-REPELtofivepreviouslypro-posedapproachesforconstructingbilingualvectorspaces.Forbothlanguagesinbothlanguagepairs,ATTRACT-REPELachievessubstantialgainsoverallofthesemethods.Inthenextsection,weshowthatthesedifferencesinintrinsicperformanceleadtosub-stantialgainsindownstreamevaluation.ModelEN-ITEN-DEENITENDE(Mikolovetal.,2013a)0.320.280.320.28(HermannandBlunsom,2014a)0.400.340.380.35(Gouwsetal.,2015)0.250.180.250.14(Vuli´candKorhonen,2016a)0.320.270.320.33(Vuli´candMoens,2016)0.230.250.200.25BilingualATTRACT-REPEL0.700.690.690.61Table7:Comparisonoftheintrinsicquality(SimLex-999)ofbilingualspacesproducedbytheATTRACT-REPELmethodtothoseproducedbyfivestate-of-the-artmethodsforconstructingbilingualvectorspaces.User:Suggestsomethingfancy.[price=expensive]System:Sure,where?User:Downtown.AnyKoreanplaces?[price=expensive,area=centre,food=Korean]System:Sorry,noKoreanplacesinthecentre.User:HowaboutJapanese?[price=expensive,area=centre,food=Japanese]System:Sticks’n’Sushimeetsyourcriteria.Figure1:Annotateddialoguestatesinasampledialogue.Underlinedwordsshowrephrasingsforontologyvalueswhicharetypicallyhandledusingsemanticdictionaries.6DownstreamTaskEvaluation6.1DialogueStateTrackingTask-orienteddialoguesystemshelpusersachievegoalssuchasmakingtravelreservationsorfindingrestaurants.Inslot-basedsystems,applicationdo-mainsaredefinedbyontologieswhichenumeratethegoalsthatuserscanexpress(Young,2010).Thegoalsareexpressedbyslot-valuepairssuchas[price:cheap]or[food:Thai].Formodulartask-basedsys-tems,theDialogueStateTracking(DST)componentisinchargeofmaintainingthebeliefstate,whichisthesystem’sinternaldistributionoverthepossiblestatesofthedialogue.Figure1showsthecorrectdialoguestateforeachturnofanexampledialogue.UnseenData/LabelsAsdialogueontologiescanbeverylarge,manyofthepossibleclasslabels(i.e.,thevariousfoodtypesorstreetnames)willnotoccurinthetrainingset.Toovercomethisproblem,delexicalization-basedDSTmodels(Hendersonetal.,2014c;Hendersonetal.,2014b;Mrkši´cetal.,2015;Wenetal.,2017)replaceoccurrencesofontol-ogyvalueswithgenerictagswhichfacilitatetrans-ferlearningacrossdifferentontologyvalues.Thisisdonethroughexactmatchingsupplementedwithsemanticlexiconswhichencoderephrasings,mor-phologyandotherlinguisticvariation.Forinstance,suchlexiconswouldberequiredtodealwiththeunderlinednon-exactmatchesinFigure1.ExactMatchingasaBottleneckSemanticlexi-conscanbehand-craftedforsmalldialoguedomains.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
319
Mrkši´cetal.(2016)showedthatsemanticallyspe-cializedvectorspacescanbeusedtoautomaticallyinducesuchlexiconsforsimpledialoguedomains.However,asdomainsgrowmoresophisticated,therelianceon(manually-orautomatically-constructed)semanticdictionarieswhichlistpotentialrephrasingsforontologyvaluesbecomesabottleneckfordeploy-ingdialoguesystems.Ambiguousrephrasingsarejustoneproblematicinstanceofthisapproach:auseraskingaboutIcelandcouldbereferringtothecoun-tryorthesupermarketchain,andsomeoneaskingforsongsbyTrainisnotinterestedintraintimeta-bles.Moreimportantly,theuseofEnglishastheprincipallanguageinmostdialoguesystemsresearchunderstatesthechallengesthatcomplexlinguisticphenomenapresentinotherlanguages.Inthiswork,weinvestigatetheextenttowhichsemanticspecial-izationcanempowerDSTmodelswhichdonotrelyonsuchdictionaries.NeuralBeliefTracker(NBT)TheNBTisanovelDSTmodelwhichoperatespurelyoverdistributedrepresentationsofwords,learningtocomposeutter-anceandcontextrepresentationswhichitthenusestodecidewhichofthepotentiallymanyontology-definedintents(goals)havebeenexpressedbytheuser(Mrkši´cetal.,2017).Toovercomethedataspar-sityproblem,theNBTuseslabelembeddingtode-composethismulti-classclassificationproblemintomanybinaryclassificationones:foreachslot,themodeliteratesoverslotvaluesdefinedbytheontol-ogy,decidingwhethereachofthemwasexpressedinthecurrentutteranceanditssurroundingcontext.ThefirstNBTlayerconsistsofneuralnetworkswhichproducedistributedrepresentationsoftheuserut-terance,14theprecedingsystemoutputandtheem-beddedlabelofthecandidateslot-valuepair.Theserepresentationsarethenpassedtothedownstreamsemanticdecodingandcontextmodellingnetworks,whichsubsequentlymakethebinarydecisionregard-ingthecurrentslot-valuecandidate.Whencontradict-inggoalsaredetected(i.e.cheapandexpensive),themodelchoosesthemoreprobableone.TheNBTtrainingprocedurekeepstheinitialwordvectorsfixed.Thatway,attesttime,unseenwords14TherearetwovariantsoftheNBTmodel:NBT-DNNandNBT-CNN.Inthiswork,welimitourinvestigationtothelatterone,asitachievedconsistentlystrongerDSTperformance.semanticallyrelatedtofamiliarslotvalues(i.e.afford-ableorcheapertocheap)arerecognizedpurelybytheirpositionintheoriginalvectorspace.Thus,itisessentialthatdeployedwordvectorsarespecializedforsemanticsimilarity,asdistributionaleffectswhichkeepantonymouswords’vectorstogethercanbeverydetrimentaltoDSTperformance(e.g.,bymatchingnortherntosouthorinexpensivetoexpensive).TheMultilingualWOZ2.0DatasetOurDSTevaluationisbasedontheWOZ2.0datasetintro-ducedbyWenetal.(2017)andMrkši´cetal.(2017).Thisdatasetisbasedontheontologyusedforthe2ndDSTChallenge(DSTC2)(Hendersonetal.,2014a).Itconsistsof1,200Wizard-of-Oz(FraserandGilbert,1991)dialoguesinwhichAmazonMechanicalTurkusersassumedtheroleofthedialoguesystemorthecallerlookingforrestaurantsinCambridge,UK.Sinceuserstypedinsteadofusingspeechandinter-actedwithintelligentassistants,thelanguagetheyusedwasmoresophisticatedthanincaseofDSTC2,whereuserswouldquicklyadapttothesystem’sin-abilitytocopewithcomplexqueries.Forourexperi-ments,theontologyand1,200dialoguesweretrans-latedtoItalianandGermanthroughgengo.com,aweb-basedhumantranslationplatform.6.2DSTExperimentsTheprincipalevaluationmetricinourDSTexperi-mentsisthejointgoalaccuracy,whichrepresentstheproportionoftestsetdialogueturnswhereallthesearchconstraintsexpresseduptothatpointintheconversationweredecodedcorrectly.OurDSTexperimentsinvestigatetwopropositions:1.Intrinsicvs.DownstreamEvaluationIfmono-andcross-lingualsemanticspecializationimprovesthesemanticcontentofwordvectorcollectionsac-cordingtointrinsicevaluation,wewouldexpecttheNBTmodeltoperformhigher-qualitybelieftrack-ingwhensuchimprovedvectorsaredeployed.WeinvestigatethedifferenceinDSTperformanceforEnglish,GermanandItalianwhentheNBTmodelemploysthefollowingwordvectorcollections:1)distributionalwordvectors;2)monolingualseman-ticallyspecializedvectors;and3)monolingualsub-spacesofthecross-lingualsemanticallyspecializedEN-DE-IT-RUvectors.Foreachlanguage,wealsocomparetotheNBTperformanceachievedusing
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
320
100200300400500600020406080NumberoftrainingdialoguesJointGoalAccuracyBootstrappingItalianDSTModelsDistributionalVectors+MonolingualCons.++Cross-LingualCons.+++100%Englishdata100200300400500600020406080NumberoftrainingdialoguesBootstrappingGermanDSTModelsDistributionalVectors+MonolingualCons.++Cross-LingualCons.+++100%EnglishdataFigure2:JointgoalaccuracyoftheNBT-CNNmodelforItalian(left)andGerman(right)WOZ2.0testsetsasafunctionofthenumberofin-languagedialoguesusedfortraining.thefivestate-of-the-artbilingualvectorspaceswecomparedtoinSection5.3.2.TrainingaMultilingualDSTModelTheval-uesexpressedbythedomainontology(e.g.,cheap,north,Thai,etc.)arelanguageindependent.Ifweas-sumecommonsemanticgroundingacrosslanguages,wecandecoupletheontologiesfromthedialoguecor-poraanduseasingleontology(i.e.itsvalues’vectorrepresentations)acrossalllanguages.Sinceweknowthathigh-performingEnglishDSTisattainable,wewillgroundtheItalianandGermanontologies(i.e.allslot-valuepairs)totheoriginalEnglishontology.Theuseofasingleontologycoupledwithcross-lingualvectorsthenallowsustocombinethetrainingdataformultiplelanguagesandtrainasingleNBTmodelcapableofperformingbelieftrackingacrossallthreelanguagesatonce.Givenahigh-qualitycross-lingualvectorspace,combiningthelanguageseffectivelyin-creasesthetrainingsetsizeandshouldthereforeleadtoimprovedperformanceacrossalllanguages.6.3ResultsandDiscussionTheDSTperformanceoftheNBT-CNNmodelonEnglish,GermanandItalianWOZ2.0datasetsisshowninTable8.Thefirstfiverowsshowtheper-formancewhenthemodelemploysthefivebaselinevectorspaces.Thesubsequentthreerowsshowtheperformanceof:a)distributionalvectorspaces;b)WordVectorSpaceENITDEEN-IT/EN-DE(Mikolovetal.,2013a)78.271.150.5EN-IT/EN-DE(Hermannetal.,2014a)71.769.344.7EN-IT/EN-DE(Gouwsetal.,2015)75.068.445.4EN-IT/EN-DE(Vuli´cetal.,2016a)81.671.850.5EN-IT/EN-DE(Vuli´cetal.,2016)72.369.038.2MonolingualDistributionalVectors77.671.246.6A-R:MonolingualSpecialization80.972.752.4A-R:Cross-LingualSpecialization80.375.355.7+EnglishOntologyGrounding82.877.157.7Table8:NBTmodelaccuracyacrossthethreelanguages.Eachfigureshowstheperformanceofthemodeltrainedusingthesubspaceofthegivenvectorspacecorrespondingtothetargetlanguage.FortheEnglishbaselinefigures,weshowthestrongeroftheEN-IT/EN-DEfigures.theirmonolingualspecialization;andc)theirEN-DE-IT-RUcross-lingualspecialization.ThelastrowshowstheperformanceofthemultilingualDSTmodeltrainedusingontologygrounding,wherethetrainingdataofallthreelanguageswascombinedandusedtotrainanimprovedmodel.Figure2in-vestigatestheusefulnessofontologygroundingforbootstrappingDSTmodelsfornewlanguageswithlessdata.ThetwofiguresdisplaytheItalian/Germanperformanceofmodelstrainedusingdifferentpro-portionsofthein-languagetrainingdataset.Thetop-performingdash-dottedcurveshowstheperformanceofthemodeltrainedusingthelanguage-specificdia-loguesandalloftheEnglishtrainingdata.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
321
TheresultsinTable8showthatbothtypesofspe-cializationimproveoverDSTperformanceachievedusingthedistributionalvectorsorthefivebaselinebilingualspaces.Interestingly,thebilingualvectorsofVuli´candKorhonen(2016a)outperformoursforEN(butnotforITandDE)despitetheirweakerSim-Lexperformance,showingthatintrinsicevaluationdoesnotcaptureallrelevantaspectspertainingtowordvectors’usabilityfordownstreamtasks.ThemultilingualDSTmodeltrainedusingontol-ogygroundingofferssubstantialperformanceim-provements,withparticularlylargegainsinthelow-datascenarioinvestigatedinFigure2(dash-dottedpurpleline).Thisfigurealsoshowsthatthediffer-enceinperformancebetweenourmono-andcross-lingualvectorsisnotverysubstantial.Again,thelargedisparityinSimLexscoresinducedonlyminorimprovementsinDSTperformance.Insummary,ourresultsshowthat:a)semanti-callyspecializedvectorsbenefitDSTperformance;b)largegainsinSimLexscoresdonotalwaysinducelargedownstreamgains;andc)high-qualitycross-lingualspacesfacilitatetransferlearningbetweenlanguagesandofferaneffectivemethodforboot-strappingDSTmodelsforlower-resourcelanguages.Finally,GermanDSTperformanceissubstantiallyweakerthanbothEnglishandItalian,corroboratingourintuitionthatlinguisticphenomenasuchascasesandcompoundingmakeGermanDSTverychalleng-ing.Wereleasethesedatasetsinthehopethatmulti-lingualDSTevaluationcangivetheNLPcommunityatoolforevaluatingdownstreamperformanceofvec-torspacesformorphologicallyricherlanguages.7ConclusionWehavepresentedanovelATTRACT-REPELmethodforinjectinglinguisticconstraintsintowordvectorspacerepresentations.Theproceduresemanticallyspecializeswordvectorsbyjointlyinjectingmono-andcross-lingualsynonymyandantonymycon-straints,creatingunifiedcross-lingualvectorspaceswhichachievethestate-of-the-artperformanceonthewell-establishedSimLex-999datasetanditsmultilin-gualvariants.Next,wehaveshownthatATTRACT-REPELcaninducehigh-qualityvectorsforlower-resourcelanguagesbytyingthemintobilingualvec-torspaceswithhigh-resourceones.Wealsodemon-stratedthatthesubstantialgainsinintrinsicevalu-ationtranslatetogainsinthedownstreamtaskofdialoguestatetracking(DST),forwhichwereleasetwonovelnon-Englishdatasets(inGermanandItal-ian).Finally,wehaveshownthatoursemanticallyrichcross-lingualvectorsfacilitatelanguagetransferinDST,providinganeffectivemethodforbootstrap-pingbelieftrackingmodelsfornewlanguages.FurtherWorkOurresults,especiallywithDST,emphasizetheneedforimprovingvectorspacemod-elsformorphologicallyrichlanguages.Moreover,ourintrinsicandtask-basedexperimentsexposedthediscrepanciesbetweentheconclusionsthatcanbedrawnfromthesetwotypesofevaluation.Wecon-siderthesetobemajordirectionsforfuturework.AcknowledgementsTheauthorswouldliketothankAndersJohannsenforhishelpwithextractingBabelNetconstraints.WewouldalsoliketothankouractioneditorSebastianPadóandtheanonymousTACLreviewersfortheirconstructivefeedback.IvanVuli´c,RoiReichartandAnnaKorhonenaresupportedbytheERCConsolida-torGrantLEXICAL(number648909).RoiReichartisalsosupportedbytheIntel-ICRIgrant:HybridModelsforMinimallySupervisedInformationEx-tractionfromConversations.ReferencesNikolaosAletrasandMarkStevenson.2015.Ahybriddistributionalandknowledge-basedmodeloflexicalsemantics.InProceedingsoftheFourthJointConfer-enceonLexicalandComputationalSemantics,*SEM,pages20–29.WaleedAmmar,GeorgeMulcaire,MiguelBallesteros,ChrisDyer,andNoahSmith.2016.Manylanguages,oneparser.TransactionsoftheACL,4:431–444.CollinF.Baker,CharlesJ.Fillmore,andJohnB.Lowe.1998.TheBerkeleyFrameNetproject.InProceedingsofACL,pages86–90.MohitBansal,KevinGimpel,andKarenLivescu.2014.Tailoringcontinuouswordrepresentationsfordepen-dencyparsing.InProceedingsofACL,pages809–815.YoshuaBengio,AaronC.Courville,andPascalVincent.2013.Representationlearning:Areviewandnewper-spectives.IEEETransactionsonPatternAnalysisandMachineIntelligence,35(8):1798–1828.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
322
JiangBian,BinGao,andTie-YanLiu.2014.Knowledge-powereddeeplearningforwordembedding.InPro-ceedingsofECML-PKDD,pages132–148.EliaBruni,Nam-KhanhTran,andMarcoBaroni.2014.Multimodaldistributionalsemantics.JournalofArtifi-cialIntelligenceResearch,49:1–47.SarathA.P.Chandar,StanislasLauly,HugoLarochelle,MiteshM.Khapra,BalaramanRavindran,VikasC.Raykar,andAmritaSaha.2014.Anautoencoderap-proachtolearningbilingualwordrepresentations.InProceedingsofNIPS,pages1853–1861.DanqiChenandChristopherD.Manning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InProceedingsofEMNLP,pages740–750.RonanCollobert,JasonWeston,LeonBottou,MichaelKarlen,KorayKavukcuoglu,andPavelKuksa.2011.Naturallanguageprocessing(almost)fromscratch.JournalofMachineLearningResearch,12:2493–2537.JocelynCoulmance,Jean-MarcMarty,GuillaumeWenzek,andAmineBenhalloum.2015.Trans-gram,fastcross-lingualwordembeddings.InProceedingsofEMNLP,pages1109–1113.DmitryDavidovandAriRappoport.2006.Efficientunsu-perviseddiscoveryofwordcategoriesusingsymmetricpatternsandhighfrequencywords.InProceedingsofACL,pages297–304.JacobDevlin,RabihZbib,ZhongqiangHuang,ThomasLamar,RichardM.Schwartz,andJohnMakhoul.2014.Fastandrobustneuralnetworkjointmodelsforstatisti-calmachinetranslation.InProceedingsofACL,pages1370–1380.ParamveerS.Dhillon,DeanP.Foster,andLyleH.Un-gar.2015.Eigenwords:Spectralwordembeddings.JournalofMachineLearningResearch,16:3035–3078.GeorgianaDinu,AngelikiLazaridou,andMarcoBaroni.2015.Improvingzero-shotlearningbymitigatingthehubnessproblem.InProceedingsofICLR:WorkshopPapers.JohnC.Duchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch,12:2121–2159.LongDuong,HiroshiKanayama,TengfeiMa,StevenBird,andTrevorCohn.2016.Learningcrosslingualwordembeddingswithoutbilingualcorpora.InProceedingsofEMNLP,pages1285–1295.MaudEhrmann,FrancescoCecconi,DanieleVannella,JohnPhilipMccrae,PhilippCimiano,andRobertoNav-igli.2014.Representingmultilingualdataaslinkeddata:ThecaseofBabelNet2.0.InProceedingsofLREC,pages401–408.ManaalFaruquiandChrisDyer.2014.Improvingvectorspacewordrepresentationsusingmultilingualcorrela-tion.InProceedingsofEACL,pages462–471.ManaalFaruquiandChrisDyer.2015.Non-distributionalwordvectorrepresentations.InProceedingsofACL,pages464–469.ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2015.Retrofittingwordvectorstosemanticlexicons.InProceedingsofNAACL,pages1606–1615.LevFinkelstein,EvgeniyGabrilovich,YossiMatias,EhudRivlin,ZachSolan,GadiWolfman,andEytanRuppin.2002.Placingsearchincontext:Theconceptrevisited.ACMTransactionsonInformationSystems,20(1):116–131.NormanM.FraserandG.NigelGilbert.1991.Simulat-ingspeechsystems.ComputerSpeechandLanguage,5(1):81–99.JuriGanitkevitchandChrisCallison-Burch.2014.TheMultilingualParaphraseDatabase.InProceedingsofLREC,pages4276–4283.JuriGanitkevitch,BenjaminVanDurme,andChrisCallison-burch.2013.PPDB:TheParaphraseDatabase.InProceedingsofNAACL,pages758–764.DanielaGerz,IvanVuli´c,FelixHill,RoiReichart,andAnnaKorhonen.2016.SimVerb-3500:Alarge-scaleevaluationsetofverbsimilarity.InProceedingsofEMNLP,pages2173–2182.XavierGlorotandYoshuaBengio.2010.Understand-ingthedifficultyoftrainingdeepfeedforwardneuralnetworks.InProceedingsofAISTATS,pages249–256.YoavGoldberg.2015.Aprimeronneuralnet-workmodelsfornaturallanguageprocessing.CoRR,abs/1510.00726.StephanGouws,YoshuaBengio,andGregCorrado.2015.BilBOWA:Fastbilingualdistributedrepresentationswithoutwordalignments.InProceedingsofICML,pages748–756.JiangGuo,WanxiangChe,HaifengWang,andTingLiu.2014.Revisitingembeddingfeaturesforsimplesemi-supervisedlearning.InProceedingsofEMNLP,pages110–120.JiangGuo,WanxiangChe,DavidYarowsky,HaifengWang,andTingLiu.2015.Cross-lingualdependencyparsingbasedondistributedrepresentations.InPro-ceedingsofACL,pages1234–1244.MatthewHenderson,BlaiseThomson,andJasonD.Wil-iams.2014a.TheSecondDialogStateTrackingChal-lenge.InProceedingsofSIGDIAL,pages263–272.MatthewHenderson,BlaiseThomson,andSteveYoung.2014b.Robustdialogstatetrackingusingdelexicalisedrecurrentneuralnetworksandunsupervisedadaptation.InProceedingsofIEEESLT,pages360–365.MatthewHenderson,BlaiseThomson,andSteveYoung.2014c.Word-baseddialogstatetrackingwithrecurrentneuralnetworks.InProceedingsofSIGDIAL,pages292–299.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
323
KarlMoritzHermannandPhilBlunsom.2014a.Multilin-gualDistributedRepresentationswithoutWordAlign-ment.InProceedingsofICLR.KarlMoritzHermannandPhilBlunsom.2014b.Multi-lingualmodelsforcompositionaldistributedsemantics.InProceedingsofACL,pages58–68.FelixHill,RoiReichart,andAnnaKorhonen.2015.SimLex-999:Evaluatingsemanticmodelswith(gen-uine)similarityestimation.ComputationalLinguistics,41(4):665–695.KejunHuang,MattGardner,EvangelosPapalexakis,ChristosFaloutsos,NikosSidiropoulos,TomMitchell,ParthaP.Talukdar,andXiaoFu.2015.Translationinvariantwordembeddings.InProceedingsofEMNLP,pages1084–1088.MohitIyyer,JordanBoyd-Graber,LeonardoClaudino,RichardSocher,andHalDauméIII.2014.ANeuralNetworkforFactoidQuestionAnsweringoverPara-graphs.InProceedingsofEMLNP,pages633–644.SujayKumarJauhar,ChrisDyer,andEduardH.Hovy.2015.Ontologicallygroundedmulti-senserepresen-tationlearningforsemanticvectorspacemodels.InProceedingsofNAACL,pages683–693.AndersJohannsen,HéctorMartínezAlonso,andAndersSøgaard.2015.Any-languageframe-semanticparsing.InProceedingsofEMNLP,pages2062–2066.DouweKiela,FelixHill,andStephenClark.2015.Spe-cializingwordembeddingsforsimilarityorrelatedness.InProceedingsofEMNLP,pages2044–2048.Joo-KyungKim,Marie-CatherinedeMarneffe,andEricFosler-Lussier.2016a.Adjustingwordembeddingswithsemanticintensityorders.InProceedingsofthe1stWorkshoponRepresentationLearningforNLP,pages62–69.Joo-KyungKim,GokhanTur,AsliCelikyilmaz,BinCao,andYe-YiWang.2016b.Intentdetectionusingseman-ticallyenrichedwordembeddings.InProceedingsofSLT.AlexandreKlementiev,IvanTitov,andBinodBhattarai.2012.Inducingcrosslingualdistributedrepresentationsofwords.InProceedingsCOLING,pages1459–1474.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InMTsummit,volume5.AndreyKutuzovandIgorAndreev.2015.Textsin,mean-ingout:neurallanguagemodelsinsemanticsimilaritytaskforRussian.InProceedingsofDIALOG.AngelikiLazaridou,GeorgianaDinu,andMarcoBaroni.2015.Hubnessandpollution:Delvingintocross-spacemappingforzero-shotlearning.InProceedingsofACL,pages270–280.IraLeviantandRoiReichart.2015.SeparatedbyanUn-commonLanguage:TowardsJudgmentLanguageInformedVectorSpaceModeling.arXivpreprint:1508.00106.OmerLevyandYoavGoldberg.2014.Dependency-basedwordembeddings.InProceedingsofACL,pages302–308.QuanLiu,HuiJiang,SiWei,Zhen-HuaLing,andYuHu.2015.Learningsemanticwordembeddingsbasedonordinalknowledgeconstraints.InProceedingsofACL,pages1501–1511.ThangLuong,HieuPham,andChristopherD.Manning.2015.Bilingualwordrepresentationswithmonolingualqualityinmind.InProceedingsofthe1stWorkshoponVectorSpaceModelingforNLP,pages151–159.TomasMikolov,QuocV.Le,andIlyaSutskever.2013a.Exploitingsimilaritiesamonglanguagesformachinetranslation.arXivpreprint,CoRR,abs/1309.4168.TomasMikolov,IlyaSutskever,KaiChen,GregoryS.Cor-rado,andJeffreyDean.2013b.Distributedrepresenta-tionsofwordsandphrasesandtheircompositionality.InProceedingsofNIPS,pages3111–3119.GeorgeA.Miller.1995.WordNet:AlexicaldatabaseforEnglish.CommunicationsoftheACM,pages39–41.BhaskarMitra,EricT.Nalisnick,NickCraswell,andRichCaruana.2016.Adualembeddingspacemodelfordocumentranking.CoRR,abs/1602.01137.NikolaMrkši´c,DiarmuidÓSéaghdha,BlaiseThomson,MilicaGaši´c,Pei-HaoSu,DavidVandyke,Tsung-HsienWen,andSteveYoung.2015.Multi-domaindialogstatetrackingusingrecurrentneuralnetworks.InProceedingsofACL,pages794–799.NikolaMrkši´c,DiarmuidÓSéaghdha,BlaiseThomson,MilicaGaši´c,LinaRojas-Barahona,Pei-HaoSu,DavidVandyke,Tsung-HsienWen,andSteveYoung.2016.Counter-fittingwordvectorstolinguisticconstraints.InProceedingsofNAACL,pages142–148.NikolaMrkši´c,DiarmuidÓSéaghdha,BlaiseThomson,Tsung-HsienWen,andSteveYoung.2017.NeuralBeliefTracker:Data-drivendialoguestatetracking.InProceedingsofACL.RobertoNavigliandSimonePaoloPonzetto.2012.Ba-belNet:Theautomaticconstruction,evaluationandap-plicationofawide-coveragemultilingualsemanticnet-work.ArtificialIntelligence,193:217–250.KimAnhNguyen,SabineSchulteimWalde,andNgocThangVu.2016.Integratingdistributionallexicalcontrastintowordembeddingsforantonym-synonymdistinction.InProceedingsofACL,pages454–459.MasatakaOno,MakotoMiwa,andYutakaSasaki.2015.WordEmbedding-basedAntonymDetectionusingThe-sauriandDistributionalInformation.InProceedingsofNAACL,pages984–989.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
3
1
5
6
7
4
6
6
/
/
t
l
a
c
_
a
_
0
0
0
6
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
324
DominiqueOsborne,ShashiNarayan,andShayCohen.2016.Encodingpriorknowledgewitheigenwordem-beddings.TransactionsoftheACL,4:417–430.DiarmuidÓSéaghdhaandAnnaKorhonen.2014.Prob-abilisticdistributionalsemantics.ComputationalLin-guistics,40(3):587–631.JeffreyPennington,RichardSocher,andChristopherMan-ning.2014.GloVe:Globalvectorsforwordrepresen-tation.InProceedingsofEMNLP,pages1532–1543.PushpendreRastogi,BenjaminVanDurme,andRamanArora.2015.MultiviewLSA:RepresentationlearningviageneralizedCCA.InProceedingsofNAACL,pages556–566.GáborRecski,EszterIklódi,KatalinPajkossy,andAn-drasKornai.2016.MeasuringSemanticSimilarityofWordsUsingConceptNetworks.InProceedingsofthe1stWorkshoponRepresentationLearningforNLP,pages193–200.TimRocktäschel,EdwardGrefenstette,KarlMoritzHer-mann,TomášKoˇciský,andPhilBlunsom.2016.Rea-soningaboutEntailmentwithNeuralAttention.InProceedingsofICLR.SaschaRotheandHinrichSchütze.2015.AutoExtend:Extendingwordembeddingstoembeddingsforsynsetsandlexemes.InProceedingsofACL,pages1793–1803.RoySchwartz,RoiReichart,andAriRappoport.2015.Symmetricpatternbasedwordembeddingsforim-provedwordsimilarityprediction.InProceedingsofCoNLL,pages258–267.RichardSocher,JohnBauer,ChristopherD.Manning,andAndrewY.Ng.2013a.Parsingwithcompositionalvectorgrammars.InProceedingsofACL,pages455–465.RichardSocher,AlexPerelygin,JeanWu,JasonChuang,ChristopherD.Manning,AndrewNg,andChristopherPotts.2013b.Recursivedeepmodelsforsemanticcom-positionalityoverasentimenttreebank.InProceedingsofEMNLP,pages1631–1642.AndersSøgaard,ŽeljkoAgi´c,HéctorMartínezAlonso,BarbaraPlank,BerndBohnet,andAndersJohannsen.2015.Invertedindexingforcross-lingualNLP.InProceedingsACL,pages1713–1722.HubertSoyer,PontusStenetorp,andAkikoAizawa.2015.Leveragingmonolingualdataforcrosslingualcomposi-tionalwordrepresentations.InProceedingsofICLR.JosephTurian,Lev-ArieRatinov,andYoshuaBengio.2010.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofACL,pages384–394.ShyamUpadhyay,ManaalFaruqui,ChrisDyer,andDanRoth.2016.Cross-lingualmodelsofwordembed-dings:Anempiricalcomparison.InProceedingsofACL,pages1661–1670.IvanVuli´candAnnaKorhonen.2016a.Is”universalsyntax”universallyusefulforlearningdistributedrep-resentations?InProceedingsofACL,pages518–524.IvanVuli´candAnnaKorhonen.2016b.Ontheroleofseedlexiconsinlearningbilingualwordembeddings.InProceedingsofACL,pages247–257.IvanVuli´candMarie-FrancineMoens.2015.Mono-lingualandcross-lingualinformationretrievalmodelsbasedon(bilingual)wordembeddings.InProceedingsofSIGIR,pages363–372.IvanVuli´candMarie-FrancineMoens.2016.Bilin-gualdistributedwordrepresentationsfromdocument-alignedcomparabledata.JournalofArtificialIntelli-genceResearch,55:953–994.IvanVuli´c,DanielaGerz,DouweKiela,FelixHill,andAnnaKorhonen.2016.Hyperlex:Alarge-scaleevaluationofgradedlexicalentailment.CoRR,abs/1608.02117.Tsung-HsienWen,DavidVandyke,NikolaMrkši´c,Mil-icaGaši´c,LinaM.Rojas-Barahona,Pei-HaoSu,Ste-fanUltes,andSteveYoung.2017.Anetwork-basedend-to-endtrainabletask-orienteddialoguesystem.InProceedingsofEACL,pages437–449.JohnWieting,MohitBansal,KevinGimpel,andKarenLivescu.2015.Fromparaphrasedatabasetocomposi-tionalparaphrasemodelandback.TransactionsoftheACL,3:345–358.JohnWieting,MohitBansal,KevinGimpel,andKarenLivescu.2016.Charagram:Embeddingwordsandsentencesviacharactern-grams.InProceedingsofEMNLP,pages1504–1515.ChangXu,YalongBai,JiangBian,BinGao,GangWang,XiaoguangLiu,andTie-YanLiu.2014.RC-NET:Ageneralframeworkforincorporatingknowledgeintowordrepresentations.InProceedingsofCIKM,pages1219–1228.Wen-TauYih,GeoffreyZweig,andJohnC.Platt.2012.PolarityinducingLatentSemanticAnalysis.InPro-ceedingsofACL,pages1212–1222.SteveJ.Young,MilicaGaši´c,BlaiseThomson,andJa-sonD.Williams.2013.POMDP-BasedStatisticalSpokenDialogSystems:AReview.ProceedingsoftheIEEE,101(5):1160–1179.SteveYoung.2010.Stilltalkingtomachines(cognitivelyspeaking).InProceedingsofINTERSPEECH,pages1–10.MoYuandMarkDredze.2014.Improvinglexicalem-beddingswithsemanticknowledge.InProceedingsofACL,pages545–550.WillY.Zou,RichardSocher,DanielCer,andChristo-pherD.Manning.2013.Bilingualwordembeddingsforphrase-basedmachinetranslation.InProceedingsofEMNLP,pages1393–1398.
Download pdf