Transactions of the Association for Computational Linguistics, vol. 4, pp. 17–30, 2016. Action Editor: Chris Callison-Burch.

Transactions of the Association for Computational Linguistics, vol. 4, pp. 17–30, 2016. Action Editor: Chris Callison-Burch.
Submission batch: 9/2015; revised 12/2015; revised 1/2016; Published 2/2016.

2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

LearningtoUnderstandPhrasesbyEmbeddingtheDictionaryFelixHillComputerLaboratoryUniversityofCambridgefelix.hill@cl.cam.ac.ukKyunghyunCho∗CourantInstituteofMathematicalSciencesandCentreforDataScienceNewYorkUniversitykyunghyun.cho@nyu.eduAnnaKorhonenDepartmentofTheoreticalandAppliedLinguisticsUniversityofCambridgealk23@cam.ac.ukYoshuaBengioCIFARSeniorFellowUniversit´edeMontr´ealyoshua.bengio@umontreal.caAbstractDistributionalmodelsthatlearnrichseman-ticwordrepresentationsareasuccessstoryofrecentNLPresearch.However,develop-ingmodelsthatlearnusefulrepresentationsofphrasesandsentenceshasprovedfarharder.Weproposeusingthedefinitionsfoundineverydaydictionariesasameansofbridg-ingthisgapbetweenlexicalandphrasalse-mantics.Neurallanguageembeddingmod-elscanbeeffectivelytrainedtomapdictio-narydefinitions(phrases)à(lexical)repre-sentationsofthewordsdefinedbythosedefi-nitions.Wepresenttwoapplicationsofthesearchitectures:reversedictionariesthatreturnthenameofaconceptgivenadefinitionordescriptionandgeneral-knowledgecrosswordquestionanswerers.Onbothtasks,neurallan-guageembeddingmodelstrainedondefini-tionsfromahandfuloffreely-availablelex-icalresourcesperformaswellorbetterthanexistingcommercialsystemsthatrelyonsig-nificanttask-specificengineering.There-sultshighlighttheeffectivenessofbothneu-ralembeddingarchitecturesanddefinition-basedtrainingfordevelopingmodelsthatun-derstandphrasesandsentences.1IntroductionMuchrecentresearchincomputationalseman-ticshasfocussedonlearningrepresentationsofarbitrary-lengthphrasesandsentences.Thistaskischallengingpartlybecausethereisnoobviousgoldstandardofphrasalrepresentationthatcouldbeused∗WorkmainlydoneattheUniversityofMontreal.intrainingandevaluation.Consequently,itisdiffi-culttodesignapproachesthatcouldlearnfromsuchagoldstandard,andalsohardtoevaluateorcomparedifferentmodels.Inthiswork,weusedictionarydefinitionstoad-dressthisissue.Thecomposedmeaningofthewordsinadictionarydefinition(atall,long-necked,spottedruminantofAfrica)shouldcorrespondtothemeaningofthewordtheydefine(giraffe).Thisbridgebetweenlexicalandphrasalsemanticsisuse-fulbecausehighqualityvectorrepresentationsofsinglewordscanbeusedasatargetwhenlearningtocombinethewordsintoacoherentphrasalrepre-sentation.Thisapproachstillrequiresamodelcapableoflearningtomapbetweenarbitrary-lengthphrasesandfixed-lengthcontinuous-valuedwordvectors.Forthispurposeweexperimentwithtwobroadclassesofneurallanguagemodels(NLMs):Recur-rentNeuralNetworks(RNNs),whichnaturallyen-codetheorderofinputwords,andsimpler(feed-forward)bag-of-words(BOW)embeddingmodels.PriortotrainingtheseNLMs,welearntargetlexi-calrepresentationsbytrainingtheWord2Vecsoft-ware(Mikolovetal.,2013)onbillionsofwordsofrawtext.Wedemonstratetheusefulnessofourapproachbybuildingandreleasingtwoapplications.Thefirstisareversedictionaryorconceptfinder:asystemthatreturnswordsbasedonuserdescriptionsordefini-tions(ZockandBilac,2004).Reversedictionariesareusedbycopywriters,novelists,translatorsandotherprofessionalwriterstofindwordsfornotionsorideasthatmightbeonthetipoftheirtongue.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

18

Forinstance,atravel-writermightlooktoenhanceherprosebysearchingforexamplesofacountrythatpeopleassociatewithwarmweatheroranac-tivitythatismentallyorphysicallydemanding.WeshowthatanNLM-basedreversedictionarytrainedononlyahandfulofdictionariesidentifiesnoveldef-initionsandconceptdescriptionscomparablyorbet-terthancommercialsystems,whichrelyonsignif-icanttask-specificengineeringandaccesstomuchmoredictionarydata.Moreover,byexploitingmod-elsthatlearnbilingualwordrepresentations(Vulicetal.,2011;Klementievetal.,2012;HermannandBlunsom,2013;Gouwsetal.,2014),weshowthattheNLMapproachcanbeeasilyextendedtopro-duceapotentiallyusefulcross-lingualreversedic-tionary.Thesecondapplicationofourmodelsisasageneral-knowledgecrosswordquestionanswerer.WhentrainedonbothdictionarydefinitionsandtheopeningsentencesofWikipediaarticles,NLMspro-duceplausibleanswersto(non-cryptic)crosswordclues,eventhosethatapparentlyrequiredetailedworldknowledge.BothBOWandRNNmodelscanoutperformbespokecommercialcrosswordsolvers,particularlywhencluescontainagreaternumberofwords.QualitativeanalysisrevealsthatNLMscanlearntorelateconceptsthatarenotdirectlycon-nectedinthetrainingdataandcanthusgeneralisewelltounseeninput.Tofacilitatefurtherresearch,allofourcode,trainingandevaluationsets(togetherwithasystemdemo)arepublishedonlinewiththispaper.12NeuralLanguageModelArchitecturesThefirstmodelweapplytothedictionary-basedlearningtaskisarecurrentneuralnetwork(RNN).RNNsoperateonvariable-lengthsequencesofin-puts;inourcase,naturallanguagedefinitions,de-scriptionsorsentences.RNNs(withLSTMs)haveachievedstate-of-the-artperformanceinlanguagemodelling(Mikolovetal.,2010),imagecaptiongeneration(Kirosetal.,2015)andapproachstate-of-the-artperformanceinmachinetranslation(Bah-danauetal.,2015).Duringtraining,theinputtotheRNNisadic-tionarydefinitionorsentencefromanencyclopedia.1https://www.cl.cam.ac.uk/˜fh295/Theobjectiveofthemodelistomapthesedefin-ingphrasesorsentencestoanembeddingofthewordthatthedefinitiondefines.ThetargetwordembeddingsarelearnedindependentlyoftheRNNweights,usingtheWord2Vecsoftware(Mikolovetal.,2013).Thesetofallwordsinthetrainingdataconsti-tutesthevocabularyoftheRNN.Foreachwordinthisvocabularywerandomlyinitialiseareal-valuedvector(inputembedding)ofmodelparameters.TheRNN‘reads’thefirstwordintheinputbyapplyinganon-linearprojectionofitsembeddingv1parame-terisedbyinputweightmatrixWandb,avectorofbiases.A1=φ(Wv1+b)yieldingthefirstinternalactivationstateA1.Inourimplementation,weuseφ(X)=tanh(X),thoughintheoryφcanbeanydifferentiablenon-linearfunc-tion.Subsequentinternalactivations(aftertime-stept)arecomputedbyprojectingtheembeddingofthetthwordandusingthisinformationto‘update’theinternalactivationstate.At=φ(UAt−1+Wvt+b).Assuch,thevaluesofthefinalinternalactivationstateunitsANareaweightedfunctionofallinputwordembeddings,andconstitutea‘summary’oftheinformationinthesentence.2.1LongShortTermMemoryAknownlimitationwhentrainingRNNstoreadlan-guageusinggradientdescentisthattheerrorsig-nal(gradient)onthetrainingexampleseithervan-ishesorexplodesasthenumberoftimesteps(sen-tencelength)increases(Bengioetal.,1994).Conse-quently,afterreadinglongersentencesthefinalin-ternalactivationANtypicallyretainsusefulinfor-mationaboutthemostrecentlyread(sentence-final)mots,butcanneglectimportantinformationnearthestartoftheinputsentence.LSTMs(HochreiterandSchmidhuber,1997)weredesignedtomitigatethislong-termdependencyproblem.Ateachtimestept,inplaceofthesingleinter-nallayerofunitsA,theLSTMRNNcomputessixinternallayersiw,gi,gf,go,handm.Thefirst,gw,representsthecoreinformationpassedtotheLSTM

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

19

unitbythelatestinputwordatt.Itiscomputedasasimplelinearprojectionoftheinputembeddingvt(byinputweightsWw)andtheoutputstateoftheLSTMattheprevioustimestepht−1(byupdateweightsUw):iwt=Wwvt+Uwht−1+bwThelayersgi,gfandgoarecomputedasweightedsigmoidfunctionsoftheinputembeddings,againparameterisedbylayer-specificweightmatricesWandU:gst=11+exp(−(Wsvt+Usht−1+bs))wheresstandsforoneofi,foro.Thesevectorstakevalueson[0,1]andareoftenreferredtoasgat-ingactivations.Finally,theinternalmemorystate,mtandnewoutputstateht,oftheLSTMattarecomputedasmt=iwt(cid:12)git+mt−1(cid:12)gftht=got(cid:12)φ(mt),où(cid:12)indicateselementwisevectormultiplica-tionandφis,asbefore,somenon-linearfunction(weusetanh).Ainsi,gideterminestowhatextentthenewinputwordisconsideredateachtimestep,gfdeterminestowhatextenttheexistingstateoftheinternalmemoryisretainedorforgottenincom-putingthenewinternalmemory,andgodetermineshowmuchthismemoryisconsideredwhencomput-ingtheoutputstateatt.Thesentence-finalmemorystateoftheLSTM,mN,a‘summary’ofalltheinformationinthesen-tence,isthenprojectedviaanextranon-linearpro-jection(parameterisedbyafurtherweightmatrix)toatargetembeddingspace.Thislayerenablesthetarget(defined)wordembeddingspacetotakeadif-ferentdimensiontotheactivationlayersoftheRNN,andinprincipleenablesamorecomplexdefinition-readingfunctiontobelearned.2.2Bag-of-WordsNLMsWeimplementasimplerlinearbag-of-words(BOW)architectureforencodingthedefinitionphrases.AswiththeRNN,thisarchitecturelearnsanembeddingviforeachwordinthemodelvocabulary,togetherwithasinglematrixofinputprojectionweightsW.TheBOWmodelsimplymapsaninputdefinitionwithwordembeddingsv1vntothesumoftheprojectedembeddingsPni=1Wvi.ThismodelcanalsobeconsideredaspecialcaseofanRNNinwhichtheupdatefunctionUandnonlinearityφareboththeidentity,sothat‘reading’thenextwordintheinputphraseupdatesthecurrentrepresentationmoresimply:At=At−1+Wvt.2.3Pre-trainedInputRepresentationsWeexperimentwithvariantsofthesemodelsinwhichtheinputdefinitionembeddingsarepre-learnedandfixed(ratherthanrandomly-initialisedandupdated)duringtraining.Thereareseveralpo-tentialadvantagestotakingthisapproach.First,thewordembeddingsaretrainedonmassivecorporaandmaythereforeintroduceadditionallinguisticorconceptualknowledgetothemodels.Second,attesttime,themodelswillhavealargereffectivevocab-ulary,sincethepre-trainedwordembeddingstypi-callyspanalargervocabularythantheunionofalldictionarydefinitionsusedtotrainthemodel.Fi-nally,themodelswillthenmaptoandfromthesamespaceofembeddings(theembeddingspacewillbeclosedundertheoperationofthemodel),socon-ceivablycouldbemoreeasilyappliedasageneral-purpose‘compositionengine’.2.4TrainingObjectiveWetrainallneurallanguagemodelsMtomaptheinputdefinitionphrasescdefiningwordctoalo-cationclosetothethepre-trainedembeddingvcofc.Weexperimentwithtwodifferentcostfunctionsfortheword-phrasepair(c,sc)fromthetrainingdata.ThefirstissimplythecosinedistancebetweenM(sc)andvc.Thesecondistheranklossmax(0,m−cos(M.(sc),vc)−cos(M.(sc),vr))wherevristheembeddingofarandomly-selectedwordfromthevocabularyotherthanc.Thislossfunctionwasusedforlanguagemodels,forexample,dans(Huangetal.,2012).Inallexperimentsweapplyamarginm=0.1,whichhasbeenshowntoworkwellonword-retrievaltasks(Bordesetal.,2015).

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

20

2.5ImplementationDetailsSincetrainingonthedictionarydatatook6-10hours,wedidnotconductahyper-parametersearchonanyvalidationsetsoverthespaceofpossiblemodelconfigurationssuchasembeddingdimension,orsizeofhiddenlayers.Instead,wechosetheseparameterstobeasstandardaspossiblebasedonpreviousresearch.Forfaircomparison,anyaspectsofmodeldesignthatarenotspecifictoaparticu-larclassofmodelwerekeptconstantacrossexperi-ments.Thepre-trainedwordembeddingsusedinallofourmodels(eitherasinputortarget)werelearnedbyacontinuousbag-of-words(CBOW)modelusingtheWord2Vecsoftwareonapproximately8billionwordsofrunningtext.2Whentrainingsuchmodelsonmassivecorpora,alargeembeddinglengthofupto700havebeenshowntoyieldbestperformance(seee.g.(Faruquietal.,2014)).Thepre-trainedem-beddingsusedinourmodelswereoflength500,asacompromisebetweenqualityandmemorycon-straints.Incaseswherethewordembeddingsarelearnedduringtrainingonthedictionaryobjective,wemaketheseembeddingsshorter(256),sincetheymustbelearnedfrommuchlesslanguagedata.IntheRNNmodels,andateachtimestepeachofthefourLSTMRNNinternallayers(gatingandactiva-tionstates)hadlength512–anotherstandardchoice(seee.g.(Choetal.,2014)).Thefinalhiddenstatewasmappedlinearlytolength500,thedimensionofthetargetembedding.IntheBOWmodels,theprojectionmatrixprojectsinputembeddings(eitherlearned,oflength256,orpre-trained,oflength500)tolength500forsumming.AllmodelswereimplementedwithTheano(Bergstraetal.,2010)andtrainedwithminibatchSGDonGPUs.Thebatchsizewasfixedat16andthelearningratewascontrolledbyadadelta(Zeiler,2012).2TheWord2Vecembeddingmodelsarewellknown;furtherdetailscanbefoundathttps://code.google.com/p/word2vec/Thetrainingdataforthispre-trainingwascom-piledfromvariousonlinetextsourcesusingthescriptdemo-train-big-model-v1.shfromthesamepage.3ReverseDictionariesThemostimmediateapplicationofourtrainedmod-elsisasareversedictionaryorconceptfinder.Itissimpletolookupadefinitioninadictionarygivenaword,butprofessionalwritersoftenalsore-quiresuitablewordsforagivenidea,conceptordefinition.3Reversedictionariessatisfythisneedbyreturningcandidatewordsgivenaphrase,de-scriptionordefinition.Forinstance,whenqueriedwiththephraseanactivitythatrequiresstrengthanddetermination,theOneLook.comreversedictio-naryreturnstheconceptsexerciseandwork.OurtrainedRNNmodelcanperformasimilarfunc-tion,simplybymappingaphrasetoapointinthetarget(Word2Vec)embeddingspace,andreturningthewordscorrespondingtotheembeddingsthatareclosesttothatpoint.Severalotheracademicstudieshaveproposedreversedictionarymodels.Thesegenerallyrelyoncommontechniquesfrominformationretrieval,comparingdefinitionsintheirinternaldatabasetotheinputquery,andreturningthewordwhosedef-initionis‘closest’tothatquery(Bilacetal.,2003;Bilacetal.,2004;ZockandBilac,2004).Proxim-ityisquantifieddifferentlyineachcase,butisgen-erallyafunctionofhand-engineeredfeaturesofthetwosentences.Forinstance,Shawetal.(2013)pro-poseamethodinwhichthecandidatesforagiveninputqueryareallwordsinthemodel’sdatabasewhosedefinitionscontainoneormorewordsfromthequery.Thiscandidatelististhenrankedaccord-ingtoaquery-definitionsimilaritymetricbasedonthehypernymandhyponymrelationsinWordNet,featurescommonlyusedinIRsuchastf-idfandaparser.Thereare,inaddition,atleasttwocommercialonlinereversedictionaryapplications,whosear-chitectureisproprietaryknowledge.ThefirstistheDictionary.comreversedictionary4,whichre-trievescandidatewordsfromtheDictionary.comdictionarybasedonuserdefinitionsordescrip-tions.ThesecondisOneLook.com,whosealgo-rithmsearches1061indexeddictionaries,including3Seethetestimonyfromprofessionalwritersathttp://www.onelook.com/?c=awards4Availableathttp://dictionary.reference.com/reverse/

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

21

allmajorfreely-availableonlinedictionariesandre-sourcessuchasWikipediaandWordNet.3.1DataCollectionandTrainingTocompileabankofdictionarydefinitionsfortrain-ingthemodel,westartedwithallwordsinthetar-getembeddingspace.Foreachofthesewords,weextracteddictionary-styledefinitionsfromfiveelec-tronicresources:Wordnet,TheAmericanHeritageDictionary,TheCollaborativeInternationalDictio-naryofEnglish,WiktionaryandWebster’s.Wechosethesefivedictionariesbecausetheyarefreely-availableviatheWordNikAPI,5butintheoryanydictionarycouldbechosen.Mostwordsinourtrain-ingdatahadmultipledefinitions.Foreachwordwwithdefinitions{d1dn}weincludedallpairs(w,d1)…(w,dn)astrainingexamples.Toallowmodelsaccesstomorefactualknowl-edgethanmightbepresentinadictionary(forin-stance,informationaboutspecificentities,placesorpeople,wesupplementedthistrainingdatawithin-formationextractedfromSimpleWikipedia.6Foreverywordinthemodel’stargetembeddingspacethatisalsothetitleofaWikipediaarticle,wetreatthesentencesinthefirstparagraphofthearticleasiftheywere(independent)definitionsofthatword.WhenawordinWikipediaalsooccursinone(ormore)ofthefivetrainingdictionaries,wesimplyaddthesepseudo-definitionstothetrainingsetofdefinitionsfortheword.CombiningWikipediaanddictionariesinthiswayresultedin≈900,000word-’definition’pairsof≈100,000uniquewords.Toexploretheeffectofthequantityoftrainingdataontheperformanceofthemodels,wealsotrainedmodelsonsubsetsofthisdata.Thefirstsub-setcomprisedonlydefinitionsfromWordnet(ap-proximately150,000definitionsof75,000words).ThesecondsubsetcomprisedonlywordsinWord-netandtheirfirstdefinitions(approximately75,000word,definitionpairs).7.ForallvariantsofRNNandBOWmodels,cependant,reducingthetrainingdatainthiswayresultedinaclearreductioninper-5Seehttp://developer.wordnik.com6https://simple.wikipedia.org/wiki/Main_Page7Aswithotherdictionaries,thefirstdefinitioninWordNetgenerallycorrespondstothemosttypicalorcommonsenseofaword.formanceonalltasks.Forbrevity,wethereforedonotpresenttheseresultsinwhatfollows.3.2ComparisonsAsabaseline,wealsoimplementedtwoentirelyunsupervisedmethodsusingtheneural(Word2Vec)wordembeddingsfromthetargetwordspace.Inthefirst(W2Vadd),wecomposetheembeddingsforeachwordintheinputquerybypointwiseaddition,andreturnascandidatesthenearestwordembed-dingstotheresultingcomposedvector.8Thesec-ondbaseline,(W2Vmult),isidenticalexceptthattheembeddingsarecomposedbyelementwisemul-tiplication.Bothmethodsareestablishedwaysofbuildingphraserepresentationsfromwordembed-dings(MitchellandLapata,2010).Noneofthemodelsorevaluationsfrompreviousacademicresearchonreversedictionariesispub-liclyavailable,sodirectcomparisonisnotpossi-ble.However,wedocompareperformancewiththecommercialsystems.TheDictionary.comsys-temreturnednocandidatesforover96%ofourin-putdefinitions.Wethereforeconductdetailedcom-parisonwithOneLook.com,whichisthefirstre-versedictionarytoolreturnedbyaGooglesearchandseemstobethemostpopularamongwriters.3.3ReverseDictionaryEvaluationToourknowledgetherearenoestablishedmeansofmeasuringreversedictionaryperformance.IntheonlypreviousacademicresearchonEnglishreversedictionariesthatweareawareof,evaluationwasconductedon300word-definitionpairswrittenbylexicographers(Shawetal.,2013).Sincethesearenotpubliclyavailablewedevelopednewevaluationsetsandmakethemfreelyavailableforfutureeval-uations.Theevaluationitemsareofthreetypes,designedtotestdifferentpropertiesofthemodels.Tocre-atetheseenevaluation,werandomlyselected500wordsfromtheWordNettrainingdata(seenbyallmodels),andthenrandomlyselectedadefinitionforeachword.Testingmodelsontheresulting500word-definitionpairsassessestheirabilitytorecallordecodepreviouslyencodedinformation.Forthe8Sinceweretrieveallanswersfromembeddingspacesbycosinesimilarity,additionofwordembeddingsisequivalenttotakingthemean.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

22

DictionarydefinitionsTestSetSeen(500WNdefs)Unseen(500WNdefs)Conceptdescriptions(200)Unsup.W2Vadd—923.04/.16163339.07/.30150modelsW2Vmult—1000.00/.0010*1000.00/.0027*OneLook0.89/.9167—18.5.38/.58153RNNcosine12.48/.7310322.41/.7011669.28/.54157RNNw2vcosine19.44/.7011119.44/.6912626.38/.66111RNNranking18.45/.6712824.43/.6910325.34/.66102NLMsRNNw2vranking54.32/.5615533.36/.6513730.33/.6977BOWcosine22.44/.6512919.43/.6910350.34/.6099BOWw2vcosine15.46/.7112414.46/.7110428.36/.6699BOWranking17.45/.6811522.42/.709532.35/.69101BOWw2vrankng55.32/.5615536.35/.6613838.33/.7285medianrankaccuracy@10/100rankvarianceTable1:Performanceofdifferentreversedictionarymodelsindifferentevaluationsettings.*Lowvarianceinmultmodelsisduetoconsistentlypoorscores,sonothighlighted.unseenevaluation,werandomlyselected500wordsfromWordNetandexcludedalldefinitionsofthesewordsfromthetrainingdataofallmodels.Finally,forafaircomparisonwithOneLook,whichhasboththeseenandunseenpairsinitsin-ternaldatabase,webuiltanewdatasetofconceptdescriptionsthatdonotappearinthetrainingdataforanymodel.Todoso,werandomlyselected200adjectives,nounsorverbsfromamongthetop3000mostfrequenttokensintheBritishNationalCor-pus(Leechetal.,1994)(butoutsidethetop100).WethenaskedtennativeEnglishspeakerstowriteasingle-sentence‘description’ofthesewords.Toensuretheresultingdescriptionsweregoodqual-ity,foreachdescriptionweaskedtwoparticipantswhodidnotproducethatdescriptiontolistanywordsthatfittedthedescription(uptoamaximumofthree).Ifthetargetwordwasnotproducedbyoneofthetwocheckers,theoriginalparticipantwasaskedtore-writethedescriptionuntilthevalidationwaspassed.9Theseconceptdescriptions,togetherwithotherevaluationsets,canbedownloadedfromourwebsiteforfuturecomparisons.Givenatestdescription,definition,orquestion,allmodelsproducearankingofpossiblewordan-swersbasedontheproximityoftheirrepresentationsoftheinputphraseandallpossibleoutputwords.Toquantifythequalityofagivenranking,were-portthreestatistics:themedianrankofthecorrect9Re-writingwasrequiredin6ofthe200cases.TestsetWordDescriptionDictionaryvalve”controlconsistingofamechanicaldefinitiondeviceforcontrollingfluidflow”Conceptprefer”whenyoulikeonethingdescriptionmorethananotherthing”Table2:Styledifferencebetweendictionarydefinitionsandconceptdescriptionsintheevaluation.answer(overthewholetestset,lowerbetter),theproportionoftrainingcasesinwhichthecorrectan-swerappearsinthetop10/100inthisranking(accu-racy@10/100-higherbetter)andthevarianceoftherankofthecorrectansweracrossthetestset(rankvariance-lowerbetter).3.4ResultsTable1showstheperformanceofthedifferentmod-elsinthethreeevaluationsettings.Oftheunsu-pervisedcompositionmodels,elementwiseadditionisclearlymoreeffectivethanmultiplication,whichalmostneverreturnsthecorrectwordasthenear-estneighbourofthecomposition.Overall,cependant,thesupervisedmodels(RNN,BOWandOneLook)clearlyoutperformthesebaselines.Theresultsindicateinterestingdifferencesbe-tweentheNLMsandtheOneLookdictionarysearchengine.TheSeen(WNfirst)definitionsinTable1occurinboththetrainingdatafortheNLMsandthelookupdatafortheOneLookmodel.ClearlytheOneLookalgorithmisbetterthanNLMsatretriev-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

23

ingalreadyavailableinformation(returning89%ofcorrectwordsamongthetop-tencandidatesonthisset).Cependant,thisislikelytocomeatthecostofagreatermemoryfootprint,sincethemodelrequiresaccesstoitsdatabaseofdictionariesatquerytime.10TheperformanceoftheNLMembeddingmodelsonthe(unseen)conceptdescriptionstaskshowsthatthesemodelscangeneralisewelltonovel,unseenqueries.WhilethemedianrankforOneLookonthisevaluationislower,theNLMsretrievethecor-rectanswerinthetoptencandidatesapproximatelyasfrequently,withinthetop100candidatesmorefrequentlyandwithlowervarianceinrankingoverthetestset.Thus,NLMsseemtogeneralisemore‘consistenly’thanOneLookonthisdataset,inthattheygenerallyassignareasonablyhighrankingtothecorrectword.Incontrast,ascanalsobeverifiedbyqueryingourwedemo,OneLooktendstoper-formeitherverywellorpoorlyonagivenquery.11WhencomparingbetweenNLMs,perhapsthemoststrikingobservationisthattheRNNmodelsdonotsignificantlyoutperformtheBOWmodels,eventhoughtheBOWmodeloutputisinvarianttochangesintheorderofwordsinthedefinition.UsersoftheonlinedemocanverifythattheBOWmodelsrecoverconceptsfromdescriptionsstrikinglywell,evenwhenthewordsinthedescriptionareper-muted.ThisobservationunderlinestheimportanceoflexicalsemanticsintheinterpretationoflanguagebyNLMs,andisconsistentwithsomeotherrecentworkonembeddingsentences(Iyyeretal.,2015).Itisdifficulttoobservecleartrendsinthedif-ferencesbetweenNLMsthatlearninputwordem-beddingsandthosewithpre-trained(Word2Vec)in-putembeddings.Bothtypesofinputyieldgoodperformanceinsomesituationsandweakerperfor-manceinothers.Ingeneral,pre-traininginputem-beddingsseemstohelpmostontheconceptde-scriptions,whicharefurthestfromthetrainingdataintermsoflinguisticstyle.Thisisperhapsunsur-prising,sincemodelsthatlearninputembeddingsfromthedictionarydataacquirealloftheirconcep-10Thetrainedneurallanguagemodelsareapproximatelyhalfthesizeofthesixtrainingdictionariesstoredasplaintext,sowouldbehundredsoftimessmallerthantheOneLookdatabaseof1061dictionariesifstoredthisway.11WealsoobservedthatthemeanrankingforNLMswaslowerthanforOneLookontheconceptdescriptionstask.tualknowledgefromthisdata(andthusmayover-fittothissetting),whereasmodelswithpre-trainedembeddingshavesomesemanticmemoryacquiredfromgeneralrunning-textlanguagedataandotherknowledgeacquiredfromthedictionaries.3.5QualitativeAnalysisSomeexampleoutputfromthevariousmodelsispresentedinTable3.Thedifferencesillustratedherearealsoevidentfromqueryingthewebdemo.ThefirstexampleshowshowtheNLMs(BOWandRNN)generalisebeyondtheirtrainingdata.Fourofthetopfiveresponsescouldbeclassedasap-propriateinthattheyrefertoinhabitantsofcoldcountries.However,inspectingtheWordNiktrain-ingdata,thereisnomentionofcoldoranythingtodowithclimateinthedefinitionsofEskimo,Scandi-navian,Scandinaviaetc.Therefore,theembeddingmodelsmusthavelearnedthatcoldnessisachar-acteristicofScandinavia,Siberia,Russia,relatestoEskimosetc.viaconnectionswithotherconceptsthataredescribedordefinedascold.Incontrast,thecandidatesproducedbytheOneLookand(unsu-pervised)W2Vbaselinemodelshavenothingtodowithcoldness.ThesecondexampledemonstrateshowtheNLMsgenerallyreturncandidateswhoselinguisticorcon-ceptualfunctionisappropriatetothequery.Foraqueryreferringexplicitlytoameans,methodorpro-cess,theRNNandBOWmodelsproduceverbsindifferentformsoranappropriatedeverbalnoun.Incontrast,OneLookreturnswordsofalltypes(aero-dynamics,draught)thatarearbitrarilyrelatedtothewordsinthequery.Asimilareffectisapparentinthethirdexample.WhilethecandidatesproducedbytheOneLookmodelarethecorrectpartofspeech(Noun),andrelatedtothequerytopic,theyarenotsemanticallyappropriate.Thedictionaryembeddingmodelsaretheonlyonesthatreturnalistofplausi-blehabits,theclassofnounrequestedbytheinput.3.6Cross-LingualReverseDictionariesWenowshowhowtheRNNarchitecturecanbeeas-ilymodifiedtocreateabilingualreversedictionary-asystemthatreturnscandidatewordsinonelan-guagegivenadescriptionordefinitioninanother.Abilingualreversedictionarycouldhaveclearap-plicationsfortranslatorsortranscribers.Indeed,le

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

24

InputDescriptionOneLookW2VaddRNNBOW”anativeof1:country2:citizen1:a2.the1:eskimo2:scandinavian1:frigid2:coldacold3:foreign4:naturalize3:another4:of3:arctic4:indian3:icy4:russiancountry”5:cisco5:whole5:siberian5:indian”awayof1:drag2:whiz1:the2:through1:glide2:scooting1:flying2:glidingmoving3:aerodynamics4:draught3:a4:moving3:glides4:gliding3:glide4:flythrough5:coefficientofdrag5:in5:flight5:scootingtheair””ahabitthat1:sisterinlaw2:fatherinlaw1:annoy2:your1:bossiness2:jealousy1:infidelity2:bossinessmightannoy3:motherinlaw4:stepson3:might4:that3:annoyance4:rudeness3:foible4:unfaithfulnessyourspouse”5:stepchild5:either5:boorishness5:adulterousTable3:Thetop-fivecandidatesforexamplequeries(inventedbytheauthors)fromdifferentreversedictionarymod-els.BoththeRNNandBOWmodelsarewithoutWord2Vecinputandusethecosineloss.InputdescriptionRNNEN-FRW2VaddRNN+Google”anemotionthatyoumightfeeltriste,pitoyableinsister,effectivementsentiment,regretterafterbeingrejected”r´epugnante,´epouvantablepourquoi,nouspeur,aversion”asmallblackflyinginsectthatmouche,canardattentivement,pouvionsvoler,faucontransmitsdiseaseandlikeshorses”hirondelle,pigeonpourrons,naturellementmouches,volantTable4:Responsesfromcross-lingualreversedictionarymodelstoselectedqueries.Underlinedresponsesare‘cor-rect’orpotentiallyusefulforanativeFrenchspeaker.problemofattachingappropriatewordstoconceptsmaybemorecommonwhensearchingforwordsinasecondlanguagethaninamonolingualcontext.Tocreatethebilingualvariant,wesimplyreplacetheWord2Vectargetembeddingswiththosefromabilingualembeddingspace.Bilingualembeddingmodelsusebilingualcorporatolearnaspaceofrep-resentationsofthewordsintwolanguages,suchthatwordsfromeitherlanguagethathavesimilarmeaningsareclosetogether(HermannandBlun-som,2013;Chandaretal.,2014;Gouwsetal.,2014).Foratest-of-conceptexperiment,weusedEnglish-Frenchembeddingslearnedbythestate-of-the-artBilBOWAmodel(Gouwsetal.,2014)fromtheWikipedia(monolingual)andEuroparl(bilin-gual)corpora.12WetrainedtheRNNmodeltomapfromEnglishdefinitionstoEnglishwordsinthebilingualspace.Attesttime,afterreadinganEn-glishdefinition,wethensimplyreturnthenearestFrenchwordneighbourstothatdefinition.Becausenobenchmarksexistforquantitative12Theapproachshouldworkwithanybilingualembeddings.WethankStephanGouwsfordoingthetraining.evaluationofbilingualreversedictionaries,wecom-parethisapproachqualitativelywithtwoalternativemethodsformappingdefinitionstowordsacrosslanguages.ThefirstisanalogoustotheW2VAddmodeloftheprevioussection:inthebilingualem-beddingspace,wefirstcomposetheembeddingsoftheEnglishwordsinthequerydefinitionwithele-mentwiseaddition,andthenreturntheFrenchwordwhoseembeddingisnearesttothisvectorsum.ThesecondusestheRNNmonolingualreversedictio-narymodeltoidentifyanEnglishwordfromanEn-glishdefinition,andthentranslatesthatwordusingGoogleTranslate.Table4showsthattheRNNmodelcanbeef-fectivelymodifiedtocreateacross-lingualreversedictionary.ItisperhapsunsurprisingthattheW2VAddmodelcandidatesaregenerallythelowestinqualitygiventheperformanceofthemethodinthemonolingualsetting.IncomparingthetwoRNN-basedmethods,theRNN(embeddingspace)modelappearstohavetwoadvantagesovertheRNN+Googleapproach.First,itdoesnotrequireon-lineaccesstoabilingualword-wordmappingas

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

25

definede.g.byGoogleTranslate.Second,itlesspronetoerrorscausedbywordsenseambiguity.Forexample,inresponsetothequeryanemotionyoufeelafterbeingrejected,thebilingualembed-dingRNNreturnsemotionsoradjectivesdescribingmentalstates.Incontrast,themonolingual+GooglemodelincorrectlymapstheplausibleEnglishre-sponseregrettotheverbalinfinitiveregretter.Themodelmakesthesameerrorwhenrespondingtoadescriptionofafly,returningtheverbvoler(tofly).3.7DiscussionWehaveshownthatsimplytrainingRNNorBOWNLMsonsixdictionariesyieldsareversedictionarythatperformscomparablytotheleadingcommer-cialsystem,evenwithaccesstomuchlessdictio-narydata.Indeed,theembeddingmodelsconsis-tentlyreturnsyntacticallyandsemanticallyplausi-bleresponses,whicharegenerallypartofamorecoherentandhomogeneoussetofcandidatesthanthoseproducedbythecommercialsystems.Wealsoshowedhowthearchitecturecanbeeasilyextendedtoproducebilingualversionsofthesamemodel.Intheanalysesperformedthusfar,weonlytestthedictionaryembeddingapproachontasksthatitwastrainedtoaccomplish(mappingdefinitionsordescriptionstowords).Inthenextsection,weex-plorewhethertheknowledgelearnedbydictionaryembeddingmodelscanbeeffectivelytransferredtoanoveltask.4GeneralKnowledge(crossword)QuestionAnsweringTheautomaticansweringofquestionsposedinnat-urallanguageisacentralproblemofArtificialIn-telligence.AlthoughwebsearchandIRtechniquesprovideameanstofindsitesordocumentsrelatedtolanguagequeries,atpresent,internetusersrequiringaspecificfactmuststillsiftthroughpagestolocatethedesiredinformation.Systemsthatattempttoovercomethis,viafullyopen-domainorgeneralknowledgequestion-answering(openQA),generallyrequirelargeteamsofresearchers,modulardesignandpowerfulinfras-tructure,exemplifiedbyIBM’sWatson(Ferruccietal.,2010).Forthisreason,muchacademicre-searchfocusesonsettingsinwhichthescopeofthetaskisreduced.Thishasbeenachievedbyrestrict-ingquestionstoaspecifictopicordomain(Moll´aandVicedo,2007),allowingsystemsaccesstopre-specifiedpassagesoftextfromwhichtheanswercanbeinferred(Iyyeretal.,2014;Westonetal.,2015),orcenteringbothquestionsandanswersonapartic-ularknowledgebase(BerantandLiang,2014;Bor-desetal.,2014).Inwhatfollows,weshowthatthedictionaryem-beddingmodelsintroducedintheprevioussectionsmayformausefulcomponentofanopenQAsys-tem.Giventheabsenceofaknowledgebaseorweb-scaleinformationinourarchitecture,wenar-rowthescopeofthetaskbyfocusingongeneralknowledgecrosswordquestions.Generalknowl-edge(non-cryptic,orquick)crosswordsappearinnationalnewspapersinmanycountries.CrosswordquestionansweringismoretractablethangeneralopenQAfortworeasons.First,modelsknowthelengthofthecorrectanswer(inletters),reducingthesearchspace.Second,somecrosswordquestionsmirrordefinitions,inthattheyrefertofundamentalpropertiesofconcepts(atwelve-sidedshape)orre-questacategorymember(acityinEgypt).134.1EvaluationGeneralKnowledgecrosswordquestionscomeindifferentstylesandforms.WeusedtheEddieJamescrosswordwebsitetocompileabankofsentence-likegeneral-knowledgequestions.14EddieJamesisoneoftheUK’sleadingcrosswordcompilers,work-ingforseveralnationalnewspapers.Ourlongques-tionsetconsistsofthefirst150questions(startingfrompuzzle#1)fromhisgeneral-knowledgecross-words,excludingcluesoffewerthanfourwordsandthosewhoseanswerwasnotasingleword(e.g.kingjames).Toevaluatemodelsonadifferenttypeofclue,wealsocompiledasetofshorterquestionsbasedontheGuardianQuickCrossword.Guardianquestionsstillrequiregeneralfactualorlinguisticknowledge,butaregenerallyshorterandsomewhatmorecrypticthanthelongerEddieJamesclues.Weagainformed13Asourinterestisinthelanguageunderstanding,wedonotaddressthequestionoffittinganswersintoagrid,whichisthemainconcernofend-to-endautomatedcrosswordsolvers(Littmanetal.,2002).14http://www.eddiejames.co.uk/

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

26

alistof150questions,beginningon1January2015andexcludinganyquestionswithmultiple-wordan-swers.Forclearcontrast,weexcludedthosefewquestionsoflengthgreaterthanfourwords.Ofthese150clues,asubsetof30weresingle-wordclues.Allevaluationdatasetsareavailableonlinewiththepaper.Aswiththereversedictionaryexperiments,can-didatesareextractedfrommodelsbyinputtingdef-initionsandreturningwordscorrespondingtotheclosestembeddingsinthetargetspace.Inthiscase,cependant,weonlyconsidercandidatewordswhoselengthmatchesthelengthspecifiedintheclue.TestsetWordDescriptionLongBaudelaire”Frenchpoet(150)andkeyfigureinthedevelopmentofSymbolism.”Short(120)satanist”devildevotee”Single-Word(30)guilt”culpability”Table5:Examplesofthedifferentquestiontypesinthecrosswordquestionevaluationdataset.4.2BenchmarksandComparisonsAswiththereversedictionaryexperiments,wecom-pareRNNandBOWNLMswithasimpleunsuper-visedbaselineofelementwiseadditionofWord2Vecvectorsintheembeddingspace(wediscardthein-effectiveW2Vmultbaseline),againrestrictingcan-didatestowordsofthepre-specifiedlength.Wealsocomparetotwobespokeonlinecrossword-solvingengines.Thefirst,OneAcross(http://www.oneacross.com/)isthecandidategen-erationmoduleoftheaward-winningProverbcross-wordsystem(Littmanetal.,2002).Proverb,whichwasproducedbyacademicresearchers,hasfeaturedinnationalmediasuchasNewScientist,andbeatenexperthumansincrosswordsolvingtournaments.ThesecondcomparisoniswithCrosswordMaestro(http://www.crosswordmaestro.com/),acommercialcrosswordsolvingsystemthathandlesbothcrypticandnon-crypticcrosswordclues(wefocusonlyonthenon-crypticsetting),andhasalsobeenfeaturedinnationalmedia.15Weareunable15Seee.g.http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/tocompareagainstathirdwell-knownautomaticcrosswordsolver,DrFill(Ginsberg,2011),becausecodeforDrFill’scandidate-generationmoduleisnotreadilyavailable.AswiththeRNNandbase-linemodels,whenevaluatingexistingsystemswediscardcandidateswhoselengthdoesnotmatchthelengthspecifiedintheclue.Certainprinciplesconnectthedesignoftheex-istingcommercialsystemsanddifferentiatethemfromourapproach.UnliketheNLMs,theyeachre-quirequery-timeaccesstolargedatabasescontain-ingcommoncrosswordclues,dictionarydefinitions,thefrequencywithwhichwordstypicallyappearascrosswordsolutionsandotherhand-engineeredandtask-specificcomponents(Littmanetal.,2002;Ginsberg,2011).4.3ResultsTheperformanceofmodelsonthevariousquestiontypesispresentedinTable6.Whenevaluatingthetwocommercialsystems,OneAcrossandCross-wordMaestro,wehaveaccesstowebinterfacesthatreturnuptoapproximately100candidatesforeachquery,socanonlyreliablyrecordmembershipofthetopten(accuracy@10).Onthelongquestions,weobserveaclearadvan-tageforalldictionaryembeddingmodelsoverthecommercialsystemsandthesimpleunsupervisedbaseline.Here,thebestperformingNLM(RNNwithWord2Vecinputembeddingsandrankingloss)ranksthecorrectanswerthirdonaverage,andinthetop-tencandidatesover60%ofthetime.Asthequestionsgetshorter,theadvantageoftheembeddingmodelsdiminishes.Boththeunsu-pervisedbaselineandOneAcrossanswertheshortquestionswithcomparableaccuracytotheRNNandBOWmodels.Onereasonforthismaybethediffer-enceinformandstylebetweentheshortercluesandthefulldefinitionsorencyclopediasentencesinthedictionarytrainingdata.Asthelengthofthecluede-creases,findingtheansweroftenreducestogenerat-ingsynonyms(culpability-guilt),orcategorymem-bers(tallanimal-giraffe).Thecommercialsystemscanretrievegoodcandidatesforsuchcluesamongtheirdatabasesofentities,relationshipsandcom-moncrosswordanswers.UnsupervisedWord2Veccrossword-blog-computers-crack-cryptic-clues

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

27

QuestionTypeavgrank-accuracy@10/100-rankvarianceLong(150)Short(120)Single-Word(30)OneAcross.39/.68/.70/CrosswordMaestro.27/.43/.73/W2Vadd42.31/.639211.50/.78662.79/.9045RNNcosine15.43/.6910822.39/.6711772.31/.52187RNNw2vcosine4.61/.82607.56/.796012.48/.72116RNNranking6.58/.844810.51/.735712.48/.6967RNNw2vranking3.62/.80618.57/.784912.48/.69114BOWcosine4.60/.82547.56/.785112.45/.72137BOWw2vcosine4.60/.83567.54/.80483.59/.79111BOWranking5.62/.87508.58/.83378.55/.7939BOWw2vranking5.60/.86488.56/.83354.55/.8343Table6:Performanceofdifferentmodelsoncrosswordquestionsofdifferentlength.Thetwocommercialsystemsareevaluatedviatheirwebinterfacesoonlyaccuracy@10canbereportedinthosecases.representationsarealsoknowntoencodethesesortsofrelationships(evenafterelementwiseadditionforshortsequencesofwords)(Mikolovetal.,2013).Thiswouldalsoexplainwhythedictionaryembed-dingmodelswithpre-trained(Word2Vec)inputem-beddingsoutperfomthosewithlearnedembeddings,particularlyfortheshortestquestions.4.4QualitativeAnalysisAbetterunderstandingofhowthedifferentmodelsarriveattheiranswerscanbegainedfromconsider-ingspecificexamples,aspresentedinTable7.Thefirstthreeexamplesshowthat,despitetheapparentlysuperficialnatureofitstrainingdata(definitionsandintroductorysentences)embeddingmodelscanan-swerquestionsthatrequirefactualknowledgeaboutpeopleandplaces.Anothernotablecharacteristicofthesemodelistheconsistentsemanticappropriate-nessofthecandidateset.Inthefirstcase,thetopfivecandidatesareallmountains,valleysorplacesintheAlps;inthesecond,theyareallbiblicalnames.Inthethird,theRNNmodelretrievescurrencies,inthiscaseperformingbetterthantheBOWmodel,whichretrievesentitiesofvarioustypeassociatedwiththeNetherlands.Generallyspeaking(ascanbeobservedbythewebdemo),the‘smoothness’orconsistencyincandidategenerationofthedictionaryembeddingmodelsisgreaterthanthatofthecom-mercialsystems.Despiteitssimplicity,theunsuper-visedW2Vadditionmethodisattimesalsosurpris-inglyeffective,asshownbythefactthatitreturnsJoshuainitstopcandidatesforthethirdquery.ThefinalexampleinTable7illustratesthesur-prisingpoweroftheBOWmodel.Inthetrainingdatathereisasingledefinitionforthecorrectan-swerSchoenberg:UnitedStatescomposerandmusi-caltheorist(borninAustria)whodevelopedatonalcomposition.Theonlywordcommontoboththequeryandthedefinitionis’composer’(thereisnotokenizationthatallowstheBOWmodeltodirectlyconnectatonalandatonality).Néanmoins,themodelisabletoinferthenecessaryconnectionsbe-tweentheconceptsinthequeryandthedefinitiontoreturnSchoenbergasthetopcandidate.Despitesuchcases,itremainsanopenquestionwhether,withmorediversetrainingdata,theworldknowledgerequiredforfullopenQA(e.g.sec-ondaryfactsaboutSchoenberg,suchashisfam-ily)couldbeencodedandretainedasweightsina(larger)dynamicnetwork,orwhetheritwillbenec-essarytocombinetheRNNwithanexternalmem-orythatislessfrequently(ornever)updated.Thislatterapproachhasbeguntoachieveimpressivere-sultsoncertainQAandentailmenttasks(Bordesetal.,2014;Gravesetal.,2014;Westonetal.,2015).5ConclusionDictionariesexistinmanyoftheworld’slanguages.Wehaveshownhowtheselexicalresourcescancon-stitutevaluabledatafortrainingthelatestneurallan-guagemodelstointerpretandrepresentthemean-ingofphrasesandsentences.Whilehumansuse

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

28

InputDescriptionOneAcrossCrosswordMaestroBOWRNN”Swissmountain1:noted2:front1:after2:favor1:Eiger2.Crags1:Eiger2:Aostapeakfamedforits3:Eiger4:crown3:ahead4:along3:Teton4:Cerro3:Cuneo4:Lecconorthface(5)”5:fount5:being5:Jebel5:Tyrol”OldTestament1:Joshua2:Exodus1:devise2:Daniel1:Isaiah2:Elijah1:Joshua2:Isaiahsuccessorto3:Hebrew4:person3:Haggai4:Isaiah3:Joshua4:Elisha3:Gideon4:ElijahMoses(6)”5:across5:Joseph5:Yahweh5:Yahweh”Theformer1:Holland2:general1:Holland2:ancient1:Guilder2:Holland1:Guilder2:Escudoscurrencyofthe3:Lesotho3:earlier4:onetime3:Drenthe4:Utrecht3:Pesetas4:SomerenNetherlands5:qondam5:Naarden5:Florins(7)””Arnold,20th1:surrealism1:disharmony1:Schoenberg1:MendelsohnCenturycomposer2:laborparty2:dissonance2:Christleib2:Williamsonpioneerof3:tonemusics3:bringabout3:Stravinsky3:Huddlestonatonality4:introduced4:constitute4:Elderfield4:Mandelbaum(10)”5:Schoenberg5:triggeroff5:Mendelsohn5:ZimmermanTable7:Responsesfromdifferentmodelstoexamplecrosswordclues.Ineachcasethemodeloutputisfilteredtoexcludeanycandidatesthatarenotofthesamelengthasthecorrectanswer.BOWandRNNmodelsaretrainedwithoutWord2Vecinputembeddingsandcosineloss.thephrasaldefinitionsindictionariestobetterun-derstandthemeaningofwords,machinescanusethewordstobetterunderstandthephrases.Weusedtwodictionaryembeddingarchitectures-arecurrentneuralnetworkarchitecturewithalong-short-termmemory,andasimplerlinearbag-of-wordsmodel-toexplicitlyexploitthisidea.Onthereversedictionarytaskthatmirrorsitstrainingsetting,NLMsthatembedallknowncon-ceptsinacontinuous-valuedvectorspaceperformcomparablytothebestknowncommercialapplica-tionsdespitehavingaccesstomanyfewerdefini-tions.Moreover,theygeneratesmoothersetsofcan-didatesandrequirenolinguisticpre-processingortask-specificengineering.Wealsoshowedhowthedescription-to-wordobjectivecanbeusedtotrainmodelsusefulforothertasks.NLMstrainedonthesamedatacananswergeneral-knowledgecrosswordquestions,andindeedoutperformcommercialsys-temsonquestionscontainingmorethanfourwords.WhileourQAexperimentsfocusedoncrosswords,theresultssuggestthatasimilarembedding-basedapproachmayultimatelyleadtoimprovedoutputfrommoregeneralQAanddialogsystemsandin-formationretrievalenginesingeneral.Wemakeallcode,trainingdata,evaluationsetsandbothofourlinguistictoolspubliclyavailableon-lineforfutureresearch.Inparticular,weproposethereversedictionarytaskasacomparativelygeneral-purposeandobjectivewayofevaluatinghowwellmodelscomposelexicalmeaningintophraseorsen-tencerepresentations(whetherornottheyinvolvetrainingondefinitionsdirectly).Inthenextstageofthisresearch,wewillex-plorewaystoenhancetheNLMsdescribedhere,especiallyinthequestion-answeringcontext.Themodelsarecurrentlynottrainedonanyquestion-likelanguage,andwouldconceivablyimproveonexposuretosuchlinguisticforms.WewouldalsoliketounderstandbetterhowBOWmodelscanper-formsowellwithno‘awareness’ofwordorder,andwhethertherearespecificlinguisticcontextsinwhichmodelslikeRNNsorotherswiththepowertoencodewordorderareindeednecessary.Finally,weintendtoexplorewaystoendowthemodelwithricherworldknowledge.Thismayrequirethein-tegrationofanexternalmemorymodule,similartothepromisingapproachesproposedinseveralrecentpapers(Gravesetal.,2014;Westonetal.,2015).AcknowledgmentsKCandYBacknowledgethesupportofthefollow-ingorganizations:NSERC,CalculQu´ebec,Com-puteCanada,theCanadaResearchChairsandCI-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

29

FAR.FHandAKweresupportedbyGoogleFacultyResearchAward,andAKfurtherbyGoogleEuro-peanFellowship.ReferencesDzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.InProceedingofICLR.YoshuaBengio,PatriceSimard,andPaoloFrasconi.1994.Learninglong-termdependencieswithgradientdescentisdifficult.NeuralNetworks,IEEETransac-tionson,5(2):157–166.JonathanBerantandPercyLiang.2014.Semanticpars-ingviaparaphrasing.InProceedingsoftheAssocia-tionforComputationalLinguistics.JamesBergstra,OlivierBreuleux,Fr´ed´ericBastien,Pas-calLamblin,RazvanPascanu,GuillaumeDesjardins,JosephTurian,DavidWarde-Farley,andYoshuaBen-gio.2010.Theano:aCPUandGPUmathexpressioncompiler.InProceedingsofthePythonforScientificComputingConference(SciPy).SlavenBilac,TimothyBaldwin,andHozumiTanaka.2003.Improvingdictionaryaccessibilitybymaximiz-inguseofavailableknowledge.TraitementAutoma-tiquedesLangues,44(2):199–224.SlavenBilac,WataruWatanabe,TaiichiHashimoto,TakenobuTokunaga,andHozumiTanaka.2004.Dic-tionarysearchbasedonthetargetworddescription.InProceedingsofNLP2014.AntoineBordes,SumitChopra,andJasonWeston.2014.Questionansweringwithsubgraphembeddings.Pro-ceedingsofEMNLP.AntoineBordes,NicolasUsunier,SumitChopra,andJasonWeston.2015.Large-scalesimplequestionansweringwithmemorynetworks.arXivpreprintarXiv:1506.02075.SarathChandar,StanislasLauly,HugoLarochelle,MiteshKhapra,BalaramanRavindran,VikasC.Raykar,andAmritaSaha.2014.Anautoencoderap-proachtolearningbilingualwordrepresentations.InAdvancesinNeuralInformationProcessingSystems,pages1853–1861.KyunghyunCho,BartVanMerri¨enboer,CaglarGul-cehre,DzmitryBahdanau,FethiBougares,HolgerSchwenk,andYoshuaBengio.2014.LearningphraserepresentationsusingRNNencoder-decoderforstatis-ticalmachinetranslation.InProceedingsofEMNLP.ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2014.Retrofittingwordvectorstosemanticlexicons.Pro-ceedingsoftheNorthAmericanChapteroftheAsso-ciationforComputationalLinguistics.DavidFerrucci,EricBrown,JenniferChu-Carroll,JamesFan,DavidGondek,AdityaA.Kalyanpur,AdamLally,J.WilliamMurdock,EricNyberg,JohnPrager,NicoSchlaefer,andChrisWelty.2010.BuildingWat-son:AnoverviewoftheDeepQAproject.InAImag-azine,volume31(3),pages59–79.MatthewL.Ginsberg.2011.Dr.FILL:CrosswordsandanimplementedsolverforsinglyweightedCSPs.InJournalofArtificialIntelligenceResearch,pages851–886.StephanGouws,YoshuaBengio,andGregCorrado.2014.BilBOWA:Fastbilingualdistributedrepresen-tationswithoutwordalignments.InProceedingsofNIPSDeepLearningWorkshop.AlexGraves,GregWayne,andIvoDanihelka.2014.Neuralturingmachines.arXivpreprintarXiv:1410.5401.KarlMoritzHermannandPhilBlunsom.2013.Multi-lingualdistributedrepresentationswithoutwordalign-ment.InProceedingsofICLR.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.Neuralcomputation,9(8):1735–1780.EricH.Huang,RichardSocher,ChristopherD.Manning,andAndrewY.Ng.2012.Improvingwordrepresenta-tionsviaglobalcontextandmultiplewordprototypes.InProceedingsoftheAssociationforComputationalLinguistics.MohitIyyer,JordanBoyd-Graber,LeonardoClaudino,RichardSocher,andHalDaum´eIII.2014.Aneu-ralnetworkforfactoidquestionansweringoverpara-graphs.InProceedingsofEMNLP.MohitIyyer,VarunManjunatha,JordanBoyd-Graber,andHalDaum´eIII.2015.Deepunorderedcompo-sitionrivalssyntacticmethodsfortextclassification.InProceedingsoftheAssociationforComputationalLinguistics.RyanKiros,RuslanSalakhutdinov,andRichardS.Zemel.2015.Unifyingvisual-semanticembeddingswithmultimodalneurallanguagemodels.Transac-tionsoftheAssociationforComputationalLinguistics.toappear.AlexandreKlementiev,IvanTitov,andBinodBhattarai.2012.Inducingcrosslingualdistributedrepresenta-tionsofwords.ProceedingsofCOLING.GeoffreyLeech,RogerGarside,andMichaelBryant.1994.CLAWS4:ThetaggingoftheBritishNationalCorpus.InProceedingsofCOLING.MichaelL.Littman,GregA.Keim,andNoamShazeer.2002.Aprobabilisticapproachtosolvingcrosswordpuzzles.ArtificialIntelligence,134(1):23–55.TomasMikolov,MartinKarafi´at,LukasBurget,JanCer-nock`y,andSanjeevKhudanpur.2010.Recurrentneu-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
8
0
1
5
6
7
3
6
4

/

/
t

je

un
c
_
un
_
0
0
0
8
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

30

ralnetworkbasedlanguagemodel.InProceedingsofINTERSPEECH2010.TomasMikolov,IlyaSutskever,KaiChen,GregS.Cor-rado,andJeffDean.2013.Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinNeuralInformationProcessingSystems.JeffMitchellandMirellaLapata.2010.Compositionindistributionalmodelsofsemantics.CognitiveScience,34(8):1388–1429.DiegoMoll´aandJos´eLuisVicedo.2007.Questionan-sweringinrestricteddomains:Anoverview.Compu-tationalLinguistics,33(1):41–61.RyanShaw,AnindyaDatta,DebraVanderMeer,andKaushikDutta.2013.Buildingascalabledatabase-drivenreversedictionary.KnowledgeandDataEngi-neering,IEEETransactionson,25(3):528–540.IvanVulic,WimDeSmet,andMarie-FrancineMoens.2011.Identifyingwordtranslationsfromcomparablecorporausinglatenttopicmodels.InProceedingsoftheAssociationforComputationalLinguistics.JasonWeston,AntoineBordes,SumitChopra,andTomasMikolov.2015.TowardsAI-completequestionanswering:Asetofprerequisitetoytasks.InarXivpreprintarXiv:1502.05698.MatthewD.Zeiler.2012.Adadelta:Anadaptivelearn-ingratemethod.InarXivpreprintarXiv:1212.5701.MichaelZockandSlavenBilac.2004.Wordlookuponthebasisofassociations:Fromanideatoaroadmap.InProceedingsoftheACLWorkshoponEnhancingandUsingElectronicDictionaries.
Télécharger le PDF