Operazioni dell'Associazione per la Linguistica Computazionale, vol. 4, pag. 17–30, 2016. Redattore di azioni: Chris Callison-Burch. - Ricerca sull'intelligenza artificiale specializzata al MIT

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 4, pag. 17–30, 2016. Redattore di azioni: Chris Callison-Burch.
Lotto di invio: 9/2015; revised 12/2015; revised 1/2016; Pubblicato 2/2016.

2016 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

C
(cid:13)

LearningtoUnderstandPhrasesbyEmbeddingtheDictionaryFelixHillComputerLaboratoryUniversityofCambridgefelix.hill@cl.cam.ac.ukKyunghyunCho∗CourantInstituteofMathematicalSciencesandCentreforDataScienceNewYorkUniversitykyunghyun.cho@nyu.eduAnnaKorhonenDepartmentofTheoreticalandAppliedLinguisticsUniversityofCambridgealk23@cam.ac.ukYoshuaBengioCIFARSeniorFellowUniversit´edeMontr´ealyoshua.bengio@umontreal.caAbstractDistributionalmodelsthatlearnrichseman-ticwordrepresentationsareasuccessstoryofrecentNLPresearch.However,develop-ingmodelsthatlearnusefulrepresentationsofphrasesandsentenceshasprovedfarharder.Weproposeusingthedeﬁnitionsfoundineverydaydictionariesasameansofbridg-ingthisgapbetweenlexicalandphrasalse-mantics.Neurallanguageembeddingmod-elscanbeeffectivelytrainedtomapdictio-narydeﬁnitions(frasi)A(lexical)repre-sentationsofthewordsdeﬁnedbythosedeﬁ-nitions.Wepresenttwoapplicationsofthesearchitectures:reversedictionariesthatreturnthenameofaconceptgivenadeﬁnitionordescriptionandgeneral-knowledgecrosswordquestionanswerers.Onbothtasks,neurallan-guageembeddingmodelstrainedondeﬁni-tionsfromahandfuloffreely-availablelex-icalresourcesperformaswellorbetterthanexistingcommercialsystemsthatrelyonsig-niﬁcanttask-speciﬁcengineering.There-sultshighlighttheeffectivenessofbothneu-ralembeddingarchitecturesanddeﬁnition-basedtrainingfordevelopingmodelsthatun-derstandphrasesandsentences.1IntroductionMuchrecentresearchincomputationalseman-ticshasfocussedonlearningrepresentationsofarbitrary-lengthphrasesandsentences.Thistaskischallengingpartlybecausethereisnoobviousgoldstandardofphrasalrepresentationthatcouldbeused∗WorkmainlydoneattheUniversityofMontreal.intrainingandevaluation.Consequently,itisdifﬁ-culttodesignapproachesthatcouldlearnfromsuchagoldstandard,andalsohardtoevaluateorcomparedifferentmodels.Inthiswork,weusedictionarydeﬁnitionstoad-dressthisissue.Thecomposedmeaningofthewordsinadictionarydeﬁnition(atall,long-necked,spottedruminantofAfrica)shouldcorrespondtothemeaningofthewordtheydeﬁne(giraffe).Thisbridgebetweenlexicalandphrasalsemanticsisuse-fulbecausehighqualityvectorrepresentationsofsinglewordscanbeusedasatargetwhenlearningtocombinethewordsintoacoherentphrasalrepre-sentation.Thisapproachstillrequiresamodelcapableoflearningtomapbetweenarbitrary-lengthphrasesandﬁxed-lengthcontinuous-valuedwordvectors.Forthispurposeweexperimentwithtwobroadclassesofneurallanguagemodels(NLMs):Recur-rentNeuralNetworks(RNNs),whichnaturallyen-codetheorderofinputwords,andsimpler(feed-forward)bag-of-words(BOW)embeddingmodels.PriortotrainingtheseNLMs,welearntargetlexi-calrepresentationsbytrainingtheWord2Vecsoft-ware(Mikolovetal.,2013)onbillionsofwordsofrawtext.Wedemonstratetheusefulnessofourapproachbybuildingandreleasingtwoapplications.Theﬁrstisareversedictionaryorconceptﬁnder:asystemthatreturnswordsbasedonuserdescriptionsordeﬁni-tions(ZockandBilac,2004).Reversedictionariesareusedbycopywriters,novelists,translatorsandotherprofessionalwriterstoﬁndwordsfornotionsorideasthatmightbeonthetipoftheirtongue.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Forinstance,atravel-writermightlooktoenhanceherprosebysearchingforexamplesofacountrythatpeopleassociatewithwarmweatheroranac-tivitythatismentallyorphysicallydemanding.WeshowthatanNLM-basedreversedictionarytrainedononlyahandfulofdictionariesidentiﬁesnoveldef-initionsandconceptdescriptionscomparablyorbet-terthancommercialsystems,whichrelyonsignif-icanttask-speciﬁcengineeringandaccesstomuchmoredictionarydata.Moreover,byexploitingmod-elsthatlearnbilingualwordrepresentations(Vulicetal.,2011;Klementievetal.,2012;HermannandBlunsom,2013;Gouwsetal.,2014),weshowthattheNLMapproachcanbeeasilyextendedtopro-duceapotentiallyusefulcross-lingualreversedic-tionary.Thesecondapplicationofourmodelsisasageneral-knowledgecrosswordquestionanswerer.WhentrainedonbothdictionarydeﬁnitionsandtheopeningsentencesofWikipediaarticles,NLMspro-duceplausibleanswersto(non-cryptic)crosswordclues,eventhosethatapparentlyrequiredetailedworldknowledge.BothBOWandRNNmodelscanoutperformbespokecommercialcrosswordsolvers,particularlywhencluescontainagreaternumberofwords.QualitativeanalysisrevealsthatNLMscanlearntorelateconceptsthatarenotdirectlycon-nectedinthetrainingdataandcanthusgeneralisewelltounseeninput.Tofacilitatefurtherresearch,allofourcode,trainingandevaluationsets(togetherwithasystemdemo)arepublishedonlinewiththispaper.12NeuralLanguageModelArchitecturesTheﬁrstmodelweapplytothedictionary-basedlearningtaskisarecurrentneuralnetwork(RNN).RNNsoperateonvariable-lengthsequencesofin-puts;inourcase,naturallanguagedeﬁnitions,de-scriptionsorsentences.RNNs(withLSTMs)haveachievedstate-of-the-artperformanceinlanguagemodelling(Mikolovetal.,2010),imagecaptiongeneration(Kirosetal.,2015)andapproachstate-of-the-artperformanceinmachinetranslation(Bah-danauetal.,2015).Duringtraining,theinputtotheRNNisadic-tionarydeﬁnitionorsentencefromanencyclopedia.1https://www.cl.cam.ac.uk/˜fh295/Theobjectiveofthemodelistomapthesedeﬁn-ingphrasesorsentencestoanembeddingofthewordthatthedeﬁnitiondeﬁnes.ThetargetwordembeddingsarelearnedindependentlyoftheRNNweights,usingtheWord2Vecsoftware(Mikolovetal.,2013).Thesetofallwordsinthetrainingdataconsti-tutesthevocabularyoftheRNN.Foreachwordinthisvocabularywerandomlyinitialiseareal-valuedvector(inputembedding)ofmodelparameters.TheRNN‘reads’theﬁrstwordintheinputbyapplyinganon-linearprojectionofitsembeddingv1parame-terisedbyinputweightmatrixWandb,avectorofbiases.A1=φ(Wv1+b)yieldingtheﬁrstinternalactivationstateA1.Inourimplementation,weuseφ(X)=tanh(X),thoughintheoryφcanbeanydifferentiablenon-linearfunc-tion.Subsequentinternalactivations(aftertime-stept)arecomputedbyprojectingtheembeddingofthetthwordandusingthisinformationto‘update’theinternalactivationstate.At=φ(UAt−1+Wvt+b).Assuch,thevaluesoftheﬁnalinternalactivationstateunitsANareaweightedfunctionofallinputwordembeddings,andconstitutea‘summary’oftheinformationinthesentence.2.1LongShortTermMemoryAknownlimitationwhentrainingRNNstoreadlan-guageusinggradientdescentisthattheerrorsig-nal(gradient)onthetrainingexampleseithervan-ishesorexplodesasthenumberoftimesteps(sen-tencelength)increases(Bengioetal.,1994).Conse-quently,afterreadinglongersentencestheﬁnalin-ternalactivationANtypicallyretainsusefulinfor-mationaboutthemostrecentlyread(sentence-ﬁnal)parole,butcanneglectimportantinformationnearthestartoftheinputsentence.LSTMs(HochreiterandSchmidhuber,1997)weredesignedtomitigatethislong-termdependencyproblem.Ateachtimestept,inplaceofthesingleinter-nallayerofunitsA,theLSTMRNNcomputessixinternallayersiw,gi,gf,go,handm.Theﬁrst,gw,representsthecoreinformationpassedtotheLSTM

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

unitbythelatestinputwordatt.Itiscomputedasasimplelinearprojectionoftheinputembeddingvt(byinputweightsWw)andtheoutputstateoftheLSTMattheprevioustimestepht−1(byupdateweightsUw):iwt=Wwvt+Uwht−1+bwThelayersgi,gfandgoarecomputedasweightedsigmoidfunctionsoftheinputembeddings,againparameterisedbylayer-speciﬁcweightmatricesWandU:gst=11+exp(-(Wsvt+Usht−1+bs))wheresstandsforoneofi,foro.Thesevectorstakevalueson[0,1]andareoftenreferredtoasgat-ingactivations.Finally,theinternalmemorystate,mtandnewoutputstateht,oftheLSTMattarecomputedasmt=iwt(cid:12)git+mt−1(cid:12)gftht=got(cid:12)φ(mt),Dove(cid:12)indicateselementwisevectormultiplica-tionandφis,asbefore,somenon-linearfunction(weusetanh).Così,gideterminestowhatextentthenewinputwordisconsideredateachtimestep,gfdeterminestowhatextenttheexistingstateoftheinternalmemoryisretainedorforgottenincom-putingthenewinternalmemory,andgodetermineshowmuchthismemoryisconsideredwhencomput-ingtheoutputstateatt.Thesentence-ﬁnalmemorystateoftheLSTM,mN,a‘summary’ofalltheinformationinthesen-tence,isthenprojectedviaanextranon-linearpro-jection(parameterisedbyafurtherweightmatrix)toatargetembeddingspace.Thislayerenablesthetarget(deﬁned)wordembeddingspacetotakeadif-ferentdimensiontotheactivationlayersoftheRNN,andinprincipleenablesamorecomplexdeﬁnition-readingfunctiontobelearned.2.2Bag-of-WordsNLMsWeimplementasimplerlinearbag-of-words(BOW)architectureforencodingthedeﬁnitionphrases.AswiththeRNN,thisarchitecturelearnsanembeddingviforeachwordinthemodelvocabulary,togetherwithasinglematrixofinputprojectionweightsW.TheBOWmodelsimplymapsaninputdeﬁnitionwithwordembeddingsv1…vntothesumoftheprojectedembeddingsPni=1Wvi.ThismodelcanalsobeconsideredaspecialcaseofanRNNinwhichtheupdatefunctionUandnonlinearityφareboththeidentity,sothat‘reading’thenextwordintheinputphraseupdatesthecurrentrepresentationmoresimply:At=At−1+Wvt.2.3Pre-trainedInputRepresentationsWeexperimentwithvariantsofthesemodelsinwhichtheinputdeﬁnitionembeddingsarepre-learnedandﬁxed(ratherthanrandomly-initialisedandupdated)duringtraining.Thereareseveralpo-tentialadvantagestotakingthisapproach.First,thewordembeddingsaretrainedonmassivecorporaandmaythereforeintroduceadditionallinguisticorconceptualknowledgetothemodels.Second,attesttime,themodelswillhavealargereffectivevocab-ulary,sincethepre-trainedwordembeddingstypi-callyspanalargervocabularythantheunionofalldictionarydeﬁnitionsusedtotrainthemodel.Fi-nally,themodelswillthenmaptoandfromthesamespaceofembeddings(theembeddingspacewillbeclosedundertheoperationofthemodel),socon-ceivablycouldbemoreeasilyappliedasageneral-purpose‘compositionengine’.2.4TrainingObjectiveWetrainallneurallanguagemodelsMtomaptheinputdeﬁnitionphrasescdeﬁningwordctoalo-cationclosetothethepre-trainedembeddingvcofc.Weexperimentwithtwodifferentcostfunctionsfortheword-phrasepair(C,sc)fromthetrainingdata.TheﬁrstissimplythecosinedistancebetweenM(sc)andvc.Thesecondistheranklossmax(0,m−cos(M(sc),vc)−cos(M(sc),vr))wherevristheembeddingofarandomly-selectedwordfromthevocabularyotherthanc.Thislossfunctionwasusedforlanguagemodels,forexample,In(Huangetal.,2012).Inallexperimentsweapplyamarginm=0.1,whichhasbeenshowntoworkwellonword-retrievaltasks(Bordesetal.,2015).

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

2.5ImplementationDetailsSincetrainingonthedictionarydatatook6-10hours,wedidnotconductahyper-parametersearchonanyvalidationsetsoverthespaceofpossiblemodelconﬁgurationssuchasembeddingdimension,orsizeofhiddenlayers.Instead,wechosetheseparameterstobeasstandardaspossiblebasedonpreviousresearch.Forfaircomparison,anyaspectsofmodeldesignthatarenotspeciﬁctoaparticu-larclassofmodelwerekeptconstantacrossexperi-ments.Thepre-trainedwordembeddingsusedinallofourmodels(eitherasinputortarget)werelearnedbyacontinuousbag-of-words(CBOW)modelusingtheWord2Vecsoftwareonapproximately8billionwordsofrunningtext.2Whentrainingsuchmodelsonmassivecorpora,alargeembeddinglengthofupto700havebeenshowntoyieldbestperformance(seee.g.(Faruquietal.,2014)).Thepre-trainedem-beddingsusedinourmodelswereoflength500,asacompromisebetweenqualityandmemorycon-straints.Incaseswherethewordembeddingsarelearnedduringtrainingonthedictionaryobjective,wemaketheseembeddingsshorter(256),sincetheymustbelearnedfrommuchlesslanguagedata.IntheRNNmodels,andateachtimestepeachofthefourLSTMRNNinternallayers(gatingandactiva-tionstates)hadlength512–anotherstandardchoice(seee.g.(Choetal.,2014)).Theﬁnalhiddenstatewasmappedlinearlytolength500,thedimensionofthetargetembedding.IntheBOWmodels,theprojectionmatrixprojectsinputembeddings(eitherlearned,oflength256,orpre-trained,oflength500)tolength500forsumming.AllmodelswereimplementedwithTheano(Bergstraetal.,2010)andtrainedwithminibatchSGDonGPUs.Thebatchsizewasﬁxedat16andthelearningratewascontrolledbyadadelta(Marinaio,2012).2TheWord2Vecembeddingmodelsarewellknown;furtherdetailscanbefoundathttps://code.google.com/p/word2vec/Thetrainingdataforthispre-trainingwascom-piledfromvariousonlinetextsourcesusingthescriptdemo-train-big-model-v1.shfromthesamepage.3ReverseDictionariesThemostimmediateapplicationofourtrainedmod-elsisasareversedictionaryorconceptﬁnder.Itissimpletolookupadeﬁnitioninadictionarygivenaword,butprofessionalwritersoftenalsore-quiresuitablewordsforagivenidea,conceptordeﬁnition.3Reversedictionariessatisfythisneedbyreturningcandidatewordsgivenaphrase,de-scriptionordeﬁnition.Forinstance,whenqueriedwiththephraseanactivitythatrequiresstrengthanddetermination,theOneLook.comreversedictio-naryreturnstheconceptsexerciseandwork.OurtrainedRNNmodelcanperformasimilarfunc-tion,simplybymappingaphrasetoapointinthetarget(Word2Vec)embeddingspace,andreturningthewordscorrespondingtotheembeddingsthatareclosesttothatpoint.Severalotheracademicstudieshaveproposedreversedictionarymodels.Thesegenerallyrelyoncommontechniquesfrominformationretrieval,comparingdeﬁnitionsintheirinternaldatabasetotheinputquery,andreturningthewordwhosedef-initionis‘closest’tothatquery(Bilacetal.,2003;Bilacetal.,2004;ZockandBilac,2004).Proxim-ityisquantiﬁeddifferentlyineachcase,butisgen-erallyafunctionofhand-engineeredfeaturesofthetwosentences.Forinstance,Shawetal.(2013)pro-poseamethodinwhichthecandidatesforagiveninputqueryareallwordsinthemodel’sdatabasewhosedeﬁnitionscontainoneormorewordsfromthequery.Thiscandidatelististhenrankedaccord-ingtoaquery-deﬁnitionsimilaritymetricbasedonthehypernymandhyponymrelationsinWordNet,featurescommonlyusedinIRsuchastf-idfandaparser.Thereare,inaddition,atleasttwocommercialonlinereversedictionaryapplications,whosear-chitectureisproprietaryknowledge.TheﬁrstistheDictionary.comreversedictionary4,whichre-trievescandidatewordsfromtheDictionary.comdictionarybasedonuserdeﬁnitionsordescrip-tions.ThesecondisOneLook.com,whosealgo-rithmsearches1061indexeddictionaries,including3Seethetestimonyfromprofessionalwritersathttp://www.onelook.com/?c=awards4Availableathttp://dictionary.reference.com/reverse/

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

allmajorfreely-availableonlinedictionariesandre-sourcessuchasWikipediaandWordNet.3.1DataCollectionandTrainingTocompileabankofdictionarydeﬁnitionsfortrain-ingthemodel,westartedwithallwordsinthetar-getembeddingspace.Foreachofthesewords,weextracteddictionary-styledeﬁnitionsfromﬁveelec-tronicresources:Wordnet,TheAmericanHeritageDictionary,TheCollaborativeInternationalDictio-naryofEnglish,WiktionaryandWebster’s.Wechosetheseﬁvedictionariesbecausetheyarefreely-availableviatheWordNikAPI,5butintheoryanydictionarycouldbechosen.Mostwordsinourtrain-ingdatahadmultipledeﬁnitions.Foreachwordwwithdeﬁnitions{d1…dn}weincludedallpairs(w,d1)…(w,dn)astrainingexamples.Toallowmodelsaccesstomorefactualknowl-edgethanmightbepresentinadictionary(forin-stance,informationaboutspeciﬁcentities,placesorpeople,wesupplementedthistrainingdatawithin-formationextractedfromSimpleWikipedia.6Foreverywordinthemodel’stargetembeddingspacethatisalsothetitleofaWikipediaarticle,wetreatthesentencesintheﬁrstparagraphofthearticleasiftheywere(independent)deﬁnitionsofthatword.WhenawordinWikipediaalsooccursinone(ormore)oftheﬁvetrainingdictionaries,wesimplyaddthesepseudo-deﬁnitionstothetrainingsetofdeﬁnitionsfortheword.CombiningWikipediaanddictionariesinthiswayresultedin≈900,000word-’deﬁnition’pairsof≈100,000uniquewords.Toexploretheeffectofthequantityoftrainingdataontheperformanceofthemodels,wealsotrainedmodelsonsubsetsofthisdata.Theﬁrstsub-setcomprisedonlydeﬁnitionsfromWordnet(ap-proximately150,000deﬁnitionsof75,000words).ThesecondsubsetcomprisedonlywordsinWord-netandtheirﬁrstdeﬁnitions(approximately75,000word,deﬁnitionpairs).7.ForallvariantsofRNNandBOWmodels,Tuttavia,reducingthetrainingdatainthiswayresultedinaclearreductioninper-5Seehttp://developer.wordnik.com6https://simple.wikipedia.org/wiki/Main_Page7Aswithotherdictionaries,theﬁrstdeﬁnitioninWordNetgenerallycorrespondstothemosttypicalorcommonsenseofaword.formanceonalltasks.Forbrevity,wethereforedonotpresenttheseresultsinwhatfollows.3.2ComparisonsAsabaseline,wealsoimplementedtwoentirelyunsupervisedmethodsusingtheneural(Word2Vec)wordembeddingsfromthetargetwordspace.Intheﬁrst(W2Vadd),wecomposetheembeddingsforeachwordintheinputquerybypointwiseaddition,andreturnascandidatesthenearestwordembed-dingstotheresultingcomposedvector.8Thesec-ondbaseline,(W2Vmult),isidenticalexceptthattheembeddingsarecomposedbyelementwisemul-tiplication.Bothmethodsareestablishedwaysofbuildingphraserepresentationsfromwordembed-dings(MitchellandLapata,2010).Noneofthemodelsorevaluationsfrompreviousacademicresearchonreversedictionariesispub-liclyavailable,sodirectcomparisonisnotpossi-ble.However,wedocompareperformancewiththecommercialsystems.TheDictionary.comsys-temreturnednocandidatesforover96%ofourin-putdeﬁnitions.Wethereforeconductdetailedcom-parisonwithOneLook.com,whichistheﬁrstre-versedictionarytoolreturnedbyaGooglesearchandseemstobethemostpopularamongwriters.3.3ReverseDictionaryEvaluationToourknowledgetherearenoestablishedmeansofmeasuringreversedictionaryperformance.IntheonlypreviousacademicresearchonEnglishreversedictionariesthatweareawareof,evaluationwasconductedon300word-deﬁnitionpairswrittenbylexicographers(Shawetal.,2013).Sincethesearenotpubliclyavailablewedevelopednewevaluationsetsandmakethemfreelyavailableforfutureeval-uations.Theevaluationitemsareofthreetypes,designedtotestdifferentpropertiesofthemodels.Tocre-atetheseenevaluation,werandomlyselected500wordsfromtheWordNettrainingdata(seenbyallmodels),andthenrandomlyselectedadeﬁnitionforeachword.Testingmodelsontheresulting500word-deﬁnitionpairsassessestheirabilitytorecallordecodepreviouslyencodedinformation.Forthe8Sinceweretrieveallanswersfromembeddingspacesbycosinesimilarity,additionofwordembeddingsisequivalenttotakingthemean.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

DictionarydeﬁnitionsTestSetSeen(500WNdefs)Unseen(500WNdefs)Conceptdescriptions(200)Unsup.W2Vadd—923.04/.16163339.07/.30150modelsW2Vmult—1000.00/.0010*1000.00/.0027*OneLook0.89/.9167—18.5.38/.58153RNNcosine12.48/.7310322.41/.7011669.28/.54157RNNw2vcosine19.44/.7011119.44/.6912626.38/.66111RNNranking18.45/.6712824.43/.6910325.34/.66102NLMsRNNw2vranking54.32/.5615533.36/.6513730.33/.6977BOWcosine22.44/.6512919.43/.6910350.34/.6099BOWw2vcosine15.46/.7112414.46/.7110428.36/.6699BOWranking17.45/.6811522.42/.709532.35/.69101BOWw2vrankng55.32/.5615536.35/.6613838.33/.7285medianrankaccuracy@10/100rankvarianceTable1:Performanceofdifferentreversedictionarymodelsindifferentevaluationsettings.*Lowvarianceinmultmodelsisduetoconsistentlypoorscores,sonothighlighted.unseenevaluation,werandomlyselected500wordsfromWordNetandexcludedalldeﬁnitionsofthesewordsfromthetrainingdataofallmodels.Finally,forafaircomparisonwithOneLook,whichhasboththeseenandunseenpairsinitsin-ternaldatabase,webuiltanewdatasetofconceptdescriptionsthatdonotappearinthetrainingdataforanymodel.Todoso,werandomlyselected200adjectives,nounsorverbsfromamongthetop3000mostfrequenttokensintheBritishNationalCor-pus(Leechetal.,1994)(butoutsidethetop100).WethenaskedtennativeEnglishspeakerstowriteasingle-sentence‘description’ofthesewords.Toensuretheresultingdescriptionsweregoodqual-ity,foreachdescriptionweaskedtwoparticipantswhodidnotproducethatdescriptiontolistanywordsthatﬁttedthedescription(uptoamaximumofthree).Ifthetargetwordwasnotproducedbyoneofthetwocheckers,theoriginalparticipantwasaskedtore-writethedescriptionuntilthevalidationwaspassed.9Theseconceptdescriptions,togetherwithotherevaluationsets,canbedownloadedfromourwebsiteforfuturecomparisons.Givenatestdescription,deﬁnition,orquestion,allmodelsproducearankingofpossiblewordan-swersbasedontheproximityoftheirrepresentationsoftheinputphraseandallpossibleoutputwords.Toquantifythequalityofagivenranking,were-portthreestatistics:themedianrankofthecorrect9Re-writingwasrequiredin6ofthe200cases.TestsetWordDescriptionDictionaryvalve”controlconsistingofamechanicaldeﬁnitiondeviceforcontrollingﬂuidﬂow”Conceptprefer”whenyoulikeonethingdescriptionmorethananotherthing”Table2:Styledifferencebetweendictionarydeﬁnitionsandconceptdescriptionsintheevaluation.answer(overthewholetestset,lowerbetter),theproportionoftrainingcasesinwhichthecorrectan-swerappearsinthetop10/100inthisranking(accu-racy@10/100-higherbetter)andthevarianceoftherankofthecorrectansweracrossthetestset(rankvariance-lowerbetter).3.4ResultsTable1showstheperformanceofthedifferentmod-elsinthethreeevaluationsettings.Oftheunsu-pervisedcompositionmodels,elementwiseadditionisclearlymoreeffectivethanmultiplication,whichalmostneverreturnsthecorrectwordasthenear-estneighbourofthecomposition.Overall,Tuttavia,thesupervisedmodels(RNN,BOWandOneLook)clearlyoutperformthesebaselines.Theresultsindicateinterestingdifferencesbe-tweentheNLMsandtheOneLookdictionarysearchengine.TheSeen(WNﬁrst)deﬁnitionsinTable1occurinboththetrainingdatafortheNLMsandthelookupdatafortheOneLookmodel.ClearlytheOneLookalgorithmisbetterthanNLMsatretriev-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

ingalreadyavailableinformation(returning89%ofcorrectwordsamongthetop-tencandidatesonthisset).Tuttavia,thisislikelytocomeatthecostofagreatermemoryfootprint,sincethemodelrequiresaccesstoitsdatabaseofdictionariesatquerytime.10TheperformanceoftheNLMembeddingmodelsonthe(unseen)conceptdescriptionstaskshowsthatthesemodelscangeneralisewelltonovel,unseenqueries.WhilethemedianrankforOneLookonthisevaluationislower,theNLMsretrievethecor-rectanswerinthetoptencandidatesapproximatelyasfrequently,withinthetop100candidatesmorefrequentlyandwithlowervarianceinrankingoverthetestset.Thus,NLMsseemtogeneralisemore‘consistenly’thanOneLookonthisdataset,inthattheygenerallyassignareasonablyhighrankingtothecorrectword.Incontrast,ascanalsobeveriﬁedbyqueryingourwedemo,OneLooktendstoper-formeitherverywellorpoorlyonagivenquery.11WhencomparingbetweenNLMs,perhapsthemoststrikingobservationisthattheRNNmodelsdonotsigniﬁcantlyoutperformtheBOWmodels,eventhoughtheBOWmodeloutputisinvarianttochangesintheorderofwordsinthedeﬁnition.UsersoftheonlinedemocanverifythattheBOWmodelsrecoverconceptsfromdescriptionsstrikinglywell,evenwhenthewordsinthedescriptionareper-muted.ThisobservationunderlinestheimportanceoflexicalsemanticsintheinterpretationoflanguagebyNLMs,andisconsistentwithsomeotherrecentworkonembeddingsentences(Iyyeretal.,2015).Itisdifﬁculttoobservecleartrendsinthedif-ferencesbetweenNLMsthatlearninputwordem-beddingsandthosewithpre-trained(Word2Vec)in-putembeddings.Bothtypesofinputyieldgoodperformanceinsomesituationsandweakerperfor-manceinothers.Ingeneral,pre-traininginputem-beddingsseemstohelpmostontheconceptde-scriptions,whicharefurthestfromthetrainingdataintermsoflinguisticstyle.Thisisperhapsunsur-prising,sincemodelsthatlearninputembeddingsfromthedictionarydataacquirealloftheirconcep-10Thetrainedneurallanguagemodelsareapproximatelyhalfthesizeofthesixtrainingdictionariesstoredasplaintext,sowouldbehundredsoftimessmallerthantheOneLookdatabaseof1061dictionariesifstoredthisway.11WealsoobservedthatthemeanrankingforNLMswaslowerthanforOneLookontheconceptdescriptionstask.tualknowledgefromthisdata(andthusmayover-ﬁttothissetting),whereasmodelswithpre-trainedembeddingshavesomesemanticmemoryacquiredfromgeneralrunning-textlanguagedataandotherknowledgeacquiredfromthedictionaries.3.5QualitativeAnalysisSomeexampleoutputfromthevariousmodelsispresentedinTable3.Thedifferencesillustratedherearealsoevidentfromqueryingthewebdemo.TheﬁrstexampleshowshowtheNLMs(BOWandRNN)generalisebeyondtheirtrainingdata.Fourofthetopﬁveresponsescouldbeclassedasap-propriateinthattheyrefertoinhabitantsofcoldcountries.However,inspectingtheWordNiktrain-ingdata,thereisnomentionofcoldoranythingtodowithclimateinthedeﬁnitionsofEskimo,Scandi-navian,Scandinaviaetc.Therefore,theembeddingmodelsmusthavelearnedthatcoldnessisachar-acteristicofScandinavia,Siberia,Russia,relatestoEskimosetc.viaconnectionswithotherconceptsthataredescribedordeﬁnedascold.Incontrast,thecandidatesproducedbytheOneLookand(unsu-pervised)W2Vbaselinemodelshavenothingtodowithcoldness.ThesecondexampledemonstrateshowtheNLMsgenerallyreturncandidateswhoselinguisticorcon-ceptualfunctionisappropriatetothequery.Foraqueryreferringexplicitlytoameans,methodorpro-cess,theRNNandBOWmodelsproduceverbsindifferentformsoranappropriatedeverbalnoun.Incontrast,OneLookreturnswordsofalltypes(aero-dynamics,draught)thatarearbitrarilyrelatedtothewordsinthequery.Asimilareffectisapparentinthethirdexample.WhilethecandidatesproducedbytheOneLookmodelarethecorrectpartofspeech(Noun),andrelatedtothequerytopic,theyarenotsemanticallyappropriate.Thedictionaryembeddingmodelsaretheonlyonesthatreturnalistofplausi-blehabits,theclassofnounrequestedbytheinput.3.6Cross-LingualReverseDictionariesWenowshowhowtheRNNarchitecturecanbeeas-ilymodiﬁedtocreateabilingualreversedictionary-asystemthatreturnscandidatewordsinonelan-guagegivenadescriptionordeﬁnitioninanother.Abilingualreversedictionarycouldhaveclearap-plicationsfortranslatorsortranscribers.Indeed,IL

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

InputDescriptionOneLookW2VaddRNNBOW”anativeof1:country2:citizen1:a2.the1:eskimo2:scandinavian1:frigid2:coldacold3:foreign4:naturalize3:another4:of3:arctic4:indian3:icy4:russiancountry”5:cisco5:whole5:siberian5:indian”awayof1:drag2:whiz1:the2:through1:glide2:scooting1:ﬂying2:glidingmoving3:aerodynamics4:draught3:a4:moving3:glides4:gliding3:glide4:ﬂythrough5:coefﬁcientofdrag5:in5:ﬂight5:scootingtheair””ahabitthat1:sisterinlaw2:fatherinlaw1:annoy2:your1:bossiness2:jealousy1:inﬁdelity2:bossinessmightannoy3:motherinlaw4:stepson3:might4:that3:annoyance4:rudeness3:foible4:unfaithfulnessyourspouse”5:stepchild5:either5:boorishness5:adulterousTable3:Thetop-ﬁvecandidatesforexamplequeries(inventedbytheauthors)fromdifferentreversedictionarymod-els.BoththeRNNandBOWmodelsarewithoutWord2Vecinputandusethecosineloss.InputdescriptionRNNEN-FRW2VaddRNN+Google”anemotionthatyoumightfeeltriste,pitoyableinsister,effectivementsentiment,regretterafterbeingrejected”r´epugnante,´epouvantablepourquoi,nouspeur,aversion”asmallblackﬂyinginsectthatmouche,canardattentivement,pouvionsvoler,faucontransmitsdiseaseandlikeshorses”hirondelle,pigeonpourrons,naturellementmouches,volantTable4:Responsesfromcross-lingualreversedictionarymodelstoselectedqueries.Underlinedresponsesare‘cor-rect’orpotentiallyusefulforanativeFrenchspeaker.problemofattachingappropriatewordstoconceptsmaybemorecommonwhensearchingforwordsinasecondlanguagethaninamonolingualcontext.Tocreatethebilingualvariant,wesimplyreplacetheWord2Vectargetembeddingswiththosefromabilingualembeddingspace.Bilingualembeddingmodelsusebilingualcorporatolearnaspaceofrep-resentationsofthewordsintwolanguages,suchthatwordsfromeitherlanguagethathavesimilarmeaningsareclosetogether(HermannandBlun-som,2013;Chandaretal.,2014;Gouwsetal.,2014).Foratest-of-conceptexperiment,weusedEnglish-Frenchembeddingslearnedbythestate-of-the-artBilBOWAmodel(Gouwsetal.,2014)fromtheWikipedia(monolingual)andEuroparl(bilin-gual)corpora.12WetrainedtheRNNmodeltomapfromEnglishdeﬁnitionstoEnglishwordsinthebilingualspace.Attesttime,afterreadinganEn-glishdeﬁnition,wethensimplyreturnthenearestFrenchwordneighbourstothatdeﬁnition.Becausenobenchmarksexistforquantitative12Theapproachshouldworkwithanybilingualembeddings.WethankStephanGouwsfordoingthetraining.evaluationofbilingualreversedictionaries,wecom-parethisapproachqualitativelywithtwoalternativemethodsformappingdeﬁnitionstowordsacrosslanguages.TheﬁrstisanalogoustotheW2VAddmodeloftheprevioussection:inthebilingualem-beddingspace,weﬁrstcomposetheembeddingsoftheEnglishwordsinthequerydeﬁnitionwithele-mentwiseaddition,andthenreturntheFrenchwordwhoseembeddingisnearesttothisvectorsum.ThesecondusestheRNNmonolingualreversedictio-narymodeltoidentifyanEnglishwordfromanEn-glishdeﬁnition,andthentranslatesthatwordusingGoogleTranslate.Table4showsthattheRNNmodelcanbeef-fectivelymodiﬁedtocreateacross-lingualreversedictionary.ItisperhapsunsurprisingthattheW2VAddmodelcandidatesaregenerallythelowestinqualitygiventheperformanceofthemethodinthemonolingualsetting.IncomparingthetwoRNN-basedmethods,theRNN(embeddingspace)modelappearstohavetwoadvantagesovertheRNN+Googleapproach.First,itdoesnotrequireon-lineaccesstoabilingualword-wordmappingas

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

deﬁnede.g.byGoogleTranslate.Second,itlesspronetoerrorscausedbywordsenseambiguity.Forexample,inresponsetothequeryanemotionyoufeelafterbeingrejected,thebilingualembed-dingRNNreturnsemotionsoradjectivesdescribingmentalstates.Incontrast,themonolingual+GooglemodelincorrectlymapstheplausibleEnglishre-sponseregrettotheverbalinﬁnitiveregretter.Themodelmakesthesameerrorwhenrespondingtoadescriptionofaﬂy,returningtheverbvoler(toﬂy).3.7DiscussionWehaveshownthatsimplytrainingRNNorBOWNLMsonsixdictionariesyieldsareversedictionarythatperformscomparablytotheleadingcommer-cialsystem,evenwithaccesstomuchlessdictio-narydata.Indeed,theembeddingmodelsconsis-tentlyreturnsyntacticallyandsemanticallyplausi-bleresponses,whicharegenerallypartofamorecoherentandhomogeneoussetofcandidatesthanthoseproducedbythecommercialsystems.Wealsoshowedhowthearchitecturecanbeeasilyextendedtoproducebilingualversionsofthesamemodel.Intheanalysesperformedthusfar,weonlytestthedictionaryembeddingapproachontasksthatitwastrainedtoaccomplish(mappingdeﬁnitionsordescriptionstowords).Inthenextsection,weex-plorewhethertheknowledgelearnedbydictionaryembeddingmodelscanbeeffectivelytransferredtoanoveltask.4GeneralKnowledge(crossword)QuestionAnsweringTheautomaticansweringofquestionsposedinnat-urallanguageisacentralproblemofArtiﬁcialIn-telligence.AlthoughwebsearchandIRtechniquesprovideameanstoﬁndsitesordocumentsrelatedtolanguagequeries,atpresent,internetusersrequiringaspeciﬁcfactmuststillsiftthroughpagestolocatethedesiredinformation.Systemsthatattempttoovercomethis,viafullyopen-domainorgeneralknowledgequestion-answering(openQA),generallyrequirelargeteamsofresearchers,modulardesignandpowerfulinfras-tructure,exempliﬁedbyIBM’sWatson(Ferruccietal.,2010).Forthisreason,muchacademicre-searchfocusesonsettingsinwhichthescopeofthetaskisreduced.Thishasbeenachievedbyrestrict-ingquestionstoaspeciﬁctopicordomain(Moll´aandVicedo,2007),allowingsystemsaccesstopre-speciﬁedpassagesoftextfromwhichtheanswercanbeinferred(Iyyeretal.,2014;Westonetal.,2015),orcenteringbothquestionsandanswersonapartic-ularknowledgebase(BerantandLiang,2014;Bor-desetal.,2014).Inwhatfollows,weshowthatthedictionaryem-beddingmodelsintroducedintheprevioussectionsmayformausefulcomponentofanopenQAsys-tem.Giventheabsenceofaknowledgebaseorweb-scaleinformationinourarchitecture,wenar-rowthescopeofthetaskbyfocusingongeneralknowledgecrosswordquestions.Generalknowl-edge(non-cryptic,orquick)crosswordsappearinnationalnewspapersinmanycountries.CrosswordquestionansweringismoretractablethangeneralopenQAfortworeasons.First,modelsknowthelengthofthecorrectanswer(inletters),reducingthesearchspace.Second,somecrosswordquestionsmirrordeﬁnitions,inthattheyrefertofundamentalpropertiesofconcepts(atwelve-sidedshape)orre-questacategorymember(acityinEgypt).134.1EvaluationGeneralKnowledgecrosswordquestionscomeindifferentstylesandforms.WeusedtheEddieJamescrosswordwebsitetocompileabankofsentence-likegeneral-knowledgequestions.14EddieJamesisoneoftheUK’sleadingcrosswordcompilers,work-ingforseveralnationalnewspapers.Ourlongques-tionsetconsistsoftheﬁrst150questions(startingfrompuzzle#1)fromhisgeneral-knowledgecross-words,excludingcluesoffewerthanfourwordsandthosewhoseanswerwasnotasingleword(e.g.kingjames).Toevaluatemodelsonadifferenttypeofclue,wealsocompiledasetofshorterquestionsbasedontheGuardianQuickCrossword.Guardianquestionsstillrequiregeneralfactualorlinguisticknowledge,butaregenerallyshorterandsomewhatmorecrypticthanthelongerEddieJamesclues.Weagainformed13Asourinterestisinthelanguageunderstanding,wedonotaddressthequestionofﬁttinganswersintoagrid,whichisthemainconcernofend-to-endautomatedcrosswordsolvers(Littmanetal.,2002).14http://www.eddiejames.co.uk/

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

alistof150questions,beginningon1January2015andexcludinganyquestionswithmultiple-wordan-swers.Forclearcontrast,weexcludedthosefewquestionsoflengthgreaterthanfourwords.Ofthese150clues,asubsetof30weresingle-wordclues.Allevaluationdatasetsareavailableonlinewiththepaper.Aswiththereversedictionaryexperiments,can-didatesareextractedfrommodelsbyinputtingdef-initionsandreturningwordscorrespondingtotheclosestembeddingsinthetargetspace.Inthiscase,Tuttavia,weonlyconsidercandidatewordswhoselengthmatchesthelengthspeciﬁedintheclue.TestsetWordDescriptionLongBaudelaire”Frenchpoet(150)andkeyﬁgureinthedevelopmentofSymbolism.”Short(120)satanist”devildevotee”Single-Word(30)guilt”culpability”Table5:Examplesofthedifferentquestiontypesinthecrosswordquestionevaluationdataset.4.2BenchmarksandComparisonsAswiththereversedictionaryexperiments,wecom-pareRNNandBOWNLMswithasimpleunsuper-visedbaselineofelementwiseadditionofWord2Vecvectorsintheembeddingspace(wediscardthein-effectiveW2Vmultbaseline),againrestrictingcan-didatestowordsofthepre-speciﬁedlength.Wealsocomparetotwobespokeonlinecrossword-solvingengines.Theﬁrst,OneAcross(http://www.oneacross.com/)isthecandidategen-erationmoduleoftheaward-winningProverbcross-wordsystem(Littmanetal.,2002).Proverb,whichwasproducedbyacademicresearchers,hasfeaturedinnationalmediasuchasNewScientist,andbeatenexperthumansincrosswordsolvingtournaments.ThesecondcomparisoniswithCrosswordMaestro(http://www.crosswordmaestro.com/),acommercialcrosswordsolvingsystemthathandlesbothcrypticandnon-crypticcrosswordclues(wefocusonlyonthenon-crypticsetting),andhasalsobeenfeaturedinnationalmedia.15Weareunable15Seee.g.http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/tocompareagainstathirdwell-knownautomaticcrosswordsolver,DrFill(Ginsberg,2011),becausecodeforDrFill’scandidate-generationmoduleisnotreadilyavailable.AswiththeRNNandbase-linemodels,whenevaluatingexistingsystemswediscardcandidateswhoselengthdoesnotmatchthelengthspeciﬁedintheclue.Certainprinciplesconnectthedesignoftheex-istingcommercialsystemsanddifferentiatethemfromourapproach.UnliketheNLMs,theyeachre-quirequery-timeaccesstolargedatabasescontain-ingcommoncrosswordclues,dictionarydeﬁnitions,thefrequencywithwhichwordstypicallyappearascrosswordsolutionsandotherhand-engineeredandtask-speciﬁccomponents(Littmanetal.,2002;Ginsberg,2011).4.3ResultsTheperformanceofmodelsonthevariousquestiontypesispresentedinTable6.Whenevaluatingthetwocommercialsystems,OneAcrossandCross-wordMaestro,wehaveaccesstowebinterfacesthatreturnuptoapproximately100candidatesforeachquery,socanonlyreliablyrecordmembershipofthetopten(accuracy@10).Onthelongquestions,weobserveaclearadvan-tageforalldictionaryembeddingmodelsoverthecommercialsystemsandthesimpleunsupervisedbaseline.Here,thebestperformingNLM(RNNwithWord2Vecinputembeddingsandrankingloss)ranksthecorrectanswerthirdonaverage,andinthetop-tencandidatesover60%ofthetime.Asthequestionsgetshorter,theadvantageoftheembeddingmodelsdiminishes.Boththeunsu-pervisedbaselineandOneAcrossanswertheshortquestionswithcomparableaccuracytotheRNNandBOWmodels.Onereasonforthismaybethediffer-enceinformandstylebetweentheshortercluesandthefulldeﬁnitionsorencyclopediasentencesinthedictionarytrainingdata.Asthelengthofthecluede-creases,ﬁndingtheansweroftenreducestogenerat-ingsynonyms(culpability-guilt),orcategorymem-bers(tallanimal-giraffe).Thecommercialsystemscanretrievegoodcandidatesforsuchcluesamongtheirdatabasesofentities,relationshipsandcom-moncrosswordanswers.UnsupervisedWord2Veccrossword-blog-computers-crack-cryptic-clues

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

QuestionTypeavgrank-accuracy@10/100-rankvarianceLong(150)Corto(120)Single-Word(30)OneAcross.39/.68/.70/CrosswordMaestro.27/.43/.73/W2Vadd42.31/.639211.50/.78662.79/.9045RNNcosine15.43/.6910822.39/.6711772.31/.52187RNNw2vcosine4.61/.82607.56/.796012.48/.72116RNNranking6.58/.844810.51/.735712.48/.6967RNNw2vranking3.62/.80618.57/.784912.48/.69114BOWcosine4.60/.82547.56/.785112.45/.72137BOWw2vcosine4.60/.83567.54/.80483.59/.79111BOWranking5.62/.87508.58/.83378.55/.7939BOWw2vranking5.60/.86488.56/.83354.55/.8343Table6:Performanceofdifferentmodelsoncrosswordquestionsofdifferentlength.Thetwocommercialsystemsareevaluatedviatheirwebinterfacesoonlyaccuracy@10canbereportedinthosecases.representationsarealsoknowntoencodethesesortsofrelationships(evenafterelementwiseadditionforshortsequencesofwords)(Mikolovetal.,2013).Thiswouldalsoexplainwhythedictionaryembed-dingmodelswithpre-trained(Word2Vec)inputem-beddingsoutperfomthosewithlearnedembeddings,particularlyfortheshortestquestions.4.4QualitativeAnalysisAbetterunderstandingofhowthedifferentmodelsarriveattheiranswerscanbegainedfromconsider-ingspeciﬁcexamples,aspresentedinTable7.Theﬁrstthreeexamplesshowthat,despitetheapparentlysuperﬁcialnatureofitstrainingdata(deﬁnitionsandintroductorysentences)embeddingmodelscanan-swerquestionsthatrequirefactualknowledgeaboutpeopleandplaces.Anothernotablecharacteristicofthesemodelistheconsistentsemanticappropriate-nessofthecandidateset.Intheﬁrstcase,thetopﬁvecandidatesareallmountains,valleysorplacesintheAlps;inthesecond,theyareallbiblicalnames.Inthethird,theRNNmodelretrievescurrencies,inthiscaseperformingbetterthantheBOWmodel,whichretrievesentitiesofvarioustypeassociatedwiththeNetherlands.Generallyspeaking(ascanbeobservedbythewebdemo),the‘smoothness’orconsistencyincandidategenerationofthedictionaryembeddingmodelsisgreaterthanthatofthecom-mercialsystems.Despiteitssimplicity,theunsuper-visedW2Vadditionmethodisattimesalsosurpris-inglyeffective,asshownbythefactthatitreturnsJoshuainitstopcandidatesforthethirdquery.TheﬁnalexampleinTable7illustratesthesur-prisingpoweroftheBOWmodel.Inthetrainingdatathereisasingledeﬁnitionforthecorrectan-swerSchoenberg:UnitedStatescomposerandmusi-caltheorist(borninAustria)whodevelopedatonalcomposition.Theonlywordcommontoboththequeryandthedeﬁnitionis’composer’(thereisnotokenizationthatallowstheBOWmodeltodirectlyconnectatonalandatonality).Nevertheless,themodelisabletoinferthenecessaryconnectionsbe-tweentheconceptsinthequeryandthedeﬁnitiontoreturnSchoenbergasthetopcandidate.Despitesuchcases,itremainsanopenquestionwhether,withmorediversetrainingdata,theworldknowledgerequiredforfullopenQA(e.g.sec-ondaryfactsaboutSchoenberg,suchashisfam-ily)couldbeencodedandretainedasweightsina(larger)dynamicnetwork,orwhetheritwillbenec-essarytocombinetheRNNwithanexternalmem-orythatislessfrequently(ornever)updated.Thislatterapproachhasbeguntoachieveimpressivere-sultsoncertainQAandentailmenttasks(Bordesetal.,2014;Gravesetal.,2014;Westonetal.,2015).5ConclusionDictionariesexistinmanyoftheworld’slanguages.Wehaveshownhowtheselexicalresourcescancon-stitutevaluabledatafortrainingthelatestneurallan-guagemodelstointerpretandrepresentthemean-ingofphrasesandsentences.Whilehumansuse

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

InputDescriptionOneAcrossCrosswordMaestroBOWRNN”Swissmountain1:noted2:front1:after2:favor1:Eiger2.Crags1:Eiger2:Aostapeakfamedforits3:Eiger4:crown3:ahead4:along3:Teton4:Cerro3:Cuneo4:Lecconorthface(5)”5:fount5:being5:Jebel5:Tyrol”OldTestament1:Joshua2:Exodus1:devise2:Daniel1:Isaiah2:Elijah1:Joshua2:Isaiahsuccessorto3:Hebrew4:person3:Haggai4:Isaiah3:Joshua4:Elisha3:Gideon4:ElijahMoses(6)”5:across5:Joseph5:Yahweh5:Yahweh”Theformer1:Holland2:general1:Holland2:ancient1:Guilder2:Holland1:Guilder2:Escudoscurrencyofthe3:Lesotho3:earlier4:onetime3:Drenthe4:Utrecht3:Pesetas4:SomerenNetherlands5:qondam5:Naarden5:Florins(7)””Arnold,20th1:surrealism1:disharmony1:Schoenberg1:MendelsohnCenturycomposer2:laborparty2:dissonance2:Christleib2:Williamsonpioneerof3:tonemusics3:bringabout3:Stravinsky3:Huddlestonatonality4:introduced4:constitute4:Elderﬁeld4:Mandelbaum(10)”5:Schoenberg5:triggeroff5:Mendelsohn5:ZimmermanTable7:Responsesfromdifferentmodelstoexamplecrosswordclues.Ineachcasethemodeloutputisﬁlteredtoexcludeanycandidatesthatarenotofthesamelengthasthecorrectanswer.BOWandRNNmodelsaretrainedwithoutWord2Vecinputembeddingsandcosineloss.thephrasaldeﬁnitionsindictionariestobetterun-derstandthemeaningofwords,machinescanusethewordstobetterunderstandthephrases.Weusedtwodictionaryembeddingarchitectures-arecurrentneuralnetworkarchitecturewithalong-short-termmemory,andasimplerlinearbag-of-wordsmodel-toexplicitlyexploitthisidea.Onthereversedictionarytaskthatmirrorsitstrainingsetting,NLMsthatembedallknowncon-ceptsinacontinuous-valuedvectorspaceperformcomparablytothebestknowncommercialapplica-tionsdespitehavingaccesstomanyfewerdeﬁni-tions.Moreover,theygeneratesmoothersetsofcan-didatesandrequirenolinguisticpre-processingortask-speciﬁcengineering.Wealsoshowedhowthedescription-to-wordobjectivecanbeusedtotrainmodelsusefulforothertasks.NLMstrainedonthesamedatacananswergeneral-knowledgecrosswordquestions,andindeedoutperformcommercialsys-temsonquestionscontainingmorethanfourwords.WhileourQAexperimentsfocusedoncrosswords,theresultssuggestthatasimilarembedding-basedapproachmayultimatelyleadtoimprovedoutputfrommoregeneralQAanddialogsystemsandin-formationretrievalenginesingeneral.Wemakeallcode,trainingdata,evaluationsetsandbothofourlinguistictoolspubliclyavailableon-lineforfutureresearch.Inparticular,weproposethereversedictionarytaskasacomparativelygeneral-purposeandobjectivewayofevaluatinghowwellmodelscomposelexicalmeaningintophraseorsen-tencerepresentations(whetherornottheyinvolvetrainingondeﬁnitionsdirectly).Inthenextstageofthisresearch,wewillex-plorewaystoenhancetheNLMsdescribedhere,especiallyinthequestion-answeringcontext.Themodelsarecurrentlynottrainedonanyquestion-likelanguage,andwouldconceivablyimproveonexposuretosuchlinguisticforms.WewouldalsoliketounderstandbetterhowBOWmodelscanper-formsowellwithno‘awareness’ofwordorder,andwhethertherearespeciﬁclinguisticcontextsinwhichmodelslikeRNNsorotherswiththepowertoencodewordorderareindeednecessary.Finally,weintendtoexplorewaystoendowthemodelwithricherworldknowledge.Thismayrequirethein-tegrationofanexternalmemorymodule,similartothepromisingapproachesproposedinseveralrecentpapers(Gravesetal.,2014;Westonetal.,2015).AcknowledgmentsKCandYBacknowledgethesupportofthefollow-ingorganizations:NSERC,CalculQu´ebec,Com-puteCanada,theCanadaResearchChairsandCI-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

FAR.FHandAKweresupportedbyGoogleFacultyResearchAward,andAKfurtherbyGoogleEuro-peanFellowship.ReferencesDzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.InProceedingofICLR.YoshuaBengio,PatriceSimard,andPaoloFrasconi.1994.Learninglong-termdependencieswithgradientdescentisdifﬁcult.NeuralNetworks,IEEETransac-tionson,5(2):157–166.JonathanBerantandPercyLiang.2014.Semanticpars-ingviaparaphrasing.InProceedingsoftheAssocia-tionforComputationalLinguistics.JamesBergstra,OlivierBreuleux,Fr´ed´ericBastien,Pas-calLamblin,RazvanPascanu,GuillaumeDesjardins,JosephTurian,DavidWarde-Farley,andYoshuaBen-gio.2010.Theano:aCPUandGPUmathexpressioncompiler.InProceedingsofthePythonforScientiﬁcComputingConference(SciPy).SlavenBilac,TimothyBaldwin,andHozumiTanaka.2003.Improvingdictionaryaccessibilitybymaximiz-inguseofavailableknowledge.TraitementAutoma-tiquedesLangues,44(2):199–224.SlavenBilac,WataruWatanabe,TaiichiHashimoto,TakenobuTokunaga,andHozumiTanaka.2004.Dic-tionarysearchbasedonthetargetworddescription.InProceedingsofNLP2014.AntoineBordes,SumitChopra,andJasonWeston.2014.Questionansweringwithsubgraphembeddings.Pro-ceedingsofEMNLP.AntoineBordes,NicolasUsunier,SumitChopra,andJasonWeston.2015.Large-scalesimplequestionansweringwithmemorynetworks.arXivpreprintarXiv:1506.02075.SarathChandar,StanislasLauly,HugoLarochelle,MiteshKhapra,BalaramanRavindran,VikasC.Raykar,andAmritaSaha.2014.Anautoencoderap-proachtolearningbilingualwordrepresentations.InAdvancesinNeuralInformationProcessingSystems,pages1853–1861.KyunghyunCho,BartVanMerri¨enboer,CaglarGul-cehre,DzmitryBahdanau,FethiBougares,HolgerSchwenk,andYoshuaBengio.2014.LearningphraserepresentationsusingRNNencoder-decoderforstatis-ticalmachinetranslation.InProceedingsofEMNLP.ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2014.Retroﬁttingwordvectorstosemanticlexicons.Pro-ceedingsoftheNorthAmericanChapteroftheAsso-ciationforComputationalLinguistics.DavidFerrucci,EricBrown,JenniferChu-Carroll,JamesFan,DavidGondek,AdityaA.Kalyanpur,AdamLally,J.WilliamMurdock,EricNyberg,JohnPrager,NicoSchlaefer,andChrisWelty.2010.BuildingWat-son:AnoverviewoftheDeepQAproject.InAImag-azine,volume31(3),pages59–79.MatthewL.Ginsberg.2011.Dr.FILL:CrosswordsandanimplementedsolverforsinglyweightedCSPs.InJournalofArtiﬁcialIntelligenceResearch,pages851–886.StephanGouws,YoshuaBengio,andGregCorrado.2014.BilBOWA:Fastbilingualdistributedrepresen-tationswithoutwordalignments.InProceedingsofNIPSDeepLearningWorkshop.AlexGraves,GregWayne,andIvoDanihelka.2014.Neuralturingmachines.arXivpreprintarXiv:1410.5401.KarlMoritzHermannandPhilBlunsom.2013.Multi-lingualdistributedrepresentationswithoutwordalign-ment.InProceedingsofICLR.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.Neuralcomputation,9(8):1735–1780.EricH.Huang,RichardSocher,ChristopherD.Manning,andAndrewY.Ng.2012.Improvingwordrepresenta-tionsviaglobalcontextandmultiplewordprototypes.InProceedingsoftheAssociationforComputationalLinguistics.MohitIyyer,JordanBoyd-Graber,LeonardoClaudino,RichardSocher,andHalDaum´eIII.2014.Aneu-ralnetworkforfactoidquestionansweringoverpara-graphs.InProceedingsofEMNLP.MohitIyyer,VarunManjunatha,JordanBoyd-Graber,andHalDaum´eIII.2015.Deepunorderedcompo-sitionrivalssyntacticmethodsfortextclassiﬁcation.InProceedingsoftheAssociationforComputationalLinguistics.RyanKiros,RuslanSalakhutdinov,andRichardS.Zemel.2015.Unifyingvisual-semanticembeddingswithmultimodalneurallanguagemodels.Transac-tionsoftheAssociationforComputationalLinguistics.toappear.AlexandreKlementiev,IvanTitov,andBinodBhattarai.2012.Inducingcrosslingualdistributedrepresenta-tionsofwords.ProceedingsofCOLING.GeoffreyLeech,RogerGarside,andMichaelBryant.1994.CLAWS4:ThetaggingoftheBritishNationalCorpus.InProceedingsofCOLING.MichaelL.Littman,GregA.Keim,andNoamShazeer.2002.Aprobabilisticapproachtosolvingcrosswordpuzzles.ArtiﬁcialIntelligence,134(1):23–55.TomasMikolov,MartinKaraﬁ´at,LukasBurget,JanCer-nock`y,andSanjeevKhudanpur.2010.Recurrentneu-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
0
1
5
6
7
3
6
4

/
T

UN
C
_
UN
_
0
0
0
8
0
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

ralnetworkbasedlanguagemodel.InProceedingsofINTERSPEECH2010.TomasMikolov,IlyaSutskever,KaiChen,GregS.Cor-rado,andJeffDean.2013.Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinNeuralInformationProcessingSystems.JeffMitchellandMirellaLapata.2010.Compositionindistributionalmodelsofsemantics.CognitiveScience,34(8):1388–1429.DiegoMoll´aandJos´eLuisVicedo.2007.Questionan-sweringinrestricteddomains:Anoverview.Compu-tationalLinguistics,33(1):41–61.RyanShaw,AnindyaDatta,DebraVanderMeer,andKaushikDutta.2013.Buildingascalabledatabase-drivenreversedictionary.KnowledgeandDataEngi-neering,IEEETransactionson,25(3):528–540.IvanVulic,WimDeSmet,andMarie-FrancineMoens.2011.Identifyingwordtranslationsfromcomparablecorporausinglatenttopicmodels.InProceedingsoftheAssociationforComputationalLinguistics.JasonWeston,AntoineBordes,SumitChopra,andTomasMikolov.2015.TowardsAI-completequestionanswering:Asetofprerequisitetoytasks.InarXivpreprintarXiv:1502.05698.MatthewD.Zeiler.2012.Adadelta:Anadaptivelearn-ingratemethod.InarXivpreprintarXiv:1212.5701.MichaelZockandSlavenBilac.2004.Wordlookuponthebasisofassociations:Fromanideatoaroadmap.InProceedingsoftheACLWorkshoponEnhancingandUsingElectronicDictionaries.
Scarica il pdf