Transactions of the Association for Computational Linguistics, vol. 2, pp. 531–545, 2014. Action Editor: Janyce Wiebe.
Submission batch: 3/2014; Revision batch 9/2014; Published 12/2014. c(cid:13)2014 Association for Computational Linguistics.
531
ALargeScaleEvaluationofDistributionalSemanticModels:Parameters,InteractionsandModelSelectionGabriellaLapesa2,11Universit¨atOsnabr¨uckInstitutf¨urKognitionswissenschaftAlbrechtstr.28,Osnabr¨uck,Germanygabriella.lapesa@fau.deStefanEvert22FAUErlangen-N¨urnbergProfessurf¨urKorpuslinguistikBismarckstr.6,Erlangen,Germanystefan.evert@fau.deAbstractThispaperpresentstheresultsofalarge-scaleevaluationstudyofwindow-basedDistribu-tionalSemanticModelsonawidevarietyoftasks.Ourstudycombinesabroadcoverageofmodelparameterswithamodelselectionmethodologythatisrobusttooverfittingandabletocaptureparameterinteractions.Weshowthatourstrategyallowsustoidentifypa-rameterconfigurationsthatachievegoodper-formanceacrossdifferentdatasetsandtasks1.1IntroductionDistributionalSemanticModels(DSMs)areem-ployedtoproducesemanticrepresentationsofwordsfromco-occurrencepatternsintextsordocuments(Sahlgren,2006;TurneyandPantel,2010).Build-ingontheDistributionalHypothesis(Harris,1954),DSMsquantifytheamountofmeaningsharedbywordsasthedegreeofoverlapofthesetsofcontextsinwhichtheyoccur.Awidelyusedapproachoperationalizesthesetofcontextsasco-occurrenceswithotherwordswithinacertainwindow(e.g.,5words).Awindow-basedDSMcanberepresentedasaco-occurrencematrixinwhichrowscorrespondtotargetwords,columnscorrespondtocontextwords,andcellsstoretheco-occurrencefrequenciesoftargetwordsandcontextwords.Theco-occurrenceinformationisusuallyweightedbysomescoringfunctionandtherowsofthematrixarenormalized.Sincetheco-occurrence1Theanalysispresentedinthispaperiscomplementedbysupplementarymaterials,whichareavailablefordownloadathttp://www.linguistik.fau.de/dsmeval/.Thispagewillalsobekeptuptodatewiththeresultsoffollow-upexperiments.matrixtendstobeverylargeandsparselypopu-lated,dimensionalityreductiontechniquesareoftenusedtoobtainamorecompactrepresentation.Lan-dauerandDumais(1997)claimthatdimensionalityreductionalsoimprovesthesemanticrepresentationencodedintheco-occurrencematrix.Finally,dis-tancesbetweentherowvectorsofthematrixarecomputedand–accordingtotheDistributionalHy-pothesis–interpretedasacorrelateofthesemanticsimilaritiesbetweenthecorrespondingtargetwords.TheconstructionanduseofaDSMinvolvesmanydesignchoices,suchas:selectionofasourcecor-pus,sizeoftheco-occurrencewindow;choiceofasuitablescoringfunction,possiblycombinedwithanadditionaltransformation;whethertoapplydimen-sionalityreduction,andthenumberofreduceddi-mensions;metricformeasuringdistancesbetweenvectors.Differentdesignchoices–technically,theDSMparameters–canresultinquitedifferentsim-ilaritiesforthesamewords(Sahlgren,2006).DSMshavealreadyprovensuccessfulinmodel-inglexicalmeaning:theyhavebeenappliedinNatu-ralLanguageProcessing(Sch¨utze,1998;Lin,1998),InformationRetrieval(Saltonetal.,1975),andCog-nitiveModeling(LandauerandDumais,1997;LundandBurgess,1996;Pad´oandLapata,2007;Ba-roniandLenci,2010).Recently,thefieldofDis-tributionalSemanticshasmovedtowardsnewchal-lenges,suchaspredictingbrainactivation(Mitchelletal.,2008;Murphyetal.,2012;BullinariaandLevy,2013)andmodelingmeaningcomposition(Baronietal.,2014,andreferencestherein).Despitesuchprogress,afullunderstandingofthedifferentparametersgoverningaDSMandtheirin-fluenceonmodelperformancehasnotbeenachievedyet.Thepresentpaperisacontributiontowardsthis
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
532
goal:itpresentstheresultsofalarge-scaleevalua-tionofwindow-basedDSMsonawidevarietyofse-mantictasks.Morecomplextasksbuildingondistri-butionalrepresentations(e.g.,vectorcompositionorrelationalanalogies)willalsobenefitfromourfind-ings,allowingthemtochooseoptimalparametersfortheunderlyingword-levelDSMs.Atthelevelofparametercoverage,thisworkeval-uatesmostoftherelevantparametersconsideredincomparablestate-of-the-artstudies(BullinariaandLevy,2007;BullinariaandLevy,2012);italsoin-troducesanadditionalone,whichhasreceivedlit-tleattentionintheliterature:theindexofdistribu-tionalrelatedness,whichconnectsdistancesintheDSMspacetosemanticsimilarity.Wecomparedirectuseofdistancemeasurestoneighborrank.NeighborrankhasalreadybeensuccessfullyusedtomodelprimingeffectswithDSMs(Hareetal.,2009;LapesaandEvert,2013);thepresentstudyextendsitsevaluationtostandardtasks.Weshowthatneigh-borrankconsistentlyimprovestheperformanceofDSMscomparedtodistance,butthedegreeofthisimprovementvariesfromtasktotask.Attheleveloftaskcoverage,thepresentstudyincludesmostofthestandarddatasetsusedincom-parativestudies(BullinariaandLevy,2007;BaroniandLenci,2010;BullinariaandLevy,2012).Weconsiderthreetypesofevaluationtasks:multiplechoice(TOEFLtest),correlationtohumansimilar-ityratings,andsemanticclustering.Atthelevelofmethodology,ourworkadoptstheapproachtomodelselectionproposedbyLapesaandEvert(2013),whichisdescribedindetailinsection4.Ourresultsshowthatparameterinteractionsplayacrucialroleindeterminingmodelperformance.Thispaperisstructuredasfollows.Section2brieflyreviewsstate-of-the-artstudiesonDSMeval-uation.Section3describestheexperimentalsettingintermsoftasksandevaluatedparameters.Sec-tion4outlinesourmethodologyformodelselection.Insection5wereporttheresultsofourevaluationstudy.Finally,section6summarizesthemainfind-ingsandsketchesongoingandfuturework.2PreviousworkInthissectionwesummarizetheresultsofpreviousevaluationstudiesofDistributionalSemanticMod-els.AmongtheexistingworkonDSMevaluation,wecanidentifytwomaintypesofapproaches.Onepossibilityistoevaluateadistributionalmodelwithcertainnewfeaturesonarangeoftasks,applyinglittleornoparametertuning,andtocom-pareittocompetingmodels;examplesarePadoandLapata’s(2007)DependencyVectorsaswellasBaroniandLenci’s(2010)DistributionalMem-ory.Sincebothstudiesfocusontestingasinglenewmodelwithfixedparameters(orasmallnumberofnewmodels),wewillnotgointofurtherdetailcon-cerningthem.Alternatively,theevaluationmaybeconductedviaincrementaltuningofparameters,whicharetestedsequentiallytoidentifytheirbestperform-ingvaluesonanumberoftasks,ashasbeendonebyBullinariaandLevy(2007;2012),PolajnarandClark(2014),andKielaandClark(2014).BullinariaandLevy(2007)reportonasystem-aticstudyoftheimpactofanumberofparame-ters(shapeandsizeoftheco-occurrencewindow,distancemetric,associationscoreforco-occurrencecounts)onanumberoftasks(includingtheTOEFLsynonymtask,whichisalsoevaluatedinourstudy).EvaluatedmodelswerebasedontheBritishNa-tionalCorpus.BullinariaandLevy(2007)foundthatvectorsscoredwithPointwiseMutualInforma-tion,builtfromverysmallcontextwindowswithasmanycontextdimensionsaspossible,andusingco-sinedistanceensuredthebestperformanceacrossalltasksatissue.BullinariaandLevy(2012)extendtheevaluationreportedinBullinariaandLevy(2007).Startingfromtheoptimalconfigurationidentifiedinthefirststudy,theytesttheimpactofthreefurtherparame-ters:applicationofstop-wordlists,stemming,anddimensionalityreductionusingSingularValueDe-composition.DSMswerebuiltfromtheukWaCcorpus,andevaluatedonanumberoftasks(includ-ingTOEFLandnounclusteringonthedatasetofMitchelletal.(2008),alsoevaluatedinourstudy).Neitherstemmingnortheapplicationofstop-wordlistsresultedinasignificantimprovementofDSMperformance.Positiveresultswereachievedbyper-formingSVDdimensionalityreductionanddiscard-ingtheinitialcomponentsofthereducedmatrix.PolajnarandClark(2014)evaluatetheimpactofcontextselection(foreachtarget,onlythemostrel-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
533
evantcontextwordsareselected,andtheremainingvectorentriesaresettozero)andvectornormaliza-tion(usedtovarymodelsparsityandtherangeofvaluesoftheDSMvectors)instandardtasksrelatedtowordandphrasesimilarity.ContextselectionandnormalizationimprovedDSMperformanceonwordsimilarityandcompositionaltasks,bothwithandwithoutSVD.KielaandClark(2014)evaluatewindow-basedanddependency-basedDSMsonavarietyoftasksrelatedtowordandphrasesimilarity.Awiderangeofparametersareinvolvedinthisstudy:sourcecorpus,windowsize,numberofcontextdimen-sions,useofstemming,lemmatizationandstop-words,similaritymetric,scoreforfeatureweight-ing.Bestresultswereobtainedwithlargecorporaandsmallwindowsizes,around50000contextdi-mensions,stemming,PositiveMutualInformation,andamean-adjustedversionofcosinedistance.Eventhoughweadoptadifferentapproachthantheseincrementaltuningstudies,thereisconsider-ableoverlapintheevaluatedparametersandtasks,whichwillbepointedoutinsection3.AnalternativetoincrementaltuningisthemethodologyproposedbyLapesaandEvert(2013)andLapesaetal.(2014).Theysystematicallytestalargenumberofparametercombinationsanduselinearregressiontodeterminetheimportanceofin-dividualparametersandtheirinteractions.Astheirevaluationmethodologyisadoptedinthepresentworkanddescribedinmoredetailinsection4,wewillnotdiscussithereandinsteadfocusonthemainresults.DSMsareevaluatedinthetaskofmodelingsemanticpriming.Thistask,albeitnotstandardinDSMevaluation,isofgreatinterestasprimingex-perimentsprovideawindowintothestructureofthementallexicon.Bothstudiesshowedthatneighborrankoutperformsdistanceincapturingprimingef-fects.Theyalsofoundthatthescoringfunctionhasacrucialinfluenceonmodelperformanceandinter-actsstronglywithanadditionallogarithmictransfor-mation.Lapesaetal.(2014)focusedonacompari-sonofsyntagmaticandparadigmaticrelations.TheyfoundthatdiscardingtheinitialSVDdimensionsisonlybeneficalforcertainrelations,suggestingthatthesedimensionsmayencodesyntagmaticinforma-tioniflargercontextwindowsareused.Concerningthescopeoftheevaluation,bothstudiesconsiderawiderangeofparameters2buttargetonlyaveryspe-cifictask.Ourstudyaimsatextendingtheirparame-tersetandevaluationmethodologytostandardtasks.3Experimentalsetting3.1TasksTheevaluationofDSMshasbeenconductedonthreestandardtypesofsemantictasks.Thefirsttaskisamultiplechoicesetting:distri-butionalrelatednessbetweenatargetwordandtwoormoreotherwordsisusedtoselectthebest,i.e.mostsimilarcandidate.Performanceinthistaskisquantifiedbythedecisionaccuracy.Theevaluateddatasetisthewell-knownTOEFLmultiple-choicesynonymtest(LandauerandDumais,1997),whichwasalsoincludedinthestudiesofBullinariaandLevy(2007;2012)andKielaandClark(2014).Inthesecondtask,wemeasurethecorrela-tionbetweendistributionalrelatednessandnativespeakerjudgmentsofsemanticsimilarityorrelated-ness.Followingpreviousstudies(BaroniandLenci,2010;Pad´oandLapata,2007),performanceinthistaskisquantifiedintermsofPearsoncorrelation.3EvaluateddatasetsaretheRubensteinandGoode-noughdataset(RG65)of65nounpairs(RubensteinandGoodenough,1965),alsoevaluatedbyKielaandClark(2014),andtheWordSim-353dataset(WS353)of353nounpairs(Finkelsteinetal.,2002),includedinthestudyofPolajnarandClark(2014).Thethirdevaluationtaskisnounclustering:distributionalsimilaritybetweenwordsisusedtoassignthemtoapre-definednumberofsemanticclasses.Performanceinthistaskisquantifiedintermsofclusterpurity.Clusteringisperformedwithanalgorithmbasedonpartitioningaroundmedoids(KaufmanandRousseeuw,1990,Ch.2),usingthe2TheparametersetofLapesaetal.(2014)fullycorrespondstotheoneusedinthepresentstudy.3SomeotherevaluationstudiesadoptSpearman’srankcor-relationρ,whichismoreappropriateifthereisanon-linearre-lationbetweendistributionalrelatednessandthehumanjudge-ments.WecomputedbothcoefficientsinourexperimentsanddecidedtoreportPearson’srforthreereasons:(i)BaroniandLenci(2010)alreadylistrscoresforawiderangeofDSMsinthistask;(ii)inmostexperimentalruns,ρandrvalueswerequitesimilar,withatendencyforρtobeslightlylowerthenr(differenceofmeansRG65:0.001;WS353:0.02);(iii)lin-earregressionanalysesforρandrshowedthesametrendsandpatternsforallDSMparameters.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
534
Rfunctionpamwithstandardsettings.4EvaluateddatasetsfortheclusteringtaskaretheAlmuhareb-Poesioset(henceforth,AP)containing402nounsgroupedinto21classes(Almuhareb,2006);theBat-tigset,containing83concretenounsgroupedinto10classes(VanOverscheldeetal.,2004);theESS-LLI2008set,containing44concretenounsgroupedinto6classes;5andtheMitchellset,containing60nounsgroupedinto12classes(Mitchelletal.,2008),alsoemployedbyBullinariaandLevy(2012).3.2ParametersDSMsevaluatedinthispaperbelongtotheclassofwindow-basedmodels.Allmodelsusethesamelargevocabularyoftargetwords(27522lemmatypes),whichisbasedonthevocabularyofDistri-butionalMemory(BaroniandLenci,2010)andhasbeenextendedtocoverallitemsinourdatasets.Dis-tributionalmodelswerebuiltusingtheUCStoolkit6andthewordspacepackageforR(Evert,2014).Thefollowingparametershavebeenevaluated:7•SourceCorpus(abbreviatedintheplotsascor-pus):thecorporafromwhichwecompiledourDSMsdifferinbothsizeandquality,andtheyrep-resentstandardchoicesinDSMevaluation.Eval-uatedcorporainthisstudyare:BritishNationalCorpus8;ukWaC;WaCkypediaEN9;•Contextwindow:–Direction*(win.direction):wecollectedco-occurrencecountsbothusingadirectedwin-dow(i.e.,separateco-occurrencecountsfor4Otherclusteringstudieshaveoftenbeencarriedoutus-ingtheCLUTOtoolkit(Karypis,2003)withstandardsettings,whichcorrespondstospectralclusteringofthedistributionalvectors.Unlikepam,whichoperatesonapre-computeddis-similaritymatrix,CLUTOcannotbeusedtotestdifferentdis-tancemeasuresorneighborrank.Comparativeclusteringex-perimentsshowednosubstantialdifferencesforcosinesimilar-ity;intherank-basedsetting,pamconsistentlyoutperformedCLUTOclustering.5http://wordspace.collocations.de/doku.php/data:esslli2008:concretenounscategorization6http://www.collocations.de/software.html7ParametersalsoevaluatedbyBullinariaandLevy(2007;2012),albeitwithadifferentrangeofvalues,aremarkedwithanasterisk(*);thoseevaluatedbyKielaandClark(2014)and/orPolajnarandClark(2014)aremarkedwithadagger(†).8http://www.natcorp.ox.ac.uk/9BothukWaCandWaCkypediaENareavailablefromhttp://wacky.sslmit.unibo.it/doku.php?id=corpora.contextwordstotheleftandtotherightofthetarget)andanundirectedwindow(nodistinc-tionbetweenleftandrightcontext);–Size(win.size)*†:weexpectthisparametertobecrucialasitdeterminestheamountofsharedcontextinvolvedinthecomputationofsimilar-ity.Wetestedwindowsof1,2,4,8,and16wordstotheleftandrightofthetarget,limitedbysentenceboundaries;•Contextselection:Contextwordsarefilteredbypart-of-speech(nouns,verbs,adjectives,andad-verbs).Fromthefullco-occurrencematrix,wefurtherselectdimensions(i.e.,columns,corre-spondingtocontextwords)accordingtothefol-lowingtwoparameters:–Criterionforcontextselection(criterion):marginalfrequency;numberofnonzeroco-occurrencecounts;–Thresholdforcontextselection(con-text.dim)*†:fromthecontextdimensionsrankedaccordingtothiscriterion,weselectthetop5000,10000,20000,50000or100000dimensions;•Scoreforfeatureweighting(score)*†:wecom-pareplainco-occurrencefrequencytotf.idfandtothefollowingassociationmeasures:Dicecoeffi-cient;simplelog-likelihood;MutualInformation(MI);t-score;z-score;10•Featuretransformation(transformation):tore-ducetheskewnessoffeaturescores,itispossibletoapplyatransformationfunction.Weevaluatesquareroot,sigmoid(tanh)andlogarithmictrans-formationvs.notransformation.10SeeEvert(2008)forathoroughdescriptionoftheasso-ciationmeasuresanddetailsontheircalculation(Fig.58.4onp.1225andFig.58.9onp.1235).WeselectedthesemeasuresbecausetheyhavewidelybeenusedinpreviousworkonDSMs(tf.idf,MIandlog-likelihood)orarepopularchoicesfortheidentificationofmultiwordexpressions.Basedonstatisticalhy-pothesistests,log-likelihood,t-scoreandz-scoremeasurethesignificanceofassociationbetweenatargetandfeatureterm;MIshowshowmuchmorefrequentlytheyco-occurthanex-pectedbychance;andDicecapturesthemutualpredictabilityoftargetandfeatureterm.Notethatwecomputesparsever-sionsoftheassociationmeasureswithnegativevaluesclampedtozeroinordertopreservethesparsenessoftheco-occurrencematrix.Forexample,ourMImeasurecorrespondstoPositiveMIintheotherevaluationstudies.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
535
•Distancemetric(metric)*†:cosinedistance(i.e.,anglebetweenvectors);Manhattandistance11;•Dimensionalityreduction:weoptionallyapplySingularValueDecompositionto1000dimen-sions,usingrandomizedSVD(Halkoetal.,2009)forperformancereasons.FortheSVD-basedmodels,therearetwoadditionalparameters:–Numberoflatentdimensions(red.dim):outofthe1000SVDdimensions,weselectthefirst100,300,500,700,900dimensions(i.e.thosewiththelargestsingularvalues);–Numberofskippeddimensions(dim.skip):whenselectingthereduceddimensions,weex-cludethefirst0,50or100dimensions.ThisparameterhasalreadybeenevaluatedbyBulli-nariaandLevy(2012),whoachievedbestper-formancebydiscardingtheinitialcomponentsofthereducedmatrix,i.e.,thosewiththehigh-estvariance.•Indexofdistributionalrelatedness(rel.index).GiventwowordsaandbrepresentedinaDSM,weconsidertwoalternativewaysofquantify-ingthedegreeofrelatednessbetweenaandb.Thefirstoption(andstandardinDSMmodel-ing)istocomputethedistance(cosineorMan-hattan)betweenthevectorsofaandb.Theal-ternativechoice,proposedinthiswork,isbasedonneighborrank.Neighborrankhasalreadybeensuccessfullyusedforcapturingprimingef-fects(Hareetal.,2009;LapesaandEvert,2013;Lapesaetal.,2014)andforquantifyingthese-manticrelatednessbetweenderivationallyrelatedwords(Zelleretal.,2014);however,itsperfor-manceonstandardtaskshasnotbeentestedyet.FortheTOEFLtask,wecomputerankasthepo-sitionofthetargetamongthenearestneighborsofeachsynonymcandidate.12Forthecorrela-11Inthisstudy,therangeofevaluatedmetricsisrestrictedtocosinevs.manhattanforanumberofreasons:(i)cosineiscon-sideredastandardchoiceinDSMmodelingandisadoptedbymostevaluationstudies(BullinariaandLevy,2007;BullinariaandLevy,2012;PolajnarandClark,2014);(ii)forournormal-izedvectors,Euclideandistanceisfullyequivalenttocosine;(iii)preliminaryexperimentswiththemaximumdistancemea-sureresultedinverylowperformance.12Notethatusingthepositionsofthesynonymcandidatesamongtheneighborsofthetargetwouldhavebeenequivalenttodirectuseofthedistancemeasure,sincethetransformationfromdistancetorankismonotonicinthiscase.tionandclusteringtasks,wecomputeasymmetricrankmeasureastheaverageoflogrank(a,b)andlogrank(b,a).Anexplorationoftheeffectsofdi-rectionalityonthepredictionofsimilarityratingsanditsuseinclusteringtasks(i.e.,experimentsinvolvingrank(a,b)andrank(b,a)asindexesofrelatedness)isleftforfuturework.4ModelselectionAshasalreadybeenpointedoutintheintroductorysection,oneofthemainopenissuesinDSMeval-uationistheneedforasystematicinvestigationoftheinteractionsbetweenDSMparameters.Anotherissuethatlarge-scaleevaluationstudiesfaceisover-fitting:ifalargenumberofmodels(i.e.parametercombinations)isevaluated,itmakeslittlesensetolookatthebestmodel(i.e.thebestparametercom-bination),whichwillbesubjecttoheavyoverfit-ting,especiallyonsmalldatasetssuchasTOEFL.Themethodologyformodelselectionappliedinthisworksuccessfullyaddressesbothissues.Inourevaluationstudy,wetestedallpossiblecombinationsoftheparametersdescribedinsec-tion3.2.Thisresultedinatotalof537600modelruns(33600intheunreducedsetting,504000inthedimensionality-reducedsetting).ThemodelsweregeneratedandevaluatedonalargeHPCclusterwithinapproximately5weeks.FollowingLapesaandEvert(2013),DSMpa-rametersareconsideredpredictorsofmodelperfor-mance:weanalyzetheinfluenceofindividualpa-rametersandtheirinteractionsusinggenerallinearmodelswithperformance(accuracy,correlation,pu-rity)asadependentvariableandthemodelparame-tersasindependentvariables,includingalltwo-wayinteractions.Morecomplexinteractionsarebeyondthescopeofthispaperandareleftforfuturework.Analysisofvariance–whichisstraightforwardforourfullfactorialdesign–isusedtoquantifytheim-portanceofeachparameterorinteraction.Robustoptimalparametersettingsareidentifiedwiththehelpofeffectdisplays(Fox,2003),whichshowthepartialeffectofoneortwoparametersbymarginal-izingoverallotherparameters.Unlikecoefficientestimates,theyallowanintuitiveinterpretationoftheeffectsizesofcategoricalvariablesirrespectiveofthedummycodingschemeused.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
536
5ResultsThissectionreportstheresultsofthemodelingex-perimentsoutlinedinsection3.Table1summarizestheevaluationresults:foreachdataset,wereportminimum,maximumandmeanperformance,com-paringunreducedandreducedruns.ThecolumnDifferenceofMeansshowstheaveragedifferenceinperformancebetweenanunreducedmodelanditsreducedcounterpart(withdimensionalityreductionparameterssettothevaluesofthegeneralbestset-tingidentifiedinsection5.5)andthep-value13ofaWilcoxonsignedranktestwithcontinuitycorrec-tion.Itisevidentthatdimensionalityreductionim-provesmodelperformancesforalldatasets14.DatasetUnreducedReducedDifferenceMinMaxMeanMinMaxMeanofMeansTOEFL25.087.563.918.798.764.4−4.626***RG650.010.880.590.000.890.63−0.073***WS3530.000.730.390.000.730.43−0.074***AP0.150.730.560.130.760.540.004n.s.BATTIG0.280.990.770.230.990.78−0.037***ESSLLI0.320.930.720.320.980.72−0.003*MITCH.0.260.970.680.270.970.69−0.031***Table1:SummaryofperformanceWhiletheimprovementsareonlyminimalinsomecases,dimensionalityreductionneverhasadetrimentaleffectwhileofferingpracticaladvan-tagesinmemoryusageandcomputationspeed.Therefore,inouranalysis,wefocusontherunsin-volvingdimensionalityreduction.Inthefollowingsubsections,wepresentdetailedresultsforeachofthethreetasks.Ineachcase,wefirstdiscusstheim-pactofDSMparametersonperformance,andthendescribetheoptimalparametervalues.5.1TOEFLIntheTOEFLtask,thelinearmodelachievesanad-justedR2of89%,showingthatitexplainstheinflu-enceofmodelparametersonTOEFLaccuracyverywell.Figure1displaystherankingoftheevaluatedparametersaccordingtotheirimportanceinafea-tureablationsetting.TheR2valuesintheplotsre-fertotheproportionofvarianceexplainedbythere-spectiveparametertogetherwithallitsinteractions,13*=p<0.05;***=p<0.001;n.s.=notsignificant.14DifferenceofmeansandWilcoxonp-valueonSpear-man’srhoforratingsdatasets:RG65,−0.061***;WS353,−0.091***.correspondingtothereductioninadjustedR2ifthisparameterisleftout.Wedonotrelyonsignificancevaluesformodelselectionbecause,giventhelargenumberofmeasurements,virtuallyallparametershaveahighlysignificanteffect.criterionrel.indexwin.directionwin.sizecontext.dimcorpusred.dimdim.skiptransformationscoremetric0102030Partial R2TOEFLFigure1:TOEFL,parametersandfeatureablationTable2reportsallparameterinteractionsfortheTOEFLtaskthatexplainmorethan0.5%ofthetotalvariance(i.e.R2≥0.5%),aswellasthecorrespond-ingdegreesoffreedom(df)andR2.InteractiondfR2score:transf187.42metric:dim.skip24.44score:metric61.77metric:context.dim40.98win.size:transf120.91corpus:score120.84score:context.dim240.64metric:red.dim40.63Table2:TOEFLtask:interactions,R2Onthebasisoftheirinfluenceindeterminingmodelperformance,wecanidentifythreeparame-tersthatarecrucialfortheTOEFLtask,andwhichwillalsoturnouttobeveryinfluentialintheothertasksatissue:distancemetric,featurescoreandfea-turetransformation.Thebestdistancemetriciscosinedistance:thisisoneoftheconsistentfindingsofourevalua-tionstudyanditisinaccordancewithBullinariaandLevy(2007)and,toalesserextent,KielaandClark(2014).15Scoreandtransformational-wayshaveafundamentalimpactonmodelperfor-15InKielaandClark(2014),cosineisreportedtobethebestsimilaritymetric,togetherwiththecorrelationsimilaritymetric
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
537
lllllll5560657075frequencytf.idfMIDicesimple−llt−scorez−scoretransformationlnonelogrootsigmoidFigure2:TOEFL,score/transformationlllll5560657075124816transformationlnonelogrootsigmoidFigure3:TOEFL,windowsize/transformationllll5560657075100300500700900metriclcosinemanhattanFigure4:TOEFL,metric/n.oflatentdim.lll5560657075050100metriclcosinemanhattanFigure5:TOEFL,metric/n.ofskippeddim.mance:theseparametersaffectthedistributionalspaceindependentlyoftasksanddatasets.Wewillshowthattheyaresystematicallyinvolvedinastronginteractionandthatitispossibletoiden-tifyascore/transformationcombinationwithrobustperformanceacrossalltasks.Theinteractionbe-tweenscoreandtransformationisdisplayedinfig-ure2.Thebestresultsareachievedbyassociationmeasuresbasedonsignificancetests(simple-ll,t-score,z-score),followedbyMI.Thisresultisinlinewithpreviousstudies(BullinariaandLevy,2012;KielaandClark,2014),whichfoundPointwiseMIorPositiveMItobethebestfeaturescores.Thebestchoice,simple-loglikelihood,exhibitsastrongvari-ationinperformanceacrossdifferenttransforma-tions.Forallthreesignificancemeasures,thebestfeaturetransformationisconsistentlyalogarithmictransformation.Rawco-occurrencefrequency,tf.idfandDiceonlyperformwellincombinationwithasquareroottransformation.Thebestwindowsize,asshowninfigure3,isa2-wordwindowforallevaluatedtransformations.(amean-adjustedversionofcosinesimilarity).Thelatter,how-ever,turnedouttobemorerobustacrossdifferentcorporaandweightingschemes.TheSVDparameters(numberoflatentdimen-sionsandnumberofskippeddimensions)playasignificantroleindeterminingmodelperformance.TheyareparticularlyimportantfortheTOEFLtask,butwewillseethattheirexplanatorypowerisalsoquitestrongintheothertasks.Interestingly,theyshowatendencytoparticipateininteractionswithotherparameters,butdonotinteractamongthem-selves.Wedisplaytheinteractionbetweenmetricandnumberoflatentdimensionsinfigure4:thesteepperformanceincreaseforbothmetricsshowsthatthewidely-usedchoiceof300latentdimen-sions(LandauerandDumais,1997)issuboptimalfortheTOEFLtask.Thebestvalueinourexper-imentis900latentdimensions,andadditionaldi-mensionswouldprobablyleadtoafurtherimprove-ment.Theinteractionbetweenmetricandnumberofskippeddimensionsisdisplayedinfigure5.Whilemanhattanperformspoorlynomatterhowmanydi-mensionsareskipped,cosineispositivelyaffectedbyskipping100and(toalesserextent)50dimen-sions.ThelattertrendhasalreadybeendiscussedbyBullinariaandLevy(2012).Inspectionoftheremaininginteractionplots,notshownhereforreasonsofspace,revealsthatthebest
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
538
DSMperformanceintheTOEFLtaskisachievedbyselectingukwacascorpusand10000originaldi-mensions.TheindexofdistributionalrelatednesshasaverylowexplanatorypowerintheTOEFLtask,withneighborrankbeingthebestchoice(seeplots16and17insection5.4).Giventheminimalexplanatorypowerofthedi-rectionofthecontextwindowandthecriterionforcontextselectioninallthreetasks,wewillnotfur-therconsidertheseparametersinouranalysis.Werecommendtosetthemtoan“unmarked”option:undirectedandfrequency.Thebestsettingidentifiedbyinspectingalleffectsisshownintable5,togetherwithitsperformanceandwiththeperformanceofthe(over-trained)bestmodelinthistask.Parametersofthelatterarere-portedinappendixA.5.2RatingsFigure6displaystheimportanceoftheevaluatedpa-rametersinthetaskofpredictingsimilarityratings.Parametersarerankedaccordingtotheaveragefea-tureablationR2valuesacrossbothdatasets(adj.R2ofthefulllinearmodel:RG65:86%;WS353:90%).lllllllllllcriterionwin.directioncontext.dimdim.skipwin.sizered.dimrel.indexmetrictransformationcorpusscore0102030Partial R2lRG65WS353Figure6:Ratings,parametersandfeatureablationTable3reportsallinteractionsthatexplainmorethan0.5%ofthetotalvarianceinbothdatasets.Forreasonsofspace,weonlydiscusstheinteractionsandbestparametervaluesonRG65;thecorrespond-ingplotsforWS353areshownonlyiftherearesub-stantialdifferences.AsalreadynotedfortheTOEFLtask,scoreandtransformationhavealargeexplanatorypowerandtheyareinvolvedinastronginteractionshowingtheInteractiondfRG65WS353score:transf1810.288.66metric:red.dim42.181.42score:metric61.910.59win.size:transf121.431.01corpus:metric21.830.51metric:context.dim41.080.62corpus:score120.770.82win.size:score240.770.69score:dim.skip120.580.85Table3:Ratingsdatasets:interactions,R2sametendenciesandoptimalvaluesalreadyiden-tifiedforTOEFL.Forreasonsofspace,wedonotelaborateonthisinteractionhere.TheanalysisofthemaineffectsshowsthatforbothdatasetsWaCkypediaisthebestoptionasasourcecorpus,suggestingthatthistaskbene-fitsfromatrade-offbetweenqualityandquan-tity(WaCkypediabeingsmallerandcleanerthanukWaC,butlessbalancedthantheBNC).IndexofdistributionalrelatednessplaysamuchmoreimportantrolethanfortheTOEFLtask,withneighborrankclearlyoutperformingdistance(seefigures16and17andthediscussioninsection5.4formoredetails).Thechoiceoftheoptimalwindowsizedependsontransformation:ontheRG65dataset,figure7showsthatforalogarithmictransformation–whichweal-readyidentifiedasthebesttransformationincombi-nationwithsignificanceassociationmeasures–thehighestperformanceisachievedwitha4wordwin-dow.ThecorrespondingeffectdisplayforWS353(figure8)suggeststhatafurthersmallimprovementmaybeobtainedwithan8wordwindowinthiscase.OnepossibleexplanationforthisobservationisthedifferentcompositionoftheWS353dataset,whichincludesexamplesofsemanticrelatednessbeyondattributionalsimilarity.The4wordwindowisaro-bustchoiceacrossbothdatasets,though.Thenumberoflatentdimensionsisinvolvedinastronginteractionwiththedistancemetric(figure9).Bestresultsareachievedwiththecosinemet-ricandatleast300latentdimensions,aswellas50skippeddimensions.Theinteractionplotbetweenmetricandnumberoforiginaldimensionsinfigure10showsthat50000contextdimensionsaresuffi-cientforgoodperformance,andnofurtherimprove-mentcanbeexpectedfromevenhigher-dimensionalspaces.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
539
lllll0.450.500.550.600.650.700.75124816transformationlnonelogrootsigmoidFigure7:RG65,windowsize/transformationlllll0.250.300.350.400.450.500.55124816transformationlnonelogrootsigmoidFigure8:WS353,windowsize/transformationlllll0.450.500.550.600.650.700.75100300500700900metriclcosinemanhattanFigure9:RG65,metric/n.latentdim.lllll0.450.500.550.600.650.700.755000100002000050000100000metriclcosinemanhattanFigure10:RG65,metric/n.contextdimensionsBestsettingsforbothdatasetsaresummarizedintable5.RefertoappendixAforbestmodels.5.3ClusteringFigure11displaystheimportanceoftheevaluatedparametersintheclusteringtask(adj.R2ofthefulllinearmodel:AP:82%;BATTIG:77%;ESSLLI:58%;MITCHELL:73%).Parameterrankingisde-terminedbytheaverageofthefeatureablationR2valuesoverallfourdatasets.lllllllllllcriterionwin.directionrel.indexcontext.dimdim.skipred.dimwin.sizemetriccorpustransformationscore0102030Partial R2lAPBATTIGESSLLIMITCHELLFigure11:Clustering,parametersandfeat.ablationInteractiondfAPBATTIGESSLLIMITCHELLscore:transf187.107.957.5611.42metric:red.dim43.293.162.032.03win.size:metric42.221.262.972.72win.size:transf122.002.950.882.66corpus:metric21.422.912.791.11metric:dim.skip22.251.542.770.86corpus:win.size82.361.181.491.23score:dim.skip120.561.150.991.39win.size:score240.740.770.540.65Table4:Clusteringtask:interactions,R2Table4reportsallparameterinteractionsthatex-plainmorethan0.5%ofthetotalvarianceforeachofthefourdatasets.Inthefollowingdiscussion,wefocusontheAPdataset,whichislargerandthusmorereliablethantheotherthreedatasets.Wementionremarkabledif-ferencesbetweenthedatasetsintermsofbestpa-rametervalues.Forafulloverviewofthebestpa-rametersettingforeachdataset,seetable5.AsalreadydiscussedforTOEFLandtheratingstask,wefindscoreandtransformationatthetopofthefeatureablationranking.Table4confirmsthatthetwoparametersareinvolvedinastronginter-action.Theinteractionplot(figure12)showsthebehaviorwearealreadyfamiliarwith:significance
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
540
lllllll0.450.500.550.60frequencytf.idfMIDicesimple−llt−scorez−scoretransformationlnonelogrootsigmoidFigure12:AP,score/transformationlllll0.450.500.550.60124816metriclcosinemanhattanFigure13:AP,windowsize/metriclllll0.450.500.550.60124816corpuslbncwackyukwacFigure14:AP,corpus/windowsizelll0.450.500.550.60050100metriclcosinemanhattanFigure15:AP,metric/n.ofskippeddim.measures(simple-ll,t-scoreandz-score)reachthebestperformanceincombinationwithlogtransfor-mation:thiscombinationisarobustchoicealsofortheotherdatasets,withminordifferencesthatcanbeobservedintable5.Theinteractionbetweenwindowsizeandmetricisdisplayedinfigure13:bestperformanceisachievedwitha2or4wordwindowincombinationwithco-sinedistance.Resultsontheotherdatasetssuggestapreferenceforthe4wordwindow.Thisisconfirmedbyinteractionplotswithsourcecorpus(figure14),whichalsorevealthatWaCkypediaisagainthebestcompromisebetweensizeandquality.Averyclearpictureconcerningthenumberofskippeddimensionsemergesfromfigure15andisthesameforalldatasets:skippingdimensionsisnotnecessarytoachievegoodperformance(eventhoughskipping50dimensionsturnedoutatleasttobenotdetrimentalforBATTIGandMITCHELL).Furthereffectdisplays,notshownhereforrea-sonsofspace,suggestthat300or500latentdimen-sions–withsomevariationacrossthedatasets(cf.table5)–andamedium-sizedco-occurrencematrix(20000or50000dimensions)areneededtoachievegoodperformance.Neighborrankisthebestchoiceasindexofdistributionalrelatedness(seesection5.4).SeeappendixAforbestmodels.5.4RelatednessindexAnovelcontributionofourworkisthesystematicevaluationofaparameterthathasreceivedlittleat-tentioninDSMresearchsofar,andonlyinstudieslimitedtoanarrowchoiceofdatasets(LapesaandEvert,2013;Lapesaetal.,2014;Zelleretal.,2014):theindexofdistributionalrelatedness.Theaimofthissectionistoprovideafulloverviewoftheimpactofthisparameterinourex-periments.Despitethemainfocusofthepaperonthereducedsetting,inthissectionwealsoshowre-sultsfromtheunreducedsetting,fortworeasons:first,sincethisparameterisrelativelynovelandevaluatedhereforthefirsttimeonstandardtasks,weconsideritnecessarytoprovideafullpictureconcerningitsbehavior;second,relatednessindexturnedouttobemuchmoreinfluentialintheunre-ducedsettingthaninthereducedone.Figure16and17displaythepartialeffectofrelat-ednessindexforeachdataset,intheunreducedandreducedsettingrespectively.Toallowforacom-parisonbetweenthedifferentmeasuresofperfor-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
541
lllllll304050607080TOEFLRG65WS353APBATTIGMITCHELLESSLLIrel.indexldistrankFigure16:UnreducedSettinglllllll304050607080TOEFLRG65WS353APBATTIGMITCHELLESSLLIrel.indexldistrankFigure17:ReducedSettingmance,correlationandpurityvalueshavebeencon-vertedtopercentages.Thepictureemergingfromthetwoplotsisveryclear:neighborrankisthebestchoiceforbothsettingsacrossallsevendatasets.Thedegreeofimprovementovervectordistance,however,showsconsiderablevariationbetweendif-ferentdatasets.Theratingtaskbenefitsthemostfromtheuseofneighborrank.Ontheotherhand,neighborrankhasverylit-tleeffectfortheTOEFLtaskinareducedsetting,whereitshighcomputationalcomplexityisclearlynotjustified;theimprovementontheAPclusteringdatasetisalsofairlysmall.WhiletheTOEFLresultseemstocontradictthesubstantialimprovementofneighborrankfoundbyLapesaandEvert(2013)foramultiple-choicetaskbasedonstimulifromprim-ingexperiments,therewereonlytwochoices(con-sistentandinconsistentprime)inthiscaseratherthanfour.Wedonotruleoutthatamorerefineduseoftherankinformation(forexample,differentstrategiesforrankcombinations)mayproducebet-terresultsontheTOEFLandAPdatasets.Asdiscussedinsection3.2,wehavenotyetex-ploredthepotentialofneighborrankinmodelingdirectionalityeffectsinsemanticsimilarity.UnlikeLapesaandEvert(2013),whoadoptfourdiffer-entindexesofdistributionalrelatedness(vectordis-tance;forwardrank,i.e.,rankofthetargetintheneighborsoftheprime;backwardrank,i.e,rankoftheprimeintheneighborsofthetarget;averageofbackwardandforwardrank),weusedonlyasinglerank-basedindex(cf.section3.2),mostlyforrea-sonsofcomputationalcomplexity.Weconsidertheresultsofthisstudymorethanencouraging,andex-pectfurtherimprovementsfromafullexplorationofdirectionalityeffectsinthetasksatissue.5.5BestsettingsWeconcludetheresultoverviewbyevaluatingthebestparametercombinationsidentifiedforeachtaskanddataset,showinghowwellourapproachtomodelselectionworksinpractice.Table5summarizestheoptimalparameterset-tingsidentifiedforeachtaskandcomparestheper-formanceofthismodel(B.set=bestsetting)withtheover-trainedbestrunintheexperiment(B.run=bestrun).16Inmostcases,theresultofourro-bustparameteroptimizationisclosetothebestrun.TheonlyexceptionistheESSLLIdataset,whichissmallerthantheotherdatasetsandparticularlysus-ceptibletoover-training(cf.thelowR2ofthere-gressionanalysisinsection5.3).Table5alsore-portsthecurrentstateoftheartforeachtask(SoA=state-of-the-art),takenfromtheACLwiki17whereavailable(TOEFLandsimilarityratings),fromBa-roniandLenci(2010)fortheclusteringtasks,andfrommorerecentstudiesofwhichweareaware.Ourresultsarecomparabletothestateoftheart,eventhoughthelatterincludesamuchbroaderrangeofapproachesthanourwindow-basedDSMs.Inonecase(BATTIG),ouroptimizedmodelevenimprovesonthebestpreviousresult.Aside-by-sideinspectionofthemaineffectsandinteractionplotsfordifferentdatasetsallowedustoidentifyparametersettingsthatarerobustacrossdatasetsandevenacrosstasks.Table6showsrec-ommendedsettingsforeachtask(independentofthe16Abbreviationsinthetable:win=windowsize;c.dim=numberofcontextdimensions;tr=transformation;red.dim=numberoflatentdimensions;d.sk=numberofskippeddimen-sions;r.ind=relatednessindex;Parametervalues:s-ll=simple-ll;t-sc=t-score;cos=cosine;man=manhattan.17http://aclweb.org/aclwiki
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
542
Datasetcorpuswinc.dimscoretrmetricr.indred.dimd.skB.setB.runSoAReferenceTOEFLukwac210ks-lllogcosrank90010092.598.7100.0BullinariaandLevy(2012)RG65wacky450ks-lllogcosrank500500.870.890.86HassanandMihalcea(2011)18WS353wacky850ks-lllogcosrank300500.680.730.81Halawietal.(2012)19APwacky420ks-lllogcosrank30000.690.760.79Rotenh¨auslerandSch¨utze(2009)BATTIGwacky850ks-lllogcosrank50000.980.990.96BaroniandLenci(2010)ESSLLIwacky220kt-sclogcosrank30000.770.980.91Katrenko,ESSLLIworkshop20MITCHELLwacky450ks-lllogcosrank50000.880.970.94BullinariaandLevy(2012)commonforalldatasets:windowdirection=undirected;criterionforcontextselection=frequencyTable5:BestSettingsparticulardataset)andamoregeneralsettingthatachievesgoodperformanceinallthreetasks.Eval-uationresultsforthesesettingsoneachdatasetarereportedintable7.Inmostcases,thegeneralmodelisclosetotheperformanceofthetask-anddataset-specificsettings.Ourrobustevaluationmethodol-ogyhasenabledustofindagoodtrade-offbetweenportabilityandperformance.Taskcorpuswinc.dimscoretrmetricr.indred.dimd.skTOEFLukwac210ks-lllogcosrank900100Ratingwacky450ks-lllogcosrank30050Clusteringwacky450ks-lllogcosrank5000Generalwacky450ks-lllogcosrank50050Table6:GeneralBestSettingsDatasetTOEFLRATINGSCLUSTERINGGENERALTOEFL92.585.075.090.0RG650.840.860.840.87WS3530.620.670.640.68AP0.620.660.670.67BATTIG0.870.910.980.90ESSLLI0.660.770.800.77MITCHELL0.750.830.880.83Table7:GeneralbestSettings–Performance6ConclusionInthispaper,wereportedtheresultsofalarge-scaleevaluationofwindow-basedDistributionalSemanticModels,involvingawiderangeofparametersandtasks.Ourmodelselectionmethodologyisrobusttooverfittingandsensitivetoparameterinteractions.18TheACLwikiliststhehybridmodelofYihandQazvinian(2012)asthebestmodelonRG65withρ=0.89,butdoesnotspecifyitsPearsoncorrelationr.Inourcomparisontable,weshowthebestPearsoncorrelation,achievedbyHassanandMi-halcea(2011),whichisalsothebestcorpus-basedmodel.19Halawietal.(2012)reportSpearman’sρ.Theρvaluesforourbestsettingare:RG65:0.85,WS353:0.70;bestsettingfortheratingstask:RG65:0.82,WS353:0.67;bestgeneralsetting:RG65:0.87,WS353:0.70.20http://wordspace.collocations.de/Itallowedustoidentifyparameterconfigurationsthatperformwellacrossdifferentdatasetswithinthesametask,andevenacrossdifferenttasks.Werec-ommendthesettinghighlightedinboldfontintable5asageneral-purposeDSMforfutureresearch.WebelievethatmanyapplicationsofDSMs(e.g.vectorcompositon)willbenefitfromusingsuchaparam-etercombinationthatachievesrobustperformanceinavarietyofsemantictasks.Moreover,anexten-siveevaluationbasedonarobustmethodologyliketheonepresentedhereisthefirstnecessarystepforfurthercomparisonsofbag-of-wordsDSMstodif-ferenttechniquesformodelingwordmeaning,suchasneuralembeddings(Mikolovetal.,2013).Letusnowsummarizeourmainfindings.•Ourexperimentsshowthataclusterofthreepa-rameters,namelyscore,transformationanddis-tancemetric,playsaconsistentlycrucialroleindeterminingDSMperformance.Theseparam-etersalsoshowahomogeneousbehavioracrosstasksanddatasetswithrespecttobestparametervalues:simple-ll,logtransformationandcosinedistance.ThesetendenciesconfirmtheresultsinPolajnarandClark(2014)andKielaandClark(2014).Inparticular,thefindingthatsparseas-sociationmeasures(withnegativevaluesclampedtozero)achievethebestperformancecanbecon-nectedtothepositiveimpactofcontextselectionhighlightedbyPolajnarandClark(2014):ongo-ingworktargetsamorespecificanalysisoftheir“thinning”effectondistributionalvectors.•Anothergroupofparameters(corpus,windowsize,dimensionalityreductionparameters)isalsoinfluentialinalltasks,butshowsmorevariationwrt.thebestparametervalues.ExceptfortheTOEFLtask,bestresultsareobtainedwiththeWaCkypediacorpus,confirmingtheobservationofSridharanandMurphy(2012)thatcorpusqual-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
543
itycompensatesforsizetosomeextent.Windowsizeanddimensionalityreductionshowamoretask-specificbehavior,eventhoughitispossibletofindagoodcompromiseina4wordwindow,areducedspaceof500dimensionsandskippingofthefirst50dimensions.ThelatterresultconfirmsthefindingsofBullinariaandLevy(2007;2012)intheirclusteringexperiments.•Thenumberofcontextdimensionsturnedouttobelesscrucial.Whileveryhigh-dimensionalspacesusuallyresultinbetterperformance,theincreasebeyond20000or50000dimensionsisrarelysuf-ficienttojustifytheincreasedprocessingcost.•Anovelcontributionofourworkisthesystematicevaluationofaparameterthathasbeengivenlit-tleattentioninDSMresearchsofar:theindexofdistributionalrelatedness.Ourresultsshowthat,eveniftheparameterisnotamongthemostin-fluentialones,neighborrankconsistentlyoutper-formsdistance.WithoutSVDdimensionalityre-duction,thedifferenceismorepronounced:thisresultisparticularlyinterestingforcomposition-alitytasks,whereSVDhasbeenreportedtobedetrimental(BaroniandZamparelli,2010).Insuchcases,thebenefitsofusingneighborrankclearlyoutweightheincreased(butmanageable)computationalcomplexity.Ongoingworkfocusesontheextensionoftheeval-uationsettingtofurtherparameters(e.g.,newdis-tancemetricsandassociationscores,Caron’s(2001)exponentp)andtasks(e.g.,compositionalitytasks,meaningincontext),aswellastheevaluationofdependency-basedmodels.Wearealsoworkingonarefinedmodelselectionmethodologyinvolv-ingasystematicanalysisofthree-wayinteractionsandtheexclusionofinferiorparametervalues(suchasManhattandistance,sigmoidtransformationandDicescore),whichmayhaveaconfoundingeffectonsomeoftheeffectdisplays.AppendixA:BestmodelsThisappendixreportsthebestrunsforeverydataset.2121Someabbreviationsaredifferentfromtables5and6.Pa-rameters:w=window;dir=direction;e=exclusioncriterionforcontextselection;m=metric.Performance:acc=accu-racy;cor=correlation;pur=purity.Parametervalues:dir=directed;undir=undirected;f=frequency;nz=non-zero.corpuswdirec.dimscoretrmr.indred.dimd.skaccukwac2undirf5000MInonecosrank90010098.75ukwac4dirf50000t-scorelogcosrank90010098.75ukwac4undirf50000t-scorerootcosdist90010098.75ukwac4dirf5000simple-lllogcosdist90010098.75Table8:TOEFLdataset–23modelstiedforbestresult(4hand-pickedexamplesshown)corpuswdirec.dimscoretrmr.indred.dimd.skcorukwac16undirnz20000MInonecosrank7001000.89ukwac8dirf20000MInonecosrank7001000.89wacky4dirnz50000simple-lllogcosrank700500.89wacky4undirf100000z-scorelogcosrank900500.89Table9:Ratings,RG65dataset–19modelstiedforbestresult(4hand-pickedexamplesshown)corpuswdirec.dimscoretrmr.indred.dimd.skcorwacky16dirf5000MInonemanrank900500.73wacky16undirf5000MInonemanrank900500.72wacky16undirf5000z-scorelogmanrank900500.72wacky16dirf10000z-scorerootmanrank900500.72Table10:Ratings,WordSim353dataset–bestmodel(3additionalhand-pickedmodelswithsim-ilarperformanceareshown)corpuswdirec.dimscoretrmr.indred.dimd.skpurukwac4dirnz10000t-scorelogmanrank900500.76wacky1dirnz10000z-scorelogmanrank900500.75wacky1undirf20000simple-lllogmanrank900500.75wacky2dirf100000z-scorelogcosrank50000.75Table11:Clustering,Almuhareb-Poesiodataset–bestmodel(plus3additionalhand-pickedmodels)corpuswdirec.dimscoretrmr.indred.dimd.skpurukwac1undirf20000Dicerootmanrank3001000.99ukwac2undirf100000freqlogcosdist300500.99wacky16undirf50000z-scorelogmandist500500.99wacky8undirf10000Dicerootmanrank50000.99Table12:Clustering,Battigdataset–1037modelstiedforbestresult(4hand-pickedexamplesshown)corpuswdirec.dimscoretrmr.indred.dimd.skpurwacky16dirnz50000z-scorenonemandist90000.98ukwac1dirnz100000simple-lllogcosdist100500.95ukwac2undirf50000tf.idfnonemandist70000.95wacky8undirf100000tf.idfrootmanrank50000.95Table13:Clustering,ESSLLIdataset–bestmodel(plus3additionalhand-pickedmodels)corpuswdirec.dimscoretrmr.indred.dimd.skpurbnc2undirnz100000simple-lllogcosrank90000.97bnc2undirf50000simple-lllogcosrank70000.97bnc2undirnz50000simple-lllogcosrank90000.97Table14:Clustering,Mitchelldataset–3modelstiedforbestresult
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
544
AcknowledgmentsWearegratefultotheeditorandtheanonymousre-viewers,whosecommentshelpedusimprovethepaper.WewouldliketothankSabineSchulteimWaldeandtheSemRelgroupattheIMSStuttgart,theCorpusLinguisticsgroupattheUniversityofErlangen-N¨urnbergandtheComputationalLinguis-ticsgroupattheIKWOsnabr¨uckfortheirfeedbackonourwork.GabriellaLapesa’sPhDresearchwasfundedbytheDFGCollaborativeResearchCentreSFB732atIMSStuttgart,wheresheconductedalargepartoftheworkpresentedinthispaper.ReferencesAbdulrahmanAlmuhareb.2006.AttributesinLexicalAcquisition.Ph.D.thesis,UniversityofEssex.MarcoBaroniandAlessandroLenci.2010.Distribu-tionalmemory:Ageneralframeworkforcorpus-basedsemantics.ComputationalLinguistics,36(4):1–49.MarcoBaroniandRobertoZamparelli.2010.Nounsarevectors,adjectivesarematrices:Representingadjective-nounconstructionsinsemanticspace.InProceedingsofthe2010ConferenceonEmpiri-calMethodsinNaturalLanguageProcessing,pages1183–1193,MIT,Massachusetts,USA.MarcoBaroni,RaffaellaBernardi,andRobertoZampar-elli.2014.Fregeinspace:Aprogramforcompo-sitionaldistributionalsemantics.LinguisticIssuesinLanguageTechnology(LiLT),9(6):5–109.JohnA.BullinariaandJosephP.Levy.2007.Extract-ingsemanticrepresentationsfromwordco-occurrencestatistics:Acomputationalstudy.BehaviorResearchMethods,39:510–526.JohnA.BullinariaandJosephP.Levy.2012.Extract-ingsemanticrepresentationsfromwordco-occurrencestatistics:stop-lists,stemmingandSVD.BehaviorResearchMethods,44:890–907.JohnA.BullinariaandJosephP.Levy.2013.Limitingfactorsformappingcorpus-basedsemanticrepresen-tationstobrainactivity.PLoSONE,8(3):1–12.JohnCaron.2001.ExperimentswithLSAscoring:Optimalrankandbasis.InMichaelW.Berry,edi-tor,ComputationalInformationRetrieval,pages157–169.SocietyforIndustrialandAppliedMathematics,Philadelphia,PA,USA.StefanEvert.2008.Corporaandcollocations.InAnkeL¨udelingandMerjaKyt¨o,editors,CorpusLinguistics.AnInternationalHandbook,chapter58.MoutondeGruyter,Berlin,NewYork.StefanEvert.2014.DistributionalsemanticsinRwiththewordspacepackage.InProceedingsofCOLING2014,the25thInternationalConferenceonCompu-tationalLinguistics:SystemDemonstrations,pages110–114,Dublin,Ireland.LevFinkelstein,EvgeniyGabrilovich,YossiMatias,EhudRivlin,ZachSolan,GadiWolfman,andEytanRuppin.2002.Placingsearchincontext:Theconceptrevisited.ACMTransactionsonInformationSystems,20(1):116–131.JohnFox.2003.EffectdisplaysinRforgeneralisedlin-earmodels.JournalofStatisticalSoftware,8(15):1–27.GuyHalawi,GideonDror,EvgeniyGabrilovich,andYehudaKoren.2012.Large-scalelearningofwordre-latednesswithconstraints.InProceedingsofthe18thACMSIGKDDinternationalconferenceonKnowl-edgediscoveryanddatamining,pages1406–1414,NewYork,NY,USA.NathanHalko,Per-GunnarMartinsson,andJoelA.Tropp.2009.Findingstructurewithrandomness:Stochasticalgorithmsforconstructingapproximatematrixdecompositions.TechnicalReport2009-05,ACM,CaliforniaInstituteofTechnology.MaryHare,MichaelJones,CarolineThomson,SarahKelly,andKenMcRae.2009.Activatingeventknowl-edge.Cognition,111(2):151–167.ZeligHarris.1954.Distributionalstructure.Word,10(23):146–162.SamerHassanandRadaMihalcea.2011.Semanticrelat-ednessusingsalientsemanticanalysis.InProceedingsoftheTwenty-fifthAAAIConferenceonArtificialIntel-ligence,pages884–889,SanFrancisco,California.GeorgeKarypis.2003.CLUTO:Aclusteringtoolkit(release2.1.1).TechnicalReport02-017,Minneapo-lis:UniversityofMinnesota,DepartmentofComputerScience.LeonardKaufmanandPeterJ.Rousseeuw.1990.Find-inggroupsindata:anintroductiontoclusteranalysis.JohnWileyandSons.DouweKielaandStephenClark.2014.Asystematicstudyofsemanticvectorspacemodelparameters.InProceedingsofEACL2014,WorkshoponContinu-ousVectorSpaceModelsandtheirCompositionality(CVSC),pages21–30,Gothenburg,Sweden.ThomasK.LandauerandSusanT.Dumais.1997.Aso-lutiontoPlato’sproblem:Thelatentsemanticanalysistheoryoftheacquisition,induction,andrepresentationofknowledge.PsychologicalReview,104:211–240.GabriellaLapesaandStefanEvert.2013.Evaluatingneighborrankanddistancemeasuresaspredictorsofsemanticpriming.InProceedingsoftheACLWork-shoponCognitiveModelingandComputationalLin-guistics(CMCL2013),pages66–74,Sofia,Bulgaria.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
545
GabriellaLapesa,StefanEvert,andSabineSchulteimWalde.2014.Contrastingsyntagmaticandparadig-maticrelations:Insightsfromdistributionalsemanticmodels.InProceedingsoftheThirdJointConfer-enceonLexicalandComputationalSemantics(*SEM2014),pages160–170,Dublin,Ireland.DekangLin.1998.Automaticretrievalandclusteringofsimilarwords.InProceedingsofthe36thAnnualMeetingoftheAssociationforComputationalLinguis-ticsand17thInternationalConferenceonComputa-tionalLinguistics-Volume2,pages768–774,Mon-treal,Quebec,Canada.KevinLundandCurtBurgess.1996.Producinghigh-dimensionalsemanticspacesfromlexicalco-occurrence.BehaviorResearchMethods,Instrumen-tationandComputers,28:203–208.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013.Efficientestimationofwordrepresen-tationsinvectorspace.CoRR.JeffMitchellandMirellaLapata.2008.Vector-basedmodelsofsemanticcomposition.InProceedingsofACL-08:HLT,pages236–244,Columbus,Ohio.TomMitchell,SvetlanaV.Shinkareva,AndrewCarlson,Kai-MinChang,VicenteL.Malave,RobertA.Mason,andMarcelAdamJust.2008.Predictinghumanbrainactivityassociatedwiththemeaningsofnouns.Sci-ence,320(5880):1191–1195.BrianMurphy,ParthaTalukdar,andTomMitchell.2012.Selectingcorpus-semanticmodelsforneurolinguisticdecoding.InProceedingsoftheFirstJointConferenceonLexicalandComputationalSemantics-SemEval’12,pages114–123.SebastianPad´oandMirellaLapata.2007.Dependency-basedconstructionofsemanticspacemodels.Compu-tationalLinguistics,33(2):161–199.TamaraPolajnarandStephenClark.2014.Improvingdistributionalsemanticvectorsthroughcontextselec-tionandnormalisation.InProceedingsofthe14thConferenceoftheEuropeanChapteroftheAsso-ciationforComputationalLinguistics(EACL2014),pages230–238,Gothenburg,Sweden.KlausRothenh¨auslerandHinrichSch¨utze.2009.Un-supervisedclassificationwithdependencybasedwordspaces.InProceedingsoftheEACL2009WorkshoponGEMS:GEometicalModelsofNaturalLanguageSemantics,pages17–24,Athens,Greece.HerbertRubensteinandJohnB.Goodenough.1965.Contextualcorrelatesofsynonymy.CommunicationsoftheACM,8(10):627—633.MagnusSahlgren.2006.TheWord-SpaceModel:Us-ingdistributionalanalysistorepresentsyntagmaticandparadigmaticrelationsbetweenwordsinhigh-dimensionalvectorspaces.Ph.D.thesis,UniversityofStockolm.GerardSalton,AndrewWong,andChungShuYang.1975.Avectorspacemodelforautomaticindexing.CommunicationsoftheACM,18(11):613–620.HinrichSch¨utze.1998.Automaticwordsensediscrimi-nation.ComputationalLinguistics,27(1):97–123.SeshadriSridharanandBrianMurphy.2012.Model-ingwordmeaning:Distributionalsemanticsandthecorpusquality-quantitytrade-off.InProceedingsofthe3rdworkshoponCognitiveAspectsoftheLexicon(CogAlex-III),pages53–68,Mumbai,India.PeterD.TurneyandPatrickPantel.2010.Fromfre-quencytomeaning:Vectorspacemodelsofsemantics.JournalofArtificialIntelligenceResearch,37:141–188.JamesVanOverschelde,KatherineRawson,andJohnDunlosky.2004.Categorynorms:AnupdatedandexpandedversionoftheBattigandMontague(1969)norms.JournalofMemoryandLanguage,50:289–335.Wen-tauYihandVahedQazvinian.2012.Measur-ingwordrelatednessusingheterogeneousvectorspacemodels.InProceedingsofthe2012ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies(NAACLHLT’12),pages616–620,Montreal,Canada.BrittaZeller,SebastianPad´o,andJanˇSnajder.2014.To-wardssemanticvalidationofaderivationallexicon.InProceedingsofCOLING2014,the25thInternationalConferenceonComputationalLinguistics:TechnicalPapers,pages1728–1739,Dublin,Ireland.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
1
1
5
6
6
9
4
3
/
/
t
l
a
c
_
a
_
0
0
2
0
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
546
Download pdf