Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119, 2018. Action Editor: Ivan Titov.
Submission batch: 6/2017; Revision batch: 9/2017; Published 2/2018.
c(cid:13)2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
EvaluatingtheStabilityofEmbedding-basedWordSimilaritiesMariaAntoniakCornellUniversitymaa343@cornell.eduDavidMimnoCornellUniversitymimno@cornell.eduAbstractWordembeddingsareincreasinglybeingusedasatooltostudywordassociationsinspecificcorpora.However,itisunclearwhethersuchembeddingsreflectenduringpropertiesoflan-guageoriftheyaresensitivetoinconsequentialvariationsinthesourcedocuments.Wefindthatnearest-neighbordistancesarehighlysen-sitivetosmallchangesinthetrainingcorpusforavarietyofalgorithms.Forallmethods,includingspecificdocumentsinthetrainingsetcanresultinsubstantialvariations.Weshowthattheseeffectsaremoreprominentforsmallertrainingcorpora.Werecommendthatusersneverrelyonsingleembeddingmodelsfordistancecalculations,butratheraverageovermultiplebootstrapsamples,especiallyforsmallcorpora.1IntroductionWordembeddingsareapopulartechniqueinnaturallanguageprocessing(NLP)inwhichthewordsinavocabularyaremappedtolow-dimensionalvectors.Embeddingmodelsareeasilytrained—severalimple-mentationsarepubliclyavailable—andrelationshipsbetweentheembeddingvectors,oftenmeasuredviacosinesimilarity,canbeusedtoreveallatentseman-ticrelationshipsbetweenpairsofwords.Wordem-beddingsareincreasinglybeingusedbyresearchersinunexpectedwaysandhavebecomepopularinfieldssuchasdigitalhumanitiesandcomputationalsocialscience(Hamiltonetal.,2016;Heuser,2016;Phillipsetal.,2017).Embedding-basedanalysesofsemanticsimilaritycanbearobustandvaluabletool,butwefindthatstandardmethodsdramaticallyunder-representthevariabilityofthesemeasurements.Embeddingalgo-rithmsaremuchmoresensitivethantheyappeartofactorssuchasthepresenceofspecificdocuments,thesizeofthedocuments,thesizeofthecorpus,andevenseedsforrandomnumbergenerators.Ifusersdonotaccountforthisvariability,theirconclusionsarelikelytobeinvalid.Fortunately,wealsofindthatsimplyaveragingovermultiplebootstrapsamplesissufficienttoproducestable,reliableresultsinallcasestested.NLPresearchinwordembeddingshassofarfo-cusedonadownstream-centeredusecase,wheretheendgoalisnottheembeddingsthemselvesbutperformanceonamorecomplicatedtask.Forexam-ple,wordembeddingsareoftenusedasthebottomlayerinneuralnetworkarchitecturesforNLP(Ben-gioetal.,2003;Goldberg,2017).Theembeddings’trainingcorpus,whichisselectedtobeaslargeaspossible,isonlyofinterestinsofarasitgeneralizestothedownstreamtrainingcorpus.Incontrast,otherresearcherstakeacorpus-centeredapproachanduserelationshipsbetweenem-beddingsasdirectevidenceaboutthelanguageandcultureoftheauthorsofatrainingcorpus(Bolukbasietal.,2016;Hamiltonetal.,2016;Heuser,2016).Embeddingsareusedasiftheyweresimulationsofasurveyaskingsubjectstofree-associatewordsfromqueryterms.Unlikethedownstream-centeredapproach,thecorpus-centeredapproachisbasedondirecthumananalysisofnearestneighborstoembed-dingvectors,andthetrainingcorpusisnotsimplyanoff-the-shelfconveniencebutratherthecentralobjectofstudy.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
108
Downstream-centeredCorpus-centeredBigcorpusSmallcorpus,difficultorimpossi-bletoexpandSourceisnotimportantSourceistheobjectofstudyOnlyvectorsareimportantSpecific,fine-grainedcomparisonsareimportantEmbeddingsareusedindownstreamtasksEmbeddingsareusedtolearnaboutthementalmodelofwordassocia-tionfortheauthorsofthecorpusTable1:Comparisonofdownstream-centeredandcorpus-centeredapproachestowordembeddings.Whilewordembeddingsmayappeartomeasurepropertiesoflanguage,theyinfactonlymeasurepropertiesofacuratedcorpus,whichcouldsuf-ferfromseveralproblems.Thetrainingcorpusismerelyasampleoftheauthors’languagemodel(Shazeeretal.,2016).Sourcescouldbemissingorover-represented,typosandotherlexicalvariationscouldbepresent,et,asnotedbyGoodfellowetal.(2016),“Manydatasetsaremostnaturallyarrangedinawaywheresuccessiveexamplesarehighlycor-related.”Furthermore,embeddingscanvaryconsid-erablyacrossrandominitializations,makinglistsof“mostsimilarwords”unstable.Wehypothesizethattrainingonsmallandpoten-tiallyidiosyncraticcorporacanexacerbatetheseprob-lemsandleadtohighlyvariableestimatesofwordsimilarity.Suchsmallcorporaarecommonindigitalhumanitiesandcomputationalsocialscience,anditisoftenimpossibletomitigatetheseproblemssimplybyexpandingthecorpus.Forexample,wecannotcreatemore18thCenturyEnglishbooksorchangetheirtopicalfocus.Weexplorecausesofthisvariability,whichrangefromthefundamentalstochasticnatureofcertainal-gorithmstomoretroublingsensitivitiestopropertiesofthecorpus,suchasthepresenceorabsenceofspecificdocuments.Wefocusonthetrainingcor-pusasasourceofvariation,viewingitasafragileartifactcuratedbyoftenarbitrarydecisions.Weex-aminefourdifferentalgorithmsandsixdatasets,andwemanipulatethecorpusbyshufflingtheorderofthedocumentsandtakingbootstrapsamplesofthedocuments.Finally,weexaminetheeffectsofthesemanipulationsonthecosinesimilaritiesbetweenem-beddings.Wefindthatthereisconsiderablevariabilityinembeddingsthatmaynotbeobvioustousersofthesemethods.Rankingsofmostsimilarwordsarenotreliable,andbothorderingandmembershipinsuchlistsareliabletochangesignificantly.Someuncer-taintyisexpected,andthereisnoclearcriterionfor“acceptable”levelsofvariance,butwearguethattheamountofvariationweobserveissufficienttocallthewholemethodintoquestion.Forexample,wefindcasesinwhichthereiszerosetoverlapin“top10”listsforthesamequerywordacrossbootstrapsamples.Smallercorporaandlargerdocumentsizesincreasethisvariation.Ourgoalistoprovidemeth-odstoquantifythisvariability,andtoaccountforthisvariability,werecommendthatasthesizeofacorpusgetssmaller,cosinesimilaritiesshouldbeaveragedovermanybootstrapsamples.2RelatedWorkWordembeddingsaremappingsofwordstopointsinaK-dimensionalcontinuousspace,whereKismuchsmallerthanthesizeofthevocabulary.Re-ducingthenumberofdimensionshastwobenefits:first,grand,sparsevectorsaretransformedintosmall,densevectors;andsecond,theconflationoffeaturesuncoverslatentsemanticrelationshipsbetweenthewords.Thesesemanticrelationshipsareusuallymea-suredviacosinesimilarity,thoughothermetricssuchasEuclideandistanceandtheDicecoefficientarepossible(TurneyandPantel,2010).Wefocusonfourofthemostpopulartrainingalgorithms:La-tentSemanticAnalysis(LSA)(Deerwesteretal.,1990),Skip-GramwithNegativeSampling(SGNS)(Mikolovetal.,2013),GlobalVectorsforWordRep-resentation(GloVe)(Penningtonetal.,2014),andPositivePointwiseMutualInformation(PPMI)(LevyandGoldberg,2014)(seeSection5formoredetaileddescriptionsofthesealgorithms).InNLP,wordembeddingsareoftenusedasfea-turesfordownstreamtasks.Dependencyparsing(ChenandManning,2014),namedentityrecogni-tion(Turianetal.,2010;CherryandGuo,2015),andbilinguallexiconinduction(VulicandMoens,2015)arejustafewexampleswheretheuseofembeddingsasfeatureshasincreasedperformanceinrecentyears.Increasingly,wordembeddingshavebeenusedasevidenceinstudiesoflanguageandculture.Forexample,Hamiltonetal.(2016)trainseparateem-beddingsontemporalsegmentsofacorpusandthen
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
109
analyzechangesinthesimilarityofwordstomeasuresemanticshifts,andHeuser(2016)usesembeddingstocharacterizediscourseaboutvirtuesin18thCen-turyEnglishtext.Otherstudiesusecosinesimilar-itiesbetweenembeddingstomeasurethevariationoflanguageacrossgeographicalareas(Kulkarnietal.,2016;Phillipsetal.,2017)andtime(Kimetal.,2014).Eachofthesestudiesseekstoreconstructthementalmodelofauthorsbasedondocuments.Anexamplethathighlightsthecontrastbetweenthedownstream-centeredandcorpus-centeredper-spectivesistheexplorationofimplicitbiasinwordembeddings.Researchershaveobservedthatembedding-basedwordsimilaritiesreflectculturalstereotypes,suchasassociationsbetweenoccupa-tionsandgenders(Bolukbasietal.,2016).Fromadownstream-centeredperspective,thesestereotypicalassociationsrepresentbiasthatshouldbefilteredoutbeforeusingtheembeddingsasfeatures.Incontrast,fromacorpus-centeredperspective,implicitbiasinembeddingsisnotaproblemthatmustbefixedbutratherameansofmeasurement,providingquantita-tiveevidenceofbiasinthetrainingcorpus.Embeddingsareusuallyevaluatedondirectusecases,suchaswordsimilarityandanalogytasksviacosinesimilarities(Mikolovetal.,2013;Penningtonetal.,2014;Levyetal.,2015;Shazeeretal.,2016).Intrinsicevaluationslikewordsimilaritiesmeasuretheinterpretabilityoftheembeddingsratherthantheirdownstreamtaskperformance(GladkovaandDrozd,2016),butwhilesomeresearchdoesevaluateembeddingvectorsontheirdownstreamtaskperfor-mance(Penningtonetal.,2014;Faruquietal.,2015),thestandardbenchmarksremainintrinsic.Therehasbeensomerecentworkinevaluatingthestabilityofwordembeddings.Levyetal.(2015)focusonthehyperparametersettingsforeachalgo-rithmandshowthathyperparameterssuchasthesizeofthecontextwindow,thenumberofnegativesam-ples,andthelevelofcontextdistributionsmoothingcanaffecttheperformanceofembeddingsonsimi-larityandanalogytasks.HellrichandHahn(2016)examinetheeffectsofwordfrequency,wordam-biguity,andthenumberoftrainingepochsonthereliabilityofembeddingsproducedbytheSGNSandskip-gramhierarchicalsoftmax(SGHS)(avariantofSGNS),strivingforreproducibilityandrecommend-ingagainstsamplingthecorpusinordertopreservestability.Likewise,Tianetal.(2016)explorethero-bustnessofSGNSandGloVeembeddingstrainedonlarge,genericcorpora(Wikipediaandnewsdata)andproposemethodstoaligntheseembeddingsacrossdifferentiterations.Incontrast,ourgoalisnottoproduceartificiallystableembeddingsbuttoidentifythefactorsthatcreateinstabilityandmeasureourstatisticalconfi-denceinthecosinesimilaritiesbetweenembeddingstrainedonsmall,specificcorpora.Wefocusonthecorpusasafragileartifactandsourceofvariation,consideringthecorpusitselfasmerelyasampleofpossibledocumentsproducedbytheauthors.Weexaminewhethertheembeddingsaccuratelymodelthoseauthors,usingbootstrapsamplingtomeasuretheeffectsofaddingorremovingdocumentsfromthetrainingcorpus.3CorporaWecollectedtwosub-corporafromeachofthreedatasets(seeTable2)toexplorehowwordembed-dingsareaffectedbysize,vocabulary,andotherparametersofthetrainingcorpus.Inordertobet-termodelrealisticexamplesofcorpus-centeredre-search,thesecorporaaredeliberatelychosentobepubliclyavailable,suggestiveofsocialresearchques-tions,variedincorpusparameters(e.g.topic,size,vocabulary),andmuchsmallerthanthestandardcor-poratypicallyusedintrainingwordembeddings(e.g.Wikipedia,Gigaword).Eachdatasetwascreatedor-ganically,overspecifictimeperiods,inspecificsocialsettings,byspecificauthors.Thus,itisimpossibletoexpandthesedatasetswithoutcompromisingthisspecificity.Weprocesseachcorpusbylowercasingalltext,re-movingwordsthatappearfewerthan20timesinthecorpus,andremovingallnumbersandpunctuation.Becauseourmethodsrelyonbootstrapsampling(seeSection6),whichoperatesbyremovingormulti-plyingthepresenceofdocuments,wealsoremoveduplicatedocumentsfromeachcorpus.U.S.FederalCourtsofAppealsTheU.S.Federalcourtsofappealsareregionalcourtsthatdecideap-pealsfromthedistrictcourtswithintheirfederalju-dicialcircuit.Weexaminetheembeddingsofthemostrecentfiveyearsofthe4thand9thcircuits.11https://www.courtlistener.com/
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
110
CorpusNumberofdocumentsUniquewordsVocabularydensityWordsperdocumentNYTSports(2000)8,78612,4750.0020708NYTMusic(2000)3,6669,7620.0037715AskScience331,63516,9010.001244AskHistorians63,5789,3840.0022664thCircuit5,36816,6390.00142,2819thCircuit9,72922,1460.00112,108Table2:Comparisonofthenumberofdocuments,numberofuniquewords(afterremovingwordsthatappearfewerthan20times),vocabularydensity(theratioofuniquewordstothetotalnumberofwords),andtheaveragenumberofwordsperdocumentforeachcorpus.SettingMethodTests…Run1Run2Run3FixedDocumentsinconsistentordervariabilityduetoalgorithm(baseline)ABCABCABCShuffledDocumentsinrandomordervariabilityduetodocumentorderACBBACCBABootstrapDocumentssampledwithreplacementvariabilityduetodocumentpresenceBAACABBBBTable3:Thethreesettingsthatmanipulatethedocumentorderandpresenceineachcorpus.The4thcircuitcontainsWashingtonD.C.andsur-roundingstates,whilethe9thcircuitcontainstheentiretyofthewestcoast.Socialscienceresearchquestionsmightinvolvemeasuringawidelyheldbeliefthatcertaincourtshavedistinctideologicalten-dencies(Broscheid,2011).Suchbiasmayresultinmeasurabledifferencesinwordassociationduetoframingeffects(Cardetal.,2015),whichcouldbeobservablebycomparingthewordsassociatedwithagivenqueryterm.Wetreateachopinionasasingledocument.NewYorkTimesTheNewYorkTimes(NYT)An-notatedCorpus(Sandhaus,2008)containsnewspaperarticlestaggedwithadditionalmetadatareflectingtheircontentandpublicationcontext.Toconstrainthesizeofthecorporaandtoenhancetheirspecificity,weextractdataonlyfortheyear2000andfocusononlytwosectionsoftheNYTdataset:sportsandmusic.Intheresultingcorpora,thesportssectionissubstantiallylargerthanthemusicsection(seeTable2).Wetreatanarticleasasingledocument.RedditReddit2isasocialwebsitecontainingthou-sandsofforums(subreddits)organizedbytopic.Weuseadatasetcontainingallpostsfortheyears2007-2014fromtwosubreddits:/r/AskScienceand/r/AskHistorians.Thesetwosubredditsallowuserstopostanyquestioninthetopicsofhistoryandscience,respectively.AskScienceismorethanfivetimeslargerthanAskHistorians,thoughthedoc-2https://www.reddit.com/umentlengthisgenerallylongerforAskHistorians(seeTable2).Redditisapopulardatasourceforcomputationalsocialscienceresearch;forexample,subredditscanbeusedtoexplorethedistinctivenessanddynamicityofcommunities(Zhangetal.,2017).Wetreatanoriginalpostasasingledocument.4CorpusParametersOrderandpresenceofdocumentsWeusethreedifferentmethodstosamplethecorpus:FIXED,SHUFFLED,andBOOTSTRAP.TheFIXEDsettingincludeseachdocumentexactlyonce,andthedoc-umentsappearinaconstant,chronologicalorderacrossallmodels.Thepurposeofthissettingistomeasurethebaselinevariabilityofanalgorithm,independentofanychangeininputdata.Algorith-micvariabilitymayarisefromrandominitializationsoflearnedparameters,randomnegativesampling,orrandomizedsubsamplingoftokenswithindocu-ments.TheSHUFFLEDsettingincludeseachdocu-mentexactlyonce,buttheorderofthedocumentsisrandomizedforeachmodel.Thepurposeofthissettingistoevaluatetheimpactofvariationonhowwepresentexamplestoeachalgorithm.Theorderofdocumentscouldbeanimportantfactorforalgo-rithmsthatuseonlinetrainingsuchasSGNS.TheBOOTSTRAPsettingsamplesNdocumentsrandomlywithreplacement,whereNisequaltothenumberofdocumentsintheFIXEDsetting.Thepurposeofthissettingistomeasurehowmuchvariabilityisduetothepresenceorabsenceofspecificsequencesof
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
111
tokensinthecorpus.SeeTable3foracomparisonofthesethreesettings.SizeofcorpusWeexpectthestabilityofembedding-basedwordsimilaritiestobeinfluencedbythesizeofthetrainingcorpus.Asweaddmoredocuments,theimpactofanyspecificdocumentshouldbelesssignificant.Atthesametime,largercorporamayalsotendtobemorebroadinscopeandvariableinstyleandtopic,leadingtolessidiosyn-craticpatternsinwordco-occurrence.Therefore,foreachcorpus,wecurateasmallersub-corpusthatcontains20%ofthetotalcorpusdocuments.Thesesamplesareselectedusingcontiguoussequencesofdocumentsatthebeginningofeachtraining(thisensuresthattheFIXEDsettingremainsconstant).LengthofdocumentsWeusetwodocumentseg-mentationstrategies.Inthefirstsetting,eachtraininginstanceisasingledocument(i.e.anarticlefortheNYTcorpus,anopinionfromtheCourtscorpus,andapostfromtheRedditcorpus).Inthesecondsetting,eachtraininginstanceisasinglesentence.Weex-pectthischoiceofsegmentationtohavethelargestimpactontheBOOTSTRAPsetting.Documentsareoftencharacterizedby“bursty”wordsthatarelocallyfrequentbutgloballyrare(Madsenetal.,2005),suchasthenameofadefendantinacourtcase.Samplingwholedocumentswithreplacementshouldmagnifytheeffectofburstywords:ararebutlocallyfrequentwordwilleitheroccurinaBootstrapcorpusornotoccur.Samplingsentenceswithreplacementshouldhavelesseffectonburstywords,sincethechancethatanentiredocumentwillberemovedfromthecorpusismuchsmaller.5AlgorithmsEvaluatingallcurrentembeddingalgorithmsandim-plementationsisbeyondthescopeofthiswork,soweselectfourcategoriesofalgorithmsthatrepresentdistinctoptimizationstrategies.Recallthatourgoalistoexaminehowalgorithmsrespondtovariationinthecorpus,nottomaximizeperformanceintheaccuracyoreffectivenessoftheembeddings.Thefirstcategoryisonlinestochasticupdates,inwhichthealgorithmupdatesmodelparametersus-ingstochasticgradientsasitproceedsthroughthetrainingcorpus.Allmethodsimplementedintheword2vecandfastTextpackagesfollowthisformat,includingskip-gram,CBOW,negativesam-pling,andhierarchicalsoftmax(Mikolovetal.,2013).WefocusonSGNSasapopularandrepresentativeexample.Thesecondcategoryisbatchstochasticupdates,inwhichthealgorithmfirstcollectsamatrixofsummarystatisticsderivedfromapassthroughthetrainingdatathattakesplacebeforeanyparame-tersareset,andthenupdatesmodelparametersusingstochasticoptimization.WeselecttheGloVealgo-rithm(Penningtonetal.,2014)asarepresentativeexample.Thethirdcategoryismatrixfactorization,inwhichthealgorithmmakesdeterministicupdatestomodelparametersbasedonamatrixofsummarystatistics.AsarepresentativeexampleweincludePPMI(LevyandGoldberg,2014).Enfin,totestwhetherwordorderisasignificantfactorweincludeadocument-basedembeddingmethodthatusesma-trixfactorization,LSA(Deerwesteretal.,1990;Lan-dauerandDumais,1997).Thesealgorithmseachincludeseveralhyperparam-eters,whichareknowntohavemeasurableeffectsontheresultingembeddings(Levyetal.,2015).Wehaveattemptedtochoosesettingsoftheseparame-tersthatarecommonlyusedandcomparableacrossalgorithms,butweemphasizethatafullevaluationoftheeffectofeachalgorithmicparameterwouldbebeyondthescopeofthiswork.Foreachofthefollowingalgorithms,wesetthecontextwindowsizeto5andtheembeddingssizeto100.Sincewere-movewordsthatoccurfewerthan20timesduringpreprocessingofthecorpus,wesetthefrequencythresholdforthefollowingalgorithmsto0.Forallotherhyperparameters,wefollowthede-faultormostpopularsettingsforeachalgorithm,asdescribedinthefollowingsections.5.1LSALatentsemanticanalysis(LSA)factorizesasparseterm-documentmatrixX(Deerwesteretal.,1990;LandauerandDumais,1997).Xisfactoredusingsingularvaluedecomposition(SVD),retainingKsingularvaluessuchthatX≈XK=UKΣKVTK.Theelementsoftheterm-documentmatrixareweighted,oftenwithTF-IDF,whichmeasuresthe
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
112
importanceofawordtoadocumentinacorpus.Thedense,low-rankapproximationoftheterm-documentmatrix,XK,canbeusedtomeasuretherelatednessoftermsbycalculatingthecosinesimilarityoftherelevantrowsofthereducedmatrix.Weusethesci-kitlearn3packagetotrainourLSAembeddings.Wecreateaterm-documentmatrixwithTF-IDFweighting,usingthedefaultset-tingsexceptthatweaddL2normalizationandsub-linearTFscaling,whichscalestheimportanceoftermswithhighfrequencywithinadocument.Weperformdimensionalityreductionviaarandomizedsolver(Halkoetal.,September2009).Theconstructionoftheterm-countmatrixandtheTF-IDFweightingshouldintroducenovariationtothefinalwordembeddings.However,weexpectvariationduetotherandomizedSVDsolver,evenwhenallotherparameters(trainingdocumentorder,presence,size,etc.)areconstant.5.2SGNSTheskip-gramwithnegativesampling(SGNS)algo-rithm(Mikolovetal.,2013)isanonlinealgorithmthatusesrandomizedupdatestopredictwordsbasedontheircontext.Ineachiteration,thealgorithmpro-ceedsthroughtheoriginaldocumentsand,ateachwordtoken,updatesmodelparametersbasedongra-dientscalculatedfromthecurrentmodelparameters.Thisprocessmaximizesthelikelihoodofobservedword-contextpairsandminimizesthelikelihoodofnegativesamples.WeuseanimplementationoftheSGNSalgorithmincludedinthePythonlibrarygensim4(ˇReh˚uˇrekandSojka,2010).Weusethedefaultsettingspro-videdwithgensimexceptasdescribedabove.WepredictthatmultiplerunsofSGNSonthesamecorpuswillnotproducethesameresults.SGNSran-domlyinitializesalltheembeddingsbeforetrainingbegins,anditreliesonnegativesamplescreatedbyrandomlyselectingwordandcontextpairs(Mikolovetal.,2013;Levyetal.,2015).WealsoexpectSGNStobesensitivetotheorderofdocuments,asitreliesonstochasticgradientdescentwhichcanbebiasedtobemoreinfluencedbyinitialdocuments(Bottou,2012).3http://scikit-learn.org/4https://radimrehurek.com/gensim/models/word2vec.html5.3GloVeGlobalVectorsforWordRepresentation(GloVe)usesstochasticgradientupdatesbutoperatesona“global”representationofwordco-occurrencethatiscalcu-latedonceatthebeginningofthealgorithm(Penning-tonetal.,2014).Wordsandcontextsareassociatedwithbiasparameters,bwandbc,wherewisawordandcisacontext,learnedbyminimizingthecostfunction:L=Xw,cf(xwc)~w·~c+bw+bc−log(xwc).WeusetheGloVeimplementationprovidedbyPenningtonetal.(2014)5.WeusethedefaultsettingsprovidedwithGloVeexceptasdescribedabove.UnlikeSGNS,thealgorithmdoesnotperformmodelupdateswhileexaminingtheoriginaldocu-ments.Asaresult,weexpectGloVetobesensitivetorandominitializationsbutnotsensitivetotheorderofdocuments.5.4PPMIThepositivepointwisemutualinformation(PPMI)matrice,whosecellsrepresentthePPMIofeachpairofwordsandcontexts,isfactoredusingsin-gularvaluedecomposition(SVD)andresultsinlow-dimensionalembeddingsthatperformsimilarlytoGloVeandSGNS(LevyandGoldberg,2014).PMI(w,c)=logP(w,c)P.(w)P.(c);PPMI(w,c)=max(PMI(w,c),0).TotrainourPPMIwordembeddings,weusehyperwords,6animplementationprovidedaspartofLevyetal.(2015).7Wefollowtheauthors’recom-mendationsandsetthecontextdistributionalsmooth-ing(cds)parameterto0.75,theeigenvaluematrix(eig)to0.5,thesubsamplingthreshold(sub)to10-5,andthecontextwindow(win)to5.5http://nlp.stanford.edu/projects/glove/6https://bitbucket.org/omerlevy/hyperwords/src7WealteredthePPMIcodetoremoveafixedrandomseedinordertointroducevariabilitygivenafixedcorpus;nootherchangewasmade.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
113
LikeGloVeandunlikeSGNS,PPMIoperatesonapre-computedrepresentationofwordco-occurrence,sowedonotexpectresultstovarybasedontheor-derofdocuments.UnlikebothGloVeandSGNS,PPMIusesastable,non-stochasticSVDalgorithmthatshouldproducethesameresultgiventhesameinput,regardlessofinitialization.However,weex-pectvariationduetoPPMI’srandomsubsamplingoffrequenttokens.6MethodsToestablishstatisticalsignificanceboundsforourobservations,wetrain50LSAmodels,50SGNSmodels,50GloVemodels,and50PPMImodelsforeachofthethreesettings(FIXED,SHUFFLED,andBOOTSTRAP),foreachdocumentsegmentationsize,foreachcorpus.Foreachcorpus,weselectasetof20relevantquerywordsfromhighprobabilitywordsfromanLDAtopicmodel(Bleietal.,2003)trainedonthatcorpuswith200topics.Wecalculatethecosinesim-ilarityofeachquerywordtotheotherwordsinthevocabulary,creatingasimilarityrankingofallthewordsinthevocabulary.Wecalculatethemeanandstandarddeviationofthecosinesimilaritiesforeachpairofquerywordandvocabularywordacrosseachsetof50models.Fromthelistsofqueriesandcosinesimilarities,weselectthe20wordsmostcloselyrelatedtothesetofquerywordsandcomparethemeanandstandarddeviationofthosepairsacrosssettings.WecalculatetheJaccardsimilaritybetweentop-Nliststocom-paremembershipchangeinthelistsofmostcloselyrelatedwords,andwefindaveragechangesinrankwithinthoselists.Weexaminethesemetricsacrossdifferentalgorithmsandcorpusparameters.7ResultsWebeginwithacasestudyoftheframingaroundthequerytermmarijuana.Onemighthypothesizethattheauthorsofvariouscorpora(e.g.judgesofthe4thCircuit,journalistsattheNYT,andusersonReddit)havedifferentperceptionsofthisdrugandthattheirlanguagemightreflectthosedifferences.Indeed,afterqualitativelyexaminingthelistsofmostsimilarterms(seeTable4),wemightcometotheconclusionthattheallegedlyconservative4thCircuitLSASGNSGloVePPMIALGORITHM0.000.010.020.030.040.050.060.07STANDARD DEVIATIONfixedshuffledbootstrapStandard Deviation in the 9th Circuit CorpusLSASGNSGloVePPMIALGORITHM0.000.020.040.060.080.100.12STANDARD DEVIATIONfixedshuffledbootstrapStandard Deviation in the NYT Music CorpusFigure1:Themeanstandarddeviationsacrosssettingsandalgorithmsforthe10closestwordstothequerywordsinthe9thCircuitandNYTMusiccorporausingthewholedocuments.Largervariationsindicatelessstableembed-dings.judgesviewmarijuanaassimilartoillegaldrugssuchasheroinandcocaine,whileRedditusersviewmari-juanaasclosertolegalsubstancessuchasnicotineandalcohol.However,weobservepatternsthatcauseustolowerourconfidenceinsuchconclusions.Table4showsthatthecosinesimilaritiescanvarysignifi-cantly.Weseethatthetoprankedwords(chosenaccordingtotheirmeancosinesimilarityacrossrunsoftheFIXEDsetting)canhavewidelydifferentmeansimilaritiesandstandarddeviationsdependingonthealgorithmandthethreetrainingsettings,FIXED,SHUFFLED,andBOOTSTRAP.Asexpected,eachalgorithmhasasmallvariationintheFIXEDsetting.Forexample,wecanseetheeffectoftherandomSVDsolverforLSAandtheeffectofrandomsubsamplingforPPMI.WedonotobserveaconsistenteffectfordocumentorderintheSHUFFLEDsetting.Mostimportantly,thesefiguresrevealthatthe
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
114
4thCircuitNYTSportsRedditAskScience0.800.820.840.860.880.900.920.94Cosine SimilaritydistributemanufactureoxycodonedistributingpowdermethamphetaminecrackdistributioncocaineheroinMost Similar WordsLSAfixedshuffledbootstrap0.450.500.550.600.650.70Cosine SimilaritysteroidsreservedsubstanceinvolvingviolentseveralcocainetestingtestedcriticizedMost Similar WordsLSAfixedshuffledbootstrap0.700.750.800.850.90Cosine SimilaritymasturbationmedicationtobaccostressnicotinealcoholthccaffeinesmokingcannabisMost Similar WordsLSAfixedshuffledbootstrap0.700.750.800.850.90Cosine SimilaritycigarettespowdernarcoticscrackdrugspillsmethamphetaminecocaineoxycodoneheroinMost Similar WordsSGNSfixedshuffledbootstrap0.600.650.700.750.800.85Cosine SimilaritytestingabusesubstanceurinealcoholcounselingsteroidnandrolonedrugcocaineMost Similar WordsSGNSfixedshuffledbootstrap0.650.700.750.800.850.90Cosine SimilaritydrugsmokinglsdthccocaineweedmdmatobacconicotinecannabisMost Similar WordsSGNSfixedshuffledbootstrap0.450.500.550.600.650.700.750.80Cosine SimilaritypossessiongrowingsmokedgramsdrugsdistributecrackkilogramsheroincocaineMost Similar WordsGloVefixedshuffledbootstrap0.20.30.40.50.60.7Cosine SimilaritypositivesuspensionsblamingsteroidpurposesaddictiontestingsmokingprocedurescocaineMost Similar WordsGloVefixedshuffledbootstrap0.40.50.60.70.8Cosine SimilarityeffectsdrugscaffeineweeddrugthctobaccosmokesmokingcannabisMost Similar WordsGloVefixedshuffledbootstrap0.750.800.850.90Cosine SimilaritymethamphetaminehydrochloridekilogramsparaphernaliakilogramgramscrackpowderheroincocaineMost Similar WordsPPMIfixedshuffledbootstrap0.50.60.70.80.9Cosine SimilaritysteroidtestingcrackdrugssubstancepositivetestedalcoholdrugcocaineMost Similar WordsPPMIfixedshuffledbootstrap0.700.750.800.850.900.95Cosine SimilaritysmokesmokersthccigarcigarettessmokingweedtobacconicotinecannabisMost Similar WordsPPMIfixedshuffledbootstrapTable4:Themostsimilarwordswiththeirmeansandstandarddeviationsforthecosinesimilaritiesbetweenthequerywordmarijuanaandits10nearestneighbors(highestmeancosinesimilarityintheFIXEDsetting.Embeddingsarelearnedfromdocumentssegmentedbysentence.BOOTSTRAPsettingcauseslargeincreasesinvaria-tionacrossallalgorithms(withaweakereffectforPPMI)andcorpora,withlargestandarddeviationsacrosswordrankings.Thisindicatesthatthepres-enceofspecificdocumentsinthecorpuscansignifi-cantlyaffectthecosinesimilaritiesbetweenembed-dingvectors.GloVeproducedverysimilarembeddingsinboththeFIXEDandSHUFFLEDsettings,withsimilarmeansandsmallstandarddeviations,whichindi-catesthatGloVeisnotsensitivetodocumentorder.However,theBOOTSTRAPsettingcausedareduc-tioninthemeanandwidenedthestandarddeviation,indicatingthatGloVeissensitivetothepresenceofspecificdocuments.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
115
Run1Run2Run3Run4Run5Run6Run7viabilityfetustrimestersurgerytrimesterpregnanciesabdomenpregnanciespregnanciessurgeryvisitsurgeryoccupationtenureabortiongestationvisittherapyincarcerationviabilitystepfatherabortionskindergartentenurepainvisitabortionwifefetusviabilityworkdayhospitalizationarrivaltenuregroingestationheadachesabortionsneckpainvisitthroatsurgerypregnantherniaheadachesheadachesabortionsgrandmotherexpirationabortionsummertrimesterbirthdaypregnantdaughtersuddenpainsuicideexperiencingneckbirthdaypanicfetalbladderabortionmedicationstenurefetusjawTable5:The10closestwordstothequerytermpregnancyarehighlyvariable.Noneofthewordsshownappearineveryrun.ResultsareshownacrossrunsoftheBOOTSTRAPsettingforthefullcorpusofthe9thCircuit,thewholedocumentsize,andtheSGNSmodel.Run1Run2Run3Run4Run5Run6Run7selectionselectionselectionselectionselectionselectionselectiongeneticsprocesshumandarwinianconvergentevolutionarydarwinianconvergentdarwinianhumanstheorydarwinianhumansnatureprocesshumansnaturalgeneticsevolutionaryspeciesevolutionarydarwinianconvergentgeneticshumangeneticsconvergentconvergentabiogenesisevolutionaryspeciesevolutionarytheoryprocessprocessevolutionaryspeciesdidhumansnaturalnaturalnaturalnaturalhumanconvergentnaturalhumansdidspeciesnaturenaturalprocessconvergentprocesshumanhumansspeciestheoryevolutionarycreationismhumandarwinianfavorTable6:Theorderofthe10closestwordstothequerytermevolutionarehighlyvariable.ResultsareshownacrossrunsoftheBOOTSTRAPsettingforthefullcorpusofAskScience,thewholedocumentlength,andtheGloVemodel.ThesepatternsoflargerorsmallervariationsaregeneralizedinFigure1,whichshowsthemeanstan-darddeviationfordifferentalgorithmsandsettings.Wecalculatedthestandarddeviationacrossthe50runsforeachquerywordineachcorpus,andthenweaveragedoverthesestandarddeviations.There-sultsshowtheaveragelevelsofvariationforeachalgorithmandcorpus.WeobservethattheFIXEDandSHUFFLEDsettingsforGloVeandLSAproducetheleastvariablecosinesimilarities,whilePPMIpro-ducesthemostvariablecosinesimilaritiesforallsettings.Thepresenceofspecificdocumentshasasignificanteffectonallfouralgorithms(lesserforPPMI),consistentlyincreasingthestandarddevia-tions.Weturntothequestionofhowthisvariationinstandarddeviationaffectsthelistsofmostsimilarwords.Arethetop-Nwordssimplyre-ordered,ordothewordspresentinthelistsubstantiallychange?Table5showsanexampleofthetop-Nwordlistsforthequerywordpregnancyinthe9thCircuitcorpus.ObservingRun1,wemightbelievethatjudgesofthe9thCircuitassociatepregnancymostwithquestionsofviabilityandabortion,whileobserv-ingRun5,wemightbelievethatpregnancyismostassociatedwithquestionsofprisonsandfamilyvisits.Althoughthelistsinthistableareallproducedfromthesamecorpusanddocumentsize,themembershipofthelistschangessubstantiallybetweenrunsoftheBOOTSTRAPsetting.Asanotherexample,Table6showsresultsforthequeryevolutionfortheGloVemodelandtheAskSciencecorpus.Althoughthisqueryshowslessvariationbetweenruns,westillfindcauseforconcern.Forexample,Run3ranksthewordshumanandhumanshighly,whileRun1includesneitherofthosewordsinthetop10.Thesechangesintop-NrankareshowninFigure2.ForeachquerywordfortheAskHistorianscorpus,wefindtheNmostsimilarwordsusingSGNS.Wegeneratenewtop-Nlistsforeachofthe50modelstrainedintheBOOTSTRAPsetting,andweuseJac-cardsimilaritytocomparethe50lists.Weobservesimilarpatternstothechangesinstandarddeviation
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
116
LSASGNSGloVePPMIALGORITHM0.00.10.20.30.40.50.60.70.80.9JACCARD SIMILARITYfixedshuffledbootstrapVariation in Top 2 WordsLSASGNSGloVePPMIALGORITHM0.00.10.20.30.40.50.60.70.80.9JACCARD SIMILARITYfixedshuffledbootstrapVariation in Top 10 WordsFigure2:ThemeanJaccardsimilaritiesacrosssettingsandalgorithmsforthetop2and10closestwordstothequerywordsintheAskHistorianscorpus.LargerJaccardsimilarityindicatesmoreconsistencyintopNmember-ship.Resultsareshownforthesentencedocumentlength.inFigure2;PPMIdisplaysthelowestJaccardsimi-larityacrosssettings,whiletheotheralgorithmshavehighersimilaritiesintheFIXEDandSHUFFLEDset-tingsbutmuchlowersimilaritiesintheBOOTSTRAPsetting.WedisplayresultsforbothN=2andN=10,emphasizingthatevenveryhighlyrankedwordsoftendropoutofthetop-Nlist.Evenwhenwordsdonotdropoutofthetop-Nlist,theyoftenchangeinrank,asweobserveinFigure3.Weshowbothaspecificexampleforthequerytermmenandanaggregateofallthetermswhoseaveragerankiswithinthetop-10acrossrunsoftheBOOTSTRAPsetting.Inordertohighlighttheav-eragechangesinrank,wedonotshowoutliersinthisfigure,butwenotethatoutliers(largefallsandjumpsinrank)arecommon.ThevariabilityacrosssamplesfromtheBOOTSTRAPsettingindicatesthatthepresenceofspecificdocumentscansignificantlyaffectthetop-Nrankings.Wealsofindthatdocumentsegmentationsizeaf-05101520253035RANKchildrenwomensoldiersboysgirlshorsesofficerspeoplepeasantsbodiesWORDChange in Rank: “men”051015202530354045RANK FOR CURRENT ITERATION12345678910AVERAGE RANKChange in Rank for All QueriesFigure3:ThechangeinrankacrossrunsoftheBOOT-STRAPsettingforthetop10words.Weshowresultsforbothasinglequery,men,andanaggregateofallthequeries,showingthechangeinrankofthewordswhoseaveragerankingfallswithinthe10nearestneighborsofthosequeries.ResultsareshownforSGNSontheAskHis-torianscorpusandthesentencedocumentlength.fectsthecosinesimilarities.Figure4showsthatdocumentssegmentedatamorefine-grainedlevelproduceembeddingswithlessvariabilityacrossrunsoftheBOOTSTRAPsetting.Documentssegmentedatthesentencelevelhavestandarddeviationsclusteringclosertothemedian,whilelargerdocumentshavestandarddeviationsthatarespreadmorewidely.Thiseffectismostsignificantforthe4thCircuitand9thCircuitcorpora,asthesehavemuchlarger“docu-ments”thantheothercorpora.WeobserveasimilareffectforcorpussizeinFigure5.Thesmallercorpusshowsalargerspreadinstandarddeviationthanthelargercorpus,indicatinggreatervariability.Finally,wefindthatthevarianceusuallystabilizesatabout25runsoftheBOOTSTRAPsetting.Figure6showsthatvariabilityinitiallyincreaseswiththenumberofmodelstrained.Weobservethispatternacrosscorpora,algorithms,andsettings.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
117
12345678910RANK0.050.000.050.100.150.20STANDARD DEVIATIONDocument Size ComparisonDOCUMENT SIZEsentencewholeFigure4:StandarddeviationofthecosinesimilaritiesbetweenallrankNwordsandtheir10nearestneighbors.Resultsareshownfordifferentdocumentsizes(sentencevswholedocument)intheBOOTSTRAPsettingforSGNSinthe4thCircuitcorpus.12345678910RANK0.020.000.020.040.060.080.100.120.14STANDARD DEVIATIONCorpus Size ComparisonCORPUS SIZE0.21.0Figure5:StandarddeviationofthecosinesimilaritiesbetweenallrankNwordsandtheir10nearestneighbors.Resultsareshownatdifferentcorpussizes(20%vs100%ofdocuments)intheBOOTSTRAPsettingforSGNSinthe4thCircuitcorpus,segmentedbysentence.8DiscussionThemostobviousresultofourexperimentsistoemphasizethatembeddingsarenotevenasingleobjectiveviewofacorpus,muchlessanobjectiveviewoflanguage.Thecorpusisitselfonlyasample,andwehaveshownthatthecurationofthissample(itssize,documentlength,andinclusionofspecificdocuments)cancausesignificantvariabilityintheembeddings.Happily,thisvariabilitycanbequan-tifiedbyaveragingresultsovermultiplebootstrapsamples.Wecanmakeseveralspecificobservationsaboutal-gorithmsensitivities.Ingeneral,LSA,GloVe,SGNS,andPPMIarenotsensitivetodocumentorderinthecollectionsweevaluated.Thisissurprising,aswe5101520253035404550NUMBER OF ITERATIONS0.0280.0300.0320.0340.036mean(STANDARD DEVIATION)Stability over IterationsFigure6:Themeanofthestandarddeviationofthecosinesimilaritiesbetweeneachquerytermandits20nearestneighbors.ResultsareshownfordifferentnumbersofrunsoftheBOOTSTRAPsettingonthe4thCircuitcorpus.hadexpectedSGNStobesensitivetodocumentorderandanecdotally,wehadobservedcaseswheretheembeddingswereaffectedbygroupsofdocuments(e.g.inadifferentlanguage)atthebeginningoftrain-ing.However,allfouralgorithmsaresensitivetothepresenceofspecificdocuments,thoughthiseffectisweakerforPPMI.AlthoughPPMIappearsdeterministic(duetoitspre-computedword-contextmatrix),wefindthatthisalgorithmproducedresultsundertheFIXEDorderingwhosevariabilitywasclosesttotheBOOTSTRAPset-ting.Weattributethisintrinsicvariabilitytotheuseoftoken-levelsubsampling.Thissamplingmethodintroducesvariationintothesourcecorpusthatap-pearstobecomparabletoabootstrapresamplingmethod.SamplinginPPMIisinspiredbyasimilarmethodintheword2vecimplementationofSGNS(Levyetal.,2015).ItisthereforesurprisingthatSGNSshowsnoticeabledifferentiationbetweentheBOOTSTRAPsettingontheonehandandtheFIXEDandSHUFFLEDsettingsontheother.Theuseofembeddingsassourcesofevidenceneedstobetemperedwiththeunderstandingthatfine-graineddistinctionsbetweencosinesimilaritiesarenotreliableandthatsmallercorporaandlongerdocu-mentsaremoresusceptibletovariationinthecosinesimilaritiesbetweenembeddings.Whenstudyingthetop-Nmostsimilarwordstoaquery,itisimportanttoaccountforvariationintheselists,asbothrankandmembershipcansignificantlychangeacrossruns.Therefore,weemphasizethatwithsmallercorporacomesgreatervariability,andwerecommendthatpractitionersusebootstrapsamplingtogeneratean
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
118
ensembleofwordembeddingsforeachsub-corpusandpresentboththemeanandvariabilityofanysum-marystatisticssuchasorderedwordsimilarities.Weleaveforfutureworkafullhyperparametersweepforthethreealgorithms.Whilethesehyperpa-rameterscansubstantiallyimpactperformance,ourgoalwiththisworkwasnottoachievehighperfor-mancebuttoexaminehowthealgorithmsrespondtochangesinthecorpus.Wemakenoclaimthatonealgorithmisbetterthananother.9ConclusionWefindthatthereareseveralsourcesofvariabilityincosinesimilaritiesbetweenwordembeddingsvec-tors.Thesizeofthecorpus,thelengthofindividualdocuments,andthepresenceorabsenceofspecificdocumentscanallaffecttheresultingembeddings.Whiledifferencesinwordassociationaremeasur-ableandareoftensignificant,smalldifferencesincosinesimilarityarenotreliable,especiallyforsmallcorpora.Iftheintentionofastudyistolearnaboutaspecificcorpus,werecommendthatpractitionerstestthestatisticalconfidenceofsimilaritiesbasedonwordembeddingsbytrainingonmultiplebootstrapsamples.10AcknowledgementsThisworkwassupportedbyNSF#1526155,#1652536,andtheAlfredP.SloanFoundation.WewouldliketothankAlexandraSchofield,LaureThompson,ourActionEditorIvanTitov,andouranonymousreviewersfortheirhelpfulcomments.ReferencesYoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJauvin.2003.Aneuralprobabilisticlan-guagemodel.JournalofMachineLearningResearch,3(Fév):1137–1155.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.Latentdirichletallocation.JournalofMachineLearningresearch,3(Jan):993–1022.TolgaBolukbasi,Kai-WeiChang,JamesY.Zou,VenkateshSaligrama,andAdamT.Kalai.2016.Manistocomputerprogrammeraswomanistohomemaker?Debiasingwordembeddings.InNIPS,pages4349–4357.L´eonBottou.2012.Stochasticgradientdescenttricks.InNeuralNetworks:TricksoftheTrade,pages421–436.Springer.AndreasBroscheid.2011.Comparingcircuits:AresomeU.S.CourtsofAppealsmoreliberalorconservativethanothers?Loi&SocietyReview,45(1),March.DallasCard,AmberE.Boydstun,JustinH.Gross,PhilipResnik,andNoahA.Smith.2015.Themediaframescorpus:Annotationsofframesacrossissues.InACL.DanqiChenandChristopherD.Manning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InEMNLP,pages740–750.ColinCherryandHongyuGuo.2015.TheunreasonableeffectivenessofwordrepresentationsforTwitternamedentityrecognition.InHLT-NAACL,pages735–745.ScottDeerwester,SusanT.Dumais,GeorgeW.Furnas,ThomasK.Landauer,andRichardHarshman.1990.Indexingbylatentsemanticanalysis.JournaloftheAmericanSocietyforInformationScience,41(6):391.ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2015.Retrofittingwordvectorstosemanticlexicons.HLT-ACL,pages1606–1615.AnnaGladkovaandAleksandrDrozd.2016.Intrinsicevaluationsofwordembeddings:Whatcanwedobet-ter?InProceedingsofthe1stWorkshoponEvaluatingVector-SpaceRepresentationsforNLP,pages36–42.YoavGoldberg.2017.NeuralNetworkMethodsforNat-uralLanguageProcessing.SynthesisLecturesonHu-manLanguageTechnologies.Morgan&ClaypoolPub-lishers.IanGoodfellow,YoshuaBengio,andAaronCourville.2016.DeepLearning.MITPress.NathanHalko,Per-GunnarMartinsson,andJoelA.Tropp.September,2009.Findingstructurewithrandomness:Stochasticalgorithmsforconstructingapproximatema-trixdecompositions.TechnicalReportNo.2009-05.Applied&ComputationalMathematics,CaliforniaIn-stituteofTechnology.WilliamL.Hamilton,JureLeskovec,andDanJurafsky.2016.Diachronicwordembeddingsrevealstatisticallawsofsemanticchange.InACL.JohannesHellrichandUdoHahn.2016.Badcompany–neighborhoodsinneuralembeddingspacesconsideredharmful.InProceedingsofCOLING2016,the26thIn-ternationalConferenceonComputationalLinguistics:TechnicalPapers,pages2785–2796.RyanHeuser.2016.Wordvectorsintheeighteenth-century.InIPAMWorkshop:CulturalAnalytics.YoonKim,Yi-IChiu,KentaroHanaki,DarshanHegde,andSlavPetrov.2014.Temporalanalysisoflanguagethroughneurallanguagemodels.ProceedingsoftheACL2014WorkshoponLanguageTechnologiesandComputationalSocialScience,.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
119
VivekKulkarni,BryanPerozzi,andStevenSkiena.2016.Freshmanorfresher?Quantifyingthegeographicvari-ationoflanguageinonlinesocialmedia.InICWSM,pages615–618.ThomasK.LandauerandSusanT.Dumais.1997.Aso-lutiontoPlato’sproblem:Thelatentsemanticanalysistheoryofacquisition,induction,andrepresentationofknowledge.PsychologicalReview,104(2):211.OmerLevyandYoavGoldberg.2014.Neuralwordembeddingasimplicitmatrixfactorization.InNIPS,pages2177–2185.OmerLevy,YoavGoldberg,andIdoDagan.2015.Im-provingdistributionalsimilaritywithlessonslearnedfromwordembeddings.TransactionsoftheACL,3:211–225.RasmusE.Madsen,DavidKauchak,andCharlesElkan.2005.Modelingwordburstinessusingthedirichletdis-tribution.InProceedingsofthe22ndInternationalCon-ferenceonMachineLearning,pages545–552.ACM.TomasMikolov,Wen-tauYih,andGeoffreyZweig.2013.LinguisticRegularitiesinContinuousSpaceWordRep-resentations.HLT-NAACL.JeffreyPennington,RichardSocher,andChristopherD.Manning.2014.GloVe:Globalvectorsforwordrepresentation.InEMNLP,volume14,pages1532–43.LawrencePhillips,KyleShaffer,DustinArendt,NathanHodas,andSvitlanaVolkova.2017.Intrinsicandex-trinsicevaluationofspatiotemporaltextrepresentationsinTwitterstreams.InProceedingsofthe2ndWorkshoponRepresentationLearningforNLP,pages201–210.RadimˇReh˚uˇrekandPetrSojka.2010.SoftwareFrame-workforTopicModellingwithLargeCorpora.InPro-ceedingsoftheLREC2010WorkshoponNewChal-lengesforNLPFrameworks,pages45–50,Valletta,Malta,May.ELRA.EvanSandhaus.2008.TheNewYorkTimesAnnotatedCorpus.LDC2008T19.LinguisticDataConsortium.NoamShazeer,RyanDoherty,ColinEvans,andChrisWaterson.2016.Swivel:ImprovingEmbeddingsbyNoticingWhat’sMissing.arXiv:1602.02215.YingtaoTian,VivekKulkarni,BryanPerozzi,andStevenSkiena.2016.Ontheconvergentpropertiesofwordembeddingmethods.arXivpreprintarXiv:1605.03956.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsoftheACL,pages384–394.AssociationforComputationalLinguistics.PeterD.TurneyandPatrickPantel.2010.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.JournalofArtificialIntelligenceResearch,37:141–188.IvanVulicandMarie-FrancineMoens.2015.Bilingualwordembeddingsfromnon-paralleldocument-aligneddataappliedtobilinguallexiconinduction.InProceed-ingsoftheACL,pages719–725.ACL.JustineZhang,WilliamL.Hamilton,CristianDanescu-Niculescu-Mizil,DanJurafsky,andJureLeskovec.2017.Communityidentityanduserengagementinamulti-communitylandscape.ProceedingsofICWSM.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6
/
/
t
je
un
c
_
un
_
0
0
0
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3