Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119, 2018. Action Editor: Ivan Titov.

Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119, 2018. Action Editor: Ivan Titov.
Submission batch: 6/2017; Revision batch: 9/2017; Published 2/2018.
c(cid:13)2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

EvaluatingtheStabilityofEmbedding-basedWordSimilaritiesMariaAntoniakCornellUniversitymaa343@cornell.eduDavidMimnoCornellUniversitymimno@cornell.eduAbstractWordembeddingsareincreasinglybeingusedasatooltostudywordassociationsinspeciﬁccorpora.However,itisunclearwhethersuchembeddingsreﬂectenduringpropertiesoflan-guageoriftheyaresensitivetoinconsequentialvariationsinthesourcedocuments.Weﬁndthatnearest-neighbordistancesarehighlysen-sitivetosmallchangesinthetrainingcorpusforavarietyofalgorithms.Forallmethods,includingspeciﬁcdocumentsinthetrainingsetcanresultinsubstantialvariations.Weshowthattheseeffectsaremoreprominentforsmallertrainingcorpora.Werecommendthatusersneverrelyonsingleembeddingmodelsfordistancecalculations,butratheraverageovermultiplebootstrapsamples,especiallyforsmallcorpora.1IntroductionWordembeddingsareapopulartechniqueinnaturallanguageprocessing(NLP)inwhichthewordsinavocabularyaremappedtolow-dimensionalvectors.Embeddingmodelsareeasilytrained—severalimple-mentationsarepubliclyavailable—andrelationshipsbetweentheembeddingvectors,oftenmeasuredviacosinesimilarity,canbeusedtoreveallatentseman-ticrelationshipsbetweenpairsofwords.Wordem-beddingsareincreasinglybeingusedbyresearchersinunexpectedwaysandhavebecomepopularinﬁeldssuchasdigitalhumanitiesandcomputationalsocialscience(Hamiltonetal.,2016;Heuser,2016;Phillipsetal.,2017).Embedding-basedanalysesofsemanticsimilaritycanbearobustandvaluabletool,butweﬁndthatstandardmethodsdramaticallyunder-representthevariabilityofthesemeasurements.Embeddingalgo-rithmsaremuchmoresensitivethantheyappeartofactorssuchasthepresenceofspeciﬁcdocuments,thesizeofthedocuments,thesizeofthecorpus,andevenseedsforrandomnumbergenerators.Ifusersdonotaccountforthisvariability,theirconclusionsarelikelytobeinvalid.Fortunately,wealsoﬁndthatsimplyaveragingovermultiplebootstrapsamplesissufﬁcienttoproducestable,reliableresultsinallcasestested.NLPresearchinwordembeddingshassofarfo-cusedonadownstream-centeredusecase,wheretheendgoalisnottheembeddingsthemselvesbutperformanceonamorecomplicatedtask.Forexam-ple,wordembeddingsareoftenusedasthebottomlayerinneuralnetworkarchitecturesforNLP(Ben-gioetal.,2003;Goldberg,2017).Theembeddings’trainingcorpus,whichisselectedtobeaslargeaspossible,isonlyofinterestinsofarasitgeneralizestothedownstreamtrainingcorpus.Incontrast,otherresearcherstakeacorpus-centeredapproachanduserelationshipsbetweenem-beddingsasdirectevidenceaboutthelanguageandcultureoftheauthorsofatrainingcorpus(Bolukbasietal.,2016;Hamiltonetal.,2016;Heuser,2016).Embeddingsareusedasiftheyweresimulationsofasurveyaskingsubjectstofree-associatewordsfromqueryterms.Unlikethedownstream-centeredapproach,thecorpus-centeredapproachisbasedondirecthumananalysisofnearestneighborstoembed-dingvectors,andthetrainingcorpusisnotsimplyanoff-the-shelfconveniencebutratherthecentralobjectofstudy.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

108

Downstream-centeredCorpus-centeredBigcorpusSmallcorpus,difﬁcultorimpossi-bletoexpandSourceisnotimportantSourceistheobjectofstudyOnlyvectorsareimportantSpeciﬁc,ﬁne-grainedcomparisonsareimportantEmbeddingsareusedindownstreamtasksEmbeddingsareusedtolearnaboutthementalmodelofwordassocia-tionfortheauthorsofthecorpusTable1:Comparisonofdownstream-centeredandcorpus-centeredapproachestowordembeddings.Whilewordembeddingsmayappeartomeasurepropertiesoflanguage,theyinfactonlymeasurepropertiesofacuratedcorpus,whichcouldsuf-ferfromseveralproblems.Thetrainingcorpusismerelyasampleoftheauthors’languagemodel(Shazeeretal.,2016).Sourcescouldbemissingorover-represented,typosandotherlexicalvariationscouldbepresent,et,asnotedbyGoodfellowetal.(2016),“Manydatasetsaremostnaturallyarrangedinawaywheresuccessiveexamplesarehighlycor-related.”Furthermore,embeddingscanvaryconsid-erablyacrossrandominitializations,makinglistsof“mostsimilarwords”unstable.Wehypothesizethattrainingonsmallandpoten-tiallyidiosyncraticcorporacanexacerbatetheseprob-lemsandleadtohighlyvariableestimatesofwordsimilarity.Suchsmallcorporaarecommonindigitalhumanitiesandcomputationalsocialscience,anditisoftenimpossibletomitigatetheseproblemssimplybyexpandingthecorpus.Forexample,wecannotcreatemore18thCenturyEnglishbooksorchangetheirtopicalfocus.Weexplorecausesofthisvariability,whichrangefromthefundamentalstochasticnatureofcertainal-gorithmstomoretroublingsensitivitiestopropertiesofthecorpus,suchasthepresenceorabsenceofspeciﬁcdocuments.Wefocusonthetrainingcor-pusasasourceofvariation,viewingitasafragileartifactcuratedbyoftenarbitrarydecisions.Weex-aminefourdifferentalgorithmsandsixdatasets,andwemanipulatethecorpusbyshufﬂingtheorderofthedocumentsandtakingbootstrapsamplesofthedocuments.Finally,weexaminetheeffectsofthesemanipulationsonthecosinesimilaritiesbetweenem-beddings.Weﬁndthatthereisconsiderablevariabilityinembeddingsthatmaynotbeobvioustousersofthesemethods.Rankingsofmostsimilarwordsarenotreliable,andbothorderingandmembershipinsuchlistsareliabletochangesigniﬁcantly.Someuncer-taintyisexpected,andthereisnoclearcriterionfor“acceptable”levelsofvariance,butwearguethattheamountofvariationweobserveissufﬁcienttocallthewholemethodintoquestion.Forexample,weﬁndcasesinwhichthereiszerosetoverlapin“top10”listsforthesamequerywordacrossbootstrapsamples.Smallercorporaandlargerdocumentsizesincreasethisvariation.Ourgoalistoprovidemeth-odstoquantifythisvariability,andtoaccountforthisvariability,werecommendthatasthesizeofacorpusgetssmaller,cosinesimilaritiesshouldbeaveragedovermanybootstrapsamples.2RelatedWorkWordembeddingsaremappingsofwordstopointsinaK-dimensionalcontinuousspace,whereKismuchsmallerthanthesizeofthevocabulary.Re-ducingthenumberofdimensionshastwobeneﬁts:ﬁrst,grand,sparsevectorsaretransformedintosmall,densevectors;andsecond,theconﬂationoffeaturesuncoverslatentsemanticrelationshipsbetweenthewords.Thesesemanticrelationshipsareusuallymea-suredviacosinesimilarity,thoughothermetricssuchasEuclideandistanceandtheDicecoefﬁcientarepossible(TurneyandPantel,2010).Wefocusonfourofthemostpopulartrainingalgorithms:La-tentSemanticAnalysis(LSA)(Deerwesteretal.,1990),Skip-GramwithNegativeSampling(SGNS)(Mikolovetal.,2013),GlobalVectorsforWordRep-resentation(GloVe)(Penningtonetal.,2014),andPositivePointwiseMutualInformation(PPMI)(LevyandGoldberg,2014)(seeSection5formoredetaileddescriptionsofthesealgorithms).InNLP,wordembeddingsareoftenusedasfea-turesfordownstreamtasks.Dependencyparsing(ChenandManning,2014),namedentityrecogni-tion(Turianetal.,2010;CherryandGuo,2015),andbilinguallexiconinduction(VulicandMoens,2015)arejustafewexampleswheretheuseofembeddingsasfeatureshasincreasedperformanceinrecentyears.Increasingly,wordembeddingshavebeenusedasevidenceinstudiesoflanguageandculture.Forexample,Hamiltonetal.(2016)trainseparateem-beddingsontemporalsegmentsofacorpusandthen

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

109

analyzechangesinthesimilarityofwordstomeasuresemanticshifts,andHeuser(2016)usesembeddingstocharacterizediscourseaboutvirtuesin18thCen-turyEnglishtext.Otherstudiesusecosinesimilar-itiesbetweenembeddingstomeasurethevariationoflanguageacrossgeographicalareas(Kulkarnietal.,2016;Phillipsetal.,2017)andtime(Kimetal.,2014).Eachofthesestudiesseekstoreconstructthementalmodelofauthorsbasedondocuments.Anexamplethathighlightsthecontrastbetweenthedownstream-centeredandcorpus-centeredper-spectivesistheexplorationofimplicitbiasinwordembeddings.Researchershaveobservedthatembedding-basedwordsimilaritiesreﬂectculturalstereotypes,suchasassociationsbetweenoccupa-tionsandgenders(Bolukbasietal.,2016).Fromadownstream-centeredperspective,thesestereotypicalassociationsrepresentbiasthatshouldbeﬁlteredoutbeforeusingtheembeddingsasfeatures.Incontrast,fromacorpus-centeredperspective,implicitbiasinembeddingsisnotaproblemthatmustbeﬁxedbutratherameansofmeasurement,providingquantita-tiveevidenceofbiasinthetrainingcorpus.Embeddingsareusuallyevaluatedondirectusecases,suchaswordsimilarityandanalogytasksviacosinesimilarities(Mikolovetal.,2013;Penningtonetal.,2014;Levyetal.,2015;Shazeeretal.,2016).Intrinsicevaluationslikewordsimilaritiesmeasuretheinterpretabilityoftheembeddingsratherthantheirdownstreamtaskperformance(GladkovaandDrozd,2016),butwhilesomeresearchdoesevaluateembeddingvectorsontheirdownstreamtaskperfor-mance(Penningtonetal.,2014;Faruquietal.,2015),thestandardbenchmarksremainintrinsic.Therehasbeensomerecentworkinevaluatingthestabilityofwordembeddings.Levyetal.(2015)focusonthehyperparametersettingsforeachalgo-rithmandshowthathyperparameterssuchasthesizeofthecontextwindow,thenumberofnegativesam-ples,andthelevelofcontextdistributionsmoothingcanaffecttheperformanceofembeddingsonsimi-larityandanalogytasks.HellrichandHahn(2016)examinetheeffectsofwordfrequency,wordam-biguity,andthenumberoftrainingepochsonthereliabilityofembeddingsproducedbytheSGNSandskip-gramhierarchicalsoftmax(SGHS)(avariantofSGNS),strivingforreproducibilityandrecommend-ingagainstsamplingthecorpusinordertopreservestability.Likewise,Tianetal.(2016)explorethero-bustnessofSGNSandGloVeembeddingstrainedonlarge,genericcorpora(Wikipediaandnewsdata)andproposemethodstoaligntheseembeddingsacrossdifferentiterations.Incontrast,ourgoalisnottoproduceartiﬁciallystableembeddingsbuttoidentifythefactorsthatcreateinstabilityandmeasureourstatisticalconﬁ-denceinthecosinesimilaritiesbetweenembeddingstrainedonsmall,speciﬁccorpora.Wefocusonthecorpusasafragileartifactandsourceofvariation,consideringthecorpusitselfasmerelyasampleofpossibledocumentsproducedbytheauthors.Weexaminewhethertheembeddingsaccuratelymodelthoseauthors,usingbootstrapsamplingtomeasuretheeffectsofaddingorremovingdocumentsfromthetrainingcorpus.3CorporaWecollectedtwosub-corporafromeachofthreedatasets(seeTable2)toexplorehowwordembed-dingsareaffectedbysize,vocabulary,andotherparametersofthetrainingcorpus.Inordertobet-termodelrealisticexamplesofcorpus-centeredre-search,thesecorporaaredeliberatelychosentobepubliclyavailable,suggestiveofsocialresearchques-tions,variedincorpusparameters(e.g.topic,size,vocabulary),andmuchsmallerthanthestandardcor-poratypicallyusedintrainingwordembeddings(e.g.Wikipedia,Gigaword).Eachdatasetwascreatedor-ganically,overspeciﬁctimeperiods,inspeciﬁcsocialsettings,byspeciﬁcauthors.Thus,itisimpossibletoexpandthesedatasetswithoutcompromisingthisspeciﬁcity.Weprocesseachcorpusbylowercasingalltext,re-movingwordsthatappearfewerthan20timesinthecorpus,andremovingallnumbersandpunctuation.Becauseourmethodsrelyonbootstrapsampling(seeSection6),whichoperatesbyremovingormulti-plyingthepresenceofdocuments,wealsoremoveduplicatedocumentsfromeachcorpus.U.S.FederalCourtsofAppealsTheU.S.Federalcourtsofappealsareregionalcourtsthatdecideap-pealsfromthedistrictcourtswithintheirfederalju-dicialcircuit.Weexaminetheembeddingsofthemostrecentﬁveyearsofthe4thand9thcircuits.11https://www.courtlistener.com/

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

110

CorpusNumberofdocumentsUniquewordsVocabularydensityWordsperdocumentNYTSports(2000)8,78612,4750.0020708NYTMusic(2000)3,6669,7620.0037715AskScience331,63516,9010.001244AskHistorians63,5789,3840.0022664thCircuit5,36816,6390.00142,2819thCircuit9,72922,1460.00112,108Table2:Comparisonofthenumberofdocuments,numberofuniquewords(afterremovingwordsthatappearfewerthan20times),vocabularydensity(theratioofuniquewordstothetotalnumberofwords),andtheaveragenumberofwordsperdocumentforeachcorpus.SettingMethodTests…Run1Run2Run3FixedDocumentsinconsistentordervariabilityduetoalgorithm(baseline)ABCABCABCShufﬂedDocumentsinrandomordervariabilityduetodocumentorderACBBACCBABootstrapDocumentssampledwithreplacementvariabilityduetodocumentpresenceBAACABBBBTable3:Thethreesettingsthatmanipulatethedocumentorderandpresenceineachcorpus.The4thcircuitcontainsWashingtonD.C.andsur-roundingstates,whilethe9thcircuitcontainstheentiretyofthewestcoast.Socialscienceresearchquestionsmightinvolvemeasuringawidelyheldbeliefthatcertaincourtshavedistinctideologicalten-dencies(Broscheid,2011).Suchbiasmayresultinmeasurabledifferencesinwordassociationduetoframingeffects(Cardetal.,2015),whichcouldbeobservablebycomparingthewordsassociatedwithagivenqueryterm.Wetreateachopinionasasingledocument.NewYorkTimesTheNewYorkTimes(NYT)An-notatedCorpus(Sandhaus,2008)containsnewspaperarticlestaggedwithadditionalmetadatareﬂectingtheircontentandpublicationcontext.Toconstrainthesizeofthecorporaandtoenhancetheirspeciﬁcity,weextractdataonlyfortheyear2000andfocusononlytwosectionsoftheNYTdataset:sportsandmusic.Intheresultingcorpora,thesportssectionissubstantiallylargerthanthemusicsection(seeTable2).Wetreatanarticleasasingledocument.RedditReddit2isasocialwebsitecontainingthou-sandsofforums(subreddits)organizedbytopic.Weuseadatasetcontainingallpostsfortheyears2007-2014fromtwosubreddits:/r/AskScienceand/r/AskHistorians.Thesetwosubredditsallowuserstopostanyquestioninthetopicsofhistoryandscience,respectively.AskScienceismorethanﬁvetimeslargerthanAskHistorians,thoughthedoc-2https://www.reddit.com/umentlengthisgenerallylongerforAskHistorians(seeTable2).Redditisapopulardatasourceforcomputationalsocialscienceresearch;forexample,subredditscanbeusedtoexplorethedistinctivenessanddynamicityofcommunities(Zhangetal.,2017).Wetreatanoriginalpostasasingledocument.4CorpusParametersOrderandpresenceofdocumentsWeusethreedifferentmethodstosamplethecorpus:FIXED,SHUFFLED,andBOOTSTRAP.TheFIXEDsettingincludeseachdocumentexactlyonce,andthedoc-umentsappearinaconstant,chronologicalorderacrossallmodels.Thepurposeofthissettingistomeasurethebaselinevariabilityofanalgorithm,independentofanychangeininputdata.Algorith-micvariabilitymayarisefromrandominitializationsoflearnedparameters,randomnegativesampling,orrandomizedsubsamplingoftokenswithindocu-ments.TheSHUFFLEDsettingincludeseachdocu-mentexactlyonce,buttheorderofthedocumentsisrandomizedforeachmodel.Thepurposeofthissettingistoevaluatetheimpactofvariationonhowwepresentexamplestoeachalgorithm.Theorderofdocumentscouldbeanimportantfactorforalgo-rithmsthatuseonlinetrainingsuchasSGNS.TheBOOTSTRAPsettingsamplesNdocumentsrandomlywithreplacement,whereNisequaltothenumberofdocumentsintheFIXEDsetting.Thepurposeofthissettingistomeasurehowmuchvariabilityisduetothepresenceorabsenceofspeciﬁcsequencesof

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

111

tokensinthecorpus.SeeTable3foracomparisonofthesethreesettings.SizeofcorpusWeexpectthestabilityofembedding-basedwordsimilaritiestobeinﬂuencedbythesizeofthetrainingcorpus.Asweaddmoredocuments,theimpactofanyspeciﬁcdocumentshouldbelesssigniﬁcant.Atthesametime,largercorporamayalsotendtobemorebroadinscopeandvariableinstyleandtopic,leadingtolessidiosyn-craticpatternsinwordco-occurrence.Therefore,foreachcorpus,wecurateasmallersub-corpusthatcontains20%ofthetotalcorpusdocuments.Thesesamplesareselectedusingcontiguoussequencesofdocumentsatthebeginningofeachtraining(thisensuresthattheFIXEDsettingremainsconstant).LengthofdocumentsWeusetwodocumentseg-mentationstrategies.Intheﬁrstsetting,eachtraininginstanceisasingledocument(i.e.anarticlefortheNYTcorpus,anopinionfromtheCourtscorpus,andapostfromtheRedditcorpus).Inthesecondsetting,eachtraininginstanceisasinglesentence.Weex-pectthischoiceofsegmentationtohavethelargestimpactontheBOOTSTRAPsetting.Documentsareoftencharacterizedby“bursty”wordsthatarelocallyfrequentbutgloballyrare(Madsenetal.,2005),suchasthenameofadefendantinacourtcase.Samplingwholedocumentswithreplacementshouldmagnifytheeffectofburstywords:ararebutlocallyfrequentwordwilleitheroccurinaBootstrapcorpusornotoccur.Samplingsentenceswithreplacementshouldhavelesseffectonburstywords,sincethechancethatanentiredocumentwillberemovedfromthecorpusismuchsmaller.5AlgorithmsEvaluatingallcurrentembeddingalgorithmsandim-plementationsisbeyondthescopeofthiswork,soweselectfourcategoriesofalgorithmsthatrepresentdistinctoptimizationstrategies.Recallthatourgoalistoexaminehowalgorithmsrespondtovariationinthecorpus,nottomaximizeperformanceintheaccuracyoreffectivenessoftheembeddings.Theﬁrstcategoryisonlinestochasticupdates,inwhichthealgorithmupdatesmodelparametersus-ingstochasticgradientsasitproceedsthroughthetrainingcorpus.Allmethodsimplementedintheword2vecandfastTextpackagesfollowthisformat,includingskip-gram,CBOW,negativesam-pling,andhierarchicalsoftmax(Mikolovetal.,2013).WefocusonSGNSasapopularandrepresentativeexample.Thesecondcategoryisbatchstochasticupdates,inwhichthealgorithmﬁrstcollectsamatrixofsummarystatisticsderivedfromapassthroughthetrainingdatathattakesplacebeforeanyparame-tersareset,andthenupdatesmodelparametersusingstochasticoptimization.WeselecttheGloVealgo-rithm(Penningtonetal.,2014)asarepresentativeexample.Thethirdcategoryismatrixfactorization,inwhichthealgorithmmakesdeterministicupdatestomodelparametersbasedonamatrixofsummarystatistics.AsarepresentativeexampleweincludePPMI(LevyandGoldberg,2014).Enfin,totestwhetherwordorderisasigniﬁcantfactorweincludeadocument-basedembeddingmethodthatusesma-trixfactorization,LSA(Deerwesteretal.,1990;Lan-dauerandDumais,1997).Thesealgorithmseachincludeseveralhyperparam-eters,whichareknowntohavemeasurableeffectsontheresultingembeddings(Levyetal.,2015).Wehaveattemptedtochoosesettingsoftheseparame-tersthatarecommonlyusedandcomparableacrossalgorithms,butweemphasizethatafullevaluationoftheeffectofeachalgorithmicparameterwouldbebeyondthescopeofthiswork.Foreachofthefollowingalgorithms,wesetthecontextwindowsizeto5andtheembeddingssizeto100.Sincewere-movewordsthatoccurfewerthan20timesduringpreprocessingofthecorpus,wesetthefrequencythresholdforthefollowingalgorithmsto0.Forallotherhyperparameters,wefollowthede-faultormostpopularsettingsforeachalgorithm,asdescribedinthefollowingsections.5.1LSALatentsemanticanalysis(LSA)factorizesasparseterm-documentmatrixX(Deerwesteretal.,1990;LandauerandDumais,1997).Xisfactoredusingsingularvaluedecomposition(SVD),retainingKsingularvaluessuchthatX≈XK=UKΣKVTK.Theelementsoftheterm-documentmatrixareweighted,oftenwithTF-IDF,whichmeasuresthe

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

112

importanceofawordtoadocumentinacorpus.Thedense,low-rankapproximationoftheterm-documentmatrix,XK,canbeusedtomeasuretherelatednessoftermsbycalculatingthecosinesimilarityoftherelevantrowsofthereducedmatrix.Weusethesci-kitlearn3packagetotrainourLSAembeddings.Wecreateaterm-documentmatrixwithTF-IDFweighting,usingthedefaultset-tingsexceptthatweaddL2normalizationandsub-linearTFscaling,whichscalestheimportanceoftermswithhighfrequencywithinadocument.Weperformdimensionalityreductionviaarandomizedsolver(Halkoetal.,September2009).Theconstructionoftheterm-countmatrixandtheTF-IDFweightingshouldintroducenovariationtotheﬁnalwordembeddings.However,weexpectvariationduetotherandomizedSVDsolver,evenwhenallotherparameters(trainingdocumentorder,presence,size,etc.)areconstant.5.2SGNSTheskip-gramwithnegativesampling(SGNS)algo-rithm(Mikolovetal.,2013)isanonlinealgorithmthatusesrandomizedupdatestopredictwordsbasedontheircontext.Ineachiteration,thealgorithmpro-ceedsthroughtheoriginaldocumentsand,ateachwordtoken,updatesmodelparametersbasedongra-dientscalculatedfromthecurrentmodelparameters.Thisprocessmaximizesthelikelihoodofobservedword-contextpairsandminimizesthelikelihoodofnegativesamples.WeuseanimplementationoftheSGNSalgorithmincludedinthePythonlibrarygensim4(ˇReh˚uˇrekandSojka,2010).Weusethedefaultsettingspro-videdwithgensimexceptasdescribedabove.WepredictthatmultiplerunsofSGNSonthesamecorpuswillnotproducethesameresults.SGNSran-domlyinitializesalltheembeddingsbeforetrainingbegins,anditreliesonnegativesamplescreatedbyrandomlyselectingwordandcontextpairs(Mikolovetal.,2013;Levyetal.,2015).WealsoexpectSGNStobesensitivetotheorderofdocuments,asitreliesonstochasticgradientdescentwhichcanbebiasedtobemoreinﬂuencedbyinitialdocuments(Bottou,2012).3http://scikit-learn.org/4https://radimrehurek.com/gensim/models/word2vec.html5.3GloVeGlobalVectorsforWordRepresentation(GloVe)usesstochasticgradientupdatesbutoperatesona“global”representationofwordco-occurrencethatiscalcu-latedonceatthebeginningofthealgorithm(Penning-tonetal.,2014).Wordsandcontextsareassociatedwithbiasparameters,bwandbc,wherewisawordandcisacontext,learnedbyminimizingthecostfunction:L=Xw,cf(xwc)~w·~c+bw+bc−log(xwc).WeusetheGloVeimplementationprovidedbyPenningtonetal.(2014)5.WeusethedefaultsettingsprovidedwithGloVeexceptasdescribedabove.UnlikeSGNS,thealgorithmdoesnotperformmodelupdateswhileexaminingtheoriginaldocu-ments.Asaresult,weexpectGloVetobesensitivetorandominitializationsbutnotsensitivetotheorderofdocuments.5.4PPMIThepositivepointwisemutualinformation(PPMI)matrice,whosecellsrepresentthePPMIofeachpairofwordsandcontexts,isfactoredusingsin-gularvaluedecomposition(SVD)andresultsinlow-dimensionalembeddingsthatperformsimilarlytoGloVeandSGNS(LevyandGoldberg,2014).PMI(w,c)=logP(w,c)P.(w)P.(c);PPMI(w,c)=max(PMI(w,c),0).TotrainourPPMIwordembeddings,weusehyperwords,6animplementationprovidedaspartofLevyetal.(2015).7Wefollowtheauthors’recom-mendationsandsetthecontextdistributionalsmooth-ing(cds)parameterto0.75,theeigenvaluematrix(eig)to0.5,thesubsamplingthreshold(sub)to10-5,andthecontextwindow(win)to5.5http://nlp.stanford.edu/projects/glove/6https://bitbucket.org/omerlevy/hyperwords/src7WealteredthePPMIcodetoremoveaﬁxedrandomseedinordertointroducevariabilitygivenaﬁxedcorpus;nootherchangewasmade.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

113

LikeGloVeandunlikeSGNS,PPMIoperatesonapre-computedrepresentationofwordco-occurrence,sowedonotexpectresultstovarybasedontheor-derofdocuments.UnlikebothGloVeandSGNS,PPMIusesastable,non-stochasticSVDalgorithmthatshouldproducethesameresultgiventhesameinput,regardlessofinitialization.However,weex-pectvariationduetoPPMI’srandomsubsamplingoffrequenttokens.6MethodsToestablishstatisticalsigniﬁcanceboundsforourobservations,wetrain50LSAmodels,50SGNSmodels,50GloVemodels,and50PPMImodelsforeachofthethreesettings(FIXED,SHUFFLED,andBOOTSTRAP),foreachdocumentsegmentationsize,foreachcorpus.Foreachcorpus,weselectasetof20relevantquerywordsfromhighprobabilitywordsfromanLDAtopicmodel(Bleietal.,2003)trainedonthatcorpuswith200topics.Wecalculatethecosinesim-ilarityofeachquerywordtotheotherwordsinthevocabulary,creatingasimilarityrankingofallthewordsinthevocabulary.Wecalculatethemeanandstandarddeviationofthecosinesimilaritiesforeachpairofquerywordandvocabularywordacrosseachsetof50models.Fromthelistsofqueriesandcosinesimilarities,weselectthe20wordsmostcloselyrelatedtothesetofquerywordsandcomparethemeanandstandarddeviationofthosepairsacrosssettings.WecalculatetheJaccardsimilaritybetweentop-Nliststocom-paremembershipchangeinthelistsofmostcloselyrelatedwords,andweﬁndaveragechangesinrankwithinthoselists.Weexaminethesemetricsacrossdifferentalgorithmsandcorpusparameters.7ResultsWebeginwithacasestudyoftheframingaroundthequerytermmarijuana.Onemighthypothesizethattheauthorsofvariouscorpora(e.g.judgesofthe4thCircuit,journalistsattheNYT,andusersonReddit)havedifferentperceptionsofthisdrugandthattheirlanguagemightreﬂectthosedifferences.Indeed,afterqualitativelyexaminingthelistsofmostsimilarterms(seeTable4),wemightcometotheconclusionthattheallegedlyconservative4thCircuitLSASGNSGloVePPMIALGORITHM0.000.010.020.030.040.050.060.07STANDARD DEVIATIONfixedshuffledbootstrapStandard Deviation in the 9th Circuit CorpusLSASGNSGloVePPMIALGORITHM0.000.020.040.060.080.100.12STANDARD DEVIATIONfixedshuffledbootstrapStandard Deviation in the NYT Music CorpusFigure1:Themeanstandarddeviationsacrosssettingsandalgorithmsforthe10closestwordstothequerywordsinthe9thCircuitandNYTMusiccorporausingthewholedocuments.Largervariationsindicatelessstableembed-dings.judgesviewmarijuanaassimilartoillegaldrugssuchasheroinandcocaine,whileRedditusersviewmari-juanaasclosertolegalsubstancessuchasnicotineandalcohol.However,weobservepatternsthatcauseustolowerourconﬁdenceinsuchconclusions.Table4showsthatthecosinesimilaritiescanvarysigniﬁ-cantly.Weseethatthetoprankedwords(chosenaccordingtotheirmeancosinesimilarityacrossrunsoftheFIXEDsetting)canhavewidelydifferentmeansimilaritiesandstandarddeviationsdependingonthealgorithmandthethreetrainingsettings,FIXED,SHUFFLED,andBOOTSTRAP.Asexpected,eachalgorithmhasasmallvariationintheFIXEDsetting.Forexample,wecanseetheeffectoftherandomSVDsolverforLSAandtheeffectofrandomsubsamplingforPPMI.WedonotobserveaconsistenteffectfordocumentorderintheSHUFFLEDsetting.Mostimportantly,theseﬁguresrevealthatthe

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

114

4thCircuitNYTSportsRedditAskScience0.800.820.840.860.880.900.920.94Cosine SimilaritydistributemanufactureoxycodonedistributingpowdermethamphetaminecrackdistributioncocaineheroinMost Similar WordsLSAfixedshuffledbootstrap0.450.500.550.600.650.70Cosine SimilaritysteroidsreservedsubstanceinvolvingviolentseveralcocainetestingtestedcriticizedMost Similar WordsLSAfixedshuffledbootstrap0.700.750.800.850.90Cosine SimilaritymasturbationmedicationtobaccostressnicotinealcoholthccaffeinesmokingcannabisMost Similar WordsLSAfixedshuffledbootstrap0.700.750.800.850.90Cosine SimilaritycigarettespowdernarcoticscrackdrugspillsmethamphetaminecocaineoxycodoneheroinMost Similar WordsSGNSfixedshuffledbootstrap0.600.650.700.750.800.85Cosine SimilaritytestingabusesubstanceurinealcoholcounselingsteroidnandrolonedrugcocaineMost Similar WordsSGNSfixedshuffledbootstrap0.650.700.750.800.850.90Cosine SimilaritydrugsmokinglsdthccocaineweedmdmatobacconicotinecannabisMost Similar WordsSGNSfixedshuffledbootstrap0.450.500.550.600.650.700.750.80Cosine SimilaritypossessiongrowingsmokedgramsdrugsdistributecrackkilogramsheroincocaineMost Similar WordsGloVefixedshuffledbootstrap0.20.30.40.50.60.7Cosine SimilaritypositivesuspensionsblamingsteroidpurposesaddictiontestingsmokingprocedurescocaineMost Similar WordsGloVefixedshuffledbootstrap0.40.50.60.70.8Cosine SimilarityeffectsdrugscaffeineweeddrugthctobaccosmokesmokingcannabisMost Similar WordsGloVefixedshuffledbootstrap0.750.800.850.90Cosine SimilaritymethamphetaminehydrochloridekilogramsparaphernaliakilogramgramscrackpowderheroincocaineMost Similar WordsPPMIfixedshuffledbootstrap0.50.60.70.80.9Cosine SimilaritysteroidtestingcrackdrugssubstancepositivetestedalcoholdrugcocaineMost Similar WordsPPMIfixedshuffledbootstrap0.700.750.800.850.900.95Cosine SimilaritysmokesmokersthccigarcigarettessmokingweedtobacconicotinecannabisMost Similar WordsPPMIfixedshuffledbootstrapTable4:Themostsimilarwordswiththeirmeansandstandarddeviationsforthecosinesimilaritiesbetweenthequerywordmarijuanaandits10nearestneighbors(highestmeancosinesimilarityintheFIXEDsetting.Embeddingsarelearnedfromdocumentssegmentedbysentence.BOOTSTRAPsettingcauseslargeincreasesinvaria-tionacrossallalgorithms(withaweakereffectforPPMI)andcorpora,withlargestandarddeviationsacrosswordrankings.Thisindicatesthatthepres-enceofspeciﬁcdocumentsinthecorpuscansigniﬁ-cantlyaffectthecosinesimilaritiesbetweenembed-dingvectors.GloVeproducedverysimilarembeddingsinboththeFIXEDandSHUFFLEDsettings,withsimilarmeansandsmallstandarddeviations,whichindi-catesthatGloVeisnotsensitivetodocumentorder.However,theBOOTSTRAPsettingcausedareduc-tioninthemeanandwidenedthestandarddeviation,indicatingthatGloVeissensitivetothepresenceofspeciﬁcdocuments.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

115

Run1Run2Run3Run4Run5Run6Run7viabilityfetustrimestersurgerytrimesterpregnanciesabdomenpregnanciespregnanciessurgeryvisitsurgeryoccupationtenureabortiongestationvisittherapyincarcerationviabilitystepfatherabortionskindergartentenurepainvisitabortionwifefetusviabilityworkdayhospitalizationarrivaltenuregroingestationheadachesabortionsneckpainvisitthroatsurgerypregnantherniaheadachesheadachesabortionsgrandmotherexpirationabortionsummertrimesterbirthdaypregnantdaughtersuddenpainsuicideexperiencingneckbirthdaypanicfetalbladderabortionmedicationstenurefetusjawTable5:The10closestwordstothequerytermpregnancyarehighlyvariable.Noneofthewordsshownappearineveryrun.ResultsareshownacrossrunsoftheBOOTSTRAPsettingforthefullcorpusofthe9thCircuit,thewholedocumentsize,andtheSGNSmodel.Run1Run2Run3Run4Run5Run6Run7selectionselectionselectionselectionselectionselectionselectiongeneticsprocesshumandarwinianconvergentevolutionarydarwinianconvergentdarwinianhumanstheorydarwinianhumansnatureprocesshumansnaturalgeneticsevolutionaryspeciesevolutionarydarwinianconvergentgeneticshumangeneticsconvergentconvergentabiogenesisevolutionaryspeciesevolutionarytheoryprocessprocessevolutionaryspeciesdidhumansnaturalnaturalnaturalnaturalhumanconvergentnaturalhumansdidspeciesnaturenaturalprocessconvergentprocesshumanhumansspeciestheoryevolutionarycreationismhumandarwinianfavorTable6:Theorderofthe10closestwordstothequerytermevolutionarehighlyvariable.ResultsareshownacrossrunsoftheBOOTSTRAPsettingforthefullcorpusofAskScience,thewholedocumentlength,andtheGloVemodel.ThesepatternsoflargerorsmallervariationsaregeneralizedinFigure1,whichshowsthemeanstan-darddeviationfordifferentalgorithmsandsettings.Wecalculatedthestandarddeviationacrossthe50runsforeachquerywordineachcorpus,andthenweaveragedoverthesestandarddeviations.There-sultsshowtheaveragelevelsofvariationforeachalgorithmandcorpus.WeobservethattheFIXEDandSHUFFLEDsettingsforGloVeandLSAproducetheleastvariablecosinesimilarities,whilePPMIpro-ducesthemostvariablecosinesimilaritiesforallsettings.Thepresenceofspeciﬁcdocumentshasasigniﬁcanteffectonallfouralgorithms(lesserforPPMI),consistentlyincreasingthestandarddevia-tions.Weturntothequestionofhowthisvariationinstandarddeviationaffectsthelistsofmostsimilarwords.Arethetop-Nwordssimplyre-ordered,ordothewordspresentinthelistsubstantiallychange?Table5showsanexampleofthetop-Nwordlistsforthequerywordpregnancyinthe9thCircuitcorpus.ObservingRun1,wemightbelievethatjudgesofthe9thCircuitassociatepregnancymostwithquestionsofviabilityandabortion,whileobserv-ingRun5,wemightbelievethatpregnancyismostassociatedwithquestionsofprisonsandfamilyvisits.Althoughthelistsinthistableareallproducedfromthesamecorpusanddocumentsize,themembershipofthelistschangessubstantiallybetweenrunsoftheBOOTSTRAPsetting.Asanotherexample,Table6showsresultsforthequeryevolutionfortheGloVemodelandtheAskSciencecorpus.Althoughthisqueryshowslessvariationbetweenruns,westillﬁndcauseforconcern.Forexample,Run3ranksthewordshumanandhumanshighly,whileRun1includesneitherofthosewordsinthetop10.Thesechangesintop-NrankareshowninFigure2.ForeachquerywordfortheAskHistorianscorpus,weﬁndtheNmostsimilarwordsusingSGNS.Wegeneratenewtop-Nlistsforeachofthe50modelstrainedintheBOOTSTRAPsetting,andweuseJac-cardsimilaritytocomparethe50lists.Weobservesimilarpatternstothechangesinstandarddeviation

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

116

LSASGNSGloVePPMIALGORITHM0.00.10.20.30.40.50.60.70.80.9JACCARD SIMILARITYfixedshuffledbootstrapVariation in Top 2 WordsLSASGNSGloVePPMIALGORITHM0.00.10.20.30.40.50.60.70.80.9JACCARD SIMILARITYfixedshuffledbootstrapVariation in Top 10 WordsFigure2:ThemeanJaccardsimilaritiesacrosssettingsandalgorithmsforthetop2and10closestwordstothequerywordsintheAskHistorianscorpus.LargerJaccardsimilarityindicatesmoreconsistencyintopNmember-ship.Resultsareshownforthesentencedocumentlength.inFigure2;PPMIdisplaysthelowestJaccardsimi-larityacrosssettings,whiletheotheralgorithmshavehighersimilaritiesintheFIXEDandSHUFFLEDset-tingsbutmuchlowersimilaritiesintheBOOTSTRAPsetting.WedisplayresultsforbothN=2andN=10,emphasizingthatevenveryhighlyrankedwordsoftendropoutofthetop-Nlist.Evenwhenwordsdonotdropoutofthetop-Nlist,theyoftenchangeinrank,asweobserveinFigure3.Weshowbothaspeciﬁcexampleforthequerytermmenandanaggregateofallthetermswhoseaveragerankiswithinthetop-10acrossrunsoftheBOOTSTRAPsetting.Inordertohighlighttheav-eragechangesinrank,wedonotshowoutliersinthisﬁgure,butwenotethatoutliers(largefallsandjumpsinrank)arecommon.ThevariabilityacrosssamplesfromtheBOOTSTRAPsettingindicatesthatthepresenceofspeciﬁcdocumentscansigniﬁcantlyaffectthetop-Nrankings.Wealsoﬁndthatdocumentsegmentationsizeaf-05101520253035RANKchildrenwomensoldiersboysgirlshorsesofficerspeoplepeasantsbodiesWORDChange in Rank: “men”051015202530354045RANK FOR CURRENT ITERATION12345678910AVERAGE RANKChange in Rank for All QueriesFigure3:ThechangeinrankacrossrunsoftheBOOT-STRAPsettingforthetop10words.Weshowresultsforbothasinglequery,men,andanaggregateofallthequeries,showingthechangeinrankofthewordswhoseaveragerankingfallswithinthe10nearestneighborsofthosequeries.ResultsareshownforSGNSontheAskHis-torianscorpusandthesentencedocumentlength.fectsthecosinesimilarities.Figure4showsthatdocumentssegmentedatamoreﬁne-grainedlevelproduceembeddingswithlessvariabilityacrossrunsoftheBOOTSTRAPsetting.Documentssegmentedatthesentencelevelhavestandarddeviationsclusteringclosertothemedian,whilelargerdocumentshavestandarddeviationsthatarespreadmorewidely.Thiseffectismostsigniﬁcantforthe4thCircuitand9thCircuitcorpora,asthesehavemuchlarger“docu-ments”thantheothercorpora.WeobserveasimilareffectforcorpussizeinFigure5.Thesmallercorpusshowsalargerspreadinstandarddeviationthanthelargercorpus,indicatinggreatervariability.Finally,weﬁndthatthevarianceusuallystabilizesatabout25runsoftheBOOTSTRAPsetting.Figure6showsthatvariabilityinitiallyincreaseswiththenumberofmodelstrained.Weobservethispatternacrosscorpora,algorithms,andsettings.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

117

12345678910RANK0.050.000.050.100.150.20STANDARD DEVIATIONDocument Size ComparisonDOCUMENT SIZEsentencewholeFigure4:StandarddeviationofthecosinesimilaritiesbetweenallrankNwordsandtheir10nearestneighbors.Resultsareshownfordifferentdocumentsizes(sentencevswholedocument)intheBOOTSTRAPsettingforSGNSinthe4thCircuitcorpus.12345678910RANK0.020.000.020.040.060.080.100.120.14STANDARD DEVIATIONCorpus Size ComparisonCORPUS SIZE0.21.0Figure5:StandarddeviationofthecosinesimilaritiesbetweenallrankNwordsandtheir10nearestneighbors.Resultsareshownatdifferentcorpussizes(20%vs100%ofdocuments)intheBOOTSTRAPsettingforSGNSinthe4thCircuitcorpus,segmentedbysentence.8DiscussionThemostobviousresultofourexperimentsistoemphasizethatembeddingsarenotevenasingleobjectiveviewofacorpus,muchlessanobjectiveviewoflanguage.Thecorpusisitselfonlyasample,andwehaveshownthatthecurationofthissample(itssize,documentlength,andinclusionofspeciﬁcdocuments)cancausesigniﬁcantvariabilityintheembeddings.Happily,thisvariabilitycanbequan-tiﬁedbyaveragingresultsovermultiplebootstrapsamples.Wecanmakeseveralspeciﬁcobservationsaboutal-gorithmsensitivities.Ingeneral,LSA,GloVe,SGNS,andPPMIarenotsensitivetodocumentorderinthecollectionsweevaluated.Thisissurprising,aswe5101520253035404550NUMBER OF ITERATIONS0.0280.0300.0320.0340.036mean(STANDARD DEVIATION)Stability over IterationsFigure6:Themeanofthestandarddeviationofthecosinesimilaritiesbetweeneachquerytermandits20nearestneighbors.ResultsareshownfordifferentnumbersofrunsoftheBOOTSTRAPsettingonthe4thCircuitcorpus.hadexpectedSGNStobesensitivetodocumentorderandanecdotally,wehadobservedcaseswheretheembeddingswereaffectedbygroupsofdocuments(e.g.inadifferentlanguage)atthebeginningoftrain-ing.However,allfouralgorithmsaresensitivetothepresenceofspeciﬁcdocuments,thoughthiseffectisweakerforPPMI.AlthoughPPMIappearsdeterministic(duetoitspre-computedword-contextmatrix),weﬁndthatthisalgorithmproducedresultsundertheFIXEDorderingwhosevariabilitywasclosesttotheBOOTSTRAPset-ting.Weattributethisintrinsicvariabilitytotheuseoftoken-levelsubsampling.Thissamplingmethodintroducesvariationintothesourcecorpusthatap-pearstobecomparabletoabootstrapresamplingmethod.SamplinginPPMIisinspiredbyasimilarmethodintheword2vecimplementationofSGNS(Levyetal.,2015).ItisthereforesurprisingthatSGNSshowsnoticeabledifferentiationbetweentheBOOTSTRAPsettingontheonehandandtheFIXEDandSHUFFLEDsettingsontheother.Theuseofembeddingsassourcesofevidenceneedstobetemperedwiththeunderstandingthatﬁne-graineddistinctionsbetweencosinesimilaritiesarenotreliableandthatsmallercorporaandlongerdocu-mentsaremoresusceptibletovariationinthecosinesimilaritiesbetweenembeddings.Whenstudyingthetop-Nmostsimilarwordstoaquery,itisimportanttoaccountforvariationintheselists,asbothrankandmembershipcansigniﬁcantlychangeacrossruns.Therefore,weemphasizethatwithsmallercorporacomesgreatervariability,andwerecommendthatpractitionersusebootstrapsamplingtogeneratean

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

118

ensembleofwordembeddingsforeachsub-corpusandpresentboththemeanandvariabilityofanysum-marystatisticssuchasorderedwordsimilarities.Weleaveforfutureworkafullhyperparametersweepforthethreealgorithms.Whilethesehyperpa-rameterscansubstantiallyimpactperformance,ourgoalwiththisworkwasnottoachievehighperfor-mancebuttoexaminehowthealgorithmsrespondtochangesinthecorpus.Wemakenoclaimthatonealgorithmisbetterthananother.9ConclusionWeﬁndthatthereareseveralsourcesofvariabilityincosinesimilaritiesbetweenwordembeddingsvec-tors.Thesizeofthecorpus,thelengthofindividualdocuments,andthepresenceorabsenceofspeciﬁcdocumentscanallaffecttheresultingembeddings.Whiledifferencesinwordassociationaremeasur-ableandareoftensigniﬁcant,smalldifferencesincosinesimilarityarenotreliable,especiallyforsmallcorpora.Iftheintentionofastudyistolearnaboutaspeciﬁccorpus,werecommendthatpractitionerstestthestatisticalconﬁdenceofsimilaritiesbasedonwordembeddingsbytrainingonmultiplebootstrapsamples.10AcknowledgementsThisworkwassupportedbyNSF#1526155,#1652536,andtheAlfredP.SloanFoundation.WewouldliketothankAlexandraSchoﬁeld,LaureThompson,ourActionEditorIvanTitov,andouranonymousreviewersfortheirhelpfulcomments.ReferencesYoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJauvin.2003.Aneuralprobabilisticlan-guagemodel.JournalofMachineLearningResearch,3(Fév):1137–1155.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.Latentdirichletallocation.JournalofMachineLearningresearch,3(Jan):993–1022.TolgaBolukbasi,Kai-WeiChang,JamesY.Zou,VenkateshSaligrama,andAdamT.Kalai.2016.Manistocomputerprogrammeraswomanistohomemaker?Debiasingwordembeddings.InNIPS,pages4349–4357.L´eonBottou.2012.Stochasticgradientdescenttricks.InNeuralNetworks:TricksoftheTrade,pages421–436.Springer.AndreasBroscheid.2011.Comparingcircuits:AresomeU.S.CourtsofAppealsmoreliberalorconservativethanothers?Loi&SocietyReview,45(1),March.DallasCard,AmberE.Boydstun,JustinH.Gross,PhilipResnik,andNoahA.Smith.2015.Themediaframescorpus:Annotationsofframesacrossissues.InACL.DanqiChenandChristopherD.Manning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InEMNLP,pages740–750.ColinCherryandHongyuGuo.2015.TheunreasonableeffectivenessofwordrepresentationsforTwitternamedentityrecognition.InHLT-NAACL,pages735–745.ScottDeerwester,SusanT.Dumais,GeorgeW.Furnas,ThomasK.Landauer,andRichardHarshman.1990.Indexingbylatentsemanticanalysis.JournaloftheAmericanSocietyforInformationScience,41(6):391.ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2015.Retroﬁttingwordvectorstosemanticlexicons.HLT-ACL,pages1606–1615.AnnaGladkovaandAleksandrDrozd.2016.Intrinsicevaluationsofwordembeddings:Whatcanwedobet-ter?InProceedingsofthe1stWorkshoponEvaluatingVector-SpaceRepresentationsforNLP,pages36–42.YoavGoldberg.2017.NeuralNetworkMethodsforNat-uralLanguageProcessing.SynthesisLecturesonHu-manLanguageTechnologies.Morgan&ClaypoolPub-lishers.IanGoodfellow,YoshuaBengio,andAaronCourville.2016.DeepLearning.MITPress.NathanHalko,Per-GunnarMartinsson,andJoelA.Tropp.September,2009.Findingstructurewithrandomness:Stochasticalgorithmsforconstructingapproximatema-trixdecompositions.TechnicalReportNo.2009-05.Applied&ComputationalMathematics,CaliforniaIn-stituteofTechnology.WilliamL.Hamilton,JureLeskovec,andDanJurafsky.2016.Diachronicwordembeddingsrevealstatisticallawsofsemanticchange.InACL.JohannesHellrichandUdoHahn.2016.Badcompany–neighborhoodsinneuralembeddingspacesconsideredharmful.InProceedingsofCOLING2016,the26thIn-ternationalConferenceonComputationalLinguistics:TechnicalPapers,pages2785–2796.RyanHeuser.2016.Wordvectorsintheeighteenth-century.InIPAMWorkshop:CulturalAnalytics.YoonKim,Yi-IChiu,KentaroHanaki,DarshanHegde,andSlavPetrov.2014.Temporalanalysisoflanguagethroughneurallanguagemodels.ProceedingsoftheACL2014WorkshoponLanguageTechnologiesandComputationalSocialScience,.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

119

VivekKulkarni,BryanPerozzi,andStevenSkiena.2016.Freshmanorfresher?Quantifyingthegeographicvari-ationoflanguageinonlinesocialmedia.InICWSM,pages615–618.ThomasK.LandauerandSusanT.Dumais.1997.Aso-lutiontoPlato’sproblem:Thelatentsemanticanalysistheoryofacquisition,induction,andrepresentationofknowledge.PsychologicalReview,104(2):211.OmerLevyandYoavGoldberg.2014.Neuralwordembeddingasimplicitmatrixfactorization.InNIPS,pages2177–2185.OmerLevy,YoavGoldberg,andIdoDagan.2015.Im-provingdistributionalsimilaritywithlessonslearnedfromwordembeddings.TransactionsoftheACL,3:211–225.RasmusE.Madsen,DavidKauchak,andCharlesElkan.2005.Modelingwordburstinessusingthedirichletdis-tribution.InProceedingsofthe22ndInternationalCon-ferenceonMachineLearning,pages545–552.ACM.TomasMikolov,Wen-tauYih,andGeoffreyZweig.2013.LinguisticRegularitiesinContinuousSpaceWordRep-resentations.HLT-NAACL.JeffreyPennington,RichardSocher,andChristopherD.Manning.2014.GloVe:Globalvectorsforwordrepresentation.InEMNLP,volume14,pages1532–43.LawrencePhillips,KyleShaffer,DustinArendt,NathanHodas,andSvitlanaVolkova.2017.Intrinsicandex-trinsicevaluationofspatiotemporaltextrepresentationsinTwitterstreams.InProceedingsofthe2ndWorkshoponRepresentationLearningforNLP,pages201–210.RadimˇReh˚uˇrekandPetrSojka.2010.SoftwareFrame-workforTopicModellingwithLargeCorpora.InPro-ceedingsoftheLREC2010WorkshoponNewChal-lengesforNLPFrameworks,pages45–50,Valletta,Malta,May.ELRA.EvanSandhaus.2008.TheNewYorkTimesAnnotatedCorpus.LDC2008T19.LinguisticDataConsortium.NoamShazeer,RyanDoherty,ColinEvans,andChrisWaterson.2016.Swivel:ImprovingEmbeddingsbyNoticingWhat’sMissing.arXiv:1602.02215.YingtaoTian,VivekKulkarni,BryanPerozzi,andStevenSkiena.2016.Ontheconvergentpropertiesofwordembeddingmethods.arXivpreprintarXiv:1605.03956.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsoftheACL,pages384–394.AssociationforComputationalLinguistics.PeterD.TurneyandPatrickPantel.2010.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.JournalofArtiﬁcialIntelligenceResearch,37:141–188.IvanVulicandMarie-FrancineMoens.2015.Bilingualwordembeddingsfromnon-paralleldocument-aligneddataappliedtobilinguallexiconinduction.InProceed-ingsoftheACL,pages719–725.ACL.JustineZhang,WilliamL.Hamilton,CristianDanescu-Niculescu-Mizil,DanJurafsky,andJureLeskovec.2017.Communityidentityanduserengagementinamulti-communitylandscape.ProceedingsofICWSM.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
8
1
5
6
7
5
8
6

/
t

un
c
_
un
_
0
0
0
0
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

120
Télécharger le PDF