Transactions of the Association for Computational Linguistics, Bd. 4, S. 47–60, 2016. Action Editor: David Chiang.
Submission batch: 11/2015; Published 2/2016.
2016 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
C
(cid:13)
DetectingCross-CulturalDifferencesUsingaMultilingualTopicModelE.D.Guti´errez1EkaterinaShutova2PatriciaLichtenstein3GerarddeMelo4LucaGilardi51UniversityofCalifornia,SanDiego2ComputerLaboratory,UniversityofCambridge3UniversityofCalifornia,Merced4IIIS,TsinghuaUniversity,5ICSI,Berkeleyedg@icsi.berkeley.edues407@cam.ac.uktricia1@uchicago.edugdm@demelo.orglucag@icsi.berkeley.eduAbstractUnderstandingcross-culturaldifferenceshasimportantimplicationsforworldaffairsandmanyaspectsofthelifeofsociety.Yet,themajorityoftext-miningmethodstodatefocusontheanalysisofmonolingualtexts.Incon-trast,wepresentastatisticalmodelthatsimul-taneouslylearnsasetofcommontopicsfrommultilingual,non-paralleldataandautomati-callydiscoversthedifferencesinperspectivesonthesetopicsacrosslinguisticcommunities.Weperformabehaviouralevaluationofasub-setofthedifferencesidentifiedbyourmodelinEnglishandSpanishtoinvestigatetheirpsy-chologicalvalidity.1IntroductionRecentyearshaveseenagrowinginterestintext-miningapplicationsaimedatuncoveringpublicopinionsandsocialtrends(Faderetal.,2007;Mon-roeetal.,2008;GerrishandBlei,2011;Pennac-chiottiandPopescu,2011).Theyrestontheas-sumptionthatthelanguageweuseisindicativeofourunderlyingworldviews.Researchincognitiveandsociolinguisticssuggeststhatlinguisticvaria-tionacrosscommunitiessystematicallyreflectsdif-ferencesintheirculturalandmoralmodelsandgoesbeyondlexiconandgrammar(K¨ovecses,2004;LakoffandWehling,2012).Cross-culturaldiffer-encesmanifestthemselvesintextinamultitudeofways,mostprominentlythroughtheuseofexplicitopinionvocabularywithrespecttoacertaintopic(e.g.“policiesthatbenefitthepoor”),idiomaticandmetaphoricallanguage(e.g.“thecompanyisspin-ningitswheels”)andothertypesoffigurativelan-guage,suchasironyorsarcasm.Theconnectionbetweenlanguage,cultureandreasoningremainsoneofthecentralresearchques-tionsinpsychology.ThibodeauandBorodit-sky(2011)investigatedhowmetaphorsaffectourdecision-making.Theypresentedtwogroupsofhu-mansubjectswithtwodifferenttextsaboutcrime.Inthefirsttext,crimewasmetaphoricallyportrayedasavirusandinthesecondasabeast.Thetwogroupswerethenaskedasetofquestionsonhowtotacklecrimeinthecity.Asaresult,whilethefirstgrouptendedtooptforpreventivemeasures(e.g.strongersocialpolicies),thesecondgroupconvergedonpunishment-orrestraint-orientedmea-sures.AccordingtoThibodeauandBoroditsky,theirresultsdemonstratethatmetaphorshaveprofoundinfluenceonhowweconceptualizeandactwithre-specttosocietalissues.Thissuggeststhatinordertogainafullunderstandingofsocialtrendsacrosspop-ulations,oneneedstoidentifysubtlebutsystematiclinguisticdifferencesthatstemfromthegroups’cul-turalbackgrounds,expressedbothliterallyandfig-uratively.Performingsuchananalysisbyhandislabor-intensiveandoftenimpractical,particularlyinamultilingualsettingwhereexpertiseinallofthelanguagesofinterestmayberare.Withtheriseofbloggingandsocialmedia,NLPtechniqueshavebeensuccessfullyusedforanumberoftasksinpoliticalscience,includingautomaticallyestimatingtheinfluenceofparticularpoliticiansintheUSsenate(Faderetal.,2007),identifyinglex-icalfeaturesthatdifferentiatepoliticalrhetoricofopposingparties(Monroeetal.,2008),predictingvotingpatternsofpoliticiansbasedontheiruseoflanguage(GerrishandBlei,2011),andpredictingpoliticalaffiliationofTwitterusers(PennacchiottiandPopescu,2011).Fangetal.(2012)addressed
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
48
theproblemofautomaticallydetectingandvisual-isingthecontrastingperspectivesonasetoftop-icsattestedinmultipledistinctcorpora.Whilesuc-cessfulintheirtasks,alloftheseapproachesfo-cusedonmonolingualdataanddidnotreachbe-yondliterallanguage.Incontrast,wepresentamethodthatdetectsfine-grainedcross-culturaldif-ferencesfrommultilingualdata,wheresuchdiffer-encesabound,expressedbothliterallyandfigura-tively.Ourmethodbringstogetheropinionminingandcross-lingualtopicmodellingtechniquesforthispurpose.Previousapproachestocross-lingualtopicmodelling(Boyd-GraberandBlei,2009;Jagarla-mudiandDaum´eIII,2010)addressedtheproblemofminingcommontopicsfrommultilingualcor-pora.Wepresentamodelthatlearnssuchcom-montopics,whilesimultaneouslyidentifyinglexi-calfeaturesthatareindicativeoftheunderlyingdif-ferencesinperspectivesonthesetopicsbyspeakersofEnglish,SpanishandRussian.Thesedifferencesareminedfrommultilingual,non-paralleldatasetsofTwitterandnewsdata.Incontrasttopreviouswork,ourmodeldoesnotmerelyoutputalistofmono-linguallexicalfeaturesformanualcomparison,butalsoautomaticallyinfersmultilingualcontrasts.Oursystem(1)usesword-documentco-occur-rencedataasinput,wherethewordsarelabeledastopicwordsorperspectivewords;(2)findsthehighest-likelihooddictionarybetweentopicwordsinthetwolanguagesgiventheco-occurrencedata;(3)findscross-lingualtopicsspecifiedbydistribu-tionsovertopic-wordsandperspective-words;Und(4)automaticallydetectsdifferencesinperspective-worddistributionsinthetwolanguages.Weperformabehaviouralevaluationofasubsetofthediffer-encesidentifiedbythemodelanddemonstratetheirpsychologicalvalidity.Ourdataanddictionariesareavailablefromthefirstauthoruponrequest.2RelatedworkViewdetection.Identifyingdifferentviewpointsisrelatedtothewell-studiedareaofsubjectivitydetection,whichaimsatexposingopinion,evalu-ation,andspeculationintext(Wiebeetal.,2004)andattributingittospecificpeople(Awadallahetal.,2011;Abu-Jbaraetal.,2012).Inourwork,wearelessinterestedinexplicitlocalformsofsubjectivity,insteadaimingatdetectingmoregeneralcontrastsacrosslinguisticcommunities.Anotherlineofresearchhasfocusedoninferringauthorattributessuchasgender,Alter(GareraandYarowsky,2009),location(Jonesetal.,2007),orpo-liticalaffiliation(PennacchiottiandPopescu,2011).Suchstudiesmakeuseofsyntacticstyle,discoursecharacteristics,aswellaslexicalchoice.Themodelsusedforthisaretypicallybinaryclassifierstrainedinafullysupervisedfashion.Incontrast,inourtask,weautomaticallyinferthetopicdistributionsandfindtopic-specificcontrasts.Probabilistictopicmodels.Probabilistictopicmodelshaveprovenusefulforavarietyofseman-tictasks,suchasselectional-preferenceinduction(´OS´eaghdha,2010;Ritteretal.,2010),sentimentanalysis(Boyd-GraberandResnik,2010)andstudy-ingtheevolutionofconceptsandideas(Halletal.,2008).Thegoalofatopicmodelistocharacter-izeobserveddataintermsofamuchsmallersetofunobserved,semanticallycoherenttopics.Apar-ticularlypopularprobabilistictopicmodelisLatentDirichletAllocation(LDA)(Bleietal.,2003).Un-deritsassumptions,eachdocumenthasauniquemixoftopics,andeachtopicisadistributionovertermsinthevocabulary.Atopicischosenforeverywordtokenaccordingtothetopicmixofthedocumenttowhichitbelongs,andthentheword’sidentityisdrawnfromthecorrespondingtopic’sdistribution.Handlingmultilingualcorpora.LDAisde-signedformonolingualtextandthusitlacksthestructurenecessarytomodelcross-linguallyvalidtopics.Whiletopicmodelscanbetrainedindi-viduallyontwolanguagesandthentheacquiredtopicscanbematched,thecorrespondencesbe-tweenthetopicsforthetwotermswillbehighlyunstable.Toaddressthis,Boyd-GraberandBlei(2009)(MUTO)andJagarlamudiandDaum´eIII(2010)(JOINTLDA)introducedthenotionofcross-linguallyvalidconceptsassociatedwithdifferenttermsindifferentlanguages,usingbilingualdictio-nariestomodeltopicsacrosslanguages.BasedonamodelbyHaghighietal.(2008),MUTOiscapa-bleoflearningtranslations–i.e.,matchingbetweentermsinthedifferentlanguagesbeingcompared.ThePolylingualTopicModelofMimnoetal.(2009)isanotherapproachtofindingtopicsinmultilingualcorpora,butitrequirestuplescomposedofcompa-
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
49
rabledocumentsineachlanguageofthecorpus.Topicmodelsforviewdetection.LDAalsoas-sumesthatthedistributionofeachtopicisfixedacrossalldocumentsinacorpus.Therefore,atopicassociatedwith,e.g.,warwillhavethesamedis-tributionoverthelexiconregardlessofwhetherthedocumentwastakenfromapro-wareditorialorananti-warspeech.However,inrealitywemayexpectasingletopictoexhibitsystematicandpredictablevariationsinitsdistributionbasedonauthorship.Thecross-collectionLDAmodelbyPaulandGirju(2009)addressesthisbyspecificallyaimingtoexposeviewpointdifferencesacrossdifferentdoc-umentcollections.AhmedandXing(2010)pro-posedasimilarmodelfordetectingideologicaldif-ferences.Fangetal.(2012)’sCross-PerspectiveTopic(CPT)modelbreaksupthetermsinthevo-cabularyintotopictermsandperspectivetermswithdifferentgenerativeprocesses,anddifferentiatesbe-tweendifferentcollectionsofdocumentswithinthecorpus.ThetopictermsareassumedtobegeneratedasinLDA.However,thedistributionofperspectivetermsinadocumentistakentobedependentonboththetopicmixtureofthedocumentaswellasthecol-lectionfromwhichthedocumentisdrawn.Recentworksproposedmodelsforspecifictypesofdata.QiuandJiang(2013)useuseridentitiesandinteractionsinthreadeddiscussions,whileGot-tipatietal.(2013)developedatopicmodelforDe-batepedia,asemi-structuredresourceinwhichar-gumentsareexplicitlyenumerated.However,allofthesemodelsperformtheiranalysesonmonolingualdatasets.Thus,theyareusefulforcomparingdiffer-entideologiesexpressedinthesamelanguage,butnotforcross-linguisticcomparisons.3MethodThegoalofourmodelistoanalyselarge,non-parallel,multilingualcorporaandpresentcross-linguallyvalidtopicsandtheassociatedperspec-tives,automaticallyinferringthedifferencesincon-ceptualizationofthesetopicsacrosscultures.Fol-lowingBoyd-GraberandBlei(2009)andJagarla-mudiandDaum´eIII(2010),ourdistributionsofla-tenttopicsrangeoverlatent,cross-lingualtopiccon-ceptsthatmanifestthemselvesaslanguage-specifictopicwords.Weusebilingualdictionaries,contain-Figure1:Basicgenerativemodel.ingwordsinonelanguageandtheirtranslationsinanotherlanguage,torepresentthetopicconcepts.Thesearerepresentedasabipartitegraph,witheachtranslationentrybeinganedgeandeachtopicwordinthetwolanguagesbeingavertex.Whilethetopicwordsaretiedtogetherbythetranslationdictionary,theperspectivewordscanvaryfreelyacrosslan-guages.FollowingFangetal.(2012),wetreatnounsastopicwordsandverbsandadjectivesasperspec-tivewords1.Themodelassumesthatadjectiveandverbtokensineachdocumentareassignedtotopicsinproportiontothetopicassignmentsofthetopicwordtokens.Then,theperspectivetermforthistopicisdrawndependingonthetopicassignmentandthelanguageofthespeaker.3.1BasicGenerativeModelGiventhelanguages‘∈{A,B},ourmodelinfersthedistributionsofmulti-lingualtopicsandlanguage-specificperspective-words(Fig.2),asfollows:1.DrawasetCofconcepts(u,v)matchingtopicwordufromlanguageatotopicwordvfromlan-guageb,wheretheprobabilityofconcept(u,v)isproportionaltoapriorπu,v(e.g.basedoninforma-tionfromatranslationdictionary).2.Drawmultinomialdistributions:1Thisapproximationwasadoptedforconvenience,compu-tationalefficiencyandeaseofinterpretation.However,inprin-cipleourmethoddoesnotdependonit,sinceitcanbeappliedwithallcontentwordsastopicorperspectivewords.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
50
•Fortopicindicesk∈{1,…,K},drawlanguage-independenttopic-conceptdistribu-tionsφwk∼Dir(βw)overpairs(wa,wb)∈C.•Fortopicindicesk∈{1,…,K}andlan-guages‘∈{A,B},drawlanguage-specificperspective-termdistributionsφ‘,ok∼Dir(βo)overperspective-termsinlanguage‘.3.Foreachdocumentd∈{1,…,D}withlang.‘d:•Drawtopicweightsθd∼Dir(α)•Foreachtopic-wordindexi∈{1,…,Nwd}ofdocumentd:–Drawtopiczi∼θd–Drawtopicconceptci=(wa,wb)∼φwzi,andselectw‘dasthememberofthatpaircorrespondingtolanguage‘d.•Foreachperspective-wordindexj∈{1,…,Nod}ofdocumentd:–Drawtopicxj∼Uniform(zw1,…,zwNod)–Drawperspective-wordoj∼φ‘,oxj3.2ModelVariantsWehaveexperimentedwithseveralvariantsofourmodel,inordertoaccountforthetranslationofpol-ysemouswords,adaptthetranslationmodeltothecorpusused,andtohandlewordsforwhichnotrans-lationisfound.a)SINGLEvariantsofthemodelmatcheachtopicterminalanguagewithatmostonetopictermintheotherlanguage.MULTIPLEvariantsalloweachtermtomatchtomultipleotherwordsintheotherlanguage.b)INFERvariantsallowhigher-likelihoodmatch-ingstobeinferredfromthedata.STATICvariantstreatthematchingsasfixed,whichisequivalenttoassigningaprobabilityof0or1toeveryedgeinourbipartitegraphC.c)RELEGATEvariantsrelegateallunmatchedwordsineachlanguagetoasingleseparateback-groundtopicdistinctfromthetopicsthatarelearnedforthematchedtopicwords.Thisisakintoforcingtheprobabilityforcurrentlyun-matchedwordsto0inalltopicsexceptforone,andforcingtheprobabilityofallcurrentlymatchedwordsto0inthistopic.INCLUDEvariantsdonotrestricttheassignmentunmatchedwords;theyareassignedtothesamesetoftopicsasthematchedwords.Wetestthefollowingsixvariants:SINGLESTATI-CRELEGATE,SINGLESTATICINCLUDE,SIN-GLEINFERRELEGATE,SINGLEINFERINCLUDE,MULTIPLESTATICRELEGATE,andMULTI-PLESTATICINCLUDE.WedonottestMULTI-PLEINFERvariantsbecauseofthecomplexityofinferringamultiplematchinginabipartitegraph.3.3Learning&InferenceForallvariants,acollapsedGibbssamplercanbeusedtoinfertopicsφ‘,oandφw,per-documenttopicdistributionsθ,aswellastopicassignmentszandx.ThiscorrespondstotheS-stepbelow.ForINFERvariants,wefollowBoyd-GraberandBleiinusinganM-stepinvolvingabipartitegraphmatchingal-gorithmtoinferthematchingmthatmaximizestheposteriorlikelihoodofthematching.S-Step:SampletopicsforwordsinthecorpususingacollapsedGibbssampler.Fortopic-wordwi=ubelongingtodocumentd,ifthewordoccursinconceptci=(u,v),thensamplethetopicandentryaccordingto:P(zi=k,ci=(u,v)|wi=u,z−i,C)∝Ndk+αkPj(Ndj+αj)×Nk(u,v)+βwkPv0(cid:0)Nk(u,v0)+βwk(cid:1)wherethesuminthedenominatorofthefirsttermisoveralltopics,andinthesecondtermisoverallwordsmatchedtou.Ndkisthecountoftopic-wordsoftopickindocumentd,Nk(u,v)isthecountoftopic-wordseitheroftypeuoroftypevassignedtotopickinallthecorpora.2Forperspective-wordoi=n,samplethetopicaccordingto:P(zi=k|oi=n,z−i,C)∝NdkPjNdj×N‘dkv+βokPm(cid:16)N‘dkm+βok(cid:17)2InRELEGATEvariants,foruunmatchedziissampledas:P(zi=k|wi=u,z−i,C)∝Ndk+αkPk(Ndk+αk),whichcanbeseenasβwu·→∞forunmatchedterms.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
51
wherethesuminthesecondtermofthedenominatorisovertheperspective-wordvocabularyoflanguage‘d;Ndkisthecountoftopicwordsindocumentdwithtopick;andN‘dkmisthecountofperspective-wordmbeingassignedtopickinlanguage‘d.Notethatinallthecountsabove,thecurrentwordtokeniisomittedfromthecount.Givenoursamplingassignments,wecanthenes-timateθd,φ‘,Ö,andφwasfollows:ˆθkd=Ndk+αkPk(Ndk+αk),ˆφwk(u,v)=Nk(u,v)+βw(u,v)Pv0(cid:16)Nk(u,v0)+βw(u,v0)(cid:17),ˆφ‘,onk=Nkn+βonPm(cid:0)N‘km+βon(cid:1).M-Step:(forINFERvariantsonly):RuntheJonker-Volgenant(JonkerandVolgenant,1987)bipartitematchingalgorithmtofindtheoptimalmatchingCgivensomeweights.Fortopic-termufromlanguageaandtopic-termvfromlanguageb,ourweightscorrespondtothelogoftheposterioroddsthattheoccurrencesofuandvcomefromamatchedtopicdistribution,asopposedtocomingfromunmatcheddistributions:µu,v=Xk\{a∗,b∗}(cid:16)Nk(u,v)logˆφwk(u,v)(cid:17)−Nulogˆφwk(u,·)−Nvlogˆφwk(·,v)+πu,v,whereNuisthecountoftopic-termuinthecor-pus.Thisexpressioncanalsobeinterpretedasakindofpointwisemutualinformation(Haghighietal.,2008).TheJonker-VolgenantalgorithmhastimecomplexityofatmostO(V3),whereVisthesizeofthelexicon(JonkerandVolgenant,1987).3.4InferenceofPerspective-WordContrastsHavinglearnedourmodelandinferredhowlikelyperspective-termsareforatopicinagivenlanguage,weseektoknowwhethertheseperspectivesdiffersignificantlyinthetwolanguages.Moreprecisely,canweinferwhetherwordminlanguageaandtheequivalentwordninlanguagebhavesignificantlydifferentdistributionsunderatopick?Todothis,wemaketheassumptionthattheperspective-wordsinlanguagesaandbareinone-to-onecorrespon-dencetoeachother.Recallthat,foragiventopickandlanguage‘,N‘kmisthecountfortermmandφ‘,ok,mistheprobabilityforwordminlanguage‘.Justaswecollecttheprobabilitiesintoword-topicdistributionvectorsφ‘,ok,wecollectthecountsintoword-topiccountvectors[N‘k1,N‘k2,..].Dann,sinceourmodelassumesapriorovertheparametervec-torsφ‘,ok,wecaninferthelikelihoodforthatob-servedword-topiccountsNakmandNbknweredrawnfromasingleword-topic-distributionpriordenotedby˘φ:=φa,okm=φb,okn.BelowallourprobabilitiesareconditionedimplicitlyonthiseventaswellasonNakandNbkbeingfixed.Denotethetotalcountofwordtokensintopickfromlanguage‘byN‘k=PmN‘km.Now,wede-rivetheprobabilitythatweobservearatiogreaterthanδbetweentheproportionofwordsintopickthatbelongtowordtypeminlanguageaandtocor-respondingwordtypeninlanguageb:P(cid:18)NakmNakNbkNbkn≥δ(cid:19)+P(cid:18)NbknNbkNakNakm≥δ(cid:19)(1)Bysymmetry,itsufficestoderiveanexpressionforthefirstterm.Wenotethattheinequalityintheprob-abilityisequivalenttoasumoverarangeofvaluesofNakmandNbkn.Byrearrangingterms,applyingthelawofconditionalprobabilitytoconditionontheterm˘φ,andexploitingtheconditionalindepen-denceofNakmandNbkmgiven˘φ,Nak,andNbk,wecanrewritethisfirsttermasNbkXx=0NakXy=xδNa/bZp(Nbkn=x|˘φ)P(Nakm=y|˘φ)P(˘φ)d˘φ,whereNa/b=NakNbk.Recallthatφ‘,ok∼Dir(βo)un-derourmodel.AssumeasymmetricDirichletdis-tributionforsimplicity.Itcanthenbeshownthatthemarginaldistributionof˘φis˘φ∼Beta(βo,(V−1)βo),whereVisthetotalsizeoftheperspective-wordvocabulary.Similarly,itcanbeshownthatthemarginaldistributionofN‘kmgivenφ‘,okisN‘km∼Binom(N‘k,φ‘,oi)for‘∈{A,B}.daher,theinte-grandaboveisproportionaltothebeta-binomialdis-tributionwithnumberoftrialsNak+Nbk,successesx+y,andparametersβoand(V−1)βo,butwithpartitionfunction(cid:0)Naky(cid:1)(cid:0)Nbkx(cid:1).DenotethePMFofthis
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
52
distributionbyf(Nak+Nbk,x+y,βo).Thenexpres-sion(1)abovebecomes:NbkXx=0NakXy=xδNa/bf(Nak+Nbk,x+y,βo)+NakXx=0NbkXy=xδNb/af(Nak+Nbk,x+y,βo).(2)WecannotobserveNakb,Nbkn,NakandNbkexplic-itly,butwecanestimatethembyobtainingposte-riorsamplesfromourGibbssampler.Wesubstitutetheseestimatesintoexpression(2).4Experiments4.1DataTwitterData.WegatheredTwitterdatainEn-glish,SpanishandRussianduringthefirsttwoweeksofDecember2013usingtheTwitterAPI.Followingpreviouswork(Puniyanietal.,2010),wetreatedeachTwitteruseraccountasadocu-ment.Wethentaggedeachdocumentforpart-of-speech,anddividedthewordtokensinitintotopic-wordsandperspective-words.Weconstructedalex-iconof2,000topictermsand1,500perspective-termsforeachlanguagebyfilteringoutanytermsthatoccurredinmorethan10%ofthedocu-mentsinthatlanguage,andthenselectingthere-mainingtermswiththehighestfrequency.Fi-nally,wekeptonlydocumentsthatcontained4ormoretopicwordsfromourlexicon.Thisleftuswith847,560documentsinEnglish(4,742,868topic-wordand1,907,685perspective-wordtokens);756,036documentsinSpanish(4,409,888topic-wordand1,668,803perspective-wordtokens);and260,981documentsinRussian(1,621,571topic-wordand981,561perspective-wordtokens).NewsData.Wegatheredallthearticlespublishedonlineduringtheyear2013bythestate-runmediaagenciesoftheUnitedStates(VoiceofAmericaor“VOA”–English),Russland(RIANovostior“RIA”–Russian),andVenezuela(AgenciaVenezolanadeNoticiasor“AVN”–Spanish).Thesethreenewsagencieswerechosenbecausetheynotonlypro-videmediainthreedistinctlanguages,buttheyareguidedbythepoliticalworld-viewsofthreedis-tinctgovernments.Wetreatedeachnewsarticleasadocument,andremovedduplicates.Onceagain,weconstructedalexiconof2,000topictermsand1,500perspective-termsusingthesamecriteriaasforTwitter,andkeptonlydocumentsthatcontained4ormoretopicwordsfromourlexicon.Thisleftuswith23,159articles(10,410,949tokens)fromVOA,41,116articles(11,726,637tokens)fromRIA,and8,541articles(2,606,796tokens)fromAVN.Dictionaries.Tocreatethetranslationdictionar-ies,weextractedtranslationsfromtheEnglish,Spanish,andRussianeditionsofWiktionary,bothfromthetranslationsectionsandtheglosssectionsifthelattercontainedsinglewordsasglosses.Multi-wordexpressionswereuniversallyremoved.Weaddedinversetranslationsforeveryoriginaltrans-lation.Fromtheresultingcollectionoftranslations,wethencreatedseparatetranslationdictionariesforeachlanguageandpart-of-speechtagcombination.Inordertogivepreferencetomoreimportanttranslations,weassignedeachtranslationaninitialweightof1+1r,whererwastherankofthetrans-lationwithinthepage.Sinceatranslation(oritsin-verse)canoccuronmultiplepages,weaggregatedtheseinitialweightsandthenassignedfinalweightsof1+1r0,wherer0wastherankafteraggregationandsortingindescendingorderofweights.4.2ExperimentalConditionsToevaluatethedifferentvariantsofourmodel,weheldout30,000documents(testset)duringtraining.WepluggedintheestimatesofφwandCacquiredduringtrainingusingtherestofthecorpustopro-ducealikelihoodestimatefortheseheld-outdocu-ments.Allmodelswereinitializedwiththepriormatchingdeterminedbythedictionarydata.ForeachnumberoftopicsK,wesetαto50/Kandtheβvariablesto0.02,asinFangetal.(2012).FortheMULTIPLEvariants,wesetπi,j=1ifiandjshareanentryand0otherwise.ForINFERvariants,onlythreeM-stepswereperformedtoavoidoverfit-ting,at250,500,and750iterationsofGibbssam-pling,followingtheprocedureinBoyd-GraberandBlei(2009).4.3ComparisonofmodelvariantsInordertocomparethevariantsofourmodel,wecomputedtheperplexityandcoherencefor
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
53
eachvariantonTWITTERandNEWS,forEnglish–SpanishandEnglish–Russianlanguagepairs.Perplexityisameasureofhowwellamodeltrainedonatrainingsetpredictstheco-occurrenceofwordsonanunseentestsetH.Lowerperplexityindicatesbettermodelfit.Weevaluatetheheld-outperplex-ityfortopicwordswiandperspective-wordsoisep-arately.Fortopicwords,theperplexityisdefinedasexp(−Pwi∈Hlogp(wi)/Nw).AsforstandardLDA,exactinferenceofp(wi)isintractableunderthismodel.Thereforeweadaptedtheestimatorde-velopedbyMurrayandSalakhutdinov(2009)toourmodels.Coherenceisameasureinspiredbypointwisemu-tualinformation(Newmanetal.,2010).LetD(v)bethethenumberofdocumentswithatleastonetokenoftypevandletD(v,w)bethenumberofdocu-mentscontainingatleastonetokenoftypevandatleastonetokenoftypew.ThenMimnoetal.(2011)definethecoherenceoftopickas1(cid:0)M2(cid:1)MXm=2m−1X‘=1logD(v(k)M,v(k)„)+(cid:15)D(v(k)„),whereV(k)=(v(k)1,…,v(k)M)isalistoftheMmostprobablewordsintopickand(cid:15)isasmallsmoothingconstantusedtoavoidtakingthelogarithmofzero.Mimnoetal.(2011)findthatcoherencecorrelatesbetterwithhumanjudgmentsthandolikelihood-basedmeasures.Coherenceistopic-specificmea-sure,soforeachmodelvariantwetrained,wecom-putedthemediantopiccoherenceacrossallthetop-icslearnedbythemodel.Weset(cid:15)=0.1.Modelperformanceandanalysis.Fig.2showsperplexityforthevariantsasafunctionofthenum-berofiterationsofGibbssamplingontheEnglish-SpanishNEWScorpus.Thefigureconfirmsthat1000iterationsofGibbssamplingontheNEWScorpuswassufficientforconvergenceacrossmodelvariants.WeomitfiguresforEnglish-RussianandfortheTWITTERcorpus,sincethepatternswerenearlyidentical.Figure3showshowperplexityvariesasafunctionofthenumberoftopics.Weusedthisinformationtochooseoptimalmodelsforthedifferentcorpora.Theoptimalnumberoftop-icswasK=175fortheEnglish-SpanishNEWScorpus,K=200fortheEnglish-RussianNEWS,K=325fortheEnglish-SpanishTWITTER,andK=300fortheEnglish-RussianTWITTER.Al-thoughtheoptimalnumberoftopicsvariedacrosscorpora,therelativeperformanceofthedifferentmodelswasthesame.Inallofourcorpora,theMULTIPLEvariantsprovidedbetterfitsthantheircorrespondingSINGLEvariants.Thereareseveralexplanationsforthis.Forone,theMULTIPLEvari-antsareabletoexploittheinformationfrommulti-pletranslations,unliketheSINGLEvariants,whichdiscardedallbutonetranslationperword.Foran-other,thematchingsproducedbytheSINGLEINFERvariantscanbepurelycoincidentalandtheresultofoverfitting(seesomeexamplesbelow).INCLUDEvariantsperformedmarkedlybetterthanRELEGATEvariants.INFERvariantsimprovedmodelfitcom-paredtoSTATICvariants,butrequiredmoretopicstoproduceoptimalfit.RecallthatweperformedanM-stepintheIN-FERvariants3times,at250,500,and750itera-tions.Asnotedin§3.3,theM-stepintheINFERvariantsmaximizestheposteriorlikelihoodofthematching.However,Fig.2showsthatthismaxi-mizationcausesheld-outperplexitytoincreasesub-stantiallyjustafterthefirstmatchingM-step,around250iterations,beforedecreasingagainafterabout50moreiterationsofGibbssampling.WebelievethatthishappensbecausetheM-stepismaximizingoverexpectationsthatareapproximate,sincetheyareestimatedusingGibbssampling.Ifthesamplerhasnotyetconverged,thentheM-step’smaximiza-tionwillbeunstable.Wefoundsupportforthisex-planationwhenwere-rantheINFERvariantsusing1000iterationsbetweenM-steps,givingtheMarkovchainenoughtimetoconverge.Afterthischange,perplexitywentdownimmediatelyaftertheM-stepandkeptdecreasingmonotonically,ratherthanin-creasingaftertheM-stepbeforedecreasing.How-ever,thisdidnotresultinasignificantlylowerfinalperplexityorcoherenceandthusdidnotchangetherelativeperformanceofthemodels.Inaddition,Fig.2suggeststhatthesecondandthirdM-steps(at500and750iterations,jeweils)hadlittleeffectonperplexity.Inlightofthehighcomputationalex-penseofeachinferencestep,thissuggestsinprac-ticeasingleinferencestepmaybesufficient.Fig.4showsthattheMULTIPLESTATICINCLUDEvariantwasalsothesuperiormodelasmeasuredby
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
54
Figure2:Perplexityofdifferentmodelvariantsfordif-ferentnumbersofiterationsatK=175.mediantopiccoherence.Onceagain,thisgeneralpatternheldtruefortheEnglish-RussianpairandTWITTERcorpora.Overall,theresultsshowthatMULTIPLESTATICINCLUDEprovidessuperiorper-formanceacrossmeasures,corpora,topicnumbers,andlanguages.Wethereforeusedthisvariantinfurtherdataanalysisandevaluation.Incidentally,theobserveddecreaseintopiccoherenceasKin-creasesisexpected,becauseasKincreases,lower-likelihoodtopicstendtobemoreincoherent(Mimnoetal.,2011).ExperimentsbyStevensetal.(2012)showthatthiseffectisobservedforLDA-,NMF-,andSVD-basedtopicmodels.Cross-linguisticmatchings.Thematchingsin-ferredbytheSINGLEINFERINCLUDEvariantwereofmixedquality.Someofthematchingscorrectedlow-qualitytranslationsintheoriginaldictionary.Forinstance,ourpriordictionarymatchedpassageinEnglishtopasajeinSpanish.Thoughtechnicallycorrect,thedominantmeaningofpasajeis[travel]ticket.TheTWITTERmodelcorrectlymatchedpas-sagetorutainstead.Manyofthematchingslearnedbythemodeldidnotprovidetechnicallycorrecttranslations,yetwerestillrevelatoryandinteresting.Forinstance,thedictionarytranslatedtheSpanishwordpitoascigaretteinEnglish.However,ininfor-malusagethiswordrefersspecificallytocannabiscigarettes,nottobaccocigarettes.TheTWITTERFigure3:Perplexityofdifferentmodelvariants.modelmatchespitototheEnglishslangwordweedinstead.TheSpanishwordSiria(Syrien)wasun-matchedinthepriordictionary;theNEWSmodelmatchedittothewordchemical,whichmakessenseinthecontextofextensivereportingoftheusageofchemicalweaponsintheongoingSyrianconflict.4.4DataanalysisanddiscussionWehaveconductedaqualitativeanalysisofthetopics,perspectivesandcontrastsproducedbyourmodelsforEnglish–SpanishandEnglish–Russian,TWITTERandNEWSdatasets.Whilethetopicswerecoherentandconsistentacrosslanguages,setsofperspectivewordsmanifestedsystematicdiffer-encesrevealinginterestingcross-culturalcontrasts.Fig.5and7showthetopperspectivewordsdiscov-eredbythemodelforthetopicoffinanceandecon-omyinEnglishandSpanishNEWSandTWITTERcorpora,respectively.Whilesomeoftheperspec-tivewordsareneutral,mostlyliteralandoccurinbothEnglishandSpanish(e.g.balanceorautho-rize),manyothersrepresentmetaphoricalvocabu-lary(e.g.saddle,gut,evaporateinEnglish,orin-cendiar,sangrar,abatirinSpanish)pointingatdis-tinctmodelsofconceptualizationofthetopic.Whenweappliedthecontrastdetectionmethod(describedin§3.4)totheseperspectivewords,ithighlightedthedifferencesinmetaphoricalperspectives,ratherthantheliteralones,asshowninFig.6and8.En-
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
55
Figure4:Coherenceofdifferentmodelvariants.glishspeakerstendtodiscusseconomicandfinan-cialprocessesusingmotionterms,suchas“slow,drive,boostorsluggish”,orarelatedmetaphorofhorse-riding,e.g.“reinindebt”,“saddlewithdebt”,oreven“breedmoney”.Incontrast,Spanishspeak-erstendtotalkabouttheeconomyintermsofsizeratherthanmotion,usingverbssuchasampliarordisminuir,andothermetaphors,suchassangrar(tobleed)andincendiar(tolightup).Theseex-amplesdemonstratecoherentconceptualizationpat-ternsthatdifferinthetwolanguages.Interestingly,thisdifferencemanifesteditselfinbothNEWSandTWITTERcorporaandechoesthefindingsofapre-viouscorpus-linguisticstudyofCharteris-BlackandEnnis(2001),whomanuallyanalysedmetaphorsusedinEnglishandSpanishfinancialdiscourseandreportedthatmotionandnavigationmetaphorsthataboundinEnglishwererarelyobservedinSpanish.Forthemajorityofthetopicsweanalysedthemodelrevealedinterestingcross-culturaldiffer-ences.Forinstance,theSpanishcorporaexhib-itedmetaphorsofbattlewhentalkingaboutpoverty(withpovertyseenasanenemy),whileintheEn-glishcorpuspovertywasdiscussedmoreneutrallyasasocialproblemthatneedsapracticalsolu-tion.English-RussianNEWSexperimentsrevealedasurprisingdifferencewithrespecttothetopicofprotests.TheysuggestedthatwhileUSmediatendtousestrongermetaphoricalvocabulary,suchasTopicENbudgetdebtdeficitreductionspendbalancecutincreaselimitdowntowntaxstressadditionplanetTopicESpresupuestodeficitdeudareduccionequilib-riodisminuciongastoaumentaciontasasacerdotePerspectiveENbalancedefaulttriplereinaccumulateaccruetrimincursaddleslashprioritizeavertgutbur-denevaporateborrowpilecapcuttacklePerspectiveESrenegociarmejoraetiquetadodesplo-marrecortarendeudarincendiardestinarasignarau-torizaraprobadoascendersangraraugurarabatirFigure5:TopperspectivesinsystemoutputforthetopicoffinanceintheNEWScorpus(metaphorsinreditalics).ContrastsEN:rein[indebt],saddle[withdebt],cap[debt],breed[money],gut[budget],[debt]hit,tackle[debt],boost,slow,drive,sluggish[economy],spurContrastsES:sangrar[dinero],ampliar,disminuir[laeconom´ıa],superar[latasa],emitir[deuda]Figure6:ContrastsidentifiedbythemodelinNEWS.clash,eruptorfire,inRussianprotestsarediscussedmoreneutrally.Generally,theNEWScorporacon-tainedmoreabstracttopicsandricherinformationaboutconceptualstructureandsentimentinalllan-guages.ManyofthetopicsdiscoveredinTWIT-TERrelatedtoeverydayconcepts,suchaspetsorconcerts,withfewertopicscoveringsocietalissues.Yet,afewTWITTER-specificcontrastscouldbeob-served:e.g.,thesportstopictendstobediscussedusingwarandbattlevocabularyinRussiantoagreaterextentthaninEnglish.Ourmodelstendtoidentifytwogeneralkindsofdifferences:(1)cross-corpusdifferencesrepre-sentingworldviewsofparticularpopulationswhomthecorporacharacterize(suchdifferencesexistbothacrossandwithinlanguages,e.g.themetaphorsusedintheprogressiveNewYorkTimeswouldbedifferentfromtheonesinthemoreconservativeWallStreetJournal);Und(2)deeplyentrenchedcross-linguisticdifferences,suchasthemotionversusexpansionmetaphorsfortheeconomyinEnglishandSpanish.Suchsystematiccross-linguisticcon-trastscanbeassociatedwithcontrastivebehaviouralpatternsacrossthedifferentlinguisticcommunities(CasasantoandBoroditsky,2008;Fuhrmanetal.,2011).InbothNEWSandTWITTERdata,ourmodeleffectivelyidentifiesandsummarisessuchcontrastssimplifyingthemanualanalysisofthedata
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
56
TopicENeconomygrowthratepercentbankeconomistinterestreservemarketpolicyTopicESeconom´ıacrecimientotasabancopolticamercadointer´esinflacinempleoeconomistaPerspectiveENeconomicfinancialgrowglobalex-pectremaincutboostlowslowdrivePerspectiveESecon´omicomundialagregarfi-nancieroinformalpeque˜nosignificarinternobajarFigure7:Topperspectivesinsystemoutputfortheecon-omytopicinTWITTER(metaphorsinred).ContrastsEN:slow[theeconomy],push[theecon-omy],strong[economy],weak[economy],stable[economy],boost[theeconomy]ContrastsES:caer[laeconom´ıa],disminuir,superar[laeconom´ıa],ampliar[elcrecimiento]Figure8:ContrastsidentifiedbythemodelinTWITTER.byhighlightinglinguistictrendsthatareindicativeoftheunderlyingconceptualdifferences.However,theconceptualdifferencesarenotstraightforwardtoevaluatebasedonthesurfacevocabularyalone.Inordertoinvestigatethisfurther,weconductedabehaviouralexperimenttestingasubsetofthecon-trastsdiscoveredbyourmodel.5BehaviouralevaluationWeassessedtherelevanceofthecontraststhroughanexperimentalstudywithnativeEnglish-speakingandnativeSpanish-speakinghumansubjects.WefocusedonalinguisticdifferenceinthemetaphorsusedbyEnglishspeakersversusSpanishspeak-erswhendiscussingchangesinanation’secon-omy.WhileEnglishspeakerstendtousemetaphorsinvolvingbothlocativemotionverbs(e.g.slow)aswellasexpansive/contractivemotionverbs(e.g.shrink),Spanishspeakerspreferentiallyemployex-pansive/contractivemotionverbs(e.g.disminuir)todescribechangesintheeconomy.Thesedifferencescouldreflectlinguisticartefacts(suchascollocationfrequencies)orcouldreflectentrenchedconceptualdifferences.Ourexperimentaddressesthequestionofwhethersuchpatternsofbehaviourarisecross-linguisticallyinresponsetonon-linguisticstimuli.Ifthelinguisticdifferencesareindicativeofen-trenchedconceptualdifferences,thenweexpecttoseeresponsestothenon-linguisticstimulithatcorre-spondtotheusagedifferencesinthetwolanguages.5.1ExperimentalsetupWerecruited60participantsfromoneEnglish-speakingcountry(theUS)and60participantsfromthreeSpanish-speakingcountries(Chile,Mexiko,andSpain)usingtheCrowdFlowercrowdsourcingplatform.Participantsfirstreadabriefdescriptionoftheexperimentaltask,whichintroducedthemtoafictionalcountryinwhicheconomistsaredevis-ingasimplebuteffectivegraphicfor“representingchangein[Die]economy”.Theythencompletedademographicquestionnaireincludinginformationabouttheirnativelanguage.Resultsfrom9USand3non-USparticipantswerediscardedforfailuretomeetthelanguagerequirement.Participantsnavigatedtoanewpagetocompletetheexperimentaltask.Stimuliwerepresentedina1200×700-pixelframe.Thecenteroftheframecontainedaspherewitha64-pixeldiameter.Foreachtrial,participantsclickedonabuttontoactivateananimationofthespherewhichinvolved(1)apos-itivedisplacement(inrightwardpixels)of10%or20%,oranegativedisplacement(inleftwardpixels)of10%or20%;3Und,(2)anexpansion(inincreasedpixeldiameter)of10%or20%,oracontraction(indecreasedpixeldiameter)of10%or20%.4Participantssaweachoftheresultingconditions3times.Thedisplacementandsizeconditionsweredrawnfromarandompermutationof16condi-tionsusingaFisher-Yatesshuffle(FisherandYates,1963).Crucially,halfofthestimulicontainedcon-flictsofinformationwithrespecttothesizeanddis-placementmetaphorsforeconomicchange(e.g.thespherecouldbothgrowandmovetotheleft).Over-allweexpectedtheSpanishspeakers’responsestobemorecloselyassociatedwithchangesindiam-eterduetothepresenceandsalienceofthesizemetaphor,andtheEnglishspeakers’responsestobeinfluencedbybothconditions.Weexpectedthesedifferencestobemostprominentinthecon-3Theuseofleftward/rightwardhorizontaldisplacementtorepresentdecreases/increasesinmagnitudeissupportedbyre-searchinnumericalcognitionshowingthatpeopleassociatesmallermagnitudeswiththeleftsideofspaceandlargermag-nitudeswiththerightside(Dehaene,1992;Fiasetal.,1995).4AdemonstrationoftheEnglishexperimentalinterfacecanbeaccessedathttp://goo.gl/W3YVfC.TheSpanishin-terfaceisidentical,butforadirecttranslationoftheguidelinesprovidedbyanativeSpanish/fluentEnglishspeaker.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
T
l
A
C
_
A
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
57
Figure9:”EconomyImproved”responserateinconflict-ingstimulusconditions.flictingtrials,whichforceEnglishspeakers(unlikeSpanishspeakers)tochoosebetweentwoavailablemetaphors.Wefocusontheseconflictingtrialsinouranalysisanddiscussionoftheresults.5.2ResultsIntrialsinwhichstimulimovingrightwardweresimultaneouslycontracting,Englishspeakersre-spondedthattheeconomyimproved66%ofthetime,whereasSpanishspeakersjudgedtheecon-omytohaveimproved43%ofthetime.Intrialsinwhichstimulimovingleftwardweresimultaneouslyexpanding,Englishspeakersjudgedtheeconomytohaveimproved34%ofthetime,andSpanishspeak-ersrespondedthattheeconomyimproved55%ofthetime.TheresultsareillustratedinFigure9.Theseresultsindicatethreeeffects:(1)En-glishspeakersexhibitapronouncedbiasforus-inghorizontaldisplacementratherthanexpan-sion/contractionduringthedecision-makingpro-cess;(2)Spanishspeakersaremorebiasedto-wardexpansion/contractioninformulatingadeci-sion;Und,(3)acrossthetwolanguagestheresponsesshowcontrastingpatterns.TheresultssupportourexpectationontherelevanceofdifferentmetaphorswhenreasoningabouttheeconomybytheEnglishandSpanishspeakers.Toexaminethesignificanceoftheseeffects,wefitabinarylogitmixedeffectsmodel5tothedata.Thefullanalysismodeledjudgmentwithnativelan-guage,displacement,andsizeasfullycrossedfixed5SeeFoxandWeisberg(2011)foradiscussionofsuchmod-elsincludingapplicationoftheTypeIIWaldtest.effectsandparticipantasarandomeffect.Thisanal-ysisconfirmedthatnativelanguagewasassociatedwithjudgmentsabouteconomicchange.Inparticu-lar,itindicatedthatchangesinsizeaffectedEnglishspeakers’judgmentsandSpanishspeakers’judg-mentsdifferently(P<0.001),withanincreaseinsizeincreasingtheodds(eβ=2.5)ofajudgmentofIMPROVEDbySpanishspeakersanddecreasingtheodds(eβ=0.44)ofajudgmentofIMPROVEDbyEnglishspeakers.ATypeIIWaldtestrevealedtheinteractionbetweenlanguageandsizetobehighlystatisticallysignificant(χ2(1)<0.001).Insummary,thepatternsweseeinthebe-haviouraldataareconsistentwiththepatternsun-coveredintheoutputofourmodel.Whilemuchter-ritoryremainstobeinvestigatedtodelimitthenatureofthisrelationship,ourresultsrepresentafirststeptowardestablishinganassociationbetweeninforma-tionminedfromlargetextualdatacollectionsandinformationobservedthroughbehaviouralresponsesonahumanscale.6ConclusionWepresentedthefirstmodelthatdetectscommontopicsfrommultilingual,non-paralleldataandau-tomaticallyuncoversdifferencesinperspectivesonthesetopicsacrosslinguisticcommunities.Ourdataanalysisandbehaviouralevaluationofferevi-denceofasymbioticrelationshipbetweenecolog-icallysoundcorpusexperimentsandscientificallycontrolledhumansubjectexperiments,pavingthewayfortheuseoflarge-scaletextminingtoinformcognitivelinguisticsandpsychologyresearch.Webelievethatourmodelrepresentsagoodfoun-dationforfutureprojectsinthisarea.Apromisingareaforfurtherworkisindevelopingbettermethodsforidentifyingcontrastsinperspectiveterms.Thiscouldperhapsinvolvemodifyingthegenerativepro-cessforperspectivetermsorincorporatingsyntacticdependencyinformation.Itwouldalsobeinterest-ingtoinvestigatetheeffectofdictionaryqualityandcorpussizeontherelativeperformanceofSTATICandINFERvariants.Finally,wenotethatthemodelcanbeappliedtoidentifycontrastiveperspectivesinmonolingualaswellasmultilingualdata,providingageneraltoolfortheanalysisofsubtle,yetimpor-tant,cross-populationdifferences.
l
D
O
w
N
O
A
D
e
D
F
R
O
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
d
u
/
t
a
c
l
/
l
A
R
T
ich
C
e
-
P
D
F
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
t
l
a
c
_
a
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
O
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
58
AcknowledgmentsWewouldliketothanktheanonymousreview-ersaswellastheTACLeditors,SharonGold-waterandDavidChiang,forhelpfulcommentsonanearlierdraftofthispaper.ThisworkusedtheExtremeScienceandEngineeringDiscov-eryEnvironment(XSEDE),whichissupportedbyNationalScienceFoundationgrantnumberACI-1053575.EkaterinaShutova’sresearchissup-portedbytheLeverhulmeTrustEarlyCareerFel-lowship.GerarddeMelo’sresearchissupportedbyChina973ProgramGrants2011CBA00300,2011CBA00301,andNSFCGrants61033001,61361136003,61550110504.ReferencesAmjadAbu-Jbara,MonaDiab,PradeepDasigi,andDragomirRadev.2012.Subgroupdetectioninideo-logicaldiscussions.InProceedingsofthe50thAnnualMeetingoftheAssociationforComputationalLinguis-tics:LongPapers-Volume1,ACL’12,pages399–409,Stroudsburg,PA,USA.AssociationforCompu-tationalLinguistics.AmrAhmedandEricP.Xing.2010.Stayingin-formed:Supervisedandsemi-supervisedmulti-viewtopicalanalysisofideologicalperspective.InPro-ceedingsofthe2010ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,EMNLP’10,pages1140–1150,Stroudsburg,PA,USA.AssociationforComputationalLinguistics.RawiaAwadallah,MayaRamanath,andGerhardWeikum.2011.OpinioNetIt:UnderstandingtheOpinions-Peoplenetworkforpoliticallycontroversialtopics.InProceedingsofthe20thACMInternationalConferenceonInformationandKnowledgeManage-ment,CIKM’11,pages2481–2484,NewYork,New York,USA.ACM.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.LatentDirichletallocation.JournalofMachineLearningResearch,3:993–1022.JordanBoyd-GraberandDavidM.Blei.2009.Multilin-gualtopicmodelsforunalignedtext.InProceedingsoftheTwenty-FifthConferenceonUncertaintyinAr-tificialIntelligence(UAI’09),pages75–82.Arlington,VA,USA:AUAIPress.JordanBoyd-GraberandPhilipResnik.2010.Holisticsentimentanalysisacrosslanguages:multilingualsu-pervisedlatentDirichletallocation.InProceedingsofthe2010ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages45–55.DanielCasasantoandLeraBoroditsky.2008.Timeinthemind:Usingspacetothinkabouttime.Cognition,106(2):579–593.JonathanCharteris-BlackandTimothyEnnis.2001.AcomparativestudyofmetaphorinSpanishandEn-glishfinancialreporting.EnglishforSpecificPur-poses,20:249–266.StanislasDehaene.1992.Varietiesofnumericalabili-ties.Cognition,44:1–42.AnthonyFader,DragomirRadev,BurtL.Monroe,andKevinM.Quinn.2007.MavenRank:Identifyingin-fluentialmembersoftheUSsenateusinglexicalcen-trality.InInProceedingsofthe2007JointConferenceonEmpiricalMethodsinNaturalLanguageProcess-ingandComputationalNaturalLanguageLearning,pages658–666.YiFang,LuoSi,NaveenSomasundaram,andZheng-taoYu.2012.Miningcontrastiveopinionsonpolit-icaltextsusingcross-perspectivetopicmodel.InPro-ceedingsoftheFifthACMInternationalConferenceonWebSearchandDataMining(WSDM’12),pages63–72,NewYork.NewYork:ACM.WimFias,MarcBrysbaert,FrankGeypens,andG´eryd’Ydewalle.1995.Theimportanceofmagnitudeinformationinnumericalprocessing:evidencefromtheSNARCeffect.MathematicalCognition,2(1):95–110.RonaldA.FisherandFrankYates.1963.StatisticalTablesforBiological,AgriculturalandMedicalRe-search.OliverandBoyd,Edinburgh.JohnFoxandSanfordWeisberg.2011.AnRCompaniontoAppliedRegression.SAGEPublications,CA:LosAngeles.OrlyFuhrman,KellyMcCormick,EvaChen,HeidiJiang,DingfangShu,ShuaimeiMao,andLeraBoroditsky.2011.Howlinguisticandculturalforcesshapeconceptionsoftime:EnglishandMandarintimein3D.CognitiveScience,35:1305–1328.NikeshGareraandDavidYarowsky.2009.Model-inglatentbiographicattributesinconversationalgen-res.InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLandthe4thInterna-tionalJointConferenceonNaturalLanguageProcess-ingoftheAFNLP:Volume2-Volume2,ACL’09,pages710–718,Stroudsburg,PA,USA.AssociationforComputationalLinguistics.SeanM.GerrishandDavidM.Blei.2011.Predict-inglegislativerollcallsfromtext.InProceedingsofICML.SwapnaGottipati,MinghuiQiu,YanchuanSim,JingJiang,andNoahA.Smith.2013.LearningtopicsandpositionsfromDebatepedia.InProceedingsofthe2013ConferenceonEmpiricalMethodsinNatu-ralLanguageProcessing,pages1858–1868,Seattle,
l
D
O
w
N
O
A
D
e
D
F
R
O
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
d
u
/
t
a
c
l
/
l
A
R
T
ich
C
e
-
P
D
F
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
t
l
a
c
_
a
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
O
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
59
Washington,USA,October.AssociationforComputa-tionalLinguistics.AriaHaghighi,PercyLiang,TaylorBerg-Kirkpatrick,andDanKlein.2008.Learningbilinguallexiconsfrommonolingualcorpora.InProceedingsofthe46thAnnualMeetingoftheAssociationforComputationalLinguistics,ACL-’08:HLT,pages771–779,Colum-bus,Ohio,USA.DavidHall,DanielJurafsky,andChristopherD.Man-ning.2008.Studyingthehistoryofideasusingtopicmodels.InProceedingsofthe2008ConferenceonEmpiricalMethodsinNaturalLanguageprocessing,pages363–371.AssociationforComputationalLin-guistics.JagadeeshJagarlamudiandHalDaum´eIII.2010.Ex-tractingmultilingualtopicsfromunalignedcompara-blecorpora.InCathalGurrin,YulanHe,GabriellaKazai,UdoKruschwitz,andSuzanneLittle,editors,Proceedingsofthe32ndEuropeanConferenceonAd-vancesinInformationRetrieval(ECIR’2010),pages444–456.Springer-Verlag,Berlin.RosieJones,RaviKumar,BoPang,andAndrewTomkins.2007.“Iknowwhatyoudidlastsummer”:Querylogsanduserprivacy.InProceedingsoftheSix-teenthACMConferenceonConferenceonInformationandKnowledgeManagement,CIKM’07,pages909–914,NewYork,New York,USA.ACM.RoyJonkerandAntonVolgenant.1987.Ashortestaug-mentingpathalgorithmfordenseandsparselinearas-signmentproblems.Computing,38(4):325–340.Zolt´anK¨ovecses.2004.Introduction:Culturalvaria-tioninmetaphor.EuropeanJournalofEnglishStud-ies,8:263–274.GeorgeLakoffandElisabethWehling.2012.TheLit-tleBlueBook:TheEssentialGuidetoThinkingandTalkingDemocratic.FreePress,NewYork.DavidMimno,HannaM.Wallach,JasonNaradowsky,DavidA.Smith,andAndrewMcCallum.2009.Polylingualtopicmodels.InProceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLan-guageProcessing:Volume2,pages880–889.Asso-ciationforComputationalLinguistics.DavidMimno,HannaM.Wallach,EdmundTalley,MiriamLeenders,andAndrewMcCallum.2011.Op-timizingsemanticcoherenceintopicmodels.InPro-ceedingsofthe2011ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing.AssociationforComputationalLinguistics.BurtL.Monroe,MichaelP.Colaresi,andKevinM.Quinn.2008.Fightin’words:Lexicalfeatureselec-tionandevaluationforidentifyingthecontentofpolit-icalconflict.PoliticalAnalysis,16(4):372–403.IainMurrayandRuslanR.Salakhutdinov.2009.Evalu-atingprobabilitiesunderhigh-dimensionallatentvari-ablemodels.InAdvancesinNeuralInformationPro-cessingSystems,pages1137–1144.DavidNewman,JeyHanLau,KarlGrieser,andTimothyBaldwin.2010.Automaticevaluationoftopiccoher-ence.InProceedingsofthe2010ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies.AssociationforComputationalLinguistics.Diarmuid´OS´eaghdha.2010.Latentvariablemodelsofselectionalpreference.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics,pages435–444,Uppsala,Sweden.Asso-ciationforComputationalLinguistics.MichaelPaulandRoxanaGirju.2009.Cross-culturalanalysisofblogsandforumswithmixed-collectiontopicmodels.InProceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing:Volume3-Volume3,EMNLP’09,pages1408–1417,Stroudsburg,PA,USA.AssociationforCompu-tationalLinguistics.MarcoPennacchiottiandAna-MariaPopescu.2011.Democrats,RepublicansandStarbucksafficionados:userclassificationinTwitter.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,KDD’11,pages430–438.KritiPuniyani,JacobEisenstein,ShayCohen,andEricP.Xing.2010.Sociallinksfromlatenttopicsinmi-croblogs.InProceedingsoftheNAACL/HLT2010WorkshoponComputationalLinguisticsinaWorldofSocialMedia,pages19–20.AssociationforComputa-tionalLinguistics.MinghuiQiuandJingJiang.2013.Alatentvariablemodelforviewpointdiscoveryfromthreadedforumposts.InProceedingsofthe2013ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages1031–1040,Atlanta,Georgia,June.Asso-ciationforComputationalLinguistics.AlanRitter,MausamEtzioni,andOrenEtzioni.2010.AlatentDirichletallocationmethodforselectionalpref-erences.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics,pages424–434.AssociationforComputationalLinguistics.KeithStevens,PhilipKegelmeyer,DavidAndrzejewski,andDavidButtler.2012.Exploringtopiccoherenceovermanymodelsandmanytopics.InProceedingsofthe2012JointConferenceonEmpiricalMethodsinNaturalLanguageProcessingandComputationalNaturalLanguageLearning,pages952–961,JejuIs-land,Korea.
l
D
O
w
N
O
A
D
e
D
F
R
O
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
d
u
/
t
a
c
l
/
l
A
R
T
ich
C
e
-
P
D
F
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
8
2
1
5
6
7
3
5
2
/
/
t
l
a
c
_
a
_
0
0
0
8
2
P
D
.
F
B
j
G
u
e
S
T
T
O
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
60
PaulH.ThibodeauandLeraBoroditsky.2011.Metaphorswethinkwith:Theroleofmetaphorinrea-soning.PLoSONE,6(2):e16782.JanyceWiebe,TheresaWilson,RebeccaBruce,MatthewBell,andMelanieMartin.2004.Learningsubjectivelanguage.Comput.Linguist.,30(3):277–308,Septem-ber.