Operazioni dell'Associazione per la Linguistica Computazionale, vol. 4, pag. 47–60, 2016. Redattore di azioni: David Chiang. - Ricerca sull'intelligenza artificiale specializzata al MIT

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 4, pag. 47–60, 2016. Redattore di azioni: David Chiang.
Lotto di invio: 11/2015; Pubblicato 2/2016.

2016 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

C
(cid:13)

DetectingCross-CulturalDifferencesUsingaMultilingualTopicModelE.D.Guti´errez1EkaterinaShutova2PatriciaLichtenstein3GerarddeMelo4LucaGilardi51UniversityofCalifornia,SanDiego2ComputerLaboratory,UniversityofCambridge3UniversityofCalifornia,Merced4IIIS,TsinghuaUniversity,5ICSI,Berkeleyedg@icsi.berkeley.edues407@cam.ac.uktricia1@uchicago.edugdm@demelo.orglucag@icsi.berkeley.eduAbstractUnderstandingcross-culturaldifferenceshasimportantimplicationsforworldaffairsandmanyaspectsofthelifeofsociety.Yet,themajorityoftext-miningmethodstodatefocusontheanalysisofmonolingualtexts.Incon-trast,wepresentastatisticalmodelthatsimul-taneouslylearnsasetofcommontopicsfrommultilingual,non-paralleldataandautomati-callydiscoversthedifferencesinperspectivesonthesetopicsacrosslinguisticcommunities.Weperformabehaviouralevaluationofasub-setofthedifferencesidentiﬁedbyourmodelinEnglishandSpanishtoinvestigatetheirpsy-chologicalvalidity.1IntroductionRecentyearshaveseenagrowinginterestintext-miningapplicationsaimedatuncoveringpublicopinionsandsocialtrends(Faderetal.,2007;Mon-roeetal.,2008;GerrishandBlei,2011;Pennac-chiottiandPopescu,2011).Theyrestontheas-sumptionthatthelanguageweuseisindicativeofourunderlyingworldviews.Researchincognitiveandsociolinguisticssuggeststhatlinguisticvaria-tionacrosscommunitiessystematicallyreﬂectsdif-ferencesintheirculturalandmoralmodelsandgoesbeyondlexiconandgrammar(K¨ovecses,2004;LakoffandWehling,2012).Cross-culturaldiffer-encesmanifestthemselvesintextinamultitudeofways,mostprominentlythroughtheuseofexplicitopinionvocabularywithrespecttoacertaintopic(e.g.“policiesthatbeneﬁtthepoor”),idiomaticandmetaphoricallanguage(e.g.“thecompanyisspin-ningitswheels”)andothertypesofﬁgurativelan-guage,suchasironyorsarcasm.Theconnectionbetweenlanguage,cultureandreasoningremainsoneofthecentralresearchques-tionsinpsychology.ThibodeauandBorodit-sky(2011)investigatedhowmetaphorsaffectourdecision-making.Theypresentedtwogroupsofhu-mansubjectswithtwodifferenttextsaboutcrime.Intheﬁrsttext,crimewasmetaphoricallyportrayedasavirusandinthesecondasabeast.Thetwogroupswerethenaskedasetofquestionsonhowtotacklecrimeinthecity.Asaresult,whiletheﬁrstgrouptendedtooptforpreventivemeasures(e.g.strongersocialpolicies),thesecondgroupconvergedonpunishment-orrestraint-orientedmea-sures.AccordingtoThibodeauandBoroditsky,theirresultsdemonstratethatmetaphorshaveprofoundinﬂuenceonhowweconceptualizeandactwithre-specttosocietalissues.Thissuggeststhatinordertogainafullunderstandingofsocialtrendsacrosspop-ulations,oneneedstoidentifysubtlebutsystematiclinguisticdifferencesthatstemfromthegroups’cul-turalbackgrounds,expressedbothliterallyandﬁg-uratively.Performingsuchananalysisbyhandislabor-intensiveandoftenimpractical,particularlyinamultilingualsettingwhereexpertiseinallofthelanguagesofinterestmayberare.Withtheriseofbloggingandsocialmedia,NLPtechniqueshavebeensuccessfullyusedforanumberoftasksinpoliticalscience,includingautomaticallyestimatingtheinﬂuenceofparticularpoliticiansintheUSsenate(Faderetal.,2007),identifyinglex-icalfeaturesthatdifferentiatepoliticalrhetoricofopposingparties(Monroeetal.,2008),predictingvotingpatternsofpoliticiansbasedontheiruseoflanguage(GerrishandBlei,2011),andpredictingpoliticalafﬁliationofTwitterusers(PennacchiottiandPopescu,2011).Fangetal.(2012)addressed

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

theproblemofautomaticallydetectingandvisual-isingthecontrastingperspectivesonasetoftop-icsattestedinmultipledistinctcorpora.Whilesuc-cessfulintheirtasks,alloftheseapproachesfo-cusedonmonolingualdataanddidnotreachbe-yondliterallanguage.Incontrast,wepresentamethodthatdetectsﬁne-grainedcross-culturaldif-ferencesfrommultilingualdata,wheresuchdiffer-encesabound,expressedbothliterallyandﬁgura-tively.Ourmethodbringstogetheropinionminingandcross-lingualtopicmodellingtechniquesforthispurpose.Previousapproachestocross-lingualtopicmodelling(Boyd-GraberandBlei,2009;Jagarla-mudiandDaum´eIII,2010)addressedtheproblemofminingcommontopicsfrommultilingualcor-pora.Wepresentamodelthatlearnssuchcom-montopics,whilesimultaneouslyidentifyinglexi-calfeaturesthatareindicativeoftheunderlyingdif-ferencesinperspectivesonthesetopicsbyspeakersofEnglish,SpanishandRussian.Thesedifferencesareminedfrommultilingual,non-paralleldatasetsofTwitterandnewsdata.Incontrasttopreviouswork,ourmodeldoesnotmerelyoutputalistofmono-linguallexicalfeaturesformanualcomparison,butalsoautomaticallyinfersmultilingualcontrasts.Oursystem(1)usesword-documentco-occur-rencedataasinput,wherethewordsarelabeledastopicwordsorperspectivewords;(2)ﬁndsthehighest-likelihooddictionarybetweentopicwordsinthetwolanguagesgiventheco-occurrencedata;(3)ﬁndscross-lingualtopicsspeciﬁedbydistribu-tionsovertopic-wordsandperspective-words;E(4)automaticallydetectsdifferencesinperspective-worddistributionsinthetwolanguages.Weperformabehaviouralevaluationofasubsetofthediffer-encesidentiﬁedbythemodelanddemonstratetheirpsychologicalvalidity.Ourdataanddictionariesareavailablefromtheﬁrstauthoruponrequest.2RelatedworkViewdetection.Identifyingdifferentviewpointsisrelatedtothewell-studiedareaofsubjectivitydetection,whichaimsatexposingopinion,evalu-ation,andspeculationintext(Wiebeetal.,2004)andattributingittospeciﬁcpeople(Awadallahetal.,2011;Abu-Jbaraetal.,2012).Inourwork,wearelessinterestedinexplicitlocalformsofsubjectivity,insteadaimingatdetectingmoregeneralcontrastsacrosslinguisticcommunities.Anotherlineofresearchhasfocusedoninferringauthorattributessuchasgender,age(GareraandYarowsky,2009),location(Jonesetal.,2007),orpo-liticalafﬁliation(PennacchiottiandPopescu,2011).Suchstudiesmakeuseofsyntacticstyle,discoursecharacteristics,aswellaslexicalchoice.Themodelsusedforthisaretypicallybinaryclassiﬁerstrainedinafullysupervisedfashion.Incontrast,inourtask,weautomaticallyinferthetopicdistributionsandﬁndtopic-speciﬁccontrasts.Probabilistictopicmodels.Probabilistictopicmodelshaveprovenusefulforavarietyofseman-tictasks,suchasselectional-preferenceinduction(´OS´eaghdha,2010;Ritteretal.,2010),sentimentanalysis(Boyd-GraberandResnik,2010)andstudy-ingtheevolutionofconceptsandideas(Halletal.,2008).Thegoalofatopicmodelistocharacter-izeobserveddataintermsofamuchsmallersetofunobserved,semanticallycoherenttopics.Apar-ticularlypopularprobabilistictopicmodelisLatentDirichletAllocation(LDA)(Bleietal.,2003).Un-deritsassumptions,eachdocumenthasauniquemixoftopics,andeachtopicisadistributionovertermsinthevocabulary.Atopicischosenforeverywordtokenaccordingtothetopicmixofthedocumenttowhichitbelongs,andthentheword’sidentityisdrawnfromthecorrespondingtopic’sdistribution.Handlingmultilingualcorpora.LDAisde-signedformonolingualtextandthusitlacksthestructurenecessarytomodelcross-linguallyvalidtopics.Whiletopicmodelscanbetrainedindi-viduallyontwolanguagesandthentheacquiredtopicscanbematched,thecorrespondencesbe-tweenthetopicsforthetwotermswillbehighlyunstable.Toaddressthis,Boyd-GraberandBlei(2009)(MUTO)andJagarlamudiandDaum´eIII(2010)(JOINTLDA)introducedthenotionofcross-linguallyvalidconceptsassociatedwithdifferenttermsindifferentlanguages,usingbilingualdictio-nariestomodeltopicsacrosslanguages.BasedonamodelbyHaghighietal.(2008),MUTOiscapa-bleoflearningtranslations–i.e.,matchingbetweentermsinthedifferentlanguagesbeingcompared.ThePolylingualTopicModelofMimnoetal.(2009)isanotherapproachtoﬁndingtopicsinmultilingualcorpora,butitrequirestuplescomposedofcompa-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

rabledocumentsineachlanguageofthecorpus.Topicmodelsforviewdetection.LDAalsoas-sumesthatthedistributionofeachtopicisﬁxedacrossalldocumentsinacorpus.Therefore,atopicassociatedwith,e.g.,warwillhavethesamedis-tributionoverthelexiconregardlessofwhetherthedocumentwastakenfromapro-wareditorialorananti-warspeech.However,inrealitywemayexpectasingletopictoexhibitsystematicandpredictablevariationsinitsdistributionbasedonauthorship.Thecross-collectionLDAmodelbyPaulandGirju(2009)addressesthisbyspeciﬁcallyaimingtoexposeviewpointdifferencesacrossdifferentdoc-umentcollections.AhmedandXing(2010)pro-posedasimilarmodelfordetectingideologicaldif-ferences.Fangetal.(2012)’sCross-PerspectiveTopic(CPT)modelbreaksupthetermsinthevo-cabularyintotopictermsandperspectivetermswithdifferentgenerativeprocesses,anddifferentiatesbe-tweendifferentcollectionsofdocumentswithinthecorpus.ThetopictermsareassumedtobegeneratedasinLDA.However,thedistributionofperspectivetermsinadocumentistakentobedependentonboththetopicmixtureofthedocumentaswellasthecol-lectionfromwhichthedocumentisdrawn.Recentworksproposedmodelsforspeciﬁctypesofdata.QiuandJiang(2013)useuseridentitiesandinteractionsinthreadeddiscussions,whileGot-tipatietal.(2013)developedatopicmodelforDe-batepedia,asemi-structuredresourceinwhichar-gumentsareexplicitlyenumerated.However,allofthesemodelsperformtheiranalysesonmonolingualdatasets.Thus,theyareusefulforcomparingdiffer-entideologiesexpressedinthesamelanguage,butnotforcross-linguisticcomparisons.3MethodThegoalofourmodelistoanalyselarge,non-parallel,multilingualcorporaandpresentcross-linguallyvalidtopicsandtheassociatedperspec-tives,automaticallyinferringthedifferencesincon-ceptualizationofthesetopicsacrosscultures.Fol-lowingBoyd-GraberandBlei(2009)andJagarla-mudiandDaum´eIII(2010),ourdistributionsofla-tenttopicsrangeoverlatent,cross-lingualtopiccon-ceptsthatmanifestthemselvesaslanguage-speciﬁctopicwords.Weusebilingualdictionaries,contain-Figure1:Basicgenerativemodel.ingwordsinonelanguageandtheirtranslationsinanotherlanguage,torepresentthetopicconcepts.Thesearerepresentedasabipartitegraph,witheachtranslationentrybeinganedgeandeachtopicwordinthetwolanguagesbeingavertex.Whilethetopicwordsaretiedtogetherbythetranslationdictionary,theperspectivewordscanvaryfreelyacrosslan-guages.FollowingFangetal.(2012),wetreatnounsastopicwordsandverbsandadjectivesasperspec-tivewords1.Themodelassumesthatadjectiveandverbtokensineachdocumentareassignedtotopicsinproportiontothetopicassignmentsofthetopicwordtokens.Then,theperspectivetermforthistopicisdrawndependingonthetopicassignmentandthelanguageofthespeaker.3.1BasicGenerativeModelGiventhelanguages‘∈{UN,B},ourmodelinfersthedistributionsofmulti-lingualtopicsandlanguage-speciﬁcperspective-words(Fig.2),asfollows:1.DrawasetCofconcepts(tu,v)matchingtopicwordufromlanguageatotopicwordvfromlan-guageb,wheretheprobabilityofconcept(tu,v)isproportionaltoapriorπu,v(e.g.basedoninforma-tionfromatranslationdictionary).2.Drawmultinomialdistributions:1Thisapproximationwasadoptedforconvenience,compu-tationalefﬁciencyandeaseofinterpretation.However,inprin-cipleourmethoddoesnotdependonit,sinceitcanbeappliedwithallcontentwordsastopicorperspectivewords.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

•Fortopicindicesk∈{1,…,K},drawlanguage-independenttopic-conceptdistribu-tionsφwk∼Dir(βw)overpairs(wa,wb)∈C.•Fortopicindicesk∈{1,…,K}andlan-guages‘∈{UN,B},drawlanguage-speciﬁcperspective-termdistributionsφ‘,ok∼Dir(βo)overperspective-termsinlanguage‘.3.Foreachdocumentd∈{1,…,D}withlang.‘d:•Drawtopicweightsθd∼Dir(α)•Foreachtopic-wordindexi∈{1,…,Nwd}ofdocumentd:–Drawtopiczi∼θd–Drawtopicconceptci=(wa,wb)∼φwzi,andselectw‘dasthememberofthatpaircorrespondingtolanguage‘d.•Foreachperspective-wordindexj∈{1,…,Nod}ofdocumentd:–Drawtopicxj∼Uniform(zw1,…,zwNod)–Drawperspective-wordoj∼φ‘,oxj3.2ModelVariantsWehaveexperimentedwithseveralvariantsofourmodel,inordertoaccountforthetranslationofpol-ysemouswords,adaptthetranslationmodeltothecorpusused,andtohandlewordsforwhichnotrans-lationisfound.a)SINGLEvariantsofthemodelmatcheachtopicterminalanguagewithatmostonetopictermintheotherlanguage.MULTIPLEvariantsalloweachtermtomatchtomultipleotherwordsintheotherlanguage.b)INFERvariantsallowhigher-likelihoodmatch-ingstobeinferredfromthedata.STATICvariantstreatthematchingsasﬁxed,whichisequivalenttoassigningaprobabilityof0or1toeveryedgeinourbipartitegraphC.c)RELEGATEvariantsrelegateallunmatchedwordsineachlanguagetoasingleseparateback-groundtopicdistinctfromthetopicsthatarelearnedforthematchedtopicwords.Thisisakintoforcingtheprobabilityforcurrentlyun-matchedwordsto0inalltopicsexceptforone,andforcingtheprobabilityofallcurrentlymatchedwordsto0inthistopic.INCLUDEvariantsdonotrestricttheassignmentunmatchedwords;theyareassignedtothesamesetoftopicsasthematchedwords.Wetestthefollowingsixvariants:SINGLESTATI-CRELEGATE,SINGLESTATICINCLUDE,SIN-GLEINFERRELEGATE,SINGLEINFERINCLUDE,MULTIPLESTATICRELEGATE,andMULTI-PLESTATICINCLUDE.WedonottestMULTI-PLEINFERvariantsbecauseofthecomplexityofinferringamultiplematchinginabipartitegraph.3.3Learning&InferenceForallvariants,acollapsedGibbssamplercanbeusedtoinfertopicsφ‘,oandφw,per-documenttopicdistributionsθ,aswellastopicassignmentszandx.ThiscorrespondstotheS-stepbelow.ForINFERvariants,wefollowBoyd-GraberandBleiinusinganM-stepinvolvingabipartitegraphmatchingal-gorithmtoinferthematchingmthatmaximizestheposteriorlikelihoodofthematching.S-Step:SampletopicsforwordsinthecorpususingacollapsedGibbssampler.Fortopic-wordwi=ubelongingtodocumentd,ifthewordoccursinconceptci=(tu,v),thensamplethetopicandentryaccordingto:P(zi=k,ci=(tu,v)|wi=u,z−i,C)∝Ndk+αkPj(Ndj+αj)×Nk(tu,v)+βwkPv0(cid:0)Nk(tu,v0)+βwk(cid:1)wherethesuminthedenominatoroftheﬁrsttermisoveralltopics,andinthesecondtermisoverallwordsmatchedtou.Ndkisthecountoftopic-wordsoftopickindocumentd,Nk(tu,v)isthecountoftopic-wordseitheroftypeuoroftypevassignedtotopickinallthecorpora.2Forperspective-wordoi=n,samplethetopicaccordingto:P(zi=k|oi=n,z−i,C)∝NdkPjNdj×N‘dkv+βokPm(cid:16)N‘dkm+βok(cid:17)2InRELEGATEvariants,foruunmatchedziissampledas:P(zi=k|wi=u,z−i,C)∝Ndk+αkPk(Ndk+αk),whichcanbeseenasβwu·→∞forunmatchedterms.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

wherethesuminthesecondtermofthedenominatorisovertheperspective-wordvocabularyoflanguage‘d;Ndkisthecountoftopicwordsindocumentdwithtopick;andN‘dkmisthecountofperspective-wordmbeingassignedtopickinlanguage‘d.Notethatinallthecountsabove,thecurrentwordtokeniisomittedfromthecount.Givenoursamplingassignments,wecanthenes-timateθd,φ‘,o,andφwasfollows:ˆθkd=Ndk+αkPk(Ndk+αk),ˆφwk(tu,v)=Nk(tu,v)+βw(tu,v)Pv0(cid:16)Nk(tu,v0)+βw(tu,v0)(cid:17),ˆφ‘,onk=Nkn+βonPm(cid:0)N‘km+βon(cid:1).M-Step:(forINFERvariantsonly):RuntheJonker-Volgenant(JonkerandVolgenant,1987)bipartitematchingalgorithmtoﬁndtheoptimalmatchingCgivensomeweights.Fortopic-termufromlanguageaandtopic-termvfromlanguageb,ourweightscorrespondtothelogoftheposterioroddsthattheoccurrencesofuandvcomefromamatchedtopicdistribution,asopposedtocomingfromunmatcheddistributions:µu,v=Xk\{a∗,b∗}(cid:16)Nk(tu,v)logˆφwk(tu,v)(cid:17)−Nulogˆφwk(tu,·)−Nvlogˆφwk(·,v)+πu,v,whereNuisthecountoftopic-termuinthecor-pus.Thisexpressioncanalsobeinterpretedasakindofpointwisemutualinformation(Haghighietal.,2008).TheJonker-VolgenantalgorithmhastimecomplexityofatmostO(V3),whereVisthesizeofthelexicon(JonkerandVolgenant,1987).3.4InferenceofPerspective-WordContrastsHavinglearnedourmodelandinferredhowlikelyperspective-termsareforatopicinagivenlanguage,weseektoknowwhethertheseperspectivesdiffersigniﬁcantlyinthetwolanguages.Moreprecisely,canweinferwhetherwordminlanguageaandtheequivalentwordninlanguagebhavesigniﬁcantlydifferentdistributionsunderatopick?Todothis,wemaketheassumptionthattheperspective-wordsinlanguagesaandbareinone-to-onecorrespon-dencetoeachother.Recallthat,foragiventopickandlanguage‘,N‘kmisthecountfortermmandφ‘,ok,mistheprobabilityforwordminlanguage‘.Justaswecollecttheprobabilitiesintoword-topicdistributionvectorsφ‘,ok,wecollectthecountsintoword-topiccountvectors[N‘k1,N‘k2,..].Then,sinceourmodelassumesapriorovertheparametervec-torsφ‘,ok,wecaninferthelikelihoodforthatob-servedword-topiccountsNakmandNbknweredrawnfromasingleword-topic-distributionpriordenotedby˘φ:=φa,okm=φb,okn.BelowallourprobabilitiesareconditionedimplicitlyonthiseventaswellasonNakandNbkbeingﬁxed.Denotethetotalcountofwordtokensintopickfromlanguage‘byN‘k=PmN‘km.Now,wede-rivetheprobabilitythatweobservearatiogreaterthanδbetweentheproportionofwordsintopickthatbelongtowordtypeminlanguageaandtocor-respondingwordtypeninlanguageb:P(cid:18)NakmNakNbkNbkn≥δ(cid:19)+P(cid:18)NbknNbkNakNakm≥δ(cid:19)(1)Bysymmetry,itsufﬁcestoderiveanexpressionfortheﬁrstterm.Wenotethattheinequalityintheprob-abilityisequivalenttoasumoverarangeofvaluesofNakmandNbkn.Byrearrangingterms,applyingthelawofconditionalprobabilitytoconditionontheterm˘φ,andexploitingtheconditionalindepen-denceofNakmandNbkmgiven˘φ,Nak,andNbk,wecanrewritethisﬁrsttermasNbkXx=0NakXy=xδNa/bZp(Nbkn=x|˘φ)P(Nakm=y|˘φ)P(˘φ)d˘φ,whereNa/b=NakNbk.Recallthatφ‘,ok∼Dir(βo)un-derourmodel.AssumeasymmetricDirichletdis-tributionforsimplicity.Itcanthenbeshownthatthemarginaldistributionof˘φis˘φ∼Beta(βo,(V−1)βo),whereVisthetotalsizeoftheperspective-wordvocabulary.Similarly,itcanbeshownthatthemarginaldistributionofN‘kmgivenφ‘,okisN‘km∼Binom(N‘k,φ‘,oi)for‘∈{UN,B}.Therefore,theinte-grandaboveisproportionaltothebeta-binomialdis-tributionwithnumberoftrialsNak+Nbk,successesx+y,andparametersβoand(V−1)βo,butwithpartitionfunction(cid:0)Naky(cid:1)(cid:0)Nbkx(cid:1).DenotethePMFofthis

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

distributionbyf(Nak+Nbk,x+y,βo).Thenexpres-sion(1)abovebecomes:NbkXx=0NakXy=xδNa/bf(Nak+Nbk,x+y,βo)+NakXx=0NbkXy=xδNb/af(Nak+Nbk,x+y,βo).(2)WecannotobserveNakb,Nbkn,NakandNbkexplic-itly,butwecanestimatethembyobtainingposte-riorsamplesfromourGibbssampler.Wesubstitutetheseestimatesintoexpression(2).4Experiments4.1DataTwitterData.WegatheredTwitterdatainEn-glish,SpanishandRussianduringtheﬁrsttwoweeksofDecember2013usingtheTwitterAPI.Followingpreviouswork(Puniyanietal.,2010),wetreatedeachTwitteruseraccountasadocu-ment.Wethentaggedeachdocumentforpart-of-speech,anddividedthewordtokensinitintotopic-wordsandperspective-words.Weconstructedalex-iconof2,000topictermsand1,500perspective-termsforeachlanguagebyﬁlteringoutanytermsthatoccurredinmorethan10%ofthedocu-mentsinthatlanguage,andthenselectingthere-mainingtermswiththehighestfrequency.Fi-nally,wekeptonlydocumentsthatcontained4ormoretopicwordsfromourlexicon.Thisleftuswith847,560documentsinEnglish(4,742,868topic-wordand1,907,685perspective-wordtokens);756,036documentsinSpanish(4,409,888topic-wordand1,668,803perspective-wordtokens);and260,981documentsinRussian(1,621,571topic-wordand981,561perspective-wordtokens).NewsData.Wegatheredallthearticlespublishedonlineduringtheyear2013bythestate-runmediaagenciesoftheUnitedStates(VoiceofAmericaor“VOA”–English),Russia(RIANovostior“RIA”–Russian),andVenezuela(AgenciaVenezolanadeNoticiasor“AVN”–Spanish).Thesethreenewsagencieswerechosenbecausetheynotonlypro-videmediainthreedistinctlanguages,buttheyareguidedbythepoliticalworld-viewsofthreedis-tinctgovernments.Wetreatedeachnewsarticleasadocument,andremovedduplicates.Onceagain,weconstructedalexiconof2,000topictermsand1,500perspective-termsusingthesamecriteriaasforTwitter,andkeptonlydocumentsthatcontained4ormoretopicwordsfromourlexicon.Thisleftuswith23,159articles(10,410,949gettoni)fromVOA,41,116articles(11,726,637gettoni)fromRIA,and8,541articles(2,606,796gettoni)fromAVN.Dictionaries.Tocreatethetranslationdictionar-ies,weextractedtranslationsfromtheEnglish,Spanish,andRussianeditionsofWiktionary,bothfromthetranslationsectionsandtheglosssectionsifthelattercontainedsinglewordsasglosses.Multi-wordexpressionswereuniversallyremoved.Weaddedinversetranslationsforeveryoriginaltrans-lation.Fromtheresultingcollectionoftranslations,wethencreatedseparatetranslationdictionariesforeachlanguageandpart-of-speechtagcombination.Inordertogivepreferencetomoreimportanttranslations,weassignedeachtranslationaninitialweightof1+1r,whererwastherankofthetrans-lationwithinthepage.Sinceatranslation(oritsin-verse)canoccuronmultiplepages,weaggregatedtheseinitialweightsandthenassignedﬁnalweightsof1+1r0,wherer0wastherankafteraggregationandsortingindescendingorderofweights.4.2ExperimentalConditionsToevaluatethedifferentvariantsofourmodel,weheldout30,000documents(testset)duringtraining.WepluggedintheestimatesofφwandCacquiredduringtrainingusingtherestofthecorpustopro-ducealikelihoodestimatefortheseheld-outdocu-ments.Allmodelswereinitializedwiththepriormatchingdeterminedbythedictionarydata.ForeachnumberoftopicsK,wesetαto50/Kandtheβvariablesto0.02,asinFangetal.(2012).FortheMULTIPLEvariants,wesetπi,j=1ifiandjshareanentryand0otherwise.ForINFERvariants,onlythreeM-stepswereperformedtoavoidoverﬁt-ting,at250,500,and750iterationsofGibbssam-pling,followingtheprocedureinBoyd-GraberandBlei(2009).4.3ComparisonofmodelvariantsInordertocomparethevariantsofourmodel,wecomputedtheperplexityandcoherencefor

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

eachvariantonTWITTERandNEWS,forEnglish–SpanishandEnglish–Russianlanguagepairs.Perplexityisameasureofhowwellamodeltrainedonatrainingsetpredictstheco-occurrenceofwordsonanunseentestsetH.Lowerperplexityindicatesbettermodelﬁt.Weevaluatetheheld-outperplex-ityfortopicwordswiandperspective-wordsoisep-arately.Fortopicwords,theperplexityisdeﬁnedasexp(−Pwi∈Hlogp(wi)/Nw).AsforstandardLDA,exactinferenceofp(wi)isintractableunderthismodel.Thereforeweadaptedtheestimatorde-velopedbyMurrayandSalakhutdinov(2009)toourmodels.Coherenceisameasureinspiredbypointwisemu-tualinformation(Newmanetal.,2010).LetD(v)bethethenumberofdocumentswithatleastonetokenoftypevandletD(v,w)bethenumberofdocu-mentscontainingatleastonetokenoftypevandatleastonetokenoftypew.ThenMimnoetal.(2011)deﬁnethecoherenceoftopickas1(cid:0)M2(cid:1)MXm=2m−1X‘=1logD(v(k)M,v(k)‘)+(cid:15)D(v(k)‘),whereV(k)=(v(k)1,…,v(k)M)isalistoftheMmostprobablewordsintopickand(cid:15)isasmallsmoothingconstantusedtoavoidtakingthelogarithmofzero.Mimnoetal.(2011)ﬁndthatcoherencecorrelatesbetterwithhumanjudgmentsthandolikelihood-basedmeasures.Coherenceistopic-speciﬁcmea-sure,soforeachmodelvariantwetrained,wecom-putedthemediantopiccoherenceacrossallthetop-icslearnedbythemodel.Weset(cid:15)=0.1.Modelperformanceandanalysis.Fig.2showsperplexityforthevariantsasafunctionofthenum-berofiterationsofGibbssamplingontheEnglish-SpanishNEWScorpus.Theﬁgureconﬁrmsthat1000iterationsofGibbssamplingontheNEWScorpuswassufﬁcientforconvergenceacrossmodelvariants.WeomitﬁguresforEnglish-RussianandfortheTWITTERcorpus,sincethepatternswerenearlyidentical.Figure3showshowperplexityvariesasafunctionofthenumberoftopics.Weusedthisinformationtochooseoptimalmodelsforthedifferentcorpora.Theoptimalnumberoftop-icswasK=175fortheEnglish-SpanishNEWScorpus,K=200fortheEnglish-RussianNEWS,K=325fortheEnglish-SpanishTWITTER,andK=300fortheEnglish-RussianTWITTER.Al-thoughtheoptimalnumberoftopicsvariedacrosscorpora,therelativeperformanceofthedifferentmodelswasthesame.Inallofourcorpora,theMULTIPLEvariantsprovidedbetterﬁtsthantheircorrespondingSINGLEvariants.Thereareseveralexplanationsforthis.Forone,theMULTIPLEvari-antsareabletoexploittheinformationfrommulti-pletranslations,unliketheSINGLEvariants,whichdiscardedallbutonetranslationperword.Foran-other,thematchingsproducedbytheSINGLEINFERvariantscanbepurelycoincidentalandtheresultofoverﬁtting(seesomeexamplesbelow).INCLUDEvariantsperformedmarkedlybetterthanRELEGATEvariants.INFERvariantsimprovedmodelﬁtcom-paredtoSTATICvariants,butrequiredmoretopicstoproduceoptimalﬁt.RecallthatweperformedanM-stepintheIN-FERvariants3times,at250,500,and750itera-tions.Asnotedin§3.3,theM-stepintheINFERvariantsmaximizestheposteriorlikelihoodofthematching.However,Fig.2showsthatthismaxi-mizationcausesheld-outperplexitytoincreasesub-stantiallyjustaftertheﬁrstmatchingM-step,around250iterations,beforedecreasingagainafterabout50moreiterationsofGibbssampling.WebelievethatthishappensbecausetheM-stepismaximizingoverexpectationsthatareapproximate,sincetheyareestimatedusingGibbssampling.Ifthesamplerhasnotyetconverged,thentheM-step’smaximiza-tionwillbeunstable.Wefoundsupportforthisex-planationwhenwere-rantheINFERvariantsusing1000iterationsbetweenM-steps,givingtheMarkovchainenoughtimetoconverge.Afterthischange,perplexitywentdownimmediatelyaftertheM-stepandkeptdecreasingmonotonically,ratherthanin-creasingaftertheM-stepbeforedecreasing.How-ever,thisdidnotresultinasigniﬁcantlylowerﬁnalperplexityorcoherenceandthusdidnotchangetherelativeperformanceofthemodels.Inaddition,Fig.2suggeststhatthesecondandthirdM-steps(at500and750iterations,rispettivamente)hadlittleeffectonperplexity.Inlightofthehighcomputationalex-penseofeachinferencestep,thissuggestsinprac-ticeasingleinferencestepmaybesufﬁcient.Fig.4showsthattheMULTIPLESTATICINCLUDEvariantwasalsothesuperiormodelasmeasuredby

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figure2:Perplexityofdifferentmodelvariantsfordif-ferentnumbersofiterationsatK=175.mediantopiccoherence.Onceagain,thisgeneralpatternheldtruefortheEnglish-RussianpairandTWITTERcorpora.Overall,theresultsshowthatMULTIPLESTATICINCLUDEprovidessuperiorper-formanceacrossmeasures,corpora,topicnumbers,andlanguages.Wethereforeusedthisvariantinfurtherdataanalysisandevaluation.Incidentally,theobserveddecreaseintopiccoherenceasKin-creasesisexpected,becauseasKincreases,lower-likelihoodtopicstendtobemoreincoherent(Mimnoetal.,2011).ExperimentsbyStevensetal.(2012)showthatthiseffectisobservedforLDA-,NMF-,andSVD-basedtopicmodels.Cross-linguisticmatchings.Thematchingsin-ferredbytheSINGLEINFERINCLUDEvariantwereofmixedquality.Someofthematchingscorrectedlow-qualitytranslationsintheoriginaldictionary.Forinstance,ourpriordictionarymatchedpassageinEnglishtopasajeinSpanish.Thoughtechnicallycorrect,thedominantmeaningofpasajeis[travel]ticket.TheTWITTERmodelcorrectlymatchedpas-sagetorutainstead.Manyofthematchingslearnedbythemodeldidnotprovidetechnicallycorrecttranslations,yetwerestillrevelatoryandinteresting.Forinstance,thedictionarytranslatedtheSpanishwordpitoascigaretteinEnglish.However,ininfor-malusagethiswordrefersspeciﬁcallytocannabiscigarettes,nottobaccocigarettes.TheTWITTERFigure3:Perplexityofdifferentmodelvariants.modelmatchespitototheEnglishslangwordweedinstead.TheSpanishwordSiria(Syria)wasun-matchedinthepriordictionary;theNEWSmodelmatchedittothewordchemical,whichmakessenseinthecontextofextensivereportingoftheusageofchemicalweaponsintheongoingSyrianconﬂict.4.4DataanalysisanddiscussionWehaveconductedaqualitativeanalysisofthetopics,perspectivesandcontrastsproducedbyourmodelsforEnglish–SpanishandEnglish–Russian,TWITTERandNEWSdatasets.Whilethetopicswerecoherentandconsistentacrosslanguages,setsofperspectivewordsmanifestedsystematicdiffer-encesrevealinginterestingcross-culturalcontrasts.Fig.5and7showthetopperspectivewordsdiscov-eredbythemodelforthetopicofﬁnanceandecon-omyinEnglishandSpanishNEWSandTWITTERcorpora,respectively.Whilesomeoftheperspec-tivewordsareneutral,mostlyliteralandoccurinbothEnglishandSpanish(e.g.balanceorautho-rize),manyothersrepresentmetaphoricalvocabu-lary(e.g.saddle,gut,evaporateinEnglish,orin-cendiar,sangrar,abatirinSpanish)pointingatdis-tinctmodelsofconceptualizationofthetopic.Whenweappliedthecontrastdetectionmethod(describedin§3.4)totheseperspectivewords,ithighlightedthedifferencesinmetaphoricalperspectives,ratherthantheliteralones,asshowninFig.6and8.En-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figure4:Coherenceofdifferentmodelvariants.glishspeakerstendtodiscusseconomicandﬁnan-cialprocessesusingmotionterms,suchas“slow,drive,boostorsluggish”,orarelatedmetaphorofhorse-riding,e.g.“reinindebt”,“saddlewithdebt”,oreven“breedmoney”.Incontrast,Spanishspeak-erstendtotalkabouttheeconomyintermsofsizeratherthanmotion,usingverbssuchasampliarordisminuir,andothermetaphors,suchassangrar(tobleed)andincendiar(tolightup).Theseex-amplesdemonstratecoherentconceptualizationpat-ternsthatdifferinthetwolanguages.Interestingly,thisdifferencemanifesteditselfinbothNEWSandTWITTERcorporaandechoestheﬁndingsofapre-viouscorpus-linguisticstudyofCharteris-BlackandEnnis(2001),whomanuallyanalysedmetaphorsusedinEnglishandSpanishﬁnancialdiscourseandreportedthatmotionandnavigationmetaphorsthataboundinEnglishwererarelyobservedinSpanish.Forthemajorityofthetopicsweanalysedthemodelrevealedinterestingcross-culturaldiffer-ences.Forinstance,theSpanishcorporaexhib-itedmetaphorsofbattlewhentalkingaboutpoverty(withpovertyseenasanenemy),whileintheEn-glishcorpuspovertywasdiscussedmoreneutrallyasasocialproblemthatneedsapracticalsolu-tion.English-RussianNEWSexperimentsrevealedasurprisingdifferencewithrespecttothetopicofprotests.TheysuggestedthatwhileUSmediatendtousestrongermetaphoricalvocabulary,suchasTopicENbudgetdebtdeﬁcitreductionspendbalancecutincreaselimitdowntowntaxstressadditionplanetTopicESpresupuestodeﬁcitdeudareduccionequilib-riodisminuciongastoaumentaciontasasacerdotePerspectiveENbalancedefaulttriplereinaccumulateaccruetrimincursaddleslashprioritizeavertgutbur-denevaporateborrowpilecapcuttacklePerspectiveESrenegociarmejoraetiquetadodesplo-marrecortarendeudarincendiardestinarasignarau-torizaraprobadoascendersangraraugurarabatirFigure5:TopperspectivesinsystemoutputforthetopicofﬁnanceintheNEWScorpus(metaphorsinreditalics).ContrastsEN:rein[indebt],saddle[withdebt],cap[debt],breed[money],gut[budget],[debt]hit,tackle[debt],boost,slow,drive,sluggish[economy],spurContrastsES:sangrar[dinero],ampliar,disminuir[laeconom´ıa],superar[latasa],emitir[deuda]Figure6:ContrastsidentiﬁedbythemodelinNEWS.clash,eruptorﬁre,inRussianprotestsarediscussedmoreneutrally.Generally,theNEWScorporacon-tainedmoreabstracttopicsandricherinformationaboutconceptualstructureandsentimentinalllan-guages.ManyofthetopicsdiscoveredinTWIT-TERrelatedtoeverydayconcepts,suchaspetsorconcerts,withfewertopicscoveringsocietalissues.Yet,afewTWITTER-speciﬁccontrastscouldbeob-served:e.g.,thesportstopictendstobediscussedusingwarandbattlevocabularyinRussiantoagreaterextentthaninEnglish.Ourmodelstendtoidentifytwogeneralkindsofdifferences:(1)cross-corpusdifferencesrepre-sentingworldviewsofparticularpopulationswhomthecorporacharacterize(suchdifferencesexistbothacrossandwithinlanguages,e.g.themetaphorsusedintheprogressiveNewYorkTimeswouldbedifferentfromtheonesinthemoreconservativeWallStreetJournal);E(2)deeplyentrenchedcross-linguisticdifferences,suchasthemotionversusexpansionmetaphorsfortheeconomyinEnglishandSpanish.Suchsystematiccross-linguisticcon-trastscanbeassociatedwithcontrastivebehaviouralpatternsacrossthedifferentlinguisticcommunities(CasasantoandBoroditsky,2008;Fuhrmanetal.,2011).InbothNEWSandTWITTERdata,ourmodeleffectivelyidentiﬁesandsummarisessuchcontrastssimplifyingthemanualanalysisofthedata

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

TopicENeconomygrowthratepercentbankeconomistinterestreservemarketpolicyTopicESeconom´ıacrecimientotasabancopolticamercadointer´esinﬂacinempleoeconomistaPerspectiveENeconomicﬁnancialgrowglobalex-pectremaincutboostlowslowdrivePerspectiveESecon´omicomundialagregarﬁ-nancieroinformalpeque˜nosigniﬁcarinternobajarFigure7:Topperspectivesinsystemoutputfortheecon-omytopicinTWITTER(metaphorsinred).ContrastsEN:slow[theeconomy],push[theecon-omy],strong[economy],weak[economy],stable[economy],boost[theeconomy]ContrastsES:caer[laeconom´ıa],disminuir,superar[laeconom´ıa],ampliar[elcrecimiento]Figure8:ContrastsidentiﬁedbythemodelinTWITTER.byhighlightinglinguistictrendsthatareindicativeoftheunderlyingconceptualdifferences.However,theconceptualdifferencesarenotstraightforwardtoevaluatebasedonthesurfacevocabularyalone.Inordertoinvestigatethisfurther,weconductedabehaviouralexperimenttestingasubsetofthecon-trastsdiscoveredbyourmodel.5BehaviouralevaluationWeassessedtherelevanceofthecontraststhroughanexperimentalstudywithnativeEnglish-speakingandnativeSpanish-speakinghumansubjects.WefocusedonalinguisticdifferenceinthemetaphorsusedbyEnglishspeakersversusSpanishspeak-erswhendiscussingchangesinanation’secon-omy.WhileEnglishspeakerstendtousemetaphorsinvolvingbothlocativemotionverbs(e.g.slow)aswellasexpansive/contractivemotionverbs(e.g.shrink),Spanishspeakerspreferentiallyemployex-pansive/contractivemotionverbs(e.g.disminuir)todescribechangesintheeconomy.Thesedifferencescouldreﬂectlinguisticartefacts(suchascollocationfrequencies)orcouldreﬂectentrenchedconceptualdifferences.Ourexperimentaddressesthequestionofwhethersuchpatternsofbehaviourarisecross-linguisticallyinresponsetonon-linguisticstimuli.Ifthelinguisticdifferencesareindicativeofen-trenchedconceptualdifferences,thenweexpecttoseeresponsestothenon-linguisticstimulithatcorre-spondtotheusagedifferencesinthetwolanguages.5.1ExperimentalsetupWerecruited60participantsfromoneEnglish-speakingcountry(theUS)and60participantsfromthreeSpanish-speakingcountries(Chile,Mexico,andSpain)usingtheCrowdFlowercrowdsourcingplatform.Participantsﬁrstreadabriefdescriptionoftheexperimentaltask,whichintroducedthemtoaﬁctionalcountryinwhicheconomistsaredevis-ingasimplebuteffectivegraphicfor“representingchangein[IL]economy”.Theythencompletedademographicquestionnaireincludinginformationabouttheirnativelanguage.Resultsfrom9USand3non-USparticipantswerediscardedforfailuretomeetthelanguagerequirement.Participantsnavigatedtoanewpagetocompletetheexperimentaltask.Stimuliwerepresentedina1200×700-pixelframe.Thecenteroftheframecontainedaspherewitha64-pixeldiameter.Foreachtrial,participantsclickedonabuttontoactivateananimationofthespherewhichinvolved(1)apos-itivedisplacement(inrightwardpixels)of10%or20%,oranegativedisplacement(inleftwardpixels)of10%or20%;3E,(2)anexpansion(inincreasedpixeldiameter)of10%or20%,oracontraction(indecreasedpixeldiameter)of10%or20%.4Participantssaweachoftheresultingconditions3times.Thedisplacementandsizeconditionsweredrawnfromarandompermutationof16condi-tionsusingaFisher-Yatesshufﬂe(FisherandYates,1963).Crucially,halfofthestimulicontainedcon-ﬂictsofinformationwithrespecttothesizeanddis-placementmetaphorsforeconomicchange(e.g.thespherecouldbothgrowandmovetotheleft).Over-allweexpectedtheSpanishspeakers’responsestobemorecloselyassociatedwithchangesindiam-eterduetothepresenceandsalienceofthesizemetaphor,andtheEnglishspeakers’responsestobeinﬂuencedbybothconditions.Weexpectedthesedifferencestobemostprominentinthecon-3Theuseofleftward/rightwardhorizontaldisplacementtorepresentdecreases/increasesinmagnitudeissupportedbyre-searchinnumericalcognitionshowingthatpeopleassociatesmallermagnitudeswiththeleftsideofspaceandlargermag-nitudeswiththerightside(Dehaene,1992;Fiasetal.,1995).4AdemonstrationoftheEnglishexperimentalinterfacecanbeaccessedathttp://goo.gl/W3YVfC.TheSpanishin-terfaceisidentical,butforadirecttranslationoftheguidelinesprovidedbyanativeSpanish/ﬂuentEnglishspeaker.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
0
8
2
1
5
6
7
3
5
2

/
T

UN
C
_
UN
_
0
0
0
8
2
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figure9:”EconomyImproved”responserateinconﬂict-ingstimulusconditions.ﬂictingtrials,whichforceEnglishspeakers(unlikeSpanishspeakers)tochoosebetweentwoavailablemetaphors.Wefocusontheseconﬂictingtrialsinouranalysisanddiscussionoftheresults.5.2ResultsIntrialsinwhichstimulimovingrightwardweresimultaneouslycontracting,Englishspeakersre-spondedthattheeconomyimproved66%ofthetime,whereasSpanishspeakersjudgedtheecon-omytohaveimproved43%ofthetime.Intrialsinwhichstimulimovingleftwardweresimultaneouslyexpanding,Englishspeakersjudgedtheeconomytohaveimproved34%ofthetime,andSpanishspeak-ersrespondedthattheeconomyimproved55%ofthetime.TheresultsareillustratedinFigure9.Theseresultsindicatethreeeffects:(1)En-glishspeakersexhibitapronouncedbiasforus-inghorizontaldisplacementratherthanexpan-sion/contractionduringthedecision-makingpro-cess;(2)Spanishspeakersaremorebiasedto-wardexpansion/contractioninformulatingadeci-sion;E,(3)acrossthetwolanguagestheresponsesshowcontrastingpatterns.TheresultssupportourexpectationontherelevanceofdifferentmetaphorswhenreasoningabouttheeconomybytheEnglishandSpanishspeakers.Toexaminethesigniﬁcanceoftheseeffects,weﬁtabinarylogitmixedeffectsmodel5tothedata.Thefullanalysismodeledjudgmentwithnativelan-guage,displacement,andsizeasfullycrossedﬁxed5SeeFoxandWeisberg(2011)foradiscussionofsuchmod-elsincludingapplicationoftheTypeIIWaldtest.effectsandparticipantasarandomeffect.Thisanal-ysisconﬁrmedthatnativelanguagewasassociatedwithjudgmentsabouteconomicchange.Inparticu-lar,itindicatedthatchangesinsizeaffectedEnglishspeakers’judgmentsandSpanishspeakers’judg-mentsdifferently(P<0.001),withanincreaseinsizeincreasingtheodds(eβ=2.5)ofajudgmentofIMPROVEDbySpanishspeakersanddecreasingtheodds(eβ=0.44)ofajudgmentofIMPROVEDbyEnglishspeakers.ATypeIIWaldtestrevealedtheinteractionbetweenlanguageandsizetobehighlystatisticallysigniﬁcant(χ2(1)<0.001).Insummary,thepatternsweseeinthebe-haviouraldataareconsistentwiththepatternsun-coveredintheoutputofourmodel.Whilemuchter-ritoryremainstobeinvestigatedtodelimitthenatureofthisrelationship,ourresultsrepresentaﬁrststeptowardestablishinganassociationbetweeninforma-tionminedfromlargetextualdatacollectionsandinformationobservedthroughbehaviouralresponsesonahumanscale.6ConclusionWepresentedtheﬁrstmodelthatdetectscommontopicsfrommultilingual,non-paralleldataandau-tomaticallyuncoversdifferencesinperspectivesonthesetopicsacrosslinguisticcommunities.Ourdataanalysisandbehaviouralevaluationofferevi-denceofasymbioticrelationshipbetweenecolog-icallysoundcorpusexperimentsandscientiﬁcallycontrolledhumansubjectexperiments,pavingthewayfortheuseoflarge-scaletextminingtoinformcognitivelinguisticsandpsychologyresearch.Webelievethatourmodelrepresentsagoodfoun-dationforfutureprojectsinthisarea.Apromisingareaforfurtherworkisindevelopingbettermethodsforidentifyingcontrastsinperspectiveterms.Thiscouldperhapsinvolvemodifyingthegenerativepro-cessforperspectivetermsorincorporatingsyntacticdependencyinformation.Itwouldalsobeinterest-ingtoinvestigatetheeffectofdictionaryqualityandcorpussizeontherelativeperformanceofSTATICandINFERvariants.Finally,wenotethatthemodelcanbeappliedtoidentifycontrastiveperspectivesinmonolingualaswellasmultilingualdata,providingageneraltoolfortheanalysisofsubtle,yetimpor-tant,cross-populationdifferences. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 8 2 1 5 6 7 3 5 2 / / t l a c _ a _ 0 0 0 8 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 58 AcknowledgmentsWewouldliketothanktheanonymousreview-ersaswellastheTACLeditors,SharonGold-waterandDavidChiang,forhelpfulcommentsonanearlierdraftofthispaper.ThisworkusedtheExtremeScienceandEngineeringDiscov-eryEnvironment(XSEDE),whichissupportedbyNationalScienceFoundationgrantnumberACI-1053575.EkaterinaShutova’sresearchissup-portedbytheLeverhulmeTrustEarlyCareerFel-lowship.GerarddeMelo’sresearchissupportedbyChina973ProgramGrants2011CBA00300,2011CBA00301,andNSFCGrants61033001,61361136003,61550110504.ReferencesAmjadAbu-Jbara,MonaDiab,PradeepDasigi,andDragomirRadev.2012.Subgroupdetectioninideo-logicaldiscussions.InProceedingsofthe50thAnnualMeetingoftheAssociationforComputationalLinguis-tics:LongPapers-Volume1,ACL’12,pages399–409,Stroudsburg,PAPÀ,USA.AssociationforCompu-tationalLinguistics.AmrAhmedandEricP.Xing.2010.Stayingin-formed:Supervisedandsemi-supervisedmulti-viewtopicalanalysisofideologicalperspective.InPro-ceedingsofthe2010ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,EMNLP’10,pages1140–1150,Stroudsburg,PAPÀ,USA.AssociationforComputationalLinguistics.RawiaAwadallah,MayaRamanath,andGerhardWeikum.2011.OpinioNetIt:UnderstandingtheOpinions-Peoplenetworkforpoliticallycontroversialtopics.InProceedingsofthe20thACMInternationalConferenceonInformationandKnowledgeManage-ment,CIKM’11,pages2481–2484,NewYork,NY,USA.ACM.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.LatentDirichletallocation.JournalofMachineLearningResearch,3:993–1022.JordanBoyd-GraberandDavidM.Blei.2009.Multilin-gualtopicmodelsforunalignedtext.InProceedingsoftheTwenty-FifthConferenceonUncertaintyinAr-tiﬁcialIntelligence(UAI’09),pages75–82.Arlington,VA,USA:AUAIPress.JordanBoyd-GraberandPhilipResnik.2010.Holisticsentimentanalysisacrosslanguages:multilingualsu-pervisedlatentDirichletallocation.InProceedingsofthe2010ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages45–55.DanielCasasantoandLeraBoroditsky.2008.Timeinthemind:Usingspacetothinkabouttime.Cognition,106(2):579–593.JonathanCharteris-BlackandTimothyEnnis.2001.AcomparativestudyofmetaphorinSpanishandEn-glishﬁnancialreporting.EnglishforSpeciﬁcPur-poses,20:249–266.StanislasDehaene.1992.Varietiesofnumericalabili-ties.Cognition,44:1–42.AnthonyFader,DragomirRadev,BurtL.Monroe,andKevinM.Quinn.2007.MavenRank:Identifyingin-ﬂuentialmembersoftheUSsenateusinglexicalcen-trality.InInProceedingsofthe2007JointConferenceonEmpiricalMethodsinNaturalLanguageProcess-ingandComputationalNaturalLanguageLearning,pages658–666.YiFang,LuoSi,NaveenSomasundaram,andZheng-taoYu.2012.Miningcontrastiveopinionsonpolit-icaltextsusingcross-perspectivetopicmodel.InPro-ceedingsoftheFifthACMInternationalConferenceonWebSearchandDataMining(WSDM’12),pages63–72,NewYork.NewYork:ACM.WimFias,MarcBrysbaert,FrankGeypens,andG´eryd’Ydewalle.1995.Theimportanceofmagnitudeinformationinnumericalprocessing:evidencefromtheSNARCeffect.MathematicalCognition,2(1):95–110.RonaldA.FisherandFrankYates.1963.StatisticalTablesforBiological,AgriculturalandMedicalRe-search.OliverandBoyd,Edinburgh.JohnFoxandSanfordWeisberg.2011.AnRCompaniontoAppliedRegression.SAGEPublications,CA:LosAngeles.OrlyFuhrman,KellyMcCormick,EvaChen,HeidiJiang,DingfangShu,ShuaimeiMao,andLeraBoroditsky.2011.Howlinguisticandculturalforcesshapeconceptionsoftime:EnglishandMandarintimein3D.CognitiveScience,35:1305–1328.NikeshGareraandDavidYarowsky.2009.Model-inglatentbiographicattributesinconversationalgen-res.InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLandthe4thInterna-tionalJointConferenceonNaturalLanguageProcess-ingoftheAFNLP:Volume2-Volume2,ACL’09,pages710–718,Stroudsburg,PAPÀ,USA.AssociationforComputationalLinguistics.SeanM.GerrishandDavidM.Blei.2011.Predict-inglegislativerollcallsfromtext.InProceedingsofICML.SwapnaGottipati,MinghuiQiu,YanchuanSim,JingJiang,andNoahA.Smith.2013.LearningtopicsandpositionsfromDebatepedia.InProceedingsofthe2013ConferenceonEmpiricalMethodsinNatu-ralLanguageProcessing,pages1858–1868,Seattle, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 8 2 1 5 6 7 3 5 2 / / t l a c _ a _ 0 0 0 8 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 59 Washington,USA,October.AssociationforComputa-tionalLinguistics.AriaHaghighi,PercyLiang,TaylorBerg-Kirkpatrick,andDanKlein.2008.Learningbilinguallexiconsfrommonolingualcorpora.InProceedingsofthe46thAnnualMeetingoftheAssociationforComputationalLinguistics,ACL-’08:HLT,pages771–779,Colum-bus,Ohio,USA.DavidHall,DanielJurafsky,andChristopherD.Man-ning.2008.Studyingthehistoryofideasusingtopicmodels.InProceedingsofthe2008ConferenceonEmpiricalMethodsinNaturalLanguageprocessing,pages363–371.AssociationforComputationalLin-guistics.JagadeeshJagarlamudiandHalDaum´eIII.2010.Ex-tractingmultilingualtopicsfromunalignedcompara-blecorpora.InCathalGurrin,YulanHe,GabriellaKazai,UdoKruschwitz,andSuzanneLittle,editors,Proceedingsofthe32ndEuropeanConferenceonAd-vancesinInformationRetrieval(ECIR’2010),pages444–456.Springer-Verlag,Berlin.RosieJones,RaviKumar,BoPang,andAndrewTomkins.2007.“Iknowwhatyoudidlastsummer”:Querylogsanduserprivacy.InProceedingsoftheSix-teenthACMConferenceonConferenceonInformationandKnowledgeManagement,CIKM’07,pages909–914,NewYork,NY,USA.ACM.RoyJonkerandAntonVolgenant.1987.Ashortestaug-mentingpathalgorithmfordenseandsparselinearas-signmentproblems.Computing,38(4):325–340.Zolt´anK¨ovecses.2004.Introduction:Culturalvaria-tioninmetaphor.EuropeanJournalofEnglishStud-ies,8:263–274.GeorgeLakoffandElisabethWehling.2012.TheLit-tleBlueBook:TheEssentialGuidetoThinkingandTalkingDemocratic.FreePress,NewYork.DavidMimno,HannaM.Wallach,JasonNaradowsky,DavidA.Smith,andAndrewMcCallum.2009.Polylingualtopicmodels.InProceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLan-guageProcessing:Volume2,pages880–889.Asso-ciationforComputationalLinguistics.DavidMimno,HannaM.Wallach,EdmundTalley,MiriamLeenders,andAndrewMcCallum.2011.Op-timizingsemanticcoherenceintopicmodels.InPro-ceedingsofthe2011ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing.AssociationforComputationalLinguistics.BurtL.Monroe,MichaelP.Colaresi,andKevinM.Quinn.2008.Fightin’words:Lexicalfeatureselec-tionandevaluationforidentifyingthecontentofpolit-icalconﬂict.PoliticalAnalysis,16(4):372–403.IainMurrayandRuslanR.Salakhutdinov.2009.Evalu-atingprobabilitiesunderhigh-dimensionallatentvari-ablemodels.InAdvancesinNeuralInformationPro-cessingSystems,pages1137–1144.DavidNewman,JeyHanLau,KarlGrieser,andTimothyBaldwin.2010.Automaticevaluationoftopiccoher-ence.InProceedingsofthe2010ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies.AssociationforComputationalLinguistics.Diarmuid´OS´eaghdha.2010.Latentvariablemodelsofselectionalpreference.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics,pages435–444,Uppsala,Sweden.Asso-ciationforComputationalLinguistics.MichaelPaulandRoxanaGirju.2009.Cross-culturalanalysisofblogsandforumswithmixed-collectiontopicmodels.InProceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing:Volume3-Volume3,EMNLP’09,pages1408–1417,Stroudsburg,PAPÀ,USA.AssociationforCompu-tationalLinguistics.MarcoPennacchiottiandAna-MariaPopescu.2011.Democrats,RepublicansandStarbucksafﬁcionados:userclassiﬁcationinTwitter.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,KDD’11,pages430–438.KritiPuniyani,JacobEisenstein,ShayCohen,andEricP.Xing.2010.Sociallinksfromlatenttopicsinmi-croblogs.InProceedingsoftheNAACL/HLT2010WorkshoponComputationalLinguisticsinaWorldofSocialMedia,pages19–20.AssociationforComputa-tionalLinguistics.MinghuiQiuandJingJiang.2013.Alatentvariablemodelforviewpointdiscoveryfromthreadedforumposts.InProceedingsofthe2013ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages1031–1040,Atlanta,Georgia,June.Asso-ciationforComputationalLinguistics.AlanRitter,MausamEtzioni,andOrenEtzioni.2010.AlatentDirichletallocationmethodforselectionalpref-erences.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics,pages424–434.AssociationforComputationalLinguistics.KeithStevens,PhilipKegelmeyer,DavidAndrzejewski,andDavidButtler.2012.Exploringtopiccoherenceovermanymodelsandmanytopics.InProceedingsofthe2012JointConferenceonEmpiricalMethodsinNaturalLanguageProcessingandComputationalNaturalLanguageLearning,pages952–961,JejuIs-land,Korea. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 8 2 1 5 6 7 3 5 2 / / t l a c _ a _ 0 0 0 8 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 60 PaulH.ThibodeauandLeraBoroditsky.2011.Metaphorswethinkwith:Theroleofmetaphorinrea-soning.PLoSONE,6(2):e16782.JanyceWiebe,TheresaWilson,RebeccaBruce,MatthewBell,andMelanieMartin.2004.Learningsubjectivelanguage.Comput.Linguist.,30(3):277–308,Septem-ber. Operazioni dell'Associazione per la Linguistica Computazionale, vol. 4, pag. 47–60, 2016. Redattore di azioni: David Chiang. Immagine

Scarica il pdf