Transactions of the Association for Computational Linguistics, vol. 6, pp. 269–285, 2018. Action Editor: Diana McCarthy.
Submission batch: 11/2017; Revision batch: 2/2018; Published 5/2018.
c(cid:13)2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
BootstrapDomain-SpecificSentimentClassifiersfromUnlabeledCorporaAndriusMudinas,DellZhang,andMarkLeveneDepartmentofComputerScienceandInformationSystemsBirkbeck,UniversityofLondonLondonWC1E7HX,UKandrius@dcs.bbk.ac.uk,dell.z@ieee.org,mark@dcs.bbk.ac.ukAbstractThere is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibil-ity of building domain-specific sentiment clas-sifiers with unlabeled documents only. Our investigation indicates that in the word em-beddings learned from the unlabeled corpus of a given domain, the distributed word rep-resentations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Ex-ploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific senti-ment lexicon from just a few typical senti-ment words (“seeds”). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophis-ticated semi-supervised/transductive learning algorithms which represent the state-of-the-art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment clas-sification, but a higher performance could be achieved through a two-phase bootstrapping method which uses the induced lexicon to as-sign positive/negative sentiment scores to un-labeled documents first, and then uses those documents found to have clear sentiment sig-nals as pseudo-labeled examples to train a document sentiment classifier via supervised learning algorithms (such as LSTM). On sev-eralbenchmarkdatasetsfordocumentsenti-mentclassification,ourend-to-endpipelinedapproachwhichisoverallunsupervised(ex-ceptforatinysetofseedwords)outper-formsexistingunsupervisedapproachesandachievesanaccuracycomparabletothatoffullysupervisedapproaches.1IntroductionSentimentanalysis(Liu,2015)isapopularresearchtopicwhichhasawiderangeofapplications,suchassummarizingcustomerreviews,monitoringsocialmedia,andpredictingstockmarkettrends(Bollenetal.,2011).Abasictaskinsentimentanalysisistoclassifythesentimentpolarityofagivenpieceoftext(document),i.e.,whethertheopinionexpressedinthetextispositiveornegative(Pangetal.,2002),whichisthefocusofthispaper.Therearemanydifferentapproachestosenti-mentclassificationintheNaturalLanguageProcess-ing(NLP)literature—fromsimplelexicon-basedmethods(Dingetal.,2008;Thelwalletal.,2010;Thelwalletal.,2012)tolearning-basedapproaches(PangandLee,2004;Turney,2002;JoandOh,2011;Argamonetal.,2007;LinandHe,2009),andalsohybridmethodsinbetween(Mudinasetal.,2012;Zhangetal.,2011).Nomatterwhichap-proachistaken,asentimentclassifierbuiltforitstargetdomainwouldworkwellonlywithinthatspe-cificdomain,butsufferaseriousperformancelossoncethedomainboundaryiscrossed.Thesamewordcoulddrasticallychangeitssentimentpolarity(and/orstrength)ifitisusedinadifferentdomain.Forexample,being“small”islikelytobenegative
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
270
forahotelroombutpositiveforadigitalcamcorder,being“unexpected”maybeagoodthingfortheend-ingofamoviebutnotfortheengineofacar,andwewillprobablyenjoy“interesting”booksbutnotnec-essarily“interesting”food.Here,thedomaincouldbedefinednotbythetopicofthedocumentsbutbythestyleofwriting.Forexample,themeaningsofwordslike“gay”and“terrific”woulddependonwhetherthetextwaswritteninahistoricaleraormoderntimes.Whenweneedtoperformsentimentclassifica-tioninanewdomainunseenbefore,thereareusu-allyneitherlabeleddictionaryavailabletoemploylexicon-basedsentimentclassifiersnorlabeledcor-pusavailabletotrainlearning-basedsentimentclas-sifiers.Itis,ofcourse,possibletoresorttoageneral-purposeoff-the-shelfsentimentclassifier,orapre-builtoneforadifferentdomain.However,theef-fectivenesswouldoftenbeunsatisfactorybecauseofthereasonsmentionedabove.Therehavebeensomestudiesondomainadaptationortransferlearn-ingforsentimentclassification(Blitzeretal.,2007;Tanetal.,2009;Panetal.,2010;Glorotetal.,2011;Yoshidaetal.,2011;Bollegalaetal.,2013;Xiaetal.,2013;YangandEisenstein,2015),buttheystillrequirealargeamountoflabeledtrainingdatafromafairlysimilarsourcedomain,whichisnotalwaysfeasible.Thosealgorithmsalsotendtobecomputational-expensiveandtime-consuming(Mo-hammadandTurney,2010;Fastetal.,2016).Inthispaper,weproposeanend-to-endpipelinednearly-unsupervisedapproachtodomain-specificsentimentclassificationofdocumentsforanewdomainbasedondistributedwordrepresentations(vectors).AsshowninFig.1,theproposedapproachconsistsofthreemainstages(components):(1)domain-specificsentimentwordembedding,(2)domain-specificsentimentlexiconinduction,(3)domain-specificsentimentclassificationofdoc-uments.Brieflyspeaking,givenalargeunlabeledcorpusforanewdomain,wewouldfirstsetupthevectorspaceforthatdomainviawordembedding,theninduceasentimentlexiconinthediscoveredvec-torspacefromaverysmallsetofseedwordsaswellasageneral-purposelexicon,andfinallyexploittheinducedlexiconinalexicon-baseddocumentsentimentclassifiertobootstrapamoreeffectivelearning-baseddocumentsentimentclassifierforthatdomain.Thesecondstageofourapproachout-performsthestate-of-the-artunsupervisedmethodforsentimentlexiconinduction(Hamiltonetal.,2016),whichisthemostcloselyrelatedwork(seeSection2).Thekeytothesuperiorperformanceofourmethodcomparedwiththeirsistheinsightgainedfromourfirststagethatpositiveandneg-ativesentimentwordsarelargelyclusteredinthedomain-specificvectorspacebutthesetwoclus-tershaveanon-negligibleoverlap,thereforesemi-supervised/transductivelearningalgorithmscouldbeeasilymisledbytheexamplesintheoverlapandwouldactuallynotworkaswellassimplesuper-visedclassificationalgorithms.Overall,thedocu-mentsentimentclassifierresultingfromournearly-unsupervisedapproachdoesnotrequireanylabeleddocumenttobetrained,anditcanoutperformthestate-of-the-artunsupervisedmethodfordocumentsentimentclassification(Eisenstein,2017).Thesourcecodeforourimplementedsystemandthedatasetsforourexperimentsareopentotheresearchcommunity1.Therestofthispaperisorganizedasfollows.InSection2,wereviewpreviousstudiesonthistopic.InSections3to5,wedescribethethreemainstagesofourapproachrespectively.InSection6,wedrawconclusionsanddiscussfuturework.2RelatedWorkMostoftheearlysentimentanalysissystemstooklexicon-basedapproachestodocumentsentimentclassificationwhichrelyonpre-compiledsentimentlexicons(Owsleyetal.,2006).Variousmethodshavebeenproposedtoautomaticallyproducesuchsentimentlexicons(HuandLiu,2004;Dingetal.,2008).Later,thefocusofresearchshiftedtolearning-basedapproaches(Pangetal.,2002;PangandLee,2004),assupervisedlearningalgorithmsusuallydeliveramuchhigheraccuracyinsenti-mentclassificationthanpurelexicon-basedmeth-ods.However,lexiconshavenotcompletelylosttheirattractiveness:theyareusuallyeasiertoun-derstandandtomaintainbynon-experts,andtheycanalsobeintegratedintolearning-basedsentimentclassifiers(Mudinasetal.,2012;Eisenstein,2017).1https://goo.gl/8K9PbE
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
271
UnlabeledTraining DocumentsWord EmbeddingsSentiment LexiconPseudo-labeledTraining DocumentsProbabilisticWord ClassifierSentiment SeedsLexicon-based Sentiment ClassifierLearning-based Sentiment ClassifierUnlabeledTest DocumentsClassified Test DocumentsFigure1:Ournearly-unsupervisedapproachtodomain-specificsentimentclassification.Thelexicon-basedsentimentclassifierusedinourexperimentsisapublicly-availablesystemcalledpSenti2(Mudinasetal.,2012).Inadditiontoacustomizablesentimentlexicon,italsousesshallowNLPtechniqueslikepart-of-speech(POS)taggingandthedetectionofsentimentinvertersandothermodifiers(intensifyinganddiminishingadverbs).Theintroductionofmodernwordembeddingtechniqueslikeword2vec(Mikolovetal.,2013)andGloVe(Penningtonetal.,2014)haveopenedthepossibilityofnewsentimentanalysismethods.Givenalargeunlabeledcorpus,suchtechniquescanlearnfromwordco-occurrenceinformationandpro-duceavectorspaceofhundredsofdimensions,witheachwordbeingassignedacorrespondingvector.Theresultingvectorspacehelpsinunderstandingthesemanticrelationshipsbetweenwordsandal-lowsgroupingofwordsbasedontheirlinguisticsimilarities.RecentlyRotheetal.(2016)proposedtheDENSIFIERmethodthatcanreducethedimen-sionalityofwordembeddingswithoutlosingseman-ticinformationandexploreditsapplicationinvari-ousdomains.FortheSemEval-2015task(Rosenthaletal.,2015),DENSIFIERperformedslightlyworsecomparedtoword2vec,thoughitstrainingtimewasshorterbyafactorof21.Infact,previousstudiessuchas(Rotheetal.,2016;Cliche,2017)suggestthatword2vecusuallyprovidesthebestwordem-beddingsforsentimentanalysistasks.Intheirrecentwork,Hamiltonetal.(2016)2https://goo.gl/pj4XAQdemonstratedthatbystartingfromasmallsetofseedwordsandconductinglabelpropagationoverthelexicalgraphderivedfromthepairwiseprox-imitiesofwordembeddings,theycouldinduceadomain-specificsentimentlexiconcomparabletoahand-curatedone.Intuitively,thesuccessoftheirmethodnamedSentProprequiresarelativelyclearseparationbetweensentimentwordsofoppositepo-larityinthevectorspacewhich,aswewillshowlater,isnotveryrealistic.Moreover,theyhavefo-cusedontheinductionofsentimentlexiconsalone,whilewearetryingtodesignanend-to-endpipelinethatcanturnunlabeleddocumentsinanewdo-maindirectlytotheirsentimentclassifications,withdomain-specificsentimentlexiconinductionasakeycomponent.Recentadvancesindeeplearning(LeCunetal.,2015)haselevatedsentimentanalysistonewperfor-mancelevels(Kim,2014;DaiandLe,2015;HongandFang,2015).AsreportedbyDaiandLe(2015),theLongShort-TermMemory(LSTM)(HochreiterandSchmidhuber,1997)RecurrentNeuralNetwork(RNN)canreachorsurpasstheperformancelev-elsofallpreviousbaselinesforsentimentclassifi-cationofdocuments.OneofthemanyappealsofLTSMisthatitcanconnectpreviousinformationtothecurrentcontextandallowseamlessintegrationofpre-trainedwordembeddingsasthefirst(projec-tion)layeroftheneuralnetwork.Moreover,Rad-fordetal.(2017)discoveredthe“sentimentunit”,thesingleunitwhichcanlearntheperfectrepresen-tationofsentiment,inamultiplicativeLSTMwith
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
272
4096units,despitethefactthattheLSTMwasonlytrainedforacompletelydifferentpurpose—topre-dictthenextcharacterinthetextofAmazonre-views.OurresultsareinlinewiththosefindingsandconfirmedthesuperiorityofLSTMinbuildingdocument-levelsentimentclassifiers.Zhangetal.(2011)triedtoaddressthelowre-callproblemoflexicon-basedmethodsforTwittersentimentclassificationviatrainingalearning-basedsentimentclassifierusingthenoisylabelsgeneratedbyalexicon-basedsentimentclassifier(Dingetal.,2008).Althoughthebasicideaoftheirworkissimilartowhatwedointhethirdstageofourap-proach(seeSection5),thereexistseveralnotabledifferences.First,theyadoptedasinglegeneral-purposesentimentlexiconprovidedbyDingetal.(2008)anduseditforalldomains,whilewewouldinduceadifferentlexiconforeachdifferentdomain.Consequently,theirmethodcouldhavearelativelylargevarianceinthedocumentsentimentclassifica-tionperformancebecauseofthedomainmismatch(e.g.,F1=0.874forthe“Tangled”tweetsandF1=0.647forthe“Obama”tweets),whereasourapproachwouldperformquiteconsistentlyoverdif-ferentdomains.Second,theywouldneedtostripoutallthepreviously-knownopinionwordsintheirsinglegeneral-purposesentimentlexiconfromthetrainingdocumentsinordertopreventthetrainingbiasandforcetheirdocumentsentimentclassifiertoexploitdomain-specificfeatures,butdoingthiswouldobviouslylosetheveryvaluablesentimentsignalscarriedbythoseopinionwords.Incontrast,wewouldbeabletoutilizealltermsinthetrainingdocuments,includingthoseopinionwordsthatap-pearedinourautomaticallyinduceddomain-specificlexicons,asfeatures,whenbuildingourdocumentsentimentclassifiers.Third,theydesignedtheirmethodspecificallyforTwittersentimentclassifica-tion,whileourapproachwouldworkfornotonlyshorttextssuchastweets(seeSection5.2)butalsolongtextssuchascustomerreviews(seeSec-tion5.1).Fourth,theyhadtouseanintermediatesteptoidentifyadditionalopinionatedtweets(ac-cordingtotheopinionindicatorsextractedthroughtheχ2testontheresultsoftheirlexicon-basedsen-timentclassifier)inordertohandletheneutralclass,butwewouldnotrequirethattime-consumingstepaswewouldusethecalibratedprobabilisticoutputsofourdocumentsentimentclassifiertodetecttheneutralclass(seeSection5.3).3Domain-SpecificSentimentWordEmbeddingOurapproachtodomain-specificdocument-levelsentimentclassificationisbuiltontopofwordem-beddings—distributedwordrepresentations(vec-tors)thatcouldbelearnedfromanunlabeledcorpustoencodethesemanticsimilaritiesbetweenwords(Goldberg,2017).Inthissection,weinvestigatehowtheembed-dingsofsentimentwordsforaparticulardomainwouldlooklikeinthedomain-specificvectorspace.Toensureafaircomparisonwiththestate-of-the-artsentimentlexiconinductiontechniqueSentProp3(Hamiltonetal.,2016)laterinSection4,weadoptthesamepublicly-availablepre-trainedwordem-beddingsforthefollowingthreedomainstogetherwiththecorrespondingsetsofsentimentwords(i.e.,sentimentlexicons).•Standard-English.WeusethetheGoogleNewswordembeddings4andthe‘GeneralInquirer’lex-icon(Stoneetal.,1966)withthesentimentpolar-ityscorescollectedbyWarrineretal.(2013).•Twitter.WeusethewordembeddingsconstructedbyRotheetal.(2016)andthesentimentlexiconfromtheSemEval-2015Task10E(Rosenthaletal.,2015).•Finance.Weusethewordembeddingslearnedus-inganSVD-basedmethod(Manningetal.,2008)fromacollectionof“8-K”financialreports5(Leeetal.,2014)andthefinancesentimentlexiconhand-craftedbyHamiltonetal.(2016).Notethattheabovethreesentimentlexiconswouldbeusedforboththeinspectionofsentimentworddistributionsinthissectionandtheevaluationofsentimentlexiconinductionlaterinthenextsec-tion.Furthermore,tofacilitateafaircompari-sonwiththestate-of-the-artunsuperviseddocumentsentimentclassificationtechniqueProbLex-DCM6(Eisenstein,2017)laterinSection5,wealsoadoptthefollowingtwodocumentcollectionswhichtheyhaveused.3https://goo.gl/BFkY8N4https://goo.gl/5r79l65https://goo.gl/7ntr2V6https://goo.gl/Qr993F
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
273
•IMDB.Weuse50kmoviereviewsinEnglishfromIMDB(Maasetal.,2011)with25klabeledtrainingdocuments.•Amazon.Weuseabout28kproductreviewsinEnglishacrossfourproductcategoriesfromAma-zon(Blitzeretal.,2007;McAuleyandLeskovec,2013)with8klabeledtrainingdocuments.Thewordembeddingsfortheabovetwodomainsweretrainedbyusontherespectivecorporaus-ingword2vec(Mikolovetal.,2013)whichemploysatwo-layerneuralnetworkandisbyfarthemostwidelyusedwordembeddingtechnique.Specifi-cally,weranword2vecwithskip-gramofafive-wordwindowtoconstructwordvectorsof500di-mensions,asrecommendedbypreviousstudies7.ThesentimentlexiconmadebyLiu(2015)isconsis-tentlyoneofthebestforanalyzingreviews(Ribeiroetal.,2016),soitisusedforbothofthosedomains.Drawingananalogytothewell-knownclusterhy-pothesisinInformationRetrieval(IR)(Manningetal.,2008),hereweputforwardtheclusterhypothe-sisforsentimentanalysis:wordsinthesameclusterbehavesimilarlywithrespecttosentimentpolarityinaspecificdomain.Thatistosay,weexpectpos-itiveandnegativesentimentwordstoformdistinctclusters,giventhattheyhavebeenrepresentedinanappropriatevectorspace.Toverifythishypothesis,itwouldbeusefultovisualizethehigh-dimensionalsentimentwordvectorsina2Dplane.Wehavetriedanumberofdimensionalityreductiontech-niquesincludingthet-distributedStochasticNeigh-borEmbedding(t-SNE)(vanderMaatenandHin-ton,2008),butfoundthatsimplyusingtheclas-sicPrincipleComponentAnalysis(PCA)(Bishop,2006)worksverywellforthispurpose.Wehavefoundthatingeneral,theaboveclusterhypothesisholdsforwordembeddingswithinaspe-cificdomain.Fig.2ashowsthatintheStandard-Englishdomain,thesentimentwordswithoppositepolaritieswouldformtwodistinctclusters.How-ever,itcanalsobeseenthatthosetwoclusterswouldoverlapwitheachother.Thatisbecauseeachwordcarriesnotonlyasentimentvaluebutalsoitslinguis-ticandsemanticinformation.Zoomingintooneofthewordvectorspaceregions(Fig.2b)canhelpusunderstandwhysentimentwordswithdifferentpo-7https://goo.gl/SyAdej−−++−−−−++++−−−−−+−−+−−−+−+−+−−−++++−+−−−++++−++−++−+−−+−−−+−−−++−−++−+−+−+−+−+++−++−+−+−−+−−−++−−−−−+−−−−−+−−−+−++−++++−++−+−−−−−++++++−−−−+−−−+−−−−+−+−−−−++−+++−+−+−−−−+−−++++−−+−+−−−++++−−−−−+−+++−−+−−−−−−++−−+−−++−−−++−−−+−−+−−+−−+−−++++−−−−−−−+−−+−−−−−+−−−+−+−++++++−−−−−+−−−−+++−−+−++−−−+++−−+++−−−−−+++++−++−−−++−−+−−−−+−−+−+−−+−+−−+−−−−−−+−−−−−+−−−−−−−−−+−+−++−−−−+−−+++−−+−−+++−−−+−−+−++++−+++++−−−−+−+++−−++−−−+−+−−−++−+−+++−−−−−−+−+−−−−−+−−−−−−−+−+−++−−−−−−−−−+−+−+−−++−−−−−+−−+++−++−−−++−+−−−−−−−−−−−−−−−+−−+−+−+−+++−−−−−−+−−−−−−−−−−−+−−−++−−−+−−−−+−+++−−−+−+−−−+−+−−+−+−−−−−++−++−+−−−+−−++−++−−−−−−+−−−−++−+++−−++++−++−−++−++−−−++−+−−−++−−−−+−+−−−+−+−+−−−++−−+−++−++−−++++−+−+−−−+−+−+−−−−−++++−+−+−++−−++−++−−+−−+−−−−++−−+−−−−+−+++−++−+−+++−−−−−−+−−−−−−−−−−+−+−+−−+−−+−−++−+−+−−+−−−−+−+−−+−+−−+−+−−−+−−+−−−−−++++−+−−−+−−+−−+−−−−−−−−−+−+−−−−−−−−+−+−+++−+−−−++−−−−−+−+−++−−−−−−−−++−−−+−−−+−−−+−+−+−−−−−−+++−+−−+−+++−+−−−+++−+−−+−−−−++−−+−−−−−−−+−+−−−+−++−+−−−−+−−−−+−++−+++−+−−−−−−−+−+++−+++−−+−−−−+−−−−+−+−−−+−+−++−++−+−−−−++−−−−+−+−−−+++−−−++−−+−−−−−+−++−−+−−−−−−−−−−+−−−+−−+−+−−+−+−−−−−−−+++++++−+−−−−−+−−+−+−−−+−−+++−−+−++−−−−−−−−+−+−−−−−−−+−−+−++−−++−−+−−−−−−−−−−+−+−+−+−−+−−−−−−−+−++−+−−++−−−−−−−−+−+−−+−−−−−++−−+++−−−+−−+−−++−++−−+−++++−+−−−−++−+−−+−++−−++−+−−−−+−−++−−+−−−−+−−−+−+−−+−−−++−+−−−+−+−+−−+−−+−−−+−++++−−+−+−++−+−−−++−+−+−+−−−−+−+−++−+−++−−−++−+−−−−−−−−+−+−+++−−−−−+−−+−−−−+−−−−+++−−−−+−−−−−−−++−−−−−−−−+−−−−+−+++−−+−+++++−−+−−−−+−+++−−+−−−++−+−+−+−−−−−+−−−−−−++−+++−−−−−−+−+−+++++−−−−−−+−−+++++−+−−−+++−−+++−−−−+−+−+++−++−+−−−+−−−−−−−+−−−−−+−−+−−−+−−−−+−−−−−+−−+−++++++−+−−−−−+−−−+++++−+−−+−−−++−+−−+−+−−−−−+−+−−−−++−+−−+−−−−−+++−−−−+−+++−−−++−−+−−+−−++++−−+−+−−++−−+++−+−−−−−+−−++−−−++−−+−+−−+++−+−−+−−−+−−++−−+−++−−−−−−+−−+−−+−+−−−+−+−+−++−+−++++++−+−−−+++−+−−−−+−−−−+−+−++−+++−−+−+−−+++−−+++++−+−−−+−−+−+−++++−−−−−−+−++−−+−−++−++−+−−+−+++−+−−−+−−++−−−−−−+−−+−−+−+−−−−++++++−−+−−+−+−+−−−−−+−++−−++−−−+++−−−−−+−+−−++−−−−−+−+−−+++−−+−−+−−+−+−+−+−+−−+++++++−−+−+−−−−+−−++−−−−+−−++−−++−−−−−++−−−−++−+−−++−−+−+−−−++−+−−++−++−−++−−−−−+−−−−++−−−+−−+−+−−+−−−+−−+−++−−++−−−−+−−+−−+−++−−+−−−−−+−−−−−−−−+−++−++++−−−+−−−−−−−−−+++−−+−+−−−−−−−++−−−−−−+−+−+−−−−−−+−−++++−+−−−+−−+++−−+−−−−+−−++++−−−−−−+−++−−−+−+−+−−−+−−+++−+−−−−−−++++−−−++−++−−−−−+−−−+−−+−−−−+−−−++−++++−++++−−−−−+−+−−+−−−++−−−−++−+−+−−+−−−+−−−−+−−++++−−−++−+−−−+−−+−−+−−+++−+−+−+−−−++++−−−++−−−++−−−++−−−−+−+−−−+−++−+−+−+++−−−−+++−−+−+−−−−+−−−+−+−−−+−−101−101(a)Theglobalvectorspaceshowingtwoclusters.(b)Alocalregionofthevectorspacezoomedin.Figure2:VisualisationofthesentimentwordsintheStandard-Englishdomain.laritiescouldbegroupedtogether:‘hail’,‘stormy’and‘sunny’arelinguisticallysimilarastheyallde-scribeweatherconditions,yettheyconveyverydif-ferentsentimentvalues.Moreover,asdescribedby(Plutchik,1984),sentimentcouldbegroupedintomultipledimensionssuchasjoy–sadness,anger–fear,trust–disgustandanticipation–surprise.Puttingthataside,certainsentimentwordscanbeclassifiedsometimesaspositiveandsometimesasnegative,dependingonthecontext.Thesereasonsleadtothephenomenonthatmanysentimentwordsarelocatedintheoverlappingnoisyregionbetweentwoclustersinthedomain-specificvectorspace.OnvisualinspectionoftheFinance(Fig.3a)sen-timentwordsandIMDB(Fig.4a)sentimentwordsintheirrespectivevectorspaces,wecanseethatpos-itiveandnegativewordsformdistinctclusterswhicharelargelyseparable.However,ifweconsiderFi-nancesentimentwordsintheIMDBvectorspace(seeFig.3b),positiveandnegativewordswouldbemixedtogetherandcouldnotbeseparatedeasily.Onemaybesurprisedthatpositiveandnegativesentimentwordsformtheirrespectiveclusters,be-causemostofthetimetheycouldbeusedinex-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
274
−−+−+−+−−+−−+−−+−−−−−+−−+−+−−−−−++−+−−−−−+−−−−−−++−−−−+−++−+−−−−−+−++−−−++−−+++−+−+−−−−+++−+−−−−−−−−+−+−−−−−−+−−−+−−−−+−+−+−+−+−+−−+−+−−−−−−+−−+−−−−−−−−−+−−−+−+−+−+−−+−−−−+++−−−−−+−−++−−−−−−−+−−−−−+−−−++−−+−−−−−−−−−−−−−−+−−−−−−−+−−−−−−−+−+−−−+−+−−−−+−−+−−++++−−−−−−−−−+−−+−−+−−−−++++−−+−−−−−−−−−−+−−+−−−−−+−+−+−+−−−−−−−−−−−−+−−+++−−+−+−−−−−−−++−−−+−+−−−+−++−−−+−−+−−+−+−−−−−−+−−−−−−−−−+−−+++−−−+−−+−+−−−−−−1001020−20−1001020(a)IntheFinance(samedomain)vectorspace.−+−−+−−+−−−+−−−−+−+−−−+−−+−−++−−−+−−−−−−−−−+−−++−−−−−−−−−−−+−−++−+−+−+−−+−−−−−−−+−−−−−−−−−−++−−−−−−+−−+−+−−+−+−−−−−−−−−−−++−−++−−++−−−−−−−+−+−−−−−+−+−−−−+−+−−++−−−+−+−−−+−−−+−−+−−+−+−+−−−−−++−−+−−+−−−−+−−−−+−−−−−−++−−−−+−−+−−−−−−−−−−++−−+−−−−−−−++−−+−−+−−−+−−−−−−−−−−−−+−−+−−−−−++−−−−+−−−+−−++−−+−−−+−−−−−++−−++−−−++−+−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−−+−−−−−+−−−−−−−+−−−+−−−−−−−−−−−−−−+++−−−−−−−−+−+++−−−−−−++−+−+−−−−−−+−+−−+−−+−++−+−−−−−+−−−−−−−−−−−−−+−−++−+−−−+−−−+−−−−−+−+−+−−−−−+−−−−−−−−−−−−+−−−−−−+++−++−+−+−−−−+−−−−+−−−+−−−−−−+−−−−−−−−+−−−+++−−−++−−+−−−+−−−−−−−−−+−++−−−−−−−−−+−+−−−+−−+−+−+−+−−−−−−−++−−−−−+−−−−−−+−−−−+−+−−−−−−+−−−−++−+−−−++−+++−−−+−+−−−+−+−+−−−++−+−−−−+−−−5051015−15−10−505(b)IntheIMDB(differentdomain)vectorspace.Figure3:SentimentwordsofFinanceinthesame/differentdomainvectorspace.−−+−−+−+−−+−+−−−+−+++++++++−+−−−−−+++−−++−++−+−−−−+++++−−−+−++−−−−+−+−+−−++++++−−+++−+++−++−+−+−++++++++−+++−++−−−−−−+−−−++−−+−−++−−+−++−++−−−+−++++−−+−++++−+++−−−−+−−+−+−−+++−−+++−−+−+−+−++−−+−−−−+−++−−++−++++−−+−−+−−−+−−−+−+−+−−−−−++−+−−+++−+−+−−−++++−−+−+++−++++−++−−+−−+−−−−−−++−−−−−++−−++−−++−−−−−+−−+−−−+++−−−−−−++−+++−−−+−++++−−−+−−++++−−+−+−−−−+−−++−−−−++−+−−+++−−−++−−++−+−−−−+++++−+−−−++−+−−+−−+−+++−+++−−−−−−−+−+−+−−+++++−+−−+++++−++−+−−+−+−−+−+−++−−−−+−++−++−+−−+−−−++−++++−+−−−−−−−−+−++−−+−−−++−++++−−+−−−−−+−+++++−−+++−+−−++−++−++++−−+−−−−−−−−+++−+−+−+++++++−−−−−+++−+++++−−−−+++++−−−−−−−+−++−−+−−−−+−+−−+−−−−++++−+−−−−+−−+−+−+−+−−++−−++−−−+−+−−−+−−−+−−++++−++−−++++−−++−++++−+−−−+−−++++−++−−+−−+−−−−+−−+−+++−++−+−−−++−−−++−−++−−+−−++−−++−+++−−++++++−−−+−−−−−+−+−−−+−++−−−−−+−−−+−++−−+−−−+−−−−−+−−++−+−−++−−−−+−++++++++−+++−++++−−−+−−++−−++−−−+−−−−−+−+++−++−+++−−++−−++−+++−−++−−+−+++−−−−+−+−−+−−+−−−++++−−−++−++−+−+−+−−−+++−−+−−−+++++++−−++−−++−−−−−−−+−+−−−−−−++++−−+++−−++−++−+−+−+−+++−−+−+−++−−+−++++−−++−+++−−−+++−+−−−+−++−+−+−−−+++−++−++−−+−+++−+−−+−+−−−+−−−+−−−++−+++−−−−++−−−−++−++−++++−+−++−+++−−−++−−+−+++−−+−−++−+++−−−+−+−+−−++−−−+++−+−++−+++−−+++−+++−+−−−−−−+−−+−−−−−+−+−++−+−−+−−−−++−++++−+−+−++−+−+−+−+−++−+−++++++−++++−+−−+−+−+−+++++++−+−−−−++++−−−−−++−++++++++−−−−+−++−−−+++−++−++++++−+−−+−+++−+++−−−+−++++−+−−++−+−−+++++−+−+−−+++−−+−+++−++−−+−+−+−+−−−+−+−+−−−++−−−−+−+−−++++++−+−−++−−−+++++++++−−−+++++−++−−−+−++++−+−−−−−++−+−++−+−+−++−+−−+−−−−−+−++−+−−+++−+−−−+−−++−++−++−−−−−−+−−−−−−−−+−+−+−−+−−++−−10−50510−10−50510(a)Original/Full.−−+++−−+−−++++−+−++−−+++−+−++++−−+++−−++−+−+−+−−+−+−+−−+−−+−−−−++−−−−−++++−−−−+−++−+−−−+−−−−+−++−−−++−−−++++−++++++++−−−+−++−++−−−−−−−+++−+++−+−+−+++−−−−−−+−−+−+−−−−+−−+++−−++−−+−−−++−+−++++−−−−−−−−−−++++−++−−−++−++++++−−++−−−−−++−++−−−++−++−−−++−++−−−+++++−−−−+−−+++−++−++−+−+−−++−+−+−−−−−−−+++−−++++−++−−+−+++−−−−−−+−+−+++−+−−+−+−−−−−−+++−+++−+−+−+−+−−+−+−+−−++−++++−−−−+−−−10−50510−10−50510(b)Filtered.Figure4:SentimentwordsaboutmoviesintheIMDBvectorspacebefore/afterfiltering.actlythesamecontextwhichmightsuggestthattheywouldresultinsimilarwordembeddings.Forexam-ple,wecouldsay“theroomisgood”andalso“theroomisbad”:botharelegitimatesentences.Theprobablereasonfortheclusterhypothesistobetrueisthatinrealitypeopletendtousepositivesentimentwordstogethermuchmoreoftenthantomixthemwithnegativesentimentwords,andviceversa.Forexample,itwouldbemuchmoreoftenforustoseesentenceslike“theroomiscleanandtidy”than“theroomiscleanbutmessy”.Itisalongestablishedfactincomputationallinguisticsthatwordswithsimilarmeaningstendtooccurnearbyeachother(MillerandCharles,1991);sentimentwordsarenoexcep-tion(Turney,2002).Moreover,ithasbeenwidelyobservedthatonlinecustomerreviewsareaffectedbytheso-calledlove-hateself-selectionbias:userstendtorateonlyproductswhichtheyeitherlikeorhate,leadingtoalotmore1-starand5-starratingsthanother(moderate)ratings;iftheproductisjust
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
275
averageorso-so,theyprobablywillnotbothertoleavereviews.Thepolarizationofonlinecustomerreviewswouldalsoencouragetheclusteringofsen-timentwordsintooppositepolarities.4Domain-SpecificSentimentLexiconInductionGiventhewordembeddingsforaspecificdo-main,wecaninduceacustomizedsentimentlexi-confromafewtypicalsentimentwords(“seeds”)frequentlyusedinthatparticulardomain.Suchaninduceddomain-specificsentimentlexiconplaysacrucialroleinthepipelinetowardsdomain-specificdocument-levelsentimentclassification.Table1showstheseedwordsforfivedifferentdo-mainswhichareidenticaltothoseusedbyHamiltonetal.(2016)exceptforthetwoadditionaldomainsIMDBandAmazon.Theinductionofasentimentlexiconcouldthenbeformulatedasasimplewordsentimentclassificationproblemwithtwoclasses(positivevs.negative).Eachwordisrepresentedasavectorviadomain-specificwordembedding;theseedwordsarelabeledwiththeircorrespond-ingclasseswhilealltheotherwords(i.e.,“candi-dates”)areunlabeled;thetaskhereistolearnaclas-sifierfromthelabeledexamplesfirstandthenapplyittopredictthesentimentpolarityofeachunlabeledcandidateword.Theprobabilisticoutputsofsuchawordsentimentclassifiercouldberegardedasthemeasureofconfidenceaboutthepredictedsentimentpolarity.Intheend,thosecandidatewordswithahighprobabilityofbeingeitherpositiveornegativewouldbeaddedtothesentimentlexicon.Thefinalinducedsentimentlexiconwouldincludeboththeseedwordsandtheselectedcandidatewords.AspointedoutbyMudinasetal.(2012),ifwesimplyconsiderallwordsfromthegivencorpusascandidatewords,theabovedescribedwordsenti-mentclassifiertendstoassignsentimentvaluesnotonlytotheactualsentimentwordsbutalsototheirassociatedproductfeaturesormoregenerallytheas-pectsoftheexpressedview.Forexample,ifalotofcustomersdonotliketheweightofaproduct,thewordsentimentclassifiermayassignstrongnega-tivesentimentto“weight”,yetthisisnotstable—thesentimentpolarityof“weight”maybedifferentwhenanewversionoftheproductisreleasedorthecustomerpopulationhaschanged,andfurthermoreitprobablydoesnotapplytootherproducts.Toavoidthispotentialissue,itwouldbenecessarytoconsideronlyahigh-qualitylistofcandidatewordswhicharelikelytobegenuinesentimentwords.Suchalistofcandidatewordscouldbeobtaineddirectlyfromgeneral-purposesentimentlexicons.Itisalsopossi-bletoperformNLPonthetargetdomaincorpusandextractfrequently-occurringadjectivesorothertypi-calsentimentindicatorslikeemoticonsascandidatewords,whichisbeyondthescopeofthispaper.Toexaminetheeffectivenessofdifferentma-chinelearningalgorithmsforbuildingsuchdomain-specificwordsentimentclassifiers,weattempttorecreateknownsentimentlexiconsinthreedomains:Standard-English,Twitter,andFinance(seeSec-tion3),inthesamewayasHamiltonetal.(2016)did.Putdifferently,forthepurposeofevaluation,wewouldjustuseaknownsentimentlexiconinthecorrespondingdomainasthelistofcandidatewordsandseehowdifferentmachinelearningalgorithmswouldclassifythosecandidatewordsbasedontheirdomain-specificwordembeddings.Forthoselexi-conswithternarysentimentclassification(positivevs.neutralvs.negative),theclass-massnormal-izationmethod(Zhuetal.,2003)usedbyHamiltonetal.(2016)hasbeenappliedheretoidentifytheneutralcategory.Thequalityofeachinducedlex-iconforaspecificdomainisevaluatedbycompar-ingitwithitscorrespondingknownlexiconastheground-truth,accordingtotheperformancemetricswhicharethesameasin(Hamiltonetal.,2016):AreaUndertheReceiver-Operating-Characteristic(ROC)Curve(AUC)forthebinaryclassifications(ignoringtheneutralclass,asiscommoninpre-viouswork)andKendall’sτrankcorrelationco-efficientwithcontinuoushuman-annotatedpolarityscores.NotethatKendall’sτisnotsuitablefortheFinancedomain,asitsknownsentimentlexiconisonlybinary.Therefore,ourexperimentalsettingandperformancemeasuresareallidenticaltothoseofHamiltonetal.(2016),whichensuresthevalidityoftheempiricalcomparisonbetweenourapproachandtheirs.InTable2,wecompareanumberoftypicalsu-pervisedandsemi-supervised/transductivelearningalgorithmsforwordsentimentclassificationinthecontextofdomain-specificsentimentlexiconinduc-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
276
CorpusPositiveNegativeStandard-Englishgood,lovely,excellent,fortunate,pleasant,delightful,perfect,loved,love,happybad,horrible,poor,unfortunate,unpleasant,disgusting,evil,hated,hate,unhappyTwitterlove,loved,loves,awesome,nice,amazing,best,fantastic,correct,happyhate,hated,hates,terrible,nasty,awful,worst,horrible,wrong,sadFinancesuccessful,excellent,profit,beneficial,im-proving,improved,success,gains,positivenegligent,loss,volatile,wrong,losses,dam-ages,bad,litigation,failure,down,negativeIMDBgood,excellent,perfect,happy,interesting,amazing,unforgettable,genius,gifted,in-crediblebad,bland,horrible,disgusting,poor,banal,shallow,disappointed,disappointing,lifeless,simplistic,boreAmazonIMDBdomainseeds(asabove)pluspositive,fortunate,correct,niceIMDBdomainseeds(asabove)plusnegative,unfortunate,wrong,terrible,inferiorTable1:The“seeds”fordomain-specificsentimentlexiconinduction.tion:•kNN—kNearestNeighbors(Hastieetal.,2009),•LR—LogisticRegression(Hastieetal.,2009),•SVMlin—SupportVectorMachinewiththelin-earkernel(Joachims,1998),•SVMrbf—SupportVectorMachinewiththenon-linearRBFkernel(Joachims,1998),•TSVM—TransductiveSupportVectorMachine(Joachims,1999),•S3VM—Semi-SupervisedSupportVectorMa-chine(Giesekeetal.,2012),•CPLE—ContrastivePessimisticLikelihoodEs-timation(Loog,2016),•SGT—SpectralGraphTransducer(Joachims,2003),•SentProp—alabelpropagationbasedclassifica-tionmethodproposedfortheSocialSentsystem(Hamiltonetal.,2016).Thesuitableparametervaluesoftheabovelearningalgorithms(suchastheCforSVM)arefoundviagridsearchwithcross-validation,andtheprobabilis-ticoutputsaregivenbyPlattscaling(Platt,2000)iftheyarenotprovidedbytheoriginallearningalgo-rithm.TheexperimentalresultsshowninTable2demonstratethatinalmosteverysingledomain,simplelinearmodelbasedsupervisedlearningal-gorithms(LRandSVMlin)canachievetheop-timalornear-optimalaccuracyforthesentimentlexiconinductiontask,andtheyoutperformthestate-of-the-artsentimentlexiconinductionmethodSentProp(Hamiltonetal.,2016)byalargemar-gin.Theperformanceimprovementsarestatisti-callysignificant(p-value<0.05)accordingtothesigntest.Theredoesnotseemtobeanybenefitofutilizingnon-linearmodels(kNNandSVMrbf)orsemi-supervised/transductivelearningalgorithms(TSVM,S3VM,CPLE,SGT,andSentProp).Thequalitativeanalysisofthesentimentlexiconsin-ducedbydifferentmethodsshowsthattheydifferonlyonthoseborderline,ambiguouswords(suchas“soft”)residinginthenoisyoverlappingregionbetweentwoclustersinthevectorspace(seeSec-tion3).Inparticular,SentPropisbasedonlabelpropagationoverthelexicalgraphofwords,soitcouldbeeasilymisledbynoisyborderlinewordswhensentimentclustershaveconsiderableover-lapwitheachother,kindof“over-fitting”(Bishop,2006).Furthermore,accordingtoourexperimentsonthesamemachine,thosesimplelinearmodelsare70+timesfasterthanSentProp.Thespeeddif-ferenceismainlyduetothefactthatsupervisedlearningalgorithmsonlyneedtotrainonasmallnumberoflabeledwords(“seeds”inourcontext)whilesemi-supervised/transductivelearningalgo-rithmsneedtotrainonnotonlyasmallnumberoflabeledwordsbutalsoalargenumberofunlabeledwords.Ithasalsobeenobservedinourexperimentsthatthereisatypicalprecision/recalltrade-off(Man-ningetal.,2008)fortheautomaticinductionofse-manticlexicons.Assumingthattheclassifiedcandi-datewordsareaddedtothelexiconinthedescend-ingorderoftheirprobabilities(ofbeingeitherpos-itiveornegative),theinducedlexiconwillbenois-ierandnoisierwhenitbecomesbiggerandbigger.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
277
CorpusSupervisedSemi-Supervised/TransductivekNNLRSVMlinSVMrbfTSVMS3VMCPLESGTSentPropAUCStandard-English0.8920.9310.9390.9410.9010.5400.6800.8520.906Twitter0.8490.9000.8950.8950.7700.5210.6510.7250.860Finance0.7110.9440.9420.9320.6650.5610.8360.7250.916τStandard-English0.4690.4950.4980.4950.4870.0380.1620.4090.440Twitter0.4900.5690.5480.5470.5220.0010.2110.4370.500Table2:Comparingtheinducedlexiconswiththeircorrespondingknownlexicons(ground-truth)accordingtotherankingofsentimentwordsmeasuredbyAUCandKendall’sτ.Fig.5showsthatimposingahighercut-offprob-abilitythreshold(forcandidatewordstoentertheinducedlexicon)woulddecreasethesizeofthein-ducedlexiconbutincreaseitsquality(accuracy).Ononehand,theinducedlexiconneedstocontainasuf-ficientnumberofsentimentwords,especiallywhendetectingsentimentfromshorttexts,asalexicon-basedmethodcannotreasonablyclassifydocumentswithnoneortoofewsentimentwords.Ontheotherhand,thenoise(misclassifiedsentimentwords)intheinducedlexiconwouldobviouslyhaveadetri-mentalimpactontheaccuracyofthedocumentsen-timentclassifierbuiltontopofit.ContrarytomostpreviousworklikethatfromQiuetal.(2011)whichtriestoexpandthesentimentlexiconasmuchaspos-sibleandthusmaintainahighrecall,wewouldputmoreemphasisontheprecisionandkeepatightcon-trolofthelexiconsize.Forus,havingasmallsenti-mentlexiconisaffordable,becauseourproposedap-proachtodocumentsentimentclassificationwillbeabletomitigatethelowrecallproblemoflexicon-basedmethodsbycombiningthemwithlearning-basedmethods,whichweshalltalkaboutnext.5Domain-SpecificSentimentClassificationofDocumentsAdomain-specificsentimentlexicon,automaticallyinducedusingtheabovetechnique,providesasolidbasisforbuildingdomain-specificdocumentsenti-mentclassifiers.Fortheexperimentshere,wewouldusealistof7866candidatewordsconstructedbymergingtwowell-knowngeneral-purposesentimentlexiconsthatarebothpubliclyavailable—the‘Gen-eralInquirer’(Stoneetal.,1966)andthesentimentlexiconfromLiu(2012).Thissetofcandidatewordsisitselfacombined,general-purposesentimentlex-Accuracy vs Size0.850.900.951.00Accuracy010002000300040000.500.600.700.800.90Cutoff probabilityNumber of wordsFigure5:Howtheaccuracyandsizeofanin-ducedlexiconareinfluencedbythecut-offproba-bilitythreshold.icon,sowenameittheGI+BLlexicon.Moreover,wewouldsetthecut-offprobabilitythresholdtoagenerallygoodvalue0.7inoursentimentlexiconinductionalgorithm.ComparingtheIMDBvectorspaceincludingallthecandidatewords(Fig.4a)withthatincludingonlythehigh-probabilitycandi-datewords(Fig.4b),itisobviousthatthepositiveandnegativesentimentclustersbecomemoreclearlyseparatedinthelatter.Theinducedsentimentlexicononitsowncouldbeapplieddirectlyinalexicon-basedmethodforsentimentclassificationofdocuments,andareason-ablygoodperformancecouldbeachievedaswewillshowlaterinTable4.However,mostofthetime,lexicon-basedsentimentclassifiersarenotaseffec-tiveaslearning-basedsentimentclassifiers.Onerea-sonisthattheformertendstosufferfromapoorrecall.Forexample,withalimitedsizesentimentlexicon,lexicon-basedmethodswouldoftenfailto
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
278
detectthesentimentpresentinshorttexts,e.g.,fromTwitter,duetothelexicalgap.Giventheinducedsentimentlexicon,weproposetousealexicon-basedsentimentclassifiertoclassifyunlabeleddocuments,andthenusethoseclassifieddocumentscontainingatleastthreesentimentwordsaspseudo-labeleddocumentstobeusedlaterforthetrainingofalearning-basedsentimentclassifier.Theconditionof“atleastthreesentimentwords”istoensurethatonlyreliablyclassifieddocumentswouldbefurtherutilisedastrainingexamples.5.1SentimentClassificationofLongTextsFirst,wetrytheinducedsentimentlexiconsinthelexicon-basedsentimentclassifierpSenti(Mudinasetal.,2012)toseehowgoodtheyare.Givenasen-timentlexicon,pSentiisabletoperformnotonlybinarysentimentclassificationbutalsoordinalsen-timentclassificationonafive-pointscale.Tomea-surethebinaryclassificationperformance,weusebothmicro-averagedF1(miF1)andmacro-averagedF1(maF1)whicharecommonlyusedintextcatego-rization(YangandLiu,1999).Tomeasurethefive-pointscaleclassificationperformance,weusebothCohen’sκcoefficient(Manningetal.,2008)andalsoRoot-Mean-SquareError(RMSE)(Bishop,2006).Asthebaseline,weuseacombinedgeneral-purposesentimentlexicon,GI+BL,mentionedpre-viouslyinSection4.AswecanseefromtheresultsshowninTable3,usingtheinducedsentimentlex-iconforthetargetdomainwouldmakethelexicon-basedsentimentclassifierpSentiperformbetterthansimplyemployinganexistinggeneral-purposesen-timentlexicon.Moreover,usingthesentimentlex-iconsinducedfromthesamedomainwouldleadamuchbetterperformancethanusingthesentimentlexiconsinducedfromadifferentdomain.Second,toevaluatetheproposedtwo-phaseboot-strappingmethod,wemakeempiricalcomparisonsontheIMDBandAmazondatasetsusinganumberofrepresentativemethodsfordocumentsentimentclassification:•pSenti—aconcept-levellexicon-basedsentimentclassifier(Mudinasetal.,2012),•ProbLex-DCM—aprobabilisticlexicon-basedclassificationusingtheDirichletCompoundMultinomial(DCM)likelihoodtoreduceeffectivecountsforrepeatedwords(Eisenstein,2017),•SVMlin—SupportVectorMachinewithlinearkernel(Joachims,1998),•CNN—ConvolutionalNeuralNetwork(Kim,2014),•LSTM—LongShort-TermMemory,aRecurrentNeuralNetwork(RNN)thatcanrememberval-uesoverarbitrarytimeintervals(HochreiterandSchmidhuber,1997;DaiandLe,2015).To apply the deep learning algorithms CNN and LSTM that have a word embedding projection layer, we fix the review size to 500 words, truncating re-views longer than that and padding reviews shorter than that with null values. As pointed out by Greff et al. (2017), the hidden layer size is an important hyperparameter of LSTM: usually the larger the net-work, the better the performance but the longer the training time. In our experiments, we have used an LSTM network with 400 units on the hidden layer which is the capacity that a PC with one Nvidia GTX 1080 Ti GPU can afford and a dropout (Wager et al., 2013) rate of 0.5 which is the most common setting in research literature (Srivastava et al., 2014; Hong and Fang, 2015; Cliche, 2017).As shown in Table 4, the above described two-phase bootstrapping method has been demonstrated to be beneficial: the learning-based sentiment clas-sifiers trained on pseudo-labeled data are supe-rior to lexicon-based sentiment classifiers, including the state-of-the-art unsupervised sentiment classifier ProbLex-DCM (Eisenstein, 2017). Furthermore, the two-phase bootstrapping method is a general frame-work which can utilize any lexicon-based sentiment classifier to produce pseudo-labeled data. There-fore the more sophisticated ProbLex-DCM could also be used instead of pSenti in this framework, which is likely to deliver an even higher perfor-mance. Among the three learning-based sentiment classifiers, LSTM achieved the best performance on both datasets, which is consistent with the observa-tions in other studies like Dai and Le (2015).Comparing the LSTM-based sentiment classifiers trained on pseudo-labeled and real labeled data, we can also see that using a large number of pseudo-labeled examples could achieve a similar effect as using 25/4 ≈ 6k and 8/2 = 4k real labeled ex-amples for IMDB and Amazon respectively. This suggests that the unsupervised approach is actually preferable to the supervised approach if there are
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
279
Lexiconbinary5-pointscalemiF1maF1Fpos1Fneg1Cohen’sκRMSEgeneral-purposeGI+BL0.7450.7440.7640.7220.2351.325domain-specificsamedomain(Kitchen)0.7610.7610.7720.7500.2361.310differentdomain(Electronics)0.7490.7490.7500.7490.2151.373differentdomain(Video)0.7360.7350.7520.7170.2061.372Table3:Lexicon-basedsentimentclassificationofAmazonKitchenproductreviews.MethodIMDBAmazonAUCF1AUCF1UnsupervisedLexicon-basedpSentiwithexistinggeneral-purposelexicon0.8080.7050.8180.747pSentiwithinduceddomain-specificlexicon0.8410.7680.8390.771ProbLex-DCM(Eisenstein,2017)0.8840.8060.8360.756Learning-basedSVMlintrainedonpseudo-labeleddata0.8630.7710.8450.763CNNtrainedonpseudo-labeleddata0.8790.7810.8490.773LSTMtrainedonpseudo-labeleddata0.8900.8100.8500.776SupervisedLearning-basedLSTMtrainedonreallabeleddata(fullsize)0.9710.9120.8780.802”(1/2size)0.9340.8620.8520.752”(1/4size)0.8920.8210.8410.744”(1/8size)0.8500.7460.8310.735Table4:Sentimentclassificationoflongtexts.onlyafewthousand(orless)labeledexamples.5.2SentimentClassificationofShortTextsToevaluateourproposedapproachtosentimentclassificationofshorttexts,wehavecarriedoutexperimentsontheTwittersentimentclassificationbenchmarkdatasetfromSemEval-2017Task4B(Rosenthaletal.,2017)whichistoclassify6185tweetsaseitherpositiveornegative.Otherthanthetrainingsetof20,508tweets,wealsocollectedun-labeledtweetsusingtheTwitterAPI.Allthetweetswouldbepre-processedbyreplacingemoticonswiththeircorrespondingtextrepresentationsandencod-ingURLsbytokens.InadditiontotheTwitter-domainseedwordslistedinTable1,wehavealsomadeuseofcommonpositive/negativeemoticonswhichareubiquitousonTwitterasadditionalseedsforthetaskofsentimentlexiconinduction.Notethatinallourexperiments,wedonotusethesen-timentlabelsandthetopicinformationprovidedinthetrainingdata.MakinguseoftheprovidedtrainingdataandourownunlabeleddatacollectedfromTwitter,wehaveconstructedthedomain-specificwordembeddings,inducedthesentimentlexicon,andbootstrappedthepseudo-labeledtweetdatatotrainthebinarytweetsentimentclassifier.AsthelearningalgorithmwehavechosenLSTMwithahiddenlayerof150unitswhichwouldbeenoughfortweetsastheyarequiteshort(withanaveragelengthofonly20words).Theofficialperformancemeasuresforthisshorttextsentimentclassificationtask(Rosenthaletal.,2017)includeAccuracy(Acc)andF1.Althoughourapproachisnearly-unsupervised(withoutanyrelianceonlabeleddocuments),itsperformanceonthisbenchmarkdatasetiscomparabletothatofsu-pervisedmethods:itwouldbeplacedroughlyinthemiddleofalltheparticipatingsystemsinthiscom-petition(seeTable5).5.3DetectingNeutralSentimentManyreal-worldapplicationsofsentimentclassifi-cation(e.g.,onsocialmedia)arenotsimplyabi-naryclassificationtask,butinvolveaneutralcate-goryaswell.Althoughmanylexicon-basedsen-timentclassifiersincludingpSenticandetectneu-tralsentiment,extendingtheabovelearning-basedsentimentclassifier(trainedonpseudo-labeleddata)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
2
0
1
5
6
7
6
2
2
/
/
t
l
a
c
_
a
_
0
0
0
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
280
SystemAccF1UnsupervisedBaselineallpositive0.3980.285Baselineallnegative0.6020.376OursLSTM0.8040.795SupervisedWorstsystem0.4120.372Mediansystem0.8020.801Bestsystem0.8970.890Table5:Sentimentclassificationofshorttextsintotwocategories—SemEval-2017Task4B.torecognizeneutralsentimentischallenging.Toinvestigatethisissue,wehavedoneexperimentsontheTwittersentimentclassificationbenchmarkdatasetfromSemEval-2017Task4C(Rosenthaletal.,2017)whichistoclassify12379tweetsintoanordinalfive-pointscale(−2,−1,0,+1,+2)where0representstheneutralclass.Onecommonwaytohandleneutralsentimentistotreatthesetofneutraldocumentsasasepa-rateclassfortheclassificationalgorithm,whichisthemethodadvocatedbyKoppelandSchler(2006).Withthepseudo-labeledtrainingexamplesofthreeclasses(−1:negative,0:neutral,and+1:posi-tive),wetriedbothstandardmulti-classclassifica-tion(HsuandLin,2002)andordinalclassification(FrankandHall,2001).However,neitherofthemcoulddeliverareasonableperformance.Aftercare-fullyinspectingtheclassificationresults,werealisedthatitisverydifficulttohaveasetofrepresentativetrainingexampleswithgoodcoveragefortheneu-tralclass.Thisisbecausetheneutralclassisnothomogeneous:adocumentcouldbeneutralbecauseitisequallypositiveandnegative,orbecauseitdoesnotcontainanysentiment.Inpractice,thelattercaseismoreoftenseenthantheformercase,anditim-pliesthattheneutralclassismoreoftendefinedbytheabsenceofsentimentwordfeaturesratherthantheirpresence,whichwouldbeproblematictomostsupervisedlearningalgorithms.Whatwediscoveredisthatthesimplemethodofidentifyingneutraldocumentsfromthebinarysentimentclassifier’sdecisionboundaryworkssur-prisinglywell,aslongastherightthresholdsarefound.Specifically,wetaketheprobabilisticout-putsofabinarysentimentclassifiertrainedasbe-fore,andthenputallthedocumentswhoseproba-bilityofbeingpositiveliesnotcloseto0,notcloseto1,butinthemiddlerangeintotheneutralclass.Itturnsoutthatprobabilitycalibration(Niculescu-MizilandCaruana,2005)iscruciallyimportantforthissimplemethodtowork.Somesupervisedlearn-ingalgorithmsforclassificationcangivepoores-timatesoftheclassprobabilities,andsomeevendonotsupportprobabilityprediction.Forinstance,maximum-marginlearningalgorithmssuchasSVMfocusonhardsamplesthatareclosetothedeci-sionboundary(thesupportvectors),whichmakestheirprobabilitypredictionbiased.Thetechniqueofprobabilitycalibrationallowsustobettercalibratetheprobabilitiesofagivenclassifier,ortoaddsup-portforprobabilityprediction.Ifaclassifieriswellcalibrated,itsprobabilisticoutputshouldbeabletobedirectlyinterpretedasaconfidencelevelontheprediction.Forexample,amongthedocumentstowhichsuchacalibratedbinaryclassifiergivesaprobabilisticoutputcloseto0.8,approximately80%ofthedocumentswouldactuallybelongtotheposi-tiveclass.UsingthesigmoidmodelofPlatt(2000)withcross-validationonthepseudo-labeledtrainingdata,wecarryoutprobabilitycalibrationforourLSTMbasedbinarysentimentclassifier.Fig.6showsthatthecalibratedprobabilitypredictionalignswiththetrueconfidenceofpredictionmuchbetterthantherawprobabilityprediction.Inthiscase,theBrierloss(Brier,1950)thatmeasuresthemeansquareddifferencebetweenthepredictedprobabilityandtheactualoutcomecouldbereducedfrom0.182to0.153byprobabilitycalibration.Ifweranktheestimatedprobabilitiesofbeingpositivefromlowtohigh,thecurveofprobabili-tieswouldbeinan“S”-shapewithadistinctmiddlerangewheretheslopeissteeperthanthetwoends,asshowninFig.7.Thedocumentswiththeirprobabil-itiesofbeingpositiveinsuchamiddlerangeshouldbeneutral.Thereforethetwoelbowpointsintheprobabilitycurvewouldmakeappropriatethresh-oldsfortheidentificationofneutralsentiment,andtheycouldbefoundautomaticallybyasimplealgo-rithmusingthecentraldifferencetoapproximatethesecondderivative.LetpLandpUdenotetheidenti-fiedthresholds(pL