Un marco de análisis estadístico para
Sentiment Classification
Li Dong∗, ∗∗
Beihang University
Furu Wei†, ‡
Microsoft Research
Shujie Liu†
Microsoft Research
Ming Zhou†
Microsoft Research
Ke Xu∗
Beihang University
We present a statistical parsing framework for sentence-level sentiment classification in this
artículo. Unlike previous works that use syntactic parsing results for sentiment analysis, nosotros
develop a statistical parser to directly analyze the sentiment structure of a sentence. Nosotros mostramos que
complicated phenomena in sentiment analysis (p.ej., negation, intensification, and contrast) poder
be handled the same way as simple and straightforward sentiment expressions in a unified and
probabilistic way. We formulate the sentiment grammar upon Context-Free Grammars (CFGs),
and provide a formal description of the sentiment parsing framework. We develop the parsing
model to obtain possible sentiment parse trees for a sentence, from which the polarity model
is proposed to derive the sentiment strength and polarity, and the ranking model is dedicated
to selecting the best sentiment tree. We train the parser directly from examples of sentences
annotated only with sentiment polarity labels but without any syntactic annotations or polarity
annotations of constituents within sentences. Therefore we can obtain training data easily. En
particular, we train a sentiment parser, s.parser, from a large amount of review sentences with
users’ ratings as rough sentiment polarity labels. Extensive experiments on existing benchmark
data sets show significant improvements over baseline sentiment classification approaches.
∗ State Key Laboratory of Software Development Environment, Beihang University, XueYuan Road No.37,
HaiDian District, Beijing, P.R. Porcelana 100191. Correo electrónico: donglixp@gmail.com; kexu@nlsde.buaa.edu.cn.
∗∗ Contribution during internship at Microsoft Research.
† Natural Language Computing Group, Microsoft Research Asia, Edificio 2, No. 5 Danling Street, Haidian
District, Beijing, P.R. Porcelana 100080. Correo electrónico: {fuwei, shujliu, mingzhou}@microsoft.com.
‡ Corresponding author.
Envío recibido: 10 December 2013; versión revisada recibida: 26 Julio 2014; accepted for publication:
28 Enero 2015.
doi:10.1162/COLI a 00221
© 2015 Asociación de Lingüística Computacional
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
1. Introducción
Sentiment analysis (Pang and Lee 2008; Liu 2012) has received much attention from
both research and industry communities in recent years. Sentiment classification, cual
identifies sentiment polarity (positive or negative) from text (sentence or document),
has been the most extensively studied task in sentiment analysis. Until now, allá
have been two mainstream approaches for sentiment classification. The lexicon-based
acercarse (Turney 2002; Taboada et al. 2011) aims to aggregate the sentiment polarity
of a sentence from the polarity of words or phrases found in the sentence, y el
learning-based approach (Angustia, Sotavento, and Vaithyanathan 2002) treats sentiment polarity
identification as a special text classification task and focuses on building classifiers
from a set of sentences (or documents) annotated with their corresponding sentiment
polarity.
The lexicon-based sentiment classification approach is simple and interpretable,
but suffers from scalability and is inevitably limited by sentiment lexicons that are
commonly created manually by experts. It has been widely recognized that sentiment
expressions are colloquial and evolve over time very frequently. Taking tweets from
Twitter1 and movie reviews on IMDb2 as examples, people use very casual language
as well as informal and new vocabulary to comment on general topics and movies. En
práctica, it is not feasible to create and maintain sentiment lexicons to capture sentiment
expressions with high coverage. Por otro lado, the learning-based approach relies
on large annotated samples to overcome the vocabulary coverage and deals with varia-
tions of words in sentences. Human ratings in reviews (Maas et al. 2011) and emoticons
in tweets (Davidov, Tsur, and Rappoport 2010; Zhao et al. 2012) are extensively used
to collect a large number of training corpora to train the sentiment classifier. Sin embargo,
it is usually not easy to design effective features to build the classifier. Among
otros, unigrams have been reported as the most effective features (Angustia, Sotavento, y
Vaithyanathan 2002) in sentiment classification.
Handling complicated expressions delivering people’s opinions is one of the most
challenging problems in sentiment analysis. Compositionalities such as negation, inten-
sification, contrast, and their combinations are typical cases. We show some concrete
examples here:
(1) The movie is not good. [negation]
(2) The movie is very good. [intensification]
(3) The movie is not funny at all. [negation + intensification]
(4) The movie is just so so, but i still like it. [contrast]
(5) The movie is not very good, but i still like it. [negation + intensification +
contrast]
The negation expressions, intensification modifiers, and the contrastive conjunction
can change the polarity (Examples (1), (3), (4), (5)), strength (Examples (2), (3), (5)), o
ambos (Examples (3), (5)) of the sentiment of the sentences. We do not need any detailed
explanations here as they can be commonly found and easily understood in people’s
1 http://twitter.com.
2 http://www.imdb.com.
294
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
daily lives. Existing works to address these issues usually rely on syntactic parsing
results either used as features (Choi and Cardie 2008; Moilanen, Pulman, and Zhang
2010) in learning-based methods or hand-crafted rules (Moilanen and Pulman 2007; Jia,
Yu, and Meng 2009; Klenner, Petrakis, and Fahrni 2009; Liu and Seneff 2009) in lexicon-
based methods. Sin embargo, even with the difficulty and feasibility of deriving the senti-
ment structure from syntactic parsing results put aside, it is an even more challenging
task to generate stable and reliable parsing results for text that is ungrammatical in
nature and has a high ratio of out-of-vocabulary words. The accuracy of the linguistic
parsers trained on standard data sets (p.ej., the Penn Treebank [marco, Marcinkiewicz,
and Santorini 1993]) drops dramatically on user-generated-content (reviews, tweets,
etc.), which is actually the prime focus of sentiment analysis algorithms. The error,
unfortunately, will propagate downstream in the process of sentiment analysis methods
building upon parsing results.
We therefore propose directly analyzing the sentiment structure of a sentence.
The nested structure of sentiment expressions can be naturally modeled in a similar
fashion as statistical syntactic parsing, which aims to find the linguistic structure of a
oración. This idea creates many opportunities for developing sentiment classifiers from
a new perspective. The most challenging problem and barrier in building a statistical
sentiment parser lies in the acquisition of training data. Idealmente, we need examples of
sentences annotated with polarity for the whole sentence as well as sentiment tags
for constituents within a sentence, as with the Penn TreeBank for training traditional
linguistic parsers. Sin embargo, this is not practical as the annotations will be inevitably
time-consuming and require laborious human efforts. Por lo tanto, it is better to learn the
sentiment parser only utilizing examples annotated with the polarity label of the whole
oración. Por ejemplo, we can collect a huge number of publicly available reviews and
rating scores on the Web. People may use the movie is gud (“gud” is a popular informal
expression of “good”) to express a positive opinion towards a movie, and not a fan to
express a negative opinion. También, we can find review sentences such as The movie is
gud, but I am still not a fan to indicate a negative opinion. We can then use these two
fragments and the overall negative opinion of the sentence to deduce sentiment rules
automatically from data. These sentiment fragments and rules can be used to analyze
the sentiment structure for new sentences.
In this article, we propose a statistical parsing framework to directly analyze the
structure of a sentence from the perspective of sentiment analysis. Específicamente, nosotros
formulate a Context-Free Grammar (CFG)–based sentiment grammar. We then develop
a statistical parser to derive the sentiment structure of a sentence. We leverage the CYK
algoritmo (cocke 1969; Younger 1967; Kasami 1965) to conduct bottom–up parsing, y
use dynamic programming to accelerate computation. Mientras tanto, we propose using
the polarity model to derive sentiment strength and polarity of a sentiment parse tree,
and the ranking model to select the best one from the sentiment parsing results. We train
the parser directly from examples of sentences annotated with sentiment polarity labels
instead of syntactic annotations and polarity annotations of constituents within sen-
tenencias. Therefore we can obtain training data easily. En particular, we train a sentiment
parser, named s.parser, from a large number of review sentences with users’ ratings
as rough sentiment polarity labels. The statistical parsing–based approach builds a
principled and scalable framework to support the sentiment composition and inference
which cannot be well handled by bag-of-words approaches. We show that complicated
phenomena in sentiment analysis (p.ej., negation, intensification, and contrast) can be
handled the same way as simple and straightforward sentiment expressions in a unified
and probabilistic way.
295
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
The major contributions of the work presented in this article are as follows.
(cid:114)
We propose a statistical parsing framework for sentiment analysis that is
capable of analyzing the sentiment structure for a sentence. Este
framework can naturally handle compositionality in a probabilistic way. Él
can be trained from sentences annotated with only sentiment polarity but
without any syntactic annotations or polarity annotations of constituents
dentro de oraciones.
(cid:114)
(cid:114)
We present the parsing model, polarity model, and ranking model in the
proposed framework, which are formulated and can be improved
independientemente. It provides a principled and flexible approach to sentiment
classification.
We implement the statistical sentiment parsing framework, and conduct
experiments on several benchmark data sets. The experimental results
show that the proposed framework and algorithm can significantly
outperform baseline methods.
The remainder of this article is organized as follows. We introduce related work in
Sección 2. We present the statistical sentiment parsing framework, including the parsing
modelo, polarity model, and ranking model, en la sección 3. Learning methods for our
model are explained in Section 4. Experimental results are reported in Section 5. Nosotros
conclude this article with future work in Section 6.
2. Trabajo relacionado
En esta sección, we give a brief introduction to related work about sentiment classi-
ficación (Sección 2.1) and parsing (Sección 2.2). We tackle the sentiment classification
problem in a parsing manner, which is a significant departure from most previous
investigación.
2.1 Sentiment Classification
Sentiment classification has been extensively studied in the past few years. In terms
of text granularity, existing works can be divided into phrase-level, nivel de oración, o
document-level sentiment classification. We focus on sentence-level sentiment classifi-
cation in this article. Regardless of what granularity the task is performed on, existing
approaches deriving sentiment polarity from text fall into two major categories, a saber,
lexicon-based and learning-based approaches.
The lexicon-based sentiment analysis uses dictionary matching on a predefined sen-
timent lexicon to derive sentiment polarity. These methods often use a set of manually
defined rules to deal with the negation of polarity. Turney (2002) proposed using the
average sentiment orientation of phrases, which contains adjectives or adverbs, en un
review to predict its sentiment orientation. Yu and Hatzivassiloglou (2003) calculated
a modified log-likelihood ratio for every word by the co-occurrences with positive and
negative seed words. To determine the polarity of a sentence, they compare the average
log-likelihood value with threshold. Taboada et al. (2011) presented a lexicon-based
approach for extracting sentiment from text. They used dictionaries of words with anno-
tated sentiment orientation (polarity and strength) while incorporating intensification
and negation. The lexicon-based methods often achieve high precisions and do not need
296
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
any labeled samples. But they suffer from coverage and domain adaption problems.
Además, lexicons are often built and used without considering the context (wilson,
Wiebe, and Hoffmann 2009). También, hand-crafted rules are often matched heuristically.
The sentiment dictionaries used for lexicon-based sentiment analysis can be cre-
ated manually, or automatically using seed words to expand the list of words. Kamps
et al. (2004) and Williams and Anand (2009) used various lexical relations (como
synonym and antonym relations) in WordNet to expend a set of seed words. Some other
methods learn lexicons from data directly. Hatzivassiloglou and McKeown (1997) usado
a log-linear regression model with conjunction constraints to predict whether conjoined
adjectives have similar or different polarities. Combining conjunction constraints across
many adjectives, a clustering algorithm separated the adjectives into groups of different
polarity. Finalmente, adjectives were labeled as positive or negative. Velikovich et al. (2010)
constructed a term similarity graph using the cosine similarity of context vectors. Ellos
performed graph propagation from seeds on the graph, obtaining polarity words and
phrases. Takamura, Inui, and Okumura (2005) regarded the polarity of words as spins of
electrons, using the mean field approximation to compute the approximate probability
function of the system instead of the intractable actual probability function. Kanayama
and Nasukawa (2006) used tendencies for similar polarities to appear successively in
contextos. They defined density and precision of coherency to filter neutral phrases and
uncertain candidates. Choi and Cardie (2009a) and Lu et al. (2011) transformed the
lexicon learning to an optimization problem, and used integer linear programming to
solve it. Kaji and Kitsuregawa (2007) defined the χ2-based polarity value and PMI-based
polarity value as a polarity strength to filter neutral phrases. de Marneffe, Manning,
and Potts (2010) utilized review data to define polarity strength as the expected rating
valor. Mudinas, zhang, and Levene (2012) used word count as a feature template and
trained a classifier using Support Vector Machines with linear kernel. They then re-
garded the weights as polarity strengths. Krestel and Siersdorfer (2013) generated topic-
dependent lexicons from review articles by incorporating topic and rating probabilities
and defined the polarity strength based on the results. In this article, the lexical relations
defined in WordNet are not used because of its coverage. Además, most of these
methods define different criteria to propagate polarity information of seeds, or use
optimization algorithms and sentence-level sentiment labels to learn polarity strength
valores. Their goal is to balance the precision and recall of learned lexicons. Nosotros también
learn the polarity strength values of phrases from data. Sin embargo, our primary objective
is to obtain correct sentence-level polarity labels, and use them to form the sentiment
gramática.
Learning-based sentiment analysis uses machine learning methods to classify sen-
tences or documents into two (negative and positive) o tres (negative, positivo, y
neutral) classes. Previous research has shown that sentiment classification is more dif-
ficult than traditional topic-based text classification, despite the fact that the number
of classes in sentiment classification is smaller than that in topic-based text classifi-
catión (Pang and Lee 2008). Angustia, Sotavento, and Vaithyanathan (2002) investigated three
machine learning methods to produce automated classifiers to generate class labels for
movie reviews. They tested them on Na¨ıve Bayes, Maximum Entropy, and Support
Vector Machine (SVM), and evaluated the contribution of different features includ-
ing unigrams, bigrams, adjectives, and part-of-speech tags. Their experimental results
suggested that a SVM classifier with unigram presence features outperforms other
competitors. Pang and Lee (2004) separated subjective portions from the objective
by finding minimum cuts in graphs to achieve better sentiment classification perfor-
mance. Matsumoto, Takamura, and Okumura (2005) used text mining techniques to
297
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
extract frequent subsequences and dependency subtrees, and used them as features
of SVM. McDonald et al. (2007) investigated a global structured model for jointly
classifying polarity at different levels of granularity. This model allowed classification
decisions from one level in the text to influence decisions at another. Yessenalina,
yue, and Cardie (2010) used sentence-level latent variables to improve document-level
predicción. T¨ackstr ¨om and McDonald (2011a) presented a latent variable model for
only using document-level annotations to learn sentence-level sentiment labels, y
T¨ackstr ¨om and McDonald (2011b) improved it by using a semi-supervised latent vari-
able model to utilize manually crafted sentence labels. Agarwal et al. (2011) and Tu et al.
(2012) explored part-of-speech tag features and tree-kernel. Wang and Manning (2012)
used SVM built over Na¨ıve Bayes log-count ratios as feature values to classify polarity.
They showed that SVM was better at full-length reviews, and Multinomial Na¨ıve Bayes
was better at short-length reviews. Liu, Agam, y grossman (2012) proposed a set
of heuristic rules based on dependency structure to detect negations and sentiment-
bearing expressions. Most of these methods are built on bag-of-words features, y
sentiment compositions are handled by manually crafted rules. In contrast to these
modelos, we derive polarity labels from tree structures parsed by the sentiment grammar.
There have been several attempts to assume that the problem of sentiment analy-
sis is compositional. Sentiment classification can be solved by deriving the sentiment
of a complex constituent (oración) from the sentiment of small units (words and
phrases) (Moilanen and Pulman 2007; Klenner, Petrakis, and Fahrni 2009; Choi and
Cárdigan 2010; Nakagawa, Inui, and Kurohashi 2010). Moilanen and Pulman (2007) pro-
posed using delicate written linguistic patterns as heuristic decision rules when com-
puting the sentiment from individual words to phrases and finally to the sentence. El
manually compiled rules were powerful enough to discriminate between the different
sentiments in effective remedies (positivo) / effective torture (negative), and in too colorful
(negative) and too sad (negative). Nakagawa, Inui, and Kurohashi (2010) leveraged a
conditional random field model to calculate the sentiment of all the parsed elements
in the dependency tree and then generated the overall sentiment. It had an advantage
over the rule-based approach (Moilanen and Pulman 2007) in that it did not explicitly
denote any sentiment designation to words or phrases in parse trees. En cambio, it modeled
their sentiment polarity as latent variables with a certain probability of being positive
or negative. Councill, McDonald, and Velikovich (2010) used a conditional random field
model informed by a dependency parser to detect the scope of negation for sentiment
análisis. Some other methods model sentiment compositionality in the vector space.
They regard the composition operator as a matrix, and use matrix-vector multiplica-
tion to obtain the transformed vector representation. Socher et al. (2012) proposed a
recursive neural network model that learned compositional vector representations for
phrases and sentences. Their model assigned a vector and a matrix to every node in a
parse tree. The vector captured the inherent meaning of the constituent, and the matrix
captured how it changes the meaning of neighboring words or phrases. Socher et al.
(2013) recently introduced a sentiment treebank based on the results of the Stanford
parser (Klein and Manning 2003). The sentiment treebank included polarity labels of
phrases that are annotated using Amazon Mechanical Turk. The authors trained recur-
sive neural tensor networks on the sentiment treebank. For a new sentence, el modelo
predicted polarity labels based on the syntactic parse tree, and used tensors to handle
compositionality in the vector space. Dong et al. (2014) proposed utilizing multiple com-
position functions in recursive neural models and learning to select them adaptively.
Most previous methods are either rigid in terms of handcrafted rules, or sensitive to
the performance of existing syntactic parsers they use. This article addresses sentiment
298
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
compositions by defining sentiment grammar and borrowing some techniques in the
parsing research field. Además, our method uses symbolic representations instead of
vector spaces.
2.2 Syntactic Parsing and Semantic Parsing
The work presented in this article is close to traditional statistical parsing, as we borrow
some algorithms to build the sentiment parser. Syntactic parsers are learned from the
Treebank corpora, and find the most likely parse tree with the largest probability. En
this article, we borrow some well-known techniques from syntactic parsing methods
(Charniak 1997; Charniak and Johnson 2005; McDonald, Crammer, and Pereira 2005;
K ¨ubler, McDonald, and Nivre 2009), such as the CYK algorithm and Context-Free
Grammar. These techniques are used to build the sentiment grammar and parsing
modelo. They provide a natural way of defining the structure of sentiment trees and
parse sentences to trees. The key difference lies in that our task is to calculate the
polarity label of a sentence, instead of obtaining the parse tree. We only have sentence-
polarity pairs as our training instances instead of annotated tree structures. Además,
in the decoding process, our goal is to compute correct polarity labels by representing
sentences as latent sentiment trees. Recientemente, Sala, Durrett, and Klein (2014) developed a
discriminative constituency parser using rich surface features, adapting it to sentiment
análisis. Besides extracting unigrams and bigrams as features, they learned interactions
between tags and words located at the beginning or the end of spans. Sin embargo, su
method relies on phrase-level polarity annotations.
Semantic parsing is another body of work related to this article. A semantic parser
is used to parse meaning representations for given sentences. Most existing semantic
parsing works (Zelle and Mooney 1996; Kate and Mooney 2006; Raymond and Mooney
2006; Zettlemoyer and Collins 2007, 2009; li, Liu, and Sun 2013) relied on fine-grained
annotations of target logical forms, which required the supervision of experts and are
relatively expensive. To balance the performance and the amount of human annotation,
some works used only question-answer pairs or even binary correct/incorrect signals
as their input. Clarke et al. (2010) used a binary correct/incorrect signal of a database
query to map sentences to logical forms. It worked with FunQL language and trans-
formed semantic parsing as an integer linear programming (ILP) problema. In each iter-
ación, it solved ILP and updated the parameters of structural SVM. Liang, Jordán, y
Klein (2013) learned a semantic parser from question-answer pairs, where the logical
form was modeled as a latent tree-based semantic representation. Krishnamurthy and
mitchell (2012) presented a method for training a semantic parser using a knowledge
base and an unlabeled text corpus, without any individually annotated sentences.
Artzi and Zettlemoyer (2013) used various types of weak supervision to learn a
grounded Combinatory Categorial Grammar semantic parser, which took context into
consideration. Bao et al. (2014) presented a translation-based weakly supervised seman-
tic parsing method to translate questions to answers based on CYK parsing. A log-linear
model is defined to score derivations. All these weakly supervised semantic parsing
methods learned to transform a natural language sentence to its semantic representation
without annotated logical form. En este trabajo, we build a sentiment parser. Specif-
icamente, we use a modified version of the CYK algorithm that parses sentences in a
bottom–up fashion. We use the log-linear model to score candidates generated by beam
buscar. Instead of using question-answer pairs, sentence-polarity pairs are used as our
weak supervisions. We also use the parameter estimation algorithm proposed by Liang,
Jordán, and Klein (2013).
299
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
3. Statistical Sentiment Parsing
We present the statistical parsing framework for sentence-level sentiment classification
in this section. The underlying idea is to model sentiment classification as a statisti-
cal parsing process. Cifra 1 shows the overview of the statistical sentiment parsing
estructura. There are three major components. The input sentence s is transformed
into and represented by sentiment trees derived from the parsing model (Sección 3.2),
using the sentiment grammar defined in Section 3.1. Trees are scored by the ranking
model in Section 3.3. The sentiment tree with the highest ranking score is treated as the
best derivation for s. Además, the polarity model (Sección 3.4) is used to compute
polarity values for the sentiment trees.
Notablemente, the sentiment trees t are unobserved during training. We can only observe
the sentence s and its polarity label y in training data. En otras palabras, we train the model
directly from the examples of sentences annotated only with sentiment polarity labels
but without any syntactic annotations or polarity annotations of the constituents within
oraciones. To be specific, we first learn the sentiment grammar and the polarity model
from data as described in Section 4.2. Entonces, given the sentence and polarity label pairs
(cid:0)s, y(cid:1), we search the latent sentiment trees t and estimate the parameters of the ranking
model as detailed in Section 4.1.
To better illustrate the whole process, we describe the sentiment parsing procedure
using an example sentence, The movie is not very good, but i still like it. The sentiment
polarity label of the above sentence is “positive.” There is negation, intensification, y
contrast in this example, which are difficult to capture using bag-of-words classification
methods. This sentence is a complex case that demonstrates the capability of the pro-
posed statistical sentiment parsing framework, which motivates the work in this article.
The statistical sentiment parsing algorithm may generate a number of sentiment trees
for the input sentence. Cifra 2 shows the best sentiment parse tree. It shows that the
statistical sentiment parsing framework can deal with the compositionality of sentiment
in a natural way. En mesa 1, we list the sentiment rules used during the parsing process.
We show the generation process of the sentiment parse tree from the bottom–up and
the calculation of sentiment strength and polarity for every text span in the parsing
proceso.
Cifra 1
The parsing model and ranking model are used to transform the input sentence s to the
sentiment tree t with the highest ranking score. Además, the polarity model defines
how to compute polarity values for the rules of the sentiment grammar. The sentiment
tree t is evaluated with respect to the polarity model to produce the polarity label y.
300
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
sty(oración)(sentiment tree)(polarity label)Parsing ModelRanking ModelPolarity Modelthe movie is not very good+/-P→goodN→not P……PPPNNSthe movieisnotverygoodPPN……+: 0.87+: 0.93-: 0.63
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Cifra 2
Sentiment structure for the sentence The movie is not very good, but i still like it. The rules used in
the derivation process include {P → the movie is; P → good; P → i still like it; P → very P;
N → not P; N → PN; N → NE; E → ,; P → N but P; S → P}.
In the following sections, we first provide a formal description of the sentiment
grammar in Section 3.1. We then present the details of the parsing model in Section 3.2,
the ranking model in Section 3.3, and the polarity model in Section 3.4.
3.1 Sentiment Grammar
We develop the sentiment grammar upon CFG (Context-Free Grammar) (Chomsky
1956). Let G =< V, Σ, S, R > denote a CFG, where V is a finite set of non-terminals,
Σ is a finite set of terminals (disjointed from V), S ∈ V is the start symbol, and R is
a set of rewrite rules (or production rules) of the form A → c where A ∈ V and c ∈
(V ∪ Σ)∗. We use Gs =< Vs, Σs, S, Rs > to denote the sentiment grammar in this article.
Mesa 1
Parsing process for the sentence The movie is not very good, but i still like it. [i, Y, j] represents the
text spanning from i to j is derived to symbol Y. N and P are non-terminals in the sentiment
gramática, and N and P represent polarities of sentiment.
Span
Regla
Strength
Polarity
[0, PAG, 3]: the movie is
[5, PAG, 6]: bien
[6, mi, 7]: ,
[8, PAG, 11]: i still like it
[4, PAG, 6]: very good
[3, norte, 6]: not very good
[0, norte, 6]: the movie is not very good
[0, norte, 7]: the movie is not very good,
[0, PAG, 11]: the movie is not very good, but i still like it
[0, S, 11]: the movie is not very good, but i still like it
P → the movie is
P → good
E → ,
P → i still like it
P → very P
N → not P
N → PN
N → NE
P → N but P
S → P
0.52
0.87
–
0.85
0.93
0.63
0.60
0.60
0.76
0.76
PAG
PAG
–
PAG
PAG
norte
norte
norte
PAG
PAG
301
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
The movieisnotverygood,butIstilllikeitPPPε PPNNNS
Ligüística computacional
Volumen 41, Número 2
The non-terminal set is denoted as Vs = {norte, PAG, S, mi}, where S is the start symbol, el
non-terminal N represents the negative polarity, and the non-terminal P represents the
positive polarity. The rules in Rs are divided into the following six categories:
(cid:114)
(cid:114)
(cid:114)
(cid:114)
(cid:114)
(cid:114)
0, where X ∈ {norte, PAG}, semana
0 = w0 . . . semana-1, y
s . These rules can be regarded as the sentiment dictionary
Dictionary rules: X → wk
0 ∈ Σ+
semana
used in traditional approaches. They are basic sentiment units
assigned with polarity probabilities. Por ejemplo, P → good is a
dictionary rule.
Combination rules: X → c, where c ∈ (Vs ∪ Σs)+, and two successive
non-terminals are not allowed. There is at least one terminal in c.
These rules combine terminals and non-terminals, such as N → not P,
and P → N but P. They are used to handle negation, intensification,
and contrast in sentiment analysis. The number of non-terminals in a
combination rule is restricted to one and two.
Glue rules: X → X1X2, where X, X1, X2 ∈ {norte, PAG}. These rules combine two
text spans that are derived into X1 and X2, respectivamente.
OOV rules: E → wk
Out-Of-Vocabulary (OOV) text spans whose polarity probabilities are not
learned from data.
0 ∈ Σ+. We use these rules to handle
0, where wk
Auxiliary rules: X → EX1, X → X1E, where X, X1 ∈ {norte, PAG}. These rules
combine a text span with polarity and an OOV text span.
Start rules: S → Y, where Y ∈ {norte, PAG, mi}. The derivations begin with S, and S
can be derived to N, PAG, and E.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Aquí, X represents the non-terminals N or P. The dictionary rules and combi-
nations rules are automatically extracted from the data. We will describe the details
en la sección 4.2. By applying these rules, we can derive the polarity label of a sen-
tence from the bottom–up. The glue rules are used to combine polarity informa-
tion of two text spans together, and it treats the combined parts as independent. En
order to tackle the OOV problem, we treat a text span that consists of OOV words as
empty text span, and derive them to E. The OOV text spans are combined with other
text spans without considering their sentiment information. Finalmente, each sentence is
derived to the symbol S using the start rules that are the beginnings of derivations.
We can use the sentiment grammar to compactly describe the derivation process of a
oración.
3.2 Parsing Model
We present the formal description of the statistical sentiment parsing model following
deductive proof systems (Shieber, Schabes, and Pereira 1995; Buen hombre 1999) as used in
traditional syntactic parsing. For a concrete example,
(A → BC)
[i, B, k]
[k, C, j]
[i, A, j]
(6)
302
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
∗⇒ is used to
which represents if we have the rule A → BC and B
represent the reflexive and transitive closure of immediate derivation), then we can
obtain A
i . By adding a unary rule
i and C
∗⇒ w j
∗⇒ w j
∗⇒ wk
k (
(A → w j
i)
[i, A, j]
(7)
with the binary rule in Equation (6), we can express the standard CYK algorithm for
CFG in Chomsky Normal Form (CNF). And the goal is [0, S, norte], in which S is the start
symbol and n is the length of the input sentence. In the given CYK example, the term in
deductive rules can be one of the following two forms:
(cid:114)
(cid:114)
[i, X, j] is an item representing a subtree rooted in X spanning from i to j, o
(X → γ) is a rule in the grammar.
Generally, we represent the form of an inference rule as:
(r) H1
. . . HK
[i, X, j]
(8)
dónde, if all the terms r and Hk are true, then we can infer [i, X, j] as true. Aquí, r denotes
a sentiment rule, and Hk denotes an item. When we refer to both rules and items, nosotros
employ the word terms.
Teóricamente, we can convert the sentiment rules to CNF versions, and then use
the CYK algorithm to conduct parsing. Because the maximum number of non-terminal
symbols in a rule is already restricted to two, we formulate the statistical sentiment
parsing based on a customized CYK algorithm that is similar to the work of Chiang
(2007). Let X, X1, X2 represent the non-terminals N or P; the inference rules for the
statistical sentiment parsing are summarized in Figure 3.
3.3 Ranking Model
The parsing model generates many candidate parse trees T(s) for a sentence s. El
goal of the ranking model is to score and rank these parse trees. The sentiment tree
with the highest score is treated as the best representation for sentence s. We extract a
feature vector φ(s, t) ∈ Rd for the specific sentence-tree pair (s, t), where t ∈ T(s) es el
parse tree. Let ψ ∈ Rd be the parameter vector for the features. We use the log-linear
model to calculate a probability p(t|s; t, ψ) for each parse tree t ∈ T(s). The probabilities
indicate how likely the trees are to produce correct predictions. Given the sentence s
and parameters ψ, the log-linear model defines a conditional probability:
pag(t|s; t, ψ) = exp {Fi(s, t)Tψ − A(ψ; s, t)}
A(ψ; s, t) = log
(cid:88)
t∈T(s)
exp. {Fi(s, t)Tψ}
(9)
(10)
303
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
(X → w j
i)
[i, X, j]
(X → wi1
)
i X1w j
j1
[i, X, j]
[i1, X1, j1]
(X → wi1
i X1wi2
j1
)
X2w j
j2
[i, X, j]
[i1, X1, j1]
[i2, X2, j2]
(X → X1X2)
[i, X1, k]
[k, X2, j]
[i, X, j]
(E → w j
i )
[i, mi, j]
(X → EX1)
[i, mi, k]
[k, X1, j]
[i, X, j]
(X → X1E )
[i, X1, k]
[k, mi, j]
[i, X, j]
where X, X1, X2 represent N or P.
Cifra 3
Inference rules for the basic parsing model.
where A(ψ; s, t) is the log-partition function with respect to T(s). The log-linear
model is a discriminative model, and it is widely used in natural language pro-
cesando. We can use φ(s, t)Tψ as the score of the parse tree without normalization in the
decoding process, because p(t|s; t, ψ) ∝ φ(s, t)Tψ, and this will not change the ranking
orden.
3.4 Polarity Model
The goal of the polarity model is to model the calculation of sentiment strength and
polarity of a text span from its subspans in the parsing process. It is specified in terms of
the rules used in the parsing process. We expand the notations in the inference rule (8)
to incorporate the polarity model. The new form of inference rule is:
(r) H1Φ1
. . . HKΦK
[i, X, j]Φ
(11)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
in which r, H1, . . . , HK are the terms described in Section 3.2. Every item Hk is assigned
polarity strength Φk :
for text span w jk
I
. For the item [i, X, j], the polarity
model Φ(r, Φ1, . . . , ΦK ) is defined as a function that takes the rule r and polarity strength
of subspans as input.
(cid:40)
)
PAG(norte |w jk
I
PAG(PAG|w jk
)
I
304
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
The polarity strength obtained by the polarity model should satisfy two con-
tensiones. Primero, the values calculated by the polarity model are non-negative, eso es,
PAG(X |w j
i) ≥ 0. Segundo, the positive and negative polarity values are normal-
i) + PAG(X |w j
i) = 1. Notablemente, X =
(cid:40)
PAG, X = N
norte , X = P
is the opposite
i) ≥ 0, PAG(X |w j
ized to 1, a saber, PAG(X |w j
polarity of X .
The inference rules with the polarity model are formally defined in Figure 4. En el
following part, we define the polarity model for the different types of rules. If the rule
is a dictionary rule X → w j
i, its sentiment strength is obtained as:
(cid:40)
Φ :
PAG(X |w j
PAG(X |w j
i) = ˜P(X |w j
i)
i) = ˜P(X |w j
i)
(12)
where X ∈ {norte , PAG} denotes the sentiment polarity of the left hand side of the rule, X is
the opposite polarity of X , and ˜P(X |w j
i), ˜P(X |w j
i) indicate the sentiment polarity values
estimated from training data.
(X → w j
i)
[i, X, j]PAG(X |w j
(X → w i1
[i, X, j]PAG(X |w j
i) = ˜P(X |w j
i)
i X1w j
[i1, X1, j1]Φ1
j1
i) = h(θ0 + θ1P(X1|w j1
i1
)
))
(X → wi1
[i, X, j]PAG(X |w j
)
X2w j
j2
i X1wi2
[i1, X1, j1]Φ1
j1
i) = h(θ0 + θ1P(X1|w j1
i1
[i, X1, k]Φ1
(X → X1X2)
[i, X, j]PAG(X |w j
i) =
PAG(X |w k
i )PAG(X |w j
PAG(X |w k
[k, X2, j]Φ2
i )PAG(X |w j
k )
i )PAG(X |w j
k )+PAG(X |w k
k )
[i2, X2, j2]Φ2
) + θ2P(X2|w j2
i2
))
(E → w j
i)
[i, mi, j]◦
(X → EX1)
[i, X, j]PAG(X |w j
[i, mi, k] ◦
[k, X1, j]Φ1
i) =P(X |w j
k)
[k, mi, j]◦
i) =P(X |w k
i )
(X → X1E )
[i, X1, k]Φ1
[i, X, j]PAG(X |w j
where h(X) =
1
1 + exp.{−x}
is a logistic function, ◦ represents the absence, and X, X1, X2
represent N or P. As specified in the polarity model, we have P(X |w j
i) = 1 − P(X |w j
i).
Cifra 4
Inference rules with the polarity model.
305
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
The glue rules X → X1X2 combine two spans (w k
i , w j
k). The polarity value is calcu-
lated by their product, and normalized to 1.
Φ :
PAG(X |w j
PAG(X |w j
i )PAG(X |w j
PAG(X |w k
k )
i) =
i )PAG(X |w j
i )PAG(X |w j
k )+PAG(X |w k
PAG(X |w k
k )
i) = 1 − P(X |w j
i)
(13)
For OOV text spans, the polarity model does not calculate the polarity values. Cuando
they are combined with in-vocabulary phrases by the auxiliary rules, the polarity values
are determined by the text span with polarity and the OOV text span is ignored. Más
specifically,
(cid:40)
Φ :
PAG(X |w j
PAG(X |w j
i) =P(X |semana
i )
i) =P(X |semana
i )
(14)
The combination rules are more complicated than other types of rules. In this article,
we model the polarity probability calculation as the logistic regression. The logistic
regression can be regarded as putting linear combination of the subspans’ polarity prob-
abilities into a logistic function (or sigmoid function). We will show that the negation,
intensification, and contrast can be well modeled by the regression-based method. Es
formally shown as
(cid:32)
PAG(X |w j
i) = h
θ0 +
=
1 + exp.
k
(cid:88)
k=1
(cid:110)
−
θkP(Xk|w jk
I
(cid:33)
)
1
(cid:16)
θ0 + (cid:80)k
k=1 θkP(Xk|w jk
I
(15)
(cid:17)(cid:111)
)
1
where h(X) =
1+exp. {−x} is the logistic function, K is the number of non-terminals in
a rule, and θ0, . . . , θK are the parameters that are learned from data. As a concrete
ejemplo, if the span w j
i+1, the inference rule with
the polarity model is defined as
i can match N → not P and P
∗⇒ w j
(cid:40)
[i, norte, j]
N → not P [i + 1, PAG, j]Φ1
PAG(norte |w j
PAG(PAG|w j
i) = h(θ0 + θ1P(PAG|w j
i) = 1 − P(norte |w j
i)
i+1))
(16)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where polarity probability is calculated by P(norte |w j
i) = h(θ0 + θ1P(PAG|w j
i+1)).
To tackle negation, switch negation (Choi and Cardie 2008; Saur´ı 2008) simply re-
verses the sentiment polarity and corresponding sentiment strength. Sin embargo, consider
not great and not good; flipping polarity directly makes not good more positive than
not great, which is unreasonable. Another potential problem of switch negation is that
negative polarity items interact with intensifiers in undesirable ways (Kennedy and
Inkpen 2006). Por ejemplo, not very good turns out to be even more negative than not
bien, given the fact that very good is more positive than good. Por lo tanto, Taboada et al.
(2011) argue that shift negation is a better way to handle polarity negation. En cambio
of reversing polarity strength, shift negation shifts it toward the opposite polarity by
306
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
a fixed amount. This method can partially avoid the aforementioned two problems.
Sin embargo, they set the parameters manually, which might not be reliable and extensible
enough to a new data set. Using the regression model, switch negation is captured by
the negative scale item θk (k > 0), and shift negation is expressed by the shift item θ0.
The intensifiers are adjectives or adverbs that strengthen (amplifier) or decrease
(downtoner) the semantic intensity of its neighboring item (Capricho 1985). Por ejemplo,
extremely good should obtain higher strength of positive polarity than good, porque
is modified by the amplifier (extremely). Polanyi and Zaenen (2006) and Kennedy and
Inkpen (2006) handle intensifiers by polarity addition and subtraction. Este método,
termed fixed intensification, increases a fixed amount of polarity for amplifiers and
decreases for downtoners. Taboada et al. (2011) propose a method, called percentage
intensification, to associate each intensification word with a percentage scale, cual es
larger than one for amplifiers, and less than one for downtoners. The regression model
can capture these two methods to handle the intensification. The shift item θ0 represents
the polarity addition and subtraction directly, and the scale item θk (k > 0) can scale the
polarity by a percentage.
Mesa 2 illustrates how the regression based polarity model represents different
negation and intensification methods. For a specific rule, the parameters and the com-
positional method are automatically learned from data (Sección 4.2.3) instead of setting
them manually as in previous work (Taboada et al. 2011). In a similar way, this method
can handle the contrast. Por ejemplo, the inference rule for N → P but N is:
[i1, PAG, j1]Φ1
(N → P but N)
(cid:40)
PAG(norte |w j
PAG(PAG|w j
i) = h(θ0 + θ1P(PAG|w j1
i1
i) = 1 − P(norte |w j
i)
[i, norte, j]
[i2, norte, j2]Φ2
) + θ2P(norte |w j2
i2
))
(17)
where the polarity probability of the rule N → P but N is computed by P(norte |w j
θ1P(PAG|w j1
i1
θ1, and θ2.
i) = h(θ0 +
)). It can express the contrast relation by specific parameters θ0,
) + θ2P(norte |w j2
i2
It should be noted that a linear regression model could turn out to be problem-
atic, as it may produce unreasonable results. Por ejemplo, if we do not add any
constraint, we may get P(norte |w j
i+1) = 0.55, we will
get P(norte |w j
i) = −0.6 + 0.55 = −0.05. This conflicts with the definition that the polarity
probability ranges from zero to one. Cifra 5 intuitively shows that the logistic function
truncates polarity values to (0, 1) smoothly.
i) = −0.6 + PAG(PAG|w j
i+1). When P(PAG|w j
Mesa 2
The check mark means the parameter of the polarity model can capture the corresponding
intensification type and negation type. Shift item θ0 can handle shift negation and fixed
intensification, and scale item θ1 can model switch negation and percentage intensification.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Parameter
Negation Type
i) = h(θ0 + θ1P(X |w j1
i1
Switch
PAG(X |w j
Shift
))
Intensification Type
i) = h(θ0 + θ1P(X |w j1
i1
PAG(X |w j
Percentage
θ0 (Shift item)
θ1 (Scale item)
(cid:88)
(cid:88)
(cid:88)
Fixed
(cid:88)
))
307
Ligüística computacional
Volumen 41, Número 2
Cifra 5
Logistic function h(X) =
values are used as polarity probabilities.
1
1+exp.{−x} truncates polarity values to (0, 1) smoothly. The computed
3.5 Constraints
We incorporate additional constraints into the parsing model. Those are used as pruning
conditions in the derivation process not only to improve efficiency but also to force the
derivation towards the correct direction. We expand the inference rules in Section 3.4 como,
(r) H1Φ1
. . . HKΦK
[i, X, j]Φ
C
(18)
where C is a side condition. The constraints are interpreted in a Boolean manner. Si
the constraint C is satisfied, the rule can be used, de lo contrario, it cannot. We define two
constraints in the parsing model.
Primero, in the parsing process, the polarity label of text span w j
i obtained by the
polarity model (Sección 3.4) should be consistent with the non-terminal X (N or P)
on the left hand side of the rule. To distinguish between the polarity labels and the
non-terminals, we denote the corresponding polarity label of non-terminal X as X .
Following this notation, we describe the first constraint as
C1 : PAG(X |w j
i) > P(X |w j
i)
(19)
where X is the opposite polarity of X . Por ejemplo, if rule P → not N matches the text
span w j
i, the polarity calculated by the polarity model should be consistent with P, es decir.,
the polarity obtained by the polarity model should be positive (PAG).
Segundo, when we apply the combination rules, the polarity strength of subspans
needs to exceed a predefined threshold τ (≥ 0.5). Específicamente, for combination rules X →
wi1
i X1wi2
j1
, we define the second constraint as
and X → wi1
i X1w j
j1
X2w j
j2
C2 : PAG(Xk|w jk
I
) > τ, k = 1, . . . , k
(20)
where K is the number of subspans in the rule, and Xk is the corresponding polarity
label of non-terminal Xk in the right hand side. If P(Xk|w jk
) is not larger than threshold
I
308
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
x0.00.51.0h(X)Linear FunctionLogistic Function
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
t, we regard the polarity of phrase w jk
as neutral. Por ejemplo, we do not want to use
I
the combination rule P → a lot of P or N → a lot of N for the phrase a lot of people. Este
constraint avoids improperly using the combination rules for neutral phrases. Notablemente,
when τ is set as 0.5, this constraint is the same as the first one in Equation (19).
As shown in Figure 6, we add these two constraints to the inference rules. The OOV
rules do not have any constraints, and the constraint C1 is applied for all the other rules.
The constraint C2 is only applied for the combination rules.
3.6 Decoding Algorithm
En esta sección, we summarize the decoding algorithm in Algorithm 1. For a sentence s,
the CYK algorithm and dynamic programming are used to obtain the sentiment tree
with the highest score. To be specific, the modified CYK parsing model parses the input
sentence to sentiment trees in a bottom–up manner—that is, from short to long text
spans. For every text span w j
i, we match the rules in the sentiment grammar (Sección 3.1)
to generate the candidate set. Their polarity values are calculated using the polarity
model described in Section 3.4. We also use the constraints described in Section 3.5 a
prune search paths. The constraints improve the efficiency of the parsing algorithm and
make derivations that meet our intuition.
(X → w j
i)
C1
[i, X, j]PAG(X |w j
(X → wi1
[i, X, j]PAG(X |w j
i) = ˜P(X |w j
i)
i X1w j
[i1, X1, j1]Φ1
j1
i) = h(θ0 + θ1P(X1|w j1
i1
)
C1 ∧ C2
))
(X → wi1
[i, X, j]PAG(X |w j
(X → X1X2)
[i, X, j]PAG(X |w j
)
[i2, X2, j2]Φ2
) + θ2P(X2|w j2
i2
X2w j
j2
i X1wi2
[i1, X1, j1]Φ1
j1
i) = h(θ0 + θ1P(X1|w j1
i1
[k, X2, j]Φ2
[i, X1, k]Φ1
i )PAG(X |w j
k )
i )PAG(X |w j
k )+PAG(X |w k
k )
PAG(X |w k
i )PAG(X |w j
i) =
PAG(X |w k
C1
))
◦
(E → w j
i)
[i, mi, j]◦
(X → EX1)
[i, X, j]PAG(X |w j
[i, mi, k] ◦
C1
[k, X1, j]Φ1
i) =P(X |w j
k)
[k, mi, j]◦
i) =P(X |w k
i )
C1
(X → X1E )
[i, X1, k]Φ1
[i, X, j]PAG(X |w j
C1 ∧ C2
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where h(X) =
1
1 + exp.{−x}
is a logistic function, ◦ represents the absence, and X, X1, X2
represent N or P. As specified in the polarity model, we have P(X |w j
i) = 1 − P(X |w j
i).
Cifra 6
Inference rules with the polarity model and constraints.
309
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
Ligüística computacional
Volumen 41, Número 2
Algoritmo 1 Decoding Algorithm
Input: wn
Output: Polarity of the input sentence
0: Oración
1: puntaje[, , ] ← {}
2: for l ← 1 . . . n do
3:
for all i, j s.t. j − i = l do
(cid:46) Modified CYK algorithm
for all inferable rule (r) H1…HK
for w j
Φ ← calculate polarity value for r
if constraints are satisfied then
[i,X,j]
i do
4:
5:
6:
7:
(cid:46) Polarity model
(cid:46) Restricción
(cid:46) Ranking
sc ← compute score for this derivation by ranking model
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
modelo
8:
9:
10: return arg maxX ∈{norte ,PAG} puntaje[0, X, norte]
if sc > score[i, j, X] entonces
puntaje[i, j, X] ← sc
[i,X,j]
[i1,P,j1]
[i2,N,j2]
The features in the ranking model (Sección 4.1.1) decompose along the structure of
the sentiment tree. So the dynamic programming technique can be used to compute the
derivation tree with the highest ranking score. For a span, the scores of its subspans
are used to calculate the local scores of its derivations. Por ejemplo, the score of the
derivation (r)
is score[i1, j1, PAG] + puntaje[i2, j2, norte] + scorer, where score[i, j, X] es
the highest score of text span w j
i that is derived to the non-terminal X, and scorer is
the score of applying the rule r. As described in Section 3.3, the score of using rule r is
scorer = φ(w j
i, r) is the feature vector of using the rule r for the span
w j
i, and ψ is the weight vector of the ranking model. The k highest score trees satisfying
the constraints are stored in score[, , ] for decoding the longer text spans. After finishing
the CYK parsing, arg maxX ∈{norte ,PAG} puntaje[0, norte, X] is regarded as the polarity label of input
oración. The time complexity is the same as the standard CYK’s.
ψ, where φ(w j
i, r)
t
4. Model Learning
We described the statistical sentiment parsing framework in Section 3. We present the
model learning process in this section. The learning process consists of two steps. Primero,
the sentiment grammar and the polarity model are learned from data. En otras palabras,
the rules and the parameters used to compute polarity values are learned. These basic
sentiment building blocks are then used to build the parse trees. Segundo, we estimate
the parameters of the ranking model using the sentence and polarity label pairs. At this
stage, we concentrate on learning how to score the parse trees based on the learned
sentiment grammar and polarity model.
Sección 4.1 shows the features and the parameter estimation algorithm used in the
ranking model. Sección 4.2 illustrates how to learn the sentiment grammar and the
polarity model.
4.1 Ranking Model Training
As shown in Section 3.3, we develop the ranking model on the log-linear model. En
the following subsections, we first present the features used to rank sentiment tree
310
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
candidates. Entonces, we describe the objective function used in the optimization algorithm.
Finalmente, we introduce the algorithm for parameter estimation using the gradient-based
método.
4.1.1 Características. We extract a feature vector φ(s, t) ∈ Rd for each parse tree t of sentence s.
The feature vector is used in the log-linear model. En figura 7, we present the features
extracted for the sentence The movie is not very good, but i still like it. The features are
organized into feature templates. Each of them contains a set of features. These feature
templates are shown as follows:
(cid:114)
(cid:114)
(cid:114)
(cid:114)
COMBHIT: This feature is the total number of combination rules used in t.
COMBRULE: It contains features {COMBRULE[r] : r is a combination rule},
each of which fires on the combination rule r appearing in t.
DICTHIT: This feature is the total number of dictionary rules used in t.
DICTRULE: It contains features {DICTRULE[r] : r is a dictionary rule}, cada
of which fires on the dictionary rule r appearing in t.
These features are generic local patterns that capture the properties of the senti-
ment tree. Another intuitive lexical feature template is [combination rule + palabra]. Para
instancia, P → very P(bien) is a feature that lexicalizes the non-terminal P to good.
Sin embargo, if this feature is fired frequently, the phrase very good would be learned as
a dictionary rule and can be used in the decoding process. So we do not use this feature
template in order to reduce the feature size. It should be noted that these features
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
(a) COMBHIT and COMBRULE
(b) DICTHIT and DICTRULE
Feature Template
Number of combination rules COMBHIT
Combination rule
Característica
Number of dictionary rules
Dictionary rule
COMBRULE[P → very P]
COMBRULE[N → not P]
COMBRULE[P → N but P]
DICTHIT
DICTRULE[P → the movie is]
DICTRULE[P → good]
DICTRULE[P → i still like it]
Feature Value
3
1
1
1
3
1
1
1
Cifra 7
Feature templates used in the ranking model. The red triangles denote the features for the
ejemplo.
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
311
The movieisnotverygood,butIstilllikeitPPPε PPNNNSThe movieisnotverygood,butIstilllikeitPPPε PPNNNS
Ligüística computacional
Volumen 41, Número 2
decompose along structures of sentiment trees, enabling us to use dynamic program-
ming in the CYK algorithm.
4.1.2 Objective Function. We design the ranking model upon the log-linear model to score
candidate sentiment trees. In the training data D, we only have the input sentence s and
its polarity label Ls. The forms of sentiment parse trees, which can obtain the correct
sentiment polarity, are unobserved. So we work with the marginal log-likelihood of
obtaining the correct polarity label Ls,
iniciar sesión p(Ls|s; t, ψ) = log p(t ∈ TLs (s)|s; t, ψ)
= A(ψ; s, TLs ) − A(ψ; s, t)
(21)
where TLs is the set of candidate trees whose prediction labels are Ls, y un(ψ; s, t)
(Ecuación (10)) is the log-partition function with respect to T(s).
Based on the marginal log-likelihood function, the objective function O(ψ, t) estafa-
sists of two terms. The first term is the sum of marginal log-likelihood over training
instances that can obtain the correct polarity labels. The second term is a L2-norm
regularization term on the parameters ψ. Formalmente,
oh(ψ, t) =
(cid:88)
(s,Ls )∈D
TLs (s)(cid:54)=∅
iniciar sesión p(Ls|s; t, ψ) − λ
2
(cid:107)ψ(cid:107)2
2
(22)
To learn the parameters ψ, we use a gradient-based optimization method to max-
imize the objective function O(ψ, t). According to Wainwright and Jordan (2008), el
derivative of the log-partition function is the expected feature vector
∂O(ψ, t)
∂ψ
=
(cid:88)
(s,Ls )∈D
TLs (s)(cid:54)=∅
(Ep(t|s;TLs ,ψ)[Fi(s, t)] − Ep(t|s;t,ψ)[Fi(s, t)]) − λψ
(23)
where Ep(X)[F (X)] = (cid:80)
x p(X)F (X) for discrete x.
4.1.3 Parameter Estimation. The objective function O(ψ, t) is not concave (nor convex),
hence the optimization potentially results in a local optimum. Stochastic Gradient De-
scent (SGD; Robbins and Monro 1951) is a widely used optimization method. The SGD
algorithm picks up a training instance randomly, and updates the parameter vector ψ
according to
ψj
(t+1) = ψj
(t) + a
(cid:18) ∂O(ψ)
∂ψj
(cid:19)
|ψ=ψ(t)
(24)
where α is the learning rate, and ∂O(ψ)
is the gradient of the objective function with
∂ψj
respect to parameter ψj. The SGD is sensitive to α, and the learning rate is the same
for all dimensions. As described in Section 4.1.1, we mix sparse features together with
dense features. We want the learning rate to be different for each dimension. Usamos
AdaGrad (Duchi, Hazan, and Singer 2011) to update the parameters, which sets an
adaptive per-feature learning rate. The AdaGrad algorithm tends to use smaller update
312
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
steps when we meet a feature many times. In order to compute efficiently, a diagonal
approximation version of AdaGrad is used. The update rule is
ψj
(t+1) = ψj
(t) + a
(cid:18) ∂O(ψ)
∂ψj
(cid:19)
|ψ=ψ(t)
(cid:113)
1
GRAMO(t+1)
j
GRAMO(t+1)
j
= G(t)
j +
(cid:18) ∂O(ψ)
∂ψj
(cid:19)2
|ψ=ψ(t)
(25)
where we introduce an adaptive term G(t)
. GRAMO(t)
j becomes larger along with updating, y
j
decreases the update step for dimension j. Compared with SGD, the only cost is to store
and update G(t)
j
for each parameter.
To train the model, we use the method proposed by Liang, Jordán, and Klein (2013).
With the candidate parse trees and objective function, the parameters ψ are updated to
make the parsing model favor correct trees and give them a higher score. Because there
are many parse trees for a sentence, we need to calculate Equation (23) efficiently. Como
indicated in Section 4.1.1, the features decompose along the structure of sentiment trees.
So dynamic programming can be utilized to compute Ep(t|s;t,ψ)[Fi(s, t)] of Equation (23).
Sin embargo, the first expectation term Ep(t|s;TLs ,ψ)[Fi(s, t)] sums over the candidates that
obtain the correct polarity labels. As this constraint does not decompose along the tree
estructura, there is no efficient dynamic program for this. Instead of searching all the
parse trees spanning s, we use beam search to approximate this expectation. Beam
search is a best-first search algorithm that explores at most K paths (K is the beam
tamaño). It keeps the local optimums to reduce the huge search space. Específicamente, el
beam search algorithm generates the K-best trees with the highest score φ(s, t)Tψ for
each span. These local optimums are used recursively in the CYK process. The K-best
trees for the whole span are regarded as the candidate set ˜T. Then ˜T and ˜TLs are used to
approximate Equation (23) as in Liang, Jordán, and Klein (2013).
The intuition behind this parameter estimation algorithm lies in: (1) if we have
better parameters, we can obtain better candidate trees; (2) with better candidate trees,
we can learn better parameters. Thus the optimization problem is solved in an iterative
manner. We initialize the parameters as zeros. This leads to a random search and gen-
erates random candidate trees. With the initial candidates, the two steps in Algorithm 2
lead the parameters ψ towards the direction achieving better performance.
4.2 Sentiment Grammar Learning
En esta sección, we present the automatic learning of the sentiment grammar as defined
en la sección 3.1. We need to extract the dictionary rules and the combination rules from
datos. In traditional statistical parsing, grammar rules are induced from annotated parse
árboles (such as the Penn TreeBank), so ideally we need examples of sentiment structure
árboles, or sentences annotated with sentiment polarity for the whole sentence as well
as those for constituents within sentences. Sin embargo, this is not practical, if not un-
feasible, as the annotations will be inevitably time consuming and require laborious
human effort. We show that it is possible to induce the sentiment grammar directly
from examples of sentences annotated with sentiment polarity labels without using
any syntactic annotations or polarity annotations of constituents within sentences. El
313
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
Algoritmo 2 Ranking Model Learning Algorithm
Input: D: Training data {(s, Ls)}, S: Maximum number of iteration
Output: ψ: Parameters of the ranking model
1: ψ(0) ← (0, 0, . . . , 0)t
2: repeat
3:
4:
(s, Ls) ← randomly select a training instance in D
˜T(t) ← BEAMSEACH(s, ψ(t))
(cid:16) ∂O(ψ, ˜T(t) )
GRAMO(t+1)
j ← G(t)
j +
∂ψj
ψ(t+1)
(t) + a 1
j ← ψj
|ψ=ψ(t)
(cid:16) ∂O(ψ, ˜T(t) )
∂ψj
|ψ=ψ(t)
5:
6:
(cid:17)2
(cid:17)
(cid:113)
GRAMO(t+1)
j
(cid:46) Beam search to generate K-best candidates
(cid:46) Update parameters using
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
AdaGrad
t ← t + 1
7:
8: until t > S
9: return ψ(t)
sentences annotated with sentiment polarity labels are relatively easy to obtain, and we
use them as our input to learn dictionary rules and combination rules.
We first present the basic idea behind the algorithm we propose. People are likely to
express positive or negative opinions using very simple and straightforward sentiment
expressions again and again in their reviews. Intuitivamente, we can mine dictionary rules
from these massive review sentences by leveraging the redundancy characteristics.
Además, there are many complicated reviews that contain complex sentiment
estructuras (p.ej., negation, intensification, and contrast). If we already have dictionary
rules on hand, we can use them to obtain basic sentiment information for the fragments
within complicated reviews. We can then extract combination rules with the help of the
dictionary rules and the sentiment polarity labels of complicated reviews. Because the
simple and straightforward sentiment expressions are often coupled with complicated
expresiones, we need to conduct dictionary rule mining and the combination rule
mining in an iterative way.
4.2.1 Dictionary Rule Learning. The dictionary rules GD are basic sentiment building
blocks used in the parsing process. Each dictionary rule in GD is in the form X → f ,
where f is a sentiment fragment. We use the polarity probabilities P(norte | F ) y P(PAG| F ) en
the polarity model. To build GD, we regard all the frequent fragments whose occurrence
frequencies are larger than τf and lengths range from 1 a 7 as the sentiment fragments.
We further filter the phrases formed by stop words and punctuations, which are not
used to express sentiment.
For a balanced data set, the sentiment distribution of a candidate sentiment frag-
ment f is calculated by
PAG(X | F ) =
#( F, X ) + 1
#( F, norte ) + #( F, PAG ) + 2
(26)
where X ∈ {norte , PAG}, y #( F, X ) denotes the number of reviews containing f with X
being the polarity. It should be noted that Laplace smoothing is used in Equation (26) a
deal with the zero frequency problem.
314
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
We do not learn the polarity probabilities P(norte | F ) y P(PAG| F ) by directly counting
occurrence frequency. Por ejemplo, in the review sentence this movie is not good (nega-
tivo), the naive counting method increases the count #(bien, norte ) in terms of the polarity
of the whole sentence. Además, because of the common collocation not as good as
(negative) in movie reviews, as good as is also regarded as negative if we count the
frequency directly. The examples indicate why some polarity probabilities of phrases
counting from data are different from our intuitions. These unreasonable polarity
probabilities also make trouble for learning the polarity model. Como consecuencia, en orden
to estimate more reasonable probabilities, we need to take the compositionality into
consideration when learning sentiment fragments.
Following this motivation, we ignore the count #( F, X ), if the sentiment fragment
f is covered by a negation rule r that negates the polarity of f . The word cover here
means that f is derived within a non-terminal of the negation rule r. Por ejemplo, el
negation rule N → not P covers the sentiment fragment good in the sentence this is not
a good movie (negative), eso es, the good is derived from P of this negation rule. So we
ignore the occurrence for #(bien, norte ) in this sentence. It should be noted that we still
increase the count for #(not good, norte ), because there is no negation rule covering the
fragment not good.
As shown in Algorithm 3, we learn the dictionary rules and their polarity probabil-
ities by counting the frequencies in negative and positive classes. Only the fragments
whose occurrence numbers are larger than threshold τf are kept. Además, we take
the combination rules into consideration to acquire more reasonable GD. Notablemente, a
subsequence of a frequent fragment must also be frequent. This is similar to the key
insight in the Apriori algorithm (Agrawal and Srikant 1994). When we learn the dic-
tionary rules, we can count the sentiment fragments from short to long, and prune the
infrequent fragments in the early stages if any subsequence is not frequent. This pruning
method accelerates the dictionary rule learning process and makes the procedure fit in
memory.
Algoritmo 3 Dictionary Rule Learning
Input: D: Data set, GC: Combination rules, τf : Frequency threshold
Output: GD: Dictionary rules
1: function MINEDICTIONARYRULES(D, GC)
2:
3:
GD
para (s, Ls) in D do
(cid:48) ← {}
for all i, j s.t. 0 ≤ i < j ≤ |s| do
if no negation rule in GC covers w j
i then
#(w j
add w j
i, Ls) ++
(cid:48)
i to GD
(cid:46) s : w0w1 · · · w|s|−1, Ls: Polarity label of s
i : wiwi+1 · · · wj−1
(cid:46) w j
GD ← {}
for f in GD
(cid:48) do
if #( f, ·) ≥ τf then
compute P(N | f ) and P(P| f ) using Equation (26)
add dictionary rule (Lf → f ) to GD
(cid:46) Lf = arg maxX∈{N,P} P(X | f )
return GD
315
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
c
o
l
i
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 2
4.2.2 Combination Rule Learning. The combination rules GC are generalizations for the
dictionary rules. They are used to handle the compositionality and process unseen
phrases. The learning of combination rules is based on the learned dictionary rules and
their polarity values. The sentiment fragments are generalized to combination rules by
replacing the subsequences of dictionary rules with their polarity labels. For instance, as
shown in Figure 8, the fragments is not (good/as expected/funny/well done) are all negative.
After replacing the subspans good, as expected, funny, and well done with their polarity
label P, we can learn the negation rule N → is not P.
w j
i
w j
i
with Lf . If Lf
| f |
w
j
, we regard the rule Lf → w i
0L
i) is larger than the threshold τp, and we can get wi
(cid:54)= L
We present the combination rule learning approach in Algorithm 4. Specifically,
the first step is to generate combination rule candidates. For every subsequence w j
i
of sentiment fragment f , we replace it with the corresponding non-terminal L
if
|w j
P(L
. Next, we compare
the polarity L
as a negation
rule. Otherwise, we further compare their polarity values. If this rule makes the polarity
value become larger (or smaller), it will be treated as a strengthen (or weaken) rule. To
obtain the contrast rules, we replace two subsequences with their polarity labels in a
similar way. If the polarities of these two subsequences are different, we categorize this
rule to the contrast type. Notably, these two non-terminals cannot be next to each other.
(cid:48) and the occurrence number of each
After these steps, we get the rule candidate set GC
rule. We then filter the rule candidates whose occurrence frequencies are too small, and
assign the rule types (negation, strengthen, weaken, and contrast) according to their
occurrence numbers.
| f |
j
0L
w j
i
w j
i
w j
i
w j
i
w
4.2.3 Polarity Model Learning. As shown in Section 3.4, we define the polarity model
to calculate the polarity probabilities using the sentiment grammar. In this section, we
present how to learn the parameters of the polarity model for the combination rules.
Figure 8
We replace the subsequences with their polarity labels for frequent sentiment fragments.
As shown here, we replace good, as expected, funny, and well done with their polarity label P.
Then we compare the polarity probabilities of subfragments with the whole fragments,
such as good and is not good, to determine whether it is a negation rule, strengthen
rule, or weaken rule. After obtaining the rule, we use polarity probabilities of these
compositional examples as training data to estimate parameters of the polarity model.
In this, (cid:0)P(P|good), P(N |is not good)(cid:1), (cid:0)P(P|as expected), P(N |is not as expected)(cid:1),
(cid:0)P(P|funny), P(N |is not funny)(cid:1), (P(P|well done), and P(N |is not well done)) are used to
learn the polarity model for N → is not P.
316
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
c
o
l
i
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
P𝑁→isnotP𝑁→isnotP… is not ………goodas expectedfunnywell doneEstimate ParametersPolarity Model
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Algorithm 4 Combination Rule Learning
Input: D: Data set, GD: Dictionary rules, τp, τ∆, τr, τc: Thresholds
Output: GC: Combination rules
(cid:46) f : w0w1 · · · w| f |−1
= arg maxX ∈{N ,P} P(X |w j
i)
= arg maxX∈{N,P} P(X |w j
i)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
with the non-terminals
(cid:48) ← {}
1: function MINECOMBINATIONRULES(D, GD)
2:
3:
4:
GC
for (X → f ) in GD do
for all i, j s.t. 0 ≤ i < j ≤ |f | do
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
if P(L
|w j
w j
i
r: X → wi
if X (cid:54)= L
i) > τp then
w
0l
| F |
w j
j
i
entonces
w j
i
#(r, negation) ++
else if P(X | F ) > P(l
w j
i
#(r, strengthen) ++
else if P(X | F ) < P(L
#(r, weaken) ++
(cid:48)
add r to GC
w j
i
(cid:46) Polarity label L
w j
i
(cid:46) Non-terminal L
w j
i
|w j
i) + τ∆ then
|w j
i) − τ∆ then
if P(L
w
for all i0, j0, i1, j1 s.t. 0 ≤ i0 < j0 < i1 < j1 ≤ |f | do
) > τp and P(l
) > τp then
w
| F |
j1
|w j1
i1
(cid:46) Replace w j0
i0
|w j0
j0
i0
i0
r: X → wi0
w
j1
i1
, w j1
i1
l
wi1
j0
w
entonces
j1
i1
0 l
w
(cid:54)= L
j0
i0
j1
i1
j0
i0
si l
w
w
#(r, contrast) ++
(cid:48)
add r to GC
GC ← {}
for r in GC
(cid:48) hacer
add r to GC
return GC
si #(r, ·) > τr and max
#(r,t)
#(r) > τc then
t
As shown in Figure 8, we learn combination rules by replacing the subsequences of
frequent sentiment fragments with their polarity labels. Both the replaced fragment and
the whole fragment can be found in the dictionary rules, so their polarity probabilities
have been estimated from data. We can use them as our training examples to figure
out how context changes the polarity of replaced fragment, and learn parameters of the
polarity model.
We describe the polarity model in Section 3.4. To further simplify the notation, nosotros
))t, and the response value as y.
), . . . , PAG(XK|w jK
iK
denote the input vector x = (1, PAG(X1|w j1
i1
Then we can rewrite Equation (15) como
hθ(X) =
1
1 + exp.{−θTx}
(27)
317
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
where hθ(X) is the polarity probability calculated by the polarity model, and θ =
(θ0, θ1, . . . , θK )T is the parameter vector. Our goal is to estimate the parameter vector
θ of the polarity model.
We fit the model to minimize the sum of squared residuals between the predicted
polarity probabilities and the values computed from data. We define the cost function as
j (i) = 1
2
(cid:88)
metro
(hθ(xm) − ym)2
(28)
dónde (cid:0)xm, ym(cid:1) is the m-th training instance.
The gradient descent algorithm is used to minimize the cost function J (i). El
partial derivative of J (i) with respect to θj is
∂J (i)
∂θj
=
=
=
(cid:88)
metro
(cid:88)
metro
(cid:88)
metro
(cid:0)hθ(xm) − ym(cid:1) ∂hθ(xm)
∂θj
(cid:0)hθ(xm) − ym(cid:1) hθ(xm) (cid:0)1 − hθ(xi)(cid:1) ∂θTxm
∂θj
(29)
(cid:0)hθ(xm) − ym(cid:1) hθ(xm) (1 − hθ(xm)) xm
j
We set the initial θ as zeros, and start with it. We use the Stochastic Gradient
Descend algorithm to minimize the cost function. For the instance (X, y), the parameters
are updated using
θj
(t+1) = θj
(t) − α
(cid:18) ∂J (i)
∂θj
(cid:19)
|θ=θ(t)
= θj
(t) − α(hθ(t) (X) − y)hθ(t) (X) (cid:0)1 − hθ(t) (X)(cid:1) xj
(30)
where α is the learning rate, and it is set to 0.01 in our experiments. We summarize
the learning method in Algorithm 5. For each combination rule, we iteratively scan
Algoritmo 5 Polarity Model Learning Algorithm
Input: GC: Combination rules, ε: Stopping condition, a: Learning rate
Output: i: Parameters of the polarity model
1: function ESTIMATEPOLARITYMODEL(GC)
for all combination rule r ∈ GC do
2:
i(0) ← (0, 0, …, 0)t
repeat
(cid:0)X, y(cid:1) ← randomly select a training instance
θj
t ← t + 1
(t) − α(hθ(t) (X) − y)hθ(t) (X) (cid:0)1 − hθ(t) (X)(cid:1) xj
(t+1) ← θj
(cid:13)i(t+1) − yo(t)(cid:13)
(cid:13)
(cid:13)
2
2
until
assign θ(t) as the parameters of the polarity model for rule r
< ε 3: 4: 5: 6: 7: 8: 9: 318 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 9 3 1 8 0 5 0 5 9 / c o l i _ a _ 0 0 2 2 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Dong et al. A Statistical Parsing Framework for Sentiment Classification Algorithm 6 Sentiment Grammar Learning Input: D: Data set {(s, Ls)}, T: Maximum number of iteration Output: GD: Dictionary rules, GC: Combination rules (cid:46) Ls: Polarity label of s GD ← MINEDICTIONARYRULES(D, GC) GC ← MINECOMBINATIONRULES(D, GD) 1: GC ← {} 2: repeat 3: 4: 5: until iteration number exceeds T 6: ESTIMATEPOLARITYMODEL(GC) 7: return GD, GC (cid:46) Algorithm 3 (cid:46) Algorithm 4 (cid:46) Algorithm 5 through the training examples (cid:0)x, y(cid:1) in a random order, and update the parameters θ according to Equation (30). The stopping condition is < ε, which indicates the parameters become stable. (cid:13)θ(t+1) − θ(t)(cid:13) (cid:13) (cid:13) 2 2 4.2.4 Summary of the Grammar Learning Algorithm. We summarize the grammar learning process in Algorithm 6, which learns the sentiment grammar in an iterative manner. We first learn the dictionary rules and their polarity probabilities by counting the frequencies in negative and positive classes. Only the fragments whose occurrence num- bers are larger than the threshold τf are kept. As mentioned in Section 4.2.1, the context can essentially change the distribution of sentiment fragments. We take the combination rules into consideration to acquire more reasonable GD. In the first iteration, the set of combination rules is empty. Therefore, we have no information about compositionality to improve dictionary rule learning. The initial GD contains some inaccurate sentiment distributions. Next, we replace the subsequences of dictionary rules to their polarity labels, and generalize these sentiment fragments to the combination rules GC as illus- trated in Section 4.2.2. At the same time, we can obtain their compositional types and learn parameters of the polarity model. We iterate over these two steps to obtain refined GD and GC. 5. Experimental Studies In this section, we describe experimental results on existing benchmark data sets with extensive comparisons with state-of-the-art sentiment classification methods. We also present the effects of different experimental settings in the proposed statistical senti- ment parsing framework. 5.1 Experiment Set-up We describe the data sets in Section 5.1.1, the experimental settings in Section 5.1.2, and the methods used for comparison in Section 5.1.3. 5.1.1 Data Sets. We conduct experiments on sentiment classification for sentence-level and phrase-level data. The sentence-level data sets contain user reviews and critic 319 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 9 3 1 8 0 5 0 5 9 / c o l i _ a _ 0 0 2 2 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 2 reviews from Rotten Tomatoes3 and IMDb.4 We balance the positive and negative instances in the training data set to mitigate the problem of data imbalance. Moreover, the Stanford Sentiment Treebank5 contains polarity labels of all syntactically plausible phrases. In addition, we use the MPQA6 data set for the phrase-level task. We describe these data sets as follows. RT-C: 436,000 critic reviews from Rotten Tomatoes. It consists of 218,000 negative and 218,000 positive critic reviews. The average review length is 23.2 words. Critic reviews from Rotten Tomatoes contain a label (Rotten: Negative, Fresh: Positive) to indicate the polarity, which we use directly as the polarity label of corresponding reviews. PL05-C: The sentence polarity data set v1.0 (Pang and Lee 2005) contains 5,331 positive and 5,331 negative snippets written by critics from Rotten Tomatoes. This data set is widely used as the benchmark data set in the sentence-level polarity classification task. The data source is the same as RT-C. SST: The Stanford Sentiment Treebank (Socher et al. 2013) is built upon PL05-C. The sentences are parsed to parse trees. Then, 215,154 syntactically plausible phrases are extracted and annotated by workers from Amazon Mechanical Turk. The experimental settings of positive/negative classification for sentences are the same as in Socher et al. (2013). RT-U: 737,806 user reviews from Rotten Tomatoes. Because we focus on sentence- level sentiment classification, we filter out user reviews that are longer than 200 char- acters. The average length of these short user reviews from Rotten Tomatoes is 15.4 words. Following previous work on polarity classification, we use the review score to select highly polarized reviews. For the user reviews from Rotten Tomatoes, a negative review has a score <2.5 out of 5, and a positive review has a score >3.5 out of 5.
IMDB-U: 600,000 user reviews from IMDb. The user reviews in IMDb contain
comments and short summaries (usually a sentence) to summarize the overall sentiment
expressed in the reviews. We use the review summaries as the sentence-level reviews.
The average length is 6.6 palabras. For user reviews of IMDb, a negative review has a score
<4 out of 10, and a positive review has score>7 out of 10.
C-TEST: 2,000 labeled critic reviews sampled from RT-C. We use C-TEST as the
testing data set for RT-C. Note that we exclude these from the training data set (es decir.,
RT-C).
U-TEST: 2,000 manually labeled user reviews sampled from RT-U. User reviews
often contain some noisy ratings compared with critic reviews. To eliminate the ef-
fect of noise, we sample 2,000 user reviews from RT-U, and annotate their polarity
labels manually. We use U-TEST as a testing data set for RT-U and IMDB-U, cual
are both user reviews. Note that we exclude them from the training data set (es decir.,
RT-U).
MPQA: The opinion polarity subtask of the MPQA data set (Wiebe, wilson, y
Cárdigan 2005). The authors manually annotate sentiment polarity labels for the ex-
pressions (es decir., sub-sentences) within a sentence. We regard the expressions as short
sentences in our experiments. Hay 7,308 negative examples and 3,316 positivo
examples in this data set. The average number of words per example is 3.1.
3 http://www.rottentomatoes.com.
4 http://www.imdb.com.
5 http://nlp.stanford.edu/sentiment/treebank.html.
6 http://mpqa.cs.pitt.edu/corpora/mpqa corpus.
320
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Mesa 3
Statistical information of data sets. #Negative and #Positive are the number of negative instances
and positive instances, respectivamente. lavg is average length of sentences in the data set, y |V| es
the vocabulary size.
Data Set
Size
#Negative
#Positive
436,000
RT-C
10,662
PL05-C
98,796
SST
RT-U
737,806
IMDB-U 600,000
10,624
MPQA
218,000
5,331
42,608
368,903
300,000
7,308
218,000
5,331
56,188
368,903
300,000
3,316
lavg
23.2
21.0
7.5
15.4
6.6
3.1
|V|
136,006
20,263
16,372
138,815
83,615
5,992
Mesa 3 shows the summary of these data sets, and all of them are publicly available
at http://goo.gl/WxTdPf.
5.1.2 Settings. To compare with other published results for PL05-C and MPQA, el
training and testing regime (10-fold cross-validation) is the same as in Pang and Lee
(2005), Nakagawa, Inui, and Kurohashi (2010) and Socher et al. (2011). For SST, el
regime is the same as in Socher et al. (2013). We use C-TEST as the testing data for RT-C,
and U-TEST as the testing data for RT-U and IMDB-U. There are a number of settings
that have trade-offs in performance, computation, and the generalization power of our
modelo. The best settings are chosen by a portion of training split data that serves as the
validation set. We provide the performance comparisons using different experimental
settings in Section 5.4.
Number of training examples: The size of training data has been widely recognized
as one of the most important factors in machine learning-based methods. Generally,
using more data leads to better performance. By default, all the training data is used
in our experiments. We use the same size of training data in different methods for fair
comparisons.
Number of training iterations (t): We use AdaGrad (Duchi, Hazan, and Singer
2011) as the optimization algorithm in the learning process. The algorithm starts with
randomly initialized parameters, and alternates between searching candidate sentiment
trees and updating parameters of the ranking model. We treat one-pass scan of training
data as an iteration.
Beam size (k): The beam size is used to make a trade-off between the search space
and the computation cost. Además, an appropriate beam size can prune unfavorable
candidates. We set K = 30 in our experiments.
Regularización (λ): The regularization parameter λ in Equation (22) is used to avoid
over-fitting. The value used in the experiments is 0.01.
Minimum fragment frequency: It is difficult to estimate reliable polarity probabili-
ties when the fragment appears very few times. Por eso, a minimum fragment frequency
that is too small will introduce noise in the fragment learning process. En el otro
mano, a large threshold will lose much useful information. The minimum fragment
frequency is chosen according to the size of the training data set and the validation
actuación. To be specific, we set this parameter as 4 for RT-C, SST, RT-U, y
IMDB-U, y 2 for PL05-C and MPQA.
321
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
Maximum fragment length: High order n-grams are more precise and deterministic
expressions than unigrams and bigrams. So it would be useful to use long fragments to
capture polarity information. According to the experimental results, as the maximum
fragment length increases, the accuracy of sentiment classification increases. The maxi-
mum fragment length is set to 7 words in our experiments.
5.1.3 Sentiment Classification Methods for Comparison. We evaluate the proposed statis-
tical sentiment parsing framework on the different data sets, and compare the results
with some baselines and state-of-the-art sentiment classification methods described as
follows.
SVM-m: Support Vector Machine (SVM) achieves good performance in the sen-
timent classification task (Pang and Lee 2005). Though unigrams and bigrams are
reported as the most effective features in existing work (Pang and Lee 2005), we use
high-order n-gram (1 ≤ n ≤ m) features to conduct fair comparisons. Hereafter, m has
the same meaning. We use LIBLINEAR (Fan et al. 2008) in our experiments because it
can handle well the high feature dimension and a large number of training examples.
We try different hyper-parameters C ∈ {10−2, 10−1, 1, 5, 10, 20} for SVM, and select C on
the validation set.
MNB-m: As indicated in Wang and Manning (2012), Multinomial Na¨ıve Bayes
(MNB) often outperforms SVM for sentence-level sentiment classification. We uti-
lize Laplace smoothing (Manning, raghavan, and Sch ¨utze 2008) to tackle the zero
probability problem. High order n-gram (1 ≤ n ≤ m) features are considered in the
experimentos.
LM-m: Language Model (LM) is a generative model calculating the probability
of word sequences. It is used for sentiment analysis in Cui, Mittal, and Datar (2006).
Probability of generating sentence s is calculated by P(s) = (cid:81)|s|−1
, dónde
wi−1
denotes the word sequence w0 . . . wi−1. We use Good-Turing smoothing (Bien
0
1953) to overcome sparsity when estimating the probability of high-order n-gram. Nosotros
train language models on negative and positive sentences separately. For a sentence,
its polarity is determined by comparing the probabilities calculated from the positive
and negative language models. The unknown-word token is treated as a regular word
(denotado por
experimento.
(cid:16)
Wisconsin|wi−1
0
i=0 P
(cid:17)
Voting-w/Rev: This approach is proposed by Choi and Cardie (2009b), and is used
as a baseline in Nakagawa, Inui, and Kurohashi (2010). The polarity of a subjective
sentence is decided by the voting of each phrase’s prior polarity. The polarity of phrases
that have odd numbers of negation phrases in their ancestors is reversed. The results
are reported by Nakagawa, Inui, and Kurohashi (2010).
HardRule: This baseline method is compared by Nakagawa, Inui, and Kurohashi
(2010). The polarity of a subjective sentence is deterministically decided based on rules,
by considering the sentiment polarity of dependency subtrees. The polarity of a mod-
ifier is reversed if its head phrase has a negation word. The decision rules are applied
from the leaf nodes to the root node in a dependency tree. We use the results reported
by Nakagawa, Inui, and Kurohashi (2010).
Tree-CRF: Nakagawa, Inui, and Kurohashi (2010) present a dependency tree-based
method using conditional random fields with hidden variables. In this model, el
polarity of each dependency subtree is represented by a hidden variable. The value of
the hidden variable of the root node is identified as the polarity of the whole sentence.
The experimental results are reported by Nakagawa, Inui, and Kurohashi (2010).
322
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
RAE-pretrain: Socher et al. (2011) introduce a framework based on recursive auto-
encoders to learn vector space representations for multi-word phrases and predict
sentiment distributions for sentences. We use the results with pre-trained word vectors
learned on Wikipedia, which leads to better results compared with randomized word
vectores. We directly compare the results with those in Socher et al. (2011).
MV-RNN: Socher et al. (2012) try to capture the compositional meaning of
long phrases through matrix-vector recursive neural networks. This model assigns a
vector and a matrix to every node in the parse tree. Matrices are regarded as opera-
tores, and vectors capture the meaning of phrases. The results are reported by Socher
et al. (2012, 2013).
s.parser-LongMatch: The longest matching rules are utilized in the decoding pro-
impuesto. En otras palabras, the derivations that contain the fewest rules are used for all text
spans. Además, the dictionary rules are preferred to the combination rules if both
of them match the same text span. The dynamic programming algorithm is used in the
implementación.
s.parser-w/oComb: This is our method without using the combination rules (semejante
as N → not P) learned from data.
5.2 Results of Sentiment Classification
We present the experimental results of the sentiment classification methods on the
different data sets in Table 4. The top three methods on each data set are in bold, y el
best methods are also underlined. The experimental results show that s.parser achieves
better performance than other methods on most data sets.
The data sets RT-C, PL05-C, and SST are critic reviews. On RT-C, the accuracy of
s.parser increases by 2 puntos de porcentaje, 2.9 puntos de porcentaje, y 7.1 porcentaje
points from the best results of SVM, MNB, and LM, respectivamente. On PL05-C, el
accuracy of s.parser also rises by 2.1 puntos de porcentaje, 0.7 puntos de porcentaje, y
4.4 percentage points from the best results of SVM, MNB, and LM, respectivamente.
Compared to Voting-w/Rev and HardRule, s.parser outperforms them by 16.4 por-
centage points and 16.6 puntos de porcentaje. The results indicate that our method
significantly outperforms the baselines that use manual rules, as rule-based methods
lack a probabilistic way to model the compositionality of context. Además, s.parser
achieves an accuracy improvement rate of 2.2 puntos de porcentaje, 1.8 puntos de porcentaje,
y 0.5 percentage points over Tree-CRF, RAE-pretrain, and MV-RNN, respectivamente. On
SST, s.parser outperforms SVM, MNB, and LM by 3.4 puntos de porcentaje, 1.4 porcentaje
puntos, y 3.8 puntos de porcentaje, respectivamente. The performance is better than MV-RNN
with an improvement rate of 1.8 puntos de porcentaje. Además, the result is compara-
ble to the 85.4% obtained by recursive neural tensor networks (Socher et al. 2013)
without depending on syntactic parsing results.
On the user review data sets RT-U and IMDB-U, our method also achieves the best
resultados. More specifically, on the data set RT-U, s.parser outperforms the best results of
SVM, MNB, and LM by 1.7 puntos de porcentaje, 2.9 puntos de porcentaje, y 1.5 porcentaje
puntos, respectivamente. On the data set IMDB-U, our method brings an improved accuracy
tasa de 2.1 puntos de porcentaje, 3.7 puntos de porcentaje, y 2.2 percentage points over
SVM, MNB, and LM, respectivamente. We find that MNB performs better than SVM and
LM on the critics review data sets RT-C and PL05-C. También, SVM and LM achieve better
results on the user review data sets RT-U and IMDB-U. The s.parser is more robust for
the different genres of data sets.
323
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
Mesa 4
Sentiment classification results on different data sets; The top three methods are in bold and the
best is also underlined; SVM-m = Support Vector Machine; MNB-m = Multinomial Na¨ıve Bayes;
LM-m = Language Model; Voting-w/Rev = Voting with negation rules; HardRule = Rule based
method on dependency tree; Tree-CRF = Dependency tree-based method employing conditional
random fields; RAE-pretrain = Recursive autoencoders with pre-trained word vectors;
MV-RNN = Matrix-vector recursive neural network; s.parser-LongMatch = The longest
matching rules are used; s.parser-w/oComb = Without using the combination rules; s.parser =
Our method. Some of the results are missing (indicated by “-”) in the table as there is no publicly
available implementation or they are hard to scale up.
Método
RT-C PL05-C SST
RT-U IMDB-U MPQA
SVM-1
SVM-2
SVM-3
SVM-4
SVM-5
MNB-1
MNB-2
MNB-3
MNB-4
MNB-5
LM-1
LM-2
LM-3
LM-4
LM-5
Voting-w/Rev
HardRule
Tree-CRF
RAE-pretrain
MV-RNN
80.3
83.0
83.1
81.5
81.7
79.6
82.0
82.2
81.8
81.7
77.6
78.0
77.3
77.2
77.0
–
–
–
–
–
s.parser-LongMatch
s.parser-w/oComb
s.parser
82.8
82.6
85.1
76.3
77.4
77.0
76.9
76.8
78.0
78.8
78.4
78.2
78.1
75.1
74.1
74.2
73.0
72.9
63.1
62.9
77.3
77.7
79.0
78.6
78.3
79.5
81.1
81.3
81.2
80.9
80.8
82.6
83.3
82.9
82.6
82.4
80.9
78.4
78.3
78.3
78.2
–
–
–
–
82.9
82.5
82.4
84.7
88.5
88.9
89.7
89.8
89.3
83.3
87.5
88.6
88.2
88.1
87.6
89.0
89.3
89.6
90.0
–
–
–
–
–
84.9
86.8
87.2
87.0
87.0
82.7
85.6
84.6
83.1
82.5
81.8
85.8
87.1
87.0
87.1
–
–
–
–
–
89.4
89.0
91.5
86.9
86.4
89.3
85.1
85.3
85.5
85.6
85.6
85.0
85.0
85.0
85.1
85.1
64.0
71.4
71.1
71.1
71.1
81.7
81.8
86.1
86.4
–
85.7
85.5
86.2
On the data set MPQA, the accuracy of s.parser increases by 0.5 puntos de porcentaje,
1.1 puntos de porcentaje, y 14.8 percentage points from the best results of SVM, MNB,
and LM, respectivamente. Compared with Voting-w/Rev and HardRule, s.parser achieves
4.5 percentage point and 4.4 percentage point improvements over them. As illustrated
en mesa 3, the size and length of sentences in MPQA are much smaller than those in
the other four data sets. The RAE-pretrain achieves better results than other methods
on this data set, because the word embeddings pre-trained on Wikipedia can leverage
smoothing to relieve the sparsity problem in MPQA. If we do not use any external
resources (es decir., Wikipedia), the accuracy of RAE on MPQA is 85.7%, which is lower than
Tree-CRF and s.parser. The results indicate that s.parser achieves the best result if no
external resource is used.
324
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Además, we compare the results of s.parser-LongMatch and s.parser-w/oComb.
The s.parser-LongMatch utilizes the dictionary rules and combination rules in the
longest matching manner, whereas s.parser-w/oComb removes the combination rules
in the parsing process. Compared with the results of s.parser, we find that both the
ranking model and the combination rules play a positive role in the model. The ranking
model learns to score parse trees by assigning larger weights to the rules that tend to ob-
tain correct labels. También, the combination rules generalize these dictionary rules to deal
with the sentiment compositionality in a symbolic way, which enables the model to
process unseen phrases. Además, s.parser-LongMatch achieves better results than
s.parser-w/oComb. This indicates that the effects of the combination rules are more
pronounced than the ranking model.
The bag-of-words classifiers work well for long documents relying on sentiment
words that appear many times in a document. The redundancy characteristics provide
strong evidence for sentiment classification. Even though some phrases of a document
are not estimated accurately, it can still result in a correct polarity label. Sin embargo, para
short text, such as a sentence, the compositionality plays an important role in sentiment
classification. Tree-CRF, MV-RNN, and s.parser take compositionality into considera-
tion in different ways, and they achieve significant improvements over SVM, MNB, y
LM. We also find that the high order n-grams contribute to classification accuracy on
most of the data sets, but they harm the accuracy of LM on PL05-C. The high-order
n-grams can partially solve compositionality in a brute-force way.
5.3 Effect of Training Data Size
We further investigate the effect of the size of training data for different sentiment clas-
sification methods. This is meaningful as the number of the publicly available reviews
is increasing dramatically nowadays. The methods that can take advantage of more
training data will be even more useful in practice.
We report the results of s.parser compared with SVM, MNB, and LM on the data
set RT-C using different training data size. In order to make the figure clear, nosotros sólo
present the results of SVM/MNB/LM-1/5 here. As shown in Figure 9, we find that
the size of training data plays an important role for all these sentiment classification
methods. The basic conclusion is that the performance of all the methods rise as the
data size increases, especially when the data size is smaller than a certain number. Él
meets our intuition that the size of data is the key factor when the size is relatively
pequeño. When the size of data is larger, the growth of accuracy becomes slower. El
performance of the baseline methods starts to converge after the data size is larger
than 200,000. The comparisons illustrate that s.parser significantly outperforms these
baselines. And the performance of s.parser becomes even better when the data size
aumenta. The convergence of s.parser’s performance is slower than the others. It in-
dicates that s.parser leverages data more effectively and benefits more from a larger
data set. With more training data, s.parser learns more dictionary rules and combina-
tion rules. These rules enhance the generalization ability of our model. Además,
it estimates more reliable parameters for the polarity model and ranking model. En
contrast, the bag-of-words based approaches (such as SVM, MNB, and LM) cannot
make full use of high-order information in the data set. The generalization ability
of the combination rules of s.parser leads to better performance, and take advantage
of larger data. It should be noted that there are similar trends with the other data
conjuntos.
325
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
Cifra 9
The curves show the test accuracy as the number of training examples increases. Our method
s.parser significantly outperforms the other methods, which indicates s.parser can leverage data
more effectively and benefits more from larger data.
5.4 Effect of Experimental Settings
En esta sección, we investigate the effects of different experimental settings. We show
the results on the data set RT-C by only changing one factor and fixing the others.
Cifra 10 shows the effect of minimum fragment frequency, and maximum frag-
ment length. Específicamente, Figure 10a indicates that a minimum fragment frequency
that is too small will introduce noise, and it is difficult to estimate reliable polarity
probabilities for infrequent fragments. Sin embargo, a minimum fragment frequency that
is too large will discard too much useful information. As shown in Figure 10b, we find
that accuracy increases as the maximum fragment length increases. The results illustrate
that the large maximum fragment length is helpful for s.parser. We can learn more
(a) Effect of minimum fragment frequency
(b) Effect of maximum fragment length
Cifra 10
(a) When the minimum fragment frequency is small, noise is introduced in the fragment
learning process. Por otro lado, too large a threshold loses useful information. (b) As the
maximum fragment length increases, the accuracy increases monotonically. It indicates that long
fragments are useful for our method.
326
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
01052×1053×1054×105Data Size6570758085AccuracySVM-1SVM-5MNB-1MNB-5LM-1LM-5s.parser2481632Minimum Fragment Frequency8081828384858687Accuracy1234567Maximum Fragment Length7075808590Accuracy
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
(a) Effect of regularization
(b) Effect of beam size
Cifra 11
(a) The test accuracy is relatively insensitive to the regularization parameter λ in Equation (22).
(b) As the beam size K increases, the test accuracy increases; sin embargo, the computation costs also
become more expensive. When K = 1, the optimization algorithm cannot learn any weights.
combination rules with a larger maximum fragment length, and long dictionary rules
capture more precise expressions than unigrams. This conclusion is the same as that in
Sección 5.2.
As shown in Figure 11, we also investigate how the training iteration, regulariza-
ción, and beam size affect the results. As shown in Figure 11a, we try a wide range of
regularization parameters λ in Equation (22). The results indicate that it is insensitive
to the choice of λ. Figure 11b shows the effects of different beam size K in the search
proceso. When beam size K = 1, the optimization algorithm cannot learn the weights.
En este caso, the decoding process is to select one search path randomly, and compute
its polarity probabilities. The results become better as the beam size K increases. On
the other hand, the computation costs increase. The proper beam size K can prune some
candidates to speed up the search procedure. It should be noted that the sentence length
also effects the run time.
5.5 Results of Grammar Learning
The sentiment grammar plays a central role in the statistical sentiment parsing frame-
trabajar. It is obvious that the accuracy of s.parser relies on the quality of the automatically
learned sentiment grammar. The quality can be implicitly evaluated by the accuracy of
sentiment classification results, as we have shown in previous sections. Sin embargo, allá
is no straightforward way to explicitly evaluate the quality of the learned grammar.
En esta sección, we provide several case studies of the learned dictionary rules and
combination rules to further illustrate the results of the sentiment grammar learning
process as detailed in Section 4.2.
To start with, we report the total number of dictionary rules and combination rules
learned from the data sets. As shown in Table 5, the results indicate that we can learn
more dictionary rules and combination rules from the larger data sets. Although we
learn more dictionary rules from RT-C than from IMDB-U, the number of combination
rules learned from RT-C is less than from IMDB-U. It indicates that the language usage
of RT-C is more diverse than that of IMDB-U. For SST, more rules are learned because
of its constituent-level annotations.
327
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
0.0010.0030.010.030.10.3Regularización (λ)7678808284868890Accuracy110203050100Beam size (k)8081828384858687AccuracyAccuracyAverage run time050100150200250Average run time (EM)
Ligüística computacional
Volumen 41, Número 2
Mesa 5
Number of rules learned from different data sets. τf represents minimum fragment frequency,
|GD| represents total number of dictionary rules, y |GC| is the total number of combination
normas.
Data Set
τf
|GD|
|GC|
4
RT-C
2
PL05-C
4
SST
RT-U
4
IMDB-U 4
2
MPQA
758,723
44,101
336,695
831,893
249,718
6,146
952
139
751
2,003
1,014
21
Además, we explore how the minimum fragment frequency τf affects the
number of dictionary rules, and present the distribution of dictionary rule length. Como
illustrated in Figure 12a, we find that the relation between total number of dictionary
normas |GD| and minimum fragment frequency τf obeys the power law, eso es, el
log10(|GD|) − log2(τf ) graph takes a linear form. It indicates that most of the fragments
appear few times, and only some of them appear frequently. Notablemente, all the syntacti-
cally plausible phrases of SST are annotated, so its distribution is different from the other
sentence-level data sets. Figure 12b shows the cumulative distribution of dictionary rule
length l. It presents most dictionary rules as short ones. For all data sets except SST,
más que 80% of dictionary rules are shorter than five words. The length distributions
of data sets RT-C and IMDB-U are similar, whereas we obtain more high order n-grams
from RT-U and SST.
We further investigate the effect of context for dictionary rule learning. Mesa 6
shows some dictionary rules with polarity probabilities learned by our method and
frequency
(a) Effect of minimum fragment
log2(τf )
Cifra 12
(a) We choose τf = 2, 4, 8, 16, 32, and plot log10(|GD|)–log2(τf ) graph to show the effects of τf for
total number of dictionary rules |GD|. The results (except SST) follow a power law distribution.
(b) The cumulative distribution of dictionary rule length l indicates that most dictionary rules
are short ones.
(b) Cumulative distribution of dictionary rule
length l
328
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
12345log2(τf)1234567log10(|GD|)RT-CPL05-CSSTRT-UIMDB-UMPQA1234567Length l0.00.20.40.60.81.0P(SG)RT-CPL05-CSSTRT-UIMDB-UMPQA
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Mesa 6
Comparing our dictionary rule learning method with naive counting. The dictionary rules that
are assigned different polarities by these two methods are presented. N represents negative, y
P represents positive. The polarity probabilities of fragments are shown in this table, y ellos
demonstrate that our method learns more intuitive results than counting directly.
Fragment
Naive Count
norte
PAG
Polarity
are fun
a very good movie
looks gorgeous
to enjoy the movies
is corny
’ s flawed
a difficult film to
disappoint
0.54
0.61
0.56
0.53
0.43
0.32
0.43
0.39
0.46
0.39
0.44
0.47
0.57
0.68
0.57
0.61
norte
norte
norte
norte
PAG
PAG
PAG
PAG
s.parser
PAG
Polarity
0.89
0.81
0.83
0.86
0.17
0.37
0.33
0.23
PAG
PAG
PAG
PAG
norte
norte
norte
norte
norte
0.11
0.19
0.17
0.14
0.83
0.63
0.67
0.77
naive counting on RT-C. We notice that if we count the fragment occurrence number
directly, some polarities of fragments are learned incorrectly. This is caused by the effect
of context as described in Section 4.2.1. By taking the context into consideration, nosotros
obtain more reasonable polarity probabilities of dictionary rules. Our dictionary rule
learning method takes compositionality into consideration, a saber, we skip the count
if there exist some negation indicators outside the phrase. This constraint tries to ensure
that the polarity of fragments is the same as the whole sentence. As shown in the results,
the polarity probabilities learned by our method are more reasonable and meet people’s
intuitions. Sin embargo, there are also some negative examples caused by “false subjective.”
Por ejemplo, the neutral phrase to pay it tends to appear in negative sentences, and it is
learned as a negative phrase. This makes sense for the data distribution, but it may lead
to the mismatch for the combination rules.
En figura 13, we show the polarity model of some combination rules learned from
the data set RT-C. The first two examples are negation rules. We find that both switch
negation and shift negation exist in data, instead of using only one negation type in
previous work (Choi and Cardie 2008; Saur´ı 2008; Taboada et al. 2011). For the rule
“N → i do not P,” we find that it is a switch negation rule. This rule reverses the polarity
and the corresponding polarity strength. Por ejemplo, the i do not like it very much is
more negative than the i do not like it. As shown in Figure 13b, the “N → is not P” is
a shift negation that reduces a fixed polarity strength to reverse the original polarity.
Específicamente, the is not good is more negative than the is not great, as described in Sec-
ción 3.4. We have a similar conclusion for the next two weaken rules. As illustrated
in Figure 13c, the “P → P actress” describes one aspect of a movie, hence it is more
likely to decrease the polarity intensity. We find that this rule is a fixed intensification
rule that reduces the polarity probability by a fixed value. The “N → a bit of N” is
a percentage intensification rule, which scales polarity intensity by a percentage. Él
reduces more strength for stronger polarity. The last two rules in Figure 13e and Fig-
ure 13f are strengthen rules. Both “P → lot of P” and “N → N terribly” increase the
polarity strength of the sub-fragments. These cases indicate that it is necessary to learn
how the context performs compositionality from data. In order to capture the composi-
tionality for different rules, we define the polarity model and learn parameters for each
regla. This also agrees with the models of Socher et al. (2012) and Dong et al. (2014),
329
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
(a) N → i do not P
(b) N → is not P.
(C) P → P actress
(d) N → a bit of N
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(mi) P → lot of P
(F) N → N terribly
Cifra 13
Illustration of the polarity model for combination rules: (a)(b) Negation rule. (C)(d) Weaken rule.
(mi)(F) Strengthen rule. The labels of axes represent the corresponding polarity labels, the red
points are the training instances, and the blue lines are the regression results for the polarity
modelo.
330
0.50.60.70.80.91.0P0.50.60.70.80.91.0NN → i do not P0.50.60.70.80.91.0P0.50.60.70.80.91.0NN → is not P .0.50.60.70.80.91.0P0.50.60.70.80.91.0PP → P actress0.50.60.70.80.91.0N0.50.60.70.80.91.0NN → a bit of N0.50.60.70.80.91.0P0.50.60.70.80.91.0PP → lot of P0.50.60.70.80.91.0N0.50.60.70.80.91.0NN → N terribly
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
which use multiple composition matrices to make compositions specific and is an
improvement over the recursive neural network that employs one composition matrix.
6. Conclusion and Future Work
In this article, we propose a statistical parsing framework for sentence-level sentiment
classification that provides a novel approach to designing sentiment classifiers from a
new perspective. It directly analyzes the sentiment structure of a sentence other than
relying on syntactic parsing results, as in existing literature. We show that complicated
phenomena in sentiment analysis, such as negation, intensification, and contrast, poder
be handled in a similar manner to simple and straightforward sentiment expressions in
a unified and probabilistic way. We provide a formal model to represent the sentiment
grammar built upon Context-Free Grammars. The framework consists of: (1) a parsing
model to analyze the sentiment structure of a sentence; (2) a polarity model to calculate
sentiment strength and polarity for each text span in the parsing process; y (3) a
ranking model to select the best parsing result from a list of candidate sentiment parse
árboles. We show that the sentiment parser can be trained from the examples of sentences
annotated only with sentiment polarity labels but without using any syntactic or senti-
ment annotations within sentences. We evaluate the proposed framework on standard
sentiment classification data sets. The experimental results show that the statistical sen-
timent parsing notably outperforms the baseline sentiment classification approaches.
We believe the work on statistical sentiment parsing can be advanced from many
different perspectives. Primero, statistical parsing has been a well-established research field,
in which many different grammars and parsing algorithms have been proposed in pre-
viously published literature. It will be an interesting direction to apply and adjust more
advanced models and algorithms from the syntactic parsing and the semantic parsing
to our framework. We leave it as a line of future work. Segundo, we can incorporate target
and aspect information in the statistical sentiment parsing framework to facilitate the
target-dependent and aspect-based sentiment analysis. Intuitivamente, this can be done by
introducing semantic tags of targets and aspects as new non-terminals in the sentiment
grammar and revising grammar rules accordingly. Sin embargo, acquiring training data
will be an even more challenging task, as we need more fine-grained information.
Tercero, as the statistical sentiment parsing produces more fine-grained information (p.ej.,
the basic sentiment expressions from the dictionary rules as well as the sentiment
structure trees), we will have more opportunities to generate better opinion summaries.
Además, we are interested in jointly learning parameters of the polarity model and the
parsing model from data. Last but not the least, we are interested in investigating the
domain adaptation, which is a very important and challenging problem in sentiment
análisis. Generally, we may need to learn domain-specific dictionary rules for different
dominios, whereas we believe combination rules are mostly generic across different
dominios. This is also worth consideration for further study.
Expresiones de gratitud
This research was partly supported by NSFC
(grant no. 61421003) and the fund of the
State Key Lab of Software Development
Environment (grant no. SKLSDE-2015ZX-05).
We gratefully acknowledge helpful
discussions with Dr. Nan Yang and the
anonymous reviewers.
Referencias
agarwal, Apoorv, Boyi Xie, Ilia Vovsha,
Owen Rambow, and Rebecca Passonneau.
2011. Sentiment analysis of twitter
datos. In Proceedings of the Workshop
on Languages in Social Media,
LSM ’11, pages 30–38,
Stroudsburg, Pensilvania.
331
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
Agrawal, Rakesh and Ramakrishnan Srikant.
1994. Fast algorithms for mining
association rules in large databases.
In Proceedings of the 20th International
Conference on Very Large Data Bases,
VLDB ’94, pages 487–499,
San Francisco, California.
Artzi, Yoav and Luke Zettlemoyer. 2013.
Weakly supervised learning of semantic
parsers for mapping instructions to
comportamiento. Transacciones de la Asociación
para Lingüística Computacional,
1(1):49–62.
Bao, Junwei, Nan Duan, Ming Zhou, y
Tiejun Zhao. 2014. Knowledge-based
question answering as machine
traducción. En Actas de la
52nd Annual Meeting of the Association
para Lingüística Computacional (Volumen 1:
Artículos largos), pages 967–976,
baltimore, Maryland.
Charniak, Eugene. 1997. Statistical parsing
with a context-free grammar and word
Estadísticas. In Proceedings of the Fourteenth
National Conference on Artificial Intelligence
and Ninth Conference on Innovative
Applications of Artificial Intelligence,
AAAI’97/IAAI’97, pages 598–603,
Providencia, Rhode Island.
Charniak, Eugene and Mark Johnson. 2005.
Coarse-to-fine n-best parsing and maxent
discriminative reranking. En procedimientos
of the 43rd Annual Meeting of the Association
para Lingüística Computacional, ACL ’05,
pages 173–180, Stroudsburg, Pensilvania.
Chiang, David. 2007. Hierarchical
phrase-based translation. computacional
Lingüística, 33(2):201–228.
Choi, Yejin and Claire Cardie. 2008. Aprendiendo
with compositional semantics as structural
inference for subsentential sentiment
análisis. In Proceedings of the Conference on
Empirical Methods in Natural Language
Procesando, EMNLP ’08, pages 793–801,
Stroudsburg, Pensilvania.
Choi, Yejin and Claire Cardie. 2009a.
Adapting a polarity lexicon using integer
linear programming for domain-specific
sentiment classification. En procedimientos de
el 2009 Conference on Empirical Methods in
Natural Language Processing: Volumen 2 –
Volumen 2, EMNLP ’09, pages 590–598,
Stroudsburg, Pensilvania.
Choi, Yejin and Claire Cardie. 2009b.
Adapting a polarity lexicon using integer
linear programming for domain-specific
sentiment classification. En procedimientos
del 2009 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural:
332
Volumen 2 – Volume 2, pages 590–598,
Singapur.
Choi, Yejin and Claire Cardie. 2010.
Hierarchical sequential learning for
extracting opinions and their attributes.
In Proceedings of the ACL 2010 Conferencia
Artículos breves, ACLShort ’10, pages 269–274,
Stroudsburg, Pensilvania.
Chomsky, Noam. 1956. Three models
for the description of language. IRE
Transactions on Information Theory,
2(3):113–124.
Clarke, James, Dan Goldwasser,
Ming-Wei Chang, and Dan Roth. 2010.
Driving semantic parsing from the world’s
respuesta. In Proceedings of the Fourteenth
Conference on Computational Natural
Aprendizaje de idiomas, CoNLL ’10,
pages 18–27, Stroudsburg, Pensilvania.
cocke, John. 1969. Programming Languages
and Their Compilers: Preliminary Notes.
Courant Institute of Mathematical
Ciencias, New York University.
Councill, Isaac G., ryan mcdonald, y
Leonid Velikovich. 2010. What’s great and
what’s not: Learning to classify the scope
of negation for improved sentiment
análisis. In Proceedings of the Workshop on
Negation and Speculation in Natural
Procesamiento del lenguaje, NeSp-NLP ’10,
pages 51–59, Stroudsburg, Pensilvania.
Cual, Hang, Vibhu Mittal, and Mayur Datar.
2006. Comparative experiments on
sentiment classification for online product
reviews. In Proceedings of the 21st National
Conferencia sobre Inteligencia Artificial –
Volumen 2, AAAI’06, pages 1,265–1,270,
Bostón, MAMÁ.
Davidov, Dmitry, Oren Tsur, and Ari
Rappoport. 2010. Enhanced sentiment
learning using twitter hashtags and
smileys. In Proceedings of the 23rd
International Conference on Computational
Lingüística: Posters, COLING ’10,
pages 241–249, Stroudsburg, Pensilvania.
de Marneffe, Marie-Catherine,
Cristóbal D.. Manning, y
Christopher Potts. 2010. “was it good? él
was provocative.” Learning the meaning
of scalar adjectives. En Actas de la
48ª Reunión Anual de la Asociación de
Ligüística computacional, ACL ’10,
pages 167–176, Stroudsburg, Pensilvania.
Dong, li, Furu Wei, Ming Zhou, and Ke Xu.
2014. Adaptive multi-compositionality for
recursive neural models with applications
to sentiment analysis. In AAAI Conference
on Artificial Intelligence, pages 1,537–1,543,
Quebec.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Duchi, John, Elad Hazan, and Yoram Singer.
Kasami, Tadao. 1965. An efficient recognition
2011. Adaptive subgradient methods
for online learning and stochastic
optimization. Journal of Machine Learning
Investigación, 12:2121–2159.
Admirador, Rong-En, Kai-Wei Chang,
Cho-Jui Hsieh, Xiang-Rui Wang,
and Chih-Jen Lin. 2008. Liblinear: A library
for large linear classification. Diario de
Machine Learning Research, 9:1871–1874.
Bien, Irving John. 1953. The population
frequencies of species and the estimation
of population parameters. Biometrika,
40(3-4):237–264.
Buen hombre, Joshua. 1999. Semiring parsing.
Ligüística computacional, 25(4):573–605.
Sala, David, Greg Durrett, and Dan Klein.
2014. Less grammar, more features. En
Proceedings of the 52nd Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 228–237,
baltimore, Maryland.
Hatzivassiloglou, Vasileios and Kathleen R.
McKeown. 1997. Predicting the semantic
orientation of adjectives. En procedimientos de
the 35th Annual Meeting of the Association
for Computational Linguistics and Eighth
Conference of the European Chapter of the
Asociación de Lingüística Computacional,
ACL ’98, pages 174–181, Stroudsburg, Pensilvania.
Jia, Lifeng, Clement Yu, and Weiyi Meng.
2009. The effect of negation on sentiment
analysis and retrieval effectiveness. En
Conference on Information and Knowledge
Management, pages 1,827–1,830,
Hong Kong.
Kaji, Nobuhiro and Masaru Kitsuregawa.
2007. Building lexicon for sentiment
analysis from massive collection of HTML
documentos. En Actas de la 2007 Joint
Jornada sobre Métodos Empíricos en Natural
Language Processing and Computational
Natural Language Learning
(EMNLP-CoNLL), pages 1,075–1,083,
Prague.
Kamps, Jaap, Robert J. Mokken,
Maarten Marx, and Maarten de Rijke.
2004. Using WordNet to measure semantic
orientation of adjectives. En procedimientos de
the 4th International Conference on Language
Resources and Evaluation (LREC 2004),
volume IV, pages 1,115–1,118, París.
Kanayama, Hiroshi and Tetsuya Nasukawa.
2006. Fully automatic lexicon expansion
for domain-oriented sentiment analysis. En
Actas de la 2006 Conferencia sobre
Empirical Methods in Natural Language
Procesando, EMNLP ’06, pages 355–363,
Stroudsburg, Pensilvania.
and syntax-analysis algorithm for
context-free languages. Technical report
AFCRL-65-758, Air Force Cambridge
Research Lab, Bedford, MAMÁ.
Kate, Rohit J. and Raymond J. Mooney.
2006. Using string-kernels for learning
semantic parsers. In ACL 2006: Actas
of the 21st International Conference on
Computational Linguistics and the 44th
Annual Meeting of the ACL, pages 913–920,
Morristown, Nueva Jersey.
Kennedy, Alistair and Diana Inkpen. 2006.
Sentiment classification of movie reviews
using contextual valence shifters.
Computational Intelligence, 22:110–125.
Klein, Dan and Christopher D. Manning.
2003. Accurate unlexicalized parsing. En
Proceedings of the 41st Annual Meeting on
Asociación de Lingüística Computacional –
Volumen 1, ACL ’03, pages 423–430,
Stroudsburg, Pensilvania.
Klenner, Manfred, Stefanos Petrakis, y
Angela Fahrni. 2009. Robust compositional
polarity classification. En Actas de la
International Conference RANLP-2009,
pages 180–184, Borovets.
Krestel, Ralf and Stefan Siersdorfer. 2013.
Generating contextualized sentiment
lexica based on latent topics and user
ratings. In Proceedings of the 24th ACM
Conference on Hypertext and Social Media,
HT ’13, pages 129–138, Nueva York, Nueva York.
Krishnamurthy, Jayant and Tom M. mitchell.
2012. Weakly supervised training of
semantic parsers. En Actas de la 2012
Joint Conference on Empirical Methods in
Natural Language Processing and
Computational Natural Language Learning,
EMNLP-CoNLL ’12, pages 754–765,
Stroudsburg, Pensilvania.
K ¨ubler, Sandra, ryan mcdonald, and Joakim
Nivre. 2009. Dependency parsing.
Synthesis Lectures on Human Language
Technologies, 1(1):1–127.
li, Peng, Yang Liu, and Maosong Sun. 2013.
An extended ghkm algorithm for inducing
Lambda-SCFG. In Conference on Artificial
Inteligencia, pages 605–611, Bellevue, Washington.
Liang, Percy, Michael I. Jordán, and Dan
Klein. 2013. Learning dependency-based
compositional semantics. computacional
Lingüística, 39(2):389–446.
Liu, Bing. 2012. Sentiment Analysis and
Opinion Mining. Synthesis Lectures on
Tecnologías del lenguaje humano. morgan
& Claypool Publishers.
Liu, Jingjing and Stephanie Seneff. 2009.
Review sentiment scoring via a
333
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
parse-and-paraphrase paradigm. En
Actas de la 2009 Conferencia sobre
Empirical Methods in Natural Language
Procesando: Volumen 1 – Volumen 1,
EMNLP ’09, pages 161–169,
Stroudsburg, Pensilvania.
Liu, Shizhu, Gady Agam, y
David A. Grossman. 2012. Generalized
sentiment-bearing expression features for
sentiment analysis. En internacional
Congreso sobre Lingüística Computacional,
pages 733–744, Mumbai.
Lu, yue, Mal ´u Castellanos, Umeshwar Dayal,
and ChengXiang Zhai. 2011. Automatic
construction of a context-aware sentiment
lexicon: an optimization approach. En
International World Wide Web Conference,
pages 347–356, Hyderabad.
Maas, Andrew L., Raymond E. Daly,
Peter T. Pham, Dan Huang, Andrew Y. Ng,
and Christopher Potts. 2011. Aprendiendo
word vectors for sentiment analysis. En
Proceedings of the 49th Annual Meeting of the
Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano – Volumen 1,
HLT ’11, pages 142–150,
Stroudsburg, Pensilvania.
Manning, Christopher D., Prabhakar
raghavan, and Hinrich Sch ¨utze. 2008.
Introduction to Information Retrieval.
Prensa de la Universidad de Cambridge, Nueva York.
marco, Mitchell P., Mary Ann
Marcinkiewicz, and Beatrice Santorini.
1993. Building a large annotated corpus of
Inglés: the Penn Treebank. computacional
Lingüística, 19(2):313–330.
Matsumoto, Shotaro, Hiroya Takamura, y
Manabu Okumura. 2005. Sentiment
classification using word sub-sequences
and dependency sub-trees. En procedimientos
of the 9th Pacific-Asia Conference on Advances
in Knowledge Discovery and Data Mining,
PAKDD’05, pages 301–311, Hanoi.
McDonald, ryan, Koby Crammer, y
Fernando Pereira. 2005. En línea
large-margin training of dependency
analizadores. In Proceedings of the 43rd Annual
Meeting on Association for Computational
Lingüística, ACL ’05, pages 91–98,
Stroudsburg, Pensilvania.
McDonald, ryan, Kerry Hannan, tyler
Neylon, Mike Wells, and Jeff Reynar. 2007.
Structured models for fine-to-coarse
sentiment analysis. En Actas de la
Asociación de Lingüística Computacional
(LCA), pages 432–439, Prague.
Moilanen, Karo and Stephen Pulman. 2007.
Sentiment composition. En procedimientos de
Recent Advances in Natural Language
334
Procesando (RANLP 2007), pages 378–382,
Borovets.
Moilanen, Karo, Stephen Pulman, and Yue
zhang. 2010. Packed feelings and ordered
sentiments: Sentiment parsing with
quasi-compositional polarity sequencing
and compression. In Proceedings of the 1st
Workshop on Computational Approaches to
Subjectivity and Sentiment Analysis (WASSA
2010) at the 19th European Conference on
Artificial Intelligence (ECAI 2010),
pages 36–43, Lisbon.
Mudinas, Andrius, Dell Zhang, and Mark
Levene. 2012. Combining lexicon and
learning based approaches for
concept-level sentiment analysis. En
Proceedings of the First International
Workshop on Issues of Sentiment Discovery
and Opinion Mining, WISDOM ’12,
paginas 5:1–5:8, Nueva York, Nueva York.
Nakagawa, Tetsuji, Kentaro Inui, and Sadao
Kurohashi. 2010. Dependency tree-based
sentiment classification using CRFS with
hidden variables. In Human Language
Technologies: El 2010 Annual Conference of
the North American Chapter of the Association
para Lingüística Computacional, HLT ’10,
pages 786–794, Stroudsburg, Pensilvania.
Angustia, Bo and Lillian Lee. 2004. A sentimental
education: Sentiment analysis using
subjectivity summarization based on
minimum cuts. In Proceedings of the 42nd
Meeting of the Association for Computational
Lingüística (ACL’04), Volumen principal,
pages 271–278, Barcelona.
Angustia, Bo and Lillian Lee. 2005. Seeing stars:
exploiting class relationships for sentiment
categorization with respect to rating scales.
In Proceedings of the 43rd Annual Meeting on
Asociación de Lingüística Computacional,
ACL ’05, pages 115–124, Stroudsburg, Pensilvania.
Angustia, Bo and Lillian Lee. 2008. Opinion
mining and sentiment analysis.
Foundations and Trends in Information
Retrieval, 2(1-2):1–135.
Angustia, Bo, Liliana Lee, and Shivakumar
Vaithyanathan. 2002. Thumbs up?:
Sentiment classification using machine
learning techniques. En procedimientos de
the ACL-02 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural –
Volumen 10, EMNLP ’02, pages 79–86,
Stroudsburg, Pensilvania.
Polanyi, Livia and Annie Zaenen. 2006.
Contextual valence shifters. In J. GRAMO.
Shanahan, Y. Qu, y j. Wiebe, editores,
Computing Attitude and Affect in Text:
Theory and Applications. Saltador
Países Bajos, pages 1–10.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Dong et al.
A Statistical Parsing Framework for Sentiment Classification
Capricho, R. 1985. A Comprehensive Grammar of
the English Language. Longman.
raimon, Ruifang Ge and J. Mooney. 2006.
Discriminative reranking for semantic
analizando. In Proceedings of the COLING/ACL
on Main Conference Poster Sessions,
COLING-ACL ’06, pages 263–270,
Stroudsburg, Pensilvania.
robbins, h. and S. Monro. 1951. A stochastic
approximation method. Annals of
Mathematical Statistics, 22:400–407.
Saur´ı, Roser. 2008. A Factuality Profiler for
Eventualities in Text. Doctor. tesis, Brandeis
Universidad.
Shieber, Stuart M., Yves Schabes, y
Fernando C. norte. Pereira. 1995. Principles
and implementation of deductive parsing.
Journal of Logic Programming, 24(1–2):3–36.
Socher, Ricardo, Brody Huval,
Cristóbal D.. Manning, and Andrew Y.
Ng. 2012. Semantic compositionality
through recursive matrix-vector spaces. En
Actas de la 2012 Joint Conference on
Empirical Methods in Natural Language
Processing and Computational Natural
Aprendizaje de idiomas, EMNLP-CoNLL ’12,
pages 1,201–1,211, Stroudsburg, Pensilvania.
Socher, Ricardo, Jeffrey Pennington, Eric H.
Huang, Andrew Y. Ng, and Christopher D.
Manning. 2011. Semi-supervised recursive
autoencoders for predicting sentiment
distributions. En Actas de la
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, EMNLP ’11,
pages 151–161, Stroudsburg, Pensilvania.
Socher, Ricardo, Alex Perelygin, Jean Y. Wu,
Jason Chuang, Cristóbal D.. Manning,
Andrew Y. Ng, and Christopher Potts.
2013. Recursive deep models for semantic
compositionality over a sentiment
treebank. In Proceedings of the Conference on
Empirical Methods in Natural Language
Procesando (EMNLP), pages 1,631–1,642,
seattle, Washington.
Stolcke, Andreas. 2002. SRILM: An extensible
language modeling toolkit. En procedimientos
of the 7th International Conference on Spoken
Procesamiento del lenguaje (ICSLP 2002,
pages 901–904, Denver, CO.
Taboada, Maite, Julian Brooke, Milan
Tofiloski, Kimberly Voll, and Manfred
Stede. 2011. Lexicon-based methods for
sentiment analysis. computacional
Lingüística, 37(2):267–307.
T¨ackstr ¨om, Oscar and Ryan McDonald.
2011a. Discovering fine-grained sentiment
with latent variable structured prediction
modelos. In Proceedings of the 33rd European
Conference on Advances in Information
Retrieval, ECIR’11, pages 368–374,
Berlina.
T¨ackstr ¨om, Oscar and Ryan McDonald.
2011b. Semi-supervised latent variable
models for sentence-level sentiment
análisis. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Lingüística: Tecnologías del lenguaje humano:
Artículos breves – Volumen 2, HLT ’11,
pages 569–574, Stroudsburg, Pensilvania.
Takamura, Hiroya, Takashi Inui, y
Manabu Okumura. 2005. Extracting
semantic orientations of words using spin
modelo. In Proceedings of the 43rd Annual
Meeting of the Association for Computational
Lingüística, ACL ’05, pages 133–140,
Stroudsburg, Pensilvania.
Tu, Zhaopeng, Yifan He, Jennifer Foster,
Josef van Genabith, Qun Liu, and Shouxun
lin. 2012. Identifying high-impact
sub-structures for convolution kernels in
document-level sentiment classification. En
Proceedings of the 50th Annual Meeting of the
Asociación de Lingüística Computacional:
Artículos breves – Volumen 2, ACL ’12,
pages 338–343, Stroudsburg, Pensilvania.
Turney, Peter D. 2002. Thumbs up or thumbs
abajo?: Semantic orientation applied to
unsupervised classification of reviews. En
Proceedings of the 40th Annual Meeting of the
Asociación de Lingüística Computacional,
ACL ’02, pages 417–424, Stroudsburg, Pensilvania.
Velikovich, Leonid, Sasha Blair-Goldensohn,
Kerry Hannan, and Ryan McDonald. 2010.
The viability of Web-derived polarity
lexicons. In Human Language Technologies:
El 2010 Annual Conference of the North
American Chapter of the Association for
Ligüística computacional, HLT ’10,
pages 777–785, Stroudsburg, Pensilvania.
Wainwright, Martín J.. and Michael I. Jordán.
2008. Graphical models, exponential
familias, and variational inference.
Foundations and Trends in Machine Learning,
1(1-2):1–305.
Wang, Sida and Christopher Manning. 2012.
Baselines and bigrams: Simple, bien
sentiment and topic classification. En
Proceedings of the 50th Annual Meeting
of the Association for Computational
Lingüística (LCA 2012), pages 90–94,
Jeju Island.
Wiebe, Janyce, Theresa Wilson, and Claire
Cárdigan. 2005. Annotating expressions of
opinions and emotions in language.
Language Resources and Evaluation,
39(2-3):165–210.
williams, Gbolahan K. and Sarabjot Singh
Anand. 2009. Predicting the polarity
335
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 41, Número 2
strength of adjectives using Wordnet. En
ICWSM, pages 346–349, San Jose, California.
wilson, Theresa, Janice Wiebe, and Paul
Hoffmann. 2009. Recognizing contextual
polarity: An exploration of features for
phrase-level sentiment analysis.
Ligüística computacional, 35:399–433.
Yessenalina, Ainur, Yisong Yue, and Claire
Cárdigan. 2010. Multi-level structured
models for document-level sentiment
classification. En Actas de la 2010
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 1,046–1,056,
Cambridge, MAMÁ.
Younger, Daniel H. 1967. Reconocimiento
and parsing of context-free languages
in time n3. Information and Control,
10(2):189–208.
Yu, Hong and Vasileios Hatzivassiloglou.
2003. Towards answering opinion
preguntas: Separating facts from opinions
and identifying the polarity of opinion
oraciones. En Actas de la 2003
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, EMNLP ’03,
pages 129–136, Stroudsburg, Pensilvania.
Zelle, John M. and Raymond J. Mooney.
1996. Learning to parse database queries
using inductive logic programming.
In AAAI/IAAI, pages 1,050–1,055,
Portland, O.
Zettlemoyer, Luke S. and Michael Collins.
2007. Online learning of relaxed CCG
grammars for parsing to logical form. En
Actas de la 2007 Joint Conference on
Empirical Methods in Natural Language
Processing and Computational Natural
Aprendizaje de idiomas (EMNLP-CoNLL-2007),
pages 678–687, Prague.
Zettlemoyer, Luke S. and Michael Collins.
2009. Learning context-dependent
mappings from sentences to logical form.
In Proceedings of the Joint Conference of the
47th Annual Meeting of the ACL and the
4th International Joint Conference on
Natural Language Processing of the AFNLP:
Volumen 2 – Volumen 2, ACL ’09, paginas
976–984, Stroudsburg, Pensilvania.
zhao, Jichang, Li Dong, Junjie Wu, y
Ke Xu. 2012. Moodlens: Un
emoticon-based sentiment analysis system
for Chinese tweets. En Actas de la
18th ACM SIGKDD International Conference
on Knowledge Discovery and Data
Minería, KDD ’12, pages 1,528–1,531,
Nueva York, Nueva York.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
1
2
2
9
3
1
8
0
5
0
5
9
/
C
oh
yo
i
_
a
_
0
0
2
2
1
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3