Supervised Gradual Machine Learning for Aspect-Term
Sentiment Analysis
Yanyan Wang†‡ Qun Chen∗†‡ Murtadha H.M. Ahmed†‡ Zhaoqiang Chen†‡
Jing Su†‡ Wei Pan†‡ Zhanhuai Li†‡
†School of Computer Science, Northwestern Polytechnical University, Xi’an, China
‡Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University,
Ministry of Industry and Information Technology, Xi’an, China
{wangyanyan@mail., chenbenben@, murtadha@mail., chenzhaoqiang@mail.,
sujing@mail., panwei1002@, lizhh@}nwpu.edu.cn
Abstract
Recent work has shown that Aspect-Term Sen-
timent Analysis (ATSA) can be effectively
performed by Gradual Machine Learning
(GML). However,
the performance of the
current unsupervised solution is limited by
inaccurate and insufficient knowledge con-
veyance. In this paper, we propose a supervised
GML approach for ATSA, which can effec-
tively exploit labeled training data to improve
knowledge conveyance. It leverages binary po-
larity relations between instances, which can
be either similar or opposite, to enable su-
pervised knowledge conveyance. Besides the
explicit polarity relations indicated by dis-
course structures, it also separately supervises
a polarity classification DNN and a binary
Siamese network to extract
implicit polar-
ity relations. The proposed approach fulfills
knowledge conveyance by modeling detected
relations as binary features in a factor graph.
Our extensive experiments on real benchmark
data show that it achieves the state-of-the-art
performance across all the test workloads. Our
work demonstrates clearly that, in collabora-
tion with DNN for feature extraction, GML
outperforms pure DNN solutions.
1
Introduction
Aspect-Term Sentiment Analysis (ATSA) is a
classical fine-grained sentiment classification task
(Pontiki et al., 2015, 2016). Aiming to analyze
detailed opinions towards certain aspects of an
entity, it has attracted extensive research interests.
In ATSA, an aspect-term, also called target, has
to explicitly appear in a review. For instance,
consider the running example shown in Table 1,
in which ri and sij denote the review and sentence
∗Corresponding author.
723
identifiers, respectively. In r1, ATSA needs to
predict the expressed sentiment polarity, positive
or negative, toward the explicit targets of space
and food.
The state-of-the-art solutions of ATSA have
been built upon pre-trained language models, such
as LCF-BERT (Zeng et al., 2019), BAT (Karimi
et al., 2020a), PH-SUM (Karimi et al., 2020b),
and RoBERTa+MLP (Dai et al., 2021), to name
a few. It is noteworthy that the efficacy of these
deep solutions depends on the independent and
identically distributed (i.i.d.) assumption. How-
ever, in real scenarios, there may not be sufficient
labeled training data; even if provided with suf-
ficient training data, the distributions of training
data and target data are almost certainly different
to some extent.
To alleviate the limitation of the i.i.d assump-
tion, a solution based on the non-i.i.d paradigm
of Gradual Machine Learning (GML) has recently
been proposed for ATSA (Wang et al., 2021).
GML begins with some easy instances, which
can be automatically labeled by the machine with
high accuracy, and then gradually labels more
challenging instances by iterative knowledge con-
veyance in a factor graph. Without exploiting
labeled training data, the current unsupervised
solution relies on sentiment lexicons and explicit
polarity relations indicated by discourse structures
to enable knowledge conveyance. An improved
GML solution leverages unsupervised DNN to ex-
tract sentiment features beyond lexicons (Ahmed
et al., 2021). It has been empirically shown
that even without leveraging any labeled training
data, unsupervised GML can achieve competi-
tive performance compared with many supervised
deep models. However, unsupervised sentiment
Transactions of the Association for Computational Linguistics, vol. 11, pp. 723–739, 2023. https://doi.org/10.1162/tacl a 00571
Action Editor: Minlie Huang. Submission batch: 11/2022; Revision batch: 1/2023; Published 6/2023.
c(cid:3) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
ri
r1
r2
sij
s11
s21
s22
Text
Space was limited, but the food
made up for it.
The food is sinful.
The staff was really friendly.
Table 1: A running example in the domain of
restaurant.
features are usually incomplete and noisy. Mean-
while, even though explicit polarity relations are
accurate, they are usually very sparse in real nat-
ural language corpus. Therefore, the performance
of gradual learning is still limited by inaccurate
and insufficient knowledge conveyance.
Therefore, there is a need to investigate how to
leverage labeled training data to improve gradual
learning. In this paper, we propose a supervised
solution based on GML for ATSA. As pointed
out by Wang et al. (2021), linguistic hints can be
very helpful for polarity reasoning. For instance,
as shown in Table 1, the two aspect polarities
of s11 can be reasoned to be opposite because
their opinion clauses are connected by the shift
word of ‘‘but’’, while the absence of any shift
word between s21 and s22 indicates their polar-
ity similarity. Representing the most direct way
of knowledge conveyance, such binary polarity
relations can effectively enable gradual learning.
Unfortunately, the binary relations indicated by
discourse structures are usually sparse in real
natural language corpora. Therefore, besides ex-
plicit polarity relations, our proposed approach
also separately supervises a DNN for polarity
classification and a Siamese network to extract
implicit polarity relations.
A supervised DNN can usually effectively sep-
arate the instances with different polarities. As a
result, two instances appearing very close in its
embedding space usually have the same polar-
ity. Therefore, we leverage a polarity classifier
for the detection of polarity similarity between
close neighbors in an embedding space. It can
also be observed that in natural languages, there
are many different types of patterns to associate
opinion words with polarities. However, a po-
larity classifier may put the instances with the
same polarity but different association patterns in
far-away places in its embedding space. In com-
parison, metric learning can cluster the instances
with the same polarity together while separating
those with different polarities as far as possible
(Kaya and Bilge, 2019); it can thus align differ-
ent association patterns with the same polarity.
Therefore, we also employ a Siamese network for
metric learning, which has been shown to perform
well on semantic textual similarity tasks (Reimers
and Gurevych, 2019), to detect complementary
polarity relations. A Siamese network can detect
both similar and opposite polarity relations be-
tween two arbitrary instances, which may be far
away in an embedding space.
Finally, our proposed approach fulfills knowl-
edge conveyance by modeling polarity relations
as binary features in a factor graph. In our imple-
mentation, we use the state-of-the-art DNN model
for ATSA, RoBERTa+MLP (Dai et al., 2021),
to capture neighborhood-based polarity similarity
while adapting the Siamese network (Chopra et al.,
2005), the classical model of deep metric learning
(Kaya and Bilge, 2019), to extract arbitrary polar-
ity relations. It is worth pointing out that our work
is orthogonal to the research on polarity classi-
fication DNNs and Siamese networks in that the
proposed approach can easily accommodate new
polarity classifiers and Siamese network models.
The main contributions of this paper can be
summarized as follows:
• We propose a supervised GML approach for
ATSA, which can effectively exploit labeled
training data to improve gradual learning;
• We present the supervised techniques to ex-
tract implicit polarity relations for ATSA,
which can be easily instilled into a GML
factor graph to enable supervised knowledge
conveyance;
• We empirically validate the efficacy of the
proposed approach on real benchmark data.
Our extensive experiments have shown that
it consistently achieves the state-of-the-art
performance across all the test datasets.
2 Related Work
Sentiment analysis at different granularity levels,
including document, sentence, and aspect lev-
els, has been extensively studied in the literature
(Ravi and Ravi, 2015). At the document (resp.,
sentence) level, its goal is to detect the polarity
of the entire document (resp., sentence) without
724
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
regard to the mentioned aspects (Zhang et al.,
2015; Johnson and Zhang, 2017; Qian et al., 2017;
Reimers and Gurevych, 2019). The state-of-the-art
solutions for document-level and sentence-level
sentiment analysis have been built upon various
DNN models (Lei et al., 2018; Long et al., 2017;
Letarte et al., 2018). However, they cannot be
directly applied to the finer-grained aspect-level
sentiment analysis because a document or sen-
tence may express different polarities towards
different aspects. The task of aspect-level senti-
ment analysis has been further classified into two
finer subtasks, Aspect-Term Sentiment Analysis
(ATSA) and Aspect-Category Sentiment Analy-
sis (ACSA) (Xue and Li, 2018). ATSA aims to
predict the sentiment polarity associated with an
explicit aspect term appearing in the text while
ACSA deals with both explicit and implicit as-
pects. In this paper, we focus on the far more
popular subtask of ATSA. But, as shown in our
experimental evaluation, our proposed approach
is also applicable to ACSA.
Even though early work on deep learning for
ATSA employed non-attention models (Dong
et al., 2014; Tang et al., 2016), more recent pro-
posals leveraged various attention mechanisms
to output aspect-specific sentiment features, such
as Interactive Attention Networks (Ma et al.,
2017), Recurrent Attention Network (Chen et al.,
2017), Content Attention Model (Liu et al., 2018),
Multi-grained Attention Network (Fan et al.,
2018), Segmentation Attention Network (Wang
and Lu, 2018), Attention-over-Attention Neural
Networks (Huang et al., 2018), and Effective At-
tention Modeling (He et al., 2018) to name a few.
Most recently, the focus has experienced a consid-
erable shift towards how to leverage pre-trained
language models for ATSA, e.g., BERT-SPC
(Song et al., 2019), AEN-BERT (Attentional
Encoder Network)
(Song et al., 2019), and
LCF-BERT (Local Context Focus) (Zeng et al.,
2019). Since BERT is trained on Wikipedia arti-
cles and has limited ability to understand review
texts, Xu proposed to first post-train BERT on
both domain knowledge and task knowledge, and
then fine-tune the resulting model of BERT-PT on
supervised domain data (Xu et al., 2019). Since
then, many models built upon BERT-PT have
been proposed (Karimi et al., 2020a,b). Other
variants of BERT for ATSA include Adapted
BERT (BERT-ADA) (Rietzler et al., 2020), Ro-
bustly Optimized BERT (RoBERTa) (Dai et al.,
2021), and BERT with Disentangled Attention
(DeBERTa) (Silva and Marcacini, 2021).
Since syntax structures are helpful for as-
pects to find their contextual words, many
syntax-enhanced models have been recently pro-
posed for ATSA, such as Proximity-Weighted
Convolution Network (PWCN) (Zhang et al.,
2019), Relational Graph Attention Network
(RGAT) (Bai et al., 2021), Graph Convolutional
Networks (GCN) (Zhao et al., 2020), Depen-
dency Graph Enhanced Dual-Transformer net-
work (DGEDT) (Tang et al., 2020), Type-aware
Graph Convolutional Networks (T-GCN) (Tian
et al., 2021), and Knowledge-aware Gated Recur-
rent Memory Network with Dual Syntax Graph
(KaGRMN-DSG) (Xing and Tsang, 2022). They
focused on how to exploit explicit syntactic in-
formation provided by dependency-based parse
trees. Other proposals investigated how to induce
implicit syntactic information from pre-trained
models (Dai et al., 2021).
The GML paradigm was first proposed for the
task of entity resolution (Hou et al., 2022). Since
then, it has also been applied to the task of ATSA
(Wang et al., 2021; Ahmed et al., 2021). Without
exploiting labeled training data, the performance
of unsupervised GML is usually limited by inac-
curate and insufficient knowledge conveyance. In
this paper, we focus on how to leverage labeled
examples to improve gradual learning for ATSA.
3 The GML Framework
In this section, we illustrate the GML framework
by the existing unsupervised GML solution for
ATSA (Wang et al., 2021). Given a corpus of
reviews, R, the goal of ATSA is to predict the
sentiment polarity of each aspect unit in R, ti =
(rj, sk, al), where rj denotes a review, sk denotes
a sentence in the review rj, and al denotes an
explicit aspect appearing in the sentence sk. In
this paper, we suppose that an aspect polarity is
either positive or negative.
As shown in Figure 1, the framework consists
of the following three essential steps:
3.1 Easy Instance Labeling
Gradual machine learning begins with some easy
instances. Therefore, high label accuracy of easy
instances is critical for GML’s ultimate perfor-
mance. The existing unsupervised solution for
ATSA employs simple user-specified rules to
725
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
features as unary and binary factors in a factor
graph respectively.
3.3 Gradual Inference
This step gradually labels the instances with in-
creasing hardness. Gradual learning is fulfilled by
iterative inference on a factor graph, G, which
consists of evidence variables representing la-
beled instances, inference variables representing
unlabeled instances, and factors representing their
features. The values of evidence variables once
labeled remain unchanged while the values of in-
ference variables need to be gradually inferred
based on G.
Formally, suppose that a factor graph, G, con-
sists of a set of evidence variables, Λ, a set
of inference variables, VI, and a group of factor
functions of variables indicating their correlations,
denoted by φwi(Vi). In the case of ATSA, each
variable in the factor graph is a boolean variable
indicating the polarity of an aspect unit, the value
of 1 for positive and 0 for negative. Then, the joint
probability distribution over V = {Λ, VI } of G
can be formulated as
Pw(Λ, VI ) =
1
Zw
m(cid:2)
i=1
φwi(Vi),
(1)
where Vi denotes a set of variables, wi denotes a
factor weight, m denotes the total number of fac-
tors, and Zw denotes the normalization constant.
Factor inference on G learns factor weights by
minimizing the negative log marginal likelihood
of evidence variables as follows:
ˆw = arg min
w
−log
(cid:3)
VI
Pw(Λ, VI ).
(2)
In each iteration, GML generally chooses to
label the inference variable with the highest degree
of evidential certainty. Given an inference variable
v, GML measures its evidential certainty by the
inverse of entropy as follows:
E(v) =
1
H(v)
(cid:4)
=
−
i=0,1
1
Pi(v) · log2Pi(v)
, (3)
in which E(v) and H(v) denote the evidential
certainty and entropy of v respectively, and Pi(v)
denotes the inferred probability of v having the
label of 0 or 1. The iteration is repeatedly invoked
until all the instances are labeled.
Figure 1: Unsupervised GML Solution for ATSA.
identify non-ambiguous instances as easy ones
(Wang et al., 2021). Specifically, if a sentence
contains some strong positive (resp., negative)
sentiment words, but no negation, contrast, and hy-
pothetical connectives, it can be reliably reasoned
to be positive (resp., negative). It is noteworthy
that since this paper considers ATSA in the su-
pervised setting, in which some labeled training
data are supposed to be available, these training
data with ground-truth labels can naturally serve
as initial easy instances.
3.2 Feature Extraction and
Influence Modeling
Features serve as the medium to convey the
knowledge obtained from labeled easy instances
to unlabeled harder ones. This step extracts the
common features shared by the labeled and unla-
beled instances. To facilitate effective knowledge
conveyance, it is desirable that a wide variety
of features are extracted to capture diverse infor-
mation. For each extracted feature, this step also
needs to model its influence over the labels of
relevant instances.
The existing unsupervised solution for ATSA
presented in Wang et al. (2021) relies on sentiment
lexicons and explicit polarity relations indicated
by discourse structures to enable knowledge con-
veyance. Specifically, given a sentiment word,
positive or negative, any sentence containing the
word is supposed to have the same polarity as the
word. Similarly, a similar (resp., opposite) polar-
ity relation between two instances indicates that
they are expected to have the same (resp., oppo-
site) polarities. GML models word and relation
726
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Algorithm 1: Scalable Gradual Inference
1 while there exists any unlabeled variable in
2
3
4
5
6
7
8
9
10
11
G do
V (cid:4) ← all the unlabeled variables in G;
for v ∈ V (cid:4) do
Measure the evidential support of v
in G;
Select top-m unlabeled variables with the
most evidential support (denoted by
Vm) ;
for v ∈ Vm do
Approximately rank the entropy of v
in Vm;
Select top-k most promising variables in
terms of entropy in Vm (denoted by Vk) ;
for v ∈ Vk do
Compute the probability of v in G by
factor graph inference over a
subgraph of G;
Label the variable with the minimal
entropy in Vk;
To improve efficiency, GML usually imple-
ments gradual inference by a scalable approach,
as sketched in Algorithm 1. It consists of three
steps: measurement of evidential support, approx-
imate ranking of entropy, and construction of
inference subgraph. It first selects the top-m unla-
beled variables with the most evidential support in
G as the inference candidates. For each unlabeled
instance, GML measures its evidential support
from each feature by the degree of labeling confi-
dence indicated by labeled observations, and then
aggregates them based on the Dempster-Shafer
theory.1 It then approximates entropy estimation
by an efficient algorithm on the m candidates and
selects only the top-k most promising variables
among them for factor graph inference. Finally, it
estimates the probabilities of the finally chosen k
variables by factor graph inference.
4 Supervised GML for ATSA
The overview of the proposed approach, denoted
by S-GML, is shown in Figure 2. In this section,
we first describe how to extract relational features,
and then present their factor modeling.
1https://en.wikipedia.org/wiki/Dempster
-Shafer theory.
4.1 Polarity Relation Extraction
As mentioned in the Introduction, there exist some
discourse relations between clauses or sentences
that can provide helpful hints for polarity reason-
ing. Specifically, if two sentences are connected
with a shift word (e.g., ‘‘but’’ and ‘‘however’’),
they usually have opposite polarities. In con-
trast, two neighboring sentences without any shift
word between them usually have similar polari-
ties. S-GML uses the same rules as presented in
Wang et al. (2021) to extract the explicit relations
indicated by discourse structures. Therefore, we
focus on how to extract implicit polarity relations
in the rest of this subsection.
4.1.1 By Polarity Classification DNN
Since a supervised DNN can effectively sepa-
rate the instances with different polarities, two
instances appearing very close in its embedding
space usually have the same polarity. Therefore,
we supervise a DNN to automatically generate
polarity-sensitive vector representations, and then
exploit them for polarity similarity detection based
on the nearest neighborhood.
As shown in Figure 2(b), we extract kn-nearest
neighbors of each unlabeled instance from both la-
beled training data and unlabeled test data, where
vector distance is measured by cosine distance. To
ensure that only very close instances in the embed-
ding space are considered to be similar, we also set
a high threshold (e.g., 0.05 in our implementation)
to filter out unreliable pairs. Our experiments have
demonstrated that the performance of supervised
GML is robust w.r.t. the value of kn provided that it
is set within a reasonable range (between 5 and 9).
In the implementation, we use RoBERTa+MLP
(Dai et al., 2021), the-state-of-art deep model for
ATSA, to learn polarity-sensitive vector represen-
tations. However, other deep models for ATSA
can also be applied.
4.1.2 By Siamese Network
A polarity classifier may put the instances with
the same polarity but different opinion association
patterns in far-away places in its embedding space.
To extract complementary polarity relations, we
also employ metric learning, which can cluster the
instances with the same polarity together while
separating those with different polarities as far
as possible, to align different association patterns
with the same polarity. Metric learning can de-
tect both similar and opposite polarity relations
727
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: The overview of S-GML: 1) it extracts three types of polarity relations; 2) it models the extracted
relations as binary factors to enable gradual learning.
between two arbitrary instances, which may be far
away in an embedding space. In our implemen-
tation, we use the Siamese network, which has
been shown to perform well on semantic textual
similarity tasks (Reimers and Gurevych, 2019),
to detect polarity relations between two arbitrary
instances.
The structure of the Siamese network has
been shown in Figure 2(c). Given two instances,
t1 = (r1, s1, a1) and t2 = (r2, s2, a2), it first gen-
erates their vector representations by feeding the
sequence of ‘‘[CLS] + aspect + [SEP] + text +
[SEP]’’ into the BERT model, then computes their
mutual information by multiplication, and finally
uses a linear layer to predict their polarity relation,
0 for opposite and 1 for similar. The whole process
can be represented by
v1 = BERT (t1),
v2 = BERT (t2),
pr = sof tmax([v1 (cid:7) v2] ∗ W ),
(4)
(5)
(6)
where dm denotes the dimension of the BERT
model, v1, v2 ∈ Rdm denote the pooled vector
representations, W ∈ Rdm×2 denotes the weights
of the linear layer, (cid:7) denotes the element-wise
multiplication, and pr ∈ R1×2 denotes the output
of the Siamese network, pr = [d, 1 − d], where
d denotes the predicted dissimilarity probability
obtained from the softmax layer.
The training of a Siamese network aims to
minimize the binary entropy loss defined as
L = −y log ˆy − (1 − y) log(1 − ˆy),
(7)
where ˆy denotes the prediction output of two in-
stances having the same polarity, and y denotes
the ground-truth label indicating whether they are
similar or opposite. Since the Siamese network is
supposed to predict binary labels, 0 or 1, as usual,
we set the threshold at 0.5. Certainly, its predic-
tions are noisy, containing some false positives
and false negatives. However, gradual learning
does not require all the predicted relations to be
correct; instead a set of noisy relations can cor-
rectly predict the label of a target instance pro-
vided that the majority of them are correct.
For the training of the Siamese network, S-GML
randomly selects a fixed number of binary rela-
tions (e.g., 80 in our implementation), half of
which are similar ones and the other half are
opposite ones. In the prediction phase, for each
unlabeled instance, S-GML randomly selects ks
from both labeled and unlabeled instances to ex-
tract its binary relations. Since polarity relation
detection between two arbitrary instances is gen-
erally more challenging than polarity similarity
detection between close neighbors in an embed-
ding space, the number of relations constructed
based on Siamese network per instance, denoted
by ks, is suggested to be set to be not greater
than the number of its extracted nearest neigh-
bors, namely ks <= kn. Our experiments have
demonstrated that the performance of supervised
GML is robust w.r.t. the values of kn and ks
provided that they are set to be within a rea-
sonable range (between 3 and 9). It is notewor-
thy that the total number of relations extracted
by the Siamese network can be represented by
O(m × ks), in which m denotes the number of
728
unlabeled instances in a target workload. Due to
the limited value of ks, relation extraction by the
Siamese network can be executed very efficiently.
4.2 Factor Modeling of Polarity Relations
An example of a GML factor graph for ATSA
is shown in Figure 2(d). S-GML models polar-
ity relations as binary factors to enable gradual
knowledge conveyance from labeled instances to
unlabeled ones.
Formally, the constructed factor graph G de-
fines a joint probability distribution over its
variables V by
Pw(V = v) =
1
Zw
(cid:2)
f ∈F
φf (vi, vj),
(8)
where vi denotes a Boolean variable indicating
the polarity of an aspect unit, F = Fc ∪ Fk ∪ Fs
denotes the set of all binary factors corresponding
to context-based, KNN-based and Siamese-based
relational features respectively, and the binary
factor φf (vi, vj) is formulated as
(cid:5)
φf (vi, vj) =
ewf
1
if vi = vj;
otherwise;
(9)
where vi and vj denote the two variables sharing
the binary feature f , and wf denotes the weight
of f . It
is noteworthy that a factor function,
which aims to measure the correlation between
variables, is usually defined as an exponential
function (Kschischang et al., 2001). It should take
non-negative values, and have larger values if its
correlated variables take desired values. There-
fore, in Eq. 9, the weight of a similar factor is
positive, or wf > 0, while the weight of an op-
posite factor is negative, or wf < 0. It can be
observed that such way of encoding would force
two variables sharing a similar factor to hold the
same polarity, while forcing two variables sharing
an opposite factor to hold the opposite polarities.
In S-GML, we have five types of relational fac-
tors, two modeling explicit relations (similar and
opposite), one modeling implicit relations detected
by polarity classifier (only similar), and the re-
maining two modeling implicit relations detected
by Siamese network (similar and opposite). The
factors of the same type are supposed to have the
same weight. In our implementation, the weights
of similar factors are initially set to 2 while the
weights of opposite factors are set to −2. However,
all the five factor weights have to be continuously
learned in the process of gradual inference.
5 Empirical Evaluation
In this section, we empirically evaluate the per-
formance of the proposed approach, denoted by
S-GML, on real benchmark data. We compare
S-GML with the existing GML solution as well
as the state-of-the-art DNN models. Even though
the focus of this paper is on ATSA, the pro-
posed approach can also be applied to the task of
ACSA. Therefore, we also compare S-GML with
its alternatives on ACSA.
The rest of this section is organized as follows:
Section 5.1 describes the experimental setup. Sec-
tion 5.2 presents the evaluation results on ATSA.
Section 5.3 presents the evaluation results of
parameter sensitivity. Section 5.4 presents the
evaluation results on ACSA.
5.1 Experimental Setup
We have used benchmark datasets in three do-
mains (restaurant,
laptop, and neighborhoods)
from the SemEval-2014 Task 4,2 SemEval-2015
Task 12,3 SemEval-2016 Task 5,4 and Senti-
Hood.5 This paper considers both ATSA and
ACSA as binary classification tasks. Note that
we use the annotated labels provided by Wang
et al. (2021) when aspect terms are not specified
in ATSA. In all the datasets, we ignore neutral
instances and label aspect polarities as positive or
negative.
the default
For performance evaluation, as usual, we ran-
training data of each
domly split
benchmark dataset into two parts by the ratio
of 8 : 2, which specifies the proportions of train-
ing and validation data, respectively. Since we run
each approach multiple times, we leverage vali-
dation data to pick the best model in each run. On
Sentihood, we use the default partition of training
and validation data. We use the classical metrics of
Accuracy and Macro-F1 to measure performance,
and conduct pairwise t-test on both metrics to
verify whether the achieved improvements are
statistically significant.
2https://alt.qcri.org/semeval2014/task4.
3https://alt.qcri.org/semeval2015/task12.
4https://alt.qcri.org/semeval2016/task5/.
5https://github.com/HSLCY/ABSA-BERT-pair
/tree/master/data/sentihood.
729
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Compared Approaches. For ATSA, the com-
pared GML solutions include:
• Unsupervised Lexicon-based GML (Wang
et al., 2021). The first unsupervised solution
relies on sentiment lexicons and explicit po-
larity relations indicated by discourse struc-
tures for knowledge conveyance.
• Unsupervised DNN-based GML (Ahmed
et al., 2021). As an improvement of lexicon-
based GML, it leverages an unsupervised
attention-based neural network to automati-
cally extract sentimental features for knowl-
edge conveyance.
• Hybrid GML (Ahmed et al., 2021). Built
upon the unsupervised DNN-based GML, it
leverages labeled training data in a naive
way by simply integrating the outputs of
supervised DNN as unary factors into a
factor graph to give a hybrid prediction.
It is noteworthy that for fair comparison,
the hybrid approach uses the same labeled
data as S-GML to train DNN models. The
original solution used the DNN model of
PH-SUM. Since RoBERTa+MLP has been
empirically shown to outperform PH-SUM,
the hybrid solution with
we implement
RoBERTa+MLP as its DNN model in this
paper.
Since the deep models for ATSA based
on the pre-trained language models have been
empirically shown to outperform their earlier
alternatives, we compare S-GML with these
state-of-the-art models, which include:
• BERT-SPC (Song et al., 2019). It feeds the
sequences of ‘‘[CLS] + context + [SEP] +
target + [SEP]’’ into the basic BERT model
for sentence pair classification.
• AEN-BERT (Song et al., 2019). It uses
an Attentional Encoder Network (AEN) to
model the correlation between context and
target.
• LCF-BERT (Zeng et al., 2019). It uses a Lo-
cal Context Focus (LCF) mechanism based
on Multi-head Self-Attention (MHSA) to pay
more attention to local context words.
• BERT-PT (Xu et al., 2019). It uses post-
trained BERT on task-aware knowledge to
enhance BERT fine-tuning.
• BAT (Karimi et al., 2020a). It uses ad-
training to fine-tune BERT for
versarial
ATSA.
• PH-SUM (Karimi et al., 2020b). It uses two
simple modules named Parallel Aggregation
and Hierarchical Aggregation on the top of
BERT for ATSA.
• RGAT (Bai et al., 2021). It uses a novel
relational graph attention network to integrate
typed syntactic dependency information for
ATSA.
• RoBERTa+MLP (Dai et al., 2021). It uses
RoBERTa to generate context-based word
embeddings of explicit aspect terms, and then
leverages an MLP layer for polarity output.
Implementation Details. We have implemented
S-GML based on the open-sourced GML solu-
tion for ATSA (Wang et al., 2021). To extract
neighborhood-based polarity similarity relations,
we use the model of RoBERTa+MLP, whose per-
formance has been empirically shown to be state of
the art. In the implementation of RoBERTa+MLP,
we use the split set of default training data and
the default parameter settings as presented in
Dai et al. (2021). Specifically, the size of hid-
den layer is set at 768, batch size at 32, learning
rate at 2e − 5, dropout at 0.5, and the number of
epoches at 40. In the implementation of Siamese
network, we use the post-trained BERT (Xu et al.,
2019), which was trained using an uncased version
of BERT-base on the domains of restaurant and
laptop. To generate training data for the Siamese
network, for each labeled instance in the training
set, we randomly select totally 80 polarity rela-
tions, 40 of which are similar while the remaining
40 are opposite. With regard to Siamese network,
we set the size of hidden layer at 768, the maxi-
mum length of inputs at 80, learning rate at 3e − 5
and batch size at 32.
In the default setting of S-GML, we select
top-5 nearest neighbors from labeled training
data and unlabeled test data for each unlabeled
instance based on the learned embedding of
RoBERTa+MLP (or kn = 5 in Subsection 4.1.1),
and randomly select 3 instances from both labeled
training data and unlabeled test data to extract
polarity relations based on the Siamese network
(or ks = 3 in Subsection 4.1.2). Our sensitiv-
ity evaluation results presented in Subsection 5.3
730
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
RES14
RES15
RES16
Model
BERT-SPC
AEN-BERT
LCF-BERT
BERT-PT
BAT
PH-SUM
RGAT
RoBERTa+MLP
Unsupervised Lexicon-based GML
Unsupervised DNN-based GML
Hybrid GML(RoBERTa+MLP)
S-GML
S-GML vs RoBERTa+MLP (p-value) 7.98e − 8 † 5.56e − 8 † 0.0183 † 0.0192 † 1.52e − 5 † 1.86e − 5 †
0.0008 †
S-GML vs Hybrid GML (p-value)
Acc
Macro-F1
93.61%
87.60%
91.77%
87.11%
93.94%
87.05%
95.50%
90.27%
95.45%
91.67%
95.87%
91.49%
95.45%
88.90%
95.74%
91.05%
83.83%
80.33%
87.05%
81.15%
91.64%
95.92%
96.90% 95.33% 90.83% 89.70% 96.00% 93.65%
Macro-F1 Acc Macro-F1
90.47% 85.80% 83.75%
88.17% 87.52% 85.88%
90.87% 85.99% 84.23%
93.24% 88.52% 87.37%
93.12% 89.17% 87.86%
93.69% 89.44% 88.21%
92.89% 85.70% 83.60%
93.53% 89.51% 88.04%
79.34% 80.22% 78.94%
82.93% 81.19% 79.92%
93.86% 89.70% 88.29%
Acc
92.06%
91.22%
91.89%
93.62%
94.76%
94.56%
92.53%
94.37%
85.64%
86.31%
94.88%
1.28e − 6 † 6.65e − 7 † 0.0247 † 0.0252 †
0.0007 †
Model
BERT-SPC
AEN-BERT
LCF-BERT
BERT-PT
BAT
PH-SUM
RGAT
RoBERTa+MLP
Unsupervised Lexicon-based GML
Unsupervised DNN-based GML
Hybrid GML(RoBERTa+MLP)
S-GML
S-GML vs RoBERTa+MLP (p-value)
S-GML vs Hybrid GML (p-value)
LAP14
LAP15
Acc Macro-F1
LAP16
Macro-F1
89.84%
91.65%
90.20%
91.78%
91.55%
91.30%
90.76%
92.83%
79.41%
83.33%
93.12%
Macro-F1
Acc
85.37%
89.45% 88.53%
91.68%
84.74%
90.99% 90.32%
93.39%
85.29%
88.93% 88.30%
91.90%
86.38%
93.10% 92.72%
93.22%
86.20%
93.51% 93.07%
93.13%
86.15%
92.12% 91.67%
92.84%
86.09%
90.75% 90.51%
93.12%
87.44%
93.06% 92.62%
94.24%
78.74%
82.42% 81.52%
82.25%
80.07%
84.05% 83.26%
85.84%
94.46%
87.68%
93.39% 92.93%
95.10% 93.95% 93.70% 93.26% 89.77% 88.49%
0.0016 †
0.0261 †
Acc
86.64%
86.64%
86.85%
87.72%
87.93%
87.61%
87.55%
88.73%
80.31%
81.62%
88.94%
0.0002 † 0.0006 †
0.0111 † 0.0227 †
0.0053 †
0.0236 †
0.0006 †
0.0098 †
0.0016 †
0.0106 †
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 2: Comparative Evaluation Results on ATSA: 1) RES and LAP stand for Restaurant and Laptop
domains respectively; 2) the best accuracies are highlighted in bold; 3) the marker † indicates p-value
< 0.05.
demonstrate that the performance of S-GML is
very robust w.r.t. the parameters of kn and ks
provided that their values are set to be within a
reasonable range (between 3 and 9). Our imple-
mentations of S-GML have been available at our
website.6
5.2 Comparative Evaluation on ATSA
The detailed comparative results on ATSA
in which Hybrid
are presented in Table 2,
GML(RoBERTa+MLP) denotes the Hybrid GML
6https://chenbenben.org/sgml.html.
solution with RoBERTa+MLP as its DNN model.
The reported results are the averages over 25 runs.
To verify statistical significance of S-GML’s per-
formance advantage, we have conducted pairwise
t-test between S-GML and its best alternatives,
RoBERTa+MLP and Hybrid GML.
It can be observed that S-GML consistently
achieves the state-of-the-art performance on all
the datasets. It outperforms the best DNN model
by the margins between 1% and 2% on most
datasets. For instance, on RES14, RES15, and
RES16,
the improvements are close to 2.0%
in terms of Macro-F1. On LAP14 and LAP16,
731
Model
S-GML(w/o knn)
S-GML(w/o Siamese)
S-GML(w/o context)
S-GML
Model
S-GML(w/o knn)
S-GML(w/o Siamese)
S-GML(w/o context)
S-GML
RES14
RES15
RES16
Acc
96.44%
95.14%
96.68%
96.90%
Macro-F1
94.68%
92.55%
95.00%
95.33%
Acc
89.02%
88.25%
90.55%
90.83%
Macro-F1
87.58%
86.77%
89.38%
89.70%
Acc
95.66%
93.00%
95.80%
96.00%
Macro-F1
93.19%
88.85%
93.35%
93.65%
LAP14
LAP15
LAP16
Acc
93.60%
91.90%
94.67%
95.10%
Macro-F1
92.16%
89.69%
93.48%
93.95%
Acc
93.06%
90.51%
93.66%
93.70%
Macro-F1
92.55%
89.95%
93.23%
93.26%
Acc
88.10%
89.35%
88.10%
89.77%
Macro-F1
86.69%
88.05%
86.73%
88.49%
Table 3: The evaluation results of ablation study on ATSA.
the improvements are more than 1% in terms of
Macro-F1. S-GML also beats previous unsuper-
vised GML solutions by large margins; in terms
of accuracy, S-GML outperforms Unsupervised
DNN-based GML by the margins between 8% and
10% across all the test workloads. It is noteworthy
that S-GML consistently beats the Hybrid GML,
which achieves overall better performance than the
state-of-the-art deep model (RoBERTa+MLP).
For instance, in terms of Macro-F1, the improve-
ment margins are around 1.5%, 1.5%, and 2%
on RES14, RES15, and RES16, respectively. Due
to the widely recognized challenge of ATSA, the
achieved improvements can be considerable.
It can also be observed that with regard to
pairwise t-test, the p-values of S-GML against
RoBERTa+MLP and Hybrid GML are all below
0.05, which means the performance improvements
are statistically significant. These experimental
results clearly demonstrate the efficacy of S-GML.
Ablation Study. The evaluation results are
presented in Table 3, where S-GML(w/o knn),
S-GML(w/o Siamese), and S-GML(w/o context)
denote the ablated models with the components
of knn-based, Siamese-based and context-based
relational features removed, respectively. It can
be observed that without either KNN relations
or Siamese relations, the performance of S-GML
drops on all the test workloads. This observa-
tion clearly indicates that KNN and Siamese
relations are complementary to each other and
their combined modeling in GML achieves better
performance than either of them. However, it can
also be observed that compared with knn relations,
the performance of GML drops more considerably
without Siamese relations. The KNN relations
capture only similarity features, while the Siamese
relations can capture both similarity and opposite,
or more diverse, relations. It is noteworthy that
these experimental results are consistent with the
expected characteristic of GML that more di-
verse features can usually facilitate knowledge
conveyance more effectively.
An Illustrative Example. We illustrate the effi-
cacy of S-GML by the examples extracted from
RES14, which are shown in Figure 3. Based on
GML, the instance t1 has the most evidential sup-
port, followed by t2, t3, and finally t4. Meanwhile,
the instances t1 and t2 have less evidential conflict
than t3 and t4. Therefore, S-GML labels them
in the order of t1, t2, t3, and t4. In spite of the
noisy relations of t4, S-GML can correctly label
t4 because after t1, t2, and t3 are labeled, the
majority of evidence neighbors provide correct
polarity hints.
5.3 Sensitivity Evaluation
To evaluate sensitivity, we vary the values of the
parameters kn and ks, which denote the number
of nearest neighbors selected by polarity classifier
and the number of relations randomly selected
based Siamese network, respectively, within the
range between 3 and 9. Since polarity relation
detection between two arbitrary instances is gen-
erally more challenging than polarity similarity
detection between close neighbors in an em-
bedding space, we set kn ≥ ks. The detailed
evaluation results are presented in Table 4. It
732
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 3: The illustrated examples of S-GML: the four subfigures show the extracted relational features of four
instances respectively, in which a true factor (resp. false factor) means that its corresponding polarity relation is
true (resp. false).
kn
ks
5
5
7
7
7
9
9
9
9
3
5
3
5
7
3
5
7
9
kn
ks
5
5
7
7
7
9
9
9
9
3
5
3
5
7
3
5
7
9
RES14
RES15
RES16
Acc
96.90%
97.04%
96.61%
96.66%
96.61%
96.61%
96.63%
96.61%
96.57%
Acc
95.10%
94.88%
94.46%
94.67%
94.46%
94.88%
95.10%
95.10%
94.88%
Macro-F1
95.33%
95.55%
94.87%
94.94%
94.86%
94.87%
94.90%
94.86%
94.81%
LAP14
Macro-F1
93.95%
93.70%
93.18%
93.46%
93.21%
93.70%
93.95%
93.95%
93.70%
Acc
90.83%
90.96%
90.79%
90.68%
90.77%
90.87%
90.70%
90.74%
90.74%
Acc
93.70%
93.70%
93.40%
93.32%
93.28%
93.32%
93.36%
93.43%
93.47%
Macro-F1
89.70%
89.86%
89.63%
89.50%
89.62%
89.74%
89.54%
89.59%
89.59%
LAP15
Macro-F1
93.26%
93.26%
92.96%
92.89%
92.85%
92.87%
92.92%
93.00%
93.04%
Acc
96.00%
95.93%
95.80%
95.83%
95.83%
95.76%
95.68%
95.76%
95.74%
Acc
89.77%
89.35%
89.98%
89.56%
89.56%
89.77%
89.56%
89.56%
89.35%
Macro-F1
93.65%
93.53%
93.28%
93.34%
93.36%
93.22%
93.11%
93.23%
93.19%
LAP16
Macro-F1
88.49%
88.05%
88.84%
88.34%
88.38%
88.62%
88.34%
88.38%
88.13%
Table 4: Sensitivity evaluation results on ATSA.
733
can be observed that the performance of S-GML
fluctuates very marginally with different value
combinations of kn and ks. These experimental
results clearly demonstrate that the performance
of S-GML is very robust w.r.t. to the parameter
setting of kn and ks. They bode well for S-GML’s
applicability in real scenarios.
5.4 Comparative Evaluation on ACSA
For ACSA, we compare performance on all the
RES and LAP workloads except LAP14 because
it does not provide implicit aspect categories.
Additionally, we compare performance on the
benchmark dataset of SentiHood, which is usu-
ally considered as a task of targeted aspect-based
sentiment analysis. In SentiHood, aspect category
consists of two parts: explicit entity (e.g., location
1) and implicit category (e.g., safety).
We compare S-GML with the following
BERT-based models specifically targeting ACSA:
1) BERT-pair-QA-M (Sun et al., 2019). It con-
verts ACSA to a sentence-pair classification task,
where the auxiliary sentence is a question. 2)
BERT-pair-NLI-M (Sun et al., 2019). It converts
ACSA to a sentence-pair classification task and
learns aspect-specific representations by pseudo-
sentence natural language inference. 3) QACG-
BERT (Wu and Ong, 2021). Asanimproved variant
of CG-BERT model
(Context-Guided BERT),
it learns quasi-attention weights in a composi-
tional manner to enable subtractive attention lack-
ing in softmax-attention. Since many deep models
proposed for ATSA, e.g., BRET-SPC, AEN-BERT,
LCF-BERT, BERT-PT, BAT, and PH-SUM, can
be directly applied to the task of ACSA, we also
compare S-GML with these models. However, we
do not compare S-GML with RoBERTa+MLP and
Hybrid GML because they cannot directly handle
implicit aspects.
In the implementation of S-GML for ACSA,
we extract neighborhood-based polarity simi-
larity based the model of BAT (Karimi et al.,
2020a), whose performance has been empirically
shown to be state of the art. We use the same
Siamese network proposed for ATSA to extract
binary relations between arbitrary instances.
The detailed comparative results on ACSA are
presented in Table 5. We have also conducted
pairwise t-test between S-GML and its best al-
ternative, BAT, over 25 runs. It can be observed
that similar to what have been reported on ATSA,
S-GML outperforms the best alternatives by the
margins between 1% and 2% on all the test work-
loads. For instance, in terms of Macro-F1, S-GML
beats BAT by around 2.0%, 1.5%, and 1.5% on
RES14, RES15, and RES16, respectively. With re-
gard to pairwise t-test, it can be observed that the
p-values of S-GML against BAT are all well below
0.05, which means the achieved improvements
are statistically significant. These experimental
results clearly demonstrate the efficacy of S-GML
on ACSA.
6 Conclusion and Future Work
In this paper, we have proposed a novel supervised
GML approach for ATSA that can effectively
exploit labeled examples to improve gradual learn-
ing. It leverages both polarity classification DNN
and Siamese network to extract implicit polarity
relations between instances, and then instills them
into a factor graph to enable supervised knowl-
edge conveyance. Our extensive empirical study
has validated its efficacy. Our work has demon-
strated clearly that in collaboration with DNN
for feature extraction, GML can outperform pure
DNN solutions.
For future work, it can be observed that even
though the proposed solution is built upon the spe-
cific polarity classifier and Siamese network for
aspect-level sentiment analysis, similar classifiers
and Siamese networks are readily available or
can be constructed for other binary classification
tasks, especially NLP tasks. Therefore, the pro-
posed collaboration approach of DNN and GML
can be potentially generalized to other binary
classification tasks.
Generalization to Multi-class Classification
Tasks.
It is worth pointing out that even though
this paper focuses on binary classification, the
proposed approach can be potentially generalized
to multi-class classification tasks. In principle, in-
stead of binary values, a variable in a factor graph
can take one out of multiple values, each of which
corresponds to a specific class. Relational fac-
tors can also be similarly constructed to indicate
similar or different label relations between vari-
ables. We briefly illustrate the generalization by
the example of three-class aspect-based sentiment
analysis, whose candidate polarities include pos-
itive, negative, and neutral. The technical details
however need further investigation in the future.
734
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Model
BERT-SPC
AEN-BERT
LCF-BERT
BERT-PT
BAT
PH-SUM
BERT-pair-QA-M
BERT-pair-NLI-M
QACG-BERT
S-GML
S-GML vs BAT (p-value)
Model
BERT-SPC
AEN-BERT
LCF-BERT
BERT-PT
BAT
PH-SUM
BERT-pair-QA-M
BERT-pair-NLI-M
QACG-BERT
S-GML
S-GML vs BAT (p-value)
RES14
RES15
RES16
Acc
93.90%
94.68%
94.74%
94.51%
95.22%
94.99%
94.81%
95.13%
94.31%
96.72%
4.22e − 8 †
Macro-F1
91.87%
92.86%
93.02%
92.75%
93.58%
93.36%
93.17%
93.57%
92.47%
95.71%
1.01e − 7 †
Acc
88.41%
87.09%
88.86%
87.14%
88.80%
89.02%
88.25%
88.58%
87.31%
90.14%
0.0007 †
Macro-F1
88.12%
86.78%
88.56%
86.79%
88.51%
88.70%
88.00%
88.30%
86.93%
89.86%
0.0016 †
Acc
90.71%
91.20%
91.98%
92.14%
93.62%
93.25%
92.64%
92.27%
91.17%
94.87%
2.06e − 9 †
Macro-F1
88.39%
89.01%
89.95%
90.04%
92.05%
91.53%
90.80%
90.16%
88.96%
93.52%
4.66e − 9 †
SentiHood
LAP15
LAP16
Acc
92.06%
91.45%
93.08%
91.53%
93.16%
91.68%
93.36%
92.85%
92.85%
93.83%
0.0077 †
Macro-F1
91.14%
90.33%
92.29%
90.40%
92.29%
90.59%
92.56%
91.97%
91.91%
93.05%
0.0084 †
Acc
89.95%
90.92%
91.18%
91.51%
92.56%
91.15%
90.83%
91.13%
90.44%
93.74%
2.24e − 5 †
Macro-F1
89.35%
90.38%
90.65%
90.92%
92.15%
90.65%
90.28%
90.61%
89.86%
93.36%
2.56e − 5 †
Acc
87.03%
87.88%
88.74%
89.13%
89.51%
89.03%
88.12%
88.47%
87.22%
90.52%
1.08e − 6 †
Macro-F1
86.26%
86.73%
87.68%
88.29%
88.71%
88.20%
87.19%
87.51%
86.21%
89.79%
2.56e − 7 †
Table 5: Comparative Evaluation Results on ACSA: the marker † indicates p-value < 0.05.
ri
r1
r2
r3
r4
sij
s11
s21
s31
s41
Text
The manager then told us we could order from
whatever menu we wanted but by that time we were so
annoyed with the waiter and the resturant that we let
and went some place else.
Even when the chef is not in the house, the food and
service are right on target.
My friend had a burger and I had these wonderful
blueberry pancakes.
It’s about $7 for lunch and they have take-out or
dine-in.
Aspect polarities
(manager, neutral), (menu,
neutral), (waiter, negative)
(chef, neutral), (food,
positive), (service, positive)
(burger, neutral), (blueberry
pancakes, positive)
(lunch, neutral), (take-out,
neutral), (dine-in, neutral)
Table 6: Illustrative examples of three-class aspect-based sentiment analysis.
Since the open-source GML inference en-
gine7 can effectively support gradual inference
on multi-class factor graphs and modeling rela-
tional features as binary factors in a factor graph
7https://github.com/gml-explore/gml.
is straightforward, we focus on how to extract
relational features for the task of three-class sen-
timent analysis. Similar to the case of binary
sentiment analysis, we can extract explicit rela-
tions by analyzing discourse structures, and im-
plicit ones by supervising a classification deep
735
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
model and a Siamese network separately as
follows:
• For explicit relations, we can similarly ex-
tract opposite relations based on the presence
of shift words, because they can reliably
indicate polarity shift regardless of actual
sentiments. For instance, as shown in Table 6,
the shift words ‘‘but’’ and ‘‘even’’ in s11 and
s21 shift polarity from neutral to negative
and positive respectively. However, identi-
fying similar relations may be more subtle.
Since the neutral polarity usually does not
involve any opinion word, two aspect polar-
ities can be reasoned to be similar if no shift
word exists between them, and both of them
contain opinion words or neither of them
does. As shown in Table 6, the two aspect
polarities in s41 can be reasoned to be similar
due to the absence of shift words and opinion
words, while the two aspect polarities in s31
cannot because its second part contains the
opinion word of ‘‘wonderful’’.
• For implicit relations, we can similarly lever-
age the SOTA polarity classifiers (e.g.,
RoBERTa) and Siamese network for their de-
tection. Since the SOTA polarity classifiers
can naturally support three-class classifica-
tion, they can be trained to detect polarity
similarity based on vector neighborhood as
in binary classification. As for a Siamese
network, it can be similarly trained to detect
similar and dissimilar relations between po-
larities provided that training data sufficiently
cover different combinations of polarities.
Acknowledgments
Our work has been supported by National Nat-
ural Science Foundation of China (62172335,
61732014, and 61672432). We would also like
to thank the action editor, and anonymous review-
ers for their insightful comments and suggestions,
which have significantly strengthened the paper.
References
Murtadha H. M. Ahmed, Qun Chen, Yanyan
Wang, Youcef Nafa, Zhanhuai Li, and Tianyi
Duan. 2021. DNN-driven gradual machine
learning for aspect-term sentiment analysis. In
Findings of the Association for Computational
Linguistics, ACL/IJCNLP, pages 488–497.
Xuefeng Bai, Pengbo Liu, and Yue Zhang.
2021. Investigating typed syntactic dependen-
cies for targeted sentiment classification using
graph attention neural network. IEEE/ACM
Transactions on Audio, Speech, and Language
Processing, 29: 503–514. https://doi
.org/10.1109/TASLP.2020.3042009
Peng Chen, Zhongqian Sun, Lidong Bing, and
Wei Yang. 2017. Recurrent attention network
on memory for aspect sentiment analysis.
the 2017 Conference
In Proceedings of
on Empirical Methods
in Natural Lan-
guage Processing, EMNLP, pages 452–461.
https://doi.org/10.18653/v1/D17
-1047
Sumit Chopra, Raia Hadsell, and Yann LeCun.
2005. Learning a similarity metric discrimina-
tively, with application to face verification. In
Proceedings of the 2005 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern
Recognition, CVPR, pages 539–546.
Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu,
and Xipeng Qiu. 2021. Does syntax matter? A
strong baseline for aspect-based sentiment anal-
ysis with roberta. In Proceedings of the 2021
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT,
pages 1816–1829. https://doi.org/10
.18653/v1/2021.naacl-main.146
Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang,
Ming Zhou, and Ke Xu. 2014. Adaptive re-
cursive neural network for target-dependent
twitter sentiment classification. In Proceed-
ings of
the
Association for Computational Linguistics,
ACL, pages 49–54. https://doi.org/10
.3115/v1/P14-2009
the 52nd Annual Meeting of
Feifan Fan, Yansong Feng, and Dongyan Zhao.
2018. Multi-grained attention network for
aspect-level sentiment classification. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
EMNLP, pages 3433–3442. https://doi
.org/10.18653/v1/D18-1380
736
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Ruidan He, Wee Sun Lee, Hwee Tou Ng,
and Daniel Dahlmeier. 2018. Effective at-
tention modeling for aspect-level sentiment
the 27th
classification.
International Conference on Computational
Linguistics, COLING, pages 1121–1131.
In Proceedings of
Boyi Hou, Qun Chen, Yanyan Wang, Youcef
Nafa, and Zhanhuai Li. 2022. Gradual machine
learning for entity resolution. IEEE Transac-
tions on Knowledge and Data Engineering,
34(4):1803–1814. https://doi.org/10
.1109/TKDE.2020.3006142
Binxuan Huang, Yanglan Ou, and Kathleen M.
Carley. 2018. Aspect
level sentiment clas-
sification with attention-over-attention neural
In Social, Cultural, and Behav-
networks.
ioral Modeling - 11th International Conference,
SBP-BRiMS, pages 197–206. https://doi
.org/10.1007/978-3-319-93372-6 22
Rie Johnson and Tong Zhang. 2017. Deep
pyramid convolutional neural networks for
text categorization. In Proceedings of the 55th
Annual Meeting of the Association for Com-
putational Linguistics, ACL, pages 562–570.
https://doi.org/10.18653/v1/P17
-1052
Akbar Karimi, Leonardo Rossi, and Andrea Prati.
2020a. Adversarial training for aspect-based
sentiment analysis with BERT. In Proceedings
the 25th International Conference on
of
Pattern Recognition, ICPR, pages 8797–8803.
https://doi.org/10.48550/arXiv
.2010.11731
Akbar Karimi, Leonardo Rossi, and Andrea
Prati. 2020b. Improving BERT performance for
aspect-based sentiment analysis. arXiv preprint
arXiv:2010.11731.
Mahmut Kaya and Hasan Sakir Bilge. 2019.
Deep metric learning: A survey. Symmetry,
11(9):1066. https://doi.org/10.3390
/sym11091066
Frank R. Kschischang, Brendan J. Frey, and
Hans-Andrea Loeliger. 2001. Factor graphs
and the sum-product algorithm. IEEE Transac-
tions on Information Theory, 47(2):498–519.
https://doi.org/10.1109/18.910572
Zeyang Lei, Yujiu Yang, and Yi Liu. 2018.
LAAN: A linguistic-aware attention network
for sentiment analysis. In Companion of the
The Web Conference 2018 on The Web Confer-
ence 2018, WWW, pages 47–48. https://
doi.org/10.1145/3184558.3186922
Ga¨el Letarte, Fr´ed´erik Paradis, Philippe Gigu`ere,
and Franc¸ois Laviolette. 2018. Importance of
self-attention for sentiment analysis. In Pro-
ceedings of
the Workshop: Analyzing and
Interpreting Neural Networks for NLP, Black-
boxNLP@EMNLP, pages 267–275. https://
doi.org/10.18653/v1/W18-5429
Qiao Liu, Haibin Zhang, Yifu Zeng, Ziqi Huang,
and Zufeng Wu. 2018. Content attention model
for aspect based sentiment analysis. In Pro-
ceedings of the 2018 Web Conference, WWW,
pages 1023–1032. https://doi.org/10
.1145/3178876.3186001
Yunfei Long, Qin Lu, Rong Xiang, Minglei
Li, and Chu-Ren Huang. 2017. A cognition
based attention model for sentiment analysis.
the 2017 Conference
In Proceedings of
in Natural Lan-
on Empirical Methods
guage Processing, EMNLP, pages 462–471.
https://doi.org/10.18653/v1/D17
-1048
Dehong Ma, Sujian Li, Xiaodong Zhang, and
Houfeng Wang. 2017. Interactive attention net-
works for aspect-level aentiment classification.
In Proceedings of the 26th International Joint
Conference on Artificial Intelligence, IJCAI,
pages 4068–4074.
Maria
Pontiki, Dimitris Galanis, Haris
Papageorgiou, Ion Androutsopoulos, Suresh
Manandhar, Mohammad Al-Smadi, Mahmoud
Al-Ayyoub, Yanyan Zhao, Bing Qin, Orph´ee
De Clercq, V´eronique Hoste, Marianna
Apidianaki, Xavier Tannier, Natalia V.
Loukachevitch, Evgeniy Kotelnikov, N´uria
Bel, Salud Mar´ıa Jim´enez Zafra, and G¨ulsen
Eryigit. 2016. SemEval-2016 Task 5: As-
pect based sentiment analysis. In Proceedings
of
the 10th International Workshop on Se-
mantic Evaluation, SemEval@NAACL-HLT,
https://doi.org/10
pages
19–30.
.18653/v1/S16-1002
Maria
Pontiki, Dimitris Galanis, Haris
Papageorgiou, Suresh Manandhar, and Ion
Androutsopoulos. 2015. SemEval-2015 Task
737
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
12: Aspect based sentiment analysis. In Pro-
ceedings of the 9th International Workshop on
Semantic Evaluation, SemEval@NAACL-HLT,
486–495. https://doi.org/10
pages
.18653/v1/S15-2082
Qiao Qian, Minlie Huang, Jinhao Lei, and Xiaoyan
Zhu. 2017. Linguistically regularized LSTM
for sentiment classification. In Proceedings
the Asso-
of
ciation for Computational Linguistics, ACL,
pages 1679–1689. https://doi.org/10
.18653/v1/P17-1154
the 55th Annual Meeting of
Kumar Ravi and Vadlamani Ravi. 2015. A sur-
vey on opinion mining and sentiment analysis:
Tasks, approaches and applications. Knowledge-
Based Systems, 89:14–46. https://doi.org
/10.1016/j.knosys.2015.06.015
Nils Reimers
and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
siamese BERT-networks. In Proceedings of
the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Nat-
ural Language Processing, EMNLP-IJCNLP,
pages 3980–3990. https://doi.org/10
.18653/v1/D19-1410
Alexander Rietzler, Sebastian Stabinger, Paul
Opitz, and Stefan Engl. 2020. Adapt or get
left behind: Domain adaptation through BERT
language model finetuning for aspect-target
In Proceedings of
sentiment classification.
The 12th Language Resources and Evaluation
Conference, LREC, pages 4933–4941.
Emanuel H. Silva and Ricardo M. Marcacini.
2021. Aspect-based sentiment analysis using
BERT with disentangled attention. In Pro-
ceedings of the LatinX in AI (LXAI) Research
workshop at ICML 2021.
Jiahai Wang, Tao
and Yanghui Rao.
Jiang,
Youwei Song,
2019.
Zhiyue Liu,
targeted
Attentional encoder network for
preprint
arXiv
sentiment
arXiv:1902.09314. https://doi.org/10
.48550/arXiv.1902.09314
classification.
Chi Sun, Luyao Huang, and Xipeng Qiu. 2019.
Utilizing BERT for aspect-based sentiment
analysis via constructing auxiliary sentence. In
Proceedings of the 2019 Conference of the
North American Chapter of the Association for
738
Computational Linguistics: Human Language
Technologies, NAACL-HLT, pages 380–385.
Duyu Tang, Bing Qin, Xiaocheng Feng,
and Ting Liu. 2016. Effective LSTMs for
target-dependent sentiment classification. In
Proceedings of the 26th International Confer-
ence on Computational Linguistics, COLING,
pages 3298–3307.
Hao Tang, Donghong Ji, Chenliang Li, and
Qiji Zhou. 2020. Dependency graph enhanced
structure for aspect-based
dual-transformer
In Proceedings of
sentiment classification.
the 58th Annual Meeting of
the Associ-
ation for Computational Linguistics, ACL,
pages 6578–6588. https://doi.org/10
.18653/v1/2020.acl-main.588
Yuanhe Tian, Guimin Chen, and Yan Song.
2021. Aspect-based sentiment analysis with
type-aware graph convolutional networks and
layer ensemble. In Proceedings of the 2021
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT,
pages 2910–2922. https://doi.org/10
.18653/v1/2021.naacl-main.231
Bailin Wang and Wei Lu. 2018. Learning latent
opinions for aspect-level sentiment classifica-
tion. In Proceedings of the 32nd Conference
on Artificial Intelligence, AAAI, the 30th in-
novative Applications of Artificial Intelligence,
IAAI, and the 8th AAAI Symposium on Educa-
tional Advances in Artificial Intelligence, EAAI,
pages 5537–5544. https://doi.org/10
.1609/aaai.v32i1.12020
Yanyan Wang, Qun Chen, Jiquan Shen, Boyi
Hou, Murtadha Ahmed, and Zhanhuai Li.
2021. Aspect-level sentiment analysis based on
gradual machine learning. Knowledge-Based
Systems, 212:106509. https://doi.org
/10.1016/j.knosys.2020.106509
Zhengxuan Wu and Desmond C. Ong. 2021.
targeted aspect-
Context-guided BERT for
based sentiment analysis. In Proceedings of
35th AAAI Conference on Artificial Intelli-
gence, AAAI, 33rd Conference on Innovative
Applications of Artificial Intelligence, IAAI,
The 11th Symposium on Educational Ad-
Intelligence, EAAI,
vances
in Artificial
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
pages 14094–14102. https://doi.org
/10.1609/aaai.v35i16.17659
if you refer
Bowen Xing and Ivor W. Tsang. 2022.
Understand me,
to aspect
knowledge: Knowledge-aware gated recur-
rent memory network. IEEE Transactions on
Emerging Topics in Computational
Intelli-
gence, 6(5):1092–1102. https://doi.org
/10.1109/TETCI.2022.3156989
Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019.
BERT post-training for review reading compre-
hension and aspect-based sentiment analysis.
In Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, NAACL-HLT, pages 2324–2335.
Wei Xue and Tao Li. 2018. Aspect based
sentiment analysis with gated convolutional
networks. In Proceedings of the 56th Annual
Meeting of
the Association for Computa-
tional Linguistics, ACL, pages 2514–2523.
https://doi.org/10.18653/v1/P18
-1234
Biqing Zeng, Heng Yang, Ruyang Xu, Wu Zhou,
and Xuli Han. 2019. LCF: A local context
focus mechanism for aspect-based sentiment
classification. Applied Sciences, 9(16):3389.
https://doi.org/10.3390/app9163389
Chen Zhang, Qiuchi Li, and Dawei Song. 2019.
Syntax-aware aspect-level sentiment classifi-
cation with proximity-weighted convolution
network. In Proceedings of the 42nd Inter-
national ACM SIGIR Conference on Research
and Development in Information Retrieval, SI-
GIR, pages 1145–1148. https://doi.org
/10.1145/3331184.3331351
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.
2015. Character-level convolutional networks
for text classification. In Advances in Neu-
ral
Information Processing Systems, NIPS,
pages 649–657.
Pinlong Zhao, Linlin Hou, and Ou Wu. 2020.
Modeling sentiment dependencies with graph
convolutional networks for aspect-level sen-
timent classification. Knowledge-Based Sys-
tems, 193:105443. https://doi.org/10
.1016/j.knosys.2020.106292
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
7
1
2
1
4
1
0
0
1
/
/
t
l
a
c
_
a
_
0
0
5
7
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
739