Extractive Opinion Summarization in Quantized Transformer Spaces
Stefanos Angelidis1 Reinald Kim Amplayo1
Yoshihiko Suhara2 Xiaolan Wang2 Mirella Lapata1
1爱丁堡大学
2Megagon Labs
s.angelidis@ed.ac.uk, reinald.kim@ed.ac.uk
yoshi@megagon.ai, xiaolan@megagon.ai
mlap@inf.ed.ac.uk
抽象的
We present the Quantized Transformer (QT),
an unsupervised system for extractive opi-
nion summarization. QT is inspired by Vector-
Quantized Variational Autoencoders, 哪个
we repurpose for popularity-driven summari-
扎化. It uses a clustering interpretation of the
quantized space and a novel extraction algo-
rithm to discover popular opinions among hun-
dreds of reviews, a significant step towards
opinion summarization of practical scope. 在
添加, QT enables controllable summari-
zation without further training, by utilizing
properties of the quantized space to extract
aspect-specific summaries. We also make pub-
licly available SPACE, a large-scale evaluation
benchmark for opinion summarizers, com-
prising general and aspect-specific summaries
为了 50 hotels. Experiments demonstrate the
promise of our approach, which is validated
by human studies where judges showed clear
preference for our method over competitive
基线.
1 介绍
Online reviews play an integral role in modern
生活, as we look to previous customer experiences
to inform everyday decisions. The need to digest
review content has fueled progress in opinion
矿业 (Pang and Lee, 2008), whose central goal
is to automatically summarize people’s attitudes
towards an entity. Early work (Hu and Liu, 2004)
focused on numerically aggregating customer
satisfaction across different aspects of the entity
under consideration (例如, the quality of a camera,
its size, clarity). 最近, 的成功
neural summarizers in the Wikipedia and news
域 (Cheng and Lapata, 2016; See et al.,
2017; Narayan et al., 2018; 刘等人。, 2018; Perez-
Beltrachini etal., 2019) has spurred interest in
277
opinion summarization; the aggregation, in textual
形式, of opinions expressed in a set of reviews
(Angelidis and Lapata, 2018; Huy Tien et al.,
2019; Tian et al., 2019; Coavoux et al., 2019; Chu
和刘, 2019; Isonuma et al., 2019; Braˇzinskas
等人。, 2020; Amplayo and Lapata, 2020; Suhara
等人。, 2020; 王等人。, 2020).
Opinion summarization has distinct character-
istics that set it apart from other summarization
任务. 第一的, it cannot rely on reference summaries
for training, because such meta-reviews are very
scarce and their crowdsourcing is unfeasible. 甚至
for a single entity, annotators would have to pro-
duce summaries after reading hundreds, 一些-
times thousands, of reviews. 第二, the inherent
subjectivity of review text distorts the notion of
information importance used in generic summa-
rization (Peyrard, 2019). Conflicting opinions are
often expressed for the same entity and, 所以,
useful summaries should be based on opinion pop-
独特性 (Ganesan et al., 2010). 而且, 方法
need to be flexible with respect to the size of the
输入 (entities are frequently reviewed by thou-
sands of users), and controllable with respect to
the scope of the output. 例如, users may
wish to read a general overview summary, 或一个
more targeted one about a particular aspect of
兴趣 (例如, a hotel’s location, its cleanliness, 或者
available food options).
Recent work (Tian et al., 2019; Coavoux et al.,
2019; Chu and Liu, 2019; Isonuma et al., 2019;
Braˇzinskas et al., 2020; Amplayo and Lapata,
2020; Suhara et al., 2020) has increasingly fo-
cused on abstractive summarization, where a sum-
mary is generated token-by-token to create novel
sentences that articulate prevalent opinions in the
input reviews. The abstractive approach offers
a solution to the lack of supervision, under the
assumption that opinion summaries should be
written in the style of reviews. This simplifica-
tion has allowed abstractive models to generate
计算语言学协会会刊, 卷. 9, PP. 277–293, 2021. https://doi.org/10.1162/tacl 00366
动作编辑器: Jing Jiang. 提交批次: 6/2020; 修改批次: 9/2020; 已发表 3/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
review-like summaries from aggregate input rep-
resentations, using sequence-to-sequence mod-
els trained to reconstruct single reviews. 尽管
being fluent, abstractive summaries may still suf-
fer from issues of text degeneration (Holtzman et al.,
2020), hallucinations (Rohrbach et al., 2018),
and the undesirable use of first-person narrative,
a direct consequence of review-like generation.
此外, previous work used an unrealistic-
ally small number of input reviews (10 or fewer),
and only sparingly investigated controllable sum-
marization, albeit in weakly supervised settings
(Amplayo and Lapata, 2019; Suhara et al., 2020).
在本文中, we attempt to address shortcom-
ings of existing methods by turning to extractive
summarization, which aims to construct an opi-
nion summary by selecting a few representative
input sentences (Angelidis and Lapata, 2018;
Huy Tien et al., 2019). 具体来说, we introduce
the Quantized Transformer (QT), an unsupervised
neural model inspired by Vector-Quantized Var-
iational Autoencoders (VQ-VAE; van den Oord
等人。, 2017; Roy et al., 2018), which we repur-
pose for popularity-driven summarization. QT
combines Transformers (Vaswani et al., 2017)
with the discretization bottleneck of VQ-VAEs
and is trained via sentence reconstruction, 西米-
larly to the work of Roy and Grangier (2019) 在
paraphrasing. At inference time, we use a clus-
tering interpretation of the quantized space and a
novel extraction algorithm that discovers popu-
lar opinions among hundreds of reviews, a sig-
nificant step towards opinion summarization of
practical scope. QT is also capable of aspect-
specific summarization without further training,
by exploiting the properties of the Transformer’s
multi-head sentence representations.
We further contribute to the progress of opinion
mining research, by introducing SPACE (shorthand
for Summaries of Popular and Aspect-specific
Customer Experiences), a large-scale corpus for the
evaluation of opinion summarizers. We collected
1,050 human-written summaries of TripAdvisor
reviews for 50 hotels. SPACE has general summa-
里斯, giving a high-level overview of popular opin-
ions, and aspect-specific ones, providing detail
on individual aspects (例如, 地点, cleanliness).
Each summary is based on 100 customer reviews,
an order of magnitude increase over existing cor-
pora, thus providing a more realistic input to
competing models. Experiments on SPACE and two
more benchmarks demonstrate that our approach
holds promise for opinion summarization. 帕-
ticipants in human evaluation further express a
clear preference for our model over competitive
基线. We make SPACE and our code publicly
available.1
2 相关工作
Ganesan et al. (2010) were the first to make the con-
nection between opinion mining and text summa-
rization; they developed Opinosis, a graph-based
abstractive summarizer that explicitly models opin-
ion popularity, a key characteristic of subjective
文本, and central to our approach. Follow-on work
(Di Fabbrizio et al., 2014) adopts a hybrid ap-
proach where salient sentences are first extracted
and abstracts are generated based on hand-written
模板 (Carenini et al., 2006). 最近,
Angelidis and Lapata (2018) extract salient opi-
nions according to their polarity intensity and as-
pect specificity, in a weakly supervised setting.
A popular approach to modeling opinion pop-
独特性, 尽管是间接的, is vector averaging. Chu
和刘 (2019) propose MeanSum, an unsuper-
vised abstractive summarizer that learns a review
decoder through reconstruction, and uses it to
generate summaries conditioned on averaged rep-
resentations of the inputs. Averaging is also used
by Braˇzinskas et al. (2020), who train a copy-
enabled variational autoencoder by reconstructing
reviews from averaged vectors of reviews about
the same entity. Other methods include denoising
autoencoders (Amplayo and Lapata, 2020) 和
the system of Coavoux et al. (2019), an encoder-
decoder architecture that uses a clustering of the
encoding space to identify opinion groups, 相似的
对我们的工作.
Our model builds on the VQ-VAE (van den Oord
等人。, 2017), a recently proposed training tech-
nique for learning discrete latent variables, 哪个
aims to overcome problems of posterior collapse
and large variance associated with Variational
Autoencoders (Kingma and Welling, 2014). Like
other related discretization techniques (Maddison
等人。, 2017; Kaiser and Bengio, 2018), VQ-VAE
passes the encoder output through a discretization
bottleneck using a neighbor lookup in the space
of latent code embeddings. The application of
VQ-VAEs to opinion summarization is novel, 到
our knowledge, as well as the proposed sentence
1https://github.com/stangelid/qt.
278
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
extraction algorithm. Our model does not depend
on vector averaging, nor does it suffer from in-
formation loss and hallucination. 此外,
it can easily accommodate a large number of in-
put reviews. Within NLP, VQ-VAEs have been
previously applied to neural machine translation
(Roy et al., 2018) and paraphrase generation
(Roy and Grangier, 2019). Our work is closest to
Roy and Grangier (2019) in its use of a quan-
tized Transformer, however we adopt a different
training algorithm (Soft EM; Roy et al., 2018),
orders of magnitude fewer discrete latent codes,
a different method for obtaining head sentence
vectors, and apply the QT in a novel way for
extractive opinion summarization.
Besides modeling, our work contributes to the
growing body of resources for opinion summa-
rization. We release SPACE, the first corpus to
contain both general and aspect-specific opinion
summaries, while increasing the number of input
reviews tenfold compared to popular benchmarks
(Braˇzinskas et al., 2020; Chu and Liu, 2019;
Angelidis and Lapata, 2018).
3 Problem Formulation
Let C be a corpus of
reviews on entities
{e1, e2, . . . } from a single domain d, 例如,
hotels. Reviews may discuss any number of rel-
evant aspects Ad = {a1, a2, . . . }, like the hotel’s
rooms or location. For every entity e, we define its
review set Re = {r1, r2, . . . }. Every review is a
sequence of sentences (x1, x2, . . . ) and a sentence
x is, 反过来, a sequence of words (w1, w2, . . . ).
For brevity, we use Xe to denote all review sen-
tences about entity e. We formalize two sub-tasks:
(A) general opinion summarization, where a sum-
mary should cover popular opinions in Re across
all discussed aspects; 和 (乙) aspect opinion sum-
marization, where a summary must focus on a
single specified aspect a ∈ Ad. In our extractive
环境, these translate to creating a general or
aspect summary by selecting a small subset of
sentences Se ⊂ Xe.
We train the QT through sentence reconstruc-
tion to learn a rich representation space and its
quantization into latent codes (部分 3.1). 我们
enable opinion summarization, by mapping in-
put sentences onto their nearest latent codes and
extract those sentences that are representative of
the most popular codes (部分 3.2). 我们也
illustrate how to produce aspect-specific summa-
279
数字 1: A sentence is encoded into a 3-head
representation and head vectors are quantized
using a weighted average of their neighboring
code embeddings. The QT model is trained by
reconstructing the original sentence.
ries using a trained QT model and a few aspect-
denoting query terms (部分 3.2.2).
3.1 The Quantized Transformer
Our model is a variant of VQ-VAEs (van den
Oord et al., 2017; Roy and Grangier, 2019) 和
consists of: (A) a Transformer sentence encoder
that encodes an input sentence x into a multi-head
表示 {x1, . . . , xH}, where xh ∈ RD and
H is the number of heads; (乙) a vector quantizer
that maps each head vector to a mixture of dis-
crete latent codes, and uses the codes’ embeddings
to produce quantized vectors {q1, . . . , qH },
qh ∈ RD; 和 (C) a Transformer sentence decoder,
which attends over the quantized vectors to
generate sentence reconstruction ˆx. The decoder
is not used during summarization; we only use the
learned quantized space to extract sentences, 作为
节中描述 3.2.
H }, X(西德:5)
Sentence Encoding Our
encoder prepends
sentence x with the special token [SNT] and uses
the vanilla Transformer encoder (Vaswani et al.,
2017) to produce token-level vectors. We ignore
individual word vectors and only keep the special
token’s vector xsnt ∈ RD. We obtain a multi-head
representation of x, by splitting xsnt into H sub-
h ∈ RD/H, followed by a
vectors {X(西德:5)
, . . . , X(西德:5)
1
layer-normalized transformation:
xh = LayerNorm(Wx(西德:5)
(1)
where xh is the h-th head and W ∈ RD×D/ H,
b ∈ RD are shared across heads. Hyperparame-
ter H, 那是, the number of sentence heads of
our encoder, is different from Transformer’s in-
ternal attention heads. The encoder’s operation
如图所示 1, where the sentence
‘‘The staff was great!’’ is encoded into a 3-head
representation.
H + 乙) ,
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Vector Quantization Let z1, . . . , zH be discrete
latent variables corresponding to H encoder heads.
Every variable can take one of K possible latent
代码, zh ∈ [K]. The quantizer’s codebook,
e ∈ RK×D, is shared across latent variables and
maps each code (or cluster) to its embedding (或者
centroid) ek ∈ RD. Given sentence x and its multi-
head encoding {x1, . . . , xH }, we independently
quantize every head using a mixture of its nearest
codes from [K]. 具体来说, we follow the Soft
EM training of Roy et al. (2018) and sample, 和
replacement, m latent codes for the h-th head:
H, . . . , zm
z1
h ∼ Multinomial(l1, . . . , lK ) ,
with lk = −(西德:7)xh − ek(西德:7)2
2
,
(2)
where Multinomial(l1, . . . , lK ) is a K-way mul-
tinomial distribution with logits l1, . . . , lK. 这
h-th quantized head vector is obtained as the
average of the sampled codes’ embeddings:
qh =
1
米
米(西德:2)
j=1
.
ezj
H
(3)
This soft quantization process is shown in Figure 1,
where head vectors x1, x2, and x3 are quantized
using a weighted average of their neighboring
code embeddings, to produce q1, q2, and q3.
Sentence Reconstruction and Training
在-
stead of attending over individual token vectors,
as in the vanilla architecture, the Transformer
sentence decoder attends over {q1, . . . , qH }, 这
quantized head vectors of the sentence, to gene-
rate reconstruction ˆx. The model is trained to
minimize:
L = Lr +
(西德:2)
H
(西德:7)xh − sg(qh)(西德:7)2 .
(4)
Lr is the reconstruction cross entropy, and stop-
gradient operator sg(·) is defined as identity during
forward computation and zero on backpropagation.
The sampling of Equation (2) is bypassed using
the straight-through estimator (Bengio et al., 2013)
and the latent codebook is trained via exponen-
tially moving averages, as detailed in Roy et al.
(2018).
3.2 Summarization in Quantized Spaces
Existing neural methods for opinion summariza-
tion have modeled opinion popularity within a set
of reviews by encoding each review into a vector,
280
averaging all vectors to obtain an aggregate rep-
resentation of the input, and feeding it to a review
decoder to produce a summary (Chu and Liu, 2019;
Coavoux et al., 2019; Braˇzinskas et al., 2020). 这
approach is problematic for two reasons. 第一的, 它
assumes that complex semantics of whole reviews
can be encoded in a single vector. 第二, 它也是
assumes that features of commonly occurring
opinions are preserved after averaging and, 那里-
fore, those opinions will appear in the generated
summary. The latter assumption becomes particu-
larly uncertain for larger numbers of input reviews.
We take a different approach, using sentences
as the unit of representation, and propose a gene-
ral extraction algorithm based on the QT, 哪个
explicitly models popularity without vector aggre-
gation. Using the same algorithmic framework we
are also able to extract aspect-specific summaries.
3.2.1 General Opinion Summarization
We exploit QT’s quantization of the encoding
space to cluster similar sentences together, quan-
tify the popularity of the resulting clusters, 和
extract representative sentences from the most
popular ones.
具体来说, given Xe = {x1, . . . , 希, . . . , xN },
the N review sentences about entity e, the trained
encoder produces N × H unquantized head vec-
托尔斯 {x11, . . . , xih, . . . , xN H }, where xih is the
h-th head of the i-th sentence. We perform hard
quantization, assigning every vector to its nearest
latent code, and counting the number of assign-
ments per code, 即, the popularity of each
cluster:
zih = arg min
(西德:7)xih − ek(西德:7)2
k∈[K]
(西德:2)
nk =
(西德:0)[zih = k] .
(5)
(6)
我,H
数字 2 shows how sentences Xe are encoded,
and their different heads are assigned to codes.
Similar sentences cluster under the same codes
和, 最后, clusters receiving numerous
assignments are characteristic of popular opinions
in Xe. A general summary should consist of the
sentences that are most representative of these
popular clusters.
在最简单的情况下, we could couple every
code k with its nearest sentence x(k):
X(k) = arg min
我
(min
H
(西德:7)xih − ek(西德:7)2) ,
(7)
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 2: General opinion summarization with QT.
All input sentences for an entity are encoded using
three heads (shown in orange, 蓝色的, and green
crosses). Sentence vectors are clustered under their
nearest latent code (gray circles). Popular clusters
(histogram) correspond to commonly occurring
意见, and are used to sample and extract the
most representative sentences.
and rank sentences x(k) according to the size nk of
their respective clusters; the top sentences, 最多一个
predefined budget, are extracted into a summary.
This ranking method entails that only those
sentences which are the nearest neighbor of a
popular code are likely to be extracted. 然而,
a salient sentence may be in the neighborhood
of multiple codes per head, despite never being
the nearest sentence of a code vector. 考试用-
普莱, the sentence ‘‘Great location and beautiful
rooms’’ is representative of clusters encoding pos-
itive attitudes for both the location and the rooms
of a hotel. To capture this, we relax the require-
ment of coupling every cluster with exactly one
sentence and propose two-step sampling (数字 3),
a novel sampling process that simultaneously es-
timates cluster popularity and promotes sentences
commonly found in the proximity of popular
clusters. We repeatedly perform the following
运营:
Cluster Sampling We first sample a latent code z
with probability proportional to the clusters’ size:
z ∼ P (z = k) =
nk
N × H
,
(8)
where nk is the number of assignments for code
k, computed in Equation (6). 例如, 如果
input contains many paraphrases of sentence ‘‘Ex-
cellent location’’, these are likely to be clustered
under the same code, which in turn increases the
probability of sampling that code. Cluster sam-
pling is illustrated on the top of Figure 3, 显示
281
数字 3: Sentence ranking via two-step sampling.
In this toy example, each sentence (s1 to s5) 是
assigned to its nearest code (k = 1, 2, 3), 作为
shown by thick purple arrows. During cluster
sampling, the probability of sampling a code (顶部
正确的; shown as blue bars) is proportional to the
number of assignments it receives. For every
sampled code, we perform sentence sampling;
sentences are sampled, with replacement, 符合-
ing to their proximity to the code’s encoding.
Samples from codes 1 和 3 are shown in black
和红色, 分别.
assignments (左边) and resulting code probabilities
(正确的).
Sentence Sampling The sampled code z exists
in the neighborhood of many input sentences.
Picking a single sentence as the most characteristic
of that cluster is too restrictive. 反而, 我们
sample (with replacement) sentences from the
code’s neighborhood n times, thus generalizing
方程 (7):
x1, . . . , xn ∼ Multinomial(我(西德:5)
, . . . , 我(西德:5)
氮 ) ,
1
((西德:7)xih − ez(西德:7)2
2) ,
with l(西德:5)
i = − min
H
(9)
where the Multinomial’s logits l(西德:5)
i mark the (nega-
主动的) distance of the i-th sentence’s head which is
closest to code z. Sentence sampling is depicted
in the toy example of Figure 3 (底部). 后
selecting code k = 1 during cluster sampling,
four sentence samples are drawn (shown in black
arrows). The next cluster sample (k = 3) 结果
in four more sentence samples (shown in red).
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Sentence s4 (‘‘Excellent room and location’’)
receives the most votes in total, after being
sampled as a neighbor of both codes.
Two-step sampling is repeated multiple times
and all sentences in Xe are ranked according to
the total number of times they have been sampled.
The final summary is constructed by concatenating
the top ones (see right part of Figure 2). Impor-
急切地, our extraction algorithm is not sensitive to
the size of the input. More sentences increase the
absolute number of assignments per code, but do
not hinder two-step sampling or cause information
loss; on the contrary, a larger pool of sentences
may result in a more densely populated quantized
encoding space and, 反过来, a better estimation of
cluster popularity and sentence ranking.
3.2.2 Aspect Opinion Summarization
迄今为止, we have focused on selecting sentences
solely based on the popularity of the opinions
they express. We now turn our attention to aspect
summaries, which discuss a particular aspect of
an entity (例如, the location or service of a hotel)
while still presenting popular opinions. We create
such summaries with a trained QT model, 没有
additional fine-tuning. 反而, we exploit QT’s
multi-head representations and only require a
small number of aspect-denoting query terms.2
We hypothesize that different sentence heads in
QT encode the approximately orthogonal seman-
tic or structural attributes which are necessary
for sentence reconstruction. In the simplified
example in Figure 2, the encoder’s first head
(orange) might capture information about
这
aspects of the sentence, the second head (蓝色的)
encodes sentiment, while head three (绿色的) 可能
encode structural information (例如, the length of
the sentence or its punctuation). Our hypothesis is
reinforced by the empirical observation that sen-
tence vectors originating from the same head will
occupy their own sub-space, and do not show any
similarity to vectors from other heads. 因此,
each latent code k receives assignments from
exactly one head of the sentence encoder. 更多的
formally, head h yields a set of latent codes such
that Kh ⊂ [K]. 数字 2 demonstrates this, 作为
encoding space consists of three sub-spaces, 一
for each head. Sentence and latent code vectors are
2Contrary to Angelidis and Lapata (2018), who used 30
seed-words per aspect, we only assume five query terms per
aspect to replicate a realistic setting.
282
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 4: Aspect opinion summarization with QT.
The aspect-encoding sub-space is identified using
mean aspect entropy and all other sub-spaces are
被忽略 (shown in gray). Two-step sampling is
restricted only to the codes associated with the
desired aspect (shown in red).
further organized within that sub-space according
to the attribute captured by the respective head.
To enable aspect summarization, we identify
the sub-space capturing aspect-relevant informa-
tion and label its aspect-specific codes, as seen in
数字 4. 具体来说, we first quantify the proba-
bility of finding an aspect in the sentences assigned
to a latent code and identify the head sub-space that
best separates sentences according to their aspect.
然后, we map every cluster within that sub-space
to an aspect and extract aspect summaries only
from those aspect-specific clusters.
We utilize a held-out set of review sentences
and keywords Qa = {s1, . . . , s5} 为了
Xdev,
aspect a. We encode and quantize sentences in
Xdev and compute the probability that latent code
k contains tokens typical of aspect a as:
Pk(A) =
(西德:3)
tf (Qa, k)
tf (Qa(西德:5), k)
,
(10)
A(西德:5)
where tf (Qa, k) is the number of times query
terms in Qa where found in sentences assigned
to k. We use information theory’s entropy to mea-
sure how aspect-certain code k is:
Hk = −
Pk(A) logPk(A) .
(11)
(西德:2)
A
Low aspect entropy values indicate that most
sentences assigned to k belong to a single aspect. 它
thus follows that hasp (IE。, the head sub-space that
best separates sentences according to their aspect)
will exhibit the lowest mean aspect entropy:
⎛
⎝ 1
(西德:2)
⎞
hasp = arg min
⎠ .
Hk
(12)
H
|Kh|
k∈Kh
SPACE (This work)
AMAZON (Braˇzinskas et al., 2020)
YELP (Chu and Liu, 2019)
OPOSUM (Angelidis and Lapata, 2018)
评论
1.14中号
4.75中号
1.29中号
359K
Entities Rev/Ent
50
60
200
60
100
8
8
10
Summaries (右)
1,050 (3)
180 (3)
200 (1)
180 (3)
Type
Scope
Abstractive General+Aspect
Abstractive
Abstractive
Extractive
General only
General only
General only
桌子 1: Statistics for SPACE and three recently introduced evaluation corpora for opinion summarization.
SPACE includes aspect summaries for six aspects. (评论: number of reviews in training set, no gold-
standard summaries are available; Rev/Ent: Input reviews per entity in test set; 右: Reference summaries
per example).
We map every code produced by hasp to its aspect
A(k) via Equation (10), and obtain aspect codes:
Ka = {k | k ∈ Khasp and a = a(k)} .
(13)
To extract a summary for aspect a, we follow
the ranking or sampling methods described in
Equations (5)–(9),
restricting the process to
codes Ka. Sub-space selection and aspect-specific
sentence sampling are illustrated in Figure 4.
4 The SPACE Corpus
We introduce SPACE (Summaries of Popular and
Aspect-specific Customer Experiences), 一个大的-
scale opinion summarization benchmark for the
evaluation of unsupervised summarizers. SPACE is
built on TripAdvisor hotel reviews and aims to
facilitate future research by improving upon the
shortcomings of existing datasets. It comes with a
training set of approximately 1.1 million reviews
超过 11,000 hotels, obtained by cleaning and
downsampling an existing collection (王等人。,
2010). The training set contains no reference
summaries, and is useful for unsupervised training.
For evaluation, we created a large collection
of human-written, abstractive opinion summaries.
具体来说, for a held-out set of 50 hotels (25
hotels for development and 25 for testing), 我们
asked human annotators to write high-level gen-
eral summaries and aspect summaries for six pop-
ular aspects: building, cleanliness, 食物, 地点,
房间, and service. For every hotel and summary
类型, we collected three reference summaries from
different annotators. 重要的, for every hotel,
summaries were based on 100 input reviews.
据我们所知, this is the largest
crowdsourcing effort
towards obtaining high-
quality abstractive summaries of reviews, 和
first to use a pool of input reviews of this scale (看
桌子 1 for a comparison with existing datasets).
而且, SPACE is the first benchmark to also
contain aspect-specific opinion summaries.
The large number of input reviews per entity
poses certain challenges with regard to the col-
lection of human summaries. A direct approach is
prohibitive, as it would require annotators to
read all 100 reviews and write a summary in a
single step. A more reasonable method is to first
identify a subset of input sentences that most
people consider salient, and then ask annotators
to summarize them. Summaries were thus created
in multiple stages using the Appen3 platform and
expert annotator channels of native English speak-
呃. Although we propose an extractive model,
annotators were asked to produce abstractive sum-
maries, as we hope SPACE will be broadly useful to
the summarization community. We did not allow
the use of first-person narrative to collect more
summary-like texts. We present our annotation
procedure below.4
4.1 Sentence Selection via Voting
The sentence selection stage identifies a subset of
review sentences that contain the most salient and
useful opinions expressed by the reviewers. 这
is a crucial but subjective task and, 所以, 我们
devised a voting scheme that allowed us to select
sentences that received votes by many annotators.
具体来说, each review was shown to five
judges who were asked to select informative sen-
时态. Annotators were encouraged to exercise
their own judgement in selecting summary-worthy
句子, but were advised to focus on sentences
which explicitly expressed or supported reviewer
意见, avoiding overly general or personal
comments (例如, ‘‘Loved the hotel’’, ‘‘I like a
shower with good pressure’’), and making sure
that important aspects were included. We set no
threshold on the number of sentences they could
选择 (we allowed selecting all or no sentences
3https://appen.com/.
4Full annotation instructions: https://github.com
/stangelid/qt/blob/main/annotation.md.
283
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
for a given review). 然而, the annotation in-
terface kept track of their total votes and guided
them to select between 20% 和 40% 句子数,
一般.
Sentences with 4 or more votes were automat-
ically promoted to the next stage. Inter-annotator
agreement according to Cohen’s kappa was
k = 0.36, indicating ‘‘fair agreement’’. 普雷维-
ous studies have shown that human agreement for
sentence selection tasks in summarization of news
articles is usually lower than 0.3 (Radev et al.,
2003). The median number of sentences promoted
for summarization for each hotel was 83, 尽管
the minimum was 46. This ensured that enough
sentences were always available for summariza-
的, while simplifying the task; annotators were
now required to read and summarize considerably
smaller amounts of review text than the original
100 reviews.
4.2 Summary Collection
General Summaries The top-voted sentences
for each hotel were presented to three annotators,
who were asked to read them and produce a high-
level overview summary up to a budget of 100
字. To simplify the task and help annotators
write coherent summaries, sentences with high
lexical overlap were grouped together and the
interface allowed the annotators to quickly sort
sentences according to words they contained. 这
process resulted in an inter-annotator ROUGE-L
分数为 29.19 and provides ample room for future
研究, as detailed in our experiments (桌子 2).
Aspect Summaries Top-voted sentences were
further labeled by an off-the-shelf aspect classifier
(Angelidis and Lapata, 2018) trained on an public
aspect-labeled corpus of hotel review sentences
(Marcheggiani et al., 2014).5 Sentences outside of
the six most popular aspects (building, cleanliness,
食物, 地点, 房间, and service) were ignored,
and sentences with 3 votes were promoted, 仅有的
if an aspect had no sentences with 4 votes. 这
promoted sentences were grouped according to
their aspect and presented to annotators, who were
asked to create a more detailed, aspect-specific
summary, up to a budget of 75 字. The aspect
summaries have an inter-annotator ROUGE-L
分数为 34.58.
5 评估
在这个部分, we discuss our experimental setup,
including datasets and comparison models, 前
presenting our automatic evaluation results, 胡-
man studies, and further analyses.
5.1 实验装置
Datasets We used SPACE as the main testbed for
our experimental evaluation, covering both gen-
eral and aspect-specific summarization tasks. 为了
general summarization, we used two additional
opinion summarization benchmarks, 即, YELP
(Chu and Liu, 2019) and AMAZON (Braˇzinskas
等人。, 2020) (见表 1). For all datasets, 我们用
pre-defined development and test set splits, 和
only report results on the test set.
Implementation Details We used unigram LM
SentencePiece vocabularies of 32K.6 All system
hyperparameters were selected on the develop-
ment set. The Transformer’s dimensionality was
set to 320 and its feed-forward layer to 512. 我们
用过的 3 layers and 4 internal attention heads for
its encoder and decoder, whose input embedding
layer was shared, but no positional encodings as
we observed no summarization improvements.
We used H = 8 sentence heads for representing
every sentence. For the quantizer, we set the num-
ber of latent codes to K = 1, 024 and sampled
m = 30 codes for every input sentence, 期间
训练. We used the Adam optimizer, with initial
learning rate of 10−3 and a learning rate decay
的 0.9. We warmed up the Transformer by dis-
abling quantization for the first 4 纪元. 总共,
we ran 20 training epochs. On the full SPACE cor-
脓, QT was trained in 4 days on a single GeForce
GTX 1080 Ti GPU, using our available PyTorch
implementation. All general and aspect summaries
were extracted with the two-step sampling proce-
dure described in Section 3.2.1, unless otherwise
指出. When two-step sampling was enabled, 我们
ranked sentences by sampling 300 latent codes
和, for every code, sampled n = 30 neighboring
句子. QT and all extractive baselines use a
greedy algorithm to eliminate redundancy, 相似的
to previous research on multi-document summa-
rization (Cao et al., 2015; Yasunaga et al., 2017;
Angelidis and Lapata, 2018).
5The classifier’s precision on the aspect-labeled corpus’
development set is 85.4%.
6https://github.com/google/sentencepiece.
284
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
5.2 指标
We evaluate the lexical overlap between system
and human summaries using ROUGE F-scores.7
We report uni- and bi-gram variants (R1/R2), 作为
well as longest common subsequence (RL).
A successful opinion summarizer must also pro-
duce summaries which match human-written ones
in terms of aspects mentioned and sentiment con-
veyed. 为此原因, we also evaluate our sys-
tems on two metrics that utilize an off-the-shelf
aspect-based sentiment analysis (ABSA) 系统
(Miao et al., 2020), pre-trained in-domain. 这
ABSA system extracts opinion phrases from sum-
maries, and predicts their aspect category and sen-
timent. The metrics use these predictions as follows.
Aspect Coverage We use the phrase-level aspect
predictions to mark the presence or absence
of an aspect in a summary. We discard very
infrequent aspect categories. Similar to Pan
等人. (2020), we measure precision, 记起,
and F1 of system against human summaries.
Aspect-level Sentiment We propose a new met-
ric to evaluate the sentiment consistency
between system and human summaries.
具体来说, we compute the sentiment
polarity score towards an individual aspect
a as the mean polarity of
the opinion
phrases that discuss this aspect in a sum-
mary (pol a ∈ [−1, 1]). We repeat the process
for every aspect, thus obtaining a vector of
aspect polarities for the summary (我们设定
the polarity of absent aspects to zero). 这
aspect-level sentiment consistency is com-
puted as the mean squared error between
system and human polarity vectors.
5.3 结果: General Summarization
We first discuss our results on general summa-
rization and then move on to present experiments
on aspect-specific summarization. We compared
our model against the following baselines:
Best Review systems select the single review that
best approximates the consensus opinions
in the input. We use a Centroid method
that encodes the entity’s reviews with BERT
(average token vector; Devlin et al., 2019) 或者
SentiNeuron (Radford et al., 2017), and picks
the one closest to the mean review vector.
We also tested an Oracle method, 哪个
7https://github.com/bheinzerling/pyrouge.
selects the review closest to the reference
summaries.
Extractive systems, where we tested LexRank
(Erkan and Radev, 2004), an unsupervised
graph-based summarizer. To compute its
adjacency matrices, we used BERT and Sen-
tiNeuron vectors, in addition to the sparse
tf-idf features of the original. We also present
a random extractive baseline.
Abstractive systems include Opinosis (Ganesan
等人。, 2010), a graph-based method; 和
MeanSum (Chu and Liu, 2019) and Copy-
猫 (Braˇzinskas et al., 2020), two neural
abstractive methods that generate review-
like summaries from aggregate review rep-
resentations learned using autoencoders.
桌子 2 reports ROUGE scores on SPACE
(test set) for the general summarization task.
QT’s popularity-based extraction algorithm shows
strong summarization capabilities outperforming
all comparison systems (differences in ROUGE
are statistically significant against all models but
Copycat). This is a welcome result, consider-
ing that QT is an extractive method and does
not benefit from the compression and rewording
capabilities of abstractive summarizers. 更多的-
超过, as we discuss in Section 5.5, QT is less
data-hungry than other neural models: 它达到了
the same level of performance even when trained
在 5% of the dataset. We also show in Table 2
(fourth block) that the proposed two-step sampling
method yields better extractive summaries com-
pared to simply selecting the sentences nearest to
the most popular clusters.
Aspect coverage and sentiment consistency
results are also encouraging for QT which con-
sistently scores highly on both metrics, 尽管
baselines show mixed results. We also compared
(using ROUGE-L) general system summaries
against reference aspect summaries. The results
表中 2 (column RLASP) confirm that aspect
tailor-made methods.
summarization requires
不出所料, all systems are inferior to the
human upper bound (IE。, inter-annotator ROUGE
and aspect-based metrics), suggesting ample room
for improvement.
QT’s ability for general opinion summarization
is further demonstrated in Table 3, which reports
results on the YELP and AMAZON datasets. We pre-
sent the strongest baselines, 那是, CentroidBERT,
LexRankBERT, OracleBERT, and the abstractive
285
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
我
A
C
_
A
_
0
0
3
6
6
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
t
s
e
乙
t
C
A
r
t
X
乙
e
我
v
e
右
SPACE [GENERAL]
w CentroidSENTI
CentroidBERT
OracleSENTI
OracleBERT
Random
LexRank
LexRankSENTI
LexRankBERT
R2 RL
R1
27.36 5.81 15.15
31.33 5.78 16.54
32.14 7.52 17.43
33.21 8.33 18.02
26.24 3.58 14.72
29.85 5.87 17.56
30.56 4.75 17.19
31.41 5.05 18.12
28.76 4.57 15.96
34.95 7.49 19.92
36.66 8.87 20.90
38.66 10.22 21.90
w/o 2-step samp. 37.82 9.13 20.10
Human Up. Bound 49.80 18.80 29.19
MeanSum
Copycat
t Opinosis
C
A
r
t
s
乙
A
QT
RLASP ACP ACR ACF1 SCMSE
.788 .705 .744
8.77
.805 .701 .749
9.35
.817 .699 .753
9.29
9.67
.823 .777 .799
11.53 .799 .374 .509
11.84 .840 .382 .525
12.11 .820 .441 .574
13.29 .823 .380 .520
11.68 .791 .446 .570
14.52 .845 .477 .610
14.15 .840 .566 .676
14.26 .843 .689 .758
13.88 .833 .680 .748
34.58 .829 .862 .845
.580
.524
.455
.401
.592
.518
.572
.500
.561
.479
.446
.430
.439
.264
桌子 2: Summarization results on SPACE. Best system
(shown in boldface) significantly outperforms all compar-
ison systems, except where underlined (p < 0.05; paired
bootstrap resampling; Koehn, 2004). We exclude Oracle
systems from comparisons as they access gold summaries
at test time. RLASP is the Rouge-L of general summarizers
against gold aspect summaries. AC and SC are shorthands
for Aspect Coverage and Sentiment Consistency. Sub-
scripts P and R refer to precision and recall, and F1 is
their harmonic mean. MSE is mean squared error (lower is
better).
R2 RL ACF1 SCMSE
R1
YELP
23.04 2.44 13.44 .551
Random
24.78 2.64 14.67 .691
CentroidBERT
OracleBERT
27.38 3.75 15.92 .703
LexRankBERT 26.46 3.00 14.36 .601
24.88 2.78 14.09 .672
Opinosis
MeanSum
28.46 3.66 15.57 .713
29.47 5.26 18.09 .728
Copycat
28.40 3.97 15.27 .722
QT
R1
AMAZON
27.66 4.72 16.95 .580
Random
CentroidBERT
29.94 5.19 17.70 .702
OracleBERT
31.69 6.47 19.25 .725
LexRankBERT 31.47 5.07 16.81 .663
28.42 4.57 15.50 .614
Opinosis
MeanSum
29.20 4.70 18.15 .710
31.97 5.81 20.16 .731
CopyCat
34.04 7.03 18.08 .739
QT
.612
.523
.507
.541
.552
.510
.495
.490
.602
.599
.512
.541
.580
.525
.510
.508
R2 RL ACF1 SCMSE
Table 3: Summarization results on YELP
and AMAZON. Best system, shown in
boldface,
is significantly better than
all comparison systems, except where
underlined (p < 0.05; paired bootstrap
resampling; Koehn, 2004).
Opinosis, MeanSum, and Copycat. On YELP, QT
performs on par with MeanSum, but worse than
Copycat. However, it is important to note that,
in contrast to SPACE, YELP’s reference summaries
were purposely written using first-person narra-
tive giving an advantage to review-like summaries
of abstractive methods. On AMAZON, QT outper-
forms all methods on ROUGE-1/2, but comes
second to Copycat on ROUGE-L. This follows a
trend seen across all datasets, where abstractive
systems appear relatively stronger in terms of
ROUGE-L compared to ROUGE-1/2. We partly
attribute this to their ability to fuse opinions into
fluent sentences, thus matching longer reference
sequences.
Besides automatic evaluation, we conducted
a user study to verify the utility of the gener-
ated summaries. We produced general summaries
from five systems (QT, Copycat, MeanSum,
LexRankBERT, and CentroidBERT) for all entities
in SPACE’s test set. For every entity and pair of
systems, we showed to three human judges a
gold-standard summary for reference, and the
two system summaries. We asked them to select
the best summary according to four criteria:
Inform.
+36.0
Centroid
−52.7
LexRank
MeanSum −23.3
−10.7
Copycat
+50.7∗
QT
Coherent
−57.3
−38.0
+26.7
+34.7
+34.0†
Concise
−60.7
−44.7
+28.7
+38.0
+38.7†
Redund.
−12.7
−1.3
+3.3
−3.3
+18.0∗
Table 4: Best-Worst Scaling human study on
SPACE. (*): significant difference to all models;
(†): significant difference to all models, except
Copycat (one-way ANOVA with posthoc Tukey
HSD test p < 0.05).
informativeness (useful opinions, consistent with
reference), coherence (easy to read, avoids contra-
dictions), conciseness (useful in a few words), and
non-redundancy (no repetitions). The systems’
scores were computed using Best-Worst Scaling
(Louviere et al., 2015), with values ranging from
−100 (unanimously worst) to +100 (unanimously
best). As shown in Table 4, participants rate QT
favorably over all baselines in terms of infor-
mativeness, conciseness and lack of redundancy,
with slight preference for Copycat summaries
with respect to coherence (statistical significance
information in caption). QT captures essential
286
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
SPACE [ASPECT]
T MeanSumASP
R
E
CopycatASP
B
LexRankASP
a
i
v
QTASP
Human
Building
13.25
17.10
14.73
16.45
40.33
Cleanliness
19.24
15.90
25.10
25.12
38.76
ROUGE-L
Food
13.01
14.53
17.56
17.79
33.63
Location
18.41
20.31
23.28
23.63
35.23
Rooms
17.81
17.30
18.24
21.61
29.25
Service
20.40
20.05
26.01
26.07
30.31
R1
23.24
24.95
27.72
28.95
44.86
R2
Average
3.72
4.82
7.54
8.34
18.45
RL
SCMSE
17.02
17.53
20.82
21.77
34.58
.235
.274
.206
.204
.153
Table 5: Aspect summarization results on SPACE. Best model shown in boldface. All differences to best
model are statistically significant, except where underlined (p < 0.05; paired bootstrap resampling;
Koehn, 2004).
Does the summary discuss the specified aspect?
No
Exclusively
Partially
QTGEN
CopycatASP
MeanSumASP
LexRankASP
QTASP
1.1
6.7
21.8
48.2
58.7
72.0
45.3
37.3
28.0
32.7
26.9
48.0
40.9
23.8
8.7
Table 6: User study on aspect-specific sum-
maries. In the ‘‘Exclusively’’ column, QT’s
difference over all models is statistically sig-
nificant (p < 0.05; χ2 test).
Figure 6: Mean aspect entropies (bars) for each of
QT’s head sub-spaces and corresponding aspect
ROUGE-1 scores for the summaries produced by
each head (line).
SPACE
[GENERAL]
Copycat
QT
Proportion of SPACE’s train data used
5%
26.1
36.9
10%
26.2
37.1
50%
31.8
37.7
100%
36.7
38.7
Table 7: ROUGE-1 on SPACE for varying train
set sizes.
Figure 5: t-SNE projection of the quantized space
of an eight-head QT trained on SPACE, showing all
1024 learned latent codes (best viewed in color).
Darker codes correspond to lower aspect entropy,
our proposed measure of high aspect-specificity.
Zooming in the aspect sub-space uncovers good
aspect separation.
opinions effectively, whereas there is room for
improvement in terms of summary cohesion.
5.4 Results: Aspect-specific Summarization
There is no existing unsupervised system for
aspect-specific opinion summarization. Instead,
we use the power of BERT (Devlin et al., 2019)
to enable aspect summarization for our baselines.
Specifically, we obtain BERT sentence vectors
(average of token vectors) for input sentences
Xe, which we cluster via k-means. We then
replicate the cluster-to-aspects mapping used
by QT, as described in Equations (10)–(13):
Each cluster is mapped to exactly one aspect,
according to the probability of
finding the
pre-defined aspect-denoting keywords in the
sentences assigned to it. As a result, we obtain
non-overlapping and aspect-specific sets of input
sentences {X (a1)
, . . . }. For aspect ai, we
create aspect-filtered input reviews, by concate-
nating sentences in X (ai)
based on the reviews
they originated from. The filtered reviews of each
aspect are given as input to general summarizers
(LexRank, MeanSum, and Copycat), thus produc-
ing aspect summaries. QT and all baselines use the
same aspect keywords, which we sourced from a
held-out set of reviews, not included in SPACE.
, X (a2)
e
e
e
Table 5 shows results on SPACE, for individual
aspects, and on average. QT outperforms baselines
287
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Human
QT
MeanSum
Copycat
All staff members were friendly,
accommodating, and helpful. The
hotel and room were very clean.
The room had modern charm and
was nicely remodeled. The beds
are extremely comfortable. The
rooms are quite with wonderful
beach views. The food at Hash, the
restaurant in lobby, was fabulous.
The location is great, very close
to the beach. It’s a longish walk
to Santa Monica. The price is very
affordable.
Great hotel. We liked our
room with an ocean view.
The staff were friendly and
helpful. There was no bal-
cony. The location is per-
fect. Our room was very
quiet.
I would definitely
stay here again. You’re one
block from the beach. So it
must be good! Filthy hall-
ways. Unvacuumed room.
Pricy, but well worth it.
It was a great stay!
The food at the hotel
is great for the price.
I can’t believe the
noise from the street
is very loud and
is not
the
traffic
so great, but
that
is not a problem.
The restaurant was
great and the food is
excellent.
in a great
is
This hotel
location, just off the beach.
The staff was very friendly
and helpful. We had a room
with a view of the beach and
ocean. The only problem was
that our room was on the
4th floor with a view of the
ocean. If you are looking for
a nice place to sleep then this
is the place for you.
Table 8: Four general opinion summaries for the same hotel: One human-written and three from
competing models.
Building: Bright colors, skateboards, butterfly chairs and a grand ocean/boardwalk view (always entertaining). There is a
small balcony, but there’s only a small glass divider between your neighbor’s balcony.
Food: We had a great breakfast at Hash too! The restaurant was amazing. Lots of good restaurants within walking distance
and some even deliver. The roof bar was the icing on the cake.
Location: The location is perfect. The hotel is very central. The hotel itself is in a great location. We hardly venture far as
everything we need is within walking distance, but for the sightseers the buses are on the doorstep.
Cleanliness: Our room was very clean and comfortable. The room was clean and retrofitted with all the right amenities. Our
room was very large, clean, and artfully decorated.
Rooms: The room was spacious and had really cool furnishings, and the beds were comfortable. The room’s were good, and
we had a free upgrade for one of them (for a Facebook ’like!) A+ for the bed and pillows.
Service: The staff is great. The staff were friendly and helpful. The hotel staff were friendly and provided us with great service.
Each member of the staff was friendly and attentive. The staff excel and nothing is ever too much trouble.
Table 9: Aspect summaries extracted by QT.
in all aspects, except building, with significant
improvements against Copycat and Meansum in
terms of ROUGE and sentiment consistency.
The abstractive methods struggle to generate
summaries restricted to the aspect in question.
To verify this, we ran a second judgment
elicitation study. We used summaries from com-
peting aspect summarizers (QTASP, CopycatASP,
MeanSumASP, and LexRankASP) for all six aspects,
as well as QT’s general summaries. A summary
was shown to three participants, who were asked
whether it discussed the specified aspect exclu-
sively, partially, or not at all. Table 6 shows that
58.7% of QT aspect-specific summaries discuss
the specified aspect exclusively, while only 8.7%
of the summaries fail
to mention the aspect.
LexRankASP follows with 23.8% of its summaries
failing to mention the aspect, while the abstractive
models performed significantly worse.
5.5 Further Analysis
Training Efficiency Table 7 shows ROUGE-1
scores for QT and Copycat on SPACE, when trained
on different portions of the training set (randomly
downsampled and averaged over 5 runs). QT ex-
hibits impressive data efficiency; when trained on
5% of data, it performs comparably to a Copycat
summarizer that has been trained on the full
corpus.
Visualizing Sub-spaces We present a visual
demonstration of QT’s quantized sub-spaces in
Figure 5. We used t-SNE (van der Maaten and
Hinton, 2008) to project the latent code vectors
onto two dimensions. The latent codes produced
by QT’s eight heads are clearly grouped in eight
separate sub-spaces. The aspect sub-space (shown
in square) was detected automatically, as it dis-
played the lowest mean aspect entropy (darker
color). Zooming into its latent codes uncov-
ers reasonable aspect separation, an impressive
the model received no
result considering that
aspect-specific supervision.
Mean Aspect Entropy Figure 6 further illus-
trates the effectiveness of aspect entropy for
detecting the head sub-space that best separates
aspect-specific sentences. Each gray bar shows
288
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
the mean aspect entropy for the codes produced
by one of QT’s eight heads. One of the heads
(leftmost) exhibits much lower entropy, indicating
a strong confidence for aspect membership within
its latent codes. We confirm this enables better
aspect summarization by generating aspect sum-
maries using each head, and plotting the obtained
ROUGE-1 scores.
System Output Finally, we show gold-standard
and system-generated general
in
Table 8, as well as QT aspect summaries in
Table 9.
summaries
6 Conclusions
We presented a novel opinion summarization
system based on the Quantized Transformer that
requires no reference summaries for training, and
is able to extract general and aspect summaries
from large groups of input reviews. QT is trained
through sentence reconstruction and learns a
rich encoding space, paired with a clustering
component based on vector quantized variational
autoencoders. At summarization time, we exploit
the characteristics of the quantized space,
to
identify those clusters that correspond to the
input’s most popular opinions, and extract the
sentences that best represent them. Moreover, we
used the multi-head representations of the model,
and no further training, to detect the encoding
sub-space that best separates aspects, enabling
aspect-specific summarization. We also collected
SPACE, a new opinion summarization corpus which
we hope will inform and inspire further research.
Experimental results on SPACE and popular
benchmarks reveal that our system is able to
produce informative summaries which cover all
or individual aspects of an entity. In the future, we
would like to utilize the QT framework in order
to generate abstractive summaries. We could also
exploit QT’s multi-head semantics more directly,
and further improve it through weak supervision
or multi-task objectives. Finally, although we
focused on opinion summarization, it would be
interesting to see if the proposed model can be
applied to other multi-document summarization
tasks.
Acknowledgments
We thank the anonymous reviewers for their
feedback and the action editor, Jing Jiang,
for her comments. We gratefully acknowledge
the support of the European Research Council
(Lapata; award number 681760, ‘‘Translating
Multiple Modalities into Text’’). We also thank
Wang-Chiew Tan for her valuable input.
References
Reinald Kim Amplayo and Mirella Lapata. 2019.
Informative and controllable opinion summa-
rization. arXiv preprint arXiv:1909.02322v1.
Reinald Kim Amplayo and Mirella Lapata.
2020. Unsupervised opinion summarization
with noising and denoising. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1934–1945.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.175
the
Stefanos Angelidis and Mirella Lapata. 2018.
extraction
Summarizing opinions: Aspect
sentiment prediction and they are
meets
In Proceedings
both weakly supervised.
of
on Empirical
Methods in Natural Language Processing,
pages 3675–3686, Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1403
2018 Conference
Yoshua Bengio, Nicholas L´eonard, and Aaron
Courville. 2013. Estimating or propagating
gradients
for
conditional computation. arXiv preprint arXiv:
1308.3432v1.
through stochastic neurons
Arthur Braˇzinskas, Mirella Lapata, and Ivan Titov.
2020. Unsupervised opinion summarization
In Proceed-
as copycat-review generation.
ings of
the
Association for Computational Linguistics,
pages 5151–5169. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/2020.acl-main.461
the 58th Annual Meeting of
Ziqiang Cao, Furu Wei, Li Dong, Sujian Li,
and Ming Zhou. 2015. Ranking with recursive
neural networks and its application to multi-
document summarization. In Proceedings of the
Twenty-Ninth AAAI Conference on Artificial
Intelligence, AAAI’15, pages 2153–2159.
AAAI Press.
289
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Giuseppe Carenini, Raymond Ng, and Adam
Pauls. 2006. Multi-document summarization
In 11th Conference of
of evaluative text.
the European Chapter of
the Association
for Computational Linguistics, Trento, Italy.
Association for Computational Linguistics.
Jianpeng Cheng and Mirella Lapata. 2016. Neu-
ral summarization by extracting sentences and
words. In Proceedings of
the 54th Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 1: Long Papers),
pages 484–494, Berlin, Germany. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P16
-1046, PMID: PMC4738087
Eric Chu and Peter Liu. 2019. MeanSum: A neu-
ral model for unsupervised multi-document abs-
tractive summarization. In Proceedings of the
36th International Conference on Machine
Learning, volume 97 of Proceedings of Ma-
chine Learning Research, pages 1223–1232,
Long Beach, California, USA. PMLR.
Maximin Coavoux, Hady Elsahar, and Matthias
Gall´e. 2019. Unsupervised aspect-based multi-
document abstractive summarization. In Pro-
ceedings of
the 2nd Workshop on New
Frontiers in Summarization, pages 42–47,
Hong Kong, China. Association for Compu-
tational Linguistics. DOI: https://doi
.org/10.18653/v1/D19-5405
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
for
Minneapolis, Minnesota. Association
Computational Linguistics.
Giuseppe Di Fabbrizio, Amanda Stent, and
Robert Gaizauskas. 2014. A hybrid approach
to multi-document summarization of opinions
in reviews. In Proceedings of the 8th Inter-
national Natural Language Generation Con-
ference (INLG), pages 54–63, Philadelphia,
Pennsylvania. Association for Computational
290
Linguistics. DOI: https://doi.org/10
.3115/v1/W14-4408
G¨unes Erkan and Dragomir R. Radev. 2004.
Lexrank: Graph-based lexical centrality as sa-
lience in text summarization. Journal Of Ar-
tificial Intelligence Research, 22(1):457–479.
DOI: https://doi.org/10.1613/jair
.1523
Kavita Ganesan, ChengXiang Zhai, and Jiawei
Han. 2010. Opinosis: A graph based approach
to abstractive summarization of highly redun-
the 23rd
dant opinions. In Proceedings of
International Conference on Computational
Linguistics (COLING 2010), pages 340–348,
Beijing, China. COLING 2010 Organizing
Committee.
Ari Holtzman,
Jan Buys, Maxwell Forbes,
and Yejin Choi. 2020. The curious case of
neural text degeneration. In 8th International
Conference on Learning Representations, ICLR
2020, Virtual Conference, Apr 26th - May 1st.
OpenReview.net.
Minqing Hu and Bing Liu. 2004. Mining and
summarizing customer reviews. In Proceed-
ings of the Tenth ACM SIGKDD Internatio-
nal Conference on Knowledge Discovery
and Data Mining, KDD ’04, pages 168–177,
New York, NY, USA. Association for Com-
puting Machinery. DOI: https://doi
.org/10.1145/1014052.1014073
similarity recognition relaxes
Nguyen Huy Tien, Le Tung Thanh, and Nguyen
summarization:
Minh Le. 2019. Opinions
the
Aspect
In Pro-
constraint of predefined aspects.
the International Conference
ceedings of
on Recent Advances in Natural Language
Processing (RANLP 2019), pages 487–496,
INCOMA Ltd. DOI:
Varna, Bulgaria.
https://doi.org/10.26615/978-954
-452-056-4 058,
30811828,
PMCID: PMC6706286
PMID:
Masaru Isonuma, Junichiro Mori, and Ichiro
Sakata. 2019. Unsupervised neural single-
document summarization of reviews via learn-
ing latent discourse structure and its ranking.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
pages 2142–2152, Florence,
Italy. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1206
Lukasz Kaiser and Samy Bengio. 2018. Discrete
autoencoders for sequence models. ArXiv,
abs/1801.09797v1.
Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational Bayes. CoRR, abs
/1312.6114v10.
Philipp Koehn. 2004. Statistical significance tests
for machine translation evaluation. In Proceed-
ings of
the 2004 Conference on Empirical
Methods in Natural Language Processing,
pages 388–395, Barcelona, Spain. Association
for Computational Linguistics.
Peter
J. Liu, Mohammad Saleh, Etienne
Pot, Ben Goodrich, Ryan Sepassi, Lukasz
Kaiser, and Noam Shazeer. 2018. Generating
Wikipedia by summarizing long sequences.
In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC,
Canada, April 30 - May 3, 2018, Conference
Track Proceedings. OpenReview.net.
Jordan J. Louviere, Terry N. Flynn, and Anthony
Alfred John Marley. 2015. Best-worst Scaling:
Theory, Methods and Applications. Cambridge
University Press. DOI: https://doi.org
/10.1017/CBO9781107337855
Chris J. Maddison, Andriy Mnih, and Yee Whye
Teh. 2017. The concrete distribution: A conti-
nuous relaxation of discrete random varia-
bles. In International Conference on Learning
Representations. DOI: https://doi.org
/10.1007/978-3-319-06028-6 23
Diego Marcheggiani, Oscar T¨ackstr¨om, Andrea
Esuli, and Fabrizio Sebastiani. 2014. Hierar-
chical multi-label conditional random fields for
aspect-oriented opinion mining. In Advances
in Information Retrieval, pages 273–285,
Cham. Springer International Publishing. DOI:
https://doi.org/10.1007/978-3-319
-06028-6 23
Zhengjie Miao, Yuliang Li, Xiaolan Wang,
and Wang-Chiew Tan. 2020. Snippext: Semi-
supervised opinion mining with augmented
291
In Proceedings of The Web Con-
data.
ference 2020, WWW ’20, pages 617–628,
New York, NY, USA. Association for Com-
puting Machinery. DOI: https://doi.org
/10.1145/3366423.3380144
Shashi Narayan, Shay B. Cohen, and Mirella
Lapata. 2018. Ranking sentences for extractive
summarization with reinforcement
learning.
the 2018 Conference of
In Proceedings of
the North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers),
pages 1747–1759, New Orleans, Louisiana.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/N18-1158
Aaron van den Oord, Oriol Vinyals, and
Koray Kavukcuoglu. 2017. Neural discrete
in
representation
Neural Information Processing Systems 30,
pages 6306–6315. Curran Associates, Inc.
In Advances
learning.
Haojie Pan, Rongqin Yang, Xin Zhou, Rui
Wang, Deng Cai, and Xiaozhong Liu. 2020.
Large scale abstractive multi-review sum-
marization (LSARS) via aspect alignment.
the 43rd International
In Proceedings of
ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR
’20, pages 2337–2346, New York, NY, USA.
Association for Computing Machinery. DOI:
https://doi.org/10.1145/3397271
.3401439
Bo Pang and Lillian Lee. 2008. Opinion mining
and sentiment analysis. Foundations and
Trends
in Information Retrieval, 2(1–2):
1–135. DOI: https://doi.org/10.1561
/1500000011
In Proceedings
Laura Perez-Beltrachini, Yang Liu, and Mirella
Lapata. 2019. Generating summaries with
topic templates and structured convolutional
57th
decoders.
Annual Meeting of the Association for Com-
putational Linguistics,
5107–5116,
Florence, Italy. Association for Computational
Linguistics. DOI: https://doi.org/10
.18653/v1/P19-1504
pages
the
of
Maxime Peyrard. 2019. A simple theoretical
model of importance for summarization. In
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 1059–1073, Florence,
Italy. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1101
Dragomir R. Radev, Simone Teufel, Horacio
Saggion, Wai Lam, John Blitzer, Hong Qi,
Arda C¸ elebi, Danyu Liu, and Elliott Drabek.
2003. Evaluation challenges in large-scale
document summarization. In Proceedings of
the 41st Annual Meeting of the Association
for Computational Linguistics, pages 375–382,
Sapporo, Japan. Association for Computational
Linguistics.
Alec Radford, Rafal
Jozefowicz, and Ilya
Sutskever. 2017. Learning to generate reviews
and discovering sentiment. arXiv preprint
arXiv:1704.01444v2.
Anna Rohrbach, Lisa Anne Hendricks, Kaylee
Burns, Trevor Darrell, and Kate Saenko. 2018.
Object hallucination in image captioning. In
Proceedings of the 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 4035–4045, Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1437
Aurko Roy and David Grangier. 2019. Unsu-
pervised paraphrasing without translation. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
Italy. Associ-
pages 6033–6039, Florence,
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1605
Aurko Roy, Ashish Vaswani, Arvind Neelakantan,
and Niki Parmar. 2018. Theory and experiments
on vector quantized autoencoders. arXiv
preprint arXiv:1805.11063v2.
Yoshihiko Suhara, Xiaolan Wang, Stefanos
and Wang-Chiew Tan. 2020.
Angelidis,
OpinionDigest: A simple
framework for
opinion summarization. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 5789–5798,
Online. Association for Computational Lin-
https://doi.org/10
guistics. DOI:
.18653/v1/2020.acl-main.513
Yufei Tian, Jianfei Yu, and Jing Jiang. 2019.
Aspect and opinion aware abstractive review
summarization with reinforced hard typed
the 28th ACM
decoder. In Proceedings of
International Conference
Information
on
and Knowledge Management, CIKM ’19,
pages 2061–2064, New York, NY, USA.
Association for Computing Machinery. DOI:
https://doi.org/10.1145/3357384
.3358142, PMID: 30884086
L. J. P. van der Maaten and G.E. Hinton.
2008. Visualizing data using t-SNE. Journal
of Machine Learning Research, 9(nov):
2579–2605.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing
Systems 30, pages 5998–6008. Curran Asso-
ciates, Inc.
Hongning Wang, Yue Lu, and Chengxiang Zhai.
2010. Latent aspect rating analysis on review
text data: A rating regression approach. In
Proceedings of the 16th ACM SIGKDD Inter-
national Conference on Knowledge Discovery
and Data Mining, KDD ’10, pages 783–792,
New York, NY, USA. Association for Com-
puting Machinery. DOI: https://doi.org
/10.1145/1835804.1835903
Abigail See, Peter J. Liu, and Christopher D.
Manning. 2017. Get
to the point: Summa-
rization with pointer-generator networks. In
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1073–1083,
Vancouver, Canada. Association for Computa-
tional Linguistics.
Xiaolan Wang, Yoshihiko Suhara, Natalie Nuno,
Yuliang Li, Jinfeng Li, Nofar Carmeli, Stefanos
Angelidis, Eser Kandogann, and Wang-Chiew
Tan. 2020. ExtremeReader: An interactive
explorer
for customizable and explainable
review summarization. In Companion Pro-
ceedings of the Web Conference 2020, WWW
’20, pages 176–180, New York, NY, USA.
292
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Association for Computing Machinery. DOI:
https://doi.org/10.1145/3366424
.3383535
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
and
Ayush Pareek, Krishnan Srinivasan,
Dragomir Radev. 2017. Graph-based neural
multi-document summarization. In Proceed-
ings of the 21st Conference on Computational
Natural Language Learning (CoNLL 2017),
pages 452–462, Vancouver, Canada. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/K17
-1045
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
6
1
9
2
4
1
8
1
/
/
t
l
a
c
_
a
_
0
0
3
6
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
293