Extractive Opinion Summarization in Quantized Transformer Spaces

Stefanos Angelidis1 Reinald Kim Amplayo1
Yoshihiko Suhara2 Xiaolan Wang2 Mirella Lapata1

1University of Edinburgh

2Megagon Labs

s.angelidis@ed.ac.uk, reinald.kim@ed.ac.uk
yoshi@megagon.ai, xiaolan@megagon.ai
mlap@inf.ed.ac.uk

Abstrakt

We present the Quantized Transformer (QT),
an unsupervised system for extractive opi-
nion summarization. QT is inspired by Vector-
Quantized Variational Autoencoders, welche
we repurpose for popularity-driven summari-
zation. It uses a clustering interpretation of the
quantized space and a novel extraction algo-
rithm to discover popular opinions among hun-
dreds of reviews, a significant step towards
opinion summarization of practical scope. In
addition, QT enables controllable summari-
zation without further training, by utilizing
properties of the quantized space to extract
aspect-specific summaries. We also make pub-
licly available SPACE, a large-scale evaluation
benchmark for opinion summarizers, com-
prising general and aspect-specific summaries
für 50 hotels. Experiments demonstrate the
promise of our approach, which is validated
by human studies where judges showed clear
preference for our method over competitive
baselines.

1 Einführung

Online reviews play an integral role in modern
life, as we look to previous customer experiences
to inform everyday decisions. The need to digest
review content has fueled progress in opinion
mining (Pang and Lee, 2008), whose central goal
is to automatically summarize people’s attitudes
towards an entity. Early work (Hu and Liu, 2004)
focused on numerically aggregating customer
satisfaction across different aspects of the entity
under consideration (z.B., the quality of a camera,
its size, clarity). More recently, the success of
neural summarizers in the Wikipedia and news
domains (Cheng and Lapata, 2016; Siehe et al.,
2017; Narayan et al., 2018; Liu et al., 2018; Perez-
Beltrachini etal., 2019) has spurred interest in

277

opinion summarization; the aggregation, in textual
bilden, of opinions expressed in a set of reviews
(Angelidis and Lapata, 2018; Huy Tien et al.,
2019; Tian et al., 2019; Coavoux et al., 2019; Chu
and Liu, 2019; Isonuma et al., 2019; Braˇzinskas
et al., 2020; Amplayo and Lapata, 2020; Suhara
et al., 2020; Wang et al., 2020).

Opinion summarization has distinct character-
istics that set it apart from other summarization
tasks. Erste, it cannot rely on reference summaries
for training, because such meta-reviews are very
scarce and their crowdsourcing is unfeasible. Sogar
for a single entity, annotators would have to pro-
duce summaries after reading hundreds, manche-
times thousands, of reviews. Zweite, the inherent
subjectivity of review text distorts the notion of
information importance used in generic summa-
rization (Peyrard, 2019). Conflicting opinions are
often expressed for the same entity and, daher,
useful summaries should be based on opinion pop-
ularity (Ganesan et al., 2010). Darüber hinaus, Methoden
need to be flexible with respect to the size of the
Eingang (entities are frequently reviewed by thou-
sands of users), and controllable with respect to
the scope of the output. Zum Beispiel, users may
wish to read a general overview summary, or a
more targeted one about a particular aspect of
interest (z.B., a hotel’s location, its cleanliness, oder
available food options).

Recent work (Tian et al., 2019; Coavoux et al.,
2019; Chu and Liu, 2019; Isonuma et al., 2019;
Braˇzinskas et al., 2020; Amplayo and Lapata,
2020; Suhara et al., 2020) has increasingly fo-
cused on abstractive summarization, where a sum-
mary is generated token-by-token to create novel
sentences that articulate prevalent opinions in the
input reviews. The abstractive approach offers
a solution to the lack of supervision, under the
assumption that opinion summaries should be
written in the style of reviews. This simplifica-
tion has allowed abstractive models to generate

Transactions of the Association for Computational Linguistics, Bd. 9, S. 277–293, 2021. https://doi.org/10.1162/tacl a 00366
Action Editor: Jing Jiang. Submission batch: 6/2020; Revision batch: 9/2020; Published 3/2021.
C(cid:2) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

review-like summaries from aggregate input rep-
resentations, using sequence-to-sequence mod-
els trained to reconstruct single reviews. Despite
being fluent, abstractive summaries may still suf-
fer from issues of text degeneration (Holtzman et al.,
2020), hallucinations (Rohrbach et al., 2018),
and the undesirable use of first-person narrative,
a direct consequence of review-like generation.
Zusätzlich, previous work used an unrealistic-
ally small number of input reviews (10 or fewer),
and only sparingly investigated controllable sum-
marization, albeit in weakly supervised settings
(Amplayo and Lapata, 2019; Suhara et al., 2020).
In diesem Papier, we attempt to address shortcom-
ings of existing methods by turning to extractive
summarization, which aims to construct an opi-
nion summary by selecting a few representative
input sentences (Angelidis and Lapata, 2018;
Huy Tien et al., 2019). Speziell, we introduce
the Quantized Transformer (QT), an unsupervised
neural model inspired by Vector-Quantized Var-
iational Autoencoders (VQ-VAE; van den Oord
et al., 2017; Roy et al., 2018), which we repur-
pose for popularity-driven summarization. QT
combines Transformers (Vaswani et al., 2017)
with the discretization bottleneck of VQ-VAEs
and is trained via sentence reconstruction, simi-
larly to the work of Roy and Grangier (2019) An
paraphrasing. At inference time, we use a clus-
tering interpretation of the quantized space and a
novel extraction algorithm that discovers popu-
lar opinions among hundreds of reviews, a sig-
nificant step towards opinion summarization of
practical scope. QT is also capable of aspect-
specific summarization without further training,
by exploiting the properties of the Transformer’s
multi-head sentence representations.

We further contribute to the progress of opinion
mining research, by introducing SPACE (shorthand
for Summaries of Popular and Aspect-specific
Customer Experiences), a large-scale corpus for the
evaluation of opinion summarizers. We collected
1,050 human-written summaries of TripAdvisor
reviews for 50 hotels. SPACE has general summa-
Ries, giving a high-level overview of popular opin-
Ionen, and aspect-specific ones, providing detail
on individual aspects (z.B., location, cleanliness).
Each summary is based on 100 customer reviews,
an order of magnitude increase over existing cor-
pora, thus providing a more realistic input to
competing models. Experiments on SPACE and two
more benchmarks demonstrate that our approach

holds promise for opinion summarization. Par-
ticipants in human evaluation further express a
clear preference for our model over competitive
baselines. We make SPACE and our code publicly
available.1

2 Related Work

Ganesan et al. (2010) were the first to make the con-
nection between opinion mining and text summa-
rization; they developed Opinosis, a graph-based
abstractive summarizer that explicitly models opin-
ion popularity, a key characteristic of subjective
Text, and central to our approach. Follow-on work
(Di Fabbrizio et al., 2014) adopts a hybrid ap-
proach where salient sentences are first extracted
and abstracts are generated based on hand-written
templates (Carenini et al., 2006). More recently,
Angelidis and Lapata (2018) extract salient opi-
nions according to their polarity intensity and as-
pect specificity, in a weakly supervised setting.

A popular approach to modeling opinion pop-
ularity, albeit indirectly, is vector averaging. Chu
and Liu (2019) propose MeanSum, an unsuper-
vised abstractive summarizer that learns a review
decoder through reconstruction, and uses it to
generate summaries conditioned on averaged rep-
resentations of the inputs. Averaging is also used
by Braˇzinskas et al. (2020), who train a copy-
enabled variational autoencoder by reconstructing
reviews from averaged vectors of reviews about
the same entity. Other methods include denoising
autoencoders (Amplayo and Lapata, 2020) Und
the system of Coavoux et al. (2019), an encoder-
decoder architecture that uses a clustering of the
encoding space to identify opinion groups, ähnlich
to our work.

Our model builds on the VQ-VAE (van den Oord
et al., 2017), a recently proposed training tech-
nique for learning discrete latent variables, welche
aims to overcome problems of posterior collapse
and large variance associated with Variational
Autoencoders (Kingma and Welling, 2014). Wie
other related discretization techniques (Maddison
et al., 2017; Kaiser and Bengio, 2018), VQ-VAE
passes the encoder output through a discretization
bottleneck using a neighbor lookup in the space
of latent code embeddings. The application of
VQ-VAEs to opinion summarization is novel, Zu
our knowledge, as well as the proposed sentence

1https://github.com/stangelid/qt.

278

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

extraction algorithm. Our model does not depend
on vector averaging, nor does it suffer from in-
formation loss and hallucination. Außerdem,
it can easily accommodate a large number of in-
put reviews. Within NLP, VQ-VAEs have been
previously applied to neural machine translation
(Roy et al., 2018) and paraphrase generation
(Roy and Grangier, 2019). Our work is closest to
Roy and Grangier (2019) in its use of a quan-
tized Transformer, however we adopt a different
training algorithm (Soft EM; Roy et al., 2018),
orders of magnitude fewer discrete latent codes,
a different method for obtaining head sentence
vectors, and apply the QT in a novel way for
extractive opinion summarization.

Besides modeling, our work contributes to the
growing body of resources for opinion summa-
rization. We release SPACE, the first corpus to
contain both general and aspect-specific opinion
summaries, while increasing the number of input
reviews tenfold compared to popular benchmarks
(Braˇzinskas et al., 2020; Chu and Liu, 2019;
Angelidis and Lapata, 2018).

3 Problem Formulation

Let C be a corpus of
reviews on entities
{e1, e2, . . . } from a single domain d, Zum Beispiel,
hotels. Reviews may discuss any number of rel-
evant aspects Ad = {a1, a2, . . . }, like the hotel’s
rooms or location. For every entity e, we define its
review set Re = {r1, r2, . . . }. Every review is a
sequence of sentences (x1, x2, . . . ) and a sentence
x is, im Gegenzug, a sequence of words (w1, w2, . . . ).
For brevity, we use Xe to denote all review sen-
tences about entity e. We formalize two sub-tasks:
(A) general opinion summarization, where a sum-
mary should cover popular opinions in Re across
all discussed aspects; Und (B) aspect opinion sum-
marization, where a summary must focus on a
single specified aspect a ∈ Ad. In our extractive
setting, these translate to creating a general or
aspect summary by selecting a small subset of
sentences Se ⊂ Xe.

We train the QT through sentence reconstruc-
tion to learn a rich representation space and its
quantization into latent codes (Abschnitt 3.1). Wir
enable opinion summarization, by mapping in-
put sentences onto their nearest latent codes and
extract those sentences that are representative of
the most popular codes (Abschnitt 3.2). We also
illustrate how to produce aspect-specific summa-

279

Figur 1: A sentence is encoded into a 3-head
representation and head vectors are quantized
using a weighted average of their neighboring
code embeddings. The QT model is trained by
reconstructing the original sentence.

ries using a trained QT model and a few aspect-
denoting query terms (Abschnitt 3.2.2).

3.1 The Quantized Transformer

Our model is a variant of VQ-VAEs (van den
Oord et al., 2017; Roy and Grangier, 2019) Und
consists of: (A) a Transformer sentence encoder
that encodes an input sentence x into a multi-head
representation {x1, . . . , xH}, where xh ∈ RD and
H is the number of heads; (B) a vector quantizer
that maps each head vector to a mixture of dis-
crete latent codes, and uses the codes’ embeddings
to produce quantized vectors {q1, . . . , qH },
qh ∈ RD; Und (C) a Transformer sentence decoder,
which attends over the quantized vectors to
generate sentence reconstruction ˆx. The decoder
is not used during summarization; we only use the
learned quantized space to extract sentences, als
described in Section 3.2.

H }, X(cid:5)

Sentence Encoding Our
encoder prepends
sentence x with the special token [SNT] and uses
the vanilla Transformer encoder (Vaswani et al.,
2017) to produce token-level vectors. We ignore
individual word vectors and only keep the special
token’s vector xsnt ∈ RD. We obtain a multi-head
representation of x, by splitting xsnt into H sub-
h ∈ RD/H, followed by a
vectors {X(cid:5)
, . . . , X(cid:5)
1
layer-normalized transformation:
xh = LayerNorm(Wx(cid:5)

(1)
where xh is the h-th head and W ∈ RD×D/ H,
b ∈ RD are shared across heads. Hyperparame-
ter H, das ist, the number of sentence heads of
our encoder, is different from Transformer’s in-
ternal attention heads. The encoder’s operation
is illustrated in Figure 1, where the sentence
‘‘The staff was great!’’ is encoded into a 3-head
representation.

H + B) ,

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Vector Quantization Let z1, . . . , zH be discrete
latent variables corresponding to H encoder heads.
Every variable can take one of K possible latent
codes, zh ∈ [K]. The quantizer’s codebook,
e ∈ RK×D, is shared across latent variables and
maps each code (or cluster) to its embedding (oder
centroid) ek ∈ RD. Given sentence x and its multi-
head encoding {x1, . . . , xH }, we independently
quantize every head using a mixture of its nearest
codes from [K]. Speziell, we follow the Soft
EM training of Roy et al. (2018) and sample, mit
replacement, m latent codes for the h-th head:

H, . . . , zm
z1

h ∼ Multinomial(l1, . . . , lK ) ,

with lk = −(cid:7)xh − ek(cid:7)2
2

(2)

where Multinomial(l1, . . . , lK ) is a K-way mul-
tinomial distribution with logits l1, . . . , lK. Der
h-th quantized head vector is obtained as the
average of the sampled codes’ embeddings:

qh =

1
M

M(cid:2)

j=1

ezj

(3)

This soft quantization process is shown in Figure 1,
where head vectors x1, x2, and x3 are quantized
using a weighted average of their neighboring
code embeddings, to produce q1, q2, and q3.

Sentence Reconstruction and Training
In-
stead of attending over individual token vectors,
as in the vanilla architecture, the Transformer
sentence decoder attends over {q1, . . . , qH }, Die
quantized head vectors of the sentence, to gene-
rate reconstruction ˆx. The model is trained to
minimize:

L = Lr +

(cid:2)

(cid:7)xh − sg(qh)(cid:7)2 .

(4)

Lr is the reconstruction cross entropy, and stop-
gradient operator sg(·) is defined as identity during
forward computation and zero on backpropagation.
The sampling of Equation (2) is bypassed using
the straight-through estimator (Bengio et al., 2013)
and the latent codebook is trained via exponen-
tially moving averages, as detailed in Roy et al.
(2018).

3.2 Summarization in Quantized Spaces

Existing neural methods for opinion summariza-
tion have modeled opinion popularity within a set
of reviews by encoding each review into a vector,

280

averaging all vectors to obtain an aggregate rep-
resentation of the input, and feeding it to a review
decoder to produce a summary (Chu and Liu, 2019;
Coavoux et al., 2019; Braˇzinskas et al., 2020). Das
approach is problematic for two reasons. Erste, Es
assumes that complex semantics of whole reviews
can be encoded in a single vector. Zweite, it also
assumes that features of commonly occurring
opinions are preserved after averaging and, Dort-
Vordergrund, those opinions will appear in the generated
summary. The latter assumption becomes particu-
larly uncertain for larger numbers of input reviews.
We take a different approach, using sentences
as the unit of representation, and propose a gene-
ral extraction algorithm based on the QT, welche
explicitly models popularity without vector aggre-
gation. Using the same algorithmic framework we
are also able to extract aspect-specific summaries.

3.2.1 General Opinion Summarization
We exploit QT’s quantization of the encoding
space to cluster similar sentences together, quan-
tify the popularity of the resulting clusters, Und
extract representative sentences from the most
popular ones.

Speziell, given Xe = {x1, . . . , xi, . . . , xN },
the N review sentences about entity e, the trained
encoder produces N × H unquantized head vec-
tors {x11, . . . , xih, . . . , xN H }, where xih is the
h-th head of the i-th sentence. We perform hard
quantization, assigning every vector to its nearest
latent code, and counting the number of assign-
ments per code, nämlich, the popularity of each
cluster:

zih = arg min

(cid:7)xih − ek(cid:7)2

k∈[K]
(cid:2)

nk =

(cid:0)[zih = k] .

(5)

(6)

ich,H

Figur 2 shows how sentences Xe are encoded,
and their different heads are assigned to codes.
Similar sentences cluster under the same codes
Und, consequently, clusters receiving numerous
assignments are characteristic of popular opinions
in Xe. A general summary should consist of the
sentences that are most representative of these
popular clusters.

In the simplest case, we could couple every

code k with its nearest sentence x(k):

X(k) = arg min

ich

(min
H

(cid:7)xih − ek(cid:7)2) ,

(7)

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: General opinion summarization with QT.
All input sentences for an entity are encoded using
three heads (shown in orange, Blau, and green
crosses). Sentence vectors are clustered under their
nearest latent code (gray circles). Popular clusters
(histogram) correspond to commonly occurring
opinions, and are used to sample and extract the
most representative sentences.

and rank sentences x(k) according to the size nk of
their respective clusters; the top sentences, up to a
predefined budget, are extracted into a summary.
This ranking method entails that only those
sentences which are the nearest neighbor of a
popular code are likely to be extracted. Jedoch,
a salient sentence may be in the neighborhood
of multiple codes per head, despite never being
the nearest sentence of a code vector. For exam-
Bitte, the sentence ‘‘Great location and beautiful
rooms’’ is representative of clusters encoding pos-
itive attitudes for both the location and the rooms
of a hotel. To capture this, we relax the require-
ment of coupling every cluster with exactly one
sentence and propose two-step sampling (Figur 3),
a novel sampling process that simultaneously es-
timates cluster popularity and promotes sentences
commonly found in the proximity of popular
clusters. We repeatedly perform the following
Operationen:

Cluster Sampling We first sample a latent code z
with probability proportional to the clusters’ size:

z ∼ P (z = k) =

nk
N × H

(8)

where nk is the number of assignments for code
k, computed in Equation (6). Zum Beispiel, wenn die
input contains many paraphrases of sentence ‘‘Ex-
cellent location’’, these are likely to be clustered
under the same code, which in turn increases the
probability of sampling that code. Cluster sam-
pling is illustrated on the top of Figure 3, showing

281

Figur 3: Sentence ranking via two-step sampling.
In this toy example, each sentence (s1 to s5) Ist
assigned to its nearest code (k = 1, 2, 3), als
shown by thick purple arrows. During cluster
sampling, the probability of sampling a code (top
Rechts; shown as blue bars) is proportional to the
number of assignments it receives. For every
sampled code, we perform sentence sampling;
sentences are sampled, with replacement, accord-
ing to their proximity to the code’s encoding.
Samples from codes 1 Und 3 are shown in black
and red, jeweils.

assignments (links) and resulting code probabilities
(Rechts).

Sentence Sampling The sampled code z exists
in the neighborhood of many input sentences.
Picking a single sentence as the most characteristic
of that cluster is too restrictive. Stattdessen, Wir
sample (with replacement) sentences from the
code’s neighborhood n times, thus generalizing
Gleichung (7):

x1, . . . , xn ∼ Multinomial(l(cid:5)
, . . . , l(cid:5)
N ) ,
1
((cid:7)xih − ez(cid:7)2
2) ,

with l(cid:5)

i = − min
H

(9)

where the Multinomial’s logits l(cid:5)
i mark the (nega-
tiv) distance of the i-th sentence’s head which is
closest to code z. Sentence sampling is depicted
in the toy example of Figure 3 (bottom). Nach
selecting code k = 1 during cluster sampling,
four sentence samples are drawn (shown in black
Pfeile). The next cluster sample (k = 3) results
in four more sentence samples (shown in red).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Sentence s4 (‘‘Excellent room and location’’)
receives the most votes in total, after being
sampled as a neighbor of both codes.

Two-step sampling is repeated multiple times
and all sentences in Xe are ranked according to
the total number of times they have been sampled.
The final summary is constructed by concatenating
the top ones (see right part of Figure 2). Impor-
tantly, our extraction algorithm is not sensitive to
the size of the input. More sentences increase the
absolute number of assignments per code, but do
not hinder two-step sampling or cause information
loss; andererseits, a larger pool of sentences
may result in a more densely populated quantized
encoding space and, im Gegenzug, a better estimation of
cluster popularity and sentence ranking.

3.2.2 Aspect Opinion Summarization

So far, we have focused on selecting sentences
solely based on the popularity of the opinions
they express. We now turn our attention to aspect
summaries, which discuss a particular aspect of
an entity (z.B., the location or service of a hotel)
while still presenting popular opinions. We create
such summaries with a trained QT model, without
additional fine-tuning. Stattdessen, we exploit QT’s
multi-head representations and only require a
small number of aspect-denoting query terms.2

We hypothesize that different sentence heads in
QT encode the approximately orthogonal seman-
tic or structural attributes which are necessary
for sentence reconstruction. In the simplified
example in Figure 2, the encoder’s first head
(orange) might capture information about
Die
aspects of the sentence, the second head (Blau)
encodes sentiment, while head three (Grün) may
encode structural information (z.B., the length of
the sentence or its punctuation). Our hypothesis is
reinforced by the empirical observation that sen-
tence vectors originating from the same head will
occupy their own sub-space, and do not show any
similarity to vectors from other heads. Infolge,
each latent code k receives assignments from
exactly one head of the sentence encoder. More
formally, head h yields a set of latent codes such
that Kh ⊂ [K]. Figur 2 demonstrates this, as the
encoding space consists of three sub-spaces, eins
for each head. Sentence and latent code vectors are

2Contrary to Angelidis and Lapata (2018), who used 30
seed-words per aspect, we only assume five query terms per
aspect to replicate a realistic setting.

282

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 4: Aspect opinion summarization with QT.
The aspect-encoding sub-space is identified using
mean aspect entropy and all other sub-spaces are
ignored (shown in gray). Two-step sampling is
restricted only to the codes associated with the
desired aspect (shown in red).

further organized within that sub-space according
to the attribute captured by the respective head.

To enable aspect summarization, we identify
the sub-space capturing aspect-relevant informa-
tion and label its aspect-specific codes, as seen in
Figur 4. Speziell, we first quantify the proba-
bility of finding an aspect in the sentences assigned
to a latent code and identify the head sub-space that
best separates sentences according to their aspect.
Dann, we map every cluster within that sub-space
to an aspect and extract aspect summaries only
from those aspect-specific clusters.

We utilize a held-out set of review sentences
and keywords Qa = {s1, . . . , s5} für
Xdev,
aspect a. We encode and quantize sentences in
Xdev and compute the probability that latent code
k contains tokens typical of aspect a as:

Pk(A) =

(cid:3)

tf (Qa, k)
tf (Qa(cid:5), k)

(10)

A(cid:5)
where tf (Qa, k) is the number of times query
terms in Qa where found in sentences assigned
to k. We use information theory’s entropy to mea-
sure how aspect-certain code k is:

Hk = −

Pk(A) logPk(A) .

(11)

(cid:2)

Low aspect entropy values indicate that most
sentences assigned to k belong to a single aspect. Es
thus follows that hasp (d.h., the head sub-space that
best separates sentences according to their aspect)
will exhibit the lowest mean aspect entropy:
⎛
⎝ 1

(cid:2)

⎞

hasp = arg min

⎠ .

(12)

|Kh|

k∈Kh

SPACE (This work)
AMAZON (Braˇzinskas et al., 2020)
YELP (Chu and Liu, 2019)
OPOSUM (Angelidis and Lapata, 2018)

Rezensionen
1.14M
4.75M
1.29M
359K

Entities Rev/Ent

50
60
200
60

100
8
8
10

Summaries (R)
1,050 (3)
180 (3)
200 (1)
180 (3)

Typ

Scope

Abstractive General+Aspect
Abstractive
Abstractive
Extractive

General only
General only
General only

Tisch 1: Statistics for SPACE and three recently introduced evaluation corpora for opinion summarization.
SPACE includes aspect summaries for six aspects. (Rezensionen: number of reviews in training set, no gold-
standard summaries are available; Rev/Ent: Input reviews per entity in test set; R: Reference summaries
per example).

We map every code produced by hasp to its aspect
A(k) via Equation (10), and obtain aspect codes:

Ka = {k | k ∈ Khasp and a = a(k)} .

(13)

To extract a summary for aspect a, we follow
the ranking or sampling methods described in
Equations (5)–(9),
restricting the process to
codes Ka. Sub-space selection and aspect-specific
sentence sampling are illustrated in Figure 4.

4 The SPACE Corpus

We introduce SPACE (Summaries of Popular and
Aspect-specific Customer Experiences), a large-
scale opinion summarization benchmark for the
evaluation of unsupervised summarizers. SPACE is
built on TripAdvisor hotel reviews and aims to
facilitate future research by improving upon the
shortcomings of existing datasets. It comes with a
training set of approximately 1.1 million reviews
for over 11,000 hotels, obtained by cleaning and
downsampling an existing collection (Wang et al.,
2010). The training set contains no reference
summaries, and is useful for unsupervised training.
For evaluation, we created a large collection
of human-written, abstractive opinion summaries.
Speziell, for a held-out set of 50 hotels (25
hotels for development and 25 for testing), Wir
asked human annotators to write high-level gen-
eral summaries and aspect summaries for six pop-
ular aspects: building, cleanliness, food, location,
Räume, and service. For every hotel and summary
type, we collected three reference summaries from
different annotators. Wichtig, for every hotel,
summaries were based on 100 input reviews.
To the best of our knowledge, this is the largest
crowdsourcing effort
towards obtaining high-
quality abstractive summaries of reviews, und das
first to use a pool of input reviews of this scale (sehen
Tisch 1 for a comparison with existing datasets).
Darüber hinaus, SPACE is the first benchmark to also
contain aspect-specific opinion summaries.

The large number of input reviews per entity
poses certain challenges with regard to the col-
lection of human summaries. A direct approach is
prohibitive, as it would require annotators to
read all 100 reviews and write a summary in a
single step. A more reasonable method is to first
identify a subset of input sentences that most
people consider salient, and then ask annotators
to summarize them. Summaries were thus created
in multiple stages using the Appen3 platform and
expert annotator channels of native English speak-
ers. Although we propose an extractive model,
annotators were asked to produce abstractive sum-
maries, as we hope SPACE will be broadly useful to
the summarization community. We did not allow
the use of first-person narrative to collect more
summary-like texts. We present our annotation
procedure below.4

4.1 Sentence Selection via Voting

The sentence selection stage identifies a subset of
review sentences that contain the most salient and
useful opinions expressed by the reviewers. Das
is a crucial but subjective task and, daher, Wir
devised a voting scheme that allowed us to select
sentences that received votes by many annotators.
Speziell, each review was shown to five
judges who were asked to select informative sen-
tences. Annotators were encouraged to exercise
their own judgement in selecting summary-worthy
Sätze, but were advised to focus on sentences
which explicitly expressed or supported reviewer
opinions, avoiding overly general or personal
Kommentare (z.B., ‘‘Loved the hotel’’, ‘‘I like a
shower with good pressure’’), and making sure
that important aspects were included. We set no
threshold on the number of sentences they could
select (we allowed selecting all or no sentences

3https://appen.com/.
4Full annotation instructions: https://github.com

/stangelid/qt/blob/main/annotation.md.

283

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

for a given review). Jedoch, the annotation in-
terface kept track of their total votes and guided
them to select between 20% Und 40% of sentences,
on average.

Sentences with 4 or more votes were automat-
ically promoted to the next stage. Inter-annotator
agreement according to Cohen’s kappa was
k = 0.36, indicating ‘‘fair agreement’’. Previ-
ous studies have shown that human agreement for
sentence selection tasks in summarization of news
articles is usually lower than 0.3 (Radev et al.,
2003). The median number of sentences promoted
for summarization for each hotel was 83, while
the minimum was 46. This ensured that enough
sentences were always available for summariza-
tion, while simplifying the task; annotators were
now required to read and summarize considerably
smaller amounts of review text than the original
100 Bewertungen.

4.2 Summary Collection

General Summaries The top-voted sentences
for each hotel were presented to three annotators,
who were asked to read them and produce a high-
level overview summary up to a budget of 100
Wörter. To simplify the task and help annotators
write coherent summaries, sentences with high
lexical overlap were grouped together and the
interface allowed the annotators to quickly sort
sentences according to words they contained. Der
process resulted in an inter-annotator ROUGE-L
score of 29.19 and provides ample room for future
Forschung, as detailed in our experiments (Tisch 2).

Aspect Summaries Top-voted sentences were
further labeled by an off-the-shelf aspect classifier
(Angelidis and Lapata, 2018) trained on an public
aspect-labeled corpus of hotel review sentences
(Marcheggiani et al., 2014).5 Sentences outside of
the six most popular aspects (building, cleanliness,
food, location, Räume, and service) were ignored,
and sentences with 3 votes were promoted, nur
if an aspect had no sentences with 4 votes. Der
promoted sentences were grouped according to
their aspect and presented to annotators, Wer waren
asked to create a more detailed, aspect-specific
summary, up to a budget of 75 Wörter. The aspect
summaries have an inter-annotator ROUGE-L
score of 34.58.

5 Evaluation

In diesem Abschnitt, we discuss our experimental setup,
including datasets and comparison models, Vor
presenting our automatic evaluation results, hu-
man studies, and further analyses.

5.1 Experimental Setup

Datasets We used SPACE as the main testbed for
our experimental evaluation, covering both gen-
eral and aspect-specific summarization tasks. Für
general summarization, we used two additional
opinion summarization benchmarks, nämlich, YELP
(Chu and Liu, 2019) and AMAZON (Braˇzinskas
et al., 2020) (siehe Tabelle 1). For all datasets, wir gebrauchen
pre-defined development and test set splits, Und
only report results on the test set.

Implementation Details We used unigram LM
SentencePiece vocabularies of 32K.6 All system
hyperparameters were selected on the develop-
ment set. The Transformer’s dimensionality was
set to 320 and its feed-forward layer to 512. Wir
gebraucht 3 layers and 4 internal attention heads for
its encoder and decoder, whose input embedding
layer was shared, but no positional encodings as
we observed no summarization improvements.
We used H = 8 sentence heads for representing
every sentence. For the quantizer, we set the num-
ber of latent codes to K = 1, 024 and sampled
m = 30 codes for every input sentence, während
Ausbildung. We used the Adam optimizer, with initial
learning rate of 10−3 and a learning rate decay
von 0.9. We warmed up the Transformer by dis-
abling quantization for the first 4 Epochen. In Summe,
wir rannten 20 training epochs. On the full SPACE cor-
pus, QT was trained in 4 days on a single GeForce
GTX 1080 Ti GPU, using our available PyTorch
implementation. All general and aspect summaries
were extracted with the two-step sampling proce-
dure described in Section 3.2.1, unless otherwise
stated. When two-step sampling was enabled, Wir
ranked sentences by sampling 300 latent codes
Und, for every code, sampled n = 30 neighboring
Sätze. QT and all extractive baselines use a
greedy algorithm to eliminate redundancy, ähnlich
to previous research on multi-document summa-
rization (Cao et al., 2015; Yasunaga et al., 2017;
Angelidis and Lapata, 2018).

5The classifier’s precision on the aspect-labeled corpus’

development set is 85.4%.

6https://github.com/google/sentencepiece.

284

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

5.2 Metrics

We evaluate the lexical overlap between system
and human summaries using ROUGE F-scores.7
We report uni- and bi-gram variants (R1/R2), als
well as longest common subsequence (RL).

A successful opinion summarizer must also pro-
duce summaries which match human-written ones
in terms of aspects mentioned and sentiment con-
veyed. Aus diesem Grund, we also evaluate our sys-
tems on two metrics that utilize an off-the-shelf
aspect-based sentiment analysis (ABSA) System
(Miao et al., 2020), pre-trained in-domain. Der
ABSA system extracts opinion phrases from sum-
maries, and predicts their aspect category and sen-
timent. The metrics use these predictions as follows.

Aspect Coverage We use the phrase-level aspect
predictions to mark the presence or absence
of an aspect in a summary. We discard very
infrequent aspect categories. Similar to Pan
et al. (2020), we measure precision, recall,
and F1 of system against human summaries.

Aspect-level Sentiment We propose a new met-
ric to evaluate the sentiment consistency
between system and human summaries.
Speziell, we compute the sentiment
polarity score towards an individual aspect
a as the mean polarity of
the opinion
phrases that discuss this aspect in a sum-
mary (pol a ∈ [−1, 1]). We repeat the process
for every aspect, thus obtaining a vector of
aspect polarities for the summary (we set
the polarity of absent aspects to zero). Der
aspect-level sentiment consistency is com-
puted as the mean squared error between
system and human polarity vectors.

5.3 Ergebnisse: General Summarization

We first discuss our results on general summa-
rization and then move on to present experiments
on aspect-specific summarization. We compared
our model against the following baselines:

Best Review systems select the single review that
best approximates the consensus opinions
in the input. We use a Centroid method
that encodes the entity’s reviews with BERT
(average token vector; Devlin et al., 2019) oder
SentiNeuron (Radford et al., 2017), and picks
the one closest to the mean review vector.
We also tested an Oracle method, welche

7https://github.com/bheinzerling/pyrouge.

selects the review closest to the reference
summaries.

Extractive systems, where we tested LexRank
(Erkan and Radev, 2004), an unsupervised
graph-based summarizer. To compute its
adjacency matrices, we used BERT and Sen-
tiNeuron vectors, in addition to the sparse
tf-idf features of the original. We also present
a random extractive baseline.

Abstractive systems include Opinosis (Ganesan
et al., 2010), a graph-based method; Und
MeanSum (Chu and Liu, 2019) and Copy-
cat (Braˇzinskas et al., 2020), two neural
abstractive methods that generate review-
like summaries from aggregate review rep-
resentations learned using autoencoders.

Tisch 2 reports ROUGE scores on SPACE
(test set) for the general summarization task.
QT’s popularity-based extraction algorithm shows
strong summarization capabilities outperforming
all comparison systems (differences in ROUGE
are statistically significant against all models but
Copycat). This is a welcome result, consider-
ing that QT is an extractive method and does
not benefit from the compression and rewording
capabilities of abstractive summarizers. More-
über, as we discuss in Section 5.5, QT is less
data-hungry than other neural models: It achieves
the same level of performance even when trained
An 5% of the dataset. We also show in Table 2
(fourth block) that the proposed two-step sampling
method yields better extractive summaries com-
pared to simply selecting the sentences nearest to
the most popular clusters.

Aspect coverage and sentiment consistency
results are also encouraging for QT which con-
sistently scores highly on both metrics, while
baselines show mixed results. We also compared
(using ROUGE-L) general system summaries
against reference aspect summaries. The results
in Table 2 (column RLASP) confirm that aspect
tailor-made methods.
summarization requires
Unsurprisingly, all systems are inferior to the
human upper bound (d.h., inter-annotator ROUGE
and aspect-based metrics), suggesting ample room
for improvement.

QT’s ability for general opinion summarization
is further demonstrated in Table 3, which reports
results on the YELP and AMAZON datasets. We pre-
sent the strongest baselines, das ist, CentroidBERT,
LexRankBERT, OracleBERT, and the abstractive

285

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
6
6
1
9
2
4
1
8
1

/
T

A
C
_
A
_
0
0
3
6
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

T
S
e
B

T
C
A
R
T
X
E

e
ich
v
e
R

SPACE [GENERAL]
w CentroidSENTI
CentroidBERT
OracleSENTI
OracleBERT
Random
LexRank
LexRankSENTI
LexRankBERT

R2 RL
R1
27.36 5.81 15.15
31.33 5.78 16.54
32.14 7.52 17.43
33.21 8.33 18.02
26.24 3.58 14.72
29.85 5.87 17.56
30.56 4.75 17.19
31.41 5.05 18.12
28.76 4.57 15.96
34.95 7.49 19.92
36.66 8.87 20.90
38.66 10.22 21.90
w/o 2-step samp. 37.82 9.13 20.10
Human Up. Bound 49.80 18.80 29.19

MeanSum
Copycat

t Opinosis

C
A
R
T
S
B
A

RLASP ACP ACR ACF1 SCMSE
.788 .705 .744
8.77
.805 .701 .749
9.35
.817 .699 .753
9.29
9.67
.823 .777 .799
11.53 .799 .374 .509
11.84 .840 .382 .525
12.11 .820 .441 .574
13.29 .823 .380 .520
11.68 .791 .446 .570
14.52 .845 .477 .610
14.15 .840 .566 .676
14.26 .843 .689 .758
13.88 .833 .680 .748
34.58 .829 .862 .845

.580
.524
.455
.401
.592
.518
.572
.500
.561
.479
.446
.430
.439
.264

Tisch 2: Summarization results on SPACE. Best system
(shown in boldface) significantly outperforms all compar-
ison systems, except where underlined (P < 0.05; paired bootstrap resampling; Koehn, 2004). We exclude Oracle systems from comparisons as they access gold summaries at test time. RLASP is the Rouge-L of general summarizers against gold aspect summaries. AC and SC are shorthands for Aspect Coverage and Sentiment Consistency. Sub- scripts P and R refer to precision and recall, and F1 is their harmonic mean. MSE is mean squared error (lower is better). R2 RL ACF1 SCMSE R1 YELP 23.04 2.44 13.44 .551 Random 24.78 2.64 14.67 .691 CentroidBERT OracleBERT 27.38 3.75 15.92 .703 LexRankBERT 26.46 3.00 14.36 .601 24.88 2.78 14.09 .672 Opinosis MeanSum 28.46 3.66 15.57 .713 29.47 5.26 18.09 .728 Copycat 28.40 3.97 15.27 .722 QT R1 AMAZON 27.66 4.72 16.95 .580 Random CentroidBERT 29.94 5.19 17.70 .702 OracleBERT 31.69 6.47 19.25 .725 LexRankBERT 31.47 5.07 16.81 .663 28.42 4.57 15.50 .614 Opinosis MeanSum 29.20 4.70 18.15 .710 31.97 5.81 20.16 .731 CopyCat 34.04 7.03 18.08 .739 QT .612 .523 .507 .541 .552 .510 .495 .490 .602 .599 .512 .541 .580 .525 .510 .508 R2 RL ACF1 SCMSE Table 3: Summarization results on YELP and AMAZON. Best system, shown in boldface, is significantly better than all comparison systems, except where underlined (p < 0.05; paired bootstrap resampling; Koehn, 2004). Opinosis, MeanSum, and Copycat. On YELP, QT performs on par with MeanSum, but worse than Copycat. However, it is important to note that, in contrast to SPACE, YELP’s reference summaries were purposely written using first-person narra- tive giving an advantage to review-like summaries of abstractive methods. On AMAZON, QT outper- forms all methods on ROUGE-1/2, but comes second to Copycat on ROUGE-L. This follows a trend seen across all datasets, where abstractive systems appear relatively stronger in terms of ROUGE-L compared to ROUGE-1/2. We partly attribute this to their ability to fuse opinions into fluent sentences, thus matching longer reference sequences. Besides automatic evaluation, we conducted a user study to verify the utility of the gener- ated summaries. We produced general summaries from five systems (QT, Copycat, MeanSum, LexRankBERT, and CentroidBERT) for all entities in SPACE’s test set. For every entity and pair of systems, we showed to three human judges a gold-standard summary for reference, and the two system summaries. We asked them to select the best summary according to four criteria: Inform. +36.0 Centroid −52.7 LexRank MeanSum −23.3 −10.7 Copycat +50.7∗ QT Coherent −57.3 −38.0 +26.7 +34.7 +34.0† Concise −60.7 −44.7 +28.7 +38.0 +38.7† Redund. −12.7 −1.3 +3.3 −3.3 +18.0∗ Table 4: Best-Worst Scaling human study on SPACE. (*): significant difference to all models; (†): significant difference to all models, except Copycat (one-way ANOVA with posthoc Tukey HSD test p < 0.05). informativeness (useful opinions, consistent with reference), coherence (easy to read, avoids contra- dictions), conciseness (useful in a few words), and non-redundancy (no repetitions). The systems’ scores were computed using Best-Worst Scaling (Louviere et al., 2015), with values ranging from −100 (unanimously worst) to +100 (unanimously best). As shown in Table 4, participants rate QT favorably over all baselines in terms of infor- mativeness, conciseness and lack of redundancy, with slight preference for Copycat summaries with respect to coherence (statistical significance information in caption). QT captures essential 286 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 SPACE [ASPECT] T MeanSumASP R E CopycatASP B LexRankASP a i v QTASP Human Building 13.25 17.10 14.73 16.45 40.33 Cleanliness 19.24 15.90 25.10 25.12 38.76 ROUGE-L Food 13.01 14.53 17.56 17.79 33.63 Location 18.41 20.31 23.28 23.63 35.23 Rooms 17.81 17.30 18.24 21.61 29.25 Service 20.40 20.05 26.01 26.07 30.31 R1 23.24 24.95 27.72 28.95 44.86 R2 Average 3.72 4.82 7.54 8.34 18.45 RL SCMSE 17.02 17.53 20.82 21.77 34.58 .235 .274 .206 .204 .153 Table 5: Aspect summarization results on SPACE. Best model shown in boldface. All differences to best model are statistically significant, except where underlined (p < 0.05; paired bootstrap resampling; Koehn, 2004). Does the summary discuss the specified aspect? No Exclusively Partially QTGEN CopycatASP MeanSumASP LexRankASP QTASP 1.1 6.7 21.8 48.2 58.7 72.0 45.3 37.3 28.0 32.7 26.9 48.0 40.9 23.8 8.7 Table 6: User study on aspect-specific sum- maries. In the ‘‘Exclusively’’ column, QT’s difference over all models is statistically sig- nificant (p < 0.05; χ2 test). Figure 6: Mean aspect entropies (bars) for each of QT’s head sub-spaces and corresponding aspect ROUGE-1 scores for the summaries produced by each head (line). SPACE [GENERAL] Copycat QT Proportion of SPACE’s train data used 5% 26.1 36.9 10% 26.2 37.1 50% 31.8 37.7 100% 36.7 38.7 Table 7: ROUGE-1 on SPACE for varying train set sizes. Figure 5: t-SNE projection of the quantized space of an eight-head QT trained on SPACE, showing all 1024 learned latent codes (best viewed in color). Darker codes correspond to lower aspect entropy, our proposed measure of high aspect-specificity. Zooming in the aspect sub-space uncovers good aspect separation. opinions effectively, whereas there is room for improvement in terms of summary cohesion. 5.4 Results: Aspect-specific Summarization There is no existing unsupervised system for aspect-specific opinion summarization. Instead, we use the power of BERT (Devlin et al., 2019) to enable aspect summarization for our baselines. Specifically, we obtain BERT sentence vectors (average of token vectors) for input sentences Xe, which we cluster via k-means. We then replicate the cluster-to-aspects mapping used by QT, as described in Equations (10)–(13): Each cluster is mapped to exactly one aspect, according to the probability of finding the pre-defined aspect-denoting keywords in the sentences assigned to it. As a result, we obtain non-overlapping and aspect-specific sets of input sentences {X (a1) , . . . }. For aspect ai, we create aspect-filtered input reviews, by concate- nating sentences in X (ai) based on the reviews they originated from. The filtered reviews of each aspect are given as input to general summarizers (LexRank, MeanSum, and Copycat), thus produc- ing aspect summaries. QT and all baselines use the same aspect keywords, which we sourced from a held-out set of reviews, not included in SPACE. , X (a2) e e e Table 5 shows results on SPACE, for individual aspects, and on average. QT outperforms baselines 287 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Human QT MeanSum Copycat All staff members were friendly, accommodating, and helpful. The hotel and room were very clean. The room had modern charm and was nicely remodeled. The beds are extremely comfortable. The rooms are quite with wonderful beach views. The food at Hash, the restaurant in lobby, was fabulous. The location is great, very close to the beach. It’s a longish walk to Santa Monica. The price is very affordable. Great hotel. We liked our room with an ocean view. The staff were friendly and helpful. There was no bal- cony. The location is per- fect. Our room was very quiet. I would definitely stay here again. You’re one block from the beach. So it must be good! Filthy hall- ways. Unvacuumed room. Pricy, but well worth it. It was a great stay! The food at the hotel is great for the price. I can’t believe the noise from the street is very loud and is not the traffic so great, but that is not a problem. The restaurant was great and the food is excellent. in a great is This hotel location, just off the beach. The staff was very friendly and helpful. We had a room with a view of the beach and ocean. The only problem was that our room was on the 4th floor with a view of the ocean. If you are looking for a nice place to sleep then this is the place for you. Table 8: Four general opinion summaries for the same hotel: One human-written and three from competing models. Building: Bright colors, skateboards, butterfly chairs and a grand ocean/boardwalk view (always entertaining). There is a small balcony, but there’s only a small glass divider between your neighbor’s balcony. Food: We had a great breakfast at Hash too! The restaurant was amazing. Lots of good restaurants within walking distance and some even deliver. The roof bar was the icing on the cake. Location: The location is perfect. The hotel is very central. The hotel itself is in a great location. We hardly venture far as everything we need is within walking distance, but for the sightseers the buses are on the doorstep. Cleanliness: Our room was very clean and comfortable. The room was clean and retrofitted with all the right amenities. Our room was very large, clean, and artfully decorated. Rooms: The room was spacious and had really cool furnishings, and the beds were comfortable. The room’s were good, and we had a free upgrade for one of them (for a Facebook ’like!) A+ for the bed and pillows. Service: The staff is great. The staff were friendly and helpful. The hotel staff were friendly and provided us with great service. Each member of the staff was friendly and attentive. The staff excel and nothing is ever too much trouble. Table 9: Aspect summaries extracted by QT. in all aspects, except building, with significant improvements against Copycat and Meansum in terms of ROUGE and sentiment consistency. The abstractive methods struggle to generate summaries restricted to the aspect in question. To verify this, we ran a second judgment elicitation study. We used summaries from com- peting aspect summarizers (QTASP, CopycatASP, MeanSumASP, and LexRankASP) for all six aspects, as well as QT’s general summaries. A summary was shown to three participants, who were asked whether it discussed the specified aspect exclu- sively, partially, or not at all. Table 6 shows that 58.7% of QT aspect-specific summaries discuss the specified aspect exclusively, while only 8.7% of the summaries fail to mention the aspect. LexRankASP follows with 23.8% of its summaries failing to mention the aspect, while the abstractive models performed significantly worse. 5.5 Further Analysis Training Efficiency Table 7 shows ROUGE-1 scores for QT and Copycat on SPACE, when trained on different portions of the training set (randomly downsampled and averaged over 5 runs). QT ex- hibits impressive data efficiency; when trained on 5% of data, it performs comparably to a Copycat summarizer that has been trained on the full corpus. Visualizing Sub-spaces We present a visual demonstration of QT’s quantized sub-spaces in Figure 5. We used t-SNE (van der Maaten and Hinton, 2008) to project the latent code vectors onto two dimensions. The latent codes produced by QT’s eight heads are clearly grouped in eight separate sub-spaces. The aspect sub-space (shown in square) was detected automatically, as it dis- played the lowest mean aspect entropy (darker color). Zooming into its latent codes uncov- ers reasonable aspect separation, an impressive the model received no result considering that aspect-specific supervision. Mean Aspect Entropy Figure 6 further illus- trates the effectiveness of aspect entropy for detecting the head sub-space that best separates aspect-specific sentences. Each gray bar shows 288 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the mean aspect entropy for the codes produced by one of QT’s eight heads. One of the heads (leftmost) exhibits much lower entropy, indicating a strong confidence for aspect membership within its latent codes. We confirm this enables better aspect summarization by generating aspect sum- maries using each head, and plotting the obtained ROUGE-1 scores. System Output Finally, we show gold-standard and system-generated general in Table 8, as well as QT aspect summaries in Table 9. summaries 6 Conclusions We presented a novel opinion summarization system based on the Quantized Transformer that requires no reference summaries for training, and is able to extract general and aspect summaries from large groups of input reviews. QT is trained through sentence reconstruction and learns a rich encoding space, paired with a clustering component based on vector quantized variational autoencoders. At summarization time, we exploit the characteristics of the quantized space, to identify those clusters that correspond to the input’s most popular opinions, and extract the sentences that best represent them. Moreover, we used the multi-head representations of the model, and no further training, to detect the encoding sub-space that best separates aspects, enabling aspect-specific summarization. We also collected SPACE, a new opinion summarization corpus which we hope will inform and inspire further research. Experimental results on SPACE and popular benchmarks reveal that our system is able to produce informative summaries which cover all or individual aspects of an entity. In the future, we would like to utilize the QT framework in order to generate abstractive summaries. We could also exploit QT’s multi-head semantics more directly, and further improve it through weak supervision or multi-task objectives. Finally, although we focused on opinion summarization, it would be interesting to see if the proposed model can be applied to other multi-document summarization tasks. Acknowledgments We thank the anonymous reviewers for their feedback and the action editor, Jing Jiang, for her comments. We gratefully acknowledge the support of the European Research Council (Lapata; award number 681760, ‘‘Translating Multiple Modalities into Text’’). We also thank Wang-Chiew Tan for her valuable input. References Reinald Kim Amplayo and Mirella Lapata. 2019. Informative and controllable opinion summa- rization. arXiv preprint arXiv:1909.02322v1. Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1934–1945. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1 /2020.acl-main.175 the Stefanos Angelidis and Mirella Lapata. 2018. extraction Summarizing opinions: Aspect sentiment prediction and they are meets In Proceedings both weakly supervised. of on Empirical Methods in Natural Language Processing, pages 3675–3686, Brussels, Belgium. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D18 -1403 2018 Conference Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. 2013. Estimating or propagating gradients for conditional computation. arXiv preprint arXiv: 1308.3432v1. through stochastic neurons Arthur Braˇzinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised opinion summarization In Proceed- as copycat-review generation. ings of the Association for Computational Linguistics, pages 5151–5169. Association for Computa- tional Linguistics. DOI: https://doi.org /10.18653/v1/2020.acl-main.461 the 58th Annual Meeting of Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. 2015. Ranking with recursive neural networks and its application to multi- document summarization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2153–2159. AAAI Press. 289 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Giuseppe Carenini, Raymond Ng, and Adam Pauls. 2006. Multi-document summarization In 11th Conference of of evaluative text. the European Chapter of the Association for Computational Linguistics, Trento, Italy. Association for Computational Linguistics. Jianpeng Cheng and Mirella Lapata. 2016. Neu- ral summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 484–494, Berlin, Germany. Associ- ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P16 -1046, PMID: PMC4738087 Eric Chu and Peter Liu. 2019. MeanSum: A neu- ral model for unsupervised multi-document abs- tractive summarization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Ma- chine Learning Research, pages 1223–1232, Long Beach, California, USA. PMLR. Maximin Coavoux, Hady Elsahar, and Matthias Gall´e. 2019. Unsupervised aspect-based multi- document abstractive summarization. In Pro- ceedings of the 2nd Workshop on New Frontiers in Summarization, pages 42–47, Hong Kong, China. Association for Compu- tational Linguistics. DOI: https://doi .org/10.18653/v1/D19-5405 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, for Minneapolis, Minnesota. Association Computational Linguistics. Giuseppe Di Fabbrizio, Amanda Stent, and Robert Gaizauskas. 2014. A hybrid approach to multi-document summarization of opinions in reviews. In Proceedings of the 8th Inter- national Natural Language Generation Con- ference (INLG), pages 54–63, Philadelphia, Pennsylvania. Association for Computational 290 Linguistics. DOI: https://doi.org/10 .3115/v1/W14-4408 G¨unes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as sa- lience in text summarization. Journal Of Ar- tificial Intelligence Research, 22(1):457–479. DOI: https://doi.org/10.1613/jair .1523 Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A graph based approach to abstractive summarization of highly redun- the 23rd dant opinions. In Proceedings of International Conference on Computational Linguistics (COLING 2010), pages 340–348, Beijing, China. COLING 2010 Organizing Committee. Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Virtual Conference, Apr 26th - May 1st. OpenReview.net. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceed- ings of the Tenth ACM SIGKDD Internatio- nal Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177, New York, NY, USA. Association for Com- puting Machinery. DOI: https://doi .org/10.1145/1014052.1014073 similarity recognition relaxes Nguyen Huy Tien, Le Tung Thanh, and Nguyen summarization: Minh Le. 2019. Opinions the Aspect In Pro- constraint of predefined aspects. the International Conference ceedings of on Recent Advances in Natural Language Processing (RANLP 2019), pages 487–496, INCOMA Ltd. DOI: Varna, Bulgaria. https://doi.org/10.26615/978-954 -452-056-4 058, 30811828, PMCID: PMC6706286 PMID: Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. 2019. Unsupervised neural single- document summarization of reviews via learn- ing latent discourse structure and its ranking. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages 2142–2152, Florence, Italy. Associ- ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19 -1206 Lukasz Kaiser and Samy Bengio. 2018. Discrete autoencoders for sequence models. ArXiv, abs/1801.09797v1. Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. CoRR, abs /1312.6114v10. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceed- ings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics. Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by summarizing long sequences. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Jordan J. Louviere, Terry N. Flynn, and Anthony Alfred John Marley. 2015. Best-worst Scaling: Theory, Methods and Applications. Cambridge University Press. DOI: https://doi.org /10.1017/CBO9781107337855 Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A conti- nuous relaxation of discrete random varia- bles. In International Conference on Learning Representations. DOI: https://doi.org /10.1007/978-3-319-06028-6 23 Diego Marcheggiani, Oscar T¨ackstr¨om, Andrea Esuli, and Fabrizio Sebastiani. 2014. Hierar- chical multi-label conditional random fields for aspect-oriented opinion mining. In Advances in Information Retrieval, pages 273–285, Cham. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319 -06028-6 23 Zhengjie Miao, Yuliang Li, Xiaolan Wang, and Wang-Chiew Tan. 2020. Snippext: Semi- supervised opinion mining with augmented 291 In Proceedings of The Web Con- data. ference 2020, WWW ’20, pages 617–628, New York, NY, USA. Association for Com- puting Machinery. DOI: https://doi.org /10.1145/3366423.3380144 Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Ranking sentences for extractive summarization with reinforcement learning. the 2018 Conference of In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 1747–1759, New Orleans, Louisiana. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1 /N18-1158 Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete in representation Neural Information Processing Systems 30, pages 6306–6315. Curran Associates, Inc. In Advances learning. Haojie Pan, Rongqin Yang, Xin Zhou, Rui Wang, Deng Cai, and Xiaozhong Liu. 2020. Large scale abstractive multi-review sum- marization (LSARS) via aspect alignment. the 43rd International In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, pages 2337–2346, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3397271 .3401439 Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2): 1–135. DOI: https://doi.org/10.1561 /1500000011 In Proceedings Laura Perez-Beltrachini, Yang Liu, and Mirella Lapata. 2019. Generating summaries with topic templates and structured convolutional 57th decoders. Annual Meeting of the Association for Com- putational Linguistics, 5107–5116, Florence, Italy. Association for Computational Linguistics. DOI: https://doi.org/10 .18653/v1/P19-1504 pages the of Maxime Peyrard. 2019. A simple theoretical model of importance for summarization. In l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1059–1073, Florence, Italy. Associ- ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19 -1101 Dragomir R. Radev, Simone Teufel, Horacio Saggion, Wai Lam, John Blitzer, Hong Qi, Arda C¸ elebi, Danyu Liu, and Elliott Drabek. 2003. Evaluation challenges in large-scale document summarization. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 375–382, Sapporo, Japan. Association for Computational Linguistics. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444v2. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D18 -1437 Aurko Roy and David Grangier. 2019. Unsu- pervised paraphrasing without translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Italy. Associ- pages 6033–6039, Florence, ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19 -1605 Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. 2018. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063v2. Yoshihiko Suhara, Xiaolan Wang, Stefanos and Wang-Chiew Tan. 2020. Angelidis, OpinionDigest: A simple framework for opinion summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5789–5798, Online. Association for Computational Lin- https://doi.org/10 guistics. DOI: .18653/v1/2020.acl-main.513 Yufei Tian, Jianfei Yu, and Jing Jiang. 2019. Aspect and opinion aware abstractive review summarization with reinforced hard typed the 28th ACM decoder. In Proceedings of International Conference Information on and Knowledge Management, CIKM ’19, pages 2061–2064, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3357384 .3358142, PMID: 30884086 L. J. P. van der Maaten and G.E. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(nov): 2579–2605. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Asso- ciates, Inc. Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: A rating regression approach. In Proceedings of the 16th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 783–792, New York, NY, USA. Association for Com- puting Machinery. DOI: https://doi.org /10.1145/1835804.1835903 Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summa- rization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computa- tional Linguistics. Xiaolan Wang, Yoshihiko Suhara, Natalie Nuno, Yuliang Li, Jinfeng Li, Nofar Carmeli, Stefanos Angelidis, Eser Kandogann, and Wang-Chiew Tan. 2020. ExtremeReader: An interactive explorer for customizable and explainable review summarization. In Companion Pro- ceedings of the Web Conference 2020, WWW ’20, pages 176–180, New York, NY, USA. 292 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Association for Computing Machinery. DOI: https://doi.org/10.1145/3366424 .3383535 Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, and Ayush Pareek, Krishnan Srinivasan, Dragomir Radev. 2017. Graph-based neural multi-document summarization. In Proceed- ings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 452–462, Vancouver, Canada. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/K17 -1045 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 6 1 9 2 4 1 8 1 / / t l a c _ a _ 0 0 3 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 293 Extractive Opinion Summarization in Quantized Transformer Spaces image

PDF Herunterladen