Template-based Abstractive Microblog Opinion Summarization
Iman Munire Bilal1,4, Bo Wang2,4, Adam Tsakalidis3,4, Dong Nguyen5,
Rob Procter1,4, Maria Liakata1,3,4
1Department of Computer Science, University of Warwick, UK
2Center for Precision Psychiatry, Massachusetts General Hospital, USA
3School of Electronic Engineering and Computer Science, Queen Mary University of London, UK
4The Alan Turing Institute, London, UK
5Department of Information and Computing Sciences, Utrecht University, The Netherlands
{iman.bilal|rob.procter}@warwick.ac.uk bwang29@mgh.harvard.edu
{atsakalidis|mliakata}@qmul.ac.uk d.p.nguyen@uu.nl
Abstract
We introduce the task of microblog opinion
summarization (MOS) and share a dataset of
3100 gold-standard opinion summaries to fa-
cilitate research in this domain. The dataset
contains summaries of tweets spanning a
2-year period and covers more topics than any
other public Twitter summarization dataset.
Summaries are abstractive in nature and have
been created by journalists skilled in sum-
marizing news articles following a template
separating factual information (main story)
from author opinions. Our method differs from
previous work on generating gold-standard
summaries from social media, which usually
involves selecting representative posts and
thus favors extractive summarization models.
To showcase the dataset’s utility and chal-
lenges, we benchmark a range of abstractive
and extractive state-of-the-art summarization
models and achieve good performance, with
the former outperforming the latter. We also
show that fine-tuning is necessary to improve
performance and investigate the benefits of
using different sample sizes.
1
Introduction
Social media has gained prominence as a means
for the public to exchange opinions on a broad
range of topics. Furthermore, its social and tem-
poral properties make it a rich resource for policy
makers and organizations to track public opin-
ion on a diverse range of issues (Procter et al.,
2013; Chou et al., 2018; Kalimeri et al., 2019).
However, understanding opinions about different
issues and entities discussed in large volumes of
posts in platforms such as Twitter is a difficult
task. Existing work on Twitter employs extractive
summarization (Inouye and Kalita, 2011; Zubiaga
et al., 2012; Wang et al., 2017a; Jang and Allan,
2018) to filter through information by ranking
and selecting tweets according to various crite-
ria. However, this approach unavoidably ends up
including incomplete or redundant information
(Wang and Ling, 2016).
To tackle this challenge we introduce Micro-
blog opinion summarization (MOS), which we
define as a multi-document summarization task
aimed at capturing diverse reactions and stances
(opinions) of social media users on a topic. While
here we apply our methods to Twitter data readily
available to us, we note that this summarization
strategy is also useful for other microblogging
platforms. An example of a tweet cluster and its
opinion summary is shown in Table 1. As shown,
our proposed summary structure for MOS sepa-
rates the factual information (story) from reactions
to the story (opinions); the latter is further divided
according to the prevalence of different opinions.
We believe that making combined use of stance
identification, sentiment analysis and abstrac-
tive summarization is a challenging but valuable
direction in aggregating opinions expressed in
microblogs.
The availability of high quality news article
datasets has meant that recent advances in text
summarization have focused mostly on this type
of data (Nallapati et al., 2016; Grusky et al., 2018;
Fabbri et al., 2019; Gholipour Ghalandari et al.,
2020). Contrary to news article summarization,
our task focuses on summarizing an event as well
as ensuing public opinions on social media. Re-
view opinion summarization (Ganesan et al., 2010;
Angelidis and Lapata, 2018) is related to MOS
and faces the same challenge of filtering through
large volumes of user-generated content. While
1229
Transactions of the Association for Computational Linguistics, vol. 10, pp. 1229–1248, 2022. https://doi.org/10.1162/tacl a 00516
Action Editor: Ivan Titov. Submission batch: 2/2022; Revision batch: 6/2022; Published 11/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Human Summary
2 Related Work
Main Story: The UK government faces intense backlash after
its decision to fund the war in Syria. Majority Opinion: The
majority of users criticise UK politicians for not directing their
efforts to more important domestic issues like the NHS, education
and homelessness instead of the war in Syria. Minority Opinion:
Some users accuse the government of its intention to kill innocents
by funding the war.
Tweet Cluster
It is shocking to me how the NHS is on its knees and the amount of
homeless people that need help in this country…but we have funds
for war!..SAD
The government cannot even afford to help the homeless people of
Britain yet they can afford to fund a war? It makes no proper sense
at all
They spend so much on sending missiles to murder innocent people
and they complain daily about homeless on the streets? Messed up.
Also, no money to resolve the issues of the homeless or education
or the NHS. Yet loads of money to drop bombs? #SyriaVote
Table 1: Abridged cluster of tweets and its corre-
sponding summary. Cluster content is color-coded
to represent information overlap with each sum-
mary component: blue for Main Story, red for
Majority Opinion, and green for Minority Opinion.
recent work (Chu and Liu, 2019; Braˇzinskas et al.,
2020) aims to produce review-like summaries
that capture the consensus, MOS summaries inev-
itably include a spectrum of stances and reactions.
In this paper we make the following contributions:
1. We introduce the task of microblog opinion
summarization (MOS) and provide detailed
guidelines.
2. We construct a corpus1 of tweet clusters and
corresponding multi-document summaries
produced by expert summarizers following
our detailed guidelines.
3. We evaluate the performance of existing
state-of-the-art models and baselines from
three summarization domains (news articles,
Twitter posts, product reviews) and four
model types (abstractive vs. extractive, sin-
gle document vs. multiple documents) on
our corpus, showing the superiority of neu-
ral abstractive models. We also investigate
the benefits of fine-tuning with various sam-
ple sizes.
1This is available at https://doi.org/10.6084
/m9.figshare.20391144.
Opinion Summarization has focused predomi-
nantly on customer reviews with datasets span-
ning reviews on Tripadvisor (Ganesan et al.,
2010), Rotten Tomatoes (Wang and Ling, 2016),
Amazon (He and McAuley, 2016; Angelidis and
Lapata, 2018) and Yelp (Yelp Dataset Challenge;
Yelp).
Early work by Ganesan et al. (2010) prioritized
redundancy control and concise summaries. More
recent approaches (Angelidis and Lapata, 2018;
Amplayo and Lapata, 2020; Angelidis et al., 2021;
Isonuma et al., 2021) employ aspect driven mod-
els to create relevant topical summaries. While
product reviews have a relatively fixed structure,
MOS operates on microblog clusters where posts
are more loosely related, which poses an additional
challenge. Moreover, while the former generally
only encodes the consensus opinion (Braˇzinskas
et al., 2020; Chu and Liu, 2019), our approach
includes both majority and minority opinions.
Multi-document summarization has gained
traction in non-opinion settings and for news
events in particular. DUC (Dang, 2005) and TAC
conferences pioneered this task by introducing
datasets of 139 clusters of articles paired with mul-
tiple human-authored summaries. Recent work has
seen the emergence of larger scale datasets such as
WikiSum (Liu et al., 2018), Multi-News (Fabbri
et al., 2019), and WCEP (Gholipour Ghalandari
et al., 2020) to combat data sparsity. Extractive
(Wang et al., 2020b,c; Liang et al., 2021) and ab-
stractive (Jin et al., 2020) methods have followed
from these multi-document news datasets.
Twitter Summarization is recognised by Cao
et al. (2016) to be a promising direction for track-
ing reaction to major events. As tweets are inher-
ently succinct and often opinionated (Mohammad
et al., 2016), this task is at the intersection of
multi-document and opinion summarization. The
construction of datasets (Nguyen et al., 2018;
Wang and Zhang, 2017) usually requires a clus-
tering step to group tweets together under specific
temporal and topical constraints, which we include
within our own pipeline. Work by Jang and Allan
(2018) and Corney et al. (2014) makes use of
the subjective nature of tweets by identifying two
stances for each topic to be summarized; we gen-
eralize this idea and do not impose a restriction on
the number of possible opinions on a topic. The
1230
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
lack of an abstractive gold standard means that
the majority of existing Twitter models are ex-
tractive (Alsaedi et al., 2021; Inouye and Kalita,
2011; Jang and Allan, 2018; Corney et al., 2014).
Here we provide such an abstractive gold stan-
dard and show the potential of neural abstractive
models for microblog opinion summarization.
3 Creating the MOS Dataset
3.1 Data Sources
Our MOS corpus consists of summaries of mi-
croblog posts originating from two data sources,
both involving topics that have generated strong
public opinion: COVID-19 (Chen et al., 2020)
and UK Elections (Bilal et al., 2021).
• COVID-19: Chen et al. (2020) collected
tweets by tracking COVID-19 related key-
words (e.g., coronavirus, pandemic, stayat-
home) and accounts (e.g., @CDCemergency,
@HHSGov, @DrTedros). We use data col-
lected between January 2020 and January
2021, which at the time was the most com-
plete version of this dataset.
• UK Elections: The Election dataset con-
sists of all geo-located UK tweets posted be-
tween May 2014 and May 2016. The tweets
were filtered using a list of 438 election-
related keywords and 71 political party
aliases curated by a team of journalists.
We follow the methodology in Bilal et al. (2021)
to obtain opinionated, coherent clusters of be-
tween 20 and 50 tweets: The clustering step em-
ploys the GSDMM-LDA algorithm (Wang et al.,
2017b), followed by thematic coherence evalua-
tion (Bilal et al., 2021). The latter is done by ag-
gregating exhaustive metrics BLEURT (Sellam
et al., 2020), BERTScore (Zhang et al., 2020),
and TF-IDF to construct a random forest classi-
fier to identify coherent clusters. Our final corpus
is created by randomly sampling 3100 clusters,2
1550 each from the COVID-19 and Election
datasets.
3.2 Summary Creation
The summary creation process was carried out in 3
stages on the Figure Eight platform by 3 journalists
2Limited resources available for annotation determined
the size of the MOS corpus.
experienced in sub-editing. Following Iskender
et al. (2021), a short pilot study was followed
by a meeting with the summarizers to ensure the
task and guidelines were well understood. Prior
to this, the design of the summarization interface
was iterated to ensure functionality and usability
(See Appendix A for interface snapshots).
In the first stage, the summarizers were asked
to read a cluster of tweets and state whether the
opinions within it could be easily summarized by
assigning one of three cluster types:
1. Coherent Opinionated: there are clear opin-
ions about a common main story expressed
in the cluster that can be easily summarized.
2. Coherent Non-opinionated: there are very
few or no clear opinions in the cluster, but
a main story is clearly evident and can be
summarized.
3. Incoherent: no main story can be detected.
This happens when the cluster contains di-
verse stories to which no majority of tweets
refers, hence it cannot be summarized.
Following Bilal et al. (2021) on thematic co-
herence, we assume a cluster is coherent if and
only if its contents can be summarized. Thus,
both Coherent Opinionated and Coherent Non-
opinionated can be summarized, but are dis-
tinct with respect to the level of subjectivity in
the tweets, while Incoherent clusters cannot be
summarized.
In the second stage, information nuggets are
defined in a cluster as important pieces of infor-
mation to aid in its summarization. The summa-
rizers were asked to highlight information nuggets
when available and categorise their aspect
in
terms of: WHAT, WHO, WHERE, REACTION,
and OTHER. Thus, each information nugget is
a pair consisting of the text and its aspect cate-
gory (see Appendix A for an example). Inspired
by the pyramid evaluation framework (Nenkova
and Passonneau, 2004) and extractive-abstractive
two-stage models in the summarization literature
(Lebanoff et al., 2018; Rudra et al., 2019; Liu
et al., 2018), information nuggets have a dual
purpose: (1) helping summarizers create the final
summary and (2) constituting an extractive ref-
erence for summary informativeness evaluation
(See 5.2.1).
In the third and final stage of the process,
the summarizers were asked to write a short
1231
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Total COVID-19 Election
R-1f1 R-2f1 R-Lf1 BLEURT
Size (#clusters)
3100
Coherent Opinionated
42%
Coherent Non-opinionated 30%
Incoherent
28%
1550
41%
24%
35%
1550
43%
37%
20%
Summary
Main Story
Majority Opinion
Minority Opinion(s)
37.46
35.15
27.53
22.90
17.91
12.98
6.15
5.10
30.16
34.59
25.95
24.39
−.215
−.324
−.497
−.703
Table 2: Annotation statistics of our MOS corpus.
template-based summary for coherent clusters.
Our chosen summary structure diverges from cur-
rent summarization approaches that reconstruct
the ‘‘most popular opinion’’ (Braˇzinskas et al.,
2020; Angelidis et al., 2021). Instead, we aim to
showcase a spectrum of diverse opinions regard-
ing the same event. Thus, the summary template
comprises three components: Main Story, Major-
ity Opinion, Minority Opinion(s). The component
Main Story serves to succinctly present the focus
of the cluster (often an event), while the other
components describe opinions about
the main
story. Here, we seek to distinguish the most pop-
ular opinion (Majority opinion) from ones ex-
pressed by a minority (Minority opinions). This
structure is consistent with the work of Gerani
et al. (2014) in template-based summarization for
product reviews, which quantifies the popularity
of user opinions in the final summary.
For ‘‘Coherent Opinionated clusters’’, summa-
rizers were asked to identify the majority opinion
within the cluster and, if it exists, to summarize
it, along with any minority opinions. If a majority
opinion could not be detected, then the minority
opinions were summarized. The final summary of
‘‘Coherent Opinionated clusters’’ is the concate-
nation of the three components: Main story +
Majority Opinion (if any) + Minority Opinion(s)
(if any). In 43% of opinionated clusters in our
MOS corpus a majority opinion and at
least
one minority opinion were identified. Addition-
ally, in 12% of opinionated clusters, 2 or more
main opinions were identified (See Appendix C,
Table 13), but without a majority opinion as there
is a clear divide between user reactions. For clus-
ters with few or no clear opinions (Coherent Non-
opinionated), the final summary is represented by
the Main Story component. Statistics regarding
the annotation results are shown in Table 2.
Agreement Analysis
tweet summarization corpus consists of
Our
3100 clusters. Of these, a random sample of 100
clusters was shared among all three summarizers
Table 3: Agreements between summarizers wrt
to final summary, main story, majority opinion
and minority opinions using ROUGE-1,2,L and
BLEURT.
to compute agreement scores. Each then worked
on 1000 clusters.
We obtain a Cohen’s Kappa score of κ = 0.46
for the first stage of the summary creation pro-
cess, which involves categorising clusters as either
Coherent Opinionated, Coherent Non-opinionated
or
Incoherent. Previous work (Feinstein and
Cicchetti, 1990) highlights a paradox regarding
Cohen’s kappa in that high levels of agreement
do not translate to high kappa scores in cases of
highly imbalanced datasets. In our data, at least 2
of the 3 summarizers agreed on the type of cluster
in 97% of instances.
In addition, we evaluate whether the concept
of ‘coherence/summarizability’ is uniformly as-
sessed, that is, we check whether summarizers
agree on what clusters can be summarized (Coher-
ent clusters) and which clusters are too incoherent.
We find that 83 out of 100 clusters were evalu-
ated as coherent by the majority, of which 65
were evaluated as uniformly coherent by all.
ROUGE-1,2,L and BLEURT (Sellam et al.,
2020) are used as proxy metrics to check the
agreement in terms of summary similarity pro-
duced between the summarizers. We compare the
consensus between the complete summaries as
well as individual components such as the main
story of the cluster, its majority opinion and any
minority opinions in Table 3. The highest agree-
ment is achieved for the Main Story, followed by
Majority Opinion and Minority Opinions. These
scores can be interpreted as upper thresholds for
the lexical and semantic overlap later in Section 6.
3.3 Comparison with Other Twitter Datasets
We next compare our corpus against the most
recent and popular Twitter datasets for summa-
rization in Table 4. To the best of our knowledge
there are currently no abstractive summarization
Twitter datasets for either event or opinion sum-
marization. While we primarily focussed on the
1232
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
Time span
#keywords
#clusters
Avg. Cluster Size
(#posts)
Summary
Avg. Summary Length
(#tokens)
COVID-19
Election
Inouye and Kalita (2011)
SMERP (Ghosh et al., 2017)
TSix (Nguyen et al., 2018)
1 year
2 years
5 days
3 days
26 days
41
112
50
N/A
30
1003
1236
200
8
925
31
30
25
359
36
Abstractive
Abstractive
Extractive
Extractive
Extractive
42
36
17
303
109
Table 4: Overview of other Twitter datasets.
collection of opinionated clusters, some of the
clusters we had automatically identified as opin-
ionated were not deemed to be so by our annota-
tors. Including the non-opinionated clusters helps
expand the depth and range of Twitter datasets
for summarization.
Compared to the summarization of product re-
views and news articles, which has gained recog-
nition in recent years because of the availability
of large-scale datasets and supervised neural archi-
tectures, Twitter summarization remains a mostly
uncharted domain with very few datasets curated.
Inouye and Kalita (2011)3 collected the tweets
for the top ten trending topics on Twitter for 5
days and manually clustered these. The SMERP
dataset (Ghosh et al., 2017) focuses on topics on
post-disaster relief operations for the 2016 earth-
quakes in central Italy. Finally, TSix (Nguyen
et al., 2018) is the dataset most similar to our
work as it covers, but on a smaller scale, several
popular topics that are deemed relevant to news
providers.
Other Twitter summarization datasets include:
(Zubiaga et al., 2012; Corney et al., 2014) on sum-
marization of football matches, (Olariu, 2014)
on real-time summarization for Twitter streams.
These datasets are either publicly unavailable or
unsuitable for our summarization task.4
Summary Type. These datasets exclusively con-
tain extractive summaries, where several tweets
are chosen as representative per cluster. This re-
sults in summaries which are often verbose, redun-
dant and information-deficient. As shown in other
domains (Grusky et al., 2018; Narayan et al.,
2018), this may lead to bias towards extractive
summarization techniques and hinder progress for
3It is unclear whether the full corpus is available: Our
statistics were calculated based on a sample of 100 posts for
each topic, but the original paper mentions that 1500 posts
for each topic were initially collected.
4Comparing to live stream summarization where millions
of posts are used as input, we focus on summarization of
clusters of maximum 50 posts.
abstractive models. Our corpus on COVID-19
and Election data aims to bridge this gap and
introduces an abstractive gold standard generated
by journalists experienced in sub-editing.
Size. The average number of posts in our clusters
is 30, which is similar to the TSix dataset and in
line with the empirical findings by Inouye and
Kalita (2011), who recommend 25 tweets/cluster.
Having clusters with a much larger number of
tweets makes it harder to apply our guidelines
for human summarization. To the best of our
knowledge, our combined corpus (COVID-19 and
Election) is currently the biggest human-generated
corpus for microblog summarization.
Time-span. Both COVID-19 and Election par-
titions were collected across year-long time spans.
This is in contrast to other datasets, which have
been constructed in brief time windows, ranging
from 3 days to a month. This emphasizes the lon-
gitudinal aspect of the dataset, which also allows
topic diversity as 153 keywords and accounts
were tracked through time.
4 Defining Model Baselines
As we introduce a novel summarization task
(MOS), the baselines featured in our experiments
are selected from domains tangential to microblog
opinion summarization, such as news articles,
Twitter posts, and product reviews (See Section 2).
In addition, the selected models represent diverse
summarization strategies: abstractive or extrac-
tive, supervised or unsupervised, multi-document
(MDS) or single-document summarization (SDS).
Note that most SDS models enforce a length limit
(1024 characters) over the input, which makes
it impossible to summarize the whole cluster of
tweets. We address this issue by only considering
the most relevant tweets ordered by topic rele-
vance. The latter is computed using the Kullback-
Leibler divergence with respect to the topical word
1233
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
distribution of the cluster in the GSDMM-LDA
clustering algorithm (Wang et al., 2017b).
The summaries were generated such that their
length matches the average length of the gold stan-
dard. Some model parameters (such as Lexrank)
only allow sentence-level truncation, in which
case the length matches the average number of
sentences in the gold standard. For models that
allow a word limit to the text to be generated
(BART, Pegasus, T5), a minimum and maximum
number of tokens was imposed such that the gen-
erated summary would be within [90%, 110%] of
the gold standard length.
4.1 Heuristic Baselines
Extractive Oracle: This baseline uses the gold
summaries to extract the highest scoring sentences
from a cluster of tweets. We follow Zhong et al.
(2020) and rank each sentence by its average
ROUGE-{1,2,L} recall scores. We then consider
the highest ranking 5 sentences to form combi-
nations of k5 sentences, which are re-evaluated
against the gold summaries. k is chosen to equal
the average number of sentences in the gold stan-
dard. The highest scoring summary with respect
to the average ROUGE-{1,2,L} recall scores is
assigned as the oracle.
Random: k sentences are extracted at random
from a tweet cluster. We report the mean result
over 5 iterations with different random seeds.
4.2 Extractive Baselines
LexRank (Erkan and Radev, 2004) constructs a
weighed connectivity graph based on cosine simi-
larities between sentence TF-IDF representations.
Hybrid TF-IDF (Inouye and Kalita, 2011) is an
unsupervised model designed for Twitter, where
a post is summarized as the weighted mean of its
TF-IDF word vectors.
BERTSumExt
(Liu and Lapata, 2019) is an
SDS model comprising a BERT (Devlin et al.,
2019)-based encoder stacked with Transformer
layers to capture document-level features for sen-
tence extraction. We use the model trained on
CNN/Daily Mail (Hermann et al., 2015).
HeterDocSumGraph (Wang et al., 2020b) in-
troduces the heterogenous graph neural network,
5For opinionated clusters, we set k=3 and for non-
opinionated k=1.
which is constructed and iteratively updated using
both sentence nodes and nodes representing other
semantic units, such as words. We use the MDS
model trained on Multi-News (Fabbri et al., 2019).
(Angelidis
Quantized Transformer
al.,
2021) combines Transformers (Vaswani et al.,
2017) and Vector-Quantized Variational Autoen-
coders for the summarization of popular opinions
in reviews. We trained QT on the MOS corpus.
et
4.3 Abstractive Baselines
Opinosis
(Ganesan et al., 2010) is an unsu-
pervised MDS model. Its graph-based algorithm
identifies valid paths in a word graph and re-
turns the highest scoring path with respect to
redundancy.
PG-MMR (Lebanoff et al., 2018) adapts the
single document setting for multi-documents
by introducing ‘mega-documents’ resulting from
concatenating clusters of texts. The model com-
bines an abstractive SDS pointer-generator net-
work with an MMR-based extractive component.
PEGASUS (Zhang et al., 2020)
introduces
gap-sentences as a pre-training objective for sum-
marization. It is then fine-tuned for 12 downstream
summarization domains. We chose the model
pre-trained on Reddit TIFU (Kim et al., 2019).
transfer
T5 (Raffel et al., 2020) adopts a unified
learning on language-
approach for
understanding tasks. For
the
summarization,
model
is pre-trained on the Colossal Clean
Crawled Corpus (Raffel et al., 2020) and then
fine-tuned on CNN/Daily Mail.
BART (Lewis et al., 2020) is pre-trained on sev-
eral evaluation tasks, including summarization.
With a bidirectional encoder and GPT2, BART is
considered a generalization of BERT. We use the
BART model pre-trained on CNN/Daily Mail.
SummPip (Zhao et al., 2020) is an MDS unsu-
pervised model that constructs a sentence graph
following Approximate Discourse Graph and deep
embedding methods. After spectral clustering of
the sentence graph, summary sentences are gen-
erated through a compression step of each cluster
of sentences.
Copycat
(Braˇzinskas et al., 2020) is a Varia-
tional Autoencoder model trained in an unsuper-
vised setting to capture the consensus opinion in
1234
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
product reviews for Yelp and Amazon. We train
it on the MOS corpus.
5.2.1 Evaluation of Gold Standard
and Models
5 Evaluation Methodology
Similar to other summarization work (Fabbri
et al., 2019; Grusky et al., 2018), we perform both
automatic and human evaluation of models. Au-
tomatic evaluation is conducted on a set of 200
clusters: Each partition of the test (COVID-19
Opinionated, COVID-19 Non-opinionated, Elec-
tion Opinionated, Election Non-opinionated) con-
tains 50 clusters uniformly sampled from the total
corpus. For the human evaluation, only the 100
opinionated clusters are evaluated.
5.1 Automatic Evaluation
Word overlap is evaluated according to the har-
monic mean F1 of ROUGE-1, 2, L6 (Lin, 2004)
as reported elsewhere (Narayan et al., 2018;
Gholipour Ghalandari et al., 2020; Zhang et al.,
2020). Work by Tay et al. (2019) acknowledges
the intractability of ROUGE in opinion text sum-
marization as sentiment-rich language uses a vast
vocabulary that does not rely on word match-
ing. This issue is mitigated by Kryscinski et al.
(2021) and Bhandari et al. (2020), who use se-
mantic similarity as an additional assessment of
candidate summaries. Similarly, we use text gen-
eration metrics BLEURT (Sellam et al., 2020)
and BERTScore7 (Zhang et al., 2020) to assess
semantic similarity.
5.2 Human Evaluation
Human evaluation is conducted to assess the qual-
ity of summaries with respect to three objectives:
1) linguistic quality, 2) informativeness, and 3)
ability to identify opinions. We conducted two
human evaluation experiments: the first (5.2.1) as-
sesses the gold standard and non-fine-tuned model
summaries on a rating scale, and the second (5.2.2)
addresses the advantages and disadvantages of
fine-tuned model summaries via Best-Worst Scal-
ing. Four and three experts were employed for the
two experiments, respectively.
6We use ROUGE-1.5.5 via the pyrouge package:
https://github.com/bheinzerling/pyrouge.
7BERTScore has a narrow score range, which makes
its interpretation more difficult than for BLEURT. Because
both metrics produce similar rankings, BERTScore can be
found in Appendix C.
The first experiment focused on assessing the
gold standard and best models from each summa-
rization type: Gold, LexRank (best extractive),
SummPip (best unsupervised abstractive), and
BART (best supervised).
Linguistic quality measures 4 syntactic dimen-
sions, which were inspired by previous work
on summary evaluation. Similar to DUC (Dang,
2005), each summary was evaluated with respect
to each criterion below on a 5-point scale.
• Fluency (Grusky et al., 2018): Sentences in
the summary ‘‘should have no formatting
problems, capitalization errors or obviously
ungrammatical sentences (e.g., fragments,
missing components) that make the text
difficult to read.’’
• Sentential Coherence (Grusky et al., 2018):
A sententially coherent summary should
be well-structured and well-organized. The
summary should not just be a heap of related
information, but should build from sentence
to sentence to a coherent body of informa-
tion about a topic.
• Non-redundancy (Dang, 2005): A non-
redundant summary should contain no du-
plication, that is, there should be no overlap
of information between its sentences.
• Referential Clarity (Dang, 2005): It should
be easy to identify who or what the pronouns
and noun phrases in the summary are refer-
ring to. If a person or other entity is men-
tioned, it should be clear what their role is in
the story.
Informativeness
is defined as the amount of
factual information displayed by a summary. To
measure this, we use a Question-Answer algo-
rithm (Patil, 2020) as a proxy. Pairs of questions
and corresponding answers are generated from
the information nuggets of each cluster. Because
we want to assess whether the summary contains
factual information, only information nuggets be-
longing to the ‘WHAT’, ‘WHO’, ‘WHERE’ are
selected as input. We chose not to include the
entire cluster as input for the QA algorithm, as
this might lead the informativeness evaluation to
1235
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
prioritize irrelevant details in the summary. Each
cluster in the test set is assigned a question-
answer pair and each system is then scored based
on the percentage of times its generated sum-
maries contain the answer to the question. Similar
to factual consistency (Wang et al., 2020a), infor-
mativeness penalizes incorrect answers (halluci-
nations), as well as the lack of a correct answer
in a summary.
As Opinion is a central component for our task,
we want to assess the extent to which summa-
ries contain opinions. Assessors report whether
summaries identify any majority or minority opin-
ions.8 A summary contains a majority opinion if
most of its sentences express this opinion or if it
contains specific terminology (‘The majority is/
Most users think…’, etc.), which is usually learned
during the fine-tuning process. Similarly, a sum-
mary contains a minority opinion if at least one
of its sentences expresses this opinion or it con-
tains specific terminology (‘A minority/ A few
users’, etc.). The final scores for each system are
the percentage of times the summaries contain
majority or minority opinions, respectively.
5.2.2 Best-Worst Evaluation of
Fine-tuned Models
The second human evaluation assesses the ef-
fects of fine-tuning on the best supervised model,
BART. The experiments use non-fine-tuned BART
(BART), BART fine-tuned on 10% of the corpus
(BART FT10%) and BART fine-tuned on 70%
of the corpus (BART FT70%).
As all the above are versions of the same neu-
ral model, Best-Worst scaling is chosen to detect
subtle improvements, which cannot otherwise be
quantified as reliably by traditional ranking scales
(Kiritchenko and Mohammad, 2017). An evalu-
ator is shown a tuple of 3 summaries (BART,
BART FT70%, BART FT30%) and asked to
choose the best/worst with respect to each criteria.
To avoid any bias, the summary order is random-
ized for each document following van der Lee
et al. (2019). The final score is calculated as the
percentage of times a model is scored as the best,
minus the percentage of times it was selected as
the worst (Orme, 2009). In this setting, a score
of 1 represents the unanimously best, while −1 is
unanimously the worst.
8Note that whether the identified minority or majority
opinions are correct is not evaluated here. This is done in
Section 5.2.2.
The same criteria as before are used for lin-
guistic quality and one new criterion is added
to assess Opinion. We define Meaning Preser-
vation as the extent to which opinions identified
in the candidate summaries match the ones iden-
tified in the gold standard. We draw a parallel
between the Faithfulness measure (Maynez et al.,
2020), which assesses the level of hallucinated
information present in summaries, and Meaning
Preservation, which assesses the extent of hal-
lucinated opinions.
6 Results
6.1 Automatic Evaluation
Results for the automatic evaluation are shown
in Table 5.
Fine-tuned Models Unsurprisingly, the best per-
forming models are ones that have been fine-
tuned on our corpus: BART (FT70%) and BART
(FT10%). Fine-tuning has been shown to yield
competitive results for many domains (Kryscinski
et al., 2021; Fabbri et al., 2021), including ours.
In addition, one can see that only the fine-tuned
abstractive models are capable of outperforming
the Extractive Oracle, which is set as the up-
per threshold for extractive methods. Note that on
average, the Oracle outperforms the Random sum-
marizer by a 59% margin, which only fine-tuned
models are able to improve on, with 112% for
BART (FT10%) and 114% for BART (FT70%).
We hypothesize that our gold summaries’ tem-
plate format poses difficulties for off-the-shelf
models and fine-tuning even on a limited portion
of the corpus produces summaries that follow the
correct structure (See Table 9 and Appendix C
for examples). We include comparisons between
the performance of BART (FT10%) and BART
(FT70%) on the individual components of the
summary in Table 6.9
Non-Fine-tuned Models Of these, SummPip
performs the best across most metrics and datasets
with an increase of 37% in performance over
the random model, followed by LexRank with
an increase of 29%. Both models are designed
for the multi-document setting and benefit from
the common strategy of mapping each sentence
9We do not
include other models in the summary
component-wise evaluation because it is impossible to iden-
tify the Main Story, Majority Opinion, and Minority Opinions
in non-fine-tuned models.
1236
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Models
COVID-19 Opinionated (CO)
COVID-19 Non-opinionated (CNO)
Election Opinionated (EO)
Election Non-opinionated (ENO)
R-1f1
R-2f1
R-Lf1
BLEURT R-1f1
R-2f1
R-Lf1
BLEURT R-1f1
R-2f1
R-Lf1
BLEURT R-1f1
R-2f1
R-Lf1
BLEURT
Heuristics
Gold (195 char)
Random Sentences (204 char)
Extractive Oracle (289 char)
13.55
15.45
1.09
1.67
9.22
10.29
−.660
−.382
7.30
11.80
0.70
1.38
5.97
9.27
−.968
−.510
11.82
15.33
0.80
1.60
8.27
10.12
−.576
−.146
6.75
10.06
0.89
2.15
5.69
8.46
−.592
−.056
LexRank (265 char)
Hybrid TF-IDF (277 char)
Quantized Transformer (273 char)
16.41
12.87
14.23
1.48
1.26
1.03
10.89
8.85
9.55
−.560
−.608
−.621
10.87
9.33
9.85
1.01
0.83
0.96
8.76
7.51
7.83
−.849
−.745
−.857
14.27
12.06
14.78
1.15
1.12
1.08
9.62
8.42
9.45
−.418
−.430
−.468
9.11
7.93
8.69
1.08
1.13
0.81
7.41
6.56
6.79
−.456
−.298
−.668
Extractive Unsupervised Models
Extractive Supervised Models
BERTSumExt (225 char)
HeterDocSumGraph (295 char)
14.22
15.13
1.31
1.19
9.68
9.79
−.571
−.748
9.78
10.05
1.11
0.88
7.70
7.79
−.699
−.867
11.93
14.28
1.10
0.96
8.47
9.15
−.384
−.564
8.06
8.40
1.00
0.72
6.63
6.86
−.407
−.626
Opinosis (215 char)
SummPip (236 char)
Copycat (153 char)
12.45
12.96
12.47
1.14
1.37
1.31
8.86
9.32
9.41
PG-MMR (238 char)
Pegasus (216 char)
T5 (206 char)
BART (237 char)
11.93
13.78
14.25
15.95
1.08
1.40
1.31
1.46
8.93
9.78
9.97
10.74
Abstractive Unsupervised Models
8.35
11.30
10.99
0.73
1.46
1.32
6.99
9.09
9.25
−.673
−.559
−.621
11.34
13.05
14.05
1.00
1.15
1.56
8.15
8.90
10.25
Abstractive Supervised Models
9.68
10.37
9.11
10.41
1.37
1.41
1.21
1.55
8.01
8.61
7.72
8.48
−.578
−.616
−.669
−.576
12.36
12.68
12.99
13.71
1.07
1.23
1.06
1.18
8.73
9.28
8.82
9.09
−.534
−.488
−.552
−.450
−.535
−.530
−.521
Fine-tuned Models
−.537
−.409
−.503
−.400
−.481
−.470
−.409
6.69
9.93
7.48
8.14
9.12
8.59
9.11
0.95
1.36
1.10
1.04
1.11
1.15
1.15
5.66
7.74
6.36
6.86
7.34
7.06
7.37
−.518
−.228
−.316
−.302
−.283
−.347
−.372
BART (FT 10%) (245 char)
BART (FT 70%) (246 char)
21.53
21.54
3.86
3.74
14.76
14.54
−.257
−.259
15.49
15.31
2.61
2.54
12.04
12.09
−.449
−.439
19.77
20.59
2.99
3.42
13.11
13.63
−.209
−.183
12.31
12.37
1.87
1.72
9.62
9.58
−.081
−.071
Table 5: Performance on the test set of baseline models evaluated with automatic metrics: ROUGE-n
(R-n) and BLEURT. The best model from each category (Extractive, Abstractive, Fine-tuned) and
overall are highlighted.
Models
COVID-19 Opinionated (CO)
Election Opinionated (EO)
R-1f1
R-2f1
R-Lf1
BLEURT R-1f1
R-2f1
R-Lf1
BLEURT
Main Story
BART (FT 10%)
BART (FT 70%)
11.43
11.18
2.49
2.29
9.95
9.57
−.082
−.137
9.82
9.55
1.72
1.70
8.31
8.19
−.185
−.104
Majority Opinion
BART (FT 10%)
BART (FT 70%)
19.74 4.06
20.25
4.28
16.18 −.505 19.13
16.86 −.487
3.74
17.88
3.11
15.60 −.392
14.57
−.442
BART (FT 10%)
BART (FT 70%)
19.05
18.70
4.66
4.81
15.87 −.544
−.643
15.83
15.26
15.98
3.97
4.63
13.34
14.01
−.791
−.604
Minority Opinion(s)
Table 6: Performance of fine-tuned models per
each summary component (Main Story, Majority
Opinion, Minority Opinion(s)) on the test set eval-
uated with automatic metrics: ROUGE-n (R-n)
and BLEURT.
in a tweet from the cluster into a node of a
sentence graph. However, not all graph map-
pings prove to be useful: Summaries produced
by Opinosis and HeterDocSumGraph, which em-
ploy a word-to-node mapping, do not correlate
well with the gold standard. The difference be-
tween word and sentence-level approaches can be
partially attributed to the high amount of spelling
variation in tweets, making the former less reliable
than the latter.
ROUGE vs BLEURT The performance on
ROUGE and BLEURT is tightly linked to the
data differences between COVID-19 and Elec-
tion partitions of the corpus. Most models achieve
higher ROUGE scores and lower BLEURT scores
on the COVID-19 than on the Election dataset.
An inspection of the data differences reveals that
COVID-19 tweets are much longer than Election
ones (169 vs 107 characters), as the latter had
been collected before the increase in length limit
from 140 to 280 characters in Twitter posts. This
is in line with findings by Sun et al. (2019), who
revealed that high ROUGE scores are mostly the re-
sult of longer summaries rather than better quality
summaries.
6.2 Human Evaluation
Evaluation of Gold Standard and Models
Table 7 shows the comparison between the gold
standard and the best performing models against
a set of criteria (See 5.2.1). As expected, the
human-authored summaries (Gold) achieve the
highest scores with respect to all linguistic qual-
ity and structure-based criteria. However,
the
gold standard fails to capture informativeness as
well as its automatic counterparts, which are, on
average, longer and thus may include more in-
formation. Since BART is previously pre-trained
1237
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Fluency
Sentential Coherence Non-redundancy Referential Clarity
Informativeness Majority Minority
Gold
Lexrank
BART
SummPip
4.52
3.03
3.24
2.73
4.63
2.43
2.76
2.70
4.85
3.10
3.46
2.53
4.31
2.55
3.01
3.37
57%
58%
67%
69%
86%
15%
8%
32%
64%
62%
60%
36%
Table 7: Evaluation of Gold Standard and Models: Results.
Model
BART
BART FT 10%
BART FT 70%
Fluency Sentential Coherence Non-redundancy Referential Clarity Meaning Preservation
−0.76
0.30
0.44
−0.54
0.14
0.40
−0.65
0.22
0.43
−0.42
0.25
0.17
0.15
−0.11
−0.04
Table 8: Best-Worst Evaluation of Fine-tuned models: Results.
on CNN/DM dataset of news articles, its output
summaries are more fluent, sententially coherent
and contain less duplication than the unsupervised
models Lexrank and SummPip. We hypothesize
that SummPip achieves high referential clarity
and majority scores as a trade-off for its very low
non-redundancy (high redundancy).
Best-Worst Evaluation of Fine-tuned Models
The results for our second human evaluation are
shown in Table 8 using the guidelines presented in
5.2.2. The model fine-tuned on more data BART
(FT70%) achieves the highest fluency and sen-
tential coherence scores. As seen in Table 9, the
summary produced by BART (FT70%) contains
complete and fluent sentences, unlike its counter-
parts. Most importantly, fine-tuning yields better
alignment with the gold standard with respect to
meaning preservation, as the fine-tuned models
BART (FT70%) and BART (FT10%) learn how to
correctly identify and summarize the main story
and the relevant opinions in a cluster of tweets. In
the specific example, non-fine-tuned BART intro-
duces a lot of irrelevant information (‘industrial
air pollution’,‘google, apple rolling out covid’)
to the main story and offers no insight into the
opinions found in the cluster of tweets, whereas
both fine-tuned models correctly introduce the
Main Story and both partially identify the Major-
ity Opinion (‘great idea’ for anti-maskers ‘to dig
graves’). However, we note that the fine-tuning
process does not lead to increased performance
with respect to all criteria; non-redundancy is
compromised and referential clarity stops improv-
ing after a certain amount of training data. As
observed in the example, BART (FT70%) con-
tains duplicated content ‘think this is a great idea.
What a great idea!’. Wilcoxon signed rank tests
Human Summary: Anti-maskers are forced to dig graves for Covid19 victims in
Indonesia. The majority of Twitter users think it is a good idea that will cause a reality
check and make people rethink their ideas.
BART FT 70%: Anti-maskers in Indonesia are forced to dig graves for Covid-19 victims
as punishment for not wearing mask. The majority think this is a great idea. What a great
idea! It’s a good way to get people to reconsider their misinformed opinions. We should
do this in the US.
BART FT 10%: Anti-maskers forced to dig graves for Covid-19 victims in Indonesia as
punishment for refusing to wear a mask. The majority believe that this is a good idea, and
that it will hopefully bring about an end to the need for masks. A minority do not believe
it is necessary to wear a
BART: Covid-19 can be more deadly for people living in areas with industrial air
pollution. Anyone refusing to wear a mask is made to dig graves for covid-19 victims
as punishment in east java. as domestic violence spikes, many victims and their children
have nowhere to live. google, apple rolling out covid-
Table 9: BART Summary Examples for the same
cluster of tweets.
with p < 0.05 and p < 0.10 are used for sig-
nificance testing between all pairs of models.
We note that most pairwise differences are sig-
nificant at p < 0.05, while BART (FT70%) and
BART NFT differences are significant at p < 0.10
for non-redundancy. The only two exceptions are
referential clarity and non-redundancy between
BART (FT70%) and BART (FT10%) where both
fine-tuned models perform similarly.
7 Error Analysis
Error analysis is carried out on 30 fine-tuned
BART summaries from a set of 15 randomly sam-
pled clusters. The results are found in Table 10.
Hallucination Fine-tuning on the MOS corpus
introduces hallucinated content in 8 out of 30
manually evaluated summaries. Generated sum-
maries contain opinions that prove to be either
false or unfounded after careful inspection of the
cluster of tweets. We follow the work of Maynez
et al. (2020) in classifying hallucinations as ei-
ther intrinsic (incorrect synthesis of information
in the source) or extrinsic (external information
1238
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Error type
Intrinsic
Hallucination
Freq.
4/30
Extrinsic
Hallucination
4/30
Example
Example 1
Generated Summary: United States surpasses six million coro-
navirus cases and deaths and remains at the top of the global
list of countries with the most cases and deaths The majority are
pleased to see the US still leads the world in terms of cases and
deaths, with 180,000 people succumbing to Covid-19.
Example 2
Generated Summary: Sex offender Rolf Harris is involved in a
prison brawl after absconding from open jail. The majority think
Rolf Harris deserves to be spat at and called a ‘‘nonce’’ and
a ‘‘terrorist’’ for absconding from open prison. A minority are
putting pressure on
Information
Loss
12/30 Example 3
Human Summary: Miley Cyrus invited a homeless man on
stage to accept her award. Most people thought it was a lovely
thing to do and it was emotional. A minority think that it was a
publicity stunt.
Generated Summary: Miley Cyrus had homeless man accept
Video of the Year award at the MTV Video Music Awards.
The majority think it was fair play for Miley Cyrus to allow
the homeless man to accept the award on her behalf. She was
emotional and selfless. The boy band singer cried and thanked
him for accepting the
Table 10: Error Analysis: Frequency of errors
and representative summary examples for each
error type.
not found in the source). Example 1 in Table 10
is an instance of an intrinsic hallucination: The
majority opinion is wrongly described as ‘pleased’,
despite containing the correct facts regarding US
coronavirus cases. Next, Example 2 shows that
Rolf Harris ‘is called a terrorist’, which is con-
firmed to be an extrinsic hallucination as none
of the tweets in the source cluster contain this
information.
Information Loss
Information loss is the most
frequent error type. As outlined in Kryscinski
et al. (2021), the majority of current summariza-
tion models face length limitations (usually 1024
characters) which are detrimental for long-input
documents and tasks. Since our task involves the
detection of all opinions within the cluster, this
weakness may lead to incomplete and less in-
formative summaries, as illustrated in Example
3 from Table 10. The candidate summary does
not contain the minority opinion identified by the
experts in the gold standard. An inspection of the
cluster of tweets reveals that most posts express-
ing this opinion are indeed not found in the first
1024-character allowed limit of the cluster input.
8 Conclusions and Future Work
We have introduced the task of Twitter opinion
summarization and constructed the first abstrac-
tive corpus for this domain, based on template-
based human summaries. Our experiments show
that existing extractive models fall short on lin-
guistic quality and informativeness while abstrac-
tive models perform better but fail to identify
all relevant opinions required by the task. Fine-
tuning on our corpus boosts performance as the
models learn the summary structure.
In the future, we plan to take advantage of the
template-based structure of our summaries to re-
fine fine-tuning strategies. One possibility is to
exploit style-specific vocabulary during the gen-
eration step of model fine-tuning to improve on
capturing opinions and other aspects of interest.
Acknowledgments
This work was supported by a UKRI/EPSRC
Turing AI Fellowship to Maria Liakata (grant
no. EP/V030302/1) and The Alan Turing Institute
(grant no. EP/N510129/1) through project funding
and its Enrichment PhD Scheme. We are grateful
to our reviewers and action editor for reading our
paper carefully and critically and thank them for
their insightful comments and suggestions. We
would also like to thank our annotators for their
invaluable expertise in constructing the corpus
and completing the evaluation tasks.
Ethics
Ethics approval
to collect and to publish ex-
tracts from social media datasets was sought and
received from Warwick University Humanities
& Social Sciences Research Ethics Committee.
When the corpus will be released to the research
community, only tweet IDs will be made avail-
able along with associated cluster membership
and summaries. Compensation rates were agreed
with the annotators before the annotation process
was launched. Remuneration was fairly paid on
an hourly rate at the end of task.
Appendix A
Summary Annotation Interface
Stage 1: Reading and choosing cluster type
The majority of the tweets in the cluster revolve
around the subject of Trident nuclear subma-
rines. The cluster contains many opinions which
can be summarized easily, hence this cluster is
Coherent Opinionated. Choose ‘Yes’ and proceed
to the next step.
1239
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Fragment of a cluster of tweets for key-
word ‘nuclear’.
Figure
Opinionated’.
2: Choose
type
of
cluster
‘Coherent
Figure 5: Summary template of a Coherent Opinion-
ated cluster with a majority opinion.
Figure 3: Example of information nuggets: ‘a cor-
nerstone of peace and security’ describes the nuclear
submarine (WHAT information nugget), while ‘de-
fence secretary Michael Fallon’ describes a person
(WHO information nugget).
Stage 2: Highlighting information nuggets
important
Highlight
information and select
the relevant aspect each information nugget be-
longs to.
Stage 3: Template-based Summary Writing
Models
Random Sentences
Extractive Oracle
LexRank
Hybrid TF-IDF
BERTSumExt
HeterDocSumGraph
Quantized Transformer
Opinosis
PG-MMR
Pegasus
T5
BART
SummPip
Copycat
BART (FT 10%)
BART (FT 70%)
COVID-19
Opinionated Non-opinionated Opinionated Non-opinionated
COVID-19
Election
Election
Heuristics
0.838
0.867
Extractive Models
0.849
0.853
0.851
0.840
0.827
Abstractive Models
0.853
0.857
0.856
0.851
0.854
0.858
0.852
Fine-tuned Models
0.870
0.870
0.842
0.858
0.851
0.851
0.848
0.839
0.840
0.845
0.853
0.850
0.850
0.852
0.852
0.848
0.873
0.873
0.846
0.871
0.856
0.856
0.859
0.847
0.850
0.846
0.851
0.852
0.853
0.856
0.854
0.848
0.875
0.878
0.861
0.904
0.868
0.879
0.874
0.853
0.856
0.860
0.863
0.869
0.872
0.868
0.878
0.872
0.893
0.892
Figure 4: Choose whether there exists a majority opin-
ion in the cluster.
Table 11: Performance on test set of baseline
models evaluated with BERTScore.
Most user reactions dismiss the Trident plan and
view it as an exaggerated security measure. This
forms the Majority Opinion. A few users express
fear for UK’s potential future in a nuclear war.
This forms a Minority Opinion.
Write cluster summary following the struc-
ture: Main Story + Majority Opinion (+ Minority
Opinions).
Appendix B
Complete Results: BERTScore Evaluation
Model Implementation Details
T5, Pegasus, and BART were implemented us-
ing the HuggingFace Transformer package (Wolf
et al., 2020) with max sequence length of 1024
characters.
Fine-tuning parameters for BART are: 8 batch
size, 5 training epochs, 4 beams, enabled early
stopping, 2 length penalty, and no trigram rep-
etition for the summary generation. The rest of
the parameters are set as default following the
configuration of BartForConditionalGeneration:
activation function gelu, vocabulary size 50265,
0.1 dropout, early stopping, 16 attention heads,
12 layers with feed forward layer dimension set
as 4096 in both decoder and encoder. Quantized
Transformer and Copycat models are trained for
5 epochs.
1240
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Appendix C
Cluster examples and summaries from the
MOS Corpus
Gosh i hope these cases are used for the negligent homicide class action suit that’s being constructed against trump. cdc warns against drinking hand sanitizer amid reports of deaths
Tweet cluster fragment for keyword ‘‘CDC’’
the cdc has also declared, ¨being stupid is hazardous to your health. ¨URLLINK
cdc warning! do not drink hand sanitizer! what the hell! people be idiots!
cdc warns against drinking hand sanitizer amid reports of deaths seriously omg?!
if the cdc has to put out a health bulletin to inform people not to try drinking hand sanitizers, how stupid are those people?
from the ‘‘if you had any doubt’’ department: the cdc is alerting your fellow americans not to drink hand sanitizer. obviously more than a couple of people have had to be treated for it. I
wonder were they poisoned in the womb, too many concussions, mt. dew in their milk bottle when they were babies?
oh my...the cdc actually had to warn people not to drink hand sanitizer. only under a trump presidency have people acted so stupidly.
@realdonaldtrump you should try drinking the hand sanitizer. After your ridiculous suggestion to inject disinfectants, people have decided to drink it and are dying. CDC now issued a
warning not to drink it. since u don’t believe anything the scientists say go ahead and drink it. First get kids out of cages
@USER i think this actually speaks more to the stupidity of the cdc.
@USER trump is in control of the cdc. don’t believe a single word that they are saying
this is sadly what happens when you put an idiot like @realdonaldtrump in the white house...people had seizures, lost vision and dead after drinking hand sanitizer, cdc warns URLLINK
@cdcgov @usfda @USER is it really necessary to tell people not to ingest hand sanitizer????? jesus
cdc gives out warning that people are going blind and are dying from drinking hand sanitizer. #darwinism URLLINK
Human Written Summary
Main Story: CDC warns people not to drink hand sanitiser after people fall ill and die after doing so. Majority Opinion: The majority are shocked by the stupidity of people thinking it’s
OK to drink hand sanitiser. Minority Opinions: A minority blame the Trump administration for putting forward the idea of people disinfecting their insides. A minority think the warning
is an act of stupidity by the CDC.
Lexrank Summary: cdc just issued a warning not to drink it. the cdc just issued an official warning to never drink hand sanitizer and i just. . . . people are having seizures, losing vision
and dying after drinking hand sanitizer, cdc warns via @yahoo. if you need a warning from the cdc not to drink hand sanitizer. . . please go ahead and drink hand sanitizer.
SummPip Summary: people have decided to drink it and are dying after swallowing hand sanitizer . . . children masks . people are having seizures, losing vision and dying after drinking
hand sanitizer, cdc warns via @yahoo .
BART Summary: People are getting sick and even dying after swallowing hand sanitizer, the us centers for disease control and prevention said wednesday. four died and others have
suffered impaired vision or seizures, the cdc says. The cdc has also declared, ‘‘being stupid is hazardous to your health’’ @realdonaldtrump you
BART FT 10% Summary: People are dying and even dying after drinking hand sanitizer, the Centers for Disease Control warns. The majority are shocked and appalled by the CDC’s
warning against people drinking hand Sanitizer. A minority believe the CDC is exaggerating the danger of drinking hand Sanitizinger to distract people from the
BART FT 70% Summary: CDC warns that people are dying and losing vision and blindness after drinking hand sanitiser. The majority believe that people are extremely stupid for even
thinking about doing this, and it should not be allowed to happen. A minority do not believe the CDC’s warning and think it is not necessary to take any action
Table 12: Example of excerpt from tweet cluster ‘‘CDC’’, human summary and best generated summary
candidates.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1241
Tweet cluster for keyword ‘‘mental health’’
A ’landmark moment’? Nick Clegg (Lib Dems) promise to put mental health on par with physical #health URLLINK #inclusion #care
All of a sudden, Nick Clegg is concerned about people with mental health issues. Nothing at all to do with trying to win voters and save his job.
Delighted that nick is finally doing something about mental health in our nhs
Nick Clegg promises ’dignity and respect’ in NHS mental health treatment video URLLINK | Guardian
I have been hearing very positive noises on the radio today from Lib Dems re: mental health treatment. Certainly long overdue but great to hear!
But if you are patting Nick Clegg on the back for new mental health reforms, consider this:
Mate, Clegg could have stood up to Cameron before his harmful reductive mental health policies got implemented.
Awesome that Clegg highlighted mental health to rapturous applause, but sure he did that with tuition fees once.
.nickclegg speech #libdemconf focusing on mental health was cool. Araith Nick Clegg yn canolpwyntio ar iechyd meddyliol yn wych.
Nick Clegg’s pandering towards the treatment of mental health illness is kinda sad and pathetic#hecantbuyavote
One immediate victory of Clegg’s speech; putting mental health issues on the agenda and in the media. #ldconf #bbcnews
LibDems are back to promising the unachievable because they know they’re safe away from power. Shame because mental health is in dire state.
His position in government could have been used to stop the reductive mental health reforms Cameron put in years back. Did he? no.
Human Written Summary
Main Story: Nick Clegg promises to focus on mental health provision in the NHS. Minority Opinions: Some Twitter users are pleased something is ’finally’ being done about it and that
it is great, it is highlighting mental health. Others are asking why he didn’t do it when he was in power and say that Clegg is doing it for personal gain.
Lexrank Summary: Nick Clegg promises ’dignity and respect’ in NHS mental health treatment video Speaking before his speech to the. . . Been hearing very positive noises on the radio
today from Lib Dems re: mental health treatment. One immediate success of Clegg’s speech; getting mental health issues on the agenda and in the media. nickclegg a six week wait for
mental health related treatment, but didn’t hear how you’ll resource the #NHS to achieve the needed care!
SummPip Summary: happy about nick clegg could have been used to stop the reductive mental health treatment . but if you are patting nick clegg is all of a sudden concerned about
people with mental health issues . nick clegg promises ’ dignity and respect ’ in nhs mental health treatment video speaking before his speech to the . . . been hearing very positive noises
on the radio today from lib dems re: mental health treatment .
BART Summary: Lib Dems promise to put mental health on par with physical health. Nick Clegg promises ’dignity and respect’ in NHS mental health treatment video. But if you are
patting Nick Clegg on the back for new mental health reforms, consider this: Feeling blessed, trying to eradicate mental health stigma and getting lifetime opportunities
BART FT 10% Summary:Lib Dem Nick Clegg makes a speech about mental health in the NHS. The majority are pleased that the Lib Dem leader is trying to tackle the stigma attached to
mental health. A minority are disappointed that he is pandering to the far right and anti-gay groups. A minority believe he is setting us up for a
BART FT 70% Summary: Lib Dem leader Nick Clegg makes a speech about putting mental health on a par with physical health in the manifesto. The majority are pleased that Nick Clegg
is taking a lead on mental health and saying that mental health needs to be treated with dignity and respect. A minority are dismayed by Nick Clegg
Table 13: Example of excerpt from tweet cluster ‘‘mental health’’, human summary and best generated
summary candidates.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1242
References
Nasser Alsaedi, Pete Burnap, and Omer Rana.
2021. Automatic summarization of real world
events using Twitter. Proceedings of the Inter-
national AAAI Conference on Web and Social
Media, 10(1):511–514.
Reinald Kim Amplayo and Mirella Lapata. 2020.
Unsupervised opinion summarization with
noising and denoising. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.175
Stefanos Angelidis, Reinald Kim Amplayo,
Yoshihiko Suhara, Xiaolan Wang, and Mirella
Lapata. 2021. Extractive opinion summariza-
tion in quantized transformer spaces. Trans-
actions of the Association for Computational
Linguistics, 9:277–293.
Stefanos Angelidis and Mirella Lapata. 2018.
Summarizing opinions: Aspect extraction meets
sentiment prediction and they are both weakly
supervised. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Lan-
guage Processing, pages 3675–3686, Brussels,
Belgium. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/D18-1403
Manik Bhandari, Pranav Narayan Gour, Atabak
Ashfaq, Pengfei Liu, and Graham Neubig.
2020. Re-evaluating evaluation in text summa-
rization. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 9347–9359, On-
line. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-main.751
Iman Munire Bilal, Bo Wang, Maria Liakata,
Rob Procter, and Adam Tsakalidis. 2021. Eval-
uation of thematic coherence in microblogs.
In Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics
and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long
Papers), pages 6800–6814, Online. Associa-
tion for Computational Linguistics.
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 5151–5169,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.acl-main.461
Ziqiang Cao, Chengyao Chen, Wenjie Li, Sujian
Li, Furu Wei, and Ming Zhou. 2016. Tgsum:
Build tweet guided multi-document summari-
zation dataset. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 30.
https://doi.org/10.1609/aaai.v30i1
.10376
Emily Chen, Kristina Lerman, and Emilio Ferrara.
2020. Tracking social media discourse about
the covid-19 pandemic: Development of a
public coronavirus twitter data set. JMIR Pub-
lic Health Surveill, 6(2):e19273. https://doi
.org/10.2196/19273, PubMed: 32427106
Wen-Ying
Sylvia Chou, April Oh,
and
William MP Klein. 2018. Addressing health-
related misinformation on social media. JAMA,
320(23):2417–2418. https://doi.org/10
.1001/jama.2018.16865, PubMed: 30428002
Eric Chu and Peter J. Liu. 2019. Meansum: A
neural model for unsupervised multi-document
abstractive summarization. In ICML.
D. Corney, Carlos Martin, and Ayse G¨oker. 2014.
Two sides to every story: Subjective event
summarization of sports events using twitter.
In SoMuS@ICMR.
Hoa Trang Dang. 2005. Overview of DUC 2005.
In Proceedings of the Document Understand-
the
ing Conf. Wksp. 2005 (DUC 2005) at
Human Language Technology Conf./Conf. on
Empirical Methods in Natural Language Pro-
cessing (HLT/EMNLP.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional
transformers for lan-
guage understanding. In NAACL.
G¨unes Erkan and Dragomir R. Radev. 2004.
Lexrank: Graph-based lexical centrality as sa-
lience in text summarization. Journal of Ar-
tificial Intelligence Research, 22(1):457–479.
https://doi.org/10.1613/jair.1523
Arthur Braˇzinskas, Mirella Lapata, and Ivan Titov.
2020. Unsupervised opinion summarization as
copycat-review generation. In Proceedings of
Alexander Fabbri, Simeng Han, Haoyuan Li,
Haoran Li, Marjan Ghazvininejad, Shafiq
Joty, Dragomir Radev, and Yashar Mehdad.
1243
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
2021. Improving zero and few-shot abstractive
summarization with intermediate fine-tuning
In Proceedings of
and data augmentation.
the 2021 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 704–717, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.57
Alexander Fabbri, Irene Li, Tianwei She, Suyi
Li, and Dragomir Radev. 2019. Multi-news:
A large-scale multi-document summarization
dataset and abstractive hierarchical model. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 1074–1084, Florence, Italy. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/P19-1102
Alvan R. Feinstein and Dominic V. Cicchetti.
1990. High agreement but low kappa: I. the
problems of two paradoxes. Journal of Clini-
cal Epidemiology, 43(6):543–9. https://doi
.org/10.1016/0895-4356(90)90158-L
Kavita Ganesan, ChengXiang Zhai, and Jiawei
Han. 2010. Opinosis: A graph based approach to
abstractive summarization of highly redundant
opinions. In Proceedings of the 23rd Interna-
tional Conference on Computational Linguis-
tics (Coling 2010), pages 340–348, Beijing,
China. Coling 2010 Organizing Committee.
Shima Gerani, Yashar Mehdad, Giuseppe
Carenini, Raymond T. Ng, and Bita Nejat.
2014. Abstractive summarization of product re-
views using discourse structure. In Proceedings
of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1602–1613, Doha, Qatar. Association for
Computational Linguistics. https://doi
.org/10.3115/v1/D14-1168
Demian Gholipour Ghalandari, Chris Hokamp,
Nghia The Pham, John Glover, and Georgiana
Ifrim. 2020. A large-scale multi-document sum-
marization dataset from the Wikipedia current
events portal. In Proceedings of the 58th An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 1302–1308, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.120
Saptarshi Ghosh, Kripabandhu Ghosh, Tanmoy
Chakraborty, Debasis Ganguly, Gareth Jones,
and Marie-Francine Moens. 2017. First Inter-
national Workshop on Exploitation of Social
Media for Emergency Relief and Preparedness
(SMERP). In Proceedings of the 39th European
Conference on IR Research – J.M. Jose et al.
(Eds.): ECIR 2017, LNCS 10193, ECIR 2017,
pages 779–783. Springer International Pub-
lishing AG. https://doi.org/10.1145
/3130332.3130338
Max Grusky, Mor Naaman, and Yoav Artzi.
2018. Newsroom: A dataset of 1.3 million
summaries with diverse extractive strategies.
In Proceedings of
the 2018 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Pa-
pers), pages 708–719, New Orleans, Louisiana.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N18
-1065
Ruining He and Julian McAuley. 2016. Ups and
downs: Modeling the visual evolution of fash-
ion trends with one-class collaborative filter-
ing. In Proceedings of the 25th International
Conference on World Wide Web, WWW ’16,
pages 507–517, Republic and Canton of
Geneva, CHE. International World Wide Web
Conferences Steering Committee. https://
doi.org/10.1145/2872427.2883037
Karl Moritz Hermann, Tom´aˇs Koˇcisk´y, Edward
Grefenstette, Lasse Espeholt, Will Kay,
Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend.
In Proceedings of the 28th International Con-
ference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, pages 1693–1701,
Cambridge, MA, USA. MIT Press.
David Inouye and Jugal K. Kalita. 2011. Com-
paring Twitter summarization algorithms for
multiple post summaries. In 2011 IEEE Third
International Conference on Privacy, Secu-
rity, Risk and Trust and 2011 IEEE Third
International Conference on Social Computing,
pages 298–306.
Neslihan Iskender, Tim Polzehl, and Sebastian
M¨oller. 2021. Reliability of human evaluation
for text summarization: Lessons learned and
1244
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
challenges ahead. In Proceedings of the Work-
shop on Human Evaluation of NLP Systems
(HumEval), pages 86–96, Online. Association
for Computational Linguistics.
Masaru Isonuma,
Junichiro Mori, Danushka
Bollegala, and Ichiro Sakata. 2021. Unsuper-
vised abstractive opinion summarization by
generating sentences with tree-structured topic
guidance.
Myungha Jang and James Allan. 2018. Explain-
ing controversy on social media via stance
summarization. In The 41st International ACM
SIGIR Conference on Research & Develop-
ment
in Information Retrieval, SIGIR ’18,
pages 1221–1224, New York, NY, USA. Asso-
ciation for Computing Machinery. https://
doi.org/10.1145/3209978.3210143
Hanqi Jin, Tianming Wang, and Xiaojun Wan.
2020. Multi-granularity interaction network
for extractive and abstractive multi-document
summarization. In Proceedings of the 58th An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 6244–6254, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.556
Kyriaki Kalimeri, Mariano G. Beir´o, Alessandra
Urbinati, Andrea Bonanomi, Alessandro
Rosina, and Ciro Cattuto. 2019. Human values
and attitudes towards vaccination in social me-
dia. In Companion Proceedings of The 2019
World Wide Web Conference, pages 248–254.
Byeongchang Kim, Hyunwoo Kim, and Gunhee
Kim. 2019. Abstractive summarization of Red-
dit posts with multi-level memory networks.
In NAACL-HLT.
Svetlana Kiritchenko and Saif Mohammad. 2017.
Best-worst scaling more reliable than rating
scales: A case study on sentiment intensity
annotation. In Proceedings of the 55th Annual
the Association for Computa-
Meeting of
tional Linguistics (Volume 2: Short Papers),
pages 465–470, Vancouver, Canada. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-2074
Wojciech Kryscinski, Nazneen Fatema Rajani,
Divyansh Agarwal, Caiming Xiong,
and
Dragomir R. Radev. 2021. Booksum: A col-
lection of datasets for
long-form narrative
summarization. CoRR, abs/2105.08209.
Logan Lebanoff, Kaiqiang Song, and Fei Liu.
2018. Adapting the neural encoder-decoder
framework from single to multi-document
summarization. In Proceedings of
the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 4131–4141,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1446
Mike Lewis, Yinhan Liu, Naman Goyal,
Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke
Zettlemoyer. 2020. BART: Denoising sequence-
to-sequence pre-training for natural language
generation, translation, and comprehension. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 7871–7880, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.703
Xinnian Liang, Shuangzhi Wu, Mu Li, and
Zhoujun Li. 2021. Improving unsupervised
extractive summarization with facet-aware
modeling. In Findings of the Association for
Computational Linguistics: ACL-IJCNLP 2021,
pages 1685–1697, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.findings-acl.147
Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81,
Barcelona, Spain. Association for Computa-
tional Linguistics.
Peter J. Liu, Mohammad Saleh, E. Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and
Noam M. Shazeer. 2018. Generating Wikipedia
by summarizing long sequences. International
Conference on Learning Representations, abs/
1801.10198.
Yang Liu and Mirella Lapata. 2019. Text sum-
marization with pretrained encoders. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 3730–3740, Hong Kong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1387
1245
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Joshua Maynez, Shashi Narayan, Bernd Bohnet,
and Ryan McDonald. 2020. On faithfulness
and factuality in abstractive summarization.
In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
tics, pages 1906–1919, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.173
Saif Mohammad, Svetlana Kiritchenko, Parinaz
Sobhani, Xiaodan Zhu, and Colin Cherry.
2016. SemEval-2016 task 6: Detecting stance
in tweets. In Proceedings of the 10th Inter-
national Workshop on Semantic Evaluation
(SemEval-2016), pages 31–41, San Diego,
California. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/S16-1003
Ramesh Nallapati, Bowen Zhou, Cicero dos
Santos, C¸ a˘glar Gulc¸ehre, and Bing Xiang.
2016. Abstractive text summarization using
sequence-to-sequence RNNs and beyond. In
Proceedings of The 20th SIGNLL Conference
on Computational Natural Language Learn-
ing, pages 280–290. https://doi.org/10
.18653/v1/K16-1028
Shashi Narayan, Shay B. Cohen, and Mirella
Lapata. 2018. Don’t give me the details, just
the summary! Topic-aware convolutional neu-
ral networks for extreme summarization. In
Proceedings of the 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1797–1807, Brussels, Belgium. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1206
Ani Nenkova and Rebecca J. Passonneau. 2004.
Evaluating content selection in summarization:
The pyramid method. In Proceedings of the
Human Language Technology Conference of
the North American Chapter of the Association
for Computational Linguistics: HLT-NAACL
2004, pages 145–152.
Minh-Tien Nguyen, Dac Viet Lai, Huy-Tien
Nguyen, and Le-Minh Nguyen. 2018. TSix:
A human-involved-creation dataset for tweet
summarization. In Proceedings of the Eleventh
International Conference on Language Re-
sources and Evaluation (LREC 2018), Miyazaki,
Japan. European Language Resources Associa-
tion (ELRA).
Andrei Olariu. 2014. Efficient online summariza-
tion of microblogging streams. In Proceedings
of the 14th Conference of the European Chapter
of the Association for Computational Linguis-
tics, volume 2: Short Papers, pages 236–240,
Gothenburg, Sweden. Association for Compu-
tational Linguistics. https://doi.org/10
.3115/v1/E14-4046
Bryan K. Orme. 2009. Maxdiff analysis:
Simple counting,individual-level logit,
and hb.
Suraj
Patil.
generation.
2020. Question
https://github.com/patil-suraj/question
generation.
Rob Procter, Jeremy Crump, Susanne Karstedt,
Alex Voss, and Marta Cantijoch. 2013. Read-
ing the riots: What were the police doing on
Twitter? Policing and Society, 23(4):413–436.
https://doi.org/10.1080/10439463.2013
.780223
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer.
Koustav Rudra, Pawan Goyal, Niloy Ganguly,
Muhammad Imran, and Prasenjit Mitra. 2019.
Summarizing situational tweets in crisis sce-
narios: An extractive-abstractive approach.
IEEE Transactions on Computational Social
Systems, 6(5):981–993. https://doi.org
/10.1109/TCSS.2019.2937899
Thibault Sellam, Dipanjan Das, and Ankur
Parikh. 2020. BLEURT: Learning robust met-
rics for text generation. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 7881–7892.
Association for Computational Linguistics,
Online. https://doi.org/10.18653/v1
/2020.acl-main.704
Simeng Sun, Ori Shapira, Ido Dagan, and Ani
Nenkova. 2019. How to compare summariz-
ers without target length? Pitfalls, solutions
and re-examination of the neural summariza-
tion literature. In Proceedings of the Workshop
on Methods for Optimizing and Evaluating
Neural Language Generation, pages 21–29,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
1246
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Wenyi Tay, Aditya Joshi, Xiuzhen Zhang,
Sarvnaz Karimi, and Stephen Wan. 2019.
Red-faced ROUGE: Examining the suitability
of ROUGE for opinion summary evalua-
tion. In Proceedings of the The 17th Annual
Workshop of the Australasian Language Tech-
nology Association, pages 52–60, Sydney,
Australia. Australasian Language Technology
Association.
Chris van der Lee, Albert Gatt, Emiel van
Miltenburg, Sander Wubben,
and Emiel
Krahmer. 2019. Best practices for the hu-
man evaluation of automatically generated
text. In Proceedings of the 12th International
Conference on Natural Language Generation,
pages 355–368, Tokyo, Japan. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/W19-8643
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Proceedings of
the 31st International Conference on Neu-
ral Information Processing Systems, NIPS’17,
pages 6000–6010, Red Hook, NY, USA.
Curran Associates Inc.
Alex Wang, Kyunghyun Cho, and Mike Lewis.
2020a. Asking and answering questions to eval-
uate the factual consistency of summaries. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 5008–5020, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.450
Bo Wang, Maria Liakata, Adam Tsakalidis,
Spiros Georgakopoulos Kolaitis, Symeon
Papadopoulos, Lazaros Apostolidis, Arkaitz
Zubiaga,
Yiannis
Procter,
Kompatsiaris. 2017a. Totemss: Topic-based,
temporal sentiment summarization for Twit-
the IJCNLP 2017,
ter.
System Demonstrations, pages 21–24.
In Proceedings of
Rob
and
Bo Wang, Maria Liakata, Arkaitz Zubiaga,
and Rob Procter. 2017b. A hierarchical topic
modelling approach for tweet clustering. In
International Conference on Social Informat-
ics, pages 378–390. Springer
International
Publishing. https://doi.org/10.1007
/978-3-319-67256-4_30
Danqing Wang, Pengfei Liu, Yining Zheng,
Xipeng Qiu, and Xuanjing Huang. 2020b.
Heterogeneous graph neural networks
for
extractive document summarization. In Pro-
ceedings of
the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 6209–6219, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.553
Kexiang Wang, Baobao Chang, and Zhifang
Sui. 2020c. A spectral method for unsupervised
multi-document summarization. In Proceed-
the 2020 Conference on Empirical
ings of
Methods in Natural Language Processing
(EMNLP), pages 435–445, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.emnlp-main.32
Lu Wang and Wang Ling. 2016. Neural
network-based abstract generation for opinions
and arguments. In Proceedings of the 2016
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, pages 47–57,
San Diego, California. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N16-1007
Zhongqing Wang and Yue Zhang. 2017. A
neural model for joint event detection and
summarization. In Proceedings of the Twenty-
Sixth International Joint Conference on Artifi-
cial Intelligence, IJCAI-17, pages 4158–4164.
https://doi.org/10.24963/ijcai.2017
/581
Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick
von Platen, Clara Ma, Yacine
Jernite, Julien Plu, Canwen Xu, Teven Le
Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020.
Transformers: State-of-the-art natural language
processing. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural
Language Processing: System Demonstrations,
pages 38–45, Online. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-demos.6
Yelp. Yelp
dataset
challenge. https://
www.yelp.com/dataset.
1247
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Jingqing Zhang, Yao Zhao, Mohammad Saleh,
and Peter J. Liu. 2020. Pegasus: Pre-training
with extracted gap-sentences for abstractive
summarization. ICML, abs/1912.08777.
Tianyi Zhang, Varsha Kishore, Felix Wu,
Kilian Q. Weinberger, and Yoav Artzi. 2020.
Bertscore: Evaluating text generation with
BERT. In International Conference on Learn-
ing Representations.
Jinming Zhao, Ming Liu, Longxiang Gao,
Yuan Jin, Lan Du, He Zhao, He Zhang, and
Gholamreza Haffari. 2020. Summpip: Unsu-
pervised multi-document summarization with
sentence graph compression. In Proceedings of
the 43rd International ACM SIGIR Conference
on Research and Development in Information
Retrieval, SIGIR ’20, pages 1949–1952, New
York, NY, USA. Association for Computing
Machinery. https://doi.org/10.1145
/3397271.3401327
summarization
Ming Zhong, Pengfei Liu, Yiran Chen, Danqing
Wang, Xipeng Qiu, and Xuanjing Huang.
2020. Extractive
text
matching. In Proceedings of the 58th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 6197–6208, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.552
as
Arkaitz Zubiaga, Damiano Spina, Enrique Amig´o,
and Julio Gonzalo. 2012. Towards real-time
summarization of scheduled events from twit-
ter streams. https://doi.org/10.1145
/2309996.2310053
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
1
6
2
0
6
0
7
3
1
/
/
t
l
a
c
_
a
_
0
0
5
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1248