WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

Prashant Budania2

Hiroaki Hayashi1
Peng Wang2
Chris Ackerson2 Raj Neervannan2 Graham Neubig1
1Language Technologies Institute, Carnegie Mellon University
2AlphaSense
{hiroakih,gneubig}@cs.cmu.edu
{pbudania,pwang,cackerson,rneervannan}@alpha-sense.com

Abstract

Aspect-based summarization is the task of gen-
erating focused summaries based on specific
points of interest. Such summaries aid efficient
analysis of text, such as quickly understanding
reviews or opinions from different angles.
However, due to large differences in the type
of aspects for different domains (e.g., senti-
ment, product features), the development of
previous models has tended to be domain-
specific. In this paper, we propose WikiAsp,1
a large-scale dataset for multi-domain aspect-
based summarization that attempts to spur
research in the direction of open-domain
aspect-based summarization. Specifically, we
build the dataset using Wikipedia articles
from 20 different domains, using the section
titles and boundaries of each article as a
proxy for aspect annotation. We propose sev-
eral straightforward baseline models for this
task and conduct experiments on the dataset.
Results highlight key challenges that existing
summarization models face in this setting, such
as proper pronoun handling of quoted sources
and consistent explanation of time-sensitive
events.

1 Introduction

Aspect-based summarization is a subtask of sum-
marization that aims to provide targeted sum-
maries of a document from different perspectives
(Titov and McDonald, 2008; Lu et al., 2009; Wang
and Ling, 2016; Yang et al., 2018; Angelidis and
Lapata, 2018). Unlike generic summarization, this
gives more concise summaries that are separated
according to specific points of interest, allowing
readers to fulfill focused information needs more
easily and quickly. However, existing aspect-

1http://github.com/neulab/wikiasp.

211

based summarization work is somewhat narrowly
focused; for example, a great majority of the
work focuses specifically on the domain of pro-
duct or restaurant reviews. In contrast, generic
summarization models are tested on a much wider
variety of genres, from newswire (Nallapati et al.,
2016; Grusky et al., 2018), to academic papers
(Kang et al., 2018; Kedzie et al., 2018), to movie
scripts (Gorinski and Lapata, 2015). For each
genre, the types and characteristics of aspects that
will need to be touched upon in a good summary
will differ greatly.

One natural source of such multi-domain ar-
ticles is Wikipedia, and the section boundaries
and titles in each article form natural annotations
of aspects and corresponding text. There have
recently been a number of attempts to generate the
lead section of Wikipedia articles from the linked
external sites in the reference section (Liu et al.,
2018; Fan et al., 2019; Liu and Lapata, 2019a),
an approach that does not explicitly consider the
different aspects covered by the article. Perez-
Beltrachini (2019) also examine domain differ-
ences in Wikipedia text summarization. However,
existing datasets and analyses lack structure, broad
domain coverage, or both. We argue that (1)
generating structured summaries is of inherent
interest, as these will allow humans consuming the
information to browse specific aspects of interest
more readily, and (2) the structure will vary across
domains, with different domains demonstrating
very different characteristics.

In this paper, we construct a dataset for multi-
domain aspect-based summarization that allows
us to train models for this unique variety of
summarization task, and examine the challenges
posed therein. Figure 1 illustrates the overview of
our task. Specifically, we turn to section titles of
Wikipedia articles and construct sets of ‘‘aspects’’
through steps of automatic extraction, curation,

Transactions of the Association for Computational Linguistics, vol. 9, pp. 211–225, 2021. https://doi.org/10.1162/tacl a 00362
Action Editor: Asli Celikyilmaz. Submission batch: 3/2020; Revision batch: 8/2020; Published 3/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Title: Barack Obama

Aspect: Early life and career
Obama was born on August 4, 1961, at Kapiolani
Medical Center for Women and Children in Honolulu,
Hawaii. . . .
Aspect: Presidency
The inauguration of Barack Obama as the 44th
President took place on January 20, 2009. In his first
few days in office, Obama issued . . .
Aspect: Legacy
Obama’s most
generally
considered to be the Patient Protection and Affordable
Care Act (PPACA), . . .

significant

legacy

Figure 1: In WikiAsp, given reference documents cited
by a target article, a summarization model must produce
targeted aspect-based summaries that correspond to
sections.

Table 1: Example Wikipedia article about Barack
Obama. Our goal is to generate texts given the
cited references and the specified aspects.

and filtering. The section texts then serve as
corresponding aspect-based summaries.

We devise a baseline two-stage method con-
sisting of aspect identification and summarization
using extractive and abstractive models, and con-
duct experiments on the proposed dataset. The
analysis of experimental results and the generated
summaries reveals the unique challenges posed
by our multi-domain and multi-document setting.
For example, aspects that require summarizing
contents in a particular order (e.g., time series
events) in a multi-document setting adds extra dif-
ficulty because of the need for correctly ordering
scattered (and possibly duplicate) pieces of infor-
mation from different sources. Certain domains
that involve interviews or quotes of people also
exhibit challenges in correctly modifying pro-
nouns based on the relationship to the topic of
interest.

2 Generating Wikipedia as Aspect-based

Summarization

Wikipedia articles exhibit a specific way of
organizing information about a focused topic. An
article S consists of two parts: section titles a,
and their contents p. The contents are further
split into sections, where each section describes
information about the main topic from different
viewpoints. Table 1 shows an example article
about the topic ‘‘Barack Obama’’, with several
sections ‘‘Early life and career’’, ‘‘Presidency’’,
and ‘‘Legacy’’. In practice, the contents included
in each section can take many forms, from text,
tables, and images, to more specialized content

such as brackets of a tournament. In this work,
we focus only on sections that mainly consist of
textual content (see Section 3 for how we define
this).

Importantly, the content in Wikipedia articles
is required to be verifiable: ‘‘other people using
the encyclopedia can check that the information
comes from a reliable source’’.2 To ensure this,
articles contain citations from a set of references
R so that readers can check the validity of the
content. In other words, citations supposedly
contain the majority of the information written
in the articles. Liu et al. (2018) took advantage
of this fact by proposing a summarization task
using cited references as source documents
for summarization. Citations include published
material (such as books) and Web sites, but
because only Web-based citations can easily and
automatically be mined via crawling, we consider
only Web-based citations as source documents in
this work and ignore the rest of non-Web based
citations following Liu et al. (2018).

The goal of our task is to learn a model f : R →
S, which can 1) identify and gather information
from cited references and 2) generate a section-
by-section summary where each section contains
the appropriate type of information. Formally, let
R = {R1, R2, . . . , RM } be a collection of M cited
references for an article S = {s1, s2, . . . , sN } of
N sections. Each section si is essentially a tuple
of a section title and one or more paragraphs:
si = (cid:4)ai, pi(cid:5).

2https://en.wikipedia.org/wiki/Wikipedia

:Verifiability.

212

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Although there is a fair amount of variety
in section titles across different articles, articles
that belong to the same domain tend to share
aspects that are particularly salient for that domain.
Because of this, we select a fixed-size subset of
all section titles that appear in each domain as
the set of aspects A that we will target; details
on how we select this subset will be elucidated in
the following section. Hence, our task is cast as
multi-document aspect-based summarization.

3 The WIKIASP Dataset

In this section, we describe our concrete steps to
create our dataset.

3.1 Data Collection

As the base data, we build upon the data
collection strategy from the WikiSum dataset
(Liu et al., 2018), a dataset for generating lead
sections of Wikipedia from referenced Web
pages. Following the WikiSum data generation
script,3 we first crawled cited references covered
by CommonCrawl for each Wikipedia article.
We then recover all the sections4 of the target
Wikipedia articles from the WikiSum (which was
unused in the WikiSum dataset) and obtain pairs
of (section title, section paragraph). An example
for this is shown in Table 1.

3.2 Domain Separation

Articles in different domains focus on different
salient topics, as observed by Perez-Beltrachini
et al. (2019). For example, the ‘‘discography’’
section is common for articles about singers, but
is not appropriate for articles about infrastructure.
To characterize such structural differences, we
separate the set of articles obtained in the previous
step into sets in particular domains. Specifically,
we follow Perez-Beltrachini et al. (2019) in
assigning one category for each article using
DBPedia (Auer et al., 2007). DBPedia stores
structured information for each Wikipedia article,
including the domain labels and info boxes.
Additionally,
it defines a topical hierarchy of
the domains (ontology classes). We first map

3Tensor2tensor’s WikiSum generator was used.
4Due to the design of WikiSum dataset, the first section
title of any article is automatically renamed to ‘‘LEAD’’.
Therefore, we could not
the
Wikipedia articles. We suggest editing the data generation
scripts for future WikiSum users if section title information
is necessary.

first sections of

recover

between articles and the domain labels from the
corresponding DBPedia dump. Obtained domain
labels, however, have mixed granularity (e.g.,
Person and its sub-class Dancer), which causes
imbalance in the number of examples in each
domain, as well as domain overlap between high-
level and low-level domains in the domain hi-
erarchy. We mitigate this by recursively merging
domains at leaf-level into coarser ones according
to the aforementioned topical hierarchy from the
ontology classes.5 We repeat the merging proce-
dure until a branch in the hierarchy includes more
than 15,000 articles, and picked 20 domains at the
leaf of the merged hierarchy.6

3.3 Aspect Selection

Next, we perform aspect selection on each set
of articles in the domains extracted during the
previous step. As previously noted, articles in the
same domain tend to share similar set of section
titles. Motivated by this observation, we construct
the set of aspects from the most frequent section
titles.

From the frequency distribution of section titles
in a domain, we manually filter ones that are
not textual, that is, more than half portion of
section consists of text. For each section title, we
take 20 randomly sampled sections and include
it in the set of aspects only if 80% of samples
consist of textual paragraphs. Following the steps
above, we construct the 10 most frequent aspects
for each domain. However, the choice of words
in section titles vary depending on the editors
within the same domain, which leads to missing
relevant aspects that are moderately frequent but
not present in Top-10. For example, one of the
common section titles in WrittenWork domain
are ‘‘summary’’ and ‘‘plot summary,’’ which
should be merged together to form a single
aspect. We handle these cases by inspecting the
frequent distribution further down and manually
identifying semantically equivalent titles to merge.
The resulting dataset consists of instances in 20
domains where each domain has 10 pre-defined
aspect classes. We show statistics comparisons of
the dataset to existing aspect-based summarization

5http://mappings.dbpedia.org/server

/ontology/classes/.

6Many articles are labeled directly as Person, in which
case the domain is high-level at the hierarchy. We do not
select this domain because lower-level domains such as Artist
or SoccerPlayer already have enough articles.

213

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Infrastructure

Software

history
route description
facilities
services
future
route
location
construction
connections
description

13293
5627
2792
1955
784
689
613
577
497
463

reception
gameplay
development
plot
history
features
story
release
overview
legacy

8196
8095
3983
3697
2465
1799
991
750
570
564

Table 2: Frequency of filtered aspects that are
textual in 2 domains. Due to space constraint, the
statistics for the rest of domains will be available
in the Appendix C.

datasets in Table 3 and examples of obtained
aspects for two domains in Table 2.

Appendix A and C summarizes the data size for
each domain and the obtained aspects for the rest
of 18 domains respectively.

4 Baseline Models

Next, in this section we describe two baseline
models for solving this task. Both of these models
decompose the overall process into two stages:
aspect discovery and aspect-based summarization
of classified sentences. Both baseline models
share the same methodology for aspect discovery,
but differ in terms of summarization models. The
model overview is shown in Figure 2.

4.1 Aspect Discovery

The first stage consists of labeling sentences in
cited reference texts according to aspects. Having
training data that contains sentences in the ref-
erence documents labeled with target aspects
would be the ideal case, but these do not exist a
priori. Therefore, we instead create training data
by assigning each sentence in the target articles
with aspect labels corresponding to the aspect to
which the sentence belongs. For example, the arti-
cle about Barack Obama in Table 1 yields training
instances consisting of sentences labeled with
Early life and career, Presidency, and Legacy
depending on which paragraph a sentence comes
from. This data makes it possible to train a clas-
sifier that learns to predict aspects from the texts
at sentence-level. At test time, cited reference
sentences are fed into the learned classifier and
are labeled with their most likely aspects.

However, the discrepancy of inputs at train/test
time is problematic because the model is not
exposed to any noisy sentences that do not belong
to any of the relevant aspects at training time, while
cited reference texts do contain such sentences.
For example, an article in the Company domain
may have a citation to the company Web site
itself, which contains commercial messages that
may not be appropriate in encyclopedic text such
as Wikipedia. We manage such cases by introduc-
ing an auxiliary label Other at training time and
let the model learn to identify noisy sentences as
well. To do so, sentences labeled with Other are
randomly sampled from texts in different domains
and added to training data. We fine-tune the pre-
trained ROBERTa (Liu et al., 2019) model on
this classification dataset for each domain. Logits
obtained from the model are then passed through
the sigmoid function to obtain probabilities of
each aspect for a given sentence. Finally, we
assign labels to a sentence by taking the aspects ai
whose probabilities are greater than the threshold
λ: P (ai) > λ. The lower we set the threshold, the
more but potentially noisy sentences we include
as the input to the summarization model. We tune
λ independently for each domain based on the per-
formance on validation sets and set 0.5 for Group,
0.8 for Album, Animal, Building, Film, and 0.9 for
the remaining domains as the threshold values.

4.2 Summarization

Sentences that are labeled with the same aspect
are then grouped in order of occurrence in cited
references to form a chunked paragraph that dis-
cusses the same aspect. This forms aspect-based
clusters of relevant sentences, which become the
input to a summarization model. On the con-
trary, aspects that are never labeled (due to low
probabilities) are deemed irrelevant and thus will
not be summarized. We consider both an extrac-
tive and an abstractive summarization model in
our baseline implementation. For the extractive
model, we use TextRank (Mihalcea and Tarau,
2004; Barrios et al., 2016), a graph-based ranking
model for extracting important sentences. For the
abstractive model, we use PreSumm (Liu and
Lapata, 2019b), a Transformer-based summarizer
with fine-tuned BERT as the source encoder. For
each domain, PreSumm is fine-tuned and trained
on the pairs of (grouped sentences, target aspect

214

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Dataset

Domain

#Dom.

#Train

Doc. Length

Sum. Length

#Asp.

#Asp./Ex.

Product Review
Product Review

OpoSum
Amazon
RottenTomatoes Movie Review
MA-News
WikiAsp

News
Encyclopedia

6
7
1
1
20

359,048
240,000
2,458
284,701
320,272

138
82
2369
1350
13,672

49
−
24
54
213

9
−
∗2
6
10

2.00
−
∗1.00
2.98
1.77

Table 3: Training set statistics comparisons against previous aspect-based summarization datasets. For
multi-domain datasets, the sum of all the examples are reported. #Asp./Ex. represents the average
number of aspects that a model has to summarize on each example. (∗Review saliency is treated as
aspects. #Asp. represents the number of aspects per domain if the number of domains is more than one.
Compared datasets are the work of Angelidis and Lapata (2018); Yang et al. (2018); Wang and Ling
(2016); Frermann and Klementiev (2019), respectively.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: Two-stage model diagram. The aspect classifier assigns aspect labels for each reference sentence Ri
j
from references R with a threshold λ. Sentences are then grouped according to the assigned labels, which are
fed to the summarization model. Groups about irrelevant aspects (i.e., a2) is ignored. Finally, the summarization
model outputs summaries for each relevant aspect.

paragraph) to learn to produce summaries given
the aspect-relevant sentences.

5 Evaluation

We evaluate models along two axes: aspect dis-
covery and summarization. We note that
the
primary task in this dataset is aspect-based sum-
thus aspect discovery evaluation
marization,
discussed below is only for diagnostic pur-
poses. Because the aspect sets differ in different
domains, evaluation is performed separately for
each domain.

Aspect Discovery Models have to correctly
predict the right set of aspects about which they
generate summaries. The aspect discovery crite-
rion aims to evaluate the similarity between the
set of aspects about which a model decides to

generate summaries and the set of aspects that
appear in the target article.7 For comparing these
two sets, we use precision, recall, and F1 scores.

Aspect-based Summarization Gold standard
summaries only exist for each of the aspects that
appear in an article. Therefore in this evaluation,
we focus on evaluating the model’s ability to
summarize inputs particularly on these aspects.
Specifically, generated summaries are paired to

7Note that there are two potential reasons an aspect does
not appear in the target article: (1) it may not be appropriate
for that particular entity (e.g., the ‘‘controversy’’ aspect in
the ‘‘company’’ domain should not exist if that company has
legitimately never had a controversy), or (2) the article may
not be complete. For this evaluation, we make the simplifying
assumption that all articles are complete and thus missing
aspects are an indication of failure to recall information, but
relaxing this assumption in some way may result in more
accurate evaluation.

215

corresponding reference summaries with the same
aspects and are evaluated using ROUGE (Lin,
2004). Because ROUGE is a recall-based mea-
sure, the number of tokens in the model outputs
directly affect the performance. Controlling the
length is particularly important for our dataset
because average summary length for each aspect
in different domains varies (e.g., ‘‘description’’
and ‘‘location’’ from HistoricPlace domain has
396 and 90 average tokens, respectively). We
take this into account by explicitly setting the
maximum number of words for extractive and
abstractive summaries to be the average number
of words in the target summaries in the training
set for each aspect and for each domain.

6 Experiments

We provide two baseline models for the task and
evaluate on the proposed dataset.

6.1 Implementation Details

For aspect classification, we used the roberta-
base8 model and fine-tuned for 5 epochs on the
created surrogate dataset above for each domain,
with the learning rate 2 × 10−5. For the extractive
summarization, we specify the summary length
for TextRank according to the mean length of tar-
get summaries for each aspect in each domain. We
re-train the PreSumm summarizer on our dataset
for each domain: the encoder is initialized with the
weights of pre-trained BERT (Devlin et al., 2019)
and the decoder is trained from scratch. The total
number of training steps is 300,000. For some
domains, we further tuned the decoder dropout
rate to 0.3 to stabilize training. At inference time,
we specify maximum summary lengths for each
aspect for each domain using the average summary
lengths from computed from the training set.

6.2 Results

In this section, we discuss the experimental results
at each stage.

6.2.1 Aspect Discovery

We show the aspect discovery results in Table 4.
We see a general trend of high recall predictions
made by the model. While varying thresholds
could balance precision and recall, the results
exhibited high recall after hyperparameter search.

Domain

Prec

Rec

F-1

Album
Animal
Artist
Building
Company
EducationalInstitution
Event
Film
Group
HistoricPlace
Infrastructure
MeanOfTransportation
OfficeHolder
Plant
Single
SoccerPlayer
Software
TelevisionShow
Town
WrittenWork

19.64
34.69
26.32
31.46
28.97
25.64
28.99
32.84
17.46
33.38
28.38
23.24
21.22
31.25
25.36
28.54
31.52
20.44
42.61
21.50

86.43
84.08
75.24
91.25
91.50
93.82
96.44
91.46
95.56
90.22
94.00
83.13
73.25
83.17
88.33
67.18
94.65
81.76
71.85
94.29

30.64
45.52
36.72
42.92
41.06
37.66
42.36
45.17
28.18
42.98
41.00
33.88
30.62
42.10
37.16
37.16
45.10
31.28
50.12
33.71

Table 4: Aspect discovery results on the test set.

This suggests that the learned classifier is poorly
calibrated. Class imbalance also plays a role here;
predicting the major classes give high recall due
to skew aspect frequency distributions. Among
the classifier performed best with the
others,
Town domain by achieving the highest precision
and the F1 score.

6.2.2 Summarization

The automatic evaluation results are shown in
Table 5. Neither baseline unanimously outper-
formed the other on all domains, but we observe
that PreSumm (abstractive) performs better than
TextRank (extractive) on average. The low R-2
and R-L scores by both models despite the oracle
being relatively higher suggest
important
phrases to be summarized do not appear rarely.9

that

To understand the upper-bound of model perfor-
mance for the task, we also show summarization
results of the extractive oracle model in Table 5.
Sentences were chosen directly from cited refer-
ence texts to maximize the ROUGE score against
summaries, thus bypassing the aspect classifica-
tion stage. The oracle performance shows that a
summarization model can indeed perform com-
petitively on the dataset if the model is given with
the full input information. The contrasting results

8We used Huggingface’s implementation (Wolf et al.,

9Note that TextRank connects nodes according to content

2019) for obtaining and fine-tuning the weights.

overlap, thus isolated sentences are not selected.

216

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

TextRank
R-2

2.81
3.16
2.49
4.96
3.70
4.29
5.67
3.81
3.62
3.71
3.27
3.93
3.15
3.02
2.67
2.36
4.56
3.21
3.56
3.89
3.59

R-L

17.26
16.05
15.58
21.85
20.65
19.24
24.08
19.14
20.20
17.51
18.39
19.31
16.77
16.84
15.86
12.89
22.05
17.68
16.50
21.14
18.45

R-1

22.76
27.11
21.79
24.99
22.28
24.17
28.31
20.58
25.51
27.40
27.86
24.52
19.63
25.29
22.06
12.89
20.51
19.20
19.76
22.19
22.94

PreSumm
R-2

6.31
8.08
3.76
5.97
4.08
6.70
7.69
5.34
4.97
8.08
9.24
7.04
5.24
6.30
6.78
1.86
5.15
3.53
4.39
4.33
5.74

R-L

20.27
25.01
20.00
23.24
20.50
21.96
26.20
18.86
23.51
25.69
25.80
22.72
18.12
23.20
19.98
12.05
18.82
17.42
16.87
20.15
21.02

R-1

19.56
18.00
17.22
23.91
22.92
21.47
26.64
21.25
22.30
18.96
20.40
21.20
18.45
18.73
17.96
14.79
24.54
19.77
17.89
23.39
20.47

Extractive Oracle
R-2

R-1

R-L

37.72
34.82
41.49
41.95
40.20
39.11
46.17
40.24
41.36
37.78
36.04
41.13
39.60
34.93
36.51
31.06
42.79
40.35
33.21
42.66
38.95

12.58
10.52
15.04
14.31
12.30
14.04
16.90
13.78
13.23
10.83
10.00
13.70
14.70
9.66
11.57
8.00
13.96
13.47
10.31
13.93
12.64

33.19
31.01
37.64
38.28
36.16
35.18
41.87
36.14
37.56
34.65
32.25
37.45
36.04
31.31
31.88
27.08
38.30
35.67
30.70
38.16
35.03

Table 5: Aspect-based summarization results on the test set. The last row shows the average performance.

between the oracle and two stage models sug-
gests the importance of accurate content selection
before performing summarization.

7 Analysis

We discuss the model outputs and analysis below.

7.1 Aspect-by-Aspect Evaluation

Not all the aspects are equally hard to summarize;
some might require summarization of a broad
range of information, whereas others require only
specific concepts to be summarized. We further
investigate this by looking into summarization
performance for both models on per-aspect basis.
Table 6 shows the best-performing aspects sorted
in descending order by ROUGE-1 scores for
two summarization models on the validation set.
Through manual investigation of the generated
samples for each aspect, we observed that the
aspects where the abstractive model performed
well tend to have common templates and similar
choice of vocabulary, more so than other aspects.
For example, 58% (out of 183 samples) of the
target summaries for government in Town shared
the identical summaries despite the fact that arti-
cles discuss different townships. Similar but less

prevalent patterns were observed in other aspects
as well.

Aspects where the extractive summarization
model performed better contain much larger num-
bers of tokens in the summaries than average.
Specifically, the average summary length for 10
aspects where TextRank performed the best was
303, while that for 10 aspects where PreSumm
performed the best was 166. Naturally, abstractive
models have issues with maintaining coherence
over long decoding results, but the extractive
model has few issues gathering relevant sentences
at the cost of incoherent transitions from sentence
to sentence. As for the content, extractive sum-
maries exhibited the advantage of being able to
correctly include mentions related to numbers and
dates.

7.2 Quality of Generated Summaries

We then examined the generated summaries from
the two models and compared them qualitatively.
Samples are shown10 in Table 7 from some of the
domains listed in Table 2.

Manual inspection of the generated summaries

revealed pros and cons of the two models:

10Samples from other domains are in Appendix B.

217

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Dom. Aspect

government
format
facilities
exterior
background
heritage listing
habitat
taxonomy and nm.
rankings
commercial perf.

Tow.
Eve.
Inf.
Bui.
Mea.
His.
Ani.
Pla.
Edu.
Alb.
Dom. Aspect
Eve.
Eve.
Sof.
Eve.
Eve.
Bui.
Sof.
Edu.
Wri.
Fil.

battle
report
gameplay
background
aftermath
history
plot
rankings
plot summary
plot

PreSumm TextRank

↓ R-1

55.10
44.94
42.46
41.81
39.00
36.58
32.91
32.70
31.80
31.71
R-1
28.00
24.77
24.17
30.01
27.54
25.32
20.50
31.80
22.08
19.43

R-1

21.20
24.73
14.75
25.60
23.72
10.25
12.95
9.39
26.92
15.51
↓ R-1
32.00
30.11
28.53
27.42
27.27
27.13
27.00
26.92
26.85
26.66

Table 6: List of aspects sorted in descending order
of ROUGE-1 score according to PreSumm (top
half) and TextRank (bottom half). ‘‘performance’’
and ‘‘naming’’ are abbreviated to ‘‘perf.’’ and
‘‘nm.’’, respectively. Domain names shortened to
the first three letters.

• Both models are successful at discussing
on-topic content. For all
the summaries
inspected, both models were able to gener-
ate on-topic content in spite of the source
documents potentially being noisy.

• Abstractive summaries underperform at
generating exact entity mentions. Almost
all the samples require generation of entities
because the task targets at generating ency-
clopedic texts. Except for the title (topic)
entity, abstractive models either generated
no entities or wrong ones.

7.3 Aspect Classification Accuracy

We observed a general trend of low precision
for aspect discovery. We hypothesize that this
is due to limited target aspects for each article;
correctly extracted aspects affect negatively to
precision if they do not exist in the target article.
To quantify this, 10 random articles are selected
from the validation set in Software domain. For
each article, we extract 10 sentences labeled with

218

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3: Precision differences in varying threshold
ranges.

the highest confidence for each of the 10 aspects,
resulting in 1,000 sentences in total. Each sentence
is annotated with binary labels indicating whether
it is correctly associated with the aspect or not.11
With the threshold λ set to 0.9, we achieved the
precision of 45.1, which shows that the aspect
discovery has the ability to extract aspects, but is
not as good at extracting relevant aspects for the
article. We observed that the model predictions
tend to be polarized to extreme values (i.e., near
0 or 1). We also show the relationship between
λ ranges and the precision in Figure 3, which
indicates that the classifier is not well calibrated.

7.4 Domain-specific Challenges

One of the benefits of having many domains
for the same task is to be able to characterize
the differences and challenges that are unique
to certain domains. We analyzed the generated
the summarization
summaries from both of
models and identified some of them below.

7.4.1 Pronoun Resolution for Opinion-based

Inputs

This is particularly important in domains and
aspects with subjective reviews such as Music
(Album, Artist, Group, and Single) or Software.
Source documents in these domains often include
quotes by artists or critics, which are often writ-
ten from different person perspective. These are

11Sometimes, the entity in discussion by the sentence is
not clear. In this case, we annotate it correct if the sentence
could correspond to the target aspect of any entity.

Domain / Title: Software / Cyberpunk 2077

Aspect: Gameplay
Gold: cyberpunk 2077 is a role – playing video game played from either a first – person or third – person perspective . it is set
in an open world metropolis called night city . the game will feature non – english speaking characters . players who do not
speak the languages can buy translator implants to better comprehend them; . . .
Ext.: cyberpunk 2077 takes place in, you guessed it, the year 2077 . for just a few hours, you can be rich, successful, and
popular with your preferred gender . cyberpunk 2077 will be a aaa rpg, but whether it will come to resemble the witcher in any
way remains unclear . how braindances will be used by or on the protagonist is . . .
Abs.: the game is an action role – playing game in which the player assumes the role of an unnamed protagonist, runabout who
lives in a large metropolis known as the ‘‘black city’’ . the game’s narrative is told through procedurally generated procedurally
generated . during the course of the game, the player is briefed into . . .
Domain / Title: Event / Battle of the Saintes
Aspect: Aftermath
Gold: the british lost 243 killed and 816 wounded, and two captains out of 36 were killed . the french loss in killed and
wounded has never been stated, but six captains out of 30 were killed . it is estimated that the french may have lost as much as
3,000, and more than 5,000 french soldiers and sailors were captured, . . .
Ext.: battle of the saintes, (april 9 – 12, 1782), in the american revolution, major naval victory for britain in the west indies
that restored british naval mastery in the area and ended the french threat to nearby british possessions . the valiant was
subsequently present at the relief of gibraltar on 12 april 1781 . . .
Abs.: the immediate aftermath of the battle of the niles meant the british fleet could not be able to deliver a great strategic
victory in the kingdom of france, and had also declared war on great britain with the royal navy to france . the outcome was
the subject of a strategic victory for the british . in the meantime, the british navy had captured the french navy . . .
Aspect: Battle
Gold: on 9 april 1782, the copper – hulled british fleet soon caught up with the french, who were surprised by their speed .
de grasse ordered the french convoy to head into guadeloupe for repair, forcing him to escort two fifty – gun ships ( fier and
experiment ) and placing his fleet in line of battle in order to cover the retreat . . . .
Ext.: ‘‘after some initial maneuvers and minor clashes, a full – scale battle was joined on april 12, by which time the british
had thirty – six ships of the line in action against thirty french ones . he turned his ships ninety degrees and sailed through the
broken french line of battle, splitting the french line into four segments . in doing this, the guns on each side of the british ships
were brought to bear on the french with little risk of return fire . . . .
Abs.: the battle started as the shock . the battle progressed, when the british forces reached the north – eastern flank of the battle
of weidman ( kingdom ) in a battle . he had begun to return to the field, and moved his forces toward the mouth of the river . in
the battle, the first contingent of the french navy ships got off from a small contingent of british soldiers as well as the third –
rate, under the command of general sir henry sturgis . . . .

Table 7: Generated summaries from multiple domains. Ext. and Abs. represent summaries from TextRank
and PreSumm.

usually converted by the Wikipedia editors into
more encyclopedic text, citing the source of the
information and writing in the third person. By
design, extractive summaries have issues with this
problem because of the lack of ability to transform
the input sentences in any way. For example, the
first extractive summary in Table 7 describes a
game in a subjective way. We verified this by
randomly selecting 20 summaries for gameplay
aspect in Software domain. We inspected pro-
nouns in extractive summaries and mark ones with
first- or second-person pronouns if the gold sum-
maries do not contain them. We found 65% of the
samples contained those undesirable pronouns that
do not align with the format of gold summaries.

7.4.2 Chronological Explanation

This variety of content is often found in certain
aspects such as history and event, which tend
to appear across multiple domains but are most

prevalent in Event, HistoricPlace, and non-human
entities like Company and Building. It is essential
in these aspects to describe key information in the
right chronological order for better readability.
This would not be a hard task for single docu-
ment summarization, as the model could perform
reasonably by following the order of the origi-
nal document. However, because our input is of
multi-document form, maintaining chronological
order when aggregating information across multi-
ple domains becomes non-trivial. Indeed, neither
of the models were successful at being truthful
to the order even when there are enough clues
in the original references. For example, multiple
sentences start with ‘‘In [year], . . .’’, but the
generated summary jumps around in time. We
randomly picked 20 samples of extractive sum-
maries with history aspect from Company domain
and found that 25% of the samples have incon-
sistent timeline explanations.

219

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

8 Related Work

Aspect-based Summarization

Aspect-based summarization has been widely
investigated primarily on product or restaurant
reviews (Titov and McDonald, 2008; Lu et al.,
2009; Yang et al., 2018; Wang and Ling, 2016).
Angelidis and Lapata (2018) proposed a weakly
supervised method for aspect-based opinion sum-
marization that discovers aspects with a topic
model and does not require gold aspect annotation.
TAC 2010 held a shared task of guided-based sum-
marization on newswire domain, which resembles
aspect-based summarization in terms of topic
guidance. Recently, the task has been extend to
news-domain by generating artificial datasets for
aspect-based summarization to address the lack of
large-scale data with aspect annotation (Frermann
and Klementiev, 2019; Krishna and Srinivasan,
2018). Our work also builds an aspect-based
summarization dataset automatically and is most
similar to Krishna and Srinivasan (2018), but
utilizes naturally available online encyclopedia
entries and their sections in multiple domains.

Wikipedia as a Summarization Dataset

Wikipedia has been studied as a target resource
for generation. An early attempt on generating full
Wikipedia articles relied on Web search results
for target entities as inputs (Sauper and Barzilay,
2009), which simulates an authoring process of
humans searching information over the Internet.
Liu et al. (2018) formulate a sub-task of generating
lead sections as summarization of reference web
pages to target articles. The resulting WikiSum
dataset is accompanied by rich metadata about
articles and inspired different uses of the dataset
(Perez-Beltrachini et al., 2019). Our work also
builds upon the WikiSum dataset, and aims to eval-
uate aspect-based summarization models using
different sections from Wikipedia articles. Com-
pared with Sauper and Barzilay (2009), our dataset
is an order of magnitude larger, both in the number
of articles and in the number of domains covered.

Multi-document Summarization

Extractive methods have shown effective for
multi-document summarization in previous work
(Nenkova et al., 2006; Cao et al., 2015; Yasunaga
et al., 2017), but abstractive methods have increas-
ingly adopted for the task (Lebanoff et al., 2018;
Fabbri et al., 2019). Our task is based on the idea of

(Liu et al., 2018) which treats references as source
documents for the multi-document summariza-
tion task, and we experimented with both types of
summarization models in our experiments.

9 Conclusion and Future Work

In this paper, we propose a large-scale, multi-domain
multi-aspect summarization dataset derived from
Wikipedia. Through experiments, we perform an
extensive analysis of performance across different
genres and aspect types. Our analysis has demon-
strated that
there are both general challenges
regarding summarization into various aspects, as
well as specific challenges in particular genres
such as time-consistent mentions and proper pro-
noun conversion depending on the writer of the
original content.

Because of this, the proposed datas et also pro-
vides a testbed for several potential directions for
future work. For example, better aspect discovery
models may take into account the coherence of
the discourse in the original documents when
extracting aspects. Better summarization models
may take into account the provenance of the
information, appropriately determining when the
information is written by a first or third party.
WikiAsp also invites a focus on domains of
interest to investigate various problems of text
summarization, such as correct pronoun handling
and description of chronological timeline.

Acknowledgments

We would like to thank anonymous reviewers for
insightful comments. HH and GN were supported
by a grant from AlphaSense.

References

Stefanos Angelidis and Mirella Lapata. 2018.
Summarizing Opinions: Aspect Extraction Meets
Sentiment Prediction and They Are Both
Weakly Supervised. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 3675–3686,
Brussels, Belgium. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D18-1403

S¨oren Auer, Christian Bizer, Georgi Kobilarov,
Jens Lehmann, Richard Cyganiak, and Zachary
Ives. 2007. Dbpedia: A nucleus for a web of

220

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

open data. In The semantic web, pages 722–735.
Springer. DOI: https://doi.org/10
.1007/978-3-540-76298-0 52

Federico Barrios, Federico L´opez, Luis Argerich,
and Rosa Wachenchauzer. 2016. Variations of
the similarity function of textrank for automated
summarization. CoRR, abs/1602.03606.

Ziqiang Cao, Furu Wei, Li Dong, Sujian Li,
and Ming Zhou. 2015. Ranking with recursive
neural networks and its application to multi-
In Twenty-ninth
document
AAAI Conference on Artificial Intelligence.

summarization.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association
for
Computational Linguistics.

Alexander Fabbri, Irene Li, Tianwei She, Suyi
Li, and Dragomir Radev. 2019. Multi-News: A
Large-Scale Multi-Document Summarization
Dataset and Abstractive Hierarchical Model.
In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguis-
tics, pages 1074–1084, Florence, Italy. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1102

Angela Fan, Claire Gardent, Chlo´e Braud, and
Antoine Bordes. 2019. Using local knowledge
graph construction to scale Seq2Seq models
to multi-document inputs. In Proceedings of
the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th In-
ternational Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages
4186–4196, Hong Kong, China. Association
for Computational Linguistics.

Lea Frermann and Alexandre Klementiev. 2019.
Inducing Document Structure for Aspect-
based Summarization. In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 6263–6273,
Florence, Italy. Association for Computational

Linguistics. DOI: https://doi.org/10
.18653/v1/P19-1630

Philip John Gorinski and Mirella Lapata. 2015.
Movie script summarization as graph-based
scene extraction. In Proceedings of the 2015
Conference of the North American Chapter of
the Association for Computational Linguis-
tics: Human Language Technologies, pages
1066–1076, Denver, Colorado. Association for
Computational Linguistics. DOI: https://
doi.org/10.3115/v1/N15-1113

Max Grusky, Mor Naaman, and Yoav Artzi.
2018. Newsroom: A dataset of 1.3 million
summaries with diverse extractive strategies.
In Proceedings of the 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers),
pages 708–719, New Orleans, Louisiana. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N18
-1065

Dongyeop Kang, Waleed Ammar, Bhavana Dalvi,
Madeleine van Zuylen, Sebastian Kohlmeier,
Eduard Hovy, and Roy Schwartz. 2018. A
dataset of peer reviews (PeerRead): Collection,
insights and NLP applications. In Proceedings
of the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 1647–1661.
New Orleans, Louisiana. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/N18-1149

Chris Kedzie, Kathleen McKeown, and Hal
Daum´e III. 2018. Content selection in deep
learning models of summarization. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 1818–1828, Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1208

Kundan Krishna and Balaji Vasan Srinivasan.
2018. Generating Topic-Oriented Summaries
Using Neural Attention. In Proceedings of
the 2018 Conference of the North American
Chapter of the Association for Computational

221

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 1697–1705,
New Orleans, Louisiana. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/N18-1153

Logan Lebanoff, Kaiqiang Song, and Fei Liu.
2018. Adapting the Neural Encoder-Decoder
Framework from Single to Multi-Document
Summarization. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 4131–4141, Brus-
sels, Belgium. Association for Computational
Linguistics. DOI: https://doi.org/10
.18653/v1/D18-1446

Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81,
Barcelona, Spain. Association for Computa-
tional Linguistics.

Peter

J. Liu, Mohammad Saleh, Etienne
Pot, Ben Goodrich, Ryan Sepassi, Lukasz
Kaiser, and Noam Shazeer. 2018. Generating
Wikipedia by Summarizing Long Sequences.
arXiv:1801.10198 [cs]. ICLR.

In Proceedings of

Yang Liu and Mirella Lapata. 2019a. Hierarchical
transformers for multi-document summariza-
the 57th Annual
tion.
Meeting of the Association for Computational
Linguistics, pages 5070–5081, Florence, Italy.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/P19-1500

Yang Liu and Mirella Lapata. 2019b. Text sum-
marization with pretrained encoders. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 3730–3740, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1387

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A Robustly Optim-
ized BERT Pretraining Approach. arXiv:
1907.11692 [cs].

Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
2009. Rated aspect summarization of short com-
ments. In Proceedings of the 18th International
Conference on World Wide Web – WWW ’09,
page 131, Madrid, Spain. ACM Press. DOI:
https://doi.org/10.1145/1526709
.1526728

the

Rada Mihalcea and Paul Tarau. 2004. TextRank:
In Proceedings
into text.
Bringing order
of
on Empirical
2004 Conference
Methods in Natural Language Processing,
pages 404–411, Barcelona, Spain. Association
for Computational Linguistics.

Ramesh Nallapati, Bowen Zhou, Cicero dos
Santos, C¸ a˘glar Gulc¸ehre, and Bing Xiang.
2016. Abstractive text summarization using
sequence-to-sequence RNNs and beyond. In
Proceedings of The 20th SIGNLL Conference
on Computational Natural Language Learning,
pages 280–290, Berlin, Germany. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/K16
-1028

Ani Nenkova, Lucy Vanderwende, and Kathleen
McKeown. 2006. A compositional context sen-
sitive multi-document summarizer: exploring
the factors that influence summarization. In
Proceedings of the 29th Annual International
ACM SIGIR Conference on Research and
Development in Information Retrieval, pages
573–580. ACM. DOI: https://doi.org
/10.1145/1148170.1148269

Laura Perez-Beltrachini, Yang Liu, and Mirella
Lapata. 2019. Generating summaries with topic
templates and structured convolutional de-
the 57th Annual
coders. In Proceedings of
Meeting of the Association for Computatio-
nal Linguistics, pages 5107–5116, Florence,
Italy. Association for Computational Linguis-
tics. DOI: https://doi.org/10.18653
/v1/P19-1504

Christina Sauper and Regina Barzilay. 2009.
Automatically Generating wikipedia articles:
a structure-aware approach. In Proceedings of
the Joint Conference of
the 47th Annual
Meeting of the ACL and the 4th International
Joint Conference on Natural Language Pro-
the AFNLP, pages 208–216,
cessing of

222

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Suntec, Singapore. Association for Computatio-
nal Linguistics. DOI: https://doi.org
/10.3115/1687878.1687909

Ivan Titov and Ryan McDonald. 2008. A joint
model of text and aspect ratings for sentiment
summarization. In Proceedings of ACL-08:
HLT, pages 308–316, Columbus, Ohio. Asso-
ciation for Computational Linguistics.

Lu Wang and Wang Ling. 2016. Neural
network-based abstract generation for opinions
and arguments. In Proceedings of the 2016
Conference of the North American Chapter
of
the Association for Computational Lin-
guistics: Human Language Technologies,
pages 47–57, San Diego, California. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N16
-1007

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R’emi Louf,

Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art
natural
abs/
1910.03771.

language processing. ArXiv,

Min Yang, Qiang Qu, Ying Shen, Qiao Liu, Wei
Zhao, and Jia Zhu. 2018. Aspect and sen-
timent aware abstractive review summariza-
tion. In Proceedings of the 27th International
Conference on Computational Linguistics,
pages 1110–1120, Santa Fe, New Mexico, USA.
Association for Computational Linguistics.

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
Ayush Pareek, Krishnan Srinivasan,
and
Dragomir Radev. 2017. Graph-based neural
multi-document summarization. In Proceed-
ings of the 21st Conference on Computational
Natural Language Learning (CoNLL 2017),
pages 452–462, Vancouver, Canada. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/K17
-1045

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

223

A Domain Statistics

Title: Pride and Glory (film)

Domain

Train Valid

Test

24434
16540
26754
20449
24353
17634
6475
32129
11966
4919
17226
9277
18177
6107
14217
17599
13516
8717
14818
15065

3104
2005
3194
2607
2946
2141
807
4014
1462
601
1984
1215
2218
786
1734
2150
1637
1128
1911
1843

3038
2007
3329
2482
3029
2267
828
3981
1444
600
2091
1170
2333
774
1712
2280
1638
1072
1831
1931

Table 8: The list of domains and the number of
Wikipedia articles in each domain that contain at
least one salient aspect.

B Additional Samples

Title: Recomposed by Max Richter: Vivaldi – The
Four Seasons

seasons

Aspect: Critical Reception
Gold:
recomposed by max richter: vivaldi – the
four
received widespread acclaim from
contemporary classical music critics . ivan hewett of
the telegraph gave the album a very positive review,
stating, ” as you would expect of a composer who once
studied with the great modernist luciano berio, richter
is very self – aware . . . .
Ext.: listen to recomposed by max richter: vivaldi,
the four seasons now . i am highly impressed with
‘recomposed’. the music then propels the audience into
an atmosphere of isolation; a delicate harmony that is
sustained whilst hope takes centre stage . . . .
Abs.: the allmusic review by michael g . nastos awarded
the album 4 stars stating ‘‘ this is an album that generally
considered for fans of the genre ‘‘ . . . .

Table 9: Generated summaries
domain.

from Album

Aspect: Plot
Gold: assistant chief francis tierney sr . is the head of
a multigenerational new york city police department (
nypd ) family, which includes his sons francis “franny”
jr . , ray, and his son – in – law jimmy egan . deputy
inspector franny is the commanding officer of the 31st
precinct, where sergeant jimmy is a patrol officer, . . .
Ext.: as we know, under the macho code, this means
that after two people who love each other end up beaten
and bloody, they will somehow arrive at a catharsis .
the plot involves how and why the four cops were killed
. a family of police officers – patriarch, two sons, and
a son – in – law – deals with corruption in a precinct in
washington heights . . . .
Abs.: in the year before the events of the first film,
the movie takes place in washington heights, d . c . ,
a . army sergeant – in – law, ray ’ s wife, and sister
abby, living in washington city . they have a romantic
relationship with one of their officers . while the four
officers are called to ‘‘ the mental patient ‘‘ , . . .

Table 10: Generated summaries
domain.

from Film

Title: Dimitri Soudas

Aspect: Career
Gold: soudas served for one term as a school trustee
at the western quebec school board from 2002 to
2005 . between 2006 and 2011, soudas was a “high
profile” member of prime minister stephen harper’s
communication team, and one of the prime minister’s
“closest and most faithful aides” initially serving as
a press secretary and later as an associate director of
communications for the prime minister ’ s office, . . .
Ext.: april 2010 – after serving as a press secretary in
the prime minister’s office, soudas was promoted to
director of communications . “to fulfil the opportunities
afforded by social media, directors of communication
need to be aware of this trend and engage with it,”
dimitri soudas writes in his master’s thesis, a copy of
which has been obtained by cbc news. . . .
Abs.: in 2001, he was elected to the canadian house
of commons as a member of the people’s action party
( pc ) for the riding of yorkshire . he was re – elected
in 2002 and 2006 . in 2006, he was .

Table 11: Generated summaries from Office-
Holder domain.

C Aspect Statistics

Tables 12 and 13 show aspect frequency statistics.
Perf., hist., dist., ext., desc., dev., edu., nm., and
intl. correspond to performance, history, distri-
bution, extracurricular, description, development,
education, naming, and international, respectively.

224

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Album

Animal

Group

HistoricPlace

reception
critical reception
background
commercial perf.
release
chart positions
recording
promotion
history
overview

Artist

career
biography
early life
personal life
music career
death
life and career
early life & edu.
early years
exhibitions

11782 description
6682 distribution
6202 dist. & habitat
2398 taxonomy
2209 habitat
1891 behavior
1490 ecology
1150 diet
1045 reproduction
840 biology

Building

10193 history
8292 architecture
7587 desc. & hist.
6775 description
2829 location
1607 interior
1512 construction
1239 exterior
1129 design
1030 facilities

12729
7813
2967
2737
2208
2167
1777
1363
1291
1238

16885
3223
1395
1382
906
877
862
746
623
572

Company

EducationalInstitution

history
products
operations
services
controversy
overview
background
subsidiaries
company history
technology

Event

background
aftermath
history
battle
format
prelude
event
report
summary
casualties

21488 history

2921 athletics
1630 academics
1019 campus
920 sports
891 student life
572 ext. activities
556 curriculum
504 facilities
471 rankings

Film

3453 plot
2483 reception
1361 production
1228 release
461 box office
450 critical reception
416 critical response
323 synopsis
321 home media
290 filming

12798
5602
4638
2471
1433
1327
1227
1191
1189
836

25772
14003
13882
7299
4572
4195
2802
2626
2461
2013

Table 12: Aspect frequency for 8 domains.

history
biography
career
musical style
background
formation
early years
legacy
style
influences

8894 history
1206 description
1102 desc. & hist.
683 heritage listing
581 architecture
408 location
279 historic uses
272 preservation
265 geography
204 interior

MeanOfTransportation

OfficeHolder

history
design
operational hist.
design & dev.
service history
development
construction
fate
background
description

Plant

description
dist. & habitat
uses
distribution
cultivation
taxonomy
ecology
conservation
etymology
taxonomy & nm.

2572 personal life
2152 political career
1989 early life
1566 career
1435 biography
1096 education
933 background
632 death
604 legacy
602 early life & career

Single

4684 music video
1649 critical reception
1585 background
1399 reception
1387 composition
1121 cover versions
884 content
554 release
389 commercial perf.
384 live performance

3232
1398
1250
942
549
161
90
84
75
70

5119
4950
4740
4115
2801
2168
1578
1402
889
859

9606
3829
3459
2097
1729
1594
1266
1045
849
113

SoccerPlayer

TelevisionShow

intl. career
club career
career
personal life
playing career
early career
early life
professional
style of play
football career

8055 plot
8029 production
6386 reception
3621 synopsis
1930 premise
1578 history
1191 format
992 broadcast
887 overview
550 critical reception

Town

geography
demographics
history
education
government
2010 census
2000 census
transportation
economy
name and history

WrittenWork

12667 plot
10949 reception
7298 plot summary
2868 history
1910 background
1363 adaptations
1284 critical reception
1239 manga
1066 history and profile
1002 anime

2902
2648
2643
1304
944
908
842
779
650
583

5495
4970
3900
2527
1218
1173
933
830
803
714

Table 13: Aspect frequency for 10 domains.

225

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
6
2
1
9
2
4
0
2
7

/
t

a
c
_
a
_
0
0
3
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf