Conditional Generation with a Question-Answering Blueprint
Shashi Narayan1, Joshua Maynez1, Reinald Kim Amplayo1, Kuzman Ganchev1,
Annie Louis2, Fantine Huot1, Anders Sandholm2, Dipanjan Das1, Mirella Lapata1
1Google DeepMind, UK 2Google Research
shashinarayan@google.com, joshuahm@google.com, reinald@google.com,
kuzman@google.com, annielouis@google.com, fantinehuot@google.com,
sandholm@google.com, dipanjand@google.com, lapata@google.com
Abstract
The ability to convey relevant and faithful in-
formation is critical for many tasks in con-
ditional generation and yet remains elusive for
neural seq-to-seq models whose outputs often
reveal hallucinations and fail to correctly cover
important details. In this work, we advocate
planning as a useful intermediate representa-
tion for rendering conditional generation less
opaque and more grounded. We propose a new
conceptualization of text plans as a sequence
of question-answer (QA) pairs and enhance
existing datasets (e.g., for summarization) with
a QA blueprint operating as a proxy for con-
tent selection (i.e., what to say) and plan-
ning (i.e., in what order). We obtain blueprints
automatically by exploiting state-of-the-art
question generation technology and convert
input-output pairs into input-blueprint-output
tuples. We develop Transformer-based mod-
els, each varying in how they incorporate the
blueprint in the generated output (e.g., as a
global plan or iteratively). Evaluation across
metrics and datasets demonstrates that blue-
print models are more factual than alterna-
tives which do not resort to planning and
allow tighter control of the generation output.
1
Introduction
Neural generation models are often prone to hal-
lucination (Song et al., 2018; Maynez et al., 2020;
Kryscinski et al., 2020; Gabriel et al., 2021),
repetition and redundancy (Li et al., 2018; Suzuki
and Nagata, 2017), and struggle to identify which
content units are salient (Tan et al., 2017a). These
phenomena are amplified when generating long-
form text, i.e., documents with multiple para-
graphs (Wiseman et al., 2017), when dealing with
non-linguistic data (e.g., database tables), or very
long input—which is common when summariz-
974
ing multiple documents (Liu and Lapata, 2019;
Perez-Beltrachini et al., 2019), books (Kry´sci´nski
et al., 2021), or dialogue (Chen et al., 2022; Zhong
et al., 2021). An additional challenge concerns
the blackbox nature of deep learning systems,
which hides the inherent complexity of modeling
multiple interconnected linguistic phenomena in
text generation, and makes it difficult to examine
model decisions and attribute errors to specific
components. The lack of modularity further af-
fects controllability as these systems cannot be
easily tailored to individual needs.
Attempts to remedy some of these issues fo-
cus on changing the way entities are represented
(Puduppully et al., 2019b; Iso et al., 2019), al-
lowing the decoder to skip low-confidence tokens
to enhance faithful generation (Tian et al., 2019),
modeling graph connections between document
elements to better capture salience (Tan et al.,
2017b; Liu and Lapata, 2019), encoding docu-
ments hierarchically (Celikyilmaz et al., 2018;
Liu and Lapata, 2019; Rohde et al., 2021), learn-
ing latent alignments between the input and the
target text (Xu et al., 2021), adopting sparse at-
tention mechanisms (Child et al., 2019; Beltagy
et al., 2020), and introducing content selection
(Gehrmann et al., 2018; Dou et al., 2021) and
planning components (Puduppully et al., 2019a;
Moryossef et al., 2019b; Narayan et al., 2021;
Wiseman et al., 2018).
In this paper we also aim to render conditional
generation more modular via an intermediate,
plan-based representation. While autoregressive
models of language predict one token at a time,
there is evidence that in humans some degree
of planning occurs at a higher level than indivi-
dual words (Levelt, 1993; Guhe, 2007). A long
tradition in natural language generation views
Transactions of the Association for Computational Linguistics, vol. 11, pp. 974–996, 2023. https://doi.org/10.1162/tacl a 00583
Action Editor: Mark Johnson. Submission batch: 11/2022; Revision batch: 3/2023; Published 8/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 1: Question-answering (QA) blueprint for AQuaMuSe summary. QA pairs were obtained from a
state-of-the-art question generation and answer identification system (Alberti et al., 2019).
planning as a central component to identifying
important content and structuring it appropriately
(Reiter and Dale, 2000), however, there is less
agreement on how plans should be represented.
Common examples
trees
(Mellish et al., 1998), entity transitions (Kibble
and Power, 2004; Barzilay and Lapata, 2008),
sequences of propositions (Karamanis, 2004), and
schemas (McKeown, 1985).
include discourse
Our work proposes a new conceptualization
of text plans as a sequence of question-answer
pairs. Specifically, we draw inspiration from the
‘‘Questions under Discussion’’ (QUD) theory of
discourse structure, which posits that one way of
articulating the structure of a text is to identify the
questions and sub-questions that are raised and
answered by subsequent spans of text (Carlson,
1983; Ginzburg, 1994; Van Kuppevelt, 1995;
Larson, 2002; Roberts, 2012; Riester, 2019). The-
oretical models of QUD assume that discourse
contains implicit questions for each of the as-
sertions made, which are thereby turned into
answers. These questions and answers can be
understood in terms of their use in moving a dis-
course forward to achieve communicative goals.
We propose to make QUDs explicit by exploiting
state-of-the-art question generation technology
(Alberti et al., 2019; Lu and Lu, 2021) and use
them as an intermediate representation layer for
conditional generation, i.e., a question-answering
(QA) blueprint operating as a proxy for both con-
tent selection (i.e., what to say) and planning (i.e.,
in what order).
Table 1 illustrates a plan for generating a
Wikipedia abstract from the AQuaMuSe dataset
(Kulkarni et al., 2020). We enhance existing da-
tasets (e.g., for summarization) with similar blue-
prints which we obtain automatically. We then
convert input-output pairs into input-blueprint-
output tuples and propose to learn encoder-decoder
models from these augmented annotations. We
develop three models that vary in how they in-
tegrate blueprints in the generation process and
their ability to handle long outputs. Aside from
generating blueprints and their corresponding text
in one go, we propose a new architecture that it-
eratively plans and generates a sentence at a time,
conditioning on the input and the output sentences
generated so far. We do not generate a global blue-
print, rather, our planning process is incremen-
tal and informed by generation, which we argue
affords greater control over the output and its
fluency. Moreover, the model is better equipped
for long-form generation, since it does not have
to (autoregressively) decode the blueprint and its
summary in one go, avoiding the risk of exceed-
ing the maximum decoder length.
We instantiate our models with a Transformer
(Vaswani et al., 2017) encoder-decoder architec-
ture and perform experiments on summarization
datasets representing different information seek-
ing tasks, application domains, and user require-
ments.1 In all cases, we empirically demonstrate
that blueprint models are more factual than alter-
natives which do not resort to planning; we also
observe that QA blueprints are a better represen-
tation compared to plans based on entity chains
(Narayan et al., 2021), allowing tighter control
of the output, and providing a comprehensive
explanation for model predictions (if the plan is
erroneous, then the summary will be too).
2 Related Work
Questions under Discussion The QUD-based
approach to discourse structure assumes an open-
ended inventory of possible questions and sub-
questions (Van Kuppevelt, 1995). Recent efforts
(De Kuthy et al., 2018; Westera et al., 2020;
1Our models, training data and predictions are available
https://github.com/google-research/google
at
-research/tree/master/text blueprint.
975
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Riester, 2019) have nevertheless shown that it
is possible to manually annotate documents with
QUDs, i.e., to formulate a question for every as-
sertion expressed in a text. De Kuthy et al. (2020)
even go as far as to partially automate QUD an-
notation in German by automatically generating
all potentially relevant questions for a given sen-
tence. Related work (Ko et al., 2020) focuses
on the generation of inquisitive questions that
reflect general text understanding and free-form
open-ended questions (Ko et al., 2021). Our work
builds upon QUD and related discourse structure
theories, although, we do not directly implement
any of them in particular. We adopt question an-
swering as a good way of spelling out the con-
nection between the information structure of a
sentence and the discourse in which the sentence
can function.
QA Pairs as a Proxy for Annotation Labels
Question-answer pairs have been previously used
as a proxy for expressing semantic content.
QA-SRL (He et al., 2015) is a representation based
on QA pairs that has been shown to capture the
vast majority of arguments and modifiers in Prop-
Bank (Palmer et al., 2005) and NomBank (Meyers
et al., 2004). Instead of using a pre-defined role
lexicon, QA-SRL labels semantic roles with ques-
tions whose answers denote the argument bearing
the role. Follow-on work uses QA pairs to repre-
sent discourse relations (Pyatkin et al., 2020) and
to capture overlap or redundancy at the proposi-
tional level (Brook Weiss et al., 2021). We also
employ QA pairs as an abstraction of proposi-
tional content, however, we do not target specific
relation types, or make any linguistic assumptions
about them (e.g., discourse relations vs semantic
roles).
Question-Answering in Summarization QA
pairs have been used for evaluating summaries
(Deutsch and Roth, 2021b; Eyal et al., 2019;
Durmus et al., 2020; Wang et al., 2020), spe-
cifically as a means of estimating the informa-
tion overlap between a reference summary and
a system-generated one. QA-based signals have
also been incorporated in the training of sum-
marization models, using reinforcement learning
(Arumae and Liu, 2018, 2019; Scialom et al.,
2019) or as a way of identifying salient content
in the input document (Deutsch and Roth, 2021a).
Cao and Wang (2022) introduce the task of hi-
erarchical question-summary generation, where
a source document is condensed into multiple
summaries, each answering a different question.
Questions are organized hierarchically into broad
questions and more specific sub-questions that
are learned from manual annotations. Our model
outputs a QA-based plan and a single summary
for a given document, although it is possible to
generate different summaries from different plans
for the same document. Our QA pairs are ob-
tained automatically and they are not stuctured.
Planning in Encoder-Decoder Models Var-
ious recent efforts have developed planning
modules in the context of data-to-text generation.
In most cases, the plans are specific to the input,
which varies from tables and records to RDF tu-
ples. For instance, Puduppully et al. (2019a) learn
a plan corresponding to a sequence of records, and
generate a summary conditioned on it. Narayan
et al. (2020) treat content selection as a task simi-
lar to extractive summarization; they first extract
sentence plans and then verbalize them one-by-one.
Moryossef et al. (2019a,b) propose a symbolic
planning stage followed by a neural realization
stage. Other work (Puduppully and Lapata, 2021;
Puduppully et al., 2022) advocates macro plan-
ning, where document content is organized into
a sequence of paragraph plans which are verbal-
izations of tabular input. Our work is closest to
Narayan et al. (2021), who also target summariza-
tion applications and learn an intermediate plan
to guide generation. We adopt a more elaborate
plan representation based on QA blueprints, and
interface decoding with plan generation similarly
to Narayan et al. (2020).
3 Text Generation with Blueprints
3.1 Problem Formulation
Let d denote the input to the model which could
be a document (or multiple documents), a dia-
logue history, or even database tables. The model
will learn to generate blueprint b for output s
(e.g., a summary) and the output itself. The blue-
print b is an ordered set of question-answer pairs
{(q1, a1), (q2, a2), . . . , (qm, am)}. Unsurprisingly,
such blueprints are not naturally occurring in ex-
isting datasets that typically consist of (d, s) pairs.
In the following we explain how we automati-
cally augment training examples (d, s) into tuples
(d, b, s) with blueprints (Section 3.2) and then
976
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Overgenerated Question-Answer Pairs
RT
RH
CO
Q1: What is a high performance variant of the Ford Mustang?
Q2: What is the high performance variant of the Ford Mustang called?
Q3: What is a high performance variant of the Ford Mustang?
Q4: What is a Shelby Mustang?
Q5: The Shelby Mustang is a high performance variant of what?
Q6: The Shelby Mustang is a high performance variant of what?
Q7: The Shelby Mustang is a high performance variant of what Ford model?
Q8: Who built the Shelby Mustang from 1965 to 1968?
Q9: During what years was the Shelby Mustang built by Shelby American?
Q10: In what year did Ford take over production of the Shelby Mustang?
Q11: What was the final year that Shelby American built the Mustang?
Q12: Who built the Shelby Mustang from 1969 to 1970?
Q13: What event in 2005 led to the revival of the Shelby Mustang?
Q14: What generation of Mustang was introduced in 2005?
Q15: What generation of Mustang was introduced in 2005?
Q16: In what year was the fifth generation of the Ford Mustang introduced?
Q17: What name was brought back for the 2005 Ford Mustang?
Q18: What was the Shelby Mustang revived as?
A1: The Shelby Mustang
A2: Shelby
A3: Shelby Mustang
A4: a high performance variant
A5: the Ford Mustang
A6: Ford Mustang
A7: Mustang
A8: Shelby American
A9: 1965 to 1968
A10: 1969
A11: 1970
A12: Ford
A13: the introduction
A14: the fifth generation
A15: fifth
A16: 2005
A17: the Shelby nameplate
A18: a new high-performance model
[The Shelby Mustang is a high performance variant of the Ford Mustang]P1 which [was built by Shelby American]P2 [from 1965 to
1968,]P3 and [from 1969 to 1970 by Ford.]P4 [Following the introduction of the fifth generation Ford Mustang in 2005,]P5 [the Shelby
nameplate was revived as a new high-performance model, this time designed and built by Ford.]P6
Table 2: Generation of QA pairs for summary in Figure 1 and blueprint annotation. We split the
summary into propositions P and select no more than one QA pair per proposition. RT, RH, and CO
are shorthand for Round Trip, Rheme, and Coverage. Questions that pass/fail each filter are marked
with ✓
, respectively.
/✗
describe how we devise blueprint models based
on them (Section 3.3).
3.2 Blueprint Annotation
We first explain how question-answer pairs are
automatically (over-)generated for output s, and
subsequently filtered to create blueprint b. We
illustrate the different filtering stages via the ex-
ample in Table 2.
Question-Answer Generation We
generate
to
QA pairs following an approach similar
Honovich et al. (2021, 2022). We convert the
SQuAD reading comprehension dataset (Rajpurkar
et al., 2018b) to a question generation dataset by
concatenating the answer and context (with sep-
arators) and fine-tuning a sequence-to-sequence
transformer model
the question.
Specifically, we fine-tune the T5-11B checkpoint
from Raffel et al. (2020); questions are decoded
with a beam size of 4. During training, answer
candidates are the answers provided in the
SQuAD annotation. At inference time, answer
candidates (i.e., base noun phrases and named
entities) are identified in the output s using
SpaCy2 and questions are generated with the
SQuAD trained system. This procedure yields
to predict
2https://spacy.io/.
a large list of QA pairs (see in Table 2 the
questions generated for
the
bottom), which we reduce using the filtering ex-
plained below.
the summary at
Question-Answer Blueprints
Initially, we ap-
ply a Round-trip Consistency check (Alberti et al.,
2019), which discards questions if they yield an-
swers different from those used to generate them.
In Table 2, Q11 is discarded as the answer it is
paired with is wrong (1968 was the final year that
Shelby American built the Mustang, not 1970).
The same is the case for Q13, where the answer to
the question ought to have been the introduction
of the of the fifth generation Ford Mustang.
To decrease the number of QA pairs further,
we chunk the text (bottom block in Table 2) into
propositions—a proposition is a sub-sentential
unit which represents a single claim or fact
(Stanovsky et al., 2018; Ernst et al., 2022), We
use propositions instead of sentences since the
latter can be too long and contain multiple facts.
We split text into propositions based on punctu-
ation (period, comma, and semicolon), coordina-
tion (e.g., and, but), relative pronouns (e.g., that,
who), and prepositions (e.g., at, by). Following
this simple approach, the summary in Table 2 is
split into six propositions, shown within square
brackets. We next match each proposition to a
977
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
✓
✗
✓
✗
✓
✗
✓
✗
✓
✓
✗
✓
✗
✓
✗
✓
✓
✗
✓
✓
✓
✓
✗
✗
✓
✓
✓
✗
✓
✗
✓
✗
✓
✓
✓
✓
✗
✓
✓
✓
single QA pair heuristically, following a two-
stage approach.
We first find the question whose answer is at the
rightmost position within a proposition. If there
are multiple such questions, we select the one
with the longest answer. This first stage, which
we call Rheme, is motivated by the theme-rheme
structure (Vallduv´ı and Vilkuna, 1998) of natural
language sentences: Already known information
(i.e.,
the theme) is usually placed first while
new information (i.e., the rheme) is placed later
in a sentence or phrase (Kruijff-Korbayov´a and
Steedman, 2003). Following this idea, Rheme se-
lection prioritizes new-information seeking ques-
tions. As can be seen in Table 2, it eliminates
several questions (e.g., Q1–Q4) as their answers
are not the right most element in the obtained
propositions. Questions Q5 and Q6 are identical,
however we retain Q5 as it yields the longest
answer.
The second stage, which we call Coverage,
prioritizes the selection of informative QA pairs
by selecting non-overlapping ones. Specifically,
we first convert s to a bag of tokens and select
the QA pair with the highest lexical overlap. We
then remove the overlapping tokens from s, and
repeat this greedy selection process until the bag
is empty or the overlap is zero. Table 2 shows how
Coverage further eliminates QA pairs Q5 and Q8.
The remaining four QA pairs constitute the final
blueprint b. Rather than defaulting to a random
order, we sort these based on the location of the
answer spans in s (see the final order in Table 1).
3.3 Blueprint Models
We devised three seq-to-seq models, which dif-
fer in the way the output and its blueprint are
generated.
End-to-End Model A straightforward approach
would be to take d as input and learn to first
predict blueprint b as p(b|d), and then generate
output s as p(s|b). However, this approach cru-
cially relies on the blueprint being accurate and
capturing all required information, which might
be overly optimistic, given that blueprints (for
training) are generated automatically. Moreover,
pipeline architectures are known to suffer from
error propagation, which in our case would un-
doubtedly affect generation performance, the final
stage of the pipeline.
Rather than modeling the blueprint and output
generation stages separately, we train an encoder-
decoder model to encode d and generate b; s
(i.e., the concatenation of the blueprint and output
sequence) in one go. Essentially, the decoder first
predicts blueprint b and then continues to gen-
erate output s, using both b and d. We prefix b
and s with special markers ‘‘Plan:’’ and ‘‘Sum-
mary:’’, respectively. In particular, we predict b
as a1; q1; . . . ; am; qm, namely, a (concatenated)
sequence of answer-question pairs.3 The model
is trained with the standard maximum-likelihood
objective to generate the augmented target b; s. In-
terestingly, in this end-to-end model the blueprint
functions as a macro-plan, i.e., a global sketch of
the content and organization of the output.
Multi-task Model
It is generally challenging
for encoder-decoder models to generate long out-
put sequences (Ko and Li, 2020; Tan et al., 2021).
The end-to-end model sketched above further am-
plifies this problem because it ultimately aims to
generate sequence b; s rather than just s, increas-
ing the sequence length by 220% (see Table 3).
To mitigate this problem, we propose a multi-
task model optimized to perform two separate
tasks. Let a and q denote an ordered sequence
of answers (a1, . . . , am) and corresponding ques-
tions (q1, . . . , qm), in blueprint b. The model is
trained to generate (a) the answer plan concate-
nated with output sequence a; s, and (b) the answer
plan concatenated with questions a; q. In partic-
ular, we train a single encoder-decoder model to
encode input d, while the decoder first predicts
answer plan a (as p(a|d)) and then continues to
generate output s (as p(s|a, d)) or correspond-
ing questions q (as p(q|a, d)), depending on the
task. We prefix a, q, and s with special mark-
ers ‘‘Plan:’’, ‘‘Questions:’’, and ‘‘Summary:’’,
respectively. We further prefix input d with ‘‘Gen-
erate Summary:’’ or ‘‘Generate Questions:’’ to
instruct our model to generate output s or ques-
tions q, respectively. We sample data points from
these two tasks with equal probability and train
the model with the standard maximum-likelihood
objective.
During inference, we use a two-step process to
generate output s(cid:3) and its blueprint b(cid:3) for input d.
3Predicting b as q1; a1; . . . ; qm; am is more natural, but,
it led to inferior performance. See the ablation experiments
in Section 5.3.
978
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
AQuM
WCSum
SS-FD
# queries
# examples
train
dev
test
source
# docs
# words
# sentences
# words/doc
target (original)
# words
# sentences
novel unigrams
novel bigrams
novel trigrams
novel 4-grams
target (+blueprint)
# QA-Pairs
# words
8,162
6,599
714
849
6.46
12,986.88
339.62
2,008.38
114.07
3.65
0.02
0.13
0.24
0.31
8.16
272.68
—
—
165,000
8,723
9,166
135.56
7455.75
307.80
52.02
115.61
4.74
0.13
0.54
0.78
0.86
9.56
291.28
3,673
338
337
1.00
8051.74
804.01
8051.74
126.73
5.26
0.17
0.66
0.92
0.98
28.10
597.90
Table 3: Summary statistics for the datasets used
in this work (AQuM, WCSum, and SS-FD are
shorthands for AQuaMuse, WikiCatSum, and
ScreenSumm-FD, respectively). We report on the
number of queries, size of training, development,
and test set, and average source and target length
(in terms of documents, words, sentences, and
words per document). We quantify the abstrac-
tiveness of the target by measuring the proportion
of n-grams unseen in the source. We also report
statistics on the target length augmented with the
blueprint (number of QA pairs and words in total).
j+1, qi
j+k, qi
j+1), . . . , (ai
to generating output s, we employ an incremen-
tal approach that interleaves planning with text
generation. Let output s consist of n sentences
{s1, s2, . . . , sn}; then, the corresponding blueprint
b can be represented as {b1, b2, . . . , bn}, where
j+k)} consists of k
bi : {(ai
question-answer pairs for sentence si. We train
our model to iteratively plan and generate one sen-
tence at a time, conditioning on the input and the
output sentences generated so far. In particular,
we train an encoder-decoder model where the en-
coder first encodes input d, while the decoder
takes summary {s1, . . . , si} generated so far as a
prompt and generates blueprint bi+1 for next sen-
tence si+1, followed by sentence si+1 itself.
The iterative model
is trained on quadru-
{(d, φ, b1, s1), . . . , (d, s1,i, bi+1, si+1), . . . ,
ples
(d, s1,n−1, bn, sn), (d, s, bend, send)}, where φ is
an empty context placeholder used to predict
the first blueprint b1 and corresponding first
sentence s1, (n + 1) is the blueprint
length,
and s1,i = {s1, . . . , si} are the output sentences
generated so far; bend and send are special
tokens marking the end of the output prediction.
We prefix s1,i, bi, and si with special markers
‘‘Context:’’, ‘‘Plan:’’, and ‘‘Next Sentence:’’,
respectively. We train the model with the
to
standard maximum-likelihood
predict s1,i; bi; si, however, we do not compute
the loss for predicting context s1,i
to avoid
over-optimizing for sentences that appear at the
beginning of the output.
objective
We first prefix d with ‘‘Generate Summary:’’ and
generate a(cid:3); s(cid:3), i.e., answer plan a(cid:3) followed by
output sequence s(cid:3). We then prefix d with ‘‘Gen-
erate Questions:’’, prompt our decoder with the
predicted answer plan a(cid:3) and generate correspond-
ing questions q(cid:3) for blueprint b(cid:3). The multi-task
model alleviates the length issue discussed above
by learning to generate a; s instead of b; s. How-
ever, this comes at the expense of generation
quality, since the model now conditions on the
answers only, not question-answer pairs. As such,
it can be viewed as an extension of FROST
(Narayan et al., 2021) with the plan being a se-
quence of answer spans rather than entity chains.
This model also creates a macro-plan of the out-
put, however, less detailed compared to the end-
to-end model.
Iterative Model Rather than predicting a global
plan (i.e., answer plan a or blueprint b) prior
The iterative approach does not create a global
macro plan. Rather, it learns micro content plans
and verbalizes them one-by-one, conditioning on
previously generated sentences but not on previ-
ously generated QA pairs. Athough it does not
have a global document view like the end-to-end
model, the iterative decoder cannot exceed the
output sequence length as it plans and predicts
one sentence at a time as bi; si, instead of gen-
erating b; s in one go. And unlike the multi-task
model, each sentence si is generated by condi-
tioning on the full blueprint bi (consisting of
questions and answers).
4 Experimental Setup
4.1 Datasets
We evaluated our model on benchmarks repre-
sentative of long-form question answering and
summarization. Our datasets vary in terms of the
979
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
input given to the generation model (e.g., mul-
tiple documents or one, web pages, or dialogue
transcripts), the user’s information need (e.g., an-
swering a question or aggregating information),
and summary style (e.g., genuinely abstractive
vs extractive). Common features among them are
very long inputs and multi-sentence output sum-
maries. We summarize various dataset statistics in
Table 3.
AQuaMuSe
(Kulkarni et al., 2020, 2021) is
a query-focused multi-document summarization
dataset; it was created with the intent of simulating
how a search engine might synthesize documents
of high relevance to a user query. It consists of
Google Natural Questions (Kwiatkowski et al.,
2019) paired with web documents extracted from
Common Crawl and long-form answers from
Wikipedia. We approach this task as a genera-
tive QA problem where we take the query and
associated web documents and generate a long-
form answer to the query. We work on the split
from Kulkarni et al. (2021); on average, each
instance has 6.46 web documents (2,008 tokens
per document), leading to very long input (12,987
tokens).
WikiCatSum (Perez-Beltrachini et al., 2019)
is a topic-focused multi-document summarization
dataset where the goal is to generate Wikipedia
abstracts (i.e., lead article sections) from a large
set of webpages related to an entity or a topic. It
focuses on three entities, namely, Films (59,973
instances), Companies (62,545 instances), and
Animals (60,816 instances). In experiments, we
collate the different data subsets into one, which
we refer to collectively as WikiCatSum. The in-
put webpages are truncated to the first 800 tokens.
SummScreen-FD (Chen et al., 2022) is a re-
cently released dialogue summarization dataset. It
contains transcripts of TV episodes (e.g., Game
of Thrones, CSI Las Vegas) and corresponding
(community authored) summaries. The original
dataset is divided into two complementary subsets;
we use the ForeverDreaming (FD) subset released
as part of the SCROLLS benchmark (Shaham et al.,
2022), which incorporates episodes from 88 dif-
ferent shows. SummScreen-FD is a challenging
testbed for several reasons. Plot details are of-
ten expressed indirectly in conversations between
characters and are scattered across the entire tran-
script. The summarization task is highly compres-
sive, a transcript the size of a book (on average
8,000 tokens; see Table 3) is condensed into a
few sentences, and the evaluation of such sum-
maries comes with its own challenges (e.g., it is
not realistic to expect humans to read the tran-
script to be able to assess their quality).
We further analyze the characteristics of these
datasets in Table 3. Long-form answers in AQua-
MuSe are mostly extractive with only 2%, 13%,
24%, and 31% novel unigrams, bigrams,
tri-
grams, and 4-grams, respectively. In comparison,
summaries in WikiCatSum and SummScreen-FD
are more abstractive; WikiCatSum abstracts have
13% novel unigrams, 54% bigrams, 78% trigrams,
and 86% 4-grams, whereas in SummScreen-FD
summaries 17% unigrams, 66% bigrams, 92%
trigrams, and 98% 4-grams were not
seen
in the training. Interestingly, SummScreen-FD
summaries have far more propositions than AQua-
MuSe or WikiCatSum targets, leading to a much
higher number for QA pairs in their blueprints
(28.10 vs 8.16 or 9.56). This in turn makes
the generation task for end-to-end models very
challenging. The average summary length to-
gether with the blueprint annotations (i.e., b; s) for
SummScreen-FD is almost twice the size of Wi-
kiCatSum and AQuaMuSe (597.90 vs 291.28 and
272.68). The majority of questions in AQuaMuSe
and WikiCatSum are what questions (76.0% and
74.2%, respectively), followed by who, where,
when, and how questions. For SummScreen-FD,
what and who questions are most popular (50.1%
and 42.9%, respectively).
4.2 Comparison Systems
All our experiments used LONGT5 (Guo et al.,
2021), an extension of the original T5 encoder
(Raffel et al., 2020) with global-local atten-
tion sparsity patterns to handle long inputs. We
compared a vanilla LONGT54 model (xl, 3B pa-
rameters) fine-tuned on our datasets (with a
maximum input sequence length of 4,096 tokens
and a maximum output length of 512 tokens)
against several blueprint variants. These include
an end-to-end LONGT5 model (E2E) which first
decodes blueprint b and then continues to decode
output s; a LONGT5 multitask model (MULTITASK)
which jointly learns to predict the answer plan
4We used the publicly released checkpoints
https://github.com/google-research/longt5.
from
980
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
followed by either the output s or the questions in
b; and a LONGT5 iterative model (ITERATIVE) which
plans and generates one sentence at a time.
In addition, we implemented a two-stage model
(2-STAGE), which first creates blueprint b given
input d and then generates output s given b and
d as input. Finally, we also fine-tuned T5 (xl, 3B
parameters) on our datasets with a maximum input
sequence length of 1,024 tokens and a maximum
output length of 256 tokens, as a baseline. We
present these comparisons in Table 4 together
with the performance of various state-of-the-art
systems.
We fine-tuned all our models with a leaning rate
of 0.001 and a batch size of 128, for 50K steps.
We select best checkpoints using average Rouge
performance on validation sets. During inference,
we use beam search with size 5 and alpha 0.8.
5 Automatic Evaluation
In this section we present experimental results
using automatic evaluation metrics that assess
overall summary (and blueprint) quality. More-
over, we quantify the extent to which automati-
cally generated output is grounded to the blueprint
and faithful to the input document/s.
5.1 Metrics
Summary and Blueprint Quality We evaluate
summary quality automatically using (summary-
level) Rouge F1 (Lin and Hovy, 2003). We report
only RougeLSum5 in Table 4 for the sake of
brevity. We also use RougeLSum to evaluate the
quality of the automatically generated blueprint,
i.e., the QA pairs and their order against the ref-
erence blueprint.
Informativeness and Grounding We evaluate
informativeness using QA-based metrics. Spe-
cifically, following the reading comprehension
literature (Rajpurkar et al., 2016, 2018b), we
quantify the extent to which the generated text
can answer all questions from its reference (Infor-
mativeness) and predicted blueprint (Grounding).
Following Stelmakh et al. (2022), we use a
RoBERTa model (Liu et al., 2019) fine-tuned on
SQuAD-V2 for question-answering in both cases.6
5RougeLSum is very similar to ROUGE-L; while the
latter is calculated on the summary as a whole, RougeLSum
interprets newlines as sentence boundaries.
6This is a high performing model
reaching 86.8%
exact-match accuracy and 89.8% F1 on SQuAD.
Table 4: Results on AQuaMuSe, WikiCatSum,
and SummScreen-FD test sets. Baseline and ear-
lier SOTA models are presented in the top block
and all blueprint models are shown in the bottom
block. Models marked with * generate extractive
summaries. HIBERT, TextRank, and SIBERT re-
sults on AQuaMuSe are taken from Kulkarni et al.
(2021). BART and REFLECT (extract-then-abstract)
results are taken from Song et al. (2022). Hybrid
R2T-BART (content selection + generation) results
are taken from Chen et al. (2022). Best results
for each task are boldfaced. Scores that are not
significantly different (using paired bootstrap re-
sampling; p < 0.05) from the best score in each
column are marked with a dagger (†).
Given generated text s(cid:3) and question-answer pair
(qi, ai) from the (reference or predicted) blue-
print, we apply our question-answering model to
s(cid:3) to predict answer a(cid:3)
i to question qi. We then
compute the token-level F1 score between pre-
dicted answer a(cid:3)
i and ground truth answer ai, and
report the average.
Faithfulness Hallucinations are a widely known
issue with neural abstractive summarization (Song
et al., 2018; Maynez et al., 2020; Kryscinski et al.,
981
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
2020; Gabriel et al., 2021), especially when a
sentence combines content from multiple sources
(Lebanoff et al., 2019).
Following previous work (Maynez et al., 2020;
Falke et al., 2019; Narayan et al., 2022; Honovich
et al., 2022; Duˇsek and Kasner, 2020), we quan-
tify the extent to which generated summaries are
faithful to their input using textual entailment. We
resort to textual entailment for two reasons; firstly,
it is a relatively intuitive metric, all information
in a summary should be entailed by the source
or at least not conflict with it; secondly, recent
studies (Maynez et al., 2020; Fischer et al., 2022)
have shown that it correlates with human judg-
ments of faithfulness across summarization data-
sets and tasks.
Following Honovich et al. (2022), we trained an
entailment model by fine-tuning T5-11B (Raffel
et al., 2020) on the Adversarial NLI dataset
(ANLI; Nie et al., 2020). For each sentence (hy-
pothesis) in the summary, we compute its entail-
ment probability given the input (premise) and
report the average across all sentences to obtain
an overall score (Maynez et al., 2020).
More formally, let E denote a textual entail-
ment model that predicts E(a, b), namely, that
text b is entailed by text a. The faithfulness score
F of summary s containing sentences s1, . . . , s2
with respect to input D is computed as:
F (s) =
1
n
n(cid:2)
i=1
E(D, si)
where n is the number of sentences in the sum-
mary. If the input is longer than the T5 maximum
encode length, we split it, calculate the entail-
ment probability per split, and take the maximum.
We convert probabilities to binary labels using a
threshold (1 if > 0.5, and 0, otherwise).
We further validated our ANLI entailment
scores against human judgments of faithfulness
elicited as part of SummEval (Fabbri et al., 2021),
a recently released dataset for assessing automated
summarization metrics. Our entailment predic-
tions correlate well with human ratings, achiev-
ing a Spearman’s rank correlation of ρ = 0.774.
5.2 Results
Why LONGT5 for Blueprint Models All the
tasks we are dealing with require modeling input
of highly complex nature, which is often very
results in Table 4
long (see Table 3). Our
(see Rouge/summary column) demonstrate that
T5 models always fall behind LONGT5, un-
derscoring the importance of sparse attention
mechanisms for modeling long inputs. In fact,
LONGT5 sets a new state of the art on AQua-
MuSe and SummScreen-FD. On WikiCatSum, it
is slightly worse than REFLECT (Song et al., 2022),
an extract-then-abstract model which has a ded-
icated content selection module. Similar content
selection techniques could also benefit LONGT5,
however, we leave this to future work. We hence-
forth use LONGT5 as a base model for fine-tuning
our blueprint models.
Blueprint Models and Rouge Compared to
LONGT5, blueprint variants slightly underperform
on AQuaMuse, but score better on WikiCatSum
and SummScreen-FD (see MULTITASK model). All
differences between LONGT5 and blueprint models
are statistically significant using paired bootstrap
resampling; p < 0.05). For a fair comparison, we
always use a maximum decoder length of 512
tokens. With the exception of AQuaMuse, E2E
is inferior to other blueprint models, which is not
surprising since it has to generate much longer text
(recall it predicts b; s rather than simply s). Over-
all, MULTITASK is significantly better than other
blueprint models on WikiCatSum but on par with
ITERATIVE on SummScreen-FD.
Similar patterns emerge when evaluating the
predicted blueprints against reference QA pairs,
with ITERATIVE significantly outperforming the
other two variants on SummScreen-FD. This could
be due to the fact that SummScreen-FD summaries
have far more propositions than AQuaMuSe or
WikiCatSum targets; it is better to predict them
one sentence at a time, rather than all together.
With regard to WikiCatSum, the difference be-
tween MULTITASK and ITERATIVE is not significant
(although MULTITASK has a slight numerical advan-
tage) and both systems are significantly better than
E2E. On AQuaMuSe, MULTITASK is significantly
better than E2E and ITERATIVE.
Note that all 2-STAGE models are significantly
worse in comparison to blueprint variants, when
evaluating either their blueprints or summaries
(in terms of Rouge). While our models learn
to optimize blueprints and summaries together,
2-STAGE models are faced with the harder task of
predicting the blueprint solely based on the input
(text-to-data). Since the blueprints learned by the
982
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
first stage are of poor quality, the summaries
generated in the second stage are also inferior.
Blueprint Models and Informativeness Our
blueprint annotation of reference summaries nat-
urally provides a more principled alternative to
Rouge. We can now use QA pairs in reference
blueprints to evaluate the informativeness of pre-
dicted summaries. Results follow a pattern overall
similar to Rouge, however, this approach reveals
the complexity of the different generation tasks
better than Rouge. While we were able to achieve
reasonably high Rouge across datasets, we are
far from generating informative summaries. On
SummScreen-FD, in particular, we achieve a max-
imum Rouge score of 31.88, but are able to answer
correctly only 7.59% of reference questions using
the predicted summaries.
Across datasets, LONGT5 performs on par with
MULTITASK, the difference between the two models
is not statistically significant, and the same is true
of ITERATIVE on SummScreen.
Blueprint Models and Grounding The E2E
and ITERATIVE variants are significantly better
than MULTITASK in generating texts grounded to
their predicted blueprints (see ground. column in
Table 4). This is because both models generate
text conditioned on their blueprints; E2E first pre-
dicts blueprint b and then continues to generate
output s using both b and the input, whereas ITER-
ATIVE plans and generates one sentence at a time
as bi; si. This is not the case with MULTITASK,
which generates s conditioned on answer spans
only. E2E performs slightly better than ITERA-
TIVE on AQuaMuSe and WikiCatSum (differences
are not statistically significant) but struggles on
SummScreen-FD, where summaries are longer
with more facts/propositions, requiring inference
over long-range dependencies, and common sense
reasoning. ITERATIVE seems the best option for
grounded generation without sacrificing informa-
tiveness (ITERATIVE is most informative amongst
blueprint models on SummScreen-FD, second best
on AQuaMuSe, and third best on WikiCatSum).
ITERATIVE Is Most Faithful Model As far as
faithfulness is concerned, ITERATIVE performs con-
sistently better than E2E and MULTITASK, as well
as T5 and LONGT5 models where text is gener-
Table 5: System output and reference summary for
SummScreen-FD (CSI S6.E9, ‘‘Dog Eat Dog’’).
Propositions which are not grounded to the input
. Generated questions from blueprint
are in
models are not shown due to space constraints.
ated from scratch without any planning (pairwise
differences between ITERATIVE and comparison
systems are all significant with the exception
of E2E on AQuaMuse). On SummScreen-FD,
ITERATIVE brings large gains on faithfulness with-
out sacrificing informativeness (both in terms of
Rouge and QA-F1). The ANLI score for ITERATIVE
is 20.84, whereas it is below 10 for E2E and MUL-
TITASK. E2E outperforms LONGT5 on AQuaMuSe
and WikiCatSum, but gains are smaller compared
to ITERATIVE.
We show examples of system output in Table 5,
highlighting propositions that are not grounded to
. E2E summaries are shorter, which
the input in
is somewhat expected; the model has to decode
both the plan and the summary and in cases where
the blueprint is large (e.g., in SummScreen-FD),
983
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 6: Example of plan/summary generated by
our E2E blueprint model as answer to the question
‘‘What is the difference between an Old English
Bulldog and an English Bulldog?’’ (AQuaMuse
test set); user edits to the plan and updated sum-
mary are shown in
.
there is no more room to decode the summary.
MULTITASK is more verbose, however, the plan (a
sequence of answer spans) is less detailed and
as a result the summary less accurate (Jackpot’s
pretzels is a restaurant, not a killer). ITERATIVE
contains many details in the summary, more than
the reference, which are not hallucinations. Both
R2T-BART and LONGT5 are rather loose with the
facts and generate multiple hallucinations.
Blueprint Models are Controllable Our con-
ceptualization of text plans as QA pairs brings
inherent controllability to the generation process.
By changing the blueprint, we can control content
selection (i.e., what to say) and planning (i.e.,
in what order) without retraining the model or
introducing additional control mechanisms. We
provide an example in Table 6 where the plan pre-
dicted by the E2E model has been edited to render
it more coherent and factual. As can be seen, the
model is able to change its output according to
the modified plan. Another example is shown in
Table 7, where the output is rendered shorter by
removing QA pairs from the predicted plan.
We are also able to control the faithfulness
of predicted summaries as follows. We take the
predicted plan and remove question-answer pairs
(E2E, ITERATIVE) or answer spans (MULTITASK)
that cannot be answered based on the input. We
then prompt our decoder with the modified plan
and generate a new summary (or sentence for
ITERATIVE). In Table 8, we quantitatively eval-
uate +drop variants, which are controlled for
faithfulness against vanilla blueprint models. We
observe improvements in entailment scores across
Table 7: Example of plan/summary generated by
the E2E blueprint model as answer to the question
‘‘What section of the world or country is hin-
duism usually found in? (AQuaMuse test set); the
part of the plan which is removed by the user is
; the shorter summary generated
highlighted in
from the elided plan is shown in
.
the board (see column entail. in the table), with
the ITERATIVE+drop performing best. Improve-
ments on abstractive datasets (WikiCatSum and
SummScreen-FD) are larger compared to AQua-
MuSe which is mostly extractive (see Table 3).
The minor drop in Rouge and informativeness
is somewhat expected as the models now zoom
in on information they can reliably talk about,
improving the consistency of the output.
Finally, we also experiment with creating sim-
ple summaries, by forcing the ITERATIVE model to
generate from a single question-answer pair on
each iteration (see +Q1 variant in Table 8). In the
example shown in Table 9, ITERATIVE+Q1 produces
simple summary sentences, each focusing on a sin-
gle information element. Interestingly, as far as
the ITERATIVE model is concerned, +Q1 variants
are as faithful as +drop ones even if they do
not explicitly control for faithfulness (across data-
sets the differences between the two models are
not statistically significant). This suggests that
controlling for simplicity might be sufficient to
reduce hallucinations, however, at the expense of
informativeness (Rouge scores for +Q1 variants
984
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 9: System output from ITERATIVE and IT-
ERATIVE+Q1 generating WikiCatSum abstract on
‘‘Abraham Verghese.’’
E2E
Rouge (RLSum)
summary
blueprint
both
QA Plan, Rheme, Covg, Sorted
AQ Plan, Rheme, Covg, Sorted
−Sorted, Random
−Rheme
−Coverage
−Rheme, −Coverage
48.75
50.86
50.79
47.16
47.02
18.05
39.06
39.95
36.08
40.70
41.37
42.54
44.31
45.60
43.43
44.19
44.79
40.90
Table 10: E2E model trained on AQuaMuSe with
different selection and sorting (validation set).
annotation choices. For the sake of brevity, we re-
port experiments with the E2E model trained (for
50,000 steps) on AQuaMuSe. We observe very
similar trends on the other two datasets. As can
be seen, it is empirically better to form blueprints
from answer-question pairs rather than predicting
the questions first and then their answers which
is more natural (at least to humans). We further
assessed whether sorting the QA pairs based on
how they appear in the summary matters by de-
faulting to a random ordering (see −Sorted in the
table). Removing either Rheme or Coverage has
a small negative impact on the summaries but
not their blueprints, while removing them both
is detrimental to summary quality, while the ab-
sence of Sorting mostly affects the quality of the
blueprint. It is not surprising that sorting is most
important to generating a blueprint with correctly
ordered propositions.
Table 8: Controllability results on the AQua-
MuSe, WikiCatSum and SummScreen-FD test
sets. Lighter blue color means more control. Best
results for each metric are boldfaced. Scores that
are not significantly different (using paired boot-
strap resampling; p < 0.05) from the best score
for each column are marked with a dagger (†).
tend to be significantly worse compared to +drop
counterparts).
Most of the controllability cases we illustrate
here are fully automatic and could be conceptu-
alized as system flags that users select according
to requirements (e.g., low tolerance for hallu-
cinations, shorter summaries for small screen
displays). Another potential use case would be
to generate summaries for a set of questions pro-
vided by the user. Their input might be articles
retrieved as an answer to a query, or in an educa-
tional context several chapters on a topic (e.g., cell
biology). However, we leave this to future work.
5.3 Ablation Studies
As described in Section 3.2, we construct blueprint
annotations using the Rheme- and Coverage-
based selection strategies. Table 10 presents var-
ious ablations that provide rationales for these
985
6 Human-based Evaluation
In addition to automatic evaluation, we conducted
three human-based studies assessing different di-
mensions of output quality. Wishing to avoid
well-documented issues7 with automated bots on
Amazon Mechanical Turk and crowdworkers run-
ning through HITs as quickly as possible without
paying attention to the tasks, we used a few trained
annotators. They were given task-specific instruc-
tions and went through several pilots to iron out
disagreements on edge cases.8
6.1 Summary Quality
Our first study assessed overall summary quality.
Specifically, we asked our annotators to select the
best among three system summaries taking into ac-
count how much they deviated from the reference
in terms of informativeness (are the summaries on
topic or emphasize irrelevant details?) and over-
all fluency. We adapted the definition of fluency
provided in Howcroft et al. (2020): Does the text
‘flow well’ or is it a sequence of unconnected
parts?
We conducted our annotation study on 100 in-
stances, each randomly sampled from AQuaMuse,
WikiCatSum, and SumScreen. We collected rat-
ings from three annotators (after two rounds of
pilot studies to improve agreement) for the out-
put of seven systems. Overall, we obtained 100
(instances) x 3 (datasets) x 6 (systems) x 3 (anno-
tators) = 5,400 annotations. Annotator agreement
was 97.11%. Our results are presented in Table 11.
We report on percentage of times each system was
ranked best.
In general, we observe that LONGT5 and
blueprint models based on it are perceived as
significantly better than previous state-of-the-art
models (i.e., SIBERT and R2T-BART). On AQua-
Muse, LONGT5 is rated overall best, followed by
E2E and MULTITASK (however, differences be-
tween them are not statistically significant). On
WikiCatSum, E2E is rated best bus is not signif-
icantly different compared to the other models.
On SummScreen, our ITERATIVE variant is rated
best followed by LONGT5. These results mirror the
difficulty of the task (see Table 3), the longer the
input/output, the better ITERATIVE performs.
7https://stanforddaily.com/2020/06/21/.
8We release our instructions and annotation templates
together with our data and models.
Table 11: Proportion of times each system was
ranked best for summary quality (on AQuaMuse,
WikiCatSum, and SummScreen test sets). Best
results for each task are boldfaced. Systems in
each column are marked with † when they are
not significantly different from the best system;
unmarked pairwise differences frin the best sys-
tem are significant (p < 0.01; using Friedman’s
ANOVA test (with post-hoc Wilcoxon signed-
rank test, Bonferroni corrected for multiple
comparisons).
Table 12: Blueprint quality human evaluation on
AQuaMuse, WikiCatSum, and SummScreen-FD
test sets. Mean scores for coherence (Coh; higher
is better) and proportion of QA pairs deemed re-
dundant (Red; lower is better). Best results for
each task are boldfaced. Systems in each column
are marked with † when they are not statisti-
cally significant from the best system; unmarked
pairwise differences from the best system are sig-
nificant (p < 0.01; using a Friedman’s ANOVA
test,
test with post-hoc Wilcoxon signed-rant
Bonferroni corrected for multiple comparisons).
6.2 Blueprint Quality
We further evaluated the predicted plans more
directly. Participants were shown QA blueprints
and asked to assess whether they tell a coherent
story (are they all relevant and ordered compre-
hensively?) using a 3-point scale (where 3 is best
and 1 is worst). They were also asked to evaluate
whether the plans have redundant QA pairs; a QA
pair is redundant if it does not add new infor-
mation to the plan. We collected judgments for
986
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 13: Human evaluation results for blueprint grounded generation on AQuaMuse, WikiCatSum,
and SummScreen-FD test sets. Proportion of QA pairs not mentioned in the summary (Absent; lower
is better); proportion of QA pairs with information contradictory to the summary (Contra; lower is
better), and mean scores for new information present in the summary (NewInfo; lower is better). The
best results for each task are boldfaced. Systems in each column are marked with † when they are
not statistically significant from the best system; unmarked pariwise differences from the best system
are significant (p < 0.01; using a Friedman’s ANOVA test with post-hoc Wilcoxon signed-rant test,
Bonferroni corrected for multiple comparisons).
the same instances used in our summary quality
evaluation from three annotators whose overall
agreement was 97.87% and obtained a total of
100 (instances) x 3 (datasets) x 5 (systems) x 3
(raters) = 4,500 annotations.
Table 12 shows the results of this study. We
report mean scores per dataset for all blueprint
models. As an upper bound, we further elicited
annotations for blueprints automatically created
from gold standard reference summaries (see row
Gold in the table). E2E generates the most co-
herent blueprints: Differences between E2E and
all comparison systems are statistically significant
with the exception of the gold standard. This is
not surprising, since all QA pairs in E2E are gen-
erated together, whereas in MULTITASK the spans
and their corresponding questions are generated
separately. ITERATIVE only generates QA pairs for
a sentence at a time and thus we would not expect
it to be more coherent than models which generate
a global document plan. With regard to redun-
dancy, ITERATIVE blueprints are generally most re-
dundant, which is again down to not having a
global view of previously generated QA pairs.
ITERATIVE further underscores issues with our
question generation technology which is far from
perfect, for example, several QA pairs are differ-
ent on the surface but actually semantically equiv-
alent, however, we have no means of detecting
this without robust coreference resolution.
6.3 Blueprint Grounded Generation
We next examine whether model summaries are
grounded to their blueprints. Specifically, we
asked our annotators to decide whether each QA
pair in the blueprint is mentioned in the summary,
and report the number of times it isn’t. Ideally, we
would like the summary to follow the blueprint
as closely as possible. For QA pairs mentioned in
the summary, we further asked our annotators to
highlight whether the intent of the question was
preserved or contradicted (we report the number of
contradictions). Finally, we also asked participants
to decide whether the summary has additional in-
formation which cannot be found in its blueprint,
using a 3-point scale (where 3 is for summaries
with lots of new information and 1 is for sum-
maries with no new information). We elicited
annotations for blueprint models, and, as an upper
bound, for gold summaries and blueprints extrap-
olated from them. We obtained 100 (instances)
x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500
judgments.
The results of our grounding experiments are
summarized in Table 13. Across datasets, we ob-
serve that ITERATIVE summaries are most grounded.
ITERATIVE blueprints have the least number of
questions that are absent from or contradict their
generated texts. ITERATIVE summaries also display
the least amount of new information in relation
to their blueprints. ITERATIVE+drop is slightly less
grounded compared to ITERATIVE, however, this is
not entirely surprising since we prompt the ITER-
ATIVE model with externally modified blueprints
(see ITERATIVE+drop in Table 13). Note that ITER-
ATIVE+drop summaries are deemed more faithful
than ITERATIVE summaries in automatic evaluation.
The entailment scores improve for all three data-
sets (see Table 4).
987
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
7 Conclusion
In this work we proposed a novel plan-based
approach to conditional generation. We concep-
tualized text plans as a sequence of QA pairs
operating as a proxy for what to say and in what
order. We developed Transformer-based models
that generate by conditioning on a global QA
blueprint plan (E2E, MULTITASK) or iteratively
by planning and generating one sentence at a
time (ITERATIVE). Experimental results across three
challenging datasets demonstrate that blueprint
models are inherently more informative than
vanilla sequence-to-sequence approaches without
a planning component. Among the three presented
here (E2E, MULTITASK, ITERATIVE), we find that
ITERATIVE is the best choice for grounded genera-
tion and suggests a promising direction for long-
form generation.
Blueprint models offer several advantages com-
pared to blackbox generation. Model predictions
can be examined, and errors can be traced
back to the blueprint, which in turn can reveal
whether the output is informative and faithful to
its input. The formulation of the blueprint plan
as question-answer pairs makes it intuitive and
user-friendly. We have discussed how blueprint
models might be used in a human-in-the-loop set-
ting, where users interact with and influence model
predictions directly, e.g., by editing the blueprint
length and content (as different blueprints lead
to different outputs). In the future, we would
like to use blueprints more directly to advance
methods for training language models using re-
ward learning (Sutton and Barto, 2018), e.g.,
based on whether the output answers the blueprint
questions. Rather than eliciting expensive human
feedback (Stiennon et al., 2020), blueprints could
provide a cheaper automatic alternative. Finally,
although we focused primarily on the generation
problem in this work, we believe blueprints might
also be useful as a general-purpose approach to
retrieving and organizing important content, es-
pecially when faced with many and very long
inputs.
Acknowledgments
We thank the action editor and our reviewers
for their valuable feedback. The human rating
process was managed by Muqthar Mohammad,
Kiranmai Chennuru, Ashwin Kakarla and their
team; without them this work would not have
been possible. Thanks for invaluable support from
Sheila de Guia and Suneet Dhingra.
References
Chris Alberti, Daniel Andor, Emily Pitler, Jacob
Devlin, and Michael Collins. 2019. Synthetic
QA corpora generation with roundtrip con-
sistency. In Proceedings of the 57th Annual
the Association for Computa-
Meeting of
tional Linguistics, pages 6168–6173, Florence,
Italy. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/P19-1620
Kristjan Arumae and Fei Liu. 2018. Rein-
forced extractive summarization with question-
focused rewards. In Proceedings of ACL 2018,
Student Research Workshop, pages 105–111,
Melbourne, Australia. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/P18-3015
Kristjan Arumae and Fei Liu. 2019. Guid-
ing extractive summarization with question-
answering rewards. In Proceedings of the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 2566–2577,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1264
Regina Barzilay and Mirella Lapata. 2008.
Modeling local coherence: An entity-based ap-
proach. Computational Linguistics, 34(1):1–34.
https://doi.org/10.1162/coli.2008
.34.1.1
Iz Beltagy, Matthew E. Peters, and Arman
Cohan. 2020. Longformer: The long-document
transformer. ArXiv, abs/2004.05150.
Daniela Brook Weiss, Paul Roit, Ayal Klein,
Ori Ernst, and Ido Dagan. 2021. QA-align:
Representing cross-text content overlap by
aligning question-answer propositions. In Pro-
ceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 9879–9894, Online and Punta Cana,
988
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.778
Shuyang Cao and Lu Wang. 2022. HIBRIDS:
Attention with hierarchical biases for structure-
aware long document summarization. In Pro-
ceedings of
the 60th Annual Meeting of
the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 786–807,
Dublin,
Ireland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.58
L. Carlson. 1983. Dialogue Games: An Approach
to Discourse Analysis. Riedel, Dordrecht.
https://doi.org/10.1007/978-94-015
-3963-0 9
Asli Celikyilmaz, Antoine Bosselut, Xiaodong
He, and Yejin Choi. 2018. Deep communicat-
ing agents for abstractive summarization. In
Proceedings of the 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers),
pages 1662–1675, New Orleans, Louisiana.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N18
-1150
Mingda Chen, Zewei Chu, Sam Wiseman, and
Kevin Gimpel. 2022. SummScreen: A dataset
for abstractive screenplay summarization. In
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 8602–8615,
Dublin,
Ireland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.589
Rewon Child, Scott Gray, Alec Radford, and
Ilya Sutskever. 2019. Generating long se-
quences with sparse transformers. ArXiv,
abs/1904.10509. https://doi.org/10.48550
/arXiv.1904.10509
Kordula De Kuthy, Madeeswaran Kannan,
Haemanth Santhi Ponnusamy, and Detmar
Meurers. 2020. Towards automatically generat-
ing questions under discussion to link informa
tion and discourse structure. In Proceedings of
the 28th International Conference on Computa-
tional Linguistics, pages 5786–5798, Barcelona,
Spain (Online). International Committee on
Computational Linguistics. https://doi.org
/10.18653/v1/2020.coling-main.509
Kordula De Kuthy, Nils Reiter, and Arndt Riester.
2018. QUD-based annotation of discourse
structure and information structure: Tool and
evaluation. In Proceedings of the Eleventh In-
ternational Conference on Language Resources
and Evaluation (LREC 2018), Miyazaki, Japan.
European Language Resources Association
(ELRA).
Daniel Deutsch and Dan Roth. 2021a. Question-
based salient span selection for more control-
lable text summarization. ArXiv, abs/2111
https://doi.org/10.48550
.07935.
/arXiv.2111.07935
Daniel Deutsch and Dan Roth. 2021b. Under-
standing the extent to which content quality
metrics measure the information quality of sum-
maries. In Proceedings of the 25th Conference
on Computational Natural Language Learning,
pages 300–309, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.conll-1.24
Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi,
Zhengbao Jiang, and Graham Neubig. 2021.
GSum: A general framework for guided neu-
ral abstractive summarization. In Proceedings
of the 2021 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 4830–4842, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.384
Esin Durmus, He He, and Mona Diab. 2020.
FEQA: A question answering evaluation frame-
work for faithfulness assessment in abstractive
summarization. In Proceedings of the 58th An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 5055–5070, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.454
Ondˇrej Duˇsek and Zdenˇek Kasner. 2020. Evaluat-
ing semantic accuracy of data-to-text generation
with natural language inference. In Proceed-
ings of the 13th International Conference on
Natural Language Generation, pages 131–137,
Dublin, Ireland. Association for Computational
Linguistics.
989
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth
Pasunuru, Mohit Bansal, Jacob Goldberger,
and Ido Dagan. 2022. Proposition-level clus-
tering for multi-document summarization. In
Proceedings of the 2022 Conference of the
the Associa-
North American Chapter of
tion for Computational Linguistics: Human
Language Technologies, pages 1765–1779,
Seattle, United States. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2022.naacl-main.128
Matan Eyal, Tal Baumel, and Michael Elhadad.
2019. Question answering as an automatic
evaluation metric for news article summa-
rization. In Proceedings of
the 2019 Con-
the North American Chapter of
ference of
the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 3938–3948,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1395
Alexander R. Fabbri, Wojciech Kry´sci´nski,
Bryan McCann, Caiming Xiong, Richard
Socher, and Dragomir Radev. 2021. Summ-
Eval: Re-evaluating summarization evaluation.
Transactions of the Association for Compu-
tational Linguistics, 9:391–409. https://
doi.org/10.1162/tacl a 00373
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya
Ajie Utama, Ido Dagan, and Iryna Gurevych.
2019. Ranking generated summaries by cor-
rectness: An interesting but challenging ap-
language inference. In
plication for natural
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2214–2220, Florence, Italy. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/P19-1213
Tim Fischer, Steffen Remus, and Chris Biemann.
2022. Measuring faithfulness of abstractive
summaries. In Proceedings of the 18th Con-
ference on Natural Language Processing
(KONVENS 2022), pages 63–73, Potsdam,
Germany.
Saadia Gabriel, Asli Celikyilmaz, Rahul Jha,
Yejin Choi, and Jianfeng Gao. 2021. GO
FIGURE: A meta evaluation of factuality in
summarization. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP
2021, pages 478–487, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.findings-acl.42
Sebastian Gehrmann, Yuntian Deng,
and
Alexander Rush. 2018. Bottom-up abstractive
summarization. In Proceedings of
the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 4098–4109.
Association for Computational Linguistics.
Jonathan Ginzburg. 1994. An update semantics
for dialogue. In Proceedings of the 1st Tilburg
International Workshop on Computational Se-
mantics. Tilburg, The Netherlands.
Markus Guhe. 2007. Incremental Conceptualiza-
tion for Language Production. Mahwah, NJ:
Lawrence Erlbaum Associates Publishers.
Mandy Guo, Joshua Ainslie, David C. Uthus,
Santiago Onta˜n´on, Jianmo Ni, Yun-Hsuan
Sung, and Yinfei Yang. 2021. LongT5: Effi-
cient text-to-text transformer for long sequences.
ArXiv, abs/2112.07916. https://doi.org
/10.18653/v1/2022.findings-naacl.55
Luheng He, Mike Lewis, and Luke Zettlemoyer.
2015. Question-answer driven semantic role
labeling: Using natural language to annotate
natural language. In Proceedings of the 2015
Conference on Empirical Methods in Nat-
ural Language Processing, pages 643–653,
Lisbon, Portugal. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D15-1076
Or Honovich, Roee Aharoni, Jonathan Herzig,
Hagai Taitelbaum, Doron Kukliansy, Vered
Cohen, Thomas Scialom,
Idan Szpektor,
Avinatan Hassidim, and Yossi Matias. 2022.
TRUE: Re-evaluating factual
consistency
the Second
In Proceedings of
evaluation.
DialDoc Workshop on Document-grounded
Dialogue and Conversational Question An-
swering, pages 161–175, Dublin,
Ireland.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2022
.dialdoc-1.19
Or Honovich, Leshem Choshen, Roee Aharoni,
Ella Neeman, Idan Szpektor, and Omri Abend.
2021. Q2: Evaluating factual consistency in
knowledge-grounded dialogues via question
generation and question answering. In Pro-
ceedings of the 2021 Conference on Empirical
990
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Methods in Natural Language Processing,
pages 7856–7870. https://doi.org/10
.18653/v1/2021.emnlp-main.619
David M. Howcroft, Anya Belz, Miruna-Adriana
Clinciu, Dimitra Gkatzia, Sadid A. Hasan,
Saad Mahamood, Simon Mille, Emiel van
Miltenburg, Sashank Santhanam, and Verena
Rieser. 2020. Twenty years of confusion in hu-
man evaluation: NLG needs evaluation sheets
and standardised definitions. In Proceedings
of the 13th International Conference on Nat-
ural Language Generation, pages 169–182,
Ireland. Association for Computa-
Dublin,
tional Linguistics.
Hayate Iso, Yui Uehara, Tatsuya Ishigaki,
Hiroshi Noji, Eiji Aramaki, Ichiro Kobayashi,
Yusuke Miyao, Naoaki Okazaki, and Hiroya
Takamura. 2019. Learning to select, track, and
generate for data-to-text. In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 2102–2113,
Florence,
Italy. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1202
Nikiforos Karamanis. 2004. Entity Coherence
for Descriptive Text Structuring. Ph.D. thesis,
School of Informatics, University of Edinburgh.
Rodger Kibble
and Richard Power. 2004.
in text
coherence
Optimizing referential
Linguistics,
Computational
generation.
30(4):401–416. https://doi.org/10.1162
/0891201042544893
Wei-Jen Ko, Te-yuan Chen, Yiyan Huang, Greg
Durrett, and Junyi Jessy Li. 2020. Inquisi-
tive question generation for high level
text
comprehension. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 6544–6555,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.emnlp-main.530
Wei-Jen Ko, Cutter Dalton, Mark P. Simmons,
Eliza Fisher, Greg Durrett,
Junyi
Jessy Li. 2021. Discourse comprehension:
A question answering framework to represent
sentence connections. ArXiv, abs/2111.00701.
https://doi.org/10.48550/arXiv
.2111.00701
and
Wei-Jen Ko and Junyi Jessy Li. 2020. Assessing
discourse relations in language generation from
GPT-2. In Proceedings of the 13th Interna-
tional Conference on Natural Language
Generation, pages 52–59, Dublin,
Ireland.
Association for Computational Linguistics.
Ivana Kruijff-Korbayov´a and Mark Steedman.
2003. Discourse and information structure.
Journal of logic, language and information,
12(3):249–259. https://doi.org/10.1023
/A:1024160025821
Wojciech Kryscinski, Bryan McCann, Caiming
Xiong, and Richard Socher. 2020. Eval-
uating the factual consistency of abstrac-
tive text summarization. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 9332–9346, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.750
Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh
and Dragomir
Agarwal, Caiming Xiong,
Radev. 2021. Booksum: A collection of da-
tasets for long-form narrative summarization.
ArXiv, abs/2105.08209. https://doi.org
/10.48550/arXiv.2105.08209
Sayali Kulkarni, Sheide Chammas, Wan Zhu,
Fei Sha, and Eugene Ie. 2020. Aquamuse:
Automatically generating datasets for query-
based multi-document summarization. ArXiv,
abs/2010.12694. https://doi.org/10.48550
/arXiv.2010.12694
Sayali Kulkarni, Sheide Chammas, Wan Zhu,
Fei Sha, and Eugene Ie. 2021. Comsum and
sibert: A dataset and neural model for query-
In
based multi-document
Document Analysis and Recognition – ICDAR
2021, pages 84–98, Cham. Springer Inter-
national Publishing. ArXiv, abs/2010.12694
https://doi.org/10.1007/978-3-030
-86331-9 6
summarization.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Llion Jones, Matthew Kelcey, Ming-Wei
Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
tional Linguistics, 7:452–466. https://doi
.org/10.1162/tacl_a_00276
991
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Staffan Larson. 2002. Issue-based Dialogue Man-
thesis, G¨oteborg University,
agement. Ph.D.
Sweden.
pages 151–162, Taoyuan, Taiwan. The Associa-
tion for Computational Linguistics and Chinese
Language Processing (ACLCLP).
Logan Lebanoff,
John Muchovej, Franck
Dernoncourt, Doo Soon Kim, Seokhwan Kim,
Walter Chang, and Fei Liu. 2019. Analyzing
sentence fusion in abstractive summarization.
In Proceedings of the 2nd Workshop on New
Frontiers in Summarization, pages 104–110,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-5413
Willem J. M. Levelt. 1993. Speaking: From
Intention to Articulation. The MIT Press.
https://doi.org/10.7551/mitpress
/6393.001.0001
Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo
Wang. 2018. Improving neural abstractive doc-
ument summarization with explicit informa-
tion selection modeling. In Proceedings of the
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1787–1796,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1205
Chin-Yew Lin and Eduard Hovy. 2003. Automatic
evaluation of summaries using n-gram co-
occurrence statistics. In Proceedings of the 2003
Human Language Technology Conference of
the North American Chapter of the Association
for Computational Linguistics, pages 150–157.
Yang Liu and Mirella Lapata. 2019. Hierarchical
transformers for multi-document summariza-
tion. In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguis-
tics, pages 5070–5081, Florence, Italy. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P19-1500
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar
Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and
Veselin Stoyanov. 2019. RoBERTa: A ro-
bustly optimized BERT pretraining approach.
ArXiv, abs/1907.11692. https://doi.org
/10.48550/arXiv.1907.11692
Chao-Yi Lu and Sin-En Lu. 2021. A survey of
approaches to automatic question generation:
From 2019 to early 2021. In Proceedings of the
33rd Conference on Computational Linguis-
tics and Speech Processing (ROCLING 2021),
Joshua Maynez, Shashi Narayan, Bernd Bohnet,
and Ryan McDonald. 2020. On faithfulness
and factuality in abstractive summarization.
In Proceedings of
the 58th Annual Meet-
ing of
the Association for Computational
Linguistics, pages 1906–1919, Online. As-
sociation
for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.173
Kathleen McKeown. 1985. Text Generation: Us-
ing Discourse Strategies and Focus Constraints
to generate Natural Language Text. Studies
language Processing. Cambridge
in Natural
University Press.
Chris Mellish, Alistair Knott, Jon Oberlander,
and Mick O’Donnell. 1998. Experiments using
stochastic search for text planning. In Natural
Language Generation, Niagara-on-the-Lake,
Ontario, Canada. Association for Computa-
tional Linguistics.
Adam Meyers, Ruth Reeves, Catherine Macleod,
Rachel Szekely, Veronika Zielinska, Brian
Young, and Ralph Grishman. 2004. The Nom-
Bank project: An interim report. In Proceedings
of
the Workshop Frontiers in Corpus An-
notation at HLT-NAACL 2004, pages 24–31,
Boston, Massachusetts, USA. Association for
Computational Linguistics.
Amit Moryossef, Yoav Goldberg, and Ido
Dagan. 2019a. Improving quality and efficiency
in plan-based neural data-to-text generation. In
Proceedings of
the 12th International Con-
ference on Natural Language Generation,
pages 377–382, Tokyo, Japan. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/W19-8645
Amit Moryossef, Yoav Goldberg, and Ido Dagan.
2019b. Step-by-step: Separating planning from
realization in neural data-to-text generation.
In Proceedings of
the 2019 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and
Short Papers), pages 2267–2277, Minneapo-
lis, Minnesota. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/N19-1236
992
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Shashi Narayan, Joshua Maynez, Jakub Adamek,
Daniele Pighin, Blaz Bratanic, and Ryan
McDonald. 2020. Stepwise extractive summa-
rization and planning with structured transform-
ers. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 4143–4159, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.339
Shashi Narayan, Gonc¸alo Sim˜oes, Yao Zhao,
Joshua Maynez, Dipanjan Das, Michael
Collins, and Mirella Lapata. 2022. A well-
composed text is half done! Composition sam-
pling for diverse conditional generation. In
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1319–1339,
Dublin,
Ireland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.94
Shashi Narayan, Yao Zhao, Joshua Maynez,
Gonc¸alo Sim˜oes, Vitaly Nikolaev, and Ryan
McDonald, Cambridge, MA. 2021. Planning
with learned entity prompts for abstractive sum-
marization. Transactions of
the Association
for Computational Linguistics, 9:1475–1492.
https://doi.org/10.1162/tacl a 00438
Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2020.
Adversarial NLI: A new benchmark for natu-
ral language understanding. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 4885–4901,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.acl-main.441
Martha Palmer, Daniel Gildea,
and Paul
Kingsbury. 2005. The Proposition Bank: An
annotated corpus of semantic roles. Computa-
tional Linguistics, 31(1):71–106. https://
doi.org/10.1162/0891201053630264
Laura Perez-Beltrachini, Yang Liu, and Mirella
Lapata. 2019. Generating summaries with topic
templates and structured convolutional de-
coders. In Proceedings of
the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 5107–5116, Florence, Italy.
Association for Computational Linguistics.
Ratish Puduppully, Li Dong, and Mirella Lapata.
2019a. Data-to-text generation with content se-
lection and planning. In Proceedings of the
33rd AAAI Conference on Artificial Intelli-
gence. AAAI Press. https://doi.org/10
.1609/aaai.v33i01.33016908
Ratish Puduppully, Li Dong, and Mirella Lapata.
2019b. Data-to-text generation with entity
modeling. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 2023–2035, Florence, Italy.
Association for Computational Linguistics.
Ratish Puduppully, Yao Fu, and Mirella Lapata.
2022. Data-to-text generation with varia-
tional sequential planning. Transactions of
the Association for Computational Linguistics,
10:697–715. https://doi.org/10.1162
/tacl a 00484
Ratish Puduppully and Mirella Lapata. 2021.
Data-to-text generation with macro planning.
Transactions of the Association for Computa-
tional Linguistics, 9:510–527. https://doi
.org/10.1162/tacl_a_00381
Valentina Pyatkin, Ayal Klein, Reut Tsarfaty,
and Ido Dagan. 2020. QADiscourse - discourse
relations as QA pairs: Representation, crowd-
In Proceedings of
sourcing and baselines.
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 2804–2819, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.224
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.
Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018a. Know what you don’t know: Unanswer-
able questions for squad. In Proceedings of
the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 784–789. https://doi.org
/10.18653/v1/P18-2124
Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018b. Know what you don’t know: Unan-
swerable questions for SQuAD. In Proceedings
of the 56th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2:
993
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Short Papers), pages 784–789, Melbourne,
Australia. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/P18-2124
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
and Percy Liang. 2016. SQuAD: 100,000+
questions for machine comprehension of text. In
Proceedings of the 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2383–2392, Austin, Texas. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/D16-1264
Ehud Reiter and Robert Dale. 2000. Build-
ing Natural Language Generation
Sys-
tems. Cambridge University Press, New
York, NY. https://doi.org/10.1017
/CBO9780511519857
Arndt Riester. 2019. Constructing QUD trees,
Questions in Discourse, volume 2: Pragmatics,
pages 164–193. Brill. https://doi.org
/10.1163/9789004378322_007
Craige Roberts. 2012. Information structure in
discourse: Towards an integrated formal the-
ory of pragmatics. Semantics and Pragmatics,
5(6):1–69. https://doi.org/10.3765
/sp.5.6
Tobias Rohde, Xiaoxia Wu, and Yinhan Liu.
2021. Hierarchical learning for generation with
long source sequences. ArXiv, abs/2104.07545.
https://doi.org/10.48550/arXiv
.2104.07545
Thomas Scialom, Sylvain Lamprier, Benjamin
Piwowarski, and Jacopo Staiano. 2019. An-
swers unite! Unsupervised metrics for rein-
forced summarization models. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3246–3256, Hong Kong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1320
Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat,
Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan
Xiong, Mor Geva,
Jonathan Berant, and
Omer Levy. 2022. SCROLLS: Standardized
long language sequences.
comparison over
ArXiv, abs/2201.03533. https://doi.org
/10.48550/arXiv.2201.03533
Kaiqiang Song, Lin Zhao, and Fei Liu. 2018.
Structure-infused copy mechanisms for ab-
stractive summarization. In Proceedings of the
27th International Conference on Computa-
tional Linguistics, pages 1717–1729, Santa Fe,
New Mexico, USA. Association for Computa-
tional Linguistics.
Yun-Zhu Song, Yi-Syuan Chen, and Hong-Han
Shuai. 2022. Improving multi-document sum-
marization through referenced flexible extrac-
tion with credit-awareness. In Proceedings of
the 2022 Conference of the North American
the Association for Computa-
Chapter of
tional Linguistics: Human Language Tech-
nologies, pages 1667–1681, Seattle, United
States. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2022.naacl-main.120
Gabriel
Stanovsky,
Julian Michael, Luke
Zettlemoyer, and Ido Dagan. 2018. Supervised
open information extraction. In Proceedings of
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 885–895, New
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/N18-1081
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra,
and Ming-Wei Chang. 2022. ASQA: Factoid
questions meet long-form answers. In Proceed-
ings of
the 2022 Conference on Empirical
Methods in Natural Language Processing,
pages 8273–8288, Abu Dhabi, United Arab
Emirates. Association
for Computational
Linguistics.
Nisan Stiennon, Long Ouyang, Jeffrey Wu,
Daniel Ziegler, Ryan Lowe, Chelsea Voss,
Alec Radford, Dario Amodei, and Paul F.
Christiano. 2020. Learning to summarize with
human feedback. In Advances in Neural In-
formation Processing Systems, volume 33,
pages 3008–3021. Curran Associates, Inc.
Richard Sutton and Andew Barto. 2018. Re-
inforcement Learning: An Introduction, 2nd
edition. MIT Press.
Jun Suzuki and Masaaki Nagata. 2017. Cutting-off
redundant repeating generations for neural ab-
stractive summarization. In Proceedings of the
994
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
15th Conference of the European Chapter of
the Association for Computational Linguistics:
Volume 2, Short Papers, pages 291–297,
Valencia, Spain. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/E17-2047
Bowen Tan, Zichao Yang, Maruan Al-Shedivat,
Eric Xing, and Zhiting Hu. 2021. Progressive
generation of long text with pretrained language
models. In Proceedings of the 2021 Conference
of the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, pages 4313–4324,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.naacl-main.341
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao.
2017a. Abstractive document summarization
with a graph-based attentional neural model.
In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1171–1181,
Vancouver, Canada. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P17-1108
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao.
summariza-
2017b. Abstractive document
tion with a graph-based attentional neural
model. In Proceedings of
the 55th Annual
the Association for Computa-
Meeting of
tional Linguistics (Volume 1: Long Papers),
pages 1171–1181. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P17-1108
Ran Tian, Shashi Narayan, Thibault Sellam,
and Ankur P. Parikh. 2019. Sticking to
faithful
the facts: Confident decoding for
data-to-text generation. ArXiv, abs/1910.08684.
https://doi.org/10.48550/arXiv
.1910.08684
Enric Vallduv´ı and Maria Vilkuna. 1998. On
rheme and kontrast. The Limits of Syntax,
pages 79–108. Brill.
Jan Van Kuppevelt. 1995. Discourse structure,
topicality and questioning. Journal of Lin-
guistics, 31(1):109–147. https://doi.org
/10.1017/S002222670000058X
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In I. Guyon,
Attention is all you need.
U. V. Luxburg, S. Bengio, H. Wallach, R.
Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran
Associates, Inc.
Alex Wang, Kyunghyun Cho, and Mike Lewis.
2020. Asking and answering questions to
evaluate the factual consistency of
sum-
maries. In Proceedings of
the 58th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 5008–5020, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.450
Matthijs Westera, Laia Mayol, and Hannah Rohde.
2020. TED-Q: TED talks and the questions
they evoke. In Proceedings of the 12th Lan-
guage Resources and Evaluation Conference,
pages 1118–1127, Marseille, France. European
Language Resources Association.
Sam Wiseman, Stuart Shieber, and Alexander
Rush. 2017. Challenges in data-to-document
the 2017
In Proceedings of
generation.
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2253–2263,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1239
Sam Wiseman, Stuart Shieber, and Alexander
Rush. 2018. Learning neural
templates for
text generation. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 3174–3187,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1356
Xinnuo Xu, Ondˇrej Duˇsek, Verena Rieser,
and Ioannis Konstas. 2021. AggGen: Or-
dering and aggregating while generating. In
Proceedings of
the 59th Annual Meeting
of
the Association for Computational Lin-
guistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1419–1434,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.acl-long.113
995
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi,
Mutethia Mutuma, Rahul Jha, Ahmed Hassan
Awadallah, Asli Celikyilmaz, Yang Liu,
Xipeng Qiu, and Dragomir Radev. 2021.
QMSum: A new benchmark for query-based
multi-domain meeting summarization. In Pro-
ceedings of the 2021 Conference of the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, pages 5905–5921, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2021.naacl-main.472
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
8
3
2
1
5
4
5
0
4
/
/
t
l
a
c
_
a
_
0
0
5
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
996
Download pdf