Conditional Generation with a Question-Answering Blueprint

Shashi Narayan1, Joshua Maynez1, Reinald Kim Amplayo1, Kuzman Ganchev1,
Annie Louis2, Fantine Huot1, Anders Sandholm2, Dipanjan Das1, Mirella Lapata1
1Google DeepMind, UK 2Google Research
shashinarayan@google.com, joshuahm@google.com, reinald@google.com,
kuzman@google.com, annielouis@google.com, fantinehuot@google.com,

sandholm@google.com, dipanjand@google.com, lapata@google.com

Abstrakt

The ability to convey relevant and faithful in-
formation is critical for many tasks in con-
ditional generation and yet remains elusive for
neural seq-to-seq models whose outputs often
reveal hallucinations and fail to correctly cover
important details. In this work, we advocate
planning as a useful intermediate representa-
tion for rendering conditional generation less
opaque and more grounded. We propose a new
conceptualization of text plans as a sequence
of question-answer (QA) pairs and enhance
existing datasets (z.B., for summarization) mit
a QA blueprint operating as a proxy for con-
tent selection (d.h., what to say) and plan-
ning (d.h., in what order). We obtain blueprints
automatically by exploiting state-of-the-art
question generation technology and convert
input-output pairs into input-blueprint-output
tuples. We develop Transformer-based mod-
els, each varying in how they incorporate the
blueprint in the generated output (z.B., as a
global plan or iteratively). Evaluation across
metrics and datasets demonstrates that blue-
print models are more factual than alterna-
tives which do not resort to planning and
allow tighter control of the generation output.

Einführung

Neural generation models are often prone to hal-
lucination (Song et al., 2018; Maynez et al., 2020;
Kryscinski et al., 2020; Gabriel et al., 2021),
repetition and redundancy (Li et al., 2018; Suzuki
and Nagata, 2017), and struggle to identify which
content units are salient (Tan et al., 2017A). Diese
phenomena are amplified when generating long-
form text, d.h., documents with multiple para-
graphs (Wiseman et al., 2017), when dealing with
non-linguistic data (z.B., database tables), or very
long input—which is common when summariz-

974

ing multiple documents (Liu and Lapata, 2019;
Perez-Beltrachini et al., 2019), books (Kry´sci´nski
et al., 2021), or dialogue (Chen et al., 2022; Zhong
et al., 2021). An additional challenge concerns
the blackbox nature of deep learning systems,
which hides the inherent complexity of modeling
multiple interconnected linguistic phenomena in
text generation, and makes it difficult to examine
model decisions and attribute errors to specific
components. The lack of modularity further af-
fects controllability as these systems cannot be
easily tailored to individual needs.

Attempts to remedy some of these issues fo-
cus on changing the way entities are represented
(Puduppully et al., 2019B; Iso et al., 2019), al-
lowing the decoder to skip low-confidence tokens
to enhance faithful generation (Tian et al., 2019),
modeling graph connections between document
elements to better capture salience (Tan et al.,
2017B; Liu and Lapata, 2019), encoding docu-
ments hierarchically (Celikyilmaz et al., 2018;
Liu and Lapata, 2019; Rohde et al., 2021), learn-
ing latent alignments between the input and the
target text (Xu et al., 2021), adopting sparse at-
tention mechanisms (Child et al., 2019; Beltagy
et al., 2020), and introducing content selection
(Gehrmann et al., 2018; Dou et al., 2021) Und
planning components (Puduppully et al., 2019A;
Moryossef et al., 2019B; Narayan et al., 2021;
Wiseman et al., 2018).

In this paper we also aim to render conditional
generation more modular via an intermediate,
plan-based representation. While autoregressive
models of language predict one token at a time,
there is evidence that in humans some degree
of planning occurs at a higher level than indivi-
dual words (Levelt, 1993; Guhe, 2007). A long
tradition in natural language generation views

Transactions of the Association for Computational Linguistics, Bd. 11, S. 974–996, 2023. https://doi.org/10.1162/tacl a 00583
Action Editor: Mark Johnson. Submission batch: 11/2022; Revision batch: 3/2023; Published 8/2023.
C(cid:2) 2023 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Tisch 1: Question-answering (QA) blueprint for AQuaMuSe summary. QA pairs were obtained from a
state-of-the-art question generation and answer identification system (Alberti et al., 2019).

planning as a central component to identifying
important content and structuring it appropriately
(Reiter and Dale, 2000), Jedoch, there is less
agreement on how plans should be represented.
Common examples
trees
(Mellish et al., 1998), entity transitions (Kibble
and Power, 2004; Barzilay and Lapata, 2008),
sequences of propositions (Karamanis, 2004), Und
schemas (McKeown, 1985).

include discourse

Our work proposes a new conceptualization
of text plans as a sequence of question-answer
pairs. Speziell, we draw inspiration from the
‘‘Questions under Discussion’’ (QUD) theory of
discourse structure, which posits that one way of
articulating the structure of a text is to identify the
questions and sub-questions that are raised and
answered by subsequent spans of text (Carlson,
1983; Ginzburg, 1994; Van Kuppevelt, 1995;
Larson, 2002; Roberts, 2012; Riester, 2019). Der-
oretical models of QUD assume that discourse
contains implicit questions for each of the as-
sertions made, which are thereby turned into
answers. These questions and answers can be
understood in terms of their use in moving a dis-
course forward to achieve communicative goals.
We propose to make QUDs explicit by exploiting
state-of-the-art question generation technology
(Alberti et al., 2019; Lu and Lu, 2021) and use
them as an intermediate representation layer for
conditional generation, d.h., a question-answering
(QA) blueprint operating as a proxy for both con-
tent selection (d.h., what to say) and planning (d.h.,
in what order).

Tisch 1 illustrates a plan for generating a
Wikipedia abstract from the AQuaMuSe dataset
(Kulkarni et al., 2020). We enhance existing da-
tasets (z.B., for summarization) with similar blue-
prints which we obtain automatically. We then
convert input-output pairs into input-blueprint-
output tuples and propose to learn encoder-decoder
models from these augmented annotations. Wir

develop three models that vary in how they in-
tegrate blueprints in the generation process and
their ability to handle long outputs. Aside from
generating blueprints and their corresponding text
in one go, we propose a new architecture that it-
eratively plans and generates a sentence at a time,
conditioning on the input and the output sentences
generated so far. We do not generate a global blue-
print, eher, our planning process is incremen-
tal and informed by generation, which we argue
affords greater control over the output and its
fluency. Darüber hinaus, the model is better equipped
for long-form generation, since it does not have
Zu (autoregressively) decode the blueprint and its
summary in one go, avoiding the risk of exceed-
ing the maximum decoder length.

We instantiate our models with a Transformer
(Vaswani et al., 2017) encoder-decoder architec-
ture and perform experiments on summarization
datasets representing different information seek-
ing tasks, application domains, and user require-
ments.1 In all cases, we empirically demonstrate
that blueprint models are more factual than alter-
natives which do not resort to planning; we also
observe that QA blueprints are a better represen-
tation compared to plans based on entity chains
(Narayan et al., 2021), allowing tighter control
of the output, and providing a comprehensive
explanation for model predictions (if the plan is
erroneous, then the summary will be too).

2 Related Work

Questions under Discussion The QUD-based
approach to discourse structure assumes an open-
ended inventory of possible questions and sub-
Fragen (Van Kuppevelt, 1995). Recent efforts
(De Kuthy et al., 2018; Westera et al., 2020;

1Our models, training data and predictions are available
https://github.com/google-research/google

bei
-research/tree/master/text blueprint.

975

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Riester, 2019) have nevertheless shown that it
is possible to manually annotate documents with
QUDs, d.h., to formulate a question for every as-
sertion expressed in a text. De Kuthy et al. (2020)
even go as far as to partially automate QUD an-
notation in German by automatically generating
all potentially relevant questions for a given sen-
tence. Related work (Ko et al., 2020) focuses
on the generation of inquisitive questions that
reflect general text understanding and free-form
open-ended questions (Ko et al., 2021). Our work
builds upon QUD and related discourse structure
theories, although, we do not directly implement
any of them in particular. We adopt question an-
swering as a good way of spelling out the con-
nection between the information structure of a
sentence and the discourse in which the sentence
can function.

QA Pairs as a Proxy for Annotation Labels
Question-answer pairs have been previously used
as a proxy for expressing semantic content.
QA-SRL (He et al., 2015) is a representation based
on QA pairs that has been shown to capture the
vast majority of arguments and modifiers in Prop-
Bank (Palmer et al., 2005) and NomBank (Meyers
et al., 2004). Instead of using a pre-defined role
lexicon, QA-SRL labels semantic roles with ques-
tions whose answers denote the argument bearing
the role. Follow-on work uses QA pairs to repre-
sent discourse relations (Pyatkin et al., 2020) Und
to capture overlap or redundancy at the proposi-
tional level (Brook Weiss et al., 2021). We also
employ QA pairs as an abstraction of proposi-
tional content, Jedoch, we do not target specific
relation types, or make any linguistic assumptions
about them (z.B., discourse relations vs semantic
roles).

Question-Answering in Summarization QA
pairs have been used for evaluating summaries
(Deutsch and Roth, 2021B; Eyal et al., 2019;
Durmus et al., 2020; Wang et al., 2020), spe-
cifically as a means of estimating the informa-
tion overlap between a reference summary and
a system-generated one. QA-based signals have
also been incorporated in the training of sum-
marization models, using reinforcement learning
(Arumae and Liu, 2018, 2019; Scialom et al.,
2019) or as a way of identifying salient content
in the input document (Deutsch and Roth, 2021A).
Cao and Wang (2022) introduce the task of hi-

erarchical question-summary generation, Wo
a source document is condensed into multiple
summaries, each answering a different question.
Questions are organized hierarchically into broad
questions and more specific sub-questions that
are learned from manual annotations. Our model
outputs a QA-based plan and a single summary
for a given document, although it is possible to
generate different summaries from different plans
for the same document. Our QA pairs are ob-
tained automatically and they are not stuctured.

Planning in Encoder-Decoder Models Var-
ious recent efforts have developed planning
modules in the context of data-to-text generation.
In most cases, the plans are specific to the input,
which varies from tables and records to RDF tu-
ples. Zum Beispiel, Puduppully et al. (2019A) learn
a plan corresponding to a sequence of records, Und
generate a summary conditioned on it. Narayan
et al. (2020) treat content selection as a task simi-
lar to extractive summarization; they first extract
sentence plans and then verbalize them one-by-one.
Moryossef et al. (2019A,B) propose a symbolic
planning stage followed by a neural realization
stage. Other work (Puduppully and Lapata, 2021;
Puduppully et al., 2022) advocates macro plan-
ning, where document content is organized into
a sequence of paragraph plans which are verbal-
izations of tabular input. Our work is closest to
Narayan et al. (2021), who also target summariza-
tion applications and learn an intermediate plan
to guide generation. We adopt a more elaborate
plan representation based on QA blueprints, Und
interface decoding with plan generation similarly
to Narayan et al. (2020).

3 Text Generation with Blueprints

3.1 Problem Formulation

Let d denote the input to the model which could
be a document (or multiple documents), a dia-
logue history, or even database tables. The model
will learn to generate blueprint b for output s
(z.B., a summary) and the output itself. The blue-
print b is an ordered set of question-answer pairs
{(q1, a1), (q2, a2), . . . , (qm, Bin)}. Unsurprisingly,
such blueprints are not naturally occurring in ex-
isting datasets that typically consist of (D, S) pairs.
In the following we explain how we automati-
cally augment training examples (D, S) into tuples
(D, B, S) with blueprints (Abschnitt 3.2) and then

976

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Overgenerated Question-Answer Pairs

Q1: What is a high performance variant of the Ford Mustang?
Q2: What is the high performance variant of the Ford Mustang called?
Q3: What is a high performance variant of the Ford Mustang?
Q4: What is a Shelby Mustang?
Q5: The Shelby Mustang is a high performance variant of what?
Q6: The Shelby Mustang is a high performance variant of what?
Q7: The Shelby Mustang is a high performance variant of what Ford model?
Q8: Who built the Shelby Mustang from 1965 Zu 1968?
Q9: During what years was the Shelby Mustang built by Shelby American?
Q10: In what year did Ford take over production of the Shelby Mustang?
Q11: What was the final year that Shelby American built the Mustang?
Q12: Who built the Shelby Mustang from 1969 Zu 1970?
Q13: What event in 2005 led to the revival of the Shelby Mustang?
Q14: What generation of Mustang was introduced in 2005?
Q15: What generation of Mustang was introduced in 2005?
Q16: In what year was the fifth generation of the Ford Mustang introduced?
Q17: What name was brought back for the 2005 Ford Mustang?
Q18: What was the Shelby Mustang revived as?

A1: The Shelby Mustang
A2: Shelby
A3: Shelby Mustang
A4: a high performance variant
A5: the Ford Mustang
A6: Ford Mustang
A7: Mustang
A8: Shelby American
A9: 1965 Zu 1968
A10: 1969
A11: 1970
A12: Ford
A13: the introduction
A14: the fifth generation
A15: fifth
A16: 2005
A17: the Shelby nameplate
A18: a new high-performance model

[The Shelby Mustang is a high performance variant of the Ford Mustang]P1 which [was built by Shelby American]P2 [aus 1965 Zu
1968,]P3 and [aus 1969 Zu 1970 by Ford.]P4 [Following the introduction of the fifth generation Ford Mustang in 2005,]P5 [the Shelby
nameplate was revived as a new high-performance model, this time designed and built by Ford.]P6

Tisch 2: Generation of QA pairs for summary in Figure 1 and blueprint annotation. We split the
summary into propositions P and select no more than one QA pair per proposition. RT, RH, and CO
are shorthand for Round Trip, Rheme, and Coverage. Questions that pass/fail each filter are marked
with ✓

, jeweils.

/✗

describe how we devise blueprint models based
on them (Abschnitt 3.3).

3.2 Blueprint Annotation

We first explain how question-answer pairs are
automatically (over-)generated for output s, Und
subsequently filtered to create blueprint b. Wir
illustrate the different filtering stages via the ex-
ample in Table 2.

Question-Answer Generation We
generate
Zu
QA pairs following an approach similar
Honovich et al. (2021, 2022). We convert the
SQuAD reading comprehension dataset (Rajpurkar
et al., 2018B) to a question generation dataset by
concatenating the answer and context (with sep-
arators) and fine-tuning a sequence-to-sequence
transformer model
the question.
Speziell, we fine-tune the T5-11B checkpoint
from Raffel et al. (2020); questions are decoded
with a beam size of 4. During training, Antwort
candidates are the answers provided in the
SQuAD annotation. At inference time, Antwort
candidates (d.h., base noun phrases and named
entities) are identified in the output s using
SpaCy2 and questions are generated with the
SQuAD trained system. This procedure yields

to predict

2https://spacy.io/.

a large list of QA pairs (see in Table 2 Die
questions generated for
Die
bottom), which we reduce using the filtering ex-
plained below.

the summary at

Question-Answer Blueprints
Anfänglich, we ap-
ply a Round-trip Consistency check (Alberti et al.,
2019), which discards questions if they yield an-
swers different from those used to generate them.
In Table 2, Q11 is discarded as the answer it is
paired with is wrong (1968 was the final year that
Shelby American built the Mustang, nicht 1970).
The same is the case for Q13, where the answer to
the question ought to have been the introduction
of the of the fifth generation Ford Mustang.

To decrease the number of QA pairs further,
we chunk the text (bottom block in Table 2) into
propositions—a proposition is a sub-sentential
unit which represents a single claim or fact
(Stanovsky et al., 2018; Ernst et al., 2022), Wir
use propositions instead of sentences since the
latter can be too long and contain multiple facts.
We split text into propositions based on punctu-
ation (Zeitraum, comma, and semicolon), coordina-
tion (z.B., Und, Aber), relative pronouns (z.B., Das,
WHO), and prepositions (z.B., bei, von). Following
this simple approach, the summary in Table 2 Ist
split into six propositions, shown within square
brackets. We next match each proposition to a

977

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

✓
✗
✓
✗
✓
✗
✓
✗
✓
✓
✗
✓
✗
✓
✗
✓
✓
✗
✓
✓
✓
✓
✗
✗
✓
✓
✓
✗
✓
✗
✓
✗
✓
✓
✓
✓
✗
✓
✓
✓

single QA pair heuristically, following a two-
stage approach.

We first find the question whose answer is at the
rightmost position within a proposition. If there
are multiple such questions, we select the one
with the longest answer. This first stage, welche
we call Rheme, is motivated by the theme-rheme
Struktur (Vallduv´ı and Vilkuna, 1998) of natural
language sentences: Already known information
(d.h.,
the theme) is usually placed first while
neue Informationen (d.h., the rheme) is placed later
in a sentence or phrase (Kruijff-Korbayov´a and
Steedman, 2003). Following this idea, Rheme se-
lection prioritizes new-information seeking ques-
tionen. Wie in der Tabelle zu sehen ist 2, it eliminates
several questions (z.B., Q1–Q4) as their answers
are not the right most element in the obtained
propositions. Questions Q5 and Q6 are identical,
however we retain Q5 as it yields the longest
Antwort.

The second stage, which we call Coverage,
prioritizes the selection of informative QA pairs
by selecting non-overlapping ones. Speziell,
we first convert s to a bag of tokens and select
the QA pair with the highest lexical overlap. Wir
then remove the overlapping tokens from s, Und
repeat this greedy selection process until the bag
is empty or the overlap is zero. Tisch 2 shows how
Coverage further eliminates QA pairs Q5 and Q8.
The remaining four QA pairs constitute the final
blueprint b. Rather than defaulting to a random
Befehl, we sort these based on the location of the
answer spans in s (see the final order in Table 1).

3.3 Blueprint Models

We devised three seq-to-seq models, which dif-
fer in the way the output and its blueprint are
generated.

End-to-End Model A straightforward approach
would be to take d as input and learn to first
predict blueprint b as p(B|D), and then generate
output s as p(S|B). Jedoch, this approach cru-
cially relies on the blueprint being accurate and
capturing all required information, which might
be overly optimistic, given that blueprints (für
Ausbildung) are generated automatically. Darüber hinaus,
pipeline architectures are known to suffer from
error propagation, which in our case would un-
doubtedly affect generation performance, the final
stage of the pipeline.

Rather than modeling the blueprint and output
generation stages separately, we train an encoder-
decoder model to encode d and generate b; S
(d.h., the concatenation of the blueprint and output
sequence) in one go. Im Wesentlichen, the decoder first
predicts blueprint b and then continues to gen-
erate output s, using both b and d. We prefix b
and s with special markers ‘‘Plan:’’ and ‘‘Sum-
mary:’’, jeweils. Insbesondere, we predict b
as a1; q1; . . . ; Bin; qm, nämlich, A (concatenated)
sequence of answer-question pairs.3 The model
is trained with the standard maximum-likelihood
objective to generate the augmented target b; S. In-
terestingly, in this end-to-end model the blueprint
functions as a macro-plan, d.h., a global sketch of
the content and organization of the output.

Multi-task Model
It is generally challenging
for encoder-decoder models to generate long out-
put sequences (Ko and Li, 2020; Tan et al., 2021).
The end-to-end model sketched above further am-
plifies this problem because it ultimately aims to
generate sequence b; s rather than just s, increas-
ing the sequence length by 220% (siehe Tabelle 3).

To mitigate this problem, we propose a multi-
task model optimized to perform two separate
tasks. Let a and q denote an ordered sequence
of answers (a1, . . . , Bin) and corresponding ques-
tionen (q1, . . . , qm), in blueprint b. The model is
trained to generate (A) the answer plan concate-
nated with output sequence a; S, Und (B) the answer
plan concatenated with questions a; Q. In partic-
ular, we train a single encoder-decoder model to
encode input d, while the decoder first predicts
answer plan a (as p(A|D)) and then continues to
generate output s (as p(S|A, D)) or correspond-
ing questions q (as p(Q|A, D)), depending on the
Aufgabe. We prefix a, Q, and s with special mark-
ers ‘‘Plan:’’, ‘‘Questions:’’, and ‘‘Summary:’’,
jeweils. We further prefix input d with ‘‘Gen-
erate Summary:’’ or ‘‘Generate Questions:’’ to
instruct our model to generate output s or ques-
tions q, jeweils. We sample data points from
these two tasks with equal probability and train
the model with the standard maximum-likelihood
objective.

During inference, we use a two-step process to
generate output s(cid:3) and its blueprint b(cid:3) for input d.

3Predicting b as q1; a1; . . . ; qm; am is more natural, Aber,
it led to inferior performance. See the ablation experiments
in Section 5.3.

978

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

AQuM

WCSum

SS-FD

# Abfragen
# examples
train
dev
test
source
# docs
# Wörter
# Sätze
# words/doc
target (original)
# Wörter
# Sätze
novel unigrams
novel bigrams
novel trigrams
novel 4-grams
target (+blueprint)
# QA-Pairs
# Wörter

8,162

6,599
714
849

6.46
12,986.88
339.62
2,008.38

114.07
3.65
0.02
0.13
0.24
0.31

8.16
272.68

—

165,000
8,723
9,166

135.56
7455.75
307.80
52.02

115.61
4.74
0.13
0.54
0.78
0.86

9.56
291.28

3,673
338
337

1.00
8051.74
804.01
8051.74

126.73
5.26
0.17
0.66
0.92
0.98

28.10
597.90

Tisch 3: Summary statistics for the datasets used
in this work (AQuM, WCSum, and SS-FD are
shorthands for AQuaMuse, WikiCatSum, Und
ScreenSumm-FD, jeweils). We report on the
number of queries, size of training, Entwicklung,
and test set, and average source and target length
(in terms of documents, Wörter, Sätze, Und
words per document). We quantify the abstrac-
tiveness of the target by measuring the proportion
of n-grams unseen in the source. We also report
statistics on the target length augmented with the
blueprint (number of QA pairs and words in total).

j+1, Qi

j+k, Qi

j+1), . . . , (ai

to generating output s, we employ an incremen-
tal approach that interleaves planning with text
Generation. Let output s consist of n sentences
{s1, s2, . . . , sn}; Dann, the corresponding blueprint
b can be represented as {b1, b2, . . . , bn}, Wo
j+k)} consists of k
bi : {(ai
question-answer pairs for sentence si. We train
our model to iteratively plan and generate one sen-
tence at a time, conditioning on the input and the
output sentences generated so far. Insbesondere,
we train an encoder-decoder model where the en-
coder first encodes input d, while the decoder
takes summary {s1, . . . , si} generated so far as a
prompt and generates blueprint bi+1 for next sen-
tence si+1, followed by sentence si+1 itself.

The iterative model

is trained on quadru-
{(D, Phi, b1, s1), . . . , (D, s1,i, bi+1, si+1), . . . ,
ples
(D, s1,n−1, bn, sn), (D, S, bend, send)}, where φ is
an empty context placeholder used to predict
the first blueprint b1 and corresponding first
sentence s1, (N + 1) is the blueprint
Länge,
and s1,i = {s1, . . . , si} are the output sentences
generated so far; bend and send are special
tokens marking the end of the output prediction.
We prefix s1,i, bi, and si with special markers
‘‘Context:’’, ‘‘Plan:’’, and ‘‘Next Sentence:’’,
jeweils. We train the model with the
Zu
standard maximum-likelihood
predict s1,i; bi; si, Jedoch, we do not compute
the loss for predicting context s1,i
to avoid
over-optimizing for sentences that appear at the
beginning of the output.

objective

We first prefix d with ‘‘Generate Summary:’’ and
generate a(cid:3); S(cid:3), d.h., answer plan a(cid:3) followed by
output sequence s(cid:3). We then prefix d with ‘‘Gen-
erate Questions:’’, prompt our decoder with the
predicted answer plan a(cid:3) and generate correspond-
ing questions q(cid:3) for blueprint b(cid:3). The multi-task
model alleviates the length issue discussed above
by learning to generate a; s instead of b; S. Wie-
immer, this comes at the expense of generation
Qualität, since the model now conditions on the
answers only, not question-answer pairs. Als solche,
it can be viewed as an extension of FROST
(Narayan et al., 2021) with the plan being a se-
quence of answer spans rather than entity chains.
This model also creates a macro-plan of the out-
put, Jedoch, less detailed compared to the end-
to-end model.

Iterative Model Rather than predicting a global
plan (d.h., answer plan a or blueprint b) prior

The iterative approach does not create a global
macro plan. Eher, it learns micro content plans
and verbalizes them one-by-one, conditioning on
previously generated sentences but not on previ-
ously generated QA pairs. Athough it does not
have a global document view like the end-to-end
Modell, the iterative decoder cannot exceed the
output sequence length as it plans and predicts
one sentence at a time as bi; si, instead of gen-
erating b; s in one go. And unlike the multi-task
Modell, each sentence si is generated by condi-
tioning on the full blueprint bi (consisting of
questions and answers).

4 Experimental Setup

4.1 Datasets

We evaluated our model on benchmarks repre-
sentative of long-form question answering and
summarization. Our datasets vary in terms of the

979

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

input given to the generation model (z.B., mul-
tiple documents or one, web pages, or dialogue
transcripts), the user’s information need (z.B., ein-
swering a question or aggregating information),
and summary style (z.B., genuinely abstractive
vs extractive). Common features among them are
very long inputs and multi-sentence output sum-
maries. We summarize various dataset statistics in
Tisch 3.

AQuaMuSe
(Kulkarni et al., 2020, 2021) Ist
a query-focused multi-document summarization
dataset; it was created with the intent of simulating
how a search engine might synthesize documents
of high relevance to a user query. It consists of
Google Natural Questions (Kwiatkowski et al.,
2019) paired with web documents extracted from
Common Crawl and long-form answers from
Wikipedia. We approach this task as a genera-
tive QA problem where we take the query and
associated web documents and generate a long-
form answer to the query. We work on the split
from Kulkarni et al. (2021); on average, jede
instance has 6.46 web documents (2,008 tokens
per document), leading to very long input (12,987
tokens).

WikiCatSum (Perez-Beltrachini et al., 2019)
is a topic-focused multi-document summarization
dataset where the goal is to generate Wikipedia
abstracts (d.h., lead article sections) from a large
set of webpages related to an entity or a topic. Es
focuses on three entities, nämlich, Films (59,973
instances), Companies (62,545 instances), Und
Animals (60,816 instances). In experiments, Wir
collate the different data subsets into one, welche
we refer to collectively as WikiCatSum. The in-
put webpages are truncated to the first 800 tokens.

SummScreen-FD (Chen et al., 2022) is a re-
cently released dialogue summarization dataset. Es
contains transcripts of TV episodes (z.B., Game
of Thrones, CSI Las Vegas) and corresponding
(community authored) summaries. The original
dataset is divided into two complementary subsets;
we use the ForeverDreaming (FD) subset released
as part of the SCROLLS benchmark (Shaham et al.,
2022), which incorporates episodes from 88 dif-
ferent shows. SummScreen-FD is a challenging
testbed for several reasons. Plot details are of-
ten expressed indirectly in conversations between
characters and are scattered across the entire tran-

Skript. The summarization task is highly compres-
sive, a transcript the size of a book (on average
8,000 tokens; siehe Tabelle 3) is condensed into a
few sentences, and the evaluation of such sum-
maries comes with its own challenges (z.B., es ist
not realistic to expect humans to read the tran-
script to be able to assess their quality).

We further analyze the characteristics of these
datasets in Table 3. Long-form answers in AQua-
MuSe are mostly extractive with only 2%, 13%,
24%, Und 31% novel unigrams, bigrams,
tri-
Gramm, and 4-grams, jeweils. In comparison,
summaries in WikiCatSum and SummScreen-FD
are more abstractive; WikiCatSum abstracts have
13% novel unigrams, 54% bigrams, 78% trigrams,
Und 86% 4-Gramm, whereas in SummScreen-FD
summaries 17% unigrams, 66% bigrams, 92%
trigrams, Und 98% 4-grams were not
seen
in the training. Interessant, SummScreen-FD
summaries have far more propositions than AQua-
MuSe or WikiCatSum targets, leading to a much
higher number for QA pairs in their blueprints
(28.10 vs 8.16 oder 9.56). This in turn makes
the generation task for end-to-end models very
challenging. The average summary length to-
gether with the blueprint annotations (d.h., B; S) für
SummScreen-FD is almost twice the size of Wi-
kiCatSum and AQuaMuSe (597.90 vs 291.28 Und
272.68). The majority of questions in AQuaMuSe
and WikiCatSum are what questions (76.0% Und
74.2%, jeweils), followed by who, Wo,
Wann, and how questions. For SummScreen-FD,
what and who questions are most popular (50.1%
Und 42.9%, jeweils).

4.2 Comparison Systems

All our experiments used LONGT5 (Guo et al.,
2021), an extension of the original T5 encoder
(Raffel et al., 2020) with global-local atten-
tion sparsity patterns to handle long inputs. Wir
compared a vanilla LONGT54 model (xl, 3B pa-
rameters) fine-tuned on our datasets (with a
maximum input sequence length of 4,096 tokens
and a maximum output length of 512 tokens)
against several blueprint variants. These include
an end-to-end LONGT5 model (E2E) which first
decodes blueprint b and then continues to decode
output s; a LONGT5 multitask model (MULTITASK)
which jointly learns to predict the answer plan

4We used the publicly released checkpoints
https://github.com/google-research/longt5.

aus

980

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
8
3
2
1
5
4
5
0
4

/
T

A
C
_
A
_
0
0
5
8
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

followed by either the output s or the questions in
B; and a LONGT5 iterative model (ITERATIVE) welche
plans and generates one sentence at a time.

Zusätzlich, we implemented a two-stage model
(2-STAGE), which first creates blueprint b given
input d and then generates output s given b and
d as input. Endlich, we also fine-tuned T5 (xl, 3B
Parameter) on our datasets with a maximum input
sequence length of 1,024 tokens and a maximum
output length of 256 tokens, as a baseline. Wir
present these comparisons in Table 4 together
with the performance of various state-of-the-art
Systeme.

We fine-tuned all our models with a leaning rate
von 0.001 and a batch size of 128, for 50K steps.
We select best checkpoints using average Rouge
performance on validation sets. During inference,
we use beam search with size 5 and alpha 0.8.

5 Automatic Evaluation

In this section we present experimental results
using automatic evaluation metrics that assess
overall summary (and blueprint) Qualität. More-
über, we quantify the extent to which automati-
cally generated output is grounded to the blueprint
and faithful to the input document/s.

5.1 Metrics

Summary and Blueprint Quality We evaluate
summary quality automatically using (summary-
Ebene) Rouge F1 (Lin and Hovy, 2003). We report
only RougeLSum5 in Table 4 for the sake of
brevity. We also use RougeLSum to evaluate the
quality of the automatically generated blueprint,
d.h., the QA pairs and their order against the ref-
erence blueprint.

Informativeness and Grounding We evaluate
informativeness using QA-based metrics. Spe-
cifically, following the reading comprehension
Literatur (Rajpurkar et al., 2016, 2018B), Wir
quantify the extent to which the generated text
can answer all questions from its reference (Infor-
mativeness) and predicted blueprint (Grounding).
Following Stelmakh et al. (2022), we use a
RoBERTa model (Liu et al., 2019) fine-tuned on
SQuAD-V2 for question-answering in both cases.6

5RougeLSum is very similar to ROUGE-L; während die
latter is calculated on the summary as a whole, RougeLSum
interprets newlines as sentence boundaries.
6This is a high performing model

reaching 86.8%

exact-match accuracy and 89.8% F1 on SQuAD.

Tisch 4: Results on AQuaMuSe, WikiCatSum,
and SummScreen-FD test sets. Baseline and ear-
lier SOTA models are presented in the top block
and all blueprint models are shown in the bottom
block. Models marked with * generate extractive
summaries. HIBERT, TextRank, and SIBERT re-
sults on AQuaMuSe are taken from Kulkarni et al.
(2021). BART and REFLECT (extract-then-abstract)
results are taken from Song et al. (2022). Hybrid
R2T-BART (content selection + Generation) results
are taken from Chen et al. (2022). Best results
for each task are boldfaced. Scores that are not
significantly different (using paired bootstrap re-
sampling; P < 0.05) from the best score in each column are marked with a dagger (†). Given generated text s(cid:3) and question-answer pair (qi, ai) from the (reference or predicted) blue- print, we apply our question-answering model to s(cid:3) to predict answer a(cid:3) i to question qi. We then compute the token-level F1 score between pre- dicted answer a(cid:3) i and ground truth answer ai, and report the average. Faithfulness Hallucinations are a widely known issue with neural abstractive summarization (Song et al., 2018; Maynez et al., 2020; Kryscinski et al., 981 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 2020; Gabriel et al., 2021), especially when a sentence combines content from multiple sources (Lebanoff et al., 2019). Following previous work (Maynez et al., 2020; Falke et al., 2019; Narayan et al., 2022; Honovich et al., 2022; Duˇsek and Kasner, 2020), we quan- tify the extent to which generated summaries are faithful to their input using textual entailment. We resort to textual entailment for two reasons; firstly, it is a relatively intuitive metric, all information in a summary should be entailed by the source or at least not conflict with it; secondly, recent studies (Maynez et al., 2020; Fischer et al., 2022) have shown that it correlates with human judg- ments of faithfulness across summarization data- sets and tasks. Following Honovich et al. (2022), we trained an entailment model by fine-tuning T5-11B (Raffel et al., 2020) on the Adversarial NLI dataset (ANLI; Nie et al., 2020). For each sentence (hy- pothesis) in the summary, we compute its entail- ment probability given the input (premise) and report the average across all sentences to obtain an overall score (Maynez et al., 2020). More formally, let E denote a textual entail- ment model that predicts E(a, b), namely, that text b is entailed by text a. The faithfulness score F of summary s containing sentences s1, . . . , s2 with respect to input D is computed as: F (s) = 1 n n(cid:2) i=1 E(D, si) where n is the number of sentences in the sum- mary. If the input is longer than the T5 maximum encode length, we split it, calculate the entail- ment probability per split, and take the maximum. We convert probabilities to binary labels using a threshold (1 if > 0.5, Und 0, ansonsten).

We further validated our ANLI entailment
scores against human judgments of faithfulness
elicited as part of SummEval (Fabbri et al., 2021),
a recently released dataset for assessing automated
summarization metrics. Our entailment predic-
tions correlate well with human ratings, achiev-
ing a Spearman’s rank correlation of ρ = 0.774.

5.2 Ergebnisse

Why LONGT5 for Blueprint Models All the
tasks we are dealing with require modeling input
of highly complex nature, which is often very
results in Table 4
long (siehe Tabelle 3). Unser

(see Rouge/summary column) demonstrate that
T5 models always fall behind LONGT5, un-
derscoring the importance of sparse attention
mechanisms for modeling long inputs. Tatsächlich,
LONGT5 sets a new state of the art on AQua-
MuSe and SummScreen-FD. On WikiCatSum, Es
is slightly worse than REFLECT (Song et al., 2022),
an extract-then-abstract model which has a ded-
icated content selection module. Similar content
selection techniques could also benefit LONGT5,
Jedoch, we leave this to future work. We hence-
forth use LONGT5 as a base model for fine-tuning
our blueprint models.

Blueprint Models and Rouge Compared to
LONGT5, blueprint variants slightly underperform
on AQuaMuse, but score better on WikiCatSum
and SummScreen-FD (see MULTITASK model). Alle
differences between LONGT5 and blueprint models
are statistically significant using paired bootstrap
resampling; P < 0.05). For a fair comparison, we always use a maximum decoder length of 512 tokens. With the exception of AQuaMuse, E2E is inferior to other blueprint models, which is not surprising since it has to generate much longer text (recall it predicts b; s rather than simply s). Over- all, MULTITASK is significantly better than other blueprint models on WikiCatSum but on par with ITERATIVE on SummScreen-FD. Similar patterns emerge when evaluating the predicted blueprints against reference QA pairs, with ITERATIVE significantly outperforming the other two variants on SummScreen-FD. This could be due to the fact that SummScreen-FD summaries have far more propositions than AQuaMuSe or WikiCatSum targets; it is better to predict them one sentence at a time, rather than all together. With regard to WikiCatSum, the difference be- tween MULTITASK and ITERATIVE is not significant (although MULTITASK has a slight numerical advan- tage) and both systems are significantly better than E2E. On AQuaMuSe, MULTITASK is significantly better than E2E and ITERATIVE. Note that all 2-STAGE models are significantly worse in comparison to blueprint variants, when evaluating either their blueprints or summaries (in terms of Rouge). While our models learn to optimize blueprints and summaries together, 2-STAGE models are faced with the harder task of predicting the blueprint solely based on the input (text-to-data). Since the blueprints learned by the 982 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 first stage are of poor quality, the summaries generated in the second stage are also inferior. Blueprint Models and Informativeness Our blueprint annotation of reference summaries nat- urally provides a more principled alternative to Rouge. We can now use QA pairs in reference blueprints to evaluate the informativeness of pre- dicted summaries. Results follow a pattern overall similar to Rouge, however, this approach reveals the complexity of the different generation tasks better than Rouge. While we were able to achieve reasonably high Rouge across datasets, we are far from generating informative summaries. On SummScreen-FD, in particular, we achieve a max- imum Rouge score of 31.88, but are able to answer correctly only 7.59% of reference questions using the predicted summaries. Across datasets, LONGT5 performs on par with MULTITASK, the difference between the two models is not statistically significant, and the same is true of ITERATIVE on SummScreen. Blueprint Models and Grounding The E2E and ITERATIVE variants are significantly better than MULTITASK in generating texts grounded to their predicted blueprints (see ground. column in Table 4). This is because both models generate text conditioned on their blueprints; E2E first pre- dicts blueprint b and then continues to generate output s using both b and the input, whereas ITER- ATIVE plans and generates one sentence at a time as bi; si. This is not the case with MULTITASK, which generates s conditioned on answer spans only. E2E performs slightly better than ITERA- TIVE on AQuaMuSe and WikiCatSum (differences are not statistically significant) but struggles on SummScreen-FD, where summaries are longer with more facts/propositions, requiring inference over long-range dependencies, and common sense reasoning. ITERATIVE seems the best option for grounded generation without sacrificing informa- tiveness (ITERATIVE is most informative amongst blueprint models on SummScreen-FD, second best on AQuaMuSe, and third best on WikiCatSum). ITERATIVE Is Most Faithful Model As far as faithfulness is concerned, ITERATIVE performs con- sistently better than E2E and MULTITASK, as well as T5 and LONGT5 models where text is gener- Table 5: System output and reference summary for SummScreen-FD (CSI S6.E9, ‘‘Dog Eat Dog’’). Propositions which are not grounded to the input . Generated questions from blueprint are in models are not shown due to space constraints. ated from scratch without any planning (pairwise differences between ITERATIVE and comparison systems are all significant with the exception of E2E on AQuaMuse). On SummScreen-FD, ITERATIVE brings large gains on faithfulness with- out sacrificing informativeness (both in terms of Rouge and QA-F1). The ANLI score for ITERATIVE is 20.84, whereas it is below 10 for E2E and MUL- TITASK. E2E outperforms LONGT5 on AQuaMuSe and WikiCatSum, but gains are smaller compared to ITERATIVE. We show examples of system output in Table 5, highlighting propositions that are not grounded to . E2E summaries are shorter, which the input in is somewhat expected; the model has to decode both the plan and the summary and in cases where the blueprint is large (e.g., in SummScreen-FD), 983 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Table 6: Example of plan/summary generated by our E2E blueprint model as answer to the question ‘‘What is the difference between an Old English Bulldog and an English Bulldog?’’ (AQuaMuse test set); user edits to the plan and updated sum- mary are shown in . there is no more room to decode the summary. MULTITASK is more verbose, however, the plan (a sequence of answer spans) is less detailed and as a result the summary less accurate (Jackpot’s pretzels is a restaurant, not a killer). ITERATIVE contains many details in the summary, more than the reference, which are not hallucinations. Both R2T-BART and LONGT5 are rather loose with the facts and generate multiple hallucinations. Blueprint Models are Controllable Our con- ceptualization of text plans as QA pairs brings inherent controllability to the generation process. By changing the blueprint, we can control content selection (i.e., what to say) and planning (i.e., in what order) without retraining the model or introducing additional control mechanisms. We provide an example in Table 6 where the plan pre- dicted by the E2E model has been edited to render it more coherent and factual. As can be seen, the model is able to change its output according to the modified plan. Another example is shown in Table 7, where the output is rendered shorter by removing QA pairs from the predicted plan. We are also able to control the faithfulness of predicted summaries as follows. We take the predicted plan and remove question-answer pairs (E2E, ITERATIVE) or answer spans (MULTITASK) that cannot be answered based on the input. We then prompt our decoder with the modified plan and generate a new summary (or sentence for ITERATIVE). In Table 8, we quantitatively eval- uate +drop variants, which are controlled for faithfulness against vanilla blueprint models. We observe improvements in entailment scores across Table 7: Example of plan/summary generated by the E2E blueprint model as answer to the question ‘‘What section of the world or country is hin- duism usually found in? (AQuaMuse test set); the part of the plan which is removed by the user is ; the shorter summary generated highlighted in from the elided plan is shown in . the board (see column entail. in the table), with the ITERATIVE+drop performing best. Improve- ments on abstractive datasets (WikiCatSum and SummScreen-FD) are larger compared to AQua- MuSe which is mostly extractive (see Table 3). The minor drop in Rouge and informativeness is somewhat expected as the models now zoom in on information they can reliably talk about, improving the consistency of the output. Finally, we also experiment with creating sim- ple summaries, by forcing the ITERATIVE model to generate from a single question-answer pair on each iteration (see +Q1 variant in Table 8). In the example shown in Table 9, ITERATIVE+Q1 produces simple summary sentences, each focusing on a sin- gle information element. Interestingly, as far as the ITERATIVE model is concerned, +Q1 variants are as faithful as +drop ones even if they do not explicitly control for faithfulness (across data- sets the differences between the two models are not statistically significant). This suggests that controlling for simplicity might be sufficient to reduce hallucinations, however, at the expense of informativeness (Rouge scores for +Q1 variants 984 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Table 9: System output from ITERATIVE and IT- ERATIVE+Q1 generating WikiCatSum abstract on ‘‘Abraham Verghese.’’ E2E Rouge (RLSum) summary blueprint both QA Plan, Rheme, Covg, Sorted AQ Plan, Rheme, Covg, Sorted −Sorted, Random −Rheme −Coverage −Rheme, −Coverage 48.75 50.86 50.79 47.16 47.02 18.05 39.06 39.95 36.08 40.70 41.37 42.54 44.31 45.60 43.43 44.19 44.79 40.90 Table 10: E2E model trained on AQuaMuSe with different selection and sorting (validation set). annotation choices. For the sake of brevity, we re- port experiments with the E2E model trained (for 50,000 steps) on AQuaMuSe. We observe very similar trends on the other two datasets. As can be seen, it is empirically better to form blueprints from answer-question pairs rather than predicting the questions first and then their answers which is more natural (at least to humans). We further assessed whether sorting the QA pairs based on how they appear in the summary matters by de- faulting to a random ordering (see −Sorted in the table). Removing either Rheme or Coverage has a small negative impact on the summaries but not their blueprints, while removing them both is detrimental to summary quality, while the ab- sence of Sorting mostly affects the quality of the blueprint. It is not surprising that sorting is most important to generating a blueprint with correctly ordered propositions. Table 8: Controllability results on the AQua- MuSe, WikiCatSum and SummScreen-FD test sets. Lighter blue color means more control. Best results for each metric are boldfaced. Scores that are not significantly different (using paired boot- strap resampling; p < 0.05) from the best score for each column are marked with a dagger (†). tend to be significantly worse compared to +drop counterparts). Most of the controllability cases we illustrate here are fully automatic and could be conceptu- alized as system flags that users select according to requirements (e.g., low tolerance for hallu- cinations, shorter summaries for small screen displays). Another potential use case would be to generate summaries for a set of questions pro- vided by the user. Their input might be articles retrieved as an answer to a query, or in an educa- tional context several chapters on a topic (e.g., cell biology). However, we leave this to future work. 5.3 Ablation Studies As described in Section 3.2, we construct blueprint annotations using the Rheme- and Coverage- based selection strategies. Table 10 presents var- ious ablations that provide rationales for these 985 6 Human-based Evaluation In addition to automatic evaluation, we conducted three human-based studies assessing different di- mensions of output quality. Wishing to avoid well-documented issues7 with automated bots on Amazon Mechanical Turk and crowdworkers run- ning through HITs as quickly as possible without paying attention to the tasks, we used a few trained annotators. They were given task-specific instruc- tions and went through several pilots to iron out disagreements on edge cases.8 6.1 Summary Quality Our first study assessed overall summary quality. Specifically, we asked our annotators to select the best among three system summaries taking into ac- count how much they deviated from the reference in terms of informativeness (are the summaries on topic or emphasize irrelevant details?) and over- all fluency. We adapted the definition of fluency provided in Howcroft et al. (2020): Does the text ‘flow well’ or is it a sequence of unconnected parts? We conducted our annotation study on 100 in- stances, each randomly sampled from AQuaMuse, WikiCatSum, and SumScreen. We collected rat- ings from three annotators (after two rounds of pilot studies to improve agreement) for the out- put of seven systems. Overall, we obtained 100 (instances) x 3 (datasets) x 6 (systems) x 3 (anno- tators) = 5,400 annotations. Annotator agreement was 97.11%. Our results are presented in Table 11. We report on percentage of times each system was ranked best. In general, we observe that LONGT5 and blueprint models based on it are perceived as significantly better than previous state-of-the-art models (i.e., SIBERT and R2T-BART). On AQua- Muse, LONGT5 is rated overall best, followed by E2E and MULTITASK (however, differences be- tween them are not statistically significant). On WikiCatSum, E2E is rated best bus is not signif- icantly different compared to the other models. On SummScreen, our ITERATIVE variant is rated best followed by LONGT5. These results mirror the difficulty of the task (see Table 3), the longer the input/output, the better ITERATIVE performs. 7https://stanforddaily.com/2020/06/21/. 8We release our instructions and annotation templates together with our data and models. Table 11: Proportion of times each system was ranked best for summary quality (on AQuaMuse, WikiCatSum, and SummScreen test sets). Best results for each task are boldfaced. Systems in each column are marked with † when they are not significantly different from the best system; unmarked pairwise differences frin the best sys- tem are significant (p < 0.01; using Friedman’s ANOVA test (with post-hoc Wilcoxon signed- rank test, Bonferroni corrected for multiple comparisons). Table 12: Blueprint quality human evaluation on AQuaMuse, WikiCatSum, and SummScreen-FD test sets. Mean scores for coherence (Coh; higher is better) and proportion of QA pairs deemed re- dundant (Red; lower is better). Best results for each task are boldfaced. Systems in each column are marked with † when they are not statisti- cally significant from the best system; unmarked pairwise differences from the best system are sig- nificant (p < 0.01; using a Friedman’s ANOVA test, test with post-hoc Wilcoxon signed-rant Bonferroni corrected for multiple comparisons). 6.2 Blueprint Quality We further evaluated the predicted plans more directly. Participants were shown QA blueprints and asked to assess whether they tell a coherent story (are they all relevant and ordered compre- hensively?) using a 3-point scale (where 3 is best and 1 is worst). They were also asked to evaluate whether the plans have redundant QA pairs; a QA pair is redundant if it does not add new infor- mation to the plan. We collected judgments for 986 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Table 13: Human evaluation results for blueprint grounded generation on AQuaMuse, WikiCatSum, and SummScreen-FD test sets. Proportion of QA pairs not mentioned in the summary (Absent; lower is better); proportion of QA pairs with information contradictory to the summary (Contra; lower is better), and mean scores for new information present in the summary (NewInfo; lower is better). The best results for each task are boldfaced. Systems in each column are marked with † when they are not statistically significant from the best system; unmarked pariwise differences from the best system are significant (p < 0.01; using a Friedman’s ANOVA test with post-hoc Wilcoxon signed-rant test, Bonferroni corrected for multiple comparisons). the same instances used in our summary quality evaluation from three annotators whose overall agreement was 97.87% and obtained a total of 100 (instances) x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500 annotations. Table 12 shows the results of this study. We report mean scores per dataset for all blueprint models. As an upper bound, we further elicited annotations for blueprints automatically created from gold standard reference summaries (see row Gold in the table). E2E generates the most co- herent blueprints: Differences between E2E and all comparison systems are statistically significant with the exception of the gold standard. This is not surprising, since all QA pairs in E2E are gen- erated together, whereas in MULTITASK the spans and their corresponding questions are generated separately. ITERATIVE only generates QA pairs for a sentence at a time and thus we would not expect it to be more coherent than models which generate a global document plan. With regard to redun- dancy, ITERATIVE blueprints are generally most re- dundant, which is again down to not having a global view of previously generated QA pairs. ITERATIVE further underscores issues with our question generation technology which is far from perfect, for example, several QA pairs are differ- ent on the surface but actually semantically equiv- alent, however, we have no means of detecting this without robust coreference resolution. 6.3 Blueprint Grounded Generation We next examine whether model summaries are grounded to their blueprints. Specifically, we asked our annotators to decide whether each QA pair in the blueprint is mentioned in the summary, and report the number of times it isn’t. Ideally, we would like the summary to follow the blueprint as closely as possible. For QA pairs mentioned in the summary, we further asked our annotators to highlight whether the intent of the question was preserved or contradicted (we report the number of contradictions). Finally, we also asked participants to decide whether the summary has additional in- formation which cannot be found in its blueprint, using a 3-point scale (where 3 is for summaries with lots of new information and 1 is for sum- maries with no new information). We elicited annotations for blueprint models, and, as an upper bound, for gold summaries and blueprints extrap- olated from them. We obtained 100 (instances) x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500 judgments. The results of our grounding experiments are summarized in Table 13. Across datasets, we ob- serve that ITERATIVE summaries are most grounded. ITERATIVE blueprints have the least number of questions that are absent from or contradict their generated texts. ITERATIVE summaries also display the least amount of new information in relation to their blueprints. ITERATIVE+drop is slightly less grounded compared to ITERATIVE, however, this is not entirely surprising since we prompt the ITER- ATIVE model with externally modified blueprints (see ITERATIVE+drop in Table 13). Note that ITER- ATIVE+drop summaries are deemed more faithful than ITERATIVE summaries in automatic evaluation. The entailment scores improve for all three data- sets (see Table 4). 987 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 7 Conclusion In this work we proposed a novel plan-based approach to conditional generation. We concep- tualized text plans as a sequence of QA pairs operating as a proxy for what to say and in what order. We developed Transformer-based models that generate by conditioning on a global QA blueprint plan (E2E, MULTITASK) or iteratively by planning and generating one sentence at a time (ITERATIVE). Experimental results across three challenging datasets demonstrate that blueprint models are inherently more informative than vanilla sequence-to-sequence approaches without a planning component. Among the three presented here (E2E, MULTITASK, ITERATIVE), we find that ITERATIVE is the best choice for grounded genera- tion and suggests a promising direction for long- form generation. Blueprint models offer several advantages com- pared to blackbox generation. Model predictions can be examined, and errors can be traced back to the blueprint, which in turn can reveal whether the output is informative and faithful to its input. The formulation of the blueprint plan as question-answer pairs makes it intuitive and user-friendly. We have discussed how blueprint models might be used in a human-in-the-loop set- ting, where users interact with and influence model predictions directly, e.g., by editing the blueprint length and content (as different blueprints lead to different outputs). In the future, we would like to use blueprints more directly to advance methods for training language models using re- ward learning (Sutton and Barto, 2018), e.g., based on whether the output answers the blueprint questions. Rather than eliciting expensive human feedback (Stiennon et al., 2020), blueprints could provide a cheaper automatic alternative. Finally, although we focused primarily on the generation problem in this work, we believe blueprints might also be useful as a general-purpose approach to retrieving and organizing important content, es- pecially when faced with many and very long inputs. Acknowledgments We thank the action editor and our reviewers for their valuable feedback. The human rating process was managed by Muqthar Mohammad, Kiranmai Chennuru, Ashwin Kakarla and their team; without them this work would not have been possible. Thanks for invaluable support from Sheila de Guia and Suneet Dhingra. References Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. 2019. Synthetic QA corpora generation with roundtrip con- sistency. In Proceedings of the 57th Annual the Association for Computa- Meeting of tional Linguistics, pages 6168–6173, Florence, Italy. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /P19-1620 Kristjan Arumae and Fei Liu. 2018. Rein- forced extractive summarization with question- focused rewards. In Proceedings of ACL 2018, Student Research Workshop, pages 105–111, Melbourne, Australia. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/P18-3015 Kristjan Arumae and Fei Liu. 2019. Guid- ing extractive summarization with question- answering rewards. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2566–2577, Minneapolis, Minnesota. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/N19-1264 Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based ap- proach. Computational Linguistics, 34(1):1–34. https://doi.org/10.1162/coli.2008 .34.1.1 Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. ArXiv, abs/2004.05150. Daniela Brook Weiss, Paul Roit, Ayal Klein, Ori Ernst, and Ido Dagan. 2021. QA-align: Representing cross-text content overlap by aligning question-answer propositions. In Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9879–9894, Online and Punta Cana, 988 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Dominican Republic. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2021.emnlp-main.778 Shuyang Cao and Lu Wang. 2022. HIBRIDS: Attention with hierarchical biases for structure- aware long document summarization. In Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 786–807, Dublin, Ireland. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2022.acl-long.58 L. Carlson. 1983. Dialogue Games: An Approach to Discourse Analysis. Riedel, Dordrecht. https://doi.org/10.1007/978-94-015 -3963-0 9 Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicat- ing agents for abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 1662–1675, New Orleans, Louisiana. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18 -1150 Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel. 2022. SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2022.acl-long.589 Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long se- quences with sparse transformers. ArXiv, abs/1904.10509. https://doi.org/10.48550 /arXiv.1904.10509 Kordula De Kuthy, Madeeswaran Kannan, Haemanth Santhi Ponnusamy, and Detmar Meurers. 2020. Towards automatically generat- ing questions under discussion to link informa tion and discourse structure. In Proceedings of the 28th International Conference on Computa- tional Linguistics, pages 5786–5798, Barcelona, Spain (Online). International Committee on Computational Linguistics. https://doi.org /10.18653/v1/2020.coling-main.509 Kordula De Kuthy, Nils Reiter, and Arndt Riester. 2018. QUD-based annotation of discourse structure and information structure: Tool and evaluation. In Proceedings of the Eleventh In- ternational Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Daniel Deutsch and Dan Roth. 2021a. Question- based salient span selection for more control- lable text summarization. ArXiv, abs/2111 https://doi.org/10.48550 .07935. /arXiv.2111.07935 Daniel Deutsch and Dan Roth. 2021b. Under- standing the extent to which content quality metrics measure the information quality of sum- maries. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 300–309, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2021.conll-1.24 Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, and Graham Neubig. 2021. GSum: A general framework for guided neu- ral abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4830–4842, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.naacl-main.384 Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation frame- work for faithfulness assessment in abstractive summarization. In Proceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 5055–5070, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.454 Ondˇrej Duˇsek and Zdenˇek Kasner. 2020. Evaluat- ing semantic accuracy of data-to-text generation with natural language inference. In Proceed- ings of the 13th International Conference on Natural Language Generation, pages 131–137, Dublin, Ireland. Association for Computational Linguistics. 989 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Jacob Goldberger, and Ido Dagan. 2022. Proposition-level clus- tering for multi-document summarization. In Proceedings of the 2022 Conference of the the Associa- North American Chapter of tion for Computational Linguistics: Human Language Technologies, pages 1765–1779, Seattle, United States. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2022.naacl-main.128 Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question answering as an automatic evaluation metric for news article summa- rization. In Proceedings of the 2019 Con- the North American Chapter of ference of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3938–3948, Minneapolis, Minnesota. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/N19-1395 Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summ- Eval: Re-evaluating summarization evaluation. Transactions of the Association for Compu- tational Linguistics, 9:391–409. https:// doi.org/10.1162/tacl a 00373 Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by cor- rectness: An interesting but challenging ap- language inference. In plication for natural Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1213 Tim Fischer, Steffen Remus, and Chris Biemann. 2022. Measuring faithfulness of abstractive summaries. In Proceedings of the 18th Con- ference on Natural Language Processing (KONVENS 2022), pages 63–73, Potsdam, Germany. Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao. 2021. GO FIGURE: A meta evaluation of factuality in summarization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 478–487, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.findings-acl.42 Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 4098–4109. Association for Computational Linguistics. Jonathan Ginzburg. 1994. An update semantics for dialogue. In Proceedings of the 1st Tilburg International Workshop on Computational Se- mantics. Tilburg, The Netherlands. Markus Guhe. 2007. Incremental Conceptualiza- tion for Language Production. Mahwah, NJ: Lawrence Erlbaum Associates Publishers. Mandy Guo, Joshua Ainslie, David C. Uthus, Santiago Onta˜n´on, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2021. LongT5: Effi- cient text-to-text transformer for long sequences. ArXiv, abs/2112.07916. https://doi.org /10.18653/v1/2022.findings-naacl.55 Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 643–653, Lisbon, Portugal. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D15-1076 Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re-evaluating factual consistency the Second In Proceedings of evaluation. DialDoc Workshop on Document-grounded Dialogue and Conversational Question An- swering, pages 161–175, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022 .dialdoc-1.19 Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. 2021. Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Pro- ceedings of the 2021 Conference on Empirical 990 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Methods in Natural Language Processing, pages 7856–7870. https://doi.org/10 .18653/v1/2021.emnlp-main.619 David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, and Verena Rieser. 2020. Twenty years of confusion in hu- man evaluation: NLG needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Nat- ural Language Generation, pages 169–182, Ireland. Association for Computa- Dublin, tional Linguistics. Hayate Iso, Yui Uehara, Tatsuya Ishigaki, Hiroshi Noji, Eiji Aramaki, Ichiro Kobayashi, Yusuke Miyao, Naoaki Okazaki, and Hiroya Takamura. 2019. Learning to select, track, and generate for data-to-text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2102–2113, Florence, Italy. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/P19-1202 Nikiforos Karamanis. 2004. Entity Coherence for Descriptive Text Structuring. Ph.D. thesis, School of Informatics, University of Edinburgh. Rodger Kibble and Richard Power. 2004. in text coherence Optimizing referential Linguistics, Computational generation. 30(4):401–416. https://doi.org/10.1162 /0891201042544893 Wei-Jen Ko, Te-yuan Chen, Yiyan Huang, Greg Durrett, and Junyi Jessy Li. 2020. Inquisi- tive question generation for high level text comprehension. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 6544–6555, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2020.emnlp-main.530 Wei-Jen Ko, Cutter Dalton, Mark P. Simmons, Eliza Fisher, Greg Durrett, Junyi Jessy Li. 2021. Discourse comprehension: A question answering framework to represent sentence connections. ArXiv, abs/2111.00701. https://doi.org/10.48550/arXiv .2111.00701 and Wei-Jen Ko and Junyi Jessy Li. 2020. Assessing discourse relations in language generation from GPT-2. In Proceedings of the 13th Interna- tional Conference on Natural Language Generation, pages 52–59, Dublin, Ireland. Association for Computational Linguistics. Ivana Kruijff-Korbayov´a and Mark Steedman. 2003. Discourse and information structure. Journal of logic, language and information, 12(3):249–259. https://doi.org/10.1023 /A:1024160025821 Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Eval- uating the factual consistency of abstrac- tive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.emnlp-main.750 Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh and Dragomir Agarwal, Caiming Xiong, Radev. 2021. Booksum: A collection of da- tasets for long-form narrative summarization. ArXiv, abs/2105.08209. https://doi.org /10.48550/arXiv.2105.08209 Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, and Eugene Ie. 2020. Aquamuse: Automatically generating datasets for query- based multi-document summarization. ArXiv, abs/2010.12694. https://doi.org/10.48550 /arXiv.2010.12694 Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, and Eugene Ie. 2021. Comsum and sibert: A dataset and neural model for query- In based multi-document Document Analysis and Recognition – ICDAR 2021, pages 84–98, Cham. Springer Inter- national Publishing. ArXiv, abs/2010.12694 https://doi.org/10.1007/978-3-030 -86331-9 6 summarization. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computa- tional Linguistics, 7:452–466. https://doi .org/10.1162/tacl_a_00276 991 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Staffan Larson. 2002. Issue-based Dialogue Man- thesis, G¨oteborg University, agement. Ph.D. Sweden. pages 151–162, Taoyuan, Taiwan. The Associa- tion for Computational Linguistics and Chinese Language Processing (ACLCLP). Logan Lebanoff, John Muchovej, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, and Fei Liu. 2019. Analyzing sentence fusion in abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 104–110, Hong Kong, China. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D19-5413 Willem J. M. Levelt. 1993. Speaking: From Intention to Articulation. The MIT Press. https://doi.org/10.7551/mitpress /6393.001.0001 Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo Wang. 2018. Improving neural abstractive doc- ument summarization with explicit informa- tion selection modeling. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 1787–1796, Brussels, Belgium. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D18-1205 Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co- occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 150–157. Yang Liu and Mirella Lapata. 2019. Hierarchical transformers for multi-document summariza- tion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pages 5070–5081, Florence, Italy. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/P19-1500 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A ro- bustly optimized BERT pretraining approach. ArXiv, abs/1907.11692. https://doi.org /10.48550/arXiv.1907.11692 Chao-Yi Lu and Sin-En Lu. 2021. A survey of approaches to automatic question generation: From 2019 to early 2021. In Proceedings of the 33rd Conference on Computational Linguis- tics and Speech Processing (ROCLING 2021), Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 1906–1919, Online. As- sociation for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.173 Kathleen McKeown. 1985. Text Generation: Us- ing Discourse Strategies and Focus Constraints to generate Natural Language Text. Studies language Processing. Cambridge in Natural University Press. Chris Mellish, Alistair Knott, Jon Oberlander, and Mick O’Donnell. 1998. Experiments using stochastic search for text planning. In Natural Language Generation, Niagara-on-the-Lake, Ontario, Canada. Association for Computa- tional Linguistics. Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. 2004. The Nom- Bank project: An interim report. In Proceedings of the Workshop Frontiers in Corpus An- notation at HLT-NAACL 2004, pages 24–31, Boston, Massachusetts, USA. Association for Computational Linguistics. Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019a. Improving quality and efficiency in plan-based neural data-to-text generation. In Proceedings of the 12th International Con- ference on Natural Language Generation, pages 377–382, Tokyo, Japan. Association for Computational Linguistics. https://doi .org/10.18653/v1/W19-8645 Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019b. Step-by-step: Separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2267–2277, Minneapo- lis, Minnesota. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/N19-1236 992 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Shashi Narayan, Joshua Maynez, Jakub Adamek, Daniele Pighin, Blaz Bratanic, and Ryan McDonald. 2020. Stepwise extractive summa- rization and planning with structured transform- ers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4143–4159, On- line. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.emnlp-main.339 Shashi Narayan, Gonc¸alo Sim˜oes, Yao Zhao, Joshua Maynez, Dipanjan Das, Michael Collins, and Mirella Lapata. 2022. A well- composed text is half done! Composition sam- pling for diverse conditional generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1319–1339, Dublin, Ireland. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2022.acl-long.94 Shashi Narayan, Yao Zhao, Joshua Maynez, Gonc¸alo Sim˜oes, Vitaly Nikolaev, and Ryan McDonald, Cambridge, MA. 2021. Planning with learned entity prompts for abstractive sum- marization. Transactions of the Association for Computational Linguistics, 9:1475–1492. https://doi.org/10.1162/tacl a 00438 Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natu- ral language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.acl-main.441 Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated corpus of semantic roles. Computa- tional Linguistics, 31(1):71–106. https:// doi.org/10.1162/0891201053630264 Laura Perez-Beltrachini, Yang Liu, and Mirella Lapata. 2019. Generating summaries with topic templates and structured convolutional de- coders. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5107–5116, Florence, Italy. Association for Computational Linguistics. Ratish Puduppully, Li Dong, and Mirella Lapata. 2019a. Data-to-text generation with content se- lection and planning. In Proceedings of the 33rd AAAI Conference on Artificial Intelli- gence. AAAI Press. https://doi.org/10 .1609/aaai.v33i01.33016908 Ratish Puduppully, Li Dong, and Mirella Lapata. 2019b. Data-to-text generation with entity modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2023–2035, Florence, Italy. Association for Computational Linguistics. Ratish Puduppully, Yao Fu, and Mirella Lapata. 2022. Data-to-text generation with varia- tional sequential planning. Transactions of the Association for Computational Linguistics, 10:697–715. https://doi.org/10.1162 /tacl a 00484 Ratish Puduppully and Mirella Lapata. 2021. Data-to-text generation with macro planning. Transactions of the Association for Computa- tional Linguistics, 9:510–527. https://doi .org/10.1162/tacl_a_00381 Valentina Pyatkin, Ayal Klein, Reut Tsarfaty, and Ido Dagan. 2020. QADiscourse - discourse relations as QA pairs: Representation, crowd- In Proceedings of sourcing and baselines. the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2804–2819, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.emnlp-main.224 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018a. Know what you don’t know: Unanswer- able questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. https://doi.org /10.18653/v1/P18-2124 Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018b. Know what you don’t know: Unan- swerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: 993 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/P18-2124 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empir- ical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics. https://doi .org/10.18653/v1/D16-1264 Ehud Reiter and Robert Dale. 2000. Build- ing Natural Language Generation Sys- tems. Cambridge University Press, New York, NY. https://doi.org/10.1017 /CBO9780511519857 Arndt Riester. 2019. Constructing QUD trees, Questions in Discourse, volume 2: Pragmatics, pages 164–193. Brill. https://doi.org /10.1163/9789004378322_007 Craige Roberts. 2012. Information structure in discourse: Towards an integrated formal the- ory of pragmatics. Semantics and Pragmatics, 5(6):1–69. https://doi.org/10.3765 /sp.5.6 Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. 2021. Hierarchical learning for generation with long source sequences. ArXiv, abs/2104.07545. https://doi.org/10.48550/arXiv .2104.07545 Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2019. An- swers unite! Unsupervised metrics for rein- forced summarization models. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 3246–3256, Hong Kong, China. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1320 Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022. SCROLLS: Standardized long language sequences. comparison over ArXiv, abs/2201.03533. https://doi.org /10.48550/arXiv.2201.03533 Kaiqiang Song, Lin Zhao, and Fei Liu. 2018. Structure-infused copy mechanisms for ab- stractive summarization. In Proceedings of the 27th International Conference on Computa- tional Linguistics, pages 1717–1729, Santa Fe, New Mexico, USA. Association for Computa- tional Linguistics. Yun-Zhu Song, Yi-Syuan Chen, and Hong-Han Shuai. 2022. Improving multi-document sum- marization through referenced flexible extrac- tion with credit-awareness. In Proceedings of the 2022 Conference of the North American the Association for Computa- Chapter of tional Linguistics: Human Language Tech- nologies, pages 1667–1681, Seattle, United States. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2022.naacl-main.120 Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 885–895, New Orleans, Louisiana. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/N18-1081 Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. 2022. ASQA: Factoid questions meet long-form answers. In Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize with human feedback. In Advances in Neural In- formation Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc. Richard Sutton and Andew Barto. 2018. Re- inforcement Learning: An Introduction, 2nd edition. MIT Press. Jun Suzuki and Masaaki Nagata. 2017. Cutting-off redundant repeating generations for neural ab- stractive summarization. In Proceedings of the 994 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 291–297, Valencia, Spain. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/E17-2047 Bowen Tan, Zichao Yang, Maruan Al-Shedivat, Eric Xing, and Zhiting Hu. 2021. Progressive generation of long text with pretrained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 4313–4324, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2021.naacl-main.341 Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017a. Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1171–1181, Vancouver, Canada. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/P17-1108 Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. summariza- 2017b. Abstractive document tion with a graph-based attentional neural model. In Proceedings of the 55th Annual the Association for Computa- Meeting of tional Linguistics (Volume 1: Long Papers), pages 1171–1181. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/P17-1108 Ran Tian, Shashi Narayan, Thibault Sellam, and Ankur P. Parikh. 2019. Sticking to faithful the facts: Confident decoding for data-to-text generation. ArXiv, abs/1910.08684. https://doi.org/10.48550/arXiv .1910.08684 Enric Vallduv´ı and Maria Vilkuna. 1998. On rheme and kontrast. The Limits of Syntax, pages 79–108. Brill. Jan Van Kuppevelt. 1995. Discourse structure, topicality and questioning. Journal of Lin- guistics, 31(1):109–147. https://doi.org /10.1017/S002222670000058X Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. In I. Guyon, Attention is all you need. U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran Associates, Inc. Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the factual consistency of sum- maries. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 5008–5020, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.450 Matthijs Westera, Laia Mayol, and Hannah Rohde. 2020. TED-Q: TED talks and the questions they evoke. In Proceedings of the 12th Lan- guage Resources and Evaluation Conference, pages 1118–1127, Marseille, France. European Language Resources Association. Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in data-to-document the 2017 In Proceedings of generation. Conference on Empirical Methods in Natu- ral Language Processing, pages 2253–2263, Copenhagen, Denmark. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/D17-1239 Sam Wiseman, Stuart Shieber, and Alexander Rush. 2018. Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 3174–3187, Brussels, Belgium. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D18-1356 Xinnuo Xu, Ondˇrej Duˇsek, Verena Rieser, and Ioannis Konstas. 2021. AggGen: Or- dering and aggregating while generating. In Proceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1419–1434, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2021.acl-long.113 995 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. 2021. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, pages 5905–5921, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.naacl-main.472 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 3 2 1 5 4 5 0 4 / / t l a c _ a _ 0 0 5 8 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 996
PDF Herunterladen