Data-to-text Generation with Macro Planning

Data-to-text Generation with Macro Planning

Ratish Puduppully and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB

r.puduppully@sms.ed.ac.uk

mlap@inf.ed.ac.uk

Abstract

Recent approaches to data-to-text generation
have adopted the very successful encoder-
decoder architecture or variants thereof. These
models generate text that is fluent (but often
imprecise) and perform quite poorly at select-
ing appropriate content and ordering it co-
herently. To overcome some of these issues,
we propose a neural model with a macro
planning stage followed by a generation stage
reminiscent of traditional methods which em-
brace separate modules for planning and sur-
face realization. Macro plans represent high
level organization of important content such
as entities, events, and their interactions; they
are learned from data and given as input to
the generator. Extensive experiments on two
data-to-text benchmarks (ROTOWIRE and MLB)
show that our approach outperforms compet-
itive baselines in terms of automatic and hu-
man evaluation.

1

Introduction

Data-to-text generation refers to the task of gen-
erating textual output from non-linguistic input
(Reiter and Dale, 1997, 2000; Gatt and Krahmer,
2018) such as databases of records, simulations
of physical systems, accounting spreadsheets, or
expert system knowledge bases. As an example,
Figure 1 shows various statistics describing a
major league baseball (MLB) game, including
extracts from the box score (i.e.,
the perfor-
mance of the two teams and individual
team
members who played as batters, pitchers or field-
ers; Table (A)), play-by-play (i.e., the detailed
sequence of each play of the game as it occurred;
Table (B)), and a human written game summary
(Table (C)).

Traditional methods for data-to-text generation
(Kukich, 1983; McKeown, 1992; Reiter and Dale,

510

1997) follow a pipeline architecture, adopting
separate stages for text planning (determining
which content to talk about and how it might
be organized in discourse), sentence planning
(aggregating content into sentences, deciding spe-
cific words to describe concepts and relations, and
generating referring expressions), and linguistic
realization (applying the rules of syntax, mor-
phology, and orthographic processing to generate
surface forms). Recent neural network–based
approaches (Lebret et al., 2016; Mei et al., 2016;
Wiseman et al., 2017) make use of the encoder-
decoder architecture (Sutskever et al., 2014), are
trained end-to-end, and have no special-purpose
modules for how to best generate a text, aside
from generic mechanisms such as attention and
copy (Bahdanau et al., 2015; Gu et al., 2016). The
popularity of end-to-end models has been fur-
ther boosted by the release of new datasets with
thousands of input-document training pairs. The
example shown in Figure 1 is taken from the MLB
dataset (Puduppully et al., 2019b), which contains
baseball game statistics and human written sum-
maries (∼25K instances). ROTOWIRE (Wiseman
et al., 2017) is another widely used benchmark,
which contains NBA basketball game statistics
and their descriptions (∼5K instances).

Wiseman et al. (2017) show that despite being
able to generate fluent text, neural data-to-text
generation models are often imprecise, prone
to hallucination (i.e., generate text that is not
supported by the input), and poor at content
selection and document structuring. Attempts to
remedy some of these issues focus on changing
the way entities are represented (Puduppully et al.,
2019b; Iso et al., 2019), allowing the decoder to
skip low-confidence tokens to enhance faithful
generation (Tian et al., 2019), and making the
encoder-decoder architecture more modular by
introducing micro planning (Puduppully et al.,
2019a; Moryossef et al., 2019). Micro planning
operates at the record level (see Table (A) in

Transactions of the Association for Computational Linguistics, vol. 9, pp. 510–527, 2021. https://doi.org/10.1162/tacl a 00381
Action Editor: Claire Gardent. Submission batch: 11/2020; Revision batch: 2/2021; Published 5/2021.
c(cid:3) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 1: MLB statistics tables and game summary. Tables summarize the performance of teams and individual
team members who played as batters and pitchers as well as the most important actions (and their actors) in each
play (Tables (A) and (B)). Macro plan for the game summary is shown at the bottom (Table (E)).

indicates
paragraph delimiters. There is a plan for every paragraph in the game summary (correspondence shown in same
color); verbalizes entities, while verbalizes events related to the top/bottom
side of an inning (see Section 3.1). Set of candidate paragraph plans are shown above macro plan (Table (D)) and
grouped into two types: plans describing a single entity/event or their combinations. Best viewed in color.

Figure 1; e.g., C.Mullins BH 2, J.Villar TEAM Orioles),
it determines which facts should be mentioned
within a textual unit (e.g., a sentence) and how
these should be structured (e.g., the sequence of
records). An explicit content planner essentially
makes the job of the neural network less onerous
allowing to concentrate on producing fluent natu-
ral language output, without expending too much
effort on content organization.

In this work, we focus on macro planning, the
high-level organization of information and how
it should be presented which we argue is impor-
tant for the generation of long, multi-paragraph
documents (see text (C) in Figure 1). Problem-
atically, modern datasets like MLB (Puduppully
et al., 2019b; and also Figure 1) and ROTOWIRE

(Wiseman et al., 2017) do not naturally lend
themselves to document planning as there is no
explicit link between the summary and the content
of the game (which is encoded in tabular form).
In other words, the underlying plans are latent,
and it is not clear how they might be best repre-
sented, namely, as sequences of records from a
table, or simply words. Nevertheless, game sum-
maries through their segmentation into paragraphs
(and lexical overlap with the input) give clues
as to how content might be organized. Paragraphs
are a central element of discourse (Chafe, 1979;
Longacre, 1979; Halliday and Hasan, 1976),
the smallest domain where coherence and topic
are defined and anaphora resolution is possible

511

(Zadrozny and Jensen, 1991). We therefore oper-
ationalize the macro plan for a game summary as
a sequence of paragraph plans.

Although resorting to paragraphs describes the
summary plan at a coarse level, we still need to
specify individual paragraph plans. In the sports
domain, paragraphs typically mention entities
(e.g., players important in the game), key events
(e.g., scoring a run), and their interaction. And
most of this information is encapsulated in the
statistics accompanying game summaries (see
Tables (A) and (B) in Figure 1). We thus define
paragraph plans such that they contain verbaliza-
tions of entity and event records (see plan (E) in
Figure 1). Given a set of paragraph plans and their
corresponding game summary (see Table (D) and
summary (C) in Figure 1), our task is twofold.
At training time, we must learn how content was
selected in order to give rise to specific game
summaries (e.g., how input (D) led to plan (E)
for summary (C) in Figure 1), while at test time,
given input for a new game, we first predict a
macro plan for the summary and then generate the
corresponding document.

We present a two-stage approach where macro
plans are induced from training data (by taking the
table and corresponding summaries into account)
and then fed to the text generation stage. Aside
from making data-to-text generation more inter-
pretable, the task of generating a document from
a macro plan (rather than a table) affords greater
control over the output text and plays to the advan-
tage of encoder-decoder architectures which excel
at modeling sequences. We evaluate model per-
formance on the ROTOWIRE (Wiseman et al., 2017)
and MLB (Puduppully et al., 2019b) benchmarks.
Experimental results show that our plan-and-
generate approach produces output that is more
factual, coherent, and fluent compared with exist-
ing state-of-the-art models. Our code,
trained
models, and dataset with macro plans can be
found at https://github.com/ratishsp
/data2text-macro-plan-py.

2 Related Work

Content planning has been traditionally consid-
ered a fundamental component in natural lan-
guage generation. Not only does it determine
which information-bearing units to talk about,
but also arranges them into a structure that

creates coherent output. Many content plan-
ners have been based on theories of discourse
coherence (Hovy, 1993), schemas (McKeown
et al., 1997), or have relied on generic plan-
ners (Dale, 1989). Plans are mostly based on
hand-crafted rules after analyzing the target text,
although a few approaches have recognized the
need for learning-based methods. For example,
Duboue and McKeown (2001) learn ordering
constraints in a content plan, Konstas and Lapata
(2013) represent plans as grammar rules whose
probabilities are estimated empirically, while oth-
ers make use of semantically annotated corpora
to bootstrap content planners
(Duboue and
McKeown, 2002; Kan and McKeown, 2002).

More recently, various attempts have been made
to improve neural generation models (Wiseman
et al., 2017) based on the encoder-decoder archi-
tecture (Bahdanau et al., 2015) by adding various
planning modules. Puduppully et al. (2019a) pro-
pose a model for data-to-text that first learns a
plan from the records in the input table and then
generates a summary conditioned on this plan.
Shao et al. (2019) introduce a Planning-based
Hierarchical Variational Model where a plan is
a sequence of groups, each of which contains a
subset of input items to be covered in a sentence.
The content of each sentence is verbalized, con-
ditioned on the plan and previously generated
context. In their case, input items are a rela-
tively small list of attributes (∼28) and the output
document is also short (∼110 words).

There have also been attempts to incorporate
neural modules in a pipeline architecture for
data-to-text generation. Moryossef et al. (2019)
develop a model with a symbolic text planning
stage followed by a neural realization stage. They
experiment with the WebNLG dataset (Gardent
et al., 2017) which consists of RDF (cid:4) Subject,
Object, Predicate (cid:5) triples paired with correspond-
ing text. Their document plan is a sequence of
sentence plans that in turn determine the division
of facts into sentences and their order. Along
similar lines, Castro Ferreira et al. (2019) pro-
pose an architecture composed of multiple steps
including discourse ordering, text structuring, lex-
icalization, referring expression generation, and
surface realization. Both approaches show the
effectiveness of pipeline architectures, however,
their task does not require content selection and
the output texts are relatively short (24 tokens on
average).

512

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Although it is generally assumed that task-
specific parallel data is available for model
training, Laha et al. (2020) do away with this
assumption and present a three-stage pipeline
model which learns from monolingual corpora.
They first convert the input to a form of tuples,
which in turn are expressed in simple sentences,
followed by the third stage of merging simple
sentences to form more complex ones by aggre-
gation and referring expression generation. They
also evaluate on data-to-text tasks which have
relatively short outputs. There have also been
efforts to improve the coherence of the output,
especially when dealing with longer documents.
Puduppully et al. (2019b) make use of hierar-
chical attention over entity representations which
are updated dynamically, while Iso et al. (2019)
explicitly keep track of salient entities and mem-
orize which ones have been mentioned.

Our work also attempts to alleviate deficien-
cies in neural data-to-text generation models.
In contrast to previous approaches, (Puduppully
et al., 2019a; Moryossef et al., 2019; Laha et al.,
2020), we place emphasis on macro planning and
create plans representing high-level organization
of a document including both its content and
structure. We share with previous work (e.g.,
Moryossef et al. 2019) the use of a two-stage
architecture. We show that macro planning can
be successfully applied to long document data-to-
text generation resulting in improved factuality,
coherence, and fluency without any postpro-
cessing (e.g., to smooth referring expressions)
or recourse to additional tools (e.g., parsing or
information extraction).

3 Problem Formulation

We hypothesize that generation based on plans
should fare better compared to generating from a
set of records, since macro plans offer a bird’s-eye
view, a high-level organization of the document
content and structure. We also believe that macro
planning will work well for long-form text genera-
tion, that is, for datasets that have multi-paragraph
target texts, a large vocabulary space, and require
content selection.

We assume the input to our model is a set of
paragraph plans E = {ei}| E |
i=1 where ei is a para-
graph plan. We model the process of generating
output summary y given E as a two step process,
namely, the construction of a macro plan x based

on the set of paragraph plans, followed by the
generation of a summary given a macro plan as
input. We now explain how E is obtained and
each step is realized. We discuss our model con-
sidering mainly an example from the MLB dataset
(Puduppully et al., 2019b) but also touch on how
the approach can be straightforwardly adapted to
ROTOWIRE (Wiseman et al., 2017).

3.1 Macro Plan Definition

A macro plan consists of a sequence of paragraph
plans separated by a paragraph discourse marker

, that is, x = ei

ej . . .

ek where
ei, ej, ek ∈ E . A paragraph plan in turn is a
sequence of entities and events describing the
game. By entities we mean individual players or
teams and the information provided about them in
box score statistics (see rows and column headings
in Figure 1 Table (A)), while events refer to infor-
mation described in play-by-play (see Table (B)).
In baseball, plays are grouped in half-innings.
During each half of an inning, a team takes its
turn to bat (the visiting team bats in the top half
and the home team in the bottom half). An exam-
ple macro plan is shown at the bottom of Figure 1.
Within a paragraph plan, entities and events are
verbalized into a text sequence along the lines of
Saleh et al. (2019). We make use of special tokens
for the of record followed by the value
of record from the table. We retain the same posi-
tion for each record type and value. For example,
batter C.Mullins from Figure 1 would be verbal-
ized as C.Mullins H 4

2 2 1 Orioles
. . . . For the sake of brevity we use shorthand
for the full entity.

Paragraph Plan for Entities For a paragraph
containing entities, the corresponding plan will
be a verbalization of the entities in sequence.
For paragraphs with multiple mentions of the
same entity, the plan will verbalize an entity only
once and at its first position of mention. Paragraph
‘‘Keller gave up a home run . . . the teams with the
worst records in the majors’’ from the summary in
Figure 1 describes four entities including B. Keller,
C. Mullins, Royals and Orioles. The respective
plan is the verbalization of the four entities in
sequence:
, where V stands
for verbalization and is a

513

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

shorthand for B.Keller V
7 5 8 . . . , is a
shorthand for the team Royals 9
14 1, and so on.

Paragraph Plan for Events A paragraph may
also describe one or more events. For example,
the paragraph ‘‘With the score tied 1–1 in the
fourth . . . 423-foot home run to left
field to
make it 3-1’’ discusses what happened in the
bottom halves of the fourth and fifth innings. We
verbalize an event by first describing the par-
ticipating entities followed by the plays in the
event. Entities are described in the order in which
they appear in a play, and within the same play
we list the batter followed by the pitcher, fielder,
scorer, and basemen. The paragraph plan corre-
sponding to the bottom halves of the fourth and
fifth inning is . Here, is a shorthand for
. . . . . .
, and so on. The en-
tities , ,
, and cor-
respond in turn to W. Merrifield, A. Cashner,
B. Goodwin, and H. Dozier while
refers to the first play in the bottom half of
the fifth inning (see the play-by-play table in
Figure 1) and abbreviates the following detailed
plan: 5 B Royals
Orioles 1
H.Dozier A. Cashner>
Home-run Royals-3-Orioles-1,
and so forth.

The procedure described above is not specific
to MLB and can be ported to other datasets with
similar characteristics such as ROTOWIRE. How-
ever, ROTOWIRE does not provide play-by-play
information, and as a result there is no event
verbalization for this dataset.

3.2 Macro Plan Construction

We provided our definition for macro plans in the
previous sections, however, it is important to note
that such macro plans are not readily available in
data-to-text benchmarks like MLB (Puduppully
et al., 2019b) and ROTOWIRE (Wiseman et al.,
2017) which consist of tables of records r paired
with a gold summary y (see Tables (A)–(C) in
Figure 1). We now describe our method for obtain-
ing macro plans x from r and y.

Similar to Moryossef et al. (2019), we define
macro plans to be conformant with gold sum-
maries such that (1) they have the same splits
into paragraphs—entities and events within a
paragraph in y are grouped into a paragraph plan
in x; and (2) the order of events and entities
in a paragraph and its corresponding plan are
identical. We construct macro plans by matching
entities and events in the summary to records
in the tables. Furthermore, paragraph delimiters
within summaries form natural units which taken
together give rise to a high-level document plan.
We match entities in summaries with entities
in tables using exact string match, allowing for
some degree of variation in the expression of
team names (e.g., A’s for Athletics and D-backs
for Diamondbacks). Information pertaining to
innings appears in the summaries in the form of
ordinal numbers (e.g., first, ninth) modifying the
noun inning and can be relatively easily identi-
fied via pattern matching (e.g., in sentences like
‘‘Dozier led off the fifth inning’’). However, there
are instances where the mention of innings is more
ambiguous (e.g., ‘‘With the scored tied 1–1 in the
fourth, Andrew Cashner (4–13) gave up a sacri-
fice fly’’). We could disambiguate such mentions
manually and then train a classifier to learn to
predict whether an inning is mentioned. Instead,
we explore a novel annotation-free method that
makes use of the pretrained language model GPT2
(Radford et al., 2019). Specifically, we feed the
context preceding the ordinal number to GPT2
(i.e.,
the current paragraph up to the ordinal
number and the paragraph preceding it) and if
inning appears in the top 10 next word predictions,
we consider it a positive match. On a held-out
dataset, this method achieves 98% precision and
98% recall at disambiguating inning mentions.

To resolve whether the summary discusses the
top or bottom side of an inning, we compare the
entities in the paragraph with the entities in each
half-inning (play-by-play Table (B) in Figure 1)
and choose the side with the greater number of
entity matches. For instance, Andrew Cashner,
Merrifield and fourth inning uniquely resolves to
the bottom half of the fourth inning.

3.3 Paragraph Plan Construction

Figure 1 shows the macro plan we obtain for
game summary (C). Importantly, macro plan (E)
is the outcome of a content selection process after

514

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

considering several candidate paragraph plans as
input. So, what are the candidate paragraph plans
that give rise to macro plan (E)? To answer this
question, we examined the empirical distribution
of paragraph plans in MLB and ROTOWIRE (train-
ing portion). Interestingly, we found that ∼79% of
the paragraph plans in MLB refer to a single event
or a single player (and team(s)). In ROTOWIRE,
∼92% of paragraphs are about a singleton player
(and team(s)) or a pair of players.

Based on this analysis, we assume that para-
graph plans can be either one (verbalized)
entity/event or a combination of at most two.
Under this assumption, we explicitly enumerate
the set of candidate paragraph plans in a game.
For the game in Figure 1, candidate paragraph
plans are shown in Table (D). The first table
groups plans based on individual verbalizations
describing the team(s), players, and events taking
place in specific innings. The second table groups
pairwise combinations thereof. In MLB, such
combinations are between team(s) and players. In
ROTOWIRE, we also create combinations between
players. Such paragraph plans form set E based
on which macro plan x is constructed to give rise
to game summary y.

4 Model Description

The input to our model is a set of paragraph plans,
each of which is a sequence of tokens. We first
compute paragraph plan representations ∈ Rn,
and then apply a contextualization and content
planning mechanism similar to planning mod-
ules introduced in earlier work (Puduppully et al.,
2019a; Chen and Bansal, 2018). Predicted macro
plans serve as input to our text generation model,
which adopts an encoder-decoder architecture
(Bahdanau et al., 2015; Luong et al., 2015).

4.1 Macro Planning

Paragraph Plan Representation We encode
tokens in a verbalized paragraph plan ei as
{ei,j}|ei|
j=1 with a BiLSTM (Figure 2, bottom part).
To reflect the fact that some records will be more
important than others, we compute an attention
weighted sum of {ei,j}|ei|
j=1 following Yang et al.
(2016). Let d ∈ Rn denote a randomly initialized
query vector
jointly with the rest of
parameters. We compute attention values αi,j over
d and paragraph plan token representation ei,j:
αi,j ∝ exp(d(cid:2)ei,j)

learnt

(1)

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

plan

2: Paragraph

Figure
and
contextualization for macro planning. Computation
of e3 is detailed in Equations (1) and (2), eatt
in
Equation (3), and ec

representation

3 in Equation (4).

3

Paragraph plan vector ei is the attention weighted
sum of ei,j (with
αi,j = 1):

(cid:2)

j

(cid:3)

ei =

αi,jei,j

(2)

j

Next, we contextualize each paragraph plan
representation vis-a-vis other paragraph plans
(Figure 2, top left part). First, we compute attention
scores βi,k over paragraph plan representations to
obtain an attentional vector eatt

for each:

i

(cid:2)
βi,k ∝ exp(e
i Waek)

(cid:3)

ci =

βi,kek

k(cid:8)=i
eatt
i = Wg[ei; ci]

(3)

(cid:2)

where Wa ∈ Rn×n, Wg ∈ Rn×2n are parameter
k(cid:8)=i βi,k = 1. Then, we compute
matrices, and
a content selection gate, and apply this gate to ei
to obtain new paragraph plan representation ec
i :

(cid:4)

(cid:5)

eatt
i

gi = sigmoid
i = gi (cid:9) ei
ec

(4)

where (cid:9) denotes element-wise multiplication.
Thus, each element in ei is weighted by cor-
responding element of gi ∈ [0, 1]n to obtain a
contextualized paragraph plan representation ec
i .

Content Planning Our model learns to predict
macro plans, after having been trained on pairs
of sets of paragraph plans and corresponding

515

separators in between. The conditional output
probability p(y|x) is modeled as:

p(y|x) =

|y|(cid:6)

t=1

p(yt|y

ROTOWIRE

MLB

Vocab Size
# Tokens
# Instances
# Record Types
Avg Records
Avg Paragraph Plans
Avg Length

11.3K
1.5M
4.9K

39
628
10.7
337.1

38.9K
14.3M
26.3K
53
565
15.1
542.05

Table 1: Dataset statistics for ROTOWIRE and
tokens,
MLB. Vocabulary size, number of
number of instances (i.e., table-summary pairs),
number of record types, average number of
records, average number of paragraph plans,
and average summary length.

During inference, we employ beam search to
find the most likely macro plan ˆz among candidate
macro plans z(cid:10) given paragraph plans as input.

ˆz = arg max

z(cid:10)

p(z(cid:10)|E ; θ)

We deterministically obtain ˆx from ˆz, and
output summary ˆy among candidate outputs y(cid:10)
given macro plan ˆx as input:

ˆy = arg max

y(cid:10)

p(y(cid:10)|ˆx; φ)

5 Experimental Setup

on

performed

experiments

Data We
the
ROTOWIRE (Wiseman et al., 2017) and MLB
(Puduppully et al., 2019b) benchmarks. The
details of these two datasets are given in Table 1.
We can see that MLB is around 5 times bigger,
has a richer vocabulary, and has longer game sum-
maries. We use the official splits of 3,398/727/728
for ROTOWIRE and 22,821/1,739/1,744 for MLB.
We make use of a tokenization script1 to deto-
kenize and retokenize the summaries in both
ROTOWIRE and MLB.

We reconstructed the MLB dataset, as the
version released by Puduppully et al. (2019b)
from
had removed all paragraph delimiters
game summaries. Specifically, we followed their
methodology and downloaded the same sum-
maries from the ESPN Web site2 and added the

delimiter to paragraphs in the summaries.3

1https://github.com/neulab/DGT.
2http://www.espn.com/mlb/recap?gameId=

{gameid}.

3Although our model is trained on game summaries with
paragraph delimiters, and also predicts these at generation
time, for evaluation we strip

from model output.

517

ROTOWIRE does not have paragraph delimiters in
game summaries either. We reverse engineered
these as follows: (1) we split summaries into sen-
tences using the NLTK (Bird et al., 2009) sentence
tokenizer; (2) initialized each paragraph with a
separate sentence; (3) merged two paragraphs into
one if the entities in the former were a superset of
entities in the latter; (4) repeated Step 3 until no
merges were possible.

Training Configuration We tuned the model
hyperparameters on the development set. For
training the macro planning and the text gener-
ation stages, we used the Adagrad (Duchi et al.,
2011) optimizer. Furthermore, the text generation
stage made use of truncated BPTT (Williams and
Peng, 1990) with truncation length 100. We learn
subword vocabulary (Sennrich et al., 2016) for
paragraph plans in the macro planning stage. We
used 2.5K merge operations for ROTOWIRE and 8K
merge operations for MLB. In text generation, we
learn a joint subword vocabulary for the macro
plan and game summaries. We used 6K merge
operations for ROTOWIRE and 16K merge oper-
ations for MLB. All models were implemented
on OpenNMT-py (Klein et al., 2017). We add to
set E the paragraph plans corresponding to the
output summary paragraphs, to ensure full cover-
age during training of the macro planner. During
inference for predicting macro plans, we employ
length normalization (Bahdanau et al., 2015) to
avoid penalizing longer outputs; specifically, we
divide the scores of beam search by the length of
the output. In addition, we adopt bigram blocking
(Paulus et al., 2018). For MLB, we further block
beams containing more than two repetitions of a
unigram. This helps improve the diversity of the
predicted macro plans.

System Comparisons We compared our model
against the following systems: (1) the Template-
based generators from Wiseman et al. (2017)
for ROTOWIRE and Puduppully et al. (2019b) for
MLB. Both systems apply the same principle, they
emit a sentence about the teams playing in the
game, followed by player-specific sentences, and
a closing sentence. MLB additionally contains a
description of play-by-play; (2) ED+CC, the best
performing system in Wiseman et al. (2017), is
a vanilla encoder-decoder model equipped with
an attention and copy mechanism; (3) NCP+CC,
the micro planning model of Puduppully et al.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(2019a), generates content plans from the table
by making use of Pointer networks (Vinyals
et al., 2015) to point to records; content plans are
encoded with a BiLSTM and the game summary
is decoded using another LSTM with attention
and copy; (4) ENT, the entity-based model of
Puduppully et al. (2019b), creates dynamically
updated entity-specific representations; the text
is generated conditioned on the data input and
entity memory representations using hierarchical
attention at each time step.

6 Results

Automatic Evaluation For automatic evalua-
tion, following earlier work (Wiseman et al. 2017;
Puduppully et al. 2019a,b, inter alia) we report
BLEU (Papineni et al., 2002) with the gold sum-
mary as reference but also make use of the In-
formation Extraction (IE) metrics from Wiseman
et al. (2017), which are defined over the output
of an IE system; the latter extracts entity (players,
teams) and value (numbers) pairs in a summary,
and then predicts the type of relation. For instance,
given the pair Kansas City Royals, 9, it would
predict their relation as TR (i.e., Team Runs).
Training data for the IE system is obtained by
checking for matches between entity, value pairs
in the gold summary and entity, value, record type
triplets in the table.

Let ˆy be the gold summary and y the model
output. Relation Generation (RG) measures the
precision and count of relations extracted from y
that also appear in records r. Content Selection
(CS) measures the precision and recall of relations
extracted from y that are also extracted from ˆy.
Content Ordering (CO) measures the normalized
Damerau-Levenshtein distance between the se-
quences of relations extracted from y and ˆy.

We reused the IE model from Puduppully et al.
(2019a) for ROTOWIRE but retrained it for MLB
to improve its precision and recall. Furthermore,
the implementation of Wiseman et al. (2017)
computes RG, CS, and CO excluding duplicate
relations. This artificially inflates the performance
of models whose outputs contain repetition. We
include duplicates in the computation of the IE
metrics (and recreate them for all comparison
systems).

Table 2 (top) presents our

results on the
ROTOWIRE
In addition to Templ,
NCP+CC, ENT, and ED+CC we include the

test

set.

ROTOWIRE

Templ
WS-2017
ED+CC
NCP+CC
ENT
RBF-2020
Macro
−Plan(4)

RG

CS
P% P% R% F% DLD%

CO

#

54.3 99.9 27.1 57.7 36.9
34.1 75.1 20.3 36.3 26.1
35.9 82.6 19.8 33.8 24.9
40.8 87.6 28.0 51.1 36.2
32.7 91.7 34.7 48.5 40.5
44.9 89.5 23.9 47.0 31.7
42.1 97.6 34.1 57.8 42.9
36.2 81.3 22.1 38.6 28.1

13.1
12.4
12.0
15.8
16.6
14.3
17.7
12.1

MLB

#

RG

CS
P% P% R% F% DLD%

CO

62.3 99.9 21.6 55.2 31.0
Templ
ED+CC
32.5 91.3 27.8 40.6 33.0
NCP+CC
19.6 81.3 44.5 44.1 44.3
23.8 81.1 40.9 49.5 44.8
ENT
30.8 94.4 40.8 54.9 46.8
Macro
−Plan(SP,4) 25.1 92.7 40.0 44.6 42.2

11.0
17.1
21.9
20.7
21.8
21.9

BLEU

8.46
14.19
14.99
16.50
16.12
17.16
15.46
14.00

BLEU

4.12
9.68
9.68
11.50
12.62
11.09

relation generation (RG) count

Table 2: Evaluation on ROTOWIRE and MLB
test sets;
(#)
and precision (P%), content
selection (CS)
precision (P%), recall (R%) and F-measure (F%),
content ordering (CO) in normalized Damerau-
Levenshtein distance (DLD%), and BLEU.

best performing model of Wiseman et al. (2017)
(WS-2017; note that ED+CC is an improved re-
implementation of their model), and the model of
Rebuffel et al. (2020) (RBF-2020), which repre-
sents the state of the art on ROTOWIRE. This model
has a Transformer encoder (Vaswani et al., 2017)
with a hierarchical attention mechanism over
entities and records within entities. The models
of Saleh et al. (2019), Iso et al. (2019), and Gong
et al. (2019) make use of additional information
not present in the input (e.g., previous/next games,
summary writer) and are not directly comparable
to the systems in Table 2. Results for the MLB
test set are in the bottom portion of Table 2.

indicating that

is always faithful

Templ has the highest RG precision and count
on both datasets. This is not surprising, by design
Templ
to the input. How-
ever, notice that it achieves the lowest BLEU
it
among comparison systems,
mostly regurgitates facts with low fluency. Macro
achieves the highest RG precision among all neu-
ral models for ROTOWIRE and MLB. We obtain
an absolute improvement of 5.9% over ENT
for ROTOWIRE and 13.3% for MLB. In addition,
Macro achieves the highest CS F-measure for
both datasets. On ROTOWIRE, Macro achieves the
highest CO score, and the highest BLEU on MLB.
On ROTOWIRE, in terms of BLEU, Macro is worse

518

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

than comparison models (e.g., NCP+CC or ENT).
Inspection of the output showed that the opening
paragraph, which mostly describes how the two
teams fared, is generally shorter in Macro, lead-
ing to shorter summaries and thus lower BLEU.
There is high variance in the length of the opening
paragraph in the training data and Macro verbal-
izes the corresponding plan conservatively. Ideas
such as length normalization (Wu et al., 2016) or
length control (Kikuchi et al., 2016; Takeno et al.,
2017; Fan et al., 2018) could help alleviate this;
however, we do not pursue them further for fair
comparison with the other models.

The Contribution of Macro Planning To study
the effect of macro planning in more detail, we
further compared Macro against text generation
models (see Section 4.2) which are trained on
verbalizations of the tabular data (and gold sum-
maries) but do not make use of document plans or a
document planning mechanism. On ROTOWIRE, the
model was trained on verbalizations of players and
teams, with the input arranged such that the ver-
balization of the home team was followed by the
visiting team, the home team players and the visit-
ing team players. Mention of players was limited
to the four best ones, following Saleh et al. (2019)
(see −Plan(4) in Table 2). For MLB, we addition-
ally include verbalizations of innings focusing on
scoring plays which are likely to be discussed in
game summaries (see −Plan(SP,4) in Table 2).
Note that by preprocessing the input in such a
way some simple form of content selection takes
place simply by removing extraneous information
which the model does not need to consider.

Across both datasets, −Plan variants appear
competitive. On ROTOWIRE −Plan(4) is better than
ED+CC in terms of content selection but worse
compared to ENT. On MLB, −Plan(SP,4) is again
superior to ED+CC in terms of content selection
but not ENT whose performance lags behind
when considering RG precision. Taken together,
these results confirm that verbalizing entities and
events into a text sequence is effective. At the
same time, we see that −Plan variants are worse
than Macro across most metrics which underlines
the importance of an explicit planning component.

Table 3 presents intrinsic evaluation of the
macro planning stage. Here, we compare the in-
ferred macro plan with the gold macro plans, CS
and CO metrics with regard to entities and events

Macro

CS-P

CS-R

CS-F

ROTOWIRE
MLB

81.3
80.6

73.2
63.3

77.0
70.9

CO

45.8
31.4

Table 3: Evaluation of macro planning stage;
content selection precision (CS-P), recall (CS-
R), F-measure (CS-F), and content ordering (CO)
between the inferred plans and gold plans in terms
of entities and events for ROTOWIRE (RW) and
MLB test sets.

instead of relations. We see that our macro plan-
ning model (Macro) achieves high scores for CS
and CO for both ROTOWIRE and MLB. We further
used the CS and CO metrics to check how well the
generated summary follows the (predicted) plan.
We followed the steps in Section 3.2 and reverse
engineered macro plans from the model summa-
ries and compared these extracted plans with the
original macro plans with regard to entities and
events. We found that Macro creates summaries
that follow the plan closely: For ROTOWIRE, the
CS F-score and CO are greater than 98%; for
MLB, the CS F-score is greater than 94% and CO
is greater than 89%. We show an output summary
for Macro in Table 4, together with the predicted
document plan.

Human-Based Evaluation We also asked par-
ticipants to assess model output in terms of relation
generation, grammaticality, coherence, and con-
ciseness (Wiseman et al., 2017; Puduppully
et al., 2019a; Puduppully et al., 2019b), For
ROTOWIRE, we compared Macro against RBF-
2020,4 ED+CC, Gold, and Templ. For MLB, we
compared Macro against ENT, ED+CC, Gold,
and Templ.

We conducted our study on the Amazon
Mechanical Turk (AMT) crowdsourcing platform,
following best practices for human evaluation in
NLG (van der Lee et al., 2019). Specifically, to en-
sure consistent ratings, we required crowdworkers
to have an approval rating greater than 98% and
a minimum of 1,000 previously completed tasks.
Raters were restricted to English-speaking coun-
tries (i.e., US, UK, Canada, Ireland, Australia, or
NZ). Participants were allowed to provide feed-
back on the task or field questions (our interface
accepts free text).

4We are grateful to Cl´ement Rebuffel for providing us

with the output of their system.

519

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3




ST. PETERSBURG, Fla. (AP) – The Tampa Bay Rays are making
the most of it.

Akinori Iwamura hit a two-run homer in the
eighth inning and the Rays beat the Boston Red Sox 2-1 on Sunday
to complete a three-game sweep.

The Rays, who have the best
record in the majors, have won six of their last seven games.


The Rays have won four of their last five series, including three in
a row against the Red Sox, who have won six of their last seven
overall.

Dioner Navarro singled with one out in the eighth off
Clay Buchholz (1-2) and moved to third on Jason Bartlett’s flyout to
center. Iwamura then drove a 1-1 pitch into the left-field stands for
his second homer of the season.

Scott Dohmann (2-0) got the
win in relief, striking out Manny Ramirez with runners on first and
third to end the eighth.

Troy Percival worked the ninth for
his fifth save in five opportunities.

Clay Buchholz (1-2) gave
up two runs and three hits in eight innings. He struck out nine and
walked two.

The Red Sox loaded the bases with one out in the
fifth on a single by Coco Crisp, a wild pitch and a walk to Jed Lowrie.
Jacoby Ellsbury drove in Crisp with a two-out single to center.


Jackson struck out four and walked three.

The Red Sox loaded
the bases with one out in the fifth on a single by Coco Crisp, a walk
to Jed Lowrie and a one-out walk to Jed Lowrie. Jackson struck out
Julio Lugo, but Jacoby Ellsbury singled to center to put the Red Sox
up 1-0.

The Red Sox threatened in the eighth when J. D. Drew
drew a two-out walk against Trever Miller, but Ramirez struck out to
end the inning.

Table 4: Predicted macro plan (top) with
corresponding model output (bottom). Entities and
events in summary corresponding to those in the
macro plan are boldfaced.

In our first study, we presented crowdworkers
with sentences randomly selected from summaries
along with their corresponding box score (and
play-by-play in case of MLB) and asked them to
count supported and contradicting facts (ignoring
hallucinations, i.e., unsupported facts). We did
not require crowdworkers to be familiar with
NBA or MLB. Instead, we provided a cheat sheet
explaining the semantics of box score tables. In
addition, we provided examples of sentences with
supported/contradicting facts. We evaluated 40
summaries from the test set (20 per dataset), 4 sen-
tences from each summary and elicited 3 responses
per summary. This resulted in 40 summaries ×
5 systems × 3 raters, for a total of 600 tasks.
Altogether, 131 crowdworkers participated in this
study (agreement using Krippendorff’s α was 0.44
for supported and 0.42 for contradicting facts).

As shown in Table 5, Macro yields the smallest
number of contradicting facts among neural mod-
els on both datasets. On ROTOWIRE the number
of contradicting facts for Macro is comparable
to Gold and Templ (the difference is not sta-
tistically significant) and significantly smaller
compared to RBF-2020 and ED+CC. The count

ROTOWIRE #Supp #Contra Gram

Coher

Concis

3.63
7.57*
3.92

Gold
Templ
ED+CC
RBF-2020 5.08*
Macro

4.00

38.33

46.25*

0.07
30.83
0.08 −61.67* −52.92* −36.67*
−4.58
0.91*
3.75
0.67*
6.67
0.27

−8.33
4.58
10.42

5.0
13.33
5.0

MLB
Gold
Templ
ED+CC
ENT
Macro

Coher
30.0

#Supp #Contra Gram
0.14
3.59
21.67
0.04 −51.25* −43.75*
4.21
0.72* −22.5* −12.08* −39.17*
3.42
5.83* −0.83* −22.08*
0.73*
3.71
27.08
26.67
46.25
0.25
3.76

Concis
26.67
7.5

facts

Table 5: Average number of supported (#Supp)
and contradicting (#Contra)
in game
summaries and best-worst scaling evaluation
(higher is better). Systems significantly different
from Macro are marked with an asterisk * (using
a one-way ANOVA with post hoc Tukey HSD
tests; p ≤ 0.05).

of supported facts for Macro is comparable to
Gold, and ED+CC, and significantly lower than
Templ and RBF-2020. On MLB, Macro has sig-
nificantly fewer contradicting facts than ENT and
ED+CC and is comparable to Templ and Gold
(the difference is not statistically significant). The
count of supported facts for Macro is comparable
to Gold, ENT, ED+CC, and Templ. For both
datasets, Templ has the lowest number of contra-
dicting facts. This is expected as Templ essentially
parrots facts (aka records) from the table.

We also conducted a second study to evaluate
the quality of the generated summaries. We pre-
sented crowdworkers with a pair of summaries and
asked them to choose the better one in terms of
Grammaticality (is the summary written in well-
formed English?), Coherence (is the summary
well structured and well organized and does it have
a natural ordering of the facts?), and Conciseness
(does the summary avoid unnecessary repetition
including whole sentences, facts or phrases?). We
provided example summaries showcasing good
and bad output. For this task, we required that the
crowdworkers be able to comfortably compre-
hend NBA/MLB game summaries. We elicited
preferences with Best-Worst Scaling (Louviere
and Woodworth 1991; Louviere et al., 2015), a
method shown to be more reliable than rating
scales. The score of a system is computed as the
number of times it is rated best minus the number
of times it is rated worst (Orme, 2009). The scores

520

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

range from −100 (absolutely worst) to +100
(absolutely best). We divided the five competing
systems into ten pairs of summaries and elicited
ratings for 40 summaries (20 per dataset). Each
summary pair was rated by 3 raters. This resulted
in 40 summaries × 10 system pairs × 3 evaluation
criteria × 3 raters, for a total of 3,600 tasks. A
total of 206 crowdworkers participated in this task
(agreement using Krippendorff’s α was 0.47).

As shown in Table 5, on ROTOWIRE, Macro is
comparable to Gold, RBF-2020, and ED+CC in
terms of Grammaticality but significantly better
than Templ. In terms of Coherence, Macro is
comparable to RBF-2020 and ED+CC but signif-
icantly better than Templ and significantly worse
than Gold. With regard to Conciseness, Macro is
comparable to Gold, RBF-2020, and ED+CC, and
significantly better than Templ. On MLB, Macro
is comparable to Gold in terms of Grammaticality
and significantly better than ED+CC, ENT, and
Templ. Macro is comparable to Gold in terms of
Coherence and significantly better than ED+CC,
ENT and Templ. In terms of Conciseness, raters
found Macro comparable to Gold and Templ and
significantly better than ED+CC and ENT. Taken
together, our results show that macro planning
leads to improvement in data-to-text generation in
comparison to other systems for both ROTOWIRE
and MLB datasets.

7 Discussion

In this work we presented a plan-and-generate
approach for data-to-text generation that consists
of a macro planning stage representing high-level
document organization in terms of structure and
content, followed by a text generation stage.
Extensive automatic and human evaluation shows
that our approach achieves better results than ex-
isting state-of-the-art models and generates sum-
maries which are factual, coherent, and concise.

Our results show that macro planning is more
advantageous for generation tasks expected to
produce longer texts with multiple discourse
units, and could be easily extended to other sports
domains such as cricket (Kelly et al., 2009) or
American football (Barzilay and Lapata, 2005).
Other approaches focusing on micro planning
(Puduppully et al., 2019a; Moryossef et al., 2019)
might be better tailored for generating shorter
texts. There has been a surge of datasets recently
focusing on single-paragraph outputs and the task

of content selection such as E2E (Novikova et al.,
2017), WebNLG (Gardent et al., 2017), and
WikiBio (Lebret et al., 2016; Perez-Belrachini
and Lapata, 2018). We note that in our model con-
tent selection takes place during macro planning
and text generation. The results in Table 2 show
that Macro achieves the highest CS F-measure
on both datasets, indicating that the document as
a whole and individual sentences discuss appro-
priate content.

Throughout our experiments we observed that
template-based systems score poorly in terms of
CS (but also CO and BLEU). This is primarily due
to the inflexibility of the template approach which
is limited to the discussion of a fixed number of
(high-scoring) players. Yet, human writers (and
neural models to a certain extent), synthesize
summaries taking into account the particulars of a
specific game (where some players might be more
important than others even if they scored less)
and are able to override global defaults. Template
sentences are fluent on their own, but since it
is not possible to perform aggregation (Reiter,
1995), the whole summary appears stilted, it lacks
coherence and variability, contributing to low
BLEU scores. The template baseline is worse for
MLB than ROTOWIRE which reflects the greater
difficulty to manually create a good template for
MLB. Overall, we observe that neural models are
more fluent and coherent, being able to learn a
better ordering of facts which is in turn reflected
in better CO scores.

Despite promising results, there is ample room
to improve macro planning, especially in terms
of the precision of RG (see Table 2, P% column
of RG). We should not underestimate that Macro
must handle relatively long inputs (the average
input length in the MLB development set is ∼3100
tokens) which are challenging for the attention
mechanism. Consider the following output of our
model on the MLB dataset: Ramirez’s two-run
double off Joe Blanton tied it in the sixth, and
Brandon Moss added a two-out RBI single off
Alan Embree to give Boston a 3-2 lead. Here, the
name of the pitcher should have been Joe Blanton
instead of Alan Embree. In fact, Alan Embree is
the pitcher for the following play in the half in-
ning. In this case, attention diffuses over the rela-
tively long MLB macro plan, leading to inaccurate
content selection. We could alleviate this prob-
lem by adopting a noisy channel decomposition

521

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(Yee et al., 2019; Yu et al., 2020), that is, by
learning two different distributions: a conditional
model that provides the probability of translating
a paragraph plan to text and a language model that
provides an unconditional estimate of the output
(i.e., the whole game summary). However, we
leave this to future work.

For ROTOWIRE, the main source of errors is
the model’s inability to understand numbers. For
example, Macro generates the following output:
The Lakers were the superior shooters in this
game, going 48 percent from the field and 30 per-
cent from the three-point line, while the Jazz went
47 percent from the floor and 30 percent from
beyond the arc. Here, 30 percent should have been
24 percent for the Lakers but the language model
expects a higher score for the three-point line, and
since 24 is low (especially compared to 30 scored
by the Jazz), it simply copies 30 scored by the
Jazz instead. A mechanism for learning better rep-
resentations for numbers (Wallace et al., 2019) or
executing operations such as argmax or minus (Nie
et al., 2018) should help alleviate this problem.

Finally, although our focus so far has been on
learning document plans from data, the decoupling
of planning from generation allows to flexibly
generate output according to specification. For
example, we could feed the model with manually
constructed macro plans, consequently controlling
the information content and structure of the output
summary (e.g., for generating short or long texts,
or focusing on specific aspects of the game).

Acknowledgments

We thank the Action Editor, Claire Gardent,
and the three anonymous reviewers for their
constructive feedback. We also thank Laura
Perez-Beltrachini for her comments on an earlier
draft of this paper, and Parag Jain, Hao Zheng,
Stefanos Angelidis and Yang Liu for helpful
discussions. We
financial
the European Research Council
support of
(Lapata; award number 681760, ‘‘Translating
Multiple Modalities into Text’’).

acknowledge

the

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate.
In
3rd International Conference on Learning

522

Representations, ICLR 2015, San Diego, CA,
USA, May 7–9, 2015, Conference Track
Proceedings.

Regina Barzilay and Mirella Lapata. 2005.
Collective content selection for concept-to-text
generation. In Proceedings of Human Language
Technology Conference and Conference on
Empirical Methods
in Natural Language
Processing,
331–338, Vancouver,
pages
British Columbia, Canada. Association for
Computational Linguistics. DOI: https://
doi.org/10.3115/1220575.1220617

Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with
Python, O’Reilly Media.

In Proceedings of

Thiago Castro Ferreira, Chris van der Lee,
Emiel van Miltenburg, and Emiel Krahmer.
generation: A
data-to-text
2019. Neural
comparison between pipeline and end-to-
the
end architectures.
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
Language
(EMNLP-IJCNLP),
pages 552–562, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1052

Processing

Wallace L. Chafe. 1979. The flow of thought
and the flow of language. In Talmy Giv´on,
editor, Syntax and Semantics, volume 12,
pages 159–181, Academic Press Inc. DOI:
https://doi.org/10.1163/9789004
368897 008

Yen-Chun Chen and Mohit Bansal. 2018.
Fast abstractive summarization with reinforce-
selected sentence rewriting. In Proceedings of
the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1:
Long Papers), pages 675–686, Melbourne,
Australia. Association
for Computational
Linguistics. DOI: https://doi.org/10
.18653/v1/P18-1063

Robert Dale. 1989. Generating referring expres-
sions in a domain of objects and processes.

Pablo Duboue and Kathleen McKeown. 2002.
Content planner construction via evolutionary

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

algorithms and a corpus-based fitness function.
the International Nat-
In Proceedings of
ural
Language Generation Conference,
pages 89–96, Harriman, New York, USA.
Association for Computational Linguistics.

Pablo A. Duboue and Kathleen R. McKeown.
2001. Empirically estimating order constraints
for content planning in generation. In Pro-
ceedings of the 39th Annual Meeting of the
Association for Computational Linguistics,
pages 172–179, Toulouse, France. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073012
.1073035

John C. Duchi, Elad Hazan, and Yoram Singer.
2011. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of
Machine Learning Research, 12:2121–2159.

Angela Fan, David Grangier, and Michael
Auli. 2018. Controllable abstractive summa-
rization. In Proceedings of the 2nd Workshop
on Neural Machine Translation and Gen-
eration, pages 45–54, Melbourne, Australia.
Association for Computational Linguistics.

Claire Gardent, Anastasia Shimorina, Shashi
Narayan, and Laura Perez-Beltrachini. 2017.
Creating training corpora for NLG micro-
planners. In Proceedings of the 55th Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 1: Long Papers),
pages 179–188, Vancouver, Canada. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P17
-1017

Albert Gatt and Emiel Krahmer. 2018. Survey
of the state of the art
language
generation: Core tasks, applications and eval-
uation. J. Artif. Intell. Res., 61:65–170. DOI:
https://doi.org/10.1613/jair.5477

in natural

Heng Gong, Xiaocheng Feng, Bing Qin, and
Ting Liu. 2019. Table-to-text generation with
effective hierarchical encoder on three dimen-
sions (row, column and time). In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),

pages 3143–3152, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1310

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
O. K. Li. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1631–1640,
Berlin, Germany. Association for Computa-
tional Linguistics.

Caglar Gulcehre,

Sungjin Ahn, Ramesh
Nallapati, Bowen Zhou, and Yoshua Bengio.
2016. Pointing the unknown words. In Pro-
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 140–149,
Berlin, Germany. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/P16-1014

M. A. K. Halliday and Ruqaiya Hasan. 1976.
Cohesion in English, London. Longman. DOI:
https://doi.org/10.1162/neco
.1997.9.8.1735

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9:1735–1780.

Eduard H. Hovy. 1993. Automated discourse
generation using discourse structure relations.
Artificial Intelligence, 63(1-2):341–385. DOI:
https://doi.org/10.1016/0004-3702
(93)90021-3

Hayate Iso, Yui Uehara, Tatsuya Ishigaki,
Hiroshi Noji, Eiji Aramaki, Ichiro Kobayashi,
Yusuke Miyao, Naoaki Okazaki, and Hiroya
Takamura. 2019. Learning to select,
track,
In Proceed-
and generate for data-to-text.
the
ings of
Association for Computational Linguistics,
pages 2102–2113, Florence,
Italy. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1202

the 57th Annual Meeting of

Min-Yen Kan and Kathleen R. McKeown. 2002.
Corpus-trained text generation for summa-
rization. In Proceedings of the International

523

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Natural Language Generation Conference,
pages 1–8, Harriman, New York, USA. Asso-
ciation for Computational Linguistics.

Colin Kelly, Ann Copestake, and Nikiforos
Karamanis. 2009. Investigating content selec-
tion for language generation using machine
learning. In Proceedings of the 12th European
Workshop on Natural Language Generation
(ENLG 2009), pages 130–137, Athens, Greece.
Association for Computational Linguistics.
DOI: https://doi.org/10.3115/161
0195.1610218

Yuta Kikuchi, Graham Neubig, Ryohei Sasano,
Hiroya Takamura, and Manabu Okumura.
length in neural
2016. Controlling output
encoder-decoders. In Proceedings of the 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1328–1338,
Austin, Texas. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D16-1140

Guillaume Klein, Yoon Kim, Yuntian Deng,
Jean Senellart, and Alexander Rush. 2017.
OpenNMT: Open-source toolkit
for neural
machine translation. In Proceedings of ACL
2017, System Demonstrations, pages 67–72,
Vancouver, Canada. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/P17-4012

Ioannis Konstas and Mirella Lapata. 2013.
Inducing document plans
for concept-to-
text generation. In Proceedings of the 2013
Conference on Empirical Methods in Natural
Language Processing,
1503–1514,
Seattle, Washington, USA. Association for
Computational Linguistics.

pages

Karen Kukich. 1983. Design of a knowledge-
In 21st Annual
based report generator.
the Association for Computa-
Meeting of
tional Linguistics. DOI: https://doi.org
/10.3115/981311.981340

Anirban Laha, Parag Jain, Abhijit Mishra,
and Karthik Sankaranarayanan. 2020. Scal-
able micro-planned generation of discourse
from structured data. Computational Linguis-
tics, 45(4):737–763. DOI: https://doi
.org/10.1162/coli a 00363

R´emi Lebret, David Grangier, and Michael
text generation from
Auli. 2016. Neural
structured data with application to the biog-
raphy domain. In Proceedings of
the 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1203–1213,
Austin, Texas. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D16-1128

R. E. Longacre. 1979. The paragraph as a
grammatical unit. In Talmy Giv´on, editor,
Syntax and Semantics, volume 12, Academic
Press Inc., pages 115–133.

Jordan J. Louviere, Terry N. Flynn,

and
A. A. J. Marley. 2015. Best-Worst Scaling:
Theory, Methods and Applications, Cambridge
University Press. DOI: https://doi.org
/10.1017/CBO9781107337855

Jordan J. Louviere and George G. Woodworth.
1991. Best-worst scaling: A model for the
largest difference judgments. University of
Alberta: Working Paper.

approaches

Thang Luong, Hieu Pham, and Christopher D.
Manning. 2015. Effective
to
attention-based neural machine translation. In
Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1412–1421, Lisbon, Portugal. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D15
-1166

Kathleen R. McKeown. 1992. Text Generation.
in Natural Language Processing,

Studies
Cambridge University Press.

Kathleen R. McKeown, Desmond A. Jordan,
Shimei Pan, James Shaw, and Barry A. Allen.
1997. Language generation for multimedia
In Fifth Conference
healthcare briefings.
on Applied Natural Language Processing,
pages 277–282, Washington, DC, USA. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/974557
.974598

Hongyuan Mei, Mohit Bansal, and Matthew R.
Walter. 2016. What to talk about and how?
Selective generation using LSTMs with coarse-
the
to-fine alignment.

In Proceedings of

524

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2016 Conference of
the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages
720–730, San Diego, California.
Association for Computational Linguistics.

Amit Moryossef, Yoav Goldberg, and Ido Dagan.
2019. Step-by-step: Separating planning from
realization in neural data-to-text generation. In
Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short
Papers),
2267–2277, Minneapolis,
Minnesota. Association for Computational
Linguistics.

pages

Feng Nie, Jinpeng Wang, Jin-Ge Yao, Rong
Pan, and Chin-Yew Lin. 2018. Operation-
guided neural networks
for high fidelity
In Proceedings of
data-to-text generation.
the 2018 Conference on Empirical Meth-
ods
Language Processing,
pages 3879–3889, Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1422

in Natural

Jekaterina Novikova, Ondˇrej Duˇsek,

and
Verena Rieser. 2017. The E2E dataset:
New challenges for end-to-end generation.
the 18th Annual SIG-
In Proceedings of
dial Meeting on Discourse and Dialogue,
pages 201–206, Saarbr¨ucken, Germany. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-5525

Romain Paulus, Caiming Xiong, and Richard
Socher. 2018. A deep reinforced model for
In International
abstractive summarization.
Conference on Learning Representations.

Laura Perez-Beltrachini and Mirella Lapata. 2018.
Bootstrapping generators from noisy data. In
Proceedings of the 2018 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers),
pages 1516–1527, New Orleans, Louisiana.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/N18-1137

Ratish Puduppully, Li Dong,

and Mirella
Lapata. 2019a. Data-to-text generation with
content selection and planning. In Proceedings
of the 33rd AAAI Conference on Artificial Intel-
ligence. Honolulu, Hawaii. DOI: https://
doi.org/10.1609/aaai.v33i01.330
16908

Ratish Puduppully, Li Dong, and Mirella Lapata.
2019b. Data-to-text generation with entity
modeling. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 2023–2035, Florence, Italy.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/P19-1195

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever.
2019. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9. DOI:
https://doi.org/10.18653/v1/P19
-1195

Bryan Orme. 2009. Maxdiff analysis: Simple
and HB.

individual-level

logit,

counting,
Sawtooth Software.

Kishore

Papineni,

Salim Roukos,

Todd
Ward, and Wei-Jing Zhu. 2002. BLEU: a
method for automatic evaluation of machine
the 40th
In Proceedings of
translation.
the Association for
Annual Meeting of
Computational Linguistics, pages 311–318,
Philadelphia, Pennsylvania, USA. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073083
.1073135

Cl´ement Rebuffel, Laure Soulier, Geoffrey
Scoutheeten, and Patrick Gallinari. 2020. A
hierarchical model for data-to-text generation.
Informa-
In
tion Retrieval, pages 65–80. Springer. DOI:
https://doi.org/10.1007/978-3-030
-45439-5 5, PMCID: PMC7148215

European Conference

on

Ehud Reiter. 1995. NLG vs. templates. CoRR,
cmp-lg/9504013v1. DOI: https://doi.org
/10.1017/S1351324997001502

Ehud Reiter and Robert Dale. 1997. Building
applied natural language generation systems.
Natural Language Engineering, 3(1):57–87.

525

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Ehud Reiter and Robert Dale. 2000. Building
Systems.
Language Generation
Natural
Studies
in Natural Language Processing,
Cambridge University Press. DOI: https://
doi.org/10.1017/CBO9780511519857

Fahimeh

Ioan
Saleh, Alexandre Berard,
Calapodescu, and Laurent Besacier. 2019. Naver
Labs Europe’s systems for
the document-
level generation and translation task at WNGT
the 3rd Work-
2019.
shop on Neural Generation and Translation,
pages 273–279, Hong Kong. Association for
Computational Linguistics. DOI: https://
doi.org/10.18653/v1/D19-5631

In Proceedings of

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
In Pro-
rare words with subword units.
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1715–1725,
Berlin, Germany. Association for Compu-
tational Linguistics. DOI: https://doi
.org/10.18653/v1/P16-1162

Zhihong Shao, Minlie Huang, Jiangtao Wen,
Wenfei Xu, and Xiaoyan Zhu. 2019. Long and
diverse text generation with planning-based
hierarchical variational model. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3257–3268, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1321

Ilya Sutskever, Oriol Vinyals, and Quoc V.
Le. 2014. Sequence to sequence learning
with neural networks. In Advances in Neural
Information Processing Systems, volume 27,
pages 3104–3112. Curran Associates, Inc.

Shunsuke Takeno, Masaaki Nagata, and Kazuhide
Yamamoto. 2017. Controlling target features
in neural machine translation via prefix
constraints. In Proceedings of the 4th Workshop
on Asian Translation (WAT 2017), pages 55–63,
Taipei, Taiwan. Asian Federation of Natural
Language Processing.

Ran Tian, Shashi Narayan, Thibault Sellam, and
Ankur P. Parikh. 2019. Sticking to the facts:
Confident decoding for faithful data-to-text
generation. CoRR, abs/1910.08684v2.

Chris van der Lee, Albert Gatt, Emiel van
Miltenburg, Sander Wubben,
and Emiel
Krahmer. 2019. Best practices for the human
evaluation of automatically generated text.
the 12th International
In Proceedings of
Conference on Natural Language Generation,
pages 355–368, Tokyo, Japan. Association for
Computational Linguistics. DOI: https://
doi.org/10.18653/v1/W19-8643

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing
Systems
Inc.,
pages 5998–6008.

Curran Associates,

30,

Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015. Pointer networks. In C. Cortes,
N. D. Lawrence, D. D. Lee, M. Sugiyama,
in
and R. Garnett,
Neural Information Processing Systems 28,
pages 2692–2700, Curran Associates, Inc.

editors, Advances

Eric Wallace, Yizhong Wang, Sujian Li, Sameer
Singh, and Matt Gardner. 2019. Do NLP
models know numbers? Probing numer-
acy in embeddings. In Proceedings of
the
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Language
pages 5307–5315, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1534

Processing

Ronald J. Williams and Jing Peng. 1990. An
efficient gradient-based algorithm for on-line
recurrent network trajectories.
training of
Neural Computation, 2(4):490–501. DOI:
https://doi.org/10.1162/neco.1990
.2.4.490

Sam Wiseman, Stuart Shieber, and Alexander
Rush. 2017. Challenges in data-to-document

526

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

the 2017
In Proceedings of
generation.
Conference on Empirical Methods in Natural
Language Processing,
2253–2263,
Copenhagen, Denmark. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/D17-1239

pages

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, Lukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural
machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144v2.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong
He, Alex Smola, and Eduard Hovy. 2016.
Hierarchical attention networks for document
the 2016
classification.
Conference of
the North American Chap-
the Association for Computational
ter of

In Proceedings of

Linguistics: Human Language Technologies,
pages 1480–1489, San Diego, California. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N16
-1174

Kyra Yee, Yann Dauphin, and Michael Auli. 2019.
Simple and effective noisy channel modeling
for neural machine translation. In Proceedings
of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Language
pages 5696–5701, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1571

Processing

Lei Yu, Laurent Sartran, Wojciech Stokowiec,
Wang Ling, Lingpeng Kong, Phil Blunsom,
and Chris Dyer. 2020. Better document-level
machine translation with Bayes’ rule. Trans-
actions of the Association for Computational
Linguistics, 8:346–360. DOI: https://
doi.org/10.1162/tacl a 00319

Wlodek Zadrozny and Karen Jensen. 1991.
paragraphs. Computational

Semantics
of
Linguistics, 17(2):171–210.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
8
1
1
9
2
4
1
7
6

/

/
t

l

a
c
_
a
_
0
0
3
8
1
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

527
Download pdf