Data-to-text Generation with Macro Planning

Ratish Puduppully and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, 爱丁堡大学
10 Crichton Street, Edinburgh EH8 9AB

r.puduppully@sms.ed.ac.uk

mlap@inf.ed.ac.uk

抽象的

Recent approaches to data-to-text generation
have adopted the very successful encoder-
decoder architecture or variants thereof. 这些
models generate text that is fluent (but often
imprecise) and perform quite poorly at select-
ing appropriate content and ordering it co-
herently. To overcome some of these issues,
we propose a neural model with a macro
planning stage followed by a generation stage
reminiscent of traditional methods which em-
brace separate modules for planning and sur-
face realization. Macro plans represent high
level organization of important content such
as entities, 事件, and their interactions; 他们
are learned from data and given as input to
the generator. Extensive experiments on two
data-to-text benchmarks (ROTOWIRE and MLB)
show that our approach outperforms compet-
itive baselines in terms of automatic and hu-
man evaluation.

介绍

Data-to-text generation refers to the task of gen-
erating textual output from non-linguistic input
(Reiter and Dale, 1997, 2000; Gatt and Krahmer,
2018) such as databases of records, simulations
of physical systems, accounting spreadsheets, 或者
expert system knowledge bases. 举个例子,
数字 1 shows various statistics describing a
major league baseball (MLB) game, 包括
extracts from the box score (IE。,
the perfor-
mance of the two teams and individual
team
members who played as batters, pitchers or field-
呃; 桌子 (A)), play-by-play (IE。, the detailed
sequence of each play of the game as it occurred;
桌子 (乙)), and a human written game summary
(桌子 (C)).

Traditional methods for data-to-text generation
(Kukich, 1983; McKeown, 1992; Reiter and Dale,

510

1997) follow a pipeline architecture, adopting
separate stages for text planning (determining
which content to talk about and how it might
be organized in discourse), sentence planning
(aggregating content into sentences, deciding spe-
cific words to describe concepts and relations, 和
generating referring expressions), and linguistic
realization (applying the rules of syntax, 摩尔-
phology, and orthographic processing to generate
surface forms). Recent neural network–based
方法 (Lebret et al., 2016; Mei et al., 2016;
Wiseman et al., 2017) make use of the encoder-
decoder architecture (Sutskever et al., 2014), 是
trained end-to-end, and have no special-purpose
modules for how to best generate a text, aside
from generic mechanisms such as attention and
copy (Bahdanau et al., 2015; Gu et al., 2016). 这
popularity of end-to-end models has been fur-
ther boosted by the release of new datasets with
thousands of input-document training pairs. 这
example shown in Figure 1 is taken from the MLB
dataset (Puduppully et al., 2019乙), which contains
baseball game statistics and human written sum-
maries (∼25K instances). ROTOWIRE (Wiseman
等人。, 2017) is another widely used benchmark,
which contains NBA basketball game statistics
and their descriptions (∼5K instances).

Wiseman et al. (2017) show that despite being
able to generate fluent text, neural data-to-text
generation models are often imprecise, prone
to hallucination (IE。, generate text that is not
supported by the input), and poor at content
selection and document structuring. Attempts to
remedy some of these issues focus on changing
the way entities are represented (Puduppully et al.,
2019乙; Iso et al., 2019), allowing the decoder to
skip low-confidence tokens to enhance faithful
一代 (Tian et al., 2019), and making the
encoder-decoder architecture more modular by
introducing micro planning (Puduppully et al.,
2019A; Moryossef et al., 2019). Micro planning
operates at the record level (见表 (A) 在

计算语言学协会会刊, 卷. 9, PP. 510–527, 2021. https://doi.org/10.1162/tacl 00381
动作编辑器: Claire Gardent. 提交批次: 11/2020; 修改批次: 2/2021; 已发表 5/2021.
C(西德:3) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: MLB statistics tables and game summary. Tables summarize the performance of teams and individual
team members who played as batters and pitchers as well as the most important actions (and their actors) in each
玩 (Tables (A) 和 (乙)). Macro plan for the game summary is shown at the bottom (桌子 (乙)).

indicates
paragraph delimiters. There is a plan for every paragraph in the game summary (correspondence shown in same
颜色); verbalizes entities, 尽管 verbalizes events related to the top/bottom
side of an inning (参见章节 3.1). Set of candidate paragraph plans are shown above macro plan (桌子 (D)) 和
grouped into two types: plans describing a single entity/event or their combinations. Best viewed in color.

数字 1; 例如, C.Mullins BH 2, J.Villar TEAM Orioles),
it determines which facts should be mentioned
within a textual unit (例如, a sentence) 以及如何
these should be structured (例如, the sequence of
记录). An explicit content planner essentially
makes the job of the neural network less onerous
allowing to concentrate on producing fluent natu-
ral language output, without expending too much
effort on content organization.

在这项工作中, we focus on macro planning, 这
high-level organization of information and how
it should be presented which we argue is impor-
tant for the generation of long, multi-paragraph
文件 (see text (C) 图中 1). Problem-
atically, modern datasets like MLB (Puduppully
等人。, 2019乙; and also Figure 1) and ROTOWIRE

(Wiseman et al., 2017) do not naturally lend
themselves to document planning as there is no
explicit link between the summary and the content
of the game (which is encoded in tabular form).
换句话说, the underlying plans are latent,
and it is not clear how they might be best repre-
已发送, 即, as sequences of records from a
桌子, or simply words. 尽管如此, game sum-
maries through their segmentation into paragraphs
(and lexical overlap with the input) give clues
as to how content might be organized. 段落
are a central element of discourse (摩擦, 1979;
Longacre, 1979; Halliday and Hasan, 1976),
the smallest domain where coherence and topic
are defined and anaphora resolution is possible

511

(Zadrozny and Jensen, 1991). We therefore oper-
ationalize the macro plan for a game summary as
a sequence of paragraph plans.

Although resorting to paragraphs describes the
summary plan at a coarse level, we still need to
specify individual paragraph plans. In the sports
domain, paragraphs typically mention entities
(例如, players important in the game), key events
(例如, scoring a run), and their interaction. 和
most of this information is encapsulated in the
statistics accompanying game summaries (看
Tables (A) 和 (乙) 图中 1). We thus define
paragraph plans such that they contain verbaliza-
tions of entity and event records (see plan (乙) 在
数字 1). Given a set of paragraph plans and their
corresponding game summary (见表 (D) 和
summary (C) 图中 1), our task is twofold.
At training time, we must learn how content was
selected in order to give rise to specific game
summaries (例如, how input (D) led to plan (乙)
for summary (C) 图中 1), while at test time,
given input for a new game, we first predict a
macro plan for the summary and then generate the
corresponding document.

We present a two-stage approach where macro
plans are induced from training data (by taking the
table and corresponding summaries into account)
and then fed to the text generation stage. Aside
from making data-to-text generation more inter-
pretable, the task of generating a document from
a macro plan (rather than a table) affords greater
control over the output text and plays to the advan-
tage of encoder-decoder architectures which excel
at modeling sequences. We evaluate model per-
formance on the ROTOWIRE (Wiseman et al., 2017)
and MLB (Puduppully et al., 2019乙) benchmarks.
Experimental results show that our plan-and-
generate approach produces output that is more
factual, 相干, and fluent compared with exist-
ing state-of-the-art models. Our code,
trained
型号, and dataset with macro plans can be
found at https://github.com/ratishsp
/data2text-macro-plan-py.

2 相关工作

Content planning has been traditionally consid-
ered a fundamental component in natural lan-
guage generation. Not only does it determine
which information-bearing units to talk about,
but also arranges them into a structure that

creates coherent output. Many content plan-
ners have been based on theories of discourse
连贯性 (蓝色的, 1993), schemas (McKeown
等人。, 1997), or have relied on generic plan-
书呆子 (戴尔, 1989). Plans are mostly based on
hand-crafted rules after analyzing the target text,
although a few approaches have recognized the
need for learning-based methods. 例如,
Duboue and McKeown (2001) learn ordering
constraints in a content plan, Konstas and Lapata
(2013) represent plans as grammar rules whose
probabilities are estimated empirically, while oth-
ers make use of semantically annotated corpora
to bootstrap content planners
(Duboue and
McKeown, 2002; Kan and McKeown, 2002).

最近, various attempts have been made
to improve neural generation models (Wiseman
等人。, 2017) based on the encoder-decoder archi-
结构 (Bahdanau et al., 2015) by adding various
planning modules. Puduppully et al. (2019A) 亲-
pose a model for data-to-text that first learns a
plan from the records in the input table and then
generates a summary conditioned on this plan.
Shao et al. (2019) introduce a Planning-based
Hierarchical Variational Model where a plan is
a sequence of groups, each of which contains a
subset of input items to be covered in a sentence.
The content of each sentence is verbalized, 骗局-
ditioned on the plan and previously generated
语境. In their case, input items are a rela-
tively small list of attributes (∼28) and the output
document is also short (∼110 words).

There have also been attempts to incorporate
neural modules in a pipeline architecture for
data-to-text generation. Moryossef et al. (2019)
develop a model with a symbolic text planning
stage followed by a neural realization stage. 他们
experiment with the WebNLG dataset (Gardent
等人。, 2017) which consists of RDF (西德:4) 主题,
目的, Predicate (西德:5) triples paired with correspond-
ing text. Their document plan is a sequence of
sentence plans that in turn determine the division
of facts into sentences and their order. Along
similar lines, Castro Ferreira et al. (2019) 亲-
pose an architecture composed of multiple steps
including discourse ordering, text structuring, lex-
icalization, referring expression generation, 和
surface realization. Both approaches show the
effectiveness of pipeline architectures, 然而,
their task does not require content selection and
the output texts are relatively short (24 tokens on
average).

512

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Although it is generally assumed that task-
specific parallel data is available for model
训练, Laha et al. (2020) do away with this
assumption and present a three-stage pipeline
model which learns from monolingual corpora.
They first convert the input to a form of tuples,
which in turn are expressed in simple sentences,
followed by the third stage of merging simple
sentences to form more complex ones by aggre-
gation and referring expression generation. 他们
also evaluate on data-to-text tasks which have
relatively short outputs. There have also been
efforts to improve the coherence of the output,
especially when dealing with longer documents.
Puduppully et al. (2019乙) make use of hierar-
chical attention over entity representations which
are updated dynamically, while Iso et al. (2019)
explicitly keep track of salient entities and mem-
orize which ones have been mentioned.

Our work also attempts to alleviate deficien-
cies in neural data-to-text generation models.
In contrast to previous approaches, (Puduppully
等人。, 2019A; Moryossef et al., 2019; Laha et al.,
2020), we place emphasis on macro planning and
create plans representing high-level organization
of a document including both its content and
结构. We share with previous work (例如,
Moryossef et al. 2019) the use of a two-stage
建筑学. We show that macro planning can
be successfully applied to long document data-to-
text generation resulting in improved factuality,
连贯性, and fluency without any postpro-
cessing (例如, to smooth referring expressions)
or recourse to additional tools (例如, parsing or
information extraction).

3 Problem Formulation

We hypothesize that generation based on plans
should fare better compared to generating from a
set of records, since macro plans offer a bird’s-eye
看法, a high-level organization of the document
content and structure. We also believe that macro
planning will work well for long-form text genera-
的, 那是, for datasets that have multi-paragraph
target texts, a large vocabulary space, and require
content selection.

We assume the input to our model is a set of
paragraph plans E = {不}| 乙 |
i=1 where ei is a para-
graph plan. We model the process of generating
output summary y given E as a two step process,
即, the construction of a macro plan x based

on the set of paragraph plans, followed by the
generation of a summary given a macro plan as
输入. We now explain how E is obtained and
each step is realized. We discuss our model con-
sidering mainly an example from the MLB dataset
(Puduppully et al., 2019乙) but also touch on how
the approach can be straightforwardly adapted to
ROTOWIRE (Wiseman et al., 2017).

3.1 Macro Plan Definition

A macro plan consists of a sequence of paragraph
plans separated by a paragraph discourse marker

, 那是, x = ei

ej . . .

ek where
不, ej, ek ∈ E . A paragraph plan in turn is a
sequence of entities and events describing the
game. By entities we mean individual players or
teams and the information provided about them in
box score statistics (see rows and column headings
图中 1 桌子 (A)), while events refer to infor-
mation described in play-by-play (见表 (乙)).
In baseball, plays are grouped in half-innings.
During each half of an inning, a team takes its
turn to bat (the visiting team bats in the top half
and the home team in the bottom half). An exam-
ple macro plan is shown at the bottom of Figure 1.
Within a paragraph plan, entities and events are
verbalized into a text sequence along the lines of
Saleh et al. (2019). We make use of special tokens
为了 of record followed by the value
of record from the table. We retain the same posi-
tion for each record type and value. 例如,
batter C.Mullins from Figure 1 would be verbal-
ized as C.Mullins H 4

2 2 1 Orioles
. . . . For the sake of brevity we use shorthand
for the full entity.

Paragraph Plan for Entities For a paragraph
containing entities, the corresponding plan will
be a verbalization of the entities in sequence.
For paragraphs with multiple mentions of the
same entity, the plan will verbalize an entity only
once and at its first position of mention. Paragraph
‘‘Keller gave up a home run . . . the teams with the
worst records in the majors’’ from the summary in
数字 1 describes four entities including B. 凯勒,
C. Mullins, Royals and Orioles. The respective
plan is the verbalization of the four entities in
顺序:
, where V stands
for verbalization and 是一个

513

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

shorthand for B.Keller V
7 5 8 . . . , 是一个
shorthand for the team Royals 9
14 1, 等等.

Paragraph Plan for Events A paragraph may
also describe one or more events. 例如,
the paragraph ‘‘With the score tied 1–1 in the
第四 . . . 423-foot home run to left
field to
make it 3-1’’ discusses what happened in the
bottom halves of the fourth and fifth innings. 我们
verbalize an event by first describing the par-
ticipating entities followed by the plays in the
事件. Entities are described in the order in which
they appear in a play, and within the same play
we list the batter followed by the pitcher, fielder,
scorer, and basemen. The paragraph plan corre-
sponding to the bottom halves of the fourth and
fifth inning is . 这里, is a shorthand for
. . . . . .
, 等等. 恩-
乳头 , ,
, 和科尔-
respond in turn to W. Merrifield, A. Cashner,
乙. Goodwin, 和H. Dozier while
refers to the first play in the bottom half of
the fifth inning (see the play-by-play table in
数字 1) and abbreviates the following detailed
plan: 5 乙 Royals
Orioles 1
H.Dozier A. Cashner>
Home-run Royals-3-Orioles-1,
等等.

The procedure described above is not specific
to MLB and can be ported to other datasets with
similar characteristics such as ROTOWIRE. 如何-
曾经, ROTOWIRE does not provide play-by-play
信息, and as a result there is no event
verbalization for this dataset.

3.2 Macro Plan Construction

We provided our definition for macro plans in the
previous sections, 然而, it is important to note
that such macro plans are not readily available in
data-to-text benchmarks like MLB (Puduppully
等人。, 2019乙) and ROTOWIRE (Wiseman et al.,
2017) which consist of tables of records r paired
with a gold summary y (see Tables (A)–(C) 在
数字 1). We now describe our method for obtain-
ing macro plans x from r and y.

Similar to Moryossef et al. (2019), 我们定义
macro plans to be conformant with gold sum-
maries such that (1) they have the same splits
into paragraphs—entities and events within a
paragraph in y are grouped into a paragraph plan
in x; 和 (2) the order of events and entities
in a paragraph and its corresponding plan are
identical. We construct macro plans by matching
entities and events in the summary to records
in the tables. 此外, paragraph delimiters
within summaries form natural units which taken
together give rise to a high-level document plan.
We match entities in summaries with entities
in tables using exact string match, allowing for
some degree of variation in the expression of
team names (例如, A’s for Athletics and D-backs
for Diamondbacks). Information pertaining to
innings appears in the summaries in the form of
ordinal numbers (例如, 第一的, ninth) modifying the
noun inning and can be relatively easily identi-
fied via pattern matching (例如, in sentences like
‘‘Dozier led off the fifth inning’’). 然而, 那里
are instances where the mention of innings is more
ambiguous (例如, ‘‘With the scored tied 1–1 in the
第四, Andrew Cashner (4–13) gave up a sacri-
fice fly’’). We could disambiguate such mentions
manually and then train a classifier to learn to
predict whether an inning is mentioned. 反而,
we explore a novel annotation-free method that
makes use of the pretrained language model GPT2
(Radford et al., 2019). 具体来说, we feed the
context preceding the ordinal number to GPT2
(IE。,
the current paragraph up to the ordinal
number and the paragraph preceding it) and if
inning appears in the top 10 next word predictions,
we consider it a positive match. On a held-out
dataset, this method achieves 98% 精度和
98% recall at disambiguating inning mentions.

To resolve whether the summary discusses the
top or bottom side of an inning, we compare the
entities in the paragraph with the entities in each
half-inning (play-by-play Table (乙) 图中 1)
and choose the side with the greater number of
entity matches. 例如, Andrew Cashner,
Merrifield and fourth inning uniquely resolves to
the bottom half of the fourth inning.

3.3 Paragraph Plan Construction

数字 1 shows the macro plan we obtain for
game summary (C). 重要的, macro plan (乙)
is the outcome of a content selection process after

514

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

considering several candidate paragraph plans as
输入. 所以, what are the candidate paragraph plans
that give rise to macro plan (乙)? To answer this
问题, we examined the empirical distribution
of paragraph plans in MLB and ROTOWIRE (火车-
ing portion). 有趣的是, we found that ∼79% of
the paragraph plans in MLB refer to a single event
or a single player (and team(s)). In ROTOWIRE,
∼92% of paragraphs are about a singleton player
(and team(s)) or a pair of players.

基于此分析, we assume that para-
graph plans can be either one (verbalized)
entity/event or a combination of at most two.
Under this assumption, we explicitly enumerate
the set of candidate paragraph plans in a game.
For the game in Figure 1, candidate paragraph
plans are shown in Table (D). The first table
groups plans based on individual verbalizations
describing the team(s), players, and events taking
place in specific innings. The second table groups
pairwise combinations thereof. In MLB, 这样的
combinations are between team(s) and players. 在
ROTOWIRE, we also create combinations between
players. Such paragraph plans form set E based
on which macro plan x is constructed to give rise
to game summary y.

4 Model Description

The input to our model is a set of paragraph plans,
each of which is a sequence of tokens. We first
compute paragraph plan representations ∈ Rn,
and then apply a contextualization and content
planning mechanism similar to planning mod-
ules introduced in earlier work (Puduppully et al.,
2019A; Chen and Bansal, 2018). Predicted macro
plans serve as input to our text generation model,
which adopts an encoder-decoder architecture
(Bahdanau et al., 2015; Luong et al., 2015).

4.1 Macro Planning

Paragraph Plan Representation We encode
tokens in a verbalized paragraph plan ei as
{不,j}|不|
j=1 with a BiLSTM (数字 2, bottom part).
To reflect the fact that some records will be more
important than others, we compute an attention
weighted sum of {不,j}|不|
j=1 following Yang et al.
(2016). Let d ∈ Rn denote a randomly initialized
query vector
jointly with the rest of
参数. We compute attention values αi,j over
d and paragraph plan token representation ei,j:
αi,j ∝ exp(d(西德:2)不,j)

learnt

(1)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

plan

2: Paragraph

数字
和
contextualization for macro planning. 计算
of e3 is detailed in Equations (1) 和 (2), eatt
在
方程 (3), and ec

表示

3 in Equation (4).

Paragraph plan vector ei is the attention weighted
sum of ei,j (和
αi,j = 1):

(西德:2)

(西德:3)

ei =

αi,jei,j

(2)

下一个, we contextualize each paragraph plan
representation vis-a-vis other paragraph plans
(数字 2, top left part). 第一的, we compute attention
scores βi,k over paragraph plan representations to
obtain an attentional vector eatt

for each:

我

(西德:2)
βi,k ∝ exp(e
i Waek)

(西德:3)

ci =

βi,kek

k(西德:8)=i
eatt
i = Wg[不; 词]

(3)

(西德:2)

where Wa ∈ Rn×n, Wg ∈ Rn×2n are parameter
k(西德:8)=i βi,k = 1. 然后, we compute
matrices, 和
a content selection gate, and apply this gate to ei
to obtain new paragraph plan representation ec
我 :

(西德:4)

(西德:5)

eatt
我

gi = sigmoid
i = gi (西德:9) 不
ec

(4)

在哪里 (西德:9) denotes element-wise multiplication.
因此, each element in ei is weighted by cor-
responding element of gi ∈ [0, 1]n to obtain a
contextualized paragraph plan representation ec
我 .

Content Planning Our model learns to predict
macro plans, after having been trained on pairs
of sets of paragraph plans and corresponding

515

separators in between. The conditional output
probability p(y|X) is modeled as:

p(y|X) =

|y|(西德:6)

t=1

p(yt|y

ROTOWIRE

MLB

Vocab Size
# 代币
# Instances
# Record Types
Avg Records
Avg Paragraph Plans
Avg Length

11.3K
1.5中号
4.9K

39
628
10.7
337.1

38.9K
14.3中号
26.3K
53
565
15.1
542.05

桌子 1: Dataset statistics for ROTOWIRE and
代币,
MLB. Vocabulary size, number of
number of instances (IE。, table-summary pairs),
number of record types, average number of
记录, average number of paragraph plans,
and average summary length.

During inference, we employ beam search to
find the most likely macro plan ˆz among candidate
macro plans z(西德:10) given paragraph plans as input.

ˆz = arg max

z(西德:10)

p(z(西德:10)|乙 ; 我)

We deterministically obtain ˆx from ˆz, 和
output summary ˆy among candidate outputs y(西德:10)
given macro plan ˆx as input:

ˆy = arg max

y(西德:10)

p(y(西德:10)|ˆx; φ)

5 实验装置

在

执行的

实验

Data We
这
ROTOWIRE (Wiseman et al., 2017) and MLB
(Puduppully et al., 2019乙) benchmarks. 这
details of these two datasets are given in Table 1.
We can see that MLB is around 5 times bigger,
has a richer vocabulary, and has longer game sum-
maries. We use the official splits of 3,398/727/728
for ROTOWIRE and 22,821/1,739/1,744 for MLB.
We make use of a tokenization script1 to deto-
kenize and retokenize the summaries in both
ROTOWIRE and MLB.

We reconstructed the MLB dataset, 作为
version released by Puduppully et al. (2019乙)
从
had removed all paragraph delimiters
game summaries. 具体来说, we followed their
methodology and downloaded the same sum-
maries from the ESPN Web site2 and added the

delimiter to paragraphs in the summaries.3

1https://github.com/neulab/DGT.
2http://www.espn.com/mlb/recap?gameId=

{gameid}.

3Although our model is trained on game summaries with
paragraph delimiters, and also predicts these at generation
时间, for evaluation we strip

from model output.

517

ROTOWIRE does not have paragraph delimiters in
game summaries either. We reverse engineered
these as follows: (1) we split summaries into sen-
tences using the NLTK (Bird et al., 2009) 句子
tokenizer; (2) initialized each paragraph with a
separate sentence; (3) merged two paragraphs into
one if the entities in the former were a superset of
entities in the latter; (4) repeated Step 3 until no
merges were possible.

Training Configuration We tuned the model
hyperparameters on the development set. 为了
training the macro planning and the text gener-
ation stages, we used the Adagrad (Duchi et al.,
2011) optimizer. 此外, the text generation
stage made use of truncated BPTT (Williams and
彭, 1990) with truncation length 100. We learn
subword vocabulary (Sennrich et al., 2016) 为了
paragraph plans in the macro planning stage. 我们
used 2.5K merge operations for ROTOWIRE and 8K
merge operations for MLB. In text generation, 我们
learn a joint subword vocabulary for the macro
plan and game summaries. We used 6K merge
operations for ROTOWIRE and 16K merge oper-
ations for MLB. All models were implemented
on OpenNMT-py (Klein et al., 2017). We add to
set E the paragraph plans corresponding to the
output summary paragraphs, to ensure full cover-
age during training of the macro planner. 期间
inference for predicting macro plans, we employ
length normalization (Bahdanau et al., 2015) 到
avoid penalizing longer outputs; 具体来说, 我们
divide the scores of beam search by the length of
the output. 此外, we adopt bigram blocking
(Paulus et al., 2018). For MLB, we further block
beams containing more than two repetitions of a
unigram. This helps improve the diversity of the
predicted macro plans.

System Comparisons We compared our model
against the following systems: (1) the Template-
based generators from Wiseman et al. (2017)
for ROTOWIRE and Puduppully et al. (2019乙) 为了
MLB. Both systems apply the same principle, 他们
emit a sentence about the teams playing in the
game, followed by player-specific sentences, 和
a closing sentence. MLB additionally contains a
description of play-by-play; (2) ED+CC, the best
performing system in Wiseman et al. (2017), 是
a vanilla encoder-decoder model equipped with
an attention and copy mechanism; (3) NCP+CC,
the micro planning model of Puduppully et al.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

(2019A), generates content plans from the table
by making use of Pointer networks (Vinyals
等人。, 2015) to point to records; content plans are
encoded with a BiLSTM and the game summary
is decoded using another LSTM with attention
and copy; (4) ENT, the entity-based model of
Puduppully et al. (2019乙), creates dynamically
updated entity-specific representations; 文本
is generated conditioned on the data input and
entity memory representations using hierarchical
attention at each time step.

6 结果

Automatic Evaluation For automatic evalua-
的, following earlier work (Wiseman et al. 2017;
Puduppully et al. 2019A,乙, inter alia) we report
蓝线 (Papineni et al., 2002) with the gold sum-
mary as reference but also make use of the In-
formation Extraction (IE) metrics from Wiseman
等人. (2017), which are defined over the output
of an IE system; the latter extracts entity (players,
团队) and value (numbers) pairs in a summary,
and then predicts the type of relation. 例如,
given the pair Kansas City Royals, 9, 它会
predict their relation as TR (IE。, Team Runs).
Training data for the IE system is obtained by
checking for matches between entity, value pairs
in the gold summary and entity, 价值, record type
triplets in the table.

Let ˆy be the gold summary and y the model
输出. Relation Generation (RG) measures the
precision and count of relations extracted from y
that also appear in records r. Content Selection
(CS) measures the precision and recall of relations
extracted from y that are also extracted from ˆy.
Content Ordering (一氧化碳) measures the normalized
Damerau-Levenshtein distance between the se-
quences of relations extracted from y and ˆy.

We reused the IE model from Puduppully et al.
(2019A) for ROTOWIRE but retrained it for MLB
to improve its precision and recall. 此外,
the implementation of Wiseman et al. (2017)
computes RG, CS, and CO excluding duplicate
关系. This artificially inflates the performance
of models whose outputs contain repetition. 我们
include duplicates in the computation of the IE
指标 (and recreate them for all comparison
系统).

桌子 2 (顶部) presents our

results on the
ROTOWIRE
In addition to Templ,
NCP+CC, ENT, and ED+CC we include the

测试

放.

ROTOWIRE

Templ
WS-2017
ED+CC
NCP+CC
ENT
RBF-2020
Macro
−Plan(4)

CS
P% P% R% F% DLD%

一氧化碳

54.3 99.9 27.1 57.7 36.9
34.1 75.1 20.3 36.3 26.1
35.9 82.6 19.8 33.8 24.9
40.8 87.6 28.0 51.1 36.2
32.7 91.7 34.7 48.5 40.5
44.9 89.5 23.9 47.0 31.7
42.1 97.6 34.1 57.8 42.9
36.2 81.3 22.1 38.6 28.1

13.1
12.4
12.0
15.8
16.6
14.3
17.7
12.1

MLB

CS
P% P% R% F% DLD%

一氧化碳

62.3 99.9 21.6 55.2 31.0
Templ
ED+CC
32.5 91.3 27.8 40.6 33.0
NCP+CC
19.6 81.3 44.5 44.1 44.3
23.8 81.1 40.9 49.5 44.8
ENT
30.8 94.4 40.8 54.9 46.8
Macro
−Plan(SP,4) 25.1 92.7 40.0 44.6 42.2

11.0
17.1
21.9
20.7
21.8
21.9

蓝线

8.46
14.19
14.99
16.50
16.12
17.16
15.46
14.00

蓝线

4.12
9.68
9.68
11.50
12.62
11.09

relation generation (RG) 数数

桌子 2: Evaluation on ROTOWIRE and MLB
test sets;
(#)
和精度 (P%), 内容
选择 (CS)
precision (P%), 记起 (R%) and F-measure (F%),
content ordering (一氧化碳) in normalized Damerau-
Levenshtein distance (DLD%), and BLEU.

best performing model of Wiseman et al. (2017)
(WS-2017; note that ED+CC is an improved re-
implementation of their model), and the model of
Rebuffel et al. (2020) (RBF-2020), which repre-
sents the state of the art on ROTOWIRE. This model
has a Transformer encoder (Vaswani et al., 2017)
with a hierarchical attention mechanism over
entities and records within entities. The models
of Saleh et al. (2019), Iso et al. (2019), and Gong
等人. (2019) make use of additional information
not present in the input (例如, previous/next games,
summary writer) and are not directly comparable
to the systems in Table 2. Results for the MLB
test set are in the bottom portion of Table 2.

indicating that

is always faithful

Templ has the highest RG precision and count
on both datasets. This is not surprising, by design
Templ
to the input. 如何-
曾经, notice that it achieves the lowest BLEU
它
among comparison systems,
mostly regurgitates facts with low fluency. Macro
achieves the highest RG precision among all neu-
ral models for ROTOWIRE and MLB. 我们得到
an absolute improvement of 5.9% over ENT
for ROTOWIRE and 13.3% for MLB. 此外,
Macro achieves the highest CS F-measure for
both datasets. On ROTOWIRE, Macro achieves the
highest CO score, and the highest BLEU on MLB.
On ROTOWIRE, in terms of BLEU, Macro is worse

518

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

than comparison models (例如, NCP+CC or ENT).
Inspection of the output showed that the opening
paragraph, which mostly describes how the two
teams fared, is generally shorter in Macro, 带领-
ing to shorter summaries and thus lower BLEU.
There is high variance in the length of the opening
paragraph in the training data and Macro verbal-
izes the corresponding plan conservatively. Ideas
such as length normalization (Wu et al., 2016) 或者
length control (Kikuchi et al., 2016; Takeno et al.,
2017; Fan et al., 2018) could help alleviate this;
然而, we do not pursue them further for fair
comparison with the other models.

The Contribution of Macro Planning To study
the effect of macro planning in more detail, 我们
further compared Macro against text generation
型号 (参见章节 4.2) which are trained on
verbalizations of the tabular data (and gold sum-
maries) but do not make use of document plans or a
document planning mechanism. On ROTOWIRE, 这
model was trained on verbalizations of players and
团队, with the input arranged such that the ver-
balization of the home team was followed by the
visiting team, the home team players and the visit-
ing team players. Mention of players was limited
to the four best ones, following Saleh et al. (2019)
(see −Plan(4) 表中 2). For MLB, we addition-
ally include verbalizations of innings focusing on
scoring plays which are likely to be discussed in
game summaries (see −Plan(SP,4) 表中 2).
Note that by preprocessing the input in such a
way some simple form of content selection takes
place simply by removing extraneous information
which the model does not need to consider.

Across both datasets, −Plan variants appear
competitive. On ROTOWIRE −Plan(4) is better than
ED+CC in terms of content selection but worse
compared to ENT. On MLB, −Plan(SP,4) is again
superior to ED+CC in terms of content selection
but not ENT whose performance lags behind
when considering RG precision. 合在一起,
these results confirm that verbalizing entities and
events into a text sequence is effective. 在
同时, we see that −Plan variants are worse
than Macro across most metrics which underlines
the importance of an explicit planning component.

桌子 3 presents intrinsic evaluation of the
macro planning stage. 这里, we compare the in-
ferred macro plan with the gold macro plans, CS
and CO metrics with regard to entities and events

Macro

CS-P

CS-R

CS-F

ROTOWIRE
MLB

81.3
80.6

73.2
63.3

77.0
70.9

一氧化碳

45.8
31.4

桌子 3: Evaluation of macro planning stage;
content selection precision (CS-P), 记起 (CS-
右), F-measure (CS-F), and content ordering (一氧化碳)
between the inferred plans and gold plans in terms
of entities and events for ROTOWIRE (RW) 和
MLB test sets.

instead of relations. We see that our macro plan-
ning model (Macro) achieves high scores for CS
and CO for both ROTOWIRE and MLB. We further
used the CS and CO metrics to check how well the
generated summary follows the (预测的) plan.
We followed the steps in Section 3.2 and reverse
engineered macro plans from the model summa-
ries and compared these extracted plans with the
original macro plans with regard to entities and
事件. We found that Macro creates summaries
that follow the plan closely: For ROTOWIRE, 这
CS F-score and CO are greater than 98%; 为了
MLB, the CS F-score is greater than 94% and CO
is greater than 89%. We show an output summary
for Macro in Table 4, together with the predicted
document plan.

Human-Based Evaluation We also asked par-
ticipants to assess model output in terms of relation
一代, 语法性, 连贯性, 和骗局-
ciseness (Wiseman et al., 2017; Puduppully
等人。, 2019A; Puduppully et al., 2019乙), 为了
ROTOWIRE, we compared Macro against RBF-
2020,4 ED+CC, 金子, and Templ. For MLB, 我们
compared Macro against ENT, ED+CC, 金子,
and Templ.

We conducted our study on the Amazon
Mechanical Turk (AMT) crowdsourcing platform,
following best practices for human evaluation in
NLG (van der Lee et al., 2019). 具体来说, to en-
sure consistent ratings, we required crowdworkers
to have an approval rating greater than 98% 和
a minimum of 1,000 previously completed tasks.
Raters were restricted to English-speaking coun-
尝试 (IE。, 我们, 英国, 加拿大, 爱尔兰, 澳大利亚, 或者
NZ). Participants were allowed to provide feed-
back on the task or field questions (our interface
accepts free text).

4We are grateful to Cl´ement Rebuffel for providing us

with the output of their system.

519

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

英石. PETERSBURG, Fla. (AP) – The Tampa Bay Rays are making
the most of it.

Akinori Iwamura hit a two-run homer in the
eighth inning and the Rays beat the Boston Red Sox 2-1 on Sunday
to complete a three-game sweep.

The Rays, who have the best
record in the majors, have won six of their last seven games.

The Rays have won four of their last five series, including three in
a row against the Red Sox, who have won six of their last seven
全面的.

Dioner Navarro singled with one out in the eighth off
Clay Buchholz (1-2) and moved to third on Jason Bartlett’s flyout to
中心. Iwamura then drove a 1-1 pitch into the left-field stands for
his second homer of the season.

Scott Dohmann (2-0) got the
win in relief, striking out Manny Ramirez with runners on first and
third to end the eighth.

Troy Percival worked the ninth for
his fifth save in five opportunities.

Clay Buchholz (1-2) gave
up two runs and three hits in eight innings. He struck out nine and
walked two.

The Red Sox loaded the bases with one out in the
fifth on a single by Coco Crisp, a wild pitch and a walk to Jed Lowrie.
Jacoby Ellsbury drove in Crisp with a two-out single to center.

Jackson struck out four and walked three.

The Red Sox loaded
the bases with one out in the fifth on a single by Coco Crisp, a walk
to Jed Lowrie and a one-out walk to Jed Lowrie. Jackson struck out
Julio Lugo, but Jacoby Ellsbury singled to center to put the Red Sox
向上 1-0.

The Red Sox threatened in the eighth when J. D. Drew
drew a two-out walk against Trever Miller, but Ramirez struck out to
end the inning.

桌子 4: Predicted macro plan (顶部) 和
corresponding model output (底部). Entities and
events in summary corresponding to those in the
macro plan are boldfaced.

In our first study, we presented crowdworkers
with sentences randomly selected from summaries
along with their corresponding box score (和
play-by-play in case of MLB) and asked them to
count supported and contradicting facts (忽略
hallucinations, IE。, unsupported facts). We did
not require crowdworkers to be familiar with
NBA or MLB. 反而, we provided a cheat sheet
explaining the semantics of box score tables. 在
添加, we provided examples of sentences with
supported/contradicting facts. We evaluated 40
summaries from the test set (20 per dataset), 4 sen-
tences from each summary and elicited 3 responses
per summary. This resulted in 40 summaries ×
5 systems × 3 raters, for a total of 600 任务.
Altogether, 131 crowdworkers participated in this
学习 (agreement using Krippendorff’s α was 0.44
for supported and 0.42 for contradicting facts).

如表所示 5, Macro yields the smallest
number of contradicting facts among neural mod-
els on both datasets. On ROTOWIRE the number
of contradicting facts for Macro is comparable
to Gold and Templ (the difference is not sta-
tistically significant) and significantly smaller
compared to RBF-2020 and ED+CC. The count

ROTOWIRE #Supp #Contra Gram

Coher

Concis

3.63
7.57*
3.92

金子
Templ
ED+CC
RBF-2020 5.08*
Macro

4.00

38.33

46.25*

0.07
30.83
0.08 −61.67* −52.92* −36.67*
−4.58
0.91*
3.75
0.67*
6.67
0.27

−8.33
4.58
10.42

5.0
13.33
5.0

MLB
金子
Templ
ED+CC
ENT
Macro

Coher
30.0

#Supp #Contra Gram
0.14
3.59
21.67
0.04 −51.25* −43.75*
4.21
0.72* −22.5* −12.08* −39.17*
3.42
5.83* −0.83* −22.08*
0.73*
3.71
27.08
26.67
46.25
0.25
3.76

Concis
26.67
7.5

facts

桌子 5: Average number of supported (#Supp)
and contradicting (#Contra)
in game
summaries and best-worst scaling evaluation
(higher is better). Systems significantly different
from Macro are marked with an asterisk * (使用
a one-way ANOVA with post hoc Tukey HSD
测试; p ≤ 0.05).

of supported facts for Macro is comparable to
金子, and ED+CC, and significantly lower than
Templ and RBF-2020. On MLB, Macro has sig-
nificantly fewer contradicting facts than ENT and
ED+CC and is comparable to Templ and Gold
(the difference is not statistically significant). 这
count of supported facts for Macro is comparable
to Gold, ENT, ED+CC, and Templ. For both
datasets, Templ has the lowest number of contra-
dicting facts. This is expected as Templ essentially
parrots facts (aka records) from the table.

We also conducted a second study to evaluate
the quality of the generated summaries. We pre-
sented crowdworkers with a pair of summaries and
asked them to choose the better one in terms of
语法性 (is the summary written in well-
formed English?), Coherence (is the summary
well structured and well organized and does it have
a natural ordering of the facts?), and Conciseness
(does the summary avoid unnecessary repetition
including whole sentences, facts or phrases?). 我们
provided example summaries showcasing good
and bad output. For this task, we required that the
crowdworkers be able to comfortably compre-
hend NBA/MLB game summaries. We elicited
preferences with Best-Worst Scaling (Louviere
and Woodworth 1991; Louviere et al., 2015), A
method shown to be more reliable than rating
scales. The score of a system is computed as the
number of times it is rated best minus the number
of times it is rated worst (Orme, 2009). The scores

520

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

range from −100 (absolutely worst) 到 +100
(absolutely best). We divided the five competing
systems into ten pairs of summaries and elicited
ratings for 40 summaries (20 per dataset). 每个
summary pair was rated by 3 raters. This resulted
在 40 summaries × 10 system pairs × 3 评估
criteria × 3 raters, for a total of 3,600 任务. A
total of 206 crowdworkers participated in this task
(agreement using Krippendorff’s α was 0.47).

如表所示 5, on ROTOWIRE, Macro is
comparable to Gold, RBF-2020, and ED+CC in
terms of Grammaticality but significantly better
than Templ. In terms of Coherence, Macro is
comparable to RBF-2020 and ED+CC but signif-
icantly better than Templ and significantly worse
than Gold. With regard to Conciseness, Macro is
comparable to Gold, RBF-2020, and ED+CC, 和
significantly better than Templ. On MLB, Macro
is comparable to Gold in terms of Grammaticality
and significantly better than ED+CC, ENT, 和
Templ. Macro is comparable to Gold in terms of
Coherence and significantly better than ED+CC,
ENT and Templ. In terms of Conciseness, raters
found Macro comparable to Gold and Templ and
significantly better than ED+CC and ENT. Taken
一起, our results show that macro planning
leads to improvement in data-to-text generation in
comparison to other systems for both ROTOWIRE
and MLB datasets.

7 讨论

In this work we presented a plan-and-generate
approach for data-to-text generation that consists
of a macro planning stage representing high-level
document organization in terms of structure and
内容, followed by a text generation stage.
Extensive automatic and human evaluation shows
that our approach achieves better results than ex-
isting state-of-the-art models and generates sum-
maries which are factual, 相干, and concise.

Our results show that macro planning is more
advantageous for generation tasks expected to
produce longer texts with multiple discourse
units, and could be easily extended to other sports
domains such as cricket (Kelly et al., 2009) 或者
American football (Barzilay and Lapata, 2005).
Other approaches focusing on micro planning
(Puduppully et al., 2019A; Moryossef et al., 2019)
might be better tailored for generating shorter
文本. There has been a surge of datasets recently
focusing on single-paragraph outputs and the task

of content selection such as E2E (Novikova et al.,
2017), WebNLG (Gardent et al., 2017), 和
WikiBio (Lebret et al., 2016; Perez-Belrachini
和拉帕塔, 2018). We note that in our model con-
tent selection takes place during macro planning
and text generation. The results in Table 2 展示
that Macro achieves the highest CS F-measure
on both datasets, indicating that the document as
a whole and individual sentences discuss appro-
priate content.

Throughout our experiments we observed that
template-based systems score poorly in terms of
CS (but also CO and BLEU). This is primarily due
to the inflexibility of the template approach which
is limited to the discussion of a fixed number of
(high-scoring) players. 然而, human writers (和
neural models to a certain extent), synthesize
summaries taking into account the particulars of a
specific game (where some players might be more
important than others even if they scored less)
and are able to override global defaults. Template
sentences are fluent on their own, but since it
is not possible to perform aggregation (赖特,
1995), the whole summary appears stilted, it lacks
coherence and variability, contributing to low
BLEU scores. The template baseline is worse for
MLB than ROTOWIRE which reflects the greater
difficulty to manually create a good template for
MLB. 全面的, we observe that neural models are
more fluent and coherent, being able to learn a
better ordering of facts which is in turn reflected
in better CO scores.

Despite promising results, there is ample room
to improve macro planning, especially in terms
of the precision of RG (见表 2, P% column
of RG). We should not underestimate that Macro
must handle relatively long inputs (平均数
input length in the MLB development set is ∼3100
代币) which are challenging for the attention
机制. Consider the following output of our
model on the MLB dataset: Ramirez’s two-run
double off Joe Blanton tied it in the sixth, 和
Brandon Moss added a two-out RBI single off
Alan Embree to give Boston a 3-2 带领. 这里, 这
name of the pitcher should have been Joe Blanton
instead of Alan Embree. 实际上, Alan Embree is
the pitcher for the following play in the half in-
ning. 在这种情况下, attention diffuses over the rela-
tively long MLB macro plan, leading to inaccurate
content selection. We could alleviate this prob-
lem by adopting a noisy channel decomposition

521

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

(Yee et al., 2019; Yu et al., 2020), 那是, 经过
learning two different distributions: a conditional
model that provides the probability of translating
a paragraph plan to text and a language model that
provides an unconditional estimate of the output
(IE。, the whole game summary). 然而, 我们
leave this to future work.

For ROTOWIRE, the main source of errors is
the model’s inability to understand numbers. 为了
例子, Macro generates the following output:
The Lakers were the superior shooters in this
game, going 48 percent from the field and 30 每-
cent from the three-point line, while the Jazz went
47 percent from the floor and 30 百分比来自
beyond the arc. 这里, 30 percent should have been
24 percent for the Lakers but the language model
expects a higher score for the three-point line, 和
自从 24 低 (especially compared to 30 scored
by the Jazz), it simply copies 30 scored by the
Jazz instead. A mechanism for learning better rep-
resentations for numbers (Wallace et al., 2019) 或者
executing operations such as argmax or minus (Nie
等人。, 2018) should help alleviate this problem.

最后, although our focus so far has been on
learning document plans from data, the decoupling
of planning from generation allows to flexibly
generate output according to specification. 为了
例子, we could feed the model with manually
constructed macro plans, consequently controlling
the information content and structure of the output
summary (例如, for generating short or long texts,
or focusing on specific aspects of the game).

致谢

We thank the Action Editor, Claire Gardent,
and the three anonymous reviewers for their
constructive feedback. We also thank Laura
Perez-Beltrachini for her comments on an earlier
draft of this paper, and Parag Jain, Hao Zheng,
Stefanos Angelidis and Yang Liu for helpful
discussions. 我们
金融的
the European Research Council
support of
(警告; award number 681760, ‘‘Translating
Multiple Modalities into Text’’).

acknowledge

这

参考

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate.
在
3rd International Conference on Learning

522

Representations, ICLR 2015, 圣地亚哥, CA,
美国, May 7–9, 2015, Conference Track
会议记录.

Regina Barzilay and Mirella Lapata. 2005.
Collective content selection for concept-to-text
一代. In Proceedings of Human Language
Technology Conference and Conference on
Empirical Methods
in Natural Language
加工,
331–338, Vancouver,
页面
British Columbia, 加拿大. 协会
计算语言学. DOI: https://
doi.org/10.3115/1220575.1220617

Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with
Python, O’Reilly Media.

在诉讼程序中

Thiago Castro Ferreira, Chris van der Lee,
Emiel van Miltenburg, and Emiel Krahmer.
一代: A
data-to-text
2019. Neural
comparison between pipeline and end-to-
这
end architectures.
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
pages 552–562, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1052

加工

Wallace L. 摩擦. 1979. The flow of thought
and the flow of language. In Talmy Giv´on,
编辑, Syntax and Semantics, 体积 12,
pages 159–181, Academic Press Inc. DOI:
https://doi.org/10.1163/9789004
368897 008

Yen-Chun Chen and Mohit Bansal. 2018.
Fast abstractive summarization with reinforce-
selected sentence rewriting. 在诉讼程序中
the 56th Annual Meeting of the Association
for Computational Linguistics (体积 1:
Long Papers), pages 675–686, 墨尔本,
澳大利亚. 协会
for Computational
语言学. DOI: https://doi.org/10
.18653/v1/P18-1063

Robert Dale. 1989. Generating referring expres-
sions in a domain of objects and processes.

Pablo Duboue and Kathleen McKeown. 2002.
Content planner construction via evolutionary

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

algorithms and a corpus-based fitness function.
the International Nat-
在诉讼程序中
ural
Language Generation Conference,
pages 89–96, Harriman, 纽约, 美国.
计算语言学协会.

Pablo A. Duboue and Kathleen R. McKeown.
2001. Empirically estimating order constraints
for content planning in generation. In Pro-
ceedings of the 39th Annual Meeting of the
计算语言学协会,
pages 172–179, Toulouse, 法国. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073012
.1073035

John C. Duchi, Elad Hazan, and Yoram Singer.
2011. Adaptive subgradient methods for online
learning and stochastic optimization. 杂志
Machine Learning Research, 12:2121–2159.

Angela Fan, David Grangier, 和迈克尔
Auli. 2018. Controllable abstractive summa-
rization. 第二届研讨会论文集
on Neural Machine Translation and Gen-
进化, pages 45–54, 墨尔本, 澳大利亚.
计算语言学协会.

Claire Gardent, Anastasia Shimorina, Shashi
Narayan, and Laura Perez-Beltrachini. 2017.
Creating training corpora for NLG micro-
planners. In Proceedings of the 55th Annual
Meeting of
the Association for Computa-
tional Linguistics (体积 1: Long Papers),
pages 179–188, Vancouver, 加拿大. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P17
-1017

Albert Gatt and Emiel Krahmer. 2018. 民意调查
of the state of the art
语言
一代: Core tasks, applications and eval-
uation. J. Artif. Intell. Res., 61:65–170. DOI:
https://doi.org/10.1613/jair.5477

in natural

Heng Gong, Xiaocheng Feng, Bing Qin, 和
Ting Liu. 2019. Table-to-text generation with
effective hierarchical encoder on three dimen-
西翁 (排, column and time). In Proceedings
的 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),

pages 3143–3152, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1310

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
氧. K. 李. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. 在
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1631–1640,
柏林, 德国. Association for Computa-
tional Linguistics.

Caglar Gulcehre,

Sungjin Ahn, Ramesh
Nallapati, Bowen Zhou, and Yoshua Bengio.
2016. Pointing the unknown words. In Pro-
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguis-
抽动症 (体积 1: Long Papers), pages 140–149,
柏林, 德国. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/P16-1014

中号. A. K. Halliday and Ruqaiya Hasan. 1976.
Cohesion in English, 伦敦. 朗文. DOI:
https://doi.org/10.1162/neco
.1997.9.8.1735

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. 神经计算,
9:1735–1780.

Eduard H. 蓝色的. 1993. Automated discourse
generation using discourse structure relations.
人工智能, 63(1-2):341–385. DOI:
https://doi.org/10.1016/0004-3702
(93)90021-3

Hayate Iso, Yui Uehara, Tatsuya Ishigaki,
Hiroshi Noji, Eiji Aramaki, Ichiro Kobayashi,
Yusuke Miyao, Naoaki Okazaki, and Hiroya
Takamura. 2019. Learning to select,
track,
In Proceed-
and generate for data-to-text.
这
ings of
计算语言学协会,
pages 2102–2113, Florence,
意大利. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1202

the 57th Annual Meeting of

Min-Yen Kan and Kathleen R. McKeown. 2002.
Corpus-trained text generation for summa-
rization. 国际会议录

523

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Natural Language Generation Conference,
第 1–8 页, Harriman, 纽约, 美国. Asso-
ciation for Computational Linguistics.

Colin Kelly, Ann Copestake, and Nikiforos
Karamanis. 2009. Investigating content selec-
tion for language generation using machine
学习. In Proceedings of the 12th European
Workshop on Natural Language Generation
(ENLG 2009), pages 130–137, 雅典, 希腊.
计算语言学协会.
DOI: https://doi.org/10.3115/161
0195.1610218

Yuta Kikuchi, Graham Neubig, Ryohei Sasano,
Hiroya Takamura, and Manabu Okumura.
length in neural
2016. Controlling output
encoder-decoders. 在诉讼程序中 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1328–1338,
Austin, 德克萨斯州. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D16-1140

Guillaume Klein, Yoon Kim, Yuntian Deng,
Jean Senellart, and Alexander Rush. 2017.
OpenNMT: Open-source toolkit
for neural
machine translation. In Proceedings of ACL
2017, 系统演示, pages 67–72,
Vancouver, 加拿大. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/P17-4012

Ioannis Konstas and Mirella Lapata. 2013.
Inducing document plans
for concept-to-
text generation. 在诉讼程序中 2013
Conference on Empirical Methods in Natural
语言处理,
1503–1514,
Seattle, 华盛顿, 美国. 协会
计算语言学.

页面

Karen Kukich. 1983. Design of a knowledge-
In 21st Annual
based report generator.
the Association for Computa-
Meeting of
tional Linguistics. DOI: https://doi.org
/10.3115/981311.981340

Anirban Laha, Parag Jain, Abhijit Mishra,
and Karthik Sankaranarayanan. 2020. 斯卡尔-
able micro-planned generation of discourse
from structured data. Computational Linguis-
抽动症, 45(4):737–763. DOI: https://土井
.org/10.1162/coli a 00363

R´emi Lebret, David Grangier, 和迈克尔
text generation from
Auli. 2016. Neural
structured data with application to the biog-
raphy domain. 在诉讼程序中
这 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1203–1213,
Austin, 德克萨斯州. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D16-1128

右. 乙. Longacre. 1979. The paragraph as a
grammatical unit. In Talmy Giv´on, 编辑,
Syntax and Semantics, 体积 12, Academic
Press Inc., pages 115–133.

Jordan J. Louviere, Terry N. Flynn,

和
A. A. J. Marley. 2015. Best-Worst Scaling:
理论, Methods and Applications, 剑桥
大学出版社. DOI: https://doi.org
/10.1017/CBO9781107337855

Jordan J. Louviere and George G. Woodworth.
1991. Best-worst scaling: A model for the
largest difference judgments. 大学
艾伯塔省: Working Paper.

方法

Thang Luong, Hieu Pham, and Christopher D.
曼宁. 2015. Effective
到
attention-based neural machine translation. 在
诉讼程序 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1412–1421, 里斯本, Portugal. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D15
-1166

Kathleen R. McKeown. 1992. Text Generation.
自然语言处理博士,

学习
剑桥大学出版社.

Kathleen R. McKeown, Desmond A. 约旦,
Shimei Pan, James Shaw, and Barry A. 艾伦.
1997. Language generation for multimedia
In Fifth Conference
healthcare briefings.
on Applied Natural Language Processing,
pages 277–282, 华盛顿, 直流, 美国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/974557
.974598

Hongyuan Mei, Mohit Bansal, and Matthew R.
Walter. 2016. What to talk about and how?
Selective generation using LSTMs with coarse-
这
to-fine alignment.

在诉讼程序中

524

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

2016 Conference of
the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
页面
720–730, 圣地亚哥, 加利福尼亚州.
计算语言学协会.

Amit Moryossef, Yoav Goldberg, and Ido Dagan.
2019. Step-by-step: Separating planning from
realization in neural data-to-text generation. 在
诉讼程序 2019 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short
文件),
2267–2277, 明尼阿波利斯,
Minnesota. Association for Computational
语言学.

页面

Feng Nie, Jinpeng Wang, Jin-Ge Yao, Rong
Pan, and Chin-Yew Lin. 2018. Operation-
guided neural networks
for high fidelity
在诉讼程序中
data-to-text generation.
这 2018 Conference on Empirical Meth-
消耗臭氧层物质
语言处理,
pages 3879–3889, 布鲁塞尔, 比利时. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1422

in Natural

Jekaterina Novikova, Ondˇrej Duˇsek,

和
Verena Rieser. 2017. The E2E dataset:
New challenges for end-to-end generation.
the 18th Annual SIG-
在诉讼程序中
dial Meeting on Discourse and Dialogue,
pages 201–206, 萨尔布吕肯, 德国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-5525

Romain Paulus, Caiming Xiong, and Richard
Socher. 2018. A deep reinforced model for
在国际
abstractive summarization.
Conference on Learning Representations.

Laura Perez-Beltrachini and Mirella Lapata. 2018.
Bootstrapping generators from noisy data. 在
诉讼程序 2018 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long Papers),
pages 1516–1527, New Orleans, Louisiana.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/N18-1137

Ratish Puduppully, Li Dong,

and Mirella
警告. 2019A. Data-to-text generation with
content selection and planning. In Proceedings
of the 33rd AAAI Conference on Artificial Intel-
智慧. 檀香山, Hawaii. DOI: https://
doi.org/10.1609/aaai.v33i01.330
16908

Ratish Puduppully, Li Dong, and Mirella Lapata.
2019乙. Data-to-text generation with entity
造型. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 2023–2035, Florence, 意大利.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/P19-1195

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever.
2019. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9. DOI:
https://doi.org/10.18653/v1/P19
-1195

Bryan Orme. 2009. Maxdiff analysis: Simple
and HB.

individual-level

logit,

数数,
Sawtooth Software.

Kishore

Papineni,

Salim Roukos,

Todd
Ward, and Wei-Jing Zhu. 2002. 蓝线: A
method for automatic evaluation of machine
the 40th
在诉讼程序中
翻译.
the Association for
Annual Meeting of
计算语言学, pages 311–318,
费城, 宾夕法尼亚州, 美国. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073083
.1073135

Cl´ement Rebuffel, Laure Soulier, 杰弗里
Scoutheeten, and Patrick Gallinari. 2020. A
hierarchical model for data-to-text generation.
Informa-
在
tion Retrieval, pages 65–80. 施普林格. DOI:
https://doi.org/10.1007/978-3-030
-45439-5 5, PMCID: PMC7148215

European Conference

在

Ehud Reiter. 1995. NLG vs. 模板. CoRR,
cmp-lg/9504013v1. DOI: https://doi.org
/10.1017/S1351324997001502

Ehud Reiter and Robert Dale. 1997. 建筑
applied natural language generation systems.
Natural Language Engineering, 3(1):57–87.

525

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Ehud Reiter and Robert Dale. 2000. 建筑
系统.
Language Generation
Natural
学习
自然语言处理博士,
剑桥大学出版社. DOI: https://
doi.org/10.1017/CBO9780511519857

Fahimeh

Ioan
Saleh, Alexandre Berard,
Calapodescu, and Laurent Besacier. 2019. Naver
Labs Europe’s systems for
the document-
level generation and translation task at WNGT
the 3rd Work-
2019.
shop on Neural Generation and Translation,
pages 273–279, 香港. 协会
计算语言学. DOI: https://
doi.org/10.18653/v1/D19-5631

在诉讼程序中

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
In Pro-
rare words with subword units.
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1715–1725,
柏林, 德国. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/P16-1162

Zhihong Shao, Minlie Huang, Jiangtao Wen,
Wenfei Xu, and Xiaoyan Zhu. 2019. Long and
diverse text generation with planning-based
hierarchical variational model. In Proceedings
的 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3257–3268, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1321

伊利亚·苏茨克维尔, Oriol Vinyals, and Quoc V.
Le. 2014. Sequence to sequence learning
with neural networks. In Advances in Neural
Information Processing Systems, 体积 27,
pages 3104–3112. 柯伦联合公司, Inc.

Shunsuke Takeno, Masaaki Nagata, and Kazuhide
Yamamoto. 2017. Controlling target features
in neural machine translation via prefix
constraints. In Proceedings of the 4th Workshop
on Asian Translation (WAT 2017), pages 55–63,
Taipei, 台湾. Asian Federation of Natural
语言处理.

Ran Tian, Shashi Narayan, Thibault Sellam, 和
Ankur P. Parikh. 2019. Sticking to the facts:
Confident decoding for faithful data-to-text
一代. CoRR, abs/1910.08684v2.

Chris van der Lee, Albert Gatt, Emiel van
Miltenburg, Sander Wubben,
and Emiel
Krahmer. 2019. Best practices for the human
evaluation of automatically generated text.
the 12th International
在诉讼程序中
Conference on Natural Language Generation,
pages 355–368, 东京, 日本. 协会
计算语言学. DOI: https://
doi.org/10.18653/v1/W19-8643

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In I. Guyon, U. V.
Luxburg, S. 本吉奥, H. 瓦拉赫, 右. 弗格斯,
S. Vishwanathan, 和R. 加内特, 编辑,
神经信息处理的进展
系统
公司,
pages 5998–6008.

柯伦联合公司,

30,

Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015. Pointer networks. 在C中. 科尔特斯,
氮. D. 劳伦斯, D. D. 李, 中号. Sugiyama,
在
和R. 加内特,
Neural Information Processing Systems 28,
pages 2692–2700, 柯伦联合公司, Inc.

编辑, Advances

Eric Wallace, Yizhong Wang, Sujian Li, Sameer
辛格, and Matt Gardner. 2019. Do NLP
models know numbers? Probing numer-
acy in embeddings. 在诉讼程序中
这
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 5307–5315, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1534

加工

Ronald J. Williams and Jing Peng. 1990. 一个
efficient gradient-based algorithm for on-line
recurrent network trajectories.
training of
神经计算, 2(4):490–501. DOI:
https://doi.org/10.1162/neco.1990
.2.4.490

Sam Wiseman, Stuart Shieber, and Alexander
匆忙. 2017. Challenges in data-to-document

526

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

这 2017
在诉讼程序中
一代.
Conference on Empirical Methods in Natural
语言处理,
2253–2263,
哥本哈根, 丹麦. Association for Com-
putational Linguistics. DOI: https://土井
.org/10.18653/v1/D17-1239

页面

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
高, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, Lukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, 乔治
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural
machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144v2.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong
他, Alex Smola, and Eduard Hovy. 2016.
Hierarchical attention networks for document
这 2016
classification.
Conference of
the North American Chap-
the Association for Computational
ter of

在诉讼程序中

语言学: 人类语言技术,
pages 1480–1489, 圣地亚哥, 加利福尼亚州. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N16
-1174

Kyra Yee, Yann Dauphin, and Michael Auli. 2019.
Simple and effective noisy channel modeling
for neural machine translation. In Proceedings
的 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 5696–5701, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1571

加工

Lei Yu, Laurent Sartran, Wojciech Stokowiec,
Wang Ling, Lingpeng Kong, Phil Blunsom,
and Chris Dyer. 2020. Better document-level
machine translation with Bayes’ rule. 反式-
actions of the Association for Computational
语言学, 8:346–360. DOI: https://
doi.org/10.1162/tacl 00319

Wlodek Zadrozny and Karen Jensen. 1991.
段落. 计算型

语义学
的
语言学, 17(2):171–210.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
8
1
1
9
2
4
1
7
6

/
t

我

A
C
_
A
_
0
0
3
8
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

527
下载pdf