Data-to-text Generation with Macro Planning
Ratish Puduppully and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, Università di Edimburgo
10 Crichton Street, Edinburgh EH8 9AB
r.puduppully@sms.ed.ac.uk
mlap@inf.ed.ac.uk
Astratto
Recent approaches to data-to-text generation
have adopted the very successful encoder-
decoder architecture or variants thereof. These
models generate text that is fluent (but often
imprecise) and perform quite poorly at select-
ing appropriate content and ordering it co-
herently. To overcome some of these issues,
we propose a neural model with a macro
planning stage followed by a generation stage
reminiscent of traditional methods which em-
brace separate modules for planning and sur-
face realization. Macro plans represent high
level organization of important content such
as entities, events, and their interactions; Essi
are learned from data and given as input to
the generator. Extensive experiments on two
data-to-text benchmarks (ROTOWIRE and MLB)
show that our approach outperforms compet-
itive baselines in terms of automatic and hu-
man evaluation.
1
introduzione
Data-to-text generation refers to the task of gen-
erating textual output from non-linguistic input
(Reiter and Dale, 1997, 2000; Gatt and Krahmer,
2018) such as databases of records, simulations
of physical systems, accounting spreadsheets, O
expert system knowledge bases. As an example,
Figura 1 shows various statistics describing a
major league baseball (MLB) game, including
extracts from the box score (cioè.,
the perfor-
mance of the two teams and individual
team
members who played as batters, pitchers or field-
ers; Tavolo (UN)), play-by-play (cioè., the detailed
sequence of each play of the game as it occurred;
Tavolo (B)), and a human written game summary
(Tavolo (C)).
Traditional methods for data-to-text generation
(Kukich, 1983; McKeown, 1992; Reiter and Dale,
510
1997) follow a pipeline architecture, adopting
separate stages for text planning (determining
which content to talk about and how it might
be organized in discourse), sentence planning
(aggregating content into sentences, deciding spe-
cific words to describe concepts and relations, E
generating referring expressions), and linguistic
realization (applying the rules of syntax, mor-
phology, and orthographic processing to generate
surface forms). Recent neural network–based
approcci (Lebret et al., 2016; Mei et al., 2016;
Wiseman et al., 2017) make use of the encoder-
decoder architecture (Sutskever et al., 2014), are
trained end-to-end, and have no special-purpose
modules for how to best generate a text, a parte
from generic mechanisms such as attention and
copy (Bahdanau et al., 2015; Gu et al., 2016). IL
popularity of end-to-end models has been fur-
ther boosted by the release of new datasets with
thousands of input-document training pairs. IL
example shown in Figure 1 is taken from the MLB
dataset (Puduppully et al., 2019B), which contains
baseball game statistics and human written sum-
maries (∼25K instances). ROTOWIRE (Wiseman
et al., 2017) is another widely used benchmark,
which contains NBA basketball game statistics
and their descriptions (∼5K instances).
Wiseman et al. (2017) show that despite being
able to generate fluent text, neural data-to-text
generation models are often imprecise, prone
to hallucination (cioè., generate text that is not
supported by the input), and poor at content
selection and document structuring. Attempts to
remedy some of these issues focus on changing
the way entities are represented (Puduppully et al.,
2019B; Iso et al., 2019), allowing the decoder to
skip low-confidence tokens to enhance faithful
generation (Tian et al., 2019), and making the
encoder-decoder architecture more modular by
introducing micro planning (Puduppully et al.,
2019UN; Moryossef et al., 2019). Micro planning
operates at the record level (Vedi la tabella (UN) In
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 9, pag. 510–527, 2021. https://doi.org/10.1162/tacl a 00381
Redattore di azioni: Claire Gardent. Lotto di invio: 11/2020; Lotto di revisione: 2/2021; Pubblicato 5/2021.
C(cid:3) 2021 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
/
/
T
l
UN
C
_
UN
_
0
0
3
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
/
/
T
l
UN
C
_
UN
_
0
0
3
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: MLB statistics tables and game summary. Tables summarize the performance of teams and individual
team members who played as batters and pitchers as well as the most important actions (and their actors) in each
play (Tables (UN) E (B)). Macro plan for the game summary is shown at the bottom (Tavolo (E)).
indicates
paragraph delimiters. There is a plan for every paragraph in the game summary (correspondence shown in same
colore);
side of an inning (see Section 3.1). Set of candidate paragraph plans are shown above macro plan (Tavolo (D)) E
grouped into two types: plans describing a single entity/event or their combinations. Best viewed in color.
Figura 1; per esempio., C.Mullins BH 2, J.Villar TEAM Orioles),
it determines which facts should be mentioned
within a textual unit (per esempio., a sentence) and how
these should be structured (per esempio., the sequence of
records). An explicit content planner essentially
makes the job of the neural network less onerous
allowing to concentrate on producing fluent natu-
ral language output, without expending too much
effort on content organization.
In this work, we focus on macro planning, IL
high-level organization of information and how
it should be presented which we argue is impor-
tant for the generation of long, multi-paragraph
documents (see text (C) in Figure 1). Problema-
atically, modern datasets like MLB (Puduppully
et al., 2019B; and also Figure 1) and ROTOWIRE
(Wiseman et al., 2017) do not naturally lend
themselves to document planning as there is no
explicit link between the summary and the content
of the game (which is encoded in tabular form).
In other words, the underlying plans are latent,
and it is not clear how they might be best repre-
sented, namely, as sequences of records from a
table, or simply words. Nevertheless, game sum-
maries through their segmentation into paragraphs
(and lexical overlap with the input) give clues
as to how content might be organized. Paragraphs
are a central element of discourse (Chafe, 1979;
Longacre, 1979; Halliday and Hasan, 1976),
the smallest domain where coherence and topic
are defined and anaphora resolution is possible
511
(Zadrozny and Jensen, 1991). We therefore oper-
ationalize the macro plan for a game summary as
a sequence of paragraph plans.
Although resorting to paragraphs describes the
summary plan at a coarse level, we still need to
specify individual paragraph plans. In the sports
domain, paragraphs typically mention entities
(per esempio., players important in the game), key events
(per esempio., scoring a run), and their interaction. E
most of this information is encapsulated in the
statistics accompanying game summaries (Vedere
Tables (UN) E (B) in Figure 1). We thus define
paragraph plans such that they contain verbaliza-
tions of entity and event records (see plan (E) In
Figura 1). Given a set of paragraph plans and their
corresponding game summary (Vedi la tabella (D) E
summary (C) in Figure 1), our task is twofold.
At training time, we must learn how content was
selected in order to give rise to specific game
summaries (per esempio., how input (D) led to plan (E)
for summary (C) in Figure 1), while at test time,
given input for a new game, we first predict a
macro plan for the summary and then generate the
corresponding document.
We present a two-stage approach where macro
plans are induced from training data (by taking the
table and corresponding summaries into account)
and then fed to the text generation stage. Aside
from making data-to-text generation more inter-
pretable, the task of generating a document from
a macro plan (rather than a table) affords greater
control over the output text and plays to the advan-
tage of encoder-decoder architectures which excel
at modeling sequences. We evaluate model per-
formance on the ROTOWIRE (Wiseman et al., 2017)
and MLB (Puduppully et al., 2019B) benchmarks.
Experimental results show that our plan-and-
generate approach produces output that is more
factual, coherent, and fluent compared with exist-
ing state-of-the-art models. Our code,
trained
models, and dataset with macro plans can be
found at https://github.com/ratishsp
/data2text-macro-plan-py.
2 Related Work
Content planning has been traditionally consid-
ered a fundamental component in natural lan-
guage generation. Not only does it determine
which information-bearing units to talk about,
but also arranges them into a structure that
creates coherent output. Many content plan-
ners have been based on theories of discourse
coherence (Blu, 1993), schemas (McKeown
et al., 1997), or have relied on generic plan-
ners (Dale, 1989). Plans are mostly based on
hand-crafted rules after analyzing the target text,
although a few approaches have recognized the
need for learning-based methods. Per esempio,
Duboue and McKeown (2001) learn ordering
constraints in a content plan, Konstas and Lapata
(2013) represent plans as grammar rules whose
probabilities are estimated empirically, while oth-
ers make use of semantically annotated corpora
to bootstrap content planners
(Duboue and
McKeown, 2002; Kan and McKeown, 2002).
More recently, various attempts have been made
to improve neural generation models (Wiseman
et al., 2017) based on the encoder-decoder archi-
tectura (Bahdanau et al., 2015) by adding various
planning modules. Puduppully et al. (2019UN) pro-
pose a model for data-to-text that first learns a
plan from the records in the input table and then
generates a summary conditioned on this plan.
Shao et al. (2019) introduce a Planning-based
Hierarchical Variational Model where a plan is
a sequence of groups, each of which contains a
subset of input items to be covered in a sentence.
The content of each sentence is verbalized, con-
ditioned on the plan and previously generated
context. In their case, input items are a rela-
tively small list of attributes (∼28) and the output
document is also short (∼110 words).
There have also been attempts to incorporate
neural modules in a pipeline architecture for
data-to-text generation. Moryossef et al. (2019)
develop a model with a symbolic text planning
stage followed by a neural realization stage. They
experiment with the WebNLG dataset (Gardent
et al., 2017) which consists of RDF (cid:4) Subject,
Object, Predicate (cid:5) triples paired with correspond-
ing text. Their document plan is a sequence of
sentence plans that in turn determine the division
of facts into sentences and their order. Along
similar lines, Castro Ferreira et al. (2019) pro-
pose an architecture composed of multiple steps
including discourse ordering, text structuring, lex-
icalization, referring expression generation, E
surface realization. Both approaches show the
effectiveness of pipeline architectures, Tuttavia,
their task does not require content selection and
the output texts are relatively short (24 tokens on
average).
512
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
/
/
T
l
UN
C
_
UN
_
0
0
3
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Although it is generally assumed that task-
specific parallel data is available for model
training, Laha et al. (2020) do away with this
assumption and present a three-stage pipeline
model which learns from monolingual corpora.
They first convert the input to a form of tuples,
which in turn are expressed in simple sentences,
followed by the third stage of merging simple
sentences to form more complex ones by aggre-
gation and referring expression generation. They
also evaluate on data-to-text tasks which have
relatively short outputs. There have also been
efforts to improve the coherence of the output,
especially when dealing with longer documents.
Puduppully et al. (2019B) make use of hierar-
chical attention over entity representations which
are updated dynamically, while Iso et al. (2019)
explicitly keep track of salient entities and mem-
orize which ones have been mentioned.
Our work also attempts to alleviate deficien-
cies in neural data-to-text generation models.
In contrast to previous approaches, (Puduppully
et al., 2019UN; Moryossef et al., 2019; Laha et al.,
2020), we place emphasis on macro planning and
create plans representing high-level organization
of a document including both its content and
structure. We share with previous work (per esempio.,
Moryossef et al. 2019) the use of a two-stage
architecture. We show that macro planning can
be successfully applied to long document data-to-
text generation resulting in improved factuality,
coherence, and fluency without any postpro-
cessazione (per esempio., to smooth referring expressions)
or recourse to additional tools (per esempio., parsing or
information extraction).
3 Problem Formulation
We hypothesize that generation based on plans
should fare better compared to generating from a
set of records, since macro plans offer a bird’s-eye
view, a high-level organization of the document
content and structure. We also believe that macro
planning will work well for long-form text genera-
zione, questo è, for datasets that have multi-paragraph
target texts, a large vocabulary space, and require
content selection.
We assume the input to our model is a set of
paragraph plans E = {NO}| E |
i=1 where ei is a para-
graph plan. We model the process of generating
output summary y given E as a two step process,
namely, the construction of a macro plan x based
on the set of paragraph plans, followed by the
generation of a summary given a macro plan as
input. We now explain how E is obtained and
each step is realized. We discuss our model con-
sidering mainly an example from the MLB dataset
(Puduppully et al., 2019B) but also touch on how
the approach can be straightforwardly adapted to
ROTOWIRE (Wiseman et al., 2017).
3.1 Macro Plan Definition
A macro plan consists of a sequence of paragraph
plans separated by a paragraph discourse marker
, questo è, x = ei
es . . .
ek where
NO, es, ek ∈ E . A paragraph plan in turn is a
sequence of entities and events describing the
game. By entities we mean individual players or
teams and the information provided about them in
box score statistics (see rows and column headings
in Figure 1 Tavolo (UN)), while events refer to infor-
mation described in play-by-play (Vedi la tabella (B)).
In baseball, plays are grouped in half-innings.
During each half of an inning, a team takes its
turn to bat (the visiting team bats in the top half
and the home team in the bottom half). An exam-
ple macro plan is shown at the bottom of Figure 1.
Within a paragraph plan, entities and events are
verbalized into a text sequence along the lines of
Saleh et al. (2019). We make use of special tokens
for the
of record from the table. We retain the same posi-
tion for each record type and value. Per esempio,
batter C.Mullins from Figure 1 would be verbal-
ized as
2
. . . . For the sake of brevity we use shorthand
Paragraph Plan for Entities For a paragraph
containing entities, the corresponding plan will
be a verbalization of the entities in sequence.
For paragraphs with multiple mentions of the
same entity, the plan will verbalize an entity only
once and at its first position of mention. Paragraph
‘‘Keller gave up a home run . . . the teams with the
worst records in the majors’’ from the summary in
Figura 1 describes four entities including B. Keller,
C. Mullins, Royals and Orioles. The respective
plan is the verbalization of the four entities in
sequence:
for verbalization and
513
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
/
/
T
l
UN
C
_
UN
_
0
0
3
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
shorthand for Paragraph Plan for Events A paragraph may The procedure described above is not specific 3.2 Macro Plan Construction We provided our definition for macro plans in the Similar to Moryossef et al. (2019), we define To resolve whether the summary discusses the 3.3 Paragraph Plan Construction Figura 1 shows the macro plan we obtain for 514 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o considering several candidate paragraph plans as Based on this analysis, we assume that para- 4 Model Description The input to our model is a set of paragraph plans, 4.1 Macro Planning Paragraph Plan Representation We encode learnt (1) l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o plan 2: Paragraph Figura representation 3 in Equation (4). 3 Paragraph plan vector ei is the attention weighted (cid:2) j (cid:3) ei = αi,jei,j (2) j Prossimo, we contextualize each paragraph plan for each: io (cid:2) (cid:3) ci = βi,kek k(cid:8)=i (3) (cid:2) where Wa ∈ Rn×n, Wg ∈ Rn×2n are parameter (cid:4) (cid:5) eatt gi = sigmoid (4) Dove (cid:9) denotes element-wise multiplication. Content Planning Our model learns to predict 515 separators in between. The conditional output P(sì|X) = |sì|(cid:6) t=1 P(yt|sì ROTOWIRE MLB Vocab Size 11.3K 39 38.9K Tavolo 1: Dataset statistics for ROTOWIRE and During inference, we employ beam search to ˆz = arg max z(cid:10) P(z(cid:10)|E ; θ) We deterministically obtain ˆx from ˆz, E ˆy = arg max sì(cid:10) P(sì(cid:10)|ˆx; φ) 5 Experimental Setup SU performed esperimenti Data We We reconstructed the MLB dataset, as the delimiter to paragraphs in the summaries.3 1https://github.com/neulab/DGT. {gameid}. 3Although our model is trained on game summaries with from model output. 517 ROTOWIRE does not have paragraph delimiters in Training Configuration We tuned the model System Comparisons We compared our model l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o (2019UN), generates content plans from the table 6 Results Automatic Evaluation For automatic evalua- Let ˆy be the gold summary and y the model We reused the IE model from Puduppully et al. Tavolo 2 (top) presents our results on the test set. ROTOWIRE Templ RG CS CO # 54.3 99.9 27.1 57.7 36.9 13.1 MLB # RG CS CO 62.3 99.9 21.6 55.2 31.0 11.0 BLEU 8.46 BLEU 4.12 relation generation (RG) count Tavolo 2: Evaluation on ROTOWIRE and MLB best performing model of Wiseman et al. (2017) indicating that is always faithful Templ has the highest RG precision and count 518 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o than comparison models (per esempio., NCP+CC or ENT). The Contribution of Macro Planning To study Across both datasets, −Plan variants appear Tavolo 3 presents intrinsic evaluation of the Macro CS-P CS-R CS-F ROTOWIRE 81.3 73.2 77.0 CO 45.8 Tavolo 3: Evaluation of macro planning stage; instead of relations. We see that our macro plan- Human-Based Evaluation We also asked par- We conducted our study on the Amazon 4We are grateful to Cl´ement Rebuffel for providing us with the output of their system. 519 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o ST. PETERSBURG, Fla. (AP) – The Tampa Bay Rays are making Akinori Iwamura hit a two-run homer in the The Rays, who have the best Dioner Navarro singled with one out in the eighth off Scott Dohmann (2-0) got the Troy Percival worked the ninth for Clay Buchholz (1-2) gave The Red Sox loaded the bases with one out in the The Red Sox loaded The Red Sox threatened in the eighth when J. D. Drew Tavolo 4: Predicted macro plan (top) con In our first study, we presented crowdworkers As shown in Table 5, Macro yields the smallest ROTOWIRE #Supp #Contra Gram Coher Concis 3.63 Gold 4.00 38.33 46.25* 0.07 −8.33 5.0 MLB Coher #Supp #Contra Gram Concis facts Tavolo 5: Average number of supported (#Supp) of supported facts for Macro is comparable to We also conducted a second study to evaluate 520 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o range from −100 (absolutely worst) A +100 As shown in Table 5, on ROTOWIRE, Macro is 7 Discussion In this work we presented a plan-and-generate Our results show that macro planning is more of content selection such as E2E (Novikova et al., Throughout our experiments we observed that Despite promising results, there is ample room 521 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o (Yee et al., 2019; Yu et al., 2020), questo è, by For ROTOWIRE, the main source of errors is Finalmente, although our focus so far has been on Ringraziamenti We thank the Action Editor, Claire Gardent, acknowledge IL Riferimenti Dzmitry Bahdanau, Kyunghyun Cho, e Yoshua 522 Representations, ICLR 2015, San Diego, CA, Regina Barzilay and Mirella Lapata. 2005. Steven Bird, Ewan Klein, and Edward Loper. Negli Atti di Thiago Castro Ferreira, Chris van der Lee, in lavorazione Wallace L. Chafe. 1979. The flow of thought Yen-Chun Chen and Mohit Bansal. 2018. Robert Dale. 1989. Generating referring expres- Pablo Duboue and Kathleen McKeown. 2002. l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o algorithms and a corpus-based fitness function. Pablo A. Duboue and Kathleen R. McKeown. John C. Duchi, Elad Hazan, and Yoram Singer. Angela Fan, David Grangier, and Michael Claire Gardent, Anastasia Shimorina, Shashi Albert Gatt and Emiel Krahmer. 2018. Survey in natural Heng Gong, Xiaocheng Feng, BingQin, E pages 3143–3152, Hong Kong, China. Asso- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor Caglar Gulcehre, Sungjin Ahn, Ramesh M. UN. K. Halliday and Ruqaiya Hasan. 1976. Sepp Hochreiter e Jürgen Schmidhuber. 1997. Eduard H. Blu. 1993. Automated discourse Hayate Iso, Yui Uehara, Tatsuya Ishigaki, the 57th Annual Meeting of Min-Yen Kan and Kathleen R. McKeown. 2002. 523 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o Natural Language Generation Conference, Colin Kelly, Ann Copestake, and Nikiforos Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Guillaume Klein, Yoon Kim, Yuntian Deng, Ioannis Konstas and Mirella Lapata. 2013. pagine Karen Kukich. 1983. Design of a knowledge- Anirban Laha, Parag Jain, Abhijit Mishra, R´emi Lebret, David Grangier, and Michael R. E. Longacre. 1979. The paragraph as a Jordan J. Louviere, Terry N. Flynn, E Jordan J. Louviere and George G. Woodworth. approcci Thang Luong, Ciao Pham, e Christopher D. Kathleen R. McKeown. 1992. Text Generation. Studi Kathleen R. McKeown, Desmond A. Jordan, Hongyuan Mei, Mohit Bansal, and Matthew R. Negli Atti di 524 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o 2016 Conference of Amit Moryossef, Yoav Goldberg, and Ido Dagan. pagine Feng Nie, Jinpeng Wang, Jin-Ge Yao, Rong in Natural Jekaterina Novikova, Ondˇrej Duˇsek, E Romain Paulus, Caiming Xiong, and Richard Laura Perez-Beltrachini and Mirella Lapata. 2018. Ratish Puduppully, Li Dong, and Mirella Ratish Puduppully, Li Dong, and Mirella Lapata. Alec Radford, Jeffrey Wu, Rewon Child, David Bryan Orme. 2009. Maxdiff analysis: Simple individual-level logit, counting, Kishore Papineni, Salim Roukos, Todd Cl´ement Rebuffel, Laure Soulier, Geoffrey European Conference SU Ehud Reiter. 1995. NLG vs. templates. CoRR, Ehud Reiter and Robert Dale. 1997. Building 525 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o Ehud Reiter and Robert Dale. 2000. Building Fahimeh Ioan Negli Atti di Rico Sennrich, Barry Haddow, and Alexandra Zhihong Shao, Minlie Huang, Jiangtao Wen, Ilya Sutskever, Oriol Vinyals, and Quoc V. Shunsuke Takeno, Masaaki Nagata, and Kazuhide Ran Tian, Shashi Narayan, Thibault Sellam, E Chris van der Lee, Albert Gatt, Emiel van Ashish Vaswani, Noam Shazeer, Niki Parmar, Curran Associates, 30, Oriol Vinyals, Meire Fortunato, and Navdeep editors, Advances Eric Wallace, Yizhong Wang, Sujian Li, Sameer in lavorazione Ronald J. Williams and Jing Peng. 1990. An Sam Wiseman, Stuart Shieber, and Alexander 526 l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o IL 2017 pagine Yonghui Wu, Mike Schuster, Zhifeng Chen, Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong Negli Atti di Linguistica: Tecnologie del linguaggio umano, Kyra Yee, Yann Dauphin, and Michael Auli. 2019. in lavorazione Lei Yu, Laurent Sartran, Wojciech Stokowiec, Wlodek Zadrozny and Karen Jensen. 1991. Semantics l D F T P : D M io e / UN l UN F D io . 1 / l UN / / l UN . F B T o
shorthand for the team 9
14
also describe one or more events. Per esempio,
the paragraph ‘‘With the score tied 1–1 in the
fourth . . . 423-foot home run to left
field to
make it 3-1’’ discusses what happened in the
bottom halves of the fourth and fifth innings. Noi
verbalize an event by first describing the par-
ticipating entities followed by the plays in the
event. Entities are described in the order in which
they appear in a play, and within the same play
we list the batter followed by the pitcher, fielder,
scorer, and basemen. The paragraph plan corre-
sponding to the bottom halves of the fourth and
fifth inning is
tities
respond in turn to W. Merrifield, UN. Cashner,
B. Goodwin, and H. Dozier while
refers to the first play in the bottom half of
the fifth inning (see the play-by-play table in
Figura 1) and abbreviates the following detailed
plan:
H.Dozier
Home-run
and so forth.
to MLB and can be ported to other datasets with
similar characteristics such as ROTOWIRE. How-
ever, ROTOWIRE does not provide play-by-play
informazione, and as a result there is no event
verbalization for this dataset.
previous sections, Tuttavia, it is important to note
that such macro plans are not readily available in
data-to-text benchmarks like MLB (Puduppully
et al., 2019B) and ROTOWIRE (Wiseman et al.,
2017) which consist of tables of records r paired
with a gold summary y (see Tables (UN)–(C) In
Figura 1). We now describe our method for obtain-
ing macro plans x from r and y.
macro plans to be conformant with gold sum-
maries such that (1) they have the same splits
into paragraphs—entities and events within a
paragraph in y are grouped into a paragraph plan
in x; E (2) the order of events and entities
in a paragraph and its corresponding plan are
identical. We construct macro plans by matching
entities and events in the summary to records
in the tables. Inoltre, paragraph delimiters
within summaries form natural units which taken
together give rise to a high-level document plan.
We match entities in summaries with entities
in tables using exact string match, allowing for
some degree of variation in the expression of
team names (per esempio., A’s for Athletics and D-backs
for Diamondbacks). Information pertaining to
innings appears in the summaries in the form of
ordinal numbers (per esempio., first, ninth) modifying the
noun inning and can be relatively easily identi-
fied via pattern matching (per esempio., in sentences like
‘‘Dozier led off the fifth inning’’). Tuttavia, there
are instances where the mention of innings is more
ambiguous (per esempio., ‘‘With the scored tied 1–1 in the
fourth, Andrew Cashner (4–13) gave up a sacri-
fice fly’’). We could disambiguate such mentions
manually and then train a classifier to learn to
predict whether an inning is mentioned. Invece,
we explore a novel annotation-free method that
makes use of the pretrained language model GPT2
(Radford et al., 2019). Specifically, we feed the
context preceding the ordinal number to GPT2
(cioè.,
the current paragraph up to the ordinal
number and the paragraph preceding it) and if
inning appears in the top 10 next word predictions,
we consider it a positive match. On a held-out
dataset, this method achieves 98% precision and
98% recall at disambiguating inning mentions.
top or bottom side of an inning, we compare the
entities in the paragraph with the entities in each
half-inning (play-by-play Table (B) in Figure 1)
and choose the side with the greater number of
entity matches. For instance, Andrew Cashner,
Merrifield and fourth inning uniquely resolves to
the bottom half of the fourth inning.
game summary (C). Importantly, macro plan (E)
is the outcome of a content selection process after
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
input. So, what are the candidate paragraph plans
that give rise to macro plan (E)? To answer this
question, we examined the empirical distribution
of paragraph plans in MLB and ROTOWIRE (train-
ing portion). È interessante notare, we found that ∼79% of
the paragraph plans in MLB refer to a single event
or a single player (and team(S)). In ROTOWIRE,
∼92% of paragraphs are about a singleton player
(and team(S)) or a pair of players.
graph plans can be either one (verbalized)
entity/event or a combination of at most two.
Under this assumption, we explicitly enumerate
the set of candidate paragraph plans in a game.
For the game in Figure 1, candidate paragraph
plans are shown in Table (D). The first table
groups plans based on individual verbalizations
describing the team(S), players, and events taking
place in specific innings. The second table groups
pairwise combinations thereof. In MLB, come
combinations are between team(S) and players. In
ROTOWIRE, we also create combinations between
players. Such paragraph plans form set E based
on which macro plan x is constructed to give rise
to game summary y.
each of which is a sequence of tokens. We first
compute paragraph plan representations ∈ Rn,
and then apply a contextualization and content
planning mechanism similar to planning mod-
ules introduced in earlier work (Puduppully et al.,
2019UN; Chen and Bansal, 2018). Predicted macro
plans serve as input to our text generation model,
which adopts an encoder-decoder architecture
(Bahdanau et al., 2015; Luong et al., 2015).
tokens in a verbalized paragraph plan ei as
{NO,j}|NO|
j=1 with a BiLSTM (Figura 2, bottom part).
To reflect the fact that some records will be more
important than others, we compute an attention
weighted sum of {NO,j}|NO|
j=1 following Yang et al.
(2016). Let d ∈ Rn denote a randomly initialized
query vector
jointly with the rest of
parameters. We compute attention values αi,j over
d and paragraph plan token representation ei,j:
αi,j ∝ exp(D(cid:2)NO,j)
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
E
contextualization for macro planning. Computation
of e3 is detailed in Equations (1) E (2), eatt
In
Equazione (3), and ec
sum of ei,j (con
αi,j = 1):
representation vis-a-vis other paragraph plans
(Figura 2, top left part). Primo, we compute attention
scores βi,k over paragraph plan representations to
obtain an attentional vector eatt
βi,k ∝ exp(e
i Waek)
eatt
i = Wg[NO; ci]
k(cid:8)=i βi,k = 1. Then, we compute
matrices, E
a content selection gate, and apply this gate to ei
to obtain new paragraph plan representation ec
io :
io
i = gi (cid:9) NO
ec
Così, each element in ei is weighted by cor-
responding element of gi ∈ [0, 1]n to obtain a
contextualized paragraph plan representation ec
io .
macro plans, after having been trained on pairs
of sets of paragraph plans and corresponding
probability p(sì|X) is modeled as:
# Tokens
# Instances
# Record Types
Avg Records
Avg Paragraph Plans
Avg Length
1.5M
4.9K
628
10.7
337.1
14.3M
26.3K
53
565
15.1
542.05
gettoni,
MLB. Vocabulary size, number of
number of instances (cioè., table-summary pairs),
number of record types, average number of
records, average number of paragraph plans,
and average summary length.
find the most likely macro plan ˆz among candidate
macro plans z(cid:10) given paragraph plans as input.
output summary ˆy among candidate outputs y(cid:10)
given macro plan ˆx as input:
IL
ROTOWIRE (Wiseman et al., 2017) and MLB
(Puduppully et al., 2019B) benchmarks. IL
details of these two datasets are given in Table 1.
We can see that MLB is around 5 times bigger,
has a richer vocabulary, and has longer game sum-
maries. We use the official splits of 3,398/727/728
for ROTOWIRE and 22,821/1,739/1,744 for MLB.
We make use of a tokenization script1 to deto-
kenize and retokenize the summaries in both
ROTOWIRE and MLB.
version released by Puduppully et al. (2019B)
from
had removed all paragraph delimiters
game summaries. Specifically, we followed their
methodology and downloaded the same sum-
maries from the ESPN Web site2 and added the
2http://www.espn.com/mlb/recap?gameId=
paragraph delimiters, and also predicts these at generation
time, for evaluation we strip
game summaries either. We reverse engineered
these as follows: (1) we split summaries into sen-
tences using the NLTK (Bird et al., 2009) sentence
tokenizer; (2) initialized each paragraph with a
separate sentence; (3) merged two paragraphs into
one if the entities in the former were a superset of
entities in the latter; (4) repeated Step 3 until no
merges were possible.
hyperparameters on the development set. For
training the macro planning and the text gener-
ation stages, we used the Adagrad (Duchi et al.,
2011) optimizer. Inoltre, the text generation
stage made use of truncated BPTT (Williams and
Peng, 1990) with truncation length 100. We learn
subword vocabulary (Sennrich et al., 2016) for
paragraph plans in the macro planning stage. Noi
used 2.5K merge operations for ROTOWIRE and 8K
merge operations for MLB. In text generation, we
learn a joint subword vocabulary for the macro
plan and game summaries. We used 6K merge
operations for ROTOWIRE and 16K merge oper-
ations for MLB. All models were implemented
on OpenNMT-py (Klein et al., 2017). We add to
set E the paragraph plans corresponding to the
output summary paragraphs, to ensure full cover-
age during training of the macro planner. During
inference for predicting macro plans, we employ
length normalization (Bahdanau et al., 2015) A
avoid penalizing longer outputs; specifically, we
divide the scores of beam search by the length of
the output. Inoltre, we adopt bigram blocking
(Paulus et al., 2018). For MLB, we further block
beams containing more than two repetitions of a
unigram. This helps improve the diversity of the
predicted macro plans.
against the following systems: (1) the Template-
based generators from Wiseman et al. (2017)
for ROTOWIRE and Puduppully et al. (2019B) for
MLB. Both systems apply the same principle, Essi
emit a sentence about the teams playing in the
game, followed by player-specific sentences, E
a closing sentence. MLB additionally contains a
description of play-by-play; (2) ED+CC, the best
performing system in Wiseman et al. (2017), È
a vanilla encoder-decoder model equipped with
an attention and copy mechanism; (3) NCP+CC,
the micro planning model of Puduppully et al.
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
by making use of Pointer networks (Vinyals
et al., 2015) to point to records; content plans are
encoded with a BiLSTM and the game summary
is decoded using another LSTM with attention
and copy; (4) ENT, the entity-based model of
Puduppully et al. (2019B), creates dynamically
updated entity-specific representations; the text
is generated conditioned on the data input and
entity memory representations using hierarchical
attention at each time step.
zione, following earlier work (Wiseman et al. 2017;
Puduppully et al. 2019UN,B, inter alia) we report
BLEU (Papineni et al., 2002) with the gold sum-
mary as reference but also make use of the In-
formation Extraction (IE) metrics from Wiseman
et al. (2017), which are defined over the output
of an IE system; the latter extracts entity (players,
teams) and value (numbers) pairs in a summary,
and then predicts the type of relation. For instance,
given the pair Kansas City Royals, 9, it would
predict their relation as TR (cioè., Team Runs).
Training data for the IE system is obtained by
checking for matches between entity, value pairs
in the gold summary and entity, value, record type
triplets in the table.
produzione. Relation Generation (RG) measures the
precision and count of relations extracted from y
that also appear in records r. Content Selection
(CS) measures the precision and recall of relations
extracted from y that are also extracted from ˆy.
Content Ordering (CO) measures the normalized
Damerau-Levenshtein distance between the se-
quences of relations extracted from y and ˆy.
(2019UN) for ROTOWIRE but retrained it for MLB
to improve its precision and recall. Inoltre,
the implementation of Wiseman et al. (2017)
computes RG, CS, and CO excluding duplicate
relations. This artificially inflates the performance
of models whose outputs contain repetition. Noi
include duplicates in the computation of the IE
metrics (and recreate them for all comparison
systems).
ROTOWIRE
In addition to Templ,
NCP+CC, ENT, and ED+CC we include the
WS-2017
ED+CC
NCP+CC
ENT
RBF-2020
Macro
−Plan(4)
P% P% R% F% DLD%
34.1 75.1 20.3 36.3 26.1
35.9 82.6 19.8 33.8 24.9
40.8 87.6 28.0 51.1 36.2
32.7 91.7 34.7 48.5 40.5
44.9 89.5 23.9 47.0 31.7
42.1 97.6 34.1 57.8 42.9
36.2 81.3 22.1 38.6 28.1
12.4
12.0
15.8
16.6
14.3
17.7
12.1
P% P% R% F% DLD%
Templ
ED+CC
32.5 91.3 27.8 40.6 33.0
NCP+CC
19.6 81.3 44.5 44.1 44.3
23.8 81.1 40.9 49.5 44.8
ENT
30.8 94.4 40.8 54.9 46.8
Macro
−Plan(SP,4) 25.1 92.7 40.0 44.6 42.2
17.1
21.9
20.7
21.8
21.9
14.19
14.99
16.50
16.12
17.16
15.46
14.00
9.68
9.68
11.50
12.62
11.09
test sets;
(#)
and precision (P%), content
selection (CS)
precision (P%), recall (R%) and F-measure (F%),
content ordering (CO) in normalized Damerau-
Levenshtein distance (DLD%), and BLEU.
(WS-2017; note that ED+CC is an improved re-
implementation of their model), and the model of
Rebuffel et al. (2020) (RBF-2020), which repre-
sents the state of the art on ROTOWIRE. This model
has a Transformer encoder (Vaswani et al., 2017)
with a hierarchical attention mechanism over
entities and records within entities. The models
of Saleh et al. (2019), Iso et al. (2019), and Gong
et al. (2019) make use of additional information
not present in the input (per esempio., previous/next games,
summary writer) and are not directly comparable
to the systems in Table 2. Results for the MLB
test set are in the bottom portion of Table 2.
on both datasets. This is not surprising, by design
Templ
to the input. How-
ever, notice that it achieves the lowest BLEU
Esso
among comparison systems,
mostly regurgitates facts with low fluency. Macro
achieves the highest RG precision among all neu-
ral models for ROTOWIRE and MLB. We obtain
an absolute improvement of 5.9% over ENT
for ROTOWIRE and 13.3% for MLB. Inoltre,
Macro achieves the highest CS F-measure for
both datasets. On ROTOWIRE, Macro achieves the
highest CO score, and the highest BLEU on MLB.
On ROTOWIRE, in terms of BLEU, Macro is worse
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Inspection of the output showed that the opening
paragraph, which mostly describes how the two
teams fared, is generally shorter in Macro, Guida-
ing to shorter summaries and thus lower BLEU.
There is high variance in the length of the opening
paragraph in the training data and Macro verbal-
izes the corresponding plan conservatively. Ideas
such as length normalization (Wu et al., 2016) O
length control (Kikuchi et al., 2016; Takeno et al.,
2017; Fan et al., 2018) could help alleviate this;
Tuttavia, we do not pursue them further for fair
comparison with the other models.
the effect of macro planning in more detail, we
further compared Macro against text generation
models (see Section 4.2) which are trained on
verbalizations of the tabular data (and gold sum-
maries) but do not make use of document plans or a
document planning mechanism. On ROTOWIRE, IL
model was trained on verbalizations of players and
teams, with the input arranged such that the ver-
balization of the home team was followed by the
visiting team, the home team players and the visit-
ing team players. Mention of players was limited
to the four best ones, following Saleh et al. (2019)
(see −Plan(4) in Table 2). For MLB, we addition-
ally include verbalizations of innings focusing on
scoring plays which are likely to be discussed in
game summaries (see −Plan(SP,4) in Table 2).
Note that by preprocessing the input in such a
way some simple form of content selection takes
place simply by removing extraneous information
which the model does not need to consider.
competitivo. On ROTOWIRE −Plan(4) is better than
ED+CC in terms of content selection but worse
compared to ENT. On MLB, −Plan(SP,4) is again
superior to ED+CC in terms of content selection
but not ENT whose performance lags behind
when considering RG precision. Taken together,
these results confirm that verbalizing entities and
events into a text sequence is effective. Al
same time, we see that −Plan variants are worse
than Macro across most metrics which underlines
the importance of an explicit planning component.
macro planning stage. Here, we compare the in-
ferred macro plan with the gold macro plans, CS
and CO metrics with regard to entities and events
MLB
80.6
63.3
70.9
31.4
content selection precision (CS-P), recall (CS-
R), F-measure (CS-F), and content ordering (CO)
between the inferred plans and gold plans in terms
of entities and events for ROTOWIRE (RW) E
MLB test sets.
ning model (Macro) achieves high scores for CS
and CO for both ROTOWIRE and MLB. We further
used the CS and CO metrics to check how well the
generated summary follows the (predicted) plan.
We followed the steps in Section 3.2 and reverse
engineered macro plans from the model summa-
ries and compared these extracted plans with the
original macro plans with regard to entities and
events. We found that Macro creates summaries
that follow the plan closely: For ROTOWIRE, IL
CS F-score and CO are greater than 98%; for
MLB, the CS F-score is greater than 94% and CO
is greater than 89%. We show an output summary
for Macro in Table 4, together with the predicted
document plan.
ticipants to assess model output in terms of relation
generation, grammaticality, coherence, and con-
ciseness (Wiseman et al., 2017; Puduppully
et al., 2019UN; Puduppully et al., 2019B), For
ROTOWIRE, we compared Macro against RBF-
2020,4 ED+CC, Gold, and Templ. For MLB, we
compared Macro against ENT, ED+CC, Gold,
and Templ.
Mechanical Turk (AMT) crowdsourcing platform,
following best practices for human evaluation in
NLG (van der Lee et al., 2019). Specifically, to en-
sure consistent ratings, we required crowdworkers
to have an approval rating greater than 98% E
a minimum of 1,000 previously completed tasks.
Raters were restricted to English-speaking coun-
tries (cioè., US, UK, Canada, Ireland, Australia, O
NZ). Participants were allowed to provide feed-
back on the task or field questions (our interface
accepts free text).
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
the most of it.
eighth inning and the Rays beat the Boston Red Sox 2-1 on Sunday
to complete a three-game sweep.
record in the majors, have won six of their last seven games.
The Rays have won four of their last five series, including three in
a row against the Red Sox, who have won six of their last seven
overall.
Clay Buchholz (1-2) and moved to third on Jason Bartlett’s flyout to
center. Iwamura then drove a 1-1 pitch into the left-field stands for
his second homer of the season.
win in relief, striking out Manny Ramirez with runners on first and
third to end the eighth.
his fifth save in five opportunities.
up two runs and three hits in eight innings. He struck out nine and
walked two.
fifth on a single by Coco Crisp, a wild pitch and a walk to Jed Lowrie.
Jacoby Ellsbury drove in Crisp with a two-out single to center.
Jackson struck out four and walked three.
the bases with one out in the fifth on a single by Coco Crisp, a walk
to Jed Lowrie and a one-out walk to Jed Lowrie. Jackson struck out
Julio Lugo, but Jacoby Ellsbury singled to center to put the Red Sox
su 1-0.
drew a two-out walk against Trever Miller, but Ramirez struck out to
end the inning.
corresponding model output (bottom). Entities and
events in summary corresponding to those in the
macro plan are boldfaced.
with sentences randomly selected from summaries
along with their corresponding box score (E
play-by-play in case of MLB) and asked them to
count supported and contradicting facts (ignoring
hallucinations, cioè., unsupported facts). We did
not require crowdworkers to be familiar with
NBA or MLB. Invece, we provided a cheat sheet
explaining the semantics of box score tables. In
aggiunta, we provided examples of sentences with
supported/contradicting facts. We evaluated 40
summaries from the test set (20 per dataset), 4 sen-
tences from each summary and elicited 3 responses
per summary. This resulted in 40 summaries ×
5 systems × 3 raters, for a total of 600 compiti.
Altogether, 131 crowdworkers participated in this
study (agreement using Krippendorff’s α was 0.44
for supported and 0.42 for contradicting facts).
number of contradicting facts among neural mod-
els on both datasets. On ROTOWIRE the number
of contradicting facts for Macro is comparable
to Gold and Templ (the difference is not sta-
tistically significant) and significantly smaller
compared to RBF-2020 and ED+CC. The count
7.57*
3.92
Templ
ED+CC
RBF-2020 5.08*
Macro
30.83
0.08 −61.67* −52.92* −36.67*
−4.58
0.91*
3.75
0.67*
6.67
0.27
4.58
10.42
13.33
5.0
Gold
Templ
ED+CC
ENT
Macro
30.0
0.14
3.59
21.67
0.04 −51.25* −43.75*
4.21
0.72* −22.5* −12.08* −39.17*
3.42
5.83* −0.83* −22.08*
0.73*
3.71
27.08
26.67
46.25
0.25
3.76
26.67
7.5
and contradicting (#Contra)
in game
summaries and best-worst scaling evaluation
(higher is better). Systems significantly different
from Macro are marked with an asterisk * (using
a one-way ANOVA with post hoc Tukey HSD
tests; p ≤ 0.05).
Gold, and ED+CC, and significantly lower than
Templ and RBF-2020. On MLB, Macro has sig-
nificantly fewer contradicting facts than ENT and
ED+CC and is comparable to Templ and Gold
(the difference is not statistically significant). IL
count of supported facts for Macro is comparable
to Gold, ENT, ED+CC, and Templ. For both
datasets, Templ has the lowest number of contra-
dicting facts. This is expected as Templ essentially
parrots facts (aka records) from the table.
the quality of the generated summaries. We pre-
sented crowdworkers with a pair of summaries and
asked them to choose the better one in terms of
Grammaticality (is the summary written in well-
formed English?), Coherence (is the summary
well structured and well organized and does it have
a natural ordering of the facts?), and Conciseness
(does the summary avoid unnecessary repetition
including whole sentences, facts or phrases?). Noi
provided example summaries showcasing good
and bad output. For this task, we required that the
crowdworkers be able to comfortably compre-
hend NBA/MLB game summaries. We elicited
preferences with Best-Worst Scaling (Louviere
and Woodworth 1991; Louviere et al., 2015), UN
method shown to be more reliable than rating
scales. The score of a system is computed as the
number of times it is rated best minus the number
of times it is rated worst (Orme, 2009). The scores
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
(absolutely best). We divided the five competing
systems into ten pairs of summaries and elicited
ratings for 40 summaries (20 per dataset). Each
summary pair was rated by 3 raters. This resulted
In 40 summaries × 10 system pairs × 3 evaluation
criteria × 3 raters, for a total of 3,600 compiti. UN
total of 206 crowdworkers participated in this task
(agreement using Krippendorff’s α was 0.47).
comparable to Gold, RBF-2020, and ED+CC in
terms of Grammaticality but significantly better
than Templ. In terms of Coherence, Macro is
comparable to RBF-2020 and ED+CC but signif-
icantly better than Templ and significantly worse
than Gold. With regard to Conciseness, Macro is
comparable to Gold, RBF-2020, and ED+CC, E
significantly better than Templ. On MLB, Macro
is comparable to Gold in terms of Grammaticality
and significantly better than ED+CC, ENT, E
Templ. Macro is comparable to Gold in terms of
Coherence and significantly better than ED+CC,
ENT and Templ. In terms of Conciseness, raters
found Macro comparable to Gold and Templ and
significantly better than ED+CC and ENT. Taken
together, our results show that macro planning
leads to improvement in data-to-text generation in
comparison to other systems for both ROTOWIRE
and MLB datasets.
approach for data-to-text generation that consists
of a macro planning stage representing high-level
document organization in terms of structure and
content, followed by a text generation stage.
Extensive automatic and human evaluation shows
that our approach achieves better results than ex-
isting state-of-the-art models and generates sum-
maries which are factual, coherent, and concise.
advantageous for generation tasks expected to
produce longer texts with multiple discourse
units, and could be easily extended to other sports
domains such as cricket (Kelly et al., 2009) O
American football (Barzilay and Lapata, 2005).
Other approaches focusing on micro planning
(Puduppully et al., 2019UN; Moryossef et al., 2019)
might be better tailored for generating shorter
texts. There has been a surge of datasets recently
focusing on single-paragraph outputs and the task
2017), WebNLG (Gardent et al., 2017), E
WikiBio (Lebret et al., 2016; Perez-Belrachini
and Lapata, 2018). We note that in our model con-
tent selection takes place during macro planning
and text generation. The results in Table 2 show
that Macro achieves the highest CS F-measure
on both datasets, indicating that the document as
a whole and individual sentences discuss appro-
priate content.
template-based systems score poorly in terms of
CS (but also CO and BLEU). This is primarily due
to the inflexibility of the template approach which
is limited to the discussion of a fixed number of
(high-scoring) players. Yet, human writers (E
neural models to a certain extent), synthesize
summaries taking into account the particulars of a
specific game (where some players might be more
important than others even if they scored less)
and are able to override global defaults. Template
sentences are fluent on their own, but since it
is not possible to perform aggregation (Reiter,
1995), the whole summary appears stilted, it lacks
coherence and variability, contributing to low
BLEU scores. The template baseline is worse for
MLB than ROTOWIRE which reflects the greater
difficulty to manually create a good template for
MLB. Overall, we observe that neural models are
more fluent and coherent, being able to learn a
better ordering of facts which is in turn reflected
in better CO scores.
to improve macro planning, especially in terms
of the precision of RG (Vedi la tabella 2, P% column
of RG). We should not underestimate that Macro
must handle relatively long inputs (the average
input length in the MLB development set is ∼3100
gettoni) which are challenging for the attention
meccanismo. Consider the following output of our
model on the MLB dataset: Ramirez’s two-run
double off Joe Blanton tied it in the sixth, E
Brandon Moss added a two-out RBI single off
Alan Embree to give Boston a 3-2 Guida. Here, IL
name of the pitcher should have been Joe Blanton
instead of Alan Embree. Infatti, Alan Embree is
the pitcher for the following play in the half in-
ning. In questo caso, attention diffuses over the rela-
tively long MLB macro plan, leading to inaccurate
content selection. We could alleviate this prob-
lem by adopting a noisy channel decomposition
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
learning two different distributions: a conditional
model that provides the probability of translating
a paragraph plan to text and a language model that
provides an unconditional estimate of the output
(cioè., the whole game summary). Tuttavia, we
leave this to future work.
the model’s inability to understand numbers. For
esempio, Macro generates the following output:
The Lakers were the superior shooters in this
game, going 48 percent from the field and 30 per-
cent from the three-point line, while the Jazz went
47 percent from the floor and 30 percent from
beyond the arc. Here, 30 percent should have been
24 percent for the Lakers but the language model
expects a higher score for the three-point line, E
since 24 is low (especially compared to 30 scored
by the Jazz), it simply copies 30 scored by the
Jazz instead. A mechanism for learning better rep-
resentations for numbers (Wallace et al., 2019) O
executing operations such as argmax or minus (Nie
et al., 2018) should help alleviate this problem.
learning document plans from data, the decoupling
of planning from generation allows to flexibly
generate output according to specification. For
esempio, we could feed the model with manually
constructed macro plans, consequently controlling
the information content and structure of the output
summary (per esempio., for generating short or long texts,
or focusing on specific aspects of the game).
and the three anonymous reviewers for their
constructive feedback. We also thank Laura
Perez-Beltrachini for her comments on an earlier
draft of this paper, and Parag Jain, Hao Zheng,
Stefanos Angelidis and Yang Liu for helpful
discussions. Noi
financial
the European Research Council
support of
(Lapata; award number 681760, ‘‘Translating
Multiple Modalities into Text’’).
Bengio. 2015. Traduzione automatica neurale di
imparare insieme ad allineare e tradurre.
In
3rd International Conference on Learning
USA, May 7–9, 2015, Conference Track
Proceedings.
Collective content selection for concept-to-text
generation. In Proceedings of Human Language
Technology Conference and Conference on
Empirical Methods
in Natural Language
in lavorazione,
331–338, Vancouver,
pagine
British Columbia, Canada. Associazione per
Linguistica computazionale. DOI: https://
doi.org/10.3115/1220575.1220617
2009. Natural Language Processing with
Python, O’Reilly Media.
Emiel van Miltenburg, and Emiel Krahmer.
generation: UN
data-to-text
2019. Neural
comparison between pipeline and end-to-
IL
end architectures.
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
Language
(EMNLP-IJCNLP),
pages 552–562, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1052
and the flow of language. In Talmy Giv´on,
editor, Syntax and Semantics, volume 12,
pages 159–181, Academic Press Inc. DOI:
https://doi.org/10.1163/9789004
368897 008
Fast abstractive summarization with reinforce-
selected sentence rewriting. Negli Atti di
the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1:
Documenti lunghi), pages 675–686, Melbourne,
Australia. Association
for Computational
Linguistica. DOI: https://doi.org/10
.18653/v1/P18-1063
sions in a domain of objects and processes.
Content planner construction via evolutionary
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
the International Nat-
Negli Atti di
ural
Language Generation Conference,
pages 89–96, Harriman, New York, USA.
Associazione per la Linguistica Computazionale.
2001. Empirically estimating order constraints
for content planning in generation. Nel professionista-
ceedings of the 39th Annual Meeting of the
Associazione per la Linguistica Computazionale,
pages 172–179, Toulouse, France. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073012
.1073035
2011. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of
Machine Learning Research, 12:2121–2159.
Auli. 2018. Controllable abstractive summa-
rization. In Proceedings of the 2nd Workshop
on Neural Machine Translation and Gen-
eration, pages 45–54, Melbourne, Australia.
Associazione per la Linguistica Computazionale.
Narayan, and Laura Perez-Beltrachini. 2017.
Creating training corpora for NLG micro-
planners. In Proceedings of the 55th Annual
Meeting of
the Association for Computa-
linguistica nazionale (Volume 1: Documenti lunghi),
pages 179–188, Vancouver, Canada. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P17
-1017
of the state of the art
lingua
generation: Core tasks, applications and eval-
uazione. J. Artif. Intell. Res., 61:65–170. DOI:
https://doi.org/10.1613/jair.5477
Ting Liu. 2019. Table-to-text generation with
effective hierarchical encoder on three dimen-
sions (row, column and time). Negli Atti
del 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1310
O. K. Li. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Documenti lunghi), pages 1631–1640,
Berlin, Germany. Associazione per il calcolo-
linguistica nazionale.
Nallapati, Bowen Zhou, and Yoshua Bengio.
2016. Pointing the unknown words. Nel professionista-
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguis-
tic (Volume 1: Documenti lunghi), pages 140–149,
Berlin, Germany. Associazione per il calcolo-
linguistica nazionale. DOI: https://doi.org
/10.18653/v1/P16-1014
Cohesion in English, London. Longman. DOI:
https://doi.org/10.1162/neco
.1997.9.8.1735
Memoria a lungo termine. Calcolo neurale,
9:1735–1780.
generation using discourse structure relations.
Artificial Intelligence, 63(1-2):341–385. DOI:
https://doi.org/10.1016/0004-3702
(93)90021-3
Hiroshi Noji, Eiji Aramaki, Ichiro Kobayashi,
Yusuke Miyao, Naoaki Okazaki, and Hiroya
Takamura. 2019. Learning to select,
track,
In Procedi-
and generate for data-to-text.
IL
ings di
Associazione per la Linguistica Computazionale,
pages 2102–2113, Florence,
Italy. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1202
Corpus-trained text generation for summa-
rization. In Proceedings of the International
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
pages 1–8, Harriman, New York, USA. Asso-
ciation for Computational Linguistics.
Karamanis. 2009. Investigating content selec-
tion for language generation using machine
apprendimento. In Proceedings of the 12th European
Workshop on Natural Language Generation
(ENLG 2009), pages 130–137, Athens, Greece.
Associazione per la Linguistica Computazionale.
DOI: https://doi.org/10.3115/161
0195.1610218
Hiroya Takamura, and Manabu Okumura.
length in neural
2016. Controlling output
encoder-decoders. Negli Atti del 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1328–1338,
Austin, Texas. Associazione per il calcolo-
linguistica nazionale. DOI: https://doi.org
/10.18653/v1/D16-1140
Jean Senellart, and Alexander Rush. 2017.
OpenNMT: Open-source toolkit
for neural
machine translation. In Proceedings of ACL
2017, System Demonstrations, pages 67–72,
Vancouver, Canada. Associazione per il calcolo-
linguistica nazionale. DOI: https://doi.org
/10.18653/v1/P17-4012
Inducing document plans
for concept-to-
text generation. Negli Atti del 2013
Conference on Empirical Methods in Natural
Language Processing,
1503–1514,
Seattle, Washington, USA. Associazione per
Linguistica computazionale.
In 21st Annual
based report generator.
the Association for Computa-
Meeting of
linguistica nazionale. DOI: https://doi.org
/10.3115/981311.981340
and Karthik Sankaranarayanan. 2020. Scal-
able micro-planned generation of discourse
from structured data. Computational Linguis-
tic, 45(4):737–763. DOI: https://doi
.org/10.1162/coli a 00363
text generation from
Auli. 2016. Neural
structured data with application to the biog-
raphy domain. Negli Atti di
IL 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1203–1213,
Austin, Texas. Associazione per il calcolo-
linguistica nazionale. DOI: https://doi.org
/10.18653/v1/D16-1128
grammatical unit. In Talmy Giv´on, editor,
Syntax and Semantics, volume 12, Academic
Press Inc., pages 115–133.
UN. UN. J. Marley. 2015. Best-Worst Scaling:
Theory, Methods and Applications, Cambridge
Stampa universitaria. DOI: https://doi.org
/10.1017/CBO9781107337855
1991. Best-worst scaling: A model for the
largest difference judgments. University of
Alberta: Working Paper.
Equipaggio. 2015. Efficace
A
traduzione automatica neurale basata sull’attenzione. In
Atti del 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pagine 1412–1421, Lisbon, Portugal. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D15
-1166
nell'elaborazione del linguaggio naturale,
Cambridge University Press.
Shimei Pan, James Shaw, and Barry A. Allen.
1997. Language generation for multimedia
In Fifth Conference
healthcare briefings.
on Applied Natural Language Processing,
pages 277–282, Washington, DC, USA. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/974557
.974598
Walter. 2016. What to talk about and how?
Selective generation using LSTMs with coarse-
IL
to-fine alignment.
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
the North American
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano,
pagine
720–730, San Diego, California.
Associazione per la Linguistica Computazionale.
2019. Step-by-step: Separating planning from
realization in neural data-to-text generation. In
Atti del 2019 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, Volume 1 (Long and Short
Carte),
2267–2277, Minneapolis,
Minnesota. Associazione per il calcolo
Linguistica.
Pan, and Chin-Yew Lin. 2018. Operation-
guided neural networks
for high fidelity
Negli Atti di
data-to-text generation.
IL 2018 Conference on Empirical Meth-
ods
Language Processing,
pages 3879–3889, Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1422
Verena Rieser. 2017. The E2E dataset:
New challenges for end-to-end generation.
the 18th Annual SIG-
Negli Atti di
dial Meeting on Discourse and Dialogue,
pages 201–206, Saarbr¨ucken, Germany. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-5525
Socher. 2018. A deep reinforced model for
In International
abstractive summarization.
Conference on Learning Representations.
Bootstrapping generators from noisy data. In
Atti del 2018 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, Volume 1 (Documenti lunghi),
pages 1516–1527, New Orleans, Louisiana.
Associazione per la Linguistica Computazionale.
DOI: https://doi.org/10.18653/v1
/N18-1137
Lapata. 2019UN. Data-to-text generation with
content selection and planning. Negli Atti
of the 33rd AAAI Conference on Artificial Intel-
ligence. Honolulu, Hawaii. DOI: https://
doi.org/10.1609/aaai.v33i01.330
16908
2019B. Data-to-text generation with entity
modeling. In Proceedings of the 57th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 2023–2035, Florence, Italy.
Associazione per la Linguistica Computazionale.
DOI: https://doi.org/10.18653/v1
/P19-1195
Luan, Dario Amodei, and Ilya Sutskever.
2019. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9. DOI:
https://doi.org/10.18653/v1/P19
-1195
and HB.
Sawtooth Software.
Ward, and Wei-Jing Zhu. 2002. BLEU: UN
method for automatic evaluation of machine
the 40th
Negli Atti di
translation.
the Association for
Annual Meeting of
Linguistica computazionale, pages 311–318,
Philadelphia, Pennsylvania, USA. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073083
.1073135
Scoutheeten, and Patrick Gallinari. 2020. UN
hierarchical model for data-to-text generation.
Informa-
In
tion Retrieval, pages 65–80. Springer. DOI:
https://doi.org/10.1007/978-3-030
-45439-5 5, PMCID: PMC7148215
cmp-lg/9504013v1. DOI: https://doi.org
/10.1017/S1351324997001502
applied natural language generation systems.
Natural Language Engineering, 3(1):57–87.
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Sistemi.
Language Generation
Natural
Studi
nell'elaborazione del linguaggio naturale,
Cambridge University Press. DOI: https://
doi.org/10.1017/CBO9780511519857
Saleh, Alexandre Berard,
Calapodescu, and Laurent Besacier. 2019. Naver
Labs Europe’s systems for
the document-
level generation and translation task at WNGT
the 3rd Work-
2019.
shop on Neural Generation and Translation,
pages 273–279, Hong Kong. Associazione per
Linguistica computazionale. DOI: https://
doi.org/10.18653/v1/D19-5631
Birch. 2016. Neural machine translation of
Nel professionista-
rare words with subword units.
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Documenti lunghi), pages 1715–1725,
Berlin, Germany. Association for Compu-
linguistica nazionale. DOI: https://doi
.org/10.18653/v1/P16-1162
Wenfei Xu, and Xiaoyan Zhu. 2019. Long and
diverse text generation with planning-based
hierarchical variational model. Negli Atti
del 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3257–3268, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1321
Le. 2014. Sequence to sequence learning
with neural networks. In Advances in Neural
Information Processing Systems, volume 27,
pages 3104–3112. Curran Associates, Inc.
Yamamoto. 2017. Controlling target features
in neural machine translation via prefix
constraints. In Proceedings of the 4th Workshop
on Asian Translation (WAT 2017), pages 55–63,
Taipei, Taiwan. Asian Federation of Natural
Language Processing.
Ankur P. Parikh. 2019. Sticking to the facts:
Confident decoding for faithful data-to-text
generation. CoRR, abs/1910.08684v2.
Miltenburg, Sander Wubben,
and Emiel
Krahmer. 2019. Best practices for the human
evaluation of automatically generated text.
the 12th International
Negli Atti di
Conference on Natural Language Generation,
pages 355–368, Tokyo, Japan. Associazione per
Linguistica computazionale. DOI: https://
doi.org/10.18653/v1/W19-8643
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing
Sistemi
Inc.,
pages 5998–6008.
Jaitly. 2015. Pointer networks. In C. Cortes,
N. D. Lawrence, D. D. Lee, M. Sugiyama,
In
and R. Garnett,
Neural Information Processing Systems 28,
pages 2692–2700, Curran Associates, Inc.
Singh, and Matt Gardner. 2019. Do NLP
models know numbers? Probing numer-
acy in embeddings. Negli Atti di
IL
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Language
pages 5307–5315, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1534
efficient gradient-based algorithm for on-line
recurrent network trajectories.
training of
Calcolo neurale, 2(4):490–501. DOI:
https://doi.org/10.1162/neco.1990
.2.4.490
Rush. 2017. Challenges in data-to-document
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Negli Atti di
generation.
Conference on Empirical Methods in Natural
Language Processing,
2253–2263,
Copenhagen, Denmark. Association for Com-
Linguistica putazionale. DOI: https://doi
.org/10.18653/v1/D17-1239
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, Lukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural
machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144v2.
Lui, Alex Smola, e Edward Blu. 2016.
Reti di attenzione gerarchica per il documento
IL 2016
classificazione.
Conference of
the North American Chap-
the Association for Computational
ter of
pagine 1480–1489, San Diego, California. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N16
-1174
Simple and effective noisy channel modeling
for neural machine translation. Negli Atti
del 2019 Conferenza sui metodi empirici
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Language
pages 5696–5701, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1571
Wang Ling, Lingpeng Kong, Phil Blunsom,
and Chris Dyer. 2020. Better document-level
machine translation with Bayes’ rule. Trans-
actions of the Association for Computational
Linguistica, 8:346–360. DOI: https://
doi.org/10.1162/tacl a 00319
paragraphs. Computational
Di
Linguistica, 17(2):171–210.
o
w
N
o
UN
D
e
D
R
o
M
H
T
/
/
io
R
e
C
T
.
T
.
D
tu
T
C
l
/
R
T
io
C
e
–
P
D
/
o
/
0
1
1
6
2
T
C
_
UN
_
0
0
3
8
1
1
9
2
4
1
7
6
T
C
_
UN
_
0
0
3
8
1
P
D
sì
G
tu
e
S
T
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3