Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

Sascha Rothe
Google Research
rothe@google.com

Shashi Narayan
Google Research

shashinarayan@google.com

Aliaksei Severyn
Google Research
severyn@google.com

抽象的

Unsupervised pre-training of large neural mod-
els has recently revolutionized Natural Language
加工. By warm-starting from the pub-
licly released checkpoints, NLP practitioners
have pushed the state-of-the-art on multiple
benchmarks while saving significant amounts
of compute time. So far the focus has been
mainly on the Natural Language Understanding
任务. 在本文中, we demonstrate the efficacy
of pre-trained checkpoints for Sequence Gen-
进化. We developed a Transformer-based
sequence-to-sequence model that is compati-
ble with publicly available pre-trained BERT,
GPT-2, and RoBERTa checkpoints and con-
ducted an extensive empirical study on the utility
of initializing our model, both encoder and
decoder, with these checkpoints. Our models
results on
结果
机器翻译, Text Summarization,
Sentence Splitting, and Sentence Fusion.

in new state-of-the-art

1 介绍

Unsupervised and self-supervised pre-training
方法, such as ELMo (Peters et al., 2018),
ULMFiT (Howard and Ruder, 2018), 和更多
recently BERT (Devlin et al., 2019), GPT and
GPT-2 (Radford et al., 2018, 2019), XLNet
(杨等人。, 2019), and RoBERTa (刘等人。,
2019) have established a qualitatively new level
of baseline performance for many widely used
Natural Language Understanding (自然语言单元) 长椅-
marks including some of the most popular, 喜欢
GLUE (Williams et al., 2018) and SQuAD
(Rajpurkar et al., 2018).

The most appealing part about this massive
shift towards using large architectures pre-trained
on large collections of texts is that
预-
trained checkpoints along with the inference code
are made freely available. This saves hundreds
of TPU/GPU hours, as warm-starting a model

264

from a pre-trained checkpoint typically requires
orders of magnitude fewer fine-tuning steps while
delivering significant performance boosts. 更多的
重要的是,
the ability to bootstrap from a
state-of-the-art performing model such as BERT
(Devlin et al., 2019) motivates the community to
greatly speed up the progress towards developing
better and easily reusable NLU systems.

While we continue to observe an increasing
number of papers building on top of BERT and/or
GPT models reporting encouraging improvements
on GLUE, SQuAD, and other similar benchmarks,
very little attention has been paid to using these
pre-trained models to warm-start sequence-to-
顺序 (seq2seq) 型号. It has been argued
that the pre-training objective used by BERT is not
well-suited for tasks that require decoding texts,
例如, conditional text generation in machine
translation and summarization (杨等人。, 2019).
尽管如此, it remains unclear to what extent
using such large models pre-trained on large
collections of text can be beneficial to warm-start
seq2seq generation models.

在本文中, we report on a Transformer-based
seq2seq model that is compatible with publicly
available pre-trained BERT, GPT-2, and RoBERTa
checkpoints. We aim to provide an empirical
answer to the following research question: 什么
is the best way to leverage publicly available pre-
trained checkpoints for warm-starting sequence
generation models? 例如, 一个可以
imagine using a BERT checkpoint to initialize
the encoder for better input understanding and
choosing GPT-2 model as the decoder for better
text generation. One of the main contributions of
this paper is that we rigorously experiment with
a large number of different settings to combine
BERT, GPT, and RoBERTa pre-trained check-
points to initialize our Transformer-based model.
We report results on three canonical conditional
text generation tasks of increasing complexity:
sentence-level fusion (DiscoFuse, Geva et al.,
2019) and splitting (WikiSplit, Botha et al., 2018),

计算语言学协会会刊, 卷. 8, PP. 264–280, 2020. https://doi.org/10.1162/tacl 00313
动作编辑器: Wenjie (Maggie) 李. 提交批次: 9/2019; 修改批次: 12/2019; 已发表 6/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

WMT14 En↔De machine translation using most
common eval sets: newstest2014 and newstest2016,
and abstractive summarization using three data-
套: Gigaword (Napoles et al., 2012), CNN and
DailyMail (Hermann et al., 2015), and BBC
extreme (Narayan et al., 2018A).

Our models report significant improvements
over randomly initialized models, 展示
the benefit of leveraging unsupervised pre-trained
型号. 更重要的是, this simple strategy
results in new state-of-the-art results on machine
翻译, text summarization, sentence splitting,
and sentence fusion. Our results also demonstrate
that a pre-trained encoder is an essential compo-
nent for sequence generation tasks and often these
tasks benefit from sharing the weights between
the encoder and the decoder. 全面的, 我们有
run over 300 experiments spending thousands of
TPU v3 hours to better accommodate the language
modeling and understanding capabilities of these
pre-trained models for text generation. 我们相信
that NLP researchers and practitioners will derive
actionable insights
findings when
tackling various seq2seq tasks.

from our

The code to query our models and predictions on
various benchmarks will be available at https://
github.com/google-research/google-
research/tree/master/bertseq2seq.

2 Models and Pre-trained Checkpoints

BERT was primarily developed for encoding text
representations for NLU tasks (encoder-only
建筑学), whereas GPT-2 (Radford et al.,
2019), was primarily developed as a decoder-only
architecture for language modeling. Our model
uses a seq2seq architecture with encoder and
decoder both composed of Transformer layers
(Vaswani et al., 2017). For the encoder, we inherit
the BERT Transformer layer implementations
(Devlin et al., 2019), which differs slightly from
the canonical Transformer layer (Vaswani et al.,
2017); BERT uses a GELU activation (Hendrycks
and Gimpel, 2016) rather than the standard RELU.
If not stated otherwise, the implementation of the
decoder layers are also identical to the BERT
implementation with two adjustments. 第一的, 这
self-attention mechanism is masked to look only
at the left context. 第二, we add an encoder-
decoder attention mechanism. 笔记, that if the
model was randomly initialized, we found no

265

difference between a BERT compatible decoder
and a GPT-2 compatible decoder.

Most of the models use the base checkpoint and
therefore have 12 layers, a hidden size of 768,
filter size of 3,072, 和 12 attention heads. 我们
chose the best-performing model and also collect
numbers using larger pre-trained checkpoints.
These models have 24 layers, a hidden size of
1,024, filter size of 4,096, 和 16 attention heads.
All models were fine-tuned on the target task
using Adam with a learning rate of 0.05. 我们
used a linear learning rate warmup with 40k steps,
normalization by the square root of the hidden
尺寸, and a square root decay. We did not perform
any tuning of these hyperparameters (除了
§5). The batch size and the number of training
steps will be reported for each task individually.

BERT Checkpoints. We tokenize our text
using the WordPiece (Wu et al., 2016) to match the
BERT pre-trained vocabulary. Depending on the
实验, we use one of the following publicly
available checkpoints: BERT-Base Cased, BERT-
Base Uncased, BERT-Base Multilingual Cased
(Devlin et al., 2019).1 The first two checkpoints
have a vocabulary size of around ∼30k word-
件, whereas the multilingual checkpoint has
a much larger vocabulary size of ∼110k. BERT
also trains positional embeddings for up to 512
positions, which is the maximum input and output
length in all experiments.

GPT-2 Checkpoints. We tokenize our text
using the SentencePieces (Kudo and Richardson,
2018) to match the GPT-2 pre-trained vocab-
ulary.2 Note that, although the available check-
point is frequently called 117M, which suggests
the same number of parameters, we count 125M
parameters in the checkpoint. This is the smallest
architecture they trained, and the number of
layers, hidden size, and filter size are comparable
to BERT-Base. The model was trained mainly
on English data but does contain some foreign
语言. The vocabulary size is ∼50k. 尽管
GPT-2 has positional embeddings for up to 1,024
positions, we only use the first 512 to make the
results comparable with BERT.

RoBERTa Checkpoints. RoBERTa (刘等人。,
2019) is trained using PyTorch, but we found
that the learned parameters are fully compatible

1BERT checkpoints

可用的

在https://

github.com/google-research/bert.

2GPT-2 checkpoints

可用的

在https://

github.com/openai/gpt-2.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

RND2RND
BERT2RND
RND2BERT
BERT2BERT
BERTSHARE
ROBERTASHARE
GPT
RND2GPT
BERT2GPT
ROBERTA2GPT

全部的
221中号
221中号
221中号
221中号
136中号
152中号
125中号
238中号
260中号
276中号

embed.
23中号
23中号
23中号
23中号
23中号
39中号
39中号
39中号
62中号
78中号

init.
0

random
221中号
109M 112M
26中号
109中号
26中号
195中号
26中号
109中号
26中号
125中号
125中号
0
125M 114M
26中号
234中号
26中号
250中号

桌子 1: The number of total trainable param-
埃特斯, embedding parameters, and parameters
initialized from the checkpoint vs. randomly.
The BERT/GPT-2 embeddings have 23M/39M
参数. The encoder-decoder attention ac-
counts for 26M parameters.

with the existing TensorFlow BERT architectures
with some minor adjustments.3 The vocabulary
treatment in RoBERTa is compatible with the
SentencePiece tokenization in GPT-2.4 As the
conceptual differences between BERT and
RoBERTa are minor, we might use BERT as a
hypernym to address both pretraining methods in
这篇论文.

3 Investigated Model Variants

在这个部分, we describe several combinations
of model initialization. The number of total train-
可用参数, the number of embedding para-
meters, and the number of parameters initialized
from the checkpoint vs. randomly are shown in
桌子 1.

RND2RND A Transformer

encoder-decoder
architecture with all weights initialized randomly.
BERT2RND A BERT-initialized encoder paired
with a randomly initialized decoder. Encoder and
decoder share the embedding matrix initialized
from a checkpoint.

RND2BERT A randomly initialized encoder
paired with a BERT-initialized decoder. 到
perform autoregressive decoding, we mask the

3进一步来说: A) the variable names have to be
adjusted; 乙) the weight and bias variables of the attention
mechanism have to be splitted into query, 钥匙, 和价值观;
C) all variables except the embedding matrices have to be
transposed.

4RoBERTa checkpoints are available at https://

github.com/pytorch/fairseq.

266

bidirectional self-attention mechanism of BERT
to look only at the left context.

BERT2BERT A BERT-initialized encoder paired
with a BERT-initialized decoder. All weights are
initialized from a public BERT checkpoint. 这
only variable that is initialized randomly is the
encoder-decoder attention.

BERTSHARE Like BERT2BERT, but the parameters
between encoder and decoder are shared. 这
greatly reduces the memory footprint of the model
(136M vs. 221M parameters). 此外, 我们
experimented with a layer-wise attention mech-
万物有灵论 (He et al., 2018), but obtained nearly iden-
tical numbers on most tasks.

ROBERTASHARE Same as BERTSHARE, 但

shared encoder and decoder are initialized with
the public RoBERTa checkpoint.

GPT A decoder-only architecture. We treat the
input as a conditioning prefix of a language
模型. The decoder is warm-started with a public
GPT-2 checkpoint. Similarly to BERTSHARE and
ROBERTASHARE, the memory footprint of this model
is smaller compared to an encoder-decoder setup
(125M parameters).

RND2GPT A randomly initialized encoder paired
with a GPT-2-compatible decoder. We warm-
start the decoder and the embedding matrix with
a public GPT-2 checkpoint.

BERT2GPT A BERT-compatible encoder paired
with a GPT-2-compatible decoder. We warm-
start both sides with the two separate, BERT
and GPT-2, public checkpoints. We use the BERT
vocabulary for the input and the GPT-2 vocabulary
for the output.

ROBERTA2GPT Same as BERT2GPT, but we use
a public RoBERTa checkpoint to warm-start the
encoder. RoBERTa was trained using the GPT-2
vocabulary so we can use it for input and output.
Note that although the vocabulary is shared, 这
model still has two embeddings matrices, one for
the input and one for the output.

输入

representation of

The pre-training objective in the BERT models
learns to predict a masked token using the
文本
bidirectional
(Devlin et al., 2019; 刘等人。, 2019). Our decoder,
even when initialized with the BERT or RoBERTa
checkpoints, always generates the output
文本
in an autoregressive fashion as in Tranformers
(Vaswani et al., 2017) and GPT-2 (Radford et al.,
2019).

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

DiscoFuse

100%

10%
SARI

1%
SARI

SARI
84.5

Exact
(Geva et al., 2019)
51.1
Initialized with the base checkpoint (12 layers)
65.6
ROBERTA2GPT
65.3
ROBERTASHARE
63.9
BERT2BERT
63.9
BERT2RND
BERTSHARE
63.9
61.5
BERT2GPT
60.4
GPT
60.0
RND2BERT
58.3
RND2RND
57.6
RND2GPT
Initialized with the large checkpoint (24 layers)
66.6
ROBERTASHARE
65.3
BERTSHARE

87.1
86.9
86.1
86.1
86.0
84.1
82.9
82.1
81.5
81.4

89.9
89.7
89.3
89.3
89.2
88.4
88.0
87.6
86.9
86.5

87.7
86.6

90.3
89.9

80.3
81.2
81.2
80.3
80.8
70.2
74.5
72.8
69.3
70.6

81.5
81.4

SARI
61.5

蓝线
76.4

Exact
WikiSplit
14.3
(Botha et al., 2018)
Initialized with the base checkpoint (12 layers)
16.3
BERTSHARE
16.1
ROBERTASHARE
15.6
BERT2BERT
15.1
ROBERTA2GPT
15.9
BERT2RND
BERT2GPT
14.6
15.2
RND2BERT
14.6
RND2RND
14.2
RND2GPT
14.2
GPT
Initialized with the large checkpoint (24 layers)
ROBERTASHARE
16.4
16.6
BERTSHARE

77.2
77.1
77.0
76.8
76.9
76.5
76.5
76.3
76.2
75.8

63.5
63.4
63.2
63.2
63.1
62.4
61.8
61.7
61.3
61.1

77.4
77.3

63.8
63.7

桌子 2: Results of different models and initiali-
zation techniques on DiscoFuse and subsampled
training sets. Blockwise sorted by SARI score on
100% of the training set.

We performed the bulk of our experiments
on the 12-layer checkpoints of BERT, GPT-2,
and RoBERTa, assuming that the findings will
also hold for the 24-layer checkpoints. We chose
BERTSHARE, ROBERTASHARE and ROBERTA to also
report numbers using the 24-layer public pre-
trained checkpoints. 我们还尝试过
the GPT setup with 24 layers and 345M parameters
but as we did not achieve any better results we
excluded this from the paper.

4 Experiments and Results

4.1 Sentence Fusion

Sentence Fusion is the problem of combining
multiple sentences into a single coherent sentence.
We use the ‘‘balanced Wikipedia’’ portion of the
DiscoFuse dataset (Geva et al., 2019) for our
experiments with 4.5M fusion examples in the
training set. The evaluation set has 50k examples.
Because of the size of this evaluation set, 甚至
small changes are statistically significant. 为了这
原因, we have solely chosen this dataset for
additional experiments described at the end of the
纸.

Training was done for 300k steps with a global
batch size of 256. The input and output are padded
to a length of 128, which covers 100% 的
训练, 评估, and test data. We report SARI

桌子 3: Results of different models and initiali-
zation setups on WikiSplit. Blockwise sorted by
SARI score.

(徐等人。, 2016)5 and the exact match accuracy.
The results can be seen in Table 2. 以前的
state-of-the-art results by Geva et al. (2019) 用过的
the vanilla transformer model by Vaswani et al.
(2017), 仅与 7 layers. All models with initial-
ized encoders outperform the baseline by a large
margin, with a SARI score of 89.3 compared
和 86.9 (BERT2RND vs. RND2RND). To measure
the effect on smaller training sets, we randomly
subsample the training data down to 10% 和 1%,
(IE。, 450k and 45k training examples, 重新指定-
主动地). 第一的, we notice, that performance compa-
rable to the baseline is achieved even when training
on only 10% of the training data (ROBERTASHARE
与. ROBERTASHARE). 第二, when using only 1%
of the training data, setups with fewer randomly
initialized parameters (BERT2BERT vs. BERT2RND)
perform better. The best performing 12-layer setup
is ROBERTA2GPT with a SARI score of 89.9 仅有的
outperformed by 24-layer setup of ROBERTASHARE
with a SARI score of 90.3.

4.2 Split and Rephrase

The reverse task of sentence fusion is the split-
and-rephrase task, which requires rewriting a long

5SARI is a lexical similarity metric that compares the
model’s output to multiple references and the input in order to
assess the model’s ability to add, delete, and keep an n-gram.
Its implementation is available at: https://github.
com/tensorflow/tensor2tensor/blob/master/
tensor2tensor/utils/sari hook.py.

267

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

sentence into two or more coherent short sentences
(Narayan et al., 2017). We use the WikiSplit
dataset (Botha et al., 2018), which consists of 1M
examples of sentence splits extracted from the
Wikipedia edit history, and follow the training/test
split suggested by the authors. Training was done
for 300k steps with a global batch size of 256. 这
input and output are padded to a length of 128,
which covers 100% of the training, 评估,
and test data. As in Botha et al. (2018), we report
corpus-level BLEU,6 the exact match accuracy,
and SARI score. Previous state-of-the-art results
by Botha et al. (2018) used a bi-directional LSTM
with a copy mechanism (Aharoni and Goldberg,
2018). Analogous to the DiscoFuse task we
observe that initializing the encoder improves the
model the most (桌子 3). The shared encoder-
decoder setup of BERTSHARE outperforms all other
setups. For the larger models with 24 layers, 我们
observed a small over-fitting after 100k steps (˜25
纪元), and therefore stop the training early.
BERTSHARE and ROBERTASHARE perform on par and
both outperform their 12-layer counterpart.

4.3 机器翻译

We test our setups on the most common bench-
mark in machine
translation—WMT 2014
English ↔ German task—using newstest2014 and
newstest2016 eval sets. We use the same hyper-
parameter settings as in the previous experiments.
We limit the input and output lengths to 128 代币
each. We used a global batch size of 256 and train
为了 30 纪元. Decoding was done with beam
大小 4 and the default value for the sentence
length penalty set to α = 0.6. We report uncased
BLEU-4 scores.7

表中 4, we first report the baseline scores
for the original Transformer model Vaswani et al.
(2017) and our Transformer implementation8 with

6We use NLTK v3.2.2 with case-sensitive scoring to

estimate BLEU scores.

7We use a script from the Tensorflow Official Transformer
implementation https://github.com/tensorflow/
models/ tree/ master/ nlp/ transformer. 笔记
那, differently from the https://github.com/
tensorflow/ tensor2tensor/ blob/ master/
tensor2tensor/utils/get ende bleu.sh
用过的
by Vaswani et al. (2017), this script does not split noun
compounds, but we normalize utf-8 quotes to ascii quotes as
we noted that our pre-processed training set contains only
ascii quotes.

8We use Transformer layers from the official BERT
implementation which have small differences from Vaswani
等人. (2017).

268

the same hyper parameters. In both cases, 我们用
the encoder and decoder with 6 layers and the 32k
wordpiece vocabulary extracted from the WMT14
training set. Our implementation obtains slightly
higher scores than the original implementation.

The middle section of Table 4 reports the
results for various initialization schema using
BERT and GPT-2 pre-trained checkpoints. 笔记
that here all models have encoders and decoders
和 12 layers. For BERT models, 我们使用
BERT-Base Multilingual Cased checkpoint

initialize the encoder or the decoder or both,
as the task involves one non-English language.
This checkpoint has been pre-trained on 108
languages using a multilingual Wikipedia dump
with a vocabulary of 110k wordpieces. 第一的, 我们
observe that initializing the model with the BERT
checkpoint is most beneficial on the encoder side;
our observation is in line with Yang et al. (2019).
此外, models initialized with the BERT
checkpoint receive a significant boost: BERT2RND
compared to the no-initialization RND2RND setup
scores higher by +4 points on En→De and +3.6
points on De→En on newstest2014. 与之相反
the WikiSplit and DiscoFuse task, sharing the
encoder and decoder variables did not give an
additional boost. This is most likely because a)
model capacity is an important factor in MT and
乙) encoder and decoder have to deal with different
grammar and vocabulary.
GPT-based models


BERT2GPT) do not perform nearly as well, 尤其
when GPT is used as the decoder and the target
language is German. This is because the GPT
model comes with an English vocabulary and has
been pre-trained mainly on English text. 因此,
we report the scores for GPT in the En→De setting
in gray.

(RND2GPT,

GPT,

Customized BERT Checkpoint. 为了

experiment we did not include RoBERTa, 作为
public checkpoint is available for English only.
反而, we train our own checkpoint. 我们也
observe that our implementation of the baseline
Transformer, as well as RND2RND setup, 哪个
uses no initialization, more weakly weaker on
newstest2014 compared with the Transformer
基线 (和 6 layers and the 32k wordpiece
词汇) we report in the top section of Table 4.
We conjecture that the differences might be due
to the larger 110k wordpiece vocabulary trained
to handle 104 languages from Wikipedia dump,

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

newstest2014

newstest2016


37.9



31.4
31.4

En→De
27.3
28.1
28.7
29.2
35.0 (33.8)

De→En En→De De→En

33.5


(Vaswani et al., 2017)
Transformer (ours)
KERMIT(Chan et al., 2019)
(Shaw et al., 2018)
(Edunov et al., 2018)*
Initialized with public checkpoints (12 layers) and vocabulary
23.7
Transformer (ours)
RND2RND
26.0
30.1
BERT2RND
RND2BERT
27.2
30.1
BERT2BERT
BERTSHARE
29.6
16.4
GPT
RND2GPT
19.6
23.2
BERT2GPT
Initialized with a custom BERT checkpoint (12 layers) and vocabulary
30.6
BERT2RND
BERTSHARE
30.5
Initialized with a custom BERT checkpoint (24 layers) and vocabulary
31.7
BERT2RND
30.5
BERTSHARE

35.8
36.7
39.6
37.5
39.3
39.6
27.7
28.5
37.0

26.6
29.1
32.7
30.4
32.7
32.6
21.5
23.2
31.4

31.6
32.4
34.4
33.2
34.6
34.4
22.4
24.2
28.1

40.2
40.1

35.6
35.4

34.2
33.8

41.1
40.9

33.5
33.6

35.1
35.5

桌子 4: Uncased BLEU-4 scores on WMT14 English ↔ German
newstest2014 and newstest2016 test sets. Models in the middle section
use the 110k wordpiece vocabulary that comes with the multilingual BERT
checkpoint. In the bottom section, we use the native 32k wordpiece vocabulary
extracted from WMT14 train set and a BERT checkpoint pre-trained only on
English and German subset of Wikipedia. * Leveraging a large number of
additional parallel sentence pairs obtained with back-translation; we include
this score as a reference to the highest achieved result on newstest2014. 这
GPT-2 results for En→De (where the GPT-2 initialized decoder is used to
decode targets in De) are grayed out as they are a priori penalizing for GPT-2,
which was only pretrained on En texts.

which is suboptimal for WMT14 data and leads
to inferior results. To verify this conjecture, 我们
perform the following experiment: We use the
32k wordpiece vocabulary extracted from the
WMT14 En ↔ De training set (same as used
in the top section of Table 4) and pre-train a
BERT model on the English and German subset
of the Wikipedia dump in the same way as the
multilingual BERT checkpoint was obtained. 我们
initialize our best-performing setups, BERT2RND
and BERTSHARE, with this checkpoint (the third
block of Table 4). This provides a further +0.5
(En ↔ De) 和 +0.8 (De ↔ En) BLEU improve-

ments on newstest2014, 和, +1.1 和 +0.7 在
newstest2016, yielding an overall very strong
performance on the challenging WMT14 task.
Experiments with the larger models (the last block)
show further improvements of up to +1.1 蓝线
点.

Edunov et al. (2018) report better results when
they augment the training set with a massive
amount of back-translated sentence pairs. 至
据我们所知, among the approaches
that only leverage parallel data from WMT14, 我们的
results are state-of-the-art on both newstest2014
and newstest2016.

269

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

4.4 Abstractive Summarization

Document summarization is the task of producing
a short version of a document while preserving
its salient information content. We evaluate our
setups on three different summarization datasets of
varying characteristics: Gigaword (Napoles et al.,
2012), CNN and DailyMail (Hermann et al.,
2015), and BBC extreme (Narayan et al., 2018A).
The Gigaword dataset focuses on abstractive
sentence summarization with a total of 3.8M
sentence-summary training pairs. The other two
datasets focus on single-document summarization:
The CNN/DailyMail dataset consists of 287k
document–summary pairs, whereas the BBC
dataset consists of 204k document-summary pairs.
The CNN/DailyMail summaries are in the form of
bullet-point story highlights and exhibit a high
degree of extraction, requiring the models to
learn to copy from the source documents. 这
BBC summaries, 另一方面, are extreme
在那里面
the documents are summarized into
single-sentence summaries. These summaries
demonstrate a high level of abstractiveness, 和
generating them automatically requires document-
level inference, abstraction, and paraphrasing.

In all three cases, we did not anonymize entities.
We worked on the original cased versions of the
CNN/DailyMail and BBC datasets. For Gigaword
we used the lowercased version to match the
requirements of the publicly available lowercased
test set. During training, the input documents were
truncated to 512 tokens for the CNN/DailyMail
and BBC, 并 128 tokens for Gigaword.
相似地, the length of the summaries was limited
到 128 tokens for CNN/DailyMail, 64 for BBC,
和 32 for Gigaword. We used a global batch
大小 128 document–summary pairs for CNN/
和 256 document–
DailyMail
summary pairs for Gigaword. We adapted to
different number of training steps depending on
the training data sizes. Models were trained for
500k, 300k, and 200k steps for the Gigaword,
CNN/DailyMail, and BBC summarization data-
sets respectively.
In all cases, we used the
standard publicly available test sets; these consists
的 1951 instances for Gigaword, 11,490 为了
CNN/DailyMail, 和 11,334 for BBC. We report
on the ROUGE F1 scores (Lin and Hovy, 2003); 在
特别的, we report on ROUGE-1 and ROUGE-2
for informativeness and ROUGE-L for fluency in
桌子 5.

and BBC,

Document Understanding. 全部

BERT
encoder-based setups (IE。, BERT2RND, BERTSHARE,
ROBERTASHARE, and BERT2BERT) outperform the
baseline RND2RND by a large margin. The improve-
ments of the RND2BERT setup, where only the
decoder is initialized, are narrow. These results
overall validate the significance of document
representation in the encoder-decoder framework
for summarization. On the BBC extreme summari-
zation in particular, these four models achieve on
average +6.85 point improvement in ROUGE-L
compared with the RND2RND setup. Our results
demonstrate that the models with better document
representations are better in generating extreme
summaries that require document-level inference
and abstraction. For the extractive highlights in
the CNN/DailyMail dataset, these models show
an improvement of +3.53 ROUGE-L points over
the RND2RND baseline. For Gigaword, 哪里的
input is a single sentence, the improvements are
最小的 (平均数 +1.02 ROUGE-L points).
The BERTSHARE setup with shared encoder and
decoder parameters achieves better performance
than BERT2BERT on all three datasets. The gains are
larger on the BBC dataset than on the Gigaword
and CNN/DailyMail datasets. This is probably
because the BBC summary sentences follow a
distribution that is similar to that of the sentences
in the document, whereas this is not necessarily
the case for the Gigaword headlines and the CNN/
DailyMail bullet-point highlights. ROBERTASHARE
performs superior to BERTSHARE on the CNN/
DailyMail and BBC datasets. ROBERTASHARE per-
形式
competitively to BERTSHARE on the
Gigaword dataset where the task is to summarize
句子.

Summarization with GPT Checkpoints. GPT
than RND2GPT,
(decoder-only) performs better
BERT2GPT or ROBERTA2GPT (encoder-decoder mod-
这) by a large margin for generating CNN/
DailyMail extracts, but poorer for generating
BBC abstracts. The encoder–decoder architec-
ture where the input document
is modeled
separately is better equipped for document-level
abstraction than the decoder-only architectures
where the input document is a conditioning prefix
of a language model. Initialization with different
checkpoints
(例如, encoder with BERT and
decoder with GPT in BERT2GPT) is not effective
summarization; BERT2GPT and
for document
to RND2GPT on the
ROBERTA2GPT are inferior

270

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Gigaword
R-2


17.48

R-L


33.29

R-1


35.88

带领
PtGen
ConvS2S
MMN
Bottom-Up

CNN/DailyMail
R-2
17.70
17.28


18.68

R-1
39.60
39.53


41.22

R-L
36.20
36.38


38.34

BBC XSum
R-2
1.61
9.21
11.54
12.10

R-1
16.30
29.70
31.89
32.00

R-L
11.95
23.24
25.75
26.00

35.96

38.73

19.71


39.65
43.47

MASS
TransLM
UniLM
Initialized with the base checkpoint (12 layers)
RND2RND
BERT2RND
RND2BERT
BERT2BERT
BERTSHARE
ROBERTASHARE
GPT
RND2GPT
BERT2GPT
ROBERTA2GPT
Initialized with the large checkpoint (24 layers)
BERTSHARE
ROBERTASHARE

34.45
35.26
34.51
35.58
35.62
35.44
33.67
33.83
34.24
35.42

18.71
19.26
18.91
19.68
19.81
19.70
18.44
18.39
18.23
19.21

36.94
37.71
37.01
38.01
38.13
38.21
36.04
36.21
36.77
37.94

35.77
38.74
36.65
39.02
39.09
40.10
37.26
32.08
25.20
36.35

19.80
19.78

39.83
40.31

35.66
35.94

38.35
38.62


17.74
20.30

14.00
17.76
15.55
17.84
18.10
18.95
15.83
8.81
4.96
14.72


36.85
40.63

32.96
35.95
33.97
36.29
36.33
37.39
34.47
29.03
22.99
33.79



30.90
38.42
32.44
37.53
38.52
39.87
22.21
28.48
27.79
19.91



10.23
15.83
11.52
15.24
16.12
17.50
4.89
8.77
8.37
5.20



24.24
30.80
25.65
30.05
31.13
32.37
16.69
22.30
21.91
15.88

17.69
18.91

37.01
37.62

38.93
41.45

16.35
18.79

31.52
33.90

桌子 5: Summarization results of different models and their initialization setups. 我们
compare our setups (the bottom block) against both non-pre-trained (the top block) 和
pre-trained (the middle block) models on various datasets: the Lead baseline, PtGen
(See et al., 2017), ConvS2S (Gehring et al., 2017), MMN (Kim et al., 2019), Bottom-Up
(Gehrmann et al., 2018), MASS (Song et al., 2019), TransLM (Khandelwal et al., 2019), 和
UniLM (Dong et al., 2019). The Lead results for the CNN/DailyMail dataset is taken from
Narayan et al. (2018乙), whereas Lead, PtGen, and ConvS2S results on the BBC dataset are
taken from Narayan et al. (2018A). Our best results are boldfaced and the best results on the
datasets are italicized.

BBC dataset and BERT2GPT to RND2GPT on the
CNN/DailyMail dataset. 然而, this is not the
case with the Gigaword dataset, which has 3.8M
training instances; BERT2GPT and ROBERTA2GPT
perform better than RND2GPT.

ROBERTASHARE performs the best and is on par
with the current state-of-the-art MASS model
(Song et al., 2019) on the Gigaword dataset. 这
MASS model has an advantage of pre-training
encoder-decoder attention from scratch, our pro-
posed models use the publicly available pre-trained
checkpoints and only fine-tune on the target task.
It is not obvious how the masked seq2seq pre-
training objective for sentence generation in the

MASS model will be beneficial for tasks like doc-
ument summarization. Our proposed models pro-
vide a generic alternative and can be easily adapted
to various text generation tasks. The ROBERTASHARE
setup sets a new state-of-the-art, outperforming all
existing baselines by a large margin on the BBC
extreme summarization task. The best model on
the CNN/DailyMail dataset outperforms

Pointer Generator network (See et al., 2017) 和
the pre-trained single-decoder model with Trans-
formerLM (Khandelwal et al., 2019). Our model,
lags behind the Bottom-Up system
然而,
(Gehrmann et al., 2018) with a task-specific mod-
ule for content selection along with the copy

271

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

机制 (Gu et al., 2016) and the UniLM model
(Dong et al., 2019) with BERT-Large pre-trained
for bidirectional, unidirectional and seq2seq lan-
guage modeling objectives. The UniLM model is
also fine-tuned with an additional extractive sum-
marization objective to predict relevant sentences
in the document; this objective could be beneficial
to generate the CNN/DailyMail extracts.

5 Discussion on Ablation Studies

Combining Different Checkpoints. Combin-
ing BERT and GPT-2 into a single model
(BERT2GPT) did not work and often underper-
formed than a randomly initialized baseline. 这
is presumably because the model has to learn two
different vocabularies. This argument is backed
by the fact that for MT, de→en the BERT2GPT setup
performed well. For this task the vocabulary set-
ting is in favor of this particular task, 意义
that two vocabularies have to be learned anyways
and the output is English, on which GPT-2 was
trained. Because RoBERTa and GPT-2 share the
same vocabulary, combining both into a single
模型 (ROBERTA2GPT) showed strong results on
several tasks but did not outperform a setup where
RoBERTa is used in the encoder and decoder.

Tuning GPT-2 Based Models. We were
surprised that setups using the GPT-2 checkpoint
performed relatively poorly given that it is trained
as a language model on a large corpus; 我们的
intuition was that GPT-2 initialized decoders will
be strong natural language generators. To ensure
that this was not due to an unfortunate choice of
hyperparameters, we tuned the learning rate, 这
warmup steps, and the optimizer ∈ {亚当,
Adafactor} for the GPT-2 based setups (RND2GPT,
BERT2GPT) on the DiscoFuse dataset.
GPT,
Naturally, this gave us slightly higher numbers but
not at a magnitude that would suggest a previously
suboptimal setting. 具体来说, we obtained a
SARI score of 88.8 compared with 88.4 为了
BERT2GPT, 88.1 compared with 88.0 for GPT, 和
87.7 compared with 86.5 for RND2GPT.

Initializing Only Embeddings. We want

investigate the impact of the non-contextualized
BERT and GPT-2 embeddings. This means we
are initializing the transformer model with only
the embedding matrices. The advantage of this
setup would be that we could freely choose the

model architecture and size and adapt it to a
specific task. We found almost no improvement
over the fully randomly initialized model RND2RND.
Concretely, we compute a SARI score of 87.1
using the BERT embeddings and 87.0 using the
GPT-2 embeddings, compared with 86.9 的
RND2RND baseline. We observe slightly higher
improvements of up to 2 percentage points when
training on only 10% of the training data.

Initializing Only Layers. Contrary to the previ-
ous paragraph, we want to investigate the effect
of initializing everything but the word embedding
矩阵. The embedding matrix accounts for only
10% 到 31% of all learnable parameters, 和
sometimes the vocabulary given from a public
checkpoint might not be optimal for a certain task.
In these cases, it would be nice to redefine the
vocabulary while still leveraging the checkpoint.
第一的, we remove the embeddings matrices from
the warm-started variables and observe a drop
的 1.7 points using the BERTSHARE setup and 11
points using the GPT setup (桌子 6). The latter
is probably due to the large vocab of the GPT-2
模型, which now remains random-initialized.
We then train a new BPE model with 16k
tokens using the DiscoFuse training data (Kudo
and Richardson, 2018; Sennrich et al., 2016).
We observe almost no change on BERTSHARE,
suggesting that the BERT vocabulary was already
optimal for DiscoFuse. GPT, 然而, showed a
significant improvement using this much smaller
vocabulary but is still behind the fully initialized
setup. 最后, we experimented with a more
sensitive way of training the model, meaning that
we fix all warm-started variables for 100k steps.
During this pre-training phase, we only train the
new word embeddings. After the pre-training, 我们
fine-tune the entire model for another 300k steps.
This training scheme resulted in an improvement
的 0.5 for the BERTSHARE setup, but overall the
number is still considerably behind the fully
initialized setup. For GPT, this training scheme
did not result in a satisfying training curve.

Initializing a Subset of Layers. Motivated by
the results of using 24 layers, we want to investi-
gate whether only a subset of these 24 layers can
be used. To account for the larger hidden layer size
(1,024 与. 768) and filter size (4,096 与. 3,072),
we limit ourselves to using only 10 layers and the
embedding matrix of this model. This model still

272

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
3
1
9
2
3
4
2
2

/

/
t

A
C
_
A
_
0
0
3
1
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

DiscoFuse
embeddings from checkpoint
+ task specific SentencePieces
+ pre-training SentencePieces

BERTSHARE
89.3
87.5
87.5
88.0

GPT
88.0
77.0
84.2
69.7

桌子 6: SARI scores on the DiscoFuse
dataset when experimenting with different
embedding setups. Each row also includes
the setups of all previous rows.

has more parameters then the base model (324中号
与. 221M for BERT2BERT, 198M vs. 136M for
BERTSHARE) but can be trained with the same batch
尺寸, in a comparable amount of time (3 min/1,000
迭代). As an initial experiment, we used the
第一的 10 layers out of the large BERT checkpoint
to initialize the BERTSHARE setup. This gave us
a SARI score of 88.2 on DiscoFuse, compared
和 89.3 when using the base checkpoint and
compared with 87.0 when using the embeddings
仅有的 (see Initializing Only Embeddings section).
We then performed a hyperparameter search on
the evaluation set using CMA-ES (汉森, 2016)
to find an optimal subset of layers to use. The best
setup used the following layers: 9, 10, 13–18, 23,
24; and achieved a SARI score of 89.1. 虽然
this is a remarkable improvement over using the
第一的 10 layers, this setup is still outperformed by
the base BERT model.

6 Analysis of Abstractive Summaries

最后, we present a qualitative analysis of these
models for text generation. 尤其, we focus
on extreme summarization, which assesses mod-
els ability to do document-level inference and
abstraction. We evaluated summaries from aran-
domly initialized model (RND2RND) and from best
performing models initialized with GPT check-
点 (RND2GPT), BERT checkpoints (BERTSHARE),
and RoBERTa checkpoints (ROBERTASHARE). 我们
also included GOLD summaries in our evaluation.
Results are presented in Table 7.

Human Assessment of Summary Quality. 这
study was conducted on the Amazon Mechanical
Turk platform using Best-Worst Scaling, a less
labor-intensive alternative to paired comparisons
(Louviere and Woodworth, 1991; Louviere et al.,
2015). Our participants were presented with a
document and summaries generated from two out

RND2RND
RND2GPT
BERTSHARE
ROBERTASHARE
GOLD

Length
20.90
21.49
20.71
21.70
24.61

Repetitions Quality
−0.103
−0.303
−0.097
0.153
0.347

29.76
16.28
27.03
28.68
4.66

桌子 7: Qualitative and human evaluations
of BBC extreme summaries. The lowest
numbers for repetitions and the highest
numbers for quality are boldfaced. See the
text for details.

of five systems (four models and gold summaries)
and were asked to decide which summary was
better than the other in order of informativeness
(does the summary capture important information
in the document correctly and concisely?) 和
fluency (is the summary written in well-formed
英语?) 我们随机选择了 40 文件
from the XSum test set. We collected judgments
from three different participants for each compari-
儿子. The order of summaries was randomized
per document and the order of documents was
randomized per participant. The score of a system
was computed as the percentage of times it was
chosen as best minus the percentage of times it
was selected as worst. The scores range from
−1 (worst) 到 1 (最好的). 见图 1 for a few
sample predictions that were used in our human
评估.

Our participants

found the ROBERTASHARE
summaries to be the best in terms of their overall
质量; the BERTSHARE summaries ranked second
after ROBERTASHARE. We further carried out
pairwise comparisons between all models to assess
whether system differences are statistically signi-
ficant.9 We did not observe significant differences
between RND2RND and RND2GPT, RND2RND and
BERTSHARE, 和, ROBERTASHARE and GOLD. 全部
other differences were statistically significant.

Summary Lengths and Repetitions. All mod-
els generated summaries of comparable lengths;
the average length of summaries is 20.90 为了
RND2RND, 21.49 for RND2GPT, 20.71 for BERTSHARE,
和 21.70 for ROBERTASHARE. ROBERTASHARE-
produced summaries were closest to the GOLD
summaries in terms of length (21.70 与. 24.61).

9One-way ANOVA with post hoc Tukey HSD tests;

p < 0.01. 273 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Model generated and reference summaries used for human evaluation. Words in orange correspond to incorrect or repeated information. Finally, we estimated the percentage of sum- maries with at least one repetition of rare or content words. We discarded the 500 most common words from the model generated and reference summaries, the rest were considered as rare or content words. BERTSHARE and ROBERTASHARE summaries improve over the RND2RND summaries, but have more repetitions than the RND2GPT sum- maries. See examples in Figure 1 for redundant repeated spans marked in orange. Overall, BERTSHARE and ROBERTASHARE sum- maries are unequivocally better than RND2GPT summaries in terms of both automatic evalua- tions (assessing ROUGE) and human evaluations (assessing summary quality); there is still room for improvements in these models (Dong et al., 2019; Song et al., 2019; Lewis et al., 2019). 7 Related Work Representation Learning. Starting around 2013, word embeddings like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) became popular as they were easy to train in an unsupervised fashion on raw text and they improved several downstream tasks when used as features. These word embeddings are invariant to the context in which we the word. There has been previously work to contextualize these embeddings, mainly to account for synonyms (e.g., Huang et al., 2012; Rothe and Sch¨utze, 2015) but only in 2018 did training of the contextualized embeddings using large deep neural networks and an unsupervised training scheme become popular. 274 Whereas ELMo (Peters et al., 2018) and ULMFiT (Howard and Ruder, 2018) are based on LSTMs (Hochreiter and Schmidhuber, 1997), BERT and GPT are based on the transformer archi- tecture (Vaswani et al., 2017). This architecture outperforms LSTMs on several NLP tasks and we therefore concentrated on these two pre-trained models. The contextualized embedding for each input token is given by the corresponding output of the last encoder layer. Pre-training Models. One can also see these models as pre-trained models (Dai and Le, 2015), which are then fine-tuned for a downstream task. This is the conceptual view we adopted for this paper. Why unsupervised pre-training helps deep learning was investigated by Erhan et al. (2010). While the unsupervised pre-training strategies are different from those used in our paper, we expect the findings to still hold. They show that unsuper- vised pre-training is not simply a way of getting a good initial marginal distribution, that classical regularization techniques cannot achieve the same performance as unsupervised pre-training, and that the effect of unsupervised pre-training does not go away with more training data. An extensive study of pre-training was done by Wang et al. (2019a). This study compares single sentence classifica- tion, sentence pair classification, seq2seq and language modeling tasks for pre-training, and measures the effect on GLUE. The primary results support the use of language modeling. Peters et al. (2019) explore whether it is preferable to fine-tune the entire model on a specific task or to use the learned representations as features (i.e., freezing the pre-trained model). Their results suggest that the relative performance of fine-tuning vs. fea- ture extraction depends on the similarity between the pre-training and the target tasks. Wang et al. (2019b) propose a combination of both, where first the model is trained with the BERT param- eters being frozen and then the entire model is fine-tuned. This is the training scheme we used in the Initializing Only Layers section. Pre-training for Sequence Generation. Pre- training for seq2seq learning was first done by Ramachandran et al. (2017). They used a language model to pre-train the encoder and decoder of an RNN seq2seq model. Their method improved BLEU scores on newstest2014 by 3 points and ROUGE-L on CNN/DailyMail also by 3 points. However, their BLEU score of 24.7 on newstest2014 En→De, compared to 30.6 in this work, and 29.4 ROUGE-L on CNN/DailyMail, compared with 36.33, also show the superiority of the transformer model as well as the masked language model objective of BERT. MASS (Song et al., 2019) is a BERT-inspired method of pre-training seq2seq models. One advantage of this method is that, in contrast to our setups (except for GPT), the encoder–decoder attention mechanism is also pre- trained. The downside of this approach is that the pre-trained model is task-specific and not as general as BERT or GPT-2. UniLM (Dong et al., 2019) also unifies bidirectional, unidirectional, and seq2seq language modeling. At the time of writing, no public checkpoint was available to us. We compare our work with their results in Table 5. To overcome the issue that the encoder-decoder attention is not pre-trained, Khandelwal et al. (2019) pre-trained a single transformer language model that encodes the source and generates the target. This setup matches our GPT setup. Conneau and Lample (2019) pre-train their model using casual language modeling (like GPT), masked language modeling (like BERT) and a third new objective called translation language modeling to improve cross-lingual pre-training. Leveraging Public Checkpoints. BERT has been used for various NLP tasks, such as question answering on the SQuAD dataset (Rajpurkar et al., 2018). It also achieved new state-of-the-art results on the GLUE benchmark (Williams et al., 2018) and grounded commonsense inference (SWAG, Zellers et al., 2018). All of these tasks are a form of classification or regression. Liu (2019) fine-tuned BERT for extractive summarization. this would mean that An analysis of different layers of the BERT model was performed by Tenney et al. (2019). They found that the classical NLP pipeline appears in the expected sequence. In the context of our experiments in the Initializing a Subset of Layers section, the DiscoFuse task profits the most from pre-trained informa- tion about POS, constituents, dependencies, and semantic roles. A similar study by Jawahar et al. (2019) found that BERT captures phrase-level information in the lower layers and linguistic information in intermediate layers, with surface features at the bottom, syntactic features in the middle, and semantic features at the top. GPT was also evaluated on natural language inference tasks. In the extended version of GPT-2, 275 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the model was evaluated on more general natural language processing tasks, like machine transla- tion, reading comprehension, summarization, and language modeling. GPT-2 achieved new state- of-the-art results on several language modeling datasets. On the other tasks, GPT-2 outperformed some unsupervised baselines but is still far behind supervised or task-specific approaches. After we performed the majority of our experi- ments, XLNet (Yang et al., 2019), an autoregres- sive pre-training method based on Transformer XL (Dai et al., 2019), was released. XLNet achieved new state-of-the-art results on several NLP tasks. We leave the experiments with their public checkpoint for future work. 8 Conclusion We performed an extensive study on leveraging pre-trained checkpoints for sequence generation. Our findings show that a pre-trained encoder is an essential part. Most tasks also profit from sharing the weights between the encoder and the decoder, which additionally decreases the memory footprint. While combing BERT and GPT-2 into a single model often underperformed a randomly initialized baseline, combining RoBERTa and GPT-2 achieved strong results and shows the importance of sharing the vocabulary. Training a language-specific BERT model also improves performance over using the multilingual version. Acknowledgments We thank the reviewers and the action editor for their feedback. We would like to thank Ryan McDonald, Joshua Maynez, and Bernd Bohnet for useful discussions. References Roee Aharoni and Yoav Goldberg. 2018. Split and rephrase: Better evaluation and stronger baselines. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 719–724. Melbourne, Australia. Association for Computational Linguistics. Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, and Dipanjan Das. 2018. Learning to split and rephrase from Wikipedia edit his- tory. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 732–737. Brussels, Belgium. Association for Computational Linguistics. William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. 2019. KERMIT: Generative insertion-based modeling for sequences. CoRR, abs/1906.01604v1. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining, In H. Wallach, H. Larochelle, A. Beygelzimer, F. Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 7057–7067. Curran Asso- ciates, Inc. Andrew M. Dai and Quoc V. Le. 2015. Semi- supervised sequence learning, In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Infor- mation Processing Systems 28, pages 3079–3087. Curran Associates, Inc. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 13042–13054. Curran Associates, Inc. 276 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back- translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computa- tional Linguistics. Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625–660. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1243–1252. Sydney, Australia. Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summariza- tion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 4098–4109, Brussels, Belgium. Association for Computational Linguistics. Mor Geva, Eric Malmi, Idan Szpektor, and Jonathan Berant. 2019. DiscoFuse: A large- scale dataset for discourse-based sentence fu- sion. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 3443–3455, Minneapolis, Minnesota. Association for Com- putational Linguistics. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mech- anism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1631–1640, Berlin, Germany. Associa- tion for Computational Linguistics. Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. CoRR, abs/1604.00772. Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. 2018. Layer-wise coordination between encoder and decoder for neural machine translation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Information editors, Advances Processing Systems 31, pages 7944–7954, Curran Associates, Inc. in Neural Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc. Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. Jeremy Howard and Sebastian Ruder. 2018. Uni- versal language model fine-tuning for text clas- sification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 328–339, Melbourne, Australia. Association for Computational Linguistics. Eric Huang, Richard Socher, Christopher Manning, and Andrew Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 873–882, Jeju Island, Korea. Association for Compu- tational Linguistics. Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah. 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics. Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. 2019. Sample efficient text summarization using a single pre-trained trans- former. CoRR, abs/1905.08836. Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim. 2019. Abstractive summarization of 277 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Reddit posts with multi-level memory net- works. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 2519–2531. Minneapolis, Minnesota. Association for Com- putational Linguistics. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated giga- word. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100, Montreal, Canada. Association for Computa- tional Linguistics. Taku Kudo and John Richardson. 2018. Senten- cePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising sequence-to-sequence pre-training for natural language genera- tion, translation, and comprehension. CoRR, abs/1910.13461. Chin Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co- occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 150–157. Yang Liu. 2019. Fine-tune BERT for extractive summarization. CoRR, abs/1903.10318. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Jordan J. Louviere, Terry N. Flynn, and Anthony Alfred John Marley. 2015. Best-worst scaling: Theory, methods and applications. Cambridge University Press. Jordan J. Louviere and George G. Woodworth. 1991. Best-worst scaling: A model for the lar- gest difference judgments. University of Alberta, Working Paper. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781. 278 Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. the 2018 Conference on In Proceedings of Empirical Methods in Natural Language Pro- cessing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018b. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1747–1759, New Orleans, Louisiana. Association for Computational Lin- guistics. Shashi Narayan, Claire Gardent, Shay B. Cohen, and Anastasia Shimorina. 2017. Split and rephrase. In Proceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, pages 606–616. Copen- hagen, Denmark. Association for Computa- tional Linguistics. Jeffrey Socher, Pennington, Richard and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceed- the 2014 Conference on Empirical ings of Methods in Natural Language Processing, pages 1532–1543. Doha, Qatar. Association for Computational Linguistics. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextu- alized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237. New Orleans, Louisiana. Association for Computational Linguistics. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Repre- sentation Learning for NLP, pages 7–14. Florence, Italy. Association for Computational Linguistics. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre- training. Technical report, OpenAI. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report, OpenAI. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswer- able questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 784–789, Melbourne, Australia. Association for Compu- tational Linguistics. Prajit Ramachandran, Peter Liu, and Quoc Le. 2017. Unsupervised pretraining for sequence to sequence learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391. Copenhagen, Denmark. Association for Com- putational Linguistics. Sascha Rothe and Hinrich Sch¨utze. 2015. AutoEx- tend: Extending word embeddings to embed- dings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Processing, pages 1793–1803, Beijing, China. Association for Computational Linguistics. Abigail See, Peter J. Liu, and Christopher D. to the point: Summa- Manning. 2017. Get rization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1073–1083, Vancouver, Canada. Asso- ciation for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725, Berlin, Germany. Association for Computa- tional Linguistics. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 464–468, New Orleans, Louisiana. Association for Com- putational Linguistics. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked sequence to sequence pre-training for language genera- tion. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 5926–5936. PMLR. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, Berlin Chen, Benjamin Van Durme, Edouard Grave, Ellie Pavlick, and Samuel R. Bowman. 2019a. Can you tell me how to get past sesame street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4465–4476, Association for Computational Linguistics. Ran Wang, Haibo Su, Chunye Wang, Kailin Ji, and Jupeng Ding. 2019b. To tune or not to tune? how about the best of both worlds? CoRR, abs/1907.05338. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through 279 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 inference. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, pages 1112–1122, New Orleans, Louisiana. Association for Com- putational Linguistics. Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimi- zing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Confe- rence on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 3 1 9 2 3 4 2 2 / / t l a c _ a _ 0 0 3 1 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 280
下载pdf