SpanBERT: Improving Pre-training by Representing

SpanBERT: Improving Pre-training by Representing
and Predicting Spans

Mandar Joshi∗† Danqi Chen∗‡§ Yinhan Liu§
Daniel S. Weld†(西德:2) Luke Zettlemoyer†§ Omer Levy§

† Allen School of Computer Science & Engineering, 华盛顿大学, Seattle, WA
{mandar90,weld,lsz}@cs.washington.edu
‡ Computer Science Department, 普林斯顿大学, 普林斯顿大学, 新泽西州
danqic@cs.princeton.edu
(西德:2) Allen Institute of Artificial Intelligence, Seattle
{danw}@allenai.org
§ Facebook AI Research, Seattle
{danqi,yinhanliu,lsz,omerlevy}@fb.com

抽象的

We present SpanBERT, a pre-training method
that is designed to better represent and predict
spans of text. Our approach extends BERT
经过 (1) masking contiguous random spans,
rather than random tokens, 和 (2) 训练
the span boundary representations to predict
the entire content of the masked span, 没有
relying on the individual token representations
within it. SpanBERT consistently outperforms
BERT and our better-tuned baselines, 和
substantial gains on span selection tasks such
as question answering and coreference reso-
溶液. 尤其, with the same training data
and model size as BERTlarge, our single model
obtains 94.6% 和 88.7% F1 on SQuAD 1.1
和 2.0 分别. We also achieve a new
state of the art on the OntoNotes coreference
resolution task (79.6% F1), strong perfor-
mance on the TACRED relation extraction
benchmark, and even gains on GLUE.1

1 介绍

Pre-training methods like BERT (Devlin et al.,
2019) have shown strong performance gains using
self-supervised training that masks individual words
or subword units. 然而, many NLP tasks in-
volve reasoning about relationships between two
or more spans of text. 例如, in extractive
question answering (Rajpurkar et al., 2016), 的-

∗Equal contribution.
1Our code and pre-trained models are available at https://

github.com/facebookresearch/SpanBERT.

64

termining that the ‘‘Denver Broncos’’ is a type of
‘‘NFL team’’ is critical for answering the ques-
tion ‘‘Which NFL team won Super Bowl 50?’’
Such spans provide a more challenging target
for self supervision tasks, 例如, predicting
‘‘Denver Broncos’’ is much harder than predicting
only ‘‘Denver’’ when you know the next word is
‘‘Broncos’’. 在本文中, we introduce a span-
level pretraining approach that consistently out-
performs BERT, with the largest gains on span
selection tasks such as question answering and
coreference resolution.

We present SpanBERT, a pre-training method
that is designed to better represent and predict
spans of text. Our method differs from BERT in
both the masking scheme and the training objec-
特维斯. 第一的, we mask random contiguous spans,
rather than random individual tokens. 第二, 我们
introduce a novel span-boundary objective (SBO)
so the model learns to predict the entire masked
span from the observed tokens at its boundary.
Span-based masking forces the model to predict
entire spans solely using the context in which
they appear. 此外, the SBO encourages
the model to store this span-level information at
the boundary tokens, which can be easily accessed
during the fine-tuning stage. 数字 1 说明
our approach.

To implement SpanBERT, we build on a well-
tuned replica of BERT, which itself substantially
outperforms the original BERT. While building on
our baseline, we find that pre-training on single
细分市场, instead of two half-length segments
with the next sentence prediction (NSP) 客观的,

计算语言学协会会刊, 卷. 8, PP. 64–77, 2020. https://doi.org/10.1162/tacl 00300
动作编辑器: Radu Florian. 提交批次: 9/2019; 修改批次: 10/2019; 已发表 3/2020.
C(西德:3) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 1: An illustration of SpanBERT training. The span an American football game is masked. The SBO uses
the output representations of the boundary tokens, x4 and x9 (in blue), to predict each token in the masked span.
The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked
by the position embedding p3, is the third token from x4.

considerably improves performance on most
下游任务. 所以, we add our modifi-
cations on top of the tuned single-sequence BERT
基线.

一起, our pre-training process yields mod-
els that outperform all BERT baselines on a
wide variety of tasks, and reach substantially
better performance on span selection tasks in par-
针状的. 具体来说, our method reaches 94.6%
和 88.7% F1 on SQuAD 1.1 和 2.0 (Rajpurkar
等人。, 2016, 2018), respectively—reducing error
by as much as 27% compared with our tuned
BERT replica. We also observe similar gains
on five additional extractive question answering
benchmarks (NewsQA, TriviaQA, SearchQA,
HotpotQA, and Natural Questions).2

SpanBERT also arrives at a new state of the art
on the challenging CoNLL-2012 (‘‘OntoNotes’’)
shared task for document-level coreference resolu-
的, where we reach 79.6% F1, exceeding the pre-
vious top model by 6.6% absolute. 最后,
we demonstrate that SpanBERT also helps on
tasks that do not explicitly involve span selec-
的, and show that our approach even im-
proves performance on TACRED (张等人。,
2017) and GLUE (王等人。, 2019).

Whereas others show the benefits of adding
more data (杨等人。, 2019) and increasing
model size (Lample and Conneau, 2019), 这
work demonstrates the importance of designing

2We use the modified MRQA version of these datasets.

See more details in Section 4.1.

good pre-training tasks and objectives, which can
also have a remarkable impact.

2 Background: BERT

BERT (Devlin et al., 2019) is a self-supervised
approach for pre-training a deep transformer en-
编码员 (Vaswani et al., 2017), before fine-tuning
it for a particular downstream task. BERT opti-
mizes two training objectives—masked language
模型 (MLM) and next sentence prediction (NSP)—
which only require a large collection of unlabeled
文本.

Notation Given a sequence of word or sub-
word tokens X = (x1, x2, . . . , xn), BERT trains
an encoder that produces a contextualized vector
representation for each token:

enc(x1, x2, . . . , xn) = x1, x2, . . . , xn.

Masked Language Model Also known as a
cloze test, MLM is the task of predicting missing
tokens in a sequence from their placeholders.
具体来说, a subset of tokens Y ⊆ X is sampled
and substituted with a different set of tokens. 在
BERT’s implementation, Y accounts for 15% 的
the tokens in X; of those, 80% are replaced with
[MASK], 10% are replaced with a random token
(according to the unigram distribution), 和 10%
are kept unchanged. The task is to predict the
original tokens in Y from the modified input.

BERT selects each token in Y independently
by randomly selecting a subset. In SpanBERT, 我们
define Y by randomly selecting contiguous spans
(部分 3.1).

65

Next Sentence Prediction The NSP task takes
two sequences (XA, XB) as input, and predicts
whether XB is the direct continuation of XA. 这
is implemented in BERT by first reading XA from
the corpus, 进而 (1) either reading XB from the
point where XA ended, 或者 (2) randomly sampling
XB from a different point in the corpus. 他们俩
sequences are separated by a special [SEP] 代币.
此外, 一个特别的 [CLS] token is added to
XA, XB to form the input, where the target of
[CLS] is whether XB indeed follows XA in the
语料库.

总之, BERT optimizes the MLM and the
NSP objectives by masking word pieces uniformly
at random in data generated by the bi-sequence
sampling procedure. In the next section, 我们将
present our modifications to the data pipeline,
masking, and pre-training objectives.

3 模型

We present SpanBERT, a self-supervised pre-
training method designed to better represent and
predict spans of text. Our approach is inspired
by BERT (Devlin et al., 2019), but deviates from
its bi-text classification framework in three ways.
第一的, we use a different random process to mask
spans of tokens, rather than individual ones. 我们
also introduce a novel auxiliary objective—the
SBO—which tries to predict the entire masked
span using only the representations of the tokens at
the span’s boundary. 最后, SpanBERT samples
a single contiguous segment of text for each train-
ing example (instead of two), and thus does not
use BERT’s next sentence prediction objective,
which we omit.

3.1 Span Masking
Given a sequence of tokens X = (x1, x2, . . . , xn),
we select a subset of tokens Y ⊆ X by iteratively
sampling spans of text until the masking budget
(例如, 15% of X) has been spent. At each iteration,
we first sample a span length (number of words)
from a geometric distribution (西德:2) ∼ Geo(p), 哪个
is skewed towards shorter spans. We then ran-
domly (uniformly) select the starting point for the
span to be masked. We always sample a sequence
of complete words (instead of subword tokens)
and the starting point must be the beginning of
one word. Following preliminary trials,3 我们设定

3We experimented with p = {0.1, 0.2, 0.4} and found 0.2

to perform the best.

数字 2: We sample random span lengths from a
geometric distribution (西德:2) ∼ Geo(p = 0.2) clipped at
(西德:2)max = 10.

p = 0.2, and also clip (西德:2) 在 (西德:2)max = 10. 这产生
a mean span length of mean ((西德:2)) = 3.8. 数字 2
shows the distribution of span mask lengths.

As in BERT, we also mask 15% of the tokens
in total: replacing 80% of the masked tokens with
[MASK], 10% with random tokens, 和 10% 和
the original tokens. 然而, we perform this
replacement at the span level and not for each
token individually; 那是, all the tokens in a span
are replaced with [MASK] or sampled tokens.

3.2 Span Boundary Objective

Span selection models (李等人。, 2016, 2017;
He et al., 2018) typically create a fixed-length
representation of a span using its boundary tokens
(start and end). To support such models, we would
ideally like the representations for the end of the
span to summarize as much of the internal span
content as possible. We do so by introducing a
span boundary objective that involves predicting
each token of a masked span using only the
representations of the observed tokens at

边界 (数字 1).

正式地, we denote the output of the trans-
former encoder for each token in the sequence
by x1, . . . , xn. Given a masked span of tokens
(xs, . . . , xe) ∈ Y , 在哪里 (s, e) indicates its start
and end positions, we represent each token xi
in the span using the output encodings of the
external boundary tokens xs−1 and xe+1, 还有
as the position embedding of the target token
pi−s+1:

yi = f (xs−1, xe+1, pi−s+1)

66

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

where position embeddings p1, p2, . . . mark rela-
tive positions of the masked tokens with respect
to the left boundary token xs−1. We implement
the representation function f (·) as a 2-layer
feed-forward network with GeLU activations
(Hendrycks and Gimpel, 2016) and layer normal-
化 (Ba et al., 2016):

h0 = [xs−1; xe+1; pi−s+1]
h1 = LayerNorm (GeLU(W1h0))
yi = LayerNorm (GeLU(W2h1))

We then use the vector representation yi to predict
the token xi and compute the cross-entropy loss
exactly like the MLM objective.

SpanBERT sums the loss from both the span
boundary and the regular masked language model
objectives for each token xi in the masked span
(xs, . . . , xe), while reusing the input embedding
(Press and Wolf, 2017) for the target tokens in
both MLM and SBO:

L(希) = LMLM(希) + LSBO(希)

= − log P (希 | 希) − log P (希 | 做)

3.3 Single-Sequence Training

As described in Section 2, BERT’s examples con-
tain two sequences of text (XA, XB), 和
objective that trains the model to predict whether
they are connected (NSP). We find that this set-
ting is almost always worse than simply using a
single sequence without the NSP objective (看
部分 5 for further details). We conjecture that
single-sequence training is superior to bi-sequence
training with NSP because (A) the model benefits
from longer full-length contexts, 或者 (乙) condi-
tioning on, often unrelated, context from an-
other document adds noise to the masked language
模型. 所以, in our approach, we remove
both the NSP objective and the two-segment sam-
pling procedure, and simply sample a single con-
tiguous segment of up to n = 512 代币, 相当
than two half-segments that sum up to n tokens
一起.

总之, SpanBERT pre-trains span repre-
sentations by: (1) masking spans of full words
using a geometric distribution based masking
scheme (部分 3.1), (2) optimizing an auxiliary
span-boundary objective (部分 3.2) 此外

to MLM using a single-sequence data pipeline
(部分 3.3). A procedural description can be
found in Appendix A.

4 实验装置

4.1 任务

We evaluate on a comprehensive suite of tasks,
including seven question answering tasks, coref-
erence resolution, nine tasks in the GLUE bench-
标记 (王等人。, 2019), and relation extraction.
We expect that the span selection tasks, 问题
answering and coreference resolution, will partic-
ularly benefit from our span-based pre-training.

Extractive Question Answering Given a short
passage of text and a question as input, the task
of extractive question answering is to select a
contiguous span of text in the passage as the
回答.

We first evaluate on SQuAD 1.1 和 2.0
(Rajpurkar et al., 2016, 2018), which have
served as major question answering benchmarks,
particularly for pre-trained models (Peters et al.,
2018; Devlin et al., 2019; 杨等人。, 2019).
We also evaluate on five more datasets from
the MRQA shared task (费什等人。, 2019)4:
NewsQA (Trischler et al., 2017), SearchQA
(Dunn et al., 2017), TriviaQA (Joshi et al.,
2017), HotpotQA (杨等人。, 2018), and Natural
问题 (Kwiatkowski et al., 2019). 因为
the MRQA shared task does not have a public
test set, we split the development set in half
to make new development and test sets. 这
datasets vary in both domain and collection meth-
odology, making this collection a good test bed
for evaluating whether our pre-trained models can
generalize well across different data distributions.
Following BERT (Devlin et al., 2019), 我们用
the same QA model architecture for all the data-
套. We first convert the passage P = (p1, p2,
. . . , pl) and question Q = (q1, q2, . . . , ql(西德:7)) 进入
a single sequence X = [CLS]p1p2 . . . pl[SEP]
q1q2 . . . ql(西德:7)[SEP], pass it to the pre-trained trans-
former encoder, and train two linear classifiers
independently on top of it for predicting the answer
span boundary (start and end). For the unanswer-
able questions in SQuAD 2.0, we simply set the

4https://github.com/mrqa/MRQA-Shared-
Task-2019. MRQA changed the original datasets to unify
them into the same format, 例如, all the contexts are truncated
to a maximum of 800 tokens and only answerable questions
are kept.

67

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

answer span to be the special token [CLS] 为了
both training and testing.

Coreference Resolution Coreference resolu-
tion is the task of clustering mentions in text which
refer to the same real-world entities. We evaluate
on the CoNLL-2012 shared task (Pradhan et al.,
2012) for document-level coreference resolu-
的. We use the independent version of the Joshi
等人. (2019乙) implementation of the higher-order
coreference model (李等人。, 2018). 该纪录片-
ment is divided into non-overlapping segments of
a pre-defined length.5 Each segment is encoded
independently by the pre-trained transformer
encoder, which replaces the original LSTM-based
encoder. For each mention span x, 该模型
learns a distribution P (·) over possible antecedent
spans Y :

磷 (y) =

(西德:2)

英语(X,y)
y(西德:7)∈Y es(X,y(西德:7))

The span pair scoring function s(X, y) is a feed-
forward neural network over fixed-length span
representations and hand-engineered features over
x and y:

s(X, y) = sm(X) + sm(y) + sc(X, y)
sm(X) = FFNN m(gx)
sc(X, y) = FFNN c(gx, gy, φ(X, y))

Here gx and gy denote the span representations,
which are a concatenation of the two transformer
output states of the span endpoints and an attention
vector computed over the output representations
of the token in the span. FFNNm and FFNNc
represent two feedforward neural networks with
one hidden layer, and φ(X, y) represents the hand-
engineered features (例如, speaker and genre infor-
运动). A more detailed description of the model
can be found in Joshi et al. (2019乙).

Relation Extraction TACRED (张等人。,
2017) is a challenging relation extraction dataset.
Given one sentence and two spans within it—
subject and object—the task is to predict the
relation between the spans from 42 pre-defined
relation types, including no relation. We follow
the entity masking schema from Zhang et al.
(2017) and replace the subject and object entities
by their NER tags such as ‘‘[CLS] [SUBJ-PER]

5The length was chosen from {128, 256, 384, 512}. 看

more details in Appendix B.

出生于 [OBJ-LOC] , Michigan, . . . ’’, 和
finally add a linear classifier on top of the [CLS]
token to predict the relation type.

GLUE The General Language Understanding
评估 (GLUE) benchmark (王等人。,
2019) consists of 9 sentence-level classification
任务:

• Two sentence-level classification tasks in-
cluding CoLA (Warstadt et al., 2018)
for evaluating linguistic acceptability and
SST-2 (Socher et al., 2013) for sentiment
classification.

• Three sentence-pair similarity tasks includ-
ing MRPC (Dolan and Brockett, 2005), A
binary paraphrasing task sentence pairs from
news sources, STS-B (Cer et al., 2017), A
graded similarity task for news headlines,
and QQP,6 a binary paraphrasing tasking be-
tween Quora question pairs.

• Four natural language inference tasks in-
cluding MNLI (Williams et al., 2018), QNLI
(Rajpurkar et al., 2016), RTE (Dagan et al.,
2005; Bar-Haim et al., 2006; Giampiccolo
等人。, 2007), and WNLI (Levesque et al.,
2011).

Unlike question answering, coreference resolu-
的, and relation extraction, these sentence-level
tasks do not require explicit modeling of span-
level semantics. 然而, they might still benefit

from implicit span-based reasoning (例如,
Prime Minister is the head of the government).
Following previous work (Devlin et al., 2019;
Radford et al., 2018),7 we exclude WNLI from
the results to enable a fair comparison. 虽然
recent work Liu et al. (2019A) has applied several
task-specific strategies to increase performance
on the individual GLUE tasks, we follow BERT’s
single-task setting and only add a linear classi-
fier on top of the [CLS] token for these classifi-
cation tasks.

4.2 Implementation

We reimplemented BERT’s model and pre-
training method in fairseq (Ott et al., 2019).

6https://data.quora.com/First-Quora-

Dataset-Release-Question-Pairs.

7Previous work has excluded WNLI on account of con-
struction issues outlined on the GLUE website – https://
gluebenchmark.com/faq.

68

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

We used the model configuration of BERTlarge
as in Devlin et al. (2019) and also pre-trained all
our models on the same corpus: BooksCorpus and
English Wikipedia using cased Wordpiece tokens.
Compared with the original BERT implemen-
站, the main differences in our implementation
包括: (A) We use different masks at each epoch
while BERT samples 10 different masks for each
sequence during data processing. (乙) We remove
all the short-sequence strategies used before (他们
sampled shorter sequences with a small proba-
能力 0.1; they also first pre-trained with smaller
sequence length of 128 为了 90% of the steps).
反而, we always take sequences of up to 512
tokens until it reaches a document boundary. 我们
refer readers to Liu et al. (2019乙) for further dis-
cussion on these modifications and their effects.

As in BERT, the learning rate is warmed up
over the first 10,000 steps to a peak value of 1e-4,
and then linearly decayed. We retain β hyper-
参数 (β1 = 0.9, β2 = 0.999) and a de-
coupled weight decay (Loshchilov and Hutter,
2019) 的 0.1. We also keep a dropout of 0.1 在
all layers and attention weights, and a GeLU ac-
tivation function (Hendrycks and Gimpel, 2016).
We deviate from the optimization by running
for 2.4M steps and using an epsilon of 1e-8 for
AdamW (Kingma and Ba, 2015), which con-
verges to a better set of model parameters. Our im-
plementation uses a batch size of 256 序列
with a maximum of 512 tokens.8 For the SBO,
我们用 200 dimension position embeddings p1,
p2, . . .
to mark positions relative to the left
boundary token. The pre-training was done on 32
Volta V100 GPUs and took 15 days to complete.
Fine-tuning is implemented based on Hugging-
Face’s codebase (沃尔夫等人。, 2019) 和更多
details are given in Appendix B.

4.3 基线

We compare SpanBERT to three baselines:

Google BERT The pre-trained models released
by Devlin et al. (2019).9

Our BERT Our reimplementation of BERT
with improved data preprocessing and optimiza-
的 (部分 4.2).

8On the average, this is approximately 390 序列,

because some documents have fewer than 512 代币.

9https://github.com/google-research/bert.

SQuAD 1.1

SQuAD 2.0

EM F1

EM F1

Human Perf.
Google BERT
Our BERT
Our BERT-1seq
SpanBERT

82.3
84.3
86.5
87.5
88.8

91.2
91.3
92.6
93.3
94.6

86.8
80.0
82.8
83.8
85.7

89.4
83.3
85.9
86.6
88.7

桌子 1: Test results on SQuAD 1.1 and SQuAD
2.0.

Our BERT-1seq Our reimplementation of BERT
trained on single full-length sequences without
NSP (部分 3.3).

5 结果

We compare SpanBERT to the baselines per task,
and draw conclusions based on the overall trends.

5.1 Per-Task Results

Extractive Question Answering Table 1 节目
the performance on both SQuAD 1.1 和 2.0.
SpanBERT exceeds our BERT baseline by 2.0%
和 2.8% F1, 分别 (3.3% 和 5.4% 超过
Google BERT). In SQuAD 1.1, this result ac-
counts for over 27% error reduction, reaching
3.4% F1 above human performance.

桌子 2 demonstrates that this trend goes be-
yond SQuAD, and is consistent in every MRQA
dataset. 平均而言, we see a 2.9% F1 improve-
ment from our reimplementation of BERT. 铝-
though some gains are coming from single-sequence
训练 (+1.1%), most of the improvement stems
from span masking and the span boundary objec-
主动的 (+1.8%), with particularly large gains on
TriviaQA (+3.2%) and HotpotQA (+2.7%).

Coreference Resolution Table 3 shows the
performance on the OntoNotes coreference res-
olution benchmark. Our BERT reimplementation
improves the Google BERT model by 1.2% 在
the average F1 metric and single-sequence train-
ing brings another 0.5% gain. 最后, SpanBERT
improves considerably on top of that, achieving a
new state of the art of 79.6% F1 (previous best
result is 73.0%).

Relation Extraction Table 4 shows the perfor-
mance on TACRED. SpanBERT exceeds our
reimplementation of BERT by 3.3% F1 and
achieves close to the current state of the art (Soares
等人。, 2019)—Our model performs better than

69

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

NewsQA TriviaQA SearchQA HotpotQA Natural

Avg.

Google BERT
Our BERT
Our BERT-1seq
SpanBERT

68.8
71.0
71.9
73.6

77.5
79.0
80.4
83.6

81.7
81.8
84.0
84.8

78.3
80.5
80.3
83.0

问题

79.9
80.5
81.8
82.5

77.3
78.6
79.7
81.5

桌子 2: Performance (F1) on the five MRQA extractive question answering tasks.

MUC

F1

81.4

84.9
85.1
85.5
85.8

79.5

82.5
83.5
84.1
84.8

80.4

83.7
84.3
84.8
85.3

72.2

76.7
77.3
77.8
78.3

B3

69.5

74.2
75.5
76.7
77.9

F1

CEAFφ4

F1

Avg. F1

70.8

75.4
76.4
77.2
78.1

68.2

74.6
75.0
75.3
76.4

67.1

70.1
71.9
73.5
74.2

67.6

72.3
73.9
74.4
75.3

73.0

77.1
78.3
78.8
79.6

Prev. SotA:
(李等人。, 2018)

Google BERT
Our BERT
Our BERT-1seq
SpanBERT

桌子 3: Performance on the OntoNotes coreference resolution benchmark. The main evaluation is the
average F1 of three metrics: MUC, B3, and CEAFφ4 on the test set.

p
BERTEM(Soares et al., 2019) -
BERTEM+MTB∗
-
Google BERT
Our BERT
Our BERT-1seq
SpanBERT


F1
- 70.1
- 71.5

69.1 63.9 66.4
67.8 67.2 67.5
72.4 67.9 70.1
70.8 70.9 70.8

桌子 4: Test performance on the TACRED
relation extraction benchmark. BERTlarge and
BERTEM+MTB from Soares et al. (2019) are the
current state-of-the-art. : BERTEM+MTB incor-
porated an intermediate ‘‘matching the blanks’’
pre-training on the entity-linked text based on
English Wikipedia, which is not a direct compar-
ison to ours trained only from raw text.

their BERTEM but is 0.7 point behind BERTEM +
MTB, which used entity-linked text for additional
pre-training. Most of this gain (+2.6%) stems from
single-sequence training although the contribution
of span masking and the span boundary objective
is still a considerable 0.7%, resulting largely from
higher recall.

GLUE Table 5 shows the performance on
GLUE.

For most tasks, the different models appear
to perform similarly. Moving to single-sequence

70

training without the NSP objective substantially
improves CoLA, and yields smaller (but consid-
erable) improvements on MRPC and MNLI. 这
main gains from SpanBERT are in the SQuAD-
based QNLI dataset (+1.3%) and in RTE (+6.9%),
the latter accounting for most of the rise in
SpanBERT’s GLUE average.

5.2 Overall Trends

We compared our approach to three BERT base-
lines on 17 benchmarks, and found that SpanBERT
outperforms BERT on almost every task. 在 14
任务, SpanBERT performed better than all base-
线. In two tasks (MRPC and QQP), it performed
on-par in terms of accuracy with single-sequence
trained BERT, but still outperformed the other
基线. In one task (SST-2), Google’s BERT
baseline performed better than SpanBERT by
0.4% 准确性.

When considering the magnitude of the gains,
it appears that SpanBERT is especially better at
extractive question answering. In SQuAD 1.1,
例如, we observe a solid gain of 2.0% F1
even though the baseline is already well above
human performance. On MRQA, SpanBERT im-
proves between 2.0% (Natural Questions) 和
4.6% (TriviaQA) F1 on top of our BERT baseline.
最后, we observe that single-sequence train-
ing works considerably better than bi-sequence

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

CoLA SST-2 MRPC

STS-B

QQP

MNLI

QNLI RTE (Avg)

59.3
Google BERT
Our BERT
58.6
Our BERT-1seq 63.5
64.3
SpanBERT

95.2
93.9
94.8
94.8

88.5/84.3
90.1/86.6
91.2/87.8
90.9/87.9

86.4/88.0
88.4/89.1
89.0/88.4
89.9/89.1

71.2/89.0
71.8/89.3
72.1/89.5
71.9/89.5

86.1/85.7
87.2/86.6
88.0/87.4
88.1/87.7

93.0
93.0
93.0
94.3

71.1
74.7
72.1
79.0

80.4
81.1
81.7
82.8

桌子 5: Test set performance on GLUE tasks. MRPC: F1/accuracy, STS-B: Pearson/Spearmanr
correlation, QQP: F1/accuracy, MNLI: matched/mistached accuracies, and accuracy for all the other
任务. WNLI (not shown) is always set to majority class (65.1% 准确性) and included in the average.

training with NSP with BERT’s choice of se-
quence lengths for a wide variety of tasks. 这
is surprising because BERT’s ablations showed
gains from the NSP objective (Devlin et al., 2019).
然而, the ablation studies still involved bi-
sequence data processing (IE。, the pre-training
stage only controlled for the NSP objective while
still sampling two half-length sequences). We hy-
pothesize that bi-sequence training, as it is im-
plemented in BERT (参见章节 2), impedes the
model from learning longer-range features, 和
consequently hurts performance on many down-
stream tasks.

6 Ablation Studies

We compare our random span masking scheme
with linguistically-informed masking schemes,
and find that masking random spans is a com-
petitive and often better approach. We then study
the impact of the SBO, and contrast it with BERT’s
NSP objective.10

6.1 Masking Schemes

Previous work (孙等人。, 2019) has shown im-
provements in downstream task performance by
masking linguistically informed spans during pre-
training for Chinese data. We compare our ran-
dom span masking scheme with masking of
linguistically informed spans. 具体来说, 我们
train the following five baseline models differing
only in the way tokens are masked.

Subword Tokens We sample random Word-
piece tokens, as in the original BERT.

Whole Words We sample random words, 和
then mask all of the subword tokens in those
字. The total number of masked subtokens is
大约 15%.

10To save time and resources, we use the checkpoints at

1.2M steps for all the ablation experiments.

Named Entities At 50% 当时的, we sample
from named entities in the text, and sample random
whole words for the other 50%. The total number
of masked subtokens is 15%. 具体来说, we run
spaCy’s named entity recognizer (Honnibal and
Montani, 2017)11 on the corpus and select all the
non-numerical named entity mentions as candidates.

Noun Phrases Similar to Named Entities, 我们
sample from noun phrases at 50% 当时的. 这
noun phrases are extracted by running spaCy’s
constituency parser.

Geometric Spans We sample random spans
from a geometric distribution, as in our SpanBERT
(参见章节 3.1).

桌子 6 shows how different pre-training
masking schemes affect performance on the devel-
opment set of a selection of tasks. All the mod-
els are evaluated on the development sets and are
based on the default BERT setup of bi-sequence
training with NSP; the results are not directly com-
parable to the main evaluation. With the exception
of coreference resolution, masking random spans
is preferable to other strategies. Although linguis-
tic masking schemes (named entities and noun
短语) are often competitive with random spans,
their performance is not consistent; 例如,
masking noun phrases achieves parity with ran-
dom spans on NewsQA, but underperforms on
TriviaQA (−1.1% F1).

On coreference resolution, we see that masking
random subword tokens is preferable to any form
of span masking. 尽管如此, we shall see in
the following experiment that combining random
span masking with the span boundary objective
can improve upon this result considerably.

6.2 Auxiliary Objectives

在部分 5, we saw that bi-sequence training
with the NSP objective can hurt performance on

11https://spacy.io/.

71

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

SQuAD 2.0 NewsQA TriviaQA Coreference MNLI-m QNLI GLUE (Avg)

Subword Tokens
Whole Words
Named Entities
Noun Phrases
Geometric Spans

83.8
84.3
84.8
85.0
85.4

72.0
72.8
72.7
73.0
73.0

76.3
77.1
78.7
77.7
78.8

77.7
76.6
75.6
76.7
76.4

86.7
86.3
86.0
86.5
87.0

92.5
92.8
93.1
93.2
93.3

83.2
82.9
83.2
83.5
83.4

桌子 6: The effect of replacing BERT’s original masking scheme (Subword Tokens) 与不同的
masking schemes. Results are F1 scores for QA tasks and accuracy for MNLI and QNLI on the
development sets. All the models are based on bi-sequence training with NSP.

SQuAD 2.0 NewsQA TriviaQA Coref MNLI-m QNLI GLUE (Avg)

Span Masking (2seq) + NSP
Span Masking (1seq)
Span Masking (1seq) + SBO

85.4
86.7
86.8

73.0
73.4
74.1

78.8
80.0
80.3

76.4
76.3
79.0

87.0
87.3
87.6

93.3
93.8
93.9

83.4
83.8
84.0

桌子 7: The effects of different auxiliary objectives, given MLM over random spans as the primary
客观的.

下游任务, when compared with single-
sequence training. We test whether this holds true
for models pre-trained with span masking, and also
evaluate the effect of replacing the NSP objective
with the SBO.

桌子 7 confirms that single-sequence training
typically improves performance. Adding SBO fur-
ther improves performance, with a substantial gain
on coreference resolution (+2.7% F1) over span
masking alone. Unlike the NSP objective, SBO
does not appear to have any adverse effects.

7 相关工作

Pre-trained contextualized word representations that
can be trained from unlabeled text (Dai and Le,
2015; Melamud et al., 2016; Peters et al., 2018)
have had immense impact on NLP lately, 参与-
ularly as methods for initializing a large model
before fine-tuning it for a specific task (霍华德
and Ruder, 2018; Radford et al., 2018; Devlin
等人。, 2019). Beyond differences in model hyper-
parameters and corpora, these methods mainly
differ in their pre-training tasks and loss functions,
with a considerable amount of contemporary liter-
ature proposing augmentations of BERT’s MLM
客观的.

While previous and concurrent work has looked
at masking (孙等人。, 2019) or dropping (歌曲
等人。, 2019; Chan et al., 2019) multiple words
from the input—particularly as pretraining for lan-

guage generation tasks—SpanBERT pretrains
span representations (李等人。, 2016), 哪个是
widely used for question answering, coreference
resolution, and a variety of other tasks. ERNIE
(孙等人。, 2019) shows improvements on Chinese
NLP tasks using phrase and named entity mask-
英. MASS (Song et al., 2019) focuses on language
generation tasks, and adopts the encoder-decoder
framework to reconstruct a sentence fragment
given the remaining part of the sentence. 我们
attempt to more explicitly model spans using the
SBO objective, and show that (geometrically dis-
tributed) random span masking works as well,
and sometimes better than, masking linguistically-
coherent spans. We evaluate on English bench-
marks for question answering, relation extraction,
and coreference resolution in addition to GLUE.
A different ERNIE (张等人。, 2019) fo-
cuses on integrating structured knowledge bases
with contextualized representations with an eye on
knowledge-driven tasks like entity typing and re-
lation classification. UNILM (Dong et al., 2019)
uses multiple language modeling objectives—
unidirectional (both left-to-right and right-to-left),
bidirectional, and sequence-to-sequence prediction—
to aid generation tasks like summarization and
question generation. XLM (Lample and Conneau,
2019) explores cross-lingual pre-training for multi-
lingual tasks such as translation and cross-lingual
classification. Kermit (Chan et al., 2019), 一个在-
sertion based approach, fills in missing tokens

72

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

(instead of predicting masked ones) during pre-
训练; they show improvements on machine
translation and zero-shot question answering.

Concurrent with our work, RoBERTa (刘等人。,
2019乙) presents a replication study of BERT
pre-training that measures the impact of many
key hyperparameters and training data size. 还
concurrent, XLNet (杨等人。, 2019) combines
an autoregressive loss and the Transformer-XL
(Dai et al., 2019) architecture with a more than an
eight-fold increase in data to achieve current state-
of-the-art results on multiple benchmarks. XLNet
also masks spans (of 1–5 tokens) during pre-
训练, but predicts them autoregressively. 我们的
model focuses on incorporating span-based pre-
训练, and as a side effect, we present a stronger
BERT baseline while controlling for the corpus,
建筑学, and the number of parameters.

Related to our SBO objective, pair2vec (Joshi
等人。, 2019A) encodes word-pair relations using
a negative sampling-based multivariate objective
during pre-training. 之后, the word-pair repre-
sentations are injected into the attention-layer of
下游任务, and thus encode limited down-
stream context. Unlike pair2vec, our SBO objec-
tive yields ‘‘pair’’ (start and end tokens of spans)
representations which more fully encode the con-
text during both pre-training and finetuning, 和
are thus more appropriately viewed as span repre-
句子. Stern et al. (2018) focus on improving
language generation speed using a block-wise par-
allel decoding scheme; they make predictions for
multiple time steps in parallel and then back off
to the longest prefix validated by a scoring model.
Also related are sentence representation methods
(Kiros et al., 2015; Logeswaran and Lee, 2018),
which focus on predicting surrounding contexts
from sentence embeddings.

8 结论

We presented a new method for span-based pre-
training which extends BERT by (1) masking
contiguous random spans, rather than random
代币, 和 (2) training the span boundary repre-
sentations to predict the entire content of the
masked span, without relying on the individual
token representations within it. 一起, our pre-
training process yields models that outperform all
BERT baselines on a variety of tasks, and reach

substantially better performance on span selection
tasks in particular.

Appendices

A Pre-training Procedure

We describe our pre-training procedure as follows:

1. Divide the corpus into single contiguous

blocks of up to 512 代币.

2. At each step of pre-training:

(A) Sample a batch of blocks uniformly at

random.

(乙) Mask 15% of word pieces in each block
in the batch using the span masking
scheme (部分 3.1).

(C) For each masked token xi, opti-
mize L(希) = LMLM(希) + LSBO(希)
(部分 3.2).

B Fine-tuning Hyperparameters

We apply the following fine-tuning hyperparam-
eters to all methods, including the baselines.

Extractive Question Answering For all

question answering tasks, we use max seq
length = 512 and a sliding window of size
128 if the lengths are longer than 512. We choose
learning rates from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}
and batch sizes from {16, 32} and fine-tune four
epochs for all the datasets.

Coreference Resolution We divide the docu-
ments into multiple chunks of lengths up to max
seq length and encode each chunk indepen-
dently. We choose max seq length from {128,
256, 384, 512}, BERT learning rates from {1e-5,
2e-5}, task-specific learning rates from {1e-4,
2e-4, 3e-4}, and fine-tune 20 epochs for all the
datasets. We use batch size = 1 (one document)
for all the experiments.

TACRED/GLUE We use max seq length =
128 and choose learning rates from {5e-6, 1e-5,
2e-5, 3e-5, 5e-5} and batch sizes from {16, 32}
and fine-tuning 10 epochs for all the datasets.
The only exception is CoLA, where we used four
纪元 (following Devlin et al., 2019), 因为 10
epochs lead to severe overfitting.

73

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

致谢

We would like to thank Pranav Rajpurkar and Robin
Jia for patiently helping us evaluate SpanBERT
on SQuAD. We thank the anonymous reviewers,
the action editor, and our colleagues at Facebook
AI Research and the University of Washington for
their insightful feedback that helped improve the
纸.

参考

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E.
欣顿. 2016. Layer normalization. arXiv pre-
print arXiv:1607.06450.

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,
Danilo Giampiccolo, Bernardo Magnini, 和
Idan Szpektor. 2006. The second PASCAL
recognising textual entailment challenge. 在
Proceedings of the Second PASCAL Challenges
Workshop on Recognising Textual Entailment,
pages 6–4.

Daniel Cer, Mona Diab, Eneko Agirre, I ˜A´sigo
Lopez-Gazpio,
and Lucia Specia. 2017.
Semeval-2017 task 1: Semantic textual similar-
ity multilingual and crosslingual focused eval-
uation. In International Workshop on Semantic
评估 (SemEval), pages 1–14. Vancouver,
加拿大.

William Chan, Nikita Kitaev, Kelvin Guu,
Mitchell Stern, and Jakob Uszkoreit. 2019.
KERMIT: Generative insertion-based modeling
for sequences. arXiv 预印本 arXiv:1906.01604.

Ido Dagan, Oren Glickman, and Bernardo
Magnini. 2005. The PASCAL recognising tex-
tual entailment challenge. In Machine Learning
Challenges Workshop, pages 177–190. 施普林格.

安德鲁·M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. In Advances in
Neural Information Processing Systems (NIPS),
pages 3079–3087.

Zihang Dai, Zhilin Yang, Yiming Yang, 威廉
瓦. 科恩, Jaime Carbonell, Quoc V. Le, 和
Ruslan Salakhutdinov. 2019. Transformer-XL:
Attentive language models beyond a fixed-
length context. In Association for Computa-
tional Linguistics (前交叉韧带).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. In North American Association
for Computational Linguistics (全国AACL).

William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
tential paraphrases. In Proceedings of the Inter-
national Workshop on Paraphrasing.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
周, and Hsiao-Wuen Hon. 2019. Unified
language model pre-training for natural lan-
guage understanding and generation. In Ad-
vances in Neural Information Processing Systems
(NIPS).

Matthew Dunn, Levent Sagun, Mike Higgins,
V. Ugur Guney, Volkan Cirik, and Kyunghyun
给. 2017. SearchQA: A new Q&A dataset
augmented with context from a search engine.
arXiv 预印本 arXiv:1704.05179.

Adam Fisch, Alon Talmor, Robin Jia, Minjoon
Seo, Eunsol Choi, and Danqi Chen. 2019.
MRQA 2019 shared task: Evaluating general-
ization in reading comprehension. In Proceed-
ings of 2nd Machine Reading for Reading
Comprehension (MRQA) Workshop at EMNLP.

Danilo Giampiccolo, Bernardo Magnini,

Ido
达甘, and Bill Dolan. 2007. The third PASCAL
recognizing textual entailment challenge. In Pro-
ceedings of the ACL-PASCAL Workshop on Tex-
tual Entailment and Paraphrasing, 第 1–9 页.

Luheng He, Kenton Lee, Omer Levy, and Luke
Zettlemoyer. 2018. Jointly predicting predicates
and arguments in neural semantic role labeling.
In Association for Computational Linguistics
(前交叉韧带), pages 364–369.

Dan Hendrycks and Kevin Gimpel. 2016.
Gaussian error linear units (gelus). arXiv pre-
print arXiv:1606.08415.

Matthew Honnibal and Ines Montani. 2017. spaCy
2: Natural language understanding with Bloom
嵌入, convolutional neural networks and
incremental parsing. To appear.

Jeremy Howard and Sebastian Ruder. 2018. 大学-
versal language model fine-tuning for text clas-
sification. arXiv 预印本 arXiv:1801.06146.

74

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Mandar

Joshi, Eunsol Choi, Omer Levy,
Daniel Weld, and Luke Zettlemoyer. 2019A.
pair2vec: Compositional word-pair embeddings
for cross-sentence inference. In North American
计算语言学协会
(全国AACL), pages 3597–3608.

Mandar Joshi, Eunsol Choi, Daniel Weld, 和
Luke Zettlemoyer. 2017. TriviaQA: A large
scale distantly supervised challenge dataset for
reading comprehension. In Association for Com-
putational Linguistics (前交叉韧带), pages 1601–1611.

Mandar Joshi, Omer Levy, Daniel S. Weld, 卢克
Zettlemoyer, and Omer Levy. 2019乙. BERT for
coreference resolution: Baselines and analysis.
In Empirical Methods in Natural Language
加工 (EMNLP).

Diederik Kingma and Jimmy Ba. 2015. 亚当:
A method for stochastic optimization. In Inter-
national Conference on Learning Representa-
系统蒸发散 (ICLR).

Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov,
Richard S. Zemel, Antonio Torralba, Raquel
Urtasun, and Sanja Fidler. 2015. Skip-thought
vectors. In Advances in Neural Information
Processing Systems (NIPS).

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Illia Polosukhin,
Alberti, Danielle Epstein,
Matthew Kelcey, Jacob Devlin, Kenton Lee,
Kristina N. Toutanova, Llion Jones, Ming-
Wei Chang, Andrew Dai, Jakob Uszkoreit,
Quoc Le, and Slav Petrov. 2019. Natural
问题: A benchmark for question answering
研究. Transactions of the Association of
计算语言学 (处理).

Guillaume Lample and Alexis Conneau. 2019.
Cross-lingual
language model pretraining.
神经信息处理的进展
系统 (NIPS).

Kenton Lee, Luheng He, Mike Lewis, 和
Luke Zettlemoyer. 2017. End-to-end neural
coreference resolution. In Empirical Methods
自然语言处理博士 (EMNLP),
pages 188–197.

coarse-to-fine inference. In North American
计算语言学协会
(全国AACL), pages 687–692.

Kenton Lee, Shimi Salant, Tom Kwiatkowski,
Ankur Parikh, Dipanjan Das, and Jonathan
Berant. 2016. Learning recurrent span repre-
sentations for extractive question answering.
arXiv 预印本 arXiv:1611.01436.

Hector J. Levesque, Ernest Davis, and Leora
Morgenstern. 2011. The Winograd schema
challenge. In AAAI Spring Symposium: Logical
Formalizations of Commonsense Reasoning,
体积 46, 页 47.

Xiaodong Liu, Pengcheng He, Weizhu Chen, 和
Jianfeng Gao. 2019A. Multi-task deep neural
networks for natural language understanding. 在
Proceedings of the 57th Annual Meeting of the
计算语言学协会. 作为-
sociation for Computational Linguistics (前交叉韧带).

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019乙. RoBERTa: A robustly opti-
mized BERT pretraining approach. arxiv pre-
print arXiv:1907.11692.

Lajanugen Logeswaran and Honglak Lee. 2018.
An efficient framework for learning sentence
陈述. arxiv preprint arXiv:1803.02893.

Ilya Loshchilov and Frank Hutter. 2019. Decou-
pled weight decay regularization. In Interna-
tional Conference on Learning Representations
(ICLR).

Oren Melamud, Jacob Goldberger, and Ido Dagan.
2016. context2vec: Learning generic context
embedding with bidirectional LSTM. In Com-
putational Natural Language Learning (CoNLL),
pages 51–61.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. 2019. fairseq: A fast, exten-
sible toolkit for sequence modeling. In North
American Association for Computational Lin-
语言学 (全国AACL), pages 48–53.

Kenton Lee, Luheng He, and Luke Zettlemoyer.
2018. Higher-order coreference resolution with

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,

75

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

and Luke Zettlemoyer. 2018. Deep contextual-
ized word representations. In North American
计算语言学协会
(全国AACL), pages 2227–2237.

Sameer Pradhan, Alessandro Moschitti, Nianwen
薛, Olga Uryupina, and Yuchen Zhang. 2012.
CoNLL-2012 shared task: Modeling multi-
lingual unrestricted coreference in ontonotes.
In Joint Conference on EMNLP and CoNLL-
Shared Task, pages 1–40.

Ofir Press and Lior Wolf. 2017. Using the out-
put embedding to improve language models.
In Proceedings of the 15th Conference of the
European Chapter of the Association for Com-
putational Linguistics: 体积 2, Short Papers,
pages 157–163. Association for Computational
语言学 (前交叉韧带).

Alec Radford, Karthik Narasimhan, Time Salimans,
and Ilya Sutskever. 2018. Improving language un-
derstanding with unsupervised learning, OpenAI.

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In Association for
计算语言学 (前交叉韧带), pages 784–789.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
and Percy Liang. 2016. SQuAD: 100,000+
questions for machine comprehension of text.
In Empirical Methods in Natural Language
加工 (EMNLP), pages 2383–2392.

Livio Baldini Soares, Nicholas Arthur FitzGerald,
Jeffrey Ling, and Tom Kwiatkowski. 2019.
Matching the blanks: Distributional similarity
for relation learning. In Association for Compu-
tational Linguistics (前交叉韧带), pages 2895–2905.

Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. 曼宁, 安德鲁
的, and Christopher Potts. 2013. Recursive
deep models for semantic compositionality over
a sentiment treebank. In Empirical Methods
自然语言处理博士 (EMNLP),
pages 1631–1642.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu,
and Tie-Yan Liu. 2019. MASS: Masked se-
quence to sequence pre-training for language
一代. In International Conference on
Machine Learning (ICML), pages 5926–5936.

Mitchell Stern, Noam Shazeer, and Jakob
Uszkoreit. 2018. Blockwise parallel decod-
ing for deep autoregressive models. In Advances
in Neural
Information Processing Systems
(NIPS).

Yu Stephanie Sun, Shuohuan Wang, Yukun Li,
Shikun Feng, Xuyi Chen, Han Zhang, Xinlun
Tian, Danxiang Zhu, Hao Tian, and Hua
吴. 2019. ERNIE: Enhanced representation
through knowledge integration. arXiv 预印本
arXiv:1904.09223.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin
哈里斯, Alessandro Sordoni, Philip Bachman,
and Kaheer Suleman. 2017. NewsQA: A ma-
chine comprehension dataset. In 2nd Work-
shop on Representation Learning for NLP,
pages 191–200.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Information Processing Systems (NIPS).

Alex Wang, Amapreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and anal-
ysis platform for natural language understand-
英. In International Conference on Learning
Representations (ICLR).

Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2018. Neural network acceptability
判断. arXiv 预印本 arXiv:1805.12471.

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
参考. In North American Association for Com-
putational Linguistics (全国AACL),pages1112–1122.

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R’emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
HuggingFace’s Transformers: State-of-the-art
语言处理. arXiv 预印本
natural
arXiv:1910.03771.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive

76

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

language understanding.

pretraining for

神经信息处理的进展
系统 (神经信息处理系统).

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
本吉奥, William Cohen, Ruslan Salakhutdinov,
and Christopher D. 曼宁. 2018. HotpotQA:
A dataset for diverse, explainable multi-hop
In Empirical Methods
question answering.
自然语言处理博士 (EMNLP),
pages 2369–2380.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor
Angeli, and Christopher D. 曼宁. 2017.
Position-aware attention and supervised data
improve slot filling. In Empirical Methods
自然语言处理博士 (EMNLP),
pages 35–45.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation with
informative entities. In Association for Compu-
tational Linguistics (前交叉韧带), pages 1441–1451.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
0
0
1
9
2
3
1
7
0

/

/
t

A
C
_
A
_
0
0
3
0
0
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

77
下载pdf