A Primer in BERTology: What We Know About How BERT Works

Anna Rogers
Center for Social Data Science
University of Copenhagen
arogers@sodas.ku.dk

Olga Kovaleva
Dept. of Computer Science
大学
Massachusetts Lowell
okovalev@cs.uml.edu

Anna Rumshisky
Dept. of Computer Science
大学
Massachusetts Lowell
arum@cs.uml.edu

抽象的

Transformer-based models have pushed state
of the art in many areas of NLP, but our under-
standing of what is behind their success is still
limited. This paper is the first survey of over
150 studies of the popular BERT model. 我们
review the current state of knowledge about
how BERT works, what kind of information
it learns and how it is represented, 常见的
modifications to its training objectives and
建筑学, the overparameterization issue,
and approaches to compression. We then
outline directions for future research.

介绍

Since their introduction in 2017, Transformers
(Vaswani et al., 2017) have taken NLP by storm,
offering enhanced parallelization and better mod-
eling of long-range dependencies. The best known
Transformer-based model is BERT (Devlin et al.,
2019); it obtained state-of-the-art results in nume-
rous benchmarks and is still a must-have baseline.
Although it is clear that BERT works remark-
ably well, it is less clear why, which limits further
hypothesis-driven improvement of the architec-
真实. Unlike CNNs, the Transformers have little
cognitive motivation, and the size of these models
limits our ability to experiment with pre-training
and perform ablation studies. This explains a large
number of studies over the past year that at-
tempted to understand the reasons behind BERT’s
表现.

在本文中, we provide an overview of what
has been learned to date, highlighting the questions
that are still unresolved. We first consider the
linguistic aspects of it, 即, the current evi-
dence regarding the types of linguistic and world
knowledge learned by BERT, as well as where and
how this knowledge may be stored in the model.
We then turn to the technical aspects of the model

and provide an overview of the current proposals
to improve BERT’s architecture, pre-training, 和
fine-tuning. We conclude by discussing the issue
of overparameterization, the approaches to com-
pressing BERT, and the nascent area of pruning
as a model analysis technique.

2 Overview of BERT Architecture

Fundamentally, BERT is a stack of Transformer
encoder layers (Vaswani et al., 2017) that consist
of multiple self-attention ‘‘heads’’. For every in-
put token in a sequence, each head computes key,
价值, and query vectors, used to create a weighted
表示. The outputs of all heads in the
same layer are combined and run through a fully
connected layer. Each layer is wrapped with a skip
connection and followed by layer normalization.
The conventional workflow for BERT consists
of two stages: pre-training and fine-tuning. 预-
training uses two self-supervised tasks: masked
语言建模 (MLM, prediction of randomly
masked input tokens) and next sentence predic-
的 (NSP, predicting if two input sentences are
adjacent to each other). In fine-tuning for down-
stream applications, one or more fully connected
layers are typically added on top of the final
encoder layer.

The input representations are computed as
如下: Each word in the input is first tokenized
into wordpieces (Wu et al., 2016), and then three
embedding layers (代币, 位置, and segment)
are combined to obtain a fixed-length vector.
Special token [CLS] is used for classification
预测, 和 [SEP] separates input segments.
Google1 and HuggingFace (沃尔夫等人。, 2020)
provide many variants of BERT, 包括
original ‘‘base’’ and ‘‘large’’ versions. They vary
in the number of heads, layers, and hidden state
尺寸.

1https://github.com/google-research/bert.

842

计算语言学协会会刊, 卷. 8, PP. 842–866, 2020. https://doi.org/10.1162/tacl 00349
动作编辑器: Dipanjas Das. 提交批次: 4/2020; 修改批次: 8/2020; 已发表 12/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

3 What Knowledge Does BERT Have?

A number of studies have looked at the know-
ledge encoded in BERT weights. The popular ap-
proaches include fill-in-the-gap probes of MLM,
analysis of self-attention weights, and probing
classifiers with different BERT representations as
输入.

3.1 Syntactic Knowledge

Lin et al. (2019) showed that BERT representa-
tions are hierarchical rather than linear, 那是,
there is something akin to syntactic tree structure
in addition to the word order information. Tenney
等人. (2019乙) and Liu et al. (2019A) also showed
that BERT embeddings encode information
about parts of speech, syntactic chunks, 和
角色. Enough syntactic information seems to be
captured in the token embeddings themselves to
recover syntactic trees (Vilares et al., 2020; Kim
等人。, 2020; Rosa and Mareˇcek, 2019), 虽然
probing classifiers could not recover the labels
of distant parent nodes in the syntactic tree (刘
等人。, 2019A). Warstadt and Bowman (2020) 报告
evidence of hierarchical structure in three out of
four probing tasks.

As far as how syntax is represented, 它似乎
that syntactic structure is not directly encoded
in self-attention weights. Htut et al. (2019) 是
unable to extract full parse trees from BERT
heads even with the gold annotations for the root.
Jawahar et al. (2019) include a brief illustration of
a dependency tree extracted directly from self-
attention weights, but provide no quantitative
评估.

然而, syntactic information can be recov-
ered from BERT token representations. Hewitt
and Manning (2019) were able to learn transforma-
tion matrices that successfully recovered syntactic
dependencies in PennTreebank data from BERT’s
token embeddings (see also Manning et al., 2020).
Jawahar et al. (2019) experimented with transfor-
mations of the [CLS] token using Tensor Product
Decomposition Networks (McCoy et al., 2019A),
concluding that dependency trees are the best
match among five decomposition schemes (虽然
the reported MSE differences are very small).
Miaschi and Dell’Orletta (2020) perform a range
of syntactic probing experiments with concate-
nated token representations as input.

Note that all these approaches look for the
evidence of gold-standard linguistic structures,

数字 1: Parameter-free probe for syntactic know-
壁架: words sharing syntactic subtrees have larger
impact on each other in the MLM prediction (Wu et al.,
2020).

and add some amount of extra knowledge to the
probe. Most recently, Wu et al. (2020) proposed a
parameter-free approach based on measuring the
impact that one word has on predicting another
word within a sequence in the MLM task (数字 1).
They concluded that BERT ‘‘naturally’’ learns
some syntactic information, although it is not
very similar to linguistic annotated resources.
The fill-in-the-gap probes of MLM showed
that BERT takes subject-predicate agreement
into account when performing the cloze task
(Goldberg, 2019; van Schijndel et al., 2019),
even for meaningless sentences and sentences
with distractor clauses between the subject and
the verb (Goldberg, 2019). A study of negative
polarity items (NPIs) by Warstadt et al. (2019)
showed that BERT is better able to detect the
presence of NPIs (例如, ‘‘ever’’) and the words
that allow their use (例如, ‘‘whether’’) 比
scope violations.

The above claims of syntactic knowledge are
belied by the evidence that BERT does not
‘‘understand’’ negation and is insensitive to
malformed input. 尤其, its predictions
were not altered2 even with shuffled word order,

2See also the recent findings on adversarial triggers, 哪个
get the model to produce a certain output even though they

843

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

truncated sentences, removed subjects and objects
(Ettinger, 2019). This could mean that either
BERT’s syntactic knowledge is incomplete, 或者
it does not need to rely on it for solving its
任务. The latter seems more likely, since Glavaˇs
and Vuli´c (2020) 报告
that an intermediate
fine-tuning step with supervised parsing does
not make much difference for downstream task
表现.

3.2 Semantic Knowledge

迄今为止, more studies have been devoted to
BERT’s knowledge of syntactic rather than se-
mantic phenomena. 然而, we do have evi-
dence from an MLM probing study that BERT
has some knowledge of semantic roles (Ettinger,
2019). BERT even displays some preference for
the incorrect fillers for semantic roles that are
semantically related to the correct ones, as op-
posed to those that are unrelated (例如, ‘‘to tip a
chef’’ is better than ‘‘to tip a robin’’, but worse
than ‘‘to tip a waiter’’).

Tenney et al. (2019乙) showed that BERT en-
codes information about entity types, 关系,
semantic roles, and proto-roles, since this infor-
mation can be detected with probing classifiers.

BERT struggles with representations of num-
bers. Addition and number decoding tasks showed
that BERT does not form good representations for
floating point numbers and fails to generalize away
from the training data (Wallace et al., 2019乙). A
part of the problem is BERT’s wordpiece tokeniza-
的, since numbers of similar values can be di-
vided up into substantially different word chunks.
Out-of-the-box BERT is surprisingly brittle
to named entity replacements: 例如,
replacing names in the coreference task changes
85% of predictions (Balasubramanian et al., 2020).
This suggests that the model does not actually
form a generic idea of named entities, 虽然
its F1 scores on NER probing tasks are high
(Tenney et al., 2019A). Broscheit (2019) finds that
fine-tuning BERT on Wikipedia entity linking
‘‘teaches’’ it additional entity knowledge, 哪个
会建议
这
relevant entity information during pre-training on
维基百科.

it did not absorb all

那

are not well-formed from the point of view of a human reader
(Wallace et al., 2019A).

844

数字 2: BERT world knowledge (Petroni et al., 2019).

3.3 World Knowledge

The bulk of evidence about commonsense know-
ledge captured in BERT comes from practitioners
using it to extract such knowledge. One direct
probing study of BERT reports that BERT strug-
gles with pragmatic inference and role-based
event knowledge (Ettinger, 2019). BERT also
struggles with abstract attributes of objects, 作为
well as visual and perceptual properties that are
likely to be assumed rather than mentioned (Da
and Kasai, 2019).

The MLM component of BERT is easy to adapt
for knowledge induction by filling in the blanks
(例如, ‘‘Cats like to chase [
]’’). Petroni et al.
(2019) 表明, for some relation types, va-
nilla BERT is competitive with methods relying
on knowledge bases (数字 2), and Roberts et al.
(2020) show the same for open-domain QA using
the T5 model (Raffel et al., 2019). Davison et al.
(2019) suggest that it generalizes better to unseen
数据. In order to retrieve BERT’s knowledge, 我们
need good template sentences, and there is work
on their automatic extraction and augmentation
(Bouraoui et al., 2019; Jiang et al., 2019乙).

然而, BERT cannot reason based on its
world knowledge. Forbes et al. (2019) 显示
BERT can ‘‘guess’’ the affordances and properties
of many objects, but cannot reason about the
relationship between properties and affordances.
例如, it ‘‘knows’’ that people can walk
into houses, and that houses are big, but it cannot
infer that houses are bigger than people. Zhou et al.
(2020) and Richardson and Sabharwal (2019) 还
show that the performance drops with the number
of necessary inference steps. Some of BERT’s
world knowledge success comes from learning
stereotypical associations (Poerner et al., 2019),
例如, a person with an Italian-sounding
name is predicted to be Italian, even when it is
不正确.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

3.4 Limitations

Multiple probing studies in section 3 and section 4
report that BERT possesses a surprising amount of
句法的, semantic, and world knowledge. 如何-
曾经, Tenney et al. (2019A) remark, ‘‘the fact that
a linguistic pattern is not observed by our probing
classifier does not guarantee that it is not there, 和
the observation of a pattern does not tell us how it
is used.’’ There is also the issue of how complex a
probe should be allowed to be (刘等人。, 2019A).
If a more complex probe recovers more infor-
运动, to what extent are we still relying on the
original model?

此外, different probing methods may
lead to complementary or even contradictory con-
clusions, which makes a single test (和大多数情况一样
学习) insufficient (Warstadt et al., 2019). A
given method might also favor one model over
其他, 例如, RoBERTa trails BERT with
one tree extraction method, but leads with another
(Htut et al., 2019). The choice of linguistic formal-
ism also matters (Kuznetsov and Gurevych, 2020).
In view of all that, the alternative is to focus
on identifying what BERT actually relies on at
inference time. This direction is currently pursued
both at the level of architecture blocks (成为
discussed in detail in subsection 6.3), and at the
level of information encoded in model weights.
Amnesic probing (Elazar et al., 2020) aims to
specifically remove certain information from the
model and see how it changes performance,
finding, 例如, that language modeling does
rely on part-of-speech information.

Another direction is information-theoretic prob-
英. Pimentel et al. (2020) operationalize probing
as estimating mutual
information between the
learned representation and a given linguistic prop-
厄蒂, which highlights that the focus should be
not on the amount of information contained in
a representation, but rather on how easily it can
be extracted from it. Voita and Titov (2020) quan-
tify the amount of effort needed to extract infor-
mation from a given representation as minimum
description length needed to communicate both
the probe size and the amount of data required for
it to do well on a task.

4 Localizing Linguistic Knowledge

4.1 BERT Embeddings

In studies of BERT, the term ‘‘embedding’’ refers
to the output of a Transformer layer (typically,

the final one). Both conventional static embed-
丁斯 (Mikolov et al., 2013) and BERT-style
embeddings can be viewed in terms of mutual
information maximization (Kong et al., 2019),
but the latter are contextualized. Every token is
represented by a vector dependent on the par-
ticular context of occurrence, and contains at least
some information about that context (Miaschi and
Dell’Orletta, 2020).

Several studies reported that distilled context-
ualized embeddings better encode lexical seman-
tic information (IE。, they are better at traditional
word-level tasks such as word similarity). 这
methods to distill a contextualized representation
into static include aggregating the information
across multiple contexts (Akbik et al., 2019;
Bommasani et al., 2020), encoding ‘‘semantically
bleached’’ sentences that rely almost exclusively
on the meaning of a given word (例如, “这是 <>“)
(May et al., 2019), and even using contextualized
embeddings to train static embeddings (王
等人。, 2020d).

But this is not to say that there is no room
for improvement. Ethayarajh (2019) measure how
similar the embeddings for identical words are
in every layer, reporting that later BERT layers
produce more context-specific representations.3
They also find that BERT embeddings occupy a
narrow cone in the vector space, and this effect
increases from the earlier to later layers. 那是,
two random words will on average have a much
higher cosine similarity than expected if em-
beddings were directionally uniform (isotro-
pic). Because isotropy was shown to be beneficial
for static word embeddings (Mu and Viswanath,
2018), this might be a fruitful direction to explore
for BERT.

Because BERT embeddings are contextualized,
an interesting question is to what extent they
capture phenomena like polysemy and hom-
onymy. There is indeed evidence that BERT’s
contextualized embeddings form distinct clus-
ters corresponding to word senses (Wiedemann
等人。, 2019; Schmidt and Hofmann, 2020), 制作
BERT successful at word sense disambiguation
任务. 然而, Mickus et al. (2019) note that
the representations of the same word depend

3Voita et al. (2019A) look at the evolution of token
嵌入, showing that in the earlier Transformer layers,
MLM forces the acquisition of contextual information at the
expense of the token identity, which gets recreated in later
layers.

845

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: Attention patterns in BERT (Kovaleva et al., 2019).

on the position of the sentence in which it
发生, likely due to the NSP objective. 这是
not desirable from the linguistic point of view, 和
could be a promising avenue for future work.

The above discussion concerns token embed-
丁斯, but BERT is typically used as a sentence
or text encoder. The standard way to generate
sentence or text representations for classification
是使用 [CLS] 代币, but alternatives are also
being discussed, including concatenation of token
陈述 (Tanaka et al., 2020), normalized
意思是 (Tanaka et al., 2020), and layer activations
(Ma et al., 2019). See Toshniwal et al. (2020) for a
systematic comparison of several methods across
tasks and sentence encoders.

4.2 Self-attention Heads

Several studies proposed classification of attention
head types. Raganato and Tiedemann (2018) 迪斯-
cuss attending to the token itself, previous/next
代币, and the sentence end. Clark et al. (2019)
distinguish between attending to previous/next
代币, [CLS], [SEP], punctuation, and ‘‘at-
tending broadly’’ over the sequence. Kovaleva
等人. (2019) propose five patterns, shown in
数字 3.

4.2.1 Heads With Linguistic Functions
The ‘‘heterogeneous’’ attention pattern shown
图中 3 could potentially be linguistically
interpretable, and a number of studies focused on
identifying the functions of self-attention heads. 在
特别的, some BERT heads seem to specialize
in certain types of syntactic relations. Htut
等人. (2019) and Clark et al. (2019) report that
there are BERT heads that attended significantly
more than a random baseline to words in certain
syntactic positions. The datasets and methods
used in these studies differ, but they both find
there are heads that attend to words in
那
obj role more than the positional baseline. 这
evidence for nsubj, advmod, and amod varies

between these two studies. The overall conclusion
is also supported by Voita et al.’s (2019乙) 学习
of the base Transformer in machine translation
语境. Hoover et al. (2019) hypothesize that even
complex dependencies like dobj are encoded by
a combination of heads rather than a single head,
but this work is limited to qualitative analysis.
Zhao and Bethard (2020) looked specifically for
the heads encoding negation scope.

Both Clark et al. (2019) and Htut et al. (2019)
conclude that no single head has the complete
syntactic tree information, in line with evidence
of partial knowledge of syntax (比照. subsection 3.1).
然而, Clark et al. (2019) identify a BERT head
that can be directly used as a classifier to perform
coreference resolution on par with a rule-based
系统, which by itself would seem to require
quite a lot of syntactic knowledge.

Lin et al. (2019) present evidence that attention
weights are weak indicators of subject-verb
agreement and reflexive anaphora. Instead of
serving as strong pointers between tokens that
should be related, BERT’s self-attention weights
were close to a uniform attention baseline, 但
there was some sensitivity to different types of
distractors coherent with psycholinguistic data.
This is consistent with conclusions by Ettinger
(2019).

To our knowledge, morphological information
in BERT heads has not been addressed, but with
the sparse attention variant by Correia et al.
(2019) in the base Transformer, some attention
heads appear to merge BPE-tokenized words.
For semantic relations, there are reports of self-
attention heads encoding core frame-semantic
关系 (Kovaleva et al., 2019), as well as lexi-
cographic and commonsense relations (Cui et al.,
2020).

The overall popularity of self-attention as an
interpretability mechanism is due to the idea that
‘‘attention weight has a clear meaning: how much

846

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

a particular word will be weighted when comput-
ing the next representation for the current word’’
(Clark et al., 2019). This view is currently debated
(Jain and Wallace, 2019; Serrano and Smith,
2019; Wiegreffe and Pinter, 2019; Brunner et al.,
2020), and in a multilayer model where attention
is followed by nonlinear transformations,
这
patterns in individual heads do not provide a full
picture. 还, although many current papers are
accompanied by attention visualizations, 在那里
is a growing number of visualization tools (Vig,
2019; Hoover et al., 2019), the visualization is
typically limited to qualitative analysis (often with
cherry-picked examples) (Belinkov and Glass,
2019), and should not be interpreted as definitive
证据.

4.2.2 Attention to Special Tokens

比 50% of heads exhibit

Kovaleva et al. (2019) show that most self-
attention heads do not directly encode any
non-trivial linguistic information, at least when
fine-tuned on GLUE (王等人。, 2018), 自从
only fewer
这
‘‘heterogeneous’’ pattern. Much of the model pro-
duced the vertical pattern (attention to [CLS],
[SEP], and punctuation tokens), consistent with
the observations by Clark et al. (2019). This re-
dundancy is likely related to the overparameteri-
zation issue (参见部分 6).

最近, Kobayashi et al. (2020) 显示
that the norms of attention-weighted input vectors,
which yield a more intuitive interpretation of self-
注意力, reduce the attention to special tokens.
然而, even when the attention weights are
normed, it is still not the case that most heads
that do the ‘‘heavy lifting’’ are even potentially
interpretable (Prasanna et al., 2020).

One methodological choice in in many studies
of attention is to focus on inter-word attention
and simply exclude special tokens (例如, Lin et al.
[2019] and Htut et al. [2019]). 然而, if atten-
tion to special tokens actually matters at inference
时间, drawing conclusions purely from inter-word
attention patterns does not seem warranted.

The functions of special tokens are not yet well
明白了. [CLS] is typically viewed as an ag-
gregated sentence-level representation (虽然
全部
至少
token representations also contain at
some sentence-level information, as discussed in
subsection 4.1); in that case, we may not see, 为了
例子, full syntactic trees in inter-word atten-

tion because part of that information is actually
packed in [CLS].

Clark et al. (2019) experiment with encoding
Wikipedia paragraphs with base BERT to consider
specifically the attention to special tokens, noting
that heads in early layers attend more to [CLS],
in middle layers to [SEP], and in final layers
to periods and commas. They hypothesize that its
function might be one of ‘‘no-op’’, a signal to
ignore the head if its pattern is not applicable to
the current case. 因此, 例如, [SEP]
gets increased attention starting in layer 5, but its
importance for prediction drops. 然而, 后
fine-tuning both [SEP] 和 [CLS] get a lot of
注意力, depending on the task (Kovaleva et al.,
2019). 有趣的是, BERT also pays a lot of
attention to punctuation, which Clark et al. (2019)
explain by the fact that periods and commas are
simply almost as frequent as the special tokens,
and so the model might learn to rely on them for
the same reasons.

4.3 BERT Layers

layer of BERT receives as input a
首先
combination of token, segment, and positional
嵌入.

It stands to reason that the lower layers have
the most information about linear word order.
Lin et al. (2019) report a decrease in the knowledge
of linear word order around layer 4 in BERT-base.
This is accompanied by an increased knowledge
of hierarchical sentence structure, as detected by
the probing tasks of predicting the token index,
the main auxiliary verb and the sentence subject.
There is a wide consensus in studies with
different tasks, datasets, and methodologies that
syntactic information is most prominent in the
middle layers of BERT.4 Hewitt and Manning
(2019) had the most success reconstructing syn-
tactic tree depth from the middle BERT layers (6-9
for base-BERT, 14-19 for BERT-large). Goldberg
(2019) reports the best subject-verb agreement
around layers 8-9, and the performance on syntac-
tic probing tasks used by Jawahar et al. (2019) 还
seems to peak around the middle of the model.
The prominence of syntactic information in the
middle BERT layers is related to Liu et al.’s

4These BERT results are also compatible with findings
by Vig and Belinkov (2019), who report the highest attention
to tokens in dependency relations in the middle layers of
GPT-2.

847

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

terms of the spread of semantic knowledge, 和
whether that is beneficial. Tenney et al. compared
BERT-base and BERT-large, and found that the
overall pattern of cumulative score gains is the
相同的, only more spread out in the larger model.

Note that Tenney et al.’s (2019A) 实验
concern sentence-level semantic relations; Cui
等人. (2020) report that the encoding of ConceptNet
semantic relations is the worst in the early layers
and increases towards the top. Jawahar et al.
(2019) place ‘‘surface features in lower layers,
syntactic features in middle layers and semantic
features in higher layers’’, but their conclusion is
surprising, given that only one semantic task in
this study actually topped at the last layer, 和
three others peaked around the middle and then
considerably degraded by the final layers.

5 Training BERT

This section reviews the proposals to optimize the
training and architecture of the original BERT.

5.1 Model Architecture Choices

迄今为止, the most systematic study of BERT ar-
chitecture was performed by Wang et al. (2019乙),
who experimented with the number of layers,
头, and model parameters, varying one option
and freezing the others. They concluded that the
number of heads was not as significant as the
number of layers. That is consistent with the find-
ings of Voita et al. (2019乙) and Michel et al.
(2019) (部分 6), and also the observation by
刘等人. (2019A) that the middle layers were the
most transferable. Larger hidden representation
size was consistently better, but the gains varied
by setting.

All in all, changes in the number of heads and
layers appear to perform different functions.
The issue of model depth must be related to
the information flow from the most task-specific
layers closer to the classifier (刘等人。, 2019A), 到
the initial layers which appear to be the most task-
invariant (Hao et al., 2019), and where the tokens
resemble the input tokens the most (Brunner et al.,
2020) (see subsection 4.3). If that is the case,
a deeper model has more capacity to encode
information that is not task-specific.

另一方面, many self-attention heads
in vanilla BERT seem to naturally learn the same
图案 (Kovaleva et al., 2019). 这解释了

数字 4: BERT layer
correspond to probing tasks, 刘等人. (2019A).

transferability (columns

the middle layers of
(2019A) observation that
Transformers are best-performing overall and the
most transferable across tasks (见图 4).

There is conflicting evidence about syntactic
chunks. Tenney et al. (2019A) conclude that ‘‘the
basic syntactic information appears earlier in the
network while high-level semantic features appear
at the higher layers’’, drawing parallels between
this order and the order of components in a typical
NLP pipeline—from POS-tagging to dependency
parsing to semantic role labeling. Jawahar et al.
(2019) also report that the lower layers were more
useful for chunking, while middle layers were
more useful for parsing. 同时, 这
probing experiments by Liu et al. (2019A) 寻找
相反: Both POS-tagging and chunking
were performed best at the middle layers, in both
BERT-base and BERT-large. 然而, all three
studies use different suites of probing tasks.

The final layers of BERT are the most task-
specific. In pre-training, this means specificity to
the MLM task, which explains why the middle
layers are more transferable (刘等人。, 2019A). 在
fine-tuning, it explains why the final layers change
最多 (Kovaleva et al., 2019), and why restoring
the weights of lower layers of fine-tuned BERT
to their original values does not dramatically hurt
the model performance (Hao et al., 2019).

Tenney et al. (2019A) suggest that whereas
syntactic information appears early in the model
and can be localized, semantics is spread across
the entire model, which explains why certain
non-trivial examples get solved incorrectly at first
but correctly at the later layers. This is rather to be
预期的: Semantics permeates all language, 和
linguists debate whether meaningless structures
can exist at all (Goldberg, 2006, p.166–182). 但
this raises the question of what stacking more
Transformer layers in BERT actually achieves in

848

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

why pruning them does not have too much impact.
The question that arises from this is how far we
could get with intentionally encouraging diverse
self-attention patterns: Theoretically, this would
mean increasing the amount of information in the
model with the same number of weights. Raganato
等人. (2020) show for Transformer-based machine
translation we can simply pre-set the patterns that
we already know the model would learn, 反而
of learning them from scratch.

Vanilla BERT is symmetric and balanced in
terms of self-attention and feed-forward layers, 但
it may not have to be. For the base Transformer,
Press et al. (2020) report benefits from more
self-attention sublayers at the bottom and more
feedforward sublayers at the top.

5.2 Improvements to the Training Regime

刘等人. (2019乙) demonstrate the benefits of
large-batch training: With 8k examples, 两者都
language model perplexity and downstream task
performance are improved. They also publish their
recommendations for other parameters. You et al.
(2019) report that with a batch size of 32k BERT’s
training time can be significantly reduced with no
degradation in performance. Zhou et al. (2019)
the normalization of the trained
observe that
[CLS] token stabilizes the training and slightly
improves performance on text classification tasks.
Gong et al. (2019) note that, because self-
attention patterns in higher and lower layers are
相似的, the model training can be done in a
recursive manner, where the shallower version
is trained first and then the trained parameters are
copied to deeper layers. Such a ‘‘warm-start’’ can
lead to a 25% faster training without sacrificing
表现.

5.3 Pre-training BERT

The original BERT is a bidirectional Transformer
pre-trained on two tasks: NSP and MLM
(部分 2). Multiple studies have come up with
alternative training objectives to improve on
BERT, and these could be categorized as follows:

• How to mask. Raffel et al. (2019) 系统-
atically experiment with corruption rate and
corrupted span length. 刘等人. (2019乙)
propose diverse masks for training examples
within an epoch, while Baevski et al. (2019)

849

mask every token in a sequence instead of
a random selection. Clinchant et al. (2019)
replace the MASK token with [UNK] 代币,
to help the model learn a representation for
unknowns that could be useful for transla-
的. 宋等人. (2020) maximize the amount
of information available to the model by
conditioning on both masked and unmasked
代币, and letting the model see how many
tokens are missing.

• What to mask. Masks can be applied to
full words instead of word-pieces (Devlin
等人。, 2019; Cui et al., 2019). 相似地, 我们
can mask spans rather than single tokens
(Joshi et al., 2020), predicting how many
are missing (刘易斯等人。, 2019). Masking
phrases and named entities (孙等人。,
2019乙) improves representation of structured
知识.

• Where to mask. Lample and Conneau
(2019) use arbitrary text streams instead of
sentence pairs and subsample frequent out-
puts similar to Mikolov et al. (2013). Bao
等人. (2020) combine the standard autoencod-
ing MLM with partially autoregressive LM
objective using special pseudo mask tokens.

• Alternatives to masking. Raffel et al. (2019)
experiment with replacing and dropping
跨度; Lewis et al. (2019) explore deletion,
infilling, sentence permutation and docu-
ment rotation; and Sun et al. (2019C) predict
whether a token is capitalized and whether
it occurs in other segments of the same
文档. Yang et al. (2019) train on dif-
ferent permutations of word order in the input
顺序, maximizing the probability of the
original word order (比照. the n-gram word or-
der reconstruction task (王等人。, 2019A)).
Clark et al. (2020) detects tokens that were
replaced by a generator network rather than
masked.

• NSP alternatives. Removing NSP does not
hurt or slightly improves performance (刘
等人。, 2019乙; Joshi et al., 2020; Clinchant
等人。, 2019). Wang et al.
(2019A) 和
Cheng et al. (2019) replace NSP with the
task of predicting both the next and the
previous sentences. Lan et al. (2020) 代替
the negative NSP examples by swapped

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

sentences from positive examples, 而不是
sentences from different documents. ERNIE
2.0 includes sentence reordering and sentence
distance prediction. Bai et al. (2020) 代替
both NSP and token position embeddings by
a combination of paragraph, 句子, 和
token index embeddings. Li and Choi (2020)
experiment with utterance order prediction
task for multiparty dialogue (and also MLM
在
the level of utterances and the whole
dialogue).

• Other tasks. Sun et al. (2019C) propose
simultaneous learning of seven tasks,
在-
cluding discourse relation classification and
predicting whether a segment is relevant for
和. Guu et al. (2020) include a latent knowl-
edge retriever in language model pretrain-
英. Wang et al. (2020C) combine MLM with
a knowledge base completion objective. Glass
等人. (2020) replace MLM with span predic-
tion task (as in extractive question answer-
英), where the model is expected to provide
the answer not from its own weights, 但
from a different passage containing the cor-
rect answer (a relevant search engine query
snippet).

Another obvious source of improvement is pre-
training data. Several studies explored the benefits
of increasing the corpus volume (刘等人。, 2019乙;
Conneau et al., 2019; Baevski et al., 2019) 和
longer training (刘等人。, 2019乙). The data
also does not have to be raw text: 有一个
number efforts to incorporate explicit linguistic
信息, both syntactic (Sundararaman et al.,
2019) and semantic (张等人。, 2020). 吴
等人. (2019乙) and Kumar et al. (2020) 包括
the label for a given sequence from an annotated
task dataset. Schick and Sch¨utze (2020) separately
learn representations for rare words.

Although BERT is already actively used as a
source of world knowledge (see subsection 3.3),
there is also work on explicitly supplying
structured knowledge. One approach is entity-
enhanced models. 例如, Peters et al.
(2019A); 张等人.
include entity
embeddings as input for training BERT, 尽管
Poerner et al. (2019) adapt entity vectors to BERT
陈述. As mentioned above, Wang et al.
(2020C) integrate knowledge not through entity

(2019)

数字 5: Pre-trained weights help BERT find wider
optima in fine-tuning on MRPC (正确的) than training
from scratch (左边) (Hao et al., 2019).

嵌入, 但
through the additional pre-
training objective of knowledge base completion.
Sun et al. (2019乙,C) modify the standard MLM task
to mask named entities rather than random words,
and Yin et al. (2020) train with MLM objective
over both text and linearized table data. Wang et al.
(2020A) enhance RoBERTa with both linguistic
and factual knowledge with task-specific adapters.
Pre-training is the most expensive part of train-
ing BERT, and it would be informative to know
how much benefit it provides. On some tasks, A
randomly initialized and fine-tuned BERT obtains
competitive or higher results than the pre-trained
BERT with the task classifier and frozen weights
(Kovaleva et al., 2019). The consensus in the
community is that pre-training does help in most
情况, but the degree and its exact contribution
requires further investigation. Prasanna et al.
(2020) found that most weights of pre-trained
BERT are useful in fine-tuning, although there
are ‘‘better’’ and ‘‘worse’’ subnetworks. One ex-
planation is that pre-trained weights help the fine-
tuned BERT find wider and flatter areas with
smaller generalization error, which makes the
model more robust to overfitting (见图 5
from Hao et al. [2019]).

Given the large number and variety of pro-
posed modifications, one would wish to know how
much impact each of them has. 然而, due to
the overall trend towards large model sizes, syste-
matic ablations have become expensive. 最多
new models claim superiority on standard bench-
marks, but gains are often marginal, and estimates
of model stability and significance testing are
very rare.

850

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

5.4 Fine-tuning BERT

Pre-training + fine-tuning workflow is a crucial
part of BERT. The former is supposed to provide
task-independent knowledge, and the latter would
presumably teach the model to rely more on the
representations useful for the task at hand.

Kovaleva et al. (2019) did not find that to be the
case for BERT fine-tuned on GLUE tasks:5 期间
fine-tuning, the most changes for three epochs
occurred in the last two layers of the models, 但
those changes caused self-attention to focus on
[SEP] rather than on linguistically interpretable
is understandable why fine-tuning
图案. 它
would increase the attention to [CLS], 但不是
[SEP]. If Clark et al. (2019) are correct that
[SEP] serves as ‘‘no-op’’ indicator, fine-tuning
basically tells BERT what to ignore.

Several studies explored the possibilities of

improving the fine-tuning of BERT:

• Taking more layers into account: 学习
a complementary representation of the infor-
mation in deep and output layers (Yang and
赵, 2019), using a weighted combination
of all layers instead of the final one (Su and
Cheng, 2019; Kondratyuk and Straka, 2019),
and layer dropout (Kondratyuk and Straka,
2019).

• Two-stage fine-tuning introduces an inter-
mediate supervised training stage between
pre-training and fine-tuning (Phang et al.,
2019; Garg et al., 2020; Arase and Tsujii,
2019; Pruksachatkun et al., 2020; Glavaˇs
and Vuli´c, 2020). Ben-David et al. (2020)
propose a pivot-based variant of MLM to
fine-tune BERT for domain adaptation.

• Adversarial token perturbations improve
the robustness of the model (Zhu et al., 2019).

• Adversarial regularization in combination
with Bregman Proximal Point Optimization
helps alleviate pre-trained knowledge forget-
ting and therefore prevents BERT from
overfitting to downstream tasks (Jiang et al.,
2019A).

• Mixout regularization improves the stab-
ility of BERT fine-tuning even for a small

5Kondratyuk and Straka (2019) suggest that fine-tuning
in syntactically
on Universal Dependencies does result
meaningful attention patterns, but there was no quantitative
评估.

number of training examples (李等人。,
2019).

With large models, even fine-tuning becomes
昂贵的, but Houlsby et al. (2019) 显示
it can be successfully approximated with adapter
模块. They achieve competitive performance
在 26 classification tasks at a fraction of the com-
putational cost. Adapters in BERT were also used
for multitask learning (Stickland and Murray,
2019) and cross-lingual transfer (Artetxe et al.,
2019). An alternative to fine-tuning is extracting
features from frozen representations, but fine-
tuning works better for BERT (Peters et al.,
2019乙).

A big methodological

challenge
在里面
the reported performance
current NLP is that
improvements of new models may well be within
variation induced by environment factors (Crane,
2018). BERT is not an exception. Dodge et al.
(2020) report significant variation for BERT
fine-tuned on GLUE tasks due to both weight
initialization and training data order. 他们还
propose early stopping on the less-promising
种子.

Although we hope that the above observations
may be useful for the practitioners, this section
does not exhaust the current research on fine-
tuning and its alternatives. 例如, 我们不
cover such topics as Siamese architectures, 政策
gradient training, automated curriculum learning,
和别的.

6 How Big Should BERT Be?

6.1 Overparameterization

Transformer-based models keep growing by or-
ders of magnitude: The 110M parameters of base
BERT are now dwarfed by 17B parameters of
Turing-NLG (Microsoft, 2020), which is dwarfed
by 175B of GPT-3 (Brown et al., 2020). This trend
raises concerns about computational complexity
of self-attention (Wu et al., 2019A), environmental
问题 (Strubell et al., 2019; Schwartz et al., 2019),
fair comparison of architectures (Aßenmacher
and Heumann, 2020), and reproducibility.

Human language is incredibly complex, 和
would perhaps take many more parameters to
describe fully, but the current models do not make
good use of the parameters they already have.
Voita et al. (2019乙) showed that all but a few
Transformer heads could be pruned without

851

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Compression Performance Speedup Model

评估

BERT-base (Devlin et al., 2019)
BERT-small

DistilBERT (Sanh et al., 2019)
BERT6-PKD (孙等人。, 2019A)
BERT3-PKD (孙等人。, 2019A)
Aguilar et al. (2019), Exp. 3
BERT-48 (赵等人。, 2019)
BERT-192 (赵等人。, 2019)
TinyBERT (Jiao et al., 2019)
MobileBERT (孙等人。, 2020)
PD (Turc et al., 2019)
WaLDORf (Tian et al., 2019)
MiniLM (王等人。, 2020乙)
MiniBERT(Tsai et al., 2019)
BiLSTM-soft (Tang et al., 2019)

n
哦
我
t
A
我
我
我
t
s
我
D

–
我
t
n
A
你
问

n Q-BERT-MP (Shen et al., 2019)
BERT-QAT (Zafrir et al., 2019)
GOBO (Zadeh and Moshovos, 2020)

哦
我
t
A
z

G
n
我
n
你
r
磷

r
e
H
t
氧

McCarley et al. (2020), ff2
RPP (Guo et al., 2019)
Soft MvP (Sanh et al., 2020)
IMP (陈等人。, 2020), rewind 50%

ALBERT-base (Lan et al., 2020)
ALBERT-xxlarge (Lan et al., 2020)
BERT-of-Theseus (徐等人。, 2020)
PoWER-BERT (Goyal et al., 2020)

×1
×3.8

×1.5
×1.6
×2.4
×1.6
×62
×5.7
×7.5
×4.3
×1.6
×4.4
×1.65
×6∗∗
×110

×13
×4
×9.8
×2.2‡
×1.7‡
×33
×1.4–2.5

×9
×0.47
×1.6
N/A

100%
91%

90%§
98%
92%
93%
87%
93%
96%
100%
98%
93%
99%
98%
91%

98%¶
99%
99%

98%‡
99%‡
94%¶
94–100%

97%
107%
98%
99%

×1
-

BERT12
BERT4†

All GLUE tasks, SQuAD
All GLUE tasks

BERT6
BERT6
BERT3
BERT6
BERT12
BERT12
†
BERT4
BERT24
†
BERT6
BERT8
BERT6

×1.6
×1.9
×3.7
-
×77
×22
×9.4
×4
×2.5‡
×9
×2
×27∗∗ mBERT3
×434‡

†k

All GLUE tasks, SQuAD
No WNLI, CoLA, STS-B; RACE
No WNLI, CoLA, STS-B; RACE
CoLA, MRPC, QQP, RTE

∗† MNLI, MRPC, SST-2
∗† MNLI, MRPC, SST-2
No WNLI; SQuAD
† No WNLI; SQuAD

No WNLI, CoLA and STS-B
SQuAD
No WNLI, STS-B, MNLImm; SQuAD

† CoNLL-18 POS and morphology

BiLSTM1 MNLI, QQP, SST-2

-
-
-
×1.9‡
-
-
-

BERT12
BERT12
BERT12

BERT24
BERT24
BERT12
BERT12

MNLI, SST-2, CoNLL-03, SQuAD
No WNLI, MNLI; SQuAD
MNLI

SQuAD, Natural Questions
No WNLI, STS-B; SQuAD
MNLI, QQP, SQuAD
No MNLI-mm; SQuAD

BERT12
-
BERT12
-
×1.9
BERT6
×2–4.5 BERT12

† MNLI, SST-2
† MNLI, SST-2
No WNLI
No WNLI; RACE

桌子 1: Comparison of BERT compression studies. Compression, performance retention, and inference
time speedup figures are given with respect to BERTbase, unless indicated otherwise. Performance
retention is measured as a ratio of average scores achieved by a given model and by BERTbase. 这
subscript in the model description reflects the number of layers used. ∗Smaller vocabulary used. †The
dimensionality of the hidden layers is reduced. kConvolutional layers used. ‡Compared to BERTlarge.
∗∗Compared to mBERT. §As reported in Jiao et al. (2019).¶In comparison to the dev set.

significant losses in performance. For BERT,
Clark et al. (2019) observe that most heads in
the same layer show similar self-attention patterns
(perhaps related to the fact that the output of all
self-attention heads in a layer is passed through
the same MLP), which explains why Michel et al.
(2019) were able to reduce most layers to a single
头.

Depending on the task, some BERT heads/
layers are not only redundant
(Kao et al.,
2020), but also harmful to the downstream task
表现. Positive effect from head disabling
was reported for machine translation (Michel et al.,
2019), abstractive summarization (Baan et al.,
2019), and GLUE tasks (Kovaleva et al., 2019).
此外, Tenney et al.
(2019A) examine
the cumulative gains of their structural probing
classifier, observing that in 5 在......之外 8 probing
tasks some layers cause a drop in scores (typically
in the final layers). Gordon et al. (2020) find that
30%–40% of the weights can be pruned without
impact on downstream tasks.

一般来说, larger BERT models perform better
(刘等人。, 2019A; Roberts et al., 2020), 但不是
always: BERT-base outperformed BERT-large
on subject-verb agreement (Goldberg, 2019) 和
sentence subject detection (林等人。, 2019). 给定
the complexity of language, and amounts of pre-
training data, it is not clear why BERT ends
up with redundant heads and layers. Clark et al.
(2019) suggest that one possible reason is the use
of attention dropouts, which causes some attention
weights to be zeroed-out during training.

6.2 Compression Techniques

Given the above evidence of overparameteriza-
的, it does not come as a surprise that BERT can
be efficiently compressed with minimal accu-
racy loss, which would be highly desirable for
real-world applications. Such efforts to date are
summarized in Table 1. The main approaches are
knowledge distillation, quantization, and pruning.
The studies in the knowledge distillation
框架 (Hinton et al., 2014) use a smaller

852

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

student-network trained to mimic the behavior of
a larger teacher-network. For BERT, this has been
achieved through experiments with loss functions
(Sanh et al., 2019; Jiao et al., 2019), mimick-
ing the activation patterns of individual portions
of the teacher network (孙等人。, 2019A), 和
knowledge transfer at the pre-training (Turc et al.,
2019; Jiao et al., 2019; 孙等人。, 2020) or fine-
tuning stage (Jiao et al., 2019). McCarley et al.
(2020) suggest that distillation has so far worked
better for GLUE than for reading comprehen-
锡安, and report good results for QA from a com-
bination of structured pruning and task-specific
distillation.

Quantization decreases BERT’s memory
footprint through lowering the precision of its
重量 (Shen et al., 2019; Zafrir et al., 2019).
Note that this strategy often requires compatible
hardware.

As discussed in section 6,

individual self-
attention heads and BERT layers can be disabled
without significant drop in performance (Michel
等人。, 2019; Kovaleva et al., 2019; Baan et al.,
2019). Pruning is a compression technique that
takes advantage of that fact, typically reducing the
amount of computation via zeroing out of certain
parts of the large model. In structured pruning,
architecture blocks are dropped, as in LayerDrop
(Fan et al., 2019). In unstructured, the weights in
the entire model are pruned irrespective of their
地点, as in magnitude pruning (陈等人。,
2020) or movement pruning (Sanh et al., 2020).

Prasanna et al. (2020) and Chen et al. (2020)
explore BERT from the perspective of the lot-
tery ticket hypothesis (Frankle and Carbin, 2019),
looking specifically at the ‘‘winning’’ subnet-
works in pre-trained BERT. They independently
find that such subnetworks do exist, and that trans-
ferability between subnetworks for different tasks
varies.

If the ultimate goal of training BERT is com-
压力, 李等人. (2020) recommend training
larger models and compressing them heavily
rather than compressing smaller models lightly.

Other techniques include decomposing BERT’s
embedding matrix into smaller matrices (Lan et al.,
2020), progressive module replacing (徐等人。,
2020), and dynamic elimination of intermediate
encoder outputs (Goyal et al., 2020). See Ganesh
等人. (2020) for a more detailed discussion of
compression methods.

6.3 Pruning and Model Analysis

There is a nascent discussion around pruning as a
model analysis technique. The basic idea is that
a compressed model a priori consists of elements
that are useful for prediction; therefore by finding
out what they do we may find out what the whole
network does. 例如, BERT has heads
that seem to encode frame-semantic relations, 但
disabling them might not hurt downstream task
表现 (Kovaleva et al., 2019); this suggests
that this knowledge is not actually used.

For the base Transformer, Voita et al. (2019乙)
identify the functions of self-attention heads and
then check which of them survive the pruning,
finding that the syntactic and positional heads are
the last ones to go. For BERT, Prasanna et al.
(2020) go in the opposite direction: pruning on the
basis of importance scores, and interpreting the
remaining ‘‘good’’ subnetwork. With respect to
self-attention heads specifically, it does not seem
to be the case that only the heads that potentially
encode non-trivial linguistic patterns survive the
pruning.

The models and methodology in these stud-
ies differ, so the evidence is inconclusive. 在
特别的, Voita et al. (2019乙) find that before
pruning the majority of heads are syntactic, 和
Prasanna et al. (2020) find that the majority of
heads do not have potentially non-trivial attention
图案.

An important limitation of the current head
and layer ablation studies (Michel et al., 2019;
Kovaleva et al., 2019) is that they inherently
assume that certain knowledge is contained in
there is evidence of
heads/layers. 然而,
more diffuse representations spread across the
full network, such as the gradual
increase in
accuracy on difficult semantic parsing tasks
the absence of
(Tenney et al., 2019A) 或者
heads that would perform parsing ‘‘in general’’
(Clark et al., 2019; Htut et al., 2019).
如果
所以, ablating individual components harms the
weight-sharing mechanism. Conclusions from
component ablations are also problematic if the
same information is duplicated elsewhere in the
网络.

7 Directions for Further Research

BERTology has clearly come a long way, 但它
is fair to say we still have more questions than
answers about how BERT works. 在这个部分,

853

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

we list what we believe to be the most promising
directions for further research.

Benchmarks that require verbal reasoning.
Although BERT enabled breakthroughs on many
NLP benchmarks, a growing list of analysis papers
are showing that its language skills are not as
impressive as they seem. 尤其, 他们是
shown to rely on shallow heuristics in natural lan-
guage inference (McCoy et al., 2019乙; Zellers
等人。, 2019; Jin et al., 2020), reading compre-
hension (Si et al., 2019; Rogers et al., 2020;
Sugawara et al., 2020; Yogatama et al., 2019),
argument reasoning comprehension (Niven and
Kao, 2019), and text classification (Jin et al.,
2020). Such heuristics can even be used to recon-
struct a non–publicly available model (Krishna
等人。, 2020). As with any optimization method, 如果
there is a shortcut in the data, we have no reason
to expect BERT to not learn it. But harder datasets
that cannot be resolved with shallow heuristics are
unlikely to emerge if their development is not as
valued as modeling work.

Benchmarks for the full range of linguistic
competence. Although the language models
seem to acquire a great deal of knowledge about
语言, we do not currently have comprehensive
stress tests for different aspects of linguistic
知识. A step in this direction is the
‘‘Checklist’’ behavioral testing (Ribeiro et al.,
2020), the best paper at ACL 2020. 理想情况下, 这样的
tests would measure not only errors, 但是也
灵敏度 (Ettinger, 2019).

Developing methods to ‘‘teach’’ reasoning.
While large pre-trained models have a lot of know-
壁架, they often fail if any reasoning needs to be
performed on top of the facts they possess (Talmor
等人。, 2019, see also subsection 3.3). 例如,
Richardson et al. (2020) propose a method to
‘‘teach’’ BERT quantification, conditionals, com-
paratives, and Boolean coordination.

Learning what happens at inference time.
Most BERT analysis papers focus on different
probes of the model, with the goal to find what
the language model ‘‘knows’’. 然而, probing
studies have limitations (subsection 3.4), 并
这一点, far fewer papers have focused on
discovering what knowledge actually gets used.
Several promising directions are the ‘‘amnesic
identifying
(Elazar et al., 2020),
probing’’

features important for prediction for a given task
(Arkhangelskaia and Dutta, 2019), and pruning the
model to remove the non-important components
(Voita et al., 2019乙; Michel et al., 2019; Prasanna
等人。, 2020).

8 结论

In a little over a year, BERT has become a
ubiquitous baseline in NLP experiments and
inspired numerous studies analyzing the model
and proposing various improvements. The stream
of papers seems to be accelerating rather than
slowing down, and we hope that this survey helps
the community to focus on the biggest unresolved
问题.

9 致谢

We thank the anonymous reviewers for their
valuable feedback. This work is funded in part
by NSF award number IIS-1844740 to Anna
Rumshisky.

参考

Gustavo Aguilar, Yuan Ling, Yu Zhang, 本杰明
Yao, Xing Fan, and Edward Guo. 2019. 诺尔-
edge Distillation from Internal Representations.
arXiv 预印本 arXiv:1910.03723.

Alan Akbik, Tanja Bergmann, and Roland
Vollgraf. 2019. Pooled Contextualized Embed-
dings for Named Entity Recognition. In Pro-
ceedings of the 2019 Conference of the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
逻辑的, 体积 1 (Long and Short Papers),
pages 724–728, 明尼阿波利斯, Minnesota. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N19
-1078

Yuki Arase and Jun’ichi Tsujii. 2019. Transfer
Fine-Tuning: A BERT Case Study. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5393–5404, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1542

854

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Ekaterina Arkhangelskaia and Sourav Dutta.
lookin’at? DeepLIFTing
2019. Whatcha
BERT’s Attention in Question Answering.
arXiv 预印本 arXiv:1910.06431.

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2019. On the Cross-lingual Trans-
ferability of Monolingual Representations.
arXiv:1911.03310 [cs]. DOI: https://土井
.org/10.18653/v1/2020.acl-main.421

Matthias Aßenmacher and Christian Heumann.
2020. On the comparability of Pre-Trained
Language Models. arXiv:2001.00781 [cs, stat].

Joris Baan, Maartje ter Hoeve, Marlies van der
Wees, Anne Schuth, and Maarten de Rijke.
2019. Understanding Multi-Head Attention
in Abstractive Summarization. arXiv 预印本
arXiv:1911.03898.

Alexei Baevski, Sergey Edunov, Yinhan Liu,
Luke Zettlemoyer, and Michael Auli. 2019.
Cloze-driven Pretraining of Self-Attention
网络. 在诉讼程序中 2019 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 5360–5369,
香港, 中国. Association for Compu-
tational Linguistics. DOI: https://doi.org
/10.18653/v1/D19-1539

He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun
Xiong, Wen Gao, and Ming Li. 2020. Sega
BERT: Pre-training of Segment-aware BERT
for Language Understanding. arXiv:2004.
14996 [cs].

Sriram Balasubramanian, Naman Jain, Gaurav
Jindal, Abhijeet Awasthi, and Sunita Sarawagi.
2020. What’s in a Name? Are BERT Named
just as Good for
Entity Representations
any other Name? 在诉讼程序中
这
5th Workshop on Representation Learning
for NLP, pages 205–214, 在线的. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.repl4nlp-1.24

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang,
Nan Yang, Xiaodong Liu, Yu Wang, Songhao
Piao, Jianfeng Gao, Ming Zhou, and Hsiao-
Wuen Hon. 2020. UniLMv2: Pseudo-Masked

Language Models for Unified Language Model
Pre-Training. arXiv:2002.12804 [cs].

Yonatan Belinkov and James Glass. 2019. Anal-
ysis Methods in Neural Language Processing:
the Association
调查. Transactions of
for Computational Linguistics, 7:49–72. DOI:
https://doi.org/10.1162/tacl
00254

Eyal Ben-David, Carmel Rabinovitz, and Roi
Reichart. 2020. PERL: Pivot-based Domain
Adaptation for Pre-trained Deep Contextual-
ized Embedding Models. arXiv:2006.09075
[cs]. DOI: https://doi.org/10.1162
/tacl a 00328

Rishi Bommasani, Kelly Davis, and Claire
Cardie. 2020. Interpreting Pretrained Contex-
tualized Representations via Reductions to
Static Embeddings. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
tational Linguistics, pages 4758–4781. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.431

Zied Bouraoui, Jose Camacho-Collados, 和
Steven Schockaert. 2019. Inducing Relational
Knowledge from BERT. arXiv:1911.12753
[cs]. DOI: https://doi.org/10.1609
/aaai.v34i05.6242

Samuel Broscheit. 2019.

Investigating Entity
Knowledge in BERT with Simple Neural
End-To-End Entity Linking. In Proceedings
the 23rd Conference on Computational
的
(CoNLL),
Natural
pages 677–685, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/K19
-1063

语言

学习

Tom B. 棕色的, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
阿加瓦尔, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
Jeffrey Wu,
Ramesh, Daniel M. Ziegler,
Clemens Winter, Christopher Hesse, 标记
陈, Eric Sigler, Mateusz Litwin, 斯科特
Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford,
伊利亚·苏茨克维尔, and Dario Amodei. 2020.

855

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Language Models are Few-Shot Learners.
arXiv:2005.14165 [cs].

Gino Brunner, Yang Liu, Damian Pascual,
Oliver Richter, Massimiliano Ciaramita, 和
Roger Wattenhofer. 2020. On Identifiability in
Transformers. In International Conference on
Learning Representations.

Tianlong Chen, Jonathan Frankle, Shiyu Chang,
Sijia Liu, Yang Zhang, Zhangyang Wang, 和
Michael Carbin. 2020. The Lottery Ticket
Hypothesis for Pre-trained BERT Networks.
arXiv:2007.12223 [cs, stat].

Xingyi Cheng, Weidi Xu, Kunlong Chen, Wei
王, Bin Bi, Ming Yan, Chen Wu, Luo Si,
Wei Chu, and Taifeng Wang. 2019. Symmetric
Regularization based BERT for Pair-Wise
Semantic Reasoning. arXiv:1909.03405 [cs].

在诉讼程序中

Kevin Clark, Urvashi Khandelwal, Omer Levy,
and Christopher D. 曼宁. 2019. 什么
Does BERT Look at? An Analysis of
这
BERT’s Attention.
2019 ACL Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 276–286, Florence, 意大利. 协会
for Computational Linguistics. DOI: https://
doi.org/10.18653/v1/W19-4828, PMID:
31709923

Kevin Clark, Minh-Thang Luong, Quoc V. Le,
and Christopher D. 曼宁. 2020. ELECTRA:
Pre-Training Text Encoders as Discriminators
在国际
Rather Than Generators.
Conference on Learning Representations.

Stephane Clinchant, Kweon Woo Jung, 和
Vassilina Nikoulina. 2019. On the use of BERT
for Neural Machine Translation. In Proceedings
of the 3rd Workshop on Neural Generation and
Translation, pages 108–117, 香港. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-5611

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised Cross-Lingual Represen-
tation Learning at Scale. arXiv:1911.02116
[cs]. DOI: https://doi.org/10.18653
/v1/2020.acl-main.747

Gonc¸alo M. Correia, Vlad Niculae, and Andr´e
F. 时间. 马丁斯. 2019. Adaptively Sparse Trans-
前者. 在诉讼程序中 2019 会议
on Empirical Methods
in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2174–2184,
香港, 中国. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/D19-1223

Matt Crane. 2018. Questionable Answers in
Question Answering Research: Reproducibility
and Variability of Published Results. 反式-
actions of the Association for Computational
语言学, 6:241–252. DOI: https://土井
org/10.1162/tacl a 00018

Leyang Cui, Sijie Cheng, Yu Wu, and Yue
张. 2020. Does BERT Solve Common-
sense Task via Commonsense Knowledge?
arXiv:2008.03945 [cs].

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin,
Ziqing Yang, Shijin Wang, and Guoping Hu.
2019. Pre-Training with Whole Word Masking
for Chinese BERT. arXiv:1906.08101 [cs].

Jeff Da and Jungo Kasai. 2019. Cracking the
Contextual Commonsense Code: Understand-
ing Commonsense Reasoning Aptitude of
Deep Contextual Representations. In Proceed-
ings of the First Workshop on Commonsense
Inference in Natural Language Processing,
pages 1–12, 香港, 中国. 协会
for Computational Linguistics.

Joe Davison, Joshua Feldman, and Alexander
匆忙. 2019. Commonsense Knowledge Mining
from Pretrained Models. 在诉讼程序中
这 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 1173–1178, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1109

加工

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: 预-
training of Deep Bidirectional Transformers
for Language Understanding. In Proceedings
the North
的

这 2019 Conference of

856

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

American Chapter of
the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short
文件), pages 4171–4186.

陈, Marianne Winslett, Hassan Sajjad, 和
Preslav Nakov. 2020. Compressing large-scale
transformer-based models: A case study on
BERT. arXiv 预印本 arXiv:2002.11985.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz,
Ali Farhadi, Hannaneh Hajishirzi, and Noah
史密斯. 2020. Fine-Tuning Pretrained Language
楷模: Weight Initializations, Data Orders,
and Early Stopping. arXiv:2002.06305 [cs].

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, 和
Yoav Goldberg. 2020. When Bert Forgets
How To POS: Amnesic Probing of Linguistic
Properties and MLM Predictions. arXiv:2006.
00995 [cs].

Kawin Ethayarajh.

2019. How Contextual
are Contextualized Word Representations?
Comparing the Geometry of BERT, ELMo,
and GPT-2 Embeddings. 在诉讼程序中
这 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 55–65, 香港, 中国. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1006

加工

Allyson Ettinger. 2019. What BERT is not:
Lessons from a new suite of psycholinguis-
tic diagnostics for language models. arXiv:
1907.13528 [cs]. DOI: https://doi.org
/10.1162/tacl a 00298

Angela Fan, Edouard Grave, and Armand Joulin.
2019. Reducing Transformer Depth on Demand
在国际
with Structured Dropout.
Conference on Learning Representations.

Maxwell Forbes, Ari Holtzman, and Yejin Choi.
2019. Do Neural Language Representations
Learn Physical Commonsense? In Proceedings
of the 41st Annual Conference of the Cognitive
科学社 (CogSci 2019), 页 7.

Jonathan Frankle and Michael Carbin. 2019. 这
Lottery Ticket Hypothesis: Finding Sparse,
Trainable Neural Networks. 在国际
Conference on Learning Representations.

Prakhar Ganesh, Yao Chen, Xin Lou,
Mohammad Ali Khan, Yin Yang, Deming

857

Siddhant Garg, Thuy Vu,

and Alessandro
Moschitti. 2020. TANDA: Transfer and Adapt
Pre-Trained Transformer Models for Answer
Sentence Selection. In AAAI. DOI: https://
doi.org/10.1609/aaai.v34i05.6282

Michael Glass, Alfio Gliozzo, Rishav Chakravarti,
Anthony Ferritto, Lin Pan, GP. Shrivatsa
Bhargav, Dinesh Garg, and Avi Sil. 2020.
Span Selection Pre-training for Question
Answering. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 2773–2782, 在线的. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.247

Goran Glavaˇs and Ivan Vuli´c. 2020. Is Supervised
Syntactic Parsing Beneficial
for Language
Understanding? 实证研究.
arXiv:2008.06788 [cs].

Adele Goldberg. 2006. Constructions at Work:
The Nature of Generalization in Language,
牛津大学出版社, 美国.

Yoav Goldberg. 2019. Assessing BERT’s syntac-
tic abilities. arXiv 预印本 arXiv:1901.05287.

Linyuan Gong, Di He, Zhuohan Li, Tao Qin,
Liwei Wang, and Tieyan Liu. 2019. Efficient
training of BERT by progressively stacking.
In International Conference on Machine
学习, pages 2337–2346.

Mitchell A. Gordon, Kevin Duh, and Nicholas
Andrews. 2020. Compressing BERT: Studying
the effects of weight pruning on transfer
学习. arXiv 预印本 arXiv:2002.08307.

Saurabh Goyal, Anamitra Roy Choudhary,
Venkatesan Chakaravarthy, Saurabh ManishRaje,
Yogish Sabharwal, and Ashish Verma. 2020.
Power-bert: Accelerating BERT inference for
classification tasks. arXiv 预印本 arXiv:2001.
08950.

Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, 薛
林, and Yanzhi Wang. 2019. Reweighted
Proximal Pruning for Large-Scale Language
Representation. arXiv:1909.12486 [cs, stat].

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. REALM:
Retrieval-Augmented Language Model Pre-
Training. arXiv:2002.08909 [cs].

Yaru Hao, Li Dong, Furu Wei, and Ke Xu.
2019. Visualizing and Understanding the
Effectiveness of BERT. 在诉讼程序中
这 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
pages 4143–4152, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1424

加工

John Hewitt and Christopher D. 曼宁. 2019.
A Structural Probe for Finding Syntax in Word
Representations. 在诉讼程序中 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 4129–4138.

Geoffrey Hinton, Oriol Vinyals, 和杰夫·迪恩.
2014. Distilling the Knowledge in a Neural
网络. In Deep Learning and Representation
Learning Workshop: NIPS 2014.

Strobelt,

Benjamin Hoover, Hendrik

和
Sebastian Gehrmann. 2019. exBERT: A Visual
Analysis Tool to Explore Learned Represen-
tations in Transformers Models. arXiv:1910.
05276 [cs]. DOI: https://doi.org/10
.18653/v1/2020.acl-demos.22

Neil Houlsby, Andrei Giurgiu,

Stanislaw
Jastrzebski, Bruna Morrone, Quentin de
Laroussilhe, Andrea Gesmundo, Mona
Attariyan, and Sylvain Gelly. 2019. 范围-
Efficient Transfer Learning for NLP. arXiv:
1902.00751 [cs, stat].

Phu Mon Htut, Jason Phang, Shikha Bordia, 和
Samuel R. Bowman. 2019. Do attention heads
in BERT track syntactic dependencies? arXiv
preprint arXiv:1911.12246.

Sarthak Jain and Byron C. 华莱士. 2019.
Attention is not Explanation. In Proceedings
的 2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,

858

体积
pages 3543–3556.

(长的

和

Short Papers),

Ganesh Jawahar, Benoˆıt Sagot, Djam´e Seddah,
Samuel Unicomb, Gerardo I˜niguez, M´arton
Karsai, Yannick L´eo, M´arton Karsai, Carlos
Sarraute, ´Eric Fleury, 等人. 2019. What does
BERT learn about the structure of language?
In 57th Annual Meeting of
the Association
for Computational Linguistics (前交叉韧带), Florence,
意大利. DOI: https://doi.org/10.18653
/v1/P19-1356

Haoming Jiang, Pengcheng He, Weizhu Chen,
Xiaodong Liu, Jianfeng Gao, and Tuo Zhao.
2019A. SMART: Robust and Efficient Fine-
Tuning for Pre-trained Natural Language
Models through Principled Regularized Opti-
mization. arXiv 预印本 arXiv:1911.03437.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.197, PMID: 33121726,
PMCID: PMC7218724

Zhengbao Jiang, Frank F. 徐, Jun Araki, 和
Graham Neubig. 2019乙. How Can We Know
What Language Models Know? arXiv:1911.
12543 [cs]. DOI: https://doi.org/10
.1162/tacl a 00324

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun
刘. 2019. TinyBERT: Distilling BERT for
自然语言理解. arXiv 预印本
arXiv:1909.10351.

Di Jin, Zhijing Jin, Joey Tianyi Zhou, 和
Peter Szolovits. 2020. Is BERT Really Robust?
A Strong Baseline for Natural Language Attack
on Text Classification and Entailment. In AAAI
2020. DOI: https://doi.org/10.1609
/aaai.v34i05.6311

Mandar

Joshi, Danqi Chen, Yinhan Liu,
Daniel S. Weld, Luke Zettlemoyer, 和
Improving
Omer Levy. 2020. SpanBERT:
Pre-Training by Representing and Predicting
Spans. Transactions of
the Association for
计算语言学, 8:64–77. DOI:
https://doi.org/10.1162/tacl 00300

Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi,
Chun-Cheng Hsieh, and Hung-Yi Lee. 2020.
Further boosting BERT-based models by
duplicating existing layers: Some intriguing

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

phenomena inside BERT. arXiv 预印本
arXiv:2001.09309.

Taeuk Kim, Jihun Choi, Daniel Edmiston, 和
Sang-goo Lee. 2020. Are pre-trained language
models aware of phrases? simple but strong
baselines for grammar induction. In ICLR 2020.

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi,
and Kentaro Inui. 2020. Attention Module is
Not Only a Weight: Analyzing Transformers
with Vector Norms. arXiv:2004.10102 [cs].

Dan Kondratyuk and Milan Straka. 2019. 75
Languages, 1 模型: Parsing Universal Depen-
dencies Universally. 在诉讼程序中
这
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
pages 2779–2795, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1279

加工

Lingpeng Kong, Cyprien de Masson d’Autume,
Lei Yu, Wang Ling, Zihang Dai, and Dani
Yogatama. 2019. A mutual information max-
imization perspective of language representa-
tion learning. In International Conference on
Learning Representations.

Olga Kovaleva, Alexey Romanov, Anna Rogers,
and Anna Rumshisky. 2019. Revealing the
Dark Secrets of BERT. 在诉讼程序中
这 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
pages 4356–4365, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1445

加工

Kalpesh Krishna, Gaurav Singh Tomar, Ankur P.
Parikh, Nicolas Papernot, and Mohit Iyyer.
2020. Thieves on Sesame Street! 模型
Extraction of BERT-Based APIs. In ICLR 2020.

Varun Kumar, Ashutosh Choudhary, and Eunah
给. 2020. Data Augmentation using Pre-
Trained Transformer Models. arXiv:2003.
02245 [cs].

Ilia Kuznetsov and Iryna Gurevych. 2020. A
Matter of Framing: The Impact of Linguistic
Formalism on Probing Results. arXiv:2004.
14999 [cs].

Guillaume Lample and Alexis Conneau. 2019.
Cross-Lingual Language Model Pretraining.
arXiv:1901.07291 [cs].

Zhenzhong Lan, Mingda Chen, Sebastian
古德曼, Kevin Gimpel, Piyush Sharma, 和
Radu Soricut. 2020A. ALBERT: A Lite BERT
for Self-Supervised Learning of Language
Representations. In ICLR.

Cheolhyoung Lee, Kyunghyun Cho, and Wanmo
Kang. 2019. Mixout: Effective regularization
to finetune large-scale pretrained language
型号. arXiv 预印本 arXiv:1909.11299.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
征收, Ves Stoyanov, and Luke Zettlemoyer.
2019. 捷运: Denoising Sequence-to-Sequence
Pre-Training for Natural Language Genera-
的, Translation, and Comprehension. arXiv:
1910.13461 [cs, stat]. DOI: https://土井
.org/10.18653/v1/2020.acl-main.703

Changmao Li and Jinho D. Choi. 2020. 反式-
formers to Learn Hierarchical Contexts in
Multiparty Dialogue for Span-based Question
Answering. In Proceedings of the 58th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 5709–5714, 在线的.
计算语言学协会.

Zhuohan Li, Eric Wallace, Sheng Shen, Kevin
林, Kurt Keutzer, Dan Klein, and Joseph E.
冈萨雷斯. 2020. Train large, then compress:
Rethinking model size for efficient training
and inference of transformers. arXiv 预印本
arXiv:2002.11794.

Yongjie Lin, Yi Chern Tan, and Robert Frank.
2019. Open Sesame: Getting inside BERT’s
Linguistic Knowledge. 在诉讼程序中
2019 ACL Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 241–253.

Nelson F. 刘, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. 史密斯. 2019A.
Linguistic Knowledge and Transferability of

859

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

这 2019 Conference of

Contextual Representations. In Proceedings
the North
的
American Chapter of
the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short
文件),
1073–1094, 明尼阿波利斯,
Minnesota. Association for Computational
语言学.

页面

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019乙. RoBERTa: A Robustly Opti-
mized BERT Pretraining Approach. arXiv:
1907.11692 [cs].

nosing Syntactic Heuristics in Natural Lan-
guage Inference. In Proceedings of the 57th
Annual Meeting of the Association for Com-
putational Linguistics,
3428–3448,
Florence, 意大利. Association for Computational
语言学. DOI: https://doi.org/10
.18653/v1/P19-1334

页面

Alessio Miaschi and Felice Dell’Orletta. 2020.
Contextual and Non-Contextual Word Embed-
丁斯: An in-depth Linguistic Investigation. 在
Proceedings of the 5th Workshop on Represen-
tation Learning for NLP, pages 110–119. DOI:
https://doi.org/10.18653/v1/2020
.repl4nlp-1.15

Xiaofei Ma, Zhiguo Wang, Patrick Ng, Ramesh
Nallapati, and Bing Xiang. 2019. Universal
Text Representation from BERT: 经验主义
Study. arXiv:1910.07973 [cs].

Paul Michel, Omer Levy, and Graham Neubig.
2019. Are Sixteen Heads Really Better than
一? Advances in Neural Information Process-
ing Systems 32 (NIPS 2019).

Christopher D. 曼宁, Kevin Clark, 约翰
Hewitt, Urvashi Khandelwal, and Omer Levy.
2020. Emergent linguistic structure in artificial
neural networks trained by self-supervision.
美国国家科学院院刊-
恩塞斯, 页 201907367. DOI: https://
doi.org/10.1073/pnas.1907367117,
PMID: 32493748

Timothee Mickus, Denis Paperno, Mathieu
持续的, and Kees van Deemeter. 2019. 什么
do you mean, BERT? assessing BERT as a
distributional semantics model. arXiv 预印本
arXiv:1911.05758.

Microsoft. 2020. Turing-NLG: A 17-billion-

parameter language model by microsoft.

In Proceedings

Chandler May, Alex Wang, Shikha Bordia,
Samuel R. Bowman, and Rachel Rudinger.
2019. On Measuring Social Biases in Sentence
2019
Encoders.
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 622–628,
明尼阿波利斯, Minnesota. 协会
为了
计算语言学.

这

的

J. S. McCarley, Rishav Chakravarti, and Avirup
Sil. 2020. Structured Pruning of a BERT-based
Question Answering Model. arXiv:1910.06360
[cs].

右. Thomas McCoy, Tal Linzen, Ewan Dunbar,
and Paul Smolensky. 2019A. RNNs implicitly
implement
陈述.
In International Conference on Learning
Representations.

tensor-product

Tomas Mikolov, 伊利亚·苏茨克维尔, Kai Chen, 格雷格小号.
科拉多, 和杰夫·迪恩. 2013. Distributed
representations of words and phrases and
their compositionality. In Advances in Neural
Information Processing Systems 26 (NIPS
2013), 第 3111–3119 页.

Jiaqi Mu and Pramod Viswanath. 2018. All-but-
the-top: Simple and effective postprocessing
在国际
for word representations.
Conference on Learning Representations.

Timothy Niven and Hung-Yu Kao. 2019.
Probing Neural Network Comprehension
In Pro-
of Natural Language Arguments.
ceedings of the 57th Annual Meeting of the
计算语言学协会,
意大利. Associ-
pages 4658–4664, Florence,
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1459

Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019乙. Right for the Wrong Reasons: Diag-

Matthew E. Peters, Mark Neumann, Robert Logan,
Roy Schwartz, Vidur Joshi, Sameer Singh, 和

860

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

诺亚A. 史密斯. 2019A. Knowledge Enhanced
Contextual Word Representations. In Proceed-
ings of
这 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 43–54, 香港, 中国. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1005, PMID: 31383442

Pretrained Representations
在诉讼程序中

Matthew E. Peters, Sebastian Ruder, and Noah
A. 史密斯. 2019乙. To Tune or Not to Tune?
到
Adapting
the 4th
Diverse Tasks.
Workshop on Representation Learning for
自然语言处理 (RepL4NLP-2019), pages 7–14, Florence,
意大利. Association for Computational Lin-
语言学. DOI: https://doi.org/10.18653
/v1/W19-4302, PMCID: PMC6351953

Fabio Petroni, Tim Rockt¨aschel, Sebastian
Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang
吴, and Alexander Miller. 2019. 语言
Models as Knowledge Bases? In Proceedings
的 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 2463–2473, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1250

加工

Jason Phang, Thibault F´evry, and Samuel R.
Bowman. 2019. Sentence Encoders on STILTs:
Supplementary Training
Intermediate
Labeled-Data Tasks. arXiv:1811.01088 [cs].

在

Tiago Pimentel, Josef Valvoda, Rowan Hall
Maudslay, Ran Zmigrod, Adina Williams, 和
Ryan Cotterell. 2020. Information-Theoretic
Probing for Linguistic Structure. arXiv:2004.
03061 [cs]. DOI: https://doi.org/10
.18653/v1/2020.acl-main.420

Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2019. BERT is not a knowledge base
(然而): Factual knowledge vs. name-based rea-
soning in unsupervised qa. arXiv 预印本 arXiv:
1911.03681.

Sai Prasanna, Anna Rogers, and Anna Rumshisky.
2020. When BERT Plays the Lottery, 全部

Tickets Are Winning. 在诉讼程序中
2020 Conference on Empirical Methods in Nat-
ural Language Processing. 在线的. 协会
for Computational Linguistics.

Ofir Press, 诺亚A. 史密斯, and Omer Levy.
2020. Improving Transformer Models by Re-
ordering their Sublayers. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学, pages 2996–3005,
在线的. Association for Computational Lin-
语言学. DOI: https://doi.org/10.18653
/v1/2020.acl-main.270

Yada Pruksachatkun, Jason Phang, Haokun Liu,
Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe
Pang, Clara Vania, Katharina Kann, 和
Samuel R. Bowman. 2020. Intermediate-Task
Transfer Learning with Pretrained Language
楷模: When and Why Does It Work?
在诉讼程序中
the 58th Annual Meet-
the Association for Computational
ing of
语言学, pages 5231–5247, 在线的. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.467

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2019. Exploring the Limits of Transfer Learning
with a Unified Text-to-Text Transformer.
arXiv:1910.10683 [cs, stat].

Alessandro Raganato, Yves Scherrer, and J¨org
Tiedemann. 2020. Fixed Encoder Self-Attention
Patterns in Transformer-Based Machine Trans-
关系. arXiv:2002.10260 [cs].

Alessandro Raganato and J¨org Tiedemann. 2018.
An Analysis of Encoder Representations in
Transformer-Based Machine Translation. 在
诉讼程序 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 287–297,
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/W18-5431

Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. 超过
Accuracy: Behavioral Testing of NLP Models
with CheckList. In Proceedings of the 58th
Annual Meeting of the Association for Com-
putational Linguistics,
4902–4912,

页面

861

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

在线的. Association for Computational Lin-
语言学. DOI: https://doi.org/10.18653
/v1/2020.acl-main.442

Kyle Richardson, Hai Hu, Lawrence S. Moss,
and Ashish Sabharwal. 2020. Probing Natural
Language Inference Models through Semantic
Fragments. In AAAI 2020. DOI: https://
doi.org/10.1609/aaai.v34i05.6397

Kyle Richardson and Ashish Sabharwal. 2019.
What Does My QA Model Know? Devising
Controlled Probes using Expert Knowledge.
arXiv:1912.13337 [cs]. DOI: https://土井
.org/10.1162/tacl a 00331

Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How Much Knowledge Can You Pack
Into the Parameters of a Language Model?
arXiv 预印本 arXiv:2002.08910.

Anna Rogers, Olga Kovaleva, Matthew Downey,
and Anna Rumshisky. 2020. Getting Closer to
AI Complete Question Answering: A Set of
Prerequisite Real Tasks. In AAAI, 页 11.
DOI: https://doi.org/10.1609/aaai
.v34i05.6398

Rudolf Rosa and David Mareˇcek. 2019. Inducing
syntactic trees from BERT representations.
arXiv 预印本 arXiv:1906.11511.

Victor Sanh, Lysandre Debut, Julien Chaumond,
and Thomas Wolf. 2019. DistilBERT, a distilled
version of BERT: Smaller, faster, cheaper and
lighter. In 5th Workshop on Energy Efficient
Machine Learning and Cognitive Computing –
神经信息处理系统 2019.

Victor Sanh, Thomas Wolf, and Alexander M.
匆忙. 2020. Movement Pruning: Adaptive
Sparsity by Fine-Tuning. arXiv:2005.07683
[cs].

Florian Schmidt and Thomas Hofmann. 2020.
BERT as a Teacher: Contextual Embeddings
for Sequence-Level Reward. arXiv 预印本
arXiv:2003.02738.

Roy Schwartz, Jesse Dodge, 诺亚A. 史密斯,
and Oren Etzioni. 2019. Green AI. arXiv:
1907.10597 [cs, stat].

Sofia Serrano and Noah A. 史密斯. 2019. Is
arXiv:1906.03731
Attention
[cs]. DOI: https://doi.org/10.18653
/v1/P19-1282

Interpretable?

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian
Ma, Zhewei Yao, Amir Gholami, Michael W.
Mahoney, and Kurt Keutzer. 2019. Q-BERT:
Hessian Based Ultra Low Precision Quanti-
zation of BERT. arXiv 预印本 arXiv:1909.
05840. DOI: https://doi.org/10.1609
/aaai.v34i05.6409

Chenglei Si, Shuohang Wang, Min-Yen Kan,
and Jing Jiang. 2019. What does BERT Learn
from Multiple-Choice Reading Comprehension
数据集? arXiv:1910.12391 [cs].

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, 和
Tie-Yan Liu. 2020. MPNet: Masked and Per-
muted Pre-training for Language Understand-
英. arXiv:2004.09297 [cs].

Asa Cooper Stickland and Iain Murray. 2019.
BERT and PALs: Projected Attention Layers
for Efficient Adaptation in Multi-Task Learn-
英. In International Conference on Machine
学习, pages 5986–5995.

Emma Strubell, Ananya Ganesh, and Andrew
麦卡勒姆. 2019. Energy and Policy Consider-
ations for Deep Learning in NLP. In ACL 2019.

Ta-Chun Su and Hsiang-Chih Cheng. 2019.
SesameBERT: Attention for Anywhere. arXiv:
1910.03176 [cs].

Timo Schick and Hinrich Sch¨utze. 2020.
BERTRAM:
Improved Word Embeddings
Have Big Impact on Contextualized Model Per-
formance. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 3996–4007, 在线的. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.368

Saku Sugawara, Pontus Stenetorp, Kentaro Inui,
and Akiko Aizawa. 2020. Assessing the Bench-
marking Capacity of Machine Reading Com-
prehension Datasets. In AAAI. DOI: https://
doi.org/10.1609/aaai.v34i05.6422

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing
刘. 2019A. Patient Knowledge Distillation for
BERT Model Compression. In Proceedings

862

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

的 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
pages 4314–4323. DOI: https://doi.org
/10.18653/v1/D19-1441

加工

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng,
Xuyi Chen, Han Zhang, Xin Tian, Danxiang
朱, Hao Tian, and Hua Wu. 2019乙. ERNIE:
Enhanced Representation through Knowledge
Integration. arXiv:1904.09223 [cs].

Yu Sun, Shuohuan Wang, Yukun Li, Shikun
冯, Hao Tian, Hua Wu, and Haifeng Wang.
2019C. ERNIE 2.0: A Continual Pre-Training
Framework for Language Understanding.
arXiv:1907.12412 [cs]. DOI: https://土井
.org/10.1609/aaai.v34i05.6428

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie
刘, Yiming Yang, and Denny Zhou. 2020.
MobileBERT: Task-Agnostic Compression of
BERT for Resource Limited Devices.

Dhanasekar Sundararaman, Vivek Subramanian,
Guoyin Wang, Shijing Si, Dinghan Shen,
Dong Wang, and Lawrence Carin. 2019.
Syntax-Infused Transformer and BERT models
for Machine Translation and Natural Language
Understanding. arXiv:1911.06156 [cs, stat].
DOI: https://doi.org/10.1109/IALP48816
.2019.9037672, PMID: 31938450, PMCID:
PMC6959198

Alon Talmor, Yanai Elazar, Yoav Goldberg,
and Jonathan Berant. 2019. oLMpics – On
what Language Model Pre-Training Captures.
arXiv:1912.13283 [cs].

Hirotaka Tanaka, Hiroyuki Shinnou, Rui Cao,
Jing Bai, and Wen Ma. 2020. 文档
Classification by Word Embeddings of BERT.
In Computational Linguistics, Communica-
tions in Computer and Information Science,
pages 145–154, 新加坡, 施普林格.

Raphael Tang, Yao Lu, Linqing Liu, Lili
Mou, 奥尔加·维奇托莫娃, and Jimmy Lin.
2019. Distilling Task-Specific Knowledge from
BERT into Simple Neural Networks. arXiv
preprint arXiv:1903.12136.

Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019A. BERT Rediscovers the Classical NLP

Pipeline. In Proceedings of the 57th Annual
the Association for Computa-
Meeting of
tional Linguistics, pages 4593–4601. DOI:
https://doi.org/10.18653/v1/P19
-1452

Ian Tenney, Patrick Xia, Berlin Chen, Alex
王, Adam Poliak, 右. Thomas McCoy,
Najoung Kim, Benjamin Van Durme, 塞缪尔
右. Bowman, Dipanjan Das, and Ellie Pavlick.
2019乙. What do you learn from context?
Probing for sentence structure in contextu-
alized word representations. 在国际
Conference on Learning Representations.

James Yi Tian, Alexander P. Kreuzer, Pai-Hung
陈, and Hans-Martin Will. 2019. WaLDORf:
Wasteless Language-model Distillation On
Reading-comprehension. arXiv 预印本 arXiv:
1912.06638.

Shubham Toshniwal, Haoyue Shi, Bowen Shi,
Lingyu Gao, Karen Livescu, and Kevin
Gimpel. 2020. A Cross-Task Analysis of Text
Span Representations. 在诉讼程序中
5th Workshop on Representation Learning
for NLP, pages 166–176, 在线的. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.repl4nlp-1.20

Henry Tsai,

Jason Riesa, Melvin Johnson,
Naveen Arivazhagan, Xin Li, and Amelia
Archer. 2019. Small and Practical BERT
Models for Sequence Labeling. arXiv 预印本
arXiv:1909.00100. DOI: https://doi.org
/10.18653/v1/D19-1374

Iulia Turc, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. Well-Read
Students Learn Better: The Impact of Student
Initialization on Knowledge Distillation. arXiv
preprint arXiv:1908.08962.

Marten van Schijndel, Aaron Mueller, and Tal
扁豆. 2019. Quantity doesn’t buy quality
syntax with neural language models. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5831–5837, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:

863

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

https://doi.org/10.18653/v1/D19
-1592

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is All you Need.
in Neural Information Processing Systems,
pages 5998–6008.

Jesse Vig. 2019. Visualizing Attention in
Transformer-Based Language Representation
楷模. arXiv:1904.02679 [cs, stat].

在诉讼程序中

Jesse Vig and Yonatan Belinkov. 2019. Analyzing
the Structure of Attention in a Transformer
这
Language Model.
2019 ACL Workshop BlackboxNLP: Analyz-
ing and Interpreting Neural Networks for
自然语言处理, pages 63–76, Florence, 意大利. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W19
-4808

David Vilares, Michalina Strzyz, Anders Søgaard,
and Carlos G´omez-Rodr´ıguez. 2020. Parsing as
pretraining. In Thirty-Fourth AAAI Conference
智力 (AAAI-20). DOI:
on Artificial
https://doi.org/10.1609/aaai.v34i05
.6446

Elena Voita, Rico Sennrich, and Ivan Titov. 2019A.
The Bottom-up Evolution of Representations
in the Transformer: A Study with Machine
Translation and Language Modeling Objec-
特维斯. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 4387–4397. DOI:
https://doi.org/10.18653/v1/D19
-1448

Elena Voita, David Talbot, Fedor Moiseev, Rico
Sennrich, and Ivan Titov. 2019乙. Analyzing
Multi-Head Self-Attention: Specialized Heads
Do the Heavy Lifting, the Rest Can Be Pruned.
arXiv:1905.09418. DOI:
arXiv
https://doi.org/10.18653/v1/P19
-1580

preprint

Eric Wallace, Shi Feng, Nikhil Kandpal, 马特
加德纳, and Sameer Singh. 2019A. 大学-
versal Adversarial Triggers
for Attacking
and Analyzing NLP. 在诉讼程序中
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
pages 2153–2162, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1221

加工

Eric Wallace, Yizhong Wang, Sujian Li, Sameer
辛格, and Matt Gardner. 2019乙. Do NLP
Models Know Numbers? Probing Numeracy
in Embeddings. arXiv 预印本 arXiv:1909.
07940. DOI: https://doi.org/10.18653
/v1/D19-1534

Alex Wang, Amapreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2018. GLUE: A Multi-Task Bench-
mark and Analysis Platform for Natural Lan-
guage Understanding. 在诉讼程序中
2018 EMNLP Workshop BlackboxNLP: Ana-
lyzing and Interpreting Neural Networks for
自然语言处理, pages 353–355, 布鲁塞尔, 比利时. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W18
-5446

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu
Wei, Xuanjing Huang, Jianshu ji, Guihong
曹, Daxin Jiang, and Ming Zhou. 2020A. K-
Adapter: Infusing Knowledge into Pre-Trained
Models with Adapters. arXiv:2002.01808 [cs].

Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi
Bao, Liwei Peng, and Luo Si. 2019A. Struct-
BERT: Incorporating Language Structures into
Pre-Training for Deep Language Understand-
英. arXiv:1908.04577 [cs].

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao,
Nan Yang, and Ming Zhou. 2020乙. MiniLM:
Deep Self-Attention Distillation for Task-
Agnostic Compression of Pre-Trained Trans-
前者. arXiv 预印本 arXiv:2002.10957.

Elena Voita and Ivan Titov. 2020. 信息-
Theoretic Probing with Minimum Description
Length. arXiv:2003.12298 [cs].

Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu,
Zhiyuan Liu, Juanzi Li, and Jian Tang. 2020C.
KEPLER: A Unified Model for Knowledge

864

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Embedding and Pre-trained Language Repre-
sentation. arXiv:1911.06136 [cs].

Yile Wang, Leyang Cui, and Yue Zhang. 2020d.
How Can BERT Help Lexical Semantics Tasks?
arXiv:1911.02929 [cs].

Zihan Wang, Stephen Mayhew, Dan Roth, 等人.
2019乙. Cross-Lingual Ability of Multilingual
BERT: An Empirical Study. arXiv 预印本
arXiv:1912.07840.

Alex Warstadt and Samuel R. Bowman. 2020.
Can neural networks acquire a structural bias
from raw linguistic data? 在诉讼程序中
42nd Annual Virtual Meeting of the Cognitive
科学社. 在线的.

Alex Warstadt, Yu Cao,

Ioana Grosu, Wei
彭, Hagen Blix, Yining Nie, Anna Alsop,
Shikha Bordia, Haokun Liu, Alicia Parrish,
等人. 2019. Investigating BERT’s Knowledge
of Language: Five Analysis Methods with
NPIs. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2870–2880. DOI:
https://doi.org/10.18653/v1/D19-1286

Gregor Wiedemann,

Steffen Remus, Avi
Chawla, and Chris Biemann. 2019. Does
BERT Make Any Sense? Interpretable Word
Sense Disambiguation with Contextualized
Embeddings. arXiv 预印本 arXiv:1909.10430.

Sarah Wiegreffe and Yuval Pinter. 2019. Atten-
tion is not not Explanation. In Proceedings
的 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 11–20, 香港, 中国. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19-1002

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, and Jamie Brew. 2020.
HuggingFace’s Transformers: State-of-the-Art
自然语言处理. arXiv:1910.
03771 [cs].

Felix Wu, Angela Fan, Alexei Baevski, Yann
Dauphin, and Michael Auli. 2019A. Pay Less
Attention with Lightweight and Dynamic
Convolutions. In International Conference on
Learning Representations.

Xing Wu, Shangwen Lv, Liangjun Zang,
Jizhong Han, and Songlin Hu. 2019乙. 骗局-
ditional BERT Contextual Augmentation. 在
ICCS 2019: Computational Science ICCS
2019, pages 84–95. 施普林格. DOI: https://
doi.org/10.1007/978-3-030-22747-0 7

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, 等人. 2016. Google’s Neural
Machine Translation System: Bridging the Gap
between Human and Machine Translation.

Zhiyong Wu, Yun Chen, Ben Kao, and Qun
刘. 2020. Perturbed Masking: Parameter-free
Probing for Analyzing and Interpreting BERT.
In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 4166–4176, 在线的. 协会
计算语言学.

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu
Wei, and Ming Zhou. 2020. BERT-of-Theseus:
Compressing BERT by Progressive Module
Replacing. arXiv 预印本 arXiv:2002.02925.

Junjie Yang and Hai Zhao. 2019. Deepening
Hidden Representations from Pre-Trained Lan-
guage Models for Natural Language Under-
常设. arXiv:1911.01940 [cs].

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized Autoregressive
Pretraining
for Language Understanding.
arXiv:1906.08237 [cs].

Pengcheng Yin, Graham Neubig, Wen-tau Yih,
and Sebastian Riedel. 2020. TaBERT: Pretrain-
ing for Joint Understanding of Textual and
Tabular Data. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 8413–8426, 在线的. Associ-
ation for Computational Linguistics.

Dani Yogatama, Cyprien de Masson d’Autume,
Jerome Connor, Tomas Kocisky, Mike
Chrzanowski, Lingpeng Kong, Angeliki

865

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Lazaridou, Wang Ling, Lei Yu, Chris Dyer,
and Phil Blunsom. 2019. Learning and Evaluat-
ing General Linguistic Intelligence. arXiv:
1901.11373 [cs, stat].

Zhuosheng Zhang, Yuwei Wu, Hai Zhao,
Zuchao Li, Shuailiang Zhang, Xi Zhou, 和
Xiang Zhou. 2020. Semantics-aware BERT for
Language Understanding. In AAAI 2020.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,
Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan
歌曲, James Demmel, and Cho-Jui Hsieh. 2019.
Large Batch Optimization for Deep Learning:
Training BERT in 76 Minutes. arXiv 预印本
arXiv:1904.00962, 1(5).

Ali Hadi Zadeh and Andreas Moshovos. 2020.
GOBO: Quantizing Attention-Based NLP
Models for Low Latency and Energy Efficient
推理. arXiv:2005.03842 [cs, stat]. DOI:
https://doi.org/10.1109/MICRO50266
.2020.00071

Ofir Zafrir, Guy Boudoukh, Peter Izsak, 和
Moshe Wasserblat. 2019. Q8BERT: 量化
8bit BERT. arXiv 预印本 arXiv:1910.06188.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can
a Machine Really Finish Your Sentence? 在
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4791–4800.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced Language Representa-
tion with Informative Entities. In Proceedings
of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 1441–1451,
Florence,
意大利. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/P19-1139

Sanqiang Zhao, Raghav Gupta, Yang Song,
and Denny Zhou. 2019. Extreme Language
Model Compression with Optimal Subwords
preprint
and Shared Projections.
arXiv:1909.11687.

arXiv

Yiyun Zhao and Steven Bethard. 2020. 如何
does BERT’s attention change when you fine-
tune? An analysis methodology and a case study
in negation scope. In Proceedings of the 58th
Annual Meeting of the Association for Com-
putational Linguistics,
4729–4747,
在线的. Association for Computational Linguis-
抽动症. DOI: https://doi.org/10.18653
/v1/2020.acl-main.429, PMCID:
PMC7660194

页面

Wenxuan Zhou, Junyi Du, and Xiang Ren. 2019.
Improving BERT Fine-tuning with Embedding
Normalization. arXiv 预印本 arXiv:1911.
03918.

Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan
黄. 2020. Evaluating Commonsense in
Pre-Trained Language Models. In AAAI 2020.
DOI: https://doi.org/10.1609
/aaai.v34i05.6523

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, 汤姆
戈德斯坦, and Jingjing Liu. 2019. FreeLB:
Enhanced Adversarial Training for Language
Understanding. arXiv:1909.11764 [cs].

866

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
9
1
9
2
3
2
8
1

/
t

我

A
C
_
A
_
0
0
3
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3 A Primer in BERTology: What We Know About How BERT Works image

下载pdf