Syntactic Structure Distillation Pretraining for Bidirectional Encoders

Adhiguna Kuncoro∗♠♦ Lingpeng Kong∗♠ Daniel Fried∗♣
Dani Yogatama♠ Laura Rimell♠ Chris Dyer♠ Phil Blunsom♠♦
♠DeepMind, London, Vereinigtes Königreich
♦Department of Computer Science, Universität Oxford, Vereinigtes Königreich
♣Computer Science Division, Universität von Kalifornien, Berkeley, CA, USA
{akuncoro,lingpenk,dyogatama,laurarimell,cdyer,pblunsom}@google.com
dfried@cs.berkeley.edu

Abstrakt

Textual representation learners trained on large
amounts of data have achieved notable success
on downstream tasks; intriguingly, they have
also performed well on challenging tests of
syntactic competence. Somit, it remains an
open question whether scalable learners like
im
BERT can become fully proficient
syntax of natural language by virtue of data
scale alone, or whether they still benefit from
more explicit syntactic biases. Um zu antworten
this question, we introduce a knowledge
distillation strategy for
injecting syntactic
biases into BERT pretraining, by distilling
the syntactically informative predictions of a
hierarchical—albeit harder to scale—syntactic
language model. Since BERT models masked
words in bidirectional context, we propose to
distill the approximate marginal distribution
over words in context from the syntactic LM.
Our approach reduces relative error by 2–21%
on a diverse set of structured prediction tasks,
although we obtain mixed results on the
GLUE benchmark. Our findings demonstrate
the benefits of syntactic biases, even for
representation learners
groß
amounts of data, and contribute to a better
understanding of where syntactic biases are
helpful in benchmarks of natural language
Verständnis.

that exploit

1 Einführung

Large-scale textual representation learners trained
with variants of the language modeling (LM) obj-
ective have achieved remarkable success on down-
stream tasks (Peters et al., 2018; Devlin et al.,
2019; Yang et al., 2019). Außerdem, these mo-

∗Equal contribution.

776

dels have also been shown to perform remark-
ably well at syntactic grammaticality judgment
tasks (Goldberg, 2019), and encode substantial
amounts of syntax in their learned representa-
tionen (Liu et al., 2019A; Tenney et al., 2019A,B;
Hewitt and Manning, 2019; Jawahar et al., 2019).
Intriguingly, success on these syntactic tasks
has been achieved by Transformer architectures
(Vaswani et al., 2017) that lack explicit notions of
hierarchical syntactic structures.

Based on such evidence, it would be tempting
to conclude that data scale alone is all we need to
learn the syntax of natural language. Trotzdem,
recent findings that systematically compare the
syntactic competence of models trained at varying
data scales suggest that model inductive biases are
in fact more important than data scale for acquiring
syntactic competence (Hu et al., 2020). Two
natural questions, daher, are the following:
Can representation learners that work well at scale
still benefit from explicit syntactic biases? And
where exactly would such syntactic biases be
helpful in different language understanding tasks?
Here we work towards answering these questions
by devising a new pretraining strategy that injects
syntactic biases into a BERT (Devlin et al., 2019)
learner that works well at scale. We hypothesize
that this approach can improve the competence of
BERT on various tasks, which provides evidence
for the benefits of syntactic biases in large-scale
Modelle.

Our approach is based on the prior work of
Kuncoro et al. (2019), who devised an effective
knowledge distillation (KD; Bucilˇa et al., 2006;
Hinton et al., 2015) procedure for improving
the syntactic competence of scalable LMs that
lack explicit syntactic biases. More concretely,
their KD procedure utilized the predictions of
an explicitly hierarchical (albeit hard to scale)
syntactic LM, recurrent neural network grammars

Transactions of the Association for Computational Linguistics, Bd. 8, S. 776–794, 2020. https://doi.org/10.1162/tacl a 00345
Action Editor: James Henderson. Submission batch: 6/2020; Revision batch: 8/2020; Published 12/2020.
C(cid:13) 2020 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/
T

A
C
_
A
_
0
0
3
4
5
P
D

B
j
G
u
e
S
T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

(RNNGs; Dyer et al., 2016) (§2) as a syntactically
informed learning signal for a sequential LM that
works well at scale.

Our setup nevertheless presents a new chal-
lenge: Here the BERT student is a denoising
autoencoder that models a collection of con-
ditionals for words in bidirectional context, while
the RNNG teacher is an autoregressive LM that
predicts words in a left-to-right fashion, das ist
tφ(xi|Xi that is
accessible to the BERT student (§3). Somit, Wir
propose an approach where the BERT student
distills the RNNG’s marginal distribution over
words in context, tφ(xi|Xich). We develop
an efficient yet effective approximation for this
quantity, since exact inference is expensive owing
to the RNNG’s left-to-right parameterization.

Our structure-distilled BERT model differs
from the standard BERT model only in its pre-
training objective, and thus retains the scalability
afforded by Transformer architectures and special-
ized hardwares like TPUs. Tatsächlich, our approach
maintains compatibility with standard BERT pipe-
lines; the structure-distilled BERT models can
simply be loaded as pretrained BERT weights,
which can then be fine-tuned in the exact same
Mode.

We hypothesize that

the stronger syntactic
biases from our new pretraining procedure are
useful for a variety of natural language under-
involve structured
Stehen (NLU) tasks that
output spaces—including tasks like semantic role
labeling (SRL) and coreference resolution that are
not explicitly syntactic in nature. We thus evaluate
our models on six diverse structured prediction
tasks,
including phrase-structure parsing (In-
domain and out-of-domain), dependency parsing,
SRL, coreference resolution, and a combinatory
categorial grammar (CCG) supertagging probe, In
addition to the GLUE benchmark (Wang et al.,
2019). On the structured prediction tasks, unser
structure-distilled BERTBASE reduces relative
error by 2% Zu 21%. These gains are more pro-
nounced in the low-resource scenario, vorschlagen
that stronger syntactic biases help improve sample
efficiency (§4).

Despite the gains on the structured prediction
tasks, we achieve mixed results on GLUE: Unser
approach yields improvements on the corpus of
linguistic acceptability (Warstadt et al., 2018,

CoLA), but performs slightly worse on the rest
of GLUE. These findings allude to a partial
dissociation between model performance on
GLUE, and on structured prediction benchmarks
of NLU.

Insgesamt, our findings: (ich) showcase the bene-
fits of syntactic biases, even for representation
learners that leverage large amounts of data, (ii)
help better understand where syntactic biases are
most helpful, Und (iii) make a case for designing
approaches that not only work well at scale, Aber
also integrate stronger notions of syntactic biases.

2 Recurrent Neural Network Grammars

Here we briefly describe the RNNG (Dyer et al.,
2016) that we use as the teacher model. An RNNG
is a syntactic LM that defines the joint prob-
ability of surface strings x and phrase-structure
nonterminals y, henceforth denoted as tφ(X, j),
through a series of structure-building actions that
traverse the tree in a top–down, left-to-right fash-
Ion. Let N and Σ denote the set of phrase-structure
non-terminals and word terminals, jeweils.
At each time step, the decision over the next action
at ∈ {NT(N), GEN(w), REDUCE}, where n ∈
N and w ∈ Σ, is parameterized by a stack LSTM
(Dyer et al., 2015) that encodes partial constit-
uents. The choice of at yields these transitions:

• at ∈ {NT(N), GEN(w)} would push the cor-
responding non-terminal or word embeddings
—en or ew—onto the stack;

• at = REDUCE would pop the top k
incomplete non-
elements up to the last
terminal, compose these elements with a
separate bidirectional LSTM, and lastly push
the composite phrase embedding ephrase back
onto the stack. The hierarchical inductive
bias of RNNGs can be attributed to this
composition function,1 which recursively
combines smaller units into larger ones.

RNNGs attempt to maximize the probability of
correct action sequences relative to each gold
tree.2

1Not all syntactic LMs have hierarchical biases; Choe
and Charniak (2016) modeled strings and phrase structures
sequentially with LSTMs. This model can be understood as
a special case of RNNGs without the composition function.
2Unsupervised RNNGs (Kim et al., 2019) existieren, although

they perform worse on measures of syntactic competence.

777

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/
T

A
C
_
A
_
0
0
3
4
5
P
D

B
j
G
u
e
S
T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Extension to Subwords. Here we extend the
RNNG to operate over subword units (Sennrich
et al., 2016) to enable compatibility with the
BERT student. As each word can be split into
an arbitrary-length sequence of subwords, Wir
preprocess the phrase-structure trees to include
an additional nonterminal symbol that represents
a word sequence, as illustrated by the example ‘‘(S
(NP (WORD the) (WORD d ##og)) (VP (WORD
ba ##rk ##s)))’’, where tokens prefixed by ‘‘##’’
are subword units.3

3 Approach

We begin with a brief review of the BERT
objective, before outlining our structure distil-
lation approach.

3.1 BERT Pretraining Objective

The aim of BERT pretraining is to find model
parameters ˆθB that would maximize the prob-
ability of reconstructing parts of x = x1, · · · ,
xk conditional on a corrupted version c(X) =
C(x1), · · · , C(xk), Wo
Die
stochastic corruption protocol of Devlin et al.
is applied to each word xi ∈ x.
(2019) Das
Formally:

denotes

C(·)

ˆθB = arg min

θ X
i∈M (X)

− log pθ(xi|C(x1), · · · , C(xk)),

(1)
where M (X) ⊆ {1, · · · , k} denotes the indices
of masked tokens that serve as reconstruction
targets.4 This masked LM objective is then
combined with a next-sentence prediction loss
that predicts whether the two segments in x are
contiguous sequences.

3.2 Motivation

Because the RNNG teacher is an expert on
syntactic generalizations (Kuncoro et al., 2018;
Futrell et al., 2019; Wilcox et al., 2019), we adopt
a structure distillation procedure (Kuncoro et al.,
2019) that enables the BERT student to learn from
the RNNG’s syntactically informative predictions.
Our setup nevertheless means that the two models
here crucially differ in nature: The BERT student

3An alternative here is to represent each phrase as a flat
sequence of subwords, although our preliminary experiments
indicate that this approach yields worse perplexity.

4In der Praxis, the corruption protocol c(·) and the recon-
struction targets M (X) are intertwined; M (X) denotes the
indices of tokens in x (∼ 15%) that were altered by c(X).

778

Figur 1: An example of the masked LM task, Wo
[MASK] = chase, and window is an attractor (Rot). Wir
suppress phrase-structure annotations and corruptions
on the context tokens for clarity.

is not a left-to-right LM like the RNNG, but rather
a denoising autoencoder that models a collection
of conditionals for words in bidirectional context
(Eq. 1).

We now present two strategies for dealing
with this challenge. Der erste, na¨ıve approach
is to ignore this difference, and let the BERT
student distill the RNNG’s marginal next-word
distribution for each w ∈ Σ based on the left
context alone, that is tφ(w|Xich
that is accessible to the BERT student, and runs
the risk of encouraging the student to assign high
probabilities for words that fit poorly with the
bidirectional context.

Somit, our second approach is to learn from
teacher distributions that not only: (ich) reflect the
strong syntactic biases of the RNNG teacher, Aber
Auch (ii) consider both the left and right context
when predicting w ∈ Σ. Formally, we propose
to distill the RNNG’s marginal distribution over
words in bidirectional context, tφ(w|Xich),
henceforth referred to as the posterior probability
for generating w under all available information.
We now demonstrate that this quantity can, in fact,
be computed from left-to-right LMs like RNNGs.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/
T

A
C
_
A
_
0
0
3
4
5
P
D

B
j
G
u
e
S
T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

3.3 Posterior Inference

Given a pretrained autoregressive, left-to-right
k
LM that factorizes tφ(X) =
i=1 tφ(xi|Xich). By definition of conditional
probabilities:5

tφ(xi|Xich) =

tφ(Xich)
Pw∈Σ tφ(Xich)

tφ(Xich|xi, Xich| ˜xi = w, Xich|xi, Xich|xi) in Eq. 2
yields:7

tφ(xi|Xich) ≈

tφ(xi|Xich|xi)
Pw∈Σ tφ(w|Xich|w)
(3)

where ˜xi—tφ(x>i|xi, Xich (d.h., we expect tφ(the cat | The dogs by
the window bark) to be low because it
Ist
the posterior
syntactically illicit). Im Gegensatz,
would assign high probabilities to plural verbs
like fight and chase that are consistent with the
bidirectional context, because we expect both
tφ(fight | The dogs by the window) and tφ(the cat
| The dogs by the window fight) to be probable.

Computational Cost. Let k denote the max-
imum length of x. Our KD approach requires
computing the posterior distribution (Eq. 2)
for every masked token xi
in the dataset D,
welche (excluding marginalization cost over y)
necessitates O(|Σ| ∗ k ∗ |D|) Operationen, Wo

5In this setup, we assume that x is a fixed-length sequence.
We aim to infer the LM’s estimate for generating a single
token xi, relative to all potential single tokens w ∈ Σ
(denominator in Eq. 2), conditional on the bidirectional
Kontext.

Although Eq. 3 is still expensive to compute, Es
enables us to apply the Bayes rule to compute
tφ(x>i|xi):

tφ(x>i|xi) =

tφ(xi|x>i) tφ(x>i)
Q(xi)

(4)

where q(·) denotes the unigram distribution. Für
efficiency, we replace tφ(xi|x>i) through a sep-
arately trained ‘‘reverse’’, right-to-left RNNG,
denoted as rω(xi|x>i). We now apply Eq. 4 Und
the right-to-left parameterization rω(xi|x>i) into
Eq. 3, and cancel common factors tφ(x>i):

tφ(xi|Xich) ≈

tφ(xi|Xich)
Q(xi)

tφ(w|Xich)
Q(w)

Pw∈Σ

(5)

Our approximation in Eq. 5 crucially reduces
the required number of operations from O(|Σ| ∗
k ∗ |D|) to O(|Σ| ∗ |D|), although the actual
speedup is much more substantial in practice,
since Eq. 5 involves easily batched operations that
considerably benefit from specialized hardwares
like GPUs.

Vor allem, our proposed approach here is a
it can approximate the posterior

general one;

6In our BERT pretraining setup, |Σ| ≈ 29, 000 (vocab-

ulary size of BERT-cased), |D| ≈ 3 ∗ 109, and k = 512.

7This approximation preserves the intuition explained in
§3.3. Concretely, verbs like bark would also be assigned low
probabilities under this posterior approximation, since tφ(Die
cat | bark) would be low since it is syntactically illicit—the
alternative ‘‘bark at the cat’’ would be syntactically licit.

779

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/
T

A
C
_
A
_
0
0
3
4
5
P
D

B
j
G
u
e
S
T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

over xi from any left-to-right LM, which can
be used as a learning signal for BERT through
KD, irrespective of the LM’s parameterization.
It does, Jedoch, necessitate a separately trained
right-to-left LM.

Connection to a Product of Experts. Eq. 5
has a similar form to a product of experts (PoE;
Hinton, 2002) between the left-to-right and right-
to-left RNNGs’ next-word distributions, albeit
with extra unigram terms q(w). If we replace the
unigram distribution with a uniform one, nämlich,
Q(w) = 1/|Σ| ∀w ∈ Σ, Eq. 5 reduces to a standard
PoE.

Approximating the Marginal. The approxi-
mation in Eq. 5 requires estimates of tφ(xi|Xich) from the left-to-right and right-
to-left RNNGs, jeweils, which necessitate
expensive marginalizations over all possible tree
prefixes yich. Following Kuncoro et al.
(2019), we approximate this marginalization using
a one-best predicted tree ˆy(X) = argmaxy∈Y (X)
sψ(j|X), where sψ(j|X) is parameterized by the
transition-based parser of Fried et al. (2019), Und
Y (X) denotes the set of all possible trees for x.
Formally:

tφ(xi|Xich)
von dem
right-to-left RNNG is approximated similarly.

Preliminary Experiments. Before proceeding
with the KD experiments, we assess the quality
and feasibility of our approximation through
preliminary LM experiments on the Penn
Treebank (PTB; Marcus et al., 1993). We find
that our approximation is much faster than exact
inference by a factor of more than 50,000, bei
the expense of a slightly worse average posterior
negative log-likelihood (2.68 rather than 2.5 für
exact inference). More details are provided in
Appendix A.

8Our approximation of tφ(xi|Xich. This non-incremental
procedure is justified, Jedoch, because we aim to design
informative teacher distributions for the non-
the most
incremental BERT student, which also has access to
bidirectional context.

Modell

Left-to-right LM
Right-to-left LM
Product of Experts

KL Div. with Posterior Approx.
2.27±1.84
2.04±1.87
1.12±1.08

Tisch 1: Preliminary experiments reporting the
mean±stdev. of the KL divergence (in nats)
between the proposed posterior approximation
(Eq. 5) Und: (ich) the left-to-right LM, (ii) the right-
to-left LM, Und (iii) a simple product of experts
baseline (Eq. 5, but with the uniform distribution
for q(w)).

Differences Between the Models. We now
empirically validate our motivating intuition in
Figur 1: A model that takes into account the
bidirectional context (as is the case for our pro-
posed posterior approximation in Eq. 5) should
make different predictions compared with the
unidirectional left-to-right and right-to-left mod-
els.9 To ascertain whether this is truly the case,
we compute the mean Kullback-Leibler (KL)
divergence between the distributions from the
proposed posterior approximation (Eq. 5) und das
distributions from: (ich) the left-to-right model, (ii)
the right-to-left model, Und (iii) a simple product
of experts baseline (d.h., Eq. 5, but where q(w) Ist
the uniform distribution). The findings in Table 1
suggest that our proposed posterior approximation
approach indeed yields quantifiably different
distributions from the left-to-right and right-to-
left baselines. To a lesser extent, it also differs
from a simple product of experts baseline that
similarly incorporates both the left-to-right and
right-to-left models’ predictions, albeit with the
uniform distribution for q(w).

3.5 Objective Function

In our structure distillation pretraining, we aim
to find BERT parameters ˆθKD that emulate our
approximation of tφ(w|Xich) through a word-
level cross-entropy loss (Hinton et al., 2015; Kim
and Rush, 2016; Furlanello et al., 2018, Unter anderem):

ˆθKD = arg min

θ

1
|D| X
x∈D

ℓKD(X; θ), Wo

ℓKD(X; θ) = −

X
i∈M (X)

X
w∈Σ

H

˜tφ,ω(w|Xich)

log pθ (˜xi = w|C(x1), · · · , C(xk)) ich,

9We use the same setup as Preliminary Experiments.

780

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/

/
T

l

A
C
_
A
_
0
0
3
4
5
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

where ˜tφ,ω(w|Xich) is our approximation of
tφ(w|Xich), as defined in Eqs. 5 Und 6.

Interpolation. The RNNG teacher is an expert
on syntax, although in practice it is only feasible
to train it on a much smaller dataset. Somit, Wir
not only want the BERT student to learn from
the RNNG’s syntactic expertise, but also from
the rich common-sense and semantics knowledge
contained in large text corpora by virtue of
predicting the true identity of the masked token
xi,10 as done in the standard BERT setup. We thus
interpolate the KD loss and the original BERT
masked LM objective:

ˆθB-KD = arg min

θ

1
|D| X
x∈D

hαℓKD(X; θ) + (1 − α)

− log pθ(xi|C(x1), · · · , C(xk))ich,

X
i∈M (X)

(7)

omitting the next-sentence prediction for brevity.
We henceforth set α = 0.5 unless stated otherwise.

4 Experimente

Here we outline the evaluation setup, present
our results, and discuss the implications of our
Erkenntnisse.

4.1 Evaluation Tasks and Setup

the improved syntactic
We conjecture that
competence from our approach would benefit a
broad range of tasks that involve structured output
including tasks that are not explicitly
Räume,
syntactic. We thus evaluate our structure-distilled
BERTs on six diverse structured prediction
tasks that encompass syntactic, semantic, Und
coreference resolution tasks, in addition to the
GLUE benchmark that is largely composed of
classification tasks.

Phrase-structure Parsing – PTB. We first
evaluate our model on phrase-structure parsing on
the WSJ section of the PTB. Following prior work,
we use sections 02–21 for training, section 22 für
validation, and section 23 for testing. Wir bewerben uns
our approach on top of the BERT-augmented
in-order (Liu and Zhang, 2017) transition-based
parser of Fried et al. (2019), which approaches
the current state of the art. Because the RNNG

10The KD loss ℓKD(X; θ) is defined independently of xi.

teacher that we distill into BERT also uses phrase-
structure trees, this setup is related to self-training
(Yarowsky, 1995; Charniak, 1997; Zhou and Li,
2005; McClosky et al., 2006; Andor et al., 2016,
Unter anderem).

Phrase-structure Parsing – OOD. Still in the
context of phrase-structure parsing, we evaluate
how well our approach generalizes to three out-
of-domain (OOD) treebanks: Braun (Francis and
Kuˇcera, 1979), Genia (Tateisi et al., 2005), Und
the English Web Treebank (Petrov and McDonald,
2012). Following Fried et al. (2019), we test the
PTB-trained parser on the test splits11 of these
OOD treebanks without any retraining, to simulate
the case where no in-domain labeled data are
verfügbar. We use the same codebase as above.

Dependency Parsing – PTB. Our third task
is PTB dependency parsing with Stanford
Dependencies (De Marneffe and Manning, 2008)
v3.3.0. We use the BERT-augmented joint phrase-
structure and dependency parser of Zhou and
Zhao (2019), which is inspired by head-driven
phrase-structure grammar (HPSG; Pollard and
Sag, 1994).

Semantic Role Labeling. Our fourth evaluation
task is span-based (SRL) on the English CoNLL
2012 (OntoNotes) dataset (Pradhan et al., 2013).
We apply our approach on top of the BERT-
augmented model of Shi and Lin (2019), als
implemented on AllenNLP (Gardner et al., 2018).

Coreference Resolution. Our fifth evaluation
task is coreference resolution, also on the English
OntoNotes dataset (Pradhan et al., 2012). Für
diese Aufgabe, we use the BERT-augmented model of
Joshi et al. (2019), which extends the higher-order
coarse-to-fine model of Lee et al. (2018).

CCG Supertagging Probe. All proposed tasks
thus far necessitate either fine-tuning the entire
BERT model, or training a task-specific model
Es
on top of the BERT embeddings. Somit,
remains unclear how much of the gains are due
to better structural representations from our new
pretraining strategy, rather than the available
supervision at the fine-tuning stage. To better
understand the gains from our approach, Wir
evaluate on CCG (Steedman, 2000) supertagging

11We use the Brown test split of Gildea (2001), the Genia
test split of McClosky et al. (2008), and the EWT test split
from SANCL 2012 (Petrov and McDonald, 2012).

781

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/

/
T

l

A
C
_
A
_
0
0
3
4
5
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

(Bangalore and Joshi, 1999; Clark and Curran,
2007) through a classifier probe (Shi et al., 2016;
Adi et al., 2017; Belinkov et al., 2017, Unter anderem),
where no BERT fine-tuning takes place.12

Information;

CCG supertagging is a compelling probing
task because it necessitates an understanding
of bidirectional context
the per-
word classification setup also lends itself well to
classifier probes. Trotzdem, it remains unclear
how much of the accuracy can be attributed to
the information encoded in the representation,
as opposed to the classifier probe itself. Wir
thus adopt the control task protocol of Hewitt
and Liang (2019) that assigns each word type
to a random control category,13 which assesses
the memorisation capacity of the classifier. In
addition to the probing accuracy, we report
the probe selectivity,14 where higher selectivity
denotes probes that faithfully rely on the linguistic
knowledge encoded in the representation. Wir gebrauchen
linear classifiers to maintain high selectivities.

Commonality. All our structured prediction ex-
periments are conducted on top of publicly avail-
able repositories of BERT-augmented models,
with the exception of the CCG supertagging task
that we evaluate as a probe. This setup means that
obtaining our results is as simple as changing the
pretrained BERT weights to our structure-distilled
BERT, and applying the exact same steps as for
fine-tuning the baseline model. The fine-tuning
hyperparameters are summarized in Appendix C.

GLUE. Beyond the six structured prediction
tasks above, we evaluate our approach on the
classification15 tasks of the GLUE benchmark
except the Winograd NLI (Levesque et al., 2012)
for consistency with the BERT paper (Devlin
et al., 2019). The BERT GLUE fine-tuning
hyperparameters are based on the fine-tuning
configurations of Joshi et al. (2020); we sum-
marize these in Appendix C.

12A similar CCG probe was explored by Liu et al. (2019A);
we obtain comparable results for the no distillation baseline.
13Following Hewitt and Liang (2019), the cardinality of
this control category is the same as the number of supertags.
14A probe’s selectivity is defined as the difference between

the probing task accuracy and the control task accurary.

15This setup excludes the semantic textual similarity
benchmark (STS-B), which is formulated as a regression
Aufgabe.

4.2 Experimental Setup and Baselines

Here we describe the key aspects of our empirical
setup, and outline the baselines for assessing the
efficacy of our approach.

RNNG Teacher. We implement the subword-
augmented RNNG teachers
(§2) on DyNet
(Neubig et al., 2017A), and obtain ‘‘silver-grade’’
phrase-structure annotations for the entire BERT
training set using the transition-based parser of
Fried et al. (2019). These trees are used to train
the RNNG (§2), and to approximate its marginal
next-word distribution at inference (Eq. 6). Wir gebrauchen
the same WordPiece tokenization and vocabulary
as BERT-Cased; Appendix B summarizes the
complete list of RNNG hyperparameters. Weil
our approximation (Eq. 5) makes use of a right-
to-left RNNG, we train this variant with the
same hyperparameters and data as the left-to-right
Modell. We train each directional RNNG teacher
on a shared subset of 3.6M sentences (∼3%) aus
the BERT training set with automatic dynamic
batching (Neubig et al., 2017B), which takes three
weeks on a V100 GPU.

BERT Student. We first apply our structure dis-
tillation pretraining protocol to BERTBASE-Cased.
We use the exact same training dataset, model con-
figuration, WordPiece tokenization, vocabulary,
and hyperparameters (Appendix C) as in the stan-
dard pretrained BERT model.16 The sole excep-
tion is that we use a larger initial learning rate
of 3e−4 based on preliminary experiments,17
which we apply to all models (einschließlich der
no distillation/standard BERT baseline) for fair
comparison.

Baselines and Comparisons. We compare the
following set of models in our experiments:

• A standard BERTBASE-Cased without any
structure distillation loss, which benefits
from scalability but lacks syntactic biases
(‘‘No-KD’’);

• Four variants of structure-distilled BERTs
Das: (ich) only distill the left-to-right RNNG
(‘‘L2R-KD’’), (ii) only distill
the right-
to-left RNNG (‘‘R2L-KD’’),
(iii) distill
für
the RNNG’s approximated marginal

16https://github.com/google-research/bert.
17We find this larger learning to perform better on most of
our evaluation tasks. Liu et al. (2019B) have similarly found
that tuning BERT’s initial learning rate leads to better results.

782

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/

/
T

l

A
C
_
A
_
0
0
3
4
5
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Validation Set

Test Set

Task

Baselines

Structure-distilled BERTs

G
N
ich
S
R
A
P

No-KD Seq-KD L2R-KD R2L-KD UF-KD UG-KD
Const. PTB – F1
95.38
Const. PTB – EM 55.33
Const. OOD – F1†
86.76
96.48
Dep. PTB – UAS
94.65
Dep. PTB – LAS
86.17
SRL – OntoNotes
Coref. – OntoNotes
72.53

95.59
56.59
87.40
96.66
94.83
86.46
73.33

95.33
55.41
86.54
96.40
94.56
86.09
69.27

95.55
55.92
87.43
96.70
94.90
86.34
73.74

95.55
56.18
87.53
96.64
94.80
86.29
73.49

95.58
56.39
87.23
96.60
94.79
86.30
73.79

CCG supertag. probe
Probe selectivity

93.69
24.79

91.59
23.77

93.97
23.3

95.21
23.57

95.13
27.28

95.21
28.3

No-KD Best-KD Err. Red.

95.35
55.25
89.04
96.79
95.13
86.08
72.71

93.88
23.15

95.70
57.77
89.76
96.86
95.23
86.39
73.69

95.2
26.07

7.6%
5.63%
6.55%
2.18%
1.99%
2.23%
3.58%

21.57%
N/A

Tisch 2: Validation and test results for the structured prediction tasks; each entry reflects the mean of
three random seeds. To preserve test set integrity, we only obtain test set results for the no distillation
baseline and the best structure-distilled BERT on the validation set; ‘‘Err. Red.’’ reports the test error
reductions relative to the No-KD baseline. We report F1 and exact match (EM) for PTB phrase-structure
parsing; for dependency, we report unlabeled (UAS) and labeled (LAS) attachment scores. The ‘‘Const.
OOD’’ (†) row indicates the mean F1 from three out-of-domain corpora: Braun, Genia, and the English
Web Treebank (EWT), although the validation results exclude the Brown Treebank that has no validation
set.

generating xi under the bidirectional context,
where q(w) (Eq. 5) is the uniform distribution
(‘‘UF-KD’’), and lastly (iv) a similar variant
als (iii), but where q(w) is the unigram
distribution (‘‘UG-KD’’). All these BERT
models crucially benefit from the syntactic
biases of RNNGs, although only variants (iii)
Und (iv) learn from teacher distributions that
consider bidirectional context for predicting
xi; Und

• A BERTBASE model

that distills

Die
approximate posterior for generating xi under
the bidirectional context, but from sequential
LSTM teachers (‘‘Seq-KD’’) in place of
RNNGs.18 This baseline crucially isolates
the importance of learning from hierarchical
Lehrer, because it utilizes the exact same
approximation technique and KD loss as the
structure-distilled BERTs.

Learning Curves. Given enough labeled data,
BERT can acquire the relevant structural infor-
mation from the fine-tuning (as opposed to pre-
Ausbildung) procedure, although better pretrained
representations can nevertheless facilitate sample-
efficient generalizations (Yogatama et al., 2019).

18For fair comparison, we train the LSTM on the exact
same subset as the RNNG, with comparable number of
model parameters. An alternative here is to use Transformers,
although we elect to use LSTMs to facilitate fair comparison
with RNNGs, which are also based on LSTM architectures.

783

We thus additionally examine the models’ fine-
tuning learning curves, as a function of varying
amounts of training data, on phrase-structure
parsing and SRL.

Random Seeds. Because fine-tuning the same
pretrained BERT with different random seeds
can lead to varying results, we report the mean
performance from three random seeds on the
structured prediction tasks, and from five random
seeds on GLUE.

Test Results. To preserve the integrity of the
test sets, we first report all performance on the
validation sets, and only report the test set results
für: (ich) the No-KD baseline, Und (ii) the best
structure-distilled model on the validation set
(‘‘Best-KD’’).

4.3 Findings and Discussion

We report the validation and test set results for
the structured prediction tasks in Table 2. Der
validation set learning curves for phrase-structure
parsing and SRL that compare the No-KD baseline
with the UG-KD variant are provided in Figure 2.

General Discussion. We summarize several
key observations from Table 2 and Figure 2.

• All four structure-distilled BERT models
consistently outperform the No-KD baseline,
including the L2R-KD and R2L-KD variants

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/

/
T

l

A
C
_
A
_
0
0
3
4
5
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: The fine-tuning learning curves that examine how the number of fine-tuning instances (aus 5% Zu
100% of the full training sets) affect validation set F1 scores in the case of phrase-structure parsing and SRL. Wir
compare the No-KD/standard BERTBASE-Cased and the UG-KD structure-distilled BERT.

that only distill
the syntactic knowledge
of unidirectional RNNGs. Remarkably, Das
pattern holds true for all six structured pre-
diction tasks. Im Gegensatz, we observe no such
gains for the Seq-KD baseline, which largely
performs worse than the No-KD model. Wir
the gains afforded by our
conclude that
structure-distilled BERTs can be attributed
to the syntactic biases of the RNNG teacher.

• We conjecture that the surprisingly strong
the L2R-KD and R2L-
performance of
KD models, which distill the knowledge of
unidirectional RNNGs, can be attributed to
the interpolated objective in Eq. 7 (α = 0.5).
This interpolation means that
the target
distribution assigns a probability mass of
mindestens 0.5 to the true masked word xi,
which is guaranteed to be consistent with the
bidirectional context. Jedoch, the syntactic
knowledge contained in the unidirectional
RNNGs’ predictions can still provide a
structurally informative learning signal, über
the rest of the probability mass, for the BERT
student.

• Although all structure-distilled variants out-
perform the baseline, models that distill our
approximation of the RNNG’s distribution
for words in bidirectional context (UF-KD
and UG-KD) yield the best results on four out
of six tasks (PTB phrase-structure parsing,
SRL, coreference resolution, and the CCG
supertagging probe). This finding confirms
the efficacy of our approach.

• We observe the largest gains for the syn-
tactic tasks, particularly for phrase-structure
parsing and CCG supertagging. Jedoch,

the improvements are not at all confined to
purely syntactic tasks: we reduce relative
error from strong BERT baselines by 2.2%
Und 3.6% on SRL and coreference resolution,
jeweils. While the RNNG’s syntac-
tic biases are derived from phrase-structure
grammar, the strong improvement on CCG
supertagging, in addition to the smaller im-
provement on dependency parsing, schlägt vor
that the RNNG’s syntactic biases generalize
well across different syntactic formalisms.

• We observe larger improvements in a low-
resource scenario, where the model is exposed
to fewer fine-tuning instances (Figur 2), sug-
gesting that syntactic biases are helpful for
enabling more sample-efficient generaliza-
tionen. This pattern holds for both tasks that we
investigated: phrase-structure parsing (syn-
tactic in nature) and SRL (not explicitly
syntactic in nature). With only 5% of the fine-
tuning data, the UG-KD model improves F1
score from 79.9 Zu 80.6 for SRL (A 3.5%
error reduction relative to the No-KD base-
Linie, as opposed to 2.2% on the full data). Für
phrase-structure parsing, the UG-KD model
achieves a remarkable 93.68 F1 (A 16% rel-
ative error reduction, as opposed to 7.6% An
the full data) with only 5% of the PTB—this
performance is notably better than past state
of the art phrase-structure parsers trained on
the full PTB c. 2017 (Kuncoro et al., 2017).

GLUE Results and Discussion. We report the
GLUE validation and test results for BERTBASE-
Cased in Table 3. Because we observe a different
pattern of results on the Corpus of Linguistic
Acceptability (CoLA; Warstadt et al., 2018) als

784

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/

/
T

l

A
C
_
A
_
0
0
3
4
5
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

No-KD

UG-KD

CoLA
7-task avg. (excl. CoLA)
Overall 8-task avg.

Validation Set (Per-task average / 1-best random seed)
54.3 / 60.6
84.8 / 86.9
81.0 / 83.6
Test set (Per-task 1-best random seed on validation set)

50.7 / 60.2
85.4 / 87.8
81.1 / 84.4

CoLA
7-task avg. (excl. CoLA)
Overall 8-task avg.

53.1
84.2
80.3

55.3
83.5
80.0

Tisch 3: Summary of the validation and test
set results on GLUE. The validation results are
derived from the average of five random seeds
for each task, which accounts for variance, Und
the 1-best random seed, which does not. The test
results are derived from the 1-best random seed
on the validation set.

on the rest of GLUE, we henceforth report: (ich)
the CoLA results, (ii) the seven task average
that excludes CoLA, Und (iii) the average across
all eight tasks. We select the UG-KD model
because it achieved the best validation set eight
task average among the structure-distilled BERTs;
the per-task GLUE breakdown is provided in
Appendix D.

to the consistent

The results on GLUE provide an interesting
Kontrast
improvements we
observed on the structured prediction tasks. More
concretely, our UG-KD model outperforms the
baseline on CoLA, but performs slightly worse on
the other GLUE tasks in aggregate, leading to a
slightly lower overall test set accuracy (80.0 für
the UG-KD as opposed to 80.3 for the No-KD
baseline).

The improvement on the syntax-sensitive
CoLA provides additional evidence—beyond the
improvement on the syntactic tasks (Tisch 2)—
that our approach indeed yields improved syntactic
competence. We conjecture that these improve-
ments do not transfer to the other GLUE tasks
because they rely more on lexical and semantic
properties, and less on syntactic competence
(McCoy et al., 2019).

We defer a more thorough investigation of
how much syntactic competence is necessary for
solving most of the GLUE tasks to future work,
but make two remarks. Erste, the findings on
GLUE are consistent with the hypothesis that our
approach yields improved structural competence,
albeit at the expense of a slightly less rich meaning
representation, which we attribute to the smaller
dataset used to train the RNNG teacher. Zweite,

785

human-level natural
language understanding
includes the ability to predict structured outputs,
Zum Beispiel, to decipher ‘‘who did what to whom’’
(SRL). Succeeding in these tasks necessitates
inference about structured output spaces, welche
(unlike most of GLUE) cannot be reduced to
a single classification decision. Our findings
indicate a partial dissociation between model
performance on these two types of tasks; somit,
supplementing GLUE evaluation with some of
these structured prediction tasks can offer a more
holistic assessment of progress in NLU.

CCG Probe Example. The CCG supertagging
probe is a particularly interesting test bed, Weil
it clearly assesses the model’s ability to use con-
textual information in making its predictions—
without introducing additional confounds from
the BERT fine-tuning procedure. We thus provide
a representative example of four different BERT
variants’ predictions on the CCG supertagging
probe in Table 4, based on which we discuss
two observations. Erste, the different models make
different predictions, where the No-KD and L2R-
KD models produce (coincidentally the same)
incorrect predictions, while the R2L-KD and
UG-KD models are able to predict the correct
supertag. This finding suggests that different
teacher models are able to impose different biases
on the BERT students.19

Zweite, the mistakes of the No-KD and L2R-
KD BERTs belong to the broader category
von
challenging argument-adjunct distinctions
(Palmer et al., 2005; Fowlie, 2017). Here both
models fail to subcategorize for the prepositional
‘‘as screens’’, which serves as
Phrase (PP)
an argument of the verb ‘‘use’’, as opposed to the
noun phrase ‘‘TV sets’’. Distinguishing between
these two potential dependencies naturally re-
quires
syntactic information from the right
Kontext; hence the R2L-KD BERT, welches ist
trained to emulate the predictions of an RNNG
teacher that observes the right context, is able
to make the correct prediction. This advantage
is crucially retained by the UG-KD model that
distills the RNNG’s approximate distribution over
words in bidirectional context (Eq. 5), and further
confirms the efficacy of our proposed approach.

19All four BERTs have access to the full bidirectional
context at test time, although some are trained to mimic
the predictions of unidirectional RNNGs (L2R-KD and
R2L-KD).

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
4
5
1
9
2
3
8
8
8

/

/
T

l

A
C
_
A
_
0
0
3
4
5
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Sentence Input
‘‘Apple II owners , Zum Beispiel , had to use their TV
sets as screens and stored data on audiocassettes’’

No-KD & L2R-KD Pred. R2L-KD & UG-KD Pred.

(S[B]\NP)/NP

((S[B]\NP)/PP)/NP

Tisch 4: An example of the CCG supertag predictions for the verb ‘‘use’’ from four different BERT
variants. The correct answer is ‘‘((S[B]\NP)/PP)/NP’’, which both the R2L-KD and UG-KD predict
correctly (Blau). Jedoch, the No-KD baseline and the L2R-KD model produce (the same) incorrect
Vorhersagen (Rot); both models fail to subcategorize for the prepositional phrase ‘‘as screens’’ as a
dependent of the verb ‘‘use’’. Beyond this, all four models predict the correct supertags for all other
Wörter (not shown).

Measuring the Models’ Differences. Darüber hinaus
the qualitative example in Table 4, we further
quantify the extent to which the different BERT
models produce different predictions. To this
end, we compute pairwise model agreement for
the phrase-structure parsing task, as measured
by exact match accuracy. We present the full
experimental setup and findings in Appendix E,
but summarize two key findings here.

Erste, the highest exact match agreement bet-
ween any pair of different models is fairly low
bei 44.92%, further supporting our conjecture
that different
teacher models indeed impose
different biases on the BERT student, as evidenced
by the different model predictions. Zweite, alle
four structure-distilled BERT variants have the
lowest pairwise agreement score with the No-
KD baseline (< 39% pairwise model agreement), suggesting that all variants of our structure distillation objectives yield quantifiably different outputs compared to the no distillation alternative, which does not learn from the syntactic knowledge of RNNGs. BERTLARGE Results. Having evaluated our structure-distilled BERTBASE-Cased, we now apply our approach on top of BERTLARGE-Cased, and present the results on the structured prediction tasks in Table 5. Overall, we observe a similar pattern of results with BERTLARGE as we do with BERTBASE: On the structured prediction tasks, our best structure distillation approach reduces error by 1.5% to 5.5% relative to the No-KD baseline. Furthermore, our structure-distilled BERTLARGE models establish new state of the art single model results—among models pretrained on the original BERT training set20—on phrase-structure parsing (PTB and OOD), PTB dependency parsing, and SRL. Test Set - BERTLARGE-Cased g n i s r a P Task No-KD Best-KD Red. 95.95 95.80 Const. PTB − F1 57.74 Const. PTB − EM 56.87 90.20 89.63 Const. OOD − F1 97.03 96.91 Dep. PTB − UAS 95.49 95.33 Dep. PTB − LAS 87.77 87.59 SRL − OntoNotes 74.69 74.03 Coref. − OntoNotes Error BERT SoTA 3.73% 95.84† 2.02% 5.48% 89.91‡ 3.78% 97.0† 3.43% 95.43† 1.45% 86.5♦ 2.55% 79.6(cid:7) − results for the structured Table 5: Test set prediction tasks with BERTLARGE-Cased; each entry reflects the mean of three random seeds. We compare the no distillation baseline (‘‘No- KD’’) with the best structure-distilled model, as selected on the validation set (‘‘Best-KD’’); ‘‘Error Red.’’ reports the test error reductions relative to the No-KD baseline. We also report the previous state of the art among non-ensemble models pretrained on the original BERT training set (‘‘BERT SoTA’’).21 4.4 Limitations We outline two limitations to our approach. First, we assume the existence of decent-quality ‘‘silver- grade’’ phrase-structure trees to train the RNNG teacher. Although this assumption holds true for English because of the existence of accurate phrase-structure parsers, this is not necessarily the case for other languages. Second, pretraining the BERT student in our na¨ıve implementation is about half as fast on TPUs compared with the baseline due to I/O bottleneck. This overhead only applies at pretraining, and can be reduced through parallelization. 5 Related Work Earlier work has proposed a few ways for introducing notions of hierarchical structures into 20This comparison excludes other models like XLNet and 21†Zhou and Zhao (2019), ‡Fried et al. (2019), ♦Shi and RoBERTa, which are trained on more data. Lin (2019), and (cid:7)Joshi et al. (2020). 786 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 5 1 9 2 3 8 8 8 / / t l a c _ a _ 0 0 3 4 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 BERT, for instance, through designing structurally motivated auxiliary losses (Wang et al., 2020), or including syntactic information in the embedding layers that serve as inputs for the Transformer (Sundararaman et al., 2019). In contrast, we use a different technique for injecting syntactic biases, which is based on the structure distillation technique of Kuncoro et al. (2019), although our work features two key differences. First, Kuncoro et al. (2019) put a sole emphasis on cases where both the teacher and student models are autoregressive, left-to-right LMs; here we extend this objective for when the student model is a representation learner that has access to bidirectional context. Second, Kuncoro et al. (2019) only evaluated their structure-distilled LMs in terms of perplexity and grammatical judgment (Marvin and Linzen, 2018). In contrast, we evaluate our structure-distilled BERT models on six diverse structured prediction tasks and the GLUE benchmark. It remains an open question whether, and how much, syntactic biases are helpful for a broader range of NLU tasks beyond grammatical judgment; our work represents a step towards answering this question. Substantial progress has recently been made in improving the performance of BERT and other masked LMs (Lan et al., 2020; Liu et al., 2019b; Raffel et al., 2019; Sun et al., 2020, inter alia). Our structure distillation technique is orthogonal, and can be applied for these approaches. Lastly, our findings on the benefits of syntactic knowledge for structured prediction tasks that are not explicitly syntactic in nature, such as SRL and coreference resolution, are consistent with those of prior work (He et al., 2017; Swayamdipta et al., 2018; He et al., 2018; Strubell et al., 2018, inter alia). 6 Conclusion of the success remarkable Given textual representation learners trained on large amounts of data, it remains an open question whether syntactic biases are still relevant for these models that work well at scale. Here we present evidence to the affirmative: our structure-distilled BERT models outperform the baseline on a diverse set of six structured prediction tasks. We achieve this through a new pretraining strategy that to learn from the enables the BERT student predictions of an explicitly hierarchical, but much less scalable, RNNG teacher model. Because the BERT student is a bidirectional model that estimates the conditional probabilities of masked words in context, we propose to distill an efficient yet surprisingly effective approximation of the RNNG’s posterior estimate for generating each word conditional on its bidirectional context. Our findings suggest that syntactic inductive biases are beneficial for a diverse range of structured prediction tasks, including for tasks that are not explicitly syntactic in nature. In addition, these biases are particularly helpful for improving fine-tuning sample efficiency on these tasks. Lastly, our findings motivate the broader ques- tion of how we can design models that integrate stronger notions of structural biases—and yet can be easily scalable at the same time—as a promising (if relatively underexplored) direction of future research. Acknowledgments We would like to thank Mandar Joshi, Zhaofeng Wu, Rui Zhang, Timothy Dozat, and Kenton Lee for answering questions regarding the evaluation of the model. We also thank Sebastian Ruder, John Hale, Kris Cao, Stephen Clark, and the three anonymous reviewers for their helpful suggestions. A. K. is supported by an EPSRC Doctoral Training Partnership studentship and a Balliol Mark Sadler scholarship; D. F. is supported by a Google PhD Fellowship. References Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine- grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of ICLR. Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of ACL. DOI: https://doi.org/10.18653/v1/P16-1231 Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: An approach to almost parsing. Computational Linguistics. Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do 787 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 5 1 9 2 3 8 8 8 / / t l a c _ a _ 0 0 3 4 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 neural machine translation models learn about morphology? In Proceedings of ACL. DOI: https://doi.org/10.18653/v1/P17-1080 Cristian Bucilˇa, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of KDD. DOI: https://doi .org/10.1145/1150402.1150464 Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In Proceedings of AAAI. Do Kook Choe and Eugene Charniak. 2016. Parsing as language modeling. In Proceedings of EMNLP. DOI: https://doi.org/10 .18653/v1/D16-1257 Stephen Clark and James R. Curran. 2007. Wide- coverage efficient statistical parsing with CCG and log-linear models. Computational Linguis- tics. DOI: https://doi.org/10.1162 /coli.2007.33.4.493 Marie-Catherine De Marneffe and Christopher D. Manning. 2008. Stanford typed dependencies manual. DOI: https://doi.org/10.3115 /1608858.1608859 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of ACL. DOI: https://doi.org/10.3115 /v1/P15-1033 Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Re- current neural network grammars. In Proceed- ings of NAACL. DOI: https://doi.org /10.18653/v1/N16-1024, PMID: 26993434 Meaghan Fowlie. 2017. Slaying the Great Green Dragon: Learning and Modelling Iterable Or- dered Optional Adjuncts. Ph.D. thesis, UCLA. Winthrop Nelson Francis and Henry Kuˇcera. 1979. Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Brown University, Department of Linguistics. Daniel Fried, Nikita Kitaev, and Dan Klein. 2019. Cross-domain generalization of neural consti- tuency parsers. In Proceedings of ACL. DOI: https://doi.org/10.18653/v1/P19-1031 Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born-again neural net- works. In Proceedings of ICML. Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. Neural language models as psycholinguis- tic subjects: Representations of syntactic state. In Proceedings of NAACL. DOI: https:// doi.org/10.18653/v1/N19-1004 Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. CoRR, abs/1803.07640. DOI: https://doi.org 10.18653/v1/W18-2501 Daniel Gildea. 2001. Corpus variation and parser performance. In Proceedings of EMNLP. Yoav Goldberg. 2019. Assessing BERT’s syntac- tic abilities. CoRR, abs/1901.05287. Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role label- ing: What works and what’s next. In Pro- ceedings of ACL. DOI: https://doi.org / 1 0 . 1 8 6 5 3 / v 1 / P 1 7 - 1 0 44, PMCID: PMC5961228 Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai. 2018. Syntax for semantic role labeling, to be, or not to be. In Proceedings of ACL. DOI: https://doi.org/10.18653/v1/P18-1192, PMCID: PMC6010685 John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of EMNLP. DOI: https:// doi.org/10.18653/v1/D19-1275 John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of NAACL. Geoffrey E Hinton. 2002. Training products of experts by minimizing contrastive divergence. 788 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 5 1 9 2 3 8 8 8 / / t l a c _ a _ 0 0 3 4 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Neural Computation. DOI: https://doi .org/10.1162/089976602760128018, PMID: 12180402 Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. CoRR, abs/1503.02531. Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger P. Levy. 2020. A systematic assessment of syntactic generalization in neural language models. In Proceedings of ACL. Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah. 2019. What does BERT learn about the structure of language? In Proceedings of ACL. DOI: https://doi.org/10.18653 /v1/P19-1356 Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving pre-training by rep- resenting and predicting spans. TACL. DOI: https://doi.org/10.1162/tacl a 00300 Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld. 2019. BERT for coreference resolution: Baselines and analysis. In Proceed- ings of EMNLP. Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Pro- ceedings of EMNLP. DOI: https://doi .org/10.18653/v1/D16-1139 Yoon Kim, Alexander M. Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gabor Melis. 2019. Unsupervised recurrent neural network gram- In Proceedings of NAACL. DOI: mars. https://doi.org/10.18653/v1/N19-1114 Taku Kudo and John Richardson. 2018. Sentence- Piece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP System Demonstrations. DOI: https://doi .org/10.18653/v1/D18-2012, PMID: 29382465 Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. What do recurrent neural net- work grammars learn about syntax? In Pro- ceedings of EACL. DOI: https://doi .org/10.18653/v1/E17-1117 Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs can learn syntax-sensitive depen- dencies well, but modeling structure makes them better. In Proceedings of ACL. DOI: https://doi.org/10.18653/v1/P18-1132 Adhiguna Kuncoro, Chris Dyer, Laura Rimell, Stephen Clark, and Phil Blunsom. 2019. Scal- able syntax-aware language modelling with knowledge distillation. In Proceedings of ACL. DOI: https://doi.org/10.18653/v1 /P19-1337 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language repre- sentations. In Proceedings of ICLR. Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of NAACL. Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Proceedings of KR. Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL. DOI: https://doi.org/10.1162/tacl a 00115 Jiangming Liu and Yue Zhang. 2017. In-order transition-based constituent parsing. TACL. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019a. Linguistic knowledge and transferability of contextual representations. In Proceedings of NAACL. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Tree- bank. Computational Linguistics, 19:313–330. 789 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 5 1 9 2 3 8 8 8 / / t l a c _ a _ 0 0 3 4 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 DOI: h t t p s : / / d o i . o r g /10.21236 /ADA273556 Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of EMNLP. DOI: https://doi .org/10.18653/v1/D18-1151 David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for In Proceedings of NAACL. DOI: parsing. https://doi.org/10.3115/1220835 .1220855 David McClosky, Eugene Charniak, and Mark Johnson. 2008. When is self-training effective for parsing? In Proceedings of COLING. DOI: https://doi.org/10.3115/1599081 .1599152 Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnos- ing syntactic heuristics in natural language In Proceedings of ACL. DOI: inference. https://doi.org/10.18653/v1/P19 -1334 Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock´y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of Interspeech. DOI: h t t p s : / /doi.org/10.1109/ICASSP .2011.5947611 Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017a. DyNet: The Dynamic Neural Network Toolkit. arXiv preprint arXiv:1701.03980. Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017b. On-the-fly operation batching in dy- namic computation graphs. In Proceedings of NeurIPS. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The proposition bank: An an- notated corpus of semantic roles. Computatio- nal Linguistics. DOI: https://doi.org /10.1162/0891201053630264 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextua- lized word representations. In Proceedings of NAACL. DOI: https://doi.org/10.18653 /v1/N18-1202 Slav Petrov and Ryan McDonald. 2012. Overview of the 2012 shared task on parsing the web. In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL). Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. University of Chicago Press. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using OntoNotes. In Proceedings of CoNLL. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared task: Modeling multilin- gual unrestricted coreference in OntoNotes. In Proceedings of CoNLL. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of ACL. DOI: https://doi.org/10.18653 .18653/v1/P16-1162 Peng Shi and Jimmy Lin. 2019. Simple BERT models for relation extraction and semantic role labeling. CoRR, abs/1904.05255. Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural MT learn source syntax? In Proceedings of EMNLP. DOI: https://doi.org/10.18653/v1/D16-1159 790 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 5 1 9 2 3 8 8 8 / / t l a c _ a _ 0 0 3 4 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Nitish Srivastava, Geoffrey Hinton, Alex and Ruslan Ilya Sutskever, Krizhevsky, Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NeurIPS. Mark Steedman. 2000. The Syntactic Process, MIT Press. DOI: https://doi.org/10 .7551/mitpress/6591.001.0001 Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for se- mantic role labeling. In Proceedings of EMNLP. DOI: https://doi.org/10.18653/v1 /D18-1548 Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE 2.0: A continual pre-training framework for language under- In Proceedings of AAAI. DOI: standing. https://doi.org/10.1609/aaai.v34i05 .6428 Dhanasekar Sundararaman, Vivek Subramanian, Guoyin Wang, Shijing Si, Dinghan Shen, Dong Wang, and Lawrence Carin. 2019. Syntax-infused transformer and bert models for machine translation and natural language under- standing. arXiv preprint arXiv:1911.06156. Swabha Swayamdipta, Sam Thomson, Kenton Lee, Luke Zettlemoyer, Chris Dyer, and Noah A. Smith. 2018. Syntactic scaffolds for seman- tic structures. In Proceedings of EMNLP. DOI: https://doi.org/10.18653/v1/D18 -1412 Yuka Tateisi, Akane Yakushiji, Tomoko Ohta, and Jun’ichi Tsujii. 2005. Syntax annotation for the GENIA corpus. In Proceedings of IJCNLP. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. BERT rediscovers the classical NLP pipeline. In Proceedings of ACL. DOI: https:// doi.org/10.18653/v1/v1/P19-1452 Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019b. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of ICLR. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language under- ICLR. DOI: standing. https://doi.org/10.18653/v1/W18 -5446 In Proceedings of Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. 2020. Struct- BERT: Incorporating language structures into pre-training for deep language understanding. In Proceedings of ICLR. Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471. Ethan Wilcox, Peng Qian, Richard Futrell, Miguel Ballesteros, and Roger Levy. 2019. Structural supervision improves learning of non-local grammatical dependencies. In Proceedings of NAACL. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for In Proceedings of NeurIPS. language understanding. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of ACL. DOI: https://doi .org/10.3115/981658.981684 Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tom´as Kocisk´y, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. 2019. Learning and eval- uating general linguistic intelligence. CoRR, abs/1901.11373. Junru Zhou and Hai Zhao. 2019. Head-driven phrase structure grammar parsing on Penn In Proceedings of ACL. DOI: treebank. https://doi.org/10.18653/v1/P19 -1230, PMCID: PMC6593428 791 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 5 1 9 2 3 8 8 8 / / t l a c _ a _ 0 0 3 4 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three class- ifiers. IEEE Transactions on Knowledge and Data Engineering. DOI: https://doi .org/10.1109/TKDE.2005.186 A Preliminary Experiments Here we discuss the preliminary experiments to assess the quality and computational efficiency of our posterior approximation procedure (§3.4). Recall that this approximation procedure only applies at inference; the LM is still trained in a typical autoregressive, left-to-right fashion. exactly computing Model. Because the tφ(xi|x