Syntactic Structure Distillation Pretraining for Bidirectional Encoders
Adhiguna Kuncoro∗♠♦ Lingpeng Kong∗♠ Daniel Fried∗♣
Dani Yogatama♠ Laura Rimell♠ Chris Dyer♠ Phil Blunsom♠♦
♠DeepMind, Londres, Reino Unido
♦Department of Computer Science, Universidad de Oxford, Reino Unido
♣Computer Science Division, Universidad de California, berkeley, California, EE.UU
{akuncoro,lingpenk,dyogatama,laurarimell,cdyer,pblunsom}@google.com
dfried@cs.berkeley.edu
Abstracto
Textual representation learners trained on large
amounts of data have achieved notable success
on downstream tasks; intriguingly, ellos tienen
also performed well on challenging tests of
syntactic competence. Por eso, it remains an
open question whether scalable learners like
en el
BERT can become fully proficient
syntax of natural language by virtue of data
scale alone, or whether they still benefit from
more explicit syntactic biases. To answer
this question, we introduce a knowledge
distillation strategy for
injecting syntactic
biases into BERT pretraining, by distilling
the syntactically informative predictions of a
hierarchical—albeit harder to scale—syntactic
modelo de lenguaje. Since BERT models masked
words in bidirectional context, we propose to
distill the approximate marginal distribution
over words in context from the syntactic LM.
Our approach reduces relative error by 2–21%
on a diverse set of structured prediction tasks,
although we obtain mixed results on the
GLUE benchmark. Our findings demonstrate
the benefits of syntactic biases, incluso para
representation learners
grande
amounts of data, and contribute to a better
understanding of where syntactic biases are
helpful in benchmarks of natural language
comprensión.
that exploit
1 Introducción
Large-scale textual representation learners trained
with variants of the language modeling (LM) obj-
ective have achieved remarkable success on down-
stream tasks (Peters et al., 2018; Devlin et al.,
2019; Yang et al., 2019). Además, these mo-
∗Equal contribution.
776
dels have also been shown to perform remark-
ably well at syntactic grammaticality judgment
tareas (Goldberg, 2019), and encode substantial
amounts of syntax in their learned representa-
ciones (Liu et al., 2019a; Tenney et al., 2019a,b;
Hewitt and Manning, 2019; Jawahar et al., 2019).
Intriguingly, success on these syntactic tasks
has been achieved by Transformer architectures
(Vaswani et al., 2017) that lack explicit notions of
hierarchical syntactic structures.
Based on such evidence, it would be tempting
to conclude that data scale alone is all we need to
learn the syntax of natural language. Sin embargo,
recent findings that systematically compare the
syntactic competence of models trained at varying
data scales suggest that model inductive biases are
in fact more important than data scale for acquiring
syntactic competence (Hu et al., 2020). Two
natural questions, por lo tanto, are the following:
Can representation learners that work well at scale
still benefit from explicit syntactic biases? Y
where exactly would such syntactic biases be
helpful in different language understanding tasks?
Here we work towards answering these questions
by devising a new pretraining strategy that injects
syntactic biases into a BERT (Devlin et al., 2019)
learner that works well at scale. We hypothesize
that this approach can improve the competence of
BERT on various tasks, which provides evidence
for the benefits of syntactic biases in large-scale
modelos.
Our approach is based on the prior work of
Kuncoro et al. (2019), who devised an effective
knowledge distillation (KD; Bucilˇa et al., 2006;
Hinton et al., 2015) procedure for improving
the syntactic competence of scalable LMs that
lack explicit syntactic biases. More concretely,
their KD procedure utilized the predictions of
an explicitly hierarchical (albeit hard to scale)
syntactic LM, recurrent neural network grammars
Transacciones de la Asociación de Lingüística Computacional, volumen. 8, páginas. 776–794, 2020. https://doi.org/10.1162/tacl a 00345
Editor de acciones: James Henderson. Lote de envío: 6/2020; Lote de revisión: 8/2020; Publicado 12/2020.
C(cid:13) 2020 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(RNNGs; Dyer et al., 2016) (§2) as a syntactically
informed learning signal for a sequential LM that
works well at scale.
Our setup nevertheless presents a new chal-
lenge: Here the BERT student is a denoising
autoencoder that models a collection of con-
ditionals for words in bidirectional context, mientras
the RNNG teacher is an autoregressive LM that
predicts words in a left-to-right fashion, eso es
tφ(xi|Xi that is
accessible to the BERT student (§3). Por eso, nosotros
propose an approach where the BERT student
distills the RNNG’s marginal distribution over
words in context, tφ(xi|Xi). We develop
an efficient yet effective approximation for this
quantity, since exact inference is expensive owing
to the RNNG’s left-to-right parameterization.
Our structure-distilled BERT model differs
from the standard BERT model only in its pre-
training objective, and thus retains the scalability
afforded by Transformer architectures and special-
ized hardwares like TPUs. De hecho, our approach
maintains compatibility with standard BERT pipe-
líneas; the structure-distilled BERT models can
simply be loaded as pretrained BERT weights,
which can then be fine-tuned in the exact same
moda.
We hypothesize that
the stronger syntactic
biases from our new pretraining procedure are
useful for a variety of natural language under-
involve structured
de pie (NLU) tasks that
output spaces—including tasks like semantic role
labeling (srl) and coreference resolution that are
not explicitly syntactic in nature. We thus evaluate
our models on six diverse structured prediction
tareas,
including phrase-structure parsing (en-
domain and out-of-domain), dependency parsing,
srl, coreference resolution, and a combinatory
categorial grammar (CCG) supertagging probe, en
addition to the GLUE benchmark (Wang y cols.,
2019). On the structured prediction tasks, nuestro
structure-distilled BERTBASE reduces relative
error by 2% a 21%. These gains are more pro-
nounced in the low-resource scenario, sugerencia
that stronger syntactic biases help improve sample
eficiencia (§4).
Despite the gains on the structured prediction
tareas, we achieve mixed results on GLUE: Nuestro
approach yields improvements on the corpus of
linguistic acceptability (Warstadt et al., 2018,
CoLA), but performs slightly worse on the rest
of GLUE. These findings allude to a partial
dissociation between model performance on
GLUE, and on structured prediction benchmarks
of NLU.
Altogether, our findings: (i) showcase the bene-
fits of syntactic biases, even for representation
learners that leverage large amounts of data, (ii)
help better understand where syntactic biases are
most helpful, y (iii) make a case for designing
approaches that not only work well at scale, pero
also integrate stronger notions of syntactic biases.
2 Recurrent Neural Network Grammars
Here we briefly describe the RNNG (Dyer et al.,
2016) that we use as the teacher model. An RNNG
is a syntactic LM that defines the joint prob-
ability of surface strings x and phrase-structure
nonterminals y, henceforth denoted as tφ(X, y),
through a series of structure-building actions that
traverse the tree in a top–down, left-to-right fash-
ion. Let N and Σ denote the set of phrase-structure
non-terminals and word terminals, respectivamente.
At each time step, the decision over the next action
at ∈ {NT(norte), GEN(w), REDUCE}, where n ∈
N and w ∈ Σ, is parameterized by a stack LSTM
(Dyer et al., 2015) that encodes partial constit-
uents. The choice of at yields these transitions:
• at ∈ {NT(norte), GEN(w)} would push the cor-
responding non-terminal or word embeddings
—en or ew—onto the stack;
• at = REDUCE would pop the top k
incomplete non-
elements up to the last
terminal, compose these elements with a
separate bidirectional LSTM, and lastly push
the composite phrase embedding ephrase back
onto the stack. The hierarchical inductive
bias of RNNGs can be attributed to this
composition function,1 which recursively
combines smaller units into larger ones.
RNNGs attempt to maximize the probability of
correct action sequences relative to each gold
tree.2
1Not all syntactic LMs have hierarchical biases; Choe
and Charniak (2016) modeled strings and phrase structures
sequentially with LSTMs. This model can be understood as
a special case of RNNGs without the composition function.
2Unsupervised RNNGs (Kim y cols., 2019) existir, a pesar de
they perform worse on measures of syntactic competence.
777
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Extension to Subwords. Here we extend the
RNNG to operate over subword units (Sennrich
et al., 2016) to enable compatibility with the
BERT student. As each word can be split into
an arbitrary-length sequence of subwords, nosotros
preprocess the phrase-structure trees to include
an additional nonterminal symbol that represents
a word sequence, as illustrated by the example ‘‘(S
(notario público (WORD the) (WORD d ##og)) (vicepresidente (WORD
ba ##rk ##s)))'', where tokens prefixed by ‘‘##’’
are subword units.3
3 Acercarse
We begin with a brief review of the BERT
objetivo, before outlining our structure distil-
lation approach.
3.1 BERT Pretraining Objective
The aim of BERT pretraining is to find model
parameters ˆθB that would maximize the prob-
ability of reconstructing parts of x = x1, · · · ,
xk conditional on a corrupted version c(X) =
C(x1), · · · , C(xk), dónde
el
stochastic corruption protocol of Devlin et al.
is applied to each word xi ∈ x.
(2019) eso
Formalmente:
denotes
C(·)
ˆθB = arg min
θ X
i∈M (X)
− log pθ(xi|C(x1), · · · , C(xk)),
(1)
where M (X) ⊆ {1, · · · , k} denotes the indices
of masked tokens that serve as reconstruction
targets.4 This masked LM objective is then
combined with a next-sentence prediction loss
that predicts whether the two segments in x are
contiguous sequences.
3.2 Motivation
Because the RNNG teacher is an expert on
syntactic generalizations (Kuncoro et al., 2018;
Futrell et al., 2019; Wilcox et al., 2019), we adopt
a structure distillation procedure (Kuncoro et al.,
2019) that enables the BERT student to learn from
the RNNG’s syntactically informative predictions.
Our setup nevertheless means that the two models
here crucially differ in nature: The BERT student
3An alternative here is to represent each phrase as a flat
sequence of subwords, although our preliminary experiments
indicate that this approach yields worse perplexity.
4En la práctica, the corruption protocol c(·) and the recon-
struction targets M (X) are intertwined; METRO (X) denotes the
indices of tokens in x (∼ 15%) that were altered by c(X).
778
Cifra 1: An example of the masked LM task, dónde
[MASK] = chase, and window is an attractor (rojo). Nosotros
suppress phrase-structure annotations and corruptions
on the context tokens for clarity.
is not a left-to-right LM like the RNNG, but rather
a denoising autoencoder that models a collection
of conditionals for words in bidirectional context
(ecuación. 1).
We now present two strategies for dealing
with this challenge. The first, na¨ıve approach
is to ignore this difference, and let the BERT
student distill the RNNG’s marginal next-word
distribution for each w ∈ Σ based on the left
context alone, that is tφ(w|Xi
that is accessible to the BERT student, and runs
the risk of encouraging the student to assign high
probabilities for words that fit poorly with the
bidirectional context.
Por eso, our second approach is to learn from
teacher distributions that not only: (i) reflect the
strong syntactic biases of the RNNG teacher, pero
también (ii) consider both the left and right context
when predicting w ∈ Σ. Formalmente, we propose
to distill the RNNG’s marginal distribution over
words in bidirectional context, tφ(w|Xi),
henceforth referred to as the posterior probability
for generating w under all available information.
We now demonstrate that this quantity can, En realidad,
be computed from left-to-right LMs like RNNGs.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
3.3 Posterior Inference
Given a pretrained autoregressive, left-to-right
k
LM that factorizes tφ(X) =
i=1 tφ(xi|Xi). By definition of conditional
probabilities:5
tφ(xi|Xi) =
tφ(Xi)
Pw∈Σ tφ(Xi)
,
=
tφ(Xi|xi, Xi| ˜xi = w, Xi|xi, Xi|xi) in Eq. 2
yields:7
,
=
tφ(xi|Xi) ≈
,
tφ(xi|Xi|xi)
Pw∈Σ tφ(w|Xi|w)
(3)
.
where ˜x
the window bark) to be low because it
es
the posterior
syntactically illicit). A diferencia de,
would assign high probabilities to plural verbs
like fight and chase that are consistent with the
bidirectional context, because we expect both
tφ(fight | The dogs by the window) and tφ(the cat
| The dogs by the window fight) to be probable.
Computational Cost. Let k denote the max-
imum length of x. Our KD approach requires
computing the posterior distribution (ecuación. 2)
for every masked token xi
in the dataset D,
cual (excluding marginalization cost over y)
necessitates O(|S| ∗ k ∗ |D|) operaciones, dónde
5In this setup, we assume that x is a fixed-length sequence.
We aim to infer the LM’s estimate for generating a single
token xi, relative to all potential single tokens w ∈ Σ
(denominator in Eq. 2), conditional on the bidirectional
contexto.
Although Eq. 3 is still expensive to compute, él
enables us to apply the Bayes rule to compute
tφ(x>i|xi):
tφ(x>i|xi) =
tφ(xi|x>i) tφ(x>i)
q(xi)
,
(4)
where q(·) denotes the unigram distribution. Para
eficiencia, we replace tφ(xi|x>i) through a sep-
arately trained ‘‘reverse’’, right-to-left RNNG,
denoted as rω(xi|x>i). We now apply Eq. 4 y
the right-to-left parameterization rω(xi|x>i) en
ecuación. 3, and cancel common factors tφ(x>i):
tφ(xi|Xi) ≈
tφ(xi|Xi)
q(xi)
tφ(w|Xi)
q(w)
Pw∈Σ
.
(5)
Our approximation in Eq. 5 crucially reduces
the required number of operations from O(|S| ∗
k ∗ |D|) to O(|S| ∗ |D|), although the actual
speedup is much more substantial in practice,
since Eq. 5 involves easily batched operations that
considerably benefit from specialized hardwares
like GPUs.
Notablemente, our proposed approach here is a
it can approximate the posterior
general one;
6In our BERT pretraining setup, |S| ≈ 29, 000 (vocab-
ulary size of BERT-cased), |D| ≈ 3 ∗ 109, y k = 512.
7This approximation preserves the intuition explained in
§3.3. Concretely, verbs like bark would also be assigned low
probabilities under this posterior approximation, since tφ(el
cat | bark) would be low since it is syntactically illicit—the
alternative ‘‘bark at the cat’’ would be syntactically licit.
779
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
over xi from any left-to-right LM, which can
be used as a learning signal for BERT through
KD, irrespective of the LM’s parameterization.
Lo hace, sin embargo, necessitate a separately trained
right-to-left LM.
Connection to a Product of Experts. ecuación. 5
has a similar form to a product of experts (PoE;
Hinton, 2002) between the left-to-right and right-
to-left RNNGs’ next-word distributions, albeit
with extra unigram terms q(w). If we replace the
unigram distribution with a uniform one, a saber,
q(w) = 1/|S| ∀w ∈ Σ, ecuación. 5 reduces to a standard
PoE.
Approximating the Marginal. The approxi-
mation in Eq. 5 requires estimates of tφ(xi|Xi) from the left-to-right and right-
to-left RNNGs, respectivamente, which necessitate
expensive marginalizations over all possible tree
prefixes yi. Following Kuncoro et al.
(2019), we approximate this marginalization using
a one-best predicted tree ˆy(X) = argmaxy∈Y (X)
sψ(y|X), where sψ(y|X) is parameterized by the
transition-based parser of Fried et al. (2019), y
Y (X) denotes the set of all possible trees for x.
Formalmente:
tφ(xi|Xi)
desde el
right-to-left RNNG is approximated similarly.
Preliminary Experiments. Before proceeding
with the KD experiments, we assess the quality
and feasibility of our approximation through
preliminary LM experiments on the Penn
Treebank (PTB; Marcus et al., 1993). We find
that our approximation is much faster than exact
inference by a factor of more than 50,000, en
the expense of a slightly worse average posterior
negative log-likelihood (2.68 en vez de 2.5 para
exact inference). More details are provided in
Apéndice A.
8Our approximation of tφ(xi|Xi. This non-incremental
procedure is justified, sin embargo, because we aim to design
informative teacher distributions for the non-
the most
incremental BERT student, which also has access to
bidirectional context.
Modelo
Left-to-right LM
Right-to-left LM
Product of Experts
KL Div. with Posterior Approx.
2.27±1.84
2.04±1.87
1.12±1.08
Mesa 1: Preliminary experiments reporting the
mean±stdev. of the KL divergence (in nats)
between the proposed posterior approximation
(ecuación. 5) y: (i) the left-to-right LM, (ii) the right-
to-left LM, y (iii) a simple product of experts
base (ecuación. 5, but with the uniform distribution
for q(w)).
Differences Between the Models. We now
empirically validate our motivating intuition in
Cifra 1: A model that takes into account the
bidirectional context (as is the case for our pro-
posed posterior approximation in Eq. 5) debería
make different predictions compared with the
unidirectional left-to-right and right-to-left mod-
els.9 To ascertain whether this is truly the case,
we compute the mean Kullback-Leibler (KL)
divergence between the distributions from the
proposed posterior approximation (ecuación. 5) y el
distributions from: (i) the left-to-right model, (ii)
the right-to-left model, y (iii) a simple product
of experts baseline (es decir., ecuación. 5, but where q(w) es
the uniform distribution). The findings in Table 1
suggest that our proposed posterior approximation
approach indeed yields quantifiably different
distributions from the left-to-right and right-to-
left baselines. To a lesser extent, it also differs
from a simple product of experts baseline that
similarly incorporates both the left-to-right and
right-to-left models’ predictions, albeit with the
uniform distribution for q(w).
3.5 Objective Function
In our structure distillation pretraining, we aim
to find BERT parameters ˆθKD that emulate our
approximation of tφ(w|Xi) through a word-
level cross-entropy loss (Hinton et al., 2015; kim
and Rush, 2016; Furlanello et al., 2018, inter alia):
ˆθKD = arg min
i
1
|D| X
x∈D
ℓKD(X; i), dónde
ℓKD(X; i) = -
X
i∈M (X)
X
w∈Σ
h
˜tφ,Vaya(w|Xi)
log pθ (˜xi = w|C(x1), · · · , C(xk)) i,
9We use the same setup as Preliminary Experiments.
780
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where ˜tφ,Vaya(w|Xi) is our approximation of
tφ(w|Xi), as defined in Eqs. 5 y 6.
Interpolation. The RNNG teacher is an expert
on syntax, although in practice it is only feasible
to train it on a much smaller dataset. Por eso, nosotros
not only want the BERT student to learn from
the RNNG’s syntactic expertise, but also from
the rich common-sense and semantics knowledge
contained in large text corpora by virtue of
predicting the true identity of the masked token
xi,10 as done in the standard BERT setup. We thus
interpolate the KD loss and the original BERT
masked LM objective:
ˆθB-KD = arg min
i
1
|D| X
x∈D
hαℓKD(X; i) + (1 − α)
− log pθ(xi|C(x1), · · · , C(xk))i,
X
i∈M (X)
(7)
omitting the next-sentence prediction for brevity.
We henceforth set α = 0.5 unless stated otherwise.
4 experimentos
Here we outline the evaluation setup, present
nuestros resultados, and discuss the implications of our
findings.
4.1 Evaluation Tasks and Setup
the improved syntactic
We conjecture that
competence from our approach would benefit a
broad range of tasks that involve structured output
including tasks that are not explicitly
spaces,
syntactic. We thus evaluate our structure-distilled
BERTs on six diverse structured prediction
tasks that encompass syntactic, semantic, y
coreference resolution tasks, in addition to the
GLUE benchmark that is largely composed of
classification tasks.
Phrase-structure Parsing – PTB. We first
evaluate our model on phrase-structure parsing on
the WSJ section of the PTB. Following prior work,
we use sections 02–21 for training, sección 22 para
validation, y sección 23 para las pruebas. We apply
our approach on top of the BERT-augmented
in-order (Liu and Zhang, 2017) transition-based
parser of Fried et al. (2019), which approaches
the current state of the art. Because the RNNG
10The KD loss ℓKD(X; i) is defined independently of xi.
teacher that we distill into BERT also uses phrase-
structure trees, this setup is related to self-training
(Yarowsky, 1995; Charniak, 1997; Zhou and Li,
2005; McClosky et al., 2006; Andor et al., 2016,
inter alia).
Phrase-structure Parsing – OOD. Still in the
context of phrase-structure parsing, we evaluate
how well our approach generalizes to three out-
of-domain (OOD) treebanks: Marrón (Francis and
Kuˇcera, 1979), Genia (Tateisi et al., 2005), y
the English Web Treebank (Petrov and McDonald,
2012). Following Fried et al. (2019), we test the
PTB-trained parser on the test splits11 of these
OOD treebanks without any retraining, to simulate
the case where no in-domain labeled data are
disponible. We use the same codebase as above.
Dependency Parsing – PTB. Our third task
is PTB dependency parsing with Stanford
Dependencies (De Marneffe and Manning, 2008)
v3.3.0. We use the BERT-augmented joint phrase-
structure and dependency parser of Zhou and
zhao (2019), which is inspired by head-driven
phrase-structure grammar (HPSG; Pollard and
Sag, 1994).
Semantic Role Labeling. Our fourth evaluation
task is span-based (srl) on the English CoNLL
2012 (OntoNotes) conjunto de datos (Pradhan et al., 2013).
We apply our approach on top of the BERT-
augmented model of Shi and Lin (2019), como
implemented on AllenNLP (Gardner et al., 2018).
Coreference Resolution. Our fifth evaluation
task is coreference resolution, also on the English
OntoNotes dataset (Pradhan et al., 2012). Para
this task, we use the BERT-augmented model of
Joshi et al. (2019), which extends the higher-order
coarse-to-fine model of Lee et al. (2018).
CCG Supertagging Probe. All proposed tasks
thus far necessitate either fine-tuning the entire
BERT model, or training a task-specific model
él
on top of the BERT embeddings. Por eso,
remains unclear how much of the gains are due
to better structural representations from our new
pretraining strategy, rather than the available
supervision at the fine-tuning stage. To better
understand the gains from our approach, nosotros
evaluate on CCG (Steedman, 2000) supertagging
11We use the Brown test split of Gildea (2001), the Genia
test split of McClosky et al. (2008), and the EWT test split
from SANCL 2012 (Petrov and McDonald, 2012).
781
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(Bangalore and Joshi, 1999; Clark and Curran,
2007) through a classifier probe (Shi et al., 2016;
Adi et al., 2017; Belinkov et al., 2017, inter alia),
where no BERT fine-tuning takes place.12
información;
CCG supertagging is a compelling probing
task because it necessitates an understanding
of bidirectional context
the per-
word classification setup also lends itself well to
classifier probes. Sin embargo, sigue sin estar claro
how much of the accuracy can be attributed to
the information encoded in the representation,
as opposed to the classifier probe itself. Nosotros
thus adopt the control task protocol of Hewitt
and Liang (2019) that assigns each word type
to a random control category,13 which assesses
the memorisation capacity of the classifier. En
addition to the probing accuracy, nosotros reportamos
the probe selectivity,14 where higher selectivity
denotes probes that faithfully rely on the linguistic
knowledge encoded in the representation. Usamos
linear classifiers to maintain high selectivities.
Commonality. All our structured prediction ex-
periments are conducted on top of publicly avail-
able repositories of BERT-augmented models,
with the exception of the CCG supertagging task
that we evaluate as a probe. This setup means that
obtaining our results is as simple as changing the
pretrained BERT weights to our structure-distilled
BERT, and applying the exact same steps as for
fine-tuning the baseline model. The fine-tuning
hyperparameters are summarized in Appendix C.
GLUE. Beyond the six structured prediction
tasks above, we evaluate our approach on the
classification15 tasks of the GLUE benchmark
except the Winograd NLI (Levesque et al., 2012)
for consistency with the BERT paper (Devlin
et al., 2019). The BERT GLUE fine-tuning
hyperparameters are based on the fine-tuning
configurations of Joshi et al. (2020); we sum-
marize these in Appendix C.
12A similar CCG probe was explored by Liu et al. (2019a);
we obtain comparable results for the no distillation baseline.
13Following Hewitt and Liang (2019), the cardinality of
this control category is the same as the number of supertags.
14A probe’s selectivity is defined as the difference between
the probing task accuracy and the control task accurary.
15This setup excludes the semantic textual similarity
benchmark (STS-B), which is formulated as a regression
tarea.
4.2 Experimental Setup and Baselines
Here we describe the key aspects of our empirical
setup, and outline the baselines for assessing the
efficacy of our approach.
RNNG Teacher. We implement the subword-
augmented RNNG teachers
(§2) on DyNet
(Neubig et al., 2017a), and obtain ‘‘silver-grade’’
phrase-structure annotations for the entire BERT
training set using the transition-based parser of
Fried et al. (2019). These trees are used to train
the RNNG (§2), and to approximate its marginal
next-word distribution at inference (ecuación. 6). Usamos
the same WordPiece tokenization and vocabulary
as BERT-Cased; Appendix B summarizes the
complete list of RNNG hyperparameters. Porque
our approximation (ecuación. 5) makes use of a right-
to-left RNNG, we train this variant with the
same hyperparameters and data as the left-to-right
modelo. We train each directional RNNG teacher
on a shared subset of 3.6M sentences (∼3%) de
the BERT training set with automatic dynamic
batching (Neubig et al., 2017b), which takes three
weeks on a V100 GPU.
BERT Student. We first apply our structure dis-
tillation pretraining protocol to BERTBASE-Cased.
We use the exact same training dataset, model con-
figuration, WordPiece tokenization, vocabulary,
and hyperparameters (Apéndice C) as in the stan-
dard pretrained BERT model.16 The sole excep-
tion is that we use a larger initial learning rate
of 3e−4 based on preliminary experiments,17
which we apply to all models (including the
no distillation/standard BERT baseline) for fair
comparación.
Baselines and Comparisons. We compare the
following set of models in our experiments:
• A standard BERTBASE-Cased without any
structure distillation loss, which benefits
from scalability but lacks syntactic biases
(‘‘No-KD’’);
• Four variants of structure-distilled BERTs
eso: (i) only distill the left-to-right RNNG
(‘‘L2R-KD’’), (ii) only distill
the right-
to-left RNNG (‘‘R2L-KD’’),
(iii) distill
para
the RNNG’s approximated marginal
16https://github.com/google-research/bert.
17We find this larger learning to perform better on most of
our evaluation tasks. Liu et al. (2019b) have similarly found
that tuning BERT’s initial learning rate leads to better results.
782
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Validation Set
Test Set
Tarea
Líneas de base
Structure-distilled BERTs
gramo
norte
i
s
r
a
PAG
No-KD Seq-KD L2R-KD R2L-KD UF-KD UG-KD
Const. PTB – F1
95.38
Const. PTB – EM 55.33
Const. OOD – F1†
86.76
96.48
Dep. PTB – UAS
94.65
Dep. PTB – LAS
86.17
srl – OntoNotes
Coref. – OntoNotes
72.53
95.59
56.59
87.40
96.66
94.83
86.46
73.33
95.33
55.41
86.54
96.40
94.56
86.09
69.27
95.55
55.92
87.43
96.70
94.90
86.34
73.74
95.55
56.18
87.53
96.64
94.80
86.29
73.49
95.58
56.39
87.23
96.60
94.79
86.30
73.79
CCG supertag. probe
Probe selectivity
93.69
24.79
91.59
23.77
93.97
23.3
95.21
23.57
95.13
27.28
95.21
28.3
No-KD Best-KD Err. Red.
95.35
55.25
89.04
96.79
95.13
86.08
72.71
93.88
23.15
95.70
57.77
89.76
96.86
95.23
86.39
73.69
95.2
26.07
7.6%
5.63%
6.55%
2.18%
1.99%
2.23%
3.58%
21.57%
N/A
Mesa 2: Validation and test results for the structured prediction tasks; each entry reflects the mean of
three random seeds. To preserve test set integrity, we only obtain test set results for the no distillation
baseline and the best structure-distilled BERT on the validation set; ‘‘Err. Red.’’ reports the test error
reductions relative to the No-KD baseline. We report F1 and exact match (EM) for PTB phrase-structure
analizando; for dependency, we report unlabeled (UAS) and labeled (LAS) attachment scores. The ‘‘Const.
OOD’’ (†) row indicates the mean F1 from three out-of-domain corpora: Marrón, Genia, and the English
Web Treebank (EWT), although the validation results exclude the Brown Treebank that has no validation
colocar.
generating xi under the bidirectional context,
where q(w) (ecuación. 5) is the uniform distribution
(‘‘UF-KD’’), and lastly (iv) a similar variant
como (iii), but where q(w) is the unigram
distribución (‘‘UG-KD’’). All these BERT
models crucially benefit from the syntactic
biases of RNNGs, although only variants (iii)
y (iv) learn from teacher distributions that
consider bidirectional context for predicting
xi; y
• A BERTBASE model
that distills
el
approximate posterior for generating xi under
the bidirectional context, but from sequential
LSTM teachers (‘‘Seq-KD’’) in place of
RNNGs.18 This baseline crucially isolates
the importance of learning from hierarchical
profesores, because it utilizes the exact same
approximation technique and KD loss as the
structure-distilled BERTs.
Learning Curves. Given enough labeled data,
BERT can acquire the relevant structural infor-
mation from the fine-tuning (as opposed to pre-
training) procedimiento, although better pretrained
representations can nevertheless facilitate sample-
efficient generalizations (Yogatama et al., 2019).
18For fair comparison, we train the LSTM on the exact
same subset as the RNNG, with comparable number of
model parameters. An alternative here is to use Transformers,
although we elect to use LSTMs to facilitate fair comparison
with RNNGs, which are also based on LSTM architectures.
783
We thus additionally examine the models’ fine-
tuning learning curves, as a function of varying
amounts of training data, on phrase-structure
parsing and SRL.
Random Seeds. Because fine-tuning the same
pretrained BERT with different random seeds
can lead to varying results, we report the mean
performance from three random seeds on the
structured prediction tasks, and from five random
seeds on GLUE.
Test Results. To preserve the integrity of the
test sets, we first report all performance on the
validation sets, and only report the test set results
para: (i) the No-KD baseline, y (ii) the best
structure-distilled model on the validation set
(‘‘Best-KD’’).
4.3 Findings and Discussion
We report the validation and test set results for
the structured prediction tasks in Table 2. El
validation set learning curves for phrase-structure
parsing and SRL that compare the No-KD baseline
with the UG-KD variant are provided in Figure 2.
General Discussion. We summarize several
key observations from Table 2 y figura 2.
• All four structure-distilled BERT models
consistently outperform the No-KD baseline,
including the L2R-KD and R2L-KD variants
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 2: The fine-tuning learning curves that examine how the number of fine-tuning instances (de 5% a
100% of the full training sets) affect validation set F1 scores in the case of phrase-structure parsing and SRL. Nosotros
compare the No-KD/standard BERTBASE-Cased and the UG-KD structure-distilled BERT.
that only distill
the syntactic knowledge
of unidirectional RNNGs. Extraordinariamente, este
pattern holds true for all six structured pre-
diction tasks. A diferencia de, we observe no such
gains for the Seq-KD baseline, which largely
performs worse than the No-KD model. Nosotros
the gains afforded by our
conclude that
structure-distilled BERTs can be attributed
to the syntactic biases of the RNNG teacher.
• We conjecture that the surprisingly strong
the L2R-KD and R2L-
performance of
KD models, which distill the knowledge of
unidirectional RNNGs, can be attributed to
the interpolated objective in Eq. 7 (un = 0.5).
This interpolation means that
the target
distribution assigns a probability mass of
al menos 0.5 to the true masked word xi,
which is guaranteed to be consistent with the
bidirectional context. Sin embargo, the syntactic
knowledge contained in the unidirectional
RNNGs’ predictions can still provide a
structurally informative learning signal, a través de
the rest of the probability mass, for the BERT
alumno.
• Although all structure-distilled variants out-
perform the baseline, models that distill our
approximation of the RNNG’s distribution
for words in bidirectional context (UF-KD
and UG-KD) yield the best results on four out
of six tasks (PTB phrase-structure parsing,
srl, coreference resolution, and the CCG
supertagging probe). This finding confirms
the efficacy of our approach.
• We observe the largest gains for the syn-
tactic tasks, particularly for phrase-structure
parsing and CCG supertagging. Sin embargo,
the improvements are not at all confined to
purely syntactic tasks: we reduce relative
error from strong BERT baselines by 2.2%
y 3.6% on SRL and coreference resolution,
respectivamente. While the RNNG’s syntac-
tic biases are derived from phrase-structure
gramática, the strong improvement on CCG
supertagging, in addition to the smaller im-
provement on dependency parsing, suggests
that the RNNG’s syntactic biases generalize
well across different syntactic formalisms.
• We observe larger improvements in a low-
resource scenario, where the model is exposed
to fewer fine-tuning instances (Cifra 2), sug-
gesting that syntactic biases are helpful for
enabling more sample-efficient generaliza-
ciones. This pattern holds for both tasks that we
investigated: phrase-structure parsing (syn-
tactic in nature) and SRL (not explicitly
syntactic in nature). With only 5% of the fine-
tuning data, the UG-KD model improves F1
score from 79.9 a 80.6 for SRL (a 3.5%
error reduction relative to the No-KD base-
line, Opuesto a 2.2% on the full data). Para
phrase-structure parsing, the UG-KD model
achieves a remarkable 93.68 F1 (a 16% rel-
ative error reduction, Opuesto a 7.6% en
the full data) con solo 5% of the PTB—this
performance is notably better than past state
of the art phrase-structure parsers trained on
the full PTB c. 2017 (Kuncoro et al., 2017).
GLUE Results and Discussion. reportamos el
GLUE validation and test results for BERTBASE-
Cased in Table 3. Because we observe a different
pattern of results on the Corpus of Linguistic
Acceptability (CoLA; Warstadt et al., 2018) than
784
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
No-KD
UG-KD
CoLA
7-task avg. (excl. CoLA)
Overall 8-task avg.
Validation Set (Per-task average / 1-best random seed)
54.3 / 60.6
84.8 / 86.9
81.0 / 83.6
Test set (Per-task 1-best random seed on validation set)
50.7 / 60.2
85.4 / 87.8
81.1 / 84.4
CoLA
7-task avg. (excl. CoLA)
Overall 8-task avg.
53.1
84.2
80.3
55.3
83.5
80.0
Mesa 3: Summary of the validation and test
set results on GLUE. The validation results are
derived from the average of five random seeds
for each task, which accounts for variance, y
the 1-best random seed, which does not. The test
results are derived from the 1-best random seed
on the validation set.
on the rest of GLUE, we henceforth report: (i)
the CoLA results, (ii) the seven task average
that excludes CoLA, y (iii) the average across
all eight tasks. We select the UG-KD model
because it achieved the best validation set eight
task average among the structure-distilled BERTs;
the per-task GLUE breakdown is provided in
Apéndice D.
to the consistent
The results on GLUE provide an interesting
contrast
improvements we
observed on the structured prediction tasks. Más
concretely, our UG-KD model outperforms the
baseline on CoLA, but performs slightly worse on
the other GLUE tasks in aggregate, leading to a
slightly lower overall test set accuracy (80.0 para
the UG-KD as opposed to 80.3 for the No-KD
base).
The improvement on the syntax-sensitive
CoLA provides additional evidence—beyond the
improvement on the syntactic tasks (Mesa 2)—
that our approach indeed yields improved syntactic
competence. We conjecture that these improve-
ments do not transfer to the other GLUE tasks
because they rely more on lexical and semantic
propiedades, and less on syntactic competence
(McCoy et al., 2019).
We defer a more thorough investigation of
how much syntactic competence is necessary for
solving most of the GLUE tasks to future work,
but make two remarks. Primero, the findings on
GLUE are consistent with the hypothesis that our
approach yields improved structural competence,
albeit at the expense of a slightly less rich meaning
representación, which we attribute to the smaller
dataset used to train the RNNG teacher. Segundo,
785
human-level natural
language understanding
includes the ability to predict structured outputs,
Por ejemplo, to decipher ‘‘who did what to whom’’
(srl). Succeeding in these tasks necessitates
inference about structured output spaces, cual
(unlike most of GLUE) cannot be reduced to
a single classification decision. Our findings
indicate a partial dissociation between model
performance on these two types of tasks; hence,
supplementing GLUE evaluation with some of
these structured prediction tasks can offer a more
holistic assessment of progress in NLU.
CCG Probe Example. The CCG supertagging
probe is a particularly interesting test bed, porque
it clearly assesses the model’s ability to use con-
textual information in making its predictions—
without introducing additional confounds from
the BERT fine-tuning procedure. We thus provide
a representative example of four different BERT
variants’ predictions on the CCG supertagging
probe in Table 4, based on which we discuss
two observations. Primero, the different models make
different predictions, where the No-KD and L2R-
KD models produce (coincidentally the same)
incorrect predictions, while the R2L-KD and
UG-KD models are able to predict the correct
supertag. This finding suggests that different
teacher models are able to impose different biases
on the BERT students.19
Segundo, the mistakes of the No-KD and L2R-
KD BERTs belong to the broader category
de
challenging argument-adjunct distinctions
(Palmer et al., 2005; Fowlie, 2017). Here both
models fail to subcategorize for the prepositional
‘‘as screens’’, which serves as
phrase (PÁGINAS)
an argument of the verb ‘‘use’’, as opposed to the
noun phrase ‘‘TV sets’’. Distinguishing between
these two potential dependencies naturally re-
quires
syntactic information from the right
contexto; hence the R2L-KD BERT, cual es
trained to emulate the predictions of an RNNG
teacher that observes the right context, is able
to make the correct prediction. This advantage
is crucially retained by the UG-KD model that
distills the RNNG’s approximate distribution over
words in bidirectional context (ecuación. 5), and further
confirms the efficacy of our proposed approach.
19All four BERTs have access to the full bidirectional
context at test time, although some are trained to mimic
the predictions of unidirectional RNNGs (L2R-KD and
R2L-KD).
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
yo
a
C
_
a
_
0
0
3
4
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Sentence Input
‘‘Apple II owners , Por ejemplo , had to use their TV
sets as screens and stored data on audiocassettes’’
No-KD & L2R-KD Pred. R2L-KD & UG-KD Pred.
(S[b]\notario público)/notario público
((S[b]\notario público)/PÁGINAS)/notario público
Mesa 4: An example of the CCG supertag predictions for the verb ‘‘use’’ from four different BERT
variants. The correct answer is ‘‘((S[b]\notario público)/PÁGINAS)/NP’’, which both the R2L-KD and UG-KD predict
correctly (azul). Sin embargo, the No-KD baseline and the L2R-KD model produce (lo mismo) incorrect
predicciones (rojo); both models fail to subcategorize for the prepositional phrase ‘‘as screens’’ as a
dependent of the verb ‘‘use’’. Beyond this, all four models predict the correct supertags for all other
palabras (not shown).
Measuring the Models’ Differences. Beyond
the qualitative example in Table 4, we further
quantify the extent to which the different BERT
models produce different predictions. To this
end, we compute pairwise model agreement for
the phrase-structure parsing task, as measured
by exact match accuracy. We present the full
experimental setup and findings in Appendix E,
but summarize two key findings here.
Primero, the highest exact match agreement bet-
ween any pair of different models is fairly low
en 44.92%, further supporting our conjecture
that different
teacher models indeed impose
different biases on the BERT student, as evidenced
by the different model predictions. Segundo, todo
four structure-distilled BERT variants have the
lowest pairwise agreement score with the No-
KD baseline (< 39% pairwise model agreement),
suggesting that all variants of our structure
distillation objectives yield quantifiably different
outputs compared to the no distillation alternative,
which does not learn from the syntactic knowledge
of RNNGs.
BERTLARGE Results. Having evaluated our
structure-distilled BERTBASE-Cased, we now apply
our approach on top of BERTLARGE-Cased, and
present the results on the structured prediction
tasks in Table 5. Overall, we observe a similar
pattern of results with BERTLARGE as we do with
BERTBASE: On the structured prediction tasks, our
best structure distillation approach reduces error
by 1.5% to 5.5% relative to the No-KD baseline.
Furthermore, our structure-distilled BERTLARGE
models establish new state of the art single model
results—among models pretrained on the original
BERT training set20—on phrase-structure parsing
(PTB and OOD), PTB dependency parsing, and
SRL.
Test Set - BERTLARGE-Cased
g
n
i
s
r
a
P
Task
No-KD Best-KD Red.
95.95
95.80
Const. PTB − F1
57.74
Const. PTB − EM 56.87
90.20
89.63
Const. OOD − F1
97.03
96.91
Dep. PTB − UAS
95.49
95.33
Dep. PTB − LAS
87.77
87.59
SRL − OntoNotes
74.69
74.03
Coref. − OntoNotes
Error BERT
SoTA
3.73% 95.84†
2.02%
5.48% 89.91‡
3.78% 97.0†
3.43% 95.43†
1.45% 86.5♦
2.55% 79.6(cid:7)
−
results for
the structured
Table 5: Test set
prediction tasks with BERTLARGE-Cased; each
entry reflects the mean of three random seeds.
We compare the no distillation baseline (‘‘No-
KD’’) with the best structure-distilled model,
as selected on the validation set (‘‘Best-KD’’);
‘‘Error Red.’’ reports the test error reductions
relative to the No-KD baseline. We also report
the previous state of the art among non-ensemble
models pretrained on the original BERT training
set (‘‘BERT SoTA’’).21
4.4 Limitations
We outline two limitations to our approach. First,
we assume the existence of decent-quality ‘‘silver-
grade’’ phrase-structure trees to train the RNNG
teacher. Although this assumption holds true for
English because of the existence of accurate
phrase-structure parsers, this is not necessarily
the case for other languages. Second, pretraining
the BERT student in our na¨ıve implementation
is about half as fast on TPUs compared with
the baseline due to I/O bottleneck. This overhead
only applies at pretraining, and can be reduced
through parallelization.
5 Related Work
Earlier work has proposed a few ways for
introducing notions of hierarchical structures into
20This comparison excludes other models like XLNet and
21†Zhou and Zhao (2019), ‡Fried et al. (2019), ♦Shi and
RoBERTa, which are trained on more data.
Lin (2019), and (cid:7)Joshi et al. (2020).
786
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
l
a
c
_
a
_
0
0
3
4
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
BERT, for instance, through designing structurally
motivated auxiliary losses (Wang et al., 2020), or
including syntactic information in the embedding
layers that serve as inputs for the Transformer
(Sundararaman et al., 2019). In contrast, we
use a different technique for injecting syntactic
biases, which is based on the structure distillation
technique of Kuncoro et al. (2019), although
our work features two key differences. First,
Kuncoro et al. (2019) put a sole emphasis on
cases where both the teacher and student models
are autoregressive, left-to-right LMs; here we
extend this objective for when the student model
is a representation learner that has access to
bidirectional context. Second, Kuncoro et al.
(2019) only evaluated their structure-distilled
LMs in terms of perplexity and grammatical
judgment (Marvin and Linzen, 2018). In contrast,
we evaluate our structure-distilled BERT models
on six diverse structured prediction tasks and the
GLUE benchmark. It remains an open question
whether, and how much, syntactic biases are
helpful for a broader range of NLU tasks beyond
grammatical judgment; our work represents a step
towards answering this question.
Substantial progress has recently been made in
improving the performance of BERT and other
masked LMs (Lan et al., 2020; Liu et al., 2019b;
Raffel et al., 2019; Sun et al., 2020, inter alia). Our
structure distillation technique is orthogonal, and
can be applied for these approaches. Lastly, our
findings on the benefits of syntactic knowledge for
structured prediction tasks that are not explicitly
syntactic in nature, such as SRL and coreference
resolution, are consistent with those of prior work
(He et al., 2017; Swayamdipta et al., 2018; He
et al., 2018; Strubell et al., 2018, inter alia).
6 Conclusion
of
the
success
remarkable
Given
textual
representation learners trained on large amounts
of data, it remains an open question whether
syntactic biases are still relevant for these models
that work well at scale. Here we present evidence
to the affirmative: our structure-distilled BERT
models outperform the baseline on a diverse set
of six structured prediction tasks. We achieve
this through a new pretraining strategy that
to learn from the
enables the BERT student
predictions of an explicitly hierarchical, but much
less scalable, RNNG teacher model. Because
the BERT student is a bidirectional model that
estimates the conditional probabilities of masked
words in context, we propose to distill an efficient
yet surprisingly effective approximation of the
RNNG’s posterior estimate for generating each
word conditional on its bidirectional context.
Our findings suggest that syntactic inductive
biases are beneficial for a diverse range of
structured prediction tasks, including for tasks that
are not explicitly syntactic in nature. In addition,
these biases are particularly helpful for improving
fine-tuning sample efficiency on these tasks.
Lastly, our findings motivate the broader ques-
tion of how we can design models that integrate
stronger notions of structural biases—and yet can
be easily scalable at the same time—as a promising
(if relatively underexplored) direction of future
research.
Acknowledgments
We would like to thank Mandar Joshi, Zhaofeng
Wu, Rui Zhang, Timothy Dozat, and Kenton Lee
for answering questions regarding the evaluation
of the model. We also thank Sebastian Ruder,
John Hale, Kris Cao, Stephen Clark, and the
three anonymous reviewers for
their helpful
suggestions. A. K. is supported by an EPSRC
Doctoral Training Partnership studentship and a
Balliol Mark Sadler scholarship; D. F. is supported
by a Google PhD Fellowship.
References
Yossi Adi, Einat Kermany, Yonatan Belinkov,
Ofer Lavi, and Yoav Goldberg. 2017. Fine-
grained analysis of sentence embeddings using
auxiliary prediction tasks. In Proceedings of
ICLR.
Daniel Andor, Chris Alberti, David Weiss,
Aliaksei Severyn, Alessandro Presta, Kuzman
Ganchev, Slav Petrov, and Michael Collins.
2016. Globally normalized transition-based
neural networks. In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/P16-1231
Srinivas Bangalore and Aravind K. Joshi. 1999.
Supertagging: An approach to almost parsing.
Computational Linguistics.
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. What do
787
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
l
a
c
_
a
_
0
0
3
4
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
neural machine translation models learn about
morphology? In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/P17-1080
Cristian Bucilˇa, Rich Caruana, and Alexandru
Niculescu-Mizil. 2006. Model compression. In
Proceedings of KDD. DOI: https://doi
.org/10.1145/1150402.1150464
Eugene Charniak. 1997. Statistical parsing with
a context-free grammar and word statistics. In
Proceedings of AAAI.
Do Kook Choe and Eugene Charniak. 2016.
Parsing as language modeling. In Proceedings
of EMNLP. DOI: https://doi.org/10
.18653/v1/D16-1257
Stephen Clark and James R. Curran. 2007. Wide-
coverage efficient statistical parsing with CCG
and log-linear models. Computational Linguis-
tics. DOI: https://doi.org/10.1162
/coli.2007.33.4.493
Marie-Catherine De Marneffe and Christopher D.
Manning. 2008. Stanford typed dependencies
manual. DOI: https://doi.org/10.3115
/1608858.1608859
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of NAACL.
Chris Dyer, Miguel Ballesteros, Wang Ling,
Austin Matthews, and Noah A. Smith. 2015.
Transition-based dependency parsing with
stack long short-term memory. In Proceedings
of ACL. DOI: https://doi.org/10.3115
/v1/P15-1033
Chris Dyer, Adhiguna Kuncoro, Miguel
Ballesteros, and Noah A. Smith. 2016. Re-
current neural network grammars. In Proceed-
ings of NAACL. DOI: https://doi.org
/10.18653/v1/N16-1024, PMID: 26993434
Meaghan Fowlie. 2017. Slaying the Great Green
Dragon: Learning and Modelling Iterable Or-
dered Optional Adjuncts. Ph.D. thesis, UCLA.
Winthrop Nelson Francis and Henry Kuˇcera.
1979. Manual of information to accompany a
standard corpus of present-day edited American
English, for use with digital computers. Brown
University, Department of Linguistics.
Daniel Fried, Nikita Kitaev, and Dan Klein. 2019.
Cross-domain generalization of neural consti-
tuency parsers. In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/P19-1031
Tommaso Furlanello, Zachary Chase Lipton,
Michael Tschannen, Laurent Itti, and Anima
Anandkumar. 2018. Born-again neural net-
works. In Proceedings of ICML.
Richard Futrell, Ethan Wilcox, Takashi Morita,
Peng Qian, Miguel Ballesteros, and Roger Levy.
2019. Neural language models as psycholinguis-
tic subjects: Representations of syntactic state.
In Proceedings of NAACL. DOI: https://
doi.org/10.18653/v1/N19-1004
Matt Gardner, Joel Grus, Mark Neumann, Oyvind
Tafjord, Pradeep Dasigi, Nelson F. Liu,
Matthew E. Peters, Michael Schmitz, and Luke
Zettlemoyer. 2018. AllenNLP: A deep semantic
natural language processing platform. CoRR,
abs/1803.07640. DOI: https://doi.org
10.18653/v1/W18-2501
Daniel Gildea. 2001. Corpus variation and parser
performance. In Proceedings of EMNLP.
Yoav Goldberg. 2019. Assessing BERT’s syntac-
tic abilities. CoRR, abs/1901.05287.
Luheng He, Kenton Lee, Mike Lewis, and Luke
Zettlemoyer. 2017. Deep semantic role label-
ing: What works and what’s next. In Pro-
ceedings of ACL. DOI: https://doi.org
/ 1 0 . 1 8 6 5 3 / v 1 / P 1 7 - 1 0 44, PMCID:
PMC5961228
Shexia He, Zuchao Li, Hai Zhao, and Hongxiao
Bai. 2018. Syntax for semantic role labeling,
to be, or not to be. In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/P18-1192,
PMCID: PMC6010685
John Hewitt and Percy Liang. 2019. Designing
and interpreting probes with control tasks. In
Proceedings of EMNLP. DOI: https://
doi.org/10.18653/v1/D19-1275
John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representations. In Proceedings of NAACL.
Geoffrey E Hinton. 2002. Training products of
experts by minimizing contrastive divergence.
788
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
l
a
c
_
a
_
0
0
3
4
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Neural Computation. DOI: https://doi
.org/10.1162/089976602760128018,
PMID: 12180402
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey
Dean. 2015. Distilling the knowledge in a neural
network. CoRR, abs/1503.02531.
Jennifer Hu, Jon Gauthier, Peng Qian, Ethan
Wilcox, and Roger P. Levy. 2020. A systematic
assessment of syntactic generalization in neural
language models. In Proceedings of ACL.
Ganesh Jawahar, Benoˆıt Sagot, and Djam´e
Seddah. 2019. What does BERT learn about
the structure of language? In Proceedings of
ACL. DOI: https://doi.org/10.18653
/v1/P19-1356
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
Weld, Luke Zettlemoyer, and Omer Levy. 2020.
SpanBERT: Improving pre-training by rep-
resenting and predicting spans. TACL. DOI:
https://doi.org/10.1162/tacl a 00300
Mandar Joshi, Omer Levy, Luke Zettlemoyer,
and Daniel Weld. 2019. BERT for coreference
resolution: Baselines and analysis. In Proceed-
ings of EMNLP.
Yoon Kim and Alexander M. Rush. 2016.
Sequence-level knowledge distillation. In Pro-
ceedings of EMNLP. DOI: https://doi
.org/10.18653/v1/D16-1139
Yoon Kim, Alexander M. Rush, Lei Yu, Adhiguna
Kuncoro, Chris Dyer, and Gabor Melis. 2019.
Unsupervised recurrent neural network gram-
In Proceedings of NAACL. DOI:
mars.
https://doi.org/10.18653/v1/N19-1114
Taku Kudo and John Richardson. 2018. Sentence-
Piece: A simple and language independent sub-
word tokenizer and detokenizer for neural
text processing. In Proceedings of EMNLP
System Demonstrations. DOI: https://doi
.org/10.18653/v1/D18-2012, PMID:
29382465
Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng
Kong, Chris Dyer, Graham Neubig, and Noah
A. Smith. 2017. What do recurrent neural net-
work grammars learn about syntax? In Pro-
ceedings of EACL. DOI: https://doi
.org/10.18653/v1/E17-1117
Adhiguna Kuncoro, Chris Dyer, John Hale, Dani
Yogatama, Stephen Clark, and Phil Blunsom.
2018. LSTMs can learn syntax-sensitive depen-
dencies well, but modeling structure makes
them better. In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/P18-1132
Adhiguna Kuncoro, Chris Dyer, Laura Rimell,
Stephen Clark, and Phil Blunsom. 2019. Scal-
able syntax-aware language modelling with
knowledge distillation. In Proceedings of ACL.
DOI: https://doi.org/10.18653/v1
/P19-1337
Zhenzhong Lan, Mingda Chen, Sebastian
Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. 2020. ALBERT: A lite BERT
for self-supervised learning of language repre-
sentations. In Proceedings of ICLR.
Kenton Lee, Luheng He, and Luke Zettlemoyer.
2018. Higher-order coreference resolution with
coarse-to-fine inference. In Proceedings of
NAACL.
Hector J. Levesque, Ernest Davis, and Leora
Morgenstern. 2012. The winograd schema
challenge. In Proceedings of KR.
Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. TACL.
DOI: https://doi.org/10.1162/tacl
a 00115
Jiangming Liu and Yue Zhang. 2017. In-order
transition-based constituent parsing. TACL.
Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. Smith. 2019a.
Linguistic knowledge and transferability of
contextual representations. In Proceedings of
NAACL.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019b. RoBERTa: A robustly
optimized BERT pretraining approach. CoRR,
abs/1907.11692.
Mitchell P. Marcus, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. Building a large
annotated corpus of English: The Penn Tree-
bank. Computational Linguistics, 19:313–330.
789
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
l
a
c
_
a
_
0
0
3
4
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
DOI: h t t p s : / / d o i . o r g /10.21236
/ADA273556
Rebecca Marvin and Tal Linzen. 2018. Targeted
syntactic evaluation of language models. In
Proceedings of EMNLP. DOI: https://doi
.org/10.18653/v1/D18-1151
David McClosky, Eugene Charniak, and Mark
Johnson. 2006. Effective self-training for
In Proceedings of NAACL. DOI:
parsing.
https://doi.org/10.3115/1220835
.1220855
David McClosky, Eugene Charniak, and Mark
Johnson. 2008. When is self-training effective
for parsing? In Proceedings of COLING. DOI:
https://doi.org/10.3115/1599081
.1599152
Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019. Right for the wrong reasons: Diagnos-
ing syntactic heuristics in natural
language
In Proceedings of ACL. DOI:
inference.
https://doi.org/10.18653/v1/P19
-1334
Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget,
Jan ˇCernock´y, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language
model. In Proceedings of Interspeech. DOI:
h t t p s : / /doi.org/10.1109/ICASSP
.2011.5947611
Graham Neubig, Chris Dyer, Yoav Goldberg,
Austin Matthews, Waleed Ammar, Antonios
Anastasopoulos, Miguel Ballesteros, David
Chiang, Daniel Clothiaux, Trevor Cohn,
Kevin Duh, Manaal Faruqui, Cynthia Gan,
Dan Garrette, Yangfeng Ji, Lingpeng Kong,
Adhiguna Kuncoro, Gaurav Kumar, Chaitanya
Malaviya,
Paul Michel, Yusuke Oda,
Matthew Richardson, Naomi Saphra, Swabha
Swayamdipta, and Pengcheng Yin. 2017a.
DyNet: The Dynamic Neural Network Toolkit.
arXiv preprint arXiv:1701.03980.
Graham Neubig, Yoav Goldberg, and Chris Dyer.
2017b. On-the-fly operation batching in dy-
namic computation graphs. In Proceedings of
NeurIPS.
Martha Palmer, Daniel Gildea,
and Paul
Kingsbury. 2005. The proposition bank: An an-
notated corpus of semantic roles. Computatio-
nal Linguistics. DOI: https://doi.org
/10.1162/0891201053630264
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextua-
lized word representations. In Proceedings of
NAACL. DOI: https://doi.org/10.18653
/v1/N18-1202
Slav Petrov and Ryan McDonald. 2012. Overview
of the 2012 shared task on parsing the web.
In Notes of the First Workshop on Syntactic
Analysis of Non-Canonical Language (SANCL).
Carl Pollard and Ivan A. Sag. 1994. Head-Driven
Phrase Structure Grammar. University of
Chicago Press.
Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga
Uryupina, Yuchen Zhang, and Zhi Zhong.
2013. Towards robust linguistic analysis using
OntoNotes. In Proceedings of CoNLL.
Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Olga Uryupina, and Yuchen Zhang. 2012.
CoNLL-2012 shared task: Modeling multilin-
gual unrestricted coreference in OntoNotes. In
Proceedings of CoNLL.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. CoRR,
abs/1910.10683.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
ACL. DOI: https://doi.org/10.18653
.18653/v1/P16-1162
Peng Shi and Jimmy Lin. 2019. Simple BERT
models for relation extraction and semantic role
labeling. CoRR, abs/1904.05255.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016.
Does string-based neural MT learn source
syntax? In Proceedings of EMNLP. DOI:
https://doi.org/10.18653/v1/D16-1159
790
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
l
a
c
_
a
_
0
0
3
4
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Nitish
Srivastava, Geoffrey Hinton, Alex
and Ruslan
Ilya Sutskever,
Krizhevsky,
Salakhutdinov. 2014. Dropout: A simple way
to prevent neural networks from overfitting.
Journal of Machine Learning Research.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Proceedings of
NeurIPS.
Mark Steedman. 2000. The Syntactic Process,
MIT Press. DOI: https://doi.org/10
.7551/mitpress/6591.001.0001
Emma Strubell, Patrick Verga, Daniel Andor,
David Weiss, and Andrew McCallum. 2018.
Linguistically-informed self-attention for se-
mantic role labeling. In Proceedings of EMNLP.
DOI: https://doi.org/10.18653/v1
/D18-1548
Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun
Feng, Hao Tian, Hua Wu, and Haifeng
Wang. 2020. ERNIE 2.0: A continual
pre-training framework for language under-
In Proceedings of AAAI. DOI:
standing.
https://doi.org/10.1609/aaai.v34i05
.6428
Dhanasekar Sundararaman, Vivek Subramanian,
Guoyin Wang, Shijing Si, Dinghan Shen,
Dong Wang, and Lawrence Carin. 2019.
Syntax-infused transformer and bert models for
machine translation and natural language under-
standing. arXiv preprint arXiv:1911.06156.
Swabha Swayamdipta, Sam Thomson, Kenton
Lee, Luke Zettlemoyer, Chris Dyer, and Noah
A. Smith. 2018. Syntactic scaffolds for seman-
tic structures. In Proceedings of EMNLP. DOI:
https://doi.org/10.18653/v1/D18
-1412
Yuka Tateisi, Akane Yakushiji, Tomoko Ohta, and
Jun’ichi Tsujii. 2005. Syntax annotation for the
GENIA corpus. In Proceedings of IJCNLP.
Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019a. BERT rediscovers the classical NLP
pipeline. In Proceedings of ACL. DOI: https://
doi.org/10.18653/v1/v1/P19-1452
Ian Tenney, Patrick Xia, Berlin Chen, Alex
Wang, Adam Poliak, R Thomas McCoy,
Najoung Kim, Benjamin Van Durme, Sam
Bowman, Dipanjan Das, and Ellie Pavlick.
2019b. What do you learn from context?
Probing for sentence structure in contextualized
word representations. In Proceedings of ICLR.
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and
analysis platform for natural language under-
ICLR. DOI:
standing.
https://doi.org/10.18653/v1/W18
-5446
In Proceedings of
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi
Bao, Liwei Peng, and Luo Si. 2020. Struct-
BERT: Incorporating language structures into
pre-training for deep language understanding.
In Proceedings of ICLR.
Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2018. Neural network acceptability
judgments. arXiv preprint arXiv:1805.12471.
Ethan Wilcox, Peng Qian, Richard Futrell, Miguel
Ballesteros, and Roger Levy. 2019. Structural
supervision improves learning of non-local
grammatical dependencies. In Proceedings of
NAACL.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for
In
Proceedings of NeurIPS.
language understanding.
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods. In
Proceedings of ACL. DOI: https://doi
.org/10.3115/981658.981684
Dani Yogatama, Cyprien de Masson d’Autume,
Jerome Connor, Tom´as Kocisk´y, Mike
Chrzanowski, Lingpeng Kong, Angeliki
Lazaridou, Wang Ling, Lei Yu, Chris Dyer,
and Phil Blunsom. 2019. Learning and eval-
uating general linguistic intelligence. CoRR,
abs/1901.11373.
Junru Zhou and Hai Zhao. 2019. Head-driven
phrase structure grammar parsing on Penn
In Proceedings of ACL. DOI:
treebank.
https://doi.org/10.18653/v1/P19
-1230, PMCID: PMC6593428
791
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
5
1
9
2
3
8
8
8
/
/
t
l
a
c
_
a
_
0
0
3
4
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Zhi-Hua Zhou and Ming Li. 2005. Tri-training:
Exploiting unlabeled data using three class-
ifiers. IEEE Transactions on Knowledge and
Data Engineering. DOI: https://doi
.org/10.1109/TKDE.2005.186
A Preliminary Experiments
Here we discuss the preliminary experiments to
assess the quality and computational efficiency
of our posterior approximation procedure (§3.4).
Recall that this approximation procedure only
applies at inference; the LM is still trained in a
typical autoregressive, left-to-right fashion.
exactly
computing
Model. Because
the
tφ(xi|xDescargar PDF