Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Yanai Elazar1,2 Shauli Ravfogel1,2 Alon Jacovi1 Yoav Goldberg1,2
1Computer Science Department, Bar Ilan University
2Allen Institute for Artificial Intelligence
{yanaiela,shauli.ravfogel,alonjacovi,yoav.goldberg}@gmail.com

Abstracto

A growing body of work makes use of probing
in order to investigate the working of neural
modelos, often considered black boxes. Recientemente,
an ongoing debate emerged surrounding the
limitations of the probing paradigm. En esto
trabajar, we point out the inability to infer beha-
vioral conclusions from probing results, y
offer an alternative method that focuses on
how the information is being used, en vez de
on what information is encoded. Our method,
Amnesic Probing, follows the intuition that the
utility of a property for a given task can be as-
sessed by measuring the influence of a causal
intervention that removes it from the represen-
tation. Equipped with this new analysis tool,
we can ask questions that were not possible
antes, Por ejemplo, is part-of-speech informa-
tion important for word prediction? We per-
form a series of analyses on BERT to answer
these types of questions. Our findings demon-
strate that conventional probing performance
is not correlated to task importance, and we
call for increased scrutiny of claims that draw
behavioral or causal conclusions from probing
results.1

1

Introducción

What drives a model to perform a specific pre-
diction? Qué
information is being used for
predicción, and what would have happen if that
information went missing? Because neural repre-
sentation is opaque and hard to interpret, respuesta-
ing these questions is challenging.

The recent advancements in Language Models
(LMs) and their success in transfer learning of
many NLP tasks (p.ej., Peters et al., 2018; Devlin
et al., 2019; Liu et al., 2019b) spiked interest

1The code is available at: https://github.com

/yanaiela/amnesic probing.

160

in understanding how these models work and what
is being encoded in them. One prominent meth-
odology that attempts to shed light on those ques-
tions is probing (Conneau et al., 2018) (también
known as auxilliary prediction [Adi et al., 2016]
and diagnostic classification [Hupkes et al., 2018]).
Under this methodology, one trains a simple model
—a probe—to predict some desired information
from the latent representations of the pre-trained
modelo. High prediction performance is interpreted
as evidence for the information being encoded
in the representation. A key drawback of such an
approach is that while it may indicate that the
information can be extracted from the represen-
tation, it provides no evidence for or against the
actual use of this information by the model.
En efecto, Hewitt and Liang (2019) have shown that
under certain conditions, above-random probing
accuracy can be achieved even when the infor-
mation that one probes for is linguistically mean-
ingless noise, which is unlikely to have any use
by the actual model. More recently, Ravichander
et al. (2020) showed that models encode linguis-
tic properties, even when not required at all for
solving the task, questioning the usefulness and
common interpretation of probing. Estos resultados
call for higher scrutiny of causal claims based on
probing results.

en este documento, we propose a counterfactual ap-
proach that serves as a step towards causal at-
tribution: Amnesic Probing (ver figura 1 for a
schematic view). We build on the intuition that
if a property Z (p.ej., part-of-speech) is being
used for a task T (p.ej., language modeling), entonces
the removal of Z should negatively influence the
ability of the model to solve the task. En cambio,
when the removal of Z has little or no influence on
the ability to solve T , one can argue that knowing
Z is not a significant contributing factor in the
strategy the model employs in solving T .

As opposed to previous work that focused on
intervention in the input space (Goyal et al., 2019;

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 160–175, 2021. https://doi.org/10.1162/tacl a 00359
Editor de acciones: Radu Florian. Lote de envío: 7/2020; Lote de revisión: 9/2020; Publicado 3/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

where one removes some component and mea-
sures the influence of that intervention.

We study several

linguistic properties such
as part-of-speech (POS) and dependency labels.
En general, we find that as opposed to the common
belief, high probing performance does not mean
that the probed information is used for predicting
the main task (§4). This is consistent with the
recent findings of Ravichander et al. (2020). Nuestro
analysis also reveals that the properties we exam-
ine are often being used differently in the masked
configuración (which is mostly used in LM training) y
in the non-masked setting (which is commonly
used for probing or fine-tuning) (§5). Nosotros entonces
dive deeper into a more fine-grained analysis, y
show that not all of the linguistic property labels
equally influence prediction (§6). Finalmente, we re-
evaluate previous claims about the way that BERT
process the traditional NLP pipeline (Tenney et al.,
2019a) with amnesic probing and provide a novel
interpretation on the utility of different layers (§7).

2 Amnesic Probing

2.1 Setup and Formulation

Given a set of labeled data of data points X =
2 and task labels Y = y1, . . . , yn we
x1, . . . , xn
analyze a model f that predicts the labels Y
from X: ˆyi = f (xi). We assume that this model
is composed of two parts: an encoder h that
transforms input xi into a representation vector
hxi and a classifier c that is used for predicting ˆyi
based on hxi: ˆyi = c(h(xi)). We refer by model to
the component that follows the encoding function
h and is used for the classification of the task of
interest y. Each data point xi is also associated
with a property of interest zi which represents
additional information, which may or may not
affect the decision of the classifier c.

En este trabajo, we are interested in the change in
prediction of the classifier c on the prediction ˆyi
which is caused due to the removal of the property
Z from the representation h(xi), that is h(xi)¬Z.

2.2 Amnesic Probing with INLP

Under the counterfactual approach, we aim to
evaluate the behavioral influence of a specific
type of information Z (p.ej., POS) on some task

2The data points can be words, documentos, images, etc.,

based on the application.

Cifra 1: A schematic description of the proposed
amnesic intervention: We transform the contextualized
representation of the word ‘‘ran’’ so as to remove
información (aquí, POS), resulting in a ‘‘cleaned’’
version h¬P OS
ran . This representation is fed to the word-
prediction layer and the behavioral influence of POS
erasure is measured.

Kaushik et al., 2020; Vig et al., 2020) or in specific
neuronas (Vig et al., 2020), our intervention is done
on the representation layers. This makes it easier
than changing the input (which is non-trivial) y
more efficient than querying hundred of neurons
(which become combinatorial when considering
the effect of multiple neurons simultaneously).

We demonstrate that amnesic probing can func-
tion as a debugging and analysis tool for neural
modelos. Específicamente, by using amnesic probing
we show how to deduce whether a property is
used by a given model in prediction.

In order to build the counterfactual represen-
taciones, we need a function that operates on a pre-
trained representation and returns a counterfactual
version which no longer encodes the property we
focus on. We use the recently proposed algorithm
for neutralizing linear information: Iterative Null-
space Projection (INLP) (Ravfogel et al., 2020).
This approach allows us to ask the counterfactual
pregunta: ‘‘How will the prediction of a task differ
without access to some property?'' (Pearl and
Mackenzie, 2018). This approach relies on the as-
sumption that the usefulness of some information
can be measured by neutralizing it from the repre-
sentation, and witnessing the resulting behavioral
cambiar. It echoes the basic idea of ablation tests

161

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(p.ej., language modeling). para hacerlo, we selectively
remove this information from the representation
and observe the change in the behavior of the
model on the main task.

One commonly used method for information
removal relies on adversarial training through the
gradient reversal layer technique (Ganin et al.,
2016). Sin embargo, this techniques requires chang-
ing the original encoding by retraining the model,
which is not desired in our case as we wish to
study the original model’s behavior. Además,
Elazar and Goldberg (2018)
este
technique does not completely remove all the
information from the learned representation.

found that

En cambio, we make use of a recently proposed
algorithm called Iterative Nullspace Projection
(INLP) (Ravfogel et al., 2020). Given a labeled
dataset of representations H, and a property to
remove, z, INLP neutralizes the ability to linearly
predict Z from H. It does so by training a sequence
of linear classifiers (probes) c1, . . . , ck that predict
z, interpreting each one as conveying information
on a unique direction in the latent space that
corresponds to Z, and iteratively removing each
of these directions. Concretely, we assume that
the ith probe ci is parameterized by a matrix Wi.
In the ith iteration, ci is trained to predict Z from
h 3, and the data is projected onto its nullspace
using a projection matrix PN (Wi). This operation
guarantees WiPN (Wi)H = 0, es decir., it neutralizes
the features in the latent space which were found
by Wi to be indicative to Z. By repeating this
process until no classifier achieves above-majority
exactitud, INLP removes all such features.4

Amnesic Probing vs. Probing Note
eso
amnesic probing extends conventional probing,
as it is only relevant in cases where the property of
interest can be predicted from the representation.
If a probe gets random accuracy, the information
cannot be used by the model to begin with. Como
semejante, amnesic probing can be seen as a comple-
mentary method, which inspects probe accuracy as
a first step, but then proceeds to derive behavioral
outcomes from the directions associated with the
probe, with respect to a specific task.

2.3 Control S

The usage of INLP in this setup involves some
subtleties we aim to account for: (1) Any modifi-
cation to the representation, independientemente de si
it removes information necessary to the task, may
cause a decrease in performance. Can the drop
in performance be attributed solely to the modi-
fication of the representation? (2) The removal
of any property using INLP may also cause re-
moval of correlating properties. Does the re-
moved information only pertain to the property
in question?

Control Over Information In order to control
for the information loss of the representations, nosotros
make use of a baseline that removes the same
number of directions as INLP does, but randomly.
For every INLP iteration the data matrix’s rank
decreases by the number of labels of the inspected
propiedad. This operation removes information
from the representation which might be used for
predicción. Using this control, Rand, instead of
finding the directions using a classifier that learned
some task, we generate random vectors from a
uniform distribution, that accounts for random
directions. Entonces, we construct the projection ma-
trix as in INLP, by finding the intersection of
nullspaces.

If the Rand impact on performance is lower
than the impact of amnesic probing for some
propiedad, we conclude that we removed important
directions for the main task. De lo contrario, cuando el
Rand control has a similar or higher impact, nosotros
conclude that there is no evidence for property
usage for the main task.

Control over Selectivity5 The result of the am-
nesic probing is taken as an indication to whether
or not the model we query makes use of the
inspected property for prediction. Sin embargo, el
removed features might solely correlate with the
propiedad (p.ej., word position in the sentence has
a nonzero correlation to syntactic function). A
what extent is the information removal process we
employ selective to the property in focus?

We test that by explicitly providing the gold
information that has been removed from the

3Concretely, we use linear SVM (Pedregosa et al., 2011).
4All relevant directions are removed to the extent they
are identified by the classifiers we train. Por lo tanto, we run
INLP until the last linear classifier achieves a score within
one point above majority accuracy on the development set.

5No

to be confused with Hewitt and Liang (2019)
Selectivity. Although recommended to use when performing
standard probing, we argue it does not fit as a control
for amnesic probing and provide a detailed explanation in
apéndice B.

162

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

representación, and finetuning the subsequent
capas (while the rest of the network is frozen).
Restoring the original performance is taken as
evidence that the property we aimed to remove
is enough to account for the damage sustained
by the amnesic intervention (it may still be the
case that
the intervention removes unrelated
propiedades; but given the explicitly-provided pro-
perty information, the model can make up for the
damage). Sin embargo, if the original performance is
not restored, this indicates that the intervention
removed more information than intended, y
this cannot be accounted for by merely explicitly
providing the value of the single property we
focused on.

Concretely, we concatenate feature vectors of
the studied property to the amnesic representa-
ciones. Those vectors are 32-dimensional, y son
initialized randomly, with a unique vector for each
value of the property of interest. Those are fine-
tuned until convergence. We note that as the new
representation vectors are of a higher dimension
than the original ones, we cannot use the original
matrix. For an easier learning process, we use the
original embedding matrix and concatenate it with
a new embedding matrix, randomly initialized, y
treat it as the new decision function.

3.2 Studied Properties

We focus on six tasks of sequence tagging: coarse
and fine-grained part-of-speech tags (c-pos and
f-pos, respectivamente); syntactic dependency labels
(dep); named-entity labels (ner); and syntactic
constituency boundaries7 that mark the beginning
and the end of a phrase (phrase start, and phrase
end, respectivamente).

We use the training and dev data of the follow-
ing datasets for each task: English UD Treebank
(McDonald et al., 2013) for c-pos, f-pos, and dep;
and English OntoNotes (Weischedel et al., 2013)
for ner, phrase start, and phrase end. For train-
En g, we use 100,000 random tokens from those
conjuntos de datos.

3.3 Métrica

We report the following metrics:
LM accuracy: Word prediction accuracy.
Kullback-Leibler Divergence (DKL): We calcu-
late the DKL between the distribution of the
model over tokens, before and after the amnesic
intervención. This measure focuses on the entire
distribución, rather than the correct token only.
Larger values implies a more significant change.

3 Studying BERT: Experimental Setup

4 To Probe or Not to Probe?

3.1 Modelo

We use our proposed method to investigate BERT
(Devlin et al., 2019),6 a popular and competitive
masked language model (MLM) that has recently
been the subject of many analysis works (p.ej.,
Hewitt and Manning, 2019; Liu et al., 2019a;
Tenney et al., 2019a). While most probing works
focus on the ability to decode a certain linguistic
property of the input text from the representation,
we aim to understand which information is being
used by it when predicting words from context.
Por ejemplo, we seek to answer questions such
as the following: ‘‘Is POS information used by
the model in word prediction?’’ The following
experiments focus on language modeling, as a
basic and popular task, but our method is more
widely applicable.

By using the probing technique, different linguis-
tic phenomenon such as POS, dependency infor-
formación, and NER (Tenney et al., 2019a; Liu et al.,
2019a; Alt et al., 2020) have been found to
be ‘‘easily extractable’’ (typically using linear
probes). A naive interpretation of these results may
conclude that because information can be easily
extracted by the probing model, this information
is being used for the predictions. Nosotros mostramos que
this is not the case. Some properties such as
syntactic structure and POS are very informative
and are being used in practice to predict words.
Sin embargo, we also find some properties, como
phrase markers, which the model does not make
use of when predicting tokens, in contrast to what
one can naively deduce from probing results. Este
finding is in line with a recent work that observed
the same behavior (Ravichander et al., 2020).

For each linguistic property, we report the prob-
ing accuracy using a linear model, así como el

6Específicamente, BERT-BASE-UNCASED (Wolf et al., 2019).

7Based on the Penn Treebank syntactic definitions.

163

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Properties

Probing

LM-Acc

LM-DKL

norte. dir
norte. classes
Majority
Vanilla
Vanilla
Rand
Selectivity
Amnesic
Rand
Amnesic

dep

738
41
11.44
76.00
94.12
12.31
73.78
7.05
8.11
8.53

f-pos

585
45
13.22
89.50
94.12
56.47
92.68
12.31
4.61
7.63

c-pos

264
12
31.76
92.34
94.12
89.65
97.26
61.92
0.36
3.21

ner

phrase start

phrase end

133
19
86.09
93.53
94.00
92.56
96.06
83.14
0.08
1.24

36
2
59.25
85.12
94.00
93.75
96.96
94.21
0.01
0.01

22
2
58.51
83.09
94.00
93.86
96.93
94.32
0.01
0.01

Mesa 1: Property statistics, probing accuracies, and the influence of the amnesic intervention
on the model’s distribution over words. dep: dependency edge identity; f-pos and c-pos:
fine-grained and coarse POS tags; phrase start and phrase end: beginning and end of
phrases. Rand refers to replacing our INLP-based projection with removal of an equal
number of random directions from the representation. The number of iterations per task can
be inferred from: N.dir/N.classes.

word prediction accuracy after removing informa-
tion about that property. The results are summa-
rized in Table 1.8 Probing achieves substantially
higher performance over majority across all tasks.
Además, after neutralizing the studied prop-
erty from the representation, the performance on
that task drops to majority (not presented in the
table for brevity). Próximo, we compare the LM
performance before and after the projection and
observe a major drop for dep and f-pos information
(decrease of 87.0 y 81.8 accuracy points,
respectivamente), and a moderate drop for c-pos
and ner information (decrease of 32.2 y 10.8
accuracy points, respectivamente). For these tasks,
Rand performance on LM-Acc is lower than the
original scores, but substantially higher than the
Amnesic scores. Recall that the Rand experiment
is done with respect to the amnesic probing, de este modo
the number of removed dimension is the same, pero
each task may differ in the amount of dimensions
removed. Además, the DKL metric shows
the same trend (but in reverse, as a lower value
indicates on a smaller change). We also report the
selectivity results, where in most experiments the
LM performance is restored, indicating amnesic
probing works as expected. Note that the dep
performance is not fully restored,
thus some
non-related features must have been coupled
and removed with the dependency features. Nosotros

8Note that because we use two different datasets, el
f-pos, c-pos, and OntoNotes for
the Vanilla LM-Acc

UD Treebank for dep,
ner, phrase-start, and phrase-end,
performance differ between these setups.

mejora

believe that this happens in part due to the large
number of removed directions.9 These results
suggests that to a large degree, the damage to
LM performance is to be attributed to the specific
information we remove, and not to rank-reduction
solo. We conclude that dependency information,
POS and NER are important for word prediction.
Curiosamente, for phrase start and phrase end
in accuracy
we observe a small
de 0.21 y 0.32 puntos, respectivamente. The per-
formance for the control on these properties is
más bajo, therefore not only are these properties not
important for the LM prediction at this part of the
modelo, they slightly harm it. The last observation
is rather surprising as phrase boundaries are cou-
pled to the structure of sentences, and the words
that form them. A potential explanation for this
phenomenon is that this information is simply
not being used at this part of the model, and is
rather being processed in an earlier stage. We fur-
ther inspect this hypothesis in Section 7. Finalmente,
the probe accuracy does not correlate with task
importance as measured by our method (Lancero
correlation of 8.5, with a p-value of 0.871).

These results strengthen recent works that
question the usefulness of probing as an analysis
tool (Hewitt and Liang, 2019; Ravichander et al.,

9Since this experiment involves additional fine-tuning
and is not entirely comparable to the vanilla setup (also due
to the additional explicit information), we also experiment
with concatenating the inspected features and finetuning. Este
results in an improvement of 3-4 puntos, above the vanilla
experimento.

164

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

dep

f-pos

c-pos

ner

phrase start

phrase end

Properties

Probing

LM-Acc

LM-DKL

norte. dir
norte. classes
Majority
Vanilla
Vanilla
Rand
Selectivity
Amnesic
Rand
Amnesic

820
41
11.44
71.19
56.98
4.67
20.46
4.67
7.77
7.77

675
45
13.22
78.32
56.98
24.69
59.51
6.01
6.10
7.26

240
12
31.76
84.40
56.98
54.55
66.49
33.28
0.45
3.36

95
19
86.09
90.68
57.71
56.88
60.35
48.39
0.10
1.39

35
2
59.25
85.53
57.71
57.46
60.97
56.89
0.02
0.06

52
2
58.51
83.21
57.71
57.27
60.80
56.19
0.04
0.13

Mesa 2: Amnesic probing results for the masked representations. Properties statistics,
word-prediction accuracy and DKL results for the different properties inspected in this
trabajar. We report the vanilla word prediction accuracy and the Amnesic scores, así como
the Rand and 1-Hot controls which shows minimal information loss and high selectivity
(except for the dep property which all information was removed). The DKL is also reported
for all properties in the last rows which show similar trends as the accuracy performance.

2020), but measure it from the usefulness of
properties on the main task. We conclude that
high probing performance does not entail this
information is being used at a later part of the
network.

5 What Properties are Important for the

Pre-Training Objective?

Probing studies tend to focus on representations
that are used for an end-task (usually the last
hidden layer before the classification layer). En el
case of MLM models, the words are not masked
when encoding them for downstream tasks.

Sin embargo, these representations are different
from those used during the pre-training LM phase
(of interest to us), where the input words are
enmascarado. It is therefore unclear if the conclusions
drawn from conventional probing also apply to
the way that the pre-trained model operates.

From this section on, unless mentioned other-
wise, we report our experiments on the masked
palabras. Eso
es, given a sequence of tokens
x1, . . . , xi, . . . , xn we encode the representa-
tion of each token xi using its context, como
x1, . . . , xi−1, [M ASK], xi+1, . . . , xn.
follows:
The rest of the tokens remain intact. We feed
these input tokens to BERT, and only use the
masked representation of each word in its context
h(x1, . . . , xi−1, [M ASK], xi+1, . . . , xn)i.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: LM accuracy over INLP predictions, para
the masked tokens version. We present both the vanilla
word-prediction score (straight, blue line), así como el
control (naranja, large circles) and INLP (verde, pequeño
circles). Note that the number of removed dimensions
per iteration differs, based on the number of classes of
that property.

We repeat the experiments from Section 4 y
report the results in Table 2. As expected, el
LM accuracy drops significantly, as the model
does not have access to the original word, y eso
has to infer it only based on context. En general, el

165

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

c-pos

Vanilla Rand Amnesic Δ

c-pos

Vanilla

Amnesic

verb
noun
adposition
determiner
numeral
punctuation
particle
conjunction
adverb
pronoun
adjective
otro

46.72
42.91
73.80
82.29
40.32
80.71
96.40
78.01
39.84
70.29
46.41
70.59

44.85
38.94
72.21
83.53
40.19
81.02
95.71
72.94
34.11
61.93
42.63
76.47

34.99
34.26
37.86
16.64
33.41
47.03
18.74
4.28
23.71
33.23
34.56
52.94

11.73
8.65
35.93
65.66
6.91
33.68
77.66
73.73
16.14
37.06
11.85
17.65

Mesa 3: Masked, c-pos removal, de grano fino
LM analysis. Removing c-pos information and
testing the accuracy performance of words,
accumulating by their label. Δ is the difference
in performance between the vanilla and Amnesic
puntuaciones.

trends in the masked setting are similar to the non-
masked setting. Sin embargo, this is not always the
caso, as we show in Section 7. We also report the
selectivity control. Notice that the performance
for this experiment was improved across all tasks.
In the case of dep and f-pos, where we had to
neutralize most of the dimensions the performance
does not fully recover. Note that the number of
classes in those experiments might be a factor
in the large performance gaps (expressed by the
number of removed dimensions, norte. dir, en el
mesa). While not part of this study, it would be
interesting to control for this factor in future work.
However for the rest of the properties (c-pos, ner,
and the phrase-markers) the performance is fully
recovered, showing our methods’ selectivity.

To further study the effect of INLP and inspect
how the different dimensions removal affect per-
rendimiento, we display in Figure 2 the LM perfor-
mance after each iteration, both with the amnesic
probing and the control, and observe a consistent
gap between them. Además, we highlight the
difference in the slope for our method and the
random direction removal. The amnesic probing
exemplifies a much steeper slope than the random
direction, indicating that the studied properties are
indeed correlated with words prediction. Nosotros también
provide the main task performance after each iter-
ation in Figure 5 in the Appendix, which steadily
decreases with each iteration.

verb
noun
adposition
determiner
numeral
punctuation
particle
conjunction
adverb
pronoun
adjective

56.98
56.98
56.98
56.98
56.98
56.98
56.98
56.98
56.98
56.98
56.98

55.60
55.79
53.40
51.04
55.88
53.12
55.26
54.29
55.64
54.97
55.95

Δ

1.38
1.19
3.58
5.94
1.10
3.86
1.72
2.69
1.34
2.02
1.03

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 4: Word prediction accuracy after fine-
grained tag distinction removal, masked version.
Rand control performance are all between 56.05
y 56.49 exactitud (with a maximum difference
from vanilla of 0.92 puntos).

6 Specific Labels and Word Prediction

In the previous sections we observed the impact
(or lack thereof) of different properties on word
predicción. But when a property affects words
predicción, are all words affected similarly? En
this section, we inspect a more fine-grained ver-
sion of the properties of interest, and study the
impact of those on word predictions.

Fine-Grained Analysis When we remove the
POS information from the representation, son
nouns affected to the same degree as conjunctions?
We repeat the masked experimental setting from
Sección 5, but this time we inspect the word
prediction performance for the different labels.
We report the results for the c-pos tagging in
Mesa 3. We observe large differences in the
word prediction performance before and after the
POS removal between the labels. Nouns, numbers,
and verbs show a relatively small impact in per-
rendimiento (8.64, 6.91, y 11.73 respectivamente),
while conjunctions, particles and determiners de-
monstrate large performance drops (73.73, 77.66,
y 65.65, respectivamente). We see that the infor-
mation about POS labels at the word-level pre-
diction is much more important
in closed-set
vocabularies (such as conjunctions and determin-
ers) than with open vocabularies (such as nouns
and verbs).

166

A manual inspection of predicted words after
removing the POS information reveals that many
of the changes are due to the transformation of
function words to content words. Por ejemplo,
the words ‘and’, ‘of’, and ‘a’ become ‘rotate’,
‘say’, and ‘final’, respectivamente, in the inspected
oraciones. For a more quantitative analysis, nosotros
use a POS tagger in order to measure the POS
label confusion before and after the intervention.
Out of the 12,700 determiners conjuncions and
punctuations, 200 of the predicted words by
BERT were tagged as nouns and verbs before
la intervención, compared to 3,982 después.

Removal of Specific Labels Following the
observation that classes are affected differently
when predicting words, we further investigate the
differences of specific label removal. Para tal fin,
we repeat the amnesic probing experiments, pero
instead of removing the fine-grained information
of a linguistic property, we make a cruder re-
moval: The distinction between a specific label
y el resto. Por ejemplo, with POS as the
general property, we now investigate whether the
information of noun vs. the rest is important for
predicting a word. We perform this experiment
for all of the pos-c labels, and report the results in
Mesa 4.10

We observe big performance gaps when re-
moving different labels. Por ejemplo, eliminando
the distinctions between nouns and the rest, o
verbs and the rest has minimal impact on per-
rendimiento. Por otro lado, determiners and
punctuations are highly affected. This is consis-
tent with the previous observation on removing
specific information. These results call for more
detailed observations and experiments when
studying a phenomenon as the fine-grained prop-
erty distinction does not behave the same across
labels.11

7 Behavior Across Layers

The results up to this section treat all of BERT’s
‘Transformer blocks’ (Vaswani et al., 2017) como
the encoding function and the embedding matrix

10In order to properly compare the different properties, nosotros
run INLP for solely 60 iterations, for each property. Desde el
‘other’ tag is not common, we omit it from this experiment.
11We repeat these experiments with the other studied

properties and observe similar trends.

as the model. But what happens when we remove
the information of some linguistic property from
earlier layers?

By using INLP to remove a property from
an intermediate layer, we prevent the subsequent
layer from using linearly present information ori-
ginally stored in that layer. Though this operation
does not erase all
the information correlative
with the studied property (as INLP only removes
linearly present information), it makes it harder
for the model to use this information. Concretely,
we begin by extracting the representation of some
text from the first k layers of BERT and then
run INLP on these representations to remove the
property of interest. Given that we wish to study
the effect of a property on layer i, we project the
representation using the corresponding projection
matrix Pi that was learned on those representation,
and then continue the encoding of the following
layers.12

7.1 Property Recovery After an Amnesic

Operation

Is the property we linearly remove from a given
layer recoverable by subsequent layers? We re-
move the information about some linguistic pro-
perty from layer i, and learn a probe classifier
on all subsequent layers i + 1, . . . , norte. This tests
how much information about this property the
following layers have recovered. We experiment
that could be removed
with the properties
without reducing too many dimensions: pos-c,
ner, phrase start, and phrase end. These results are
summarized in Figure 3, both for the non-masked
versión (upper row) and the masked version (más bajo
row).

Notablemente, for the pos-c, non-masked version, el
information is highly recoverable in subsequent
layers when applying the amnesic operation on
the first seven layers: the performance drops from
the regular probing of that layer between 5.72 y
12.69 accuracy points. Sin embargo, in the second
part of the network, the drop is substantially larger:
entre 16.57 y 46.39 accuracy points. Para el
masked version, we witness an opposite trend: El
pos-c information is much less recoverable in the
lower parts of the network than the upper parts. En
particular, the removal of pos-c from the second

12As the representations used to train INLP do not include
BERTs’ special tokens (p.ej., ‘CLS’, ‘SEP’), we also don’t
use the projection matrix on those tokens.

167

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: Layer-wise removal. Removing from layer i (the rows) and testing probing performance on layer j (el
columnas). Top row (3a) is non-masked version, fila inferior (3b) is masked.

layer appears to affect the rest of the layers, cual
do not manage to recover a high score on this task,
que van desde 32.7 a 42.1 exactitud.

For all of the non-masked experiments the upper
layers seem to make it harder for the subsequent
layers to extract
the property. In the masked
version however, there is no consistent trend.
It is harder to extract properties after the lower
parts for pos-c and ner. For phrase start the upper
part makes it harder for further extraction and
for phrase end both the lower and upper parts
make it harder, as opposed to the middle layers.
Further research is needed in order to understand
the significance of those findings, and whether or
not they are related to information usage across
capas.

This lead us to the final experiment where we
test for the main task performance after an amnesic
operation at the intermediate layers.

7.2 Re-rediscovering the NLP Pipeline

In the previous set of experiments, we measured
how much of the signal removed in layer i is
recovered in subsequent layers. We now study
how the removal of information in layer i affects
the word prediction accuracy at the final layer,
in order to get a complementary measure for
layer importance with respect to a property. El
results for the different properties are presented
En figura 4, where we plot the difference in word
prediction performance between the control and

the amnesic probing when removing a linguistic
property from a certain layer.

These results provide a clear interpretation
on the internal function of BERT’s layers. Para
the masked version (Cifra 4), we observe that
the pos-c properties are mostly important
en
capa 3 and its surrounding layers, así como
capa 12. Sin embargo, this information is accurately
extractable only towards the last layers. For ner,
we observe that the main performance loss occurs
at layer 4. For phrase-markers the middle layers
are important: capas 5 y 7 for phrase start
(although the absolute performance loss is not
big) and layer 6 for phrase end contributes the
most for the word prediction performance.

The story with the non-masked version is quite
diferente (Cifra 4). Primero, notice that the amnesic
operation improves the LM performance for all
propiedades, in some layers.13 Second, the drop in
performance peak across all properties is different
than the masked version experiments. Particularly,
it seems that for pos-c, when the words are non-
masked in the input, the most important layer
for pos-c is 11 (and not
capa 3, como en el
masked version), while this information is easily
extractable (by standard probing) across all layers
(arriba 80% exactitud).

Curiosamente, the conclusions we draw on layer-
importance from amnesic probing partly differ

13Giulianelli et al. (2018) observed a similar behavior by

performing an intervention on LSTM activations.

168

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4: The influence of the different properties, from each layer on LM predictions. Top figure (4a) shows the
results on the regular, non-masked version, bottom figure (4b) for the masked version. Colors allow ease of layer
comparison across graphs.

el ultimo

trends are similar:

from the ones in the ‘‘Pipeline processing’’ hy-
pothesis (Tenney et al., 2019a), which aims to
localize and attribute information processing of
linguistic properties to parts of BERT (para el
non-masked version).14 On one hand the ner
experimento
capas
are much more important than earlier ones (en
particular, capa 11 is the most affected in our
caso, with a decrease of 31.09 accuracy points.
Por otro lado, in contrast to their hypotheses,
we find that POS information, pos-c (which was
considered to be more important in the earlier
capas) affects the word prediction performance
much more in the upper layers (40.99 exactitud
loss in the 11th layer). Finalmente, we note that our
approach performs an ablation of these properties
in the representation space, which reveals which
layers are actually responsible for processing
propiedades, as opposed to Tenney et al. (2019a),
who focused on where this information is easily
extractable.

clearer distinctions between the two. Finalmente, nosotros
stress that the different experiments should not be
compared between one setting to the other, y
thus the different y-scales in the figures. Esto es
due to confounding variables (p.ej., the number
of removed dimensions from the representations),
which we do not control for in this work.

8 Trabajo relacionado

With the established impressive performance of
large pre-trained language models (Devlin et al.,
2019; Liu et al., 2019b), based on the Transformer
architecture (Vaswani et al., 2017), a large body of
work started studying and gaining insight into how
these models work and what do they encode.15 For
a thorough summary of these advancements we
refer the reader to a recent primer on the subject
(Rogers et al., 2020).

We note the big differences in behavior when
analyzing the masked vs. the non-masked version
of BERT, and call for future work to make a

14We note that this work analyzes BERT-base, in contrast

to Tenney et al. (2019a) who analyzed BERT-Large.

15These works cover a wide variety of topics, incluido
grammatical generalization (Goldberg, 2019; Warstadt et al.,
2019), syntax (Tenney et al., 2019b; Lin et al., 2019; Reif
et al., 2019; Hewitt and Manning, 2019; Liu et al., 2019a),
world knowledge (Petroni et al., 2019; Jiang et al., 2020),
reasoning (Talmor et al., 2019), and common sense (Forbes
et al., 2019; Zhou y cols., 2019; Weir et al., 2020).

169

A particularly popular and easy-to-use inter-
pretation method is probing (Conneau et al.,
2018). Despite its popularity, recent works have
questioned the use of probing as an interpretation
tool. Hewitt and Liang (2019) have emphasized
the need to distinguish between decoding and
learning the probing tasks. They introduced con-
trol tasks, a consistent but linguistically mea-
ningless attribution of labels to tokens, and have
shown that probes trained on the control tasks
often perform well, due to the strong lexical infor-
mation held in the representations and learned by
the probe. This leads them to propose a selectivity
measure that aims to choose probes which achieve
high accuracy only on linguistically-meaningful
tareas. Tamkin et al. (2020) claim that probing
cannot serve as an explanation of downstream
task success. They observe that the probing scores
do not correlate with the transfer scores achieved
by fine-tuning.

Finalmente, Ravichander et al. (2020) muestra esa
probing can achieve non-trivial results for linguis-
tic properties that were not needed for the task the
model was trained on. En este trabajo, we observe a
similar phenomenon, but from a different angle.
We actively remove some property of interest
from the queried representation, and measure
the impact of the amnesic representation of the
property on the main task.

Two recent works study the probing paradigm
from an information-theory perspective. Pimentel
et al. (2020) emphasize that under a mutual-
information maximization objective, ‘‘better’’ probes
are increasingly more accurate, regardless of
complejidad. They use the data-processing in-
equality to question the rationale behind methods
that focus on encoding, and propose ease of ex-
tractability as an alternative criterion. Voita and
Titov (2020) follow this direction, using the
concept of minimum description length (MDL,
Rissanen, 1978) to quantify the total information
needed to transmit both the probing model and
the labels it predicts. Our discussion here is
somewhat orthogonal to those on the meaning
of encoding and probe complexity, as we focus on
the information influence on the model’s behavior,
rather than on the ability to extract it from the
representación.

Finally and concurrent to this work, Feder et al.
(2020) have studied a similar question of a causal
attribution of concepts to representations, usando
adversarial training guided by causal graphs.

9 Discusión

(selectivity). Este

Intuitivamente, we would like to completely neutralize
the abstract property we are interested in—e.g.,
POS information (completeness), as represented
by the model—while keeping the rest of the
representation intact
es un
nontrivial goal, as it is not clear whether neural
models actually have abstract and disentangled
representations of properties such as POS, cual
are independent of other properties of the text.
It may be the case that the representation of
many properties is intertwined. En efecto, hay
an ongoing debate on the assertion that certain
information is ‘‘encoded’’ in the representation
(Voita and Titov, 2020; Pimentel et al., 2020).
Sin embargo, even if a disentangled representation of
the information we focus on exists, it is not clear
how to detect it.

We implement the information removal oper-
ation with INLP, which gives a first order approx-
imation using linear classifiers; we note, sin embargo,
that one can in principle use other approaches to
achieve the same goal. While we show that we do
remove the linear ability to predict the properties
and provide some evidence to the selectivity of
this method (§2), one has to bear in mind that we
remove only linearly-present information, y eso
the classifiers can rely on arbitrary features that
happen to correlate with the gold label, be it a result
of spurious correlations or inherent encoding of the
direct property. En efecto, we observe this behavior
en la sección 7.1 (Cifra 3), where we neutralize the
information from certain layers, but occasionally
observe higher probing accuracy in following
capas. We thus stress that the information we
remove in practice should be seen only as an
approximation for the abstract information we are
interested in, and that one has to be cautious of
causal interpretations of the results. Although in
this paper we use the INLP algorithm in order to
remove linear information, amnesic probing is not
restricted to removing linear information. Cuando
non-linear removal methods become available,
they can be swapped instead of INLP. Este
stresses the importance of creating algorithms for
non-linear information removal.

Another unanswered question is how to quan-
tify the relative importance of different prop-
erties encoded in the representation for the word
prediction task. The different erasure portion
for different properties makes it hard to draw

170

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

conclusions on which property is more important
for the task of interest. Although we do not
make claims such as ‘‘dependency information is
more important than POS’’, these are interesting
questions that should be further discussed and
researched.

10 Conclusions

En este trabajo, we propose a new method, Amnesic
Probing, which aims to quantify the influence of
specific properties on a model that is trained on a
task of interest. We demonstrate that conventional
probing falls short in answering such behavioral
preguntas, and perform a series of experiments
on different linguistic phenomenon, quantifying
their influence on the masked language modeling
tarea. Además, we inspect both unmasked
and masked BERT’s representation and detail
the differences between them, which we find to
be substantial. We also highlight the different
influence of specific fine-grained properties (p.ej.,
nouns and determiners) on the final task. Finalmente,
we use our proposed method on the different
layers of BERT, and study which parts of the
model make use of
the different properties.
Tomados juntos, we argue that compared with
probing, counterfactual intervention—such as the
one we present here—can provide a richer and
more refined view of the way symbolic linguistic
information is encoded and used by neural models
with distributed representations.16

Expresiones de gratitud

a

como

thank Hila Gonen,
We would
Amit Moryossef, Divyansh Kaushik, Abhilasha
Ravichander, Uri Shalit, Felix Kreuk, Jurica ˇSeva,
and Yonatan Belinkov for their helpful comments
y discusiones. We also thank the anonymous
reviewers and the action editor, Radu Florian, para
their valuable suggestions.

This project has received funding from the
Europoean Research Council (ERC) under the
Europoean Union’s Horizon 2020 research and
innovation programme, grant agreement no.
802774 (iEXTRACT). Yanai Elazar is grateful
to be partially supported by the PBC fellowship
for outstanding PhD candidates in Data Science.

16All of the experiments were logged and tracked using

Weights and Biases (Biewald, 2020).

171

Referencias

Yossi Adi, Einat Kermany, Yonatan Belinkov,
Ofer Lavi, and Yoav Goldberg. 2016. Fine-
grained analysis of sentence embeddings using
auxiliary prediction tasks. CORR, abs/1608
.04207.

Christoph Alt, Aleksandra Gabryszak,

y
Leonhard Hennig. 2020. Probing linguistic fea-
tures of sentence-level representations in rela-
tion extraction. En procedimientos de
the 58th
Annual Meeting of the Association for Compu-
lingüística nacional, pages 1534–1545, En línea.
Asociación de Lingüística Computacional.

Lukas Biewald. 2020. Experiment tracking with
weights and biases. Software available from
wandb.com.

Alexis Conneau, Germ´an Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single $&!#*
vector: Probing sentence embeddings for lin-
guistic properties. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 2126–2136. DOI: https://doi.org
/10.18653/v1/P18-1198

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of deep bidirectional transformers for
En procedimientos de
language understanding.
el 2019 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
NAACL-HLT 2019, Mineápolis, Minnesota, EE.UU,
June 2–7, 2019, Volumen 1 (Long and Short
Documentos), páginas 4171–4186.

Yanai Elazar and Yoav Goldberg. 2018. Adversar-
ial removal of demographic attributes from text
datos. En Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, pages 11–21. Asociación para Com-
Lingüística putacional. DOI: https://doi
.org/10.18653/v1/D18-1002

Amir Feder, Nadav Oved, Uri Shalit, y
Roi Reichart. 2020. Causalm: Causal model
explanation through counterfactual
idioma
modelos.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Maxwell Forbes, Ari Holtzman, and Yejin Choi.
2019. Do neural language representations learn
physical commonsense? Actas de
el
41st Annual Conference of
the Cognitive
Science Society.

Yaroslav Ganin, Evgeniya Ustinova, Hana
Ajakan, Pascal Germain, Hugo Larochelle,
Franc¸ois Laviolette, Mario Marchand, y
Victor Lempitsky. 2016. Domain-adversarial
training of neural networks. Journal of Machine
Investigación del aprendizaje, 17(1):2096–2030.

Mario Giulianelli, Jack Harding, Florian Mohnert,
Dieuwke Hupkes, and Willem Zuidema. 2018.
Under the hood: Using diagnostic classifiers
to investigate and improve how language mo-
dels track agreement information. En curso-
ings of the 2018 EMNLP Workshop Blackbox
NLP: Analyzing and Interpreting Neural Net-
works for NLP, pages 240–248. DOI: https://
doi.org/10.18653/v1/W18-5426

Yoav Goldberg. 2019. Assessing bert’s syntactic
abilities. arXiv preimpresión arXiv:1901.05287.

Yash Goyal, Uri Shalit, and Been Kim. 2019.
Explaining classifiers with causal concept effect
(cace). arXiv preimpresión arXiv:1907.07165.

John Hewitt and Percy Liang. 2019. Designing and
interpreting probes with control tasks. In Em-
pirical Methods in Natural Language Pro-
cesando (EMNLP). DOI: https://doi.org
/10.18653/v1/D19-1275

John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representaciones. In Proceedings of the Con-
diferencia de
el Capítulo Norteamericano de
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, NAACL-HLT,
pages 4129–4138.

Dieuwke Hupkes, Sara Veldhoen, and Willem
Zuidema. 2018. Visualisation and’diagnostic
classifiers’ reveal how recurrent and recursive
neural networks process hierarchical structure.
Journal of Artificial Intelligence Research,
61:907–926. DOI: https://doi.org/10
.1613/jair.1.11196

Zhengbao Jiang, Frank F Xu, Jun Araki, y
Graham Neubig. 2020. How can we know what

language models know? Transactions of the
Asociación de Lingüística Computacional,
8:423–438. DOI: https://doi.org/10
.1162/tacl a 00324

Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that
makes a difference with counterfactually-
augmented data. In International Conference
on Learning Representations.

Yongjie Lin, Yi Chern Tan, y robert frank.
2019. Open sesame: Getting inside berts lin-
el
guistic knowledge.
2019 ACL Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 241–253.

En procedimientos de

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, y Noé A.. Herrero. 2019a.
Linguistic knowledge and transferability of
contextual representations. En procedimientos de
el 2019 Conference of the North American
Chapter of
la Asociación de Computación-
lingüística nacional: Human Language Techno-
logies, Volumen 1 (Artículos largos y cortos),
pages 1073–1094.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019b. Roberta: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.

ryan mcdonald,

Joakim Nivré, Yvonne
Quirmbach-Brundage, Yoav Goldberg, Dipanjan
El, Kuzman Ganchev, Keith Hall, Slav
Petrov, Hao Zhang, Oscar T¨ackstr¨om, et al.
2013. Universal dependency annotation for
el
multilingual parsing. En procedimientos de
51st Annual Meeting of the Association for
Ligüística computacional (Volumen 2: Short
Documentos), pages 92–97.

Judea Pearl and Dana Mackenzie. 2018. The book
of why: the new science of cause and effect.
Libros Básicos.

Fabian Pedregosa, Ga¨el Varoquaux, Alexandre
Gramfort, Vincent Michel, Bertrand Thirion,
Peter
Olivier Grisel, Mathieu Blondel,
Prettenhofer, Ron Weiss, and Vincent Dubourg.
2011. Scikit-learn: Machine learning in python.

172

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Journal of Machine Learning Research, 12(Oct):
2825–2830.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contex-
En curso-
tualized word representations.
ings of NAACL-HLT, pages 2227–2237. DOI:
https://doi.org/10.18653/v1/N18
-1202

Fabio Petroni, Tim Rockt¨aschel, Sebastián Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Modelos de lenguaje
as knowledge bases? En procedimientos de
el 2019 Conferencia sobre métodos empíricos
in Natural Language Processing and the 9th
International Joint Conference on Natural
Idioma
(EMNLP-IJCNLP),
pages 2463–2473. DOI: https://doi.org
/10.18653/v1/D19-1250

Procesando

Tiago Pimentel, Josef Valvoda, Rowan Hall
Maudslay, Ran Zmigrod, Adina Williams, y
Ryan Cotterell. 2020.
Information-theoretic
probing for linguistic structure. DOI: https://
doi.org/10.18653/v1/2020.acl-main
.420

Shauli Ravfogel, Yanai Elazar, Hila Gonen,
Michael Twiton, and Yoav Goldberg. 2020.
Null it out: Guarding protected attributes by
iterative nullspace projection. En procedimientos de
la 58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 7237–7256,
En línea. Association for Computational Linguis-
tics. DOI: https://doi.org/10.18653
/v1/2020.acl-main.647

Abhilasha Ravichander, Yonatan Belinkov, y
Eduard Hovy. 2020. Probing the probing
paradigma: Does probing accuracy entail task
relevance? arXiv preimpresión arXiv:2005.00719.

Emily Reif, Ann Yuan, Martin Wattenberg,
Fernanda B. Viegas, Andy Coenen, Adán
Pearce, and Been Kim. 2019. Visualizing and
measuring the geometry of bert. In Advances
en sistemas de procesamiento de información neuronal,
pages 8592–8600.

Jorma Rissanen. 1978. Modeling by shortest data
descripción. Automatica, 14(5):465–471. DOI:
https://doi.org/10.1016/0005-1098
(78)90005-5

Anna Rogers, Olga Kovaleva,

and Anna
Rumshisky. 2020. A primer in bertology: Qué
we know about how Bert works. arXiv preprint
arXiv:2002.12327. DOI: https://doi.org
/10.1162/tacl a 00349

Alon Talmor, Yanai Elazar, Yoav Goldberg, y
Jonathan Berant. 2019. olmpics – on what
language model pre-training captures. DOI:
https://doi.org/10.1162/tacl a
00342

Alex Tamkin, Trisha Singh, Davide Giovanardi,
Investigating
and Noah Goodman. 2020.
transferability in pretrained language models.
arXiv
arXiv:2004.14975. DOI:
https://doi.org/10.18653/v1/2020
.findings-emnlp.125

preprint

Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019a. BERT rediscovers the classical NLP
pipeline. In Proceedings of the Conference of
la Asociación de Lingüística Computacional,
LCA, pages 4593–4601. DOI: https://doi
.org/10.18653/v1/P19-1452

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R. Thomas McCoy, Najoung
kim, Benjamin Van Durme, Sam Bowman,
Dipanjan Das, and Ellie Pavlick. 2019b.
What do you learn from context? probing
for sentence structure in contextualized word
representaciones. In International Conference on
Learning Representations.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N..
Gómez, lucas káiser, y Illia Polosukhin.
2017. Attention is all you need. In Advances
in neural
information processing systems,
pages 5998–6008.

Jesse Vig, Sebastian Gehrmann, Yonatan
Belinkov, Sharon Qian, Daniel Nevo, Yaron
Cantante, and Stuart Shieber. 2020. Causal
mediation analysis for interpreting neural NLP:
The case of gender bias. arXiv preprint
arXiv:2004.12265.

Elena Voita and Ivan Titov. 2020. Información-
theoretic probing with minimum description
length. arXiv preimpresión arXiv:2003.12298. DOI:
https://doi.org/10.18653/v1/2020
.emnlp-main.14

173

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng,
Hagen Blix, Yining Nie, Anna Alsop, Shikha
Bordia, Haokun Liu, Alicia Parrish, Sheng-Fu
Wang, Jason Phang, Anhad Mohananey, Phu
Mon Htut, Paloma Jeretiˇc, and Samuel R.
Bowman. 2019. Investigating BERTs knowl-
edge of language: Five analysis methods with
NPIs. En Actas de la 2019 Conferencia
sobre métodos empíricos en lenguaje natural
Procesamiento y IX Conjunción Internacional
Conferencia sobre procesamiento del lenguaje natural
(EMNLP-IJCNLP), pages 2870–2880. DOI:
https://doi.org/10.18653/v1/D19
-1286

Nathaniel Weir, Adam Poliak, and Benjamin
Van Durme. 2020. Probing neural language
models for human tacit assumptions. In 42nd
Annual Virtual Meeting of theCognitive Science
Sociedad (CogSci).

Ralph Weischedel, Martha Palmer, mitchell
marco, Eduard Hovy, Sameer Pradhan, Lance
Ramshaw, Nianwen Xue, Ann Taylor, Jeff
Kaufman, Michelle Franchini, and Mohammed
El-Bachouti, Robert Belvin, and Ann Houston.
release 5.0 ldc2013t19.
2013. Ontonotes
Linguistic Data Consortium, Filadelfia,
Pensilvania, 23.

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, R’emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art
natural language processing. ArXiv, abs/1910
.03771.

Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan
Huang. 2019. Evaluating commonsense in
pre-trained language models. arXiv preprint
arXiv:1911.11931.

Apéndice A

We provide additional experiments that depict
the performance of the main task (p.ej., POS)
performance during the INLP iterations
en
Cifra 5.

Cifra 5: LM accuracy over INLP predictions, para el
masked tokens version. We present both the Vanilla
word-prediction score (straight, blue line), así como
Amnesic Probing (naranja, small circles), and the main
task performance (rojo, large circles). For reference we
also provide the vanilla probing performance of each
tarea (verde, cross marks). Note that the number of
removed dimensions per iteration differs, based on the
number of classes of that property.

Appendix B Hewitt and Liang’s
Control Task

Control Task (Hewitt and Liang, 2019) ha sido
suggested as a way to attribute the performance
of the probe to extraction of encoded information,
as opposed to lexical memorization. Our goal in
this work, sin embargo, is not to extract information
from the representation (as is done in conventional
probing) but to measure a behavioural outcome.
Since the control
task is solved by lexical
memorization, applying INLP on control task’s
classifiers erases lexical information (es decir., erases
the ability to distinguish between arbitrary words),
which is at the core of the LM objective and
which is highly correlated with many of the other
linguistic properties, such as POS. We argue that
even if we do see a significant drop in performance
with the control
this says little on the
validity of the results of removal of the linguistic
propiedad (p.ej., POS). Sin embargo, for completeness,
we provide the results in Figure 6. As can be seen
from this figure, this control’s slope is smaller than

tarea,

174

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

the one of the amnesic probing, suggesting that
those directions have less behavioral influence.
Sin embargo, the slopes are steeper than the ‘Rand’
experimento. This is due to the identity removal
of groups of words, due to the label shuffle, como
suggested in their setup. This is the reason we
believe this test is not adequate in our case, y
why we provide other tests to control for our
método: Rand and Selectivity (§2.3).

Cifra 6: LM accuracy over INLP predictions, para el
masked tokens version. We present both the Vanilla
word-prediction score (straight, blue line), también
as Amnesic Probing (naranja, small circles), y el
control performance (naranja, large circles). Nosotros también
provide the Control results for selectivity test, propuesto
by Hewitt and Liang (2019) (rojo, crosses). Tenga en cuenta que
the number of removed dimensions per iteration differs,
based on the number of classes of that property.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
9
1
9
2
4
1
8
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

175Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals image
Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals image
Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals image
Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals image

Descargar PDF