Naturalistic Causal Probing for Morpho-Syntax

Afra Amini1,2 Tiago Pimentel3 Clara Meister1 Ryan Cotterell1,2
1ETH Z¨urich, Switzerland 2ETH AI Center, Switzerland 3University of Cambridge, ROYAUME-UNI

afra.amini@inf.ethz.ch

tp472@cam.ac.uk
ryan.cotterell@inf.ethz.ch

clara.meister@inf.ethz.ch

Abstrait

Probing has become a go-to methodology for
interpreting and analyzing deep neural mod-
els in natural language processing. Cependant,
there is still a lack of understanding of the
limitations and weaknesses of various types
of probes. In this work, we suggest a strat-
egy for input-level intervention on naturalistic
phrases. Using our approach, we intervene
on the morpho-syntactic features of a sen-
tence, while keeping the rest of the sentence
unchanged. Such an intervention allows us
to causally probe pre-trained models. Nous
apply our naturalistic causal probing frame-
work to analyze the effects of grammatical
gender and number on contextualized rep-
resentations extracted from three pre-trained
models in Spanish, the multilingual versions
of BERT, RoBERTa, and GPT-2. Our exper-
iments suggest that naturalistic interventions
lead to stable estimates of the causal effects
of various linguistic properties. De plus, notre
experiments demonstrate the importance of
naturalistic causal probing when analyzing
pre-trained models.

https://github.com/rycolab
/naturalistic-causal-probing

Introduction

Contextualized word representations are a by-
product of pre-trained neural language models
and have led to improvements in performance on
a myriad of downstream natural language process-
ing (NLP) tasks (Joshi et al., 2019; Kondratyuk,
2019; Zellers et al., 2019; Brown et al., 2020).
Despite this performance improvement, though, it
is still not obvious to researchers how these rep-
resentations encode linguistic information. Un
prominent line of work attempts to shed light on
this topic through probing (Alain and Bengio,
2017), also referred to as auxiliary prediction (Adi
et coll., 2017) or diagnostic classification (Hupkes
et coll., 2018). In machine learning parlance, a probe
is a supervised classifier that is trained to predict

384

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

a property of interest from the target model’s rep-
resentations. If the probe manages to predict the
property with high accuracy, one may conclude
que
these representations encode information
about the probed property.

While widely used, probing is not without its
limitations.1 For instance, probing a pre-trained
model for grammatical gender can only tell us
whether information about gender is present in
the representations,2 it cannot, cependant, tell us
how or if the model actually uses information
about gender in its predictions (Ravichander et al.,
2021; Elazar et al., 2021; Ravfogel et al., 2021;
Lasri et al., 2022). En outre, supervised prob-
ing cannot tell us whether the property under
consideration is directly encoded in the represen-
tations, or if it can be recovered from the represen-
tations alone due to spurious correlations among
various linguistic properties. Autrement dit, alors que
we might find correlations between a probed
property and representations through supervised
probing techniques, we cannot uncover causal
relationships between them.

In this work, we propose a new strategy for
input-level intervention on naturalistic data to ob-
tain what we call naturalistic counterfactuals,
which we then use to perform causal probing.
Through such input-level interventions, we can
ascertain whether a particular linguistic property
has a causal effect on a model’s representations.
A number of prior papers have attempted to tease
apart causal dependencies using either input-level
or representation-level interventions. Par exemple,
work on representational counterfactuals has
investigated causal dependencies via interventions
on neural representations. While quite versatile,
representation-level interventions make it hard—

1See Belinkov (2021) for an overview.
2See Pimentel et al. (2020b), Hewitt et al. (2021), et
Pimentel and Cotterell (2021) for fomalizations of this state-
ment under information-theoretic frameworks.

Transactions of the Association for Computational Linguistics, vol. 11, pp. 384–403, 2023. https://doi.org/10.1162/tacl a 00554
Action Editor: Miguel Ballesteros. Submission batch: 4/2022; Revision batch: 7/2022; Published 5/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

if not impossible—to determine whether we are
only intervening on our property of interest. Un-
other proposed method, templated counterfac-
tuals, does perform an input-level intervention
strategy, which is guaranteed to only affect the
probed property. Under such an approach, the re-
searcher first creates a number of templated sen-
tences (either manually or automatically), lequel
they then fill with a set of minimal-pair words
to generate counterfactual examples. Cependant,
template-based interventions are limited by de-
sign: They do not reflect the diversity of sentences
present in natural language, et, thus, lead to
biased estimates of the measured causal effects. Nat-
uralistic counterfactuals improve upon template-
based interventions in that they lead to unbiased
estimates of the causal effect.

In our first set of experiments, we employ
naturalistic causal probing to estimate the average
treatment effect (ATE) of two morpho-syntactic
features—namely, number
and grammatical
gender—on a noun’s contextualized representa-
tion. We show the estimated ATE’s stability across
corpora. In our second set of experiments, we find
that a noun’s grammatical gender and its number
are encoded by a small number of directions in
three pre-trained models’ representations: BERT,
RoBERTa, and GPT-2.3 We further use natural-
istic counterfactuals to causally investigate gender
bias in RoBERTa. We find that RoBERTa is
much more likely to predict the adjective her-
moso(un) (beautiful) for feminine nouns and racio-
nal (rational) for masculine. This suggests that
RoBERTa is indeed gender-biased in its adjective
prédictions.

Enfin, through our naturalistic counterfactu-
als, we show that correlational probes overesti-
mate the presence of certain linguistic properties.
We compare the performance of correlational
probes on two versions of our dataset: one un-
altered and one augmented with naturalistic coun-
terfactuals. While correlational probes achieve
very high (au-dessus de 90%) performance when pre-
dicting gender from sentence-level representa-
tion, they only perform close to chance (autour
60%) on the augmented data. Ensemble, our results
demonstrate the importance of a naturalistic causal
approach to probing.

3We study the Spanish version of these models, if it exists,

or the multilingual version if there is no Spanish version.

2 Probing

There are several types of probing methods that
have been proposed for the analysis of NLP mod-
le, and there are many possible taxonomies of
those methods. For the purposes of this paper, nous
divide previously proposed probing models into
two groups: correlational and causal probes. Sur
one hand, correlational probes attempt to uncover
whether a probed property is present in a model’s
representations. On the other hand, causal probes,
roughly speaking, attempt to uncover how a model
encodes and makes use of a specific probed prop-
erty. We compare and contrast correlational and
causal probing techniques in this section.

2.1 Correlational Probing

Correlational probing is any attempt to correlate
the input representations with the probed prop-
erty of interest. Under correlational probing, le
performance of a probe is viewed as the degree
to which a model encodes information in its rep-
resentations about some probed property (Alain
and Bengio, 2017). At various times, correlational
results have been used to claim that language
models have knowledge of various morphologi-
cal, syntactic, and semantic phenomena (Adi et al.,
2017; Ettinger et al., 2016; Belinkov et al., 2017;
Conneau et al., 2018, inter alia). Yet the valid-
ity of these claims has been a subject of debate
(Saphra and Lopez, 2019; Hewitt and Liang, 2019;
Pimentel et al., 2020un,b; Voita and Titov, 2020).

2.2 Causal Probing

A more recent line of work aims to answer the
question: What is the causal relationship between
the property of interest and the probed model’s
representations? In natural language, cependant,
answering this question is not straightforward:
sentences typically contain confounding factors
that render analyses tedious. To circumvent this
problem, most work in causal probing relies on
interventions, c'est, the act of setting a variable
of interest to a fixed value (Pearl, 2009). Im-
portantly, this must be done without altering any
of this variable’s causal parents, thereby keeping
their probability distributions fixed.4 As a byprod-
uct, these interventions generate counterfactuals,

4Consider a set of three random variables with a causal
structure X → Y → Z (where X causes Y , which causes
Z). If we simply conditioned on Y = 1, we would be left
with the conditional distribution p(X, z | Y = 1) = p(X |
Y = 1)p(z | Y = 1). If we perform an intervention on

385

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

namely, examples where a specific property of
interest is changed while everything else is held
constant. Counterfactuals can then be used to
perform a causal analysis. Prior probing papers
have proposed methods using both representa-
tional and templated counterfactuals, as we dis-
cuss next.

Representational Counterfactuals. A few re-
cent causal probing papers perform interventions
directly on a model’s representations (Giulianelli
et coll., 2018; Feder et al., 2021; Vig et al., 2020;
Tucker et al., 2021; Ravfogel et al., 2021; Lasri
et coll., 2022; Ravfogel et al., 2022un). Par exemple,
Elazar et al. (2021) use iterative null space pro-
jection (INLP; Ravfogel et al., 2020) to remove
an analyzed property’s information, Par exemple,
part of speech, from the representations. Although
representational interventions can be applied to
situations where other forms of intervention are
not feasible, it is often impossible to make sure
only the information about the probed property is
removed or changed.5 In the absence of this guar-
antee, any causal conclusion should be viewed
with caution.

Templated Counterfactuals. Other work (Vig
et coll., 2020; Finlayson et al., 2021), like us, a
leveraged input-level interventions. Cependant, dans
these cases, the interventions are carried out using
templated minimal-pair sentences, which differ
only with respect to a single analyzed property.
Using these minimal pairs, they estimate the effect
of an input-level intervention on individual atten-
tion heads and neurons. One benefit of template-
based approaches is that they create a highly
controlled environment, which guarantees that the
intervention is done correctly, and which may lead
to insights that would be impossible to gain from
natural data. Cependant, since the templates are
typically designed to analyze a specific property,
they cover a narrow set linguistic phenomena,
which may not reflect the complexity of language
in naturalistic data.

Y = 1, on the other hand, we are left with a distribution
of p(X, z | do(Oui ) = 1) = p(X)p(z | Y = 1); thus X’s
distribution is not altered by Y .

5Il y a, cependant, methods to mitigate this issue,
par exemple., Ravfogel et al. (2022b) recently proposed an improved
(adversarial) method to remove information from a set of
representations that greatly reduces the number of removed
dimensions.

Naturalistic Counterfactuals.
In this paper,
following Zmigrod et al. (2019), we propose a new
and less complex strategy to perform input-level
interventions by creating naturalistic counterfac-
tuals that are not derived from templates. Plutôt,
we derive the counterfactuals from the dependency
structure of the sentence. By creating counterfac-
tuals on the fly using a dependency parse, nous
avoid the biases of manually creating templates.
En outre, our approach guarantees that we
only intervene on the specific linguistic property
of interest, Par exemple, changing the grammatical
gender or number of a noun.

3 The Causal Framework

The question of interest in this paper is how con-
textualized representations are causally affected
by a morpho-syntactic feature such as gender or
number. To see how our method works, it is eas-
iest to start with an example. Let’s consider the
following pair of Spanish sentences:

(1) El programador talentoso escribi´o el c´odigo.

the.M.SG programmer.M.SG talented.M.SG wrote
the code.
The talented programmer wrote the code.

(2) La programadora talentosa escribi´o el c´odigo.

the.F.SG programmer.F.SG talented.F.SG wrote
the code.
The talented programmer wrote the code.

The meaning of these sentences is equivalent
up to the gender of the noun programador, dont
feminine form is programadora. Cependant, plus
than just this one word changes from (1) à (2):
The definite article el changes to la and the
adjective talentoso changes to talentosa. In the
terminology of this paper, we will refer to progra-
mador as the focus noun, as it is the noun whose
grammatical properties we are going to change.
We will refer to the changing of (1) à (2) as a
syntactic intervention on the focus noun. Infor-
mally, a syntactic intervention may be thought of
as taking part in two steps. D'abord, we swap the
focus noun (programador) with another noun that
is equivalent up to a single grammatical property.
Dans ce cas, we swap programador with progra-
madora, which differs only in its gender marking.
Deuxième, we reinflect the sentence so that all nec-
essary words grammatically agree with the new
focus noun. The result of a syntactic intervention

386

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Notation and Variables. We denote random
variables in upper-case letters and instances with
lower-case letters. We bold sequences: bold
lower-case letters represent a sequence of words
and bold upper-case letters represent a sequence of
random variables. Let f = (cid:4)f1, . . . , fT (cid:5) be a sen-
tence (of length T ) where each ft is a word form.
En outre, let r be the list of contextual represen-
tations r = (cid:4)r1, . . . , rT (cid:5) where each rt ∈ Rh, et
is in one-to-one correspondence with the sentence
F , c'est, rt is ft’s contextual representations. Fur-
thermore, let (cid:2) = (cid:4)(cid:2)1, . . . , (cid:2)T (cid:5) be a list of lemmata
et (cid:2)m = (cid:4)m1, . . . , mT (cid:5) a list of morpho-syntactic
features co-indexed with f ; (cid:2)t is the lemma of
ft and mt is its morpho-syntactic features. Nous
call m = (cid:4)mt1, . . . , mtK
(cid:5) the minimal list of
morpho-syntactic features, where each tk is an
index between 1 to T . En substance, we drop features
of the tokens that are dependent on other to-
kens’ morphology. In our example (1) this means
we only include the morpho-syntactic features of
programador and c´odigo, thus m = (cid:4)m2, m6(cid:5).6
We denote the morpho-syntactic feature of inter-
est as m∗, lequel, in this work, represents either
the gender g∗ or number n∗ of the focus noun. Nous
further denote the lemma of the focus noun as (cid:2)∗.

Causal Assumptions. Our causal model is in-
troduced in Figure 2. It encodes the causal rela-
tionships between U, L, M. , F , et R. Explicitly,
we assume the following causal relationships:

• M and L are causally dependent on U .
The underlying meaning that the writer of
a sentence wants to convey determines the
used lemmas and morpho-syntactic features;

• In general, Lt can causally affect Mt. Take
the gender of inanimate nouns as an example,
where the lemma determines the gender;

• F is causally dependent on L and M . Word
forms are a combination of lemmata and
morpho-syntactic features;

• R is causally dependent on F . Contex-
tualized representations are obtained by
processing the sentences through the probed
model.

Chiffre 1: Intervention on the gender of lemma pro-
gramador (masculine → feminine). Changes are
propagated from that noun to its dependent words
accordingly.

is a pair of sentences that differ minimally, c'est,
only with respect to this one grammatical property
(Chiffre 1). Another way of framing the syntactic
intervention is as a counterfactual: What would
(1) have looked like if programador had been
feminine? The rest of this section focuses on for-
malizing the notion of a syntactic intervention and
discussing how to use them in a causal inference
framework for probing.

A Note on Inanimate Nouns. When estimat-
ing the effect of grammatical gender here, nous
restrict our investigation to animate nouns, pour
example, programadora/programador (feminine/
masculine programmer). Grammatical gender of
inanimate nouns is lexicalized, meaning that each
noun is assigned a single gender, Par exemple,
puente (bridge) is masculine. Autrement dit, là
is not a non-zero probability of assigning each
lemmata to each gender, which violates a condi-
tion called positivity in causal inference literature.
Ainsi, we cannot perform an intervention on the
grammatical gender of those words, but rather
would need to perform an intervention on the
lemma itself. We refer to Gonen et al. (2019) pour
an analysis of the effect of gender on inanimate
nouns’ representations. Note that a similar lexi-
calization can also be observed in a few animate
nouns, Par exemple, madre/padre (mother/father).
Dans de tels cas, to separate the lemma from gender,
we assume that these words share a hypothetical
lemma, which in our example represents par-
enthood, and combining that with gender would
give us the specific forms (par exemple., madre/padre).

3.1 The Causal Model

We now describe a causal model that will allow us
to more formally discuss syntactic interventions.

6In this work, we only focus on two morpho-syntactic
features: gender and number. To analyze other features, le
minimal list should be expanded—e.g., to analyze verb tense,
m3 should be added to the list.

387

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 2: Causal graph for the Spanish sentence El programador talentoso escribi´o el c´odigo. before (on the left)
and after (on the right) an intervention on the grammatical gender of the focus noun.

Dependency Trees.
In order to measure the
causal effect of the gender of the focus noun (g∗)
on the contextualized representation (r), all of its
causal dependencies must be considered. As our
causal graph shows (in Figure 2), g∗ not only
has a causal effect on the focus noun’s form, mais
also on the definite article el and the adjective
talentoso. Encore, not all word forms in a sentence
are affected; par exemple, the definite article el
in the noun phrase el c´odigo. Luckily, dans un
given sentence, such relationships are naturally
encoded by that sentence’s dependency tree. Le
dependency graph d of a sentence f is a directed
graph created by connecting each word form ft
pour 1 ≤ t ≤ T to its syntactic parent. We use
the information encoded in d by leveraging the
fact that a word form ft is causally dependent
on its syntactic parent. En substance, a dependency
tree d implicitly encodes a function dt[m] lequel
returns the subset of morphological properties that
causally affect the form ft. Ainsi, we are able to
express the complete joint probability distribution
of our causal model as follows:

p(F , m, (cid:2), toi)

(1)

= p(toi) p(m, (cid:2) | toi) p(F | m, (cid:2))

= p(toi) p(m, (cid:2) | toi)

T(cid:3)

t=1

p(ft | dt[m], (cid:2)t)

Abstract Causal Model. We can now simplify
the causal model from Figure 2 into Figure 3.
For simplicity, we isolate the lemma and morpho-

Chiffre 3: Causal model showing dependencies be-
tween the underlying meaning (U ), lemma (L∗) et
morpho-syntactic features (M∗) of the focus noun,
contexte (Z), phrases (F ), and contextualized repre-
sentations (R.).

syntactic feature of interest L∗ and M∗ and ag-
gregate the other lemmata and morpho-syntactic
features into an abstract variable, which we call Z
and refer to as the context. En outre, we only
show the aggregation of word forms and repre-
sentations as F and R in the abstract model. Nous
will assume for now, and in most of our experi-
ments, that the output of the causal model (R in
Chiffre 3) represents the contextualized represen-
tation of the focus noun. Cependant, as we gener-
alize later, the output of the causal model can be
any function of word forms F , tel que: The rep-
resentation of other words in the sentence, le
probability distribution assigned by the model to
a masked word, or even the output of a down-
stream task. We note that Figure 3 can be easily
re-expanded into Figure 2 for any specific utter-
ance by using its dependency tree.

388

3.2 Naturalistic Counterfactuals

In causal inference literature, the do(·) operator
represents an intervention on a causal diagram.
Par exemple, we might want to intervene on the
gender of the focus noun (thus using gender G∗
as the morpho-syntactic feature of interest M∗).
Concretely, in our example (Chiffre 2), do(G∗ =
FEM) means intervening on the causal graph by
removing all the causal edges going into G∗ from
U and L∗ and setting G∗’s value to a specific
realization FEM. The result of this intervention
on a sampled sentence f is a new counterfac-
tual sentence f (cid:9). As our causal graph suggests,
the relationship between words in a sentence is
complexe, occurring at multiple levels of abstrac-
tion; swapping the gender of a single word—while
leaving all other words unchanged—may not re-
sult in grammatical text. Par conséquent, one must
approach the creation of counterfactuals in natural
language with caution. Spécifiquement, we rely on
syntactic interventions to generate our naturalistic
counterfactuals.

Syntactic Intervention. We develop a heuristic
algorithm to perform our interventions, shown in
Appendix B. Given a sentence and its dependency
arbre, the algorithm generates a counterfactual ver-
sion of the sentence, c'est, approximating the
do(·) operation. This algorithm processes the de-
pendency tree of each sentence in a depth-first
search recursive manner. In each iteration, if the
node in process is a noun, it is marked as the
focus noun7 and a new copy of the sentence is
created, which will be the base of the counterfac-
tual sentence. Alors, the intervention is performed,
altering the focus noun and all dependent tokens
in the copied sentence.8 Notably, when we syn-
tactically intervene on the grammatical gender or
number of a noun, we do not alter potentially in-
compatible semantic contexts. Take sentence (3)
as an example, where the focus noun is mujer
and we intervene on gender. Its counterfactual
sentence (4) is semantically odd and unlikely,
but still meaningful. We can thus estimate the

7Spécifiquement, for gender intervention we only mark the

noun as the focus if it is an animate noun.

8This is a simplified version of the algorithm where we
omit the rule-based re-inflection functions for nouns, adjec-
tives, and determiners. We also handle contractions, tel que
un + el → al, which is not mentioned in this pseudo-code.

causal effect of grammatical gender in the con-
textual representations—breaking the correlation
between morpho-syntax and semantics.

(3) La mujer dio a luz a 6 beb´es.

the.F.SG woman.F.SG gave birth to 6 babies.
The woman gave birth to 6 babies.

(4) El hombre dio a luz a 6 beb´es.

the.M.SG man.M.SG gave birth to 6 babies.
The man gave birth to 6 babies.

3.3 Measuring Causal Effects

Dans cette section, we define the causal effect of a
morpho-syntactic feature. We then present esti-
mators for these values in the following section.
While we focus on grammatical gender here, notre
derivations are similarly applicable to number and
other morpho-syntactic features.

Given a specific focus–context pair ((cid:2)∗, z), le
causal effect of gender G∗ on the representations
is called the individual treatment effect (ITE;
Rosenbaum and Rubin, 1983) and is defined as:

Δ((cid:2)∗, z) =

(cid:4)

E
F
(cid:4)
−E
F

tgt(F ) | G∗ = MSC, L∗ = (cid:2)∗, Z = z

tgt(F ) | G∗ = FEM, L∗ = (cid:2)∗, Z = z

(2)
(cid:5)

(cid:5)

where tgt(·) is a deterministic function that im-
plements the model being probed, Par exemple,
a pretrained model like BERT, taking a form F
as input and outputting R. Since F is itself a
deterministic function of a (cid:4)G∗, L∗, Z(cid:5) triple, nous
can rewrite this equation as:9

Δ((cid:2)∗, z) =

(3)

tgt(MSC, (cid:2)∗, z) − tgt(FEM, (cid:2)∗, z)

As can be seen from Equation (3), the ITE is
the difference in the representation given that
the focus noun of the sentence is masculine vs.
feminine.

To get a more general understanding of how
gender affects these representations, cependant, it
is not enough to just look at individual treatment
effects. It is necessary to consider a holistic metric
across the entire language. The ATE is one such
metric, and is defined as the difference between
the following expectations:

(cid:6)

ψATE = E
F
−E
F

tgt(F ) | do(G∗ = MSC)
(cid:7)

tgt(F ) | do(G∗ = FEM)

(cid:7)

(4)

9We overload tgt(·) to receive either F or (cid:4)G∗, L∗, Z(cid:5).

389

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

In words, the ATE is the expected causal effect
of one random variable on another (in this case
gender on the model’s representations). Comput-
ing this expectation, cependant, is not as simple as
conditioning it on gender. As there are backdoor
paths10 from the treatment (genre) to the effect
(the representations), we rely on the backdoor
criterion (Pearl, 2009) to compute this expecta-
tion. Simply put, we first need to find a set of
variables that block every such backdoor path.
We then condition our expectation on them. Comme
shown in Proposition 1 (in the Appendix), the set
of variables satisfying the backdoor criterion in
our case is {L∗, Z}. Donc, we can rewrite
Équation (4) by conditioning our expectation over
{L∗, Z}:

ψATE =

E
L∗,Z

− E
L∗,Z

(cid:8)

(cid:4)

E
F

tgt(F ) | G∗ = MSC, L∗, Z

tgt(F ) | G∗ = FEM, L∗, Z

which we can again rewrite as:

(5)

(cid:5)(cid:9)

(6)

ψATE =
E
L∗,Z

[tgt(MSC, L∗, Z) − tgt(FEM, L∗, Z)]

En outre, plugging Equation (3) into Equa-
tion (6):

ψATE = E
L∗,Z

(cid:6)

(cid:7)

Δ(L∗, Z)

(7)

of sentences: one with only masculine focus nouns
MSC and the other with feminine ones S
S
FEM. Nous
then compute their difference:

(8)

ψna¨ıve =

1
|S

MSC

− 1
|S

FEM

(cid:10)

(cid:4) ,(cid:2)∗,z(cid:5)∈S
(cid:10)

MSC

(cid:4) ,(cid:2)∗,z(cid:5)∈S

FEM

tgt(MSC, (cid:2)∗, z)

tgt(FEM, (cid:2)∗, z)

We note, cependant, that this is a very na¨ıve estima-
MSC (and respectively S
tor.11 Since S
FEM) includes
only the fraction of sentences with masculine fo-
cus nouns, restricting the sample mean to this
set of instances is equivalent to using samples
z, (cid:2)∗ ∼ p(z, (cid:2)∗ | MSC), rather than z, (cid:2)∗ ∼ p(z, (cid:2)∗)
(as should be done for ATE). Notably, this is
equivalent to ignoring the do operator in Equa-
tion (4). Par conséquent, Équation (8) introduces
a purely correlational baseline. Dans ce qui suit
section, we present our (better) causal estimator.

4.2 Paired Estimator

We now use our naturalistic counterfactual sen-
tences to approximate the ATE. Spécifiquement, par
relying on our syntactic interventions, we can ob-
tain both a feminine and masculine form of each
sentence ((cid:2)∗, z) sampled from the corpus. Con-
cretely, we use the following paired estimator:

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

the ITE in
reveals that Equation (5) is just
expectation. Ainsi,
the ATE is an appropriate
language-wide measure of the effect of gender
on contextual representations.

(cid:4)

ψpaired =
(cid:10)
1
|S|

(cid:4) ,(cid:2)∗,z(cid:5)∈S

(9)
(cid:5)

tgt(MSC, (cid:2)∗, z)
(cid:14)
(cid:12)(cid:13)
(cid:11)
(1)

(cid:11)

− tgt(FEM, (cid:2)∗, z)
(cid:14)
(cid:12)(cid:13)
(2)

4 Approximating the ATE

Dans cette section, we show how to estimate
Équation(6) from a finite corpus of sentences S.

4.1 Na¨ıve Estimator

Each sentence in our corpus can be written as a
triple (cid:4)g∗, (cid:2)∗, z(cid:5). We now discuss how to use such
a corpus to estimate Equation (6). Spécifiquement, nous
first compute the sample mean using two subsets

10A backdoor path is a causal path from an analyzed
variable to its effect which contains an arrow to the treatment
(c'est à dire., an arrow going backwards). Par exemple, consider
random variables with a causal structure Y → X → Z and
Y → Z (where Y causes X, and both X and Y cause Z).
X ← Y → Z forms a backdoor path (Definition 3; Pearl,
2009).

où, depending on g∗, the model’s output tgt(·)
dans (1) et (2) will be extracted from a pre-trained
model using either the original or counterfactual
phrases.

4.3 A Closer Look at our Estimators

A closer look at our paired estimator in Equation
(9) shows that it is an unbiased Monte Carlo
estimator of the ATE presented in Equation (6).
In short, if we assume our corpus S was sampled
from the target distribution, we can use this corpus
as samples (cid:2)∗, z ∼ p((cid:2)∗, z). For each (cid:2)∗, z pair,
we can then generate sentences with both MSC and
FEM grammatical genders to estimate the ATE.

11This is referred to as the na¨ıve or unadjusted estimator

in the literature (Hern´an and Robins, 2020).

390

The na¨ıve estimator, on the other hand, will
not produce an unbiased estimate of the ATE.
As mentioned above, by considering sentences in
MSC or S
S
FEM separately, we implicitly condition on
the gender when approximating each expectation.
This estimator instead approximates a value we
term the average correlational effect (ACE):

ψACE =

−

E
L∗,Z|G∗=MSC
E
L∗,Z|G∗=FEM

[tgt(MSC, L∗, Z)]

(10)

[tgt(FEM, L∗, Z)]

On a separate note, template-based approaches
allow the researcher to investigate causal effects
by using minimal pairs of sentences, each of
which can be used to estimate an ITE (as in Equa-
tion (3)). Et, by averaging them, they provide an
estimate of ATE (as in Equation (7)). Cependant,
these minimal pairs are either manually written or
automatically collected using template structures.
Donc, they cover a narrow (and potentially
biased) set of structures, arguably not following
a naturalistic distribution. Autrement dit, their
corpus S cannot be assumed to be sampled ac-
cording to the distribution p((cid:2)∗, z).12 In practice,
templated counterfactuals approximate the treat-
ment effect using an approach identical to the
paired estimators–up to a change of distribution.
This change of distribution, cependant, may lead to
biased estimates of the ATE..

5 Dataset

We use two Spanish UD treebanks (Nivre
et coll., 2020) in our experiments: Spanish-GSD
(McDonald et al., 2013) and Spanish-AnCora
(Taul´e et al., 2008). We only analyze gender on
animate nouns and use Open Multilingual Word-
Net (Gonzalez-Agirre et al., 2012) to mark the
animacy. Corpus statistics for the datasets can be
found in Table 1.

5.1 Evaluating Counterfactual Sentences

To evaluate our syntactic intervention algorithm
(introduced in §3.2), we randomly sample a subset
de 100 sentences from our datasets. These sam-
ples are evenly distributed across the two datasets

12This becomes clear when we take a look at the sentences
in one of such template-based datasets. Par exemple, all sen-
tences in the Winogender dataset (Rudinger et al., 2018)—
used by Vig et al. (2020)—have very similar sentential
structures. Such biases, cependant, are not necessarily prob-
lematic and might be imposed by design to analyze specific
phenomena.

Gender

Nombre

train dev test

MSC

FEM

SING

PLUR

AnCora

GSD

✓
✗

✓

✓
✗

✓

✗
✓

✗

1,029 203 14,602 6,692
693

1,540

107

403

135

9,141 3,993

Tableau 1: Aggregated dataset statistics.

(AnCora and GSD), morpho-syntactic features
(gender and number), and categories within each
feature (masculine, feminine, singular, and plural).
A native Spanish speaker assessed the gram-
maticality of sampled sentences. Our syntactic
intervention algorithm was able to accurately gen-
erate counterfactuals for 73% of the sentences.13
The accuracy for the gender and number inter-
ventions are 76% et 70%, respectivement. Due to
the subtleties discussed in disentangling syntax
from semantics and the complex sentence struc-
tures found in naturalistic data, we believe this
error is within an acceptable range and leave
improvements to future work.

5.2 Template-Based Dataset

To compare our approach to templated counter-
factuals, we translate two datasets for measuring
gender bias: Winogender (Rudinger et al., 2018)
and WinoBias (Zhao et al., 2018). As shown by
Stanovsky et al. (2019), simply translating these
templates to Spanish leads to biased translations,
where professions are translated stereotypically
and the context is ignored. Following Stanovsky
et coll., we thus put either handsome and pretty
before nouns to enforce the gender constraint after
translation. Consider, par exemple, the sentence:
‘‘The developer was unable to communicate with
the writer because he only understands the code.’’
We rewrite it as ‘‘The handsome developer. . .’’.
De la même manière, if the pronoun was she, we would write
‘‘The pretty developer. . .’’. As an extra constraint,
we want to ensure the gender of the writer stays the
same before and after the intervention. Donc,
we make two copies of the sentence: One where
writer is translated as escritora (feminine writer),
enforced by replacing writer with pretty writer,
and one where writer is translated as escritor

13Approximating our estimate of this accuracy with a
normal distribution, we obtain a 95% Intervalle de confiance
(Wald interval) which ranges from 64% à 82% (Brun
et coll., 2001).

391

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 4: Cosine similarities of the ATE on BERT
representations. N. represents ψna¨ıve; P.. represents
ψpaired; and T. represents ψpaired estimated on the
template-based dataset.

(masculine writer), enforced by replacing writer
with handsome writer. We translate the result-
ing pairs of sentences using the Google Translate
API and drop the sentences with wrong gender
translations. À la fin, we obtain 2740 minimal
pairs.

Insights From ATE Estimators

In the following experiments, we first use the es-
timators introduced in §4 to approximate the ATE
of number and grammatical gender on contextual-
ized representations. We look at how stable these
ATE estimates are across datasets, and whether
they change across words with different parts of
speech. We then analyze whether the ATE (as an
expected value) was an accurate description of
how representations actually change in individual
phrases. Enfin, we compute the ATE of gender
on the probability of predicting specific adjectives
in a sentence, thereby measuring the causal effect
of gender in adjective prediction.

6.1 Variations Across ATEs

Variation Across Datasets. Using our ATE es-
timators, we compute the average treatment effect
of both gender and number on BERT’s contex-
tualized representations (Devlin et al., 2019) de
focus nouns.14 We compute ψpaired and ψna¨ıve
estimators. Chiffre 4 presents their cosine similari-
liens. We observe high cosine similarities between
paired estimators across datasets,15 but lower co-
sine similarities with the na¨ıve estimator. Ce
suggests that, while the causal effect is stable

14More specifically, BERT-BASE-MULTILINGUAL-CASED in the

Transformers library (Wolf et al., 2020).

15To make sure that the imbalance in the dataset before
intervention doesn’t have a significant effect on results, nous
create a balanced version of the dataset, where we observe
similar results.

Chiffre 5: Cosine similarity of ATE estimators com-
puted on focus nouns, adjectives, and determiners using
BERT representations.

across treebanks, the correlational effect is more
susceptible to variations in the datasets, for exam-
ple, semantic variations due to the domain from
which treebanks were sampled.

Templated vs. Naturalistic Counterfactuals.
As an extra baseline, we estimate the ATE using
a paired estimator with the template-based dataset
introduced in §5.2. We observe a low cosine sim-
ilarity between our naturalistic ATE estimates
and the template-based ones. This shows that
sentences from template-based datasets are sub-
stantially different from naturalistic datasets, thus
fail to provide unbiased estimates in naturalistic
settings.

Variation Across Part-of-Speech Tags. Using
the same approach, we additionally compute the
ATEs on adjectives and determiners. Chiffre 5
presents our na¨ıve and paired ATE estimates,
computed on words with different parts of speech.
These results suggest that gender and number
do not affect the focus noun or its dependent
words in the same way. While the ATE on focus
nouns and adjectives are strongly aligned, le
cosine similarity between ATEs on focus nouns
and determiners is smaller.16

6.2 Masked Language Modeling Predictions

We now analyze the effect of our morpho-
syntactic features on masked language modeling
prédictions. Spécifiquement, we analyze RoBERTa
(Conneau et al., 2020)17 in these experiments, as it
has better performance than BERT in masked pre-
diction. We thus look at how grammatical gender

16Relatedly, Lasri et al. (2022) recently showed BERT

encodes number differently on nouns and verbs.
17More specifically, we use XLM-ROBERTA-BASE.

392

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
4
2
0
8
6
3
0
5

/
t

un
c
_
un
_
0
0
5
5
4
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

⎧
⎨

⎩

⎧
⎨

⎩

DET: MProbs(h(cid:9))
ADJ: MProbs(h(cid:9))
FOCUS: MProbs(h(cid:9))
DET: MProbs(h(cid:9))
ADJ: MProbs(h(cid:9))
FOCUS: MProbs(h(cid:9))

R.
E
D
N
E
G

R.
E
B
M.
U
N

MProbs(h(cid:9) ) MProbs((cid:2)hψna¨ıve
4.85 ± 2.39
2.29 ± 2
3.75 ± 2.67

1.09 ± 1.4
1.04 ± 1.05
1.74 ± 1.11

) MProbs((cid:2)hψpaired
0.67 ± 1.14
0.9 ± 1.12
1.53 ± 0.93

)

6.93 ± 2.52
5.63 ± 2.75
5.50 ± 3.02

1.92 ± 2.87
2.25 ± 2.2
2.25 ± 2.14

2.05 ± 2.64
2.5 ± 2.17
2.41 ± 1.9

Tableau 2: Mean and standard deviation of Jensen–
hannon divergence between the masked probabil-
ity distributions of focus nouns, determiners, et
adjectives over the corpus.

and number affect the probability that RoBERTa
assigns to each word in its output vocabulary.

We start by masking a word in our sentence: ei-
ther the focus noun, a dependent determiner, or an
adjective. We then obtain this word’s contextual
representation h. Deuxième, we apply a syntactic
intervention to this sentence, et, following simi-
lar steps, obtain another representation h(cid:9). Troisième,
we use these representations to obtain the prob-
abilities RoBERTa assigns to the words in its
vocabulary MProbs(h) and MProbs(h(cid:9)). Enfin,
we obtain these same probability assignments,
but using ATE to estimate the counterfactual
representations:

MProbs((cid:15)hψpaired), (cid:15)hψpaired = h ± ψpaired
MProbs((cid:15)hψna¨ıve), (cid:15)hψna¨ıve = h ± ψna¨ıve

(11)

(12)

We now look at how probability assignments
change as a function of our interventions. Specifi-
cally, Tableau 2 shows Jensen–Shannon divergences
between MProbs(·) computed on top of different
representations. We can make a number of ob-
servations based on this table. D'abord, for gender,
these distributions change more when predicting
determiners and focus nouns than adjectives. Nous
speculate that this may be because many Span-
ish adjectives are syncretic, c'est, they have
the same inflected form for masculine and femi-
nine (par exemple., inteligente [intelligent], or profesional
[professional]). Deuxième, the distributions change
more after an intervention on number than on gen-
der. Troisième, when we use either of our estimators
to approximate the counterfactual representation,
the divergences are greatly reduced. Ces résultats
show that the ATE values do describe (at least
to some extent) the change of representations in
individual sentences.

6.3 Gender Bias in Adjectives

As shown by Bartl et al. (2020) and Gonen et al.
(2022),
the results of studies on gender bias
in English are not completely transferable to
gender-marking languages. We analyze the causal
effect of gender on specific masked adjective
probabilities, predicted by the RoBERTa model.
To this end, we manually create a list of 30 ad-
jectives (the complete list is in Appendix A) dans
both masculine and feminine forms. We sample a
sentence f from a subset of the dataset in which
the focus noun has one dependent adjective a,
and mask this adjective. We then define a new
fonction, tgt(·), to measure the ATE on adjective
probabilities. Spécifiquement, we write:

tgta(F ) = ln pθ(un | F )

(13)

= ln pθ(un | g∗, (cid:2)∗, z)

where a represents an adjective in our list (aussi
exists in RoBERTa’s vocabulary V) and pθ(un |
F ) is the probability RoBERTa assigns to that
adjective.18 We plug this new function into our
paired ATE estimator in Equation (9). As this
prediction is somewhat susceptible to noise, nous
replace the mean in Equation (9) with the median.
Spécifiquement, this is equivalent to computing:

(cid:8)

ψ(un)
paired = median
(cid:4) ,(cid:2)∗,z(cid:5)∈S

pθ(un | MSC, (cid:2)∗, z)
pθ(un | FEM, (cid:2)∗, z)

(cid:9)

(14)

In this equation, if ψ(un)
paired > 0, the predicted prob-
ability that the adjective appears in a sentence
where it is dependent on a masculine focus noun
will be typically higher than in a sentence with a
feminine focus noun. Whereas if ψ(un)
paired < 0 the reverse will hold. Therefore, we say a is biased towards masculine gender if ψ(a) paired > 0 and it is
biased towards feminine gender if ψ(un)
paired < 0. As shown in Figure 6, rich (rica/rico) and rational (racional) are more biased towards masculine gender, while beautiful (hermosa/hermoso) is biased towards feminine gender. 7 Insights From Naturalistic Counterfactuals In the following experiments, we rely on a dataset augmented with naturalistic counterfactu- als. We first explore the geometry of the encoded 18When an adjective in the list has two forms depending on the gender (e.g., hermosa/hermoso), we sum the probabilities for masculine and feminine forms. 393 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Figure 6: ψ(a) paired values computed using Equation (14) to measure causal gender bias in masked adjective prediction. Figure 7: (top) Percentage of the gender and number variance explained by the first 10 PCA components. (bottom) The projection of 20 pairs of focus noun’s representations on the first principal component. morpho-syntactic features. We then run a more classic correlational probing experiment, high- lighting the importance of a causal framework when analyzing representations. 7.1 Geometry of Morpho-Syntactic Features In this experiment, we follow Bolukbasi et al.’s (2016) methodology to isolate the subspace cap- turing our morpho-syntactic features’ information. First, we create a matrix with the representations of all focus nouns in our counterfactually augmented dataset. Second, we pair each noun’s representa- tion with its counterfactual representation (after the intervention). Third, we center the matrix of representations by subtracting each pair’s mean. Finally, we perform principal component analysis on this new matrix. As Figure 7 shows, in BERT and RoBERTa, the first principal component explains close to 20% of the variance caused by gender and number. In GPT-2 (Radford et al., 2019),19 more than half of the variance is captured by the first or the first two principal components.20 This result is in line 19More specifically, we use GPT2-SMALL-SPANISH. 20These results are not obtained due to the randomness of a finite sample of high dimensional vectors. Neither are they due to the structure of the model. To show this, we present two random baselines: random vectors of the same size |S| with prior work (e.g., Biasion et al., 2020, on Italian word embeddings), and suggests that these morpho-syntactic features are linearly encoded in the representations. To further explore the gender and number subspaces, we project a random sample of 20 sentences (along with their counterfactuals) onto the first principal component. Figure 7 (bot- the three models we probe tom) shows that can (at least to a large extent) differentiate both morpho-syntactic features using a single dimen- sion. Notably, this first principal component is strongly aligned with the estimate ψpaired; they have a cosine similarity of roughly 0.99 in all these settings. 7.2 Analysis of Correlational Probing We now use a dataset augmented with naturalistic counterfactuals to empirically evaluate the entan- glement of correlation and causation discussed in §2, which arises when using diagnostic probes to probe the representations. Again, we probe three contextual representations: BERT, RoBERTa, and GPT-2. We train logistic regressors (LogReg- Probe) and support vector machines (SVMProbe) to predict either gender or number of the focus (as green traces) and representations extracted from models with randomized weights (as gray traces) in Figure 7. 394 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Figure 8: Accuracy scores of gender and number probes on the original and augmented datasets. noun from its contextual representation. Further, we probe the representations in two positions: the focus noun and the [CLS] token (or a sentence’s last token, for GPT-2).21 Accuracy of correlational probes on the original dataset is shown in Figure 8 as green points. Both gender and number probes reach a near-perfect accuracy on focus nouns’ representations. Fur- thermore, all correlational gender probes reach a high accuracy in [CLS] representations, suggest- ing that gender can be reliably recovered from them. Next, we evaluate trained probes on counter- factually augmented test sets (shown as yellow points in Figure 8). We see that there is a drop in performance in all settings, and, more specifically, the accuracy of probes on [CLS] representations drops significantly when evaluated on the coun- terfactual test set. This suggest that the previous results using correlational probes overestimate the extent to which gender and number can be predicted from the representations. Finally, we also train supervised probes on a counterfactually augmented dataset in order to study whether we can achieve the levels of per- formance attested in the literature (shown as gray points in Figure 8). Since these probes are trained on a dataset augmented with counterfactuals, they are not as susceptible to spurious correlations; we thus call them the causal probes. Although there is a considerable improvement in accuracy, there is 21BERT and RoBERTa treat [CLS] as a special token whose representation is supposed to aggregate information from the whole input sentence. In GPT-2, the last token in a sentence should also contain information about all its previous tokens. still a large gap between correlational and causal probes’ accuracies. Together, these results imply that correlational probes are sensitive to spurious correlations in the data (such as the semantic con- text in which nouns appear), and do not learn to predict grammatical gender robustly. 8 Conclusion We propose a heuristic algorithm for syntactic intervention which, when applied to naturalistic data, allows us to create naturalistic counterfactu- als. Although similar analyses have been run by prior work, using either templated or represen- tational counterfactuals (Elazar et al., 2021; Vig et al., 2020; Bolukbasi et al., 2016, inter alia), our syntactic intervention approach allows us to run these analyses on naturalistic data. We fur- ther discuss how to use these counterfactuals in a causal setting to probe for morpho-syntax. Exper- imentally, we first showed that ATE estimates are more robust to dataset differences than either our na¨ıve (correlational) estimator, or template-based approaches. Second, we showed that ATE can (at least partially) predict how representations will be affected after intervention on gender or num- ber. Third, we employ our ATE framework to study gender bias, finding a list of adjectives that are biased towards one or other gender. Fourth, we find that the variation of gender and number can be captured by a few principal axes in the nouns’ representations. And, finally, we highlight the importance of causal analyses when probing: When evaluated on counterfactually augmented data, correlational probe results drop significantly. 395 Ethical Concerns A List of Adjectives Pretrained models often encode gender bias. The adjective bias experiments in this work can pro- vide further insights into the extent to which these biases are encoded in multilingual pretrained mod- els. As our paper focuses on (grammatical) gender as a morpho-syntactic feature, it focuses on a bi- nary notion of gender, which is not representative of the full spectrum of human gender expression. Most of the analysis in this paper focuses on mea- suring grammatical gender, not gender bias. We thus advise caution when interpreting the findings from this work. Nonetheless, we hope the causal structure formalized here, together with our anal- yses, can be of use to bias mitigation techniques in future (e.g., Liang et al., 2020). joven (young), divertido/divertida We use 30 different Spanish adjectives in our experiments: hermoso/hermosa (beautiful), sexy (sexy), molest/molesta (upset), bonito/bonita r´apido (pretty), delicado/delicada (delicate), /r´apida (fast), inteligente (in- telligent), fuerte (funny), (strong), duro/dura (hard), alegre (cheerful), protegido/protegida (protected), excelente (ex- cellent), nuevo/nueva (new), serio/seria (serious), sensible (sensitive), profesional (professional), emocional (emotional), independiente (indepen- fant´astico/fant´astica (fantastic), brutal dent), (brutal), malo/mala (bad), bueno/buena (good), horrible (horrible), triste (sad), amable (nice), (rich), (quiet), tranquilo/tranquila racional (rational). rico/rica l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 396 B Algorithm for Heuristic Intervention isFocusNoun ← false if state == NORMAL and node is a valid noun : Algorithm 1 1: procedure REINFLECTTREE(node, parent, state) 2: 3: 4: 5: 6: 7: 8: 9: 10: REINFLECTDET(node) (cid:4) Change determiner if node is a determiner : REINFLECTNOUN(node) (cid:4) Change the noun and set the morpho-syntactic feature to the desired value isFocusNoun ← true if node is subject : REINFLECTVERB(parent) (cid:4) Change verb if state == DIR : (cid:4) Current node is a direct dependent of a focus noun 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: if node is an adjective modifier : REINFLECTADJ(node) (cid:4) Change adjective if node is a nominal subject : REINFLECTNOUN(node) (cid:4) Change noun nsubj ← true if node is a copula : REINFLECTCOP(node) (cid:4) Change copula if state == INDIR and node is an adjective modifier and parent is an adjective modifier : (cid:2) Current node is a descendant of a focus noun REINFLECTADJ(node) for child ∈ children(node) : if isFocusNoun or nsubj : REINFLECTTREE(child, node, DIR ) else if state == DIR or state == INDIR : REINFLECTTREE(child, node, INDIR ) else REINFLECTTREE(child, node, NORMAL ) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 397 C Theory Proposition 1. In this proposition we show that the average treatment effect is equivalent to the difference of two expectations with no do-operator: (cid:4) E F tgt(F ) | do (G∗ = MSC) (cid:8) (cid:4) (cid:5) (cid:4) − E F tgt(F ) | do (G∗ = FEM) (cid:5)(cid:9) (cid:5) (cid:8) (cid:4) (15) (cid:5)(cid:9) = E L∗,Z E F tgt(F ) | G∗ = MSC, L∗, Z − E L∗,Z E F tgt(F ) | G∗ = FEM, L∗, Z Proof. First, we note the existence of two backdoor paths in our model Figure 3: M∗ ← U → Z → F → R and M∗ ← U → L∗ → F → R. We can easily check that Z blocks the first and L∗ blocks the second path, and neither Z nor L∗ are descendants of M∗. Therefore {L∗, Z} satisfies the back-door criterion. To make the proof simpler, we show that the first term of the left-hand side of Equation (15) equals the first term in the right-hand side of Equation (15) and then we obtain the full result by symmetry. We proceed as follows: (cid:4) E F tgt(F ) |do (G∗ = MSC) (cid:4) (cid:10) (cid:10) (cid:5) (cid:5) (16) = = (cid:2)∗∈L (cid:10) (cid:2)∗∈L E F (cid:4) z∈Z (cid:10) E F z∈Z (cid:8) (cid:4) tgt(F ) | do(G∗ = MSC), (cid:2)∗, z) p((cid:2)∗, z) (marginalize (cid:2)∗ and z) (cid:5) tgt(F ) | G∗ = MSC, (cid:2)∗, z) (cid:5)(cid:9) p((cid:2)∗, z) (backdoor criterion) = E L∗,Z E F tgt(F ) | G∗ = MSC, L∗, Z (rewrite as an expectation) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 398 Acknowledgments We would like to thank Shauli Ravfogel for feed- back on a preliminary draft and Dami´an Blasi for analyzing the errors made by our naturalistic coun- terfactual algorithm. We would also like to thank the action editor and the the anonymous reviewers for their insightful feedback during the review pro- cess. Afra Amini is supported by ETH AI Center doctoral fellowship. Ryan Cotterell acknowledges support from the SNSF through the ‘‘The For- gotten Role of Inductive Bias in Interpretability’’ project. References and Yoav Goldberg. Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In 5th Inter- national Conference on Learning Represen- tations (Conference Track). https://doi .org/10.1147/JRD.2017.2702858 Guillaume Alain and Yoshua Bengio. 2017. Un- derstanding intermediate layers using linear classifier probes. In 5th International Confer- ence on Learning Representations (Workshop Track). Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking contextual stereotypes: Mea- suring and mitigating BERT’s gender bias. In Proceedings of the Second Workshop on Gender Bias in Natural Language Process- ing, pages 1–16, Barcelona, Spain (Online). Association for Computational Linguistics. Yonatan Belinkov. 2021. Probing classifiers: Promises, shortcomings, and alternatives. arXiv preprint arXiv:2102.12452. Yonatan Belinkov, Llu´ıs M`arquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10, Taipei, Taiwan. Asian Federation of Natural Language Processing. Davide Biasion, Alessandro Fabris, Gianmaria Silvello, and Gian Antonio Susto. 2020. Gen- In der bias in italian word embeddings. Proceedings of the Seventh Italian Confer- ence on Computational Linguistics CLiC-it 2020: Bologna. https://doi.org/10 .4000/books.aaccademia.8280 Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems, volume 29. Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. 2001. Interval estimation for a binomial proportion. Statistical Science, 16(2):101–133. https://doi.org/10.1214 /ss/1009213286 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representa- tion learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /P18-1198 Alexis Conneau, German Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for lin- guistic properties. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Asso- ciation for Computational Linguistics. 399 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic probing: Behavioral explanation with amnesic counter- factuals. Transactions of the Association for Computational Linguistics. https://doi .org/10.1162/tacl_a_00359 Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. 2016. Probing for semantic evidence of composition by means of simple classifica- tion tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, Berlin, Germany. Association for Computational Linguistics. Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. 2021. CausaLM: Causal Model Explanation Through Counterfactual Lan- Linguis- guage Models. Computational 47(2):333–386. https://doi.org tics, /10.1162/coli_a_00404 Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. 2021. Causal analysis of syntactic agreement mechanisms in neu- ral language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 1828–1843, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2021.acl-long.144 Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. 2018. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18 -5426 Hila Gonen, Yova Kementchedjhieva, and Yoav Goldberg. 2019. How does grammatical gender affect noun representations in gender-marking the 23rd In Proceedings of languages? Conference on Computational Natural Lan- guage Learning (CoNLL), pages 463–471, Hong Kong, China. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/K19-1043 Hila Gonen, Shauli Ravfogel, and Yoav Goldberg. 2022. Analyzing gender represen- tation in multilingual models. In Proceedings of the 7th Workshop on Representation Learn- ing for NLP, pages 67–77, Dublin, Ireland. Association for Computational Linguistics. Aitor Gonzalez-Agirre, Egoitz Laparra, and German Rigau. 2012. Multilingual central In Proceedings of repository version 3.0. the Eighth International Conference on Lan- guage Resources and Evaluation (LREC’12), pages 2525–2529, Istanbul, Turkey. European Language Resources Association (ELRA). M. A. Hern´an and J. M. Robins. 2020. Causal inference: What if. Boca Raton, FL: Chapman & Hall/CRC. John Hewitt, Kawin Ethayarajh, Percy Liang, and Christopher Manning. 2021. Conditional probing: Measuring usable information be- yond a baseline. In Proceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 1626–1639, Online and Punta Cana, Dominican Repub- lic. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2021.emnlp-main.122 John Hewitt and Percy Liang. 2019. Design- ing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics. Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. 2018. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and re- cursive neural networks process hierarchical 400 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 structure. Journal of Artificial Intelligence Research, 61:907–926. https://doi.org /10.1613/jair.1.11196 Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld. 2019. BERT for coreference res- olution: Baselines and analysis. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 5803–5808, Hong Kong, China. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1588 Dan Kondratyuk. 2019. Cross-lingual lemmatiza- tion and morphology tagging with two-stage multilingual BERT fine-tuning. In Proceed- ings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Mor- phology, pages 12–18, Florence, Italy. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/W19-4203 Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, and Ryan Cotterell. 2022. Probing for the usage of grammatical number. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8818–8831, Dublin, Ireland. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2022.acl-long.603 Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Towards de- biasing sentence representations. In Proceed- ings of the Association for Computational Linguistics, pages 5502–5515, Online. Association for Computational Linguistics. the 58th Annual Meeting of RyanMcDonald, JoakimNivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar T¨ackstr¨om, Claudia Bedini, N´uria Bertomeu Castell´o, and Jungmee Lee. 2013. Universal Dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 92–97, Sofia, Bulgaria. Association for Computational Linguistics. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajiˇc, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. 2020. Universal Dependencies v2: An evergrowing multilin- gual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Con- ference, pages 4034–4043, Marseille, France. European Language Resources Association. Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146. https://doi.org/10.1214/09-SS057 Tiago Pimentel and Ryan Cotterell. 2021. A Bayesian framework for information-theoretic probing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2869–2887, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Tiago Pimentel, Naomi Saphra, Adina Williams, and Ryan Cotterell. 2020a. Pareto probing: Trading off accuracy for complexity. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3138–3153, Online. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/2020.emnlp -main.254 Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020b. Information-theoretic probing for linguistic structure. In Proceed- the ings of Association for Computational Linguistics, pages 4609–4622, Online. Association for Computational Linguistics. the 58th Annual Meeting of Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. Null it out: Guarding protected attributes by it- erative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.acl-main.647 401 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Shauli Ravfogel, Grusha Prasad, Tal Linzen, and inter- Yoav Goldberg. 2021. Counterfactual ventions reveal the causal effect of relative clause representations on agreement predic- tion. In Proceedings of the 25th Conference on Computational Natural Language Learn- ing, pages 194–209, Online. Association for Computational Linguistics. Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan Cotterell. 2022a. Linear relaxed adversarial concept erasure. Shauli Ravfogel, FranciscoVargas, Yoav Goldberg, and Ryan Cotterell. 2022b. Adversarial concept erasure in kernel space. Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. 2021. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Confer- ence of the European Chapter of the Associ- ation for Computational Linguistics: Main Volume, pages 3363–3377, Online. Association for Computational Linguistics. https:// doi.org/10.18653/v1/2021.eacl -main.295 Paul R. Rosenbaum and Donald B. Rubin. 1983. The central the propensity score in observational studies for causal ef- fects. Biometrika, 70(1):41–55. https:// doi.org/10.1093/biomet/70.1.41 role of Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/N18-2002 Naomi Saphra and Adam Lopez. 2019. Under- standing learning dynamics of language models with SVCCA. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3257–3267, Minneapolis, Minnesota. Association for Com- putational Linguistics. Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. Evaluating gender bias in 402 machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics. Mariona Taul´e, M. Ant`onia Mart´ı, and Marta Recasens. 2008. AnCora: Multilevel anno- tated corpora for Catalan and Spanish. In Proceedings of the Sixth International Confer- ence on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA). Mycal Tucker, Peng Qian, and Roger Levy. 2021. What if this modified that? Syntactic inter- ventions with counterfactual embeddings. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 862–875, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2021.findings-acl.76 Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Elena and Ivan Voita Titov. 2020. Information-theoretic probing with minimum the description length. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online. Association for Com- putational Linguistics. In Proceedings of Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2020.emnlp-demos.6 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/P19-1472 Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evalu- ation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/N18-2003 Ran Zmigrod, Sabrina J. Mielke, Hanna Wallach, and Ryan Cotterell. 2019. Counter- factual data augmentation for mitigating gender stereotypes in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661, Florence, Italy. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/P19-1161 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 4 2 0 8 6 3 0 5 / / t l a c _ a _ 0 0 5 5 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 403 Naturalistic Causal Probing for Morpho-Syntax image

Télécharger le PDF