Erasure of Unaligned Attributes from Neural Representations
Shun Shao∗ Yftah Ziser∗ Shay B. Cohen
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Édimbourg, EH8 9AB, ROYAUME-UNI
s.shao-11@inf.ed.ac.uk
yftah.ziser@inf.ed.ac.uk
scohen@inf.ed.ac.uk
Abstrait
the Assignment-Maximization
We present
Spectral Attribute removaL (AMSAL) algo-
rithm, which erases information from neural
representations when the information to be
erased is implicit rather than directly being
aligned to each input example. Our algorithm
works by alternating between two steps. Dans
un, it finds an assignment of the input rep-
resentations to the information to be erased,
and in the other, it creates projections of both
the input representations and the information
to be erased into a joint latent space. We test
our algorithm on an extensive array of data-
sets, including a Twitter dataset with multiple
guarded attributes, the BiasBios dataset, et
the BiasBench benchmark. The latter bench-
mark includes four datasets with various types
of protected attributes. Our results demon-
strate that bias can often be removed in our
setup. We also discuss the limitations of our
approach when there is a strong entanglement
between the main task and the information to
be erased.1
1
Introduction
Developing a methodology for adjusting neural
representations to preserve user privacy and avoid
encoding bias in them has been an active area of
research in recent years. Previous work shows it is
possible to erase undesired information from rep-
resentations so that downstream classifiers can-
not use that information in their decision-making
processus. This previous work assumes that this
sensitive information (or guarded attributes, tel
as gender or race) is available for each input in-
position. These guarded attributes, cependant, are sen-
sitive, and obtaining them on a large scale is often
challenging and, in some cases, not feasible (Han
et coll., 2021b). Par exemple, Blodgett et al. (2016)
∗Equal contribution.
1Our code is available at https://github.com
/jasonshaoshun/AMSAL.
488
studied the characteristics of African-American
English on Twitter, and could not couple the
ethnicity attribute directly with the tweets they
collected due to the attribute’s sensitivity.
This paper introduces a novel debiasing setting
in which the guarded attributes are not paired
up with each input instance and an algorithm to
remove information from representations in that
setting. In our setting, we assume that each neural
input representation is coupled with a guarded at-
tribute value, but this assignment is unavailable.
In cases where the domain of the guarded attri-
bute is small (Par exemple, with binary attributes),
this means that the guarded attribute informa-
tion consists of priors with respect to the whole
population and not instance-level information.
The intuition behind our algorithm is that if we
were to find a strong correlation between the input
variable and a set of guarded grounded attributes
either in the form of an unordered list of records
or as priors, then it is unlikely to be coincidental
if the sample size is sufficiently large (§3.5). Nous
implement this intuition by jointly finding pro-
jections of the input samples and the guarded
attributes into a joint embedding space and an
alignment between the two sets in that joint space.
Our resulting algorithm (§3), the Alignment-
Maximization Spectral Attribute removaL algo-
rithm (AMSAL), is a coordinate-ascent algorithm
reminiscent of the hard expectation-maximization
algorithme (hard EM; MacKay, 2003). It first loops
between two Alignment and Maximization steps,
during which it finds an alignment (UN) based on
existing projections and then projects the repre-
sentations and guarded attributes into a joint space
based on an existing alignment (M.). After these
two steps are iteratively repeated and an align-
ment is identified, the algorithm takes another
step to erase information from the input rep-
resentations based on the projections identified.
This step closely follows the work of Shao et al.
Transactions of the Association for Computational Linguistics, vol. 11, pp. 488–510, 2023. https://doi.org/10.1162/tacl a 00558
Action Editor: Jonathan Berant. Submission batch: 12/2022; Revision batch: 2/2023; Published 5/2023.
c(cid:3) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
je
un
c
_
un
_
0
0
5
5
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
preserving the information needed for the main
task decision-making. We also study the limita-
tions of our algorithm by experimenting with a
setup where it is hard to distinguish between the
guarded attributes and the downstream task labels
when aligning the neural representations with the
guarded attributes (§4.5).
2 Problem Formulation and Notation
For an integer n we denote by [n]
the set
{1, . . . , n}. For a vector v, we denote by ||v||2 its
(cid:2)2 norm. For two vectors v and u, by default in col-
umn form, (cid:4)v, toi(cid:5) = v(cid:6)toi (dot product). Matrices
and vectors are in boldface font (with uppercase
or lowercase letters, respectivement). Random vari-
able vectors are also denoted by boldface upper-
case letters. For a matrix A, we denote by aij
the value of cell (je, j). The Frobenius norm of
a matrix A is (cid:7)UN(cid:7)
ij. The spectral
2=1 (cid:7)Ax(cid:7)
norm of a matrix is (cid:7)UN(cid:7)
2.
The expectation of a random variable T is de-
noted by E[T].
je,j a2
2 = max(cid:7)X(cid:7)
(cid:2)(cid:3)
F =
In our problem formulation, we assume three
random variables: X ∈ Rd, Y ∈ R, and Z ∈ Rd(cid:9)
,
such that d(cid:9) ≤ d and the expectation of all three
variables is 0 (see Shao et al., 2023). Samples
of X are the inputs for a classifier to predict
corresponding samples of Y. The random vector
Z represents the guarded attributes. We want to
maintain the ability to predict Y from X, alors que
minimizing the ability to predict Z from X.
We assume n samples of (X, Oui) and m sam-
ples of Z, denoted by (X(je), oui(je)) for i ∈ [n], et
z(je) for i ∈ [m] (m ≤ n). While originally, ces
samples were generated jointly from the underly-
ing distribution p(X, Oui, Z), we assume a shuffling
of the Z samples in such a way that we are only
left with m samples that are unique (no repeti-
tion) and an underlying unknown many-to-one
mapping π : [n] → [m] that maps each x(je) to its
original z(j).
The problem formulation is such that we need
to remove the information from the xs in such a
way that we consider the samples of zs as a set.
In our case, we do so by iterating between trying
to infer π, and then using standard techniques to
remove the information from xs based on their
alignment to the corresponding zs.
Singular Value Decomposition Let A =
E[XZ(cid:6)], the matrix of cross-covariance between
489
Chiffre 1: A depiction of the problem setting and so-
lution. The inputs are aligned to each guarded sample,
based on strength using two projections U and V . Nous
solve a bipartite matching problem to find the blue
edges, and then recalculate U and V .
(2023), who use Singular Value Decomposition to
remove principal directions of the covariance ma-
trix between the input examples and the guarded
attributes. Chiffre 1 depicts a sketch of our set-
ting and the corresponding algorithm, with xi
being the input representations and zj being the
guarded attributes. Our algorithm is modular:
While our use of the algorithm of Shao et al.
(2023) for the removal step is natural due to the
nature of the AM steps, a user can use any such
algorithm to erase the information from the input
representations (§3.4).
Our contributions are as follows:
(1) Nous
propose a new setup for removing guarded infor-
mation from neural representations where there
are few or no labeled guarded attributes; (2)
We present a novel two-stage coordinate-ascent
algorithm that iteratively improves (un) an align-
ment between guarded attributes and neural
representations; et (b)
information removal
projections.
Using an array of datasets, we perform exten-
sive experiments to assess how challenging our
setup is and whether our algorithm is able to re-
move information without having aligned guarded
attributes (§4). We find in several cases that lit-
tle information is needed to align between neural
representations and their corresponding guarded
attributes. The consequence is that it is possible
to erase the information such guarded attributes
provide from the neural representations while
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
je
un
c
_
un
_
0
0
5
5
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
X and Z. This means that Aij = Cov(Xi, Zj) pour
i ∈ [d] and j ∈ [d(cid:9)].
For any two vectors, a ∈ Rd, b ∈ Rd(cid:9)
, the fol-
lowing holds due to the linearity of expectation:
aAb(cid:6) = Cov(un(cid:6)X, b(cid:6)Z).
(1)
Singular value decomposition on A, in this
case, finds the ‘‘principal directions’’: instructions
in which the projection of X and Z maximize
their covariance. The projections are represented
as two matrices U ∈ Rd×d and V ∈ Rd(cid:9)×d(cid:9)
. Chaque
column in these matrices plays the role of the
vectors a and b in Eq. 1. SVD finds U and V
such that for any i ∈ [d(cid:9)] it holds that:
Cov(U (cid:6)
i X, V (cid:6)
i Z) = max
(un,b)∈Oi
Cov(un(cid:6)X, b(cid:6)Z),
where Oi is the set of pairs of vectors (un, b)
such that (cid:7)un(cid:7)
2 = 1, a is orthogonal to
U 1, . . . , U i−1 and similarly, b is orthogonal to
V 1, . . . , V i−1.
2 = (cid:7)b(cid:7)
Shao et al. (2023) showed that SVD in this
form can be used to debias representations. Nous
calculate SVD between X and Z and then prune
out the principal directions that denote the high-
est covariance. We will use their method, SAL
(Spectral Attribute removaL), in the rest of the
papier. See also §3.4.
3 Méthodologie
We view the problem of information removal
with unaligned samples as a joint optimization
problem of: (un) finding the alignment; (b) find-
ing the projection that maximizes the covariance
between the alignments, and using its comple-
ment to project the inputs. Such an optimization,
in principle, is intractable, so we break it down
into two coordinate-ascent style steps: A-step (dans
which the alignment is identified as a bipartite
graph matching problem) and M-step (dans lequel
based on the previously identified alignment,
a maximal-covariance projection is calculated).
Officiellement, the maximization problem we solve is:
(U , V , π) = arg max
U ,V ,π
je = 1
n(cid:4)
(X(je))(cid:6)U V (cid:6)z(je),
(2)
where we constrain U and V to be matrices with
orthonormal columns in Rn×k.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
je
un
c
_
un
_
0
0
5
5
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 2: The main Assignment-Maximization Spectral
Attribute removaL (AMSAL) algorithm for removal
of information without alignment between samples of
X and Z.
Note that the sum in the above equation has
a term per pair of (X(je), zπ(je)), which enables us
to frame the A-step as an integer linear program-
ming (ILP) problem (§3.1). The full algorithm is
given in Figure 2, and we proceed in the next two
steps to further explain the A-step and the M-step.
3.1 A-step (Guarded Sample Assignment)
In the Assignment Step, we are required to find
a many-to-one alignment π : [n] → [m] entre
{X(1), . . . , X(n)} et {z(1), . . . , z(m)}. Given U
and V from the previous M-step, we can find
such an assignment by solving the following opti-
mization problem:
arg max
π
n(cid:4)
je = 1
(cid:4)U (cid:6)X(je), V (cid:6)z(π(je))(cid:5).
This maximization problem can be formulated
as an integer linear program of the following form:
maximum
P ∈{0,1}n×m
m(cid:4)
n(cid:4)
j=1
je = 1
s.t. ∀i.
∀j.
pij(cid:4)U (cid:6)X(je), V (cid:6)z(j)(cid:5)
m(cid:4)
j=1
pij = 1,
b0j ≤
m(cid:4)
je = 1
pij ≤ b1j.
(3)
490
This is a solution to an assignment prob-
lem (Kuhn, 1955; Ramshaw and Tarjan, 2012),
where pij denotes whether x(je) is associated with
le (type of) guarded attribute z(j). The values
(b0j, b1j) determine lower and upper bounds on
the number of xs a given z(j) can be assigned
à. While a standard assignment problem can be
solved efficiently using the Hungarian method of
Kuhn (1955), we choose to use the ILP formu-
lation, as it enables us to have more freedom in
adding constraints to the problem, such as the
lower and upper bounds.
3.2 M-step (Covariance Maximization)
The result of an A-step is an assignment π such
that π(je) = j implies x(je) was deemed as aligned
to zj. With that π in mind, we define the follow-
ing empirical covariance matrix Ωπ ∈ Rd×d(cid:9)
:
Ωπ =
n(cid:4)
je = 1
(cid:5)
X(je)
z(π(je))
(cid:6)(cid:6)
.
(4)
We then apply SVD on Ωπ to get new U and
V that are used in the next iteration of the algo-
rithm with the A-step, if the algorithm continues
to run. When the maximal number of iterations is
reached, we follow the work of Shao et al. (2023)
in using a truncated part of U to remove the in-
formation from the xs. We do that by projecting
X(je) using the singular vectors of U with the
smallest singular values. These projected vectors
co-vary the least with the guarded attributes, comme-
suming the assignment in the last A-step was pre-
cise. This method has been shown by Shao
et autres. (2023) to be highly effective and efficient in
debiasing neural representations.
3.3 A Matrix Formulation of the AM Steps
Let e1, . . . , em be the standard basis vectors. Ce
means ei is a vector of length m with 0 in all
coordinates except for the ith coordinate, où
it is 1.
Let E be the set of all matrices E where each
E ∈ E is such that E ∈ Rn×m and each row
is one of ei, i ∈ [m]. In that case, NON(cid:6) is an
n × d(cid:9) matrice, such that the jth row is a copy
of the ith column of Z ∈ Rd(cid:9)×n. Donc, le
AM steps can be viewed as solving the following
maximization problem using coordinate ascent:
arg max
E∈E,U ,V ,Σ
(cid:7)U (cid:6)ΣV − XEZ(cid:6)(cid:7)2
F ,
where U , V are orthonormal matrices, and Σ is
a diagonal matrix with non-negative elements.
This corresponds to the SVD of the matrix
XEZ(cid:6).
In that case,
the matrix E can be directly
mapped to an assignment in the form of π, où
π(je) would be the j such that the jth coordinate
in the ith row of E is non-zero.
3.4 Removal Algorithm
The AM steps are best suited for the removal of
information through SVD with an algorithm such
as SAL. This is because AM steps are optimizing
an objective of the same type of SAL—relying on
the projections U and V to project the inputs and
guarded representations into a joint space. Comment-
jamais, a by-product of the algorithm in Figure 2 est
an assignment function π that aligns between the
inputs and the guarded representations.
With that assignment, other removal algo-
rithms can be used, Par exemple, the algorithm of
Ravfogel et al. (2020). We experiment with this
idea in §4.
3.5 Justification of the AM Steps
We next provide a justification of our algorithm
(which may be skipped on a first reading). Notre
justification is based on the observation that if
indeed X and Z are linked together (this con-
nection is formalized as a latent variable in their
joint distribution), then for a given sample that is
permuted, the singular values of Ω will be larger
the closer the permutation is to the identity per-
mutation. This justifies finding such a permuta-
tion that maximizes the singular values in an
SVD of Ω.
More Details Let ι : [n] → [n] be the identity
permutation, ι(je) = i. We will assume the case
in which n = m (but the justification can be
generalized to the case m < n), and that the
underlying joint distribution p(X, Z) is mediated
by a latent variable H, such that
p(X, Z, H) = p(H)p(X | H)p(Z | H).
(5)
This implies there is a latent variable that con-
nects X and Z, and that the joint distribution
p(X, Z) is a mixture through H.
491
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Proposition 1 (informal). Let {(x(i), z(i))} be a
sample of size n from the distribution in Eq. 5.
Let π be a permutation over [n] uniformly sam-
pled from the set of permutations. Then with
high likelihood, the sum of the singular values
of Ωπ is smaller than the sum of singular values
under Ωι.
For full details of this claim, see Appendix A.
4 Experiments
In our experiments, we test several combinations
of algorithms. We use the k-means (KMEANS) as
a substitute for the AM steps as a baseline for the
assignment step of xs to zs. In addition, for the
removal step (once an assignment has been iden-
tified), we test two algorithms: SAL (Shao et al.,
2023; resulting in AMSAL) and INLP (Ravfogel
et al., 2020). We also compare these two algo-
rithms in oracle mode (in which the assignment
of guarded attributes to inputs is known), to see
the loss in performance that happens due to noisy
assignments from the AM or k-means algorithm
(ORACLESAL and ORACLEINLP).
When running the AM algorithm or k-means,
we execute it with three random seeds (see also
§4.6) for a maximum of a hundred iterations and
choose the projection matrix with the largest ob-
jective value over all seeds and iterations. For the
slack variables (b0j and b1j variables in Eq. 3),
we use 20%–30% above and below the baseline
of the guarded attribute priors according to the
training set. With the SAL methods, we remove
the number of directions according to the rank of
the Ω matrix (between 2 to 6 in all experiments
overall).
In addition, we experiment with a partially
supervised assignment process, in which a small
seed dataset of aligned xs and zs is provided to the
AM steps. We use it for model selection: Rather
than choosing the assignment with the highest
SVD objective value, we choose the assignment
with the highest accuracy on this seed dataset.
We refer to this setting as PARTIAL (for ‘‘partially
supervised assignment’’).
Finally, in the case of a gender-protected at-
tribute, we compare our results against a baseline
in which the input x is compared against a list of
words stereotypically associated with the genders
of male or female.2 Based on the overlap with
2https://tinyurl.com/33bzddtw.
these two lists, we heuristically assign the gen-
der label to x and then run SAL or INLP (rather
than using the AM algorithm). While this word-
list heuristic is plausible in the case of gender,
it is not as easy to derive in the case of other
protected attributes, such as age or race. We give
the results for this baseline using the marker WL
in the corresponding tables.
Main Findings Our overall main finding shows
that our novel setting in which guarded infor
mation is erased from individually unaligned
representations is viable. We discovered that
AM methods perform particularly well when deal-
ing with more complex bias removal scenarios,
such as when multiple guarded attributes are pres-
ent. We also found that having similar priors for
the guarded attributes and downstream task labels
may lead to poor performance on the task at hand.
In these cases, using a small amount of super-
vision often effectively helps reduce bias while
maintaining the utility of the representations for
the main classification of the regression problem.
Finally, our analysis of alignment stability shows
that our AM algorithm often converges to suit-
able solutions that align X with Z.
Due to the unsupervised nature of our prob-
lem setting, we advise validating the utility of
our method in the following way. Once we run
the AM algorithm, we check whether there is a
high-accuracy alignment between X and Y (rather
than Z, which is unavailable). If this alignment
is accurate, then we run the risk of significantly
damaging task performance. An example is given
in §4.5.
4.1 Word Embedding Debiasing
As a preliminary assessment of our setup and
algorithms, we apply our methods to GloVe word
embeddings to remove gender bias, and follow
the previous experiment settings of this problem
(Bolukbasi et al., 2016; Ravfogel et al., 2020;
Shao et al., 2023). We considered only the 150,000
most common words to ensure the embedding
quality and omitted the rest. We sort the remaining
−→
embeddings by their projection on the
she
direction. Then we consider the top 7,500 word
embeddings as male-associated words (z = 1)
and the bottom 7,500 as female-associated words
(z = −1).
−→
he-
Our findings are that both the k-means and the
AM algorithms perfectly identify the alignment
492
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
is perfect in this case. This finding indicates that
this standard word embedding dataset used for
debiasing is trivial to debias—debiasing can be
done even without knowing the identity of the
stereotypical gender associated with each word.
4.2 BiasBios Results
De-Arteaga et al. (2019) presented the BiasBios
dataset, which consists of self-provided biogra-
phies paired with the profession and gender of
their authors. A list of pronouns and names is used
to obtain the authors’ gender automatically. They
aim to expose the caveats of automated hiring
systems by showing that even the simple task of
predicting a candidate’s profession can be affected
by the candidate’s gender, which is encoded in
the biography representation. For example, we
want to avoid one being identified as ‘‘he’’ or
‘‘she’’ in their biography, affecting the likelihood
of them being classified as engineers or teachers.
We follow the setup of De-Arteaga et al.
(2019), predicting a candidate’s professions (y),
based on a self-provided short biography (x),
aiming to remove any information about the can-
didate’s gender (z). Due to computational con-
straints, we use only random 30K examples to
learn the projections with both SAL and INLP
(whether in the unaligned or aligned setting). For
the classification problem, we use the full dataset.
To obtain vector representations for the biogra-
phies, we use two different encoders, FastText
word embeddings (Joulin et al., 2016), and BERT
(Devlin et al., 2019). We stack a multi-class clas-
sifier on top of these representations, as there are
28 different professions. We use 20% of the train-
ing examples for the PARTIAL setting. For BERT,
we followed De-Arteaga et al. (2019) in using
the last CLS token state as the representation of
the whole biography. We used the BERT model
bert-base-uncased.
Evaluation Measures We use an extension of
the True Positive Rate (TPR) gap, the root mean
square (RMS) TPR gap of all classes, for eval-
uating bias in a multiclass setting. This metric
was suggested by De-Arteaga et al. (2019), who
demonstrated it is significantly correlated with
gender imbalances, which often lead to unfair
classification. The higher the metric value is,
the bigger the gap between the two categories
(for example, between male and female) for the
Figure 3: A t-SNE visualization of the word embed-
dings before and after gender information removal. In
(a) we see the embeddings naturally cluster into the
corresponding gender.
between the word embeddings and their asso-
ciated gender label (100%). Indeed, the dataset
construction itself follows a natural perfect clus-
tering that these algorithms easily discover. Since
the alignments are perfectly identified, the results
of predicting the gender from the word embed-
dings after removal are identical to the oracle
case. These results are quite close to the results of
a random guess, and we refer the reader to Shao
et al. (2023) for details on experiments with SAL
and INLP for this dataset. Considering Figure 3,
it is evident that our algorithm essentially follows
a natural clustering of the word embeddings into
two clusters, female and male, as the embeddings
are highly separable in this case. This is why the
alignment score of X (embedding) to Z (gender)
493
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
4.3 BiasBench Results
Meade et al. (2022) followed an empirical study
of an array of datasets in the context of debiasing.
They analyzed different methods and tasks, and
we follow their benchmark evaluation to assess
our AMSAL algorithm and other methods in the
context of our new setting. We include a short
description of the datasets we use in this section.
We include full results in Appendix B, with a
description of other datasets. We also encourage
the reader to refer to Meade et al. (2022) for details
on this benchmark. We use 20% of the training
examples for the PARTIAL setting.
StereoSet (Nadeem et al., 2021) This dataset
presents a word completion test for a language
model, where the completion can be stereotypical
or non-stereotypical. The bias is then measured
by calculating how often a model prefers the ste-
reotypical completion over the non-stereotypical
one. Nadeem et al. (2021) introduced the language
model score to measure the language model us-
ability, which is the percentage of examples for
which a model prefers the stereotypical or non-
stereotypical word over some unrelated word.
CrowS-Pairs (Nangia et al., 2020) This dataset
includes pairs of sentences that are minimally dif-
ferent at the token level, but these differences lead
to the sentence being either stereotypical or anti-
stereotypical. The assessment measures how many
times a language model prefers the stereotypi-
cal element in a pair over the anti-stereotypical
element.
Results We start with an assessment of the
BERT model for the CrowS-Pairs gender, race,
and religion bias evaluation (Table 2). We observe
that all approaches for gender, except AM+INLP,
reduce the stereotype score. Race and religion are
more difficult to debias in the case of BERT. INLP
with k-means works best when no seed align-
ment data is provided at all, but when we con-
sider PARTIALSAL, in which we use the alignment
algorithm with some seed aligned data, we see
that the results are the strongest. When we con-
sider the RoBERTa model, the results are sim-
ilar, with PARTIALSAL significantly reducing the
bias. Our findings from Table 2 overall indicate
that the ability to debias a representation highly
depends on the model that generates the rep-
resentation. In Table 10 we observe that
the
Table 1: BiasBios dataset results. The top part
uses BERT embeddings to encode the biographies,
while the bottom part uses FastText embeddings.
specific main task prediction. For the profession
classification, we report accuracy.
Results Table 1 provides the results for the bi-
ography dataset. We see that INLP significantly
reduces the TPR-GAP in all settings, but this
comes at a cost: The representations are signifi-
cantly less useful for the main task of predicting
the profession. When inspecting the alignments,
we observe that their accuracy is quite high with
BERT: 100% with k-means, 85% with the AM
algorithm, and 99% with PARTIAL AM. For Fast-
Text, the results are lower, hovering around 55%
for all three methods. The high BERT assignment
performance indicates that the BiasBios BERT
representations are naturally separated by gender.
We also observe that the results of WL+SAL
and WL+INLP are correspondingly identical to
Oracle+SAL and Oracle+INLP. This comes as
no surprise, as the gender label is derived from
a similar word list, which enables the WL ap-
proach to get a nearly perfect alignment (over
96% agreement with the gender label).
494
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 2: (a) CrowS-Pairs Gender stereotype scores (Stt. score) in language models debiased by differ-
ent debiasing techniques and assignment; (b) CrowS-Pairs Race stereotype scores; (c) CrowS-Pairs
Religion stereotype scores. All models are deemed least biased if the stereotype score is 50%. The colored
numbers are calculated as | |b − 50 | − |s − 50 | | where b is the top row score and s is the correspond-
ing system score.
representations, on average, are not damaged for
most GLUE tasks. Additional analysis, included
with a full version appendix, shows that the rep-
resentations, on average, are not damaged for
most GLUE tasks.
As Meade et al. (2022) have noted, when chang-
ing the representations of a language model to
remove bias, we might cause such adjustments
that damage the usability of the language model.
To test which methods possibly cause such an
issue, we also assess the language model score
on the StereoSet dataset in Table 3. We overall
see that often SAL-based methods give a lower
stereotype score, while INLP methods more sig-
nificantly damage the language model score. This
implies that the SAL-based methods remove bias
495
baseline task performance almost
Appendix B.
in full. See
4.4 Multiple-Guarded Attribute Sentiment
We hypothesize that AM-based methods are bet-
ter suited for setups where multiple guarded at-
tributes should be removed, as they allow us to
target several guarded attributes with different
priors. To examine our hypothesis, we experi-
ment with a dataset curated from Twitter (tweets
encoded using BERT, bert-base-uncased),
in which users are surveyed for their age and
gender (Cachola et al., 2018). We bucket the age
into three groups (0–25, 26–50, and above 50).
Tweets in this dataset are annotated with their
sentiment, ranging from 1 (very negative) to 5
(very positive). The dataset consists of more than
6,400 tweets written by more than 1,700 users.
We removed users who no longer have public
Twitter accounts and users with locations that do
not exist based on a filter,3 resulting in a dataset
with over 3,000 tweets, written by 817 unique us-
ers. As tweets are short by nature and their num-
ber is relatively small, the debiasing signal in
this dataset (the amount of information it contains
about the guarded attributes) might not be suffi-
cient for the attribute removal. To amplify this
signal, we concatenated each tweet in the dataset
to at most ten other tweets from the same user.
We study the relationship between the main
task of sentiment detection and the two protected
attributes of age and gender. As a protected at-
tribute z, we use the combination of both age and
gender as a binary one-hot vector. This dataset
presents a use-case for our algorithm of a com-
posed protected attribute. Rather than using a
classifier for predicting the sentiment, we use lin-
ear regression. Following Cachola et al. (2018),
we use Mean Absolute Error (MAE) to report the
error of the sentiment predictions. Given that the
sentiment is predicted as a continuous value, we
cannot use the TPR gap as in previous sections.
Rather, we use the following formula:
MAEGap = std(MADz=j | j ∈ [m]),
(6)
3We used a list of cities, counties, and states in
the United States, taken from https://tinyurl.com
/4kmc6pyn. All users were in the United States when
the data was collected by the original curators.
Table 3: StereoSet stereotype scores (Stt. Score)
and language modeling scores (LM Score) for the
gender category. Stereotype scores indicate the
least bias at 50% and the LM scores indicate high
usability at 100%.
effectively while less significantly harming the
usability of the language model representations.
We also conducted comprehensive results for
other datasets (SEAT and GLUE) and categories
of bias (based on race and religion). The results,
especially for GLUE, demonstrate the effective-
ness of our method of unaligned information
removal. For GLUE, we consistently retain the
496
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 4: MAE and debiasing gap values on the
Twitter dataset, when using BERT to encode the
tweets. For age and gender, we give the MAE gap
as in Eq. 5.
(cid:3)
i
|ηij − μj| where i
where MADz=j = 1
(cid:4)
ranges over the set of size (cid:2) of examples with
protected attribute value j, μj is the average of
absolute Y prediction error for that set and ηij
is the absolute difference between μj and the ab-
solute error of example i.4 The function std in
this case indicates the standard deviation of the m
values of MADz=j, j ∈ [m].
Results Table 4 presents our results. Overall,
AMSAL reduces the gender and age gap in the
predictions while not increasing by much MAE.
In addition, we can see both AM-based meth-
ods outperform their k-means counterparts which
increase unfairness (KMEANS + INLP) or signif-
icantly harm the downstream-task performance
(KMEANS + SAL). We also consider Figure 4,
which shows the quality of the assignments of the
AM algorithm change as a function of the labeled
data used. As expected, the more labeled data we
have, the more accurate the assignments are, but
the differences are not very large.
4.5 An Example of Our Method Limitations
We now present the main limitation in our ap-
proach and setting. This limitation arises when
the random variables Y and Z are not easily
distinguishable through information about X.
We experiment with a binary sentiment analysis
(y) task, predicted on users’ tweets (x), aim-
ing to remove information regarding the authors’
ethnic affiliations. To do so, we use a dataset
collected by Blodgett et al. (2016), which exam-
ined the differences between African-American
English speakers and Standard American English
4The absolute error of prediction a with true value b
is |a − b|.
497
Figure 4: Accuracy of the AM steps with respect to age
and gender separately (on unseen data), as a function
of the fraction of the labeled dataset used by the AM
algorithm.
speakers. As information about one’s ethnicity is
hard to obtain, the user’s geolocation informa-
tion was used to create a distantly supervised
mapping between authors and their ethnic affilia-
tions. We follow previous work (Shao et al., 2023;
Ravfogel et al., 2020) and use the DeepMoji en-
coder (Felbo et al., 2017) to obtain representa-
tions for the tweets. The train and test sets are
balanced regarding sentiment and authors’ ethnic-
ity. We use 20% of the examples for the PARTIAL
setting. Table 5 gives the results for this dataset.
We observe that the removal with the assignment
(k-means, AM, or PARTIAL) significantly harms
the performance on the main task and reduces it
to a random guess.
This presents a limitation of our algorithm.
A priori, there is no distinction between Y and
Z, as our method is unsupervised. In addition,
the positive labels of Y and Z have the same
prior probability. Indeed, when we check the as-
signment accuracy in the sentiment dataset, we
observe that the k-means, AM, and PARTIAL AM
assignment accuracy for identifying Z are be-
tween 0.55 and 0.59. If we check the assignment
against Y, we get an accuracy between 0.74 and
0.76. This means that all assignment algorithms
actually identify Y rather than Z (both Y and Z
are binary variables in this case). The conclusion
from this is that our algorithm works best when
sufficient information on Z is presented such that
it can provide a basis for aligning samples of
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
Table 5: The performance of removing race information from the DeepMoji dataset is shown for
two cases: with balanced ratios of race and sentiment (left) and with ratios of 0.8 for sentiment and
0.5 for race (right). In both cases, the total size of the dataset used is 30,000 examples. To evaluate
the performance of the unbalanced sentiment dataset, we use the F1 macro measure, because in an
unbalanced dataset such as this one, a simple classifier that always returns one label will achieve an
accuracy of 80%. Such a classifier would have a F1 macro score of 0.44 ˙4.
Z with samples of X. Suppose such information
is unavailable or unidentifiable with information
regarding Y. In that case, we may simply iden-
tify the natural clustering of X according to
their main task classes, leading to low main-task
performance.
In Table 5, we observe that this behavior is
significantly mitigated when the priors over the
sentiment and the race are different (0.8 for sen-
timent and 0.5 for race). In that case, the AM
algorithm is able to distinguish between the race-
protected attribute (z) and the sentiment class (y)
quite consistently with INLP and SAL, and the
gap is reduced.
We also observe that INLP changed neither
the accuracy nor the TPR-GAP for the balanced
scenario (Table 5) when using a k-means assign-
ment or an AM assignment. Upon inspection, we
found out that INLP returns an identity projection
in these cases, unable to amplify the relatively
weak signal
to change the
in the assignment
representations.
4.6 Stability Analysis of the Alignment
In Figure 5, we plot the accuracy of the alignment
algorithm (knowing the true value of the guarded
attribute per input) throughout the execution of the
AM steps for the first ten iterations. The shaded
area indicates one standard deviation. We observe
that the first few iterations are the ones in which
the accuracy improves the most. For most of the
datasets, the accuracy does not decrease between
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Accuracy of the AM steps (in identifying
the correct assignment of inputs to guarded informa-
tion) as a function of the iteration number. Shaded
gray gives upper and lower bound on the standard
deviation over five runs with different seeds for the
initial π. FastText refers to the BiasBios dataset, the
BERT models are for the CrowS-Pairs dataset and
Emb. refers to the word embeddings dataset from §4.1.
iterations, though in the case of DeepMoji we do
observe a ‘‘bump.’’ This is indeed why the PARTIAL
setting of our algorithm, where a small amount
of guarded information is available to determine
at which iteration to stop the AM algorithm, is
important. In the word embeddings case, the vari-
ance is larger because, in certain executions, the
498
beddings. Gonen and Goldberg (2019) examined
the effectiveness of the methods mentioned above
and concluded they remove bias in a shallow way.
For example, they demonstrated that classifiers
can accurately predict the gender associated with
a word when fed with the embeddings of both
debiasing methods.
Another related strand of work uses adver-
sarial learning (Ganin et al., 2016), where an
additional objective function is added for balanc-
ing undesired-information removal and the main
task (Edwards and Storkey, 2016; Li et al., 2018;
Coavoux et al., 2018; Wang et al., 2021). Elazar
and Goldberg (2018) have also demonstrated
that an ad-hoc classifier can easily recover the
removed information from adversarially trained
representations. Since then, methods for informa-
tion erasure such as INLP and its generalization
(Ravfogel et al., 2020, 2022), SAL (Shao et al.,
2023) and methods based on similarity measures
between neural representations (Colombo et al.,
2022) have been developed. With a similar moti-
vation to ours, Han et al. (2021b) aimed to ease
the burden of obtaining guarded attributes at a
large scale by decoupling the adversarial informa-
tion removal process from the main task training.
They, however, did not experiment with debi-
asing representations where no guarded attribute
alignments are available. Shao et al. (2023) exper-
imented with the removal of features in a scenario
in which a low number of protected attributes is
available.
Additional previous work showed that methods
based on causal inference (Feder et al., 2021),
train-set balancing (Han et al., 2021a), and con-
trastive learning (Shen et al., 2021; Chi et al.,
2022) effectively reduce bias and increase fair-
ness. In addition, there is a large body of work for
detecting bias, its evaluation (Dev et al., 2021)
and its implications in specific NLP applica-
tions. Savoldi et al. (2022) detected a gender
bias in speech translation systems for gendered
languages. Gender bias is also discussed in the
context of knowledge base embeddings by Fisher
et al. (2019); Du et al. (2022), and multilingual
text classification (Huang, 2022).
6 Conclusions and Future Work
We presented a new and challenging setup for
removing information, with minimal or no avail-
able sensitive information alignment. This setup
Figure 6: Ratio of the objective value in iteration t
and iteration 0 of the ILP for the AM steps as a
function of the iteration number t. Shaded gray gives
upper and lower bound on the standard deviation over
five runs with different seeds for the initial π. See
legend explanation in Table 5.
algorithm converged quickly, while in others, it
took more iterations to converge to high accuracy.
Figure 6 plots the relative change of the ob-
jective value of the ILP from §3.1 against itera-
tion number. The relative change is defined as the
ratio between the objective value before the al-
gorithm begins and the same value at a given
iteration. We see that there is a relative stability
of the algorithm and that the AM steps converge
quite quickly. We also observe the DeepMoji
dataset has a large increase in the objective value
in the first iteration (around ×5 compared to the
value the algorithm starts with), after which it
remains stable.
5 Related Work
There has been an increasing amount of work on
detecting and erasing undesired or protected in-
formation from neural representations, with stan-
dard software packages for this process having
been developed (Han et al., 2022). For example,
in their seminal work, Bolukbasi et al. (2016)
showed that word embeddings exhibit gender ste-
reotypes. To mitigate this issue, they projected the
word embeddings to a neutral space with respect
to a ‘‘he-she’’ direction. Influenced by this work,
Zhao et al. (2018) proposed a customized training
scheme to reduce the gender bias in word em-
499
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
is crucial for the wide applicability of debiasing
methods, as for most applications, obtaining such
sensitive labels on a large scale is challenging. To
ease this problem, we present a method to erase
information from neural representations, where
the guarded attribute information does not accom-
pany each input instance. Our main algorithm,
AMSAL, alternates between two steps (Assign-
ment and Maximization) to identify an assignment
between the input instances and the guarded in-
formation records. It then completes its execu-
tion by removing the information by minimizing
covariance between the input instances and the
aligned guarded attributes. Our approach is mod-
ular, and other erasure algorithms, such as INLP,
can be used with it. Experiments show that we
can reduce the unwanted bias in many cases while
keeping the representations highly useful. Future
work might include extending our technique to
the kernelized case, analogously to the method
of Shao et al. (2023).
Ethical Considerations
The AM algorithm could potentially be misused
by, rather than using the AM steps to erase infor-
mation, using them to link records of two differ-
ent types, undermining the privacy of the record
holders. Such a situation may merit additional
concern because the links returned between the
guarded attributes and the input instances will
likely contain mistakes. The links are unreliable
for decision-making at the individual level. In-
stead, they should be used on an aggregate as
a statistical construct to erase information from
the input representations. Finally,5 we note that
the automation of the debiasing process, without
properly statistically confirming its accuracy us-
ing a correct sample, may promote a false sense
of security that a given system is making fair de-
cisions. We do not recommend using our method
for debiasing without proper statistical control
and empirical verification of correctness.
being a sounding board for certain parts of the pa-
per. The experiments in this paper were supported
by compute grants from the Edinburgh Parallel
Computing Center and from the Baskerville Tier
2 HPC service (University of Birmingham).
A Justification of the AM Algorithm:
Further Details
We provide here the full details for the claim in
§3.5. Our first observation is that for a uniformly
sampled permutation π : [n] → [n], the probabil-
ity that it has exactly k ≤ n elements such that
π(i) = i for all i in this set of elements is
bounded from above by:6
(cid:7)
(cid:8)
n
k
(n − k)!
n!
=
1
k!
.
We also assume that E[X | H] = 0 and
E[Z | H] = 0, and that the product of every pair
of coordinates of X and Z is bounded in absolute
value by a constant B > 0. Let {(X(je), z(je), h(je))}
be a random sample of size n from the joint
distribution p(X, Z, H). Given a permutation
π : [n] → [n], define I(π) = {je | π(je) = i}.
For a given set M ⊆ [n], define
Ωπ|M =
(cid:4)
i∈M
X(je)(z(π(je)))(cid:6).
For a matrix A ∈ Rd×d(cid:9)
, let σj(UN) be its
jth largest singular value, and let σ+(UN) =
(cid:3)
j σj(UN). Let σ+ = σ+(E[Ωι]).
We first note that for any permutation π, it
holds that E[Ωπ|K] = 0 where we define K =
[n] \ je(π).
Lemma 1. For any t > 0, it holds that:
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
je
un
c
_
un
_
0
0
5
5
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
p(||Ωπ|je(π) − E[Ωπ|je(π)]||2 ≥ dd(cid:9)t)
(cid:9)
(cid:10)
(7)
t2
|je(π)|B2
.
Remerciements
is smaller than 2dd(cid:9) exp
−
We thank the reviewers, the action editors and
Marcio Fonseca for their thorough feedback. Nous
also thank Daniel Preot¸iuc-Pietro for his help with
the Twitter data. We thank Kousha Etessami for
Proof. By Hoeffding’s inequality, for any i ∈ [d],
j ∈ [d(cid:9)], it holds that the probability that for
5We thank the anonymous reviewer for raising this issue.
arbitrarily.
6Choose k elements that are fixed, and let the rest vary
500
|je(π)| i.i.d. random variables Xk, Zk the follow-
ing is true:
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:4)
Xk
i Zk
j
−
(cid:4)
k∈I(π)
k∈I(π)
E[Xk
i Zk
j ]
(cid:5)
(cid:6)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
≥ t
is smaller than 2 exp
. Donc, par
a union bound on each element of the matrix Ωπ,
we get the upper bound on Eq. 7.
|je(π)|B2
− t2
Lemma 2. For any t > 0, it holds that:
(cid:7)Ωπ|K − E[Ωπ|K](cid:7)
2
is smaller than 2|K|dd(cid:9)B.
Proof. Since Xi and Zj are bounded as a product
in absolute value by B, and the dimensions of
Ωπ|K is d × d(cid:9), each cell being a sum of |K|
valeurs, the bound naturally follows.
Let n such that nσ+ > 2kdd(cid:9)B where k = |K|.
Then from Lemma 2, (cid:7)Ωπ|K − E[Ωπ|K](cid:7)
2 <
nσ+. Consider the event σ+(Ωι) < σ+(Ωπ). Its
probability is bounded from above by the proba-
bility of the event σ+(Ωι) ≤ nσ+ OR σ+(Ωπ) ≥
nσ+ (for any n as the above). Due to the in-
equality of Weyl (Theorem 1 in Stewart 1990;
see below), the fact that Ωπ = Ωπ|K + Ωπ|I(π),
Lemma 1, and the fact that n − k ≤ n, the prob-
ability of this OR event is bounded from above
by 4dd(cid:9) exp
− (n − k)(σ+)2
(dd(cid:9)B)2
(cid:6)
(cid:5)
.
The conclusion from this is that if we were to
sample uniformly a permutation π from the set
of permutations over [n], then with quite high
likelihood (because the fraction of elements that
are preserved under π becomes smaller as n be-
comes larger), the sum of the singular values of
Ωπ under this permutation will be smaller than
the sum of the singular values of Ωι—meaning,
when the xs and the zs are correctly aligned. This
justifies our objective of aligning the xs and the
zs with an objective that maximizes the singular
values, following Proposition 1.
Inequality of Weyl (1912) As mentioned by
Stewart (1990), the following holds:
Lemma 3. Let A and E be two matrices, and let
˜A = A + E. Let σi be the ith singular value of
501
A and ˜σi be the ith singular value of ˜A. Then
|σi − ˜σi| ≤ ||E||2.
B Comprehensive Results on the
BiasBench Datasets
We include more results for the SEAT dataset
from BiasBench and for the CrowS-Pairs dataset
and StereoSet datasets for bias categories other
than gender. A description of the SEAT and
GLUE datasets (with metrics used) follows.
SEAT (May et al., 2019) SEAT is a sentence-
level extension of WEAT (Caliskan et al., 2017),
which is an association test between two catego-
ries of words: attribute word sets and target word
sets. For example, attribute words for gender bias
could be { he, man }, while a target words could
be { career, office }. For example, an attribute
word set (in case of gender bias) could be a set
of words such as { he, him, man }, while a target
word set might be words related to office work.
If we see a high association between an attribute
word set and a target word set, we may claim
that a particular gender bias is encoded. The final
evaluation is calculated by measuring the simi-
larity between the different attributes and target
word sets. To extend WEAT to a sentence-level
test, (Caliskan et al., 2017) incorporated the
WEAT attribute and target words into synthetic
sentence templates.
We use an effect size metric to report our
results for SEAT. This measure is a normalized
difference between cosine similarity of repre-
sentations of the attribute words and the target
words. Both attribute words and target words are
split into two categories (for example, in rela-
tion to gender), so the difference is based on four
terms, between each pair of each category set
of words (target and attribute). An effect size
closer to zero indicates less bias is encoded in
the representations.
GLUE (Wang et al., 2019) We follow Meade
et al. (2022) and use the GLUE dataset to test the
debiased model on an array of downstream tasks
to validate their usability. GLUE is a highly pop-
ular benchmark for testing NLP models, contain-
ing a variety of tasks, such as classification tasks
(e.g., sentiment analysis), similarity tasks (e.g.,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 6: (a) StereoSet stereotype scores and language modeling scores (LM Score) for race debiased
BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM
Scores are best at 100%; (b) StereoSet stereotype scores and language modeling scores (LM Score) for
religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased
at 50% and the LM Scores are best at 100%.
paraphrase identification), and inference tasks
(e.g., question-answering).
The following tables of results are included:
• Tables 7, 8, and 9 describe the SEAT effect
sizes for the gender, race, and religion cases,
respectively.
• Table 6 presents the StereoSet results for re-
moving the race (a) and religion (b) guarded
attributes.
• Table 10 presents the scores the debi-
ased representations achieve for the GLUE
benchmark.
502
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 7: SEAT effect sizes for gender-debiased representations of BERT, ALBERT, RoBERTa, and
GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically
significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute
effect size across all six gender SEAT tests for each debiased model.
503
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
Table 8: SEAT effect sizes for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect
sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes
at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six
gender SEAT tests for each debiased model.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
504
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 9: SEAT effect sizes for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models.
Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect
sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all
six gender SEAT tests for each debiased model.
505
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
Table 10: GLUE tests for gender-debiased BERT, ALBERT, RoBERTa, and GPT-2 models.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
506
References
Su Lin Blodgett, Lisa Green, and Brendan
O’Connor. 2016. Demographic dialectal varia-
tion in social media: A case study of African-
American English. In Proceedings of the 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1119–1130,
Austin, Texas. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/D16-1120
Tolga Bolukbasi, Kai-Wei Chang, James Y.
Zou, Venkatesh Saligrama, and Adam Tauman
Kalai. 2016. Man is to computer programmer
as woman is to homemaker? Debiasing word
embeddings. In Advances in Neural Informa-
tion Processing Systems 29: Annual Conference
on Neural
Information Processing Systems
2016, December 5–10, 2016, Barcelona, Spain,
pages 4349–4357.
Isabel Cachola, Eric Holgate, Daniel Preot¸iuc-
Pietro, and Junyi Jessy Li. 2018. Expres-
sively vulgar: The socio-dynamics of vulgarity
and its effects on sentiment analysis
in
the 27th
social media.
International Conference on Computational
Linguistics, pages 2927–2938, Santa Fe, New
Mexico, USA. Association for Computational
Linguistics.
In Proceedings of
Aylin Caliskan, Joanna J. Bryson, and Arvind
Narayanan. 2017. Semantics derived automat-
ically from language corpora contain human-
356(6334):183–186.
like
https://doi.org/10.1126/science
.aal4230, PubMed: 28408601
Science,
biases.
Jianfeng Chi, William Shand, Yaodong Yu,
Kai-Wei Chang, Han Zhao, and Yuan Tian.
2022. Conditional supervised contrastive learn-
ing for fair text classification. ArXiv preprint,
abs/2205.11485.
Pierre Colombo, Guillaume Staerman, Nathan
Noiry, and Pablo Piantanida. 2022. Learn-
ing disentangled textual representations via
In Pro-
statistical measures of similarity.
ceedings of
the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 2614–2630,
Dublin, Ireland. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/2022.acl-long.187
Maria De-Arteaga, Alexey Romanov, Hanna
Wallach, Jennifer Chayes, Christian Borgs,
Alexandra Chouldechova,
Sahin Geyik,
Krishnaram Kenthapadi, and Adam Tauman
Kalai. 2019. Bias in bios: A case study of
semantic representation bias in a high-stakes
setting. In Proceedings of the Conference on
Fairness, Accountability, and Transparency,
120–128. https://doi.org/10
pages
.1145/3287560.3287572
Sunipa Dev, Tao Li, Jeff M. Phillips, and
Vivek Srikumar. 2021. OSCaR: Orthogonal
subspace correction and rectification of biases
in word embeddings. In Proceedings of the
2021 Conference on Empirical Methods in Nat-
ural Language Processing, pages 5034–5050,
Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.emnlp-main.411
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Maximin Coavoux, Shashi Narayan, and Shay B.
Cohen. 2018. Privacy-preserving neural rep-
resentations of text. In Proceedings of
the
2018 Conference on Empirical Methods in
Natural Language Processing, pages 1–10,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1001
Yupei Du, Qi Zheng, Yuanbin Wu, Man Lan, Yan
Yang, and Meirong Ma. 2022. Understanding
gender bias in knowledge base embeddings.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1381–1395,
Dublin, Ireland. Association for Computational
Linguistics.
507
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Harrison Edwards and Amos J. Storkey. 2016.
Censoring representations with an adversary.
In 4th International Conference on Learn-
ing Representations, ICLR 2016, San Juan,
Puerto Rico, May 2–4, 2016, Conference Track
Proceedings.
Yanai Elazar and Yoav Goldberg. 2018. Ad-
versarial removal of demographic attributes
from text data. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 11–21, Brussels,
Belgium. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/D18-1002
Amir Feder, Nadav Oved, Uri Shalit, and Roi
Reichart. 2021. CausaLM: Causal model expla-
nation through counterfactual language mod-
els. Computational Linguistics, 47(2):333–386.
https://doi.org/10.1162/coli a 00404
Bjarke Felbo, Alan Mislove, Anders Søgaard,
Iyad Rahwan, and Sune Lehmann. 2017. Us-
ing millions of emoji occurrences to learn any-
domain representations for detecting sentiment,
emotion and sarcasm. In Proceedings of the
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1615–1625,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1169
Joseph
Dave
Fisher,
Palfrey,
Christos
Christodoulopoulos, and Arpit Mittal. 2019.
Measuring social bias in knowledge graph
embeddings. ArXiv preprint, abs/1912.02761.
https://doi.org/10.18653/v1/2020
.emnlp-main.595
Yaroslav Ganin, Evgeniya Ustinova, Hana
Ajakan, Pascal Germain, Hugo Larochelle,
Franc¸ois Laviolette, Mario Marchand, and
Victor Lempitsky. 2016. Domain-adversarial
training of neural networks. The Journal of
Machine Learning Research, 17(1):2096–2030.
Hila Gonen and Yoav Goldberg. 2019. Lipstick
on a pig: Debiasing methods cover up system-
atic gender biases in word embeddings but do
not remove them. In Proceedings of the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 609–614,
508
Minneapolis, Minnesota. Association
Computational Linguistics.
for
Xudong Han, Timothy Baldwin, and Trevor
Cohn. 2021a. Balancing out bias: Achieving
fairness through training reweighting. ArXiv
preprint, abs/2109.08253.
Xudong Han, Timothy Baldwin, and Trevor
Cohn. 2021b. Decoupling adversarial training
for fair NLP. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP
2021, pages 471–477, Online. Association for
Computational Linguistics.
Xudong Han, Aili Shen, Yitong Li, Lea Frermann,
Timothy Baldwin, and Trevor Cohn. 2022.
fairlib: A unified framework for assessing and
improving classification fairness. ArXiv pre-
print, abs/2205.01876.
Xiaolei Huang. 2022. Easy adaptation to mitigate
gender bias in multilingual text classification.
In Proceedings of the 2022 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 717–723, Seattle,
United States. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/2022.naacl-main.52
Armand Joulin, Edouard Grave, Piotr Bojanowski,
Matthijs Douze, H´erve J´egou, and Tomas
Mikolov. 2016. Fasttext.zip: Compressing text
classification models. ArXiv preprint, abs/
1612.03651.
Harold W. Kuhn. 1955. The Hungarian method
for the assignment problem. Naval Research
Logistics Quarterly, 2:83–97. https://doi
.org/10.1002/nav.3800020109
Yitong Li, Timothy Baldwin, and Trevor Cohn.
2018. Towards robust and privacy-preserving
text representations. In Proceedings of
the
56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 25–30, Melbourne, Australia.
Association for Computational Linguistics.
David J. C. MacKay. 2003. Information theory,
Inference and Learning Algorithms. Cambridge
University Press.
Chandler May, Alex Wang, Shikha Bordia,
Samuel R. Bowman, and Rachel Rudinger.
2019. On measuring social biases in sentence
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
encoders. In Proceedings of the 2019 Con-
the North American Chapter of
ference of
the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 622–628,
for
Minneapolis, Minnesota. Association
Computational Linguistics.
Nicholas Meade, Elinor Poole-Dayan, and Siva
Reddy. 2022. An empirical survey of the ef-
fectiveness of debiasing techniques for pre-
trained language models. In Proceedings of
the 60th Annual Meeting of
the Associa-
tion for Computational Linguistics (Volume 1:
Long Papers), pages 1878–1898, Dublin,
Ireland. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2022.acl-long.132
Moin Nadeem, Anna Bethke, and Siva Reddy.
2021. StereoSet: Measuring
stereotypical
bias in pretrained language models. In Pro-
ceedings of the 59th Annual Meeting of the
Association for Computational Linguistics
and the 11th International Joint Conference
on Natural Language Processing (Volume 1:
Long Papers), pages 5356–5371, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.acl-long.416
Nikita Nangia, Clara Vania, Rasika Bhalerao, and
Samuel R. Bowman. 2020. CrowS-pairs: A
challenge dataset for measuring social biases
in masked language models. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 1953–1967, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.154
Lyle Ramshaw and Robert E. Tarjan. 2012. On
minimum-cost assignments in unbalanced bi-
partite graphs. HP Labs, Palo Alto, CA, USA,
Technical Report HPL-2012-40R1.
Shauli Ravfogel, Yanai Elazar, Hila Gonen,
Michael Twiton, and Yoav Goldberg. 2020.
Null it out: Guarding protected attributes by it-
erative nullspace projection. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 7237–7256,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.acl-main.647
Shauli Ravfogel, Michael Twiton, Yoav
Goldberg, and Ryan D. Cotterell. 2022.
Linear adversarial concept erasure. In Inter-
national Conference on Machine Learning,
pages 18400–18421. PMLR.
Beatrice Savoldi, Marco Gaido, Luisa Bentivogli,
Matteo Negri, and Marco Turchi. 2022. Un-
der the morphosyntactic lens: A multifaceted
evaluation of gender bias in speech translation.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1807–1824,
Dublin,
Ireland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.127
Shun Shao, Yftah Ziser, and Shay B. Cohen. 2023.
Gold doesn’t always glitter: Spectral removal
of linear and nonlinear guarded attribute in-
formation. In Proceedings of the 17th Annual
Meeting of the European chapter of the Asso-
ciation for Computational Linguistics (EACL),
volume abs/2203.07893.
Aili Shen, Xudong Han, Trevor Cohn, Timothy
Baldwin, and Lea Frermann. 2021. Contras-
tive learning for fair representations. ArXiv
preprint, abs/2109.10645.
Gilbert W. Stewart. 1990. Perturbation theory
for the singular value decomposition, Tech-
nical Report UMIACS-90-120 / CS-TR 2539,
University of Maryland, College Park.
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2019. GLUE: A multi-task bench-
mark and analysis platform for natural language
understanding. In 7th International Confer-
ence on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6–9, 2019.
OpenReview.net.
Liwen Wang, Yuanmeng Yan, Keqing He, Yanan
Wu, and Weiran Xu. 2021. Dynamically
disentangling social bias from task-oriented
representations with adversarial attack. In Pro-
ceedings of the 2021 Conference of the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, pages 3740–3750, Online. Association
509
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
for Computational Linguistics. https://doi
.org/10.18653/v1/2021.naacl-main.293
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei
Wang, and Kai-Wei Chang. 2018. Learning
gender-neutral word embeddings. In Proceed-
ings of
the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 4847–4853, Brussels, Belgium. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1521
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2
/
/
t
l
a
c
_
a
_
0
0
5
5
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
510
Télécharger le PDF