Erasure of Unaligned Attributes from Neural Representations

Erasure of Unaligned Attributes from Neural Representations

Shun Shao∗ Yftah Ziser∗ Shay B. Cohen
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh, EH8 9AB, UK

s.shao-11@inf.ed.ac.uk

yftah.ziser@inf.ed.ac.uk

scohen@inf.ed.ac.uk

Abstract

the Assignment-Maximization
We present
Spectral Attribute removaL (AMSAL) algo-
rithm, which erases information from neural
representations when the information to be
erased is implicit rather than directly being
aligned to each input example. Our algorithm
works by alternating between two steps. In
one, it finds an assignment of the input rep-
resentations to the information to be erased,
and in the other, it creates projections of both
the input representations and the information
to be erased into a joint latent space. We test
our algorithm on an extensive array of data-
sets, including a Twitter dataset with multiple
guarded attributes, the BiasBios dataset, and
the BiasBench benchmark. The latter bench-
mark includes four datasets with various types
of protected attributes. Our results demon-
strate that bias can often be removed in our
setup. We also discuss the limitations of our
approach when there is a strong entanglement
between the main task and the information to
be erased.1

1

Introduction

Developing a methodology for adjusting neural
representations to preserve user privacy and avoid
encoding bias in them has been an active area of
research in recent years. Previous work shows it is
possible to erase undesired information from rep-
resentations so that downstream classifiers can-
not use that information in their decision-making
process. This previous work assumes that this
sensitive information (or guarded attributes, such
as gender or race) is available for each input in-
stance. These guarded attributes, however, are sen-
sitive, and obtaining them on a large scale is often
challenging and, in some cases, not feasible (Han
et al., 2021b). For example, Blodgett et al. (2016)

∗Equal contribution.
1Our code is available at https://github.com

/jasonshaoshun/AMSAL.

488

studied the characteristics of African-American
English on Twitter, and could not couple the
ethnicity attribute directly with the tweets they
collected due to the attribute’s sensitivity.

This paper introduces a novel debiasing setting
in which the guarded attributes are not paired
up with each input instance and an algorithm to
remove information from representations in that
setting. In our setting, we assume that each neural
input representation is coupled with a guarded at-
tribute value, but this assignment is unavailable.
In cases where the domain of the guarded attri-
bute is small (for example, with binary attributes),
this means that the guarded attribute informa-
tion consists of priors with respect to the whole
population and not instance-level information.

The intuition behind our algorithm is that if we
were to find a strong correlation between the input
variable and a set of guarded grounded attributes
either in the form of an unordered list of records
or as priors, then it is unlikely to be coincidental
if the sample size is sufficiently large (§3.5). We
implement this intuition by jointly finding pro-
jections of the input samples and the guarded
attributes into a joint embedding space and an
alignment between the two sets in that joint space.
Our resulting algorithm (§3), the Alignment-
Maximization Spectral Attribute removaL algo-
rithm (AMSAL), is a coordinate-ascent algorithm
reminiscent of the hard expectation-maximization
algorithm (hard EM; MacKay, 2003). It first loops
between two Alignment and Maximization steps,
during which it finds an alignment (A) based on
existing projections and then projects the repre-
sentations and guarded attributes into a joint space
based on an existing alignment (M). After these
two steps are iteratively repeated and an align-
ment is identified, the algorithm takes another
step to erase information from the input rep-
resentations based on the projections identified.
This step closely follows the work of Shao et al.

Transactions of the Association for Computational Linguistics, vol. 11, pp. 488–510, 2023. https://doi.org/10.1162/tacl a 00558
Action Editor: Jonathan Berant. Submission batch: 12/2022; Revision batch: 2/2023; Published 5/2023.
c(cid:3) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2

/

/
t

l

a
c
_
a
_
0
0
5
5
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

preserving the information needed for the main
task decision-making. We also study the limita-
tions of our algorithm by experimenting with a
setup where it is hard to distinguish between the
guarded attributes and the downstream task labels
when aligning the neural representations with the
guarded attributes (§4.5).

2 Problem Formulation and Notation

For an integer n we denote by [n]
the set
{1, . . . , n}. For a vector v, we denote by ||v||2 its
(cid:2)2 norm. For two vectors v and u, by default in col-
umn form, (cid:4)v, u(cid:5) = v(cid:6)u (dot product). Matrices
and vectors are in boldface font (with uppercase
or lowercase letters, respectively). Random vari-
able vectors are also denoted by boldface upper-
case letters. For a matrix A, we denote by aij
the value of cell (i, j). The Frobenius norm of
a matrix A is (cid:7)A(cid:7)
ij. The spectral
2=1 (cid:7)Ax(cid:7)
norm of a matrix is (cid:7)A(cid:7)
2.
The expectation of a random variable T is de-
noted by E[T].

i,j a2
2 = max(cid:7)x(cid:7)

(cid:2)(cid:3)

F =

In our problem formulation, we assume three
random variables: X ∈ Rd, Y ∈ R, and Z ∈ Rd(cid:9)
,
such that d(cid:9) ≤ d and the expectation of all three
variables is 0 (see Shao et al., 2023). Samples
of X are the inputs for a classifier to predict
corresponding samples of Y. The random vector
Z represents the guarded attributes. We want to
maintain the ability to predict Y from X, while
minimizing the ability to predict Z from X.

We assume n samples of (X, Y) and m sam-
ples of Z, denoted by (x(i), y(i)) for i ∈ [n], and
z(i) for i ∈ [m] (m ≤ n). While originally, these
samples were generated jointly from the underly-
ing distribution p(X, Y, Z), we assume a shuffling
of the Z samples in such a way that we are only
left with m samples that are unique (no repeti-
tions) and an underlying unknown many-to-one
mapping π : [n] → [m] that maps each x(i) to its
original z(j).

The problem formulation is such that we need
to remove the information from the xs in such a
way that we consider the samples of zs as a set.
In our case, we do so by iterating between trying
to infer π, and then using standard techniques to
remove the information from xs based on their
alignment to the corresponding zs.

Singular Value Decomposition Let A =
E[XZ(cid:6)], the matrix of cross-covariance between

489

Figure 1: A depiction of the problem setting and so-
lution. The inputs are aligned to each guarded sample,
based on strength using two projections U and V . We
solve a bipartite matching problem to find the blue
edges, and then recalculate U and V .

(2023), who use Singular Value Decomposition to
remove principal directions of the covariance ma-
trix between the input examples and the guarded
attributes. Figure 1 depicts a sketch of our set-
ting and the corresponding algorithm, with xi
being the input representations and zj being the
guarded attributes. Our algorithm is modular:
While our use of the algorithm of Shao et al.
(2023) for the removal step is natural due to the
nature of the AM steps, a user can use any such
algorithm to erase the information from the input
representations (§3.4).

Our contributions are as follows:

(1) We
propose a new setup for removing guarded infor-
mation from neural representations where there
are few or no labeled guarded attributes; (2)
We present a novel two-stage coordinate-ascent
algorithm that iteratively improves (a) an align-
ment between guarded attributes and neural
representations; and (b)
information removal
projections.

Using an array of datasets, we perform exten-
sive experiments to assess how challenging our
setup is and whether our algorithm is able to re-
move information without having aligned guarded
attributes (§4). We find in several cases that lit-
tle information is needed to align between neural
representations and their corresponding guarded
attributes. The consequence is that it is possible
to erase the information such guarded attributes
provide from the neural representations while

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2

/

/
t

l

a
c
_
a
_
0
0
5
5
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

X and Z. This means that Aij = Cov(Xi, Zj) for
i ∈ [d] and j ∈ [d(cid:9)].

For any two vectors, a ∈ Rd, b ∈ Rd(cid:9)

, the fol-

lowing holds due to the linearity of expectation:

aAb(cid:6) = Cov(a(cid:6)X, b(cid:6)Z).

(1)

Singular value decomposition on A, in this
case, finds the ‘‘principal directions’’: directions
in which the projection of X and Z maximize
their covariance. The projections are represented
as two matrices U ∈ Rd×d and V ∈ Rd(cid:9)×d(cid:9)
. Each
column in these matrices plays the role of the
vectors a and b in Eq. 1. SVD finds U and V
such that for any i ∈ [d(cid:9)] it holds that:

Cov(U (cid:6)

i X, V (cid:6)

i Z) = max
(a,b)∈Oi

Cov(a(cid:6)X, b(cid:6)Z),

where Oi is the set of pairs of vectors (a, b)
such that (cid:7)a(cid:7)
2 = 1, a is orthogonal to
U 1, . . . , U i−1 and similarly, b is orthogonal to
V 1, . . . , V i−1.

2 = (cid:7)b(cid:7)

Shao et al. (2023) showed that SVD in this
form can be used to debias representations. We
calculate SVD between X and Z and then prune
out the principal directions that denote the high-
est covariance. We will use their method, SAL
(Spectral Attribute removaL), in the rest of the
paper. See also §3.4.

3 Methodology

We view the problem of information removal
with unaligned samples as a joint optimization
problem of: (a) finding the alignment; (b) find-
ing the projection that maximizes the covariance
between the alignments, and using its comple-
ment to project the inputs. Such an optimization,
in principle, is intractable, so we break it down
into two coordinate-ascent style steps: A-step (in
which the alignment is identified as a bipartite
graph matching problem) and M-step (in which
based on the previously identified alignment,
a maximal-covariance projection is calculated).
Formally, the maximization problem we solve is:

(U , V , π) = arg max
U ,V ,π

i=1

n(cid:4)

(x(i))(cid:6)U V (cid:6)z(i),

(2)
where we constrain U and V to be matrices with
orthonormal columns in Rn×k.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2

/

/
t

l

a
c
_
a
_
0
0
5
5
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: The main Assignment-Maximization Spectral
Attribute removaL (AMSAL) algorithm for removal
of information without alignment between samples of
X and Z.

Note that the sum in the above equation has
a term per pair of (x(i), zπ(i)), which enables us
to frame the A-step as an integer linear program-
ming (ILP) problem (§3.1). The full algorithm is
given in Figure 2, and we proceed in the next two
steps to further explain the A-step and the M-step.

3.1 A-step (Guarded Sample Assignment)

In the Assignment Step, we are required to find
a many-to-one alignment π : [n] → [m] between
{x(1), . . . , x(n)} and {z(1), . . . , z(m)}. Given U
and V from the previous M-step, we can find
such an assignment by solving the following opti-
mization problem:

arg max

π

n(cid:4)

i=1

(cid:4)U (cid:6)x(i), V (cid:6)z(π(i))(cid:5).

This maximization problem can be formulated
as an integer linear program of the following form:

max
P ∈{0,1}n×m

m(cid:4)

n(cid:4)

j=1

i=1

s.t. ∀i.

∀j.

pij(cid:4)U (cid:6)x(i), V (cid:6)z(j)(cid:5)

m(cid:4)

j=1

pij = 1,

b0j ≤

m(cid:4)

i=1

pij ≤ b1j.

(3)

490

This is a solution to an assignment prob-
lem (Kuhn, 1955; Ramshaw and Tarjan, 2012),
where pij denotes whether x(i) is associated with
the (type of) guarded attribute z(j). The values
(b0j, b1j) determine lower and upper bounds on
the number of xs a given z(j) can be assigned
to. While a standard assignment problem can be
solved efficiently using the Hungarian method of
Kuhn (1955), we choose to use the ILP formu-
lation, as it enables us to have more freedom in
adding constraints to the problem, such as the
lower and upper bounds.

3.2 M-step (Covariance Maximization)

The result of an A-step is an assignment π such
that π(i) = j implies x(i) was deemed as aligned
to zj. With that π in mind, we define the follow-
ing empirical covariance matrix Ωπ ∈ Rd×d(cid:9)

:

Ωπ =

n(cid:4)

i=1

(cid:5)

x(i)

z(π(i))

(cid:6)(cid:6)

.

(4)

We then apply SVD on Ωπ to get new U and
V that are used in the next iteration of the algo-
rithm with the A-step, if the algorithm continues
to run. When the maximal number of iterations is
reached, we follow the work of Shao et al. (2023)
in using a truncated part of U to remove the in-
formation from the xs. We do that by projecting
x(i) using the singular vectors of U with the
smallest singular values. These projected vectors
co-vary the least with the guarded attributes, as-
suming the assignment in the last A-step was pre-
cise. This method has been shown by Shao
et al. (2023) to be highly effective and efficient in
debiasing neural representations.

3.3 A Matrix Formulation of the AM Steps

Let e1, . . . , em be the standard basis vectors. This
means ei is a vector of length m with 0 in all
coordinates except for the ith coordinate, where
it is 1.

Let E be the set of all matrices E where each
E ∈ E is such that E ∈ Rn×m and each row
is one of ei, i ∈ [m]. In that case, EZ(cid:6) is an
n × d(cid:9) matrix, such that the jth row is a copy
of the ith column of Z ∈ Rd(cid:9)×n. Therefore, the
AM steps can be viewed as solving the following
maximization problem using coordinate ascent:

arg max
E∈E,U ,V ,Σ

(cid:7)U (cid:6)ΣV − XEZ(cid:6)(cid:7)2
F ,

where U , V are orthonormal matrices, and Σ is
a diagonal matrix with non-negative elements.
This corresponds to the SVD of the matrix
XEZ(cid:6).

In that case,

the matrix E can be directly
mapped to an assignment in the form of π, where
π(i) would be the j such that the jth coordinate
in the ith row of E is non-zero.

3.4 Removal Algorithm

The AM steps are best suited for the removal of
information through SVD with an algorithm such
as SAL. This is because AM steps are optimizing
an objective of the same type of SAL—relying on
the projections U and V to project the inputs and
guarded representations into a joint space. How-
ever, a by-product of the algorithm in Figure 2 is
an assignment function π that aligns between the
inputs and the guarded representations.

With that assignment, other removal algo-
rithms can be used, for example, the algorithm of
Ravfogel et al. (2020). We experiment with this
idea in §4.

3.5 Justification of the AM Steps

We next provide a justification of our algorithm
(which may be skipped on a first reading). Our
justification is based on the observation that if
indeed X and Z are linked together (this con-
nection is formalized as a latent variable in their
joint distribution), then for a given sample that is
permuted, the singular values of Ω will be larger
the closer the permutation is to the identity per-
mutation. This justifies finding such a permuta-
tion that maximizes the singular values in an
SVD of Ω.

More Details Let ι : [n] → [n] be the identity
permutation, ι(i) = i. We will assume the case
in which n = m (but the justification can be
generalized to the case m < n), and that the underlying joint distribution p(X, Z) is mediated by a latent variable H, such that p(X, Z, H) = p(H)p(X | H)p(Z | H). (5) This implies there is a latent variable that con- nects X and Z, and that the joint distribution p(X, Z) is a mixture through H. 491 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Proposition 1 (informal). Let {(x(i), z(i))} be a sample of size n from the distribution in Eq. 5. Let π be a permutation over [n] uniformly sam- pled from the set of permutations. Then with high likelihood, the sum of the singular values of Ωπ is smaller than the sum of singular values under Ωι. For full details of this claim, see Appendix A. 4 Experiments In our experiments, we test several combinations of algorithms. We use the k-means (KMEANS) as a substitute for the AM steps as a baseline for the assignment step of xs to zs. In addition, for the removal step (once an assignment has been iden- tified), we test two algorithms: SAL (Shao et al., 2023; resulting in AMSAL) and INLP (Ravfogel et al., 2020). We also compare these two algo- rithms in oracle mode (in which the assignment of guarded attributes to inputs is known), to see the loss in performance that happens due to noisy assignments from the AM or k-means algorithm (ORACLESAL and ORACLEINLP). When running the AM algorithm or k-means, we execute it with three random seeds (see also §4.6) for a maximum of a hundred iterations and choose the projection matrix with the largest ob- jective value over all seeds and iterations. For the slack variables (b0j and b1j variables in Eq. 3), we use 20%–30% above and below the baseline of the guarded attribute priors according to the training set. With the SAL methods, we remove the number of directions according to the rank of the Ω matrix (between 2 to 6 in all experiments overall). In addition, we experiment with a partially supervised assignment process, in which a small seed dataset of aligned xs and zs is provided to the AM steps. We use it for model selection: Rather than choosing the assignment with the highest SVD objective value, we choose the assignment with the highest accuracy on this seed dataset. We refer to this setting as PARTIAL (for ‘‘partially supervised assignment’’). Finally, in the case of a gender-protected at- tribute, we compare our results against a baseline in which the input x is compared against a list of words stereotypically associated with the genders of male or female.2 Based on the overlap with 2https://tinyurl.com/33bzddtw. these two lists, we heuristically assign the gen- der label to x and then run SAL or INLP (rather than using the AM algorithm). While this word- list heuristic is plausible in the case of gender, it is not as easy to derive in the case of other protected attributes, such as age or race. We give the results for this baseline using the marker WL in the corresponding tables. Main Findings Our overall main finding shows that our novel setting in which guarded infor mation is erased from individually unaligned representations is viable. We discovered that AM methods perform particularly well when deal- ing with more complex bias removal scenarios, such as when multiple guarded attributes are pres- ent. We also found that having similar priors for the guarded attributes and downstream task labels may lead to poor performance on the task at hand. In these cases, using a small amount of super- vision often effectively helps reduce bias while maintaining the utility of the representations for the main classification of the regression problem. Finally, our analysis of alignment stability shows that our AM algorithm often converges to suit- able solutions that align X with Z. Due to the unsupervised nature of our prob- lem setting, we advise validating the utility of our method in the following way. Once we run the AM algorithm, we check whether there is a high-accuracy alignment between X and Y (rather than Z, which is unavailable). If this alignment is accurate, then we run the risk of significantly damaging task performance. An example is given in §4.5. 4.1 Word Embedding Debiasing As a preliminary assessment of our setup and algorithms, we apply our methods to GloVe word embeddings to remove gender bias, and follow the previous experiment settings of this problem (Bolukbasi et al., 2016; Ravfogel et al., 2020; Shao et al., 2023). We considered only the 150,000 most common words to ensure the embedding quality and omitted the rest. We sort the remaining −→ embeddings by their projection on the she direction. Then we consider the top 7,500 word embeddings as male-associated words (z = 1) and the bottom 7,500 as female-associated words (z = −1). −→ he- Our findings are that both the k-means and the AM algorithms perfectly identify the alignment 492 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 is perfect in this case. This finding indicates that this standard word embedding dataset used for debiasing is trivial to debias—debiasing can be done even without knowing the identity of the stereotypical gender associated with each word. 4.2 BiasBios Results De-Arteaga et al. (2019) presented the BiasBios dataset, which consists of self-provided biogra- phies paired with the profession and gender of their authors. A list of pronouns and names is used to obtain the authors’ gender automatically. They aim to expose the caveats of automated hiring systems by showing that even the simple task of predicting a candidate’s profession can be affected by the candidate’s gender, which is encoded in the biography representation. For example, we want to avoid one being identified as ‘‘he’’ or ‘‘she’’ in their biography, affecting the likelihood of them being classified as engineers or teachers. We follow the setup of De-Arteaga et al. (2019), predicting a candidate’s professions (y), based on a self-provided short biography (x), aiming to remove any information about the can- didate’s gender (z). Due to computational con- straints, we use only random 30K examples to learn the projections with both SAL and INLP (whether in the unaligned or aligned setting). For the classification problem, we use the full dataset. To obtain vector representations for the biogra- phies, we use two different encoders, FastText word embeddings (Joulin et al., 2016), and BERT (Devlin et al., 2019). We stack a multi-class clas- sifier on top of these representations, as there are 28 different professions. We use 20% of the train- ing examples for the PARTIAL setting. For BERT, we followed De-Arteaga et al. (2019) in using the last CLS token state as the representation of the whole biography. We used the BERT model bert-base-uncased. Evaluation Measures We use an extension of the True Positive Rate (TPR) gap, the root mean square (RMS) TPR gap of all classes, for eval- uating bias in a multiclass setting. This metric was suggested by De-Arteaga et al. (2019), who demonstrated it is significantly correlated with gender imbalances, which often lead to unfair classification. The higher the metric value is, the bigger the gap between the two categories (for example, between male and female) for the Figure 3: A t-SNE visualization of the word embed- dings before and after gender information removal. In (a) we see the embeddings naturally cluster into the corresponding gender. between the word embeddings and their asso- ciated gender label (100%). Indeed, the dataset construction itself follows a natural perfect clus- tering that these algorithms easily discover. Since the alignments are perfectly identified, the results of predicting the gender from the word embed- dings after removal are identical to the oracle case. These results are quite close to the results of a random guess, and we refer the reader to Shao et al. (2023) for details on experiments with SAL and INLP for this dataset. Considering Figure 3, it is evident that our algorithm essentially follows a natural clustering of the word embeddings into two clusters, female and male, as the embeddings are highly separable in this case. This is why the alignment score of X (embedding) to Z (gender) 493 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 4.3 BiasBench Results Meade et al. (2022) followed an empirical study of an array of datasets in the context of debiasing. They analyzed different methods and tasks, and we follow their benchmark evaluation to assess our AMSAL algorithm and other methods in the context of our new setting. We include a short description of the datasets we use in this section. We include full results in Appendix B, with a description of other datasets. We also encourage the reader to refer to Meade et al. (2022) for details on this benchmark. We use 20% of the training examples for the PARTIAL setting. StereoSet (Nadeem et al., 2021) This dataset presents a word completion test for a language model, where the completion can be stereotypical or non-stereotypical. The bias is then measured by calculating how often a model prefers the ste- reotypical completion over the non-stereotypical one. Nadeem et al. (2021) introduced the language model score to measure the language model us- ability, which is the percentage of examples for which a model prefers the stereotypical or non- stereotypical word over some unrelated word. CrowS-Pairs (Nangia et al., 2020) This dataset includes pairs of sentences that are minimally dif- ferent at the token level, but these differences lead to the sentence being either stereotypical or anti- stereotypical. The assessment measures how many times a language model prefers the stereotypi- cal element in a pair over the anti-stereotypical element. Results We start with an assessment of the BERT model for the CrowS-Pairs gender, race, and religion bias evaluation (Table 2). We observe that all approaches for gender, except AM+INLP, reduce the stereotype score. Race and religion are more difficult to debias in the case of BERT. INLP with k-means works best when no seed align- ment data is provided at all, but when we con- sider PARTIALSAL, in which we use the alignment algorithm with some seed aligned data, we see that the results are the strongest. When we con- sider the RoBERTa model, the results are sim- ilar, with PARTIALSAL significantly reducing the bias. Our findings from Table 2 overall indicate that the ability to debias a representation highly depends on the model that generates the rep- resentation. In Table 10 we observe that the Table 1: BiasBios dataset results. The top part uses BERT embeddings to encode the biographies, while the bottom part uses FastText embeddings. specific main task prediction. For the profession classification, we report accuracy. Results Table 1 provides the results for the bi- ography dataset. We see that INLP significantly reduces the TPR-GAP in all settings, but this comes at a cost: The representations are signifi- cantly less useful for the main task of predicting the profession. When inspecting the alignments, we observe that their accuracy is quite high with BERT: 100% with k-means, 85% with the AM algorithm, and 99% with PARTIAL AM. For Fast- Text, the results are lower, hovering around 55% for all three methods. The high BERT assignment performance indicates that the BiasBios BERT representations are naturally separated by gender. We also observe that the results of WL+SAL and WL+INLP are correspondingly identical to Oracle+SAL and Oracle+INLP. This comes as no surprise, as the gender label is derived from a similar word list, which enables the WL ap- proach to get a nearly perfect alignment (over 96% agreement with the gender label). 494 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 2: (a) CrowS-Pairs Gender stereotype scores (Stt. score) in language models debiased by differ- ent debiasing techniques and assignment; (b) CrowS-Pairs Race stereotype scores; (c) CrowS-Pairs Religion stereotype scores. All models are deemed least biased if the stereotype score is 50%. The colored numbers are calculated as | |b − 50 | − |s − 50 | | where b is the top row score and s is the correspond- ing system score. representations, on average, are not damaged for most GLUE tasks. Additional analysis, included with a full version appendix, shows that the rep- resentations, on average, are not damaged for most GLUE tasks. As Meade et al. (2022) have noted, when chang- ing the representations of a language model to remove bias, we might cause such adjustments that damage the usability of the language model. To test which methods possibly cause such an issue, we also assess the language model score on the StereoSet dataset in Table 3. We overall see that often SAL-based methods give a lower stereotype score, while INLP methods more sig- nificantly damage the language model score. This implies that the SAL-based methods remove bias 495 baseline task performance almost Appendix B. in full. See 4.4 Multiple-Guarded Attribute Sentiment We hypothesize that AM-based methods are bet- ter suited for setups where multiple guarded at- tributes should be removed, as they allow us to target several guarded attributes with different priors. To examine our hypothesis, we experi- ment with a dataset curated from Twitter (tweets encoded using BERT, bert-base-uncased), in which users are surveyed for their age and gender (Cachola et al., 2018). We bucket the age into three groups (0–25, 26–50, and above 50). Tweets in this dataset are annotated with their sentiment, ranging from 1 (very negative) to 5 (very positive). The dataset consists of more than 6,400 tweets written by more than 1,700 users. We removed users who no longer have public Twitter accounts and users with locations that do not exist based on a filter,3 resulting in a dataset with over 3,000 tweets, written by 817 unique us- ers. As tweets are short by nature and their num- ber is relatively small, the debiasing signal in this dataset (the amount of information it contains about the guarded attributes) might not be suffi- cient for the attribute removal. To amplify this signal, we concatenated each tweet in the dataset to at most ten other tweets from the same user. We study the relationship between the main task of sentiment detection and the two protected attributes of age and gender. As a protected at- tribute z, we use the combination of both age and gender as a binary one-hot vector. This dataset presents a use-case for our algorithm of a com- posed protected attribute. Rather than using a classifier for predicting the sentiment, we use lin- ear regression. Following Cachola et al. (2018), we use Mean Absolute Error (MAE) to report the error of the sentiment predictions. Given that the sentiment is predicted as a continuous value, we cannot use the TPR gap as in previous sections. Rather, we use the following formula: MAEGap = std(MADz=j | j ∈ [m]), (6) 3We used a list of cities, counties, and states in the United States, taken from https://tinyurl.com /4kmc6pyn. All users were in the United States when the data was collected by the original curators. Table 3: StereoSet stereotype scores (Stt. Score) and language modeling scores (LM Score) for the gender category. Stereotype scores indicate the least bias at 50% and the LM scores indicate high usability at 100%. effectively while less significantly harming the usability of the language model representations. We also conducted comprehensive results for other datasets (SEAT and GLUE) and categories of bias (based on race and religion). The results, especially for GLUE, demonstrate the effective- ness of our method of unaligned information removal. For GLUE, we consistently retain the 496 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 4: MAE and debiasing gap values on the Twitter dataset, when using BERT to encode the tweets. For age and gender, we give the MAE gap as in Eq. 5. (cid:3) i |ηij − μj| where i where MADz=j = 1 (cid:4) ranges over the set of size (cid:2) of examples with protected attribute value j, μj is the average of absolute Y prediction error for that set and ηij is the absolute difference between μj and the ab- solute error of example i.4 The function std in this case indicates the standard deviation of the m values of MADz=j, j ∈ [m]. Results Table 4 presents our results. Overall, AMSAL reduces the gender and age gap in the predictions while not increasing by much MAE. In addition, we can see both AM-based meth- ods outperform their k-means counterparts which increase unfairness (KMEANS + INLP) or signif- icantly harm the downstream-task performance (KMEANS + SAL). We also consider Figure 4, which shows the quality of the assignments of the AM algorithm change as a function of the labeled data used. As expected, the more labeled data we have, the more accurate the assignments are, but the differences are not very large. 4.5 An Example of Our Method Limitations We now present the main limitation in our ap- proach and setting. This limitation arises when the random variables Y and Z are not easily distinguishable through information about X. We experiment with a binary sentiment analysis (y) task, predicted on users’ tweets (x), aim- ing to remove information regarding the authors’ ethnic affiliations. To do so, we use a dataset collected by Blodgett et al. (2016), which exam- ined the differences between African-American English speakers and Standard American English 4The absolute error of prediction a with true value b is |a − b|. 497 Figure 4: Accuracy of the AM steps with respect to age and gender separately (on unseen data), as a function of the fraction of the labeled dataset used by the AM algorithm. speakers. As information about one’s ethnicity is hard to obtain, the user’s geolocation informa- tion was used to create a distantly supervised mapping between authors and their ethnic affilia- tions. We follow previous work (Shao et al., 2023; Ravfogel et al., 2020) and use the DeepMoji en- coder (Felbo et al., 2017) to obtain representa- tions for the tweets. The train and test sets are balanced regarding sentiment and authors’ ethnic- ity. We use 20% of the examples for the PARTIAL setting. Table 5 gives the results for this dataset. We observe that the removal with the assignment (k-means, AM, or PARTIAL) significantly harms the performance on the main task and reduces it to a random guess. This presents a limitation of our algorithm. A priori, there is no distinction between Y and Z, as our method is unsupervised. In addition, the positive labels of Y and Z have the same prior probability. Indeed, when we check the as- signment accuracy in the sentiment dataset, we observe that the k-means, AM, and PARTIAL AM assignment accuracy for identifying Z are be- tween 0.55 and 0.59. If we check the assignment against Y, we get an accuracy between 0.74 and 0.76. This means that all assignment algorithms actually identify Y rather than Z (both Y and Z are binary variables in this case). The conclusion from this is that our algorithm works best when sufficient information on Z is presented such that it can provide a basis for aligning samples of l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . Table 5: The performance of removing race information from the DeepMoji dataset is shown for two cases: with balanced ratios of race and sentiment (left) and with ratios of 0.8 for sentiment and 0.5 for race (right). In both cases, the total size of the dataset used is 30,000 examples. To evaluate the performance of the unbalanced sentiment dataset, we use the F1 macro measure, because in an unbalanced dataset such as this one, a simple classifier that always returns one label will achieve an accuracy of 80%. Such a classifier would have a F1 macro score of 0.44 ˙4. Z with samples of X. Suppose such information is unavailable or unidentifiable with information regarding Y. In that case, we may simply iden- tify the natural clustering of X according to their main task classes, leading to low main-task performance. In Table 5, we observe that this behavior is significantly mitigated when the priors over the sentiment and the race are different (0.8 for sen- timent and 0.5 for race). In that case, the AM algorithm is able to distinguish between the race- protected attribute (z) and the sentiment class (y) quite consistently with INLP and SAL, and the gap is reduced. We also observe that INLP changed neither the accuracy nor the TPR-GAP for the balanced scenario (Table 5) when using a k-means assign- ment or an AM assignment. Upon inspection, we found out that INLP returns an identity projection in these cases, unable to amplify the relatively weak signal to change the in the assignment representations. 4.6 Stability Analysis of the Alignment In Figure 5, we plot the accuracy of the alignment algorithm (knowing the true value of the guarded attribute per input) throughout the execution of the AM steps for the first ten iterations. The shaded area indicates one standard deviation. We observe that the first few iterations are the ones in which the accuracy improves the most. For most of the datasets, the accuracy does not decrease between f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 5: Accuracy of the AM steps (in identifying the correct assignment of inputs to guarded informa- tion) as a function of the iteration number. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π. FastText refers to the BiasBios dataset, the BERT models are for the CrowS-Pairs dataset and Emb. refers to the word embeddings dataset from §4.1. iterations, though in the case of DeepMoji we do observe a ‘‘bump.’’ This is indeed why the PARTIAL setting of our algorithm, where a small amount of guarded information is available to determine at which iteration to stop the AM algorithm, is important. In the word embeddings case, the vari- ance is larger because, in certain executions, the 498 beddings. Gonen and Goldberg (2019) examined the effectiveness of the methods mentioned above and concluded they remove bias in a shallow way. For example, they demonstrated that classifiers can accurately predict the gender associated with a word when fed with the embeddings of both debiasing methods. Another related strand of work uses adver- sarial learning (Ganin et al., 2016), where an additional objective function is added for balanc- ing undesired-information removal and the main task (Edwards and Storkey, 2016; Li et al., 2018; Coavoux et al., 2018; Wang et al., 2021). Elazar and Goldberg (2018) have also demonstrated that an ad-hoc classifier can easily recover the removed information from adversarially trained representations. Since then, methods for informa- tion erasure such as INLP and its generalization (Ravfogel et al., 2020, 2022), SAL (Shao et al., 2023) and methods based on similarity measures between neural representations (Colombo et al., 2022) have been developed. With a similar moti- vation to ours, Han et al. (2021b) aimed to ease the burden of obtaining guarded attributes at a large scale by decoupling the adversarial informa- tion removal process from the main task training. They, however, did not experiment with debi- asing representations where no guarded attribute alignments are available. Shao et al. (2023) exper- imented with the removal of features in a scenario in which a low number of protected attributes is available. Additional previous work showed that methods based on causal inference (Feder et al., 2021), train-set balancing (Han et al., 2021a), and con- trastive learning (Shen et al., 2021; Chi et al., 2022) effectively reduce bias and increase fair- ness. In addition, there is a large body of work for detecting bias, its evaluation (Dev et al., 2021) and its implications in specific NLP applica- tions. Savoldi et al. (2022) detected a gender bias in speech translation systems for gendered languages. Gender bias is also discussed in the context of knowledge base embeddings by Fisher et al. (2019); Du et al. (2022), and multilingual text classification (Huang, 2022). 6 Conclusions and Future Work We presented a new and challenging setup for removing information, with minimal or no avail- able sensitive information alignment. This setup Figure 6: Ratio of the objective value in iteration t and iteration 0 of the ILP for the AM steps as a function of the iteration number t. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π. See legend explanation in Table 5. algorithm converged quickly, while in others, it took more iterations to converge to high accuracy. Figure 6 plots the relative change of the ob- jective value of the ILP from §3.1 against itera- tion number. The relative change is defined as the ratio between the objective value before the al- gorithm begins and the same value at a given iteration. We see that there is a relative stability of the algorithm and that the AM steps converge quite quickly. We also observe the DeepMoji dataset has a large increase in the objective value in the first iteration (around ×5 compared to the value the algorithm starts with), after which it remains stable. 5 Related Work There has been an increasing amount of work on detecting and erasing undesired or protected in- formation from neural representations, with stan- dard software packages for this process having been developed (Han et al., 2022). For example, in their seminal work, Bolukbasi et al. (2016) showed that word embeddings exhibit gender ste- reotypes. To mitigate this issue, they projected the word embeddings to a neutral space with respect to a ‘‘he-she’’ direction. Influenced by this work, Zhao et al. (2018) proposed a customized training scheme to reduce the gender bias in word em- 499 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 is crucial for the wide applicability of debiasing methods, as for most applications, obtaining such sensitive labels on a large scale is challenging. To ease this problem, we present a method to erase information from neural representations, where the guarded attribute information does not accom- pany each input instance. Our main algorithm, AMSAL, alternates between two steps (Assign- ment and Maximization) to identify an assignment between the input instances and the guarded in- formation records. It then completes its execu- tion by removing the information by minimizing covariance between the input instances and the aligned guarded attributes. Our approach is mod- ular, and other erasure algorithms, such as INLP, can be used with it. Experiments show that we can reduce the unwanted bias in many cases while keeping the representations highly useful. Future work might include extending our technique to the kernelized case, analogously to the method of Shao et al. (2023). Ethical Considerations The AM algorithm could potentially be misused by, rather than using the AM steps to erase infor- mation, using them to link records of two differ- ent types, undermining the privacy of the record holders. Such a situation may merit additional concern because the links returned between the guarded attributes and the input instances will likely contain mistakes. The links are unreliable for decision-making at the individual level. In- stead, they should be used on an aggregate as a statistical construct to erase information from the input representations. Finally,5 we note that the automation of the debiasing process, without properly statistically confirming its accuracy us- ing a correct sample, may promote a false sense of security that a given system is making fair de- cisions. We do not recommend using our method for debiasing without proper statistical control and empirical verification of correctness. being a sounding board for certain parts of the pa- per. The experiments in this paper were supported by compute grants from the Edinburgh Parallel Computing Center and from the Baskerville Tier 2 HPC service (University of Birmingham). A Justification of the AM Algorithm: Further Details We provide here the full details for the claim in §3.5. Our first observation is that for a uniformly sampled permutation π : [n] → [n], the probabil- ity that it has exactly k ≤ n elements such that π(i) = i for all i in this set of elements is bounded from above by:6 (cid:7) (cid:8) n k (n − k)! n! = 1 k! . We also assume that E[X | H] = 0 and E[Z | H] = 0, and that the product of every pair of coordinates of X and Z is bounded in absolute value by a constant B > 0. Let {(x(i), z(i), h(i))}
be a random sample of size n from the joint
distribution p(X, Z, H). Given a permutation
π : [n] → [n], define I(π) = {i | π(i) = i}.
For a given set M ⊆ [n], define

Ωπ|M =

(cid:4)

i∈M

x(i)(z(π(i)))(cid:6).

For a matrix A ∈ Rd×d(cid:9)

, let σj(A) be its
jth largest singular value, and let σ+(A) =
(cid:3)

j σj(A). Let σ+ = σ+(E[Ωι]).
We first note that for any permutation π, it
holds that E[Ωπ|K] = 0 where we define K =
[n] \ I(π).

Lemma 1. For any t > 0, it holds that:

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
5
8
2
1
1
0
6
0
2

/

/
t

l

a
c
_
a
_
0
0
5
5
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

p(||Ωπ|I(π) − E[Ωπ|I(π)]||2 ≥ dd(cid:9)t)
(cid:9)

(cid:10)

(7)

t2
|I(π)|B2

.

Acknowledgments

is smaller than 2dd(cid:9) exp

We thank the reviewers, the action editors and
Marcio Fonseca for their thorough feedback. We
also thank Daniel Preot¸iuc-Pietro for his help with
the Twitter data. We thank Kousha Etessami for

Proof. By Hoeffding’s inequality, for any i ∈ [d],
j ∈ [d(cid:9)], it holds that the probability that for

5We thank the anonymous reviewer for raising this issue.

arbitrarily.

6Choose k elements that are fixed, and let the rest vary

500

|I(π)| i.i.d. random variables Xk, Zk the follow-
ing is true:

(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)

(cid:4)

Xk

i Zk
j

(cid:4)

k∈I(π)

k∈I(π)

E[Xk

i Zk
j ]

(cid:5)

(cid:6)

(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)
(cid:11)

≥ t

is smaller than 2 exp
. Therefore, by
a union bound on each element of the matrix Ωπ,
we get the upper bound on Eq. 7.

|I(π)|B2

− t2

Lemma 2. For any t > 0, it holds that:

(cid:7)Ωπ|K − E[Ωπ|K](cid:7)
2

is smaller than 2|K|dd(cid:9)B.

Proof. Since Xi and Zj are bounded as a product
in absolute value by B, and the dimensions of
Ωπ|K is d × d(cid:9), each cell being a sum of |K|
values, the bound naturally follows.

Let n such that nσ+ > 2kdd(cid:9)B where k = |K|.
Then from Lemma 2, (cid:7)Ωπ|K − E[Ωπ|K](cid:7)
2 < nσ+. Consider the event σ+(Ωι) < σ+(Ωπ). Its probability is bounded from above by the proba- bility of the event σ+(Ωι) ≤ nσ+ OR σ+(Ωπ) ≥ nσ+ (for any n as the above). Due to the in- equality of Weyl (Theorem 1 in Stewart 1990; see below), the fact that Ωπ = Ωπ|K + Ωπ|I(π), Lemma 1, and the fact that n − k ≤ n, the prob- ability of this OR event is bounded from above by 4dd(cid:9) exp − (n − k)(σ+)2 (dd(cid:9)B)2 (cid:6) (cid:5) . The conclusion from this is that if we were to sample uniformly a permutation π from the set of permutations over [n], then with quite high likelihood (because the fraction of elements that are preserved under π becomes smaller as n be- comes larger), the sum of the singular values of Ωπ under this permutation will be smaller than the sum of the singular values of Ωι—meaning, when the xs and the zs are correctly aligned. This justifies our objective of aligning the xs and the zs with an objective that maximizes the singular values, following Proposition 1. Inequality of Weyl (1912) As mentioned by Stewart (1990), the following holds: Lemma 3. Let A and E be two matrices, and let ˜A = A + E. Let σi be the ith singular value of 501 A and ˜σi be the ith singular value of ˜A. Then |σi − ˜σi| ≤ ||E||2. B Comprehensive Results on the BiasBench Datasets We include more results for the SEAT dataset from BiasBench and for the CrowS-Pairs dataset and StereoSet datasets for bias categories other than gender. A description of the SEAT and GLUE datasets (with metrics used) follows. SEAT (May et al., 2019) SEAT is a sentence- level extension of WEAT (Caliskan et al., 2017), which is an association test between two catego- ries of words: attribute word sets and target word sets. For example, attribute words for gender bias could be { he, man }, while a target words could be { career, office }. For example, an attribute word set (in case of gender bias) could be a set of words such as { he, him, man }, while a target word set might be words related to office work. If we see a high association between an attribute word set and a target word set, we may claim that a particular gender bias is encoded. The final evaluation is calculated by measuring the simi- larity between the different attributes and target word sets. To extend WEAT to a sentence-level test, (Caliskan et al., 2017) incorporated the WEAT attribute and target words into synthetic sentence templates. We use an effect size metric to report our results for SEAT. This measure is a normalized difference between cosine similarity of repre- sentations of the attribute words and the target words. Both attribute words and target words are split into two categories (for example, in rela- tion to gender), so the difference is based on four terms, between each pair of each category set of words (target and attribute). An effect size closer to zero indicates less bias is encoded in the representations. GLUE (Wang et al., 2019) We follow Meade et al. (2022) and use the GLUE dataset to test the debiased model on an array of downstream tasks to validate their usability. GLUE is a highly pop- ular benchmark for testing NLP models, contain- ing a variety of tasks, such as classification tasks (e.g., sentiment analysis), similarity tasks (e.g., l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 6: (a) StereoSet stereotype scores and language modeling scores (LM Score) for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%; (b) StereoSet stereotype scores and language modeling scores (LM Score) for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%. paraphrase identification), and inference tasks (e.g., question-answering). The following tables of results are included: • Tables 7, 8, and 9 describe the SEAT effect sizes for the gender, race, and religion cases, respectively. • Table 6 presents the StereoSet results for re- moving the race (a) and religion (b) guarded attributes. • Table 10 presents the scores the debi- ased representations achieve for the GLUE benchmark. 502 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 7: SEAT effect sizes for gender-debiased representations of BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model. 503 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . Table 8: SEAT effect sizes for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model. f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 504 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 9: SEAT effect sizes for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model. 505 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . Table 10: GLUE tests for gender-debiased BERT, ALBERT, RoBERTa, and GPT-2 models. f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 506 References Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 2016. Demographic dialectal varia- tion in social media: A case study of African- American English. In Proceedings of the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 1119–1130, Austin, Texas. Association for Computational Linguistics. https://doi.org/10.18653 /v1/D16-1120 Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Informa- tion Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pages 4349–4357. Isabel Cachola, Eric Holgate, Daniel Preot¸iuc- Pietro, and Junyi Jessy Li. 2018. Expres- sively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in the 27th social media. International Conference on Computational Linguistics, pages 2927–2938, Santa Fe, New Mexico, USA. Association for Computational Linguistics. In Proceedings of Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automat- ically from language corpora contain human- 356(6334):183–186. like https://doi.org/10.1126/science .aal4230, PubMed: 28408601 Science, biases. Jianfeng Chi, William Shand, Yaodong Yu, Kai-Wei Chang, Han Zhao, and Yuan Tian. 2022. Conditional supervised contrastive learn- ing for fair text classification. ArXiv preprint, abs/2205.11485. Pierre Colombo, Guillaume Staerman, Nathan Noiry, and Pablo Piantanida. 2022. Learn- ing disentangled textual representations via In Pro- statistical measures of similarity. ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2614–2630, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653 /v1/2022.acl-long.187 Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 120–128. https://doi.org/10 pages .1145/3287560.3287572 Sunipa Dev, Tao Li, Jeff M. Phillips, and Vivek Srikumar. 2021. OSCaR: Orthogonal subspace correction and rectification of biases in word embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 5034–5050, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .emnlp-main.411 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Maximin Coavoux, Shashi Narayan, and Shay B. Cohen. 2018. Privacy-preserving neural rep- resentations of text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1–10, Brussels, Belgium. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D18-1001 Yupei Du, Qi Zheng, Yuanbin Wu, Man Lan, Yan Yang, and Meirong Ma. 2022. Understanding gender bias in knowledge base embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1381–1395, Dublin, Ireland. Association for Computational Linguistics. 507 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Harrison Edwards and Amos J. Storkey. 2016. Censoring representations with an adversary. In 4th International Conference on Learn- ing Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings. Yanai Elazar and Yoav Goldberg. 2018. Ad- versarial removal of demographic attributes from text data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 11–21, Brussels, Belgium. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/D18-1002 Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. 2021. CausaLM: Causal model expla- nation through counterfactual language mod- els. Computational Linguistics, 47(2):333–386. https://doi.org/10.1162/coli a 00404 Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Us- ing millions of emoji occurrences to learn any- domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 1615–1625, Copenhagen, Denmark. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/D17-1169 Joseph Dave Fisher, Palfrey, Christos Christodoulopoulos, and Arpit Mittal. 2019. Measuring social bias in knowledge graph embeddings. ArXiv preprint, abs/1912.02761. https://doi.org/10.18653/v1/2020 .emnlp-main.595 Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030. Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up system- atic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 609–614, 508 Minneapolis, Minnesota. Association Computational Linguistics. for Xudong Han, Timothy Baldwin, and Trevor Cohn. 2021a. Balancing out bias: Achieving fairness through training reweighting. ArXiv preprint, abs/2109.08253. Xudong Han, Timothy Baldwin, and Trevor Cohn. 2021b. Decoupling adversarial training for fair NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 471–477, Online. Association for Computational Linguistics. Xudong Han, Aili Shen, Yitong Li, Lea Frermann, Timothy Baldwin, and Trevor Cohn. 2022. fairlib: A unified framework for assessing and improving classification fairness. ArXiv pre- print, abs/2205.01876. Xiaolei Huang. 2022. Easy adaptation to mitigate gender bias in multilingual text classification. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 717–723, Seattle, United States. Association for Computational Linguistics. https://doi.org/10.18653 /v1/2022.naacl-main.52 Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H´erve J´egou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. ArXiv preprint, abs/ 1612.03651. Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97. https://doi .org/10.1002/nav.3800020109 Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. Towards robust and privacy-preserving text representations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 25–30, Melbourne, Australia. Association for Computational Linguistics. David J. C. MacKay. 2003. Information theory, Inference and Learning Algorithms. Cambridge University Press. Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On measuring social biases in sentence l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 encoders. In Proceedings of the 2019 Con- the North American Chapter of ference of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 622–628, for Minneapolis, Minnesota. Association Computational Linguistics. Nicholas Meade, Elinor Poole-Dayan, and Siva Reddy. 2022. An empirical survey of the ef- fectiveness of debiasing techniques for pre- trained language models. In Proceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1878–1898, Dublin, Ireland. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2022.acl-long.132 Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Pro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .acl-long.416 Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.emnlp-main.154 Lyle Ramshaw and Robert E. Tarjan. 2012. On minimum-cost assignments in unbalanced bi- partite graphs. HP Labs, Palo Alto, CA, USA, Technical Report HPL-2012-40R1. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. Null it out: Guarding protected attributes by it- erative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.acl-main.647 Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D. Cotterell. 2022. Linear adversarial concept erasure. In Inter- national Conference on Machine Learning, pages 18400–18421. PMLR. Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2022. Un- der the morphosyntactic lens: A multifaceted evaluation of gender bias in speech translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1807–1824, Dublin, Ireland. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2022.acl-long.127 Shun Shao, Yftah Ziser, and Shay B. Cohen. 2023. Gold doesn’t always glitter: Spectral removal of linear and nonlinear guarded attribute in- formation. In Proceedings of the 17th Annual Meeting of the European chapter of the Asso- ciation for Computational Linguistics (EACL), volume abs/2203.07893. Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. 2021. Contras- tive learning for fair representations. ArXiv preprint, abs/2109.10645. Gilbert W. Stewart. 1990. Perturbation theory for the singular value decomposition, Tech- nical Report UMIACS-90-120 / CS-TR 2539, University of Maryland, College Park. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task bench- mark and analysis platform for natural language understanding. In 7th International Confer- ence on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net. Liwen Wang, Yuanmeng Yan, Keqing He, Yanan Wu, and Weiran Xu. 2021. Dynamically disentangling social bias from task-oriented representations with adversarial attack. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, pages 3740–3750, Online. Association 509 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 for Computational Linguistics. https://doi .org/10.18653/v1/2021.naacl-main.293 Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4847–4853, Brussels, Belgium. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D18-1521 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 8 2 1 1 0 6 0 2 / / t l a c _ a _ 0 0 5 5 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 510
Download pdf