Quantifying Social Biases in NLP: - 麻省理工学院人工智能研究专业

Quantifying Social Biases in NLP:
A Generalization and Empirical Comparison
of Extrinsic Fairness Metrics

Paula Czarnowska♠
University of Cambridge, 英国
pjc211@cam.ac.uk

Yogarshi Vyas
Amazon AI, 美国
yogarshi@amazon.com

Kashif Shah
Amazon AI, 美国
shahkas@amazon.com

抽象的

Measuring bias is key for better understanding
and addressing unfairness in NLP/ML models.
This is often done via fairness metrics, 哪个
quantify the differences in a model’s behaviour
across a range of demographic groups. 在这个
工作, we shed more light on the differences
and similarities between the fairness metrics
used in NLP. 第一的, we unify a broad range
of existing metrics under three generalized
fairness metrics, revealing the connections be-
tween them. 下一个, we carry out an extensive
empirical comparison of existing metrics and
demonstrate that the observed differences in
bias measurement can be systematically ex-
plained via differences in parameter choices
for our generalized metrics.

介绍

The prevalence of unintended social biases in
NLP models has been recently identified as a
major concern for the field. A number of papers
have published evidence of uneven treatment of
different demographics (Dixon et al., 2018; 赵
等人。, 2018; Rudinger et al., 2018; Garg et al.,
2019; Borkan et al., 2019; Stanovsky et al., 2019;
Gonen and Webster, 2020; Huang et al., 2020A;
Nangia et al., 2020), which can reportedly cause
a variety of serious harms, like unfair allocation
of opportunities or unfavorable representation of
particular social groups (Blodgett et al., 2020).

Measuring bias in NLP models is key for better
understanding and addressing unfairness. 这是
often done via fairness metrics, which quantify
the differences in a model’s behavior across a
range of social groups. The community has pro-
posed a multitude of such metrics (Dixon et al.,
2018; Garg et al., 2019; Huang et al., 2020A;
Borkan et al., 2019; Gaut et al., 2020). In this pa-

♠ Work done during an internship at Amazon AI.

每, we aim to shed more light on how those varied
means of quantifying bias differ and what facets
of bias they capture. Developing such understand-
ing is crucial for drawing reliable conclusions and
actionable recommendations regarding bias. 我们
focus on bias measurement for downstream tasks,
as Goldfarb-Tarrant et al. (2021) have recently
shown that there is no reliable correlation between
bias measured intrinsically on, 例如, word
嵌入, and bias measured extrinsically on a
downstream task. We narrow down the scope of
this paper to tasks that do not involve prediction
of a sensitive attribute.

We survey 146 papers on social bias in NLP
and unify the multitude of disparate metrics we
find under three generalized fairness metrics.
Through this unification we reveal the key connec-
tions between a wide range of existing metrics—
we show that they are simply different param-
etrizations of our generalized metrics. 下一个, 我们
empirically investigate the role of different metrics
in detecting the systemic differences in perfor-
mance for different demographic groups, 即,
differences in quality of service (Jacobs et al.,
2020). We experiment on three transformer-based
models—two models for sentiment analysis and
one for named entity recognition (NER)—which
we evaluate for fairness with respect to seven dif-
ferent sensitive attributes, qualified for protection
under the United States federal anti-discrimination
法律:1 性别, Sexual Orientation, 宗教, Na-
tionality, 种族, 年龄, and Disability. Our results
highlight the differences in bias measurements
across the metrics and we discuss how these vari-
ations can be systematically explained via differ-
ent parameter choices of our generalized metrics.
Our proposed unification and observations can

1 https://www.ftc.gov/site-information/no
-fear-act/protections-against-discrimination.

1249

计算语言学协会会刊, 卷. 9, PP. 1249–1267, 2021. https://doi.org/10.1162/tacl 00425
动作编辑器: Dirk Hovy. 提交批次: 2/2021; 修改批次: 5/2021; 已发表 11/2021.
C(西德:3) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

guide decisions about which metrics (and param-
埃特斯) to use, allowing researchers to focus on
the pressing matter of bias mitigation, 而不是
reinventing parametric variants of the same met-
rics. While we focus our experiments on English,
the metrics we study are language-agnostic and
our methodology can be trivially applied to other
语言.

We release our code with implementations of
all metrics discussed in this paper.2 Our imple-
mentation mirrors our generalized formulation
(部分 3), which simplifies the creation of new
指标. We build our code on top of CHECK-
3 (Ribeiro et al., 2020), making it compatible
LIST
with the CHECKLIST testing functionalities; 那
是, one can evaluate the model using the fair-
ness metrics, as well as the CHECKLIST-style tests,
like invariance, under a single bias evaluation
框架.

2 Background

2.1 Terminology

We use the term sensitive attribute to refer to
a category by which people are qualified for
保护 (例如, Religion or Gender). For each
sensitive attribute we define a set of protected
groups T (例如, for Gender, T could be set to
{女性, male, non-binary}). 下一个, each protected
group can be expressed through one of its iden-
tity terms, 我 (例如, for the protected group female
those terms could be {woman, 女性, 女孩} 或一个
set of typically female names).

2.2 Definitions of Fairness in NLP

The metrics proposed to quantify bias in NLP
models across a range of social groups can be
categorized based on whether they operationalize
notions of group or counterfactual fairness. 在
this section we give a brief overview of both and
encourage the reader to consult Hutchinson and
米切尔 (2019) for a broader scope of literature
on fairness, dating back to the 1960s.

Group fairness
requires parity of some statisti-
cal measure across a small set of protected groups
(Chouldechova and Roth, 2020). Some promi-
nent examples are demographic parity (Dwork

2https://github.com/amazon-research/generalized

-fairness-metrics.

3https://github.com/marcotcr/checklist.

来源
例子

I like
{人}.

女性

男性

I like Anna.
I like Mary.
I like Liz.

I like Adam.
I like Mark.
I like Chris.

{Person}
has friends.

Anna has friends.
Mary has friends.
Liz has friends.

Adam has friends.
Mark has friends.
Chris has friends.

桌子 1: Example of counterfactual fairness data.
T = {女性, male} 和 |我| = 3 for both groups.

等人。, 2012), which requires equal positive classi-
fication rate across different groups, or equalized
odds (Hardt et al., 2016) which for binary clas-
sification requires equal true positive and false
negative rates. In NLP, group fairness metrics are
based on performance comparisons for different
sets of examples, 例如, the comparison
of two F1 scores: one for examples mentioning
female names and one for examples with male
名字.

Counterfactual fairness
requires parity for two
or more versions of an individual, one from the ac-
tual world and others from counterfactual worlds
in which the individual belongs to a different
protected group; 那是, it requires invariance to
the change of the protected group (Kusner et al.,
2017). Counterfactual fairness is often viewed as
a type of individual fairness, which asks for sim-
ilar individuals to be treated similarly (Dwork
等人。, 2012). In NLP, counterfactual fairness
metrics are based on comparisons of performance
for variations of the same sentence, which differ
in mentioned identity terms. Such data can be
created through perturbing real-world sentences
or creating synthetic sentences from templates.

在这项工作中, we require that for each protected
group there exists at least one sentence variation
for every source example (pre-perturbation sen-
tence or a template). 在实践中, 的数量
variations for each protected group will depend
on the cardinality of I (桌子 1). In contrast to
most NLP works (Dixon et al., 2018; Garg et al.,
2019; Sheng et al., 2020), we allow for a protected
group to be realized as more than one identity
学期. To allow for this, we separate the variations
for each source example into |时间 | 套, 每一个
which can be viewed as a separate counterfac-
tual world.

1250

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

3 Generalized Fairness Metrics

We introduce three generalized fairness metrics
that are based on different comparisons between
protected groups and are model and task agnostic.
They are defined in terms of two parameters:

(我) A scoring function, φ, which calculates the
score on a subset of examples. The score
is a base measurement used to calculate the
metric and can be either a scalar or a set (看
桌子 2 举些例子).

(二) A comparison function, d, which takes a
range of different scores—computed for dif-
ferent subsets of examples—and outputs a
single scalar value.

Each of the three metrics is conceptually dif-
ferent and is most suitable in different scenarios;
the choice of the most appropriate one depends
on the scientific question being asked. Through
different choices for φ and d, we can systemati-
cally formulate a broad range of different fairness
指标, targeting different types of questions. 我们
demonstrate this in Section 4 和表 2, 在哪里
we show that many metrics from the NLP lit-
erature can be viewed as parametrizations of the
metrics we propose here. To account for the differ-
ences between group and counterfactual fairness
(部分 2.2) we define two different versions of
each metric.

. . . , t|时间 |} be a set
Notation Let T = {t1, t2,
of all protected groups for a given sensitive at-
贡, 例如, 性别, and φ(A) be the
score for some set of examples A. This score
can be either a set or a scalar, 根据
the parametrization of φ. For group fairness, 让
S be the set of all evaluation examples. 我们
denote a subset of examples associated with
a protected group ti as Sti. For counterfactual
公平, let X = {x1, x2, . . . , X|X|} be a set of
source examples, 例如, sentences pre-perturbation,
2, . . . , S(西德:4)
1, S(西德:4)
和S(西德:4) = {S(西德:4)
|S|} be a set of sets
of evaluation examples, 哪里(西德:4)
j is a set of
all variations of a source example xj, IE。, 那里
is a one-to-one correspondence between S(西德:4) 和
to denote a subset of S(西德:4)
X. We use S(西德:4)的
j
j
associated with a protected group ti. 考试用-
普莱, if T = {女性, male} and the templates

were defined as in Table 1,
{‘I like Anna.’, ‘I like Mary.’, ‘I like Liz.’}.

then S(西德:4)女性

3.1 Pairwise Comparison Metric

Pairwise Comparison Metric (PCM) quantifies
how distant, 一般, the scores for two dif-
ferent, randomly selected groups are. It is suitable
for examining whether and to what extent the cho-
sen protected groups differ from one another. 为了
例子, for the sensitive attribute Disability, 是
there any performance differences for cognitive
vs mobility vs no disability? We define Group
(1) and Counterfactual (2) PCM as follows:

1
氮

(西德:2)

(西德:3)

φ(Sti), φ(Stj )

(西德:4)

的,tj ∈(时间
2 )

1
|S(西德:4)| 氮

(西德:2)

S(西德:4)

j ∈S(西德:4)

的,tk∈(时间
2 )

(西德:5)

φ(S(西德:4)的

j ), φ(S(西德:4)tk
j )

(西德:6)

(1)

(2)

where N is a normalizing factor, 例如,

(西德:3)

(西德:4)

|时间 |
2

3.2 Background Comparison Metric

Background Comparison Metric (BCM) relies on
a comparison between the score for a protected
group and the score of its background. The def-
inition of the background depends on the task at
hand and the investigated question. 例如,
if the aim is to answer whether the performance
of a model for the group differs from the model’s
general performance, the background can be a set
of all evaluation examples. 或者, 如果
question of interest is whether the groups con-
sidered disadvantaged are treated differently than
some privileged group, the background can be a set
of examples associated with that privileged group.
在这种情况下, T should be narrowed down to
the disadvantaged groups only. For counterfactual
fairness the background could be the unperturbed
例子, allowing us to answer whether a model’s
behavior differs for any of the counterfactual ver-
sions of the world. 正式地, we define Group (3)
and Counterfactual (4) BCM as follows:

(西德:3)

1
氮

(西德:2)

ti∈T

φ(βti,S), φ(Sti)

(西德:4)

1
|S(西德:4)| 氮

(西德:2)

(西德:5)

S(西德:4)

j ∈S(西德:4)

ti∈T

(西德:6)

φ(βti,S(西德:4)

j ), φ(S(西德:4)的
j )

(3)

(4)

1251

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

where N is a normalizing factor and βti,S is
the background for group ti for the set of ex-
amples S.

Vector-valued BCM In its basic form BCM
aggregates the results obtained for different pro-
tected groups in order to return a single scalar
价值. Such aggregation provides a concise signal
about the presence and magnitude of bias, 但它
does so at the cost of losing information. 经常,
it is important to understand how different pro-
tected groups contribute to the resulting outcome.
This requires the individual group results not to be
accumulated; 那是, dropping the 1
ti∈T term
氮
从方程 (3) 和 (4). We call this version of
BCM, the vector-valued BCM (VBCM).

(西德:7)

3.3 Multi-group Comparison Metric

Multi-group Comparison Metric (MCM) 不同
from the other two in that the comparison function
d takes as arguments the scores for all protected
团体. This metric can quantify the global ef-
fect that a sensitive attribute has on a model’s
表现; 例如, whether the change of
Gender has any effect on model’s scores. It can
provide a useful initial insight, but further inspec-
tion is required to develop better understanding of
the underlying bias, if it is detected.

团体 (5) and Counterfactual (6) MCM are

定义为:

d(φ(St1), φ(St2), . . . , φ(英石|时间 |))

(5)

1
|S(西德:4)|

(西德:2)

S(西德:4)

j ∈S(西德:4)

d(φ(S(西德:4)t1

j ), φ(S(西德:4)t2

j ), . . . , φ(S(西德:4)

jt|时间 |))

(6)

4 Classifying Existing Fairness Metrics
Within the Generalized Metrics

桌子 2 expresses 22 metrics from the litera-
ture as instances of our generalized metrics from
部分 3. The presented metrics span a num-
ber of NLP tasks, including text classification
(Dixon et al., 2018; Kiritchenko and Mohammad,
2018; Garg et al., 2019; Borkan et al., 2019;
Prabhakaran et al., 2019), relation extraction (Gaut
等人。, 2020), text generation (Huang et al., 2020A)
and dependency parsing (Blodgett et al., 2018).

We arrive at this list by reviewing 146 文件
that study bias from the survey of Blodgett et al.

(2020) and selecting metrics that meet three crite-
ria: (我) the metric is extrinsic; 那是, it is applied
to at least one downstream NLP task,4 (二) 它
quantifies the difference in performance across
two or more groups, 和 (三、) it is not based on
the prediction of a sensitive attribute—metrics
based on a model’s predictions of sensitive at-
tributes, 例如, in image captioning or text
一代, constitute a specialized sub-type of
fairness metrics. Out of the 26 metrics we find,
only four do not fit within our framework: BPSN
metric
and BNSP (Borkan et al., 2019), 这
(De-Arteaga et al., 2019), and Perturbation Label
Distance (Prabhakaran et al., 2019).5

(西德:8)

重要的, many of the metrics we find are
PCMs defined for only two protected groups, typ-
ically for male and female genders or white and
non-white races. Only those that use commuta-
tive d can be straightforwardly adjusted to more
团体. Those that cannot be adjusted are marked
with gray circles in Table 2.

Prediction vs. Probability Based Metrics Be-
yond the categorization into PCM, BCM, 和
MCM, as well as group and counterfactual fair-
内斯, the metrics can be further categorized into
prediction or probability based. The former cal-
culate the score based on a model’s predictions,
while the latter use the probabilities assigned to a
particular class or label (we found no metrics that
make use of both probabilities and predictions).
Thirteen out of 16 group fairness metrics are pre-
diction based, while all counterfactual metrics are
probability based. Since the majority of metrics
表中 2 are defined for binary classification,
the prevalent scores for prediction based metrics
include false positive/negative rates (FPR/FNR)
and true positive/negative rates (TPR/TNR). 最多
of the probability-based metrics are based on the
probability associated with the positive/toxic class
(班级 1 in binary classification). The exception
are the metrics of Prabhakaran et al. (2019),
which utilize the probability of the target class
18 19 21 .

4We do not consider

language modeling to be a

downstream task.

5BPSN and BNSP can be defined as Group VBCM if we
relax the definition and allow for a separate φ function for
the background—they require returning different confidence
scores for the protected group and the background. 这
metrics of Prabhakaran et al. (2019) 18 19 21 originally have
not been defined in terms of protected groups. In their paper,
T is a set of different names, both male and female.

1252

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Metric

Gen. Metric

φ(A)

氮

βti,S

False Positive Equality
Difference (FPED)
False Negative Equality
Difference (FNED)
Average Group
Fairness (AvgGF)

FPR Ratio
Positive Average
Equality Gap (PosAvgEG)
Negative Average
Equality Gap (NegAvgEG)

GROUP METRICS

False Positive Rate

BCM

False Negative Rate

{F (X, 1) | x ∈ A}

False Positive Rate

VBCM

{F (X, 1) | x ∈ A, y(X) = 1}

{F (X, 1) | x ∈ A, y(X) = 0}

1
2

7 Disparity Score
*TPR Gap
*TNR Gap

*Parity Gap

*Accuracy Difference
*TPR Difference
*F1 Difference
*LAS Difference
*Recall Difference
*F1 Ratio

PCM

F1
True Positive Rate
True Negative Rate
|{X|x∈A,ˆy(X)=y(X)}|
|A|

Accuracy
True Positive Rate
F1
LAS
Recall
Recall

COUNTERFACTUAL METRICS

Counterfactual Token
Fairness Gap (CFGap)

BCM

F (X, 1), A = {X}

|x − y|

W1(X, 是 )
y
X

− MW U (X,是 )

|X||是 |

− MW U (X,是 )

|X||是 |
|x − y|
|x − y|
|x − y|
|x − y|
x − y
x − y
x − y
x − y
x − y
X
y

|x − y|

std(X)

|时间 |

–

S \ Sti

(西德:2)

(西德:3)

|时间 |
(西德:3)
(西德:2)
|时间 |
(西德:2)
(西德:3)
2
|时间 |
2
|时间 |
2
1
1
1
1
1
1

|时间 |

–

(西德:3)

(西德:2)

|时间 |
2

–
–
–

–

–
–
–
–
–
–

{xj}

–

Perturbation Score
Sensitivity (PertSS)

Perturbation Score
Deviation (PertSD)
Perturbation Score
Range (PertSR)

Average Individual
Fairness (AvgIF)
*Average Score
Difference

VBCM

F (X, y(X)), A = {X}

MCM

PCM

F (X, y(X)), A = {X}

max(X) − min(X)

{F (X, 1) | x ∈ A}

意思是({F (X, 1) | x ∈ A})

W1(X, 是 )

x − y

桌子 2: Existing fairness metrics and how they fit in our generalized metrics. F (X, C), y(X) 和
ˆy(X) are the probability associated with a class c, the gold class and the predicted class for exam-
ple x, 分别. M W U is the Mann-Whitney U test statistic and W1 is the Wasserstein-1 dis-
tance between the distributions of X and Y . Metrics marked with * have been defined in the context
of only two protected groups and do not define the normalizing factor. The metrics associated with
(Dixon et al., 2018),
gray circles cannot be applied to more than two groups (参见章节 4).
(Gaut et al.,
9 (Prost et al., 2019), 10 (Beutel et al., 2017), 11

21 (Huang et al., 2020A),

(Borkan et al., 2019),

(Beutel et al., 2019),

8 (Beutel et al., 2017; Prost et al., 2019),

3
2020),
(Blodgett and O’Connor, 2017; Bhaskaran and Bhallamudi, 2019),
(Stanovsky et al., 2019; Saunders and Byrne, 2020),
17 (Garg et al., 2019),
2019),
(Kiritchenko and Mohammad, 2018; Popovi´c et al., 2020).

16 (Webster et al., 2018),

14 (Blodgett et al., 2018),

12 (De-Arteaga et al., 2019),

13
15 (Bamman et al.,

20 (Prabhakaran et al., 2019),

Choice of φ and d For scalar-valued φ the
most common bivariate comparison function is
这 (absolute) difference between two scores. 作为
异常值, Beutel et al. (2019) 4 use the ratio of the
group score to the background score and Webster

等人. (2018) 16 use the ratio between the first
and the second group. Prabhakaran et al.’s (2019)
MCM metrics use multivariate d. Their Perturba-
tion Score Deviation metric 19 uses the standard
deviation of the scores, while their Perturbation

1253

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Score Range metric 20 uses the range of the scores
(difference between the maximum and minimum
分数). For set-valued φ, 黄等人。. (2020A)
choose Wasserstein-1 distance (Jiang et al., 2020)
21 , while Borkan et al. (2019) define their com-
3
parison function using the Mann-Whitney U test
statistic (Mann and Whitney, 1947).

5 Experimental Details

Having introduced our generalized framework and
classified the existing metrics, we now empirically
investigate their role in detecting the systemic
performance difference across the demographic
团体. We first discuss the relevant experimental
details before presenting our results and analyses
(部分 6).

Models We experiment on three RoBERTa (刘
等人。, 2019) based models:6 (我) a binary classifier
trained on SemEval-2018 valence classification
shared task data (Mohammad et al., 2018) 亲-
cessed for binary classification (SemEval-2)7 (二)
a 3-class classifier trained on SemEval-3, 和 (三、)
a NER model trained on the CoNLL 2003 Shared
Task data (Tjong Kim Sang and De Meulder,
2003) which uses RoBERTa to encode a text se-
quence and a Conditional Random Field (拉弗蒂
等人。, 2001) to predict the tags. In NER experi-
ments we use the BILOU labeling scheme (Ratinov
and Roth, 2009) 和, for the probability-based
指标, we use the probabilities from the en-
coder’s output. 桌子 5 reports the performance
on the official dev splits for the datasets the models
were trained on.

Evaluation Data For classification, we experi-
ment on seven sensitive attributes, and for each
attribute we devise a number of protected groups
(桌子 3).8 We analyze bias within each attribute

6Our preliminary experiments also used models based on
Electra (Clark et al., 2020) as well as those trained on SST-2
and SST-3 datasets (Socher et al., 2013). For all models, 我们
observed similar trends in differences between the metrics.
Due to space constraints we omit those results and leave a
detailed cross-model bias comparison for future work.

7We process the SemEval data as is commonly done for
SST (Socher et al., 2013). For binary classification, we filter
out the neutral class and compress the multiple fine-grained
positive/negative classes into a single positive/negative class.
For 3-class classification we do not filter out the neutral class.
8For Disability and Race we used the groups from
(2020) and from the Racial and
Hutchinson et al.
Ethnic Categories and Definitions for NIH Diversity Pro-
克 (https://grants.nih.gov/grants/guide

Sensitive
attribute

性别

Sexual
Orientation

宗教

种族

年龄
Disability

Protected groups (时间 )

aab, 女性, male, cis, many-genders, 不-
性别, non-binary, 反式

asexual, homosexual, heterosexual, bisex-
瓦尔, 其他
atheism, buddhism, baha’i-faith, christian-
性, hinduism, islam, judaism, mormonism,
sikhism, taoism
african american, american indian, asian,
hispanic, pacific islander, 白色的
young, adult, 老的
cerebral palsy, chronic illness, cognitive,
down syndrome, epilepsy, hearing, 精神的
健康, mobility, physical, short stature,
sight, unspecified, 没有

Nationality We define 6 groups by categorizing countries

based on their GDP.

桌子 3: The list of sensitive attributes and
protected groups used in our experiments.

Protected
团体
aab

女性
male
许多
genders

no-gender

Identity terms (我)

AMAB, AFAB, DFAB, DMAB, 女性-
assigned, male-assigned
女性 (adj), 女性 (n), woman
male (adj), male (n), 男人

ambigender, ambigendered, androgynous,
intersex-
bigender, bigendered,
瓦尔, pangender, pangendered, polygender,
androgyne, hermaphrodite
agender, agendered, genderless

intersex,

桌子 4: Examples of explicit identity terms for
the selected protected groups of Gender.

independently and focus on explicit mentions of
each identity. This is reflected in our choice of
identity terms, which we have gathered from
维基百科, Wiktionary, as well as Dixon et al.
(2018) and Hutchinson et al. (2020) (见表 4
for an example). 此外, for the Gender
attribute we also investigate implicit mentions—
female and male groups represented with names
typically associated with these genders. We ex-
periment on synthetic data created using hand-
crafted templates, as is common in the literature
(Dixon et al., 2018; Kiritchenko and Mohammad,
2018; Kurita et al., 2019; Huang et al., 2020A).
For each sensitive attribute we use 60 模板

/notice-files/not-od-15-089.html),
重新指定-
主动地. For the remaining attributes, we rely on Wikipedia
and Wiktionary, among other sources.

1254

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

SemEval-2

SemEval-3

Accuracy

0.90

0.73

CoNLL 2003
F1
0.94

桌子 5: RoBERTa performance on the official
development splits for the three tasks.

with balanced classes: 20 negative, 20 neutral and
20 positive templates. For each attribute we use
30 generic templates—with adjective and noun
phrase slots to be filled with identity terms—and
30 attribute-specific templates.9
表中 6 我们
present examples of both generic templates and
attribute-specific templates for Nationality. 笔记
that the slots of generic templates are designed
to be filled with terms that explicitly reference
an identity (桌子 4), and are unsuitable for ex-
periments on female/male names. 为此原因,
for names we design additional 30 name-specific
模板 (60 in total). We present examples of
those templates in Table 6.

For NER, we only experiment on Nationality
and generate the evaluation data from 22 TEM-
plates with a missing {国家} slot for which we
manually assign a BILOU tag to each token. 这
{国家} slot is initially labeled as U-LOC and
is later automatically adjusted to a sequence of
labels if a country name filling the slot spans more
than one token, 例如, B-LOC L-LOC for New
Zeland.

Metrics We experiment on metrics that sup-
port more than two protected groups (IE。, 这
white-circled metrics in Table 2). As described in
部分 2.2, for each source example we allow for
a number of variations for each group. 因此, 为了
counterfactual metrics that require only one exam-
ple per group (all counterfactual metrics but Av-
erage Individual Fairness 21 ) we evaluate on the
|时间 |-ary Cartesian products over the sets of varia-
tions for all groups. For groups with large |我| 我们
sample 100 elements from the Cartesian product,
without replacement. We convert Counterfactual
Token Fairness Gap 17 and Perturbation Score
Sensitivity 18 into PCMs because for templated-
data there is no single real-world example.

Average Group Fairness

3 , Counterfacutal
Token Fairnes Gap 17 , and Average Individual
Fairness 21 calculate bias based on the probability

9The templates can be found with the code.

Generic templates

S
氧
磷

时间
U
乙
氮

G
乙
氮

S
氧
磷

时间
U
乙
氮

G
乙
氮

S
氧
磷

时间
U
乙
氮

G
乙
氮

to celebrate

is a wonderful event

I liked that {identity adj} writer.
作为 {A:identity np}, I feel hopeful.
这
summit
{identity adj} folk.
I recently found out that Alex is {A:identity np}.
The main character was {identity adj}.
I watched a documentary about {identity adj} writers.
What a sad day to be {identity adj}.
It’s scary how they treat {identity adj} 人们.
Being {identity adj} makes life really hard.

Nationality templates

The next ACL will be in the beautiful {国家}!
The outstanding festival celebrated the best artists from
{国家}.
{国家} has done a great job in fighting economic
不等式.
Alex always wanted to live in {国家}.
Next speaker travelled all the way from {国家}.
{国家} is UK’s most important trading partner.
{国家} is the worst holiday destination.
{Country adj} Music Awards were a disaster.
The designs of this {country adj} based firm are
distressing.

Name templates

{Person} teared up and it was heartwarming.
{Person} feels fulfilled as a parent.
You should read this ground-breaking paper by
{人}.
As a child, {人} had big dreams.
{Person} received their high-school diploma in 1990.
{Person} came out with new strategies for managing
and financing that program.
{Person} sadly didn’t get the job.
Nothing special, {人} thought so too.
It’s unfair that {人} got that reputable position
without prior experience.

桌子 6: Template examples.

of positive/toxic class on all examples. We intro-
duce alternative versions of these metrics which
calculate bias only on examples with gold label c,
which we mark with a (TC) (for true class) suffix.
The original versions target demographic parity
(Dwork et al., 2012), while the TC versions target
equality of opportunity (Hardt et al., 2016) 和
can pinpoint the existence of bias more precisely,
as we show later (部分 6).

5.1 Moving Beyond Binary Classification

Fourteen out of 15 white-circled metrics from
桌子 2 are inherently classification metrics, 11
of which are defined exclusively for binary clas-
sification. We adapt binary classification metrics

1255

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: BCM, PCM, and MCM metrics calculated for different sensitive attributes, for the positive class. 指标
marked with (全部) are inherently multiclass and are calculated for all classes. Superscripts P and * mark the
probability-based and correctly normalized metrics, 分别. We row-normalize the heatmap coloring, across
the whole figure, using maximum absolute value scaling.

到 (我) multiclass classification and (二) 顺序
labeling to support a broader range of NLP tasks.

Classification Probability-based
Multiclass
metrics that use the probability of the target class
20 ) do not require any adaptations for
( 18
multiclass classification. For other metrics, 我们
measure bias independently for each class c, 使用
a one-vs-rest strategy for prediction-based metrics
and the probability of class c for the scores of
probability-based metrics ( 3

17 21 ).

Sequence Labeling We view sequence labeling
as a case of multiclass classification, with each
token being a separate classification decision. 作为
for multiclass classification, we compute the bias
measurements for each class independently. 为了

prediction-based metrics, we use one-vs-rest strat-
egy and base the F1 and FNR scores on exact span
matching.10 For probability-based metrics,
为了
each token we accumulate the probability scores
for different labels of the same class. 例如,
with the BILOU labeling scheme, the probabilities
for B-PER, I-PER, L-PER, and U-PER are summed to
obtain the probability for the class PER. 更远,
for counterfactual metrics, to account for different
identity terms yielding different number of tokens,
we average the probability scores for all tokens of
multi-token identity terms.

10We do not compute FPR based metrics, because false
positives are unlikely to occur for our synthetic data and are
less meaningful if they occur.

1256

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: Results for BCM and VBCM metrics on the positive class on Gender for explicit (左边) and implicit
身份, signaled through names (正确的).

6 Empirical Metric Comparison

数字 1 shows the results for sentiment analy-
sis for all attributes on BCM, PCM, and MCM
指标. In each table we report the original bias
measurements and row-normalize the heatmap
coloring using maximum absolute value scaling
to allow for some cross-metric comparison.11
数字 1 gives evidence of unintended bias for
most of the attributes we consider, with Disability
and Nationality being the most and least affected
属性, 分别. We highlight that because
we evaluate on simple synthetic data in which the
expressed sentiment is evident, even small per-
formance differences can be concerning. 数字 1
also gives an initial insight into how the bias
measurements vary across the metrics.

11Even after normalization, bias measurements across met-
rics are not fully comparable—different metrics use different
base measurements (TPR, TNR, ETC。) and hence measure
different aspects of bias.

图中 2 we present the per-group results for
VBCM and BCM metrics for the example Gender
attribute.12 Similarly, 图中 3 we show results
for NER for the relevant LOC class. The first set
of results indicates that the most problematic Gen-
der group is cis. For NER we observe a big gap
in the model’s performance between the most af-
fluent countries and countries with lower GDP.
In the context of those empirical results we now
discuss how different parameter choices affect the
observed bias measurement.

Key Role of the Base Measurement Perhaps
the most important difference between the met-
rics lies in the parametrization of the scoring
function φ. The choice of φ determines what

12We omit the per-group results for the remaining attributes
due to the lack of space. For BCM, we do not include
accumulated values in the normalization.

1257

the bias measurements are incomparable and can
be misleadingly elevated. The latter is the case for
three metrics—FPED 1 , FNED 2 , and Disparity
分数 7 . The first two lack any kind of nor-
malization, while Disparity Score is incorrectly
normalized—N is set to the number of groups,
rather than group pairs. 图中 1 we present
the results on the original versions of those met-
rics and for their correctly normalized versions,
marked with *. The latter result in much lower
bias measurements. This is all the more important
for FPED and FNED, as they have been very in-
fluential, with many works relying exclusively on
these metrics (Rios, 2020; Huang et al., 2020乙;
Gencoglu, 2021; Rios and Lwowski, 2020).

Relative vs Absolute Comparison Next, we ar-
gue that the results of metrics based on the relative
比较, 例如, FPR Ratio 4 , 可
misleading and hard to interpret if the original
scores are not reported. 尤其, the relative
comparison can amplify bias in cases when both
scores are low; in such scenarios even a very small
absolute difference can be relatively large. 这样的
amplification is evident in the FNR Ratio metric
(FNR equivalent of FPR Ratio) on female vs male
names for RoBERTa fine-funed on SemEval-2
(Figure 2b). 相似地, when both scores are very
高的, the bias can be underestimated—a signif-
icant difference between the scores can seem
relatively small if both scores are large. 的确,
such effects have also been widely discussed
in the context of reporting health risks (Forrow
等人。, 1992; Stegenga, 2015; Noordzij et al., 2017).
相比之下, the results of metrics based on abso-
lute comparison can be meaningfully interpreted,
even without the original scores, if the range of
the scoring function is known and interpretable
(which is the case for all metrics we review).

Importance of Per-Group Results Most group
metrics accumulate the results obtained for
different groups. Such accumulation leads to di-
luted bias measurements in situations where the
performance differs only for a small proportion
of all groups. This is evident in, 例如, 这
per-group NER results for correctly normalized
指标 (数字 3). We emphasize the importance
of reporting per-group results whenever possible.

Prediction vs Probability Based In contrast
to prediction-based metrics, probability-based

数字 3: Results for the NER model on National-
ity attribute for six groups defined by categorizing
countries based on their GDP (six quantiles) 为了
(most relevant) LOC class. We present group metrics at
the top and the counterfactual metrics at the bottom.
The probability-based metrics not marked with (TC)
use probability scores for LOC for all tokens, 包括-
ing O; hence they are less meaningful than their TC
备择方案.

type and aspect of bias is being measured, 麦-
ing the metrics conceptually different. 考虑,
for example φ of Average Group Fairness
—{F (X, 1) | x ∈ A}—and Positive Average
Equality Gap 5 —{F (X, 1) | x ∈ A, y(X) = 1}.
They are both based on the probabilities associated
with class 1, but the former is computed on all ex-
amples in A, while the latter is computed on only
those examples that belong to the positive class
(IE。, have gold label 1). This difference causes
them to measure different types of bias—the first
targets demographic parity, the second equality of
机会.

更远, consider FPED 1 and FNED 2 , 哪个
use FPR and FNR for their score, 分别.
This difference alone can lead to entirely different
结果. 例如, in Figure 2a FNED reveals
prominent bias for the cis group while FPED
shows none. 合在一起, these results signal
that the model’s behavior for this group is notably
different from the other groups but this difference
manifests itself only on the positive examples.

(在)Correct Normalization Next, we highlight
the importance of correct normalization. We ar-
gue that fairness metrics should be invariant to the
number of considered protected groups, 否则

1258

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

metrics also capture more subtle performance
differences that do not lead to different predic-
系统蒸发散. This difference can be seen, 例如,
for aab Gender group results for SemEval-2
(Figure 2a) and the results for female/male names
for SemEval-3 (Figure 2d). We contend that it
is beneficial to use both types of metrics to un-
derstand the effect of behavior differences on
predictions and to allow for detection of more
subtle differences.

Signed vs Unsigned Out of the 15 white-circled
metrics only two are signed; Positive and Negative
6 . Using at
Average Equality Gap (AvgEG)
least one signed metric allows for quick identifica-
tion of the bias direction. 例如, results for
Average Equality Gap reveal that examples men-
tioning the cis Gender group are considered less
positive than examples mentioning other groups
然后, for NER, the probability of LOC is lower
for the richest countries (first and second quantiles
have negative signs).

True Class Evaluation We observe that the TC
versions of probability-metrics allow for better un-
derstanding of bias location, compared with their
non-TC alternatives. Consider Average Group
3 and its TC versions evaluated on
Fairness
the positive class (PosAvgGF) and negative class
(NegAvgGF) for binary classification (Figure 2a).
The latter two reveal
the differences in
behavior apply solely to the positive examples.

那

6.1 Fairness Metrics vs Significance Tests

Just like fairness metrics, statistical significance
tests can also detect the presence of systematic
differences in the behavior of a model, 因此
are often used as alternative means to quantify
bias (Mohammad et al., 2018; Davidson et al.,
2019; Zhiltsova et al., 2019). 然而, 骗子-
trast to fairness metrics, significance tests do not
capture the magnitude of the differences. 相当,
they quantify the likelihood of observing given
differences under the null hypothesis. This is an
important distinction with clear empirical conse-
序列, as even very subtle differences between
the scores can be statistically significant.

To demonstrate this, we present p-values for
significance tests for which we use the probabil-
ity of the positive class as a dependent variable
(桌子 7). Following Kiritchenko and Mohammad
(2018), we obtain a single probability score for

属性
性别 (名字)
性别
Sexual Orientation
宗教
Nationality
种族
年龄
Disability

SemEval-2
8.72 × 10−1
1.41 × 10−8
2.76 × 10−9
1.14 × 10−23
1.61 × 10−2
2.18 × 10−5
4.86 × 10−2
9.67 × 10−31

SemEval-3
3.05 × 10−6
3.80 × 10−24
9.49 × 10−24
8.24 × 10−36
1.45 × 10−14
8.44 × 10−5
4.81 × 10−8
2.89 × 10−44

桌子 7: P-values for the Wilcoxon signed-rank
测试 (attribute Gender, (名字)) and the Friedman
测试 (all other attributes).

each template by averaging the results across all
identity terms per group. Because we evaluate on
synthetic data, which is balanced across all groups,
we use the scores for all templates regardless
of their gold class. We use the Friedman test
for all attributes with more than two protected
团体. For Gender with male/female names as
identity terms we use the Wilcoxon signed-rank
测试. We observe that, despite the low absolute
values of the metrics obtained for the National-
ity attribute (数字 1), the behavior of the models
across the groups is unlikely to be equal. 相同
applies to the results for female vs male names
for SemEval-3 (Figure 2d). Utilizing a test for
statistical significance can capture such nuanced
presence of bias.

尤其, Average Equality Gap metrics 5
6
occupy an atypical middle ground between being a
fairness metric and a significance test. 相比之下
to other metrics from Table 2, they do not quantify
the magnitude of the differences, but the likelihood
of a group being considered less positive than the
background.

7 Which Metrics to Choose?

In the previous section we highlighted important
differences between the metrics which stem from
different parameter choices. 尤其, we em-
phasized the difference between prediction and
probability-based metrics, in regards to their sen-
sitivity to bias, as well as the conceptual distinction
between the fairness metrics and significance tests.
We also stressed the importance of correct normal-
ization of metrics and reporting per-group results
whenever possible. 然而, one important ques-
tion still remains unanswered: Out of the many

1259

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

different metrics that can be used, which ones are
the most appropriate? 很遗憾, there is no
easy answer. The choice of the metrics depends on
many factors, including the task, the particulars of
how and where the system is deployed, 也
the goals of the researcher.

In line with the recommendations of Olteanu
等人. (2017) and Blodgett et al. (2020), we assert
that fairness metrics need to be grounded in the
application domain and carefully matched to the
type of studied bias to offer meaningful insights.
While we cannot be too prescriptive about the ex-
act metrics to choose, we advise against reporting
results for all the metrics presented in this paper.
反而, we suggest a three-step process that helps
to narrow down the full range of metrics to those
that are the most applicable.

Step 1. Identifying the type of question to ask
and choosing the appropriate generalized met-
ric to answer it. As discussed in Section 3, each
generalized metric is most suitable in different
scenarios; 例如, MCM metrics can be used
to investigate whether the attribute has any overall
effect on the model’s performance and (V)BCM
allows us to investigate how the performance for
particular groups differs with respect to model’s
general performance.

Step 2. Identifying scoring functions that tar-
get the studied type and aspect of bias. At this
stage it is important to consider practical conse-
quences behind potential base measurements. 为了
例子, for sentiment classification, misclassy-
fing positive sentences mentioning a specific de-
mographic as negative can be more harmful than
misclassyfing negative sentences as positive, 作为
it can perpetuate negative stereotypes. 康塞-
经常地, the most appropriate φ would be based
on FNR or the probability of the negative class.
相比之下, in the context of convicting low-
level crimes, a false positive has more serious
practical consequences than a false negative, 自从
it may have a long-term detrimental effect on a
person’s life. 更远, the parametrization of φ
should be carefully matched to the motivation of
the study and the assumed type/conceptualization
of bias.

Step 3. Making the remaining parameter
choices. 尤其, deciding on the compar-
ison function most suitable for the selected φ and

the targeted bias; 例如, absolute difference
if φ is scalar-valued φ or Wasserstein-1 distance
for set-valued φ.

The above three steps can identify the most
relevant metrics, which can be further filtered
down to the minimal set sufficient to identify
studied bias. To get a complete understanding of
a model’s (和)公平, our general suggestion is
to consider at least one prediction-based metric
and one probability-based metric. Those can be
further complemented with a test for statistical
significance. 最后, it is essential that the results
of each metric are interpreted in the context of the
score employed by that metric (参见章节 6). 这是
also universally good practice to report the results
from all selected metrics, regardless of whether
they do or do not give evidence of bias.

8 相关工作

To our knowledge, we are the first to review
and empirically compare fairness metrics used
within NLP. Close to our endeavor are surveys
that discuss types, 来源, and mitigation of bias
in NLP or AI in general. Surveys of Mehrabi et al.
(2019), Hutchinson and Mitchell (2019), 和
Chouldechova and Roth (2020) cover a broad
scope of literature on algorithmic fairness. Shah
等人. (2020) offer both a survey of bias in NLP
as well as a conceptual framework for studying
bias. Sun et al. (2019) provide a comprehensive
overview of addressing gender bias in NLP. 那里
are also many task specific surveys, 为了考试-
普莱, for language generation (Sheng et al., 2021)
or machine translation (Savoldi et al., 2021). Fi-
nally, Blodgett et al. (2020) outline a number of
methodological issues, such as providing vague
motivations, which are common for papers on
bias in NLP.

We focus on measuring bias exhibited on classi-
fication and sequence labeling downstream tasks.
A related line of research measures bias present
in sentence or word representations (Bolukbasi
等人。, 2016; Caliskan et al., 2017; Kurita et al.,
2019; Sedoc and Ungar, 2019; Chaloner and
Maldonado, 2019; Dev and Phillips, 2019; Gonen
and Goldberg, 2019; Hall Maudslay et al., 2019;
Liang et al., 2020; Shin et al., 2020; Liang et al.,
2020; Papakyriakopoulos et al., 2020). 如何-
曾经, such intrinsic metrics have been recently
shown not
to correlate with application bias
(Goldfarb-Tarrant et al., 2021). In yet another

1260

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

line of research, Badjatiya et al. (2019) detect bias
through identifying bias sensitive words.

Beyond the fairness metrics and significance
测试, some works quantify bias through calculat-
ing a standard evaluation metric, 例如, F1
or accuracy, or a more elaborate measure indepen-
dently for each protected group or for each split
of a challenge dataset (Hovy and Søgaard, 2015;
Rudinger et al., 2018; 赵等人。, 2018; Garimella
等人。, 2019; Sap et al., 2019; Bagdasaryan et al.,
2019; Stafanoviˇcs et al., 2020; Tan et al., 2020;
Mehrabi et al., 2020; Nadeem et al., 2020; 曹
and Daum´e III, 2020).

9 结论

We conduct a thorough review of existing fair-
ness metrics and demonstrate that they are simply
parametric variants of the three generalized fair-
ness metrics we propose, each suited to a different
type of a scientific question. 更远, we empiri-
cally demonstrate that the differences in parameter
choices for our generalized metrics have direct
impact on the bias measurement. In light of our
结果, we provide a range of concrete sugges-
tions to guide NLP practitioners in their metric
choices.

We hope that our work will facilitate further re-
search in the bias domain and allow the researchers
to direct their efforts towards bias mitigation.
Because our framework is language and model
agnostic, in the future we plan to experiment on
more languages and use our framework as prin-
cipled means of comparing different models with
respect to bias.

致谢

We would like to thank the anonymous reviewers
for their thoughtful comments and suggestions.
We also thank the members of Amazon AI for
many useful discussions and feedback.

参考

Pinkesh Badjatiya, Manish Gupta, and Vasudeva
Varma. 2019. Stereotypical bias removal for
hate speech detection task using knowledge-
based generalizations. In The World Wide Web
会议, pages 49–59. https://土井
.org/10.1145/3308558.3313504

Eugene Bagdasaryan, Omid Poursaeed, and Vitaly
Shmatikov. 2019. Differential privacy has dis-
parate impact on model accuracy. In Advances
in Neural Information Processing Systems,
pages 15479–15488.

David Bamman, Sejal Popat, and Sheng Shen.
2019. An annotated dataset of literary entities.
在诉讼程序中
这 2019 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: 人类
语言技术, 体积 1 (Long and
Short Papers), pages 2138–2144, 明尼阿波利斯,
Minnesota. Association for Computational
语言学.

Alex Beutel, J. 陈, Zhe Zhao, and Ed Huai
hsin Chi. 2017. Data decisions and theoret-
ical implications when adversarially learning
fair representations. Workshop on Fairness,
Accountability, and Transparency in Machine
学习 (FAT/ML 2017).

Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian,
Allison Woodruff, Christine Luu, 皮埃尔
Kreitmann, Jonathan Bischof, and Ed H. 志.
2019. Putting fairness principles into practice:
挑战, 指标, and improvements.
在
诉讼程序 2019 AAAI/ACM Confer-
ence on AI, 伦理, 与社会, pages 453–459.
https://doi.org/10.1145/3306618
.3314234

Jayadev Bhaskaran and Isha Bhallamudi. 2019.
Good secretaries, bad truck drivers? Occupa-
tional gender stereotypes in sentiment analysis.
In Proceedings of the First Workshop on Gen-
der Bias in Natural Language Processing,
pages 62–68, Florence, 意大利. 协会
计算语言学. https://土井
.org/10.18653/v1/W19-3809

Su Lin Blodgett, Solon Barocas, Hal Daum´e III,
and Hanna Wallach. 2020. 语言 (technol-
奥吉) is power: A critical survey of ‘‘bias’’ in
the 58th Annual
自然语言处理.
Meeting of the Association for Computational
语言学, pages 5454–5476, 在线的. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl-main.485

在诉讼程序中

Su Lin Blodgett and Brendan T. O’Connor. 2017.
Racial Disparity in Natural Language Process-
英: A case study of social media African-

1261

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

美式英语. Workshop on Fairness,
Accountability, and Transparency in Machine
学习 (FAT/ML 2017).

Su Lin Blodgett, Johnny Wei, and Brendan
O’Connor. 2018. Twitter universal dependency
parsing for African-American and mainstream
美式英语. In Proceedings of the 56th
Annual Meeting of the Association for Com-
putational Linguistics (体积 1: Long Pa-
pers), pages 1415–1425, 墨尔本, 澳大利亚.
计算语言学协会.
https://doi.org/10.18653/v1/P18-1131

Tolga Bolukbasi, Kai-Wei Chang, James Zou,
Venkatesh Saligrama, and Adam Kalai. 2016.
Man is to computer programmer as woman
is to homemaker? Debiasing word embed-
丁斯. In Proceedings of the 30th International
Conference on Neural Information Process-
ing Systems, NIPS’16, pages 4356–4364, Red
Hook, 纽约, 美国. Curran Associates Inc.

Daniel Borkan, Lucas Dixon, Jeffrey Sorensen,
Nithum Thain, and Lucy Vasserman. 2019.
Nuanced metrics for measuring unintended
bias with real data for text classification. 在
Companion Proceedings of The 2019 世界
Wide Web Conference, pages 491–500, 桑
Francisco USA. ACM. https://doi.org
/10.1145/3308560.3317593

Aylin Caliskan, Joanna Bryson, and Arvind
Narayanan. 2017. Semantics derived automat-
ically from language corpora contain human-
like biases. 科学, 356:183–186. https://
doi.org/10.1126/science.aal4230, 考研:
28408601

Yang Trista Cao and Hal Daum´e III. 2020.
Toward gender-inclusive coreference resolu-
的. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
抽动症, pages 4568–4595, 在线的. 协会
计算语言学.

Kaytlin Chaloner and Alfredo Maldonado. 2019.
Measuring gender bias in word embeddings
across domains and discovering new gender
bias word categories. 在诉讼程序中
First Workshop on Gender Bias in Natural Lan-
guage Processing, pages 25–32, Florence, 意大利.
计算语言学协会.

https://doi.org/10.18653/v1/W19
-3804

Alexandra Chouldechova and Aaron Roth. 2020.
A snapshot of the frontiers of fairness in ma-
chine learning. ACM 通讯,
63(5):82–89. https://doi.org/10.1145
/3376898

Kevin Clark, Minh-Thang Luong, Quoc V. Le,
and Christopher D. 曼宁. 2020. ELECTRA:
Pre-training text encoders as discriminators
rather than generators. The International Con-
ference on Learning Representations (ICLR).

Thomas Davidson, Debasmita Bhattacharya, 和
Ingmar Weber. 2019. Racial bias in hate speech
and abusive language detection datasets. In Pro-
ceedings of the Third Workshop on Abusive
Language Online, pages 25–35, Florence, 意大利.
计算语言学协会.
https://doi.org/10.18653/v1/W19
-3504

Maria De-Arteaga, Alexey Romanov, Hanna
瓦拉赫, Jennifer Chayes, Christian Borgs,
Alexandra Chouldechova,
Sahin Geyik,
Krishnaram Kenthapadi, and Adam Tauman
Kalai. 2019. Bias in bios: A case study of
semantic representation bias in a high-stakes
环境. In Proceedings of the Conference on
Fairness, Accountability, and Transparency,
FAT* ’19, pages 120–128, 纽约, 纽约,
美国. Association for Computing Machinery.
https://doi.org/10.1145/3287560
.3287572

Sunipa Dev and Jeff Phillips. 2019. Attenu-
ating bias in word vectors. In Proceedings
the Twenty-Second International Confer-
的
ence on Artificial Intelligence and Statistics,
体积 89 of Proceedings of Machine Learn-
ing Research, pages 879–887. PMLR.

Lucas Dixon, John Li, Jeffrey Sorensen, Nithum
Thain, and Lucy Vasserman. 2018. 测量
and mitigating unintended bias in text classi-
fication. 在诉讼程序中 2018 AAAI/
ACM Conference on AI, 伦理, 与社会,
AIES ’18, pages 67–73, 纽约, 纽约,
美国. Association for Computing Machinery.
https://doi.org/10.1145/3278721
.3278729

1262

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Cynthia Dwork, Moritz Hardt, Toniann Pitassi,
Omer Reingold, and Richard Zemel. 2012.
Fairness through awareness. 在诉讼程序中
the 3rd Innovations in Theoretical Computer
Science Conference, ITCS ’12, pages 214–226,
纽约, 纽约, 美国. Association for Com-
puting Machinery. https://doi.org/10
.1145/2090236.2090255

Lachlan Forrow, William C. 泰勒, and Robert
中号. Arnold. 1992. Absolutely relative: How re-
search results are summarized can affect treat-
ment decisions. The American Journal of
药品, 92(2):121–124. https://土井
.org/10.1016/0002-9343(92)90100-磷

Sahaj Garg, Vincent Perot, Nicole Limtiaco,
Ankur Taly, Ed H. 志, and Alex Beutel.
2019. Counterfactual fairness in text classifi-
cation through robustness. 在诉讼程序中
这 2019 AAAI/ACM Conference on AI, 伦理,
与社会, AIES ’19, pages 219–226, 新的
约克, 纽约, 美国. Association for Computing
Machinery. https://doi.org/10.1145
/3306618.3317950

Aparna Garimella, Carmen Banea, Dirk Hovy,
and Rada Mihalcea. 2019. Women’s syntac-
luck:
tic resilience and men’s grammatical
Gender-bias in part-of-speech tagging and
dependency parsing. 在诉讼程序中
这
57th Annual Meeting of the Association for
计算语言学, pages 3493–3498,
Florence,
意大利. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1339

Andrew Gaut, Tony Sun, Shirlyn Tang, Yuxin
黄, Jing Qian, Mai ElSherief, Jieyu Zhao,
Diba Mirza, Elizabeth Belding, Kai-Wei
张, and William Yang Wang. 2020.
Towards understanding gender bias in relation
extraction. In Proceedings of the 58th Annual
the Association for Computa-
Meeting of
tional Linguistics, pages 2943–2953, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.acl-main.265

Oguzhan Gencoglu. 2021. Cyberbullying detec-
IEEE Inter-
tion with fairness constraints.
net Computing, 25(01):20–29. https://土井
.org/10.1109/MIC.2020.3032461

Seraphina Goldfarb-Tarrant, Rebecca Marchant,
Ricardo Mu˜noz Sanchez, Mugdha Pandya,
and Adam Lopez. 2021. Intrinsic bias metrics
do not correlate with application bias.
在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
抽动症. 计算语言学协会.
https://doi.org/10.18653/v1/2021
.acl-long.150

Hila Gonen and Yoav Goldberg. 2019. Lipstick
on a pig: Debiasing methods cover up system-
atic gender biases in word embeddings but do
not remove them. 在诉讼程序中 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 609–614,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Hila Gonen and Kellie Webster. 2020. Automat-
ically identifying gender issues in machine
translation using perturbations. In Findings of
the Association for Computational Linguis-
抽动症: EMNLP 2020, pages 1991–1995, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.findings-emnlp.180

Rowan Hall Maudslay, Hila Gonen, Ryan
Cotterell, and Simone Teufel. 2019. It’s all in
the name: Mitigating gender bias with name-
based counterfactual data substitution. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5267–5275, 香港, 中国. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1530

Moritz Hardt, Eric Price, and Nathan Srebro. 2016.
Equality of opportunity in supervised learn-
英. In Proceedings of the 30th International
Conference on Neural Information Process-
ing Systems, NIPS’16, pages 3323–3331, Red
Hook, 纽约, 美国. Curran Associates Inc.

Dirk Hovy and Anders Søgaard. 2015. Tag-
ging performance correlates with author age.
In Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguis-
tics and the 7th International Joint Conference

1263

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

on Natural Language Processing (体积 2:
Short Papers), pages 483–488, 北京, 中国.
计算语言学协会.

Po-Sen Huang, Huan Zhang, Ray Jiang, 罗伯特
Stanforth, Johannes Welbl, Jack Rae, Vishal
Maini, Dani Yogatama, and Pushmeet Kohli.
2020A. Reducing sentiment bias
in lan-
guage models via counterfactual evaluation.
In Findings of the Association for Computa-
tional Linguistics: EMNLP 2020, pages 65–83,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.findings-emnlp.7

Xiaolei Huang, Linzi Xing, Franck Dernoncourt,
and Michael J. 保罗. 2020乙. Multilingual twitter
corpus and baselines for evaluating demo-
graphic bias in hate speech recognition. 在
Proceedings of the 12th Language Resources
and Evaluation Conference, pages 1440–1448,
Marseille, 法国. European Language Re-
sources Association.

Ben Hutchinson and Margaret Mitchell. 2019.
50 Years of test (和)公平: Lessons for
机器学习. In Proceedings of the Con-
ference on Fairness, Accountability, and Trans-
parency, 第 49–58 页. https://doi.org
/10.1145/3287560.3287600

Ben Hutchinson, Vinodkumar Prabhakaran, Emily
丹顿, Kellie Webster, Yu Zhong, 和
Stephen Denuyl. 2020. Social biases in NLP
models as barriers for persons with disabili-
领带. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
抽动症, pages 5491–5501, 在线的. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/2020.acl-main.487

Abigail Z. Jacobs, Su Lin Blodgett, Solon
Barocas, Hal Daum´e, and Hanna Wallach.
2020. The meaning and measurement of bias:
Lessons from natural language processing. 在
诉讼程序 2020 Conference on Fair-
内斯, Accountability, and Transparency, FAT*
’20, 页 706, 纽约, 纽约, 美国. Associ-
ation for Computing Machinery. https://
doi.org/10.1145/3351095.3375671

and Vibhav Gogate, 编辑, In Proceedings
of The 35th Uncertainty in Artificial Intelli-
gence Conference, 体积 115 of Proceedings
of Machine Learning Research, pages 862–872.
PMLR.

Svetlana Kiritchenko and Saif Mohammad. 2018.
Examining gender and race bias in two hundred
sentiment analysis systems. 在诉讼程序中
the Seventh Joint Conference on Lexical and
Computational Semantics, pages 43–53, 新的
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/S18-2005

Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W.
黑色的, and Yulia Tsvetkov. 2019. 测量
bias in contextualized word representations. 在
Proceedings of the First Workshop on Gen-
der Bias in Natural Language Processing,
pages 166–172, Florence, 意大利. 协会
计算语言学. https://土井
.org/10.18653/v1/W19-3823

Matt J. Kusner, Joshua Loftus, Chris Russell, 和
Ricardo Silva. 2017. Counterfactual fairness.
In Advances in Neural Information Processing
系统, 体积 30, pages 4066–4076. 柯兰
Associates, Inc.

John D. 拉弗蒂, Andrew McCallum, 和
Fernando C. 氮. 佩雷拉. 2001. Conditional ran-
dom fields: Probabilistic models for segment-
ing and labeling sequence data. In Proceedings
of the Eighteenth International Conference on
Machine Learning, ICML ’01, pages 282–289,
旧金山, CA, 美国. Morgan Kaufmann
Publishers Inc.

Paul Pu Liang, Irene Mengze Li, Emily Zheng,
Yao Chong Lim, Ruslan Salakhutdinov, 和
Louis-Philippe Morency. 2020. Towards de-
biasing sentence representations. In Proceed-
ings of
这
计算语言学协会,
pages 5502–5515, 在线的. 协会
计算语言学. https://土井
.org/10.18653/v1/2020.acl-main.488

the 58th Annual Meeting of

Ray Jiang, Aldo Pacchiano, Tom Stepleton,
Heinrich Jiang, and Silvia Chiappa. 2020.
Wasserstein fair classification. Ryan P. Adams

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin

1264

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. CoRR,
abs/1907.11692. Version 1.

H. 乙. Mann and D. 右. Whitney. 1947. On a
test of whether one of two random variables
is stochastically larger than the other. 安.
数学. Statist., 18(1):50–60. https://土井
.org/10.1214/aoms/1177730491

Ninareh Mehrabi, Thamme Gowda,

Fred
Morstatter, Nanyun Peng, and Aram Galstyan.
2020. Man is to person as woman is to location:
in named entity
Measuring gender bias
认出. In Proceedings of the 31st ACM
Conference on Hypertext and Social Media,
HT ’20, pages 231–232, 纽约, 纽约,
美国. Association for Computing Machinery.
https://doi.org/10.1145/3372923
.3404804

Ninareh Mehrabi, Fred Morstatter, 氮. Saxena,
Kristina Lerman, 和一个. Galstyan. 2019. A
survey on bias and fairness in machine learning.
CoRR, abs/1908.09635. Version 2.

Saif Mohammad,

Felipe

Salameh,

Bravo-Marquez,
Mohammad
Svetlana
和
Kiritchenko. 2018. SemEval-2018 Task 1: 的-
fect in Tweets. In Proceedings of The 12th
International Workshop on Semantic Evalu-
化, 第 1–17 页, New Orleans, Louisiana.
计算语言学协会.
https://doi.org/10.18653/v1/S18
-1001

Moin Nadeem, Anna Bethke, and Siva Reddy.
2020. StereoSet: Measuring stereotypical bias
in pretrained language models. CoRR, abs/2004
.09456. Version 1. https://doi.org/10
.18653/v1/2021.acl-long.416

Nikita Nangia, Clara Vania, Rasika Bhalerao,
and Samuel R. Bowman. 2020. CrowS-Pairs:
A challenge dataset for measuring social bi-
ases in masked language models. In Proceed-
ings of
这 2020 Conference on Empirical
Methods in Natural Language Processing
(EMNLP), pages 1953–1967, 在线的. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/2020.emnlp
-main.154

风险: One cannot be interpreted without the
其他: Clinical epidemiology in nephrology.
Nephrology Dialysis Transplantation, 32:ii13–ii18.
https://doi.org/10.1093/ndt/gfw465,
考研: 28339913

Alexandra Olteanu, Kartik Talamadupula, 和
Kush R. Varshney. 2017. The limits of abstract
evaluation metrics: The case of hate speech
detection. 在诉讼程序中 2017 ACM
on Web Science Conference, pages 405–406.
https://doi.org/10.1145/3091478
.3098871

Orestis Papakyriakopoulos, Simon Hegelich, Juan
Carlos Medina Serrano, and Fabienne Marco.
2020. Bias in word embeddings. In Proceed-
这 2020 Conference on Fairness,
ings of
Accountability, and Transparency, FAT* ’20,
pages 446–457, 纽约, 纽约, 美国. Associ-
ation for Computing Machinery. https://
doi.org/10.1145/3351095.3372843

Radomir Popovi´c, Florian Lemmerich,

和
Markus Strohmaier. 2020. Joint multiclass de-
biasing of word embeddings. In Foundations
of Intelligent Systems, pages 79–89, 占婆.
Springer International Publishing. https://
doi.org/10.1007/978-3-030-59491
-6 8

Vinodkumar Prabhakaran, Ben Hutchinson, 和
Margaret Mitchell. 2019. Perturbation sen-
sitivity analysis to detect unintended model
biases. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Process-
英 (EMNLP-IJCNLP), pages 5740–5745,
香港, 中国. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/D19-1578

Flavien Prost, Nithum Thain,

and Tolga
Bolukbasi. 2019. Debiasing embeddings for
reduced gender bias in text classification. 在
Proceedings of the First Workshop on Gen-
der Bias in Natural Language Processing,
pages 69–75, Florence, 意大利. 协会
计算语言学. https://土井
.org/10.18653/v1/W19-3810

中号. Noordzij, 中号. van Diepen, F. Caskey, 和
K. Jager. 2017. Relative risk versus absolute

Lev Ratinov and Dan Roth. 2009. Design chal-
lenges and misconceptions in named entity

1265

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

认出. In Proceedings of the Thirteenth
Conference on Computational Natural Lan-
guage Learning (CoNLL-2009), pages 147–155,
博尔德, 科罗拉多州. Association for Computa-
tional Linguistics. https://doi.org/10
.3115/1596374.1596399

domain adaptation problem. 在诉讼程序中
the 58th Annual Meeting of the Association for
计算语言学, pages 7724–7736,
在线的. Association for Computational Lin
语言学. https://doi.org/10.18653
/v1/2020.acl-main.690

Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. 超过
准确性: Behavioral testing of NLP models
with Checklist. In Proceedings of the 58th An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 4902–4912, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.acl-main.442

Anthony Rios. 2020. FuzzE: Fuzzy fairness eval-
uation of offensive language classifiers on
African-American English. 在诉讼程序中
the AAAI Conference on Artificial Intelligence,
体积 34, pages 881–889. https://土井
.org/10.1609/aaai.v34i01.5434

Anthony Rios and Brandon Lwowski. 2020. 一个
empirical study of the downstream reliability
of pre-trained word embeddings. In Proceed-
ings of the 28th International Conference on
计算语言学, pages 3371–3388.
International Committee on Computational
语言学, 巴塞罗那, 西班牙 (在线的). https://
doi.org/10.18653/v1/2020.coling-main.299

Rachel Rudinger,

Jason Naradowsky, Brian
Leonard, and Benjamin Van Durme. 2018.
Gender bias in coreference resolution.
在
诉讼程序 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, 体积 2 (Short Papers),
pages 8–14, New Orleans, Louisiana. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/N18-2002

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin
Choi, and Noah A. 史密斯. 2019. The risk of
racial bias in hate speech detection. In Pro-
ceedings of the 57th Annual Meeting of the
计算语言学协会,
pages 1668–1678, Florence, 意大利. 协会
for Computational Linguistics.

Danielle Saunders and Bill Byrne. 2020. 减少
gender bias in neural machine translation as a

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli,
Matteo Negri, and Marco Turchi. 2021. 性别
bias in machine translation. Transactions of the
计算语言学协会.

Jo˜ao Sedoc and Lyle Ungar. 2019. The role of
protected class word lists in bias identification
of contextualized word representations. In Pro-
ceedings of the First Workshop on Gender Bias
自然语言处理博士, pages 55–61,
Florence, 意大利. Association for Computational
语言学. https://doi.org/10.18653
/v1/W19-3808

Deven Santosh Shah, H. Andrew Schwartz,
and Dirk Hovy. 2020. Predictive biases in
natural language processing models: A con-
In Pro-
framework and overview.
ceptual
ceedings of the 58th Annual Meeting of the
计算语言学协会,
pages 5248–5264, 在线的. 协会
计算语言学.

Emily Sheng, Kai-Wei Chang, Prem Natarajan,
and Nanyun Peng. 2020. Towards controllable
biases in language generation. In Findings of
the Association for Computational Linguistics:
EMNLP 2020, pages 3239–3254, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.findings-emnlp.291

Emily Sheng, Kai-Wei Chang, Premkumar
Natarajan, and Nanyun Peng. 2021. Societal
biases in language generation: Progress and
挑战. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学. Association for Computational
语言学.

Seungjae Shin, Kyungwoo Song, JoonHo Jang,
Hyemi Kim, Weonyoung Joo, and Il-Chul
Moon. 2020. Neutralizing gender bias in word
embeddings with latent disentanglement and
counterfactual generation. In Findings of the
计算语言学协会:
EMNLP 2020, pages 3126–3140, 在线的.

1266

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学协会.
https://doi.org/10.18653/v1/2020
.findings-emnlp.280

计算语言学协会.
https://doi.org/10.18653/v1/P19
-1159

Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. 曼宁, Andrew Ng,
and Christopher Potts. 2013. Recursive deep
models for semantic compositionality over a
sentiment treebank. 在诉讼程序中 2013
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1631–1642,
Seattle, 华盛顿, 美国. 协会
计算语言学.

Art¯urs Stafanoviˇcs, Toms Bergmanis, and M¯arcis
Pinnis. 2020. Mitigating gender bias in ma-
chine translation with target gender annotations.
the 5th Conference on
在诉讼程序中
机器翻译 (WMT), pages 629–638.
计算语言学协会.

Gabriel Stanovsky, 诺亚A. 史密斯, and Luke
Zettlemoyer. 2019. Evaluating Gender Bias in
机器翻译. 在诉讼程序中
57th Annual Meeting of the Association for
计算语言学, pages 1679–1684,
Florence,
意大利. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1164

Jacob Stegenga. 2015. Measuring effectiveness.
Studies in History and Philosophy of Science
Part C: Studies in History and Philosophy of
Biological and Biomedical Sciences, 54:62–71.
https://doi.org/10.1016/j.shpsc
.2015.06.003, 考研: 26199055

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin
黄, Mai ElSherief,
Jieyu Zhao, Diba
Mirza, Elizabeth Belding, Kai-Wei Chang, 和
William Yang Wang. 2019. Mitigating gender
bias in natural language processing: Literature
review. 在诉讼程序中
the 57th Annual
Meeting of the Association for Computational
语言学, pages 1630–1640, Florence, 意大利.

Samson Tan, Shafiq Joty, Min-Yen Kan, 和
Richard Socher. 2020. It’s morphin’ time! 康姆-
bating linguistic discrimination with inflec-
tional perturbations. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
tational Linguistics, pages 2920–2935, 在线的.
计算语言学协会.

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 Shared
任务: Language-independent named entity
认出. 在诉讼程序中
the Seventh
Conference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147. https://
doi.org/10.3115/1119176.1119195

Kellie Webster, Marta Recasens, Vera Axelrod,
and Jason Baldridge. 2018. Mind the GAP:
A balanced corpus of gendered ambiguous pro-
nouns. Transactions of the Association for Com-
putational Linguistics, 6:605–617. https://
doi.org/10.1162/tacl 00240

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente
Ordonez, and Kai-Wei Chang. 2018. 性别
bias in coreference resolution: Evaluation and
debiasing methods. 在诉讼程序中 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 2
(Short Papers), pages 15–20, New Orleans,
Louisiana. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/N18-2003

A. Zhiltsova, S. Caton, and Catherine Mulway.
2019. Mitigation of unintended biases against
non-native english texts in sentiment analysis.
In Proceedings for the 27th AIAI Irish Con-
ference on Artificial Intelligence and Cognitive
科学.

1267

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
2
5
1
9
7
2
6
7
7

/
t

我

A
C
_
A
_
0
0
4
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf