Fair Is Better than Sensational: Man Is to - 麻省理工学院人工智能研究专业

Fair Is Better than Sensational: Man Is to
Doctor as Woman Is to Doctor

Malvina Nissim
格罗宁根大学
Center for Language and Cognition
m.nissim@rug.nl

Rik van Noord
格罗宁根大学
Center for Language and Cognition
r.i.k.van.noord@rug.nl

Rob van der Goot
IT University of Copenhagen
Computer Science Department
robv@itu.dk

Analogies such as man is to king as woman is to X are often used to illustrate the amazing
power of word embeddings. Concurrently, they have also been used to expose how strongly
human biases are encoded in vector spaces trained on natural language, with examples like
man is to computer programmer as woman is to homemaker. Recent work has shown
that analogies are in fact not an accurate diagnostic for bias, but this does not mean that they are
not used anymore, or that their legacy is fading. Instead of focusing on the intrinsic problems of
the analogy task as a bias detection tool, we discuss a series of issues involving implementation as
well as subjective choices that might have yielded a distorted picture of bias in word embeddings.
We stand by the truth that human biases are present in word embeddings, 和, 当然, 这
need to address them. But analogies are not an accurate tool to do so, and the way they have been
most often used has exacerbated some possibly non-existing biases and perhaps hidden others.
Because they are still widely popular, and some of them have become classics within and outside
the NLP community, we deem it important to provide a series of clariﬁcations that should put
well-known, and potentially new analogies, into the right perspective.

1. 介绍

Word embeddings are distributed representations of texts that capture similarities be-
tween words. Besides improving a wide variety of NLP tasks, their power is often also
tested intrinsically. 米科洛夫等人. (2013) introduced the idea of testing the soundness of
embedding spaces via the analogy task. Analogies are equations of the form A : 乙 :: C : D,
or simply A is to B as C is to D. Given the terms A, 乙, C, the model must return the

提交材料已收到: 28 可能 2019; 收到修订版: 09 十二月 2019; 接受出版:
19 一月 2020.

https://doi.org/10.1162/COLI 00379

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 46, 数字 2

word that correctly stands for D in the given analogy. A most classic example is man
is to king as woman is to X, where the model is expected to return queen, by subtracting
“manness” from the concept of king to obtain some general royalty, and then re-adding
some “womanness” to obtain the concept of queen (king − man + woman = queen).

Besides showing this kind of seemingly magical power, analogies have been exten-
sively used to show that embeddings carry worrying biases present in our society and
thus encoded in language. This bias is often demonstrated by using the analogy task to
ﬁnd stereotypical relations, such as the classic man is to doctor as woman is to nurse or man
is to computer programmer as woman is to homemaker.

The potential of the analogy task has been recently questioned, 尽管. It has been
argued that what is observed through the analogy task might be mainly due to irrele-
vant neighborhood structure rather than to the vector offset that supposedly captures
the analogy itself (扁豆 2016; 罗杰斯, Drozd, and Li 2017). 还, Drozd, Gladkova, 和
Matsuoka (2016) have shown that the original and classically used 3COSADD method
(米科洛夫等人. 2013) is not able to capture all linguistic regularities present in the em-
beddings. With the recently proposed contextual embeddings (Peters et al. 2018; Devlin
等人. 2019), it is non-trivial to evaluate on the analogy task, and is thus not commonly
用过的. Recent research has shown that analogies are also not an accurate diagnostic to
detect bias in word embeddings (Gonen and Goldberg 2019). 尽管如此, analogies
are not only still widely used, but have also left a strong footprint, with some by-now-
classic examples often brought up as proof of human bias in language models. A case
in point is the opening speech by the ACL President at ACL 2019 in Florence, 意大利,
where the issue of bias in embeddings is brought up showing biased analogies from a
2019 纸 (Manzini et al. 2019乙).1

This contribution thus aims at providing some clariﬁcations over the past use of
analogies to hopefully raise further and broader awareness of their potential and their
局限性, and put well-known and possibly new analogies in the right perspective.2
第一的, we take a closer look at the concept of analogy together with requirements and
expectations. We look at how the original analogy structure was used to query embed-
丁斯, and some misconceptions that a simple implementation choice has caused. 在里面
original proportional analogy implementation, all terms of the equation A : 乙 :: C : D are
distinct (米科洛夫等人. 2013), 那是, the model is forced to return a different concept than
any of the input ones. Given an analogy of the form A : 乙 :: C : D, the code explicitly
prevents yielding any term D such that D == B, D == A, or D == C. Although this
constraint is helpful when all terms are expected to be different, it becomes a problem,
and even a dangerous artifact, when the terms could or should be the same.

第二, we discuss different analogy detection strategies/measures that have been
proposed, 即, the original 3COSADD measure, the 3COSMUL measure (Levy and
Goldberg 2014), and the Bolukbasi et al. (2016) formula, which introduces a different
take on the analogy construction, reducing the impact of subjective choices (部分 4.3).
第三, we highlight the role played by human biases in choosing which analogies
to search for, and which results to report. We also show that even when subjective

1 https://www.microsoft.com/en-us/research/uploads/prod/2019/08/ACL-MingZhou-50min-ming.

v9-5d5104dcbe73c.pdf, slide 29.

2 This work does not mean at all to downplay the presence and danger of human biases in word

嵌入. 相反: Embeddings do encode human biases (Caliskan, Bryson, and Narayanan
2017; Garg et al. 2018; Kozlowski, Taddy, and Evans 2019; Gonen and Goldberg 2019), and we agree that
this issue deserves the full attention of the ﬁeld (Hovy and Spruit 2016).

488

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我想念, van Noord, and van der Goot

Fair Is Better than Sensational

choices are minimized in input (as in Bolukbasi et al. 2016), parameter tuning might
have consequences on the results, which should not go unnoticed or underestimated.

2. What Counts as Analogy?

In linguistics, analogies of the form A : 乙 :: C : D can be conceived on two main levels
of analysis (Fischer 2019). The ﬁrst one is morphological (so-called strict proportional
analogies), and they account for systematic language regularities. The second one is
more at the lexico-semantic level, and similarities can get looser and more subject to
解释 (例如, trafﬁc is to street as water is to riverbed [特尼 2012]). The original,
widely used, analogy test set introduced by Mikolov et al. (2013) consists indeed of two
main categories: morphosyntactic analogies (car is to cars as table is to tables) and semantic
analogies (Paris is to France as Tokyo is to Japan). Within these, examples are classiﬁed in
more speciﬁc sub-categories.

There are two important aspects that must be considered following the above. 第一的,
analogies are (traditionally) mostly conceived as featuring four distinct terms. 第二,
we need to distinguish between cases where there is one speciﬁc, 预期的, 正确的
fourth term, and cases where there is not. Both aspects bear important methodological
consequences in the way we query and analyze (biased) analogies in word embeddings.

2.1 Should All Terms Be Different?

Two of the four constraints introduced by Turney in formally deﬁning analogies indi-
rectly force the terms B and D to be different (特尼 2012, p. 540). 还, 所有的例子
of the original analogy test (米科洛夫等人. 2013) expect four different terms. Is this
always the case? Are expressions featuring the same term twice non-analogies?

Because most out-of-the-box word embeddings have no notion of senses, homo-
graphs are modeled as one unit. 例如, the inﬁnitive form and the past tense
of the verb to read, will be represented by one single vector for the word read. A con-
sequence of this is that for certain examples, two terms would be identical, 尽管
they would be conceptually different. In strong verbs, inﬁnitive and simple past can be
homographs (例如, split/split), and countries or regions can be homographs with their
capitals (例如, Singapore/Singapore). Other cases where all terms are not necessarily dis-
tinct include “is-a” relations (hypernyms, 猫:动物 :: 狗:动物), and ordered concepts
(银:gold :: bronze:银). 而且, the extended analogy test set created by Gladkova,
Drozd, and Matsuoka (2016) also includes examples where B is the correct answer, 为了
example country:language and thing:颜色. While these examples might not be conceived
as standard analogies, the issue with homographs remains.

2.2 Is There a Correct Answer?

In Mikolov’s analogy set, all the examples are such that given the ﬁrst three terms,
there is one speciﬁc, 正确的 (预期的) fourth term. We can call such analogies “factual.”
While morphosyntactic analogies are in general indeed factual (but there are exceptions
due to homographical ambiguities), the picture is rather different for the semantic ones.
If we take man:computer programmer :: woman:X as a semantic analogy, what is the “cor-
rect” answer? Is there an expected, unbiased completion to this query? Compare it to the
case of he:actor :: 她:X—it seems straightforward to assume that X should be resolved
to actress. 然而, such resolution easily rescales the analogy to a morphosyntactic
rather than semantic level, thereby also ensuring a factual, unbiased answer.

489

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 46, 数字 2

The morphosyntactic and semantic levels are indeed not always distinct. 什么时候
querying man:doctor :: woman:X, is one after a morphosyntactic or a semantic an-
swer? Morphosyntactically, we should resolve to doctor, thereby violating the all-terms-
different constraint. If we take the semantic interpretation, there is no single predeﬁned
term that “correctly” completes the analogy (or perhaps doctor does here too).3

In such nonfactual, more creative analogies, various terms could be used for com-
pletion depending on the implied underlying relation (特尼 2012), which could be
unclear or unspeciﬁed in the query. For the analogies used by Manzini et al. (2019乙)
(见表 2 later in the article), 例如, it is rather unclear what one would expect
to ﬁnd. Some of the returned terms might be biased, but in order to claim bias, 一
should also conceive the expected unbiased term. 所以, if doctor is not eligible by violating
the distinction constraint, what would the unbiased answer be?

When posing queries, all such aspects should be considered, and one should be
aware of what analogy algorithms and implementations are designed to detect. 如果
correct or unbiased answer to man:woman :: doctor:X is expected to be doctor and the
model is not allowed to return any of the input terms as it would otherwise not abide to
the deﬁnition of analogy, then such a query should not be asked. If asked anyway under
such conditions, the model should not be charged with bias for not returning doctor.

3. Algorithms

We consider three strategies that have been used to capture analogies. We use the
standard 3COSADD function Equation (1) from Mikolov et al. (2013), and 3COSMUL,
introduced by Levy and Goldberg (2014) to overcome some of the shortcomings of
3COSADD, mainly ensuring that a single large term cannot dominate the expression
方程 (2):

最大精量
d

(因斯(d, C) − cos(d, A) + 因斯(d, 乙))

最大精量
d

因斯(d, C) 因斯(d, 乙)
因斯(d, A) + 0.001

(1)

(2)

Bolukbasi et al. (2016) designed another formula, speciﬁcally focused on ﬁnding pairs
乙 : D with a similar direction as A : C:

(西德:40)

S(A,C)(乙, d) =

因斯(a − c, b − d)
0

如果 ||b − d|| ≤ δ
否则

(3)

They do not assume that B is known beforehand, and generate a ranked list of B : D
对, with the advantage of introducing less subjective bias in the input query (看
部分 4.2). To ensure that B and D are related, the threshold δ is introduced, and set to
1.0 in Bolukbasi et al. (2016). This corresponds to π/3 and in practice means that B and
D have to be closer together than two random embedding vectors. 为了方便,
because B is known beforehand in our setup and we are interested in examining the

3 在这个意义上, it is admirable that Caliskan, Bryson, and Narayanan (2017) try to better understand their

results by checking them against actual job distributions between the two genders.

490

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我想念, van Noord, and van der Goot

Fair Is Better than Sensational

top-N output, we rewrite Equation (3) as Equation (4) (note that they yield the exact
same scores).

(西德:40)

最大精量
d

因斯(a − c, b − d)
0

如果 ||b − d|| ≤ δ
否则

(4)

Even though it is not part of the equations, in practice most implementations of
these optimization functions speciﬁcally ignore one or more input vectors. Most likely,
this is because the traditional deﬁnition of analogies expects all terms to be different
(参见章节 2), and the original analogy test set reﬂects this. Without this constraint,
3COSADD for example would return B in absence of close neighbors. 然而, 我们有
seen that this is a strong constraint, both in morphosyntactic and semantic analogies.
而且, even though this constraint is mentioned in the original paper (米科洛夫
等人. 2013) and in follow-up work (扁豆 2016; Bolukbasi et al. 2016; 罗杰斯, Drozd,
and Li 2017; Goldberg 2017; Schluter 2018), we believe this is not common knowledge in
the ﬁeld (analogy examples are still widely used), and even more so outside the ﬁeld.4

4. Is the Bias in the Models, in the Implementation, or in the Queries?

In addition to preventing input vectors from being returned, other types of implemen-
tation choices (such as punctuation, capitalization, or word frequency cutoffs), 和
subjective decisions play a substantial role. 所以, what is the actual inﬂuence of such
choices on obtaining biased responses? 下文中, unless otherwise speciﬁed, 我们
run all queries on the standard GoogleNews embeddings.5 All code to reproduce our
experiments is available: https://bitbucket.org/robvanderg/w2v.

4.1 Ignoring or Allowing the Input Words

In the default implementation of word2vec (米科洛夫等人. 2013), gensim ( ˇReh ˚uˇrek and
Sojka 2010) as well as the code from Bolukbasi et al. (2016), the input terms of the
analogy query are not allowed to be returned.6 We adapted all these code-bases to allow
for the input words to be returned.7

We evaluated all three methods on the test set from Mikolov et al. (2013), 在他们的
constrained and unconstrained versions. We observe a large drop in macro-accuracy for
3COSADD and 3COSMUL in the unconstrained setting (从 0.71 到 0.21 和 0.73 到 0.45,
分别). 在多数情况下, this is because the second term is returned as answer (男人
is to king as woman is to king, D == B), which happens if no close neighbor is found,
but in some cases it is the third term that gets returned (short is to shorter as new is to
新的, D == C). A similar drop in performance was observed before by Linzen (2016)
and Schluter (2018). The Bolukbasi et al. (2016) method shows very low scores (0.06
constrained, 0.11 unconstrained), but this was to be expected, since their formula was
not speciﬁcally designed to capture factual analogies. But what is so different between
factual and biased analogies?

4 This was conﬁrmed by the response we obtained when we uploaded a ﬁrst version of the paper.
5 https://code.google.com/archive/p/word2vec/.
6 In Equation (3), in practice B will almost never be returned, as it will always be assigned a score of 0.0,

making it the last ranked candidate.

7 The 3COSADD unconstrained setting can be tested in an online demo: www.robvandergoot.com/embs.

491

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 46, 数字 2

桌子 1
Example output of the three algorithms for their constrained (const.) and unconstrained
(unconst.) implementations for three well-known gender bias analogies.

3COSADD

const.

unconst.

3COSMUL

const.

unconst.
man is to doctor as woman is to X
doctor

gynecologist

BOLUKBASI

const.

unconst.

midwife

gynecologist

he is to doctor as she is to X

nurse

doctor

nurse

doctor

computer programmer

man is to computer programmer as woman is to X
computer programmer

homemaker

–

schoolteacher

gynecologist

nurse

homemaker

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

表中 1, we report the results using the same settings for a small selection of
mainstream examples from the literature on embedding bias. It directly becomes clear
that removing constraints leads to different (and arguably less biased) results.8 More
precisely, for 3COSADD and 3COSMUL we obtain word B as answer, and using the
method described by Bolukbasi et al. (2016) we obtain different results because with the
vocabulary cutoff they used (50,000 most frequent words, 参见章节 4.3), gynecologist
(51,839) and computer programmer (57,255) were excluded.9

The analogy man is to doctor as woman is to nurse is a classic showcase of human bias
in word embeddings, reﬂecting gendered stereotypes in our society. This is meaningful,
然而, only if the system were allowed to yield doctor (arguably the expected answer
in absence of bias, 参见章节 2) instead of nurse, and it does not. But using the original
analogy code, it is impossible to obtain man is to doctor as woman is to doctor (在哪里
D == B). Under such settings, it is not exactly fair to claim that the embedding space is
biased because it does not return doctor.

4.2 Subjective Factors

Let us take a step back though, and ask: Why do people query man is to doctor as woman
is to X? In fairness, one should wonder how much bias leaks in from our own views,
preconceptions, and expectations. In this section we aim to show how these affect the
queries we pose and the results we get, and how the inferences we can draw depend
strongly on the choices we make in formulating queries and in reporting the outcome.
To start with, the large majority of the queries posed and found in the literature
imply human bias. People usually query for man:doctor :: woman:X, which in 3COSADD
and 3COSMUL is different from querying for woman:doctor :: 男人:X, both in results
and in assumptions. This issue also raises the major, usually unaddressed question
as to what would the unbiased, desired, D term be? Such bias-searching queries do
not pose factual, one-correct-answer, analogies, unless interpreted morphosyntactically
(部分 2).

Another subjective decision has to do with reporting results. One would think that
the top returned term should always be reported, or possibly the top ﬁve, if willing

8 This was noticed before: https://www.youtube.com/watch?v=25nC0n9ERq4, and https://medium.com/

artists-and-machine-intelligence/ami-residency-part-1-exploring-word-space-
andprojecting-meaning-onto-noise-98af7252f749.

9 Though man is to computer programmer as woman is to homemaker is used in the title of Bolukbasi et al. (2016),

this analogy is obtained using 3COSADD.

492

我想念, van Noord, and van der Goot

Fair Is Better than Sensational

桌子 2
Overview of reported biased analogies in Manzini et al. (2019乙) and Manzini et al. (2019C),
obtained with 3COSADD without constraints, but their embeddings as they are. “Idx” refers to
the average position of the reported biased word as we ﬁnd it in their ﬁve embedding sets
trained with different seeds (IE。, the same they used.)

Analogy

Reported

Idx Top-5 answers (averaged)

) caucasian lawful black

asian yuppie caucasian

black killer asian

criminal

hillbilly

工程师

2.0 lawful criminal defamation libel vigilante

5.0 yuppie knighting pasty hipster hillbilly

5.2 addict aspie impostor killer engineer

jew liberal christian

conservative

2.0 liberal conservative progressive heterodox secular

jew journalist muslim

terrorist

1.6 terrorist purportedly journalist watchdog cia

muslim regressive christian

conservative

9.2 regressive progressive milquetoast liberal neoliberal

) black homeless caucasian

servicemen

211.6 homeless somalis unemployed bangladeshi nigerians

caucasian hillbilly asian

suburban

60.6 hillbilly hippy hick redneck hippie

asian laborer black

landowner

3.0 laborer landowner fugitive worker millionaire

jew greedy muslim

powerless

8.8 greedy corrupt rich marginalized complacent

christian familial muslim

warzone

7172 familial domestic marital bilateral mutual

muslim uneducated christian intellectually

16.6 uneducated uninformed idealistic elitist arrogant

乙
9
1
0
2
(

.
我
A

t
e

我

我
z
n
A
中号

C
9
1
0
2
(

.
我
A
t
e

我
n
我
z
n
A
中号

to provide a broader picture. 然而, subjective biases and result expectation might
lead to discard returned terms that are not viewed as biased, and report biased terms
that are appearing further down in the list, 然而. This causes a degree of arbi-
trariness in reporting results that can be substantially misleading. As a case in point,
we discuss here the recent Manzini et al. 纸, which is the work from which the
examples used in the opening presidential speech of ACL 2019 were taken (see footnote 1).
This paper was published in three subsequent versions, differing only in the analogy
queries used and the results reported. We discuss this to show how subjective the types
of choices above can be, and that transparency about methodology and implementation
are necessary.

Initially Manzini et al. (2019A), the authors accidentally searched for the inverse of
the intended query: instead of A is to B as C is to X (black is to criminal as caucasian is to X),
they queried C is to B as A is to X (caucasian is to criminal as black is to X).10 出奇,
they still managed to ﬁnd biased examples by inspecting the top-N returned D terms.
换句话说, they reported the analogy black is to criminal as caucasian is to police to
support the hypothesis that there is cultural bias against the black, but they had in fact
found caucasian is to criminal as black is to police, so the complete opposite.

This mistake was ﬁxed in subsequent versions (Manzini etal. 2019乙,C), 但它是
unclear which algorithm is used to obtain the analogies. We tried the three algorithms
in Section 3, and in Table 2 we show the results of 3COSADD, for which we could most
closely reproduce their results. For their second version, in ﬁve out of their six examples
the input word B would actually be returned before the reported answer D. For three of
the six analogies, they pick a term from the returned top-10 rather than the top one. 在
their third version Manzini et al. (2019C), the authors changed the list of tested analogies,

10 We conﬁrmed this with the authors.

493

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 46, 数字 2

especially regarding the B terms. It is unclear under which assumption some of these
“new” terms were chosen (greedy associated to jew, 例如: what is one expecting
to get—biased or non-biased—considering this is a negative stereotype to start with,
and the C term is muslim?). 然而, for each of the analogy algorithms, we cannot
reasonably reproduce four out of six analogies, even when inspecting the top 10 结果.
Although qualitatively observing and weighing the bias of a large set of returned
answers can make sense, it can be misleading to cherry-pick and report very biased
terms in sensitive analogies. 至少, when reporting term-N, 一个应该
report the top-N terms to provide a more complete picture.

4.3 Other Constraints

Using the BOLUKBASI formula is much less prone to subjective choices. It takes as input
only two terms (A and C, like man and woman), thus reducing the bias present in the
query itself, and consequently the impact of human-induced bias expectation. 在
同时, 尽管, starting with A : C, the formula requires some parameter tuning
in order to obtain (A) meaningful B : D pair(s). Such parameter values also affect the
outcome, possibly substantially, and must be weighed in when assessing bias.

As shown in Equation (3), Bolukbasi et al. (2016) introduce a threshold δ to ensure
that B and D are semantically similar. In their work, δ is set to 1 to ensure that B and
D are closer than two random vectors (参见章节 3). Choosing alternative values for δ
will however yield quite different results, and it is not a straightforward parameter to
tune, since it cannot be done against some gold standard, “correct” examples.

Another common constraint that can have a substantial impact on the results is lim-
iting the embedding set to the top-N most frequent words. Both Bolukbasi et al. (2016)
and Manzini et al. (2019A) ﬁlter the embeddings to only the 50,000 most frequent words,
though no motivation for this need or this speciﬁc value is provided. Setting such an
arbitrary value might result in the exclusion of valid alternatives. Further processing
can also rule out potentially valid strings. 例如, Manzini et al. (2019A) lowercase
all words before training, and remove words containing punctuation after training,
whereas Bolukbasi et al. (2016) keep only words that are shorter than 20 characters and
do not contain punctuation or capital letters (after training the embeddings).

To brieﬂy illustrate the impact of varying the values of the threshold and the vo-
cabulary size when using the BOLUKBASI formula, 表中 3 we show the results when
changing them for the query man is to doctor as woman is to X (given that B is known, 我们

桌子 3
Inﬂuence of vocabulary size and threshold value for the method of Bolukbasi et al. (2016). 和
extreme values for the threshold, and allowing to return query words, the answer becomes
“doctor” (≤0.5) and “she” (≥1.5). Italics: original settings.

Voc. 尺寸
10,000
50,000
100,000
500,000
full vocab.

0.8
医生
医生
gynecologist
gynecologist
gynecologist

0.9
nurse
nurse
gynecologist
gynecologist
gynecologist

Threshold (δ)

1.0
nurse
midwife
gynecologist
gynecologist
gynecologist

1.1
nurse
midwife
gynecologist
nurse midwife
nurse midwife

1.2
woman
woman
gynecologist
nurse midwife
nurse midwife

494

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我想念, van Noord, and van der Goot

Fair Is Better than Sensational

use Equation (4)). The variety of answers, ranging from what can be considered to be
biased (nurse) to not biased at all (医生), illustrates how important it is to be aware of
the inﬂuence of choices concerning implementation and parameter values.

5. Final Remarks

If analogies might not be the most appropriate tool to capture certain relations, 一定
matters have been made worse by the way that consciously or not they have been
用过的 (Gonen and Goldberg [2019] have rightly dubbed them sensational “party tricks”).
This is harmful for at least two reasons. One is that they get easily propagated both
in science itself (Jha and Mamidi 2017; Gebru et al. 2018; Mohammad et al. 2018;
Hall Maudslay et al. 2019), also outside NLP and artiﬁcial intelligence (McQuillan 2018)
and in popularized articles (Zou and Schiebinger 2018), where readers are usually in
no position to verify the reliability or signiﬁcance of such examples. The other is that
they might mislead the search for bias and the application of debiasing strategies. 和
although it is debatable whether we should aim at debiasing or rather at transparency
and awareness (Caliskan, Bryson, and Narayanan 2017; Gonen and Goldberg 2019), 它
is crucial that we are clear and transparent about what analogies can and cannot do
as a diagnostic for embeddings bias, and about all the implications of subjective and
implementation choices. This is a strict prerequisite to truly understand how and to
what extent embeddings encode and reﬂect biases of our society, and how to cope with
这, both socially and computationally.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

致谢
We would like to thank Hessel Haagsma for
providing us with interesting and
challenging examples, Gertjan van Noord for
reading previous versions of this paper, 和
St´ephan Tulkens for the discussions on the
analogy algorithms. We also like to thank the
three anonymous reviewers for their
comments. 而且, we are grateful that we
could exchange ideas with the authors of the
two papers that we discuss mostly in this
文章, in particular Thomas Manzini,
Kai-Wei Chang, and Adam Kalai. 这
opinions here expressed remain obviously
our own.

参考
Bolukbasi, Tolga, Kai-Wei Chang, James Y.
Zou, Venkatesh Saligrama, and Adam T.
Kalai. 2016. Man is to computer
programmer as woman is to homemaker?
Debiasing word embeddings. 在
神经信息处理的进展
系统 29, pages 4349–4357,
柯伦联合公司, Inc.

Caliskan, Aylin, Joanna J. Bryson, 和
Arvind Narayanan. 2017. 语义学
derived automatically from language

corpora contain human-like biases. 科学,
356(6334):183–186.

Devlin, 雅各布, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: 预-
training of deep bidirectional transformers
for language understanding. 在诉讼中
的 2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 1 (Long and Short Papers),
pages 4171–4186, 明尼阿波利斯, 明尼苏达州.
Drozd, Aleksandr, Anna Gladkova, 和

Satoshi Matsuoka. 2016. Word
嵌入, analogies, and machine
学习: Beyond king – 男人 + woman =
queen. COLING 论文集 2016, 这
26th International Conference on
计算语言学: 技术论文,
pages 3519–3530, 大阪.

Fischer, Olga. 2019. Analogy in Language and

语言学, Oxford Bibliographies in
语言学, 牛津大学出版社.

Garg, Nikhil, Londa Schiebinger, Dan

Jurafsky, and James Zou. 2018. Word
embeddings quantify 100 years of gender
and ethnic stereotypes. 诉讼程序
美国国家科学院,
115(16):E3635–E3644.

495

计算语言学

体积 46, 数字 2

Gebru, Timnit, Jamie Morgenstern, Briana
Vecchione, Jennifer Wortman Vaughan,
Hanna Wallach, Hal Daume´e III, and Kate
克劳福德. 2018. Datasheets for datasets.
arXiv 预印本 arXiv:1803.09010 V4.
Gladkova, 安娜, Aleksandr Drozd, 和

Satoshi Matsuoka. 2016. Analogy-based
detection of morphological and semantic
relations with word embeddings: 什么
works and what doesn’t. 在诉讼程序中
the NAACL Student Research Workshop,
pages 8–15, 圣地亚哥, CA.

Goldberg, Yoav. 2017. Neural network

methods for natural language processing.
Synthesis Lectures on Human Language
Technologies, 10(1):1–309.

Gonen, Hila and Yoav Goldberg. 2019.
Lipstick on a pig: Debiasing methods
cover up systematic gender biases in word
embeddings but do not remove them. 在
诉讼程序 2019 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short
文件), pages 609–614, 明尼阿波利斯, 明尼苏达州.

Hall Maudslay, Rowan, Hila Gonen, Ryan

Cotterell, and Simone Teufel. 2019. Its all in
the name: Mitigating gender bias with
name-based counterfactual data
substitution. 在诉讼程序中 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language
加工 (EMNLP-IJCNLP),
pages 5270–5278, 香港.

蓝色的, Dirk and Shannon L. Spruit. 2016.
The social impact of natural language
加工. In Proceedings of the 54th
Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), pages 591–598.

Jha, Akshita and Radhika Mamidi. 2017.

When does a compliment become sexist?
Analysis and classiﬁcation of ambivalent
sexism using Twitter data. 在诉讼中
of the Second Workshop on NLP and
Computational Social Science,
pages 7–16, Vancouver.

Kozlowski, Austin C., Matt Taddy, 和

詹姆斯·A. 埃文斯. 2019. The geometry of
文化: Analyzing the meanings of class
through word embeddings. 美国人
Sociological Review, 84(5):905–949.
征收, Omer and Yoav Goldberg. 2014.
Linguistic regularities in sparse and
explicit word representations. 在
Proceedings of the Eighteenth Conference on
Computational Natural Language Learning,
pages 171–180, 安娜堡, MI.

496

扁豆, Tal. 2016. Issues in evaluating

semantic spaces using word analogies. 在
Proceedings of the 1st Workshop on Evaluating
Vector-Space Representations for NLP,
pages 13–18, 柏林.

Manzini, 托马斯, Yao Chong Lim, Yulia

Tsvetkov, and Alan W. 黑色的. 2019乙. 黑色的
is to criminal as Caucasian is to police:
Detecting and removing multiclass bias in
word embeddings. arXiv 预印本
arXiv:1904.04047 V2.

Manzini, 托马斯, Yao Chong Lim, Yulia

Tsvetkov, and Alan W. 黑色的. 2019C. 黑色的
is to criminal as Caucasian is to police:
Detecting and removing multiclass bias in
word embeddings. arXiv 预印本
arXiv:1904.04047 V3.

Manzini, 托马斯, Lim Yao Chong, Alan W.
黑色的, and Yulia Tsvetkov. 2019A. Black is
to criminal as Caucasian is to police:
Detecting and removing multiclass bias in
word embeddings. 在诉讼程序中
2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 1 (Long and Short Papers),
pages 615–621, 明尼阿波利斯, 明尼苏达州.

McQuillan, Dan. 2018. People’s councils for
ethical machine learning. Social Media+
社会, 4(2):1–10. SAGE Publications,
伦敦, England.

米科洛夫, 托马斯, Kai Chen, Greg Corrado,

and Jeffrey Dean. 2013. Efﬁcient estimation
of word representations in vector space.
In Proceedings of Workshop at ICLR,
Scottsdale, AZ.

Mohammad, Saif, Felipe Bravo-Marquez,
Mohammad Salameh, and Svetlana
Kiritchenko. 2018. SemEval-2018 Task 1:
Affect in tweets. 在诉讼程序中
12th International Workshop on
Semantic Evaluation, 第 1–17 页,
New Orleans, 这.

Peters, 马修, Mark Neumann, Mohit

伊耶尔, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. 2018.
Deep contextualized word representations.
在诉讼程序中 2018 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long Papers),
pages 2227–2237, New Orleans, 这.
ˇReh ˚uˇrek, Radim and Petr Sojka. 2010.

Software framework for topic modeling
with large corpora. 在诉讼程序中
LREC 2010 Workshop on New Challenges for
NLP Frameworks, pages 45–50, Valletta.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我想念, van Noord, and van der Goot

Fair Is Better than Sensational

罗杰斯, 安娜, Aleksandr Drozd, and Bofang

李. 2017. 这 (too many) problems of
analogical reasoning with word vectors.
In Proceedings of the 6th Joint Conference on
Lexical and Computational Semantics
(*SEM 2017), pages 135–148, Vancouver.
Schluter, Natalie. 2018. The word analogy
testing caveat. 在诉讼程序中 2018
Conference of the North American Chapter of
the Association for Computational Linguistics:

人类语言技术, 体积 2
(Short Papers), pages 242–246, 新的
Orleans, 这.

特尼, Peter D. 2012. Domain and function:
A dual-space model of semantic relations
and compositions. Journal of Artiﬁcial
Intelligence Research, 44:533–585.

Zou, James and Londa Schiebinger. 2018.
AI can be sexist and racist—it’s time
to make it fair. 自然, 559(7714):324.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

497

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
6
2
4
8
7
1
8
4
7
5
5
4
/
C
哦

我
我

_
A
_
0
0
3
7
9
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

498
下载pdf