Hate Speech Classifiers Learn Normative Social Stereotypes

Aida Mostafazadeh Davani, Mohammad Atari, Brendan Kennedy, Morteza Dehghani
University of Southern California, USA

{mostafaz,atari,btkenned,mdehghan}@usc.edu

Abstract

Social stereotypes negatively impact individ-
uals’ judgments about different groups and
may have a critical role in understanding lan-
guage directed toward marginalized groups.
Here, we assess the role of social stereotypes
in the automated detection of hate speech
in the English language by examining the
impact of social stereotypes on annotation
behaviors, annotated datasets, and hate speech
classifiers. Specifically, we first investigate the
impact of novice annotators’ stereotypes on
their hate-speech-annotation behavior. Then,
we examine the effect of normative stereo-
types in language on the aggregated annota-
tors’ judgments in a large annotated corpus.
Finally, we demonstrate how normative stereo-
types embedded in language resources are
associated with systematic prediction errors
in a hate-speech classifier. The results dem-
onstrate that hate-speech classifiers reflect so-
cial stereotypes against marginalized groups,
which can perpetuate social inequalities when
propagated at scale. This framework, combin-
ing social-psychological and computational-
linguistic methods, provides insights into
sources of bias in hate-speech moderation,
informing ongoing debates regarding machine
learning fairness.

Introduction

Artificial Intelligence (AI) technologies are prone
to acquiring cultural, social, and institutional bi-
ases from the real-world data on which they are
trained (McCradden et al., 2020; Mehrabi et al.,
2021; Obermeyer et al., 2019). AI models trained
on biased datasets both reflect and amplify those
biases (Crawford, 2017). For example, the domi-
nant practice in modern Natural Language Pro-
cessing (NLP)—which is to train AI systems
on large corpora of human-generated text data—
leads to representational biases, such as preferring
European American names over African Amer-
ican names (Caliskan et al., 2017), associating
words with more negative sentiment with phrases

300

referencing persons with disabilities (Hutchinson
et al., 2020), making ethnic stereotypes by asso-
ciating Hispanics with housekeepers and Asians
with professors (Garg et al., 2018), and assign-
ing men to computer programming and women
to homemaking (Bolukbasi et al., 2016).

Moreover, NLP models are particularly suscep-
tible to amplifying biases when their task involves
evaluating language generated by or describing a
social group (Blodgett and O’Connor, 2017). For
example, previous research has shown that tox-
icity detection models associate documents con-
taining features of African American English with
higher offensiveness than text without those fea-
tures (Sap et al., 2019; Davidson et al., 2019).
Similarly, Dixon et al. (2018) demonstrate that
models trained on social media posts are prone
to erroneously classifying ‘‘I am gay’’ as hate
speech. Therefore, using such models for moder-
ating social-media platforms can yield dispropor-
tionate removal of social-media posts generated
by or mentioning marginalized groups (Davidson
et al., 2019). This unfair assessment negatively
impacts marginalized groups’ representation in
online platforms, which leads to disparate impacts
on historically excluded groups (Feldman et al.,
2015).

Mitigating biases in hate speech detection, nec-
essary for viable automated content moderation
(Davidson et al., 2017; Mozafari et al., 2020), has
recently gained momentum (Davidson et al., 2019;
Dixon et al., 2018; Sap et al., 2019; Kennedy et al.,
2020; Prabhakaran et al., 2019). Most current su-
pervised algorithms for hate speech detection rely
on data resources that potentially reflect real-
world biases: (1) text representation, which maps
textual data to their numeric representations in a
semantic space; and (2) human annotations, which
represent subjective judgments about
the hate
speech content of the text, constituting the train-
ing dataset. Both (1) and (2) can introduce biases
into the final model. First, a classifier may be-
come biased due to how the mapping of language

Transactions of the Association for Computational Linguistics, vol. 11, pp. 300–319, 2023. https://doi.org/10.1162/tacl a 00550
Action Editor: Alice Oh. Submission batch: 4/2022; Revision batch: 11/2022; Published 3/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
5
0
2
0
7
5
7
3
0

/
t

a
c
_
a
_
0
0
5
5
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

to numeric representations is affected by stereo-
typical co-occurrences in the training data of the
language model. For example, a semantic asso-
ciation between phrases referencing persons with
disabilities and words with more negative senti-
ment in the language model can impact a classi-
fier’s evaluation of a sentence about disability
(Hutchinson et al., 2020). Second, individual-level
biases of annotators can impact the classifier in
stereotypical directions. For example, a piece of
rhetoric about disability can be analyzed and
labeled differently depending upon annotators’
social biases.

Although previous research has documented
stereotypes in text representations (Garg et al.,
2018; Bolukbasi et al., 2016; Manzini et al., 2019;
Swinger et al., 2019; Charlesworth et al., 2021),
the impact of annotators’ biases on training data
and models remains largely unknown. Filling this
gap in our understanding of the effect of human
annotation on biased NLP models is the focus of
this work. As argued by Blodgett et al. (2020)
and Kiritchenko et al. (2021), a comprehensive
evaluation of human-like biases in hate speech
classification needs to be grounded in social psy-
chological theories of prejudice and stereotypes,
in addition to how they are manifested in lan-
guage. In this paper, we rely on the Stereotype
Content Model (SCM; Fiske et al., 2002) which
suggests that social perceptions and stereotyp-
ing form along two dimensions, namely, warmth
(e.g., trustworthiness, friendliness) and compe-
tence (e.g., capability, assertiveness). The SCM’s
main tenet is that perceived warmth and compe-
tence underlie group stereotypes. Hence, different
social groups can be positioned in different loca-
tions in this two-dimensional space, since much
of the variance in stereotypes of groups is ac-
counted for by these basic social psychological
dimensions.

In three studies presented in this paper, we study
the pipeline for training a hate speech classi-
fier, consisting of collecting annotations, aggre-
gating annotations for creating the training dataset,
and training the model. We investigate the effects
of social stereotypes on each step, namely, (1)
the relationship between social stereotypes and
hate speech annotation behaviors, (2) the relation-
ship between social stereotypes and aggregated
annotations of trained, expert annotators in cu-
rated datasets, and (3) social stereotypes as they
manifest in the biased predictions of hate speech

classifiers. Our work demonstrates that differ-
ent stereotypes along warmth and competence
differentially affect
individual annotators, cu-
rated datasets, and trained language classifiers.
Therefore, understanding the specific social bi-
ases targeting different marginalized groups is
essential for mitigating human-like biases of AI
models.

1 Study 1: Text Annotation

Here, we investigate the effect of individuals’
social stereotypes on their hate speech annotations.
Specifically, we aim to determine whether novice
annotators’ stereotypes (perceived warmth and/or
competence) of a mentioned social group lead
to higher rate of labeling text as hate speech and
higher rates of disagreement with other annotators.
We conduct a study on a nationally stratified
sample (in terms of age, ethnicity, gender, and
political orientation) of US adults. First, we ask
participants to rate eight US-relevant social groups
on different stereotypical traits (e.g., friendliness).
Then, participants are presented with social media
posts mentioning the social groups and are asked
to label the content of each post based on whether
it attacks the dignity of that group. We expect
the perceived warmth and/or competence of the
social groups to be associated with participants’
annotation behaviors, namely, their rate of labeling
text as hate speech and disagreeing with other
annotators.

Participants To achieve a diverse set of an-
notations, we recruited a relatively large (N =
1,228) set of participants in a US sample stratified
across participants’ gender, age, ethnicity, and po-
litical ideology through Qualtrics Panels.1 After
filtering participants based on quality-check items
(described below), our final sample included 857
American adults (381 male, 476 female) rang-
ing in age from 18 to 70 (M = 46.7, SD =
16.4) years, about half Democrats (50.4%) and
half Republicans (49.6%), with diverse reported
race/ethnicity (67.8% White or European Amer-
ican, 17.5% Black or African American, 17.7%
Hispanic or Latino/Latinx, 9.6% Asian or Asian
American).

1https://www.census.gov/quickfacts/fact

/table/US/PST045221.

301

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
5
0
2
0
7
5
7
3
0

/
t

a
c
_
a
_
0
0
5
5
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stimuli To compile a set of stimuli items for this
study, we selected posts from the Gab Hate Cor-
pus (GHC; Kennedy et al., 2022), which includes
27,665 social-media posts collected from the cor-
pus of Gab.com (Gaffney, 2018), each annotated
for their hate speech content by at least three ex-
pert annotators. We collected all posts with high
disagreement among the GHC’s (original) anno-
tators (based on Equation 1 for quantifying item
disagreement) which mention at least one social
group. We searched for posts mentioning one of
the eight most frequently targeted social groups
in the GHC: (1) women; (2) immigrants; (3) Mus-
lims; (4) Jews; (5) communists; (6) liberals; (7)
African Americans; and (8) homosexual individ-
uals. We selected seven posts per group, result-
ing in a set of 56 items in total.

Explicit Stereotype Measure We assessed par-
ticipants’ warmth and competence stereotypes of
the 8 US social groups in our study based on
their perceived traits for a typical member of each
group. To this end, we followed social psycholog-
ical approaches for collecting these self-reported,
explicit stereotypes (Cuddy et al., 2008) and
asked participants to rate a typical member of
each social group (e.g., Muslims) based on their
‘‘friendliness’’, ‘‘helpfulness,’’ ‘‘peacefulness,’’
and ‘‘intelligence.’’ Following previous studies
of perceived stereotypes (Huesmann et al., 2012;
Cuddy et al., 2007), participants were asked to
rate these traits from low (e.g., ‘‘unfriendly’’) to
high (e.g., ‘‘friendly’’) using an 8-point semantic
differential scale. We considered the average of
the first three traits as the indicator of perceived
warmth2 and the fourth item as the perceived
competence.

While explicit assessments are generally corre-
lated with implicit measures of attitude, in the case
of self-reporting social stereotypes, participants’
explicit answers can be less significantly corre-
lated with their implicit biases, potentially due
to motivational and cognitive factors (Hofmann
et al., 2005). Therefore, it should be noted that
this study relies on an explicit assessment of so-
cial stereotypes, and the results do not directly
explain the effects of implicit biases on annotat-
ing hate speech.

2Cronbach’s α’s ranged between .90 [women] and .95

[Muslims].

Hate Speech Annotation Task We asked par-
ticipants to annotate the 56 items based on a short
definition of hate speech (Kennedy et al., 2022):
‘‘Language that intends to attack the dignity of
a group of people, either through an incitement
to violence, encouragement of the incitement to
violence, or the incitement to hatred.’’

Participants could proceed with the study only
after they acknowledged understanding the pro-
vided definition of hate speech. We then tested
their understanding of the definition by plac-
ing three synthetic ‘‘quality-check’’ items among
survey items, two of which included clear and
explicit hateful language directly matching our
definition and one item that was simply infor-
mational (see Supplementary Materials). Overall,
371 out of the original 1,228 participants failed
to satisfy these conditions and their input was
removed from the data.3

Disagreement Throughout this paper, we as-
sess annotation disagreement in different levels:

• Item disagreement, d(i): Motivated by Fleiss
(1971), for each item i, item disagreement
d(i) is the number of annotator pairs that
disagree on the item’s label, divided by the
number of all possible annotator pairs.4

d(i) =

× n(i)
n(i)
1
0
(cid:3)
(cid:2)
1 +n(i)
n(i)
2

(1)

and n(i)
0

Here, n(i)
show the number of
1
hate and non-hate labels assigned to i,
respectively.

• Participant item-level disagreement, d(p,i):
For each participant p and each item i, we
define d(p,i) as the ratio of participants with
whom p agreed, to the size of the set of par-
ticipants who annotated the same item (P ).

d(p,i) =

|{p(cid:3)|p(cid:3) ∈ P − {p}, yp,i = yp(cid:3),i}|
|P |

(2)

Here, yp,i is the label that p assigned to i.

3The replication of our analyses with all participants
yielded similar results, reported in Supplementary Materials.
4We found this measure more suitable than a simple
percentage, as Fleiss captures the total number of annotators
as well as the disagreeing pairs.

302

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
5
0
2
0
7
5
7
3
0

/
t

a
c
_
a
_
0
0
5
5
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

• Group-level disagreement, d(p,S): For a spe-
cific set of items S and an annotator p, d(p,S)
captures how much p disagrees with others
over items in S. We calculate d(p,S) by
averaging d(p,i)s for all items i ∈ S

d(p,S) =

1
|S|

(cid:4)

i∈S

d(p,i)

(3)

Annotators’ Tendency To explore participants’
annotation behaviors relative to other participants,
we rely on the Rasch model (Rasch, 1993). The
Rasch model is a psychometric method that mod-
els participants’ responses—here, annotations—
to items by calculating two sets of parameters,
namely, the ability of each participant and the
difficulty of each item. Similar approaches, based
on Item Response Theory (IRT), have recently
been applied in evaluating NLP models (Lalor
et al., 2016) and for modeling the relative per-
formance of annotators (Hovy et al., 2013).
While, compared to Rasch models, IRT mod-
els can include more item-level parameters, our
choice of Rasch models is based on their ro-
bust estimations for annotators’ ability scores.
Specifically, Rasch models calculate the ability
score solely based on individuals’ performances
and independent from the sample set. In contrast,
in IRT-based approaches, individual annotators’
scores depend on the complete set of annotators
(Stemler and Naples, 2021). To provide an estima-
tion of these two sets of parameters (annotators’
ability and items’ difficulty), the Rasch model
iteratively fine-tunes parameters’ values to ulti-
mately fit the best probability model to partici-
pants’ responses to items. Here, we apply a Rasch
model to each set of items mentioning a specific
social group.

It should be noted that Rasch models con-
sider each response as either correct or incorrect
and estimate participants’ ability and items’ diffi-
culty based on the underlying logic that subjects
have a higher probability of correctly answering
easier items. However, we assume no ‘‘ground
truth’’ for the labels, therefore ‘‘1’’s and ‘‘0’’s
represent annotators ‘‘hate’’ and ‘‘not hate’’
answers. Therefore, items’ difficulty (which orig-
inally represents the probability of ‘‘0’’ labels)
can be interpreted as non-hatefulness (probabil-
ity of ‘‘non-hate’’ labels). Respectively, partici-
pants’ ability (probability of getting a ‘‘1’’ for a
difficult item), can be interpreted as their ten-

Figure 1: The overview of Study 1. Novice annotators
are asked to label hate speech content of each post.
Then, their annotation behaviors, per social group to-
ken, are taken to be the number of posts they labeled as
hate speech, their disagreement with other annotators
and their tendency to identify hate speech.

dency towards labeling text as hate (labeling
non-hateful items as hateful). Throughout this
study we use tendency to refer to the ability
parameter.

Analysis We estimate associations between par-
ticipants’ social stereotypes about each social
group with their annotation behaviors evaluated
on items mentioning that social group. Namely,
the dependent variables are (1) the number of hate
labels, (2) the tendency (via the Rasch model) to
detect hate speech relative to others, and (3) the
ratio of disagreement with other participants—as
quantified by group-level disagreement. To ana-
lyze annotation behaviors concerning each social
group, we considered each pair of participant
(N = 857) and social group (ngroup = 8) as
an observation (ntotal = 6,856). Each observa-
tion includes the social group’s perceived warmth
and competence based on the participant’s an-
swer to the explicit stereotype measure, as well
as their annotation behaviors on items that men-
tion that social group. Since each observation
is nested in and affected by annotator-level and
social-group level variable, we fit cross-classified
multi-level models to analyze the association
of annotation behaviors with social stereotypes.
Figure 1 illustrates our methodology in con-
ducting Study 1. All analyses were performed in

303

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
5
0
2
0
7
5
7
3
0

/
t

a
c
_
a
_
0
0
5
5
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

R (3.6.1), and the eRm (1.0.1) package was used
for the Rasch model.

Results We first investigated the relation be-
tween participants’ social stereotypes about each
social group and the number of hate speech la-
bels they assigned to items mentioning that group.
The result of a cross-classified multi-level Poisson
model, with the number of hate speech labels as
the dependent variable and participants’ percep-
tion of warmth and competence as independent
variables, shows that a higher number of items
are categorized as hate speech when participants
perceive that social group as high on competence
(β = 0.03, SE = 0.006, p < .001). In other words, a one point increase in a participant’s rating of a social group’s competence (on the scale of 1 to 8) is associated with a 3.0% increase in the number of hate labels they assigned to items mentioning that social group. Perceived warmth scores were not significantly associated with the number of hate labels (β = 0.01, p = .128). We then compared annotators’ relative ten- dency to assign hate speech labels to items mentioning each social group, calculated by the Rasch models. We conducted a cross-classified multi-level linear model to predict participants’ tendency as the dependent variable, and each social group’s warmth and competence stereo- types as independent variables. The result shows that participants demonstrate higher tendency (to assign hate speech labels) on items that men- tion a social group they perceive as highly competent (β = 0.07, SE = 0.013, p < .001). However, perceived warmth scores were not sig- nificantly associated with participants’ tendency scores (β = 0.02, SE = 0.014, p = 0.080). Finally, we analyzed participants’ group-level disagreement for items that mention each social group. We use a logistic regression model to pre- dict disagreement ratio, which is a value between 0 and 1. The results of a cross-classified multi-level logistic regression, with group-level disagreement ratio as the dependent variable and warmth and competence stereotypes as independent variables, show that participants disagreed more on items that mention a social group which they perceive as low on competence (β = −0.29, SE = 0.001, p < .001). In other words, a one point decrease in a participant’s rating of a social group’s com- petence (on the scale of 1 to 8) is associated with a 25.2% increase in their odds of disagreement Figure 2: The relationship between the stereotypical competence of social groups and (1) the number of hate labels annotators detected, (2) their tendency to detect hate speech, and (3) their ratio of disagree- ment with other participants (top to bottom). on items mentioning that social group. Perceived warmth scores were not significantly associated with the odds of disagreement (β = 0.05, SE = 0.050, p = .322). In summary, as represented in Figure 2, the results of Study 1 demonstrate that when novice annotators perceive a mentioned social group as high on competence they (1) assign more hate speech labels, (2) show higher tendency for iden- tifying hate speech, and (3) disagree less with other annotators. These associations collectively denote that when annotators stereotypically per- ceive a social group as highly competent, they tend to become more sensitive or alert about hate speech directed toward that group. These results support the idea that hate speech annotation is af- fected by annotators’ stereotypes (specifically the perceived competence) of target social groups. 2 Study 2: Ground-Truth Generation The high levels of inter-annotator disagreements in hate speech annotation (Ross et al., 2017) can 304 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the inter-annotator disagreement and the majority vote for each of the posts and considered them as dependent variables in our analyses. Quantifying Social Stereotypes To quantify social stereotypes about each social group from our list of social group tokens (Dixon et al., 2018), we calculated semantic similarity of that social group term with lexicons (dictionaries) of competence and warmth (Pietraszkiewicz et al., 2019). The competence and warmth dictionaries consist of 192 and 184 tokens, respectively, and have been shown to measure linguistic markers of competence and warmth reliably in different contexts. We calculated the similarity of each social group token with the entirety of words in dictio- naries of warmth and competence in a latent vec- tor space based on previous approaches (Caliskan et al., 2017; Garg et al., 2018). Specifically, for each social group token, s and each word w in the dictionaries of warmth (Dw) or competence (Dc) we first obtain their numeric representation (R(s) ∈ Rt and R(w) ∈ Rt, respectively) from pre-trained English word embeddings (GloVe; Pennington et al., 2014). The representation func- tion, R(), maps each word to a t-dimensional vector, trained based on the word co-occurrences in a corpus of English Wikipedia articles. Then, the warmth and competence scores for each so- cial group token were calculated by averaging the cosine similarity of the numeric represen- tation of the social group token and the numeric representation of the words of the two dictionaries. Results We examined the effects of the quanti- fied social stereotypes on hate speech annotations captured in the dataset. Specifically, we compared post-level annotation disagreements with the men- tioned social group’s warmth and competence. For example, based on this method, ‘‘man’’ is the most semantically similar social group token to the dictionary of competence (Cman = 0.22), while ‘‘elder’’ is the social group token with the closest semantic representation to the dictionary of warmth (Welder = 0.19). Of note, we inves- tigated the effect of these stereotypes on hate speech annotation of social media posts that men- tion at least one social group token (Nposts = 5535). Since some posts mention more than one social group token, we considered each men- tioned social group token as an observation Figure 3: The overview of Study 2. We investigate a hate speech dataset and evaluate the inter-annotator disagreement and majority label for each document in relation to stereotypes about mentioned social groups. be attributed to numerous factors, including an- notators’ varying perception of the hateful lan- guage, or ambiguities of the text being annotated (Aroyo et al., 2019). However, aggregating these annotations into single ground-truth labels disre- gards the nuances of such disagreements (Uma et al., 2021) and even leads to disproportionate representation of individual annotators in anno- tated datasets (Prabhakaran et al., 2021). Here, we explore the effect of normative social stereo- types, as encoded in language, on the aggregated hate labels provided in a large annotated dataset. Annotated datasets of hate speech commonly represent the aggregated judgments of annota- tors rather than individual annotators’ annotation behaviors. Therefore, rather than being impacted by individual annotators’ self-reported social ste- reotypes (as in Study 1), we expect aggregated labels to be affected by normative social stereo- types. Here, we rely on semantic representations of social groups in pre-trained language models, known to encode normative social stereotypes and biases of large text corpora (Bender et al., 2021). Figure 3 illustrates the methodology of Study 2. Data We analyzed the GHC (Kennedy et al., 2022, discussed in Study 1) which includes 27,665 social-media posts labeled for hate speech content by 18 annotators. This dataset includes 91,967 annotations in total, where each post is annotated by at least three coders. Based on our definition of item disagreement in Equation 1, we computed 305 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 (Nobservation = 7550), and conducted a multi- level model, with mentioned social group tokens as the level-1 variable and posts as the level-2 variable. We conducted two logistic regression analyses to assess the impact of (1) the warmth and (2) the competence of the mentioned social group as independent variables, and with the inter- annotator disagreement as the dependent vari- able. The results of the two models demonstrate that both higher warmth (β = −2.62, SE = 0.76, p < 0.001) and higher competence (β = −5.27, SE = 0.62, p < 0.001) scores were associated with lower disagreement. Similar multi-level lo- gistic regressions with the majority hate label of the posts as the dependent variable and consider- ing either social groups’ warmth or competence as independent variables show that competence predicts lower hate (β = −7.77, SE = 3.47, p = .025), but there was no significant relationship between perceived warmth and the hate speech content (β = −3.74, SE = 4.05, p = 0.355). We like to note that controlling for the frequency of each social groups’ mentions in the dataset yields the same results (see Supplementary Materials). In this study, we demonstrated that social ste- reotypes (i.e., warmth and competence), as en- coded into language resources, are associated with annotator disagreement in an annotated dataset of hate speech. As in Study 1, annotators agreed more on their judgments about social media posts that mention stereotypically more com- petent groups. Moreover, we observed higher inter-annotator disagreement on social media posts that mentioned stereotypically cold social groups (Figure 4). While Study 1 demonstrated novice annotators’ higher tendency for detecting hate speech targeting stereotypically competent groups, we found a lower likelihood of hate labels for posts that mention stereotypically competent social groups in this dataset. The potential reasons for this discrepancy are: (1) while both novice and expert annotators have been exposed to the same definition of hate speech (Kennedy et al., 2018), expert annotators’ training focused more on the consequences of hate speech targeting marginal- ized groups; moreover, the lack of variance in expert annotators’ socio-demographic background (mostly young, educated, liberal adults) have led to their increased sensitivity about hate speech directed toward specific stereotypically incompe- tent groups; and (2) while Study 1 uses a set of items with balanced representation for different Figure 4: Effects of social groups’ stereotype content, on majority hate labels, and annotators’ disagreement. social groups, the dataset used in Study 2 in- cludes disproportionate mentions of social groups. Therefore, the effect might be caused by the higher likelihood of hateful language appearing in GHC’s social media posts mentioning stereo- typically less competent groups. 3 Study 3: Model Training NLP models that are trained on human-annotated datasets are prone to patterns of false predic- tions associated with specific social group tokens (Blodgett and O’Connor, 2017; Davidson et al., 2019). For example, trained hate speech classi- fiers may have a high probability of assigning a hate speech label to a non-hateful post that men- tions the word ‘‘gay.’’ Such patterns of false predictions are known as prediction bias (Hardt et al., 2016; Dixon et al., 2018), which impact models’ performance on input data associated with specific social groups. Previous research has investigated several sources leading to pre- diction bias, such as disparate representation of specific social groups in the training data and language models, or the choice of research de- sign and machine learning algorithm (Hovy and Prabhumoye, 2021). However, to our knowledge, no study has evaluated prediction bias with re- gard to the normative social stereotypes targeting each social group. In Study 3, we investigate whether social stereotypes influence hate speech classifiers’ prediction bias toward those groups. 306 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 (Devlin et al., 2019) and RoBERTa (Zhuang et al., 2021). We implemented these two classifi- cation models using the transformers (v3.1) library of HuggingFace (Wolf et al., 2020) and fine-tuned both models for six epochs with a learning rate of 10−7. The third model applies a Support Vector Machine (SVM; Cortes and Vapnik, 1995) with a linear kernel on Term Frequency-Inverse Document Frequency (TF-IDF) vector representations, implemented through the scikit-learn (Pedregosa et al., 2011) Python package. Models were trained on subsets of the GHC and their performance was evaluated on test items mentioning different social groups. To account for possible variations in the resulting models, caused by selecting different subsets of the data- set for training, we performed 100 iterations of model training and evaluating for each classi- fier. In each iteration, we trained the model on a randomly selected 80% of the dataset (ntrain = 22, 132) and recorded the model predictions on the remaining 20% of the samples (ntest = 5, 533). Then, we explored model predictions for all iterations (nprediction = 100 × 5, 533), to cap- ture false predictions for instances that mention at least one social group token. By comparing the model prediction with the majority vote for each instance provided in GHC, we detected all ‘‘incorrect’’ predictions. For each social group, we specifically capture the number of false- negative (hate speech instances which are labeled as non-hateful) and false-positive (non-hateful instances labeled as hate speech) predictions. For each social group token the false-positive and false-negative ratios are calculated by dividing the number of false predictions by the total num- ber of posts mentioning the social group token. Quantifying Social Stereotypes In each analy- sis, we considered either warmth or competence (calculated as in Study 2) of social groups as the independent variable to predict false-positive and false-negative predictions as dependent variables. Classification Results On average, the clas- sifiers based on BERT, RoBERTa, and SVM achieved F1 scores of 48.22% (SD = 3%), 47.69% (SD = 3%), and 35.4% (SD = 1%), respectively, on the test sets over the 100 itera- tions. Since the GHC includes a varying number of posts mentioning each social group token, the Figure 5: The overview of Study 3. In each iteration, the model is trained on a subset of the dataset. The false predictions of the model are then calculated for each social group token mentioned in test items. We define prediction bias as erroneous predic- tions of our text classifier model. We specifically focus on false positives (hate-speech labels as- signed to non-hateful instances) and false nega- tives (non-hate-speech labels assigned to hateful instances) (Blodgett et al., 2020). In the two previous studies, we demonstrated that variance in annotators’ behaviors toward hate speech and imbalanced distribution of ground- truth labels in datasets are both associated with stereotypical perceptions about social groups. Accordingly, we expect hate speech classifiers, trained on the ground-truth labels, to be af- fected by stereotypes that provoke disagreements among annotators. If that is the case, we ex- pect the classifier to perform less accurately and in a biased way on social-media posts that mention social groups with specific social stereo- types. To detect patterns of false predictions for specific social groups (i.e., prediction bias), we first train several models on different subsets of an annotated corpus of hate speech (GHC; de- scribed in Study 1 and 2). We then evaluate the frequency of false predictions provided for each social group and their association with the so- cial groups’ stereotypes. Figure 5 illustrates an overview of this study. Hate Speech Classifiers We implemented three hate speech classifiers; the first two models are based on pre-trained language models, BERT 307 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 predictions (nprediction = 553, 300) include a varying number of items for each social group token (M = 2,284.66, M dn = 797.50, SD = 3,269.20). ‘‘White’’ as the most frequent social group token appears in 16,155 of the predictions and ‘‘non-binary’’ is the least frequent social group token with only 13 observations. Since so- cial group tokens have varying distributions in the dataset, we considered the ratios of false pre- dictions (rather than frequencies) in all regres- sion models by adding the log-transform of the number of test samples for each social group token as the offset. ‘‘Buddhist.’’ Analysis of Results The average false-positive ratio of social group tokens in the BERT-classifier was 0.58 (SD = 0.24), with a maximum of 1.00 false-positive ratio for several social groups, in- cluding ‘‘bisexual’’, and the minimum of 0.03 false-positive ratio for In other words, BERT-classifiers always predicted incor- rect hate speech labels for non-hateful social- media posts mentioning ‘‘bisexuals’’ while rarely making those mistakes for posts mentioning ‘‘Buddhists’’. The average false-negative ratio of social group tokens in the BERT-classifier was 0.12 (SD = 0.11), with a maximum of 0.49 false-negative ratio associated with ‘‘homosex- ual’’ and the minimum of 0.0 false-negative ratio for several social groups including ‘‘Latino.’’ In other words, BERT-classifiers predicted incorrect non-hateful labels for social-media post mention- ing ‘‘homosexuals’’ while hardly making those mistakes for posts mentioning ‘‘Latino’’. These statistics are consistent with observations of pre- vious findings (Davidson et al., 2017; Kwok and Wang, 2013; Dixon et al., 2018; Park et al., 2018), which identify false-positive errors as the more critical issue with hate speech classifiers. For each classifier, we assess the number of false-positive and false-negative hate speech predictions for social-media posts that mention each social group. For analyzing each classifier, two Poisson models were created, consider- ing false-positive predictions as the dependent variable and social groups’ (1) warmth or (2) com- petence, calculated from a pre-trained language model (see Study 2) as the independent variable. The same settings were considered in two other Poisson models to assess false-negative predic- tions as the dependent variable, and either warmth or competence as the independent variable. False Positive C W False Negative W C BERT RoBERTa −0.04** −0.15** SVM −0.09** −0.23** −0.04** −0.10** 0.05* −0.09** 0.02 −0.05** −0.09** −0.01 Table 1: Associations of erroneous predictions (false positives and false negatives) and so- cial groups’ warmth (W) and competence (C) stereotypes in predictions of three classifiers. ** and * represent p-values less than .001 and .05, respectively. Table 1 reports the association between so- cial groups’ warmth and competence stereotypes with the false hate speech labels predicted by the models. The results indicate that the number of false-positive predictions is negatively associ- ated with the social groups’ language-embedded warmth and competence scores in all three mod- els. In other words, texts that mentions social groups stereotyped as cold and incompetent are more likely to be misclassified as containing hate speech; for instance, in the BERT-classifier a one point increase in the social groups warmth and competence is, respectively, associated with 8.4% and 20.3% decrease in model’s false-positive error ratios. The number of false-negative predictions is also significantly associated with the social groups’ competence scores; however, this asso- ciation had varying directions among the three models. BERT and SVM classifiers are more likely to misclassify instances as not containing hate speech when texts mention stereotypically incompetent social groups; such that one point increase in competence is associated with 9.8% decrease in BERT model’s false-negative error ratio. Whereas false-negative predictions of the RoBERTa model is more likely for text men- tioning stereotypically competent social groups. The discrepancy in the association of warmth and competence stereotypes and false-negative errors calls for further investigation. Figure 6 depicts the associations of the two stereotype dimen- sions with the proportions of false-positive and false-negative predictions of the BERT classifier for social groups. In summary, this study demonstrates that erro- neous predictions of hate speech classifiers are associated with the normative stereotypes re- garding the social groups mentioned in text. 308 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 media posts about those social groups. Our find- ings indicate that for novice annotators judging social groups as competent is associated with a higher tendency toward detecting hate and lower disagreement with other annotators. We reasoned that novice annotators prioritize protecting the groups they perceive as warm and competent. These results can be interpreted based on the Be- haviors from Intergroup Affect and Stereotypes framework (BIAS; Cuddy et al., 2007): groups judged as competent elicit passive facilitation (i.e., obligatory association), whereas those judged as lacking competence elicit passive harm (i.e., ig- noring). Here, novice annotators might tend to ‘‘ignore’’ social groups judged to be incompe- tent and not assign ‘‘hate speech’’ labels to in- flammatory posts attacking these social groups. However, Study 1’s results may not uncover the pattern of annotation biases in hate speech datasets as data curation efforts rely on annotator pools with imbalanced representation of different socio-demographic groups (Posch et al., 2018) and data selection varies among different datasets. In Study 2, we examined the role of social stereotypes in the aggregation process, where expert annota- tors’ disagreements are discarded to create a large dataset containing the ground-truth hate-speech labels. We demonstrated that, similar to Study 1, texts that included groups stereotyped to be warm and competent were highly agreed upon. How- ever, unlike Study 1, posts mentioning groups stereotyped as incompetent are more frequently marked as hate speech by the aggregated labels. In other words, novice annotators tend to focus on protecting groups they perceive as competent; however, the majority vote of expert annotators tend to focus on common targets of hate in the corpus. We noted two potential reasons for this disparity (1) Novice and expert annotators vary in their annotation behaviors; in many cases, hate speech datasets are labeled by expert annotators who are thoroughly trained for this specific task (Patton et al., 2019), and have specific experi- ences that affect their perception of online hate (Talat, 2016). GHC annotators were undergrad- uate psychologist research assistants trained by first reading a typology and coding manual for studying hate-based rhetoric and then passing a curated test of about thirty messages designed for assessing their understanding of the annotation task (Kennedy et al., 2022). Therefore, their rel- atively higher familiarity with and experience in Figure 6: Social groups’ higher stereotypical compe- tence and warmth is associated with lower false positive and negative predictions in hate speech detection. Particularly, the results indicate that documents mentioning stereotypically colder and less compe- tent social groups, which lead to higher disagree- ment among expert annotators based on Study 2, drive higher error rates in hate speech classi- fiers. This pattern of high false predictions (both false-positives and false-negatives) for social groups stereotyped as cold and incompetent im- plies that prediction bias in hate speech classifiers is associated with social stereotypes, and resem- bles normative social biases that we documented in the previous studies. 4 Discussion Here, we integrate theory-driven and data-driven approaches (Wagner et al., 2021) to investigate human annotators’ and normative social stereo- types as a source of bias in hate speech data- sets and classifiers. In three studies, we combine social psychological frameworks and computa- tional methods to make theory-driven predictions about hate-speech-annotation behavior and em- pirically test the sources of bias in hate speech classifiers. Overall, we find that hate speech an- notation behaviors, often assumed to be objective, are impacted by social stereotypes, and that this in turn adversely influences automated content moderation. In Study 1, we investigated the association between participants’ self-reported social stereo- types against 8 different social groups, and their annotation behavior on a small subset of social- 309 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 annotating hate speech, compared to annotators in Study 1, led to different annotation behaviors. Moreover, dataset annotators are not usually rep- resentative of the exact population that interacts with social media content. As pointed out by D´ıaz et al. (2022), understanding the socio-cultural fac- tors of an annotator pool can shed light on the disparity of our results. In our case, identities and lived experiences can significantly vary between participants in Study 1 and GHC’s annotators in Study 2, which impacts how annotation ques- tions are interpreted and responded to. (2) Social groups with specific stereotypes have imbalanced presence in hate speech datasets; while in Study 1, we collect a balanced set of items with equal rep- resentation for each of the 8 social groups, social media posts disproportionately include mentions of different social groups, and the frequency of each social group being targeted depends on mul- tiple social and contextual factors. To empirically demonstrate the effect of social stereotypes on supervised hate speech classifiers, in Study 3, we evaluated the performance and biased predictions of such models when trained on an annotated dataset. We used the ratio of incorrect predictions to operationalize the clas- sifiers’ unintended bias in assessing hate speech toward specific groups (Hardt et al., 2016). Study 3’s findings suggested that social stereotypes of a mentioned group, as captured in large language models, are significantly associated with biased classification of hate speech such that more false- positive predictions are generated for documents that mention groups that are stereotyped to be cold and incompetent. However, we did not find consistent trends in associations between social groups’ warmth and competence stereotypes and false-negative predictions among different mod- els. These results demonstrate that false-positive predictions are more frequent for the same social groups that evoked more disagreements between annotators in Study 2. Similar to Davani et al. (2022), these findings challenge supervised learn- ing approaches that only consider the majority vote for training a hate speech classifier and dis- pose of the annotation biases reflected in inter- annotator disagreements. It should be noted that while Study 1 assesses so- cial stereotypes as reported by novice annotators, Studies 2 and 3 rely on a semantic representa- tion of such stereotypes. Since previous work on language representation have shown that semantic representations encode socially embedded biases, in Studies 2 and 3 we referred to the construct under study as normative social stereotypes. Our comparison of results demonstrated that novice annotators’ self-reported social stereotypes im- pact their annotation behaviors, and the annotated datasets and hate speech classifiers are prone to being affected by normative stereotypes. Our work is limited to the English language, a single dataset of hate speech, and participants from the US. Given that the increase in hate speech is not limited to the US, it is important to extend our findings in terms of research par- ticipants and language resources. Moreover, we applied SCM to quantify social stereotypes, but other novel theoretical frameworks such as the Agent-Beliefs-Communion model (Koch et al., 2016) can be applied in the future to uncover other sources of bias. 5 Related Work Measuring Annotator Bias Annotators are bi- ased in their interpretations of subjective language understanding tasks (Aroyo et al., 2019; Talat et al., 2021). Annotators’ sensitivity to toxic lan- guage can vary based on their expertise (Talat, 2016), lived experiences (Patton et al., 2019), and demographics (e.g., gender, race, and political orientation) (Cowan et al., 2002; Norton and Sommers, 2011; Carter and Murphy, 2015; Prabhakaran et al., 2021; Jiang et al., 2021). Sap et al. (2022) discovered associations between annotators’ racist beliefs and their perceptions of toxicity in anti-Black messages and text written in African American English. Compared to pre- vious efforts, our research takes a more general approach to modeling annotators’ biases, which is not limited to specific targets of hate. Recent research efforts argue that annotators’ disagreements should not be treated solely as noise in data (Pavlick and Kwiatkowski, 2019) and call for alternative approaches for consid- ering annotators as independent sources for in- forming the modeling process in subjective tasks (Prabhakaran et al., 2021). Such efforts tend to im- prove data collection (Vidgen et al., 2021; Rottger et al., 2022) and the modeling process in various tasks, such as detecting sarcasm (Rajadesingan et al., 2015), humor (Gultchin et al., 2019), senti- ment (Gong et al., 2017), and hate speech (Koco´n et al., 2021). For instance, Davani et al. (2022) 310 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 introduced a method for modeling individual an- notators’ behaviors rather than their majority vote. In another work, Akhtar et al. (2021) clustered annotators into groups with high internal agree- ment (similarly explored by Wich et al., 2020) and redefined the task as modeling the aggre- gated label of each group. Our findings especially help such efforts by providing a framework for incorporating annotators’ biases into hate speech classifiers. Measuring Hate Speech Detection Bias When propagated into the modeling process, biases in the annotated hate speech datasets cause group- based biases in predictions (Sap et al., 2019) and lack of robustness in results (Geva et al., 2019; Arhin et al., 2021). Specifically, previous research has shed light on unintended biases (Dixon et al., 2018), which are generally defined as systemic differences in performance for different demo- graphic groups, potentially compounding existing challenges to fairness in society at large (Borkan et al., 2019). While a significant body of work has been dedicated to mitigating unintended biases in hate speech (and abusive language) classifi- cation (Vaidya et al., 2020; Ahmed et al., 2022; Garg et al., 2019; Nozza et al., 2019; Badjatiya et al., 2019; Park et al., 2018; Mozafari et al., 2020; Xia et al., 2020; Kennedy et al., 2020; Mostafazadeh Davani et al., 2021; Chuang et al., 2021), the choice of the exact bias metrics is not consistent within all these studies. As demon- strated by Czarnowska et al. (2021), various bias metrics can be considered as different parametri- zations of a generalized metric. In hate speech detection in particular, disproportionate false pre- dictions, especially false positive predictions, for marginalized social groups have often been con- sidered as an indicator of unintended bias in the model. This is due to the fact that hate speech, by definition, involves a social group as the target of hate, and the disproportionate mentions of specific social groups in hateful social media content have led to imbalance datasets and biased models. Measuring Social Stereotypes The Stereotype Content Model (SCM; Fiske et al., 2002) sug- gests that to determine whether other people are threats or allies, individuals make prompt assess- ments about their warmth (good vs. ill intentions) and competence (ability vs. inability to act on intentions). Koch et al. (2016) proposed to fill in an empirical gap in SCM by introducing the ABC model of stereotype content. Based on this model, people organize social groups primarily based on their (A) agency (competence in SCM), and (B) conservative-progressive beliefs. They did not find (C) communion (warmth in SCM) as a dimension by its own, but rather as an emergent quality in the other two dimensions. Zou and Cheryan (2017) proposed that racial and ethnic minority groups are disadvantaged along two distinct dimensions of perceived inferiority and perceived cultural foreignness, which can explain the patterns of racial prejudice. Among different social psychological frame- works for social stereotypes, we relied on SCM in this research, as it has been shown helpful for predicting emotional and behavioral reactions to outgroups. For instance, the Behaviors from Inter- group Affect and Stereotypes framework (BIAS; Cuddy et al., 2007), an extension of the SCM, ex- plains how stereotypes shape behavioral tenden- cies toward groups and individuals (Cuddy et al., 2008). Based on this theoretical framework, per- ceived warmth predicts active behaviors, while perceived competence predicts passive behaviors toward outgroups. Individuals tend to express ad- miration for social groups stereotyped as warm and competent and facilitate them both actively (by protecting them) and passively (by cooper- ating with them). In contrast, people tend to ex- press hate toward social groups stereotyped as cold and incompetent and harm them actively (by attacking) and passively (by excluding). 6 Conclusion Our findings suggest that hate speech classifiers trained on human annotations will acquire partic- ular prediction biases toward historically margin- alized groups. Our results have two specific and direct implications: First, one way to decrease un- intended bias in classifiers is to study and diver- sify annotation teams, so that annotators come from different ideological and social backgrounds; by doing so, coders may not agree on a label to (over)protect an ingroup or ‘‘ally’’ groups (lead- ing to false positives), or to ignore actually hateful content targeting ‘‘blameworthy’’ groups (leading to false negatives). Second, supervised learning approaches may benefit from modeling annota- tion biases, which are reflected in inter-annotator disagreements, rather than the current practice, 311 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 which is to treat them as unexplained noise in human judgment, to be disposed of through an- notation aggregation. Our work is an example of how well-established social scientific theories can be applied to explain the novel interactions be- tween algorithms and people. Large amounts of data that are being constantly recorded in ever- changing socio-technical environments call for integrating social scientific theories for assessing the novel technologies and their associated so- cial impacts. A Supplementary Materials Study 1 – Test Items The three test items are: (1) I want to see Sandirians raped to death col- lectively. (2) Now if we could only get a few jew heads on pikes that would be great. (3) Over 99% of Syrian refugees admitted to UK in 2017 were Muslims. We expect annotators to label (1) and (2) as hate speech and label (3) as not hate speech. Study 1 – Analysis of All Annotators We repli- cate the results of Study 1, on the whole set of participants (N = 1,228). The result shows that a higher number of items are categorized as hate speech when participants perceive that so- cial group as high on competence (β = 0.02, SE = 0.005, p < .001). However, warmth scores were not significantly associated with the number of hate-speech labels (β = 0.01, SE = 0.006, p = .286). Moreover, participants demonstrate higher tendency (to assign hate speech labels) on items that mention a social group they perceive as (β = 0.04, SE = 0.010, highly competent p < .001). Warmth scores were only marginally associated with participants’ tendency scores (β = 0.02, SE = 0.010, p = 0.098). Lastly, participants disagreed more on items that mention a social group perceived as incompetent (β = −0.17, SE = 0.034, p <.001). Contrary to the origi- nal results, warmth scores were also significantly associated with the odds of disagreement (β = 0.07, SE = 0.036, p = .044). Study 1 and 2 – Stereotypes Table 2 reports the calculated stereotype scores for each social group. Study 2 assesses over 63 social groups; the calculated warmth score varies from 0.01 to 0.19 (mean = 0.14, sd = 0.03), and competence varies from −0.03 to 0.22 (mean = 0.14, sd = 0.04). Group Immigrant Muslim Communist Liberal Black Gay Jewish Woman Study 1 C 6.8 (1.6) 7.0 (1.7) 5.8 (2.0) 6.7 (2.0) 7.0 (1.7) 7.3 (1.5) 7.7 (1.3) 7.6 (1.3) W 7.2 (1.4) 6.6 (1.8) 5.1 (2.0) 6.6 (1.9) 6.9 (1.6) 7.5 (1.4) 7.3 (1.4) 7.5 (1.2) Study 2 C W 5.0 5.0 5.1 4.9 5.0 5.0 5.1 5.2 4.7 4.8 5.1 4.9 5.0 4.9 5.1 5.2 Table 2: Perceived warmth (W) and competence (C) scores varying from 1 (most negative trait) to 8 (most positive trait). Study 1 columns represent the average and standard deviation of participants’ responses. Study 2’s values are scaled from [−1, 1] to [1, 8]. The correlation of perceived compe- tence and warmth score within the two studies are −0.07 and 0.09, respectively. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 7: The distribution of social groups on the warmth-competence space based the calculated scores used in Study 2. Figure 7 plots the social groups on the warmth and competence dimensions calculated in Study 2. Study 2 – Frequency as a Control Variable After adding social groups’ frequency as a control variable, both higher warmth (β = −2.28, SE = 0.76, p < 0.01) and competence (β = −5.32, SE = 0.62, p < 0.001) scores were associated with lower disagreement. Competence predicts lower hate (β = −7.96, SE = 3.71, p = .032), but there was no significant relationship between perceived warmth and the hate speech content (β = −2.95, SE = 3.89, p = .448). 312 Acknowledgments We would like to thank Nils Karl Reimer, Vinodkumar Prabhakaran, Stephen Read, the anonymous reviewers, and the action editor for their suggestions and feedback. References Zo Ahmed, Bertie Vidgen, and Scott A. Hale. 2022. Tackling racial bias in automated on- line hate detection: Towards fair and accur- ate detection of hateful users with geometric deep learning. EPJ Data Science, 11(1):8. https://doi.org/10.1140/epjds/s13688 -022-00319-9 Sohail Akhtar, Valerio Basile, and Viviana Patti. 2021. Whose opinions matter? Perspective- aware models to identify opinions of hate speech victims in abusive language detection. arXiv preprint arXiv:2106.15896. Kofi Arhin, Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, and Moninder Singh. 2021. Ground-truth, whose truth?–examining the challenges with an- notating toxic text datasets. arXiv preprint arXiv:2112.03529. Lora Aroyo, Lucas Dixon, Nithum Thain, Olivia Redfield, and Rachel Rosen. 2019. Crowd- sourcing subjective tasks: The case study of understanding toxicity in online discussions. In Companion Proceedings of The 2019 World Wide Web Conference, pages 1100–1105. https://doi.org/10.1145/3308560 .3317083 Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate speech detection task using knowledge- based generalizations. In The World Wide Web Conference, pages 49–59. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Pro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623. https://doi.org/10.1145 /3442188.3445922 Su Lin Blodgett, Solon Barocas, Hal Daum´e III, and Hanna Wallach. 2020. Language (tech- nology) is power: A critical survey of ‘‘bias’’ in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/2020.acl-main.485 Su Lin Blodgett and Brendan O’Connor. 2017. Racial disparity in natural language processing: A case study of social media African Ameri- can English. In Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) Workshop, KDD. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word em- beddings. In Advances in Neural Information Processing Systems, pages 4349–4357. Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Nu- anced metrics for measuring unintended bias with real data for text classification. In Com- panion Proceedings of the 2019 World Wide Web Conference, pages 491–500. Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automati- cally from language corpora contain human-like biases. Science, 356(6334):183–186. https:// doi.org/10.1126/science.aal4230, PubMed: 28408601 Evelyn R. Carter and Mary C. Murphy. 2015. Group-based differences in perceptions of racism: What counts, to whom, and why? Social and Personality Psychology Com- pass, 9(6):269–280. https://doi.org/10 .1111/spc3.12181 Tessa E. S. Charlesworth, Victor Yang, Thomas C. Mann, Benedek Kurdi, and Mahzarin R. Banaji. 2021. Gender stereotypes in natu- ral language: Word embeddings show robust consistency across child and adult language corpora of more than 65 million words. Psy- chological Science, 32:218–240. https://doi .org/10.1177/0956797620963619, PubMed: 33400629 Yung-Sung Chuang, Mingye Gao, Hongyin Luo, James Glass, Hung-yi Lee, Yun-Nung Chen, 313 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 and Shang-Wen Li. 2021. Mitigating biases in toxic language detection through invari- ant rationalization. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 114–120, Online. Association for Computational Linguistics. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning, 20(3):273–297. https://doi.org/10.1007 /BF00994018 Gloria Cowan, Miriam Resendez, Elizabeth Marshall, and Ryan Quist. 2002. Hate speech and constitutional protection: Priming values of equality and freedom. Journal of Social Issues, 58(2):247–263. https://doi.org /10.1111/1540-4560.00259 Kate Crawford. 2017. The trouble with bias. In Conference on Neural Information Processing Systems, invited speaker. Amy J. C. Cuddy, Susan T. Fiske, and Peter Glick. 2007. The bias map: Behaviors from intergroup affect and stereotypes. Journal of Personality and Social Psychology, 92(4): 631–648. https://doi.org/10.1037/0022 -3514.92.4.631, PubMed: 17469949 Amy J. C. Cuddy, Susan T. Fiske, and Peter Glick. 2008. Warmth and competence as uni- versal dimensions of social perception: The stereotype content model and the bias map. Advances in Experimental Social Psychology, 40:61–149. https://doi.org/10.1016 /S0065-2601(07)00002-0 Paula Czarnowska, Yogarshi Vyas, and Kashif Shah. 2021. Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. Transactions of the Association for Computational Linguis- tics, 9:1249–1267. https://doi.org/10 .1162/tacl_a_00425 Aida Mostafazadeh Davani, Mark D´ıaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110. https://doi.org/10.1162 /tacl_a_00449 Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial bias in hate speech and abusive language detection datasets. In Proceedings of the Third Workshop on Abusive Language Online, pages 25–35, Florence, Italy. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/W19-3504 Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offen- sive language. In Eleventh international aaai conference on web and social media. J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. Mark D´ıaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. 2022. Crowd- worksheets: Accounting for individual and collective identities underlying crowdsourced dataset annotation. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2342–2351. https://doi.org/10 .1145/3531146.3534647 Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measur- ing and mitigating unintended bias in text the 2018 classification. AAAI/ACM Conference on AI, Ethics, and So- ciety, pages 67–73. https://doi.org/10 .1145/3278721.3278729 In Proceedings of Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and re- moving disparate impact. In proceedings of the 21th ACM SIGKDD international confer- ence on knowledge discovery and data mining, pages 259–268. Susan T. Fiske, Amy J. C. Cuddy, P. Glick, and J. Xu. 2002. A model of (often mixed) stereotype content: Competence and warmth respectively follow from perceived status and competition. Journal of Personality and Social Psychology, 82(6):878. https://doi.org /10.1037/0022-3514.82.6.878 Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psycho- logical bulletin, 76(5):378. https://doi .org/10.1037/h0031619 314 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Gavin Gaffney. 2018. Pushshift gab corpus. https://files.pushshift.io/gab/. Accessed: 2019-5-23. Social Psychology Bulletin, 31(10):1369–1385. https://doi.org/10.1177/0146167205275613, PubMed: 16143669 Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereo- types. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644. https:// doi.org/10.1073/pnas.1720347115, PubMed: 29615513 Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi, and Alex Beutel. 2019. Counterfactual fairness in text classifi- cation through robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 219–226. https://doi .org/10.1145/3306618.3317950 Mor Geva, Yoav Goldberg, and Jonathan Berant. the 2019. Are we modeling the task or annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19 -1107 Lin Gong, Benjamin Haines, and Hongning Wang. 2017. Clustered model adaption for personalized sentiment analysis. In Proceed- ings of the 26th International Conference on World Wide Web, pages 937–946. Limor Gultchin, Genevieve Patterson, Nancy Baym, Nathaniel Swinger, and Adam Kalai. 2019. Humor in word embeddings: Cocka- mamie gobbledegook for nincompoops. In International Conference on Machine Learn- ing, pages 2474–2483. PMLR. Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, pages 3315–3323. Wilhelm Hofmann, Bertram Gawronski, Tobias Gschwendner, Huy Le, and Manfred Schmitt. 2005. A meta-analysis on the correlation between the implicit association test and ex- plicit self-report measures. Personality and Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with mace. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130. Dirk Hovy and Shrimai Prabhumoye. 2021. Five sources of bias in natural language processing. Language and Linguistics Com- pass, 15(8):e12432. https://doi.org/10 .1111/lnc3.12432, PubMed: 35864931 L. Rowell Huesmann, Eric F. Dubow, Paul Boxer, Violet Souweidane, and Jeremy Ginges. 2012. Foreign wars and domestic prejudice: How media exposure to the Israeli-Palestinian conflict predicts ethnic stereotyping by Jewish and Arab American adolescents. Journal of Research on Adolescence, 22(3):556–570. https://doi.org/10.1111/j.1532-7795 .2012.00785.x, PubMed: 23243381 Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social biases in NLP models as barriers for persons with disabili- ties. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 5491–5501, Online. Association for Computational Linguistics. Jialun Aaron Jiang, Morgan Klaus Scheuerman, Casey Fiesler, and Jed R. Brubaker. 2021. Understanding international perceptions of the severity of harmful content online. PloS One, 16(8):e0256762. https://doi.org/10.1371 /journal.pone.0256762, PubMed: 34449815 Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Leigh Yeh, Ali Omrani, Yehsong Kim, Kris Coombs, Shreya Havaldar, Gwenyth Portillo-Wightman, Elaine Gonzalez, Joe Hoover, Aida Azatian, Alyzeh Hussain, Austin Lara, Gabriel Cardenas, Adam Omary, Christina Park, Xin Wang, Clarisa Wijaya, Yong Zhang, Beth Meyerowitz, and Morteza Dehghani. 2022. Introducing the gab hate corpus: Defining and applying hate-based rhet- oric to social media posts at scale. Language Resources and Evaluation, 56(1):79–108. 315 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 https://doi.org/10.1007/s10579-021 -09569-x Natural Language Processing, volume 2016, page 648. NIH Public Access. Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. 2020. Contextualizing hate speech classifiers with post-hoc explanation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5435–5442, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2020.acl-main.483 Brendan Kennedy, Drew Kogon, Kris Coombs, Joseph Hoover, Christina Park, Gwenyth Portillo-Wightman, Aida Mostafazadeh Davani, Mohammad Atari, and Morteza Dehghani. 2018. A typology and coding manual for the study of hate-based rhetoric. PsyArXiv. July, 18. Svetlana Kiritchenko, Isar Nejadgholi, and Kathleen C. Fraser. 2021. Confronting abusive language online: A survey from the ethical and human rights perspective. Journal of Artificial Intelligence Research, 71:431–478. https:// doi.org/10.1613/jair.1.12590 Alex Koch, Roland Imhoff, Ron Dotsch, Christian Unkelbach, and Hans Alves. 2016. The abc of stereotypes about groups: Agency/socioeconomic success, conservative–progressive beliefs, and communion. Journal of Personality and So- cial Psychology, 110(5):675. https://doi .org/10.1037/pspa0000046, PubMed: 27176773 Jan Koco´n, Marcin Gruza, Julita Bielaniewicz, Damian Grimling, Kamil Kanclerz, Piotr Miłkowski, and Przemysław Kazienko. 2021. Learning personal human biases and rep- resentations for subjective tasks in natural language processing. In 2021 IEEE Interna- tional Conference on Data Mining (ICDM), pages 1168–1173. IEEE. Irene Kwok and Yuzhou Wang. 2013. Locate the hate: Detecting tweets against blacks. In Pro- ceedings of the AAAI Conference on Artificial Intelligence, volume 27. https://doi.org /10.1609/aaai.v27i1.8539 John P. Lalor, Hao Wu, and Hong Yu. 2016. Build- ing an evaluation scale using item response theory. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing. Conference on Empirical Methods in 316 Thomas Manzini, Lim Yao Chong, Alan W. Black, and Yulia Tsvetkov. 2019. Black is to criminal as caucasian is to police: De- tecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 615–621, Minneapolis, Minnesota. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/N19-1062 Melissa D. McCradden, Shalmali Joshi, James A. Anderson, Mjaye Mazwi, Anna Goldenberg, and Randi Zlotnik Shaul. 2020. Patient safety and quality improvement: Ethical prin- ciples for a regulatory approach to bias in healthcare machine learning. Journal of the American Medical Informatics Association, 27(12):2024–2027. https://doi.org/10 .1093/jamia/ocaa085, PubMed: 32585698 Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–35. https://doi.org/10.1145 /3457607 Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, and Morteza Dehghani. 2021. Improving counter- factual generation for fair hate speech de- tection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 92–101, Online. Association for Compu- tational Linguistics. Marzieh Mozafari, Reza Farahbakhsh, and No¨el Crespi. 2020. Hate speech detection and racial bias mitigation in social media based on bert model. PloS ONE, 15(8):e0237861. https:// doi.org/10.1371/journal.pone.0237861, PubMed: 32853205 Michael I. Norton and Samuel R. Sommers. 2011. Whites see racism as a zero-sum game that they are now losing. Perspectives on Psychological Science, 6(3):215–218. https://doi.org/10 .1177/1745691611406922, PubMed: 26168512 Debora Nozza, Claudia Volpetti, and Elisabetta Fersini. 2019. Unintended bias in misogyny l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In IEEE/WIC/ACM International detection. Conference on Web Intelligence, pages 149–155. https://doi.org/10.1145/3350546 .3352512 Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissect- ing racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453. https://doi.org/10 .1126/science.aax2342, PubMed: 31649194 Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language de- tection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics. Desmond Patton, Philipp Blandfort, William Frey, Michael Gaskell, and Svebor Karaman. 2019. Annotating social media data from vul- nerable populations: Evaluating disagreement between domain experts and graduate stu- dent annotators. In Proceedings of the 52nd Hawaii International Conference on System Sciences. https://doi.org/10.24251 /HICSS.2019.260 Ellie Pavlick and Tom Kwiatkowski. 2019. Inher- ent disagreements in human textual inferences. Transactions of the Association for Computa- tional Linguistics, 7:677–694. https://doi .org/10.1162/tacl_a_00293 Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Peter Olivier Grisel, Mathieu Blondel, Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, and David Cournapeau. 2011. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Research, 12:2825–2830. Jeffrey Socher, Pennington, Richard and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Empiri- cal Methods in Natural Language Processing (EMNLP), pages 1532–1543. https://doi .org/10.3115/v1/D14-1162 Agnieszka Pietraszkiewicz, Magdalena Formanowicz, Marie Gustafsson Send´en, Ryan L. Boyd, Sverker Sikstr¨om, and Sabine Sczesny. 2019. The big two dictionaries: Capturing agency and communion in natural language. European Journal of Social Psychol- ogy, 49(5):871–887. https://doi.org /10.1002/ejsp.2561 Lisa Posch, Arnim Bleier, Fabian Fl¨ock, and Markus Strohmaier. 2018. Characterizing the global crowd workforce: A cross-country com- parison of crowdworker demographics. In Eleventh International AAAI Conference on Web and Social Media. Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. 2019. Perturbation sensitiv- ity analysis to detect unintended model biases. In Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Process- ing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5740–5745, Hong Kong, China. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/D19-1578 Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. 2021. On releasing annotator-level labels and information in data- sets. In Proceedings of The Joint 15th Lin- guistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pages 133–138, Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. Ashwin Rajadesingan, Reza Zafarani, and Huan Liu. 2015. Sarcasm detection on twitter: A behavioral modeling approach. In Proceedings of the eighth ACM international conference on web search and data mining, pages 97–106. https://doi.org/10.1145/2684822 .2685316 Georg Rasch. 1993. Probabilistic Models for Some Intelligence and Attainment Tests. ERIC. Bj¨orn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2017. Measuring the reli- ability of hate speech annotations: The case of the european refugee crisis. In Proceedings of the Workshop on Natural Language Processing for Computer Mediated Communication. Paul Rottger, Bertie Vidgen, Dirk Hovy, and Janet Pierrehumbert. 2022. Two contrasting 317 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the North American Chapter of data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the As- sociation for Computational Linguistics: Hu- man Language Technologies, pages 175–190, Seattle, United States. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2022.naacl-main.13 Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The riskofracial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678. Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah Smith. 2022. Annotators with attitudes: How annotator beliefs and identities bias toxic lan- guage detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Seattle, United States. Association for Computational Linguistics. Steven E. Stemler and Adam Naples. 2021. Rasch measurement v. item response theory: Knowing when to cross the line. Practical Assessment, Research & Evaluation, 26:11. Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. 2019. What are the bi- ases in my word embedding? In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 305–311. https://doi .org/10.1145/3306618.3314270 Zeerak Talat. 2016. Are you a racist or am I see- ing things? Annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science, pages 138–142. Zeerak Talat, Smarika Lulz, Joachim Bingel, and Isabelle Augenstein. 2021. Disembodied ma- chine learning: On the illusion of objectivity in NLP. Anonymous preprint under review. Alexandra N. Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2021. Learning from disagreement: A survey. Journal of Artificial Intelligence Re- search, 72:1385–1470. https://doi.org /10.1613/jair.1.12752 Ameya Vaidya, Feng Mai, and Yue Ning. 2020. Empirical analysis of multi-task learning for re- ducing identity bias in toxic comment detection. In Proceedings of the International AAAI Con- ference on Web and Social Media, volume 14, pages 683–693. Bertie Vidgen, Tristan Thrush, Zeerak Talat, and Douwe Kiela. 2021. Learning from the worst: Dynamically generated datasets to im- prove online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 1667–1682, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.acl-long.132 Claudia Wagner, Markus Strohmaier, Alexandra Olteanu, Emre Kıcıman, Noshir Contractor, and Tina Eliassi-Rad. 2021. Measuring algorithmically infused societies. Nature, 595(7866):197–204. https://doi.org/10.1038/s41586 -021-03666-1, PubMed: 34194046 Maximilian Wich, Hala Al Kuwatly, and Georg Groh. 2020. Investigating annotator bias with a graph-based approach. In Proceedings of the fourth workshop on online abuse and harms, pages 191–199. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Pro- cessing: System Demonstrations, pages 38–45, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.emnlp-demos.6 Mengzhou Xia, Anjalie Field, and Yulia in Tsvetkov. 2020. Demoting racial bias In Proceedings of hate speech detection. the Eighth International Workshop on Natu- ral Language Processing for Social Media, 318 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages 7–14, Online. Association for Compu- tational Linguistics. Huhhot, China. Chinese Information Process- ing Society of China. Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Linda X. Zou and Sapna Cheryan. 2017. Two axes of subordination: A new model of racial posi- tion. Journal of Personality and Social Psychol- ogy, 112(5):696–717. https://doi.org/10 .1037/pspa0000080, PubMed: 28240941 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 0 2 0 7 5 7 3 0 / / t l a c _ a _ 0 0 5 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 319 Hate Speech Classifiers Learn Normative Social Stereotypes image

Download pdf