Annotation Error Detection: Analyzing the

Annotation Error Detection: Analyzing the
Past and Present for a More Coherent Future

Jan-Christoph Klie∗
Ubiquitous Knowledge Processing Lab
Department of Computer Science
Technical University of Darmstadt
www.ukp.tu-darmstadt.de

Bonnie Webber
School of Informatics,
University of Edinburgh

Iryna Gurevych
UKP Lab / TU Darmstadt

Annotated data is an essential ingredient in natural language processing for training and
evaluating machine learning models. It is therefore very desirable for the annotations to be
of high quality. Recent work, cependant, has shown that several popular datasets contain a
surprising number of annotation errors or inconsistencies. To alleviate this issue, many meth-
ods for annotation error detection have been devised over the years. While researchers show
that their approaches work well on their newly introduced datasets, they rarely compare their
methods to previous work or on the same datasets. This raises strong concerns on methods’
general performance and makes it difficult to assess their strengths and weaknesses. We therefore
reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English
datasets for text classification as well as token and span labeling. En outre, we define a uniform
evaluation setup including a new formalization of the annotation error detection task, evaluation
protocol, and general best practices. To facilitate future research and reproducibility, we release
our datasets and implementations in an easy-to-use and open source software package.1

1. Introduction

Annotated corpora are an essential component in many scientific disciplines, y compris
natural language processing (NLP) (Gururangan et al. 2020; Peters, Ruder, and Smith

∗ Corresponding author.

1 https://github.com/UKPLab/nessie.

Action Editor: Nianwen Xue. Submission received: 7 Juin 2022; revised version received: 29 Août 2022;
accepted for publication: 23 Septembre 2022.

https://doi.org/10.1162/coli_a_00464

© 2022 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) Licence

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

2019), linguistics (Haselbach et al. 2012), language acquisition research (Behrens 2008),
and the digital humanities (Schreibman, Siemens, and Unsworth 2004). Corpora are
used to train and evaluate machine learning models, to deduce new knowledge, and to
suggest appropriate revisions to existing theories. Especially in machine learning, haut-
quality datasets play a crucial role in advancing the field (Sun et al. 2017). It is often
taken for granted that gold standard corpora do not contain errors—but alas, this is
not always the case. Datasets are usually annotated by humans who can and do make
mistakes (Northcutt, Athalye, and Mueller 2021). Annotation errors can even be found
in corpora used for shared tasks such as CONLL-2003 (Tjong Kim Sang and De Meulder
2003). Par exemple, Durban is annotated there as PER (person) and S.AFRICA as MISC
(miscellaneous), but both should be annotated as LOC (location).

Gold standard annotation is also subject to inconsistency, where words or phrases
that are intended to refer to the same type of thing (and so should be labeled in the
same way) are nevertheless assigned different labels (voir, par exemple., Hollenstein, Schneider,
and Webber 2016). Par exemple, in CONLL-2003, when Fiorentina was used to refer to
the local football club, it was annotated as ORG, but when Japan was used to refer to the
Japanese national football team, it was inconsistently annotated as LOC. One reason for
annotation inconsistencies is that tokens can be ambiguous, either because they have
multiple senses (par exemple., the word club can refer to an organization or to a weapon), ou
because metonymy allows something to be referred to by one of its parts or attributes
(par exemple., the Scottish curling team being referred to as Scotland, as in Scotland beat Canada
in the final match). We further define errors as well as inconsistencies and also discuss
ambiguity in detail in § 3.1.

Such annotation errors or inconsistencies can negatively impact a model’s perfor-
mance or even lead to erroneous conclusions (Manning 2011; Northcutt, Athalye, et
Mueller 2021; Larson et al. 2020; Zhang et al. 2021). A deployed model that learned
errors during training can potentially cause harm, especially in critical applications like
medical or legal settings. High-quality labels are needed to evaluate machine learning
methods even if they themselves are robust to label noise (par exemple., Song et al. 2020). Corpus
linguistics relies on correctly annotated data to develop and confirm new theories.
Learner corpora containing errors might be detrimental to the language learning expe-
rience and teach wrong lessons. Ainsi, it is imperative for datasets to have high-quality
labels.

Cleaning the labels by hand, cependant, is expensive and time consuming. Donc,
many automatic methods for annotation error detection (AED) have been devised over
the years. These methods enable dataset creators and machine learning practitioners to
narrow down the instances that need manual inspection and—if necessary—correction.
This reduces the overall work needed to find and fix annotation errors (voir, par exemple., Reiss
et autres. 2020). As an example, AED has been used to discover that widely used benchmark
datasets contain errors and inconsistencies (Northcutt, Athalye, and Mueller 2021).
Around 2% of the samples (sometimes even more than 5%) have been found incorrectly
annotated in datasets like Penn Treebank (Dickinson and Meurers 2003a), sentiment
analysis datasets like SST, Amazon Reviews, or IMDb (Barnes, Øvrelid, and Velldal
2019; Northcutt, Athalye, and Mueller 2021), CoNLL-2003 (Wang et al. 2019; Reiss
et autres. 2020), or relation extraction in TACRED (Alt, Gabryszak, and Hennig 2020;
Stoica, Platanios, and Poczos 2021). AED has likewise been used to find ambiguous
instances, Par exemple, for part-of-speech (POS) annotation (Dickinson and Meurers
2003un). En plus, it has been shown that errors in automatically annotated (silver)
corpora can also be found and fixed with the help of AED (Rehbein 2014; Ménard and
Mougeot 2019).

158

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

While AED methods have been applied successfully in the past (par exemple., Reiss et al.
2020), there are several issues that hinder their widespread use. New approaches for
AED are often only evaluated on newly introduced datasets that are proprietary or
not otherwise available (par exemple., Dligach and Palmer 2011; Amiri, Miller, and Savova 2018;
Larson et al. 2019). Aussi, they rarely compare newly introduced methods to previous
work or baselines. These issues make comparisons of AED methods very difficult. Dans
addition to that, there is neither agreement on how to evaluate AED methods, nor which
metrics to use during their development and application. Par conséquent, it is often not clear
how well AED works in practice, especially which AED methods should be applied to
which kind of data and underlying tasks. To alleviate these issues, we define a unified
evaluation setup for AED, conduct a large-scale analysis of 18 AED methods, and apply
them to 9 datasets for text classification, token labeling, and span labeling. This work
focuses on errors and inconsistencies related to instance labels. We leave issues such
as boundary errors, sentence splitting, or tokenization for future work. The methods
presented in this article are particularly suited to the NLP community, but many of
them can also be adapted to other tasks (par exemple., relation classification) and domains (like
computer vision). The research questions we answer are:

RQ1 Which methods work well across tasks and datasets?

RQ2

RQ3

Does model calibration help to improve AED performance?

To what extent are model and AED performance correlated?

RQ4 What (performance) impact does using cross-validation have?

The research reported in this article addresses the aforementioned issues by providing
the following contributions:

Evaluation Methodology To unify its findings and establish comparability, we first
define the task of AED and a standardized evaluation setup, including an improvement
for evaluating span labeling in this context (§ 3.1).

Easy to Use Reference Implementations We survey past work from the last 25 années
and implement the 18 most common and generally applicable AED methods (§ 3.2).
We publish our implementation in a Python package called NESSIE that is easy to use,
thoroughly tested, and extensible to new methods and tasks. We provide abstractions
for models, tasks, as well as helpers for cross validation to reduce the boilerplate code
needed to a minimum. En outre, we provide extensive documentation and code
examples. Our package makes it therefore significantly easier to get started with AED
for researchers and practitioners alike.

Benchmarking Datasets We identify, vet, and generate datasets for benchmarking AED
approaches, which results in 9 datasets for text classification, token labeling, and span
labeling (§ 4). We also publish the collected datasets to facilitate easy comparison and
reproducibility.

Evaluation and Analysis Using our implementation, we investigate several funda-
mental research questions regarding AED (§ 5). We specifically focus on how to
achieve the best AED performance for each task and dataset, taking model calibra-
tion, usage of cross-validation, as well as model selection into account. Based on our

159

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

résultats, we provide recipes and give recommendations on how to best use AED in
pratique (§ 6).

2. Related Work

This section provides a brief overview of annotation error detection and its related tasks.

Annotation Error Detection. In most works, AED is used as a means to improve the quality
of an annotated corpus. En tant que tel, the method used is treated as secondary and possible
methods are not compared. The work of Amiri, Miller, and Savova (2018) and Larson
et autres. (2020) are the few instances that implement different methods and baselines, mais
only use newly introduced datasets. Dans d'autres cas, AED is just discussed as a minor
contribution and not thoroughly evaluated (par exemple., Swayamdipta et al. 2020, Rodriguez
et autres. 2021).

Donc, to the best of our knowledge, no large-scale evaluation of AED methods
exists. Closest to the current study is the work of Dickinson (2015), a survey about
the history of annotation error detection. Cependant, that survey does not reimplement,
compare, or evaluate existing methods quantitatively. Its focus is also limited to part-
of-speech and dependency annotations. Our work fills the aforementioned gaps by
reimplementing 18 methods for AED, evaluating the methods against 9 datasets, et
investigating the setups in which they perform best.

Annotation Error Correction. After potential errors have been detected, the next step is to
have them corrected to obtain gold labels. This is usually done by human annotators
who carefully examine those instances that have been detected. Some AED methods
can also both detect and correct labels. Only a few groups have studied correction so
far (par exemple., Kv˘eto ˇn and Oliva 2002; Loftsson 2009; Dickinson 2006; Angle, Mishra, et
Sharma 2018; Qian et al. 2021). Dans cette étude, we focus on detection and leave an in-
depth treatment of annotation error correction for future work.

Error Type Classification. Even if errors are not corrected automatically, it may still be
worth identifying the type of each error. Par exemple, Larson et al. (2020) investigate
the different errors for slot filling (par exemple., incorrect span boundaries, incorrect labels,
or omissions). Alt, Gabryszak, and Hennig (2020) investigate error types for relation
classification. Yaghoub-Zadeh-Fard et al. (2019) collect tools and methods to find quality
errors in paraphrases used to train conversational agents. Barnes, Øvrelid, and Velldal
(2019) analyze the types of errors found in annotating for sentiment analysis. While
error type classification has not been explicitly addressed in the current study, tel
classification requires good AED, so the results of the current study can contribute to
automate error type classification in the future.

Training with Label Noise. Related to AED is the task of training with noise: Given a
dataset that potentially contains label errors, train a model so that the performance
impact due to noisy labels is as low as possible (Song et al. 2020). The goal in this setting
is not to clean a dataset (labels are left as is), but to obtain a well-performing machine
learning model. An example application is learning directly from crowdsourced data
without adjudicating it (Rodrigues and Pereira 2018). Training with label noise and
AED have in common that they both enable models to be trained when only noisy
data is available. Evaluating these models still requires clean data, which AED can help
to produce.

160

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

3. Annotation Error Detection

In this section we introduce AED and formalize the concept, then categorize state-of-
the-art approaches according to our formalization.

3.1 Task Definition

Given an adjudicated dataset with one label per annotated instance, the goal of AED
is to find those instances that are likely labeled incorrectly or inconsistently. These
candidate instances then can be given to human annotators for manual inspection or
used in annotation error correction methods. The definition of instance depends on the
task and defines the granularity on which errors or inconsistencies are detected. Dans ce
article, we consider AED in text classification (where instances are sentences), in token
labeling (instances are tokens; par exemple., POS tagging), and in span labeling (instances are
spans; par exemple., named entity recognition [NER]). AED can and has been applied to many
domains and tasks, par exemple, sentiment analysis (Barnes, Øvrelid, and Velldal 2019;
Northcutt, Athalye, and Mueller 2021), relation extraction (Alt, Gabryszak, and Hennig
2020), POS tagging (Dickinson and Meurers 2003a; Loftsson 2009), image classification
(Northcutt, Athalye, and Mueller 2021), NER (Wang et al. 2019; Reiss et al. 2020), slot
filling (Larson et al. 2020), or speech classification (Northcutt, Athalye, and Mueller
2021).

We consider a label to be incorrect if there is a unique, true label that should be
assigned but it differs from the label that has been assigned. Par exemple, there is a
named entity span Durban in CONLL-2003 which has been labeled PER, whereas in
contexte, it refers to a city in South Africa, so the label should be LOC.

Instances can also be ambiguous, c'est, there are at least two different labels that
are valid given the context. Par exemple, in the sentence They were visiting relatives,
visiting can either be a verb or an adjective. Ambiguous instances themselves are often
more difficult for machine learning models to learn from and predict. Choosing one
label over another is neither inherently correct nor incorrect. But ambiguous instances
can be annotated inconsistently. We consider a label inconsistent if there is more than
one potential label for an instance, but the choice of resolution was different for similar
instances. Par exemple, in the sentence Stefan Edberg produced some of his vintage best
on Tuesday to extend his grand run at the Grand Slams by toppling Wimbledon champion
Richard Krajicek, the entity Wimbledon was annotated as LOC. But in the headline of this
article, Edberg extends Grand Slam run, topples Wimbledon champ, the entity Wimbledon
was annotated as MISC. We discuss the impact of ambiguity on AED further in § 3.1.
An instance that is neither incorrect nor inconsistent is correct. If not explicitly stated
otherwise, then we refer to both incorrect and inconsistent as incorrect or erroneous.

AED is typically used after a new dataset has been annotated and adjudicated. Il
is assumed that no already cleaned data and no other data having the same annotation
scheme is available.

Flaggers vs. Scorers. We divide automatic methods for AED into two categories, lequel
we dub flaggers and scorers. Flagging means that methods cast a binary judgment
whether the label for an instance is correct or incorrect. Scoring methods give an esti-
mate on how likely it is that an annotation is incorrect. These correspond to classification
and ranking.

While flaggers are explicit as to whether they consider an annotation to be incorrect,
they do not indicate the likelihood of that decision. On the other hand, while scorers

161

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

provide a likelihood, they require a threshold value to decide when an annotation is
considered an error—for example, those instances with a score above 80%. Those would
then be given to human evaluation. Scorers can also be used in settings similar to active
learning for error correction (Vlachos 2006).

This distinction between flaggers and scorers regarding AED has not been made in
previous work, as typically approaches of one type or the other have been proposed
per paper. But it is key to understanding why different metrics need to be used when
evaluating flaggers compared to scorers, similarly to unranked and ranked evaluation
from information retrieval (see § 5).

Ambiguity. In certain NLP tasks, there exists more than one valid label per instance
(Kehler et al. 2007; Plank, Hovy, and Søgaard 2014b; Aroyo and Welty 2015; Pavlick
and Kwiatkowski 2019; Basile et al. 2021). While this might reduce the usefulness of
AED at first glance, gold labels are not required by AED, as it is about uncovering
problems independent of their cause and not assigning a gold label. Instances detected
this way are then marked for further processing. They can be, par exemple, inspected
for whether they are incorrectly or inconsistently annotated. Ambiguous or difficult
instances especially deserve additional scrutiny when creating a corpus; finding them
is therefore very useful. Once found, several alternatives are possible: (1) Ambiguous
cases can be corrected (par exemple., Alt, Gabryszak, and Hennig 2020; Reiss et al. 2020); (2) ils
can be removed (par exemple., Jamison and Gurevych 2015); (3) their annotation guidelines can
be adjusted to reduce disagreement (par exemple., Pustejovsky and Stubbs 2013); (4) the task can
eventually be redefined to use soft labels (Fornaciari et al. 2021) or used to learn from
disagreement (Paun et al. 2018). Finding such instances is hence very desirable and can
be achieved by AED. But similarly to past work on AED, we focus on detecting errors
and inconsistencies as a first step and leave evaluating ambiguity detection performance
for future work.

3.2 Survey of Existing AED Methods

Over the past three decades, several methods have been developed for AED. Ici, nous
group them by how they detect annotation errors and briefly describe each of them. Dans
this article, we focus on AED for natural language processing, mais (as noted earlier in
§ 1), many of the presented methods can be and have been adjusted to different tasks
and modalities. An overview of the different methods is also given in Table 1.

3.2.1 Variation-based. Methods based on the variation principle leverage the observation
that similar surface forms are often annotated with only one or at most a few distinct
labels. If an instance is annotated with a different, rarer label, then it is possibly an
annotation error or an inconsistency. Variation-based methods are relatively easy to
implement and can be used in settings in which it is difficult to train a machine learning
model, such as low-resource scenarios or tasks that are difficult to train models for, pour
example, detecting lexical semantic units (Hollenstein, Schneider, and Webber 2016).
The main disadvantage of variation-based methods is that they need similar surface
forms to perform well, which is not the case in settings like text classification or datasets
with diverse instances.

Variation n-grams. The most frequently used method of this kind is variation n-grams,
which has been initially developed for POS tagging (Dickinson and Meurers 2003a)
and later extended to discontinuous constituents (Dickinson and Meurers 2005),

162

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

Tableau 1
Annotation error detection methods evaluated in this work. In most scorer methods, scorer
output and erroneous labels are positively correlated. Scorers marked with ∗ show negative
correlation.

Abbr.

Method Name

Tasks

Proposed by

Texte

Token

Span

Flagger methods
CL
CS
DE
IRT
LA
LS
PE
RE
VN

Confident Learning
Curriculum Spotter
Diverse Ensemble
Item Response Theory
Label Aggregation
Leitner Spotter
Projection Ensemble
Retag
Variation N-Grams

Scorer methods
Borda Count
BC
Classification Uncertainty
CU
DM∗
Data Map Confidence
Dropout Uncertainty
DU
KNN k-Nearest Neighbor Entropy
LE
MARYLAND
PM∗
WD

Label Entropy
Mean Distance
Prediction Margin
Weighted Discrepancy

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
·

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
·
(cid:88)
(cid:88)
·

(cid:88)
·
(cid:88)
(cid:88)
(cid:88)
·
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
·
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
·
(cid:88)
(cid:88)
(cid:88)
·
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
·
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

Northcutt et al. (2021)
Amiri et al. (2018)
Loftsson (2009)
Rodriguez et al. (2021)
Amiri et al. (2018)
Amiri et al. (2018)
Reiss et al. (2020)
van Halteren (2000)
Dickinson and Meurers (2003un)

Larson et al. (2020)
Hendrycks and Gimpel (2017)
Swayamdipta et al. (2020)
Amiri et al. (2018)
Grivas et al. (2020)
Hollenstein et al. (2016)
Larson et al. (2019)
Dligach and Palmer (2011)
Hollenstein et al. (2016)

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

predicate-argument structures (Dickinson and Lee 2008), dependency parsing (Boyd,
Dickinson, and Meurers 2008), or slot filling (Larson et al. 2020). For each instance,
n-gram contexts of different sizes are collected and compared to each other. It is con-
sidered incorrect if the label for an instance disagrees with labels from other instances
with the same n-gram context.

Label Entropy and Weighted Discrepancy. Hollenstein, Schneider, and Webber (2016) derive
metrics from the surface form and label counts that are then used as scorers. Ceux-ci sont
the entropy over the label count distribution per surface form or the weighted difference
between most and least frequent labels. They apply their methods to find possible an-
notation errors in datasets for multi-word expressions and super-sense tagging, lequel
are then reviewed manually for tokens that are actual errors.

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

3.2.2 Model-based. Probabilistic classifiers trained on the to-be-corrected dataset can be
used to find annotation errors. Models in this context are usually trained via cross-
validation (CV) and the respective holdout set is used to detect errors. After all folds
have been used as holdout, the complete dataset is analyzed. Because some methods
described below directly use model probabilities, it is of interest whether these are
accurately describing the belief of the model. This is not always true, as models often
are overconfident (Guo et al. 2017). Donc, we will evaluate whether calibration,
c'est, tuning probabilities so that they are closer to the observed accuracy, can improve
performance (see § 5.2). Several ways have been devised for model-based AED, lequel

163

Computational Linguistics

Volume 49, Nombre 1

are described below. Note that most model-based methods are agnostic to the task itself
and rely only on model predictions and confidences. This is why they can easily be used
with different tasks and modalities.

Re-tagging. A simple way to use a trained model for AED is to use model predictions
directly; when the predicted labels are different from the manually assigned ones,
instances are flagged as annotation errors (van Halteren 2000). Larson et al. (2020) apply
this using a conditional random field (CRF) tagger to find errors in crowdsourced slot-
filling annotations. De la même manière, Amiri, Miller, and Savova (2018) use Retag for text clas-
sification. Yaghoub-Zadeh-Fard et al. (2019) train machine learning models to classify
whether paraphrases contain errors and if they do, what kind of error it is. To reduce
the need of annotating instances twice for higher quality, Dligach and Palmer (2011)
train a model on the labels given by an initial annotator. If the model disagrees with
the instance’s labeling, then it is flagged for re-annotation. For cleaning dependency
annotations in a Hindi treebank, Ambati et al. (2011) train a logistic regression classifier.
If the model’s label does not agree with the original annotation and the model confi-
dence is above a predefined threshold, then the annotation is considered to be incorrect.
CrossWeigh (Wang et al. 2019) is similar to Retag with repeated CV. During CV, entity
disjoint filtering is used to force more model errors: Instances are flagged as erroneous if
the probability of their having the correct label falls below the respective threshold. Comme
it is computationally much more expensive than Retag while being very similar, we did
not include it in our comparison.

Classification Uncertainty. Probabilistic classification models assign probabilities that are
typically higher for instances that are correctly labeled compared with erroneous ones
(Hendrycks and Gimpel 2017). Donc, the class probabilities of the noisy labels
can be used to score these for being an annotation error. Using model uncertainty is
basically identical to using the network loss (comme, par exemple., used by Amiri, Miller, and Savova
2018) because the cross-entropy function used to compute the loss is monotonic. Le
probability formulation, cependant, allows us to use calibration more easily later (voir
§ 5.2), which is why we adapt the former instead of using the loss.

Prediction Margin. Inspired by active learning, Predictive Margin uses the probabilities
of the two highest scoring labels for an instance. The resulting score is simply their
difference (Dligach and Palmer 2011). The intuition behind this is that samples with a
smaller margin are more likely to be an annotation error, since the smaller the decision
margin is the more unsure the model was.

Confident Learning. This method estimates the joint distribution of noisy and true labels
(Northcutt, Jiang, and Chuang 2021). A threshold is then learned (the average self-
confidence) and instances whose computed probability of having the correct label is
below the respective threshold are flagged as erroneous.

Dropout Uncertainty. Amiri, Miller, and Savova (2018) use Monte Carlo dropout (Gal
and Ghahramani 2016) to estimate the uncertainty of an underlying model. Il y a
different acquisition methods to compute uncertainty from the stochastic passes. UN
summary can be found in Shelmanov et al. (2021). The work of Amiri, Miller, et
Savova (2018) uses the probability variance averaged over classes.

164

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

Label Aggregation. Given T predictions obtained via Monte Carlo dropout, Amiri, Miller,
and Savova (2018) use MACE (Hovy et al. 2013), an aggregation technique from crowd-
sourcing to adjudicate the resulting repeated predictions.

3.2.3 Training Dynamics. Methods based on training dynamics use information derived
from how a model behaves during training and how predictions change over the course
of its training.

Curriculum and Leitner Spotter. Amiri, Miller, and Savova (2018) train a model via cur-
riculum learning, where the network trains on easier instances during earlier epochs
and is then gradually introduced to harder instances. Instances then are ranked by how
hard they were perceived during training. They also adapt the ideas of the Zettelkasten
(Ahrens 2017) and Leitner queue networks (Leitner 1974) to model training. Là,
difficult instances are presented more often during training than easier ones. The as-
sumption behind both of these methods is that instances that are perceived harder or
misclassified more frequently are more often annotation errors than are easier ones.
These two methods require that the instances can be scheduled independently. Ce
est, par exemple, not the case for sequence labeling, as the model trains on complete
sentences and not individual tokens or spans. Even if they have different difficulties,
they would end up in the same batch nonetheless.

Data Map Confidence. Swayamdipta et al. (2020) use the class probability for each in-
stance’s gold label across epochs as a measure of confidence. In their experiments, faible
confidence correlates well with an item having an incorrect label.

3.2.4 Vector Space Proximity. Approaches of this kind leverage dense embeddings of
tokens, spans, and texts into a vector space and use their distribution therein. Le
distance of an instance to semantically similar instances is expected to be smaller than
the distance to semantically different ones. Embeddings are typically obtained by using
BERT-type models for tokens and spans (Devlin et al. 2019) or S-BERT for sentences
(Reimers and Gurevych 2019).

Mean Distance. Larson et al. (2019) compute the centroid of each class by averaging
vector embeddings of the respective instances. Items are then scored by the distance
between their embedding vector to their centroid. The underlying assumption is that
semantically similar items should have the same label and be close together (et
thereby close to the centroid) in the vector space. In the original publication, this method
was only evaluated on detecting errors in sentence classification datasets, but we extend
it to also token and span classification.

k-Nearest-Neighbor Entropy. In the context of NER in clinical reports, Grivas et al. (2020)
leverage the work of Khandelwal et al. (2020) regarding nearest-neighbor language
models to find mislabeled named entities. D'abord, all instances are embedded into a
vector space. Alors, the k nearest neighbors of each instance according to their Euclidean
distance are retrieved. Their distances to the instance embedding vector are then used to
compute a distribution over labels by applying softmax. An instance’s score is then the
entropy of its distance distribution; if it is large, it indicates uncertainty, hinting at being
mislabeled. Grivas et al. (2020) only used this method qualitatively; we have turned
their qualitative approach into a method that can be used to score instances automati-
cally and evaluated it on detecting errors in both NER and sentence classification—the

165

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

latter using S-BERT embeddings. This method was only evaluated on detecting errors
in NER datasets, but we apply it to sentence classification as well by using S-BERT
embeddings.

3.2.5 Ensembling. Ensemble methods combine the scores or predictions of several indi-
vidual flaggers or scorers to obtain better performance than the sum of their parts.

Diverse Ensemble. Instead of using a single prediction like Retag does, the predictions
of several, architecturally different models are aggregated. If most of them disagree on
the label for an instance, then it is likely to be an annotation error. Alt, Gabryszak, et
Hennig (2020) use an ensemble of 49 different models to find annotation errors in the
TACRED relation extraction corpus. In their setup, instances are ranked by how often
a model suggests a label different from the original one. Barnes, Øvrelid, and Velldal
(2019) use three models to analyze error types on several sentiment analysis datasets;
they flag instances for which all models disagree with the gold label. Loftsson (2009)
and Angle, Mishra, and Sharma (2018) use an ensemble of different taggers to correct
POS tags.

Projection Ensemble. In order to correct the CONLL-2003 named entity corpus, Reiss
et autres. (2020) train 17 logistic regression models on different Gaussian projections of
BERT embeddings. The aggregated predictions that disagree with the dataset were then
corrected by hand.

Item Response Theory. Lord, Novick, and Birnbaum (1968) developed Item Response Theory
as a mathematical framework to model relationships between measured responses of
test subjects (par exemple., answers to questions in an exam) for an underlying, latent trait (par exemple.,
the overall grasp on the subject that is tested). It can also be used to estimate the
discriminative power of an item, namely, how well the response to a question can be
used to distinguish between subjects of different ability. In the context of AED, test
subjects are trained models, the observations are the predictions on the dataset, et
the latent trait is task performance. Rodriguez et al. (2021) have shown that items
that negatively discriminate (c'est à dire., where a better response indicates being less skilled)
correlate with annotation errors.

Borda Count. Similarly to combining several flaggers into an ensemble, rankings ob-
tained from different scorers can be combined as well. For that, Dwork et al. (2001)
propose leveraging Borda counts, a voting scheme that assigns points based on their
ranking. For each scorer, given scores for N instances, the instance that is ranked the
highest is given N points, the second-highest N − 1, et ainsi de suite (Szpiro 2010). The points
assigned by different scorers are then summed up for each instance and form the
aggregated ranking. Larson et al. (2019) use this to combine scores for runs of Mean
Distance with different embeddings and show that this improves overall performance
compared to only using individual scores.

3.2.6 Rule-based. Several studies leverage rules that describe which annotations are valid
and which are not. Par exemple, to find errors in POS annotated corpora, Kv˘eto ˇn and
Oliva (2002) developed a set of conditions that tags have to fulfill in order to be valid,
especially n-grams that are impossible based on the underlying lexical or morphological
information of their respective surface forms. Rule-based approaches for AED can be
very effective but are hand-tailored to the respective dataset, its domain, langue, et

166

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

task. Our focus in this article is to evaluate generally applicable methods that can be
used for many different tasks and settings. Donc, we do not discuss rule-based
methods further in the current work.

4. Datasets and Tasks

In order to compare the performance of AED methods on a large scale, we need datasets
with parallel gold and noisy labels. But even with previous work on correcting noisy
corpora, such datasets are hard to find.

We consider three kinds of approaches to obtain datasets that can be used for
evaluating AED. D'abord, existing datasets can be used whose labels are then randomly
perturbed. Deuxième, there exist adjudicated gold corpora for which the annotations
of single annotators exist. Noisy labels are then the unadjucated annotations. These
kinds of corpora are mainly obtained from crowdsourcing experiments. Troisième, là
are manually corrected corpora whose both clean and noisy parts have been made
public. Because only a few such datasets are available for AED, we have derived several
datasets of the first two types from existing corpora.

When injecting random noise we use flipped label noise (Zheng, Awadallah, et
Dumais 2021) with a noise level of 5%, which is in a similar range to error rates in
previously examined datasets like PENN TREEBANK (Dickinson and Meurers 2003b) ou
CONLL-2003 (Reiss et al. 2020). In our settings, for a random subset of 5% instances,
this kind of noise assigns uniformly a different label from the tagset without taking the
original label into account. While randomly injecting noise is simple and can be applied
to any existing gold corpus, errors in these datasets are often easy to spot (Larson et al.
2019). This is because errors typically made by human annotators vary with the actual
label, which is not true for random noise (Hedderich, Zhu, and Klakow 2021). Note that
evaluating AED methods does not require knowing true labels: All that is required are
potentially noisy labels and whether or not they are erroneous. It is only correction that
needs true labels as well as noisy ones.

As noted earlier, we will address AED in three broad NLP tasks: text classification,
token labeling, and span labeling. These have been the tasks most frequently evaluated
in AED and on which the majority of methods can be applied. Aussi, these tasks have
many different machine learning models available to solve them. This is crucial for
evaluating calibration (§ 5.2) and assessing whether well-performing models lead to
better task performance for model-based methods (§ 5.3). To foster reproducibility
and to obtain representative results, we then choose datasets that fulfill the following
requirements: (1) they are available openly and free of charge, (2) they are for common
and different NLP tasks, (3) they come from different domains, et (4) they have high
inter-annotator agreement and very few annotation errors. Based on these criteria, nous
select 9 datasets. They are listed in Table 2 and are described in the following section.
We manually inspected and carefully analyzed the corpora to verify that the given gold
labels are of very high quality.

4.1 Text Classification

The goal of text classification is to assign a predefined category to a given text sequence
(ici, a sentence, paragraph, or a document). Example applications are news catego-
rization, sentiment analysis, or intent detection. For text classification, each individual
sentence or document is considered its own instance.

167

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

Tableau 2
Dataset statistics. We report the number of instances |je| and annotations |UN| as well as the
number of mislabeled ones (|je(cid:15)| et |UN(cid:15)|), their percentage, as well as the number of classes |C|.
For token and span labeling datasets, |UN| counts the number of annotated tokens and spans,
respectivement. Kind indicates whether the noisy part was created by randomly corrupting labels
(R.), or by aggregation (UN) from individual annotations like crowdsourcing, or whether the gold
labels stem from manual correction (M.). Errors for span labeling are calculated via exact span
match. Source points to the work that introduced the dataset for use in AED if it was created via
manual correction and to the work proposing the initial dataset for aggregation or randomly
perturbed ones.

Nom

|je|

|je(cid:15)|

|je(cid:15)|
|je| %

|UN|

|UN(cid:15)|

|UN(cid:15)|
|UN| % |C| Kind

Source

Text classification
ATIS
IMDb
SST

4,978
24,799
8,544

238
499
420

4.78
2.01
4.92

4,978
24,799
8,544

238
499
420

4.78
2.01
4.92

Token labeling
GUM
Plank

Span labeling
CoNLL-2003
SI Companies
SI Flights
SI Forex

7,397
500

3,920
373

52.99
74.60

137,605
7,876

6,835
931

4.97
11.82

3,380
500
500
520

217
224
43
63

6.42
44.80
8.60
12.12

5,505
1,365
1,196
1,263

262
325
49
98

4.76
23.81
4.10
7.76

22
2
2

18
13

5
11
7
4

R.
M.
R.

R.
UN

M.
M.
M.
M.

Hemphill et al. (1990)
Northcutt et al. (2021)
Socher et al. (2013)

Zeldes (2017)
Plank et al. (2014un)

Reiss et al. (2020)
Larson et al. (2020)
Larson et al. (2020)
Larson et al. (2020)

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ATIS contains transcripts of user interactions with travel inquiry systems, annotated
with intents and slots. For AED on intent classification, we have randomly per-
turbed the labels.

IMDb contains movie reviews labeled with sentiment. Northcutt, Athalye, and Mueller
(2021) discovered that it contains a non-negligible amount of annotation errors.
They applied Confident Learning to the test set and let crowdworkers check
whether the flags were genuine.

SST The STANFORD SENTIMENT TREEBANK is a dataset for sentiment analysis of movie
reviews from Rotten Tomatoes. We use it for binary sentiment classification and
randomly perturb the labels.

4.2 Token Labeling

The task of token labeling is to assign a label to each token. The most common task in
this category is POS tagging. As there are not many other tasks with easily obtainable
datasets, we only use two different POS tagging datasets. For token labeling, chaque
individual token is considered an instance.

GUM The GEORGETOWN UNIVERSITY MULTILAYER CORPUS is an open source corpus
annotated with several layers from the Universal Dependencies project (Nivre
et autres. 2020). It has been collected by linguistics students at Georgetown University
as part of their course work. Ici, the original labels have been perturbed with
random noise.

168

Klie, Webber, and Gurevych

Annotation Error Detection

Plank POS contains Twitter posts that were annotated by Gimpel et al. (2011). Plank,
Hovy, and Søgaard (2014un) mapped their labels to Universal POS tags and had
500 tweets reannotated by two new annotators. We flag an instance as erroneous
if its two annotations disagree.

4.3 Span Labeling

Span labeling assigns labels not to single tokens, but to spans of text. Common tasks that
can be modeled that way are NER, slot filling, or chunking. In this work, we assume that
spans have already been identified, focusing only on finding label errors and leaving
detecting boundary errors and related issues for future work. We use the following
datasets:

CoNLL-2003 is a widely used dataset for NER (Tjong Kim Sang and De Meulder 2003).
It consists of news wire articles from the Reuters Corpus annotated by experts.
Reiss et al. (2020) discovered several annotation errors in the English portion of
the dataset. They developed Projection Ensembles and then manually corrected
the instances flagged by it. While errors concerning tokenization and sentence
splitting were also corrected, we ignore them here as being out of scope of the
current study. Donc, we report slightly fewer instances and errors overall in
Tableau 2. Wang et al. (2019) also corrected errors in CONLL-2003 and named the
resulting corpus CONLL++. As they only re-annotated the test set and found fewer
errors, we use the corrected version of Reiss et al. (2020).

Slot Inconsistencies is a dataset that was created by Larson et al. (2020) to investi-
gate and classify errors in slot filling annotations. It contains documents of three
domains (COMPANIES, FOREX, FLIGHTS) that were annotated via crowdsourcing.
Errors were then manually corrected by experts.

Span labeling is typically indicated using Begin-Inside-Out (BIO) tags.2 When labeling
a span as X, tokens outside the span are labeled O, the token at the beginning of the span
is labeled B-X, and tokens within the span are labeled I-X. Datasets for span labeling
are also usually represented in this format.

This raises the issues of (1) boundary differences and (2) split entities. D'abord, pour
model-based methods, models might predict different spans and span boundaries from
the original annotations. In many evaluation datasets, boundary issues were also cor-
rected and therefore boundaries for the same span in the clean and noisy data can be
different, which makes evaluation difficult. Deuxième, for scorers it does not make much
sense to order BIO tagged tokens independently of their neighbors or to alter only parts
of a sequence to a different label. This can lead to corrections that split entities, which is
often undesirable. Donc, directly using BIO tags as the granularity of detection and
correction for span labeling is problematic.

Ainsi, we suggest converting the BIO tagged sequences back to a span represen-
tation consisting of begin, end, and label. This first step solves the issue of entities
potentially being torn apart by detection and correction. Spans from the original data
and from the model predictions then need to be aligned for evaluation in order to reduce
boundary issues. This is depicted in Figure 1.

2 For simplicity, we describe the BIO tagging format. There are more advanced schemas like BIOES, mais

our resulting task-specific evaluation is independent of the actual schema used.

169

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

Chiffre 1
Alignment between original or corrected spans A and noisy or predicted spans B. The goal is to
find an alignment that maximizes overlap. Spans that are in A but find no match in B are given a
match with the same offsets but a special, unique label that is different from all other labels (par exemple.,
Massachusetts). Spans that are in B but find no match in A are dropped (par exemple., located). Spans from
A that have no overlapping span in B are considered different and cannot be aligned (par exemple., Boston
in A and Massachusetts in B). Span colors here indicate their labels.

We require a good alignment to (1) maximize overlap between aligned spans so that
the most likely spans are aligned, (2) be deterministic, (3) not use additional information
like probabilities, et (4) not align spans that have no overlap to avoid aligning things
that should not be aligned. If these properties are not given, then the alignment and
resulting confidences or representations that are computed based on this can be subpar.
This kind of alignment is related to evaluation, Par exemple, for NER in the style of
MUC-5 (Chinchor and Sundheim 1993), especially for partial matching. Their alignment
does not, cependant, satisfy (1) et (3) in the case of multiple predictions overlapping
with a single gold entity. Par exemple, if the gold entity is New York City and the
system predicted York and New York, then in most implementations, the first prediction
is chosen and other predictions that also could match are discarded. What prediction
is first depends on the order of predictions which is non-deterministic. This also does
not choose the optimal alignment with maximum span overlap, which requires a more
involved approach.

We thus adopt the following alignment procedure: Given a sequence of tokens, a set
of original spans A and predicted/noisy spans B, align both sets of spans and thereby
allow certain leeway of boundaries. The goal is to find an assignment that maximizes
overlap of spans in A and B; only spans of A that overlap in at least one token with
spans in B are considered. This can be formulated as a linear sum assignment problem:
Given two sets A, B of equal size and a function that assigns a cost to connect an element
of A with an element of B, find the assignment that minimizes the overall cost (Burkard,
Dell’Amico, and Martello 2012). It can happen that not all elements of A are assigned a
match in B and vice versa—we assign a special label that indicates missing alignment in
the first case and drop spans of B that have no overlap in A. For the latter, it is possible
to also assign a special label to indicate that a gold entity is missing; in this work, nous
focus on correcting labels only and hence leave using this information to detect missing
spans for future work.

We are not aware of previous work that proposes a certain methodology for this.
While Larson et al. (2020) evaluate AED on slot filling, it is not clear on which granular-
ity they measure detection performance or whether and how they align. To the best of
our knowledge, we are the first to propose this span alignment approach for span-level
AED. Span alignment requires aggregating token probabilities into span probabilities,
which is described in § 1.1. This alignment approach can also be extended to other tasks

170

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

LOCHarvard University is located in the middle of Boston , Massachusetts .A:B:Harvard University is located in the middle of Boston , Massachusetts .PERLOCLOCORGORG

Klie, Webber, and Gurevych

Annotation Error Detection

like object classification or matching boxes for optical character recognition. In that case,
the metric to optimize is the Jaccard index.

5. Experiments

In this section we first define the general evaluation setup, metrics to be used, et
the models that are leveraged for model-based AED. Details on how each method was
implemented for this work can be found in Appendix A. In § 5.1 through § 5.4, we then
describe our results for the experiments we perform to answer the research questions
raised in § 1.

Metrics. As described in § 3.1, we differentiate between two kinds of annotation
error detectors, flaggers and scorers. These need different metrics during evaluation,
similar to unranked and ranked evaluation from information retrieval (Manning,
Raghavan, and Schütze 2008). Flagging is a binary classification task. Donc, we use
the standard metrics for this task, which are precision, recall, and F1. We also record the
percentage of instances flagged (Larson et al. 2020). Scoring produces a ranking, as in
information retrieval. We use average precision3 (AP), Precision@10%, and Recall@10%,
similarly to Amiri, Miller, and Savova (2018) and Larson et al. (2019). There are reasons
why both precision and recall can be considered the more important metric of the two.
A low precision leads to increased cost because many more instances than necessary
need to be inspected manually after detection. De la même manière, a low recall leads to problems
because there still can be errors left after the application of AED. As both arguments
have merit, we will mainly use the aggregated metrics F1 and AP. Precision and recall at
10% evaluate a scenario in which a scorer was applied and the first 10% with the highest
score—most likely to be incorrectly annotated—are manually corrected. We use the
PYTREC-EVAL toolkit to compute these ranking metrics.4 Recall relies on knowing the
exact number of correctly and incorrectly annotated instances. While this information
may be available when developing and evaluating AED methods, it is generally not
available when actually applying AED to clean real data. One solution to computing
recall then is to have experts carefully annotate a subset of the data and then use it to
estimate recall overall.

In contrast to previous work, we explicitly do not use ROC AUC and discourage
its use for AED, as it heavily overestimates performance when applied to imbalanced
datasets (Davis and Goadrich 2006; Saito and Rehmsmeier 2015). Datasets needing AED
are typically very imbalanced because there are far more correct labels than incorrect
ones.

Models. We use multiple different neural and non-neural model types per task for
model-based AED. These are used to investigate the relationship between model and
method performances, whether model calibration can improve method performances
and for creating diverse ensembles.

For text classification we use seven different models: logistic regression as well as
gradient boosting machines (Ke et al. 2017) with either bag-of-word or S-BERT features
(Reimers and Gurevych 2019), transformer based on DistilRoBERTa (Sanh et al. 2019),

3 Also known as Area Under the Precision-Recall Curve (AUPR/AUPRC). In AED, AP is also identical to

mean average precision (mAP) used in other works.

4 https://github.com/cvangysel/pytrec_eval.

171

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

BiLSTM based on Flair (Akbik et al. 2019), and FastText (Joulin et al. 2017). For S-BERT,
we use all-mpnet-base-v2 as the underlying model, as it has been shown by their
creators to produce sentence embeddings of the highest quality overall.

For token and span labeling, we use four different models: CRFs with the hand-
crafted features as proposed by Gimpel et al. (2011), BiLSTM + CRF based on Flair
(Akbik et al. 2019), transformers with CRF based on DistilRoBERTa (Sanh et al. 2019),
and logistic regression (also called maximum entropy model). For the initialization of
Flair-based models we use a combination of GloVe (Pennington, Socher, and Manning
2014) as well as Byte-Pair Encoding embeddings (Heinzerling and Strube 2018) and a
hidden layer size of 256 for both text classification and sequence labeling. Note that
we do not perform extensive hyperparameter tuning for model selection because when
using AED in practice, no annotated in-domain data can be held out for tuning since all
data must be checked for errors. Aussi, when comparing models as we do here, it would
be prohibitively expensive to carry out hyperparameter tuning across all datasets and
model combinations. Plutôt, we use default configurations that have been shown to
work well on a wide range of tasks and datasets.

When using transformers for sequence labeling we use the probabilities of the first
subword token. We use 10-fold cross-validation to train each model and use the same
model weights for all methods evaluated on the same fold. Thereby, all methods applied
to the same fold use the predictions of the same model.

5.1 RQ1 – Which Methods Work Well across Tasks and Datasets?

We first report the scores resulting from the best setup as a reference to the upcoming
experiments. Then we describe the experiments and results that lead to this setup.
We do not apply calibration to any of the methods for the reported scores because it
only marginally improved performance (see § 5.2). For model-based methods, the best
performance for text classification and span labeling was achieved using transformers;
for token labeling, best performance was achieved using Flair (see § 5.3). Not using
cross-validation for model-based AED was found to substantially reduce recall for
model-based AED (see § 5.4), so we have used 10-fold cross-validation in comparing
model-based methods.

In Table 3, we present the overall performance in F1 and AP across all datasets and

tasks. Detailed results including scores for all metrics can be found in Appendix C.

First of all, it can be seen that in datasets with randomly injected noise (ATIS,
SST, and GUM), errors are easier to find than in aggregated or hand-corrected ones.
Especially in ATIS, many algorithms reach close-to-perfect scores, in particular scorer
(> 0.9 AP). We attribute this to the artificial noise injected. The more difficult datasets
have usually natural noise patterns that are often harder to solve (Amiri, Miller, et
Savova 2018; Larson et al. 2019; Hedderich, Zhu, and Klakow 2021). The three SLOT
INCONSISTENCIES datasets are also easy compared to CONLL-2003. On some datasets
with real errors—PLANK and SLOT INCONSISTENCIES—the performance of the best
methods is already quite good with F1 ≈ 0.5 and AP ≈ 0.4 for PLANK and F1, AP > 0.65
for SLOT INCONSISTENCIES.

Dans l'ensemble, methods that work well are Classification Uncertainty (CU), Confident Learn-
ing (CL), Curriculum Spotter (CS), Datamap Confidence (DM), Diverse Ensemble (DE), Label
Aggregation (LA), Leitner Spotter (LS), Projection Ensemble (PE), and Retag (RE). Aggre-
gating scorer judgments via Borda Count (BC) can improve performance and deliver the
second-best AP score based on the harmonic mean. The downside here is very high total
runtime (the sum of runtimes of individual scores aggregated), as it requires training

172

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

Tableau 3
F1 and AP for all implemented flaggers
setups. We also report the harmonic mean H computed across all datasets. Label Aggregation,
Retag, Diverse Ensemble, and Borda Count perform especially well across tasks and datasets.
Datasets created via injecting random noise (ATIS, SST, and GUM) are comparatively easier to
detect errors in.

evaluated with the best overall

as well as scorers

Texte

Token

Span

Method ATIS

IMDb

SST GUM Plank Comp. CoNLL

Flights

Forex

H

Flagger
CL
DE
IRT
LA
PE
RE
VN
Scorer
BC
CS
CU
DM
DU
KNN
LE
LS
MARYLAND
MP
WD

0.35
0.72
0.00
0.83
0.54
0.81
·

0.98
0.97
0.87
0.98
0.05
0.13
·
0.91
0.14
0.06
·

0.33
0.30
0.01
0.33
0.18
0.33
·

0.35
0.29
0.28
0.25
0.06
0.05
·
0.31
0.03
0.05
·

0.34
0.33
0.02
0.35
0.34
0.34
·

0.50
0.21
0.27
0.49
0.05
0.11
·
0.46
0.08
0.05
·

0.80
0.74
0.00
0.68
0.58
0.69
0.55

0.92
·
0.98
0.95
0.05
0.21
0.60
·
0.12
0.05
0.53

0.37
0.48
0.12
0.49
0.50
0.49
0.30

0.38
·
0.42
0.27
0.24
0.31
0.22
·
0.16
0.23
0.39

0.50
0.57
0.41
0.59
0.56
0.64
0.11

0.68
·
0.70
0.66
0.43
0.61
0.41
·
0.54
0.54
0.45

0.24
0.28
0.29
0.30
0.25
0.32
0.02

0.14
·
0.17
0.14
0.07
0.12
0.19
·
0.06
0.06
0.16

0.42
0.55
0.02
0.66
0.29
0.67
0.29

0.49
·
0.68
0.35
0.18
0.07
0.10
·
0.07
0.12
0.11

0.57
0.64
0.62
0.70
0.56
0.70
0.14

0.54
·
0.70
0.61
0.32
0.16
0.11
·
0.14
0.25
0.14

0.39
0.45
0.00
0.48
0.36
0.49
0.08

0.41
0.33
0.41
0.36
0.08
0.12
0.18
0.46
0.08
0.08
0.20

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

instances of all scorers beforehand, which already perform very well (HAP of Borda
Count is 0.41 and the best individual scorer has HAP of 0.46). While aggregating scores
requires well performing scorers (3 in our setup, see § 1.2) it is more stable across tasks
than using individual methods on their own. Most model-based methods (Classification
Uncertainty, Confident Learning, Diverse Ensemble, Label Aggregation, Retag) perform very
well overall, but methods based on training dynamics that do not need cross-validation
(Curriculum Spotter, Datamap Confidence, Leitner Spotter) are on par or better. En particulier,
Datamap Confidence shows a very solid performance and can keep up with the closely
related Classification Uncertainty, sometimes even outperforming it while not needing
CV. Confident Learning specifically has high precision for token and span labeling.

Amiri, Miller, and Savova (2018) argue that prediction loss is not enough to detect
incorrect instances because easy ones still can have a large loss. Donc, more intricate
methods like Leitner Spotter and Curriculum Spotter are needed. We do not observe a
large difference between Classifier Uncertainty and the two, though. Datamap Confidence,
as a more complicated sibling of Classification Uncertainty, cependant, outperforms these
from time to time, indicating that training dynamics offers an advantage over simply
using class probabilities.

Variation n-grams (VN) has high precision and tends to be conservative in flagging
items, c'est, exhibit low false positives, especially for span classification. Weighted
Discrepancy works overall better than Label Entropy, but both methods almost always
perform worse than more intricate ones. When manually analyzing their scores, ils

173

Computational Linguistics

Volume 49, Nombre 1

mostly assign a score of 0.0 and rarely a different score (less than 10% from our ob-
servation, often even lower). This is because there are only very few instances with
both surface form overlap and different labels. While the scores for Prediction Margin
appear to be not good, the original paper (Dligach and Palmer 2011) reports a similarly
low performance while their implementation of Retag reaches scores that are around
two times higher (10% vs. 23% precision and 38% vs. 60% recall). This is similar to
our observations. One potential reason why Classification Uncertainty produces better
results than the related Prediction Margin is that the latter does not take the given label
into account; it always uses the difference between the two most probable classes.
Using a formulation of Projection Ensemble that uses the label did not improve results
significantly, though.

Methods based on vector proximity—k-Nearest Neighbor Entropy (KNN) and Mean
Distance (MARYLAND)—perform sub-par across tasks and datasets. We attribute this to issues
in distance calculation for high-dimensional data, as noted for instance by Cui Zhu,
Kitagawa, and Faloutsos (2005) in a related setting. In high-dimensional vector spaces,
everything can appear equidistant (curse of dimensionality). Another performance-
relevant issue is the embedding quality. In Grivas et al. (2020), KNN is used with
domain-specific embeddings for biomedical texts. These could have potentially im-
proved performance in their setting, but they do not report quantitative results, though,
which makes a comparison difficult. With regard to Mean Distance, we only achieve
H = 0.08. On real data for intent classification, Larson et al. (2019) achieve an average
precision of around 0.35. They report high recall and good average precision on datasets
with random labels but do not report precision on its own. Their datasets contain mainly
paraphrased intents, which makes it potentially easier to achieve good performance.
This is similar to how AED applied on our randomly perturbed ATIS dataset resulted
in high detection scores. Code and data used in their original publication are no longer
available. We were therefore not able to reproduce their reported performances with our
implementation and on our data.

Item Response Theory (IRT) does not perform well across datasets and tends to
overly flag instances. Donc, it is preferable to use the model predictions in a Di-
verse Ensemble, which yields much better performance. IRT is also relatively slow for
larger corpora as it is optimized via variational inference and needs many iterations to
converge. Our hypothesis is that Item Response Theory needs more subjects (in our case
models) to better estimate discriminability. Compared to our very few subjects (seven
for text classification and four for token and span labeling), Rodriguez et al. (2021) used
predictions of the SQuAD leaderboard with 161 development and 115 test subjects. À
validate this hypothesis, we rerun Item Response Theory on the unaggregated predictions
of Projected Ensemble. While this leads to slightly better performance, it still does not
work as well as using predictions in Diverse Ensemble or Projected Ensemble directly. As it
is often unfeasible to have that many models providing predictions, we see Item Response
Theory only useful in very specific scenarios.

Regarding Dropout Uncertainty, after extensive debugging with different models,
datasets, and formulations of the method, we were not able to achieve comparably
good results with other AED methods evaluated in this work. On real data, Amiri,
Miller, and Savova (2018) also report relatively low performances similar to ours. Notre
implementation delivers results similar to Shelmanov et al. (2021) on misclassification
detection. In their paper, the reported scores appear to be very high. But we consider
their reported scores an overestimate, as they use ROC AUC (which is overconfident
for imbalanced datasets) and not AP to evaluate their experiments. Even when applying
the method on debug datasets with the most advantageous conditions that are solvable

174

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

by other methods with perfect scores, Dropout Uncertainty only achieves AP values of
autour 0.2. The main reason we see for the overall low scores for Dropout Uncertainty
is that the different repeated prediction probabilities are highly correlated and do not
differ much overall. This is similar to the observations of Shelmanov et al. (2021).

Qualitative Analysis. To better understand for which kinds of errors methods work well
or fail, we manually analyze the instances in CONLL-2003. It is our dataset of choice
for three reasons: (1) span labeling datasets potentially contain many different errors,
(2) it is annotated and corrected by humans, et (3) it is quite difficult for AED to find
errors in it, based on our previous evaluation (see Table 3). For spans whose noisy labels
disagree with the correction, we annotate them as either being inconsistent, a true error,
an incorrect correction, or a hallucinated entity. Descriptions and examples for each type
of error are given in the following.

True errors are labels that are unambiguously incorrect, par exemple, in the sentence
NATO military chiefs to visit Iberia, the entity Iberia was annotated as ORG but
should be LOC, as it refers to the Iberian peninsula.

Inconsistencies are instances that were assigned different labels in similar contexts. Dans
CONLL-2003, these are mostly from sports teams that were sometimes annotated
as LOC and sometimes as ORG.

Incorrect correction In very few cases, the correction introduced a new error, for exam-

ple, United Nations was incorrectly corrected from ORG to LOC.

Hallucinated entity are spans that were labeled to contain an entity, but they should
not have been annotated at all. Par exemple, in the sentence Returns on treasuries
were also in negative territory, treasuries was annotated as MISC but does not contain
a named entity. Sometimes, entities that should consist of one span were annotated
originally as two entities. This results in one unmatched entity after alignment. Nous
consider this a hallucinated entity as well.

CONLL-2003 was corrected manually by Reiss et al. (2020). After aligning (see § 4.3),
we find that there are in total 293 errors. We group them by difficulty based on how
often methods were able to detect them. For scorers, we consider the instances with
the highest 10% scores as flagged, similarly to how we evaluate precision and recall.
For span labeling, we implemented a total of 16 méthodes. The errors detected at least
by half of the methods (50%) are considered easy, the ones detected by at least four
méthodes (25%) are considered medium, and the rest, hard (25%). This results in 50 easy,
78 moyen, et 165 hard instances. The distribution of error types across difficulty
levels is visualized in Figure 2. It can be seen that true errors are easier to detect
than inconsistencies by a significant margin: The easy partition consists only of 50%
inconsistencies, whereas in the hard partition, it consists of around 75% inconsistent
instances. This can be intuitively explained by the fact that the inconsistencies are not
rare, but make up a large fraction of all corrections. It is therefore difficult for a method
to learn that it should be flagged when it is only given noisy labels.

We further analyze how each method can deal with the different types of errors
across difficulty levels. The percentage of correctly detected errors per type and method
is depicted in Table 4. It again can be seen that true errors are easier for methods to detect
than inconsistencies; inconsistencies of hard difficulty were almost never detected.
Fait intéressant, scorers that are not model-based (k-Nearest Neighbor Entropy (KNN), Label
Entropy (LE), and Weighted Discrepancy (WD)) are able to better detect inconsistencies of

175

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

Chiffre 2
Error counts per difficulty level in CONLL-2003. It can be seen that the number of inconsistencies
increases with the difficulty. This indicates that “real” annotation errors are easier to detect than
inconsistencies.

and scorers

Tableau 4
Percent of errors and inconsistencies detected on CONLL-2003 across methods and difficulty for
flaggers
often detected than inconsistencies (je). Some methods not relying on models (KNN, LE, WD) sont
sometimes better in spotting inconsistencies than errors, whereas for model-based method it is
the opposite. Note that errors concerning incorrect corrections (IC) and hallucinated entities
(HE) are quite rare and not reliable to draw conclusions from.

grouped by error types. It can be seen that real errors (E) are more

Flagger

Scorer

Error

CL

DE

IRT

LA

PE

RE

VN

BC

CU

DM DU

KNN LE MD WD

MP

Easy
E
je
HE
IC
Medium
E
je
HE
IC
Hard
E
je
HE
IC

66
72
25
0

40
13
0
100

0
0
0
0

96
100
100
100

51
26
75
100

17
0
13
20

100
100
100
100

54
31
75
100

13
0
13
20

100
88
100
100

88
26
25
100

0
0
0
20

100
94
100
100

82
52
75
100

65
9
53
20

100
94
100
100

82
31
100
100

0
0
6
0

3
11
0
0

0
0
0
0

0
0
0
0

92
94
100
100

31
15
75
0

0
3
0
0

100
100
100
100

94
36
50
100

0
0
0
0

66
55
100
100

31
23
75
0

17
5
0
0

25
27
100
0

34
21
25
0

0
4
6
20

40
61
0
0

22
44
0
0

17
14
0
0

29
55
0
0

0
63
0
0

0
4
0
0

14
11
50
100

11
15
0
0

17
1
6
20

29
55
0
0

0
63
0
0

0
4
0
0

25
27
100
0

25
13
50
0

13
4
0
0

medium and sometimes hard difficulty but fail at detecting most errors. We explain this
for KNN by the fact that it relies on semantic vector space embeddings that do not rely
on the noisy label but on the semantics of its surface form in its context. As neighbors
in the space have the same meaning, it is possible to detect errors even though the label
is inconsistent in many cases. The same can be said about WD and LE, which rely only
on the surface form and how often it is annotated differently. If the correct label is the
majority, then it can still detect inconsistencies even if they are quite frequent. But both
still do not perform as well as other methods on easier instances; they only find 50%
of errors and inconsistencies whereas Classification Uncertainty or Retag detects almost

176

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

0255075100125easymediumhardnError typeErrorInconsistentHallucinated entityIncorrect correction

Klie, Webber, and Gurevych

Annotation Error Detection

tous (but again fail to find inconsistencies on medium difficulty). Variation n-grams (VN),
cependant, do not work well even for easy cases because they rely on contexts around
annotations that need to match exactly, which is very rare in this dataset. To summarize,
the methods that worked best overall across tasks and datasets are Borda Count (BC),
Diverse Ensemble (DE), Label Aggregation (LA), and Retag (RE). Inconsistencies appear to
be more difficult to detect for most methods, especially for model-based ones. Methods
that do not rely on the noisy labels like k-Nearest Neighbor Entropy, Label Entropy, et
Weighted Discrepancy were better in finding inconsistencies on more difficult instances
when manually analyzing CONLL-2003.

5.2 RQ2 – Does Model Calibration Improve Model-based Method Performance?

Several model-based AED methods, par exemple, Classification Uncertainty, directly
leverage probability estimates provided by a machine learning model (§ 3.2.2). Là-
fore, it is of interest whether models output class probability distributions that are
accurate. Par exemple, if a model predicts 100 instances and states for all 80% confi-
dence, then the accuracy should be around 0.8. If this is the case for a model, then it
is called calibrated. Previous studies have shown that models are often not calibrated
very well, especially neural networks (Guo et al. 2017). To alleviate this issue, a number
of calibration algorithms have been developed. The most common approaches are
post hoc, which means that they are applied after the model has already been trained.

Probabilities that are an under- or overestimate can lead to non-optimal AED re-
sults. The question arises whether model-based AED methods can benefit from calibra-
tion and, if so, to what extent. We are only aware of one study mentioning calibration
in the context of annotation error detection. Northcutt, Jiang, and Chuang (2021) claim
that their approach does not require calibration to work well, but they did not evaluate
it in detail. We only evaluate whether calibration helps for approaches that directly use
probabilities and can leverage CV, as calibration needs to be trained on a holdout set.
Ce, par exemple, excludes Curriculum Spotter, Leitner Spotter, and Datamap Confidence.
Methods that can benefit are Confident Learning, Classifier Uncertainty, Dropout Uncer-
tainty, and Prediction Margin.

There are two groups of approaches for post hoc calibration: parametric (par exemple., Platt
Scaling/Logistic Calibration [Platt 1999] or Temperature Scaling [Guo et al. 2017]) ou
non-parametric (par exemple., Histogram Binning [Zadrozny and Elkan 2001], Isotonic Regression
[Zadrozny and Elkan 2002], or Bayesian Binning into Quantiles [Naeini, Tonnelier, et
Hauskrecht 2015]). On a holdout corpus we evaluate several calibration methods to
determine which calibration method to use (see Appendix B). Par conséquent, we apply the
best—Platt Scaling—for all experiments that leverage calibration.

Calibration is normally trained on a holdout set. As we already perform cross-
validation, we use the holdout set both for training the calibration and for predicting
annotation errors. While this would not be optimal if we are interested in generalizing
calibrated probabilities to unseen data, we are more interested in downstream task per-
formance. Using an additional fold per round would be theoretically more sound. Mais
our preliminary experiments show that it has the issue of reducing the available training
data and thereby hurts the error detection performance more than the calibration helps.
Using the same fold for both calibration and applying AED, cependant, improves overall
task performance, which is what matters in our special task setting. We do not leak the
values for the downstream tasks (whether an instance is labeled incorrectly or not) mais
only the labels for the primary task.

177

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

To evaluate whether calibration helps model-based methods that leverage probabil-
ities, we train models with cross-validation and then evaluate each applicable method
with and without calibration. The same model and therefore the same initial probabili-
ties are used for both. We measure the relative and total improvement in F1 (for flaggers)
and AP (for scorers), which are our main metrics. The results are depicted in Figure 3.
It can be seen that calibration has the potential of improving the performance of certain
methods by quite a large margin. For Confident Learning, the absolute gain is up to 3
percentage points (pp) F1 on text classification, 5 pp for token labeling, and up to 10
pp for span labeling. On the latter two tasks, though, there are also many cases with
performance reductions. A similar pattern can be seen for Classification Uncertainty with
up to 2 pp, no impact, and up to 8 pp, respectivement. Dropout Uncertainty and Prediction
Margin do not perform well to begin with. But after calibration, they gain 5 à 10 pp AP,
especially for span and in some instances for token labeling. In most cases on median,
calibration does not hurt the overall performance.

In order to check whether the improvement using calibration is statistically signifi-
cant, we also use statistical testing. We choose the Wilcoxon signed-rank test (Wilcoxon
1945) because the data is not normally distributed, which is required by the more
powerful paired t-test. The alternative hypothesis is that calibration improves method
performance, resulting in a one-sided test.

We do not perform a multiple-comparison correction as each experiment works
on different data. The p-values can be seen in Table 5. We can see that calibration
can improve performance significantly overall in two task and method combinations
(text classification + Confident Learning and span labeling + Classification Uncertainty).
For text classification and token labeling, the absolute gain is relatively small. For span
labeling, Classification Uncertainty benefits the most. The gains for Dropout Uncertainty
and Prediction Margins appear large, but these methods do not perform well in the first
place. Ainsi, our conclusion is that calibration can help model-based AED performance
but it is very task- and dataset-specific.

We do not see a clear tendency for which models or datasets benefit the most from
calibration. More investigation is needed regarding which calibrator works best for
which model and task. We chose the one that reduces the calibration error the most,
which is not necessarily the best choice for each setting.

5.3 RQ3 – To What Extent Are Model and Detection Performance Correlated?

Several AED methods directly use model predictions or probabilities to detect potential
annotation errors. This raises the question of how model performance impacts AED
performance. Reiss et al. (2020) state that they deliberately use simpler models to
find more potential errors in CONLL-2003 and therefore developed Projection Ensemble,
an ensemble of logistic regression classifiers that use BERT embeddings reduced by
different Gaussian projections. Their motivation is to obtain a diverse collection of
predictions to have disagreements. They conjecture that using very well-performing
models might be detrimental to AED performance as their predictions potentially
would not differ that much from the noisy labels as the models learned predicting
the noise. In contrast to that, Barnes, Øvrelid, and Velldal (2019) use state-of-the-art
models to find annotation errors in different sentiment datasets. But neither Reiss et al.
(2020) nor Barnes, Øvrelid, and Velldal (2019) directly evaluate AED performance—
rather, they use AED to clean noisy datasets for which the gold labels are unknown.
Donc, the question of how much model and detection performance are correlated
has not yet been thoroughly evaluated.

178

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

(un) Text classification

(b) Token labeling

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(c) Span labeling

Chiffre 3
Relative and total improvement of model-based AED methods over different corpora, méthodes,
and models when calibrating probabilities. It can be seen that calibration can lead to good
improvements, while on median mostly not hurting performance. This plot is best viewed in the
electronic version of this paper. Not displayed are extreme (positive) outlier points.

179

-0.10.00.10.2CLCUDUPM% Gain-0.02-0.010.000.010.020.03CLCUDUPMD GainmodelFlairFTLGBMSLGBMTLRSLRTTcorpusATISIMDbSST-0.10.00.10.20.3CLCUDUPM% Gain-0.050.000.050.100.15CLCUDUPMD GainmodelCRFFlairLRTcorpusGUMPlank-0.250.000.25CLCUDUPM% Gain-0.10-0.050.000.050.10CLCUDUPMD GainmodelCRFFlairLRTcorpusComp.CoNLLFlightsForex

Computational Linguistics

Volume 49, Nombre 1

Tableau 5
p-values forWilcoxon signed-rank test.We check whether calibration improves AED
performance on a statistically significant level. Underlined values are significant with p < 0.05. Method Text Token Span Confident Learning (CL) Classification Uncertainty (CU) Dropout Uncertainty (DU) Prediction Margin (PM) 0.021 0.121 0.750 0.064 0.230 0.320 0.188 0.273 0.665 0.003 0.320 0.628 For answering this question, we leverage the fact that we implemented several models of varying performance for each task. We use two complementary approaches to analyze this question. First, we measure the correlation between model and task performances for the overall score, precision, and recall. Then, we analyze which models lead to the best AED performance. Throughout this section, scores for flaggers and scorers are coalesced; overall score corresponds to F1 and AP, precision to precision and precision@10%, recall to recall and recall@10%. For reference, model performances are shown in Figure C.1. We choose micro aggregation for measuring model performance, as we are interested in the overall scores and not the scores per class. Using macro aggregation yields qualitatively similar but less significant results. Correlation. In order to determine whether there exists a positive or negative relationship between model and method performances, we compute Kendall’s τ coefficient (Kendall 1938) for each method and dataset. The results are depicted in Table 6. We see that when the test is significant with p < 0.05, then there is almost always a moderate to strong monotonic relationship.5 τ is zero or positive for classification and token labeling, hinting that there is either no relationship or a positive one. For span labeling we observe negative correlation for precision and overall. It is significant in one case only. One issue with this test is its statistical power. In our setting, it is quite low due to the few samples available per method and task. It is therefore likely that the null hypothesis—in our case, the assumption that there is no relationship between model and method performances—is not rejected even if it should have been. Hence, we next perform additional analysis to see which models overall lead to the best model performances. Which Models Lead to the Best Method Performances. In order to further analyze the rela- tionship between model and AED performances, we look at which model leads to the best performance on a given dataset. In Figure 4 we show the results differentiated by overall, precision, and recall scores. We observe that in the most cases, the best or second best models lead to the best method performances. It is especially clear for token labeling, where using Flair leads to the best performance in all cases if we look at the overall and precision score. Interestingly, Flair has better performance than transformers for span labeling but the latter is preferred by most methods. Flair only leads to best method performances for most of CONLL-2003 and parts of FLIGHTS. Besides the fact that better models on average lead to better AED performance, we do 5 |τ| > 0.07 indicates a weak, |τ| > 0.21 a moderate, |τ| > 0.35 indicates a strong monotonic relation.

180

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

Tableau 6
Kendall’s τ coefficient grouped by task and method measured across datasets. For p, the null
hypothesis is τ = 0 and the alternative hypotheses is τ (cid:54)= 0. Underlined are significant p-values
with p < 0.05. Positive correlation is highlighted , negative correlation is highlighted . Overall Precision Recall Method τ p τ p τ p Text CL CU DU LA PM RE Token CL CU DU LA PM RE Span CL CU DU LA PM RE +0.495 +0.486 +0.333 +0.333 +0.143 +0.571 +0.929 +0.714 −0.333 +1.000 +0.000 +0.857 −0.033 +0.017 −0.429 −0.357 −0.317 −0.167 0.002 0.002 0.602 0.602 0.365 0.000 0.001 0.013 0.497 0.042 1.000 0.003 0.857 0.928 0.138 0.216 0.087 0.368 +0.657 +0.396 +0.333 +1.000 +0.211 +0.600 +0.929 +0.786 −0.333 +0.667 +0.000 +0.714 −0.183 −0.250 −0.429 −0.643 −0.319 −0.217 0.000 0.013 0.602 0.117 0.184 0.000 0.001 0.006 0.497 0.174 1.000 0.013 0.322 0.177 0.138 0.026 0.086 0.242 −0.373 +0.705 +0.333 +0.333 +0.230 +0.412 +0.214 +0.714 +0.333 +0.667 +0.000 +0.357 +0.101 +0.300 +0.286 +0.143 +0.202 −0.109 0.018 0.000 0.602 0.602 0.147 0.011 0.458 0.013 0.497 0.174 1.000 0.216 0.588 0.105 0.322 0.621 0.279 0.558 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 9 1 1 5 7 2 0 6 8 9 8 0 / c o not see a consistent pattern that certain methods prefer certain models. A special case, however, is the recall of Retag. We indeed observe the assumption of Reiss et al. (2020) that the model with the lowest recall often leads to the highest AED recall (see Figure 4). This is especially pronounced for token and span labeling. For these tasks, Retag can use a low-recall model to flag a large fraction of tokens because the model disagrees at many positions with the input labels. This improves recall while being detrimental to precision. To summarize, overall we see positive correlation between model and AED perfor- mances. Using a well-performing model is a good choice for most model-based AED approaches. Neural models perform especially well, although they are more expensive to train. We therefore use transformers for text classification as well as span labeling and Flair for token labeling. Using a low-recall model for Retag leads to higher recall for token and span labeling, as conjectured by Reiss et al. (2020). This, however, concurs with lower precision and excessive flagging and thus more annotations need to be inspected. l i _ a _ 0 0 4 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 5.4 RQ4 – What Performance Impact Does Using (or not Using) Cross-validation Have? Model-based AED approaches are typically used together with CV (e.g., Amiri, Miller, and Savova 2018, Larson et al. 2020, Reiss et al. 2020). Northcutt, Jiang, and Chuang (2021) explicitly state that Confident Learning should only be applied to out-of-sample 181 Computational Linguistics Volume 49, Number 1 (a) Text classification (b) Token labeling (c) Span labeling Figure 4 Model-based methods and how often which model type leads to the best method performance with respect to overall, precision, and recall score. A connection from left to right between a method and a model indicates that using that method with outputs from that model leads to the best task performance. The color of the connection indicates the chosen model, for instance, Flair is aggregated by Borda Count across datasets. This figure is best viewed in color. . The model axis is presented in descending order by model performance, , Transformer predicted probabilities. Amiri, Miller, and Savova (2018) do not mention that they used CV for Dropout Uncertainty, Label Aggregation, or Classification Uncertainty. When using AED with CV, models are trained on k − 1 splits and then detection is done on the remaining k-th set. After all unique folds are processed, all instances are checked. CV is often used in supervised learning where the goal is to find a model configuration as well as hyperparameters that generalize on unseen data. The goal of AED, however, is to find errors in the data at hand. Resulting models are just an instrument and not used afterwards. They therefore will not be applied to unseen data and need not generalize to data other than the one to clean. Hence, the question arises whether CV is really necessary for AED, which has not been analyzed as of yet. Not using CV has the advantage of being much faster and using less energy, since using 182 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 9 1 1 5 7 2 0 6 8 9 8 0 / c o l i _ a _ 0 0 4 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 REPMLADUCUCLLRTFTLGBMTFlairLRSLGBMSTOverallREPMLADUCUCLLRTFTLGBMTFlairLRSLGBMSTPrecisionREPMLADUCUCLLRTFTLGBMTFlairLRSLGBMSTRecallREPMLADUCUCLLRTCRFFlairOverallREPMLADUCUCLLRTCRFFlairPrecisionREPMLADUCUCLLRTCRFFlairRecallREPMLADUCUCLLRTCRFFlairOverallREPMLADUCUCLLRTCRFFlairPrecisionREPMLADUCUCLLRTCRFFlairRecall Klie, Webber, and Gurevych Annotation Error Detection Table 7 Performance delta of model-based methods when training models with and without CV. Negative the opposite. It can be seen that overall recall is strongly impacted when not using CV but precision can improve. Flagger and scorer results are separated by a gap. values indicate that not using CV performs worse than using it, positive values Text Token Span IMDb SST GUM Plank Comp. CoNLL Flights Forex Method ATIS ∆ Precision CL DE IRT LA PE RE CU DU PM ∆ Recall CL DE IRT LA PE RE CU DU PM ∆ % Flagged CL DE IRT LA PE RE +0.77 −0.22 +0.27 +0.16 +0.51 +0.23 +0.26 +0.21 +0.11 +0.41 +0.04 +0.03 +0.38 +0.00 +0.02 +0.28 −0.00 +0.23 +0.06 +0.26 +0.04 +0.01 +0.01 +0.00 +0.05 +0.29 +0.17 +0.15 +0.19 +0.78 −0.03 −0.12 −0.20 −0.00 −0.05 +0.34 −0.01 +0.14 +0.12 +0.34 −0.05 +0.14 +0.05 +0.04 +0.03 +0.09 +0.06 −0.04 −0.63 −0.82 −0.07 −0.25 −0.29 −0.25 −0.41 −0.10 −0.34 +0.44 +0.37 +0.63 −0.27 −0.64 −0.83 −0.00 −0.30 +0.00 −0.00 −0.00 −0.14 +0.00 −0.32 −0.64 −0.82 −0.00 −0.35 −0.07 −0.62 −0.40 −0.00 −0.04 +0.71 −0.05 +0.12 +0.25 +0.71 −0.23 +0.12 +0.09 +0.09 +0.06 +0.02 +0.34 +0.07 +0.01 +0.14 −0.03 −0.05 +0.00 −0.17 −0.18 −0.31 +0.26 −0.18 −0.13 −0.17 −0.01 +0.00 −0.04 −0.02 −0.06 −0.19 −0.02 −0.07 −0.09 −0.05 −0.05 −0.15 −0.03 −0.10 −0.31 +0.06 −0.91 +0.03 +0.13 +0.16 +0.15 −0.10 +0.00 −0.10 −0.03 −0.06 −0.19 −0.01 −0.00 −0.01 −0.00 −0.04 −0.15 −0.08 −0.04 −0.06 −0.19 −0.02 −0.10 +0.61 −0.26 −0.27 −0.05 +0.01 −0.10 −0.13 −0.03 −0.02 −0.17 −0.30 −0.31 −0.27 −0.06 −0.30 −0.24 −0.06 −0.04 −0.02 −0.06 −0.06 −0.05 −0.03 −0.05 +0.08 +0.59 +0.03 +0.29 +0.22 +0.19 −0.08 +0.07 +0.05 −0.24 −0.39 +0.76 −0.34 +0.00 −0.44 −0.22 +0.17 +0.12 −0.01 −0.06 +0.07 −0.03 −0.08 −0.03 +0.27 +0.36 −0.41 +0.32 +0.11 +0.29 −0.29 +0.14 +0.07 −0.46 −0.37 −0.11 −0.49 −0.11 −0.55 −0.28 +0.14 +0.07 −0.06 −0.11 +0.79 −0.09 −0.05 −0.09 CV increases training time linearly with the number of folds. In the typical setup with 10-fold CV, this means an increase of training time by 10×. To answer this question we train a single model on all instances and then predict on the very same data. Then we use the resulting outputs to rerun methods that used CV before, which are Classification Uncertainty, Confident Learning, Diverse Ensemble, Dropout Uncertainty, Item Response Theory, Label Aggregation, Prediction Margin, Projection Ensemble, and Retag. The results are listed in Table 7. Overall, it can be seen that not using CV massively degrades recall for model-based methods while the precision im- proves. This can be intuitively explained by the fact that if the underlying models have already seen all the data, then they overfit to it and hence can re-predict it well. Due to the positive relationship between model and method performances (see § 5.3) this is also reflected downstream; fewer instances are predicted differently than the original labels. This reduces recall and thereby the chance of making errors, thus increasing precision. This can be seen by the reduction in the percentage of flagged instances for flaggers. Interestingly, Dropout Uncertainty and Prediction Margin are not impacted as much and sometimes even improve when not using CV across all scores, especially for 183 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 9 1 1 5 7 2 0 6 8 9 8 0 / c o l i _ a _ 0 0 4 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 49, Number 1 easier datasets. Recall of Item Response Theory also improves at the cost of more flagged items and a reduction in precision. Prediction Ensemble for text classification is relatively unaffected and for token and span labeling, the performance difference is around ±0.10 pp. Therefore, it might be a good tradeoff to not use CV with this method as it is already expensive due to its ensembling. To summarize, not using CV can negatively impact performance—in particular, degrading recall. We therefore recommend the use of CV, even though it increases runtime by the number of folds (in our case, by a factor of ten). In settings where this is an issue, we recommend using methods that inherently do not need CV. These include most heuristics and well-performing approaches like Datamap Confidence, Leitner Spotter, or Curriculum Spotter. If precision is more important than recall, then not using CV might be taken into consideration. 6. Takeaways and Recommendations This article has probed several questions related to annotation error detection. Our findings show that it is usually better to use well-performing models for model-based methods as they yield better detection performance on average. Using a worse model for Retag improves recall at the cost of lower precision. For detection, these models should be trained via cross-validation, otherwise the recall of downstream methods is heavily degraded (while the precision improves). Calibration can improve these model- based annotation error detection methods, but more research is needed to determine when exactly it can be useful. Some model-method combinations achieved relatively large gains after calibration while others did not improve. Methods that are used frequently in practice—Retag and Classification Uncertainty— performed well in our experiments. Others did not perform particularly well, especially Dropout Uncertainty, Item Response Theory, k-Nearest Neighbor Entropy, Mean Distance, and Prediction Margin. For Mean Distance in particular, Larson et al. (2019) reported AP of > 0.6 and recall > 0.8 on corpora with artificial noise, which we could not reproduce.
Experiments with Dropout Uncertainty disseminated in Amiri, Miller, and Savova (2018)
reached similar high scores as using Curriculum Spotter, Leitner Spotter, or Classification
Uncertainty, but we were not able to make Dropout Uncertainty reach similar high
scores as the others. Label Aggregation, though, which uses the same inputs, performs
exceedingly well. For the others, either no scores were reported or they were similarly
low as in our experiments.

Experiments on actual corpora have shown that AED methods still have room for
improvement. While looking promising on artificial corpora, there is a large perfor-
mance drop when applying them in practice. Dans l'ensemble, the methods that worked best
are Classification Uncertainty, Confident Learning, Curriculum Spotter, Datamap Confidence,
Diverse Ensemble, Label Aggregation, Leitner Spotter, Projection Ensemble, and Retag. More
complicated methods are not necessarily better. Par exemple, Classification Uncertainty
and Retag perform well across tasks and datasets while being easy to implement. Model-
based methods require k-fold cross-validation. Donc, if runtime is a concern, alors
Datamap Confidence is a good alternative. It performs well while only needing to train
one model instead of k. In case the data or its corresponding task to correct is not
suitable for machine learning, methods like Label Entropy, K-Nearest-Neighbor Entropy,
or Variation n-grams still can be applied. As the latter usually has high precision it is
often worthwhile to apply it whenever the data is suitable for it; c'est, if the data has
sufficient surface form overlap. Individual scorer scores can be aggregated via Borda
Count but it tremendously increases runtime. While not yielding significantly better

184

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

results in our experiments, results aggregated that way were much more stable across
datasets and tasks while individual scorers sometimes had performance drops in certain
settings.

Manual analysis of CONLL-2003 showed that finding inconsistencies is often more
difficult than finding annotation errors. While model-based methods were often quite
good in the latter, they performed poorly when detecting inconsistencies. Methods that
do not rely on the noisy labels but on the surface form or semantics like k-Nearest
Neighbor Entropy, Label Entropy, and Weighted Discrepancy have shown the opposite
behavior. They each have their own strengths and it can be worth combining both types
of methods.

7. Conclusion

Having annotated corpora with high-quality labels is imperative for many branches
of science and for the training of well-performing and generalizing models. Previous
work has shown that even commonly used benchmark corpora contain non-negligible
numbers of annotation errors. In order to assist human annotators with detecting and
correcting these errors, many different methods for annotation error detection have
been developed. À ce jour, cependant, methods have not been compared, so it has been
unclear what method to choose under what circumstances. To close this gap, we sur-
veyed the field of annotation error detection, reimplemented 18 méthodes, collected
and generated 9 datasets for text classification, token labeling, and span labeling, et
evaluated method performance in different settings. Our results show that AED can
already be useful in real use cases to support data cleaning efforts. But especially for
more difficult datasets, the performance ceiling is far from reached yet.

Dans le passé, the focus of most works researching or using AED was to clean data and
not to develop a method. The method was only a means to achieve a cleaned corpus
and not the target itself. Aussi, several studies proposed algorithms for different use
cases and AED was one application to it just mentioned briefly at the end without in-
depth evaluation, rendering it unclear how well the method performs. We therefore
strongly encourage authors who introduce new AED methods to compare their method
to previous work and on the same corpora to foster reproducibility and to bring the per-
formance of new methods into context. This article surveys, standardizes, and answers
several fundamental questions regarding AED so that future work has a stable footing
for research. For this, we also make our implementation and datasets publicly available.

Limitations and Future Work. While we thoroughly investigated many available methods
on different datasets and tasks, there are some limitations to our work. D'abord, the datasets
that we used were only in English. Donc, it would be interesting to investigate
AED on different languages. One first step could be the work by Hedderich, Zhu, et
Klakow (2021), who created a corpus for NER in Estonian with natural noise patterns.
Hand-curated datasets with explicitly annotated errors are rare. We therefore also used
existing, clean datasets and injected random noise, similarly to previous works. These
datasets with artificial errors have been shown to overestimate the ability of AED
méthodes, but are still a good estimator for the maximal performance of methods. Le
next step is to create benchmark corpora that are designed from the ground up for the
evaluation of annotation error detection. As creating these requires effort and is costly, un
cheaper way is to aggregate raw crowdsourcing data. This is often not published along
with adjudicated corpora, so we urge researchers to also publish these alongside the
final corpus.

185

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

AED was evaluated on three different tasks with nine NLP datasets. The tasks
were chosen based on the number of datasets and model types available to answer our
research questions. Most AED methods are task-agnostic; previous work, par exemple,
investigated question answering (Amiri, Miller, and Savova 2018) or relation classifica-
tion (Alt, Gabryszak, and Hennig 2020; Stoica, Platanios, and Poczos 2021). Ainsi, AED
can and has been applied in different fields like computer vision (Northcutt, Athalye,
and Mueller 2021). But these works are plagued by the same issues that most previous
AED works have (par exemple., only limited comparison to other works and quantitative anal-
ysis, code and data not available). Having several fundamental questions answered in
this article, future work can now readily apply and investigate AED on many different
tasks, domains, and in many different settings, while leveraging our findings (which are
summarized in § 6). It would especially be interesting to evaluate and apply AED on
more hierarchical and difficult tasks, such as semantic role labeling or natural language
inference.

While we investigated many relevant research questions, these were mostly about
model-based methods as well as flaggers. À ce jour, scorers have been treated as a black
box, so it would be worth investigating what makes a good scorer—for example, what
makes Classification Uncertainty better than Prediction Margin. Aussi, leveraging scorers
as uncertainty estimates for correction is a promising application, similar to the works
of Dligach and Palmer (2011) or Angle, Mishra, and Sharma (2018).

This work also only focuses on errors, inconsistencies, and ambiguities related to
instance labels. Some datasets also benefit from finding errors concerning tokenization,
sentence splitting, or missing entities (par exemple., Reiss et al. 2020). We also did not investigate
the specific kinds of errors made. This can be useful information and could be leveraged
by human annotators during manual corrections. It would be especially interesting to
investigate the kinds of errors certain models and configurations were able to correct—
par exemple, whether using no cross-validation finds more obvious errors but with a
higher precision. We leave detection of errors other than incorrect labels or error kind
detection for future work because we did not find a generic way to do it across the wide
range of evaluated datasets and tasks used in this article.

Enfin, we implemented each method as described and performed only basic hy-
perparameter tuning. We did not tune them further due to the prohibitive costs for our
large-scale setup. This is especially true for the considered machine learning models,
where we kept the parameters mostly default for all regardless of the dataset and
domain. We are sure that one can certainly improve scores for each method, but our
implementations should still serve as a lower bound. Cependant, we do not expect large
gains from further optimization and no large shifts in ranking between the methods.

Appendix A. Hyperparameter Tuning and Implementation Details

In this section we briefly describe implementation details for the different AED methods
used throughout this article. As the tuning data we select one corpus for each task type.
For text classification we subsample 5,000 instances from the training split of AG NEWS
(Zhang, Zhao, and LeCun 2015); the number of samples is chosen as it is around the
same data size as our other datasets. As the corpus for token labeling we choose the
English part of PARTUT (Sanguinetti and Bosco 2015) and their POS annotations and
inject 5% random noise. For span labeling we use CONLL-2003 (Reiss et al. 2020) à
which we apply 5% flipped label noise.

186

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

A.1 Aggregating Probabilities and Embeddings for Span Labeling

When converting BIO-tagged sequences to spans for alignment (see § 4.3) consisting
only of start and end position as well as its label, the probabilities assigned to each BIO-
tag representing the span need to be aggregated. The same needs to be done for creating
span embeddings from token embeddings. As an example, consider NER for persons
and locations with a tagset of B-PER, I-PER, B-LOC, I-LOC. It has to be aggregated so
that spans have labels PER or LOC. Look at a span of two tokens that has been tagged as
B-PER, I-PER. Then the probability for PER needs to be aggregated from the B-PER and
I-PER tags. We evaluate our CONLL-2003 tuning data. We use a Maxent sequence tagger
to evaluate Confident Learning with 10-fold cross-validation for this hyperparameter
selection. En outre, for k-Nearest-Neighbor Entropy we evaluate aggregation schemes to
create span embeddings from individual token embeddings. Dans l'ensemble, we do not observe
a large difference between max, mean, or median aggregation. The results can be seen
in Table A.1. We choose aggregating via arithmetic mean because it is slightly better in
terms of F1 and AP than the other methods.

A.2 Method Details

In the following we describe the choices we made when implementing the various AED
methods evaluated in this article.

Diverse Ensemble Our diverse ensemble uses the predictions of all different model
types trained for the task and dataset, similarly to Loftsson (2009), Alt, Gabryszak, et
Hennig (2020), and Barnes, Øvrelid, and Velldal (2019).

Spotter and Datamap Confidence The implementations of Datamap Confidence as well
as Curriculum Spotter and Leitner Spotter require callbacks or a similar functionality to
obtain predictions for every epoch which only HuggingFace Transformers provide. That
is why we only evaluate these methods in combination with a transformer.

Dropout Uncertainty In our implementation we use mean entropy, which we observed
in preliminary experiments to perform slightly better overall than the other version
evaluated by Shelmanov et al. (2021).

Variation n-grams We follow Wisniewski (2018) for our implementation and use gen-
eralized suffix trees to find repetitions. If there are repetitions of length more than one
in the surface forms that are tagged differently, we look up the respective tag sequence
that occurs most often in the corpus and flag the positions of all other repetitions where

Table A.1
Impact of different aggregation functions for span alignment and embeddings.

Aggregation

P.

min
maximum
mean
median

0.149
0.761
0.766
0.765

CL

R.

0.594
0.881
0.878
0.876

KNN

F1

AP

P@10% R@10%

0.238
0.817
0.818
0.817

0.307
0.308
0.318
0.316

0.325
0.325
0.331
0.330

0.326
0.326
0.333
0.332

187

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

they disagree with the majority tags. We convert tokens and sentences to lower case to
slightly increase recall while slightly reducing precision. We do not flag an instance if
its label is the most common label. This yields far better results as the most common
label is most often correct and should not be flagged. When using Variation n-grams for
span labeling, we use a context of one token to the left and right of the span, similarly
to Larson et al. (2020).

Projection Ensemble In our implementation we flag an instance if the majority label of
the ensemble disagrees with the given one.

Label Aggregation In the original work that evaluated using Label Aggregation for AED
(Amiri, Miller, and Savova 2018), MACE (Hovy et al. 2013) was used. We use Dawid-
Skene (Dawid and Skene 1979), which has similar performance as MACE (Paun et al.
2018) but many more available implementations (we use Ustalov et al. (2021)). Le
difference between the two (MACE modeling annotator spam) is not relevant here.

Mean Distance We compare different embedding methods and metrics for Mean Dis-
tance. For that we use the Sentence Transformers6 library and evaluate S-BERT embed-
dings (Reimers and Gurevych 2019), Universal Sentence Encoder (Cer et al. 2018), et
average GloVe embeddings (Pennington, Socher, and Manning 2014). We evaluate on
our AG NEWS tuning data. As our Universal Sentence Encoder implementation we use
distiluse-base-multilingual-cased-v1 from Sentence Transformers. The Universal
Sentence Encoder embeddings as used in the original implementation of Mean Distance
(Larson et al. 2019) overall perform not better than all S-BERT embeddings. lof refers
to Local Outlier Factor, a clustering metric proposed by Breunig et al. (2000). Using
all-mpnet-base-v2 together with Euclidean distance works best here and we use this
throughout our experiments.

Item Response Theory We use the setup from Rodriguez et al. (2021), c'est, a 2P IRT
model that is optimized via variational inference and the original code of the authors.
We optimize for 10,000 iterations. Item Response Theory uses the collected predictions of
all models for the respective task, similarly to Diverse Ensemble.

Label Entropy and Weighted Discrepancy We implement equations (2) et (3) dans
Hollenstein, Schneider, and Webber (2016) but assign the minimum score (meaning no
error) if the current label is the most common label. This yields far better results because
the most common label is most often correct and should not be downranked.

K-Nearest-Neighbor Entropy To evaluate which embedding aggregation over trans-
former layers works best for k-Nearest-Neighbor Entropy, we evaluate several different
configurations on our PARTUT tuning data. We chose this task and not span labeling
as span labeling requires an additional aggregation step to combine token embeddings
to span embeddings (see § 1.1). The transformer of choice is RoBERTa (Liu et al. 2019),
as it has better performance than BERT while still being fast enough. We also compare
with several non-transformer embeddings: GloVe 6B (Pennington, Socher, and Manning
2014), Byte-Pair Encoding (Heinzerling and Strube 2018), and a concatenation of both.
We follow Devlin et al. (2019) regarding which configurations to try. The results can
be seen in Table A.2. The best scoring embedder is RoBERTa, using only the last layer
that will be the configuration used throughout this work for obtaining token and span

6 https://www.sbert.net/.

188

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

Table A.2
The performance impact of using different embedding types and configurations for KNN
entropy on UD ParTUT.

Embedder
Last Hidden
Sum All Layers
First Hidden
Sum Last 4 Hidden
Second-to-Last Hidden
Concat Last 4 Hidden
Glove
Glove + BPE
BPE

AP
0.265
0.237
0.230
0.227
0.224
0.149
0.148
0.148
0.147

P@10% R@10%
0.255
0.254
0.269
0.246
0.244
0.193
0.169
0.175
0.170

0.256
0.254
0.270
0.246
0.244
0.194
0.169
0.175
0.170

Table A.3
Evaluation of scorers and their aggregation via Borda Count for text classification and span
labeling. Highlighted in gray are the runs of Borda Count aggregation.

Method
BCtop3
BCtop2
DM
BCtop5
LS
CU
MARYLAND
CS
BCall
KNN
DU
MP

AP
0.848
0.824
0.819
0.794
0.706
0.521
0.422
0.296
0.268
0.055
0.055
0.050

P@10%
0.460
0.456
0.448
0.454
0.448
0.426
0.344
0.390
0.244
0.062
0.062
0.046

R@10%
0.947
0.938
0.922
0.934
0.922
0.877
0.708
0.802
0.502
0.128
0.128
0.095

Method
DM
BCtop3
CU
BCtop2
BCtop5
WD
LE
BCall
MARYLAND
DU
MP
KNN

AP
0.963
0.897
0.881
0.881
0.716
0.665
0.567
0.350
0.206
0.103
0.102
0.100

P@10%
0.932
0.863
0.837
0.837
0.625
0.632
0.579
0.378
0.231
0.104
0.104
0.101

R@10%
0.934
0.865
0.839
0.839
0.626
0.633
0.580
0.379
0.232
0.104
0.104
0.101

(un) Texte

(b) Span

embeddings. For sentence embeddings we will use all-mpnet-base-v2 (see § 1.2). À
compute the KNN entropy, we use the code of the original authors.

Borda Count In order to evaluate which scores to aggregate via Borda Count we eval-
uate three settings on our AG NEWS tuning data. We either aggregate the five best,
three best, or all scorer outputs. As the underlying model for model-based methods we
use transformers, because we need repeated probabilities for Dropout Uncertainty. Le
results are listed in Table A.3. It can be seen that aggregating only the three best scores
leads to far superior performance. Ainsi, we choose this setting when evaluating Borda
Count aggregation during our experiments.

Appendix B. Calibration

From the most common calibration methods we select the best method by calibrating
probabilities for models trained on our AG NEWS tuning data. We use 10-fold cross-
validation where eight parts are used for training models, one for calibrating and one
for evaluating the calibration. The results can be seen in Figure B.1. We follow Guo et al.

189

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

(2017) and use the Expected Calibration Error (ECE) (Naeini, Tonnelier, and Hauskrecht
2015) as the metric for calibration quality. We decide to use one method for all task types
and finally choose Logistic Calibration (also known as Platt Scaling), which performs well
across tasks. We use the implementations of Küppers et al. (2020).

Figure B.1
Percentage decrease of the Expected
Calibration Error (ECE) after calibration,
when training models on our AG NEWS tuning
data. Higher is better.

Figure B.2
Average Precision of using Mean Distance
with different embedders and similarity
metrics on our AG NEWS tuning data.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

190

FastTextFlairLightGBMBERTLightGBMTF-IDFMaxEntSBERTMaxEntTF-IDFTransformerBayesian Binninginto QuantilesHistogramBinningIsotonicRegressionLogisticCalibrationTemperatureScaling-1.971.034.78.950.862.077.4-0.362.934.12.549.072.976.714.558.227.84.052.769.066.612.969.933.421.352.454.659.314.365.433.527.352.756.856.9all-mpnet-base-v2distiluse-base-multi-cased-v1glove.6B.300dmulti-qa-MiniLM-L6-cos-v1multi-qa-mpnet-base-dot-v1cosinedoteuclideanlof0.400.260.210.290.300.390.220.130.260.280.420.250.200.330.280.240.240.140.130.16

Klie, Webber, and Gurevych

Annotation Error Detection

Appendix C. Best Scores

(un) Text classification

(b) Token labeling

(c) Sequence labeling

Figure C.1
Model performances across tasks and datasets. The model axis is ordered in descending order by
the respective models’ overall performance via Borda Count.

191

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

0.970.930.940.930.870.830.980.880.680.880.870.900.740.950.740.650.830.680.840.730.84LRTFTLGBMTFlairLRSLGBMSTATISIMDbSSTOverall0.970.930.940.930.870.830.980.880.680.880.870.900.740.950.740.650.830.680.840.730.84ATISIMDbSSTPrecision0.970.930.940.930.870.830.980.880.680.880.870.900.740.950.740.650.830.680.840.730.84ATISIMDbSSTRecall0.940.950.580.880.820.850.800.79LRTCRFFlairGUMPlankOverall0.940.950.580.880.820.850.800.79GUMPlankPrecision0.940.950.580.880.820.850.800.79GUMPlankRecall0.780.830.700.550.970.980.870.930.910.960.870.830.950.970.940.87LRTCRFFlairComp.CoNLLFlightsForexOverall0.780.830.700.550.970.980.870.930.910.960.870.830.950.970.940.87Comp.CoNLLFlightsForexPrecision0.780.830.700.550.970.980.870.930.910.960.870.830.950.970.940.87Comp.CoNLLFlightsForexRecall

Computational Linguistics

Volume 49, Nombre 1

C

ATIS

IMDb

SST

AP P@10% R@10%
M.
1.00
0.48
0.98
BC
1.00
0.48
0.97
CS
1.00
0.48
CU
0.87
1.00
0.48
DM 0.98
0.13
0.06
0.05
DU
0.27
0.13
KNN 0.13
1.00
0.48
LS
0.91
0.35
0.17
MARYLAND 0.14
0.13
0.06
MP 0.06
0.71
0.14
0.35
BC
0.64
0.13
0.29
CS
0.74
0.15
0.28
CU
0.63
0.13
DM 0.25
0.39
0.08
DU
0.06
0.27
0.05
KNN 0.05
0.67
0.13
LS
0.31
0.18
0.04
MARYLAND 0.03
0.35
0.07
MP 0.05
0.74
0.37
0.50
BC
0.54
0.27
0.21
CS
0.59
0.29
CU
0.27
0.67
0.33
DM 0.49
0.09
0.04
DU
0.05
0.22
0.11
KNN 0.11
0.65
0.33
LS
0.46
0.20
0.10
MARYLAND 0.08
0.10
0.05
MP 0.05

C

Comp.

CoNLL

Flights

Forex

P.
M.
0.83
CL
DE
0.57
IRT 0.32
LA 0.78
0.59
PE
0.80
RE
VN 0.69
0.39
CL
DE
0.26
IRT 0.27
LA 0.29
0.17
PE
0.30
RE
VN 0.60
0.92
CL
DE
0.41
IRT 0.01
LA 0.60
0.18
PE
0.61
RE
VN 1.00
0.73
CL
DE
0.52
IRT 0.49
LA 0.65
0.45
PE
RE
0.67
VN 1.00

R.
0.35
0.58
0.59
0.48
0.53
0.53
0.06
0.18
0.30
0.31
0.31
0.47
0.33
0.01
0.27
0.80
0.20
0.73
0.68
0.73
0.17
0.46
0.81
0.85
0.76
0.72
0.75
0.07

F1 %F
0.18
0.43
0.79
0.26
0.38
0.28
0.04
0.02
0.06
0.06
0.06
0.15
0.06
0.00
0.01
0.07
0.93
0.05
0.14
0.05
0.01
0.06
0.16
0.18
0.12
0.16
0.11
0.01

0.50
0.57
0.41
0.59
0.56
0.64
0.11
0.24
0.28
0.29
0.30
0.25
0.32
0.02
0.42
0.55
0.02
0.66
0.29
0.67
0.29
0.57
0.64
0.62
0.70
0.56
0.70
0.14

C

ATIS

IMDb

SST

P.
M.
0.43
CL
DE
0.56
IRT 0.00
LA 0.71
0.37
PE
0.67
RE
0.23
CL
DE
0.19
IRT 0.01
LA 0.22
0.10
PE
0.22
RE
0.22
CL
DE
0.21
IRT 0.01
LA 0.22
0.22
PE
0.21
RE

R.
0.30
1.00
0.00
1.00
1.00
1.00
0.63
0.73
0.30
0.64
0.62
0.64
0.82
0.82
0.16
0.84
0.84
0.82

F1 %F
0.03
0.09
0.91
0.07
0.13
0.07
0.06
0.08
0.93
0.06
0.12
0.06
0.19
0.20
0.82
0.19
0.19
0.19

0.35
0.72
0.00
0.83
0.54
0.81
0.33
0.30
0.01
0.33
0.18
0.33
0.34
0.33
0.02
0.35
0.34
0.34

C

Comp.

CoNLL

Flights

Forex

AP P@10% R@10%
M.
0.20
0.83
0.68
BC
0.21
0.87
CU
0.70
0.19
0.82
DM 0.66
0.12
0.50
DU
0.43
0.18
0.75
KNN 0.61
0.09
0.38
LE
0.41
0.14
0.59
MARYLAND 0.54
0.15
0.64
MP 0.54
WD 0.45
0.11
0.48
0.24
0.13
0.14
BC
0.34
0.18
CU
0.17
0.23
0.12
DM 0.14
0.15
0.08
DU
0.07
0.24
0.13
KNN 0.12
0.32
0.17
LE
0.19
0.09
0.05
MARYLAND 0.06
0.14
0.07
MP 0.06
0.32
0.17
WD 0.16
0.63
0.24
0.49
BC
CU
0.76
0.29
0.68
0.66
0.25
DM 0.35
0.27
0.10
DU
0.18
0.20
0.07
KNN 0.07
0.27
0.10
LE
0.10
0.29
0.11
MARYLAND 0.07
0.32
0.12
MP 0.12
0.32
0.12
WD 0.11
0.49
0.49
0.54
BC
0.71
0.71
CU
0.70
0.65
0.66
DM 0.61
0.32
0.32
DU
0.32
0.14
0.14
KNN 0.16
0.09
0.09
LE
0.11
0.20
0.20
MARYLAND 0.14
0.30
0.30
MP 0.25
0.20
0.20
WD 0.14

(un) Text classification

(b) Span labeling

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

C

GUM

Plank

P.
M.
0.73
CL
DE
0.59
IRT 0.00
LA 0.51
0.41
PE
RE
0.53
VN 0.47
0.47
CL
DE
0.45
IRT 0.07
LA 0.46
0.48
PE
RE
0.47
VN 0.55

R.
0.90
1.00
0.00
1.00
1.00
1.00
0.66
0.31
0.51
0.47
0.53
0.53
0.52
0.21

F1 %F
0.06
0.08
0.91
0.10
0.12
0.09
0.07
0.08
0.13
0.84
0.14
0.13
0.13
0.04

0.80
0.74
0.00
0.68
0.58
0.69
0.55
0.37
0.48
0.12
0.49
0.50
0.49
0.30

C

GUM

Plank

M.
AP P@10% R@10%
0.95
0.47
BC
0.92
1.00
0.50
0.98
CU
0.98
0.49
DM 0.95
0.10
0.05
DU
0.05
0.38
0.19
KNN 0.21
0.69
0.34
LE
0.60
MARYLAND 0.12
0.29
0.14
0.11
0.05
MP 0.05
0.79
0.39
WD 0.53
0.36
0.42
0.38
BC
0.43
0.51
CU
0.42
0.31
0.37
DM 0.27
0.24
0.28
DU
0.24
0.33
0.39
KNN 0.31
0.21
0.24
LE
0.22
0.16
0.19
MARYLAND 0.16
0.24
0.29
MP 0.23
0.37
0.43
WD 0.39

(c) Token labeling

Figure C.2
AED results achieved with using the respective best models across all flaggers
for text classification, span, and token labeling.

and scorers

192

Klie, Webber, and Gurevych

Annotation Error Detection

Remerciements
We thank Andreas Rücklé, Edwin Simpson,
Falko Helm, Ivan Habernal, Ji-Ung Lee,
Michael Bugert, Nafise Sadat Moosavi, Nils
Dycke, Richard Eckart de Castilho, Tobias
Mayer, and Yevgeniy Puzikov, and our
anonymous reviewers for the fruitful
discussions and helpful feedback that
improved this article. We are especially
grateful for Michael Bugert and his
implementation of the span matching via
linear sum assignment and Amir Zeldes,
Andreas Grivas, Beatrice Alex, Hadi Amiri,
Jeremy Barnes, as well as Stefan Larson for
answering our questions regarding their
publications and for making datasets and
code available. This research work has been
funded by the “Data Analytics for the
Humanities” grant by the Hessian Ministry
of Higher Education, Research, Science and
les arts.

Les références
Ahrens, Sönke. 2017. How to Take Smart Notes:

One Simple Technique to Boost Writing,
Learning and Thinking: For Students,
Academics and Nonfiction Book Writers.
CreateSpace, North Charleston, SC.
Akbik, Alan, Tanja Bergmann, Duncan

Blythe, Kashif Rasul, Stefan Schweter, et
Roland Vollgraf. 2019. FLAIR: Un
easy-to-use framework for state-of-the-art
NLP. In Proceedings of the 2019 Conference of
the North American Chapter of the Association
for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and
Short Papers), pages 54–59.

Alt, Christoph, Aleksandra Gabryszak, et

Leonhard Hennig. 2020. TACRED
revisited: A thorough evaluation of the
TACRED relation extraction task. Dans
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics,
pages 1558–1569. https://est ce que je.org/10
.18653/v1/2020.acl-main.142

Ambati, Bharat Ram, Rahul Agarwal, Mridul
Gupta, Samar Husain, and Dipti Misra
Sharma. 2011. Error detection for treebank
validation. In Proceedings of the 9th
Workshop on Asian Language Resources,
pages 23–30.

Amiri, Hadi, Timothy Miller, and Guergana
Savova. 2018. Spotting spurious data with
neural networks. In Proceedings of the 2018
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers), pages 2006–2016.
https://doi.org/10.18653/v1/N18-1182

Angle, Sachi, Pruthwik Mishra, and Dipti
Mishra Sharma. 2018. Automated error
correction and validation for POS tagging
of Hindi. In Proceedings of the 32nd Pacific
Asia Conference on Language, Information
and Computation, pages 11–18.

Aroyo, Lora and Chris Welty. 2015. Truth is a
lie: Crowd truth and the seven myths of
human annotation. AI Magazine,
36(1):15–24. https://doi.org/10.1609
/aimag.v36i1.2564

Barnes, Jeremy, Lilja Øvrelid, and Erik

Velldal. 2019. Sentiment analysis is not
solved! Assessing and probing sentiment
classification. In Proceedings of the 2019 ACL
Workshop BlackboxNLP: Analyzing and
Interpreting Neural Networks for NLP,
pages 12–23. https://est ce que je.org/10
.18653/v1/W19-4802

Basile, Valerio, Michael Fell, Tommaso
Fornaciari, Dirk Hovy, Silviu Paun,
Barbara Plank, Massimo Poesio, et
Alexandra Uma. 2021. We need to consider
disagreement in evaluation. In Proceedings
of the 1st Workshop on Benchmarking: Passé,
Present and Future, pages 15–21. https://
doi.org/10.18653/v1/2021.bppf-1.3

Behrens, Heike, editor. 2008. Corpora in
Language Acquisition Research: Histoire,
Methods, Perspectives, volume 6 of Trends in
Language Acquisition Research. John
Benjamins Publishing Company,
Amsterdam. https://doi.org/10.1075
/tilar.6.03beh

Boyd, Adriane, Markus Dickinson, et W.

Detmar Meurers. 2008. On detecting errors
in dependency treebanks. Research on
Language and Computation, 6(2):113–137.
https://doi.org/10.1007/s11168-008
-9051-9

Breunig, Markus M., Hans-Peter Kriegel,
Raymond T. Ng, and Jörg Sander. 2000.
LOF: Identifying density-based local
outliers. ACM SIGMOD Record,
29(2):93–104. https://doi.org/10.1145
/335191.335388

Burkard, Rainer, Mauro Dell’Amico, et
Silvano Martello. 2012. Assignment
Problems: Revised Reprint. Society for
Industrial and Applied Mathematics.
https://doi.org/10.1137/1
.9781611972238

Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan
Hua, Nicole Limtiaco, Rhomni St. John,
Noah Constant, Mario Guajardo-Cespedes,
Steve Yuan, Chris Tar, Brian Strope, et
Ray Kurzweil. 2018. Universal sentence
encoder for English. In Proceedings of the
2018 Conference on Empirical Methods in

193

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

Natural Language Processing: System
Demonstrations, pages 169–174.
https://doi.org/10.18653/v1/D18-2029

Chinchor, Nancy and Beth Sundheim. 1993.
MUC-5 evaluation metrics. In Proceedings
of the Fifth Message Understanding
Conference (MUC-5), pages 69–78.
https://doi.org/10.3115/1072017
.1072026

Cui Zhu, H. Kitagawa, and C. Faloutsos.
2005. Example-based robust outlier
detection in high dimensional datasets. Dans
Fifth IEEE International Conference on Data
Mining (ICDM’05), pages 829–832.

Davis, Jesse and Mark Goadrich. 2006. Le
relationship between precision-recall and
ROC curves. In Proceedings of the 23rd
International Conference on Machine
Apprentissage – ICML ’06, pages 233–240.

Dawid, UN. P.. et un. M.. Skene. 1979.

Maximum likelihood estimation of
observer error-rates using the EM
algorithme. Applied Statistics, 28(1):20–28.
https://doi.org/10.2307/2346806

Devlin, Jacob, Ming-Wei Chang, Kenton Lee,

and Kristina Toutanova. 2019. BERT:
Pre-training of deep bidirectional
transformers for language understanding.
In Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short
Papers), pages 4171–4186.

Dickinson, Markus. 2006. From detecting

errors to automatically correcting them. Dans
11th Conference of the European Chapter of the
Association for Computational Linguistics,
pages 265–272.

Dickinson, Markus. 2015. Detection of

annotation errors in Corpora. Language
and Linguistics Compass, 9(3):119–138.
https://doi.org/10.1111/lnc3.12129

Dickinson, Markus and Chong Min Lee.
2008. Detecting errors in semantic
annotation. In Proceedings of the Sixth
International Conference on Language
Resources and Evaluation (LREC’08),
pages 605–610.

Dickinson, Markus and W. Detmar Meurers.
2003un. Detecting errors in part-of-speech
annotation. In Proceedings of the Tenth
Conference on European Chapter of the
Association for Computational Linguistics
Volume 1, EACL ’03, pages 107–114.
https://doi.org/10.3115/1067807
.1067823

Workshop on Treebanks and Linguistic
Theories, pages 1–12.

Dickinson, Markus and W. Detmar Meurers.
2005. Detecting errors in discontinuous
structural annotation. In Proceedings of the
43rd Annual Meeting of the Association for
Computational LinguisticsACL ’05,
pages 322–329. https://est ce que je.org/10
.3115/1219840.1219880

Dligach, Dmitriy and Martha Palmer. 2011.
Reducing the need for double annotation.
In Proceedings of the 5th Linguistic
Annotation Workshop, pages 65–73.

Dwork, Cynthia, Ravi Kumar, Moni Naor,

and D. Sivakumar. 2001. Rank aggregation
methods for the Web. In Proceedings of the
Tenth International Conference on World Wide
WebWWW ’01, pages 613–622.

Fornaciari, Tommaso, Alexandra Uma, Silviu

Paun, Barbara Plank, Dirk Hovy, et
Massimo Poesio. 2021. Beyond black &
blanc: Leveraging annotator disagreement
via soft-label multi-task learning. Dans
Actes du 2021 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 2591–2597.
https://doi.org/10.18653/v1/2021
.naacl-main.204

Gal, Yarin and Zoubin Ghahramani. 2016.
Dropout as a Bayesian approximation:
Representing model uncertainty in deep
learning. In Proceedings of the 33rd
International Conference on Machine
Apprentissage, pages 1050–1059.

Gimpel, Kevin, Nathan Schneider, Brendan
O’Connor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani
Yogatama, Jeffrey Flanigan, and Noah A.
Forgeron. 2011. Part-of-speech tagging for
Twitter: Annotation, features, et
experiments. In Proceedings of the 49th
Annual Meeting of the Association for
Computational Linguistics: Human Language
Technologies, pages 42–47. https://est ce que je
.org/10.21236/ADA547371

Grivas, Andreas, Beatrice Alex, Claire
Grover, Richard Tobin, and William
Whiteley. 2020. Not a cute stroke: Analysis
of Rule- and Neural Network-based
Information Extraction Systems for Brain
Radiology Reports. In Proceedings of the
11th International Workshop on Health Text
Mining and Information Analysis,
pages 24–37. https://doi.org/10.18653
/v1/2020.louhi-1.4

Dickinson, Markus and W. Detmar Meurers.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian

2003b. Detecting inconsistencies in
treebanks. In Proceedings of the Second

Q. Weinberger. 2017. On calibration of
modern neural networks. In Proceedings of

194

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

the 34th International Conference on Machine
Apprentissage, pages 1321–1330.

Gururangan, Suchin, Ana Marasovi´c,

Swabha Swayamdipta, Kyle Lo, Iz Beltagy,
Doug Downey, and Noah A. Forgeron. 2020.
Don’t stop pretraining: Adapt language
models to domains and tasks. Dans
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics,
pages 8342–8360. https://est ce que je.org/10
.18653/v1/2020.acl-main.740

Haselbach, Boris, Kerstin Eckart, Wolfgang

Seeker, Kurt Eberle, and Ulrich Heid. 2012.
Approximating theoretical linguistics
classification in real data: The case of
German “nach” particle verbs. Dans
Proceedings of COLING 2012,
pages 1113–1128.

Hedderich, Michael A., Dawei Zhu, et

Dietrich Klakow. 2021. Analysing the noise
model error for realistic noisy label data.
Proceedings of the AAAI Conference on
Artificial Intelligence, 35(9):7675–7684.
https://doi.org/10.1609/aaai
.v35i9.16938

Heinzerling, Benjamin and Michael Strube.

2018. BPEmb: Tokenization-free
pre-trained subword embeddings in 275
languages. In Proceedings of the Eleventh
International Conference on Language
Resources and Evaluation (LREC 2018),
pages 2989–2993.

Hemphill, Charles T., John J. Godfrey, et
George R. Doddington. 1990. The ATIS
spoken language systems pilot corpus. Dans
Proceedings of the Workshop on Speech and
Natural Language, pages 96–101. https://
doi.org/10.3115/116580.116613

Hendrycks, Dan and Kevin Gimpel. 2017. UN
baseline for detecting misclassified and
out-of-distribution examples in neural
réseaux. In Proceedings of International
Conference on Learning Representations,
pages 1–12.

Hollenstein, Nora, Nathan Schneider, et
Bonnie Webber. 2016. Inconsistency
detection in semantic annotation. Dans
Proceedings of the Tenth International
Conference on Language Resources and
Evaluation (LREC’16), pages 3986–3990.
Hovy, Dirk, Taylor Berg-Kirkpatrick, Ashish

Vaswani, and Eduard Hovy. 2013.
Learning whom to trust with MACE. Dans
Actes du 2013 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 1120–1130.

Jamison, Emily and Iryna Gurevych. 2015.

Noise or additional information?

Leveraging crowdsource annotation item
agreement for natural language tasks. Dans
Actes du 2015 Conference on
Empirical Methods in Natural Language
Processing, pages 291–297. https://est ce que je
.org/10.18653/v1/D15-1035

Joulin, Armand, Edouard Grave, Piotr

Bojanowski, and Tomas Mikolov. 2017. Bag
of tricks for efficient text classification. Dans
Proceedings of the 15th Conference of the
European Chapter of the Association for
Computational Linguistics: Volume 2, Short
Papers, pages 427–431. https://doi.org
/10.18653/v1/E17-2068

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng
Wang, Wei Chen, Weidong Ma, Qiwei Ye,
and Tie-Yan Liu. 2017. LightGBM: UN
highly efficient gradient boosting decision
arbre. In Proceedings of the 31st International
Conference on Neural Information Processing
Systems, pages 1–9.

Kehler, UN., L. Kertz, H. Rohde, and J. L.

Elman. 2007. Coherence and coreference
revisited. Journal of Semantics, 25(1):1–44.
https://doi.org/10.1093/jos/ffm018,
PubMed: 22923856

Kendall, M.. G. 1938. A new measure of rank

correlation. Biometrika, 30(1–2):81–93.
https://doi.org/10.1093/biomet/30
.1-2.81

Khandelwal, Urvashi, Omer Levy, Dan
Jurafsky, Luke Zettlemoyer, and Mike
Lewis. 2020. Generalization through
memorization: Nearest neighbor language
models. In International Conference on
Learning Representations (ICLR),
pages 1–13.

Küppers, Fabian, Jan Kronenberger,
Amirhossein Shantia, and Anselm
Haselhoff. 2020. Multivariate confidence
calibration for object detection. In 2nd
Workshop on Safe Artificial Intelligence for
Automated Driving (SAIAD), pages 1–9.

Kv˘eto ˇn, Pavel and Karel Oliva. 2002.

(Semi-)automatic detection of errors in
PoS-tagged corpora. In COLING 2002:
The 19th International Conference on
Computational Linguistics, pages 1–7.
https://doi.org/10.3115/1072228
.1072249

Larson, Stefan, Adrian Cheung, Anish

Mahendran, Kevin Leach, and Jonathan K.
Kummerfeld. 2020. Inconsistencies in
crowdsourced slot-filling annotations: UN
typology and identification methods. Dans
Proceedings of the 28th International
Conference on Computational Linguistics,
pages 5035–5046. https://est ce que je.org/10
.18653/v1/2020.coling-main.442

195

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

Larson, Stefan, Anish Mahendran, Andrew

Lee, Jonathan K. Kummerfeld, Parker Hill,
Michael A. Laurenzano, Johann
Hauswald, Lingjia Tang, and Jason Mars.
2019. Outlier detection for improved data
quality and diversity in dialog systems. Dans
Actes du 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short
Papers), pages 517–527. https://doi.org
/10.18653/v1/N19-1051

Leitner, Sebastian. 1974. So Lernt Man Leben
[How to Learn to Live]. Droemer-Knaur,
Munich.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, et
Veselin Stoyanov. 2019. RoBERTa: UN
robustly optimized BERT pretraining
approche. arxiv preprints 11692.
Loftsson, Hrafn. 2009. Correcting a
POS-Tagged corpus using three
complementary methods. In Proceedings of
the 12th Conference of the European Chapter of
the ACL (EACL 2009), pages 523–531.
https://doi.org/10.3115/1609067
.1609125

Lord, F. M., M.. R.. Novick, and Allan

Birnbaum. 1968. Statistical Theories of
Mental Test Scores. Addison-Wesley,
Oxford, England.

Manning, Christopher D. 2011.

Part-of-speech tagging from 97% à 100%:
Is it time for some linguistics? Dans
Computational Linguistics and Intelligent Text
Processing, volume 6608, pages 171–189.
https://doi.org/10.1007/978-3-642
-19400-9_14

Manning, Christopher D., Prabhakar

Raghavan, and Hinrich Schütze. 2008.
Introduction to Information Retrieval.
la presse de l'Universite de Cambridge, New York.
Ménard, Pierre André and Antoine Mougeot.

2019. Turning silver into gold:
Error-focused corpus reannotation with
active learning. In ProceedingsNatural
Language Processing in a Deep Learning
Monde, pages 758–767. https://doi.org
/10.26615/978-954-452-056-4_088

Naeini, Mahdi Pakdaman, Gregory F.

Tonnelier, and Milos Hauskrecht. 2015.
Obtaining well calibrated probabilities
using Bayesian binning. In Proceedings of
the Twenty-Ninth AAAI Conference on
Artificial Intelligence, pages 2901–2907.

Schuster, Francis Tyers, and Daniel Zeman.
2020. Universal Dependencies v2: Un
evergrowing multilingual treebank
collection. In Proceedings of the 12th
Language Resources and Evaluation
Conference, pages 4034–4043.

Northcutt, Curtis, Lu Jiang, and Isaac
Chuang. 2021. Confident learning:
Estimating uncertainty in dataset labels.
Journal of Artificial Intelligence Research,
70:1373–1411. https://est ce que je.org/10
.1613/jair.1.12125

Northcutt, Curtis G., Anish Athalye, et

Jonas Mueller. 2021. Pervasive label errors
in test sets destabilize machine learning
benchmarks. In 35th Conference on Neural
Information Processing Systems Datasets and
Benchmarks Track, pages 1–13.
Paun, Silviu, Bob Carpenter, Jon

Chamberlain, Dirk Hovy, Udo Kruschwitz,
and Massimo Poesio. 2018. Comparing
Bayesian models of annotation.
Transactions of the Association for
Computational Linguistics, 6(0):571–585.
https://doi.org/10.1162/tacl_a_00040

Pavlick, Ellie and Tom Kwiatkowski. 2019.
Inherent disagreements in human textual
inferences. Transactions of the Association for
Computational Linguistics, 7:677–694.
Pennington, Jeffrey, Richard Socher, et
Christopher D. Manning. 2014. GloVe:
Global vectors for word representation. Dans
Actes du 2014 Conference on
Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543.
https://doi.org/10.3115/v1/D14-1162

Peters, Matthew E., Sebastian Ruder, et
Noah A. Forgeron. 2019. To tune or not to
tune? Adapting pretrained representations
to diverse tasks. In Proceedings of the 4th
Workshop on Representation Learning for
NLP (RepL4NLP-2019), pages 7–14.
https://doi.org/10.18653/v1/W19-4302

Plank, Barbara, Dirk Hovy, and Anders

Søgaard. 2014un. Learning part-of-speech
taggers with inter-annotator agreement
perte. In Proceedings of the 14th Conference of
the European Chapter of the Association for
Computational Linguistics, pages 742–751.

Plank, Barbara, Dirk Hovy, and Anders

Søgaard. 2014b. Linguistically debatable or
just plain wrong? In Proceedings of the 52nd
Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 507–511. https://est ce que je
.org/10.3115/v1/P14-2083

Nivre, Joakim, Marie-Catherine de Marneffe,

Platt, John C. 1999. Probabilistic outputs for

Filip Ginter, Jan Hajiˇc, Christopher D.
Manning, Sampo Pyysalo, Sebastian

support vector machines and comparisons
to regularized likelihood methods.

196

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Klie, Webber, and Gurevych

Annotation Error Detection

Advances in Large Margin Classifiers,
10(3):1–9.

Pustejovsky, J.. and Amber Stubbs. 2013.

Natural Language Annotation for Machine
Apprentissage. O’Reilly Media, Sebastopol, Californie.

Qian, Kun, Ahmad Beirami, Zhouhan Lin,
Ankita De, Alborz Geramifard, Zhou Yu,
and Chinnadhurai Sankar. 2021.
Annotation inconsistency and entity bias
in MultiWOZ. In Proceedings of the 22nd
Annual Meeting of the Special Interest Group
on Discourse and Dialogue, pages 326–337.
Rehbein, Ines. 2014. POS error detection in
automatically annotated corpora. Dans
Proceedings of LAW VIIIthe 8th Linguistic
Annotation Workshop, pages 20–28.
Reimers, Nils and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings
using Siamese BERT-networks. Dans
Actes du 2019 Conference on
Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 3980–3990.
https://doi.org/10.18653/v1/D19-1410

Reiss, Frederick, Hong Xu, Bryan Cutler,
Karthik Muthuraman, and Zachary
Eichenberger. 2020. Identifying incorrect
labels in the CoNLL-2003 corpus. Dans
Proceedings of the 24th Conference on
Computational Natural Language Learning,
pages 215–226. https://est ce que je.org/10
.18653/v1/2020.conll-1.16

Rodrigues, Filipe and Francisco Pereira. 2018.
Deep learning from crowds. In Proceedings
of the Thirty-Second AAAI Conference on
Artificial Intelligence, pages 1611–1618.
Rodriguez, Pedro, Joe Barrow, Alexander
Miserlis Hoyle, John P. Lalor, Robin Jia,
and Jordan Boyd-Graber. 2021. Evaluation
examples are not equally informative:
How should that change NLP
leaderboards? In Proceedings of the 59th
Annual Meeting of the Association for
Computational Linguistics and the 11th
International Joint Conference on Natural
Language Processing (Volume 1: Long Papers),
pages 4486–4503. https://est ce que je.org/10
.18653/v1/2021.acl-long.346

Saito, Takaya and Marc Rehmsmeier. 2015.

The precision-recall plot is more
informative than the ROC plot when
evaluating binary classifiers on
imbalanced datasets. PLOS ONE,
10(3):1–21. https://doi.org/10.1371
/journal.pone.0118432, PubMed:
25738806

Sanguinetti, Manuela and Cristina Bosco.
2015. PartTUT: The Turin University

Parallel Treebank. In Basili, Roberto,
Cristina Bosco, Rodolfo Delmonte,
Alessandro Moschitti, and Maria Simi,
editors, Harmonization and Development of
Resources and Tools for Italian Natural
Language Processing within the PARLI
Project, volume 589, pages 51–69. https://
doi.org/10.1007/978-3-319-14206-7_3

Sanh, Victor, Lysandre Debut, Julien

Chaumond, and Thomas Wolf. 2019.
DistilBERT, a distilled version of BERT:
Smaller, faster, cheaper and lighter. Dans
Proceedings of the 5th Workshop on Energy
Efficient Machine Learning and Cognitive
Computing, pages 1–5.

Schreibman, Susan, Ray Siemens, et Jean
Unsworth, editors. 2004. A Companion to
Digital Humanities. Blackwell Publishing
Ltd, Malden, MA, Etats-Unis.

Shelmanov, Artem, Evgenii Tsymbalov,

Dmitri Puzyrev, Kirill Fedyanin,
Alexander Panchenko, and Maxim Panov.
2021. How certain is your transformer? Dans
Proceedings of the 16th Conference of the
European Chapter of the Association for
Computational Linguistics: Main Volume,
pages 1833–1840. https://est ce que je.org/10
.18653/v1/2021.eacl-main.157

Socher, Richard, Alex Perelygin, Jean Wu,
Jason Chuang, Christopher D. Manning,
Andrew Ng, and Christopher Potts. 2013.
Recursive deep models for semantic
compositionality over a sentiment
treebank. In Proceedings of the 2013
Conference on Empirical Methods in Natural
Language Processing, pages 1631–1642.
Song, Hwanjun, Minseok Kim, Dongmin
Parc, Yooju Shin, and Jae-Gil Lee. 2020.
Learning from noisy labels with deep
neural networks: A survey. arxiv preprint,
2007.8199. https://doi.org/10.1109
/TNNLS.2022.3152527, PubMed: 35254993

Stoica, George, Emmanouil Antonios

Platanios, and Barnabas Poczos. 2021.
Re-TACRED: Addressing shortcomings of
the TACRED dataset. In Proceedings of the
35th AAAI Conference on Artificial
Intelligence 2021, pages 13843–13850.
https://doi.org/10.1609/aaai
.v35i15.17631

Sun, Chen, Abhinav Shrivastava, Saurabh

Singh, and Abhinav Gupta. 2017.
Revisiting unreasonable effectiveness of
data in deep learning era. In IEEE
International Conference on Computer Vision
(ICCV), pages 1–13.

Swayamdipta, Swabha, Roy Schwartz,
Nicholas Lourie, Yizhong Wang,
Hannaneh Hajishirzi, Noah A. Forgeron, et

197

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 49, Nombre 1

Yejin Choi. 2020. Dataset cartography:
Mapping and diagnosing datasets with
training dynamics. In Proceedings of the
2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP),
pages 9275–9293. https://est ce que je.org/10
.18653/v1/2020.emnlp-main.746
Szpiro, George. 2010. Numbers Rule: Le

Vexing Mathematics of Democracy, from Plato
to the Present. Princeton University Press.
https://doi.org/10.1515
/9781400834440

Tjong Kim Sang, Erik F. and Fien De
Meulder. 2003. Introduction to the
CoNLL-2003 shared task:
Language-independent named entity
reconnaissance. In Proceedings of the Seventh
Conference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147.
https://doi.org/10.3115/1119176
.1119195

Ustalov, Dmitry, Nikita Pavlichenko,

Vladimir Losev, Iulian Giliazev, et
Evgeny Tulin. 2021. A general-purpose
crowdsourcing computational quality
control toolkit for Python. In the Ninth
AAAI Conference on Human Computation
and Crowdsourcing: Works-in-Progress and
Demonstration Track, pages 1–4.

van Halteren, Hans. 2000. The detection of
inconsistency in manually tagged text. Dans
Proceedings of the COLING-2000 Workshop
on Linguistically Interpreted Corpora,
pages 48–55.

Vlachos, Andreas. 2006. Active annotation.
In Proceedings of the Workshop on Adaptive
Text Extraction and Mining (ATEM 2006),
pages 64–71.

Wang, Zihan, Jingbo Shang, Liyuan Liu,

Lihao Lu, Jiacheng Liu, and Jiawei Han.
2019. CrossWeigh: Training named entity
tagger from imperfect annotations. Dans
Actes du 2019 Conference on
Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 5153–5162.
https://doi.org/10.18653/v1/D19
-1519, PubMed: 31303768

Wilcoxon, Frank. 1945. Individual

comparisons by ranking methods.
Biometrics Bulletin, 1(6):80. https://
doi.org/10.2307/3001968

Wisniewski, Guillaume. 2018. Errator: A tool

to help detect annotation errors in the

Universal Dependencies project. Dans
Proceedings of the Eleventh International
Conference on Language Resources and
Evaluation (LREC 2018), pages 4489–4493.

Yaghoub-Zadeh-Fard, Mohammad Ali,

Boualem Benatallah, Moshe Chai Barukh,
and Shayan Zamanirad. 2019. A study of
incorrect paraphrases in crowdsourced
user utterances. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 295–306.
https://doi.org/10.18653/v1/N19-1026

Zadrozny, Bianca and Charles Elkan. 2001.

Obtaining calibrated probability estimates
from decision trees and naive Bayesian
classifiers. In Proceedings of the Eighteenth
International Conference on Machine
Apprentissage, ICML ’01, pages 609–616.

Zadrozny, Bianca and Charles Elkan. 2002.

Transforming classifier scores into accurate
multiclass probability estimates. Dans
Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge
Discovery and Data Mining, pages 694–700.

Zeldes, Amir. 2017. The GUM corpus:
Creating multilayer resources in the
classroom. Language Resources and
Evaluation, 51(3):581–612. https://est ce que je
.org/10.1007/s10579-016-9343-x

Zhang, Xiang, Junbo Zhao, and Yann LeCun.

2015. Character-level convolutional
networks for text classification. Dans
Proceedings of the 28th International
Conference on Neural Information Processing
Systems – Volume 1, pages 649–657.

Zhang, Xin, Guangwei Xu, Yueheng Sun,
Meishan Zhang, and Pengjun Xie. 2021.
Crowdsourcing learning as domain
adaptation: A case study on named entity
reconnaissance. In Proceedings of the 59th
Annual Meeting of the Association for
Computational Linguistics and the 11th
International Joint Conference on Natural
Language Processing (Volume 1: Long Papers),
pages 5558–5570. https://est ce que je.org/10
.18653/v1/2021.acl-long.432

Zheng, Guoqing, Ahmed Hassan Awadallah,

and Susan Dumais. 2021. Meta label
correction for noisy label learning. Dans
Proceedings of the Thirty-fifth AAAI
Conference on Artificial Intelligence 2021,
pages 11053–11061. https://est ce que je.org/10
.1609/aaai.v35i12.17319

198

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi
/
c
o

je
je
/

je

un
r
t
je
c
e

p
d

F
/

/

/

/

4
9
1
1
5
7
2
0
6
8
9
8
0
/
c
o

je
je

_
un
_
0
0
4
6
4
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF