Context-aware Adversarial Training for Name Regularity Bias in

Context-aware Adversarial Training for Name Regularity Bias in
Named Entity Recognition

Abbas Ghaddar, Philippe Langlais†, Ahmad Rashid, and Mehdi Rezagholizadeh

Huawei Noah’s Ark Lab, Montreal Research Center, Canada
†RALI/DIRO, Universit´e de Montr´eal, Canada
abbas.ghaddar@huawei.com, felipe@iro.umontreal.ca
ahmad.rashid@huawei.com, mehdi.rezagholizadeh@huawei.com

Abstrait

In this work, we examine the ability of NER
models to use contextual information when
predicting the type of an ambiguous entity.
We introduce NRB, a new testbed carefully
designed to diagnose Name Regularity Bias of
NER models. Our results indicate that all state-
of-the-art models we tested show such a bias;
BERT fine-tuned models significantly outper-
forming feature-based (LSTM-CRF) ones on
NRB, despite having comparable (sometimes
lower) performance on standard benchmarks.

To mitigate this bias, we propose a novel
model-agnostic training method that adds learn-
able adversarial noise to some entity mentions,
thus enforcing models to focus more strongly
on the contextual signal, leading to significant
gains on NRB. Combining it with two other
training strategies, data augmentation and pa-
rameter freezing, leads to further gains.

1

Introduction

Recent advances in language model pre-training
(Peters et al., 2018; Devlin et al., 2019; Liu et al.,
2019) have greatly improved the performance of
many Natural Language Understanding (NLU)
tasks. Encore, several studies (McCoy et al., 2019;
Clark et al., 2019; Utama et al., 2020b) revealed
that state-of-the-art NLU models often make use
of surface patterns in the data that do not gen-
eralize well. Named-Entity Recognition (NER),
a downstream task that consists in identifying
textual mentions and classifying them into a
predefined set of types, is no exception.

The robustness of modern NER models has
received considerable attention recently (Mayhew
et coll., 2019; Mayhew et al., 2020; Agarwal et al.,
2020un; Zeng et al., 2020; Bernier-Colborne and
Langlais, 2020). Name Regularity Bias (Lin et al.,

586

2020; Agarwal et al., 2020b; Zeng et al., 2020)
in NER occurs when a model relies on a signal
coming from the entity name, and disregards
evidence within the local context. Chiffre 1 shows
examples where state-of-the-art models (Peters
et coll., 2018; Akbik et al., 2018; Devlin et al.,
2019) fail to exploit contextual information. Pour
instance, the entity Gonzales in the first sentence
of the figure is wrongly recognized as a per-
son, while the context clearly signals that it is a
location (city).

To better highlight this issue, we propose NRB,
a testbed designed to accurately diagnose name
regularity bias of NER models by harvesting
natural sentences from Wikipedia that contain
challenging entities, such as those in Figure 1.
This is different from previous work that evalu-
ated models on artificial data obtained by either
randomizing (Lin et al., 2020) or substituting
entities by ones from a pre-defined list (Agarwal
et coll., 2020un). NRB is compatible with any anno-
tation scheme, and is intended to be used as an
auxiliary validation set.

We conduct experiments with the feature-
based LSTM-CRF architecture (Peters et al.,
2018; Akbik et al., 2018) and the BERT (Devlin
et coll., 2019) fine-tuning approach trained on stan-
dard benchmarks. The best LSTM-based model
we tested is able to correctly predict 38% of the
entities in NRB. BERT-based models are perform-
ing much better (+37%), even if they (slightly)
underperform on in-domain development and test
sets. This mismatch in performance between NRB
and standard benchmarks indicates that context
awareness of models is not rewarded by existing
benchmarks, thus justifying NRB as an additional
validation set.

We further propose a novel architecture-
agnostic adversarial training procedure (Miyato

Transactions of the Association for Computational Linguistics, vol. 9, pp. 586–604, 2021. https://doi.org/10.1162/tacl a 00386
Action Editor: Miguel Ballesteros. Submission batch: 10/2020; Revision batch: 1/2021; Published 7/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

NLU is the tendency of models to quickly lever-
age surface form features and annotation artifacts
(Gururangan et al., 2018), which is often referred
to as dataset biases (Dasgupta et al., 2018; Shah
et coll., 2020). We discuss related works along two
axes: diagnosis and mitigation.

2.1 Diagnosing Biais

A growing number of studies (Zellers et al.,
2018; Poliak et al., 2018; Geva et al., 2019;
Utama et al., 2020b; Sanh et al., 2020) are show-
ing that NLU models rely heavily on spurious
correlations between output labels and surface
features (par exemple., keywords, lexical overlap), impact-
ing their generalization performance. Donc,
considerable attention has been paid to design
diagnostic benchmarks where models relying
on bias would perform poorly. Par exemple,
HANS (McCoy et al., 2019), FEVER Symmetric
(Schuster et al., 2019), and PAWS (Zhang et al.,
2019) are benchmarks that contain counterexam-
ples to well-known biases in the training data of
textual entailment (Williams et al., 2017), fact
verification (Thorne et al., 2018), and paraphrase
identification (Wang et al., 2018), respectivement.

Naturellement, many entity names have a strong
correlation with a single type (par exemple., ou ). Recent works have
noted that over-relying on entity name information
negatively impacts NLU tasks. Balasubramanian
et autres. (2020) found that substituting named-entities
in standard test sets of natural language inference,
coreference resolution, and grammar error cor-
rection has a negative impact on those tasks.
In political claims detection (Pad´o et al., 2019),
Dayanik and Pad´o (2020) show that claims made
by frequently occurring politicians in the training
data are better recognized than those made by
less frequent ones.

Recently, Zeng et al. (2020) and Agarwal et al.
(2020b) conducted two separate analyses on the
decision making mechanism of NER models. Both
works found that context tokens do contribute
to system performance, mais
that entity names
play a major role in driving high performances.
Agarwal et al. (2020un) reported a performance
drop in NER models when entities in standard
test sets are substituted with other ones pulled
from pre-defined lists. Concurrently, Lin et al.
(2020) conducted an empirical analysis on the
robustness of NER models in the open domain

Chiffre 1: Examples extracted from Wikipedia (title
in bold) that illustrate name regularity bias in NER.
Entities of interest are underlined, gold types are in blue
superscript, model predictions are in red subscript, et
context information is highlighted in purple. Models
used in this study disregard contextual information
and rely instead on some signal from the named-entity
lui-même.

et coll., 2016) in which learnable noise vectors are
added to named-entity words, weakening their
signal, thus encouraging the model to pay more
attention to contextual information. Applying it
to both feature-based LSTM-CRF and fine-tuned
BERT models leads to consistent gains on NRB
(+13 points) while maintaining the same level of
performance on standard benchmarks.

The remainder of the paper is organized as
follows. We discuss related works in Section 2.
We describe how we built NRB in Section 3,
and its use in diagnosing named-entity bias of
state-of-the-art models in Section 4. In Section 5,
we present a novel adversarial training method
that we compare and combine with two simpler
ones. We further analyze these training methods
in Section 6, and conclude in Section 7.

2 Related Work

Robustness and out-of-distribution generalization
has always been a persistent concern in deep
learning applications such as computer vision
(Szegedy et al., 2013; Recht et al., 2019), speech
traitement (Seltzer et al., 2013; Borgholt et al.,
2020), and NLU (Søgaard, 2013; Hendrycks
and Gimpel, 2017; Ghaddar and Langlais, 2017;
Yaghoobzadeh et al., 2019; Hendrycks et al.,
2020). One key challenge behind this issue in

587

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

scénario. They show that models are biased by
strong entity name regularity, and train\test over-
lap in standard benchmarks. They observe a drop
in performance of 34% when entity mentions are
randomly replaced by other mentions.

The aforementioned studies certainly demon-
strate name regularity bias. Toujours, in many cases
the entity mention is the only key to infer its type,
as in ‘‘James won the league’’. Ainsi, randomly
swapping entity names, as proposed by Lin et al.
(2020), typically introduces false positive exam-
ples, which obscures observations. En outre,
creating artificial word sequences introduces
a mismatch between the pre-training and the
fine-tuning phases of large-scale language models.
NER is also challenging because of compound-
ing factors such as entity boundary detection
(Zheng et al., 2019), rare words and emerging
entities (Strauss et al., 2016), document-level
contexte (Durrett and Klein, 2014), capitaliza-
tion mismatch (Mayhew et al., 2019), unbalance
datasets (Nguyen et al., 2020), and domain shift
(Alvarado et al., 2015; Augenstein et al., 2017).
It is unclear to us how randomizing mentions in
a corpus, as proposed by Lin et al. (2020), est
interfering with these factors.

NRB gathers genuine entities that appear in nat-
ural sentences extracted from Wikipedia. Exam-
ples are selected so that entity boundaries are easy
to identify, and their types can be inferred from the
local context, thus avoiding compounding many
factors responsible for lack of robustness.

2.2 Mitigating Bias

The prevailing approach to address dataset bias
consists in adjusting the training loss for biased
examples. A number of recent studies (Clark
et coll., 2019; Belinkov et al., 2019; He et al.,
2019; Mahabadi et al., 2020; Utama et al., 2020un)
proposed to train a shallow model that exploits
manually designed biased features. A main model
is then trained in an ensemble with this pre-trained
model, in order to discourage the main model from
adopting the naive strategy of the shallow one.

Adversarial training (Miyato et al., 2016) est
a regularization method that has been shown to
improve not only robustness (Ebrahimi et al.,
2018; Bekoulis et al., 2018), but also general-
ization (Cheng et al., 2019; Zhu et al., 2019) dans
NLU. It builds on the idea of adding adversarial
examples (Goodfellow et al., 2014; Fawzi et al.,

2016) to the training set, c'est, small perturba-
tions of the data that can change the prediction of
a classifier. These perturbations for NLP tasks are
done at the token embedding level and are norm
bounded. Typiquement, adversarial
training algo-
rithms can be defined as a minmax optimization
problem wherein the adversarial examples are
generated to maximize the loss, while the model
is trained to minimize it.

Belinkov et al. (2019) used adversarial training
to mitigate the hypothesis-only bias in textual
entailment models. Clark et al. (2020) adversar-
ially trained a low and a high capacity model
in an ensemble in order to ensure that the latter
model is focusing on patterns that should gen-
eralize better. Dayanik and Pad´o (2020) used an
extra adversarial loss in order to encourage a
political claims detection model to learn more
from samples with infrequent politician names.
Le Bras et al. (2020) proposed an adversarial tech-
nique to filter-out biased examples from training
matériel. Models trained on the filtered datasets
show improved out-of-distribution performances
on various computer vision and NLU tasks.

Data augmentation is another strategy for
enhancing robustness. It was successfully used in
Min et al. (2020) and Moosavi et al. (2020) à
improve textual entailment performances on the
HANS benchmark. The former approach proposes
to append original training sentences with their
corresponding predicate-arguments triplets gen-
erated by a semantic role labelling tagger; alors que
the latter generates new examples by applying
syntactic transformations to the original training
instances.

Zeng et al. (2020) created new examples by
randomly replacing an entity by another one of
the same type that occurs in the training data.
New examples are considered valid if the type
of the replaced entity is correctly predicted by
a NER model trained on the original dataset.
De la même manière, Dai and Adel (2020) explored different
entity substitution techniques for data augmen-
tation tailored to NER. Both studies conclude
that data augmentation techniques based on entity
substitution improves the overall performances
on low resource biomedical NER.

Studies discussed above have the potential to
mitigate name regularity bias of NER models.
Toujours, we are not aware of any dedicated work that
shows it is so. In this work, we propose ways of

588

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

mitigating name regularity bias for NER, inclure-
ing an elaborate adversarial method that forces the
model to capture more signal from the context. Notre
methods do not require an extra training stage, ou
to manually characterize biased features. They are
therefore conceptually simpler, and can potentially
be combined to any of the discussed techniques.
En outre, our proposed methods are effective
under both low and high resource settings.

3 The NRB Benchmark

NRB is a diagnosing testbed exclusively dedicated
to name regularity bias in NER. To this end, it
gathers named-entities that satisfy 4 criteria:

1. Must be real-world entities within natural
sentences → We select sentences from
Wikipedia articles.

2. Must be compatible with any annotation
scheme → We restrict our focus on the 3 most
common types found in NER benchmarks:
person, location, and organization.

3. Boundary detection (segmentation) should
not be a bottleneck → We only select single
word entities that start with a capital letter.

4. Supporting evidences of the type must be
restricted to local context only (a window of
2 à 4 tokens) → We developed a primitive
context-only tagger to filter-out entities with
no close-context signal.

The strategy used to gather examples in NRB is
illustrated in Figure 2. We first select Wikipedia
articles that are listed in a disambiguation page.
Disambiguation pages group different topics that
could be referred to by the same query term.1
The query term Bromwich in Figure 2 has its own
disambiguation page that contains a link to the
city of West Bromwich, West Bromwich Albion
Football Club, and Kenny Bromwich the rugby
league player.

We associate each article in a disambiguation
page to the entity type found in its corresponding
Freebase page (Bollacker et al., 2008), consid-
ering only articles whose Freebase type can be
mapped to a person, a location, or an organization.
We assume that occurrences of the query term

1 https://en.wikipedia.org/wiki/Wikipedia

:Manual of Style/Disambiguation pages.

589

Chiffre 2: Selection of a sentence in NRB.

within the article are of this type. This assump-
tion was found accurate in previous work on
Wikipedia distant supervision for NER (Ghaddar
and Langlais, 2016, 2018). The sentence in our
example is extracted from the Kenny Bromwich
article, whose Freebase type can be mapped to a
person. Donc, we assume Bromwich in this
sentence to be a person.

To decide whether a sentence containing a
query term is worth being included in NRB, nous
rely on two NER taggers. One is a popular NER
system that provides a confidence score to each
prediction, and that acts as a weak superviser, le
other is a context-only tagger we designed specif-
ically (see Section 3.1) to detect entities with a
strong signal from their local context. A sentence
is selected if the query term is incorrectly labeled
with high confidence (score > 0.85) by the former
tagger, while the latter one labels it correctly with
high confidence (a gap of at least 0.25 in probabil-
ity between the first and second predicted types).
This is the case of the sentence in Figure 2, où
Bromwich is incorrectly labeled as an organization
by the weak supervision tagger, however correctly
labeled as a person by the context-only tagger.

3.1 Implementation

We used the Stanford CoreNLP (Manning et al.,
2014) tagger as our weak supervision tagger and
developed a simple yet efficient method to build a
context-only tagger. For this, we first applied the
Stanford tagger to the entire Wikipedia dump and
replaced all entity mentions identified by their tag.
Alors, we train a 5-gram language model on the
resulting corpus using kenLM (Heafield, 2011).

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 3: Illustration of a language model used as a
context-only tagger.

Chiffre 3 illustrates how this model is deployed
as an entity tagger: The mention is replaced by
an empty slot and the language model is queried
for each type. We rank the tags using the per-
plexity score given by the model to the resulting
phrases, then we normalize those scores to get
a probability distribution over types.

We downloaded the Wikipedia dump of June
2020, which contains 30k disambiguation pages.
These pages contain links to 263k articles, où
only 107k (40%) of them have a type in Freebase
that can be mapped to the 3 types of interest.
The Stanford tagger identified 440k entities that
match the query term of the disambiguation pages.
The thresholds discussed previously were chosen
to select around 5000 of the most challenging
examples in terms of name regularity bias. Ce
figure aligns with the number of entities present in
the test set of the well-studied CONLL benchmark
(Tjong Kim Sang and De Meulder, 2003).

We assessed the annotation quality by asking
a human to filter out noisy examples. A sentence
was removed if it contains an annotation error, ou
if the type of the query term cannot be inferred
from the local context. Only 1.3% of the examples
where removed, which confirms the accuracy of
our automatic procedure. NRB is composed of
5275 examples, and each sentence contains a
single annotation (voir la figure 1 for examples).

3.2 Control Set (WTS)

In addition to NRB, we collected a set of domain
control sentences—called WTS for WITNESS—that
contain the very same query terms selected in
NRB, but that were correctly labeled by both
the Stanford (score > 0.85) and the context-only
taggers. We selected examples with a small gap
(< 0.1) between the first and second ranked type assigned to the query term by the latter tagger. Thus, examples in WTS should be easy to tag. For example, because Obama the Japanese city (see Figure 3) is selected among the query terms in NRB, we added an instance of Obama the president. Performing poorly on such examples2 indicates a domain shift between NRB (Wikipedia) and whatever dataset a model is trained on (we call it the in-domain corpus). WTS is composed of 5192 sentences that have also been manually checked. 4 Diagnosing Bias 4.1 Data To be comparable with state-of-the-art models, we consider two standard benchmarks for NER: CONLL-2003 (Tjong Kim Sang and De Meulder, 2003) and ONTONOTES 5.0 (Pradhan et al., 2012), which include 4 and 18 types of named-entities, respectively. ONTONOTES is 4 times larger than CONLL, and both benchmarks mainly cover the news domain. We run experiments on the official train/dev/test splits, and report mention-level F1 scores, following previous work. Since in NRB, there is only one entity per sentence to annotate, a system is evaluated on its ability to correctly identify the boundaries of this entity and its type. When we train on ONTONOTES (18 types) and eval- uate on NRB (3 types), we perform type mapping using the scheme of Augenstein et al. (2017). 4.2 Systems Following (Devlin et al., 2019), we term all approaches that learn the encoder from scratch as feature-based, as opposed to the ones that fine- tune a pre-trained model for the downstream task. We conduct experiments using 3 feature-based and 2 fine-tuning approaches for NER: • Flair-LSTM An LSTM-CRF model that uses FLAIR (Akbik et al., 2018) contextualized embeddings as main features. • ELMo-LSTM The LSTM-CRF tagging model of Peters et al. (2018) that uses ELMo contextualized embeddings at the input layer. • BERT-LSTM Similar to the previous model, but replacing ELMo by a representation gathered from the last four layers of BERT. • BERT-base The approach proposed by Devlin et al. (2019) using the BERT-base model. fine-tuning 2That is, a system that fail to tag Obama the president as a person. 590 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model CONLL Dev Test Flair-LSTM ELMo-LSTM 96.69 BERT-LSTM 95.94 - 93.03 92.47 91.94 BERT-base BERT-large 96.18 96.90 92.19 92.86 NRB WTS Feature-based 99.58 27.56 98.24 31.65 98.08 38.34 Fine-tuning 98.67 98.51 75.54 75.55 ONTONOTES Test NRB WTS Dev - 88.31 86.12 89.06 89.38 87.28 33.67 34.34 43.07 93.98 94.90 92.04 87.23 89.26 88.19 89.93 75.34 75.41 94.22 95.06 Table 1: Mention level F1 scores of models on CONLL and ONTONOTES, as well as on NRB and WTS. • BERT-large The fine-tuning approach using the BERT-large model. We used Flair-LSTM off-the-shelf,3 and re- implemented other approaches using the default settings proposed in the respective papers. For our reimplementations, we used early stopping based on performance on the development set, and report average performance over 5 runs. For BERT- based solutions, we adopt spanBERT (Joshi et al., 2020) as a backbone model because it was found by Li et al. (2020) to perform better on NER. 4.3 Results Table 1 shows the mention level F1 score of the systems considered. FLAIR-LSTM and BERT- large are the best performing models on in-domain test sets, the maximum gap with other models being 1.1 and 2.7 on CONLL and ONTONOTES respectively. These figures are in line with pre- vious work. What is more interesting is the performance on NRB. Feature-based models do poorly, Flair-LSTM underperforms compared to other models (F1 score of 27.6 and 33.7 when trained on CONLL and ONTONOTES respectively). Fine-tuned BERT models clearly perform better (around 75), but far from in-domain results (92.9 and 89.9 on CONLL and ONTONOTES, respec- tively). Domain shift is not a reason for those results, since the performances on WTS are rather high (92 or higher). Furthermore, we found that the boundary detection (segmentation) perfor- mance on NRB is above 99.2% across all settings. Because errors made on NRB are neither due to segmentation nor to domain shift, they must be imputed to name regularity bias of models. It is worth noting that BERT-LSTM outper- forms ELMo-LSTM on NRB, despite underper- forming on in-domain test sets. This may be because BERT was pre-trained on Wikipedia (same domain of NRB), while ELMo embeddings were trained on the One Billion Word corpus (Chelba et al., 2014). Also, we observe that switching from BERT-base to BERT-large, or training on 4 times more data (CONLL versus ONTONOTES) does not help on NRB. This suggests that name regularity bias is neither a data nor a model capacity issue. 4.4 Feature-based vs. Fine-tuning In this section, we analyze reasons for the drastic superiority of fined-tuned models on NRB. First, the large gap between BERT-LSTM and BERT- base on NRB suggests that this is not related to the representations being used at the input layer. Second, we tested several configurations of ELMo-LSTM where we scale up the number of LSTM layers and hidden units. We observed a degradation of performance on dev, test, and NRB sets, mostly due to over-parameterized models. We also trained 9-, 6-, and 4-layer BERT-base models,4 and still noticed a large advantage of BERT models on NRB.5 This suggests that the higher capacity of BERT alone cannot explain all the gains. Third, since by design, evidence on the entity type in NRB resides within the local context, it is unlikely that gains on this set come from the ability of Transformers (Vaswani et al., 2017) to better handle long dependencies than LSTM (Hochreiter and Schmidhuber, 1997). To further validate this statement, we fine-tuned BERT models with ran- domly initialized weights, except the embedding layer. We noticed that this time, the performances 4We used early exit (Xin et al., 2020) at the kth layer. 5The 4-layer model has 53M parameters and performs 3https://github.com/flairNLP/flair. 52% on NRB. 591 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 on NRB fall into the same range of those of feature- based models, and a drastic decrease (12%–15%) on standard benchmarks. These observations are in keeping with results from Hendrycks et al. (2020) on the out-of-distribution robustness of fine- tuning pre-trained transformers, and also confirms observations made by Agarwal et al. (2020b). From these analyses, we conclude that the (MLM) objective Masked Language Model (Devlin et al., 2019) that the BERT models were pre-trained with is a key factor driving superior performance of the fine-tuned models on NRB. In most cases, the target word is masked or randomly selected, therefore the model must rely on the con- text to predict the correct target, which is what a model should do to correctly predict the type of entities in NRB. We think that in fine-tuning, training for a few epochs with a small learning rate helps the model to preserve the contextual behavior induced by the MLM objective. Nevertheless, fine-tuned models recording at best an F1 score of 75.6 on NRB do show some name regularity bias, and fail to capture useful local contextual information. 5 Mitigating Bias In this section, we investigate training procedures that are designed to enhance the contextual aware- ness of a model, leading to better performance on NRB without impacting in-domain performance. These training procedures are not supposed to use any external data. In fact, NRB is only used as a diagnosing corpus, once the model is trained. We propose 3 training procedures that can be combined, two of them are architecture-agnostic, and one is specific to fine-tuning BERT. 5.1 Entity Masking Inspired by the masking strategy applied during the pre-training phase of BERT, we propose a data augmentation approach that introduces a special [MASK] token in some of the training examples. Specifically, we search for entities in the training material that are preceded or followed by 3 non- entity words. This criterion applies to 35% and 39% of entities in the training data of CONLL and ONTONOTES, respectively. For each such entity, we create a new training example (new sentence) by replacing the entity by [MASK], thus forcing the model to infer the type of masked tokens from the context. We call this procedure mask. 5.2 Parameter Freezing Another simple strategy, specific to fine-tuning BERT, consists of freezing part of the net- work. More precisely, we freeze the bottom half of BERT, including the embedding layer. The intuition is to preserve part of the predicting- by-context mechanism that BERT has acquired during the pre-training phase. This training proce- dure is expected to enforce the contextual ability of the model, thus adding to our analysis on the critical role of the MLM objective in pre-training BERT. We name this method freeze. 5.3 Adversarial Noise We propose an adversarial learning algorithm that makes entity type patterns in the input represen- tation less reliable for the model, thus enforcing it to rely more aggressively on the context. To do so, we add a learnable adversarial noise vector (only) to the input representation of entities. We refer to this method as adv. Let T = {t1, t2, . . . , tK} be a predefined set of types such as PER, LOC, and ORG in our case. Let x = x1, x2, . . . , xn be the input sequence of length n, y = y1, y2, . . . , yn be the gold label sequence following the IOB6 tagging scheme, and y(cid:4) = y(cid:4) n be a sequence obtained by adding noise to y at the mention-level, that is, by randomly replacing the type of mentions in y with some noisy type sampled from T . 2, . . . , y(cid:4) 1, y(cid:4) Let Yij(t) = yi, . . . , yj be a mention of type t ∈ T , spanning the sequence of indices i to j in y. We derive a noisy mention Y (cid:4) ij in y(cid:4) from Yij(t) as follows: ⎧ ⎪⎪⎨ Y (cid:4) ij = ⎪⎪⎩ Yij(t(cid:4)) p ∼ U (0, 1) ≤ λ t(cid:4) ∼ Cat γ∈T \{t} (γ|ξ = 1 K−1 ) Yij(t) otherwise where λ is a threshold parameter, U (0, 1) refers to the uniform distribution in the range [0,1], Cat(γ|ξ = 1 K−1 ) is the categorical distribution whose outcomes are equally likely with the probability of ξ, and the set T \ {t} = {t(cid:4) : t(cid:4) ∈ T ∧ t(cid:4) (cid:9)= t} stands for the set T excluding type t. 6Naturally applies to other schemes, such as BILOU that Ratinov and Roth (2009) found more informative. 592 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 4: Illustration of our adversarial method applied on the entity New York. First, we generate a noisy type (PER), and then add a learnable noise embedding (LOC→PER) to the input representation of that entity. This will make entity patterns (hashed rectangles) unreliable for the model, hence forcing it to collect evidences (dotted arrow) from the context. The noise embedding matrix and the noise label projection layer weights (dotted rectangle) are trained independently from the model parameters. The above procedure only applies to the entities that are preceded or followed by 3 context words. For instance, in Figure 4, we produce a noisy type for New York (PER), but not for John (p > λ).
Aussi, note that we generate a different sequence
oui(cid:4) from y at each training epoch.

Suivant, we define a learnable noisy embedding
matrix E(cid:4) ∈ Rm×d where m = |T | × (|T | − 1) est
the number of valid type switching possibilities,
and d is the dimension of the input representations
of x. For each token with a noisy label, we add
the corresponding noisy embedding to its input
representation. For other tokens, we simply add
a zero vector of size d. As depicted in Figure 4,
the noisy type of the entity New York is PER,
therefore we add the noise embedding at index
LOC → P ER to its input representation.

Alors, the input representation of the sequence is
fed to an encoder followed by an output layer, tel
as LSTM-CRF in Peters et al. (2018), or BERT-
Softmax in Devlin et al. (2019). D'abord, we extend
the aforementioned models by generating an extra
logit f (cid:4) using a projection layer parametrized by
W (cid:4) and followed by a softmax function. Comme indiqué
in Figure 4, for each token the model produces
two logits relative to the true and noisy tags. Alors,
we train the entire model to minimize two losses:
Ltrue(je) and Lnoisy(je(cid:4)), where θ is the original set
of parameters and θ(cid:4) = {E(cid:4), W (cid:4)} is the extra set
we added (dotted boxes in Figure 4). Ltrue(je) est

the regular loss on the true tags, while Lnoisy(je(cid:4))
is the loss on the noisy tags defined as follows:

Lnoisy(je(cid:4)) =

n(cid:6)

je = 1

11 (oui(cid:4)
je

(cid:9)= yi) CE (F (cid:4)

je, oui(cid:4)
je)

where CE is the cross-entropy loss function.
Both losses are minimized using gradient descent.
is worth mentioning that λ is the only
Il
hyper-parameter of our adv method. It controls
how often noisy embeddings are added during
entraînement. Higher values of λ increase the amount
of uncertainty around salient patterns in the
input representation of entities, hence preventing
the model from overfitting those patterns, et
therefore pushing it
to rely more on context
information. We tried values of λ between 0.3 et
0.9, and found λ = 0.8 to be the best one based
on CONLL and ONTONOTES development sets.

5.4 Results

We trained models on CONLL and ONTONOTES,
and evaluated them on their respective TEST set.7
Recall that NRB and WTS are only used as auxil-
iary diagnosing sets. Tableau 2 shows the impact of
our training methods when fine-tuning the BERT-
large model (the one that performs best on NRB).
D'abord, we observe that each training method
significantly improves the performance on NRB.

7Performances on DEV show very similar trends.

593

Method

ONTONOTES
CONLL
Test NRB WTS
Test NRB WTS
92.8 75.6 98.6 89.9 75.4 95.1
BERT-lrg
92.9 82.9 98.4 89.8 77.3 96.5
+mask
92.7 83.1 98.4 89.9 79.8 96.0
+freeze
92.7 86.1 98.3 90.1 85.8 95.2
+adv
92.8 85.5 97.8 89.9 80.6 95.9
+F&m
92.8 87.7 98.1 89.7 87.6 95.9
+un&m
+un&F
92.7 88.4 98.2 90.0 88.1 95.7
+un&m&F 92.8 89.7 97.9 89.9 88.8 95.6

Impact of

Tableau 2:
training methods on
BERT-large models fine-tuned on CONLL or
ONTONOTES.

CONLL

ONTONOTES

Method

Test NRB WTS Test NRB WTS
E-LSTM 92.5 31.7 98.2 89.4 34.3 94.9
92.4 40.8 97.5 89.3 38.8 95.3
+mask
+adv
92.4 42.4 97.8 89.4 40.7 95.0
+un&m 92.4 45.7 96.8 89.3 46.6 93.7

Tableau 3: Impact of training methods on the
ELMo-LSTM trained on CONLL or ONTONOTES.

Adding adversarial noise is notably the best
performing method on NRB, with an additional
gain of 10.5 et 10.4 F1 points over the respective
baselines. On the other hand, we observe minor
variations on in-domain test sets, as well as on
WTS. The paired sample t-test (Cohen, 1996)
confirms that these variations are not statistically
significant (p > 0.05). After all, the number of
decisions that differ between the baseline and the
best model on a given in-domain set is less than 20.
Deuxième, we observe that combining methods
always leads to improvements on NRB;
le
best configuration being when we combine all 3
méthodes. It is interesting to note that combining
training methods leads to a performance on NRB
which does not depend much on the training set
used: CONLL (89.7) and ONTONOTES (88.8). Ce
suggests that name regularity bias is a modeling
issue, and not the effect of factors such as training
data size, domain, or type granularity.

In order to validate that our training methods
are not specific to the fine-tuning approach, nous
replicated the same experiments with the ELMo-
LSTM. Tableau 3 shows the performance of the
mask and adv procedures (the freeze method
does not apply here). The results are in line
with those observed with BERT-large: significant

594

gains on NRB of 14 et 12 points for CONLL
and ONTONOTES models, respectivement, and no sta-
tistically significant changes on in-domain test
sets. Encore, combining training methods leads to
systematic gains on NRB (13 points on average).
Differently from fine-tuning BERT, we observe a
slight drop in performance of 1.2% on WTS when
both methods are used.

The performance of ELMo-LSTM on NRB
does not rival the one obtained by fine-tuning the
BERT-large model, which confirms that BERT
is a key factor to enhance robustness, even if in-
domain performance is not necessarily rewarded
(McCoy et al., 2019; Hendrycks et al., 2020).

6 Analysis

So far, we have shown that state-of-the-art mod-
els do suffer from name regularity bias, and we
proposed model-agnostic training methods that
are able to mitigate this bias to some extent.
In Section 6.1, we provide further evidence
that our training methods force the BERT-large
model to better concentrate on contextual cues. Dans
Section 6.2, we replicate the evaluation protocol of
Lin et al. (2020) in order to clear out the possibility
that our training methods are only valid on NRB.
Last, we perform extensive experiments on name
regularity bias under low resource (Section 6.3)
and multilingual (Section 6.4) settings.

6.1 Attention Heads

We leverage the attention map of BERT to better
understand how our method enhances context
encoding. To this end, we calculate the average
number of attention heads that point to the entity
mentions being predicted at each layer. We con-
duct this experiment on NRB with the BERT-large
model (24 layers with 16 attention heads at each
layer) fine-tuned on CONLL.

At each layer, we average the number of
heads which have their highest attention weight
(argmax) pointing to the entity name.8 Figure 5
shows the average number of attention heads
that point to an entity mention in the BERT-
large model fine-tuned without our methods, avec
the adversarial noise method (adv), and with all
three methods.

8We used the weights of the first sub-token since NRB

only contains single word entities.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 5: Average number of attention heads (y-axis)
pointing to NRB entity mentions at each layer (x-axis)
of the BERT-large model fine-tuned on CONLL.

Chiffre 6: Performance on NRB of BERT-large models
as a function of the number of sentences used to
fine-tune them.

Method
BERT-large

+adv
+adv&mask
+adv&mask&freeze

π(dev) π(test)
23.45
31.98
35.02
40.39

25.46
31.99
34.09
38.62

Tableau 4: F1 scores of BERT-large models
fine-tuned on CONLL and evaluated on
randomly permuted versions of the dev and
test sets: π(dev) and π(test).

We observe an increasing number of heads
pointing to entity names when we get closer to
the output layer: at the bottom layers (left part of
the figure) only a few heads are pointing to entity
names, in contrast to the last 2 layers (right part)
where almost all heads do so. This observation
is in line with Jawahar et al. (2019), who show
that bottom and intermediate BERT layers mainly
encode lexical and syntactic information, alors que
top layers represent task-related information. Notre
training methods lead to fewer heads at top layers
pointing to entity mentions, suggesting the model
is focusing more on contextual information.

6.2 Random Permutations

Following the protocol described in Lin et al.
(2020), we modified dev and test sets of standard
benchmarks by randomly permuting dataset-wise
mentions of entities, keeping the types untouched.
Par exemple, the span of a specific mention of a
person can be replaced by a span of a location,
whenever it appears in the dataset. These ran-
domized tests are highly challenging, as discussed
in Section 2, since here the context is the only
available clue to solve the task, and many false
positive examples are introduced that way.

Tableau 4 shows the results of the BERT-large
model fine-tuned on CONLL and evaluated on
the permuted in-domain dev and test sets. F1

scores are much lower here, confirming this is
a hard testbed, but they do provide evidence of
the named-regularity bias of BERT. Our training
methods improve the model F1 score by 17% et
13% on permuted dev and test sets, respectivement,
an increase much in line with what we observed
on NRB.

6.3 Low Resource Setting

Similarly to Zhou et al. (2019) and Ding et al.
(2020), we simulate a low resource setting by ran-
domly sampling tiny subsets of the training data.
Since our focus is to measure the contextual learn-
ing ability of models, we first selected sentences
of CONLL training data that contain at least one
entity followed or preceded by 3 non-entity words.
Alors, we randomly sampled k ∈ {100, 500,
1000, 2000} sentences9 with which we fine-tuned
BERT-large. Chiffre 6 shows the performance of
the resulting models on NRB. Expectedly, F1
scores of models fine-tuned with few examples
are rather low on NRB as well as on the in-domain
test set. Not shown in Figure 6, fine-tuning on 100
et 2000 sentences leads to performance of 14%
et 45%, respectivement, on the CONLL test set.
Nevertheless, we observe that our training meth-
ods, and adv in particular, improve performances
on NRB even under extremely low resource set-
tings. On CONLL test and WTS sets, scores vary
in a range of ±0.5 and ±0.7, respectivement, quand
our methods are added to BERT-large.

6.4 Multilingual Setting

6.4.1 Experimental Protocol
For in-domain data, we use the German, Spanish,
and Dutch CONLL-2002 (Tjong Kim Sang, 2002)
NER datasets. Those benchmarks—also from the
news domain—come with a train/dev/test split,
and the training material is comparable in size

9{0.7, 3.5, 7.1, 14.3}% of the training sentences.

595

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

de
es
nl

NRB WTS
37% 44% fi
20% 22% da
20% 24% hr
af

NRB WTS
53% 62%
19% 24%
39% 48%
26% 32%

5: Percentage

Tableau
translated
sentences from NRB and WTS discarded
for each language.

de

to the English CONLL dataset. En outre, nous
experiment with four non CONLL benchmarks:
Finnish (Luoma et al., 2020), Danish (Hvingelby
et coll., 2020), Croatian (Ljubeˇsi´c et al., 2018), et
Afrikaans (Eiselen, 2016) data. These corpora
have more diversified text genres, yet mainly
follow the CONLL annotation scheme.10 Finnish
and Afrikaans datasets have comparable size to
English CONLL, Danish is 60% smaller, alors que
the Croatian is twice larger. We use the provided
train/dev/test splits for Danish and Finnish, et
we randomly split (80/10/10) the Croatian and
Afrikaans datasets.

Because NRB and WTS are in English, nous
designed a simple yet generic method for pro-
jecting them to another language. D'abord, both test
sets are translated to the target language using
an online translation service. In order to ensure a
high quality corpus, we eliminate a sentence if the
BLEU score (Papineni et al., 2002) between the
original (English) sentence and the back translated
one is below 0.65.

Tableau 5 reports the percentage of discarded sen-
tences for each language. While for the Finnish
(fi), Croatian (hr), and German (de) languages
we remove a large proportion of sentences, nous
found our translation approach simpler and more
systematic than generating an NRB corpus from
scratch for each language. The latter approach
depends on the robustness of the weak tagger, le
number of Wikipedia articles and disambiguation
pages per language, as well as the existence of
type information. This is left as future work.

For experiments with fine-tuning, we use
language-specific BERT models11 for German
(Chan et coll., 2020), Spanish (Canete et al., 2020),

10The Finnish data is tagged with EVENT, PRODUCT,

and DATE in addition to the CONLL 4 classes.

11Language-specific models have been reported more
accurate than multilingual ones in a monolingual setting
(Martin et al., 2019; Le et al., 2020; Delobelle et al., 2020;
Virtanen et al., 2019).

596

Dutch (de Vries et al., 2019), Finnish (Virtanen
et coll., 2019), Danish,12 Croatain (Ulˇcar and
Robnik- ˇSikonja, 2020), while we use mBERT
(Devlin et al., 2019) for Afrikaans.

For feature-based approaches, we use the same
architecture for ELMo-LSTM (Peters et al., 2018)
except that we replace English word embeddings
by language-specific ones: FastText (Bojanowski
et coll., 2017) for static representations, et le
aforementioned BERT-base models for contextu-
alized ones.

6.4.2 Results

Tableau 6 reports the performances on test, NRB,
and WTS sets for both feature-based and fine-
tuning approaches with and without our training
méthodes. We used the hyper-parameters of the
English CONLL experiments with no further
tuning. We selected the best performing mod-
els based on development sets score, and report
average results on 5 runs.

Mainly due to implementation details and hy-
perparameter settings, our fine-tuned BERT-base
models perform better on the CONLL test sets
for German (83.8 vs. 80.4) and Dutch (91.8 vs.
90.0) and slightly worse on Spanish (88.0 vs.
88.4) compared to the results reported in their
respective BERT papers.

Consistent with the results obtained on English
feature-based (Tableau 1) and fine-tuned
pour
(Tableau 3) models, the latter approach performs
better on NRB, although by a smaller margin
compared to English (+37%). More precisely, nous
observe a gain of +28% et +26% on German
and Croatian respectively, and a gain ranging
entre 11% et 15% for other languages.

Nevertheless, our training methods lead to
systematic and often drastic improvements on
NRB coupled with a statistically nonsignificant
overall decrease on in-domain test sets. They do,
cependant, incur a slight but significant drop of
autour 2 F1 score points on WTS for feature-
based models. Similar to what was previously
observed, the best scores on NRB are obtained
by BERT models when the training methods are
combined. For the Dutch language, we observe
that once trained with our methods, the type of
models used (feature-based vs. BERT fine-tuned)
leads to much less difference on NRB.

12https://github.com/botxo/nordic bert.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

German

Finnish
TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS

Afrikaans

Croatian

Spanish

Danish

Dutch

Feature-based
BERT-LSTM 78.9 36.4 84.2 85.6 59.9 90.8 84.9 45.4 85.7 76.0 38.9 84.5 76.4 42.6 78.1 78.0 28.4 79.3 76.2 39.7 65.8
+adv
78.2 44.1 82.8 85.0 65.8 90.2 84.3 57.8 83.5 75.1 52.9 81.0 75.4 47.2 76.9 77.5 35.2 75.5 75.7 42.3 63.3
+adv&mask 78.1 47.6 82.9 84.9 72.2 88.7 84.0 62.8 83.5 74.6 54.3 81.8 75.1 48.4 76.6 76.9 36.8 76.7 75.1 52.8 63.1
Fine-tuning

BERT-base

+adv
+un&m&F

83.8 64.0 93.3 88.0 72.3 93.9 91.8 56.1 92.0 91.3 64.6 91.9 83.6 56.6 86.2 89.7 54.7 95.6 80.4 54.3 91.6
83.7 68.9 93.6 87.9 75.9 93.9 91.9 58.3 91.8 90.2 66.4 92.5 82.7 58.4 86.5 89.5 57.9 95.5 79.7 60.2 92.1
83.2 73.3 94.0 87.4 81.6 93.7 91.2 63.6 91.0 89.8 67.4 92.7 82.3 63.1 85.4 88.8 59.6 94.9 79.4 64.2 91.6

Tableau 6: Mention level F1 scores of 7 multilingual models trained on their respective training data,
and tested on their respective in-domain test, NRB, and WTS sets.

Altogether, these results demonstrate that name
regularity bias is not specific to a particular lan-
guage, even if its degree of severity varies from
one language to another, and that the training
methods proposed notably mitigate this bias.

7 Conclusion

In this work, we focused on the name regularity
bias of NER models, a problem first discussed in
Lin et al. (2020). We propose NRB, a benchmark
we specifically designed to diagnose such a bias.
As opposed to existing strategies devised to mea-
sure it, NRB is composed of real sentences with
easy to identify mentions.

We show that current state-of-the-art models,
perform from poorly (feature-based) to decently
(fined-tuned BERT) on NRB. In order to mit-
igate this bias, we propose a novel adversarial
training method based on adding some learnable
noise vectors to entity words. These learnable
vectors encourage the model to better incorporate
contextual information. We demonstrate that this
approach greatly improves the contextual ability
of existing models, and that it can be combined
with other training methods we proposed. Signif-
icant gains are observed in both low-resource and
multilingual settings. To foster research on NER
robustness, we encourage others to report results
on NRB and WTS.13

This study opens up new avenues of inves-
tigation. Conducting a large-scaled multilingual
experiment, characterizing the name regularity
bias of more diversified morphological language
families is one of them, possibly leveraging mas-
sively multilingual resources such as WikiAnn
(Pan et al., 2017), Polyglot-NER (Al-Rfou et al.,

13English and multilingual NRB and WTS are available at
h t t p : / / r a li.iro.umontreal.ca/rali/?q=en
/wikipedia-nrb-ner.

2015), or Universal Dependencies (Nivre et al.,
2016). We can also develop a more challenging
NRB by selecting sentences with multi-word
entities.

Aussi, non-sequential labeling approaches for
NER like the ones of Li et al. (2020) and Yu et al.
(2020) have reported impressive results on both
flat and nested NER. We plan to measure their
bias on NRB and study the benefits of applying
our training methods to those approaches. Enfin,
we want to investigate whether our adversarial
training method can be successfully applied to
other NLP tasks.

Remerciements

We are grateful to the reviewers of this work
pour
their constructive comments that greatly
contributed to improving this paper.

Les références

Oshin Agarwal, Yinfei Yang, Byron C. Wallace,
and Ani Nenkova. 2020un. Entity-switched
datasets: an approach to auditing the in-domain
robustness of named entity recognition models.
arXiv preprint arXiv:2004.04123.

Oshin Agarwal, Yinfei Yang, Byron C. Wallace,
and Ani Nenkova. 2020b. Interpretability anal-
ysis for named entity recognition to understand
system predictions and how they can im-
prove. arXiv preprint arXiv:2004.04564. EST CE QUE JE:
https://doi.org/10.1162/coli a
00397

Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual
pour
sequence labeling. In Proceedings of the 27th
International Conference on Computational
Linguistics, pages 1638–1649.

string embeddings

597

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi,
and Steven Skiena. 2015. Polyglot-ner: Massive
multilingual named entity recognition. En Pro-
ceedings of the 2015 SIAM International Con-
ference on Data Mining, pages 586–594. SIAM.
EST CE QUE JE: https://doi.org/10.1137/1
.9781611974010.66

Julio Cesar Salinas Alvarado, Karin Verspoor, et
Timothy Baldwin. 2015. Domain adaption of
named entity recognition to support credit risk
assessment. In Proceedings of the Australasian
Language Technology Association Workshop
2015, pages 84–90.

Isabelle Augenstein, Leon Derczynski, and Kalina
Bontcheva. 2017. Generalisation in named en-
tity recognition: A quantitative analysis. Com-
puter Speech & Language, 44:61–83. EST CE QUE JE:
https://doi.org/10.1016/j.csl.2017
.01.012

Sriram Balasubramanian, Naman Jain, Gaurav
Jindal, Abhijeet Awasthi, and Sunita Sarawagi.
2020. What’s in a name? Are BERT named
entity representations just as good for any other
nom? arXiv preprint arXiv:2007.06897. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.repl4nlp-1.24

Giannis Bekoulis,

Johannes Deleu, Thomas
Demeester, and Chris Develder. 2018. Adver-
sarial training for multi-context joint entity and
relation extraction. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 2830–2836. EST CE QUE JE:
https://doi.org/10.18653/v1/D18=
-1307

Yonatan Belinkov, Adam Poliak, Stuart M.
Shieber, Benjamin Van Durme, and Alexander
M.. Rush. 2019. On adversarial removal of
hypothesis-only bias
langue
inference. In Proceedings of the Eighth Joint
Conference on Lexical and Computational
Semantics (SEM 2019), pages 256–262. EST CE QUE JE:
https://doi.org/10.18653/v1/S19
-1028

in natural

Gabriel Bernier-Colborne and Phillippe Langlais.
2020. HardEval: Focusing on challenging to-
kens to assess robustness of NER. En Pro-
ceedings of The 12th Language Resources
and Evaluation Conference, pages 1697–1704,

Marseille, France. European Language Re-
sources Association.

Piotr Bojanowski, Edouard Grave, Armand
Joulin, and Tomas Mikolov. 2017. Enriching
word vectors with subword information. Trans-
actions of the Association for Computational
Linguistics, 5:135–146. EST CE QUE JE: https://
doi.org/10.1162/tacl a 00051

pour

et

Jamie Taylor.

Kurt Bollacker, Colin Evans, Praveen Paritosh,
Tim Sturge,
2008.
Freebase: A collaboratively created graph
structuring
database
knowl-
human
le 2008 ACM
In Proceedings of
bord.
SIGMOD international conference on Man-
agement of data, pages 1247–1250. EST CE QUE JE:
https://doi.org/10.1145/1376616
.1376746

Lasse Borgholt,

Jakob D. Havtorn, Anders
Søgaard Zeljko Agic, Lars Maaløe, et
Christian Igel. 2020. Do end-to-end speech
recognition models care about context? Dans
Proceedings of Interspeech.

Jos´e Canete, Gabriel Chaperon, Rodrigo Fuentes,
and Jorge P´erez. 2020. Spanish pre-trained
bert model and evaluation data. PML4DC
at ICLR, 2020. EST CE QUE JE: https://doi.org
/10.21437/Interspeech.2020-1750

2020. German’s

Branden Chan, Stefan Schweter, and Timo
Möller.
langue
model. arXiv preprint arXiv:2010.10906. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.coling-main.598

next

Ciprian Chelba, Tomas Mikolov, Mike Schuster,
Qi Ge, Thorsten Brants, Phillipp Koehn, et
Tony Robinson. 2014. One billion word bench-
mark for measuring progress in statistical
language modeling. In Fifteenth Annual Con-
ference of the International Speech Communi-
cation Association.

Yong Cheng, Lu Jiang, and Wolfgang Macherey.
2019. Robust neural machine translation with
doubly adversarial inputs. In Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, pages 4324–4333.
EST CE QUE JE: https://doi.org/10.18653/v1
/P19-1425

598

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Christopher Clark, Mark Yatskar, and Luke
Zettlemoyer. 2019. Dont take the easy way
dehors: Ensemble based methods for avoiding
known dataset biases. In Proceedings of the
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
Language
(EMNLP-IJCNLP),
pages 4060–4073.

Processing

Christopher Clark, Mark Yatskar, and Luke
Zettlemoyer. 2020. Learning to model and
ignore dataset bias with mixed capacity
ensembles. arXiv preprint arXiv:2011.03856.

Paul R. Cohen. 1996. Empirical methods for
artificial intelligence. IEEE Intelligent Systems.

Xiang Dai and Heike Adel. 2020. An analysis
of simple data augmentation for named entity
reconnaissance. In Proceedings of the 28th Inter-
national Conference on Computational Lin-
guistics, pages 3861–3867.

Ishita Dasgupta, Demi Guo, Andreas Stuhlm¨uller,
Samuel J. Gershman, and Noah D. Homme bon.
2018. Evaluating compositionality in sentence
embeddings. arXiv preprint arXiv:1802.04302.

Erenay Dayanik and Sebastian Pad´o. 2020. Mask-
ing actor information leads to fairer political
claims detection. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
tational Linguistics, pages 4385–4391. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.acl-main.404

Wietse de Vries, Andreas van Cranenburgh,
Arianna Bisazza, Tommaso Caselli, Gertjan
van Noord,
and Malvina Nissim. 2019.
Bertje: A Dutch BERT model. arXiv preprint
arXiv:1912.09582.

Pieter Delobelle, Thomas Winters, and Bettina
Berendt. 2020. Robbert: A Dutch roberta-based
language model. arXiv preprint arXiv:2001
.06286. EST CE QUE JE: https://doi.org/10.18653
/v1/2020.findings-emnlp.292

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter of

the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186.

Bosheng Ding, Linlin Liu, Lidong Bing,
Canasai Kruengkrai, Thien Hai Nguyen,
Shafiq Joty, Luo Si, and Chunyan Miao.
2020. Daga: Données
augmentation with a
generation approach for low-resource tagging
tasks. arXiv preprint arXiv:2011.01549. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.emnlp-main.488

Greg Durrett and Dan Klein. 2014. A joint model
for entity analysis: Coreference, typing, et
linking. Transactions of the Association for
Computational Linguistics, 2:477–490. EST CE QUE JE:
https://doi.org/10.1162/tacl a
00197

Javid Ebrahimi, Anyi Rao, Daniel Lowd,
and Dejing Dou. 2018. Hotflip: White-box
adversarial examples for text classification.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 2: Short Papers), pages 31–36.
EST CE QUE JE: https://doi.org/10.18653/v1
/P18-2006

Roald Eiselen. 2016. Government domain named
entity recognition for south african languages.
In Proceedings of the Tenth International Con-
ference on Language Resources and Evaluation
(LREC’16), pages 3344–3348.

Alhussein Fawzi, Seyed-Mohsen Moosavi-
Dezfooli, and Pascal Frossard. 2016. Robust-
ness of classifiers: from adversarial to random
bruit. In Proceedings of the 30th International
Conference on Neural Information Processing
Systems, pages 1632–1640.

Mor Geva, Yoav Goldberg, and Jonathan Berant.
le
2019. Are we modeling the task or
annotator? An investigation of annotator bias
language understanding datasets.
in natural
In Proceedings of
le 2019 Conference on
in Natural Language
Empirical Methods
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1161–1166. EST CE QUE JE:
https://doi.org/10.18653/v1/D19
-1107

599

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Abbas Ghaddar and Phillippe Langlais. 2016.
Coreference in Wikipedia: Main concept
In Proceedings of The 20th
resolution.
SIGNLL Conference on Computational Natu-
ral Language Learning, pages 229–238. EST CE QUE JE:
https://doi.org/10.18653/v1/K16
-1023

Abbas Ghaddar and Phillippe Langlais. 2017.
Winer: A Wikipedia annotated corpus for
named entity recognition. In Proceedings of
the Eighth International Joint Conference on
Natural Language Processing (Volume 1: Long
Papers), pages 413–422.

Abbas Ghaddar and Philippe Langlais. 2018.
Transforming Wikipedia into a large-scale fine-
grained entity type corpus. In Proceedings
of the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018).

Ian J. Goodfellow,

Jonathon Shlens,

et
Christian Szegedy. 2014. Explaining and
harnessing adversarial examples. arXiv preprint
arXiv:1412.6572.

Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, et
Noah A. Forgeron. 2018. Annotation artifacts in
natural language inference data. In Proceedings
of the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 107–112. EST CE QUE JE:
https://doi.org/10.18653/v1/N18
-2017

He He, Sheng Zha, and Haohan Wang. 2019.
Unlearn dataset bias in natural language infer-
ence by fitting the residual. EMNLP-IJCNLP
2019, page 132. EST CE QUE JE: https://doi.org
/10.18653/v1/D19-6115

Kenneth Heafield. 2011. KenLM: Faster and
smaller language model queries. In Proceedings
of the EMNLP 2011 Sixth Workshop on Sta-
tistical Machine Translation, pages 187–197.
Édimbourg, Écosse, United Kingdom.

Dan Hendrycks and Kevin Gimpel. 2017. UN
baseline for detecting misclassified and out-
of-distribution examples in neural networks.
Proceedings of International Conference on
Learning Representations.

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace,
Adam Dziedzic, Rishabh Krishnan, and Dawn
Song. 2020. Pretrained transformers improve
out-of-distribution robustness. arXiv preprint
arXiv:2004.06100. EST CE QUE JE: https://doi.org
/10.18653/v1/2020.acl-main.244

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural computation,
9(8):1735–1780. EST CE QUE JE: https://doi.org
/10.1162/neco.1997.9.8.1735, PMID:
9377276

Rasmus Hvingelby, Amalie Brogaard Pauli,
Maria Barrett, Christina Rosted, Lasse Malm
Lidegaard, and Anders Søgaard. 2020. Dane:
Dans
A named entity resource for Danish.
Proceedings of the 12th Language Resources
and Evaluation Conference, pages 4597–4604.

Ganesh Jawahar, Benoˆıt Sagot, and Djam´e
Seddah. 2019. What Does BERT Learn about
the Structure of Language? In Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3651–3657.
EST CE QUE JE: https://doi.org/10.18653/v1
/P19-1356

Mandar

Joshi, Danqi Chen, Yinhan Liu,
Daniel S. Weld, Luke Zettlemoyer, and Omer
Levy. 2020. Spanbert: Improving pre-training
by representing and predicting spans. Trans-
actions of the Association for Computational
Linguistics, 8:64–77. EST CE QUE JE: https://doi.org
/10.1162/tacl a 00300

Hang Le, Lo¨ıc Vial, Jibril Frej, Vincent Segonne,
Maximin Coavoux, Benjamin Lecouteux,
Alexandre Allauzen, Benoit Crabb´e, Laurent
Besacier, and Didier Schwab. 2020. Flaubert:
Unsupervised language model pre-training for
In Proceedings of The 12th Lan-
French.
guage Resources and Evaluation Conference,
pages 2479–2490.

Ronan

Swabha

Le Bras,

Swayamdipta,
Chandra Bhagavatula, Rowan Zellers, Matthew
Peters, Ashish Sabharwal, and Yejin Choi.
2020. Adversarial filters of dataset biases.
In International Conference on Machine
Apprentissage, pages 1078–1088. PMLR.

Xiaoya Li,

Jingrong Feng, Yuxian Meng,
Qinghong Han, Fei Wu, and Jiwei Li. 2020.

600

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

A unified MRC framework for named entity
reconnaissance. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, pages 5849–5859.

Hongyu Lin, Yaojie Lu, Jialong Tang, Xianpei
Han, Le Sun, Zhicheng Wei, and Nicholas Jing
Yuan. 2020. A rigorous study on named entity
reconnaissance: Can fine-tuning pretrained model
lead to the promised land? In Proceedings of
le 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 7291–7300.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv preprint
arXiv:1907.11692.

Nikola Ljubeˇsi´c, ˇZeljko Agi´c, Filip Klubiˇcka, Vuk
Batanovi´c, and Tomaˇz Erjavec. 2018. Training
corpus hr500k 1.0. Slovenian language resource
repository CLARIN.SI.

Jouni Luoma, Miika Oinonen, Maria Pyyk¨onen,
Veronika Laippala, and Sampo Pyysalo. 2020.
A broad-coverage corpus for finnish named
In Proceedings of The
entity recognition.
12th Language Resources and Evaluation
Conference, pages 4615–4624.

Rabeeh Karimi Mahabadi, Yonatan Belinkov,
and James Henderson. 2020. End-to-end bias
mitigation by modelling biases in corpora.
the 58th Annual Meet-
In Proceedings of
the Association for Computational
ing of
Linguistics,
8706–8716. Associa-
tion for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.acl-main.769

pages

Christopher D. Manning, Mihai Surdeanu, John
Bauer, Jenny Rose Finkel, Steven Bethard, et
David McClosky. 2014. The Stanford CoreNLP
Natural Language Processing Toolkit. In ACL
(System Demonstrations), pages 55–60. EST CE QUE JE:
https://doi.org/10.3115/v1/P14
-5010

Louis Martin,

Benjamin Muller,

Pedro
Javier Ortiz Su´arez, Yoann Dupont, Laurent
´Eric Villemonte de la Clergerie,
Romary,

Djam´e Seddah, and Benoˆıt Sagot. 2019.
langue
Camembert: A tasty
model. arXiv preprint arXiv:1911.03894. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.acl-main.645

French

Stephen Mayhew, Gupta Nitish, and Dan Roth.
2020. Robust named entity recognition with
truecasing pretraining. In Proceedings of the
AAAI Conference on Artificial Intelligence,
pages 8480–8487. EST CE QUE JE: https://doi.org
/10.1609/aaai.v34i05.6368

Stephen Mayhew, Tatiana Tsygankova, and Dan
Roth. 2019. ner and pos when nothing is capi-
talized. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 6257–6262. EST CE QUE JE:
https://doi.org/10.18653/v1/D19
-1650

Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019. Right for the wrong reasons: Diagnos-
ing syntactic heuristics in natural
langue
inference. In Proceedings of the 57th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 3428–3448. EST CE QUE JE:
https://doi.org/10.18653/v1/P19
-1334

Junghyun Min, R.. Thomas McCoy, Dipanjan
Le, Emily Pitler, and Tal Linzen. 2020.
Syntactic data augmentation increases robust-
ness to inference heuristics. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 2339–2352.

Takeru Miyato, Andrew M. Dai, and Ian
Goodfellow. 2016. Adversarial training meth-
ods for semi-supervised text classification.
arXiv preprint arXiv:1605.07725.

Nafise Sadat Moosavi, Marcel de Boer, Prasetya
Ajie Utama, and Iryna Gurevych. 2020. Im-
proving robustness by augmenting training
sentences with predicate-argument structures.
arXiv preprint arXiv:2010.12510.

Thong Nguyen, Duy Nguyen, and Pramod Rao.
2020. Adaptive Name Entity Recognition
under highly unbalanced data. arXiv preprint
arXiv:2003.10296.

601

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Joakim Nivre, Marie-Catherine De Marneffe,
Jan Hajic,
Filip Ginter, Yoav Goldberg,
Christopher D. Manning, Ryan McDonald,
Slav Petrov, Sampo Pyysalo, Natalia Silveira,
et d'autres. 2016. Universal dependencies v1:
treebank collection. En Pro-
A multilingual
ceedings of the Tenth International Conference
on Language Resources and Evaluation
(LREC’16), pages 1659–1666.

Sebastian Pad´o, Andr´e Blessing, Nico Blokker,
et
Erenay Dayanik, Sebastian Haunss,
Jonas Kuhn.
sides with
2019. Who
whom? Towards computational construction
of discourse networks for political debates. Dans
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2841–2847.

Xiaoman Pan, Boliang Zhang, Jonathan May,
Joel Nothman, Kevin Knight, and Heng Ji.
2017. Cross-lingual name tagging and linking
le
pour 282 languages.
55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 1946–1958.

In Proceedings of

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. Bleu: A method for
automatic evaluation of machine translation. Dans
Proceedings of
the 40th annual meeting of
the Association for Computational Linguis-
tics, pages 311–318. EST CE QUE JE: https://est ce que je
.org/10.3115/1073083.1073135

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep Context-
ualized Word Representations. In Proceed-
ings of
the North
the Association for
American Chapter of
Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers),
pages 2227–2237. EST CE QUE JE: https://doi.org
/10.18653/v1/N18-1202

le 2018 Conference of

Adam Poliak,

Jason Naradowsky, Aparajita
and Benjamin
Haldar, Rachel Rudinger,
Van Durme. 2018. Hypothesis only baselines in
natural language inference. In Proceedings of
the Seventh Joint Conference on Lexical and
Computational Semantics, pages 180–191.
EST CE QUE JE: https://doi.org/10.18653/v1
/S18-2023

Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Olga Uryupina, and Yuchen Zhang. 2012.
CoNLL-2012 shared task: Modeling multi-
lingual unrestricted coreference in OntoNotes.
In Joint Conference on EMNLP and CoNLL-
Shared Task, pages 1–40.

Lev Ratinov and Dan Roth. 2009. Design
challenges and misconceptions
in named
entity recognition. In Proceedings of the Thir-
teenth Conference on Computational Natural
Language Learning, pages 147–155. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.3115/1596374
.1596399

Benjamin Recht, Rebecca Roelofs, Louis
Schmidt, and Vaishaal Shankar. 2019. Do
imagenet classifiers generalize to imagenet?
In International Conference on Machine
Apprentissage, pages 5389–5400. PMLR.

Victor Sanh, Thomas Wolf, Yonatan Belinkov,
and Alexander M. Rush. 2020. Apprentissage
from others’ mistakes: Avoiding dataset
biases without modeling them. arXiv preprint
arXiv:2012.01300.

Tal Schuster, Darsh Shah, Yun Jie Serene Yeo,
Daniel Roberto Filizzola Ortiz, Enrico Santus,
and Regina Barzilay. 2019. Towards debiasing
fact verification models. In Proceedings of
le 2019 Conference on Empirical Methods
in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3410–3416. EST CE QUE JE: https://doi.org
/10.18653/v1/D19-1341

for noise robust

Michael L. Seltzer, Dong Yu, and Yongqiang
Wang. 2013. An investigation of deep
neural networks
speech
Dans 2013 IEEE International
reconnaissance.
Conference on Acoustics, Sspeech and Signal
Processing, pages 7398–7402. IEEE. EST CE QUE JE:
https://doi.org/10.1109/ICASSP
.2013.6639100

Deven Santosh Shah, H. Andrew Schwartz, et
Dirk Hovy. 2020. Predictive biases in natural
language processing models: A conceptual
framework and overview. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 5248–5264.

602

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Anders Søgaard. 2013. Part-of-speech tagging
with antagonistic adversaries. In Proceedings
of the 51st Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short
Papers), pages 640–644.

Prasetya Ajie Utama, Nafise Sadat Moosavi,
and Iryna Gurevych. 2020un. Mind the trade-
off: Debiasing NLU models without degrading
the in-distribution performance. arXiv preprint
arXiv:2005.00315.

Benjamin Strauss, Bethany Toma, Alan Ritter,
Marie-Catherine de Marneffe, and Wei Xu.
2016. Results of the WNUT16 named entity
recognition shared task. In Proceedings of the
2nd Workshop on Noisy User-generated Text
(WNUT), pages 138–144.

Christian Szegedy, Wojciech Zaremba,

Ilya
Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. 2013. Intriguing
properties of neural networks. arXiv preprint
arXiv:1312.6199.

James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
Fever: A large-scale dataset for fact extraction
and verification. In Proceedings of the 2018
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers), pages 809–819. EST CE QUE JE: https://
doi.org/10.18653/v1/N18-1074

Erik F. Tjong Kim Sang. 2002. Introduction
to the CoNLL-2002 shared task: Language-
Dans
independent named entity recognition.
COLING-02: The 6th Conference on Natural
Language Learning 2002 (CoNLL-2002). EST CE QUE JE:
https://doi.org/10.3115/1118853
.1118877

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity re-
the seventh
In Proceedings of
cognition.
conference on Natural language learning at
HLT-NAACL 2003-Volume 4, pages 142–147.
Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.3115/1119176
.1119195

Matej Ulˇcar and Marko Robnik- ˇSikonja. 2020.
Finest BERT and crosloengual BERT: Less is
more in multilingual models. arXiv preprint
arXiv:2006.07890. EST CE QUE JE: https://doi.org
/10.1007/978-3-030-58323-1 11

Prasetya Ajie Utama, Nafise Sadat Moosavi, et
Iryna Gurevych. 2020b. Towards debiasing
NLU models from unknown biases. En Pro-
ceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing
(EMNLP), pages 7597–7610.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in
Systems,
Information Processing
Neural
pages 5998–6008.

Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni
Luoma, Juhani Luotolahti, Tapio Salakoski,
Filip Ginter, and Sampo Pyysalo. 2019.
Multilingual is not enough: BERT for Finnish.
arXiv preprint arXiv:1912.07076.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman.
2018. Glue: A multi-task benchmark and
analysis platform for natural language under-
standing. In Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing
et
for NLP,
Interpreting Neural Networks
pages 353–355. EST CE QUE JE: https://doi.org
/10.18653/v1/W18-5446

Adina Williams, Nikita Nangia, and Samuel R.
Bowman. 2017. A broad-coverage challenge
corpus for sentence understanding throughv
inference. arXiv preprint arXiv:1704.05426.
EST CE QUE JE: https://doi.org/10.18653/v1
/N18-1101

In Proceedings of

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu,
and Jimmy Lin. 2020. DeeBERT: Dynamic
early exiting for accelerating BERT infer-
the 58th Annual
ence.
Meeting of the Association for Computational
Linguistics, pages 2246–2251, En ligne. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/2020
.acl-main.204

Yadollah Yaghoobzadeh, Remi Tachet, Timothy
J.. Hazen, and Alessandro Sordoni. 2019.

603

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Robust natural
with example
arXiv:1911.03861.

language inference models
forgetting. arXiv preprint

Juntao Yu, Bernd Bohnet, and Massimo Poesio.
2020. Named entity recognition as dependency
parsing. arXiv preprint arXiv:2005.07150.

Rowan Zellers, Yonatan Bisk, Roy Schwartz,
and Yejin Choi. 2018. SWAG: A large-scale
adversarial dataset
for grounded common-
sense inference. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 93–104. EST CE QUE JE:
https://doi.org/10.18653/v1/D18
-1009

Xiangji Zeng, Yunliang Li, Yuchen Zhai, et
Yin Zhang. 2020. Counterfactual generator:
A weakly-supervised method for named entity
reconnaissance. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7270–7280.
EST CE QUE JE: https://doi.org/10.18653/v1
/2020.emnlp-main.590

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. In Proceedings of the 2019

Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 1298–1308.

Changmeng Zheng, Yi Cai, Jingyun Xu, Ho-fung
Leung, and Guandong Xu. 2019. A boundary-
aware neural model for nested named entity
reconnaissance. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessation (EMNLP-IJCNLP), pages 357–366.
EST CE QUE JE: https://doi.org/10.18653/v1
/D19-1034

Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan
Zhu, Meng Fang, Rick Siow Mong Goh,
and Kenneth Kwok. 2019. Dual adversarial
neural transfer for low-resource named entity
reconnaissance. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 3461–3471.

Chen Zhu, Yu Cheng, Zhe Gan,

Siqi
Sun, Thomas Goldstein, and Jingjing Liu.
2019. Freelb: Enhanced adversarial
entraînement
for language understanding. arXiv preprint
arXiv:1909.11764.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
t

je

un
c
_
un
_
0
0
3
8
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

604
Télécharger le PDF