Context-aware Adversarial Training for Name Regularity Bias in

Context-aware Adversarial Training for Name Regularity Bias in
Named Entity Recognition

Abbas Ghaddar, Philippe Langlais†, Ahmad Rashid, and Mehdi Rezagholizadeh

Huawei Noah’s Ark Lab, Montreal Research Center, Canada
†RALI/DIRO, Universit´e de Montr´eal, Canada
abbas.ghaddar@huawei.com, felipe@iro.umontreal.ca
ahmad.rashid@huawei.com, mehdi.rezagholizadeh@huawei.com

Astratto

In this work, we examine the ability of NER
models to use contextual information when
predicting the type of an ambiguous entity.
We introduce NRB, a new testbed carefully
designed to diagnose Name Regularity Bias of
NER models. Our results indicate that all state-
of-the-art models we tested show such a bias;
BERT fine-tuned models significantly outper-
forming feature-based (LSTM-CRF) ones on
NRB, despite having comparable (sometimes
inferiore) performance on standard benchmarks.

To mitigate this bias, we propose a novel
model-agnostic training method that adds learn-
able adversarial noise to some entity mentions,
thus enforcing models to focus more strongly
on the contextual signal, leading to significant
gains on NRB. Combining it with two other
training strategies, data augmentation and pa-
rameter freezing, leads to further gains.

1

introduzione

Recent advances in language model pre-training
(Peters et al., 2018; Devlin et al., 2019; Liu et al.,
2019) have greatly improved the performance of
many Natural Language Understanding (NLU)
compiti. Yet, several studies (McCoy et al., 2019;
Clark et al., 2019; Utama et al., 2020B) revealed
that state-of-the-art NLU models often make use
of surface patterns in the data that do not gen-
eralize well. Named-Entity Recognition (NER),
a downstream task that consists in identifying
textual mentions and classifying them into a
predefined set of types, is no exception.

The robustness of modern NER models has
received considerable attention recently (Mayhew
et al., 2019; Mayhew et al., 2020; Agarwal et al.,
2020UN; Zeng et al., 2020; Bernier-Colborne and
Langlais, 2020). Name Regularity Bias (Lin et al.,

586

2020; Agarwal et al., 2020B; Zeng et al., 2020)
in NER occurs when a model relies on a signal
coming from the entity name, and disregards
evidence within the local context. Figura 1 shows
examples where state-of-the-art models (Peters
et al., 2018; Akbik et al., 2018; Devlin et al.,
2019) fail to exploit contextual information. For
instance, the entity Gonzales in the first sentence
of the figure is wrongly recognized as a per-
figlio, while the context clearly signals that it is a
location (città).

To better highlight this issue, we propose NRB,
a testbed designed to accurately diagnose name
regularity bias of NER models by harvesting
natural sentences from Wikipedia that contain
challenging entities, such as those in Figure 1.
This is different from previous work that evalu-
ated models on artificial data obtained by either
randomizing (Lin et al., 2020) or substituting
entities by ones from a pre-defined list (Agarwal
et al., 2020UN). NRB is compatible with any anno-
tation scheme, and is intended to be used as an
auxiliary validation set.

We conduct experiments with the feature-
based LSTM-CRF architecture (Peters et al.,
2018; Akbik et al., 2018) and the BERT (Devlin
et al., 2019) fine-tuning approach trained on stan-
dard benchmarks. The best LSTM-based model
we tested is able to correctly predict 38% del
entities in NRB. BERT-based models are perform-
ing much better (+37%), even if they (slightly)
underperform on in-domain development and test
sets. This mismatch in performance between NRB
and standard benchmarks indicates that context
awareness of models is not rewarded by existing
benchmarks, thus justifying NRB as an additional
validation set.

We further propose a novel architecture-
agnostic adversarial training procedure (Miyato

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 9, pag. 586–604, 2021. https://doi.org/10.1162/tacl a 00386
Redattore di azioni: Miguel Ballesteros. Lotto di invio: 10/2020; Lotto di revisione: 1/2021; Pubblicato 7/2021.
C(cid:2) 2021 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

NLU is the tendency of models to quickly lever-
age surface form features and annotation artifacts
(Gururangan et al., 2018), which is often referred
to as dataset biases (Dasgupta et al., 2018; Shah
et al., 2020). We discuss related works along two
axes: diagnosis and mitigation.

2.1 Diagnosing Biais

A growing number of studies (Zellers et al.,
2018; Poliak et al., 2018; Geva et al., 2019;
Utama et al., 2020B; Sanh et al., 2020) are show-
ing that NLU models rely heavily on spurious
correlations between output labels and surface
caratteristiche (per esempio., keywords, lexical overlap), impact-
ing their generalization performance. Therefore,
considerable attention has been paid to design
diagnostic benchmarks where models relying
on bias would perform poorly. For instance,
HANS (McCoy et al., 2019), FEVER Symmetric
(Schuster et al., 2019), and PAWS (Zhang et al.,
2019) are benchmarks that contain counterexam-
ples to well-known biases in the training data of
textual entailment (Williams et al., 2017), fatto
verification (Thorne et al., 2018), and paraphrase
identification (Wang et al., 2018), rispettivamente.

Naturally, many entity names have a strong
correlation with a single type (per esempio., O ). Recent works have
noted that over-relying on entity name information
negatively impacts NLU tasks. Balasubramanian
et al. (2020) found that substituting named-entities
in standard test sets of natural language inference,
coreference resolution, and grammar error cor-
rection has a negative impact on those tasks.
In political claims detection (Pad´o et al., 2019),
Dayanik and Pad´o (2020) show that claims made
by frequently occurring politicians in the training
data are better recognized than those made by
less frequent ones.

Recentemente, Zeng et al. (2020) and Agarwal et al.
(2020B) conducted two separate analyses on the
decision making mechanism of NER models. Both
works found that context tokens do contribute
to system performance, Ma
that entity names
play a major role in driving high performances.
Agarwal et al. (2020UN) reported a performance
drop in NER models when entities in standard
test sets are substituted with other ones pulled
from pre-defined lists. Concurrently, Lin et al.
(2020) conducted an empirical analysis on the
robustness of NER models in the open domain

Figura 1: Examples extracted from Wikipedia (titolo
in bold) that illustrate name regularity bias in NER.
Entities of interest are underlined, gold types are in blue
superscript, model predictions are in red subscript, E
context information is highlighted in purple. Modelli
used in this study disregard contextual information
and rely instead on some signal from the named-entity
itself.

et al., 2016) in which learnable noise vectors are
added to named-entity words, weakening their
signal, thus encouraging the model to pay more
attention to contextual information. Applying it
to both feature-based LSTM-CRF and fine-tuned
BERT models leads to consistent gains on NRB
(+13 points) while maintaining the same level of
performance on standard benchmarks.

The remainder of the paper is organized as
follows. We discuss related works in Section 2.
We describe how we built NRB in Section 3,
and its use in diagnosing named-entity bias of
state-of-the-art models in Section 4. In Section 5,
we present a novel adversarial training method
that we compare and combine with two simpler
ones. We further analyze these training methods
in Section 6, and conclude in Section 7.

2 Related Work

Robustness and out-of-distribution generalization
has always been a persistent concern in deep
learning applications such as computer vision
(Szegedy et al., 2013; Recht et al., 2019), speech
processing (Seltzer et al., 2013; Borgholt et al.,
2020), and NLU (Søgaard, 2013; Hendrycks
and Gimpel, 2017; Ghaddar and Langlais, 2017;
Yaghoobzadeh et al., 2019; Hendrycks et al.,
2020). One key challenge behind this issue in

587

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

scenario. They show that models are biased by
strong entity name regularity, and train\test over-
lap in standard benchmarks. They observe a drop
in performance of 34% when entity mentions are
randomly replaced by other mentions.

The aforementioned studies certainly demon-
strate name regularity bias. Ancora, in many cases
the entity mention is the only key to infer its type,
as in ‘‘James won the league’’. Così, randomly
swapping entity names, as proposed by Lin et al.
(2020), typically introduces false positive exam-
ples, which obscures observations. Inoltre,
creating artificial word sequences introduces
a mismatch between the pre-training and the
fine-tuning phases of large-scale language models.
NER is also challenging because of compound-
ing factors such as entity boundary detection
(Zheng et al., 2019), rare words and emerging
entities (Strauss et al., 2016), document-level
context (Durrett and Klein, 2014), capitaliza-
tion mismatch (Mayhew et al., 2019), unbalance
datasets (Nguyen et al., 2020), and domain shift
(Alvarado et al., 2015; Augenstein et al., 2017).
It is unclear to us how randomizing mentions in
a corpus, as proposed by Lin et al. (2020), È
interfering with these factors.

NRB gathers genuine entities that appear in nat-
ural sentences extracted from Wikipedia. Exam-
ples are selected so that entity boundaries are easy
to identify, and their types can be inferred from the
local context, thus avoiding compounding many
factors responsible for lack of robustness.

2.2 Mitigating Bias

The prevailing approach to address dataset bias
consists in adjusting the training loss for biased
examples. A number of recent studies (Clark
et al., 2019; Belinkov et al., 2019; He et al.,
2019; Mahabadi et al., 2020; Utama et al., 2020UN)
proposed to train a shallow model that exploits
manually designed biased features. A main model
is then trained in an ensemble with this pre-trained
modello, in order to discourage the main model from
adopting the naive strategy of the shallow one.

Adversarial training (Miyato et al., 2016) È
a regularization method that has been shown to
improve not only robustness (Ebrahimi et al.,
2018; Bekoulis et al., 2018), but also general-
ization (Cheng et al., 2019; Zhu et al., 2019) In
NLU. It builds on the idea of adding adversarial
examples (Goodfellow et al., 2014; Fawzi et al.,

2016) to the training set, questo è, small perturba-
tions of the data that can change the prediction of
a classifier. These perturbations for NLP tasks are
done at the token embedding level and are norm
bounded. Typically, adversarial
training algo-
rithms can be defined as a minmax optimization
problem wherein the adversarial examples are
generated to maximize the loss, while the model
is trained to minimize it.

Belinkov et al. (2019) used adversarial training
to mitigate the hypothesis-only bias in textual
entailment models. Clark et al. (2020) adversar-
ially trained a low and a high capacity model
in an ensemble in order to ensure that the latter
model is focusing on patterns that should gen-
eralize better. Dayanik and Pad´o (2020) used an
extra adversarial loss in order to encourage a
political claims detection model to learn more
from samples with infrequent politician names.
Le Bras et al. (2020) proposed an adversarial tech-
nique to filter-out biased examples from training
Materiale. Models trained on the filtered datasets
show improved out-of-distribution performances
on various computer vision and NLU tasks.

Data augmentation is another strategy for
enhancing robustness. It was successfully used in
Min et al. (2020) and Moosavi et al. (2020) A
improve textual entailment performances on the
HANS benchmark. The former approach proposes
to append original training sentences with their
corresponding predicate-arguments triplets gen-
erated by a semantic role labelling tagger; while
the latter generates new examples by applying
syntactic transformations to the original training
instances.

Zeng et al. (2020) created new examples by
randomly replacing an entity by another one of
the same type that occurs in the training data.
New examples are considered valid if the type
of the replaced entity is correctly predicted by
a NER model trained on the original dataset.
Allo stesso modo, Dai and Adel (2020) explored different
entity substitution techniques for data augmen-
tation tailored to NER. Both studies conclude
that data augmentation techniques based on entity
substitution improves the overall performances
on low resource biomedical NER.

Studies discussed above have the potential to
mitigate name regularity bias of NER models.
Ancora, we are not aware of any dedicated work that
shows it is so. In this work, we propose ways of

588

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

mitigating name regularity bias for NER, includ-
ing an elaborate adversarial method that forces the
model to capture more signal from the context. Nostro
methods do not require an extra training stage, O
to manually characterize biased features. They are
therefore conceptually simpler, and can potentially
be combined to any of the discussed techniques.
Inoltre, our proposed methods are effective
under both low and high resource settings.

3 The NRB Benchmark

NRB is a diagnosing testbed exclusively dedicated
to name regularity bias in NER. A tal fine, Esso
gathers named-entities that satisfy 4 criteria:

1. Must be real-world entities within natural
sentences → We select sentences from
Wikipedia articles.

2. Must be compatible with any annotation
scheme → We restrict our focus on the 3 most
common types found in NER benchmarks:
persona, location, and organization.

3. Boundary detection (segmentation) should
not be a bottleneck → We only select single
word entities that start with a capital letter.

4. Supporting evidences of the type must be
restricted to local context only (a window of
2 A 4 gettoni) → We developed a primitive
context-only tagger to filter-out entities with
no close-context signal.

The strategy used to gather examples in NRB is
illustrated in Figure 2. We first select Wikipedia
articles that are listed in a disambiguation page.
Disambiguation pages group different topics that
could be referred to by the same query term.1
The query term Bromwich in Figure 2 has its own
disambiguation page that contains a link to the
city of West Bromwich, West Bromwich Albion
Football Club, and Kenny Bromwich the rugby
league player.

We associate each article in a disambiguation
page to the entity type found in its corresponding
Freebase page (Bollacker et al., 2008), consid-
ering only articles whose Freebase type can be
mapped to a person, a location, or an organization.
We assume that occurrences of the query term

1 https://en.wikipedia.org/wiki/Wikipedia

:Manual of Style/Disambiguation pages.

589

Figura 2: Selection of a sentence in NRB.

within the article are of this type. This assump-
tion was found accurate in previous work on
Wikipedia distant supervision for NER (Ghaddar
and Langlais, 2016, 2018). The sentence in our
example is extracted from the Kenny Bromwich
article, whose Freebase type can be mapped to a
persona. Therefore, we assume Bromwich in this
sentence to be a person.

To decide whether a sentence containing a
query term is worth being included in NRB, we
rely on two NER taggers. One is a popular NER
system that provides a confidence score to each
prediction, and that acts as a weak superviser, IL
other is a context-only tagger we designed specif-
ically (see Section 3.1) to detect entities with a
strong signal from their local context. A sentence
is selected if the query term is incorrectly labeled
with high confidence (score > 0.85) by the former
tagger, while the latter one labels it correctly with
high confidence (a gap of at least 0.25 in probabil-
ity between the first and second predicted types).
This is the case of the sentence in Figure 2, Dove
Bromwich is incorrectly labeled as an organization
by the weak supervision tagger, however correctly
labeled as a person by the context-only tagger.

3.1 Implementation

We used the Stanford CoreNLP (Manning et al.,
2014) tagger as our weak supervision tagger and
developed a simple yet efficient method to build a
context-only tagger. For this, we first applied the
Stanford tagger to the entire Wikipedia dump and
replaced all entity mentions identified by their tag.
Then, we train a 5-gram language model on the
resulting corpus using kenLM (Heafield, 2011).

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 3: Illustration of a language model used as a
context-only tagger.

Figura 3 illustrates how this model is deployed
as an entity tagger: The mention is replaced by
an empty slot and the language model is queried
for each type. We rank the tags using the per-
plexity score given by the model to the resulting
sentences, then we normalize those scores to get
a probability distribution over types.

We downloaded the Wikipedia dump of June
2020, which contains 30k disambiguation pages.
These pages contain links to 263k articles, Dove
only 107k (40%) of them have a type in Freebase
that can be mapped to the 3 types of interest.
The Stanford tagger identified 440k entities that
match the query term of the disambiguation pages.
The thresholds discussed previously were chosen
to select around 5000 of the most challenging
examples in terms of name regularity bias. Questo
figure aligns with the number of entities present in
the test set of the well-studied CONLL benchmark
(Tjong Kim Sang and De Meulder, 2003).

We assessed the annotation quality by asking
a human to filter out noisy examples. A sentence
was removed if it contains an annotation error, O
if the type of the query term cannot be inferred
from the local context. Only 1.3% of the examples
where removed, which confirms the accuracy of
our automatic procedure. NRB is composed of
5275 examples, and each sentence contains a
single annotation (Guarda la figura 1 for examples).

3.2 Control Set (WTS)

In addition to NRB, we collected a set of domain
control sentences—called WTS for WITNESS—that
contain the very same query terms selected in
NRB, but that were correctly labeled by both
the Stanford (score > 0.85) and the context-only
taggers. We selected examples with a small gap
(< 0.1) between the first and second ranked type assigned to the query term by the latter tagger. Thus, examples in WTS should be easy to tag. For example, because Obama the Japanese city (see Figure 3) is selected among the query terms in NRB, we added an instance of Obama the president. Performing poorly on such examples2 indicates a domain shift between NRB (Wikipedia) and whatever dataset a model is trained on (we call it the in-domain corpus). WTS is composed of 5192 sentences that have also been manually checked. 4 Diagnosing Bias 4.1 Data To be comparable with state-of-the-art models, we consider two standard benchmarks for NER: CONLL-2003 (Tjong Kim Sang and De Meulder, 2003) and ONTONOTES 5.0 (Pradhan et al., 2012), which include 4 and 18 types of named-entities, respectively. ONTONOTES is 4 times larger than CONLL, and both benchmarks mainly cover the news domain. We run experiments on the official train/dev/test splits, and report mention-level F1 scores, following previous work. Since in NRB, there is only one entity per sentence to annotate, a system is evaluated on its ability to correctly identify the boundaries of this entity and its type. When we train on ONTONOTES (18 types) and eval- uate on NRB (3 types), we perform type mapping using the scheme of Augenstein et al. (2017). 4.2 Systems Following (Devlin et al., 2019), we term all approaches that learn the encoder from scratch as feature-based, as opposed to the ones that fine- tune a pre-trained model for the downstream task. We conduct experiments using 3 feature-based and 2 fine-tuning approaches for NER: • Flair-LSTM An LSTM-CRF model that uses FLAIR (Akbik et al., 2018) contextualized embeddings as main features. • ELMo-LSTM The LSTM-CRF tagging model of Peters et al. (2018) that uses ELMo contextualized embeddings at the input layer. • BERT-LSTM Similar to the previous model, but replacing ELMo by a representation gathered from the last four layers of BERT. • BERT-base The approach proposed by Devlin et al. (2019) using the BERT-base model. fine-tuning 2That is, a system that fail to tag Obama the president as a person. 590 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model CONLL Dev Test Flair-LSTM ELMo-LSTM 96.69 BERT-LSTM 95.94 - 93.03 92.47 91.94 BERT-base BERT-large 96.18 96.90 92.19 92.86 NRB WTS Feature-based 99.58 27.56 98.24 31.65 98.08 38.34 Fine-tuning 98.67 98.51 75.54 75.55 ONTONOTES Test NRB WTS Dev - 88.31 86.12 89.06 89.38 87.28 33.67 34.34 43.07 93.98 94.90 92.04 87.23 89.26 88.19 89.93 75.34 75.41 94.22 95.06 Table 1: Mention level F1 scores of models on CONLL and ONTONOTES, as well as on NRB and WTS. • BERT-large The fine-tuning approach using the BERT-large model. We used Flair-LSTM off-the-shelf,3 and re- implemented other approaches using the default settings proposed in the respective papers. For our reimplementations, we used early stopping based on performance on the development set, and report average performance over 5 runs. For BERT- based solutions, we adopt spanBERT (Joshi et al., 2020) as a backbone model because it was found by Li et al. (2020) to perform better on NER. 4.3 Results Table 1 shows the mention level F1 score of the systems considered. FLAIR-LSTM and BERT- large are the best performing models on in-domain test sets, the maximum gap with other models being 1.1 and 2.7 on CONLL and ONTONOTES respectively. These figures are in line with pre- vious work. What is more interesting is the performance on NRB. Feature-based models do poorly, Flair-LSTM underperforms compared to other models (F1 score of 27.6 and 33.7 when trained on CONLL and ONTONOTES respectively). Fine-tuned BERT models clearly perform better (around 75), but far from in-domain results (92.9 and 89.9 on CONLL and ONTONOTES, respec- tively). Domain shift is not a reason for those results, since the performances on WTS are rather high (92 or higher). Furthermore, we found that the boundary detection (segmentation) perfor- mance on NRB is above 99.2% across all settings. Because errors made on NRB are neither due to segmentation nor to domain shift, they must be imputed to name regularity bias of models. It is worth noting that BERT-LSTM outper- forms ELMo-LSTM on NRB, despite underper- forming on in-domain test sets. This may be because BERT was pre-trained on Wikipedia (same domain of NRB), while ELMo embeddings were trained on the One Billion Word corpus (Chelba et al., 2014). Also, we observe that switching from BERT-base to BERT-large, or training on 4 times more data (CONLL versus ONTONOTES) does not help on NRB. This suggests that name regularity bias is neither a data nor a model capacity issue. 4.4 Feature-based vs. Fine-tuning In this section, we analyze reasons for the drastic superiority of fined-tuned models on NRB. First, the large gap between BERT-LSTM and BERT- base on NRB suggests that this is not related to the representations being used at the input layer. Second, we tested several configurations of ELMo-LSTM where we scale up the number of LSTM layers and hidden units. We observed a degradation of performance on dev, test, and NRB sets, mostly due to over-parameterized models. We also trained 9-, 6-, and 4-layer BERT-base models,4 and still noticed a large advantage of BERT models on NRB.5 This suggests that the higher capacity of BERT alone cannot explain all the gains. Third, since by design, evidence on the entity type in NRB resides within the local context, it is unlikely that gains on this set come from the ability of Transformers (Vaswani et al., 2017) to better handle long dependencies than LSTM (Hochreiter and Schmidhuber, 1997). To further validate this statement, we fine-tuned BERT models with ran- domly initialized weights, except the embedding layer. We noticed that this time, the performances 4We used early exit (Xin et al., 2020) at the kth layer. 5The 4-layer model has 53M parameters and performs 3https://github.com/flairNLP/flair. 52% on NRB. 591 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 on NRB fall into the same range of those of feature- based models, and a drastic decrease (12%–15%) on standard benchmarks. These observations are in keeping with results from Hendrycks et al. (2020) on the out-of-distribution robustness of fine- tuning pre-trained transformers, and also confirms observations made by Agarwal et al. (2020b). From these analyses, we conclude that the (MLM) objective Masked Language Model (Devlin et al., 2019) that the BERT models were pre-trained with is a key factor driving superior performance of the fine-tuned models on NRB. In most cases, the target word is masked or randomly selected, therefore the model must rely on the con- text to predict the correct target, which is what a model should do to correctly predict the type of entities in NRB. We think that in fine-tuning, training for a few epochs with a small learning rate helps the model to preserve the contextual behavior induced by the MLM objective. Nevertheless, fine-tuned models recording at best an F1 score of 75.6 on NRB do show some name regularity bias, and fail to capture useful local contextual information. 5 Mitigating Bias In this section, we investigate training procedures that are designed to enhance the contextual aware- ness of a model, leading to better performance on NRB without impacting in-domain performance. These training procedures are not supposed to use any external data. In fact, NRB is only used as a diagnosing corpus, once the model is trained. We propose 3 training procedures that can be combined, two of them are architecture-agnostic, and one is specific to fine-tuning BERT. 5.1 Entity Masking Inspired by the masking strategy applied during the pre-training phase of BERT, we propose a data augmentation approach that introduces a special [MASK] token in some of the training examples. Specifically, we search for entities in the training material that are preceded or followed by 3 non- entity words. This criterion applies to 35% and 39% of entities in the training data of CONLL and ONTONOTES, respectively. For each such entity, we create a new training example (new sentence) by replacing the entity by [MASK], thus forcing the model to infer the type of masked tokens from the context. We call this procedure mask. 5.2 Parameter Freezing Another simple strategy, specific to fine-tuning BERT, consists of freezing part of the net- work. More precisely, we freeze the bottom half of BERT, including the embedding layer. The intuition is to preserve part of the predicting- by-context mechanism that BERT has acquired during the pre-training phase. This training proce- dure is expected to enforce the contextual ability of the model, thus adding to our analysis on the critical role of the MLM objective in pre-training BERT. We name this method freeze. 5.3 Adversarial Noise We propose an adversarial learning algorithm that makes entity type patterns in the input represen- tation less reliable for the model, thus enforcing it to rely more aggressively on the context. To do so, we add a learnable adversarial noise vector (only) to the input representation of entities. We refer to this method as adv. Let T = {t1, t2, . . . , tK} be a predefined set of types such as PER, LOC, and ORG in our case. Let x = x1, x2, . . . , xn be the input sequence of length n, y = y1, y2, . . . , yn be the gold label sequence following the IOB6 tagging scheme, and y(cid:4) = y(cid:4) n be a sequence obtained by adding noise to y at the mention-level, that is, by randomly replacing the type of mentions in y with some noisy type sampled from T . 2, . . . , y(cid:4) 1, y(cid:4) Let Yij(t) = yi, . . . , yj be a mention of type t ∈ T , spanning the sequence of indices i to j in y. We derive a noisy mention Y (cid:4) ij in y(cid:4) from Yij(t) as follows: ⎧ ⎪⎪⎨ Y (cid:4) ij = ⎪⎪⎩ Yij(t(cid:4)) p ∼ U (0, 1) ≤ λ t(cid:4) ∼ Cat γ∈T \{t} (γ|ξ = 1 K−1 ) Yij(t) otherwise where λ is a threshold parameter, U (0, 1) refers to the uniform distribution in the range [0,1], Cat(γ|ξ = 1 K−1 ) is the categorical distribution whose outcomes are equally likely with the probability of ξ, and the set T \ {t} = {t(cid:4) : t(cid:4) ∈ T ∧ t(cid:4) (cid:9)= t} stands for the set T excluding type t. 6Naturally applies to other schemes, such as BILOU that Ratinov and Roth (2009) found more informative. 592 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 8 6 1 9 2 9 6 9 1 / / t l a c _ a _ 0 0 3 8 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 4: Illustration of our adversarial method applied on the entity New York. First, we generate a noisy type (PER), and then add a learnable noise embedding (LOC→PER) to the input representation of that entity. This will make entity patterns (hashed rectangles) unreliable for the model, hence forcing it to collect evidences (dotted arrow) from the context. The noise embedding matrix and the noise label projection layer weights (dotted rectangle) are trained independently from the model parameters. The above procedure only applies to the entities that are preceded or followed by 3 context words. For instance, in Figure 4, we produce a noisy type for New York (PER), but not for John (p > λ).
Also, note that we generate a different sequence
sì(cid:4) from y at each training epoch.

Prossimo, we define a learnable noisy embedding
matrix E(cid:4) ∈ Rm×d where m = |T | × (|T | 1) È
the number of valid type switching possibilities,
and d is the dimension of the input representations
of x. For each token with a noisy label, we add
the corresponding noisy embedding to its input
representation. For other tokens, we simply add
a zero vector of size d. As depicted in Figure 4,
the noisy type of the entity New York is PER,
therefore we add the noise embedding at index
LOC → P ER to its input representation.

Then, the input representation of the sequence is
fed to an encoder followed by an output layer, come
as LSTM-CRF in Peters et al. (2018), or BERT-
Softmax in Devlin et al. (2019). Primo, we extend
the aforementioned models by generating an extra
logit f (cid:4) using a projection layer parametrized by
W (cid:4) and followed by a softmax function. As shown
in Figure 4, for each token the model produces
two logits relative to the true and noisy tags. Then,
we train the entire model to minimize two losses:
Ltrue(θ) and Lnoisy(θ(cid:4)), where θ is the original set
of parameters and θ(cid:4) = {E(cid:4), W (cid:4)} is the extra set
we added (dotted boxes in Figure 4). Ltrue(θ) È

the regular loss on the true tags, while Lnoisy(θ(cid:4))
is the loss on the noisy tags defined as follows:

Lnoisy(θ(cid:4)) =

N(cid:6)

i=1

11 (sì(cid:4)
io

(cid:9)= yi) CE (F (cid:4)

io, sì(cid:4)
io)

where CE is the cross-entropy loss function.
Both losses are minimized using gradient descent.
is worth mentioning that λ is the only
It
hyper-parameter of our adv method. It controls
how often noisy embeddings are added during
training. Higher values of λ increase the amount
of uncertainty around salient patterns in the
input representation of entities, hence preventing
the model from overfitting those patterns, E
therefore pushing it
to rely more on context
informazione. We tried values of λ between 0.3 E
0.9, and found λ = 0.8 to be the best one based
on CONLL and ONTONOTES development sets.

5.4 Results

We trained models on CONLL and ONTONOTES,
and evaluated them on their respective TEST set.7
Recall that NRB and WTS are only used as auxil-
iary diagnosing sets. Tavolo 2 shows the impact of
our training methods when fine-tuning the BERT-
large model (the one that performs best on NRB).
Primo, we observe that each training method
significantly improves the performance on NRB.

7Performances on DEV show very similar trends.

593

Method

ONTONOTES
CONLL
Test NRB WTS
Test NRB WTS
92.8 75.6 98.6 89.9 75.4 95.1
BERT-lrg
92.9 82.9 98.4 89.8 77.3 96.5
+mask
92.7 83.1 98.4 89.9 79.8 96.0
+freeze
92.7 86.1 98.3 90.1 85.8 95.2
+adv
92.8 85.5 97.8 89.9 80.6 95.9
+F&M
92.8 87.7 98.1 89.7 87.6 95.9
+UN&M
+UN&F
92.7 88.4 98.2 90.0 88.1 95.7
+UN&M&F 92.8 89.7 97.9 89.9 88.8 95.6

Impact of

Tavolo 2:
training methods on
BERT-large models fine-tuned on CONLL or
ONTONOTES.

CONLL

ONTONOTES

Method

Test NRB WTS Test NRB WTS
E-LSTM 92.5 31.7 98.2 89.4 34.3 94.9
92.4 40.8 97.5 89.3 38.8 95.3
+mask
+adv
92.4 42.4 97.8 89.4 40.7 95.0
+UN&M 92.4 45.7 96.8 89.3 46.6 93.7

Tavolo 3: Impact of training methods on the
ELMo-LSTM trained on CONLL or ONTONOTES.

Adding adversarial noise is notably the best
performing method on NRB, with an additional
gain of 10.5 E 10.4 F1 points over the respective
baselines. D'altra parte, we observe minor
variations on in-domain test sets, as well as on
WTS. The paired sample t-test (Cohen, 1996)
confirms that these variations are not statistically
significant (p > 0.05). After all, the number of
decisions that differ between the baseline and the
best model on a given in-domain set is less than 20.
Secondo, we observe that combining methods
always leads to improvements on NRB;
IL
best configuration being when we combine all 3
metodi. It is interesting to note that combining
training methods leads to a performance on NRB
which does not depend much on the training set
used: CONLL (89.7) and ONTONOTES (88.8). Questo
suggests that name regularity bias is a modeling
issue, and not the effect of factors such as training
data size, domain, or type granularity.

In order to validate that our training methods
are not specific to the fine-tuning approach, we
replicated the same experiments with the ELMo-
LSTM. Tavolo 3 shows the performance of the
mask and adv procedures (the freeze method
does not apply here). The results are in line
with those observed with BERT-large: significant

594

gains on NRB of 14 E 12 points for CONLL
and ONTONOTES models, rispettivamente, and no sta-
tistically significant changes on in-domain test
sets. Again, combining training methods leads to
systematic gains on NRB (13 points on average).
Differently from fine-tuning BERT, we observe a
slight drop in performance of 1.2% on WTS when
both methods are used.

The performance of ELMo-LSTM on NRB
does not rival the one obtained by fine-tuning the
BERT-large model, which confirms that BERT
is a key factor to enhance robustness, even if in-
domain performance is not necessarily rewarded
(McCoy et al., 2019; Hendrycks et al., 2020).

6 Analysis

So far, we have shown that state-of-the-art mod-
els do suffer from name regularity bias, and we
proposed model-agnostic training methods that
are able to mitigate this bias to some extent.
In Section 6.1, we provide further evidence
that our training methods force the BERT-large
model to better concentrate on contextual cues. In
Sezione 6.2, we replicate the evaluation protocol of
Lin et al. (2020) in order to clear out the possibility
that our training methods are only valid on NRB.
Last, we perform extensive experiments on name
regularity bias under low resource (Sezione 6.3)
and multilingual (Sezione 6.4) settings.

6.1 Attention Heads

We leverage the attention map of BERT to better
understand how our method enhances context
encoding. A tal fine, we calculate the average
number of attention heads that point to the entity
mentions being predicted at each layer. We con-
duct this experiment on NRB with the BERT-large
modello (24 layers with 16 attention heads at each
layer) fine-tuned on CONLL.

At each layer, we average the number of
heads which have their highest attention weight
(argmax) pointing to the entity name.8 Figure 5
shows the average number of attention heads
that point to an entity mention in the BERT-
large model fine-tuned without our methods, con
the adversarial noise method (adv), and with all
three methods.

8We used the weights of the first sub-token since NRB

only contains single word entities.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 5: Average number of attention heads (y-axis)
pointing to NRB entity mentions at each layer (x-axis)
of the BERT-large model fine-tuned on CONLL.

Figura 6: Performance on NRB of BERT-large models
as a function of the number of sentences used to
fine-tune them.

Method
BERT-large

+adv
+adv&mask
+adv&mask&freeze

π(dev) π(test)
23.45
31.98
35.02
40.39

25.46
31.99
34.09
38.62

Tavolo 4: F1 scores of BERT-large models
fine-tuned on CONLL and evaluated on
randomly permuted versions of the dev and
test sets: π(dev) and π(test).

We observe an increasing number of heads
pointing to entity names when we get closer to
the output layer: at the bottom layers (left part of
the figure) only a few heads are pointing to entity
names, in contrast to the last 2 layers (right part)
where almost all heads do so. This observation
is in line with Jawahar et al. (2019), who show
that bottom and intermediate BERT layers mainly
encode lexical and syntactic information, whereas
top layers represent task-related information. Nostro
training methods lead to fewer heads at top layers
pointing to entity mentions, suggesting the model
is focusing more on contextual information.

6.2 Random Permutations

Following the protocol described in Lin et al.
(2020), we modified dev and test sets of standard
benchmarks by randomly permuting dataset-wise
mentions of entities, keeping the types untouched.
For instance, the span of a specific mention of a
person can be replaced by a span of a location,
whenever it appears in the dataset. These ran-
domized tests are highly challenging, as discussed
in Section 2, since here the context is the only
available clue to solve the task, and many false
positive examples are introduced that way.

Tavolo 4 shows the results of the BERT-large
model fine-tuned on CONLL and evaluated on
the permuted in-domain dev and test sets. F1

scores are much lower here, confirming this is
a hard testbed, but they do provide evidence of
the named-regularity bias of BERT. Our training
methods improve the model F1 score by 17% E
13% on permuted dev and test sets, rispettivamente,
an increase much in line with what we observed
on NRB.

6.3 Low Resource Setting

Similarly to Zhou et al. (2019) and Ding et al.
(2020), we simulate a low resource setting by ran-
domly sampling tiny subsets of the training data.
Since our focus is to measure the contextual learn-
ing ability of models, we first selected sentences
of CONLL training data that contain at least one
entity followed or preceded by 3 non-entity words.
Then, we randomly sampled k ∈ {100, 500,
1000, 2000} sentences9 with which we fine-tuned
BERT-large. Figura 6 shows the performance of
the resulting models on NRB. Expectedly, F1
scores of models fine-tuned with few examples
are rather low on NRB as well as on the in-domain
test set. Not shown in Figure 6, fine-tuning on 100
E 2000 sentences leads to performance of 14%
E 45%, rispettivamente, on the CONLL test set.
Nevertheless, we observe that our training meth-
ods, and adv in particular, improve performances
on NRB even under extremely low resource set-
tings. On CONLL test and WTS sets, scores vary
in a range of ±0.5 and ±0.7, rispettivamente, Quando
our methods are added to BERT-large.

6.4 Multilingual Setting

6.4.1 Experimental Protocol
For in-domain data, we use the German, Spanish,
and Dutch CONLL-2002 (Tjong Kim Sang, 2002)
NER datasets. Those benchmarks—also from the
news domain—come with a train/dev/test split,
and the training material is comparable in size

9{0.7, 3.5, 7.1, 14.3}% of the training sentences.

595

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Di
es
nl

NRB WTS
37% 44% fi
20% 22% da
20% 24% hr
af

NRB WTS
53% 62%
19% 24%
39% 48%
26% 32%

5: Percentage

Tavolo
translated
sentences from NRB and WTS discarded
for each language.

Di

to the English CONLL dataset. Inoltre, we
experiment with four non CONLL benchmarks:
Finnish (Luoma et al., 2020), Danish (Hvingelby
et al., 2020), Croatian (Ljubeˇsi´c et al., 2018), E
Afrikaans (Eiselen, 2016) dati. These corpora
have more diversified text genres, yet mainly
follow the CONLL annotation scheme.10 Finnish
and Afrikaans datasets have comparable size to
English CONLL, Danish is 60% smaller, while
the Croatian is twice larger. We use the provided
train/dev/test splits for Danish and Finnish, E
we randomly split (80/10/10) the Croatian and
Afrikaans datasets.

Because NRB and WTS are in English, we
designed a simple yet generic method for pro-
jecting them to another language. Primo, both test
sets are translated to the target language using
an online translation service. In order to ensure a
high quality corpus, we eliminate a sentence if the
BLEU score (Papineni et al., 2002) between the
original (English) sentence and the back translated
one is below 0.65.

Tavolo 5 reports the percentage of discarded sen-
tences for each language. While for the Finnish
(fi), Croatian (hr), and German (Di) languages
we remove a large proportion of sentences, we
found our translation approach simpler and more
systematic than generating an NRB corpus from
scratch for each language. The latter approach
depends on the robustness of the weak tagger, IL
number of Wikipedia articles and disambiguation
pages per language, as well as the existence of
type information. This is left as future work.

For experiments with fine-tuning, noi usiamo
language-specific BERT models11 for German
(Chan et al., 2020), Spanish (Canete et al., 2020),

10The Finnish data is tagged with EVENT, PRODUCT,

and DATE in addition to the CONLL 4 classes.

11Language-specific models have been reported more
accurate than multilingual ones in a monolingual setting
(Martin et al., 2019; Le et al., 2020; Delobelle et al., 2020;
Virtanen et al., 2019).

596

Dutch (de Vries et al., 2019), Finnish (Virtanen
et al., 2019), Danish,12 Croatain (Ulˇcar and
Robnik- ˇSikonja, 2020), while we use mBERT
(Devlin et al., 2019) for Afrikaans.

For feature-based approaches, we use the same
architecture for ELMo-LSTM (Peters et al., 2018)
except that we replace English word embeddings
by language-specific ones: FastText (Bojanowski
et al., 2017) for static representations, and the
aforementioned BERT-base models for contextu-
alized ones.

6.4.2 Results

Tavolo 6 reports the performances on test, NRB,
and WTS sets for both feature-based and fine-
tuning approaches with and without our training
metodi. We used the hyper-parameters of the
English CONLL experiments with no further
tuning. We selected the best performing mod-
els based on development sets score, and report
average results on 5 runs.

Mainly due to implementation details and hy-
perparameter settings, our fine-tuned BERT-base
models perform better on the CONLL test sets
for German (83.8 vs. 80.4) and Dutch (91.8 vs.
90.0) and slightly worse on Spanish (88.0 vs.
88.4) compared to the results reported in their
respective BERT papers.

Consistent with the results obtained on English
feature-based (Tavolo 1) and fine-tuned
for
(Tavolo 3) models, the latter approach performs
better on NRB, although by a smaller margin
compared to English (+37%). More precisely, we
observe a gain of +28% E +26% on German
and Croatian respectively, and a gain ranging
between 11% E 15% for other languages.

Nevertheless, our training methods lead to
systematic and often drastic improvements on
NRB coupled with a statistically nonsignificant
overall decrease on in-domain test sets. They do,
Tuttavia, incur a slight but significant drop of
around 2 F1 score points on WTS for feature-
based models. Similar to what was previously
observed, the best scores on NRB are obtained
by BERT models when the training methods are
combined. For the Dutch language, we observe
that once trained with our methods, the type of
models used (feature-based vs. BERT fine-tuned)
leads to much less difference on NRB.

12https://github.com/botxo/nordic bert.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

German

Finnish
TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS TEST NRB WTS

Afrikaans

Croatian

Spanish

Danish

Dutch

Feature-based
BERT-LSTM 78.9 36.4 84.2 85.6 59.9 90.8 84.9 45.4 85.7 76.0 38.9 84.5 76.4 42.6 78.1 78.0 28.4 79.3 76.2 39.7 65.8
+adv
78.2 44.1 82.8 85.0 65.8 90.2 84.3 57.8 83.5 75.1 52.9 81.0 75.4 47.2 76.9 77.5 35.2 75.5 75.7 42.3 63.3
+adv&mask 78.1 47.6 82.9 84.9 72.2 88.7 84.0 62.8 83.5 74.6 54.3 81.8 75.1 48.4 76.6 76.9 36.8 76.7 75.1 52.8 63.1
Fine-tuning

BERT-base

+adv
+UN&M&F

83.8 64.0 93.3 88.0 72.3 93.9 91.8 56.1 92.0 91.3 64.6 91.9 83.6 56.6 86.2 89.7 54.7 95.6 80.4 54.3 91.6
83.7 68.9 93.6 87.9 75.9 93.9 91.9 58.3 91.8 90.2 66.4 92.5 82.7 58.4 86.5 89.5 57.9 95.5 79.7 60.2 92.1
83.2 73.3 94.0 87.4 81.6 93.7 91.2 63.6 91.0 89.8 67.4 92.7 82.3 63.1 85.4 88.8 59.6 94.9 79.4 64.2 91.6

Tavolo 6: Mention level F1 scores of 7 multilingual models trained on their respective training data,
and tested on their respective in-domain test, NRB, and WTS sets.

Altogether, these results demonstrate that name
regularity bias is not specific to a particular lan-
guage, even if its degree of severity varies from
one language to another, and that the training
methods proposed notably mitigate this bias.

7 Conclusione

In this work, we focused on the name regularity
bias of NER models, a problem first discussed in
Lin et al. (2020). We propose NRB, a benchmark
we specifically designed to diagnose such a bias.
As opposed to existing strategies devised to mea-
sure it, NRB is composed of real sentences with
easy to identify mentions.

We show that current state-of-the-art models,
perform from poorly (feature-based) to decently
(fined-tuned BERT) on NRB. In order to mit-
igate this bias, we propose a novel adversarial
training method based on adding some learnable
noise vectors to entity words. These learnable
vectors encourage the model to better incorporate
contextual information. We demonstrate that this
approach greatly improves the contextual ability
of existing models, and that it can be combined
with other training methods we proposed. Signif-
icant gains are observed in both low-resource and
multilingual settings. To foster research on NER
robustness, we encourage others to report results
on NRB and WTS.13

This study opens up new avenues of inves-
tigation. Conducting a large-scaled multilingual
experiment, characterizing the name regularity
bias of more diversified morphological language
families is one of them, possibly leveraging mas-
sively multilingual resources such as WikiAnn
(Pan et al., 2017), Polyglot-NER (Al-Rfou et al.,

13English and multilingual NRB and WTS are available at
h t t p : / / r a li.iro.umontreal.ca/rali/?q=en
/wikipedia-nrb-ner.

2015), or Universal Dependencies (Nivre et al.,
2016). We can also develop a more challenging
NRB by selecting sentences with multi-word
entities.

Also, non-sequential labeling approaches for
NER like the ones of Li et al. (2020) and Yu et al.
(2020) have reported impressive results on both
flat and nested NER. We plan to measure their
bias on NRB and study the benefits of applying
our training methods to those approaches. Finalmente,
we want to investigate whether our adversarial
training method can be successfully applied to
other NLP tasks.

Ringraziamenti

We are grateful to the reviewers of this work
for
their constructive comments that greatly
contributed to improving this paper.

Riferimenti

Oshin Agarwal, Yinfei Yang, Byron C. Wallace,
and Ani Nenkova. 2020UN. Entity-switched
datasets: an approach to auditing the in-domain
robustness of named entity recognition models.
arXiv preprint arXiv:2004.04123.

Oshin Agarwal, Yinfei Yang, Byron C. Wallace,
and Ani Nenkova. 2020B. Interpretability anal-
ysis for named entity recognition to understand
system predictions and how they can im-
prove. arXiv preprint arXiv:2004.04564. DOI:
https://doi.org/10.1162/coli a
00397

Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual
for
sequence labeling. In Proceedings of the 27th
Conferenza internazionale sul calcolo
Linguistica, pages 1638–1649.

string embeddings

597

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi,
and Steven Skiena. 2015. Polyglot-ner: Massive
multilingual named entity recognition. Nel professionista-
ceedings of the 2015 SIAM International Con-
ference on Data Mining, pages 586–594. SIAM.
DOI: https://doi.org/10.1137/1
.9781611974010.66

Julio Cesar Salinas Alvarado, Karin Verspoor, E
Timothy Baldwin. 2015. Domain adaption of
named entity recognition to support credit risk
assessment. In Proceedings of the Australasian
Language Technology Association Workshop
2015, pages 84–90.

Isabelle Augenstein, Leon Derczynski, and Kalina
Bontcheva. 2017. Generalisation in named en-
tity recognition: A quantitative analysis. Com-
puter Speech & Language, 44:61–83. DOI:
https://doi.org/10.1016/j.csl.2017
.01.012

Sriram Balasubramanian, Naman Jain, Gaurav
Jindal, Abhijeet Awasthi, and Sunita Sarawagi.
2020. What’s in a name? Are BERT named
entity representations just as good for any other
name? arXiv preprint arXiv:2007.06897. DOI:
https://doi.org/10.18653/v1/2020
.repl4nlp-1.24

Giannis Bekoulis,

Johannes Deleu, Thomas
Demeester, and Chris Develder. 2018. Adver-
sarial training for multi-context joint entity and
relation extraction. Negli Atti del 2018
Conference on Empirical Methods in Natural
Language Processing, pages 2830–2836. DOI:
https://doi.org/10.18653/v1/D18=
-1307

Yonatan Belinkov, Adam Poliak, Stuart M.
Shieber, Benjamin Van Durme, and Alexander
M. Rush. 2019. On adversarial removal of
hypothesis-only bias
lingua
inference. In Proceedings of the Eighth Joint
Conference on Lexical and Computational
Semantics (SEM 2019), pages 256–262. DOI:
https://doi.org/10.18653/v1/S19
-1028

in natural

Gabriel Bernier-Colborne and Phillippe Langlais.
2020. HardEval: Focusing on challenging to-
kens to assess robustness of NER. Nel professionista-
ceedings of The 12th Language Resources
and Evaluation Conference, pages 1697–1704,

Marseille, France. European Language Re-
sources Association.

Piotr Bojanowski, Edouard Grave, Armand
Joulin, and Tomas Mikolov. 2017. Enriching
word vectors with subword information. Trans-
actions of the Association for Computational
Linguistica, 5:135–146. DOI: https://
doi.org/10.1162/tacl a 00051

for

E

Jamie Taylor.

Kurt Bollacker, Colin Evans, Praveen Paritosh,
Tim Sturge,
2008.
Freebase: A collaboratively created graph
structuring
database
knowl-
human
IL 2008 ACM
Negli Atti di
edge.
SIGMOD international conference on Man-
agement of data, pages 1247–1250. DOI:
https://doi.org/10.1145/1376616
.1376746

Lasse Borgholt,

Jakob D. Havtorn, Anders
Søgaard Zeljko Agic, Lars Maaløe, E
Christian Igel. 2020. Do end-to-end speech
recognition models care about context? In
Proceedings of Interspeech.

Jos´e Canete, Gabriel Chaperon, Rodrigo Fuentes,
and Jorge P´erez. 2020. Spanish pre-trained
bert model and evaluation data. PML4DC
at ICLR, 2020. DOI: https://doi.org
/10.21437/Interspeech.2020-1750

2020. German’s

Branden Chan, Stefan Schweter, and Timo
Möller.
lingua
modello. arXiv preprint arXiv:2010.10906. DOI:
https://doi.org/10.18653/v1/2020
.coling-main.598

next

Ciprian Chelba, Tomás Mikolov, Mike Schuster,
Qi Ge, Thorsten Brants, Phillipp Koehn, E
Tony Robinson. 2014. One billion word bench-
mark for measuring progress in statistical
language modeling. In Fifteenth Annual Con-
ference of the International Speech Communi-
cation Association.

Yong Cheng, Lu Jiang, and Wolfgang Macherey.
2019. Robust neural machine translation with
doubly adversarial inputs. Negli Atti di
the 57th Annual Meeting of the Association for
Linguistica computazionale, pages 4324–4333.
DOI: https://doi.org/10.18653/v1
/P19-1425

598

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Cristoforo Clark, Mark Yatskar, and Luke
Zettlemoyer. 2019. Dont take the easy way
fuori: Ensemble based methods for avoiding
known dataset biases. Negli Atti del
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
Language
(EMNLP-IJCNLP),
pages 4060–4073.

in lavorazione

Cristoforo Clark, Mark Yatskar, and Luke
Zettlemoyer. 2020. Learning to model and
ignore dataset bias with mixed capacity
ensembles. arXiv preprint arXiv:2011.03856.

Paul R. Cohen. 1996. Empirical methods for
artificial intelligence. IEEE Intelligent Systems.

Xiang Dai and Heike Adel. 2020. An analysis
of simple data augmentation for named entity
recognition. In Proceedings of the 28th Inter-
national Conference on Computational Lin-
guistics, pages 3861–3867.

Ishita Dasgupta, Demi Guo, Andreas Stuhlm¨uller,
Samuel J. Gershman, and Noah D. Goodman.
2018. Evaluating compositionality in sentence
embeddings. arXiv preprint arXiv:1802.04302.

Erenay Dayanik and Sebastian Pad´o. 2020. Mask-
ing actor information leads to fairer political
claims detection. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
linguistica nazionale, pages 4385–4391. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.404

Wietse de Vries, Andreas van Cranenburgh,
Arianna Bisazza, Tommaso Caselli, Gertjan
van Noord,
and Malvina Nissim. 2019.
Bertje: A Dutch BERT model. arXiv preprint
arXiv:1912.09582.

Pieter Delobelle, Thomas Winters, and Bettina
Berendt. 2020. Robbert: A Dutch roberta-based
language model. arXiv preprint arXiv:2001
.06286. DOI: https://doi.org/10.18653
/v1/2020.findings-emnlp.292

Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. Negli Atti di
IL 2019
Conference of the North American Chapter of

the Association for Computational Linguistics:
Tecnologie del linguaggio umano, Volume 1
(Long and Short Papers), pages 4171–4186.

Bosheng Ding, Linlin Liu, Lidong Bing,
Canasai Kruengkrai, Thien Hai Nguyen,
Shafiq Joty, Luo Si, and Chunyan Miao.
2020. Daga: Data
augmentation with a
generation approach for low-resource tagging
compiti. arXiv preprint arXiv:2011.01549. DOI:
https://doi.org/10.18653/v1/2020
.emnlp-main.488

Greg Durrett and Dan Klein. 2014. A joint model
for entity analysis: Coreference, typing, E
linking. Transactions of the Association for
Linguistica computazionale, 2:477–490. DOI:
https://doi.org/10.1162/tacl a
00197

Javid Ebrahimi, Anyi Rao, Daniel Lowd,
and Dejing Dou. 2018. Hotflip: White-box
adversarial examples for text classification.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tic (Volume 2: Short Papers), pages 31–36.
DOI: https://doi.org/10.18653/v1
/P18-2006

Roald Eiselen. 2016. Government domain named
entity recognition for south african languages.
In Proceedings of the Tenth International Con-
ference on Language Resources and Evaluation
(LREC’16), pages 3344–3348.

Alhussein Fawzi, Seyed-Mohsen Moosavi-
Dezfooli, and Pascal Frossard. 2016. Robust-
ness of classifiers: from adversarial to random
noise. In Proceedings of the 30th International
Conference on Neural Information Processing
Sistemi, pages 1632–1640.

Mor Geva, Yoav Goldberg, and Jonathan Berant.
IL
2019. Are we modeling the task or
annotator? An investigation of annotator bias
language understanding datasets.
in natural
Negli Atti di
IL 2019 Conference on
in Natural Language
Empirical Methods
Processing and the 9th International Joint
Conferenza sull'elaborazione del linguaggio naturale
(EMNLP-IJCNLP), pages 1161–1166. DOI:
https://doi.org/10.18653/v1/D19
-1107

599

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Abbas Ghaddar and Phillippe Langlais. 2016.
Coreference in Wikipedia: Main concept
In Proceedings of The 20th
resolution.
SIGNLL Conference on Computational Natu-
ral Language Learning, pages 229–238. DOI:
https://doi.org/10.18653/v1/K16
-1023

Abbas Ghaddar and Phillippe Langlais. 2017.
Winer: A Wikipedia annotated corpus for
named entity recognition. Negli Atti di
the Eighth International Joint Conference on
Elaborazione del linguaggio naturale (Volume 1: Lungo
Carte), pages 413–422.

Abbas Ghaddar and Philippe Langlais. 2018.
Transforming Wikipedia into a large-scale fine-
grained entity type corpus. Negli Atti
of the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018).

Ian J. Goodfellow,

Jonathon Shlens,

E
Christian Szegedy. 2014. Explaining and
harnessing adversarial examples. arXiv preprint
arXiv:1412.6572.

Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, E
Noah A. Smith. 2018. Annotation artifacts in
natural language inference data. Negli Atti
del 2018 Conferenza del Nord America
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano,
Volume 2 (Short Papers), pages 107–112. DOI:
https://doi.org/10.18653/v1/N18
-2017

He He, Sheng Zha, and Haohan Wang. 2019.
Unlearn dataset bias in natural language infer-
ence by fitting the residual. EMNLP-IJCNLP
2019, page 132. DOI: https://doi.org
/10.18653/v1/D19-6115

Kenneth Heafield. 2011. KenLM: Faster and
smaller language model queries. Negli Atti
of the EMNLP 2011 Sixth Workshop on Sta-
tistical Machine Translation, pages 187–197.
Edinburgh, Scotland, United Kingdom.

Dan Hendrycks and Kevin Gimpel. 2017. UN
baseline for detecting misclassified and out-
of-distribution examples in neural networks.
Proceedings of International Conference on
Learning Representations.

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace,
Adam Dziedzic, Rishabh Krishnan, and Dawn
Song. 2020. Pretrained transformers improve
out-of-distribution robustness. arXiv preprint
arXiv:2004.06100. DOI: https://doi.org
/10.18653/v1/2020.acl-main.244

Sepp Hochreiter e Jürgen Schmidhuber. 1997.
Memoria a lungo termine. Neural computation,
9(8):1735–1780. DOI: https://doi.org
/10.1162/neco.1997.9.8.1735, PMID:
9377276

Rasmus Hvingelby, Amalie Brogaard Pauli,
Maria Barrett, Christina Rosted, Lasse Malm
Lidegaard, and Anders Søgaard. 2020. Dane:
In
A named entity resource for Danish.
Proceedings of the 12th Language Resources
and Evaluation Conference, pages 4597–4604.

Ganesh Jawahar, Benoˆıt Sagot, and Djam´e
Seddah. 2019. What Does BERT Learn about
the Structure of Language? Negli Atti di
the 57th Annual Meeting of the Association for
Linguistica computazionale, pages 3651–3657.
DOI: https://doi.org/10.18653/v1
/P19-1356

Mandar

Joshi, Danqi Chen, Yinhan Liu,
Daniel S. Weld, Luke Zettlemoyer, and Omer
Levy. 2020. Spanbert: Improving pre-training
by representing and predicting spans. Trans-
actions of the Association for Computational
Linguistica, 8:64–77. DOI: https://doi.org
/10.1162/tacl a 00300

Hang Le, Lo¨ıc Vial, Jibril Frej, Vincent Segonne,
Maximin Coavoux, Benjamin Lecouteux,
Alexandre Allauzen, Benoit Crabb´e, Laurent
Besacier, and Didier Schwab. 2020. Flaubert:
Unsupervised language model pre-training for
In Proceedings of The 12th Lan-
French.
guage Resources and Evaluation Conference,
pages 2479–2490.

Ronan

Swabha

Le Bras,

Swayamdipta,
Chandra Bhagavatula, Rowan Zellers, Matthew
Peters, Ashish Sabharwal, and Yejin Choi.
2020. Adversarial filters of dataset biases.
In International Conference on Machine
Apprendimento, pages 1078–1088. PMLR.

Xiaoya Li,

Jingrong Feng, Yuxian Meng,
Qinghong Han, Fei Wu, and Jiwei Li. 2020.

600

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

A unified MRC framework for named entity
recognition. In Proceedings of the 58th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 5849–5859.

Hongyu Lin, Yaojie Lu, Jialong Tang, Xianpei
Han, Le Sun, Zhicheng Wei, and Nicholas Jing
Yuan. 2020. A rigorous study on named entity
recognition: Can fine-tuning pretrained model
lead to the promised land? Negli Atti di
IL 2020 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale (EMNLP),
pages 7291–7300.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv preprint
arXiv:1907.11692.

Nikola Ljubeˇsi´c, ˇZeljko Agi´c, Filip Klubiˇcka, Vuk
Batanovi´c, and Tomaˇz Erjavec. 2018. Training
corpus hr500k 1.0. Slovenian language resource
repository CLARIN.SI.

Jouni Luoma, Miika Oinonen, Maria Pyyk¨onen,
Veronika Laippala, and Sampo Pyysalo. 2020.
A broad-coverage corpus for finnish named
In Proceedings of The
entity recognition.
12th Language Resources and Evaluation
Conferenza, pages 4615–4624.

Rabeeh Karimi Mahabadi, Yonatan Belinkov,
and James Henderson. 2020. End-to-end bias
mitigation by modelling biases in corpora.
the 58th Annual Meet-
Negli Atti di
the Association for Computational
ing of
Linguistica,
8706–8716. Associa-
tion for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.769

pagine

Christopher D. Equipaggio, Mihai Surdeanu, John
Bauer, Jenny Rose Finkel, Steven Bethard, E
David McClosky. 2014. The Stanford CoreNLP
Natural Language Processing Toolkit. In ACL
(System Demonstrations), pages 55–60. DOI:
https://doi.org/10.3115/v1/P14
-5010

Louis Martin,

Benjamin Muller,

Pedro
Javier Ortiz Su´arez, Yoann Dupont, Laurent
´Eric Villemonte de la Clergerie,
Romary,

Djam´e Seddah, and Benoˆıt Sagot. 2019.
lingua
Camembert: A tasty
modello. arXiv preprint arXiv:1911.03894. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.645

French

Stephen Mayhew, Gupta Nitish, and Dan Roth.
2020. Robust named entity recognition with
truecasing pretraining. Negli Atti del
AAAI Conference on Artificial Intelligence,
pages 8480–8487. DOI: https://doi.org
/10.1609/aaai.v34i05.6368

Stephen Mayhew, Tatiana Tsygankova, and Dan
Roth. 2019. ner and pos when nothing is capi-
talized. Negli Atti del 2019 Conferenza
sui metodi empirici nel linguaggio naturale
Processing and the 9th International Joint
Conferenza sull'elaborazione del linguaggio naturale
(EMNLP-IJCNLP), pages 6257–6262. DOI:
https://doi.org/10.18653/v1/D19
-1650

Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019. Right for the wrong reasons: Diagnos-
ing syntactic heuristics in natural
lingua
inference. In Proceedings of the 57th Annual
Meeting of
the Association for Computa-
linguistica nazionale, pages 3428–3448. DOI:
https://doi.org/10.18653/v1/P19
-1334

Junghyun Min, R. Thomas McCoy, Dipanjan
Das, Emily Pitler, and Tal Linzen. 2020.
Syntactic data augmentation increases robust-
ness to inference heuristics. Negli Atti di
the 58th Annual Meeting of the Association for
Linguistica computazionale, pages 2339–2352.

Takeru Miyato, Andrew M. Dai, and Ian
Goodfellow. 2016. Adversarial training meth-
ods for semi-supervised text classification.
arXiv preprint arXiv:1605.07725.

Nafise Sadat Moosavi, Marcel de Boer, Prasetya
Ajie Utama, and Iryna Gurevych. 2020. Im-
proving robustness by augmenting training
sentences with predicate-argument structures.
arXiv preprint arXiv:2010.12510.

Thong Nguyen, Duy Nguyen, and Pramod Rao.
2020. Adaptive Name Entity Recognition
under highly unbalanced data. arXiv preprint
arXiv:2003.10296.

601

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Joakim Nivre, Marie-Catherine De Marneffe,
Jan Hajic,
Filip Ginter, Yoav Goldberg,
Christopher D. Equipaggio, Ryan McDonald,
Slav Petrov, Sampo Pyysalo, Natalia Silveira,
and others. 2016. Universal dependencies v1:
treebank collection. Nel professionista-
A multilingual
ceedings of the Tenth International Conference
on Language Resources and Evaluation
(LREC’16), pages 1659–1666.

Sebastian Pad´o, Andr´e Blessing, Nico Blokker,
E
Erenay Dayanik, Sebastian Haunss,
Jonas Kuhn.
sides with
2019. Who
whom? Towards computational construction
of discourse networks for political debates. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2841–2847.

Xiaoman Pan, Boliang Zhang, Jonathan May,
Joel Nothman, Kevin Knight, and Heng Ji.
2017. Cross-lingual name tagging and linking
IL
for 282 languages.
55esima Assemblea Annuale dell'Associazione per
Linguistica computazionale (Volume 1: Lungo
Carte), pages 1946–1958.

Negli Atti di

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. Bleu: A method for
automatic evaluation of machine translation. In
Proceedings of
the 40th annual meeting of
the Association for Computational Linguis-
tic, pages 311–318. DOI: https://doi
.org/10.3115/1073083.1073135

Matteo Peters, Marco Neumann, Mohit Iyyer,
Matt Gardner, Cristoforo Clark, Kenton Lee,
e Luke Zettlemoyer. 2018. Deep Context-
ualized Word Representations. In Procedi-
ings di
the North
the Association for
American Chapter of
Linguistica computazionale: Human Language
Technologies, Volume 1 (Documenti lunghi),
pagine 2227–2237. DOI: https://doi.org
/10.18653/v1/N18-1202

IL 2018 Conference of

Adam Poliak,

Jason Naradowsky, Aparajita
and Benjamin
Haldar, Rachel Rudinger,
Van Durme. 2018. Hypothesis only baselines in
natural language inference. Negli Atti di
the Seventh Joint Conference on Lexical and
Computational Semantics, pages 180–191.
DOI: https://doi.org/10.18653/v1
/S18-2023

Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Olga Uryupina, and Yuchen Zhang. 2012.
CoNLL-2012 shared task: Modeling multi-
lingual unrestricted coreference in OntoNotes.
In Joint Conference on EMNLP and CoNLL-
Shared Task, pages 1–40.

Lev Ratinov and Dan Roth. 2009. Design
challenges and misconceptions
in named
entity recognition. In Proceedings of the Thir-
teenth Conference on Computational Natural
Language Learning, pages 147–155. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/1596374
.1596399

Benjamin Recht, Rebecca Roelofs, Ludwig
Schmidt, and Vaishaal Shankar. 2019. Fare
imagenet classifiers generalize to imagenet?
In International Conference on Machine
Apprendimento, pages 5389–5400. PMLR.

Victor Sanh, Thomas Wolf, Yonatan Belinkov,
and Alexander M. Rush. 2020. Apprendimento
from others’ mistakes: Avoiding dataset
biases without modeling them. arXiv preprint
arXiv:2012.01300.

Tal Schuster, Darsh Shah, Yun Jie Serene Yeo,
Daniel Roberto Filizzola Ortiz, Enrico Santus,
and Regina Barzilay. 2019. Towards debiasing
fact verification models. Negli Atti di
IL 2019 Conferenza sui metodi empirici
in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3410–3416. DOI: https://doi.org
/10.18653/v1/D19-1341

for noise robust

Michael L. Seltzer, Dong Yu, and Yongqiang
Wang. 2013. An investigation of deep
neural networks
speech
In 2013 IEEE International
recognition.
Conference on Acoustics, Sspeech and Signal
in lavorazione, pages 7398–7402. IEEE. DOI:
https://doi.org/10.1109/ICASSP
.2013.6639100

Deven Santosh Shah, H. Andrew Schwartz, E
Dirk Hovy. 2020. Predictive biases in natural
language processing models: A conceptual
framework and overview. Negli Atti di
the 58th Annual Meeting of the Association for
Linguistica computazionale, pages 5248–5264.

602

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Anders Søgaard. 2013. Part-of-speech tagging
with antagonistic adversaries. Negli Atti
of the 51st Annual Meeting of the Association
for Computational Linguistics (Volume 2: Corto
Carte), pages 640–644.

Prasetya Ajie Utama, Nafise Sadat Moosavi,
and Iryna Gurevych. 2020UN. Mind the trade-
off: Debiasing NLU models without degrading
the in-distribution performance. arXiv preprint
arXiv:2005.00315.

Benjamin Strauss, Bethany Toma, Alan Ritter,
Marie-Catherine de Marneffe, and Wei Xu.
2016. Results of the WNUT16 named entity
recognition shared task. Negli Atti del
2nd Workshop on Noisy User-generated Text
(WNUT), pages 138–144.

Christian Szegedy, Wojciech Zaremba,

Ilya
Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. 2013. Intriguing
properties of neural networks. arXiv preprint
arXiv:1312.6199.

James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
Fever: A large-scale dataset for fact extraction
and verification. Negli Atti del 2018
Conference of the North American Chapter of
the Association for Computational Linguistics:
Tecnologie del linguaggio umano, Volume 1
(Documenti lunghi), pages 809–819. DOI: https://
doi.org/10.18653/v1/N18-1074

Erik F. Tjong Kim Sang. 2002. introduzione
to the CoNLL-2002 shared task: Language-
In
independent named entity recognition.
COLING-02: The 6th Conference on Natural
Language Learning 2002 (CoNLL-2002). DOI:
https://doi.org/10.3115/1118853
.1118877

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity re-
the seventh
Negli Atti di
cognition.
conference on Natural language learning at
HLT-NAACL 2003-Volume 4, pages 142–147.
Associazione per la Linguistica Computazionale.
DOI: https://doi.org/10.3115/1119176
.1119195

Matej Ulˇcar and Marko Robnik- ˇSikonja. 2020.
Finest BERT and crosloengual BERT: Less is
more in multilingual models. arXiv preprint
arXiv:2006.07890. DOI: https://doi.org
/10.1007/978-3-030-58323-1 11

Prasetya Ajie Utama, Nafise Sadat Moosavi, E
Iryna Gurevych. 2020B. Towards debiasing
NLU models from unknown biases. Nel professionista-
ceedings of the 2020 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale
(EMNLP), pages 7597–7610.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in
Sistemi,
Information Processing
Neural
pages 5998–6008.

Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni
Luoma, Juhani Luotolahti, Tapio Salakoski,
Filip Ginter, and Sampo Pyysalo. 2019.
Multilingual is not enough: BERT for Finnish.
arXiv preprint arXiv:1912.07076.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman.
2018. Glue: A multi-task benchmark and
analysis platform for natural language under-
standing. Negli Atti del 2018 EMNLP
Workshop BlackboxNLP: Analyzing
E
for NLP,
Interpreting Neural Networks
pages 353–355. DOI: https://doi.org
/10.18653/v1/W18-5446

Adina Williams, Nikita Nangia, and Samuel R.
Bowman. 2017. A broad-coverage challenge
corpus for sentence understanding throughv
inference. arXiv preprint arXiv:1704.05426.
DOI: https://doi.org/10.18653/v1
/N18-1101

Negli Atti di

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu,
and Jimmy Lin. 2020. DeeBERT: Dynamic
early exiting for accelerating BERT infer-
the 58th Annual
ence.
Riunione dell'Associazione per il Computazionale
Linguistica, pages 2246–2251, Online. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.204

Yadollah Yaghoobzadeh, Remi Tachet, Timothy
J. Hazen, and Alessandro Sordoni. 2019.

603

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Robust natural
with example
arXiv:1911.03861.

language inference models
forgetting. arXiv preprint

Juntao Yu, Bernd Bohnet, and Massimo Poesio.
2020. Named entity recognition as dependency
parsing. arXiv preprint arXiv:2005.07150.

Rowan Zellers, Yonatan Bisk, Roy Schwartz,
and Yejin Choi. 2018. SWAG: A large-scale
adversarial dataset
for grounded common-
sense inference. Negli Atti del 2018
Conference on Empirical Methods in Natural
Language Processing, pages 93–104. DOI:
https://doi.org/10.18653/v1/D18
-1009

Xiangji Zeng, Yunliang Li, Yuchen Zhai, E
Yin Zhang. 2020. Counterfactual generator:
A weakly-supervised method for named entity
recognition. Negli Atti del 2020 Contro-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7270–7280.
DOI: https://doi.org/10.18653/v1
/2020.emnlp-main.590

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. Negli Atti del 2019

Conference of the North American Chapter of
the Association for Computational Linguistics:
Tecnologie del linguaggio umano, Volume 1
(Long and Short Papers), pages 1298–1308.

Changmeng Zheng, Yi Cai, Jingyun Xu, Ho-fung
Leung, and Guandong Xu. 2019. A boundary-
aware neural model for nested named entity
recognition. Negli Atti del 2019 Contro-
ference on Empirical Methods in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessazione (EMNLP-IJCNLP), pages 357–366.
DOI: https://doi.org/10.18653/v1
/D19-1034

Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan
Zhu, Meng Fang, Rick Siow Mong Goh,
and Kenneth Kwok. 2019. Dual adversarial
neural transfer for low-resource named entity
recognition. In Proceedings of the 57th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 3461–3471.

Chen Zhu, Yu Cheng, Zhe Gan,

Siqi
Sun, Thomas Goldstein, and Jingjing Liu.
2019. Freelb: Enhanced adversarial
training
for language understanding. arXiv preprint
arXiv:1909.11764.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
8
6
1
9
2
9
6
9
1

/

/
T

l

UN
C
_
UN
_
0
0
3
8
6
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

604
Scarica il pdf