MasakhaNER: Named Entity Recognition for African Languages
David Ifeoluwa Adelani1∗, Jade Abbott2∗, Graham Neubig3, Daniel D’souza4∗,
Julia Kreutzer5∗, Constantine Lignos6∗, Chester Palen-Michel6∗, Happy Buzaaba7∗,
Shruti Rijhwani3, Sebastian Ruder8, Stephen Mayhew9, Israel Abebe Azime10∗,
Shamsuddeen H. Muhammad11,12∗, Chris Chinenye Emezue13∗,
Joyce Nakatumba-Nabende14∗, Perez Ogayo15∗, Aremu Anuoluwapo16∗, Catherine Gitau∗,
Derguene Mbaye∗,Jesujoba Alabi17∗,Seid Muhie Yimam18,Tajuddeen Rabiu Gwadabe19∗,
Ignatius Ezeani20∗, Rubungo Andre Niyongabo21∗, Jonathan Mukiibi14, Verrah Otiende22∗,
Iroro Orife23∗, Davis David∗, Samba Ngom∗, Tosin Adewumi24∗, Paul Rayson20,
Mofetoluwa Adeyemi∗, Gerald Muriuki14, Emmanuel Anebi∗,
Chiamaka Chukwuneke20, Nkiruka Odu25, Eric Peter Wairagala14, Samuel Oyerinde∗,
Clemencia Siro∗, Tobius Saul Bateesa14, Temilola Oloyede∗, Yvonne Wambui∗,
Victor Akinode∗, Deborah Nabagereka14, Maurice Katusiime14, Ayodele Awokoya26∗,
Mouhamadane MBOUP∗, Dibora Gebreyohannes∗, Henok Tilaye∗, Kelechi Nwaike∗,
Degaga Wolde∗, Abdoulaye Faye∗, Blessing Sibanda27∗, Orevaoghene Ahia28∗,
Bonaventure F. PAG. Dossou29∗, Kelechi Ogueji30∗, Thierno Ibrahima DIOP∗,
Abdoulaye Diallo∗, Adewale Akinfaderin∗, Tendai Marengereke∗, and Salomey Osei10∗
∗Masakhane NLP, 1Spoken Language Systems Group (LSV), Saarland University, Alemania,
2Retro Rabbit, South Africa, 3Language Technologies Institute, Carnegie Mellon University, United
Estados, 4ProQuest, United States, 5Google Research, Canada, 6Brandeis University, United States,
8DeepMind, Reino Unido, 9Duolingo, United States, 7Graduate School of Systems and
Information Engineering, University of Tsukuba, Japón, 10African Institute for Mathematical Sciences
(AIMS-AMMI), Ethiopia, 11University of Porto, Nigeria, 12Bayero University, Kano, Nigeria,
13Technical University of Munich, Alemania 14 Makerere University, Kampala, Uganda,15African
Leadership University, Rwanda 16University of Lagos, Nigeria, 17Max Planck Institute for Informatics,
Alemania, 18LT Group, Universit¨at Hamburg, Alemania, 19University of Chinese Academy of Science,
China 20Lancaster University, Reino Unido, 21University of Electronic Science and Technology of
Porcelana, Porcelana, 22United States International University – África (USIU-A), Kenya, 23Niger-Volta LTI
24Lule˚o University of Technology, Sweden 25African University of Science and Technology, Abuja,
Nigeria 26University of Ibadan, Nigeria, 27Namibia University of Science and Technology, Namibia
28Instadeep, Nigeria 29Jacobs University Bremen, Alemania, 30 Universidad de Waterloo, Canada
Abstracto
We take a step towards addressing the under-
representation of the African continent in NLP
research by bringing together different stake-
holders to create the first large, publicly avail-
capaz, high-quality dataset for named entity
recognition (NER) in ten African languages.
We detail the characteristics of these languages
to help researchers and practitioners better
understand the challenges they pose for NER
tareas. We analyze our datasets and conduct
an extensive empirical evaluation of state-
of-the-art methods across both supervised and
transfer learning settings. Finalmente, we release
los datos, código, and models to inspire future
research on African NLP.1
1https://git.io/masakhane-ner.
1 Introducción
Africa has over 2,000 spoken languages (Eberhard
et al., 2020); sin embargo,
these languages are
scarcely represented in existing natural language
Procesando (NLP) conjuntos de datos, investigación, and tools
(Martinus and Abbott, 2019). ∀ (2020) investigate
the reasons for these disparities by examining
how NLP for low-resource languages is con-
strained by several societal factors. Uno de estos
factors is the geographical and language diver-
sity of NLP researchers. Por ejemplo, del
2695 affiliations of authors whose works were
published at the five major NLP conferences in
2019, only five were from African institutions
(Caines, 2019). En cambio, many NLP tasks such
1116
Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1116–1131, 2021. https://doi.org/10.1162/tacl a 00416
Editor de acciones: Miguel Ballesteros. Lote de envío: 5/2021; Lote de revisión: 7/2021; Publicado 10/2021.
C(cid:4) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
as machine translation, text classification, part-
of-speech tagging, y reconocimiento de entidad nombrada
would benefit from the knowledge of native speak-
ers who are involved in the development of data-
sets and models.
En este trabajo, we focus on named entity recog-
nition (NER)—one of the most impactful tasks
in NLP (Sang and De Meulder, 2003; Lample
et al., 2016). NER is an important information
extraction task and an essential component of
numerous products including spell-checkers, lo-
calization of voice and dialogue systems, y
conversational agents. It also enables identifying
African names, lugares, and organizations for infor-
mation retrieval. African languages are under-
represented in this crucial task due to lack of
conjuntos de datos, reproducible results, and researchers who
understand the challenges that such languages
present for NER.
en este documento, we take an initial step towards im-
proving representation for African languages for
the NER task, making the following contributions:
(i) We bring together language speakers, conjunto de datos
curators, NLP practitioners, and evaluation
experts to address the challenges facing
NER for African languages. Basado en el
availability of online news corpora and lan-
guage annotators, we develop NER datasets,
modelos, and evaluation covering ten widely
spoken African languages.
(ii) We curate NER datasets from local sources
to ensure relevance of future research for
native speakers of the respective languages.
(iii) We train and evaluate multiple NER models
for all
ten languages. Our experiments
provide insights into the transfer across
idiomas, and highlight open challenges.
(iv) We release the datasets, código, and models
to facilitate future research on the spe-
cific challenges raised by NER for African
idiomas.
2 Trabajo relacionado
African NER Datasets NER is a well-studied
sequence labeling task (Yadav and Bethard, 2018)
and has been the subject of many shared
tasks in different languages (Tjong Kim Sang,
2002; Tjong Kim Sang and De Meulder, 2003;
Sangal et al., 2008; Shaalan, 2014; Benikova
et al., 2014). Sin embargo, most of the available
datasets are in high-resource languages. A pesar de
there have been efforts to create NER datasets for
lower-resourced languages, such as the WikiAnn
cuerpo (Pan et al., 2017) covering 282 idiomas,
such datasets consist of ‘‘silver-standard’’ labels
created by transferring annotations from English
to other languages through cross-lingual links in
knowledge bases. Because the WikiAnn corpus
data comes from Wikipedia, it includes some
African languages; though most have fewer than
10k tokens.
Other NER datasets for African languages in-
clude SADiLaR (Eiselen, 2016) for ten South
African languages based on government data, y
small corpora of fewer than 2K sentences for
Yor`ub´a (Alabi et al., 2020) and Hausa (Hedderich
et al., 2020). Además, the LORELEI language
packs (Strassel and Tracey, 2016) include some
African languages (Yor`ub´a, Hausa, Amharic,
Somali, Twi, Swahili, Wolof, Kinyarwanda, y
Zulu), but are not publicly available.
et
and Nichols,
NER Models Popular sequence labeling models
for NER include the CRF (Lafferty et al., 2001),
CNN-BiLSTM (Chiu
2016),
BiLSTM-CRF (Huang et al., 2015), and CNN-
BiLSTM-CRF (Ma and Hovy, 2016). la tradicion-
tional CRF makes use of hand-crafted features
like part-of-speech tags, context words and
word capitalization. Neural NER models on the
other hand are initialized with word embeddings
like Word2Vec (Mikolov et al., 2013), GloVe
and FastText
(Pennington
(Bojanowski et al., 2017). More recently, pre-
trained language models such as BERT (Devlin
et al., 2019), RoBERTa (Liu et al., 2019), y
LUKE (Yamada et al., 2020) have been applied to
produce state-of-the-art results for the NER task.
Multilingual variants of these models like mBERT
and XLM-RoBERTa (Conneau et al., 2020) make
it possible to train NER models for several lan-
guages using transfer learning. Language-specific
parameters and adaptation to unlabeled data of
the target language have yielded further gains
(Pfeiffer et al., 2020a,b).
2014),
Alabama.,
3 Focus Languages
Mesa 1 provides an overview of the languages
considered in this work, their language family,
1117
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Idioma
Family
Speakers Region
Amharic
Afro-Asiatic-Ethio-Semitic
33M East
Hausa
Igbo
Afro-Asiatic-Chadic
63M West
Niger-Congo-Volta-Niger
27M West
Kinyarwanda
Niger-Congo-Bantu
12M East
Luganda
Niger-Congo-Bantu
luo
Nilo Saharan
Nigerian-Pidgin English Creole
Swahili
Niger-Congo-Bantu
7M East
4M East
75M West
98M Central
& East
Wolof
Niger-Congo-Senegambia
5M West &
NW
Yor`ub´a
Niger-Congo-Volta-Niger
42M West
Mesa 1: Idioma, familia, number of speakers
(Eberhard et al., 2020), and regions in Africa.
number of speakers and the regions in Africa
where they are spoken. We chose to focus on these
languages due to the availability of online news
corpus, annotators, and most importantly because
they are widely spoken native African languages.
Both region and language family might indicate
a notion of proximity for NER, either because of
linguistic features shared within that family, or be-
cause data sources cover a common set of locally
relevant entities. We highlight language specifics
for each language to illustrate the diversity of this
selection of languages in Section 3.1, y luego
showcase the differences in named entities across
these languages in Section 3.2.
3.1 Language Characteristics
(h¨a)
(m¨a)
(amh) uses the Fidel script consisting
Amharic
(ˇs¨a)…),
(l¨a)
de 33 basic scripts (
each of them with at least 7 vowel sequences (semejante
como
(ho)). This results in more than 231 characters or
Fidels. Numbers and punctuation marks are also
(1),
represented uniquely with specific Fidels (
( ¯he)
( ¯hi)
(hu)
(h¨a)
(ha)
(hi)
(2), … y
(.), !(!),
(;),).
Hausa (hau) has 23–25 consonants, depending
on the dialect and five short and five long vowels.
Hausa has labialized phonemic consonants, as in
/gw/ (p.ej., ‘agwagwa’). As found in some African
idiomas,
en
Hausa (p.ej., ‘b, ‘d, etc., as in ‘barna’). simí-
mucho, the Hausa approximant ‘r’ is realized in two
distinct manners: roll and trill, as in ‘rai’ and
‘ra’ayi’, respectivamente.
implosive consonants also exist
Igbo (ibo) is an agglutinative language, con
many frequent suffixes and prefixes (Emenanjo,
1978). A single stem can yield many word-forms
by addition of affixes that extend its original mean-
En g (Onyenwe and Hepple, 2016). Igbo is also
tonal, with two distinctive tones (high and low)
and a down-stepped high tone in some cases. El
alphabet consists of 28 consonants and 8 vowels
(A, mi, I, I., oh, oh. , Ud., Ud.. ). In addition to the Latin
letters (except c), Igbo contains the following
digraphs: (ch, gb, gh, gw, kp, kw, nw, ny, sh).
Kinyarwanda (kin) makes use of 24 latín
characters with 5 vowels similar to English and
19 consonants excluding q and x. Además, Kin-
yarwanda has 74 additional complex consonants
(such as mb, mpw, and njyw) (Government, 2014).
It is a tonal language with three tones: bajo (No
diacritic), alto (signaled by ‘‘/’’), and falling
(signaled by ‘‘∧’’). The default word order is
subject-verb-object.
is a tonal
Luganda (lug)
language with
subject-verb-object word order. The Luganda al-
phabet is composed of 24 letters that include 17
consonants (pag, v, F, metro, d, t, yo, r, norte, z, s, j, C, gramo),
5 vowel sounds represented in the five alphabeti-
cal symbols (a, mi, i, oh, tu), y 2 semi-vowels (w,
y). It also has a special consonant ng(cid:6).
luo (luo) is a tonal language with 4 tones
(alto, bajo, falling, rising), although the tonality is
not marked in orthography. Tiene 26 Latin conso-
nants without Latin letters (C, q, v, X, and z) and ad-
ditional consonants (ch, DH, mb, nd, ng’, ng, ny, nj,
,
th, sh). There are nine vowels (a, mi, i, oh, tu,
) which are distinguished primarily by ad-
vanced tongue root (ATR) harmony (De Pauw
et al., 2007).
,
,
a
es
Nigerian-Pidgin (pcm)
largely oral,
national lingua franca with a distinct phonology
from English, its lexifier language. Portuguese,
Francés, and especially indigenous languages form
the substrate of lexical, phonological, syntactic,
and semantic influence on Nigerian-Pidgin (notario público).
English lexical items absorbed by NP are often
phonologically closer to indigenous Nigerian lan-
calibres, notably in the realization of vowels. Como
a rapidly evolving language, the NP orthography
is undergoing codification and indigenization
(Offiong Mensah, 2012; Onovbiona, 2012;
Ojarikre, 2013).
1118
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Mesa 2: Example of named entities in different languages. PER , LOC , and DATE are in colours
purple, naranja, and green, respectivamente.
(swa) is the most widely spoken lan-
Swahili
guage on the African continent. Tiene 30 letters
incluido 24 Latin letters without characters (q
and x) and six additional consonants (ch, DH, gh,
ng’, sh, th) unique to Swahili pronunciation.
(wol) has an alphabet similar to that of
Wolof
Francés. It consists of 29 characters, including all
letters of the French alphabet except h, v, and z.
(‘‘ng’’, más bajo-
It also includes the characters
) and ˜N’ (‘‘gn’’ as in Spanish). Accents are
caso:
present, but limited in number ( `A, ´E, ¨E, ´O). Cómo-
alguna vez, unlike many other Niger-Congo languages,
Wolof is not a tonal language.
Yor `ub´a (yor) tiene 25 Latin letters without the
Latin characters (C, q, v, X, and z) y con
additional letters (e., gb, s., oh. ). Yor`ub´a is a tonal
language with three tones: bajo (‘‘\’’), middle
(‘‘−’’, optional) and high (‘‘/’’). The tonal marks
and underdots are referred to as diacritics and
they are needed for the correct pronunciation of a
palabra. Yor`ub´a is a highly isolating language and
the sentence structure follows subject-verb-object.
3.2 Named Entities
Most of the work on NER is centered around
Inglés, and it is unclear how well existing mod-
els can generalize to other languages in terms of
sentence structure or surface forms. In Hu et al.’s
(2020) evaluation on cross-lingual generalization
for NER, only two African languages were con-
sidered and it was seen that transformer-based
models particularly struggled to generalize to
named entities in Swahili. To highlight the differ-
ences across our focus languages, Mesa 2 muestra
an English2 example sentence, with color-coded
PER, LOC, and DATE entities, and the correspond-
ing translations. The following characteristics of
the languages in our dataset could pose challenges
for NER systems developed for English:
• Amharic shares no lexical overlap with the
English source sentence.
es
• While ‘‘Zhang’’
identical across all
Latin-script languages, ‘‘Kano’’ features ac-
cents in Wolof and Yor`ub´a due to its
localization.
• The Fidel script has no capitalization, cual
could hinder transfer from other languages.
• Igbo, Wolof, and Yor`ub´a all use diacritics,
which are not present in the English alphabet.
• The surface form of named entities (NE) es
the same in English and Nigerian-Pidgin, pero
there exist lexical differences (p.ej., in terms
of how time is realized).
• Between the 10African languages,‘‘Nigeria’’
is spelled in 6 different ways.
• Numerical ‘‘18’’: Igbo, Wolof, and Yor`ub´a
write out their numbers, resulting in different
numbers of tokens for the entity span.
2Although the original sentence is from BBC Pidgin
https://www.bbc.com/pidgin/tori-51702073.
1119
Idioma
Data Source
Train/ dev/ test
#
Anno.
PER ORG LOC DATE
% of Entities
#
in Tokens Tokens
DW & BBC
VOA Hausa
BBC Igbo
IGIHE news
BUKEDDE news
Ramogi FM news
Amharic
Hausa
Igbo
Kinyarwanda
Luganda
luo
Nigerian-Pidgin BBC Pidgin
Swahili
Wolof
Yor`ub´a
1750/ 250/ 500
1903/ 272/ 545
2233/ 319/ 638
2110/ 301/ 604
2003/ 200/ 401
644/ 92/ 185
2100/ 300/ 600
VOA Swahili
2104/ 300/ 602
Lu Defu Waxu & Saabal 1,871/ 267/ 536
2124/ 303/ 608
GV & VON news
4
3
6
2
3
2
5
6
2
5
403 1,420
730
1,490
766 2,779
1,603 1,292 1,677
1,366 1,038 2,096
943
838
1,868
666
286
557
2,602 1,042 1,317
960 2,842
1,702
836
245
731
835 1,627
1,039
580
922
690
792
574
343
1,242
940
206
853
15.13
12.17
13.15
12.85
14.81
14.95
13.25
12.48
6.02
11.57
37,032
80,152
61,668
68,819
46,615
26,303
76,063
79,272
52,872
83,285
Mesa 3: Statistics of our datasets including their source, number of sentences in each split, number
of annotators, number of entities of each label type, percentage of tokens that are named entities, y
total number of tokens.
4 Data and Annotation Methodology
Our data were obtained from local news sources,
in order to ensure relevance of the dataset for
native speakers from those regions. The dataset
was annotated using the ELISA tool (Lin et al.,
2018) by native speakers who come from the
same regions as the news sources and volunteered
through the Masakhane community.3 Annotators
were not paid but are all included as authors of this
paper. The annotators were trained on how to per-
form NER annotation using the MUC-6 annotation
guide.4 We annotated four entity types: Personal
name (PER), Location (LOC), Organization (ORG),
and date & tiempo (DATE). The annotated entities
were inspired by the English CoNLL-2003 Corpus
(Tjong Kim Sang, 2002). We replaced the MISC
tag with the DATE tag following Alabi et al.
(2020) as the MISC tag may be ill-defined and
cause disagreement among non-expert annotators.
We report the number of annotators as well as
general statistics of the datasets in Table 3. Para
each language, we divided the annotated data into
training, desarrollo, and test splits consisting of
70%, 10%, y 20% of the data, respectivamente.
A key objective of our annotation procedure was
to create high-quality datasets by ensuring high
annotator agreement. To achieve high agreement
puntuaciones, we ran collaborative workshops for each
idioma, which allowed annotators to discuss any
disagreements. ELISA provides an entity-level
F1-score and also an interface for annotators to
correct their mistakes, making it easy to achieve
3https://www.masakhane.io.
4https://cs.nyu.edu/∼grishman/muc6.html.
Token
Fleiss’ κ Fleiss’ κ
Entity Disagreement
Dataset
amh
hau
ibo
kin
lug
luo
pcm
swa
wol
yor
0.987
0.988
0.995
1.000
0.997
1.000
0.989
1.000
1.000
0.990
0.959
0.962
0.983
1.000
0.990
1.000
0.966
1.000
1.000
0.964
from Type
0.044
0.097
0.071
0.000
0.023
0.000
0.048
0.000
0.000
0.079
Mesa 4: Inter-annotator agreement for our datasets
calculated using Fleiss’ kappa (κ) at the token and
entity level. Disagreement from type refers to the
proportion of all entity-level disagreements, cual
are due only to type mismatch.
inter-annotator agreement scores between 0.96
y 1.0 for all languages.
We report inter-annotator agreement scores in
Mesa 4 using Fleiss’ kappa (Fleiss, 1971) at both
the token and entity level. The latter considers
each span an annotator proposed as an entity. Como
a result of our workshops, all our datasets have
exceptionally high inter-annotator agreement. Para
Kinyarwanda, luo, Swahili, and Wolof, nosotros reportamos
perfect inter-annotator agreement scores (κ = 1).
For each of these languages, two annotators an-
notated each token and were instructed to discuss
and resolve conflicts among themselves. The Ap-
pendix provides a detailed entity-level confusion
matrix in Table 11.
1120
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
5 Experimental Setup
5.1 NER Baseline Models
To evaluate baseline performance on our dataset,
we experiment with three popular NER mod-
los: CNN-BiLSTM-CRF, multilingual BERT
(mBERTO), and XLM-RoBERTa (XLM-R). El
latter two models are implemented using the
HuggingFace transformers toolkit (Wolf et al.,
2019). For each language, we train the models on
the in-language training data and evaluate on its
test data.
CNN-BiLSTM-CRF This architecture was pro-
posed for NER by Ma and Hovy (2016). Para
each input sequence, we first compute the vec-
tor representation for each word by concatenating
character-level encodings from a CNN and vector
embeddings for each word. Following Rijhwani
et al. (2020), we use randomly initialized word
embeddings since we do not have high-quality
pre-trained embeddings for all the languages in
nuestro conjunto de datos. Our model is implemented using the
DyNet toolkit (Neubig et al., 2017).
mBERT We
fine-tune multilingual BERT
(Devlin et al., 2019) on our NER corpus by adding
a linear classification layer to the pre-trained trans-
former model, and train it end-to-end. mBERTO
was trained on 104 languages including only
two African languages: Swahili and Yor`ub´a. Nosotros
use the mBERT-base cased model with 12-layer
Transformer blocks consisting of 768-hidden size
and 110M parameters.
XLM-R XLM-R (Conneau et al., 2020) era
trained on 100 languages including Amharic,
Hausa, and Swahili. The major differences be-
tween XLM-R and mBERT are (1) XLM-R was
trained on Common Crawl while mBERT was
trained on Wikipedia; (2) XLM-R is based on
RoBERTa, which is trained with a masked lan-
guage model (MLM) objective while mBERT was
additionally trained with a next sentence predic-
tion objective. We make use of the XLM-R base
and large models for the baseline models. El
XLM-R-base model consisting of 12 capas, con
a hidden size of 768 and 270M parameters. Sobre el
other hand, the XLM-R-large has 24 capas, con
a hidden size of 1024 and 550M parameters.
MeanE-BiLSTM This is a simple BiLSTM
model with an additional linear classifier. Para
each input sequence, we first extract a sentence
embedding from mBERT or XLM-R language
modelo (LM) before passing it into the BiLSTM
modelo. Following Reimers and Gurevych (2019),
we make use of the mean of the 12-layer output
embeddings of the LM (es decir., MeanE). This has
been shown to provide better sentence represen-
tations than the embedding of the [CLS] simbólico
used for fine-tuning mBERT and XLM-R.
Language BERT The mBERT and the XLM-R
models only support two and three languages un-
der study, respectivamente. One effective approach
to adapt the pre-trained transformer models to
new domains is ‘‘domain-adaptive fine-tuning’’
(Howard and Ruder, 2018; Gururangan et al.,
2020)—fine-tuning on unlabeled data in the new
domain, which also works very well when adapt-
ing to a new language (Pfeiffer et al., 2020a; Alabi
et al., 2020). For each of the African languages,
we performed language-adaptive fine-tuning on
available unlabeled corpora mostly from JW300
(Agi´c and Vuli´c, 2019), indigenous news sources,
and XLM-R Common Crawl corpora (Conneau
et al., 2020). The Appendix provides the details of
the unlabeled corpora in Table 10. This approach
is quite useful for languages whose scripts are not
supported by the multi-lingual transformer mod-
els like Amharic where we replace the vocabulary
of mBERT by an Amharic vocabulary before we
perform language-adaptive fine-tuning, similar to
Alabi et al. (2020).
5.2 Improving the Baseline Models
En esta sección, we consider techniques to improve
the baseline models such as utilizing gazetteers,
transfer learning from other domains, and lan-
calibres, and aggregating NER datasets by regions.
For these experiments, we focus on the PER,
ORG, and LOC categories, because the gazetteers
from Wikipedia do not contain DATE entities and
some source domains and languages that we trans-
fer from do not have the DATE annotation. Nosotros
apply these modifications to the XLM-R model
because it generally outperforms mBERT in our
experimentos (mira la sección 6).
5.2.1 Gazetteers for NER
Gazetteers are lists of named entities col-
lected from manually crafted resources such as
1121
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
GeoNames or Wikipedia. Before the widespread
adoption of neural networks, NER methods used
gazetteers-based features to improve performance
(Ratinov and Roth, 2009). These features are
created for each n-gram in the dataset and are
typically binary-valued, indicating whether that
n-gram is present in the gazetteer.
Recientemente, Rijhwani et al. (2020) showed that
augmenting the neural CNN-BiLSTM-CRF model
with gazetteer features can improve NER perfor-
mance for low-resource languages. We conduct
similar experiments on the languages in our
conjunto de datos, using entity lists from Wikipedia as
gazetteers. For Luo and Nigerian-Pidgin, cual
do not have their own Wikipedia, we use entity
lists from English Wikipedia.
5.2.2 Transfer Learning
Aquí, we focus on cross-domain transfer from
Wikipedia to the news domain, and cross-lingual
transfer from English and Swahili NER datasets
to the other languages in our dataset.
Domain Adaptation from WikiAnn We make
use of the WikiAnn corpus (Pan et al., 2017),
which is available for five of the languages in our
conjunto de datos: Amharic, Igbo, Kinyarwanda, Swahili,
and Yor`ub´a. For each language, the corpus con-
tains 100 sentences in each of the training,
development and test splits except for Swahili,
which contains 1K sentences in each split. Para
each language, we train on the corresponding
WikiAnn training set and either zero-shot transfer
to our respective test set or additionally fine-tune
on our training data.
Cross-lingual Transfer For training the cross-
lingual transfer models, we use the CoNLL-20035
NER dataset in English with over 14K training
sentences and our annotated corpus. La razón
for CoNLL-2003 is because it is in the same news
domain as our annotated corpus. We also make
use of the languages that are supported by the
XLM-R model and are widely spoken in East and
West Africa like Swahili and Hausa. The English
corpus has been shown to transfer very well to
idiomas de bajos recursos (Hedderich et al., 2020;
5We also tried OntoNotes 5.0 by combining FAC & ORG
as ‘‘ORG’’ and GPE & LOC as ‘‘LOC’’ and others as ‘‘O’’
except ‘‘PER’’, but it gave lower performance in zero-shot
transfer (19.38 F1) while CoNLL-2003 gave 37.15 F1.
Lauscher et al., 2020). We first train on either the
English CoNLL-2003 data or our training data in
Swahili, Hausa, or Nigerian-Pidgin before testing
on the target African languages.
5.3 Aggregating Languages by Regions
As previously illustrated in Table 2, several en-
tities have the same form in different languages
while some entities may be more common in the
region where the language is spoken. To study the
performance of NER models across geographical
areas, we combine languages based on the region
of Africa that they are spoken in (ver tabla 1):
(1) East region with Kinyarwanda, Luganda, luo,
and Swahili; (2) West Region with Hausa, Igbo,
Nigerian-Pidgin, Wolof, and Yor`ub´a languages,
(3) East and West regions—all languages except
Amharic because of its distinct writing system.
6 Resultados
6.1 Baseline Models
Mesa 5 gives the F1-score obtained by CNN-
BiLSTM-CRF, mBERTO, and XLM-R models on
the test sets of the ten African languages when
training on our in-language data. We addition-
ally indicate whether the language is supported by
the pre-trained language models (✓
). The per-
centage of entities that are of out-of-vocabulary
(OOV; entities in the test set that are not present in
the training set) is also reported alongside results
of the baseline models. En general, the datasets
with greater numbers of OOV entities have lower
performance with the CNN-BiLSTM-CRF model,
while those with lower OOV rates (Hausa, Igbo,
Swahili) have higher performance. We find that
the CNN-BiLSTM-CRF model performs worse
than fine-tuning mBERT and XLM-R models
end-to-end (FTune). We expect performance to
be better (p.ej., for Amharic and Nigerian-Pidgin
with over 18 F1 point difference) when using
pre-trained word embeddings for the initialization
of the BiLSTM model rather than random initial-
ización (we leave this for future work as discussed
en la sección 7).
Curiosamente, the pre-trained language mod-
los (PLM) have reasonable performance even
on languages they were not trained on such as
Igbo, Kinyarwanda, Luganda, luo, and Wolof.
Sin embargo, languages supported by the PLM tend
1122
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
En
En
mBERTO? XLM-R?
% OOV
in Test
Entidades
CNN-
BiLSTM mBERT-base
XLM-R-base
CRF
MeanE / FTune MeanE / FTune
XLM-R
Large
FTune
lang.
lang.
BERT XLM-R
FTune
FTune
Lang.
amh
hau
ibo
kin
lug
luo
pcm
swa
wol
yor
avg
avg (excl. amh)
–
–
–
–
72.94
33.40
46.56
57.85
61.12
65.18
61.26
40.97
69.73
65.99
57.50
55.78
52.08
83.52
80.02
62.97
74.67
65.98
67.67
78.24
59.70
67.44
69.23
71.13
0.0 / 0.0
81.49 / 86.65
76.17 / 85.19
65.85 / 72.20
70.38 / 80.36
56.56 / 74.22
81.87 / 87.23
83.08 / 86.80
57.21 / 64.52
74.28 / 78.97
64.69 / 71.61
71.87 / 79.88
63.57 / 70.62
86.06 / 89.50
73.47 / 84.78
63.66 / 73.32
68.15 / 79.69
52.57 / 74.86
81.93 / 87.26
84.33 / 87.37
54.97 / 63.86
67.45 / 78.26
69.62 / 78.96
70.29 / 79.88
76.18
90.54
84.12
73.75
81.57
73.58
89.02
89.36
67.90
78.89
80.49
80.97
60.89
91.31
86.75
77.57
83.44
75.59
89.95
89.36
69.43
82.58
80.69
82.89
77.97
91.47
87.74
77.76
84.70
75.27
90.00
89.46
68.31
83.66
82.63
83.15
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
yo
a
C
_
a
_
0
0
4
1
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Mesa 5: NER model comparison, showing F1-score on the test sets after 50 epochs averaged over
5 carreras. This result is for all 4 tags in the dataset: PER, ORG, LOC, DATE. Bold marks the top score
(tied if within the range of SE). mBERT and XLM-R are trained in two ways (1) MeanE: mean output
incrustaciones de la 12 LM layers are used to initialize BiLSTM + Linear classifier, y (2) FTune:
LM fine-tuned end-to-end with a linear classifier. Lang. BERT & Lang XLM-R (base) are models
fine-tuned after language adaptive fine-tuning.
to have better performance overall. We observe
that fine-tuned XLM-R-base models have sig-
nificantly better performance on five languages;
two of the languages (Amharic and Swahili) son
supported by the pre-trained XLM-R. Similarmente,
fine-tuning mBERT has better performance for
Yor`ub´a since the language is part of the PLM’s
training corpus. Although mBERT is trained on
Swahili, XLM-R-base shows better performance.
This observation is consistent with Hu et al. (2020)
and could be because XLM-R is trained on more
Swahili text (Common Crawl with 275M tokens)
whereas mBERT is trained on a smaller corpus
de Wikipedia (6M tokens6).
Another observation is that mBERT tends to
have better performance for the non-Bantu Niger-
Congo languages (es decir., Igbo, Wolof, and Yor`ub´a)
while XLM-R-base works better for Afro-Asiatic
(es decir., Amharic and Hausa), Nilo-
idiomas
Saharan (es decir., luo), and Bantu languages like
Kinyarwanda and Swahili. We also note that the
writing script is one of the primary factors influ-
encing the transfer of knowledge in PLMs with
regard to the languages they were not trained on.
Por ejemplo, mBERT achieves an F1-score of 0.0
on Amharic because it has not encountered the
script during pre-training. En general, we find the
fine-tuned XLM-R-large (with 550M parameters)
to be better than XLM-R-base (with 270M pa-
6https://github.com/mayhewsw/multilingual
-data-stats.
rameters) and mBERT (con 110 parámetros) en
almost all languages. Sin embargo, mBERT models
perform slightly better for Igbo, luo, and Yor`ub´a
despite having fewer parameters.
We further analyze the transfer abilities of
mBERT and XLM-R by extracting sentence em-
beddings from the LMs to train a BiLSTM model
(MeanE-BiLSTM) instead of fine-tuning them
end-to-end. Mesa 5 shows that languages that are
not supported by mBERT or XLM-R generally
perform worse than CNN-BiLSTM-CRF model
(despite being randomly initialized) excepto por
kin. También, sentence embeddings extracted from
mBERT often lead to better performance than
XLM-R for languages they both do not support
(like ibo, kin, lug, luo, and wol).
Por último, we train NER models using language
BERT models that have been adapted to each
of the African languages via language-specific
fine-tuning on unlabeled text.
En todos los casos,
fine-tuning language BERT and language XLM-R
models achieves a 1%−7% improvement
en
fine-tuning mBERT-base and
F1-score over
XLM-R-base respectively. This approach is still
effective for small sized pre-training corpora pro-
vided they are of good quality. Por ejemplo, el
Wolof monolingual corpus, which contains less
than 50K sentences (ver tabla 10 in the Ap-
pendix) still improves performance by over 4%
F1. Más, we obtain over 60% improvement in
performance for Amharic BERT because mBERT
does not recognize the Amharic script.
1123
✗
✓
✗
✓
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✓
✓
✗
✗
✓
✗
amh
Método
CNN-BiLSTM-CRF 50.31
49.51
+ Gazetteers
hau
84.64
85.02
ibo
81.25
80.40
kin
60.32
64.54
lug
75.66
73.85
luo
68.93
65.44
pcm
62.60
66.54
swa
77.83
80.16
wol
61.84
62.44
yor
66.48
65.49
avg
68.99
69.34
Mesa 6: Improving NER models using gazetteers. The result is only for 3 Tags: PER, ORG, & LOC.
Models trained for 50 epochs. Result is an average over 5 carreras.
Método
XLM-R-base
WikiAnn zero-shot
eng-CoNLL zero-shot
pcm zero-shot
swa zero-shot
hau zero-shot
WikiAnn + finetune
eng-CoNLL + finetune
pcm + finetune
swa + finetune
hau + finetune
combined East Langs.
combined West Langs.
combined 9 Langs.
amh
69.71
27.68
–
–
–
–
70.92
–
–
–
–
–
–
–
hau
ibo
kin
lug
luo
pcm
86.16
91.03
21.90
–
47.71
67.52
63.71
42.69
85.35* 55.37
86.87
–
67.27
–
73.76
9.56
38.17
40.99
58.44
58.41* 59.10* 59.78
72.84
85.24
71.55
85.10
71.69
86.42
74.84
87.11
74.22
86.84
75.65
–
–
87.06
87.94
75.46
75.81
80.51
–
–
34.19
39.45
43.50
33.12
57.65* 42.88* 72.87*
70.74
42.81
–
–
84.05
73.92
–
75.56
86.74
74.49
88.03
75.55
–
77.56
87.21
–
88.12
78.12
–
77.34
79.72
80.21
80.56
81.10
–
81.29
–
–
89.73
90.78
91.50
–
–
90.88
91.64
wol
yor
avg
swa
88.65
36.91
76.40
72.84
–
69.56
–
24.33
25.37
41.70
83.19* 42.81* 55.97
76.78
–
87.90
75.77
68.11
87.59
78.29
67.21
87.62
80.68
68.47
–
70.20
79.44
87.92
–
–
88.15
80.68
69.70
–
80.59
69.84
88.10
77.30
78.05
–
10.42
37.15
39.04
35.16
36.81
57.87* 52.32
53.14*
–
75.30
76.48
77.63
77.80
–
–
78.87
Mesa 7: Transfer learning result (es decir., F1-score). Three tags: PER, ORG, & LOC. WikiAnn, eng-CoNLL,
and the annotated datasets are trained for 50 epochs. Fine-tuning is only for 10 epochs. Results are
averaged over 5 runs and the total average (avg) is computed over ibo, kin, lug, luo, wol, y
yor languages. The overall highest F1-score is in bold, and the best F1-score in zero-shot settings is
indicated with an asterisk (*).
6.2 Evaluation of Gazetteer Features
Source Language
PER
ORG
LOC
Mesa 6 shows the performance of the CNN-
BiLSTM-CRF model with the addition of gazet-
teer features as described in Section 5.2.1. On
promedio, the model that uses gazetteer features
performs better than the baseline. En general, lan-
guages with larger gazetteers, such as Swahili
(16K entities in the gazetteer) and Nigerian-Pidgin
(for which we use an English gazetteer with 2M
entidades), have more improvement in performance
than those with fewer gazetteer entries, como
Amharic and Luganda (2K and 500 gazetteer
entidades, respectivamente). This indicates that having
high-coverage gazetteers is important for the
model to take advantage of the gazetteer features.
6.3 Transfer Learning Experiments
Mesa 7 shows the result for the different transfer
learning approaches, which we discuss individu-
ally in the following sections. We make use of
XLM-R-base model for all the experiments in this
eng-CoNLL
pcm
swa
hau
36.17
21.50
55.00
52.67
27.00
65.33
69.67
57.50
50.50
68.17
46.00
48.50
Mesa 8: Average per-named entity F1-score for
the zero-shot NER using the XLM-R model. El
average is computed over ibo, kin, lug, luo,
wol, yor languages.
sub-section because the performance difference if
we use XLM-R-large is small (<2%) as shown in
Table 5 and because it is faster to train.
6.3.1 Cross-domain Transfer
transfer
evaluate
We
from
cross-domain
Wikipedia to the news domain for the five lan-
guages that are available in the WikiAnn (Pan
et al., 2017) dataset. In the zero-shot setting, the
NER F1-score is low: less than 40 F1-score for all
1124
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Language
CNN-BiLSTM
mBERT-base
XLM-R-base
all
0-freq 0-freq Δ long long Δ all
0-freq 0-freq Δ long long Δ all
–
–
–
52.89 40.98 −11.91 45.16 −7.73
amh
83.70 78.52 −5.18 66.21 −17.49 87.34 79.41 −7.93
hau
78.48 70.57 −7.91 53.93 −24.55 85.11 78.41 −6.70
ibo
64.61 55.89 −8.72 40.00 −24.61 70.98 65.57 −5.41
kin
74.31 67.99 −6.32 58.33 −15.98 80.56 76.27 −4.29
lug
66.42 58.93 −7.49 54.17 −12.25 72.65 72.85
luo
0.20
66.43 59.73 −6.70 47.80 −18.63 87.78 82.40 −5.38
pcm
79.26 64.74 −14.52 44.78 −34.48 86.37 78.77 −7.60
swa
60.43 49.03 −11.40 26.92 −33.51 66.10 59.54 −6.56
wol
67.07 56.33 −10.74 64.52 −2.55 78.64 73.41 −5.23
yor
avg (excl. amh) 69.36 60.27 −9.09 50.18 −19.18 79.50 74.07 −5.43
–
–
70.96 68.91 −2.05
67.67 −19.67 89.44 85.48 −3.96
60.46 −24.65 84.51 77.42 −7.09
55.39 −15.59 73.93 66.54 −7.39
65.67 −14.89 80.71 73.54 −7.17
66.67 −5.98 75.14 72.34 −2.80
77.12 −10.66 87.39 83.65 −3.74
45.55 −40.82 87.55 80.91 −6.64
19.05 −47.05 64.38 57.21 −7.17
74.34 −4.30 77.58 72.01 −5.57
59.10 −20.40 79.15 73.80 −5.36
0-freq 0-freq Δ long long Δ
64.86 −6.10
76.06 −13.38
59.52 −24.99
54.96 −18.97
63.77 −16.94
69.39 −5.75
74.67 −12.72
53.93 −33.62
38.89 −25.49
76.14 −1.44
63.22 −15.94
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 9: F1 score for two varieties of hard-to-identify entities: zero-frequency entities that do not appear
in the training corpus, and longer entities of four or more words.
languages, with Kinyarwanda and Yor`ub´a having
less than 10 F1-score. This is likely due to the
number of training sentences present in WikiAnn:
There are only 100 sentences in the datasets of
Amharic, Igbo, Kinyarwanda, and Yor`ub´a. Al-
though the Swahili corpus has 1,000 sentences,
the 35 F1-score shows that transfer is not very
effective. In general, cross-domain transfer is a
challenging problem, and is even harder when
the number of training examples from the source
domain is small. Fine-tuning on the in-domain
the
news NER data does not
baseline (XLM-R-base).
improve over
6.3.2 Cross-Lingual Transfer
Zero-shot
In the zero-shot setting we evaluated
trained on the English eng-
NER models
CoNLL03 dataset, and on the Nigerian-Pidgin
(pcm), Swahili (swa), and Hausa (hau) anno-
tated corpus. We excluded the MISC entity in the
eng-CoNLL03 corpus because it is absent in our
target datasets. Table 7 shows the result for the
(zero-shot) transfer performance. We observe that
the closer the source and target languages are geo-
graphically, the better the performance. The pcm
model (trained on only 2K sentences) obtains sim-
ilar transfer performance as the eng-CoNLL03
model (trained on 14K sentences). swa performs
better than pcm and eng-CoNLL03 with an im-
provement of over 14 F1 on average. We found
that, on average, transferring from Hausa provided
the best F1, with an improvement of over 16% and
1% compared to using the eng-CoNLL and swa
data, respectively. Per-entity analysis in Table 8
shows that the largest improvements are obtained
for ORG. The pcm data were more effective in
transferring to LOC and ORG, while swa and hau
performed better when transferring to PER. In
general, zero-shot transfer is most effective when
transferring from Hausa and Swahili.
Fine-tuning We use the target language corpus
to fine-tune the NER models previously trained on
eng-CoNLL, pcm, and swa. On average, there is
only a small improvement when compared to the
XLM-R base model. In particular, we see signifi-
cant improvement for Hausa, Igbo, Kinyarwanda,
Nigerian-Pidgin, Wolof, and Yor`ub´a using either
swa or hau as the source NER model.
6.4 Regional Influence on NER
We evaluate whether combining different lan-
guage training datasets by region affects the per-
formance for individual languages. Table 7 shows
that all languages spoken in West Africa (ibo,
wol, pcm, yor) except hau have slightly better
performance (0.1–2.6 F1) when we train on
their combined training data. However, for the
East-African languages, the F1 score only im-
proved (0.8–2.3 F1) for three languages (kin,
lug, luo). Training the NER model on all
nine languages leads to better performance on all
languages except Swahili. On average over six
languages (ibo, kin, lug, luo, wol, yor), the
performance improves by 1.6 F1.
6.5 Error Analysis
Finally, to better understand the types of entities
that were successfully identified and those that
were missed, we performed fine-grained analy-
sis of our baseline methods mBERT and XLM-R
1125
using the method of Fu et al. (2020), with re-
sults shown in Table 9. Specifically, we found
that across all languages, entities that were not
contained in the training data (zero-frequency en-
tities), and entities consisting of more than three
words (long entities) were particularly difficult in
all languages; compared to the F1 score over all
entities, the scores dropped by around 5 points
when evaluated on zero-frequency entities, and
by around 20 points when evaluated on long
entities. Future work on low-resource NER or
cross-lingual representation learning may further
improve on these hard cases.
7 Conclusion and Future Work
We address the NER task for African languages
by bringing together a variety of stakeholders to
create a high-quality NER dataset for ten African
languages. We evaluate multiple state-of-the-art
NER models and establish strong baselines. We
have released one of our best models that can
recognize named entities in ten African languages
on HuggingFace Model Hub.7 We also investi-
gate cross-domain transfer with experiments on
five languages with the WikiAnn dataset, along
with cross-lingual transfer for low-resource NER
using the English CoNLL-2003 dataset and other
languages supported by XLM-R. In the future,
we plan to use pretrained word embeddings such
as GloVe (Pennington et al., 2014) and fastText
(Bojanowski et al., 2017) instead of random ini-
tialization for the CNN-BiLSTM-CRF, increase
the number of annotated sentences per language,
and expand the dataset to more African languages.
Acknowledgments
We would like to thank Heng Ji and Ying Lin for
providing the ELISA NER tool used for annota-
tion. We also thank the Spoken Language Systems
Chair, Dietrich Klakow at Saarland University,
for providing GPU resources to train the models.
We thank Adhi Kuncoro and the anonymous re-
viewers for their useful feedback on a draft of this
paper. David Adelani acknowledges the support of
the EU-funded H2020 project COMPRISE under
grant agreement no. 3081705. Finally, we thank
Mohamed Ahmed for proofreading the draft.
7https://huggingface.co/Davlan/xlm-roberta
-large-masakhaner.
References
D. Adelani, Dana Ruiter, J. Alabi, Damilola
Adebonojo, Adesina Ayeni, Mofetoluwa
Adeyemi, Ayodele Awokoya, and C. Espa˜na-
Bonet. 2021. MENYO-20k: A multi-domain
english-yor`ub´a corpus for machine translation
and domain adaptation. ArXiv, abs/2103.08647.
ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: A
wide-coverage parallel corpus for low-resource
languages. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 3204–3210, Florence, Italy.
Association for Computational Linguistics.
Jesujoba Alabi, Kwabena Amponsah-Kaakyire,
David Adelani, and Cristina Espa˜na-Bonet.
2020. Massive vs. curated embeddings for
low-resourced languages: The case of Yor`ub´a
and Twi. In Proceedings of
the 12th Lan-
guage Resources and Evaluation Conference,
pages 2754–2762, Marseille, France. European
Language Resources Association.
Darina Benikova, Chris Biemann, and Marc
Reznicek. 2014. NoSta-D named entity anno-
tation for German: Guidelines and dataset. In
Proceedings of the Ninth International Con-
ference on Language Resources and Evalua-
tion (LREC’14), pages 2524–2531, Reykjavik,
Iceland. European Language Resources Asso-
ciation (ELRA).
Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word vec-
tors with subword information. Transactions of
the Association for Computational Linguistics,
5:135–146. https://doi.org/10.1162
/tacl a 00051
Andrew Caines. 2019. The geographic diversity
of NLP conferences.
Jason P.C. Chiu and Eric Nichols. 2016.
Named entity recognition with bidirectional
LSTM-CNNs. Transactions of the Association
for Computational Linguistics, 4:357–370.
Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. In Proceedings of the 58th
1126
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Annual Meeting of the Association for Compu-
tational Linguistics, pages 8440–8451, Online.
Association for Computational Linguistics.
Guy De Pauw, Peter W Wagacha, and Dorothy
Atieno Abade. 2007. Unsupervised induction of
Dholuo word classes using maximum entropy
learning. Proceedings of the First International
Computer Science and ICT Conference, page 8.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long
and Short Papers). Minneapolis, Minnesota.
Association for Computational Linguistics.
David M. Eberhard, Gary F. Simons, and Charles
D. Fennig (eds.). 2020. Ethnologue: Languages
of the World. 23rd edition.
Roald Eiselen. 2016. Government domain named
entity recognition for South African languages.
In Proceedings of the Tenth International Con-
ference on Language Resources and Evalua-
tion (LREC’16), pages 3344–3348, Portoroˇz,
Slovenia. European Language Resources Asso-
ciation (ELRA).
Ahmed El-Kishky, Vishrav Chaudhary, Francisco
Guzm´an, and Philipp Koehn. 2020. CCAligned:
A massive collection of cross-lingual web-
document pairs. In Proceedings of the 2020
Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP 2020),
pages 5960–5969, Online. Association for
Computational Linguistics.
Nolue Emenanjo. 1978. Elements of Modern Igbo
Grammar - a descriptive approach. Ibadan,
Nigeria. Oxford University Press.
Ignatius Ezeani, Paul Rayson, I. Onyenwe, C.
Uchechukwu, and M. Hepple. 2020. Igbo-
english machine translation: An evaluation
benchmark. ArXiv, abs/2004.00648. https://
doi.org/10.1037/h0031619
Joseph L Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psychological
Bulletin, 76(5):378.
Jinlan Fu, Pengfei Liu, and Graham Neubig.
2020. Interpretable multi-dataset evaluation for
named entity recognition. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 6058–6069, Online. Association for
Computational Linguistics.
Rwanda Government. 2014. Official gazette
number 41 bis of 13/10/2014.
Suchin Gururangan, Ana Marasovi´c, Swabha
Iz Beltagy, Doug
Swayamdipta, Kyle Lo,
Downey, and Noah A. Smith. 2020. Don’t stop
pretraining: Adapt language models to domains
and tasks. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, pages 8342–8360, Online. Associ-
ation for Computational Linguistics.
Michael A. Hedderich, David Adelani, Dawei
Zhu, Jesujoba Alabi, Udia Markus, and Dietrich
Klakow. 2020. Transfer learning and distant su-
pervision for multilingual transformer models:
A study on African languages. In Proceedings
of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 2580–2591, Online. Association for
Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. Uni-
language model fine-tuning for text
versal
classification. In Proceedings of ACL 2018.
Junjie Hu, Sebastian Ruder, Aditya Siddhant,
Graham Neubig, Orhan Firat, and Melvin
Johnson. 2020. XTREME: A massively mul-
tilingual multi-task benchmark for evaluating
cross-lingual generalization. In Proceedings of
ICML 2020.
Zhiheng Huang, W. Xu, and Kailiang Yu. 2015.
Bidirectional LSTM-CRF models for sequence
tagging. ArXiv, abs/1508.01991.
John D. Lafferty, Andrew McCallum, and
Fernando C. N. Pereira. 2001. Conditional ran-
dom fields: Probabilistic models for segmenting
and labeling sequence data. In Proceedings
of the Eighteenth International Conference on
Machine Learning, ICML ’01, pages 282–289,
San Francisco, CA, USA. Morgan Kaufmann
Publishers Inc.
1127
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Guillaume Lample, Miguel Ballesteros, Sandeep
Subramanian, Kazuya Kawakami, and Chris
Dyer. 2016. Neural architectures for named
entity recognition. In Proceedings of NAACL-
HLT 2016.
Anne Lauscher, Vinit Ravishankar, Ivan Vuli´c,
and Goran Glavaˇs. 2020. From zero to hero: On
the limitations of zero-shot language transfer
with multilingual Transformers. In Proceedings
of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 4483–4499, Online. Association for
Computational Linguistics.
Ying Lin, Cash Costello, Boliang Zhang, Di Lu,
Heng Ji, James Mayfield, and Paul McNamee.
2018. Platforms for non-speakers annotating
names in any language. In Proceedings of
ACL 2018, System Demonstrations, pages 1–6,
Melbourne, Australia. Association for Compu-
tational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach.
Xuezhe Ma and Eduard Hovy. 2016. End-to-end
sequence labeling via bi-directional LSTM-
CNNs-CRF. In Proceedings of the 54th Annual
the Association for Computa-
Meeting of
tional Linguistics (Volume 1: Long Papers),
pages 1064–1074, Berlin, Germany. Associa-
tion for Computational Linguistics.
Laura Martinus and Jade Z. Abbott. 2019. A
focus on neural machine translation for African
languages. arXiv preprint arXiv:1906.05685.
MBS. 2020. T´eereb Injiil: La Bible Wolof –
Ancien Testament. http://biblewolof
.com/.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg
S. Corrado, and Jeff Dean. 2013. Distributed
representations of words and phrases and their
compositionality. In Advances in Neural In-
formation Processing Systems, volume 26,
pages 3111–3119. Curran Associates, Inc.
Wilhelmina
Vukosi Marivate,
Tshinondiwa Matsila, Timi Fasubaa, Taiwo
Nekoto,
Iroro Orife,
Fagbohungbe, Solomon Oluwole Akinola,
Shamsuddeen Muhammad, Salomon Kabongo
Kabenamualu, Salomey Osei, Freshia Sackey,
Rubungo Andre Niyongabo, Ricky Macharm,
Perez Ogayo, Orevaoghene Ahia, Musie
Meressa Berhe, Mofetoluwa Adeyemi, Masabata
Mokgesi-Selinga, Lawrence Okegbemi, Laura
Martinus, Kolawole Tajudeen, Kevin Degila,
Julia
Kelechi Ogueji, Kathleen Siminyu,
Kreutzer, Jason Webster, Jamiil Toure Ali,
Jade Abbott,
Ignatius Ezeani,
Idris Abdulkadir Dangana, Herman Kamper,
Hady Elsahar, Goodness Duru, Ghollah Kioko,
Murhabazi Espoir, Elan van Biljon, Daniel
Whitenack, Christopher Onyefuluchi, Chris
Chinenye Emezue, Bonaventure F. P. Dossou,
Blessing Sibanda, Blessing Bassey, Ayodele
Olabiyi, Arshath Ramkilowan, Alp ¨Oktem,
Adewale Akinfaderin, and Abdallah Bashir.
2020. Participatory research for low-resourced
machine translation: A case study in African
the Association
languages.
for Computational Linguistics: EMNLP 2020.
Online.
In Findings of
Graham Neubig, Chris Dyer, Y. Goldberg,
A. Matthews, Waleed Ammar, Antonios
Anastasopoulos, Miguel Ballesteros, David
Chiang, Daniel Clothiaux, Trevor Cohn, Kevin
Duh, Manaal Faruqui, Cynthia Gan, Dan
Garrette, Yangfeng
Ji, Lingpeng Kong,
Adhiguna Kuncoro, Manish Kumar, Chaitanya
Malaviya, Paul Michel, Y. Oda, M. Richardson,
Naomi Saphra, Swabha Swayamdipta, and
Pengcheng Yin. 2017. Dynet: The dynamic
neural network toolkit. ArXiv, abs/1701.03980.
Rubungo Andre Niyongabo, Qu Hong, Julia
Kreutzer, and Li Huang. 2020. KINNEWS and
KIRNEWS: Benchmarking cross-lingual text
classification for Kinyarwanda and Kirundi.
In Proceedings of
the 28th International
Conference on Computational Linguistics,
pages 5507–5521, Barcelona, Spain (Online).
International Committee on Computational
Linguistics.
Eyo Offiong Mensah. 2012. Grammaticalization
in Nigerian Pidgin. ´Ikala, revista de lenguaje y
cultura, 17(2):167–179.
Anthony Ojarikre. 2013. Perspectives and prob-
lems of codifying Nigerian Pidgin English
orthography. Perspectives, 3(12).
1128
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ijite Blessing Onovbiona. 2012. Serial verb
construction in Nigerian Pidgin.
Ikechukwu E. Onyenwe and Mark Hepple. 2016.
Predicting morphologically-complex unknown
words in Igbo. In Text, Speech, and Dialogue,
pages 206–214, Cham. Springer International
Publishing.
Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
Nothman, Kevin Knight, and Heng Ji. 2017.
Cross-lingual name tagging and linking for 282
languages. In Proceedings of the 55th Annual
the Association for Computa-
Meeting of
tional Linguistics (Volume 1: Long Papers),
pages 1946–1958, Vancouver, Canada. Associ-
ation for Computational Linguistics.
Jeffrey
Socher,
Pennington, Richard
and
Christopher Manning. 2014. GloVe: Global
vectors for word representation. In Proceedings
of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1532–1543, Doha, Qatar. Association for
Computational Linguistics.
Jonas Pfeiffer, Ivan Vuli, Iryna Gurevych, and
Sebastian Ruder. 2020a. MAD-X: An Adapter-
based Framework for Multi-task Cross-lingual
Transfer. In Proceedings of EMNLP 2020.
Jonas Pfeiffer, Ivan Vuli´c, Iryna Gurevych, and
Sebastian Ruder. 2020b. Unks everywhere:
Adapting multilingual language models to new
scripts. arXiv preprint arXiv:2012.15562.
Lev Ratinov and Dan Roth. 2009. Design
in named
challenges and misconceptions
the
entity recognition.
Thirteenth Conference
on Computational
Natural Language Learning (CoNLL-2009),
pages 147–155, Boulder, Colorado. Association
for Computational Linguistics.
In Proceedings of
Nils Reimers
and Iryna Gurevych. 2019.
Sentence-bert: Sentence embeddings using
Siamese bert-networks. In Proceedings of the
2019 Conference on Empirical Methods in
Natural Language Processing. Association for
Computational Linguistics.
Shruti Rijhwani, Shuyan Zhou, Graham Neubig,
and Jaime Carbonell. 2020. Soft gazetteers
for low-resource named entity recognition. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 8118–8123, Online. Association for
Computational Linguistics.
Erik F. Sang and Fien De Meulder. 2003.
Introduction to the conll-2003 shared task:
Language-independent named entity recogni-
tion. In Proceedings of CoNLL 2003.
Rajeev Sangal, Dipti Misra Sharma, and Anil
Kumar Singh. 2008. In Proceedings of
the
IJCNLP-08 workshop on named entity recogni-
tion for south and south east Asian languages.
K. Shaalan. 2014. A survey of Arabic named entity
recognition and classification. Computational
Linguistics, 40:469–510. https://doi.org
/10.1162/COLI a 00178
for
technology development
Stephanie Strassel and Jennifer Tracey. 2016.
LORELEI language packs: Data, tools, and
resources
in
low resource languages. In Proceedings of
the Tenth International Conference on Lan-
guage Resources and Evaluation (LREC’16),
pages 3273–3280. Portoroˇz, Slovenia. Euro-
pean Language Resources Association (ELRA).
J¨org Tiedemann. 2012. Parallel data,
tools
and interfaces in OPUS. In Proceedings of
the Eighth International Conference on Lan-
guage Resources and Evaluation (LREC’12),
pages 2214–2218, Istanbul, Turkey. European
Language Resources Association (ELRA).
Erik F. Tjong Kim Sang. 2002. Introduction
to the CoNLL-2002 shared task: Language-
independent named entity recognition.
In
COLING-02: The 6th Conference on Natural
Language Learning 2002 (CoNLL-2002).
Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity re-
cognition. In Proceedings of the Seventh Con-
ference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147.
Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R’emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art
1129
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
language processing. ArXiv,
natural
1910.03771.
abs/
Vikas Yadav and Steven Bethard. 2018. A survey
on recent advances in named entity recognition
In Proceed-
from deep learning models.
ings of the 27th International Conference on
Computational Linguistics, pages 2145–2158,
Santa Fe, New Mexico, USA. Association for
Computational Linguistics.
Ikuya Yamada, Akari Asai, Hiroyuki Shindo,
Hideaki Takeda, and Yuji Matsumoto. 2020.
LUKE: Deep contextualized entity representa-
tions with entity-aware self-attention. In Pro-
ceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing
(EMNLP), pages 6442–6454, Online. Associ-
ation for Computational Linguistics.
A Appendix
A.1 Annotator Agreement
To shed more light on the few cases where annota-
tors disagreed, we provide entity-level confusion
matrices across all ten languages in Table 11.
is between
The most common disagreement
organizations and locations.
A.2 Model Hyperparameters for
Reproducibility
For fine-tuning mBERT and XLM-R, we used the
base and large models with maximum sequence
length of 164 for mBERT and 200 for XLM-R,
batch size of 32, learning rate of 5e-5, and number
of epochs 50. For the MeanE-BiLSTM model,
the hyperparameters are similar to fine-tuning the
LM except for the learning rate that we set to
be 5e-4, the BiLSTM hyperparameters are: input
dimension is 768 (since the embedding size from
mBERT and XLM-R is 768) in each direction
of LSTM, one hidden layer, hidden layer size of
64, and drop-out probability of 0.3 before the last
linear layer. All the experiments were performed
on a single GPU (Nvidia V100).
A.3 Monolingual Corpora for Language
Adaptive Fine-tuning
Table 10 shows the monolingual corpus we used
for the language adaptive fine-tuning. We provide
the details of the source of the data, and their
sizes. For most of the languages, we make use of
JW3008 and CC-1009. In some cases CC-Aligned
(El-Kishky et al., 2020) was used, in such a case,
we removed duplicated sentences from CC-100.
For fine-tuning the language model, we make
use of the HuggingFace (Wolf et al., 2019) code
with learning rate 5e-5. However, for the Amharic
BERT, we make use of a smaller learning rate
of 5e-6 since the multilingual BERT vocabu-
lary was replaced by Amharic vocabulary, so
that we can slowly adapt the mBERT LM to
understand Amharic texts. All language BERT
models were pre-trained for 3 epochs (‘‘ibo’’,
‘‘kin’’,‘‘lug’’,‘‘luo’’, ‘‘pcm’’,‘‘swa’’,‘‘yor’’) or
10 epochs (‘‘amh’’, ‘‘hau’’,‘‘wol’’) depending on
their convergence. The models can be found on
HuggingFace Model Hub.10
8https://opus.nlpl.eu/.
9http://data.statmt.org/cc-100/.
10https://huggingface.co/Davlan.
1130
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Language
Source
Size (MB)
No. sentences
amh
hau
ibo
kin
lug
luo
pcm
swa
wol
yor
CC-100 (Conneau et al., 2020)
CC-100
JW300 (Agi´c and Vuli´c, 2019), CC-100, CC-Aligned (El-Kishky et al.,
2020), and IgboNLP (Ezeani et al., 2020)
JW300, KIRNEWS (Niyongabo et al., 2020), and BBC Gahuza
JW300, CC-100, and BUKEDDE News
JW300
JW300, and BBC Pidgin
CC-100
OPUS (Tiedemann, 2012) (excl. CC-Aligned), Wolof Bible (MBS, 2020),
and news corpora (Lu Defu Waxu, Saabal, and Wolof Online)
JW300, Yoruba Embedding Corpus (Alabi et al., 2020), MENYO-20k
(Adelani et al., 2021), CC-100, CC-Aligned, and news corpora (BBC
Yoruba, Asejere, and Alaroye).
889.7MB
318.4MB
118.3MB
123.4MB
54.0MB
12.8MB
56.9MB
3,124,760
3,182,277
1,068,263
726,801
506,523
160,904
207,532
1,800MB
12,664,787
3.8MB
42,621
117.6MB
910,628
Table 10: Monolingual corpora, their sources, size, and number of sentences.
DATE
LOC
ORG
PER
DATE 32,978
LOC
10
ORG
0
PER
2
–
70,610
52
48
–
–
35,336
12
–
–
–
64,216
Table 11: Entity-level confusion matrix between
annotators, calculated over all ten languages.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1
/
/
t
l
a
c
_
a
_
0
0
4
1
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1131
Descargar PDF