MasakhaNER: Named Entity Recognition for African Languages

David Ifeoluwa Adelani1∗, Jade Abbott2∗, Graham Neubig3, Daniel D’souza4∗,
Julia Kreutzer5∗, Constantine Lignos6∗, Chester Palen-Michel6∗, Happy Buzaaba7∗,
Shruti Rijhwani3, Sebastian Ruder8, Stephen Mayhew9, Israel Abebe Azime10∗,
Shamsuddeen H. Muhammad11,12∗, Chris Chinenye Emezue13∗,
Joyce Nakatumba-Nabende14∗, Perez Ogayo15∗, Aremu Anuoluwapo16∗, Catherine Gitau∗,
Derguene Mbaye∗,Jesujoba Alabi17∗,Seid Muhie Yimam18,Tajuddeen Rabiu Gwadabe19∗,
Ignatius Ezeani20∗, Rubungo Andre Niyongabo21∗, Jonathan Mukiibi14, Verrah Otiende22∗,
Iroro Orife23∗, Davis David∗, Samba Ngom∗, Tosin Adewumi24∗, Paul Rayson20,
Mofetoluwa Adeyemi∗, Gerald Muriuki14, Emmanuel Anebi∗,
Chiamaka Chukwuneke20, Nkiruka Odu25, Eric Peter Wairagala14, Samuel Oyerinde∗,
Clemencia Siro∗, Tobius Saul Bateesa14, Temilola Oloyede∗, Yvonne Wambui∗,
Victor Akinode∗, Deborah Nabagereka14, Maurice Katusiime14, Ayodele Awokoya26∗,
Mouhamadane MBOUP∗, Dibora Gebreyohannes∗, Henok Tilaye∗, Kelechi Nwaike∗,
Degaga Wolde∗, Abdoulaye Faye∗, Blessing Sibanda27∗, Orevaoghene Ahia28∗,
Bonaventure F. PAG. Dossou29∗, Kelechi Ogueji30∗, Thierno Ibrahima DIOP∗,
Abdoulaye Diallo∗, Adewale Akinfaderin∗, Tendai Marengereke∗, and Salomey Osei10∗

∗Masakhane NLP, 1Spoken Language Systems Group (LSV), Saarland University, Alemania,
2Retro Rabbit, South Africa, 3Language Technologies Institute, Carnegie Mellon University, United
Estados, 4ProQuest, United States, 5Google Research, Canada, 6Brandeis University, United States,
8DeepMind, Reino Unido, 9Duolingo, United States, 7Graduate School of Systems and
Information Engineering, University of Tsukuba, Japón, 10African Institute for Mathematical Sciences
(AIMS-AMMI), Ethiopia, 11University of Porto, Nigeria, 12Bayero University, Kano, Nigeria,
13Technical University of Munich, Alemania 14 Makerere University, Kampala, Uganda,15African
Leadership University, Rwanda 16University of Lagos, Nigeria, 17Max Planck Institute for Informatics,
Alemania, 18LT Group, Universit¨at Hamburg, Alemania, 19University of Chinese Academy of Science,
China 20Lancaster University, Reino Unido, 21University of Electronic Science and Technology of
Porcelana, Porcelana, 22United States International University – África (USIU-A), Kenya, 23Niger-Volta LTI
24Lule˚o University of Technology, Sweden 25African University of Science and Technology, Abuja,
Nigeria 26University of Ibadan, Nigeria, 27Namibia University of Science and Technology, Namibia
28Instadeep, Nigeria 29Jacobs University Bremen, Alemania, 30 Universidad de Waterloo, Canada

Abstracto
We take a step towards addressing the under-
representation of the African continent in NLP
research by bringing together different stake-
holders to create the first large, publicly avail-
capaz, high-quality dataset for named entity
recognition (NER) in ten African languages.
We detail the characteristics of these languages
to help researchers and practitioners better
understand the challenges they pose for NER
tareas. We analyze our datasets and conduct
an extensive empirical evaluation of state-
of-the-art methods across both supervised and
transfer learning settings. Finalmente, we release
los datos, código, and models to inspire future
research on African NLP.1

1https://git.io/masakhane-ner.

1 Introducción

Africa has over 2,000 spoken languages (Eberhard
et al., 2020); sin embargo,
these languages are
scarcely represented in existing natural language
Procesando (NLP) conjuntos de datos, investigación, and tools
(Martinus and Abbott, 2019). ∀ (2020) investigate
the reasons for these disparities by examining
how NLP for low-resource languages is con-
strained by several societal factors. Uno de estos
factors is the geographical and language diver-
sity of NLP researchers. Por ejemplo, del
2695 affiliations of authors whose works were
published at the five major NLP conferences in
2019, only five were from African institutions
(Caines, 2019). En cambio, many NLP tasks such

1116

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1116–1131, 2021. https://doi.org/10.1162/tacl a 00416
Editor de acciones: Miguel Ballesteros. Lote de envío: 5/2021; Lote de revisión: 7/2021; Publicado 10/2021.
C(cid:4) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

as machine translation, text classification, part-
of-speech tagging, y reconocimiento de entidad nombrada
would benefit from the knowledge of native speak-
ers who are involved in the development of data-
sets and models.

En este trabajo, we focus on named entity recog-
nition (NER)—one of the most impactful tasks
in NLP (Sang and De Meulder, 2003; Lample
et al., 2016). NER is an important information
extraction task and an essential component of
numerous products including spell-checkers, lo-
calization of voice and dialogue systems, y
conversational agents. It also enables identifying
African names, lugares, and organizations for infor-
mation retrieval. African languages are under-
represented in this crucial task due to lack of
conjuntos de datos, reproducible results, and researchers who
understand the challenges that such languages
present for NER.

en este documento, we take an initial step towards im-
proving representation for African languages for
the NER task, making the following contributions:

(i) We bring together language speakers, conjunto de datos
curators, NLP practitioners, and evaluation
experts to address the challenges facing
NER for African languages. Basado en el
availability of online news corpora and lan-
guage annotators, we develop NER datasets,
modelos, and evaluation covering ten widely
spoken African languages.

(ii) We curate NER datasets from local sources
to ensure relevance of future research for
native speakers of the respective languages.

(iii) We train and evaluate multiple NER models
for all
ten languages. Our experiments
provide insights into the transfer across
idiomas, and highlight open challenges.

(iv) We release the datasets, código, and models
to facilitate future research on the spe-
cific challenges raised by NER for African
idiomas.

2 Trabajo relacionado

African NER Datasets NER is a well-studied
sequence labeling task (Yadav and Bethard, 2018)
and has been the subject of many shared
tasks in different languages (Tjong Kim Sang,
2002; Tjong Kim Sang and De Meulder, 2003;

Sangal et al., 2008; Shaalan, 2014; Benikova
et al., 2014). Sin embargo, most of the available
datasets are in high-resource languages. A pesar de
there have been efforts to create NER datasets for
lower-resourced languages, such as the WikiAnn
cuerpo (Pan et al., 2017) covering 282 idiomas,
such datasets consist of ‘‘silver-standard’’ labels
created by transferring annotations from English
to other languages through cross-lingual links in
knowledge bases. Because the WikiAnn corpus
data comes from Wikipedia, it includes some
African languages; though most have fewer than
10k tokens.

Other NER datasets for African languages in-
clude SADiLaR (Eiselen, 2016) for ten South
African languages based on government data, y
small corpora of fewer than 2K sentences for
Yor`ub´a (Alabi et al., 2020) and Hausa (Hedderich
et al., 2020). Además, the LORELEI language
packs (Strassel and Tracey, 2016) include some
African languages (Yor`ub´a, Hausa, Amharic,
Somali, Twi, Swahili, Wolof, Kinyarwanda, y
Zulu), but are not publicly available.

and Nichols,

NER Models Popular sequence labeling models
for NER include the CRF (Lafferty et al., 2001),
CNN-BiLSTM (Chiu
2016),
BiLSTM-CRF (Huang et al., 2015), and CNN-
BiLSTM-CRF (Ma and Hovy, 2016). la tradicion-
tional CRF makes use of hand-crafted features
like part-of-speech tags, context words and
word capitalization. Neural NER models on the
other hand are initialized with word embeddings
like Word2Vec (Mikolov et al., 2013), GloVe
and FastText
(Pennington
(Bojanowski et al., 2017). More recently, pre-
trained language models such as BERT (Devlin
et al., 2019), RoBERTa (Liu et al., 2019), y
LUKE (Yamada et al., 2020) have been applied to
produce state-of-the-art results for the NER task.
Multilingual variants of these models like mBERT
and XLM-RoBERTa (Conneau et al., 2020) make
it possible to train NER models for several lan-
guages using transfer learning. Language-specific
parameters and adaptation to unlabeled data of
the target language have yielded further gains
(Pfeiffer et al., 2020a,b).

2014),

Alabama.,

3 Focus Languages

Mesa 1 provides an overview of the languages
considered in this work, their language family,

1117

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Idioma

Family

Speakers Region

Amharic

Afro-Asiatic-Ethio-Semitic

33M East

Hausa

Igbo

Afro-Asiatic-Chadic

63M West

Niger-Congo-Volta-Niger

27M West

Kinyarwanda

Niger-Congo-Bantu

12M East

Luganda

Niger-Congo-Bantu

luo

Nilo Saharan

Nigerian-Pidgin English Creole

Swahili

Niger-Congo-Bantu

7M East

4M East

75M West

98M Central
& East

Wolof

Niger-Congo-Senegambia

5M West &

Yor`ub´a

Niger-Congo-Volta-Niger

42M West

Mesa 1: Idioma, familia, number of speakers
(Eberhard et al., 2020), and regions in Africa.

number of speakers and the regions in Africa
where they are spoken. We chose to focus on these
languages due to the availability of online news
corpus, annotators, and most importantly because
they are widely spoken native African languages.
Both region and language family might indicate
a notion of proximity for NER, either because of
linguistic features shared within that family, or be-
cause data sources cover a common set of locally
relevant entities. We highlight language specifics
for each language to illustrate the diversity of this
selection of languages in Section 3.1, y luego
showcase the differences in named entities across
these languages in Section 3.2.

3.1 Language Characteristics

(h¨a)

(m¨a)

(amh) uses the Fidel script consisting
Amharic
(ˇs¨a)…),
(l¨a)
de 33 basic scripts (
each of them with at least 7 vowel sequences (semejante
como
(ho)). This results in more than 231 characters or
Fidels. Numbers and punctuation marks are also
(1),
represented uniquely with specific Fidels (

( ¯he)

( ¯hi)

(hu)

(h¨a)

(ha)

(hi)

(2), … y

(.), !(!),

(;),).

Hausa (hau) has 23–25 consonants, depending
on the dialect and five short and five long vowels.
Hausa has labialized phonemic consonants, as in
/gw/ (p.ej., ‘agwagwa’). As found in some African
idiomas,
en
Hausa (p.ej., ‘b, ‘d, etc., as in ‘barna’). simí-
mucho, the Hausa approximant ‘r’ is realized in two
distinct manners: roll and trill, as in ‘rai’ and
‘ra’ayi’, respectivamente.

implosive consonants also exist

Igbo (ibo) is an agglutinative language, con
many frequent suffixes and prefixes (Emenanjo,
1978). A single stem can yield many word-forms
by addition of affixes that extend its original mean-
En g (Onyenwe and Hepple, 2016). Igbo is also
tonal, with two distinctive tones (high and low)
and a down-stepped high tone in some cases. El
alphabet consists of 28 consonants and 8 vowels
(A, mi, I, I., oh, oh. , Ud., Ud.. ). In addition to the Latin
letters (except c), Igbo contains the following
digraphs: (ch, gb, gh, gw, kp, kw, nw, ny, sh).

Kinyarwanda (kin) makes use of 24 latín
characters with 5 vowels similar to English and
19 consonants excluding q and x. Además, Kin-
yarwanda has 74 additional complex consonants
(such as mb, mpw, and njyw) (Government, 2014).
It is a tonal language with three tones: bajo (No
diacritic), alto (signaled by ‘‘/’’), and falling
(signaled by ‘‘∧’’). The default word order is
subject-verb-object.

is a tonal

Luganda (lug)
language with
subject-verb-object word order. The Luganda al-
phabet is composed of 24 letters that include 17
consonants (pag, v, F, metro, d, t, yo, r, norte, z, s, j, C, gramo),
5 vowel sounds represented in the five alphabeti-
cal symbols (a, mi, i, oh, tu), y 2 semi-vowels (w,
y). It also has a special consonant ng(cid:6).

luo (luo) is a tonal language with 4 tones
(alto, bajo, falling, rising), although the tonality is
not marked in orthography. Tiene 26 Latin conso-
nants without Latin letters (C, q, v, X, and z) and ad-
ditional consonants (ch, DH, mb, nd, ng’, ng, ny, nj,
,
th, sh). There are nine vowels (a, mi, i, oh, tu,
) which are distinguished primarily by ad-
vanced tongue root (ATR) harmony (De Pauw
et al., 2007).

Nigerian-Pidgin (pcm)
largely oral,
national lingua franca with a distinct phonology
from English, its lexifier language. Portuguese,
Francés, and especially indigenous languages form
the substrate of lexical, phonological, syntactic,
and semantic influence on Nigerian-Pidgin (notario público).
English lexical items absorbed by NP are often
phonologically closer to indigenous Nigerian lan-
calibres, notably in the realization of vowels. Como
a rapidly evolving language, the NP orthography
is undergoing codification and indigenization
(Offiong Mensah, 2012; Onovbiona, 2012;
Ojarikre, 2013).

1118

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 2: Example of named entities in different languages. PER , LOC , and DATE are in colours
purple, naranja, and green, respectivamente.

(swa) is the most widely spoken lan-
Swahili
guage on the African continent. Tiene 30 letters
incluido 24 Latin letters without characters (q
and x) and six additional consonants (ch, DH, gh,
ng’, sh, th) unique to Swahili pronunciation.

(wol) has an alphabet similar to that of
Wolof
Francés. It consists of 29 characters, including all
letters of the French alphabet except h, v, and z.
(‘‘ng’’, más bajo-
It also includes the characters
) and ˜N’ (‘‘gn’’ as in Spanish). Accents are
caso:
present, but limited in number ( `A, ´E, ¨E, ´O). Cómo-
alguna vez, unlike many other Niger-Congo languages,
Wolof is not a tonal language.

Yor `ub´a (yor) tiene 25 Latin letters without the
Latin characters (C, q, v, X, and z) y con
additional letters (e., gb, s., oh. ). Yor`ub´a is a tonal
language with three tones: bajo (‘‘\’’), middle
(‘‘−’’, optional) and high (‘‘/’’). The tonal marks
and underdots are referred to as diacritics and
they are needed for the correct pronunciation of a
palabra. Yor`ub´a is a highly isolating language and
the sentence structure follows subject-verb-object.

3.2 Named Entities

Most of the work on NER is centered around
Inglés, and it is unclear how well existing mod-
els can generalize to other languages in terms of
sentence structure or surface forms. In Hu et al.’s
(2020) evaluation on cross-lingual generalization
for NER, only two African languages were con-
sidered and it was seen that transformer-based
models particularly struggled to generalize to

named entities in Swahili. To highlight the differ-
ences across our focus languages, Mesa 2 muestra
an English2 example sentence, with color-coded
PER, LOC, and DATE entities, and the correspond-
ing translations. The following characteristics of
the languages in our dataset could pose challenges
for NER systems developed for English:

• Amharic shares no lexical overlap with the

English source sentence.

• While ‘‘Zhang’’

identical across all
Latin-script languages, ‘‘Kano’’ features ac-
cents in Wolof and Yor`ub´a due to its
localization.

• The Fidel script has no capitalization, cual
could hinder transfer from other languages.

• Igbo, Wolof, and Yor`ub´a all use diacritics,
which are not present in the English alphabet.

• The surface form of named entities (NE) es
the same in English and Nigerian-Pidgin, pero
there exist lexical differences (p.ej., in terms
of how time is realized).

• Between the 10African languages,‘‘Nigeria’’

is spelled in 6 different ways.

• Numerical ‘‘18’’: Igbo, Wolof, and Yor`ub´a
write out their numbers, resulting in different
numbers of tokens for the entity span.

2Although the original sentence is from BBC Pidgin

https://www.bbc.com/pidgin/tori-51702073.

1119

Idioma

Data Source

Train/ dev/ test

#
Anno.

PER ORG LOC DATE

% of Entities

in Tokens Tokens

DW & BBC
VOA Hausa
BBC Igbo
IGIHE news
BUKEDDE news
Ramogi FM news

Amharic
Hausa
Igbo
Kinyarwanda
Luganda
luo
Nigerian-Pidgin BBC Pidgin
Swahili
Wolof
Yor`ub´a

1750/ 250/ 500
1903/ 272/ 545
2233/ 319/ 638
2110/ 301/ 604
2003/ 200/ 401
644/ 92/ 185
2100/ 300/ 600
VOA Swahili
2104/ 300/ 602
Lu Defu Waxu & Saabal 1,871/ 267/ 536
2124/ 303/ 608
GV & VON news

4
3
6
2
3
2
5
6
2
5

403 1,420
730
1,490
766 2,779
1,603 1,292 1,677
1,366 1,038 2,096
943
838
1,868
666
286
557
2,602 1,042 1,317
960 2,842
1,702
836
245
731
835 1,627
1,039

580
922
690
792
574
343
1,242
940
206
853

15.13
12.17
13.15
12.85
14.81
14.95
13.25
12.48
6.02
11.57

37,032
80,152
61,668
68,819
46,615
26,303
76,063
79,272
52,872
83,285

Mesa 3: Statistics of our datasets including their source, number of sentences in each split, number
of annotators, number of entities of each label type, percentage of tokens that are named entities, y
total number of tokens.

4 Data and Annotation Methodology

Our data were obtained from local news sources,
in order to ensure relevance of the dataset for
native speakers from those regions. The dataset
was annotated using the ELISA tool (Lin et al.,
2018) by native speakers who come from the
same regions as the news sources and volunteered
through the Masakhane community.3 Annotators
were not paid but are all included as authors of this
paper. The annotators were trained on how to per-
form NER annotation using the MUC-6 annotation
guide.4 We annotated four entity types: Personal
name (PER), Location (LOC), Organization (ORG),
and date & tiempo (DATE). The annotated entities
were inspired by the English CoNLL-2003 Corpus
(Tjong Kim Sang, 2002). We replaced the MISC
tag with the DATE tag following Alabi et al.
(2020) as the MISC tag may be ill-defined and
cause disagreement among non-expert annotators.
We report the number of annotators as well as
general statistics of the datasets in Table 3. Para
each language, we divided the annotated data into
training, desarrollo, and test splits consisting of
70%, 10%, y 20% of the data, respectivamente.

A key objective of our annotation procedure was
to create high-quality datasets by ensuring high
annotator agreement. To achieve high agreement
puntuaciones, we ran collaborative workshops for each
idioma, which allowed annotators to discuss any
disagreements. ELISA provides an entity-level
F1-score and also an interface for annotators to
correct their mistakes, making it easy to achieve

3https://www.masakhane.io.
4https://cs.nyu.edu/∼grishman/muc6.html.

Token
Fleiss’ κ Fleiss’ κ

Entity Disagreement

Dataset

amh
hau
ibo
kin
lug
luo
pcm
swa
wol
yor

0.987
0.988
0.995
1.000
0.997
1.000
0.989
1.000
1.000
0.990

0.959
0.962
0.983
1.000
0.990
1.000
0.966
1.000
1.000
0.964

from Type

0.044
0.097
0.071
0.000
0.023
0.000
0.048
0.000
0.000
0.079

Mesa 4: Inter-annotator agreement for our datasets
calculated using Fleiss’ kappa (κ) at the token and
entity level. Disagreement from type refers to the
proportion of all entity-level disagreements, cual
are due only to type mismatch.

inter-annotator agreement scores between 0.96
y 1.0 for all languages.

We report inter-annotator agreement scores in
Mesa 4 using Fleiss’ kappa (Fleiss, 1971) at both
the token and entity level. The latter considers
each span an annotator proposed as an entity. Como
a result of our workshops, all our datasets have
exceptionally high inter-annotator agreement. Para
Kinyarwanda, luo, Swahili, and Wolof, nosotros reportamos
perfect inter-annotator agreement scores (κ = 1).
For each of these languages, two annotators an-
notated each token and were instructed to discuss
and resolve conflicts among themselves. The Ap-
pendix provides a detailed entity-level confusion
matrix in Table 11.

1120

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

5 Experimental Setup

5.1 NER Baseline Models

To evaluate baseline performance on our dataset,
we experiment with three popular NER mod-
los: CNN-BiLSTM-CRF, multilingual BERT
(mBERTO), and XLM-RoBERTa (XLM-R). El
latter two models are implemented using the
HuggingFace transformers toolkit (Wolf et al.,
2019). For each language, we train the models on
the in-language training data and evaluate on its
test data.

CNN-BiLSTM-CRF This architecture was pro-
posed for NER by Ma and Hovy (2016). Para
each input sequence, we first compute the vec-
tor representation for each word by concatenating
character-level encodings from a CNN and vector
embeddings for each word. Following Rijhwani
et al. (2020), we use randomly initialized word
embeddings since we do not have high-quality
pre-trained embeddings for all the languages in
nuestro conjunto de datos. Our model is implemented using the
DyNet toolkit (Neubig et al., 2017).

mBERT We
fine-tune multilingual BERT
(Devlin et al., 2019) on our NER corpus by adding
a linear classification layer to the pre-trained trans-
former model, and train it end-to-end. mBERTO
was trained on 104 languages including only
two African languages: Swahili and Yor`ub´a. Nosotros
use the mBERT-base cased model with 12-layer
Transformer blocks consisting of 768-hidden size
and 110M parameters.

XLM-R XLM-R (Conneau et al., 2020) era
trained on 100 languages including Amharic,
Hausa, and Swahili. The major differences be-
tween XLM-R and mBERT are (1) XLM-R was
trained on Common Crawl while mBERT was
trained on Wikipedia; (2) XLM-R is based on
RoBERTa, which is trained with a masked lan-
guage model (MLM) objective while mBERT was
additionally trained with a next sentence predic-
tion objective. We make use of the XLM-R base
and large models for the baseline models. El
XLM-R-base model consisting of 12 capas, con
a hidden size of 768 and 270M parameters. Sobre el
other hand, the XLM-R-large has 24 capas, con
a hidden size of 1024 and 550M parameters.

MeanE-BiLSTM This is a simple BiLSTM
model with an additional linear classifier. Para
each input sequence, we first extract a sentence
embedding from mBERT or XLM-R language
modelo (LM) before passing it into the BiLSTM
modelo. Following Reimers and Gurevych (2019),
we make use of the mean of the 12-layer output
embeddings of the LM (es decir., MeanE). This has
been shown to provide better sentence represen-
tations than the embedding of the [CLS] simbólico
used for fine-tuning mBERT and XLM-R.

Language BERT The mBERT and the XLM-R
models only support two and three languages un-
der study, respectivamente. One effective approach
to adapt the pre-trained transformer models to
new domains is ‘‘domain-adaptive fine-tuning’’
(Howard and Ruder, 2018; Gururangan et al.,
2020)—fine-tuning on unlabeled data in the new
domain, which also works very well when adapt-
ing to a new language (Pfeiffer et al., 2020a; Alabi
et al., 2020). For each of the African languages,
we performed language-adaptive fine-tuning on
available unlabeled corpora mostly from JW300
(Agi´c and Vuli´c, 2019), indigenous news sources,
and XLM-R Common Crawl corpora (Conneau
et al., 2020). The Appendix provides the details of
the unlabeled corpora in Table 10. This approach
is quite useful for languages whose scripts are not
supported by the multi-lingual transformer mod-
els like Amharic where we replace the vocabulary
of mBERT by an Amharic vocabulary before we
perform language-adaptive fine-tuning, similar to
Alabi et al. (2020).

5.2 Improving the Baseline Models

En esta sección, we consider techniques to improve
the baseline models such as utilizing gazetteers,
transfer learning from other domains, and lan-
calibres, and aggregating NER datasets by regions.
For these experiments, we focus on the PER,
ORG, and LOC categories, because the gazetteers
from Wikipedia do not contain DATE entities and
some source domains and languages that we trans-
fer from do not have the DATE annotation. Nosotros
apply these modifications to the XLM-R model
because it generally outperforms mBERT in our
experimentos (mira la sección 6).

5.2.1 Gazetteers for NER

Gazetteers are lists of named entities col-
lected from manually crafted resources such as

1121

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

GeoNames or Wikipedia. Before the widespread
adoption of neural networks, NER methods used
gazetteers-based features to improve performance
(Ratinov and Roth, 2009). These features are
created for each n-gram in the dataset and are
typically binary-valued, indicating whether that
n-gram is present in the gazetteer.

Recientemente, Rijhwani et al. (2020) showed that
augmenting the neural CNN-BiLSTM-CRF model
with gazetteer features can improve NER perfor-
mance for low-resource languages. We conduct
similar experiments on the languages in our
conjunto de datos, using entity lists from Wikipedia as
gazetteers. For Luo and Nigerian-Pidgin, cual
do not have their own Wikipedia, we use entity
lists from English Wikipedia.

5.2.2 Transfer Learning

Aquí, we focus on cross-domain transfer from
Wikipedia to the news domain, and cross-lingual
transfer from English and Swahili NER datasets
to the other languages in our dataset.

Domain Adaptation from WikiAnn We make
use of the WikiAnn corpus (Pan et al., 2017),
which is available for five of the languages in our
conjunto de datos: Amharic, Igbo, Kinyarwanda, Swahili,
and Yor`ub´a. For each language, the corpus con-
tains 100 sentences in each of the training,
development and test splits except for Swahili,
which contains 1K sentences in each split. Para
each language, we train on the corresponding
WikiAnn training set and either zero-shot transfer
to our respective test set or additionally fine-tune
on our training data.

Cross-lingual Transfer For training the cross-
lingual transfer models, we use the CoNLL-20035
NER dataset in English with over 14K training
sentences and our annotated corpus. La razón
for CoNLL-2003 is because it is in the same news
domain as our annotated corpus. We also make
use of the languages that are supported by the
XLM-R model and are widely spoken in East and
West Africa like Swahili and Hausa. The English
corpus has been shown to transfer very well to
idiomas de bajos recursos (Hedderich et al., 2020;

5We also tried OntoNotes 5.0 by combining FAC & ORG
as ‘‘ORG’’ and GPE & LOC as ‘‘LOC’’ and others as ‘‘O’’
except ‘‘PER’’, but it gave lower performance in zero-shot
transfer (19.38 F1) while CoNLL-2003 gave 37.15 F1.

Lauscher et al., 2020). We first train on either the
English CoNLL-2003 data or our training data in
Swahili, Hausa, or Nigerian-Pidgin before testing
on the target African languages.

5.3 Aggregating Languages by Regions

As previously illustrated in Table 2, several en-
tities have the same form in different languages
while some entities may be more common in the
region where the language is spoken. To study the
performance of NER models across geographical
areas, we combine languages based on the region
of Africa that they are spoken in (ver tabla 1):
(1) East region with Kinyarwanda, Luganda, luo,
and Swahili; (2) West Region with Hausa, Igbo,
Nigerian-Pidgin, Wolof, and Yor`ub´a languages,
(3) East and West regions—all languages except
Amharic because of its distinct writing system.

6 Resultados

6.1 Baseline Models

Mesa 5 gives the F1-score obtained by CNN-
BiLSTM-CRF, mBERTO, and XLM-R models on
the test sets of the ten African languages when
training on our in-language data. We addition-
ally indicate whether the language is supported by
the pre-trained language models (✓
). The per-
centage of entities that are of out-of-vocabulary
(OOV; entities in the test set that are not present in
the training set) is also reported alongside results
of the baseline models. En general, the datasets
with greater numbers of OOV entities have lower
performance with the CNN-BiLSTM-CRF model,
while those with lower OOV rates (Hausa, Igbo,
Swahili) have higher performance. We find that
the CNN-BiLSTM-CRF model performs worse
than fine-tuning mBERT and XLM-R models
end-to-end (FTune). We expect performance to
be better (p.ej., for Amharic and Nigerian-Pidgin
with over 18 F1 point difference) when using
pre-trained word embeddings for the initialization
of the BiLSTM model rather than random initial-
ización (we leave this for future work as discussed
en la sección 7).

Curiosamente, the pre-trained language mod-
los (PLM) have reasonable performance even
on languages they were not trained on such as
Igbo, Kinyarwanda, Luganda, luo, and Wolof.
Sin embargo, languages supported by the PLM tend

1122

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

mBERTO? XLM-R?

% OOV
in Test
Entidades

CNN-

BiLSTM mBERT-base

XLM-R-base

CRF

MeanE / FTune MeanE / FTune

XLM-R
Large
FTune

lang.

lang.
BERT XLM-R
FTune
FTune

Lang.

amh
hau
ibo
kin
lug
luo
pcm
swa
wol
yor

avg
avg (excl. amh)

–
–

72.94
33.40
46.56
57.85
61.12
65.18
61.26
40.97
69.73
65.99

57.50
55.78

52.08
83.52
80.02
62.97
74.67
65.98
67.67
78.24
59.70
67.44

69.23
71.13

0.0 / 0.0
81.49 / 86.65
76.17 / 85.19
65.85 / 72.20
70.38 / 80.36
56.56 / 74.22
81.87 / 87.23
83.08 / 86.80
57.21 / 64.52
74.28 / 78.97

64.69 / 71.61
71.87 / 79.88

63.57 / 70.62
86.06 / 89.50
73.47 / 84.78
63.66 / 73.32
68.15 / 79.69
52.57 / 74.86
81.93 / 87.26
84.33 / 87.37
54.97 / 63.86
67.45 / 78.26

69.62 / 78.96
70.29 / 79.88

76.18
90.54
84.12
73.75
81.57
73.58
89.02
89.36
67.90
78.89

80.49
80.97

60.89
91.31
86.75
77.57
83.44
75.59
89.95
89.36
69.43
82.58

80.69
82.89

77.97
91.47
87.74
77.76
84.70
75.27
90.00
89.46
68.31
83.66

82.63
83.15

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
6
1
9
6
6
2
0
1

/
t

a
C
_
a
_
0
0
4
1
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 5: NER model comparison, showing F1-score on the test sets after 50 epochs averaged over
5 carreras. This result is for all 4 tags in the dataset: PER, ORG, LOC, DATE. Bold marks the top score
(tied if within the range of SE). mBERT and XLM-R are trained in two ways (1) MeanE: mean output
incrustaciones de la 12 LM layers are used to initialize BiLSTM + Linear classifier, y (2) FTune:
LM fine-tuned end-to-end with a linear classifier. Lang. BERT & Lang XLM-R (base) are models
fine-tuned after language adaptive fine-tuning.

to have better performance overall. We observe
that fine-tuned XLM-R-base models have sig-
nificantly better performance on five languages;
two of the languages (Amharic and Swahili) son
supported by the pre-trained XLM-R. Similarmente,
fine-tuning mBERT has better performance for
Yor`ub´a since the language is part of the PLM’s
training corpus. Although mBERT is trained on
Swahili, XLM-R-base shows better performance.
This observation is consistent with Hu et al. (2020)
and could be because XLM-R is trained on more
Swahili text (Common Crawl with 275M tokens)
whereas mBERT is trained on a smaller corpus
de Wikipedia (6M tokens6).

Another observation is that mBERT tends to
have better performance for the non-Bantu Niger-
Congo languages (es decir., Igbo, Wolof, and Yor`ub´a)
while XLM-R-base works better for Afro-Asiatic
(es decir., Amharic and Hausa), Nilo-
idiomas
Saharan (es decir., luo), and Bantu languages like
Kinyarwanda and Swahili. We also note that the
writing script is one of the primary factors influ-
encing the transfer of knowledge in PLMs with
regard to the languages they were not trained on.
Por ejemplo, mBERT achieves an F1-score of 0.0
on Amharic because it has not encountered the
script during pre-training. En general, we find the
fine-tuned XLM-R-large (with 550M parameters)
to be better than XLM-R-base (with 270M pa-

6https://github.com/mayhewsw/multilingual

-data-stats.

rameters) and mBERT (con 110 parámetros) en
almost all languages. Sin embargo, mBERT models
perform slightly better for Igbo, luo, and Yor`ub´a
despite having fewer parameters.

We further analyze the transfer abilities of
mBERT and XLM-R by extracting sentence em-
beddings from the LMs to train a BiLSTM model
(MeanE-BiLSTM) instead of fine-tuning them
end-to-end. Mesa 5 shows that languages that are
not supported by mBERT or XLM-R generally
perform worse than CNN-BiLSTM-CRF model
(despite being randomly initialized) excepto por
kin. También, sentence embeddings extracted from
mBERT often lead to better performance than
XLM-R for languages they both do not support
(like ibo, kin, lug, luo, and wol).

Por último, we train NER models using language
BERT models that have been adapted to each
of the African languages via language-specific
fine-tuning on unlabeled text.
En todos los casos,
fine-tuning language BERT and language XLM-R
models achieves a 1%−7% improvement
en
fine-tuning mBERT-base and
F1-score over
XLM-R-base respectively. This approach is still
effective for small sized pre-training corpora pro-
vided they are of good quality. Por ejemplo, el
Wolof monolingual corpus, which contains less
than 50K sentences (ver tabla 10 in the Ap-
pendix) still improves performance by over 4%
F1. Más, we obtain over 60% improvement in
performance for Amharic BERT because mBERT
does not recognize the Amharic script.

1123

✗
✓
✗
✓
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✓
✓
✗
✗
✓
✗

amh
Método
CNN-BiLSTM-CRF 50.31
49.51
+ Gazetteers

hau
84.64
85.02

ibo
81.25
80.40

kin
60.32
64.54

lug
75.66
73.85

luo
68.93
65.44

pcm
62.60
66.54

swa
77.83
80.16

wol
61.84
62.44

yor
66.48
65.49

avg
68.99
69.34

Mesa 6: Improving NER models using gazetteers. The result is only for 3 Tags: PER, ORG, & LOC.
Models trained for 50 epochs. Result is an average over 5 carreras.

Método

XLM-R-base
WikiAnn zero-shot
eng-CoNLL zero-shot
pcm zero-shot
swa zero-shot
hau zero-shot
WikiAnn + finetune
eng-CoNLL + finetune
pcm + finetune
swa + finetune
hau + finetune
combined East Langs.
combined West Langs.
combined 9 Langs.

amh

69.71
27.68
–
–
–
–
70.92
–
–
–
–
–
–
–

hau

ibo

kin

lug

luo

pcm

86.16
91.03
21.90
–
47.71
67.52
63.71
42.69
85.35* 55.37

86.87
–
67.27
–

73.76
9.56
38.17
40.99
58.44
58.41* 59.10* 59.78
72.84
85.24
71.55
85.10
71.69
86.42
74.84
87.11
74.22
86.84
75.65
–
–
87.06
87.94
75.46

75.81
80.51
–
–
34.19
39.45
43.50
33.12
57.65* 42.88* 72.87*
70.74
42.81
–
–
84.05
73.92
–
75.56
86.74
74.49
88.03
75.55
–
77.56
87.21
–
88.12
78.12

–
77.34
79.72
80.21
80.56
81.10
–
81.29

–
–
89.73
90.78
91.50
–
–
90.88
91.64

wol

yor

avg

swa
88.65
36.91
76.40
72.84
–

69.56
–
24.33
25.37
41.70
83.19* 42.81* 55.97
76.78
–
87.90
75.77
68.11
87.59
78.29
67.21
87.62
80.68
68.47
–
70.20
79.44
87.92
–
–
88.15
80.68
69.70
–
80.59
69.84
88.10

77.30
78.05
–
10.42
37.15
39.04
35.16
36.81
57.87* 52.32
53.14*
–
75.30
76.48
77.63
77.80
–
–
78.87

Mesa 7: Transfer learning result (es decir., F1-score). Three tags: PER, ORG, & LOC. WikiAnn, eng-CoNLL,
and the annotated datasets are trained for 50 epochs. Fine-tuning is only for 10 epochs. Results are
averaged over 5 runs and the total average (avg) is computed over ibo, kin, lug, luo, wol, y
yor languages. The overall highest F1-score is in bold, and the best F1-score in zero-shot settings is
indicated with an asterisk (*).

6.2 Evaluation of Gazetteer Features

Source Language

PER

ORG

LOC

Mesa 6 shows the performance of the CNN-
BiLSTM-CRF model with the addition of gazet-
teer features as described in Section 5.2.1. On
promedio, the model that uses gazetteer features
performs better than the baseline. En general, lan-
guages with larger gazetteers, such as Swahili
(16K entities in the gazetteer) and Nigerian-Pidgin
(for which we use an English gazetteer with 2M
entidades), have more improvement in performance
than those with fewer gazetteer entries, como
Amharic and Luganda (2K and 500 gazetteer
entidades, respectivamente). This indicates that having
high-coverage gazetteers is important for the
model to take advantage of the gazetteer features.

6.3 Transfer Learning Experiments

Mesa 7 shows the result for the different transfer
learning approaches, which we discuss individu-
ally in the following sections. We make use of
XLM-R-base model for all the experiments in this

eng-CoNLL
pcm
swa
hau

36.17
21.50
55.00
52.67

27.00
65.33
69.67
57.50

50.50
68.17
46.00
48.50

Mesa 8: Average per-named entity F1-score for
the zero-shot NER using the XLM-R model. El
average is computed over ibo, kin, lug, luo,
wol, yor languages.

sub-section because the performance difference if
we use XLM-R-large is small (<2%) as shown in Table 5 and because it is faster to train. 6.3.1 Cross-domain Transfer transfer evaluate We from cross-domain Wikipedia to the news domain for the five lan- guages that are available in the WikiAnn (Pan et al., 2017) dataset. In the zero-shot setting, the NER F1-score is low: less than 40 F1-score for all 1124 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Language CNN-BiLSTM mBERT-base XLM-R-base all 0-freq 0-freq Δ long long Δ all 0-freq 0-freq Δ long long Δ all – – – 52.89 40.98 −11.91 45.16 −7.73 amh 83.70 78.52 −5.18 66.21 −17.49 87.34 79.41 −7.93 hau 78.48 70.57 −7.91 53.93 −24.55 85.11 78.41 −6.70 ibo 64.61 55.89 −8.72 40.00 −24.61 70.98 65.57 −5.41 kin 74.31 67.99 −6.32 58.33 −15.98 80.56 76.27 −4.29 lug 66.42 58.93 −7.49 54.17 −12.25 72.65 72.85 luo 0.20 66.43 59.73 −6.70 47.80 −18.63 87.78 82.40 −5.38 pcm 79.26 64.74 −14.52 44.78 −34.48 86.37 78.77 −7.60 swa 60.43 49.03 −11.40 26.92 −33.51 66.10 59.54 −6.56 wol 67.07 56.33 −10.74 64.52 −2.55 78.64 73.41 −5.23 yor avg (excl. amh) 69.36 60.27 −9.09 50.18 −19.18 79.50 74.07 −5.43 – – 70.96 68.91 −2.05 67.67 −19.67 89.44 85.48 −3.96 60.46 −24.65 84.51 77.42 −7.09 55.39 −15.59 73.93 66.54 −7.39 65.67 −14.89 80.71 73.54 −7.17 66.67 −5.98 75.14 72.34 −2.80 77.12 −10.66 87.39 83.65 −3.74 45.55 −40.82 87.55 80.91 −6.64 19.05 −47.05 64.38 57.21 −7.17 74.34 −4.30 77.58 72.01 −5.57 59.10 −20.40 79.15 73.80 −5.36 0-freq 0-freq Δ long long Δ 64.86 −6.10 76.06 −13.38 59.52 −24.99 54.96 −18.97 63.77 −16.94 69.39 −5.75 74.67 −12.72 53.93 −33.62 38.89 −25.49 76.14 −1.44 63.22 −15.94 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 9: F1 score for two varieties of hard-to-identify entities: zero-frequency entities that do not appear in the training corpus, and longer entities of four or more words. languages, with Kinyarwanda and Yor`ub´a having less than 10 F1-score. This is likely due to the number of training sentences present in WikiAnn: There are only 100 sentences in the datasets of Amharic, Igbo, Kinyarwanda, and Yor`ub´a. Al- though the Swahili corpus has 1,000 sentences, the 35 F1-score shows that transfer is not very effective. In general, cross-domain transfer is a challenging problem, and is even harder when the number of training examples from the source domain is small. Fine-tuning on the in-domain the news NER data does not baseline (XLM-R-base). improve over 6.3.2 Cross-Lingual Transfer Zero-shot In the zero-shot setting we evaluated trained on the English eng- NER models CoNLL03 dataset, and on the Nigerian-Pidgin (pcm), Swahili (swa), and Hausa (hau) anno- tated corpus. We excluded the MISC entity in the eng-CoNLL03 corpus because it is absent in our target datasets. Table 7 shows the result for the (zero-shot) transfer performance. We observe that the closer the source and target languages are geo- graphically, the better the performance. The pcm model (trained on only 2K sentences) obtains sim- ilar transfer performance as the eng-CoNLL03 model (trained on 14K sentences). swa performs better than pcm and eng-CoNLL03 with an im- provement of over 14 F1 on average. We found that, on average, transferring from Hausa provided the best F1, with an improvement of over 16% and 1% compared to using the eng-CoNLL and swa data, respectively. Per-entity analysis in Table 8 shows that the largest improvements are obtained for ORG. The pcm data were more effective in transferring to LOC and ORG, while swa and hau performed better when transferring to PER. In general, zero-shot transfer is most effective when transferring from Hausa and Swahili. Fine-tuning We use the target language corpus to fine-tune the NER models previously trained on eng-CoNLL, pcm, and swa. On average, there is only a small improvement when compared to the XLM-R base model. In particular, we see signifi- cant improvement for Hausa, Igbo, Kinyarwanda, Nigerian-Pidgin, Wolof, and Yor`ub´a using either swa or hau as the source NER model. 6.4 Regional Influence on NER We evaluate whether combining different lan- guage training datasets by region affects the per- formance for individual languages. Table 7 shows that all languages spoken in West Africa (ibo, wol, pcm, yor) except hau have slightly better performance (0.1–2.6 F1) when we train on their combined training data. However, for the East-African languages, the F1 score only im- proved (0.8–2.3 F1) for three languages (kin, lug, luo). Training the NER model on all nine languages leads to better performance on all languages except Swahili. On average over six languages (ibo, kin, lug, luo, wol, yor), the performance improves by 1.6 F1. 6.5 Error Analysis Finally, to better understand the types of entities that were successfully identified and those that were missed, we performed fine-grained analy- sis of our baseline methods mBERT and XLM-R 1125 using the method of Fu et al. (2020), with re- sults shown in Table 9. Specifically, we found that across all languages, entities that were not contained in the training data (zero-frequency en- tities), and entities consisting of more than three words (long entities) were particularly difficult in all languages; compared to the F1 score over all entities, the scores dropped by around 5 points when evaluated on zero-frequency entities, and by around 20 points when evaluated on long entities. Future work on low-resource NER or cross-lingual representation learning may further improve on these hard cases. 7 Conclusion and Future Work We address the NER task for African languages by bringing together a variety of stakeholders to create a high-quality NER dataset for ten African languages. We evaluate multiple state-of-the-art NER models and establish strong baselines. We have released one of our best models that can recognize named entities in ten African languages on HuggingFace Model Hub.7 We also investi- gate cross-domain transfer with experiments on five languages with the WikiAnn dataset, along with cross-lingual transfer for low-resource NER using the English CoNLL-2003 dataset and other languages supported by XLM-R. In the future, we plan to use pretrained word embeddings such as GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017) instead of random ini- tialization for the CNN-BiLSTM-CRF, increase the number of annotated sentences per language, and expand the dataset to more African languages. Acknowledgments We would like to thank Heng Ji and Ying Lin for providing the ELISA NER tool used for annota- tion. We also thank the Spoken Language Systems Chair, Dietrich Klakow at Saarland University, for providing GPU resources to train the models. We thank Adhi Kuncoro and the anonymous re- viewers for their useful feedback on a draft of this paper. David Adelani acknowledges the support of the EU-funded H2020 project COMPRISE under grant agreement no. 3081705. Finally, we thank Mohamed Ahmed for proofreading the draft. 7https://huggingface.co/Davlan/xlm-roberta -large-masakhaner. References D. Adelani, Dana Ruiter, J. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofetoluwa Adeyemi, Ayodele Awokoya, and C. Espa˜na- Bonet. 2021. MENYO-20k: A multi-domain english-yor`ub´a corpus for machine translation and domain adaptation. ArXiv, abs/2103.08647. ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy. Association for Computational Linguistics. Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina Espa˜na-Bonet. 2020. Massive vs. curated embeddings for low-resourced languages: The case of Yor`ub´a and Twi. In Proceedings of the 12th Lan- guage Resources and Evaluation Conference, pages 2754–2762, Marseille, France. European Language Resources Association. Darina Benikova, Chris Biemann, and Marc Reznicek. 2014. NoSta-D named entity anno- tation for German: Guidelines and dataset. In Proceedings of the Ninth International Con- ference on Language Resources and Evalua- tion (LREC’14), pages 2524–2531, Reykjavik, Iceland. European Language Resources Asso- ciation (ELRA). Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vec- tors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. https://doi.org/10.1162 /tacl a 00051 Andrew Caines. 2019. The geographic diversity of NLP conferences. Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representa- tion learning at scale. In Proceedings of the 58th 1126 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Annual Meeting of the Association for Compu- tational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics. Guy De Pauw, Peter W Wagacha, and Dorothy Atieno Abade. 2007. Unsupervised induction of Dholuo word classes using maximum entropy learning. Proceedings of the First International Computer Science and ICT Conference, page 8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota. Association for Computational Linguistics. David M. Eberhard, Gary F. Simons, and Charles D. Fennig (eds.). 2020. Ethnologue: Languages of the World. 23rd edition. Roald Eiselen. 2016. Government domain named entity recognition for South African languages. In Proceedings of the Tenth International Con- ference on Language Resources and Evalua- tion (LREC’16), pages 3344–3348, Portoroˇz, Slovenia. European Language Resources Asso- ciation (ELRA). Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzm´an, and Philipp Koehn. 2020. CCAligned: A massive collection of cross-lingual web- document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP 2020), pages 5960–5969, Online. Association for Computational Linguistics. Nolue Emenanjo. 1978. Elements of Modern Igbo Grammar - a descriptive approach. Ibadan, Nigeria. Oxford University Press. Ignatius Ezeani, Paul Rayson, I. Onyenwe, C. Uchechukwu, and M. Hepple. 2020. Igbo- english machine translation: An evaluation benchmark. ArXiv, abs/2004.00648. https:// doi.org/10.1037/h0031619 Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378. Jinlan Fu, Pengfei Liu, and Graham Neubig. 2020. Interpretable multi-dataset evaluation for named entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6058–6069, Online. Association for Computational Linguistics. Rwanda Government. 2014. Official gazette number 41 bis of 13/10/2014. Suchin Gururangan, Ana Marasovi´c, Swabha Iz Beltagy, Doug Swayamdipta, Kyle Lo, Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Associ- ation for Computational Linguistics. Michael A. Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, and Dietrich Klakow. 2020. Transfer learning and distant su- pervision for multilingual transformer models: A study on African languages. In Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 2580–2591, Online. Association for Computational Linguistics. Jeremy Howard and Sebastian Ruder. 2018. Uni- language model fine-tuning for text versal classification. In Proceedings of ACL 2018. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively mul- tilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of ICML 2020. Zhiheng Huang, W. Xu, and Kailiang Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. ArXiv, abs/1508.01991. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional ran- dom fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. 1127 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL- HLT 2016. Anne Lauscher, Vinit Ravishankar, Ivan Vuli´c, and Goran Glavaˇs. 2020. From zero to hero: On the limitations of zero-shot language transfer with multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499, Online. Association for Computational Linguistics. Ying Lin, Cash Costello, Boliang Zhang, Di Lu, Heng Ji, James Mayfield, and Paul McNamee. 2018. Platforms for non-speakers annotating names in any language. In Proceedings of ACL 2018, System Demonstrations, pages 1–6, Melbourne, Australia. Association for Compu- tational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM- CNNs-CRF. In Proceedings of the 54th Annual the Association for Computa- Meeting of tional Linguistics (Volume 1: Long Papers), pages 1064–1074, Berlin, Germany. Associa- tion for Computational Linguistics. Laura Martinus and Jade Z. Abbott. 2019. A focus on neural machine translation for African languages. arXiv preprint arXiv:1906.05685. MBS. 2020. T´eereb Injiil: La Bible Wolof – Ancien Testament. http://biblewolof .com/. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural In- formation Processing Systems, volume 26, pages 3111–3119. Curran Associates, Inc. Wilhelmina Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Nekoto, Iroro Orife, Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Julia Kelechi Ogueji, Kathleen Siminyu, Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp ¨Oktem, Adewale Akinfaderin, and Abdallah Bashir. 2020. Participatory research for low-resourced machine translation: A case study in African the Association languages. for Computational Linguistics: EMNLP 2020. Online. In Findings of Graham Neubig, Chris Dyer, Y. Goldberg, A. Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Manish Kumar, Chaitanya Malaviya, Paul Michel, Y. Oda, M. Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. Dynet: The dynamic neural network toolkit. ArXiv, abs/1701.03980. Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, and Li Huang. 2020. KINNEWS and KIRNEWS: Benchmarking cross-lingual text classification for Kinyarwanda and Kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5507–5521, Barcelona, Spain (Online). International Committee on Computational Linguistics. Eyo Offiong Mensah. 2012. Grammaticalization in Nigerian Pidgin. ´Ikala, revista de lenguaje y cultura, 17(2):167–179. Anthony Ojarikre. 2013. Perspectives and prob- lems of codifying Nigerian Pidgin English orthography. Perspectives, 3(12). 1128 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Ijite Blessing Onovbiona. 2012. Serial verb construction in Nigerian Pidgin. Ikechukwu E. Onyenwe and Mark Hepple. 2016. Predicting morphologically-complex unknown words in Igbo. In Text, Speech, and Dialogue, pages 206–214, Cham. Springer International Publishing. Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual the Association for Computa- Meeting of tional Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Associ- ation for Computational Linguistics. Jeffrey Socher, Pennington, Richard and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Jonas Pfeiffer, Ivan Vuli, Iryna Gurevych, and Sebastian Ruder. 2020a. MAD-X: An Adapter- based Framework for Multi-task Cross-lingual Transfer. In Proceedings of EMNLP 2020. Jonas Pfeiffer, Ivan Vuli´c, Iryna Gurevych, and Sebastian Ruder. 2020b. Unks everywhere: Adapting multilingual language models to new scripts. arXiv preprint arXiv:2012.15562. Lev Ratinov and Dan Roth. 2009. Design in named challenges and misconceptions the entity recognition. Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155, Boulder, Colorado. Association for Computational Linguistics. In Proceedings of Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using Siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Shruti Rijhwani, Shuyan Zhou, Graham Neubig, and Jaime Carbonell. 2020. Soft gazetteers for low-resource named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 8118–8123, Online. Association for Computational Linguistics. Erik F. Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recogni- tion. In Proceedings of CoNLL 2003. Rajeev Sangal, Dipti Misra Sharma, and Anil Kumar Singh. 2008. In Proceedings of the IJCNLP-08 workshop on named entity recogni- tion for south and south east Asian languages. K. Shaalan. 2014. A survey of Arabic named entity recognition and classification. Computational Linguistics, 40:469–510. https://doi.org /10.1162/COLI a 00178 for technology development Stephanie Strassel and Jennifer Tracey. 2016. LORELEI language packs: Data, tools, and resources in low resource languages. In Proceedings of the Tenth International Conference on Lan- guage Resources and Evaluation (LREC’16), pages 3273–3280. Portoroˇz, Slovenia. Euro- pean Language Resources Association (ELRA). J¨org Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Lan- guage Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA). Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language- independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002). Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity re- cognition. In Proceedings of the Seventh Con- ference on Natural Language Learning at HLT-NAACL 2003, pages 142–147. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art 1129 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 language processing. ArXiv, natural 1910.03771. abs/ Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition In Proceed- from deep learning models. ings of the 27th International Conference on Computational Linguistics, pages 2145–2158, Santa Fe, New Mexico, USA. Association for Computational Linguistics. Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representa- tions with entity-aware self-attention. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442–6454, Online. Associ- ation for Computational Linguistics. A Appendix A.1 Annotator Agreement To shed more light on the few cases where annota- tors disagreed, we provide entity-level confusion matrices across all ten languages in Table 11. is between The most common disagreement organizations and locations. A.2 Model Hyperparameters for Reproducibility For fine-tuning mBERT and XLM-R, we used the base and large models with maximum sequence length of 164 for mBERT and 200 for XLM-R, batch size of 32, learning rate of 5e-5, and number of epochs 50. For the MeanE-BiLSTM model, the hyperparameters are similar to fine-tuning the LM except for the learning rate that we set to be 5e-4, the BiLSTM hyperparameters are: input dimension is 768 (since the embedding size from mBERT and XLM-R is 768) in each direction of LSTM, one hidden layer, hidden layer size of 64, and drop-out probability of 0.3 before the last linear layer. All the experiments were performed on a single GPU (Nvidia V100). A.3 Monolingual Corpora for Language Adaptive Fine-tuning Table 10 shows the monolingual corpus we used for the language adaptive fine-tuning. We provide the details of the source of the data, and their sizes. For most of the languages, we make use of JW3008 and CC-1009. In some cases CC-Aligned (El-Kishky et al., 2020) was used, in such a case, we removed duplicated sentences from CC-100. For fine-tuning the language model, we make use of the HuggingFace (Wolf et al., 2019) code with learning rate 5e-5. However, for the Amharic BERT, we make use of a smaller learning rate of 5e-6 since the multilingual BERT vocabu- lary was replaced by Amharic vocabulary, so that we can slowly adapt the mBERT LM to understand Amharic texts. All language BERT models were pre-trained for 3 epochs (‘‘ibo’’, ‘‘kin’’,‘‘lug’’,‘‘luo’’, ‘‘pcm’’,‘‘swa’’,‘‘yor’’) or 10 epochs (‘‘amh’’, ‘‘hau’’,‘‘wol’’) depending on their convergence. The models can be found on HuggingFace Model Hub.10 8https://opus.nlpl.eu/. 9http://data.statmt.org/cc-100/. 10https://huggingface.co/Davlan. 1130 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Language Source Size (MB) No. sentences amh hau ibo kin lug luo pcm swa wol yor CC-100 (Conneau et al., 2020) CC-100 JW300 (Agi´c and Vuli´c, 2019), CC-100, CC-Aligned (El-Kishky et al., 2020), and IgboNLP (Ezeani et al., 2020) JW300, KIRNEWS (Niyongabo et al., 2020), and BBC Gahuza JW300, CC-100, and BUKEDDE News JW300 JW300, and BBC Pidgin CC-100 OPUS (Tiedemann, 2012) (excl. CC-Aligned), Wolof Bible (MBS, 2020), and news corpora (Lu Defu Waxu, Saabal, and Wolof Online) JW300, Yoruba Embedding Corpus (Alabi et al., 2020), MENYO-20k (Adelani et al., 2021), CC-100, CC-Aligned, and news corpora (BBC Yoruba, Asejere, and Alaroye). 889.7MB 318.4MB 118.3MB 123.4MB 54.0MB 12.8MB 56.9MB 3,124,760 3,182,277 1,068,263 726,801 506,523 160,904 207,532 1,800MB 12,664,787 3.8MB 42,621 117.6MB 910,628 Table 10: Monolingual corpora, their sources, size, and number of sentences. DATE LOC ORG PER DATE 32,978 LOC 10 ORG 0 PER 2 – 70,610 52 48 – – 35,336 12 – – – 64,216 Table 11: Entity-level confusion matrix between annotators, calculated over all ten languages. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 6 1 9 6 6 2 0 1 / / t l a c _ a _ 0 0 4 1 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1131
Descargar PDF