Improving Candidate Generation

Improving Candidate Generation
for Low-resource Cross-lingual Entity Linking

Shuyan Zhou, Shruti Rijhwani, John Wieting
Jaime Carbonell, Graham Neubig

Language Technologies Institute
Carnegie Mellon University
{shuyanzh,srijhwan,jwieting,jgc,gneubig}@cs.cmu.edu

Abstrait

Cross-lingual entity linking (XEL)
is the
task of finding referents in a target-language
knowledge base (KB) for mentions extracted
from source-language texts. The first step of
(X)EL is candidate generation, which retrieves
a list of plausible candidate entities from
the target-language KB for each mention. Ap-
proaches based on resources from Wikipedia
have proven successful in the realm of rela-
tively high-resource languages, but these do
to low-resource languages
not extend well
with few, if any, Wikipedia pages. Recently,
transfer learning methods have been shown to
reduce the demand for resources in the low-
resource languages by utilizing resources in
closely related languages, but the performance
still lags far behind their high-resource coun-
terparts. In this paper, we first assess the prob-
lems faced by current entity candidate generation
methods for low-resource XEL, then propose
three improvements that (1) reduce the dis-
connect between entity mentions and KB en-
tries, et (2) improve the robustness of the
model to low-resource scenarios. The methods
are simple, but effective: We experiment with
our approach on seven XEL datasets and find
that they yield an average gain of 16.9% dans
TOP-30 gold candidate recall, compared with
state-of-the-art baselines. Our improved model
also yields an average gain of 7.9% in in-KB
accuracy of end-to-end XEL.1

1 Introduction

Entity linking (EL; Bunescu and Pas¸ca, 2006;
Cucerzan, 2007; Dredze et al., 2010; Hoffart

1Code and data will be released.

109

et coll., 2011) associates entity mentions in a
document with their entries in a knowledge base
(KB). In this work, we focus on cross-lingual
entity linking (XEL; McNamee et al., 2011;
Ji et al., 2015) where the documents are in a
source language that differs from the KB language
(cible). XEL is an important component task for
information extraction in languages that do not
have extensive KB resources, and can potentially
benefit downstream applications such as cross-
lingual building question answering systems
(Veyseh, 2016), or supporting international hu-
manitarian assistance efforts in areas that do not
speak English (Strassel et al., 2017; Min et al.,
2019). Following Sil et al. (2018) and Upadhyay
et autres. (2018un), we consider the target language KB
to be English Wikipedia.

Given a document and named entity mentions
identified by a Named Entity Recognition (NER)
model, there are two primary steps in an XEL
système: (1) candidate generation,
in which a
model retrieves a short list of plausible KB entities
for each mention and (2) disambiguation, dans lequel
a model selects the most likely KB entity from
the candidate list. The quality of candidate lists
will influence the performance of the end-to-end
XEL system, as correct entities not included in this
list will not be recovered by the disambiguation
model.

In monolingual EL, candidate generation has often
been considered trivial (Shen et al., 2015). Simple
approaches using string similarity or Wikipedia
anchor-text links produce mention-entity lookup
tables with high candidate recalls (par exemple., dans le 90%
range), and thus most work focuses on methods
for downstream entity disambiguation (Globerson
et coll., 2016; Yamada et al., 2017; Ganea and
Hofmann, 2017; Sil et al., 2018, Radhakrishnan

Transactions of the Association for Computational Linguistics, vol. 8, pp. 109–124, 2020. https://doi.org/10.1162/tacl a 00303
Action Editor: Radu Florian. Submission batch: 10/2019; Revision batch: 11/2019; Published 2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

et coll., 2018). String similarity (par exemple., edit distance)
cannot easily extend to XEL because surface
forms of entities often differ significantly across
the source and target language, particularly when
the languages are in different scripts. Wikipedia
link methods can be extended to XEL by using
inter-language links between the two languages
to redirect entities to the English KB (Spitkovsky
and Chang, 2012; Sil and Florian, 2016; Sil et al.,
2018; Upadhyay et al., 2018un). This method works
to some extent, but often under-performs on low-
resource languages due to the lack of source
language Wikipedia resources.

Although scarce, there are some methods that
propose to improve entity candidate generation
by training translation models with low resource-
langue (LRL)-English entity gazetteers (Pan
et coll., 2017), or learning neural string matching
models based on an entity gazetteer in a related
high-resource language (HRL) which is then
applied to the LRL (Rijhwani et al., 2019)
(more in §2). Cependant, even with these relatively
top-30 candidates still
sophisticated methods,
fall far behind their high-resource counterparts,
lagging by as much as 70% absolute candidate
recall.

In this work, we perform a systematic study to
understand and address the limitations of previous
XEL candidate generation models. D'abord, in §3 we
examine the sources of error in the state-of-the-art
candidate generation model of Rijhwani et al.
(2019), and identify a number of potential reasons
for failure. Spécifiquement, we find that two common
sources of error are (1) mismatch between the
entity name in the KB and the entity mention in
the text, et (2) failure of the string matching
model itself. In Figure 1, we show an example of
linking Marathi, a low-resource language spoken
in Western India, to English, which we will use as
a running example throughout the paper (bien que
our method is broadly applicable, as noted in
experiments). Dans ce cas, errors of the first
type are due to the fact that the English entity
Cobie Smulders is mentioned as
(vert,
Smulders) ou
(yellow, Jacoba Francisca Maria Smulders) dans
the text. Errors of the second type are simple
recognition errors such as where the mention
(blue, Cobie Smulders) is recognized
as English entity Cobie Sikkens. We proceed to

110

Chiffre 1: The candidate generation process for various
mentions corresponding to the gold entity ‘‘Cobie
Smulders’’. Strings on the left are mentions in the
document, and the pronunciation in IPA of each string
is written below it. The candidate entities in the English
KB generated by the candidate generation model are
shown on the right.

propose methodological improvements that re-
solve these major issues.

The first set of improvements handles the
mismatch between the unique entity name that
appears in the English KB, and the many different
realizations of it in the source text. D'abord, we note
that training data used in learning-based methods
for XEL candidate generation (Pan et al., 2017;
Rijhwani et al., 2019) is made of entity-entity
pairs, which fail to capture this variation. Nous
experiment with adding mention-entity pairs to
the training data to provide explicit supervision,
helping the model better capture the differences
between mentions and entities (§4.1). Deuxième,
we note that many of the variations in the source
language are actually similar to how the entity
varies in English, and thus we can use English
language resources to capture this variation. À
this effect, we collect entity aliases from English
Wikidata2 and allow the model
to also look
up these aliases during the candidate generation
processus (§4.2).

The second contribution of this work is a
better modeling strategy for strings that represent
mentions and entities (§4.3). We posit that part
of the reason why the LSTM-based model of
Rijhwani et al. (2019) fails to properly model all
words in a string is because it is not the ideal
architecture to learn from limited training data,
and as a result, it erroneously learns that some
words in the mention can be ignored. To solve
this problem, we replace the LSTM with a more
direct model based on the sum of character n-gram

2https://www.wikidata.org/wik.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

embeddings (Wieting et al., 2016), which we
posit is more likely to generalize to this difficult
learning setting.

We evaluate our proposed methods on four real-
world XEL datasets provided by DARPA LORELEI
(Strassel and Tracey, 2016), as well as three
other datasets we create with Wikipedia anchor-
text and inter-language links (§5). Although our
methods are simple, they are highly effective—our
proposed model
leads to gains ranging from
7.4-33.3% in top-30 gold candidate recall com-
pared with Rijhwani et al. (2019) in seven LRLs.
Because our model provides downstream disam-
biguation models with a much larger headroom for
improvement, we find that simply changing the
candidate generation process yields an average
gain of 7.9% in end-to-end XEL in-KB accuracy
in four LRLs, pushing low-resource XEL a step
towards high-resource XEL performance.

2 Background

2.1 Problem Formulation

Given a set of mentions M = {m1, m2, . . . , mN }
extracted from multiple documents in the source
langue, and an English KB KEN that contains
millions of entities with unique names, the goal of
a candidate generation model is to retrieve a list of
possible candidate entities ei = {ei,1, ei,2, . . . , ei,n}
from KEN for each mi ∈ M. In consideration of
the computational cost of the more complicated
downstream disambiguation model, n is often 30
or smaller (Sil et al., 2018; Upadhyay et al.,
2018un). The performance of candidate generation
is measured by the gold candidate recall, lequel
is the proportion of retrieved candidate lists that
contains the correct entity. It is critical that this
number is high, as any time the correct entity is
excluded, the disambiguation model will be unable
to recover it. Officiellement, if we denote the correct
entity of each mention m as ˆe, the gold candidate
recall r is defined as:

r = P

N

i=1 δ(ˆei ∈ ei)
N

mentions whose linked entity does not exist in the
KB in this work.3

We use ‘‘EN’’ to denote the target language
English, ‘‘HRL’’ to denote any high-resource
language and ‘‘LRL’’ to denote any low-resource
langue. Par exemple, KHRL is a KB in an HRL
(par exemple., Spanish Wikipedia), eHRL is an entity in
KHRL. Because our focus is on low-resource XEL,
the source language is always an LRL. We also
refer to the HRL as the ‘‘pivoting’’ language
below.

2.2 Baseline Candidate Generation Models

Dans cette section, we introduce two existing cate-
gories of techniques for candidate generation.

Direct Wikipedia-based Models WIKIMENTION
is a popular candidate generation model used
by most state-of-the-art work in XEL (Sil and
Florian, 2016; Sil et al., 2018; Upadhyay et al.,
2018un). Spécifiquement, this model first extracts a
monolingual mLRL-eLRL map from anchor-text
(Smulders)
links. Par exemple, if mention
(Cobie Smulders)
is linked to entity
in some Marathi Wikipedia pages,
will be treated as a candidate entity of
.
These Marathi entities are then redirected to their
English counterpart by Wikipedia LRL-English
inter-language links. Par exemple,
(Cobie Smulders) will be redirected to Cobie
Smulders. Cependant, the reliance on the coverage
of LRL Wikipedia strongly constrains this method
in low-resource settings.

TRANSLATION is another Wikipedia-based can-
didate generation model, proposed by Pan et al.
(2017). Instead of building a monolingual map
that requires accessing anchor-text links in an LRL
Wikipedia, this model translates any mLRL to mEN
word-by-word and retrieves candidate entities
from an existing mEN − eEN map. The word-
by-word translations are induced by LRL-English
inter-language links. Even though TRANSLATION
is less sensitive to the availability of resources
(to some extent), its dependency on LRL-English
inter-language links still limits its performance in
low-resource settings.

where δ(·) is the indicator function, which is 1 if true
else 0, and N is the total number of mentions
among all documents. We follow Yamada et al.
(2017) and Ganea and Hofmann (2017) to ignore

3The predictions of these mentions will always be wrong.
This could be fixed by either designing mechanisms to predict
‘‘not linkable’’ or expanding the KB, which are beyond the
scope of this work.

111

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Pivoting-based Entity Linking
Instead of
relying on LRL resources, pivoting-based entity
linking (PBEL, Rijhwani et al., 2019) learns to
perform cross-lingual string matching based on
an entity gazetteer between a related HRL and
English. This model consists of two BI-LSTMs,
namely, the HL-BI-LSTM and the EN-BI-LSTM.
The training data is a collection of entity pairs
(eHRL − eEN). Each of the BI-LSTMs reads in
an entity name eHRL (eEN) and encodes it to an
embedding vHRL (vEN). The learning objective
is to maximize the similarity between the two
entities of each pair. The trained model HRL is
used as-is to encode the LRL mentions to vLRL,
relying on the similarity between the languages
to achieve a reasonably accurate encoding. UN
vLRL is compared with every entity embedding in
KEN, and entities with the top-n highest similarity
scores are retrieved as the candidate entities. À
compensate for the accuracy degradation due to
transfer, this work also considers the similarity
between mLRL and eHRL, where eHRL is the
counterpart of eEN in KHRL. Ainsi,
the score
between mLRL and entity eEN is defined as:

score(mLRL, eEN) = max(sim(mLRL, eEN),

sim(mLRL, eHRL))

(1)

where sim(X, oui) = cosine(vx, vy). When eHRL
does not exist, sim(mLRL, eHRL) is set to −∞.

PBEL removes the reliance on LRL resources,
and currently represents the state-of-the-art for
candidate generation in low-resource XEL. Cependant,
as we analyze in detail in the following §3, it still
faces a number of challenges.

3 Failures of Existing Models

Dans cette section, we perform a systematic analysis
of failure cases existing in PBEL (§3.1), and spe-
cifically focus on two error types: entity-mention
mismatch (§3.2) and string matching failures (§3.3).

3.1 Mention Types and Analysis

We apply a PBEL model trained with eHRL − eEN
pairs to generate candidate entities for mentions
extracted from LRL documents. For LRLs we use
Tigrinya, Oromo, Marathi, and Lao, and for HRLs
we use Amharic, Hindi, Hindi, and Thai, respecter-
tivement. The details of the datasets are in §5. Nous
randomly sample 100 system outputs from each
LRL and manually annotate their mention type
according to an typology created simultaneously

while performing analysis. The mention type is
as follows, where the comparison is between the
mention in a LRL and the entity string in English:

DIRECT: The mention is a direct transliteration
of the entity. Par exemple, one a mention
of Cobie Smulders is
(Cobie
Smulders)

ALIAS: The mention is another full proper name
que
is different from the entity name in
English KB. Par exemple, a mention of Cobie
Smulders as
(Jacoba Francisca Maria Smulders).

TRANS: The mention and the entity have word-
by-word alignment, cependant, the mention
contains regular words (par exemple., university,
union) that cannot be transliterated directly.

EXTRA SRC: There is at least one extra word in
the mention that is not a proper noun (par exemple.,
(Sir)); or there is at least one extra syllable
in the mention, which is often due to the
morphology of the source language.

EXTRA ENG: There is at least one extra word
in the English entity that is not a proper noun.

BAD SPAN: The mention span is not an entity
due to mis-annotation, or non-standard an-
chor text in Wikipedia; the annotated linked
entity is wrong; the mention is in another
language other than our testing language.

We consider three situations for each sample:
(1) in top-1: the model ranks the correct entity the
highest, the ideal case; (2) in top-2 to 30: the model
ranks the correct entity in the top-2 to top-30,
which is less ideal, but will still potentially allow
a downstream disambiguation model to predict the
correct entity and (3) not in top-30: the model does
not rank the entity to top-30, which will certainly
lead to an error.

Chiffre 2 shows the mention types of the 400
samples and PBEL performance within each of the
mention types. In the following sections, nous
examine, in depth, two major causes of error:
mention-entity mismatch (largely affecting errors
in ALIAS, EXTRA SRC, and EXTRA ENG cate-
gories), and model failure (largely affecting errors
in DIRECT).

112

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

entity pairs, the model would need to capture more
complex patterns (par exemple., ignoring some words).4

The diverse realizations of a single entity bring
another, more serious, challenge to models that
mainly learn string matches: En réalité, a reali-
zation does not necessarily have significant overlap
with the entity name in Wikipedia. Sometimes, le
mention does not have any overlap with the entity
name at all, as noted in the ALIAS class. Ce
common pattern reflects the limitation of using eEN
as the unique representation on the English side.

3.3 Failures in Direct Transliteration

Even in seemingly easy cases where the entity is a
perfectly transliteration of the mention (DIRECT),
we found the LSTM to fail frequently in our low-
data scenario. Among all DIRECT errors, nous
found an interesting observation that the BILSTM
often only properly captures the first word (or the
first a few characters) and ignores the existence
of the second and further-on words. Par exemple,
the model ranks Cobie Sikken higher than Cobie
Smulders for

(Cobie Smulders).

To better understand this behavior, we manually
annotated 100 training pairs in Hindi and measured
how often the second or later words in eHRL do not
match their counterpart in eEN phonologically.5

est,

We find that whereas 93 examples share a
phonologically similar first word, à propos 40 de
them have second and further-on words that
are not phonological matches: While most pairs
have word-by-word mappings, their second or
later words often match with each other only
semantically—that
there are regular words
(par exemple., district, university) that have very different
pronunciations across the HRL and English, et
are therefore difficult to predict unless they are
explicitly seen in the training data. The BILSTM,
which is a flexible model, seems to overfit
and erroneously learn that latter words in the
sentence do not need to be mapped directly with
little inductive bias. This is a straightforward
explanation for why the model learns to ignore
the second and further-on words.

To sum up, the failures of the PBEL model
can be mainly attributed to (1) lack of explicit
supervision; (2) lack of external resources to assist

4Low numbers for th are due to lack of explicit word

boundaries marked by spaces.

5The phonological similarity of names across languages is
vital to the success of cross-lingual mention-entity matching.

Chiffre 2: The distribution of mention types in 400
samples and the baseline model’s performance with
respect to each of the mention types.

Lang

am

donc

Salut

ème

|eHRL|=|eEN|
|mHRL|=|eEN|

82.9
71.1

80.7
58.0

83.4
56.8

56.8
55.8

Tableau 1: Proportion of entries where HRL
strings have the same number of words as
their English counterparts.

3.2 Failures due to Mention-Entity Mismatch

As demonstrated in Figure 1, a single English
entity can have different
realizations in the
source language document. Par conséquent, many
of these realizations will not match lexically or
phonetically with the entity in the KB. This poses a
serious problem for matching methods that rely on
graphemic or phonemic similarity such as PBEL.
One typical pattern in mention-entity variation
is additional words, as noted in the EXTRA SRC
and EXTRA ENG classes. We examine more sys-
tematically across the whole corpus by comparing
the number of words on each side, which is a rough
lower bound on the amount of this mismatch. Le
first row in Table 1 is the comparison between
eHRL and eEN, which presumably have better word-
by-word alignment (and were used in training
of previous XEL methods). The second row
displays the comparison between mHRL and eEN.
It is obvious that entity-entity pairs have more
consistent length in words, while this consistency
is not preserved in mention-entity pair data. Ainsi,
even if the previous PBEL model could easily
learn exact string matches from the entity-entity
training data, to successfully associate mention-

113

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

cases where the mention and entity name diverge
significantly; et (3) the BILSTM’s inability to
properly match the whole string.

4 Improved Candidate Generation

Based on the results of this empirical study,
we propose three methods to resolve the main
problems inherent in the baseline PBEL model.

4.1 Eliminating Train-Test Discrepancy

The mention-entity discrepancy naturally leads
to our first simple but effective improvement
to the baseline model: We extend the original
eHRL − eEN pairs with mHRL − eEN pairs. Nous
first collect mHRL − eHRL pairs from anchor-text
links in an HRL Wikipedia and then redirect these
entities to their parallel in English Wikipedia. Comme
a result, we get the desired mHRL − eEN pairs.
(Smulders) is linked to
Par exemple, si
(Cobie Smulders) in some Marathi
Wikipedia pages, which could be redirected to
and Cobie
Cobie Smulders in English,
Smulders form one mention-entity pair. Although
to our
this is perhaps obvious in hindsight,
connaissance, all previous works that explicitly
train XEL candidate retrieval models do so on
eHRL − eEN pairs (Pan et al., 2017; Rijhwani et al.,
2019), which are mostly word-by-word mappings.

4.2 Utilizing English Entity Aliases

The training method introduced in the previous
section will render the model more capable of
dealing with minor differences between mentions
and entities. Cependant, it still would struggle to
match strings with significant differences, tel que
the examples of ‘‘Cobie Smulders’’ and ‘‘Pope
Paul V’’ shown in Section 3.2. To mitigate
ce, we propose using Wikidata, a crowd-edited
knowledge base similar to Wikipedia, lequel
provides an ‘‘also known as’’ section that lists
common aliases of each entity.6 Our second
method is based on the observation that Wikidata
resources can serve as an off-the-shelf alias lookup
table with better coverage than simply using the
entity’s canonical Wikipedia title. An example of
how this lookup table can increase coverage is
indicated in Figure 2. In our analysis, we found
that more than 50% of the ALIAS mentions could
be covered by this table. There is a map between

6Par exemple, https://www.wikidata.org/

wiki/Q200566.

Wikipedia entities and Wikidata entities, so we
can direct Wikipedia to the Wikidata to retrieve
these aliases.7

At test time, we treat the alias of an entity
equally as its main Wikipedia entity name,
allowing the model to match the target mention to
this alias as well. Par conséquent, sim(mLRL, eEN) dans
Équation (1) is modified as:

sim(mLRL, eEN) = max

ai∈A (cid:0)sim(mLRL, ai)(cid:1)

where A is a combination of entity Wikipedia title
and entity aliases.8 Note that although one may
consider using aliases in languages other than
English, we found that they are very scarce, donc
we did not attempt to expand entity names on the
HRL side.

4.3 More Explicit String Encoding

As mentioned previously, while BI-LSTMS have
proven powerful in modeling sequential data in
the literature, we argue that they are not an ideal
string encoder for this setting. This is because
our training data contain a nontrival number of
pairs that contains less predictable word mappings
(par exemple., translations). With such large freedom in
the face of insufficient and noisy training data,
this encoder seemingly overfits, resulting in poor
generalization. Previous researchers (Dai and Le,
2015; Wieting et al., 2016un) have noticed similar
problems when using LSTMs for representation
learning.

As an alternative, we propose the use of the
CHARAGRAM model (Wieting et al., 2016) as the
string encoder. This model scans the string with
various window sizes and produces a bag of
character n-grams. It then maps these n-grams to
their corresponding embeddings through a lookup
table. The final embedding of the string is the
sum of all the n-gram embeddings followed by a
nonlinear activation function. Chiffre 3 shows an
illustration of the model.

Officiellement, we denote a string as a sequence
of characters x = [x1, x2, . . . , xm] that includes

7Other resources such as bold terms,

link anchors,
disambiguation pages, and surnames of mentions could
potentially increase the coverage of Wikidata.

8Note that incorporating aliases results in a small amount
of extra computation by multiplying the effective size of
the KB by a, the average number of aliases per mention.
Cependant, in Wikidata, a = 1.2, so we believe this is a
reasonable cost-benefit trade-off, given the gains afforded by
incorporating these aliases for many languages.

114

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

max-margin loss to train the model:

L =

B

X
je = 1

maximum(0, 1 − sim(m, eEN+)

+ sim(m, ei

EN−))

where eEN+ is the linked entity of m and eEN− is a
randomly sampled English entity. B is the number
of negative samples for each positive pair.

5 Experiments

5.1 Datasets

We evaluate our model on the following datasets,
spanning seven low-resource languages.

the first

DARPA-LRL: The data for
four
languages are news articles, blogs, and social
media annotated with entity spans and links
by LDC as part of the DARPA LORELEI3
faible-
program. The documents are in four
resource languages: Tigrinya (ti; a Semitic
language spoken in Eritrea and Ethiopia, written
in Ethiopian script), Oromo (om; an Afroasiatic
langage spoken in the Horn of Africa, written
in Roman script), Kinyarwanda (rw; a language
of the Niger-Congo family spoken in Rwanda,
written in Roman script), and Sinhala (si,
and Indo-Aryan language spoken in Sri Lanka,
written in its own script). These are naturally
occurring real-world data annotated and linked to
a KB, containing information about disasters and
humanitarian crises. We use these as the ‘‘gold
standard’’ datasets for our evaluation.

it

WIKI: One disadvantage of the DARPA-LRL
dataset, cependant,
is not publicly
is that
distributed at the time of this writing. In order
to allow for direct comparison with our method by
researchers without access to the DARPA-LRL
data, we additionally create three datasets from
Wikipedia, as described in §4.1. Spécifiquement, ces
include Marathi (mr, an Indo-Aryan language
spoken in Western India, written in Devanagari
scénario), Lao (lo, a Kra-Dai language written in
Lao script), and Telugu (te, a Dravidian language
spoken in southeastern India written in Telugu
scénario). As Wikipedia is created through crowd-
sourcing, the anchor-text links are similar to those
appearing in realistic XEL datasets. It is notable
that entity mentions in WIKI often closely match
the Wikipedia entity titles, and thus this dataset is
nominally easier than the DARPA-LRL dataset.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 3: The architecture of CHARAGRAM.

space characters as well as special start and end
symbols. We use xj
i to denote a sub-sequence from
position i to position j inclusive. Par exemple,
xj
i = [xi, xi+1, . . . , xj]. The embedding v of a
string x is:

v = tanh(b +

m

X
je = 1

X
n∈N

1(xi

je + 1 − n ∈ V )Wxj

)

je

where N is a set of predefined window sizes.
b ∈ Rd, V is all n-grams seen in the training data,
W ∈ R|V |×d is the embedding lookup table and
je . Note that 1(X)
Wxj
is the indicator function, if a n-gram is not in V ,
we simply discard it.

∈ Rd is the embedding of xj

je

Compared with the BI-LSTM, the advantages
of CHARAGRAM are four-fold. D'abord, the complexity
of memorizing short character strings in the
model
is reduced. CHARAGRAM learns multi-
character subsequences by simply adding them
to an embedding table, whereas the LSTM learns
them in a multi-step recurrent process. Deuxième,
because of their relatively higher expressiveness,
LSTMs overfit to the noisy and relatively small
training data provided by Wikipedia bilingual
entity maps, the likely reason for LSTMs only
considering the start word in errors from the
DIRECT category. In contrast, CHARAGRAM does
not consider order information, giving it an
explicit inductive bias that forces it to rely on
character n-gram matching for all n-grams in the
séquence. Troisième, CHARAGRAM’s simple architecture
le
eases the learning process. Par exemple,
LSTMs needs O(m) steps to propagate gradients
from start
to finish (Vaswani et al., 2017),
while the CHARAGRAM requires only O(1) step
to do so. Enfin, although not a performance-
based advantage, the CHARAGRAM model is more
interpretable, which make our further analysis
easier to perform (see Section 5).

We follow Wieting et al. (2016) and Rijhwani
et autres. (2019) and use negative sampling with a

115

5.2 Training Details

In the CHARAGRAM model, we use character
n-grams with n ∈ {2, 3, 4, 5}, and embedding size
de 300. We train the model with stochastic gradient
descent with batch size 64, and a learning rate of
0.1. For the BI-LSTM model, we follow Rijhwani
et autres. (2019) for hyperparameter selection.

We also compare our model with a character-
based CNN with sum-pooling (CHARCNN; Zhang
et coll., 2015; Wieting et al., 2016), where pa-
rameters are set to be roughly comparable in size
to our CHARAGRAM model. The embedding size of
each character is set to 1024; the kernel size is
set to 2, 3, 4, 5 each with 4800 feature maps. Le
output of sum-pooling layer with a dimension of
19,200 (4800×4) is fed a fully connected layer
and results in a vector of size 300. The dropout is
set to 0.5.9

For each training language, we set aside a
small subset of training data (mHRL − eEN) comme
our development set. For all models, we stop
training if top-30 gold candidate recall on the
development set does not increase for 50 epochs,
and the maximum number of training epochs is
set to 200.

We select the HRL that has the highest character
n-gram overlap with the source LRL, a decision
we discuss more in §5.4. Rijhwani et al. (2019)
used phoneme-based representations to help deal
with the fact that different languages use different
scripts, and we do so as well using Epitran
(Mortensen et al., 2018) to convert strings to inter-
national phonetic alphabet (IPA) symbols. The se-
lection of the HRL and the representation of each
LRL is shown in Table 2. Epitran has relatively
wide and growing coverage (55 languages at the
time of this writing). Our method could also poten-
tially be used with other tools such as the Romanizer
uroman,10 which is a less accurate phonetic repre-
sentation than Epitran but covers most languages
dans le monde. Cependant, testing different romanizers
is somewhat orthogonal to the main claims of this
papier, and thus we have not explicitly performed
experiments on this.

Our HRL pool contains 38 languages, spe-
cifically those that have more than 10k Wiki-
pedia pages and are supported by Epitran. We do

9We also try smaller architectures with embedding size
set to 64 and number of feature maps set to 300. Ce
configuration yields worse performance than the larger
model.

10https://www.isi.edu/∼ulf/uroman.html.

LRL

HRL

Representation

ti
om
rw
si
mr
lo
te

Amharic (am)
Phoneme
Indonesian (id) Grapheme
Tagalog (tl)
Phoneme
Hindi (Salut)
Phoneme
Hindi (Salut)
Grapheme
Thai (ème)
Phoneme
Hindi (Salut)
Phoneme

Tableau 2: The HRL for each LRL. Pour
phoneme representations, all input strings
in LRL, HRL, and English are convert to
IPA. For grapheme representations, strings
preserve their original representation.

not consider Swedish and Cebuano because most
Wikipedia pages of these two languages are bot-
generated.11 We also remove all languages that
do not achieve a candidate recall of 75% on the
development set for the HRL, indicating that the
model may not be trained well.

5.3 Main Results

Starting from the PBEL model, we gradually rep-
lace the baseline components with our proposed
improvements to reach our complete model. Le
results are shown in the second section of Table 3.
To put the results in the context, we also list the
Wikipedia size and the hyperlink count of every
langue. The Wikipedia size corresponds to the
number of entities recorded in the Wikipedia, et
the hyperlink count roughly reflects the richness
of the content of each page.

Dans l'ensemble, the model with the three proposed im-
provements yields significantly better perfor-
mance than the baseline. It brings 7.4–33.3%
improvement on top-30 gold candidate recall on six
LRLs, with the exception of te. We will discuss
the failure of te in §5.4.12 Next, we can see that
the CHARAGRAM brings the first major improve-
ment, improving over both baselines BILSTM
and CHARCNN. Even trained with eHRL−eEN pairs,
CHARAGRAM generalizes better to the test data
(mLRL − eEN) where the patterns to be matched
are different from the training data. Ce résultat

11https://en.wikipedia.org/wiki/Lsjbot.
12Although not a direct target of our paper, we note
that the three methodological improvements, especially the
introduction of CHARAGRAM, also improve the baseline model
in HRL settings. We often observe more than a 20% gain in
top-30 gold candidate recall in the development set, which is
derived from the same HRL as the training set.

116

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

DARPA-LRL

ti

om

rw

si

Model

WIKIMENTION
TRANSLATION

21.9
13.4

ee + BILSTM = PBEL 54.1
ee + CHARCNN
53.8
ee + CHARAGRAM
70.6
ee + me + CHARAGRAM
74.4
75.1
+ aka = Ours

Wikipedia Size
Hyperlink Count

168
188

45.3
20.9

18.1
13.0
20.4
41.3
46.0

775
4K

59.6
25.3

57.5
55.9
60.2
64.6
64.9

66.6
21.0

34.5
30.8
17.5
50.7
51.1

2K 15K
7K 63K

WIKI
lo


21.0
18.0
40.1
54.4
54.3

mr


53.5
47.7
63.4
72.8
77.5

te

avg


40.7
24.6
23.8
34.3
34.4


40.7
34.8
43.2
56.6
57.6

3K

50K
20K
300K 11K 610K 165K

70K

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 3: Top-30 gold candidate recall (%) of different models. First block: performance
of direct Wikipedia-based models that use LRL resource; second block: performance of
pivoting-based models that does not require any LRL resource. ee means using entity-
entity pairs as training data and me means using mention-entity pairs as training data.
Bold numbers are the best performance of the corresponding languages.

suggests that, as we hypothesize, the model struc-
ture of CHARAGRAM makes it better able to learn
string mappings in the face of relatively small and
noisy data. We note that we also try many varia-
tions of the two baseline models. Par exemple, nous
use the average hidden states instead of the last
hidden state of BILSTM to represent a string, et
we replace the sum-pooling layerwith the max-
pooling layer in CHARCNN. These variations yield
comparable or worse recall compared with the
current baselines.

En outre, introducing mHRL −eENpairs brings
further improvement over all seven languages. Ce
is perhaps not surprising;
these data provide
explicit supervision that matches the actual task
of entity-mention matching that we are faced with
at test time.

The influence of entity aliases varies from
language to language. Although they offer some
significant gains in om and mr, they do not largely
change other languages. We suspect this is because
of the diverse properties of the languages used in
our datasets. Par exemple, for the case of Marathi
speakers, they may also speak English frequently
and be familiar with English entity names due to
English being a national language of India. Ce
may lead them to follow conventions similar to
the English aliases that are available in Wikidata.
Speakers of other languages might either not use
as many aliases or their aliases may not match
well with those included in Wikidata.

De plus, we quantify how our proposed methods
reduce the failures existing in the baseline system.

117

Chiffre 4: The distribution of mention types and the
performance of our proposed model (right bars),
compared with the baseline (left bars).

We use the 400 samples of §3 and compare the
error distribution with the original one in Figure 4.
From the results, we can see that our model
eliminates a large number of the errors by ranking
the correct entities the highest. It significantly
reduces DIRECT and ALIAS errors, lequel
demonstrates the effectiveness of our proposed
method. As a side benefit, a number of the TRANS
errors are also resolved. En outre, when the
proposed model fails to rank the correct entity the
highest, it is able to increase the number of correct
entities in the top-30 candidate list, providing
a downstream disambiguation model with larger
improvement headroom. A few concrete examples
are shown in Table 4.

Error Type

Mention

IPA

Ours

PBEL

ALIAS

DIRECT

TRANS

Beaver Creek Resort

Beaver Creek State Forest (New York)

Gajanan Digambar Madgulkar

Ghada Amer

Hermann Staudinger

Muscoline

Khmer Empire

Union européenne

Herman Heuser

Benito Mussolini

Khmer Issarak

Yuri Petunin

Tableau 4: Successful cases, where the top-1 candidate entity retrieved by our model improves over that
of the baseline model.

Up until this point, we have been comparing
models that are purely zero-shot—they need no
training data in the source LRL. Cependant, même
for low-resourced languages there is often some
Wikipedia data that can be used to create models.
Using this data, we additionally compare our
model with the two Wikipedia-based models that
are not zero-shot (§2.2) on four DARPA-LRL
datasets on the first section of Table 3.13 Notre
model consistently beats TRANSLATION on all four
datasets without relying on any LRL resources.
De plus, it outperforms WIKIMENTION by a large
margin on three datasets with relatively small sized
Wikipedias, evidencing the advantage of zero-shot
learning in resource scarce settings. For si
with over 15K Wikipedia pages, our model lags
behind the resource-heavy WIKIMENTION model
by about 15% in the gold candidate recall. This is
perhaps expected as our model does not rely on
any of LRL resources, and it is possible that
explicitly training our model with these resources
could further improve its accuracy. En plus,
we observe that our model could serve as a com-
plement to WIKIMENTION and bring further gain in
gold candidate recall. We discuss this in detail in
Section 5.6.

5.4 Pivoting Language Selection

Choosing a closely related HRL and directly
applying the model trained on that HRL to the
LRL has been a popular transfer learning paradigm
in low-resource settings (T¨ackstr¨om et al., 2012;
Zhang et al., 2016; Cotterell and Heigold, 2017;
Rijhwani et al., 2019; Lin et al., 2019; Rahimi
et coll., 2019). Related languages are often chosen

13For the 3 WIKI datasets, the way we create these datasets
is exactly the same as the way we generate mHRL−eEN lookup
tables, and thus WIKIMENTION will achieve 100% recall. Nous
skip the unfair comparison on these datasets.

LRL

Linguistics

n-gram Overlap

d

ti
om
rw
si
lo
mr
te

ˆam, 63.9 (60.8) am, 74.2 (70.9)
ˆid, 40.9 (75.8)
ˆso, 28.0 (63.7)
ˆrn, 46.4 (62.9) tl, 64.6 (79.0)
Salut, 50.4 (63.1) Salut, 50.4 (63.1)
ème, 51.4 (78.8) ème, 51.4 (78.8)
ˆhi, 72.8 (83.3)
ˆhi, 72.8 (83.3)
ta, 12.6 (32.3) Salut, 32.6 (45.1)

10.3
12.9
18.2
0
0
0
20.0

Tableau 5: The pivoting language, performance (et
their n-gram overlap % with the LRL) selected
by different criteria. δ column shows the top-30
candidate recall improvement (%) using n-gram
overlap. Language with a hat use grapheme repre-
sentations while the remaining ones use phoneme
representations.

heuristically based on linguistic intuition, bien que
there are some works that have recently examined
training models to select languages automatically
(Lin et al., 2019; Rahimi et al., 2019). In our
case, we would like to choose both a pivoting
langue, and a string representation: phonemes
or graphemes. This doubles the search space and
increases the search difficulty.

We devise a simple yet strong heuristic for pick-
ing HRLs for transfer: picking the language that
shares the largest number of character n-grams
with the LRL. This is an automatic process that
does not need any domain or linguistic knowl-
bord. Tableau 5 shows the performance gap between
this criterion and manual selection with linguis-
tics features, which has been used in previous
work on XEL (Rijhwani et al., 2019). Notably,
to eliminate the variance caused by the different
number of inter-language links possessed by dif-
ferent HRLs, we compare the similarity between
mLRL with eEN directly, without the comparison

118

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

HRL

am

Salut

ème

donc

5 Nearest Neighbor

EN

ma

bi

bi
Uni maca, amac, Jaam, macad, Jaam

bi, mbi, arbee, inho, biya

Tableau 6: Randomly sampled English n-grams and their
five nearest neighbors in n-gram embedding space.

between mLRL and eHRL. More specifically, nous
replace Equation (1) with score(mLRL, eEN) =
sim(mLRL, eEN).

It is clear that selecting proper pivoting lan-
guages and string representations is important;
failing to do so can cause performance degradation
of as much as 20%. Cependant, while our heuristic
selection method is empirically better than manual
selection with linguistic features, it is notable
that pivoting languages and the representations
selected in this way do not necessarily yield
the best performance. We observe that choosing
a pivoting language with slightly less n-gram
overlap yields better performance for some LRLs.
Par exemple, while om has about 43% character
n-gram overlap with am, using the model trained
with am yields a gold candidate recall of 45.0%
(compared to 40.9% with ˆid). This indicates that
accuracy could be further improved with more
sophisticated pivoting language selection criteria.
Regarding the importance of n-gram sharing,
the relatively low recall of te
we suspect
compared to the baseline model results from a
lack of shared character n-grams with its pivot
language hi. Whereas most other language pairs
have over 60% character n-gram overlap, te and
hi only have 45.1%, meaning vm only encodes
less than half n-grams it has. On the contrary,
character-level embeddings used by BI-LSTM are
less sparse than higher-order n-grams, Et ainsi
BI-LSTM suffers less information loss.

5.5 Properties of Learned n-grams

As discussed in the previous sections, the objective
of CHARAGRAM is to learn n-gram mappings

between the HRL and English. To more concretely
understand our model’s behavior, we randomly
sample a few English n-gram embeddings and
retrieve their five nearest neighbors from the HRL
side. Tableau 6 lists these most similar n-grams.

CHARAGRAM is able to correctly associate n-grams
that have close pronunciation in different lan-
guages together. Because the pronunciation of
the same syllable could vary in the context of
different words, n-grams with small variances in
vowels can still be reasonable approximations.
Par exemple, ‘‘li’’ can be pronounced as both
‘‘li’’ and ‘‘le’’ in different words. One thing that
is worth mentioning is that CHARAGRAM is able
to correctly recognize some mappings of non-
transliterated words. Par exemple, ‘‘Jaamacadda’’
in so is the parallel of ‘‘University’’ in English,
and the model was able to correctly align n-grams
corresponding to these words. This result demon-
strates one way how CHARAGRAM alleviates the
TRANS error that BI-LSTM suffers from.

5.6 Improving End-to-end XEL Systems

To investigate how our candidate generation
model influences the end-to-end XEL system,
we use its candidate lists in the disambiguation
model BURN proposed by Zhou et al. (2019). BURN
creates a fully connected graph for each document
and performs joint inference on all mentions in
the document. To the best of our knowledge,
it
que
is currently the disambiguation model
has demonstrated the strongest empirical results
for XEL without any targeted LRL resources.
Donc, we believe it is the most reasonable
choice in our low-resource scenario. For details,

119

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

we encourage readers to refer to the original
paper.14

To make the best use of scarce but existing
ressources, we follow Zhou et al. (2019) and con-
catenate candidate lists generated by WIKIMENTION
to candidate lists of both the baseline and our
method. The score of each candidate entity is
calculated in the following way:

scoremerge(eEN) = α × scorewm(eEN)
+ (1 − un) × score′
ca(eEN)
ca(eEN) = softmax(β × scoreca(eEN))

score′

is the scaled score over

where scorewm is the score from WIKIMENTION
and scorecn is the original score from CHARAGRAM.
score′
the top-30
cn
candidate list. We omit mLRL in all score functions
for simpilicty. In our experiments, α is set to 0.6
and β is set to 100.

Tableau 7 lists the end-to-end XEL results.
Compared with the baseline model, our model
recovers more candidate entities missed by
WIKIMENTION and significantly benefits the down-
stream disambiguation model, as well as the
end-to-end system. Even though incorporating
WIKIMENTION narrows the gap of gold candidate
recall (compared to Table 3), our model still beats
the baseline model by a large margin. While the
baseline candidate generation model only reaches
a recall in the range of 60% on average, ours
yields a recall in the range of 70%, closer to the
high-resource counterparts which are often in the
range of 80%. Par conséquent, the end-to-end XEL
in-KB accuracy increases over all four languages,
with gains from 1.3% à 16.7%. This is significant
for extremely low-resource languages like ti,
indicating the potential of our model in truly
resource-scarce settings.

6 Related Work

Candidate generation for entity linking: Dans
most work, candidate generation for monolingual
entity linking relies on string matching and
Wikipedia anchor text lookup (Shen et al., 2015).
For cross-lingual entity linking, inter-language

14It is notable that we assume that the XEL system could
access the oracle NER outputs. En réalité, the F1 scores of
low-resource NER are often in the range of 70%. We leave
the evaluation and possible improvement with non-perfect
NER systems as our future work.

ee + BILSTM

Ours

d

ti
om
rw
si

50.8 (55.4)
53.2 (61.3)
61.5 (67.5)
70.9 (76.1)

67.5 (75.8) 16.7 (20.4)
59.2 (67.9)
68.9 (73.9)
72.2 (78.0)

6.0 (6.6)
7.4 (6.4)
1.3 (1.9)

avg

59.1 (65.1)

67.0 (73.9)

7.9 (8.8)

Tableau 7: In-KB accuracy (with top-30 gold can-
didate recall of the merged candidate lists in
brackets, both represent percentage %) of the
end-to-end XEL system with different candidate
generation models. δ shows the in-KB accuracy
degrade (%) using baseline candidate generation
model.15

lexicons
links from Wikipedia and bilingual
are used to translate the given entity mentions
into the language of the KB (often English)
in order to generate candidates (Tsai and Roth,
2016; Pan et al., 2017; Upadhyay et al., 2018un).
Plus récemment, Rijhwani et al. (2019) use ortho-
graphic and phonological similarity to high-
resource languages to generate candidates for
low-resource test languages. For the related task
of clustering entities, Blissett and Ji (2019) utiliser
RNNs for measuring orthographic similarity of
entity mentions.

to our current

Transliteration: There has also been work in
transliterating named entities from one language
to another (Knight and Graehl, 1998; Li et al.,
2004). Although similar
task
of selecting candidates from an English KB,
transliteration poses different challenges as it
involves generating the English entity name
lui-même. Upadhyay et al. (2018b) use a sequence-
to-sequence model and a bootstrapping method
to transliterate low-resource entity mentions using
extremely limited training data. Tsai and Roth
(2018) combine the standard translation method
for XEL candidate generation with a transliteration
score to improve XEL candidate recall on several
languages.

Bilingual lexicon induction: Another related
task is bilingual
lexicon induction, où un
mapping between words in two languages is
predicted by a learned model (Haghighi et al.,
2008). Although such a mapping can be used to

15These results are not comparable to Rijhwani et al. (2019)
as we only consider a subset of mentions whose linked entity
exists in the Wikipedia.

120

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

translate entities from the source test language
to English for XEL candidate generation, most
existing lexicon induction methods assume the
availability of a large amount of monolingual data
in both the source and target language (Conneau
et coll., 2017; Chen and Cardie, 2018; Artetxe
et autres. 2018). Although this data is readily available
in English, it is unrealistic for many low-resource
languages, diminishing the utility of such methods
for the low-resource XEL task.

7 Conclusion

In this work, we perform a systematic analysis to
study and address the limitation of a previous can-
didate generation model in low-resource settings.
We propose three methodological improvements
to resolve two main problems of the baseline
model, namely, mismatch between mention and
entity and sub-optimal string modeling. For the
first problem, we introduce mention-entity pairs
into the training process to provide supervision.
We additionally collect entity aliases from English
Wikidata to further bridge this gap. To solve the
second problem, we replace the LSTM with a
more direct model CHARAGRAM. These methods
form our proposed candidate generation model.
We experiment with seven realistic datasets in
LRLs. Our model yields an average gain of 16.9%
in top-30 gold candidate recall. We also evaluate
the influence of our candidate generation model
in the context of end-to-end low-resource XEL. Il
brings an average gain of 7.9% in four LRLs.

An immediate future focus is finding a way to
properly combine multiple models trained on dif-
ferent HRLs together to have better character
n-gram coverage and thus improve model perfor-
mance in different LRLs. Another interesting
avenue is to investigate how to efficiently com-
pare mentions and a large number of entities
(par exemple., 2M in Wikipedia) in high dimensional
espace. Actuellement, our model calculates the cosine
similarity between a mention and every entity
in the KB, which takes a few minutes for each
test set. Cependant, there is much existing work
(Rajaraman and Ullman, 2011; Johnson et al.,
2019) for efficient similarity search in high di-
mensional space for billion-scale datasets. C'est
likely that combination of these algorithms with
our retrieval method will allow them to scale well
and reduce the computation time to a few seconds.
En outre, other interesting future directions are

examining how to balance the trade-off between
the gold candidate recall and the disambiguation
difficulty, and how to apply our model to settings
where the target language is not English.

Remerciements

We would like to thank Radu Florian and the anon-
ymous reviewers for their useful feedback. Ce
material is based on work supported in part by
the Defense Advanced Research Projects Agency
Information Innovation Office (I2O) Low Resource
Languages for Emergent Incidents (LORELEI)
program under contract no. HR0011-15-C0114.
The views and conclusions contained in this doc-
ument are those of the authors and should not be
interpreted as representing the official policies,
either expressed or implied, of the U.S. Government.
The U.S. Government is authorized to reproduce
and distribute reprints for Government purposes
notwithstanding any copyright notation here
sur. Shruti Rijhwani is supported by a Bloomberg
Data Science Ph.D. Fellowship.

Les références

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2018. A robust self-learning method for fully
unsupervised cross-lingual mappings of word
embeddings. In Proceedings of the 56th Annual
Meeting of the Association for Computational
Linguistics, pages 789–798.

Kevin Blissett and Heng Ji. 2019. Cross-lingual
NIL entity clustering for low-resource languages.
In Proceedings of the Second Workshop on
Computational Models of Reference, Anaphora
and Coreference, pages 20–25.

Razvan Bunescu and Marius Pas¸ca. 2006. Using
encyclopedic knowledge for named entity
disambiguation. In 11th Conference of the Euro-
pean Chapter of the Association for Computational
Linguistics.

Xilun Chen and Claire Cardie. 2018. Unsupervised
multilingual word embeddings. In Proceedings
of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 261–270.

Alexis Conneau, Guillaume Lample, Marc’Aurelio
Ranzato, Ludovic Denoyer, and Herv´e J´egou.

121

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2017. Word translation without parallel data.
International Conference on Learning Repre-
sentations.

Ryan Cotterell and Georg Heigold. 2017. Cross-
lingual character-level neural morphological
tagging. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 748–759. Association for
Computational Linguistics.

Silviu Cucerzan. 2007. Large-scale named entity
disambiguation based on wikipedia data. Dans
Actes du 2007 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
Apprentissage.

Andrew M. Dai and Quoc V. Le. 2015. Semi-
In Advances
supervised sequence learning.
in Neural Information Processing Systems,
pages 3079–3087.

Mark Dredze, Paul McNamee, Delip Rao,
Adam Gerber, and Tim Finin. 2010. Entity
disambiguation for knowledge base population.
In Proceedings of
the 23rd International
Conference on Computational Linguistics,
pages 277–285.

Octavian-Eugen Ganea and Thomas Hofmann.
2017. Deep joint entity disambiguation with
local neural attention. In Proceedings of the
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2619–2629.

Amir Globerson, Nevena Lazic,

Soumen
Chakrabarti, Amarnag Subramanya, Michael
Ringaard, and Fernando Pereira. 2016. Collective
entity resolution with multi-focal attention. Dans
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
pages 621–631.

Aria Haghighi, Percy Liang, Taylor Berg-
Kirkpatrick, and Dan Klein. 2008. Apprentissage
bilingual lexicons from monolingual corpora.
In Proceedings of the 46th Annual Meeting of
the Association for Computational Linguistics,
pages 771–779.

Johannes Hoffart, Mohamed Amir Yosef,
Ilaria Bordino, Hagen F¨urstenau, Manfred
Pinkal, Marc Spaniol, Bilyana Taneva, Stefan

Thater, and Gerhard Weikum. 2011. Robust
disambiguation of named entities in text. Dans
Proceedings of the Conference on Empirical
Methods in Natural Language Processing,
pages 782–792. Association for Computational
Linguistics.

Heng Ji, Joel Nothman, Ben Hachey, and Radu
Florian. 2015. Overview of TAC-KBP 2015
tri-lingual entity discovery and linking. In Text
Analysis Conference.

Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2019. Billion-scale similarity search with gpus.
IEEE Transactions on Big Data.

Kevin Knight and Jonathan Graehl. 1998. Ma-
chine transliteration. Computational Linguis-
tics, 24:599–612.

Haizhou Li, Min Zhang, and Jian Su. 2004. UN
joint source-channel model for machine trans-
literation. In Proceedings of the 42nd Annual
Meeting of the Association for Computational
Linguistics, pages 159–166.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui
Li, Yuyan Zhang, Mengzhou Xia, Shruti
Rijhwani, Junxian He, Zhisong Zhang, Xuezhe
Ma, Antonios Anastasopoulos, Patrick Littell,
and Graham Neubig. 2019. Choosing transfer
languages for cross-lingual learning. In The
57th Annual Meeting of the Association for
Computational Linguistics.

Paul McNamee, James Mayfield, Dawn Lawrie,
Douglas Oard, and David Doermann. 2011.
Cross-language entity linking. In Proceedings
of 5th International Joint Conference on
Natural Language Processing, pages 255–263.

Bonan Min, Yee Seng Chan, Haoling Qiu,
and Joshua Fasching. 2019. Towards machine
reading for interventions from humanitarian-
assistance program literature. In Proceedings
of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th
International Joint Conference on Natural
Language Processing, pages 6443–6447.

David R. Mortensen, Siddharth Dalmia, et
Patrick Littell. 2018. Epitran: Precision g2p
le
for many languages. In Proceedings of
Eleventh International Conference on Language
Resources and Evaluation.

122

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
Nothman, Kevin Knight, and Heng Ji. 2017.
Cross-lingual name tagging and linking for 282
languages. In Proceedings of the 55th Annual
Meeting of the Association for Computational
Linguistics, pages 1946–1958.

Stephanie Strassel and Jennifer Tracey. 2016.
Lorelei language packs: Données, tools, and re-
sources for technology development in low
resource languages. In Proceedings of the Tenth
International Conference on Language Resources
and Evaluation, pages 3273–3280.

Priya Radhakrishnan, Partha Talukdar,

et
Vasudeva Varma. 2018. Elden: Improved entity
linking using densified knowledge graphs. Dans
Actes du 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers),
pages 1844–1853.

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.
Massively Multilingual Transfer
for NER.
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 151–164.

Anand Rajaraman and Jeffrey David Ullman.
2011. Mining of Massive Datasets, Cambridge
Presse universitaire.

Shruti Rijhwani, Jiateng Xie, Graham Neubig,
and Jaime Carbonell. 2019. Zero-shot neural
Dans
transfer for cross-lingual entity linking.
Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI).

Stephanie M. Strassel, Ann Bies, and Jennifer
Tracey. 2017. Situational awareness for low re-
source languages: the lorelei situation frame
annotation task. In SMERP@ ECIR, pages 32–41.

Oscar T¨ackstr¨om, Ryan McDonald, and Jakob
Uszkoreit. 2012. Cross-lingual word clusters
for direct transfer of linguistic structure. Dans
Actes du 2012 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 477–487.

Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual
wikification using multilingual embeddings. Dans
Actes du 2016 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 589–598.

Chen-Tse Tsai and Dan Roth. 2018. Apprentissage
better name translation for cross-lingual wiki-
fication. In Thirty-Second AAAI Conference on
Artificial Intelligence.

Wei Shen, Jianyong Wang, and Jiawei Han.
2015. Entity linking with a knowledge base:
Problèmes, techniques, and solutions. IEEE Trans-
actions on Knowledge and Data Engineering,
pages 443–460.

Shyam Upadhyay, Nitish Gupta, and Dan Roth.
2018un. Joint multilingual supervision for cross-
lingual entity linking. In Proceedings of the
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2486–2495.

Avirup Sil and Radu Florian. 2016. One for all:
Towards language independent named entity
linking. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics, pages 2255–2264.

Avirup Sil, Gourab Kundu, Radu Florian, et
Wael Hamza. 2018. Neural cross-lingual entity
linking. In Thirty-Second AAAI Conference on
Artificial Intelligence.

Valentin I. Spitkovsky and Angel X. Chang.
2012. A cross-lingual dictionary for English
le
Wikipedia concepts. In Proceedings of
Eighth International Conference on Language
Resources and Evaluation, pages 3168–3175.

Shyam Upadhyay, Jordan Kodner, and Dan
Roth. 2018b. Bootstrapping transliteration with
constrained discovery for low-resource lan-
guages. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, pages 501–511.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.

Amir Pouran Ben Veyseh. 2016. Cross-lingual
question answering using common semantic

123

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

espace. In Proceedings of TextGraphs-10: le
Workshop on Graph-based Methods for Natural
Language Processing, pages 15–19.

from knowledge base. Transactions of
le
Association for Computational Linguistics,
5:397–411.

John Wieting, Mohit Bansal, Kevin Gimpel, et
Karen Livescu. 2016un. Towards Universal
paraphrastic sentence embeddings. In Proceed-
ings of International Conference on Learning
Representations.

John Wieting, Mohit Bansal, Kevin Gimpel, et
Karen Livescu. 2016b. Charagram: Embedding
words and sentences via character n-grams.
le 2016 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing, pages 1504–1515.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda,
and Yoshiyasu Takefuji. 2017. Learning dis-
tributed representations of texts and entities

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text
classification. In Advances in Neural Infor-
mation Processing Systems, pages 649–657.

Yuan Zhang, David Gaddy, Regina Barzilay, et
Tommi Jaakkola. 2016. Ten pairs to tag—
multilingual POS tagging via coarse mapping
between embeddings. In Proceedings of the
the North American
2016 Conference of
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 1307–1317.

Shuyan Zhou, Shruti Rijhwani, and Graham
Neubig. 2019, Novembre. Towards zero-
resource cross-lingual entity linking. In Work-
shop on Deep Learning for Low-resource NLP.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
3
0
3
1
9
2
3
8
0
5

/

/
t

je

un
c
_
un
_
0
0
3
0
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

124
Télécharger le PDF