Improving Candidate Generation - IA de Investigación especializada en el MIT

Improving Candidate Generation
for Low-resource Cross-lingual Entity Linking

Shuyan Zhou, Shruti Rijhwani, John Wieting
Jaime Carbonell, Graham Neubig

Language Technologies Institute
Carnegie Mellon University
{shuyanzh,srijhwan,jwieting,jgc,gneubig}@cs.cmu.edu

Abstracto

Cross-lingual entity linking (XEL)
es el
task of finding referents in a target-language
base de conocimientos (KB) for mentions extracted
from source-language texts. The first step of
(X)EL is candidate generation, which retrieves
a list of plausible candidate entities from
the target-language KB for each mention. AP-
proaches based on resources from Wikipedia
have proven successful in the realm of rela-
tively high-resource languages, but these do
to low-resource languages
not extend well
with few, if any, Wikipedia pages. Recientemente,
transfer learning methods have been shown to
reduce the demand for resources in the low-
resource languages by utilizing resources in
closely related languages, but the performance
still lags far behind their high-resource coun-
terparts. en este documento, we first assess the prob-
lems faced by current entity candidate generation
methods for low-resource XEL, then propose
three improvements that (1) reduce the dis-
connect between entity mentions and KB en-
intentos, y (2) improve the robustness of the
model to low-resource scenarios. The methods
are simple, but effective: We experiment with
our approach on seven XEL datasets and find
that they yield an average gain of 16.9% en
TOP-30 gold candidate recall, comparado con
state-of-the-art baselines. Our improved model
also yields an average gain of 7.9% in in-KB
accuracy of end-to-end XEL.1

1 Introducción

Entity linking (EL; Bunescu and Pas¸ca, 2006;
Cucerzan, 2007; Dredze et al., 2010; Hoffart

1Code and data will be released.

109

et al., 2011) associates entity mentions in a
document with their entries in a knowledge base
(KB). En este trabajo, we focus on cross-lingual
vinculación de entidades (XEL; McNamee et al., 2011;
Ji et al., 2015) where the documents are in a
source language that differs from the KB language
(objetivo). XEL is an important component task for
information extraction in languages that do not
have extensive KB resources, and can potentially
benefit downstream applications such as cross-
lingual building question answering systems
(Veyseh, 2016), or supporting international hu-
manitarian assistance efforts in areas that do not
speak English (Strassel et al., 2017; Min et al.,
2019). Following Sil et al. (2018) and Upadhyay
et al. (2018a), we consider the target language KB
to be English Wikipedia.

Given a document and named entity mentions
identified by a Named Entity Recognition (NER)
modelo, there are two primary steps in an XEL
sistema: (1) candidate generation,
in which a
model retrieves a short list of plausible KB entities
for each mention and (2) disambiguation, en el cual
a model selects the most likely KB entity from
the candidate list. The quality of candidate lists
will influence the performance of the end-to-end
XEL system, as correct entities not included in this
list will not be recovered by the disambiguation
modelo.

In monolingual EL, candidate generation has often
been considered trivial (Shen et al., 2015). Simple
approaches using string similarity or Wikipedia
anchor-text links produce mention-entity lookup
tables with high candidate recalls (p.ej., en el 90%
range), and thus most work focuses on methods
for downstream entity disambiguation (Globerson
et al., 2016; Yamada et al., 2017; Ganea and
Hofmann, 2017; Sil et al., 2018, Radhakrishnan

Transacciones de la Asociación de Lingüística Computacional, volumen. 8, páginas. 109–124, 2020. https://doi.org/10.1162/tacl a 00303
Editor de acciones: Radu Florian. Lote de envío: 10/2019; Lote de revisión: 11/2019; Publicado 2020.
C(cid:13) 2020 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

et al., 2018). String similarity (p.ej., edit distance)
cannot easily extend to XEL because surface
forms of entities often differ significantly across
the source and target language, particularly when
the languages are in different scripts. Wikipedia
link methods can be extended to XEL by using
inter-language links between the two languages
to redirect entities to the English KB (Spitkovsky
and Chang, 2012; Sil and Florian, 2016; Sil et al.,
2018; Upadhyay et al., 2018a). This method works
hasta cierto punto, but often under-performs on low-
resource languages due to the lack of source
language Wikipedia resources.

Although scarce, there are some methods that
propose to improve entity candidate generation
by training translation models with low resource-
idioma (LRL)-English entity gazetteers (Cacerola
et al., 2017), or learning neural string matching
models based on an entity gazetteer in a related
high-resource language (HRL) which is then
applied to the LRL (Rijhwani et al., 2019)
(more in §2). Sin embargo, even with these relatively
top-30 candidates still
sophisticated methods,
fall far behind their high-resource counterparts,
lagging by as much as 70% absolute candidate
recordar.

En este trabajo, we perform a systematic study to
understand and address the limitations of previous
XEL candidate generation models. Primero, in §3 we
examine the sources of error in the state-of-the-art
candidate generation model of Rijhwani et al.
(2019), and identify a number of potential reasons
for failure. Específicamente, we find that two common
sources of error are (1) mismatch between the
entity name in the KB and the entity mention in
the text, y (2) failure of the string matching
model itself. En figura 1, we show an example of
linking Marathi, a low-resource language spoken
in Western India, to English, which we will use as
a running example throughout the paper (a pesar de
our method is broadly applicable, as noted in
experimentos). En este caso, errors of the first
type are due to the fact that the English entity
Cobie Smulders is mentioned as
(verde,
Smulders) o
(yellow, Jacoba Francisca Maria Smulders) en
the text. Errors of the second type are simple
recognition errors such as where the mention
(azul, Cobie Smulders) is recognized
as English entity Cobie Sikkens. We proceed to

110

Cifra 1: The candidate generation process for various
mentions corresponding to the gold entity ‘‘Cobie
Smulders’’. Strings on the left are mentions in the
documento, and the pronunciation in IPA of each string
is written below it. The candidate entities in the English
KB generated by the candidate generation model are
shown on the right.

propose methodological improvements that re-
solve these major issues.

The first set of improvements handles the
mismatch between the unique entity name that
appears in the English KB, and the many different
realizations of it in the source text. Primero, we note
that training data used in learning-based methods
for XEL candidate generation (Pan et al., 2017;
Rijhwani et al., 2019) is made of entity-entity
pares, which fail to capture this variation. Nosotros
experiment with adding mention-entity pairs to
the training data to provide explicit supervision,
helping the model better capture the differences
between mentions and entities (§4.1). Segundo,
we note that many of the variations in the source
language are actually similar to how the entity
varies in English, and thus we can use English
language resources to capture this variation. A
este efecto, we collect entity aliases from English
Wikidata2 and allow the model
to also look
up these aliases during the candidate generation
proceso (§4.2).

The second contribution of this work is a
better modeling strategy for strings that represent
mentions and entities (§4.3). We posit that part
of the reason why the LSTM-based model of
Rijhwani et al. (2019) fails to properly model all
words in a string is because it is not the ideal
architecture to learn from limited training data,
and as a result, it erroneously learns that some
words in the mention can be ignored. To solve
this problem, we replace the LSTM with a more
direct model based on the sum of character n-gram

2https://www.wikidata.org/wik.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

embeddings (Wieting et al., 2016), which we
posit is more likely to generalize to this difficult
learning setting.

We evaluate our proposed methods on four real-
world XEL datasets provided by DARPA LORELEI
(Strassel and Tracey, 2016), as well as three
other datasets we create with Wikipedia anchor-
text and inter-language links (§5). Although our
methods are simple, they are highly effective—our
proposed model
leads to gains ranging from
7.4-33.3% in top-30 gold candidate recall com-
pared with Rijhwani et al. (2019) in seven LRLs.
Because our model provides downstream disam-
biguation models with a much larger headroom for
mejora, we find that simply changing the
candidate generation process yields an average
gain of 7.9% in end-to-end XEL in-KB accuracy
in four LRLs, pushing low-resource XEL a step
towards high-resource XEL performance.

2 Fondo

2.1 Problem Formulation

Given a set of mentions M = {m1, m2, . . . , mN }
extracted from multiple documents in the source
idioma, and an English KB KEN that contains
millions of entities with unique names, the goal of
a candidate generation model is to retrieve a list of
possible candidate entities ei = {ei,1, ei,2, . . . , ei,norte}
from KEN for each mi ∈ M. In consideration of
the computational cost of the more complicated
downstream disambiguation model, n is often 30
or smaller (Sil et al., 2018; Upadhyay et al.,
2018a). The performance of candidate generation
is measured by the gold candidate recall, cual
is the proportion of retrieved candidate lists that
contains the correct entity. It is critical that this
number is high, as any time the correct entity is
excluded, the disambiguation model will be unable
to recover it. Formalmente, if we denote the correct
entity of each mention m as ˆe, the gold candidate
recall r is defined as:

r = P

norte

i=1 δ(ˆei ∈ ei)
norte

mentions whose linked entity does not exist in the
KB in this work.3

We use ‘‘EN’’ to denote the target language
Inglés, ‘‘HRL’’ to denote any high-resource
language and ‘‘LRL’’ to denote any low-resource
idioma. Por ejemplo, KHRL is a KB in an HRL
(p.ej., Spanish Wikipedia), eHRL is an entity in
KHRL. Because our focus is on low-resource XEL,
the source language is always an LRL. Nosotros también
refer to the HRL as the ‘‘pivoting’’ language
abajo.

2.2 Baseline Candidate Generation Models

En esta sección, we introduce two existing cate-
gories of techniques for candidate generation.

Direct Wikipedia-based Models WIKIMENTION
is a popular candidate generation model used
by most state-of-the-art work in XEL (Sil and
Florian, 2016; Sil et al., 2018; Upadhyay et al.,
2018a). Específicamente, this model first extracts a
monolingual mLRL-eLRL map from anchor-text
(Smulders)
Enlaces. Por ejemplo, if mention
(Cobie Smulders)
is linked to entity
in some Marathi Wikipedia pages,
will be treated as a candidate entity of
.
These Marathi entities are then redirected to their
English counterpart by Wikipedia LRL-English
inter-language links. Por ejemplo,
(Cobie Smulders) will be redirected to Cobie
Smulders. Sin embargo, the reliance on the coverage
of LRL Wikipedia strongly constrains this method
in low-resource settings.

TRANSLATION is another Wikipedia-based can-
didate generation model, proposed by Pan et al.
(2017). Instead of building a monolingual map
that requires accessing anchor-text links in an LRL
Wikipedia, this model translates any mLRL to mEN
word-by-word and retrieves candidate entities
from an existing mEN − eEN map. The word-
by-word translations are induced by LRL-English
inter-language links. Even though TRANSLATION
is less sensitive to the availability of resources
(hasta cierto punto), its dependency on LRL-English
inter-language links still limits its performance in
low-resource settings.

where δ(·) is the indicator function, cual es 1 if true
else 0, and N is the total number of mentions
among all documents. We follow Yamada et al.
(2017) and Ganea and Hofmann (2017) to ignore

3The predictions of these mentions will always be wrong.
This could be fixed by either designing mechanisms to predict
‘‘not linkable’’ or expanding the KB, which are beyond the
scope of this work.

111

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Pivoting-based Entity Linking
Instead of
relying on LRL resources, pivoting-based entity
linking (PBEL, Rijhwani et al., 2019) learns to
perform cross-lingual string matching based on
an entity gazetteer between a related HRL and
Inglés. This model consists of two BI-LSTMs,
a saber, the HL-BI-LSTM and the EN-BI-LSTM.
The training data is a collection of entity pairs
(eHRL − eEN). Each of the BI-LSTMs reads in
an entity name eHRL (eEN) and encodes it to an
embedding vHRL (vEN). The learning objective
is to maximize the similarity between the two
entities of each pair. The trained model HRL is
used as-is to encode the LRL mentions to vLRL,
relying on the similarity between the languages
to achieve a reasonably accurate encoding. A
vLRL is compared with every entity embedding in
KEN, and entities with the top-n highest similarity
scores are retrieved as the candidate entities. A
compensate for the accuracy degradation due to
transfer, this work also considers the similarity
between mLRL and eHRL, where eHRL is the
counterpart of eEN in KHRL. De este modo,
the score
between mLRL and entity eEN is defined as:

puntaje(mLRL, eEN) = max(sim(mLRL, eEN),

sim(mLRL, eHRL))

(1)

where sim(X, y) = cosine(vx, vy). When eHRL
does not exist, sim(mLRL, eHRL) is set to −∞.

PBEL removes the reliance on LRL resources,
and currently represents the state-of-the-art for
candidate generation in low-resource XEL. Sin embargo,
as we analyze in detail in the following §3, it still
faces a number of challenges.

3 Failures of Existing Models

En esta sección, we perform a systematic analysis
of failure cases existing in PBEL (§3.1), and spe-
cifically focus on two error types: entity-mention
mismatch (§3.2) and string matching failures (§3.3).

3.1 Mention Types and Analysis

We apply a PBEL model trained with eHRL − eEN
pairs to generate candidate entities for mentions
extracted from LRL documents. For LRLs we use
Tigrinya, Oromo, Marathi, and Lao, and for HRLs
we use Amharic, Hindi, Hindi, and Thai, respetar-
activamente. The details of the datasets are in §5. Nosotros
randomly sample 100 system outputs from each
LRL and manually annotate their mention type
according to an typology created simultaneously

while performing analysis. The mention type is
como sigue, where the comparison is between the
mention in a LRL and the entity string in English:

DIRECT: The mention is a direct transliteration
of the entity. Por ejemplo, one a mention
of Cobie Smulders is
(Cobie
Smulders)

ALIAS: The mention is another full proper name
eso
is different from the entity name in
English KB. Por ejemplo, a mention of Cobie
Smulders as
(Jacoba Francisca Maria Smulders).

TRANS: The mention and the entity have word-
by-word alignment, sin embargo, the mention
contains regular words (p.ej., university,
union) that cannot be transliterated directly.

EXTRA SRC: There is at least one extra word in
the mention that is not a proper noun (p.ej.,
(Sir)); or there is at least one extra syllable
in the mention, which is often due to the
morphology of the source language.

EXTRA ENG: There is at least one extra word
in the English entity that is not a proper noun.

BAD SPAN: The mention span is not an entity
due to mis-annotation, or non-standard an-
chor text in Wikipedia; the annotated linked
entity is wrong; the mention is in another
language other than our testing language.

We consider three situations for each sample:
(1) in top-1: the model ranks the correct entity the
highest, the ideal case; (2) in top-2 to 30: el modelo
ranks the correct entity in the top-2 to top-30,
which is less ideal, but will still potentially allow
a downstream disambiguation model to predict the
correct entity and (3) not in top-30: the model does
not rank the entity to top-30, which will certainly
lead to an error.

Cifra 2 shows the mention types of the 400
samples and PBEL performance within each of the
mention types. In the following sections, nosotros
examine, in depth, two major causes of error:
mention-entity mismatch (largely affecting errors
in ALIAS, EXTRA SRC, and EXTRA ENG cate-
gories), and model failure (largely affecting errors
in DIRECT).

112

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

entity pairs, the model would need to capture more
complex patterns (p.ej., ignoring some words).4

The diverse realizations of a single entity bring
otro, more serious, challenge to models that
mainly learn string matches: En realidad, a reali-
zation does not necessarily have significant overlap
with the entity name in Wikipedia. A veces, el
mention does not have any overlap with the entity
name at all, as noted in the ALIAS class. Este
common pattern reflects the limitation of using eEN
as the unique representation on the English side.

3.3 Failures in Direct Transliteration

Even in seemingly easy cases where the entity is a
perfectly transliteration of the mention (DIRECT),
we found the LSTM to fail frequently in our low-
data scenario. Among all DIRECT errors, nosotros
found an interesting observation that the BILSTM
often only properly captures the first word (or the
first a few characters) and ignores the existence
of the second and further-on words. Por ejemplo,
the model ranks Cobie Sikken higher than Cobie
Smulders for

(Cobie Smulders).

To better understand this behavior, we manually
anotado 100 training pairs in Hindi and measured
how often the second or later words in eHRL do not
match their counterpart in eEN phonologically.5

es,

We find that whereas 93 examples share a
phonologically similar first word, acerca de 40 de
them have second and further-on words that
are not phonological matches: While most pairs
have word-by-word mappings, their second or
later words often match with each other only
semantically—that
there are regular words
(p.ej., distrito, university) that have very different
pronunciations across the HRL and English, y
are therefore difficult to predict unless they are
explicitly seen in the training data. The BILSTM,
which is a flexible model, seems to overfit
and erroneously learn that latter words in the
sentence do not need to be mapped directly with
little inductive bias. This is a straightforward
explanation for why the model learns to ignore
the second and further-on words.

To sum up, the failures of the PBEL model
can be mainly attributed to (1) lack of explicit
supervision; (2) lack of external resources to assist

4Low numbers for th are due to lack of explicit word

boundaries marked by spaces.

5The phonological similarity of names across languages is
vital to the success of cross-lingual mention-entity matching.

Cifra 2: The distribution of mention types in 400
samples and the baseline model’s performance with
respect to each of the mention types.

Lang

entonces

|eHRL|=|eEN|
|mHRL|=|eEN|

82.9
71.1

80.7
58.0

83.4
56.8

56.8
55.8

Mesa 1: Proportion of entries where HRL
strings have the same number of words as
their English counterparts.

3.2 Failures due to Mention-Entity Mismatch

As demonstrated in Figure 1, a single English
entity can have different
realizations in the
source language document. Como resultado, muchos
of these realizations will not match lexically or
phonetically with the entity in the KB. This poses a
serious problem for matching methods that rely on
graphemic or phonemic similarity such as PBEL.
One typical pattern in mention-entity variation
is additional words, as noted in the EXTRA SRC
and EXTRA ENG classes. We examine more sys-
tematically across the whole corpus by comparing
the number of words on each side, which is a rough
lower bound on the amount of this mismatch. El
first row in Table 1 is the comparison between
eHRL and eEN, which presumably have better word-
by-word alignment (and were used in training
of previous XEL methods). The second row
displays the comparison between mHRL and eEN.
It is obvious that entity-entity pairs have more
consistent length in words, while this consistency
is not preserved in mention-entity pair data. De este modo,
even if the previous PBEL model could easily
learn exact string matches from the entity-entity
training data, to successfully associate mention-

113

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cases where the mention and entity name diverge
de modo significativo; y (3) the BILSTM’s inability to
properly match the whole string.

4 Improved Candidate Generation

Based on the results of this empirical study,
we propose three methods to resolve the main
problems inherent in the baseline PBEL model.

4.1 Eliminating Train-Test Discrepancy

The mention-entity discrepancy naturally leads
to our first simple but effective improvement
to the baseline model: We extend the original
eHRL − eEN pairs with mHRL − eEN pairs. Nosotros
first collect mHRL − eHRL pairs from anchor-text
links in an HRL Wikipedia and then redirect these
entities to their parallel in English Wikipedia. Como
a result, we get the desired mHRL − eEN pairs.
(Smulders) is linked to
Por ejemplo, si
(Cobie Smulders) in some Marathi
Wikipedia pages, which could be redirected to
and Cobie
Cobie Smulders in English,
Smulders form one mention-entity pair. A pesar de
to our
this is perhaps obvious in hindsight,
conocimiento, all previous works that explicitly
train XEL candidate retrieval models do so on
eHRL − eEN pairs (Pan et al., 2017; Rijhwani et al.,
2019), which are mostly word-by-word mappings.

4.2 Utilizing English Entity Aliases

The training method introduced in the previous
section will render the model more capable of
dealing with minor differences between mentions
and entities. Sin embargo, it still would struggle to
match strings with significant differences, como
the examples of ‘‘Cobie Smulders’’ and ‘‘Pope
Paul V’’ shown in Section 3.2. To mitigate
este, we propose using Wikidata, a crowd-edited
knowledge base similar to Wikipedia, cual
provides an ‘‘also known as’’ section that lists
common aliases of each entity.6 Our second
method is based on the observation that Wikidata
resources can serve as an off-the-shelf alias lookup
table with better coverage than simply using the
entity’s canonical Wikipedia title. An example of
how this lookup table can increase coverage is
indicated in Figure 2. In our analysis, we found
that more than 50% of the ALIAS mentions could
be covered by this table. There is a map between

6Por ejemplo, https://www.wikidata.org/

wiki/Q200566.

Wikipedia entities and Wikidata entities, so we
can direct Wikipedia to the Wikidata to retrieve
these aliases.7

en el momento de la prueba, we treat the alias of an entity
equally as its main Wikipedia entity name,
allowing the model to match the target mention to
this alias as well. Como resultado, sim(mLRL, eEN) en
Ecuación (1) is modified as:

sim(mLRL, eEN) = max

ai∈A (cid:0)sim(mLRL, ai)(cid:1)

where A is a combination of entity Wikipedia title
and entity aliases.8 Note that although one may
consider using aliases in languages other than
Inglés, we found that they are very scarce, entonces
we did not attempt to expand entity names on the
HRL side.

4.3 More Explicit String Encoding

As mentioned previously, while BI-LSTMS have
proven powerful in modeling sequential data in
the literature, we argue that they are not an ideal
string encoder for this setting. This is because
our training data contain a nontrival number of
pairs that contains less predictable word mappings
(p.ej., translations). With such large freedom in
the face of insufficient and noisy training data,
this encoder seemingly overfits, resulting in poor
generalización. Previous researchers (Dai and Le,
2015; Wieting et al., 2016a) have noticed similar
problems when using LSTMs for representation
aprendiendo.

As an alternative, we propose the use of the
CHARAGRAM model (Wieting et al., 2016) como el
string encoder. This model scans the string with
various window sizes and produces a bag of
character n-grams. It then maps these n-grams to
their corresponding embeddings through a lookup
mesa. The final embedding of the string is the
sum of all the n-gram embeddings followed by a
nonlinear activation function. Cifra 3 shows an
illustration of the model.

Formalmente, we denote a string as a sequence
of characters x = [x1, x2, . . . , xm] that includes

7Other resources such as bold terms,

link anchors,
disambiguation pages, and surnames of mentions could
potentially increase the coverage of Wikidata.

8Note that incorporating aliases results in a small amount
of extra computation by multiplying the effective size of
the KB by a, the average number of aliases per mention.
Sin embargo, en Wikidata, a = 1.2, so we believe this is a
reasonable cost-benefit trade-off, given the gains afforded by
incorporating these aliases for many languages.

114

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

max-margin loss to train the model:

L =

X
yo=1

máximo(0, 1 − sim(metro, eEN+)

+ sim(metro, ei

EN−))

where eEN+ is the linked entity of m and eEN− is a
randomly sampled English entity. B is the number
of negative samples for each positive pair.

5 experimentos

5.1 Datasets

We evaluate our model on the following datasets,
spanning seven low-resource languages.

la primera

DARPA-LRL: The data for
four
languages are news articles, blogs, and social
media annotated with entity spans and links
by LDC as part of the DARPA LORELEI3
bajo-
programa. The documents are in four
resource languages: Tigrinya (de; a Semitic
language spoken in Eritrea and Ethiopia, written
in Ethiopian script), Oromo (om; an Afroasiatic
langage spoken in the Horn of Africa, written
in Roman script), Kinyarwanda (rw; a language
of the Niger-Congo family spoken in Rwanda,
written in Roman script), and Sinhala (si,
and Indo-Aryan language spoken in Sri Lanka,
written in its own script). These are naturally
occurring real-world data annotated and linked to
a KB, containing information about disasters and
humanitarian crises. We use these as the ‘‘gold
standard’’ datasets for our evaluation.

él

WIKI: One disadvantage of the DARPA-LRL
conjunto de datos, sin embargo,
is not publicly
is that
distributed at the time of this writing. In order
to allow for direct comparison with our method by
researchers without access to the DARPA-LRL
datos, we additionally create three datasets from
Wikipedia, as described in §4.1. Específicamente, estos
include Marathi (mr, an Indo-Aryan language
spoken in Western India, written in Devanagari
script), Lao (lo, a Kra-Dai language written in
Lao script), and Telugu (te, a Dravidian language
spoken in southeastern India written in Telugu
script). As Wikipedia is created through crowd-
sourcing, the anchor-text links are similar to those
appearing in realistic XEL datasets. It is notable
that entity mentions in WIKI often closely match
the Wikipedia entity titles, and thus this dataset is
nominally easier than the DARPA-LRL dataset.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: The architecture of CHARAGRAM.

space characters as well as special start and end
symbols. We use xj
i to denote a sub-sequence from
position i to position j inclusive. Por ejemplo,
xj
i = [xi, xi+1, . . . , xj]. The embedding v of a
string x is:

v = tanh(b +

metro

X
yo=1

X
n∈N

1(xi

i + 1 − n ∈ V )Wxj

)

where N is a set of predefined window sizes.
b ∈ Rd, V is all n-grams seen in the training data,
W ∈ R|V |×d is the embedding lookup table and
i . Tenga en cuenta que 1(X)
Wxj
is the indicator function, if a n-gram is not in V ,
we simply discard it.

∈ Rd is the embedding of xj

Compared with the BI-LSTM, the advantages
of CHARAGRAM are four-fold. Primero, the complexity
of memorizing short character strings in the
modelo
is reduced. CHARAGRAM learns multi-
character subsequences by simply adding them
to an embedding table, whereas the LSTM learns
them in a multi-step recurrent process. Segundo,
because of their relatively higher expressiveness,
LSTMs overfit to the noisy and relatively small
training data provided by Wikipedia bilingual
entity maps, the likely reason for LSTMs only
considering the start word in errors from the
DIRECT category. A diferencia de, CHARAGRAM does
not consider order information, giving it an
explicit inductive bias that forces it to rely on
character n-gram matching for all n-grams in the
secuencia. Tercero, CHARAGRAM’s simple architecture
el
eases the learning process. Por ejemplo,
LSTMs needs O(metro) steps to propagate gradients
from start
to finish (Vaswani et al., 2017),
while the CHARAGRAM requires only O(1) step
to do so. Finalmente, although not a performance-
based advantage, the CHARAGRAM model is more
interpretable, which make our further analysis
easier to perform (mira la sección 5).

We follow Wieting et al. (2016) and Rijhwani
et al. (2019) and use negative sampling with a

115

5.2 Training Details

In the CHARAGRAM model, we use character
n-grams with n ∈ {2, 3, 4, 5}, and embedding size
de 300. We train the model with stochastic gradient
descent with batch size 64, and a learning rate of
0.1. For the BI-LSTM model, we follow Rijhwani
et al. (2019) for hyperparameter selection.

We also compare our model with a character-
based CNN with sum-pooling (CHARCNN; zhang
et al., 2015; Wieting et al., 2016), where pa-
rameters are set to be roughly comparable in size
to our CHARAGRAM model. The embedding size of
each character is set to 1024; the kernel size is
set to 2, 3, 4, 5 each with 4800 feature maps. El
output of sum-pooling layer with a dimension of
19,200 (4800×4) is fed a fully connected layer
and results in a vector of size 300. The dropout is
set to 0.5.9

For each training language, we set aside a
small subset of training data (mHRL − eEN) como
our development set. For all models, we stop
training if top-30 gold candidate recall on the
development set does not increase for 50 epochs,
and the maximum number of training epochs is
set to 200.

We select the HRL that has the highest character
n-gram overlap with the source LRL, a decision
we discuss more in §5.4. Rijhwani et al. (2019)
used phoneme-based representations to help deal
with the fact that different languages use different
scripts, and we do so as well using Epitran
(Mortensen et al., 2018) to convert strings to inter-
national phonetic alphabet (IPA) symbols. The se-
lection of the HRL and the representation of each
LRL is shown in Table 2. Epitran has relatively
wide and growing coverage (55 languages at the
time of this writing). Our method could also poten-
tially be used with other tools such as the Romanizer
uroman,10 which is a less accurate phonetic repre-
sentation than Epitran but covers most languages
in the world. Sin embargo, testing different romanizers
is somewhat orthogonal to the main claims of this
paper, and thus we have not explicitly performed
experiments on this.

Our HRL pool contains 38 idiomas, spe-
cifically those that have more than 10k Wiki-
pedia pages and are supported by Epitran. Hacemos

9We also try smaller architectures with embedding size
set to 64 and number of feature maps set to 300. Este
configuration yields worse performance than the larger
modelo.

10https://www.isi.edu/∼ulf/uroman.html.

LRL

HRL

Representation

de
om
rw
si
mr
lo
te

Amharic (am)
Phoneme
Indonesian (id) Grapheme
Tagalog (tl)
Phoneme
Hindi (hi)
Phoneme
Hindi (hi)
Grapheme
Thai (th)
Phoneme
Hindi (hi)
Phoneme

Mesa 2: The HRL for each LRL. Para
phoneme representations, all input strings
in LRL, HRL, and English are convert to
IPA. For grapheme representations, strings
preserve their original representation.

not consider Swedish and Cebuano because most
Wikipedia pages of these two languages are bot-
generated.11 We also remove all languages that
do not achieve a candidate recall of 75% sobre el
development set for the HRL, indicating that the
model may not be trained well.

5.3 Main Results

Starting from the PBEL model, we gradually rep-
lace the baseline components with our proposed
improvements to reach our complete model. El
results are shown in the second section of Table 3.
To put the results in the context, we also list the
Wikipedia size and the hyperlink count of every
idioma. The Wikipedia size corresponds to the
number of entities recorded in the Wikipedia, y
the hyperlink count roughly reflects the richness
of the content of each page.

En general, the model with the three proposed im-
provements yields significantly better perfor-
mance than the baseline. It brings 7.4–33.3%
improvement on top-30 gold candidate recall on six
LRLs, with the exception of te. We will discuss
the failure of te in §5.4.12 Next, we can see that
the CHARAGRAM brings the first major improve-
mento, improving over both baselines BILSTM
and CHARCNN. Even trained with eHRL−eEN pairs,
CHARAGRAM generalizes better to the test data
(mLRL − eEN) where the patterns to be matched
are different from the training data. This result

11https://en.wikipedia.org/wiki/Lsjbot.
12Although not a direct target of our paper, we note
that the three methodological improvements, especially the
introduction of CHARAGRAM, also improve the baseline model
in HRL settings. We often observe more than a 20% gain in
top-30 gold candidate recall in the development set, cual es
derived from the same HRL as the training set.

116

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

DARPA-LRL

Modelo

WIKIMENTION
TRANSLATION

21.9
13.4

ee + BILSTM = PBEL 54.1
ee + CHARCNN
53.8
ee + CHARAGRAM
70.6
ee + a mí + CHARAGRAM
74.4
75.1
+ aka = Ours

Wikipedia Size
Hyperlink Count

168
188

45.3
20.9

18.1
13.0
20.4
41.3
46.0

775
4k

59.6
25.3

57.5
55.9
60.2
64.6
64.9

66.6
21.0

34.5
30.8
17.5
50.7
51.1

2K 15K
7K 63K

WIKI
lo

−
−

21.0
18.0
40.1
54.4
54.3

−
−

53.5
47.7
63.4
72.8
77.5

avg

−
−

40.7
24.6
23.8
34.3
34.4

−
−

40.7
34.8
43.2
56.6
57.6

50k
20k
300K 11K 610K 165K

70k

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 3: Top-30 gold candidate recall (%) of different models. First block: actuación
of direct Wikipedia-based models that use LRL resource; second block: performance of
pivoting-based models that does not require any LRL resource. ee means using entity-
entity pairs as training data and me means using mention-entity pairs as training data.
Bold numbers are the best performance of the corresponding languages.

suggests that, as we hypothesize, the model struc-
ture of CHARAGRAM makes it better able to learn
string mappings in the face of relatively small and
noisy data. We note that we also try many varia-
tions of the two baseline models. Por ejemplo, nosotros
use the average hidden states instead of the last
hidden state of BILSTM to represent a string, y
we replace the sum-pooling layerwith the max-
pooling layer in CHARCNN. These variations yield
comparable or worse recall compared with the
current baselines.

Además, introducing mHRL −eENpairs brings
further improvement over all seven languages. Este
is perhaps not surprising;
these data provide
explicit supervision that matches the actual task
of entity-mention matching that we are faced with
en el momento de la prueba.

The influence of entity aliases varies from
language to language. Although they offer some
significant gains in om and mr, they do not largely
change other languages. We suspect this is because
of the diverse properties of the languages used in
our datasets. Por ejemplo, for the case of Marathi
speakers, they may also speak English frequently
and be familiar with English entity names due to
English being a national language of India. Este
may lead them to follow conventions similar to
the English aliases that are available in Wikidata.
Speakers of other languages might either not use
as many aliases or their aliases may not match
well with those included in Wikidata.

Además, we quantify how our proposed methods
reduce the failures existing in the baseline system.

117

Cifra 4: The distribution of mention types and the
performance of our proposed model (right bars),
compared with the baseline (left bars).

We use the 400 samples of §3 and compare the
error distribution with the original one in Figure 4.
From the results, we can see that our model
eliminates a large number of the errors by ranking
the correct entities the highest. It significantly
reduces DIRECT and ALIAS errors, cual
demonstrates the effectiveness of our proposed
método. As a side benefit, a number of the TRANS
errors are also resolved. Además, cuando el
proposed model fails to rank the correct entity the
highest, it is able to increase the number of correct
entities in the top-30 candidate list, providing
a downstream disambiguation model with larger
improvement headroom. A few concrete examples
se muestran en la tabla 4.

Error Type

Mention

IPA

Ours

PBEL

ALIAS

DIRECT

TRANS

Beaver Creek Resort

Beaver Creek State Forest (Nueva York)

Gajanan Digambar Madgulkar

Ghada Amer

Hermann Staudinger

Muscoline

Khmer Empire

European Union

Herman Heuser

Benito Mussolini

Khmer Issarak

Yuri Petunin

Mesa 4: Successful cases, where the top-1 candidate entity retrieved by our model improves over that
of the baseline model.

Up until this point, we have been comparing
models that are purely zero-shot—they need no
training data in the source LRL. Sin embargo, incluso
for low-resourced languages there is often some
Wikipedia data that can be used to create models.
Using this data, we additionally compare our
model with the two Wikipedia-based models that
are not zero-shot (§2.2) on four DARPA-LRL
datasets on the first section of Table 3.13 Nuestro
model consistently beats TRANSLATION on all four
datasets without relying on any LRL resources.
Además, it outperforms WIKIMENTION by a large
margin on three datasets with relatively small sized
Wikipedias, evidencing the advantage of zero-shot
learning in resource scarce settings. For si
with over 15K Wikipedia pages, our model lags
behind the resource-heavy WIKIMENTION model
by about 15% in the gold candidate recall. Esto es
perhaps expected as our model does not rely on
any of LRL resources, and it is possible that
explicitly training our model with these resources
could further improve its accuracy. Además,
we observe that our model could serve as a com-
plement to WIKIMENTION and bring further gain in
gold candidate recall. We discuss this in detail in
Sección 5.6.

5.4 Pivoting Language Selection

Choosing a closely related HRL and directly
applying the model trained on that HRL to the
LRL has been a popular transfer learning paradigm
in low-resource settings (T¨ackstr¨om et al., 2012;
Zhang et al., 2016; Cotterell and Heigold, 2017;
Rijhwani et al., 2019; Lin et al., 2019; Rahimi
et al., 2019). Related languages are often chosen

13Para el 3 WIKI datasets, the way we create these datasets
is exactly the same as the way we generate mHRL−eEN lookup
tables, and thus WIKIMENTION will achieve 100% recordar. Nosotros
skip the unfair comparison on these datasets.

LRL

Lingüística

n-gram Overlap

de
om
rw
si
lo
mr
te

ˆam, 63.9 (60.8) am, 74.2 (70.9)
ˆid, 40.9 (75.8)
ˆso, 28.0 (63.7)
ˆrn, 46.4 (62.9) tl, 64.6 (79.0)
hi, 50.4 (63.1) hi, 50.4 (63.1)
th, 51.4 (78.8) th, 51.4 (78.8)
ˆhi, 72.8 (83.3)
ˆhi, 72.8 (83.3)
frente a, 12.6 (32.3) hi, 32.6 (45.1)

10.3
12.9
18.2
0
0
0
20.0

Mesa 5: The pivoting language, actuación (y
their n-gram overlap % with the LRL) selected
by different criteria. δ column shows the top-30
candidate recall improvement (%) using n-gram
superposición. Language with a hat use grapheme repre-
sentations while the remaining ones use phoneme
representaciones.

heuristically based on linguistic intuition, a pesar de
there are some works that have recently examined
training models to select languages automatically
(Lin et al., 2019; Rahimi et al., 2019). En nuestro
caso, we would like to choose both a pivoting
idioma, and a string representation: phonemes
or graphemes. This doubles the search space and
increases the search difficulty.

We devise a simple yet strong heuristic for pick-
ing HRLs for transfer: picking the language that
shares the largest number of character n-grams
with the LRL. This is an automatic process that
does not need any domain or linguistic knowl-
borde. Mesa 5 shows the performance gap between
this criterion and manual selection with linguis-
tics features, which has been used in previous
work on XEL (Rijhwani et al., 2019). Notablemente,
to eliminate the variance caused by the different
number of inter-language links possessed by dif-
ferent HRLs, we compare the similarity between
mLRL with eEN directly, without the comparison

118

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

HRL

entonces

5 Nearest Neighbor

mamá

bi
Uni maca, amac, Jaam, macad, ~~Jaam~~

~~bi, mbi, arbee, inho~~, biya

Mesa 6: Randomly sampled English n-grams and their
five nearest neighbors in n-gram embedding space.

between mLRL and eHRL. More specifically, nosotros
replace Equation (1) with score(mLRL, eEN) =
sim(mLRL, eEN).

It is clear that selecting proper pivoting lan-
guages and string representations is important;
failing to do so can cause performance degradation
of as much as 20%. Sin embargo, while our heuristic
selection method is empirically better than manual
selection with linguistic features, it is notable
that pivoting languages and the representations
selected in this way do not necessarily yield
the best performance. We observe that choosing
a pivoting language with slightly less n-gram
overlap yields better performance for some LRLs.
Por ejemplo, while om has about 43% personaje
n-gram overlap with am, using the model trained
with am yields a gold candidate recall of 45.0%
(compared to 40.9% with ˆid). This indicates that
accuracy could be further improved with more
sophisticated pivoting language selection criteria.
Regarding the importance of n-gram sharing,
the relatively low recall of te
we suspect
compared to the baseline model results from a
lack of shared character n-grams with its pivot
language hi. Whereas most other language pairs
have over 60% character n-gram overlap, te and
hi only have 45.1%, meaning vm only encodes
less than half n-grams it has. On the contrary,
character-level embeddings used by BI-LSTM are
less sparse than higher-order n-grams, y por lo tanto
BI-LSTM suffers less information loss.

5.5 Properties of Learned n-grams

As discussed in the previous sections, the objective
of CHARAGRAM is to learn n-gram mappings

between the HRL and English. To more concretely
understand our model’s behavior, we randomly
sample a few English n-gram embeddings and
retrieve their five nearest neighbors from the HRL
lado. Mesa 6 lists these most similar n-grams.

CHARAGRAM is able to correctly associate n-grams
that have close pronunciation in different lan-
guages together. Because the pronunciation of
the same syllable could vary in the context of
different words, n-grams with small variances in
vowels can still be reasonable approximations.
Por ejemplo, ‘‘li’’ can be pronounced as both
‘‘li’’ and ‘‘le’’ in different words. One thing that
is worth mentioning is that CHARAGRAM is able
to correctly recognize some mappings of non-
transliterated words. Por ejemplo, ‘‘Jaamacadda’’
in so is the parallel of ‘‘University’’ in English,
and the model was able to correctly align n-grams
corresponding to these words. This result demon-
strates one way how CHARAGRAM alleviates the
TRANS error that BI-LSTM suffers from.

5.6 Improving End-to-end XEL Systems

To investigate how our candidate generation
model influences the end-to-end XEL system,
we use its candidate lists in the disambiguation
model BURN proposed by Zhou et al. (2019). BURN
creates a fully connected graph for each document
and performs joint inference on all mentions in
the document. A lo mejor de nuestro conocimiento,
él
eso
is currently the disambiguation model
has demonstrated the strongest empirical results
for XEL without any targeted LRL resources.
Por lo tanto, we believe it is the most reasonable
choice in our low-resource scenario. Para más detalles,

119

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

we encourage readers to refer to the original
paper.14

To make the best use of scarce but existing
resources, we follow Zhou et al. (2019) y estafa-
catenate candidate lists generated by WIKIMENTION
to candidate lists of both the baseline and our
método. The score of each candidate entity is
calculated in the following way:

scoremerge(eEN) = α × scorewm(eEN)
+ (1 − α) × score′
ca(eEN)
ca(eEN) = softmax(β × scoreca(eEN))

score′

is the scaled score over

where scorewm is the score from WIKIMENTION
and scorecn is the original score from CHARAGRAM.
score′
the top-30
cn
candidate list. We omit mLRL in all score functions
for simpilicty. En nuestros experimentos, α is set to 0.6
and β is set to 100.

Mesa 7 lists the end-to-end XEL results.
Compared with the baseline model, our model
recovers more candidate entities missed by
WIKIMENTION and significantly benefits the down-
stream disambiguation model, así como el
end-to-end system. Even though incorporating
WIKIMENTION narrows the gap of gold candidate
recordar (compared to Table 3), our model still beats
the baseline model by a large margin. Mientras que la
baseline candidate generation model only reaches
a recall in the range of 60% on average, nuestro
yields a recall in the range of 70%, closer to the
high-resource counterparts which are often in the
range of 80%. Como resultado, the end-to-end XEL
in-KB accuracy increases over all four languages,
with gains from 1.3% a 16.7%. This is significant
for extremely low-resource languages like ti,
indicating the potential of our model in truly
resource-scarce settings.

6 Trabajo relacionado

Candidate generation for entity linking: En
most work, candidate generation for monolingual
entity linking relies on string matching and
Wikipedia anchor text lookup (Shen et al., 2015).
For cross-lingual entity linking, inter-language

14It is notable that we assume that the XEL system could
access the oracle NER outputs. En realidad, the F1 scores of
low-resource NER are often in the range of 70%. We leave
the evaluation and possible improvement with non-perfect
NER systems as our future work.

ee + BILSTM

Ours

de
om
rw
si

50.8 (55.4)
53.2 (61.3)
61.5 (67.5)
70.9 (76.1)

67.5 (75.8) 16.7 (20.4)
59.2 (67.9)
68.9 (73.9)
72.2 (78.0)

6.0 (6.6)
7.4 (6.4)
1.3 (1.9)

avg

59.1 (65.1)

67.0 (73.9)

7.9 (8.8)

Mesa 7: In-KB accuracy (with top-30 gold can-
didate recall of the merged candidate lists in
brackets, both represent percentage %) del
end-to-end XEL system with different candidate
generation models. δ shows the in-KB accuracy
degrade (%) using baseline candidate generation
model.15

lexicons
links from Wikipedia and bilingual
are used to translate the given entity mentions
into the language of the KB (often English)
in order to generate candidates (Tsai and Roth,
2016; Pan et al., 2017; Upadhyay et al., 2018a).
More recently, Rijhwani et al. (2019) use ortho-
graphic and phonological similarity to high-
resource languages to generate candidates for
low-resource test languages. For the related task
of clustering entities, Blissett and Ji (2019) usar
RNNs for measuring orthographic similarity of
entity mentions.

to our current

Transliteration: There has also been work in
transliterating named entities from one language
a otro (Knight and Graehl, 1998; Le et al.,
2004). Although similar
tarea
of selecting candidates from an English KB,
transliteration poses different challenges as it
involves generating the English entity name
sí mismo. Upadhyay et al.. (2018b) use a sequence-
to-sequence model and a bootstrapping method
to transliterate low-resource entity mentions using
extremely limited training data. Tsai and Roth
(2018) combine the standard translation method
for XEL candidate generation with a transliteration
score to improve XEL candidate recall on several
idiomas.

Bilingual lexicon induction: Another related
task is bilingual
lexicon induction, where a
mapping between words in two languages is
predicted by a learned model (Haghighi et al.,
2008). Although such a mapping can be used to

15These results are not comparable to Rijhwani et al. (2019)
as we only consider a subset of mentions whose linked entity
exists in the Wikipedia.

120

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

translate entities from the source test language
to English for XEL candidate generation, mayoría
existing lexicon induction methods assume the
availability of a large amount of monolingual data
in both the source and target language (Conneau
et al., 2017; Chen and Cardie, 2018; casa de arte
et al. 2018). Although this data is readily available
en Inglés, it is unrealistic for many low-resource
idiomas, diminishing the utility of such methods
for the low-resource XEL task.

7 Conclusión

En este trabajo, we perform a systematic analysis to
study and address the limitation of a previous can-
didate generation model in low-resource settings.
We propose three methodological improvements
to resolve two main problems of the baseline
modelo, a saber, mismatch between mention and
entity and sub-optimal string modeling. Para el
first problem, we introduce mention-entity pairs
into the training process to provide supervision.
We additionally collect entity aliases from English
Wikidata to further bridge this gap. To solve the
second problem, we replace the LSTM with a
more direct model CHARAGRAM. These methods
form our proposed candidate generation model.
We experiment with seven realistic datasets in
LRLs. Our model yields an average gain of 16.9%
in top-30 gold candidate recall. We also evaluate
the influence of our candidate generation model
in the context of end-to-end low-resource XEL. Él
brings an average gain of 7.9% in four LRLs.

An immediate future focus is finding a way to
properly combine multiple models trained on dif-
ferent HRLs together to have better character
n-gram coverage and thus improve model perfor-
mance in different LRLs. Another interesting
avenue is to investigate how to efficiently com-
pare mentions and a large number of entities
(p.ej., 2M in Wikipedia) in high dimensional
espacio. Actualmente, our model calculates the cosine
similarity between a mention and every entity
in the KB, which takes a few minutes for each
test set. Sin embargo, there is much existing work
(Rajaraman and Ullman, 2011; Johnson et al.,
2019) for efficient similarity search in high di-
mensional space for billion-scale datasets. Es
likely that combination of these algorithms with
our retrieval method will allow them to scale well
and reduce the computation time to a few seconds.
Además, other interesting future directions are

examining how to balance the trade-off between
the gold candidate recall and the disambiguation
difficulty, and how to apply our model to settings
where the target language is not English.

Expresiones de gratitud

We would like to thank Radu Florian and the anon-
ymous reviewers for their useful feedback. Este
material is based on work supported in part by
the Defense Advanced Research Projects Agency
Information Innovation Office (I2O) Low Resource
Languages for Emergent Incidents (LORELEI)
program under contract no. HR0011-15-C0114.
The views and conclusions contained in this doc-
ument are those of the authors and should not be
interpreted as representing the official policies,
either expressed or implied, of the U.S. Government.
Estados Unidos. Government is authorized to reproduce
and distribute reprints for Government purposes
notwithstanding any copyright notation here
en. Shruti Rijhwani is supported by a Bloomberg
Data Science Ph.D. Fellowship.

Referencias

Mikel Artetxe, Gorka Lavaka, and Eneko Agirre.
2018. A robust self-learning method for fully
unsupervised cross-lingual mappings of word
embeddings. In Proceedings of the 56th Annual
Meeting of the Association for Computational
Lingüística, pages 789–798.

Kevin Blissett and Heng Ji. 2019. multilingüe
NIL entity clustering for low-resource languages.
In Proceedings of the Second Workshop on
Computational Models of Reference, Anaphora
and Coreference, pages 20–25.

Razvan Bunescu and Marius Pas¸ca. 2006. Usando
encyclopedic knowledge for named entity
disambiguation. In 11th Conference of the Euro-
pean Chapter of the Association for Computational
Lingüística.

Xilun Chen and Claire Cardie. 2018. Unsupervised
multilingual word embeddings. En procedimientos
del 2018 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural, pages 261–270.

Alexis Conneau, Guillaume Lample, Marc’Aurelio
Ranzato, Ludovic Denoyer, and Herv´e J´egou.

121

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2017. Word translation without parallel data.
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones.

Ryan Cotterell and Georg Heigold. 2017. Cruz-
lingual character-level neural morphological
tagging. En Actas de la 2017 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, pages 748–759. Asociación para
Ligüística computacional.

Silviu Cucerzan. 2007. Large-scale named entity
disambiguation based on wikipedia data. En
Actas de la 2007 Joint Conference on
Métodos empíricos en Natural Language Pro-
cessing and Computational Natural Language
Aprendiendo.

Andrew M. Dai and Quoc V. Le. 2015. Semi-
In Advances
supervised sequence learning.
en sistemas de procesamiento de información neuronal,
pages 3079–3087.

Mark Dredze, Paul McNamee, Delip Rao,
Adam Gerber, and Tim Finin. 2010. Entity
disambiguation for knowledge base population.
En procedimientos de
the 23rd International
Congreso sobre Lingüística Computacional,
pages 277–285.

Octavian-Eugen Ganea and Thomas Hofmann.
2017. Deep joint entity disambiguation with
local neural attention. En Actas de la
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2619–2629.

Amir Globerson, Nevena Lazic,

Soumen
Chakrabarti, Amarnag Subramanya, Miguel
Ringaard, and Fernando Pereira. 2016. Collective
entity resolution with multi-focal attention. En
Proceedings of the 54th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 621–631.

Aria Haghighi, Percy Liang, Taylor Berg-
Kirkpatrick, and Dan Klein. 2008. Aprendiendo
bilingual lexicons from monolingual corpora.
In Proceedings of the 46th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 771–779.

Johannes Hoffart, Mohamed Amir Yosef,
Ilaria Bordino, Hagen F¨urstenau, Manfred
Pinkal, Marc Spaniol, Bilyana Taneva, Stefan

Thater, and Gerhard Weikum. 2011. Robusto
disambiguation of named entities in text. En
Proceedings of the Conference on Empirical
Métodos en el procesamiento del lenguaje natural,
pages 782–792. Asociación de Computación
Lingüística.

Heng Ji, Joel Nothman, Ben Hachey, and Radu
Florian. 2015. Overview of TAC-KBP 2015
tri-lingual entity discovery and linking. In Text
Analysis Conference.

Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2019. Billion-scale similarity search with gpus.
IEEE Transactions on Big Data.

Kevin Knight and Jonathan Graehl. 1998. Mamá-
chine transliteration. Computational Linguis-
tics, 24:599–612.

Haizhou Li, Min Zhang, and Jian Su. 2004. A
joint source-channel model for machine trans-
literation. In Proceedings of the 42nd Annual
Meeting of the Association for Computational
Lingüística, pages 159–166.

Yu Hsiang Lin, Chian-Yu Chen, Juan Lee, Zirui
li, Yuyan Zhang, Mengzhou Xia, sruti
Rijhwani, Junxian He, Zhang Zhisong, Xuezhe
Mamá, Antonios Anastasopoulos, Patricio Littell,
y Graham Neubig. 2019. Choosing transfer
languages for cross-lingual learning. In The
57ª Reunión Anual de la Asociación de
Ligüística computacional.

Paul McNamee, James Mayfield, Dawn Lawrie,
Douglas Oard, and David Doermann. 2011.
Cross-language entity linking. En procedimientos
of 5th International Joint Conference on
Natural Language Processing, pages 255–263.

Bonan Min, Yee Seng Chan, Haoling Qiu,
and Joshua Fasching. 2019. Towards machine
reading for interventions from humanitarian-
assistance program literature. En procedimientos
del 2019 Conferencia sobre métodos empíricos
in Natural Language Processing and the 9th
International Joint Conference on Natural
Procesamiento del lenguaje, pages 6443–6447.

David R. Mortensen, Siddharth Dalmia, y
Patricio Littell. 2018. Epitran: Precision g2p
el
for many languages. En procedimientos de
Eleventh International Conference on Language
Resources and Evaluation.

122

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Xiaoman Pan, Boliang Zhang, Jonathan May, joel
Nothman, Kevin Knight, and Heng Ji. 2017.
Cross-lingual name tagging and linking for 282
idiomas. In Proceedings of the 55th Annual
Meeting of the Association for Computational
Lingüística, pages 1946–1958.

Stephanie Strassel and Jennifer Tracey. 2016.
Lorelei language packs: Datos, herramientas, and re-
sources for technology development in low
resource languages. In Proceedings of the Tenth
International Conference on Language Resources
and Evaluation, pages 3273–3280.

Priya Radhakrishnan, Partha Talukdar,

y
Vasudeva Varma. 2018. Elden: Improved entity
linking using densified knowledge graphs. En
Actas de la 2018 Conference of the
North American Chapter of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, Volumen 1 (Artículos largos),
pages 1844–1853.

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.
Massively Multilingual Transfer
for NER.
Actas de la 57ª Reunión Anual de
la Asociación de Lingüística Computacional,
pages 151–164.

Anand Rajaraman and Jeffrey David Ullman.
2011. Mining of Massive Datasets, Cambridge
Prensa universitaria.

Shruti Rijhwani, Jiateng Xie, Graham Neubig,
and Jaime Carbonell. 2019. Zero-shot neural
En
transfer for cross-lingual entity linking.
Thirty-Third AAAI Conference on Artificial
Inteligencia (AAAI).

Stephanie M. Strassel, Ann Bies, and Jennifer
Tracey. 2017. Situational awareness for low re-
source languages: the lorelei situation frame
annotation task. In SMERP@ ECIR, pages 32–41.

Oscar T¨ackstr¨om, ryan mcdonald, and Jakob
Uszkoreit. 2012. Cross-lingual word clusters
for direct transfer of linguistic structure. En
Actas de la 2012 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, pages 477–487.

Chen-Tse Tsai and Dan Roth. 2016. multilingüe
wikification using multilingual embeddings. En
Actas de la 2016 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, pages 589–598.

Chen-Tse Tsai and Dan Roth. 2018. Aprendiendo
better name translation for cross-lingual wiki-
fication. In Thirty-Second AAAI Conference on
Artificial Intelligence.

Wei Shen, Jianyong Wang, and Jiawei Han.
2015. Entity linking with a knowledge base:
Asuntos, técnicas, and solutions. IEEE Trans-
actions on Knowledge and Data Engineering,
pages 443–460.

Shyam Upadhyay, Nitish Gupta, and Dan Roth.
2018a. Joint multilingual supervision for cross-
lingual entity linking. En Actas de la
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2486–2495.

Avirup Sil and Radu Florian. 2016. One for all:
Towards language independent named entity
linking. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Lingüística, pages 2255–2264.

Avirup Sil, Gourab Kundu, Radu Florian, y
Wael Hamza. 2018. Neural cross-lingual entity
linking. In Thirty-Second AAAI Conference on
Artificial Intelligence.

Valentin I. Spitkovsky and Angel X. Chang.
2012. A cross-lingual dictionary for English
el
Wikipedia concepts. En procedimientos de
Eighth International Conference on Language
Resources and Evaluation, pages 3168–3175.

Shyam Upadhyay, Jordan Kodner, and Dan
Roth. 2018b. Bootstrapping transliteration with
constrained discovery for low-resource lan-
calibres. En Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, pages 501–511.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
In Advances
Attention is all you need.
en sistemas de procesamiento de información neuronal,
pages 5998–6008.

Amir Pouran Ben Veyseh. 2016. multilingüe
question answering using common semantic

123

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

espacio. In Proceedings of TextGraphs-10: el
Workshop on Graph-based Methods for Natural
Procesamiento del lenguaje, pages 15–19.

from knowledge base. Transactions of
el
Asociación de Lingüística Computacional,
5:397–411.

John Wieting, Mohit Bansal, Kevin Gimpel, y
Karen Livescu. 2016a. Towards Universal
paraphrastic sentence embeddings. En curso-
ings of International Conference on Learning
Representaciones.

John Wieting, Mohit Bansal, Kevin Gimpel, y
Karen Livescu. 2016b. Charagram: Embedding
words and sentences via character n-grams.
el 2016 Conferencia sobre
En procedimientos de
Empirical Methods
in Natural Language
Procesando, pages 1504–1515.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda,
and Yoshiyasu Takefuji. 2017. Learning dis-
tributed representations of texts and entities

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text
clasificación. In Advances in Neural Infor-
mation Processing Systems, pages 649–657.

Yuan Zhang, David Gaddy, Regina Barzilay, y
Tommi Jaakkola. 2016. Ten pairs to tag—
multilingual POS tagging via coarse mapping
between embeddings. En Actas de la
the North American
2016 Conference of
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 1307–1317.

Shuyan Zhou, Shruti Rijhwani, y graham
Neubig. 2019, Noviembre. Towards zero-
resource cross-lingual entity linking. In Work-
shop on Deep Learning for Low-resource NLP.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

a
C
_
a
_
0
0
3
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

124
Descargar PDF