Improving Candidate Generation - 麻省理工学院人工智能研究专业

Improving Candidate Generation
for Low-resource Cross-lingual Entity Linking

Shuyan Zhou, Shruti Rijhwani, John Wieting
Jaime Carbonell, Graham Neubig

Language Technologies Institute
卡内基梅隆大学
{shuyanzh,srijhwan,jwieting,jgc,gneubig}@cs.cmu.edu

抽象的

Cross-lingual entity linking (XEL)
是个
task of finding referents in a target-language
knowledge base (KB) for mentions extracted
from source-language texts. The first step of
(X)EL is candidate generation, which retrieves
a list of plausible candidate entities from
the target-language KB for each mention. 应用程序-
proaches based on resources from Wikipedia
have proven successful in the realm of rela-
tively high-resource languages, but these do
to low-resource languages
not extend well
with few, 如果有的话, Wikipedia pages. 最近,
transfer learning methods have been shown to
reduce the demand for resources in the low-
resource languages by utilizing resources in
closely related languages, but the performance
still lags far behind their high-resource coun-
terparts. 在本文中, we first assess the prob-
lems faced by current entity candidate generation
methods for low-resource XEL, then propose
three improvements that (1) reduce the dis-
connect between entity mentions and KB en-
尝试, 和 (2) improve the robustness of the
model to low-resource scenarios. The methods
are simple, but effective: We experiment with
our approach on seven XEL datasets and find
that they yield an average gain of 16.9% 在
TOP-30 gold candidate recall, compared with
state-of-the-art baselines. Our improved model
also yields an average gain of 7.9% in in-KB
accuracy of end-to-end XEL.1

1 介绍

Entity linking (EL; Bunescu and Pas¸ca, 2006;
Cucerzan, 2007; Dredze et al., 2010; Hoffart

1Code and data will be released.

109

等人。, 2011) associates entity mentions in a
document with their entries in a knowledge base
(KB). 在这项工作中, we focus on cross-lingual
entity linking (XEL; McNamee et al., 2011;
Ji et al., 2015) where the documents are in a
source language that differs from the KB language
(目标). XEL is an important component task for
information extraction in languages that do not
have extensive KB resources, and can potentially
benefit downstream applications such as cross-
lingual building question answering systems
(Veyseh, 2016), or supporting international hu-
manitarian assistance efforts in areas that do not
speak English (Strassel et al., 2017; Min et al.,
2019). Following Sil et al. (2018) and Upadhyay
等人. (2018A), we consider the target language KB
to be English Wikipedia.

Given a document and named entity mentions
identified by a Named Entity Recognition (NER)
模型, there are two primary steps in an XEL
系统: (1) candidate generation,
in which a
model retrieves a short list of plausible KB entities
for each mention and (2) disambiguation, 其中
a model selects the most likely KB entity from
the candidate list. The quality of candidate lists
will influence the performance of the end-to-end
XEL system, as correct entities not included in this
list will not be recovered by the disambiguation
模型.

In monolingual EL, candidate generation has often
been considered trivial (Shen et al., 2015). Simple
approaches using string similarity or Wikipedia
anchor-text links produce mention-entity lookup
tables with high candidate recalls (例如, 在里面 90%
范围), and thus most work focuses on methods
for downstream entity disambiguation (Globerson
等人。, 2016; Yamada et al., 2017; Ganea and
Hofmann, 2017; Sil et al., 2018, Radhakrishnan

计算语言学协会会刊, 卷. 8, PP. 109–124, 2020. https://doi.org/10.1162/tacl 00303
动作编辑器: Radu Florian. 提交批次: 10/2019; 修改批次: 11/2019; 已发表 2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

等人。, 2018). String similarity (例如, edit distance)
cannot easily extend to XEL because surface
forms of entities often differ significantly across
the source and target language, particularly when
the languages are in different scripts. 维基百科
link methods can be extended to XEL by using
inter-language links between the two languages
to redirect entities to the English KB (Spitkovsky
and Chang, 2012; Sil and Florian, 2016; Sil et al.,
2018; Upadhyay et al., 2018A). This method works
to some extent, but often under-performs on low-
resource languages due to the lack of source
language Wikipedia resources.

Although scarce, there are some methods that
propose to improve entity candidate generation
by training translation models with low resource-
语言 (LRL)-English entity gazetteers (Pan
等人。, 2017), or learning neural string matching
models based on an entity gazetteer in a related
high-resource language (HRL) which is then
applied to the LRL (Rijhwani et al., 2019)
(more in §2). 然而, even with these relatively
top-30 candidates still
sophisticated methods,
fall far behind their high-resource counterparts,
lagging by as much as 70% absolute candidate
记起.

在这项工作中, we perform a systematic study to
understand and address the limitations of previous
XEL candidate generation models. 第一的, in §3 we
examine the sources of error in the state-of-the-art
candidate generation model of Rijhwani et al.
(2019), and identify a number of potential reasons
for failure. 具体来说, we find that two common
sources of error are (1) mismatch between the
entity name in the KB and the entity mention in
文本, 和 (2) failure of the string matching
model itself. 图中 1, we show an example of
linking Marathi, a low-resource language spoken
in Western India, to English, which we will use as
a running example throughout the paper (虽然
our method is broadly applicable, as noted in
实验). 在这种情况下, errors of the first
type are due to the fact that the English entity
Cobie Smulders is mentioned as
(绿色的,
Smulders) 或者
(黄色的, Jacoba Francisca Maria Smulders) 在
文本. Errors of the second type are simple
recognition errors such as where the mention
(蓝色的, Cobie Smulders) is recognized
as English entity Cobie Sikkens. We proceed to

110

数字 1: The candidate generation process for various
mentions corresponding to the gold entity ‘‘Cobie
Smulders’’. Strings on the left are mentions in the
文档, and the pronunciation in IPA of each string
is written below it. The candidate entities in the English
KB generated by the candidate generation model are
shown on the right.

propose methodological improvements that re-
solve these major issues.

The first set of improvements handles the
mismatch between the unique entity name that
appears in the English KB, and the many different
realizations of it in the source text. 第一的, we note
that training data used in learning-based methods
for XEL candidate generation (Pan et al., 2017;
Rijhwani et al., 2019) is made of entity-entity
对, which fail to capture this variation. 我们
experiment with adding mention-entity pairs to
the training data to provide explicit supervision,
helping the model better capture the differences
between mentions and entities (§4.1). 第二,
we note that many of the variations in the source
language are actually similar to how the entity
varies in English, and thus we can use English
language resources to capture this variation. 到
this effect, we collect entity aliases from English
Wikidata2 and allow the model
to also look
up these aliases during the candidate generation
过程 (§4.2).

The second contribution of this work is a
better modeling strategy for strings that represent
mentions and entities (§4.3). We posit that part
of the reason why the LSTM-based model of
Rijhwani et al. (2019) fails to properly model all
words in a string is because it is not the ideal
architecture to learn from limited training data,
and as a result, it erroneously learns that some
words in the mention can be ignored. To solve
这个问题, we replace the LSTM with a more
direct model based on the sum of character n-gram

2https://www.wikidata.org/wik.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

嵌入 (Wieting et al., 2016), 我们
posit is more likely to generalize to this difficult
learning setting.

We evaluate our proposed methods on four real-
world XEL datasets provided by DARPA LORELEI
(Strassel and Tracey, 2016), as well as three
other datasets we create with Wikipedia anchor-
text and inter-language links (§5). Although our
methods are simple, they are highly effective—our
proposed model
leads to gains ranging from
7.4-33.3% in top-30 gold candidate recall com-
pared with Rijhwani et al. (2019) in seven LRLs.
Because our model provides downstream disam-
biguation models with a much larger headroom for
改进, we find that simply changing the
candidate generation process yields an average
gain of 7.9% in end-to-end XEL in-KB accuracy
in four LRLs, pushing low-resource XEL a step
towards high-resource XEL performance.

2 Background

2.1 Problem Formulation

Given a set of mentions M = {m1, 平方米, . . . , mN }
extracted from multiple documents in the source
语言, and an English KB KEN that contains
millions of entities with unique names, the goal of
a candidate generation model is to retrieve a list of
possible candidate entities ei = {不,1, 不,2, . . . , 不,n}
from KEN for each mi ∈ M. In consideration of
the computational cost of the more complicated
downstream disambiguation model, n is often 30
or smaller (Sil et al., 2018; Upadhyay et al.,
2018A). The performance of candidate generation
is measured by the gold candidate recall, 哪个
is the proportion of retrieved candidate lists that
contains the correct entity. It is critical that this
number is high, as any time the correct entity is
excluded, the disambiguation model will be unable
to recover it. 正式地, if we denote the correct
entity of each mention m as ˆe, the gold candidate
recall r is defined as:

r = P

氮

i=1 δ(ˆei ∈ ei)
氮

mentions whose linked entity does not exist in the
KB in this work.3

We use ‘‘EN’’ to denote the target language
英语, ‘‘HRL’’ to denote any high-resource
language and ‘‘LRL’’ to denote any low-resource
语言. 例如, KHRL is a KB in an HRL
(例如, Spanish Wikipedia), eHRL is an entity in
KHRL. Because our focus is on low-resource XEL,
the source language is always an LRL. 我们也
refer to the HRL as the ‘‘pivoting’’ language
以下.

2.2 Baseline Candidate Generation Models

在这个部分, we introduce two existing cate-
gories of techniques for candidate generation.

Direct Wikipedia-based Models WIKIMENTION
is a popular candidate generation model used
by most state-of-the-art work in XEL (Sil and
Florian, 2016; Sil et al., 2018; Upadhyay et al.,
2018A). 具体来说, this model first extracts a
monolingual mLRL-eLRL map from anchor-text
(Smulders)
links. 例如, if mention
(Cobie Smulders)
is linked to entity
in some Marathi Wikipedia pages,
will be treated as a candidate entity of
.
These Marathi entities are then redirected to their
English counterpart by Wikipedia LRL-English
inter-language links. 例如,
(Cobie Smulders) will be redirected to Cobie
Smulders. 然而, the reliance on the coverage
of LRL Wikipedia strongly constrains this method
in low-resource settings.

TRANSLATION is another Wikipedia-based can-
didate generation model, proposed by Pan et al.
(2017). Instead of building a monolingual map
that requires accessing anchor-text links in an LRL
维基百科, this model translates any mLRL to mEN
word-by-word and retrieves candidate entities
from an existing mEN − eEN map. The word-
by-word translations are induced by LRL-English
inter-language links. Even though TRANSLATION
is less sensitive to the availability of resources
(to some extent), its dependency on LRL-English
inter-language links still limits its performance in
low-resource settings.

where δ(·) is the indicator function, 这是 1 如果是真的
别的 0, and N is the total number of mentions
among all documents. We follow Yamada et al.
(2017) and Ganea and Hofmann (2017) to ignore

3The predictions of these mentions will always be wrong.
This could be fixed by either designing mechanisms to predict
‘‘not linkable’’ or expanding the KB, which are beyond the
scope of this work.

111

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Pivoting-based Entity Linking
Instead of
relying on LRL resources, pivoting-based entity
linking (PBEL, Rijhwani et al., 2019) learns to
perform cross-lingual string matching based on
an entity gazetteer between a related HRL and
英语. This model consists of two BI-LSTMs,
即, the HL-BI-LSTM and the EN-BI-LSTM.
The training data is a collection of entity pairs
(eHRL − eEN). Each of the BI-LSTMs reads in
an entity name eHRL (eEN) and encodes it to an
embedding vHRL (vEN). The learning objective
is to maximize the similarity between the two
entities of each pair. The trained model HRL is
used as-is to encode the LRL mentions to vLRL,
relying on the similarity between the languages
to achieve a reasonably accurate encoding. A
vLRL is compared with every entity embedding in
KEN, and entities with the top-n highest similarity
scores are retrieved as the candidate entities. 到
compensate for the accuracy degradation due to
transfer, this work also considers the similarity
between mLRL and eHRL, where eHRL is the
counterpart of eEN in KHRL. 因此,
the score
between mLRL and entity eEN is defined as:

分数(mLRL, eEN) = max(模拟(mLRL, eEN),

模拟(mLRL, eHRL))

(1)

where sim(X, y) = cosine(vx, vy). When eHRL
does not exist, 模拟(mLRL, eHRL) is set to −∞.

PBEL removes the reliance on LRL resources,
and currently represents the state-of-the-art for
candidate generation in low-resource XEL. 然而,
as we analyze in detail in the following §3, it still
faces a number of challenges.

3 Failures of Existing Models

在这个部分, we perform a systematic analysis
of failure cases existing in PBEL (§3.1), and spe-
cifically focus on two error types: entity-mention
mismatch (§3.2) and string matching failures (§3.3).

3.1 Mention Types and Analysis

We apply a PBEL model trained with eHRL − eEN
pairs to generate candidate entities for mentions
extracted from LRL documents. For LRLs we use
Tigrinya, Oromo, Marathi, and Lao, and for HRLs
we use Amharic, Hindi, Hindi, and Thai, 重新指定-
主动地. The details of the datasets are in §5. 我们
randomly sample 100 system outputs from each
LRL and manually annotate their mention type
according to an typology created simultaneously

while performing analysis. The mention type is
as follows, where the comparison is between the
mention in a LRL and the entity string in English:

DIRECT: The mention is a direct transliteration
of the entity. 例如, one a mention
of Cobie Smulders is
(Cobie
Smulders)

ALIAS: The mention is another full proper name
那
is different from the entity name in
English KB. 例如, a mention of Cobie
Smulders as
(Jacoba Francisca Maria Smulders).

TRANS: The mention and the entity have word-
by-word alignment, 然而, the mention
contains regular words (例如, university,
联盟) that cannot be transliterated directly.

EXTRA SRC: There is at least one extra word in
the mention that is not a proper noun (例如,
(Sir)); or there is at least one extra syllable
in the mention, which is often due to the
morphology of the source language.

EXTRA ENG: There is at least one extra word
in the English entity that is not a proper noun.

BAD SPAN: The mention span is not an entity
due to mis-annotation, or non-standard an-
chor text in Wikipedia; the annotated linked
entity is wrong; the mention is in another
language other than our testing language.

We consider three situations for each sample:
(1) in top-1: the model ranks the correct entity the
highest, the ideal case; (2) in top-2 to 30: 该模型
ranks the correct entity in the top-2 to top-30,
which is less ideal, but will still potentially allow
a downstream disambiguation model to predict the
correct entity and (3) not in top-30: the model does
not rank the entity to top-30, which will certainly
lead to an error.

数字 2 shows the mention types of the 400
samples and PBEL performance within each of the
mention types. 在以下部分中, 我们
examine, in depth, two major causes of error:
mention-entity mismatch (largely affecting errors
in ALIAS, EXTRA SRC, and EXTRA ENG cate-
gories), and model failure (largely affecting errors
in DIRECT).

112

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

entity pairs, the model would need to capture more
complex patterns (例如, ignoring some words).4

The diverse realizations of a single entity bring
其他, more serious, challenge to models that
mainly learn string matches: 事实上, a reali-
zation does not necessarily have significant overlap
with the entity name in Wikipedia. 有时, 这
mention does not have any overlap with the entity
name at all, as noted in the ALIAS class. 这
common pattern reflects the limitation of using eEN
as the unique representation on the English side.

3.3 Failures in Direct Transliteration

Even in seemingly easy cases where the entity is a
perfectly transliteration of the mention (DIRECT),
we found the LSTM to fail frequently in our low-
data scenario. Among all DIRECT errors, 我们
found an interesting observation that the BILSTM
often only properly captures the first word (或者
first a few characters) and ignores the existence
of the second and further-on words. 例如,
the model ranks Cobie Sikken higher than Cobie
Smulders for

(Cobie Smulders).

To better understand this behavior, 我们手动
annotated 100 training pairs in Hindi and measured
how often the second or later words in eHRL do not
match their counterpart in eEN phonologically.5

是,

We find that whereas 93 examples share a
phonologically similar first word, 关于 40 的
them have second and further-on words that
are not phonological matches: While most pairs
have word-by-word mappings, their second or
later words often match with each other only
semantically—that
there are regular words
(例如, 区, university) that have very different
pronunciations across the HRL and English, 和
are therefore difficult to predict unless they are
explicitly seen in the training data. The BILSTM,
which is a flexible model, seems to overfit
and erroneously learn that latter words in the
sentence do not need to be mapped directly with
little inductive bias. This is a straightforward
explanation for why the model learns to ignore
the second and further-on words.

总结, the failures of the PBEL model
can be mainly attributed to (1) lack of explicit
supervision; (2) lack of external resources to assist

4Low numbers for th are due to lack of explicit word

boundaries marked by spaces.

5The phonological similarity of names across languages is
vital to the success of cross-lingual mention-entity matching.

数字 2: The distribution of mention types in 400
samples and the baseline model’s performance with
respect to each of the mention types.

Lang

所以

你好

|eHRL|=|eEN|
|mHRL|=|eEN|

82.9
71.1

80.7
58.0

83.4
56.8

56.8
55.8

桌子 1: Proportion of entries where HRL
strings have the same number of words as
their English counterparts.

3.2 Failures due to Mention-Entity Mismatch

As demonstrated in Figure 1, a single English
entity can have different
realizations in the
source language document. 因此, 许多
of these realizations will not match lexically or
phonetically with the entity in the KB. This poses a
serious problem for matching methods that rely on
graphemic or phonemic similarity such as PBEL.
One typical pattern in mention-entity variation
is additional words, as noted in the EXTRA SRC
and EXTRA ENG classes. We examine more sys-
tematically across the whole corpus by comparing
the number of words on each side, which is a rough
lower bound on the amount of this mismatch. 这
first row in Table 1 is the comparison between
eHRL and eEN, which presumably have better word-
by-word alignment (and were used in training
of previous XEL methods). The second row
displays the comparison between mHRL and eEN.
It is obvious that entity-entity pairs have more
consistent length in words, while this consistency
is not preserved in mention-entity pair data. 因此,
even if the previous PBEL model could easily
learn exact string matches from the entity-entity
training data, to successfully associate mention-

113

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

cases where the mention and entity name diverge
显著地; 和 (3) the BILSTM’s inability to
properly match the whole string.

4 Improved Candidate Generation

Based on the results of this empirical study,
we propose three methods to resolve the main
problems inherent in the baseline PBEL model.

4.1 Eliminating Train-Test Discrepancy

The mention-entity discrepancy naturally leads
to our first simple but effective improvement
to the baseline model: We extend the original
eHRL − eEN pairs with mHRL − eEN pairs. 我们
first collect mHRL − eHRL pairs from anchor-text
links in an HRL Wikipedia and then redirect these
entities to their parallel in English Wikipedia. 作为
一个结果, we get the desired mHRL − eEN pairs.
(Smulders) is linked to
例如, 如果
(Cobie Smulders) in some Marathi
Wikipedia pages, which could be redirected to
and Cobie
Cobie Smulders in English,
Smulders form one mention-entity pair. 虽然
to our
this is perhaps obvious in hindsight,
知识, all previous works that explicitly
train XEL candidate retrieval models do so on
eHRL − eEN pairs (Pan et al., 2017; Rijhwani et al.,
2019), which are mostly word-by-word mappings.

4.2 Utilizing English Entity Aliases

The training method introduced in the previous
section will render the model more capable of
dealing with minor differences between mentions
and entities. 然而, it still would struggle to
match strings with significant differences, 例如
the examples of ‘‘Cobie Smulders’’ and ‘‘Pope
Paul V’’ shown in Section 3.2. To mitigate
这, we propose using Wikidata, a crowd-edited
knowledge base similar to Wikipedia, 哪个
provides an ‘‘also known as’’ section that lists
common aliases of each entity.6 Our second
method is based on the observation that Wikidata
resources can serve as an off-the-shelf alias lookup
table with better coverage than simply using the
entity’s canonical Wikipedia title. An example of
how this lookup table can increase coverage is
indicated in Figure 2. In our analysis, we found
that more than 50% of the ALIAS mentions could
be covered by this table. There is a map between

6例如, https://www.wikidata.org/

wiki/Q200566.

Wikipedia entities and Wikidata entities, so we
can direct Wikipedia to the Wikidata to retrieve
these aliases.7

At test time, we treat the alias of an entity
equally as its main Wikipedia entity name,
allowing the model to match the target mention to
this alias as well. 因此, 模拟(mLRL, eEN) 在
方程 (1) is modified as:

模拟(mLRL, eEN) = max

ai∈A (西德:0)模拟(mLRL, 人工智能)(西德:1)

where A is a combination of entity Wikipedia title
and entity aliases.8 Note that although one may
consider using aliases in languages other than
英语, we found that they are very scarce, 所以
we did not attempt to expand entity names on the
HRL side.

4.3 More Explicit String Encoding

As mentioned previously, while BI-LSTMS have
proven powerful in modeling sequential data in
the literature, we argue that they are not an ideal
string encoder for this setting. 这是因为
our training data contain a nontrival number of
pairs that contains less predictable word mappings
(例如, translations). With such large freedom in
the face of insufficient and noisy training data,
this encoder seemingly overfits, resulting in poor
generalization. Previous researchers (Dai and Le,
2015; Wieting et al., 2016A) have noticed similar
problems when using LSTMs for representation
学习.

As an alternative, we propose the use of the
CHARAGRAM model (Wieting et al., 2016) 作为
string encoder. This model scans the string with
various window sizes and produces a bag of
character n-grams. It then maps these n-grams to
their corresponding embeddings through a lookup
桌子. The final embedding of the string is the
sum of all the n-gram embeddings followed by a
nonlinear activation function. 数字 3 shows an
illustration of the model.

正式地, we denote a string as a sequence
of characters x = [x1, x2, . . . , xm] that includes

7Other resources such as bold terms,

link anchors,
disambiguation pages, and surnames of mentions could
potentially increase the coverage of Wikidata.

8Note that incorporating aliases results in a small amount
of extra computation by multiplying the effective size of
the KB by a, the average number of aliases per mention.
然而, in Wikidata, a = 1.2, so we believe this is a
reasonable cost-benefit trade-off, given the gains afforded by
incorporating these aliases for many languages.

114

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

max-margin loss to train the model:

L =

乙

X
我=1

max(0, 1 − sim(米, eEN+)

+ 模拟(米, 不

EN−))

where eEN+ is the linked entity of m and eEN− is a
randomly sampled English entity. B is the number
of negative samples for each positive pair.

5 实验

5.1 数据集

We evaluate our model on the following datasets,
spanning seven low-resource languages.

the first

DARPA-LRL: The data for
四
languages are news articles, 博客, 和社会的
media annotated with entity spans and links
by LDC as part of the DARPA LORELEI3
低的-
程序. The documents are in four
resource languages: Tigrinya (的; a Semitic
language spoken in Eritrea and Ethiopia, written
in Ethiopian script), Oromo (om; an Afroasiatic
langage spoken in the Horn of Africa, written
in Roman script), Kinyarwanda (rw; a language
of the Niger-Congo family spoken in Rwanda,
written in Roman script), and Sinhala (si,
and Indo-Aryan language spoken in Sri Lanka,
written in its own script). These are naturally
occurring real-world data annotated and linked to
a KB, containing information about disasters and
humanitarian crises. We use these as the ‘‘gold
standard’’ datasets for our evaluation.

它

WIKI: One disadvantage of the DARPA-LRL
dataset, 然而,
is not publicly
就是它
distributed at the time of this writing. In order
to allow for direct comparison with our method by
researchers without access to the DARPA-LRL
数据, we additionally create three datasets from
维基百科, as described in §4.1. 具体来说, 这些
include Marathi (mr, an Indo-Aryan language
spoken in Western India, written in Devanagari
script), Lao (罗, a Kra-Dai language written in
Lao script), and Telugu (te, a Dravidian language
spoken in southeastern India written in Telugu
script). As Wikipedia is created through crowd-
sourcing, the anchor-text links are similar to those
appearing in realistic XEL datasets. It is notable
that entity mentions in WIKI often closely match
the Wikipedia entity titles, and thus this dataset is
nominally easier than the DARPA-LRL dataset.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: The architecture of CHARAGRAM.

space characters as well as special start and end
symbols. We use xj
i to denote a sub-sequence from
position i to position j inclusive. 例如,
xj
i = [希, xi+1, . . . , xj]. The embedding v of a
string x is:

v = tanh(乙 +

米

X
我=1

X
n∈N

1(希

我 + 1 − n ∈ V )Wxj

)

我

where N is a set of predefined window sizes.
b ∈ Rd, V is all n-grams seen in the training data,
W ∈ R|V |×d is the embedding lookup table and
我 . 注意 1(X)
Wxj
is the indicator function, if a n-gram is not in V ,
we simply discard it.

∈ Rd is the embedding of xj

我

Compared with the BI-LSTM, the advantages
of CHARAGRAM are four-fold. 第一的, the complexity
of memorizing short character strings in the
模型
is reduced. CHARAGRAM learns multi-
character subsequences by simply adding them
to an embedding table, whereas the LSTM learns
them in a multi-step recurrent process. 第二,
because of their relatively higher expressiveness,
LSTMs overfit to the noisy and relatively small
training data provided by Wikipedia bilingual
entity maps, the likely reason for LSTMs only
considering the start word in errors from the
DIRECT category. 相比之下, CHARAGRAM does
not consider order information, giving it an
explicit inductive bias that forces it to rely on
character n-gram matching for all n-grams in the
顺序. 第三, CHARAGRAM’s simple architecture
这
eases the learning process. 例如,
LSTMs needs O(米) steps to propagate gradients
from start
to finish (Vaswani et al., 2017),
while the CHARAGRAM requires only O(1) step
这样做. 最后, although not a performance-
based advantage, the CHARAGRAM model is more
interpretable, which make our further analysis
easier to perform (参见章节 5).

We follow Wieting et al. (2016) and Rijhwani
等人. (2019) and use negative sampling with a

115

5.2 Training Details

In the CHARAGRAM model, we use character
n-grams with n ∈ {2, 3, 4, 5}, and embedding size
的 300. We train the model with stochastic gradient
descent with batch size 64, and a learning rate of
0.1. For the BI-LSTM model, we follow Rijhwani
等人. (2019) for hyperparameter selection.

We also compare our model with a character-
based CNN with sum-pooling (CHARCNN; 张
等人。, 2015; Wieting et al., 2016), where pa-
rameters are set to be roughly comparable in size
to our CHARAGRAM model. The embedding size of
each character is set to 1024; the kernel size is
set to 2, 3, 4, 5 每个都有 4800 feature maps. 这
output of sum-pooling layer with a dimension of
19,200 (4800×4) is fed a fully connected layer
and results in a vector of size 300. The dropout is
set to 0.5.9

For each training language, we set aside a
small subset of training data (mHRL − eEN) 作为
our development set. For all models, we stop
training if top-30 gold candidate recall on the
development set does not increase for 50 纪元,
and the maximum number of training epochs is
set to 200.

We select the HRL that has the highest character
n-gram overlap with the source LRL, a decision
we discuss more in §5.4. Rijhwani et al. (2019)
used phoneme-based representations to help deal
with the fact that different languages use different
scripts, and we do so as well using Epitran
(Mortensen et al., 2018) to convert strings to inter-
national phonetic alphabet (IPA) symbols. The se-
lection of the HRL and the representation of each
LRL is shown in Table 2. Epitran has relatively
wide and growing coverage (55 languages at the
time of this writing). Our method could also poten-
tially be used with other tools such as the Romanizer
uroman,10 which is a less accurate phonetic repre-
sentation than Epitran but covers most languages
在世界上. 然而, testing different romanizers
is somewhat orthogonal to the main claims of this
纸, and thus we have not explicitly performed
experiments on this.

Our HRL pool contains 38 语言, spe-
cifically those that have more than 10k Wiki-
pedia pages and are supported by Epitran. We do

9We also try smaller architectures with embedding size
set to 64 and number of feature maps set to 300. 这
configuration yields worse performance than the larger
模型.

10https://www.isi.edu/∼ulf/uroman.html.

LRL

HRL

Representation

的
om
rw
si
mr
罗
te

Amharic (am)
Phoneme
Indonesian (id) Grapheme
Tagalog (tl)
Phoneme
Hindi (你好)
Phoneme
Hindi (你好)
Grapheme
Thai (th)
Phoneme
Hindi (你好)
Phoneme

桌子 2: The HRL for each LRL. 为了
phoneme representations, all input strings
in LRL, HRL, and English are convert to
IPA. For grapheme representations, strings
preserve their original representation.

not consider Swedish and Cebuano because most
Wikipedia pages of these two languages are bot-
generated.11 We also remove all languages that
do not achieve a candidate recall of 75% 在
development set for the HRL, indicating that the
model may not be trained well.

5.3 Main Results

Starting from the PBEL model, we gradually rep-
lace the baseline components with our proposed
improvements to reach our complete model. 这
results are shown in the second section of Table 3.
To put the results in the context, we also list the
Wikipedia size and the hyperlink count of every
语言. The Wikipedia size corresponds to the
number of entities recorded in the Wikipedia, 和
the hyperlink count roughly reflects the richness
of the content of each page.

全面的, the model with the three proposed im-
provements yields significantly better perfor-
mance than the baseline. It brings 7.4–33.3%
improvement on top-30 gold candidate recall on six
LRLs, with the exception of te. We will discuss
the failure of te in §5.4.12 Next, we can see that
the CHARAGRAM brings the first major improve-
蒙特, improving over both baselines BILSTM
and CHARCNN. Even trained with eHRL−eEN pairs,
CHARAGRAM generalizes better to the test data
(mLRL − eEN) where the patterns to be matched
are different from the training data. This result

11https://en.wikipedia.org/wiki/Lsjbot.
12Although not a direct target of our paper, we note
that the three methodological improvements, 尤其是
introduction of CHARAGRAM, also improve the baseline model
in HRL settings. We often observe more than a 20% gain in
top-30 gold candidate recall in the development set, 这是
derived from the same HRL as the training set.

116

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

DARPA-LRL

的

模型

WIKIMENTION
TRANSLATION

21.9
13.4

是的 + BILSTM = PBEL 54.1
是的 + CHARCNN
53.8
是的 + CHARAGRAM
70.6
是的 + 我 + CHARAGRAM
74.4
75.1
+ aka = Ours

Wikipedia Size
Hyperlink Count

168
188

45.3
20.9

18.1
13.0
20.4
41.3
46.0

775
4K

59.6
25.3

57.5
55.9
60.2
64.6
64.9

66.6
21.0

34.5
30.8
17.5
50.7
51.1

2K 15K
7K 63K

WIKI
罗

-
-

21.0
18.0
40.1
54.4
54.3

-
-

53.5
47.7
63.4
72.8
77.5

平均

-
-

40.7
24.6
23.8
34.3
34.4

-
-

40.7
34.8
43.2
56.6
57.6

50K
20K
300K 11K 610K 165K

70K

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 3: Top-30 gold candidate recall (%) of different models. First block: 表现
of direct Wikipedia-based models that use LRL resource; second block: performance of
pivoting-based models that does not require any LRL resource. ee means using entity-
entity pairs as training data and me means using mention-entity pairs as training data.
Bold numbers are the best performance of the corresponding languages.

suggests that, as we hypothesize, the model struc-
ture of CHARAGRAM makes it better able to learn
string mappings in the face of relatively small and
noisy data. We note that we also try many varia-
tions of the two baseline models. 例如, 我们
use the average hidden states instead of the last
hidden state of BILSTM to represent a string, 和
we replace the sum-pooling layerwith the max-
pooling layer in CHARCNN. These variations yield
comparable or worse recall compared with the
current baselines.

此外, introducing mHRL −eENpairs brings
further improvement over all seven languages. 这
is perhaps not surprising;
these data provide
explicit supervision that matches the actual task
of entity-mention matching that we are faced with
at test time.

The influence of entity aliases varies from
language to language. Although they offer some
significant gains in om and mr, they do not largely
change other languages. We suspect this is because
of the diverse properties of the languages used in
our datasets. 例如, for the case of Marathi
speakers, they may also speak English frequently
and be familiar with English entity names due to
English being a national language of India. 这
may lead them to follow conventions similar to
the English aliases that are available in Wikidata.
Speakers of other languages might either not use
as many aliases or their aliases may not match
well with those included in Wikidata.

而且, we quantify how our proposed methods
reduce the failures existing in the baseline system.

117

数字 4: The distribution of mention types and the
performance of our proposed model (right bars),
compared with the baseline (left bars).

We use the 400 samples of §3 and compare the
error distribution with the original one in Figure 4.
From the results, we can see that our model
eliminates a large number of the errors by ranking
the correct entities the highest. It significantly
reduces DIRECT and ALIAS errors, 哪个
demonstrates the effectiveness of our proposed
方法. As a side benefit, a number of the TRANS
errors are also resolved. 此外, 当。。。的时候
proposed model fails to rank the correct entity the
highest, it is able to increase the number of correct
entities in the top-30 candidate list, 提供
a downstream disambiguation model with larger
improvement headroom. A few concrete examples
are shown in Table 4.

Error Type

Mention

IPA

Ours

PBEL

ALIAS

DIRECT

TRANS

Beaver Creek Resort

Beaver Creek State Forest (纽约)

Gajanan Digambar Madgulkar

Ghada Amer

Hermann Staudinger

Muscoline

Khmer Empire

欧洲联盟

Herman Heuser

Benito Mussolini

Khmer Issarak

Yuri Petunin

桌子 4: Successful cases, where the top-1 candidate entity retrieved by our model improves over that
of the baseline model.

Up until this point, we have been comparing
models that are purely zero-shot—they need no
training data in the source LRL. 然而, 甚至
for low-resourced languages there is often some
Wikipedia data that can be used to create models.
Using this data, we additionally compare our
model with the two Wikipedia-based models that
are not zero-shot (§2.2) on four DARPA-LRL
datasets on the first section of Table 3.13 我们的
model consistently beats TRANSLATION on all four
datasets without relying on any LRL resources.
而且, it outperforms WIKIMENTION by a large
margin on three datasets with relatively small sized
Wikipedias, evidencing the advantage of zero-shot
learning in resource scarce settings. For si
with over 15K Wikipedia pages, our model lags
behind the resource-heavy WIKIMENTION model
by about 15% in the gold candidate recall. 这是
perhaps expected as our model does not rely on
any of LRL resources, and it is possible that
explicitly training our model with these resources
could further improve its accuracy. 此外,
we observe that our model could serve as a com-
plement to WIKIMENTION and bring further gain in
gold candidate recall. We discuss this in detail in
部分 5.6.

5.4 Pivoting Language Selection

Choosing a closely related HRL and directly
applying the model trained on that HRL to the
LRL has been a popular transfer learning paradigm
in low-resource settings (T¨ackstr¨om et al., 2012;
张等人。, 2016; Cotterell and Heigold, 2017;
Rijhwani et al., 2019; 林等人。, 2019; Rahimi
等人。, 2019). Related languages are often chosen

13为了 3 WIKI datasets, the way we create these datasets
is exactly the same as the way we generate mHRL−eEN lookup
桌子, and thus WIKIMENTION will achieve 100% 记起. 我们
skip the unfair comparison on these datasets.

LRL

语言学

n-gram Overlap

的
om
rw
si
罗
mr
te

ˆam, 63.9 (60.8) am, 74.2 (70.9)
ˆid, 40.9 (75.8)
ˆso, 28.0 (63.7)
ˆrn, 46.4 (62.9) tl, 64.6 (79.0)
你好, 50.4 (63.1) 你好, 50.4 (63.1)
th, 51.4 (78.8) th, 51.4 (78.8)
ˆhi, 72.8 (83.3)
ˆhi, 72.8 (83.3)
ta, 12.6 (32.3) 你好, 32.6 (45.1)

10.3
12.9
18.2
0
0
0
20.0

桌子 5: The pivoting language, 表现 (和
their n-gram overlap % with the LRL) selected
by different criteria. δ column shows the top-30
candidate recall improvement (%) using n-gram
重叠. Language with a hat use grapheme repre-
sentations while the remaining ones use phoneme
陈述.

heuristically based on linguistic intuition, 虽然
there are some works that have recently examined
training models to select languages automatically
(林等人。, 2019; Rahimi et al., 2019). In our
案件, we would like to choose both a pivoting
语言, and a string representation: phonemes
or graphemes. This doubles the search space and
increases the search difficulty.

We devise a simple yet strong heuristic for pick-
ing HRLs for transfer: picking the language that
shares the largest number of character n-grams
with the LRL. This is an automatic process that
does not need any domain or linguistic knowl-
边缘. 桌子 5 shows the performance gap between
this criterion and manual selection with linguis-
tics features, which has been used in previous
work on XEL (Rijhwani et al., 2019). 尤其,
to eliminate the variance caused by the different
number of inter-language links possessed by dif-
ferent HRLs, we compare the similarity between
mLRL with eEN directly, without the comparison

118

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

HRL

你好

所以

5 Nearest Neighbor

嘛

双

双
Uni maca, amac, Jaam, macad, ~~Jaam~~

~~双, mbi, arbee, inho~~, biya

桌子 6: Randomly sampled English n-grams and their
five nearest neighbors in n-gram embedding space.

between mLRL and eHRL. 进一步来说, 我们
replace Equation (1) with score(mLRL, eEN) =
模拟(mLRL, eEN).

It is clear that selecting proper pivoting lan-
guages and string representations is important;
failing to do so can cause performance degradation
of as much as 20%. 然而, while our heuristic
selection method is empirically better than manual
selection with linguistic features, it is notable
that pivoting languages and the representations
selected in this way do not necessarily yield
the best performance. We observe that choosing
a pivoting language with slightly less n-gram
overlap yields better performance for some LRLs.
例如, while om has about 43% 特点
n-gram overlap with am, using the model trained
with am yields a gold candidate recall of 45.0%
(相比 40.9% with ˆid). This indicates that
accuracy could be further improved with more
sophisticated pivoting language selection criteria.
Regarding the importance of n-gram sharing,
the relatively low recall of te
we suspect
compared to the baseline model results from a
lack of shared character n-grams with its pivot
language hi. Whereas most other language pairs
have over 60% character n-gram overlap, te and
hi only have 45.1%, meaning vm only encodes
less than half n-grams it has. 相反,
character-level embeddings used by BI-LSTM are
less sparse than higher-order n-grams, 因此
BI-LSTM suffers less information loss.

5.5 Properties of Learned n-grams

As discussed in the previous sections, the objective
of CHARAGRAM is to learn n-gram mappings

between the HRL and English. To more concretely
understand our model’s behavior, we randomly
sample a few English n-gram embeddings and
retrieve their five nearest neighbors from the HRL
边. 桌子 6 lists these most similar n-grams.

CHARAGRAM is able to correctly associate n-grams
that have close pronunciation in different lan-
guages together. Because the pronunciation of
the same syllable could vary in the context of
different words, n-grams with small variances in
vowels can still be reasonable approximations.
例如, ‘‘li’’ can be pronounced as both
‘‘li’’ and ‘‘le’’ in different words. One thing that
is worth mentioning is that CHARAGRAM is able
to correctly recognize some mappings of non-
transliterated words. 例如, ‘‘Jaamacadda’’
in so is the parallel of ‘‘University’’ in English,
and the model was able to correctly align n-grams
corresponding to these words. This result demon-
strates one way how CHARAGRAM alleviates the
TRANS error that BI-LSTM suffers from.

5.6 Improving End-to-end XEL Systems

To investigate how our candidate generation
model influences the end-to-end XEL system,
we use its candidate lists in the disambiguation
model BURN proposed by Zhou et al. (2019). BURN
creates a fully connected graph for each document
and performs joint inference on all mentions in
the document. 据我们所知,
它
那
is currently the disambiguation model
has demonstrated the strongest empirical results
for XEL without any targeted LRL resources.
所以, we believe it is the most reasonable
choice in our low-resource scenario. 详情,

119

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

we encourage readers to refer to the original
paper.14

To make the best use of scarce but existing
资源, we follow Zhou et al. (2019) 和骗局-
catenate candidate lists generated by WIKIMENTION
to candidate lists of both the baseline and our
方法. The score of each candidate entity is
calculated in the following way:

scoremerge(eEN) = α × scorewm(eEN)
+ (1 − α) × score′
加州(eEN)
加州(eEN) = softmax(β × scoreca(eEN))

score′

is the scaled score over

where scorewm is the score from WIKIMENTION
and scorecn is the original score from CHARAGRAM.
score′
the top-30
中文
candidate list. We omit mLRL in all score functions
for simpilicty. In our experiments, α is set to 0.6
and β is set to 100.

桌子 7 lists the end-to-end XEL results.
Compared with the baseline model, our model
recovers more candidate entities missed by
WIKIMENTION and significantly benefits the down-
stream disambiguation model, as well as the
end-to-end system. Even though incorporating
WIKIMENTION narrows the gap of gold candidate
记起 (compared to Table 3), our model still beats
the baseline model by a large margin. While the
baseline candidate generation model only reaches
a recall in the range of 60% 一般, ours
yields a recall in the range of 70%, closer to the
high-resource counterparts which are often in the
范围 80%. 因此, the end-to-end XEL
in-KB accuracy increases over all four languages,
with gains from 1.3% 到 16.7%. This is significant
for extremely low-resource languages like ti,
indicating the potential of our model in truly
resource-scarce settings.

6 相关工作

Candidate generation for entity linking: 在
most work, candidate generation for monolingual
entity linking relies on string matching and
Wikipedia anchor text lookup (Shen et al., 2015).
For cross-lingual entity linking, inter-language

14It is notable that we assume that the XEL system could
access the oracle NER outputs. 事实上, the F1 scores of
low-resource NER are often in the range of 70%. We leave
the evaluation and possible improvement with non-perfect
NER systems as our future work.

是的 + BILSTM

Ours

的
om
rw
si

50.8 (55.4)
53.2 (61.3)
61.5 (67.5)
70.9 (76.1)

67.5 (75.8) 16.7 (20.4)
59.2 (67.9)
68.9 (73.9)
72.2 (78.0)

6.0 (6.6)
7.4 (6.4)
1.3 (1.9)

平均

59.1 (65.1)

67.0 (73.9)

7.9 (8.8)

桌子 7: In-KB accuracy (with top-30 gold can-
didate recall of the merged candidate lists in
brackets, both represent percentage %) 的
end-to-end XEL system with different candidate
generation models. δ shows the in-KB accuracy
degrade (%) using baseline candidate generation
model.15

lexicons
links from Wikipedia and bilingual
are used to translate the given entity mentions
into the language of the KB (often English)
in order to generate candidates (Tsai and Roth,
2016; Pan et al., 2017; Upadhyay et al., 2018A).
最近, Rijhwani et al. (2019) use ortho-
graphic and phonological similarity to high-
resource languages to generate candidates for
low-resource test languages. For the related task
of clustering entities, Blissett and Ji (2019) 使用
RNNs for measuring orthographic similarity of
entity mentions.

to our current

Transliteration: There has also been work in
transliterating named entities from one language
to another (Knight and Graehl, 1998; 李等人。,
2004). Although similar
任务
of selecting candidates from an English KB,
transliteration poses different challenges as it
involves generating the English entity name
本身. Upadhyay et al. (2018乙) use a sequence-
to-sequence model and a bootstrapping method
to transliterate low-resource entity mentions using
extremely limited training data. Tsai and Roth
(2018) combine the standard translation method
for XEL candidate generation with a transliteration
score to improve XEL candidate recall on several
语言.

Bilingual lexicon induction: Another related
task is bilingual
lexicon induction, where a
mapping between words in two languages is
predicted by a learned model (Haghighi et al.,
2008). Although such a mapping can be used to

15These results are not comparable to Rijhwani et al. (2019)
as we only consider a subset of mentions whose linked entity
exists in the Wikipedia.

120

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

translate entities from the source test language
to English for XEL candidate generation, 最多
existing lexicon induction methods assume the
availability of a large amount of monolingual data
in both the source and target language (Conneau
等人。, 2017; Chen and Cardie, 2018; Artetxe
等人. 2018). Although this data is readily available
in English, it is unrealistic for many low-resource
语言, diminishing the utility of such methods
for the low-resource XEL task.

7 结论

在这项工作中, we perform a systematic analysis to
study and address the limitation of a previous can-
didate generation model in low-resource settings.
We propose three methodological improvements
to resolve two main problems of the baseline
模型, 即, mismatch between mention and
entity and sub-optimal string modeling. 为了
first problem, we introduce mention-entity pairs
into the training process to provide supervision.
We additionally collect entity aliases from English
Wikidata to further bridge this gap. To solve the
second problem, we replace the LSTM with a
more direct model CHARAGRAM. These methods
form our proposed candidate generation model.
We experiment with seven realistic datasets in
LRLs. Our model yields an average gain of 16.9%
in top-30 gold candidate recall. We also evaluate
the influence of our candidate generation model
in the context of end-to-end low-resource XEL. 它
brings an average gain of 7.9% in four LRLs.

An immediate future focus is finding a way to
properly combine multiple models trained on dif-
ferent HRLs together to have better character
n-gram coverage and thus improve model perfor-
mance in different LRLs. Another interesting
avenue is to investigate how to efficiently com-
pare mentions and a large number of entities
(例如, 2M in Wikipedia) in high dimensional
空间. 现在, our model calculates the cosine
similarity between a mention and every entity
in the KB, which takes a few minutes for each
test set. 然而, there is much existing work
(Rajaraman and Ullman, 2011; Johnson et al.,
2019) for efficient similarity search in high di-
mensional space for billion-scale datasets. 这是
likely that combination of these algorithms with
our retrieval method will allow them to scale well
and reduce the computation time to a few seconds.
此外, other interesting future directions are

examining how to balance the trade-off between
the gold candidate recall and the disambiguation
difficulty, and how to apply our model to settings
where the target language is not English.

致谢

We would like to thank Radu Florian and the anon-
ymous reviewers for their useful feedback. 这
material is based on work supported in part by
the Defense Advanced Research Projects Agency
Information Innovation Office (I2O) Low Resource
Languages for Emergent Incidents (LORELEI)
program under contract no. HR0011-15-C0114.
The views and conclusions contained in this doc-
ument are those of the authors and should not be
interpreted as representing the official policies,
either expressed or implied, 美国的. 政府.
美国. Government is authorized to reproduce
and distribute reprints for Government purposes
notwithstanding any copyright notation here
在. Shruti Rijhwani is supported by a Bloomberg
Data Science Ph.D. Fellowship.

参考

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2018. A robust self-learning method for fully
unsupervised cross-lingual mappings of word
嵌入. In Proceedings of the 56th Annual
Meeting of the Association for Computational
语言学, pages 789–798.

Kevin Blissett and Heng Ji. 2019. Cross-lingual
NIL entity clustering for low-resource languages.
In Proceedings of the Second Workshop on
Computational Models of Reference, Anaphora
and Coreference, pages 20–25.

Razvan Bunescu and Marius Pas¸ca. 2006. 使用
encyclopedic knowledge for named entity
disambiguation. In 11th Conference of the Euro-
pean Chapter of the Association for Computational
语言学.

Xilun Chen and Claire Cardie. 2018. Unsupervised
multilingual word embeddings. In Proceedings
的 2018 经验方法会议
自然语言处理博士, pages 261–270.

Alexis Conneau, Guillaume Lample, Marc’Aurelio
Ranzato, Ludovic Denoyer, and Herv´e J´egou.

121

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2017. Word translation without parallel data.
International Conference on Learning Repre-
句子.

Ryan Cotterell and Georg Heigold. 2017. 叉-
lingual character-level neural morphological
tagging. 在诉讼程序中 2017 会议
on Empirical Methods in Natural Language
加工, pages 748–759. 协会
计算语言学.

Silviu Cucerzan. 2007. Large-scale named entity
disambiguation based on wikipedia data. 在
诉讼程序 2007 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
学习.

安德鲁·M. Dai and Quoc V. Le. 2015. Semi-
In Advances
supervised sequence learning.
in Neural Information Processing Systems,
pages 3079–3087.

Mark Dredze, Paul McNamee, Delip Rao,
Adam Gerber, and Tim Finin. 2010. Entity
disambiguation for knowledge base population.
在诉讼程序中
the 23rd International
Conference on Computational Linguistics,
pages 277–285.

Octavian-Eugen Ganea and Thomas Hofmann.
2017. Deep joint entity disambiguation with
local neural attention. 在诉讼程序中
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2619–2629.

Amir Globerson, Nevena Lazic,

Soumen
Chakrabarti, Amarnag Subramanya, 迈克尔
Ringaard, and Fernando Pereira. 2016. Collective
entity resolution with multi-focal attention. 在
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
pages 621–631.

Aria Haghighi, Percy Liang, Taylor Berg-
柯克帕特里克, and Dan Klein. 2008. 学习
bilingual lexicons from monolingual corpora.
In Proceedings of the 46th Annual Meeting of
the Association for Computational Linguistics,
pages 771–779.

Johannes Hoffart, Mohamed Amir Yosef,
Ilaria Bordino, Hagen F¨urstenau, Manfred
Pinkal, Marc Spaniol, Bilyana Taneva, Stefan

Thater, and Gerhard Weikum. 2011. Robust
disambiguation of named entities in text. 在
Proceedings of the Conference on Empirical
Methods in Natural Language Processing,
pages 782–792. Association for Computational
语言学.

Heng Ji, Joel Nothman, Ben Hachey, and Radu
Florian. 2015. Overview of TAC-KBP 2015
tri-lingual entity discovery and linking. In Text
Analysis Conference.

Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2019. Billion-scale similarity search with gpus.
IEEE Transactions on Big Data.

Kevin Knight and Jonathan Graehl. 1998. Ma-
chine transliteration. Computational Linguis-
抽动症, 24:599–612.

Haizhou Li, Min Zhang, and Jian Su. 2004. A
joint source-channel model for machine trans-
literation. In Proceedings of the 42nd Annual
Meeting of the Association for Computational
语言学, pages 159–166.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui
李, Yuyan Zhang, Mengzhou Xia, Shruti
Rijhwani, Junxian He, Zhisong Zhang, Xuezhe
Ma, Antonios Anastasopoulos, Patrick Littell,
and Graham Neubig. 2019. Choosing transfer
languages for cross-lingual learning. In The
57th Annual Meeting of the Association for
计算语言学.

Paul McNamee, James Mayfield, Dawn Lawrie,
Douglas Oard, and David Doermann. 2011.
Cross-language entity linking. In Proceedings
of 5th International Joint Conference on
自然语言处理, pages 255–263.

Bonan Min, Yee Seng Chan, Haoling Qiu,
and Joshua Fasching. 2019. Towards machine
reading for interventions from humanitarian-
assistance program literature. In Proceedings
的 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
语言处理, pages 6443–6447.

David R. Mortensen, Siddharth Dalmia, 和
Patrick Littell. 2018. Epitran: Precision g2p
这
for many languages. 在诉讼程序中
Eleventh International Conference on Language
Resources and Evaluation.

122

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
Nothman, Kevin Knight, and Heng Ji. 2017.
Cross-lingual name tagging and linking for 282
语言. In Proceedings of the 55th Annual
Meeting of the Association for Computational
语言学, pages 1946–1958.

Stephanie Strassel and Jennifer Tracey. 2016.
Lorelei language packs: 数据, 工具, 并重新-
sources for technology development in low
resource languages. In Proceedings of the Tenth
International Conference on Language Resources
and Evaluation, pages 3273–3280.

Priya Radhakrishnan, Partha Talukdar,

和
Vasudeva Varma. 2018. Elden: Improved entity
linking using densified knowledge graphs. 在
诉讼程序 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, 体积 1 (Long Papers),
pages 1844–1853.

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.
Massively Multilingual Transfer
for NER.
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 151–164.

Anand Rajaraman and Jeffrey David Ullman.
2011. Mining of Massive Datasets, 剑桥
大学出版社.

Shruti Rijhwani, Jiateng Xie, Graham Neubig,
and Jaime Carbonell. 2019. Zero-shot neural
在
transfer for cross-lingual entity linking.
Thirty-Third AAAI Conference on Artificial
智力 (AAAI).

Stephanie M. Strassel, Ann Bies, and Jennifer
Tracey. 2017. Situational awareness for low re-
source languages: the lorelei situation frame
annotation task. In SMERP@ ECIR, pages 32–41.

Oscar T¨ackstr¨om, Ryan McDonald, and Jakob
Uszkoreit. 2012. Cross-lingual word clusters
for direct transfer of linguistic structure. 在
诉讼程序 2012 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 477–487.

Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual
wikification using multilingual embeddings. 在
诉讼程序 2016 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 589–598.

Chen-Tse Tsai and Dan Roth. 2018. 学习
better name translation for cross-lingual wiki-
fication. In Thirty-Second AAAI Conference on
人工智能.

Wei Shen, Jianyong Wang, and Jiawei Han.
2015. Entity linking with a knowledge base:
问题, 技巧, and solutions. IEEE Trans-
actions on Knowledge and Data Engineering,
pages 443–460.

Shyam Upadhyay, Nitish Gupta, and Dan Roth.
2018A. Joint multilingual supervision for cross-
lingual entity linking. 在诉讼程序中
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2486–2495.

Avirup Sil and Radu Florian. 2016. One for all:
Towards language independent named entity
linking. In Proceedings of the 54th Annual
Meeting of the Association for Computational
语言学, pages 2255–2264.

Avirup Sil, Gourab Kundu, Radu Florian, 和
Wael Hamza. 2018. Neural cross-lingual entity
linking. In Thirty-Second AAAI Conference on
人工智能.

Valentin I. Spitkovsky and Angel X. 张.
2012. A cross-lingual dictionary for English
这
Wikipedia concepts. 在诉讼程序中
Eighth International Conference on Language
Resources and Evaluation, pages 3168–3175.

Shyam Upadhyay, Jordan Kodner, and Dan
Roth. 2018乙. Bootstrapping transliteration with
constrained discovery for low-resource lan-
guages. 在诉讼程序中 2018 会议
on Empirical Methods in Natural Language
加工, pages 501–511.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.

Amir Pouran Ben Veyseh. 2016. Cross-lingual
question answering using common semantic

123

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

空间. In Proceedings of TextGraphs-10: 这
Workshop on Graph-based Methods for Natural
语言处理, pages 15–19.

from knowledge base. Transactions of
这
计算语言学协会,
5:397–411.

John Wieting, Mohit Bansal, Kevin Gimpel, 和
Karen Livescu. 2016A. Towards Universal
paraphrastic sentence embeddings. In Proceed-
ings of International Conference on Learning
Representations.

John Wieting, Mohit Bansal, Kevin Gimpel, 和
Karen Livescu. 2016乙. Charagram: 嵌入
words and sentences via character n-grams.
这 2016 会议
在诉讼程序中
Empirical Methods
in Natural Language
加工, pages 1504–1515.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda,
and Yoshiyasu Takefuji. 2017. Learning dis-
tributed representations of texts and entities

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text
classification. In Advances in Neural Infor-
mation Processing Systems, pages 649–657.

Yuan Zhang, David Gaddy, Regina Barzilay, 和
Tommi Jaakkola. 2016. Ten pairs to tag—
multilingual POS tagging via coarse mapping
between embeddings. 在诉讼程序中
the North American
2016 Conference of
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 1307–1317.

Shuyan Zhou, Shruti Rijhwani, and Graham
Neubig. 2019, 十一月. Towards zero-
resource cross-lingual entity linking. In Work-
shop on Deep Learning for Low-resource NLP.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
0
3
1
9
2
3
8
0
5

/
t

我

A
C
_
A
_
0
0
3
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

124
下载pdf