MKQA: A Linguistically Diverse Benchmark for

MKQA: A Linguistically Diverse Benchmark for
Multilingual Open Domain Question Answering

Shayne Longpre
Apple Inc.
slongpre@mit.edu

Yi Lu
Apple Inc.
ylu7@apple.com

Joachim Daiber
Apple Inc.
jodaiber@apple.com

抽象的

Progress in cross-lingual modeling depends on
具有挑战性的, realistic, and diverse evaluation
套. We introduce Multilingual Knowledge
Questions and Answers (MKQA), an open-
domain question answering evaluation set
comprising 10k question-answer pairs aligned
across 26 typologically diverse languages
(260k question-answer pairs in total). 一个-
swers are based on heavily curated, 语言-
independent data representation, making results
comparable across languages and independent
of language-specific passages. 和 26 lan-
guages, this dataset supplies the widest range
of languages to-date for evaluating question
answering. We benchmark a variety of state-
of-the-art methods and baselines for generative
and extractive question answering, trained on
Natural Questions, in zero shot and translation
settings. 结果

challenging even in English, but especially in
low-resource languages.1

indicate this dataset

1

介绍

Training and evaluation data for question answer-
英 (QA) is severely lacking outside of high-
resource languages like English. As unsupervised,
transfer learning and zero/few-shot methods nar-
row the multilingual performance gap with En-
glish (Conneau et al., 2020; Lee and Lee, 2019;
Cui et al., 2019A; 刘易斯等人。, 2020), their real
progress is hard to measure without challenging,
realistic, and linguistically diverse evaluation sets.
Existing multilingual QA datasets are realistic
and challenging, but they lack linguistic diversity,
comparable evaluation between languages, 和
are often limited to passages provided with the
dataset (见表 2).

We introduce Multilingual Knowledge Ques-
tions and Answers (MKQA) for evaluation of

open-domain question answering. MKQA selects
10k realistic English queries from the Natural
Questions dataset (NQ, Kwiatkowski et al., 2019)
and human translates them into 25 additional lan-
guages and dialects. Accompanying these query
translations we replace NQ’s passage embedded
answer spans with high-quality, 语言- 和
retrieval-independent answer annotations, 链接的
directly against Wikidata entities and a limited
set of well-defined value types (numbers, dates,
strings, ETC。).2

See one full example in Table 1. More flexi-
ble than existing multilingual datasets, MKQA’s
grading procedure ensures these labels are suf-
ficient to evaluate any QA method, 包括
knowledge graph and generative approaches. 这
objective of this evaluation set is to facilitate fair
comparison between languages, without imposing
assumptions on the underlying QA approach. 我们
see MKQA as a useful tool enabling practition-
ers to benchmark a variety of multilingual open
domain question answering methods against the
widest range of available languages yet. 以下,
we discuss its central properties as an evaluation
benchmark.

Realistic and Reliable Annotations Of crucial
importance to any evaluation set is (A) 怎么样
it reflects realistic, real-world settings, 和 (乙) 这
reliability of its annotations. To ensure the English
queries, which form the basis of our dataset, 是
realistic, we use Natural Questions, formulated by
real users, independent of passages or answers. 到
ensure these queries are realistic in other languages
we employ expert bilingual translators, guided by
strict localization criteria. We confirm that a large
majority of these queries are geographically in-
variant, meaning that their answer is not culturally
or geographically dependent (we found that less

1MKQA data and evaluation scripts are available at

2Wikidata is a collaboratively edited open knowledge

https://github.com/apple/ml-mkqa.

图形: https://www.wikidata.org/.

1389

计算语言学协会会刊, 卷. 9, PP. 1389–1406, 2021. https://doi.org/10.1162/tacl 00433
动作编辑器: Partha Taldukar. 提交批次: 3/2021; 修改批次: 6/2021; 已发表 12/2021.
C(西德:13) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

桌子 1: Questions and answers in all supported languages for one instance in MKQA. The IETF
BCP- 47 language codes specify the language and locale. The Entity ID corresponds to Wikidata (看
for instance https://www.wikidata.org/wiki/Q794).

比 4% of answers are rendered incorrect by geo-
graphical and cultural context, for more details
参见章节 4.2). To ensure annotation reliabil-
性, we enforce minimum inter-grader agreement,
conduct quality checks, and re-annotation from
expert graders where necessary. 更远, the Wiki-
data entity identifiers (QIDs) ground the answer
annotations in structured data. This can be used
for other knowledge graph-specific metrics, to re-
trieve other valid answer strings, and trivial entity
translation into hundreds of languages beyond the
scope of MKQA.

Parallel Questions Our evaluation set is fully
对齐, or ‘‘parallel’’, across all available lan-
guages, meaning the same examples exist in all
语言. This is accomplished by a mixture of
expert human translation and using multilingual
data from Wikidata. This property enables direct
comparison between all 26 languages for fully
cross-lingual or zero-shot systems. While Clark
等人. (2020) point out the natural query distribu-
tion varies by language and geography, we reserve
our assessment to geographically invariant queries
for the purpose of more fair comparison between
方法.

Retrieval-Independent Annotations Existing
training and evaluation sets are oriented to ‘‘ex-
tractive’’ QA, providing specific passages and
passage-dependent answer annotations (克拉克

等人。, 2020; 刘易斯等人。, 2020; Artetxe et al.,
2020乙; 刘等人。, 2019A). These types of anno-
tations are of limited use with varying retrieval
系统, knowledge graph approaches, 乃至
generative approaches because the answers are
tied to the particular phrasing of their passage.
Translating annotations from English passages
may also introduce ‘‘translationese artifacts’’ as
the translation is implicitly influenced by the origi-
nal English structure (Artetxe et al., 2020A). 这些
artifacts render the task easier for methods rely-
ing on English supervision or machine translation
技巧. As we shall discuss in Section 3, 这
MKQA collection procedure yields primarily en-
tity and structured ‘‘atomic’’ answer types. 我们
contend retrieval-independent (and particularly
entity-oriented) annotations minimize the risk of
translation artifacts, and remove limitations on the
underlying QA approach.

Linguistic Diversity Lastly, MKQA has broad
linguistic diversity, covering 26 languages and
dialects from 14 language family branches. 兰-
guages from MKQA cover half of the world
populations’ native language, 以及超过 90%
of the world population lives in a country where
one of these languages is an official language
(参见章节 4.1 for more details). It is to our
knowledge both the largest and most linguistically
diverse open-domain QA evaluation set currently
可用的 (见表 2 和 3).

1390

Multilingual QA
Evaluation Set

Answer

Parallel Language Fam.

Independence Questions

Branches

Languages Total Examples

XQA (刘等人。, 2019A)
MLQA (刘易斯等人。, 2020)
XQuAD (Artetxe et al., 2020乙)
TyDi (Clark et al., 2020)
Xor-QA (Asai et al., 2021)

MKQA (This work)

X
×
×
×
×

X

×
X
X
×
×

X

5
6
11
11
7

14

9
7
11
11
7

26

28k
46k
13k
204k
40k

260k

桌子 2: Comparison of multilingual QA evaluation sets. Answer independence indicates whether
the gold answer is independent of a retrieved document, and parallel questions indicates whether
examples are the same across languages.

MKQA makes two important contributions to

the field of multilingual question answering:

• Our answer collection procedure renders the
evaluation set highly reliable, 独立的,
and unbiased towards the QA technique used.
This unique setup allows us to fairly compare
the performance of techniques as distinct as
knowledge graph-based, dense and sparse
retrieval and generative QA techniques on a
large number of languages (参见章节 5).

• Our dataset provides fully aligned examples
in the largest yet number of typologically di-
verse languages, enabling comparable eval-
uation across many languages.

We find MKQA is innately more challenging
than Natural Questions from which it was derived,
due to the multi-stage re-annotation process. 这
best model obtains only 52.3% F1 in English,
并且只有 5.7% above a naive baseline on the
lowest resource language. Given these qualities,
our dataset facilitates broad and reliable evaluation
of multilingual, open-domain question answering.

2 相关工作

Cross-Lingual Modeling Recent work trains
cross-lingual representations with unsupervised
language modeling over many languages, 包括-
ing Multilingual BERT (Devlin et al., 2019),
XLM-R (Conneau et al., 2020), and Multilingual
T5 (Xue et al., 2021). Transfer learning techniques
are often applied to these cross-lingual represen-
tations to overcome the dearth of non-English
数据 (Cui et al., 2019A; Hsu et al., 2019; 李
和李, 2019; Kumar et al., 2019). Recent
investigations into cross-lingual modeling have
revealed ‘‘translation artifacts’’ in datasets where

machine translation systems are used, or human
translation tasks are not carefully curated (Artetxe
等人。, 2020A; Wintner, 2016; Rabinovich and
Wintner, 2015). ‘‘Translationese’’ results in hid-
den linguistic cues in translated text that render
the task easier than a natural translation.

English QA Resources A majority of question
answering research focuses on English, 哪个
offers ample selection of evaluation datasets, 在-
cluding SQuAD (Rajpurkar et al., 2016), Trivia-
QA (Joshi et al., 2017), and Natural Questions
(Kwiatkowski et al., 2019). Open Domain QA,
pioneered by Green et al. (1986), is the task of
answering open questions using external knowl-
edge sources. A common approach is to combine
retrieval and extractive techniques (陈等人。,
2016, 2017; Dhingra et al., 2017; Cui et al., 2017).

Monolingual QA Resources Non-English ques-
tion answering resource options remain compara-
tively rare, with most options spanning only one
other language, and rarelylow-resourcelanguages.
DuReader (He et al., 2018), CMRC (Cui et al.,
2019乙), and DRCD (Shao et al., 2018) all of-
fer high-quality Chinese QA datsets. 相似地,
XCMRC (刘等人。, 2019乙) and BiPar (Jing et al.,
2019) present parallel, cross-lingual QA datasets
between English and Chinese. Exploring slightly
less resource-rich languages, numerous works
have derived new datasets from SQuAD, 采用-
ing varying degrees of human or semi-automatic
translation techniques to non-English target lan-
guages: ARCD for Arabic (Mozannar et al., 2019),
KorQuAD-1.0 for Korean (Lim et al., 2019), 和
MMQA for Hindi (Gupta et al., 2018).

Multilingual QA Resources Table 2 compares
the largest publicly available multilingual question
answering evaluation sets. The table highlights the

1391

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

following properties of each dataset: 是否
available gold answers are independent of re-
trieved documents, whether examples are aligned
across languages, and the number of languages and
examples provided. MLQA (刘易斯等人。, 2020)
and XQuAD (Artetxe et al., 2020乙) are examples
of SQuAD-style extractive datasets, employing
human translators to create parallel examples.
Both MLQA and XQuAD ensure that all an-
swers are answerable (discarding ‘‘No Answer’’
examples), and derive answers from provided
文件. XQA (刘等人。, 2019A), 之一
the few retrieval-independent QA datasets, 的-
fers cloze-style questions, leveraging Wikipedia’s
daily questions and entity answers to popu-
late document-independent answers. TyDi (克拉克
等人。, 2020), like MKQA, focuses on typological
diversity in its wide language selection. 尽管
TyDi offers a more natural distribution of ques-
系统蒸发散, its annotations are based on the retrieval
system used by the authors (Google search); 因此
their answers are actually start and end indices for
spans of text within a given passage. Xor-QA
(Asai et al., 2021) explores cross-lingual subtasks
by re-annotating 40k TyDi examples, 超过 7 lan-
guages, sourcing answers from English documents
and translating them back to the target language.
Many of these multilingual resources have been
bundled into cross-lingual benchmarks, 例如
XTREME (Hu et al., 2020) and XGLUE (梁
等人。, 2020).

2.1 Comparison to Native Speaker Datasets

There are key advantages to datasets such as TyDi
(Clark et al., 2020) and Xor-QA (Asai et al., 2021),
which use native speakers questions, 特别
in the naturalness and cultural authenticity of
the corpora. 然而, there are also key dis-
advantages to these datasets that MKQA circum-
vents with language alignment, to provide more
challenging and fair model evaluations across
语言.

TyDi (Clark et al., 2020) and MKQA both
target high typological diversity, highlight the
importance of sourcing realistic questions (和
answers unseen), and incorporate a broader distri-
bution of question types than competing datasets
(including ‘‘No Answer’’ and ‘‘Yes’’/‘‘No’’ an-
swers). There are three main differences between
MKQA and TyDi: (A) question alignment across
语言, (乙) answer distribution, 和 (C) anno-

tation retrieval independence (closely tied with the
notions of ‘‘open‘‘ and ‘‘closed’’ domain). TyDi
provides a different set of natural questions per
语言, at the expense of direct comparability
across languages. Not only are the TyDi questions
different between languages, but the percentage
of answerable passages varies dramatically, 从
22% in Korean to 69% in Arabic. XorQA-TyDi
(Asai et al., 2021) partially resolves this issue by
sourcing answers from English documents, 但
this may in turn re-introduce cultural biases. 这
suggests that the conceptual difficulty of these
questions may also vary dramatically, as consum-
ers from different locales cater their questions
based on their existing beliefs of the quality of
the virtual assistants in their language. 因此,
it is difficult to interpret the core reasons why mul-
tilingual system’s performance varies between
语言. To ensure this property, MKQA ver-
ifies its questions are predominantly geograph-
ically invariant, and thus the answers will not
change due to geographical or cultural factors.

The second difference between datasets is the
answer distribution. MKQA answers (A) are pre-
dominantly entities (42.2%) or atomic answers
such as dates, binary, or numbers with units, 和
(乙) use a different definition of ‘‘Unanswerable’’.
Xor-QA focuses only on answerable queries,
TyDi’s definition conditions on the presence of
the answer in the passage, whereas MKQA’s def-
inition is based on the ability of a human to find
a succinct answer to a question on the web, 那
是, whether it is human answerable. 因此,
our annotations are not limited by the quality of
selected passages, and provide higher answer cov-
erage (67.58% as opposed to the TyDi language
平均数 38%).

最后, while MKQA does not expect an an-
swer to be derived from a single source document,
TyDi is an extractive QA dataset. 最后,
its answer annotations are defined as spans, tied
directly to particular Wikipedia documents and
fixed index from which they were retrieved. 作为
an evaluation set we contend the flexibility of
document-independent answers is critical to not
restrain what approaches can be evaluated in future
研究.

3 Dataset Collection

We aim for certain properties of our evaluation
放: (我) realistic questions, (二) reliable annotations

1392

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

(例如, via inter-annotator agreement), 和 (三、) A
flexible task setup that makes as few assump-
tions as possible about the underlying modeling
技巧, enabling fair comparison between any
方法.

3.1 Query Selection

Our evaluation set collection pipeline begins with
the Answer Curation steps outlined in Figure 1.
These are designed to yield high-concensus an-
swer labels, with normalized textual formats, 前任-
pressive alias sets for robust comparison, 和
grounding in structured information for entity dis-
ambiguation or more informative analysis. 为了
first step, we sample 10,000 queries from Natural
问题 (NQ) (Kwiatkowski et al., 2019), as this
is one of the few QA datasets based on realistic
queries, generated by information seeking users.

3.2 Raw Answer Collection

At the raw answer collection stage, 5 annotators
are independently shown the query and asked to
search the web to either copy or generate an ideal
回答. They are asked to select an answer type
(radio buttons) from the options shown below, 和
input the answer (text box) according to format
instructions per answer type. The formatting con-
straints allow us to automatically link WikiData
entities for the units in ‘‘number with units’’ and
to gather well-structured data for answers such as
dates, to save annotator time.

For each query,

the graders select a typed

answer from the following taxonomy:

• Atomic value: This category includes dates,
numbers and number ranges with or without
a unit (meters, 年, . . . ).

• Entities: Entities are annotated with Wiki-
data QIDs and include generic entities, 人-
普莱, 物体, and most locations.

• Yes/No: Type representing yes/no answers.

• Short answer: Answers which cannot be
encapsulated in an atomic value, entity or
binary (yes/no) 回答, but are still a short
短语.

• Long answer: The long answer category in-
dicates no simple factual answer or short
phrase answers this question and a longer or
visual explanation is required. During evalu-
ation we treat these as ‘‘Unanswerable’’ for
simplicity.

• Unanswerable: This category indicates that
the query is not answerable, potentially be-
cause it is ill-formed or because no clear an-
swer is available.

3.3 Answer Resolution

Given the query and a candidate answer from
the previous stage, annotators are next asked to
normalize date/number formats and resolve the an-
swer text against Wikidata entities, where feasible.
To resolve short textual answers against Wikidata
实体, we apply an internal entity linking system
to the answer string to generate Wikidata candi-
date entities.3 The top 10 entity suggestions and
their descriptions, along with the original query
and short answer are then presented to 3 graders,
who are asked to pick the correct reference entity
or ‘‘None of the above.’’ In cases where graders
do not achieve sufficient agreement or where the
correct entity is not in the list, a domain expert
(one of the MKQA authors/designers) provides the
correct reference. 全面的, this step enables us to
disambiguate homonyms and collect valid answer
synonyms/aliases, for more robustly measuring
annotator agreement and prediction accuracy.

3.4 Answer Verification

Up until this stage, 5 raw answers were collected
per query, and subsequently format normalized
and resolved against Wikidata. In the fourth stage
of Answer Curation (图中 1) any normalized
answer given by at least 2 annotators is admitted
to the final set as a gold answer. For those annota-
tions that did not achieve the required agreement
from at least two annotators, a domain expert (一
of the MKQA authors/designers) with access to
全部 5 preliminary annotations is tasked to provide
a final decision. This second manual round was
afforded as much time per decision as necessary
to obtain a satisfactory answer. The instructions
permit the selection of existing normalized an-
swer(s), modifying them slightly, or overriding
them if necessary.

3.5 Answer Localization

In the last two stages of MKQA curation shown in
数字 1 we translate, or ‘‘localize’’, the English
queries and answers into the target languages.
Given the special care we took to avoid them in

3This step can be replicated using an off-the-shelf entity
linker such as spaCy available at https://spacy.io
/api/entitylinker.

1393

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 1: Data Collection Process. A depiction of the 6 sequential steps in our data collection pipeline. 首先
four steps involve Answer Curation, and the last two localize questions and answers into 26 target languages.

our methodology, and since we only localize short
answers and queries (no context passages), we be-
lieve translation artifacts are likely to be minimal
in MKQA.

Verified answers are localized into the tar-
get language by a combination of methods. 为了
Wikidata-resolved answers, we leverage Wiki-
data’s names and aliases for the target language.
These names and aliases are transcribed in the
native alphabet where appropriate, reflecting the
expected answer in each language. Atomic answer
类型, including numeric, number with entity, 和
date types were also translated by this method,
maintaining Arabic numerals for all languages,
but naturalizing unit terms such as ‘‘November’’,
‘‘century’’, ‘‘b.c’’, ‘‘acres’’, and ‘‘light years’’.
For date types specifically, for every combination
of year, 月, and day, we generate template
answers in each language, accommodating both
American and European date formats, 也
numeric and written out versions for months.

In cases where a Wikidata link could not be
成立, or where answers were not available for
a given language code, professional bilingual hu-
man translators were used to provide the native
equivalent. For this task, human translators are

given access to the English query, the English an-
swer, and where available the Wikidata link and
Wikipedia page for the entity. We found localiza-
tion quality improved when bilingual translators
are shown several examples prior to grading, cov-
ering each of the localization options:

Localization Options:

• Transliteration is a type of conversion of a
text from one script to another that involves
swapping letters (thus trans- + liter-) in pre-
dictable ways (such as α → a, χ → ch, 或者
æ→ ae).

• Translation is the communication of the
meaning of a source-language text by means
of an equivalent target-language text.

• Unchanged is selected if the entity name
does not need to be localized as it is com-
monly used as is.

• Mix transliteration/translation/unchanged
if the entity is localized using more than one
技术.

1394

3.6 Query Localization

Family

Branch

Language Reach

The final stage of MKQA construction, as shown
图中 1, is query localization. As with answer
localization, bilingual translators were asked to
translate each query ensuring the query’s meaning
is maximally preserved, while naturally phrased.
Translators were further instructed to use localized
names of named entities if they exist in the target
language and to transliterate names otherwise. 我们的
translators, who are native speakers of the target
语言, are verified to live in the targeted region
and are required to pass an entrance exam to verify
a high level of fluency in English. Translators
received a standard hourly wage varying with
the target region and were not compensated per
completed task, as is usual with alternative public
services such as Amazon Mechanical Turk. 在
average, 大约 16 translators participated in the
translation of the 10k source queries from English
into each target language.

4 Dataset Quality and Analysis

Given our dataset collection and methodology,
we evaluate the effect of our choices, 和
properties of the final set, including the selected
语言, annotation quality, geographical in-
方差, and answer type distribution as com-
pared to NQ.

4.1 Language Selection

We select a set of languages meeting both aca-
demic and practical considerations, by maximiz-
ing typological diversity as well as the share of the
world population that understand at least one of the
languages in the set. 桌子 3 shows the languages
selected for our dataset with the corresponding
branch of their language family. 我们还展示
the language’s reach, 那是, the percentage of the
world population that speaks the language either
as a first or second language (based on Ethnologue
数据, Simons and Fennig, 2018). Since combined
第一的- and second-language speaker statistics are
not readily available, it is not straight-forward
to accurately determine what share of the world
population can be covered by the languages in
this set (例如, a native speaker of German may
also be fluent in English). A practical option is
to calculate the share of the world population that
lives in a country where one of the languages
in our set is recognized as an official language.
By this measure, 90.62% of the world population

Indo-European

Germanic

Italic

Balto-Slavic

Sino-Tibetan

Sinitic

英语
德语
Dutch
Swedish
Danish
Norwegian

西班牙语
法语
Portuguese
Italian

俄语
Polish

16.46%
1.70%
0.38%
0.17%
0.08%
0.07%

6.99%
3.59%
3.28%
0.87%

3.35%
0.58%

Mandarin
Cantonese

14.54%
1.10%

Afro-Asiatic

Semitic

Arabic
Hebrew

Austronesian Malayo-Poly. Malay

Japonic

Austroasiatic

Japonic

Vietic
Khmer

Japanese

Vietnamese
Khmer

Turkic

Kra–Dai

Koreanic

Uralic

康姆. Turkic Turkish

Tai

Han

Finnic
Ugric

Thai

Korean

Finnish
Hungarian

4.44%
0.12%

3.47%

1.64%

1.00%
0.21%

1.10%

0.78%

1.03%

0.07%
0.17%

桌子 3: Languages with their correspond-
ing language families and speakers. Reach
indicates the combined number of first-language
(L1) and second-language (L2) speakers as a
percentage of the world population (Ethnologue,
Simons and Fennig, 2018).

live in a country with an official language cov-
ered by the languages in our set.4 With the large
number of diverse language families covered and
the reach of the selected languages, MKQA ad-
dresses both academic and practical requirements
for a wide and diverse question answering bench-
标记. 最后, we note that the Wikidata IDs
provided for a large portion of our gold answers
allow these answers to be further localized into
Wikipedia languages beyond those in MKQA,
should practitioners wish to expand their analysis.

4We determine this percentage based on Wikidata as the
combined population (Wikidata property ‘‘P1082’’) 全部的
countries that have an official language (Wikidata property
‘‘P37’’) in our dataset divided by the combined population
of all countries in Wikidata.

1395

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

语言

英语

德语
西班牙语
Thai
Chinese (simpl.)

Acceptance Rate
Query Translation Answer

99.01%
99.01%
96.04%
92.24%

97.03%

91.08%
92.07%
91.09%
89.32%

桌子 4: Query translation and retrieval-
agnostic answer quality in various languages.
Query translation acceptance rate is the percent-
age of query translations judged as acceptable.
Answer acceptance rates is the percentage of an-
swers graders found acceptable in response to the
translated target-language query.

4.2 Translation and Answer Quality

The quality and reliability of our dataset is highly
dependent on two factors: (A) how well our pro-
fessional translators were able to translate the
English queries into each target language, 和
(乙) how well our language-independent answer
representations transfer to each target language.

We run a small-scale grading experiment, 毕业生-
ing just above 1% of the total data, to estimate
the quality of the query translations and how well
the meaning of our language-independent answer
annotation is preserved across languages (地理-
graphical invariance). We present graders with
the localized query and its answer annotations
and ask them to judge whether (A) the localized
query is an acceptable translation of the origi-
nal English query, 和 (乙) whether the provided
回答 (entities are shown with their QID and
description, and a short explanation is added to
each other answer type) is acceptable for the trans-
lated target-language query. 此外, we also
ask graders to judge the answer quality for the
original English queries as a baseline.

桌子 4 shows the acceptance rates for query
translations and answers for a small selection
of languages. The table shows that query trans-
lations are consistently judged as acceptable in
德语, 西班牙语, and Thai, while the quality
for Chinese translations was judged as lower in
比较. Most translation issues are related to
the localization of entities and to domain-specific
条款 (例如, sports terminology such as ‘‘recep-
tions’’ in football). 正如预期的那样, the acceptability

of answers is judged to be higher for English
than other languages but it is still at or above
90% even for languages as linguistically distant
from English as Thai. Note that errors in an-
swer acceptance rate and query translation ac-
ceptance rate heavily overlap since incorrect query
translations will most likely mean that the exist-
ing language-independent answer will not match.
into the following
Answer quality issues fall
类别 (illustrated with German examples):

(1) Answer differs based on cultural context
(44%) This includes cases where the localized
version of an entity may have different properties.
For example the English-language TV show ‘‘Man
vs Food’’ has 8 seasons while the German version
有 5. 相似地, a character in a movie such as
‘‘Finding Nemo’’ may be voiced by a different
voice actor in the German version of the same
movie.

(2) Generic annotation issues (33%) The sec-
ond biggest source of errors are answer quality
issues that will hold across languages. Examples
include answers that are time-sensitive such as
the answer to the question ‘‘when was the oldest
person in the world born’’ and questions with
ambiguous answers in the data such as ‘‘is
northern ireland a part of great britain.’’

(3) Entities transliterated incorrectly (11%)
Names for entities may be transliterated incor-
rectly if they do not exist in the target language
(‘‘who wrote the book clear and present danger’’).
(4) Generic translation artifacts (11%) 锗-
neric translation errors may lead to a mismatch be-
tween the question and the language-independent
回答. In one example the English ‘‘words to’’
meaning ‘‘lyrics’’ was translated into German as
the literal ‘‘Worte’’ which would be an uncommon
phrasing in a question about lyrics.

Translation artifacts are a recognized problem
in multilingual datasets and manual grading of the
data in Table 4 shows that the human translation
step may introduce more or less query–answer
discrepancies depending on the target language.
In an alternative scenario, annotation could be
performed directly on native queries from each
语言; 然而, such data is not readily avail-
able and might additionally suffer from other
downsides such as relatively small user bases in
less frequently spoken languages (参见章节 2.1
for further discussion). Similar to our evaluation,
the authors of NQ perform a manual precision

1396

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 2: Answer Type Breakdown. Compares the distribution of answer types between MKQA and Natural
问题 (NQ) for the 10k examples in the evaluation set.

grading of their data and find an overall data pre-
cision of 84% for short answers. While we hope
that future work can improve on data quality fur-
ther, comparatively even for the language with the
most severe translation artifacts in our evaluation,
Simplified Chinese, the resulting data quality (一个-
swer acceptance rate of 89%) is still within an
acceptable range. 此外, our dataset provides
the only available source of question answering
evaluation in many languages.

We encourage authors of future multilingual
datasets that use any translation methods to report
and detail their geographical invariance, 和我们一样
have done, and to benchmark the reliability of
examples and presence of translation artifacts.

4.3 Annotation Breakdowns

下一个, we compare the distribution of answer types
between the original NQ dataset, with those newly
assigned in MKQA. 如图 2 节目, 50% 的
NQ are completely ‘‘Unanswerable’’ by retrieved
passages and another 13% require long passage
答案. In the short answer setup for NQ both of
these are considered unanswerable, amounting to
63% of all questions. In comparison, 仅有的 32.4%
of examples are ‘‘Unanswerable’’ or ‘‘Long’’
answer type in MKQA. This is due to a shift
in definition from whether a passage contains
an answer, to whether a question is (succinctly)
answerable by a human, with full web access.
Given that the answer types in MKQA are not
dependent on a learned retrieval system, 他们
reflect the properties of the question only.

We later show that this ‘‘unanswerable’’ defi-
nition yields more challenging evaluation because
(我) correctly answering questions is on average
harder than learning when to abstain, 和 (二) 许多

of the most difficult questions were unanswer-
able in NQ but are answerable in MKQA. 这
suggests the property of ‘‘retrieval independent
annotations’’, currently not used in any other mul-
tilingual QA benchmarks except XQA, is highly
desirable for (A) constructing more challenging
QA evaluation sets, 和 (乙) yielding annotations
useful to evaluate any QA approach, not just ex-
tractive QA models.

We also encourage future QA benchmarks to
mimic our multi-stage data collection framework
in providing supplementary metadata per example
(answer type and Wikidata QIDs). Beyond basic
comparison of systems, our evaluation tools al-
low practitioners to perform further error analysis
with more interpretable metrics.

5 实验

in language l,

5.1 Task Definition
Given a question ql
the task
is to produce a prediction pl ∈ {No Answer, Text
Answer}, where a Text Answer is a sequence of
tokens in the corresponding language. pl can be
obtained by any method, extracted from a doc-
ument, 生成的, or derived from a knowledge
图形.

For evaluation using MKQA gold answers, 电动车-
ery question ql
i from i ∈ [1, 10000] is accompanied
by a set of valid annotations al
i per language. Ev-
ery prediction pl
i is scored based on exact match
(EM) and token overlap F1, as with previous
open-retrieval QA datasets. The official evalua-
tion script also ingests a ‘‘No Answer probability’’
for each example. If the probability is above a cho-
sen threshold value then the prediction defaults
to No Answer instead of the provided Textual
Answer. As this threshold varies from 0 到 1 这

1397

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

predictions shift from entirely No Answer to all
textual answers. We follow NQ in reporting the
best F1 over the range of thresholds, to remove
threshold tuning as a factor in evaluation. A best
threshold is computed and applied per language,
where each example receives a ‘‘textual” (代币
重叠) F1 after language-specific normalization
(removing whitespace, punctuation, and articles)
is applied to both the prediction and gold answers.
最后, the official per-language F1 is computed
as the mean of example F1s, and the official Macro
Average F1 is the mean of per-language F1 scores.

5.2 Baseline Approaches

To benchmark our evaluation set, we combine
state-of-the-art approaches in retrieval, 机器
翻译, extractive QA, and generative QA. 全部
retriever models are off-the-shelf, and all reader
models are finetuned on Natural Questions, 在-
cluding XLM-ROBERTA LARGE (Conneau et al.,
2020) and M-BERT (Devlin et al., 2019) 对于前任-
tractive QA, and MT5-LARGE (Xue et al., 2021) 为了
generative QA.5 In each case, tokenization is hand-
led by the multilingual model used—sentencepiece
for XLM-R and MT5-LARGE, WordPiece for M-BERT,
each with vocabularies initialized from their spe-
cific pre-training implementations. 更远, 全部
query and prediction translations in our approaches
use Zhang et al.’s (2020) open source many-to-
许多, encoder-decoder machine translation sys-
TEM, trained on the OPUS multilingual corpus,
covering 100 语言.

Retrieval Corpora Our baselines operate on a
Wikipedia document corpus from December 07,
2020, following previous work in open-domain
question answering (Kwiatkowski et al., 2019;
Asai et al., 2021; Clark et al., 2020). We use the
language-specific Wikipedia corpora for Elastic-
search and the English versions for other baselines.
Using Wikipedia as this base corpus is a pragmatic
choice based on several aspects: 1) It provides
comparability across baselines and previous work,
和 2) compared to large web document corpora,
such as Common Crawl, it requires less data
cleaning and is computationally more tractable,
which improves the replicability of our results
and helps to ensure that the major variable be-
ing evaluated is model performance (而不是
engineering effort). 因此, while we believe that

5Note that we exclude the 10k examples used in our

evaluation set from this training set.

using a web-scale corpus, such as Common Crawl,
would potentially enable even stronger baselines,
we leave such experiments to future work.

Elasticsearch → XLM-R We benchmark a
fully multilingual retriever approach using Elastic-
search followed by XLM-R as the extractive reader.
Elasticsearch leverages language-specific token-
izers and analyzers with BM25 to search for na-
tive passages in the target language’s Wikipedia
dump. We used their built in language specific
analyzers which include stopwords and stem-
mer in each language.6 We took the Wikipedia
dump from December 7, 2020, for each language
as source documents. The languages Hebrew,
Khmer, Korean, Malay, and Vietnamese are not
part of the Elasticsearch baseline as they are not
natively supported by Elasticsearch.

DPR → RoBERTa We benchmark an approach
that utilizes state-of-the-art English retrieval and
reader systems, enabled by translating the incom-
ing query into English, and the outgoing prediction
into the target language. We use off-the-shelf
Dense Passage Retrieval (DPR, Karpukhin et al.,
2020), followed by ROBERTA (刘等人。, 2019C)
to extract a prediction.7

Gold NQ → Extractive QA For this set of
基线, optimal English retrieval is simulated
via the passages provided with NQ. We illustrate
baselines that leverage these provided ‘‘Gold
English documents, machine translation, 和前-
tractive QA models. We vary the type of QA
模型 (M-BERT vs. XLM-R) and the train/test ap-
普罗奇, comparing common zero shot, translate
测试, and translate train approaches.

In zero shot transfer each multilingual model
is finetuned with NQs’ default English questions
Qen and passages Pen. At test time the model
receives MKQA questions Qxx in language xx,
paired with English passages Pen.

For translate test, at train time the model uses
NQ’s default English. At test time, MKQA ques-
tions are translated into English Qxx→en, 和
passage remains in English Pen. Passages remain
in English for both training and inference.

6https://www.elastic.co/guide/en/elasticsearch
/reference/current/analysis-lang-analyzer.html
#arabic-analyzer.

7We use the trained ‘‘Multiset’’ DPR model available in

https://github.com/facebookresearch/DPR.

1398

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Retriever

Reader

NO ANSWER

Translation
Query Answer

Retrieval Metrics
R@1

Answerable Metrics
Mean A ∈ D F1 Mean A /∈ D F1

End-to-End Metrics
En F1 Mean F1
32.4

32.4

MULTILINGUAL RETRIEVER

ELASTICSEARCH* XLM-R

42.57 ± 1.2

25.18 ± 3.8

7.24 ± 2.5

34.99

34.13± 0.4

TRANSLATE-TEST ENGLISH RETRIEVER

DPR

ROBERTA

Test

Test

53.62 ± 2.2

20.33 ± 4.1

10.24 ± 1.8

45.19

36.81± 1.2

GOLD NQ PASSAGES

GOLD NQ
GOLD NQ
GOLD NQ
GOLD NQ
GOLD NQ
GOLD NQ

M-BERT
M-BERT
M-BERT
XLM-R
XLM-R
XLM-R

QUERY-ONLY
GOLD NQ

MT5
MT5


Test
Train

Test
Train


Test

80.22

20.13 ± 5.5
28.10 ± 6.5
32.21 ± 6.0
38.81 ± 3.2
34.23 ± 5.0
40.28 ± 3.1

7.56 ± 1.7

12.1 ± 2.1
14.8 ± 1.9
20.05 ± 2.6
16.38 ± 2.6
20.93 ± 2.7

51.97

52.27

37.8± 2.0
41.4± 2.2
44.1± 1.8
45.5± 1.4
42.9± 2.1
46.0± 1.4

GENERATIVE MODELS



80.22


36.8 ± 6.2


17.07 ± 2.6

43.8
47.6

35.0± 1.2
38.5± 2.2

桌子 5: Results for each baseline, broken down by retrieval metrics (Recall @ K passages),
answerable question metrics (F1 at the best confidence threshold), and end-to-end metrics (F1
at the best confidence threshold). A naive approach, predicting exclusively NO ANSWER, achieves
a lower bound score of 32.42% F1. Translate-Train using NQs Gold passages and an XLM-R reader
outperforms all alternate settings. A ∈ D denotes metrics for where the answer A exists in the top
retrieved document D (exact match). A /∈ D denotes metrics for where the answer A does not exist in
top retrieved document D (exact match). ∗ Elasticsearch benchmark does not include Hebrew, Khmer,
Korean, Malay, and Vietnamese.

For translate train, at train time, questions are
translated into the target language Qen→xx. 在
test time the model is given queries in the target
language Qxx and passages Pen in the default
English from NQ. Passages are always in English.

Query-only mT5 We benchmark a ‘‘closed-
book’’, query-only generative QA approach,
based on Roberts et al. (2020). 这种方法
allows us to circumvent retrieval and machine
translation entirely, using parametric knowledge
within MT5 LARGE. Simply, the query is fed to the
模型, which is trained to generate the localized
answer directly.

Gold NQ → mT5 We benchmark a stronger
generative QA approach, that also has access to the
English Gold NQ passages. Based on open-source
implementations for MLQA and XQuAD datasets,
the model is fed the non-English query, 和 (在
这个案例) the English gold passage, and generates
the predicted answer.8

8Implementation and hyperparameters based on https://

github.com/google-research/multilingual-t5.

5.3 结果

桌子 5 presents retrieval and end-to-end met-
rics for each baseline, as the mean across all 26
语言. Retrieval metrics include recall at K,
measuring if the correct answer appears anywhere
in the top K retrieved passages, as traditionally
used in information retrieval settings. 注意
these metrics are computed by looking for an ex-
act match of the text-normalized gold answer in
the text-normalized passage. We find that trans-
lation followed by English DPR outperforms the
Elasticsearch multilingual sparse retrievers. 这
is consistent with results observed in XOR-QA
(Asai et al., 2021) which shows the surprising
under-performance of multilingual retrievers. Er-
rors are likely a combination of no answer being
present in smaller non-English Wikipedia indexes,
and the weak performance of sparse retrieval.
The Gold NQ documents contain a valid answer
80.22% 当时的. 然而, this is likely an
upper bound, as these documents are often very
long and noisy, such that NQ annotators often
marked them as not containing an answer to the
问题, even though we find the gold answer
string is present.

For end-to-end metrics, we measure F1 just
for English (‘‘EN F1’’), which omits the impact

1399

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 3: F1 by Language. XLM-R Zero-Shot performance ranked by language. Unanswerable F1 (红色的) 科尔-
responds to the proportion of the Aggregate F1 obtained from predicting No Answer. The Unanswerable proportion
is calculated as the percentage of unanswerable examples (32.42%) multiplied by the Unanswerable F1.

of machine translation, and mean F1 over all 26
语言. The naive baseline of only predict-
ing No Answer achieves a lower bound score of
32.42%. We chose to combine both Unanswerable
and Long Answers into the No Answer category
for evaluation to focus MKQA on short, factoid
answers that can be evaluated automatically and
robustly. 不出所料, we observe models with
access to NQ gold documents achieve the best
结果, with Translate Train XLM-R achieving the
best mean F1 of 46.0±1.4. Among these methods,
XLM-R outperforms M-Bert, and Translate-Train
outperforms Translate-Test and Zero Shot. Gen-
erative approaches using MT5 perform fairly well,
even under zero shot conditions (trained only
on English), or without any passage provided
(query-only).

We also measure the F1 scores for the subset of
answerable questions to measure the ability of the
retrievers and readers to find the right answer. 我们
separately report the average all-language F1 for
(我) questions in which a gold answer appears in the
top retrieved document, 和 (二) questions in which
none are found. 正如预期的那样, performance is much
higher for both extractive and generative models
where the retriever has succeeded. Translate Train
with XLM-R still achieves the best performance.
XLM-R also performs well on the correct outputs
(A ∈ D) of the weakest retriever, Elasticsearch,
though there are fewer of them. Comparing with
end-to-end metrics, which includes unanswerable
问题, answerable questions are more difficult
to answer.

全面的, these results show how collecting rel-
evant passages remains a challenging bottleneck
in multilingual open-retrieval QA. Multilingual
retrievers, English state-of-the-art retrievers, 和
generative QA models all fail to overcome this
问题, and even when gold passages are
假如, multilingual readers and machine trans-
lation still fail to consistently produce localized
答案 (with generous evaluation settings).

图中 3 we compare cross-lingual perfor-
mance between languages, ranked by F1 score.
We plot XLM-R Zero Shot to minimize the noise
from machine translation. 正如预期的那样, the XLM-R
model performs fairly well on English (52.3),
and common non-English languages, 包括
the most common Indo-European Germanic and
Italic languages, but poorly on languages from
lower-resourced families. Note that the minimum
F1 score is 32.42%, where a threshold of 0 pre-
dicts No Answer to every question. 有趣的是,
as the Aggregate F1 decreases, the Unanswerable
F1 rises on average from ∼27% to ∼29%, abstain-
ing from an answer more often. Given the parallel
questions property of MKQA, these metrics allow
a practitioner to specifically identify languages
with weak model performance, and answer absten-
tion behavior for commonly used reader models,
such as XLM-R. Even before considering a cul-
tural shift in query distribution, these metrics
allow us to isolate performance on geographi-
cally invariant queries, and general effectiveness
of transfer learning for particular languages and
training regimes.

1400

5.4 Unanswerable vs. Long Answers

As discussed in Section 4.3, following the Short
Answer setup for Natural Questions (Kwiatkowski
等人。, 2019) we define Unanswerable as a query
without a short answer (IE。, examples with long
or unanswerable answer types)—for our task. 铝-
though evaluating long answers is important, 这是
out of the scope of MKQA. The primary benefit
of this decision is that it enforces the retrieval-
independent annotations property of MKQA,
since long answers have an unbounded number
of correct answer strings. Here we investigate
whether long and ‘‘truly’’ unanswerable examples
in MKQA are treated differently by our baseline
型号.

To answer this question, we break down the
larger Unanswerable set into the long and ‘truly’
unanswerable examples, comprising 56% 和
44% 分别. We then compute the final
表现 (F1) by model type and by lan-
guage for each of these two categories. We find
the results vary according to the quality of the
model and the language (as do performance on
answerable queries), but the difference between
the long answer and truly unanswerable scores are
marginal. 例如, XLM-R Translate Train,
using Gold NQ passages, achieves 84.2% F1 on
长的, 和 84.7% on truly unanswerable examples,
with a mean difference over all 26 languages of
仅有的 0.5%. These differences are similarly negli-
gible across other baselines. This finding suggests
standard open-domain QA systems, trained on
short answer datasets like Natural Questions, 有
learned to consider long answers as unanswerable,
and do not appear to find one set more challenging
than the other.

6 讨论

Difficulty of MKQA Our baselines represent
a strong and diverse set of methods, that score
competitively with state-of-the-art on similar open
domain question answering datasets. 尽管如此,
on English alone, the best system recieves an F1
score of only 52.3%, less than the same methods
achieve on the open datasets Natural Questions
and TriviaQA, or other standard benchmarks for
this task. These comparative results demonstrate
MKQA is highly challenging and leaves ample
room for improvement in both English and the
long tail of natural languages. In this section we

explain why, with a detailed comparison to its
closest set, Natural Questions.

Why is MKQA so challenging for state-of-the-
art approaches even for English open-domain
QA? To shed light on this, we compare the diffi-
culty of English-only annotations between Natural
问题 (NQ) and MKQA. 图中 4 我们用
the same BERT-LARGE English model (trained on
NQ, using Gold NQ passages) and evaluate it
on both sets of annotations. The ‘‘F1 by Answer
Type’’ diagram shows unanswerable examples in
MKQA (红线) are easier than the unanswerable
examples in NQ (red dashed line), as the model
maintains higher performance at all No Answer
confidence thresholds. The opposite relationship
is observed for answerable examples.

We hypothesize that this is due to the Retrieval-
Independence property and high coverage of our
re-annotation process (节中描述 3).
Due to the annotation procedures NQ uses, 那里
are several cases that can lead to a potential an-
swer missing from the dataset: (A) the initial re-
trieval may have not produced a candidate, (乙) 这
answer may have not been in Wikipedia, 或者 (C) NQ
graders may have missed a valid answer. MKQA
annotations are not susceptible to (A) 和 (乙) 和
likely less impacted by (C). 最后, 这
most challenging questions migrated from unan-
swerable in NQ to answerable in MKQA, shifting
the unanswerable distribution from 63% 到 32%
(如图 2). Consider the following
examples.

(A) NQ retrieval failure In this example, 这
NQ retrieved document does not contain an an-
swer to the question, causing no long or short
回答 (No Answer) in NQ. There exists a better
Wikipedia document (Wheel of Fortune) that does
contain the MKQA answer ‘‘Autumn Erhard’’.

• Q: Who won the most money on wheel of

fortune?

• NQ URL: 维基百科: American game show

winnings records.

• NQ Answer: No Answer

• MKQA Answers: ‘‘Autumn Erhard’’

(乙) No Wikipedia answer This is also an an-
swerable query, labelled as no answer by NQ,
because the answer is not found on Wikipedia
(either by NQ or our best efforts). 然而, 一个

1401

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 4: Comparing MKQA and NQ English Annotations. The performance of the same English BERT-LARGE
model on each of Natural Questions (NQ) annotations and MKQA annotations, using the MKQA evaluation
指标. For all plots the y-axis is F1 score and the x-axis is the value of the threshold over No Answer probabilities.
F1 by Answer Type (left diagram) compares the accuracy of the model on Answerable and Unanswerable examples
for each dataset, showing Unanswerable examples are on average easier in MKQA, and Answerable examples
are on average harder in MKQA. NQ F1 Proportions (中间) and MKQA F1 Proportions (正确的) show what
proportion of the aggregate F1 score is derived from each Answer Type. These plots demonstrate MKQA is more
difficult than NQ because there is a higher proportion of answerable questions, which are harder on average.

answer can be found by MKQA graders from
other websites and sources.

• Q: How many teeth does a saltwater

crocodile have?

• NQ URL: 维基百科: Saltwater Crocodile.

• NQ Answer: No Answer

• MKQA Answers: ‘‘66’’

(C) Annotator misses valid answer For this
query, the answer is clearly visible in the provided
Wikipedia article, but NQ’s annotation process
yields no answer.

• Q: 什么
ukraine?

language do they speak in the

• NQ URL: 维基百科: Languages of Ukraine.

• NQ Answer: No Answer

• MKQA Answers: ‘‘Ukrainian’’

Given the answer to these queries are not easily
found in the corpus, by retrieval, or by human
annotators, they are likely more challenging on
average. 像这样, their label shift from no answer
in NQ to answerable in MKQA likely explains
why there is higher mean difficulty of answerable
questions in MKQA, as observed in Figure 4. 到
understand the prevalence of each error type, 我们
compute how often any MKQA answer appears
in the retrieved document for which the NQ label
says no answer exists. We find a valid answer
appears in 70.4% of these documents, suggesting

类别 (C), annotator error, is the largest source
of such unanswerable queries in NQ (和
largest source of improvement in label quality for
MKQA).

The middle and right diagrams in Figure 4
normalize the answer types by their proportion
within the dataset, so we can compare their relative
contributions to the aggregate F1 (the sum of
answerable and unanswerable). NQ labels enable
a much higher aggregate F1 score (69.38% 在
best threshold) than MKQA (52.08% at the best
临界点) primarily due to the higher proportion
of unanswerable examples—which are easier on
average than answerable examples. By comparing
the ratio of unanswerable to answerable examples
attempted at the best thresholds in each of the
middle and right diagrams (the blue regions vs.
the red regions) we see that the MKQA task is
more oriented to answering questions rather than
abstaining.

Due to the Parallel Question property of
MKQA, the dataset is similarly challenging in all
26 语言. There is also a noticeable gap be-
tween the performance on English and on lower-
resourced languages (数字 3). For Korean and
Arabic the best F1 score is only 6% higher than
the lower bound score of 32.42% obtained from
predicting exclusively ‘‘unanswerable.’’ This de-
monstrates that existing transfer learning meth-
ods have significant deficits to overcome for
low-resource multilingual QA to match English
表现. MKQA offers a challenging bench-
mark to measure this cross-language progress
具体来说.

1402

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Future Work The parallel questions property
of MKQA offers alternative task setups in addi-
tion to typical open domain question answering.
Lewis et al. (2020) suggests a generalized cross-
lingual transfer task (G-XLT) where the question
and answer languages are intentionally different.
或者, future work might assume we are
given the English question-answer pairs, and at-
tempt to propagate these answers into other lan-
guages by localizing the questions and answers.

We anticipate that this dataset will enable in-
dustry practitioners and researchers to rapidly test
and compare novel cutting-edge techniques for
QA against existing techniques in a more fair,
comparable, and precise manner than previous
benchmarks. 此外, we hope that the lin-
guistic diversity and large number of languages
will inspire more researchers to treat model per-
formance across many (partially less-resourced)
languages as an important and worthy goal in
本身. As MKQA offers the only open-QA op-
tion for many of these languages, we also hope
to spark important research in these monolingual,
non-English settings.

7 结论

在这项工作中, we introduce a multilingual open
domain question answering evaluation set. 它是
特性,
invariance,
including geographical
language-parallel questions, retrieval-independent
注释, and linguistic diversity, set it apart
from existing resources in terms of annotation
质量, difficulty, and flexibility to evaluate new
方法. We encourage future multilingual
benchmarks to adopt data collection and anno-
tation principles to promote higher-quality, 和
informative evaluation practices. We evaluate sev-
eral baselines, based on state-of-the-art methods,
and demonstrate ample room for improvement
both in English and in the tail of lower-resourced
语言. We hope that this evaluation set en-
ables wider exploration of cross-lingual and mono-
lingual methods in non-English QA.

致谢

to Ivan Montero for testing out early versions of
数据. Thanks to Pablo N. Mendes and Charles
Srisuwananukorn for guidance and support, 作为
well as to Noriyo Sakamoto for help in data col-
lection. This work would not have been possible
without the TryRating annotation platform.

参考

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2020A. Translation artifacts in cross-lingual
transfer learning. 在诉讼程序中 2020
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7674–7684.
https://doi.org/10.18653/v1/2020
.emnlp-main.618

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2020乙. On the cross-lingual trans-
ferability of monolingual representations. 在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 4623–4637. https://doi.org/10
.18653/v1/2020.acl-main.421

Akari Asai, Jungo Kasai, Jonathan H. 克拉克,
Kenton Lee, Eunsol Choi, and Hannaneh
Hajishirzi. 2021. XOR QA: Cross-lingual open-
retrieval question answering. 在诉讼程序中
这 2021 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
547–564. https://doi.org/10
页面
.18653/v1/2021.naacl-main.46

Danqi Chen, Jason Bolton, and Christopher D.
曼宁. 2016. A thorough examination of the
CNN/Daily Mail reading comprehension task.
In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 2358–2367,
柏林, 德国. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P16-1223

We would like to thank Chris DuBois, WHO
has been instrumental to releasing this data. 伊利亚
Chatsviorkin, Xiao Ling, Nikhil Ramesh, Ni Lao,
Agatha Downey, Silviana Ciurea-Ilcus, Anthony
陈, and Russ Webb have provided invaluable
feedback on early versions of this paper. 谢谢

Danqi Chen, Adam Fisch, Jason Weston, 和
Antoine Bordes. 2017. Reading Wikipedia to
answer open-domain questions. In Proceedings
of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 1870–1879, Vancouver,

1403

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

加拿大. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/P17-1171

Jonathan H. 克拉克, Eunsol Choi, Michael Collins,
Dan Garrette, Tom Kwiatkowski, Vitaly
Nikolaev, and Jennimaria Palomaki. 2020. TyDI
QA: A benchmark for
information-seeking
question answering in typologically diverse
the Association
语言. Transactions of
for Computational Linguistics, 8:454–470.
https://doi.org/10.1162/tacl 00317

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, ´Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学, pages 8440–8451.
https://doi.org/10.18653/v1/2020
.acl-main.747

Yiming Cui, Wanxiang Che, Ting Liu, Bing
Qin, Shijin Wang, and Guoping Hu. 2019A.
Cross-lingual machine reading comprehen-
锡安. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1586–1595.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin
王, Ting Liu, and Guoping Hu. 2017.
Attention-over-attention neural networks for
reading comprehension. 在诉讼程序中
55th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), pages 593–602, Vancouver, 加拿大.
计算语言学协会.

Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao,
Zhipeng Chen, Wentao Ma, Shijin Wang, 和
Guoping Hu. 2019乙. A span-extraction dataset
for Chinese machine reading comprehension.
在诉讼程序中
这 2019 会议
Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Con-
ference on Natural Language Processing
(EMNLP-IJCNLP), pages 5886–5891.

of deep bidirectional transformers for language
这 2019
理解. 在诉讼程序中
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186.

Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang,
William Cohen, and Ruslan Salakhutdinov.
2017. Gated-attention readers for text compre-
hension. In Proceedings of the 55th Annual
Meeting of
the Association for Computa-
tional Linguistics (体积 1: Long Papers),
pages 1832–1846, Vancouver, 加拿大. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1168

乙. 绿色的, A. Wolf, C. Chomsky, and K. Laughery.
1986. BASEBALL: An Automatic Question
Answerer, Morgan Kaufmann Publishers Inc.,
旧金山, CA, 美国.

Deepak Gupta, Surabhi Kumari, Asif Ekbal,
and Pushpak Bhattacharyya. 2018. MMQA: A
multi-domain multi-lingual question-answering
framework for English and Hindi. In Proceed-
ings of the Eleventh International Conference
on Language Resources and Evaluation (LREC
2018).

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi
赵, Xinyan Xiao, Yuan Liu, Yizhong Wang,
Hua Wu, Qiaoqiao She, 和别的. 2018.
Dureader: A Chinese machine reading compre-
hension dataset from real-world applications.
In Proceedings of the Workshop on Machine
Reading for Question Answering, pages 37–46.
https://doi.org/10.18653/v1/W18
-2605

Tsung-Yuan Hsu, Chi-Liang Liu, and Hung-yi
李. 2019. Zero-shot reading comprehension
by cross-lingual transfer learning with multi-
lingual language representation model. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5933–5940, 香港, 中国. Associ-
ation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training

Junjie Hu, Sebastian Ruder, Aditya Siddhant,
Graham Neubig, Orhan Firat, and Melvin

1404

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

约翰逊. 2020. Xtreme: A massively multi-
lingual multi-task benchmark for evaluating
cross-lingual generalization. arXiv 预印本
arXiv:2003.11080.

Yimin Jing, Deyi Xiong, and Zhen Yan. 2019.
Bipar: A bilingual parallel dataset for multi-
lingual and cross-lingual reading comprehen-
sion on novels. 在诉讼程序中 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2452–2462.
https://doi.org/10.18653/v1/D19
-1249

Mandar Joshi, Eunsol Choi, Daniel S. Weld, 和
Luke Zettlemoyer. 2017. TriviaQA: A large
scale distantly supervised challenge dataset for
reading comprehension. 在诉讼程序中
55th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), pages 1601–1611. https://土井
.org/10.18653/v1/P17-1147

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. 在诉讼程序中 2020 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 6769–6781.
https://doi.org/10.18653/v1/2020
.emnlp-main.550

Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee,
Ganesh Ramakrishnan, and Preethi Jyothi.
training for automatic
2019. Cross-lingual

question generation. 在诉讼程序中
57th Annual Meeting of the Association for
计算语言学, pages 4863–4872.
https://doi.org/10.18653/v1/P19
-1481

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Illia Polosukhin,
Alberti, Danielle Epstein,
Jacob Devlin, Kenton Lee, Kristina N.
Toutanova, Llion Jones, Ming-Wei Chang, 一个-
drew Dai, Jakob Uszkoreit, Quoc Le, and Slav
Petrov. 2019. Natural questions: A benchmark
for question answering research. Transactions

of the Association for Computational Linguis-
抽动症, 7:453–466. https://doi.org/10.1162
/tacl a 00276

Chia-Hsuan Lee and Hung-Yi Lee. 2019. 叉-
lingual transfer learning for question answering.
arXiv 预印本 arXiv:1907.06042.

Patrick Lewis, Barlas Oguz, 鲁西·里诺特,
Sebastian Riedel, and Holger Schwenk. 2020.
MLQA: Evaluating cross-lingual extractive
question answering. 在诉讼程序中

58th Annual Meeting of the Association for
计算语言学, pages 7315–7330.
https://doi.org/10.18653/v1/2020.acl
-main.653

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu,
Fenfei Guo, Weizhen Qi, Ming Gong, Linjun
Shou, Daxin Jiang, Guihong Cao, Xiaodong
Fan, Ruofei Zhang, Rahul Agrawal, 爱德华
Cui, Sining Wei, Taroon Bharti, Ying Qiao,
Jiun-Hung Chen, Winnie Wu, Shuguang Liu,
Fan Yang, Daniel Campos, Rangan Majumder,
and Ming Zhou. 2020. XGLUE: A new bench-
mark datasetfor cross-lingual pre-training, 和-
derstanding and generation. 在诉讼程序中
这 2020 经验方法会议
自然语言处理博士 (EMNLP),
pages 6008–6018. https://doi.org/10
.18653/v1/2020.emnlp-main.484

Seungyoung Lim, Myungji Kim, and Jooyoul
李. 2019. Korquad1. 0: Korean QA dataset
for machine reading comprehension. arXiv
preprint arXiv:1909.07005.

Jiahua Liu, Yankai Lin, Zhiyuan Liu, 和
Maosong Sun. 2019A. XQA: A cross-lingual
open-domain question answering dataset. 在
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2358–2368.

Pengyuan Liu, Yuning Deng, Chenghao Zhu,
and Han Hu. 2019乙. XCMRC: Evaluating
cross-lingual machine reading comprehension.
In CCF International Conference on Natural
Language Processing and Chinese Computing,
pages 552–564. 施普林格. https://doi.org
/10.1007/978-3-030-32233-5 43

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,

1405

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019C. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv 预印本
arXiv:1907.11692.

Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying
Tseng, and Sam Tsai. 2018. DRCD: A Chinese
machine reading comprehension dataset. arXiv
preprint arXiv:1806.00920.

Hussein Mozannar, Elie Maamary, Karl El Hajal,
and Hazem Hajj. 2019. Neural Arabic question
answering. In Proceedings of the Fourth Ara-
bic Natural Language Processing Workshop,
pages 108–118. https://doi.org/10.18653
/v1/W19-4612

Ella Rabinovich and Shuly Wintner. 2015.
Unsupervised identification of translationese.
Transactions of the Association for Computa-
tional Linguistics, 3:419–432. https://土井
.org/10.1162/tacl_a_00148

Pranav Rajpurkar,

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ 问题
for machine compre-
hension of text. 在诉讼程序中 2016
Conference on Empirical Methods in Natural
语言处理,
2383–2392.
https://doi.org/10.18653/v1/D16
-1264

页面

Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How much knowledge can you pack into
the parameters of a language model? In Pro-
ceedings of the 2020 Conference on Empiri-
cal Methods in Natural Language Processing
(EMNLP), pages 5418–5426. https://土井
.org/10.18653/v1/2020.emnlp-main.437

Gary F. Simons and Charles D. Fennig. 2018.
Ethnologue: Languages of the world, twenty.
达拉斯, 德克萨斯州: SIL International. Online ver-
锡安: http://www.ethnologue.com

Shuly Wintner. 2016. Translationese: 之间
human and machine translation. In Proceed-
ings of COLING 2016, the 26th International
Conference on Computational Linguistics: Tu-
torial Abstracts, pages 18–19, 大阪, 日本.
The COLING 2016 Organizing Committee.

Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. MT5: A
massively multilingual pre-trained text-to-text
transformer. 在诉讼程序中 2021 骗局-
ference of the North American Chapter of the
计算语言学协会: 胡-
man Language Technologies, pages 483–498.

Biao Zhang, Philip Williams, Ivan Titov, 和
Rico Sennrich. 2020.
Improving massively
multilingual neural machine translation and

zero-shot
58th Annual Meeting of the Association for
计算语言学, pages 1628–1639.
https://doi.org/10.18653/v1/2020
.acl-main.148

翻译. 在诉讼程序中

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
3
3
1
9
7
6
1
8
7

/

/
t

A
C
_
A
_
0
0
4
3
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

1406MKQA: A Linguistically Diverse Benchmark for image
MKQA: A Linguistically Diverse Benchmark for image
MKQA: A Linguistically Diverse Benchmark for image

下载pdf