MIRACLE: Un ensemble de données de récupération multilingue couvrant
18 Diverse Languages
Xinyu Zhang1∗, Nandan Thakur1∗, Odunayo Ogundepo1, Ehsan Kamalloo1†,
David Alfonso-Hermelo2, Xiaoguang Li3, Qun Liu3, Mehdi Rezagholizadeh2, Jimmy Lin1
1David R. École d'informatique de Cheriton, Université de Waterloo, Canada
2Huawei Noah’s Ark Lab, Canada
3Huawei Noah’s Ark Lab, Chine
Abstrait
MIRACL is a multilingual dataset for ad hoc
retrieval across 18 languages that collectively
encompass over three billion native speakers
around the world. This resource is designed
to support monolingual retrieval tasks, où
the queries and the corpora are in the same
langue. In total, we have gathered over
726k high-quality relevance judgments for 78k
queries over Wikipedia in these languages,
where all annotations have been performed by
native speakers hired by our team. MIRACLE
covers languages that are both typologically
close as well as distant from 10 language fami-
lies and 13 sub-families, associated with vary-
ing amounts of publicly available resources.
Extensive automatic heuristic verification and
manual assessments were performed during
the annotation process to control data quality.
In total, MIRACL represents an investment of
around five person-years of human annotator
effort. Our goal is to spur research on improv-
ing retrieval across a continuum of languages,
thus enhancing information access capabili-
ties for diverse populations around the world,
particularly those that have traditionally been
underserved. MIRACL is available at http://
miracl.ai/.
1
Introduction
Information access is a fundamental human right.
Spécifiquement, the Universal Declaration of Human
Rights by the United Nations articulates that ‘‘ev-
eryone has the right to freedom of opinion and
expression’’, which includes the right ‘‘to seek,
receive, and impart information and ideas through
any media and regardless of frontiers’’ (Article 19).
Information access capabilities such as search,
∗ Equal contribution.
† Work done while at Huawei Noah’s Ark Lab.
question answering, summarization, and recom-
mendation are important technologies for safe-
guarding these ideals.
With the advent of deep learning in NLP, IR,
et au-delà, the importance of large datasets as
drivers of progress is well understood (Lin et al.,
2021b). For retrieval models in English, the MS
MARCO datasets (Bajaj et al., 2018; Craswell
et coll., 2021; Lin et al., 2022) have had a transfor-
mative impact in advancing the field. De la même manière,
for question answering (QA), there exist many
resources in English, such as SQuAD (Rajpurkar
et coll., 2016), TriviaQA (Joshi et al., 2017), et
Natural Questions (Kwiatkowski et al., 2019).
We have recently witnessed many efforts in
building resources for non-English languages,
Par exemple, CLIRMatrix (Sun and Duh, 2020),
XTREME (Hu et al., 2020), MKQA (Longpre
et coll., 2021), mMARCO (Bonifacio et al., 2021),
TYDI QA (Clark et al., 2020), XOR-TYDI (Asai
et coll., 2021), and Mr. TYDI (Zhang et al., 2021).
These initiatives complement multilingual re-
trieval evaluations from TREC, CLEF, NTCIR,
and FIRE that focus on specific language pairs.
Nevertheless, there remains a paucity of resources
for languages beyond English. Existing datasets
are far from sufficient to fully develop informa-
tion access capabilities for the 7000+ languages
spoken on our planet (Joshi et al., 2020). Our goal
is to take a small step towards addressing these
issues.
To stimulate further advances in multilingual
retrieval, we have built the MIRACL dataset on
top of Mr. TYDI (Zhang et al., 2021), comprising
human-annotated passage-level relevance judg-
ments on Wikipedia for 18 languages, totaling
over 726k query–passage pairs for 78k queries.
These languages are written using 11 distinct
scripts, originate from 10 different language fam-
ilies, and collectively encompass more than three
1114
Transactions of the Association for Computational Linguistics, vol. 11, pp. 1114–1131, 2023. https://doi.org/10.1162/tacl a 00595
Action Editor: Xiaodong He. Submission batch: 4/2023; Revision batch: 4/2023; Published 9/2023.
c(cid:3) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
diverse languages, where the queries and the cor-
pora are in the same language (par exemple., Thai queries
searching Thai passages), as opposed to cross-
lingual retrieval, where the queries and the cor-
pora are in different languages (par exemple., searching a
Swahili corpus with Arabic queries).
As mentioned in the Introduction, there has
been work over the years on building resources
for retrieval in non-English languages. Below, nous
provide a thorough comparison between MIRACL
and these efforts, with an overview in Table 1.
Chiffre 1: Examples of annotated query–passage pairs
in Thai (ème) from MIRACL.
2.1 Comparison to Traditional IR
Collections
billion native speakers around the world. Ils
include what would typically be characterized as
high-resource languages as well as low-resource
languages. Chiffre 1 shows a sample Thai query
with a relevant and a non-relevant passage. In total,
the MIRACL dataset represents over 10k hours,
or about five person-years, of annotator effort.
Along with the dataset, our broader efforts in-
cluded organizing a competition at the WSDM
2023 conference that provided a common evalua-
tion methodology, a leaderboard, and a venue for
a competition-style event with prizes. To provide
starting points that the community can rapidly
build on, we also share reproducible baselines in
the Pyserini IR toolkit (Lin et al., 2021un).
Compared to existing datasets, MIRACL pro-
vides more thorough and robust annotations and
broader coverage of the languages, qui comprennent
both typologically diverse and similar language
pairs. We believe that MIRACL can serve as a
high-quality training and evaluation dataset for
the community, advance retrieval effectiveness in
diverse languages, and answer interesting scien-
tific questions about cross-lingual transfer in the
multilingual retrieval context.
2 Background and Related Work
The focus of this work is the standard ad hoc re-
trieval task in information retrieval, where given
a corpus C, the system’s task is to return for a
given query q an ordered list of top-k passages
from C that maximizes some standard quality
metric such as nDCG. A query q is a well-formed
natural language question in some language Ln
and the passages draw from the same language
Cn. Ainsi, our focus is monolingual retrieval across
Historically,
there have been monolingual re-
trieval evaluations of search tasks in non-English
languages, Par exemple, at TREC, FIRE, CLEF,
and NCTIR. These community evaluations typi-
cally release test collections built from newswire
articles, which typically provide only dozens of
queries with modest amounts of relevance judg-
ments, and are insufficient for training neural
retrieval models. The above organizations also
provide evaluation resources for cross-lingual re-
trieval, but they cover relatively few language
pairs. Par exemple, the recent TREC 2022 Neu-
CLIR Track (Lawrie et al., 2023) evaluates only
three languages (Chinese, Persian, and Russian)
in a cross-lingual setting.
2.2 Comparison to Multilingual QA Datasets
There are also existing datasets for multilingual
QA. Par exemple, XOR-TYDI (Asai et al., 2021) est
a cross-lingual QA dataset built on TYDI by anno-
tating answers in English Wikipedia for questions
that TYDI considers unanswerable in the original
source (non-English) langue. This setup, unfor-
tunately, does not allow researchers to examine
monolingual retrieval in non-English languages.
Another point of comparison is MKQA
(Longpre et al., 2021), which comprises 10k ques-
tion–answer pairs aligned across 26 typologically
diverse languages. Questions are paired with exact
answers in the different languages, and evalua-
tion is conducted in the open-retrieval setting by
matching those answers in retrieved text—thus,
MKQA is not a ‘‘true’’ retrieval dataset. Plus loin-
plus, because the authors translated questions to
achieve cross-lingual alignment, the translations
may not be ‘‘natural’’, as pointed out by Clark
et autres. (2020).
1115
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Dataset Name
Natural Natural Human
Queries Passages Labels
# Lang Avg # Q
Avg
Total
# Labels/Q # Labels
Training?
NeuCLIR (Lawrie et al., 2023)
MKQA (Longpre et al., 2021)
mMARCO (Bonifacio et al., 2021)
CLIRMatrix (Sun and Duh, 2020)
Mr. TYDI (Zhang et al., 2021)
MIRACLE (our work)
(cid:2)
×
×
×
(cid:2)
(cid:2)
(cid:2)
(cid:2)
×
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
×
(cid:2)
(cid:2)
3
26
13
139
11
18
160
10k
808k
352k
6.3k
4.3k
32.74
1.35
0.66
693
1.02
9.23
5.2k
14k
533k
34B
71k
726k
×
×
(cid:2)
(cid:2)
(cid:2)
(cid:2)
Tableau 1: Comparison of select multilingual retrieval datasets. Natural Queries and Natural Passages
indicate whether the queries and passages are ‘‘natural’’, c'est à dire., generated by native speakers (vs. human-
or machine-translated), and for queries, in natural language (vs. keywords or entities); Human Labels
indicates human-generated labels (vs. synthetically generated labels); # Lang is the number of languages
supported; Avg # Q is the average number of queries for each language; Avg # Labels/Q is the average
number of labels provided per query; Total # Labels is the total number of human labels (both positive
and negative) across all languages (including synthetic labels in CLIRMatrix). Training? indicates
whether the dataset provides sufficient data for training neural models.
2.3 Comparison to Synthetic Datasets
Since collecting human relevance labels is la-
borious and costly, other studies have adopted
workarounds to build multilingual datasets. Pour
exemple, Bonifacio et al. (2021) automatically
translated the MS MARCO dataset (Bajaj et al.,
2018) from English into 13 other languages. Comment-
jamais, translation is known to cause inadvertent
artifacts such as ‘‘translationese’’ (Clark et al.,
2020; Lembersky et al., 2012; Volansky et al.,
2015; Avner et al., 2016; Eetemadi and Toutanova,
2014; Rabinovich and Wintner, 2015) and may
lead to training data of questionable value.
Alternativement, Sun and Duh (2020) built syn-
thetic bilingual retrieval datasets in a resource
called CLIRMatrix based on the parallel struc-
ture of Wikipedia that covers 139 languages.
Constructing datasets automatically by exploiting
heuristics has the virtue of not requiring expen-
sive human annotations and can be easily scaled up
to cover many languages. Cependant, such datasets
are inherently limited by the original resource they
are built from. Par exemple, in CLIRMatrix, le
queries are the titles of Wikipedia articles, lequel
tend to be short phrases such as named entities.
Aussi, multi-degree judgments in the dataset are di-
rectly converted from BM25 scores, which creates
an evaluation bias towards lexical approaches.
2.4 Comparison to Mr. TYDI
Since MIRACL inherits from Mr. TYDI, it makes
sense to discuss important differences between
Chiffre 2: Examples of missing relevant passages in
TYDI QA (and thus Mr. TYDI) for the query ‘‘Which
is the largest marine animal?’’ Since only relevant
passages in the top Wikipedia article are included,
other relevant passages are missed.
the two: Mr. TYDI (Zhang et al., 2021) is a human-
labeled retrieval dataset built atop TYDI QA (Clark
et coll., 2020), covering 11 typologically diverse lan-
guages. While Mr. TYDI enables the training and
evaluation of monolingual retrieval, it has three
shortcomings we aimed to address in MIRACL.
Limited Positive Passages.
In TYDI QA, Et ainsi
Mr. TYDI, candidate passages for annotation are
selected only from the top-ranked Wikipedia article
based on a Google search. Par conséquent, a con-
siderable number of relevant passages that exist
in other Wikipedia articles are ignored; Chiffre 2
illustrates an instance of this limitation. In contrast,
1116
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
MIRACL addresses this issue by sourcing can-
didate passages from all of Wikipedia, ensuring
that our relevant passages are diverse.
De plus, in MIRACL, we went a step further
and asked annotators to assess the top-10 candi-
date passages from an ensemble model per query,
resulting in richer annotations compared to those
of Mr. TYDI, which were mechanistically gener-
ated from TYDI QA as no new annotations were
performed. En outre, we believe that explicitly
judged negative examples are quite valuable, com-
pared to, Par exemple, implicit negatives in MS
MARCO sampled from BM25 results, as recent
work has demonstrated the importance of so-called
‘‘hard negative’’ mining (Karpukhin et al., 2020;
Xiong et al., 2021; Qu et al., 2021; Santhanam
et coll., 2021; Formal et al., 2021). As the candidate
passages come from diverse models, they are par-
ticularly suitable for feeding various contrastive
techniques.
Inconsistent Passages. Since passage-level rel-
evance annotations in Mr. TYDI were derived from
TYDI QA, it retained exactly those same passages.
Cependant, as TYDI QA was not originally designed
for retrieval, it did not provide consistent passage
segmentation for all Wikipedia articles. Ainsi, le
Mr. TYDI corpora comprised a mix of TYDI QA
passages and custom segments that were heuristi-
cally adjusted to ‘‘cover’’ the entire raw Wikipedia
dumps. This inconsistent segmentation is a weak-
ness of Mr. TYDI that we rectified in MIRACL by
re-segmenting the Wikipedia articles provided by
TYDI QA and re-building the relevance mapping
for these passages (more details in Section 4.2).
No Typologically Similar Languages. Le 11
languages included in TYDI QA encompass a
broad range of linguistic typologies by design,
belonging to 11 different sub-families from 9 dif-
ferent families. Cependant, this means that they are
all quite distant from one another. In contrast,
new languages added to MIRACL include typo-
logically similar languages. The inclusion of these
languages is crucial because their presence can
better foster research on cross-lingual compari-
sons. Par exemple, cross-lingual transfer can be
more effectively studied when more similar lan-
guages are present. We explore this issue further
in Section 5.3.
Résumé. Dans l'ensemble, not only is MIRACL an or-
der of magnitude larger than Mr. TYDI, as shown
in Table 1, but MIRACL goes beyond simply
scaling up the dataset to correcting many short-
comings observed in Mr. TYDI. The result is a
larger, higher-quality, and richer dataset to support
monolingual retrieval in diverse languages.
3 Dataset Overview
MIRACL is a multilingual retrieval dataset that
spans 18 different languages, en se concentrant sur le
monolingual retrieval
task. In total, we have
gathered over 726k manual relevance judgments
(c'est à dire., query–passage pairs) for 78k queries across
Wikipedia in these languages, where all assess-
ments have been performed by native speakers.
There are sufficient examples in MIRACL to train
and evaluate neural retrieval models. Detailed
statistics for MIRACL are shown in Table 2.
Among the 18 languages in MIRACL, 11 sont
existing languages that are originally covered
by Mr. TYDI, where we take advantage of the
Mr. TYDI queries as a starting point. Ressources
for the other 7 new languages are created from
scratch. We generated new queries for all lan-
guages, but since we inherited Mr. TYDI queries,
fewer new queries were generated for the 11 ex-
isting languages as these languages already had
sizable training and development sets. For all lan-
guages, we provide far richer annotations with
respect to Wikipedia passages. C'est, compared
to Mr. TYDI, where each query has on average
only a single positive (relevant) passage (Zhang
et coll., 2021), MIRACL provides far more pos-
itive as well as negative labels for all queries.
Ici, we provide an overview of the data splits;
details of fold construction will be introduced in
Section 4.2.
• Training and development sets: For the
11 existing Mr. TYDI languages, the training
(development) sets comprise subsets of the
queries from the Mr. TYDI training (devel-
opération) sets. The main difference is that
MIRACL provides richer annotations for more
passages. For the new languages, the train-
ing (development) data consist of entirely of
newly generated queries.
• Test-A sets: Similar to the training and de-
velopment sets, the test-A sets align with the
test sets in Mr. TYDI. Cependant, the test-A
sets exist only for the existing languages.
1117
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Lang
ISO
Train
Dev
# Q
# J.
# Q
# J.
Test-A
# Q
# J.
Test-B
# Q
# J.
# Passages
# Articles
Avg. Avg.
Q Len P Len
ar
Arabic
bn
Bengali
dans
English
fi
Finnish
Indonesian id
ja
Japonais
ko
Korean
ru
Russian
sw
Swahili
te
Telugu
ème
Thai
es
fa
fr
Salut
zh
de
yo
Spanish
Persian
French
Hindi
Chinese
German
Yoruba
Total
3,495 25,382
1,631 16,754
2,863 29,416
2,897 20,350
4,071 41,358
3,477 34,387
868 12,767
4,683 33,921
1,901
9,359
3,452 18,608
2,972 21,293
2,162 21,531
2,107 21,844
1,143 11,426
1,169 11,668
1,312 13,113
–
–
–
–
2,896
411
799
1,271
960
860
213
1,252
482
828
733
648
632
343
350
393
305
119
29,197
4,206
8,350
12,008
9,668
8,354
3,057
13,100
5,092
1,606
7,573
6,443
6,571
3,429
3,494
3,928
3,144
1,188
936
102
734
9,325
1,037
5,617
1,060 10,586
7,430
731
6,922
650
3,855
263
8,777
911
6,615
638
594
5,948
992 10,432
–
–
–
–
–
–
–
–
–
–
–
–
–
–
1,405
1,130
1,790
711
611
1,141
1,417
718
465
793
650
1,515
1,476
801
819
920
712
288
14,036
11,286
18,241
7,100
6,098
11,410
14,161
7,174
4,620
7,920
6,493
15,074
15,313
8,008
8,169
9,196
7,317
2,880
2,061,414
297,265
32,893,221
1,883,509
1,446,315
6,953,614
1,486,752
9,543,918
131,924
518,079
542,166
10,373,953
2,207,172
14,636,953
506,264
4,934,368
656,982
63,762
5,758,285
447,815
446,330
1,133,444
437,373
1,476,045
47,793
66,353
128,179
1,669,181
857,827
2,325,608
148,107
1,246,389
15,866,222
49,043
2,651,352
33,094
40,203 343,177
13,495 130,408
7,611 76,544
17,362 174,496
106,332,152 19,593,919
6
7
7
5
5
17
4
6
7
5
42
8
8
7
10
11
7
8
10
54
56
65
41
49
147
38
46
36
51
358
66
49
55
69
121
58
28
77
Tableau 2: Descriptive statistics for all languages in MIRACL, organized by split. # Q: number of queries;
# J.: number of labels (relevant and non-relevant); # Passages: number of passages; # Articles: number
of Wikipedia articles from which the passages were drawn; # Avg. Q Len: average number of tokens
per query; # Avg. P Len: average number of tokens per passage. Tokens are based on characters for th,
ja, and zh, otherwise delimited by whitespace. Underlined are the new languages in MIRACL.
• Test-B sets: For all languages, the test-B sets
are composed entirely of new queries that
have never been released before (compared
to test-A sets, whose queries ultimately draw
from TYDI QA and thus have been publicly
available for quite some time now). These
queries can be viewed as a true held-out
test set.
Although the training, development, and test-A
sets of MIRACL languages that overlap with
Mr. TYDI align with Mr. TYDI, in some cases
there are fewer queries in MIRACL than in their
corresponding Mr. TYDI splits. This is because
our annotators were unable to find relevant pas-
sages for some queries from Mr. TYDI; we call
these ‘‘invalid’’ queries and removed them from
the corresponding MIRACL splits (details in
Section 5.2).
To evaluate the quality of system outputs,
we use standard information retrieval metrics,
nDCG@10 and Recall@100, which are defined
as follows:
DCG@k =
k(cid:2)
je = 1
reli
log2(je + 1)
(1)
nDCG@k =
Recall@k =
DCG@k
iDCG@k
(cid:3)
k
i=1 reli
k
(2)
(3)
where reli = 1 if di is relevant to the query and
0 otherwise, and iDCG@k is the DCG@k of the
ideal ranked list (c'est à dire., with binary judgments, tous
relevant documents appear before non-relevant
documents). We set k to 10 et 100 for the two
metrics, respectivement, to arrive at nDCG@10 and
Recall@100. These serve as the official metrics
of MIRACL.
In addition to releasing the evaluation resources,
we organized a WSDM Cup challenge at the
WSDM 2023 conference to encourage participa-
tion from the community, where the test-B sets
were used for final evaluation. Among the 18 lan-
guages, German (de) and Yoruba (yo), the bottom
block in Table 2, were selected as surprise lan-
guages, while the rest are known languages. Ce
distinction was created for the official WSDM
competition. Whereas data from the known lan-
guages were released in October 2022, the iden-
tity of the surprise languages was concealed until
two weeks before the competition deadline in
1118
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 3: Diagram illustrating the MIRACL annotation workflow.
Janvier 2023. For the known languages, partic-
ipants were given ample data and time to train
language-specific models. On the other hand, pour
the surprise languages, no training splits were
provided to specifically evaluate retrieval under a
limited data and time condition. Cependant, depuis
the WSDM Cup challenge has concluded, the dis-
tinction between surprise and known languages is
no longer relevant.
4 Dataset Construction
To build MIRACL, we hired native speakers as
annotators to provide high-quality queries and
relevance judgments. At a high level, our work-
flow comprised two phases: D'abord, the annotators
were asked to generate well-formed queries based
on ‘‘prompts’’ (Clark et al., 2020) (details in
Section 4.2). Alors, they were asked to assess the
relevance of the top-k query–passage pairs pro-
duced by an ensemble baseline retrieval system.
An important feature of MIRACL is that our
dataset was not constructed via crowd-sourced
workers, unlike other previous efforts such as
SQuAD (Rajpurkar et al., 2016). Plutôt, we hired
31 annotators (both part-time and full-time) across
all languages. Each annotator was interviewed
prior to being hired and was verified to be a native
speaker of the language they were working in. Notre
team created a consistent onboarding process that
included carefully crafted training sessions. Nous
began interviewing annotators in mid-April 2022;
dataset construction began in late April 2022 et
continued until the end of September 2022.
Throughout the annotation process, we con-
stantly checked randomly sampled data to monitor
annotation quality (see Section 4.3). Whenever
issues were detected, we promptly communicated
with the annotators to clarify the problems so that
they were resolved expeditiously. Constant inter-
actions with the annotators helped minimize er-
rors and misunderstandings. We believe that this
design yielded higher quality annotations than what
could have been obtained by crowd-sourcing.
In total, MIRACL represents over 10k hours
of assessor effort, or around five person-years.
We offered annotators the hourly rate of $18.50 per hour (converted into USD). For reference, the local minimum wage is $11.50 USD/hr.
4.1 Corpora Preparation
For each MIRACL language, we prepared a pre-
segmented passage corpus from a raw Wikipedia
dump. For the existing languages in Mr. TYDI,
we used exactly the same raw Wikipedia dump
as Mr. TYDI and TYDI QA from early 2019. Pour
the new languages, we used releases from March
2022. We parsed the Wikipedia articles using
WikiExtractor1 and segmented them into passages
based on natural discourse units using two consec-
utive newlines in the wiki markup as the delimiter.
Query–passage pairs formed the basic annotation
unit (Step
in Figure 3).
4.2 Annotation Workflow
MIRACL was created using a two-phase annota-
tion process adapted from TYDI QA, ce qui était
in turn built on best practices derived from Clark
et autres. (2020). The two phases are query generation
and relevance assessment.
Query Generation.
In Phase I, annotators were
shown ‘‘prompts’’, comprising the first 100 mots
1https://github.com/attardi/wikiextractor.
1119
of randomly selected Wikipedia articles that pro-
vide contexts to elicit queries. Following Clark
et autres. (2020), the prompts are designed to help
annotators write queries for which they ‘‘seek an
answer.’’
To generate high-quality queries, we asked
annotators to avoid generating queries that are
directly answerable by the prompts themselves.
They were asked to generate well-formed natural
language queries likely (in their opinion) to have
precise, unambiguous answers. Using prompts
also alleviates the issue where the queries may
be overly biased towards their personal experi-
ences. We also gave the annotators the option of
skipping any prompt that did not ‘‘inspire’’ any
dans
queries. This process corresponds to Step
Chiffre 3.
Note that in this phase, annotators were asked
to generate queries in batch based on the prompts,
and did not proceed to the next phase before fin-
ishing tasks in this phase. Ainsi, the annotators
had not yet examined any retrieval results (c'est à dire.,
Wikipedia passages) at this point. De plus, nous
did not suggest to annotators during the entire pro-
cess that the queries should be related to Wikipedia
in order to prevent
the annotators from writ-
ing oversimplified or consciously biased queries.
Donc, it could be the case that their queries
cannot be readily answered by the information
contained in the corpora, which we discuss in
Section 5.2.
Relevance Assessment.
In the second phase,
for each query from the previous phase, we asked
the annotators to judge the binary relevance of
the top-k candidate passages (k = 10) from an
ensemble retrieval system that combines three
dans
separate models, which corresponds to Step
Chiffre 3:
• BM25 (Robertson and Zaragoza, 2009), un
traditional retrieval algorithm based on lex-
ical matching, which has been shown to be
a robust baseline when evaluated zero-shot
across domains and languages (Thakur et al.,
2021; Zhang et al., 2021). We used the im-
plementation in Anserini (Yang et al., 2018),
which is based on the open-source Lucene
search library, with default parameters and
the corresponding language-specific analyzer
in Lucene if it exists. If not, we simply used
the white space tokenizer.
• mDPR (Karpukhin et al., 2020; Zhang et al.,
2021), a single-vector dense retrieval method
that has proven to be effective for many
retrieval tasks. This model was trained using
the Tevatron toolkit (Gao et al., 2023) starting
from an mBERT checkpoint and then fine-
tuned using the training set of MS MARCO
Passage. Retrieval was performed in a zero-
shot manner.
• mColBERT (Khattab and Zaharia, 2020;
Bonifacio et al., 2021), a multi-vector dense
retrieval model that has been shown to be ef-
fective in both in-domain and out-of-domain
contexts (Thakur et al., 2021; Santhanam
et coll., 2021). This model was trained using the
authors’ official repository.2 Same as mDPR,
the model was initialized from mBERT and
fine-tuned on MS MARCO Passage; retrieval
was also performed zero shot.
For each query, we retrieved the top 1000 pas-
sages using each model, then performed ensemble
fusion by first normalizing all retrieval scores to
the range [0, 1] and then averaging the scores. A fi-
nal ranked list was then generated from these new
scores. Based on initial experiments, we found that
annotating the top 10 passages per query yielded
a good balance in terms of obtaining diverse
passages and utilizing annotator effort.
As mentioned in Section 2.4, we rectified the
inconsistent passage segmentation in Mr. TYDI
by re-segmenting the Wikipedia articles of TYDI
QA. The downside, unfortunately, is that anno-
tated passages from Mr. TYDI may no longer exist
in the MIRACL corpora. Ainsi, to take advantage
of existing annotations, for queries from the 11
existing languages in Mr. TYDI, we augmented the
set of passages to be assessed with ‘‘projected’’
relevant passages. This allowed us to take ad-
vantage of existing annotations transparently in
our workflow. To accomplish this, we used rele-
vant passages from Mr. TYDI as queries to search
the corresponding MIRACL corpus using BM25.
As the passages in both corpora differ in terms
of segmentation but not content, the top retrieved
passages are likely to have substantial overlap with
the annotated passage in Mr. TYDI (and hence are
also likely to be relevant). This is shown as Step
in Figure 3.
2https://github.com/stanford-futuredata
/ColBERT#colbertv1.
1120
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
We used a simple heuristic to determine how
many of these results to re-assess. If the score of
the top retrieved passage from MIRACL is 50%
higher than the score of the passage ranked sec-
ond, we have reasonable confidence (based on
initial explorations) that the top passage is a good
match for the original relevant passage. Dans ce
case, we only add the top passage to the set of
candidates that the assessor considers. Otherwise,
we add the top 5 passages.
Once the candidate passages are prepared, le
annotators are asked to provide a binary label
for each query–passage pair (1 = ‘‘relevant’’ and
0 = ‘‘not relevant’’). This is shown in Step
in Figure 3. Note that the ‘‘projected’’ passages
are not identified to the
prepared from Step
annotators. In all cases, they simply receive a set
of passages to label per query, without any explicit
knowledge of where the passages came from.
In our design, each query–passage pair was
only labeled by one annotator, which we believe
is a better use of limited resources, compared to
the alternative of multiple judgments over fewer
queries. The manual assessment proceeded in
batches of approximately 1000 query–passage
pairs. Our annotators differed greatly in speed,
but averaged over all languages and individuals,
each batch took roughly 4.25 hours to complete.
Fold Creation and Data Release. At a high-
level, MIRACL contains two classes of queries:
those inherited from Mr. TYDI and those that were
created from scratch, from which four splits were
prepared. During the annotation process, all que-
ries followed the same workflow. After the anno-
tation process concluded, we divided MIRACL
into training sets, development sets, test-A sets,
and test-B sets, as described in Section 3. Pour
existing languages, the training, development, et
test-A sets align with the training, development,
and test sets in Mr. TYDI, and the test-B sets
are formed by the newly generated queries. Pour
the new (known) languages, all generated queries
were split into training, development, and test-B
sets with a split ratio of 50%, 15%, et 35%. Note
that there are no test-A sets for these languages.
4.3 Quality Control
To ensure data quality, we implemented two
processes: automatic heuristic verification that in-
cludes simple daily checks and manual periodic
assessment executed by human reviewers on a
random subset of annotations.
4.3.1 Automatic Verification
Automatic heuristic verification was applied to
both phases daily, using language-specific heuris-
tics to flag ‘‘obviously bad’’ annotations. In Phase
je, we flagged the generated query if it was: (1)
left empty; (2) similar to previous queries in the
same language;3 (3) missing interrogative indica-
tors;4 ou (4) overly short or long.5 In Phase II, nous
flagged the label if it was left empty or an invalid
valeur (c'est à dire., values that are not 0 ou 1).
The most extreme cases of bad annotations
(par exemple., empty or duplicate queries) were removed
from the dataset, while minor issues were flagged
but the data were retained. For both cases, quand-
ever the metrics dropped below a reasonable
threshold,6 we scheduled a meeting with the an-
notator in question to discuss. This allowed us to
minimize the number of obvious errors while giv-
ing and receiving constructive feedback.
4.3.2 Manual Assessment
Manual assessment was also applied to both
phases. To accomplish this, we hired another
group of native speakers of each language as
reviewers (with minor overlap). De la même manière, nous
provided consistent onboarding training to each
reviewer.
Phase I.
In this phase, reviewers were given
both the prompts and the generated queries, et
asked to apply a checklist to determine whether
the queries met our requirements. Criteria include
the examination of the query itself (spelling or
syntax errors, fluency, etc.) and whether the query
could be answered directly by the prompt, lequel
we wished to avoid (see Section 4.2).
How fluent are the queries? Review results
showed that approximately 12% of the questions
3We measured similarity using Levenshtein distance. UN
query was flagged if its similarity score to any previous query
was greater than 75%.
4Par exemple, writing system- and language-specific in-
terrogative punctuation and particles.
5For zh, ja, and th, length is measured in characters,
where the expected range is [6, 40] for zh and ja, et [6, 120]
for th. Otherwise, length is measured in tokens delimited
by white space, with expected range [3, 15].
6In practice, we set this threshold to 0.8 in terms of the
percentage of questionable annotations in each submission.
1121
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
had spelling or syntax errors, or ‘‘sound artifi-
cial’’. Cependant, almost all these flaws (over 99%)
did not affect the understanding of the question
lui-même. We thus retained these ‘‘flawed’’ queries to
reflect real-world distributions.
How related are the queries to the prompts? Nous
measured the lexical overlap between the queries
and their corresponding prompts to understand
their connections. Our analysis shows approxi-
mately two words of overlap on average, lequel
is consistent with statistics reported in Clark et al.
(2020). Overlaps primarily occur in entities or
stopwords. We thus conclude that the generated
queries are reasonably different from the given
prompts.
Phase II.
In this phase, reviewers were provided
the same guidance as annotators performing the
relevance assessment. They were asked to (inde-
pendently) label a randomly sampled subset of
the query–passage pairs. The degree of agreement
on the overlapping pairs is used to quantify the
quality of the relevance labels. As a high-level
summary, we observe average agreements of over
80% on query–passage relevance.
Why are there disagreements? As is well-
known from the IR literature dating back many
decades (Voorhees, 1998), real-world annotators
disagree due to a number of ‘‘natural’’ reasons.
To further understand why, we randomly sampled
five query–passage pairs per language for closer
examination. After some iterative sense-making,
we arrived at the conclusion that disagreements
come from both mislabeling and partially rele-
vant pairs. We show examples of the two cases
in Figure 4a and Figure 4b.
Figure 4a depicts a common mislabeling case,
where a passage starting with misleading informa-
tion is labeled as non-relevant. We observe more
mislabeling errors from the reviewers compared
to the annotators, which may be attributed to the
annotators’ greater commitment to the task and
more experience in performing it. C'est, in these
cases, the reviewers are more often ‘‘wrong’’ than
the original assessors.
Partially relevant annotations can take on vari-
ous forms. The most frequent case is exemplified
by Figure 4b, where a passage is related to the
query but does not provide a precise answer. Dif-
ferences in ‘‘thresholds’’ for relevance are inevi-
table under a binary relevance system. En outre,
disagreements often arise when an annotator la-
Chiffre 4: Examples of disagreements between annota-
tors and reviewers.
bels a passage as non-relevant, but a reviewer
labels it as relevant. These cases suggest that an-
notators exercise greater caution and are more
‘‘strict’’ when they are uncertain. Dans l'ensemble, ces
analyses convinced us that the generated queries
and annotations are of high quality.
5 MIRACL Analysis
To better characterize MIRACL, we present three
separate analyses of the dataset we have built.
5.1 Question Words
We first analyze question types, following Clark
et autres. (2020). Tableau 3 shows the distribution
of the question word in the English dataset of
MIRACLE, along with SQuAD (Rajpurkar et al.,
2016) as a point of reference. The MIRACL statis-
tics are grouped by splits in Table 3. The test-B
column shows the distribution of the new que-
ries generated with the workflow described in
Section 4.2, whereas the other columns corre-
spond to queries derived from Mr. TYDI.
1122
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
je
un
c
_
un
_
0
0
5
9
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
MIRACLE
Train Dev Test-A Test-B
SQuAD
HOW
WHAT
WHEN
WHERE
WHICH
WHO
WHY
YES/NO
15% 17% 19%
26% 27% 31%
26% 25% 20%
6%
1%
7%
1%
13% 11%
1%
6%
1%
7%
4%
2%
9%
1%
4%
6%
34%
6%
6%
25%
19%
1%
6%
12%
51%
8%
5%
5%
11%
2%
<1%
Table 3: Query distribution of each split in En-
glish MIRACL, compared to SQuAD.
We observe a more balanced distribution of
question words in MIRACL compared to SQuAD.
In addition, the question words highlight a dis-
tribution shift between Mr. TYDI and the new
queries. More specifically, the test-B split contains
more WHICH and WHO queries, while containing
fewer HOW and WHEN queries, compared to the
other splits inherited from Mr. TYDI. This also
shows that exactly replicating a query generation
process is challenging, even when closely follow-
ing steps outlined in prior work.
5.2 Queries with No Relevant Passages
Recall that since the query generation and rel-
evance assessment phases were decoupled, for
some fraction of queries in each language, an-
notators were not able to identify any relevant
passage in the pairs provided to them. Because
we aimed to have at least one relevant passage
per query, these queries—referred to as ‘‘invalid’’
for convenience—were discarded from the final
dataset.
For each language, we randomly sampled five
invalid queries and spot-checked them using a
variety of tools, including interactive searching
on the web. From this, we identified a few com-
mon reasons for the lack of relevant passages in
Wikipedia: (1) query not asking for factual infor-
mation, for example, ‘‘Kuinka nuoret voivat osal-
listua suomalaisessa yhteiskunnassa?’’ (Finnish,
‘‘How can young people participate in Finnish
society?’’); (2) query being too specific, for exam-
?’’ (Chi-
ple, ‘‘
nese, ‘‘What is the new population of Washington
in 2021?’’); (3) inadequate Wikipedia content,
2021
where the query seems reasonable, but the rele-
vant information could not be found in Wikipedia
due to the low-resource nature of the language
(even after interactive search with Google), for
example, ‘‘Bawo ni aaye Tennis se tobi to?’’
(Yoruba, ‘‘How big is the tennis court?’’).
We note that just because no relevant passage
was found in the candidates presented to the an-
notators, it is not necessarily the case that no rel-
evant passage exists in the corpus. However,
short of exhaustively assessing the corpus (obvi-
ously impractical), we cannot conclude this with
certainty. Nevertheless, for dataset consistency
we decided to discard ‘‘invalid’’ queries from
MIRACL.
5.3 Discussion of MIRACL Languages
We provide a discussion of various design choices
made in the selection of languages included in
MIRACL, divided into considerations of linguistic
diversity and demographic diversity.
Linguistic Diversity. One important considera-
tion in the selection of languages is diversity from
the perspective of linguistic characteristics. This
was a motivating design principle in TYDI QA,
which we built on in MIRACL. Our additional
languages introduced similar pairs as well, open-
ing up new research opportunities. A summary
can be found in Table 4.
Families and sub-families provide a natural
approach to group languages based on historical
origin. Lineage in this respect explains a large
amount of resemblance between languages, for
example, house in English and haus in German
(Greenberg, 1960). The 18 languages in MIRACL
are from 10 language families and 13 sub-families,
shown in the first two columns of Table 4.
The notion of synthesis represents an impor-
tant morphological typology, which categorizes
languages based on how words are formed from
morphemes, the smallest unit of meaning. The
languages in MIRACL are balanced across the
three synthesis types (analytic, agglutinative, and
fusional), which form a spectrum with increasing
complexity in word formation when moving from
analytic to fusional languages. See more discus-
sion in Plank (1999), Dawson et al. (2016), and
N¨ubling (2020). Additional features of the lan-
guages are summarized in Table 4, including the
written scripts, word order, use of white space for
delimiting tokens, and grammatical gender.
1123
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 4: Characteristics of languages in MIRACL, including language (sub-)family, script, linguistic
typologies (synthesis, word order, and gender), number of speakers (L1 and L2), and number of
Wikipedia articles. Under White Space, (cid:2) indicates the language uses white space as the token
delimiter; under Gender, (cid:2) indicates that gender is evident in the language. Under # Speakers and
Wikipedia Size, cells are highlighted column-wise with the color gradient ranging from green (high
values) to red (low values), where the Wikipedia Size column is identical to the # Articles column in
Table 2. Underlined are the new languages not included in Mr. TYDI.
In our context, linguistic diversity is critical to
answering important questions about the trans-
fer capabilities of multilingual language models
(Pires et al., 2019), including whether certain typo-
logical characteristics are intrinsically challenging
for neural language models (Gerz et al., 2018) and
how to incorporate typologies to improve the ef-
fectiveness of NLP tasks (Ponti et al., 2019; Jones
et al., 2021). While researchers have examined
these research questions, most studies have not
been in the context of retrieval specifically.
In this respect, MIRACL can help advance the
state of the art. Consider language family, for ex-
ample: Our resource can be used to compare the
transfer capacity between languages at different
‘‘distances’’ in terms of language kinship. While
Mr. TYDI provides some opportunities to explore
this question, the additional languages in MIR-
ACL enrich the types of studies that are possible.
For example, we might consider both contrastive
pairs (i.e., those that are very typologically dif-
ferent) as well as similar pairs (i.e., those that
are closer) in the context of multilingual language
models.
Similar questions abound for synthesis char-
acteristics, written script, word order, etc. For
example, we have three exemplars in the Indo-
Iranian sub-family (Bengali, Hindi, and Persian):
Despite lineal similarities, these languages use
different scripts. How does ‘‘cross-script’’ rel-
evance transfer work in multilingual language
models? We begin to explore some of these
questions in Section 6.2.
Demographic Diversity. The other important
consideration in our choice of languages is the
demographic distribution of language speakers.
As noted in the Introduction, information access
is a fundamental human right that ought to extend
to every inhabitant of our planet, regardless of the
languages they speak.
We attempt to quantify this objective in the
final two columns of Table 4, which presents
statistics of language speakers and Wikipedia ar-
ticles. We count both L1 and L2 speakers.7 Both
columns are highlighted column-wise based on
the value, from green (high values) to red (low
values). We can use the size of Wikipedia in that
language as a proxy for the amount of language
resources that are available. We see that in many
7L1 are the native speakers; L2 includes other speakers
who learned the language later in life.
1124
ISO
ar
bn
en
fi
id
ja
ko
ru
sw
te
th
es
fa
fr
hi
zh
de
yo
K. Avg
S. Avg
Avg
BM25 mDPR Hyb. mCol. mCon.
in-L.
BM25 mDPR Hyb. mCol. mCon.
in-L.
nDCG@10
Recall@100
0.481
0.508
0.351
0.551
0.449
0.369
0.419
0.334
0.383
0.494
0.484
0.319
0.333
0.183
0.458
0.180
0.226
0.406
0.394
0.316
0.385
0.499
0.443
0.394
0.472
0.272
0.439
0.419
0.407
0.299
0.356
0.358
0.478
0.480
0.435
0.383
0.512
0.490
0.396
0.415
0.443
0.418
0.673
0.654
0.549
0.672
0.443
0.576
0.609
0.532
0.446
0.602
0.599
0.641
0.594
0.523
0.616
0.526
0.565
0.374
0.578
0.470
0.566
0.571
0.546
0.388
0.465
0.298
0.496
0.487
0.477
0.358
0.462
0.481
0.426
0.460
0.267
0.470
0.398
0.334
0.561
0.441
0.448
0.441
0.525
0.501
0.364
0.602
0.392
0.424
0.483
0.391
0.560
0.528
0.517
0.418
0.215
0.314
0.286
0.410
0.408
0.415
0.433
0.412
0.431
0.649
0.593
0.413
0.649
0.414
0.570
0.472
0.521
0.644
0.781
0.628
0.409
0.469
0.376
0.458
0.515
–
–
0.535
–
–
0.889
0.909
0.819
0.891
0.904
0.805
0.783
0.661
0.701
0.831
0.887
0.702
0.731
0.653
0.868
0.560
0.572
0.733
0.787
0.653
0.772
0.841
0.819
0.768
0.788
0.573
0.825
0.737
0.797
0.616
0.762
0.678
0.864
0.898
0.915
0.776
0.944
0.898
0.715
0.788
0.807
0.790
0.941
0.932
0.882
0.895
0.768
0.904
0.900
0.874
0.725
0.857
0.823
0.948
0.937
0.965
0.912
0.959
0.898
0.715
0.889
0.807
0.880
0.908
0.913
0.801
0.832
0.669
0.895
0.722
0.866
0.692
0.830
0.845
0.842
0.910
0.730
0.884
0.908
0.803
0.917
0.828
0.860
0.832
0.925
0.921
0.797
0.953
0.802
0.878
0.875
0.850
0.911
0.961
0.936
0.841
0.654
0.824
0.646
0.903
0.841
0.770
0.855
0.806
0.849
0.904
0.917
0.751
0.907
0.823
0.880
0.807
0.850
0.909
0.957
0.902
0.783
0.821
0.823
0.777
0.883
–
–
0.856
–
–
Table 5: Baseline results on the MIRACL dev set, where ‘‘K. Avg.’’ and ‘‘S. Avg’’ indicate the average
scores over the known (ar–zh) and surprise languages (de–yo). Hyb.: Hybrid results of BM25 and
mDPR; mCol.: mColBERT; mCon.: mContriever; in-L: in-language fine-tuned mDPR.
cases, there are languages with many speakers,
but are poor in resources. Particularly noteworthy
examples include Telugu, Indonesian, Swahili,
Yoruba, Thai, Bengali, and Hindi. We hope that
the inclusion of these languages will catalyze in-
terest in multilingual retrieval and in turn benefit
large populations for whom the languages have
been historically overlooked by the mainstream
IR research community.
6 Experiments Results
6.1 Baselines
As neural retrieval models have gained in sophis-
tication in recent years, the ‘‘software stack’’ for
end-to-end systems has grown more complex. This
has increased the barrier to entry for ‘‘newcom-
ers’’ who wish to start working on multilingual
retrieval. We believe that the growth of the diver-
sity of languages introduced in MIRACL should
be accompanied by an increase in the diversity of
participants.
To that end, we make available in the popular
Pyserini IR toolkit (Lin et al., 2021a) several base-
lines to serve as foundations that others can build
on. Baseline scores for these retrieval models are
shown in Table 5 in terms of the two official
retrieval metrics of MIRACL. The baselines in-
clude the three methods used in the ensemble
system introduced in Section 4.2 (BM25, mDPR,
mColBERT) plus the following approaches:
• Hybrid combines the scores of BM25 and
mDPR results. For each query–passage pair,
the hybrid score is computed as sHybrid =
α · sBM25 + (1 − α) · smDPR, where we set
α = 0.5 without tuning. Scores of BM25
and mDPR (sBM25 and smDPR) are first
normalized to [0, 1].
• mContriever (Izacard et al., 2022) adopts
additional pretraining with contrastive loss
based on unsupervised data prepared from
CCNet (Wenzek et al., 2020), which demon-
strates improved effectiveness in down-
stream IR tasks. We used the authors’
released multilingual checkpoint, where the
model was further fine-tuned on English MS
MARCO after additional pretraining.8
8https://huggingface.co/facebook/mcontriever
-msmarco.
1125
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Case study of how language kinship affects cross-lingual transfer. Results are grouped according to
the target language and each bar indicates a source language (used for fine-tuning mDPR). Within each panel,
the relationship between the source and target languages moves from closer to more distant from left to right
(self, same language sub-family, same language family, different language family). Striped bars denote that the
source language has a different script from the target language. The exact nDCG@10 score is shown at the top
of each bar.
• In-language fine-tuned mDPR follows the
same model configuration as the mDPR base-
line, but we fine-tuned each model on the
MIRACL training set of the target language
rather than MS MARCO. Here, the negative
examples are sampled from the labeled neg-
atives and the unlabeled top-30 candidates
from BM25.
Note that all languages in MIRACL are included
in the pretraining corpus of mBERT, which pro-
vides the backbone of the mDPR and mColBERT
models. However, three languages (fa, hi, yo)
are not included in CCNet (Wenzek et al., 2020),
the dataset used by mContriever for additional
pretraining. Code, documentation, and instruc-
tions for reproducing these baselines have been
released together with the MIRACL dataset and
can be found on the MIRACL website.
These results provide a snapshot of the current
state of research. We see that across these di-
verse languages, mDPR does not substantially
outperform decades-old BM25 technology, al-
though BM25 exhibits much lower effective-
ness in French and Chinese. Nevertheless, the
BM25–mDPR hybrid provides a strong zero-shot
baseline that outperforms all the other individ-
ual models on average nDCG@10. Interestingly,
the BM25–mDPR hybrid even outperforms in-
language fine-tuned mDPR on most of the lan-
guages, except for the ones that are comparatively
under-represented in mBERT pretraining (e.g.,
sw, th, hi). Overall, these results show that
plenty of work remains to be done to advance
multilingual retrieval.
6.2 Cross-Lingual Transfer Effects
As suggested in Section 5.3, MIRACL enables
the exploration of the linguistic factors influ-
encing multilingual transfer; in this section, we
present a preliminary study. Specifically, we eval-
uate cross-lingual transfer effectiveness on two
groups of target languages: (A) {bn, hi, fa} and
(B) {es, fr}, where the models are trained on dif-
ferent source languages drawn from (with respect
to the target language): (1) different language
families, (2) same language family but different
sub-families, and (3) same sub-family. We add the
‘‘self’’ condition as an upper-bound, where the
model is trained on data in the target language
itself.
language (Latin script
The evaluation groups are divided based on
the script of the language: in group (A), all the
source languages are in a different script from
the target languages, whereas in group (B), all
the source languages are in the same script as the
in this case). In
target
these experiments, we reused checkpoints from
the ‘‘in-language fine-tuned mDPR’’ condition in
Section 6.1. For example, when the source lan-
guage is hi and the target language is bn, we
encode the bn query and the bn corpus using the
hi checkpoint; this corresponds to the hi row,
in-L. column in Table 5.
The results are visualized in Figure 5, where
each panel corresponds to the target language in-
dicated in the header. Each bar within a panel
represents a different source language and the
y-axis shows the zero-shot nDCG@10 scores.
Within each panel, from left to right, the kinship
1126
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
relationship between the source and target lan-
guages moves from close to distant, where the
source languages that are at the same distance to
the target languages are colored the same (self,
same language sub-family, same language family,
different language family). Striped bars denote
that the source language is in a different script
from the target language.
We find that languages from the same sub-
families show better transfer capabilities in gen-
eral, regardless of the underlying script. Among
the five languages studied, those from the same
sub-families achieve the best transfer scores (i.e.,
the orange bars are the tallest). The only excep-
tion appears to be when evaluating bn using fi
as the source language (the leftmost panel). Inter-
estingly, the source language being in the same
family (but a different sub-family) does not ap-
pear to have an advantage over the languages
in different families (i.e., the green bars versus
the pink bars). This suggests that transfer effects
only manifest when the languages are closely
related within a certain degree of kinship.
We also observe that transfer effects do not
appear to be symmetrical between languages. For
example, the model trained on bn achieves 0.483
and 0.549 on hi and fa, respectively, which are
over 90% of the ‘‘self’’ score on hi and fa (the
first orange bar in the second and third panels).
However, models trained on hi and fa do not
generalize on bn to the same degree, scoring less
than 80% of the ‘‘self’’ score on bn (the two
orange bars in the first panel).
We emphasize that this is merely a preliminary
study on cross-lingual transfer effects with re-
spect to language families and scripts, but our
experiments show the potential of MIRACL for
further understanding multilingual models.
7 Conclusion
In this work, we present MIRACL, a new
high-quality multilingual retrieval dataset
that
represents approximately five person-years of an-
notation effort. We provide baselines and present
initial explorations demonstrating the potential of
MIRACL for studying interesting scientific ques-
tions. Although the WSDM Cup challenge asso-
ciated with our efforts has ended, we continue to
host a leaderboard to encourage continued parti-
cipation from the community.
While MIRACL represents a significant stride
further
towards equitable information access,
efforts are necessary to accomplish this objec-
tive. One obvious future direction is to extend
MIRACL to include more languages, espe-
cially low-resource ones. Another possibility
is to augment MIRACL with cross-lingual re-
trieval support. However, pursuing these avenues
demands additional expenses and manual labor.
Nevertheless, MIRACL already provides a
valuable resource to support numerous research
directions: It offers a solid testbed for building and
evaluating multilingual versions of dense retrieval
models (Karpukhin et al., 2020), late-interaction
models (Khattab and Zaharia, 2020), as well
as reranking models (Nogueira and Cho, 2019;
Nogueira et al., 2020), and will further acceler-
ate progress in multilingual retrieval research (Shi
et al., 2020; MacAvaney et al., 2020; Nair et al.,
2022; Zhang et al., 2022). Billions of speakers
of languages that have received relatively little
attention from researchers stand to benefit from
improved information access.
Acknowledgments
The authors wish to thank the anonymous re-
viewers for their valuable feedback. We would
also like to thank our annotators, without whom
MIRACL could not have been built. This research
was supported in part by the Natural Sciences
and Engineering Research Council (NSERC) of
Canada, a gift from Huawei, and Cloud TPU sup-
port from Google’s TPU Research Cloud (TRC).
References
Akari Asai, Jungo Kasai, Jonathan Clark, Kenton
Lee, Eunsol Choi, and Hannaneh Hajishirzi.
2021. XOR QA: Cross-lingual open-retrieval
question answering. In Proceedings of
the
2021 Conference of the North American Chap-
ter of
the Association for Computational
Linguistics: Human Language Technologies,
pages 547–564, Online. https://doi.org
/10.18653/v1/2021.naacl-main.46
Ehud Alexander Avner, Noam Ordan, and Shuly
Wintner. 2016. Identifying translationese at
the word and sub-word level. Digital Scholar-
ship in the Humanities, 31(1):30–54. https://
doi.org/10.1093/llc/fqu047
1127
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Payal Bajaj, Daniel Campos, Nick Craswell, Li
Deng, Jianfeng Gao, Xiaodong Liu, Rangan
Majumder, Andrew McNamara, Bhaskar Mitra,
Tri Nguyen, Mir Rosenberg, Xia Song, Alina
Stoica, Saurabh Tiwary, and Tong Wang.
2018. MS MARCO: A human generated MA-
chine Reading COmprehension dataset. arXiv:
1611.09268v3.
Luiz Henrique Bonifacio,
Israel Campiotti,
Vitor Jeronymo, Roberto Lotufo, and Rodrigo
Nogueira. 2021. mMARCO: A multilingual
version of the MS MARCO passage ranking
dataset. arXiv:2108.13897.
Jonathan H. Clark, Eunsol Choi, Michael Collins,
Dan Garrette, Tom Kwiatkowski, Vitaly
Nikolaev, and Jennimaria Palomaki. 2020.
TyDi QA: A benchmark for
information-
seeking question answering in typologically
diverse languages. Transactions of the Associa-
tion for Computational Linguistics, 8:454–470.
https://doi.org/10.1162/tacl a 00317
Nick Craswell, Bhaskar Mitra, Daniel Campos,
Emine Yilmaz, and Jimmy Lin. 2021. MS
MARCO: Benchmarking ranking models in the
large-data regime. In Proceedings of the 44th
Annual International ACM SIGIR Conference
on Research and Development in Information
Retrieval
(SIGIR 2021), pages 1566–1576.
https://doi.org/10.1145/3404835
.3462804
Hope Dawson, Antonio Hernandez, and Cory
Shain. 2016. Morphological types of languages.
In Language Files: Materials for an Introduc-
tion to Language and Linguistics, 12th Edition.
Department of Linguistics, The Ohio State
University. https://doi.org/10.26818
/9780814252703
Sauleh Eetemadi and Kristina Toutanova. 2014.
Asymmetric features of human generated trans-
lation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 159–164, Doha,
Qatar. https://doi.org/10.3115/v1
/D14-1018
Thibault Formal, Benjamin Piwowarski, and
St´ephane Clinchant. 2021. SPLADE: Sparse
lexical and expansion model for first stage
ranking. In Proceedings of the 44th Interna-
tional ACM SIGIR Conference on Research
and Development
pages 2288–2292, New York, NY, USA.
in Information Retrieval,
Luyu Gao, Xueguang Ma, Jimmy Lin, and
Jamie Callan. 2023. Tevatron: An efficient
and flexible toolkit for Neural Retrieval. In
Proceedings of the 46th International ACM SI-
GIR Conference on Research and Development
in Information Retrieval, pages 3120–3124.
Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti,
Roi Reichart, and Anna Korhonen. 2018. On the
relation between linguistic typology and (limi-
tations of) multilingual language modeling. In
Proceedings of the 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 316–327, Brussels, Belgium. https://
doi.org/10.18653/v1/D18-1029
Joseph Harold Greenberg. 1960. A quantita-
tive approach to the morphological typology
of language. International Journal of Ameri-
can Linguistics, 26:178–194. https://doi
.org/10.1086/464575
Junjie Hu, Sebastian Ruder, Aditya Siddhant,
Graham Neubig, Orhan Firat, and Melvin
Johnson. 2020. XTREME: A massively mul-
tilingual multi-task benchmark for evaluating
cross-lingual generalisation. In Proceedings of
the 37th International Conference on Machine
Learning, pages 4411–4421.
Gautier Izacard, Mathilde Caron, Lucas Hosseini,
Sebastian Riedel, Piotr Bojanowski, Armand
Joulin, and Edouard Grave. 2022. Unsuper-
vised dense information retrieval with con-
trastive learning. Transactions on Machine
Learning Research.
Alexander Jones, William Yang Wang, and Kyle
Mahowald. 2021. A massively multilingual
analysis of cross-linguality in shared embed-
ding space. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural
Language Processing, pages 5833–5847, On-
line and Punta Cana, Dominican Republic.
https://doi.org/10.18653/v1/2021
.emnlp-main.471
Mandar Joshi, Eunsol Choi, Daniel Weld, and
Luke Zettlemoyer. 2017. TriviaQA: A large
scale distantly supervised challenge dataset
for reading comprehension. In Proceedings
of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Long Papers), pages 1601–1611, Vancouver,
1128
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Canada. https://doi.org/10.18653/v1
/P17-1147
Pratik Joshi, Sebastin Santy, Amar Budhiraja,
Kalika Bali, and Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 6282–6293,
Online. https://doi.org/10.18653/v1
/2020.acl-main.560
Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 6769–6781.
https://doi.org/10.18653/v1/2020
.emnlp-main.550
Omar Khattab and Matei Zaharia. 2020. Col-
BERT: Efficient and effective passage search
via contextualized late interaction over BERT.
In Proceedings of the 43rd Annual Interna-
tional ACM SIGIR Conference on Research
and Development
in Information Retrieval
(SIGIR 2020), pages 39–48. https://doi
.org/10.1145/3397271.3401075
Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Llion Jones, Matthew Kelcey, Ming-Wei
Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural Questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
tional Linguistics, 7:452–466. https://doi
.org/10.1162/tacl_a_00276
Dawn Lawrie, Sean MacAvaney, James Mayfield,
Paul McNamee, Douglas W. Oard, Luca
Soldaini, and Eugene Yang. 2023. Overview of
the TREC 2022 NeuCLIR track. In Proceedings
of the 31st Text REtrieval Conference.
Gennadi Lembersky, Noam Ordan, and Shuly
Wintner. 2012. Language models for machine
translation: Original vs. translated texts. Compu-
tational Linguistics, 38(4):799–825. https://
doi.org/10.1162/COLI a 00111
Jimmy Lin, Daniel Campos, Nick Craswell,
Bhaskar Mitra, and Emine Yilmaz. 2022. Fos-
tering coopetition while plugging leaks: The
design and implementation of the MS MARCO
leaderboards. In Proceedings of the 45th An-
nual International ACM SIGIR Conference
on Research and Development
in Informa-
tion Retrieval (SIGIR 2022), pages 2939–2948,
Madrid, Spain. https://doi.org/10.1145
/3477495.3531725
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin,
Jheng-Hong Yang, Ronak Pradeep,
and
Rodrigo Nogueira. 2021a. Pyserini: A Python
toolkit for reproducible information retrieval
research with sparse and dense representations.
In Proceedings of the 44th Annual Interna-
tional ACM SIGIR Conference on Research
and Development
in Information Retrieval
(SIGIR 2021), pages 2356–2362. https://doi
.org/10.1145/3404835.3463238
Jimmy Lin, Rodrigo Nogueira, and Andrew Yates.
2021b. Pretrained Transformers for Text Rank-
ing: BERT and Beyond. Morgan & Claypool
Publishers. https://doi.org/10.1007
/978-3-031-02181-7
Shayne Longpre, Yi Lu, and Joachim Daiber.
2021. MKQA: A linguistically diverse bench-
mark for multilingual open domain question
answering. Transactions of
the Association
for Computational Linguistics, 9:1389–1406.
https://doi.org/10.1162/tacl a 00433
Sean MacAvaney, Luca Soldaini, and Nazli
Goharian. 2020. Teaching a new dog old tricks:
Resurrecting multilingual retrieval using zero-
shot learning. In Proceedings of the 42nd Eu-
ropean Conference on Information Retrieval,
Part II (ECIR 2020), pages 246–254. https://
doi.org/10.1007/978-3-030-45442-5 31
Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin
Duh, Paul McNamee, Kenton Murray, James
Mayfield, and Douglas W. Oard. 2022. Transfer
learning approaches for building cross-language
dense retrieval models. In Proceedings of the
44th European Conference on Information Re-
trieval (ECIR 2022), Part I, pages 382–396,
Stavanger, Norway. https://doi.org/10
.1007/978-3-030-99736-6_26
Rodrigo Nogueira and Kyunghyun Cho. 2019.
Passage re-ranking with BERT. arXiv:1901.
04085.
1129
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep,
and Jimmy Lin. 2020. Document ranking with
a pretrained sequence-to-sequence model. In
Findings of the Association for Computational
Linguistics: EMNLP 2020, pages 708–718,
Online. https://doi.org/10.18653/v1
/2020.findings-emnlp.63
Damaris N¨ubling. 2020. Inflectional morphol-
ogy. In The Cambridge Handbook of Ger-
manic Linguistics. Cambridge University Press.
https://doi.org/10.1017/9781108378291
.011
Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual is multilingual BERT?
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4996–5001, Florence, Italy. https://
doi.org/10.18653/v1/P19-1493
Frans Plank. 1999. Split morphology: How agglu-
tination and flexion mix. Linguistic Typology,
3:279–340. https://doi.org/10.1515
/lity.1999.3.3.279
Edoardo Maria Ponti, Helen O’Horan, Yevgeni
Ivan Vuli´c, Roi Reichart, Thierry
Berzak,
Poibeau, Ekaterina Shutova,
and Anna
Korhonen. 2019. Modeling language variation
and universals: A survey on typological linguis-
tics for natural language processing. Computa-
tional Linguistics, 45(3):559–601. https://
doi.org/10.1162/coli a 00357
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu,
Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong,
Hua Wu, and Haifeng Wang. 2021. Rocket-
QA: An optimized training approach to dense
passage retrieval for open-domain question an-
swering. In Proceedings of the 2021 Conference
of the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, pages 5835–5847.
https://aclanthology.org/2021.naacl
-main.466
Ella Rabinovich and Shuly Wintner. 2015.
Unsupervised identification of translationese.
Transactions of the Association for Computa-
tional Linguistics, 3:419–432. https://doi
.org/10.1162/tacl_a_00148
text.
In Proceedings of
the 2016
sion of
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392,
Austin, Texas. https://doi.org/10.18653
/v1/D16-1264
Stephen Robertson and Hugo Zaragoza. 2009.
The probabilistic relevance framework: BM25
and beyond. Foundations and Trends in In-
formation Retrieval, 3(4):333–389. https://
doi.org/10.1561/1500000019
Keshav Santhanam, Omar Khattab, Jon Saad-
Falcon, Christopher Potts, and Matei Zaharia.
2021. ColBERTv2: Effective and efficient re-
trieval via lightweight late interaction. In Pro-
ceedings of the 2022 Conference of the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, pages 3715–3734, Seattle, United
States. https://doi.org/10.18653/v1
/2022.naacl-main.272
Peng Shi, He Bai, and Jimmy Lin. 2020. Cross-
lingual training of neural models for document
ranking. In Findings of the Association for
Computational Linguistics: EMNLP 2020,
pages 2768–2773, Online. https://doi.org
/10.18653/v1/2020.findings-emnlp.249
Shuo Sun and Kevin Duh. 2020. CLIRMatrix: A
massively large collection of bilingual and mul-
tilingual datasets for cross-lingual information
retrieval. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4160–4170,
Online. https://doi.org/10.18653/v1
/2020.emnlp-main.340
Nandan Thakur, Nils Reimers, Andreas R¨uckl´e,
Abhishek Srivastava, and Iryna Gurevych.
2021. BEIR: A heterogeneous benchmark for
zero-shot evaluation of information retrieval
Information Processing
models.
Systems: Datasets and Benchmarks Track.
In Neural
Vered Volansky, Noam Ordan, and Shuly
Wintner. 2015. On the features of translation-
ese. Digital Scholarship in the Humanities,
30(1):98–118. https://doi.org/10.1093
/llc/fqt031
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
Ellen M. Voorhees. 1998. Variations in rele-
vance judgments and the measurement of re-
the
trieval effectiveness. In Proceedings of
1130
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
21st Annual International ACM SIGIR Con-
ference on Research and Development in Infor-
mation Retrieval (SIGIR 1998), pages 315–323,
Melbourne, Australia. https://doi.org
/10.1145/290941.291017
Guillaume Wenzek, Marie-Anne Lachaux, Alexis
Conneau, Vishrav Chaudhary,
Francisco
Guzm´an, Armand Joulin, and Edouard Grave.
2020. CCNet: Extracting high quality mono-
lingual datasets from web crawl data. In Pro-
ceedings of the Twelfth Language Resources
and Evaluation Conference, pages 4003–4012,
Marseille, France.
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung
Tang, Jialin Liu, Paul N. Bennett, Junaid
Ahmed, and Arnold Overwijk. 2021. Approx-
imate nearest neighbor negative contrastive
learning for dense text retrieval. In Proceed-
ings of the 9th International Conference on
Learning Representations (ICLR 2021).
Peilin Yang, Hui Fang, and Jimmy Lin. 2018.
Anserini: Reproducible ranking baselines us-
ing Lucene. Journal of Data and Information
Quality, 10(4):Article 16. https://doi.org
/10.1145/3239571
Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy
Lin. 2021. Mr. TyDi: A multi-lingual bench-
mark for dense retrieval. In Proceedings of
the 1st Workshop on Multilingual Represen-
tation Learning, pages 127–137, Punta Cana,
Dominican Republic. https://doi.org
/10.18653/v1/2021.mrl-1.12
Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, and
Jimmy Lin. 2022. Towards best practices for
training multilingual dense retrieval models.
arXiv:2204.02363.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0
/
/
t
l
a
c
_
a
_
0
0
5
9
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
1131