MIRACL: A Multilingual Retrieval Dataset Covering - Specialized Research AI at MIT

MIRACL: A Multilingual Retrieval Dataset Covering
18 Diverse Languages

Xinyu Zhang1∗, Nandan Thakur1∗, Odunayo Ogundepo1, Ehsan Kamalloo1†,

David Alfonso-Hermelo2, Xiaoguang Li3, Qun Liu3, Mehdi Rezagholizadeh2, Jimmy Lin1
1David R. Cheriton School of Computer Science, University of Waterloo, Canada

2Huawei Noah’s Ark Lab, Canada
3Huawei Noah’s Ark Lab, China

Abstract

MIRACL is a multilingual dataset for ad hoc
retrieval across 18 languages that collectively
encompass over three billion native speakers
around the world. This resource is designed
to support monolingual retrieval tasks, where
the queries and the corpora are in the same
language. In total, we have gathered over
726k high-quality relevance judgments for 78k
queries over Wikipedia in these languages,
where all annotations have been performed by
native speakers hired by our team. MIRACL
covers languages that are both typologically
close as well as distant from 10 language fami-
lies and 13 sub-families, associated with vary-
ing amounts of publicly available resources.
Extensive automatic heuristic verification and
manual assessments were performed during
the annotation process to control data quality.
In total, MIRACL represents an investment of
around five person-years of human annotator
effort. Our goal is to spur research on improv-
ing retrieval across a continuum of languages,
thus enhancing information access capabili-
ties for diverse populations around the world,
particularly those that have traditionally been
underserved. MIRACL is available at http://
miracl.ai/.

Introduction

Information access is a fundamental human right.
Specifically, the Universal Declaration of Human
Rights by the United Nations articulates that ‘‘ev-
eryone has the right to freedom of opinion and
expression’’, which includes the right ‘‘to seek,
receive, and impart information and ideas through
any media and regardless of frontiers’’ (Article 19).
Information access capabilities such as search,

∗ Equal contribution.
† Work done while at Huawei Noah’s Ark Lab.

question answering, summarization, and recom-
mendation are important technologies for safe-
guarding these ideals.

With the advent of deep learning in NLP, IR,
and beyond, the importance of large datasets as
drivers of progress is well understood (Lin et al.,
2021b). For retrieval models in English, the MS
MARCO datasets (Bajaj et al., 2018; Craswell
et al., 2021; Lin et al., 2022) have had a transfor-
mative impact in advancing the field. Similarly,
for question answering (QA), there exist many
resources in English, such as SQuAD (Rajpurkar
et al., 2016), TriviaQA (Joshi et al., 2017), and
Natural Questions (Kwiatkowski et al., 2019).

We have recently witnessed many efforts in
building resources for non-English languages,
for example, CLIRMatrix (Sun and Duh, 2020),
XTREME (Hu et al., 2020), MKQA (Longpre
et al., 2021), mMARCO (Bonifacio et al., 2021),
TYDI QA (Clark et al., 2020), XOR-TYDI (Asai
et al., 2021), and Mr. TYDI (Zhang et al., 2021).
These initiatives complement multilingual re-
trieval evaluations from TREC, CLEF, NTCIR,
and FIRE that focus on specific language pairs.
Nevertheless, there remains a paucity of resources
for languages beyond English. Existing datasets
are far from sufficient to fully develop informa-
tion access capabilities for the 7000+ languages
spoken on our planet (Joshi et al., 2020). Our goal
is to take a small step towards addressing these
issues.

To stimulate further advances in multilingual
retrieval, we have built the MIRACL dataset on
top of Mr. TYDI (Zhang et al., 2021), comprising
human-annotated passage-level relevance judg-
ments on Wikipedia for 18 languages, totaling
over 726k query–passage pairs for 78k queries.
These languages are written using 11 distinct
scripts, originate from 10 different language fam-
ilies, and collectively encompass more than three

1114

Transactions of the Association for Computational Linguistics, vol. 11, pp. 1114–1131, 2023. https://doi.org/10.1162/tacl a 00595
Action Editor: Xiaodong He. Submission batch: 4/2023; Revision batch: 4/2023; Published 9/2023.
c(cid:3) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

diverse languages, where the queries and the cor-
pora are in the same language (e.g., Thai queries
searching Thai passages), as opposed to cross-
lingual retrieval, where the queries and the cor-
pora are in different languages (e.g., searching a
Swahili corpus with Arabic queries).

As mentioned in the Introduction, there has
been work over the years on building resources
for retrieval in non-English languages. Below, we
provide a thorough comparison between MIRACL
and these efforts, with an overview in Table 1.

Figure 1: Examples of annotated query–passage pairs
in Thai (th) from MIRACL.

2.1 Comparison to Traditional IR

Collections

billion native speakers around the world. They
include what would typically be characterized as
high-resource languages as well as low-resource
languages. Figure 1 shows a sample Thai query
with a relevant and a non-relevant passage. In total,
the MIRACL dataset represents over 10k hours,
or about five person-years, of annotator effort.

Along with the dataset, our broader efforts in-
cluded organizing a competition at the WSDM
2023 conference that provided a common evalua-
tion methodology, a leaderboard, and a venue for
a competition-style event with prizes. To provide
starting points that the community can rapidly
build on, we also share reproducible baselines in
the Pyserini IR toolkit (Lin et al., 2021a).

Compared to existing datasets, MIRACL pro-
vides more thorough and robust annotations and
broader coverage of the languages, which include
both typologically diverse and similar language
pairs. We believe that MIRACL can serve as a
high-quality training and evaluation dataset for
the community, advance retrieval effectiveness in
diverse languages, and answer interesting scien-
tific questions about cross-lingual transfer in the
multilingual retrieval context.

2 Background and Related Work

The focus of this work is the standard ad hoc re-
trieval task in information retrieval, where given
a corpus C, the system’s task is to return for a
given query q an ordered list of top-k passages
from C that maximizes some standard quality
metric such as nDCG. A query q is a well-formed
natural language question in some language Ln
and the passages draw from the same language
Cn. Thus, our focus is monolingual retrieval across

Historically,
there have been monolingual re-
trieval evaluations of search tasks in non-English
languages, for example, at TREC, FIRE, CLEF,
and NCTIR. These community evaluations typi-
cally release test collections built from newswire
articles, which typically provide only dozens of
queries with modest amounts of relevance judg-
ments, and are insufficient for training neural
retrieval models. The above organizations also
provide evaluation resources for cross-lingual re-
trieval, but they cover relatively few language
pairs. For example, the recent TREC 2022 Neu-
CLIR Track (Lawrie et al., 2023) evaluates only
three languages (Chinese, Persian, and Russian)
in a cross-lingual setting.

2.2 Comparison to Multilingual QA Datasets

There are also existing datasets for multilingual
QA. For example, XOR-TYDI (Asai et al., 2021) is
a cross-lingual QA dataset built on TYDI by anno-
tating answers in English Wikipedia for questions
that TYDI considers unanswerable in the original
source (non-English) language. This setup, unfor-
tunately, does not allow researchers to examine
monolingual retrieval in non-English languages.

Another point of comparison is MKQA
(Longpre et al., 2021), which comprises 10k ques-
tion–answer pairs aligned across 26 typologically
diverse languages. Questions are paired with exact
answers in the different languages, and evalua-
tion is conducted in the open-retrieval setting by
matching those answers in retrieved text—thus,
MKQA is not a ‘‘true’’ retrieval dataset. Further-
more, because the authors translated questions to
achieve cross-lingual alignment, the translations
may not be ‘‘natural’’, as pointed out by Clark
et al. (2020).

1115

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Dataset Name

Natural Natural Human
Queries Passages Labels

# Lang Avg # Q

Avg

Total

# Labels/Q # Labels

Training?

NeuCLIR (Lawrie et al., 2023)
MKQA (Longpre et al., 2021)
mMARCO (Bonifacio et al., 2021)
CLIRMatrix (Sun and Duh, 2020)
Mr. TYDI (Zhang et al., 2021)

MIRACL (our work)

(cid:2)
×
×
×
(cid:2)

(cid:2)

(cid:2)
(cid:2)
×
(cid:2)
(cid:2)

(cid:2)

(cid:2)
(cid:2)
(cid:2)
×
(cid:2)

(cid:2)

3
26
13
139
11

160
10k
808k
352k
6.3k

4.3k

32.74
1.35
0.66
693
1.02

9.23

5.2k
14k
533k
34B
71k

726k

×
×
(cid:2)
(cid:2)
(cid:2)

(cid:2)

Table 1: Comparison of select multilingual retrieval datasets. Natural Queries and Natural Passages
indicate whether the queries and passages are ‘‘natural’’, i.e., generated by native speakers (vs. human-
or machine-translated), and for queries, in natural language (vs. keywords or entities); Human Labels
indicates human-generated labels (vs. synthetically generated labels); # Lang is the number of languages
supported; Avg # Q is the average number of queries for each language; Avg # Labels/Q is the average
number of labels provided per query; Total # Labels is the total number of human labels (both positive
and negative) across all languages (including synthetic labels in CLIRMatrix). Training? indicates
whether the dataset provides sufficient data for training neural models.

2.3 Comparison to Synthetic Datasets

Since collecting human relevance labels is la-
borious and costly, other studies have adopted
workarounds to build multilingual datasets. For
example, Bonifacio et al. (2021) automatically
translated the MS MARCO dataset (Bajaj et al.,
2018) from English into 13 other languages. How-
ever, translation is known to cause inadvertent
artifacts such as ‘‘translationese’’ (Clark et al.,
2020; Lembersky et al., 2012; Volansky et al.,
2015; Avner et al., 2016; Eetemadi and Toutanova,
2014; Rabinovich and Wintner, 2015) and may
lead to training data of questionable value.

Alternatively, Sun and Duh (2020) built syn-
thetic bilingual retrieval datasets in a resource
called CLIRMatrix based on the parallel struc-
ture of Wikipedia that covers 139 languages.
Constructing datasets automatically by exploiting
heuristics has the virtue of not requiring expen-
sive human annotations and can be easily scaled up
to cover many languages. However, such datasets
are inherently limited by the original resource they
are built from. For instance, in CLIRMatrix, the
queries are the titles of Wikipedia articles, which
tend to be short phrases such as named entities.
Also, multi-degree judgments in the dataset are di-
rectly converted from BM25 scores, which creates
an evaluation bias towards lexical approaches.

2.4 Comparison to Mr. TYDI

Since MIRACL inherits from Mr. TYDI, it makes
sense to discuss important differences between

Figure 2: Examples of missing relevant passages in
TYDI QA (and thus Mr. TYDI) for the query ‘‘Which
is the largest marine animal?’’ Since only relevant
passages in the top Wikipedia article are included,
other relevant passages are missed.

the two: Mr. TYDI (Zhang et al., 2021) is a human-
labeled retrieval dataset built atop TYDI QA (Clark
et al., 2020), covering 11 typologically diverse lan-
guages. While Mr. TYDI enables the training and
evaluation of monolingual retrieval, it has three
shortcomings we aimed to address in MIRACL.

Limited Positive Passages.
In TYDI QA, and thus
Mr. TYDI, candidate passages for annotation are
selected only from the top-ranked Wikipedia article
based on a Google search. Consequently, a con-
siderable number of relevant passages that exist
in other Wikipedia articles are ignored; Figure 2
illustrates an instance of this limitation. In contrast,

1116

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

MIRACL addresses this issue by sourcing can-
didate passages from all of Wikipedia, ensuring
that our relevant passages are diverse.

Moreover, in MIRACL, we went a step further
and asked annotators to assess the top-10 candi-
date passages from an ensemble model per query,
resulting in richer annotations compared to those
of Mr. TYDI, which were mechanistically gener-
ated from TYDI QA as no new annotations were
performed. Furthermore, we believe that explicitly
judged negative examples are quite valuable, com-
pared to, for example, implicit negatives in MS
MARCO sampled from BM25 results, as recent
work has demonstrated the importance of so-called
‘‘hard negative’’ mining (Karpukhin et al., 2020;
Xiong et al., 2021; Qu et al., 2021; Santhanam
et al., 2021; Formal et al., 2021). As the candidate
passages come from diverse models, they are par-
ticularly suitable for feeding various contrastive
techniques.

Inconsistent Passages. Since passage-level rel-
evance annotations in Mr. TYDI were derived from
TYDI QA, it retained exactly those same passages.
However, as TYDI QA was not originally designed
for retrieval, it did not provide consistent passage
segmentation for all Wikipedia articles. Thus, the
Mr. TYDI corpora comprised a mix of TYDI QA
passages and custom segments that were heuristi-
cally adjusted to ‘‘cover’’ the entire raw Wikipedia
dumps. This inconsistent segmentation is a weak-
ness of Mr. TYDI that we rectified in MIRACL by
re-segmenting the Wikipedia articles provided by
TYDI QA and re-building the relevance mapping
for these passages (more details in Section 4.2).

No Typologically Similar Languages. The 11
languages included in TYDI QA encompass a
broad range of linguistic typologies by design,
belonging to 11 different sub-families from 9 dif-
ferent families. However, this means that they are
all quite distant from one another. In contrast,
new languages added to MIRACL include typo-
logically similar languages. The inclusion of these
languages is crucial because their presence can
better foster research on cross-lingual compari-
sons. For example, cross-lingual transfer can be
more effectively studied when more similar lan-
guages are present. We explore this issue further
in Section 5.3.

Summary. Overall, not only is MIRACL an or-
der of magnitude larger than Mr. TYDI, as shown

in Table 1, but MIRACL goes beyond simply
scaling up the dataset to correcting many short-
comings observed in Mr. TYDI. The result is a
larger, higher-quality, and richer dataset to support
monolingual retrieval in diverse languages.

3 Dataset Overview

MIRACL is a multilingual retrieval dataset that
spans 18 different languages, focusing on the
monolingual retrieval
task. In total, we have
gathered over 726k manual relevance judgments
(i.e., query–passage pairs) for 78k queries across
Wikipedia in these languages, where all assess-
ments have been performed by native speakers.
There are sufficient examples in MIRACL to train
and evaluate neural retrieval models. Detailed
statistics for MIRACL are shown in Table 2.

Among the 18 languages in MIRACL, 11 are
existing languages that are originally covered
by Mr. TYDI, where we take advantage of the
Mr. TYDI queries as a starting point. Resources
for the other 7 new languages are created from
scratch. We generated new queries for all lan-
guages, but since we inherited Mr. TYDI queries,
fewer new queries were generated for the 11 ex-
isting languages as these languages already had
sizable training and development sets. For all lan-
guages, we provide far richer annotations with
respect to Wikipedia passages. That is, compared
to Mr. TYDI, where each query has on average
only a single positive (relevant) passage (Zhang
et al., 2021), MIRACL provides far more pos-
itive as well as negative labels for all queries.
Here, we provide an overview of the data splits;
details of fold construction will be introduced in
Section 4.2.

• Training and development sets: For the
11 existing Mr. TYDI languages, the training
(development) sets comprise subsets of the
queries from the Mr. TYDI training (devel-
opment) sets. The main difference is that
MIRACL provides richer annotations for more
passages. For the new languages, the train-
ing (development) data consist of entirely of
newly generated queries.

• Test-A sets: Similar to the training and de-
velopment sets, the test-A sets align with the
test sets in Mr. TYDI. However, the test-A
sets exist only for the existing languages.

1117

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Lang

ISO

Train

Dev

# Q

# J

# Q

# J

Test-A

# Q

# J

Test-B

# Q

# J

# Passages

# Articles

Avg. Avg.
Q Len P Len

ar
Arabic
bn
Bengali
en
English
fi
Finnish
Indonesian id
ja
Japanese
ko
Korean
ru
Russian
sw
Swahili
te
Telugu
th
Thai

es
fa
fr
hi
zh

de
yo

Spanish
Persian
French
Hindi
Chinese

German
Yoruba

Total

3,495 25,382
1,631 16,754
2,863 29,416
2,897 20,350
4,071 41,358
3,477 34,387
868 12,767
4,683 33,921
1,901
9,359
3,452 18,608
2,972 21,293

2,162 21,531
2,107 21,844
1,143 11,426
1,169 11,668
1,312 13,113

–
–

2,896
411
799
1,271
960
860
213
1,252
482
828
733

648
632
343
350
393

305
119

29,197
4,206
8,350
12,008
9,668
8,354
3,057
13,100
5,092
1,606
7,573

6,443
6,571
3,429
3,494
3,928

3,144
1,188

936
102
734

9,325
1,037
5,617
1,060 10,586
7,430
731
6,922
650
3,855
263
8,777
911
6,615
638
594
5,948
992 10,432

–
–
–
–
–

–
–

–
–
–
–
–

–
–

1,405
1,130
1,790
711
611
1,141
1,417
718
465
793
650

1,515
1,476
801
819
920

712
288

14,036
11,286
18,241
7,100
6,098
11,410
14,161
7,174
4,620
7,920
6,493

15,074
15,313
8,008
8,169
9,196

7,317
2,880

2,061,414
297,265
32,893,221
1,883,509
1,446,315
6,953,614
1,486,752
9,543,918
131,924
518,079
542,166

10,373,953
2,207,172
14,636,953
506,264
4,934,368

656,982
63,762
5,758,285
447,815
446,330
1,133,444
437,373
1,476,045
47,793
66,353
128,179

1,669,181
857,827
2,325,608
148,107
1,246,389

15,866,222
49,043

2,651,352
33,094

40,203 343,177

13,495 130,408

7,611 76,544

17,362 174,496

106,332,152 19,593,919

6
7
7
5
5
17
4
6
7
5
42

8
8
7
10
11

7
8

54
56
65
41
49
147
38
46
36
51
358

66
49
55
69
121

58
28

Table 2: Descriptive statistics for all languages in MIRACL, organized by split. # Q: number of queries;
# J: number of labels (relevant and non-relevant); # Passages: number of passages; # Articles: number
of Wikipedia articles from which the passages were drawn; # Avg. Q Len: average number of tokens
per query; # Avg. P Len: average number of tokens per passage. Tokens are based on characters for th,
ja, and zh, otherwise delimited by whitespace. Underlined are the new languages in MIRACL.

• Test-B sets: For all languages, the test-B sets
are composed entirely of new queries that
have never been released before (compared
to test-A sets, whose queries ultimately draw
from TYDI QA and thus have been publicly
available for quite some time now). These
queries can be viewed as a true held-out
test set.

Although the training, development, and test-A
sets of MIRACL languages that overlap with
Mr. TYDI align with Mr. TYDI, in some cases
there are fewer queries in MIRACL than in their
corresponding Mr. TYDI splits. This is because
our annotators were unable to find relevant pas-
sages for some queries from Mr. TYDI; we call
these ‘‘invalid’’ queries and removed them from
the corresponding MIRACL splits (details in
Section 5.2).

To evaluate the quality of system outputs,
we use standard information retrieval metrics,
nDCG@10 and Recall@100, which are defined
as follows:

DCG@k =

k(cid:2)

i=1

reli
log2(i + 1)

(1)

nDCG@k =

Recall@k =

DCG@k
iDCG@k

(cid:3)

k
i=1 reli
k

(2)

(3)

where reli = 1 if di is relevant to the query and
0 otherwise, and iDCG@k is the DCG@k of the
ideal ranked list (i.e., with binary judgments, all
relevant documents appear before non-relevant
documents). We set k to 10 and 100 for the two
metrics, respectively, to arrive at nDCG@10 and
Recall@100. These serve as the official metrics
of MIRACL.

In addition to releasing the evaluation resources,
we organized a WSDM Cup challenge at the
WSDM 2023 conference to encourage participa-
tion from the community, where the test-B sets
were used for final evaluation. Among the 18 lan-
guages, German (de) and Yoruba (yo), the bottom
block in Table 2, were selected as surprise lan-
guages, while the rest are known languages. This
distinction was created for the official WSDM
competition. Whereas data from the known lan-
guages were released in October 2022, the iden-
tity of the surprise languages was concealed until
two weeks before the competition deadline in

1118

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3: Diagram illustrating the MIRACL annotation workflow.

January 2023. For the known languages, partic-
ipants were given ample data and time to train
language-specific models. On the other hand, for
the surprise languages, no training splits were
provided to specifically evaluate retrieval under a
limited data and time condition. However, since
the WSDM Cup challenge has concluded, the dis-
tinction between surprise and known languages is
no longer relevant.

4 Dataset Construction

To build MIRACL, we hired native speakers as
annotators to provide high-quality queries and
relevance judgments. At a high level, our work-
flow comprised two phases: First, the annotators
were asked to generate well-formed queries based
on ‘‘prompts’’ (Clark et al., 2020) (details in
Section 4.2). Then, they were asked to assess the
relevance of the top-k query–passage pairs pro-
duced by an ensemble baseline retrieval system.

An important feature of MIRACL is that our
dataset was not constructed via crowd-sourced
workers, unlike other previous efforts such as
SQuAD (Rajpurkar et al., 2016). Instead, we hired
31 annotators (both part-time and full-time) across
all languages. Each annotator was interviewed
prior to being hired and was verified to be a native
speaker of the language they were working in. Our
team created a consistent onboarding process that
included carefully crafted training sessions. We
began interviewing annotators in mid-April 2022;
dataset construction began in late April 2022 and
continued until the end of September 2022.

Throughout the annotation process, we con-
stantly checked randomly sampled data to monitor
annotation quality (see Section 4.3). Whenever
issues were detected, we promptly communicated

with the annotators to clarify the problems so that
they were resolved expeditiously. Constant inter-
actions with the annotators helped minimize er-
rors and misunderstandings. We believe that this
design yielded higher quality annotations than what
could have been obtained by crowd-sourcing.

In total, MIRACL represents over 10k hours
of assessor effort, or around five person-years.
We offered annotators the hourly rate of $18.50 per hour (converted into USD). For reference, the local minimum wage is $11.50 USD/hr.

4.1 Corpora Preparation

For each MIRACL language, we prepared a pre-
segmented passage corpus from a raw Wikipedia
dump. For the existing languages in Mr. TYDI,
we used exactly the same raw Wikipedia dump
as Mr. TYDI and TYDI QA from early 2019. For
the new languages, we used releases from March
2022. We parsed the Wikipedia articles using
WikiExtractor1 and segmented them into passages
based on natural discourse units using two consec-
utive newlines in the wiki markup as the delimiter.
Query–passage pairs formed the basic annotation
unit (Step

in Figure 3).

4.2 Annotation Workflow

MIRACL was created using a two-phase annota-
tion process adapted from TYDI QA, which was
in turn built on best practices derived from Clark
et al. (2020). The two phases are query generation
and relevance assessment.

Query Generation.
In Phase I, annotators were
shown ‘‘prompts’’, comprising the first 100 words

1https://github.com/attardi/wikiextractor.

1119

of randomly selected Wikipedia articles that pro-
vide contexts to elicit queries. Following Clark
et al. (2020), the prompts are designed to help
annotators write queries for which they ‘‘seek an
answer.’’

To generate high-quality queries, we asked
annotators to avoid generating queries that are
directly answerable by the prompts themselves.
They were asked to generate well-formed natural
language queries likely (in their opinion) to have
precise, unambiguous answers. Using prompts
also alleviates the issue where the queries may
be overly biased towards their personal experi-
ences. We also gave the annotators the option of
skipping any prompt that did not ‘‘inspire’’ any
in
queries. This process corresponds to Step
Figure 3.

Note that in this phase, annotators were asked
to generate queries in batch based on the prompts,
and did not proceed to the next phase before fin-
ishing tasks in this phase. Thus, the annotators
had not yet examined any retrieval results (i.e.,
Wikipedia passages) at this point. Moreover, we
did not suggest to annotators during the entire pro-
cess that the queries should be related to Wikipedia
in order to prevent
the annotators from writ-
ing oversimplified or consciously biased queries.
Therefore, it could be the case that their queries
cannot be readily answered by the information
contained in the corpora, which we discuss in
Section 5.2.

Relevance Assessment.
In the second phase,
for each query from the previous phase, we asked
the annotators to judge the binary relevance of
the top-k candidate passages (k = 10) from an
ensemble retrieval system that combines three
in
separate models, which corresponds to Step
Figure 3:

• BM25 (Robertson and Zaragoza, 2009), a
traditional retrieval algorithm based on lex-
ical matching, which has been shown to be
a robust baseline when evaluated zero-shot
across domains and languages (Thakur et al.,
2021; Zhang et al., 2021). We used the im-
plementation in Anserini (Yang et al., 2018),
which is based on the open-source Lucene
search library, with default parameters and
the corresponding language-specific analyzer
in Lucene if it exists. If not, we simply used
the white space tokenizer.

• mDPR (Karpukhin et al., 2020; Zhang et al.,
2021), a single-vector dense retrieval method
that has proven to be effective for many
retrieval tasks. This model was trained using
the Tevatron toolkit (Gao et al., 2023) starting
from an mBERT checkpoint and then fine-
tuned using the training set of MS MARCO
Passage. Retrieval was performed in a zero-
shot manner.

• mColBERT (Khattab and Zaharia, 2020;
Bonifacio et al., 2021), a multi-vector dense
retrieval model that has been shown to be ef-
fective in both in-domain and out-of-domain
contexts (Thakur et al., 2021; Santhanam
et al., 2021). This model was trained using the
authors’ official repository.2 Same as mDPR,
the model was initialized from mBERT and
fine-tuned on MS MARCO Passage; retrieval
was also performed zero shot.

For each query, we retrieved the top 1000 pas-
sages using each model, then performed ensemble
fusion by first normalizing all retrieval scores to
the range [0, 1] and then averaging the scores. A fi-
nal ranked list was then generated from these new
scores. Based on initial experiments, we found that
annotating the top 10 passages per query yielded
a good balance in terms of obtaining diverse
passages and utilizing annotator effort.

As mentioned in Section 2.4, we rectified the
inconsistent passage segmentation in Mr. TYDI
by re-segmenting the Wikipedia articles of TYDI
QA. The downside, unfortunately, is that anno-
tated passages from Mr. TYDI may no longer exist
in the MIRACL corpora. Thus, to take advantage
of existing annotations, for queries from the 11
existing languages in Mr. TYDI, we augmented the
set of passages to be assessed with ‘‘projected’’
relevant passages. This allowed us to take ad-
vantage of existing annotations transparently in
our workflow. To accomplish this, we used rele-
vant passages from Mr. TYDI as queries to search
the corresponding MIRACL corpus using BM25.
As the passages in both corpora differ in terms
of segmentation but not content, the top retrieved
passages are likely to have substantial overlap with
the annotated passage in Mr. TYDI (and hence are
also likely to be relevant). This is shown as Step

in Figure 3.

2https://github.com/stanford-futuredata

/ColBERT#colbertv1.

1120

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

We used a simple heuristic to determine how
many of these results to re-assess. If the score of
the top retrieved passage from MIRACL is 50%
higher than the score of the passage ranked sec-
ond, we have reasonable confidence (based on
initial explorations) that the top passage is a good
match for the original relevant passage. In this
case, we only add the top passage to the set of
candidates that the assessor considers. Otherwise,
we add the top 5 passages.

Once the candidate passages are prepared, the
annotators are asked to provide a binary label
for each query–passage pair (1 = ‘‘relevant’’ and
0 = ‘‘not relevant’’). This is shown in Step
in Figure 3. Note that the ‘‘projected’’ passages
are not identified to the
prepared from Step
annotators. In all cases, they simply receive a set
of passages to label per query, without any explicit
knowledge of where the passages came from.

In our design, each query–passage pair was
only labeled by one annotator, which we believe
is a better use of limited resources, compared to
the alternative of multiple judgments over fewer
queries. The manual assessment proceeded in
batches of approximately 1000 query–passage
pairs. Our annotators differed greatly in speed,
but averaged over all languages and individuals,
each batch took roughly 4.25 hours to complete.

Fold Creation and Data Release. At a high-
level, MIRACL contains two classes of queries:
those inherited from Mr. TYDI and those that were
created from scratch, from which four splits were
prepared. During the annotation process, all que-
ries followed the same workflow. After the anno-
tation process concluded, we divided MIRACL
into training sets, development sets, test-A sets,
and test-B sets, as described in Section 3. For
existing languages, the training, development, and
test-A sets align with the training, development,
and test sets in Mr. TYDI, and the test-B sets
are formed by the newly generated queries. For
the new (known) languages, all generated queries
were split into training, development, and test-B
sets with a split ratio of 50%, 15%, and 35%. Note
that there are no test-A sets for these languages.

4.3 Quality Control

To ensure data quality, we implemented two
processes: automatic heuristic verification that in-
cludes simple daily checks and manual periodic

assessment executed by human reviewers on a
random subset of annotations.

4.3.1 Automatic Verification

Automatic heuristic verification was applied to
both phases daily, using language-specific heuris-
tics to flag ‘‘obviously bad’’ annotations. In Phase
I, we flagged the generated query if it was: (1)
left empty; (2) similar to previous queries in the
same language;3 (3) missing interrogative indica-
tors;4 or (4) overly short or long.5 In Phase II, we
flagged the label if it was left empty or an invalid
value (i.e., values that are not 0 or 1).

The most extreme cases of bad annotations
(e.g., empty or duplicate queries) were removed
from the dataset, while minor issues were flagged
but the data were retained. For both cases, when-
ever the metrics dropped below a reasonable
threshold,6 we scheduled a meeting with the an-
notator in question to discuss. This allowed us to
minimize the number of obvious errors while giv-
ing and receiving constructive feedback.

4.3.2 Manual Assessment

Manual assessment was also applied to both
phases. To accomplish this, we hired another
group of native speakers of each language as
reviewers (with minor overlap). Similarly, we
provided consistent onboarding training to each
reviewer.

Phase I.
In this phase, reviewers were given
both the prompts and the generated queries, and
asked to apply a checklist to determine whether
the queries met our requirements. Criteria include
the examination of the query itself (spelling or
syntax errors, fluency, etc.) and whether the query
could be answered directly by the prompt, which
we wished to avoid (see Section 4.2).

How fluent are the queries? Review results
showed that approximately 12% of the questions

3We measured similarity using Levenshtein distance. A
query was flagged if its similarity score to any previous query
was greater than 75%.

4For example, writing system- and language-specific in-

terrogative punctuation and particles.

5For zh, ja, and th, length is measured in characters,
where the expected range is [6, 40] for zh and ja, and [6, 120]
for th. Otherwise, length is measured in tokens delimited
by white space, with expected range [3, 15].

6In practice, we set this threshold to 0.8 in terms of the
percentage of questionable annotations in each submission.

1121

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

had spelling or syntax errors, or ‘‘sound artifi-
cial’’. However, almost all these flaws (over 99%)
did not affect the understanding of the question
itself. We thus retained these ‘‘flawed’’ queries to
reflect real-world distributions.

How related are the queries to the prompts? We
measured the lexical overlap between the queries
and their corresponding prompts to understand
their connections. Our analysis shows approxi-
mately two words of overlap on average, which
is consistent with statistics reported in Clark et al.
(2020). Overlaps primarily occur in entities or
stopwords. We thus conclude that the generated
queries are reasonably different from the given
prompts.

Phase II.
In this phase, reviewers were provided
the same guidance as annotators performing the
relevance assessment. They were asked to (inde-
pendently) label a randomly sampled subset of
the query–passage pairs. The degree of agreement
on the overlapping pairs is used to quantify the
quality of the relevance labels. As a high-level
summary, we observe average agreements of over
80% on query–passage relevance.

Why are there disagreements? As is well-
known from the IR literature dating back many
decades (Voorhees, 1998), real-world annotators
disagree due to a number of ‘‘natural’’ reasons.
To further understand why, we randomly sampled
five query–passage pairs per language for closer
examination. After some iterative sense-making,
we arrived at the conclusion that disagreements
come from both mislabeling and partially rele-
vant pairs. We show examples of the two cases
in Figure 4a and Figure 4b.

Figure 4a depicts a common mislabeling case,
where a passage starting with misleading informa-
tion is labeled as non-relevant. We observe more
mislabeling errors from the reviewers compared
to the annotators, which may be attributed to the
annotators’ greater commitment to the task and
more experience in performing it. That is, in these
cases, the reviewers are more often ‘‘wrong’’ than
the original assessors.

Partially relevant annotations can take on vari-
ous forms. The most frequent case is exemplified
by Figure 4b, where a passage is related to the
query but does not provide a precise answer. Dif-
ferences in ‘‘thresholds’’ for relevance are inevi-
table under a binary relevance system. Furthermore,
disagreements often arise when an annotator la-

Figure 4: Examples of disagreements between annota-
tors and reviewers.

bels a passage as non-relevant, but a reviewer
labels it as relevant. These cases suggest that an-
notators exercise greater caution and are more
‘‘strict’’ when they are uncertain. Overall, these
analyses convinced us that the generated queries
and annotations are of high quality.

5 MIRACL Analysis

To better characterize MIRACL, we present three
separate analyses of the dataset we have built.

5.1 Question Words

We first analyze question types, following Clark
et al. (2020). Table 3 shows the distribution
of the question word in the English dataset of
MIRACL, along with SQuAD (Rajpurkar et al.,
2016) as a point of reference. The MIRACL statis-
tics are grouped by splits in Table 3. The test-B
column shows the distribution of the new que-
ries generated with the workflow described in
Section 4.2, whereas the other columns corre-
spond to queries derived from Mr. TYDI.

1122

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
9
5
2
1
5
7
3
4
0

/
t

a
c
_
a
_
0
0
5
9
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

MIRACL

Train Dev Test-A Test-B

SQuAD

HOW

WHAT

WHEN

WHERE

WHICH

WHO

WHY

YES/NO

15% 17% 19%

26% 27% 31%

26% 25% 20%

13% 11%

34%

25%

19%

12%

51%

11%

<1% Table 3: Query distribution of each split in En- glish MIRACL, compared to SQuAD. We observe a more balanced distribution of question words in MIRACL compared to SQuAD. In addition, the question words highlight a dis- tribution shift between Mr. TYDI and the new queries. More specifically, the test-B split contains more WHICH and WHO queries, while containing fewer HOW and WHEN queries, compared to the other splits inherited from Mr. TYDI. This also shows that exactly replicating a query generation process is challenging, even when closely follow- ing steps outlined in prior work. 5.2 Queries with No Relevant Passages Recall that since the query generation and rel- evance assessment phases were decoupled, for some fraction of queries in each language, an- notators were not able to identify any relevant passage in the pairs provided to them. Because we aimed to have at least one relevant passage per query, these queries—referred to as ‘‘invalid’’ for convenience—were discarded from the final dataset. For each language, we randomly sampled five invalid queries and spot-checked them using a variety of tools, including interactive searching on the web. From this, we identified a few com- mon reasons for the lack of relevant passages in Wikipedia: (1) query not asking for factual infor- mation, for example, ‘‘Kuinka nuoret voivat osal- listua suomalaisessa yhteiskunnassa?’’ (Finnish, ‘‘How can young people participate in Finnish society?’’); (2) query being too specific, for exam- ?’’ (Chi- ple, ‘‘ nese, ‘‘What is the new population of Washington in 2021?’’); (3) inadequate Wikipedia content, 2021 where the query seems reasonable, but the rele- vant information could not be found in Wikipedia due to the low-resource nature of the language (even after interactive search with Google), for example, ‘‘Bawo ni aaye Tennis se tobi to?’’ (Yoruba, ‘‘How big is the tennis court?’’). We note that just because no relevant passage was found in the candidates presented to the an- notators, it is not necessarily the case that no rel- evant passage exists in the corpus. However, short of exhaustively assessing the corpus (obvi- ously impractical), we cannot conclude this with certainty. Nevertheless, for dataset consistency we decided to discard ‘‘invalid’’ queries from MIRACL. 5.3 Discussion of MIRACL Languages We provide a discussion of various design choices made in the selection of languages included in MIRACL, divided into considerations of linguistic diversity and demographic diversity. Linguistic Diversity. One important considera- tion in the selection of languages is diversity from the perspective of linguistic characteristics. This was a motivating design principle in TYDI QA, which we built on in MIRACL. Our additional languages introduced similar pairs as well, open- ing up new research opportunities. A summary can be found in Table 4. Families and sub-families provide a natural approach to group languages based on historical origin. Lineage in this respect explains a large amount of resemblance between languages, for example, house in English and haus in German (Greenberg, 1960). The 18 languages in MIRACL are from 10 language families and 13 sub-families, shown in the first two columns of Table 4. The notion of synthesis represents an impor- tant morphological typology, which categorizes languages based on how words are formed from morphemes, the smallest unit of meaning. The languages in MIRACL are balanced across the three synthesis types (analytic, agglutinative, and fusional), which form a spectrum with increasing complexity in word formation when moving from analytic to fusional languages. See more discus- sion in Plank (1999), Dawson et al. (2016), and N¨ubling (2020). Additional features of the lan- guages are summarized in Table 4, including the written scripts, word order, use of white space for delimiting tokens, and grammatical gender. 1123 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Table 4: Characteristics of languages in MIRACL, including language (sub-)family, script, linguistic typologies (synthesis, word order, and gender), number of speakers (L1 and L2), and number of Wikipedia articles. Under White Space, (cid:2) indicates the language uses white space as the token delimiter; under Gender, (cid:2) indicates that gender is evident in the language. Under # Speakers and Wikipedia Size, cells are highlighted column-wise with the color gradient ranging from green (high values) to red (low values), where the Wikipedia Size column is identical to the # Articles column in Table 2. Underlined are the new languages not included in Mr. TYDI. In our context, linguistic diversity is critical to answering important questions about the trans- fer capabilities of multilingual language models (Pires et al., 2019), including whether certain typo- logical characteristics are intrinsically challenging for neural language models (Gerz et al., 2018) and how to incorporate typologies to improve the ef- fectiveness of NLP tasks (Ponti et al., 2019; Jones et al., 2021). While researchers have examined these research questions, most studies have not been in the context of retrieval specifically. In this respect, MIRACL can help advance the state of the art. Consider language family, for ex- ample: Our resource can be used to compare the transfer capacity between languages at different ‘‘distances’’ in terms of language kinship. While Mr. TYDI provides some opportunities to explore this question, the additional languages in MIR- ACL enrich the types of studies that are possible. For example, we might consider both contrastive pairs (i.e., those that are very typologically dif- ferent) as well as similar pairs (i.e., those that are closer) in the context of multilingual language models. Similar questions abound for synthesis char- acteristics, written script, word order, etc. For example, we have three exemplars in the Indo- Iranian sub-family (Bengali, Hindi, and Persian): Despite lineal similarities, these languages use different scripts. How does ‘‘cross-script’’ rel- evance transfer work in multilingual language models? We begin to explore some of these questions in Section 6.2. Demographic Diversity. The other important consideration in our choice of languages is the demographic distribution of language speakers. As noted in the Introduction, information access is a fundamental human right that ought to extend to every inhabitant of our planet, regardless of the languages they speak. We attempt to quantify this objective in the final two columns of Table 4, which presents statistics of language speakers and Wikipedia ar- ticles. We count both L1 and L2 speakers.7 Both columns are highlighted column-wise based on the value, from green (high values) to red (low values). We can use the size of Wikipedia in that language as a proxy for the amount of language resources that are available. We see that in many 7L1 are the native speakers; L2 includes other speakers who learned the language later in life. 1124 ISO ar bn en fi id ja ko ru sw te th es fa fr hi zh de yo K. Avg S. Avg Avg BM25 mDPR Hyb. mCol. mCon. in-L. BM25 mDPR Hyb. mCol. mCon. in-L. nDCG@10 Recall@100 0.481 0.508 0.351 0.551 0.449 0.369 0.419 0.334 0.383 0.494 0.484 0.319 0.333 0.183 0.458 0.180 0.226 0.406 0.394 0.316 0.385 0.499 0.443 0.394 0.472 0.272 0.439 0.419 0.407 0.299 0.356 0.358 0.478 0.480 0.435 0.383 0.512 0.490 0.396 0.415 0.443 0.418 0.673 0.654 0.549 0.672 0.443 0.576 0.609 0.532 0.446 0.602 0.599 0.641 0.594 0.523 0.616 0.526 0.565 0.374 0.578 0.470 0.566 0.571 0.546 0.388 0.465 0.298 0.496 0.487 0.477 0.358 0.462 0.481 0.426 0.460 0.267 0.470 0.398 0.334 0.561 0.441 0.448 0.441 0.525 0.501 0.364 0.602 0.392 0.424 0.483 0.391 0.560 0.528 0.517 0.418 0.215 0.314 0.286 0.410 0.408 0.415 0.433 0.412 0.431 0.649 0.593 0.413 0.649 0.414 0.570 0.472 0.521 0.644 0.781 0.628 0.409 0.469 0.376 0.458 0.515 – – 0.535 – – 0.889 0.909 0.819 0.891 0.904 0.805 0.783 0.661 0.701 0.831 0.887 0.702 0.731 0.653 0.868 0.560 0.572 0.733 0.787 0.653 0.772 0.841 0.819 0.768 0.788 0.573 0.825 0.737 0.797 0.616 0.762 0.678 0.864 0.898 0.915 0.776 0.944 0.898 0.715 0.788 0.807 0.790 0.941 0.932 0.882 0.895 0.768 0.904 0.900 0.874 0.725 0.857 0.823 0.948 0.937 0.965 0.912 0.959 0.898 0.715 0.889 0.807 0.880 0.908 0.913 0.801 0.832 0.669 0.895 0.722 0.866 0.692 0.830 0.845 0.842 0.910 0.730 0.884 0.908 0.803 0.917 0.828 0.860 0.832 0.925 0.921 0.797 0.953 0.802 0.878 0.875 0.850 0.911 0.961 0.936 0.841 0.654 0.824 0.646 0.903 0.841 0.770 0.855 0.806 0.849 0.904 0.917 0.751 0.907 0.823 0.880 0.807 0.850 0.909 0.957 0.902 0.783 0.821 0.823 0.777 0.883 – – 0.856 – – Table 5: Baseline results on the MIRACL dev set, where ‘‘K. Avg.’’ and ‘‘S. Avg’’ indicate the average scores over the known (ar–zh) and surprise languages (de–yo). Hyb.: Hybrid results of BM25 and mDPR; mCol.: mColBERT; mCon.: mContriever; in-L: in-language fine-tuned mDPR. cases, there are languages with many speakers, but are poor in resources. Particularly noteworthy examples include Telugu, Indonesian, Swahili, Yoruba, Thai, Bengali, and Hindi. We hope that the inclusion of these languages will catalyze in- terest in multilingual retrieval and in turn benefit large populations for whom the languages have been historically overlooked by the mainstream IR research community. 6 Experiments Results 6.1 Baselines As neural retrieval models have gained in sophis- tication in recent years, the ‘‘software stack’’ for end-to-end systems has grown more complex. This has increased the barrier to entry for ‘‘newcom- ers’’ who wish to start working on multilingual retrieval. We believe that the growth of the diver- sity of languages introduced in MIRACL should be accompanied by an increase in the diversity of participants. To that end, we make available in the popular Pyserini IR toolkit (Lin et al., 2021a) several base- lines to serve as foundations that others can build on. Baseline scores for these retrieval models are shown in Table 5 in terms of the two official retrieval metrics of MIRACL. The baselines in- clude the three methods used in the ensemble system introduced in Section 4.2 (BM25, mDPR, mColBERT) plus the following approaches: • Hybrid combines the scores of BM25 and mDPR results. For each query–passage pair, the hybrid score is computed as sHybrid = α · sBM25 + (1 − α) · smDPR, where we set α = 0.5 without tuning. Scores of BM25 and mDPR (sBM25 and smDPR) are first normalized to [0, 1]. • mContriever (Izacard et al., 2022) adopts additional pretraining with contrastive loss based on unsupervised data prepared from CCNet (Wenzek et al., 2020), which demon- strates improved effectiveness in down- stream IR tasks. We used the authors’ released multilingual checkpoint, where the model was further fine-tuned on English MS MARCO after additional pretraining.8 8https://huggingface.co/facebook/mcontriever -msmarco. 1125 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Figure 5: Case study of how language kinship affects cross-lingual transfer. Results are grouped according to the target language and each bar indicates a source language (used for fine-tuning mDPR). Within each panel, the relationship between the source and target languages moves from closer to more distant from left to right (self, same language sub-family, same language family, different language family). Striped bars denote that the source language has a different script from the target language. The exact nDCG@10 score is shown at the top of each bar. • In-language fine-tuned mDPR follows the same model configuration as the mDPR base- line, but we fine-tuned each model on the MIRACL training set of the target language rather than MS MARCO. Here, the negative examples are sampled from the labeled neg- atives and the unlabeled top-30 candidates from BM25. Note that all languages in MIRACL are included in the pretraining corpus of mBERT, which pro- vides the backbone of the mDPR and mColBERT models. However, three languages (fa, hi, yo) are not included in CCNet (Wenzek et al., 2020), the dataset used by mContriever for additional pretraining. Code, documentation, and instruc- tions for reproducing these baselines have been released together with the MIRACL dataset and can be found on the MIRACL website. These results provide a snapshot of the current state of research. We see that across these di- verse languages, mDPR does not substantially outperform decades-old BM25 technology, al- though BM25 exhibits much lower effective- ness in French and Chinese. Nevertheless, the BM25–mDPR hybrid provides a strong zero-shot baseline that outperforms all the other individ- ual models on average nDCG@10. Interestingly, the BM25–mDPR hybrid even outperforms in- language fine-tuned mDPR on most of the lan- guages, except for the ones that are comparatively under-represented in mBERT pretraining (e.g., sw, th, hi). Overall, these results show that plenty of work remains to be done to advance multilingual retrieval. 6.2 Cross-Lingual Transfer Effects As suggested in Section 5.3, MIRACL enables the exploration of the linguistic factors influ- encing multilingual transfer; in this section, we present a preliminary study. Specifically, we eval- uate cross-lingual transfer effectiveness on two groups of target languages: (A) {bn, hi, fa} and (B) {es, fr}, where the models are trained on dif- ferent source languages drawn from (with respect to the target language): (1) different language families, (2) same language family but different sub-families, and (3) same sub-family. We add the ‘‘self’’ condition as an upper-bound, where the model is trained on data in the target language itself. language (Latin script The evaluation groups are divided based on the script of the language: in group (A), all the source languages are in a different script from the target languages, whereas in group (B), all the source languages are in the same script as the in this case). In target these experiments, we reused checkpoints from the ‘‘in-language fine-tuned mDPR’’ condition in Section 6.1. For example, when the source lan- guage is hi and the target language is bn, we encode the bn query and the bn corpus using the hi checkpoint; this corresponds to the hi row, in-L. column in Table 5. The results are visualized in Figure 5, where each panel corresponds to the target language in- dicated in the header. Each bar within a panel represents a different source language and the y-axis shows the zero-shot nDCG@10 scores. Within each panel, from left to right, the kinship 1126 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 relationship between the source and target lan- guages moves from close to distant, where the source languages that are at the same distance to the target languages are colored the same (self, same language sub-family, same language family, different language family). Striped bars denote that the source language is in a different script from the target language. We find that languages from the same sub- families show better transfer capabilities in gen- eral, regardless of the underlying script. Among the five languages studied, those from the same sub-families achieve the best transfer scores (i.e., the orange bars are the tallest). The only excep- tion appears to be when evaluating bn using fi as the source language (the leftmost panel). Inter- estingly, the source language being in the same family (but a different sub-family) does not ap- pear to have an advantage over the languages in different families (i.e., the green bars versus the pink bars). This suggests that transfer effects only manifest when the languages are closely related within a certain degree of kinship. We also observe that transfer effects do not appear to be symmetrical between languages. For example, the model trained on bn achieves 0.483 and 0.549 on hi and fa, respectively, which are over 90% of the ‘‘self’’ score on hi and fa (the first orange bar in the second and third panels). However, models trained on hi and fa do not generalize on bn to the same degree, scoring less than 80% of the ‘‘self’’ score on bn (the two orange bars in the first panel). We emphasize that this is merely a preliminary study on cross-lingual transfer effects with re- spect to language families and scripts, but our experiments show the potential of MIRACL for further understanding multilingual models. 7 Conclusion In this work, we present MIRACL, a new high-quality multilingual retrieval dataset that represents approximately five person-years of an- notation effort. We provide baselines and present initial explorations demonstrating the potential of MIRACL for studying interesting scientific ques- tions. Although the WSDM Cup challenge asso- ciated with our efforts has ended, we continue to host a leaderboard to encourage continued parti- cipation from the community. While MIRACL represents a significant stride further towards equitable information access, efforts are necessary to accomplish this objec- tive. One obvious future direction is to extend MIRACL to include more languages, espe- cially low-resource ones. Another possibility is to augment MIRACL with cross-lingual re- trieval support. However, pursuing these avenues demands additional expenses and manual labor. Nevertheless, MIRACL already provides a valuable resource to support numerous research directions: It offers a solid testbed for building and evaluating multilingual versions of dense retrieval models (Karpukhin et al., 2020), late-interaction models (Khattab and Zaharia, 2020), as well as reranking models (Nogueira and Cho, 2019; Nogueira et al., 2020), and will further acceler- ate progress in multilingual retrieval research (Shi et al., 2020; MacAvaney et al., 2020; Nair et al., 2022; Zhang et al., 2022). Billions of speakers of languages that have received relatively little attention from researchers stand to benefit from improved information access. Acknowledgments The authors wish to thank the anonymous re- viewers for their valuable feedback. We would also like to thank our annotators, without whom MIRACL could not have been built. This research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada, a gift from Huawei, and Cloud TPU sup- port from Google’s TPU Research Cloud (TRC). References Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. XOR QA: Cross-lingual open-retrieval question answering. In Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 547–564, Online. https://doi.org /10.18653/v1/2021.naacl-main.46 Ehud Alexander Avner, Noam Ordan, and Shuly Wintner. 2016. Identifying translationese at the word and sub-word level. Digital Scholar- ship in the Humanities, 31(1):30–54. https:// doi.org/10.1093/llc/fqu047 1127 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A human generated MA- chine Reading COmprehension dataset. arXiv: 1611.09268v3. Luiz Henrique Bonifacio, Israel Campiotti, Vitor Jeronymo, Roberto Lotufo, and Rodrigo Nogueira. 2021. mMARCO: A multilingual version of the MS MARCO passage ranking dataset. arXiv:2108.13897. Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information- seeking question answering in typologically diverse languages. Transactions of the Associa- tion for Computational Linguistics, 8:454–470. https://doi.org/10.1162/tacl a 00317 Nick Craswell, Bhaskar Mitra, Daniel Campos, Emine Yilmaz, and Jimmy Lin. 2021. MS MARCO: Benchmarking ranking models in the large-data regime. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 1566–1576. https://doi.org/10.1145/3404835 .3462804 Hope Dawson, Antonio Hernandez, and Cory Shain. 2016. Morphological types of languages. In Language Files: Materials for an Introduc- tion to Language and Linguistics, 12th Edition. Department of Linguistics, The Ohio State University. https://doi.org/10.26818 /9780814252703 Sauleh Eetemadi and Kristina Toutanova. 2014. Asymmetric features of human generated trans- lation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 159–164, Doha, Qatar. https://doi.org/10.3115/v1 /D14-1018 Thibault Formal, Benjamin Piwowarski, and St´ephane Clinchant. 2021. SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th Interna- tional ACM SIGIR Conference on Research and Development pages 2288–2292, New York, NY, USA. in Information Retrieval, Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Tevatron: An efficient and flexible toolkit for Neural Retrieval. In Proceedings of the 46th International ACM SI- GIR Conference on Research and Development in Information Retrieval, pages 3120–3124. Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018. On the relation between linguistic typology and (limi- tations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, pages 316–327, Brussels, Belgium. https:// doi.org/10.18653/v1/D18-1029 Joseph Harold Greenberg. 1960. A quantita- tive approach to the morphological typology of language. International Journal of Ameri- can Linguistics, 26:178–194. https://doi .org/10.1086/464575 Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively mul- tilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, pages 4411–4421. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. Unsuper- vised dense information retrieval with con- trastive learning. Transactions on Machine Learning Research. Alexander Jones, William Yang Wang, and Kyle Mahowald. 2021. A massively multilingual analysis of cross-linguality in shared embed- ding space. In Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 5833–5847, On- line and Punta Cana, Dominican Republic. https://doi.org/10.18653/v1/2021 .emnlp-main.471 Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, 1128 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Canada. https://doi.org/10.18653/v1 /P17-1147 Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. https://doi.org/10.18653/v1 /2020.acl-main.560 Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 6769–6781. https://doi.org/10.18653/v1/2020 .emnlp-main.550 Omar Khattab and Matei Zaharia. 2020. Col- BERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), pages 39–48. https://doi .org/10.1145/3397271.3401075 Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A benchmark for question answering research. Transactions of the Association for Computa- tional Linguistics, 7:452–466. https://doi .org/10.1162/tacl_a_00276 Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. 2023. Overview of the TREC 2022 NeuCLIR track. In Proceedings of the 31st Text REtrieval Conference. Gennadi Lembersky, Noam Ordan, and Shuly Wintner. 2012. Language models for machine translation: Original vs. translated texts. Compu- tational Linguistics, 38(4):799–825. https:// doi.org/10.1162/COLI a 00111 Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz. 2022. Fos- tering coopetition while plugging leaks: The design and implementation of the MS MARCO leaderboards. In Proceedings of the 45th An- nual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval (SIGIR 2022), pages 2939–2948, Madrid, Spain. https://doi.org/10.1145 /3477495.3531725 Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021a. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362. https://doi .org/10.1145/3404835.3463238 Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021b. Pretrained Transformers for Text Rank- ing: BERT and Beyond. Morgan & Claypool Publishers. https://doi.org/10.1007 /978-3-031-02181-7 Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A linguistically diverse bench- mark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406. https://doi.org/10.1162/tacl a 00433 Sean MacAvaney, Luca Soldaini, and Nazli Goharian. 2020. Teaching a new dog old tricks: Resurrecting multilingual retrieval using zero- shot learning. In Proceedings of the 42nd Eu- ropean Conference on Information Retrieval, Part II (ECIR 2020), pages 246–254. https:// doi.org/10.1007/978-3-030-45442-5 31 Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, and Douglas W. Oard. 2022. Transfer learning approaches for building cross-language dense retrieval models. In Proceedings of the 44th European Conference on Information Re- trieval (ECIR 2022), Part I, pages 382–396, Stavanger, Norway. https://doi.org/10 .1007/978-3-030-99736-6_26 Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv:1901. 04085. 1129 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718, Online. https://doi.org/10.18653/v1 /2020.findings-emnlp.63 Damaris N¨ubling. 2020. Inflectional morphol- ogy. In The Cambridge Handbook of Ger- manic Linguistics. Cambridge University Press. https://doi.org/10.1017/9781108378291 .011 Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. https:// doi.org/10.18653/v1/P19-1493 Frans Plank. 1999. Split morphology: How agglu- tination and flexion mix. Linguistic Typology, 3:279–340. https://doi.org/10.1515 /lity.1999.3.3.279 Edoardo Maria Ponti, Helen O’Horan, Yevgeni Ivan Vuli´c, Roi Reichart, Thierry Berzak, Poibeau, Ekaterina Shutova, and Anna Korhonen. 2019. Modeling language variation and universals: A survey on typological linguis- tics for natural language processing. Computa- tional Linguistics, 45(3):559–601. https:// doi.org/10.1162/coli a 00357 Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. Rocket- QA: An optimized training approach to dense passage retrieval for open-domain question an- swering. In Proceedings of the 2021 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 5835–5847. https://aclanthology.org/2021.naacl -main.466 Ella Rabinovich and Shuly Wintner. 2015. Unsupervised identification of translationese. Transactions of the Association for Computa- tional Linguistics, 3:419–432. https://doi .org/10.1162/tacl_a_00148 text. In Proceedings of the 2016 sion of Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392, Austin, Texas. https://doi.org/10.18653 /v1/D16-1264 Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in In- formation Retrieval, 3(4):333–389. https:// doi.org/10.1561/1500000019 Keshav Santhanam, Omar Khattab, Jon Saad- Falcon, Christopher Potts, and Matei Zaharia. 2021. ColBERTv2: Effective and efficient re- trieval via lightweight late interaction. In Pro- ceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, pages 3715–3734, Seattle, United States. https://doi.org/10.18653/v1 /2022.naacl-main.272 Peng Shi, He Bai, and Jimmy Lin. 2020. Cross- lingual training of neural models for document ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2768–2773, Online. https://doi.org /10.18653/v1/2020.findings-emnlp.249 Shuo Sun and Kevin Duh. 2020. CLIRMatrix: A massively large collection of bilingual and mul- tilingual datasets for cross-lingual information retrieval. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 4160–4170, Online. https://doi.org/10.18653/v1 /2020.emnlp-main.340 Nandan Thakur, Nils Reimers, Andreas R¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval Information Processing models. Systems: Datasets and Benchmarks Track. In Neural Vered Volansky, Noam Ordan, and Shuly Wintner. 2015. On the features of translation- ese. Digital Scholarship in the Humanities, 30(1):98–118. https://doi.org/10.1093 /llc/fqt031 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehen- Ellen M. Voorhees. 1998. Variations in rele- vance judgments and the measurement of re- the trieval effectiveness. In Proceedings of 1130 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 21st Annual International ACM SIGIR Con- ference on Research and Development in Infor- mation Retrieval (SIGIR 1998), pages 315–323, Melbourne, Australia. https://doi.org /10.1145/290941.291017 Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm´an, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting high quality mono- lingual datasets from web crawl data. In Pro- ceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approx- imate nearest neighbor negative contrastive learning for dense text retrieval. In Proceed- ings of the 9th International Conference on Learning Representations (ICLR 2021). Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines us- ing Lucene. Journal of Data and Information Quality, 10(4):Article 16. https://doi.org /10.1145/3239571 Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A multi-lingual bench- mark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Represen- tation Learning, pages 127–137, Punta Cana, Dominican Republic. https://doi.org /10.18653/v1/2021.mrl-1.12 Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, and Jimmy Lin. 2022. Towards best practices for training multilingual dense retrieval models. arXiv:2204.02363. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 5 2 1 5 7 3 4 0 / / t l a c _ a _ 0 0 5 9 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 1131 MIRACL: A Multilingual Retrieval Dataset Covering image

Download pdf