Quality at a Glance:
An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer1,2, Isaac Caswell3, Lisa Wang3,4, Ahsan Wahab5,47, Daan van Esch6,
Nasanbayar Ulzii-Orshikh7, Allahsera Tapo8,9, Nishant Subramani10,11, Artem Sokolov4,
Claytone Sikasote12,13, Monang Setyawan14, Supheakmungkol Sarin14,
Sokhar Samb15,16, Benoˆıt Sagot17, Clara Rivera18, Annette Rios19, Isabel Papadimitriou20,
Salomey Osei21,22, Pedro Ortiz Suarez17,23, Iroro Orife10,24, Kelechi Ogueji2,25,
Andre Niyongabo Rubungo26,27, Toan Q. Nguyen28, Mathias M ¨uller19, Andr´e M ¨uller19,
Shamsuddeen Hassan Muhammad29,30, Nanda Muhammad30, Ayanda Mnyakeni31,
Jamshidbek Mirzakhalov5,32, Tapiwanashe Matangira33, Colin Leong10, Nze Lawson14,
Sneha Kudugunta3, Yacine Jernite10,34, Mathias Jenny19, Orhan Firat3,5,
Bonaventure F. P. Dossou35,36, Sakhile Dlamini14, Nisansa de Silva37,
Sakine C¸ abuk Ballı19, Stella Biderman38, Alessia Battisti19, Ahmed Baruwa10,39,
Ankur Bapna3, Pallavi Baljekar1, Israel Abebe Azime40,41, Ayodele Awokoya29,42,
Duygu Ataman19,43, Orevaoghene Ahia10,44, Oghenefego Ahia14,
Sweta Agrawal45, Mofetoluwa Adeyemi29,46
1Google Research, Canada, 2Masakhane NLP, USA, 3Google Research, USA, 4Google Research,
Germany, 5Turkic Interlingua, 6Google Research, The Netherlands, 7Haverford College, USA,
8Masakhane NLP, Mali, 9RobotsMali, Mali, 10Masakhane NLP, USA, 11Allen Institute for Artificial
Intelligence, USA, 12Masakhane NLP, Zambia, 13University of Zambia, Zambia, 14Google, USA,
15Masakhane NLP, Senegal, 16AIMS-AMMI, Senegal, 17Inria, France, 18Google Research, UK,
19University of Zurich, Switzerland, 20Stanford University, USA, 21Masakhane NLP, Ghana, 22Kwame
Nkrumah University of Science and Technology, Ghana, 23Sorbonne Universit´e, France, 24Niger-Volta
LTI, USA, 25University of Waterloo, Canada, 26Masakhane NLP, Spain, 27Universitat Polit`ecnica de
Catalunya, Spain, 28University of Notre Dame, USA, 29Masakhane NLP, Nigeria, 30Bayero University
Kano, Nigeria, 31Google, South Africa, 32University of South Florida, USA, 33Google, Canada,
34Hugging Face, USA, 35Masakhane NLP, Germany, 36Jacobs University Bremen, Germany,
37University of Moratuwa, Sri Lanka, 38EleutherAI, USA, 39Obafemi Awolowo University, Nigeria,
40Masakhane NLP, Ethiopia, 41AIMS-AMMI, Ethiopia, 42University of Ibadan, Nigeria, 43Turkic
Interlingua, Switzerland, 44Instadeep, Nigeria, 45University of Maryland, USA, 46Defence Space
Administration Abuja, Nigeria, 47University of South Florida, USA
Abstract
With the success of large-scale pre-training
and multilingual modeling in Natural Lan-
guage Processing (NLP), recent years have
seen a proliferation of
large, Web-mined
text datasets covering hundreds of languages.
We manually audit
the quality of 205
language-specific corpora released with five
major public datasets (CCAligned, ParaCrawl,
WikiMatrix, OSCAR, mC4). Lower-resource
corpora have systematic issues: At least 15
corpora have no usable text, and a signifi-
cant fraction contains less than 50% sentences
of acceptable quality. In addition, many are
mislabeled or use nonstandard/ambiguous lan-
guage codes. We demonstrate that these issues
are easy to detect even for non-proficient
speakers, and supplement the human audit with
automatic analyses. Finally, we recommend
techniques to evaluate and improve multilin-
gual corpora and discuss potential risks that
come with low-quality data releases.
1 Introduction
Access to multilingual datasets for NLP research
has vastly improved over the past years. A variety
of Web-derived collections for hundreds of lan-
guages is available for anyone to download,
such as ParaCrawl (Espl`a et al., 2019; Ba˜n´on
et al., 2020), WikiMatrix (Schwenk et al., 2021),
50
Transactions of the Association for Computational Linguistics, vol. 10, pp. 50–72, 2022. https://doi.org/10.1162/tacl a 00447
Action Editor: Sebastian Pad´o. Submission batch: 6/2021; Revision batch: 9/2021; Published 1/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
CCAligned (El-Kishky et al., 2020), OSCAR
(Ortiz Su´arez et al., 2019; Ortiz Su´arez et al.,
2020), and several others. These have in turn en-
abled a variety of highly multilingual models, like
mT5 (Xue et al., 2021), M2M-100 (Fan et al.,
2020), and M4 (Arivazhagan et al., 2019).
Curating such datasets relies on the Web sites
giving clues about the language of their con-
tents (e.g., a language identifier in the URL) and
on automatic language classification (LangID).
It is commonly known that these automatically
crawled and filtered datasets tend to have over-
all lower quality than hand-curated collections
(Koehn et al., 2020), but their quality is rarely
measured directly, and is rather judged through
the improvements they bring to downstream
applications (Schwenk et al., 2021).
Building NLP technologies with automatically
crawled datasets is promising. This is especially
true for low-resource languages, because data
scarcity is one of the major bottlenecks for deep
learning approaches. However, there is a problem:
There exists very little research on evaluating both
data collections and automatic crawling and filter-
ing tools for low-resource languages. As a result,
although many low-resource languages are cov-
ered by the latest multilingual crawl data releases,
their quality and thus usability is unknown.
To shed light on the quality of data crawls
for the lowest resource languages, we perform
a manual data audit for 230 per-language sub-
sets of five major crawled multilingual datasets:1
CCAligned (El-Kishky et al., 2020), ParaCrawl
(Espl`a et al., 2019; Ba˜n´on et al., 2020), Wiki-
Matrix (Schwenk et al., 2021), OSCAR (Ortiz
Su´arez et al., 2019; Ortiz Su´arez et al., 2020),
and mC4 (Xue et al., 2021). We propose solutions
for effective, low-effort data auditing (Section 4),
including an error taxonomy. Our quantitative
analysis reveals surprisingly low amounts of valid
in-language data, and identifies systematic issues
across datasets and languages. In addition, we
find that a large number of datasets is labeled with
nontransparent or incorrect language codes (Sec-
tion 5). This leads us to reflect on the potential
harm of low-quality data releases for low-resource
1Annotations are available for download (last accessed:
12 Oct 2021).
51
languages (Section 6), and provide a set of recom-
mendations for future multilingual data releases
(Section 7).
2 Related Work
Corpora collected by web crawlers are known to
be noisy (Junczys-Dowmunt, 2019; Luccioni and
Viviano, 2021). In highly multilingual settings,
past work found that web-crawls of lower-resource
languages have serious issues, especially with
segment-level LangID (Caswell et al., 2020).
Cleaning and filtering web-crawls can boost gen-
eral language modeling (Gao et al., 2020; Brown
et al., 2020; Raffel et al., 2020) and downstream
task performance (Moore and Lewis, 2010;
Rarrick et al., 2011; Xu and Koehn, 2017;
Khayrallah and Koehn 2018; Brown et al., 2020).
As the scale of ML research grows, it becomes
increasingly difficult to validate automatically col-
lected and curated datasets (Biderman and Scheirer,
2020; Birhane and Prabhu, 2021; Bender et al.,
2021). Several works have focused on advanc-
ing methodologies and best practices to address
these challenges. Bender and Friedman (2018)
introduced data statements, a documentary frame-
work for NLP datasets that seeks to provide a
universal minimum bar for dataset description.
Similar work has been done on systematizing
documentation in other areas in data science
and machine learning, including work focusing
on online news (Kevin et al., 2018), data ethics
(Sun et al., 2019), and data exploration (Holland
et al., 2018), as well as generalist work such as
Gebru et al. (2018). Data quality is also im-
plicitly documented by successes of filtering
methods. There is a large literature on filtering
data for various NLP tasks, for example, Axelrod
et al., 2011, Moore and Lewis (2010), Rarrick
et al., 2011, Wang et al. (2018), Kamholz et al.
(2014), Junczys-Dowmunt (2018), and Caswell
et al., 2020.
Closest to our work is the analysis of a highly
multilingual (non-publicly available) web-crawl
and LangID-related quality issues by (Caswell
et al., 2020). They perform a brief analysis of
the quality of OSCAR with the focus only on
the presence of in-language content. Dodge et al.
(2021) automatically documented and analyzed the
contents and sources of C4 (Raffel et al., 2020),
the English counterpart of mC4, which surfaced
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Parallel
Monolingual
CCAligned
ParaCrawl v7.1
WikiMatrix OSCAR
mC4
#languages
Source
Filtering level
Langid
Alignment
Evaluation
137
CC 2013–2020
document
FastText
LASER
TED-6
85
41
Wikipedia
selected Web sites
sentence
sentence
CLD2
FastText
Vec/Hun/BLEU-Align LASER
TED-45
WMT-5
101
166
CC 11/2018 CC all
document
FastText
–
document
CLD3
–
POS/DEP-5 XTREME
Table 1: Comparison of parallel and monolingual corpora extracted from web documents, including
their downstream evaluation tasks. All parallel corpora are evaluated for machine translation (BLEU).
TED-6: da, cr, sl, sk, lt, et; TED-45: 45-language subset of (Qi et al., 2018); WMT-5: cs, de,
fi, lv, ro. POS/DEP-5: part-of-speech labeling and dependency parsing for bg, ca, da, fi, id.
the presence of machine-translated contents and
NLP benchmark data.
3 Multilingual Corpora
Table 1 provides an overview of the corpora of
interest in this work. We selected the corpora for
their multilinguality and the inclusion of under-
studied languages in NLP. With the exception of
WikiMatrix and ParaCrawl, all corpora are derived
from CommonCrawl (CC).2
CCAligned (El-Kishky et al., 2020)
is a paral-
lel dataset built off 68 CC snapshots. Documents
are aligned if they are in the same language ac-
cording to FastText LangID (Joulin et al., 2016,
2017), and have the same URL but for a differing
language code. These alignments are refined with
cross-lingual LASER embeddings (Artetxe and
Schwenk, 2019). For sentence-level data, they
split on newlines and align with LASER, but
perform no further filtering. Human annotators
evaluated the quality of document alignments for
six languages (de, zh, ar, ro, et, my) selected
for their different scripts and amount of retrieved
documents, reporting precision of over 90%. The
quality of the extracted parallel sentences was
evaluated in a machine translation (MT) task on
six European (da, cr, sl, sk, lt, et) languages
of the TED corpus (Qi et al., 2018), where it
compared favorably to systems built on crawled
sentences from WikiMatrix and ParaCrawl v6.
Multilingual C4 (mC4) (Xue et al., 2021)
is
a document-level dataset used for training the
mT5 language model. It consists of monolingual
text in 101 languages and is generated from 71 CC
snapshots. It filters out pages that contain less than
three lines of at least 200 characters and pages that
contain bad words.3 Since this is a document-level
dataset, we split it by sentence and deduplicate it
before rating. For language identification, it uses
CLD3 (Botha et al., 2017),4 a small feed-forward
neural network that was trained to detect 107
languages. The mT5 model pre-trained on mC4 is
evaluated on 6 tasks of the XTREME benchmark
(Hu et al., 2020) covering a variety of languages
and outperforms other multilingual pre-trained
language models such as mBERT (Devlin et al.,
2019) and XLM-R (Conneau et al., 2020).
OSCAR (Ortiz Su´arez et al., 2019; Ortiz Su´arez
et al., 2020)
is a set of monolingual corpora ex-
tracted from CC snapshots, specifically from the
plain text WET format distributed by CC which
removes all the HTML tags and converts the text
to UTF-8. It is deduplicated and follows the ap-
proach by Grave et al. (2018) of using FastText
LangID (Joulin et al., 2016, 2017) on a line-level.5
No other filtering was applied. For five languages
(bg, ca, da, fi, id), OSCAR was used by its
original authors to train language models which
were then evaluated on parsing and POS tagging
(Ortiz Su´arez et al., 2020). OSCAR has also been
used in independent studies to train monolingual
or multilingual language models (ar, as, bn, de,
el, fr, gu, he, hi, kn, ml, mr, nl, or, pa,
ro, ta, te) and subsequently evaluate them on
various downstream tasks (Antoun et al., 2021;
3https://github.com/LDNOOBW/.
4https://github.com/google/cld3/.
5https://fasttext.cc/docs/en/language
2http://commoncrawl.org/.
-identification.html.
52
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Kakwani et al., 2020; Wilie et al., 2020; Chan
et al., 2020; Koutsikakis et al., 2020; Martin
et al., 2020; Chriqui and Yahav, 2021; Seker et al.,
2021; Delobelle et al., 2020; Dumitrescu et al.,
2020; Masala et al., 2020).
ParaCrawl v7.1 is a parallel dataset with 41 lan-
guage pairs primarily aligned with English (39 out
of 41) and mined using the parallel-data-crawling
tool Bitextor (Espl`a et al., 2019; Ba˜n´on et al.,
2020) which includes downloading documents,
preprocessing and normalization, aligning doc-
uments and segments, and filtering noisy data
via Bicleaner.6 ParaCrawl focuses on European
languages, but also includes 9 lower-resource,
non-European language pairs in v7.1. Sentence
alignment and sentence pair filtering choices were
optimized for five languages (mt, et, hu, cs,
de) by training and evaluating MT models on
the resulting parallel sentences. An earlier version
(v5) was shown to improve translation quality on
WMT benchmarks for cs, de, fi, lv, ro.
WikiMatrix (Schwenk et al., 2021)
is a pub-
lic dataset containing 135M parallel sentences in
1620 language pairs (85 languages) mined from
Wikipedia. Out of the 135M parallel sentences,
34M are aligned with English. The text is ex-
tracted from Wikipedia pages, split into sentences,
and duplicate sentences are removed. FastText
LangID is used before identifying bitext with
LASER’s distance-based mining approach. The
margin threshold is optimized by training and
evaluating downstream MT models on four WMT
benchmarks (de-en, de-fr, cs-de, cs-fr).
The final dataset is used to train translation models
that are then evaluated by automatically measur-
ing the quality of their translations against human
translations of TED talks in 45 languages, with
highest quality for translations between English
and, for example, pt, es, da, and lowest for sr,
ja, mr, zh TW. In the audit we focus on language
pairs with English on one side.
4 Auditing Data Quality
None of the above datasets has been evaluated
for quality on the sentence level (exception: sev-
eral languages in ParaCrawl v3), and downstream
evaluations are centered around a small fraction
of higher-resource languages. This is insufficient
for drawing conclusions about
the quality of
individual or aligned sentences, and about the
entirety of languages. In addition, there might
be a publication bias preventing negative results
with any of the above corpora with lower quality
being published.
To close this gap, we conduct a human data
quality audit focused on the lowest-resource and
most under-evaluated languages, but also covering
mid- and high-resource languages for comparison.
4.1 Auditing Process
Participants We recruited 51 volunteers from
the NLP community, covering about 70 languages
with proficient language skills.7 Each sentence
is annotated by one rater. To verify our hypoth-
esis that those annotations can largely done by
non-native speakers, we repeat a set of language
expert annotations by a non-expert, and measure
the accuracy of the non-expert.
Sample Selection For each language in each
dataset, we took a random sample of 100 lines,
which may be anywhere from single words to
short paragraphs depending on segmentation. We
manually annotated them according to the error
taxonomy described below. For WikiMatrix and
CCAligned, we selected those languages that are
paired with English, and for ParaCrawl, we also in-
cluded those paired with Spanish (‘‘total’’ counts
in Table 3). We did not annotate all languages,
but focused on the ones with the least number
of sentences in each dataset (at least the smallest
10) and languages for which we found proficient
speakers. Since we annotate the same maximum
number of sentences8 across all chosen languages
regardless of their total number of sentences, the
annotated samples are not an unbiased sample
from the whole dataset.
Non-expert Labeling
Strategies Although
many of the volunteers were familiar with the
languages in question or spoke related languages,
in cases where no speaker of a relevant language
could be found, volunteers used dictionaries and
Internet search to form educated guesses. We
discuss this deeper in Appendix C to highlight
how much of this low-resource focused evaluation
7This surprisingly high number comes in part because
there are many closely related languages, e.g., one person
may be proficient enough to rate many different Slavic or
Turkic languages even if only one is their native language.
6https://github.com/bitextor/bicleaner.
8Some languages had fewer than 100 sentences.
53
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
C: Correct translation, any
Combined label for CC, CB, CS
Correct Codes
CC: Correct translation, natural sentence
en The Constitution of South Africa
en Transforming your swimming pool
into a pond
nso Molaotheo wa Rephabliki ya Afrika Borwa
de Umbau Ihres Swimmingpools zum Teich
CB: Correct translation, Boilerplate or low quality
en Reference number: 13634
en Latest Smell Stop Articles
ln Motango ya r´ef´erence: 13634
fil Pinakabagong mga Artikulo Smell Stop
CS: Correct translation, Short
en movies, dad
en Halloween – without me
it cinema, pap`a
ay Hallowen –janiw nayampejj
Error Codes
X: Incorrect translation, but both correct languages
en A map of the arrondissements of Paris
en Ask a question
kg Paris kele mbanza ya kimfumu ya Fwalansa.
tr Soru sor Kullanıma g¨ore sec¸im
WL: Source OR target wrong language, but both still linguistic content
en The ISO3 language code is zho
en Der Werwolf—sprach der gute Mann,
zza T´aim eadra bracach mar bhionns na frogannaidhe.
de des Weswolfs, Genitiv sodann,
NL: Not a language: at least one of source and target are not linguistic content
en EntryScan 4
en organic peanut butter
tn TSA PM704
ckb (cid:2)? (cid:2)? (cid:2)? (cid:2)? (cid:2)? (cid:2)? (cid:2)?
Table 2: Annotation codes for parallel data with sentence pair examples. The language code before each
sentence indicates the language it is supposed to be in.
can actually be done by non-proficient speakers
with relatively low effort. In general, we aim to
find an upper bound on quality, so we encouraged
annotators to be forgiving of translation mistakes
when the overall meaning of the sentence or large
parts thereof are conveyed, or when most of the
sentence is in the correct language.
Effort The individual effort was dependent on
the quality and complexity of the data, and on
the annotator’s knowledge of the language(s), for
example, it took from less than two minutes for
an English native speaker to pass through 100
well-formed English sentences (or similarly to an-
notate languages with 0% in-language sentences),
to two hours of ‘‘detective work’’ for well-formed
content
in languages for an annotator with-
out familiarity.
Taxonomy
In order to quantify errors, we de-
veloped a simple error taxonomy. Sentences and
sentence pairs were annotated according to a
simple rubric with error classes of Incorrect Trans-
lation (X, excluded for monolingual data), Wrong
Language (WL), and Non-Linguistic Content (NL).
Of correct sentences (C), we further mark single
words or phrases (CS) and boilerplate contents
(CB). In addition, we asked annotators to flag
offensive or pornographic content. Table 2 pro-
vides examples for parallel data, and Appendix B
contains detailed annotation instructions.
4.2 Human Audit Results
Interpretation of Results For each language,
we compute the percentage of each label within
the 100 audited sentences. Then, we either ag-
gregate the labels across languages with equal
them ac-
weights (macro-average), or weight
cording to their presence in the overall dataset
(micro-average). Results are shown in Table 3.
The statistics for the correct codes (CC, CB, CS)
are combined as C. The number of languages,
the numbers of sentences per language, and the
choice of languages differ across datasets, both
54
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Parallel
Monolingual
CCAligned ParaCrawl v7.1 WikiMatrix
OSCAR
mC4
#langs audited / total
%langs audited
#sents audited / total
%sents audited
65 / 119
54.62%
21 / 38
55.26%
8037 / 907M 2214 / 521M
0.00089%
0.00043%
20 / 78
25.64%
51 / 166
30.72%
1997 / 95M 3517 / 8.4B 5314 / 8.5B
0.00004% 0.00006%
0.00211%
48 / 108
44.44%
o
r
c
a
m
o
r
c
i
m
C
X
WL
NL
offensive
porn
C
X
WL
NL
offensive
porn
#langs =0% C
#langs <50% C
#langs >50% NL
#langs >50% WL
29.25%
29.46%
9.44%
31.42%
0.01%
5.30%
53.52%
32.25%
3.60%
10.53%
0.00%
2.86%
7
44
13
1
76.14%
19.17%
3.43%
1.13%
0.00%
0.63%
83.00%
15.27%
1.04%
0.69%
0.00%
0.33%
0
4
0
0
23.74%
68.18%
6.08%
1.60%
0.00%
0.00%
50.58%
47.10%
1.35%
0.94%
0.00%
0.00%
1
19
0
0
87.21%
–
6.26%
6.54%
0.14%
0.48%
98.72%
–
0.52%
0.75%
0.18%
1.63%
7
11
7
3
72.40%
–
15.98%
11.40%
0.06%
0.36%
92.66%
–
2.33%
5.01%
0.03%
0.08%
0
9
1
4
Table 3: Averages of sentence-level annotations across datasets and selected languages. Macro-avg:
Each language is weighted equally in the aggregation, regardless of its size. Micro-avg: Each label
is weighted by the fraction of sentences for that language in the overall annotated corpus, i.e., the
annotations for higher-represented languages are upweighted, and annotations for lower-represented
languages are downweighted. The bottom rows contain the number of languages that have 0% labeled
C, etc. Note that these are not true expectations since the languages audited were not randomly sampled.
in the original release and in the selection for
our audit, so the comparison of numbers across
datasets has to be taken with a grain of salt.
Since the numbers are based on a small sam-
ple of sentences that were partially annotated by
non-experts, the error statistics are only rough
estimates. Our audit captures a decent ratio of
languages (25–55%, 2nd row in Table 3), but
only a tiny fraction of the overall number of
sentences (0.00004–0.002%). When we speak
of ‘‘low-’’ and ‘‘high’’-resource languages, we
mean languages with smaller or larger represen-
tation in the datasets at hand. When reporting
language-specific results we use the original lan-
guage identifiers of the datasets.
Which Datasets Have Quality Issues? The
macro-averaged results show that the ratio of
correct samples (C) ranges from 24% to 87%,
with a large variance across the five audited
datasets. Particularly severe problems were found
in CCAligned and WikiMatrix, with 44 of the 65
languages that we audited for CCAligned con-
taining under 50% correct sentences, and 19
of the 20 in WikiMatrix. In total, 15 of the
205 language-specific samples (7.3%) contained
not a single correct sentence. For the parallel
datasets we are also interested in the quantity of
misaligned/mistranslated sentences (X). For Wiki-
Matrix, two-thirds of the audited samples were on
average misaligned. We noticed that sentences
were often similar in structure, but described dif-
ferent facts (see Table 6). This might originate
from the nature of the underlying Wikipedia arti-
cles, since they are often comparable rather than
parallel (Schwenk et al., 2021).
Figure 1 illustrates per-corpus correctness more
completely, showing for each dataset what per-
cent of audited corpora are under each possible
threshold of correctness.
55
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
in-language text, but
average) and nonlinguistic content (11.40% aver-
age), with 4 of the 48 audited languages having
more than 50% contents in other languages. The
low amount of wrong language in ParaCrawl
shows the benefits of selecting domains by the
the dataset also
amount
covers the smallest amount of languages. The
low ratio of wrong language samples in OSCAR
may reflect the success of line-level LangID fil-
tering. These numbers provide evidence that more
research in LangID could improve the overall
quality, especially with respect
to nonlinguis-
tic content.
Which Languages Got Confused? The lan-
guages that were confused were frequently related
higher-resource languages. However, there were
also a significant number of
‘‘out-of-model
cousin’’ cases, where languages not supported by
the LangID model ended up in a similar-seeming
language. For instance in mC4, much of the
Shona (sn, Bantu language spoken in Zimbabwe
and Mozambique) corpus is actually Kinyarwanda
(rw, Bantu language spoken in mostly in Rwanda
and Uganda)—and, peculiarly, much of
the
Hawaiian (haw, Polynesian language spoken in
Hawaii) is actually Twi (tw/ak, Central Tano
language spoken mostly in Ghana).
Do Low-resource Languages Have Lower
Quality? Low-resource datasets tend to have
lower human-judged quality. The Spearman rank
correlation between quality (%C) and size is
positive in all cases. The trend is strongest for
mC4 (r = 0.66), and gradually declines for
CCAligned (r = 0.53), WikiMatrix (r = 0.49),
ParaCrawl (r = 0.43), and OSCAR (r = 0.37).
Figure 2 compares the number of sentences for
each language against
the proportion of cor-
rect sentences: Not all higher-resource languages
(> 106 sentences) have high quality, in partic-
ular for CCAligned (e.g., Javanese (en–jv ID)
with 5%C, or Tagalog (en–tl XX) with 13%C).
For mid-resource languages (104–106 sentences)
the picture is inconclusive, with some languages
having high quality, and others having extremely
low quality, even within the same datasets (e.g.,
Urdu in CCAligned en-ur PK has 100%C vs.
its romanized counterpart en–ur PK rom 0.5%
C). For individual error codes trends are less clear
(not depicted).
Figure 1: Fraction of languages in each dataset below
a given quality threshold (percent correct).
Why Haven’t These Problems Been Reported
Before? The findings above are averaged on
a per-language basis (i.e., macro-average), and
therefore give low and high-resource languages
equal weight. If we instead estimate the qual-
ity on a per-sentence basis (i.e., down-weight
lower-resource languages in the computation of
the average), the numbers paint a more opti-
mistic picture (‘‘micro’’ block in Table 3). This
is especially relevant for the monolingual datasets
because they contain audits for English, which
makes up for 43% of all sentences in OSCAR
and 36% in mC4. To illustrate the effect of this
imbalance: A random sample from the entire mC4
dataset with over 63% chance will be from one of
the 8 largest languages (en, ru, es, de, fr, it,
pt, pl, >100M sentences each), of which all have
near perfect quality. Analogously, evaluation and
tuning of web mining pipelines and resulting cor-
pora in downstream applications focused largely
on higher-resource languages (Section 3), so the
low quality of underrepresented languages might
go unnoticed if there is no dedicated evaluation, or
no proficient speakers are involved in the curation
(Nekoto et al., 2020).
How Much Content is Nonlinguistic or in the
Wrong Language? Nonlinguistic content is a
more common problem than wrong-language con-
tent. Among the parallel datasets, CCAligned
contains the highest percentage of nonlinguistic
content, at 31.42% on average across all rated
corpora, and also the highest percent of wrong-
language content, at 9.44%. Among the mono-
lingual datasets, mC4 contains the highest ratio
both of sentences in incorrect languages (15.98%
56
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Percentage of sentences labeled as correct vs. log N sentences for all audited languages.
es XX bm ML yo NG tr TR ku TR zh CN af ZA jv ID zh TW it IT mean
Acc-6
Acc-4
Acc-2
0.58
0.77
0.91
0.73
0.73
0.96
0.41
0.60
0.72
0.45
0.55
0.64
0.43
0.56
0.71
0.55
0.72
0.79
0.65
0.72
0.77
0.55
0.57
0.92
0.46
0.58
0.81
0.55
0.66
0.69
0.66
0.72
0.79
Table 4: Rater evaluation for a subset of audits from CCAligned (translated from English) measured
by the accuracy (Acc-n) of annotations by non-proficient speaker against annotations by proficient
speakers.
Which Languages Have the Lowest Quality?
Across datasets we observe that the quality is
particularly poor for languages that are included
in romanized script ( rom/ latn), but are more
commonly written in other scripts (e.g., Urdu
(ur), Japanese (ja), Arabic (ar)). These are not
transliterations of other scripts, but mostly con-
tain non-linguistic material or wrong languages
(e.g.,
the romanized Japanese corpus in mC4
(ja latn) contains Spanish, French, English,
Portuguese, among others). In terms of geog-
raphy, the poorest quality is found for African
languages (Bambara (bm), Fula (ff), Kikongo
(kg), Luganda (lg), Lingala (ln), Norther Sotho
(nso), Oromo (om), Shona (sn), Somali (so),
Tswana (tn), Wolof (wo)), minority languages
in Europe and the Middle East that are closely
related to higher-resource languages (Azerbaijani
(az-IR), North Frisian (frr), Neapolitan (nap),
Silesian (szl), Zaza (zza)), lesser spoken Chi-
nese languages sharing a script with Mandarin
(Yue (yue), Wu (wuu)), four major Austronesian
(Central Bikol (bcl), Chavacano (cbk), Javanese
(jv), Sundanese (su)), and some South-Asian
languages, in particular Sinhala (si). Appendix D
contains the detailed per-language statistics for
all corpora.
Is
What
the Incidence of Offensive and
Pornographic Content? Overall, the sampled
sentences did not contain a large amount of
offensive content. However, there were notable
amounts of pornographic content (> 10%) found
in CCAligned for 11 languages.
Annotation Quality For a subset of audited
languages from CCAligned and OSCAR we mea-
sure the accuracy (Acc) of the labels assigned by
non-proficient speakers against the labels assigned
by proficient speakers for all audited sentences.
This can be understood as a directed measure of
annotator agreement for the special case where one
rater is an expert and the other is not. Results for
varying label granularity are reported in Tables 4
and 5. For n = 6 all classes of the taxonomy
were distinguished, for n = 4 the C subclasses
were combined, and for n = 2 it is binary deci-
sion between C and the rest of the error classes.
With the full 6-class taxonomy (Acc-6) we find
a mean accuracy of 0.66 for CCAligned audits,
and 0.98 for OSCAR audits. With a binary tax-
onomy (Acc-2) distinguishing C from the rest, the
accuracy further increases to 0.79 for CCAligned.
This provides strong evidence that good quality
57
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
tyv rm bar eml zh la mean
Acc-6
Acc-4
Acc-2
1.0
1.0
1.0
0.98
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.86 1.0
0.87 1.0
0.87 1.0
0.98
0.98
0.98
Table 5: Rater evaluation for a subset of audits
from OSCAR measured by the accuracy (Acc-n)
of annotations by non-proficient speaker against
annotations by proficient speakers.
annotations are not limited to those proficient in
a language.
However, the significant drop of accuracy for
finer-grained labels hints at that our taxonomy
can be further improved, especially for parallel
sentences. The error taxonomy lacks at least one
category of error, namely, ‘‘correct/in-language
but unnatural’’. Similarly,
the definition of
‘‘correct-short’’ and ‘‘correct-boilerplate’’ were
not understood equally by all annotators and the
concept of ‘‘correct-short’’ has potential issues
for agglutinative languages like Turkish. Finally,
it was unclear what to do with related dialects, for
example, when a sentence is ‘‘almost correct but
wrong dialect’’ or when it is unclear which dialect
a sentence belongs to. We recommend including
these categories for future audits.
4.3 Automatic Filtering
Given the frequency of WL and NL annotations,
it might be tempting to use open-source LangID
models to post-filter data on a per-sentence(-pair)
level, as OSCAR does. Unfortunately, this turns
out to have its own issues.
Sentence-level n-gram LangID Filtering We
classify all sentence pairs of CCAligned with
CLD3, an n-gram based LangID model. By com-
paring its predictions to the audit
labels, we
evaluate its quality on the subset of annotated
samples: The classifier should detect both correct
languages when the pair is annotated as C and X,
and should detect incorrect languages in the pair
when WL and NL. On this task, the CLD3 classifier
achieves an average precision of only 40.6%.
Sentence-level Transformer LangID Filtering
n-gram LangID models like CLD3 have known
problems. However, Caswell et al. (2020) demon-
strate that semi-supervised Transformer-based
LangID models strongly out-perform them. We
train a comparable Transformer-based LangID
model and apply it to our annotated CCAligned
data. We find that filtering noisy corpora (< 50%
correct) on LangID for both source and target
leads to gains in median precision, rising from
13.8% pre-filter to 43.9% post-filter. However,
this comes at a steep cost of 77.5% loss in re-
call. The biggest winners were Lingala, whose
precision climbs from 8% to 80%, and Oromo,
which soars from 2% to 33% in-language. Both
of these, however, come at the cost of losing
50% of the correct in-language sentences, be-
ing reduced from 22k sentences to 3k and 1k
sentences, respectively, which would likely be
too small for building downstream models. The
moral is that, at least at the current stage, there
is no one-size-fits-all approach for sentence-level
LangID filtering.
5 Dataset Mis-labeling
language codes are important
Standardized and unambiguous representations
of
for practi-
cal data use and exchange. The standard used
by most academic and industry applications
is BCP-47 (Phillips and Davis, 2005), which
builds off the two-letter ISO639-2 codes and
three-letter ISO639-3 codes, but also allows for
adding subtags for scripts (e.g., Hindi in Latin
script: hi-Latn) or regional varieties (e.g.,
French spoken in Canada: fr-CA). It would en-
hance transparency and interoperability if adopted
consistently, especially with growing language
diversity in NLP.
We find a variety of errors and inconsistencies
in language code usage, ranging from serious mis-
labelings to small transgressions against standard
conventions. For this analysis, we also include the
JW300 (Agi´c and Vuli´c, 2019) dataset, a multilin-
gual dataset crawled from jw.org. In summary,
we find 8 nonstandard codes in CCAligned, 3 in
OSCAR, 1 in mC4, 1 in WikiMatrix, and 70 in
JW300, for 83 in total. This does not include the
59 codes affected by superset issues. Full details
are given in Appendix A.
Inconsistent Language Codes One common is-
sue is simply using nonstandard or invented codes.
For example, CCAligned uses only two-letter
codes, so when the BCP-47 code for a language
is three letters it is either shortened (e.g., zza →
zz) or invented (shn → qa). Similarly, OSCAR
contains data labeled as als (BCP-47 for Tosk
58
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Albanian) that is actually in gsw (Allemannic).9
Twenty-two additional language codes in JW300
have similar issues, including 12 codes that start
with jw but are not Javanese.
False Sign Languages Twelve percent (48/417)
of JW300 carry language codes for sign languages.
Instead of sign language transcripts they are texts
in another high-resource language, mostly English
or Spanish—for example, the en-zsl (Zambian
sign language) data is actually English-English
parallel data (copies), details in Appendix A.
This was likely caused by videos with sign lan-
guage interpretation embedded on the crawled
Web sites.10
Mysterious Supersets When datasets contain
language codes that are supersets of other lan-
guage codes, it is difficult to determine which
particular language the text contains. WikiMatrix
has Serbian (sr), Croatian (hr), Bosnian (bs),
and Serbo-Croatian (sh)—their superset.11 The
issue of codes that are supersets of others is com-
mon enough to include a small table dedicated to it
(Appendix Table 7). In some cases this may not be
an issue, as with Arabic, where ar conventionally
refers to Modern Standard Arabic, even though
the code technically encompasses all dialects. In
many cases, the nature of the data in the superset
code remains a mystery.
Deprecated Codes Finally, there are several
deprecated codes that are used: sh in WikiMatrix,
iw in mC4, sh and eml in Oscar, and daf in
JW300.
6 Risks of Low-Quality Data
Low Quality in Downstream Applications
Text corpora today are building blocks for many
downstream NLP applications like question an-
swering and text summarization—for instance, a
common approach is to first train translation mod-
els on such data and then automatically translate
training data for downstream models (Conneau
et al., 2018). If the data used for the original sys-
tems is flawed, derived technology may fail for
those languages far down the line without know-
ing the causes. This risk of undesired downstream
9This is a result of the language code used by the
Alemannic Wikipedia and affects any corpus or tool that
uses Wikipedia data without correcting for this, like FastText.
10Kudos to Rebecca Knowles for this explanation.
11https://iso639-3.sil.org/code/hbs.
effects calls for future studies with a careful treat-
ment of intertwined effects such as data size and
domain, language-specific phenomena, evaluation
data and metric biases. To give the reader a brief
glimpse of the impact of data quality for the
example of translation, we compare the C% met-
ric from our audit with the translation quality
(sentencepiece-BLEU, spBLEU) of the multilin-
gual translation model M2M124 for 124 languages
(Goyal et al., 2021). It was trained on WikiMa-
trix and CCAligned, and similar data collected
with the same tools, which we expect to show
similar biases. Translation quality is evaluated on
the trusted, human-translated FloReS benchmark
(Goyal et al., 2021). For the 21 languages present
in both the audit and the FloReS benchmark, we
found a positive correlation (Spearman) between
the data quality scores and spBLEU of ρ = 0.44
(p = 0.041). This is not as large as the correlation
with data size (ρ = 0.66, p = 0.00078), but it
nonetheless helps to explain translation quality—
the correlation between the product of C% and
data size (in other words, the expected total num-
ber of good sentences in the dataset), is the highest
yet, with a value of ρ = 0.73 (p = 0.00013).12
there
Representation Washing Since
are
datasets that contain many low-resource lan-
guages,
the community may feel a sense of
progress and growing equity, despite the actual
quality of the resources for these languages. Sim-
ilarly, if low-quality datasets are used as bench-
marks they may exaggerate model performance,
making low-resource NLP appear more solved
than it
if models perform
poorly when trained with such data, it may be
wrongly assumed that the task of learning models
for these languages is harder than it actually is or
infeasible given current resources. These effects
could result in productive effort being redirected
away from these tasks and languages.
is—or conversely,
Trust in Incorrect ‘‘Facts’’ We found many
instances of parallel-looking sentences that are
structurally and semantically similar, but not fac-
tually correct translations (Table 6). They can
cause models to produce plausible ‘‘translations’’
that are factually wrong, but users may still trust
them (algorithmic trust) without verifying the
12For the translation from English, BLEU scores are less
comparable but the trend holds nonetheless, with values of
(ρ = 0.32, p = 0.14), (ρ = 0.74, p = 0.000078), and
(ρ = 0.80, p = 0.0000087), respectively.
59
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
en
The prime minister of the UK is Boris Johnson.
nl De minister-president van Nederland is Mark Rutte.
en: The prime minister of the Netherlands
is Mark Rutte.
en
pt
24 March 2018
14 Novembro 2018
en: 14 November 2018
The current local time in Sarasota is 89 minutes.
en
nn Den lokale tiden i Miami er 86 minutt.
en: The local time in Miami is 86 minutes.
In 1932 the highway was extended north to LA.
en
bar 1938 is de Autobahn bei Inglstod fertig gstellt.
en: The highway near Inglstod was completed
in 1938.
Table 6: Examples of ‘‘parallel’’ data where the
translation has a different meaning than the source,
but the form looks the same. (We added transla-
tions of the non-English side.) Such data may
encourage hallucinations of fake ‘‘facts’’.
information. Similarly, automation bias (Skitka
et al., 1999), referring to humans favoring deci-
sions made by automated systems over decisions
made by humans, might amplify the issues of
inaccurate translations caused by misalignments.
7 Future Work and Recommendations
Of the five multilingual corpora evaluated, we
consistently found severe issues with quality, es-
pecially in the lower-resource languages. We rated
samples of 205 languages, and found that 87 of
them had under 50% usable data, with a full
15 languages at 0% in-language. We furthermore
found consistent issues with mislabeled data and
nonstandard language codes, particularly in the
JW300 dataset, and identified 83 affected cor-
pora, at least 48 of which were entirely spurious
(Section 5). While there might have been anecdo-
tal evidence of insufficient quality for some of the
datasets, the majority of these quality issues had
not been reported, nor been investigated in depth.
These issues might go unnoticed for languages
that are not represented in the evaluation of the
crawling methods, and cause harm in downstream
applications (Khayrallah and Koehn, 2018).
There are a variety of ways to improve both the
ease and accuracy of human evaluation, as well
a few classes of issues we ignored in this paper,
like close dialects. Ideally we would like to build
a standard suite of automatic metrics for datasets,
but more research is necessary to determine what
the appropriate metrics would be. One important
area missing from our analyses, however, is the
estimated portion of a dataset which has been gen-
erated by MT (Rarrick et al., 2011), LM systems,
or bots/templates, as for example in the analysis of
C4 (Dodge et al., 2021). The information captured
in machine-generated content might still be useful
for modeling, but might falsely overrepresent typ-
ical generation patterns and introduce linguistic
errors or unnatural artifacts.
We therefore strongly recommend looking at
samples of any dataset before using it or releasing
it to the public. As we have shown, one does not
need to be proficient in a language to see when
there are serious quality issues, and a quick scan
of 100 sentences can be sufficient to detect major
problems. Moreover, going through and annotat-
ing a small sample of data can bring actionable
insights about new ways to filter or use it.
If data quality issues are found, a wide variety
of techniques can be explored, like filtering on
length-ratio, LangID, TF-IDF wordlists (Caswell
et al., 2020), or dictionaries (Kamholz et al., 2014);
to neural approaches like LM scoring (Axelrod
et al., 2011; Moore and Lewis, 2010; Wang et al.,
2018). Unfortunately, none of these provides a
quick and easy fix, especially for low-resource
languages—data cleaning is no trivial task!
Noisy datasets are by no means useless, at least
if they contain some desirable content. Therefore
an alternative to filtering can be documentation
(Bender et al., 2021). This can take the form
of a per-language quality score and notes about
known issues, a datasheet (Gebru et al., 2018) or
nutrition label (Holland et al., 2018). However,
we suggest researchers not release corpora with
near-zero in-language content, as this may give
the mistaken impression of usable resources.
Finally, we encourage the community to con-
tinue conducting evaluations and audits of public
datasets—similar to system comparison papers.
Acknowledgments
We would like to thank the TACL editors and
reviewers, and AfricaNLP and Google reviewers
who have helped us shape this paper. Furthermore,
we are grateful for Ahmed El-Kishky’s support
and help with CCAligned and WikiMatrix size
statistics.
60
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
References
ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: A
wide-coverage parallel corpus for low-resource
languages. In Proceedings of the 57th Annual
the Association for Computa-
Meeting of
tional Linguistics, pages 3204–3210, Florence,
Italy. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/P19-1310
Wissam Antoun, Fady Baly, and Hazem Hajj.
2021. AraELECTRA: Pre-training text discrim-
inators for Arabic language understanding. In
Proceedings of the Sixth Arabic Natural Lan-
guage Processing Workshop, pages 191–195,
Kyiv, Ukraine (Virtual). Association for Com-
putational Linguistics.
Naveen Arivazhagan, Ankur Bapna, Orhan Firat,
Dmitry Lepikhin, Melvin Johnson, Maxim
Krikun, Mia Xu Chen, Yuan Cao, George F.
Foster, Colin Cherry, Wolfgang Macherey,
Zhifeng Chen, and Yonghui Wu. 2019. Mas-
sively multilingual neural machine translation
in the wild: Findings and challenges. arXiv
preprint arXiv:1907.05019.
Mikel Artetxe and Holger Schwenk. 2019. Mas-
sively multilingual sentence embeddings for
zero-shot cross-lingual
transfer and beyond.
Transactions of the Association for Computa-
tional Linguistics, 7:597–610. https://doi
.org/10.1162/tacl_a_00288
Amittai Axelrod, Xiaodong He, and Jianfeng
Gao. 2011. Domain adaptation via pseudo
in-domain data selection. In Proceedings of
the 2011 Conference on Empirical Methods in
Natural Language Processing, pages 355–362,
Edinburgh, Scotland, UK. Association for
Computational Linguistics.
Marta Ba˜n´on, Pinzhen Chen, Barry Haddow,
Kenneth Heafield, Hieu Hoang, Miquel
Espl`a-Gomis, Mikel L. Forcada, Amir Kamran,
Faheem Kirefu, Philipp Koehn, Sergio
Ortiz Rojas, Leopoldo Pla Sempere, Gema
Ram´ırez-S´anchez, Elsa Sarr´ıas, Marek Strelec,
Brian Thompson, William Waites, Dion
Wiggins, and Jaume Zaragoza. 2020. Para-
Crawl: Web-scale
acquisition of parallel
corpora. In Proceedings of the 58th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 4555–4567, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.417
Emily M. Bender and Batya Friedman. 2018.
language pro-
Data statements for natural
cessing: Toward mitigating system bias and
enabling better science. Transactions of the
Association for Computational Linguistics,
6:587–604. https://doi.org/10.1162
/tacl_a_00041
Emily M. Bender, Timnit Gebru, Angelina
McMillan-Major, and Shmargaret Shmitchell.
2021. On the dangers of stochastic parrots:
Can language models be too big? In Pro-
ceedings of
the 2021 ACM Conference on
Fairness, Accountability, and Transparency,
pages 610–623, New York, NY, USA. Asso-
ciation for Computing Machinery. https://
doi.org/10.1145/3442188.3445922
Stella Biderman and Walter J. Scheirer. 2020.
Pitfalls in machine learning research: Reexam-
ining the development cycle. arXiv preprint
arXiv:2011.02832.
Abeba Birhane and Vinay Uday Prabhu. 2021.
Large image datasets: A pyrrhic win for com-
puter vision? In 2021 IEEE Winter Conference
on Applications of Computer Vision (WACV),
pages 1536–1546. https://doi.org/10
.1109/WACV48630.2021.00158
Jan A. Botha, Emily Pitler, Ji Ma, Anton
Bakalov, Alex Salcianu, David Weiss, Ryan
McDonald, and Slav Petrov. 2017. Natural
language processing with small feed-forward
networks. In Proceedings of the 2017 Con-
ference on Empirical Methods
in Natu-
ral Language Processing, pages 2879–2885,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1309
Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever,
61
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
and Dario Amodei. 2020. Language models
are few-shot learners. In Advances in Neural
Information Processing Systems, volume 33,
pages 1877–1901. Curran Associates, Inc.
Isaac Caswell, Theresa Breiner, Daan van
Esch, and Ankur Bapna. 2020. Language
ID in the wild: Unexpected challenges on
the path to a thousand-language web text
the 28th Inter-
corpus.
national Conference on Computational Lin-
guistics, pages 6588–6608, Barcelona, Spain
(Online). International Committee on Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.coling-main.579
In Proceedings of
Branden Chan, Stefan Schweter, and Timo
M¨oller. 2020. German’s next language model.
In Proceedings of
the 28th International
Conference on Computational Linguistics,
pages 6788–6796, Barcelona, Spain (On-
line). International Committee on Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2020.coling-main.598
Avihay Chriqui and Inbal Yahav. 2021. HeBERT
& HebEMO: A Hebrew BERT Model and
a Tool for Polarity Analysis and Emotion
Recognition. arXiv preprint arXiv:2102.01909.
Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 8440–8451,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.747
Alexis Conneau, Ruty Rinott, Guillaume Lample,
Adina Williams, Samuel Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI:
Evaluating cross-lingual sentence representa-
tions. In Proceedings of the 2018 Conference
on Empirical Methods
in Natural Lan-
guage Processing, pages 2475–2485, Brussels,
Belgium. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/D18-1269
Pieter Delobelle, Thomas Winters, and Bettina
Berendt. 2020. RobBERT: a Dutch RoBERTa-
the
based Language Model. In Findings of
Association for Computational Linguistics:
EMNLP 2020, pages 3255–3265, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.findings-emnlp.292
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
the 2019
understanding. In Proceedings of
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Jesse Dodge, Maarten Sap, Ana Marasovic,
William Agnew, Gabriel
Ilharco, Dirk
Groeneveld, and Matt Gardner. 2021. Docu-
menting the english colossal clean crawled
corpus. arXiv preprint arXiv:2104.08758.
Stefan Dumitrescu, Andrei-Marius Avram, and
Sampo Pyysalo. 2020. The birth of Romanian
the Association for
BERT. In Findings of
Computational Linguistics: EMNLP 2020,
pages 4324–4328, Online. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2020.findings
-emnlp.387
In Proceedings of
Ahmed El-Kishky, Vishrav Chaudhary, Francisco
Guzm´an, and Philipp Koehn. 2020. CCAligned:
A massive collection of cross-lingual web-
the
document pairs.
2020 Conference on Empirical Methods in
(EMNLP),
Natural Language Processing
pages 5960–5969, Online. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2020.emnlp-main
.480
Miquel
Forcada,
Espl`a, Mikel
Gema
Ram´ırez-S´anchez, and Hieu Hoang. 2019.
ParaCrawl: Web-scale parallel corpora for
the languages of the EU. In Proceedings of
Machine Translation Summit XVII: Translator,
Project and User Tracks, pages 118–119,
Ireland. European Association for
Dublin,
Machine Translation.
Angela Fan, Shruti Bhosale, Holger Schwenk,
Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal,
Mandeep Baines, Onur Celebi, Guillaume
Wenzek, Vishrav Chaudhary, Naman Goyal,
62
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Tom Birch, Vitaliy Liptchinsky, Sergey
Edunov, Edouard Grave, Michael Auli, and
Armand Joulin. 2020. Beyond English-centric
multilingual machine translation. arXiv preprint
arXiv:2010.11125.
Wilhelmina
Nekoto,
Iroro Orife,
Vukosi Marivate,
Tshinondiwa Matsila, Timi Fasubaa, Taiwo
Fagbohungbe, Solomon Oluwole Akinola,
Shamsuddeen Muhammad, Salomon Kabongo
Kabenamualu, Salomey Osei, Freshia Sackey,
Rubungo Andre Niyongabo, Ricky Macharm,
Perez Ogayo, Orevaoghene Ahia, Musie
Meressa Berhe, Mofetoluwa Adeyemi, Masabata
Mokgesi-Selinga, Lawrence Okegbemi, Laura
Martinus, Kolawole Tajudeen, Kevin Degila,
Julia
Kelechi Ogueji, Kathleen Siminyu,
Kreutzer, Jason Webster, Jamiil Toure Ali,
Jade Abbott,
Ignatius Ezeani,
Idris Abdulkadir Dangana, Herman Kamper,
Hady Elsahar, Goodness Duru, Ghollah Kioko,
Murhabazi Espoir, Elan van Biljon, Daniel
Whitenack, Christopher Onyefuluchi, Chris
Chinenye Emezue, Bonaventure F. P. Dossou,
Blessing Sibanda, Blessing Bassey, Ayodele
Olabiyi, Arshath Ramkilowan, Alp ¨Oktem,
Adewale Akinfaderin, and Abdallah Bashir.
2020. Participatory research for low-resourced
machine translation: A case study in African
languages. In Findings of the Association for
Computational Linguistics: EMNLP 2020.
Online. https://doi.org/10.18653/v1
/2020.findings-emnlp.195
Leo Gao, Stella Biderman, Sid Black, Laurence
Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa
Nabeshima, Shawn Presser and Connor Leahy.
2020. The pile: An 800gb dataset of diverse
text for language modeling. arXiv preprint
arXiv:2101.00027.
Timnit Gebru,
Jamie Morgenstern, Briana
Vecchione, Jennifer Wortman Vaughan, Hanna
Wallach, Hal Daum´e III, and Kate Crawford.
2018. Datasheets for datasets. arXiv preprint
arXiv:1803.09010.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary,
Peng-Jen Chen, Guillaume Wenzek, Da Ju,
Sanjana Krishnan, Marc’Aurelio Ranzato,
Francisco Guzm´an, and Angela Fan. 2021.
The FLORES-101 evaluation benchmark for
63
low-resource and multilingual machine transla-
tion. arXiv preprint arXiv:2106.03193.
Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 languages.
In Proceedings of the Eleventh International
Conference on Language Resources and Evalu-
ation (LREC 2018), Miyazaki, Japan. European
Language Resources Association (ELRA).
Sarah Holland, Ahmed Hosny, Sarah Newman,
Joshua Joseph, and Kasia Chmielinski. 2018.
The dataset nutrition label: A framework to
drive higher data quality standards. arXiv
preprint arXiv:1805.03677.
Junjie Hu, Sebastian Ruder, Aditya Siddhant,
Graham Neubig, Orhan Firat, and Melvin
Johnson. 2020. XTREME: A massively mul-
tilingual multi-task benchmark for evaluating
cross-lingual generalisation. In Proceedings of
the 37th International Conference on Machine
Learning, volume 119 of Proceedings of Ma-
chine Learning Research, pages 4411–4421.
PMLR.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
Matthijs Douze, Herv´e J´egou, and Tom´as
Mikolov. 2016. Fasttext.zip: Compressing text
classification models. arXiv preprint arXiv:
1612.03651.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
and Tomas Mikolov. 2017. Bag of tricks for
efficient text classification. In Proceedings of
the 15th Conference of the European Chapter
of the Association for Computational Linguis-
tics: Volume 2, Short Papers, pages 427–431,
Valencia, Spain. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/E17-2068
Marcin Junczys-Dowmunt. 2018. Dual condi-
tional cross-entropy filtering of noisy parallel
corpora. In Proceedings of the Third Confer-
ence on Machine Translation: Shared Task
Papers, pages 888–895, Belgium, Brussels.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/W18
-6478
Marcin
Junczys-Dowmunt.
2019. Microsoft
translator at WMT 2019: Towards large-scale
document-level neural machine translation. In
Proceedings of the Fourth Conference on Ma-
chine Translation (Volume 2: Shared Task
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Papers, Day 1), pages 225–233, Florence,
Italy. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/W19-5321
Divyanshu Kakwani, Anoop Kunchukuttan,
Satish Golla, Gokul N. C., Avik Bhattacharyya,
Mitesh M. Khapra, and Pratyush Kumar. 2020.
IndicNLPSuite: Monolingual corpora, evalua-
tion benchmarks and pre-trained multilingual
language models for Indian languages. In Find-
ings of
the Association for Computational
Linguistics: EMNLP 2020, pages 4948–4961,
Association for Computational Linguistics,
https://doi.org/10.18653
Online.
/v1/2020.findings-emnlp.445
David Kamholz,
Jonathan Pool, and Susan
Colowick. 2014. PanLex: Building a resource
for panlingual lexical translation. In Proceed-
ings of the Ninth International Conference on
Language Resources and Evaluation (LREC’
14), pages 3145–3150, Reykjavik,
Iceland.
European Language Resources Association
(ELRA).
Vincentius Kevin, Birte H¨ogden, Claudia
Schwenger, Ali S¸ ahan, Neelu Madan, Piush
Aggarwal, Anusha Bangaru, Farid Muradov,
and Ahmet Aker. 2018. Information nutrition
labels: A plugin for online news evaluation.
In Proceedings of
the First Workshop on
Fact Extraction and VERification (FEVER),
pages 28–33, Brussels, Belgium. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/W18-5505
Huda Khayrallah and Philipp Koehn. 2018. On
the impact of various types of noise on
neural machine translation. In Proceedings
the 2nd Workshop on Neural Machine
of
Translation and Generation, pages 74–83,
Melbourne, Australia. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/W18-2709
Philipp Koehn, Vishrav Chaudhary, Ahmed
El-Kishky, Naman Goyal, Peng-Jen Chen,
and Francisco Guzm´an. 2020. Findings of
the WMT 2020 shared task on parallel cor-
pus filtering and alignment. In Proceedings
of the Fifth Conference on Machine Transla-
tion, pages 726–742, Online. Association for
Computational Linguistics.
John Koutsikakis,
Ilias Chalkidis, Prodromos
Malakasiotis, and Ion Androutsopoulos. 2020.
Greek-bert: The greeks visiting sesame street.
In 11th Hellenic Conference on Artificial In-
telligence, SETN 2020, pages 110–117, New
York, NY, USA. Association for Computing
Machinery. https://doi.org/10.1145
/3411408.3411440
Alexandra Sasha Luccioni
and Joseph D.
Viviano. 2021. What’s in the box? an analy-
sis of undesirable content
in the common
crawl corpus. arXiv preprint arXiv:2105.02732.
https://doi.org/10.18653/v1/2021
.acl-short.24
Louis Martin, Benjamin Muller, Pedro Javier
Ortiz Su´arez, Yoann Dupont, Laurent Romary,
´Eric de la Clergerie, Djam´e Seddah, and
Benoˆıt Sagot. 2020. CamemBERT: A tasty
French language model. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 7203–7219,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.645
Mihai Masala, Stefan Ruseti, and Mihai Dascalu.
2020. RoBERT—a Romanian BERT model.
the 28th International
In Proceedings of
Conference on Computational Linguistics,
pages 6626–6637, Barcelona, Spain (On-
line). International Committee on Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2020.coling-main.581
Robert C. Moore and William Lewis. 2010.
Intelligent selection of language model train-
ing data. In Proceedings of
the ACL 2010
Conference Short Papers, pages 220–224,
Uppsala, Sweden. Association for Computa-
tional Linguistics.
Pedro Javier Ortiz Su´arez, Laurent Romary,
and Benoˆıt Sagot. 2020. A monolingual ap-
proach to contextualized word embeddings for
mid-resource languages. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 1703–1714,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.156
Pedro Javier Ortiz Su´arez, Benoˆıt Sagot, and
Laurent Romary. 2019. Asynchronous pipelines
for processing huge corpora on medium to low
64
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
resource infrastructures. In Proceedings of the
Workshop on Challenges in the Management
of Large Corpora (CMLC-7) 2019. Cardiff,
22nd July 2019, pages 9–16, Mannheim.
Leibniz-Institut f¨ur Deutsche Sprache.
Addison Phillips and Mark Davis. 2005. Tags
for Identifying Languages. Internet Engineering
Task Force. Work in Progress.
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna
Padmanabhan, and Graham Neubig. 2018.
When and why are pre-trained word embed-
dings useful for neural machine translation?
the 2018 Conference of
In Proceedings of
the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Pa-
pers), pages 529–535, New Orleans, Louisiana.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N18
-2084
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21:1–67.
Spencer Rarrick, Chris Quirk, and Will Lewis.
2011. MT detection in Web-scraped paral-
lel corpora. In Proceedings of MT Summit
XIII. Asia-Pacific Association for Machine
Translation.
Holger Schwenk, Vishrav Chaudhary, Shuo Sun,
Hongyu Gong, and Francisco Guzm´an. 2021.
WikiMatrix: Mining 135M parallel sentences
in 1620 language pairs
from Wikipedia.
the 16th Conference of
In Proceedings of
the European Chapter of
the Association
for Computational Linguistics: Main Volume,
pages 1351–1361, Online. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2021.eacl-main
.115
Amit Seker, Elron Bandel, Dan Bareket, Idan
Brusilovsky, Refael Shaked Greenfeld, and
Reut Tsarfaty. 2021. AlephBERT:A Hebrew
large pre-trained language model to start-off
your Hebrew NLP application with. arXiv
preprint arXiv:2104.04052.
Linda J. Skitka, Kathleen L. Mosier, and Mark
Burdick. 1999. Does automation bias decision-
65
making? International Journal of Human-
Computer Studies, 51(5):991–1006. https://
doi.org/10.1006/ijhc.1999.0252
Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish,
Bill Howe, and Julia Stoyanovich. 2019.
Mithralabel: Flexible dataset nutritional labels
for responsible data science. In Proceedings
of
the 28th ACM International Conference
on Information and Knowledge Management,
pages 2893–2896, New York, NY, USA.
Association for Computing Machinery.
Wei Wang, Taro Watanabe, Macduff Hughes,
Tetsuji Nakagawa, and Ciprian Chelba. 2018.
Denoising neural machine translation train-
ing with trusted data and online data selec-
tion. In Proceedings of the Third Conference
on Machine Translation: Research Papers,
pages 133–143, Brussels, Belgium. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/W18-6314
Bryan Wilie, Karissa Vincentio, Genta Indra
Winata, Samuel Cahyawijaya, Xiaohong Li,
Zhi Yuan Lim, Sidik Soleman, Rahmad
Mahendra, Pascale Fung, Syafri Bahar, and
Ayu Purwarianti. 2020. IndoNLU: Benchmark
and resources for evaluating Indonesian natural
language understanding. In Proceedings of the
1st Conference of the Asia-Pacific Chapter of
the Association for Computational Linguistics
and the 10th International Joint Conference on
Natural Language Processing, pages 843–857,
Suzhou, China. Association for Computational
Linguistics.
Hainan Xu and Philipp Koehn. 2017. Zippo-
rah: A fast and scalable data cleaning system
for noisy Web-crawled parallel corpora. In
Proceedings of the 2017 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2945–2950, Copenhagen, Denmark.
Association for Computational Linguistics.
Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. mT5:
A massively multilingual pre-trained text-
the
to-text
2021 Conference of
the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 483–498, Online. Association for Com-
putational Linguistics.
In Proceedings of
transformer.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
Supercode
Subcode(s)
Actual language
Code in JW300
JW300
JW300
JW300
JW300
OSCAR
OSCAR
OSCAR
OSCAR
OSCAR
OSCAR
OSCAR
OSCAR
WikiMatrix
WikiMatrix
WikiMatrix
kg
mg
qu
sw
ar
az
sh
ku
ms
no
sq
zh
ar
sh
zh
kwy
tdx
que, qug, qus,
quw, quy, quz,
qvi, qvz
swc
arz
azb
bs, hr, sr
ckb
id, min
nn
als∗
yue, wuu
arz
bs, hr, sr
wuu
Table 7: Situations where two language codes
are represented, but one is a superset of another
by the ISO standard, leading to unclarity about
the data in the supercode dataset. ∗The als
dataset is actually in gsw.
cs
de
el
en
es
fi
fr
hu
id
it
ja
ko
pl
pt
ro
ru
sk
sq
st
zh
cse
gsg
gss
ase, asf, bfi, ins, psp,
sfs, zib, zsl
aed, bvl, csf, csg, csn,
csr, ecs, esn, gsm, hds,
lsp, mfs, ncs, prl, pys,
ssp, vsl
fse
fcs,fsl
hsh
inl
ise
jsl
kvk
pso
bzs, mzy, psr, sgn AO
rms
rsl
svk
sql
jw ssa
csl, tss
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
A Details on Language Code Issues
Table 7 provides a complete lists of the corpora
where one code is defined as a superset of the
other by the ISO standard, and in Table 8 we
provide a complete list of the language codes in
JW300 which purport to be sign language but are
actually unrelated high-resource languages.
Special attention needs to be given to the JW300
dataset, which, in addition to the sign languages
and superset code issues, has a variety of other
peculiarities. These problems seem to originate in
the codes used by jw.org,13 which were appar-
ently not checked in the creation of the JW300
dataset. An overview is provided in Table 9, and
the following paragraphs give specifics.
Twelve languages in JW300 have codes start-
ing in jw , suggesting they are varieties of
Javanese (ISO639-1 jw), but are instead at-
tempts to represent language dialects for which
there are no BCP-47 codes. These codes seem
13The jw.org Web site seems to use correct BCP-47
extensions now, however, and entering a code such as
‘‘jw dmr’’ redirects to ‘‘naq x dmr’’.
Table 8: There are 48 languages in the JW300
corpus with language codes that correspond
to sign languages, but in reality are unrelated
high-resource languages (usually the most spoken
language in the country of origin of the sign lan-
guage). This table shows the actual language of the
data corresponding to each sign language code.
to have been updated in jw.org to appropri-
ate BCP-47 private-use extensions in the form
in Table 9. Twelve languages have codes starting
in jw , suggesting they are varieties of Javanese,
but are instead mis-parsed private-use extensions.
Three codes appear in addition to equivalent ISO
codes, making it unclear which languages they
are. One language uses a deprecated ISO code.
Four languages use the ISO639-3 code instead of
the ISO639-2 code, and therefore are not BCP-47.
In addition to the jw tags, there are two other
mis-used private subtags: hy arevmda, which in
addition to lacking the mandatory x appears to
represent standard Western Armenian (hyw); and
66
Code in JW300 BCP-47 code Actual Language Name
Dataset
Code in Corpus Correct Code
Incorrect private-use extensions
hy arevmda
jw dgr
jw dmr
jw ibi
jw paa
jw qcs
jw rmg
jw rmv
jw spl
jw ssa
jw tpo
jw vlc
jw vz
rmy AR
hyw
os x dgr
naq x dmr
yom x ibi
pap x paa
qxl
rmn x rmg
rmy x rmv
nso x spl
st ZA
pt PT
ca x vlc
skg x vz
rmy x ?
Western Armenian
Digor Ossetian
Damara Khoekhoe
Ibinda Kongo
Papiamento (Aruba)
Salasaca Highland Kichwa
Greek Romani (South)
Vlax Romani, Russia
Sepulana
Sesotho (South Africa)
Portuguese (Portugal)
Catalan (Valencia)
Vezo Malagasy
Kalderash
Equivalent codes used in place of extensions
kmr latn
nya
que
kmr x rdu
ny x ?
qu x ?
Kurmanji (Caucasus)
Chinyanja (Zambia)
Quechua (Ancash)
Deprecated codes
daf
dnj/lda
Dan
ISO-693-3 used in place of ISO-693-2
cat
gug
run
tso MZ
ca
gn
rn
ts MZ
Catalan
Guarani
Kirundi
Changana (Mozambique)
Table 9: Language code issues in the JW300
datasets for 22 language varieties not covered by
Tables 7 and 8. Private use extensions are given
as they appear in jw.org, and specified as ‘?’ if
they are absent from jw.org.
rmy AR, which, rather than being Romany from
Argentina, is Kalderash Romany.
There are also a few anomalies where private
use extensions should have been used but other
methods were found to convey the distinctions.
Three codes appear in addition to equivalent ISO
codes, making it unclear which languages they are.
Two of these are equivalencies between ISO639-2
and ISO639-3 (nya and ny are both Chichewa,
qu and que are both Quechua), and one is a
script equivalency (kmr and kmr latn are both
in Latin script). In these three cases the two codes
do represent different languages—so a private use
extension would have been appropriate.
Finally, there is the more minor issue that three
languages use the ISO639-3 code instead of the
ISO639-2 code, and therefore are not BCP-47.
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
mC4
OSCAR
OSCAR
OSCAR
WikiMatrix
zz
sz
ns
cb
tz
qa
qd
cx
iw
eml
als
sh
sh
zza
szl
nso
ckb
ber
shn
kac
ceb
he
egl
gsw
hbs
hbs
Table 10: Miscellaneous errors in language codes.
In addition to the JW300-specific errors,
Table 10 summarizes miscellaneous errors in
CCAligned and OSCAR that were detailed in
Section 5.
B Complete Error Taxonomy
and Instructions
In addition to the examples given in Table 2, raters
were provided with the following verbal notes on
the error codes:
• CC: Correct translation, natural sentence:
It’s OK if it’s a sentence fragment instead of
a whole sentence, as long as it is not too short
(about 5 words or greater). The translation
does not have to be perfect.
• CS: Correct translation, but single word or
short phrase: Also includes highly repeated
short phrases, like ‘‘the cat the cat the cat the
cat the cat …’’
• CB: Correct translation, but boilerplate:
This can be auto-generated or formulaic con-
tent, or content that one deems ‘‘technically
correct but generally not very useful to NLP
models’’. Unfortunately, it’s often not clear
what should be counted as boilerplate…do
your best.
• X: Incorrect translation [for parallel sen-
tences] both source and target are in the
correct language, but they are not adequate
translations.
67
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
• WL: Wrong language For short sentences,
especially with proper nouns, there is often
a fine line between ‘‘Wrong language’’ and
‘‘Not language’’. Do your best.
• NL: Not language At least one of source
and target are not linguistic content. Any
sentence consisting only of a proper noun
(e.g. ‘‘Tyrone Ping’’) should be marked as
NL.
• U: Unknown for sentences that need verifica-
tion by a native speaker. This is an auxiliary
label that is resolved in most cases.
C Methodological Notes
A surprising amount of work can be done without
being an expert in the languages involved. The
easiest approach is simply to search the internet
for the sentence, which usually results in finding
the exact page the sentence came from, which in
turn frequently contains clues like language codes
in the URL, or a headline like News in X language,
sometimes with references to a translated version
of the same page. However, for the cases where
this is insufficient, here are a few tips, tricks, and
observations.
No Skills Required: Things that do not require
knowledge of the language(s) in question.
1. ‘‘Not language’’ can usually be identified by
anyone who can read the script, though there
are tricky cases with proper nouns.
2. Frequently, ‘‘parallel’’ sentences contain dif-
ferent numbers in the source and target
(especially autogenerated content), and are
easy to disqualify.
3. Errors tend to repeat. If a word is mistrans-
lated once, it will often be mistranslated many
more times throughout a corpus, making it
easy to spot.
Basic Research Required: Things that do not
require knowledge of the language(s) in question
but can be done with basic research.
1. If it’s written in the wrong script it’s consid-
ered wrong language. (Sometimes the writing
system is indicated in the published corpus,
e.g., bg-Latn, but usually the language has
a ‘‘default’’ script defined by ISO.)
2. Some types of texts come with inherent labels
or markers, such as enumerators or verse
numbers.
3. When all else fails, search the internet for the
whole sentence or n-grams thereof! If the
whole sentence can be found, frequently
the language is betrayed by the web page (the
language’s autonym is useful in this case).
D Complete Audit Results
Tables 11, 12, 13, 14, and 15 give the complete
annotation percentages for CCAligned, Wiki-
Matrix, ParaCrawl, mC4 and OSCAR, respec-
tively. For each annotation label, we report the
ratio of the annotated sentences (of max 100
sentences) that were assigned that label by the
primary annotator. Repeated annotations done for
agreement measurement are not included. The C
column aggregates all correct sub-codes (CC, CS,
CB). We also report the total number of sentences
that each dataset contains for each language and
the average sentence length for the audited sen-
tences to illustrate differences across languages.
The original language codes as they are published
with the datasets are maintained for the sake of
consistency (but should be handled with care in
future work, see Section 5), and those with less
than 20% correct sentences are highlighted.
68
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
en-sz PL
en-mt MT
en-tz MA
en-zz TR
en-kg AO
en-qa MM
en-bm ML
en-az IR
en-qd MM
en-ay BO
en-ak GH
en-st ZA
en-ve ZA
en-ts ZA
en-or IN
en-ns ZA
en-lg UG
en-ln CD
en-om KE
en-ss SZ
en-te IN rom
en-cb IQ
en-tn BW
en-ff NG
en-sn ZW
en-wo SN
en-br FR
en-zu ZA
en-ku TR
en-ig NG
en-kn IN
en-yo NG
en-ky KG
en-tg TJ
en-ha NG
en-am ET
en-km KH
en-ne NP
en-su ID
en-ur PK rom
en-ht HT
en-mn MN
en-te IN
en-kk KZ
en-be BY
en-af ZA
en-jv ID
en-hi IN rom
en-lv LV
en-ar AR rom
en-tl XX
en-uk UA
en-zh TW
en-el GR
en-nl NL
en-da DK
en-vi VN
en-sv SE
en-zh CN
en-tr TR
en-ja XX
en-pt XX
en-it IT
en-de DE
en-es XX
NL
X
0.00%
8.00% 67.00%
1.43%
3.96%
2.97%
9.57%
3.00%
6.00% 31.00% 46.00%
#sentences
porn
WL
CB
CS
CC
C
12
0.00%
0.00%
8.33% 91.67%
0.00%
0.00%
0.00%
26
0.00%
0.00% 50.00% 26.92% 19.23%
3.85%
0.00%
3.85%
33
0.00%
0.00% 45.45% 36.36%
6.06%
6.06%
6.06%
12.12%
34
0.00%
8.82% 61.76% 29.41%
0.00%
0.00%
0.00%
0.00%
74
0.00%
2.70% 81.08%
0.00% 14.86%
1.35%
0.00%
1.35%
136
0.00%
3.68% 13.24%
1.47% 72.06%
3.68%
5.88%
11.03%
149
0.00%
6.71% 60.40%
0.00% 26.85%
2.01%
4.03%
6.04%
158
0.00%
0.00% 20.79% 13.86% 58.42%
0.00%
6.93%
6.93%
179
0.00%
3.96%
0.99% 81.19%
6.93%
7.92%
1.98%
4.95%
475
0.00%
0.00% 29.00%
3.00% 17.00%
51.00% 33.00% 18.00%
478
0.00%
0.00% 46.86% 19.25% 19.67%
0.63%
14.23% 13.60%
904
0.00%
9.29%
6.43% 40.71%
48.57% 42.14%
0.00%
1555
0.00%
6.93%
8.91% 28.71%
60.40% 29.70% 21.78%
1967
0.00%
4.95%
4.95% 40.59%
51.49% 34.65% 11.88%
5526
0.00%
6.09% 24.35% 12.17% 38.26%
9.57%
42.61%
14138
4.00%
2.00% 23.00% 15.00% 58.00%
2.00%
0.00%
4.00%
14701
2.00%
0.00% 68.00% 17.00%
0.00%
9.00%
6.00%
6.00%
21562
4.00%
4.00% 74.00%
1.00% 14.00%
4.00%
3.00%
8.00%
22206
0.00% 31.00% 38.00% 29.00% 24.00%
2.00%
0.00%
2.00%
22960
0.00% 13.25% 24.10% 50.00% 13.86%
9.04%
3.61%
12.65%
25272
0.00% 25.00%
0.00%
5.00%
0.00%
0.00%
52297
0.00% 30.00% 18.00% 48.00% 11.00%
1.00%
3.00%
4.00%
71253
6.90%
8.97% 63.45% 10.34%
0.00%
0.00%
0.00%
0.00%
73022
2.00%
8.00% 92.00%
0.00%
0.00%
0.00%
0.00%
0.00%
86868
1.00% 81.00% 14.00%
1.00%
0.00%
0.00%
3.00%
5.00%
88441
3.31% 94.98% 18.46%
0.00%
0.00%
1.71%
0.00%
0.00%
115128
1.00%
1.00% 13.00% 37.00% 14.00% 32.00%
17.00%
3.00%
126101
3.00%
8.00%
55.00% 39.00%
7.00%
3.00% 13.00% 30.00%
137874
1.74%
1.74%
36.52% 12.17% 13.04% 11.30% 33.04% 28.70%
148146
0.00%
1.00%
6.00% 29.00% 12.00%
58.00% 49.00%
163921
4.00%
9.00%
46.00%
5.00%
2.00%
175192
0.00%
6.16% 10.96% 17.81% 34.93% 12.33% 17.81%
34.93%
240657
1.96% 33.33% 22.55%
0.98%
0.00%
44.12% 24.51% 17.65%
251865
2.94% 32.35% 20.59%
4.90%
0.98%
46.08% 18.63% 24.51%
339176
1.00%
9.00% 12.00%
3.00%
30.00% 25.00%
2.00% 49.00%
346517
0.00%
0.49%
2.96%
59.11% 35.47%
2.46% 21.18% 37.44%
412381
1.02%
56.12% 12.24% 33.67% 10.20% 42.86%
0.00%
0.00%
487155
8.00% 30.00% 14.00%
47.00% 10.00% 13.00% 24.00% 15.00%
494142
0.00%
35.00% 15.00% 15.00%
5.00% 13.00% 13.00% 39.00%
513123
5.47%
0.00%
0.50%
0.50%
0.00% 18.91% 27.36% 53.23%
558167
6.19%
8.25% 10.31% 37.11% 35.05%
55.67%
1.03%
3.09%
566885
7.00% 18.00% 12.00%
33.00%
8.00% 14.00% 11.00% 42.00%
581651
1.00%
3.00%
1.00%
69.00% 42.00% 11.00% 16.00% 27.00%
689651
1.98%
3.96%
8.91%
8.91% 18.81%
68.32% 40.59% 18.81%
1125772
0.00%
0.00%
90.00% 57.00% 13.00% 20.00% 10.00%
2.00%
1504061
4.00% 12.00%
0.00% 31.00%
63.00% 40.00% 23.00%
2.00%
1513974
8.08%
3.03% 25.25% 10.10% 59.60%
1.01%
1.01%
5.05%
3789571
8.00%
0.00%
1.00%
1.00% 39.00% 21.00% 39.00%
0.00%
4850957
3.00% 14.00%
9.00% 13.00% 31.00%
59.00% 37.00%
7.00%
5584724
4.00%
4.00% 96.00%
0.00%
0.00%
0.00%
0.00%
0.00%
6593250
5.00%
4.00% 24.00% 26.00% 37.00%
3.00%
13.00%
6.00%
8547348
5.00%
1.00%
8.00% 13.00% 35.00%
63.00% 42.00%
1.00%
8778971
1.00%
6.00%
4.00% 47.00%
46.00% 11.00% 31.00%
1.00%
8.00%
3.00% 10.00%
5.00% 29.00% 38.00%
49.00% 15.00%
8878492
0.00% 36324231
3.00%
2.00%
0.00% 49.00%
46.00% 27.00% 19.00%
7.00% 10738582
5.00% 12.00%
5.00% 29.00%
54.00% 31.00% 18.00%
6.00% 12394379
1.00% 14.00%
0.00% 13.00% 54.00%
31.00% 18.00%
0.00% 12544075
3.00%
97.00% 91.00%
0.00%
0.00%
3.00%
3.00%
1.04% 15181410
1.04% 10.42%
57.29% 22.92% 12.50% 21.88% 31.25%
4.00% 20282339
5.50%
5.00%
45.00% 14.50% 14.00% 16.50% 44.50%
0.00% 26201214
0.00%
6.00%
57.00% 35.00% 21.00%
1.00% 34.00%
0.00% 46525410
8.91%
3.96%
66.34% 36.63% 10.89% 18.81% 20.79%
0.00% 58022366
3.00%
1.00%
36.00% 14.00% 18.00%
4.00% 60.00%
2.00% 92597196
2.00%
62.00% 29.00% 14.00% 19.00% 28.00%
8.00%
4.95% 98351611
2.97% 15.84%
58.42% 16.83% 25.74% 15.84% 22.77%
avg target length
71.42
12.58
57.33
46.53
29.20
55.28
32.19
115.85
60.34
92.19
45.85
111.83
82.99
73.93
71.39
33.52
15.83
28.80
23.83
25.30
24.21
30.04
16.80
33.59
102.59
27.25
41.68
79.32
90.51
83.42
70.20
75.01
69.56
75.31
60.78
58.29
71.35
79.14
57.08
18.41
101.95
44.43
97.95
72.36
118.45
105.45
18.34
18.13
83.67
16.69
37.03
67.88
24.89
54.90
85.95
73.99
74.19
103.91
33.55
83.80
34.44
87.20
97.44
78.08
72.18
Table 11: Audit results for a sample of 100 sentences from CCAligned for each language pair, compared
to the number of sentences available in the dataset. If fewer than 100 sentences were available, all
sentences were audited. Language codes are as originally published. The length is measured in number
of characters and averaged across the audited portion of each corpus. Languages with less than 20%
correct sentences are boldfaced.
69
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
en-ug
en-mwl
en-tg
en-ne
en-ka
en-lmo
en-io
en-jv
en-wuu
br-en
bar-en
en-kk
en-sw
en-nds
be-en
en-hi
en-ko
en-uk
en-it
en-simple
C
X
CS
NL
CB
CC
WL
porn
12.87% 8.91% 1.98% 1.98% 72.28% 9.90% 1.98% 0.00%
27.00% 26.00% 0.00% 1.00% 73.00% 0.00% 0.00% 0.00%
0.00% 0.00% 0.00% 0.00% 95.10% 3.92% 0.98% 0.00%
13.00% 7.00% 6.00% 0.00% 60.00% 23.00% 4.00% 0.00%
11.88% 2.97% 2.97% 5.94% 73.27% 10.89% 2.97% 0.00%
12.75% 11.76% 0.00% 0.98% 81.37% 4.90% 0.98% 0.00%
28.00% 27.00% 0.00% 1.00% 69.00% 2.00% 1.00% 0.00%
13.73% 9.80% 0.00% 3.92% 70.59% 12.75% 2.94% 0.00%
23.23% 14.14% 7.07% 2.02% 65.66% 7.07% 4.04% 0.00%
8.70% 7.61% 1.09% 0.00% 82.61% 4.35% 0.00% 0.00%
6.00% 6.00% 0.00% 0.00% 75.00% 16.00% 3.00% 0.00%
5.00% 2.00% 2.00% 1.00% 81.00% 14.00% 0.00% 0.00%
33.33% 27.27% 4.04% 2.02% 64.65% 2.02% 0.00% 0.00%
1.96% 1.96% 0.00% 0.00% 95.10% 1.96% 0.98% 0.00%
26.00% 24.00% 2.00% 0.00% 73.00% 1.00% 0.00% 0.00%
36.27% 32.35% 0.98% 2.94% 59.80% 0.98% 2.94% 0.00%
48.04% 33.33% 2.94% 11.76% 48.04% 2.94% 0.98% 0.00%
87.00% 84.00% 2.00% 1.00% 10.00% 1.00% 2.00% 0.00%
42.00% 42.00% 0.00% 0.00% 58.00% 0.00% 0.00% 0.00%
37.62% 24.75% 0.00% 12.87% 56.44% 2.97% 2.97% 0.00%
# sentences
22012
33899
37975
40549
41638
43790
45999
48301
51024
58400
67394
109074
138590
178533
257946
696125
1345630
2576425
4626048
N/A
avg target length
95.55
135.26
88.87
69.26
144.74
89.38
83.26
91.87
34.77
90.68
103.51
56.03
111.61
91.95
121.22
96.77
55.18
104.39
140.27
77.53
Table 12: Audit results for a sample of 100 sentences from WikiMatrix for each language pair,
compared to the number of sentences available in the dataset. Language codes are as originally
published. The length is measured in number of characters and averaged across the audited portion of
each corpus. Languages with less than 20% correct sentences are boldfaced.
C
X
CS
NL
CB
CC
WL
# sentences
porn
14879
80.81% 61.62% 1.01% 18.18% 14.14% 5.05% 0.00% 0.00%
en-so
26321
72.00% 53.00% 9.00% 10.00% 17.00% 10.00% 0.00% 0.00%
en-ps
31374
45.00% 9.00% 16.00% 20.00% 32.00% 9.00% 14.00% 0.00%
en-my
en-km 76.00% 51.00% 13.00% 12.00% 18.00% 6.00% 0.00% 0.00%
65113
92084
en-ne
73.00% 48.00% 1.00% 24.00% 23.00% 2.00% 0.00% 0.00%
132517
en-sw 85.00% 60.00% 15.00% 10.00% 11.00% 2.00% 2.00% 0.00%
en-si
217407
37.00% 31.00% 6.00% 0.00% 62.00% 0.00% 1.00% 0.00%
323519
35.92% 24.27% 8.74% 2.91% 49.51% 13.59% 0.97% 0.00%
en-nn
514610
88.00% 66.00% 15.00% 7.00% 10.00% 1.00% 1.00% 0.00%
es-eu
es-gl
1222837
89.00% 46.00% 6.00% 37.00% 4.00% 7.00% 0.00% 0.00%
5377911
81.00% 73.00% 6.00% 2.00% 19.00% 0.00% 0.00% 6.00%
en-ru
en-bg
6470710
95.15% 85.44% 0.97% 8.74% 4.85% 0.00% 0.00% 0.97%
es-ca
6870183
80.00% 54.00% 19.00% 7.00% 11.00% 9.00% 0.00% 5.00%
9402646
91.59% 68.22% 0.93% 22.43% 7.48% 0.93% 0.00% 0.00%
en-el
en-pl
13744860
94.12% 76.47% 0.98% 16.67% 3.92% 1.96% 0.00% 0.98%
31295016
49.00% 32.00% 17.00% 0.00% 46.00% 3.00% 2.00% 0.00%
en-nl
31486963
93.07% 92.08% 0.00% 0.99% 4.95% 1.98% 0.00% 0.00%
en-pt
40798278
60.82% 36.08% 16.49% 8.25% 38.14% 0.00% 1.03% 0.00%
en-it
en-es
78662122
87.00% 54.00% 20.00% 13.00% 12.00% 0.00% 1.00% 0.50%
82.83% 64.65% 13.13% 5.05% 13.13% 3.03% 1.01% 0.00%
82638202
en-de
en-fr
89.62% 82.08% 4.72% 2.83% 10.38% 0.00% 0.00% 0.00% 104351522
avg target length
189.83
141.01
147.07
121.20
153.42
167.34
123.06
56.24
121.31
107.88
101.28
112.29
107.21
135.66
95.95
95.05
108.68
127.55
119.72
111.43
144.20
Table 13: Audit results for a sample of 100 sentences from ParaCrawl for each language pair,
compared to the number of sentences available in the dataset. Language codes are as originally
published. The length is measured in number of characters and averaged across the audited portion of
each corpus.
70
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
yo
st
haw
ig
sm
ha
su
sn
mg
pa
ga
co
zu
jv
km
kn
fy
te
la
be
af
lb
ne
sr
gl
bn
mr
sl
hi
bg
uk
ro
sv
zh
ja
tr
nl
pl
pt
it
fr
de
ru
en
bg latn
ja latn
ru latn
zh latn
C
CC
WL
# sentences
NL
porn
CS
CB
46214
1.02% 0.00%
2.04% 11.22% 14.29%
84.69% 71.43%
66837
0.00% 35.05%
8.25% 0.00%
56.70% 42.27% 14.43%
84312
9.18% 33.67% 21.43% 1.02%
1.02%
44.90% 34.69%
92909
3.94%
0.00% 44.09% 0.79%
55.91% 41.73% 10.24%
98467
0.00% 27.55% 12.24% 0.00%
2.04%
60.20% 58.16%
247479
0.00% 14.14%
5.05% 2.02%
1.01%
80.81% 79.80%
280719
0.00% 25.25% 15.15% 2.02%
1.01%
59.60% 58.59%
326392
4.95% 0.00%
0.99% 58.42%
2.97%
36.63% 32.67%
345040
0.00% 18.00% 25.00% 0.00%
0.00%
57.00% 57.00%
363399
3.77%
78.30% 68.87%
4.72% 10.38% 0.00%
5.66%
465670
6.06% 12.12% 10.10% 13.13% 0.00%
76.77% 58.59%
494913
2.00% 48.00% 19.00% 0.00%
2.00%
33.00% 29.00%
555458
1.00% 30.00% 19.00% 0.00%
51.00% 48.00%
2.00%
581528
7.27% 1.82%
52.73% 19.09% 19.09% 14.55% 40.00%
756612
0.00% 0.00%
7.14%
0.00%
92.86% 92.86%
0.00%
1056849
9.90% 0.00%
2.97%
3.96%
85.15% 73.27%
7.92%
1104359
3.85% 0.00%
2.88% 39.42%
3.85%
56.73% 50.00%
1188243
8.00% 0.00%
9.00%
89.00% 76.00%
3.00%
4.00%
674463
7.69% 0.00%
6.15% 10.77% 10.00%
82.31% 65.38%
1742030
3.54% 0.00%
4.42%
2.65%
2.65%
92.04% 86.73%
2152243
0.00% 15.00%
0.00%
76.00% 76.00%
9.00% 0.00%
2740336
7.77% 74.76% 0.00%
0.00%
0.00%
17.48% 17.48%
2942785
0.00% 0.00%
0.00% 21.65%
1.03%
78.35% 77.32%
3398483
0.00% 0.00%
5.41%
0.90%
7.21%
93.69% 85.59%
4549465
0.00% 13.33% 17.14% 0.00%
67.62% 57.14% 10.48%
7444098
6.00%
1.00%
93.00% 86.00%
4.00% 0.00%
7774331
1.90% 49.52% 10.48% 0.00%
2.86%
40.00% 35.24%
8499456
4.95% 0.00%
4.95%
4.95%
92.08% 82.18%
2.97%
18507273
0.00% 2.53%
2.53% 19.70%
1.01%
80.30% 76.77%
23409799
2.01% 17.09% 0.00%
2.51%
2.51%
80.90% 75.88%
38556465
2.51% 0.00%
2.01%
6.53%
95.48% 81.41%
7.54%
45738857
2.02% 0.00%
3.03%
4.04%
94.95% 78.79% 12.12%
8570979
3.92% 1.96%
4.90%
3.92%
2.94%
91.18% 84.31%
54542308
7.00% 0.00%
1.00%
4.00%
1.00%
92.00% 87.00%
87337884
1.00% 1.00%
0.00%
4.00%
6.00%
99.00% 89.00%
87595290
0.51% 0.00%
3.54%
7.07%
0.00%
95.96% 88.89%
96210458
5.94% 0.00%
1.98%
0.00%
6.93%
92.08% 85.15%
126164277
2.00%
7.00%
7.00%
96.00% 82.00%
2.00% 0.00%
169239084
2.00% 12.00% 1.00%
3.00%
4.00%
86.00% 79.00%
186404508
7.00% 0.00%
1.00%
4.00%
9.00%
92.00% 79.00%
332674575
7.00% 0.00%
1.00%
3.00%
7.00%
92.00% 82.00%
397006993
1.96% 0.00%
6.86%
91.18% 77.45%
5.88%
7.84%
4.88% 0.00%
4.07%
91.06% 69.11% 11.38% 10.57%
755585265
8.08%
2.02%
93.94% 83.84%
5.05% 0.00% 3079081989
1.01%
0.00%
0.00% 51.52% 39.39% 1.01%
9.09%
9.09%
2.00% 60.00% 27.00% 0.00%
13.00%
4.00%
7.00%
0.93% 34.58% 28.97% 0.93%
36.45% 25.23% 10.28%
0.00% 64.00% 31.00% 0.00%
1.00%
4.00%
5.00%
N/A
N/A
N/A
N/A
3.00%
avg length
117.71
132.13
129.99
98.03
126.42
155.76
107.10
145.59
116.23
134.43
147.35
195.30
137.81
97.96
162.57
105.39
234.25
108.49
67.25
110.86
99.52
481.68
102.88
131.72
151.45
92.60
281.94
149.45
105.54
93.86
116.79
130.08
114.45
94.77
59.94
152.75
103.67
170.70
133.51
180.26
143.69
107.71
109.28
130.97
139.92
218.92
123.14
186.84
Table 14: Audit results for a sample of 100 sentences from mC4 for each language, compared to the
number of sentences available in the dataset. Language codes are as originally published. The length is
measured in number of characters and averaged across the audited portion of each corpus. Languages
with less than 20% correct sentences are boldfaced.
71
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
C
CS
CC
CB
0.00%
0.00%
0.00%
# sentences
WL
NL
porn
1
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
1
0.00% 100.00% 0.00%
0.00% 0.00% 0.00%
1
0.00% 0.00%
0.00% 0.00% 0.00% 100.00%
2
0.00% 0.00%
0.00%
4
75.00% 0.00%
0.00%
5
0.00% 0.00%
0.00%
7
42.86% 0.00%
0.00%
0.00% 0.00% 0.00% 57.14%
7
0.00% 0.00%
57.14% 57.14% 0.00% 0.00% 42.86%
9
0.00% 100.00% 0.00%
0.00% 0.00% 0.00%
0.00%
10
70.00% 0.00%
30.00% 30.00% 0.00% 0.00%
0.00%
11
40.00% 0.00%
30.00% 30.00% 0.00% 0.00% 30.00%
17
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
26
3.85% 0.00%
0.00%
96.15% 96.15% 0.00% 0.00%
29
0.00% 0.00%
79.31% 75.86% 0.00% 3.45% 20.69%
37
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
41
0.00% 0.00%
0.00%
100.00% 97.56% 0.00% 2.44%
42
71.43% 0.00%
0.00% 0.00% 0.00% 28.57%
47
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
60
0.00%
0.00% 0.00%
100.00% 96.67% 0.00% 3.33%
61
0.00% 100.00% 0.00%
0.00%
0.00% 0.00% 0.00%
64
0.00% 0.00%
1.54%
98.46% 96.92% 0.00% 1.54%
81
16.05% 0.00%
2.47%
81.48% 81.48% 0.00% 0.00%
81
8.64% 0.00%
0.00%
91.36% 91.36% 0.00% 0.00%
83
4.82% 0.00%
3.61%
91.57% 90.36% 0.00% 1.20%
86
1.16% 0.00%
0.00% 0.00% 0.00% 98.84%
0.00%
104
57.43% 0.00%
0.00%
42.57% 42.57% 0.00% 0.00%
104
8.65% 0.00%
1.92%
89.42% 21.15% 0.00% 68.27%
180
9.00% 0.00%
6.00% 0.00% 58.00% 27.00%
64.00%
425
0.00% 0.00%
0.00%
100.00% 98.97% 0.00% 1.03%
676
1.00% 0.00%
0.00%
99.00% 99.00% 0.00% 0.00%
2350
2.00% 0.00%
1.00%
97.00% 86.00% 0.00% 11.00%
7997
1.00% 0.00%
6.00%
93.00% 93.00% 0.00% 0.00%
33838
0.00% 0.00%
2.00%
98.00% 98.00% 0.00% 0.00%
34244
0.00% 0.00%
2.00%
98.00% 98.00% 0.00% 0.00%
35032
0.00% 0.00%
2.97%
97.03% 95.05% 0.00% 1.98%
40066
2.00% 0.00%
0.00%
98.00% 98.00% 0.00% 0.00%
61941
0.00% 0.00%
0.00%
100.00% 96.00% 0.00% 4.00%
67762
1.00% 0.00%
2.00%
97.00% 97.00% 0.00% 0.00%
287142
0.00% 0.00%
81.09% 79.10% 0.00% 1.99% 18.91%
517353
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
1099498
0.00% 0.00%
0.00%
100.00% 98.00% 0.00% 2.00%
1430527
0.00% 0.00%
2.00%
98.00% 94.00% 0.00% 4.00%
1685185
1.01% 1.01%
0.00%
98.99% 93.94% 1.01% 4.04%
2719851
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
0.00% 0.00%
1.00%
99.00% 91.00% 0.00% 8.00%
13292843
0.00% 4.00% 126067610
98.00% 94.00% 2.00% 2.00%
2.00%
0.99% 1.98% 210348435
87.13% 71.29% 1.98% 13.86% 11.88%
0.00% 1.00% 232673578
0.00%
100.00% 97.00% 0.00% 3.00%
0.00% 5.00% 461349575
0.00%
100.00% 93.00% 0.00% 7.00%
0.00% 3.00% 488616724
0.00%
100.00% 94.00% 0.00% 6.00%
1.00% 1.00% 3809525119
0.00%
99.00% 96.00% 0.00% 3.00%
diq
bcl
cbk
pam 100.00% 100.00% 0.00% 0.00%
25.00% 25.00% 0.00% 0.00%
bar
myv
100.00% 100.00% 0.00% 0.00%
yue
mwl
frr
ht
ie
scn
tyv
mai
bxr
dsb
so
rm
nah
nap
yo
gn
vec
kw
wuu
eml
bh
min
qu
su
jv
als
la
uz
nds
sw
br
fy
am
af
eu
mn
te
kk
ca
nl
it
zh
fr
es
en
avg length
131.00
623.00
519.00
139.00
53.50
127.00
177.00
141.00
231.56
329.10
121.70
155.59
167.96
141.17
160.76
155.15
208.24
137.66
164.53
152.11
281.57
234.95
184.90
162.75
157.15
177.88
137.17
649.85
167.27
221.00
203.08
375.44
224.11
369.99
344.74
196.70
239.56
340.23
267.43
339.18
330.93
309.94
412.31
318.93
333.38
305.01
393.66
195.60
306.62
268.07
364.65
Table 15: Audit results for a sample of 100 sentences from OSCAR for each language, compared
to the number of sentences available in the dataset. If fewer than 100 sentences were available, all
sentences were audited language codes are as originally published. Length is measured in number
of characters. Languages with less than 20% correct sentences are boldfaced.
72
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5
/
/
t
l
a
c
_
a
_
0
0
4
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf