Calidad de un vistazo:

Calidad de un vistazo:
An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer1,2, Isaac Caswell3, Lisa Wang3,4, Ahsan Wahab5,47, Daan van Esch6,
Nasanbayar Ulzii-Orshikh7, Allahsera Tapo8,9, Nishant Subramani10,11, Artem Sokolov4,
Claytone Sikasote12,13, Monang Setyawan14, Supheakmungkol Sarin14,
Sokhar Samb15,16, Benoˆıt Sagot17, Clara Rivera18, Annette Rios19, Isabel Papadimitriou20,
Salomey Osei21,22, Pedro Ortiz Suarez17,23, Iroro Orife10,24, Kelechi Ogueji2,25,
Andre Niyongabo Rubungo26,27, Toan Q. Nguyen28, Mathias M ¨uller19, Andr´e M ¨uller19,
Shamsuddeen Hassan Muhammad29,30, Nanda Muhammad30, Ayanda Mnyakeni31,
Jamshidbek Mirzakhalov5,32, Tapiwanashe Matangira33, Colin Leong10, Nze Lawson14,
Sneha Kudugunta3, Yacine Jernite10,34, Mathias Jenny19, Orhan Firat3,5,
Bonaventure F. PAG. Dossou35,36, Sakhile Dlamini14, Nisansa de Silva37,
Sakine C¸ abuk Ballı19, Stella Biderman38, Alessia Battisti19, Ahmed Baruwa10,39,
Ankur Bapna3, Pallavi Baljekar1, Israel Abebe Azime40,41, Ayodele Awokoya29,42,
Duygu Ataman19,43, Orevaoghene Ahia10,44, Oghenefego Ahia14,
Sweta Agrawal45, Mofetoluwa Adeyemi29,46

1Google Research, Canada, 2Masakhane NLP, EE.UU, 3Google Research, EE.UU, 4Google Research,
Alemania, 5Turkic Interlingua, 6Google Research, Los países bajos, 7Haverford College, EE.UU,
8Masakhane NLP, Mali, 9RobotsMali, Mali, 10Masakhane NLP, EE.UU, 11Allen Institute for Artificial
Inteligencia, EE.UU, 12Masakhane NLP, Zambia, 13University of Zambia, Zambia, 14Google, EE.UU,
15Masakhane NLP, Senegal, 16AIMS-AMMI, Senegal, 17Inria, Francia, 18Google Research, Reino Unido,
19University of Zurich, Suiza, 20Universidad Stanford, EE.UU, 21Masakhane NLP, Ghana, 22Kwame
Nkrumah University of Science and Technology, Ghana, 23Sorbonne Universit´e, Francia, 24Niger-Volta
LTI, EE.UU, 25Universidad de Waterloo, Canada, 26Masakhane NLP, España, 27Universitat Polit`ecnica de
Catalunya, España, 28University of Notre Dame, EE.UU, 29Masakhane NLP, Nigeria, 30Bayero University
Kano, Nigeria, 31Google, South Africa, 32University of South Florida, EE.UU, 33Google, Canada,
34Hugging Face, EE.UU, 35Masakhane NLP, Alemania, 36Jacobs University Bremen, Alemania,
37University of Moratuwa, Sri Lanka, 38EleutherAI, EE.UU, 39Obafemi Awolowo University, Nigeria,
40Masakhane NLP, Ethiopia, 41AIMS-AMMI, Ethiopia, 42University of Ibadan, Nigeria, 43Turkic
Interlingua, Suiza, 44Instadeep, Nigeria, 45Universidad de Maryland, EE.UU, 46Defence Space

Administration Abuja, Nigeria, 47University of South Florida, EE.UU

Abstracto

With the success of large-scale pre-training
and multilingual modeling in Natural Lan-
Procesamiento de calibre (NLP), recent years have
seen a proliferation of
grande, Web-mined
text datasets covering hundreds of languages.
We manually audit
the quality of 205
language-specific corpora released with five
major public datasets (CCAligned, ParaCrawl,
WikiMatrix, OSCAR, mC4). Lower-resource
corpora have systematic issues: At least 15
corpora have no usable text, and a signifi-
cant fraction contains less than 50% oraciones
of acceptable quality. Además, many are
mislabeled or use nonstandard/ambiguous lan-
guage codes. We demonstrate that these issues

are easy to detect even for non-proficient
speakers, and supplement the human audit with
automatic analyses. Finalmente, we recommend
techniques to evaluate and improve multilin-
gual corpora and discuss potential risks that
come with low-quality data releases.

1 Introducción

Access to multilingual datasets for NLP research
has vastly improved over the past years. A variety
of Web-derived collections for hundreds of lan-
guages is available for anyone to download,
such as ParaCrawl (Espl`a et al., 2019; Ba˜n´on
et al., 2020), WikiMatrix (Schwenk et al., 2021),

50

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 50–72, 2022. https://doi.org/10.1162/tacl a 00447
Editor de acciones: Sebastián Padó. Lote de envío: 6/2021; Lote de revisión: 9/2021; Publicado 1/2022.
C(cid:2) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

CCAligned (El-Kishky et al., 2020), OSCAR
(Ortiz Su´arez et al., 2019; Ortiz Su´arez et al.,
2020), and several others. These have in turn en-
abled a variety of highly multilingual models, como
mT5 (Xue et al., 2021), M2M-100 (Fan et al.,
2020), and M4 (Arivazhagan et al., 2019).

Curating such datasets relies on the Web sites
giving clues about the language of their con-
tents (p.ej., a language identifier in the URL) y
on automatic language classification (LangID).
It is commonly known that these automatically
crawled and filtered datasets tend to have over-
all lower quality than hand-curated collections
(Koehn et al., 2020), but their quality is rarely
measured directly, and is rather judged through
the improvements they bring to downstream
applications (Schwenk et al., 2021).

Building NLP technologies with automatically
crawled datasets is promising. This is especially
true for low-resource languages, because data
scarcity is one of the major bottlenecks for deep
learning approaches. Sin embargo, there is a problem:
There exists very little research on evaluating both
data collections and automatic crawling and filter-
ing tools for low-resource languages. Como resultado,
although many low-resource languages are cov-
ered by the latest multilingual crawl data releases,
their quality and thus usability is unknown.

To shed light on the quality of data crawls
for the lowest resource languages, we perform
a manual data audit for 230 per-language sub-
sets of five major crawled multilingual datasets:1
CCAligned (El-Kishky et al., 2020), ParaCrawl
(Espl`a et al., 2019; Ba˜n´on et al., 2020), Wiki-
Matrix (Schwenk et al., 2021), OSCAR (Ortíz
Su´arez et al., 2019; Ortiz Su´arez et al., 2020),
and mC4 (Xue et al., 2021). We propose solutions
for effective, low-effort data auditing (Sección 4),
including an error taxonomy. Our quantitative
analysis reveals surprisingly low amounts of valid
in-language data, and identifies systematic issues
across datasets and languages. Además, nosotros
find that a large number of datasets is labeled with
nontransparent or incorrect language codes (Sec-
ción 5). This leads us to reflect on the potential
harm of low-quality data releases for low-resource

1Annotations are available for download (last accessed:

12 Oct 2021).

51

idiomas (Sección 6), and provide a set of recom-
mendations for future multilingual data releases
(Sección 7).

2 Trabajo relacionado

Corpora collected by web crawlers are known to
be noisy (Junczys-Dowmunt, 2019; Luccioni and
Viviano, 2021). In highly multilingual settings,
past work found that web-crawls of lower-resource
languages have serious issues, especially with
segment-level LangID (Caswell et al., 2020).
Cleaning and filtering web-crawls can boost gen-
eral language modeling (Gao et al., 2020; Marrón
et al., 2020; Rafael y col., 2020) and downstream
task performance (Moore and Lewis, 2010;
Rarrick et al., 2011; Xu and Koehn, 2017;
Khayrallah and Koehn 2018; Brown y cols., 2020).
As the scale of ML research grows, it becomes
increasingly difficult to validate automatically col-
lected and curated datasets (Biderman and Scheirer,
2020; Birhane and Prabhu, 2021; Bender et al.,
2021). Several works have focused on advanc-
ing methodologies and best practices to address
these challenges. Bender and Friedman (2018)
introduced data statements, a documentary frame-
work for NLP datasets that seeks to provide a
universal minimum bar for dataset description.
Similar work has been done on systematizing
documentation in other areas in data science
and machine learning, including work focusing
on online news (Kevin et al., 2018), data ethics
(Sun et al., 2019), and data exploration (Holanda
et al., 2018), as well as generalist work such as
Gebru et al. (2018). Data quality is also im-
plicitly documented by successes of filtering
methods. There is a large literature on filtering
data for various NLP tasks, Por ejemplo, Axelrod
et al., 2011, Moore and Lewis (2010), Rarrick
et al., 2011, Wang et al. (2018), Kamholz et al.
(2014), Junczys-Dowmunt (2018), and Caswell
et al., 2020.

Closest to our work is the analysis of a highly
plurilingüe (non-publicly available) web-crawl
and LangID-related quality issues by (Caswell
et al., 2020). They perform a brief analysis of
the quality of OSCAR with the focus only on
the presence of in-language content. Dodge et al.
(2021) automatically documented and analyzed the
contents and sources of C4 (Rafael y col., 2020),
the English counterpart of mC4, which surfaced

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Parallel

Monolingual

CCAligned

ParaCrawl v7.1

WikiMatrix OSCAR

mC4

#idiomas
Fuente
Filtering level
Langid
Alignment
Evaluation

137
CC 2013–2020
documento
FastText
LASER
TED-6

85
41
Wikipedia
selected Web sites
oración
oración
CLD2
FastText
Vec/Hun/BLEU-Align LASER
TED-45
WMT-5

101

166
CC 11/2018 CC all
documento
FastText

documento
CLD3

POS/DEP-5 XTREME

Mesa 1: Comparison of parallel and monolingual corpora extracted from web documents, incluido
their downstream evaluation tasks. All parallel corpora are evaluated for machine translation (AZUL).
TED-6: da, cr, sl, sk, lt, et; TED-45: 45-language subset of (Qi et al., 2018); WMT-5: cs, de,
fi, lv, ro. POS/DEP-5: part-of-speech labeling and dependency parsing for bg, ca, da, fi, id.

the presence of machine-translated contents and
NLP benchmark data.

3 Multilingual Corpora

Mesa 1 provides an overview of the corpora of
interest in this work. We selected the corpora for
their multilinguality and the inclusion of under-
studied languages in NLP. With the exception of
WikiMatrix and ParaCrawl, all corpora are derived
from CommonCrawl (CC).2

CCAligned (El-Kishky et al., 2020)
is a paral-
lel dataset built off 68 CC snapshots. Documents
are aligned if they are in the same language ac-
cording to FastText LangID (Joulin et al., 2016,
2017), and have the same URL but for a differing
language code. These alignments are refined with
cross-lingual LASER embeddings (Artetxe and
Schwenk, 2019). For sentence-level data, ellos
split on newlines and align with LASER, pero
perform no further filtering. Human annotators
evaluated the quality of document alignments for
six languages (de, zh, ar, ro, et, mi) selected
for their different scripts and amount of retrieved
documentos, reporting precision of over 90%. El
quality of the extracted parallel sentences was
evaluated in a machine translation (MONTE) task on
six European (da, cr, sl, sk, lt, et) idiomas
of the TED corpus (Qi et al., 2018), where it
compared favorably to systems built on crawled
sentences from WikiMatrix and ParaCrawl v6.

Multilingual C4 (mC4) (Xue et al., 2021)
es
a document-level dataset used for training the
mT5 language model. It consists of monolingual

text in 101 languages and is generated from 71 CC
snapshots. It filters out pages that contain less than
three lines of at least 200 characters and pages that
contain bad words.3 Since this is a document-level
conjunto de datos, we split it by sentence and deduplicate it
before rating. For language identification, it uses
CLD3 (Botha et al., 2017),4 a small feed-forward
neural network that was trained to detect 107
idiomas. The mT5 model pre-trained on mC4 is
evaluated on 6 tasks of the XTREME benchmark
(Hu et al., 2020) covering a variety of languages
and outperforms other multilingual pre-trained
language models such as mBERT (Devlin et al.,
2019) and XLM-R (Conneau et al., 2020).

OSCAR (Ortiz Su´arez et al., 2019; Ortiz Su´arez
et al., 2020)
is a set of monolingual corpora ex-
tracted from CC snapshots, specifically from the
plain text WET format distributed by CC which
removes all the HTML tags and converts the text
to UTF-8. It is deduplicated and follows the ap-
proach by Grave et al. (2018) of using FastText
LangID (Joulin et al., 2016, 2017) on a line-level.5
No other filtering was applied. For five languages
(bg, ca, da, fi, id), OSCAR was used by its
original authors to train language models which
were then evaluated on parsing and POS tagging
(Ortiz Su´arez et al., 2020). OSCAR has also been
used in independent studies to train monolingual
or multilingual language models (ar, como, bn, de,
el, fr, gu, él, hi, kn, ml, mr, nl, o, pa,
ro, frente a, te) and subsequently evaluate them on
various downstream tasks (Antoun et al., 2021;

3https://github.com/LDNOOBW/.
4https://github.com/google/cld3/.
5https://fasttext.cc/docs/en/language

2http://commoncrawl.org/.

-identification.html.

52

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kakwani et al., 2020; Wilie et al., 2020; chan
et al., 2020; Koutsikakis et al., 2020; Martín
et al., 2020; Chriqui and Yahav, 2021; Seker et al.,
2021; Delobelle et al., 2020; Dumitrescu et al.,
2020; Masala et al., 2020).

ParaCrawl v7.1 is a parallel dataset with 41 lan-
guage pairs primarily aligned with English (39 afuera
de 41) and mined using the parallel-data-crawling
tool Bitextor (Espl`a et al., 2019; Ba˜n´on et al.,
2020) which includes downloading documents,
preprocessing and normalization, aligning doc-
uments and segments, and filtering noisy data
via Bicleaner.6 ParaCrawl focuses on European
idiomas, but also includes 9 lower-resource,
non-European language pairs in v7.1. Oración
alignment and sentence pair filtering choices were
optimized for five languages (mt, et, hu, cs,
de) by training and evaluating MT models on
the resulting parallel sentences. An earlier version
(v5) was shown to improve translation quality on
WMT benchmarks for cs, de, fi, lv, ro.

WikiMatrix (Schwenk et al., 2021)
is a pub-
lic dataset containing 135M parallel sentences in
1620 language pairs (85 idiomas) mined from
Wikipedia. Out of the 135M parallel sentences,
34M are aligned with English. The text is ex-
tracted from Wikipedia pages, split into sentences,
and duplicate sentences are removed. FastText
LangID is used before identifying bitext with
LASER’s distance-based mining approach. El
margin threshold is optimized by training and
evaluating downstream MT models on four WMT
benchmarks (de-en, de-fr, cs-de, cs-fr).
The final dataset is used to train translation models
that are then evaluated by automatically measur-
ing the quality of their translations against human
translations of TED talks in 45 idiomas, con
highest quality for translations between English
y, Por ejemplo, pt, es, da, and lowest for sr,
ja, mr, zh TW. In the audit we focus on language
pairs with English on one side.

4 Auditing Data Quality

None of the above datasets has been evaluated
for quality on the sentence level (exception: sev-
eral languages in ParaCrawl v3), and downstream
evaluations are centered around a small fraction
of higher-resource languages. This is insufficient

for drawing conclusions about
the quality of
individual or aligned sentences, and about the
entirety of languages. Además, there might
be a publication bias preventing negative results
with any of the above corpora with lower quality
being published.

To close this gap, we conduct a human data
quality audit focused on the lowest-resource and
most under-evaluated languages, but also covering
mid- and high-resource languages for comparison.

4.1 Auditing Process

Participants We recruited 51 volunteers from
the NLP community, covering about 70 idiomas
with proficient language skills.7 Each sentence
is annotated by one rater. To verify our hypoth-
esis that those annotations can largely done by
non-native speakers, we repeat a set of language
expert annotations by a non-expert, and measure
the accuracy of the non-expert.

Sample Selection For each language in each
conjunto de datos, we took a random sample of 100 líneas,
which may be anywhere from single words to
short paragraphs depending on segmentation. Nosotros
manually annotated them according to the error
taxonomy described below. For WikiMatrix and
CCAligned, we selected those languages that are
paired with English, and for ParaCrawl, we also in-
cluded those paired with Spanish (‘‘total’’ counts
en mesa 3). We did not annotate all languages,
but focused on the ones with the least number
of sentences in each dataset (at least the smallest
10) and languages for which we found proficient
speakers. Since we annotate the same maximum
number of sentences8 across all chosen languages
regardless of their total number of sentences, el
annotated samples are not an unbiased sample
from the whole dataset.

Non-expert Labeling
Strategies Although
many of the volunteers were familiar with the
languages in question or spoke related languages,
in cases where no speaker of a relevant language
could be found, volunteers used dictionaries and
Internet search to form educated guesses. Nosotros
discuss this deeper in Appendix C to highlight
how much of this low-resource focused evaluation

7This surprisingly high number comes in part because
there are many closely related languages, p.ej., one person
may be proficient enough to rate many different Slavic or
Turkic languages even if only one is their native language.

6https://github.com/bitextor/bicleaner.

8Some languages had fewer than 100 oraciones.

53

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

C: Correct translation, cualquier

Combined label for CC, CB, CS

Correct Codes

CC: Correct translation, natural sentence
en The Constitution of South Africa
en Transforming your swimming pool

into a pond

nso Molaotheo wa Rephabliki ya Afrika Borwa
de Umbau Ihres Swimmingpools zum Teich

CB: Correct translation, Boilerplate or low quality
en Reference number: 13634
en Latest Smell Stop Articles

ln Motango ya r´ef´erence: 13634
fil Pinakabagong mga Artikulo Smell Stop

CS: Correct translation, Short
en movies, dad
en Halloweenwithout me

it cinema, pap`a
ay Hallowen –janiw nayampejj

Error Codes

X: Incorrect translation, but both correct languages
en A map of the arrondissements of Paris
en Ask a question

kg Paris kele mbanza ya kimfumu ya Fwalansa.
tr Soru sor Kullanıma g¨ore sec¸im

WL: Source OR target wrong language, but both still linguistic content
en The ISO3 language code is zho
en Der Werwolf—sprach der gute Mann,

zza T´aim eadra bracach mar bhionns na frogannaidhe.
de des Weswolfs, Genitiv sodann,

NL: Not a language: at least one of source and target are not linguistic content
en EntryScan 4
en organic peanut butter

tn TSA PM704
ckb (cid:2)? (cid:2)? (cid:2)? (cid:2)? (cid:2)? (cid:2)? (cid:2)?

Mesa 2: Annotation codes for parallel data with sentence pair examples. The language code before each
sentence indicates the language it is supposed to be in.

can actually be done by non-proficient speakers
with relatively low effort. En general, we aim to
find an upper bound on quality, so we encouraged
annotators to be forgiving of translation mistakes
when the overall meaning of the sentence or large
parts thereof are conveyed, or when most of the
sentence is in the correct language.

Effort The individual effort was dependent on
the quality and complexity of the data, and on
the annotator’s knowledge of the language(s), para
ejemplo, it took from less than two minutes for
an English native speaker to pass through 100
well-formed English sentences (or similarly to an-
notate languages with 0% in-language sentences),
to two hours of ‘‘detective work’’ for well-formed
contenido
in languages for an annotator with-
out familiarity.

Taxonomy
In order to quantify errors, we de-
veloped a simple error taxonomy. Sentences and
sentence pairs were annotated according to a

simple rubric with error classes of Incorrect Trans-
lación (X, excluded for monolingual data), Wrong
Idioma (WL), and Non-Linguistic Content (NL).
Of correct sentences (C), we further mark single
words or phrases (CS) and boilerplate contents
(CB). Además, we asked annotators to flag
offensive or pornographic content. Mesa 2 pro-
vides examples for parallel data, and Appendix B
contains detailed annotation instructions.

4.2 Human Audit Results

Interpretation of Results For each language,
we compute the percentage of each label within
el 100 audited sentences. Entonces, we either ag-
gregate the labels across languages with equal
them ac-
weights (macro-average), or weight
cording to their presence in the overall dataset
(micro-average). Results are shown in Table 3.
The statistics for the correct codes (CC, CB, CS)
are combined as C. The number of languages,
the numbers of sentences per language, y el
choice of languages differ across datasets, ambos

54

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Parallel

Monolingual

CCAligned ParaCrawl v7.1 WikiMatrix

OSCAR

mC4

#langs audited / total
%langs audited
#sents audited / total
%sents audited

65 / 119
54.62%

21 / 38
55.26%

8037 / 907METRO 2214 / 521METRO

0.00089%

0.00043%

20 / 78
25.64%

51 / 166
30.72%
1997 / 95METRO 3517 / 8.4B 5314 / 8.5B
0.00004% 0.00006%
0.00211%

48 / 108
44.44%

oh
r
C
a
metro

oh
r
C
i
metro

C
X
WL
NL
offensive
porn

C
X
WL
NL
offensive
porn

#langs =0% C
#langs <50% C #langs >50% NL
#langs >50% WL

29.25%
29.46%
9.44%
31.42%
0.01%
5.30%

53.52%
32.25%
3.60%
10.53%
0.00%
2.86%

7
44
13
1

76.14%
19.17%
3.43%
1.13%
0.00%
0.63%

83.00%
15.27%
1.04%
0.69%
0.00%
0.33%

0
4
0
0

23.74%
68.18%
6.08%
1.60%
0.00%
0.00%

50.58%
47.10%
1.35%
0.94%
0.00%
0.00%

1
19
0
0

87.21%

6.26%
6.54%
0.14%
0.48%

98.72%

0.52%
0.75%
0.18%
1.63%

7
11
7
3

72.40%

15.98%
11.40%
0.06%
0.36%

92.66%

2.33%
5.01%
0.03%
0.08%

0
9
1
4

Mesa 3: Averages of sentence-level annotations across datasets and selected languages. Macro-avg:
Each language is weighted equally in the aggregation, regardless of its size. Micro-avg: Each label
is weighted by the fraction of sentences for that language in the overall annotated corpus, es decir., el
annotations for higher-represented languages are upweighted, and annotations for lower-represented
languages are downweighted. The bottom rows contain the number of languages that have 0% labeled
C, etc.. Note that these are not true expectations since the languages audited were not randomly sampled.

in the original release and in the selection for
our audit, so the comparison of numbers across
datasets has to be taken with a grain of salt.
Since the numbers are based on a small sam-
ple of sentences that were partially annotated by
non-experts, the error statistics are only rough
estimados. Our audit captures a decent ratio of
idiomas (25–55%, 2nd row in Table 3), pero
only a tiny fraction of the overall number of
oraciones (0.00004–0.002%). When we speak
of ‘‘low-’’ and ‘‘high’’-resource languages, nosotros
mean languages with smaller or larger represen-
tation in the datasets at hand. When reporting
language-specific results we use the original lan-
guage identifiers of the datasets.

Which Datasets Have Quality Issues? El
macro-averaged results show that the ratio of
correct samples (C) ranges from 24% a 87%,
with a large variance across the five audited

conjuntos de datos. Particularly severe problems were found
in CCAligned and WikiMatrix, con 44 del 65
languages that we audited for CCAligned con-
taining under 50% correct sentences, y 19
del 20 in WikiMatrix. In total, 15 del
205 language-specific samples (7.3%) contained
not a single correct sentence. For the parallel
datasets we are also interested in the quantity of
misaligned/mistranslated sentences (X). For Wiki-
Matrix, two-thirds of the audited samples were on
average misaligned. We noticed that sentences
were often similar in structure, but described dif-
ferent facts (ver tabla 6). This might originate
from the nature of the underlying Wikipedia arti-
cles, since they are often comparable rather than
parallel (Schwenk et al., 2021).

Cifra 1 illustrates per-corpus correctness more
completely, showing for each dataset what per-
cent of audited corpora are under each possible
threshold of correctness.

55

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

in-language text, pero

promedio) and nonlinguistic content (11.40% aver-
edad), con 4 del 48 audited languages having
más que 50% contents in other languages. El
low amount of wrong language in ParaCrawl
shows the benefits of selecting domains by the
the dataset also
amount
covers the smallest amount of languages. El
low ratio of wrong language samples in OSCAR
may reflect the success of line-level LangID fil-
tering. These numbers provide evidence that more
research in LangID could improve the overall
quality, especially with respect
to nonlinguis-
tic content.

Which Languages Got Confused? The lan-
guages that were confused were frequently related
higher-resource languages. Sin embargo, Había
also a significant number of
‘‘out-of-model
cousin’’ cases, where languages not supported by
the LangID model ended up in a similar-seeming
idioma. For instance in mC4, much of the
Shona (sn, Bantu language spoken in Zimbabwe
and Mozambique) corpus is actually Kinyarwanda
(rw, Bantu language spoken in mostly in Rwanda
and Uganda)—and, peculiarly, much of
el
Hawaiian (haw, Polynesian language spoken in
Hawaii) is actually Twi (tw/ak, Central Tano
language spoken mostly in Ghana).

Do Low-resource Languages Have Lower
Quality? Low-resource datasets tend to have
lower human-judged quality. The Spearman rank
correlation between quality (%C) and size is
positive in all cases. The trend is strongest for
mC4 (r= 0.66), and gradually declines for
CCAligned (r= 0.53), WikiMatrix (r= 0.49),
ParaCrawl (r= 0.43), and OSCAR (r= 0.37).
Cifra 2 compares the number of sentences for
each language against
the proportion of cor-
rect sentences: Not all higher-resource languages
(> 106 oraciones) have high quality, in partic-
ular for CCAligned (p.ej., Javanese (en–jv ID)
with 5%C, or Tagalog (en–tl XX) with 13%C).
For mid-resource languages (104–106 sentences)
the picture is inconclusive, with some languages
having high quality, and others having extremely
low quality, even within the same datasets (p.ej.,
Urdu in CCAligned en-ur PK has 100%C vs.
its romanized counterpart en–ur PK rom 0.5%
C). For individual error codes trends are less clear
(not depicted).

Cifra 1: Fraction of languages in each dataset below
a given quality threshold (percent correct).

Why Haven’t These Problems Been Reported
Before? The findings above are averaged on
a per-language basis (es decir., macro-average), y
therefore give low and high-resource languages
igual peso. If we instead estimate the qual-
ity on a per-sentence basis (es decir., down-weight
lower-resource languages in the computation of
the average), the numbers paint a more opti-
mistic picture (‘‘micro’’ block in Table 3). Este
is especially relevant for the monolingual datasets
because they contain audits for English, cual
makes up for 43% of all sentences in OSCAR
y 36% in mC4. To illustrate the effect of this
imbalance: A random sample from the entire mC4
dataset with over 63% chance will be from one of
el 8 largest languages (en, ru, es, de, fr, él,
pt, pl, >100M sentences each), of which all have
near perfect quality. Analogously, evaluation and
tuning of web mining pipelines and resulting cor-
pora in downstream applications focused largely
on higher-resource languages (Sección 3), so the
low quality of underrepresented languages might
go unnoticed if there is no dedicated evaluation, o
no proficient speakers are involved in the curation
(Nekoto et al., 2020).

How Much Content is Nonlinguistic or in the
Wrong Language? Nonlinguistic content is a
more common problem than wrong-language con-
tent. Among the parallel datasets, CCAligned
contains the highest percentage of nonlinguistic
contenido, en 31.42% on average across all rated
corpus, and also the highest percent of wrong-
language content, en 9.44%. Among the mono-
lingual datasets, mC4 contains the highest ratio
both of sentences in incorrect languages (15.98%

56

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Percentage of sentences labeled as correct vs. log N sentences for all audited languages.

es XX bm ML yo NG tr TR ku TR zh CN af ZA jv ID zh TW it IT mean

Acc-6
Acc-4
Acc-2

0.58
0.77
0.91

0.73
0.73
0.96

0.41
0.60
0.72

0.45
0.55
0.64

0.43
0.56
0.71

0.55
0.72
0.79

0.65
0.72
0.77

0.55
0.57
0.92

0.46
0.58
0.81

0.55
0.66
0.69

0.66
0.72
0.79

Mesa 4: Rater evaluation for a subset of audits from CCAligned (translated from English) measured
by the accuracy (Acc-n) of annotations by non-proficient speaker against annotations by proficient
speakers.

Which Languages Have the Lowest Quality?
Across datasets we observe that the quality is
particularly poor for languages that are included
in romanized script ( rom/ latn), but are more
commonly written in other scripts (p.ej., Urdu
(ur), Japanese (ja), Arábica (ar)). These are not
transliterations of other scripts, but mostly con-
tain non-linguistic material or wrong languages
(p.ej.,
the romanized Japanese corpus in mC4
(ja latn) contains Spanish, Francés, Inglés,
Portuguese, among others). In terms of geog-
raphy, the poorest quality is found for African
idiomas (Bambara (bm), Fula (ff), Kikongo
(kg), Luganda (lg), Lingala (ln), Norther Sotho
(nso), Oromo (om), Shona (sn), Somali (entonces),
Tswana (tn), Wolof (wo)), minority languages
in Europe and the Middle East that are closely
related to higher-resource languages (Azerbaijani
(az-IR), North Frisian (frr), Neapolitan (nap),
Silesian (szl), Zaza (zza)), lesser spoken Chi-
nese languages sharing a script with Mandarin
(yue (yue), Wu (wuu)), four major Austronesian
(Central Bikol (bcl), Chavacano (cbk), Javanese
(jv), Sundanese (su)), and some South-Asian
idiomas, in particular Sinhala (si). Apéndice D
contains the detailed per-language statistics for
all corpora.

Es

Qué
the Incidence of Offensive and
Pornographic Content? En general, the sampled
sentences did not contain a large amount of
offensive content. Sin embargo, there were notable
amounts of pornographic content (> 10%) found
in CCAligned for 11 idiomas.

Annotation Quality For a subset of audited
languages from CCAligned and OSCAR we mea-
sure the accuracy (Acc) of the labels assigned by
non-proficient speakers against the labels assigned
by proficient speakers for all audited sentences.
This can be understood as a directed measure of
annotator agreement for the special case where one
rater is an expert and the other is not. Results for
varying label granularity are reported in Tables 4
y 5. For n = 6 all classes of the taxonomy
were distinguished, for n = 4 the C subclasses
were combined, and for n = 2 it is binary deci-
sion between C and the rest of the error classes.
With the full 6-class taxonomy (Acc-6) we find
a mean accuracy of 0.66 for CCAligned audits,
y 0.98 for OSCAR audits. With a binary tax-
onomy (Acc-2) distinguishing C from the rest, el
accuracy further increases to 0.79 for CCAligned.
This provides strong evidence that good quality

57

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

tyv rm bar eml zh la mean

Acc-6
Acc-4
Acc-2

1.0
1.0
1.0

0.98
1.0
1.0

1.0
1.0
1.0

1.0
1.0
1.0

0.86 1.0
0.87 1.0
0.87 1.0

0.98
0.98
0.98

Mesa 5: Rater evaluation for a subset of audits
from OSCAR measured by the accuracy (Acc-n)
of annotations by non-proficient speaker against
annotations by proficient speakers.

annotations are not limited to those proficient in
a language.

Sin embargo, the significant drop of accuracy for
finer-grained labels hints at that our taxonomy
can be further improved, especially for parallel
oraciones. The error taxonomy lacks at least one
category of error, a saber, ‘‘correct/in-language
but unnatural’’. Similarmente,
the definition of
‘‘correct-short’’ and ‘‘correct-boilerplate’’ were
not understood equally by all annotators and the
concept of ‘‘correct-short’’ has potential issues
for agglutinative languages like Turkish. Finalmente,
it was unclear what to do with related dialects, para
ejemplo, when a sentence is ‘‘almost correct but
wrong dialect’’ or when it is unclear which dialect
a sentence belongs to. We recommend including
these categories for future audits.

4.3 Automatic Filtering

Given the frequency of WL and NL annotations,
it might be tempting to use open-source LangID
models to post-filter data on a per-sentence(-pair)
nivel, as OSCAR does. Desafortunadamente, this turns
out to have its own issues.

Sentence-level n-gram LangID Filtering We
classify all sentence pairs of CCAligned with
CLD3, an n-gram based LangID model. By com-
paring its predictions to the audit
labels, nosotros
evaluate its quality on the subset of annotated
muestras: The classifier should detect both correct
languages when the pair is annotated as C and X,
and should detect incorrect languages in the pair
when WL and NL. On this task, the CLD3 classifier
achieves an average precision of only 40.6%.

Sentence-level Transformer LangID Filtering
n-gram LangID models like CLD3 have known
problemas. Sin embargo, Caswell et al. (2020) demon-
strate that semi-supervised Transformer-based
LangID models strongly out-perform them. Nosotros

train a comparable Transformer-based LangID
model and apply it to our annotated CCAligned
datos. We find that filtering noisy corpora (< 50% correct) on LangID for both source and target leads to gains in median precision, rising from 13.8% pre-filter to 43.9% post-filter. However, this comes at a steep cost of 77.5% loss in re- call. The biggest winners were Lingala, whose precision climbs from 8% to 80%, and Oromo, which soars from 2% to 33% in-language. Both of these, however, come at the cost of losing 50% of the correct in-language sentences, be- ing reduced from 22k sentences to 3k and 1k sentences, respectively, which would likely be too small for building downstream models. The moral is that, at least at the current stage, there is no one-size-fits-all approach for sentence-level LangID filtering. 5 Dataset Mis-labeling language codes are important Standardized and unambiguous representations of for practi- cal data use and exchange. The standard used by most academic and industry applications is BCP-47 (Phillips and Davis, 2005), which builds off the two-letter ISO639-2 codes and three-letter ISO639-3 codes, but also allows for adding subtags for scripts (e.g., Hindi in Latin script: hi-Latn) or regional varieties (e.g., French spoken in Canada: fr-CA). It would en- hance transparency and interoperability if adopted consistently, especially with growing language diversity in NLP. We find a variety of errors and inconsistencies in language code usage, ranging from serious mis- labelings to small transgressions against standard conventions. For this analysis, we also include the JW300 (Agi´c and Vuli´c, 2019) dataset, a multilin- gual dataset crawled from jw.org. In summary, we find 8 nonstandard codes in CCAligned, 3 in OSCAR, 1 in mC4, 1 in WikiMatrix, and 70 in JW300, for 83 in total. This does not include the 59 codes affected by superset issues. Full details are given in Appendix A. Inconsistent Language Codes One common is- sue is simply using nonstandard or invented codes. For example, CCAligned uses only two-letter codes, so when the BCP-47 code for a language is three letters it is either shortened (e.g., zza → zz) or invented (shn → qa). Similarly, OSCAR contains data labeled as als (BCP-47 for Tosk 58 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Albanian) that is actually in gsw (Allemannic).9 Twenty-two additional language codes in JW300 have similar issues, including 12 codes that start with jw but are not Javanese. False Sign Languages Twelve percent (48/417) of JW300 carry language codes for sign languages. Instead of sign language transcripts they are texts in another high-resource language, mostly English or Spanish—for example, the en-zsl (Zambian sign language) data is actually English-English parallel data (copies), details in Appendix A. This was likely caused by videos with sign lan- guage interpretation embedded on the crawled Web sites.10 Mysterious Supersets When datasets contain language codes that are supersets of other lan- guage codes, it is difficult to determine which particular language the text contains. WikiMatrix has Serbian (sr), Croatian (hr), Bosnian (bs), and Serbo-Croatian (sh)—their superset.11 The issue of codes that are supersets of others is com- mon enough to include a small table dedicated to it (Appendix Table 7). In some cases this may not be an issue, as with Arabic, where ar conventionally refers to Modern Standard Arabic, even though the code technically encompasses all dialects. In many cases, the nature of the data in the superset code remains a mystery. Deprecated Codes Finally, there are several deprecated codes that are used: sh in WikiMatrix, iw in mC4, sh and eml in Oscar, and daf in JW300. 6 Risks of Low-Quality Data Low Quality in Downstream Applications Text corpora today are building blocks for many downstream NLP applications like question an- swering and text summarization—for instance, a common approach is to first train translation mod- els on such data and then automatically translate training data for downstream models (Conneau et al., 2018). If the data used for the original sys- tems is flawed, derived technology may fail for those languages far down the line without know- ing the causes. This risk of undesired downstream 9This is a result of the language code used by the Alemannic Wikipedia and affects any corpus or tool that uses Wikipedia data without correcting for this, like FastText. 10Kudos to Rebecca Knowles for this explanation. 11https://iso639-3.sil.org/code/hbs. effects calls for future studies with a careful treat- ment of intertwined effects such as data size and domain, language-specific phenomena, evaluation data and metric biases. To give the reader a brief glimpse of the impact of data quality for the example of translation, we compare the C% met- ric from our audit with the translation quality (sentencepiece-BLEU, spBLEU) of the multilin- gual translation model M2M124 for 124 languages (Goyal et al., 2021). It was trained on WikiMa- trix and CCAligned, and similar data collected with the same tools, which we expect to show similar biases. Translation quality is evaluated on the trusted, human-translated FloReS benchmark (Goyal et al., 2021). For the 21 languages present in both the audit and the FloReS benchmark, we found a positive correlation (Spearman) between the data quality scores and spBLEU of ρ = 0.44 (p = 0.041). This is not as large as the correlation with data size (ρ = 0.66, p = 0.00078), but it nonetheless helps to explain translation quality— the correlation between the product of C% and data size (in other words, the expected total num- ber of good sentences in the dataset), is the highest yet, with a value of ρ = 0.73 (p = 0.00013).12 there Representation Washing Since are datasets that contain many low-resource lan- guages, the community may feel a sense of progress and growing equity, despite the actual quality of the resources for these languages. Sim- ilarly, if low-quality datasets are used as bench- marks they may exaggerate model performance, making low-resource NLP appear more solved than it if models perform poorly when trained with such data, it may be wrongly assumed that the task of learning models for these languages is harder than it actually is or infeasible given current resources. These effects could result in productive effort being redirected away from these tasks and languages. is—or conversely, Trust in Incorrect ‘‘Facts’’ We found many instances of parallel-looking sentences that are structurally and semantically similar, but not fac- tually correct translations (Table 6). They can cause models to produce plausible ‘‘translations’’ that are factually wrong, but users may still trust them (algorithmic trust) without verifying the 12For the translation from English, BLEU scores are less comparable but the trend holds nonetheless, with values of (ρ = 0.32, p = 0.14), (ρ = 0.74, p = 0.000078), and (ρ = 0.80, p = 0.0000087), respectively. 59 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 en The prime minister of the UK is Boris Johnson. nl De minister-president van Nederland is Mark Rutte. en: The prime minister of the Netherlands is Mark Rutte. en pt 24 March 2018 14 Novembro 2018 en: 14 November 2018 The current local time in Sarasota is 89 minutes. en nn Den lokale tiden i Miami er 86 minutt. en: The local time in Miami is 86 minutes. In 1932 the highway was extended north to LA. en bar 1938 is de Autobahn bei Inglstod fertig gstellt. en: The highway near Inglstod was completed in 1938. Table 6: Examples of ‘‘parallel’’ data where the translation has a different meaning than the source, but the form looks the same. (We added transla- tions of the non-English side.) Such data may encourage hallucinations of fake ‘‘facts’’. information. Similarly, automation bias (Skitka et al., 1999), referring to humans favoring deci- sions made by automated systems over decisions made by humans, might amplify the issues of inaccurate translations caused by misalignments. 7 Future Work and Recommendations Of the five multilingual corpora evaluated, we consistently found severe issues with quality, es- pecially in the lower-resource languages. We rated samples of 205 languages, and found that 87 of them had under 50% usable data, with a full 15 languages at 0% in-language. We furthermore found consistent issues with mislabeled data and nonstandard language codes, particularly in the JW300 dataset, and identified 83 affected cor- pora, at least 48 of which were entirely spurious (Section 5). While there might have been anecdo- tal evidence of insufficient quality for some of the datasets, the majority of these quality issues had not been reported, nor been investigated in depth. These issues might go unnoticed for languages that are not represented in the evaluation of the crawling methods, and cause harm in downstream applications (Khayrallah and Koehn, 2018). There are a variety of ways to improve both the ease and accuracy of human evaluation, as well a few classes of issues we ignored in this paper, like close dialects. Ideally we would like to build a standard suite of automatic metrics for datasets, but more research is necessary to determine what the appropriate metrics would be. One important area missing from our analyses, however, is the estimated portion of a dataset which has been gen- erated by MT (Rarrick et al., 2011), LM systems, or bots/templates, as for example in the analysis of C4 (Dodge et al., 2021). The information captured in machine-generated content might still be useful for modeling, but might falsely overrepresent typ- ical generation patterns and introduce linguistic errors or unnatural artifacts. We therefore strongly recommend looking at samples of any dataset before using it or releasing it to the public. As we have shown, one does not need to be proficient in a language to see when there are serious quality issues, and a quick scan of 100 sentences can be sufficient to detect major problems. Moreover, going through and annotat- ing a small sample of data can bring actionable insights about new ways to filter or use it. If data quality issues are found, a wide variety of techniques can be explored, like filtering on length-ratio, LangID, TF-IDF wordlists (Caswell et al., 2020), or dictionaries (Kamholz et al., 2014); to neural approaches like LM scoring (Axelrod et al., 2011; Moore and Lewis, 2010; Wang et al., 2018). Unfortunately, none of these provides a quick and easy fix, especially for low-resource languages—data cleaning is no trivial task! Noisy datasets are by no means useless, at least if they contain some desirable content. Therefore an alternative to filtering can be documentation (Bender et al., 2021). This can take the form of a per-language quality score and notes about known issues, a datasheet (Gebru et al., 2018) or nutrition label (Holland et al., 2018). However, we suggest researchers not release corpora with near-zero in-language content, as this may give the mistaken impression of usable resources. Finally, we encourage the community to con- tinue conducting evaluations and audits of public datasets—similar to system comparison papers. Acknowledgments We would like to thank the TACL editors and reviewers, and AfricaNLP and Google reviewers who have helped us shape this paper. Furthermore, we are grateful for Ahmed El-Kishky’s support and help with CCAligned and WikiMatrix size statistics. 60 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 References ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual the Association for Computa- Meeting of tional Linguistics, pages 3204–3210, Florence, Italy. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /P19-1310 Wissam Antoun, Fady Baly, and Hazem Hajj. 2021. AraELECTRA: Pre-training text discrim- inators for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Lan- guage Processing Workshop, pages 191–195, Kyiv, Ukraine (Virtual). Association for Com- putational Linguistics. Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George F. Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Mas- sively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019. Mikel Artetxe and Holger Schwenk. 2019. Mas- sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computa- tional Linguistics, 7:597–610. https://doi .org/10.1162/tacl_a_00288 Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 355–362, Edinburgh, Scotland, UK. Association for Computational Linguistics. Marta Ba˜n´on, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Espl`a-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ram´ırez-S´anchez, Elsa Sarr´ıas, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. Para- Crawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 4555–4567, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.417 Emily M. Bender and Batya Friedman. 2018. language pro- Data statements for natural cessing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604. https://doi.org/10.1162 /tacl_a_00041 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Pro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, New York, NY, USA. Asso- ciation for Computing Machinery. https:// doi.org/10.1145/3442188.3445922 Stella Biderman and Walter J. Scheirer. 2020. Pitfalls in machine learning research: Reexam- ining the development cycle. arXiv preprint arXiv:2011.02832. Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for com- puter vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. https://doi.org/10 .1109/WACV48630.2021.00158 Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov. 2017. Natural language processing with small feed-forward networks. In Proceedings of the 2017 Con- ference on Empirical Methods in Natu- ral Language Processing, pages 2879–2885, Copenhagen, Denmark. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/D17-1309 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, 61 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. 2020. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text the 28th Inter- corpus. national Conference on Computational Lin- guistics, pages 6588–6608, Barcelona, Spain (Online). International Committee on Compu- tational Linguistics. https://doi.org/10 .18653/v1/2020.coling-main.579 In Proceedings of Branden Chan, Stefan Schweter, and Timo M¨oller. 2020. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (On- line). International Committee on Computa- tional Linguistics. https://doi.org/10 .18653/v1/2020.coling-main.598 Avihay Chriqui and Inbal Yahav. 2021. HeBERT & HebEMO: A Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representa- tion learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.acl-main.747 Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representa- tions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /D18-1269 Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: a Dutch RoBERTa- the based Language Model. In Findings of Association for Computational Linguistics: EMNLP 2020, pages 3255–3265, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .findings-emnlp.292 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language the 2019 understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, and Matt Gardner. 2021. Docu- menting the english colossal clean crawled corpus. arXiv preprint arXiv:2104.08758. Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian the Association for BERT. In Findings of Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2020.findings -emnlp.387 In Proceedings of Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzm´an, and Philipp Koehn. 2020. CCAligned: A massive collection of cross-lingual web- the document pairs. 2020 Conference on Empirical Methods in (EMNLP), Natural Language Processing pages 5960–5969, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2020.emnlp-main .480 Miquel Forcada, Espl`a, Mikel Gema Ram´ırez-S´anchez, and Hieu Hoang. 2019. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks, pages 118–119, Ireland. European Association for Dublin, Machine Translation. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, 62 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2020. Beyond English-centric multilingual machine translation. arXiv preprint arXiv:2010.11125. Wilhelmina Nekoto, Iroro Orife, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Julia Kelechi Ogueji, Kathleen Siminyu, Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp ¨Oktem, Adewale Akinfaderin, and Abdallah Bashir. 2020. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Online. https://doi.org/10.18653/v1 /2020.findings-emnlp.195 Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum´e III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzm´an, and Angela Fan. 2021. The FLORES-101 evaluation benchmark for 63 low-resource and multilingual machine transla- tion. arXiv preprint arXiv:2106.03193. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evalu- ation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:1805.03677. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively mul- tilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Ma- chine Learning Research, pages 4411–4421. PMLR. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herv´e J´egou, and Tom´as Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv: 1612.03651. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguis- tics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/E17-2068 Marcin Junczys-Dowmunt. 2018. Dual condi- tional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Confer- ence on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18 -6478 Marcin Junczys-Dowmunt. 2019. Microsoft translator at WMT 2019: Towards large-scale document-level neural machine translation. In Proceedings of the Fourth Conference on Ma- chine Translation (Volume 2: Shared Task l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Papers, Day 1), pages 225–233, Florence, Italy. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /W19-5321 Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evalua- tion benchmarks and pre-trained multilingual language models for Indian languages. In Find- ings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961, Association for Computational Linguistics, https://doi.org/10.18653 Online. /v1/2020.findings-emnlp.445 David Kamholz, Jonathan Pool, and Susan Colowick. 2014. PanLex: Building a resource for panlingual lexical translation. In Proceed- ings of the Ninth International Conference on Language Resources and Evaluation (LREC’ 14), pages 3145–3150, Reykjavik, Iceland. European Language Resources Association (ELRA). Vincentius Kevin, Birte H¨ogden, Claudia Schwenger, Ali S¸ ahan, Neelu Madan, Piush Aggarwal, Anusha Bangaru, Farid Muradov, and Ahmet Aker. 2018. Information nutrition labels: A plugin for online news evaluation. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 28–33, Brussels, Belgium. Association for Computational Linguistics. https:// doi.org/10.18653/v1/W18-5505 Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings the 2nd Workshop on Neural Machine of Translation and Generation, pages 74–83, Melbourne, Australia. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/W18-2709 Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Naman Goyal, Peng-Jen Chen, and Francisco Guzm´an. 2020. Findings of the WMT 2020 shared task on parallel cor- pus filtering and alignment. In Proceedings of the Fifth Conference on Machine Transla- tion, pages 726–742, Online. Association for Computational Linguistics. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2020. Greek-bert: The greeks visiting sesame street. In 11th Hellenic Conference on Artificial In- telligence, SETN 2020, pages 110–117, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145 /3411408.3411440 Alexandra Sasha Luccioni and Joseph D. Viviano. 2021. What’s in the box? an analy- sis of undesirable content in the common crawl corpus. arXiv preprint arXiv:2105.02732. https://doi.org/10.18653/v1/2021 .acl-short.24 Louis Martin, Benjamin Muller, Pedro Javier Ortiz Su´arez, Yoann Dupont, Laurent Romary, ´Eric de la Clergerie, Djam´e Seddah, and Benoˆıt Sagot. 2020. CamemBERT: A tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.acl-main.645 Mihai Masala, Stefan Ruseti, and Mihai Dascalu. 2020. RoBERT—a Romanian BERT model. the 28th International In Proceedings of Conference on Computational Linguistics, pages 6626–6637, Barcelona, Spain (On- line). International Committee on Computa- tional Linguistics. https://doi.org/10 .18653/v1/2020.coling-main.581 Robert C. Moore and William Lewis. 2010. Intelligent selection of language model train- ing data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden. Association for Computa- tional Linguistics. Pedro Javier Ortiz Su´arez, Laurent Romary, and Benoˆıt Sagot. 2020. A monolingual ap- proach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.acl-main.156 Pedro Javier Ortiz Su´arez, Benoˆıt Sagot, and Laurent Romary. 2019. Asynchronous pipelines for processing huge corpora on medium to low 64 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 resource infrastructures. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9–16, Mannheim. Leibniz-Institut f¨ur Deutsche Sprache. Addison Phillips and Mark Davis. 2005. Tags for Identifying Languages. Internet Engineering Task Force. Work in Progress. Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embed- dings useful for neural machine translation? the 2018 Conference of In Proceedings of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 2 (Short Pa- pers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18 -2084 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67. Spencer Rarrick, Chris Quirk, and Will Lewis. 2011. MT detection in Web-scraped paral- lel corpora. In Proceedings of MT Summit XIII. Asia-Pacific Association for Machine Translation. Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzm´an. 2021. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. the 16th Conference of In Proceedings of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1351–1361, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.eacl-main .115 Amit Seker, Elron Bandel, Dan Bareket, Idan Brusilovsky, Refael Shaked Greenfeld, and Reut Tsarfaty. 2021. AlephBERT:A Hebrew large pre-trained language model to start-off your Hebrew NLP application with. arXiv preprint arXiv:2104.04052. Linda J. Skitka, Kathleen L. Mosier, and Mark Burdick. 1999. Does automation bias decision- 65 making? International Journal of Human- Computer Studies, 51(5):991–1006. https:// doi.org/10.1006/ijhc.1999.0252 Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish, Bill Howe, and Julia Stoyanovich. 2019. Mithralabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2893–2896, New York, NY, USA. Association for Computing Machinery. Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, and Ciprian Chelba. 2018. Denoising neural machine translation train- ing with trusted data and online data selec- tion. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 133–143, Brussels, Belgium. Association for Computational Linguistics. https:// doi.org/10.18653/v1/W18-6314 Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. 2020. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics. Hainan Xu and Philipp Koehn. 2017. Zippo- rah: A fast and scalable data cleaning system for noisy Web-crawled parallel corpora. In Proceedings of the 2017 Conference on Empir- ical Methods in Natural Language Processing, pages 2945–2950, Copenhagen, Denmark. Association for Computational Linguistics. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text- the to-text 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Com- putational Linguistics. In Proceedings of transformer. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Dataset Supercode Subcode(s) Actual language Code in JW300 JW300 JW300 JW300 JW300 OSCAR OSCAR OSCAR OSCAR OSCAR OSCAR OSCAR OSCAR WikiMatrix WikiMatrix WikiMatrix kg mg qu sw ar az sh ku ms no sq zh ar sh zh kwy tdx que, qug, qus, quw, quy, quz, qvi, qvz swc arz azb bs, hr, sr ckb id, min nn als∗ yue, wuu arz bs, hr, sr wuu Table 7: Situations where two language codes are represented, but one is a superset of another by the ISO standard, leading to unclarity about the data in the supercode dataset. ∗The als dataset is actually in gsw. cs de el en es fi fr hu id it ja ko pl pt ro ru sk sq st zh cse gsg gss ase, asf, bfi, ins, psp, sfs, zib, zsl aed, bvl, csf, csg, csn, csr, ecs, esn, gsm, hds, lsp, mfs, ncs, prl, pys, ssp, vsl fse fcs,fsl hsh inl ise jsl kvk pso bzs, mzy, psr, sgn AO rms rsl svk sql jw ssa csl, tss l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 4 7 1 9 8 6 5 8 5 / / t l a c _ a _ 0 0 4 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A Details on Language Code Issues Table 7 provides a complete lists of the corpora where one code is defined as a superset of the other by the ISO standard, and in Table 8 we provide a complete list of the language codes in JW300 which purport to be sign language but are actually unrelated high-resource languages. Special attention needs to be given to the JW300 dataset, which, in addition to the sign languages and superset code issues, has a variety of other peculiarities. These problems seem to originate in the codes used by jw.org,13 which were appar- ently not checked in the creation of the JW300 dataset. An overview is provided in Table 9, and the following paragraphs give specifics. Twelve languages in JW300 have codes start- ing in jw , suggesting they are varieties of Javanese (ISO639-1 jw), but are instead at- tempts to represent language dialects for which there are no BCP-47 codes. These codes seem 13The jw.org Web site seems to use correct BCP-47 extensions now, however, and entering a code such as ‘‘jw dmr’’ redirects to ‘‘naq x dmr’’. Table 8: There are 48 languages in the JW300 corpus with language codes that correspond to sign languages, but in reality are unrelated high-resource languages (usually the most spoken language in the country of origin of the sign lan- guage). This table shows the actual language of the data corresponding to each sign language code. to have been updated in jw.org to appropri- ate BCP-47 private-use extensions in the form X , which are provided
en mesa 9. Twelve languages have codes starting
in jw , suggesting they are varieties of Javanese,
but are instead mis-parsed private-use extensions.
Three codes appear in addition to equivalent ISO
codes, making it unclear which languages they
son. One language uses a deprecated ISO code.
Four languages use the ISO639-3 code instead of
the ISO639-2 code, and therefore are not BCP-47.
In addition to the jw tags, there are two other
mis-used private subtags: hy arevmda, En cual
addition to lacking the mandatory x appears to
represent standard Western Armenian (hyw); y

66

Code in JW300 BCP-47 code Actual Language Name

Dataset

Code in Corpus Correct Code

Incorrect private-use extensions

hy arevmda
jw dgr
jw dmr
jw ibi
jw paa
jw qcs
jw rmg
jw rmv
jw spl
jw ssa
jw tpo
jw vlc
jw vz
rmy AR

hyw
os x dgr
naq x dmr
yom x ibi
pap x paa
qxl
rmn x rmg
rmy x rmv
nso x spl
st ZA
pt PT
ca x vlc
skg x vz
rmy x ?

Western Armenian
Digor Ossetian
Damara Khoekhoe
Ibinda Kongo
Papiamento (Aruba)
Salasaca Highland Kichwa
Greek Romani (South)
Vlax Romani, Russia
Sepulana
Sesotho (South Africa)
Portuguese (Portugal)
Catalan (Valencia)
Vezo Malagasy
Kalderash

Equivalent codes used in place of extensions

kmr latn
nya
que

kmr x rdu
ny x ?
qu x ?

Kurmanji (Caucasus)
Chinyanja (Zambia)
Quechua (Ancash)

Deprecated codes

daf

dnj/lda

Dan

ISO-693-3 used in place of ISO-693-2

cat
gug
run
tso MZ

ca
gn
rn
ts MZ

Catalan
Guarani
Kirundi
Changana (Mozambique)

Mesa 9: Language code issues in the JW300
datasets for 22 language varieties not covered by
Tables 7 y 8. Private use extensions are given
as they appear in jw.org, and specified as ‘?’ if
they are absent from jw.org.

rmy AR, cual, rather than being Romany from
Argentina, is Kalderash Romany.

There are also a few anomalies where private
use extensions should have been used but other
methods were found to convey the distinctions.
Three codes appear in addition to equivalent ISO
codes, making it unclear which languages they are.
Two of these are equivalencies between ISO639-2
and ISO639-3 (nya and ny are both Chichewa,
qu and que are both Quechua), and one is a
script equivalency (kmr and kmr latn are both
in Latin script). In these three cases the two codes
do represent different languages—so a private use
extension would have been appropriate.

Finalmente, there is the more minor issue that three
languages use the ISO639-3 code instead of the
ISO639-2 code, and therefore are not BCP-47.

CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned
CCAligned

mC4

OSCAR
OSCAR
OSCAR

WikiMatrix

zz
sz
ns
cb
tz
qa
qd
cx

iw

eml
como
sh

sh

zza
szl
nso
ckb
ber
shn
kac
ceb

él

egl
gsw
hbs

hbs

Mesa 10: Miscellaneous errors in language codes.

In addition to the JW300-specific errors,
Mesa 10 summarizes miscellaneous errors in
CCAligned and OSCAR that were detailed in
Sección 5.

B Complete Error Taxonomy

and Instructions

In addition to the examples given in Table 2, raters
were provided with the following verbal notes on
the error codes:

• CC: Correct translation, natural sentence:
It’s OK if it’s a sentence fragment instead of
a whole sentence, as long as it is not too short
(acerca de 5 words or greater). The translation
does not have to be perfect.

• CS: Correct translation, but single word or
short phrase: Also includes highly repeated
short phrases, like ‘‘the cat the cat the cat the
cat the cat …''

• CB: Correct translation, but boilerplate:
This can be auto-generated or formulaic con-
tent, or content that one deems ‘‘technically
correct but generally not very useful to NLP
models’’. Desafortunadamente, it’s often not clear
what should be counted as boilerplate…hacer
your best.

• X: Incorrect translation [for parallel sen-
tenencias] both source and target are in the
correct language, but they are not adequate
translations.

67

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

• WL: Wrong language For short sentences,
especially with proper nouns, there is often
a fine line between ‘‘Wrong language’’ and
‘‘Not language’’. Do your best.

• NL: Not language At least one of source
and target are not linguistic content. Cualquier
sentence consisting only of a proper noun
(p.ej. ‘‘Tyrone Ping’’) should be marked as
NL.

• U: Unknown for sentences that need verifica-
tion by a native speaker. This is an auxiliary
label that is resolved in most cases.

C Methodological Notes

A surprising amount of work can be done without
being an expert in the languages involved. El
easiest approach is simply to search the internet
for the sentence, which usually results in finding
the exact page the sentence came from, En cual
turn frequently contains clues like language codes
in the URL, or a headline like News in X language,
sometimes with references to a translated version
of the same page. Sin embargo, for the cases where
this is insufficient, here are a few tips, tricks, y
observaciones.

No Skills Required: Things that do not require
knowledge of the language(s) in question.

1. ‘‘Not language’’ can usually be identified by
anyone who can read the script, though there
are tricky cases with proper nouns.

2. Frequently, ‘‘parallel’’ sentences contain dif-
ferent numbers in the source and target
(especially autogenerated content), y son
easy to disqualify.

3. Errors tend to repeat. If a word is mistrans-
lated once, it will often be mistranslated many

more times throughout a corpus, making it
easy to spot.

Basic Research Required: Things that do not
require knowledge of the language(s) in question
but can be done with basic research.

1. If it’s written in the wrong script it’s consid-
ered wrong language. (Sometimes the writing
system is indicated in the published corpus,
p.ej., bg-Latn, but usually the language has
a ‘‘default’’ script defined by ISO.)

2. Some types of texts come with inherent labels
or markers, such as enumerators or verse
numbers.

3. When all else fails, search the internet for the
whole sentence or n-grams thereof! If the
whole sentence can be found, frequently
the language is betrayed by the web page (el
language’s autonym is useful in this case).

D Complete Audit Results

Tables 11, 12, 13, 14, y 15 give the complete
annotation percentages for CCAligned, Wiki-
Matrix, ParaCrawl, mC4 and OSCAR, respetar-
activamente. For each annotation label, we report the
ratio of the annotated sentences (of max 100
oraciones) that were assigned that label by the
primary annotator. Repeated annotations done for
agreement measurement are not included. The C
column aggregates all correct sub-codes (CC, CS,
CB). We also report the total number of sentences
that each dataset contains for each language and
the average sentence length for the audited sen-
tences to illustrate differences across languages.
The original language codes as they are published
with the datasets are maintained for the sake of
consistencia (but should be handled with care in
future work, mira la sección 5), and those with less
than 20% correct sentences are highlighted.

68

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

en-sz PL
en-mt MT
en-tz MA
en-zz TR
en-kg AO
en-qa MM
en-bm ML
en-az IR
en-qd MM
en-ay BO
en-ak GH
en-st ZA
en-ve ZA
en-ts ZA
en-or IN
en-ns ZA
en-lg UG
en-ln CD
en-om KE
en-ss SZ
en-te IN rom
en-cb IQ
en-tn BW
en-ff NG
en-sn ZW
en-wo SN
en-br FR
en-zu ZA
en-ku TR
en-ig NG
en-kn IN
en-yo NG
en-ky KG
en-tg TJ
en-ha NG
en-am ET
en-km KH
en-ne NP
en-su ID
en-ur PK rom
en-ht HT
en-mn MN
en-te IN
en-kk KZ
en-be BY
en-af ZA
en-jv ID
en-hi IN rom
en-lv LV
en-ar AR rom
en-tl XX
en-uk UA
en-zh TW
en-el GR
en-nl NL
en-da DK
en-vi VN
en-sv SE
en-zh CN
en-tr TR
en-ja XX
en-pt XX
en-it IT
en-de DE
en-es XX

NL

X
0.00%

8.00% 67.00%

1.43%
3.96%
2.97%
9.57%

3.00%
6.00% 31.00% 46.00%

#oraciones
porn
WL
CB
CS
CC
C
12
0.00%
0.00%
8.33% 91.67%
0.00%
0.00%
0.00%
26
0.00%
0.00% 50.00% 26.92% 19.23%
3.85%
0.00%
3.85%
33
0.00%
0.00% 45.45% 36.36%
6.06%
6.06%
6.06%
12.12%
34
0.00%
8.82% 61.76% 29.41%
0.00%
0.00%
0.00%
0.00%
74
0.00%
2.70% 81.08%
0.00% 14.86%
1.35%
0.00%
1.35%
136
0.00%
3.68% 13.24%
1.47% 72.06%
3.68%
5.88%
11.03%
149
0.00%
6.71% 60.40%
0.00% 26.85%
2.01%
4.03%
6.04%
158
0.00%
0.00% 20.79% 13.86% 58.42%
0.00%
6.93%
6.93%
179
0.00%
3.96%
0.99% 81.19%
6.93%
7.92%
1.98%
4.95%
475
0.00%
0.00% 29.00%
3.00% 17.00%
51.00% 33.00% 18.00%
478
0.00%
0.00% 46.86% 19.25% 19.67%
0.63%
14.23% 13.60%
904
0.00%
9.29%
6.43% 40.71%
48.57% 42.14%
0.00%
1555
0.00%
6.93%
8.91% 28.71%
60.40% 29.70% 21.78%
1967
0.00%
4.95%
4.95% 40.59%
51.49% 34.65% 11.88%
5526
0.00%
6.09% 24.35% 12.17% 38.26%
9.57%
42.61%
14138
4.00%
2.00% 23.00% 15.00% 58.00%
2.00%
0.00%
4.00%
14701
2.00%
0.00% 68.00% 17.00%
0.00%
9.00%
6.00%
6.00%
21562
4.00%
4.00% 74.00%
1.00% 14.00%
4.00%
3.00%
8.00%
22206
0.00% 31.00% 38.00% 29.00% 24.00%
2.00%
0.00%
2.00%
22960
0.00% 13.25% 24.10% 50.00% 13.86%
9.04%
3.61%
12.65%
25272
0.00% 25.00%
0.00%
5.00%
0.00%
0.00%
52297
0.00% 30.00% 18.00% 48.00% 11.00%
1.00%
3.00%
4.00%
71253
6.90%
8.97% 63.45% 10.34%
0.00%
0.00%
0.00%
0.00%
73022
2.00%
8.00% 92.00%
0.00%
0.00%
0.00%
0.00%
0.00%
86868
1.00% 81.00% 14.00%
1.00%
0.00%
0.00%
3.00%
5.00%
88441
3.31% 94.98% 18.46%
0.00%
0.00%
1.71%
0.00%
0.00%
115128
1.00%
1.00% 13.00% 37.00% 14.00% 32.00%
17.00%
3.00%
126101
3.00%
8.00%
55.00% 39.00%
7.00%
3.00% 13.00% 30.00%
137874
1.74%
1.74%
36.52% 12.17% 13.04% 11.30% 33.04% 28.70%
148146
0.00%
1.00%
6.00% 29.00% 12.00%
58.00% 49.00%
163921
4.00%
9.00%
46.00%
5.00%
2.00%
175192
0.00%
6.16% 10.96% 17.81% 34.93% 12.33% 17.81%
34.93%
240657
1.96% 33.33% 22.55%
0.98%
0.00%
44.12% 24.51% 17.65%
251865
2.94% 32.35% 20.59%
4.90%
0.98%
46.08% 18.63% 24.51%
339176
1.00%
9.00% 12.00%
3.00%
30.00% 25.00%
2.00% 49.00%
346517
0.00%
0.49%
2.96%
59.11% 35.47%
2.46% 21.18% 37.44%
412381
1.02%
56.12% 12.24% 33.67% 10.20% 42.86%
0.00%
0.00%
487155
8.00% 30.00% 14.00%
47.00% 10.00% 13.00% 24.00% 15.00%
494142
0.00%
35.00% 15.00% 15.00%
5.00% 13.00% 13.00% 39.00%
513123
5.47%
0.00%
0.50%
0.50%
0.00% 18.91% 27.36% 53.23%
558167
6.19%
8.25% 10.31% 37.11% 35.05%
55.67%
1.03%
3.09%
566885
7.00% 18.00% 12.00%
33.00%
8.00% 14.00% 11.00% 42.00%
581651
1.00%
3.00%
1.00%
69.00% 42.00% 11.00% 16.00% 27.00%
689651
1.98%
3.96%
8.91%
8.91% 18.81%
68.32% 40.59% 18.81%
1125772
0.00%
0.00%
90.00% 57.00% 13.00% 20.00% 10.00%
2.00%
1504061
4.00% 12.00%
0.00% 31.00%
63.00% 40.00% 23.00%
2.00%
1513974
8.08%
3.03% 25.25% 10.10% 59.60%
1.01%
1.01%
5.05%
3789571
8.00%
0.00%
1.00%
1.00% 39.00% 21.00% 39.00%
0.00%
4850957
3.00% 14.00%
9.00% 13.00% 31.00%
59.00% 37.00%
7.00%
5584724
4.00%
4.00% 96.00%
0.00%
0.00%
0.00%
0.00%
0.00%
6593250
5.00%
4.00% 24.00% 26.00% 37.00%
3.00%
13.00%
6.00%
8547348
5.00%
1.00%
8.00% 13.00% 35.00%
63.00% 42.00%
1.00%
8778971
1.00%
6.00%
4.00% 47.00%
46.00% 11.00% 31.00%
1.00%
8.00%
3.00% 10.00%
5.00% 29.00% 38.00%
49.00% 15.00%
8878492
0.00% 36324231
3.00%
2.00%
0.00% 49.00%
46.00% 27.00% 19.00%
7.00% 10738582
5.00% 12.00%
5.00% 29.00%
54.00% 31.00% 18.00%
6.00% 12394379
1.00% 14.00%
0.00% 13.00% 54.00%
31.00% 18.00%
0.00% 12544075
3.00%
97.00% 91.00%
0.00%
0.00%
3.00%
3.00%
1.04% 15181410
1.04% 10.42%
57.29% 22.92% 12.50% 21.88% 31.25%
4.00% 20282339
5.50%
5.00%
45.00% 14.50% 14.00% 16.50% 44.50%
0.00% 26201214
0.00%
6.00%
57.00% 35.00% 21.00%
1.00% 34.00%
0.00% 46525410
8.91%
3.96%
66.34% 36.63% 10.89% 18.81% 20.79%
0.00% 58022366
3.00%
1.00%
36.00% 14.00% 18.00%
4.00% 60.00%
2.00% 92597196
2.00%
62.00% 29.00% 14.00% 19.00% 28.00%
8.00%
4.95% 98351611
2.97% 15.84%
58.42% 16.83% 25.74% 15.84% 22.77%

avg target length
71.42
12.58
57.33
46.53
29.20
55.28
32.19
115.85
60.34
92.19
45.85
111.83
82.99
73.93
71.39
33.52
15.83
28.80
23.83
25.30
24.21
30.04
16.80
33.59
102.59
27.25
41.68
79.32
90.51
83.42
70.20
75.01
69.56
75.31
60.78
58.29
71.35
79.14
57.08
18.41
101.95
44.43
97.95
72.36
118.45
105.45
18.34
18.13
83.67
16.69
37.03
67.88
24.89
54.90
85.95
73.99
74.19
103.91
33.55
83.80
34.44
87.20
97.44
78.08
72.18

Mesa 11: Audit results for a sample of 100 sentences from CCAligned for each language pair, comparado
to the number of sentences available in the dataset. If fewer than 100 sentences were available, todo
sentences were audited. Language codes are as originally published. The length is measured in number
of characters and averaged across the audited portion of each corpus. Languages with less than 20%
correct sentences are boldfaced.

69

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

en-ug
en-mwl
en-tg
en-ne
en-ka
en-lmo
en-io
en-jv
en-wuu
br-en
bar-en
en-kk
en-sw
en-nds
be-en
en-hi
en-ko
en-uk
en-it
en-simple

C

X

CS

NL

CB

CC

WL

porn
12.87% 8.91% 1.98% 1.98% 72.28% 9.90% 1.98% 0.00%
27.00% 26.00% 0.00% 1.00% 73.00% 0.00% 0.00% 0.00%
0.00% 0.00% 0.00% 0.00% 95.10% 3.92% 0.98% 0.00%
13.00% 7.00% 6.00% 0.00% 60.00% 23.00% 4.00% 0.00%
11.88% 2.97% 2.97% 5.94% 73.27% 10.89% 2.97% 0.00%
12.75% 11.76% 0.00% 0.98% 81.37% 4.90% 0.98% 0.00%
28.00% 27.00% 0.00% 1.00% 69.00% 2.00% 1.00% 0.00%
13.73% 9.80% 0.00% 3.92% 70.59% 12.75% 2.94% 0.00%
23.23% 14.14% 7.07% 2.02% 65.66% 7.07% 4.04% 0.00%
8.70% 7.61% 1.09% 0.00% 82.61% 4.35% 0.00% 0.00%
6.00% 6.00% 0.00% 0.00% 75.00% 16.00% 3.00% 0.00%
5.00% 2.00% 2.00% 1.00% 81.00% 14.00% 0.00% 0.00%
33.33% 27.27% 4.04% 2.02% 64.65% 2.02% 0.00% 0.00%
1.96% 1.96% 0.00% 0.00% 95.10% 1.96% 0.98% 0.00%
26.00% 24.00% 2.00% 0.00% 73.00% 1.00% 0.00% 0.00%
36.27% 32.35% 0.98% 2.94% 59.80% 0.98% 2.94% 0.00%
48.04% 33.33% 2.94% 11.76% 48.04% 2.94% 0.98% 0.00%
87.00% 84.00% 2.00% 1.00% 10.00% 1.00% 2.00% 0.00%
42.00% 42.00% 0.00% 0.00% 58.00% 0.00% 0.00% 0.00%
37.62% 24.75% 0.00% 12.87% 56.44% 2.97% 2.97% 0.00%

# oraciones
22012
33899
37975
40549
41638
43790
45999
48301
51024
58400
67394
109074
138590
178533
257946
696125
1345630
2576425
4626048
N/A

avg target length
95.55
135.26
88.87
69.26
144.74
89.38
83.26
91.87
34.77
90.68
103.51
56.03
111.61
91.95
121.22
96.77
55.18
104.39
140.27
77.53

Mesa 12: Audit results for a sample of 100 sentences from WikiMatrix for each language pair,
compared to the number of sentences available in the dataset. Language codes are as originally
published. The length is measured in number of characters and averaged across the audited portion of
each corpus. Languages with less than 20% correct sentences are boldfaced.

C

X

CS

NL

CB

CC

WL

# oraciones
porn
14879
80.81% 61.62% 1.01% 18.18% 14.14% 5.05% 0.00% 0.00%
en-so
26321
72.00% 53.00% 9.00% 10.00% 17.00% 10.00% 0.00% 0.00%
en-ps
31374
45.00% 9.00% 16.00% 20.00% 32.00% 9.00% 14.00% 0.00%
en-my
en-km 76.00% 51.00% 13.00% 12.00% 18.00% 6.00% 0.00% 0.00%
65113
92084
en-ne
73.00% 48.00% 1.00% 24.00% 23.00% 2.00% 0.00% 0.00%
132517
en-sw 85.00% 60.00% 15.00% 10.00% 11.00% 2.00% 2.00% 0.00%
en-si
217407
37.00% 31.00% 6.00% 0.00% 62.00% 0.00% 1.00% 0.00%
323519
35.92% 24.27% 8.74% 2.91% 49.51% 13.59% 0.97% 0.00%
en-nn
514610
88.00% 66.00% 15.00% 7.00% 10.00% 1.00% 1.00% 0.00%
es-eu
es-gl
1222837
89.00% 46.00% 6.00% 37.00% 4.00% 7.00% 0.00% 0.00%
5377911
81.00% 73.00% 6.00% 2.00% 19.00% 0.00% 0.00% 6.00%
en-ru
en-bg
6470710
95.15% 85.44% 0.97% 8.74% 4.85% 0.00% 0.00% 0.97%
es-ca
6870183
80.00% 54.00% 19.00% 7.00% 11.00% 9.00% 0.00% 5.00%
9402646
91.59% 68.22% 0.93% 22.43% 7.48% 0.93% 0.00% 0.00%
en-el
en-pl
13744860
94.12% 76.47% 0.98% 16.67% 3.92% 1.96% 0.00% 0.98%
31295016
49.00% 32.00% 17.00% 0.00% 46.00% 3.00% 2.00% 0.00%
en-nl
31486963
93.07% 92.08% 0.00% 0.99% 4.95% 1.98% 0.00% 0.00%
en-pt
40798278
60.82% 36.08% 16.49% 8.25% 38.14% 0.00% 1.03% 0.00%
en-it
en-es
78662122
87.00% 54.00% 20.00% 13.00% 12.00% 0.00% 1.00% 0.50%
82.83% 64.65% 13.13% 5.05% 13.13% 3.03% 1.01% 0.00%
82638202
en-de
en-fr
89.62% 82.08% 4.72% 2.83% 10.38% 0.00% 0.00% 0.00% 104351522

avg target length
189.83
141.01
147.07
121.20
153.42
167.34
123.06
56.24
121.31
107.88
101.28
112.29
107.21
135.66
95.95
95.05
108.68
127.55
119.72
111.43
144.20

Mesa 13: Audit results for a sample of 100 sentences from ParaCrawl for each language pair,
compared to the number of sentences available in the dataset. Language codes are as originally
published. The length is measured in number of characters and averaged across the audited portion of
each corpus.

70

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo
st
haw
ig
sm
ha
su
sn
mg
pa
ga
co
zu
jv
km
kn
fy
te
la
ser
af
lb
ne
sr
gl
bn
mr
sl
hi
bg
uk
ro
sv
zh
ja
tr
nl
pl
pt
él
fr
de
ru
en
bg latn
ja latn
ru latn
zh latn

C

CC

WL

# oraciones
NL
porn
CS
CB
46214
1.02% 0.00%
2.04% 11.22% 14.29%
84.69% 71.43%
66837
0.00% 35.05%
8.25% 0.00%
56.70% 42.27% 14.43%
84312
9.18% 33.67% 21.43% 1.02%
1.02%
44.90% 34.69%
92909
3.94%
0.00% 44.09% 0.79%
55.91% 41.73% 10.24%
98467
0.00% 27.55% 12.24% 0.00%
2.04%
60.20% 58.16%
247479
0.00% 14.14%
5.05% 2.02%
1.01%
80.81% 79.80%
280719
0.00% 25.25% 15.15% 2.02%
1.01%
59.60% 58.59%
326392
4.95% 0.00%
0.99% 58.42%
2.97%
36.63% 32.67%
345040
0.00% 18.00% 25.00% 0.00%
0.00%
57.00% 57.00%
363399
3.77%
78.30% 68.87%
4.72% 10.38% 0.00%
5.66%
465670
6.06% 12.12% 10.10% 13.13% 0.00%
76.77% 58.59%
494913
2.00% 48.00% 19.00% 0.00%
2.00%
33.00% 29.00%
555458
1.00% 30.00% 19.00% 0.00%
51.00% 48.00%
2.00%
581528
7.27% 1.82%
52.73% 19.09% 19.09% 14.55% 40.00%
756612
0.00% 0.00%
7.14%
0.00%
92.86% 92.86%
0.00%
1056849
9.90% 0.00%
2.97%
3.96%
85.15% 73.27%
7.92%
1104359
3.85% 0.00%
2.88% 39.42%
3.85%
56.73% 50.00%
1188243
8.00% 0.00%
9.00%
89.00% 76.00%
3.00%
4.00%
674463
7.69% 0.00%
6.15% 10.77% 10.00%
82.31% 65.38%
1742030
3.54% 0.00%
4.42%
2.65%
2.65%
92.04% 86.73%
2152243
0.00% 15.00%
0.00%
76.00% 76.00%
9.00% 0.00%
2740336
7.77% 74.76% 0.00%
0.00%
0.00%
17.48% 17.48%
2942785
0.00% 0.00%
0.00% 21.65%
1.03%
78.35% 77.32%
3398483
0.00% 0.00%
5.41%
0.90%
7.21%
93.69% 85.59%
4549465
0.00% 13.33% 17.14% 0.00%
67.62% 57.14% 10.48%
7444098
6.00%
1.00%
93.00% 86.00%
4.00% 0.00%
7774331
1.90% 49.52% 10.48% 0.00%
2.86%
40.00% 35.24%
8499456
4.95% 0.00%
4.95%
4.95%
92.08% 82.18%
2.97%
18507273
0.00% 2.53%
2.53% 19.70%
1.01%
80.30% 76.77%
23409799
2.01% 17.09% 0.00%
2.51%
2.51%
80.90% 75.88%
38556465
2.51% 0.00%
2.01%
6.53%
95.48% 81.41%
7.54%
45738857
2.02% 0.00%
3.03%
4.04%
94.95% 78.79% 12.12%
8570979
3.92% 1.96%
4.90%
3.92%
2.94%
91.18% 84.31%
54542308
7.00% 0.00%
1.00%
4.00%
1.00%
92.00% 87.00%
87337884
1.00% 1.00%
0.00%
4.00%
6.00%
99.00% 89.00%
87595290
0.51% 0.00%
3.54%
7.07%
0.00%
95.96% 88.89%
96210458
5.94% 0.00%
1.98%
0.00%
6.93%
92.08% 85.15%
126164277
2.00%
7.00%
7.00%
96.00% 82.00%
2.00% 0.00%
169239084
2.00% 12.00% 1.00%
3.00%
4.00%
86.00% 79.00%
186404508
7.00% 0.00%
1.00%
4.00%
9.00%
92.00% 79.00%
332674575
7.00% 0.00%
1.00%
3.00%
7.00%
92.00% 82.00%
397006993
1.96% 0.00%
6.86%
91.18% 77.45%
5.88%
7.84%
4.88% 0.00%
4.07%
91.06% 69.11% 11.38% 10.57%
755585265
8.08%
2.02%
93.94% 83.84%
5.05% 0.00% 3079081989
1.01%
0.00%
0.00% 51.52% 39.39% 1.01%
9.09%
9.09%
2.00% 60.00% 27.00% 0.00%
13.00%
4.00%
7.00%
0.93% 34.58% 28.97% 0.93%
36.45% 25.23% 10.28%
0.00% 64.00% 31.00% 0.00%
1.00%
4.00%
5.00%

N/A
N/A
N/A
N/A

3.00%

avg length
117.71
132.13
129.99
98.03
126.42
155.76
107.10
145.59
116.23
134.43
147.35
195.30
137.81
97.96
162.57
105.39
234.25
108.49
67.25
110.86
99.52
481.68
102.88
131.72
151.45
92.60
281.94
149.45
105.54
93.86
116.79
130.08
114.45
94.77
59.94
152.75
103.67
170.70
133.51
180.26
143.69
107.71
109.28
130.97
139.92
218.92
123.14
186.84

Mesa 14: Audit results for a sample of 100 sentences from mC4 for each language, compared to the
number of sentences available in the dataset. Language codes are as originally published. The length is
measured in number of characters and averaged across the audited portion of each corpus. Idiomas
with less than 20% correct sentences are boldfaced.

71

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

C

CS

CC

CB

0.00%

0.00%
0.00%

# oraciones
WL
NL
porn
1
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
1
0.00% 100.00% 0.00%
0.00% 0.00% 0.00%
1
0.00% 0.00%
0.00% 0.00% 0.00% 100.00%
2
0.00% 0.00%
0.00%
4
75.00% 0.00%
0.00%
5
0.00% 0.00%
0.00%
7
42.86% 0.00%
0.00%
0.00% 0.00% 0.00% 57.14%
7
0.00% 0.00%
57.14% 57.14% 0.00% 0.00% 42.86%
9
0.00% 100.00% 0.00%
0.00% 0.00% 0.00%
0.00%
10
70.00% 0.00%
30.00% 30.00% 0.00% 0.00%
0.00%
11
40.00% 0.00%
30.00% 30.00% 0.00% 0.00% 30.00%
17
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
26
3.85% 0.00%
0.00%
96.15% 96.15% 0.00% 0.00%
29
0.00% 0.00%
79.31% 75.86% 0.00% 3.45% 20.69%
37
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
41
0.00% 0.00%
0.00%
100.00% 97.56% 0.00% 2.44%
42
71.43% 0.00%
0.00% 0.00% 0.00% 28.57%
47
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
60
0.00%
0.00% 0.00%
100.00% 96.67% 0.00% 3.33%
61
0.00% 100.00% 0.00%
0.00%
0.00% 0.00% 0.00%
64
0.00% 0.00%
1.54%
98.46% 96.92% 0.00% 1.54%
81
16.05% 0.00%
2.47%
81.48% 81.48% 0.00% 0.00%
81
8.64% 0.00%
0.00%
91.36% 91.36% 0.00% 0.00%
83
4.82% 0.00%
3.61%
91.57% 90.36% 0.00% 1.20%
86
1.16% 0.00%
0.00% 0.00% 0.00% 98.84%
0.00%
104
57.43% 0.00%
0.00%
42.57% 42.57% 0.00% 0.00%
104
8.65% 0.00%
1.92%
89.42% 21.15% 0.00% 68.27%
180
9.00% 0.00%
6.00% 0.00% 58.00% 27.00%
64.00%
425
0.00% 0.00%
0.00%
100.00% 98.97% 0.00% 1.03%
676
1.00% 0.00%
0.00%
99.00% 99.00% 0.00% 0.00%
2350
2.00% 0.00%
1.00%
97.00% 86.00% 0.00% 11.00%
7997
1.00% 0.00%
6.00%
93.00% 93.00% 0.00% 0.00%
33838
0.00% 0.00%
2.00%
98.00% 98.00% 0.00% 0.00%
34244
0.00% 0.00%
2.00%
98.00% 98.00% 0.00% 0.00%
35032
0.00% 0.00%
2.97%
97.03% 95.05% 0.00% 1.98%
40066
2.00% 0.00%
0.00%
98.00% 98.00% 0.00% 0.00%
61941
0.00% 0.00%
0.00%
100.00% 96.00% 0.00% 4.00%
67762
1.00% 0.00%
2.00%
97.00% 97.00% 0.00% 0.00%
287142
0.00% 0.00%
81.09% 79.10% 0.00% 1.99% 18.91%
517353
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
1099498
0.00% 0.00%
0.00%
100.00% 98.00% 0.00% 2.00%
1430527
0.00% 0.00%
2.00%
98.00% 94.00% 0.00% 4.00%
1685185
1.01% 1.01%
0.00%
98.99% 93.94% 1.01% 4.04%
2719851
0.00% 0.00%
0.00%
100.00% 100.00% 0.00% 0.00%
0.00% 0.00%
1.00%
99.00% 91.00% 0.00% 8.00%
13292843
0.00% 4.00% 126067610
98.00% 94.00% 2.00% 2.00%
2.00%
0.99% 1.98% 210348435
87.13% 71.29% 1.98% 13.86% 11.88%
0.00% 1.00% 232673578
0.00%
100.00% 97.00% 0.00% 3.00%
0.00% 5.00% 461349575
0.00%
100.00% 93.00% 0.00% 7.00%
0.00% 3.00% 488616724
0.00%
100.00% 94.00% 0.00% 6.00%
1.00% 1.00% 3809525119
0.00%
99.00% 96.00% 0.00% 3.00%

diq
bcl
cbk
pam 100.00% 100.00% 0.00% 0.00%
25.00% 25.00% 0.00% 0.00%
bar
myv
100.00% 100.00% 0.00% 0.00%
yue
mwl
frr
ht
ie
scn
tyv
mai
bxr
dsb
entonces
rm
nah
nap
yo
gn
vec
kw
wuu
eml
bh
mín.
qu
su
jv
como
la
uz
nds
sw
br
fy
am
af
eu
mn
te
kk
ca
nl
él
zh
fr
es
en

avg length
131.00
623.00
519.00
139.00
53.50
127.00
177.00
141.00
231.56
329.10
121.70
155.59
167.96
141.17
160.76
155.15
208.24
137.66
164.53
152.11
281.57
234.95
184.90
162.75
157.15
177.88
137.17
649.85
167.27
221.00
203.08
375.44
224.11
369.99
344.74
196.70
239.56
340.23
267.43
339.18
330.93
309.94
412.31
318.93
333.38
305.01
393.66
195.60
306.62
268.07
364.65

Mesa 15: Audit results for a sample of 100 sentences from OSCAR for each language, comparado
to the number of sentences available in the dataset. If fewer than 100 sentences were available, todo
sentences were audited language codes are as originally published. Length is measured in number
of characters. Languages with less than 20% correct sentences are boldfaced.

72

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
4
7
1
9
8
6
5
8
5

/

/
t

yo

a
C
_
a
_
0
0
4
4
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF