Samanantar: The Largest Publicly Available
Parallel Corpora Collection for 11 Indic Languages
Gowtham Ramesh1∗ Sumanth Doddapaneni1∗ Aravinth Bheemaraj2,5
Mayank Jobanputra3 Raghavan AK4 Ajitesh Sharma2,5 Sujit Sahoo2,5
Harshita Diddee4 Mahalakshmi J4 Divyanshu Kakwani3,4 Navneet Kumar2,5
Aswin Pradeep2,5 Srihari Nagaraj2,5 Kumar Deepak2,5 Vivek Raghavan5
Anoop Kunchukuttan4,6 Pratyush Kumar1,3,4 Mitesh Shantadevi† Khapra1,3,4‡
2Tarento Technologies, India
5EkStep Foundation, India
3IIT Madras, India
6Microsoft, India
1RBCDSAI, India
4AI4Bharat, India
Astratto
We present Samanantar, the largest publicly
available parallel corpora collection for Indic
languages. The collection contains a total of
49.7 million sentence pairs between English
E 11 Indic languages (from two language
families). Specifically, we compile 12.4 mil-
lion sentence pairs from existing, publicly
available parallel corpora, and additionally
mine 37.4 million sentence pairs from the
Web, resulting in a 4× increase. We mine
the parallel sentences from the Web by com-
bining many corpora, tools, and methods: (UN)
Web-crawled monolingual corpora, (B) doc-
ument OCR for extracting sentences from
scanned documents, (C) multilingual repre-
sentation models for aligning sentences, E
(D) approximate nearest neighbor search for
searching in a large collection of sentences.
Human evaluation of samples from the newly
mined corpora validate the high quality of the
parallel sentences across 11 languages. Fur-
ther, we extract 83.4 million sentence pairs
between all 55 Indic language pairs from the
English-centric parallel corpus using English
as the pivot language. We trained multilingual
NMT models spanning all these languages
on Samanantar which outperform existing
models and baselines on publicly available
benchmarks, such as FLORES, establishing
the utility of Samanantar. Our data and mod-
els are available publicly at Samanantar and we
hope they will help advance research in NMT
and multilingual NLP for Indic languages.
1
introduzione
The advent of deep-learning (DL) based neural
encoder-decoder models has led to significant
∗The first two authors have contributed equally.
†Corresponding author: miteshk@cse.iitm.ac.in.
‡Dedicated to the loving memory of my grandmother.
145
progress in machine translation (MT) (Bahdanau
et al., 2015; Wu et al., 2016; Sennrich et al.,
2016B,UN; Vaswani et al., 2017). While this has
been favorable for resource-rich languages, there
has been limited benefit for resource-poor lan-
guages which lack parallel corpora, monolingual
corpora, and evaluation benchmarks (Koehn and
Knowles, 2017). Multilingual models can im-
prove performance on resource-poor languages
via transfer learning from resource-rich languages
(Firat et al., 2016; Johnson et al., 2017B; Kocmi
and Bojar, 2018), more so when the resource-rich
and resource-poor languages are related (Nguyen
and Chiang, 2017; Dabre et al., 2017). Tuttavia, Esso
is difficult to achieve this with limited in-language
dati (Guzm´an et al., 2019), particularly when an
entire group of related languages is low-resource,
making transfer-learning infeasible.
A case in point is that of languages from the
Indian subcontinent, a very linguistically diverse
region. India has 22 constitutionally listed lan-
guages spanning 4 major language families. Other
countries in the subcontinent also have their share
of widely spoken languages. These languages are
closely related both genetically and through con-
tatto, which led to significant sharing of vocabulary
and linguistic features (Emeneau, 1956). These
languages account for a collective speaker base of
Sopra 1 billion speakers. The demand for quality,
publicly available translation systems in a mul-
tilingual society like India is obvious. Tuttavia,
there is very limited publicly available parallel
data for Indic languages. Given this situation, an
obvious question to ask is: What does it take to im-
prove MT on the large set of related low-resource
Indic languages? The answer is straightforward:
create large parallel datasets and train proven
DL models. Tuttavia, collecting new data with
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 145–162, 2022. https://doi.org/10.1162/tacl a 00452
Redattore di azioni: Philipp Koehn. Lotto di invio: 6/2021; Lotto di revisione: 9/2021; Pubblicato 2/2022.
C(cid:3) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Fonte
en-as
en-bn en-gu
en-hi
en-kn en-ml
en-mr
en-or
en-pa
en-ta
en-te
Total
Existing Sources
New Sources
Total
Increase Factor
108
34
141
1.3
3,496
5,109
8,605
2.5
611
2,457
3,068
5
2,818
7,308
10,126
3.6
472
3,622
4,094
8.7
1,237
4,687
5,924
4.8
758
2,869
3,627
4.8
229
769
998
4.4
631
2,349
2,980
4.7
1,456
3,809
5,265
3.6
593
4,353
4,946
8.3
12,408
37,366
49,774
4
Tavolo 1: Summary statistics of the Samanatar corpus. All numbers are in thousands.
manual translations at the scale necessary to train
large DL models would be slow and expensive.
Invece, several recent works have proposed min-
ing parallel sentences from the web (Schwenk
et al., 2019UN, 2020; El-Kishky et al., 2020).
The representation of Indic languages in these
works is poor, Tuttavia (per esempio., CCMatrix con-
tains parallel data for only 6 Indic languages).
In this work, we aim to significantly increase the
amount of parallel data on Indic languages by
combining the benefits of many recent contribu-
zioni: large Indic monolingual corpora (Kakwani
et al., 2020; Ortiz Suarez et al., 2019), accurate
multilingual representation learning (Feng et al.,
2020; Artetxe and Schwenk, 2019), scalable ap-
proximate nearest neighbor search (Johnson et al.,
2017UN; Subramanya et al., 2019; Guo et al., 2020),
and optical character recognition (OCR) of Indic
scripts in rich text documents. By combining these
metodi, we propose different pipelines to collect
parallel data from three different types of sources:
(UN) non machine readable sources like scanned
parallel documents, (B) machine-readable sources
like news Web sites with multilingual content,
(C) IndicCorp (Kakwani et al., 2020), the largest
corpus of monolingual data for Indic languages.
Combining existing datasets and the new
datasets that we mine from the above-mentioned
fonti, we present Samanantar,1 the largest
publicly available parallel corpora collection
Indic languages. Samanantar contains ∼
for
49.7M parallel sentences between English and
11 Indic languages, ranging from 141K pairs be-
tween English-Assamese to 10.1M pairs between
English-Hindi. Of these, 37.4M pairs are newly
mined as a part of this work and 12.4M are com-
piled from existing sources. Così, the newly mined
data is about 3 times the existing data. Tavolo 1
shows the language-wise statistics. Figura 1 shows
the relative contribution of different sources from
which new parallel sentences were mined. IL
largest contributor is data mined from IndicCorp,
1Samanantar in Sanskrit means semantically similar.
Figura 1: Total number of newly mined En-X parallel
sentences in Samanantar from different sources.
(cid:3)
11
2
which accounts for 67% of the total corpus. From
this English-centric corpus, we mine 83.4M paral-
(cid:2)
) Indic language
lel sentences between the 55 (
pairs using English as the pivot. To evaluate the
quality of the mined sentences we collect human
judgments from 38 annotators for a total of 9,566
sentence pairs across 11 languages. The annota-
tions attest to the high quality of the mined parallel
corpus and validate our design choices.
To evaluate if Samanantar advances the state
of the art for Indic NMT, we train a multilin-
gual model, called IndicTrans, using Samanantar.
We compare IndicTrans, con (UN) commercial
translation systems (Google, Microsoft), (B) pub-
licly available translation systems OPUS-MT
(Tiedemann and Thottingal, 2020UN), mBART50
(Tang et al., 2020), and CVIT-ILMulti (Philip
et al., 2020), E (C) models trained on all
existing sources of parallel data between Indic
languages. Across multiple publicly available test
sets spanning 10 Indic languages, we observe
that IndicTrans performs better than all existing
open source models and even outperforms com-
mercial systems on many benchmarks, thereby
establishing the utility of Samanantar.
The three main contributions of this work,
namely, (io) Samanantar, the largest collection
146
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
of parallel corpora for Indic languages, (ii) In-
dicTrans, a multilingual model for translating
from En-Indic and Indic-En, E (iii) human
textual similarity
judgments on cross-lingual
for about 9,566 sentence pairs,
is publicly
available (https://indicnlp.ai4bharat
.org/samanantar/).
2 Samanantar: A Parallel Corpus for
Indic Languages
(cid:3)
(cid:2)
11
2
Samanantar contains parallel sentences between
English and 11 Indic languages, questo è, Assamese
(COME), Bengali (bn), Gujarati (gu), Hindi (CIAO), Kan-
nada (kn), Malayalam (ml), Marathi (mr), Odia
(O), Punjabi (pa), Tamil (ta), and Telugu (te).
Inoltre, it also contains parallel sentences
= 55 Indic language pairs ob-
between the
tained by pivoting through English (en). To build
this corpus, we first collated all existing public
sources of parallel data for Indic languages that
have been released over the years, as described in
Sezione 2.1. We then expand this corpus further
by mining parallel sentences from three types of
sources from the web as described in Sections 2.2
A 2.4.
2.1 Collation from Existing Sources
We first briefly describe the existing sources of
parallel sentences for Indic languages. The In-
dic NLP Catalog2 helped identify many of these
fonti. Recentemente, the WAT 2021 MultiIndicMT
shared task (Nakazawa et al., 2021) also compiled
many existing Indic language parallel corpora.
Some sentence-aligned corpora were collected
from OPUS3 (Tiedemann, 2012) SU 21 Marzo
2021. These include localization data (GNOME,
KDE4, Ubuntu, Mozilla-I10n),
religious text
(JW300 (Agi´c and Vuli´c, 2019), Bible-eudin
(Christodouloupoulos and Steedman, 2015), Tan-
zil), GlobalVoices, OpenSubtitles (Lison and
Tiedemann, 2016), TED2020 (Reimers and
Gurevych, 2020), WikiMatrix (Schwenk et al.,
2019UN), Tatoeba, and ELRC 2922.
We also collated parallel data from the follow-
ing non-OPUS sources (URLs can be found in the
Indic NLP Catalog): ALT (Riza et al., 2016),
BanglaNMT (Hasan et al., 2020), CVIT-PIB
2https://github.com/AI4Bharat/indicnlp
_catalog.
3URLs to the original sources can be found on the OPUS
Web site: https://opus.nlpl.eu.
147
(Philip et al., 2020), IITB (Kunchukuttan et al.,
2018), MTEnglish2Odia, NLPC, OdiEnCorp 2.0
(Parida et al., 2020), PMIndia V1 (Haddow and
Kirefu, 2020), SIPC (Post et al., 2012), TICO19
(Anastasopoulos et al., 2020), UFAL (Ramasamy
et al., 2012), URST (Shah and Bakrola, 2019),
and WMT (Barrault et al., 2019) provided train-
ing set for en-gu.
As shown in Table 1, these sources4 collated to-
gether result in a total of 12.4M parallel sentences
(after removing duplicates) between English and
11 Indic languages. It is interesting that no pub-
licly available MT system has been trained using
parallel data from all these existing sources.
We observed that some existing sources, come
as JW300, were extremely noisy, containing
transla-
many sentence pairs that were not
tions of each other. Tuttavia, we chose not to
clean/post-process any of the existing sources,
beyond what was already done by the pub-
lic repositories that released these datasets. As
future work, we plan to study different data filter-
ing (Junczys-Dowmunt, 2018) and data sampling
techniques (Bengio et al., 2009) and their impact
on the performance of the NMT model being
trained. Per esempio, we could sort the sources
by their quality and feed sentences from only very
high quality sources during the later epochs while
training the model.
2.2 Mining Parallel Sentences from Machine
Readable Comparable Corpora
We identified several news websites which pub-
lish articles in multiple Indic languages (Vedere
Tavolo 2). For a given Web site, the articles across
languages are not necessarily translations of each
other. Tuttavia, content within a given date range
is often similar as the sources are India-centric
with a focus on local events, personalities, advi-
sories, eccetera. Per esempio, news about guidelines for
CoViD-19 vaccination get published in multiple
Indic languages. Even if such a news article in
Hindi is not a sentence-by-sentence translation, Esso
may contain some sentences which are acciden-
tally or intentionally parallel to sentences from a
corresponding English article. Hence, we consider
4We have not included CCMatrix (Schwenk et al., 2020)
and CCAligned (El-Kishky et al., 2020) in the current ver-
sion of Samanantar. CCMatrix is not publicly available and
CCAligned has been criticized by some recent work (Caswell
et al., 2021).
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Mykhel
DD national + sports
Punjab govt
Pranabmukherjee
Catchnews
Drivespark
Financial Express
Gujarati govt
General corpus
Kolkata24×7
Good returns
Indian Express
Zeebiz
Sakshi
Business Standard
NewsOnAir
Asianetnews
The Wire
Nouns dictionary
YouTube science channels
The times of india Marketfeed
The Bridge
PIB
Prothomalo
Nativeplanet
Jagran
The Better India
PIB archives
Khan academy
Nptel
Wikipedia
Coursera
Tavolo 2: Machine readable sources in Samanantar.
such news Web sites to be good sources of par-
allel sentences.
We also identified some sources from the
education domain (NPTEL5, Coursera6, Khan
Academy7) and some science Youtube channels
that provide educational videos with parallel hu-
man translated subtitles in different Indic languages.
We use the following steps to extract parallel
sentences from the above sources:
Article Extraction. For every news Web site,
we build custom extractors using BeautifulSoup8
or Selenium9 to extract the main article content.
For NPTEL, Youtube science channels, and Khan
Academy, we use youtube-dl10 to collect Indic
and English subtitles for every video. We skip
the auto-generated youtube captions to ensure
that we only get high quality translations. Noi
collected subtitles for all available courses/videos
on March 7, 2021. For Coursera, we identify
courses which have manually created Indic and
English subtitles and then use coursera-dl11 to
extract these subtitles.
Tokenization. We split
the main content of
the articles into sentences using the Indic NLP
Library12 (Kunchukuttan, 2020), with a few addi-
tional heuristics to account for Indic punctuation
characters, sentence delimiters and non-breaking
prefixes.
Parallel Sentence Extraction. At the end of the
above step, we have sentence tokenized articles in
English and a target language (Dire, Hindi). Fur-
ther, all these news Web sites contain metadata
5https://nptel.ac.in.
6https://www.coursera.org/.
7https://www.khanacademy.org.
8https://www.crummy.com/software/BeautifulSoup.
9https://pypi.org/project/selenium.
10https://github.com/tpikonen/youtube-dl.
11https://github.com/coursera-dl/coursera-dl.
12https://github.com/anoopkunchukuttan
/indic nlp library.
based on which we can cluster the articles accord-
ing to the month in which they were published
(Dire, Gennaio 2021). We assume that to find a
match for a given Hindi sentence we only need
to consider all English sentences which belong to
articles published in the same month as the article
containing the Hindi sentence. This is a reasonable
assumption as content of news articles is tempo-
ral in nature. Note that such clustering based on
dates is not required for the education sources as
there we can find matching sentences in bilingual
captions belonging to the same video.
Let S = {s1, s2, . . . , sm} be the set of all
sentences across all English articles in a par-
ticular month (or in the English caption file
corresponding to a given video). Allo stesso modo, let
T = {t1, t2, . . . , tn} be the set of all sentences
across all Hindi articles in that same month (O
in the Hindi caption file corresponding to the
same video). Let f (S, T) be a scoring function
which assigns a score indicating how likely it is
that s ∈ S, t ∈ T form a translation pair. For a
given Hindi sentence ti ∈ T , the matching English
sentence can be found as:
s∗ = arg max
s∈S
F (S, ti)
We chose f to be the cosine similarity function
on embeddings of s and t. We compute these em-
beddings using LaBSE (Feng et al., 2020) che è
a state-of-the-art multilingual sentence embedding
model that encodes text from different languages
into a shared embedding space. We refer to the
cosine similarity between the LaBSE embeddings
of s, t as the LaBSE Alignment Score (LAS).
Post Processing. Using the above described
processi, we find the top matching English sen-
tence, s∗, for every Hindi sentence, ti. We now
apply a threshold and select only those pairs for
which the cosine similarity is greater than a thresh-
old t. Across different sources we found 0.75 A
be a good threshold. We refer to this as the
148
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
LAS threshold. Prossimo, we remove duplicates in the
dati. We consider two pairs (si, ti) E (sj, tj)
to be duplicate if si = sj and ti = tj. We also
remove any sentence pair where the English sen-
tence is less than 4 parole. Lastly, we use a lan-
guage identifier13 and eliminate pairs where the
language identified for si or ti does not match the
intended language.
2.3 Mining Parallel Sentences from
Non-Machine Readable Comparable
Corpora
While Web sources are machine readable, there
are official documents that are generated which are
not always machine readable. Per esempio, pro-
ceedings of the legislative assemblies of different
Indian states in English as well as the official
language of the state are published as PDFs. In
this work, we considered 3 such public sources:
(UN) documents from Tamil Nadu government14
(en-ta), (B) speeches from Bangladesh Parlia-
ment15 and West Bengal Legislative Assembly16
(en-bn), E (C) speeches from Andhra Pradesh17
and Telangana Legislative Assemblies18 (en-te).
Most of these documents either contained scanned
images of the original document or contained
proprietary encodings (non-UTF8) due to legacy
issues. Di conseguenza, standard PDF parsers cannot
be used to extract text from them. We use the fol-
lowing pipeline for extracting parallel sentences
from such sources.
Optical Character Recognition (OCR). Noi
used Google’s Vision API, which supports English
as well as the 11 Indic languages considered, A
extract text from each document.
Tokenization. We use the same tokenization
process as described in the previous section on
the extracted text with extra heuristics to merge
an incomplete sentence at the bottom of one page
with an incomplete sentence at the top of the
next page.
Parallel Sentence Extraction. Unlike the pre-
vious section, we have exact information about
which documents are parallel. This information
13https://github.com/aboSamoor/polyglot.
14https://www.tn.gov.in/documents/deptname.
15http://www.parliament.gov.bd.
16http://www.wbassembly.gov.in.
17https://www.aplegislature.org, https://
www.apfinance.gov.in.
18https://finance.telangana.gov.in.
149
is typically encoded in the URL of the document
itself (per esempio., https://tn.gov.in/en/budget
.pdf and https://tn.gov.in/ta/budget
.pdf). Hence, for a given Tamil sentence, ti
we only need to consider the sentences S =
{s1, s2, . . . , sm} which appear in the correspond-
ing English article. For a given ti, we identify the
matching sentence, s∗, from the candidate set S,
using LAS as described in Section 2.2.
Post-Processing. We use
processing as described in Section 2.2.
IL
same post-
2.4 Mining Parallel Sentences from Web
Scale Monolingual Corpora
Esso
Recent work (Schwenk et al., 2019B; Feng et al.,
is possible to align
2020) has shown that
parallel sentences in large monolingual corpora
(per esempio., CommonCrawl) by computing the similarity
between them in a shared multilingual embed-
ding space. In this work, we consider IndicCorp
(Kakwani et al., 2020), the largest collection of
monolingual corpora for Indic languages (ranging
from 1.39M sentences for Assamese to 100.6M
sentences for English). The idea is to take an Indic
sentence and find its matching En sentence from a
large collection of En sentences. To perform this
search efficiently, we use FAISS (Johnson et al.,
2017UN) which does efficient indexing, clustering,
semantic matching, and retrieval of dense vectors
as explained below.
Indexing. We compute the sentence embedding
using LaBSE for all English sentences in In-
dicCorp. We create a FAISS index where these
embeddings are stored in 100k clusters. We use
Product Quantization (J´egou et al., 2011) to re-
duce the space required to store these embeddings
by quantizing the 786 dimensional LaBSE em-
bedding into a m dimensional vector (m = 64)
where each dimension is represented using an
8-bit integer value.
Retrieval. For every Indic sentence (Dire, Hindi
sentence) we first compute the LaBSE embedding
and then query the FAISS index for its nearest
neighbor based on normalized inner product (cioè.,
cosine similarity). FAISS first finds the top-p clus-
ters by computing the distance between each of the
cluster centroids and the given Hindi sentence. Noi
set the value of p to 1024. Within each of these
clusters, FAISS searches for the nearest neigh-
bors. This retrieval is highly optimized to scale.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
In our implementation, on average we were able
to perform 1100 nearest neighbourhood searches
per second on the index containing 100.6M
En sentences.
Recomputing Cosine Similarity. Note that
FAISS computes cosine similarity on the quan-
tized vectors (of dimension m = 64). We found
that while the relative ranking produced by FAISS
is good, the similarity scores on the quantized
vectors vary widely and do not accurately capture
the cosine similarity between the original 768d
LaBSE embeddings. Hence, it is difficult to choose
an appropriate threshold on the similarity of the
quantized vector. Tuttavia, the relative ranking
provided by FAISS is still good. Per esempio, for
Tutto 100 query Hindi sentences that we analyzed,
FAISS retrieved the correct matching English sen-
tence from an index of 100.6 M sentences at the
top-1 position. Based on this observation, we fol-
low a two-step approach: Primo, we retrieve the
top-1 matching sentence from FAISS using the
quantized vector. Then, we compute the LAS be-
tween the full LaBSE embeddings of the retrieved
sentence pair. On the computed LAS, we apply
a LAS threshold of 0.80 (slightly higher than the
one used for comparable sources described ear-
lier) for filtering. This modified FAISS mining,
combining quantized vectors for efficient search-
ing and full embeddings from LaBSE for accurate
thresholding, was crucial for mining a large num-
ber of parallel sentences.
Post-processing. We follow the same post-
processing steps as described in Section 2.2.
We also used the above process to extract par-
allel sentences from Wikipedia by treating it as a
collection of monolingual sentences in different
languages. We were able to mine more parallel
sentences using this approach as opposed to using
Wikipedia’s interlanguage links for article alignment
followed by inter-article parallel sentence mining.
Note that we chose this LaBSE based align-
ment method over existing methods like Vecalign
(Thompson and Koehn, 2019) and Bleualign
(Sennrich and Volk, 2011) as these methods as-
sume/require parallel documents. Tuttavia, for
IndicCorp, such a parallel alignment of docu-
ments is not available and may not even exist.
Further, LaBSE is trained on 17 billion mono-
lingual sentences and 6 billion bilingual sentence
pairs using from 109 languages including all the
11 Indic languages considered in this work. IL
authors have shown that it produces state of the
art results on multiple parallel text retrieval tasks
and is effective even for low-resource languages.
Given these advantages of LaBSE embeddings
and to have a uniform scoring mechanism (cioè.,
LAS) across sources, we use the same LaBSE
based mechanism for mining parallel sentences
from all the sources that we considered.
2.5 Mining Inter-Indic Language Corpora
So far, we have discussed mining parallel corpora
between English and Indic languages. Following
Freitag and Firat (2020) and Rios et al. (2020),
we now use English as a pivot to mine par-
allel sentences between Indic languages from
the English-centric corpora described ear-
Tutto
lier in this section. Most of the sources that
we crawled data from for creating Samanantar
were English-centric, questo è, they contain data
in English and one or more Indian languages.
Hence we chose English as the pivot language.
Per esempio, let (sen, thi) E (ˆsen, tta) be mined
parallel sentences between en-hi and en-ta respec-
tive. If sen = ˆsen then we extract (thi, tta) as a
Hindi-Tamil parallel sentence pair. Further, we
use a very strict de-duplication criterion to avoid
the creation of very similar parallel sentences. For
esempio, if an en sentence is aligned to m hi
sentences and n ta sentences, then we would get
mn hi-ta pairs. We retain only 1 randomly chosen
pair out of these mn pairs, since these mn pairs
are likely to be similar. We mined 83.4M parallel
(cid:3)
(cid:2)
11
Indic language pairs
sentences between the
2
resulting in a 5.33× increase in publicly avail-
able sentence pairs between these languages (Vedere
Tavolo 3).
3 Analysis of the Quality of the Mined
Parallel Corpus
We now describe the intrinsic evaluation of the
data that we mined as a part of this work us-
ing the methods described in Sections 2.2, 2.3,
E 2.4). This evaluation was performed by ask-
ing human annotators to estimate cross-lingual
Semantic Textual Similarity (STS) of the mined
parallel sentences.
3.1 Annotation Task and Setup
We sampled 9,566 sentence pairs (English and In-
dic) from the mined data across 11 Indic languages
150
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
COME
−
bn
356
−
gu
142
CIAO
162
kn
193
ml
227
mr
162
1576
2627
2137
2876
1847
−
2465
2053
2349
1757
−
2148
2747
2086
−
2869
1819
−
1827
−
O
70
592
529
659
533
558
581
−
pa
108
ta
te
Total
214
206
1839
1126
2432
2350
17920
1135
2054
2302
16361
1637
2501
2434
19466
1123
2498
2796
18168
1122
2584
2671
19829
1076
2113
2225
15493
507
1076
1114
6218
−
1747
1756
11336
−
2599
19816
−
20453
COME
bn
gu
CIAO
kn
ml
mr
O
pa
ta
te
Tavolo 3: The number of parallel sentences (in thousands) between Indic
language pairs. The ‘Total’ column indicates the aggregate parallel corpus for
the language in a row with other Indic languages.
and several sources. The sampling was stratified
to have equal number of sentences from three sets:
• Definite accept: sentence pairs with LAS
larger than 0.1 of the chosen threshold.
• Marginal accept: sentence pairs with LAS
larger than but within 0.1 of the chosen
threshold.
• Reject: sentence pairs with LAS smaller than
but within 0.1 of the chosen threshold.
The sampled sentences were shuffled randomly
such that no ordering is preserved across sources
or LAS. We then divided the language-wise sen-
tence pairs into annotation batches of 30 parallel
sentences each.
For defining the annotation scores, we refer to
the SemEval-2016 Task 1 (Agirre et al., 2016),
wherein crosslingual semantic textual similarity is
characterized by six ordinal levels ranging from
complete semantic equivalence (5) to complete
semantic dissimilarity (0). These guidelines were
explained to 38 annotators across 11 Indic lan-
guages, with a minimum of 2 annotators per
lingua. Each annotator is a native speaker in
the language assigned and is also fluent in En-
glish. The annotators have experience of 1 A 20
years in working on language tasks, with a mean
Di 5 years. The annotation task was performed on
Google forms: Each form consisted of 30 sentence
pairs from an annotation batch. Annotators were
shown one sentence pair at a time and were asked
to score it in the range of 0 A 5. The SemEval-2016
guidelines were visible to annotators at all times.
After annotating 30 parallel sentences, the anno-
tators submitted the form and resumed again with
a new form. Annotators were compensated at the
rate of Rs 100 to Rs 150 (1.38 A 2.06 USD) per
100 words read.
3.2 Annotation Results and Discussion
The results of the annotation of the 9,566 sentence
pairs and almost 30,000 annotations are shown
language-wise in Table 4. Over 85% of the sen-
tence pairs are such that annotators agree within
a semantic similarity score of 1 of each other.
We make the following key observations from
the data.
Sentence Pairs Included in Samanantar Have
High Semantic Similarity. Overall, the ‘All ac-
cept’ sentence pairs received a mean STS score
Di 4.27 and a median of 5. On a scale of 0 A
5, Dove 5 represents perfect semantic similar-
ità, these statistics indicate that annotators rated
sentence pairs that are included in Samanantar to
be of high quality. Inoltre, the chosen LAS
thresholds sensitively regulate quality: the ‘Def-
inite accept’ sentence pairs have a high average
STS score of 4.63, which reduces to 3.89 con
‘Marginal accept’, and significantly falls to 2.94
with the ‘Reject’ sets.
LaBSE Alignment and Annotator Scores are
Moderately Correlated. The Spearman corre-
lation coefficient between LAS and STS is a
151
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Language
Annotation data
Semantic Textual Similarity score
Spearman correlation coefficient
# Bitext
pairs
#Annotations
Tutto
accept
Definite Marginal
accept
accept
Rifiutare
LAS,
STS
LAS,
Sentence
len
STS,
Sentence
len
Assamese
Bengali
Gujarati
Hindi
Kannada
Malayalam
Marathi
Odia
Punjabi
Tamil
Telugu
Overall
689
957
779
1,276
957
948
779
500
688
1,044
949
9,566
1,972
3,797
2,298
4,616
2,838
2,760
1,984
1,264
2,222
2,882
2,516
29,149
3.52
4.59
4.08
4.50
4.20
4.00
4.07
4.49
4.23
4.29
4.62
4.27
3.86
4.86
4.54
4.84
4.61
4.46
4.52
4.63
4.67
4.62
4.87
4.63
3.11
4.31
3.59
4.14
3.78
3.55
3.54
4.34
3.74
3.95
4.34
3.89
2.18
3.53
2.67
3.15
2.81
2.45
2.67
4.33
2.32
2.57
3.62
2.94
0.25
0.45
0.49
0.48
0.39
0.40
0.40
0.15
0.43
0.35
0.36
0.37
−0.39
−0.43
−0.31
−0.18
−0.38
−0.33
−0.36
−0.42
−0.25
−0.40
−0.40
−0.35
0.19
−0.16
−0.08
−0.12
−0.09
0.03
−0.04
−0.05
0.06
−0.14
−0.09
−0.04
Tavolo 4: Results of the annotation task to evaluate the semantic similarity between sentence pairs
across 11 languages. Human judgments confirm that the mined sentences (All accept) have a high
semantic similarity and with a moderately high correlation between the human judgments and LAS.
moderately positive value of 0.37, questo è, sentence
pairs with a higher LAS are more likely to be
rated to be semantically similar. Tuttavia, IL
correlation coefficient
is also not very high
(say > 0.5) indicating potential for further im
provement in learning multilingual representa-
tions with LaBSE-like models. Further, the two
languages that have the smallest correlation (As
and Or) also have the smallest resource sizes, In-
dicating potential for improvement in alignment
methods for low-resource languages.
LaBSE Alignment is Negatively Correlated
with Sentence Length, while Annotator Scores
Are Not. To be consistent across languages,
sentence length is computed for the English sen-
tence in each pair. We find that sentence length is
negatively correlated with LAS with a Spearman
correlation coefficient of -0.35, while it is almost
uncorrelated with STS with a Spearman corre-
lation coefficient of -0.04. In other words, pairs
with longer sentences are less likely to have high
alignment on LaBSE representations.
Error Analysis of Mined Corpora For error
analysis we considered those sentence pairs as
accurate sentences which had (UN) LAS greater than
the threshold, questo è, both marginally accept and
definitely accept, E (B) human annotation score
greater than or equal to 4. We found that extraction
accuracy is 79.5% overall, while the extraction
accuracy for Definitely accept bucket is 90.1%.
This shows that LAS score based mining and
filtering can yield high-quality parallel corpora
with high accuracy. In Table 5 we call out different
styles of errors for each of the 3 buckets. In
Marginally Reject (MR) bucket, we find cases
where English and aligned language sentences
are different in meaning and cannot be treated
as parallel sentences altogether. In Marginally
Accept (MA) and Definitely Accept (DA) buckets,
we find more minor errors, for instance differences
in quantity / number and mistaken alignment of
special words like Quarter finals (in English) being
aligned to Semi finals (in Indic languages).
In summary, the annotation task established
that the parallel sentences in Samanantar are of
high quality and validated the chosen thresholds.
The task also established that LaBSE-based align-
ment should be further improved for low-resource
languages (like as, O) and for longer sentences.
We will release this parallel dataset and human
judgments on the over 9,566 sentence pairs as
a dataset for evaluating cross-lingual semantic
similarity between English and Indic languages.
4 IndicTrans: Multilingual, Single Indic
Script Models
The languages in the Indian subcontinent exhibit
many lexical and syntactic similarities on account
of genetic and contact relatedness (Abbi, 2012;
Subb¯ar¯ao, 2012). Genetic relatedness manifests
in the two major language groups considered
in this work:
the Indo-Aryan branch of the
Indo-European family and the Dravidian fam-
ily. Owing to the long history of contact between
these language groups, the Indian subcontinent is
a linguistic area (Emeneau, 1956) exhibiting con-
vergence of many linguistic properties between
152
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Indian Sentence
English Sentence
So, what are their strengths?
Johor, also spelled as
Johore, is a state of Malaysia in
the south of the Malay Peninsula.
The flights between Hyderabad and
Gorakhpur will begin from 30 April
the province is divided into ten districts.
The Sensex was trading with gains of 150
points, while the Nifty rose 52 points in trade.
Parupalli Kashyap advcanced to Korea
Open semifinals.
Inflammation of the interior portion of the eye,
known as uveitis, can cause blurred vision and
eye pain, especially when exposed to light
(photophobia).
LAS
0.70
Bucket
Error
MR
Should be ’’So, what are these containers?’’
0.70
MR
0.68
MR
0.81
0.76
MA
MA
Should be ‘‘Located in the north-western
part of the state of Johor, the district of
this city is also known as Muwar.’’
Should be ‘‘This Hyderabad-Dubai
tour will start from Shamshabad
airport in Hyderabad’’
Should be ‘‘six districts’’
Mistake with numbers
0.88
DA
semifinals became quarter finals
0.89
DA
It should be ‘‘when exposed to
high amounts of light’’
Tavolo 5: Table shows the various errors for different classes in LaBSE based alignment.
languages of these groups. Hence, we explore
multilingual models spanning all these Indic lan-
guages to enable transfer from high resource to
low resource languages on account of genetic re-
latedness (Nguyen and Chiang, 2017) or contact
relatedness (Goyal et al., 2020). We trained 2
types of multilingual models for translation in-
volving Indic languages: (io) One to Many for
English to Indic language translation (O2M: 11
pairs) (ii) Many to One for Indic language to
English translation (M2O: 11 pairs).
Data Representation. We made a design choice
to represent all the Indic language data in a single
script (using the Indic NLP Library). The scripts
for these Indic languages are all derived from
the ancient Brahmi script. Though each of these
scripts have their own Unicode codepoint range, Esso
is possible to get a 1-1 mapping between characters
in these different scripts since the Unicode stan-
dard takes into account the similarities between
these scripts. Hence, we convert all the Indic data
to the Devanagari script. This allows better lexical
sharing between languages for transfer learning,
prevents fragmentation of the subword vocabu-
lary between Indic languages and allows using a
smaller subword vocabulary.
The first token of the source sentence is a
special
token indicating the source language
(Tan et al., 2019; Tang et al., 2020). The model
can make a decision on the transfer learning be-
tween these languages based on both the source
language tag and the similarity of representations.
When multiple target languages are involved, we
follow the standard approach of using a special
token in the input sequence to indicate the target
lingua (Johnson et al., 2017B). Other standard
pre-processing done on the data are Unicode nor-
malization and tokenization. When the target lan-
guage is Indic, the output in Devanagari script is
converted back to the corresponding Indic script.
Training Data. We use all the Samanantar par-
allel data between English and Indic languages and
remove overlaps with any test or validation data
using a very strict criteria. For the purpose of over-
lap identification only, we work with lower-cased
data with all punctuation characters removed. Noi
remove any translation pair, (en, T), from the train-
ing data if (io) the English sentence en appears in
the validation/test data of any En-X language pair
O (ii) the Indic sentence t appears in the valida-
tion/test data of the corresponding En-X language
pair. Note that, since we train a joint model it
is important to ensure that no en sentence in the
test/validation data appears in any of the En-X
training sets. For instance, if there is an en sen-
tence in the En-Hi validation/test data then any
pair containing this sentence should not be in any
of the En-X training sets. We do not use any data
sampling while training and leave the exploration
of these strategies for future work (Arivazhagan
et al., 2019).
Validation Data. We used all the validation data
from the benchmarks described in Section 5.1.
Vocabulary. We learn separate vocabularies for
English and Indic languages from English-centric
training data using 32K BPE merge operations
each using subword-nmt (Sennrich et al., 2016B).
Network and Training. We use fairseq (Ott
et al., 2019) for training transformer-based mod-
els. We use 6 encoder and decoder layers, input
153
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
embeddings of size 1536 con 16 attention heads
and feedforward dimension of 4096. We opti-
mized the cross entropy loss using the Adam
optimizer with a label-smoothing of 0.1 and gra-
dient clipping of 1.0. We use mixed precision
training with Nvidia Apex.19 We use an initial
learning rate of 5e-4, 4000 warmup steps, and the
learning rate annealing schedule as proposed in
Vaswani et al. (2017). We use a global batch size
of 64k tokens. We train each model on 8 V100
GPUs and use early stopping with the patience of
5 epochs.
Decoding. We use beam search with a beam
size of 5 and length penalty set to 1.
5 Experimental Setup
We evaluate the usefulness of Samanantar by
comparing the performance of a translation system
trained using it with existing state of the art models
on a wide variety of benchmarks.
5.1 Benchmarks
We use the following publicly available bench-
marks for evaluating all the models: WAT2020
Indic task (Nakazawa et al., 2020), WAT2021
MultiIndicMT task,20 WMT test sets (Bojar et al.,
2014) (Barrault et al., 2019; Barrault et al., 2020),
UFAL Entam (Ramasamy et al., 2012), and the
recently released FLORES test set (Goyal et al.,
2021). We also create a testset consisting of 1000
validation and 2000 test samples for the en-as pair
from PMIndia corpus (Haddow and Kirefu, 2020).
5.2 Evaluation Metrics
We use BLEU scores for evaluating the mod-
els. To ensure consistency and reproducibility
across the models, we provide SacreBLEU sig-
natures in the footnote for Indic-English21 and
English-Indic22 evaluations. For Indic-English,
we use the in-built, default mteval-v13a tok-
enizer. For En-Indic, since SacreBLEU tokenizer
does not support Indic languages,23 we first tok-
enize using the IndicNLP tokenizer before running
SacreBLEU. The evaluation script will be made
available for reproducibility.
19https://github.com/NVIDIA/apex.
20https://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2021
/index.html.
21BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1.
22BLEU+case.mixed+numrefs.1+smooth.exp+tok.none+version.1.5.1.
23We plan to submit a pull request in sacrebleu for indic
tokenizers.
5.3 Modelli
We compare the the following models:
Commercial MT Systems. We use the trans-
lation APIs provided by Google Cloud Platform
(v2) (GOOG) and Microsoft Azure Cognitive Ser-
vices (v3) (MSFT) to translate all the sentences in
the test set of the benchmarks described above.
Publicly Available MT Systems. We consider
the following publicly available NMT systems:
OPUS-MT 24(OPUS): These models were
trained using all parallel sources available from
OPUS as described in Section 2.1. We refer the
readers to (Tiedemann and Thottingal, 2020B) for
further details about the training data.
mBART5025(mBART): This is a multilingual
many-to-many model which can translate be-
tween any pair of 50 languages. This model is
first pre-trained on large amounts of monolingual
data from all the 50 languages and then jointly
fine-tuned using parallel data between multiple
language pairs. We refer the readers to the original
paper for details of the monolingual pre-training
data and the bilingual fine-tuning data (Tang
et al., 2020).
Models Trained on All Existing Parallel Data.
To evaluate the usefulness of the parallel sentences
in Samanantar, we train a few well studied models
using all parallel data available prior to this work.
Transformer(TF): We train one transformer model
each for every en-Indic language pair and one for
every Indic-en language pair (22 models in all).
We follow the TransformerBASE model de-
scribed in Vaswani et al. (2017). We use byte pair
encoding (BPE) with a vocabulary size of ≈32K
for every language. We use the same learning rate
schedule as proposed in Vaswani et al. (2017). Noi
train each model on 8 V100 GPUs and use early
stopping with the patience set to 5 epochs.
mT5(mT5): We finetune the pre-trained mT5BASE
modello (Xue et al., 2021) for the translation task
using all existing sources of parallel data. Noi
finetune one model for every language pair of
interesse (18 pairs). We train each model on 1 v3
TPU and use early stopping with a patience of
25K steps.
24https://huggingface.co/Helsinki-NLP.
25https://huggingface.co/transformers
/model doc/mbart.html.
154
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Model GOOG MSFT CVIT OPUS mBART TF mT5
IT Δ GOOG MSFT CVIT OPUS mBART TF mT5 IT
Δ
x-en
en-x
20.6
32.9
36.7
24.6
27.2
26.1
23.7
35.9
23.5
25.9
17.0
21.0
22.6
17.3
18.1
14.6
15.6
31.3
30.4
27.5
21.8
34.5
38.0
23.4
27.4
27.7
27.4
35.9
24.8
25.4
17.2
22.0
21.3
16.5
18.6
15.4
15.1
30.1
29.9
27.4
–
–
–
–
–
–
–
–
–
–
18.1
23.4
23.0
18.9
19.5
17.1
13.7
24.6
24.2
17.1
25.1
25.5
19.9
–
16.7
–
11.4
–
13.3
–
5.7
0.4
–
8.6
–
–
9.0
–
8.6
5.8
0.5
–
–
13.1
–
–
–
–
4.7
6.0
33.1
–
19.1
11.7
–
–
26.8
4.3
6.2
3.0
19.0
13.5
9.2
16.1
5.1
25.7
5.6
20.7
WAT2021
24.2 24.8 29.6
33.1 34.6 40.3
38.8 39.2 43.9
23.5 27.8 36.4
26.3 26.8 34.6
26.7 27.6 33.5
34.4
23.7
–
36.0 37.1 43.2
28.4 27.8 33.2
26.8 28.5 36.2
4.8
5.7
4.7
8.6
7.3
5.9
7.0
6.1
4.8
7.7
WAT2020
16.3 16.4 20.0
16.6 18.9 24.1
21.7 21.5 23.6
14.4 15.4 20.4
15.3 16.8 20.4
15.3 14.9 18.3
12.1 14.2 18.5
1.9
0.7
0.6
1.5
0.9
1.3
2.9
WMT
25.3 26.0 29.7 −1.6
16.8 21.9 25.1 −5.4
16.6 17.5 24.1 −3.4
UFAL
7.3
16.1
32.8
12.9
10.6
12.6
10.4
22.0
9.0
7.6
6.6
10.8
16.1
5.6
8.7
4.5
5.5
24.6
15.2
9.6
11.4
22.4
34.3
16.1
7.6
15.7
14.6
28.1
11.8
8.5
8.3
12.8
15.6
5.5
10.1
5.4
7.0
24.2
17.5
10.0
12.2
22.4
34.3
–
11.4
16.5
16.3
–
11.6
8.0
8.5
12.4
16.0
5.3
9.6
4.6
5.6
20.2
12.6
4.8
24.7
26.3 25.6 30.2
3.9
7.7
10.1
7.2
PMI
–
7.4
–
29.9 13.2
–
10.8
–
–
–
11.4
–
1.5
0.1
–
–
–
–
–
–
6.7
1.1
0.2
–
–
7.9
–
–
–
–
0.5
0.7
27.7
–
1.6
1.1
–
–
11.1
0.6
0.9
0.5
13.4
1.5
1.0
5.5
1.1
18.3
0.5
6.3
13.3 13.6 15.3
21.9 24.8 25.6
35.9 36.0 38.6
12.1 17.3 19.1
7.2 14.7
11.2
16.3 17.7 20.1
18.9
14.8
–
29.8 31.0 33.1
12.5 13.2 13.5
7.5 14.1
12.4
9.3 11.4
8.7
9.7 11.8 15.3
17.4 17.3 20.0
7.2
5.2
3.6
9.8 10.9 12.7
6.2
5.2
5.0
7.6
5.4
5.0
1.7
0.8
2.6
1.8
3.3
2.4
2.6
2.1
0.3
1.7
2.1
2.5
2.6
1.6
1.8
0.7
0.7
23.0 23.8 25.5
0.9
9.0 12.3 17.2 −0.3
9.9 −0.1
5.8
7.1
9.2
11.3 11.9 10.9 −1.0
–
3.5
–
11.6
0.8
bn
gu
CIAO
kn
ml
mr
O
pa
ta
te
bn
gu
CIAO
ml
mr
ta
te
CIAO
gu
ta
ta
COME
Tavolo 6: BLEU scores for En-X and X-En translation across different available testsets. Δ represents
the difference between IndicTrans and the best results from the other models. We bold the best public
model and underline the overall best model.
(IT26).
Models Trained Using Samanantar
We train the proposed IndicTrans model from
scratch using the entire Samanantar corpus.
For all the models trained/finetuned as a part of
this work, we ensured that there is no overlap be-
tween the training set and the test/validation sets.
6 Results and Discussion
The results of our experiments on Indic-En and
En-Indic translation are reported in Table 6 E
Tavolo 7. Below, we list down the main observa-
tions from our experiments.
Compilation of Existing Resources was a
Fruitful Exercise. We observe that current
state-of-the-art models trained on all existing par-
allel data (curated as a subset of Samanantar)
perform competitively with other models.
IndicTrans Trained on Samanantar Outper-
forms All Publicly Available Open Source
26IT is trained on Samanantar-v0.3 Corpus.
Modelli. From Tables 6 E 7, we observe that
IndicTrans trained on Samanantar outperforms
nearly all existing models for all the languages in
both the directions. In all cases, except for lan-
guages in the WMT and UFAL en-ta benchmark,
IndicTrans trained on Samanantar improves upon
all existing systems. The absolute gain in BLEU
score is higher for the Indic-En direction as com-
pared to the En-Indic direction. This is on account
of better transfer in many to one settings compared
to one-to-many settings (Aharoni et al., 2019)
and better language model on the target side.
In particular, in Table 7, we observe that Indic-
Trans trained on Samanantar clearly outperforms
IndicTrans trained only on existing resources.
Note that the results reported in Table 7 are on the
FLORES test set, a more balanced test set in com-
parison to the other test sets in Table 6 (which are
primarily from NEWS sources and have similar
distributions as the corresponding training sets).
The good performance of our model trained on
Samanantar on the independently created FLO-
RES test set clearly demonstrates the utility of
155
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
x-en
en-x
Model GOOG MSFT CVIT OPUS mBART IT†
–
COME
–
bn
–
gu
–
CIAO
kn
–
–
ml
–
mr
–
O
pa
–
–
ta
–
te
–
9.4
4.8
32.6
–
24.0
14.8
–
–
22.3
15.5
–
34.6
40.2
44.2
32.2
34.6
36.1
31.7
39.0
31.9
38.8
24.9
31.2
35.4
36.9
30.5
34.1
32.7
31.0
35.1
29.8
37.3
–
17.9
–
18.6
–
9.5
0.6
–
9.9
–
–
17.1 23.3
30.1 32.2
30.6 34.3
34.3 37.9
19.5 28.8
26.5 31.7
27.1 30.8
26.1 30.1
30.3 35.8
24.2 28.6
29.0 33.5
IT GOOG MSFT CVIT OPUS mBART IT†
IT
7.0
–
6.9
18.2 20.3
7.9
19.4 22.6
14.1
32.2 34.5
25.7
9.9 18.9
–
10.9 16.3
6.6
12.7 16.1
8.5
11.0 13.9
7.9
21.3 26.9
–
10.2 16.3
7.9
17.7 22.0
8.2
13.6
22.9
27.7
31.8
22.0
21.1
18.3
20.9
28.5
20.0
30.5
–
28.1
25.6
38.7
32.6
27.4
19.8
24.4
27.0
28.0
30.6
–
–
–
13.7
–
4.4
0.1
–
–
–
–
–
1.4
0.7
22.2
–
3.0
1.2
–
–
8.7
4.5
Tavolo 7: BLEU scores for En-X and X-En translation for FLORES devtest Benchmark. IT† is
IndicTrans trained only on existing data. We bold the best public model and underline the overall
best model.
Samanantar
MT models on a wide variety of domains.
in improving the performance of
7 Conclusione
IndicTrans Trained on Samanantar Outper-
forms Commercial Systems on Most Datasets.
From Table 6, we observe that IndicTrans trained
on Samanantar outperforms commercial mod-
els (GOOG and MSFT) on most benchmarks.
On the FLORES dataset our models are still a
few points behind the commercial systems. IL
higher performance of the commercial NMT sys-
tems on the FLORES dataset indicates that the
in-house training datasets for these systems better
capture the domain and data distributions of the
FLORES dataset.
Performance Gains are Higher for Low Re-
source Languages. We observe significant gains
for low resource languages such as, or and kn,
especially in the Indic-En direction. These lan-
guages benefit from other related languages with
more resources due to multilingual training.
Pre-training Needs Further Investigation. mT5,
which is pre-trained on large amounts of mono-
lingual corpora from multiple languages, does not
always outperform a TransformerBASE model that
is just trained on existing parallel data without
any pre-training. While this does not invalidate
the value of pre-training, it does suggest that
pre-training needs to be optimized for the specific
languages. As future work, we would like to ex-
plore pre-training using the monolingual corpora
on Indic languages available from IndicCorp. Fur-
ther, we would like to pre-train a single script
mT5- or mBART-like model for Indic languages
and then fine-tune on MT using Samanantar.
We present Samanantar,
the largest publicly
available collection of parallel corpora for In-
dic languages. In particular, we mine 37.4M
parallel sentences by leveraging Web crawled
monolingual corpora as well as recent ad-
vances in multilingual representation learning,
approximate nearest neighbor search, and optical
character recognition. We also mine 83.4M par-
allel sentences between 55 Indic language pairs
from this English-centric corpus. We collect hu-
man judgments for 9,566 sentence pairs from
Samanantar and show that the newly mined pairs
are of high quality. Our multilingual single-script
modello, IndicTrans, trained on Samanantar out-
performs existing models on a wide variety
of benchmarks, demonstrating that our paral-
lel corpus mining approaches can contribute to
high-quality MT models for Indic languages.
Indian languages,
To further improve the parallel corpora and
IL
translation quality for
following areas need further exploration: (UN) io sono-
proving LaBSE representations for low-resource
languages and longer sentences, especially ben-
efiting from human judgments, (B) optimizing
training schedules and objectives
such that
they utilize data quality information and lin-
guistic similarity, E (C) pre-training multilin-
gual models.
We hope that
the three main contributions
of this work—Samanantar, IndicTrans, and a
manually annotated dataset for cross-lingual simi-
larity—will contribute to further research on NMT
and multilingual NLP for Indic languages.
156
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Ringraziamenti
We would like to thank the TACL editors and
reviewers, who have helped us shape this pa-
per. We would like to thank EkStep Foundation
for their generous grant which went into hir-
ing human resources as well as cloud resources
needed for this work. We would like to thank
the Robert Bosch Center for Data Science and
Artificial Intelligence for supporting Sumanth
and Gowtham through their Post Baccalaureate
Fellowship Program. We would like to thank
Google for their generous grant through their TPU
Research Cloud Program. We would also like
to thank the following members from Tarento
Technlogies for providing logistical and tech-
nical support: Sivaprakash Ramasamy, Amritha
Devadiga, Karthickeyan Chandrasekar, Naresh
Kumar, Dhiraj D, Vishal Mahuli, Dhanvi Desai,
Jagadeesh Lachannagari, Dhiraj Suthar, Promodh
Pinto, Sajish Sasidharan, Roshan Prakash Shah,
and Abhilash Seth.
Riferimenti
Anvita Abbi. 2012. Languages of India and
India and as a Linguistic Area. http://www
.andamanese.net/LanguagesofIndiaand
Indiaasalinguisticarea.pdf. Retrieved
novembre 15, 2015.
ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: UN
wide-coverage parallel corpus for low-resource
languages. In Proceedings of the 57th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 3204–3210. Florence, Italy.
Associazione per la Linguistica Computazionale.
Eneko Agirre, Carmen Banea, Daniel Cer,
Mona Diab, Aitor Gonzalez-Agirre, Rada
Mihalcea, German Rigau, and Janyce Wiebe.
2016. SemEval-2016 task 1: Semantic tex-
tual similarity, monolingual and cross-lingual
evaluation. In Proceedings of the 10th Inter-
national Workshop on Semantic Evaluation
(SemEval-2016), pages 497–511, San Diego,
California. Associazione per il calcolo
Linguistica.
Language Technologies, Volume 1 (Long and
Short Papers), pages 3874–3884, Minneapolis,
Minnesota. Associazione per il calcolo
Linguistica.
Antonios Anastasopoulos, Alessandro Cattelan,
Zi-Yi Dou, Marcello Federico, Christian
Federman, Dmitriy Genzel, Francisco Guzm´an,
Junjie Hu, Macduff Hughes, Philipp Koehn,
Rosie Lazar, Will Lewis, Graham Neubig,
Mengmeng Niu, Alp ¨Oktem, Eric Paquin, Grace
Tang, and Sylwia Tur. 2020. Tico-19: IL
translation initiative for COVID-19.
Naveen Arivazhagan, Ankur Bapna, Orhan
Firat, Dmitry Lepikhin, Melvin Johnson,
Maxim Krikun, Mia Xu Chen, Yuan Cao,
George Foster, Colin Cherry, Wolfgang
Macherey, Zhifeng Chen, and Yonghui Wu.
2019. Massively multilingual neural machine
translation in the wild: Findings and challenges.
Mikel Artetxe and Holger Schwenk. 2019.
Massively multilingual sentence embeddings
transfer and be-
for zero-shot cross-lingual
the Association for
yond. Transactions of
Linguistica computazionale, 7:597–610.
Dzmitry Bahdanau, Kyunghyun Cho, e Yoshua
Bengio. 2015. Traduzione automatica neurale di
imparare insieme ad allineare e tradurre. In 3rd
International Conference on Learning Repre-
sentations, ICLR 2015, San Diego, CA, USA,
May 7-9, 2015, Conference Track Proceedings.
Lo¨ıc Barrault, Magdalena Biesialska, Ondˇrej
Bojar, Marta R. Costa-juss`a, Christian Federmann,
Yvette Graham, Roman Grundkiewicz, Barry
Haddow, Matthias Huck, Eric Joanis, Tom
Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola
Ljubeˇsi´c, Christof Monz, Makoto Morishita,
Masaaki Nagata, Toshiaki Nakazawa, Santanu
Pal, Matt Post, and Marcos Zampieri. 2020.
Findings of the 2020 conference on machine
translation (WMT20). Negli Atti del
Fifth Conference on Machine Translation,
pages 1–55, Online. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/W19-5301
Roee Aharoni, Melvin Johnson, and Orhan Firat.
2019. Massively multilingual neural machine
translation. Negli Atti del 2019 Confer-
ence of the North American Chapter of the Asso-
ciation for Computational Linguistics: Umano
Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,
Christian Federmann, Mark Fishel, Yvette
Graham, Barry Haddow, Matthias Huck,
Philipp Koehn, Shervin Malmasi, Christof
Monz, Mathias M¨uller, Santanu Pal, Matt
157
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Post, and Marcos Zampieri. 2019. Findings of
IL 2019 conference on machine translation
(WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2:
Shared Task Papers, Day 1), pages 1–61,
Florence, Italy. Associazione per il calcolo
Linguistica. https://doi.org/10.18653
/v1/W19-5301
Yoshua Bengio, J. Louradour, Ronan Collobert,
and J. Weston. 2009. Curriculum learning. In
ICML ’09. https://doi.org/10.1145
/1553374.1553380
Ondˇrej Bojar, Christian Buck, Christian
Federmann, Barry Haddow, Philipp Koehn,
Johannes Leveling, Christof Monz, Pavel
Pecina, Matt Post, Herve Saint-Amand, Radu
Soricut, Lucia Specia, and Aleˇs Tamchyna.
2014. Findings of the 2014 workshop on statis-
tical machine translation. Negli Atti del
Ninth Workshop on Statistical Machine Trans-
lation, pages 12–58, Baltimore, Maryland, USA.
Associazione per la Linguistica Computazionale.
https://doi.org/10.3115/v1/W14-3302
Isaac Caswell,
Julia Kreutzer, Lisa Wang,
Ahsan Wahab, Daan van Esch, Nasanbayar
Ulzii-Orshikh, Allahsera
Tapo, Nishant
Subramani, Artem Sokolov, Claytone Sikasote,
Monang Setyawan, Supheakmungkol Sarin,
Sokhar Samb, Benoˆıt Sagot, Clara Rivera,
Annette Rios, Isabel Papadimitriou, Salomey
Osei, Pedro Javier Ortiz Su´arez, Iroro Orife,
Kelechi Ogueji, Rubungo Andre Niyongabo,
Toan Q. Nguyen, Mathias M¨uller, Andr´e
M¨uller, Shamsuddeen Hassan Muhammad,
Nanda Muhammad, Ayanda Mnyakeni,
Jamshidbek Mirzakhalov, Tapiwanashe Matangira,
Colin Leong, Nze Lawson, Sneha Kudugunta,
Yacine Jernite, Mathias Jenny, Orhan Firat,
Bonaventure F. P. Dossou, Sakhile Dlamini,
Nisansa de Silva, Sakine C¸ abuk Balli, Stella
Biderman, Alessia Battisti, Ahmed Baruwa,
Ankur Bapna, Pallavi Baljekar, Israel Abebe
Azime, Ayodele Awokoya, Duygu Ataman,
Orevaoghene Ahia, Oghenefego Ahia, Sweta
Agrawal, and Mofetoluwa Adeyemi. 2021.
Quality at a glance: An audit of Web-crawled
multilingual datasets. CoRR, abs/2103.12028.
Christos
Christodouloupoulos
and Mark
Steedman. 2015. A massively parallel corpus:
In 100 languages. Language
The bible
Resources and Evaluation, 49(2):375–395.
https://doi.org/10.1007/s10579-014
-9287-sì, PubMed: 26321896
Raj Dabre, Tetsuji Nakagawa, and Hideto
Kazawa. 2017. An empirical study of language
relatedness for transfer learning in neural ma-
chine translation. In Proceedings of the 31st
Pacific Asia Conference on Language, Infor-
mation and Computation, pages 282–286. IL
National University (Phillippines).
Ahmed El-Kishky, Vishrav Chaudhary, Francesco
Guzm´an, and Philipp Koehn. 2020. CCAligned:
A massive collection of cross-lingual web-
document pairs. Negli Atti del 2020
Conference on Empirical Methods in Natural
LanguageProcessing(EMNLP), pages 5960–5969,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.480
Murray B. Emeneau. 1956. India as a lingus-
tic area. Language. https://doi.org/10
.2307/410649
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer,
Naveen Arivazhagan, and Wei Wang. 2020.
Language-agnostic bert sentence embedding.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
2016. Multi-way, multilingual neural machine
translation with a shared attention mechanism.
Negli Atti del 2016 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, pages 866–875, San Diego,
California. Associazione per il calcolo
Linguistica.
Markus Freitag and Orhan Firat. 2020. Com-
plete multilingual neural machine translation.
the Fifth Conference on
Negli Atti di
Machine Translation, pages 550–560, Online.
Associazione per la Linguistica Computazionale.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary,
Peng-Jen Chen, Guillaume Wenzek, Da Ju,
Sanjana Krishnan, Marc’Aurelio Ranzato,
Francisco Guzm´an, and Angela Fan. 2021.
The FLORES-101 evaluation benchmark for
low-resource and multilingual machine transla-
zione. CoRR, abs/2106.03193.
Vikrant Goyal, Anoop Kunchukuttan, Rahul
Kejriwal, Siddharth Jain, and Amit Bhagwat.
2020. Contact relatedness can help improve
158
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
multilingual NMT: Microsoft STCI-MT @
WMT20. In Proceedings of the Fifth Confer-
ence on Machine Translation, pages 202–206,
Online. Associazione per la Linguistica Computazionale.
Carte, pages 888–895, Belgium, Brussels.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/W18
-6478
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan
Geng, David Simcha, Felix Chern, and Sanjiv
Kumar. 2020. Accelerating large-scale inference
with anisotropic vector quantization. In Inter-
national Conference on Machine Learning.
Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
Koehn, Vishrav Chaudhary, and Marc’Aurelio
Ranzato. 2019. The FLORES evaluation da-
tasets for low-resource machine translation:
Nepali–English and Sinhala–English. Nel professionista-
ceedings of the 2019 Conferenza sull'Empirico
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
elaborazione del linguaggio urale (EMNLP-IJCNLP),
pages 6098–6111, Hong Kong, China. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1632
Barry Haddow and Faheem Kirefu. 2020.
Pmindia – a collection of parallel corpora of
languages of India.
Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin,
Masum Hasan, Madhusudan Basak, M. Sohel
Rahman, and Rifat Shahriyar. 2020. Not low-
resource anymore: Aligner ensembling, batch
filtering, and new datasets for Bengali-English
machine translation. https://doi.org/10
.18653/v1/2020.emnlp-main.207
Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2017UN. Billion-scale similarity search with
GPUS. arXiv preprint arXiv:1702.08734.
Melvin Johnson, Mike Schuster, Quoc V. Le,
Maxim Krikun, Yonghui Wu, Zhifeng Chen,
Nikhil Thorat, Fernanda Vi´egas, Martin
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017B. Google’s multilin-
gual neural machine translation system: In-
abling zero-shot translation. Transactions of
the Association for Computational Linguistics,
5:339–351. https://doi.org/10.1162
/tacl_a_00065
Marcin Junczys-Dowmunt. 2018. Dual condi-
tional cross-entropy filtering of noisy parallel
corpora. In Proceedings of the Third Confer-
ence on Machine Translation: Shared Task
Herve J´egou, Matthijs Douze, and Cordelia
Schmid. 2011. Product quantization for near-
est neighbor search. IEEE Transactions on
Pattern Analysis and Machine Intelligence,
33(1):117–128. https://doi.org/10.1109
/TPAMI.2010.57, PubMed: 21088323
Divyanshu Kakwani, Anoop Kunchukuttan,
Satish Golla, Gokul N.C., Avik Bhattacharyya,
Mitesh M. Khapra, and Pratyush Kumar. 2020.
IndicNLPSuite: Monolingual corpora, valuta-
tion benchmarks and pre-trained multilingual
In
language models for
Findings of the Association for Computational
Linguistica: EMNLP 2020, pages 4948–4961,
Online. Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/2020
.findings-emnlp.445
Indian languages.
Tom Kocmi and Ondˇrej Bojar. 2018. Trivial
transfer learning for low-resource neural ma-
chine translation. In Proceedings of the Third
Conference on Machine Translation: Research
Carte, pages 244–252, Brussels, Belgium.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/W18
-6325
Philipp Koehn and Rebecca Knowles. 2017. Six
challenges for neural machine translation. In
Proceedings of the First Workshop on Neural
Machine Translation, pages 28–39, Vancouver.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/W17
-3204
Anoop Kunchukuttan. 2020. The IndicNLP Library.
https://github.com/anoopkunchukuttan
/indic nlp library/blob/master/docs
/indicnlp.pdf.
Anoop Kunchukuttan, Pratik Mehta, and Pushpak
Bhattacharyya. 2018. The IIT Bombay English-
Hindi parallel corpus.
Pierre Lison and J¨org Tiedemann. 2016. Open-
Subtitles2016: Extracting large parallel corpora
from movie and TV subtitles. Negli Atti
of the Tenth International Conference on Lan-
guage Resources and Evaluation (LREC’16),
159
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
pages 923–929, Portoroˇz, Slovenia. European
Language Resources Association (ELRA).
Toshiaki Nakazawa, Hideki Nakayama, Chenchen
Ding, Raj Dabre, Shohei Higashiyama, Hideya
Isao Goto, Win Pa Pa, Anoop
Mino,
Kunchukuttan, Shantipriya Parida, Ondˇrej
Bojar, Chenhui Chu, Akiko Eriguchi, Kaori
Abe, Sadao Oda, and Yusuke Kurohashi. 2021.
Overview of the 8th workshop on Asian trans-
lation. In Proceedings of the 8th Workshop on
Asian Translation, Bangkok, Thailand. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/2021.wat-1.1
Toshiaki Nakazawa, Hideki Nakayama, Chenchen
Ding, Raj Dabre, Shohei Higashiyama, Hideya
Isao Goto, Win Pa Pa, Anoop
Mino,
Kunchukuttan, Shantipriya Parida, Ondˇrej
Bojar, and Sadao Kurohashi. 2020. Overview
of the 7th workshop on Asian translation.
the 7th Workshop on
Negli Atti di
Asian Translation, pages 1–44, Suzhou, China.
Associazione per la Linguistica Computazionale.
Toan Q. Nguyen and David Chiang. 2017.
Transfer learning across low-resource, related
languages for neural machine translation. In
International Joint Conference on Natural
Language Processing.
Pedro Javier Ortiz Suarez, Benoˆıt Sagot, E
Laurent Romary. 2019. Asynchronous pipelines
for processing huge corpora on medium to low
resource infrastructures. Negli Atti del
Workshop on Challenges in the Management
of Large Corpora, pages 9–16, Mannheim.
Leibniz-Institut f¨ur Deutsche Sprache.
Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. fairseq:
A fast, extensible toolkit for sequence mod-
eling. In Proceedings of NAACL-HLT 2019:
Demonstrations.
Shantipriya Parida, Satya Ranjan Dash, Ondˇrej
Bojar, Petr Motlicek, Priyanka Pattnaik, E
Debasish Kumar Mallick. 2020. OdiEnCorp
2.0: Odia-English parallel corpus for machine
translation. In Proceedings of the WILDRE5–
5th Workshop on Indian Language Data: Re-
sources and Evaluation, pages 14–19, Marseille,
France. European Language Resources Associ-
ation (ELRA).
Jerin Philip, Shashank Siripragada, Vinay P.
Namboodiri, and C. V. Jawahar. 2020. Revi-
siting low resource status of Indian languages
in machine translation. 8th ACM IKDD CODS
and 26th COMAD. https://doi.org/10
.1145/3430984.3431026
Matt Post, Chris Callison-Burch, and Miles
Osborne. 2012. Constructing parallel corpora
for six indian languages via crowdsourcing. In
Proceedings of the Seventh Workshop on Sta-
tistical Machine Translation, pages 401–409,
Montr´eal, Canada. Associazione per il calcolo-
linguistica nazionale.
Loganathan Ramasamy, Ondˇrej Bojar,
E
Zdenˇek ˇZabokrtsk´y. 2012. Morphological pro-
cessing for English-Tamil statistical machine
translation. Negli Atti del Convegno
on Machine Translation and Parsing in Indian
Languages (MTPIL-2012), pages 113–122.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual
using knowledge distillation. arXiv preprint
arXiv:2004.09813. https://doi.org/10
.18653/v1/2020.emnlp-main.365
Annette Rios, Mathias M¨uller,
and Rico
Sennrich. 2020. Subword segmentation and a
single bridge language affect zero-shot neu-
ral machine translation. Negli Atti di
the Fifth Conference on Machine Transla-
zione, pages 528–537, Online. Associazione per
Linguistica computazionale.
H. Riza, M. Purwoadi, Gunarso, T. Uliniansyah,
UN. UN. Ti, S. M. Aljunied, l. C. Mai, V. T.
Thang, N. P. Thai, V. Chea, R. Sun, S. Sam,
S. Seng, K. M. Soe, K. T. Nwet, M. Utiyama,
and C. Ding. 2016. Introduction of the Asian
language treebank. In 2016 Conference of The
Oriental Chapter of International Commit-
tee for Coordination and Standardization of
Speech Databases and Assessment Techniques
(O-COCOSDA), pages 1–6. https://doi
.org/10.1109/ICSDA.2016.7918974
Holger Schwenk, Vishrav Chaudhary, Shuo Sun,
Hongyu Gong, and Francisco Guzm´an. 2019UN.
Wikimatrix: Mining 135m parallel sentences in
1620 language pairs from Wikipedia.
Holger Schwenk, Guillaume Wenzek, Sergey
Edunov, Edouard Grave, and Armand Joulin.
2019B. Ccmatrix: Mining billions of high-
160
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
quality parallel sentences on the WEB. CoRR,
abs/1911.04944.
Holger Schwenk, Guillaume Wenzek, Sergey
Edunov, Edouard Grave,
and Armand
Joulin. 2020. Ccmatrix: Mining billions
sentences on the
of high-quality parallel
Web. https://doi.org/10.18653/v1
/2021.acl-long.507
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016UN. Improving neural machine trans-
lation models with monolingual data.
In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguis-
tic (Volume 1: Documenti lunghi), pages 86–96,
Berlin, Germany. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/P16-1009
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016B. Neural machine translation of
rare words with subword units. Negli Atti
of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Lungo
Carte), pages 1715–1725, Berlin, Germany.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/P16
-1162
Rico Sennrich and Martin Volk. 2011. Iterative,
MT-based sentence alignment of parallel texts.
In Proceedings of the 18th Nordic Conference of
Linguistica computazionale (NODALIDA 2011),
pages 175–182, Riga, Latvia. Northern Eu-
ropean Association for Language Technology
(NEALT).
P. Shah and V. Bakrola. 2019. Neural ma-
chine translation system of indic languages—an
In 2019 Secondo
attention based approach.
International Conference on Advanced Com-
putational and Communication Paradigms
(ICACCP), pages 1–5. https://doi.org
/10.1109/ICACCP.2019.8882969
K¯arum¯uri V. Subb¯ar¯ao. 2012. South Asian Lan-
guages: A Syntactic Typology. Cambridge
Stampa universitaria. https://doi.org/10
.1017/CBO9781139003575
Suhas Jayaram Subramanya, F. Devvrit, H. V.
Simhadri, R. Krishnawamy, and R. Kadekodi.
2019. Diskann: Fast accurate billion-point near-
In
est neighbor search on a single node.
Advances in Neural Information Processing
Sistemi, volume 32, pages 13771–13781.
Xu Tan,
Jiale Chen, Di He, Yingce Xia,
Tao Qin, and Tie-Yan Liu. 2019. Multilin-
gual neural machine translation with language
clustering. Negli Atti del 2019 Contro-
ference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
in lavorazione (EMNLP-IJCNLP), pages 963–973,
Hong Kong, China. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/D19-1089
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen
Chen, Naman Goyal, Vishrav Chaudhary,
Jiatao Gu, and Angela Fan. 2020. Multilingual
translation with extensible multilingual pre-
training and finetuning.
Brian Thompson and Philipp Koehn. 2019. Vec-
align: Improved sentence alignment in linear
time and space. Negli Atti del 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessazione (EMNLP-IJCNLP), pages 1342–1348,
Hong Kong, China. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/D19-1136
J¨org Tiedemann and Santhosh Thottingal. 2020UN.
OPUS-MT — building open translation services
for the world. Negli Atti di
the 22nd
Annual Conference of the European Associ-
ation for Machine Translation, pages 479–480,
Lisbon, Portugal. European Association for
Machine Translation.
J¨org Tiedemann and Santhosh Thottingal. 2020B.
OPUS-MT—Building open translation services
for the World. In Proceedings of the 22nd
Annual Conferenec of the European Associa-
tion for Machine Translation (EAMT). Lisbon,
Portugal.
Negli Atti di
J¨org Tiedemann. 2012. Parallel data, tools and
IL
interfaces in opus.
Eight International Conference on Language
Resources and Evaluation (LREC’12). Istan-
bul, Turkey. European Language Resources
Association (ELRA).
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need.
161
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, Jeff Klingner, Apurva Shah,
Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,
Stephan Gouws, Yoshikiyo Kato, Taku Kudo,
Hideto Kazawa, Keith Stevens, George Kurian,
Nishant Patil, Wei Wang, Cliff Young, Jason
Smith, Jason Riesa, Alex Rudnick, Oriol
Vinyals, Greg Corrado, Macduff Hughes, E
Jeffrey Dean. 2016. Google’s neural ma-
chine translation system: Bridging the gap be-
tween human and machine translation. CoRR,
abs/1609.08144.
Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. mt5: UN
massively multilingual pre-trained text-to-text
transformer.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
2
1
9
8
7
0
1
0
/
/
T
l
UN
C
_
UN
_
0
0
4
5
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3