PEREJIL: Un conjunto de desafíos de comprensión del idioma para el persa

PEREJIL: Un conjunto de desafíos de comprensión del idioma para el persa

Daniel Khashabi1 Arman Cohan1

Siamak Shakeri2

Pedram Hosseini3

Pouya Pezeshkpour4 Malihe Alikhani5 Moin Aminnaseri6 Marzieh Bitaab7
Faeze Brahman8 Sarik Ghazarian9 Mozhdeh Gheini9 Arman Kabiri10
Rabeeh Karimi Mahabagdi11 Omid Memarrast12 Ahmadreza Mosallanezhad7
Sepideh Sadeghi2
Erfan Noury13

Shahab Raji14 Mohammad Sadegh Rasooli15

Erfan Sadeqi Azer2 Niloofar Safi Samghabadi16 Mahsa Shafaei17
Saber Sheybani18 Ali Tazarv4 Yadollah Yaghoobzadeh19
1Instituto Allen para la IA, USA 2Google, USA 3George Washington University, USA 4UC Irvine, EE.UU
5University of Pittsburgh, USA 6TaskRabbit, USA 7Arizona State University, USA 8UC Santa Cruz,
USA 9University of Southern California, USA 10IMRSV Data Labs, Canada 11EPFL, Suiza
12University of Illinois – chicago, USA 13University of Maryland Baltimore County, EE.UU
14Rutgers University, USA 15University of Pennsylvania, USA 16Expedia Inc., USA 17University of
houston, USA 18Indiana University – Bloomington, USA 19Microsoft, Canada

Abstracto

Despite the progress made in recent years
in addressing natural language understanding
(NLU) challenges, the majority of this progress
remains to be concentrated on resource-rich
languages like English. This work focuses on
Persian language, one of the widely spoken
languages in the world, and yet there are few
NLU datasets available for this language. El
availability of high-quality evaluation datasets
is a necessity for reliable assessment of the
progress on different NLU tasks and domains.
We introduce PARSINLU, the first benchmark
in Persian language that includes a range of lan-
guage understanding tasks—reading compre-
hension, textual entailment, etcétera. Estos
datasets are collected in a multitude of ways,
often involving manual annotations by na-
tive speakers. This results in over 14.5k
new instances across 6 distinct NLU tasks.
Además, we present the first results on
state-of-the-art monolingual and multilingual
pre-trained language models on this bench-
mark and compare them with human per-
rendimiento, which provides valuable insights
into our ability to tackle natural language un-
derstanding challenges in Persian. We hope
PARSINLU fosters further research and ad-
vances in Persian language understanding.1

1

Introducción

En años recientes, considerable progress has been
made in building stronger NLU models, particular-

larly supported by high-quality benchmarks
(Bowman et al., 2015; Rajpurkar et al., 2016;
Wang y cols., 2019) for resourceful languages like
Inglés. Sin embargo, in many other languages, semejante
benchmarks remain scarce, unfortunately, stagnat-
ing the progress towards language understanding
in these languages.

En este trabajo, we focus on developing natural
language understanding (NLU) benchmarks for
persa (also known as Farsi). This language has
many attributes that make it distinct from other
well-studied languages. In terms of script, persa
is similar to Semitic languages (p.ej., Arábica). lin-
guistically, sin embargo, Persian is an Indo-European
idioma (Masica, 1993) and thus distantly related
to most of the languages of Europe as well as the
northern part of the Indian subcontinent. Such at-
tributes make Persian a unique case to study in
terms of language technologies. Although Persian
is a widely spoken language (Simons and Fennig,
2017), our ability to evaluate performance and
measure the progress of NLU models on this lan-
guage remains limited. This is mainly due to the
lack of major language understanding benchmarks
that can evaluate progress on a diverse range of
tareas.

En este trabajo, we present PARSINLU, a collec-
tion of NLU challenges for Persian.2 PARSINLU
contains challenges for reading comprehension,
multiple-choice question-answering, textual en-
tailment, sentiment analysis, question paraphrasing,

1https://git.io/JIuRO.
(cid:63) The point of view of the authors are their own and not

attributable to the company they work for.

2We focus on the standard Iranian Persian, spoken by
encima 80 million people. There are other dialects of Persian
spoken in other countries, p.ej., Afghanistan and Tajikistan.

1147

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1147–1162, 2021. https://doi.org/10.1162/tacl a 00419
Editor de acciones: Marcos Johnson. Lote de envío: 3/21; Lote de revisión: 6/21; Publicado 10/2021.
C(cid:13) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

and machine translation (examples in Figure 1).
PARSINLU offers data for tasks that have never
been explored before in the context of the Per-
sian language. We are not aware of any publicly
available dataset for Persian question answer-
En g (§3.2.2), comprensión lectora (§3.2.1), y
paraphrasing (§3.2.5). For the rest of the tasks, nosotros
improve at least one aspect of the existing datasets
(p.ej., better data construction, more comprehen-
sive evaluation, and evaluation of less investigated
genres or domains). To ensure the quality of the
presented challenge tasks, we rely on the annota-
tions from native Persian speakers or novel data
collection techniques, such as search engine auto-
complete (§3.2.1) and past collegiate exams
(§3.2.2). A lo mejor de nuestro conocimiento, this is the
first comprehensive collection of its own, com-
posed of a variety of Persian NLU tasks.

We conduct a collection of empirical work (§4)
to establish the difficulty of PARSINLU. We bench-
mark each PARSINLU task via collecting state-of-
the-art multilingual and mono-lingual language
modelos (LMs), as well as estimating the human
upper bound scores. The gap between human and
machine baselines indicate the need for further
research and stronger models for Persian. We hope
that the release of PARSINLU will encourage more
research on Persian NLP.

2 Trabajo relacionado

Cross-lingual Benchmarks. There are several
recent cross-lingual benchmarks; sin embargo, almost
none includes Persian: XNLI (Conneau et al.,
2018) for entailment; PWNS-X (Yang et al., 2019)
for paraphrasing; XCOPA (Ponti et al., 2020)
for choice of plausible alternatives; and XQuAD,
MLQA, TyDI, and MKQA (Artetxe et al., 2020b;
Lewis et al., 2020; Clark et al., 2020a; Longpre
et al., 2020) for reading comprehension. These da-
tasets have also been integrated as part of multitask
multilingual evaluation suites such as XTREME
(Hu et al., 2020) and XGLUE (Liang et al., 2020).
Desafortunadamente, the Persian portion of the former
benchmark covers only two tagging tasks (POS
and NER) and the latter does not cover Persian.

for Other Languages.
NLU Benchmarks
Benchmarks like GLUE (Wang y cols., 2019) en-
courage development of better and stronger mod-
els on a diverse set of challenges. There have been
several efforts to create GLUE-like benchmarks
for other languages; Por ejemplo, CLUE for

Chino (Xu et al., 2020), GLUECoS for Hindi
(Khanuja et al., 2020), and RussianSuperGLUE
(Shavrina et al., 2020). We view PARSINLU in
the same family of benchmarks, dedicated to the
Persian language.

NLU Datasets for Persian. Prior work on creat-
ing evaluation resources for the Persian language
has focused on low-level tasks in narrow domains
(p.ej., datasets for POS [Bijankhan, 2004], NER
[Shahshahani et al., 2019], Parsing [Seraji et al.,
2013]). Complementary to these efforts, we aim at
providing an NLU evaluation benchmark for Per-
sian, consisting of a wide variety of tasks. Abajo
we mention several related works and how we
build upon them.

FarsTail (Amirkhani et al., 2020) is a concurrent
work on the entailment task, where the dataset is
constructed semi-automatically based on existing
multiple-choice exams. Different from this work,
our entailment datasets are built with the anno-
tations of native speakers of Persian and some
use of machine translation (§3.2.4). Por lo tanto, nosotros
hypothesize our construction represents a slightly
different distribution than that of FarsTail.

There is a rich set of works on Persian sentiment
análisis. We build upon these works and differ
from them in the following manners: (a) El
existing work mainly focuses on document-level
sentiment identification which does not capture
the nuanced judgments with respect to aspects and
entities of the context (HosseinzadehBendarkheili
et al., 2019; Sharami et al., 2020, inter alia). En
addition to such document-level annotations, nosotros
provide aspect-level sentiment annotations (§3.2.3).
(b) The majority of existing resources, como
MirasOpinion (Ashrafi Asli et al., 2020), focus
on binary or ternary sentiment classes. Sin embargo,
our annotations contain a more granular sentiment
intensity with five labels (§3.2.3). (C) Compared to
the aspect-level datasets of Hosseini et al. (2018)
and Ataei et al. (2019), we cover two relatively
less investigated domains: alimento & beverages and
cine, each posing new challenges for Persian
sentiment analysis.

Machine translation of Persian (cid:29) English is
one of the few tasks that has enjoyed decent atten-
ción (Tiedemann and Nygaard, 2004; Mohaghegh
et al., 2010; Pilevar et al., 2011; Mohaghegh et al.,
2011; Rasooli et al., 2013; Karimi et al., 2018;
Kashefi, 2018; Khojasteh et al., 2020). Unfortu-
nately, most published work for this task focus on

1148

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Examples of the PARSINLU tasks. For each task (other than Machine Translation, which already contains
English phrases) we show the English translations for ease of communication to non-Persian readers. The purple
tags indicate the example category, according to their construction (explained in the main text under Section 3.2).

1149

niche domains and datasets. Our contribution to
this task is compiling a set of high-quality evalua-
tion sets from a broad range of domains, Residencia en
the existing datasets as well as datasets introduced
in this work. The hope is that this will help future
work on Persian MT to evaluate their systems on a
variety of domains to get a more realistic measure
of machine translation.

A lo mejor de nuestro conocimiento, this is the first
work that publishes an evaluation benchmark for
Persian language, promoting future studies on sev-
eral NLU tasks such as question answering (§3.2.2),
comprensión lectora (§3.2.1), and paraphras-
En g (§3.2.5), among others.

3 PEREJIL

3.1 Design Considerations

We now discuss possible design choices for con-
structing the dataset and the underlying reasons.

Naturally Occurring Instances. A common
way of collecting data for low-resource languages
has been using automated translation of the bench-
mark datasets of high-resource languages (casa de arte
et al., 2020b; Ponti et al., 2020). This can be a
poor practice, as recent investigations have shown
translation artifacts in data gathered via transla-
tion of existing tasks (Artetxe et al., 2020a). Es
important for any NLP dataset to reflect the natu-
ral distribution of the target language tokens and
their associated cultural contexts. Por lo tanto, uno
should avoid over-reliance on automatic conver-
sion of resources from high-resource languages
to minimize any unnatural instances or artifacts
(Khvalchik and Malkin, 2020).

Experts Over Crowdworkers. While crowd-
sourcing has been the common approach for
building datasets, we choose to work with a few
native Persian speakers to construct the dataset.
Crowdworkers are difficult to train and often gen-
erate more noisy annotations. Sin embargo, expert
annotators who are closely familiar with the task
at hand often generate better quality annotations.
Using crowdworkers is further complicated by the
fact that crowdsourcing platforms do not have
an active community of Persian-speaking workers
due to limited international financial transactions
and crowdsourcing platforms. A study done by
Pavlick et al. (2014, Mesa 6) shows that there
are almost no crowdworkers for Persian on the
Amazon Mechanical Turk platform.

3.2 Constructing PARSINLU tasks

Examples are shown in Figure 1. We now explain
the data construction of each task.

3.2.1 Reading Comprehension
We use the commonly used definition of reading-
comprehension task: extracting a substring from
a given context paragraph that answers a given
pregunta.

Equipo (Rajpurkar et al., 2016) is one of the
most popular reading comprehension datasets in
Inglés. Similar datasets to SQuAD are devel-
oped in other languages using varying degrees of
human or semi-automatic translation techniques:
KorQuAD for Korean (Lim y col., 2019), MMQA
for Hindi (Gupta et al., 2018), etcétera. Para
constructing our reading comprehension tasks, nosotros
avoid using SQuAD as a source and use a process
resembling that of Kwiatkowski et al. (2019) eso
would lead to more natural questions.

Collecting Questions. Our efforts to translate
questions from the English dataset indicated that
such questions are often about topics that are not
of much importance in Persian. Por ejemplo, allá
are many questions in SQuAD (Rajpurkar et al.,
2016) about major US sports events (p.ej., Super-
bowl, NFL) or western civilization history that
might not be common among Persian speakers.
En cambio, we follow a pipeline that is more similar
to the one introduced by Kwiatkowski et al. (2019),
setting our goal to annotate answers for an existing
naturalistic set of questions in Persian, as opposed
to writing questions for existing paragraphs.

Unlike Kwiatkowski et al. (2019), we do not
have direct access to query logs. Thus we follow
the approach of Berant et al. (2013) and Khashabi
et al. (2021), which relies on a query auto-
completion API for collecting questions. Similarmente,
we use Google’s auto-completion,3 which enables
us to mine a rich, yet a natural set of questions in
Persian as it is reflective of popular questions
posed by users of Google.

We start with a seed set of question terms (p.ej.,
‘‘
''
'' [che kasI] meaning ‘‘who’’, and ‘‘
[kojA] meaning ‘‘where’’) We bootstrap based on
this set, by repeatedly querying parts of previously
extracted questions, in order to discover a longer
and richer set of questions. We hypothesize that
such questions extracted from the auto-complete

3http://google.com/complete/search?

client=chrome&q=….

1150

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

algorithm are highly reflective of popular ques-
tions posed by Persian-speaking users of Google.
We filter out any results shorter than 5 tokens as
they are often incomplete questions. This process
yields over 50k questions.

Después, we automatically filter out open-
ended questions with no concrete answers (p.ej.,
'' [nætIdZe ye bAzI bA ZApon?]
‘‘
meaning ‘‘What is the result of the game with
Japón?''). Our filtering was guided by the obser-
vation that typically more complete questions lead
to Google results that include well-established
sources (such as Wikipedia). Por eso, we perform
this filtering by retrieving the Google search re-
sults4 for each question and checking if any of the
arriba 10 search results overlap with a pre-defined list
of credible websites.5 We keep only the questions
that match this criterion.

Annotating Paragraphs and Answers.
En esto
step, native speakers of Persian select a paragraph
and an answer span within the paragraph that
answers each of the questions. At the first step, el
annotators read the question and correct any gram-
'' [otsAn] es
matical errors and typos (p.ej., ‘‘
'' [ostAn] ‘‘state’’). Próximo, ellos
corrected to ‘‘
annotate all the minimal and coherent spans that
contains the answer to the question, from a para-
graph obtained from a relevant web page (de
the Google search results retrieved from an ear-
lier step). Whenever possible, we annotate all
valid spans as the answer (Por ejemplo, ‘‘
''
[hæmedAn] and ‘‘
'' [ostAn e hæmedAn], como
como se muestra en la figura 1). The paragraph that contains
this answer is also annotated as the context of the
pregunta.

En general, 6 native-speaker annotators annotated a
collection of 1.3k question-answer-paragraph trip-
lets (Mesa 2).

Annotation Quality. To ensure the quality of
the annotations, the answers to each question
were labeled by two independent annotators. Cualquier
misalignment of the answer spans or missing any
valid spans were indicated as disagreements.

Such disagreements were resolved in further

adjudication.

3.2.2 Multiple-Choice QA
Multiple-choice questions are one of the common
formats for evaluation of fact-retrieval and reason-
En g (Richardson et al., 2013; Clark et al., 2020b).

4https://github.com/MarioVilas/googlesearch.
5fa.wikipedia.org, bbcpersian.com, etc..

Following prior works, we define the task as:
given a natural language question, pick the correct
answer among a list of multiple candidates. A key
difference from reading comprehension (§3.2.1)
is that the instances are open-domain (es decir., no con-
text paragraph is provided). Por eso, a model would
either need to retrieve external supporting docu-
ments or have stored the necessary knowledge
internally to be able to answer the question.

Sources of Questions. We use existing sources
of multiple-choice questions, rather than annotat-
ing new ones. We collect the questions from a
variety of sources: (i) The literature questions of
the annual college entrance exams in Iran, para el
pasado 15 años. These questions often involve the
understanding of poetry and their implied mean-
En g, knowledge of Persian grammar, and the his-
tory of literature. (ii) Employment exams that are
expected to assess an individual’s depth in var-
ious topics (accounting, teaching, matemáticas,
logic, etc.). (iii) Common knowledge questions,
which involve questions about topics such as basic
ciencia, historia, or geography.

Most of these sources are scanned copies of
the original exams in image format. We use an
existing Persian OCR tool to convert the image
data to a textual format.6 Then 4 annotators fix any
mistakes made by the OCR system and convert the
result into a structured format. En general, esto produce
2460 questions with an average of 4.0 candidate
answers (Mesa 2). Además, the task comes
with a label indicating the type of knowledge it
requires: ‘literature’ (understanding of literary ex-
pressions), ‘common-knowledge’ (encyclopedic
knowledge or everyday activities), and ‘math &
logic’ (logical or mathematical problems). Exam-
ples from each category of questions are included
En figura 1.

Annotation Quality. To further examine the
quality of the annotations, we randomly sam-
pled 100 questions from the annotations and
cross-checked the OCR output with the original
datos. We discovered that 94 of such questions
exactly matched the original data, y el resto
required minor modifications. We thus conclude
that the annotated data is of high quality.

3.2.3 Aspect-Based Sentiment Analysis
Sentiment Analysis (SA) is the study of opin-
ions (es decir., positivo, negative, or neutral sentiment)

6https://www.sobhe.ir/alefba/.

1151

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

expressed in a given text (Liu, 2012). Aspect-
based Sentiment Analysis (ABSA) is a more fine-
grained SA that aims to extract aspects of entities
mentioned in the text and determine sentiment
toward these aspects (Pontiki et al., 2014). Para
instancia, ‘‘it tastes good but it’s so expensive …''
(Cifra 1) conveys positive and negative senti-
ments with respect to taste and price aspects of
the mentioned product (entidad), respectivamente.

Annotation Scheme. We follow the existing
ABSA scheme (Pontiki et al., 2014). For every
revisar, we do two types of annotations: (1) We as-
sign an overall sentiment to each review, selecting
from one of the following values: very-negative,
negative, neutral, positivo, very positive, and mixed.
The mixed category indicates reviews where none
of the sentiments are dominant (mix of positive
and negative, or borderline cases), hence it is hard
to detect the primary sentiment of a review. Nosotros
also assign neutral label to reviews that express
no clear sentiment toward an entity or any aspect
de ello. (2) We annotate pairs of (a, s) where a is an
aspect that belongs to a predefined set of aspects
for each domain and s expresses the sentiment
toward the aspect a.

Collecting Reviews. At first, we collect reviews
from two different domains: (1) alimento & beverages
y (2) cine. We chose these domains since
they are relatively less investigated in the existing
literature (see §2 for past work). For the food &
beverages category, we extracted7 reviews from
the online grocery section of Digikala,8 and for the
movie reviews category, we crawled reviews from
Tiwall.9 Both of these websites are well known
and popular websites among Persian speakers.

Defining Aspects. Following the ABSA scheme,
we predefined a set of aspects for each domain.
For food & beverages, we crawled Digikala and
retrieved all listed aspects for product reviews
in the food & beverages category. Después,
we manually aggregated the extracted aspects and
merged those with significant semantic overlap.
We also added taste/smell as a new aspect cate-
gory because users frequently commented on this
aspect. For movie reviews, we created an initial
list of aspects based on the movie review aspects
defined by Thet et al. (2010). In consultation with

7https://github.com/rajabzz/digikala-crawler.
8https://www.digikala.com/.
9https://www.tiwall.com/.

Mesa 1: The predefined sentiment aspects (§3.2.3).

Attribute

Statistic

Tarea

gramo
norte
i
d
a
mi
R


norte
mi
h
mi
r
pag
metro
oh
C


mi
yo
pag
i
t
yo
tu
METRO

A
q
mi
C
i
oh
h
C

t
norte
mi
metro

i
t
norte
mi
S

s
i
s
y
yo
a
norte
A

norte
oh
i
s

# of instances
avg. question length (tokens)
avg. paragraph length (tokens)
avg. answer length (tokens)

# of instances
% of ‘literature’ questions
% of ‘common-knowledge’ questions
% of ‘math & logic’ questions
avg. # of candidates

1300
6.3
94.6
7.6

2460
834
949
677
4.0

2423
# of instances
1917
% of ‘food & beverages’ reviews
506
% of ‘movie’ reviews
avg. length of reviews (palabras)
22.01
# of annotated pairs of (aspect, sentiment) 2539

t # of instances

yo
a
tu
t
X
mi
t

norte
mi
metro

yo
i
a
t
norte
mi

% of ‘natural’ instances
% of ‘mnli’ instances
avg. length of premises (tokens)
avg. length of hypotheses (tokens)

gramo # of instances

norte
oh
i
t
s
mi
tu
q

norte
i
s
a
r
h
pag
a
r
a
PAG

mi
norte
i
h
C
a

METRO

norte
oh
i
t
a
yo
s
norte
a
r
t

% of ‘natural’ instances
% of ‘qqp’ instances
avg. length of Q1 (tokens)
avg. length of Q2 (tokens)

# of instances
% of ‘QP’ subset
% of ‘Quran’ subset
% of ‘Bible’ subset
% of ‘Mizan’ subset (eval. solo)

2,700
1,370
1,330
23.4
11.8

4,644
2,521
2,123
10.7
11.0

47,745
489
6,236
31,020
10,000

Mesa 2: Statistics on various subsets of the dataset.

a movie critic, we resolved the potential overlaps
among aspect categories and created a set of as-
pects that capture various perspectives of movie
reviews. En general, this process resulted in 6 y 7
aspects for food & beverages and movie review
dominios, respectivamente (Mesa 1).

After defining the sentiment aspects, we trained
four native speaker annotators for the final round
of annotations. This results in 2423 instances for
the sentiment task (Mesa 2).

Annotation Quality. To measure the quality
of the annotations, we randomly selected 100

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1152

Cifra 2: The distribution of the overall sentiment
labels (nivel de documento).

samples from each domain and calculated the
Inter-Annotator Agreement (IAA) using Cohen’s
kappa (cohen, 1960) on annotations elicited from
two independent annotators. Based on the com-
puted IAA values, there is a substantial agreement
on sub-task 1 (0.76), and moderate agreement on
sub-tasks 2 y 3 (0.49 y 0.47, respectivamente).

Distribution of the Labels. Here we report the
distribution of the labels for this task. Cifra 2
shows the distribution of the document-level
sentiment labels. As expected, most reviews are as-
sociated with extreme sentiments (very positive or
very negative) and a relatively small portion of
them are neutral. There is also a non-negligible
portion of the reviews that contains mixed senti-
mentos (partially positive and partially negative).

3.2.4 Textual Entailment

Textual entailment (Dagan et al., 2013; Bowman
et al., 2015) is typically defined as a 3-way classifi-
cation to determine whether a hypothesis sentence
entails, contradicts, or is neutral with respect to a
given premise sentence.

We construct two subsets: (i) based on available
natural sentences, y (ii) based on the available
English entailment dataset. The former approach
yields high-quality instances, but it is a relatively
slower annotation task. The latter is slightly easier,
but yields less interesting instances.

Based on Natural Sentences. We start with ran-
domly sampled raw sentences, selected from 3 dif-
ferent resources: Miras,10 Persian Wikipedia, y
VOA corpus.11 In this random sampling process,
we specifically sample sentences that contain con-
'' [amA] meaning ‘‘but’’),
junctive adverbs (p.ej, ‘‘
along with their preceding sentences. Nosotros elegimos
such examples as there is a higher chance that

10https://github.com/miras-tech/MirasText.
11https://jon.dehdari.org/corpora/.

Cifra 3: The distribution of the labels for the entail-
ment task.

these sentences naturally contain inference rela-
tionships. We ask annotators to consider both
sentences and write a premise and correspond-
ing entailing, contradicting, and neutral sentences,
whichever they deem appropriate. To minimize
annotation artifacts and avoid creating an artifi-
cially easy dataset, we specifically instruct anno-
tators to avoid using simple modifications, semejante
as simply negating a sentence or changing a word
to its synonym. For the rest of the work, we refer
to this set as the ‘natural’ set.

Based on Existing Datasets.
In this approach,
we use existing datasets in English. We start
with the MNLI dataset (Williams et al., 2018)
and translate them with the publicly available
Google Translate API.12 Subsequently, expert
annotators carefully review and fix inaccurate
translations. Además, each translated docu-
ment is reviewed by a native-speaker annotator to
correct the translational mistakes. Our annotations
show that about 66.4% of the translated documents
have gone through some form of correction by our
annotators. For the rest of the draft, we refer to
this set as ‘mnli’.

En general, our two-pronged construction with 6
annotators results in 2.7k entailment instances
(Mesa 2). Examples from each collected subset
are included in Figure 1.

Annotation Quality. To verify the annotation
quality, we quantify the agreement of 3 indepen-
dent annotators, en 150 random examples. On this
subset, we observe a Fleiss Kappa (Fleiss, 1971)
de 0.77, indicating a substantial inter-annotator
agreement (Landis and Koch, 1977).

Distribution of the Labels. As the label distri-
bution (Cifra 3) muestra, the distribution of the

12https://nube.google.com/translate.

1153

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

labels across the three categories are not far from
uniform distribution.

3.2.5 Question Paraphrasing

This task is defined as determining whether two
given questions are paraphrases or not. This task
has been previously used to improve downstream
applications like document retrieval (Zukerman
and Raskutti, 2002; Callison-Burch et al., 2006;
Duboue and Chu-Carroll, 2006).

Similar to the construction of the entailment
tarea (§3.2.4), we take two different approaches:
(i) based on available natural sentences, y (ii)
based an existing English question paraphras-
ing dataset.

Based on Natural Sentences. We start with
questions mined using Google auto-complete
(§3.2.1) as well as an additional set of questions
mined from Persian discussion forums.13 We cre-
ate pairs of questions with high token overlap.
Each pair is annotated as paraphrase or not-
paraphrase by native-speakers. We drop the pair
if any of the questions is incomplete. For the rest of
this document, we refer to this subset as ‘natural’.

Based on Existing Datasets. We start with the
QQP dataset,14 which is a dataset of English
question-pairs, and translate it with Google Trans-
late API. Más tarde, expert annotators carefully review
the translations and amend any inaccuracies. Nosotros
observe that about 65.6% of the translated docu-
ments have gone through some form of correction
by our annotators.

En general, the annotations involved 4 annotators
and resulted in 4682 question paraphrasing in-
posturas (Mesa 2). Examples from each collected
subset are included in Figure 1.

Annotation Quality. After the annotation of
the earlier steps, the examples were reviewed by
another annotators familiar with the task. The dis-
agreements were labeled and adjudicated among
the annotators, in order to ensure the quality of the
resulting labels.

Distribution of the Labels. As the label distri-
bution shows (Cifra 4), the label distributions of
the two splits (‘qqp’ vs ‘natural’) are not much
diferente.

13http://javabkoo.com/.
14https://www.kaggle.com/c/quora-question

-pares.

Cifra 4: Label distribution for the query paraphrasing
tarea.

3.2.6 Máquina traductora

We consider the task of translating a given English
sentence into Persian, y viceversa.

This task is one of the few for which several
resources are available in the literature (Kashefi,
2018; Prokopidis et al., 2016; Pilevar et al., 2011).
One major limitation is that there is no widely
adopted comprehensive assessment of this task:
Most of the works are often limited to narrow
dominios, and the generalization across different
styles of text is rarely studied. Our contribution
is to put together a collection of evaluation sets,
from various domains to encourage a more holistic
evaluation set.

Our proposed evaluation sets consist of the fol-
mugiendo: (i) Quran: The Quran has been translated
into many languages, including English and Per-
sian (Tiedemann and Nygaard, 2004). Usamos
several different translations of the Quran to cre-
ate high-quality evaluation sets (10 gold standard
translations for each direction). Having multiple
gold standards is particularly helpful for the auto-
matic evaluation of machine translation since such
metrics work best when provided with several gold
standards (Gupta et al., 2019). (ii) Bible: Similarmente,
we use Persian and English versions of the Bible15
as another evaluation set. (iii) QQP: We use the
data obtained in the construction of question para-
phrasing task (§3.2.5) to create an evaluation set
for translating language questions. (iv) Mizan:
We use the evaluation subset of the Mizan corpus
(Kashefi, 2018), which is acquired based on a man-
ual alignment of famous literary works and their
published Persian translations. En general, the combi-
nation of these four high-quality subsets yields an

15https://github.com/christos-c/bible-corpus.

1154

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

evaluation set that contains 47k sentences, de 4
different domains (Mesa 2).

While our main contribution here is providing a
more comprehensive evaluation of machine trans-
lación, we also provide training/dev sets to let the
future work create comparable experiments to that
of ours. We compile our training set at the union
of the following datasets: (i) questions obtained
from the question paraphrasing task (§3.2.5, por
translating the QQP instances), (ii) the training
set of the Mizan dataset (Kashefi, 2018), y (iii)
the TEP dataset (Pilevar et al., 2011) and Global
Voices dataset (Prokopidis et al., 2016). Este último
two are not included in our evaluation set because
of their noisy translations to prevent any inaccu-
rate evaluations. Note that the Quran and Bible
documents are intentionally not included in the
training data, in order to measure models’ gener-
alization to unseen documents.

4 experimentos

We experiment with several recent LMs, to assess
the difficulty of the PARSINLU tasks (compared to
human expert performance) and also to establish
baseline performance of the state-of-the-art mono-
and multilingual pre-trained models.

All the baseline models used in this work are

available online.16

Métricas de evaluación. For each task, we pick
a common set of existing metrics: For reading-
comprensión, we use F1 between gold answer
and the response string (Rajpurkar et al., 2016);
for question paraphrasing,
textual entailment,
multiple-choice question-answering, and senti-
ment analysis, we use accuracy. For the first two
sub-tasks of sentiment analysis (nivel de documento
sentiment, aspect extraction), we use macro-F1.
For the third sub-task (aspect-specific sentiment)
we use accuracy as our target evaluation metric
(Angelidis and Lapata, 2018; Sun et al., 2019).
For machine translation we use SacreBLEU (Correo,
2018).

Task Splits. For each task, we have provided
statistics on eval, train, and dev splits in Table 3. En
doing so, we have ensured that enough instances
are included in our evaluation sets.

16Included in the repository mentioned in footnote 1.

Tarea

Tren

desarrollador

Eval

Reading Comprehension
Multiple-Choice
Sentiment Analysis
Textual Entailment
Question Paraphrasing
Máquina traductora

600
1271
1894
756
1,830
1.6metro

125
139
235
271
898
2k

575
1050
294
1,751
1,916
47k

Mesa 3: Split sizes for different tasks.

Human Performance. To have an estimate of
the performance and the difficulty of the chal-
lentes, we report human performance on a random
subset (100-150) of instances from each task. Sim-
ilar to Wang et al. (2019), we collect annotations
from three human annotators, adjudicate the in-
consistencies, and evaluate it against the gold la-
bels to estimate human performance for each task.

Modelos. For evaluation of our baselines, nosotros
use state-of-the-art LMs. Multilingual BERT
(mBERTO) (Devlin et al., 2019) is pre-trained on
the masked LM task over 104 idiomas. adi-
cionalmente, we use two specialized variants of BERT
for Persian: wikiBERT17 (trained on Persian Wiki)
and ParsBERT (Farahani et al., 2020).18 Nosotros también
use mT5 (Xue et al., 2021), which is a multilingual
variant of T5 (Rafael y col., 2020).

Model Selection. We train each model with
various hyperparameters and select the best one
according to their developement set performance.
For the BERT-based models, we fine-tune them
according to the cross product of the following
hyperparameters: (1) Batch sizes: {8, 16} para
small/base models and {1, 2} for large models;
(2) Training epochs: {3, 7}; (3) Learning-rates:
{3 × 10−5, 5 × 10−5}. For mT5 models, nosotros
fine-tune them for 20k steps, dumping checkpoints
every 1k step. For the translation task, we trained
the models for 200k steps since the task has much
larger training data. We use 10−3 learning-rate.

Input/Output Encoding. We formulate ques-
tion paraphrasing (§3.2.5) and entailment (§3.2.4)
tasks as text classification tasks.19 For sentiment
análisis (§3.2.3), we follow formulation of Sun
et al. (2019) and encode the instances as questions
per aspect. The expected output is the sentiment

17https://github.com/TurkuNLP/wikibert.
18https://github.com/hooshvare/parsbert.
19https://git.io/JYTNr.

1155

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

p Model ↓Task → Reading Comprehension Multiple-Choice Question Answering

Textual Entailment Question Paraphrasing

tu
t
mi
S

Subtask →

norte
a
i
s
r
mi
PAG
norte
oh
d
mi
norte
i
a
r
t

mBERTO (base)
WikiBERT (base)
ParsBERT (base)
mT5 (pequeño)
mT5 (base)
mT5 (grande)
mT5 (XL)

norte
oh
d
mi
norte
i
a
r
t

h
s
i
yo
gramo
norte
mi

norte
oh
d
mi
norte
i
a
r
t

gramo
norte
mi
+
r
mi
PAG

mT5 (pequeño)
mT5 (base)
mT5 (grande)
mT5 (XL)

mT5 (pequeño)
mT5 (base)
mT5 (grande)
mT5 (XL)

Humano

tu
t
mi
S

todo

49.0
39.2
40.7
30.9
42.6
49.2
70.4

33.0
53.4
67.4
68.2

45.3
63.9
73.6
74.7

86.2

literature

com-know math & logic

natural

mnli

natural

30.1.
36.9
33.4
33.7
34.0
32.6
33.7

20.9
23.4
27.4
28.3

30.9
32.3
30.6
38.0

80.0

28.7
30.2
28.6
23.7
24.0
27.1
27.7

25.7
23.4
33.1
38.6

24.9
24.0
28.9
33.7

85.0

33.8
34.1
32.5
39.1
36.9
38.9
38.9

28.9
24.3
25.4
22.0

36.6
37.7
38.6
38.0

85.0

48.7
52.8
51.8
51.9
57.8
69.1
77.2

45.1
44.4
46.5
66.2

53.3
57.8
70.9
75.5

87.1

51.6
52.6
53.9
51.0
59.9
71.6
74.5

55.6
43.3
54.9
77.8

56.2
63.9
72.5
78.7

90.2

80.4
80.0
79.4
75.2
79.1
84.6
88.6

73.5
83.2
88.1
89.2

77.9
80.2
85.3
88.2

92.3

qqp

75.3
75.5
72.0
72.0
75.1
76.6
80.3

75.1
81.8
86.6
87.0

71.3
73.4
78.9
80.3

88.4

p Model ↓Task → Sentiment (sentence sent.)

Sentiment (aspect ext.)

Sentiment (aspect sent.) Máquina traductora (Eng → Per)

Máquina traductora (Per → Eng)

Subtask →

food movies

food movies

food movies

quran bible qqp mizan quran bible

qqp mizan

a
t
a
d

r
tu
oh

norte
oh
d
mi
norte
i
a
r
t

mBERTO (base)
WikiBERT (base)
ParsBERT (base)
mT5 (pequeño)
mT5 (base)
mT5 (grande)
mT5 (XL)

norte
oh
d
mi
norte
i
a
r
t

h
s
i
yo
gramo
norte
mi

norte
oh
d
mi
norte
i
a
r
t

gramo
norte
mi
+
r
mi
PAG

mT5 (pequeño)
mT5 (base)
mT5 (grande)
mT5 (XL)

mT5 (pequeño)
mT5 (base)
mT5 (grande)
mT5 (XL)

55.2
52.0
59.1
54.6
56.6
62.9
63.1







48.6
58.5
56.8
49.4
52.9
72.5
70.6







87.1
91.9
91.1
86.4
88.6
92.2
92.0

73.24
78.0
76.8
78.6
80.5
85.0
85.8













53.9
56.5
53.9
52.4
52.9
58.1
58.9







34.7
41.6
37.6
40.6
46.5
53.5
54.5







Humano

88.4

90.3

93.1

91.6

71.0

61.6




10.2
11.4
11.9
13.5




2.1
2.1
2.1
2.2




22.2
27.3
24.8
20.0




8.4
9.4
10.6
11.0




























20.6
22.8
24.7
30.0

6.6
11.5
20.2
25.6

19.2
24.1
29.9
33.4




2.5
2.5
2.4
2.6

1.9
2.1
2.3
2.3

2.5
2.4
2.6
2.6




22.9
34.6
35.1
33.7

7.7
14.0
21.0
30.7

25.6
36.0
36.5
41.0




14.6
14.9
16.4
19.3

3.7
5.7
7.4
9.7

12.1
14.8
18.1
18.2

Mesa 4: Evaluation of Persian-only models (arriba), English-only (middle), and Persian+English (abajo) modelos
on Persian tasks. Best baseline scores are indicated in bold.

polarity of the input review with respect to the in-
put aspect-specific question. This formulation has
the benefit that it is not restricted to a particular
domain and its associated set of aspects, a diferencia de
alternatives such as multiclass classification.

Experimental Setups. Primero, we fine-tune our
models on Persian (nuestro conjunto de datos). The results of
this setup are listed in the top segment of Table 4.

Following recent work on generalization across
idiomas (Artetxe et al., 2020b), we evaluate En-
glish models on our Persian benchmark. Usamos
the commonly used English datasets to supervise
mT5 on each task and evaluate the resulting model
on the evaluation section of PARSINLU. The En-
glish datasets used here are as follows: Equipo

1.1 (Rajpurkar et al., 2016) for reading comprehen-
sión (tamaño: 88k); the union of ARC (Clark et al.,
2018), OpenBookQA (Mihaylov et al., 2018),
and CommonsenseQA (Talmor et al., 2019) para
multiple-choice question-answering (tamaño: 18k);
SNLI (Bowman et al., 2015) for textual entailment
(tamaño: 550k); QQP20 for question paraphrasing
(tamaño: 350k); and the Arabic-English subset of
OPUS-100 (Zhang et al., 2020) for machine trans-
lación (tamaño: 1metro). We don’t do such mixing for
sentiment analysis because existing English data-
sets are not quite compatible with our sentiment
schema. The results are reported in the middle
section of Table 4.

Finalmente, we train models on the union of Persian
and English datasets. Since English datasets tend

20See footnote 14.

1156

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

to be much larger than Persian ones, we make sure
that the batches of training data, on average, estafa-
tain the same number of instances from each lan-
guage. Similar treatments of task mixing have also
been adopted by Khashabi et al. (2020) and Raffel
et al. (2020). The results of this setup are at the
bottom segment of Table 4.

4.1 Resultados

Below are key insights from the empirical work:

Humans Do Well on PARSINLU. As shown in
the last row of Table 4, the human upper-bound
scores are relatively high across the board. Esto es
an indication of a reasonable degree of consensus
between the ground-truth and judgments of native
speakers and hence, the quality of our dataset.

Models Haven’t Solved PARSINLU Yet. El
majority of the models significantly lag behind
human performance. This is especially true for
the mid-sized (‘large’ or smaller) models that are
commonly used. It is encouraging that our larg-
est model (mT5-XL) achieves close to human
actuación, for certain tasks (p.ej., question para-
phrasing), although this model is prohibitively
large and it requires a massive amount of compute.
Sin embargo, even these large models still strug-
gle for most of the remaining tasks, particularly
multiple-choice QA.

English Models Successfully Transfer
a
persa. Consistent with prior observations
(Artetxe et al., 2020b), multilingual models (mT5,
en este caso) trained with English data show a sur-
prising degree of generalization to other languages
(to Persian, in our case). Training on English data
is particularly helpful for challenges that were
originally translated from English datasets (semejante
as ‘qqp’ and ‘mnli’).

Joint Training on English and Persian Helps.
For most of the tasks, combining Persian and
English yields better results than training solely
on Persian or English data.

While joint training generally helps, such com-
binations are not guaranteed to lead to positive
gains all the times. Whether the ‘‘Eng + Per’’ mod-
els will beat either of the Persian-only or English-
only models depends on whether their strengths
(large size of ‘‘Eng’’ and distributional alignment
of ‘‘Per’’) align or go against each other. Porque
of this issue, the combined models are not always
better than the individual models.

5 Discusión

We now discuss several limitations of the current
dataset and the experiments. We then outline sev-
eral directions for future work.

Beyond Current Models. As shown in the ear-
lier experiments, for most of the tasks the current
mid-sized models perform significantly worse than
humanos. This is particularly pronounced for the
multiple-choice QA task where there is over a
40% gap between the model and human perfor-
mance, and increasing the model size (number of
parámetros) shows minimal benefits.

We hypothesize that

the difficulty of our
multiple-choice questions (and other tasks,
a
some extent) for the models are partly due to the
reasoning and abstraction needed to answer them.
Por ejemplo, the ‘literature’ questions often de-
mand creating connection several pieces of poetry,
based on abstract interpretations of their meanings.
Asimismo, most of the ‘math & logic’ questions
require several ‘hops’ of algebraic operations to
get to the final answer. We hypothesize that these
challenges (multi-hop reasoning over high-level
abstractions of language) cannot solely be ad-
dressed with more training data. and likely require
a dramatic rethinking of our architectures design.
Por ejemplo, the poor performance on ‘math &
logic’ questions might be due to models’ inability
to comprehend Persian numbers and do logical
reasoning with them, a topic that is briefly studied
en Inglés (Geva et al., 2020). There might also be
value in exploring multitask setups across our var-
ious tasks (Zaremoodi et al., 2018), which we del-
egate to the future work. We hope this benchmark
will encourage more of such studies, especially in
the context of the Persian language.

Coverage of Dialects. There are other dialects
of Persian, including Dari and Tajiki dialects, eso
are not covered by our dataset. We acknowledge
this limitation and hope the future work will create
broader and more inclusive collections.

6 Conclusión

This work introduced PARSINLU, a benchmark for
high-level language understanding tasks in Per-
sian. We present a careful set of steps that we have
followed to construct each of the tasks with the
help of native speakers (§3.2). We have presented
human scores to establish estimated upper-bounds
for each task. This is followed by evaluating

1157

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

state-of-the-art models on each task and quantify-
ing the human–machine gap (§4).

A lo mejor de nuestro conocimiento, this is the first
work that publishes a language understanding
benchmark for Persian language. We hope that
PARSINLU inspires more activity in the Persian
NLU tasks, as well as contributing to the latest
efforts in multilingual NLU.

Expresiones de gratitud

The authors would like to thank Alireza Nourian
for providing the OCR system used in the work and
the anonymous reviewers for their constructive
comentario. Thanks to Google’s TensorFlow Re-
search Cloud (TFRC) for making research TPUs
disponible.

Referencias

Hossein Amirkhani, Mohammad Azari Jafari,
Azadeh Amirak, Zohreh Pourjafari, Soroush
Faridan Jahromi, and Zeinab Kouhkan. 2020.
Farstail: A Persian natural language inference
conjunto de datos. arXiv preimpresión arXiv:2009.08820.

Stefanos Angelidis and Mirella Lapata. 2018. Mul-
tiple instance learning networks for fine-grained
sentiment analysis. Transacciones de la Asociación-
ción para la Lingüística Computacional, 6:17–31.
https://doi.org/10.1162/tacl a 00002

Mikel Artetxe, Gorka Lavaka, and Eneko Agirre.
2020a. Translation artifacts in cross-lingual
transferir aprendizaje. In Proceedings of EMNLP,
7674–7684. https://doi.org/10
paginas
.18653/v1/2020.emnlp-main.618

2020b. On

Mikel Artetxe, Sebastián Ruder, y dani
cross-lingual
Yogatama.
transferability of monolingual representations.
In Proceedings of ACL, páginas 4623–4637.
https://doi.org/10.18653/v1/2020.acl
-main.421

el

Seyed Arad Ashrafi Asli, Behnam Sabeti, Zahra
Majdabadi, Preni Golazizian, Reza Fahmi, y
Omid Momenzadeh. 2020. Optimizing annota-
tion effort using active learning strategies: A
sentiment analysis case study in Persian. En
Proceedings of LREC, pages 2855–2861.

Taha Shangipour Ataei, Kamyar Darvishi,
Soroush Javdan, Behrouz Minaei-Bidgoli, y

Sauleh Eetemadi. 2019. Pars-absa: An aspect-
based sentiment analysis dataset for Persian.
arXiv preimpresión arXiv:1908.01815.

Jonathan Berant, Andrés Chou, Roy Frosty, y
Percy Liang. 2013. Semantic parsing on free-
base de pares de preguntas y respuestas. En curso-
ings of EMNLP, páginas 1533–1544.

Mahmood Bijankhan. 2004. The role of the cor-
pus in writing a grammar: An introduction
to a software. Iranian Journal of Linguistics,
19(2):48–67.

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. In Proceedings of EMNLP.
https://doi.org/10.18653/v1/D15
-1075

Chris Callison Burch, Philipp Koehn, and Miles
Osborne. 2006. Improved statistical machine
translation using paraphrases. En procedimientos de
NAACL, pages 17–24. https://doi.org
/10.3115/1220835.1220838

Jonathan H. clark, Eunsol Choi, michael collins,
Dan Garrett, Tom Kwiatkowski, Vitaly
Nikolaev, and Jennimaria Palomaki. 2020a.
TyDi QA: A benchmark for
información-
seeking question answering in typologically
diverse languages. Transactions of the Associ-
ation for Computational Linguistics, 8:454–470.
https://doi.org/10.1162/tacl a 00317

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar
Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. 2018. Think you have
solved question answering? try arc, the ai2
reasoning challenge. arXiv preimpresión arXiv:
1803.05457.

Peter Clark, Oren Etzioni, Tushar Khot, Daniel
Khashabi, Bhavana Mishra, Kyle Richardson,
Ashish Sabharwal, Carissa Schoenick, Oyvind
Tafjord, Niket Tandon, Sumithra Bhakthavatsalam,
y
Dirk Groeneveld, Michal Guerquín,
Michael Schmitz. 2020b. From ‘F’ to ‘A’ on
the NY Regents Science Exams: An overview
of the Aristo Project. AI Magazine, 41(4):39–53.
https://doi.org/10.1609/aimag.v41i4
.5304

1158

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jacob Cohen. 1960. A coefficient of agreement
for nominal scales. Educational and Psycholog-
ical Measurement, 20(1):37–46. https://doi
.org/10.1177/001316446002000104

Alexis Conneau, Ruty Rinott, Guillaume Lample,
Adina Williams, Samuel Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI:
Evaluating cross-lingual sentence representations.
In Proceedings of EMNLP, pages 2475–2485.
https://doi.org/10.18653/v1/D18
-1269

Ido Dagan, Dan Roth, Mark Sammons, and Fabio
Massimo Zanzotto. 2013. Recognizing Textual
Entailment: Models and Applications. morgan
& Claypool Publishers. https://doi.org
/10.2200/S00509ED1V01Y201305HLT023

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
In Proceedings of NAACL,
comprensión.
páginas 4171–4186.

Pablo Duboue and Jennifer Chu-Carroll. 2006.
Answering the question you wish they had
asked: The impact of paraphrasing for ques-
tion answering. In Proceedings of NAACL,
pages 33–36. https://doi.org/10.3115
/1614049.1614058

Prakhar Gupta, Shikib Mehri, Tiancheng Zhao,
Amy Pavel, Maxine Eskenazi, and Jeffrey
PAG. Bigham. 2019. Investigating evaluation of
open-domain dialogue systems with human
generated multiple references. En procedimientos
of SIGDIAL, pages 379–391. https://doi
.org/10.18653/v1/W19-5944

Pedram Hosseini, Ali Ahmadian Ramaki,
Hassan Maleki, Mansoureh Anvari, and Seyed
Abolghasem Mirroshandel. 2018. Sentipers: A
sentiment analysis corpus for Persian. arXiv
preprint arXiv:1801.07737.

Fatemeh HosseinzadehBendarkheili, Rezvan
MohammadiBaghmolaei, and Ali Ahmadi.
2019. Product quality assessment using opinion
mining in Persian online shopping. En curso-
ings of ICEE, pages 1917–1921. IEEE.

Jun Jie Hu, Sebastián Ruder, Aditya Siddhant,
Graham Neubig, Orhan Firat, y melvin
Johnson. 2020. Xtreme: A massively multi-
lingual multi-task benchmark for evaluating
cross-lingual generalisation. En procedimientos de
ICML, páginas 4411–4421. PMLR.

Akbar Karimi, Ebrahim Ansari, and Bahram
Sadeghi Bigham. 2018. Extracting an English-
Persian parallel corpus from comparable cor-
pora. In Proceedings of LREC.

Mehrdad Farahani, Mohammad Gharachorloo,
Marzieh Farahani, and Mohammad Manthouri.
2020. Parsbert: Transformer-based model for
Persian language understanding. arXiv preprint
arXiv:2005.12515.

Joseph L. Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psychologi-
cal Bulletin, 76(5):378. https://doi.org
/10.1037/h0031619

Omid Kashefi.

2018. Mizan: Un gran
Persian-English parallel corpus. arXiv preprint
arXiv:1801.02107.

Simran Khanuja, Sandipan Dandapat, Anirudh
Srinivasan, Sunayana Sitaram, and Monojit
Choudhury. 2020. GLUECoS: An evaluation
benchmark for code-switched NLP. En profesional-
ceedings of ACL. https://doi.org/10
.18653/v1/2020.acl-main.329

Mor Geva, Ankit Gupta, y jonathan berant.
2020. Injecting numerical reasoning skills into
In Proceedings of ACL,
language models.
pages 946–958. https://doi.org/10
.18653/v1/2020.acl-main.89

Deepak Gupta, Surabhi Kumari, Asif Ekbal, y
Pushpak Bhattacharyya. 2018. MMQA: A
multi-domain multi-lingual question-answering
framework for English and Hindi. En curso-
ings of LREC.

Daniel Khashabi, Sewon Min, Tushar Khot,
Ashish Sabharwal, Oyvind Tafjord, Peter Clark,
and Hannaneh Hajishirzi. 2020. UnifiedQA:
Crossing format boundaries with a single QA
sistema. In Proceedings of EMNLP (Findings),
pages 1896–1907. https://doi.org/10
.18653/v1/2020.findings-emnlp.171

Daniel Khashabi, Amos Ng, Tushar Khot, Ashish
Sabharwal, Hannaneh Hajishirzi, and Chris
Callison-Burch. 2021. GooAQ: Open question

1159

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

answering with diverse answer types. arXiv
preprint arXiv:2104.08727.

https://doi.org/10.18653/v1/2020
.emnlp-main.484

Hadi Abdi Khojasteh, Ebrahim Ansari, and Mahdi
Bohlouli. 2020. LSCP: Enhanced large scale
colloquial Persian language understanding. En
Proceedings of LREC, pages 6323–6327.

Seungyoung Lim, Myungji Kim, and Jooyoul Lee.
2019. Korquad1.0: Korean QA dataset for ma-
chine reading comprehension. arXiv preprint
arXiv:1909.07005.

Maria Khvalchik and Mikhail Malkin. 2020.
Departamento de nosotros: How machine trans-
lated corpora affects language models in mrc
tareas. En procedimientos de
the Workshop on
Hybrid Intelligence for Natural Language Pro-
cessing Tasks (HI4NLP 2020) co-located with
24th European Conference on Artificial Intel-
ligence (ECAI 2020): Santiago de Compostela,
España, Agosto 29, 2020, pages 29–33. CEUR
Workshop Proceedings.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, michael collins, Ankur Parikh, cris
Illia Polosukhin,
Alberti, Danielle Epstein,
Jacob Devlin, Kenton Lee, Kristina N.
Toutanova, Leon Jones, Ming-Wei Chang,
Andrew Dai, Jakob Uszkoreit, Quoc Le, y
eslavo petrov. 2019. Natural questions: A bench-
mark for question answering research. Trans-
acciones de la Asociación de Computación
Lingüística, 7:453–466. https://doi.org
/10.1162/tacl a 00276

j. Richard Landis and Gary G. Koch. 1977. El
measurement of observer agreement for cate-
gorical data. Biometrics, 159–174. https://
doi.org/10.2307/2529310

Patrick Lewis, Barlas Oguz, Ruty Rinott,
Sebastián Riedel, and Holger Schwenk. 2020.
MLQA: Evaluating cross-lingual extractive
question answering. In Proceedings of ACL,
pages 7315–7330. https://doi.org/10
.18653/v1/2020.acl-main.653

Yaobo Liang, Nan Duan, Yeyun Gong, Y
Wu, Fenfei Guo, Weizhen Qi, Ming Gong,
Linjun Shou, Daxin Jiang, Guihong Cao,
Xiaodong Fan, Ruo Fei Zhang, Rahul Agrawal,
Edward Cui, Sining Wei, Reclutamiento Taroon, Ying
Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang
Liu, fan yang, Daniel Campos, Rangan
Majumder, y Ming Zhou. 2020. XPEGAMENTO:
A new benchmark datasetfor cross-lingual
pre-training, comprensión y generación. En
Proceedings of EMNLP, páginas 6008–6018.

Bing Liu. 2012. Sentiment analysis and opinion
mining. Synthesis Lectures on Human Language
Technologies, 5(1):1–167. https://doi.org
/10.2200/S00416ED1V01Y201204HLT016

Shayne Longpre, Yi Lu, and Joachim Daiber. 2020.
MKQA: A linguistically diverse benchmark for
multilingual open domain question answering.
arXiv preimpresión arXiv:2007.15207.

Colin P. Masica. 1993. The Indo-aryan Languages.

Prensa de la Universidad de Cambridge.

Todor Mihaylov, Peter Clark, Tushar Khot, y
Ashish Sabharwal. 2018. Can a suit of armor
conduct electricity? A new dataset for open
book question answering. En procedimientos de
EMNLP. https://doi.org/10.18653
/v1/D18-1260

Mahsa Mohaghegh, Abdolhossein Sarrafzadeh,
and Tom Moir. 2010. Improved language mod-
eling for English-Persian statistical machine
traducción. In Proceedings of the Workshop on
Syntax and Structure in Statistical Translation,
pages 75–82.

Mahsa Mohaghegh, Abdolhossein Sarrafzadeh,
and Tom Moir. 2011.
Improving Persian-
English statistical machine translation:experto-
iments in domain adaptation. En procedimientos
de
the Workshop on South Southeast Asian
Natural Language Processing (WSSANLP),
pages 9–15.

Ellie Pavlick, publicación mate, Ann Irvine, Dmitry
Kachaev, and Chris Callison-Burch. 2014. El
language demographics of Amazon Mechanical
Turk. Transactions of the Association for Com-
Lingüística putacional, 2:79–92. https://
doi.org/10.1162/tacl a 00167

Mohammad Taher Pilevar, Heshaam Faili, y
Abdol Hamid Pilevar. 2011. TEP: Tehran
English-Persian parallel corpus. En internacional
Conference on Intelligent Text Processing
and Computational Linguistics, pages 68–79.

1160

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Saltador. https://doi.org/10.1007
/978-3-642-19437-5 6

Edoardo Maria Ponti, Goran Glavaˇs, Olga
Ivan Vulic, y
Majewska, Qianchu Liu,
Anna Korhonen. 2020. XCOPA: A multilin-
gual dataset for causal commonsense reasoning.
In Proceedings of EMNLP, pages 2362–2376.

Maria Pontiki, Dimitris Galanis, John Pavlopoulos,
Harris Papageorgiou,
Ion Androutsopoulos,
and Suresh Manandhar. 2014. SemEval-2014
tarea 4: Aspect based sentiment analysis.
En procedimientos de
the International Work-
shop on Semantic Evaluation (SemEval 2014),
pages 27–35. https://doi.org/10.3115
/v1/S14-2004

publicación mate. 2018. A call for clarity in report-
ing BLEU scores. In Proceedings of WMT,
186–191. https://doi.org/10
paginas
.18653/v1/W18-6319

Prokopis Prokopidis, Vassilis Papavassiliou, y
Stelios Piperidis. 2016. Parallel global voices:
A collection of multilingual corpora with cit-
izen media stories. In Proceedings of LREC,
pages 900–905.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, wei li, y Pedro J.. Liu. 2020.
Exploring the limits of transfer learning with
a unified text-to-text transformer. Diario de
Machine Learning Research, 21(140):1–67.

Pranav Rajpurkar,

Jian Zhang, Constantino
Lopyrev, y Percy Liang. 2016. SQUAD:
100,000+ questions for machine comprehension
of text. In Proceedings of EMNLP. https://
doi.org/10.18653/v1/D16-1264

Mohammad Sadegh Rasooli, Ahmed El Kholy,
and Nizar Habash. 2013. Orthographic and mor-
phological processing for Persian-to-English
statistical machine translation. En procedimientos
of IJCNLP, pages 1047–1051.

Matthew Richardson, Christopher J. C. Burges,
and Erin Renshaw. 2013. MCTest: A challenge
dataset for the open-domain machine compre-
hension of text. In Proceedings of EMNLP,
pages 193–203.

Mojgan Seraji, Carina Jahani, Be´ata Megyesi, y
Joakim Nivré. 2013. Uppsala Persian depen-
dency treebank annotation guidelines. técnico-
cal report, Uppsala University.

Mahsa Sadat Shahshahani, Mahdi Mohseni,
Azadeh Shakery, and Heshaam Faili. 2019.
Payma: A tagged corpus of Persian named enti-
corbatas. Signal and Data Processing, 16(1):91–110.
https://doi.org/10.29252/jsdp.16
.1.91

Javad PourMostafa Roshan Sharami, Parsa
Abbasi Sarabestani, and Seyed Abolghasem
Mirroshandel. 2020. Deepsentipers: Novel
deep learning models trained over proposed
augmented Persian sentiment corpus. arXiv
preprint arXiv:2004.05328.

Tatiana Shavrina, Alena Fenogenova, Emelyanov
Anton, Denis Shevelev, Ekaterina Artemova,
Valentin Malykh, Vladislav Mikhailov,
Maria Tikhonova, Andrey Chertok, and Andrey
Evlampiev. 2020. Russiansuperglue: A Russian
language understanding evaluation benchmark.
In Proceedings of EMNLP, pages 4717–4726.
https://doi.org/10.18653/v1/2020
.emnlp-main.381

Gary F. Simons and Charles D. Fennig. 2017.
Ethnologue: Languages of Asia. sil Interna-
tional Dallas.

Chi Sun, Luyao Huang, and Xipeng Qiu. 2019.
Utilizing BERT for aspect-based sentiment
analysis via constructing auxiliary sentence. En
Proceedings of NAACL, pages 380–385.

Alon Talmor, Jonathan Herzig, Nicholas Lourie,
y jonathan berant. 2019. CommonsenseQA:
A question answering challenge targeting
commonsense knowledge. En procedimientos de
NAACL, pages 4149–4158.

Tun Thura Thet, Jin-Cheon Na, and Christopher
S. GRAMO. Khoo. 2010. Aspect-based sentiment
analysis of movie reviews on discussion boards.
Journal of Information Science, 36(6):823–848.

J¨org Tiedemann and Lars Nygaard. 2004. El
OPUS corpusparallel and free: http://logos
.uio.no/opus. In Proceedings of LREC.
https://doi.org/10.1177/0165551510388123

1161

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel Bowman. 2019. Super-
glue: A stickier benchmark for general-purpose
language understanding systems. En curso-
ings of NourIPS, pages 3266–3280.

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
of NAACL,
Actas
inferencia.
pages 1112–1122. https://doi.org/10
.18653/v1/N18-1101

En

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie
Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu,
Cong Yu, Yin Tian, Qianqian Dong, Weitang
Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng,
Rongzhao Wang, Weijian Xie, Yanting Li,
Yina Patterson, Zuoyu Tian, Yiwen Zhang,
He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng
zhao, Cong Yue, Xinrui Zhang, Zhengliang
Cual, Kyle Richardson, and Zhenzhong Lan.
2020. CLUE: A chinese language understand-
ing evaluation benchmark. En procedimientos de
COLECCIONAR, pages 4762–4772.

Linting Xue, Noé constante, Adam Roberts,
Col rizada hábil, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, y Colin Raffel. 2021. MT5: A
massively multilingual pre-trained text-to-text
el 2021
En procedimientos de
transformador.

the North American Chap-
Conference of
the Association for Computational
ter of
Lingüística: Tecnologías del lenguaje humano,
pages 483–498.

Yinfei Yang, Yuan Zhang, Chris Tar, and Jason
Baldridge. 2019. PAWS-X: A cross-lingual
adversarial dataset
for paraphrase identifi-
catión. In Proceedings of EMNLP-IJCNLP,
pages 3678–3683. https://doi.org/10
.18653/v1/D19-1382

Poorya

y
Zaremoodi, Wray Buntine,
Gholamreza Haffari. 2018. Adaptive knowl-
edge sharing in multi-task learning: Improving
low-resource neural machine translation. En
Proceedings of the 56th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 2: Artículos breves), pages 656–661.
https://doi.org/10.18653/v1/P18
-2104

Biao Zhang, Philip Williams, Ivan Titov, y
Rico Sennrich. 2020.
Improving massively
multilingual neural machine translation and
zero-shot translation. In Proceedings of ACL,
pages 1628–1639. https://doi.org/10
.18653/v1/2020.acl-main.148

Ingrid Zukerman and Bhavani Raskutti. 2002.
Lexical query paraphrasing for document re-
trieval. In Proceedings of COLING. https://
doi.org/10.3115/1072228.1072389

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
1
9
1
9
7
1
8
0
7

/

/
t

yo

a
C
_
a
_
0
0
4
1
9
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1162
Descargar PDF