On the Difficulty of Translating Free-Order Case-Marking Languages

Arianna Bisazza

Ahmet ¨Ust ¨un
Center for Language and Cognition
Universidad de Groninga, Los países bajos
{a.bisazza, a.ustun}@rug.nl, research@spor.tel

Stephan Sportel

Abstracto

Identifying factors that make certain languages
harder to model than others is essential to reach
language equality in future Natural Language
Processing technologies. Free-order case-marking
idiomas, such as Russian, latín, or Tamil,
have proved more challenging than fixed-order
languages for the tasks of syntactic parsing
and subject-verb agreement prediction. En esto
trabajar, we investigate whether this class of lan-
guages is also more difficult to translate by
state-of-the-art Neural Machine Translation
(NMT) modelos. Using a variety of synthetic
languages and a newly introduced translation
challenge set, we find that word order flexi-
bility in the source language only leads to a
very small loss of NMT quality, even though
the core verb arguments become impossible
to disambiguate in sentences without seman-
tic cues. The latter issue is indeed solved by
the addition of case marking. Sin embargo, en
medio- and low-resource settings, the overall
NMT quality of fixed-order languages remains
unmatched.

Introducción

Despite the tremendous advances achieved in less
than a decade, Natural Language Processing re-
mains a field where language equality is far from
being reached (Joshi et al., 2020). In the field
of Machine Translation, modern neural models
have attained remarkable quality for high-resource
language pairs like German-English, Chino-
Inglés, or English-Czech, with a number of stud-
ies even claiming human parity (Hassan et al.,
2018; Bojar et al., 2018; Barrault et al., 2019;
Popel et al., 2020). These results may lead to the
unfounded belief that Neural Machine Transla-
ción (NMT) methods will perform equally well
in any language pair, provided similar amounts
of training data. De hecho, several studies suggest
the opposite (Platanios et al., 2018; Ataman and
Federico, 2018; Bugliarello et al., 2020).

Por qué, entonces, do some language pairs have lower
translation accuracy? Y, more specifically: Are
certain typological profiles more challenging for
current state-of-the-art NMT models? Every lan-
guage has its own combination of typological
propiedades, including word order, morphosyntac-
tic features, y más (Dryer and Haspelmath,
2013). Identifying language properties (or combi-
nations thereof) that pose major problems to the
current modeling paradigms is essential to reach
language equality in future MT (and other NLP)
tecnologías (Joshi et al., 2020), in a way that is
orthogonal to data collection efforts. Among oth-
ers, natural languages adopt different mechanisms
to disambiguate the role of their constituents:
Flexible order typically correlates with the pres-
ence of case marking and, viceversa, fixed order
is observed in languages with little or no case
marking (Comrie, 1981; Sinnem¨aki, 2008; Futrell
et al., 2015b). Morphologically rich languages
in general are known to be challenging for MT
at least since the times of phrase-based statisti-
cal MT (Birch et al., 2008) due to their larger
and sparser vocabularies, and remain challenging
even for modern neural architectures (Ataman and
Federico, 2018; Belinkov et al., 2017). por estafa-
contraste, the relation between word order flexibility
and MT quality has not been directly studied to
nuestro conocimiento.

en este documento, we study this relationship using
strictly controlled experimental setups. Specifi-
cally, we ask:

• Are current state-of-the-art NMT systems

biased towards fixed-order languages?

• To what extent does case marking compen-
sate for the lack of a fixed order in the source
idioma?

Desafortunadamente, parallel data are scarce in most
of the world languages (Guzm´an et al., 2019), y

1233

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1233–1248, 2021. https://doi.org/10.1162/tacl a 00424
Editor de acciones: Alexandra Birch. Lote de envío: 1/2021; Lote de revisión: 5/2021; Publicado 11/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fixed

VSO

VOS

follows the little cat the friendly dog
follows the friendly dog the little cat

Free+Case

follows the little cat#S the friendly dog#O
O
follows the friendly dog#O the little cat#S

Translation

de kleine kat volgt de vriendelijke hond

Mesa 1: Example sentence in different fixed/
flexible-order English-based synthetic languages
and their SVO Dutch translation. The subject in
each sentence is underlined. Artificial case mark-
ers start with #.

corpora in different languages are drawn from dif-
ferent domains. Exceptions exist, like the widely
used Europarl (Koehn, 2005), but represent a small
fraction of the large variety of typological feature
combinations attested in the world. This makes it
very difficult to run a large-scale comparative
study and isolate the factors of interest from, para
ejemplo, domain mismatch effects. As a solu-
ción, we propose to evaluate NMT on synthetic
idiomas (Gulordava and Merlo, 2016; Wang
and Eisner, 2016; Ravfogel et al., 2019) that dif-
fer from each other only by specific properties,
a saber: the order of main constituents, or the
presence and nature of case markers (ver ejemplo
en mesa 1).

We use this approach to isolate the impact of
various source-language typological features on
MT quality and to remove the typical confounders
of corpus size and domain. Using a variety of syn-
thetic languages and a newly introduced challenge
colocar, we find that state-of-the-art NMT has little to
no bias towards fixed-order languages, but only
when a sizeable training set is available.

2 Free-order Case-marking Languages

The word order profile of a language is usually
represented by the canonical order of its main
constituents, (S)ubject, (oh)bject, (V)erb. Para en-
postura, English and French are SVO languages,
while Turkish and Hindi are SOV. Otro, less com-
monly attested, word orders are VSO and VOS,
whereas OSV and OVS are extremely rare (Dryer,
2013). Although many other word order features
existir (p.ej., noun/adjective), they often correlate
with the order of main constituents (Greenberg,
1963).

A different, but likewise important dimension
is that of word order freedom (or flexibility).

Languages that primarily rely on the position
of a word to encode grammatical roles typically
display rigid orders (like English or Mandarin
Chino), while languages that rely on case mark-
ing can be more flexible allowing word order to
express discourse-related factors like topicaliza-
ción. Examples of highly flexible-order languages
include languages as diverse as Russian, Hungar-
ian, latín, Tamil, and Turkish.1

In the field of psycholinguistics, due to the his-
torical influence of English-centered studies, palabra
order has long been considered the primary and
most natural device through which children learn
to infer syntactic relationships in their language
(Slobin, 1966). Sin embargo, cross-linguistic studies
have later revealed that children are equally pre-
pared to acquire both fixed-order and inflectional
idiomas (Slobin and Bever, 1982).

Coming to computational

linguistics, datos-
driven MT and other NLP approaches were also
historically developed around languages with re-
markably fixed orders and very simple to moder-
ately simple morphological systems, like English
or French. Luckily, our community has been giv-
ing increasing attention to more and more lan-
guages with diverse typologies, especially in the
last decade. Hasta ahora, previous work has found that
free-order languages are more challenging for
analizando (Gulordava and Merlo, 2015, 2016) y
subject-verb agreement prediction (Ravfogel
et al., 2019) than their fixed-order counterparts.
This raises the question of whether word order
flexibility also negatively affects MT quality.

Before the advent of modern NMT, Birch et al.
(2008) used the Europarl corpus to study how
various language properties affected the quality
of phrase-based Statistical MT. Amount of re-
ordering, target morphological complexity, y
historical relatedness of source and target lan-
guages were identified as strong predictors of
MT quality. Recent work by Bugliarello et al.
(2020), sin embargo, has failed to show a correlation
between NMT difficulty (measured by a novel
information-theoretic metric) and several linguis-
tic properties of source and target
idioma,
including Morphological Counting Complexity
(Sagot, 2013) and Average Dependency Length
(Futrell et al., 2015a). While that work specifically

1See Futrell et al. (2015b) for detailed figures of word
order freedom (measured by the entropy of subject and
object dependency relation order) in a diverse sample of 34
idiomas.

1234

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

aimed at ensuring cross-linguistic comparability,
the sample on which the linguistic properties
could be computed (Europarl) was rather small
and not very typologically diverse, leaving our re-
search questions open to further investigation. En
this paper, we therefore opt for a different meth-
odology: a saber, synthetic languages.

3 Metodología

Synthetic Languages This paper presents two
sets of experiments: In the first (§4), we create
parallel corpora using very simple and predict-
able artificial grammars and small vocabularies
(Lupyan and Christiansen, 2002). See an example
en mesa 1. By varying the position of subject/
verb/object and introducing case markers to the
source language, we study the biases of two NMT
architectures in optimal training data conditions
and a fully controlled setup, eso es, without any
other linguistic cues that may disambiguate con-
stituent roles. In the second set of experiments
(§5), we move to a more realistic setup using syn-
thetic versions of the English language that differ
from it in only one or few selected typological
características (Ravfogel et al., 2019). Por ejemplo, el
original sentence’s order (SVO) is transformed to
different orders, like SOV or VSO, based on its
syntactic parse tree.

In both cases, typological variations are intro-
duced in the source side of the parallel corpora,
while the target language remains fixed. En esto
way, we avoid the issue of non-comparable BLEU
scores across different target languages. Por último,
we make the simplifying assumption that, cuando
verb-argument order varies from the canonical or-
der in a flexible-order language, it does so in a
totally arbitrary way. Although this is rarely true in
práctica, as word order may be predictable given
pragmatics or other factors, we focus here on ‘‘the
extent to which word order is conditioned on the
syntactic and compositional semantic properties
of an utterance’’ (Futrell et al., 2015b).

LSTM) timestep (elman, 1990; Hochreiter y
Schmidhuber, 1997). (ii) The non-recurrent, fully
attention-based Transformer
(Vaswani et al.,
2017) processes all input symbols in parallel re-
lying on dedicated embeddings to encode each
input’s position.2 Transformer has nowadays
surpassed recurrent encoder-decoder models in
terms of generic MT quality. Además, Choshen
and Abend (2019) have recently shown that
Transformer-based NMT models are indifferent
to the absolute order of source words, al menos
when equipped with learned positional embed-
dings. Por otro lado, the lack of recurrence
in Transformers has been linked to a limited abil-
ity to capture hierarchical structure (Tran et al.,
2018; Hahn, 2020). To our knowledge, no pre-
vious work has studied the biases of either ar-
chitectures towards fixed-order languages in a
systematic manner.

4 Toy Parallel Grammar

We start by evaluating our models on a pair of toy
languages inspired by the English-Dutch pair and
created using a Synchronous Context-Free Gram-
mar (Chiang and Knight, 2006). Each sentence
consists of a simple clause with a transitive verb,
sujeto, and object. Both arguments are singu-
lar and optionally modified by an adjective. El
source vocabulary contains 6 nouns, 6 verbos, 6
adjectives, and the complete corpus contains 10k
generated sentence pairs. Working with such a
pequeño, finite grammar allows us to simulate an
otherwise impossible situation where the NMT
model can be trained on (almost) the totality of a
language’s utterances, canceling out data sparsity
effects.3

Source Language Variants We consider three
source language variants, illustrated in Table 1:

• fixed-order VSO;

• fixed-order VOS;

Translation Models We consider two widely
used NMT architectures that crucially differ in
their encoding of positional information: (i) Re-
current sequence-to-sequence BiLSTM with at-
tention (Bahdanau et al., 2015; Luong et al., 2015)
processes the input symbols sequentially and has
each hidden state directly conditioned on that
of the previous (or following, for the backward

• mixed-order (randomly chosen between VSO

or VOS) with nominal case marking.

2We use sinusoidal embeddings (Vaswani et al., 2017). Todo
our models are built using OpenNMT: https://github
.com/OpenNMT/OpenNMT-py.

3Data and code to replicate the toy grammar experiments
in this section are available at https://github.com
/573phn/cm-vs-wo.

1235

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Toy language NMT sentence-level accuracy on validation set by number of training epochs. Fuente
idiomas: fixed-order VSO, fixed-order VOS, and mixed-order (VSO/VOS) with case marking. Target language:
always fixed SVO. Each experiment is repeated five times, and averaged results are shown.

We choose these word orders so that, en el
flexible-order corpus, the only way to disambig-
uate argument roles is case marking, realized by
simple unambiguous suffixes (#S and #O). El
target language is always fixed SVO. The same
random split (80/10/10% training/validation/test)
is applied to the three corpora.

NMT Setup As recurrent model, we trained
a 2-layer BiLSTM with attention (Luong et al.,
2015) con 500 hidden layer size. As Transformer
modelos, we trained one using the standard 6-layer
configuración (Vaswani et al., 2017) and a smaller
one with only 2 capas, given the simplicity of
the languages. All models are trained at the word
level using the complete vocabulary. More hyper-
parameters are provided in Appendix A.1. Nota
that our goal is not to compare LSTM and Trans-
former accuracy to each other, but rather to
observe the different trends across fixed- y
flexible-order language variants. Given the small
vocabulary, we use sentence-level accuracy in-
stead of BLEU for evaluation.

Results As shown in Figure 1, all models
achieve perfect accuracy on all language pairs after
1000 training steps, except for the Large Trans-
former on the free-order language, likely due to
overparametrization (Sankararaman et al., 2020).
These results demonstrate that our NMT archi-
tectures are equally capable of modeling trans-
lation of both types of language, when all other
factors of variation are controlled for.

Sin embargo, a pattern emerges when looking
at the learning curves within each plot: Mientras
the two fixed-order languages have very similar
learning curves, the free-order language with case
markers always requires slightly more training
steps to converge. This is also the case, albeit to

a lesser extent, when the mixed-order corpus is
pre-processed by splitting all case suffixes from
the nouns (extra experiment not shown in the plot).
This trend is noteworthy, given the simplicity of
our grammars and the transparency of the case
sistema. As our training sets cover a large majority
of the languages, this result might suggest that
free-order natural languages need larger training
datasets to reach a similar translation quality than
their fixed-order counterparts. In §5 we validate
this hypothesis on more naturalistic language data.

5 Synthetic English Variants

Experimenting with toy languages has its short-
comings, like the small vocabulary size and non-
realistic distribution of words and structures. En
this section, we follow the approach of Ravfogel
et al. (2019) to validate our findings in a less con-
trolled but more realistic setup. Específicamente, nosotros
create several variants of the Europarl English-
French parallel corpus where the source sentences
are modified by changing word order and adding
artificial case markers. We choose French as target
language because of its fixed order, SVO, and its
relatively simple morphology.4 As Indo-European
idiomas, English and French are moderately re-
lated in terms of syntax and vocabulary while
being sufficiently distant to avoid a word-by-word
translation strategy in many cases.

Source language variants are obtained by trans-
forming the syntactic tree of the original sentences.
While Ravfogel et al. (2019) could rely on the
Penn Treebank (Marcus et al., 1993) for their
monolingual task of agreement prediction, nosotros

4According to the Morphological Counting Complexity
(Sagot, 2013) values reported by Cotterell et al. (2018),
English scores 6 (least complex), Dutch 26, Francés 30,
Español 71, checo 195, and Finnish 198 (most complex).

1236

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Original (no case):
The woman says her sisters often invited her for dinner.

SOV (no case):
The woman her sisters her often invited for dinner say.

SOV, syncretic case marking (overt):
The woman.arg.sg her sisters.arg.pl she.arg.sg often in-
vited.arg.pl for dinner say.arg.sg.

SOV, unambiguous case marking (overt):
The woman.nsubj.sg her sisters.nsubj.pl she.dobj.sg often
invited.dobj.sg.nsubj.pl for dinner say.nsubj.sg.

SOV, unambiguous case (implicit):
The womankar her sisterskon shekin often invitedkinkon
for dinner saykar.

SOV, unambiguous case (implicit with declensions):
The womankar her sisterspon shekit often invitedkitpon for
dinner saykar.

French translation:
La femme dit que ses soeurs l’invitaient souvent `a dˆırner.

Mesa 2: Examples of synthetic English variants
and their (common) French translation. The full
list of suffixes is provided in Appendix A.3.

instead need parallel data. Por esta razón, nosotros
parse the English side of the Europarl v.7 cor-
pus (Koehn, 2005) using the Stanza dependency
parser (Qi et al., 2020; Manning et al., 2014).
After parsing, we adopt a modified version of the
synthetic language generator by Ravfogel et al.
(2019) to create the following English variants:5

agreement features from verbs in all the above
variants (cf. says → say in Table 2).

To answer our second research question, nosotros
experiment with two artificial case systems pro-
posed by Ravfogel et al. (2019) and illustrated in
Mesa 2 (overt suffixes):

• Unambiguous case system: suffixes indi-
cating argument role (subject/object/indirect
object) and number
(singular/plural) son
added to the heads of noun and verb phrases;

• Syncretic case system: suffixes indicating
number but not grammatical function are
added to the heads of main arguments, pro-
viding only partial disambiguation of argu-
ment roles. This system is inspired from
subject/object syncretism in Russian.

Syncretic case systems were found to be roughly as
common as non-syncretic ones in a large sample of
almost 200 world languages (Baerman and Brown,
2013). Case marking is always combined with the
fully flexible order of main constituents. As in
Ravfogel et al. (2019), English number marking
is removed from verbs and their arguments before
adding the artificial suffixes.

• Fixed-order: either SVO, SOV, VSO or

5.1 NMT Setup

VOS;6

• Free-order: for each sentence in the corpus,
one of the six possible orders of (Subject,
Object, Verb) is chosen randomly;

• Shuffled words: all source words are shuf-
fled regardless of their syntactic role. Esto es
our lower bound, measuring the reordering
ability of a model in the total absence of
source-side order cues (akin to bag-of-words
aporte).

To allow for a fair comparison with the artifi-
cial case-marking languages, we remove number

5Our revised language generator is available at https://

github.com/573phn/rnn typology.

6To keep the number of experiments manageable, nosotros
omit object-initial languages, which are significantly less
attested among world languages (Dryer, 2013).

Models As recurrent model, we used a 3-layer
BiLSTM with hidden size of 512 and MLP at-
tention (Bahdanau et al., 2015). The Transformer
model has the standard 6-layer configuration with
hidden size of 512, 8 attention heads, and sinu-
soidal positional encoding (Vaswani et al., 2017).
All models use subword representation based on
32k BPE merge operations (Sennrich et al., 2016),
except in the low-resource setup where this is
reduced to 10k operations. More hyperparameters
are provided in Appendix A.1.

Data and Evaluation We train our models on
various subsets of the English-French Europarl
cuerpo: 1.9M sentence pairs (high-resource), 100k
(medium-resource), and 10K (low-resource). Para
evaluación, we use 5K sentences randomly held
out from the same corpus. Given the importance
of word order to assess the correct translation
of verb arguments into French, we compute the

1237

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

reordering-focused RIBES7 metric (Isozaki et al.,
2010) in addition to the more commonly used
AZUL (Papineni et al., 2002). In each experi-
mento, the source side of training and test data is
transformed using the same procedure whereas the
target side remains unchanged. We repeat each ex-
perimento 3 veces (o 4 for languages with random
order choice) and report the averaged results.

5.2 Challenge Set

Besides syntactic structure, natural language of-
ten contains semantic and collocational cues that
help disambiguate the role of an argument. Pequeño
BLEU/RIBES differences between our language
variants may indicate actual robustness of a model
to word order flexibility, but may also indicate that
a model relies on those cues rather than on syntac-
tic structure (Gulordava et al., 2018). To discern
these two hypotheses, we create a challenge set of
7,200 simple affirmative and negative sentences
where swapping subject and object leads to an-
other plausible sentence.8 Each English sentence
and its reverse are included in the test set together
with the respective translations, as for example:

(1)

(a) The president thanks the minister. /
Le pr´esident remercie le ministre.

(b) The minister thanks the president. /
Le ministre remercie le pr´esident.

The source side is then processed as explained in
§5 and translated by the NMT model trained on
the corresponding language variant. De este modo, trans-
lation quality on this set reflects the extent to
which NMT models have robustly learned to de-
tect verb arguments and their roles independently
from other cues, which we consider an important
sign of linguistic generalization ability. For space
constraints we only present RIBES scores on the
challenge set.9

7BLEU captures local word-order errors only indirectly
(lower precision of higher-order n-grams) and does not cap-
ture long-range word-order errors at all. Por el contrario, RIBES
directly measures correlation between the word ranks in the
reference and those in the MT output.

8More details can be found in Appendix A.2. We release
the challenge set at https://github.com/arianna
-bis/freeorder-mt.

9We also computed BLEU scores: They strongly correlate
with RIBES but fluctuate more due to the larger effect of
lexical choice.

5.3 High-Resource Results

Mesa 3 reports the high-resource setting results.
The first row (original English to French) is given
only for reference and shows the overall highest
resultados. The BLEU drop observed when moving
to any of the fixed-order variants (including SVO)
is likely due to parsing flaws resulting in awkward
reorderings. As this issue affects all our synthetic
variants, it does not undermine the validity of our
findings. For clarity, we center our main discus-
sion on the Transformer results and comment on
the BiLSTM results at the end of this section.

Fixed-Order Variants All four tested fixed-
order variants obtain very similar BLEU/RIBES
scores on the Europarl-test. This is in line with pre-
vious work in NMT showing that linguistically
motivated pre-ordering leads to small gains (zhao
et al., 2018) or none at all (Du and Way, 2017),
and that Transformer-based models are not bi-
ased towards monotonic translation (Choshen and
Abend, 2019). On the challenge set, scores are
slightly more variable but a manual inspection
reveals that this is due to different lexical choices,
while word order is always correct for this group
of languages. To sum up, in the high-resource
setup, our Transformer models are perfectly able
to disambiguate the core argument roles when
these are consistently encoded by word order.

Fixed-Order vs Random-Order Somewhat
surprisingly,
the Transformer results are only
marginally affected by the random ordering of
verb and core arguments. Recall that in the ‘Ran-
dom’ language all six possible permutations of
(S,V,oh) are equally likely. De este modo, Transformador
shows an excellent ability to reconstruct the cor-
rect constituent order in the general-purpose test
colocar. The picture is very different on the challenge
colocar, where RIBES drops severely from 97.6 a
74.1. These low results were to be expected given
the challenge set design (it is impossible even for
a human to recognize subject from object in the
‘Random, no case’ challenge set). Sin embargo,
they demonstrate that the general-purpose set can-
not tell us whether an NMT model has learned to
reliably exploit syntactic structure of the source
idioma, because of the abundant non-syntactic
señales. De hecho, even when all source words are
shuffled, Transformer still achieves a respectable
25.8/71.2 BLEU/RIBES on the Europarl-test.

1238

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

English*→French
Large Training (1.9METRO)

BI-LSTM

TRANSFORMER

Europarl-Test

Challenge

Europarl-Test

Challenge

Original English

Fixed Order:

S-V-O
S-O-V
V-S-O
V-O-S
Average (fixed orders)

Flexible Order:

AZUL

39.4

RIBES

85.0

RIBES

98.0

AZUL

38.3

RIBES

84.9

RIBES

97.7

38.3
37.6
38.0
37.8
37.9±0.4

84.5
84.2
84.2
84.0
84.2±0.3

98.1
97.7
97.8
98.0
97.9±0.2

37.7
37.9
37.8
37.6
37.8±0.1

84.6
84.5
84.6
84.3
84.5±0.1

98.0
97.2
98.0
97.2
97.6±0.4

Aleatorio, no case
Aleatorio + syncretic case
Aleatorio + unambig. caso

Shuffle all words

37.1
36.9
37.3

18.5

83.7
83.6
83.9

65.2

75.1
75.4
97.7

79.4

37.5
37.3
37.3

25.8

84.2
84.2
84.4

71.2

74.1
84.4
98.1

83.2

Mesa 3: Translation quality from various English-based synthetic languages into standard French,
using the largest training data (1.9M sentences). NMT architectures: 3-layer BiLSTM seq-to-seq with
atención; 6-layer Transformer. Europarl-Test: 5K held-out Europarl sentences; Challenge set: see §5.2.
All scores are averaged over three training runs.

Case Marking The key comparison in our
study lies between fixed-order and free-order
case-marking languages. Aquí, we find that case
marking can indeed restore near-perfect accuracy
on the challenge set (98.1 RIBES). Sin embargo,
this only happens when the marking system is
completely unambiguous, cual, as already men-
cionado, is true for only about a half of the real
case-marking languages (Baerman and Brown,
2013). En efecto, the syncretic system visibly im-
proves quality on the challenge set (74.1 a 84.4
RIBES) but remains far behind the fixed-order
puntaje (97.6). In terms of overall NMT quality
(Europarl-test), fixed-order languages score only
marginally higher
than the free-order case-
marking ones, regardless of the unambiguous/
syncretic distinction. Thus our finding that Trans-
former NMT systems are equally capable of mod-
eling the two types of languages (§4) is also
confirmed with more naturalistic language data.
dicho eso, we will show in Section 5.4 that this
positive finding is conditional on the availability
of large amounts of training samples.

BiLSTM vs Transformer The LSTM-based re-
sults generally correlate with the Transformer
results discussed above, however our recurrent

models appear to be slightly more sensitive to
changes in the source-side order, in line with
previous findings (Choshen and Abend, 2019).
Específicamente, translation quality on Europarl-test
fluctuates slightly more than Transformer among
different fixed orders, with the most monotonic
orden (SVO) leading to the best results. When all
words are randomly shuffled, BiLSTM scores
drop much more than Transformer. Sin embargo,
when comparing the fixed-order variants to the
ones with free order of main constituents, BiL-
STM shows only a slightly stronger preference
for fixed-order, compared to Transformer. Este
suggests that, by experimenting with arbitrary
permutations, Choshen and Abend (2019) might
have overestimated the bias of recurrent NMT
towards more monotonic translation, mientras que el
more realistic combination of constituent-level re-
ordering with case marking used in our study is
not so problematic for this type of model.

Curiosamente, on the challenge set, BiLSTM and
Transformer perform on par, with the notable ex-
ception that syncretic case is much more difficult
for the BiLSTM model. Our results agree with the
large drop of subject-verb agreement prediction
accuracy observed by Ravfogel et al. (2019) cuando
experimenting with the random order of main

1239

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

constituents. Sin embargo, their scores were also low
for SOV and VOS, which is not the case in our
NMT experiments. Besides the fact that our chal-
lenge set only contains short sentences (hence no
long dependencies and few agreement attractors),
our task is considerably different in that agreement
only needs to be predicted in the target language,
which is fixed-order SVO.

Summary Our results so far suggest that state-
of-the-art NMT models, especially if Transformer-
based, have little or no bias towards fixed-order
idiomas. In what follows, we study whether this
finding is robust to differences in data size, tipo
of morphology, and target language.

5.4 Effect of Data Size and
Morphological Features

Data Size The results shown in Table 3 representar
a high-resource setting (almost 2M training sen-
tenencias). While recent successes in cross-lingual
transfer learning alleviate the need for labeled
datos (Liu et al., 2020), their success still de-
pends on the availability of large unlabeled data
as well as other, yet to be explained, idioma
propiedades (Joshi et al., 2020). We then ask: Do
free-order case-marking languages need more data
than fixed-order non-case-marking ones to reach
similar NMT quality? We simulate a medium-
and low-resource scenario by sampling 100K and
10K training sentences, respectivamente, desde el
full Europarl data. To reduce the number of ex-
perimentos, we only consider Transformer with
one fixed-order language variant (SOV)10 and ex-
clude syncretic case marking. To disentagle the
effect of word order from that of case mark-
ing on low-resource translation quality, nosotros también
experiment with a language variant combining
fixed-order (SOV) and case marking. Results are
como se muestra en la figura 2 and discussed below.

Morphological Features The artificial case
systems used so far included easily separable
suffixes with a 1:1 mapping between grammat-
ical categories and morphemes (p.ej., .nsubj.sg,
.dobj.pl) reminiscent of agglutinative morpholo-
gies. Many world languages, sin embargo, do not
comply to this 1:1 mapping principle but dis-
play exponence (multiple categories conveyed by

10We choose SOV because it is a commonly attested word
order and is different from that of the target language, thereby
requiring some non-trivial reorderings during translation.

one morpheme) and/or flexivity (the same cat-
egory expressed by various, lexically determined,
morphemes). Well-studied examples of languages
with case+number exponence include Russian and
Finnish, while flexive languages include, de nuevo,
Russian and Latin. Motivated by previous findings
on the impact of fine-grained morphological fea-
tures on language modeling difficulty (Gerz et al.,
2018), we experiment with three types of suffixes
(see examples in Table 2):

• Overt: number and case are denoted by easily
separable suffixes (p.ej., .nsubj.sg, .dobj.pl)
similar to agglutinative languages (1:1);

• Implicit: the combination of number and
case is expressed by unique suffixes without
internal structure (p.ej., kar for .nsubj.sg, ker
for .dobj.pl) similar to fusional languages.
This system displays exponence (muchos:1);

• Implicit with declensions: like the previous,
but with three different paradigms each ar-
bitrarily assigned to a different subset of the
lexicon. This system displays exponence and
flexivity (muchos:muchos).

A complete overview of our morphological
paradigms is provided in Appendix A.3 All our
languages have moderate inflectional synthesis
y, in terms of fusion, are exclusively concatena-
tivo. Despite this, the effect on vocabulary size is
substantial: 180% increase by overt and implicit
case marking, 250% by implicit marking with
declensions (in the full data setting).

Results Results are shown in the plots of
Cifra 2 (detailed numerical scores are given
in Appendix A.4). We find that reducing train-
ing size has, not surprisingly, a major effect
on translation quality. Among source language
variants, fixed-order obtains the highest qual-
ity across all setups. In terms of BLEU (2(a)),
the spread among variants increases somewhat
with less data however differences are small.
A clearer picture emerges from RIBES (2(b)),
whereby less data clearly leads to more disparity.
This is already visible in the 100k setup, con
the fixed SOV language dominating the others.
Case marking, despite being necessary to disam-
biguate argument roles in the absence of semantic
señales, does not improve translation quality and

1240

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: EN*-FR Transformer NMT quality versus training data size (x-axis). Source language variants:
Fixed-order (SOV) and free-order (aleatorio) with different case systems (r+overt/implicit/declens). Scores
averaged over three training runs. Detailed numerical results are provided in Appendix A.4.

even degrades it in the low-resource setup. Look-
ing at the challenge set results (2(C)) we see
that the free-order case-marking languages are
clearly disadvantaged: In the mid-resource setup,
case marking improves substantially over the
underspecified random,no-case language but re-
mains far behind fixed-order. In low-resource,
case marking notably hurts quality even in
comparison with the underspecified language.
These results thus demonstrate that free-order
case-marking languages require more data than
their fixed-order counterparts to be accurately
translated by state-of-the-art NMT.11 Our ex-
periments also show that this greater learning
difficulty is not only due to case marking (y
subsequent data sparsity), but also to word or-
der flexibility (compare sov+overt to r+overt in
Cifra 2).

Regarding different morphology types, we do
not observe a consistent trend in terms of overall
calidad de traducción (Europarl-test): en algunos casos,
the richest morphology (with declensions) slightly
outperforms the one without declensions—a re-
sult that would deserve further exploration. On
the other hand, results on the challenge set, dónde

11In the light of this finding, it would be interesting to
revisit the evaluation of Bugliarello et al. (2020) in relation
to varying data sizes.

Cifra 3: Transformer results for more target languages
(100k training size). Scores averaged over 2 carreras.

most words are case-marked, show that morpho-
logical richness inversely correlates with transla-
tion quality when data is scarce. We postulate that
our artificial morphologies may be too limited in
alcance (only 3-way case and number marking) a
impact overall translation quality and leave the
investigation of richer inflectional synthesis to
future work.

5.5 Effect of Target Language

All results so far involved translation into a fixed-
orden (SVO) language without case marking. A
verify the generality of our findings, we repeat a
subset of experiments with the same synthetic En-
glish variants, but using Czech or Dutch as target

1241

idiomas. Czech has rich fusional morphology
including case marking, and very flexible order.
Dutch has simple morphology (no case marking)
and moderately flexible, syntactically determined
order.12

Cifra 3 shows the results with 100k train-
ing sentences. In terms of BLEU, differences are
even smaller than in English-French. In terms of
RIBES, trends are similar across target languages,
with the fixed SOV source language obtaining
best results and the case-marked source language
obtaining worst results. This suggests that the ma-
jor findings of our study are not due to the specific
choice of French as the target language.

6 Trabajo relacionado

The effect of word order flexibility on NLP model
performance has been mostly studied in the field
of syntactic parsing, por ejemplo, using Aver-
age Dependency Length (Gildea and Temperley,
2010; Futrell et al., 2015a) or head-dependent or-
der entropy (Futrell et al., 2015b; Gulordava and
Merlo, 2016) as syntactic correlates of word order
freedom. Related work in language modeling has
shown that certain languages are intrinsically more
difficult to model than others (Cotterell et al.,
2018; Mielke et al., 2019) and has furthermore
studied the impact of fine-grained morphology
características (Gerz et al., 2018) on LM perplexity.

Regarding the word order biases of seq-to-seq
modelos, Chaabouni et al. (2019) use miniature
languages similar to those of Section 4 to study
the evolution of LSTM-based agents in a sim-
ulated iterated learning setup. Their results in a
standard ‘‘individual learning’’ setup show, como
nuestro, that a free-order case-marking toy language
can be learned just as well as a fixed-order one,
confirming earlier results obtained by simple
Elman networks trained for grammatical role
clasificación (Lupyan and Christiansen, 2002).
Transformer was not included in these studies.
Choshen and Abend (2019) measure the ability
of LSTM- and Transformer-based NMT to model
a language pair where the same arbitrary (non-
syntactically motivated) permutation is applied to
all source sentences. They find that Transformer
is largely indifferent to the order of source words
(provided this is fixed and consistent across train-
ing and test set) but nonetheless struggles to

12Dutch word order is very similar to German, con el

position of S, V, and O depending on the type of clause.

translate long dependencies actually occurring in
natural data. They do not directly study the effect
of order flexibility.

The idea of permuting dependency trees to
generate synthetic languages was introduced in-
dependently by Gulordava and Merlo (2016)
(discutido anteriormente) and by Wang and Eisner (2016),
the latter with the aim of diversifying the set
of treebanks currently available for language
adaptación.

7 Conclusions

We have presented an in-depth analysis of how
Neural Machine Translation difficulty is affected
by word order flexibility and case marking in the
source language. Although these common lan-
guage properties were previously shown to neg-
atively affect parsing and agreement prediction
exactitud, our main results show that state-of-the-
art NMT models, especially Transformer-based
unos, have little or no bias towards fixed-order lan-
calibres. Our simulated low-resource experiments,
sin embargo, reveal a different picture, eso es: Free-
order case-marking languages require more data
to be translated as accurately as their fixed-order
counterparts. Because parallel data (like labeled
data in general) are scarce for most of the world
idiomas (Guzm´an et al., 2019; Joshi et al., 2020),
we believe this should be considered as a fur-
ther obstacle to language equality in future NLP
tecnologías.

In future work, our analysis should be extended
to target language variants using principled alter-
natives to BLEU (Bugliarello et al., 2020), y
to other typological features that are likely to
affect MT performance, such as inflectional syn-
thesis and degree of fusion (Gerz et al., 2018).
Finalmente, the synthetic languages and challenge set
proposed in this paper could be used to evalu-
ate syntax-aware NMT models (Eriguchi et al.,
2016; Bisk and Tran, 2018; Currey and Heafield,
2019), which promise to better capture linguistic
estructura, especially in low-resource scenarios.

Expresiones de gratitud

Arianna Bisazza was partly funded by the
Netherlands Organization for Scientific Research
(NOW) under project number 639.021.646. Nosotros
would like to thank the Center for Information
Technology of the University of Groningen for
providing access to the Peregrine HPC cluster,

1242

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

and the anonymous reviewers for their helpful
comments.

Referencias

Duygu Ataman and Marcello Federico. 2018.
An evaluation of two vocabulary reduction
methods for neural machine translation. En
Proceedings of the 13th Conference of the Asso-
ciation for Machine Translation in the Americas
(Volumen 1: Research Papers), pages 97–110.

Matthew Baerman and Dunstan Brown. 2013.
Case syncretism. In Matthew S. Dryer and
Martin Haspelmath, editores, The World Atlas
of Language Structures Online. Max Planck In-
stitute for Evolutionary Anthropology, Leipzig.

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
aprender juntos a alinear y traducir. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015.

Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,
Christian Federmann, Mark Fishel, Yvette
graham, Barry Haddow, Matthias Huck,
Philipp Koehn, Shervin Malmasi, Christof
Monz, Mathias M¨uller, Santanu Pal, Mate
Correo, and Marcos Zampieri. 2019. Findings of
el 2019 conference on machine translation
(WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volumen 2:
Shared Task Papers, Day 1), pages 1–61,
Florencia, Italia. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/W19-5301

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. Qué
do neural machine translation models learn
about morphology? In Proceedings of the 55th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 861–872, vancouver, Canada. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/P17-1080

Hawaii. Asociación de Lin Computacional-
https://doi.org/10.3115
guísticos.
/1613715.1613809

Yonatan Bisk and Ke Tran. 2018.

Inducing
grammars with and for neural machine transla-
ción. In Proceedings of the 2nd Workshop on
Neural Machine Translation and Generation,
pages 25–35, Melbourne, Australia. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/W18-2704

Ondˇrej Bojar, Christian Federmann, Mark Fishel,
Yvette Graham, Barry Haddow, Philipp Koehn,
and Christof Monz. 2018. Findings of the 2018
conference on machine translation (WMT18).
In Proceedings of the Third Conference on
Máquina traductora: Shared Task Papers,
pages 272–303, Bélgica, Bruselas. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/W18-6401

Emanuele Bugliarello, Sabrina

j. Mielke,
Antonios Anastasopoulos, Ryan Cotterell, y
Naoaki Okazaki. 2020. It’s easier to translate
out of English than into it: Measuring neural
translation difficulty by cross-mutual informa-
ción. In Proceedings of the 58th Annual Meet-
ing of
the Association for Computational
Lingüística, pages 1640–1649, En línea. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/2020.acl-main
.149

Rahma

Eugene

Chaabouni,

Kharitonov,
Alessandro Lazaric, Emmanuel Dupoux, y
Marco Baroni. 2019. Word-order biases in
deep-agent emergent communication. En profesional-
cesiones de
the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 5166–5175, Florencia, Italia. Asociación
para Lingüística Computacional.

David Chiang and Kevin Knight. 2006. Un
introduction to synchronous grammars. Tuto-
rial available at http://www.isi.edu/
˜chiang/papers/synchtut.pdf.

Alexandra Birch, Miles Osborne, and Philipp
Koehn. 2008. Predicting success in machine
traducción. En Actas de la 2008 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre, pages 745–754, Honolulu,

Leshem Choshen and Omri Abend. 2019. Auto-
matically extracting challenge sets for non-local
phenomena in neural machine translation. En
Proceedings of the 23rd Conference on Compu-
tational Natural Language Learning (CONLL),

1243

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

pages 291–303, Hong Kong, Porcelana. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/P19-1509

Berlina, Alemania. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P16-1078

Benrard Comrie. 1981. Language Universals and

Linguistic Typology. Blackwell. Book.

Ryan Cotterell, Sabrina

j. Mielke,

Jason
Eisner, and Brian Roark. 2018. son todos LAN-
guages equally hard to language-model? En
Actas de la 2018 Conference of the
North American Chapter of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, Volumen 2 (Artículos breves),
pages 536–541, Nueva Orleans, Luisiana. también-
ciation for Computational Linguistics.

Anna Currey and Kenneth Heafield. 2019. Incor-
porating source syntax into transformer-based
neural machine translation. En procedimientos de
the Fourth Conference on Machine Translation
(Volumen 1: Research Papers), pages 24–33,
Florencia,
Italia. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/W19-5203

Matthew S. Dryer. 2013. Order of subject, transmisión exterior-
ject and verb. In Matthew S. Dryer and Martin
Haspelmath, editores, The World Atlas of Lan-
guage Structures Online. Max Planck Institute
for Evolutionary Anthropology, Leipzig.

Matthew S. Dryer and Martin Haspelmath,
editores. 2013. WALS Online. Max Planck In-
stitute for Evolutionary Anthropology, Leipzig.
https://wals.info/

Jinhua Du and Andy Way. 2017. Pre-reordering
for neural machine translation: Helpful or harm-
lleno? The Prague Bulletin of Mathematical
Lingüística, 108(1):171–182. https://doi
.org/10.1515/pralin-2017-0018

Jeffrey L. elman. 1990. Finding structure in time.
Ciencia cognitiva, 14(2):179–211. https://
doi.org/10.1207/s15516709cog1402 1

Akiko Eriguchi, Kazuma Hashimoto,

y
Yoshimasa Tsuruoka. 2016. Tree-to-sequence
attentional neural machine translation. En profesional-
the 54th Annual Meeting of
cesiones de
the Association for Computational Linguis-
tics (Volumen 1: Artículos largos), pages 823–833,

Richard Futrell, Kyle Mahowald, and Edward
Gibson. 2015a. Large-scale evidence of depen-
dency length minimization in 37 idiomas.
Proceedings of the National Academy of Sci-
ences, 112(33):10336–10341. https://doi
.org/10.1073/pnas.1502134112, PubMed:
26240370

Richard Futrell, Kyle Mahowald, and Edward
Gibson. 2015b. Quantifying word order free-
En curso-
dom in dependency corpora.
cosas de
the Third International Conference
on Dependency Linguistics (Depling 2015),
pages 91–100, Uppsala, Suecia. Uppsala
Universidad, Uppsala, Suecia.

Daniela Gerz, Ivan Vulic, Edoardo Maria Ponti,
Roi Reichart, and Anna Korhonen. 2018. Sobre el
relation between linguistic typology and (limi-
tations of) multilingual language modeling. En
Actas de la 2018 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 316–327, Bruselas, Bélgica. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/D18-1029

Daniel Gildea and David Temperley. 2010. Do
grammars minimize dependency length? Cog-
nitive Science, 34(2):286–310. https://doi
.org/10.1111/j.1551-6709.2009.01073.x,
PubMed: 21564213

Joseph H. Greenberg. 1963. Some universals
of grammar with particular reference to the
order of meaningful elements. In Joseph H.
Greenberg, editor, Universals of Human Lan-
guage, pages 73–113. CON prensa, Cambridge,
MAMÁ.

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.
Colorless green recurrent networks dream
hierarchically. En procedimientos de
el 2018
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos), pages 1195–1205, Nueva Orleans,
Luisiana. Asociación de Computación
Lingüística.

1244

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kristina Gulordava and Paola Merlo. 2015. Di-
achronic trends in word order freedom and
dependency length in dependency-annotated
corpora of Latin and ancient Greek. En profesional-
ceedings of the Third International Conference
on Dependency Linguistics (Depling 2015),
pages 121–130, Uppsala, Suecia. Uppsala
Universidad, Uppsala, Suecia. https://doi
.org/10.18653/v1/N18-1108

Kristina Gulordava and Paola Merlo. 2016.
Multi-lingual dependency parsing evaluation:
A large-scale analysis of word order prop-
erties using artificial data. Transactions of
la Asociación de Lingüística Computacional,
4:343–356. https://doi.org/10.1162
/tacl_a_00103

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
Koehn, Vishrav Chaudhary, and Marc’Aurelio
Ranzato. 2019. The FLORES Evaluation Da-
tasets for Low-Resource Machine Translation:
Nepali–English and Sinhala–English. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6100–6113. https://doi.org/10
.18653/v1/D19-1632

Michael Hahn. 2020. Limitaciones teóricas de
autoatención en modelos de secuencia neuronal. Trans-
acciones de la Asociación de Computación
Lingüística, 8:156–171. https://doi.org
/10.1162/tackle_a_00306

Hany Hassan, Anthony Aue, Chang Chen,
Vishal Chowdhary, Jonathan Clark, cristiano
Federmann, Xuedong Huang, Marcin Junczys-
Dowmunt, William Lewis, Mu Li, Shujie Liu,
Tie-Yan Liu, Renqian Luo, Arul Menezes, tao
Qin, Frank Seide, Xu Tan, Fei Tian, Lijun
Wu, Shuangzhi Wu, Yingce Xia, Dongdong
zhang, Zhirui Zhang, y Ming Zhou. 2018.
Achieving human parity on automatic Chi-
nese to English news translation. arXiv preprint
arXiv:1803.05567.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
https://doi.org/10
9(8):1735–1780.
.1162/neco.1997.9.8.1735, PubMed:
9377276

Hideki

Isozaki, Tsutomu Hirao, Kevin Duh,
Katsuhito Sudoh, and Hajime Tsukada. 2010.
Automatic evaluation of translation quality for
distant language pairs. En Actas de la
2010 Conference on Empirical Methods in
Natural Language Processing, pages 944–952,
Cambridge, MAMÁ. Asociación de Computación-
lingüística nacional.

Joshi Pratik, Sebastián Santy, Amar Budhiraja,
Kalika Bali, y Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. En procedimientos de
la 58ª Reunión Anual de la Asociación de
Ligüística computacional, páginas 6282–6293,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.560

Philipp Koehn. 2005. Europarl: A parallel cor-
pus for statistical machine translation. In The
Tenth Machine Translation Summit Proceed-
ings of Conference, pages 79–86. Internacional
Association for Machine Translation.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian
li, Sergey Edunov, Marjan Ghazvininejad,
mike lewis, and Luke Zettlemoyer. 2020.
Multilingual denoising pre-training for neu-
ral machine translation. Transactions of the
Asociación de Lingüística Computacional,
8:726–742. https://doi.org/10.1162
/tacl_a_00343

Minh-Thang Luong, Hieu Pham, and Christopher
D. Manning. 2015. Effective approaches to
attention-based neural machine translation. En
Métodos empíricos en Natural Language Pro-
cesando (EMNLP), pages 1412–1421, Lisbon,
Portugal. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/D15-1166

Gary Lupyan and Morten H. Christiansen. 2002.
Case, word order, and language learnabil-
idad: Insights from connectionist modeling. En
Actas de
the Twenty-Fourth Annual
Conference of the Cognitive Science Society.

Cristóbal D.. Manning, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, y
David McClosky. 2014. The Stanford CoreNLP

1245

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

natural language processing toolkit. In Associ-
ation for Computational Linguistics (LCA) Sys-
tem Demonstrations, pages 55–60. https://
doi.org/10.3115/v1/P14-5010

Mitchell Marcus, Beatrice Santorini, and Mary
Ann Marcinkiewicz. 1993. Building a large
annotated corpus of English: The Penn
Treebank. https://doi.org/10.21236
/ADA273556

Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,
Brian Roark, and Jason Eisner. 2019. Qué
kind of language is hard to language-model?
In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 4975–4989, Florencia, Italia. Asociación
para Lingüística Computacional.

Kishore Papineni, Salim Roukos, Todd Ward, y
Wei-Jing Zhu. 2002. AZUL: Un método para
evaluación automática de la traducción automática.
In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics,
ACL ’02, páginas 311–318, Stroudsburg, Pensilvania,
EE.UU. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/P19-1491

Emmanouil Antonios

Platanios, Mrinmaya
Sachan, Graham Neubig, and Tom Mitchell.
2018. Contextual parameter generation for
universal neural machine translation. En profesional-
cesiones de la 2018 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
425–435. https://doi.org/10
paginas
.18653/v1/D18-1039

Martin Popel, Marketa Tomkova, Jakub Tomek,
lucas káiser, Jakob Uszkoreit, Ondˇrej Bojar,
and Zdenˇek ˇZabokrtsk´y. 2020. Transforming
machine translation: a deep learning system
reaches news translation quality comparable
to human professionals. Nature Communica-
ciones, 11(1):4381. https://doi.org/10
.1038/s41467-020-18073-9, PubMed:
32873773

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason
Bolton, and Christopher D. Manning. 2020.
Stanza: A Python natural language process-
ing toolkit for many human languages. En
Actas de la 58ª Reunión Anual de

la Asociación de Lingüística Computacional:
Demostraciones del sistema.

idiomas. En procedimientos de

Shauli Ravfogel, Yoav Goldberg, and Tal
Linzen. 2019. Studying the inductive biases
of RNNs with synthetic variations of natu-
el 2019
ral
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), pages 3532–3542,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N19-1356

Benoˆıt Sagot. 2013. Comparing complexity mea-
sures. In Computational Approaches to Mor-
phological Complexity. París, Francia. Surrey
Morphology Group.

Karthik Abinav Sankararaman, Soham De, Zheng
Xu, W.. Ronny Huang, and Tom Goldstein.
2020. Analyzing the effect of neural network
architecture on training performance. En profesional-
ceedings of Machine Learning and Systems
2020, pages 9834–9845.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
rare words with subword units. En procedimientos
of the 54th Annual Meeting of the Associa-
ción para la Lingüística Computacional (Volumen 1:
Artículos largos), pages 1715–1725, Berlina,
Alemania. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/P16-1162

Kaius Sinnem¨aki. 2008. Complexity trade-offs in
core argument marking. Language Complexity,
pages 67–88. Juan Benjamín. https://doi
.org/10.1075/slcs.94.06sin

Dan I. Slobin. 1966. The acquisition of Russian as
a native language. The Genesis of Language: A
Psycholinguistic Approach, pages 129–148.

Dan I. Slobin and Thomas G. bever. 1982.
Children use canonical sentence schemas: A
crosslinguistic study of word order and inflec-
ciones. Cognición, 12(3):229–265. https://
doi.org/10.1016/0010-0277(82)90033-6

Ke Tran, Arianna Bisazza, and Christof Monz.
2018. The importance of being recurrent for

1246

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

modeling hierarchical structure. En curso-
el 2018 Conferencia sobre Empirismo
cosas de
Métodos en el procesamiento del lenguaje natural,
pages 4731–4736, Bruselas, Bélgica. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/D18-1503

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
Lukasz Kaiser, y Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neu-
ral Information Processing Systems 30: Annual
Conference on Neural Information Process-
ing Systems 2017, 4–9 December 2017, Largo
Beach, California, EE.UU, pages 5998–6008.

Nouns

Verbos

president / pr´esident
hombre / homme
woman / femme
minister / ministre
candidate / candidat
secretary / secr´etaire
commissioner / commissaire
niño / enfant
maestro / enseignant
alumno / ´etudiant

thank / remercier
apoyo / soutenir
representar / repr´esenter
defend / d´efendre
welcome / saluer
invite / inviter
attack / attaquer
respeto / respecter
replace / remplacer
exploit / exploiter

Mesa 4: The English/French vocabulary used
to generate the challenge set. Both singular and
plural forms are used for each noun.

Dingquan Wang and Jason Eisner. 2016. El
galactic dependencies treebanks: Getting more
data by synthesizing new languages. Trans-
acciones de la Asociación de Computación
Lingüística, 4:491–505. https://doi.org
/10.1162/tacl_a_00113

tasa de 0.001 for BiLSTM. We also increased batch
size to 128, number of warm-up steps to 80K and
update steps to 2M for all models. Finalmente, para
100k and 10k datasize experiments, we decreased
the warm-up steps to 4K. During evaluation we
chose the best performing model on validation set.

Adina Williams, Tiago Pimentel, Hagen Blix,
Arya D. McCarthy, Eleanor Chodroff, y
Ryan Cotterell. 2020. Predicting declension
class from form and meaning. En curso-
cosas de
el
Asociación de Lingüística Computacional,
pages 6682–6695, En línea. Asociación para
Ligüística computacional.

the 58th Annual Meeting of

Yang Zhao, Jiajun Zhang, and Chengqing Zong.
2018. Exploiting pre-ordering for neural
machine translation. En procedimientos de
el
Eleventh International Conference on Lan-
guage Resources and Evaluation (LREC-2018).
https://doi.org/10.18653/v1/2020
.acl-main.597

A Appendices

A.1 NMT Hyperparameters
In the toy parallel grammar experiments (§4),
tamaño de lote de 64 (oraciones) and 1K max update
steps are used for all models. We train BiLSTM
with learning rate 1, and Transformer with learn-
ing rate of 2 together with 40 warm-up steps by
using noam learning rate decay. Dropout ratio of
0.3 y 0.1 are used in BiLSTM and Transformer
models respectively. In the synthetic English vari-
ants experiments (§5), we set a constant learning

A.2 Challenge Set

The English-French challenge set used in this pa-
por, and available at https://github.com
/arianna-bis/freeorder-mt,
is gener-
ated by a small synchronous context-free grammar
and contains 7,200 simple sentences consisting of
a subject, a transitive verb, and an object (ver
Mesa 4). All sentences are in the present tense;
half are affirmative, and half negative. All nouns
in the grammar can plausibly act as both subject
and object of the verbs, so that an MT system
must rely on sentence structure to get perfect
translation accuracy. The sentences are from a
general domain, but we specifically choose nouns
and verbs with little translation ambiguity that
are well represented in the Europarl corpus: Mayoría
have thousands of occurrences, while the rarest
word has about 80. Sentence example (Inglés
lado): ‘The teacher does not respect the student.’
and its reverse: ‘The student does not respect the
teacher.’

A.3 Morphological Paradigms

The complete list of morphological paradigms
used in this work is shown in Table 5. The im-
plicit language with exponence (muchos:1) uses only
the suffixes of the 1st (default) declension. El
implicit language with exponence and flexivity
(muchos:muchos) uses three declensions, assigned as

1247

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Overt

.nsubj.sg
.nsubj.pl
.dobj.sg
.dobj.pl
.iobj.sg
.iobj.pl
.arg.sg
.arg.pl

1st(default)

Implicit
2nd

kar
kon
kin
ker
ken
kre
–
–

par
pon
él
et
kez
kr
–
–

3rd

pa
después
kit
ket
ke
re
–
–

Unambiguous

Syncretic

Mesa 5: The artificial morphological paradigms
used in this work, extended from Ravfogel et al.
(2019). 1st, 2nd and 3rd are the declensions in the
flexive language.

follows: Primero, the list of lemmas extracted from the
training set is randomly split into three classes,13
with distribution 1st:60%, 2nd:30%, 3rd:10%.
Entonces, each core verb argument occurring in the
corpus is marked with the suffix corresponding to
its lemma’s declension.

A.4 Effect of Data Size and Morphological

Características: Detailed Results

Mesa 6 shows the detailed numerical results
corresponding to the plots of Figure 2 en el
main text.

Eparl-BLEU
original
SOV
SOV+overt
aleatorio
random+overt
random+implicit
random+declens

Eparl-RIBES
original
SOV
SOV+overt
aleatorio
random+overt
random+implicit
random+declens

Challenge-RIBES
original
SOV
SOV+overt
aleatorio
random+overt
random+implicit
random+declens

1.9METRO
38.3
37.9
37.4
37.5
37.3
37.3
37.4

1.9METRO
84.9
84.5
84.5
84.2
84.4
84.3
84.3

1.9METRO
97.7
97.2
97.7
74.1
98.1
97.5
97.6

100k
26.9
25.3
24.6
24.6
24.1
24.3
23.1

100k
80.1
78.7
78.4
77.7
77.6
77.4
77.1

100k
92.2
89.8
86.5
72.9
84.5
85.4
84.4

10k
11.0
8.8
8.4
8.5
7.8
7.1
7.7

10k
67.5
64.1
63.1
61.7
61.6
59.8
61.3

10k
74.2
69.5
64.9
63.1
57.4
54.8
53.1

Mesa 6: Detailed results corresponding to the
plots of Figure 2: EN*-FR Transformer NMT
quality versus training data size (1.9METRO, 100k, o
10K sentence pairs). Source language variants:
Fixed-order (SOV) and free-order (aleatorio) con
different case systems (+overt/implicit/declens).
Scores averaged over three training runs.

13See Williams et al. (2020) for an interesting account of
how declension classes are actually partly predictable from
form and meaning.

1248

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
4
1
9
7
2
4
4
0

/
t

a
C
_
a
_
0
0
4
2
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF