On the Difficulty of Translating Free-Order Case-Marking Languages

Arianna Bisazza

Ahmet ¨Ust ¨un
Center for Language and Cognition
University of Groningen, Die Niederlande
{a.bisazza, a.ustun}@rug.nl, research@spor.tel

Stephan Sportel

Abstrakt

Identifying factors that make certain languages
harder to model than others is essential to reach
language equality in future Natural Language
Processing technologies. Free-order case-marking
languages, such as Russian, Latin, or Tamil,
have proved more challenging than fixed-order
languages for the tasks of syntactic parsing
and subject-verb agreement prediction. In diesem
arbeiten, we investigate whether this class of lan-
guages is also more difficult to translate by
state-of-the-art Neural Machine Translation
(NMT) Modelle. Using a variety of synthetic
languages and a newly introduced translation
challenge set, we find that word order flexi-
bility in the source language only leads to a
very small loss of NMT quality, wenngleich
the core verb arguments become impossible
to disambiguate in sentences without seman-
tic cues. The latter issue is indeed solved by
the addition of case marking. Jedoch, In
Mittel- and low-resource settings, the overall
NMT quality of fixed-order languages remains
unmatched.

Einführung

Despite the tremendous advances achieved in less
than a decade, Natural Language Processing re-
mains a field where language equality is far from
being reached (Joshi et al., 2020). In the field
of Machine Translation, modern neural models
have attained remarkable quality for high-resource
language pairs like German-English, Chinese-
English, or English-Czech, with a number of stud-
ies even claiming human parity (Hassan et al.,
2018; Bojar et al., 2018; Barrault et al., 2019;
Popel et al., 2020). These results may lead to the
unfounded belief that Neural Machine Transla-
tion (NMT) methods will perform equally well
in any language pair, provided similar amounts
of training data. Tatsächlich, several studies suggest
the opposite (Platanios et al., 2018; Ataman and
Federico, 2018; Bugliarello et al., 2020).

Warum, Dann, do some language pairs have lower
translation accuracy? And, more specifically: Are
certain typological profiles more challenging for
current state-of-the-art NMT models? Every lan-
guage has its own combination of typological
properties, including word order, morphosyntac-
tic features, and more (Dryer and Haspelmath,
2013). Identifying language properties (or combi-
nations thereof) that pose major problems to the
current modeling paradigms is essential to reach
language equality in future MT (and other NLP)
technologies (Joshi et al., 2020), in a way that is
orthogonal to data collection efforts. Among oth-
ers, natural languages adopt different mechanisms
to disambiguate the role of their constituents:
Flexible order typically correlates with the pres-
ence of case marking and, vice versa, fixed order
is observed in languages with little or no case
marking (Comrie, 1981; Sinnem¨aki, 2008; Futrell
et al., 2015B). Morphologically rich languages
in general are known to be challenging for MT
at least since the times of phrase-based statisti-
cal MT (Birch et al., 2008) due to their larger
and sparser vocabularies, and remain challenging
even for modern neural architectures (Ataman and
Federico, 2018; Belinkov et al., 2017). Durch con-
trast, the relation between word order flexibility
and MT quality has not been directly studied to
our knowledge.

In diesem Papier, we study this relationship using
strictly controlled experimental setups. Specifi-
cally, we ask:

• Are current state-of-the-art NMT systems

biased towards fixed-order languages?

• To what extent does case marking compen-
sate for the lack of a fixed order in the source
Sprache?

Bedauerlicherweise, parallel data are scarce in most
of the world languages (Guzm´an et al., 2019), Und

1233

Transactions of the Association for Computational Linguistics, Bd. 9, S. 1233–1248, 2021. https://doi.org/10.1162/tacl a 00424
Action Editor: Alexandra Birch. Submission batch: 1/2021; Revision batch: 5/2021; Published 11/2021.
C(cid:2) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Fixed

VSO

VOS

follows the little cat the friendly dog
follows the friendly dog the little cat

Free+Case

follows the little cat#S the friendly dog#O
OR
follows the friendly dog#O the little cat#S

Translation

de kleine kat volgt de vriendelijke hond

Tisch 1: Example sentence in different fixed/
flexible-order English-based synthetic languages
and their SVO Dutch translation. The subject in
each sentence is underlined. Artificial case mark-
ers start with #.

corpora in different languages are drawn from dif-
ferent domains. Exceptions exist, like the widely
used Europarl (Koehn, 2005), but represent a small
fraction of the large variety of typological feature
combinations attested in the world. This makes it
very difficult to run a large-scale comparative
study and isolate the factors of interest from, für
Beispiel, domain mismatch effects. As a solu-
tion, we propose to evaluate NMT on synthetic
languages (Gulordava and Merlo, 2016; Wang
and Eisner, 2016; Ravfogel et al., 2019) that dif-
fer from each other only by specific properties,
nämlich: the order of main constituents, oder der
presence and nature of case markers (see example
in Table 1).

We use this approach to isolate the impact of
various source-language typological features on
MT quality and to remove the typical confounders
of corpus size and domain. Using a variety of syn-
thetic languages and a newly introduced challenge
set, we find that state-of-the-art NMT has little to
no bias towards fixed-order languages, but only
when a sizeable training set is available.

2 Free-order Case-marking Languages

The word order profile of a language is usually
represented by the canonical order of its main
constituents, (S)ubject, (Ö)bject, (V)erb. Für in-
Haltung, English and French are SVO languages,
while Turkish and Hindi are SOV. Other, less com-
monly attested, word orders are VSO and VOS,
whereas OSV and OVS are extremely rare (Dryer,
2013). Although many other word order features
existieren (z.B., noun/adjective), they often correlate
with the order of main constituents (Greenberg,
1963).

A different, but likewise important dimension
is that of word order freedom (or flexibility).

Languages that primarily rely on the position
of a word to encode grammatical roles typically
display rigid orders (like English or Mandarin
Chinese), while languages that rely on case mark-
ing can be more flexible allowing word order to
express discourse-related factors like topicaliza-
tion. Examples of highly flexible-order languages
include languages as diverse as Russian, Hungar-
ian, Latin, Tamil, and Turkish.1

In the field of psycholinguistics, due to the his-
torical influence of English-centered studies, word
order has long been considered the primary and
most natural device through which children learn
to infer syntactic relationships in their language
(Slobin, 1966). Jedoch, cross-linguistic studies
have later revealed that children are equally pre-
pared to acquire both fixed-order and inflectional
languages (Slobin and Bever, 1982).

Coming to computational

linguistics, Daten-
driven MT and other NLP approaches were also
historically developed around languages with re-
markably fixed orders and very simple to moder-
ately simple morphological systems, like English
or French. Luckily, our community has been giv-
ing increasing attention to more and more lan-
guages with diverse typologies, especially in the
last decade. So far, previous work has found that
free-order languages are more challenging for
parsing (Gulordava and Merlo, 2015, 2016) Und
subject-verb agreement prediction (Ravfogel
et al., 2019) than their fixed-order counterparts.
This raises the question of whether word order
flexibility also negatively affects MT quality.

Before the advent of modern NMT, Birch et al.
(2008) used the Europarl corpus to study how
various language properties affected the quality
of phrase-based Statistical MT. Amount of re-
ordering, target morphological complexity, Und
historical relatedness of source and target lan-
guages were identified as strong predictors of
MT quality. Recent work by Bugliarello et al.
(2020), Jedoch, has failed to show a correlation
between NMT difficulty (measured by a novel
information-theoretic metric) and several linguis-
tic properties of source and target
Sprache,
including Morphological Counting Complexity
(Sagot, 2013) and Average Dependency Length
(Futrell et al., 2015A). While that work specifically

1See Futrell et al. (2015B) for detailed figures of word
order freedom (measured by the entropy of subject and
object dependency relation order) in a diverse sample of 34
languages.

1234

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

aimed at ensuring cross-linguistic comparability,
the sample on which the linguistic properties
could be computed (Europarl) was rather small
and not very typologically diverse, leaving our re-
search questions open to further investigation. In
this paper, we therefore opt for a different meth-
odology: nämlich, synthetic languages.

3 Methodik

Synthetic Languages This paper presents two
sets of experiments: In the first (§4), we create
parallel corpora using very simple and predict-
able artificial grammars and small vocabularies
(Lupyan and Christiansen, 2002). See an example
in Table 1. By varying the position of subject/
verb/object and introducing case markers to the
source language, we study the biases of two NMT
architectures in optimal training data conditions
and a fully controlled setup, das ist, without any
other linguistic cues that may disambiguate con-
stituent roles. In the second set of experiments
(§5), we move to a more realistic setup using syn-
thetic versions of the English language that differ
from it in only one or few selected typological
Merkmale (Ravfogel et al., 2019). Zum Beispiel, Die
original sentence’s order (SVO) is transformed to
different orders, like SOV or VSO, based on its
syntactic parse tree.

In both cases, typological variations are intro-
duced in the source side of the parallel corpora,
while the target language remains fixed. In diesem
Weg, we avoid the issue of non-comparable BLEU
scores across different target languages. zuletzt,
we make the simplifying assumption that, Wann
verb-argument order varies from the canonical or-
der in a flexible-order language, it does so in a
totally arbitrary way. Although this is rarely true in
üben, as word order may be predictable given
pragmatics or other factors, we focus here on ‘‘the
extent to which word order is conditioned on the
syntactic and compositional semantic properties
of an utterance’’ (Futrell et al., 2015B).

LSTM) timestep (Elman, 1990; Hochreiter and
Schmidhuber, 1997). (ii) The non-recurrent, völlig
attention-based Transformer
(Vaswani et al.,
2017) processes all input symbols in parallel re-
lying on dedicated embeddings to encode each
input’s position.2 Transformer has nowadays
surpassed recurrent encoder-decoder models in
terms of generic MT quality. Darüber hinaus, Choshen
and Abend (2019) have recently shown that
Transformer-based NMT models are indifferent
to the absolute order of source words, mindestens
when equipped with learned positional embed-
dings. Andererseits, the lack of recurrence
in Transformers has been linked to a limited abil-
ity to capture hierarchical structure (Tran et al.,
2018; Hahn, 2020). To our knowledge, no pre-
vious work has studied the biases of either ar-
chitectures towards fixed-order languages in a
systematic manner.

4 Toy Parallel Grammar

We start by evaluating our models on a pair of toy
languages inspired by the English-Dutch pair and
created using a Synchronous Context-Free Gram-
mar (Chiang and Knight, 2006). Each sentence
consists of a simple clause with a transitive verb,
Thema, and object. Both arguments are singu-
lar and optionally modified by an adjective. Der
source vocabulary contains 6 nouns, 6 verbs, 6
Adjektive, and the complete corpus contains 10k
generated sentence pairs. Working with such a
small, finite grammar allows us to simulate an
otherwise impossible situation where the NMT
model can be trained on (almost) the totality of a
language’s utterances, canceling out data sparsity
effects.3

Source Language Variants We consider three
source language variants, illustrated in Table 1:

• fixed-order VSO;

• fixed-order VOS;

Translation Models We consider two widely
used NMT architectures that crucially differ in
their encoding of positional information: (ich) Re-
current sequence-to-sequence BiLSTM with at-
Aufmerksamkeit (Bahdanau et al., 2015; Luong et al., 2015)
processes the input symbols sequentially and has
each hidden state directly conditioned on that
of the previous (or following, for the backward

• mixed-order (randomly chosen between VSO

or VOS) with nominal case marking.

2We use sinusoidal embeddings (Vaswani et al., 2017). Alle
our models are built using OpenNMT: https://github
.com/OpenNMT/OpenNMT-py.

3Data and code to replicate the toy grammar experiments
in this section are available at https://github.com
/573phn/cm-vs-wo.

1235

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1: Toy language NMT sentence-level accuracy on validation set by number of training epochs. Quelle
languages: fixed-order VSO, fixed-order VOS, and mixed-order (VSO/VOS) with case marking. Target language:
always fixed SVO. Each experiment is repeated five times, and averaged results are shown.

We choose these word orders so that, im
flexible-order corpus, the only way to disambig-
uate argument roles is case marking, realized by
simple unambiguous suffixes (#S and #O). Der
target language is always fixed SVO. The same
random split (80/10/10% training/validation/test)
is applied to the three corpora.

NMT Setup As recurrent model, we trained
a 2-layer BiLSTM with attention (Luong et al.,
2015) mit 500 hidden layer size. As Transformer
Modelle, we trained one using the standard 6-layer
configuration (Vaswani et al., 2017) and a smaller
one with only 2 layers, given the simplicity of
the languages. All models are trained at the word
level using the complete vocabulary. More hyper-
parameters are provided in Appendix A.1. Notiz
that our goal is not to compare LSTM and Trans-
former accuracy to each other, but rather to
observe the different trends across fixed- Und
flexible-order language variants. Given the small
vocabulary, we use sentence-level accuracy in-
stead of BLEU for evaluation.

Results As shown in Figure 1, all models
achieve perfect accuracy on all language pairs after
1000 training steps, except for the Large Trans-
former on the free-order language, likely due to
overparametrization (Sankararaman et al., 2020).
These results demonstrate that our NMT archi-
tectures are equally capable of modeling trans-
lation of both types of language, when all other
factors of variation are controlled for.

dennoch, a pattern emerges when looking
at the learning curves within each plot: Während
the two fixed-order languages have very similar
learning curves, the free-order language with case
markers always requires slightly more training
steps to converge. This is also the case, albeit to

a lesser extent, when the mixed-order corpus is
pre-processed by splitting all case suffixes from
the nouns (extra experiment not shown in the plot).
This trend is noteworthy, given the simplicity of
our grammars and the transparency of the case
System. As our training sets cover a large majority
of the languages, this result might suggest that
free-order natural languages need larger training
datasets to reach a similar translation quality than
their fixed-order counterparts. In §5 we validate
this hypothesis on more naturalistic language data.

5 Synthetic English Variants

Experimenting with toy languages has its short-
kommt, like the small vocabulary size and non-
realistic distribution of words and structures. In
this section, we follow the approach of Ravfogel
et al. (2019) to validate our findings in a less con-
trolled but more realistic setup. Speziell, Wir
create several variants of the Europarl English-
French parallel corpus where the source sentences
are modified by changing word order and adding
artificial case markers. We choose French as target
language because of its fixed order, SVO, and its
relatively simple morphology.4 As Indo-European
languages, English and French are moderately re-
lated in terms of syntax and vocabulary while
being sufficiently distant to avoid a word-by-word
translation strategy in many cases.

Source language variants are obtained by trans-
forming the syntactic tree of the original sentences.
While Ravfogel et al. (2019) could rely on the
Penn Treebank (Marcus et al., 1993) for their
monolingual task of agreement prediction, Wir

4According to the Morphological Counting Complexity
(Sagot, 2013) values reported by Cotterell et al. (2018),
English scores 6 (least complex), Dutch 26, French 30,
Spanish 71, Czech 195, and Finnish 198 (most complex).

1236

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Original (no case):
The woman says her sisters often invited her for dinner.

SOV (no case):
The woman her sisters her often invited for dinner say.

SOV, syncretic case marking (overt):
The woman.arg.sg her sisters.arg.pl she.arg.sg often in-
vited.arg.pl for dinner say.arg.sg.

SOV, unambiguous case marking (overt):
The woman.nsubj.sg her sisters.nsubj.pl she.dobj.sg often
invited.dobj.sg.nsubj.pl for dinner say.nsubj.sg.

SOV, unambiguous case (implicit):
The womankar her sisterskon shekin often invitedkinkon
for dinner saykar.

SOV, unambiguous case (implicit with declensions):
The womankar her sisterspon shekit often invitedkitpon for
dinner saykar.

French translation:
La femme dit que ses soeurs l’invitaient souvent `a dˆırner.

Tisch 2: Examples of synthetic English variants
und ihre (common) French translation. The full
list of suffixes is provided in Appendix A.3.

instead need parallel data. Aus diesem Grund, Wir
parse the English side of the Europarl v.7 cor-
pus (Koehn, 2005) using the Stanza dependency
parser (Qi et al., 2020; Manning et al., 2014).
After parsing, we adopt a modified version of the
synthetic language generator by Ravfogel et al.
(2019) to create the following English variants:5

agreement features from verbs in all the above
variants (vgl. says → say in Table 2).

To answer our second research question, Wir
experiment with two artificial case systems pro-
posed by Ravfogel et al. (2019) and illustrated in
Tisch 2 (overt suffixes):

• Unambiguous case system: suffixes indi-
cating argument role (subject/object/indirect
Objekt) and number
(singular/plural) Sind
added to the heads of noun and verb phrases;

• Syncretic case system: suffixes indicating
number but not grammatical function are
added to the heads of main arguments, Profi-
viding only partial disambiguation of argu-
ment roles. This system is inspired from
subject/object syncretism in Russian.

Syncretic case systems were found to be roughly as
common as non-syncretic ones in a large sample of
almost 200 world languages (Baerman and Brown,
2013). Case marking is always combined with the
fully flexible order of main constituents. As in
Ravfogel et al. (2019), English number marking
is removed from verbs and their arguments before
adding the artificial suffixes.

• Fixed-order: either SVO, SOV, VSO or

5.1 NMT Setup

VOS;6

• Free-order: for each sentence in the corpus,
one of the six possible orders of (Thema,
Object, Verb) is chosen randomly;

• Shuffled words: all source words are shuf-
fled regardless of their syntactic role. Das ist
our lower bound, measuring the reordering
ability of a model in the total absence of
source-side order cues (akin to bag-of-words
Eingang).

To allow for a fair comparison with the artifi-
cial case-marking languages, we remove number

5Our revised language generator is available at https://

github.com/573phn/rnn typology.

6To keep the number of experiments manageable, Wir
omit object-initial languages, which are significantly less
attested among world languages (Dryer, 2013).

Models As recurrent model, we used a 3-layer
BiLSTM with hidden size of 512 and MLP at-
Aufmerksamkeit (Bahdanau et al., 2015). The Transformer
model has the standard 6-layer configuration with
hidden size of 512, 8 attention heads, and sinu-
soidal positional encoding (Vaswani et al., 2017).
All models use subword representation based on
32k BPE merge operations (Sennrich et al., 2016),
except in the low-resource setup where this is
reduced to 10k operations. More hyperparameters
are provided in Appendix A.1.

Data and Evaluation We train our models on
various subsets of the English-French Europarl
corpus: 1.9M sentence pairs (high-resource), 100K
(medium-resource), and 10K (low-resource). Für
evaluation, we use 5K sentences randomly held
out from the same corpus. Given the importance
of word order to assess the correct translation
of verb arguments into French, wir berechnen die

1237

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

reordering-focused RIBES7 metric (Isozaki et al.,
2010) in addition to the more commonly used
BLEU (Papineni et al., 2002). In each experi-
ment, the source side of training and test data is
transformed using the same procedure whereas the
target side remains unchanged. We repeat each ex-
periment 3 mal (oder 4 for languages with random
order choice) and report the averaged results.

5.2 Challenge Set

Besides syntactic structure, natural language of-
ten contains semantic and collocational cues that
help disambiguate the role of an argument. Small
BLEU/RIBES differences between our language
variants may indicate actual robustness of a model
to word order flexibility, but may also indicate that
a model relies on those cues rather than on syntac-
tic structure (Gulordava et al., 2018). To discern
these two hypotheses, we create a challenge set of
7,200 simple affirmative and negative sentences
where swapping subject and object leads to an-
other plausible sentence.8 Each English sentence
and its reverse are included in the test set together
with the respective translations, as for example:

(1)

(A) The president thanks the minister. /
Le pr´esident remercie le ministre.

(B) The minister thanks the president. /
Le ministre remercie le pr´esident.

The source side is then processed as explained in
§5 and translated by the NMT model trained on
the corresponding language variant. Daher, trans-
lation quality on this set reflects the extent to
which NMT models have robustly learned to de-
tect verb arguments and their roles independently
from other cues, which we consider an important
sign of linguistic generalization ability. For space
constraints we only present RIBES scores on the
challenge set.9

7BLEU captures local word-order errors only indirectly
(lower precision of higher-order n-grams) and does not cap-
ture long-range word-order errors at all. Im Gegensatz, RIBES
directly measures correlation between the word ranks in the
reference and those in the MT output.

8More details can be found in Appendix A.2. We release
the challenge set at https://github.com/arianna
-bis/freeorder-mt.

9We also computed BLEU scores: They strongly correlate
with RIBES but fluctuate more due to the larger effect of
lexical choice.

5.3 High-Resource Results

Tisch 3 reports the high-resource setting results.
The first row (original English to French) is given
only for reference and shows the overall highest
results. The BLEU drop observed when moving
to any of the fixed-order variants (including SVO)
is likely due to parsing flaws resulting in awkward
reorderings. As this issue affects all our synthetic
variants, it does not undermine the validity of our
Erkenntnisse. For clarity, we center our main discus-
sion on the Transformer results and comment on
the BiLSTM results at the end of this section.

Fixed-Order Variants All four tested fixed-
order variants obtain very similar BLEU/RIBES
scores on the Europarl-test. This is in line with pre-
vious work in NMT showing that linguistically
motivated pre-ordering leads to small gains (Zhao
et al., 2018) or none at all (Du and Way, 2017),
and that Transformer-based models are not bi-
ased towards monotonic translation (Choshen and
Abend, 2019). On the challenge set, scores are
slightly more variable but a manual inspection
reveals that this is due to different lexical choices,
while word order is always correct for this group
of languages. To sum up, in the high-resource
setup, our Transformer models are perfectly able
to disambiguate the core argument roles when
these are consistently encoded by word order.

Fixed-Order vs Random-Order Somewhat
überraschenderweise,
the Transformer results are only
marginally affected by the random ordering of
verb and core arguments. Recall that in the ‘Ran-
dom’ language all six possible permutations of
(S,V,Ö) are equally likely. Daher, Transformer
shows an excellent ability to reconstruct the cor-
rect constituent order in the general-purpose test
set. The picture is very different on the challenge
set, where RIBES drops severely from 97.6 Zu
74.1. These low results were to be expected given
the challenge set design (it is impossible even for
a human to recognize subject from object in the
‘Random, no case’ challenge set). dennoch,
they demonstrate that the general-purpose set can-
not tell us whether an NMT model has learned to
reliably exploit syntactic structure of the source
Sprache, because of the abundant non-syntactic
Hinweise. Tatsächlich, even when all source words are
shuffled, Transformer still achieves a respectable
25.8/71.2 BLEU/RIBES on the Europarl-test.

1238

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

English*→French
Large Training (1.9M)

BI-LSTM

TRANSFORMER

Europarl-Test

Challenge

Europarl-Test

Challenge

Original English

Fixed Order:

S-V-O
S-O-V
V-S-O
V-O-S
Average (fixed orders)

Flexible Order:

BLEU

39.4

RIBES

85.0

RIBES

98.0

BLEU

38.3

RIBES

84.9

RIBES

97.7

38.3
37.6
38.0
37.8
37.9±0.4

84.5
84.2
84.2
84.0
84.2±0.3

98.1
97.7
97.8
98.0
97.9±0.2

37.7
37.9
37.8
37.6
37.8±0.1

84.6
84.5
84.6
84.3
84.5±0.1

98.0
97.2
98.0
97.2
97.6±0.4

Random, no case
Random + syncretic case
Random + unambig. Fall

Shuffle all words

37.1
36.9
37.3

18.5

83.7
83.6
83.9

65.2

75.1
75.4
97.7

79.4

37.5
37.3
37.3

25.8

84.2
84.2
84.4

71.2

74.1
84.4
98.1

83.2

Tisch 3: Translation quality from various English-based synthetic languages into standard French,
using the largest training data (1.9M sentences). NMT architectures: 3-layer BiLSTM seq-to-seq with
attention; 6-layer Transformer. Europarl-Test: 5K held-out Europarl sentences; Challenge set: see §5.2.
All scores are averaged over three training runs.

Case Marking The key comparison in our
study lies between fixed-order and free-order
case-marking languages. Hier, we find that case
marking can indeed restore near-perfect accuracy
on the challenge set (98.1 RIBES). Jedoch,
this only happens when the marking system is
completely unambiguous, welche, as already men-
tioned, is true for only about a half of the real
case-marking languages (Baerman and Brown,
2013). In der Tat, the syncretic system visibly im-
proves quality on the challenge set (74.1 Zu 84.4
RIBES) but remains far behind the fixed-order
Punktzahl (97.6). In terms of overall NMT quality
(Europarl-test), fixed-order languages score only
marginally higher
than the free-order case-
marking ones, regardless of the unambiguous/
syncretic distinction. Thus our finding that Trans-
former NMT systems are equally capable of mod-
eling the two types of languages (§4) is also
confirmed with more naturalistic language data.
That said, we will show in Section 5.4 that this
positive finding is conditional on the availability
of large amounts of training samples.

BiLSTM vs Transformer The LSTM-based re-
sults generally correlate with the Transformer
results discussed above, however our recurrent

models appear to be slightly more sensitive to
changes in the source-side order, in line with
previous findings (Choshen and Abend, 2019).
Speziell, translation quality on Europarl-test
fluctuates slightly more than Transformer among
different fixed orders, with the most monotonic
Befehl (SVO) leading to the best results. When all
words are randomly shuffled, BiLSTM scores
drop much more than Transformer. Jedoch,
when comparing the fixed-order variants to the
ones with free order of main constituents, BiL-
STM shows only a slightly stronger preference
for fixed-order, compared to Transformer. Das
suggests that, by experimenting with arbitrary
permutations, Choshen and Abend (2019) might
have overestimated the bias of recurrent NMT
towards more monotonic translation, whereas the
more realistic combination of constituent-level re-
ordering with case marking used in our study is
not so problematic for this type of model.

Interessant, on the challenge set, BiLSTM and
Transformer perform on par, with the notable ex-
ception that syncretic case is much more difficult
for the BiLSTM model. Our results agree with the
large drop of subject-verb agreement prediction
accuracy observed by Ravfogel et al. (2019) Wann
experimenting with the random order of main

1239

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

constituents. Jedoch, their scores were also low
for SOV and VOS, which is not the case in our
NMT experiments. Besides the fact that our chal-
lenge set only contains short sentences (hence no
long dependencies and few agreement attractors),
our task is considerably different in that agreement
only needs to be predicted in the target language,
which is fixed-order SVO.

Summary Our results so far suggest that state-
of-the-art NMT models, especially if Transformer-
based, have little or no bias towards fixed-order
languages. In what follows, we study whether this
finding is robust to differences in data size, type
of morphology, and target language.

5.4 Effect of Data Size and
Morphological Features

Data Size The results shown in Table 3 vertreten
a high-resource setting (almost 2M training sen-
tences). While recent successes in cross-lingual
transfer learning alleviate the need for labeled
Daten (Liu et al., 2020), their success still de-
pends on the availability of large unlabeled data
as well as other, yet to be explained, Sprache
properties (Joshi et al., 2020). We then ask: Do
free-order case-marking languages need more data
than fixed-order non-case-marking ones to reach
similar NMT quality? We simulate a medium-
and low-resource scenario by sampling 100K and
10K training sentences, jeweils, von dem
full Europarl data. To reduce the number of ex-
periments, we only consider Transformer with
one fixed-order language variant (SOV)10 and ex-
clude syncretic case marking. To disentagle the
effect of word order from that of case mark-
ing on low-resource translation quality, we also
experiment with a language variant combining
fixed-order (SOV) and case marking. Results are
shown in Figure 2 and discussed below.

Morphological Features The artificial case
systems used so far included easily separable
suffixes with a 1:1 mapping between grammat-
ical categories and morphemes (z.B., .nsubj.sg,
.dobj.pl) reminiscent of agglutinative morpholo-
gies. Many world languages, Jedoch, do not
comply to this 1:1 mapping principle but dis-
play exponence (multiple categories conveyed by

10We choose SOV because it is a commonly attested word
order and is different from that of the target language, thereby
requiring some non-trivial reorderings during translation.

one morpheme) and/or flexivity (the same cat-
egory expressed by various, lexically determined,
morphemes). Well-studied examples of languages
with case+number exponence include Russian and
Finnish, while flexive languages include, wieder,
Russian and Latin. Motivated by previous findings
on the impact of fine-grained morphological fea-
tures on language modeling difficulty (Gerz et al.,
2018), we experiment with three types of suffixes
(see examples in Table 2):

• Overt: number and case are denoted by easily
separable suffixes (z.B., .nsubj.sg, .dobj.pl)
similar to agglutinative languages (1:1);

• Implicit: the combination of number and
case is expressed by unique suffixes without
internal structure (z.B., kar for .nsubj.sg, ker
for .dobj.pl) similar to fusional languages.
This system displays exponence (viele:1);

• Implicit with declensions: like the previous,
but with three different paradigms each ar-
bitrarily assigned to a different subset of the
lexicon. This system displays exponence and
flexivity (viele:viele).

A complete overview of our morphological
paradigms is provided in Appendix A.3 All our
languages have moderate inflectional synthesis
Und, in terms of fusion, are exclusively concatena-
tiv. Despite this, the effect on vocabulary size is
substantial: 180% increase by overt and implicit
case marking, 250% by implicit marking with
declensions (in the full data setting).

Results Results are shown in the plots of
Figur 2 (detailed numerical scores are given
in Appendix A.4). We find that reducing train-
ing size has, nicht überraschend, a major effect
on translation quality. Among source language
variants, fixed-order obtains the highest qual-
ity across all setups. In terms of BLEU (2(A)),
the spread among variants increases somewhat
with less data however differences are small.
A clearer picture emerges from RIBES (2(B)),
whereby less data clearly leads to more disparity.
This is already visible in the 100k setup, mit
the fixed SOV language dominating the others.
Case marking, despite being necessary to disam-
biguate argument roles in the absence of semantic
Hinweise, does not improve translation quality and

1240

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: EN*-FR Transformer NMT quality versus training data size (x-axis). Source language variants:
Fixed-order (SOV) and free-order (random) with different case systems (r+overt/implicit/declens). Scores
averaged over three training runs. Detailed numerical results are provided in Appendix A.4.

even degrades it in the low-resource setup. Look-
ing at the challenge set results (2(C)) we see
that the free-order case-marking languages are
clearly disadvantaged: In the mid-resource setup,
case marking improves substantially over the
underspecified random,no-case language but re-
mains far behind fixed-order. In low-resource,
case marking notably hurts quality even in
comparison with the underspecified language.
These results thus demonstrate that free-order
case-marking languages require more data than
their fixed-order counterparts to be accurately
translated by state-of-the-art NMT.11 Our ex-
periments also show that this greater learning
difficulty is not only due to case marking (Und
subsequent data sparsity), but also to word or-
der flexibility (compare sov+overt to r+overt in
Figur 2).

Regarding different morphology types, we do
not observe a consistent trend in terms of overall
translation quality (Europarl-test): in manchen Fällen,
the richest morphology (with declensions) slightly
outperforms the one without declensions—a re-
sult that would deserve further exploration. An
die andere Hand, results on the challenge set, Wo

11In the light of this finding, it would be interesting to
revisit the evaluation of Bugliarello et al. (2020) in relation
to varying data sizes.

Figur 3: Transformer results for more target languages
(100k training size). Scores averaged over 2 runs.

most words are case-marked, show that morpho-
logical richness inversely correlates with transla-
tion quality when data is scarce. We postulate that
our artificial morphologies may be too limited in
scope (only 3-way case and number marking) Zu
impact overall translation quality and leave the
investigation of richer inflectional synthesis to
future work.

5.5 Effect of Target Language

All results so far involved translation into a fixed-
Befehl (SVO) language without case marking. To
verify the generality of our findings, we repeat a
subset of experiments with the same synthetic En-
glish variants, but using Czech or Dutch as target

1241

languages. Czech has rich fusional morphology
including case marking, and very flexible order.
Dutch has simple morphology (no case marking)
and moderately flexible, syntactically determined
order.12

Figur 3 shows the results with 100k train-
ing sentences. In terms of BLEU, differences are
even smaller than in English-French. In terms of
RIBES, trends are similar across target languages,
with the fixed SOV source language obtaining
best results and the case-marked source language
obtaining worst results. This suggests that the ma-
jor findings of our study are not due to the specific
choice of French as the target language.

6 Related Work

The effect of word order flexibility on NLP model
performance has been mostly studied in the field
of syntactic parsing, zum Beispiel, using Aver-
age Dependency Length (Gildea and Temperley,
2010; Futrell et al., 2015A) or head-dependent or-
der entropy (Futrell et al., 2015B; Gulordava and
Merlo, 2016) as syntactic correlates of word order
freedom. Related work in language modeling has
shown that certain languages are intrinsically more
difficult to model than others (Cotterell et al.,
2018; Mielke et al., 2019) and has furthermore
studied the impact of fine-grained morphology
Merkmale (Gerz et al., 2018) on LM perplexity.

Regarding the word order biases of seq-to-seq
Modelle, Chaabouni et al. (2019) use miniature
languages similar to those of Section 4 to study
the evolution of LSTM-based agents in a sim-
ulated iterated learning setup. Their results in a
standard ‘‘individual learning’’ setup show, wie
ours, that a free-order case-marking toy language
can be learned just as well as a fixed-order one,
confirming earlier results obtained by simple
Elman networks trained for grammatical role
classification (Lupyan and Christiansen, 2002).
Transformer was not included in these studies.
Choshen and Abend (2019) measure the ability
of LSTM- and Transformer-based NMT to model
a language pair where the same arbitrary (nicht-
syntactically motivated) permutation is applied to
all source sentences. They find that Transformer
is largely indifferent to the order of source words
(provided this is fixed and consistent across train-
ing and test set) but nonetheless struggles to

12Dutch word order is very similar to German, mit dem

position of S, V, and O depending on the type of clause.

translate long dependencies actually occurring in
natural data. They do not directly study the effect
of order flexibility.

The idea of permuting dependency trees to
generate synthetic languages was introduced in-
dependently by Gulordava and Merlo (2016)
(discussed above) and by Wang and Eisner (2016),
the latter with the aim of diversifying the set
of treebanks currently available for language
adaptation.

7 Conclusions

We have presented an in-depth analysis of how
Neural Machine Translation difficulty is affected
by word order flexibility and case marking in the
source language. Although these common lan-
guage properties were previously shown to neg-
atively affect parsing and agreement prediction
accuracy, our main results show that state-of-the-
art NMT models, especially Transformer-based
ones, have little or no bias towards fixed-order lan-
guages. Our simulated low-resource experiments,
Jedoch, reveal a different picture, das ist: Free-
order case-marking languages require more data
to be translated as accurately as their fixed-order
counterparts. Because parallel data (like labeled
data in general) are scarce for most of the world
languages (Guzm´an et al., 2019; Joshi et al., 2020),
we believe this should be considered as a fur-
ther obstacle to language equality in future NLP
technologies.

In future work, our analysis should be extended
to target language variants using principled alter-
natives to BLEU (Bugliarello et al., 2020), Und
to other typological features that are likely to
affect MT performance, such as inflectional syn-
thesis and degree of fusion (Gerz et al., 2018).
Endlich, the synthetic languages and challenge set
proposed in this paper could be used to evalu-
ate syntax-aware NMT models (Eriguchi et al.,
2016; Bisk and Tran, 2018; Currey and Heafield,
2019), which promise to better capture linguistic
Struktur, especially in low-resource scenarios.

Danksagungen

Arianna Bisazza was partly funded by the
Netherlands Organization for Scientific Research
(NWO) under project number 639.021.646. Wir
would like to thank the Center for Information
Technology of the University of Groningen for
providing access to the Peregrine HPC cluster,

1242

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

and the anonymous reviewers for their helpful
Kommentare.

Verweise

Duygu Ataman and Marcello Federico. 2018.
An evaluation of two vocabulary reduction
methods for neural machine translation. In
Proceedings of the 13th Conference of the Asso-
ciation for Machine Translation in the Americas
(Volumen 1: Research Papers), pages 97–110.

Matthew Baerman and Dunstan Brown. 2013.
Case syncretism. In Matthew S. Dryer and
Martin Haspelmath, editors, The World Atlas
of Language Structures Online. Max Planck In-
stitute for Evolutionary Anthropology, Leipzig.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
Sendungen, ICLR 2015.

Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,
Christian Federmann, Mark Fishel, Yvette
Graham, Barry Haddow, Matthias Huck,
Philipp Koehn, Shervin Malmasi, Christof
Monz, Mathias M¨uller, Santanu Pal, Matt
Post, and Marcos Zampieri. 2019. Findings of
Die 2019 conference on machine translation
(WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volumen 2:
Shared Task Papers, Day 1), pages 1–61,
Florence, Italien. Association for Computational
Linguistik. https://doi.org/10.18653
/v1/W19-5301

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. Was
do neural machine translation models learn
about morphology? In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (Volumen 1: Long Papers),
pages 861–872, Vancouver, Kanada. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1080

Hawaii. Association for Computational Lin-
https://doi.org/10.3115
guistics.
/1613715.1613809

Yonatan Bisk and Ke Tran. 2018.

Inducing
grammars with and for neural machine transla-
tion. In Proceedings of the 2nd Workshop on
Neural Machine Translation and Generation,
pages 25–35, Melbourne, Australia. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/W18-2704

Ondˇrej Bojar, Christian Federmann, Mark Fishel,
Yvette Graham, Barry Haddow, Philipp Koehn,
and Christof Monz. 2018. Findings of the 2018
conference on machine translation (WMT18).
In Proceedings of the Third Conference on
Machine Translation: Shared Task Papers,
pages 272–303, Belgien, Brussels. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/W18-6401

Emanuele Bugliarello, Sabrina

J. Mielke,
Antonios Anastasopoulos, Ryan Cotterell, Und
Naoaki Okazaki. 2020. It’s easier to translate
out of English than into it: Measuring neural
translation difficulty by cross-mutual informa-
tion. In Proceedings of the 58th Annual Meet-
ing of
the Association for Computational
Linguistik, pages 1640–1649, Online. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl-main
.149

Rahma

Eugene

Chaabouni,

Kharitonov,
Alessandro Lazaric, Emmanuel Dupoux, Und
Marco Baroni. 2019. Word-order biases in
deep-agent emergent communication. In Pro-
ceedings of
the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 5166–5175, Florence, Italien. Association
für Computerlinguistik.

David Chiang and Kevin Knight. 2006. Ein
introduction to synchronous grammars. Tuto-
rial available at http://www.isi.edu/
˜chiang/papers/synchtut.pdf.

Alexandra Birch, Miles Osborne, and Philipp
Koehn. 2008. Predicting success in machine
Übersetzung. In Proceedings of the 2008 Con-
ference on Empirical Methods in Natural Lan-
guage Processing, pages 745–754, Honolulu,

Leshem Choshen and Omri Abend. 2019. Auto-
matically extracting challenge sets for non-local
phenomena in neural machine translation. In
Proceedings of the 23rd Conference on Compu-
tational Natural Language Learning (CoNLL),

1243

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

pages 291–303, Hongkong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P19-1509

Berlin, Deutschland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P16-1078

Benrard Comrie. 1981. Language Universals and

Linguistic Typology. Blackwell. Book.

Ryan Cotterell, Sabrina

J. Mielke,

Jason
Eisner, and Brian Roark. 2018. Are all lan-
guages equally hard to language-model? In
Verfahren der 2018 Conference of the
North American Chapter of the Association
für Computerlinguistik: Human Lan-
guage Technologies, Volumen 2 (Short Papers),
pages 536–541, New Orleans, Louisiana. Asso-
ciation for Computational Linguistics.

Anna Currey and Kenneth Heafield. 2019. Incor-
porating source syntax into transformer-based
neural machine translation. In Proceedings of
the Fourth Conference on Machine Translation
(Volumen 1: Research Papers), pages 24–33,
Florence,
Italien. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/W19-5203

Matthew S. Dryer. 2013. Order of subject, ob-
ject and verb. In Matthew S. Dryer and Martin
Haspelmath, editors, The World Atlas of Lan-
guage Structures Online. Max Planck Institute
for Evolutionary Anthropology, Leipzig.

Matthew S. Dryer and Martin Haspelmath,
editors. 2013. WALS Online. Max Planck In-
stitute for Evolutionary Anthropology, Leipzig.
https://wals.info/

Jinhua Du and Andy Way. 2017. Pre-reordering
for neural machine translation: Helpful or harm-
voll? The Prague Bulletin of Mathematical
Linguistik, 108(1):171–182. https://doi
.org/10.1515/pralin-2017-0018

Jeffrey L. Elman. 1990. Finding structure in time.
Cognitive Science, 14(2):179–211. https://
doi.org/10.1207/s15516709cog1402 1

Akiko Eriguchi, Kazuma Hashimoto,

Und
Yoshimasa Tsuruoka. 2016. Tree-to-sequence
attentional neural machine translation. In Pro-
the 54th Annual Meeting of
ceedings of
the Association for Computational Linguis-
Tics (Volumen 1: Long Papers), pages 823–833,

Richard Futrell, Kyle Mahowald, and Edward
Gibson. 2015A. Large-scale evidence of depen-
dency length minimization in 37 languages.
Proceedings of the National Academy of Sci-
zen, 112(33):10336–10341. https://doi
.org/10.1073/pnas.1502134112, PubMed:
26240370

Richard Futrell, Kyle Mahowald, and Edward
Gibson. 2015B. Quantifying word order free-
In Proceed-
dom in dependency corpora.
ings of
the Third International Conference
on Dependency Linguistics (Depling 2015),
pages 91–100, Uppsala, Schweden. Uppsala
Universität, Uppsala, Schweden.

Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti,
Roi Reichart, and Anna Korhonen. 2018. Auf der
relation between linguistic typology and (limi-
tations of) multilingual language modeling. In
Verfahren der 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 316–327, Brussels, Belgien. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1029

Daniel Gildea and David Temperley. 2010. Do
grammars minimize dependency length? Cog-
nitive Science, 34(2):286–310. https://doi
.org/10.1111/j.1551-6709.2009.01073.x,
PubMed: 21564213

Joseph H. Greenberg. 1963. Some universals
of grammar with particular reference to the
order of meaningful elements. In Joseph H.
Greenberg, editor, Universals of Human Lan-
Spur, pages 73–113. MIT Press, Cambridge,
MA.

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.
Colorless green recurrent networks dream
hierarchically. In Proceedings of
Die 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
Tics: Human Language Technologies, Volumen 1
(Long Papers), pages 1195–1205, New Orleans,
Louisiana. Association for Computational
Linguistik.

1244

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Kristina Gulordava and Paola Merlo. 2015. Aus-
achronic trends in word order freedom and
dependency length in dependency-annotated
corpora of Latin and ancient Greek. In Pro-
ceedings of the Third International Conference
on Dependency Linguistics (Depling 2015),
pages 121–130, Uppsala, Schweden. Uppsala
Universität, Uppsala, Schweden. https://doi
.org/10.18653/v1/N18-1108

Kristina Gulordava and Paola Merlo. 2016.
Multi-lingual dependency parsing evaluation:
A large-scale analysis of word order prop-
erties using artificial data. Transactions of
the Association for Computational Linguistics,
4:343–356. https://doi.org/10.1162
/tacl_a_00103

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
Koehn, Vishrav Chaudhary, and Marc’Aurelio
Ranzato. 2019. The FLORES Evaluation Da-
tasets for Low-Resource Machine Translation:
Nepali–English and Sinhala–English. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6100–6113. https://doi.org/10
.18653/v1/D19-1632

Michael Hahn. 2020. Theoretical limitations of
self-attention in neural sequence models. Trans-
actions of the Association for Computational
Linguistik, 8:156–171. https://doi.org
/10.1162/tacl_a_00306

Hany Hassan, Anthony Aue, Chang Chen,
Vishal Chowdhary, Jonathan Clark, Christian
Federmann, Xuedong Huang, Marcin Junczys-
Dowmunt, William Lewis, Mu Li, Shujie Liu,
Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao
Qin, Frank Seide, Xu Tan, Fei Tian, Lijun
Wu, Shuangzhi Wu, Yingce Xia, Dongdong
Zhang, Zhirui Zhang, and Ming Zhou. 2018.
Achieving human parity on automatic Chi-
nese to English news translation. arXiv preprint
arXiv:1803.05567.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
https://doi.org/10
9(8):1735–1780.
.1162/neco.1997.9.8.1735, PubMed:
9377276

Hideki

Isozaki, Tsutomu Hirao, Kevin Duh,
Katsuhito Sudoh, and Hajime Tsukada. 2010.
Automatic evaluation of translation quality for
distant language pairs. In Proceedings of the
2010 Conference on Empirical Methods in
Natural Language Processing, pages 944–952,
Cambridge, MA. Association for Computa-
tional Linguistics.

Pratik Joshi, Sebastin Santy, Amar Budhiraja,
Kalika Bali, and Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. In Proceedings of
the 58th Annual Meeting of the Association for
Computerlinguistik, pages 6282–6293,
Online. Association for Computational Linguis-
Tics. https://doi.org/10.18653/v1
/2020.acl-main.560

Philipp Koehn. 2005. Europarl: A parallel cor-
pus for statistical machine translation. In The
Tenth Machine Translation Summit Proceed-
ings of Conference, pages 79–86. International
Association for Machine Translation.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian
Li, Sergey Edunov, Marjan Ghazvininejad,
Mike Lewis, and Luke Zettlemoyer. 2020.
Multilingual denoising pre-training for neu-
ral machine translation. Transactions of the
Verein für Computerlinguistik,
8:726–742. https://doi.org/10.1162
/tacl_a_00343

Minh-Thang Luong, Hieu Pham, and Christopher
D. Manning. 2015. Effective approaches to
attention-based neural machine translation. In
Empirical Methods in Natural Language Pro-
Abschließen (EMNLP), pages 1412–1421, Lisbon,
Portugal. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/D15-1166

Gary Lupyan and Morten H. Christiansen. 2002.
Case, word order, and language learnabil-
ität: Insights from connectionist modeling. In
Proceedings of
the Twenty-Fourth Annual
Conference of the Cognitive Science Society.

Christopher D. Manning, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, Und
David McClosky. 2014. The Stanford CoreNLP

1245

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

natural language processing toolkit. In Associ-
ation for Computational Linguistics (ACL) Sys-
tem Demonstrations, pages 55–60. https://
doi.org/10.3115/v1/P14-5010

Mitchell Marcus, Beatrice Santorini, and Mary
Ann Marcinkiewicz. 1993. Building a large
annotated corpus of English: The Penn
Treebank. https://doi.org/10.21236
/ADA273556

Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,
Brian Roark, and Jason Eisner. 2019. Was
kind of language is hard to language-model?
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4975–4989, Florence, Italien. Association
für Computerlinguistik.

Kishore Papineni, Salim Roukos, Todd Ward, Und
Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics,
ACL ’02, pages 311–318, Stroudsburg, PA,
USA. Association for Computational Linguis-
Tics. https://doi.org/10.18653/v1
/P19-1491

Emmanouil Antonios

Platanios, Mrinmaya
Sachan, Graham Neubig, and Tom Mitchell.
2018. Contextual parameter generation for
universal neural machine translation. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
425–435. https://doi.org/10
Seiten
.18653/v1/D18-1039

Martin Popel, Marketa Tomkova, Jakub Tomek,
Łukasz Kaiser, Jakob Uszkoreit, Ondˇrej Bojar,
and Zdenˇek ˇZabokrtsk´y. 2020. Transforming
maschinelle Übersetzung: a deep learning system
reaches news translation quality comparable
to human professionals. Nature Communica-
tionen, 11(1):4381. https://doi.org/10
.1038/s41467-020-18073-9, PubMed:
32873773

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason
Bolton, and Christopher D. Manning. 2020.
Stanza: A Python natural language process-
ing toolkit for many human languages. In
Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics:
Systemdemonstrationen.

languages. In Proceedings of

Shauli Ravfogel, Yoav Goldberg, and Tal
Linzen. 2019. Studying the inductive biases
of RNNs with synthetic variations of natu-
Die 2019
ral
Conference of the North American Chapter
of the Association for Computational Linguis-
Tics: Human Language Technologies, Volumen 1
(Long and Short Papers), pages 3532–3542,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1356

Benoˆıt Sagot. 2013. Comparing complexity mea-
sures. In Computational Approaches to Mor-
phological Complexity. Paris, Frankreich. Surrey
Morphology Group.

Karthik Abinav Sankararaman, Soham De, Zheng
Xu, W. Ronny Huang, and Tom Goldstein.
2020. Analyzing the effect of neural network
architecture on training performance. In Pro-
ceedings of Machine Learning and Systems
2020, pages 9834–9845.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
rare words with subword units. In Proceedings
of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volumen 1:
Long Papers), pages 1715–1725, Berlin,
Deutschland. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/P16-1162

Kaius Sinnem¨aki. 2008. Complexity trade-offs in
core argument marking. Language Complexity,
pages 67–88. John Benjamins. https://doi
.org/10.1075/slcs.94.06sin

Dan I. Slobin. 1966. The acquisition of Russian as
a native language. The Genesis of Language: A
Psycholinguistic Approach, pages 129–148.

Dan I. Slobin and Thomas G. Bever. 1982.
Children use canonical sentence schemas: A
crosslinguistic study of word order and inflec-
tionen. Cognition, 12(3):229–265. https://
doi.org/10.1016/0010-0277(82)90033-6

Ke Tran, Arianna Bisazza, and Christof Monz.
2018. The importance of being recurrent for

1246

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

modeling hierarchical structure. In Proceed-
Die 2018 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 4731–4736, Brussels, Belgien. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1503

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neu-
ral Information Processing Systems 30: Jährlich
Conference on Neural Information Process-
ing Systems 2017, 4–9 December 2017, Long
Beach, CA, USA, pages 5998–6008.

Nouns

Verbs

president / pr´esident
man / homme
woman / femme
minister / ministre
candidate / candidat
secretary / secr´etaire
commissioner / commissaire
Kind / enfant
teacher / enseignant
student / ´etudiant

thank / remercier
support / soutenir
vertreten / repr´esenter
defend / d´efendre
welcome / saluer
invite / inviter
Attacke / attaquer
respect / respecter
replace / remplacer
exploit / exploiter

Tisch 4: The English/French vocabulary used
to generate the challenge set. Both singular and
plural forms are used for each noun.

Dingquan Wang and Jason Eisner. 2016. Der
galactic dependencies treebanks: Getting more
data by synthesizing new languages. Trans-
actions of the Association for Computational
Linguistik, 4:491–505. https://doi.org
/10.1162/tacl_a_00113

rate of 0.001 for BiLSTM. We also increased batch
size to 128, number of warm-up steps to 80K and
update steps to 2M for all models. Endlich, für
100k and 10k datasize experiments, we decreased
the warm-up steps to 4K. During evaluation we
chose the best performing model on validation set.

Adina Williams, Tiago Pimentel, Hagen Blix,
Arya D. McCarthy, Eleanor Chodroff, Und
Ryan Cotterell. 2020. Predicting declension
class from form and meaning. In Proceed-
ings of
Die
Verein für Computerlinguistik,
pages 6682–6695, Online. Association for
Computerlinguistik.

the 58th Annual Meeting of

Yang Zhao, Jiajun Zhang, and Chengqing Zong.
2018. Exploiting pre-ordering for neural
maschinelle Übersetzung. In Proceedings of
Die
Eleventh International Conference on Lan-
guage Resources and Evaluation (LREC-2018).
https://doi.org/10.18653/v1/2020
.acl-main.597

A Appendices

A.1 NMT Hyperparameters
In the toy parallel grammar experiments (§4),
batch size of 64 (Sätze) and 1K max update
steps are used for all models. We train BiLSTM
with learning rate 1, and Transformer with learn-
ing rate of 2 together with 40 warm-up steps by
using noam learning rate decay. Dropout ratio of
0.3 Und 0.1 are used in BiLSTM and Transformer
models respectively. In the synthetic English vari-
ants experiments (§5), we set a constant learning

A.2 Challenge Set

The English-French challenge set used in this pa-
pro, and available at https://github.com
/arianna-bis/freeorder-mt,
is gener-
ated by a small synchronous context-free grammar
and contains 7,200 simple sentences consisting of
a subject, a transitive verb, and an object (sehen
Tisch 4). All sentences are in the present tense;
half are affirmative, and half negative. All nouns
in the grammar can plausibly act as both subject
and object of the verbs, so that an MT system
must rely on sentence structure to get perfect
translation accuracy. The sentences are from a
general domain, but we specifically choose nouns
and verbs with little translation ambiguity that
are well represented in the Europarl corpus: Most
have thousands of occurrences, while the rarest
word has about 80. Sentence example (English
Seite): ‘The teacher does not respect the student.’
and its reverse: ‘The student does not respect the
teacher.’

A.3 Morphological Paradigms

The complete list of morphological paradigms
used in this work is shown in Table 5. The im-
plicit language with exponence (viele:1) uses only
the suffixes of the 1st (default) declension. Der
implicit language with exponence and flexivity
(viele:viele) uses three declensions, assigned as

1247

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Overt

.nsubj.sg
.nsubj.pl
.dobj.sg
.dobj.pl
.iobj.sg
.iobj.pl
.arg.sg
.arg.pl

1st(default)

Implicit
2nd

kar
kon
kin
ker
ken
kre
–
–

Par
pon
Es
et
kez
kr
–
–

3rd

pa
po
kit
ket
ke
Re
–
–

Unambiguous

Syncretic

Tisch 5: The artificial morphological paradigms
used in this work, extended from Ravfogel et al.
(2019). 1st, 2nd and 3rd are the declensions in the
flexive language.

follows: Erste, the list of lemmas extracted from the
training set is randomly split into three classes,13
with distribution 1st:60%, 2nd:30%, 3rd:10%.
Dann, each core verb argument occurring in the
corpus is marked with the suffix corresponding to
its lemma’s declension.

A.4 Effect of Data Size and Morphological

Features: Detailed Results

Tisch 6 shows the detailed numerical results
corresponding to the plots of Figure 2 im
main text.

Eparl-BLEU
original
SOV
SOV+overt
random
random+overt
random+implicit
random+declens

Eparl-RIBES
original
SOV
SOV+overt
random
random+overt
random+implicit
random+declens

Challenge-RIBES
original
SOV
SOV+overt
random
random+overt
random+implicit
random+declens

1.9M
38.3
37.9
37.4
37.5
37.3
37.3
37.4

1.9M
84.9
84.5
84.5
84.2
84.4
84.3
84.3

1.9M
97.7
97.2
97.7
74.1
98.1
97.5
97.6

100k
26.9
25.3
24.6
24.6
24.1
24.3
23.1

100k
80.1
78.7
78.4
77.7
77.6
77.4
77.1

100k
92.2
89.8
86.5
72.9
84.5
85.4
84.4

10k
11.0
8.8
8.4
8.5
7.8
7.1
7.7

10k
67.5
64.1
63.1
61.7
61.6
59.8
61.3

10k
74.2
69.5
64.9
63.1
57.4
54.8
53.1

Tisch 6: Detailed results corresponding to the
plots of Figure 2: EN*-FR Transformer NMT
quality versus training data size (1.9M, 100K, oder
10K sentence pairs). Source language variants:
Fixed-order (SOV) and free-order (random) mit
different case systems (+overt/implicit/declens).
Scores averaged over three training runs.

13See Williams et al. (2020) for an interesting account of
how declension classes are actually partly predictable from
form and meaning.

1248

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/
T

A
C
_
A
_
0
0
4
2
4
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
PDF Herunterladen