On the Difficulty of Translating Free-Order Case-Marking Languages

On the Difficulty of Translating Free-Order Case-Marking Languages

Arianna Bisazza

Ahmet ¨Ust ¨un
Center for Language and Cognition
格罗宁根大学, 荷兰人
{a.bisazza, a.ustun}@rug.nl, research@spor.tel

Stephan Sportel

抽象的

Identifying factors that make certain languages
harder to model than others is essential to reach
language equality in future Natural Language
Processing technologies. Free-order case-marking
语言, such as Russian, Latin, or Tamil,
have proved more challenging than fixed-order
languages for the tasks of syntactic parsing
and subject-verb agreement prediction. 在这个
工作, we investigate whether this class of lan-
guages is also more difficult to translate by
state-of-the-art Neural Machine Translation
(NMT) 型号. Using a variety of synthetic
languages and a newly introduced translation
challenge set, we find that word order flexi-
bility in the source language only leads to a
very small loss of NMT quality, 虽然
the core verb arguments become impossible
to disambiguate in sentences without seman-
tic cues. The latter issue is indeed solved by
the addition of case marking. 然而, 在
medium- and low-resource settings, the overall
NMT quality of fixed-order languages remains
unmatched.

1

介绍

Despite the tremendous advances achieved in less
超过十年, Natural Language Processing re-
mains a field where language equality is far from
being reached (Joshi et al., 2020). In the field
of Machine Translation, modern neural models
have attained remarkable quality for high-resource
language pairs like German-English, Chinese-
英语, or English-Czech, with a number of stud-
ies even claiming human parity (Hassan et al.,
2018; Bojar et al., 2018; Barrault et al., 2019;
Popel et al., 2020). These results may lead to the
unfounded belief that Neural Machine Transla-
的 (NMT) methods will perform equally well
in any language pair, provided similar amounts
of training data. 实际上, several studies suggest
相反 (Platanios et al., 2018; Ataman and
Federico, 2018; Bugliarello et al., 2020).

为什么, 然后, do some language pairs have lower
translation accuracy? 和, 更具体地说: Are
certain typological profiles more challenging for
current state-of-the-art NMT models? Every lan-
guage has its own combination of typological
特性, including word order, morphosyntac-
tic features, 和更多 (Dryer and Haspelmath,
2013). Identifying language properties (or combi-
nations thereof) that pose major problems to the
current modeling paradigms is essential to reach
language equality in future MT (and other NLP)
技术 (Joshi et al., 2020), in a way that is
orthogonal to data collection efforts. Among oth-
呃, natural languages adopt different mechanisms
to disambiguate the role of their constituents:
Flexible order typically correlates with the pres-
ence of case marking and, vice versa, fixed order
is observed in languages with little or no case
marking (Comrie, 1981; Sinnem¨aki, 2008; 富特雷尔
等人。, 2015乙). Morphologically rich languages
in general are known to be challenging for MT
at least since the times of phrase-based statisti-
cal MT (Birch et al., 2008) due to their larger
and sparser vocabularies, and remain challenging
even for modern neural architectures (Ataman and
Federico, 2018; Belinkov et al., 2017). By con-
特拉斯特, the relation between word order flexibility
and MT quality has not been directly studied to
our knowledge.

在本文中, we study this relationship using
strictly controlled experimental setups. Specifi-
卡莉, we ask:

• Are current state-of-the-art NMT systems

biased towards fixed-order languages?

• To what extent does case marking compen-
sate for the lack of a fixed order in the source
语言?

很遗憾, parallel data are scarce in most
of the world languages (Guzm´an et al., 2019), 和

1233

计算语言学协会会刊, 卷. 9, PP. 1233–1248, 2021. https://doi.org/10.1162/tacl 00424
动作编辑器: Alexandra Birch. 提交批次: 1/2021; 修改批次: 5/2021; 已发表 11/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Fixed

VSO

VOS

follows the little cat the friendly dog
follows the friendly dog the little cat

Free+Case

follows the little cat#S the friendly dog#O
或者
follows the friendly dog#O the little cat#S

Translation

de kleine kat volgt de vriendelijke hond

桌子 1: Example sentence in different fixed/
flexible-order English-based synthetic languages
and their SVO Dutch translation. The subject in
each sentence is underlined. Artificial case mark-
ers start with #.

corpora in different languages are drawn from dif-
ferent domains. Exceptions exist, like the widely
used Europarl (科恩, 2005), but represent a small
fraction of the large variety of typological feature
combinations attested in the world. This makes it
very difficult to run a large-scale comparative
study and isolate the factors of interest from, 为了
例子, domain mismatch effects. As a solu-
的, we propose to evaluate NMT on synthetic
语言 (Gulordava and Merlo, 2016; 王
and Eisner, 2016; Ravfogel et al., 2019) that dif-
fer from each other only by specific properties,
即: the order of main constituents, 或者
presence and nature of case markers (see example
表中 1).

We use this approach to isolate the impact of
various source-language typological features on
MT quality and to remove the typical confounders
of corpus size and domain. Using a variety of syn-
thetic languages and a newly introduced challenge
放, we find that state-of-the-art NMT has little to
no bias towards fixed-order languages, 但仅
when a sizeable training set is available.

2 Free-order Case-marking Languages

The word order profile of a language is usually
represented by the canonical order of its main
constituents, (S)ubject, (氧)bject, (V)erb. 对于在-
姿态, English and French are SVO languages,
while Turkish and Hindi are SOV. 其他, less com-
monly attested, word orders are VSO and VOS,
whereas OSV and OVS are extremely rare (Dryer,
2013). Although many other word order features
存在 (例如, noun/adjective), they often correlate
with the order of main constituents (Greenberg,
1963).

A different, but likewise important dimension
is that of word order freedom (or flexibility).

Languages that primarily rely on the position
of a word to encode grammatical roles typically
display rigid orders (like English or Mandarin
Chinese), while languages that rely on case mark-
ing can be more flexible allowing word order to
express discourse-related factors like topicaliza-
的. Examples of highly flexible-order languages
include languages as diverse as Russian, Hungar-
ian, Latin, Tamil, and Turkish.1

In the field of psycholinguistics, due to the his-
torical influence of English-centered studies, word
order has long been considered the primary and
most natural device through which children learn
to infer syntactic relationships in their language
(Slobin, 1966). 然而, cross-linguistic studies
have later revealed that children are equally pre-
pared to acquire both fixed-order and inflectional
语言 (Slobin and Bever, 1982).

Coming to computational

语言学, 数据-
driven MT and other NLP approaches were also
historically developed around languages with re-
markably fixed orders and very simple to moder-
ately simple morphological systems, like English
or French. Luckily, our community has been giv-
ing increasing attention to more and more lan-
guages with diverse typologies, especially in the
last decade. 迄今为止, previous work has found that
free-order languages are more challenging for
解析 (Gulordava and Merlo, 2015, 2016) 和
subject-verb agreement prediction (Ravfogel
等人。, 2019) than their fixed-order counterparts.
This raises the question of whether word order
flexibility also negatively affects MT quality.

Before the advent of modern NMT, Birch et al.
(2008) used the Europarl corpus to study how
various language properties affected the quality
of phrase-based Statistical MT. Amount of re-
ordering, target morphological complexity, 和
historical relatedness of source and target lan-
guages were identified as strong predictors of
MT quality. Recent work by Bugliarello et al.
(2020), 然而, has failed to show a correlation
between NMT difficulty (measured by a novel
information-theoretic metric) and several linguis-
tic properties of source and target
语言,
including Morphological Counting Complexity
(Sagot, 2013) and Average Dependency Length
(富特雷尔等人。, 2015A). While that work specifically

1See Futrell et al. (2015乙) for detailed figures of word
order freedom (measured by the entropy of subject and
object dependency relation order) in a diverse sample of 34
语言.

1234

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

aimed at ensuring cross-linguistic comparability,
the sample on which the linguistic properties
could be computed (Europarl) was rather small
and not very typologically diverse, leaving our re-
search questions open to further investigation. 在
这篇论文, we therefore opt for a different meth-
odology: 即, synthetic languages.

3 方法

Synthetic Languages This paper presents two
sets of experiments: In the first (§4), we create
parallel corpora using very simple and predict-
able artificial grammars and small vocabularies
(Lupyan and Christiansen, 2002). See an example
表中 1. By varying the position of subject/
verb/object and introducing case markers to the
source language, we study the biases of two NMT
architectures in optimal training data conditions
and a fully controlled setup, 那是, without any
other linguistic cues that may disambiguate con-
stituent roles. In the second set of experiments
(§5), we move to a more realistic setup using syn-
thetic versions of the English language that differ
from it in only one or few selected typological
特征 (Ravfogel et al., 2019). 例如, 这
original sentence’s order (然后) is transformed to
different orders, like SOV or VSO, based on its
syntactic parse tree.

In both cases, typological variations are intro-
duced in the source side of the parallel corpora,
while the target language remains fixed. 在这个
方式, we avoid the issue of non-comparable BLEU
scores across different target languages. 最后,
we make the simplifying assumption that, 什么时候
verb-argument order varies from the canonical or-
der in a flexible-order language, it does so in a
totally arbitrary way. Although this is rarely true in
实践, as word order may be predictable given
pragmatics or other factors, we focus here on ‘‘the
extent to which word order is conditioned on the
syntactic and compositional semantic properties
of an utterance’’ (富特雷尔等人。, 2015乙).

LSTM) timestep (Elman, 1990; Hochreiter and
施米德胡贝尔, 1997). (二) The non-recurrent, 完全
attention-based Transformer
(Vaswani et al.,
2017) processes all input symbols in parallel re-
lying on dedicated embeddings to encode each
input’s position.2 Transformer has nowadays
surpassed recurrent encoder-decoder models in
terms of generic MT quality. 而且, Choshen
and Abend (2019) have recently shown that
Transformer-based NMT models are indifferent
to the absolute order of source words, 至少
when equipped with learned positional embed-
丁斯. 另一方面, the lack of recurrence
in Transformers has been linked to a limited abil-
ity to capture hierarchical structure (Tran et al.,
2018; Hahn, 2020). To our knowledge, no pre-
vious work has studied the biases of either ar-
chitectures towards fixed-order languages in a
systematic manner.

4 Toy Parallel Grammar

We start by evaluating our models on a pair of toy
languages inspired by the English-Dutch pair and
created using a Synchronous Context-Free Gram-
三月 (Chiang and Knight, 2006). Each sentence
consists of a simple clause with a transitive verb,
主题, and object. Both arguments are singu-
lar and optionally modified by an adjective. 这
source vocabulary contains 6 nouns, 6 动词, 6
形容词, and the complete corpus contains 10k
generated sentence pairs. Working with such a
小的, finite grammar allows us to simulate an
otherwise impossible situation where the NMT
model can be trained on (almost) the totality of a
language’s utterances, canceling out data sparsity
effects.3

Source Language Variants We consider three
source language variants, illustrated in Table 1:

• fixed-order VSO;

• fixed-order VOS;

Translation Models We consider two widely
used NMT architectures that crucially differ in
their encoding of positional information: (我) 关于-
current sequence-to-sequence BiLSTM with at-
注意力 (Bahdanau et al., 2015; Luong et al., 2015)
processes the input symbols sequentially and has
each hidden state directly conditioned on that
of the previous (or following, for the backward

• mixed-order (randomly chosen between VSO

or VOS) with nominal case marking.

2We use sinusoidal embeddings (Vaswani et al., 2017). 全部
our models are built using OpenNMT: https://github
.com/OpenNMT/OpenNMT-py.

3Data and code to replicate the toy grammar experiments
in this section are available at https://github.com
/573phn/cm-vs-wo.

1235

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 1: Toy language NMT sentence-level accuracy on validation set by number of training epochs. 来源
语言: fixed-order VSO, fixed-order VOS, and mixed-order (VSO/VOS) with case marking. Target language:
always fixed SVO. Each experiment is repeated five times, and averaged results are shown.

We choose these word orders so that, 在里面
flexible-order corpus, the only way to disambig-
uate argument roles is case marking, realized by
simple unambiguous suffixes (#S and #O). 这
target language is always fixed SVO. 相同
random split (80/10/10% training/validation/test)
is applied to the three corpora.

NMT Setup As recurrent model, we trained
a 2-layer BiLSTM with attention (Luong et al.,
2015) 和 500 hidden layer size. As Transformer
型号, we trained one using the standard 6-layer
configuration (Vaswani et al., 2017) and a smaller
one with only 2 layers, given the simplicity of
the languages. All models are trained at the word
level using the complete vocabulary. More hyper-
parameters are provided in Appendix A.1. 笔记
that our goal is not to compare LSTM and Trans-
former accuracy to each other, but rather to
observe the different trends across fixed- 和
flexible-order language variants. Given the small
词汇, we use sentence-level accuracy in-
stead of BLEU for evaluation.

Results As shown in Figure 1, all models
achieve perfect accuracy on all language pairs after
1000 training steps, except for the Large Trans-
former on the free-order language, likely due to
overparametrization (Sankararaman et al., 2020).
These results demonstrate that our NMT archi-
tectures are equally capable of modeling trans-
lation of both types of language, when all other
factors of variation are controlled for.

尽管如此, a pattern emerges when looking
at the learning curves within each plot: 尽管
the two fixed-order languages have very similar
learning curves, the free-order language with case
markers always requires slightly more training
steps to converge. This is also the case, albeit to

a lesser extent, when the mixed-order corpus is
pre-processed by splitting all case suffixes from
the nouns (extra experiment not shown in the plot).
This trend is noteworthy, given the simplicity of
our grammars and the transparency of the case
系统. As our training sets cover a large majority
of the languages, this result might suggest that
free-order natural languages need larger training
datasets to reach a similar translation quality than
their fixed-order counterparts. In §5 we validate
this hypothesis on more naturalistic language data.

5 Synthetic English Variants

Experimenting with toy languages has its short-
comings, like the small vocabulary size and non-
realistic distribution of words and structures. 在
this section, we follow the approach of Ravfogel
等人. (2019) to validate our findings in a less con-
trolled but more realistic setup. 具体来说, 我们
create several variants of the Europarl English-
French parallel corpus where the source sentences
are modified by changing word order and adding
artificial case markers. We choose French as target
language because of its fixed order, 然后, 和它的
relatively simple morphology.4 As Indo-European
语言, English and French are moderately re-
lated in terms of syntax and vocabulary while
being sufficiently distant to avoid a word-by-word
translation strategy in many cases.

Source language variants are obtained by trans-
forming the syntactic tree of the original sentences.
While Ravfogel et al. (2019) could rely on the
Penn Treebank (Marcus et al., 1993) for their
monolingual task of agreement prediction, 我们

4According to the Morphological Counting Complexity
(Sagot, 2013) values reported by Cotterell et al. (2018),
English scores 6 (least complex), Dutch 26, 法语 30,
西班牙语 71, Czech 195, and Finnish 198 (most complex).

1236

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Original (no case):
The woman says her sisters often invited her for dinner.

SOV (no case):
The woman her sisters her often invited for dinner say.

SOV, syncretic case marking (overt):
The woman.arg.sg her sisters.arg.pl she.arg.sg often in-
vited.arg.pl for dinner say.arg.sg.

SOV, unambiguous case marking (overt):
The woman.nsubj.sg her sisters.nsubj.pl she.dobj.sg often
invited.dobj.sg.nsubj.pl for dinner say.nsubj.sg.

SOV, unambiguous case (implicit):
The womankar her sisterskon shekin often invitedkinkon
for dinner saykar.

SOV, unambiguous case (implicit with declensions):
The womankar her sisterspon shekit often invitedkitpon for
dinner saykar.

French translation:
La femme dit que ses soeurs l’invitaient souvent `a dˆırner.

桌子 2: Examples of synthetic English variants
和他们的 (常见的) French translation. 完整的
list of suffixes is provided in Appendix A.3.

instead need parallel data. 为此原因, 我们
parse the English side of the Europarl v.7 cor-
脓 (科恩, 2005) using the Stanza dependency
parser (Qi et al., 2020; Manning et al., 2014).
After parsing, we adopt a modified version of the
synthetic language generator by Ravfogel et al.
(2019) to create the following English variants:5

agreement features from verbs in all the above
variants (比照. says → say in Table 2).

To answer our second research question, 我们
experiment with two artificial case systems pro-
posed by Ravfogel et al. (2019) and illustrated in
桌子 2 (overt suffixes):

• Unambiguous case system: suffixes indi-
cating argument role (subject/object/indirect
目的) and number
(singular/plural) 是
added to the heads of noun and verb phrases;

• Syncretic case system: suffixes indicating
number but not grammatical function are
added to the heads of main arguments, 亲-
viding only partial disambiguation of argu-
ment roles. This system is inspired from
subject/object syncretism in Russian.

Syncretic case systems were found to be roughly as
common as non-syncretic ones in a large sample of
almost 200 world languages (Baerman and Brown,
2013). Case marking is always combined with the
fully flexible order of main constituents. As in
Ravfogel et al. (2019), English number marking
is removed from verbs and their arguments before
adding the artificial suffixes.

• Fixed-order: either SVO, SOV, VSO or

5.1 NMT Setup

VOS;6

• Free-order: for each sentence in the corpus,
one of the six possible orders of (主题,
目的, 动词) is chosen randomly;

• Shuffled words: all source words are shuf-
fled regardless of their syntactic role. 这是
our lower bound, measuring the reordering
ability of a model in the total absence of
source-side order cues (akin to bag-of-words
输入).

To allow for a fair comparison with the artifi-
cial case-marking languages, we remove number

5Our revised language generator is available at https://

github.com/573phn/rnn typology.

6To keep the number of experiments manageable, 我们
omit object-initial languages, which are significantly less
attested among world languages (Dryer, 2013).

Models As recurrent model, we used a 3-layer
BiLSTM with hidden size of 512 and MLP at-
注意力 (Bahdanau et al., 2015). The Transformer
model has the standard 6-layer configuration with
hidden size of 512, 8 attention heads, and sinu-
soidal positional encoding (Vaswani et al., 2017).
All models use subword representation based on
32k BPE merge operations (Sennrich et al., 2016),
except in the low-resource setup where this is
reduced to 10k operations. More hyperparameters
are provided in Appendix A.1.

Data and Evaluation We train our models on
various subsets of the English-French Europarl
语料库: 1.9M sentence pairs (high-resource), 100K
(medium-resource), and 10K (low-resource). 为了
评估, we use 5K sentences randomly held
out from the same corpus. Given the importance
of word order to assess the correct translation
of verb arguments into French, we compute the

1237

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

reordering-focused RIBES7 metric (Isozaki et al.,
2010) in addition to the more commonly used
蓝线 (Papineni et al., 2002). In each experi-
蒙特, the source side of training and test data is
transformed using the same procedure whereas the
target side remains unchanged. We repeat each ex-
periment 3 次 (或者 4 for languages with random
order choice) and report the averaged results.

5.2 Challenge Set

Besides syntactic structure, natural language of-
ten contains semantic and collocational cues that
help disambiguate the role of an argument. 小的
BLEU/RIBES differences between our language
variants may indicate actual robustness of a model
to word order flexibility, but may also indicate that
a model relies on those cues rather than on syntac-
tic structure (Gulordava et al., 2018). To discern
these two hypotheses, we create a challenge set of
7,200 simple affirmative and negative sentences
where swapping subject and object leads to an-
other plausible sentence.8 Each English sentence
and its reverse are included in the test set together
with the respective translations, as for example:

(1)

(A) The president thanks the minister. /
Le pr´esident remercie le ministre.

(乙) The minister thanks the president. /
Le ministre remercie le pr´esident.

The source side is then processed as explained in
§5 and translated by the NMT model trained on
the corresponding language variant. 因此, 反式-
lation quality on this set reflects the extent to
which NMT models have robustly learned to de-
tect verb arguments and their roles independently
from other cues, which we consider an important
sign of linguistic generalization ability. For space
constraints we only present RIBES scores on the
challenge set.9

7BLEU captures local word-order errors only indirectly
(lower precision of higher-order n-grams) and does not cap-
ture long-range word-order errors at all. 相比之下, RIBES
directly measures correlation between the word ranks in the
reference and those in the MT output.

8More details can be found in Appendix A.2. We release
the challenge set at https://github.com/arianna
-bis/freeorder-mt.

9We also computed BLEU scores: They strongly correlate
with RIBES but fluctuate more due to the larger effect of
lexical choice.

5.3 High-Resource Results

桌子 3 reports the high-resource setting results.
The first row (original English to French) is given
only for reference and shows the overall highest
结果. The BLEU drop observed when moving
to any of the fixed-order variants (including SVO)
is likely due to parsing flaws resulting in awkward
reorderings. As this issue affects all our synthetic
variants, it does not undermine the validity of our
发现. For clarity, we center our main discus-
sion on the Transformer results and comment on
the BiLSTM results at the end of this section.

Fixed-Order Variants All four tested fixed-
order variants obtain very similar BLEU/RIBES
scores on the Europarl-test. This is in line with pre-
vious work in NMT showing that linguistically
motivated pre-ordering leads to small gains (赵
等人。, 2018) or none at all (Du and Way, 2017),
and that Transformer-based models are not bi-
ased towards monotonic translation (Choshen and
Abend, 2019). On the challenge set, scores are
slightly more variable but a manual inspection
reveals that this is due to different lexical choices,
while word order is always correct for this group
of languages. 总结, in the high-resource
setup, our Transformer models are perfectly able
to disambiguate the core argument roles when
these are consistently encoded by word order.

Fixed-Order vs Random-Order Somewhat
出奇,
the Transformer results are only
marginally affected by the random ordering of
verb and core arguments. Recall that in the ‘Ran-
dom’ language all six possible permutations of
(S,V,氧) are equally likely. 因此, Transformer
shows an excellent ability to reconstruct the cor-
rect constituent order in the general-purpose test
放. The picture is very different on the challenge
放, where RIBES drops severely from 97.6 到
74.1. These low results were to be expected given
the challenge set design (it is impossible even for
a human to recognize subject from object in the
‘Random, no case’ challenge set). 尽管如此,
they demonstrate that the general-purpose set can-
not tell us whether an NMT model has learned to
reliably exploit syntactic structure of the source
语言, because of the abundant non-syntactic
cues. 实际上, even when all source words are
shuffled, Transformer still achieves a respectable
25.8/71.2 BLEU/RIBES on the Europarl-test.

1238

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

English*→French
Large Training (1.9中号)

BI-LSTM

TRANSFORMER

Europarl-Test

Challenge

Europarl-Test

Challenge

Original English

Fixed Order:

S-V-O
S-O-V
V-S-O
V-O-S
Average (fixed orders)

Flexible Order:

蓝线

39.4

RIBES

85.0

RIBES

98.0

蓝线

38.3

RIBES

84.9

RIBES

97.7

38.3
37.6
38.0
37.8
37.9±0.4

84.5
84.2
84.2
84.0
84.2±0.3

98.1
97.7
97.8
98.0
97.9±0.2

37.7
37.9
37.8
37.6
37.8±0.1

84.6
84.5
84.6
84.3
84.5±0.1

98.0
97.2
98.0
97.2
97.6±0.4

Random, no case
Random + syncretic case
Random + unambig. 案件

Shuffle all words

37.1
36.9
37.3

18.5

83.7
83.6
83.9

65.2

75.1
75.4
97.7

79.4

37.5
37.3
37.3

25.8

84.2
84.2
84.4

71.2

74.1
84.4
98.1

83.2

桌子 3: Translation quality from various English-based synthetic languages into standard French,
using the largest training data (1.9M sentences). NMT architectures: 3-layer BiLSTM seq-to-seq with
注意力; 6-layer Transformer. Europarl-Test: 5K held-out Europarl sentences; Challenge set: see §5.2.
All scores are averaged over three training runs.

Case Marking The key comparison in our
study lies between fixed-order and free-order
case-marking languages. 这里, we find that case
marking can indeed restore near-perfect accuracy
on the challenge set (98.1 RIBES). 然而,
this only happens when the marking system is
completely unambiguous, 哪个, as already men-
tioned, is true for only about a half of the real
case-marking languages (Baerman and Brown,
2013). 的确, the syncretic system visibly im-
proves quality on the challenge set (74.1 到 84.4
RIBES) but remains far behind the fixed-order
分数 (97.6). In terms of overall NMT quality
(Europarl-test), fixed-order languages score only
marginally higher
than the free-order case-
marking ones, regardless of the unambiguous/
syncretic distinction. Thus our finding that Trans-
former NMT systems are equally capable of mod-
eling the two types of languages (§4) is also
confirmed with more naturalistic language data.
那就是说, we will show in Section 5.4 那个这个
positive finding is conditional on the availability
of large amounts of training samples.

BiLSTM vs Transformer The LSTM-based re-
sults generally correlate with the Transformer
results discussed above, however our recurrent

models appear to be slightly more sensitive to
changes in the source-side order, in line with
previous findings (Choshen and Abend, 2019).
具体来说, translation quality on Europarl-test
fluctuates slightly more than Transformer among
different fixed orders, with the most monotonic
命令 (然后) leading to the best results. When all
words are randomly shuffled, BiLSTM scores
drop much more than Transformer. 然而,
when comparing the fixed-order variants to the
ones with free order of main constituents, BiL-
STM shows only a slightly stronger preference
for fixed-order, compared to Transformer. 这
suggests that, by experimenting with arbitrary
permutations, Choshen and Abend (2019) 可能
have overestimated the bias of recurrent NMT
towards more monotonic translation, whereas the
more realistic combination of constituent-level re-
ordering with case marking used in our study is
not so problematic for this type of model.

有趣的是, on the challenge set, BiLSTM and
Transformer perform on par, with the notable ex-
ception that syncretic case is much more difficult
for the BiLSTM model. Our results agree with the
large drop of subject-verb agreement prediction
accuracy observed by Ravfogel et al. (2019) 什么时候
experimenting with the random order of main

1239

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

constituents. 然而, their scores were also low
for SOV and VOS, which is not the case in our
NMT experiments. Besides the fact that our chal-
lenge set only contains short sentences (hence no
long dependencies and few agreement attractors),
our task is considerably different in that agreement
only needs to be predicted in the target language,
which is fixed-order SVO.

Summary Our results so far suggest that state-
of-the-art NMT models, especially if Transformer-
基于, have little or no bias towards fixed-order
语言. 下文中, we study whether this
finding is robust to differences in data size, 类型
of morphology, and target language.

5.4 Effect of Data Size and
Morphological Features

Data Size The results shown in Table 3 represent
a high-resource setting (almost 2M training sen-
时态). While recent successes in cross-lingual
transfer learning alleviate the need for labeled
数据 (刘等人。, 2020), their success still de-
pends on the availability of large unlabeled data
as well as other, yet to be explained, 语言
特性 (Joshi et al., 2020). We then ask: 做
free-order case-marking languages need more data
than fixed-order non-case-marking ones to reach
similar NMT quality? We simulate a medium-
and low-resource scenario by sampling 100K and
10K training sentences, 分别, 从
full Europarl data. To reduce the number of ex-
实验, we only consider Transformer with
one fixed-order language variant (SOV)10 和前-
clude syncretic case marking. To disentagle the
effect of word order from that of case mark-
ing on low-resource translation quality, we also
experiment with a language variant combining
fixed-order (SOV) and case marking. Results are
shown in Figure 2 and discussed below.

Morphological Features The artificial case
systems used so far included easily separable
suffixes with a 1:1 mapping between grammat-
ical categories and morphemes (例如, .nsubj.sg,
.dobj.pl) reminiscent of agglutinative morpholo-
吉斯. Many world languages, 然而, 不要
comply to this 1:1 mapping principle but dis-
play exponence (multiple categories conveyed by

10We choose SOV because it is a commonly attested word
order and is different from that of the target language, 从而
requiring some non-trivial reorderings during translation.

one morpheme) and/or flexivity (the same cat-
egory expressed by various, lexically determined,
morphemes). Well-studied examples of languages
with case+number exponence include Russian and
Finnish, while flexive languages include, 再次,
Russian and Latin. Motivated by previous findings
on the impact of fine-grained morphological fea-
tures on language modeling difficulty (Gerz et al.,
2018), we experiment with three types of suffixes
(see examples in Table 2):

• Overt: number and case are denoted by easily
separable suffixes (例如, .nsubj.sg, .dobj.pl)
similar to agglutinative languages (1:1);

• Implicit: the combination of number and
case is expressed by unique suffixes without
internal structure (例如, kar for .nsubj.sg, ker
for .dobj.pl) similar to fusional languages.
This system displays exponence (许多:1);

• Implicit with declensions: like the previous,
but with three different paradigms each ar-
bitrarily assigned to a different subset of the
lexicon. This system displays exponence and
flexivity (许多:许多).

A complete overview of our morphological
paradigms is provided in Appendix A.3 All our
languages have moderate inflectional synthesis
和, in terms of fusion, are exclusively concatena-
主动的. Despite this, the effect on vocabulary size is
重大的: 180% increase by overt and implicit
case marking, 250% by implicit marking with
declensions (in the full data setting).

Results Results are shown in the plots of
数字 2 (detailed numerical scores are given
in Appendix A.4). We find that reducing train-
ing size has, 毫不奇怪, a major effect
on translation quality. Among source language
variants, fixed-order obtains the highest qual-
ity across all setups. In terms of BLEU (2(A)),
the spread among variants increases somewhat
with less data however differences are small.
A clearer picture emerges from RIBES (2(乙)),
whereby less data clearly leads to more disparity.
This is already visible in the 100k setup, 和
the fixed SOV language dominating the others.
Case marking, despite being necessary to disam-
biguate argument roles in the absence of semantic
cues, does not improve translation quality and

1240

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 2: EN*-FR Transformer NMT quality versus training data size (x-axis). Source language variants:
Fixed-order (SOV) and free-order (random) with different case systems (r+overt/implicit/declens). Scores
averaged over three training runs. Detailed numerical results are provided in Appendix A.4.

even degrades it in the low-resource setup. Look-
ing at the challenge set results (2(C)) we see
that the free-order case-marking languages are
clearly disadvantaged: In the mid-resource setup,
case marking improves substantially over the
underspecified random,no-case language but re-
mains far behind fixed-order. In low-resource,
case marking notably hurts quality even in
comparison with the underspecified language.
These results thus demonstrate that free-order
case-marking languages require more data than
their fixed-order counterparts to be accurately
translated by state-of-the-art NMT.11 Our ex-
periments also show that this greater learning
difficulty is not only due to case marking (和
subsequent data sparsity), but also to word or-
der flexibility (compare sov+overt to r+overt in
数字 2).

Regarding different morphology types, 我们的确是
not observe a consistent trend in terms of overall
translation quality (Europarl-test): 在某些情况下,
the richest morphology (with declensions) 轻微地
outperforms the one without declensions—a re-
sult that would deserve further exploration. 在
另一方面, results on the challenge set, 在哪里

11In the light of this finding, it would be interesting to
revisit the evaluation of Bugliarello et al. (2020) in relation
to varying data sizes.

数字 3: Transformer results for more target languages
(100k training size). Scores averaged over 2 runs.

most words are case-marked, show that morpho-
logical richness inversely correlates with transla-
tion quality when data is scarce. We postulate that
our artificial morphologies may be too limited in
范围 (only 3-way case and number marking) 到
impact overall translation quality and leave the
investigation of richer inflectional synthesis to
future work.

5.5 Effect of Target Language

All results so far involved translation into a fixed-
命令 (然后) language without case marking. 到
verify the generality of our findings, we repeat a
subset of experiments with the same synthetic En-
glish variants, but using Czech or Dutch as target

1241

语言. Czech has rich fusional morphology
including case marking, and very flexible order.
Dutch has simple morphology (no case marking)
and moderately flexible, syntactically determined
order.12

数字 3 shows the results with 100k train-
ing sentences. In terms of BLEU, differences are
even smaller than in English-French. 按照
RIBES, trends are similar across target languages,
with the fixed SOV source language obtaining
best results and the case-marked source language
obtaining worst results. This suggests that the ma-
jor findings of our study are not due to the specific
choice of French as the target language.

6 相关工作

The effect of word order flexibility on NLP model
performance has been mostly studied in the field
of syntactic parsing, 例如, using Aver-
age Dependency Length (Gildea and Temperley,
2010; 富特雷尔等人。, 2015A) or head-dependent or-
der entropy (富特雷尔等人。, 2015乙; Gulordava and
Merlo, 2016) as syntactic correlates of word order
freedom. Related work in language modeling has
shown that certain languages are intrinsically more
difficult to model than others (Cotterell et al.,
2018; Mielke et al., 2019) and has furthermore
studied the impact of fine-grained morphology
特征 (Gerz et al., 2018) on LM perplexity.

Regarding the word order biases of seq-to-seq
型号, Chaabouni et al. (2019) use miniature
languages similar to those of Section 4 to study
the evolution of LSTM-based agents in a sim-
ulated iterated learning setup. Their results in a
standard ‘‘individual learning’’ setup show, 喜欢
ours, that a free-order case-marking toy language
can be learned just as well as a fixed-order one,
confirming earlier results obtained by simple
Elman networks trained for grammatical role
classification (Lupyan and Christiansen, 2002).
Transformer was not included in these studies.
Choshen and Abend (2019) measure the ability
of LSTM- and Transformer-based NMT to model
a language pair where the same arbitrary (非-
syntactically motivated) permutation is applied to
all source sentences. They find that Transformer
is largely indifferent to the order of source words
(provided this is fixed and consistent across train-
ing and test set) but nonetheless struggles to

12Dutch word order is very similar to German, 与

position of S, V, and O depending on the type of clause.

translate long dependencies actually occurring in
natural data. They do not directly study the effect
of order flexibility.

The idea of permuting dependency trees to
generate synthetic languages was introduced in-
dependently by Gulordava and Merlo (2016)
(discussed above) and by Wang and Eisner (2016),
the latter with the aim of diversifying the set
of treebanks currently available for language
adaptation.

7 结论

We have presented an in-depth analysis of how
Neural Machine Translation difficulty is affected
by word order flexibility and case marking in the
source language. Although these common lan-
guage properties were previously shown to neg-
atively affect parsing and agreement prediction
准确性, our main results show that state-of-the-
art NMT models, especially Transformer-based
那些, have little or no bias towards fixed-order lan-
guages. Our simulated low-resource experiments,
然而, reveal a different picture, 那是: 自由的-
order case-marking languages require more data
to be translated as accurately as their fixed-order
同行. Because parallel data (like labeled
data in general) are scarce for most of the world
语言 (Guzm´an et al., 2019; Joshi et al., 2020),
we believe this should be considered as a fur-
ther obstacle to language equality in future NLP
技术.

In future work, our analysis should be extended
to target language variants using principled alter-
natives to BLEU (Bugliarello et al., 2020), 和
to other typological features that are likely to
affect MT performance, such as inflectional syn-
thesis and degree of fusion (Gerz et al., 2018).
最后, the synthetic languages and challenge set
proposed in this paper could be used to evalu-
ate syntax-aware NMT models (Eriguchi et al.,
2016; Bisk and Tran, 2018; Currey and Heafield,
2019), which promise to better capture linguistic
结构, especially in low-resource scenarios.

致谢

Arianna Bisazza was partly funded by the
Netherlands Organization for Scientific Research
(NWO) under project number 639.021.646. 我们
would like to thank the Center for Information
Technology of the University of Groningen for
providing access to the Peregrine HPC cluster,

1242

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

and the anonymous reviewers for their helpful
comments.

参考

Duygu Ataman and Marcello Federico. 2018.
An evaluation of two vocabulary reduction
methods for neural machine translation. 在
Proceedings of the 13th Conference of the Asso-
ciation for Machine Translation in the Americas
(体积 1: Research Papers), pages 97–110.

Matthew Baerman and Dunstan Brown. 2013.
Case syncretism. In Matthew S. Dryer and
Martin Haspelmath, 编辑, The World Atlas
of Language Structures Online. Max Planck In-
stitute for Evolutionary Anthropology, 莱比锡.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
句子, ICLR 2015.

Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,
Christian Federmann, Mark Fishel, Yvette
Graham, Barry Haddow, Matthias Huck,
Philipp Koehn, Shervin Malmasi, Christof
Monz, Mathias M¨uller, Santanu Pal, 马特
邮政, and Marcos Zampieri. 2019. Findings of
这 2019 conference on machine translation
(WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (体积 2:
Shared Task Papers, Day 1), pages 1–61,
Florence, 意大利. Association for Computational
语言学. https://doi.org/10.18653
/v1/W19-5301

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. 什么
do neural machine translation models learn
about morphology? In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
pages 861–872, Vancouver, 加拿大. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1080

Hawaii. Association for Computational Lin-
https://doi.org/10.3115
语言学.
/1613715.1613809

Yonatan Bisk and Ke Tran. 2018.

Inducing
grammars with and for neural machine transla-
的. In Proceedings of the 2nd Workshop on
Neural Machine Translation and Generation,
pages 25–35, 墨尔本, 澳大利亚. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/W18-2704

Ondˇrej Bojar, Christian Federmann, Mark Fishel,
Yvette Graham, Barry Haddow, Philipp Koehn,
and Christof Monz. 2018. Findings of the 2018
conference on machine translation (WMT18).
In Proceedings of the Third Conference on
机器翻译: Shared Task Papers,
pages 272–303, 比利时, 布鲁塞尔. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/W18-6401

Emanuele Bugliarello, Sabrina

J. Mielke,
Antonios Anastasopoulos, Ryan Cotterell, 和
Naoaki Okazaki. 2020. It’s easier to translate
out of English than into it: Measuring neural
translation difficulty by cross-mutual informa-
的. In Proceedings of the 58th Annual Meet-
ing of
the Association for Computational
语言学, pages 1640–1649, 在线的. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl-main
.149

Rahma

尤金

Chaabouni,

Kharitonov,
Alessandro Lazaric, Emmanuel Dupoux, 和
Marco Baroni. 2019. Word-order biases in
deep-agent emergent communication. In Pro-
ceedings of
the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 5166–5175, Florence, 意大利. 协会
for Computational Linguistics.

David Chiang and Kevin Knight. 2006. 一个
introduction to synchronous grammars. Tuto-
rial available at http://www.isi.edu/
˜chiang/papers/synchtut.pdf.

Alexandra Birch, Miles Osborne, and Philipp
科恩. 2008. Predicting success in machine
翻译. 在诉讼程序中 2008 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing, pages 745–754, 檀香山,

Leshem Choshen and Omri Abend. 2019. 汽车-
matically extracting challenge sets for non-local
phenomena in neural machine translation. 在
Proceedings of the 23rd Conference on Compu-
tational Natural Language Learning (CoNLL),

1243

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

pages 291–303, 香港, 中国. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P19-1509

柏林, 德国. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P16-1078

Benrard Comrie. 1981. Language Universals and

Linguistic Typology. 布莱克威尔. 书.

Ryan Cotterell, Sabrina

J. Mielke,

Jason
艾斯纳, and Brian Roark. 2018. Are all lan-
guages equally hard to language-model? 在
诉讼程序 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, 体积 2 (Short Papers),
pages 536–541, New Orleans, Louisiana. Asso-
ciation for Computational Linguistics.

Anna Currey and Kenneth Heafield. 2019. Incor-
porating source syntax into transformer-based
neural machine translation. 在诉讼程序中
the Fourth Conference on Machine Translation
(体积 1: Research Papers), pages 24–33,
Florence,
意大利. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/W19-5203

Matthew S. Dryer. 2013. Order of subject, ob-
ject and verb. In Matthew S. Dryer and Martin
Haspelmath, 编辑, The World Atlas of Lan-
guage Structures Online. Max Planck Institute
for Evolutionary Anthropology, 莱比锡.

Matthew S. Dryer and Martin Haspelmath,
编辑. 2013. WALS Online. Max Planck In-
stitute for Evolutionary Anthropology, 莱比锡.
https://wals.info/

Jinhua Du and Andy Way. 2017. Pre-reordering
for neural machine translation: Helpful or harm-
满? The Prague Bulletin of Mathematical
语言学, 108(1):171–182. https://土井
.org/10.1515/pralin-2017-0018

Jeffrey L. Elman. 1990. Finding structure in time.
认知科学, 14(2):179–211. https://
doi.org/10.1207/s15516709cog1402 1

Akiko Eriguchi, Kazuma Hashimoto,


Yoshimasa Tsuruoka. 2016. Tree-to-sequence
attentional neural machine translation. In Pro-
the 54th Annual Meeting of
ceedings of
the Association for Computational Linguis-
抽动症 (体积 1: Long Papers), pages 823–833,

Richard Futrell, Kyle Mahowald, and Edward
吉布森. 2015A. Large-scale evidence of depen-
dency length minimization in 37 语言.
美国国家科学院院刊-
恩塞斯, 112(33):10336–10341. https://土井
.org/10.1073/pnas.1502134112, 考研:
26240370

Richard Futrell, Kyle Mahowald, and Edward
吉布森. 2015乙. Quantifying word order free-
In Proceed-
dom in dependency corpora.
ings of
the Third International Conference
on Dependency Linguistics (Depling 2015),
pages 91–100, Uppsala, 瑞典. Uppsala
大学, Uppsala, 瑞典.

Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti,
Roi Reichart, and Anna Korhonen. 2018. 上
relation between linguistic typology and (limi-
tations of) multilingual language modeling. 在
诉讼程序 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 316–327, 布鲁塞尔, 比利时. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1029

Daniel Gildea and David Temperley. 2010. 做
grammars minimize dependency length? 齿轮-
nitive Science, 34(2):286–310. https://土井
.org/10.1111/j.1551-6709.2009.01073.x,
考研: 21564213

Joseph H. Greenberg. 1963. Some universals
of grammar with particular reference to the
order of meaningful elements. In Joseph H.
Greenberg, 编辑, Universals of Human Lan-
规格, pages 73–113. 与新闻界, 剑桥,
嘛.

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.
Colorless green recurrent networks dream
hierarchically. 在诉讼程序中
这 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long Papers), pages 1195–1205, New Orleans,
Louisiana. Association for Computational
语言学.

1244

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Kristina Gulordava and Paola Merlo. 2015. 从-
achronic trends in word order freedom and
dependency length in dependency-annotated
corpora of Latin and ancient Greek. In Pro-
ceedings of the Third International Conference
on Dependency Linguistics (Depling 2015),
pages 121–130, Uppsala, 瑞典. Uppsala
大学, Uppsala, 瑞典. https://土井
.org/10.18653/v1/N18-1108

Kristina Gulordava and Paola Merlo. 2016.
Multi-lingual dependency parsing evaluation:
A large-scale analysis of word order prop-
erties using artificial data. Transactions of
the Association for Computational Linguistics,
4:343–356. https://doi.org/10.1162
/tacl_a_00103

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
科恩, Vishrav Chaudhary, and Marc’Aurelio
Ranzato. 2019. The FLORES Evaluation Da-
tasets for Low-Resource Machine Translation:
Nepali–English and Sinhala–English. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6100–6113. https://doi.org/10
.18653/v1/D19-1632

Michael Hahn. 2020. Theoretical limitations of
self-attention in neural sequence models. 反式-
actions of the Association for Computational
语言学, 8:156–171. https://doi.org
/10.1162/tacl_a_00306

Hany Hassan, Anthony Aue, Chang Chen,
Vishal Chowdhary, Jonathan Clark, Christian
Federmann, Xuedong Huang, Marcin Junczys-
Dowmunt, William Lewis, Mu Li, Shujie Liu,
Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao
Qin, Frank Seide, Xu Tan, Fei Tian, Lijun
吴, Shuangzhi Wu, Yingce Xia, Dongdong
张, Zhirui Zhang, and Ming Zhou. 2018.
Achieving human parity on automatic Chi-
nese to English news translation. arXiv 预印本
arXiv:1803.05567.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. 神经计算,
https://doi.org/10
9(8):1735–1780.
.1162/neco.1997.9.8.1735, 考研:
9377276

Hideki

Isozaki, Tsutomu Hirao, Kevin Duh,
Katsuhito Sudoh, and Hajime Tsukada. 2010.
Automatic evaluation of translation quality for
distant language pairs. 在诉讼程序中
2010 实证方法会议
自然语言处理, pages 944–952,
剑桥, 嘛. Association for Computa-
tional Linguistics.

Pratik Joshi, Sebastin Santy, Amar Budhiraja,
Kalika Bali, and Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. 在诉讼程序中
the 58th Annual Meeting of the Association for
计算语言学, pages 6282–6293,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.560

Philipp Koehn. 2005. Europarl: A parallel cor-
pus for statistical machine translation. In The
Tenth Machine Translation Summit Proceed-
ings of Conference, pages 79–86. 国际的
Association for Machine Translation.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian
李, Sergey Edunov, Marjan Ghazvininejad,
Mike Lewis, and Luke Zettlemoyer. 2020.
Multilingual denoising pre-training for neu-
ral machine translation. Transactions of the
计算语言学协会,
8:726–742. https://doi.org/10.1162
/tacl_a_00343

Minh-Thang Luong, Hieu Pham, and Christopher
D. 曼宁. 2015. Effective approaches to
attention-based neural machine translation. 在
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 1412–1421, 里斯本,
Portugal. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/D15-1166

Gary Lupyan and Morten H. Christiansen. 2002.
案件, word order, and language learnabil-
性: Insights from connectionist modeling. 在
会议记录
the Twenty-Fourth Annual
认知科学学会会议.

Christopher D. 曼宁, Mihai Surdeanu, 约翰
Bauer, Jenny Finkel, Steven J. Bethard, 和
David McClosky. 2014. The Stanford CoreNLP

1245

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

natural language processing toolkit. In Associ-
ation for Computational Linguistics (前交叉韧带) Sys-
tem Demonstrations, pages 55–60. https://
doi.org/10.3115/v1/P14-5010

Mitchell Marcus, Beatrice Santorini, and Mary
Ann Marcinkiewicz. 1993. Building a large
annotated corpus of English: The Penn
树库. https://doi.org/10.21236
/ADA273556

Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,
Brian Roark, and Jason Eisner. 2019. 什么
kind of language is hard to language-model?
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4975–4989, Florence, 意大利. 协会
for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, 和
Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics,
前交叉韧带'02, pages 311–318, Stroudsburg, PA,
美国. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/P19-1491

Emmanouil Antonios

Platanios, Mrinmaya
Sachan, Graham Neubig, and Tom Mitchell.
2018. Contextual parameter generation for
universal neural machine translation. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
425–435. https://doi.org/10
页面
.18653/v1/D18-1039

Martin Popel, Marketa Tomkova, Jakub Tomek,
Łukasz Kaiser, Jakob Uszkoreit, Ondˇrej Bojar,
and Zdenˇek ˇZabokrtsk´y. 2020. Transforming
machine translation: a deep learning system
reaches news translation quality comparable
to human professionals. Nature Communica-
系统蒸发散, 11(1):4381. https://doi.org/10
.1038/s41467-020-18073-9, 考研:
32873773

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason
Bolton, and Christopher D. 曼宁. 2020.
Stanza: A Python natural language process-
ing toolkit for many human languages. 在
Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics:
系统演示.

语言. 在诉讼程序中

Shauli Ravfogel, Yoav Goldberg, and Tal
扁豆. 2019. Studying the inductive biases
of RNNs with synthetic variations of natu-
这 2019
拉尔
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 3532–3542,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1356

Benoˆıt Sagot. 2013. Comparing complexity mea-
确定. In Computational Approaches to Mor-
phological Complexity. 巴黎, 法国. Surrey
Morphology Group.

Karthik Abinav Sankararaman, Soham De, 郑
徐, 瓦. Ronny Huang, and Tom Goldstein.
2020. Analyzing the effect of neural network
architecture on training performance. In Pro-
ceedings of Machine Learning and Systems
2020, pages 9834–9845.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
rare words with subword units. In Proceedings
of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 1715–1725, 柏林,
德国. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/P16-1162

Kaius Sinnem¨aki. 2008. Complexity trade-offs in
core argument marking. Language Complexity,
pages 67–88. John Benjamins. https://土井
.org/10.1075/slcs.94.06sin

Dan I. Slobin. 1966. The acquisition of Russian as
a native language. The Genesis of Language: A
Psycholinguistic Approach, pages 129–148.

Dan I. Slobin and Thomas G. 贝弗. 1982.
Children use canonical sentence schemas: A
crosslinguistic study of word order and inflec-
系统蒸发散. 认识, 12(3):229–265. https://
doi.org/10.1016/0010-0277(82)90033-6

Ke Tran, Arianna Bisazza, and Christof Monz.
2018. The importance of being recurrent for

1246

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

modeling hierarchical structure. In Proceed-
这 2018 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 4731–4736, 布鲁塞尔, 比利时. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1503

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neu-
ral Information Processing Systems 30: Annual
Conference on Neural Information Process-
ing Systems 2017, 4–9 December 2017, 长的
Beach, CA, 美国, pages 5998–6008.

Nouns

Verbs

president / pr´esident
男人 / homme
woman / femme
minister / ministre
candidate / candidat
secretary / secr´etaire
commissioner / commissaire
孩子 / enfant
teacher / enseignant
student / ´etudiant

thank / remercier
支持 / soutenir
represent / repr´esenter
defend / d´efendre
welcome / saluer
invite / inviter
攻击 / attaquer
尊重 / respecter
代替 / remplacer
exploit / exploiter

桌子 4: The English/French vocabulary used
to generate the challenge set. Both singular and
plural forms are used for each noun.

Dingquan Wang and Jason Eisner. 2016. 这
galactic dependencies treebanks: Getting more
data by synthesizing new languages. 反式-
actions of the Association for Computational
语言学, 4:491–505. https://doi.org
/10.1162/tacl_a_00113

率 0.001 for BiLSTM. We also increased batch
size to 128, number of warm-up steps to 80K and
update steps to 2M for all models. 最后, 为了
100k and 10k datasize experiments, we decreased
the warm-up steps to 4K. During evaluation we
chose the best performing model on validation set.

Adina Williams, Tiago Pimentel, Hagen Blix,
Arya D. 麦卡锡, Eleanor Chodroff, 和
Ryan Cotterell. 2020. Predicting declension
class from form and meaning. In Proceed-
ings of

计算语言学协会,
pages 6682–6695, 在线的. 协会
计算语言学.

the 58th Annual Meeting of

Yang Zhao, Jiajun Zhang, and Chengqing Zong.
2018. Exploiting pre-ordering for neural
machine translation. 在诉讼程序中

Eleventh International Conference on Lan-
guage Resources and Evaluation (LREC-2018).
https://doi.org/10.18653/v1/2020
.acl-main.597

A Appendices

A.1 NMT Hyperparameters
In the toy parallel grammar experiments (§4),
batch size of 64 (句子) and 1K max update
steps are used for all models. We train BiLSTM
with learning rate 1, and Transformer with learn-
ing rate of 2 together with 40 warm-up steps by
using noam learning rate decay. Dropout ratio of
0.3 和 0.1 are used in BiLSTM and Transformer
models respectively. In the synthetic English vari-
ants experiments (§5), we set a constant learning

A.2 Challenge Set

The English-French challenge set used in this pa-
每, and available at https://github.com
/arianna-bis/freeorder-mt,
is gener-
ated by a small synchronous context-free grammar
and contains 7,200 simple sentences consisting of
a subject, a transitive verb, and an object (看
桌子 4). All sentences are in the present tense;
half are affirmative, and half negative. All nouns
in the grammar can plausibly act as both subject
and object of the verbs, so that an MT system
must rely on sentence structure to get perfect
translation accuracy. The sentences are from a
general domain, but we specifically choose nouns
and verbs with little translation ambiguity that
are well represented in the Europarl corpus: 最多
have thousands of occurrences, while the rarest
word has about 80. Sentence example (英语
边): ‘The teacher does not respect the student.’
and its reverse: ‘The student does not respect the
teacher.’

A.3 Morphological Paradigms

The complete list of morphological paradigms
used in this work is shown in Table 5. The im-
plicit language with exponence (许多:1) uses only
the suffixes of the 1st (default) declension. 这
implicit language with exponence and flexivity
(许多:许多) uses three declensions, assigned as

1247

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Overt

.nsubj.sg
.nsubj.pl
.dobj.sg
.dobj.pl
.iobj.sg
.iobj.pl
.arg.sg
.arg.pl

1st(default)

Implicit
2ND

kar
kon
亲属
ker
ken
kre

par
pon

et
kez
kr

3rd

pa

kit
ket
ke
关于

Unambiguous

Syncretic

桌子 5: The artificial morphological paradigms
used in this work, extended from Ravfogel et al.
(2019). 1st, 2nd and 3rd are the declensions in the
flexive language.

如下: 第一的, the list of lemmas extracted from the
training set is randomly split into three classes,13
with distribution 1st:60%, 2ND:30%, 3rd:10%.
然后, each core verb argument occurring in the
corpus is marked with the suffix corresponding to
its lemma’s declension.

A.4 Effect of Data Size and Morphological

Features: Detailed Results

桌子 6 shows the detailed numerical results
corresponding to the plots of Figure 2 在里面
main text.

Eparl-BLEU
original
SOV
SOV+overt
random
random+overt
random+implicit
random+declens

Eparl-RIBES
original
SOV
SOV+overt
random
random+overt
random+implicit
random+declens

Challenge-RIBES
original
SOV
SOV+overt
random
random+overt
random+implicit
random+declens

1.9中号
38.3
37.9
37.4
37.5
37.3
37.3
37.4

1.9中号
84.9
84.5
84.5
84.2
84.4
84.3
84.3

1.9中号
97.7
97.2
97.7
74.1
98.1
97.5
97.6

100k
26.9
25.3
24.6
24.6
24.1
24.3
23.1

100k
80.1
78.7
78.4
77.7
77.6
77.4
77.1

100k
92.2
89.8
86.5
72.9
84.5
85.4
84.4

10k
11.0
8.8
8.4
8.5
7.8
7.1
7.7

10k
67.5
64.1
63.1
61.7
61.6
59.8
61.3

10k
74.2
69.5
64.9
63.1
57.4
54.8
53.1

桌子 6: Detailed results corresponding to the
plots of Figure 2: EN*-FR Transformer NMT
quality versus training data size (1.9中号, 100K, 或者
10K sentence pairs). Source language variants:
Fixed-order (SOV) and free-order (random) 和
different case systems (+overt/implicit/declens).
Scores averaged over three training runs.

13See Williams et al. (2020) for an interesting account of
how declension classes are actually partly predictable from
form and meaning.

1248

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
2
4
1
9
7
2
4
4
0

/

/
t

A
C
_
A
_
0
0
4
2
4
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3
下载pdf