Data-driven Model Generalizability in Crosslinguistic Low-resource
Morphological Segmentation
Zoey Liu
Department of Computer Science
Boston College, Etats-Unis
zoey.liu@bc.edu
Emily Prud’hommeaux
Department of Computer Science
Boston College, Etats-Unis
prudhome@bc.edu
Abstrait
Common designs of model evaluation typi-
cally focus on monolingual settings, où
different models are compared according to
their performance on a single data set that is
assumed to be representative of all possible
data for the task at hand. While this may be
reasonable for a large data set, this assump-
tion is difficult to maintain in low-resource
scenarios, where artifacts of the data col-
lection can yield data sets that are outliers,
potentially making conclusions about model
performance coincidental. To address these
concerns, we investigate model generalizabil-
ity in crosslinguistic low-resource scenarios.
Using morphological segmentation as the test
case, we compare three broad classes of models
with different parameterizations, taking data
depuis 11 languages across 6 language fami-
lies. In each experimental setting, we evaluate
all models on a first data set, then examine
their performance consistency when introduc-
ing new randomly sampled data sets with the
same size and when applying the trained mod-
els to unseen test sets of varying sizes. Le
results demonstrate that the extent of model
generalization depends on the characteristics
of the data set, and does not necessarily rely
heavily on the data set size. Among the charac-
teristics that we studied, the ratio of morpheme
overlap and that of the average number of
morphemes per word between the training and
test sets are the two most prominent factors.
Our findings suggest that future work should
adopt random sampling to construct data sets
with different sizes in order to make more
responsible claims about model evaluation.
1
Introduction
In various natural
language processing (NLP)
études, when evaluating or comparing the per-
formance of several models for a specific task,
the current common practice tends to examine
these models with one particular data set that is
393
deemed representative. The models are usually
trained and evaluated using a predefined split
of training/test tests (with or without a develop-
ment set), or using cross-validation. Given the
results from the test set, authors typically con-
clude that certain models are better than others
(Devlin et al., 2019), report that one model is gen-
erally suitable for particular tasks (Rajpurkar et al.,
2016; Wang et al., 2018; Kondratyuk and Straka,
2019), or state that large pretrained models have
‘‘mastered’’ linguistic knowledge at human levels
(Hu et al., 2020).
Par conséquent, the ‘‘best’’ models, while not
always surpassing other models by a large mar-
gin, are subsequently cited as the new baseline
or adopted for state-of-the-art comparisons in
follow-up experiments. These new experiments
might not repeat the comparisons that were carried
out in the work that identified the best model(s).
This model might be extended to similar tasks
in different domains or languages, regardless
of whether these new tasks have characteris-
tics comparable to those where the model was
demonstrated to be effective.
That being said, just because a self-proclaimed
authentic Chinese restaurant is very good at cook-
ing orange chicken, does not mean that its orange
chicken will be delicious on every visit to the
restaurant. De la même manière, common implementations
of model comparisons have been called into
question. Recent work by Gorman and Bedrick
(2019) (see also Szyma´nski and Gorman [2020]
and Søgaard et al. [2021]) analyzed a set of the
best part-of-speech (POS) taggers using the the
Wall Street Journal (WSJ) from the Penn Tree-
bank (Marcus et al., 1993). They demonstrated
inconsistent model performance with randomly
generated splits of the WSJ data in contrast to the
predefined splits as used in previous experiments
(Collins, 2002).
Transactions of the Association for Computational Linguistics, vol. 10, pp. 393–413, 2022. https://doi.org/10.1162/tacl a 00467
Action Editor: Luke Zellemoyer. Submission batch: 9/21; Revision batch: 12/2021; Published 4/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Others (see Linzen [2020] for a review) have
argued that the majority of model evaluations are
conducted with data drawn from the same distri-
bution—that is, the training and the test sets bear
considerable resemblance in terms of their statis-
tical characteristics—and therefore these models
will not generalize well to new data sets from dif-
ferent domains or languages (McCoy et al., 2019,
2020; Jia and Liang, 2017). Autrement dit, à
evaluate the authenticity of a Chinese restaurant,
one would need to not only try the orange chicken,
but also order other traditional dishes.
Previous work on model evaluation, cependant,
has focused on high-resource scenarios, dans lequel
there is abundant training data. Very little ef-
fort has been devoted to exploring whether the
performance of the same model will generalize
in low-resource scenarios (Hedderich et al., 2021;
Ramponi and Plank, 2020). Concerns about model
evaluation and generalizability are likely to be
even more fraught when there is a limited amount
of data available for training and testing.
Imagine a task in a specific domain or language
that typically utilizes just one specific data set,
which is common in low-resource cases. We will
refer to the data set as the first data set. Quand
this first data set is of a reasonable size, it may
well be the case (though it might not be true)
that the data is representative of the population
for the domain or language, and that the training
and the test sets are drawn more or less from the
same distribution. The performance of individual
models and the rankings of different models can
be expected to hold, at the very least, for new data
from the same domain.
Now unfold that supposition to a task in a spe-
cific domain or language situated in a low-resource
scénario. Say this task also resorts to just one par-
ticular data set, which we again call the first data
ensemble. Considering the limitation that the data set
size is relatively small, one might be skeptical of
the claim that this first data set could represent the
population distribution of the domain or language.
Accordingly, within the low-resource scenario,
model evaluations obtained from just the first data
set or any one data set could well be coincidental
or random. These results could be specific to the
characteristics of the data set, which might not
necessarily reflect the characteristics of the do-
main or language more broadly given the small
size of the data. This means that even within the
same domain or language, the same model might
not perform well when applied to additional data
sets of the same sizes beyond the first data set, et
the best models based on the first data set might
not outperform other candidates when facing new
data. Continuing the Chinese restaurant analogy,
the same orange chicken recipe might not turn out
well with a new shipment of oranges.
As an empirical illustration of our concerns
raised above, this paper studies model generaliz-
ability in crosslinguistic low-resource scenarios.
For our test case, we use morphological segmen-
tation, the task of decomposing a given word into
individual morphemes (par exemple., modeling → model +
ing). En particulier, we focus on surface segmen-
tation, where the concatenation of the segmented
sub-strings stays true to the orthography of the
word (Cotterell et al., 2016b).1 This also involves
words that stand alone as free morphemes (par exemple.,
free → free).
Leveraging data from 11 languages across six
language families, we compare the performance
consistency of three broad classes of models
(with varying parameterizations) that have gained
in popularity in the literature of morphologi-
cal segmentation. In each of the low-resource
experimental settings (Section 6.1), we ask:
(1) To what extent do individual models gener-
alize to new randomly sampled data sets of
the same size as the first data set?
(2) For the model that achieves the best results
overall across data sets, how does its perfor-
mance vary when applied to new test sets of
different sizes?
2 Prior Work: Model Generalizability
A number of studies have noted inconsistent per-
formance of the same models across a range of
generalization and evaluation tasks. Par exemple,
McCoy et al. (2020) demonstrated that the same
pretrained language models, when paired with
classifiers that have different initial weights or are
fine-tuned with different random seeds, can lead to
1This includes segmentation for both inflectional and
derivational morphology; we used the term ‘‘sub-strings’’,
since not all elements in the segmented forms will be a
linguistically defined morpheme due to the word formation
processes. Therefore surface segmentation is in opposition
to canonical segmentation; see Cotterell et al. (2016b) pour
detailed descriptions of the differences between the two
tasks.
394
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
drastically unstable results (McCoy et al., 2019).
This pattern seems to hold for both synthetic
(Weber et al., 2018; McCoy et al., 2018) and natu-
ralistic language data (Zhou et al., 2020; Reimers
and Gurevych, 2017). Using approximately 15,000
training examples, Lake and Baroni
artificial
(2018) showed that sequence-to-sequence mod-
els do not generalize well with small data sets
(see also Akyurek and Andreas, 2021). Through a
series of case studies, Dodge et al. (2019) demon-
strated that model performance showed a different
picture depending on the amount of computation
associated with each model (par exemple., number of it-
erations, computing power). They suggested that
providing detailed documentation of results on
held-out data is necessary every time changes in
the hyperparameters are applied.
Another line of research has focused on test-
ing the generalizability of large language models
with controlled psycholinguistic stimuli such as
subject-verb agreement (Linzen et al., 2016),
filler-gap dependencies (Wilcox et al., 2019), et
garden-path sentences (Futrell et al., 2019), où
the stimuli data share statistical similarities with
the training data of the language models to dif-
ferent degrees (Wilcox et al., 2020; Hu et al.,
2020; Thrush et al., 2020). This targeted evalua-
tion paradigm (Marvin and Linzen, 2018; Linzen
and Baroni, 2021) tends to compare model perfor-
mance to behaviors of language users. It has been
claimed that this approach is able to probe the
linguistic capabilities of neural models, ainsi que
shed light on humans’ ability to perform gram-
matical generalizations. Thus far, related studies
have mainly investigated English, with notable
exceptions of crosslinguistic extensions to other
typologically diverse languages, such as Hebrew
(Mueller et al., 2020), Finnish (Dhar and Bisazza,
2021), and Mandarin Chinese (Wang et al., 2021;
Xiang et al., 2021).
As fruitful as the prior studies are, ils sont
mostly targeted towards evaluating models that
have access to a large or at least reasonable amount
of training data. In contrast, there is, as yet, Non
concrete evidence regarding whether and to what
extent models generalize when facing limited data.
En plus, although recent work has pro-
posed different significance testing methods in
order to add statistical rigor to model evalua-
tion (Gorman and Bedrick, 2019; Szyma´nski and
Gorman, 2020; Dror et al., 2018), most tests make
assumptions on the distributional properties or the
sampling process of the data (par exemple., data points are
sampled independently [Søgaard, 2013]), and a
number of these tests such as bootstrapping are
not suitable when data set size is extremely small
due to lack of power (see also Card et al., 2020).
3 Morphological Surface Segmentation
Why morphological surface segmentation? D'abord,
previous work has demonstrated that morpho-
logical supervision is useful for a variety of
NLP tasks, including but not limited to machine
translation (Clifton and Sarkar, 2011), depen-
dency parsing (Seeker and C¸ etino˘glu, 2015),
bilingual word alignment (Eyig¨oz et al., 2013),
and language modeling (Blevins and Zettlemoyer,
2019). For languages with minimal training re-
sources, surface or subword-based segmentation
is able to effectively mitigate data sparsity issues
(Tachbelie et al., 2014; Ablimit et al., 2010).
For truly low-resource cases, such as endan-
gered and indigenous languages, morpheme-level
knowledge has the potential
to advance de-
language technologies such as
velopment of
automatic speech recognition (Afify et al.,
2006) in order to facilitate community language
documentation.
Deuxième, information about morphological struc-
tures is promising for language learning. Espe-
cially for indigenous groups, prior studies have
proposed incorporating morphological annotation
into the creation of online dictionaries as well
as preparing teaching materials, with the goal
of helping the community’s language immersion
programs (Garrett, 2011; Spence, 2013).
Troisième, despite its utility in different
tasks,
semi-linguistically informed subword units are
not as widely used as they might be, depuis
acquiring labeled data for morphological seg-
including the case of
mentation in general,
surface segmentation,
requires extensive lin-
guistic knowledge from native or advanced
speakers of the language. Linguistic expertise
can be much more difficult
to find for criti-
cally endangered languages, lequel, by definition
(Meek, 2012), are spoken by very few peo-
ple, many of whom are not native speakers of
the language. These aforementioned limitations
necessitate better understanding of the perfor-
mance for different segmentation models and
how to reliably estimate their effectiveness.
395
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Language
Language family Morphological feature(s) N of types
Data set sizes
New test set sizes
Yorem Nokki Yuto-Aztecan
Polysynthetic
Nahuatl
Wixarika
Polysynthetic
Polysynthetic
English
Indo-European
Fusional
German
Persian
Russian
Fusional
Fusional
Fusional
95,922
in initial data
1,050
1,096
1,350
1,686
1,751
32,292
{500, 1,000}
{500, 1,000}
–
–
{500, 1,000}
{50, 100}
{500, 1,000, 1,500}
{50, 100}
{500, 1,000, 1,500}
{50, 100}
{500, 1,000, 1,500
{2,000, 3,000, 4,000}
{500, 1,000, 1,500}
2,000, 3,000, 4,000}
{50, 100, 500, 1,000}
{50, 100, 500, 1,000}
Turkish
Turkic
Agglutinative
Finnish
Uralic
Agglutinative
1,760
1,835
{500, 1,000, 1,500}
{50, 100}
{500, 1,000, 1,500}
{50, 100}
Zulu
Niger-Congo
Agglutinative
10,040
Indonesian
Austronesian
Agglutinative
3,500
{500, 1,000, 1,500}
{2,000, 3,000, 4,000}
{50, 100, 500, 1,000}
{500, 1,000, 1,500}
{2,000, 3,000}
{50, 100, 500}
Tableau 1: Descriptive statistics for the languages in our experiments. Data set sizes apply to both sampling
strategies (with and without replacement). New test set sizes (sampled without replacement) refer to the
different sizes of constructed test sets described in Section 6.2.
4 Meet the Languages
A total of eleven languages from six language
families were invited to participate in our
experiments. Following recently proposed scien-
tific practices for computational linguistics and
NLP (Bender and Friedman, 2018; Gebru et al.,
2018), we would like to introduce these languages
and their data sets explored in our study (Tableau 1).
First we have three indigenous languages
from the Yuto-Aztecan language family (Boulanger,
1997),
including Yorem Nokki (Southern di-
alect) (Freeze, 1989) spoken in the Mexican
states of Sinaloa and Sonora, Nahuatl (Oriental
branch) (de Su´arez, 1980) spoken in Northern
Puebla, and Wixarika (G´omez and L´opez, 1999)
spoken in Central West Mexico. These three
Mexican languages are highly polysynthetic with
agglutinative morphology. The data for these
languages was originally digitized from the book
collections of Archive of Indigenous Language.
Data collections were carried out and made
publicly available by the authors of Kann et al.
(2018) given the narratives in their work.
Next are four members from the Indo-European
language family: English, German, Persian, et
Russian, all with fusional morphological char-
acteristics. The data for English was provided
by the 2010 Morpho Challenge (Kurimo et al.,
2010), a shared task targeted towards unsupervised
learning of morphological segmentation. Le
German data came from the CELEX lexical data-
base (Baayen et al., 1996) and was made available
by Cotterell et al. (2015). The Persian data from
Ansari et al. (2019un) (see also Ansari et al., 2019b)
contains crowd-sourced annotations, tandis que le
data for Russian was extracted from an online
dictionary (Sorokin and Kravtsova, 2018).
The remaining languages each serve as a repre-
sentative of a different language family. The data
sets for Turkish from the Turkic language fam-
ily and Finnish from the Uralic language family
were again provided by the 2010 Morpho Chal-
lenge (Kurimo et al., 2010). The data collection
396
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
for both Zulu and Indonesian was carried out by
Cotterell et al. (2015). The Zulu words were taken
from the Ukwabelana Corpus (Spiegler et al.,
2010), and the segmented Indonesian words were
derived via application of a rule-based mor-
to an Indonesian-English
phological analyzer
bilingual corpus.2 All four of these languages
have agglutinative morphology.
5 Meet the Models
compared three
Nous
categories of models:
sequence-to-sequence (Seq2seq) (Bahdanau et al.,
2015), conditional random field (CRF) (Lafferty
et coll., 2001), and Morfessor (Creutz and Lagus,
2002).
Seq2seq The first guest in our model suite is a
character-based Seq2seq recurrent neural network
(RNN) (Elman, 1990). Previous work (Kann et al.,
2018; Liu et al., 2021) has demonstrated that this
model is able to do well for polysynthetic indige-
nous languages even with very limited amount
of data. Consider the English word papers as an
illustration. With this word as input, the task of
the Seq2seq model is to perform:
papers → paper + s.
As we are interested in individual models,
the training setups are constant for the data of
tous
languages. We adopted an attention-based
encoder-decoder (Bahdanau et al., 2015) où
the encoder consists of a bidirectional gated
recurrent unit (GRU) (Cho et al., 2014), et le
decoder is composed of a unidirectional GRU.
The encoder and the decoder both have two hid-
den layers with 100 hidden states in each layer. All
embeddings have 300 dimensions. Model training
was carried out with OpenNMT (Klein et al.,
2017), using ADADELTA (Zeiler, 2012) and a
batch size of 16.
Order-k CRF The second guest is the order-k
CRF (hereafter k-CRF) (Lafferty et al., 2001;
Ruokolainen et al., 2013), a type of log-linear
discriminative model that treats morphological
segmentation as an explicit sequence tagging task.
Given a character wt within a word w, where t is
the index of the character, CRF gradually predicts
the label yt of the character from a designed feature
set xt that is composed of local (sub-)strings.
2https://github.com/desmond86/Indonesian
-English-Bilingual-Corpus
397
In detail, k-CRF takes into consideration the
label(s) of k previous characters in a word
(Cuong et al., 2014; Cotterell et al., 2015). Ce
means that when k = 0 (c'est à dire., 0-CRF), the label yt
is dependent on just the feature set xt of the current
character wt. On the other hand, when k ≥ 1, dans
our settings, the prediction of yt is context-driven
and additionally considers a number of previous
labels {yt−k, . . . , yt−1}.
Prior work (Vieira et al., 2016) (see also Mueller
et coll., 2013) has claimed that, with regard to
choosing a value of k, 2-CRF usually yields good
résultats; increasing k from 0 à 1 leads to large
improvements, while increasing from 1 à 2 concernant-
sults in a significant yet weaker boost in model
performance. With that in mind, we investigated
five different values of k (k = {0, 1, 2, 3, 4}).
In training the CRFs, the feature set for every
character in a word is constructed as follows. Chaque
word is first appended with a start ((cid:5)w(cid:6)) and an end
((cid:5)/w(cid:6)) symbol. Then every position t of the word
is assigned a label and a feature set. The feature
set consists of the substring(s) occurring both on
the left and on the right side of the current position
up to a maximum length, d. As a demonstration,
take the same English word above (papers) as an
example. For the third character p in the word,
with δ having a value of 3, the set of substrings on
the left and right side would be, respectivement, {un,
Pennsylvanie, (cid:5)w(cid:6)Pennsylvanie} et {p, pe, par}. The concatenation of
these two sets would be the full feature set of the
character p. In our experiments, we set δ to have
a constant value of 4 across all languages.
Depending on the position index t and the mor-
pheme that this index occurs in, each character can
have one of six labels: {START (start of a word),
B (beginning of a multi-character morpheme), M.
(middle of a multi-character morpheme), E (end of
a multi-character morpheme), S (single-character
morpheme), END (end of a word)}. Donc,
papers has the following segmentation label
representation, and the training objective of the
CRF model is to learn to correctly label each
character given its corresponding feature set.
(cid:5)/w(cid:6)
(cid:5)w(cid:6)
p
START B M M M E S END
p
e
un
s
r
Dans notre étude, all linear-chain CRFs were im-
plemented with CRFsuite (Okazaki, 2007) avec
modifications. Model parameters were estimated
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
with L-BFGS (Liu and Nocedal, 1989) and L2 reg-
ularization. Encore, the training setups were kept
the same for all languages.
Supervised Morfessor The last participating
model
is the supervised variant of Morfes-
sor (Creutz and Lagus, 2002; Rissanen, 1998)
which uses algorithms similar to those of the
semi-supervised variant (Kohonen et al., 2010),
except that, given our crosslinguistic low-resource
scenarios, we did not resort to any additional un-
labeled data. For the data of every language, tous
Morfessor models were trained with the default
parameters.
In the remainder of this paper, to present clear
narratives, in cases where necessary, we use the
term model alternative(s) to refer to any of the
seven alternatives: Morfessor, Seq2seq, 0-CRF,
1-CRF, 2-CRF, 3-CRF, and 4-CRF.
6 Experiments
6.1 Generalizability Across Data Sets
Our first question concerns the generalizability of
results from the first data set across other data sets
of the same size. For data set construction, nous
took an approach similar to the synthetic method
from Berg-Kirkpatrick et al. (2012). Given the
initial data for each language, we first calculated
the number of unique word types; after comparing
this number across languages, we decided on a
range of data set sizes that are small in general
(Tableau 1), then performed random sampling to
construct data sets accordingly.
The sampling process is as follows, using Rus-
sian as an example. For each data set size (par exemple.,
500), we randomly sampled words with replace-
ment to build one data set of this size; this data
set was designated the first data set; and then an-
other 49 data sets were sampled in the same way.
We repeated a similar procedure using sampling
without replacement. Each sampled data set was
randomly assigned to training/test sets at a 3:2
ratio, five times. In what follows, we use the term
experimental setting to refer to each unique com-
bination of a specific data set size and a sampling
strategy.
Compared with Berg-Kirkpatrick et al. (2012),
our data set constructions differ in three aspects.
D'abord, our sampling strategies include both with
in order to simulate
and without replacement
realistic settings (par exemple., real-world endangered lan-
guage documentation activities) where the training
and the test sets may or may not have overlapping
items. Deuxième, in each augmented setting with a
specific sampling strategy and a data set size, un
total of 250 models were trained given each model
alternative (5 random splits ∗ 50 data sets); ce
contrasts with the number of models for different
tasks in Berg-Kirkpatrick et al. (2012), ranging
depuis 20 for word alignment to 150 for machine
translation. Troisième, we aim to build data sets of very
small sizes to suit our low-resource scenarios.
For every model trained on each random split,
five metrics were computed to evaluate its perfor-
mance on the test set: full form accuracy, mor-
pheme precision, recall, and F1 (Cotterell et al.,
2016un; van den Bosch and Daelemans, 1999),
and average Levenshtein distance (Levenshtein,
1966). The average of every metric across the
five random splits of each data set was then com-
puted. For each experimental setting, given each
metric, we measured the consistency of model
performance from the first data set as follows:
(1) the proportion of times the best model alter-
native based on the first data set is the best
across the 50 data sets;
(2) the proportion of times the model ranking of
the first data set holds across the 50 data sets.
If a particular model A is indeed better than
alternative B based on results from the first data
ensemble, we would expect to see that hold across the
50 data sets. En outre, the model ranking drawn
from the first data set would be the same for the
other data sets as well.
While we examined the results of all metrics,
for the sake of brevity, we focus on F1 scores
when presenting results (Section 7).3
6.2 Generalizability Across New Test Sets
In addition to evaluating model performance
across the originally sampled data sets of the
same sizes, we also investigated the generalizabil-
ity of the best model alternatives from Section 6.1
when facing new unseen test sets. Taking into
consideration the type counts of the initial data for
every language and the sizes of their sampled data
sets in our experiments, a range of test set sizes
that would be applicable to all data set sizes was
decided, shown in Table 1. (Note that no new test
sets were created for Yorem Nokki and Nahuatl,
3Code and full results are in quarantine at https://
github.com/zoeyliu18/orange chicken.
398
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 1: Generalizability results for the Russian example in Section 7.1.
since it would be feasible for these two languages
only when the data set size is 500.) Alors, for every
data set, we selected all unique words from the
initial data that did not overlap with those in the
data set. From these words, given a test set size,
we performed random sampling without replace-
ment 100 times. Autrement dit, each data set had
correspondingly 100 new test sets of a particu-
lar size.
After constructing new test sets, for each ex-
perimental setting, we picked the overall best
performing model alternative based on average
observations across the 50 data sets from Sec-
tion 6.1. For each data set, this model alternative
was trained five times (via 5 random splits). Le
trained models were then evaluated with each of
le 100 new test sets for the data set and the same
five metrics of segmentation performance were
computed. Encore, due to space constraints, nous
focus on F1 in presentations of the results.
7 Results
7.1 A Walk-through
To demonstrate the two ways of evaluating
model generalizability outlined above, take Rus-
sian again as an example, when the data set size is
2,000 and the evaluation metric is F1 score. For the
first data set, when sampled with replacement, le
best model is Seq2seq, achieving an average F1
score of 74.26. As shown in Figure 1a, the propor-
tion of times Seq2seq is the best model across the
50 data sets is 12%. The model performance rank-
ing for the first data set is Seq2seq > 1-CRF >
4-CRF > 2-CRF > 3-CRF > 0-CRF > Mor-
fessor, a ranking that holds for 2% of all 50
data sets.
By comparison, when the first data set was
sampled without replacement, the best model is
1-CRF with an average F1 score of 74.62; le
proportion of times this model has the best results
across the 50 data sets is 18%. The best model
ranking is 1-CRF > 3-CRF > 2-CRF > 4-CRF >
Seq2seq > 0-CRF > Morfessor, and this pattern
accounts for 2% of all model rankings considering
le 50 data sets together.
Taking a closer look at the performance of each
model for the first data set given each sampling
strategy (Tableau 2), while the average F1 scores of
Morfessor and 0-CRF are consistently the lowest,
the performance of the other models is quite simi-
lar; the difference of F1 scores between each pair
of models is less than one.
Now turn to patterns across the 50 data sets as
a whole. It seems that for the Russian example,
whether sampling with or without replacement,
the best model is instead 4-CRF (Tableau 2), lequel
is different from the observations derived from the
respective first data set of each sampling method.
The differences in the proportion of times 4-CRF
ranks as the best model and that of other model al-
ternatives are quite distinguishable, although their
mean F1 scores are comparable except for Mor-
fessor and 0-CRF. With CRF models, it does not
399
Sampling
Model
Avg. F1 for
the first data set
Avg. F1 across
le 50 data sets
F1 range across
le 50 data sets
F1 std. across
le 50 data sets
% of times
as the best model
avec
replacement
Morfessor
0-CRF
1-CRF
2-CRF
3-CRF
4-CRF
Seq2seq
without
replacement
Morfessor
0-CRF
1-CRF
2-CRF
3-CRF
4-CRF
Seq2seq
34.61
58.75
74.20
74.02
73.71
74.06
74.26
34.16
59.10
74.62
74.53
74.60
74.48
74.00
36.52
59.59
75.38
75.46
75.48
75.54
74.86
36.04
59.41
75.08
75.15
75.18
75.21
74.39
(34.61, 38.36)
(57.83, 62.65)
(73.93, 76.66)
(74.02, 76.68)
(73.71, 76.52)
(74.06, 76.73)
(73.48, 76.45)
(34.16, 38.17)
(57.15, 60.95)
(73.72, 76.72)
(73.75, 76.89)
(73.77, 76.76)
(73.79, 77.14)
(72.54, 76.99)
0.76
0.87
0.65
0.65
0.67
0.68
0.85
0.88
0.96
0.72
0.81
0.79
0.83
0.89
0
0
14
22
14
38
12
0
0
18
16
18
38
10
Tableau 2: Summarization statistics of model performance for the Russian example in Section 7.1.
appear to be the case that a higher-order model
leads to significant improvement in performance.
Comparing more broadly, CRF models are better
overall than the Seq2seq models.
In spite of the small differences on average
between the better-performing model alternatives,
every model alternative presents variability in
its performance: The score range (c'est à dire., the dif-
ference of the highest and the lowest metric
score across the 50 data sets of each experi-
mental setting) spans from 2.66 for 2-CRF to 2.97
for Seq2seq. The difference between the average
F1 for the first data set and the highest average
F1 score of the other 49 data sets is also no-
ticeable. When sampling with replacement, le
value spans from 2.19 for Seq2seq to 3.90 pour
0-CRF; and when sampling without replacement,
the value ranges from 1.85 for 0-CRF to 2.99
for Seq2seq.
In contrast, in this Russian example, we see
a different picture for each model alternative if
comparing their results for the first data set to
the averages across all the data sets instead; le
largest mean F1 difference computed this way
est 1.91 when sampling with replacement, et
1.89 for sampling without replacement, both from
scores of Morfessor; and these difference val-
ues are smaller than the score range from all
data sets.
While we focused on F1 scores, we analyzed
the other four metrics in the same way. Given
each metric, the generalizability of the observa-
tions from the first data set, regardless of the
specific score of each model alternative and their
differences, is still highly variable; the most gen-
eralizable case, where the best model from the first
data set holds across the 50 data sets, is relying on
full form accuracy when sampling with replace-
ment (36%). When contrasting results from the
first data set to the other 49 data sets, there are
again noticeable differences between the average
score of the former and the highest average score
of the latter. Par exemple, when using morpheme
precision and sampling with replacement, the dif-
ference ranges from 1.98 for Seq2seq to 4.00 pour
Morfessor; with full form accuracy and sampling
without replacement, the difference spans from
2.12 for 0-CRF to 4.78 for Morfessor.
Given each sampling method, we applied the
250 trained models of the best performing model
alternative across the 50 data sets (Tableau 2) to new
test sets of four different sizes. As presented in
400
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1b, for this Russian example, the distribu-
tions of the F1 scores demonstrate variability to
different extents for different test set sizes. Quand
the data was sampled with replacement, the results
are the most variable with a test set size of 50,
where the F1 scores span from 57.44 à 90.46;
and the least variable F1 scores are derived when
the test set size is 1,000, ranging from 70.62 à
79.36. On the other hand, the average F1 across
the different test sizes is comparable, spanning
depuis 75.30 when the test size is 50 à 75.41 quand
the test size is 500. Encore, we observed similar
results when the data sets were sampled without
replacement.
7.2 An Overall Look
The results for the data sets of all languages were
analyzed in exactly the same way as described
for the Russian example given in Section 7.1.
When not considering how different the average
F1 score of each model alternative is compared to
every other alternative within each experimental
setting (82 settings in total; 41 for each sampling
method), in most of the settings the performance
of the model alternatives for the first data set is
not consistent across the other data sets; the pro-
portions of cases where the best model alternative
of the first data set holds for the other 49 data
sets are mostly below 50%, except for most of the
cases for Zulu (data set size (cid:2) 1,000) where the
proportion values approximate 100%. The obser-
vations for model rankings are also quite variable;
the proportion of times the model ranking of the
first data set stays true across the 50 data sets
va de 2% for those in Russian containing
2,000 words sampled with replacement, à 56%
for data sets in Zulu including 4,000 words sam-
pled without replacement. (The high variability
derived this way persists despite of the particular
metrics applied.)
Considering all the experimental settings, quand
sampling with replacement, the two best per-
forming models for the first data sets are 4-CRF
(13 / 41 = 31.71%) and Seq2seq (36.59%); 2-CRF
(26.83%) and 4-CRF (36.59%) are more favorable
if sampling without replacement. The differences
between the best model alternatives are in general
petit; the largest difference score between the
top and second best models is 1.89 for data sets
in Zulu with 1,500 words sampled with replace-
ment. Across the 50 data sets within each setting,
overall, higher-order CRFs are the best per-
forming models.
Similar findings of performance by different
model alternatives were observed for other metrics
aussi, except for average Levenshtein distance
where CRF models are consistently better than
Seq2seq, even though the score differences are
mostly small. This is within expectation given
that our task is surface segmentation, and Seq2seq
models are not necessarily always able to ‘‘copy’’
every character from the input sequence to the
output. Previous work has tried to alleviate this
problem via multi-task learning (Kann et al., 2018;
Liu et al., 2021), où, in addition to the task
of morphological segmentation, the model is also
asked to learn to output sequences that are exactly
the same as the input. We opted not to explore
multi-task learning here in order to have more
experimental control for data across languages.
When analyzing the performance range of the
models, we see variability as well. Figure 2a
presents the distribution of F1 score ranges ag-
gregating the results for all the models across
the score ranges of
all experimental settings;
individual model alternatives are additionally
demonstrated in Figure 3. Larger score ranges
exist especially when data sets have a size of 500,
irrespective of the specific model (whether it is
non-neural or neural). Par exemple, 3-CRF has a
score range of 11.50 for data in Persian sampled
with replacement and the range of Seq2seq in this
setting is 11.24. As the data set sizes increase, le
score ranges become smaller.
Across all experimental settings, we also an-
alyzed the difference between the average F1
score of the first data set and that across all 50
data sets (Figure 2b). While in most of the cases
the differences for the same model alternative
are very small, the values of the differences ap-
pear to be more distinguishable when the data
set size is small. On the other hand, as shown
in Figure 3, these differences are by comparison
(much) smaller than the average F1 score range
(the triangles are consistently below the circles),
an observation that also aligns with what we have
seen from the Russian example.
Encore, similar results hold for the other four
evaluation metrics as well; different model alter-
natives demonstrate large score range across all
experimental settings regardless of the particular
metric. As an illustration, we performed linear
401
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 2: Some distribution statistics of the average F1 score of all models across all experimental settings.
402
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
je
un
c
_
un
_
0
0
4
6
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 3: Some summary statistics for the performance of all models across all experimental settings. Within each
setting, circle represents the average F1 score range; variability exists in the score range regardless of the model
alternative applied or the data set size, though the range appears to be larger when data set sizes are small. Par
comparison, triangle represents the difference between the average F1 score of the first data set and that across
tous 50 data sets; the values of the differences seem to be more noticeable when data set sizes are small, yet these
differences are in general smaller than the score range.
403
regression modeling analyzing the relationships
between other evaluation metrics and the average
F1 score. Take morpheme recall as an example.
Given all the data of a language, the regression
model predicts the score range of average mor-
pheme recall across data sets in an experimental
setting as a function of the score range of av-
erage F1 in the same experimental setting. Le
same procedure was carried out for each of the
other metrics as well. Significant effects were
trouvé (p < 0.01 or p < 0.001) for the score
range of average F1, indicating strong compara-
bility of model performance in terms of different
evaluation metrics.
In addition, besides average F1, there are also
noticeable differences between a given metric
score for the first data set and that across all
50 data sets within each experimental setting,
yet these differences are also smaller than the
score range. Our regression models predicting
these differences as a function of the differences
of average F1 score also show significant
effects for the latter, again lending support to
the comparability between these metrics when
characterizing model evaluation in our study here.
When we applied the trained models of the
overall best model alternative to new test sets, all
settings demonstrate a great amount of variability
that seems to be dependent on both the size of the
data set used to train the models and that of the
new test sets. The extent of variation does not ap-
pear to be constrained by the sampling strategy.
The difference in the resulting F1 scores ranges
from around 6.87, such as the case with Indonesian
when the data set size is 3,000 sampled without
replacement and the test set size is 500, to approx-
imately 48.59, as is the case for Persian when the
data set size is 500 sampled with replacement and
the test set size is 50. That being said, as in the
Russian example, the average results among the
different test sizes for each setting of every lan-
guage are also comparable to each other.4
4In addition to testing the best model alternatives, for
every experimental setting of Persian, Russian, and Zulu, the
three languages that have the largest range of data set sizes and
new test set sizes in our study, we also experimented with the
second or third best model alternatives based on their average
performance and computing power. The observations of these
alternatives are comparable to those of the best models.
404
7.3 Regression Analysis
One question arises from the aforementioned find-
ings: Why is there variable performance for each
model alternative across all the experimental set-
tings of each language? With the observations in
our study thus far, it seems that the results are
dependent on the data set size or test set size. But
is that really the case, or is sample size simply a
confounding factor?
To address the questions above, we studied
several data characteristics and how they affect
model performance. It is important to note that,
as discussed in Section 1, we do not wish to
claim these features are representative of the full
language profile (e.g., how morphologically com-
plex the language is as a whole). For instance, one
might expect that on average, languages that are
agglutinative or polysynthetic have larger num-
bers of morphemes per word when compared to
fusional languages. While that might be true for
cases with ample amounts of data, the same as-
sumption does not always seem to be supported by
the data sets in our experiments. For each experi-
mental setting, the average number of morphemes
per word across the 50 data sets of Indonesian
(classified as agglutinative) is mostly comparable
to that of Persian (fusional); on the other hand,
in all experimental settings that are applicable,
the average number of morphemes per word for
data sets of Russian (fusional) is always higher
than that for data of the Mexican indigenous lan-
guages (polysynthetic).
These patterns resonate again with the our main
point, that when data set size is small, the first
data set or just one data set might not suffice to
reflect the language features overall. Thus here
we consider the characteristics to be specific to
the small data sets in our experiments. For each
random split (including a training and a test set) of
every data set, we investigated: (1) word overlap,
the proportion of words in the test set that also
occur in the training set (only applicable to sam-
pling with replacement); (2) morpheme overlap,
the proportion of morphemes in the test set that
also appear in the training set; (3) the ratio of the
average number of morphemes per word between
the training and test sets; (4) the distance between
the distribution of the average number of mor-
phemes per word in the training set and that in the
test set; for this feature we used the Wasserstein
distance for its ease of computation (Arjovsky
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Language Word overlap Morpheme overlap Ratio of avg. N Distance between distributions
of morphemes
of N of morphemes
Ratio of avg.
morpheme length
Yorem Nokki
11.68**
17.10
13.64**
Nahuatl
13.15**
49.56***
14.22***
Wixarika
21.57***
63.69***
−2.58
English
9.97**
50.35***
26.01***
German
10.03**
61.09***
21.44***
Persian
26.90***
21.56***
26.15***
Russian
3.38
69.49***
11.96***
Turkish
15.88***
44.31***
1.37
Finnish
9.47**
60.58***
10.49**
Zulu
15.48***
79.07***
11.34***
Indonesian
8.12*
25.53***
19.55***
−7.19
−3.16
−0.70
−6.78*
2.29
−3.82*
−2.88**
−0.73
−1.95
−4.14***
−6.46**
4.81
−1.15
−11.06**
8.44*
3.25
−3.09
3.19
0.30
−3.90
4.11
7.64*
Table 3: Regression coefficients for random splits of data sets in all languages; a positive coefficient
value corresponds to higher metric scores; the numbers in bold indicate significant effects; the number
of * suggests significance level: * p < 0.05, ** p < 0.01, *** p < 0.001.
et al., 2017; Søgaard et al., 2021); and (5) the
ratio of the average morpheme length per word
between the training and the test sets.
To measure the individual effect of each fea-
ture, we used linear
regression. Because of
the relatively large number of data sets that
we have and our areas of interest, we fit the
same regression model to the data of each lan-
guage rather than combining the data from all
languages. The regression predicts the metric
scores of all models for each random split as
a function of the five characteristics, described
above, of the random split. Meanwhile we con-
trolled for the roles of four other factors: the
model alternatives, the metrics used, sampling
methods, and data set size—the latter two of
which also had interaction terms with the five
characteristics.
Given the components of our regression model,
a smaller/larger value of the coefficient for the
same factor here does not necessarily mean that
this factor has a weaker/stronger role. In other
words, the coefficients of the same feature are
not comparable across the data for each language.
Rather, our goal is simply to see whether a fea-
ture potentially influences metric scores when
other factors are controlled for (e.g., data set
size) within the context of the data for every
language.
The regression results are presented in Table 3.
It appears that the features with the most promi-
nent roles are the proportion of morpheme overlap
as well as the ratio of the average number of
morphemes per word. Word overlap also has sig-
nificant effects, though this is applicable only
to data sets sampled with replacement. In com-
parison, data set size does not appear to have
significant positive effects in all cases; the excep-
tions include the data for Yorem Nokki, Nahuatl,
Wixarika, and Turkish. In scenarios where data
set size does have an effect, its magnitude is
much smaller compared to the characteristics in
Table 3 that we studied. Thus, while data set size
potentially plays a role in the results, it does not
appear to be the sole or even the most important
factor. The range of model performance is more
dependent on the specific features of, or what is
available in, the data sets. For example, larger
training and test sets do not necessarily lead to
higher morpheme overlap ratio.
We adopted similar approaches to investigate
the variability in the results when trained segmen-
tation models were applied to new test sets of
different sizes (Section 6.2). In these cases, the
405
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
five characteristics described before were mea-
sured taking into account the training set of the
segmentation model and each of the new test sets
(100 in total for each training set). In contrast to
the previous regression model, since the new test
sets were sampled without replacement and since
we are focusing on predictions derived from the
best model alternative (but see Section 7.2), word
overlap ratios and the model alternatives were not
included in the regression. We additionally con-
trolled for the effect of the new test set size. Again
the same regression model was fit to the data of
each language. Comparing the different features,
morpheme overlap ratio and the average number
of morphemes per word are again the two most
pronounced factors in model performance, while
the role of test set size is much less pronounced or
even negligible.
7.4 Alternative Data Splits
Our experiments so far involved random splits.
While the heuristic or adversarial splits proposed
in Søgaard et al. (2021) are also valuable, the
success of these methods requires that the same
data could be split based on heuristics, or be
separated into training/test sets such that their
distributions are as divergent as possible.
To illustrate this point for morphological seg-
mentation, we examined the possibility of the data
sets being split heuristically and adversarially. For
the former, we relied on the metric of the average
number of morphemes per word, based on regres-
sion results from Section 7.3. Given each data
set within every experimental setting, we tried to
automatically find a metric threshold so that the
words in the data set are able to be separated by
this threshold into training/test sets at the same
3:2 ratio, or a similar ratio; words in which the
number of morphemes goes beyond this thresh-
old were placed in the test set. We note that it
is important (same as for adversarial splits) that
the number of words in the resulting training and
test sets follows a similar ratio to that for random
splits. This is in order to ensure that the size of
the training (or the test) set would not be a fac-
tor for potential differences in model performance
derived from different data split methods. In most
of the data sets across settings, however, such a
threshold for the average number of morphemes
per word does not seem to exist. The exceptions
are Wixarika, where the maximum number of
data sets that are splittable this way is 35, for
data sets containing 1,000 words sampled without
replacement; and certain cases in Finnish, where
the maximum number of data sets suitable for
heuristic splitting is 11, when the data sets have
500 words sampled with replacement.
For adversarial splitting, we split each data set
(for five times, as well) via maximization of the
Wasserstein distance between the training and the
test sets (Søgaard et al., 2021). We then calculated
the word overlap ratio between the test sets of
the adversarial splits and those from the random
splits. Across most of the experimental settings,
the average word overlap ratios center around or
are lower than 50%. This suggests that the training
and test sets of adversarial splits are reasonably
different from those derived after random splits;
in other words, these data sets could be split
adversarially.
Although it is not the focus in this study to com-
pare different ways of splitting data, we carried out
adversarial splits for the data of Persian and Rus-
sian, two of the languages with the largest range
of data set sizes and new test set sizes. The models
applied were second-order and fourth-order CRFs
because of their overall good performance (Sec-
tion 7). For each experimental setting, the results
are still highly variable. Compared to splitting data
randomly, the average metric scores of adversar-
ial splits are lower (e.g., the mean difference of
the average F1 scores between the two data split
methods is 17.85 across settings for the data of
Zulu, and 6.75 for the data of Russian), and the
score ranges as well as standard deviations are
higher. That being said, the results from these
two data split methods are, arguably, not directly
comparable, since with adversarial splits the test
sets are constructed to be as different or distant as
possible from the training sets.
When applying the trained models to the same
new test sets sampled from Section 6.2, the ob-
servations are similar to the descriptions above,
except the differences between the two data split
methods regarding the mean metric scores, score
ranges, and variances are much smaller. The most
pronounced average difference in F1 is approxi-
mately 7.0 when the experimental settings yield
small data set sizes. On the other hand, within each
setting, despite the model performance variability,
the average results of the different test sizes are
again comparable.
406
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
8 Discussion and Conclusion
Using morphological segmentation as the test
case, we compared three broad classes of models
with different parameterizations in crosslinguistic
low-resource scenarios. Leveraging data from 11
languages across six language families, our re-
sults demonstrate that the best models and model
rankings for the first data sets do not generalize
well to other data sets of the same size, though
the numerical differences in the results of the
better-performing model alternatives are small. In
particular, within the same experimental setting,
there are noticeable discrepancies in model predic-
tions for the first data set compared to the averages
across all data sets; and the performance of each
model alternative presents different yet signifi-
cant score ranges and variances. When examining
trained models on new test sets, considerable vari-
ability exists in the results. The patterns described
above speak to our concerns raised at the begin-
ning, namely, that when facing a limited amount
of data, model evaluation gathered from the first
data set—or essentially any one data set—could
fail to hold in light of new or unseen data.
To remedy the observed inconsistencies in
model performance, we propose that future work
should consider utilizing random sampling from
initial data for more realistic estimates. We draw
support for this proposal from two patterns in
our study. First, in each experimental setting, the
difference between the average F1 score of the
first data set and that across all 50 data sets is
in general smaller than the range of the scores.
Second, the average metric scores across unseen
test sets of varying sizes in every setting are com-
parable to each other; this holds for models trained
from random splits and those trained using adver-
sarial splits.
Therefore, depending on the initial amount of
data, it is important to construct data sets and new
test sets of different sizes, then evaluate mod-
els accordingly. Even if the size of the initial
data set is small, as is the case with most en-
dangered languages, it is worthwhile to sample
with replacement in order to better understand
model performance. Based on these observations,
it would be interesting to analyze the potential
factors that lead to the different degrees of gen-
eralization, which could in turn provide guidance
on what should be included or sampled in the
training sets in the first place. Thorough compar-
isons of different models should be encouraged
even as the amount of data grows, particularly
if the experimental settings are still situated
within low-resources scenarios for the task at
hand (Hedderich et al., 2021).
Lastly, while we did not perform adversarial
splits for all experimental settings here, for the
cases that we have investigated, model perfor-
mance from adversarial splits also yields high
variability both across data sets of the same size
and in new test sets. Though our criteria for
heuristic splits were not applicable with the data
explored, we would like to point out that for future
endeavors, it would be necessary to at least check
the applicability of different data splits, as we did
in Section 7.4. As long as the data set possesses the
properties that allow it to be divided by different
split methods, it is worthwhile to further explore
the influence of these splits, coupled with ran-
dom sampling.
Acknowledgments
We are grateful to the reviewers and the action
editor for their insightful feedback. This material
is based upon work supported by the National
Science Foundation under Grant # 2127309 to the
Computing Research Association for the CIFel-
lows Project, and Grant # 1761562. Any opinions,
findings, and conclusions or recommendations ex-
pressed in this material are those of the author(s)
and do not necessarily reflect the views of the
National Science Foundation nor the Computing
Research Association.
References
M. Ablimit, G. Neubig, M. Mimura, S. Mori,
T. Kawahara, and A. Hamdulla. 2010. Uyghur
morpheme-based language models and ASR.
In ICASSP, pages 581–584. https://doi
.org/10.1109/ICOSP.2010.5656065
Mohamed Afify, Ruhi Sarikaya, Hong-Kwang
Jeff Kuo, Laurent Besacier, and Yuqing Gao.
2006. On the use of morphological analy-
sis for dialectal Arabic speech recognition. In
Ninth International Conference on Spoken Lan-
guage Processing. https://doi.org/10
.21437/Interspeech.2006-87
407
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ekin Akyurek and Jacob Andreas. 2021. Lexicon
Learning for Few Shot Sequence Modeling.
In Proceedings of
the 59th Annual Meet-
ing of
the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 4934–4946,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.acl-long.382
Ebrahim Ansari, Zdenˇek ˇZabokrtsk´y, Hamid
Haghdoost, and Mahshid Nikravesh. 2019a.
Persian Morphologically Segmented Lexicon
0.5. LINDAT/CLARIAH-CZ digital library at
the Institute of Formal and Applied Linguistics
( ´UFAL), Faculty of Mathematics and Physics,
Charles University.
using
rich
In Proceedings of
Ebrahim Ansari, Zdenˇek ˇZabokrtsk´y, Mohammad
Mahmoudi, Hamid Haghdoost, and Jon´aˇs
2019b. Supervised morphological
Vidra.
segmentation
lexi-
the International
con.
Conference on Recent Advances in Natu-
ral Language Processing (RANLP 2019),
pages 52–61, Varna, Bulgaria. INCOMA Ltd.
https://doi.org/10.26615/978-954
-452-056-4 007
annotated
Martin Arjovsky, Soumith Chintala, and L´eon
Bottou. 2017. Wasserstein generative adver-
sarial networks. In International Conference on
Machine Learning, pages 214–223. PMLR.
R. Harald Baayen, Richard Piepenbrock, and Leon
Gulikers. 1996. The CELEX lexical database
(cd-rom).
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
In
jointly learning to align and translate.
3rd International Conference on Learning
Representations, ICLR 2015.
Mark C. Baker. 1997. Complex predicates and
agreement in polysynthetic languages. Complex
predicates, pages 247–288.
Emily M. Bender and Batya Friedman. 2018.
Data statements for natural
language pro-
cessing: Toward mitigating system bias and
enabling better science. Transactions of the
Association for Computational Linguistics,
6:587–604. https://doi.org/10.1162
/tacl_a_00041
408
Taylor Berg-Kirkpatrick, David Burkett, and
Dan Klein. 2012. An empirical investigation
of statistical significance in NLP. In Pro-
ceedings of
the 2012 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
Learning, pages 995–1005, Jeju Island, Korea.
Association for Computational Linguistics.
Terra Blevins and Luke Zettlemoyer. 2019.
Better character language modeling through
morphology. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 1606–1613, Florence, Italy.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P19
-1156
Dallas Card,
Peter Henderson, Urvashi
Khandelwal, Robin Jia, Kyle Mahowald,
and Dan Jurafsky. 2020. With little power
comes great responsibility. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 9263–9274, Online. Association for
Computational Linguistics. https://doi
. o r g / 1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p
- m a i n . 7 4 5
Kyunghyun Cho, Bart van Merri¨enboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN
encoder–decoder for statistical machine trans-
lation. In Proceedings of the 2014 Conference
on Empirical Methods
in Natural Lan-
guage Processing (EMNLP), pages 1724–1734,
Doha, Qatar. Association for Computational
Linguistics.
Ann Clifton and Anoop Sarkar. 2011. Com-
bining morpheme-based machine translation
with post-processing morpheme prediction. In
Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics:
Human Language Technologies, pages 32–42,
Portland, Oregon, USA. Association for Com-
putational Linguistics.
Michael Collins. 2002. Discriminative training
methods for Hidden Markov Models: Theory
and experiments with perceptron algorithms. In
Proceedings of the 2002 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP 2002), pages 1–8. Association for
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics. https://doi
.org/10.3115/1118693.1118694
Ryan Cotterell, Arun Kumar, and Hinrich Sch¨utze.
2016a. Morphological segmentation inside-out.
In Proceedings of
the 2016 Conference on
in Natural Language
Empirical Methods
Processing, pages 2325–2330, Austin, Texas.
Association for Computational Linguistics.
h t t p s : / / d o i . o r g / 1 0 . 1 8 6 5 3 / v 1
/ D 1 6 - 1 2 5 6
In Proceedings of
Ryan Cotterell, Thomas M¨uller, Alexander Fraser,
and Hinrich Sch¨utze. 2015. Labeled mor-
phological segmentation with semi-Markov
the Nineteenth
models.
Conference on Computational Natural Lan-
guage Learning, pages 164–174, Beijing,
China. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/K15-1017
Ryan Cotterell, Tim Vieira, and Hinrich Sch¨utze.
2016b. A joint model of orthography and
morphological segmentation. In Proceedings
of the 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages
664–669, San Diego, California.
Association for Computational Linguistics.
h t t p s : / / d o i . o r g / 1 0 . 1 8 6 5 3 / v 1
/ N 1 6 - 1 0 8 0
discovery
of morphemes.
Mathias Creutz and Krista Lagus. 2002. Un-
In
supervised
Proceedings of
the ACL-02 Workshop on
Morphological and Phonological Learning,
pages 21–30. Association for Computational
Linguistics. https://doi.org/10.3115
/1118647.1118650
Nguyen Viet Cuong, Nan Ye, Wee Sun Lee, and
Hai Leong Chieu. 2014. Conditional random
field with high-order dependencies for sequence
labeling and segmentation. Journal of Machine
Learning Research, 15(1):981–1009.
Yolanda Lastra de Su´arez. 1980. N´ahuatl de
Acaxochitl´an (Hidalgo), El Colegio de M´exico.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume
409
1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
recurrent neural networks.
Prajit Dhar and Arianna Bisazza. 2021. Un-
derstanding cross-lingual syntactic transfer in
multilingual
In
Proceedings of the 23rd Nordic Conference
on Computational Linguistics (NoDaLiDa),
Iceland (Online).
pages 74–85, Reykjavik,
Link¨oping University Electronic Press, Sweden.
Jesse Dodge, Suchin Gururangan, Dallas Card,
Roy Schwartz, and Noah A. Smith. 2019.
Show your work: Improved reporting of ex-
perimental results. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2185–2194,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1224
Rotem Dror, Gili Baumer, Segev Shlomov, and
Roi Reichart. 2018. The hitchhiker’s guide
to testing statistical significance in natural
language processing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages
1383–1392, Melbourne, Australia.
Association for Computational Linguistics.
h t t p s : / / d o i . o r g / 1 0 . 1 8 6 5 3 / v 1
/ P 1 8 - 1 1 2 8
Jeffrey L. Elman. 1990. Finding structure in time.
Cognitive Science, 14(2):179–211. https://
doi.org/10.1207/s15516709cog1402 1
Elif Eyig¨oz, Daniel Gildea, and Kemal Oflazer.
2013. Simultaneous word-morpheme align-
ment for statistical machine translation. In
Proceedings of the 2013 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 32–40, Atlanta, Georgia.
Association for Computational Linguistics.
Ray A. Freeze. 1989. Mayo de Los Capomos,
Sinaloa. El Colegio de M´exico.
Richard Futrell, Ethan Wilcox, Takashi Morita,
Peng Qian, Miguel Ballesteros, and Roger
language models as
Levy. 2019. Neural
psycholinguistic subjects: Representations of
the 2019
syntactic state. In Proceedings of
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume
1 (Long and Short Papers), pages 32–42,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1004
Andrew Garrett. 2011. An online dictionary
with texts and pedagogical tools: The Yurok
language project at Berkeley. International
Journal
24(4):405–419.
https://doi.org/10.1093/ijl/ecr018
of Lexicography,
Timnit Gebru,
Jamie Morgenstern, Briana
Vecchione, Jennifer Wortman Vaughan, Hanna
Wallach, Hal Daum´e III, and Kate Crawford.
2018. Datasheets for datasets. In Proceedings of
the 5th Workshop on Fairness, Accountability,
and Transparency in Machine Learning.
Paula G´omez and Paula G´omez L´opez. 1999.
Huichol de San Andr´es Cohamiata, Jalisco. El
Colegio de M´exico.
Kyle Gorman and Steven Bedrick. 2019. We need
to talk about standard splits. In Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, pages 2786–2791,
Florence,
Italy. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1267
Michael A. Hedderich, Lukas Lange, Heike Adel,
Jannik Str¨otgen, and Dietrich Klakow. 2021.
A survey on recent approaches for natural
language processing in low-resource scenar-
ios. In Proceedings of the 2021 Conference
of the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, pages 2545–2568,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.naacl-main.201
Jennifer Hu, Jon Gauthier, Peng Qian, Ethan
Wilcox, and Roger Levy. 2020. A systematic
assessment of syntactic generalization in neural
language models. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
tational Linguistics, pages 1725–1744, Online.
Association for Computational Linguistics.
Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
systems. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 2021–2031, Copenhagen,
Denmark. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/D17-1215
Katharina Kann, Jesus Manuel Mager Hois, Ivan
Vladimir Meza-Ruiz, and Hinrich Sch¨utze.
2018. Fortification of neural morphologi-
cal segmentation models for polysynthetic
minimal-resource languages. In Proceedings of
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 47–57, New
Orleans, Louisiana. Association for Computa-
tional Linguistics.
Guillaume Klein, Yoon Kim, Yuntian Deng,
Jean Senellart, and Alexander Rush. 2017.
OpenNMT: Open-source toolkit
for neural
machine translation. In Proceedings of ACL
2017, System Demonstrations, pages 67–72,
Vancouver, Canada. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P17-4012
Oskar Kohonen, Sami Virpioja, and Krista Lagus.
2010. Semi-supervised learning of concatena-
tive morphology. In Proceedings of the 11th
Meeting of the ACL Special Interest Group
on Computational Morphology and Phonol-
ogy, pages 78–86. https://doi.org/10
.18653/v1/P17-4012
Dan Kondratyuk and Milan Straka. 2019. 75
languages, 1 model: Parsing universal depen-
dencies universally. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2779–2795,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1279
Mikko Kurimo, Sami Virpioja, Ville Turunen,
and Krista Lagus. 2010. Morpho Challenge
2005-2010: Evaluations and results. In Pro-
ceedings of the 11th Meeting of the ACL Special
Interest Group on Computational Morphology
and Phonology, pages 87–95, Uppsala, Sweden.
Association for Computational Linguistics.
John D. Lafferty, Andrew McCallum, and
Fernando C. N. Pereira. 2001. Conditional ran-
dom fields: Probabilistic models for segmenting
410
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
and labeling sequence data. In Proceedings
of the Eighteenth International Conference on
Machine Learning, pages 282–289.
Brenden M. Lake and Marco Baroni. 2018.
Generalization without systematicity: On the
compositional skills of sequence-to-sequence
the
recurrent networks.
35th International Conference on Machine
ICML 2018, Stockholmsm¨assan,
Learning,
Stockholm, Sweden, July 10-15, 2018, vol-
ume 80 of Proceedings of Machine Learning
Research, pages 2879–2888. PMLR.
In Proceedings of
Vladimir I. Levenshtein. 1966. Binary codes capa-
ble of correcting deletions, insertions, and rever-
sals. Soviet Physics Doklady, 10(8):707–710.
Tal Linzen. 2020. How can we accelerate
progress towards human-like linguistic gener-
alization? In Proceedings of the 58th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 5210–5217, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.465
Tal Linzen and Marco Baroni. 2021. Syntactic
structure from deep learning. Annual Review
of Linguistics, 7(1):195–212. https://doi
.org/10.1146/annurev-linguistics
-032020-051035
Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. Trans-
actions of the Association for Computational
Linguistics, 4:521–535. https://doi.org
/10.1162/tacl_a_00115
Dong C. Liu and Jorge Nocedal. 1989. On
the limited memory BFGS method for large
scale optimization. Mathematical Program-
ming, 45(1):503–528. https://doi.org
/10.1007/BF01589116
Zoey Liu, Robert
Jimerson,
and Emily
Prud’hommeaux. 2021. Morphological seg-
mentation for Seneca. In Proceedings of the
First Workshop on Natural Language Process-
ing for Indigenous Languages of the Americas,
pages 90–101, Online. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.americasnlp-1.10
Mitchell Marcus, Beatrice Santorini, and Mary
Ann Marcinkiewicz. 1993. Building a large an-
notated corpus of English: The Penn Treebank.
411
Computational Linguistics,
19(2):313–330.
https://doi.org/10.21236/ADA273556
Rebecca Marvin and Tal Linzen. 2018. Targeted
language models.
syntactic evaluation of
the 2018 Conference on
In Proceedings of
Empirical Methods in Natural Language Pro-
cessing, pages 1192–1202, Brussels, Belgium.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D18-1151
R. Thomas McCoy, Robert Frank, and Linzen Tal.
2018. Revisiting the poverty of the stimulus: Hi-
erarchical generalization without a hierarchical
bias in recurrent neural networks. In Proceed-
ings of
the
Cognitive Science Society, pages 2093–2098.
Madison, WI.
the 40th Annual Conference of
R. Thomas McCoy, Junghyun Min, and Tal
Linzen. 2020. BERTs of a feather do not
generalize together: Large variability in gen-
eralization across models with similar test
the
set performance.
Third BlackboxNLP Workshop on Analyzing
and Interpreting Neural Networks for NLP,
pages 217–227, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.blackboxnlp-1.21
In Proceedings of
Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019. Right for the wrong reasons: Diagnos-
ing syntactic heuristics in natural
language
inference. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 3428–3448, Florence, Italy.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P19-1334
Barbra A. Meek. 2012. We Are Our Language:
An Ethnography of Language Revitalization in
a Northern Athabaskan Community. University
of Arizona Press.
Aaron Mueller, Garrett Nicolai, Panayiota
and Tal
Petrou-Zeniou, Natalia Talmina,
Linzen. 2020. Cross-linguistic syntactic eval-
uation of word prediction models. In Proceed-
ings of
the
Association for Computational Linguistics,
pages 5523–5539, Online. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.490
the 58th Annual Meeting of
Thomas Mueller, Helmut Schmid, and Hinrich
Sch¨utze. 2013. Efficient higher-order CRFs for
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
morphological tagging. In Proceedings of the
2013 Conference on Empirical Methods in Nat-
ural Language Processing, pages 322–332,
Seattle, Washington, USA. Association for
Computational Linguistics.
Naoaki Okazaki. 2007. CRFsuite: A fast
im-
random Fields
plementation of conditional
(CRFs).
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
the 2016
sion of
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392,
Austin, Texas. Association for Computational
Linguistics.
In Proceedings of
text.
In Proceedings of
Alan Ramponi and Barbara Plank. 2020. Neural
unsupervised domain adaptation in NLP—A
the 28th Inter-
survey.
national Conference on Computational Lin-
guistics, pages 6838–6855, Barcelona, Spain
(Online). International Committee on Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.coling-main.603
Nils Reimers and Iryna Gurevych. 2017. Re-
porting score distributions makes a difference:
performance study of LSTM-networks for Se-
quence Tagging. In Proceedings of the 2017
Conference on Empirical Methods in Nat-
ural Language Processing, pages 338–348,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1035
Jorma Rissanen. 1998. Stochastic Complexity in
Statistical Inquiry, volume 15. World scientific.
Teemu Ruokolainen, Oskar Kohonen, Sami
Virpioja, and Mikko Kurimo. 2013. Supervised
morphological segmentation in a low-resource
learning setting using conditional
random
the Seventeenth
In Proceedings of
fields.
Conference on Computational Natural Lan-
guage Learning, pages 29–37, Sofia, Bulgaria.
Association for Computational Linguistics.
Wolfgang Seeker and ¨Ozlem C¸ etino˘glu. 2015.
A graph-based lattice dependency parser for
joint morphological segmentation and syntac-
tic analysis. Transactions of the Association
for Computational Linguistics, 3:359–373.
https://doi.org/10.1162/tacl a 00144
412
Anders Søgaard. 2013. Estimating effect size
across datasets. In Proceedings of the 2013 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 607–611.
Association for Computational Linguistics,
Atlanta, Georgia.
Anders Søgaard, Sebastian Ebert,
Jasmijn
Bastings, and Katja Filippova. 2021. We need
In Proceed-
to talk about
random splits.
ings of the 16th Conference of the European
Chapter of the Association for Computational
Linguistics: Main Volume, pages 1823–1832,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.eacl-main.156
Alexey Sorokin and Anastasia Kravtsova. 2018.
Deep convolutional networks
super-
vised morpheme segmentation of Russian
language. In Conference on Artificial Intel-
ligence and Natural Language, pages 3–10.
https://doi.org/10.1007
Springer.
/978-3-030-01204-5_1
for
Justin Spence. 2013. Language Change, Contact,
and Koineization in Pacific Coast Athabaskan.
Ph.D. thesis, University of California, Berkeley.
Sebastian Spiegler, Andrew van der Spuy, and
Peter A. Flach. 2010. Ukwabelana - An
open-source morphological Zulu corpus. In
Proceedings of the 23rd International Con-
ference on Computational Linguistics (Coling
2010), pages 1020–1028, Beijing, China.
Coling 2010 Organizing Committee.
Piotr Szyma´nski and Kyle Gorman. 2020. Is
the best better? Bayesian statistical model
comparison for natural
language process-
ing. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 2203–2212, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.172
Martha Tachbelie, Solomon Teferra Abate,
and Laurent Besacier. 2014. Using differ-
ent acoustic,
lexical and language model-
ing units for ASR of an under-resourced
language - Amharic. Speech Communica-
tion, 56. https://doi.org/10.1016/j
.specom.2013.01.008
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Tristan Thrush, Ethan Wilcox,
and Roger
Levy. 2020. Investigating novel verb Learn-
ing in BERT: Selectional preference classes
and alternation-based syntactic generalization.
In Proceedings of
the Third BlackboxNLP
Workshop on Analyzing and Interpreting
Neural Networks for NLP, pages 265–275,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.blackboxnlp-1.25
Antal van den Bosch and Walter Daelemans.
1999. Memory-based morphological analysis.
the 37th Annual Meet-
In Proceedings of
ing of
the Association for Computational
Linguistics, pages 285–292, College Park,
Maryland, USA. Association for Computational
Linguistics. https://doi.org/10.3115
/1034678.1034726
Tim Vieira, Ryan Cotterell, and Jason Eisner.
2016. Speed-accuracy tradeoffs in tagging with
variable-order CRFs and structured sparsity.
the 2016 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing, pages 1973–1978, Austin, Texas.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D16-1206
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman.
2018. GLUE: A multi-task benchmark and
language un-
analysis platform for natural
the 2018
In Proceedings of
derstanding.
EMNLP Workshop BlackboxNLP: Analyz-
ing and Interpreting Neural Networks for
NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/W18-5446
Yiwen Wang, Jennifer Hu, Roger Levy, and
Peng Qian. 2021. Controlled evaluation of
grammatical knowledge in Mandarin Chinese
language models. In Proceedings of the 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 5604–5620,
Online and Punta Cana, Dominican Repub-
lic. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.emnlp-main.454
Noah Weber, Leena Shekhar, and Niranjan
Balasubramanian. 2018. The fine line be-
tween linguistic generalization and failure
in seq2seq-attention models. In Proceedings
of
the Workshop on Generalization in the
Age of Deep Learning, pages 24–27, New
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/W18-1004
the 2019 Conference of
Ethan Wilcox, Peng Qian, Richard Futrell,
Miguel Ballesteros, and Roger Levy. 2019.
Structural supervision improves learning of
non-local grammatical dependencies. In Pro-
ceedings of
the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and
Short Papers), pages 3302–3312, Minneapo-
lis, Minnesota. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/N19-1334
Ethan Wilcox, Peng Qian, Richard Futrell,
Ryosuke Kohita, Roger Levy, and Miguel
Ballesteros. 2020. Structural supervision im-
proves few-shot learning and syntactic gen-
language models.
eralization in neural
In
Proceedings of
the 2020 Conference on
in Natural Language
Empirical Methods
Processing (EMNLP), pages 4640–4652, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.375
Beilei Xiang, Changbing Yang, Yu Li, Alex
Warstadt, and Katharina Kann. 2021. CLiMP:
A benchmark for Chinese language model
evaluation. In Proceedings of the 16th Confer-
ence of
the As-
the European Chapter of
sociation
for Computational Linguistics:
Main Volume, pages 2784–2790, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021.eacl
-main.242
Matthew D. Zeiler. 2012. Adadelta: An adap-
tive learning rate method. arXiv preprint
arXiv:1212.5701v1.
Xiang Zhou, Yixin Nie, Hao Tan, and Mohit
Bansal. 2020. The curse of performance
instability in analysis datasets: Consequences,
In Proceedings
and suggestions.
source,
on Empirical
2020 Conference
the
of
Methods
in Natural Language Process-
ing (EMNLP), pages 8215–8228, Online.
for Computational Linguis-
Association
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.659
413
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
6
7
2
0
0
6
9
7
9
/
/
t
l
a
c
_
a
_
0
0
4
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3