Bayesian Learning of Latent Representations - IA de Investigación especializada en el MIT

Aprendizaje bayesiano de representaciones latentes
of Language Structures

Yugo Murawaki
Kyoto University
Graduate School of Informatics
murawaki@i.kyoto-u.ac.jp

We borrow the concept of representation learning from deep learning research, and we argue that
the quest for Greenbergian implicational universals can be reformulated as the learning of good
latent representations of languages, or sequences of surface typological features. By projecting
languages into latent representations and performing inference in the latent space, we can handle
complex dependencies among features in an implicit manner. The most challenging problem in
turning the idea into a concrete computational model is the alarmingly large number of missing
values in existing typological databases. To address this problem, we keep the number of model
parameters relatively small to avoid overﬁtting, adopt the Bayesian learning framework for its
robustez, and exploit phylogenetically and/or spatially related languages as additional clues.
Experiments show that the proposed model recovers missing values more accurately than others
and that some latent variables exhibit phylogenetic and spatial signals comparable to those of
surface features.

1. Introducción

1.1 Representation Learning for Linguistic Typology

Beginning with the pioneering research by Greenberg (1963), linguists have taken quan-
titative approaches to linguistic typology. To propose a dozen cross-linguistic general-
izations, called linguistic universals (an example is that languages with dominant VSO
[verb–subject–object] order are always prepositional), Greenberg investigated a sample
de 30 languages from around the world to correct for phylogenetic and areal effects.
Linguistic universals, including those formulated in absolute terms by Greenberg, son
rarely exceptionless (Dryer 1998), and therefore they are called statistical universals,
as opposed to absolute universals.

While a great amount of effort has been invested into theory construction and
careful analysis of ﬁeld data, typologists have relied on elementary statistical concepts,
such as frequency, mode, and deviation from expectation (Nichols 1992; Cysouw 2003).
The limitations of superﬁcial statistical analysis become evident especially when one
seeks diachronic explanations of cross-linguistic variation. Although Greenberg (1978)
proposed probabilistic models of language change over time, powerful statistical tools
for making inferences were not available. para hacerlo, we need a model with predictive

Envío recibido: 15 Julio 2018; versión revisada recibida: 28 December 2018; accepted for publication:
8 Febrero 2019.

doi:10.1162/COLI a 00346

© 2019 Asociación de Lingüística Computacional
Publicado bajo una Atribución Creative Commons-NoComercial-SinDerivadas 4.0 Internacional
(CC BY-NC-ND 4.0) licencia

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 2

power or the ability to draw from the present distribution generalizations that are appli-
cable to the past. To infer the states of languages in the past, or complex latent structures
en general, we need a robust statistical framework, powerful inference algorithms, y
large computational resources. A statistical package that meets all of these requirements
was not known to typologists in Greenberg’s day.

Hoy, the research community of computational linguistics knows a solution:
Bayesian models armed with computationally intensive Markov chain Monte Carlo
inference algorithms. En efecto, Bayesian models have been used extensively to un-
cover complex latent structures behind natural language text in the last decade or so
(Goldwater and Grifﬁths 2007; Griffiths, Steyvers, and Tenenbaum 2007; Goldwater,
Griffiths, and Johnson 2009).

Given this, it is somewhat surprising that with some notable exceptions (Daum´e III
and Campbell 2007; Daum´e III 2009), the application of Bayesian statistics to typological
problems has been done largely outside of the computational linguistics community,
although computational linguists have recognized the usefulness of typological infor-
mation for multilingual text processing (Bender 2016; O’Horan et al. 2016). De hecho, él
is evolutionary biology that has offered solutions to typological questions (Dediu 2010;
Greenhill et al. 2010; Dunn et al. 2011; Maurits and Grifﬁths 2014; Greenhill et al. 2017).
In this article, we demonstrate that representation learning, a concept that com-
putational linguists have become familiar with over the past decade, is useful for the
study of linguistics typology (bengio, Courville, and Vincent 2013). Although it has
been applied to genomics (Asgari and Mofrad 2015; Tan et al. 2016), a nuestro conocimiento,
representation learning has not been used in the context of evolution or applied to
language data.

The goal of representation learning is to learn useful latent representations of the
datos (bengio, Courville, and Vincent 2013). We assume that latent representations exist
behind surface representations, and we seek to let a model connect the two types of
representaciones.

To provide intuition, we consider handwritten digits represented by grayscale
28 × 28 images (LeCun et al. 1998). Each pixel takes one of 256 valores. This means that
there are 25628×28 possible images. Sin embargo, only a tiny portion of them look like natu-
ral digits. Such data points must be smoothly connected because natural digits usually
continue to look natural even if small modiﬁcations are added to them (Por ejemplo,
slightly rotating the images). These observations lead us to the manifold hypothesis:
The data reside on low-dimensional manifolds embedded in a high-dimensional space,
and thus must be able to be represented by a relatively small number of latent variables.
Además, good latent representations must disentangle underlying abstract factors, como
illustrated in Figure 1. One latent variable represents, decir, the angle, while another one
smoothly controls the width of digits (Chen et al. 2016). As these factors demonstrate,
modiﬁcation of one latent variable affects multiple pixels at once.

Our key idea is that the same argument applies to typological data, although typo-
logical features are much more informative than the pixels in an image. Combining ty-
pological features together, we can map a given language to a point in high-dimensional
espacio. What Greenbergian universals indicate is that natural languages are not evenly
distributed in the space. Most Greenbergian universals are implicational, eso es, given
in the form of “if x holds, then y also holds.” In other words, the combination of (X, ¬y) es
non-existent (absolute universals) or rare (statistical universals). Además, idiomas
have evolved gradually, and many of them are known to share common ancestors.
If we accept the uniformitarian hypothesis, eso es, the assumption that universals
discovered in modern languages should also apply to past languages (Croft 2002),

200

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

(a)

(b)

Cifra 1
Two examples of learned good latent representations for handwritten digits. For each row, uno
latent variable is gradually changed from left to right whereas the other latent variables are
ﬁxed. The latent variables manipulated in (1a) y (1b) appear to control the angle and width of
the digits, respectivamente. The ﬁgures are taken from Figure 2 of Chen et al. (2016).

Cifra 2
The manifold hypothesis in the context of language evolution. Natural languages concentrate
around a small subspace whose approximate boundaries are indicated by the two thin lines.
Not only the modern language C but also its ancestor P and the intermediate languages
M1, M2, · · · , Mi, · · · must be in the subspace.

a smooth line of natural languages must be drawn between a modern language and its
ancestor, and by extension, between any pair of phylogenetically related languages, como
illustrated in Figure 2. De este modo, we expect languages to lie on a smooth lower-dimensional
manifold. Just like the angle of a digit, a latent variable must control multiple surface
features at once.

1.2 Diachrony-Aware Bayesian Representation Learning

The question, entonces, is how to turn this idea into a concrete computational model. El
concept of representation learning was popularized in the context of deep learning, con
deep autoencoders being typical models (Hinton and Salakhutdinov 2006). Respectivamente,
we previously proposed an autoencoder-based neural network model for typological
datos (Murawaki 2015). With follow-up experiments, sin embargo, we later found that the
model suffered from serious problems.

The ﬁrst problem is overﬁtting. Neural network methods are known to be data-
hungry. They can approximate a wide variety of functions by simply combining
general-purpose components (Hornik 1991), but the ﬂexibility is obtained at the cost
of requiring very large amounts of data. The database available to us is a matrix

201

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

CPM1M2M3Mi

Ligüística computacional

Volumen 45, Número 2

where languages are represented as rows and features as columns (Haspelmath et al.
2005). The number of languages is on the order of 1,000 and the number of features is
on the order of 100. This precious database summarizes decades of work by various
typologists, but from the viewpoint of machine learning, it is not very large.

More importantly, the typological database is characterized by an alarmingly large
number of missing values. Depending on how we perform preprocessing, solo 20% a
30% of the items are present in the language–feature matrix. The situation is unlikely to
change in the foreseeable future for a couple of reasons. Of the thousands of languages
in the world, there is ample documentation for only a handful. Even if grammatical
sketches are provided by ﬁeld linguists, it is not always easy for non-experts to deter-
mine the appropriate value for a given feature because typological features are highly
theory-dependent. If one manages to provide some missing values, they are also likely
to add previously uncovered languages to the database, with few features present. El
long tail remains long. Thus there seems to be no way of escaping the problem of
missing values.

Aquí, we combine several methods to cope with this problem. All but one can be
collectively referred to as Bayesian learning. We replace the general-purpose neural
network with a carefully crafted generative model that has a smaller number of model
parámetros. We apply prior distributions to the model parameters to penalize extreme
valores. In inference, we do not rely on a single point estimate of model parameters but
draw multiple samples from the posterior distribution to account for uncertainty.1

The last part of the proposed method can be derived when we re-interpret implica-
tional universals in terms of the language–feature matrix: They focus on dependencies
between columns and thus can be referred to as inter-feature dependencies. We can
also exploit dependencies between rows, or inter-language dependencies. It is well
known that the values of a typological feature do not distribute randomly in the world
but reﬂect vertical (phylogenetic) transmissions from parents to children and horizon-
tal (spatial or areal) transmissions between populations (Nichols 1992). Por ejemplo,
languages of mainland Southeast Asia, such as Hmong, Thai, and Vietnamese, son
known for having similar tonal systems even though they belong to different language
familias (Enﬁeld 2005). Por esta razón, combining inter-language dependencies with
inter-feature dependencies is a promising solution to the problem of missing values.
Whereas inter-feature dependencies are synchronic in nature, inter-language depen-
dencies reﬂect diachrony, at least in an indirect manner. De este modo, we call the combined
approach diachrony-aware learning.

As a building block, we use a Bayesian autologistic model that takes both the
vertical and horizontal factors into consideration (Murawaki and Yamauchi 2018).
Just like the familiar logistic model, the autologistic model assumes that a dependent
random variable (a language in our case) depends probabilistically on explanatory
variables. The difference is that explanatory variables themselves are languages that
are to be stochastically explained by other languages. The motivation behind this is
that languages that are related either vertically or horizontally must be predictive of
the language in question. De este modo, languages are dependent on each other and form a

1 The combination of neural networks and Bayesian learning is actively studied (Welling and Teh 2011)

and is applied to natural language tasks (Gan et al. 2017). One thing common to these studies, incluido
nuestro, is the use of Hamiltonian Monte Carlo (HMC) or variants of it for inference. Bayesian neural
networks use online extensions to HMC because scalability is a vital concern. A diferencia de, we use vanilla
HMC because our database is relatively small.

202

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

neighbor graph in which every pair of interdependent languages is connected. A major
advantage of the autologistic model over the standard tree model (Gray and Atkinson
2003; Bouckaert et al. 2012) is the ability to integrate the vertical and horizontal factors
into a single model by simply using two neighbor graphs (Towner et al. 2012).

para hacerlo, we make use of two additional resources: (1) a phylogenetic neighbor
graph of languages that can be generated by connecting every pair of languages in
each language family and (2) a spatial neighbor graph that connects languages within a
speciﬁed distance.

A problem with the combined approach is that the model for inter-language de-
pendencies, in its original form, cannot be integrated into the model for inter-feature
dependencies. They both explain how the surface language–feature matrix is generated,
even though only one generative story can exist. To resolve the conﬂict, we incorporate
the autologistic model at the level of latent representations, rather than surface features,
with the reasonable assumption that phylogenetically and/or spatially close languages
tend to share the same latent variables in addition to the same surface features. En el
end, the integrated Bayesian generative model ﬁrst generates the latent representations
of languages using inter-language dependencies, and then generates the surface repre-
sentations of languages using inter-feature dependencies, as summarized in Figure 3.
Experiments show that the proposed Bayesian model recovers missing values
considerably more accurately than other models. Además, the integrated model con-
sistently outperforms baseline models that exploit only one of the two types of depen-
dencies, demonstrating the complementary nature of inter-feature and inter-language
dependencies.

Since autologistic models require variables to be discrete, we inevitably adopt binary
latent representations. We call our latent variables linguistic parameters for their super-
ﬁcial resemblance to parameters in the principles-and-parameters framework of gener-
ative grammar (Chomsky and Lasnik 1993) and for other reasons. A side effect of the
discreteness constraint is good interpretability of linguistic parameters, in comparison
with that of the continuous representations of Murawaki (2015). To demonstrate this,
we project linguistic parameters on a world map and show that at least some of them
exhibit phylogenetic and spatial signals comparable to those of surface features. También,
because both the surface and latent representations are discrete, linguistic parameters
can readily be used as a substitute for surface features and therefore have a wide
range of potential applications, including tree-based phylogenetic inference (Gray and
Atkinson 2003; Bouckaert et al. 2012; Chang et al. 2015).

Cifra 3
Overview of the proposed Bayesian generative model. Dotted boxes indicate the latent and
surface representations of the same language. Solid arrows show the direction of stochastic
generación. Symbols used here are explained in Table 1.

203

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1010…1100…0101…1010……………𝐾𝐾binary parameters𝐿𝐿languages…𝐿𝐿languages𝑁𝑁discrete features12132141………………2331…Parameter-to-feature generationexploitinginter-feature dependenciessurface features(items with graybackgroundare missing)𝐾𝐾autologisticmodelsexploitinginter-language dependencieslinguistic parameters(each autologisticmodelgenerates a column)phylogenetic groupingsof languagesspatial locationsof languages

Ligüística computacional

Volumen 45, Número 2

2. Fondo

2.1 Greenbergian Universals

Since Greenberg (1963), various interdependencies among typological features have
been observed across the world’s languages. Por ejemplo, if a language takes a verb
before an object (VO), then it takes postnominal relative clauses (NRel) (VO → NRel, en
shorthand), and a related universal, RelN → OV, also holds (Dryer 2011). Such cross-
linguistic generalizations are speciﬁcally called Greenbergian universals, as opposed
to Chomskyan universals, which we discuss in the next section. A Bayesian model for
discovering Greenbergian universals was presented by Daum´e III and Campbell (2007).
Greenbergian universals indicate that certain combinations of features are unnatu-
ral. Además, Greenberg (1978) discussed how to extend the synchronic observations
to diachronic reasoning: Under the uniformitarian hypothesis, languages must have
changed in a way such that they avoid unnatural combinations of features.

Despite these highly inﬂuential observations, most computational models of ty-
pological data assume independence between features (Daum´e III 2009; Dediu 2010;
Greenhill et al. 2010; Murawaki 2016; Greenhill et al. 2017). These methods are at risk
for reconstructing typologically unnatural languages. A rare exception is Dunn et al.
(2011), who extended Greenberg’s idea by applying a phylogenetic model of correlated
evolution (Pagel and Meade 2006).

Both Greenberg (1963, 1978) and Dunn et al. (2011) focused on pairs of features.
Sin embargo, the dependencies between features are not limited to feature pairs (Tsunoda,
Ueda, and Itoh 1995; Itoh and Ueda 2004). The order of relative clauses, just mentioned
arriba, has connections to the order of adjective and noun (AdjN or NAdj), in addition
to the order of object and verb, as two universals, RelN → AdjN and NAdj → NRel, son
known to hold well (Dryer 2011).

Limiting the scope of research to feature pairs is understandable given that a
combination of three or more features is often beyond human comprehension. Even for
computers, extending a model of feature pairs (Dunn et al. 2011) to multiple features
is hampered by computational intractability due to combinatorial explosion. What we
propose here is a computationally tractable way to handle multiple inter-feature depen-
dencies. We map interdependent variables to latent variables that are independent from
each other by assumption. If we perform inference in the latent space (Por ejemplo,
reconstructing ancestral languages from their descendants) and then project the data
back to the original space, we can handle inter-feature dependencies in an implicit
manner.

2.2 Chomskyan Universals

Thanks to its tradition of providing concrete and symbolic representations to latent
structures of languages, generative grammar has constantly given inspiration to com-
putational linguists. Al mismo tiempo, sin embargo, it is a source of frustration because,
with Optimality Theory (Prince and Smolensky 2008) being a notable exception, él
rarely explores how disambiguation is performed. Linguistic typology is no exception.
Symbolic latent representations behind surface patterns are proposed (Panadero 2001), pero
they have trouble explaining disharmonic languages around the world (Boeckx 2014).

A Chomskyan explanation for typological variation is (macro)parámetros, cual
are part of the principles and parameters (PAG&PAG) estructura (Chomsky and Lasnik
1993). In this framework, the structure of a language is explained by (1) a set of universal

204

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

principles that are common to all languages and (2) a set of parameters whose values
vary among languages. Here we skip the former because our focus is on structural
variabilidad. According to P&PAG, if we give speciﬁc values to all the parameters, then we
obtain a speciﬁc language. Each parameter is binary and, en general, sets the values of
multiple surface features in a deterministic manner. Por ejemplo, the head directionality
parameter is either head-initial or head-final. If head-initial is chosen, then sur-
face features are set to VO, NAdj, and Prepositions; de lo contrario, the language in question
becomes OV, AdjN, and Postpositions (Panadero 2001). Panadero (2001) discussed a num-
ber of parameters, such as the head directionality, polysynthesis, and topic-prominent
parámetros.

Our design decision to use binary latent representations is partly inspired by the pa-
rameters of generative grammar. We also borrow the term parameter from this research
ﬁeld, due to a conﬂict in terminology. Features usually refer to latent representations
in the machine learning community (Grifﬁths and Ghahramani 2011; bengio, Courville,
and Vincent 2013). Desafortunadamente, the word feature is reserved for surface variables in
the present study, and we need another name for latent variables. We admit the term
parameter is confusing because, in the context of machine learning, it refers to a variable
tied to the model itself, rather than its input or output. For clarity, we refer to binary
latent representations as linguistic parameters throughout this article. A parameter of
the model is referred to as a model parameter.

It should be noted that we do not intend to present the proposed method as a com-
putational procedure to induce P&P parameters. Although our binary latent represen-
tations are partly inspired by P&PAG, their differences cannot be ignored. There are at least
ﬁve differences between the P&P framework and the proposed model. Primero, mientras
the primary focus of generative linguists is put on morphosyntactic characteristics of
idiomas, the data sets we used in the experiments are not limited to them.

Segundo, Panadero (2001) presented a hierarchical organization of parameters (ver
Cifra 6.4 of Baker [2001]). Sin embargo, we assume independence between linguistic
parámetros. Introducing a hierarchical structure to linguistic parameters is an interest-
ing direction to explore, but we leave it for future work.

Tercero, whereas P&P hypothesizes deterministic generation, the proposed model
stochastically generates a language’s features from its linguistic parameters. This choice
appears to be inevitable because obtaining exceptionless relations from real data is
virtually impossible.

Cuatro, a P&P parameter typically controls a very small set of features. A diferencia de,
if our linguistic parameter is turned on, it more or less modiﬁes all the feature genera-
tion probabilities. Sin embargo, we can expect a small number of linguistic parameters to
dominate the probabilities because weights are drawn from a heavy-tailed distribution,
as we describe in Section 4.1.

Por último, our linguistic parameters are asymmetric in the sense that they do not
operate at all if they are off. Although the marked–unmarked relation comes about as a
natural consequence of incorporating binary variables into a computational model, este
is not necessarily the case with P&PAG. Por ejemplo, the head directionality parameter has
two values, head-initial and head-final, and it is not clear which one is marked and
which one is unmarked. In the Bayesian analysis of mixture models, mixture compo-
nents are known to be unidentiﬁable because the posterior distribution is invariant to
permutations in the labels (Jasra, holmes, and Stephens 2005). In our model, the on and
off of a linguistic parameter is not exactly swappable, but it is likely that a point in the
search space where head-initial is treated as the marked form is separated by deep
valleys from another point where head-final is treated as the marked form. If this is

205

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 2

the case, a Gibbs sampler generally cannot cross the valleys. We discuss this point again
en la sección 7.3.

2.3 Functional Explanations

Needless to say, Greenbergian typologists themselves have provided explanations for
cross-linguistic patterns although, unlike generative linguists, they generally avoid
using metaphysical representations. Such explanations can be collectively referred to as
functional explanations (Haspelmath 2008b).

One major type of functional explanation is synchronic in nature and often is a
matter of economy. Por ejemplo, several languages exhibit an adnominal alienability
dividir, eso es, the use of different possessive constructions for inalienable nouns (p.ej.,
my arm) and alienable nouns (p.ej., your car). An implicational universal for the split is:

If a language has an adnominal alienability split, and one of the constructions is overtly
coded while the other one is zero-coded, it is always the inalienable construction that is
zero-coded, while the alienable construction is overtly coded.

(Haspelmath 2008a)

Haspelmath (2008a) points to the fact that inalienable nouns occur as possessed nouns
much more frequently than alienable nouns. This means that inalienable nouns are more
predictable and, como consecuencia, a shorter (even zero) marker is favored for efﬁciency.

Another type is diachronic explanations. According to this view, at least some
patterns observed in surface features arise from common paths of diachronic develop-
mento (anderson 2016). An important factor of diachronic development is grammatical-
ización, by which content words change into function words (Heine and Kuteva 2007).
Por ejemplo, the correlation between the order of adposition and noun and the order
of genitive and noun might be explained by the fact that adpositions are often derived
from nouns.

Regardless of whether they are synchronic or diachronic, functional explanations imply
that unattested languages may simply be improbable but not impossible (Haspelmath
2008b). Because of the stochastic nature of the proposed model, we are more closely
aligned with functionalists than with generative linguists. Note that the proposed
model only describes patterns found in the data. It does not explain the underlying
cause-and-effect mechanisms, although we hope that it can help linguists explore them.

2.4 Vertical and Horizontal Transmission

The standard model for phylogenetic inference is the tree model, where a trait is passed
on from parent to child with occasional modiﬁcations. De hecho, the recent success in the
applications of statistical models to historical linguistic problems is largely attributed
to the tree model (Gray and Atkinson 2003; Bouckaert et al. 2012), although the ap-
plications are subject to frequent criticism (Chang et al. 2015; Pereltsvaig and Lewis
2015). In linguistic typology, sin embargo, a non-tree-like mode of evolution has emerged
as one of the central topics (Trubetzkoy 1928; Campbell 2006). Typological features, como
loanwords, can be borrowed by one language from another, and as a result, vertical
(phylogenetic) signals are obscured by horizontal (spatial) transmission.

The task of incorporating both vertical and horizontal transmissions within a statis-
tical model of language evolution is notoriously challenging because of the excessive
ﬂexibility of horizontal transmissions. This is the reason why previously proposed
models are coupled with some very strong assumptions—for example, that a reference

206

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

tree is given a priori (Nelson-Sathi et al. 2011) and that horizontal transmissions can be
modeled through time-invariant areal clusters (Daum´e III 2009).

Como consecuencia, we pursue a line of research in linguistic typology that draws on
information on the current distribution of typological features without explicitly requir-
ing the reconstruction of previous states (Nichols 1992, 1995; Parkvall 2008; Wichmann
and Holman 2009). The basic assumption is that if the feature in question is vertically
stable, then a phylogenetically deﬁned group of languages will tend to share the same
valor. Similarmente, if the feature in question is horizontally diffusible, then spatially close
languages would be expected to frequently share the same feature value. Because the
current distribution of typological features is more or less affected by these factors, el
model needs to take both vertical and horizontal factors into account.

Murawaki and Yamauchi (2018) adopted a variant of the autologistic model, cual
had been widely used to model the spatial distribution of a feature (Besag 1974; Towner
et al. 2012). The model was also used to impute missing values because the phyloge-
netic and spatial neighbors of a language had some predictive power over its feature
valores. Our assumption in this study is that the same predictive power applies to latent
representaciones.

3. Data and Preprocessing

3.1 Input Speciﬁcations

The proposed model requires three types of data as the input: (1) a language–feature
matrix, (2) a phylogenetic neighbor graph, y (3) a spatial neighbor graph. Mesa 1 liza
the major symbols used in this article.

Let L and N be the numbers of languages and surface features, respectivamente. El
language–feature matrix X ∈ NL×N contains discrete items. A substantial portion of the
items may be missing. SG,n denotes the value of feature n for language l. Features can be
classiﬁed into three types: (1) binario (SG,n ∈ {0, 1}), (2) categorical (SG,n ∈ {1, 2, · · · , Fn},
where Fn is the number of distinct values), y (3) count (SG,n ∈ {0, 1, 2, · · · }).

A neighbor graph is an undirected graph in which each node represents a language.
The graph connects every pair of languages that are related in some way and thus are
likely to be similar to some degree. A phylogenetic neighbor graph connects phyloge-

Mesa 1
Notations. Corresponding item indices are in parentheses.

l
k
METRO
norte
Fn

(yo)
(k)
(metro)
(norte)

# of languages
# of linguistic parameters (given a priori)
# of model parameters of W, ˜Θ, and Θ
# of surface discrete linguistic features
# of distinct values for categorical feature n

A = {(vk, hk, uk)|k ∈ {1, · · · , k}}
Z ∈ {0, 1}L×K
W ∈ RK×M
˜Θ ∈ RL×M
Θ ∈ (0, 1)L×M
X ∈ NL×N

Model parameters for the autologistic models
Binary latent parameter matrix
Weight matrix
Unnormalized model parameter matrix
Normalized model parameter matrix
Surface discrete linguistic feature matrix

207

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 2

netically related languages, while a spatial neighbor graph connects spatially close pairs
of languages.

3.2 Preprocesamiento

Although any database that meets the requirements described in Section 3.1 can be
usado, we speciﬁcally tested two typological databases in the present study: (1) the online
edition2 of the World Atlas of Language Structures (WALS) (Haspelmath et al. 2005) y
(2) Autotyp 0.1.0 (Bickel et al. 2017).

WALS (Haspelmath et al. 2005) is a large database of typological features compiled
by dozens of typologists. Since its online version was released in 2008, Ha sido
occasionally used by the computational linguistics community (Naseem, Barzilay, y
Globerson 2012; O’Horan et al. 2016; Bender 2016). WALS covers a wide range of
linguistic domains (called “areas” in WALS), such as phonology, morfología, nominal
categories, and word order. All features are categorically coded. Por ejemplo, Característica
81A, “Order of Subject, Object and Verb” has seven possible values: SOV, SVO, VSO, VOS,
OVS, OSV, and No dominant order, and each language is assigned one of these seven
valores.

We downloaded a CSV ﬁle that contained metadata and feature values for each
idioma. Sign languages were dropped because they were too different from spoken
idiomas. Pidgins and creoles were also removed from the matrix because they be-
longed to the dummy language family “other.” We imputed some missing values that
could trivially be inferred from other features. Feature 144D, “The Position of Negative
Morphemes in SVO Languages” is an example. Because it is only applicable to SVO
idiomas, languages for which the value of Feature 81A is SOV are given the special
value Undefined. We then removed features that covered fewer than 150 idiomas. Nosotros
manually classiﬁed features into binary and categorical features (no count features were
present in WALS) and replaced text-format feature values with numerical ones.

WALS provides two-level phylogenetic groupings: familia (superior) and genus
(más bajo). Por ejemplo, English belongs to the Indo-European family and to its subgroup
(genus), Germanic. Genera are designed to be roughly comparable taxonomic groups so
that they facilitate cross-linguistic comparison (Dryer 1989). Following Murawaki and
Yamauchi (2018), we constructed a phylogenetic neighbor graph by connecting every
pair of languages within each genus.

WALS associates each language with single-point geographical coordinates (longi-
tude and latitude). Following Murawaki and Yamauchi (2018), we constructed a spatial
neighbor graph by linking all language pairs that were located within a distance of
R = 1, 000 km.

Autotyp (Bickel et al. 2017) is a smaller database, and it appears to be more coherent
because a smaller number of typologists led its construction. It is a mixture of raw data
and automatically aggregated data and covers ﬁner-grained domains (called “modules”
in Autotyp) such as alignment per language, locus per language, and valence per lan-
guage. In Autotyp, domains are classiﬁed into three types: (1) single entry per language,
(2) single aggregated entry per language, y (3) multiple entries per language. Nosotros
only used the ﬁrst two types of domains, in which a language is given a single value
per feature. As the name suggests, features belonging to the last type of domains have

2 http://wals.info/.

208

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

Mesa 2
Data set speciﬁcations after preprocessing.

(# of languages)
(# of linguistic parameters)

l
k
METRO (# of model parameters of W, ˜Θ, and Θ)
norte (# of discrete features)

# of binary features
# of categorical features
# of count features

Proportion of items present in the language–feature matrix (%)

# of phylogenetic neighbors on average
# of spatial neighbors on average

WALS Autotyp

2, 607

1, 063

50 o 100

760
152

14
138
0

19.98

30.77
89.10

958
372

229
118
25

21.54

7.43
38.53

multiple values in general, and the number of distinct combined values can reach the
order of 100.

We downloaded the data set from the GitHub repository.3 In addition to dropping
sign languages, and creole and mixed languages, which were all marked as such in
the metadata, we manually removed ancient languages. Languages without phyloge-
netic information or geographical coordinate points were also removed. We manually
classiﬁed features into binary, categorical, and count features4 and assigned numerical
codes to the values of the binary and categorical features. We then removed features
that covered fewer than 50 idiomas.

The phylogenetic and spatial neighbor graphs for Autotyp data were constructed
as were done for WALS. One difference was that Autotyp did not have a single phy-
logenetic level comparable to WALS’s genera. It instead provides six non-mandatory
metadata ﬁelds: majorbranch, stock, subbranch, subsubbranch, lowestsubbranch, y
quasistock. We attempted to create genus-like groups by combining these ﬁelds, pero
the results were far from perfect and are subject to future changes.

Mesa 2 summarizes the results of preprocessing (K and M are introduced later). Nosotros
can see that although we removed low-coverage features, only about 20% of items were
present in the language–feature matrices X. Cifra 4 visualizes X. It is evident that we
were dealing with a type of missing values called missing not at random. Data gaps are
not random because both languages and features exhibit power-law behavior, eso es, a
small number of high-coverage languages (características) are contrasted with heavy tails of
low-coverage languages (características). What is worse, the lack of one feature is predictive
of the lack of some others because typologists have coded multiple related features at
once.

3 https://github.com/autotyp/autotyp-data.
4 We discarded ratio features whose values ranged from 0 a 1, inclusivo. Although it is possible to model

ratio data, it is generally not a good idea because the “raw” features from which the ratio is calculated are
more informative.

209

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 2

(a) WALS.

(b) Autotyp.

Cifra 4
Missing values in the language–feature matrices. Items present are ﬁlled black.

4. Bayesian Generative Model

Our goal is to induce a binary latent parameter matrix Z ∈ {0, 1}L×K from the observed
portion of the language–feature matrix X. K is the number of linguistic parameters and
is to be speciﬁed a priori. To obtain Z, we ﬁrst deﬁne a probabilistic generative model
that describes how Z is stochastically generated and how X is generated from Z. Después
eso, we devise a method to infer Z, as we explain in Section 5. For now, we do not need
to care about missing values.

As shown in Figure 3, the model assumes a two-step generation process: It exploits
inter-language dependencies for the ﬁrst part and inter-feature dependencies for the
second part. Respectivamente, the joint distribution is given as

PAG(A, z, W., X) =P(A)PAG(z|A)PAG(W.)PAG(X|z, W.)

(1)

where hyperparameters are omitted for brevity. A is a set of model parameters that
control the generation of Z, whereas W is a weight matrix that connects Z and X.

For ease of description, we trace the generative story backward from X. Sección 4.1

describes the second part, which is followed by Section 4.2 for the ﬁrst part.

4.1 Inter-Feature Dependencies

En esta sección, we describe how the surface feature representations are generated from
the binary latent representations. Cifra 5 illustrates the process. We use matrix factor-
ización (Srebro, Rennie, and Jaakkola 2005; Grifﬁths and Ghahramani 2011) to capture
inter-feature dependencies. Because the discrete feature matrix X cannot directly be

210

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

050100150Features02505007501000125015001750200022502500Languages0100200300Features02004006008001000Languages

Murawaki

Bayesian Learning of Latent Representations of Language Structures

Cifra 5
Stochastic parameter-to-feature generation. ˜Θ = ZW encodes inter-feature dependencies.

decomposed into two matrices, we instead decompose a closely related, unnormalized
model parameter matrix ˜Θ. It is ˜Θ that directly controls the stochastic generation of X.
Recall that xl,n can take a binary, categorical, or count value. As usual, asumimos

that a binary feature is drawn from a Bernoulli distribution:

SG,n ∼ Bernoulli(θl,F (norte,1))

(2)

where θl,F (norte,1) ∈ (0, 1) is the corresponding model parameter. Porque, as we discuss
subsequently, one feature can correspond to more than one linguistic parameter, feature
n is mapped to the corresponding model parameter index by the function f (norte, i) ∈
{1, · · · , metro, · · · , METRO}. A binary or count feature has one model parameter whereas a cate-
gorical feature with Fn distinct values has Fn model parameters. M is the total number
of these model parameters. θl,m is an item of the normalized model parameter matrix
Θ ∈ (0, 1)L×M.

A categorical value is generated from a categorical distribution:

SG,n ∼ Categorical(θl,F (norte,1), · · · , θl,F (norte,Fn ))

(3)

where θl,F (norte,i) ∈ (0, 1) y (cid:80)Fn
Fn model parameters.

i=1 θl,F (norte,i) = 1. As you can see, the categorical feature n has

A count value is drawn from a Poisson distribution:

SG,n ∼ Poisson(θl,F (norte,1))

(4)

where θl,F (norte,1) > 0. This distribution has mean and variance θl,F (norte,1).5

Θ is obtained by normalizing ˜Θ ∈ RL×M. For binary features, we use the sigmoid

función:

θl,F (norte,1) = sigmoid( ˜θl,F (norte,1)) =

1
1 + exp.(− ˜θl,F (norte,1))

(5)

5 Alternativamente, we can use a negative binomial distribution because it may provide a closer ﬁt by

decoupling mean and variance. Another option is to use a Poisson hurdle distribution (Mullahy 1986),
which deals with the high occurrence of zeroes in the data.

211

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

…𝐾𝐾binarylinguistic parameters𝐿𝐿languages×𝑍𝑍2.9-4.18.20.2…𝐾𝐾binarylinguistic parameters4.23.9-0.23.2…-0.3-2.3-2.51.2…-0.25.20.3-2.4………………𝑀𝑀model parameters𝑊𝑊=10.2-4.18.4…-9.88.9-2.3…-8.9-7.9-7.3…-4.9-9.42.5……………3.9-4.23.5-8.3…𝐿𝐿languages�Θ𝑀𝑀model parameters⇒Draw fromlocallynormalizedfeature distributions…𝐿𝐿languages𝑋𝑋𝑁𝑁discrete features12132141………………2331…1010…1100…0101…1010…………

Ligüística computacional

Volumen 45, Número 2

Similarmente, the softmax function is used for categorical features:

θl,F (norte,i) = softmaxi( ˜θl,F (norte,1), · · · , ˜θl,F (norte,Fn )) =

exp.( ˜θl,F (norte,i))
i(cid:48)=1 exp( ˜θl,F (norte,i(cid:48) ))

(cid:80)Fn

and the softplus function for count features:

θl,F (norte,1) = softplus( ˜θl,F (norte,1)) = log(1 + exp.( ˜θl,F (norte,1)))

(6)

(7)

The unnormalized model parameter matrix ˜Θ is a product of the binary latent
parameter matrix Z and the weight matrix W. The generation of Z is described in
Sección 4.2. Each item of ˜Θ, ˜θl,metro, is language l’s m-th unnormalized model parameter.
It is affected only by linguistic parameters with zl,k = 1 porque

˜θl,m =

k
(cid:88)

k=1

zl,kwk,metro

(8)

To investigate how categorical features are related to each other, we combine Equa-

ciones (8) y (6). Obtenemos

θl,F (norte,i) ∝ exp

(cid:33)

zl,kwk,F (norte,i)

(cid:32) k
(cid:88)

k=1

k
(cid:89)

k=1

exp.(zl,kwk,F (norte,i))

(9)

We can see from Equation (9) that this is a product-of-experts model (Hinton 2002). Si
zl,k = 0, the linguistic parameter k has no effect on θl,F (norte,i) because exp(zl,kwk,F (norte,i)) = 1.
De lo contrario, if wk,F (norte,i) > 0, it makes θl,F (norte,i) más grande, and if wk,F (norte,i) < 0, it lowers θl,f (n,i). Suppose that for the linguistic parameter k, a certain group of languages takes zl,k = 1. If two categorical feature values (n1, i1) and (n2, i2) have large positive weights (wk,f (n1,i1 ) >
0 and wk,F (n2,i2 ) > 0), the pair must often co-occur in these languages. Asimismo, the fact
that two feature values do not co-occur can be encoded as a positive weight for one
value and a negative weight for the other.

Binary and count features are more straightforward because both the sigmoid and
softplus functions take a single argument and increase monotonically. For a binary
feature, if zl,k = 1 and wk,F (norte,i) > 0, then θl,F (norte,i) approaches 1. semana,F (norte,i) < 0 makes θl,f (n,i) closer to 0. The best ﬁt for the count data is obtained when the value equals the mode of the Poisson distribution, which is close to θl,f (n,i). Each item of W, wk,m, is generated from a Student t-distribution with 1 degree of freedom. We choose this distribution for two reasons. First, it has heavier tails than the Gaussian distribution and allows some weights to fall far from 0. Second, our inference algorithm demands that the negative logarithm of the probability density function 212 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Murawaki Bayesian Learning of Latent Representations of Language Structures Figure 6 Neighbor graphs and counting functions used to encode inter-language dependencies. be differentiable, as explained in Section 5.2. The t-distribution satisﬁes the condition whereas the Laplace distribution does not.6 4.2 Inter-Language Dependencies The autologistic model (Murawaki and Yamauchi 2018) for the linguistic parameter k generates a column of Z, z∗,k = (z1,k, · · · , zL,k). To construct the model, we use two neighbor graphs and the corresponding three counting functions, as illustrated in Figure 6. V(z∗,k) returns the number of pairs sharing the same value in the phylogenetic neighbor graph, and H(z∗,k) is the spatial equivalent of V(z∗,k). U(z∗,k) gives the number of languages that take the value 1. We now introduce the following variables: vertical stability vk > 0, horizontal dif-
fusibility hk > 0, and universality uk ∈ (−∞, ∞) for each linguistic parameter k. El
probability of z∗.k conditioned on vk, hk, and uk is given as

(cid:18)

(cid:19)

PAG(z∗,k | vk, hk, uk) =

(cid:80)

exp.

z(cid:48)
∗,k

vkV(z∗,k) + hkH(z∗,k) + ukU(z∗,k)
(cid:18)

vkV(z(cid:48)

∗,k) + hkH(z(cid:48)

∗,k) + ukU(z(cid:48)

∗,k)

(cid:19)

(10)

The denominator is a normalization term, ensuring that the sum of the distribution
equals one.

The autologistic model can be interpreted in terms of the competition associated
with the 2L possible assignments of z∗,k for the probability mass 1. If a given value, z∗,k,
has a relatively large V(z∗,k), then setting a large value for vk enables it to appropriate
fractions of the mass from its weaker rivals. Sin embargo, if too large a value is set for vk,
then it will be overwhelmed by its stronger rivals.

To acquire further insights into the model, let us consider the probability of
language l taking value b ∈ {0, 1}, conditioned on the rest of the languages, z−l,k =
(z1,k, · · · , zl−1,k, zl+1,k, · · · , zL,k):

PAG(zl,k = b | z−l,k, vk, hk, uk) ∝ exp (cid:0)vkVl,k,b + hkHl,k,b + ukb(cid:1)

(11)

6 Alternativamente, we can explicitly impose sparsity by generating another binary matrix ZW and replacing W

with ZW (cid:12) W., dónde (cid:12) denotes element-wise multiplication.

213

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1001z∗,𝑘𝑘0??phylogenetic groups(ancestral states are unknown)phylogeneticneighbor graph𝑉𝑉z∗,𝑘𝑘=2𝐻𝐻z∗,𝑘𝑘=3spatialneighbor graph(connecting languageswithin 𝑅𝑅km)≤𝑅𝑅1001z∗,𝑘𝑘0𝑈𝑈z∗,𝑘𝑘=21001z∗,𝑘𝑘0(# of languages with z𝑙𝑙,𝑘𝑘=1)(no neighbor graph)

Ligüística computacional

Volumen 45, Número 2

where Vl,k,b is the number of language l’s phylogenetic neighbors that assume the
value b, and Hl,k,b is its spatial counterpart. PAG(zl,k = b | z−l,k, vk, hk, uk) is expressed by
the weighted linear combination of the three factors in the log-space. It will increase
with a rise in the number of phylogenetic neighbors that assume the value b. Sin embargo,
this probability depends not only on the phylogenetic neighbors of language l, pero
also depends on its spatial neighbors and on universality. How strongly these factors
affect the stochastic selection is controlled by vk, hk, and uk.

Recall that matrix Z has K columns. Respectivamente, we have K autologistic models:

PAG(z | A) =

k
(cid:89)

k=1

PAG(z∗,k | vk, hk, uk)

The model parameter set A can be decomposed in a similar manner:

PAG(A) =

k
(cid:89)

k=1

PAG(vk)PAG(hk)PAG(uk)

(12)

(13)

Their prior distributions are: vk ∼ Gamma(κ, i), hk ∼ Gamma(κ, i), and uk ∼ N (0, p2).
They complete the generative story. In the experiments, we set shape κ = 1, escala
θ = 1, and standard deviation σ = 10. These priors are not non-informative, pero ellos
are sufﬁciently gentle in the regions where these model parameters typically reside.

An extension of the model is to set z∗,K = (1, · · · , 1) for the last linguistic parameter
k. Como consecuencia, the autologistic model is dropped from the linguistic parameter K
(while K − 1 autologistic models remain). With this modiﬁcation, the weight vector
wK,∗ = (wK,1, · · · , wK,METRO) is activated for all languages and serves as a bias term. We used
this version of the model in the experiments.

Finalmente, let us consider a simpliﬁed version of the model. If we set vk = hk = 0,

Ecuación (11) is reduced to

PAG(zl,k = b | z−l,k, vk, hk, uk) =P(zl,k = b | uk) =

exp.(ukb)
1 + exp.(uk)

(14)

We can see that the generation of zl,k no longer depends on z−l,k and is a simple Bernoulli
trial with probability exp(uk)/(1 + exp.(uk)).7

7 Indian buffet processes (Grifﬁths and Ghahramani 2011) are the natural choice for modeling binary latent
matrices (G ¨or ¨ur, J¨akel, and Rasmussen 2006; Knowles and Ghahramani 2007; Meeds et al. 2007; Doyle,
Bicknell, and Levy 2014). The Indian buffet process (IBP) is appealing for its ability to adjust the number
of linguistic parameters to data. A linguistic parameter k is called an active parameter if it has one or
more languages with zl,k = 1. The key property of the IBP is that although there are an unbounded
number of linguistic parameters, the number of active linguistic parameters K+ is ﬁnite. K+ changes
during posterior inference. It is decremented when one linguistic parameter becomes inactive. Similarmente,
it is incremented when zl,k changes from 0 a 1 for an inactive linguistic parameter k.

We modeled P(z | A) using an IBP in a preliminary study but later switched to the present model for
two reasons. Primero, it is difﬁcult to extend an IBP to incorporate inter-language dependencies. It appears
that we have no choice but to replace the IBP with the product of the autologistic models. Segundo, el
nonparametric model’s adaptability did not work in our case. In theory, Gibbs sampling converges to a
stationary distribution after a sufﬁciently large number of iterations. Sin embargo, we observed that the
number of active linguistic parameters heavily depended on its initial value K0 because it was very rare
for additional linguistic parameters to survive as active linguistic parameters. Por esta razón, the number
of linguistic parameters for the present model, k, is given a priori and is ﬁxed throughout posterior
inferencia.

214

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

5. Posterior Inference

Once the generative model is deﬁned, we want to infer the binary latent matrix Z
together with other latent variables. With a slight abuse of notation, let X be disjointly
decomposed into the observed portion Xobs and the remaining missing portion Xmis.
Formalmente, the posterior probability is given as

PAG(A, z, W., Xmis | Xobs) ∝ P(A, z, W., Xmis ∪ Xobs)

(15)

As usual, we use Gibbs sampling to draw samples from the posterior distribution.
Given observed values xl,norte, we iteratively update zl,k, vk, hk, uk, and wk,∗ as well as
missing values xl,norte.

Update missing xl,norte.
type of feature n.

SG,n is sampled from Equations (2), (3), o (4), depending on the

yo,∗ . We use the Metropolis-Hastings algorithm to update zl,k and xmis
Update zl,k and xmis
yo,∗ ,
the missing portion of xl,∗ = (SG,1, · · · , SG,norte ). We ﬁnd that updating xmis
yo,∗ drastically im-
proves the mobility of zl,k. The proposal distribution ﬁrst toggles the current zl,k to obtain
the proposal z(cid:48)
yo,k (1 if zl,k = 0; 0 de lo contrario). As the corresponding wk,∗ = (semana,1, · · · , semana,METRO)
yo,∗ = (i(cid:48)
gets activated or inactivated, i(cid:48)
yo,METRO) is also updated accordingly. We per-
form a Gibbs sampling scan on xmis
yo,∗ : Every missing xl,n is sampled from the correspond-
ing distribution with the proposal model parameter(s). The proposal is accepted with
probabilidad

yo,1, · · · , i(cid:48)

(cid:32)

mín.

PAG(z(cid:48)
yo,k, X(cid:48)
yo,∗ | −)
PAG(zl,k, SG,∗ | −)

yo,∗ | z(cid:48)

q(zl,k, xmis
yo,k, xmis(cid:48)
q(z(cid:48)
yo,∗

yo,k, xmis(cid:48)
yo,∗ )
| zl,k, xmis
yo,∗ )

(cid:33)

(16)

where conditional parts are omitted for brevity. PAG(zl,k, SG,∗ | −) is the probability of gen-
erating the current state (zl,k, SG,∗), while P(z(cid:48)
yo,∗ | −) is the probability of generating
the proposed state (z(cid:48)
yo,k, X(cid:48)
yo,∗ is updated by the proposal distribution. Q is
the proposal function constructed as explained above. Ecuación (16) can be calculated
by combining Equations (11), (2), (3), y (4).

yo,∗), in which xmis

yo,k, X(cid:48)

We want to sample vk (and hk and uk) from P(vk | −) ∝
Update vk, hk, and uk.
PAG(vk)PAG(z∗,k | vk, hk, uk). This belongs to a class of problems known as sampling from
doubly-intractable distributions (Møller et al. 2006; Murray, Ghahramani, and MacKay
2006). Although it remains a challenging problem in statistics, it is not difﬁcult to
approximately sample the variables if we give up theoretical rigorousness (Liang 2010).
The details of the algorithm are described in Section 5.1.

The remaining problem is how to update wk,metro. Because the number
Update wk,∗.
of weights is very large (K × M), the simple Metropolis-Hastings algorithm (G ¨or ¨ur,
J¨akel, and Rasmussen 2006; Doyle, Bicknell, and Levy 2014) is not a workable option.
To address this problem, we block-sample wk,∗ = (semana,1, · · · , semana,METRO) using Hamiltonian
Monte Carlo (HMC) (Neal 2011). We present a sketch of the algorithm in Section 5.2.

215

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 2

5.1 Approximate Sampling from Doubly Intractable Distributions

During inference, we want to sample vk (and hk and uk) from its posterior distribu-
ción, PAG(vk | −) ∝ P(vk)PAG(z∗,k | vk, hk, uk). Desafortunadamente, we cannot apply the standard
Metropolis-Hastings (MH) sampler to this problem because P(z∗,k | vk, hk, uk) contiene
an intractable normalization term. Such a distribution is called a doubly intractable
distribution because Markov chain Monte Carlo itself approximates the intractable
distribución (Møller et al. 2006; Murray, Ghahramani, and MacKay 2006). Este problema
remains an active topic in the statistics literature to date. Sin embargo, if we give up
theoretical rigorousness, it is not difﬁcult to draw samples from the posterior, cual
are only approximately correct but work well in practice.

Específicamente, we use the double MH sampler (Liang 2010). The key idea is to use an
auxiliary variable to cancel out the normalization term. This sampler is based on the
exchange algorithm of Murray, Ghahramani, and MacKay (2006), which samples vk in
the following steps.

k ∼ q(v(cid:48)

k|vk, z∗,k).

Propose v(cid:48)
Generate an auxiliary variable z(cid:48)
sampler.
Accept v(cid:48)

k with probability min{1, r(vk, v(cid:48)

k, z(cid:48)

∗,k | z∗,k)}, dónde

∗,k ∼ P(z(cid:48)

∗,k | v(cid:48)

k, hk, uk) using an exact

r(vk, v(cid:48)

k, z(cid:48)

∗,k | z∗,k) =

k)q(vk | v(cid:48)

PAG(v(cid:48)
PAG(vk)q(v(cid:48)

k, z∗,k)
k | vk, z∗,k)

k, hk, uk)PAG(z(cid:48)
PAG(z∗,k | v(cid:48)
PAG(z∗,k | vk, hk, uk)PAG(z(cid:48)

∗,k | vk, hk, uk)
∗,k | v(cid:48)
k, hk, uk)

(17)

A problem lies in the second step. The exact sampling of z(cid:48)
∗,k is as difﬁcult as the original
problema. The double MH sampler approximates it with a Gibbs sampling scan of zl,k’s
starting from the current z∗,k. At each step of the Gibbs sampling scan, z(cid:48)
yo,k is updated
yo,k | z(cid:48)
according to P(z(cid:48)
∗,k is only used to
compute Equation (17).

k, hk, uk). Note that the auxiliary variable z(cid:48)

−l,k, v(cid:48)

We construct the proposal distributions q(v(cid:48)

k | vk, z∗,k) and q(h(cid:48)
k | uk, z∗,k) using a Gaussian distribution with mean uk.

k | hk, z∗,k) using a log-

normal distribution, and q(tu(cid:48)

5.2 Hamiltonian Monte Carlo

HMC (Neal 2011) is a Markov chain Monte Carlo method for drawing samples from a
probability density distribution. Unlike Metropolis-Hastings, it exploits gradient infor-
mation to propose a new state, which can be distant from the current state. If no numer-
ical error is involved, the new state proposed by HMC is accepted with probability 1.

HMC has a connection to Hamiltonian dynamics and the physical analogy is useful
for gaining an intuition. In HMC, the variable to be sampled, q ∈ RM, is seen as a
generalized coordinate of a system and is associated with a potential energy function
Ud.(q) = − log P(q), the negative logarithm of the (unnormalized) density function. El
coordinate q is tied with an auxiliary momentum variable p ∈ RM and a kinetic function
k(pag). The momentum makes the object move. Because H(q, pag) = U(q) + k(pag), the sum of
the kinetic and potential energy, is constant with respect to time, the time evolution of
the system is uniquely deﬁned given an initial state (q0, p0). The trajectory is computed
to obtain a state (q, pag) at some time, and that q is the next sample we want.

216

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Murawaki

Bayesian Learning of Latent Representations of Language Structures

p ← p − (cid:15)∇U(q)

Algoritmo 1 HMC(Ud., ∇U, q0).
1: q ← q0
2: p0 ∼ N (µ = 0, Σ = I)
3: p ← p0
4: p ← p − (cid:15)∇U(q)/2
5: for s ← 1, S do
q ← q + (cid:15)pag
6:
if s < S then 7: 8: 9: 10: end for 11: p ← p − (cid:15)∇U(q)/2 12: p ← −p 13: r ∼ Uniform[0, 1] 14: if min[1, exp(−U(q) + U(q0) − K(p) + K(p0))] > r then
15:
16: else
17:
18: end if

return q0

return q

end if

(cid:46) accept

(cid:46) reject

Algoritmo 1 shows the pseudo-code, which is adopted from Neal (2011). The mo-
mentum variable is drawn from the Gaussian distribution (line 2). The time evolution
of the system is numerically simulated using the leapfrog method (lines 4–11), dónde
(cid:15) and S are parameters of the algorithm to be tuned. This is followed by a Metropolis
step to correct for numerical errors (lines 13–18).

Going back to the sampling of wk,∗, we need U(semana,∗) = − log P(semana,∗ | −) and its
gradient ∇U(semana,∗) to run HMC. The unnormalized density function P(semana,∗ | −) es el
product of (1) the probability of generating wk,m’s from the t-distribution and (2) el
probability of generating xl,n’s for each language with zl,k = 1. Note that U(semana,∗) is differ-
entiable because Equations (5), (6), y (7) as well as the t-distribution are differentiable.

6. Missing Value Imputation

6.1 Settings

Although our goal is to induce good latent representations, “goodness” is too subjective
to be measured. To quantitatively evaluate the proposed model, we use missing value
imputation as an approximate performance indicator. If the model predicts missing fea-
ture values better than reasonable baselines, we can say that the induced linguistic pa-
rameters are justiﬁed. Although no ground truth exists for the missing portion of a data
colocar, missing value imputation can be evaluated by hiding some observed values and
verifying the effectiveness of their recovery. We conducted a 10-fold cross-validation.

We ran the proposed model, now called SYNDIA, with two different settings: K = 50
y 100.8 We performed posterior inference for 500 iterations. Después, we collected

8 En experimentos preliminares, we also tried K = 250 y 500, but it quickly became evident that a large K

caused performance degeneration.

217

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
5
2
1
9
9
1
8
0
9
7
8
2
/
C
oh

yo
i

_
a
_
0
0
3
4
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 2

100 samples of xl,n for each language, one per iteration. For each missing value xl,norte, nosotros
output the most frequent value among the 100 muestras. The HMC parameters (cid:15) and S
were set to 0.05 y 10, respectivamente. We applied simulated annealing to the sampling
of zl,k and xmis
yo,∗ . For the ﬁrst 100 iterations, the inverse temperature was increased from
0.1 a 1.0.

We also tested a simpliﬁed version of SYNDIA, SYN, from which vk and hk were

removed. Ecuación (11) was replaced with Equation (14).

We compared SYN and SYNDIA with several baselines.

MFV For each feature n, always output the most frequent value among the observed

portion of x∗,norte = (x1,n, · · · , xL,norte).

Surface-DIA The autologistic model applied to each surface feature n (Murawaki and
Yamauchi 2018). Después 500 burn-in iterations, we collected 500 samples with the in-
terval of ﬁve iterations. Among the collected samples, we chose the most frequent
feature value for each language as the output.9

DPMPM A Dirichlet process mixture of multinomial distributions with a truncated
stick-breaking construction (Si and Reiter 2013) used by Blasi, Michaelis, y
Haspelmath (2017) for missing value imputation. It assigned a single categorical
latent variable to each language. As an implementation, we used the R package
NPBayesImpute. We ran the model with several different values for the truncation
level K∗. The best score is reported.

MCA A variant of multiple correspondence analysis (Josse et al. 2012) used by
Murawaki (2015) for missing value imputation. We used the imputeMCA function
of the R package missMDA.

MFV and Surface-DIA can be seen as models for inter-language dependencies, y
DPMPM, MCA, and SYN are models for inter-feature dependencies. SYNDIA exploits
both types of clues.

6.2 Resultados

Mesa 3 shows the result. We can see that SYNDIA outperformed the rest by substantial
margins for both data sets. The gains of SYNDIA over SYN were statistically signiﬁcant
in both data sets (pag < 0.01 for WALS and p < 0.05 for Autotyp). Although the likelihood P(X | Z, W) went up as K increased, likelihood was not necessarily correlated with accuracy. Because of the high ratio of missing values, the model with a larger K can overﬁt the data. The best score was obtained with K = 100 for WALS and K = 50 for Autotyp, although the differences were not statistically signiﬁcant in either data set. We conjecture that because WALS’s features ranged over broader domains of language, a larger K was needed to cover major regular patterns in the data (Haspelmath 2008b). The fact that SYN outperformed Surface-DIA suggests that inter-feature dependen- cies have more predictive power than inter-language dependencies in the data sets. However, they are complimentary in nature, as SYNDIA outperformed SYN. DPMPM performed poorly even if a small value was set to the truncation level K∗ to avoid overﬁtting. It divided the world’s languages into a ﬁnite number of disjoint 9 Whereas Murawaki and Yamauchi (2018) partitioned xobs ∗,n into 10 equal sized subsets for each feature n, we applied the 10-fold cross-validation to Xobs as a whole, in order to allow comparison with other models. At the level of features, this resulted in slightly uneven distributions over folds but did not seem to have made a notable impact on the overall performance. 218 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Murawaki Bayesian Learning of Latent Representations of Language Structures Table 3 Accuracy of missing value imputation. The ﬁrst column indicates the types of dependencies the models exploit: inter-language dependencies, inter-feature dependencies, or both. Type Model WALS Autotyp Inter-language dependencies Inter-feature dependencies Both MFV Surface-DIA DPMPM (K∗ = 50) MCA SYN (K = 50) SYN (K = 100) SYNDIA (K = 50) SYNDIA (K = 100) 54.80 61.27 59.12 65.61 72.71 72.78 73.47 73.57 69.42 73.04 65.53 76.29 81.24 81.20 81.58 81.33 Table 4 Ablation experiments for missing value imputation. † and †† indicate statistically signiﬁcant changes from SYNDIA with p < 0.05 and p < 0.01, respectively. Model WALS (K = 100) Autotyp (K = 50) Full model (SYNDIA) -vertical -horizontal -vertical -horizontal (SYN) 73.57 73.40 73.17 72.78 (−0.17) (−0.39)† (−0.79)†† 81.58 81.33 81.28 81.24 (−0.25) (−0.30)† (−0.34)† groups. In other words, the latent representation of a language was a single categorical value. As expected, DPMPM showed its limited expressive power. MCA uses a more expressive representation for each language: a sequence of con- tinuous variables. It outperformed DPMPM but was inferior to SYN by a large margin. We conjecture that MCA was more sensitive to initialization than the Bayesian model armed with Markov chain Monte Carlo sampling. To investigate the effects of inter-language dependencies, we conducted ablation experiments. We removed vk, hk, or both from the model. Note that if both are removed, the resultant model is SYN. The result is shown in Table 4. Although not all modiﬁca- tions yielded statistically signiﬁcant changes, we observed exactly the same pattern in both data sets: The removal of vertical and horizontal factors consistently degenerated the performance, with hk having larger impacts than vk. 7. Discussion 7.1 Vertical Stability and Horizontal Diffusibility Missing value imputation demonstrates that SYNDIA successfully captures regularities of the data. We now move on to the question of what the learned linguistic parameters 219 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 2 (a) WALS. (b) Autotyp. Figure 7 The histogram of the proportions of the modal values of linguistic parameters. Since linguistic parameters are binary, the proportion is always greater than or equal to 0.5. look like. We performed posterior inference again, with the settings based on the results of missing value imputation: We chose K = 100 for WALS and K = 50 for Autotyp. The difference was that we no longer performed a 10-fold cross-validation but gave all observed data to the models. Figure 7 shows the proportion of the modal value of each linguistic parameter. For Autotyp, the largest histogram bin ranged from 0.95 to 1.0. This suggests that even with K = 50, macro-patterns were covered by a limited number of linguistic parameters. The rest focused on capturing micro-patterns. In contrast, linguistic parameters of WALS tended toward 0.5. There might have been more macro-patterns to be captured. Next, we take advantage of the fact that both Surface-DIA and SYNDIA use the autologistic model to estimate vertical stability vk (vn for Surface-DIA) and horizontal diffusibility hk (hn). Large vk indicates that phylogenetically related languages would be expected to frequently share the same value for linguistic parameter k. Similarly, hk measures how frequently spatially close languages would be expected to share the parameter value. Because both models make use of the same neighbor graphs, we expect that if the latent representations successfully reﬂect surface patterns, vk’s (hk’s) are in roughly the same range as vn’s (hn’s). To put it differently, if the region in which most vk’s reside is much closer to zero than that of most vn’s, it serves as a signal of the model’s failure. Figures 8 and 9 summarize the results. For both WALS and Autotyp, most linguistic parameters were within the same ranges as surface features, both in terms of vertical stability and horizontal diffusibility. Although this result does not directly prove that the latent representations are good, we can at least conﬁrm that the model did not fail in this regard. There were some outliers to be explained. Autotyp’s outliers were surface features and are easier to explain: Features with very large vn were characterized by heavy imbalances in their value distributions, and their minority values were conﬁned to small language families. The outliers of WALS were linguistic parameters. They had much larger vk, much larger hk, or both, than those of surface features, but deviations from the typical range were more marked for vk. As we see in Figure 7(a), the linguistic parameters of WALS, including the outliers, focused on capturing macro-patterns and did not try to ex- plain variations found in small language families. As we discuss later, these linguistic 220 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 0.50.60.70.80.91.0Proportion of the modal value048121620Frequency0.50.60.70.80.91.0Proportion of the modal value024681012Frequency Murawaki Bayesian Learning of Latent Representations of Language Structures l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 8 Scatter plots of the surface features and induced linguistic parameters of WALS, with vertical stability vk (vn) as the y-axis and horizontal diffusibility hk (hn) as the x-axis. Larger vk (hk) indicates that linguistic parameter k is more stable (diffusible). Comparing the absolute values of a vk and an hk makes no sense because they are tied with different neighbor graphs. Features are classiﬁed into nine broad domains (called Area in WALS). vk (and hk) is the geometric mean of the 100 samples from the posterior. parameters were associated with clear geographical distributions, which we believe may be too clear. We are unsure whether they successfully “denoised” minor variations found in surface features or simply overgeneralized macro-patterns to missing features. 7.2 Manual Investigation of the Latent Representations Figure 10 visualizes the weight matrix W. We can ﬁnd some dashed horizontal lines, which indicate strong dependencies between feature values. However, it is difﬁcult for humans to directly derive meaningful patterns from the noise-like images. A more effective way to visually explore linguistic parameters is to plot them on a world map. In fact, WALS Online provides the functionality of plotting a surface 221 Computational Linguistics Volume 45, Number 2 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 9 Scatter plots of the surface features and induced linguistic parameters of Autotyp. See Figure 8 for details. feature and dynamically combined features to help typologists discover geographical patterns (Haspelmath et al. 2005). As exempliﬁed by Figure 11(a), some surface features show several geographic clusters of large size, revealing something about the evolution- ary history of languages. We can do the same thing for linguistic parameters. An example from WALS is shown in Figure 11(b). Even with a large number of missing values, SYNDIA yielded geographic clusters of comparable size for some linguistic parameters. Needless to say, not all surface features were associated with clear geographic patterns, and neither were linguistic parameters. For comparison, we also investigated linguistic parameters induced by SYN (not shown in this article), which did not exploit inter-language dependencies. Some geo- graphic clusters were found, especially when the estimation of zl,k was stable. In our subjective evaluation, however, SYNDIA appeared to show clearer patterns than SYN. There were many low-coverage languages, and due to inherent uncertainty, zl,k swung 222 Murawaki Bayesian Learning of Latent Representations of Language Structures (a) WALS. (b) Autotyp. Figure 10 Weight matrix W. Each row represents a linguistic parameter. Linguistic parameters are sorted by uk in decreasing order. In other words, a linguistic parameter with a smaller index is more likely to be on. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o (a) Feature 31A, “Sex-based and Non-sex-based Gender Systems.” (b) A linguistic parameter of SYNDIA. Figure 11 Geographical distributions of a surface feature and SYNDIA’s linguistic parameter of WALS. Each point denotes a language. For the linguistic parameter, lighter nodes indicate higher frequencies of zl,k = 1 among the 100 samples from the posterior. l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 223 0200400600M0255075100K−30−20−1001020300250500750M01020304050K−40−30−20−10010203040No genderSex-basedNon-sex-basedN/A0.00.20.40.60.81.0 Computational Linguistics Volume 45, Number 2 between 0 and 1 during posterior inference when inter-language dependencies were ignored. As a result, geographical signals were obscured. The world map motivates us to look back at the cryptic weight matrix. The linguistic parameter visualized in Figure 11(b), for example, was tied to weight vector wk,∗, which gave the largest positive values to the following feature–value pairs: Feature 31A, “Sex-based and Non-sex-based Gender Systems”: No gender; Feature 44A, “Gender Distinctions in Independent Personal Pronouns”: No gender distinctions; Feature 30A, “Number of Genders”: None; and Feature 32A, “Systems of Gender Assignment”: No gender. We can determine that the main function of this linguistic parameter is to disable a system of grammatical gender. It is interesting that not all genderless languages turned on this linguistic parameter. The value No gender for Feature 31A is given a relatively large positive weight by another linguistic parameter, which was characterized by a tendency to avoid morphological marking. In this way, we can draw from latent parameters hypotheses of linguistic typology that are to be tested through a more rigorous analysis of data. 7.3 Applications Although we have demonstrated in the previous section how we can directly examine the learned model for a typological inquiry, this approach has limitations. As we brieﬂy discussed in Section 2.2, the search space must have numerous modes. A trivial way to prove this is to swap a pair of linguistic parameters together with the corresponding weight vector pair. The posterior probability remains the same. Although permutation invariance poses little problem in practice, arbitrariness involving marked–unmarked relations can be problematic. The model has a certain degree of ﬂexibility that allows it to choose a marked form from a dichotomy it uncovers, and the rather arbitrary decision affects other linguistic parameters. The net outcome of unidentiﬁability is that even if the model provides a seemingly good explanation for an observed pattern, another run of the model may provide another explanation. We suggest that instead of investing too much effort in interpreting induced lin- guistic parameters, we can use the model as a black box that connects two or more surface representations via the latent space. A schematic illustration of the proposed approach is shown in Figure 12. We ﬁrst use the model to map a given language xl,∗ into its latent representation zl,∗, then manipulate zl,∗ in the latent space to obtain zl(cid:48),∗, the latent representation of a new language l(cid:48), and ﬁnally project zl(cid:48),∗ back into the original space to obtain its surface representation xl(cid:48),∗. By comparing xl,∗ with xl(cid:48),∗, we can identify how manipulation in the latent space affects multiple surface features in general. A useful property of this approach is that because both the surface and latent representations are discrete, linguistic parameters can readily be used as a substitute for surface features. This means that methods that have been successfully applied to surface features are applicable to linguistic parameters as well. Undoubtedly, the most promising method is Bayesian phylogenetic analysis (Dediu 2010; Greenhill et al. 2010; Dunn et al. 2011; Maurits and Grifﬁths 2014; Greenhill et al. 2017). For example, we can reconstruct an ancestral language from its descendants in the latent space. By comparing the ancestor and a descendant in the original space, we can investigate how languages change over time without being trapped into unnatural combinations of features, as was envisioned by Greenberg (1978). Unlike Dunn et al. (2011), who focused on the 224 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Murawaki Bayesian Learning of Latent Representations of Language Structures Figure 12 A schematic illustration of latent representation-based analysis. dependency between a pair of binary features, the proposed framework has the ability to uncover correlated evolution involving multiple features (Murawaki 2018). 8. Conclusions In this article, we reformulated the quest for Greenbergian universals with representa- tion learning. To develop a concrete model that is robust with respect to a large number of missing values, we adopted the Bayesian learning framework. We also exploited inter-language dependencies to deal with low-coverage languages. Missing value im- putation demonstrates that the proposed model successfully captures regularities of the data. We plotted languages on a world map to show that some latent variables are associated with clear geographical patterns. The source code is publicly available at https://github.com/murawaki/lattyp. The most promising application of latent representation-based analysis is Bayesian phylogenetic methods. The proposed model can be used to uncover correlated evolu- tion involving multiple features. Acknowledgments The key ideas and early experimental results were presented at the Eighth International Joint Conference on Natural Language Processing (Murawaki 2017). This article, which presents updated results, is a substantially extended version of the earlier conference paper. This work was partly supported by JSPS KAKENHI grant 18K18104. References Anderson, Stephen R. 2016. Synchronic versus diachronic explanation and the nature of the language faculty. Annual Review of Linguistics, 2:1–425. Asgari, Ehsaneddin and Mohammad R. K. Mofrad. 2015. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLOS ONE, 10(11):1–15. Baker, Mark C. 2001. The Atoms of Language: The Mind’s Hidden Rules of Grammar. Basic Books. Bender, Emily M. 2016. Linguistic typology in natural language processing. Linguistic Typology, 20(3):645–660. Bengio, Yoshua, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828. Besag, Julian. 1974. Spatial interaction and the statistical analysis of lattice systems. 225 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1100…21…3featuresx𝑙𝑙,∗linguisticparametersz𝑙𝑙,∗infer1001…11…2features x𝑙𝑙′,∗linguisticparametersz𝑙𝑙′,∗generateLatent spaceOriginal Spacemanipulatecompare Computational Linguistics Volume 45, Number 2 Journal of the Royal Statistical Society. Series B (Methodological), 38(2):192–236. Bickel, Balthasar, Johanna Nichols, Taras Zakharko, Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler, Lennart Bierkandt, Fernando Z ´u ˜niga, and John B. Lowe. 2017. The AUTOTYP typological databases. version 0.1.0. Blasi, Dami´an E., Susanne Maria Michaelis, and Martin Haspelmath. 2017. Grammars are robustly transmitted even during the emergence of creole languages. Nature Human Behaviour, 1(10):723–729. Boeckx, Cedric. 2014. What principles and parameters got wrong. In Picallo, M. Carme, editor, Treebanks: Building and Using Parsed Corpora, Oxford University Press, pages 155–178. Bouckaert, Remco, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard, and Quentin D. Atkinson. 2012. Mapping the origins and expansion of the Indo-European language family. Science, 337(6097):957–960. Campbell, Lyle. 2006. Areal linguistics. In Encyclopedia of Language and Linguistics, Second Edition. Elsevier, pages 454–460. Chang, Will, Chundra Cathcart, David Hall, and Andrew Garrett. 2015. Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91(1):194–244. Chen, Xi, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29, pages 2172–2180. Chomsky, Noam and Howard Lasnik. 1993. The theory of principles and parameters. In Joachim Jacobs, Arnim von Stechow, Wolfgang Sternefeld, and Theo Vennemann, editors, Syntax: An International Handbook of Contemporary Research, 1. De Gruyter, pages 506–569. Croft, William. 2002. Typology and Universals. Cambridge University Press. Cysouw, Michael. 2003. Against implicational universals. Linguistic Typology, 7(1):89–101. Daum´e III, Hal. 2009. Non-parametric Bayesian areal linguistics. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 593–601, Boulder, CO. 226 Daum´e III, Hal and Lyle Campbell. 2007. A Bayesian model for discovering typological implications. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 65–72, Prague. Dediu, Dan. 2010. A Bayesian phylogenetic approach to estimating the stability of linguistic features and the genetic biasing of tone. Proceedings of the Royal Society of London B: Biological Sciences, 278(1704):474–479. Doyle, Gabriel, Klinton Bicknell, and Roger Levy. 2014. Nonparametric learning of phonological constraints in optimality theory. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1094–1103, Baltimore, MD. Dryer, Matthew S. 1998. Why statistical universals are better than absolute universals. Chicago Linguistic Society, 33(2):123–145. Dryer, Matthew S. 1989. Large linguistic areas and language sampling. Studies in Language, 13:257–292. Dryer, Matthew S. 2011. The evidence for word order correlations: A response to Dunn, Greenhill, Levinson and Gray’s paper in Nature. Linguistic Typology, 15:335–380. Dunn, Michael, Simon J. Greenhill, Stephen C. Levinson, and Russell D. Gray. 2011. Evolved structure of language shows lineage-speciﬁc trends in word-order universals. Nature, 473(7345):79–82. Enﬁeld, Nicholas J. 2005. Areal linguistics and Mainland Southeast Asia. Annual Review of Anthropology, 34:181–206. Gan, Zhe, Chunyuan Li, Changyou Chen, Yunchen Pu, Qinliang Su, and Lawrence Carin. 2017. Scalable Bayesian learning of recurrent neural networks for language modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 321–331, Vancouver. Goldwater, Sharon, Thomas L. Grifﬁths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54. Goldwater, Sharon and Tom Grifﬁths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 744–751, Prague. G ¨or ¨ur, Dilan, Frank J¨akel, and Carl Edward Rasmussen. 2006. A choice model with inﬁnitely many latent features. In Proceedings of the 23rd International l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Murawaki Bayesian Learning of Latent Representations of Language Structures Conference on Machine Learning, pages 361–368, Pittsburgh, PA. Gray, Russell D. and Quentin D. Atkinson. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature, 426(6965):435–439. Greenberg, Joseph H. 1963. Some universals of grammar with particular reference to the order of meaningful elements. In Joseph H. Greenberg, editor, Universals of Language, MIT Press, pages 73–113. Greenberg, Joseph H. 1978. Diachrony, synchrony and language universals. In Joseph H. Greenberg, Charles A. Ferguson, and Edith A. Moravesik, editors, Universals of Human Language, volume 1. Stanford University Press, pages 61–91. Greenhill, Simon J., Quentin D. Atkinson, Andrew Meade, and Russel D. Gray. 2010. The shape and tempo of language evolution. Proceedings of the Royal Society B: Biological Sciences, 277(1693):2443–2450. Greenhill, Simon J., Chieh-Hsi Wu, Xia Hua, Michael Dunn, Stephen C. Levinson, and Russell D. Gray. 2017. Evolutionary dynamics of language systems. Proceedings of the National Academy of Sciences, U.S.A., 114(42):E8822–E8829. Grifﬁths, Thomas L. and Zoubin Ghahramani. 2011. The Indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185–1224. Grifﬁths, Thomas L., Mark Steyvers, and Joshua B. Tenenbaum. 2007. Topics in semantic representation. Psychological Review, 114(2):211–244. Haspelmath, Martin. 2008a. Alienable vs. inalienable possessive constructions. Handout, Leipzig Spring School on Linguistic Diversity. Haspelmath, Martin. 2008b, Parametric versus functional explanations of syntactic universals. In Theresa Biberauer, editor, The Limits of Syntactic Variation, John Benjamins, pages 75–107. Haspelmath, Martin, Matthew Dryer, David Gil, and Bernard Comrie, editors . 2005. The World Atlas of Language Structures. Oxford University Press. Heine, Bernd and Tania Kuteva. 2007. The Genesis of Grammar: A Reconstruction. Oxford University Press. Hinton, Geoffrey E. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800. Hinton, Geoffrey E. and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507. Hornik, Kurt. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257. Itoh, Yoshiaki and Sumie Ueda. 2004. The Ising model for changes in word ordering rules in natural languages. Physica D: Nonlinear Phenomena, 198(3):333–339. Jasra, Ajay, Chris C. Holmes, and David A. Stephens. 2005. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20(1):50–67. Josse, Julie, Marie Chavent, Benot Liquet, and Franc¸ois Husson. 2012. Handling missing values with regularized iterative multiple correspondence analysis. Journal of Classiﬁcation, 29(1):91–116. Knowles, David and Zoubin Ghahramani. 2007. Inﬁnite sparse factor analysis and inﬁnite independent components analysis. In Proceedings of the 7th International Conference on Independent Component Analysis and Signal Separation, pages 381–388, London. LeCun, Yann, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. Liang, Faming. 2010. A double Metropolis–Hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation, 80(9):1007–1022. Maurits, Luke and Thomas L. Grifﬁths. 2014. Tracing the roots of syntax with Bayesian phylogenetics. Proceedings of the National Academy of Sciences, U.S.A., 111(37):13576–13581. Meeds, Edward, Zoubin Ghahramani, Radford Neal, and Sam Roweis. 2007. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems, 19:977–984. Møller, Jesper, Anthony N. Pettitt, R. Reeves, and Kasper K. Berthelsen. 2006. An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika, 93(2):451–458. Mullahy, John. 1986. Speciﬁcation and testing of some modiﬁed count data models. Journal of Econometrics, 33(3):341–365. Murawaki, Yugo. 2015. Continuous space representations of linguistic typology and their application to phylogenetic inference. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 324–334, Denver, CO. 227 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 2 Murawaki, Yugo. 2016. Statistical modeling of creole genesis. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1329–1339, San Diego, CA. Murawaki, Yugo. 2017. Diachrony-aware induction of binary latent representations from typological features. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Taipei. Murawaki, Yugo. 2018. Analyzing correlated evolution of multiple features using latent representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4371–4382, Brussels. Murawaki, Yugo and Kenji Yamauchi. 2018. A statistical model for the joint inference of vertical stability and horizontal diffusibility of typological features. Journal of Language Evolution, 3(1):13–25. Murray, Iain, Zoubin Ghahramani, and David J. C. MacKay. 2006. MCMC for doubly-intractable distributions. In Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 359–366, Cambridge, MA. Naseem, Tahira, Regina Barzilay, and Amir Globerson. 2012. Selective sharing for multilingual dependency parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 629–637, Seogwipo. Neal, Radford M. 2011. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, CRC Press, pages 113–162. Nelson-Sathi, Shijulal, Johann-Mattis List, Hans Geisler, Heiner Fangerau, Russell D. Gray, William Martin, and Tal Dagan. 2011. Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society B: Biological Sciences, 278:1794–1803. Nichols, Johanna. 1992. Linguistic Diversity in Space and Time. University of Chicago Press. Nichols, Johanna. 1995. Diachronically stable structural features. In Henning Andersen, editor, Historical Linguistics 1993. Selected Papers from the 11th International Conference on Historical Linguistics, Los Angeles 16–20 August 1993. John Benjamins Publishing Company, pages 337–355. O’Horan, Helen, Yevgeni Berzak, Ivan Vulic, Roi Reichart, and Anna Korhonen. 2016. Survey on the use of typological 228 information in natural language processing. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1297–1308, Osaka. Pagel, Mark and Andrew Meade. 2006. Bayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo. American Naturalist, 167(6):808–825. Parkvall, Mikael. 2008. Which parts of language are the most stable? STUF-Language Typology and Universals Sprachtypologie und Universalienforschung, 61(3):234–250. Pereltsvaig, Asya and Martin W. Lewis. 2015. The Indo-European Controversy: Facts and Fallacies in Historical Linguistics. Cambridge University Press. Prince, Alan and Paul Smolensky. 2008. Optimality Theory: Constraint Interaction in Generative Grammar. John Wiley & Sons. Si, Yajuan and Jerome P. Reiter. 2013. Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics, 38(5):499–521. Srebro, Nathan, Jason D. M. Rennie, and Tommi S. Jaakkola. 2005. Maximum-margin matrix factorization. In Proceedings of the 17th International Conference on Neural Information Processing Systems, pages 1329–1336, Vancouver. Tan, Jie, John H. Hammond, Deborah A. Hogan, and Casey S. Greene. 2016. ADAGE-based integration of publicly available pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. mSystems, 1(1):e00025–15. Towner, Mary C., Mark N. Grote, Jay Venti, and Monique Borgerhoff Mulder. 2012. Cultural macroevolution on neighbor graphs: Vertical and horizontal transmission among western North American Indian societies. Human Nature, 23(3):283–305. Trubetzkoy, Nikolai Sergeevich. 1928. Proposition 16. In Acts of the First International Congress of Linguists, pages 17–18. Tsunoda, Tasaku, Sumie Ueda, and Yoshiaki Itoh. 1995. Adpositions in word-order typology. Linguistics, 33(4):741–762. Welling, Max and Yee W. Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, Bellevue, WA. Wichmann, Søren and Eric W. Holman. 2009. Temporal Stability of Linguistic Typological Features. Lincom Europa. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 2 1 9 9 1 8 0 9 7 8 2 / c o l i _ a _ 0 0 3 4 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Bayesian Learning of Latent Representations image

Descargar PDF