The Emergence of Argument Structure in Artificial Languages
Tom Bosc
Mila
Universit´e de Montr´eal, Canada
bosct@mila.quebec
Pascal Vincent
Meta AI, Mila
Universit´e de Montr´eal, Canada
CIFAR AI Chair
vincentp@iro.umontreal.ca
Astratto
Computational approaches to the study of lan-
guage emergence can help us understand how
natural languages are shaped by cognitive and
sociocultural factors. Previous work focused
on tasks where agents refer to a single entity. In
contrasto, we study how agents predicate, Quello
È, how they express that some relation holds
between several entities. We introduce a setup
where agents talk about a variable number of
entities that can be partially observed by the lis-
tener. In the presence of a least-effort pressure,
they tend to discuss only entities that are not
observed by the listener. Thus we can obtain
artificial phrases that denote a single entity, COME
well as artificial sentences that denote several
entities. In natural languages, if we ignore the
verb, phrases are usually concatenated, either
in a specific order or by adding case mark-
ers to form sentences. Our setup allows us to
quantify how much this holds in emergent lan-
guages using a metric we call concatenability.
We also measure transitivity, which quantifies
the importance of word order. We demon-
strate the usefulness of this new setup and
metrics for studying factors that influence ar-
gument structure. We compare agents having
access to input representations structured into
pre-segmented objects with properties, versus
unstructured representations. Our results in-
dicate that the awareness of object structure
yields a more natural sentence organization.
How do languages emerge and evolve? Zipf
(1949) viewed language as the result of an opti-
mization procedure balancing information trans-
mission maximization and effort minimization.
This view is amenable to formalization and sim-
ulation. An early example is Hurford’s (1989)
comparison of language acquisition strategies, COME-
suming that communication success gives an evo-
lutionary advantage. More generally, subsequent
research uses optimization procedures and evolu-
tionary mechanisms to create and study artificial
languages (Steels, 1997; Lazaridou and Baroni,
2020).
Such approaches are mainly used with two
objectives in mind: Firstly, to improve natural lan-
guage processing methods; secondly, to help us
understand the roles of cognitive and sociocul-
tural factors on the shape of languages, ad esempio
our drive to cooperate, pragmatic reasoning, E
imitation (Tomasello, 2010).
In the deep learning era, language emergence
researchers have focused on the referential func-
tion of language, namely, how agents commu-
nicate about single objects, using artificial noun
phrases equivalent to ‘‘blue triangle’’ or ‘‘red
circle’’ (Lazaridou et al., 2017; Kottur et al.,
2017). In contrasto, we propose to study the pred-
ication function of language, questo è, the expres-
sion of relations between entities (events). How
do artificial agents express events such as ‘‘the
blue triangle is above the red circle’’?
We introduce an experimental setup for study-
ing predication. The speaker communicates about
an event involving a variable number of entities
that are in a certain relation. Then, the listener
tries to reconstruct this event. To simplify, IL
relation is observed by both agents.
Crucially, the listener is given a partial observa-
tion of the event, ranging from nothing to all but
one entity. In the presence of shared context, it is
unnecessary for the speaker to communicate the
whole event, and a least-effort penalty encourages
parsimony. Thus we obtain utterances that refer
to single entities in isolation, like phrases, E
utterances about several entities, like sentences.
Using these artificial phrases and sentences, we
can compute various metrics to quantify compo-
sitionality (Szab´o, 2020) at the sentence level. UN
simple sentence typically contains a few phrases
that refer to entities. These phrases can generally
be understood in isolation, a property sometimes
called context-independence (Bogin et al., 2018).
1375
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 1375–1391, 2022. https://doi.org/10.1162/tacl a 00524
Redattore di azioni: Daniel Gildea. Lotto di invio: 3/2022; Lotto di revisione: 6/2022; Pubblicato 12/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: Overview of experimental setup. (From left to right) Proto-role dataset contains annotations (18 fea-
tures and a role) for each argument and a relation (SELL.01 and WAKE.02, rispettivamente, observed by both speaker
S and listener L). Preprocessing: From the 1st annotation, 3 datapoints are created, where the number of enti-
ties observed by L varies (see L’s mask and Partial F.&R. columns). The 2nd annotation contains a single ob-
ject so a single datapoint is created. Training: S produces a message. L reads it, and the pair of agents S, L is
jointly trained to minimize the reconstruction error and the length of the message. As a result of the objective,
S only talks about the entities not observed by L. Analysis: Informally, concatenability measures how con-
catenation of messages m12 = m1 ⊕ m2 and/or m21 = m2 ⊕ m1 are interchangeable with the actually sent
message m∗; transitivity measures how much one order is preferred compared to the other across the dataset
(cf. Sezioni 5, 6).
Inoltre, the sentence is the concatenation of
these phrases along with the verb. Correspond-
ingly, we introduce concatenability metrics that
should be large for natural languages. Further-
more, we propose transitivity metrics to quantify
the importance of word order. A high-level over-
view of our setup is shown in Figure 1.
This setup enables us to analyze artificial lan-
guages without segmenting messages into constit-
uents. Segmentation introduces another layer of
complexity, to the extent that in practice, it is
not done at all: It is implicitly assumed that each
symbol is independently meaningful. Tuttavia,
this assumption is flawed, because if letters or
phonemes are assumed to bear meaning, no natural
language is compositional.
Previous work has highlighted the influence of
input representations and architectures for lan-
guage emergence. Inappropriate representations
completely hinder evolution of a non-trivial lan-
guage with more than 2 parole (Lazaridou et al.,
2017) or prevents agents from solving the task
altogether (Guo et al., 2019). This suggests that
specific inductive biases are still lacking for arti-
ficial agents to develop languages like ours.
We posit that the perception of objects as wholes
with properties is an important inductive bias. A
be able to produce sentences containing referential
frasi, it seems that agents need to be able to
attend to the referents of these phrases reliably, A
conceive of them as bounded objects with intrinsic
properties in the first place.
We demonstrate the usefulness of our setup
and our metrics for testing this hypothesis. Noi
implement an object-centric inductive bias using
Attenzione (Bahdanau et al., 2014) over representa-
tions of objects. We compare it to an architecture
which disregards the structure of the input, con-
sidering it merely a large unstructured feature
vector. The object-centric architecture yields more
natural languages—they are more concatenable.
Inoltre, word order matters more with this
architecture than for the baseline. These results
are corroborated by our quantitative analysis and
measures of generalization outside of the train-
ing data.
Our contributions are two-fold. Firstly, on the
methodological front, we propose and motivate
a novel task and two new metrics. This task not
only explains the emergence of compositionality
1376
from a functional perspective, but also enables
us to easily analyze the learned language, avoid-
ing the problem of segmentation.
Secondly, we provide evidence that when rep-
resentations reflect the perception of objects as
wholes with properties, emergent languages are
more natural than when they do not. With this
finding we hope to foster the use of more cogni-
tively plausible input representations for explain-
ing language emergence.
1 Task
We design a task for studying how artificial
agents predicate. It is an instance of a recon-
struction task (Lazaridou et al., 2017), where one
agent, the speaker, observes an input and pro-
duces a message—a sequence of symbols. IL
message is then read by another agent, the lis-
tener, who tries to reconstruct the input observed
by the speaker.
We train several pairs of agents and study the
messages produced by the speakers. This training
procedure models language evolution and lan-
guage acquisition at once, unlike frameworks like
iterated learning (Kirby and Hurford, 2002).
The main novelty of our task is that agents are
trained to communicate about a variable number
of entities. In this section, we explain how the
inputs of the agents are obtained by preprocessing
the proto-role dataset (Reisinger et al., 2015).
Then, we argue that our task is realistic, yet simple
enough to permit an easy analysis of the messages.
1.1 The Proto-role Dataset
The data that are fed to agents are based on the
proto-role dataset built by Reisinger et al. (2015).
This dataset was created to evaluate Dowty’s
(1991) linking theory, a theory that predicts how
verb-specific roles are mapped to grammatical
relations in English.
To illustrate the annotation scheme, we use the
example from Figure 1, ‘‘the company sold a
portion of secured notes’’.
Firstly, a relation is extracted. Here, the verb
‘‘sold’’ corresponds to the PropBank (Kingsbury
and Palmer, 2002) label SELL.01, which identi-
fies the verb and its particular sense.
There are nobj = 2 arguments of the verb,
‘‘the company’’ and ‘‘a portion of secured
notes’’. Each of these arguments is annotated
with nf eat = 18 features indicating various prop-
erties of the referred entity. For instance, the first
feature indicates whether the entity caused the
event to happen, the second feature whether the
entity chose to be involved in the event, and so
forth (Reisinger et al., 2015). In this work, IL
meaning of these features is irrelevant. These fea-
tures are encoded on a Likert scale from 1 E
5 or take a non-applicable (NA) value. Since the
description of each entity is a small feature vec-
tor, many different noun phrases correspond to
the same feature vector. Thus ‘‘Technology
stocks’’ and ‘‘a portion of secured notes’’ in
Figura 1 denote the same entity.
Inoltre, each argument is also assigned one
of six mutually exclusive classical θ-roles. In
the arguments respectively have
the example,
the θ-roles AGENT and PATIENT.
We define an event as (io) a relation and (ii) UN
set of pairs of feature vectors and θ-roles.
1.2 Task Description
For each event in the proto-role dataset, we gather
the relation, and for each entity, their 18 fea-
tures and their role. The features are rescaled
from {1, 2, 3, 4, 5} A {1, 2, 3}, and we only re-
tain the arguments in the 3 most frequent θ-roles
(AGENT, PATIENT, and a MISC category containing
instruments, benefactives, attributes).
The speaker observes the following quantities:
• the PropBank relation β,
• entity features I S ∈ {NA, 1, 2, 3}nobj ×nf eat,
• entity
Rnobj
rS
∈
roles
{AGENT, PATIENT, MISC}nobj ,
=
• the listener’s mask: α ∈ {0, 1}nobj .
The tensors I S, rS, and α are indexed by an
integer between 1 and nobj, so they represent a set
ES of nobj triplets where each triplet (I S
io , αi)
characterizes the i-th entity.
io , rS
The i-th entity is said to be hidden iff αi = 1.
Hidden entities are not observed by the listener,
and the mask α indicates this to the speaker. Since
the listener tries to reconstruct the inputs of the
speaker, the mask essentially flags the entities that
the speaker should communicate about. Così, IL
listener observes:
• the PropBank relation β,
• partial entity features I L[1 − α],
1377
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
• partial entity roles rL[1 − α],
• the speaker’s message: m ∈ M.
Here, tu[v] denotes the tensor obtained by
restricting u to the rows i such that vi = 1.
(cid:2)nobj
The message space M is defined as follows.
Let V = {1, . . . , nV , eos} be the vocabulary con-
taining nV symbols, plus an end-of-sentence
(eos) token. Let nL be the maximum message
length (here, set to nL = 8). M contains all the
sequences of elements of V with at most length
nL and ending with eos.
A datapoint is valid if
i=1 αi ≥ 1, questo è, at
least one object is hidden and some information
needs to be conveyed. From each event, we add
as many valid datapoints as possible to our data-
set. In our example, as there are 2 entities, either
one or both can be hidden, yielding 3 datapoints.
Given its inputs and the sender’s message, IL
listener tries to reconstruct the sender’s inputs.
The agents are jointly trained to minimize a re-
construction loss while minimizing the number of
symbols exchanged, as formalized in Section 2.2.
1.3 Motivations
All the aspects of the task can have a major
influence on the learned languages. In this section,
we argue that our task is realistic in important
aspects.
Our task is to convey semantic annotations of
sentences, not words or sentences directly, because
using linguistic data as input could be a meth-
odological mistake. Infatti, language-specific ty-
pological properties might leak into the artificial
languages.1 We follow this principle, except for
our use of θ-roles. They are linguistic abstrac-
tions over relation-specific (participant) roles.
This limitation is discussed in Section 8.2.
In our task, agents have to communicate about
a realistic, variable number of entities. We posit
that this is a crucial characteristic for argument
Infatti, if humans only
structure to be natural.
ever talked about two entities at once, grammar
would be simpler since a transitive construction
could be used everywhere. In our dataset, IL
1Per esempio, if the task was to transmit basic color terms
instead of, Dire, color represented as RGB triplets, the choice
of a language with only 3 basic color terms vs 11 color terms
(as in English) would yield different artificial languages. For
one thing, transmitting English color terms would require
agents to use more symbols.
distribution of the number of entities talked about
is directly derived from an English corpus, E,
to our knowledge, the distribution of the num-
ber of arguments does not vary much across
languages. Thus we do not expect a bias
In Mordatch and
towards English typology.
Abbeel’s (2018) and Bogin et al.’s (2018) works,
agents also need to predicate. Tuttavia, the event
structure is unrealistic as it is identical across
datapoints: The number of arguments is con-
stant and each argument has the same ‘‘type’’
(a landmark, an agent, a color, eccetera.).
The relation β is observed by both agents. As
a consequence, we do not expect artificial sen-
tences to contain the equivalent of a verb. IL
main reason is that it greatly simplifies the anal-
ysis of the artificial languages.
We define the context as everything that is
observed by both agents: the relation and the
non-hidden entities. We now justify why agents
share context, and why the loss function includes
a penalty to minimize the number of symbols
sent (cf. Sezione 2.2).
Primo, let us argue that this is realistic. IL
context is a coarse approximation of the notion
of common ground. According to Clark (1996),
common ground encompasses the cultural back-
ground of the interlocutors, their sensory percep-
zioni, and their conversational history. In theory,
the speaker only needs to communicate the infor-
mation that is not part of the common ground,
but transferring more information than needed
is not ruled out. Tuttavia, in practice, humans
try to be concise (cf. Grice’s [1975] maxim of
quantity). The penalty that we use encourages
parsimony. It could be seen as the result from
a more general principle governing cooperative
social activities (Grice, 1975) or even the whole
of human behavior (Zipf, 1949).
To illustrate, consider the following situation.
Upon seeing a broken window, one would ask
‘‘who/what broke the window?’’. A knowledge-
able interlocutor would answer ‘‘John’’ or ‘‘John
did’’. In our setup, the speaker is this knowl-
edgeable person, answering such questions about
unobserved entities. The context contains the bro-
ken window, and the speaker does not need to
refer to it since (io) the listener observes it, E
since (ii) the speaker knows that the listener ob-
serves it (via the mask α). While the speaker
could still refer to the window, the least-effort
penalty makes it costly to do so, so the speaker
1378
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
avoids it. Even if the agents do not engage
IL
in dialogues but
mask α can be interpreted as simulating an in-
ference made by the speaker about the listener’s
knowledge.
in one-time interactions,
This setup is not only realistic, it is also es-
pecially useful for the purpose of analysing the
emergent languages. By masking all but one en-
tity, we obtain an artificial phrase that denotes
a single entity. By masking all but two entities,
we obtain an artificial sentence relating two en-
tities. The metrics that we introduce rely on our
abilities to obtain such phrases and sentences.
The concatenability metrics can be seen as mea-
sures of systematicity, namely, how the meaning
of phrases is related to meaning of sentences
(Szab´o, 2020).
Without this setup, one would need to some-
how segment sentences into phrases. To our
knowledge, the problem has not been addressed
in the language emergence literature, but is iden-
tified by Baroni (2020). For instance, applied to
English corpora, metrics for quantifying compo-
sitionality like Chaabouni et al.’s (2020) disen-
tanglement metrics would tell us that English is
not compositional, since single letters are not
meaningful.
2 Model and Objective
We present
two Transformer-based (Vaswani
et al., 2017) variants of the model of the agents:
One that encodes an object-centric bias and one
that does not. Before delving into their differ-
enze, let us describe their common features.
2.1 General Architecture
Both the speaker S and the listener L are imple-
mented as Transformers, each of them built out of
an encoder Tfme and a decoder Tfmd.
The inputs of the speaker are encoded into a
real-valued matrix V S, which differs in the two
variants of the model. For now, assume that V S
encodes the speaker’s inputs and similarly, Quello
V L encodes the listener’s inputs.
The speaker produces a message m by first
encoding its input into
H = TfmS
e (V S),
(1)
then auto-regressively decodes the message
mt+1 ∼ q(mt+1|m1:T, I S, α, β) = TfmS
D (M1:T, H)T
with M1:T
embeddings of
bols m1:T.
the sum of positional and value
the previously decoded sym-
At train time, the symbol is randomly sampled
according to q, whereas at test time, the most
likely symbol is picked greedily. If the maximum
length nL is reached, eos is appended to the
message and generation stops. Else, the genera-
tion process stops when eos is produced. In order
to backpropagate through the discrete sampling,
we use the Straight-Through Gumbel estimator
(Jang et al., 2016; Maddison et al., 2016).
L also embeds the message m into a matrix
M (cid:8), and its decoder produces a matrix OL:
H (cid:8) = TfmL
OL = TfmL
e (M (cid:8)),
D (V L, H (cid:8)).
(2)
Note that TfmS
OL is then used to predict the presence of the
objects as well as all the features of the objects.
This computation is slightly different depending
on the variant of the models and is detailed below.
d is invariant with respect to the
order of the objects in V S, since we do not use
positional embeddings to create V S, but rather
use the role information directly, as will be ex-
plained for each model separately.2 On the other
hand, the message m is embedded using both
token and positional embeddings in M and M (cid:8),
so TfmS
e are sensitive to word order.
d and TfmL
2.2 Loss
The loss function is a sum of two terms, UN
reconstruction loss and a length penalty.
Reconstruction Loss: The input to reconstruct
is a set, the set of pairs of 18 features and a θ-roles.
For each θ-role, we predict the corresponding
features as well as whether an object i in this role
is present or not, denoted by γi.
For a given data point indexed by j, the re-
construction loss is the sum over all objects i
lj =
(cid:3)
io
−[log p(I S
io
|I L, M, β) + log p(γi|I L, M, β)].
2When used without positional embeddings, the encoder
of the Transformer is permutation-equivariant, cioè., for any
permutation matrix P , Tfme(P X) = P Tfme(X); allo stesso modo,
the decoder is permutation-invariant in its second argument
(the encoded matrix H), cioè., Tfmd(P X) = Tfmd(X). Per-
mutations are applied to the input matrices, the masks, E
the role vectors.
1379
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Length Penalty: As done by Chaabouni et al.
(2019), we penalize long messages. This can be
seen as an implementation of Zipf’s (1949) least-
effort principle. In its simplest form, the penalty
is a term pj = λ|mj| where λ is a hyperpa-
rameter, E |mj| is the number of symbols in
the message.
Tuttavia, we noticed that messages collapse
to empty messages early on during training. Questo
is similar to the well-known posterior collapse,
where the approximate posteriors of latents of
sequence-to-sequence VAEs collapse to their
priors (Bowman et al., 2016). We fix the issue
by adapting two well-known tricks: Pelsmaeker
and Aziz’s (2019) minimum desired rate and
Kingma et al.’s (2016) free bits. The penalty term
becomes
pj = 1lj <τ 1|mj |>nmin(λ|mj|),
Dove 1 is the indicator function.
For this term to be non-zero, two conditions
need to be fulfilled. Firstly, the reconstruction
error must be below τ , which is analogous to
a minimum desired rate. This threshold can be
set without difficulty to a fraction of the re-
construction error incurred by the listener seeing
empty messages. In our case, this average error
È 18.6. We randomly choose the threshold in
{5, +∞} across runs, where +∞ essentially dis-
ables the trick.
Secondly, the penalty is above 0 only if the
message contains more than nmin symbols. Questo
gives models nmin ‘‘free’’ symbols for each
this factor, we found that
datapoint. Without
speakers often utter empty messages (in partic-
ular, when a single entity is hidden).
For a given data point indexed by j, the to-
tal loss to minimize is the sum lj + pj. During
training, the average is taken over a mini-batch
(n = 128), while during evaluation, it is taken
over the entire test split.
2.3 On the Perception of Objects
We demonstrate our setup and metrics by com-
paring a model which is object-centric (OC), Quello
È, aware of objects as wholes with properties, A
a baseline model (flat attention, or FA), Quale
ignores the structure of the inputs.
We follow Gentner (1982), who argued that
perception of objects must be a strong, prelin-
guistic cognitive bias. She gives the example of
a bottle floating into a cave. She imagines an
imaginary language in which the bottle and the
mouth of the cave are construed as a single en-
tity, and argues that this language would be very
implausible. Across languages, the two entities
seem to always be referred to by separate phrases,
hinting at universals in the perception of objects.
More evidence is provided by Xu and Carey
infants use spatio-
(1996). They showed that
temporal cues to individuate objects,
È,
to ‘‘establish the boundaries of objects’’. Only
around the start of language acquisition do chil-
dren start to rely on the properties or kinds of
objects to individuate. But could it be exposure
to language that drives infant to perceive the
properties and kinds of objects? Mendes et al.’s
(2008) experiments on apes suggest it is the other
way around, namely, that linguistic input is not
necessary to learn to individuate based on prop-
erty differences. Thus our hypothesis is that the
perception of objects as wholes is a prerequisite
for natural language to develop.
Quello
To implement the OC bias and the FA base-
line, we process the inputs in two ways and
obtain different V L and V S to plug in Equa-
zioni 1 E 2. Embedding the matrices I L and
I S gives us real-valued 3-dimensional tensors.
But since Transformers consume matrices, we
need to reduce the dimensionality of I S and
I L by one dimension. It is this dimensionality-
reduction step that encodes the inductive biases
of OC and FA. We tried to minimize the differ-
ences between the two models. Figura 2 shows
an overview of their differences.
2.3.1 Object-centric Variant
Let I be either I S or I L, where each row Ii
represents an object. Each Ii is embedded using
a learned embedding matrix Valj for each feature
j, and the result is concatenated, yielding
Ei = [Val1(Ii,1)T ; . . . ; Valnf eat(Ii,nf eat)T ].
Then, this vector is transformed using a linear
function, followed by a ReLU (Nair and Hinton,
2010) and layer normalization (Ba et al., 2016).
We obtain V (0), a real-valued nobj × d matrix with
V (0)
i = LN(max(W Ei + B, 0)).
(3)
1380
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 2: Comparison of flat-attention (FA) and object-centric (OC) variants. The discrete-valued matrices I L
and I S (upper-left) encode the features of entities. FA turns each datapoint into (nobj · nf eat) × d continuous-
valued matrix (with nobj · nf eat attention weights), while OC produces a nobj × d continuous-valued matrix
(with nobj attention weights). Numbers index embedding matrices and show weight-sharing. The role informa-
tion is encoded afterwards and similarly for masking (not shown here).
As for hidden objects and padding objects,
they are represented using a single embedding
V (0)
i = vh directly. A role embedding is added to
this representation to obtain
i = V (0)
V (1)
io + Role(rS
io ).
Finalmente, V is a (nobj + 1) × d matrix, Dove
d is the size of embedding vectors. V is V (1)
the β relation
with an additional row vector,
embedding.
The listener cannot distinguish between hid-
den and padding objects, so the message should
encode the roles along with the entities’ features.
In order to reconstruct the speaker’s inputs,
the listener linearly transforms each row vector
OL
(except the one corresponding to the rela-
io
|I L, M, β), the joint pmf
zione) to produce p(I S
io
over the discrete features of object i as well as
P(γi|I L, M, β).
2.3.2 Flat Attention Variant
In FA, the structure of the input—composed of
different objects with aligned features—is disre-
garded. Firstly, the input matrices I S and I L,
where each row corresponds to a single object,
are ‘‘flattened’’. Secondly, there is one attention
weight per feature and object pair, instead of a
single weight per object as in the OC variant.
Finalmente, each embedding matrix is specific to a
role and feature pair, instead of being specific to
a feature.
Formalmente, let k be the index of a pair of object
indexed by i and feature indexed by j. Using a
k-specific embedding matrix, we obtain
V (0)
k = Valk(Ii,j),
with V (0) a real-valued (nobj · nf eat) × d matrix.
Again, hidden and padding objects are represented
by a special vector V (0)
k = vh. An index embedding
is added, similar to the role embedding:
k = V (0)
V (1)
k + Idx(k).
As in the OC variant, we obtain V by adding an
embedding of the relation β as a row to V (1).
To reconstruct
the speaker’s inputs, OL is
linearly transformed and to each output vector
corresponds a specific feature of a specific ob-
ject. To predict γi, Tutto
the output vectors in
OL corresponding to the i-th object are mean-
and average-pooled, concatenated and linearly
transformed.
3 General Experimental Setup
In the next sections, we review various properties
of natural languages, and introduce metrics to
quantify these in artificial languages and compare
the effect of using OC versus FA on these metrics.
The training set contains 60% of the data, IL
validation set 10%, and the test set the rest. Noi
denote the entire data set by D and denote by Dk
the subsets of D composed of examples for which
1381
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
T
l
UN
C
_
UN
_
0
0
5
2
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
(cid:2)
i αi = k, questo è, the examples where k objects
are hidden.
All the experiments use the EGG framework
(Kharitonov et al., 2021) based on the PyTorch
library (Paszke et al., 2019).3 The neural agents
are trained using Adam (Kingma and Ba, 2014).
There is a large number of hyperparameters so
we resort to random search (Bergstra and Bengio,
2012).4 Our hyperparameter search is deliberately
broad since we do not know a priori which hy-
perparameter choices are realistic. We expect to
obtain results with high-variance, but a major ad-
vantage is that we get more robust conclusions by
averaging over unknowns.
We perform linear regressions to predict the
value of each metric given a binary variable indi-
cating whether OC is used. When the coefficient
for this variable is significantly different from 0
according to a t-test, then OC has a significant
effect.5 Additionally, we consider that the entropy
of the messages is a mediator that we control for.
For instance, the reconstruction error is indirectly
influenced by the vocabulary size and the sam-
pling temperature via the entropy. Tuttavia, if
we observe that OC improves the generalization
error, we want to exclude the possibility that this
is because OC agents send messages with higher
entropy than FA agents, since it should be trivial
to also increase the entropy of the FA models by
modifying hyperparameters.
We discard models with messages of average
length below 1 and above 6. Infatti, when the
average length is too small, many messages are
empty, and when it is too long, artificial sentences
are barely or not longer than artificial phrases.
These cases are considered a priori unnatural.
This leaves us with 100 out of 136 runs.
3The proto-role dataset
is available here: http://
decomp.io/projects/semantic-proto-roles/.
The code (including on-the-fly preprocessing of the data-
set) is available at https://github.com/tombosc
/EGG_f/tree/r1/egg/zoo/vd_reco.
4 Hyperparameters
(uniformly sampled): # Trans-
former layers ∈ {1, 2, 3}, and dimensions ∈ {200, 400},
dropout ∈ {0.1, 0.2, 0.3}, Gumbel-Softmax temperature
∈ {0.9, 1.0, 1.5}, λ ∈ {0.1, 0.3, 1, 3, 10}, nmin ∈ {1, 2},
τ ∈ {5, +∞}. Adam’s parameters: β1 = 0.9, β2 ∈
{0.9, 0.99, 0.999}.
5We manipulate data using the pandas package (IL
Pandas Development Team 2021; McKinney, 2010), E
perform linear
regression with the statsmodel package
(Seabold and Perktold, 2010). We use HC3 covariance es-
timation to deal with heteroskedasticity (MacKinnon and
White, 1985; Long and Ervin, 2000).
Arch.
FA
OC
FA
OC
1 hidden
6.5 ± 1.6
6.2 ± 1.9
8.9 ± 2.1
8.3 ± 2.4
2 hidden
16 ± 3.6
14 ± 3.7∗∗∗
24 ± 3.9
21 ± 4.6∗∗
3 hidden
28 ± 5.4
25 ± 5.6∗∗
41 ± 5.5
39 ± 5.9
iD
OoD
Tavolo 1: Mean and stdev of test reconstruction
loss, in distribution and out of distribution. rows:
models; columns: # of hidden entities. OC agents
generalize better than FA agents. (*: p-value <
0.05, **: p-value < 0.01).
Note that the length penalty works as expected.
Without the penalty, the messages all contain the
maximum number of symbols. With the penalty,
the average message length grows as the speaker
needs to send more and more information (on D1:
4.19, D2: 5.24, D3: 5.89).
4 Generalization Performance
Natural languages are often said to be productive
and systematic: There is an infinity of utter-
ances which we can understand without having
encountered them before (productivity), in par-
ticular when we understand constituents of the
novel sentence in other contexts (systematicity)
(Szab´o, 2020). Do emergent languages exhibit
these characteristics? In this section, we study
such generalization abilities. We measure how
well the listener can reconstruct the inputs when
the sender communicates about datapoints unseen
at train time.
Firstly, we test our models in distribution.
Secondly, we test our models out of distribution
(OoD), following Lazaridou et al. (2018). We
compute the empirical marginal distributions over
the number of hidden entities, the entities, the
roles, and the relations. Then, the OoD test set is
sampled from these marginals independently.
We measure the reconstruction losses on sub-
sets where 1, 2, and 3 entities are hidden for a
finer-grained analysis.
Results: Table 1 contains the results. As ex-
pected, performance degrades when we evaluate
out of distribution. More interestingly, OC mod-
els perform better than FA models both in distri-
bution and out of distribution.
However, the performance difference between
OC and FA does not tell us much: Both OC
and FA agents could exchange messages that are
structured in very unnatural manners. In the next
1382
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
two sections, we introduce metrics to shed light
on how the information about different entities
is packaged into a single message.
5 Concatenability
In natural languages, the verb encodes the relation
while arguments refer to entities, but roles do not
have direct equivalents in all languages. They are
encoded using three strategies, typically using a
mix of strategies within a single language.
In analytic languages like English or Chinese,
roles are encoded in word order and possibly
using adpositions or clitics, but the role does not
change the form of the arguments. For example,
in sentences (1a) and (1b), the arguments are
identical but are placed in a reverse order, so that
their roles are inverted, too:
(1) a. The old lady walks the dog.
b. The dog walks the old lady.
In more synthetic languages like Russian or Turk-
ish, case markings code for the roles. In Russian,
these markings are suffixes on nouns, adjectives,
and so forth, as can be seen in (2a) and (2b):
(2)
Finally, in polysynthetic languages (Caucasian
languages, Samoan languages, etc.), arguments
typically look like those in analytic languages,
but the roles are encoded using markers on the
verb.6 Since, in this work, relations are not com-
municated by agents, there is no artificial equiv-
alent of the verb. Therefore, this strategy cannot
emerge and we consider it no further.
Crucially, simple sentences are obtained by
concatenating a verb and one or several noun
phrases that refer to entities, whether word order
matters or word order does not matter and cases
are marked.
For a single event, by varying what informa-
tion is available to the listener through the mask
α, we get messages describing two entities in
6This presentation is extremely simplified, for example,
Bakker and Siewierska (2009)’s paper for why and how these
three strategies generally coexist within a single language.
isolation (phrases) as well as messages describ-
ing two entities at once (sentences). For exam-
ple, consider (I S, (1, 1, 0), rS, β) drawn from
D2, the subset of the data with two hidden ob-
jects. Let g be the function that transforms this
speaker’s inputs into a message via greedily de-
coding, and define
m∗ = g(I S, (1, 1, 0), rS, β).
We obtain the messages sent when L observes
the first or the second object in isolation as
m1 = g(I S, (1, 0, 0), rS, β),
m2 = g(I S, (0, 1, 0), rS, β).
We define concatenated messages
to be
m12 = m1 ⊕ m2 and m21 = m2 ⊕ m1, where
⊕ is the concatenation operator. This is shown
in Figure 1. We define P2 as the empirical dis-
tribution on the subset of D2 such that neither
m1 or m2 are empty messages, implying that
m12 (cid:10)= m21.
As argued above, in natural languages, m12
or m21 (or both,
if word order is irrelevant)
should convey information at least as well as m∗.
Denote by l(m) the reconstruction loss incurred
by L if L had received the message m, that is,
l(m) = − log p(I S|I L, m, β). Then, concaten-
ability from the listener’s point of view is de-
fined as
CL = EP2[l(m∗) − min(l(m12), l(m21))].
When close to 0, on average, one of the two
concatenated messages (or both) is as informative
as the message actually uttered by the speaker for
reconstructing the inputs.
L can correctly reconstruct S’s inputs from a
concatenated message that S is unlikely to utter.
Inversely, a concatenated utterance can be highly
likely for S even if L might fail to reconstruct
S’s input from it. Therefore, there are actually
two symmetrical measures of concatenability, one
from the point of view of S and the other from
the point of view of L. A similar proposition was
made by Lowe et al. (2019) in the context of in-
teractive games. They have shown the usefulness
of distinguishing these two points of view.
The metric is defined similarly on the speak-
er’s side with a slight subtlety. Since sampled
1383
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
messages have a maximum message length of
nL, the probability of a sequence longer than
nL is 0. However, concatenated messages are
sometimes longer than nL. We define q∞ as
the distribution generated by S without the con-
straint that probable sequences have length below
nL. We denote the conditional log-probability
of a message given a certain input by u(m) =
log q∞(m|I S, α, β). Then, concatenability from
the speaker’s point of view is defined as
CS = EP2[max(u(m12), u(m21)) − u(m∗)].
It is close to 0 when, on average, one concate-
nation of the two messages (or both) has roughly
the same probability as the actual message.
To give an intuition, let us go back to our
examples. Take the speaker of an hypothetical
language, English without verbs. Suppose that
this speaker, when exposed to a given input
xS = (I S, (1, 1, 0), rS, β), produces a sentence
m∗ corresponding to (1a), ‘‘the old lady the dog’’.
By exposing the speaker to the same input, but
by changing the mask to (1, 0, 0), they produce
m1 = ‘‘the lady’’, while using the mask (0, 1, 0),
they produce m2 = ‘‘a golden retriever’’. CS
compares the log probability of m∗ with that
of m12 = ‘‘the lady a golden retriever’’ and
m21 = ‘‘a golden retriever the lady’’, whichever
is more probable. Since English without verbs is
rather concatenable, the speaker judges that m12
is roughly as likely as m∗ given the inputs. Thus,
the value inside the expectation of CS will be
high, close to 0.
Now, take an identical speaker, except that
they assign a very high probability to m(cid:8)1 =
‘‘a shoebox’’, while the new m(cid:8)12 and m(cid:8)21 are
unlikely conditioned on xS. Then CS will be
low and negative. Perhaps (i) ‘‘a shoebox’’ has
different semantics when it is used alone in a
sentence, as compared to when it is used with a
second referent; or perhaps (ii) ‘‘a shoebox’’ is
never used with another referent in a sentence,
and the speaker would use ‘‘a lady’’ instead. In
any case, concatenability for this speaker would
be low, which corresponds to the intuition that
their language is unnatural and unsystematic.7
7This example only illustrates the intuition. In reality,
it is not straightforward to apply these metrics on natural
language, because they require probability distributions for
the agents. We could learn models that map back and forth
between the semantics and the ground-truth utterances, but
CL ↑
−6.1 ± 3.8
−3.2 ± 2.4∗∗∗
CS ↑
−29 ± 13
−26 ± 15
FA
OC
Table 2: Mean and stdev of concatenability met-
rics on OC and FA runs. (i) OC improves con-
catenability. Arrows indicate optimal direction.
(p-values: *: < 0.05, **: < 0.01, ***: < 0.001).
The same illustration holds for CL, and it can
be adapted to show why CS and CL should also
be high for more synthetic languages.
Results: We measure these metrics on the test
set. In our experiments, they always take negative
values: the concatenated messages are on average
worse than the actual messages. Some models
yield values close to 0, but this depends on the
choice of hyperparameters.
Table 2 shows that OC largely improves over
FA in terms of both CL and CS. For instance, the
reconstruction losses of OC models go up by 3.1
nats on average when the best concatenated mes-
sages are used instead of the actually sent mes-
sages. In contrast, FA models incur a loss that is
higher by 6.1 nats. Thus, languages obtained using
the OC architecture are more natural than those
emerging from FA in the sense of concatenability.
6 Word Order
6.1 Importance of Word Order
Concatenability metrics do not distinguish be-
tween the use of word order or some sort of case
marking strategy. Since both strategies are found
in natural languages, we claim that for all natural
languages, this metric should be high. But we also
want to know what particularly strategy is used,
in particular when concatenability is high.
First, note that it is difficult to detect the pres-
ence of case markings directly. Even for the
simplest forms of morphology, we are hindered
by the segmentation problem of identifying the
root and the affix, as mentioned in Section 1.3.8
the models would add some bias. Moreover, we only have
ground-truth utterances for English and any attempts to use
machine translation would add some more bias.
8It is generally even more complicated for several rea-
sons: a lexeme can have several roots, each morpheme can
simultaneously encode several semantic properties, and the
modification of the root can be non-concatenative (Stump,
2017).
1384
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
T L ↑
3.4 ± 3.2
11 ± 10∗
T S ↑
10 ± 9
15 ± 12
RP E ↓
0.48 ± 0.18
0.52 ± 0.12∗∗∗
FA
OC
Table 3: Mean and stdev of transitivity metrics
and RPE for OC and FA. T L (T S) statistics and
significance computed on runs scoring CL (CS)
above median. Arrows indicate optimal direc-
tion. OC uses word order more than FA. Con-
trols are discussed in the main text. (p-values:
*: < 0.05, **: < 0.01, ***: < 0.001).
word order is more important for OC runs than
for FA runs. This is also confirmed by Table 3.
Table 3 also shows that OC and FA agents
have very similar RPE. This means that both
encode roles in referential phrases quantitatively
similarly. More work is needed to determine how
roles are encoded (when they are), that is, if there
are traces of morphology or if messages denot-
ing a single entity in different roles are unrelated.
6.2 Consistency of Word Order
To go further, we can study which word orders
are favored across different contexts. For every
pair of roles such as AGENT and PATIENT, is it the
message with the AGENT uttered first that is more
likely, or the opposite?
To answer the question, instead of looking at
the magnitude of the gap as does T S, we can
count which word orders maximize the gap. By
finding the most frequent order, we find for each
model the preference of the speaker P S, a binary
relation on R2. For example,
{(AGENT, PATIENT),(PATIENT, MISC),
(MISC, AGENT)}
(4)
is such a relation. This is very crude, as it does
not distinguish the case where AGENT always pre-
cedes PATIENT from the case where AGENT pre-
cedes PATIENT 51% of the time, but we leave
more involved analyses for future work. We
define analogously P L using the reconstruction
loss l instead of message probability u.
Results: We compute preferences P S and P L
for each run. Out of 100 runs, 29 runs have both
CS and CL higher than their median values, and
23 of these have equal P S and P L.
Figure 3: Role prediction error (RP E) as a function of
transitivity T L. Color indicates reconstruction loss. (i)
(upper-left quadrant) Low T L and high RP E implies
a high reconstruction error, since roles are not encoded
properly. (ii) OC has higher average transitivity than
FA, but similar RP E.
Yet we can quantify on average how much ref-
erential phrases (messages about a single hidden
object) encode roles. We train a bigram classi-
fier on the training set and measure its test error,
the Role Prediction Error (RPE). If there are
case markings, this error will be low (but the
opposite is not true).
Moreover, we introduce two transitivity met-
rics, to directly measure the importance of word
order. T S is defined as:
T S = EP2
|u(m12) − u(m21)|.
This metric is 0 if the two concatenated mes-
sages are equally probable for S; and it is large
if one word order is much more likely than the
other for S. Similarly, T L is defined as
T L = EP2
|l(m12) − l(m21)|
and has similar interpretations.
These metrics are only interpretable when
concatenability metrics are high enough, so we
measured T S only for runs where CS is above the
median and similarly for T L.
Results: As can be seen on Figure 3, when
transitivity is low and RP E is high, the recon-
struction loss is poor (top-left corner), because
there is no efficient strategy to encode roles. There
is a lot of variance both for OC and FA, but OC
models tend to have higher transitivity, both on
average and in terms of maximal values. Thus
1385
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Entities
−, 8, 1
−, 8, 5
−, 8, 190
−, 8, 39
α
0, 1, 0
0, 0, 1
0, 1, 1
0, 1, 0
0, 0, 1
0, 1, 1
0, 1, 0
0, 0, 1
0, 1, 1
0, 1, 0
0, 0, 1
0, 1, 1
A
B
Entities
24, 79, 25
105, 16, 105
105 , 79, 24
24, 79, 25
19, 24
47, 79, 24, 25
24, 79, 25
16, 19
105, 47
34, 34
34 , 47
105, 47
18, 18
18 , 24
105, 47
19
16 , 79 , 39, 79
105 , 24
24, 79, 25
16, 44, 16, 72, 2
105, 47
16, 19
44, 16 , 59, 72
105 , 16
8, 4, −
8, 61, −
−, 132, 8
−, 287, 8
α
1,0,0
0,1,0
1,1,0
1,0,0
0,1,0
1,1,0
0,1,0
0,0,1
0,1,1
0,1,0
0,0,1
0,1,1
A
79, 24, 24, 79, 24
34, 34, 15
B
18, 1, 18
15, 34, 15
34 , 24, 79, 24, 79, 24
34, 34, 34 , 1, 18
79, 24, 79, 24, 24
94, 54, 25, 94, 72
18, 18, 19
16, 16, 25
94 , 121, 25 , 79, 24, 79, 24
16 , 19 , 24, 19, 18
19, 24, 19
79, 24, 72
24, 19 , 123, 19
35, 19
79, 24, 72
19, 59
47, 71, 105
18, 24, 59
19, 59, 16
47, 71, 105
16, 79 , 19, 35
24, 19, 59, 16
Table 4: A sample of messages exchanged about the same entity u8. Entities: list of entities (‘‘−’’:
no entity; number indicate rank of entity in the dataset; position in the list indicate role: AGENT,
PATIENT, MISC). α: mask. A, B: Messages produced by speakers of models A and B. Symbols are
manually colored to identify phrases (first 2 rows in every block of 3 rows) in artificial sentences
(third row in every block). Relations are omitted but are different for each block.
Among all possible relations, some are not
transitive, such as (4). However, all the prefer-
ences we found are transitive, which is extremely
unlikely due to chance. A simple explanation is
that transitive relations allows agents to discuss
three entities with word order only. However, it
does not seem to be universally required by nat-
ural languages to have well-defined orders in the
presence of many roles. For instance, in English,
the use of different prepositions allow for dif-
ferent word order, such as the dative alternation
which offers two different orders to talk about
three entities.
7 Qualitative Analysis
One can gain intuition about the metrics by look-
ing at messages exchanged by agents. In particu-
lar, we compare two models A and B which both
have relatively high concatenability, but A has
high transitivity scores whereas those of B are
low. The chosen models also have relatively close
reconstruction loss, so that the messages convey
roughly as much information about their inputs.
To simplify, we focus on one entity vector and
see how it is transmitted when it is in different
roles and in different contexts. Since feature vec-
tors are slightly sparse (with many NA values),
vectors which have many NAs are sometimes not
conveyed at all (the penalty makes it costly to
do so). We search for an entity that appears in
many different roles and that is sufficiently not
sparse. The 8th most frequent vector (u8) is the
most frequent vector that fits these criteria.
First,
let us examine the left-hand side of
Table 4, which shows how u8 is talked about
in its most frequent role, the PATIENT role. In both
models, u8 is denoted by the same phrase very
consistently (first rows of each block). Thus the
context of u8 (entities and relation) does not seem
to influence the message. This property is some-
times called context-independence (Bogin et al.,
2018).
Despite using a large vocabulary of 128 sym-
bols, only a few symbols are used. This is due to
the difficulty of discrete optimization. We were
puzzled to find so many common symbols in
the two models, but it turns out that the selected
models have the same hyperparameters except
for the length-penalty coefficient (A: λ = 1,
B: λ = 10).
Each last row of each block of three lines
shows an artificial sentence, where two entities
are hidden. We can see that most symbols in
these sentences also frequently appear in phrases
that denote individual entities (identified by their
colors). Some symbols from phrases are omitted
or in a different order in the sentence, but the
precise structure of these phrases is out of scope
for our work.
A is more consistent in its use of word order
than B: A almost always refers to MISC before
PATIENT, whereas the order varies for B. This
is evidence that the transitivity metrics correctly
1386
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
measure the importance of word order, at least
when concatenability is high enough.
they are frequently discussed and negotiated. Thus
it is frequent to describe events partially.
On the right-hand side of Table 4, u8 appears
in less frequent roles, and we see much more
irregularities. Firstly, the phrases denoting u8 in
isolation are less consistent across different con-
texts (more context-dependence), even though we
find a large overlap of symbols. Secondly, we
also found more empty phrases (not shown here).
Thirdly, we did not find evidence for a lower
transitivity of B in these roles, but the sample size
was smaller.
8 Discussion and Limitations
8.1 Partial Observability and Reference
Thanks to our experimental setup and metrics,
we avoid the problem of segmentation. How-
ever, concatenability and transitivity rely on a
crucial aspect of the task, partial observability,
which allows us to obtain messages about a single
‘‘thing’’ in isolation. In our case, this ‘‘thing’’ is
an entity and role pair, but instead, could it be a
single attribute like shape or color, as in simpler
referential games used in past research?
Such a setup would be similar to our setup (cf.
1.2). However, (i) there would be no relation β;
(ii) I S, I L and α would be vectors of size nf eat;
(iii) in terms of models, we would use a simple
attention mechanism to select a subset of the
features to communicate about.
However, we do not think that this setup re-
alistically models real-life communicative situa-
tions. Visual properties like shape and color are
often perceived simultaneously. If, sometimes, we
fail to perceive colors (for example, at night) or
shapes (perhaps due to an occlusion), we rarely
need to inquire about these attributes. In general,
the missing attributes do not particularly matter,
but are useful to identify the kind of the entity.
For example, the white color and the circular
shape of an object tells us that it is a plate, which
is useful; but its particular appearance generally
does not often matter once it has been catego-
rized. Thus, we generally infer the kind from the
observed attributes if possible, or else directly ask
for the kind.
By contrast, events are often partially observed,
which creates many interrogations. When one ob-
serves the consequences of a past action, one
often wonders who was the agent that caused it.
Similarly, since future events are indeterminate,
In sum, the semantics of events are often con-
veyed partially whereas the semantics of entities
are more frequently packaged into the word for a
kind. Thus directly transposing this setup to the
referential case seems unrealistic. However, per-
haps it could be adapted to a discriminative setup
(Lazaridou et al., 2017), where the need to convey
partial features of objects is clearer.
8.2 On θ-roles
As inputs to our models, θ-roles are much more
salient than any of the 18 features associated
with entities: Each θ-role is associated with an
entire vector added to the keys and values used
by the attention mechanisms (cf. Role and Idx in
Sections 2.3.1 and 2.3.2). Moreover, there are
only three of them and they are mutually exclu-
sive. For these reasons, it is easy to attend over
each of them, which explains why many artificial
agents rely on θ-roles to structure their messages.
These θ-roles are groups of verb-specific roles
(sometimes called participant roles). For exam-
ple, the LOVER, the EATER, and the BUILDER verb-
specific roles are clustered into the verb-general
AGENT θ-role, while the LOVEE, the EATEE, and the
BUILDEE roles fall under the PATIENT θ-role. Dowty
(1991) shows that some θ-roles can be predicted
from a small set of features that are mostly re-
lated to physical notions of movement and to
causality.9 However, since humans perceive many
more features (for example, shapes, colors, tex-
tures, etc.), it is not clear why these particular
features are preferred to structure the grammars of
natural languages.
To answer this question, we might be able
to use pretrained unsupervised learning models
as feature extractors (Santoro et al., 2017; van
Steenkiste et al., 2018; Kipf et al., 2018). An
object-centric model like R-NEM (van Steenkiste
et al., 2018) can extract object representations
from videos of physically interacting objects. An
interaction model like NRI (Kipf et al., 2018) can
infer the relations between objects given object
representations over time, such that these rela-
tions are predictive of how the objects change
over time. By combining such models, it may be
9These features are precisely the features that are used in
this paper to represent the semantics of the entities, but their
meaning is irrelevant in this work.
1387
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
possible to learn object, relation, and role repre-
sentations from videos. We could then use such
learned representations as inputs in our com-
munication games to study whether verb-general
roles emerge.
9 Conclusion
We have presented an experimental setup for
studying how probabilistic artificial agents pred-
icate, that is, how they convey that a relation
holds between entities. In our daily lives, events
are partially observed and predication is used to
share information about what is not observed, of-
ten in a parsimonious manner. Our task and loss
realistically reflect this function of language.
At the same time, this setup allows us to di-
rectly study argument structure while ignoring
the internal structure of phrases. Indeed, we can
easily obtain artificial phrases, that is, utterances
that refer to single entities, as well as artificial
sentences, utterances which express the relation
holding between different entities. Then, we can
study whether and how artificial phrases are sys-
tematically composed to form artificial sentences,
via our concatenability and transitivity metrics.
Thus we completely sidestep the need to seg-
ment artificial sentences into phrases, a compli-
cated problem that is unfortunately ignored in
previous works.
More precisely, we have argued that all nat-
ural languages should have high concatenabil-
ity, while transitivity is not necessarily high and
merely quantifies the importance of word order.
Equipped with this setup and these metrics, we
have compared a cognitively plausible architec-
ture that leverages the structure of the inputs into
objects with properties (OC) against an implau-
sible baseline that ignores this structure (FA).
Object-centric models yield more natural
lan-
guages in terms of concatenability, while also
relying more on word order. Moreover, they gen-
eralize better than their implausible counterparts,
both in distribution and out of distribution.
These results confirm the importance of the in-
put representations and of the architectures lead-
ing to the discretization bottleneck, also reported
by Lazaridou et al. (2017) and Guo et al. (2019).
In our experiments, discrete low-dimensional
inputs were processed by task-specific architec-
tures. However, we believe that one can use
high-dimensional representations obtained from
pretrained models, as long as these representa-
tions are prelinguistic, as object-centric represen-
tations seem to be.
Our methods could be extended to investigate
other aspects of sentences. For instance, how
would agents convey relations? To answer this
question, we could use the representations learned
via relational unsupervised learning algorithms
as inputs. We could study how different relations
are discretized into one or several symbols, per-
haps the equivalent of verbs and adverbs. We
could also analyze how relation-specific roles
cluster in abstract roles (like θ-roles) and struc-
ture grammar.
Acknowledgments
Tom Bosc was financially supported for this re-
search by the Canada CIFAR AI Chair Program.
We also thank the Mila IDT team for the com-
putational infrastructure, as well as anonymous
reviewers and the action editors for their helpful
feedback.
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E.
Hinton. 2016. Layer normalization. arXiv pre-
print arXiv:1607.06450v1.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2014. Neural machine translation by
jointly learning to align and translate. arXiv
preprint arXiv:1409.0473v7.
Dik Bakker and Anna Siewierska. 2009. Case
and alternative strategies: Word order and
In The Oxford Hand-
agreement marking.
book of Case, edited by Andrej Malchukov
and Andrew Spencer, pages 290–303. 2009.
https://doi.org/10.1093/oxfordhb
/9780199206476.013.0020
Marco Baroni. 2020. Rat big, cat eaten! Ideas
for a useful deep-agent protolanguage. arXiv
preprint arXiv:2003.11922v1.
James Bergstra and Yoshua Bengio. 2012. Ran-
dom search for hyper-parameter optimization.
Journal of Machine Learning Research, 13(2).
Ben Bogin, Mor Geva, and Jonathan Berant.
2018. Emergence of communication in an in-
teractive world with consistent speakers. arXiv
preprint arXiv:1809.00549v1.
1388
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals,
Andrew Dai, Rafal Jozefowicz, and Samy
Bengio. 2016. Generating sentences from a
continuous space. In Proceedings of the 20th
SIGNLL Conference on Computational Natu-
ral Language Learning, pages 10–21, Berlin,
Germany. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/K16-1002
Rahma Chaabouni, Eugene Kharitonov, Diane
Bouchacourt, Emmanuel Dupoux, and Marco
Baroni. 2020. Compositionality and generaliza-
tion in emergent languages. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 4427–4442.
https://doi.org/10.18653/v1/2020
.acl-main.407
Rahma
Eugene
Chaabouni,
Kharitonov,
Emmanuel Dupoux, and Marco Baroni. 2019.
Anti-efficient encoding in emergent communi-
cation. In Advances in Neural Information Pro-
cessing Systems, volume 32, pages 6293–6303.
Curran Associates, Inc.
Herbert H. Clark. 1996. Using Language.
Cambridge University Press.
David Dowty. 1991. Thematic proto-roles and
argument selection. Language, 67(3):547–619.
https://doi.org/10.2307/415037,
https://doi.org/10.1353/lan.1991
.0021
Dedre Gentner. 1982. Why nouns are learned
before verbs: Linguistic relativity versus natural
partitioning. Center for the Study of Reading
Technical Report; no. 257.
Herbert P. Grice. 1975. Logic and conversation.
Speech Acts, pages 41–58. Brill. https://
doi.org/10.1163/9789004368811 003
Shangmin Guo, Yi Ren, Serhii Havrylov, Stella
Frank, Ivan Titov, and Kenny Smith. 2019. The
Emergence of Compositional Languages for
Numeric Concepts Through Iterated Learning
in Neural Agents.
Eric Jang, Shixiang Gu, and Ben Poole. 2016.
Categorical reparameterization with gumbel-
softmax. arXiv preprint arXiv:1611.01144v5.
Eugene Kharitonov, Roberto Dess`ı, Rahma
Chaabouni, Diane Bouchacourt, and Marco
Baroni. 2021. EGG: A toolkit for research on
Emergence of lanGuage in Games. https://
github.com/facebookresearch/EGG.
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980v9.
Durk
P. Kingma, Tim Salimans, Rafal
Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. 2016. Improved variational inference
with inverse autoregressive flow. In Advances
in Neural Information Processing Systems,
pages 4743–4751.
Paul R. Kingsbury and Martha Palmer. 2002.
In LREC,
From TreeBank to PropBank.
pages 1989–1993. Citeseer.
Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang,
Max Welling, and Richard Zemel. 2018.
Neural relational inference for interacting sys-
tems. In International Conference on Machine
Learning, pages 2688–2697. PMLR.
Simon Kirby and James R. Hurford. 2002. The
emergence of linguistic structure: An overview
the iterated learning model. Simulating
of
the Evolution of Language, pages 121–147.
https://doi.org/10.1007/978-1-4471
-0663-0 6
Satwik Kottur, Jos´e Moura, Stefan Lee, and
Dhruv Batra. 2017. Natural language does not
emerge ‘naturally’ in multi-agent dialog. In
Proceedings of the 2017 Conference on Em-
pirical Methods in Natural Language Process-
ing, pages 2962–2967, Copenhagen, Denmark.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D17
-1321
Angeliki Lazaridou and Marco Baroni. 2020.
Emergent multi-agent communication in the
deep learning era. arXiv preprint arXiv:2006
.02419v1.
James R. Hurford. 1989. Biological evolu-
tion of the Saussurean sign as a component
of the language acquisition device. Lingua,
77(2):187–222. https://doi.org/10.1016
/0024-3841(89)90015-6
Angeliki Lazaridou, Karl Moritz Hermann, Karl
Tuyls, and Stephen Clark. 2018. Emergence
linguistic communication from referen-
of
tial games with symbolic and pixel
input.
In 6th International Conference on Learning
1389
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Representations, ICLR 2018, Vancouver, BC,
Canada, April 30–May 3, 2018, Conference
Track Proceedings. OpenReview.net.
Angeliki Lazaridou, Alexander Peysakhovich, and
Marco Baroni. 2017. Multi-agent cooperation
and the emergence of (natural) language. In 5th
International Conference on Learning Repre-
sentations, ICLR 2017, Toulon, France, April
24–26, 2017, Conference Track Proceedings.
OpenReview.net.
J. Scott Long and Laurie H. Ervin. 2000. Using
heteroscedasticity consistent standard errors
in the linear regression model. The American
Statistician, 54(3):217–224. https://doi
.org/10.2307/2685594, https://doi
.org/10.1080/00031305.2000.10474549
Ryan Lowe, Jakob N. Foerster, Y.-Lan Boureau,
Joelle Pineau, and Yann N. Dauphin. 2019. On
the pitfalls of measuring emergent communica-
tion. In Proceedings of the 18th International
Conference on Autonomous Agents and Multi-
Agent Systems, AAMAS ’19, Montreal, QC,
Canada, May 13–17, 2019, pages 693–701.
International Foundation for Autonomous
Agents and Multiagent Systems.
James G. MacKinnon and Halbert White.
1985. Some heteroskedasticity-consistent co-
variance matrix estimators with improved finite
sample properties. Journal of Econometrics,
29(3):305–325. https://doi.org/10.1016
/0304-4076(85)90158-7
Chris J. Maddison, Andriy Mnih, and Yee Whye
Teh. 2016. The concrete distribution: A con-
tinuous relaxation of discrete random varia-
bles. arXiv preprint arXiv:1611.00712v3.
Wes McKinney. 2010. Data structures for sta-
tistical computing in Python. In Proceedings
of
the 9th Python in Science Conference,
pages 56–61. https://doi.org/10.25080
/Majora-92bf1922-00a
individuation without
Natacha Mendes, Hannes Rakoczy,
and
Josep Call. 2008. Ape metaphysics: Ob-
language. Cogni-
ject
tion, 106(2):730–749. https://doi.org/10
.1016/j.cognition.2007.04.007, PubMed:
17537418
Igor Mordatch and Pieter Abbeel. 2018. Emer-
gence of grounded compositional language in
multi-agent populations. In Proceedings of the
Thirty-Second AAAI Conference on Artificial
Intelligence, (AAAI-18),
the 30th innovative
Applications of Artificial Intelligence (IAAI-
18), and the 8th AAAI Symposium on Ed-
ucational Advances in Artificial Intelligence
(EAAI-18), New Orleans, Louisiana, USA,
February 2–7, 2018, pages 1495–1502. AAAI
Press.
Vinod Nair and Geoffrey E. Hinton. 2010. Rec-
tified linear units improve restricted boltzmann
machines. In Proceedings of the 27th Interna-
tional Conference on International Conference
on Machine Learning, pages 807–814.
Adam Paszke, Sam Gross, Francisco Massa,
James Bradbury, Gregory
Adam Lerer,
Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary DeVito,
Martin Raison, Alykhan Tejani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. 2019.
PyTorch: An imperative style, high-performance
In H. Wallach, H.
deep learning library.
Larochelle, A. Beygelzimer, F. d’ Alch´e-Buc,
E. Fox, and R. Garnett, editors, Advances
in Neural Information Processing Systems 32,
pages 8024–8035. Curran Associates, Inc.
Tom Pelsmaeker and Wilker Aziz. 2019. Effec-
tive estimation of deep generative language
models. arXiv preprint arXiv:1904.08194.
https://doi.org/10.18653/v1/2020
.acl-main.646
Drew Reisinger, Rachel Rudinger, Francis
Ferraro, Craig Harman, Kyle Rawlins, and
Benjamin Van Durme. 2015. Semantic proto-
roles. Transactions of the Association for Com-
putational Linguistics, 3:475–488. https://
doi.org/10.1162/tacl a 00152
Adam Santoro, David Raposo, David G. Barrett,
Mateusz Malinowski, Razvan Pascanu, Peter
Battaglia, and Timothy Lillicrap. 2017. A
simple neural network module for relational
reasoning. Advances in Neural Information
Processing Systems, 30.
Skipper Seabold and Josef Perktold. 2010.
statsmodels: Econometric and statistical mod-
eling with Python. In 9th Python in Science
Conference. https://doi.org/10.25080
/Majora-92bf1922-011
1390
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Luc Steels. 1997. The synthetic modeling of lan-
guage origins. Evolution of Communication,
1(1):1–34. https://doi.org/10.1075
/eoc.1.1.02ste
Sjoerd van Steenkiste, Michael Chang, Klaus
Greff, and J¨urgen Schmidhuber. 2018. Re-
lational Neural Expectation Maximization:
Unsupervised Discovery of Objects and their
Interactions. In International Conference on
Learning Representations.
Gregory T. Stump. 2017. Inflection. The Hand-
book of Morphology, pages 11–43. https://
doi.org/10.1002/9781405166348.ch1
Zolt´an Gendler Szab´o. 2020, Compositional-
ity, Edward N. Zalta, editor, The Stanford
Encyclopedia of Philosophy, fall 2020 edi-
tion. Metaphysics Research Lab, Stanford
University.
The Pandas Development Team. 2021. pandas-
dev/pandas: Pandas 1.2.3.
Michael Tomasello. 2010. Origins of Human
Communication. MIT Press.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is All you Need. In Advances in
Neural Information Processing Systems 30,
pages 5998–6008. Curran Associates, Inc.
Fei Xu and Susan Carey. 1996.
Infants’
iden-
metaphysics: The case of numerical
tity. Cognitive Psychology, 30(2):111–153.
https://doi.org/10.1006/cogp.1996
.0005, PubMed: 8635312
George Kingsley Zipf. 1949. Human behavior and
the principle of least effort. Ravenio Books.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
2
4
2
0
6
5
9
4
8
/
/
t
l
a
c
_
a
_
0
0
5
2
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1391