The Emergence of Argument Structure in Artificial Languages

The Emergence of Argument Structure in Artificial Languages

Tom Bosc
Mila
Universit´e de Montr´eal, Kanada
bosct@mila.quebec

Pascal Vincent
Meta AI, Mila
Universit´e de Montr´eal, Kanada
CIFAR AI Chair
vincentp@iro.umontreal.ca

Abstrakt

Computational approaches to the study of lan-
guage emergence can help us understand how
natural languages are shaped by cognitive and
sociocultural factors. Previous work focused
on tasks where agents refer to a single entity. In
Kontrast, we study how agents predicate, Das
Ist, how they express that some relation holds
between several entities. We introduce a setup
where agents talk about a variable number of
entities that can be partially observed by the lis-
tener. In the presence of a least-effort pressure,
they tend to discuss only entities that are not
observed by the listener. Thus we can obtain
artificial phrases that denote a single entity, als
well as artificial sentences that denote several
entities. In natural languages, if we ignore the
verb, phrases are usually concatenated, entweder
in a specific order or by adding case mark-
ers to form sentences. Our setup allows us to
quantify how much this holds in emergent lan-
guages using a metric we call concatenability.
We also measure transitivity, which quantifies
the importance of word order. We demon-
strate the usefulness of this new setup and
metrics for studying factors that influence ar-
gument structure. We compare agents having
access to input representations structured into
pre-segmented objects with properties, versus
unstructured representations. Our results in-
dicate that the awareness of object structure
yields a more natural sentence organization.

How do languages emerge and evolve? Zipf
(1949) viewed language as the result of an opti-
mization procedure balancing information trans-
mission maximization and effort minimization.
This view is amenable to formalization and sim-
ulation. An early example is Hurford’s (1989)
comparison of language acquisition strategies, als-
suming that communication success gives an evo-
lutionary advantage. Allgemeiner, subsequent
research uses optimization procedures and evolu-
tionary mechanisms to create and study artificial

languages (Steels, 1997; Lazaridou and Baroni,
2020).

Such approaches are mainly used with two
objectives in mind: zuerst, to improve natural lan-
guage processing methods; secondly, to help us
understand the roles of cognitive and sociocul-
tural factors on the shape of languages, wie zum Beispiel
our drive to cooperate, pragmatic reasoning, Und
Nachahmung (Tomasello, 2010).

In the deep learning era, language emergence
researchers have focused on the referential func-
tion of language, nämlich, how agents commu-
nicate about single objects, using artificial noun
phrases equivalent to ‘‘blue triangle’’ or ‘‘red
circle’’ (Lazaridou et al., 2017; Kottur et al.,
2017). Im Gegensatz, we propose to study the pred-
ication function of language, das ist, the expres-
sion of relations between entities (Veranstaltungen). Wie
do artificial agents express events such as ‘‘the
blue triangle is above the red circle’’?

We introduce an experimental setup for study-
ing predication. The speaker communicates about
an event involving a variable number of entities
that are in a certain relation. Dann, the listener
tries to reconstruct this event. Vereinfachen, Die
relation is observed by both agents.

Crucially, the listener is given a partial observa-
tion of the event, ranging from nothing to all but
one entity. In the presence of shared context, es ist
unnecessary for the speaker to communicate the
whole event, and a least-effort penalty encourages
parsimony. Thus we obtain utterances that refer
to single entities in isolation, like phrases, Und
utterances about several entities, like sentences.

Using these artificial phrases and sentences, Wir
can compute various metrics to quantify compo-
sitionality (Szab´o, 2020) at the sentence level. A
simple sentence typically contains a few phrases
that refer to entities. These phrases can generally
be understood in isolation, a property sometimes
called context-independence (Bogin et al., 2018).

1375

Transactions of the Association for Computational Linguistics, Bd. 10, S. 1375–1391, 2022. https://doi.org/10.1162/tacl a 00524
Action Editor: Daniel Gildea. Submission batch: 3/2022; Revision batch: 6/2022; Published 12/2022.
C(cid:2) 2022 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1: Overview of experimental setup. (From left to right) Proto-role dataset contains annotations (18 fea-
tures and a role) for each argument and a relation (SELL.01 and WAKE.02, jeweils, observed by both speaker
S and listener L). Vorverarbeitung: From the 1st annotation, 3 datapoints are created, where the number of enti-
ties observed by L varies (see L’s mask and Partial F.&R. columns). The 2nd annotation contains a single ob-
ject so a single datapoint is created. Training: S produces a message. L reads it, and the pair of agents S, L is
jointly trained to minimize the reconstruction error and the length of the message. As a result of the objective,
S only talks about the entities not observed by L. Analyse: Informally, concatenability measures how con-
catenation of messages m12 = m1 ⊕ m2 and/or m21 = m2 ⊕ m1 are interchangeable with the actually sent
message m∗; transitivity measures how much one order is preferred compared to the other across the dataset
(vgl. Abschnitte 5, 6).

Darüber hinaus, the sentence is the concatenation of
these phrases along with the verb. Correspond-
ingly, we introduce concatenability metrics that
should be large for natural languages. Weiter-
mehr, we propose transitivity metrics to quantify
the importance of word order. A high-level over-
view of our setup is shown in Figure 1.

This setup enables us to analyze artificial lan-
guages without segmenting messages into constit-
uents. Segmentation introduces another layer of
complexity, to the extent that in practice, es ist
not done at all: It is implicitly assumed that each
symbol is independently meaningful. Jedoch,
this assumption is flawed, because if letters or
phonemes are assumed to bear meaning, no natural
language is compositional.

Previous work has highlighted the influence of
input representations and architectures for lan-
guage emergence. Inappropriate representations
completely hinder evolution of a non-trivial lan-
guage with more than 2 Wörter (Lazaridou et al.,
2017) or prevents agents from solving the task
altogether (Guo et al., 2019). Das deutet darauf hin
specific inductive biases are still lacking for arti-
ficial agents to develop languages like ours.

We posit that the perception of objects as wholes
with properties is an important inductive bias. To
be able to produce sentences containing referential
phrases, it seems that agents need to be able to
attend to the referents of these phrases reliably, Zu
conceive of them as bounded objects with intrinsic
properties in the first place.

We demonstrate the usefulness of our setup
and our metrics for testing this hypothesis. Wir
implement an object-centric inductive bias using
attention (Bahdanau et al., 2014) over representa-
tions of objects. We compare it to an architecture
which disregards the structure of the input, con-
sidering it merely a large unstructured feature
vector. The object-centric architecture yields more
natural languages—they are more concatenable.
Außerdem, word order matters more with this
architecture than for the baseline. These results
are corroborated by our quantitative analysis and
measures of generalization outside of the train-
ing data.

Our contributions are two-fold. zuerst, on the
methodological front, we propose and motivate
a novel task and two new metrics. This task not
only explains the emergence of compositionality

1376

from a functional perspective, but also enables
us to easily analyze the learned language, avoid-
ing the problem of segmentation.

Zweitens, we provide evidence that when rep-
resentations reflect the perception of objects as
wholes with properties, emergent languages are
more natural than when they do not. With this
finding we hope to foster the use of more cogni-
tively plausible input representations for explain-
ing language emergence.

1 Task

We design a task for studying how artificial
agents predicate. It is an instance of a recon-
struction task (Lazaridou et al., 2017), where one
agent, the speaker, observes an input and pro-
duces a message—a sequence of symbols. Der
message is then read by another agent, the lis-
tener, who tries to reconstruct the input observed
by the speaker.

We train several pairs of agents and study the
messages produced by the speakers. This training
procedure models language evolution and lan-
guage acquisition at once, unlike frameworks like
iterated learning (Kirby and Hurford, 2002).

The main novelty of our task is that agents are
trained to communicate about a variable number
of entities. In diesem Abschnitt, we explain how the
inputs of the agents are obtained by preprocessing
the proto-role dataset (Reisinger et al., 2015).
Dann, we argue that our task is realistic, yet simple
enough to permit an easy analysis of the messages.

1.1 The Proto-role Dataset

The data that are fed to agents are based on the
proto-role dataset built by Reisinger et al. (2015).
This dataset was created to evaluate Dowty’s
(1991) linking theory, a theory that predicts how
verb-specific roles are mapped to grammatical
relations in English.

To illustrate the annotation scheme, we use the
example from Figure 1, ‘‘the company sold a
portion of secured notes’’.

zuerst, a relation is extracted. Hier, das Verb
‘‘sold’’ corresponds to the PropBank (Kingsbury
and Palmer, 2002) label SELL.01, which identi-
fies the verb and its particular sense.

There are nobj = 2 arguments of the verb,
‘‘the company’’ and ‘‘a portion of secured
notes’’. Each of these arguments is annotated

with nf eat = 18 features indicating various prop-
erties of the referred entity. Zum Beispiel, the first
feature indicates whether the entity caused the
event to happen, the second feature whether the
entity chose to be involved in the event, and so
forth (Reisinger et al., 2015). In this work, Die
meaning of these features is irrelevant. These fea-
tures are encoded on a Likert scale from 1 Und
5 or take a non-applicable (NA) value. Since the
description of each entity is a small feature vec-
tor, many different noun phrases correspond to
the same feature vector. Thus ‘‘Technology
stocks’’ and ‘‘a portion of secured notes’’ in
Figur 1 denote the same entity.

Darüber hinaus, each argument is also assigned one
of six mutually exclusive classical θ-roles. In
the arguments respectively have
the example,
the θ-roles AGENT and PATIENT.

We define an event as (ich) a relation and (ii) A

set of pairs of feature vectors and θ-roles.

1.2 Task Description

For each event in the proto-role dataset, we gather
the relation, and for each entity, their 18 fea-
tures and their role. The features are rescaled
aus {1, 2, 3, 4, 5} Zu {1, 2, 3}, and we only re-
tain the arguments in the 3 most frequent θ-roles
(AGENT, PATIENT, and a MISC category containing
Instrumente, benefactives, attributes).

The speaker observes the following quantities:

• the PropBank relation β,
• entity features I S ∈ {NA, 1, 2, 3}nobj ×nf eat,
• entity

Rnobj

rS

roles
{AGENT, PATIENT, MISC}nobj ,

=

• the listener’s mask: α ∈ {0, 1}nobj .

The tensors I S, rS, and α are indexed by an
integer between 1 and nobj, so they represent a set
ES of nobj triplets where each triplet (I S
ich , αi)
characterizes the i-th entity.

ich , rS

The i-th entity is said to be hidden iff αi = 1.
Hidden entities are not observed by the listener,
and the mask α indicates this to the speaker. Seit
the listener tries to reconstruct the inputs of the
speaker, the mask essentially flags the entities that
the speaker should communicate about. Daher, Die
listener observes:

• the PropBank relation β,
• partial entity features I L[1 − α],

1377

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

• partial entity roles rL[1 − α],

• the speaker’s message: m ∈ M.

Hier, u[v] denotes the tensor obtained by

restricting u to the rows i such that vi = 1.

(cid:2)nobj

The message space M is defined as follows.
Let V = {1, . . . , nV , eos} be the vocabulary con-
taining nV symbols, plus an end-of-sentence
(eos) token. Let nL be the maximum message
Länge (Hier, set to nL = 8). M contains all the
sequences of elements of V with at most length
nL and ending with eos.
A datapoint is valid if

i=1 αi ≥ 1, das ist, bei
least one object is hidden and some information
needs to be conveyed. From each event, we add
as many valid datapoints as possible to our data-
set. In our example, as there are 2 entities, entweder
one or both can be hidden, yielding 3 datapoints.
Given its inputs and the sender’s message, Die
listener tries to reconstruct the sender’s inputs.
The agents are jointly trained to minimize a re-
construction loss while minimizing the number of
symbols exchanged, as formalized in Section 2.2.

1.3 Motivations

All the aspects of the task can have a major
influence on the learned languages. In diesem Abschnitt,
we argue that our task is realistic in important
Aspekte.

Our task is to convey semantic annotations of
Sätze, not words or sentences directly, Weil
using linguistic data as input could be a meth-
odological mistake. In der Tat, language-specific ty-
pological properties might leak into the artificial
languages.1 We follow this principle, except for
our use of θ-roles. They are linguistic abstrac-
tions over relation-specific (participant) roles.
This limitation is discussed in Section 8.2.

In our task, agents have to communicate about
a realistic, variable number of entities. We posit
that this is a crucial characteristic for argument
In der Tat, if humans only
structure to be natural.
ever talked about two entities at once, grammar
would be simpler since a transitive construction
could be used everywhere. In our dataset, Die

1Zum Beispiel, if the task was to transmit basic color terms
instead of, sagen, color represented as RGB triplets, the choice
of a language with only 3 basic color terms vs 11 color terms
(as in English) would yield different artificial languages. Für
one thing, transmitting English color terms would require
agents to use more symbols.

distribution of the number of entities talked about
is directly derived from an English corpus, Und,
to our knowledge, the distribution of the num-
ber of arguments does not vary much across
languages. Thus we do not expect a bias
In Mordatch and
towards English typology.
Abbeel’s (2018) and Bogin et al.’s (2018) funktioniert,
agents also need to predicate. Jedoch, das Ereignis
structure is unrealistic as it is identical across
datapoints: The number of arguments is con-
stant and each argument has the same ‘‘type’’
(a landmark, an agent, a color, usw.).

The relation β is observed by both agents. Als
a consequence, we do not expect artificial sen-
tences to contain the equivalent of a verb. Der
main reason is that it greatly simplifies the anal-
ysis of the artificial languages.

We define the context as everything that is
observed by both agents: the relation and the
non-hidden entities. We now justify why agents
share context, and why the loss function includes
a penalty to minimize the number of symbols
sent (vgl. Abschnitt 2.2).

Erste, let us argue that this is realistic. Der
context is a coarse approximation of the notion
of common ground. According to Clark (1996),
common ground encompasses the cultural back-
ground of the interlocutors, their sensory percep-
tionen, and their conversational history. In theory,
the speaker only needs to communicate the infor-
mation that is not part of the common ground,
but transferring more information than needed
is not ruled out. Jedoch, in practice, humans
try to be concise (vgl. Grice’s [1975] maxim of
quantity). The penalty that we use encourages
parsimony. It could be seen as the result from
a more general principle governing cooperative
social activities (Grice, 1975) or even the whole
of human behavior (Zipf, 1949).

To illustrate, consider the following situation.
Upon seeing a broken window, one would ask
‘‘who/what broke the window?’’. A knowledge-
able interlocutor would answer ‘‘John’’ or ‘‘John
did’’. In our setup, the speaker is this knowl-
edgeable person, answering such questions about
unobserved entities. The context contains the bro-
ken window, and the speaker does not need to
refer to it since (ich) the listener observes it, Und
seit (ii) the speaker knows that the listener ob-
serves it (via the mask α). While the speaker
could still refer to the window, the least-effort
penalty makes it costly to do so, so the speaker

1378

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

avoids it. Even if the agents do not engage
Die
in dialogues but
mask α can be interpreted as simulating an in-
ference made by the speaker about the listener’s
Wissen.

in one-time interactions,

This setup is not only realistic, it is also es-
pecially useful for the purpose of analysing the
emergent languages. By masking all but one en-
tity, we obtain an artificial phrase that denotes
a single entity. By masking all but two entities,
we obtain an artificial sentence relating two en-
tities. The metrics that we introduce rely on our
abilities to obtain such phrases and sentences.
The concatenability metrics can be seen as mea-
sures of systematicity, nämlich, how the meaning
of phrases is related to meaning of sentences
(Szab´o, 2020).

Without this setup, one would need to some-
how segment sentences into phrases. To our
Wissen, the problem has not been addressed
in the language emergence literature, but is iden-
tified by Baroni (2020). Zum Beispiel, applied to
English corpora, metrics for quantifying compo-
sitionality like Chaabouni et al.’s (2020) disen-
tanglement metrics would tell us that English is
not compositional, since single letters are not
meaningful.

2 Model and Objective

We present
two Transformer-based (Vaswani
et al., 2017) variants of the model of the agents:
One that encodes an object-centric bias and one
that does not. Before delving into their differ-
zen, let us describe their common features.

2.1 General Architecture

Both the speaker S and the listener L are imple-
mented as Transformers, each of them built out of
an encoder Tfme and a decoder Tfmd.

The inputs of the speaker are encoded into a
real-valued matrix V S, which differs in the two
variants of the model. For now, assume that V S
encodes the speaker’s inputs and similarly, Das
V L encodes the listener’s inputs.

The speaker produces a message m by first

encoding its input into

H = TfmS

e (V S),

(1)

then auto-regressively decodes the message

mt+1 ∼ q(mt+1|m1:T, I S, α, β) = TfmS

D (M1:T, H)T

with M1:T
embeddings of
bols m1:T.

the sum of positional and value
the previously decoded sym-

At train time, the symbol is randomly sampled
according to q, whereas at test time, the most
likely symbol is picked greedily. If the maximum
length nL is reached, eos is appended to the
message and generation stops. Else, the genera-
tion process stops when eos is produced. In order
to backpropagate through the discrete sampling,
we use the Straight-Through Gumbel estimator
(Jang et al., 2016; Maddison et al., 2016).

L also embeds the message m into a matrix

M (cid:8), and its decoder produces a matrix OL:

H (cid:8) = TfmL
OL = TfmL

e (M (cid:8)),
D (V L, H (cid:8)).

(2)

Note that TfmS

OL is then used to predict the presence of the
objects as well as all the features of the objects.
This computation is slightly different depending
on the variant of the models and is detailed below.
d is invariant with respect to the
order of the objects in V S, since we do not use
positional embeddings to create V S, but rather
use the role information directly, as will be ex-
plained for each model separately.2 On the other
Hand, the message m is embedded using both
token and positional embeddings in M and M (cid:8),
so TfmS

e are sensitive to word order.

d and TfmL

2.2 Loss

The loss function is a sum of two terms, A
reconstruction loss and a length penalty.

Reconstruction Loss: The input to reconstruct
is a set, the set of pairs of 18 features and a θ-roles.
For each θ-role, we predict the corresponding
features as well as whether an object i in this role
is present or not, denoted by γi.

For a given data point indexed by j, the re-

construction loss is the sum over all objects i

lj =

(cid:3)

ich

−[log p(I S
ich

|I L, M, β) + log p(γi|I L, M, β)].

2When used without positional embeddings, the encoder
of the Transformer is permutation-equivariant, d.h., for any
permutation matrix P , Tfme(P X) = P Tfme(X); similarly,
the decoder is permutation-invariant in its second argument
(the encoded matrix H), d.h., Tfmd(P X) = Tfmd(X). Per-
mutations are applied to the input matrices, the masks, Und
the role vectors.

1379

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Length Penalty: As done by Chaabouni et al.
(2019), we penalize long messages. This can be
seen as an implementation of Zipf’s (1949) least-
effort principle. In its simplest form, the penalty
is a term pj = λ|mj| where λ is a hyperpa-
rameter, Und |mj| is the number of symbols in
the message.

Jedoch, we noticed that messages collapse
to empty messages early on during training. Das
is similar to the well-known posterior collapse,
where the approximate posteriors of latents of
sequence-to-sequence VAEs collapse to their
priors (Bowman et al., 2016). We fix the issue
by adapting two well-known tricks: Pelsmaeker
and Aziz’s (2019) minimum desired rate and
Kingma et al.’s (2016) free bits. The penalty term
becomes

pj = 1lj <τ 1|mj |>nmin(λ|mj|),

Wo 1 is the indicator function.
For this term to be non-zero, two conditions
need to be fulfilled. zuerst, the reconstruction
error must be below τ , which is analogous to
a minimum desired rate. This threshold can be
set without difficulty to a fraction of the re-
construction error incurred by the listener seeing
empty messages. In unserem Fall, this average error
Ist 18.6. We randomly choose the threshold in
{5, +} across runs, where +∞ essentially dis-
ables the trick.

Zweitens, the penalty is above 0 only if the
message contains more than nmin symbols. Das
gives models nmin ‘‘free’’ symbols for each
this factor, we found that
datapoint. Without
speakers often utter empty messages (in partic-
ular, when a single entity is hidden).

For a given data point indexed by j, the to-
tal loss to minimize is the sum lj + pj. Während
Ausbildung, the average is taken over a mini-batch
(n = 128), while during evaluation, it is taken
over the entire test split.

2.3 On the Perception of Objects

We demonstrate our setup and metrics by com-
paring a model which is object-centric (OC), Das
Ist, aware of objects as wholes with properties, Zu
a baseline model (flat attention, or FA), welche
ignores the structure of the inputs.

We follow Gentner (1982), who argued that
perception of objects must be a strong, prelin-

guistic cognitive bias. She gives the example of
a bottle floating into a cave. She imagines an
imaginary language in which the bottle and the
mouth of the cave are construed as a single en-
tity, and argues that this language would be very
implausible. Across languages, the two entities
seem to always be referred to by separate phrases,
hinting at universals in the perception of objects.
More evidence is provided by Xu and Carey
infants use spatio-
(1996). They showed that
temporal cues to individuate objects,
Ist,
to ‘‘establish the boundaries of objects’’. Nur
around the start of language acquisition do chil-
dren start to rely on the properties or kinds of
objects to individuate. But could it be exposure
to language that drives infant to perceive the
properties and kinds of objects? Mendes et al.’s
(2008) experiments on apes suggest it is the other
way around, nämlich, that linguistic input is not
necessary to learn to individuate based on prop-
erty differences. Thus our hypothesis is that the
perception of objects as wholes is a prerequisite
for natural language to develop.

Das

To implement the OC bias and the FA base-
Linie, we process the inputs in two ways and
obtain different V L and V S to plug in Equa-
tionen 1 Und 2. Embedding the matrices I L and
I S gives us real-valued 3-dimensional tensors.
But since Transformers consume matrices, Wir
need to reduce the dimensionality of I S and
I L by one dimension. It is this dimensionality-
reduction step that encodes the inductive biases
of OC and FA. We tried to minimize the differ-
ences between the two models. Figur 2 zeigt an
an overview of their differences.

2.3.1 Object-centric Variant

Let I be either I S or I L, where each row Ii
represents an object. Each Ii is embedded using
a learned embedding matrix Valj for each feature
J, and the result is concatenated, yielding

Ei = [Val1(Ii,1)T ; . . . ; Valnf eat(Ii,nf eat)T ].

Dann, this vector is transformed using a linear
Funktion, followed by a ReLU (Nair and Hinton,
2010) and layer normalization (Ba et al., 2016).
We obtain V (0), a real-valued nobj × d matrix with

V (0)
i = LN(max(W Ei + B, 0)).

(3)

1380

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: Comparison of flat-attention (FA) and object-centric (OC) variants. The discrete-valued matrices I L
and I S (upper-left) encode the features of entities. FA turns each datapoint into (nobj · nf eat) × d continuous-
valued matrix (with nobj · nf eat attention weights), while OC produces a nobj × d continuous-valued matrix
(with nobj attention weights). Numbers index embedding matrices and show weight-sharing. The role informa-
tion is encoded afterwards and similarly for masking (not shown here).

As for hidden objects and padding objects,
they are represented using a single embedding
V (0)
i = vh directly. A role embedding is added to
this representation to obtain

i = V (0)
V (1)

ich + Role(rS

ich ).

Endlich, V is a (nobj + 1) × d matrix, Wo
d is the size of embedding vectors. V is V (1)
the β relation
with an additional row vector,
embedding.

The listener cannot distinguish between hid-
den and padding objects, so the message should
encode the roles along with the entities’ features.
In order to reconstruct the speaker’s inputs,
the listener linearly transforms each row vector
OL
(except the one corresponding to the rela-
ich
|I L, M, β), the joint pmf
tion) to produce p(I S
ich
over the discrete features of object i as well as
P(γi|I L, M, β).

2.3.2 Flat Attention Variant

In FA, the structure of the input—composed of
different objects with aligned features—is disre-
garded. zuerst, the input matrices I S and I L,
where each row corresponds to a single object,
are ‘‘flattened’’. Zweitens, there is one attention
weight per feature and object pair, instead of a
single weight per object as in the OC variant.
Endlich, each embedding matrix is specific to a
role and feature pair, instead of being specific to
a feature.

Formally, let k be the index of a pair of object
indexed by i and feature indexed by j. Using a
k-specific embedding matrix, we obtain

V (0)
k = Valk(Ii,J),

with V (0) a real-valued (nobj · nf eat) × d matrix.
Wieder, hidden and padding objects are represented
by a special vector V (0)
k = vh. An index embedding
is added, similar to the role embedding:

k = V (0)
V (1)

k + Idx(k).

As in the OC variant, we obtain V by adding an

embedding of the relation β as a row to V (1).

To reconstruct

the speaker’s inputs, OL is
linearly transformed and to each output vector
corresponds a specific feature of a specific ob-
ject. To predict γi, alle
the output vectors in
OL corresponding to the i-th object are mean-
and average-pooled, concatenated and linearly
transformed.

3 General Experimental Setup

In the next sections, we review various properties
of natural languages, and introduce metrics to
quantify these in artificial languages and compare
the effect of using OC versus FA on these metrics.
The training set contains 60% of the data, Die
validation set 10%, and the test set the rest. Wir
denote the entire data set by D and denote by Dk
the subsets of D composed of examples for which

1381

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
2
4
2
0
6
5
9
4
8

/

/
T

l

A
C
_
A
_
0
0
5
2
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

(cid:2)

i αi = k, das ist, the examples where k objects

are hidden.

All the experiments use the EGG framework
(Kharitonov et al., 2021) based on the PyTorch
library (Paszke et al., 2019).3 The neural agents
are trained using Adam (Kingma and Ba, 2014).

There is a large number of hyperparameters so
we resort to random search (Bergstra and Bengio,
2012).4 Our hyperparameter search is deliberately
broad since we do not know a priori which hy-
perparameter choices are realistic. We expect to
obtain results with high-variance, but a major ad-
vantage is that we get more robust conclusions by
averaging over unknowns.

We perform linear regressions to predict the
value of each metric given a binary variable indi-
cating whether OC is used. When the coefficient
for this variable is significantly different from 0
according to a t-test, then OC has a significant
effect.5 Additionally, we consider that the entropy
of the messages is a mediator that we control for.
Zum Beispiel, the reconstruction error is indirectly
influenced by the vocabulary size and the sam-
pling temperature via the entropy. Jedoch, Wenn
we observe that OC improves the generalization
Fehler, we want to exclude the possibility that this
is because OC agents send messages with higher
entropy than FA agents, since it should be trivial
to also increase the entropy of the FA models by
modifying hyperparameters.

We discard models with messages of average
length below 1 und darüber 6. In der Tat, when the
average length is too small, many messages are
leer, and when it is too long, artificial sentences
are barely or not longer than artificial phrases.
These cases are considered a priori unnatural.
This leaves us with 100 out of 136 runs.

3The proto-role dataset

is available here: http://
decomp.io/projects/semantic-proto-roles/.
The code (including on-the-fly preprocessing of the data-
set) is available at https://github.com/tombosc
/EGG_f/tree/r1/egg/zoo/vd_reco.

4 Hyperparameters

(uniformly sampled): # Trans-
former layers ∈ {1, 2, 3}, and dimensions ∈ {200, 400},
dropout ∈ {0.1, 0.2, 0.3}, Gumbel-Softmax temperature
∈ {0.9, 1.0, 1.5}, λ ∈ {0.1, 0.3, 1, 3, 10}, nmin ∈ {1, 2},
τ ∈ {5, +}. Adam’s parameters: β1 = 0.9, β2 ∈
{0.9, 0.99, 0.999}.

5We manipulate data using the pandas package (Der
Pandas Development Team 2021; McKinney, 2010), Und
perform linear
regression with the statsmodel package
(Seabold and Perktold, 2010). We use HC3 covariance es-
timation to deal with heteroskedasticity (MacKinnon and
White, 1985; Long and Ervin, 2000).

Arch.

FA
OC
FA
OC

1 versteckt
6.5 ± 1.6
6.2 ± 1.9
8.9 ± 2.1
8.3 ± 2.4

2 versteckt
16 ± 3.6
14 ± 3.7∗∗∗
24 ± 3.9
21 ± 4.6∗∗

3 versteckt
28 ± 5.4
25 ± 5.6∗∗
41 ± 5.5
39 ± 5.9

iD

OoD

Tisch 1: Mean and stdev of test reconstruction
loss, in distribution and out of distribution. rows:
Modelle; columns: # of hidden entities. OC agents
generalize better than FA agents. (*: p-value < 0.05, **: p-value < 0.01). Note that the length penalty works as expected. Without the penalty, the messages all contain the maximum number of symbols. With the penalty, the average message length grows as the speaker needs to send more and more information (on D1: 4.19, D2: 5.24, D3: 5.89). 4 Generalization Performance Natural languages are often said to be productive and systematic: There is an infinity of utter- ances which we can understand without having encountered them before (productivity), in par- ticular when we understand constituents of the novel sentence in other contexts (systematicity) (Szab´o, 2020). Do emergent languages exhibit these characteristics? In this section, we study such generalization abilities. We measure how well the listener can reconstruct the inputs when the sender communicates about datapoints unseen at train time. Firstly, we test our models in distribution. Secondly, we test our models out of distribution (OoD), following Lazaridou et al. (2018). We compute the empirical marginal distributions over the number of hidden entities, the entities, the roles, and the relations. Then, the OoD test set is sampled from these marginals independently. We measure the reconstruction losses on sub- sets where 1, 2, and 3 entities are hidden for a finer-grained analysis. Results: Table 1 contains the results. As ex- pected, performance degrades when we evaluate out of distribution. More interestingly, OC mod- els perform better than FA models both in distri- bution and out of distribution. However, the performance difference between OC and FA does not tell us much: Both OC and FA agents could exchange messages that are structured in very unnatural manners. In the next 1382 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 two sections, we introduce metrics to shed light on how the information about different entities is packaged into a single message. 5 Concatenability In natural languages, the verb encodes the relation while arguments refer to entities, but roles do not have direct equivalents in all languages. They are encoded using three strategies, typically using a mix of strategies within a single language. In analytic languages like English or Chinese, roles are encoded in word order and possibly using adpositions or clitics, but the role does not change the form of the arguments. For example, in sentences (1a) and (1b), the arguments are identical but are placed in a reverse order, so that their roles are inverted, too: (1) a. The old lady walks the dog. b. The dog walks the old lady. In more synthetic languages like Russian or Turk- ish, case markings code for the roles. In Russian, these markings are suffixes on nouns, adjectives, and so forth, as can be seen in (2a) and (2b): (2) Finally, in polysynthetic languages (Caucasian languages, Samoan languages, etc.), arguments typically look like those in analytic languages, but the roles are encoded using markers on the verb.6 Since, in this work, relations are not com- municated by agents, there is no artificial equiv- alent of the verb. Therefore, this strategy cannot emerge and we consider it no further. Crucially, simple sentences are obtained by concatenating a verb and one or several noun phrases that refer to entities, whether word order matters or word order does not matter and cases are marked. For a single event, by varying what informa- tion is available to the listener through the mask α, we get messages describing two entities in 6This presentation is extremely simplified, for example, Bakker and Siewierska (2009)’s paper for why and how these three strategies generally coexist within a single language. isolation (phrases) as well as messages describ- ing two entities at once (sentences). For exam- ple, consider (I S, (1, 1, 0), rS, β) drawn from D2, the subset of the data with two hidden ob- jects. Let g be the function that transforms this speaker’s inputs into a message via greedily de- coding, and define m∗ = g(I S, (1, 1, 0), rS, β). We obtain the messages sent when L observes the first or the second object in isolation as m1 = g(I S, (1, 0, 0), rS, β), m2 = g(I S, (0, 1, 0), rS, β). We define concatenated messages to be m12 = m1 ⊕ m2 and m21 = m2 ⊕ m1, where ⊕ is the concatenation operator. This is shown in Figure 1. We define P2 as the empirical dis- tribution on the subset of D2 such that neither m1 or m2 are empty messages, implying that m12 (cid:10)= m21. As argued above, in natural languages, m12 or m21 (or both, if word order is irrelevant) should convey information at least as well as m∗. Denote by l(m) the reconstruction loss incurred by L if L had received the message m, that is, l(m) = − log p(I S|I L, m, β). Then, concaten- ability from the listener’s point of view is de- fined as CL = EP2[l(m∗) − min(l(m12), l(m21))]. When close to 0, on average, one of the two concatenated messages (or both) is as informative as the message actually uttered by the speaker for reconstructing the inputs. L can correctly reconstruct S’s inputs from a concatenated message that S is unlikely to utter. Inversely, a concatenated utterance can be highly likely for S even if L might fail to reconstruct S’s input from it. Therefore, there are actually two symmetrical measures of concatenability, one from the point of view of S and the other from the point of view of L. A similar proposition was made by Lowe et al. (2019) in the context of in- teractive games. They have shown the usefulness of distinguishing these two points of view. The metric is defined similarly on the speak- er’s side with a slight subtlety. Since sampled 1383 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 messages have a maximum message length of nL, the probability of a sequence longer than nL is 0. However, concatenated messages are sometimes longer than nL. We define q∞ as the distribution generated by S without the con- straint that probable sequences have length below nL. We denote the conditional log-probability of a message given a certain input by u(m) = log q∞(m|I S, α, β). Then, concatenability from the speaker’s point of view is defined as CS = EP2[max(u(m12), u(m21)) − u(m∗)]. It is close to 0 when, on average, one concate- nation of the two messages (or both) has roughly the same probability as the actual message. To give an intuition, let us go back to our examples. Take the speaker of an hypothetical language, English without verbs. Suppose that this speaker, when exposed to a given input xS = (I S, (1, 1, 0), rS, β), produces a sentence m∗ corresponding to (1a), ‘‘the old lady the dog’’. By exposing the speaker to the same input, but by changing the mask to (1, 0, 0), they produce m1 = ‘‘the lady’’, while using the mask (0, 1, 0), they produce m2 = ‘‘a golden retriever’’. CS compares the log probability of m∗ with that of m12 = ‘‘the lady a golden retriever’’ and m21 = ‘‘a golden retriever the lady’’, whichever is more probable. Since English without verbs is rather concatenable, the speaker judges that m12 is roughly as likely as m∗ given the inputs. Thus, the value inside the expectation of CS will be high, close to 0. Now, take an identical speaker, except that they assign a very high probability to m(cid:8)1 = ‘‘a shoebox’’, while the new m(cid:8)12 and m(cid:8)21 are unlikely conditioned on xS. Then CS will be low and negative. Perhaps (i) ‘‘a shoebox’’ has different semantics when it is used alone in a sentence, as compared to when it is used with a second referent; or perhaps (ii) ‘‘a shoebox’’ is never used with another referent in a sentence, and the speaker would use ‘‘a lady’’ instead. In any case, concatenability for this speaker would be low, which corresponds to the intuition that their language is unnatural and unsystematic.7 7This example only illustrates the intuition. In reality, it is not straightforward to apply these metrics on natural language, because they require probability distributions for the agents. We could learn models that map back and forth between the semantics and the ground-truth utterances, but CL ↑ −6.1 ± 3.8 −3.2 ± 2.4∗∗∗ CS ↑ −29 ± 13 −26 ± 15 FA OC Table 2: Mean and stdev of concatenability met- rics on OC and FA runs. (i) OC improves con- catenability. Arrows indicate optimal direction. (p-values: *: < 0.05, **: < 0.01, ***: < 0.001). The same illustration holds for CL, and it can be adapted to show why CS and CL should also be high for more synthetic languages. Results: We measure these metrics on the test set. In our experiments, they always take negative values: the concatenated messages are on average worse than the actual messages. Some models yield values close to 0, but this depends on the choice of hyperparameters. Table 2 shows that OC largely improves over FA in terms of both CL and CS. For instance, the reconstruction losses of OC models go up by 3.1 nats on average when the best concatenated mes- sages are used instead of the actually sent mes- sages. In contrast, FA models incur a loss that is higher by 6.1 nats. Thus, languages obtained using the OC architecture are more natural than those emerging from FA in the sense of concatenability. 6 Word Order 6.1 Importance of Word Order Concatenability metrics do not distinguish be- tween the use of word order or some sort of case marking strategy. Since both strategies are found in natural languages, we claim that for all natural languages, this metric should be high. But we also want to know what particularly strategy is used, in particular when concatenability is high. First, note that it is difficult to detect the pres- ence of case markings directly. Even for the simplest forms of morphology, we are hindered by the segmentation problem of identifying the root and the affix, as mentioned in Section 1.3.8 the models would add some bias. Moreover, we only have ground-truth utterances for English and any attempts to use machine translation would add some more bias. 8It is generally even more complicated for several rea- sons: a lexeme can have several roots, each morpheme can simultaneously encode several semantic properties, and the modification of the root can be non-concatenative (Stump, 2017). 1384 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T L ↑ 3.4 ± 3.2 11 ± 10∗ T S ↑ 10 ± 9 15 ± 12 RP E ↓ 0.48 ± 0.18 0.52 ± 0.12∗∗∗ FA OC Table 3: Mean and stdev of transitivity metrics and RPE for OC and FA. T L (T S) statistics and significance computed on runs scoring CL (CS) above median. Arrows indicate optimal direc- tion. OC uses word order more than FA. Con- trols are discussed in the main text. (p-values: *: < 0.05, **: < 0.01, ***: < 0.001). word order is more important for OC runs than for FA runs. This is also confirmed by Table 3. Table 3 also shows that OC and FA agents have very similar RPE. This means that both encode roles in referential phrases quantitatively similarly. More work is needed to determine how roles are encoded (when they are), that is, if there are traces of morphology or if messages denot- ing a single entity in different roles are unrelated. 6.2 Consistency of Word Order To go further, we can study which word orders are favored across different contexts. For every pair of roles such as AGENT and PATIENT, is it the message with the AGENT uttered first that is more likely, or the opposite? To answer the question, instead of looking at the magnitude of the gap as does T S, we can count which word orders maximize the gap. By finding the most frequent order, we find for each model the preference of the speaker P S, a binary relation on R2. For example, {(AGENT, PATIENT),(PATIENT, MISC), (MISC, AGENT)} (4) is such a relation. This is very crude, as it does not distinguish the case where AGENT always pre- cedes PATIENT from the case where AGENT pre- cedes PATIENT 51% of the time, but we leave more involved analyses for future work. We define analogously P L using the reconstruction loss l instead of message probability u. Results: We compute preferences P S and P L for each run. Out of 100 runs, 29 runs have both CS and CL higher than their median values, and 23 of these have equal P S and P L. Figure 3: Role prediction error (RP E) as a function of transitivity T L. Color indicates reconstruction loss. (i) (upper-left quadrant) Low T L and high RP E implies a high reconstruction error, since roles are not encoded properly. (ii) OC has higher average transitivity than FA, but similar RP E. Yet we can quantify on average how much ref- erential phrases (messages about a single hidden object) encode roles. We train a bigram classi- fier on the training set and measure its test error, the Role Prediction Error (RPE). If there are case markings, this error will be low (but the opposite is not true). Moreover, we introduce two transitivity met- rics, to directly measure the importance of word order. T S is defined as: T S = EP2 |u(m12) − u(m21)|. This metric is 0 if the two concatenated mes- sages are equally probable for S; and it is large if one word order is much more likely than the other for S. Similarly, T L is defined as T L = EP2 |l(m12) − l(m21)| and has similar interpretations. These metrics are only interpretable when concatenability metrics are high enough, so we measured T S only for runs where CS is above the median and similarly for T L. Results: As can be seen on Figure 3, when transitivity is low and RP E is high, the recon- struction loss is poor (top-left corner), because there is no efficient strategy to encode roles. There is a lot of variance both for OC and FA, but OC models tend to have higher transitivity, both on average and in terms of maximal values. Thus 1385 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Entities −, 8, 1 −, 8, 5 −, 8, 190 −, 8, 39 α 0, 1, 0 0, 0, 1 0, 1, 1 0, 1, 0 0, 0, 1 0, 1, 1 0, 1, 0 0, 0, 1 0, 1, 1 0, 1, 0 0, 0, 1 0, 1, 1 A B Entities 24, 79, 25 105, 16, 105 105 , 79, 24 24, 79, 25 19, 24 47, 79, 24, 25 24, 79, 25 16, 19 105, 47 34, 34 34 , 47 105, 47 18, 18 18 , 24 105, 47 19 16 , 79 , 39, 79 105 , 24 24, 79, 25 16, 44, 16, 72, 2 105, 47 16, 19 44, 16 , 59, 72 105 , 16 8, 4, − 8, 61, − −, 132, 8 −, 287, 8 α 1,0,0 0,1,0 1,1,0 1,0,0 0,1,0 1,1,0 0,1,0 0,0,1 0,1,1 0,1,0 0,0,1 0,1,1 A 79, 24, 24, 79, 24 34, 34, 15 B 18, 1, 18 15, 34, 15 34 , 24, 79, 24, 79, 24 34, 34, 34 , 1, 18 79, 24, 79, 24, 24 94, 54, 25, 94, 72 18, 18, 19 16, 16, 25 94 , 121, 25 , 79, 24, 79, 24 16 , 19 , 24, 19, 18 19, 24, 19 79, 24, 72 24, 19 , 123, 19 35, 19 79, 24, 72 19, 59 47, 71, 105 18, 24, 59 19, 59, 16 47, 71, 105 16, 79 , 19, 35 24, 19, 59, 16 Table 4: A sample of messages exchanged about the same entity u8. Entities: list of entities (‘‘−’’: no entity; number indicate rank of entity in the dataset; position in the list indicate role: AGENT, PATIENT, MISC). α: mask. A, B: Messages produced by speakers of models A and B. Symbols are manually colored to identify phrases (first 2 rows in every block of 3 rows) in artificial sentences (third row in every block). Relations are omitted but are different for each block. Among all possible relations, some are not transitive, such as (4). However, all the prefer- ences we found are transitive, which is extremely unlikely due to chance. A simple explanation is that transitive relations allows agents to discuss three entities with word order only. However, it does not seem to be universally required by nat- ural languages to have well-defined orders in the presence of many roles. For instance, in English, the use of different prepositions allow for dif- ferent word order, such as the dative alternation which offers two different orders to talk about three entities. 7 Qualitative Analysis One can gain intuition about the metrics by look- ing at messages exchanged by agents. In particu- lar, we compare two models A and B which both have relatively high concatenability, but A has high transitivity scores whereas those of B are low. The chosen models also have relatively close reconstruction loss, so that the messages convey roughly as much information about their inputs. To simplify, we focus on one entity vector and see how it is transmitted when it is in different roles and in different contexts. Since feature vec- tors are slightly sparse (with many NA values), vectors which have many NAs are sometimes not conveyed at all (the penalty makes it costly to do so). We search for an entity that appears in many different roles and that is sufficiently not sparse. The 8th most frequent vector (u8) is the most frequent vector that fits these criteria. First, let us examine the left-hand side of Table 4, which shows how u8 is talked about in its most frequent role, the PATIENT role. In both models, u8 is denoted by the same phrase very consistently (first rows of each block). Thus the context of u8 (entities and relation) does not seem to influence the message. This property is some- times called context-independence (Bogin et al., 2018). Despite using a large vocabulary of 128 sym- bols, only a few symbols are used. This is due to the difficulty of discrete optimization. We were puzzled to find so many common symbols in the two models, but it turns out that the selected models have the same hyperparameters except for the length-penalty coefficient (A: λ = 1, B: λ = 10). Each last row of each block of three lines shows an artificial sentence, where two entities are hidden. We can see that most symbols in these sentences also frequently appear in phrases that denote individual entities (identified by their colors). Some symbols from phrases are omitted or in a different order in the sentence, but the precise structure of these phrases is out of scope for our work. A is more consistent in its use of word order than B: A almost always refers to MISC before PATIENT, whereas the order varies for B. This is evidence that the transitivity metrics correctly 1386 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 measure the importance of word order, at least when concatenability is high enough. they are frequently discussed and negotiated. Thus it is frequent to describe events partially. On the right-hand side of Table 4, u8 appears in less frequent roles, and we see much more irregularities. Firstly, the phrases denoting u8 in isolation are less consistent across different con- texts (more context-dependence), even though we find a large overlap of symbols. Secondly, we also found more empty phrases (not shown here). Thirdly, we did not find evidence for a lower transitivity of B in these roles, but the sample size was smaller. 8 Discussion and Limitations 8.1 Partial Observability and Reference Thanks to our experimental setup and metrics, we avoid the problem of segmentation. How- ever, concatenability and transitivity rely on a crucial aspect of the task, partial observability, which allows us to obtain messages about a single ‘‘thing’’ in isolation. In our case, this ‘‘thing’’ is an entity and role pair, but instead, could it be a single attribute like shape or color, as in simpler referential games used in past research? Such a setup would be similar to our setup (cf. 1.2). However, (i) there would be no relation β; (ii) I S, I L and α would be vectors of size nf eat; (iii) in terms of models, we would use a simple attention mechanism to select a subset of the features to communicate about. However, we do not think that this setup re- alistically models real-life communicative situa- tions. Visual properties like shape and color are often perceived simultaneously. If, sometimes, we fail to perceive colors (for example, at night) or shapes (perhaps due to an occlusion), we rarely need to inquire about these attributes. In general, the missing attributes do not particularly matter, but are useful to identify the kind of the entity. For example, the white color and the circular shape of an object tells us that it is a plate, which is useful; but its particular appearance generally does not often matter once it has been catego- rized. Thus, we generally infer the kind from the observed attributes if possible, or else directly ask for the kind. By contrast, events are often partially observed, which creates many interrogations. When one ob- serves the consequences of a past action, one often wonders who was the agent that caused it. Similarly, since future events are indeterminate, In sum, the semantics of events are often con- veyed partially whereas the semantics of entities are more frequently packaged into the word for a kind. Thus directly transposing this setup to the referential case seems unrealistic. However, per- haps it could be adapted to a discriminative setup (Lazaridou et al., 2017), where the need to convey partial features of objects is clearer. 8.2 On θ-roles As inputs to our models, θ-roles are much more salient than any of the 18 features associated with entities: Each θ-role is associated with an entire vector added to the keys and values used by the attention mechanisms (cf. Role and Idx in Sections 2.3.1 and 2.3.2). Moreover, there are only three of them and they are mutually exclu- sive. For these reasons, it is easy to attend over each of them, which explains why many artificial agents rely on θ-roles to structure their messages. These θ-roles are groups of verb-specific roles (sometimes called participant roles). For exam- ple, the LOVER, the EATER, and the BUILDER verb- specific roles are clustered into the verb-general AGENT θ-role, while the LOVEE, the EATEE, and the BUILDEE roles fall under the PATIENT θ-role. Dowty (1991) shows that some θ-roles can be predicted from a small set of features that are mostly re- lated to physical notions of movement and to causality.9 However, since humans perceive many more features (for example, shapes, colors, tex- tures, etc.), it is not clear why these particular features are preferred to structure the grammars of natural languages. To answer this question, we might be able to use pretrained unsupervised learning models as feature extractors (Santoro et al., 2017; van Steenkiste et al., 2018; Kipf et al., 2018). An object-centric model like R-NEM (van Steenkiste et al., 2018) can extract object representations from videos of physically interacting objects. An interaction model like NRI (Kipf et al., 2018) can infer the relations between objects given object representations over time, such that these rela- tions are predictive of how the objects change over time. By combining such models, it may be 9These features are precisely the features that are used in this paper to represent the semantics of the entities, but their meaning is irrelevant in this work. 1387 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 possible to learn object, relation, and role repre- sentations from videos. We could then use such learned representations as inputs in our com- munication games to study whether verb-general roles emerge. 9 Conclusion We have presented an experimental setup for studying how probabilistic artificial agents pred- icate, that is, how they convey that a relation holds between entities. In our daily lives, events are partially observed and predication is used to share information about what is not observed, of- ten in a parsimonious manner. Our task and loss realistically reflect this function of language. At the same time, this setup allows us to di- rectly study argument structure while ignoring the internal structure of phrases. Indeed, we can easily obtain artificial phrases, that is, utterances that refer to single entities, as well as artificial sentences, utterances which express the relation holding between different entities. Then, we can study whether and how artificial phrases are sys- tematically composed to form artificial sentences, via our concatenability and transitivity metrics. Thus we completely sidestep the need to seg- ment artificial sentences into phrases, a compli- cated problem that is unfortunately ignored in previous works. More precisely, we have argued that all nat- ural languages should have high concatenabil- ity, while transitivity is not necessarily high and merely quantifies the importance of word order. Equipped with this setup and these metrics, we have compared a cognitively plausible architec- ture that leverages the structure of the inputs into objects with properties (OC) against an implau- sible baseline that ignores this structure (FA). Object-centric models yield more natural lan- guages in terms of concatenability, while also relying more on word order. Moreover, they gen- eralize better than their implausible counterparts, both in distribution and out of distribution. These results confirm the importance of the in- put representations and of the architectures lead- ing to the discretization bottleneck, also reported by Lazaridou et al. (2017) and Guo et al. (2019). In our experiments, discrete low-dimensional inputs were processed by task-specific architec- tures. However, we believe that one can use high-dimensional representations obtained from pretrained models, as long as these representa- tions are prelinguistic, as object-centric represen- tations seem to be. Our methods could be extended to investigate other aspects of sentences. For instance, how would agents convey relations? To answer this question, we could use the representations learned via relational unsupervised learning algorithms as inputs. We could study how different relations are discretized into one or several symbols, per- haps the equivalent of verbs and adverbs. We could also analyze how relation-specific roles cluster in abstract roles (like θ-roles) and struc- ture grammar. Acknowledgments Tom Bosc was financially supported for this re- search by the Canada CIFAR AI Chair Program. We also thank the Mila IDT team for the com- putational infrastructure, as well as anonymous reviewers and the action editors for their helpful feedback. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv pre- print arXiv:1607.06450v1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473v7. Dik Bakker and Anna Siewierska. 2009. Case and alternative strategies: Word order and In The Oxford Hand- agreement marking. book of Case, edited by Andrej Malchukov and Andrew Spencer, pages 290–303. 2009. https://doi.org/10.1093/oxfordhb /9780199206476.013.0020 Marco Baroni. 2020. Rat big, cat eaten! Ideas for a useful deep-agent protolanguage. arXiv preprint arXiv:2003.11922v1. James Bergstra and Yoshua Bengio. 2012. Ran- dom search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2). Ben Bogin, Mor Geva, and Jonathan Berant. 2018. Emergence of communication in an in- teractive world with consistent speakers. arXiv preprint arXiv:1809.00549v1. 1388 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natu- ral Language Learning, pages 10–21, Berlin, Germany. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /K16-1002 Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, and Marco Baroni. 2020. Compositionality and generaliza- tion in emergent languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4427–4442. https://doi.org/10.18653/v1/2020 .acl-main.407 Rahma Eugene Chaabouni, Kharitonov, Emmanuel Dupoux, and Marco Baroni. 2019. Anti-efficient encoding in emergent communi- cation. In Advances in Neural Information Pro- cessing Systems, volume 32, pages 6293–6303. Curran Associates, Inc. Herbert H. Clark. 1996. Using Language. Cambridge University Press. David Dowty. 1991. Thematic proto-roles and argument selection. Language, 67(3):547–619. https://doi.org/10.2307/415037, https://doi.org/10.1353/lan.1991 .0021 Dedre Gentner. 1982. Why nouns are learned before verbs: Linguistic relativity versus natural partitioning. Center for the Study of Reading Technical Report; no. 257. Herbert P. Grice. 1975. Logic and conversation. Speech Acts, pages 41–58. Brill. https:// doi.org/10.1163/9789004368811 003 Shangmin Guo, Yi Ren, Serhii Havrylov, Stella Frank, Ivan Titov, and Kenny Smith. 2019. The Emergence of Compositional Languages for Numeric Concepts Through Iterated Learning in Neural Agents. Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel- softmax. arXiv preprint arXiv:1611.01144v5. Eugene Kharitonov, Roberto Dess`ı, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. 2021. EGG: A toolkit for research on Emergence of lanGuage in Games. https:// github.com/facebookresearch/EGG. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980v9. Durk P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751. Paul R. Kingsbury and Martha Palmer. 2002. In LREC, From TreeBank to PropBank. pages 1989–1993. Citeseer. Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference for interacting sys- tems. In International Conference on Machine Learning, pages 2688–2697. PMLR. Simon Kirby and James R. Hurford. 2002. The emergence of linguistic structure: An overview the iterated learning model. Simulating of the Evolution of Language, pages 121–147. https://doi.org/10.1007/978-1-4471 -0663-0 6 Satwik Kottur, Jos´e Moura, Stefan Lee, and Dhruv Batra. 2017. Natural language does not emerge ‘naturally’ in multi-agent dialog. In Proceedings of the 2017 Conference on Em- pirical Methods in Natural Language Process- ing, pages 2962–2967, Copenhagen, Denmark. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17 -1321 Angeliki Lazaridou and Marco Baroni. 2020. Emergent multi-agent communication in the deep learning era. arXiv preprint arXiv:2006 .02419v1. James R. Hurford. 1989. Biological evolu- tion of the Saussurean sign as a component of the language acquisition device. Lingua, 77(2):187–222. https://doi.org/10.1016 /0024-3841(89)90015-6 Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. 2018. Emergence linguistic communication from referen- of tial games with symbolic and pixel input. In 6th International Conference on Learning 1389 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings. OpenReview.net. Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2017. Multi-agent cooperation and the emergence of (natural) language. In 5th International Conference on Learning Repre- sentations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net. J. Scott Long and Laurie H. Ervin. 2000. Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3):217–224. https://doi .org/10.2307/2685594, https://doi .org/10.1080/00031305.2000.10474549 Ryan Lowe, Jakob N. Foerster, Y.-Lan Boureau, Joelle Pineau, and Yann N. Dauphin. 2019. On the pitfalls of measuring emergent communica- tion. In Proceedings of the 18th International Conference on Autonomous Agents and Multi- Agent Systems, AAMAS ’19, Montreal, QC, Canada, May 13–17, 2019, pages 693–701. International Foundation for Autonomous Agents and Multiagent Systems. James G. MacKinnon and Halbert White. 1985. Some heteroskedasticity-consistent co- variance matrix estimators with improved finite sample properties. Journal of Econometrics, 29(3):305–325. https://doi.org/10.1016 /0304-4076(85)90158-7 Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A con- tinuous relaxation of discrete random varia- bles. arXiv preprint arXiv:1611.00712v3. Wes McKinney. 2010. Data structures for sta- tistical computing in Python. In Proceedings of the 9th Python in Science Conference, pages 56–61. https://doi.org/10.25080 /Majora-92bf1922-00a individuation without Natacha Mendes, Hannes Rakoczy, and Josep Call. 2008. Ape metaphysics: Ob- language. Cogni- ject tion, 106(2):730–749. https://doi.org/10 .1016/j.cognition.2007.04.007, PubMed: 17537418 Igor Mordatch and Pieter Abbeel. 2018. Emer- gence of grounded compositional language in multi-agent populations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI- 18), and the 8th AAAI Symposium on Ed- ucational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, pages 1495–1502. AAAI Press. Vinod Nair and Geoffrey E. Hinton. 2010. Rec- tified linear units improve restricted boltzmann machines. In Proceedings of the 27th Interna- tional Conference on International Conference on Machine Learning, pages 807–814. Adam Paszke, Sam Gross, Francisco Massa, James Bradbury, Gregory Adam Lerer, Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance In H. Wallach, H. deep learning library. Larochelle, A. Beygelzimer, F. d’ Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc. Tom Pelsmaeker and Wilker Aziz. 2019. Effec- tive estimation of deep generative language models. arXiv preprint arXiv:1904.08194. https://doi.org/10.18653/v1/2020 .acl-main.646 Drew Reisinger, Rachel Rudinger, Francis Ferraro, Craig Harman, Kyle Rawlins, and Benjamin Van Durme. 2015. Semantic proto- roles. Transactions of the Association for Com- putational Linguistics, 3:475–488. https:// doi.org/10.1162/tacl a 00152 Adam Santoro, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. Advances in Neural Information Processing Systems, 30. Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical mod- eling with Python. In 9th Python in Science Conference. https://doi.org/10.25080 /Majora-92bf1922-011 1390 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Luc Steels. 1997. The synthetic modeling of lan- guage origins. Evolution of Communication, 1(1):1–34. https://doi.org/10.1075 /eoc.1.1.02ste Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and J¨urgen Schmidhuber. 2018. Re- lational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions. In International Conference on Learning Representations. Gregory T. Stump. 2017. Inflection. The Hand- book of Morphology, pages 11–43. https:// doi.org/10.1002/9781405166348.ch1 Zolt´an Gendler Szab´o. 2020, Compositional- ity, Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy, fall 2020 edi- tion. Metaphysics Research Lab, Stanford University. The Pandas Development Team. 2021. pandas- dev/pandas: Pandas 1.2.3. Michael Tomasello. 2010. Origins of Human Communication. MIT Press. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc. Fei Xu and Susan Carey. 1996. Infants’ iden- metaphysics: The case of numerical tity. Cognitive Psychology, 30(2):111–153. https://doi.org/10.1006/cogp.1996 .0005, PubMed: 8635312 George Kingsley Zipf. 1949. Human behavior and the principle of least effort. Ravenio Books. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 2 4 2 0 6 5 9 4 8 / / t l a c _ a _ 0 0 5 2 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1391The Emergence of Argument Structure in Artificial Languages image

PDF Herunterladen