Transparency Helps Reveal When Language Models Learn Meaning

Transparency Helps Reveal When Language Models Learn Meaning

Zhaofeng Wu ∗ William Merrill

Hao Peng

Iz Beltagy

Noah A. Herrero

CON

New York University

Allen Institute for Artificial Intelligence

Paul G. Allen School of Computer Science & Ingeniería, University of Washington
zfw@csail.mit.edu willm@nyu.edu {haop,beltagy,noah}@allenai.org

Abstracto

Many current NLP systems are built from
language models trained to optimize unsu-
pervised objectives on large amounts of raw
texto. Under what conditions might such a pro-
cedure acquire meaning? Our systematic ex-
periments with synthetic data reveal that, con
languages where all expressions have context-
independent denotations (es decir., languages with
strong transparency), both autoregressive
and masked language models successfully
learn to emulate semantic relations between
expresiones. Sin embargo, when denotations are
changed to be context-dependent with the
language otherwise unmodified, this ability
degrades. Turning to natural language, nuestro
experiments with a specific phenomenon—
referential opacity—add to the growing body
of evidence that current language models do
not represent natural language semantics well.
We show this failure relates to the context-
dependent nature of natural language form-
meaning mappings.

1

Introducción

Despite language models’ (LMs) centrality to
recent progress on NLP benchmarks, a formal
characterization of what can be learned from un-
supervised training on large text corpora, y
of what modern language models actually do
aprender, remains elusive. Empirically, Tenney et al.
(2019), Kovaleva et al. (2019), Wu et al. (2021),
among others, all discovered that pretrained LMs
possess unsatisfactory semantic representations.
Traylor et al. (2021) found co-variation between
form and meaning to be insufficient for an LM
to represent lexical semantics. Li et al. (2021), en
the other hand, identified evidence of LMs repre-

senting dynamic semantics (Kamp, 1981; Heim,
1982; Groenendijk and Stokhof, 1991).

From first principles, Bender and Koller (2020)
argued that it is a priori impossible for an un-
grounded system that has access only to linguistic
forms to learn the mapping between those forms
and their grounded denotations. They claimed, como
a thought experiment, that a learner that has ac-
cess to all Java code (es decir., forma) on GitHub can
never learn execution (es decir., significado). They nev-
ertheless acknowledged that the existence of unit
pruebas, which assert the expected output given input
to blocks of code, could constitute a weak form
of grounding which potentially enables the learn-
ing of meaning.

Formalizing this idea, Merrill y otros. (2021) el-
oretically proved the possibility of learning (o
more technically, emulating) relaciones semánticas
between expressions in a certain class of formal
languages—those that are strongly transpar-
ent whose expressions have context-independent
denotations—using an assertion oracle, él tiene-
gous to the assertions in unit tests. Además,
with an example, they showed the existence of
non-emulatable languages even with an assertion
oracle.

Todavía, the practical implications of these the-
oretical results have not been explored. Mientras
assertions enable the emulation of strongly trans-
parent languages, it is unclear if existing LM archi-
tectures and objectives achieve emulation given
training data with assertions. Además, we do
not know if natural language (NL) is similarly non-
emulatable as Merrill et al.’s (2021) constructed
ejemplo, especially since non-transparency does
not always imply non-emulatability. We thus pose
two research questions:

∗This work was done when Zhaofeng Wu was at AI2.
Our code and trained models are released at https://
github.com/ZhaofengWu/transparency.

RQ1. Can current LM architectures and pre-
training objectives emulate the meaning of
strongly transparent languages?

617

Transacciones de la Asociación de Lingüística Computacional, volumen. 11, páginas. 617–634, 2023. https://doi.org/10.1162/tacl a 00565
Editor de acciones: Marco Baroni. Lote de envío: 12/2022; Lote de revisión: 2/2023; Publicado 6/2023.
C(cid:3) 2023 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

RQ2. Can modern LMs fully emulate the meaning
of natural language which is non-transparent?

We answer RQ1 in the positive (§3): On a
strongly transparent propositional logic language,
autoregressive and masked language models pre-
trained on only expressions (forma), `a la GPT-2
(Radford et al., 2019) and RoBERTa (Liu et al.,
2019C), can consistently compare and evaluate
their values (significado). We find that necessary
grounding of the pretraining data distribution is
crucial to this ability. We also investigate the
role of transparency for emulatability in a con-
trolled setting as an intermediate study before
analyzing non-transparent natural language. Nosotros
ablate strong transparency from the logic lan-
guage while keeping other factors unchanged. Nosotros
observe a substantial drop in the LMs’ ability to
emulate meaning, highlighting the importance of
transparency for emulatability.

We then turn to natural language (§4). Refer-
ential opacity is an extensively studied phenom-
enon in semantics and philosophy (Quine, 1956;
Kripke, 1972, among others) but has not been
examined in modern NLP. We prove that this
phenomenon entails non-transparency and ana-
lyze how well existing LMs represent it. Our ana-
lyses based on probing and sentence similarity
point to a lack of its representation in the largest
GPT-2 and BERT (Devlin et al., 2019) modelos
(RQ2). Teóricamente, this is a natural language
parallel to the emulation difficulty for our non-
transparent formal
idioma, and further rein-
forces the connection between transparency and
meaning emulatability. Practically, through the
lens of strong transparency, our results supplement
prior studies that identified pretrained LMs’ in-
sufficient semantic representations (Tenney et al.,
2019; Yu and Ettinger, 2020, 2021; Wu et al.,
2021, among others).

2 Fondo

We follow Merrill et al.’s (2021) operationaliza-
tion of the learning of meaning by emulation and
their definition of strong transparency. We sum-
marize their nomenclature and theoretical re-
sults in this section and provide some examples.
We refer readers to Merrill et al. (2021) for more
details.

At a high level, we take an inferential (Speaks,
2021, §2.2.3) view of meaning. An LM is taken

to understand a language L if it can resolve se-
mantic relations (p.ej., equivalence) between ex-
pressions in L.1 This is achieved through two
procedures: μL maps expressions into representa-
tions based on training data from L, and δ uses
the representations of two expressions to resolve
a semantic relation between them.

2.1 Idiomas
We consider a language L ⊆ Σ∗ over an alpha-
bet Σ and denote (Σ∗)2 = Σ∗ × Σ∗. We term
members of L sentences. We consider an ex-
pression e ∈ Σ∗ with associated left and right
context κ = (cid:6)yo, r(cid:7) (Σ∗)2. ler ∈ L is a sen-
tence. We denote the empty string with λ and the
empty context with λ2.

Definición 1 (teniente). We use the following context-
free grammar (CFG) to specify a propositional
logic language as a running example:

S → (e ∧ e) | (e ∨ e) | (¬e)
e → (e ∧ e) | (e ∨ e) | (¬e) | t | F

(1)

S is the distinguished start symbol and T and F
stand for True and False. We call this language
Lt where t stands for ‘‘transparent’’ (see §2.5). Él
underlies our investigation in §3.

Por ejemplo, la frase (((¬T) ∨ F) ∨ (¬T))
belongs to Lt because it can be generated by
this CFG using the steps illustrated in Figure 1.
In this sentence, the expression F has context
(cid:6)(((¬T)∨ , ) ∨ (¬T))(cid:7).

2.2 Significado

We consider the denotation of an expression e,
(cid:2)mi | κ(cid:3)l, to be its meaning in the context κ.2 We
write (cid:2)mi | κ(cid:3)L = ∅ if e is invalid in κ.

The meaning of a propositional logic expres-
sion can be the value derived from its conven-
tional semantics, es decir., either T or F. Por ejemplo,
(cid:2)(T ∧ (¬F))|l2(cid:3)Lt = T, y (cid:2)(¬F)|(cid:6)(T∧ , )(cid:7)(cid:3)Lt =
t. For natural language, extensionally, the mean-
ing of a sentence is its truth value, also either

1This inferentialist perspective can be contrasted with
denotationalism, which says that ‘‘understanding’’ is the
task of mapping an expression to a logical representation of
its meaning (Speaks, 2021, §2.2.3). Inferentialism implic-
itly underlies natural language inference-based evaluation
of NLP models (p.ej., Bowman et al., 2015).

2 We overload L to represent both the surface form and

a mapping between form and denotation.

618

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

be further reduced to all propositions: ‘‘Corgis
run.’’ is equivalent to ℵ(Corgis run., t) under the
extensional framework.3

2.4 ℵ-emulation: Learning Meaning

Merrill y otros. (2021) say that a class of languages
L is ℵ-emulatable if, intuitively, a learner μL with
ℵL-access produces context-independent repre-
sentations that allow another function δ to check
the equivalence of any two expressions under any
context without further ℵL-access. Formalmente, l
is ℵ-emulatable if there exists an oracle Turing
machine μL (that can query ℵL) y un estándar
Turing machine δ such that, for all L ∈ L, contexto
κ ∈ (Σ∗)2, and valid expressions e, mi(cid:11) in κ,

(cid:2)mi|κ(cid:3)L = (cid:2)mi(cid:11)|κ(cid:3)L ⇐⇒ δ (μL(mi), μL(mi(cid:11)) | κ) (3)

Back to Corgis, an English learner μ can observe
the equivalence of e = ‘‘Corgis’’ and e(cid:11) = ‘‘the
cutest dogs’’ in many different contexts κ and
develop their representations. We say that natural
language is emulated if there exists δ that can
decide the equivalence between such expressions
from the representations alone.

The standard pretraining-probing setup is an
intuitive instantiation of μL and δ. A model μL
can query ℵL while pretraining on language L,
which can then produce a representation μL(mi)
for any expression e. An equivalence probe δ can
take the (frozen) representation of two expres-
sions and decide their equivalence in some con-
texto. En tono rimbombante, because δ is frozen, it cannot
make any more queries to ℵL. We adopt this
paradigm for analysis in §3 and §4 and elaborate
abajo.

2.5 Strong Transparency

Definición 2. A language L is strongly trans-
parent if all of its expressions have context-
independent denotations. Eso es, for all e ∈ Σ∗,
κ ∈ (Σ∗)2, either (cid:2)mi|κ(cid:3)L = (cid:2)mi|l2(cid:3)l (cid:15)= ∅ or
(cid:2)mi|κ(cid:3)L = ∅.

Under conventional propositional logic seman-
tics, teniente (Def. 1) is strongly transparent because
the value of every expression is determined by
itself and unaffected by its context. Natural lan-
guage, por otro lado, is non-transparent. Nosotros

3Assuming that propositions are more frequently true
than false, which tends to be the case pragmatically (Grice,
1975).

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: An example sentence in our propositional
nodo
logic language as specified in Eq. 1. El

nodo, inverting its meaning in Ln
c-commands the
(§3.5). We mark the denotation of each node under Lt
or Ln.

T or F (Frege, 1892); intensionally, the meaning
is its truth condition, which could be viewed as a
set of possible worlds where the sentence is true
(Carnap, 1947). For a summary of the extension
and intension of other expressions in NL, ver
Kearns (2011, §1.3). As an example in English,
extensionally, (cid:2)An author of this paper believes
that Corgis are the cutest dogs.|l2(cid:3) = T.

2.3 Assertion Oracle

To represent assertions in unit tests, Merrill y otros.
(2021) considered an assertion oracle which out-
puts if two expressions have the same denotation
under the same context. Específicamente, for expres-
sions e, mi(cid:11) ∈ Σ∗ and κ ∈ (Σ∗)2, the assertion
oracle is defined as

(cid:2)

ℵL (mi, mi(cid:11) | κ) =

1 si (cid:2)mi|κ(cid:3)L = (cid:2)mi(cid:11)|κ(cid:3)l
0 de lo contrario

(2)

LM pretraining corpora could provide ℵ-like
signals. Por ejemplo, pretraining sequences of the
form e=e’ are a natural analog to an ℵ query.
We adopt this view to pretrain our propositional
logic language in §3. In English and many other
natural languages, copulas are a straightforward
counterpart: ‘‘Corgis are the cutest dogs.’’ is
equivalent to ‘‘Corgis=the cutest dogs.’’ This can

619

prove in §4 that the NL phenomenon of referen-
tial opacity violates strong transparency.

Merrill y otros. (2021) theoretically proved that all
strongly transparent languages are ℵ-emulatable.
En otras palabras, it is possible to learn to emulate
the meaning of these languages with only asser-
tion oracle access. The converse is not necessarily
true4 and hence there may be a weaker condi-
tion than strong transparency that also entails ℵ-
emulatability.

In what follows, we study how their theoretical
results realize empirically. We examine in §3 if
LM architectures and objectives can emulate the
meaning of a strongly transparent language. En
§4, we return to natural language which is non-
transparent and thus Merrill et al.’s (2021) resultados
do not predict its meaning emulatability.

3 How Well Do Language Models Fare?

While strongly transparent languages are in theory
ℵ-emulatable, it is unknown if existing LM archi-
tectures, coupled with their pretraining objectives,
are able to successfully achieve ℵ-emulation, o
more intuitively, to learn their meaning.

To test this, we synthetically create a strongly
transparent language based on propositional logic.
We pretrain LMs with the same architecture and
similar data scale as GPT-2 and RoBERTa on a
generated pretraining corpus. We then train an
equivalence probe to study if the pretrained rep-
resentations enable ℵ-emulation. The probe is
trained with a sentence pair binary classification
objective and tested on unseen sentences sampled
from the same grammar. Alternativamente, we also try
to directly evaluate the value of unseen sentences,
without probe training. To isolate the effect of
strong transparency, we also minimally perturb
this language to be non-transparent and study how
this affects emulatability.

3.1 Datos

probabilities are hand-designed. The denotation
of an expression can be computed according to
the conventional semantics of propositional logic,
cual, as argued in §2.5, makes Lt transparent.
Cifra 1 shows an example. See §A for more
details.

Our CFG rules prevent the atomic sentences
T and F from occurring in the corpus (y (t)
y (F) también) and only allow compositional sen-
tenencias. This ensures the absence of pretraining
sequences like sentence=T and guarantees that
there is no direct grounding to denotations dur-
ing pretraining, but only indirect grounding via ℵ.
This makes the task more difficult than the ℵ-
emulation setup but more realistically transferable
to natural language (§5).

The dataset has 819.2M pretraining sequences
and 1M/10K/10K probe training/validation/test
sentence pairs. All splits have disjoint sentences.
The average sentence length is around 48.6. §A
contains more details including tokenization.

3.2 Pretraining

We pretrain from scratch an autoregressive LM
(ALM) and a masked LM (MLM), respectivamente
simulating GPT-2-small and RoBERTa-base6
with their original architecture, objetivo, y, a
the extent possible, hyperparameters. They have
near-identical model size hyperparameters, dirigir-
ing to 86.8M ALM parameters and 87.0M for
MLM. We sample sentence pairs (a, b) con el
same denotation and format the pretraining se-
quences in the form of a=b, como (T∧F)=
(F∨F), simulating ℵ-access (but restricting que-
ries to be sentences, a more challenging setup: ver
ecuación. 2). §3.3 will discuss a necessary form of data
augmentation. We train for 100K steps, 20% de
RoBERTa-base’s training duration and hence data
tamaño, which we found sufficient for convergence
on our data. §B summarizes hyperparameters.

We use a PCFG to construct our propositional
logic dataset because its recursive nature and
context-freeness bear some resemblance to nat-
ural language,5 and because it is convenient for
sampling. The rules are specified in Eq. 1 y el

3.3 Análisis: Probing Lt

Probing is a commonly adopted method to quan-
tify the extent to which a representation encodes
a particular type of linguistic information (Alain
and Bengio, 2017; Liu et al., 2019a; Hewitt and

4Considerar, Por ejemplo, a finite non-transparent language

whose denotation space can be learned by enumeration.

5There are aspects of natural language that a PCFG
does not capture, such as recursion constraints (Karlsson,
2010) and non-context-free phenomena (Shieber, 1985).
Sin embargo, the goal of this research question is not to

maximally simulate NL, but rather investigate the distribu-
tional learnability of compositional semantics. Trabajo futuro
could investigate the effect of moving away from a strict
PCFG.

6We do not follow BERT because next sentence prediction

is not applicable here, but they are otherwise similar.

620

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Manning, 2019, among others). The represen-
tation is frozen, on top of which a lightweight
classifier is trained to predict the information of
interés. As shown in §2.4, this paradigm conve-
niently corresponds to the formalization in Merrill
et al. (2021), and hence we use it to investigate
whether or not pretrained representations encode
sufficient semantic information for equivalence
decisiones.

We probe semantic equivalence from the pre-
trained models for pairs of unseen sentences. Nosotros
embed each sentence separately through the pre-
trained model, taking the last token representa-
tion for ALM and the average for MLM.7 Voita
et al. (2019) and Haviv et al. (2022) have shown
that the positional information is diluted at the top
transformer layers of MLMs, but it is crucial for
the truth value in our language. Nosotros, por lo tanto, llevar
a weighted sum (a.k.a. scalar mix) of all layers
for compensation for MLM.8 We also found that
these simple methods for sentence representations
sometimes do not perform well. We hence addi-
tionally consider a variant where the probe is an
attention-weighted mixture of all token positions.
We refer to these two representations as –ATTN
and +ATTN, respectivamente. See §B for more on their
details. We train a bilinear classifier probe on top
of the sentence representations (Le et al., 2021)
and evaluate it with accuracy on a held-out test
colocar. For each setting, we train the same probe
with five different random seeds and report their
mean and standard deviation. We report hyperpa-
rameters in §B.

Past work has cast doubt on whether probes
faithfully reflect the representation’s encoding of
the information of interest, or if they directly
learn the task (Hewitt and Liang, 2019). Esto es
an especially important issue here as our +ATTN
sentence representation injects additional train-
able parameters compared to a simple (bi)linear
classifier. To answer this question in our setting,
we follow previous studies (Conneau et al., 2018;
Tenney et al., 2019; Wu et al., 2021, among oth-
ers) and train a randomly initialized and similarly
frozen control model with the same architecture:

7The lack of a next sentence prediction task (Fn. 6) leads

to no supervision for a [CLS] simbólico.

8Formalmente, μ’s output contains all layer representations.
9ALM Trained +ATTN Ln has a degenerate seed that led to
alrededor 50% exactitud, hence the large variance. It is possible
that additional configuration-specific hyperparameter tuning,
which we did not perform, could reduce this instability.

ALM (`a la GPT-2)

MLM (`a la RoBERTa)

Random Trained

Random Trained

Probing: –ATTN

teniente
Ln

teniente
Ln

teniente
Ln

49.9±0.3
50.0±0.3

98.8±0.0
79.9±0.2

50.0±0.4
49.9±0.1

50.1±0.2
49.5±0.1

Probing: +ATTN

49.9±0.6
50.1±0.4

100.0±0.0
82.5±20.9

50.0±0.4
50.2±0.2

63.8±1.7
49.7±0.3

50.0
50.0

Direct evaluation

97.0±6.8
91.1±19.9

50.0
50.0

95.4±4.7
50.4±0.8

Mesa 1: Probing and direct evaluation accuracy
(%) on random and pretrained models with autore-
gressive and masked LMs on our propositional
logic test set. We report the results with both our
transparent language Lt and the perturbed lan-
guage Ln (§3.5). Probing checks the equivalence
of two sentences, while direct evaluation com-
putes the value of one sentence. For probing, nosotros
test two ways to obtain sentence representations,
reporting the mean and standard deviation across
five probe training seeds. For direct evaluation,
we report the mean and standard deviation across
our five templates.9

If LMs emulate meaning similarly to Merrill
et al.’s (2021) algoritmo, we would expect the
pretrained model to yield higher probing accuracy
than the random model.

Resultados. The Lt rows in the top two sections
de mesa 1 summarize the results. With a simple
sentence representation (–ATTN), the pretrained
ALM achieves near-perfect probing accuracy for
teniente, though MLM performs at chance level. Un
attention-based sentence representation enables
63.8% accuracy10 for MLM and improves ALM’s
performance to 100%. En tono rimbombante, in this variant,
the random baselines still perform at chance level,
demonstrating that the additional parameters do
not lead to an overly powerful probe. We dis-
cuss the accuracy differences between ALM and
MLM in §5. These results demonstrate that pre-
training enables meaning emulation, though the
meaning representation can be more deeply en-
coded than what can be extracted with a (bi)linear

10With additional linear layers, it could go up to 83.4%±2.0
while the random model still performs at chance level. Nosotros
did not include this in Table 1 for consistency with other
settings.

621

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

–Symmetry

+Symmetry

–Reflexivity

+Reflexivity

a=b
50.5±0.4
a=b, b=a
50.3±0.3

a=b, a=a, b=b
92.7±0.1
a=b, b=a, a=a, b=b
98.8±0.0

Mesa 2: ALM probing accuracy (–ATTN; %) en
our propositional logic test set with pretraining
data with different properties, where a, b are ex-
pressions in Lt. We report the mean and standard
deviation across five probe training seeds.

probe. We note that it is expected that the perfor-
mance of pretrained models does not reach 100%.
While Merrill et al. (2021) showed its theoreti-
cal possibility, their setup assumes active learning
with unlimited access to ℵ and allows the ‘‘probe’’
δ to be an arbitrarily powerful function, entre
other differences.

Grounding. We found that independently sam-
pling pretraining sequences results in unsuccessful
emulation with probing performance at random.
En cambio, it is crucial to ground = with reflexivity
and symmetry.11 We achieve this by augment-
ing the pretraining data: if a=b is a pretraining
secuencia, we ensure a=a, b=b (reflexivity), y
b=a (symmetry) are too. This imposes a con-
straint on the pretraining data distribution that
eases the learning of =’s meaning. Mesa 2 muestra
that both properties are important. We consider
the implication in §5.

3.4 Análisis: Direct Evaluation on Lt

The process of training a probe introduces ad-
ditional complexity, such as ±ATTN, that poten-
tially complicates our analysis. Por lo tanto, nosotros también
test a stronger condition where there is no addi-
tional classifier: Can the pretrained models evalu-
ate expressions, without any further training (p.ej.,
a probe)? For MLM, it is the most straightforward
to compare if the model assigns a higher proba-
bility to T or F in sentence=[MASK]. Sin embargo,
this is a sequence that never occurs in the pre-
training corpus since a standalone T or F is not
part of our language (ecuación. 1). Por lo tanto, we use
five templates on the right-hand side that are min-

11Reflexivity states that a = a, and symmetry a = b ⇒
b = a. Equality further requires transitivity: a = b ∧ b =
c ⇒ a = c, but it is not tested in our probing setup and
we found it unimportant for probing accuracy in preliminary
experimentos.

imal in our language: (T∧[MASK]), (F∨[MASK]),
([MASK]∧T), ([MASK]∨F), (¬[MASK]). Para el
first four templates, we expect the masked po-
sition to be filled with the truth value of the
proposition, and the negated value for the last one.
For ALM, we compare if the model assigns a
higher probability to the sequence where [MASK]
is filled in with T vs. F.

Resultados. The bottom section of Table 1 muestra
the mean and standard deviation of the evaluation
accuracy across our five templates. Without train-
En g, a random model always has 50.0% exactitud
on expectation. Both ALM and MLM achieve a
high evaluation accuracy, arriba 95%, corroborat-
ing the LMs’ capability to represent the meaning
of Lt.

These results respond to the argument

en

Bender and Koller (2020):

We let GPT-2 complete the simple arith-
metic problem Three plus five equals.
The five responses below […] muestra esa
this problem is beyond the current capa-
bility of GPT-2, y, discutiríamos,
any pure LM.

We showed that form-only supervision does al-
low such evaluation on a strongly transparent
idioma, at least when the supervising data dis-
tribution satisfies symmetry and reflexivity.

3.5 Non-transparency

Building towards non-transparent natural
lan-
guage, it is important to understand strong trans-
parency’s effect on emulatability. We design a
minimally perturbed version of Lt that is non-
transparent, Ln. The syntax stays the same, pero
we change the semantics such that ¬ has a side
efecto: When followed by T or F, it inverts the
meaning of these literals that occur in certain
other environments. Específicamente, cada (¬T) nodo
changes the meaning of all the literals T in its c-
commanded subtree (es decir., the e subtree headed by
el (¬T) node’s sibling, if there is one; Reinhart,
1976) to F. An additional (¬T) does not invert
atrás. Similarmente, (¬F) changes the meaning of
the literal F to T. Por ejemplo, in the sentence
node is c-commanded by (o,
En figura 1, el
→ (¬T)
a descendant of a sibling of) el
nodo, so its meaning is changed to F. On
the e → (¬ ) node does
the other hand,

622

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

not invert the meaning of the unboxed T because
they do not constitute a c-command relation. Este
alternation is inspired by binding theory in gen-
erative grammar (Chomsky, 1981, 1983), dónde
el (¬T) node is the binder that c-commands
the bindee. ince the meaning of T and F now
depends on the existence of a binder, Ln is non-
transparent.12

Resultados. We conduct
the same pretraining/
probing/direct evaluation procedure on Ln. Mesa 1
informa los resultados. on-transparency decreases
ALM’s probing accuracy with both –ATTN and
+ATTN, though not to random level. The variance
across different probe training seeds also increases
compared to Lt, indicating that the pretrained
representation is less robust. Directly evaluating
ALM with Ln similarly leads to both decreased
average accuracy and increased variance. MLM,
por otro lado, achieves random probing and
evaluation accuracy. En general, the lack of strong
transparency reduces models’ meaning emulation
capacidad, though not always to chance performance.

4 What About Natural Language?

While existing LM architectures and objectives
are able to emulate the meaning of synthetic
idiomas, it is unclear how these observations
transfer to natural language (NL). Merrill y otros.
(2021) hinted that, since NL is non-transparent
and likely more complex than their constructed
non-emulatable language, it is probable that a
pretraining procedure, even with ℵ-access, poder-
not emulate its meaning either. Este, sin embargo,
remained an untested hypothesis.

We formalize this intuition and prove that
a specific NL phenomenon, referential opacity,
makes NL non-transparent.13 This phenomenon
has been widely studied in semantics (Quine,
1956; Kripke, 1972, among others), yet it has re-
ceived little attention in modern NLP. We fill this
gap from the perspective of strong transparency
and study the representation of this phenomenon

12This is a straightforward way to introduce a ¬ with side
effect to a hierarchical structure. An alternative is to rely
on a linear structure and invert all literals linearly follow-
ing ¬. Sin embargo, our version leverages the hierarchical
reasoning that the model originally needs to possess to eval-
uate an expression, while this version requires a new type
of reasoning that is linear. So that change would be less
minimal.

13Deictic expressions are another example, though they
have been extensively studied under coreference resolution.

in modern LMs with a probing-based and a
sentence similarity-based analysis.

4.1 Referential Opacity

To illustrate referential opacity, we use the classic
example in semantics:

Ejemplo 1.

(a) Lois Lane believes Superman is a hero.

(b) Lois Lane believes Clark Kent is a hero.

Tenga en cuenta que (a) y (b) have different truth con-
ditions: Their truth values differ if Lois Lane
does not know Superman and Clark Kent are
the same person. Formalmente, (cid:2)Lois Lane believes
Superman is a hero.|l2(cid:3) (cid:15)= (cid:2)Lois Lane believes
Clark Kent is a hero.|l2(cid:3).14 Por otro lado,
(cid:2)Superman|l2(cid:3) = (cid:2)Clark Kent|l2(cid:3).15 En otra
palabras, two expressions that have the same deno-
tation, when embedded in the same context, producir
sentences with different truth conditions. Semejante
contexts are called referentially opaque, y, en
este caso, they are induced by a propositional at-
titude verb ‘‘believes’’ whose meaning depends
on the cognitive state of its subject (anderson
and Owens, 1990).

Now we formalize referential opacity:

Definición 3. In natural language, an expression
e is contextually valid in κ = (cid:6)yo, r(cid:7) if none of
(cid:2)yo|λ, es(cid:3), (cid:2)mi|yo, r(cid:3), (cid:2)r|le, λ(cid:3) is ∅.16

Definición 4. A context κ = (cid:6)yo, r(cid:7) in natural
language is referentially opaque if there exist ex-
pressions e1, e2, both contextually valid in κ, semejante
eso (cid:2)e1|l2(cid:3) = (cid:2)e2|l2(cid:3) y (cid:2)le1r|l2(cid:3) (cid:15)= (cid:2)le2r|l2(cid:3).
Def. 4 matches the linguistic phenomenon:
Let e1=‘‘Superman’’, e2=‘‘Clark Kent’’, y el
opaque context κ=(cid:6)‘‘Lois Lane believes’’, ‘‘is a
hero.’’(cid:7), and we recover our analysis of Ex. 1
arriba.

eso

15Él

is possible

(cid:2)Superman|l2(cid:3)

14In this section we consider the language L to be English,
or any NL that exhibits this phenomenon, y (cid:2)·|·(cid:3) ser
intensions (§2.2). We drop the subscript L for brevity.
to argue

(cid:15)=
(cid:2)Clark Kent|l2(cid:3) if we consider their intension to be dif-
ferent. Sin embargo, we adopt the view of Heim and Kratzer
(1998, §12.3) to not
introduce intensionality by default
(es decir., with κ = λ2), but rather to evoke it by context: ‘‘The
usual denotations are extensions. But for nonextensional
contextos, Intensional Functional Application allows a switch
to intensions. The switch is triggered by particular lexical
elementos […]''.

16This is a technical detail needed for proving Theorem 1.

623

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ahora, we prove that the existence of referen-
tially opaque contexts implies non-transparency.
We assume compositionality, for which we pro-
vide a working definition: (cid:2)ler|l2(cid:3) = f ((cid:2)yo|λ, es(cid:3),
(cid:2)mi|yo, r(cid:3), (cid:2)r|le, λ(cid:3)) for some meaning composition
function f .17 Intuitivamente, the proof shows that
if all expressions have fixed meaning (es decir., son
strongly transparent), referential opacity would
not arise.

Teorema 1. A compositional
language with
referentially opaque contexts is not strongly
transparent.

Prueba. Suppose by contradiction we have such a
language L that is strongly transparent. Let e1, e2
be expressions in some opaque context (cid:6)yo, r(cid:7) in L.

(cid:3)

(cid:3)

= f

(cid:2)le1r|l2(cid:3) = f ((cid:2)yo|λ, e1r(cid:3), (cid:2)e1|yo, r(cid:3), (cid:2)r|le1, λ(cid:3))
By compositionality
(cid:4)
(cid:2)yo|λ, e2r(cid:3), (cid:2)e1|l2(cid:3), (cid:2)r|le2, λ(cid:3)
By strong transparency
(cid:4)
(cid:2)yo|λ, e2r(cid:3), (cid:2)e2|l2(cid:3), (cid:2)r|le2, λ(cid:3)
By referential opacity premise
= f ((cid:2)yo|λ, e2r(cid:3), (cid:2)e2|yo, r(cid:3), (cid:2)r|le2, λ(cid:3))
By strong transparency
(cid:2)le2r|l2(cid:3)
By compositionality

= f

=

This violates (cid:2)le1r|l2(cid:3) (cid:15)= (cid:2)le2r|l2(cid:3), the referential
opacity premise. So L is not strongly transparent.

Por lo tanto, as a non-transparent example in NL,
we study whether referential opacity is reflected
in the representation of current LMs.

4.2 Datos

We cast referential opacity as a sentence pair bi-
nary classification problem. We generate sentence
pairs like Ex. 1 as our dataset. Ex. 1 consists of
two parts that correspond to the two conditions in
Def. 4: two co-referring expressions ((cid:2)e1|l2(cid:3) =
(cid:2)e2|l2(cid:3)), and a referentially opaque context
that embeds the entity ((cid:2)le1r|l2(cid:3) (cid:15)= (cid:2)le2r|l2(cid:3)).
Próximo, we separately introduce how we generate
a ellos. Our final dataset consists of 45K/6K/6K
training/development/testing sentence pairs for
GPT-2 and 97K/12K/12K for BERT. §C provides

17This is a mild assumption, considering the generality of
composicionalidad (Fodor and Pylyshyn, 1988) and that our
definition is weak, p.ej., weaker than that of Andreas’s (2019).

624

more details, including more fine-grained dataset
statistics for different experimental settings below.

él

Co-referring Expressions. The co-referring
expressions in Ex. 1 are proper names, ‘‘Super-
man’’ and ‘‘Clark Kent.’’ Not only is this hard to
collect data for, pero, due to the rigidity of proper
names (Kripke, 1972),
is also theoretically
more challenging to analyze as the classic in-
tensionality framework is more difficult to apply
(Von Fintel and Heim, 2011).18 We hence con-
sider co-referring expressions that are one proper
name and one definite description, such as ‘‘Yuri
Gagarin’’ and ‘‘the first person in space,’’ which
can be more straightforwardly accounted for with
intensionality (Heim and Kratzer, 1998, §12; Von
Fintel and Heim, 2011). We use the LAMA data-
colocar (Petroni et al., 2019), specifically the T-REx
dividir (Elsahar et al., 2018) following recent fac-
tual probing work (Jiang et al., 2020; Shin et al.,
2020; Zhong et al., 2021), to obtain a list of such
entidades. To make sure the model representation
captures the coreference, we follow Petroni et al.
(2019) and use LAMA to prompt the LM with
these equivalences and only keep entities that are
correctly predicted.19

Contexts. We construct referentially opaque
and referentially transparent contexts to embed
these co-referring expressions. We only consider
referential opacity involving propositional attitude
verbos, where the context is referentially opaque
iff its main verb conveys propositional attitude.
There are other types of referential opacity, semejante
as counterfactuals (Von Fintel and Heim, 2011;
Kearns, 2011, §7) and substitutions that shift the
syntactic status of constituents (p.ej., Fine, 1990),
that we omit in this work for simplicity, aunque
they could be targets of future studies. We man-
ually design two classes of templates, depending
on the verb’s argument structure. The first has an
embedded clause, p.ej.,
Ejemplo 2. Label = non-equivalent20

18Though see Shabasson (2018) for a theorization.
19Trabajo previo (Poerner et al., 2020; Dufter et al., 2021;
Cao et al., 2021) questioned whether such prompting mea-
sures model ‘‘understanding.’’ Our setup, aunque, does not
depend on ‘‘understanding’’, but only requires association.

20Considerar, Por ejemplo, if this person is Yuri’s neigh-
bor and wants to meet him for dinner, pero, being an avid
flat-earther, is not fond of space traveling and is unaware
that he has been to space. She would say she wants to meet
Yuri Gagarin but has no interest in meeting the first person
in space.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(a) She wants to meet Yuri Gagarin.

(b) She wants to meet the first person in space.

The second contains only the main clause, como

Simple

Ejemplo 3. Label = equivalent

(a) He speaks Lao.

(b) He speaks the official language of Laos.

The two sentences in a pair only differ by the
entity reference: one is a name and one is a definite
descripción. A sentence pair is non-equivalent iff
it has a referentially opaque context, or within our
scope of study, iff its main verb is a propositional
attitude verb. We gather the list of verbs from past
linguistic studies and verify with native speaker
judgment (see §C).

4.3 Modelos
We consider GPT-2-XL and BERT-large-cased21,
the largest variants in these two families, as repre-
sentative autoregressive and masked LMs. Ellos
have 1.5B and 340M parameters, respectivamente.
We obtain sentence representations in the same
way as in §3, except without attention-weighting
and simply using the [CLS] embedding for BERT.

4.4 Análisis: Probing
We use the same bilinear probe in §3 as a bi-
nary classifier over sentence pairs, determinando
the equivalence, or the referential transparency, de
each pair. Sin embargo, because of the lexicalized na-
ture of referential opacity, the probe could easily
overfit and recognize not their equivalence but the
existence of a propositional attitude verb.

To overcome this, we introduce attractors
(Linzen et al., 2016; Gulordava et al., 2018; Pandia
and Ettinger, 2021, among others).22 We always
conjoin a clause with a propositional attitude verb
and one with a non-attitude verb, disallowing the
aforementioned heuristics. The equivalence label
now depends on if the entity alternation occurs
under the non-attitude verb, which would result
in an equivalent sentence pair, or the attitude

21Not RoBERTa as in §3, because BERT’s [CLS] simbólico
can act as and is commonly taken to be the sentence repre-
sentation (Devlin et al., 2019; Karpukhin et al., 2020, entre
otros).

22Another option is to have disjoint training and testing
verbos. This did not work in preliminary experiments be-
cause verbs that induce referential opacity are semantically
closer, as they always convey propositional attitude. So
the model could use this similarity in the word embedding
space to extrapolate.

Equiv.
Non-equiv.
En general

Equiv.
Non-equiv.
En general

GPT-2

BERT

100.0±0.00
100.0±0.00
100.0±0.00

85.3±0.03
15.5±0.03
50.4±0.00

100.0±0.00
100.0±0.00
100.0±0.00

72.4±0.03
29.6±0.03
51.0±0.00

Coord.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 3: Probing accuracy (%) for referential
opacity on GPT-2-XL and BERT-large-cased. Nosotros
report the mean and standard deviation across 10
seeds. We consider two types of sentences, sim-
ple sentences without attractors and coordinated
sentences with attractors. For each type, we show
both the label-specific accuracy (Equivalent/Non-
equivalente) and the overall accuracy.

verb, which would lead to non-equivalence. Para
ejemplo:

Ejemplo 4. Label = equivalent

(a) He speaks Lao and she wants to meet Yuri

Gagarin.

(b) He speaks the official language of Laos and

she wants to meet Yuri Gagarin.

Ejemplo 5. Label = non-equivalent

(a) He speaks Lao and she wants to meet Yuri

Gagarin.

(b) He speaks Lao and she wants to meet the first

person in space.

Despite both examples having the same verbs, el
sentence pair in Ex. 4 is equivalent, but Ex. 5 es
no. We are not using attractors for out-of-domain
evaluación; en cambio, the training and test sets are
i.i.d., but we break down the test set performance
by categories.

We train a probe on GPT-2-XL and BERT-large
encima 10 random seeds. Details are in §D. Mesa 3
informa los resultados. As expected, both models
overfit with the attractor-less simple sentences,
achieving perfect accuracy. With attractors in co-
ordinated sentences, sin embargo, both models ob-
tain near-random performance overall. Porque
the training and test sets are i.i.d., this means that
semantic equivalence based on referential opacity
cannot be probed in our setup from these two
modelos, suggesting an inadequate representation

625

of this phenomenon.23 Interestingly, both mod-
els tend to predict equivalence more than non-
equivalence (more prominent with GPT-2 than
BERT), likely due to the nuanced nature of this
tarea: Sin
training, a human would likely
judge equivalence on referentially opaque sen-
tence pairs too.24 See §E for a set of experiments
that show that LMs can potentially learn to cap-
ture referential opacity with semantic supervision
following pretraining.

4.5 Análisis: Sentence Similarity
As in §3.4, the simplicity of a training-free anal-
ysis can be desirable. Para tal fin, we directly
measure the cosine similarity between the two
sentence representations in a pair. While this se-
mantic similarity would be high for both groups of
sentences by our construction, equivalent sentence
pairs should have more similar representations
than those that are not. While factors other than
semantics, such as syntax, also affect sentence
representaciones, we strictly control them in our
synthetic data generation to be identical between
referentially transparent and opaque sentences.
We do not consider attractor sentences (§4.4) en
this analysis.

For significance testing, we employ an exact
permutation test (Pescador, 1935) and a bootstrap
prueba (Efron and Tibshirani, 1993) con 1,000 él-
erations, performed across verbs, where the test
statistic is the difference between the averaged
cosine similarity of the two groups. Both tests
are two-sided with the null hypothesis being that
the model representation does not distinguish be-
tween the two classes of verbs. For GPT-2-XL,
the permutation test gives p = 0.64 and bootstrap
gives p = 0.66, barring us from rejecting the null
hypothesis. For BERT-large, they give p = 0.45
and p = 0.57 respectivamente, where we again ob-
serve no significant difference between the two
classes. Sin embargo, we note that the inability to
reject the null hypothesis does not entail it is true.
Reimers y Gurévych (2019) noted that
computing sentence pair cosine similarity using
BERT’s [CLS] simbólico, as we did, does not cor-
relate well with textual similarity benchmarks.

23There might still be other more complex heuristics, pero
even so, the probe still fails. Hence we do not need addi-
tional attractors to rule out all possible heuristics.

24Though, with training, it is relatively straightforward to
perform this task for a human, so it is reasonable to test the
ability in LMs.

This phenomenon is commonly attributed to the
anisotropic nature of pretrained representations
(Ethayarajh, 2019). This does not undermine the
validity of our method, which instead relies on
the correlation between the cosine similarity and
the model’s representation of semantic close-
ness. We ensure this correlation by controlling
for all factors other than semantics (syntax, lexi-
cal choices, entidades, etc.). Sin embargo, nosotros también
postprocess BERT’s [CLS] representation using
BERT-flow (Le et al., 2020) which has been shown
to increase the correlation with textual similarity
benchmarks. We obtain a similar result: Bootstrap
gives p = 0.49. While the two-sided permutation
test gives p = 0.03 with potential significance,
the one-sided version gives p = 0.99; in other
palabras, the calibrated space represents opaque sen-
tence pairs to be more similar than transparent
unos, contrary to our expectation that equivalent
sentence pairs should be closer in the represen-
tation space than non-equivalent ones when all
other factors are controlled.

The results from these two sets of analyses in
§4.4 and §4.5 are consistent and show no evi-
dence of modern LMs representing referential
opacity, demostrando que
they cannot fully
emulate the meaning of NL. Our finding adds
to recent observations that pretrained LMs do
not represent semantic phenomena well (Tenney
et al., 2019; Kovaleva et al., 2019; Wu et al., 2021,
among others). Teóricamente, it also strengthens
the connection between strong transparency and
meaning emulatability with NL-based empirical
evidencia.

5 Discusión

Through analyses based on probing and direct
evaluación, we have seen that existing LM ar-
chitectures and objectives can learn to emulate
the meaning of a strongly transparent language
Lt when the training data reflects equivalence
relaciones. While non-transparency (Ln) causes
this ability to decrease, the trained models still
outperform a random model in certain setups. Nosotros
believe this result hints at the strength of current
LM architectures and objectives.25 There seems
to be a limit to this strength, though—in natural

25Especially since our setting is more challenging than
Merrill et al.’s (2021) algoritmo, without their unlimited
ℵ-access, active learning, arbitrarily powerful δ, etc.. Plus,
we restrict ℵ queries to be sentences and disallow compar-
ing a sentence with T or F using ℵ.

626

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

idioma, neither GPT-2 nor BERT represents
the non-transparent phenomenon of referential
opacity well.

Our results shed light on the relationship be-
tween the strong transparency of a language
and whether its semantics can be emulated. Nosotros
observed co-variation between the two: Cuando
slightly perturbed to be non-transparent, our logic
language becomes harder to emulate; and there is
no evidence for LMs representing the semantics of
a non-transparent NL phenomenon. Sin embargo,
the above-random emulation performance with
Ln suggests that there could be language prop-
erties that potentially better predict emulatability,
leaving room for future theoretical endeavors.

We also found that, with a similar size and
training procedure (§3.2), ALM is more suitable
for representing the meaning of our propositional
logic languages than MLM, in our setup. ALM
achieves better probing accuracy than MLM under
both methods of obtaining sentence representa-
tions that we explored. También, MLM completely
fails to emulate meaning facing non-transparency,
but not ALM. Por último, aunque, we hope to
understand if this difference transfers to natu-
ral language. Our NL investigation reveals that
both ALM (GPT-2) and MLM (BERT) achieve
chance-level probing performance on the one phe-
nomenon that we inspected,
likely due to its
difficulty. It would be interesting for future ef-
forts to further examine their differences, if any,
in learning and representing the meaning of other
NL phenomena.

Our results also lead to the question: Why can
LMs achieve above-random results on Ln but not
referential opacity? While it is entirely possible
that the latter is simply more difficult than our
synthetic non-transparency, there are other factors
at play. First of all, natural language is much more
variable than our synthetic language: Utterances
can be untruthful (though they are in general gov-
erned by Gricean quality; Grice, 1975), subjetivo
(such as our earlier claim about Corgis’ cute-
ness, §2.3), intensional (see Merrill et al., 2021
for a discussion), etc.. But putting these variations
aside, we saw from §3 that even the synthetic lan-
guage requires an explicit grounding of = to enable
emulation, and this is missing from NL pretrain-
En g. It is certainly not the case that, for every
expression such as ‘‘Corgis are the cutest dogs.’’
that exists in the pretraining corpus, the vari-
ations ‘‘The cutest dogs are Corgis.’’, ‘‘Corgis

are Corgis.’’, ‘‘The cutest dogs are the cutest
dogs.’’ are also guaranteed to appear. So perhaps
there needs to be a more foundational change in
our pretraining objective. As Brown et al. (2020)
foretold, ‘‘A more fundamental limitation of […]
scaling up any LM-like model […] is that it may
eventually run into (or could already be running
en) the limits of the pretraining objective.’’ Our
results point to one such possibility: Creemos
research into a more explicit representation of se-
mantic relations in future pretraining processes,
such as based on paraphrases, could be fruitful.

What we did not investigate, aunque, is whether
partial equivalence grounding enables emulation:
what if, Por ejemplo, solo 1% of the pretraining
data has this form of grounding, while the rest
does not? And the above format already exists for
certain sentences in NL. Este, también, could be an
exciting future research question.

6 Trabajo relacionado

Bender and Koller (2020) initiated the discussion
on the possibility of a learner acquiring mean-
ing from training on linguistic forms alone. De
first principles, they argued for its impossibility.
Empirically, Traylor et al. (2021) also found that
LMs cannot well-represent lexical-level symbols
when the pretraining data is distributionally con-
strained to supply relevant signals. Merrill y otros.
(2021), por otro lado, proved theoretically that
it is possible to emulate the meaning of strongly
transparent languages with assertion oracle access.
We showed in this work that, empirically, LMs
also attain the capability. The work of Patel and
Pavlick (2022) is also conceptually similar to our
trabajar, discovering that the internal representa-
tion of LMs is to a large extent isomorphic to the
conceptual spaces of directions and colors. Ellos
adopted in-context learning (Brown y cols., 2020;
among others) to elicit the isomorphism, while we
used the more traditional probing paradigm.

Another line of work has inspected the extent
to which pretrained LMs encode various types of
semantic information. Some have examined the
representation of lexical semantics: Gar´ı Soler
and Apidianaki (2021) found that BERT represen-
tations reflect polysemy levels, and Vuli´c et al.
(2020) showed that they also capture abundant
type-level lexical knowledge. Por otro lado,
Ettinger (2020) and Ravichander et al. (2020)
have discovered that pretrained LMs do not

627

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

satisfactorily encode negation and hypernymy,
respectivamente. Moving beyond the lexical level,
Wu et al. (2021) demonstrated that pretrained
BERT and RoBERTa models less readily surface
semantic dependency information than syntactic
dependencies, while Li et al. (2021) identificado
evidence of dynamic semantics representation in
estos modelos.

7 Conclusión

We have empirically shown that pretrained lan-
guage models are able to emulate the meaning of a
strongly transparent language through pretraining
on an assertion-inspired format, but this abil-
ity deteriorates when the language is minimally
perturbed to be no longer strongly transparent.
Además, we found no representation of ref-
erential opacity, which is significant for being a
non-transparent natural language phenomenon, en
pretrained LMs.

Expresiones de gratitud

We thank the TACL reviewers and action edi-
tor for helpful feedback on this work. We thank
Kyle Richardson, Jesse Dodge, and other mem-
bers of AI2 for insightful discussions. este trabajo
was funded in part by NSF award 1922658.
WM was supported by an NSF graduate research
fellowship.

Referencias

Guillaume Alain and Yoshua Bengio. 2017.
Understanding intermediate layers using linear
classifier probes. In 5th International Confer-
ence on Learning Representations, Taller
Track Proceedings.

C. Anthony Anderson and Joseph Owens, edi-
tores. 1990. Propositional Attitudes: The Role of
Content in Logic, Idioma, and Mind. CSLI
Lecture Notes; No. 20. Center for the Study of
Language and Information, stanford, California.

jacob andreas. 2019. Measuring composi-
tionality in representation learning. En curso-
ings of International Conference on Learning
Representaciones.

and understanding in the age of data.
En
Actas de la 58ª Reunión Anual de
the Association for Computational Linguis-
tics, pages 5185–5198, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.acl-main.463

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. En Actas de la 2015
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 632–642, Lisbon,
Portugal. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/D15-1075

Tom Brown, Benjamín Mann, Nick Ryder,
Melanie Subbiah, Jared D.. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, sandhi
agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, niño rewon, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Marcos Chen, Eric Sigler, Mateusz
Litwin, Scott Gris, Benjamin Chess, Jacobo
clark, Christopher Berner, Sam McCandlish,
Alec Radford,
Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot
learners. In Proceedings of Advances in Neural
Sistemas de procesamiento de información, volumen 33,
páginas 1877-1901. Asociados Curran, Cª.

Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun,
Lingyong Yan, Meng Liao, Tong Xue, y
Jin Xu. 2021. Knowledgeable or educated
guess? Revisiting language models as knowl-
edge bases. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 1: Artículos largos), pages 1860–1874,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/2021.acl-long.146

Rudolf Carnap. 1947. Meaning and Necessity: A
Study in Semantics and Modal Logic. universidad-
sity of Chicago Press, chicago.

Noam Chomsky. 1981. Lectures on Government
and Binding. Studies in Generative Grammar.
Foris Publications.

Emily M. Bender and Alexander Koller. 2020.
Climbing towards NLU: On meaning, forma,

Noam Chomsky. 1983. Some Concepts and
Consequences of the Theory of Government

628

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

and Binding. Linguistic Inquiry Monograph 6.
CON prensa.

Alexis Conneau, German Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single $&!#*
vector: Probing sentence embeddings for lin-
guistic properties. In Proceedings of the 56th
Annual Meeting of the Association for Com-
Lingüística putacional (Volumen 1: Long Pa-
pers), pages 2126–2136, Melbourne, Australia.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P18
-1198

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N19-1423

Philipp Dufter, Nora Kassner, and Hinrich
Sch¨utze. 2021. Static embeddings as efficient
knowledge bases? En Actas de la 2021
Conference of
the North American Chap-
ter of
the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 2353–2363, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2021.naacl-main.186

Bradley Efron and Robert J. Tibshirani. 1993. Un
Introduction to the Bootstrap. Número 57 en
Monographs on Statistics and Applied Prob-
capacidad. Chapman & Hall/CRC, Boca Raton,
Florida, EE.UU.

Hady Elsahar, Pavlos Vougiouklis, Arslen
Remaci, Christophe Gravier, Jonathon Hare,
Frederique Laforest, and Elena Simperl. 2018.
T-REx: A large scale alignment of natural lan-
guage with knowledge base triples. En curso-
ings of the Eleventh International Conference
on Language Resources and Evaluation (LREC
2018), Miyazaki, Japón. European Language
Resources Association (ELRA).

the geometry of BERT, ELMo, and GPT-2
embeddings. En Actas de la 2019 Estafa-
ference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Procesando (EMNLP-IJCNLP), pages 55–65,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1006

Allyson Ettinger. 2020. What BERT is not:
Lessons from a new suite of psycholinguistic
diagnostics for language models. Transactions
de la Asociación de Linguis Computacional-
tics, 8:34–48. https://doi.org/10.1162
/tacl a 00298

Kit Fine. 1990. Quine on quantifying in. Propo-
sitional attitudes: The role of content in logic,
idioma, and mind, CSLI Lecture Notes;
No. 20, pages 1–26. Center for the Study of
Language and Information, stanford, California.

R. A. Pescador. 1935. The Design of Experiments.
The Design of Experiments. Oliver and Boyd.

Jerry A. Fodor and Zenon W. Pylyshyn.
1988. Connectionism and cognitive architec-
tura: A critical analysis. Cognición, 28(1):3–71.
https://doi.org/10.1016/0010-0277
90031-5, PubMed: 2450716

Gottlob Frege. 1892. ¨Uber sinn und bedeutung.
Zeitschrift f¨ur Philosophie Und Philosophische
Kritik.

Aina Gar´ı Soler and Marianna Apidianaki. 2021.
Let’s play mono-poly: BERT can reveal words’
polysemy level and partitionability into senses.
Transactions of the Association for Computa-
lingüística nacional, 9:825–844. https://doi
.org/10.1162/tacl_a_00400

Herbert P. Grice. 1975. Logic and conversation.
In Peter Cole and Jerry L. morgan, editores,
Syntax and Semantics: volumen. 3: Speech Acts,
pages 41–58. Prensa académica, Nueva York.
https://doi.org/10.1163/9789004368811
003

Jeroen Groenendijk and Martin Stokhof. 1991.
Dynamic predicate logic. Linguistics and Phi-
losophy, 14(1):39–100. https://doi.org
/10.1007/BF00628304

Kawin Ethayarajh. 2019. How contextual are con-
textualized word representations? Comparing

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.

629

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Colorless green recurrent networks dream
el 2018
hierarchically. En procedimientos de
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos), pages 1195–1205, Nueva Orleans,
Luisiana. Asociación de Computación
Lingüística. https://doi.org/10.18653/v1
/N18-1108

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak,
and Omer Levy. 2022. Transformer language
models without positional encodings still learn
positional information. https://doi.org
/10.48550/arXiv.2203.16634

Irene Heim. 1982. The Semantics of Definite
tesis,

and Indefinite Noun Phrases. Doctor.
University of Massachusetts Amherst.

pages 277–322. Ámsterdam: Mathematisch
Centrum.

Fred Karlsson. 2010. Syntactic Recursion and
Iteration. Studies in Generative Grammar.
De Gruyter Mouton, Alemania. https://doi
.org/10.1515/9783110219258.43

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. En Actas de la 2020 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre (EMNLP), pages 6769–6781,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/2020.emnlp-main.550

Kate Kearns. 2011. Semántica. Macmillan Modern

Irene Heim and Angelika Kratzer. 1998. Seman-

Lingüística. Palgrave Macmillan.

tics in Generative Grammar. Blackwell.

John Hewitt and Percy Liang. 2019. Designing and
interpreting probes with control tasks. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 2733–2743, Hong Kong, Porcelana. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/D19-1275

John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representaciones. En Actas de la 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), pages 4129–4138,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N19-1419

Zhengbao Jiang, Frank F. Xu, Jun Araki, y
Graham Neubig. 2020. How can we know
what language models know? Transactions of
la Asociación de Lingüística Computacional,
8:423–438. https://doi.org/10.1162
/tacl_a_00324

Hans Kamp. 1981. A theory of truth and se-
mantic representation. In Jeroen Groenendijk,
Theo Janssen, and Martin Stokhof, editores,
Formal Methods in the Study of Language,

Olga Kovaleva, Alexey Romanov, Anna Rogers,
and Anna Rumshisky. 2019. Revealing the dark
secrets of BERT. En Actas de la 2019
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 4365–4374,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1445

Saul A. Kripke. 1972. Naming and neces-
In Semantics of Natural Language,

sity.
pages 253–355. Saltador.

Belinda Z. li, Maxwell Nye, and Jacob Andreas.
2021. Implicit representations of meaning in
language models. En procedimientos de
neural
the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
Procesamiento de calibre (Volumen 1: Artículos largos),
pages 1813–1827, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2021.acl-long.143

Bohan Li, Hao Zhou, Junxian He, Mingxuan
Wang, Yiming Yang, and Lei Li. 2020. Sobre el
sentence embeddings from pre-trained language
modelos. En Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando (EMNLP), pages 9119–9130, On-
line. Asociación de Lingüística Computacional.

630

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of
LSTMs to learn syntax-sensitive dependencies.
Transactions of the Association for Computa-
lingüística nacional, 4:521–535. https://doi
.org/10.1162/tacl_a_00115

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, y Noé A.. Herrero. 2019a.
Linguistic knowledge and transferability of
contextual representations. En procedimientos de
el 2019 Conference of the North American
Chapter of the Association for Computational
Lingüística: Human Language Technolo-
gies, Volumen 1 (Artículos largos y cortos),
pages 1073–1094, Mineápolis, Minnesota.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/N19
-1112

Nelson F. Liu, Roy Schwartz, y Noé A..
Herrero. 2019b. Inoculation by fine-tuning: A
method for analyzing challenge datasets. En
Actas de la 2019 Conference of the
North American Chapter of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, Volumen 1 (Long and
Artículos breves), pages 2171–2179, Mineápolis,
Minnesota. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/N19-1225

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019C. RoBERTa: A robustly opti-
mized BERT pretraining approach. https://
doi.org/10.48550/arXiv.1907.11692

Ilya Loshchilov and Frank Hutter. 2019. Decou-
pled weight decay regularization. En curso-
ings of International Conference on Learning
Representaciones.

Qing Lyu, Zheng Hua, Daoxin Li, Li Zhang,
Marianna Apidianaki, and Chris Callison-
Burch. 2022. Is ‘‘my favorite new movie’’
my favorite movie? Probing the understanding
of recursive noun phrases. En procedimientos de
el 2022 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 5286–5302, seattle, United States.
Asociación de Lingüística Computacional.

https://doi.org/10.18653/v1/2022
.naacl-main.388

William Merril, Yoav Goldberg, Roy Schwartz,
y Noé A.. Herrero. 2021. Provable limitations
of acquiring meaning from ungrounded form:
What will future language models understand?
Transacciones de la Asociación de Compu-
lingüística nacional, 9:1047–1060. https://
doi.org/10.1162/tacl a 00412

Lalchand Pandia and Allyson Ettinger. 2021.
Sorting through the noise: Testing robustness
of information processing in pre-trained lan-
guage models. En procedimientos de
el 2021
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 1583–1596,
En línea y Punta Cana, República Dominicana.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2021
.emnlp-main.119

Roma Patel and Ellie Pavlick. 2022. Cartografía
language models
to grounded conceptual
spaces. In International Conference on Learn-
ing Representations.

Fabio Petroni, Tim Rockt¨aschel, Sebastián Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Modelos de lenguaje
as knowledge bases? En Actas de la 2019
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 2463–2473,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1250

Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2020. E-BERT: Efficient-yet-effective
entity embeddings for BERT. En hallazgos de
the Association for Computational Linguis-
tics: EMNLP 2020, pages 803–818, En línea.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.findings-emnlp.71

Willard V. Quine. 1956. Quantifiers and propo-
sitional attitudes. The Journal of Philosophy,
53(5):177–187. https://doi.org/10.2307
/2022451

Alec Radford, Jeff Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.

631

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Language models are unsupervised multitask
learners. https://d4mucfpksywv.cloudfront
.net/better-language-models/language
models are unsupervised multitask
learners.pdf.

Abhilasha Ravichander, Eduard Hovy, Kaheer
Suleman, Adam Trischler, and Jackie Chi Kit
Cheung. 2020. On the systematicity of prob-
ing contextualized word representations: El
case of hypernymy in BERT. En curso-
ings of the Ninth Joint Conference on Lexical
and Computational Semantics, pages 88–102,
Barcelona, España (En línea). Asociación para
Ligüística computacional.

Nils Reimers

and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
Siamese BERT-networks. En procedimientos de
el 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
Procesamiento del lenguaje oral (EMNLP-IJCNLP),
pages 3982–3992, Hong Kong, Porcelana. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1410

Tanya Reinhart. 1976. The Syntactic Domain of
Anaphora. Doctor. tesis, Massachusetts Insti-
tute of Technology, Cambridge.

Daniel S.. Shabasson. 2018. The Two Indexical
Uses Theory of Proper Names and Frege’s
Puzzle. Doctor. tesis, City University of New
york.

Stuart M. Shieber. 1985. Evidence against the
context-freeness of natural language. Linguis-
tics and Philosophy, 8:333–343. https://
doi.org/10.1007/BF00630917

Taylor Shin, Yasaman Razeghi, Robert L. logan
IV, Eric Wallace, and Sameer Singh. 2020.
AutoPrompt: Eliciting Knowledge from Lan-
guage Models with Automatically Generated
Prompts. En Actas de la 2020 Conferir-
encia sobre métodos empíricos en lan natural-
Procesamiento de calibre (EMNLP), pages 4222–4235,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/2020.emnlp-main.346

Jeff Speaks. 2021. Theories of meaning.

En
Edward N. Zalta, editor, The Stanford En-
cyclopedia of Philosophy, Primavera 2021 edi-
ción. Metaphysics Research Lab, stanford
Universidad.

632

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R. Thomas McCoy, Najoung
kim, Benjamin Van Durme, Sam Bowman,
Dipanjan Das, and Ellie Pavlick. 2019. Qué
do you learn from context? Probing for sen-
tence structure in contextualized word rep-
resentaciones. In Proceedings of International
Conferencia sobre Representaciones del Aprendizaje.

Aaron Traylor, Roman Feiman, and Ellie Pavlick.
2021. AND does not mean OR: Using formal
languages to study language models’ repre-
sentaciones. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 2: Artículos breves), pages 158–167, On-
line. Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2021
.acl-short.21

Elena Voita, Rico Sennrich, and Ivan Titov.
2019. The bottom-up evolution of representa-
tions in the transformer: A study with machine
translation and language modeling objectives.
En Actas de la 2019 Conference on Em-
pirical Methods in Natural Language Process-
ing and the 9th International Joint Conference
sobre el procesamiento del lenguaje natural (EMNLP-
IJCNLP), pages 4396–4406, Hong Kong, Porcelana.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/D19
-1448

Kai Von Fintel and Irene Heim. 2011. Inten-
sional semantics. Unpublished Lecture Notes.

lexical

semantics.

Ivan Vulic, Edoardo Maria Ponti, Roberto
Litschko, Goran Glavaˇs, and Anna Korhonen.
2020. Probing pretrained language models
En procedimientos de
para
el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 7222–7240, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.emnlp-main.586

Zhaofeng Wu, Hao Peng, y Noé A.. Herrero.
2021. Infusing finetuning with semantic de-
pendencies. Transactions of
la Asociación
para Lingüística Computacional, 9:226–242.
https://doi.org/10.1162/tacl a 00363

Lang Yu and Allyson Ettinger. 2020. evaluando
representation and composition in

phrasal

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

transformadores. En Actas de la 2020 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre (EMNLP), pages 4896–4907,
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.emnlp-main.397

Lang Yu and Allyson Ettinger. 2021. Sobre el
interplay between fine-tuning and composition
in transformers. In Findings of the Association
para Lingüística Computacional: ACL-IJCNLP
2021, pages 2279–2293, En línea. Asociación
para Lingüística Computacional.

Zexuan Zhong, Dan Friedman, and Danqi Chen.
2021. Factual probing is [MASK]: Learn-
ing vs. learning to recall. En procedimientos de
el 2021 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 5017–5033, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2021.naacl-main.398

A Propositional Logic Dataset Details

We hand-designed the PCFG probabilities in
ecuación. 1. To expand an e, the two binary rules
each have 0.06 probability under Lt. The ¬ rule
and expansion to T and F divide the remaining
probability mass, with T and F having the same
probabilidad, half of the ¬ rule. As S does not ex-
pand to T or F, the other three rules proportionally
split the probability mass. We consider each of
(, ), ∧, ∨, ¬, t, F, = as a separate token for tok-
enization. We enforce a maximum length of 248
tokens. We sample all sentences without replace-
mento. The average Lt sentence length is ≈48.6
tokens. Sampling Ln results in slightly longer
oraciones, so we decrease the binary rule prob-
abilities to be 0.03 cada, but the specification
is otherwise the same. The resulting Ln sen-
tence on average has ≈51.7 tokens. We sample
819.2M pretraining sentences and 1M/10K/10K
probe training/validation/test sentences. Entonces, para
each split, we sample sentence pairs, con el
same number as the number of sentences in that
dividir.

base. We train with batches of 8,192 sequences
for 100k steps, equivalent to 1 epoch over our
pretraining data. We use the AdamW optimizer
(Loshchilov and Hutter, 2019) with epsilon 10−8
for ALM for 10−6 for MLM, and β2 0.95 para
ALM and 0.98 for MLM. We set the learning
rate to 6 × 10−4 warmed up over 10k steps with
a 0.1 weight decay.

For probing, +ATTN trains a query vector that in-
teracts with the key representation of each token,
obtained with a trained key matrix transforma-
ción, and the resulting attention weights are used
to average the token embeddings. We train all
probes for 3 epochs with batch size 8 y 1,000
warmup steps and select checkpoint with valida-
tion accuracy. We use AdamW with 10−5 learning
rate except only for Ln –ATTN ALM that benefits
from a different learning rate 10−3. We clip gra-
dients to unit norm.

C Referential Opacity Dataset Details

We detail the generation of our referential opac-
ity dataset, separately discussing its two aspects
(§4.2).

C.1 Generating Co-referring Expressions

For fact probing on LAMA, we use the prompt
in the form ‘‘The official language of Laos is
’’ which we found appropriate for the
known as
entity types in T-REx. If the LM correctly predicts
‘‘Lao’’, we consider this equivalence, or fact,
captured by the model. As LAMA was designed to
have 1-token answers with BERT’s tokenization,
we let BERT fill in the blank. This is not a guar-
antee for GPT-2’s tokenization, so we run de-
coding for the same number of steps as the true
answer’s length with beam size 5 and no sampling.
To further ensure that the predictions are reliable
and not due to noise, we only keep entity catego-
ries with overall prediction accuracy > 25%. El
resulting categories are ‘‘P37 official language’’,
‘‘P364 original language of film or TV show’’,
‘‘P140 religion’’, ‘‘P103 native language’’, y
‘‘P36 capital’’. This procedure results in 1,606
facts for GPT-2 and 2,962 facts for BERT.

B Propositional Logic Training Details

For pretraining, we mostly follow the original
hyperparameters for GPT-2-small and RoBERTa-

C.2 Generating Contexts
We generate two types of contexts (§4.2). El
first type contains an embedded clause, para cual
we construct templates for each entity category

633

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

in §C.1. For language entities, Por ejemplo, uno
template is ‘‘[PRONOUN] [VERB] to speak [ENTITY].''
A sentence pair is formed by filling in [ENTITY]
with a definite description vs. a proper name for
a fact. We only consider the pronouns ‘‘She’’
and ‘‘He’’ in this work. We consider 6 refer-
entially transparent verbs (‘‘starts’’, ‘‘begins’’,
‘‘ceases’’, ‘‘stops’’, ‘‘managed’’, ‘‘failed’’) y 6
referentially opaque verbs (‘‘wants’’, ‘‘intends’’,
‘‘hopes’’, ‘‘begs’’, ‘‘preferred’’, ‘‘suggested’’).
The second type of context contains only the
main clause. We use the referentially opaque
template ‘‘[PRONOUN] dislikes [ENTITY].’’ and an
entity category-specific referentially transparent
template such as ‘‘[PRONOUN] speaks [ENTITY].’’ In
total, tenemos 64,672 sentence pairs for GPT-2
y 121,768 for BERT.

For our probing analysis, we also included
attractors with coordinated sentences (§4.4). Como
there are a quadratic number of possible coor-
dinations, we subsampled 59,548 such sentences
for GPT-2 and 119,540 for BERT, similar to the
number of attractor-less sentences. We split all
sentence pairs 8/1/1 for training/validation/testing.
For our similarity analysis, for a cleaner signif-
icance test, we only consider sentence pairs with
an embedded clause. This leaves 58,776 oración
pairs for GPT-2 and 111,312 for BERT.

D Referential Opacity Training Details

The probe is trained similarly to §B except for 1
epoch with batch size 256 and learning rate 10−5.

E Can Language Models Learn to

Represent Referential Opacity With
Appropriate Supervision?

We showed in §4 that we do not observe evi-
dence of pretrained language models representing
the phenomenon of referential opacity. A natural
pregunta, entonces, is whether language models can
learn to represent it. Following a similar setup
as Lyu et al. (2022) and Liu et al. (2019b), nosotros
finetune the entire model on a portion of our train-

Cifra 2: Probing accuracy after finetuning a pre-
trained LM on our (coordinated) referential opacity
dataset with different numbers of finetuning exam-
ples. The mean and the standard deviation across 10
seeds are plotted. For clarity in visualizing the trend,
the x-axis is not in linear scale.

ing set for 1 epoch and conduct the same probing
procedure on the resulting model. All training is
done with the coordinated data introduced (§4.4).
Finetuning uses the same hyperparameters in §D.
Similar to §4.4, we report the mean and standard
deviation across 10 random seeds for each setting.
We plot the probing accuracy along with the
number of finetuning examples in Figure 2. Ambos
GPT-2 and BERT continue to be unable to per-
form above-random with up to 10,000 finetuning
examples, further demonstrating their inadequate
semantic representation of referential opacity.
Sin embargo, with enough finetuning examples,
both models eventually achieve near-100% prob-
ing accuracy. Es, por lo tanto, possible that they
can potentially learn to represent referential opac-
ity with sufficient semantic supervision, aunque
we note a caveat: while we introduced coordi-
nated data to prevent an obvious shortcut that the
model could take (§4.4), it does not eliminate all
possible shortcuts. It could be the case that the
additional capacity afforded by finetuning enables
the model to exploit a more sophisticated short-
cut (unknown to us) instead of truly capturing this
fenómeno.

634

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
5
6
5
2
1
3
8
3
5
0

/

/
t

yo

a
C
_
a
_
0
0
5
6
5
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3Transparency Helps Reveal When Language Models Learn Meaning image

Descargar PDF