Transparency Helps Reveal When Language Models Learn Meaning

Zhaofeng Wu ∗ William Merrill

Hao Peng

Iz Beltagy

Noah A. Forgeron

AVEC

New York University

Allen Institute for Artificial Intelligence

Paul G. Allen School of Computer Science & Engineering, University of Washington
zfw@csail.mit.edu willm@nyu.edu {haop,beltagy,noah}@allenai.org

Abstrait

Many current NLP systems are built from
language models trained to optimize unsu-
pervised objectives on large amounts of raw
text. Under what conditions might such a pro-
cedure acquire meaning? Our systematic ex-
periments with synthetic data reveal that, avec
languages where all expressions have context-
independent denotations (c'est à dire., languages with
strong transparency), both autoregressive
and masked language models successfully
learn to emulate semantic relations between
expressions. Cependant, when denotations are
changed to be context-dependent with the
language otherwise unmodified, this ability
degrades. Turning to natural language, notre
experiments with a specific phenomenon—
referential opacity—add to the growing body
of evidence that current language models do
not represent natural language semantics well.
We show this failure relates to the context-
dependent nature of natural language form-
meaning mappings.

Introduction

Despite language models’ (LMs) centrality to
recent progress on NLP benchmarks, a formal
characterization of what can be learned from un-
supervised training on large text corpora, et
of what modern language models actually do
learn, remains elusive. Empirically, Tenney et al.
(2019), Kovaleva et al. (2019), Wu et al. (2021),
entre autres, all discovered that pretrained LMs
possess unsatisfactory semantic representations.
Traylor et al. (2021) found co-variation between
form and meaning to be insufficient for an LM
to represent lexical semantics. Li et al. (2021), sur
the other hand, identified evidence of LMs repre-

senting dynamic semantics (Kamp, 1981; Heim,
1982; Groenendijk and Stokhof, 1991).

From first principles, Bender and Koller (2020)
argued that it is a priori impossible for an un-
grounded system that has access only to linguistic
forms to learn the mapping between those forms
and their grounded denotations. They claimed, comme
a thought experiment, that a learner that has ac-
cess to all Java code (c'est à dire., formulaire) on GitHub can
never learn execution (c'est à dire., meaning). They nev-
ertheless acknowledged that the existence of unit
tests, which assert the expected output given input
to blocks of code, could constitute a weak form
of grounding which potentially enables the learn-
ing of meaning.

Formalizing this idea, Merrill et al. (2021) le-
oretically proved the possibility of learning (ou
more technically, emulating) semantic relations
between expressions in a certain class of formal
languages—those that are strongly transpar-
ent whose expressions have context-independent
denotations—using an assertion oracle, analo-
gous to the assertions in unit tests. En outre,
with an example, they showed the existence of
non-emulatable languages even with an assertion
oracle.

Encore, the practical implications of these the-
oretical results have not been explored. While
assertions enable the emulation of strongly trans-
parent languages, it is unclear if existing LM archi-
tectures and objectives achieve emulation given
training data with assertions. En outre, we do
not know if natural language (NL) is similarly non-
emulatable as Merrill et al.’s (2021) constructed
example, especially since non-transparency does
not always imply non-emulatability. We thus pose
two research questions:

∗This work was done when Zhaofeng Wu was at AI2.
Our code and trained models are released at https://
github.com/ZhaofengWu/transparency.

RQ1. Can current LM architectures and pre-
training objectives emulate the meaning of
strongly transparent languages?

617

Transactions of the Association for Computational Linguistics, vol. 11, pp. 617–634, 2023. https://doi.org/10.1162/tacl a 00565
Action Editor: Marco Baroni. Submission batch: 12/2022; Revision batch: 2/2023; Published 6/2023.
c(cid:3) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

RQ2. Can modern LMs fully emulate the meaning
of natural language which is non-transparent?

We answer RQ1 in the positive (§3): On a
strongly transparent propositional logic language,
autoregressive and masked language models pre-
trained on only expressions (formulaire), `a la GPT-2
(Radford et al., 2019) and RoBERTa (Liu et al.,
2019c), can consistently compare and evaluate
their values (meaning). We find that necessary
grounding of the pretraining data distribution is
crucial to this ability. We also investigate the
role of transparency for emulatability in a con-
trolled setting as an intermediate study before
analyzing non-transparent natural language. Nous
ablate strong transparency from the logic lan-
guage while keeping other factors unchanged. Nous
observe a substantial drop in the LMs’ ability to
emulate meaning, highlighting the importance of
transparency for emulatability.

We then turn to natural language (§4). Refer-
ential opacity is an extensively studied phenom-
enon in semantics and philosophy (Quine, 1956;
Kripke, 1972, entre autres) but has not been
examined in modern NLP. We prove that this
phenomenon entails non-transparency and ana-
lyze how well existing LMs represent it. Our ana-
lyses based on probing and sentence similarity
point to a lack of its representation in the largest
GPT-2 and BERT (Devlin et al., 2019) models
(RQ2). Theoretically, this is a natural language
parallel to the emulation difficulty for our non-
transparent formal
langue, and further rein-
forces the connection between transparency and
meaning emulatability. Practically, through the
lens of strong transparency, our results supplement
prior studies that identified pretrained LMs’ in-
sufficient semantic representations (Tenney et al.,
2019; Yu and Ettinger, 2020, 2021; Wu et al.,
2021, entre autres).

2 Background

We follow Merrill et al.’s (2021) operationaliza-
tion of the learning of meaning by emulation and
their definition of strong transparency. We sum-
marize their nomenclature and theoretical re-
sults in this section and provide some examples.
We refer readers to Merrill et al. (2021) for more
details.

At a high level, we take an inferential (Speaks,
2021, §2.2.3) view of meaning. An LM is taken

to understand a language L if it can resolve se-
mantic relations (par exemple., equivalence) between ex-
pressions in L.1 This is achieved through two
procedures: μL maps expressions into representa-
tions based on training data from L, and δ uses
the representations of two expressions to resolve
a semantic relation between them.

2.1 Languages
We consider a language L ⊆ Σ∗ over an alpha-
bet Σ and denote (Σ∗)2 = Σ∗ × Σ∗. We term
members of L sentences. We consider an ex-
pression e ∈ Σ∗ with associated left and right
context κ = (cid:6)je, r(cid:7) ⊆ (Σ∗)2. ler ∈ L is a sen-
tence. We denote the empty string with λ and the
empty context with λ2.

Definition 1 (Lt). We use the following context-
free grammar (CFG) to specify a propositional
logic language as a running example:

S → (e ∧ e) | (e ∨ e) | (¬e)
e → (e ∧ e) | (e ∨ e) | (¬e) | T | F

(1)

S is the distinguished start symbol and T and F
stand for True and False. We call this language
Lt where t stands for ‘‘transparent’’ (see §2.5). Il
underlies our investigation in §3.

Par exemple, the sentence (((¬T) ∨ F) ∨ (¬T))
belongs to Lt because it can be generated by
this CFG using the steps illustrated in Figure 1.
In this sentence, the expression F has context
(cid:6)(((¬T)∨ , ) ∨ (¬T))(cid:7).

2.2 Meaning

We consider the denotation of an expression e,
(cid:2)e | κ(cid:3)L, to be its meaning in the context κ.2 We
write (cid:2)e | κ(cid:3)L = ∅ if e is invalid in κ.

The meaning of a propositional logic expres-
sion can be the value derived from its conven-
tional semantics, c'est à dire., either T or F. Par exemple,
(cid:2)(T ∧ (¬F))|λ2(cid:3)Lt = T, et (cid:2)(¬F)|(cid:6)(T∧ , )(cid:7)(cid:3)Lt =
T. For natural language, extensionally, the mean-
ing of a sentence is its truth value, also either

1This inferentialist perspective can be contrasted with
denotationalism, which says that ‘‘understanding’’ is the
task of mapping an expression to a logical representation of
its meaning (Speaks, 2021, §2.2.3). Inferentialism implic-
itly underlies natural language inference-based evaluation
of NLP models (par exemple., Bowman et al., 2015).

2 We overload L to represent both the surface form and

a mapping between form and denotation.

618

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

be further reduced to all propositions: ‘‘Corgis
run.’’ is equivalent to ℵ(Corgis run., T) under the
extensional framework.3

2.4 ℵ-emulation: Learning Meaning

Merrill et al. (2021) say that a class of languages
L is ℵ-emulatable if, intuitively, a learner μL with
ℵL-access produces context-independent repre-
sentations that allow another function δ to check
the equivalence of any two expressions under any
context without further ℵL-access. Officiellement, L
is ℵ-emulatable if there exists an oracle Turing
machine μL (that can query ℵL) and a standard
Turing machine δ such that, for all L ∈ L, contexte
κ ∈ (Σ∗)2, and valid expressions e, e(cid:11) in κ,

(cid:2)e|κ(cid:3)L = (cid:2)e(cid:11)|κ(cid:3)L ⇐⇒ δ (μL(e), μL(e(cid:11)) | κ) (3)

Back to Corgis, an English learner μ can observe
the equivalence of e = ‘‘Corgis’’ and e(cid:11) = ‘‘the
cutest dogs’’ in many different contexts κ and
develop their representations. We say that natural
language is emulated if there exists δ that can
decide the equivalence between such expressions
from the representations alone.

The standard pretraining-probing setup is an
intuitive instantiation of μL and δ. A model μL
can query ℵL while pretraining on language L,
which can then produce a representation μL(e)
for any expression e. An equivalence probe δ can
take the (frozen) representation of two expres-
sions and decide their equivalence in some con-
text. Surtout, because δ is frozen, it cannot
make any more queries to ℵL. We adopt this
paradigm for analysis in §3 and §4 and elaborate
below.

2.5 Strong Transparency

Definition 2. A language L is strongly trans-
parent if all of its expressions have context-
independent denotations. C'est, for all e ∈ Σ∗,
κ ∈ (Σ∗)2, either (cid:2)e|κ(cid:3)L = (cid:2)e|λ2(cid:3)L (cid:15)= ∅ or
(cid:2)e|κ(cid:3)L = ∅.

Under conventional propositional logic seman-
tics, Lt (Def. 1) is strongly transparent because
the value of every expression is determined by
itself and unaffected by its context. Natural lan-
guage, on the other hand, is non-transparent. Nous

3Assuming that propositions are more frequently true
than false, which tends to be the case pragmatically (Grice,
1975).

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1: An example sentence in our propositional
node
logic language as specified in Eq. 1. Le

node, inverting its meaning in Ln
c-commands the
(§3.5). We mark the denotation of each node under Lt
or Ln.

T or F (Frege, 1892); intensionally, the meaning
is its truth condition, which could be viewed as a
set of possible worlds where the sentence is true
(Carnap, 1947). For a summary of the extension
and intension of other expressions in NL, voir
Kearns (2011, §1.3). As an example in English,
extensionally, (cid:2)An author of this paper believes
that Corgis are the cutest dogs.|λ2(cid:3) = T.

2.3 Assertion Oracle

To represent assertions in unit tests, Merrill et al.
(2021) considered an assertion oracle which out-
puts if two expressions have the same denotation
under the same context. Spécifiquement, for expres-
sions e, e(cid:11) ∈ Σ∗ and κ ∈ (Σ∗)2, the assertion
oracle is defined as

(cid:2)

ℵL (e, e(cid:11) | κ) =

1 si (cid:2)e|κ(cid:3)L = (cid:2)e(cid:11)|κ(cid:3)L
0 otherwise

(2)

LM pretraining corpora could provide ℵ-like
signals. Par exemple, pretraining sequences of the
form e=e’ are a natural analog to an ℵ query.
We adopt this view to pretrain our propositional
logic language in §3. In English and many other
natural languages, copulas are a straightforward
counterpart: ‘‘Corgis are the cutest dogs.’’ is
equivalent to ‘‘Corgis=the cutest dogs.’’ This can

619

prove in §4 that the NL phenomenon of referen-
tial opacity violates strong transparency.

Merrill et al. (2021) theoretically proved that all
strongly transparent languages are ℵ-emulatable.
Autrement dit, it is possible to learn to emulate
the meaning of these languages with only asser-
tion oracle access. The converse is not necessarily
true4 and hence there may be a weaker condi-
tion than strong transparency that also entails ℵ-
emulatability.

In what follows, we study how their theoretical
results realize empirically. We examine in §3 if
LM architectures and objectives can emulate the
meaning of a strongly transparent language. Dans
§4, we return to natural language which is non-
transparent and thus Merrill et al.’s (2021) résultats
do not predict its meaning emulatability.

3 How Well Do Language Models Fare?

While strongly transparent languages are in theory
ℵ-emulatable, it is unknown if existing LM archi-
tectures, coupled with their pretraining objectives,
are able to successfully achieve ℵ-emulation, ou
more intuitively, to learn their meaning.

To test this, we synthetically create a strongly
transparent language based on propositional logic.
We pretrain LMs with the same architecture and
similar data scale as GPT-2 and RoBERTa on a
generated pretraining corpus. We then train an
equivalence probe to study if the pretrained rep-
resentations enable ℵ-emulation. The probe is
trained with a sentence pair binary classification
objective and tested on unseen sentences sampled
from the same grammar. Alternativement, we also try
to directly evaluate the value of unseen sentences,
without probe training. To isolate the effect of
strong transparency, we also minimally perturb
this language to be non-transparent and study how
this affects emulatability.

3.1 Données

probabilities are hand-designed. The denotation
of an expression can be computed according to
the conventional semantics of propositional logic,
lequel, as argued in §2.5, makes Lt transparent.
Chiffre 1 shows an example. See §A for more
details.

Our CFG rules prevent the atomic sentences
T and F from occurring in the corpus (et (T)
et (F) aussi) and only allow compositional sen-
tences. This ensures the absence of pretraining
sequences like sentence=T and guarantees that
there is no direct grounding to denotations dur-
ing pretraining, but only indirect grounding via ℵ.
This makes the task more difficult than the ℵ-
emulation setup but more realistically transferable
to natural language (§5).

The dataset has 819.2M pretraining sequences
and 1M/10K/10K probe training/validation/test
sentence pairs. All splits have disjoint sentences.
The average sentence length is around 48.6. §A
contains more details including tokenization.

3.2 Pretraining

We pretrain from scratch an autoregressive LM
(ALM) and a masked LM (MLM), respectivement
simulating GPT-2-small and RoBERTa-base6
with their original architecture, objective, et, à
the extent possible, hyperparameters. They have
near-identical model size hyperparameters, lead-
ing to 86.8M ALM parameters and 87.0M for
MLM. We sample sentence pairs (un, b) avec le
same denotation and format the pretraining se-
quences in the form of a=b, tel que (T∧F)=
(F∨F), simulating ℵ-access (but restricting que-
ries to be sentences, a more challenging setup: voir
Eq. 2). §3.3 will discuss a necessary form of data
augmentation. We train for 100K steps, 20% de
RoBERTa-base’s training duration and hence data
size, which we found sufficient for convergence
on our data. §B summarizes hyperparameters.

We use a PCFG to construct our propositional
logic dataset because its recursive nature and
context-freeness bear some resemblance to nat-
ural language,5 and because it is convenient for
sampling. The rules are specified in Eq. 1 et le

3.3 Analysis: Probing Lt

Probing is a commonly adopted method to quan-
tify the extent to which a representation encodes
a particular type of linguistic information (Alain
and Bengio, 2017; Liu et al., 2019un; Hewitt and

4Consider, Par exemple, a finite non-transparent language

whose denotation space can be learned by enumeration.

5There are aspects of natural language that a PCFG
does not capture, such as recursion constraints (Karlsson,
2010) and non-context-free phenomena (Shieber, 1985).
Nevertheless, the goal of this research question is not to

maximally simulate NL, but rather investigate the distribu-
tional learnability of compositional semantics. Future work
could investigate the effect of moving away from a strict
PCFG.

6We do not follow BERT because next sentence prediction

is not applicable here, but they are otherwise similar.

620

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Manning, 2019, entre autres). The represen-
tation is frozen, on top of which a lightweight
classifier is trained to predict the information of
interest. As shown in §2.4, this paradigm conve-
niently corresponds to the formalization in Merrill
et autres. (2021), and hence we use it to investigate
whether or not pretrained representations encode
sufficient semantic information for equivalence
decisions.

We probe semantic equivalence from the pre-
trained models for pairs of unseen sentences. Nous
embed each sentence separately through the pre-
trained model, taking the last token representa-
tion for ALM and the average for MLM.7 Voita
et autres. (2019) and Haviv et al. (2022) have shown
that the positional information is diluted at the top
transformer layers of MLMs, but it is crucial for
the truth value in our language. Nous, donc, prendre
a weighted sum (a.k.a. scalar mix) of all layers
for compensation for MLM.8 We also found that
these simple methods for sentence representations
sometimes do not perform well. We hence addi-
tionally consider a variant where the probe is an
attention-weighted mixture of all token positions.
We refer to these two representations as –ATTN
and +ATTN, respectivement. See §B for more on their
details. We train a bilinear classifier probe on top
of the sentence representations (Li et al., 2021)
and evaluate it with accuracy on a held-out test
ensemble. For each setting, we train the same probe
with five different random seeds and report their
mean and standard deviation. We report hyperpa-
rameters in §B.

Past work has cast doubt on whether probes
faithfully reflect the representation’s encoding of
the information of interest, or if they directly
learn the task (Hewitt and Liang, 2019). This is
an especially important issue here as our +ATTN
sentence representation injects additional train-
able parameters compared to a simple (bi)linear
classifier. To answer this question in our setting,
we follow previous studies (Conneau et al., 2018;
Tenney et al., 2019; Wu et al., 2021, among oth-
ers) and train a randomly initialized and similarly
frozen control model with the same architecture:

7The lack of a next sentence prediction task (Fn. 6) leads

to no supervision for a [CLS] token.

8Officiellement, μ’s output contains all layer representations.
9ALM Trained +ATTN Ln has a degenerate seed that led to
autour 50% accuracy, hence the large variance. C'est possible
that additional configuration-specific hyperparameter tuning,
which we did not perform, could reduce this instability.

ALM (`a la GPT-2)

MLM (`a la RoBERTa)

Random Trained

Probing: –ATTN

Lt
Ln

49.9±0.3
50.0±0.3

98.8±0.0
79.9±0.2

50.0±0.4
49.9±0.1

50.1±0.2
49.5±0.1

Probing: +ATTN

49.9±0.6
50.1±0.4

100.0±0.0
82.5±20.9

50.0±0.4
50.2±0.2

63.8±1.7
49.7±0.3

50.0
50.0

Direct evaluation

97.0±6.8
91.1±19.9

50.0
50.0

95.4±4.7
50.4±0.8

Tableau 1: Probing and direct evaluation accuracy
(%) on random and pretrained models with autore-
gressive and masked LMs on our propositional
logic test set. We report the results with both our
transparent language Lt and the perturbed lan-
guage Ln (§3.5). Probing checks the equivalence
of two sentences, while direct evaluation com-
putes the value of one sentence. For probing, nous
test two ways to obtain sentence representations,
reporting the mean and standard deviation across
five probe training seeds. For direct evaluation,
we report the mean and standard deviation across
our five templates.9

If LMs emulate meaning similarly to Merrill
et al.’s (2021) algorithme, we would expect the
pretrained model to yield higher probing accuracy
than the random model.

Results. The Lt rows in the top two sections
of Table 1 summarize the results. With a simple
sentence representation (–ATTN), the pretrained
ALM achieves near-perfect probing accuracy for
Lt, though MLM performs at chance level. Un
attention-based sentence representation enables
63.8% accuracy10 for MLM and improves ALM’s
performance to 100%. Surtout, in this variant,
the random baselines still perform at chance level,
demonstrating that the additional parameters do
not lead to an overly powerful probe. We dis-
cuss the accuracy differences between ALM and
MLM in §5. These results demonstrate that pre-
training enables meaning emulation, though the
meaning representation can be more deeply en-
coded than what can be extracted with a (bi)linear

10With additional linear layers, it could go up to 83.4%±2.0
while the random model still performs at chance level. Nous
did not include this in Table 1 for consistency with other
settings.

621

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

–Symmetry

+Symmetry

–Reflexivity

+Reflexivity

a=b
50.5±0.4
a=b, b=a
50.3±0.3

a=b, a=a, b=b
92.7±0.1
a=b, b=a, a=a, b=b
98.8±0.0

Tableau 2: ALM probing accuracy (–ATTN; %) sur
our propositional logic test set with pretraining
data with different properties, où un, b are ex-
pressions in Lt. We report the mean and standard
deviation across five probe training seeds.

sonde. We note that it is expected that the perfor-
mance of pretrained models does not reach 100%.
While Merrill et al. (2021) showed its theoreti-
cal possibility, their setup assumes active learning
with unlimited access to ℵ and allows the ‘‘probe’’
δ to be an arbitrarily powerful function, among
other differences.

Grounding. We found that independently sam-
pling pretraining sequences results in unsuccessful
emulation with probing performance at random.
Plutôt, it is crucial to ground = with reflexivity
and symmetry.11 We achieve this by augment-
ing the pretraining data: if a=b is a pretraining
séquence, we ensure a=a, b=b (reflexivity), et
b=a (symmetry) are too. This imposes a con-
straint on the pretraining data distribution that
eases the learning of =’s meaning. Tableau 2 shows
that both properties are important. We consider
the implication in §5.

3.4 Analysis: Direct Evaluation on Lt

The process of training a probe introduces ad-
ditional complexity, such as ±ATTN, that poten-
tially complicates our analysis. Donc, we also
test a stronger condition where there is no addi-
tional classifier: Can the pretrained models evalu-
ate expressions, without any further training (par exemple.,
a probe)? For MLM, it is the most straightforward
to compare if the model assigns a higher proba-
bility to T or F in sentence=[MASK]. Cependant,
this is a sequence that never occurs in the pre-
training corpus since a standalone T or F is not
part of our language (Eq. 1). Donc, we use
five templates on the right-hand side that are min-

11Reflexivity states that a = a, and symmetry a = b ⇒
b = a. Equality further requires transitivity: a = b ∧ b =
c ⇒ a = c, but it is not tested in our probing setup and
we found it unimportant for probing accuracy in preliminary
experiments.

imal in our language: (T∧[MASK]), (F∨[MASK]),
([MASK]∧T), ([MASK]∨F), (¬[MASK]). For the
first four templates, we expect the masked po-
sition to be filled with the truth value of the
proposition, and the negated value for the last one.
For ALM, we compare if the model assigns a
higher probability to the sequence where [MASK]
is filled in with T vs. F.

Results. The bottom section of Table 1 shows
the mean and standard deviation of the evaluation
accuracy across our five templates. Without train-
ing, a random model always has 50.0% accuracy
on expectation. Both ALM and MLM achieve a
high evaluation accuracy, au-dessus de 95%, corroborat-
ing the LMs’ capability to represent the meaning
of Lt.

These results respond to the argument

dans

Bender and Koller (2020):

We let GPT-2 complete the simple arith-
metic problem Three plus five equals.
The five responses below […] show that
this problem is beyond the current capa-
bility of GPT-2, et, we would argue,
any pure LM.

We showed that form-only supervision does al-
low such evaluation on a strongly transparent
langue, at least when the supervising data dis-
tribution satisfies symmetry and reflexivity.

3.5 Non-transparency

Building towards non-transparent natural
lan-
guage, it is important to understand strong trans-
parency’s effect on emulatability. We design a
minimally perturbed version of Lt that is non-
transparent, Ln. The syntax stays the same, mais
we change the semantics such that ¬ has a side
effect: When followed by T or F, it inverts the
meaning of these literals that occur in certain
other environments. Spécifiquement, chaque (¬T) node
changes the meaning of all the literals T in its c-
commanded subtree (c'est à dire., the e subtree headed by
le (¬T) node’s sibling, if there is one; Reinhart,
1976) to F. An additional (¬T) does not invert
back. De la même manière, (¬F) changes the meaning of
the literal F to T. Par exemple, in the sentence
node is c-commanded by (ou,
in Figure 1, le
→ (¬T)
a descendant of a sibling of) le
node, so its meaning is changed to F. Sur
the e → (¬ ) node does
the other hand,

622

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

not invert the meaning of the unboxed T because
they do not constitute a c-command relation. Ce
alternation is inspired by binding theory in gen-
erative grammar (Chomsky, 1981, 1983), où
le (¬T) node is the binder that c-commands
the bindee. ince the meaning of T and F now
depends on the existence of a binder, Ln is non-
transparent.12

Results. We conduct
the same pretraining/
probing/direct evaluation procedure on Ln. Tableau 1
reports the results. on-transparency decreases
ALM’s probing accuracy with both –ATTN and
+ATTN, though not to random level. The variance
across different probe training seeds also increases
compared to Lt, indicating that the pretrained
representation is less robust. Directly evaluating
ALM with Ln similarly leads to both decreased
average accuracy and increased variance. MLM,
on the other hand, achieves random probing and
evaluation accuracy. Dans l'ensemble, the lack of strong
transparency reduces models’ meaning emulation
ability, though not always to chance performance.

4 What About Natural Language?

While existing LM architectures and objectives
are able to emulate the meaning of synthetic
languages, it is unclear how these observations
transfer to natural language (NL). Merrill et al.
(2021) hinted that, since NL is non-transparent
and likely more complex than their constructed
non-emulatable language, it is probable that a
pretraining procedure, even with ℵ-access, peut-
not emulate its meaning either. Ce, cependant,
remained an untested hypothesis.

We formalize this intuition and prove that
a specific NL phenomenon, referential opacity,
makes NL non-transparent.13 This phenomenon
has been widely studied in semantics (Quine,
1956; Kripke, 1972, entre autres), yet it has re-
ceived little attention in modern NLP. We fill this
gap from the perspective of strong transparency
and study the representation of this phenomenon

12This is a straightforward way to introduce a ¬ with side
effect to a hierarchical structure. An alternative is to rely
on a linear structure and invert all literals linearly follow-
ing ¬. Nevertheless, our version leverages the hierarchical
reasoning that the model originally needs to possess to eval-
uate an expression, while this version requires a new type
of reasoning that is linear. So that change would be less
minimal.

13Deictic expressions are another example, though they
have been extensively studied under coreference resolution.

in modern LMs with a probing-based and a
sentence similarity-based analysis.

4.1 Referential Opacity

To illustrate referential opacity, we use the classic
example in semantics:

Exemple 1.

(un) Lois Lane believes Superman is a hero.

(b) Lois Lane believes Clark Kent is a hero.

Note that (un) et (b) have different truth con-
ditions: Their truth values differ if Lois Lane
does not know Superman and Clark Kent are
the same person. Officiellement, (cid:2)Lois Lane believes
Superman is a hero.|λ2(cid:3) (cid:15)= (cid:2)Lois Lane believes
Clark Kent is a hero.|λ2(cid:3).14 On the other hand,
(cid:2)Superman|λ2(cid:3) = (cid:2)Clark Kent|λ2(cid:3).15 In other
words, two expressions that have the same deno-
tation, when embedded in the same context, yield
sentences with different truth conditions. Tel
contexts are called referentially opaque, et, dans
this case, they are induced by a propositional at-
titude verb ‘‘believes’’ whose meaning depends
on the cognitive state of its subject (Anderson
and Owens, 1990).

Now we formalize referential opacity:

Definition 3. In natural language, an expression
e is contextually valid in κ = (cid:6)je, r(cid:7) if none of
(cid:2)je|λ, er(cid:3), (cid:2)e|je, r(cid:3), (cid:2)r|le, λ(cid:3) is ∅.16

Definition 4. A context κ = (cid:6)je, r(cid:7) in natural
language is referentially opaque if there exist ex-
pressions e1, e2, both contextually valid in κ, tel
que (cid:2)e1|λ2(cid:3) = (cid:2)e2|λ2(cid:3) et (cid:2)le1r|λ2(cid:3) (cid:15)= (cid:2)le2r|λ2(cid:3).
Def. 4 matches the linguistic phenomenon:
Let e1=‘‘Superman’’, e2=‘‘Clark Kent’’, et le
opaque context κ=(cid:6)‘‘Lois Lane believes’’, ‘‘is a
hero.’’(cid:7), and we recover our analysis of Ex. 1
au-dessus de.

que

15Il

is possible

(cid:2)Superman|λ2(cid:3)

14In this section we consider the language L to be English,
or any NL that exhibits this phenomenon, et (cid:2)·|·(cid:3) to be
intensions (§2.2). We drop the subscript L for brevity.
to argue

(cid:15)=
(cid:2)Clark Kent|λ2(cid:3) if we consider their intension to be dif-
ferent. Nevertheless, we adopt the view of Heim and Kratzer
(1998, §12.3) to not
introduce intensionality by default
(c'est à dire., with κ = λ2), but rather to evoke it by context: ‘‘The
usual denotations are extensions. But for nonextensional
contexts, Intensional Functional Application allows a switch
to intensions. The switch is triggered by particular lexical
items […]’’.

16This is a technical detail needed for proving Theorem 1.

623

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Now, we prove that the existence of referen-
tially opaque contexts implies non-transparency.
We assume compositionality, for which we pro-
vide a working definition: (cid:2)ler|λ2(cid:3) = f ((cid:2)je|λ, er(cid:3),
(cid:2)e|je, r(cid:3), (cid:2)r|le, λ(cid:3)) for some meaning composition
function f .17 Intuitively, the proof shows that
if all expressions have fixed meaning (c'est à dire., sont
strongly transparent), referential opacity would
not arise.

Theorem 1. A compositional
language with
referentially opaque contexts is not strongly
transparent.

Proof. Suppose by contradiction we have such a
language L that is strongly transparent. Let e1, e2
be expressions in some opaque context (cid:6)je, r(cid:7) in L.

(cid:3)

= f

= f

This violates (cid:2)le1r|λ2(cid:3) (cid:15)= (cid:2)le2r|λ2(cid:3), the referential
opacity premise. So L is not strongly transparent.

Donc, as a non-transparent example in NL,
we study whether referential opacity is reflected
in the representation of current LMs.

4.2 Données

We cast referential opacity as a sentence pair bi-
nary classification problem. We generate sentence
pairs like Ex. 1 as our dataset. Ex. 1 consists of
two parts that correspond to the two conditions in
Def. 4: two co-referring expressions ((cid:2)e1|λ2(cid:3) =
(cid:2)e2|λ2(cid:3)), and a referentially opaque context
that embeds the entity ((cid:2)le1r|λ2(cid:3) (cid:15)= (cid:2)le2r|λ2(cid:3)).
Suivant, we separately introduce how we generate
eux. Our final dataset consists of 45K/6K/6K
training/development/testing sentence pairs for
GPT-2 and 97K/12K/12K for BERT. §C provides

17This is a mild assumption, considering the generality of
compositionality (Fodor and Pylyshyn, 1988) and that our
definition is weak, par exemple., weaker than that of Andreas’s (2019).

624

more details, including more fine-grained dataset
statistics for different experimental settings below.

Co-referring Expressions. The co-referring
expressions in Ex. 1 are proper names, ‘‘Super-
man’’ and ‘‘Clark Kent.’’ Not only is this hard to
collect data for, mais, due to the rigidity of proper
names (Kripke, 1972),
is also theoretically
more challenging to analyze as the classic in-
tensionality framework is more difficult to apply
(Von Fintel and Heim, 2011).18 We hence con-
sider co-referring expressions that are one proper
name and one definite description, such as ‘‘Yuri
Gagarin’’ and ‘‘the first person in space,’’ which
can be more straightforwardly accounted for with
intensionality (Heim and Kratzer, 1998, §12; Von
Fintel and Heim, 2011). We use the LAMA data-
ensemble (Petroni et al., 2019), specifically the T-REx
split (Elsahar et al., 2018) following recent fac-
tual probing work (Jiang et al., 2020; Shin et al.,
2020; Zhong et al., 2021), to obtain a list of such
entities. To make sure the model representation
captures the coreference, we follow Petroni et al.
(2019) and use LAMA to prompt the LM with
these equivalences and only keep entities that are
correctly predicted.19

Contexts. We construct referentially opaque
and referentially transparent contexts to embed
these co-referring expressions. We only consider
referential opacity involving propositional attitude
verbs, where the context is referentially opaque
iff its main verb conveys propositional attitude.
There are other types of referential opacity, tel
as counterfactuals (Von Fintel and Heim, 2011;
Kearns, 2011, §7) and substitutions that shift the
syntactic status of constituents (par exemple., Fine, 1990),
that we omit in this work for simplicity, though
they could be targets of future studies. We man-
ually design two classes of templates, depending
on the verb’s argument structure. The first has an
embedded clause, par exemple.,
Exemple 2. Label = non-equivalent20

18Though see Shabasson (2018) for a theorization.
19Previous work (Poerner et al., 2020; Dufter et al., 2021;
Cao et al., 2021) questioned whether such prompting mea-
sures model ‘‘understanding.’’ Our setup, though, does not
depend on ‘‘understanding’’, but only requires association.

20Consider, Par exemple, if this person is Yuri’s neigh-
bor and wants to meet him for dinner, mais, being an avid
flat-earther, is not fond of space traveling and is unaware
that he has been to space. She would say she wants to meet
Yuri Gagarin but has no interest in meeting the first person
in space.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(un) She wants to meet Yuri Gagarin.

(b) She wants to meet the first person in space.

The second contains only the main clause, tel que

Simple

Exemple 3. Label = equivalent

(un) He speaks Lao.

(b) He speaks the official language of Laos.

The two sentences in a pair only differ by the
entity reference: one is a name and one is a definite
description. A sentence pair is non-equivalent iff
it has a referentially opaque context, or within our
scope of study, iff its main verb is a propositional
attitude verb. We gather the list of verbs from past
linguistic studies and verify with native speaker
judgment (see §C).

4.3 Models
We consider GPT-2-XL and BERT-large-cased21,
the largest variants in these two families, as repre-
sentative autoregressive and masked LMs. Ils
have 1.5B and 340M parameters, respectivement.
We obtain sentence representations in the same
way as in §3, except without attention-weighting
and simply using the [CLS] embedding for BERT.

4.4 Analysis: Probing
We use the same bilinear probe in §3 as a bi-
nary classifier over sentence pairs, determining
the equivalence, or the referential transparency, de
each pair. Cependant, because of the lexicalized na-
ture of referential opacity, the probe could easily
overfit and recognize not their equivalence but the
existence of a propositional attitude verb.

To overcome this, we introduce attractors
(Linzen et al., 2016; Gulordava et al., 2018; Pandia
and Ettinger, 2021, entre autres).22 We always
conjoin a clause with a propositional attitude verb
and one with a non-attitude verb, disallowing the
aforementioned heuristics. The equivalence label
now depends on if the entity alternation occurs
under the non-attitude verb, which would result
in an equivalent sentence pair, or the attitude

21Not RoBERTa as in §3, because BERT’s [CLS] token
can act as and is commonly taken to be the sentence repre-
phrase (Devlin et al., 2019; Karpukhin et al., 2020, among
others).

22Another option is to have disjoint training and testing
verbs. This did not work in preliminary experiments be-
cause verbs that induce referential opacity are semantically
closer, as they always convey propositional attitude. Donc
the model could use this similarity in the word embedding
space to extrapolate.

Equiv.
Non-equiv.
Dans l'ensemble

GPT-2

BERT

100.0±0.00
100.0±0.00
100.0±0.00

85.3±0.03
15.5±0.03
50.4±0.00

100.0±0.00
100.0±0.00
100.0±0.00

72.4±0.03
29.6±0.03
51.0±0.00

Coord.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 3: Probing accuracy (%) for referential
opacity on GPT-2-XL and BERT-large-cased. Nous
report the mean and standard deviation across 10
seeds. We consider two types of sentences, sim-
ple sentences without attractors and coordinated
sentences with attractors. For each type, we show
both the label-specific accuracy (Equivalent/Non-
equivalent) and the overall accuracy.

verb, which would lead to non-equivalence. Pour
example:

Exemple 4. Label = equivalent

(un) He speaks Lao and she wants to meet Yuri

Gagarin.

(b) He speaks the official language of Laos and

she wants to meet Yuri Gagarin.

Exemple 5. Label = non-equivalent

(un) He speaks Lao and she wants to meet Yuri

Gagarin.

(b) He speaks Lao and she wants to meet the first

person in space.

Despite both examples having the same verbs, le
sentence pair in Ex. 4 is equivalent, but Ex. 5 est
pas. We are not using attractors for out-of-domain
evaluation; instead, the training and test sets are
i.i.d., but we break down the test set performance
by categories.

We train a probe on GPT-2-XL and BERT-large
over 10 random seeds. Details are in §D. Tableau 3
reports the results. As expected, both models
overfit with the attractor-less simple sentences,
achieving perfect accuracy. With attractors in co-
ordinated sentences, cependant, both models ob-
tain near-random performance overall. Because
the training and test sets are i.i.d., this means that
semantic equivalence based on referential opacity
cannot be probed in our setup from these two
models, suggesting an inadequate representation

625

of this phenomenon.23 Interestingly, both mod-
els tend to predict equivalence more than non-
equivalence (more prominent with GPT-2 than
BERT), likely due to the nuanced nature of this
task: Without
entraînement, a human would likely
judge equivalence on referentially opaque sen-
tence pairs too.24 See §E for a set of experiments
that show that LMs can potentially learn to cap-
ture referential opacity with semantic supervision
following pretraining.

4.5 Analysis: Sentence Similarity
As in §3.4, the simplicity of a training-free anal-
ysis can be desirable. To this end, we directly
measure the cosine similarity between the two
sentence representations in a pair. While this se-
mantic similarity would be high for both groups of
sentences by our construction, equivalent sentence
pairs should have more similar representations
than those that are not. While factors other than
semantics, such as syntax, also affect sentence
representations, we strictly control them in our
synthetic data generation to be identical between
referentially transparent and opaque sentences.
We do not consider attractor sentences (§4.4) dans
this analysis.

For significance testing, we employ an exact
permutation test (Pêcheur, 1935) and a bootstrap
test (Efron and Tibshirani, 1993) avec 1,000 it-
erations, performed across verbs, where the test
statistic is the difference between the averaged
cosine similarity of the two groups. Both tests
are two-sided with the null hypothesis being that
the model representation does not distinguish be-
tween the two classes of verbs. For GPT-2-XL,
the permutation test gives p = 0.64 and bootstrap
gives p = 0.66, barring us from rejecting the null
hypothèse. For BERT-large, they give p = 0.45
and p = 0.57 respectivement, where we again ob-
serve no significant difference between the two
classes. Néanmoins, we note that the inability to
reject the null hypothesis does not entail it is true.
Reimers and Gurevych (2019) noted that
computing sentence pair cosine similarity using
BERT’s [CLS] token, as we did, does not cor-
relate well with textual similarity benchmarks.

23There might still be other more complex heuristics, mais
even so, the probe still fails. Hence we do not need addi-
tional attractors to rule out all possible heuristics.

24Though, with training, it is relatively straightforward to
perform this task for a human, so it is reasonable to test the
ability in LMs.

This phenomenon is commonly attributed to the
anisotropic nature of pretrained representations
(Ethayarajh, 2019). This does not undermine the
validity of our method, which instead relies on
the correlation between the cosine similarity and
the model’s representation of semantic close-
ness. We ensure this correlation by controlling
for all factors other than semantics (syntax, lexi-
cal choices, entities, etc.). Nevertheless, we also
postprocess BERT’s [CLS] representation using
BERT-flow (Li et al., 2020) which has been shown
to increase the correlation with textual similarity
benchmarks. We obtain a similar result: Bootstrap
gives p = 0.49. While the two-sided permutation
test gives p = 0.03 with potential significance,
the one-sided version gives p = 0.99; in other
words, the calibrated space represents opaque sen-
tence pairs to be more similar than transparent
ones, contrary to our expectation that equivalent
sentence pairs should be closer in the represen-
tation space than non-equivalent ones when all
other factors are controlled.

The results from these two sets of analyses in
§4.4 and §4.5 are consistent and show no evi-
dence of modern LMs representing referential
opacity, demonstrating that
they cannot fully
emulate the meaning of NL. Our finding adds
to recent observations that pretrained LMs do
not represent semantic phenomena well (Tenney
et coll., 2019; Kovaleva et al., 2019; Wu et al., 2021,
entre autres). Theoretically, it also strengthens
the connection between strong transparency and
meaning emulatability with NL-based empirical
evidence.

5 Discussion

Through analyses based on probing and direct
evaluation, we have seen that existing LM ar-
chitectures and objectives can learn to emulate
the meaning of a strongly transparent language
Lt when the training data reflects equivalence
relations. While non-transparency (Ln) causes
this ability to decrease, the trained models still
outperform a random model in certain setups. Nous
believe this result hints at the strength of current
LM architectures and objectives.25 There seems
to be a limit to this strength, though—in natural

25Especially since our setting is more challenging than
Merrill et al.’s (2021) algorithme, without their unlimited
ℵ-access, active learning, arbitrarily powerful δ, etc.. Plus,
we restrict ℵ queries to be sentences and disallow compar-
ing a sentence with T or F using ℵ.

626

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

langue, neither GPT-2 nor BERT represents
the non-transparent phenomenon of referential
opacity well.

Our results shed light on the relationship be-
tween the strong transparency of a language
and whether its semantics can be emulated. Nous
observed co-variation between the two: Quand
slightly perturbed to be non-transparent, our logic
language becomes harder to emulate; and there is
no evidence for LMs representing the semantics of
a non-transparent NL phenomenon. Nevertheless,
the above-random emulation performance with
Ln suggests that there could be language prop-
erties that potentially better predict emulatability,
leaving room for future theoretical endeavors.

Nous avons également constaté que, with a similar size and
training procedure (§3.2), ALM is more suitable
for representing the meaning of our propositional
logic languages than MLM, in our setup. ALM
achieves better probing accuracy than MLM under
both methods of obtaining sentence representa-
tions that we explored. Aussi, MLM completely
fails to emulate meaning facing non-transparency,
but not ALM. Finalement, though, we hope to
understand if this difference transfers to natu-
ral language. Our NL investigation reveals that
both ALM (GPT-2) and MLM (BERT) achieve
chance-level probing performance on the one phe-
nomenon that we inspected,
likely due to its
difficulty. It would be interesting for future ef-
forts to further examine their differences, if any,
in learning and representing the meaning of other
NL phenomena.

Our results also lead to the question: Why can
LMs achieve above-random results on Ln but not
referential opacity? While it is entirely possible
that the latter is simply more difficult than our
synthetic non-transparency, there are other factors
at play. First of all, natural language is much more
variable than our synthetic language: Utterances
can be untruthful (though they are in general gov-
erned by Gricean quality; Grice, 1975), subjective
(such as our earlier claim about Corgis’ cute-
ness, §2.3), intensional (see Merrill et al., 2021
for a discussion), etc.. But putting these variations
aside, we saw from §3 that even the synthetic lan-
guage requires an explicit grounding of = to enable
emulation, and this is missing from NL pretrain-
ing. It is certainly not the case that, for every
expression such as ‘‘Corgis are the cutest dogs.’’
that exists in the pretraining corpus, la variété-
ations ‘‘The cutest dogs are Corgis.’’, ‘‘Corgis

are Corgis.’’, ‘‘The cutest dogs are the cutest
dogs.’’ are also guaranteed to appear. So perhaps
there needs to be a more foundational change in
our pretraining objective. As Brown et al. (2020)
foretold, ‘‘A more fundamental limitation of […]
scaling up any LM-like model […] is that it may
eventually run into (or could already be running
into) the limits of the pretraining objective.’’ Our
results point to one such possibility: We believe
research into a more explicit representation of se-
mantic relations in future pretraining processes,
such as based on paraphrases, could be fruitful.

What we did not investigate, though, is whether
partial equivalence grounding enables emulation:
what if, Par exemple, only 1% of the pretraining
data has this form of grounding, while the rest
does not? And the above format already exists for
certain sentences in NL. Ce, aussi, could be an
exciting future research question.

6 Related Work

Bender and Koller (2020) initiated the discussion
on the possibility of a learner acquiring mean-
ing from training on linguistic forms alone. Depuis
first principles, they argued for its impossibility.
Empirically, Traylor et al. (2021) also found that
LMs cannot well-represent lexical-level symbols
when the pretraining data is distributionally con-
strained to supply relevant signals. Merrill et al.
(2021), on the other hand, proved theoretically that
it is possible to emulate the meaning of strongly
transparent languages with assertion oracle access.
We showed in this work that, empirically, LMs
also attain the capability. The work of Patel and
Pavlick (2022) is also conceptually similar to our
travail, discovering that the internal representa-
tion of LMs is to a large extent isomorphic to the
conceptual spaces of directions and colors. Ils
adopted in-context learning (Brown et al., 2020;
entre autres) to elicit the isomorphism, while we
used the more traditional probing paradigm.

Another line of work has inspected the extent
to which pretrained LMs encode various types of
semantic information. Some have examined the
representation of lexical semantics: Gar´ı Soler
and Apidianaki (2021) found that BERT represen-
tations reflect polysemy levels, and Vuli´c et al.
(2020) showed that they also capture abundant
type-level lexical knowledge. On the other hand,
Ettinger (2020) and Ravichander et al. (2020)
have discovered that pretrained LMs do not

627

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

satisfactorily encode negation and hypernymy,
respectivement. Moving beyond the lexical level,
Wu et al. (2021) demonstrated that pretrained
BERT and RoBERTa models less readily surface
semantic dependency information than syntactic
dependencies, while Li et al. (2021) identified
evidence of dynamic semantics representation in
these models.

7 Conclusion

We have empirically shown that pretrained lan-
guage models are able to emulate the meaning of a
strongly transparent language through pretraining
on an assertion-inspired format, but this abil-
ity deteriorates when the language is minimally
perturbed to be no longer strongly transparent.
En outre, we found no representation of ref-
erential opacity, which is significant for being a
non-transparent natural language phenomenon, dans
pretrained LMs.

Remerciements

We thank the TACL reviewers and action edi-
tor for helpful feedback on this work. We thank
Kyle Richardson, Jesse Dodge, and other mem-
bers of AI2 for insightful discussions. This work
was funded in part by NSF award 1922658.
WM was supported by an NSF graduate research
fellowship.

Les références

Guillaume Alain and Yoshua Bengio. 2017.
Understanding intermediate layers using linear
classifier probes. In 5th International Confer-
ence on Learning Representations, Workshop
Track Proceedings.

C. Anthony Anderson and Joseph Owens, edi-
tors. 1990. Propositional Attitudes: The Role of
Content in Logic, Language, and Mind. CSLI
Lecture Notes; Non. 20. Center for the Study of
Language and Information, Stanford, Californie.

Jacob Andreas. 2019. Measuring composi-
tionality in representation learning. In Proceed-
ings of International Conference on Learning
Representations.

and understanding in the age of data.
Dans
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 5185–5198, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.463

Samuel R. Bowman, Gabor Angeli, Christophe
Potts, and Christopher D. Manning. 2015. UN
large annotated corpus for learning natural lan-
guage inference. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 632–642, Lisbon,
Portugal. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/D15-1075

Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish,
Alec Radford,
Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot
learners. In Proceedings of Advances in Neural
Information Processing Systems, volume 33,
pages 1877–1901. Curran Associates, Inc.

Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun,
Lingyong Yan, Meng Liao, Tong Xue, et
Jin Xu. 2021. Knowledgeable or educated
guess? Revisiting language models as knowl-
edge bases. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1860–1874,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.acl-long.146

Rudolf Carnap. 1947. Meaning and Necessity: UN
Study in Semantics and Modal Logic. Univer-
sity of Chicago Press, Chicago.

Noam Chomsky. 1981. Lectures on Government
and Binding. Studies in Generative Grammar.
Foris Publications.

Emily M. Bender and Alexander Koller. 2020.
Climbing towards NLU: On meaning, formulaire,

Noam Chomsky. 1983. Some Concepts and
Consequences of the Theory of Government

628

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

and Binding. Linguistic Inquiry Monograph 6.
AVEC Presse.

Alexis Conneau, German Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single $&!#*
vector: Probing sentence embeddings for lin-
guistic properties. In Proceedings of the 56th
Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Pa-
pers), pages 2126–2136, Melbourne, Australia.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P18
-1198

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1423

Philipp Dufter, Nora Kassner, and Hinrich
Sch¨utze. 2021. Static embeddings as efficient
knowledge bases? In Proceedings of the 2021
Conference of
the North American Chap-
ter of
the Association for Computational
Linguistics: Human Language Technologies,
pages 2353–2363, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.186

Bradley Efron and Robert J. Tibshirani. 1993. Un
Introduction to the Bootstrap. Nombre 57 dans
Monographs on Statistics and Applied Prob-
ability. Chapman & Hall/CRC, Boca Raton,
Florida, Etats-Unis.

Hady Elsahar, Pavlos Vougiouklis, Arslen
Remaci, Christophe Gravier, Jonathon Hare,
Frederique Laforest, and Elena Simperl. 2018.
T-REx: A large scale alignment of natural lan-
guage with knowledge base triples. In Proceed-
ings of the Eleventh International Conference
on Language Resources and Evaluation (LREC
2018), Miyazaki, Japan. European Language
Resources Association (ELRA).

the geometry of BERT, ELMo, and GPT-2
embeddings. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 55–65,
Hong Kong, Chine. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D19-1006

Allyson Ettinger. 2020. What BERT is not:
Lessons from a new suite of psycholinguistic
diagnostics for language models. Transactions
of the Association for Computational Linguis-
tics, 8:34–48. https://est ce que je.org/10.1162
/tacl a 00298

Kit Fine. 1990. Quine on quantifying in. Propo-
sitional attitudes: The role of content in logic,
langue, and mind, CSLI Lecture Notes;
Non. 20, pages 1–26. Center for the Study of
Language and Information, Stanford, Californie.

R.. UN. Pêcheur. 1935. The Design of Experiments.
The Design of Experiments. Oliver and Boyd.

Jerry A. Fodor and Zenon W. Pylyshyn.
1988. Connectionism and cognitive architec-
ture: A critical analysis. Cognition, 28(1):3–71.
https://doi.org/10.1016/0010-0277
90031-5, PubMed: 2450716

Gottlob Frege. 1892. ¨Uber sinn und bedeutung.
Zeitschrift f¨ur Philosophie Und Philosophische
critique.

Aina Gar´ı Soler and Marianna Apidianaki. 2021.
Let’s play mono-poly: BERT can reveal words’
polysemy level and partitionability into senses.
Transactions of the Association for Computa-
tional Linguistics, 9:825–844. https://est ce que je
.org/10.1162/tacl_a_00400

Herbert P. Grice. 1975. Logic and conversation.
In Peter Cole and Jerry L. Morgan, editors,
Syntax and Semantics: Vol. 3: Speech Acts,
pages 41–58. Academic Press, New York.
https://doi.org/10.1163/9789004368811
003

Jeroen Groenendijk and Martin Stokhof. 1991.
Dynamic predicate logic. Linguistics and Phi-
losophy, 14(1):39–100. https://doi.org
/10.1007/BF00628304

Kawin Ethayarajh. 2019. How contextual are con-
textualized word representations? Comparing

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.

629

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Colorless green recurrent networks dream
le 2018
hierarchically. In Proceedings of
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long Papers), pages 1195–1205, La Nouvelle Orléans,
Louisiana. Association for Computational
Linguistics. https://doi.org/10.18653/v1
/N18-1108

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak,
and Omer Levy. 2022. Transformer language
models without positional encodings still learn
positional information. https://doi.org
/10.48550/arXiv.2203.16634

Irene Heim. 1982. The Semantics of Definite
thesis,

and Indefinite Noun Phrases. Ph.D.
University of Massachusetts Amherst.

pages 277–322. Amsterdam: Mathematisch
Centrum.

Fred Karlsson. 2010. Syntactic Recursion and
Iteration. Studies in Generative Grammar.
De Gruyter Mouton, Allemagne. https://est ce que je
.org/10.1515/9783110219258.43

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 6769–6781,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-main.550

Kate Kearns. 2011. Semantics. Macmillan Modern

Irene Heim and Angelika Kratzer. 1998. Seman-

Linguistics. Palgrave Macmillan.

tics in Generative Grammar. Puits noir.

John Hewitt and Percy Liang. 2019. Designing and
interpreting probes with control tasks. En Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 2733–2743, Hong Kong, Chine. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1275

John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representations. In Proceedings of the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4129–4138,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1419

Zhengbao Jiang, Frank F. Xu, Jun Araki, et
Graham Neubig. 2020. How can we know
what language models know? Transactions of
the Association for Computational Linguistics,
8:423–438. https://est ce que je.org/10.1162
/tacl_a_00324

Hans Kamp. 1981. A theory of truth and se-
mantic representation. In Jeroen Groenendijk,
Theo Janssen, and Martin Stokhof, editors,
Formal Methods in the Study of Language,

Olga Kovaleva, Alexey Romanov, Anna Rogers,
and Anna Rumshisky. 2019. Revealing the dark
secrets of BERT. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessation (EMNLP-IJCNLP), pages 4365–4374,
Hong Kong, Chine. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D19-1445

Saul A. Kripke. 1972. Naming and neces-
In Semantics of Natural Language,

ville.
pages 253–355. Springer.

Belinda Z. Li, Maxwell Nye, and Jacob Andreas.
2021. Implicit representations of meaning in
language models. In Proceedings of
neural
the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers),
pages 1813–1827, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.acl-long.143

Bohan Li, Hao Zhou, Junxian He, Mingxuan
Wang, Yiming Yang, and Lei Li. 2020. On the
sentence embeddings from pre-trained language
models. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 9119–9130, Sur-
line. Association for Computational Linguistics.

630

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of
LSTMs to learn syntax-sensitive dependencies.
Transactions of the Association for Computa-
tional Linguistics, 4:521–535. https://est ce que je
.org/10.1162/tacl_a_00115

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. Forgeron. 2019un.
Linguistic knowledge and transferability of
contextual representations. In Proceedings of
le 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers),
pages 1073–1094, Minneapolis, Minnesota.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N19
-1112

Nelson F. Liu, Roy Schwartz, and Noah A.
Forgeron. 2019b. Inoculation by fine-tuning: UN
method for analyzing challenge datasets. Dans
Actes du 2019 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and
Short Papers), pages 2171–2179, Minneapolis,
Minnesota. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/N19-1225

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019c. RoBERTa: A robustly opti-
mized BERT pretraining approach. https://
doi.org/10.48550/arXiv.1907.11692

Ilya Loshchilov and Frank Hutter. 2019. Decou-
pled weight decay regularization. In Proceed-
ings of International Conference on Learning
Representations.

Qing Lyu, Zheng Hua, Daoxin Li, Li Zhang,
Marianna Apidianaki, and Chris Callison-
Burch. 2022. Is ‘‘my favorite new movie’’
my favorite movie? Probing the understanding
of recursive noun phrases. In Proceedings of
le 2022 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 5286–5302, Seattle, États-Unis.
Association for Computational Linguistics.

https://doi.org/10.18653/v1/2022
.naacl-main.388

William Merrill, Yoav Goldberg, Roy Schwartz,
and Noah A. Forgeron. 2021. Provable limitations
of acquiring meaning from ungrounded form:
What will future language models understand?
Transactions of the Association for Compu-
tational Linguistics, 9:1047–1060. https://
doi.org/10.1162/tacl a 00412

Lalchand Pandia and Allyson Ettinger. 2021.
Sorting through the noise: Testing robustness
of information processing in pre-trained lan-
guage models. In Proceedings of
le 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1583–1596,
Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.emnlp-main.119

Roma Patel and Ellie Pavlick. 2022. Cartographie
language models
to grounded conceptual
les espaces. In International Conference on Learn-
ing Representations.

Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Language models
as knowledge bases? In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessation (EMNLP-IJCNLP), pages 2463–2473,
Hong Kong, Chine. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D19-1250

Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2020. E-BERT: Efficient-yet-effective
entity embeddings for BERT. In Findings of
the Association for Computational Linguis-
tics: EMNLP 2020, pages 803–818, En ligne.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.findings-emnlp.71

Willard V. Quine. 1956. Quantifiers and propo-
sitional attitudes. The Journal of Philosophy,
53(5):177–187. https://doi.org/10.2307
/2022451

Alec Radford, Jeff Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.

631

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Language models are unsupervised multitask
learners. https://d4mucfpksywv.cloudfront
.net/better-language-models/language
models are unsupervised multitask
learners.pdf.

Abhilasha Ravichander, Eduard Hovy, Kaheer
Suleman, Adam Trischler, and Jackie Chi Kit
Cheung. 2020. On the systematicity of prob-
ing contextualized word representations: Le
case of hypernymy in BERT. In Proceed-
ings of the Ninth Joint Conference on Lexical
and Computational Semantics, pages 88–102,
Barcelona, Espagne (En ligne). Association for
Computational Linguistics.

Nils Reimers

and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
Siamese BERT-networks. In Proceedings of
le 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3982–3992, Hong Kong, Chine. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1410

Tanya Reinhart. 1976. The Syntactic Domain of
Anaphora. Ph.D. thesis, Massachusetts Insti-
tute of Technology, Cambridge.

Daniel S. Shabasson. 2018. The Two Indexical
Uses Theory of Proper Names and Frege’s
Puzzle. Ph.D. thesis, City University of New
York.

Stuart M. Shieber. 1985. Evidence against the
context-freeness of natural language. Linguis-
tics and Philosophy, 8:333–343. https://
doi.org/10.1007/BF00630917

Taylor Shin, Yasaman Razeghi, Robert L. Logan
IV, Eric Wallace, and Sameer Singh. 2020.
AutoPrompt: Eliciting Knowledge from Lan-
guage Models with Automatically Generated
Prompts. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4222–4235,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-main.346

Jeff Speaks. 2021. Theories of meaning.

Dans
Edward N. Zalta, editor, The Stanford En-
cyclopedia of Philosophy, Spring 2021 edi-
tion. Metaphysics Research Lab, Stanford
University.

632

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R.. Thomas McCoy, Najoung
Kim, Benjamin Van Durme, Sam Bowman,
Dipanjan Das, and Ellie Pavlick. 2019. What
do you learn from context? Probing for sen-
tence structure in contextualized word rep-
resentations. In Proceedings of International
Conference on Learning Representations.

Aaron Traylor, Roman Feiman, and Ellie Pavlick.
2021. AND does not mean OR: Using formal
languages to study language models’ repre-
sentations. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 2: Short Papers), pages 158–167, Sur-
line. Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.acl-short.21

Elena Voita, Rico Sennrich, and Ivan Titov.
2019. The bottom-up evolution of representa-
tions in the transformer: A study with machine
translation and language modeling objectives.
In Proceedings of the 2019 Conference on Em-
pirical Methods in Natural Language Process-
ing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-
IJCNLP), pages 4396–4406, Hong Kong, Chine.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D19
-1448

Kai Von Fintel and Irene Heim. 2011. Inten-
sional semantics. Unpublished Lecture Notes.

lexical

semantics.

Ivan Vuli´c, Edoardo Maria Ponti, Robert
Litschko, Goran Glavaˇs, and Anna Korhonen.
2020. Probing pretrained language models
In Proceedings of
pour
le 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 7222–7240, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.586

Zhaofeng Wu, Hao Peng, and Noah A. Forgeron.
2021. Infusing finetuning with semantic de-
pendencies. Transactions of
the Association
for Computational Linguistics, 9:226–242.
https://doi.org/10.1162/tacl a 00363

Lang Yu and Allyson Ettinger. 2020. Assessing
representation and composition in

phrasal

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

transformers. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4896–4907,
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.emnlp-main.397

Lang Yu and Allyson Ettinger. 2021. On the
interplay between fine-tuning and composition
in transformers. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP
2021, pages 2279–2293, En ligne. Association
for Computational Linguistics.

Zexuan Zhong, Dan Friedman, and Danqi Chen.
2021. Factual probing is [MASK]: Learn-
ing vs. learning to recall. In Proceedings of
le 2021 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 5017–5033, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.398

A Propositional Logic Dataset Details

We hand-designed the PCFG probabilities in
Eq. 1. To expand an e, the two binary rules
each have 0.06 probability under Lt. The ¬ rule
and expansion to T and F divide the remaining
probability mass, with T and F having the same
probability, half of the ¬ rule. As S does not ex-
pand to T or F, the other three rules proportionally
split the probability mass. We consider each of
(, ), ∧, ∨, ¬, T, F, = as a separate token for tok-
enization. We enforce a maximum length of 248
tokens. We sample all sentences without replace-
ment. The average Lt sentence length is ≈48.6
tokens. Sampling Ln results in slightly longer
phrases, so we decrease the binary rule prob-
abilities to be 0.03 chaque, but the specification
is otherwise the same. The resulting Ln sen-
tence on average has ≈51.7 tokens. We sample
819.2M pretraining sentences and 1M/10K/10K
probe training/validation/test sentences. Alors, pour
each split, we sample sentence pairs, avec le
same number as the number of sentences in that
split.

base. We train with batches of 8,192 sequences
for 100k steps, equivalent to 1 epoch over our
pretraining data. We use the AdamW optimizer
(Loshchilov and Hutter, 2019) with epsilon 10−8
for ALM for 10−6 for MLM, and β2 0.95 pour
ALM and 0.98 for MLM. We set the learning
rate to 6 × 10−4 warmed up over 10k steps with
un 0.1 weight decay.

For probing, +ATTN trains a query vector that in-
teracts with the key representation of each token,
obtained with a trained key matrix transforma-
tion, and the resulting attention weights are used
to average the token embeddings. We train all
probes for 3 epochs with batch size 8 et 1,000
warmup steps and select checkpoint with valida-
tion accuracy. We use AdamW with 10−5 learning
rate except only for Ln –ATTN ALM that benefits
from a different learning rate 10−3. We clip gra-
dients to unit norm.

C Referential Opacity Dataset Details

We detail the generation of our referential opac-
ity dataset, separately discussing its two aspects
(§4.2).

C.1 Generating Co-referring Expressions

For fact probing on LAMA, we use the prompt
in the form ‘‘The official language of Laos is
’’ which we found appropriate for the
known as
entity types in T-REx. If the LM correctly predicts
‘‘Lao’’, we consider this equivalence, or fact,
captured by the model. As LAMA was designed to
have 1-token answers with BERT’s tokenization,
we let BERT fill in the blank. This is not a guar-
antee for GPT-2’s tokenization, so we run de-
coding for the same number of steps as the true
answer’s length with beam size 5 and no sampling.
To further ensure that the predictions are reliable
and not due to noise, we only keep entity catego-
ries with overall prediction accuracy > 25%. Le
resulting categories are ‘‘P37 official language’’,
‘‘P364 original language of film or TV show’’,
‘‘P140 religion’’, ‘‘P103 native language’’, et
‘‘P36 capital’’. This procedure results in 1,606
facts for GPT-2 and 2,962 facts for BERT.

B Propositional Logic Training Details

For pretraining, we mostly follow the original
hyperparameters for GPT-2-small and RoBERTa-

C.2 Generating Contexts
We generate two types of contexts (§4.2). Le
first type contains an embedded clause, Pour qui
we construct templates for each entity category

633

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

in §C.1. For language entities, Par exemple, un
template is ‘‘[PRONOUN] [VERB] to speak [ENTITY].’’
A sentence pair is formed by filling in [ENTITY]
with a definite description vs. a proper name for
a fact. We only consider the pronouns ‘‘She’’
and ‘‘He’’ in this work. We consider 6 refer-
entially transparent verbs (‘‘starts’’, ‘‘begins’’,
‘‘ceases’’, ‘‘stops’’, ‘‘managed’’, ‘‘failed’’) et 6
referentially opaque verbs (‘‘wants’’, ‘‘intends’’,
‘‘hopes’’, ‘‘begs’’, ‘‘preferred’’, ‘‘suggested’’).
The second type of context contains only the
main clause. We use the referentially opaque
template ‘‘[PRONOUN] dislikes [ENTITY].’’ and an
entity category-specific referentially transparent
template such as ‘‘[PRONOUN] speaks [ENTITY].’’ In
total, we have 64,672 sentence pairs for GPT-2
et 121,768 for BERT.

For our probing analysis, we also included
attractors with coordinated sentences (§4.4). Comme
there are a quadratic number of possible coor-
dinations, we subsampled 59,548 such sentences
for GPT-2 and 119,540 for BERT, similar to the
number of attractor-less sentences. We split all
sentence pairs 8/1/1 for training/validation/testing.
For our similarity analysis, for a cleaner signif-
icance test, we only consider sentence pairs with
an embedded clause. This leaves 58,776 sentence
pairs for GPT-2 and 111,312 for BERT.

D Referential Opacity Training Details

The probe is trained similarly to §B except for 1
epoch with batch size 256 and learning rate 10−5.

E Can Language Models Learn to

Represent Referential Opacity With
Appropriate Supervision?

We showed in §4 that we do not observe evi-
dence of pretrained language models representing
the phenomenon of referential opacity. A natural
question, alors, is whether language models can
learn to represent it. Following a similar setup
as Lyu et al. (2022) and Liu et al. (2019b), nous
finetune the entire model on a portion of our train-

Chiffre 2: Probing accuracy after finetuning a pre-
trained LM on our (coordinated) referential opacity
dataset with different numbers of finetuning exam-
ples. The mean and the standard deviation across 10
seeds are plotted. For clarity in visualizing the trend,
the x-axis is not in linear scale.

ing set for 1 epoch and conduct the same probing
procedure on the resulting model. All training is
done with the coordinated data introduced (§4.4).
Finetuning uses the same hyperparameters in §D.
Similar to §4.4, we report the mean and standard
deviation across 10 random seeds for each setting.
We plot the probing accuracy along with the
number of finetuning examples in Figure 2. Both
GPT-2 and BERT continue to be unable to per-
form above-random with up to 10,000 finetuning
examples, further demonstrating their inadequate
semantic representation of referential opacity.
Nevertheless, with enough finetuning examples,
both models eventually achieve near-100% prob-
ing accuracy. C'est, donc, possible that they
can potentially learn to represent referential opac-
ity with sufficient semantic supervision, though
we note a caveat: while we introduced coordi-
nated data to prevent an obvious shortcut that the
model could take (§4.4), it does not eliminate all
possible shortcuts. It could be the case that the
additional capacity afforded by finetuning enables
the model to exploit a more sophisticated short-
cut (unknown to us) instead of truly capturing this
phenomenon.

634

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
6
5
2
1
3
8
3
5
0

/
t

un
c
_
un
_
0
0
5
6
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3 Transparency Helps Reveal When Language Models Learn Meaning image

Télécharger le PDF