How to Dissect a Muppet: The Structure of - Recherche en IA spécialisée au MIT

How to Dissect a Muppet: The Structure of
Transformer Embedding Spaces

Timothee Mickus∗
University of Helsinki, Finlande
timothee.mickus
@helsinki.fi

Denis Paperno
Utrecht University,
The Netherlands
d.paperno@uu.nl

Mathieu Constant
Universit´e de Lorraine,
CNRS,ATILF, France
Mathieu.Constant
@univ-lorraine.fr

Abstrait

Pretrained embeddings based on the Trans-
former architecture have taken the NLP
community by storm. We show that they can
mathematically be reframed as a sum of vector
factors and showcase how to use this refram-
ing to study the impact of each component.
We provide evidence that multi-head atten-
tions and feed-forwards are not equally useful
in all downstream applications, as well as a
quantitative overview of the effects of fine-
tuning on the overall embedding space. Ce
approach allows us to draw connections to a
wide range of previous studies, from vector
space anisotropy to attention weights.

Introduction

The Transformer architecture (Vaswani et al.,
2017) has taken the NLP community by storm.
Based on the attention mechanism (Bahdanau
et coll., 2015; Luong et al., 2015), it was shown
to outperform recurrent architectures on a wide
variety of tasks. Another step was taken with
pretrained language models derived from this ar-
chitecture (BERT, Devlin et al., 2019, among
others): they now embody the default approach
to a vast swath of NLP applications. Success
breeds scrutiny; likewise the popularity of these
models has fostered research in explainable NLP
interested in the behavior and explainability of
pretrained language models (Rogers et al., 2020).
In this paper, we develop a novel decomposition
of Transformer output embeddings. Our approach
consists in quantifying the contribution of each
network submodule to the output contextual em-
bedding, and grouping those into four terms: (je)
what relates to the input for a given position, (ii)
what pertains to feed-forward submodules, (iii)

∗The work described in the present paper was conducted

chiefly while at ATILF.

what corresponds to multi-head attention, et (iv)
what is due to vector biases.

This allows us to investigate Transformer em-
beddings without relying on attention weights or
treating the entire model as a black box, as is most
often done in the literature. The usefulness of our
method is demonstrated on BERT: Our case study
yields enlightening connections to state-of-the-art
work on Transformer explainability, evidence that
multi-head attentions and feed-forwards are not
equally useful in all downstream applications, comme
well as an overview of the effects of finetuning on
the embedding space. We also provide a simple
and intuitive measurement of the importance of
any term in this decomposition with respect to the
whole embedding.

We will provide insights on the Transformer
architecture in Section 2, and showcase how these
insights can translate into experimental investiga-
tions in Sections 3 à 6. We will conclude with
connections to other relevant works in Section 7
and discuss future perspectives in Section 8.

2 Additive Structure in Transformers

We show that the Transformer embedding et for
a token t is as a sum of four terms:

et = it + ht + ft + ct

(1)

where it can be thought of as a classical static em-
bedding, ft and ht are the cumulative contributions
at every layer of the feed-forward submodules and
the MHAs respectively, and ct corresponds to
biases accumulated across the model.

Équation (1) provides interpretable and quan-
tifiable terms that can explain the behavior of
specific components of the Transformer archi-
tecture. More precisely, it characterizes what is
the impact of adding another sublayer on top of
what was previously computed: the terms in Equa-
tion (1) are defined as sums across (sub)layers;

981

Transactions of the Association for Computational Linguistics, vol. 10, pp. 981–996, 2022. https://doi.org/10.1162/tacl a 00501
Action Editor: Dani Yogatama. Submission batch: 3/2022; Revision batch: 5/2022; Published 9/2022.
c(cid:3) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

matrice
tth row of A
(row) vector
scalars
item linked to submodule M

UN
(UN)t,·
un
un, un
W(M.)
a⊕b concatenation of vectors a and b
(cid:2)

an a1 ⊕ a2 ⊕ · · · ⊕ an

n
un(cid:5)b element-wise multiplication of a and b
(cid:3)

an a1 (cid:5) a2 (cid:5) · · · (cid:5) un

n
(cid:3)1
0m,n
Dans

vector with all components set to 1
null matrix of shape m × n
identity matrix of shape n × n

Tableau 1: Notation.

hence we can track how a given sublayer trans-
forms its input, and show that this effect can be
thought of as adding another vector to a pre-
vious sum. This layer-wise sum of submodule
outputs also allows us to provide a first estimate
of which parameters are most relevant to the over-
all embedding space: a submodule whose output
is systematically negligible has its parameters set
so that its influence on subsequent computations
is minimal.

The formulation in Equation (1) more generally
relies on the additive structure of Transformer
embedding spaces. We start by reviewing the
Transformer architecture in Section 2.1, before
discussing our decomposition in greater detail in
Section 2.2 and known limitations in Section 2.3.

2.1 Transformer Encoder Architecture

Let’s start by characterizing the Transformer ar-
chitecture of Vaswani et al. (2017) in the notation
described in Table 1.

Transformers are often defined using three
hyperparameters: the number of layers L, the di-
mensionality of the hidden representations d, et
H, the number of attention heads in multi-head
attentions. Officiellement, a Transformer model is a
stack of sublayers. A visual representation is
shown in Figure 1. Two sublayers are stacked
to form a single Transformer layer: The first
corresponds to a multi-head attention mechanism
(MHA), and the second to a feed-forward (FF).
A Transformer with L layers contains Λ = 2L
sublayers. In Figure 1 two sublayers (in blue) sont
grouped into one layer, and L layers are stacked
one after the other.

Chiffre 1: Overview of a Transformer encoder.

Each sublayer is centered around a specific sub-
layer function. Sublayer functions map an input
x to an output y, and can either be feed-forward
submodules or multi-head attention submodules.

FFs are subnets of the form:
(cid:5)

(cid:4)
xtW(FF,je) + b(FF,je)

oui(FF)
t = φ

W(FF,Ô) + b(FF,Ô)

where φ is a non-linear function, such as ReLU or
GELU (Hendrycks and Gimpel, 2016). Ici, (…,je)
et (…,Ô) distinguish the input and output linear
projections, and the index t corresponds to the
token position. Input and output dimensions are
equal, whereas the intermediary layer dimension
(c'est à dire., the size of the hidden representations to
which the non-linear function φ will be applied)
is larger, typically of b = 1024 ou 2048. In other
words, W(FF,je) is of shape d × b, b(FF,je) of size b,
W(FF,Ô) is of shape b × d, and b(FF,Ô) of size d.

MHAs are concatenations of scaled-dot atten-

tion heads:

oui(MHA)
t

(cid:6)

H(cid:7)

h=1

(cid:8)

(Ah)t,·

W(MHA,Ô) + b(MHA,Ô)

où (Ah)t,· is the tth row vector of the following
n × d/H matrix Ah:

Ah = softmax

(cid:6)

(cid:8)

QhKT
h(cid:9)
d/H

with h an index tracking attention heads. Le
parameter matrix W(MHA,Ô) of shape d × d, et
the bias b(MHA,Ô) of size d. The queries Qh, keys
Kh and values Vh are simple linear projections

982

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

of shape n × (d/H), computed from all inputs
x1, . . . , xn:

(Qh)t,· = xtW(Q)
(Kh)t,· = xtW(K)
(Vh)t,· = xtW(V)

h + b(Q)
h + b(K)
h + b(V)

h , b(K)

h , W(K)

where the weight matrices W(Q)
h and W(V)
are of the shape d × (d/H), with H the num-
ber of attention heads, and biases b(Q)
et
b(V)
are of size d/H. This component is often
h
analyzed in terms of attention weights αh, lequel
correspond to the softmax dot-product between
keys and queries. Autrement dit, the product
softmax(QhKT
d/H) can be thought of as
h /
n × n matrix of weights in an average over
the transformed input vectors xt(cid:6)W(V)
h + b(V)
(Kobayashi et al., 2020, Eqs. (1) à (4)): mul-
tiplying these weights with the value projection
Vh yields a weighted sum of value projections:

(cid:9)

n(cid:10)

(Ah)t,· =

αh,t,t(cid:6)(Vh)t(cid:6),·

t(cid:6)=1

where αh,t,t(cid:6) is the component at row t and column
t(cid:6) of this attention weights matrix.

Dernièrement, after each sublayer function S, a residual
connection and a layer normalization (LN, Ba
et coll., 2016) are applied:

oui(LN)
t = g (cid:5)

⎡

⎣

S (xt) + xt −
st

(cid:4)

(cid:5)
mt · (cid:3)1

⎤

⎦ + b(LN)

The gain g and bias b(LN) are learned parameters
with d components each; mt · (cid:3)1 is the vector
(1, · · · , 1) scaled by the mean component value
mt of the input vector S (xt)+xt; st is the standard
deviation of the component values of this input.
En tant que tel, a LN performs a z-scaling, suivi de
the application of the gain g and the bias b(LN).

To kick-start computations, a sequence of static
vector representations x0,1 . . . x0,n with d compo-
nents each is fed into the first layer. This initial
input corresponds to the sum of a static lookup
word embedding and a positional encoding.1

1 In BERT (Devlin et al., 2019), additional terms to this
static input encode the segment a token belongs to, and a LN
is added before the very first sublayer. Other variants also
encode positions by means of an offset in attention heads
(Huang et al., 2018; Shaw et al., 2018).

2.2 Mathematical Re-framing

We now turn to the decomposition proposed in
Équation (1): et = it + ft + ht + ct.2 We provide
a derivation in Appendix A.

The term it corresponds to the input em-
bedding (c'est à dire., the positional encoding, the input
word-type embedding, and the segment encoding
in BERT-like models), after having gone through
all the LN gains and rescaling:

Λ(cid:3)

λ=1
Λ(cid:15)

sλ,t

gλ

(cid:5) x0,t

(2)

it =

λ=1

where Λ = 2L ranges over all sublayers. Ici,
the gλ correspond to the learned gain parame-
ters of the LNs, whereas the sλ,t scalar derive
from the z-scaling performed in the λth LN, comme
defined above. The input x0,t consists of the sum
of a static lookup embedding and a positional
encoding—as such, it resembles an uncontextual-
ized embedding.

The next two terms capture the outputs of spe-
cific submodules, either FFs or MHAs. En tant que tel,
their importance and usefulness will differ from
task to task. The term ft is the sum of the outputs
of the FF submodules. Submodule outputs pass
through LNs of all the layers above, hence:

ft =

L(cid:10)

l=1

Λ(cid:3)

λ=2l
Λ(cid:15)

sλ,t

gλ

(cid:5) ˜fl,t

(3)

λ=2l

(cid:4)

(cid:5)

W(FF,Ô)
je,t W(FF,je)
X(FF)
where ˜fl,t = φ
je
is the unbiased output at the position t of the FF
submodule for this layer l.

+ b(FF,je)
je

The term ht corresponds to the sum across
layers of each MHA output, having passed through
the relevant LNs. As MHAs are entirely linear, nous
can further describe each output as a sum over all
H heads of a weighted bag-of-words of the input
representations to that submodule. Or:

⎛

Λ(cid:3)

L(cid:10)

ht =

l=1

⎜
⎜
⎜
⎝

λ=2l−1
Λ(cid:15)

(cid:19)

H(cid:10)

n(cid:10)

gλ

(cid:5)

αl,h,t,t(cid:6) xl,t(cid:6) Zl,h

sλ,t

h=1

t(cid:6)=1

Zl,h =W(V)

λ=2l−1
je,h MhW(MHA,Ô)

⎞

(cid:20)

⎟
⎟
⎟
⎠

(4)

2 We empirically verified that components from at-
tested embeddings et and those derived from Eq. (1) sont
systematically equal up to ±10−7.

983

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

where Zl,h corresponds to passing an input em-
bedding through the unbiased values projection
W(V)
je,h of the head h, then projecting it from a
d/H-dimensional subspace onto a d-dimensional
space using a zero-padded identity matrix:

(cid:24)

Mh =

0d/H,(h−1)×d/H Id/H 0d/H,(H−h)×d/H

(cid:25)

Chiffre 2: Fitting the ft term: r2 across layers.

and finally passing it through the unbiased outer
projection W(MHA,Ô)

of the relevant MHA.

In the last term ct, we collect all the biases. Nous
don’t expect these offsets to be meaningful but
rather to depict a side-effect of the architecture:

gλ(cid:6)

(cid:5) b(LN)
λ

−

sλ(cid:6),t

Λ(cid:6)

λ(cid:6)=λ
Λ(cid:7)

λ(cid:6)=λ

gλ(cid:6)

(cid:8)

(cid:5)

mλ,t · (cid:2)1

sλ(cid:6)
t

(cid:9)

⎞

⎟
⎟
⎟
⎠

(cid:13)

gλ

(cid:5)

b(MHA,Ô)

(cid:16)

(cid:17)

b(V)
je,h

W(MHA,Ô)

(cid:14)

H(cid:15)

h=1

Λ(cid:6)

λ(cid:6)=λ+1
Λ(cid:7)

λ(cid:6)=λ+1
Λ(cid:6)

λ=2l−1
Λ(cid:7)

sλ,t

λ=2l−1
Λ(cid:6)

gλ

ct =

⎛

⎜
⎜
⎜
⎝

Λ(cid:2)

λ=1

L(cid:2)

l=1

L(cid:2)

l=1

(cid:5) b(FF,Ô)
je

λ=2l
Λ(cid:7)

sλ,t

λ=2l

(5)

(cid:2)

h b(V)

h b(V)
je,h here is equivalent
The concatenation
to a sum of zero-padded identity matrices:
(cid:26)
je,h Mh. This term ct
includes the biases
b(LN)
and mean-shifts mλ,t · (cid:3)1 of the LNs, le
λ
outer projection biases of the FF submodules
b(FF,Ô)
, the outer projection bias in each MHA
je
submodule b(MHA,Ô)
and the value projection bi-
ases, mapped through the outer MHA projection
(cid:4)(cid:2)
(cid:5)

h b(V)

je,h

W(MHA,Ô)

2.3 Limitations of Equation(1)

The decomposition proposed in Equation (1)
comes with a few caveats that are worth address-
ing explicitly. Le plus important, Équation (1) does
not entail that the terms are independent from one
another. Par exemple, the scaling factor 1/
sλ,t
systematically depends on the magnitude of earlier
hidden representations. Équation (1) only stresses

(cid:15)

3In the case of relative positional embeddings applied to
value projections (Shaw et al., 2018), it is rather straight-
forward to follow the same logic so as to include relative
positional offset in the most appropriate term.

984

that a Transformer embedding can be decomposed
as a sum of the outputs of its submodules: It does
not fully disentangle computations. We leave the
precise definition of computation disentanglement
and its elaboration for the Transformer to future
recherche, and focus here on the decomposition
proposed in Equation (1).

In all, the major issue at hand is the ft term:
It is the only term that cannot be derived as a
linear composition of vectors, due to the non-linear
function used in the FFs. Aside from the ft term,
non-linear computations all devolve into scalar
corrections (namely, the LN z-scaling factors sλ,t
and mλ,t and the attention weights αl,h). En tant que tel,
ft is the single bottleneck that prevents us from
entirely decomposing a Transformer embedding
as a linear combination of sub-terms.

As the non-linear functions used in Transform-
ers are generally either ReLU or GELU, lequel
both behave almost linearly for a high enough
input value, it is in principle possible that the FF
submodules can be approximated by a purely lin-
ear transformation, depending on the exact set of
parameters they converged onto. It is worth assess-
ing this possibility. Ici, we learn a least-squares
linear regression mapping the z-scaled inputs of
every FF to its corresponding z-scaled output. Nous
use the BERT base uncased model of Devlin et al.
(2019) and a random sample of 10,000 phrases
from the Europarl English section (Koehn, 2005),
or almost 900,000 word-piece tokens, and fit the
regressions using all 900,000 embeddings.

Chiffre 2 displays the quality of these linear ap-
proximations, as measured by a r2 score. We see
some variation across layers but never observe a
perfect fit: 30% à 60% of the observed variance is
not explained by a linear map, suggesting BERT
actively exploits the non-linearity. That the model
doesn’t simply circumvent the non-linear func-
tion to adopt a linear behavior intuitively makes
sense: Adding the feed-forward terms is what pre-
vents the model from devolving into a sum of
bag-of-words and static embeddings. While such
approaches have been successful (Mikolov et al.,

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2013; Mitchell and Lapata, 2010), a non-linearity
ought to make the model more expressive.

resentations, and plot how this relevance evolves
across layers.

In all, the sanity check in Figure 2 highlights
that the interpretation of the ft term is the major
‘‘black box’’ unanalyzable component remaining
under Equation (1). En tant que tel, the recent interest in
analyzing these modules (par exemple., Geva et al., 2021;
Zhao et al., 2021; Geva et al., 2022) is likely to
have direct implications for the relevance of the
present work. When adopting the linear decom-
position approach we advocate, this problem can
be further simplified: We only require a com-
putationally efficient algorithm to map an input
weighted sum of vectors through the non-linearity
to an output weighted sum of vectors.4

Also remark that previous research stressed
that Transformer layers exhibit a certain degree
of commutativity (Zhao et al., 2021) et ça
additional computation can be injected between
contiguous sublayers (Pfeiffer et al., 2020). Ce
can be thought of as evidence pointing towards a
certain independence of the computations done in
each layers: If we can shuffle and add layers, then it
seems reasonable to characterize sublayers based
on what their outputs add to the total embedding,
as we do in Equation (1).

Beyond the expectations we may have,
it
remains to be seen whether our proposed method-
ology is of actual use, c'est, whether is conducive
to further research. The remainder of this article
presents some analyses that our decomposition
enables us to conduct.5

3 Visualizing the Contents of Embeddings

One major question is that of the relative relevance
of the different submodules of the architecture
with respect to the overall output embedding.
Studying the four terms it, ft, ht, and ct can
prove helpful in this endeavor. Given that Equa-
tion (2) à (5) are defined as sums across layers
or sublayers, it is straightforward to adapt them
to derive the decomposition for intermediate rep-
resentations. Ainsi, we can study how relevant
are each of the four terms to intermediary rep-

4One could simply treat the effect of a non-linear acti-
vation as if it were an offset. Par exemple, in the case of
ReLU:

ReLU (v) = v + z where z = ReLU (v) − v = ReLU(−v)
5Code for our experiments is available at the follow-
ing URL: https://github.com/TimotheeMickus
/bert-splat.

To that end, we propose an importance metric
to compare one of the terms tt to the total et.
We require it to be sensitive to co-directionality
(c'est à dire., whether tt and et have similar directions)
and relative magnitude (whether tt is a major
component of et). A normalized dot-product of
the form:

m(et, tt) = eT

t tt/(cid:7)et

(cid:7)2
2

(6)

satisfies both of these requirements. As dot-
product distributes over addition (c'est à dire., aT
i bi =
(cid:26)
i aT bi) and the dot-product of a vector with
2):

itself is its magnitude squared (c'est à dire., aT a = (cid:7)un(cid:7)2

(cid:26)

m(et, it) + m(et, ft) + m(et, ht) + m(et, ct) = 1

Hence this function intuitively measures the im-
portance of a term relative to the total.

We use the same Europarl sample as in Sec-
tion 2.3. We contrast embeddings from three
related models: The BERT base uncased model
and fine-tuned variants on CONLL 2003 NER
(Tjong Kim Sang and De Meulder, 2003)6 et
SQuAD v2 (Rajpurkar et al., 2018).7

Chiffre 3 summarizes the relative importance
of the four terms of Eq. (1), as measured by
the normalized dot-product defined in Eq. (6);
ticks on the x-axis correspond to different layers.
Figures 3a to 3c display the evolution of our
proportion metric across layers for all three BERT
models, and Figures 3d to 3f display how our
normalized dot-product measurements correlate
across pairs of models using Spearman’s ρ.8

Looking at Figure 3a, we can make a few im-
portant observations. The input term it, lequel
corresponds to a static embedding, initially dom-
inates the full output, but quickly decreases in
prominence, until it reaches 0.045 at the last layer.
This should explain why lower layers of Trans-
formers generally give better performances on
static word-type tasks (Vuli´c et al., 2020, among
others). The ht term is not as prominent as one
could expect from the vast literature that focuses

6https://huggingface.co/dslim/bert-base

-NER-uncased.

7https://huggingface.co/twmkn9/bert-base

-uncased-squad2.

8Layer 0 is the layer normalization conducted before the

first sublayer, hence ft and ht are undefined here.

985

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 3: Relative importance of main terms.

on MHA. Its normalized dot-product is barely
above what we observe for ct, and never averages
au-dessus de 0.3 across any layer. This can be partly
pinned down on the prominence of ft and its nor-
malized dot-product of 0.4 or above across most
layers. As FF submodules are always the last com-
ponent added to each hidden state, the sub-terms
of ft go through fewer LNs than those of ht, et
thus undergo fewer scalar multiplications—which
likely affects their magnitude. Dernièrement, the term ct
is far from negligible: At layer 11, it is the most
prominent term, and in the output embedding it
makes up for up to 23%. Note that ct defines a set
of offsets embedded in a 2Λ-dimensional hyper-
plane (cf. Appendix B). In BERT base, 23% of the
output can be expressed using a 50-dimensional
vector, ou 6.5% of the 768 dimensions of the
model. This likely induces part of the anisotropy of
Transformer embeddings (par exemple., Ethayarajh, 2019;
Timkey and van Schijndel, 2021), as the ct term
pushes the embedding towards a specific region
of the space.

The fine-tuned models in Figures 3b and 3c
are found to impart a much lower proportion of
the contextual embeddings to the it and ct terms.
While ft seems to dominate in the final embedding,
looking at the correlations in Figures 3d and 3e
suggest that the ht terms are those that undergo
the most modifications. Proportions assigned to
the terms correlate with those assigned in the

non-finetuned model more in the case of lower
layers than higher layers (Figures 3d and 3e).
The required adaptations seem task-specific as the
two fine-tuned models do not correlate highly
with each other (Figure 3f). Dernièrement, updates in the
NER model impact mostly layer 8 and upwards
(Figure 3d), whereas the QA model (Figure 3e)
sees important modifications to the ht term at the
first layer, suggesting that SQuAD requires more
drastic adaptations than CONLL 2003.

4 The MLM Objective

An interesting follow-up question concerns which
of the four terms allow us to retrieve the tar-
get word-piece. We consider two approaches:
(un) Using the actual projection learned by the
non-fine-tuned BERT model, ou (b) learning a
simple categorical regression for a specific term.
We randomly select 15% of the word-pieces in
our Europarl sample. As in the work of Devlin
et autres. (2019), 80% of these items are masked,
10% are replaced by a random word-piece, et
10% are left as is. Selected embeddings are then
split between train (80%), validation (10%), et
test (10%).

Results are displayed in Table 2. The first row
(‘‘Default’’) details predictions using the default
output projection on the vocabulary, c'est, nous
test the performances of combinations sub-terms

986

Tableau 2: Masked language model accuracy (dans %). Cells in underlined bold font indicate best
performance per setup across runs. Cell color indicates the ranking of setups within a run. Rows marked
μ contain average performance; rows marked σ contains the standard deviation across runs.

under the circumstances encountered by the model
during training.9 The rows below (‘‘Learned’’)
correspond to learned linear projections; the row
marked μ display the average performance across
tous 5 runs. Columns display the results of using
the sum of 1, 2, 3, ou 4 of the terms it, ht, ft
and ct to derive representations; Par exemple, le
rightmost corresponds to it + ht + ft + ct (c'est à dire., le
full embedding), whereas the leftmost corresponds
to predicting based on it alone. Focusing on the
default projection first, we see that it benefits
from a more extensive training: When using all
four terms, it is almost 2% more accurate than
learning one from scratch. On the other hand,
learning a regression allows us to consider more
specifically what can be retrieved from individual
termes, as is apparent from the behavior of the
ft: When using the default output projection, nous
get 1.36% accuracy, whereas a learned regression
yields 53.77%.

term from any experiment using ft

The default projection matrix is also highly
dependent on the normalization offsets ct and
the FF terms ft being added together: Removing
this ct
est
highly detrimental to the accuracy. On the other
main, combining the two produces the highest
accuracy scores. Our logistic regressions show
that most of this performance can be imputed to
the ft term. Learning a projection from the ft
term already yields an accuracy of almost 54%.
On the other hand, a regression learned from
ct only has a limited performance of 9.72% sur

9 We thank an anonymous reviewer for pointing out that
the BERT model ties input and output embeddings; we leave
investigating the implications of this fact for future work.

average. Fait intéressant, this is still above what one
would observe if the model always predicted the
most frequent word-piece (viz. le, 6% of the
test targets): even these very semantically bare
items can be exploited by a classifier. As ct
is tied to the LN z-scaling, this suggests that
the magnitude of Transformer embeddings is not
wholly meaningless.

In all, do FFs make the model more effective?
The ft term is necessary to achieve the highest
accuracy on the training objective of BERT. On its
propre, it doesn’t achieve the highest performances:
for that we also need to add the MHA outputs
ht. Cependant, the performances we can associate
to ft on its own are higher than what we observe
for ht, suggesting that FFs make the Transformer
architecture more effective on the MLM objec-
tive. This result connects with the work of Geva
et autres. (2021, 2022) who argue that FFs update
the distribution over the full vocabulary, hence it
makes sense that ft would be most useful to the
MLM task.

5 Lexical Contents and WSD

We now turn to look at how the vector spaces
are organized, and which term yields the most
linguistically appropriate space. We rely on
WSD, as distinct senses should yield different
representations.

We consider an intrinsic KNN-based setup and
an extrinsic probe-based setup. The former is in-
spired from Wiedemann et al. (2019): We assign
to a target the most common label in its neighbor-
hood. We restrict neighborhoods to words with
the same annotated lemma and use the 5 nearest

987

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 3: Accuracy on SemCor WSD (dans %).

neighbors using cosine distance. The latter is a
2-layer MLP similar to Du et al. (2019), where the
first layer is shared for all items and the second
layer is lemma-specific. We use the NLTK Semcor
dataset (Landes et al., 1998; Bird et al., 2009),
with an 80%–10%–10% split. We drop monose-
mous or OOV lemmas and sum over word-pieces
to convert them into single word representations.
Tableau 3 shows accuracy results. Selecting the most
frequent sense would yield an accuracy of 57%;
picking a sense at random, 24%. The terms it and
ct struggle to outperform the former baseline: rel-
evant KNN accuracy scores are lower, and corre-
sponding probe accuracy scores are barely above.
Dans l'ensemble, the same picture emerges from the
KNN setup and all 5 runs of the classifier setup.
The ft term does not yield the highest perfor-
mances in our experiment—instead, the ht term
systematically dominates. In single term models,
ht is ranked first and ft second. As for sums of two
termes, the setups ranked 1st, 2nd, and 3rd are those
that include ht; setups ranked 3rd to 5th, those that
include ft. Even more surprisingly, when sum-
ming three of the terms, the highest ranked setup
is the one where we exclude ft, and the lowest
corresponds to excluding ht. Removing ft system-
atically yields better performances than using the
full embedding. This suggests that ft is not neces-
sarily helpful to the final representation for WSD.
This contrast with what we observed for MLM,
where ht was found to be less useful then ft.

One argument that could be made here would
be to posit that the predictions derived from the
different sums of terms are intrinsically different,
hence a purely quantitative ranking might not cap-
ture this important distinction. To verify whether
this holds, we can look at the proportion of pre-
dictions that agree for any two models. Because

Chiffre 4: Prediction agreement for WSD models (dans %).
Upper triangle: agreement for KNNs; lower triangle:
for learned classifiers.

our intent is to see what can be retrieved from
specific subterms of the embedding, we focus
solely on the most efficient classifiers across runs.
This is summarized in Figure 4: An individual
cell will detail the proportion of the assigned la-
bels shared by the models for that row and that
column. In short, we see that model predictions
tend to a high degree of overlap. For both KNN and
classifier setups, the three models that appear to
make the most distinct predictions turn out to be
computed from the it term, the ct term or their
sum: c'est, the models that struggle to perform
better than the MFS baseline and are derived
from static representations.

6 Effects of Fine-tuning and NER

Downstream application can also be achieved
through fine-tuning, c'est, restarting a model’s
training to derive better predictions on a narrower
task. As we saw from Figures 3b and 3c, le
modifications brought upon this second round

988

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 4: Macro-f1 on WNUT 2016 (dans %).

of training are task-specific, meaning that an
exhaustive experimental survey is out of our reach.
We consider the task of Named Entity Recog-
nition, using the WNUT 2016 shared task dataset
(Strauss et al., 2016). We contrast the perfor-
mance of the non-finetuned BERT model to that
of the aforementioned variant fine-tuned on the
CONLL 2003 NER dataset using shallow probes.
Results are presented in Table 4. The very high
variance we observe across is likely due to the
smaller size of this dataset (46,469 training exam-
ples, as compared to the 142,642 of Section 5 or the
107,815 in Section 4). Fine-tuning BERT on an-
other NER dataset unsurprisingly has a systematic
positive impact: Average performance jumps up
par 5% ou plus. More interesting is the impact this
fine-tuning has on the ft term: When used as sole
input, the highest observed performance increases
by over 8%, and similar improvements are ob-
served consistently across all setups involving ft.
Encore, the best average performance for fine-tuned
and base embeddings correspond to ht (39.28% dans
tuned), it+ht (39.21%), and it+ht+ct (39.06%);
in the base setting the highest average performance
are reached with ht + ct (33.40%), it + ht + ct
(33.25%) and ht (32.91%)—suggesting that ft
might be superfluous for this task.

We can also look at whether the highest scoring
classifiers across runs classifiers produce differ-
ent outputs. Given the high class imbalance of
the dataset at hand, we macro-average the pre-
diction overlaps by label. The result is shown in
Chiffre 5; the upper triangle details the behavior of
the untuned model, and the lower triangle details

Chiffre 5: NER prediction agreement (macro-average,
dans %). Upper triangle: agreement for untuned models;
lower triangle: for tuned models.

that of the NER-fine-tuned model. In this round
of experiments, we see much more distinctly that
the it model, the ct model, and the it + ct model
behave markedly different from the rest, avec
ct yielding the most distinct predictions. As for
the NER-fine-tuned model (lower triangle), aside
from the aforementioned static representations,
most predictions display a degree of overlap much
higher than what we observe for the non-finetuned
model: Both FFs and MHAs are skewed towards
producing outputs more adapted to NER tasks.

7 Relevant Work

The derivation we provide in Section 2 ties in well
with other studies setting out to explain how Trans-
formers embedding spaces are structured (Voita
et coll., 2019; Mickus et al., 2020; V´azquez et al.,
2021, entre autres) and more broadly how they

989

behave (Rogers et al., 2020). Par exemple, lower
layers tend to yield higher performance on sur-
face tasks (par exemple., predicting the presence of a word,
Jawahar et al. 2019) or static benchmarks (par exemple.,
analogy, Vuli´c et al. 2020): This ties in with the
vanishing prominence of it across layers. Like-
wise, probe-based approaches to unearth a linear
structure matching with the syntactic structure
of the input sentence (Raganato and Tiedemann,
2018; Hewitt and Manning, 2019, entre autres)
can be construed as relying on the explicit linear
dependence that we highlight here.

Another connection is with studies on embed-
ding space anisotropy (Ethayarajh, 2019; Timkey
and van Schijndel, 2021): Our derivation provides
a means of circumscribing which neural compo-
nents are likely to cause it. Also relevant is the
study on sparsifying Transformer representations
of Yun et al. (2021): The linearly dependent nature
of Transformer embeddings has some implications
when it comes to dictionary coding.

Also relevant are the works focusing on the in-
terpretation of specific Transformer components,
and feed-forward sublayers in particular (Geva
et coll., 2021; Zhao et al., 2021; Geva et al., 2022).
Dernièrement, our approach provides some quantitative
argument for the validity of attention-based studies
(Serrano and Smith, 2019; Jain and Wallace, 2019;
Wiegreffe and Pinter, 2019; Pruthi et al., 2020)
and expands on earlier works looking beyond
attention weights (Kobayashi et al., 2020).

8 Conclusions and Future Work

In this paper, we stress how Transformer embed-
dings can be decomposed linearly to describe the
impact of each network component. We show-
cased how this additive structure can be used to
investigate Transformers. Our approach suggests
a less central place for attention-based studies:
If multi-head attention only accounts for 30% de
embeddings, can we possibly explain what Trans-
formers do by looking solely at these submodules?
The crux of our methodology lies in that we de-
compose the output embedding by submodule
instead of layer or head. These approaches are not
mutually exclusive (cf. Section 3), hence our ap-
proach can easily be combined with other probing
protocols, providing the means to narrow in on
specific network components.

whether our decomposition in Equation (1) pourrait
yield useful results—or, as we put it earlier in
Section 2.3, whether this approach could be con-
ducive to future research. We were able to use
the proposed approach to draw insightful con-
nections. The noticeable anisotropy of contextual
embeddings can be connected to the prominent
trace of the biases in the output embedding: Comme
model biases make up an important part of the
whole embedding, they push it towards a specific
sub-region of the embedding. The diminishing
importance of it links back to earlier results on
word-type semantic benchmarks. We also report
novel findings, showcasing how some submodules
outputs may be detrimental in specific scenarios:
The output trace of FF modules was found to be
extremely useful for MLM, whereas the ht term
was found to be crucial for WSD. Our methodol-
ogy also allows for an overview of the impact of
finetuning (cf. Section 6): It skews components to-
wards more task-specific outputs, and its effect are
especially noticeable in upper layers (Figures 3d
and 3e).

Analyses in Sections 3 à 6 demonstrate the
immediate insight that our Transformer decom-
position can help achieve. This work therefore
opens a number of research perspectives, de
which we name three. D'abord, as mentioned in Sec-
tion 2.3, our approach can be extended further to
more thoroughly disentangle computations. Sec-
ond, while we focused here more on feed-forward
and multi-head attention components, extracting
the static component embeddings from it would
allow for a principled comparison of contextual
and static distributional semantics models. Last
but not least, because our analysis highlights the
different relative importance of Transformer com-
ponents in different tasks, it can be used to help
choose the most appropriate tools for further in-
terpretation of trained models among the wealth
of alternatives.

Remerciements

We are highly indebted to Marianne Clausel for
her significant help with how best to present the
mathematical aspects of this work. Our thanks also
go to Aman Sinha, as well as three anonymous
reviewers for their substantial comments towards
bettering this work.

The experiments we have conducted in Sec-
tion 3 à 6 were designed so as to showcase

This work was supported by a public grant
overseen by the French National Research

990

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Agency (ANR) as part of the ‘‘Investissements
d’Avenir’’ program: Idex Lorraine Universit´e
d’Excellence (reference: ANR-15-IDEX-0004).
We also acknowledge the support by the FoTran
project, funded by the European Research Coun-
cil (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant
agreement n◦ 771113).

A Step-by-step Derivation of Eq. (1)

Given that a Transformer layer consists of a stack
of L layers, each comprising two sublayers, nous
can treat a Transformer as a stack of Λ = 2L
sublayers. For notation simplicity, we link the
sublayer index λ to the layer index l: The first
sublayer of layer l is the (2l − 1)th sublayer, et
the second is the (2je)th sublayer.10 All sublayers
include a residual connection before the final LN:

yλ,t = gλ (cid:5)

(cid:19)

(cid:20)

(S (X) + X) − mλ,t · (cid:3)1
sλ,t

+ b(LN)
λ

We can model the effects of the gain gλ and the
scaling 1/sλ,t as the d × d square matrix:

Tλ =

1
sλ,t

⎡

⎢
⎢
⎢
⎣

(gλ)1
0
…
0

0
(gλ)2

. . .

⎤

⎥
⎥
⎥
⎦

(gλ)d

which we use to rewrite a sublayer output yλ,t as:

(cid:4)

(cid:5)(cid:5)

yλ,t =

Sλ (X) + x−

= Sλ (X) Tλ + xTλ −

mλ,t · (cid:3)1
(cid:4)

Tλ + b(LN)

(cid:5)
mλ,t · (cid:3)1

λ
Tλ + b(LN)

We can then consider what happens to this addi-
tive structure in the next sublayer. We first de-
fine Tλ+1 as previously and remark that, as both
Tλ and Tλ+1 only contain diagonal entries:

TλTλ+1 =

1
sλ,tsλ+1,t
⎡

⎢
⎣

(gλ (cid:5) gλ+1)1
…
0

⎤

⎥
⎦

. . .
. . .

(gλ (cid:5) gλ+1)d

This generalizes for any sequence of LNs as:

(cid:29)

Tλ =

1(cid:15)

sλ,t
(cid:30)

(cid:3)

λ
⎡

⎢
⎢
⎢
⎢
⎢
⎣

(cid:31)

gλ

…
0

⎤

⎥
⎥
⎥
⎥
⎥
⎦

. . .

(cid:30)

(cid:3)

(cid:31)

gλ

Let us now pass the input x through a complete

layer, c'est, through sublayers λ and λ + 1:

yλ+1,t = Sλ+1 (yλ,t) Tλ+1

+ yλTλ+1 −

(cid:4)

(cid:5)
mλ+1,t · (cid:3)1

Tλ+1 + b(LN)
λ+1

Substituting in the expression for yλ from above:

(cid:4)

yλ+1,t = Sλ+1
(cid:4)

−

Sλ (X) Tλ + xTλ

(cid:5)
mλ,t · (cid:3)1
(cid:6)

λ+1(cid:29)

Tλ + b(LN)
(cid:8)

(cid:5)

Tλ+1
(cid:6)

λ+1(cid:29)

(cid:8)

+ Sλ (X)

Tλ(cid:6)

+ X

Tλ(cid:6)

(cid:4)

λ(cid:6)=λ
(cid:6)
(cid:5)
mλ,t · (cid:3)1

−

λ+1(cid:29)

Tλ(cid:6)

λ(cid:6)=λ

(cid:8)

λ(cid:6)=λ
(cid:4)

(cid:5)
mλ+1,t · (cid:3)1

+ b(LN)

λ Tλ+1 −

+ b(LN)
λ+1

Tλ+1

As we are interested in the combined effects
of a layer, we only consider the case where Sλ
is a MHA mechanism and Sλ+1 a FF. We start
by reformulating the output of a MHA. Recall
that attention heads can be seen as weighted sums
of value vectors (Kobayashi et al., 2020). Due
to the softmax normalization, attention weights
αt,1, . . . αt,n sum to 1 for any position t. Ainsi:

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(Ah)t,· =

n(cid:10)

t(cid:6)=1
n(cid:10)

αh,t,t(cid:6)(Vh)t(cid:6),·

αh,t,t(cid:6)xt(cid:6)W(V)

h + αh,t,t(cid:6)b(V)

t(cid:6)=1
(cid:6)

n(cid:10)

t(cid:6)

(cid:8)

αh,t,t(cid:6)xt(cid:6)W(V)
h

+ b(V)
h

10 In the case of BERT, we also need to include a LN
before the first layer, which is straightforward if we index it
as λ = 0.

To account for all H heads in a MHA, we con-
catenate these head-specific sums and pass them

991

through the output projection W(MHA,Ô). En tant que tel,
we can denote the unbiased output of the MHA
and the associated bias as:

λ + 1 with respect to the input of sublayer λ.
Passing the output yl,t into the next layer l + 1
(c'est à dire., through sublayers λ+2 and λ+3) then gives:

Tλ+1 + b(LN)
λ+1

+ b(MHA)
je

˜hl,t =

(cid:10)

t(cid:6)

αl,h,t,t(cid:6)xl,t(cid:6)Zl,h
(cid:8)

(cid:6)

b(MHA)

= b(MHA,Ô)
je

b(V)
je,h

W(MHA,Ô)

H(cid:7)

with Zl,h as introduced in (4). By substituting the
actual sublayer functions in our previous equation:
(cid:8)

(cid:6)

yl,t = ˜fl,tTλ+1 + b(FF,Ô)

(cid:6)

λ+1(cid:29)

+ b(MHA)
je

Tλ+1 + ˜hl,t
(cid:6)

(cid:8)

λ+1(cid:29)

λ(cid:6)=λ

Tλ(cid:6)
(cid:8)

Tλ(cid:6)

+ X

Tλ(cid:6)

λ(cid:6)=λ
(cid:6)
(cid:5)
mλ,t · (cid:3)1

λ+1(cid:29)

(cid:4)

−

Tλ(cid:6)

λ(cid:6)=λ

(cid:8)

+ b(LN)

λ Tλ+1 −

λ(cid:6)=λ
(cid:4)

(cid:5)
mλ+1,t · (cid:3)1

Ici, given that there is only one FF for this
layer, the output of sublayer function at λ + 1
will correspond to the output of the FF for layer
je, c'est à dire., ˜fl,t + b(FF,Ô)
, and similarly the output for
sublayer λ should be that of the MHA of layer l,
or ˜hl,t + b(MHA)

. To match Eq. (1), rewrite as:

yl,t = iλ+1,t + hλ+1,t + fλ+1,t + cλ+1,t

(cid:6)

(cid:8)

iλ+1,t = xλ,t

λ+1(cid:29)

λ(cid:6)=λ

Tλ(cid:6)

hλ+1,t = ˜hl,t

(cid:6)

λ+1(cid:29)

(cid:8)

Tλ(cid:6)

λ(cid:6)=λ

fλ+1,t = ˜fl,tTλ+1

cλ+1,t = b(FF,Ô)

(cid:4)

−

Tλ+1 + b(MHA)
(cid:6)

λ+1(cid:29)

(cid:5)
mλ,t · (cid:3)1

Tλ(cid:6)

λ(cid:6)=λ

(cid:5)
mλ+1,t · (cid:3)1
Tλ+1
λ Tλ+1 + b(LN)

λ+1

+ b(LN)

(cid:6)

λ+1(cid:29)

(cid:8)

Tλ(cid:6)

λ(cid:6)=λ
(cid:8)

where xλ,t is the tth input for sublayer λ; c'est,
the above characterizes the output of sublayer

992

yl+1,t = iλ+3,t + hλ+3,t + fλ+3,t + cλ+3,t
λ+3(cid:29)

(cid:8)

(cid:6)

iλ+3,t = iλ+1,t

Tλ(cid:6)

λ(cid:6)=λ+2
(cid:6)
λ+3(cid:29)

(cid:8)

hλ+3,t = hλ+1,t

Tλ(cid:6)

λ(cid:6)=λ+2
(cid:6)
λ+3(cid:29)

(cid:8)

Tλ(cid:6)

(cid:8)

λ(cid:6)=λ+2
λ+3(cid:29)

+ ˜hl+1,t
(cid:6)

fλ+3,t = fλ+1,t

Tλ(cid:6)

(cid:6)

λ(cid:6)=λ+2
λ+3(cid:29)

(cid:8)

cλ+3,t = cλ+1,t

Tλ(cid:6)

+ ˜fl+1,tTλ+3

λ(cid:6)=λ+2
(cid:6)

λ+3(cid:29)

(cid:8)

Tλ(cid:6)

λ+3(cid:29)

Tλ+3

+ b(FF,Ô)
je
(cid:8)

λ(cid:6)=λ+2
(cid:6)
(cid:5)

(cid:4)

−

mλ+2,t · (cid:3)1

Tλ(cid:6)

λ(cid:6)=λ+2

(cid:5)

mλ+3,t · (cid:3)1
Tλ+3
λ+2 Tλ+3 + b(LN)

λ+3

+ b(LN)

This logic carries on across layers: Adding a
layer corresponds to (je) mapping the existing terms
through the two new LNs, (ii) adding new terms
for the MHA and the FF, (iii) tallying up biases
introduced in the current layer. Ainsi, the above
generalizes to any number of layers k ≥ 1 comme:11

yl+k,t = iλ+2k−1,t + hλ+2k−1,t

+ fλ+2k−1,t + cλ+2k−1,t

(cid:6)

λ+2k−1(cid:29)

(cid:8)

iλ+2k−1,t = xλ,t

Tλ(cid:6)

hλ+2k−1,t =

fλ+2k−1,t =

l+k(cid:10)

je(cid:6)=l

l+k(cid:10)

λ(cid:6)=λ
⎛

2(l+k)(cid:29)

⎝

˜hl(cid:6),t

⎞

⎠

Tλ(cid:6)

λ(cid:6)=2l(cid:6)−1

⎛

2(l+k)(cid:29)

⎝

˜fl(cid:6),t

Tλ(cid:6)

⎞

⎠

je(cid:6)=l

λ(cid:6)=2l(cid:6)

11The edge case

(cid:7)

λ
λ(cid:6)=λ+1 Tλ(cid:6) is taken to be the identity

matrix Id, for notation simplicity.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

λ+2k−1(cid:10)

(cid:19)

(cid:6)

b(LN)
λ(cid:6)

λ+2k−1(cid:29)

(cid:8)

Tλ(cid:6)(cid:6)

cλ+2k−1,t =

(cid:8)(cid:20)

(cid:8)

(cid:6)

λ(cid:6)(cid:6)=λ(cid:6)+1
λ+2k−1(cid:29)

Tλ(cid:6)(cid:6)

(cid:6)

λ(cid:6)(cid:6)=λ(cid:6)
λ+2k−1(cid:29)

Tλ(cid:6)

λ(cid:6)=2l−1

(cid:8)(cid:20)

λ(cid:6)=λ
(cid:4)

(cid:5)
mλ(cid:6),t · (cid:3)1
(cid:19)

−

l+k(cid:10)

je(cid:6)=l

b(MHA)
je(cid:6)
(cid:6)

+ b(FF,Ô)
je(cid:6)

λ+2k−1(cid:29)

Tλ(cid:6)

λ(cid:6)=2l−1

Dernièrement, recall that by construction, we have:

(cid:6)

(cid:29)

(cid:8)

(cid:3)

gλ

Tλ

(cid:15)
λ

(cid:5) v

sλ,t

By recurrence over all layers and providing the
initial input x0,t, we obtain Eqs. (1) à (5).

B Hyperplane Bounds of ct
We can re-write Eq. (5) to highlight that is com-
posed only of scalar multiplications applied to
constant vectors. Let:

b(S)

λ =

⎧
⎨

⎩

b(MHA,Ô)

je
b(FF,Ô)
je

(cid:21)

(cid:22)

(cid:23)

b(V)
je,h

W(MHA,Ô)

if λ = 2l − 1

if λ = 2l

⎞

gλ(cid:6)

⎠ (cid:5) (b(LN)

λ + b(S)
λ )

pλ =

⎛

⎝

Λ(cid:4)

λ(cid:6)=λ+1

Λ(cid:4)

qλ =

gλ(cid:6)

λ(cid:6)=λ+1

Using the above, Eq. (5) is equivalent to:

⎛

⎞

⎛

ct =

⎜
⎜
⎜
⎝

Λ(cid:10)

1
Λ(cid:15)

λ(cid:6)=λ+1

sλ(cid:6)
t

⎟
⎟
⎟
⎠ +

· pλ

⎜
⎜
⎜
⎝

Λ(cid:10)

−mλ,t
Λ(cid:15)

sλ(cid:6),t

λ(cid:6)=λ+1

⎞

⎟
⎟
⎟
⎠

· qλ

Note that pλ and qλ are constant across all inputs.
Assuming their linear independence puts an upper
bound of 2Λ vectors necessary to express ct.

C Computational Details

In Section 2.3, we use the default hyperpa-
rameters of scikit-learn (Pedregosa et al.,
2011). In Section 4, we learn categorical regres-
sions using an AdamW optimizer (Loshchilov and

993

Hutter, 2019) and iterate 20 times over the train
ensemble; hyperparameters (learning rate, weight decay,
dropout, and the β1 and β2 AdamW hyperparam-
eters) are set using Bayes Optimization (Snoek
et coll., 2012), avec 50 hyperparameter samples and
accuracy as objective. In Section 5, learning rate,
dropout, weight decay, β1 and β2, learning rate
scheduling are selected with Bayes Optimization,
en utilisant 100 samples and accuracy as objective. Dans
Section 6, we learn shallow logistic regressions,
setting hyperparameters with Bayes Optimization,
en utilisant 100 samples and macro-f1 as the objective.
Experiments were run on a 4GB NVIDIA GPU.

D Ethical Considerations

The offset method of Mikolov et al. (2013) est
known to also model social stereotypes (Bolukbasi
et coll., 2016, entre autres). Some of the sub-
representations of our decomposition may ex-
hibit stronger biases than the whole embedding
et, and can yield higher performances than fo-
cusing on the whole embedding (par exemple., Tableau 3).
This could provide an undesirable incentive to
deploy NLP models with higher performances
and stronger systemic biases.

Les références

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E.

Hinton. 2016. Layer normalization.

Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Rep-
resentations, ICLR 2015 ; Conference date:
07-05-2015 Through 09-05-2015.

Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with
Python: Analyzing Text with the Natural
Language Toolkit. O’Reilly Media, Inc.

Tolga Bolukbasi, Kai-Wei Chang, James Y.
Zou, Venkatesh Saligrama, and Adam T.
Kalai. 2016. Man is to computer programmer
as woman is to homemaker? Debiasing word
embeddings. In Advances in Neural Informa-
tion Processing Systems, volume 29. Curran
Associates, Inc.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
pour
Minneapolis, Minnesota. Association
Computational Linguistics.

Jiaju Du, Fanchao Qi, and Maosong Sun. 2019.
Using BERT for word sense disambiguation.
CoRR, abs/1909.08358.

Kawin Ethayarajh. 2019. How contextual are con-
textualized word representations? Comparing
the geometry of BERT, ELMo, and GPT-2
embeddings. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 55–65,
Hong Kong, Chine. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D19-1006

Mor Geva, Avi Caciularu, Kevin Ro Wang,
and Yoav Goldberg. 2022. Transformer
feed-forward layers build predictions by pro-
moting concepts in the vocabulary space.

Mor Geva, Roei Schuster, Jonathan Berant, et
Omer Levy. 2021. Transformer feed-forward
layers are key-value memories. In Proceed-
ings of
le 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 5484–5495, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://est ce que je.org/10
.18653/v1/2021.emnlp-main.446

Dan Hendrycks

and Kevin Gimpel. 2016.
Gaussian error linear units (GELU). arXiv
preprint arXiv:1606.08415.

John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representations. In Proceedings of the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4129–4138,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob
Uszkoreit, Noam Shazeer, Curtis Hawthorne,
Andrew M. Dai, Matthew D. Hoffman,
and Douglas Eck. 2018. An improved rela-
tive self-attention mechanism for transformer
with application to music generation. CoRR,
abs/1809.04281.

Sarthak Jain and Byron C. Wallace. 2019. À-
tention is not explanation. In Proceedings of
le 2019 Conference of the North American
Chapter of
the Association for Computa-
tional Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 3543–3556, Minneapolis, Minnesota.
Association for Computational Linguistics.

Ganesh Jawahar, Benoˆıt Sagot, and Djam´e
Seddah. 2019. What does BERT learn about
the structure of language? In Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3651–3657,
Florence, Italy. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/P19-1356

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi,
and Kentaro Inui. 2020. Attention is not only
a weight: Analyzing transformers with vector
norms. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 7057–7075, Sur-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.574

Philipp Koehn. 2005. Europarl: A parallel corpus
for statistical machine translation. In Proceed-
ings of Machine Translation Summit X: Papers,
pages 79–86. Phuket, Thaïlande.

Shari Landes, Claudia Leacock, and Randee I.
Tengi. 1998. Building semantic concordances.
WordNet: An Electronic Lexical Database,
chapter 8, pages 199–216. Bradford Books.

Ilya Loshchilov and Frank Hutter. 2019.
Decoupled weight decay regularization.
Dans
International Conference on Learning Repre-
sentations.

Thang Luong, Hieu Pham, and Christopher D.
Manning. 2015. Effective
à
attention-based neural machine translation. Dans
Actes du 2015 Conference on Empir-
ical Methods in Natural Language Processing,

approaches

994

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

pages 1412–1421, Lisbon, Portugal. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D15-1166

Timothee Mickus, Denis Paperno, Mathieu
Constant, and Kees van Deemter. 2020. What
do you mean, BERT? In Proceedings of the
Society for Computation in Linguistics 2020,
pages 279–290, New York, New York. Asso-
ciation for Computational Linguistics.

Tomas Mikolov, Wen-tau Yih, and Geoffrey
Zweig. 2013. Linguistic regularities in continu-
ous space word representations. In Proceedings
of the 2013 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 746–751, Atlanta, Georgia. Association
for Computational Linguistics.

Jeff Mitchell

and Mirella Lapata.

2010.
Composition in distributional models of se-
mantics. Sciences cognitives, 34(8):1388–1429.
https://doi.org/10.1111/j.1551-6709
.2010.01106.X

F. Pedregosa, G. Varoquaux, UN. Gramfort,
V. Michel, B. Thirion, Ô. Grisel, M.. Blondel,
P.. Prettenhofer, R.. Blanc, V. Dubourg, J..
Vanderplas, UN. Passos, D. Cournapeau, M..
Brucher, M.. Perrot,
and E. Duchesnay.
2011. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research,
12:2825–2830.

Jonas Pfeiffer, Andreas R¨uckl´e, Clifton Poth,
Ivan Vuli´c, Sebastian
Aishwarya Kamath,
Ruder, Kyunghyun Cho, and Iryna Gurevych.
2020. Adapterhub: A framework for adapt-
ing transformers. In Proceedings of the 2020
Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP 2020):
Systems Demonstrations, pages 46–54, Sur-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-demos.7

-.1pthttps://doi.org/10.18653/v1/2020
.acl-main.432

Alessandro Raganato and J¨org Tiedemann. 2018.
An analysis of encoder
representations in
transformer-based machine translation. En Pro-
ceedings of
le 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 287–297,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/W18-5431

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In Proceedings
of the 56th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2:
Short Papers), pages 784–789, Melbourne,
Australia. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/P18-2124

Anna Rogers, Olga Kovaleva,

and Anna
Rumshisky. 2020. A primer in BERTology:
What we know about how BERT works. Trans-
actions of the Association for Computational
Linguistics, 8:842–866. https://doi.org
/10.1162/tacl_a_00349

Sofia Serrano and Noah A. Forgeron. 2019. Is at-
tention interpretable? In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 2931–2951,
Florence,
Italy. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/P19-1282

Peter Shaw,

and Ashish
Jakob Uszkoreit,
Vaswani. 2018. Self-attention with relative po-
sition representations. CoRR, abs/1803.02155.

Jasper Snoek, Hugo Larochelle, and Ryan P.
Adams. 2012. Practical Bayesian optimization
of machine learning algorithms. In Advances
in Neural Information Processing Systems,
volume 25. Curran Associates, Inc.

Danish Pruthi, Mansi Gupta, Bhuwan Dhingra,
Graham Neubig, and Zachary C. Lipton.
2020. Learning to deceive with attention-based
explanations. In Proceedings of the 58th An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 4782–4793, En ligne.
Association for Computational Linguistics.

Benjamin Strauss, Bethany Toma, Alan Ritter,
Marie-Catherine de Marneffe, and Wei Xu.
2016. Results of the WNUT16 named entity
recognition shared task. In Proceedings of the
2nd Workshop on Noisy User-generated Text
(WNUT), pages 138–144, Osaka, Japan. Le
COLING 2016 Organizing Committee.

995

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

in transformer

William Timkey and Marten van Schijndel.
2021. All bark and no bite: Rogue di-
mensions
language models
obscure representational quality. In Proceed-
ings of
le 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 4527–4546, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://est ce que je.org/10
.18653/v1/2021.emnlp-main.372

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity re-
cognition. In Proceedings of the Seventh Con-
ference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147. https://
doi.org/10.3115/1119176.1119195

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. À-
tention is all you need. In Advances in Neural
Information Processing Systems, volume 30.
Curran Associates, Inc.

Ra´ul V´azquez, Hande Celikkanat, Mathias Creutz,
and J¨org Tiedemann. 2021. On the differences
between BERT and MT encoder spaces and how
to address them in translation tasks. In Proceed-
ings of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
guage Processing: Student Research Workshop,
pages 337–347, En ligne. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.acl-srw.35

Elena Voita, Rico Sennrich, and Ivan Titov.
2019. The bottom-up evolution of represen-
tations in the transformer: A study with
machine translation and language modeling
objectifs. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessation (EMNLP-IJCNLP), pages 4396–4406,

Hong Kong, Chine. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D19-1448

Ivan Vuli´c, Edoardo Maria Ponti, Robert Litschko,
Goran Glavaˇs, and Anna Korhonen. 2020. Prob-
ing pretrained language models for lexical
semantics. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7222–7240,
En ligne. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.586

Gregor Wiedemann, Steffen Remus, Avi Chawla,
and Chris Biemann. 2019. Does BERT make
any sense? Interpretable word sense dis-
ambiguation with contextualized embeddings.
ArXiv, abs/1909.10430.

Sarah Wiegreffe and Yuval Pinter. 2019. Atten-
tion is not not explanation. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 11–20, Hong Kong, Chine. Association
for Computational Linguistics. https://est ce que je
.org/10.18653/v1/D19-1002

Zeyu Yun, Yubei Chen, Bruno Olshausen, et
Yann LeCun. 2021. Transformer visualiza-
tion via dictionary learning: Contextualized
embedding as a linear superposition of trans-
In Proceedings of Deep
former
factors.
Learning Inside Out
(DeeLIO): The 2nd
Workshop on Knowledge Extraction and In-
tegration for Deep Learning Architectures,
pages 1–10, En ligne. Association for Compu-
tational Linguistics. https://est ce que je.org/10
.18653/v1/2021.deelio-1.1

Sumu Zhao, Dami´an Pascual, Gino Brunner, et
Roger Wattenhofer. 2021. Of non-linearity and
commutativity in bert. Dans 2021 International
Joint Conference on Neural Networks (IJCNN),
pages 1–8. https://doi.org/10.1109
/IJCNN52387.2021.9533563

996

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
0
1
2
0
4
2
5
9
9

/
t

un
c
_
un
_
0
0
5
0
1
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF