How to Dissect a Muppet: The Structure of
Transformer Embedding Spaces
Timothee Mickus∗
University of Helsinki, Finland
timothee.mickus
@helsinki.fi
Denis Paperno
Utrecht University,
The Netherlands
d.paperno@uu.nl
Mathieu Constant
Universit´e de Lorraine,
CNRS,ATILF, France
Mathieu.Constant
@univ-lorraine.fr
Astratto
Pretrained embeddings based on the Trans-
former architecture have taken the NLP
community by storm. We show that they can
mathematically be reframed as a sum of vector
factors and showcase how to use this refram-
ing to study the impact of each component.
We provide evidence that multi-head atten-
tions and feed-forwards are not equally useful
in all downstream applications, as well as a
quantitative overview of the effects of fine-
tuning on the overall embedding space. Questo
approach allows us to draw connections to a
wide range of previous studies, from vector
space anisotropy to attention weights.
1
introduzione
The Transformer architecture (Vaswani et al.,
2017) has taken the NLP community by storm.
Based on the attention mechanism (Bahdanau
et al., 2015; Luong et al., 2015), it was shown
to outperform recurrent architectures on a wide
variety of tasks. Another step was taken with
pretrained language models derived from this ar-
chitecture (BERT, Devlin et al., 2019, among
others): they now embody the default approach
to a vast swath of NLP applications. Success
breeds scrutiny; likewise the popularity of these
models has fostered research in explainable NLP
interested in the behavior and explainability of
pretrained language models (Rogers et al., 2020).
in questo documento, we develop a novel decomposition
of Transformer output embeddings. Our approach
consists in quantifying the contribution of each
network submodule to the output contextual em-
bedding, and grouping those into four terms: (io)
what relates to the input for a given position, (ii)
what pertains to feed-forward submodules, (iii)
∗The work described in the present paper was conducted
chiefly while at ATILF.
what corresponds to multi-head attention, E (iv)
what is due to vector biases.
This allows us to investigate Transformer em-
beddings without relying on attention weights or
treating the entire model as a black box, as is most
often done in the literature. The usefulness of our
method is demonstrated on BERT: Our case study
yields enlightening connections to state-of-the-art
work on Transformer explainability, evidence that
multi-head attentions and feed-forwards are not
equally useful in all downstream applications, COME
well as an overview of the effects of finetuning on
the embedding space. We also provide a simple
and intuitive measurement of the importance of
any term in this decomposition with respect to the
whole embedding.
We will provide insights on the Transformer
architecture in Section 2, and showcase how these
insights can translate into experimental investiga-
tions in Sections 3 A 6. We will conclude with
connections to other relevant works in Section 7
and discuss future perspectives in Section 8.
2 Additive Structure in Transformers
We show that the Transformer embedding et for
a token t is as a sum of four terms:
et = it + ht + piedi + ct
(1)
where it can be thought of as a classical static em-
bedding, ft and ht are the cumulative contributions
at every layer of the feed-forward submodules and
the MHAs respectively, and ct corresponds to
biases accumulated across the model.
Equazione (1) provides interpretable and quan-
tifiable terms that can explain the behavior of
specific components of the Transformer archi-
tectura. More precisely, it characterizes what is
the impact of adding another sublayer on top of
what was previously computed: the terms in Equa-
zione (1) are defined as sums across (sub)layers;
981
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 981–996, 2022. https://doi.org/10.1162/tacl a 00501
Redattore di azioni: Dani Yogatama. Lotto di invio: 3/2022; Lotto di revisione: 5/2022; Pubblicato 9/2022.
C(cid:3) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
matrix
tth row of A
(row) vector
scalars
item linked to submodule M
UN
(UN)T,·
UN
UN, α
W(M)
a⊕b concatenation of vectors a and b
(cid:2)
an a1 ⊕ a2 ⊕ · · · ⊕ an
N
UN(cid:5)b element-wise multiplication of a and b
(cid:3)
an a1 (cid:5) a2 (cid:5) · · · (cid:5) an
N
(cid:3)1
0M,N
In
vector with all components set to 1
null matrix of shape m × n
identity matrix of shape n × n
Tavolo 1: Notation.
hence we can track how a given sublayer trans-
forms its input, and show that this effect can be
thought of as adding another vector to a pre-
vious sum. This layer-wise sum of submodule
outputs also allows us to provide a first estimate
of which parameters are most relevant to the over-
all embedding space: a submodule whose output
is systematically negligible has its parameters set
so that its influence on subsequent computations
is minimal.
The formulation in Equation (1) more generally
relies on the additive structure of Transformer
embedding spaces. We start by reviewing the
Transformer architecture in Section 2.1, before
discussing our decomposition in greater detail in
Sezione 2.2 and known limitations in Section 2.3.
2.1 Transformer Encoder Architecture
Let’s start by characterizing the Transformer ar-
chitecture of Vaswani et al. (2017) in the notation
described in Table 1.
Transformers are often defined using three
hyperparameters: the number of layers L, the di-
mensionality of the hidden representations d, E
H, the number of attention heads in multi-head
attentions. Formalmente, a Transformer model is a
stack of sublayers. A visual representation is
shown in Figure 1. Two sublayers are stacked
to form a single Transformer layer: The first
corresponds to a multi-head attention mechanism
(MHA), and the second to a feed-forward (FF).
A Transformer with L layers contains Λ = 2L
sublayers. In Figure 1 two sublayers (in blue) are
grouped into one layer, and L layers are stacked
one after the other.
Figura 1: Overview of a Transformer encoder.
Each sublayer is centered around a specific sub-
layer function. Sublayer functions map an input
x to an output y, and can either be feed-forward
submodules or multi-head attention submodules.
FFs are subnets of the form:
(cid:5)
(cid:4)
xtW(FF,IO) + B(FF,IO)
sì(FF)
t = φ
W(FF,O) + B(FF,O)
where φ is a non-linear function, such as ReLU or
GELU (Hendrycks and Gimpel, 2016). Here, (…,IO)
E (…,O) distinguish the input and output linear
projections, and the index t corresponds to the
token position. Input and output dimensions are
equal, whereas the intermediary layer dimension
(cioè., the size of the hidden representations to
which the non-linear function φ will be applied)
is larger, typically of b = 1024 O 2048. In other
parole, W(FF,IO) is of shape d × b, B(FF,IO) of size b,
W(FF,O) is of shape b × d, and b(FF,O) of size d.
MHAs are concatenations of scaled-dot atten-
tion heads:
sì(MHA)
T
=
(cid:6)
H(cid:7)
h=1
(cid:8)
(Ah)T,·
W(MHA,O) + B(MHA,O)
Dove (Ah)T,· is the tth row vector of the following
n × d/H matrix Ah:
Ah = softmax
(cid:6)
(cid:8)
QhKT
H(cid:9)
d/H
Vh
with h an index tracking attention heads. IL
parameter matrix W(MHA,O) of shape d × d, E
the bias b(MHA,O) of size d. The queries Qh, keys
Kh and values Vh are simple linear projections
982
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
of shape n × (d/H), computed from all inputs
x1, . . . , xn:
(Qh)T,· = xtW(Q)
(Kh)T,· = xtW(K)
(Vh)T,· = xtW(V)
H + B(Q)
H + B(K)
H + B(V)
H
H
H
H
H
H , B(K)
H , W(K)
where the weight matrices W(Q)
h and W(V)
are of the shape d × (d/H), with H the num-
ber of attention heads, and biases b(Q)
E
B(V)
are of size d/H. This component is often
H
analyzed in terms of attention weights αh, Quale
correspond to the softmax dot-product between
keys and queries. In other words, the product
softmax(QhKT
d/H) can be thought of as
H /
n × n matrix of weights in an average over
the transformed input vectors xt(cid:6)W(V)
H + B(V)
(Kobayashi et al., 2020, Eqs. (1) A (4)): mul-
tiplying these weights with the value projection
Vh yields a weighted sum of value projections:
(cid:9)
H
N(cid:10)
(Ah)T,· =
αh,T,T(cid:6)(Vh)T(cid:6),·
T(cid:6)=1
where αh,T,T(cid:6) is the component at row t and column
T(cid:6) of this attention weights matrix.
Lastly, after each sublayer function S, a residual
connection and a layer normalization (LN, Ba
et al., 2016) are applied:
sì(LN)
t = g (cid:5)
⎡
⎣
S (xt) + xt −
st
(cid:4)
(cid:5)
mt · (cid:3)1
⎤
⎦ + B(LN)
The gain g and bias b(LN) are learned parameters
with d components each; mt · (cid:3)1 is the vector
(1, · · · , 1) scaled by the mean component value
mt of the input vector S (xt)+xt; st is the standard
deviation of the component values of this input.
As such, a LN performs a z-scaling, followed by
the application of the gain g and the bias b(LN).
To kick-start computations, a sequence of static
vector representations x0,1 . . . x0,n with d compo-
nents each is fed into the first layer. This initial
input corresponds to the sum of a static lookup
word embedding and a positional encoding.1
1 In BERT (Devlin et al., 2019), additional terms to this
static input encode the segment a token belongs to, and a LN
is added before the very first sublayer. Other variants also
encode positions by means of an offset in attention heads
(Huang et al., 2018; Shaw et al., 2018).
2.2 Mathematical Re-framing
We now turn to the decomposition proposed in
Equazione (1): et = it + piedi + ht + ct.2 We provide
a derivation in Appendix A.
The term it corresponds to the input em-
bedding (cioè., the positional encoding, the input
word-type embedding, and the segment encoding
in BERT-like models), after having gone through
all the LN gains and rescaling:
Λ(cid:3)
λ=1
Λ(cid:15)
sλ,T
gλ
(cid:5) x0,t
(2)
it =
λ=1
where Λ = 2L ranges over all sublayers. Here,
the gλ correspond to the learned gain parame-
ters of the LNs, whereas the sλ,t scalar derive
from the z-scaling performed in the λth LN, COME
defined above. The input x0,t consists of the sum
of a static lookup embedding and a positional
encoding—as such, it resembles an uncontextual-
ized embedding.
The next two terms capture the outputs of spe-
cific submodules, either FFs or MHAs. As such,
their importance and usefulness will differ from
task to task. The term ft is the sum of the outputs
of the FF submodules. Submodule outputs pass
through LNs of all the layers above, hence:
ft =
l(cid:10)
l=1
Λ(cid:3)
λ=2l
Λ(cid:15)
sλ,T
gλ
(cid:5) ˜fl,T
(3)
λ=2l
(cid:4)
(cid:5)
W(FF,O)
l,t W(FF,IO)
X(FF)
where ˜fl,t = φ
l
is the unbiased output at the position t of the FF
submodule for this layer l.
+ B(FF,IO)
l
l
The term ht corresponds to the sum across
layers of each MHA output, having passed through
the relevant LNs. As MHAs are entirely linear, we
can further describe each output as a sum over all
H heads of a weighted bag-of-words of the input
representations to that submodule. Or:
⎛
Λ(cid:3)
l(cid:10)
ht =
l=1
⎜
⎜
⎜
⎝
λ=2l−1
Λ(cid:15)
(cid:19)
H(cid:10)
N(cid:10)
gλ
(cid:5)
αl,H,T,T(cid:6) xl,T(cid:6) Zl,H
sλ,T
h=1
T(cid:6)=1
Zl,h =W(V)
λ=2l−1
l,h MhW(MHA,O)
l
⎞
(cid:20)
⎟
⎟
⎟
⎠
(4)
2 We empirically verified that components from at-
tested embeddings et and those derived from Eq. (1) are
systematically equal up to ±10−7.
983
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
where Zl,h corresponds to passing an input em-
bedding through the unbiased values projection
W(V)
l,h of the head h, then projecting it from a
d/H-dimensional subspace onto a d-dimensional
space using a zero-padded identity matrix:
(cid:24)
Mh =
0d/H,(h−1)×d/H Id/H 0d/H,(H−h)×d/H
(cid:25)
Figura 2: Fitting the ft term: r2 across layers.
and finally passing it through the unbiased outer
projection W(MHA,O)
of the relevant MHA.
l
In the last term ct, we collect all the biases. Noi
don’t expect these offsets to be meaningful but
rather to depict a side-effect of the architecture:
gλ(cid:6)
(cid:5) B(LN)
λ
−
sλ(cid:6),T
Λ(cid:6)
λ(cid:6)=λ
Λ(cid:7)
λ(cid:6)=λ
gλ(cid:6)
(cid:8)
(cid:5)
mλ,t · (cid:2)1
sλ(cid:6)
T
(cid:9)
⎞
⎟
⎟
⎟
⎠
(cid:13)
gλ
(cid:5)
B(MHA,O)
l
+
(cid:16)
(cid:17)
B(V)
l,H
W(MHA,O)
l
(cid:14)
H(cid:15)
h=1
Λ(cid:6)
λ(cid:6)=λ+1
Λ(cid:7)
λ(cid:6)=λ+1
Λ(cid:6)
λ=2l−1
Λ(cid:7)
sλ,T
λ=2l−1
Λ(cid:6)
gλ
ct =
⎛
⎜
⎜
⎜
⎝
Λ(cid:2)
λ=1
+
l(cid:2)
l=1
+
l(cid:2)
l=1
(cid:5) B(FF,O)
l
λ=2l
Λ(cid:7)
sλ,T
λ=2l
(5)
(cid:2)
h b(V)
h b(V)
l,h here is equivalent
The concatenation
to a sum of zero-padded identity matrices:
(cid:26)
l,h Mh. This term ct
includes the biases
B(LN)
and mean-shifts mλ,t · (cid:3)1 of the LNs, IL
λ
outer projection biases of the FF submodules
B(FF,O)
, the outer projection bias in each MHA
l
submodule b(MHA,O)
and the value projection bi-
ases, mapped through the outer MHA projection
(cid:4)(cid:2)
(cid:5)
l
h b(V)
l,H
W(MHA,O)
.3
l
2.3 Limitations of Equation(1)
The decomposition proposed in Equation (1)
comes with a few caveats that are worth address-
ing explicitly. Most importantly, Equazione (1) fa
not entail that the terms are independent from one
another. For instance, the scaling factor 1/
sλ,T
systematically depends on the magnitude of earlier
hidden representations. Equazione (1) only stresses
(cid:15)
3In the case of relative positional embeddings applied to
value projections (Shaw et al., 2018), it is rather straight-
forward to follow the same logic so as to include relative
positional offset in the most appropriate term.
984
that a Transformer embedding can be decomposed
as a sum of the outputs of its submodules: It does
not fully disentangle computations. We leave the
precise definition of computation disentanglement
and its elaboration for the Transformer to future
research, and focus here on the decomposition
proposed in Equation (1).
In all, the major issue at hand is the ft term:
It is the only term that cannot be derived as a
linear composition of vectors, due to the non-linear
function used in the FFs. Aside from the ft term,
non-linear computations all devolve into scalar
corrections (namely, the LN z-scaling factors sλ,T
and mλ,t and the attention weights αl,H). As such,
ft is the single bottleneck that prevents us from
entirely decomposing a Transformer embedding
as a linear combination of sub-terms.
As the non-linear functions used in Transform-
ers are generally either ReLU or GELU, Quale
both behave almost linearly for a high enough
input value, it is in principle possible that the FF
submodules can be approximated by a purely lin-
ear transformation, depending on the exact set of
parameters they converged onto. It is worth assess-
ing this possibility. Here, we learn a least-squares
linear regression mapping the z-scaled inputs of
every FF to its corresponding z-scaled output. Noi
use the BERT base uncased model of Devlin et al.
(2019) and a random sample of 10,000 sentences
from the Europarl English section (Koehn, 2005),
or almost 900,000 word-piece tokens, and fit the
regressions using all 900,000 embeddings.
Figura 2 displays the quality of these linear ap-
proximations, as measured by a r2 score. We see
some variation across layers but never observe a
perfect fit: 30% A 60% of the observed variance is
not explained by a linear map, suggesting BERT
actively exploits the non-linearity. That the model
doesn’t simply circumvent the non-linear func-
tion to adopt a linear behavior intuitively makes
sense: Adding the feed-forward terms is what pre-
vents the model from devolving into a sum of
bag-of-words and static embeddings. While such
approaches have been successful (Mikolov et al.,
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
2013; Mitchell and Lapata, 2010), a non-linearity
ought to make the model more expressive.
resentations, and plot how this relevance evolves
across layers.
In all, the sanity check in Figure 2 highlights
that the interpretation of the ft term is the major
‘‘black box’’ unanalyzable component remaining
under Equation (1). As such, the recent interest in
analyzing these modules (per esempio., Geva et al., 2021;
Zhao et al., 2021; Geva et al., 2022) is likely to
have direct implications for the relevance of the
present work. When adopting the linear decom-
position approach we advocate, this problem can
be further simplified: We only require a com-
putationally efficient algorithm to map an input
weighted sum of vectors through the non-linearity
to an output weighted sum of vectors.4
Also remark that previous research stressed
that Transformer layers exhibit a certain degree
of commutativity (Zhao et al., 2021) and that
additional computation can be injected between
contiguous sublayers (Pfeiffer et al., 2020). Questo
can be thought of as evidence pointing towards a
certain independence of the computations done in
each layers: If we can shuffle and add layers, then it
seems reasonable to characterize sublayers based
on what their outputs add to the total embedding,
as we do in Equation (1).
Beyond the expectations we may have,
Esso
remains to be seen whether our proposed method-
ology is of actual use, questo è, whether is conducive
to further research. The remainder of this article
presents some analyses that our decomposition
enables us to conduct.5
3 Visualizing the Contents of Embeddings
One major question is that of the relative relevance
of the different submodules of the architecture
with respect to the overall output embedding.
Studying the four terms it, piedi, ht, and ct can
prove helpful in this endeavor. Given that Equa-
zioni (2) A (5) are defined as sums across layers
or sublayers, it is straightforward to adapt them
to derive the decomposition for intermediate rep-
resentations. Hence, we can study how relevant
are each of the four terms to intermediary rep-
4One could simply treat the effect of a non-linear acti-
vation as if it were an offset. For instance, in the case of
ReLU:
ReLU (v) = v + z where z = ReLU (v) − v = ReLU(−v)
5Code for our experiments is available at the follow-
ing URL: https://github.com/TimotheeMickus
/bert-splat.
To that end, we propose an importance metric
to compare one of the terms tt to the total et.
We require it to be sensitive to co-directionality
(cioè., whether tt and et have similar directions)
and relative magnitude (whether tt is a major
component of et). A normalized dot-product of
the form:
μ(et, tt) = eT
t tt/(cid:7)et
(cid:7)2
2
(6)
satisfies both of these requirements. As dot-
product distributes over addition (cioè., aT
i bi =
(cid:26)
i aT bi) and the dot-product of a vector with
2):
itself is its magnitude squared (cioè., aT a = (cid:7)UN(cid:7)2
(cid:26)
μ(et, Esso) + μ(et, piedi) + μ(et, ht) + μ(et, ct) = 1
Hence this function intuitively measures the im-
portance of a term relative to the total.
We use the same Europarl sample as in Sec-
zione 2.3. We contrast embeddings from three
related models: The BERT base uncased model
and fine-tuned variants on CONLL 2003 NER
(Tjong Kim Sang and De Meulder, 2003)6 E
SQuAD v2 (Rajpurkar et al., 2018).7
Figura 3 summarizes the relative importance
of the four terms of Eq. (1), as measured by
the normalized dot-product defined in Eq. (6);
ticks on the x-axis correspond to different layers.
Figures 3a to 3c display the evolution of our
proportion metric across layers for all three BERT
models, and Figures 3d to 3f display how our
normalized dot-product measurements correlate
across pairs of models using Spearman’s ρ.8
Looking at Figure 3a, we can make a few im-
portant observations. The input term it, Quale
corresponds to a static embedding, initially dom-
inates the full output, but quickly decreases in
prominence, until it reaches 0.045 at the last layer.
This should explain why lower layers of Trans-
formers generally give better performances on
static word-type tasks (Vuli´c et al., 2020, among
others). The ht term is not as prominent as one
could expect from the vast literature that focuses
6https://huggingface.co/dslim/bert-base
-NER-uncased.
7https://huggingface.co/twmkn9/bert-base
-uncased-squad2.
8Layer 0 is the layer normalization conducted before the
first sublayer, hence ft and ht are undefined here.
985
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 3: Relative importance of main terms.
on MHA. Its normalized dot-product is barely
above what we observe for ct, and never averages
above 0.3 across any layer. This can be partly
pinned down on the prominence of ft and its nor-
malized dot-product of 0.4 or above across most
layers. As FF submodules are always the last com-
ponent added to each hidden state, the sub-terms
of ft go through fewer LNs than those of ht, E
thus undergo fewer scalar multiplications—which
likely affects their magnitude. Lastly, the term ct
is far from negligible: At layer 11, it is the most
prominent term, and in the output embedding it
makes up for up to 23%. Note that ct defines a set
of offsets embedded in a 2Λ-dimensional hyper-
plane (cf. Appendix B). In BERT base, 23% del
output can be expressed using a 50-dimensional
vector, O 6.5% del 768 dimensions of the
modello. This likely induces part of the anisotropy of
Transformer embeddings (per esempio., Ethayarajh, 2019;
Timkey and van Schijndel, 2021), as the ct term
pushes the embedding towards a specific region
of the space.
The fine-tuned models in Figures 3b and 3c
are found to impart a much lower proportion of
the contextual embeddings to the it and ct terms.
While ft seems to dominate in the final embedding,
looking at the correlations in Figures 3d and 3e
suggest that the ht terms are those that undergo
the most modifications. Proportions assigned to
the terms correlate with those assigned in the
non-finetuned model more in the case of lower
layers than higher layers (Figures 3d and 3e).
The required adaptations seem task-specific as the
two fine-tuned models do not correlate highly
with each other (Figure 3f). Lastly, updates in the
NER model impact mostly layer 8 and upwards
(Figure 3d), whereas the QA model (Figure 3e)
sees important modifications to the ht term at the
first layer, suggesting that SQuAD requires more
drastic adaptations than CONLL 2003.
4 The MLM Objective
An interesting follow-up question concerns which
of the four terms allow us to retrieve the tar-
get word-piece. We consider two approaches:
(UN) Using the actual projection learned by the
non-fine-tuned BERT model, O (B) learning a
simple categorical regression for a specific term.
We randomly select 15% of the word-pieces in
our Europarl sample. As in the work of Devlin
et al. (2019), 80% of these items are masked,
10% are replaced by a random word-piece, E
10% are left as is. Selected embeddings are then
split between train (80%), validation (10%), E
test (10%).
Results are displayed in Table 2. The first row
(‘‘Default’’) details predictions using the default
output projection on the vocabulary, questo è, we
test the performances of combinations sub-terms
986
Tavolo 2: Masked language model accuracy (In %). Cells in underlined bold font indicate best
performance per setup across runs. Cell color indicates the ranking of setups within a run. Rows marked
μ contain average performance; rows marked σ contains the standard deviation across runs.
under the circumstances encountered by the model
during training.9 The rows below (‘‘Learned’’)
correspond to learned linear projections; the row
marked μ display the average performance across
Tutto 5 runs. Columns display the results of using
the sum of 1, 2, 3, O 4 of the terms it, ht, piedi
and ct to derive representations; Per esempio, IL
rightmost corresponds to it + ht + piedi + ct (cioè., IL
full embedding), whereas the leftmost corresponds
to predicting based on it alone. Focusing on the
default projection first, we see that it benefits
from a more extensive training: When using all
four terms, it is almost 2% more accurate than
learning one from scratch. D'altra parte,
learning a regression allows us to consider more
specifically what can be retrieved from individual
terms, as is apparent from the behavior of the
piedi: When using the default output projection, we
Ottenere 1.36% accuracy, whereas a learned regression
yields 53.77%.
term from any experiment using ft
The default projection matrix is also highly
dependent on the normalization offsets ct and
the FF terms ft being added together: Removing
this ct
È
highly detrimental to the accuracy. On the other
hand, combining the two produces the highest
accuracy scores. Our logistic regressions show
that most of this performance can be imputed to
the ft term. Learning a projection from the ft
term already yields an accuracy of almost 54%.
D'altra parte, a regression learned from
ct only has a limited performance of 9.72% SU
9 We thank an anonymous reviewer for pointing out that
the BERT model ties input and output embeddings; we leave
investigating the implications of this fact for future work.
average. È interessante notare, this is still above what one
would observe if the model always predicted the
most frequent word-piece (viz. IL, 6% del
test targets): even these very semantically bare
items can be exploited by a classifier. As ct
is tied to the LN z-scaling, this suggests that
the magnitude of Transformer embeddings is not
wholly meaningless.
In all, do FFs make the model more effective?
The ft term is necessary to achieve the highest
accuracy on the training objective of BERT. On its
own, it doesn’t achieve the highest performances:
for that we also need to add the MHA outputs
ht. Tuttavia, the performances we can associate
to ft on its own are higher than what we observe
for ht, suggesting that FFs make the Transformer
architecture more effective on the MLM objec-
tive. This result connects with the work of Geva
et al. (2021, 2022) who argue that FFs update
the distribution over the full vocabulary, hence it
makes sense that ft would be most useful to the
MLM task.
5 Lexical Contents and WSD
We now turn to look at how the vector spaces
are organized, and which term yields the most
linguistically appropriate space. We rely on
WSD, as distinct senses should yield different
representations.
We consider an intrinsic KNN-based setup and
an extrinsic probe-based setup. The former is in-
spired from Wiedemann et al. (2019): We assign
to a target the most common label in its neighbor-
hood. We restrict neighborhoods to words with
the same annotated lemma and use the 5 nearest
987
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tavolo 3: Accuracy on SemCor WSD (In %).
neighbors using cosine distance. The latter is a
2-layer MLP similar to Du et al. (2019), dove il
first layer is shared for all items and the second
layer is lemma-specific. We use the NLTK Semcor
dataset (Landes et al., 1998; Bird et al., 2009),
with an 80%–10%–10% split. We drop monose-
mous or OOV lemmas and sum over word-pieces
to convert them into single word representations.
Tavolo 3 shows accuracy results. Selecting the most
frequent sense would yield an accuracy of 57%;
picking a sense at random, 24%. The terms it and
ct struggle to outperform the former baseline: rel-
evant KNN accuracy scores are lower, and corre-
sponding probe accuracy scores are barely above.
Overall, the same picture emerges from the
KNN setup and all 5 runs of the classifier setup.
The ft term does not yield the highest perfor-
mances in our experiment—instead, the ht term
systematically dominates. In single term models,
ht is ranked first and ft second. As for sums of two
terms, the setups ranked 1st, 2nd, and 3rd are those
that include ht; setups ranked 3rd to 5th, those that
include ft. Even more surprisingly, when sum-
ming three of the terms, the highest ranked setup
is the one where we exclude ft, and the lowest
corresponds to excluding ht. Removing ft system-
atically yields better performances than using the
full embedding. This suggests that ft is not neces-
sarily helpful to the final representation for WSD.
This contrast with what we observed for MLM,
where ht was found to be less useful then ft.
One argument that could be made here would
be to posit that the predictions derived from the
different sums of terms are intrinsically different,
hence a purely quantitative ranking might not cap-
ture this important distinction. To verify whether
this holds, we can look at the proportion of pre-
dictions that agree for any two models. Because
Figura 4: Prediction agreement for WSD models (In %).
Upper triangle: agreement for KNNs; lower triangle:
for learned classifiers.
our intent is to see what can be retrieved from
specific subterms of the embedding, we focus
solely on the most efficient classifiers across runs.
This is summarized in Figure 4: An individual
cell will detail the proportion of the assigned la-
bels shared by the models for that row and that
column. In short, we see that model predictions
tend to a high degree of overlap. For both KNN and
classifier setups, the three models that appear to
make the most distinct predictions turn out to be
computed from the it term, the ct term or their
sum: questo è, the models that struggle to perform
better than the MFS baseline and are derived
from static representations.
6 Effects of Fine-tuning and NER
Downstream application can also be achieved
through fine-tuning, questo è, restarting a model’s
training to derive better predictions on a narrower
task. As we saw from Figures 3b and 3c, IL
modifications brought upon this second round
988
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tavolo 4: Macro-f1 on WNUT 2016 (In %).
of training are task-specific, meaning that an
exhaustive experimental survey is out of our reach.
We consider the task of Named Entity Recog-
nition, using the WNUT 2016 shared task dataset
(Strauss et al., 2016). We contrast the perfor-
mance of the non-finetuned BERT model to that
of the aforementioned variant fine-tuned on the
CONLL 2003 NER dataset using shallow probes.
Results are presented in Table 4. The very high
variance we observe across is likely due to the
smaller size of this dataset (46,469 training exam-
ples, as compared to the 142,642 of Section 5 or the
107,815 in Section 4). Fine-tuning BERT on an-
other NER dataset unsurprisingly has a systematic
positive impact: Average performance jumps up
by 5% or more. More interesting is the impact this
fine-tuning has on the ft term: When used as sole
input, the highest observed performance increases
by over 8%, and similar improvements are ob-
served consistently across all setups involving ft.
Yet, the best average performance for fine-tuned
and base embeddings correspond to ht (39.28% In
tuned), it+ht (39.21%), and it+ht+ct (39.06%);
in the base setting the highest average performance
are reached with ht + ct (33.40%), Esso + ht + ct
(33.25%) and ht (32.91%)—suggesting that ft
might be superfluous for this task.
We can also look at whether the highest scoring
classifiers across runs classifiers produce differ-
ent outputs. Given the high class imbalance of
the dataset at hand, we macro-average the pre-
diction overlaps by label. The result is shown in
Figura 5; the upper triangle details the behavior of
the untuned model, and the lower triangle details
Figura 5: NER prediction agreement (macro-average,
In %). Upper triangle: agreement for untuned models;
lower triangle: for tuned models.
that of the NER-fine-tuned model. In this round
of experiments, we see much more distinctly that
the it model, the ct model, and the it + ct model
behave markedly different from the rest, con
ct yielding the most distinct predictions. Quanto a
the NER-fine-tuned model (lower triangle), a parte
from the aforementioned static representations,
most predictions display a degree of overlap much
higher than what we observe for the non-finetuned
modello: Both FFs and MHAs are skewed towards
producing outputs more adapted to NER tasks.
7 Relevant Work
The derivation we provide in Section 2 ties in well
with other studies setting out to explain how Trans-
formers embedding spaces are structured (Voita
et al., 2019; Mickus et al., 2020; V´azquez et al.,
2021, among others) and more broadly how they
989
behave (Rogers et al., 2020). For instance, inferiore
layers tend to yield higher performance on sur-
face tasks (per esempio., predicting the presence of a word,
Jawahar et al. 2019) or static benchmarks (per esempio.,
analogy, Vuli´c et al. 2020): This ties in with the
vanishing prominence of it across layers. Like-
wise, probe-based approaches to unearth a linear
structure matching with the syntactic structure
of the input sentence (Raganato and Tiedemann,
2018; Hewitt and Manning, 2019, among others)
can be construed as relying on the explicit linear
dependence that we highlight here.
Another connection is with studies on embed-
ding space anisotropy (Ethayarajh, 2019; Timkey
and van Schijndel, 2021): Our derivation provides
a means of circumscribing which neural compo-
nents are likely to cause it. Also relevant is the
study on sparsifying Transformer representations
of Yun et al. (2021): The linearly dependent nature
of Transformer embeddings has some implications
when it comes to dictionary coding.
Also relevant are the works focusing on the in-
terpretation of specific Transformer components,
and feed-forward sublayers in particular (Geva
et al., 2021; Zhao et al., 2021; Geva et al., 2022).
Lastly, our approach provides some quantitative
argument for the validity of attention-based studies
(Serrano and Smith, 2019; Jain and Wallace, 2019;
Wiegreffe and Pinter, 2019; Pruthi et al., 2020)
and expands on earlier works looking beyond
attention weights (Kobayashi et al., 2020).
8 Conclusions and Future Work
in questo documento, we stress how Transformer embed-
dings can be decomposed linearly to describe the
impact of each network component. We show-
cased how this additive structure can be used to
investigate Transformers. Our approach suggests
a less central place for attention-based studies:
If multi-head attention only accounts for 30% Di
embeddings, can we possibly explain what Trans-
formers do by looking solely at these submodules?
The crux of our methodology lies in that we de-
compose the output embedding by submodule
instead of layer or head. These approaches are not
mutually exclusive (cf. Sezione 3), hence our ap-
proach can easily be combined with other probing
protocols, providing the means to narrow in on
specific network components.
whether our decomposition in Equation (1) could
yield useful results—or, as we put it earlier in
Sezione 2.3, whether this approach could be con-
ducive to future research. We were able to use
the proposed approach to draw insightful con-
nections. The noticeable anisotropy of contextual
embeddings can be connected to the prominent
trace of the biases in the output embedding: As
model biases make up an important part of the
whole embedding, they push it towards a specific
sub-region of the embedding. The diminishing
importance of it links back to earlier results on
word-type semantic benchmarks. We also report
novel findings, showcasing how some submodules
outputs may be detrimental in specific scenarios:
The output trace of FF modules was found to be
extremely useful for MLM, whereas the ht term
was found to be crucial for WSD. Our methodol-
ogy also allows for an overview of the impact of
finetuning (cf. Sezione 6): It skews components to-
wards more task-specific outputs, and its effect are
especially noticeable in upper layers (Figures 3d
and 3e).
Analyses in Sections 3 A 6 demonstrate the
immediate insight that our Transformer decom-
position can help achieve. This work therefore
opens a number of research perspectives, Di
which we name three. Primo, as mentioned in Sec-
zione 2.3, our approach can be extended further to
more thoroughly disentangle computations. Sez-
ond, while we focused here more on feed-forward
and multi-head attention components, extracting
the static component embeddings from it would
allow for a principled comparison of contextual
and static distributional semantics models. Last
but not least, because our analysis highlights the
different relative importance of Transformer com-
ponents in different tasks, it can be used to help
choose the most appropriate tools for further in-
terpretation of trained models among the wealth
of alternatives.
Ringraziamenti
We are highly indebted to Marianne Clausel for
her significant help with how best to present the
mathematical aspects of this work. Our thanks also
go to Aman Sinha, as well as three anonymous
reviewers for their substantial comments towards
bettering this work.
The experiments we have conducted in Sec-
zioni 3 A 6 were designed so as to showcase
This work was supported by a public grant
overseen by the French National Research
990
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Agency (ANR) as part of the ‘‘Investissements
d’Avenir’’ program: Idex Lorraine Universit´e
d’Excellence (reference: ANR-15-IDEX-0004).
We also acknowledge the support by the FoTran
project, funded by the European Research Coun-
cil (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant
agreement n◦ 771113).
A Step-by-step Derivation of Eq. (1)
Given that a Transformer layer consists of a stack
of L layers, each comprising two sublayers, we
can treat a Transformer as a stack of Λ = 2L
sublayers. For notation simplicity, we link the
sublayer index λ to the layer index l: The first
sublayer of layer l is the (2l − 1)th sublayer, E
the second is the (2l)th sublayer.10 All sublayers
include a residual connection before the final LN:
yλ,t = gλ (cid:5)
(cid:19)
(cid:20)
(S (X) + X) − mλ,t · (cid:3)1
sλ,T
+ B(LN)
λ
We can model the effects of the gain gλ and the
scaling 1/sλ,t as the d × d square matrix:
Tλ =
1
sλ,T
⎡
⎢
⎢
⎢
⎣
(gλ)1
0
…
0
0
(gλ)2
. . .
. . .
0
⎤
⎥
⎥
⎥
⎦
(gλ)D
which we use to rewrite a sublayer output yλ,t as:
(cid:4)
(cid:4)
(cid:5)(cid:5)
yλ,t =
Sλ (X) + x −
= Sλ (X) Tλ + xTλ −
mλ,t · (cid:3)1
(cid:4)
Tλ + B(LN)
(cid:5)
mλ,t · (cid:3)1
λ
Tλ + B(LN)
λ
We can then consider what happens to this addi-
tive structure in the next sublayer. We first de-
fine Tλ+1 as previously and remark that, as both
Tλ and Tλ+1 only contain diagonal entries:
TλTλ+1 =
1
sλ,tsλ+1,t
⎡
×
⎢
⎣
(gλ (cid:5) gλ+1)1
…
0
⎤
⎥
⎦
0
. . .
. . .
(gλ (cid:5) gλ+1)D
This generalizes for any sequence of LNs as:
(cid:29)
λ
Tλ =
1(cid:15)
×
sλ,T
(cid:30)
(cid:3)
λ
⎡
⎢
⎢
⎢
⎢
⎢
⎣
(cid:31)
gλ
λ
1
…
0
⎤
⎥
⎥
⎥
⎥
⎥
⎦
. . .
. . .
0
(cid:30)
(cid:3)
(cid:31)
gλ
λ
D
Let us now pass the input x through a complete
layer, questo è, through sublayers λ and λ + 1:
yλ+1,t = Sλ+1 (yλ,T) Tλ+1
+ yλTλ+1 −
(cid:4)
(cid:5)
mλ+1,t · (cid:3)1
Tλ+1 + B(LN)
λ+1
Substituting in the expression for yλ from above:
(cid:4)
yλ+1,t = Sλ+1
(cid:4)
−
Sλ (X) Tλ + xTλ
(cid:5)
mλ,t · (cid:3)1
(cid:6)
λ+1(cid:29)
Tλ + B(LN)
(cid:8)
λ
(cid:5)
Tλ+1
(cid:6)
λ+1(cid:29)
(cid:8)
+ Sλ (X)
Tλ(cid:6)
+ X
Tλ(cid:6)
(cid:4)
λ(cid:6)=λ
(cid:6)
(cid:5)
mλ,t · (cid:3)1
−
λ+1(cid:29)
Tλ(cid:6)
λ(cid:6)=λ
(cid:8)
λ(cid:6)=λ
(cid:4)
(cid:5)
mλ+1,t · (cid:3)1
+ B(LN)
λ Tλ+1 −
+ B(LN)
λ+1
Tλ+1
As we are interested in the combined effects
of a layer, we only consider the case where Sλ
is a MHA mechanism and Sλ+1 a FF. We start
by reformulating the output of a MHA. Recall
that attention heads can be seen as weighted sums
of value vectors (Kobayashi et al., 2020). Due
to the softmax normalization, attention weights
αt,1, . . . αt,n sum to 1 for any position t. Hence:
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
(Ah)T,· =
N(cid:10)
T(cid:6)=1
N(cid:10)
αh,T,T(cid:6)(Vh)T(cid:6),·
!
αh,T,T(cid:6)xt(cid:6)W(V)
H + αh,T,T(cid:6)B(V)
H
T(cid:6)=1
(cid:6)
N(cid:10)
T(cid:6)
(cid:8)
αh,T,T(cid:6)xt(cid:6)W(V)
H
+ B(V)
H
=
=
10 In the case of BERT, we also need to include a LN
before the first layer, which is straightforward if we index it
as λ = 0.
To account for all H heads in a MHA, we con-
catenate these head-specific sums and pass them
991
through the output projection W(MHA,O). As such,
we can denote the unbiased output of the MHA
and the associated bias as:
λ + 1 with respect to the input of sublayer λ.
Passing the output yl,t into the next layer l + 1
(cioè., through sublayers λ+2 and λ+3) then gives:
Tλ+1 + B(LN)
λ+1
+ B(MHA)
l
˜hl,t =
(cid:10)
(cid:10)
H
T(cid:6)
αl,H,T,T(cid:6)xl,T(cid:6)Zl,H
(cid:8)
(cid:6)
B(MHA)
l
= b(MHA,O)
l
+
B(V)
l,H
W(MHA,O)
l
H(cid:7)
H
with Zl,h as introduced in (4). By substituting the
actual sublayer functions in our previous equation:
(cid:8)
(cid:6)
yl,t = ˜fl,tTλ+1 + B(FF,O)
l
(cid:6)
λ+1(cid:29)
+ B(MHA)
l
Tλ+1 + ˜hl,T
(cid:6)
(cid:8)
λ+1(cid:29)
λ+1(cid:29)
λ(cid:6)=λ
Tλ(cid:6)
(cid:8)
Tλ(cid:6)
+ X
Tλ(cid:6)
λ(cid:6)=λ
(cid:6)
(cid:5)
mλ,t · (cid:3)1
λ+1(cid:29)
(cid:4)
−
Tλ(cid:6)
λ(cid:6)=λ
(cid:8)
+ B(LN)
λ Tλ+1 −
λ(cid:6)=λ
(cid:4)
(cid:5)
mλ+1,t · (cid:3)1
Here, given that there is only one FF for this
layer, the output of sublayer function at λ + 1
will correspond to the output of the FF for layer
l, cioè., ˜fl,T + B(FF,O)
, and similarly the output for
sublayer λ should be that of the MHA of layer l,
or ˜hl,T + B(MHA)
. To match Eq. (1), rewrite as:
l
l
yl,t = iλ+1,t + hλ+1,t + fλ+1,t + cλ+1,t
(cid:6)
(cid:8)
iλ+1,t = xλ,T
λ+1(cid:29)
λ(cid:6)=λ
Tλ(cid:6)
hλ+1,t = ˜hl,T
(cid:6)
λ+1(cid:29)
(cid:8)
Tλ(cid:6)
λ(cid:6)=λ
fλ+1,t = ˜fl,tTλ+1
cλ+1,t = b(FF,O)
l
(cid:4)
(cid:4)
−
−
Tλ+1 + B(MHA)
(cid:6)
l
λ+1(cid:29)
(cid:5)
mλ,t · (cid:3)1
Tλ(cid:6)
λ(cid:6)=λ
(cid:5)
mλ+1,t · (cid:3)1
Tλ+1
λ Tλ+1 + B(LN)
λ+1
+ B(LN)
(cid:6)
λ+1(cid:29)
(cid:8)
Tλ(cid:6)
λ(cid:6)=λ
(cid:8)
where xλ,t is the tth input for sublayer λ; questo è,
the above characterizes the output of sublayer
992
yl+1,t = iλ+3,t + hλ+3,t + fλ+3,t + cλ+3,t
λ+3(cid:29)
(cid:8)
(cid:6)
iλ+3,t = iλ+1,t
Tλ(cid:6)
λ(cid:6)=λ+2
(cid:6)
λ+3(cid:29)
(cid:8)
hλ+3,t = hλ+1,t
Tλ(cid:6)
λ(cid:6)=λ+2
(cid:6)
λ+3(cid:29)
(cid:8)
Tλ(cid:6)
(cid:8)
λ(cid:6)=λ+2
λ+3(cid:29)
+ ˜hl+1,t
(cid:6)
fλ+3,t = fλ+1,t
Tλ(cid:6)
(cid:6)
λ(cid:6)=λ+2
λ+3(cid:29)
(cid:8)
cλ+3,t = cλ+1,t
Tλ(cid:6)
+ ˜fl+1,tTλ+3
λ(cid:6)=λ+2
(cid:6)
λ+3(cid:29)
(cid:8)
Tλ(cid:6)
λ+3(cid:29)
Tλ+3
+ B(FF,O)
l
(cid:8)
λ(cid:6)=λ+2
(cid:6)
(cid:5)
(cid:4)
(cid:4)
−
−
mλ+2,t · (cid:3)1
Tλ(cid:6)
λ(cid:6)=λ+2
(cid:5)
mλ+3,t · (cid:3)1
Tλ+3
λ+2 Tλ+3 + B(LN)
λ+3
+ B(LN)
This logic carries on across layers: Adding a
layer corresponds to (io) mapping the existing terms
through the two new LNs, (ii) adding new terms
for the MHA and the FF, (iii) tallying up biases
introduced in the current layer. Hence, the above
generalizes to any number of layers k ≥ 1 COME:11
yl+k,t = iλ+2k−1,t + hλ+2k−1,t
+ fλ+2k−1,t + cλ+2k−1,t
(cid:6)
λ+2k−1(cid:29)
(cid:8)
iλ+2k−1,t = xλ,T
Tλ(cid:6)
hλ+2k−1,t =
fλ+2k−1,t =
l+k(cid:10)
l(cid:6)=l
l+k(cid:10)
λ(cid:6)=λ
⎛
2(l+k)(cid:29)
⎝
˜hl(cid:6),T
⎞
⎠
Tλ(cid:6)
λ(cid:6)=2l(cid:6)−1
⎛
2(l+k)(cid:29)
⎝
˜fl(cid:6),T
Tλ(cid:6)
⎞
⎠
l(cid:6)=l
λ(cid:6)=2l(cid:6)
11The edge case
(cid:7)
λ
λ(cid:6)=λ+1 Tλ(cid:6) is taken to be the identity
matrix Id, for notation simplicity.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
λ+2k−1(cid:10)
(cid:19)
(cid:6)
B(LN)
λ(cid:6)
λ+2k−1(cid:29)
(cid:8)
Tλ(cid:6)(cid:6)
cλ+2k−1,t =
(cid:8)(cid:20)
(cid:8)
(cid:6)
λ(cid:6)(cid:6)=λ(cid:6)+1
λ+2k−1(cid:29)
Tλ(cid:6)(cid:6)
(cid:6)
λ(cid:6)(cid:6)=λ(cid:6)
λ+2k−1(cid:29)
Tλ(cid:6)
λ(cid:6)=2l−1
(cid:8)(cid:20)
λ(cid:6)=λ
(cid:4)
(cid:5)
mλ(cid:6),t · (cid:3)1
(cid:19)
−
+
l+k(cid:10)
l(cid:6)=l
B(MHA)
l(cid:6)
(cid:6)
+ B(FF,O)
l(cid:6)
λ+2k−1(cid:29)
Tλ(cid:6)
λ(cid:6)=2l−1
Lastly, recall that by construction, we have:
(cid:6)
(cid:29)
v
λ
(cid:8)
(cid:3)
gλ
Tλ
(cid:15)
λ
=
(cid:5) v
sλ,T
λ
By recurrence over all layers and providing the
initial input x0,t, we obtain Eqs. (1) A (5).
B Hyperplane Bounds of ct
We can re-write Eq. (5) to highlight that is com-
posed only of scalar multiplications applied to
constant vectors. Let:
B(S)
λ =
⎧
⎨
⎩
+
B(MHA,O)
l
B(FF,O)
l
(cid:21)
(cid:22)
H
(cid:23)
B(V)
l,H
W(MHA,O)
l
if λ = 2l − 1
if λ = 2l
⎞
gλ(cid:6)
⎠ (cid:5) (B(LN)
λ + B(S)
λ )
pλ =
⎛
⎝
Λ(cid:4)
λ(cid:6)=λ+1
Λ(cid:4)
qλ =
gλ(cid:6)
λ(cid:6)=λ+1
Using the above, Eq. (5) is equivalent to:
⎛
⎞
⎛
ct =
⎜
⎜
⎜
⎝
Λ(cid:10)
λ
1
Λ(cid:15)
λ(cid:6)=λ+1
sλ(cid:6)
T
⎟
⎟
⎟
⎠ +
· pλ
⎜
⎜
⎜
⎝
Λ(cid:10)
λ
−mλ,T
Λ(cid:15)
sλ(cid:6),T
λ(cid:6)=λ+1
⎞
⎟
⎟
⎟
⎠
· qλ
Note that pλ and qλ are constant across all inputs.
Assuming their linear independence puts an upper
bound of 2Λ vectors necessary to express ct.
C Computational Details
In Section 2.3, we use the default hyperpa-
rameters of scikit-learn (Pedregosa et al.,
2011). In Section 4, we learn categorical regres-
sions using an AdamW optimizer (Loshchilov and
993
Hutter, 2019) and iterate 20 times over the train
set; hyperparameters (learning rate, weight decay,
dropout, and the β1 and β2 AdamW hyperparam-
eters) are set using Bayes Optimization (Snoek
et al., 2012), con 50 hyperparameter samples and
accuracy as objective. In Section 5, learning rate,
dropout, weight decay, β1 and β2, learning rate
scheduling are selected with Bayes Optimization,
using 100 samples and accuracy as objective. In
Sezione 6, we learn shallow logistic regressions,
setting hyperparameters with Bayes Optimization,
using 100 samples and macro-f1 as the objective.
Experiments were run on a 4GB NVIDIA GPU.
D Ethical Considerations
The offset method of Mikolov et al. (2013) È
known to also model social stereotypes (Bolukbasi
et al., 2016, among others). Some of the sub-
representations of our decomposition may ex-
hibit stronger biases than the whole embedding
et, and can yield higher performances than fo-
cusing on the whole embedding (per esempio., Tavolo 3).
This could provide an undesirable incentive to
deploy NLP models with higher performances
and stronger systemic biases.
Riferimenti
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E.
Hinton. 2016. Layer normalization.
Dzmitry Bahdanau, Kyung Hyun Cho, e Yoshua
Bengio. 2015. Traduzione automatica neurale di
imparare insieme ad allineare e tradurre. In 3rd
International Conference on Learning Rep-
resentations, ICLR 2015 ; Conference date:
07-05-2015 Through 09-05-2015.
Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with
Python: Analyzing Text with the Natural
Language Toolkit. O’Reilly Media, Inc.
Tolga Bolukbasi, Kai-Wei Chang, James Y.
Zou, Venkatesh Saligrama, and Adam T.
Kalai. 2016. Man is to computer programmer
as woman is to homemaker? Debiasing word
embeddings. In Advances in Neural Informa-
tion Processing Systems, volume 29. Curran
Associates, Inc.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. Negli Atti di
IL 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tic: Tecnologie del linguaggio umano, Volume 1
(Long and Short Papers), pages 4171–4186,
for
Minneapolis, Minnesota. Association
Linguistica computazionale.
Jiaju Du, Fanchao Qi, and Maosong Sun. 2019.
Using BERT for word sense disambiguation.
CoRR, abs/1909.08358.
Kawin Ethayarajh. 2019. How contextual are con-
textualized word representations? Comparing
the geometry of BERT, ELMo, and GPT-2
embeddings. Negli Atti del 2019 Contro-
ference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
in lavorazione (EMNLP-IJCNLP), pages 55–65,
Hong Kong, China. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/D19-1006
Mor Geva, Avi Caciularu, Kevin Ro Wang,
and Yoav Goldberg. 2022. Transformer
feed-forward layers build predictions by pro-
moting concepts in the vocabulary space.
Mor Geva, Roei Schuster, Jonathan Berant, E
Omer Levy. 2021. Transformer feed-forward
layers are key-value memories. In Procedi-
ings di
IL 2021 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
pages 5484–5495, Online and Punta Cana,
Dominican Republic. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.emnlp-main.446
Dan Hendrycks
and Kevin Gimpel. 2016.
Gaussian error linear units (GELU). arXiv
preprint arXiv:1606.08415.
John Hewitt and Christopher D. Equipaggio. 2019.
A structural probe for finding syntax in word
representations. Negli Atti del 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tic: Tecnologie del linguaggio umano, Volume 1
(Long and Short Papers), pages 4129–4138,
Minneapolis, Minnesota. Association for Com-
Linguistica putazionale.
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob
Uszkoreit, Noam Shazeer, Curtis Hawthorne,
Andrew M. Dai, Matteo D. Hoffman,
and Douglas Eck. 2018. An improved rela-
tive self-attention mechanism for transformer
with application to music generation. CoRR,
abs/1809.04281.
Sarthak Jain and Byron C. Wallace. 2019. A-
tention is not explanation. Negli Atti di
IL 2019 Conferenza del Nord America
Chapter of
the Association for Computa-
linguistica nazionale: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 3543–3556, Minneapolis, Minnesota.
Associazione per la Linguistica Computazionale.
Ganesh Jawahar, Benoˆıt Sagot, and Djam´e
Seddah. 2019. What does BERT learn about
the structure of language? Negli Atti di
the 57th Annual Meeting of the Association for
Linguistica computazionale, pages 3651–3657,
Florence, Italy. Associazione per il calcolo
Linguistica. https://doi.org/10.18653
/v1/P19-1356
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi,
and Kentaro Inui. 2020. Attention is not only
a weight: Analyzing transformers with vector
norms. Negli Atti del 2020 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione (EMNLP), pages 7057–7075, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.574
Philipp Koehn. 2005. Europarl: A parallel corpus
for statistical machine translation. In Procedi-
ings of Machine Translation Summit X: Carte,
pages 79–86. Phuket, Thailand.
Shari Landes, Claudia Leacock, and Randee I.
Tengi. 1998. Building semantic concordances.
WordNet: An Electronic Lexical Database,
chapter 8, pages 199–216. Bradford Books.
Ilya Loshchilov and Frank Hutter. 2019.
Decoupled weight decay regularization.
In
International Conference on Learning Repre-
sentations.
Thang Luong, Ciao Pham, e Christopher D.
Equipaggio. 2015. Efficace
A
traduzione automatica neurale basata sull’attenzione. In
Atti del 2015 Conference on Empir-
ical Methods in Natural Language Processing,
approcci
994
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
pagine 1412–1421, Lisbon, Portugal. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D15-1166
Timothee Mickus, Denis Paperno, Mathieu
Constant, and Kees van Deemter. 2020. Che cosa
do you mean, BERT? Negli Atti del
Society for Computation in Linguistics 2020,
pages 279–290, New York, New York. Asso-
ciation for Computational Linguistics.
Tomás Mikolov, Wen-tau Yih, and Geoffrey
Zweig. 2013. Linguistic regularities in continu-
ous space word representations. Negli Atti
del 2013 Conferenza del Nord America
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano,
pages 746–751, Atlanta, Georgia. Association
for Computational Linguistics.
Jeff Mitchell
and Mirella Lapata.
2010.
Composition in distributional models of se-
mantics. Cognitive Science, 34(8):1388–1429.
https://doi.org/10.1111/j.1551-6709
.2010.01106.X
F. Pedregosa, G. Varoquaux, UN. Gramfort,
V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J.
Vanderplas, UN. Passos, D. Cournapeau, M.
Brucher, M. Perrot,
and E. Duchesnay.
2011. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research,
12:2825–2830.
Jonas Pfeiffer, Andreas R¨uckl´e, Clifton Poth,
Ivan Vuli´c, Sebastian
Aishwarya Kamath,
Ruder, Kyunghyun Cho, and Iryna Gurevych.
2020. Adapterhub: A framework for adapt-
ing transformers. Negli Atti del 2020
Conference on Empirical Methods in Nat-
elaborazione del linguaggio urale (EMNLP 2020):
Systems Demonstrations, pages 46–54, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-demos.7
-.1pthttps://doi.org/10.18653/v1/2020
.acl-main.432
Alessandro Raganato and J¨org Tiedemann. 2018.
An analysis of encoder
representations in
transformer-based machine translation. Nel professionista-
ceedings of
IL 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 287–297,
Brussels, Belgium. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/W18-5431
Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. Negli Atti
of the 56th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2:
Short Papers), pages 784–789, Melbourne,
Australia. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/P18-2124
Anna Rogers, Olga Kovaleva,
and Anna
Rumshisky. 2020. A primer in BERTology:
What we know about how BERT works. Trans-
actions of the Association for Computational
Linguistica, 8:842–866. https://doi.org
/10.1162/tacl_a_00349
Sofia Serrano and Noah A. Smith. 2019. Is at-
tention interpretable? Negli Atti del
57esima Assemblea Annuale dell'Associazione per
Linguistica computazionale, pages 2931–2951,
Florence,
Italy. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/P19-1282
Peter Shaw,
and Ashish
Jakob Uszkoreit,
Vaswani. 2018. Self-attention with relative po-
sition representations. CoRR, abs/1803.02155.
Jasper Snoek, Hugo Larochelle, and Ryan P.
Adams. 2012. Practical Bayesian optimization
of machine learning algorithms. In Advances
in Neural Information Processing Systems,
volume 25. Curran Associates, Inc.
Danish Pruthi, Mansi Gupta, Bhuwan Dhingra,
Graham Neubig, and Zachary C. Lipton.
2020. Learning to deceive with attention-based
explanations. In Proceedings of the 58th An-
nual Meeting of the Association for Computa-
linguistica nazionale, pages 4782–4793, Online.
Associazione per la Linguistica Computazionale.
Benjamin Strauss, Bethany Toma, Alan Ritter,
Marie-Catherine de Marneffe, and Wei Xu.
2016. Results of the WNUT16 named entity
recognition shared task. Negli Atti del
2nd Workshop on Noisy User-generated Text
(WNUT), pages 138–144, Osaka, Japan. IL
COLING 2016 Organizing Committee.
995
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
in transformer
William Timkey and Marten van Schijndel.
2021. All bark and no bite: Rogue di-
mensions
language models
obscure representational quality. In Procedi-
ings di
IL 2021 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
pages 4527–4546, Online and Punta Cana,
Dominican Republic. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.emnlp-main.372
Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity re-
cognition. In Proceedings of the Seventh Con-
ference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147. https://
doi.org/10.3115/1119176.1119195
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. A-
tention is all you need. In Advances in Neural
Information Processing Systems, volume 30.
Curran Associates, Inc.
Ra´ul V´azquez, Hande Celikkanat, Mathias Creutz,
and J¨org Tiedemann. 2021. On the differences
between BERT and MT encoder spaces and how
to address them in translation tasks. In Procedi-
ings of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
guage Processing: Student Research Workshop,
pages 337–347, Online. Association for Com-
Linguistica putazionale. https://doi.org
/10.18653/v1/2021.acl-srw.35
Elena Voita, Rico Sennrich, and Ivan Titov.
2019. The bottom-up evolution of represen-
tations in the transformer: A study with
machine translation and language modeling
objectives. Negli Atti del 2019 Contro-
ference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessazione (EMNLP-IJCNLP), pages 4396–4406,
Hong Kong, China. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/D19-1448
Ivan Vuli´c, Edoardo Maria Ponti, Robert Litschko,
Goran Glavaˇs, and Anna Korhonen. 2020. Prob-
ing pretrained language models for lexical
semantics. Negli Atti del 2020 Contro-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7222–7240,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.586
Gregor Wiedemann, Steffen Remus, Avi Chawla,
and Chris Biemann. 2019. Does BERT make
any sense? Interpretable word sense dis-
ambiguation with contextualized embeddings.
ArXiv, abs/1909.10430.
Sarah Wiegreffe and Yuval Pinter. 2019. Atten-
tion is not not explanation. Negli Atti
del 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 11–20, Hong Kong, China. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/D19-1002
Zeyu Yun, Yubei Chen, Bruno Olshausen, E
Yann LeCun. 2021. Transformer visualiza-
tion via dictionary learning: Contextualized
embedding as a linear superposition of trans-
In Proceedings of Deep
former
factors.
Learning Inside Out
(DeeLIO): The 2nd
Workshop on Knowledge Extraction and In-
tegration for Deep Learning Architectures,
pages 1–10, Online. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.deelio-1.1
Sumu Zhao, Dami´an Pascual, Gino Brunner, E
Roger Wattenhofer. 2021. Of non-linearity and
commutativity in bert. In 2021 Internazionale
Joint Conference on Neural Networks (IJCNN),
pages 1–8. https://doi.org/10.1109
/IJCNN52387.2021.9533563
996
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
0
1
2
0
4
2
5
9
9
/
/
T
l
UN
C
_
UN
_
0
0
5
0
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Scarica il pdf