A Systematic Study of Inner-Attention-Based

A Systematic Study of Inner-Attention-Based
Sentence Representations in Multilingual
Neural Machine Translation

Raúl Vázquez
University of Helsinki
Department of Digital Humanities
raul.vazquez@helsinki.fi

Alessandro Raganato
University of Helsinki
Department of Digital Humanities
alessandro.raganato@helsinki.fi

Mathias Creutz
University of Helsinki
Department of Digital Humanities
mathias.creutz@helsinki.fi

Jörg Tiedemann
University of Helsinki
Department of Digital Humanities
jorg.tiedemann@helsinki.fi

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Neural machine translation has considerably improved the quality of automatic translations
by learning good representations of input sentences. In questo articolo, we explore a multilingual
translation model capable of producing fixed-size sentence representations by incorporating an
intermediate crosslingual shared layer, which we refer to as attention bridge. This layer exploits
the semantics from each language and develops into a language-agnostic meaning representation
that can be efficiently used for transfer learning. We systematically study the impact of the size of
the attention bridge and the effect of including additional languages in the model. In contrast to
related previous work, we demonstrate that there is no conflict between translation performance
and the use of sentence representations in downstream tasks. In particular, we show that larger
intermediate layers not only improve translation quality, especially for long sentences, but also
push the accuracy of trainable classification tasks. Nevertheless, shorter representations lead to
increased compression that is beneficial in non-trainable similarity tasks. Allo stesso modo, we show
that trainable downstream tasks benefit from multilingual models, whereas additional language
signals do not improve performance in non-trainable benchmarks. This is an important insight

Invio ricevuto: 21 Marzo 2019; revised version received: 12 novembre 2019; accepted for publication:
29 Gennaio 2020.

https://doi.org/10.1162/COLI_a_00377

© 2020 Associazione per la Linguistica Computazionale
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Internazionale
(CC BY-NC-ND 4.0) licenza

Linguistica computazionale

Volume 46, Numero 2

that helps to properly design models for specific applications. Finalmente, we also include an in-
depth analysis of the proposed attention bridge and its ability to encode linguistic properties.
We carefully analyze the information that is captured by individual attention heads and identify
interesting patterns that explain the performance of specific settings in linguistic probing tasks.

1. introduzione

Neural machine translation (NMT) has rapidly become the new machine translation
(MT) standard, significantly improving over the traditional statistical machine trans-
lation model (Bojar et al. 2018). In only about four years, several architectures and
approaches have been proposed, with increasing research efforts toward multilingual
machine translation (Firat et al. 2016; Lakew, Cettolo, and Federico 2018; Wang et al.
2018). Inasmuch as MT is described as the task of translating a sentence from one
language to another, at the recent conferences on MT (WMT18 and WMT19)1 much
interest was put on multilingualism, where a sub-track on multilingual systems was
introduced with the aim of exploiting a third language to improve a bilingual model.

Multilingual neural machine translation comes in many flavors with different archi-
tectures and ways of sharing parameters (Luong et al. 2016; Zoph and Knight 2016; Lee,
Cho, and Hofmann 2017; Dong et al. 2015; Firat, Cho, and Bengio 2016; Lu et al. 2018;
Blackwood, Ballesteros, and Ward 2018). The main motivation of multilingual models
is the effect of transfer learning that enables machine translation systems to benefit
from relationships between languages and training signals that come from different
set di dati. Common techniques explore multisource encoders, multitarget decoders, O
combinations of both. Multilingual models can push the translation performance of
low-resource language pairs but also enable the translation between unseen language
pairs, so-called zero-shot translation (Ha, Niehues, and Waibel 2016; Johnson et al. 2017;
Gu et al. 2018UN).

The effective computation of sentence representations using the translation task
as an auxiliary semantic signal has also drawn interest to MT models (Hill, Cho, E
Korhonen 2016; McCann et al. 2017; Schwenk and Douze 2017; Subramanian et al.
2018). Infatti, recent work makes use of machine translation models to capture syntactic
and semantic properties of the input sentences, later to be used for learning general-
purpose sentence representations (Shi, Padhi, and Knight 2016; Belinkov et al. 2017;
Dalvi et al. 2017; Poliak et al. 2018; Bau et al. 2019). An important feature that enables
an immediate use of the MT-based representations in other downstream tasks is the
effective reduction to a fixed-sized vector; it enables functionality, at the expense of
hampering the performance in the MT task (Britz, Guan, and Luong 2017; Cífka and
Bojar 2018). Tuttavia, it is not fully clear how the properties of the fixed-sized vector
influence the tradeoff between the performance of the model in MT and the information
it encodes as a meaning representation vector. Recent studies either focus on the usage
of such MT-based vector representations in other tasks (Schwenk 2018), on translation
quality (Lu et al. 2018), on speed comparison (Britz, Guan, and Luong 2017), or only
explore a bilingual scenario (Cífka and Bojar 2018).

For this study, we focus on exploring a crosslingual intermediate shared layer
in an MT model. We apply an architecture based on shared inner-attention with

1 http://www.statmt.org/wmt18/translation-task.html.
http://www.statmt.org/wmt19/translation-task.html.

388

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

language-specific encoders and decoders that can easily scale to a large number of
languages (more details about the architecture in Section 2). Simultaneously, it addresses
the task of obtaining language-agnostic sentence embeddings (Lin et al. 2017; Cífka and
Bojar 2018; Lu et al. 2018) that can be straightforwardly applied to downstream tasks. In
Sezioni 4 E 5, we examine this model with a systematic evaluation on different sizes
of the shared layer and extensive experiments to study the abstractions it learns from
multiple translation tasks.

In contrast to previous work (Cífka and Bojar 2018), we demonstrate that there
is a direct relation between the translation performance and the scores attained on
trainable downstream tasks when adjusting the size of the intermediate layer. IL
trend is different for non-trainable tasks that benefit from the increased compression
that denser representations achieve, which typically hurts the translation performance
because of the decreased capacity of the model. We also show that multilingual models
improve trainable downstream tasks, even further demonstrating the additional ab-
straction that is pushed into the representations through additional translation tasks
involved in training. This even holds in low-resource scenarios as we show empirically
in Section 4.4. Inoltre, we find that multilingual training leads to a better encoding of
linguistic properties of the sentence, and that a larger size of the shared inner-attention
layer leads to a better syntactic understanding of the sentence rather than semantic (Vedere
Sezione 5). Inoltre, we include an in-depth analysis of the attention bridge on the
ability of encoding linguistic properties, investigating systematically each component
of the shared inner-attention layer.

In the following, we will first introduce the architecture that we apply in our exper-
iments. Thereafter, we will discuss translation quality before diving into the detailed
analyses of sentence representations and their applications, which will be the main
focus of this article.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

2. Model Architecture

The model we use follows the standard set-up of an encoder–decoder model of machine
translation with a traditional attention mechanism (Bahdanau, Cho, and Bengio 2015;
Luong et al. 2016). Tuttavia, to enable multilingual training we augment the network
with language-specific encoders and decoders trainable with a language-rotating sched-
uler (Dong et al. 2015; Schwenk and Douze 2017). We also incorporate an intermediate
inner-attention layer, which summarizes the encoder information in a fixed-size vector
representation, to serve as a language-agnostic layer (Cífka and Bojar 2018; Lu et al.
2018). Because of the attentive connection between encoders and decoders we call this
layer attention bridge, and its architecture is a multilingual adaptation from the model
proposed by Cífka and Bojar (2018). The overall architecture is illustrated in Figure 1.

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

2.1 Background: Attention Mechanism

Given an input X = (x1, . . . , xn), a sequence of embedded tokens into the vector space
Rdx, our goal is to generate a translation Y = (y1, . . . , ym). For the sake of clarity, we as-
sume a recurrent encoder in the following even though the mechanism is not restricted
to this particular type of encoder. A recurrent neural network (RNN)-based encoder
reads each element in X to generate a context vector c. Generally, for each token the

389

Linguistica computazionale

Volume 46, Numero 2

Figura 1
Architecture of the proposed multilingual NMT system. Left: The attention bridge connects the
language-specific encoders and decoders. Center: Input x1 . . . xn is translated into the decoder
states s1 . . . st via the encoder states h1 . . . hn and the attention bridge M = m1 . . . mk. Right:
Computation of the fixed-size attentive matrix A.

RNN generates a hidden state ht ∈ Rdh where the last hidden state of the RNN often
defines c:

ht = f (xt, ht−1)
c = hn

(1)

(2)

and f : Rdx × Rdh −→ Rdh is a non-linear activation function. We use bidirectional long
short-term memory (LSTM) units (Graves and Schmidhuber 2005) as f in this article.
Then, the decoder network sequentially computes (y1, . . . , ym) by optimizing

P(Y|X) =

M
(cid:89)

t=1

P(yt|C, Yt−1)

(3)

where Yt−1 = (y1, . . . , yt−1). Each distribution pt = p(yt|C, Yt−1) ∈ Rdv is usually com-
puted with a softmax function over all the words in the vocabulary, taking into account
the current hidden state of the decoder st:

pt = softmax(yt−1, st)
st = ϕ(C, yt−1, st−1)

(4)

(5)

where ϕ is another non-linear activation function and dv is the size of the vocabulary.

Including an attention mechanism in the decoder implies that a different context
vector ct will be computed at each step t, instead of fixing c as in Equation (2) for gen-
erating all output words. This alignment method allows the decoder to assign different
weights to each part of the input at every decoding step by defining ct as the weighted

390

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

sum of hidden states of the encoder ct = (cid:80)N
i=1 αt,iht, where αt,i indicates how much the
i-th input word contributes to generating the t-th output word, and is usually defined as

αt,i =

(cid:80)N

esp(et,io)
k=1 exp(et,k)

et,i = g(st, CIAO)

(6)

(7)

where g is a feedforward neural network.

2.2 Inner-Attention as Semantic Bridge

To enable multilingual training and the possibility to obtain a fixed-size sentence rep-
resentation from the model, we propose to extend the attention-based network (Sez-
zione 2.1) with the following modifications:

1.

2.

3.

the incorporation of the attention bridge: an inner-attention layer shared
among all language pairs, that serves as a neural “interlingua”;

the use of language-specific encoders and decoders for each language pair,
trainable with a language-rotating scheduler; E

the introduction of a penalty term in the loss function to avoid redundancy
in the shared inner-attention.

(1) Attention bridge: Each encoder takes as input a sequence of tokens (x1, . . . , xn) E
produces n hidden states H = (h1, . . . , hn) with hi ∈ Rdh, in our case, using a bidirectional
LSTM (Graves and Schmidhuber 2005).2 Prossimo, we encode this variable length sentence-
embedding matrix H into a fixed size M ∈ Rdh×k capable of focusing on k different
components of the sentence (Lin et al. 2017; Chen, Ling, and Zhu 2018; Cífka and Bojar
2018), using self-attention as follows:

A = softmax (cid:0)W2ReLU(W1HT )(cid:1)
M = AH

(8)

(9)

where W1 ∈ Rdw×dh and W2 ∈ Rk×dw are weight matrices, with dw a hyperparameter set
arbitrarily, and k the number of attention heads in the attention bridge. Note that each
column of M, mi, is a component focusing on a portion of the sentence, so all of them
together should reflect the overall semantics of the sentence.

Each decoder follows a common attention mechanism in NMT (Luong, Pham, E
Equipaggio 2015), with an initial state computed by mean pooling over M, and using M
instead of the hidden states of the encoder for computing the context vector. Formalmente,
we only need to compute Equations (6) E (7) using the columns of M instead of the
encoder states hi.
(2) Language-specific encoders and decoders: To deal with additional language pairs, we
incorporate an encoder for each input language and an attentive decoder for each

2 Note that the attention bridge is independent of the underlying encoder and decoder (Lu et al. 2018).
Although we use a BiLSTM, it could be replaced with a GRU (Cho et al. 2014), a transformer type
rete (Vaswani et al. 2017), or with a CNN (Gehring et al. 2017)

391

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

output language to be connected via the attention bridge. This adjusts the parameters
of the bridge layer with multilingual information.

Figura 1 shows a basic diagram on the left-hand side to illustrate the use of sev-
eral encoders and decoders that are plugged in and out at every change of batch. A
avoid over-fitting the attention bridge layer toward one specific language-pair, we cycle
through the available target and source languages at each batch uniformly as in Lu et al.
(2018).
(3) Penalty term: The attention bridge matrix M from Equation (9) could potentially suf-
fer from redundancy problems by learning repetitive information for different attention
heads. To address this issue, we add a penalty term to the loss function, proven effective
in related work (Lin et al. 2017; Chen, Ling, and Zhu 2018; Tao et al. 2018):

L = −log (cid:0)P (Y|X)(cid:1) + (cid:13)

(cid:13)AAT − I

(cid:13)
(cid:13)

2
F

(10)

where A is as in Equation (8) and I is the identity matrix. Note that this term forces each
vector to focus on different aspects of the sentence by making the columns of A to be
approximately orthogonal in the Frobenius norm.

The advantage of the fixed-size representation is the straightforward application in
downstream tasks. Tuttavia, selecting a reasonable size of the attention bridge in terms
of attention heads is crucial for the performance both in a bilingual and multilingual
scenario as we will see in our experiments in Sections 3.2 E 4.

3. Translation Quality

Before applying and analyzing sentence representations that can be learned with the
proposed architecture from the previous section, we ought to verify that the model is
indeed capable of learning multilingual translation—the original training objective. For
Questo, we apply the model in two scenarios: a low-resource scenario with a multilingual
image caption translation task (Elliott et al. 2016) and the application to considerably
larger data sets based on experiments with Europarl (Koehn 2005) and news translation
compiti (Callison-Burch et al. 2007). In the following we will first discuss multilingual
transfer learning in the low-resource scenario before we analyze the effect of the atten-
tion bridge size on translation quality in the large-data setting.

3.1 Multilingual Translation of Image Captions

Multi30K (Elliott et al. 2016) is a parallel data set containing 29k image captions for
training and 1k sentences for validation in four European languages; Czech (cs), Ger-
Uomo (Di), French (fr), and English (en). We test the trained model with the flickr 2016
test data of the same data set and obtain BLEU scores using the sacreBLEU script3 (Post
2018). The preprocessing pipeline consists of lowercasing, normalizing, and tokenizing
using the scripts provided in the Moses decoder (Koehn et al. 2007), together with
learning and applying a 10k operations byte-pair-encoding (BPE) model per language
(Sennrich, Haddow, and Birch 2016). Each encoder consists of two stacked BiLSTMs
of size dh = 512 (cioè., the hidden states per direction are of size 256). Each decoder is
composed of two stacked unidirectional LSTMs with hidden states of size 512. For the

3 With signature BLEU+case.lc+numrefs.1+smooth.exp+tok.13a+version.1.2.11.

392

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

Tavolo 1
BLEU scores obtained in experiments on the Multi30k data set. Left: Bilingual models, our
baselines. Center: Models trained on {De,Fr,Cs}↔En, with zero-shot translations in italics. Right:
Many-to-many model. Both zero-shot and M ↔ M translations improve significantly when
including monolingual data. (Best results shown in bold font).

BILINGUAL

src/tgt

EN

DE

CS

FR

EN

{DE,FR,CS} ↔ EN
CS
DE


39.00
35.89
49.54

36.78

28.98
32.92

28.00
23.44

25.98

55.96
38.22
36.44


39.39
37.20
48.49

37.85

0.65
0.60

29.51
0.35

0.30

M ↔ M

EN

DE

CS

FR


40.68
38.42
49.92

37.70

31.07
34.63

29.67
26.78

26.92

55.78
41.07
40.27

FR

57.87
0.83
1.02

BILINGUAL + ATT BRIDGE
EN

DE

CS

FR

{DE,FR,CS} ↔ EN + MONOLING M ↔ M + MONOLINGUAL
FR
EN

EN

DE

DE

CS

CS

FR


38.19
36.41
48.93

35.85

27.28
31.70

27.10
23.97

25.96

53.03
37.40
36.41


40.17
37.30
50.41

38.92

22.13
25.96

30.27
19.50

20.09

57.87
26.46
22.80


41.82
39.58
50.94

38.48

31.51
35.25

30.47
26.90

28.80

57.35
41.49
40.87

EN
DE
CS
FR

EN
DE
CS
FR

model input and output, the word embeddings have dimension dx = dy = 512. We use
an attention bridge layer with k = 10 attention heads with dw = 1, 024, dimensions of
W1 and W2 from Equation (8).

We use a stochastic gradient descent optimizer with a learning rate of 1.0 and batch
size 64, and for each experiment, we select the best model on the development set. Noi
implement our model on top of an OpenNMT-py (Klein et al. 2017) fork, which we make
available for reproducibility purposes.4

3.1.1 Baselines. The first experiment we conduct is to corroborate that the proposed
architecture works correctly, and we assess performance in a bilingual setting. We expect
that the models slightly drop in performance when the fixed-size attention bridge is
introduced, because there are no direct crosslingual attention links between the source
and target languages. Tuttavia, we want to see whether the architecture is robust
enough to carry over the essential information needed for translation with the inclusion
of the additional intermediate abstraction layer.

In Table 1 we present a comparison of our architecture in contrast with a strong
bilingual baseline consisting of an architecture with the same specifications, without
the components of our model. The table presents the scores obtained for each of the
12 bilingual models trained on each language pair. In questo caso, we note that the basic
bilingual models without any attention bridge have a slightly better performance in
most cases. The most significant drop occurs when translating English to French, con un
difference of over 2 BLEU points, but this case is exceptional. Typically the BLEU score
decreases by less than 1 point.

This behavior is expected because the information from the encoder has to be
summarized in the 10 heads of the inner-attention layer without (multilingual) infor-
mation from other encoders to boost the states of this bridge. Nevertheless, these tests
justify the validity of the architecture; namely, that the attention bridge does not cause

4 https://github.com/Helsinki-NLP/OpenNMT-py/tree/neural-interlingua.

393

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

a significant problem for the translation model in the bilingual case. We will use the
results of bilingual models both with and without attention bridge as our baselines for
the comparison to the multilingual models that we describe subsequently.

3.1.2 Many-To-One and One-To-Many Models. The expected power of the attention bridge
comes from its ability to share information across various language pairs. We now
look at the effect of including additional languages during training on the translation
performance of individual language pairs. We start by training models that include
many-to-one and one-to-many settings with English as target and source, rispettivamente.
This set-up makes it possible to study the ability of zero-shot translation, questo è, IL
translation between languages that have not been seen together in the training data. By
performing zero-shot translation, we can test the abstraction potential of the attention
bridge and its effectiveness in encoding multilingual information.

For the first experiment, we use the many-to-one and one-to-many strategy to
train a {De,Fr,Cs}↔En model. As depicted in Table 1, this attempt already results in
substantial improvements for the language pairs seen during training.

The model exceeds both bilingual baselines from the previous section. Tuttavia,
this model is entirely incapable of performing zero-shot translations. We believe that
this inability of the model to generalize to unseen language pairs arises from the fact
that every non-English encoder (or decoder) only learns to process information that is
to be decoded into English (or encoded from English input). This finding is consistent
with Lu et al. (2018); so, to address this problem, we incorporate monolingual data
during training, questo è, for each available language A, we train A → A with identical
copies of the input sentence as the target. Hence, we do not include any additional
dati, but we reincorporate examples from the same parallel training corpus used in all
other experiments. As a consequence, we see a remarkable increase in the BLEU scores,
including a substantial boost for the language pairs not seen during training. In short,
the monolingual data informs the model that other languages can be produced besides
English, and that English is not the unique source language.

Additionally, there is a positive effect on the seen language pairs, the cause of
which is not immediately evident. One possibility may be that the shared layer acquires
additional information that can be included in the abstraction process yet not available
to the other models.

3.1.3 Many-to-Many Models. To further examine the capabilities of the proposed archi-
tecture we conduct two experiments under a many-to-many scenario.

Primo, we test the architecture in a many-to-many setting with all language pairs
included. Tavolo 1 summarizes the results of our experiments. As in the previous case,
we compare settings that include monolingual data with their counterparts that do not
include it.

On a first note, the inclusion of language pairs results in an improved performance
when compared to the bilingual baselines, as well as the many-to-one and one-to-many
cases. The only exception is the En→Fr task. Inoltre, the addition of monolingual
data during training leads to even higher scores, producing the overall best model.
The improvements in BLEU range from 1.40 A 4.43 compared to the standard bilingual
modello.

Prossimo, we perform a systematic evaluation on zero-shot translation. For this, we
train six different models where we include all but one of the available language pairs
(per esempio., En↔De). Then, we test our models while also performing bidirectional zero-shot
translations for the unseen language pairs. Figura 2 summarizes the results.

394

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

Figura 2
For every language pair, we compare the BLEU scores between models trained and tested on the
Multi30k data set: (1) our best model (M ↔ M plus monolingual data), (2) the bilingual model of
that language pair, (3) the zero-shot translation of the many-to-many model trained without that
specific language pair, (4) the Johnson et al.(2017) model using language labels (LL) trained in a
many-to-many scenario, E (5) the zero-shot of of LL without that specific language pair.

We observe that these zero-shot translation scores are generally better than the
ones from the previous {De,Fr,Cs}↔En model with monolingual data (Tavolo 1). We also
note that the zero-shot models perform relatively well in comparison with the MANY-
TO-MANY model. Inoltre, these zero-shot models almost reach the scores of the
bilingual models trained only on the zero-shot language pairs.

As a point of comparison, we also implemented the approach of Johnson et al.
(2017), using a language label at the beginning of the input sentence to specify the
required target language and a single shared model with joint vocabulary. We will refer
to this model as the LL approach herein. We used a combined 40k BPE operations model
trained on the combined corpora and the same architecture specifications from Section
3.1, without the components of the attention bridge model. The results are shown in
Figura 2 in the gray bars next to our attention bridge scores. We can see that the many-
to-many LL models perform slightly better than our attention bridge model. This is not
very surprising as they are based on a model architecture that also performs better in
the bilingual case as we have seen in the comparison between bilingual models with
and without attention bridge in Table 1. Sezione 3.2.2 will also show that this is basically
caused by long sentences that are not as well covered by the attention-bridge model.

A similar effect is visible in the zero-shot results that we obtain in the same way as
with our attention bridge model (cioè., leaving one language pair out of the training data).
The differences to our model are sometimes larger than in the supervised set-up. Questo
can be explained by the positive effect of sharing all encoder and decoder parameters
in the case of related languages. Having a small data set to start with, the additional
data from the other language pairs seems to be very beneficial and in some cases the
zero-shot performance comes very close to the supervised model with all data included.
In future work, we would like to investigate the effect on more distant languages and
increasing numbers of languages involved in our comparison.

395

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

EnDeDeEnEnCsCsEnEnFrFrEnDeFrFrDeDeCsCsDeCsFrFrCsSourceTarget202530354045505560BLEU38.541.830.539.657.450.941.535.226.931.540.928.834.037.426.930.543.043.832.931.423.125.532.923.735.838.227.136.453.048.937.431.724.027.336.426.035.539.628.932.747.744.638.436.423.928.236.626.939.042.530.540.259.052.243.236.227.032.841.229.0m2m + monolingualbilingual + attBridgezero-shotLL m2mLL zero-shot

Linguistica computazionale

Volume 46, Numero 2

Also note that the language labeling technique does not produce crosslingual
sentence representations, the main advantage of our approach, which we will test in
multilingual downstream tasks (see Section 4.2). The language label makes the en-
coder effectively depending on the target language, which makes it difficult to apply
the representations produced by that system to unrelated downstream tasks. These
drawbacks and the fact that we produce competitive results with our architecture while
producing directly applicable crosslingual sentence representations motivate the use of
our architecture in multilingual set-ups. Inoltre, we can also show that the drop in
performance mainly comes from long sentences that are not covered as well as shorter
ones. More details on this effect can be found in Section 3.2.2.

3.1.4 Effect of the Penalty Term. In order to study the effect of the penalty term, we
have trained additional bilingual and multilingual models, where the penalty term
Equazione (10) was excluded from the loss function. We re-ran all the 36 tests in the lower
row of sub-tables in Table 1. We then compared the BLEU scores between corresponding
set-ups where the penalty term was present and absent. We discovered that in 21 fuori
del 36 tests (58 %) the presence of the penalty term was beneficial. On average, IL
penalty term improves the BLEU scores by 0.11 points, across all tested types of models
and language pairs.

As discussed in Lin et al. (2017), the quantitative effect of the penalty term might not
be significant for some tasks, yet still it maintains the positive effect of encouraging the
attentive matrix to be focused on different aspects of the sentence rather than picking up
redundant information. Infatti, as we will see in Section 5.3, adding the penalty term
effectively helps the model to spread the attention of the individual attention heads
once the sentence is covered with token-specific attention. This leads us to keeping it in
the remaining experiments.

3.2 The Effect of the Attention Bridge Size on MT Quality

The study on the Multi30K data set demonstrates the general ability of the attention
bridge model to learn multilingual translation models capable of sharing knowledge
between the various language pairs also enabling zero-shot translation similar to other
multilingual NMT architectures. In the following, we investigate the impact of the size
of the attention bridge on translation performance. For this study, we choose a data set of
a realistic size and a more challenging benchmark with a larger vocabulary and a greater
variety of sentence lengths as one of the most crucial properties influencing the quality
of machine translation. In particular, we apply the Europarl Corpus v7 (Koehn 2005)
with a selection of four languages and news test sets from the the ACL-WMT07 shared
task (Callison-Burch et al. 2007), using dev2006 as validation data and devtest2006 plus
test2006 as blind test data, ending up with 2K and 4K sentences, rispettivamente. We focus
on six language pair directions: English–French (EN–FR), English–German (EN–DE),
and English–Spanish (EN–ES), with training data of approximately 2M sentences each.5
The data are pre-processed following the standard MT pipeline, including tokenization
and truecasing. Sentences are then encoded using BPE (Sennrich, Haddow, and Birch
2016), con 32,000 merge operations for each language. BLEU scores are computed case-
insensitively using SACREBLEU as before.

5 As before, we trained all models including monolingual data, and because of the small size of the Czech

Europarl data, we include Spanish instead.

396

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

Tavolo 2
BLEU scores for bilingual Europarl models.

BILINGUAL MODELS

baseline

k=1

k=10

k=25

k=50

22.72
30.28
25.88

24.28
28.16
25.39

15.04
22.8
18.97

17.22
19.33
17.46

20.25
27.3
23.49

22.53
25.2
22.1

21.26
28.52
24.42

23.18
26.49
22.4

21.87
29.15
25.07

23.59
28.16
24.22

DE
ES

FR

EN

EN

DE
ES
FR

We will first look at the impact of attention bridge size on bilingual and multilingual
models before we discuss the impact of sentence length on our model. Generalmente, we
expect that the positive effect of transfer learning in translation will fade out as the
bilingual baseline models become stronger and outperform the attention bridge model
with their additional bottleneck of a fixed size intermediate representation. This will
mainly affect long sentences that are not properly summarized in the shared layer,
causing a less effective access to encoder information through the crosslingual attention
(more detailed analyses are presented in Section 3.2.2).

3.2.1 The Impact of Attention Bridge Size on Bilingual and Multilingual Models. For the
following experiments, we apply the same architecture and hyperparameters as in
Sezione 3.1. Regarding the attention bridge, we experiment with four different numbers
of attention heads: k = 1, 10, 25, 50. In the training we use the Adam optimizer (Kingma
and Ba 2015) with a learning rate of 0.0004 and batch size 256, for at most 100,000
steps per language pair. We select the best model according to the BLEU score on the
validation set. For multilingual systems, we select only one model with the best overall
BLEU score across the validation set of all the language pairs involved.

We adopt different training strategies: First we train bilingual models for the lan-
guage pairs of interest; then we train a {DE,ES,FR}↔EN model using the many-to-
one and one-to-many strategy; lastly we train a many-to-many model involving all
translation directions between the three languages (cioè., we also include DE–ES, DE–FR,
and ES–FR).6

Tavolo 2 shows the BLEU scores of our models with varying k, the number of the
attention heads in the attention bridge, compared with a baseline, a traditional encoder-
decoder model with attention mechanism (Luong, Pham, and Manning 2015). Among
the attention bridge models, we can see that the performance consistently increases
when k grows. The model with 50 heads achieves the best results among our models.
It obtains scores that range in the same ballpark as the baseline, only in a few cases
there is a degradation of around 1 BLEU point. Inoltre, the performance of this
model compared with the one with one attention head is substantial: more than 6 BLEU
points on average, corroborating previous findings (Britz, Guan, and Luong 2017; Cífka
and Bojar 2018). Intuitively, a large number of attention heads manage to encode richer

6 Data coming from the same Europarl source: http://opus.nlpl.eu/Europarl.php.

397

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

Tavolo 3
BLEU scores for multilingual Europarl models with various sizes k of the attention bridge. For
comparison, the table includes results of a multi-way multilingual NMT model (Firat, Cho, E
Bengio 2016) and a completely shared architecture with language labels: LL (Johnson et al. 2017) .

DE
ES
FR

EN

EN

DE
ES
FR

k=1

14.66
21.82
17.8

16.97
18.38
17.52

k=10

19.87
27.55
23.35

21.39
25.39
21.93

M ↔ EN

k=25

20.61
28.41
24.36

23.42
27.01
24.4

k=50

20.83
28.13
23.79

24
27.12
23.9

Firat

18.49
27.73
23.22

24.8
25.7
24.52

LL

21.63
29.48
25.56

25.96
28.41
26.93

M ↔ M

k=50

20.47
27.6
24.15

24.4
26.98
24.47

k=1

14.89
21.4
17.62

17.38
19.43
17.47

LL

21.7
29.53
25.51

25.84
28.67
25.47

information about the source sentence improving the performance of the model for
MT. Those results verify that BLEU and meaning representations do not have to be
in opposition, as suggested by Cífka and Bojar (2018).

For the multilingual settings, we train a {DE,ES,FR}↔EN model using the many-to-
one and one-to-many strategy, and a many-to-many model as discussed in Sections 3.1.2
E 3.1.3. Tavolo 3 shows the comparison between the multilingual models. Generalmente,
we observe the same trend as in the bilingual evaluation concerning the size of the
attention bridge. Namely, more attention heads lead to a higher BLEU score. Notably,
we do not see any increase in translation quality from the {DE,ES,FR}↔EN model to the
many-to-many model; the BLEU scores for all six translation directions are statistically
equivalent. Besides, when we compare the bilingual and multilingual models for a
given k, we do not note any apparent degradation or improvement regarding the BLEU
score when incorporating multilingual data into the models.

For comparison, we again add results from the language-labeling approach by
Johnson et al. (2017) and also from another popular approach that has been proposed by
Firat, Cho, and Bengio (2016). The latter refers to a multi-way multilingual NMT system
with a shared crosslingual attention mechanism, a model that is quite similar in spirit
with our approach but without a fixed-size shared layer between encoder and decoder
that bridges the crosslingual attention.

The multiway architecture produces lower scores for most language pairs. Note
that we only show results for the {DE,ES,FR}↔EN set-up as the used implementation
no longer holds for current standards,7 training is very slow and would be prohibitively
expensive in the many-to-many set-up. We expect that the trend will be the same and the
scores are below our proposed architecture. The language-label approach by Johnson
et al. (2017), on the other hand, is again very effective and produces the overall best
risultati. Sharing all parameters is also beneficial in the Europarl experiments similar
to what we have seen in the Multi30K results. Again, we have to note that the bilin-
gual baseline will be higher and that we also focus on related languages again that
benefit from a strong overlap in linguistic properties. Tuttavia, once again, we can see
that our model produces competitive results with the additional benefit of producing
crosslingual fixed-size representations that are directly applicable in downstream tasks
including crosslingual ones.

7 https://github.com/nyu-dl/dl4mt-multi.

398

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

3.2.2 Length Analysis. In the previous section, we could see that there is a strong cor-
relation between the size of the attention bridge and the quality of the translations
produced. We could also see that the attention bridge model is capable of translating
with a similar performance even though it creates an additional bottleneck of fixed-size
representations. Nevertheless, the performance drops slightly and, in this section, we
would like to investigate the reasons for that drop by looking at the effect on different
subsets of the test data.

One of the main motivations for having more attention heads lies in the better
support of longer sentences. To study the effect, following previous work (Bahdanau,
Cho, and Bengio 2015; Tu et al. 2017; Dou et al. 2018), we group sentences of similar
length and compute the BLEU score for each group. As we can see from Figure 3,
a larger number of attention heads has, Infatti, a positive impact when translating
longer sentences. Long sentences do require a bigger attention bridge, and it affects
both bilingual and multilingual models. Interestingly enough, on sentences with up to
45 parole, there is no real gap between the results of the baseline model and our bridge
models with a high number of attention heads. It looks like the performance drop of the
attention bridge models is entirely due to sentences longer than 45 parole. The same is
true in comparison to the language-label approach. This also suggests that the increased
performance of that model is due to the better coverage of long sentences.

Inoltre, we notice that multilingual models with 50 attention heads lose more
on long sentences than bilingual ones. We hypothesize that this might be due to the
increasing syntactic divergences between the languages that have to be encoded. IL
shared inner-attention layer needs to learn to focus on different parts of a sentence
depending on the language it reads and, with increasing lengths of a sentence, Questo
ability becomes harder and more difficult to pick up from the data alone.

3.3 Discussion

Our results demonstrate that the attention bridge model proposed in this paper imple-
ments an effective approach to multilingual machine translation. The shared layer suc-
cessfully bridges language-dependent encoder and decoder networks enabling efficient
transfer learning and improved sentence representation learning. Using the multi30k
benchmark, the results of the multilingual models consistently outperform a strong

Figura 3
The BLEU scores obtained by the models with respect to different sentence length. The left figure
shows the bilingual models, the middle one the many-to-one models, and the figure to the right
illustrates the many-to-many models.

399

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

153045>451520253035BLEUbaselinebilingual k=1bilingual k=10bilingual k=25bilingual k=50153045>45SENTENCE LENGTH1520253035baselineMEn k=1MEn k=10MEn k=25MEn k=50MEn LLFirat153045>451520253035baselineMM k=1MM k=50MM LL

Linguistica computazionale

Volume 46, Numero 2

bilingual model. This advantage, Tuttavia, fades out with larger data sets. This is ex-
pected because of the limits of the fixed-size representations that bridge the gap between
the various languages. But our analysis shows that this is mainly due to the problem
with long sentences, an issue that needs to be addressed in future work. Our analysis
also reveals that the size of the attention bridge plays a crucial role on translation quality
and we will further discuss this below in the application of the sentence representations
to unrelated downstream tasks. This brings us to the main point of this article, namely,
the discussion of the quality of representations that can be learned from translations
using the proposed multilingual architecture.

4. MT-Based Representations in Downstream Tasks

The main motivation for our study is to investigate the sentence representations that
the MT model picks up during training. Therefore, the most important part is the
assessment of these representations in unrelated downstream tasks and the analyses of
the internal structure (which we will discuss in Section 5). In the following, we will first
briefly introduce the tasks we consider before applying our models to each of them. Nostro
MT models are trained on the Europarl data. Tuttavia, in Section 4.4 we also include a
study on downstream tasks with representations learned from limited resources, using
the Multi30K data set, to further demonstrate that useful representations can be picked
up even from tiny data sets. This is in contrast to related work where huge amounts of
training data are typically applied to obtain reasonable performance.

Our assumption is that multilinguality contributes to a higher level of semantic
abstraction that can be learned from the translation objective. To test this claim, we apply
standard benchmarks collected in the SentEval toolkit (Conneau and Kiela 2018), IL
XNLI evaluation corpus (Conneau et al. 2018C), as well as the Yelp challenge data set.8
The SentEval toolkit contains three benchmark types: classification, similarity, E
linguistic probing tasks. In the classification tasks, a classifier is trained on top of a
sentence embedding involving various data sets: CR—product reviews (Hu and Liu
2004), MR—movie reviews (Pang and Lee 2005), MPQA—opinion polarity (Wiebe,
Wilson, and Cardie 2005), SUBJ—subjectivity/objectivity status (Pang and Lee 2004),
SST—binary and fine-grained sentiment analysis (Socher et al. 2013), TREC—question-
type classification (Voorhees and Tice 2000), MRPC—paraphrase detection (Dolan,
Quirk, and Brockett 2004), and SICK and SNLI—textual entailment and natural lan-
guage inference (Marelli et al. 2014; Bowman et al. 2015).

In contrast to the classification tasks mentioned above, the similarity tasks do not
involve any training and, instead, correlate the cosine distance between two sentence
representations with a human labeled score using Pearson and Spearman coefficients.
The data sets come from the SemEval Semantic Textual Similarity (STS) task series, from
2012 A 2016 (Agirre et al. 2012, 2013, 2014, 2015, 2016). The only exceptions are the SICK
and STSB data set (Marelli et al. 2014; Cer et al. 2017), where training data are provided.
Inoltre, the SentEval toolkit contains probing tasks to study how linguistic

features are encoded within a fixed-size vector (Conneau et al. 2018UN).

All SentEval tasks are designed for English only. Therefore, we find it valuable
to evaluate our sentence representations on multilingual classification tasks as well.
For this purpose we make use of the XNLI evaluation corpus (Conneau et al. 2018C)

8 http://www.yelp.com/dataset.

400

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

for language transfer and crosslingual sentence classification, as well as a multilingual
subset of the Yelp challenge data set.

We run the evaluation following the recommended default settings, questo è, training
a logistic regression classifier for the classification tasks, with the Adam optimizer (batch
size: 64, epoch size: 4). For the probing tasks we use a multilayer perceptron classifier
with sigmoid nonlinearity, 200 hidden units, E 0.1 dropout rate. In order to obtain a
sentence vector out of multiple attention heads we apply mean pooling over M, as in
Lu et al. (2018).

We present our experiments and their results in the following order: First we present
the classification tasks, the SentEval classification tasks on English (Sezione 4.1) anche
as multilingual classification based on XNLI and Yelp reviews (Sezione 4.2). Prossimo, we
turn to the similarity tasks in SentEval (Sezione 4.3). In all these set-ups we use models
trained on Europarl data. Afterwards, in Section 4.4 we turn to a low-resource scenario
and study SentEval classification and similarity on the Multi30k data set. The SentEval
probing tasks are studied in depth as part of the analysis in Section 5.

4.1 SentEval Classification Tasks

Figura 4 shows the average performance of our models on the various classification
downstream tasks. The most frequent baseline achieves an average score of 48.19, Quale
all our models beat by a wide margin. We can see that the multilingual models work
best with the many-to-many model, clearly outperforming the rest on average. IL
figure also illustrates the impact of increasing the number of attention heads. Let us
have a closer look at individual classification tasks to get a more detailed picture of the
performance in the various settings.

Tables 4 E 5 show the performance of our models on the different downstream
compiti. We report the accuracy on each individual test set, including the following
comparison scores: a baseline of the most frequent class; a bag-of-vectors baseline
obtained by averaging GloVe word embeddings (Pennington, Socher, and Manning
2014); an average of word embeddings as well as the CLS fixed-size sentence vector
representation obtained from the large-scale pretrained language model BERT (Devlin
et al. 2019; Reimers and Gurevych 2019); a state of the art general-purpose model that
exploits large-scale multitask learning on different tasks including machine translation
(Subramanian et al. 2018); and the performance from other MT systems by Hill, Cho,
and Korhonen (2016) and Conneau et al. (2018UN).9

The experiments reveal two important findings:

1.

In contrast with the results from Cífka and Bojar (2018), our scores
demonstrate that an increasing number of attention heads is beneficial for
classification-based downstream tasks. All models perform best with more
than one attention head and the general trend is that the accuracies
improve with larger representations. The previous claim was that there is
the opposite effect and that lower numbers of attention heads lead to

9 We only report the best result across the various NMT systems presented by Hill, Cho, and Korhonen

(2016) and Conneau et al. (2018UN).

401

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

Figura 4
Average scores over the 10 SentEval classification tasks. Results shown for the different trained
models.

higher performances in downstream tasks, but we do not see that effect in
our set-up, at least not in the classification tasks.

2.

The second outcome is the positive effect of multilingual training. We can
see that multilingual training objectives are generally helpful for the
trainable downstream tasks.

Previous work has focused more on the evaluation of translation alone in the
multilingual set-up (Dong et al. 2015) and with our results we can now demonstrate
that multilinguality indeed boosts the abstraction power of a fixed-size sentence vector
that can be trained with the machine translation objective. Particularly interesting is the
fact that the many-to-many model performs best on average even though it does not add
any further training examples for English (compared to the other multilingual models),
which is the target language of the downstream tasks. This suggests that the model is
able to improve generalizations even from other language pairs (DE–ES, FR–ES, FR–DE)
that are not directly involved in training the representations of English sentences.

For completeness we also include a comparison with other approaches, even though
the comparison is only partly fair, for several reasons (different underlying architecture,
different set of hyperparameters, different training data and preprocessing techniques).
First of all, as a sanity check, we observe that our best model reaches far better
results than the majority-class baseline. Prossimo, we can see that the results of our best
model are better than the best systems by Hill, Cho, and Korhonen (2016) and Cífka and
Bojar (2018).

402

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

Tavolo 4
Accuracies of different models in eight different classification tasks. The average accuracy in the
right-most column illustrates the overall trend that a higher number of attention heads and
multilingual models are beneficial. Results marked with † taken from Cífka and Bojar (2018);
with ‡ from Conneau et al. (2018UN); con (cid:93) from Reimers and Gurevych (2019).

EN-DE
EN-DE
EN-DE
EN-DE

EN-ES
EN-ES
EN-ES
EN-ES

EN-FR
EN-FR
EN-FR
EN-FR

k=1
k=10
k=25
k=50

k=1
k=10
k=25
k=50

k=1
k=10
k=25
k=50

M ↔ EN k=1
M ↔ EN k=10
M ↔ EN k=25
M ↔ EN k=50

M ↔ M k=1
M ↔ M k=50

Most frequent baseline†
Hill, Cho, and Korhonen
(2016) en→fr†
en→cs (2018)
GloVe-BOW†

CR MR MPQA SUBJ SSTB SSTF TREC MRPC AVG

75.47 68.10
74.33 69.29
73.93 69.38
74.41 68.42

74.41 66.67
72.21 68.72
73.72 67.96
74.06 67.38

75.68 68.77
74.67 68.89
74.41 67.98
74.86 69.25

75.28 69.58
74.07 70.66
75.36 69.43
75.28 69.87

87.49
87.66
87.86
87.63

86.95
87.93
87.75
87.80

87.27
87.72
87.67
88.26

88.15
88.42
88.21
88.26

85.25 71.77 37.15
85.89 75.12 38.37
86.13 72.98 38.19
87.42 73.26 38.28

84.75 70.90 35.93
86.38 72.05 37.33
85.79 73.59 36.83
85.84 72.16 36.65

85.18 71.72 36.97
86.62 73.59 39.77
86.33 74.19 38.64
87.02 75.29 38.06

86.98 74.46 38.96
87.63 75.84 38.55
87.33 75.67 39.19
87.71 75.12 39.64

77.60
73.60
81.40
79.60

78.40
77.20
80.00
81.20

77.40
75.60
80.00
82.06

79.60
75.80
81.80
80.00

70.84
71.83
72.46
72.35

70.67
71.83
72.87
67.94

70.38
71.65
71.13
72.52

70.20
71.48
72.93
70.14

71.71
72.01
72.79
72.67

71.08
71.70
72.31
71.63

71.67
72.31
72.54
73.42

72.90
72.81
73.74
73.25

75.92 71.23
74.72 70.47

88.07
88.39

87.64 75.84 39.73
87.98 77.16 40.14

78.8
83.00

73.28
72.58

73.81
74.31

63.80 50.00
70.10 64.70

68.80
81.50

50.00 49.90 23.10 18.80
82.80
84.90

66.50
69.10

48.86

76.00 68.20
78.20 77.00

84.90
87.90

86.90 72.00 35.70 89.00
91.10 81.00 44.40 82.00

70.70
72.30

72.92
76.74

Conneau et al. (2018UN) en→fi‡
Subramanian et al. (2018)
Avg. BERT embeddings(cid:93)
BERT CLS-vector(cid:93)

81.10 77.00
88.60 82.40
86.25 78.66
84.85 78.68

90.00
90.70
88.66
88.23

91.50 80.30 43.40 87.20
94.00
93.80 85.10
92.80
94.37 84.40
91.40
94.21 84.13



75.00
78.30
69.45
71.13




Hill, Cho, and Korhonen (2016) train a standard RNN encoder–decoder based sys-
tem (Cho et al. 2014) on all available English–French data from the 2015 Workshop on
Statistical Machine Translation (WMT’15).10 Similarly to our system, their training set
incorporates 2 million English–French sentence pairs from the Europarl corpus. They
use additional English–French data, whereas we train on additional English–German
and English–Spanish data. We outperform their system in every single classification
task (Tavolo 4) when we use multilingual data. Even if we limit ourselves to English–
French data, as they did, we outperform them in all tasks but TREC. This suggests that
our model is superior in both its way to exploit multilingual data and in its architecture.

10 www.statmt.org/wmt15/translation-task.html.

403

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

Tavolo 5
Results of the two natural language inference (NLI) tasks in SentEval. SICKE = SICK entailment
set. Results marked with † taken from Cífka and Bojar (2018); with ‡ from Conneau et al. (2018UN).

EN-DE
EN-DE
EN-DE
EN-DE

EN-ES
EN-ES
EN-ES
EN-ES

EN-FR
EN-FR
EN-FR
EN-FR

k=1
k=10
k=25
k=50

k=1
k=10
k=25
k=50

k=1
k=10
k=25
k=50

SNLI

SICKE

63.86
65.30
65.13
65.30

62.79
66.02
65.20
65.49

63.71
65.64
65.68
65.47

77.09
78.77
79.34
79.36

76.76
77.65
79.30
78.83

76.19
78.08
79.97
79.14

M ↔ EN
M ↔ EN
M ↔ EN
M ↔ EN

M ↔ M
M ↔ M

k=1
k=10
k=25
k=50

k=1
k=50

SNLI

SICKE

65.56
67.01
66.94
67.38

66.92
67.73

77.96
79.48
79.85
80.54

77.82
81.12

Most frequent baseline†

34.30

56.70

GloVe-BOW†
en→cs (2018)
en→fi (2018UN)
Subramanian et al. (2018)

66.00
69.30

78.20
80.80
81.70
87.40

Hill, Cho, and Korhonen (2016) use the last state of the encoder as their sentence
representation, whereas we use the attention bridge layer.

The model by Cífka and Bojar (2018) is based on a very similar architecture as ours,
but they train on bilingual data, 57 million English–Czech sentence pairs. We train on
a considerably smaller, but multilingual, insieme di dati (3 times 2 million sentence pairs of
EN–FR, EN–DE, and EN-ES). Yet our system outperforms theirs in six out of nine tasks
listed in Tables 4 E 5. This again demonstrates the power of multilingual models.

In futher comparisons, we can see that our model outperforms the competitive
baseline of GloVe-BOW (Kruszewski et al. 2015; Arora, Liang, and Ma 2017; Adi et al.
2017) in five tasks out of ten. Tuttavia, Conneau et al. (2018UN) and Subramanian et al.
(2018) perform better than us in all the classification and NLI tasks. We believe that
the strong performance of the latter models is explained by orders of magnitudes of
more training data. GloVe-BOW and the Conneau et al. (2018UN) model are based on
incorporamenti di parole, which have been pretrained on several billions of words of text.
The large vocabularies of the pretrained embeddings provide better representations for
low-frequency as well as out-of-vocabulary words. Subramanian et al. (2018) use 124
million sentence pairs for training, che è 20 times more than we have. The BERT
models, trained on 3.3 billion words, do not quite reach the level of Subramanian et al.
(2018).

Although our aim is not to beat the state of the art, but rather to understand the
impact of various sizes of attention heads in a bilingual and multilingual scenario, we
argue that a larger attention bridge and multilinguality constitute a preferable starting
point to learn more meaningful sentence representations. With this, we can contrast
and extend previous findings, leading the way to further extensions of the MT-based
framework for crosslingual representation learning.

4.2 Multilingual Classification Tasks

In the previous section, we focused on downstream tasks that consider English only.
The main point was to show that reasonable representations can be learned from the

404

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

translation objective and that multilingual data help to improve the abstractions that
can be derived. Even more intriguing is the fact that our shared representation combines
language-specific encoders with language agnostic representations. This makes it possi-
ble to directly test crosslingual downstream tasks, which we will focus on in this section.
The interest in crosslingual NLP leads to a number of benchmarks and downstream
applications and here we will consider the framework of crosslingual NLI as defined
by the XNLI challenge (Conneau et al. 2018C) and crosslingual review classification
as proposed by Lu et al. (2018). We start with the XNLI results and then turn to the
multilingual classifier based on Yelp reviews.11

4.2.1 XNLI. The idea of the XNLI challenge is that the provided corpus enables us to test
natural language inference across different languages. The test pairs are all translated
into 14 languages, which makes it possible to obtain comparable results across various
language pairs. Hence, a classifier can be trained on one language and be tested on
another one. In order to make this work, one essentially needs to produce crosslingual
sentence representations that are useful for the task in all test languages. Tavolo 6 sum-
marizes the results obtained for different settings. We rely on our multilingual attention
bridge model trained in a many-to-many fashion.

For comparison, we include representations obtained from large pretrained word
embeddings. Note that those embeddings are trained on vastly more data than our
modello, which is trained on the parallel Europarl corpus. In particular, we use the
multilingual word embeddings from the fastText (Grave et al. 2018) and the MUSE
(Conneau et al. 2018B) libraries. The fastText algorithm is based on word2vec and
produces word embeddings compounded from character n-grams (Bojanowski et al.
2017), which is to be preferred for morphologically richer languages in a multilingual
setting. The fastText word vectors are pretrained in CommonCrawl and Wikipedia
using CBOW with position weights, whereas MUSE word embeddings are Wikipedia
fastText vectors from 30 languages aligned in a supervised way into a single vector
spazio. Because fastText vectors are not aligned into the same space we only present the
accuracies on the relevant languages for each case. To obtain sentence representations,
we compute the average of the individual word vectors.

We use the XNLI corpus to train multilingual classifiers that are then to be tested for
zero-shot classification. Logistic regression classifiers are trained on top of the sentence
embeddings produced with the English, German, French, and Spanish training data,
or a combination of these, and then tested in all four languages. We observe that our
model is clearly better than the fastText and MUSE benchmarks. Besides, it reaches
results equal to the XCBOW model presented in Conneau et al. (2018C)—a model that
incorporates an additional multilingual loss to enhance the performance in the task,
and is based on a feed-forward neural network classifier instead of simple logistic
regression. Note that the XCBOW results are taken directly from the original paper
and only available for a classifier trained on English. Even though the comparison with
XCBOW scores is not completely fair because the data sets we applied in training are
much smaller and narrow in domain (Europarl only), we reach similar performance. It
is certainly re-assuring that our model is capable of creating language-agnostic repre-
sentations that are properly aligned with language-specific encoders. Inoltre, our
model outperforms the crosslingually aligned MUSE word embeddings by a significant
margin, which demonstrates the importance of a proper sentence encoder also in the

11 http://www.yelp.com/dataset.

405

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

Tavolo 6
Accuracy obtained in the XNLI task comparing our multilingual model with pretrained vector
embedings. Results marked with † taken from Conneau et al. (2018C) and use a feed-forward
neural network as a classifier as opposed to all others that use a logistic regression.

M ↔ M k=50
MUSE
fastText
X-CBOW†

M ↔ M k=50
MUSE
fastText

M ↔ M k=50
MUSE
fastText

M ↔ M k=50
MUSE
fastText

M ↔ M k=50
MUSE

M ↔ M k=50
MUSE

M ↔ M k=50
MUSE

M ↔ M k=50
MUSE

Classifier
trained on

EN

DE

FR

ES

EN+DE

EN+FR

EN+ES

EN+DE+FR+ES

EN

65.0
55.9
53.1
64.5

58.9
50.5

62.6
45.9

61.2
49.2

65.3
55.4

65.2
55.0

64.9
54.5

64.8
54.8

DE

59.0
47.8

61.0

62.6
53.1
52.9

60.9
43.3

59.0
44.5

62.8
53.4

61.6
46.5

60.9
46.9

62.8
51.7

FR

55.9
42.5

60.3

54.8
39.2

63.5
53.0
48.6

56.3
43.1

58.6
40.1

63.8
52.8

58.5
43.4

63.0
51.6

ES

58.0
42.8

60.7

57.9
42.5

61.9
41.8

63.8
52.5
50.6

60.8
43.1

61.7
44.5

64.6
52.0

63.6
51.2

crosslingual setting. The same effect can be seen in comparison to the non-aligned
language-specific fastText word embeddings that have been trained on huge amounts
of training data.

Looking at the different combinations of training data and the difference between
supervised and zero-shot classification, we can see that our model is quite robust across
the different settings. The drop in performance when moving to zero-shot classification
is rather modest in most cases and in some of them we achieve scores that are close to
the fully supervised mode (Vedere, Per esempio, the results for Spanish with the classifier
trained on French). The results also show that all languages seem to be equally covered
and the supervised scores end up to be around 63–65% accuracy with zero-shot scores
ranging from roughly 56–62%. Interesting to note is that a combination of training
languages can lead to improvements of zero-shot classification. Per esempio, results for
Spanish and French are improved when combining English and German in the training
dati. The same happens for German and Spanish when combining English and French,
and so forth.

406

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

4.2.2 Yelp. Another crosslingual task is the multilingual review classification task pro-
posed by Lu et al. (2018). The idea is to train a classifier to label online reviews based on
their ratings to decide whether reviews in another language are received in a positive
or in a negative way. Tavolo 7 shows the scores achieved by an English-reviews clas-
sifier when tested on French, German, and Spanish sentences. To make the results as
comparable as possible to Lu et al. (2018) we use the same settings as they did, con
the addition of including Spanish in our experiments. Namely, we took a subset of
5,000 reviews in English from the Yelp review data set (Round 13) to train a simple
logistic regression classifier, as well as test sets of French, English, German, and Spanish
recensioni, Di 1,000 sentences each. As in their approach, we extracted the non-English
reviews by applying a language detection tool (Joulin et al. 2017). We use binary review
scores, Dove 4- and 5-star reviews are labeled as positive, we do not use 3 stars, E 1-
and 2-star reviews are labeled as negative. We treat each review as a sentence and use
the shared intermediate representations produced by our multilingual systems as input
to the classifier. As before, we compare our results to MUSE and fastText baselines,
obtaining the sentence representations by averaging the word embeddings of a full
revisione.

In Table 7 we present models with different sizes of the attention bridge for the
many-to-English set-up and, finally, also the results for representations coming from a
many-to-many model. We only include classifiers trained on English as there are not
enough data for the other languages to train a reliable classifier.

The results show that our representations perform well across all languages with
scores mostly around 83–87% accuracy. The most interesting thing is that there is
basically no deterioration when moving to zero-shot classification for languages other
than English. Note that the class distribution is quite skewed here and the majority class
baseline is quite high, especially for German and French. This makes it rather difficult
to interpret the results at least for those languages with a heavy over-representation of
positive reviews. Spanish might be the most reliable zero-shot language in our test set
with a more balanced distribution and, here, we can see a clear improvement over the
majority-class baseline when applying the representations coming from our multilin-
gual translation model. We also clearly outperform MUSE in that case demonstrating
the ability of our sentence encoders in creating reliable crosslingual representations.
Again, MUSE and fastText are not directly comparable as they have been trained on
much larger and more diverse data sets. Nevertheless, we also outperform fastText-
based representations in the supervised case in all settings except one. The picture about

Tavolo 7
Accuracy for crosslingual Yelp binary review classification.

% Positive
MUSE
fastText

M ↔ EN
M ↔ EN
M ↔ EN
M ↔ EN
M ↔ M

k=1
k=10
k=25
k=50
k=50

EN

75.3
81.5
82.5

84.9
83.7
81.2
83.3
82.6

DE

86.6
86.7

84.6
84.4
85.7
79.6
87.1

FR

83.0
83.1

83.5
83.6
76.1
83.8
84.1

ES

69.7
76.0

79.6
83.6
83.8
83.4
81.8

407

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

the perfect size for the attention bridge is not very clear. In some cases, a small size is
preferable whereas in others a larger size is beneficial. Also, the effect of additional
language pairs is not entirely clear but on average the many-to-many set-up produces
better scores compared with the similar set-up with many-to-English models. Some
additional studies might reveal further insights, which we will leave for future work.

4.3 SentEval Similarity Tasks

The next evaluation refers to the similarity tasks of SentEval—that is, English data only.
Tavolo 8 summarizes the results using Pearson’s correlation coefficient as well as the
average on all tasks. As comparison we include the bag-of-vectors baseline (Guanto-
BOW) as in the earlier SentEval classification tasks, the best model from Cífka and Bojar
(2018), and the InferSent model (Conneau et al. 2017) as a state-of-the-art model that is
pretrained on a natural language inference (NLI) task. As discussed earlier, note that the
SICK and STSB benchmarks provide training data where a classifier learns to predict the
probability distribution of the relatedness scores (Tai, Socher, and Manning 2015). Two
different trends become visible:

io) On the unsupervised textual similarity tasks, having fewer attention heads is
beneficial. Contrary to the results in the classification tasks, the best overall model is

Tavolo 8
Results from seven similarity tasks, measured using Pearson’s correlation coefficients; IL
Spearman’s coefficients exhibit the same trend. The average values are displayed in the
right-most column. Results marked with † taken from Cífka and Bojar (2018).

SICK STSB

STS12

STS13

STS14

STS15

STS16 AVG

k=1
k=10
k=25
k=50
k=1
k=10
k=25
k=50
k=1
k=10
k=25
k=50
k=1
k=10
k=25
k=50
k=1
k=50

0.74
0.76
0.78
0.78
0.73
0.77
0.77
0.78
0.74
0.75
0.77
0.77
0.76
0.78
0.78
0.79
0.77
0.79

0.80
0.81
0.88

0.69
0.69
0.67
0.65
0.68
0.68
0.64
0.63
0.66
0.67
0.65
0.63
0.69
0.69
0.68
0.66
0.71
0.69

0.64
0.73
0.76

0.57
0.52
0.50
0.47
0.54
0.53
0.50
0.48
0.56
0.53
0.50
0.46
0.53
0.51
0.46
0.45
0.53
0.47

0.52
0.46
0.59

0.46
0.41
0.39
0.36
0.42
0.37
0.35
0.31
0.44
0.36
0.36
0.38
0.38
0.34
0.32
0.30
0.39
0.30

0.50
0.32
0.59

0.58
0.54
0.50
0.46
0.53
0.52
0.47
0.40
0.57
0.51
0.48
0.45
0.52
0.47
0.42
0.36
0.52
0.35

0.55
0.45
0.70

0.63
0.58
0.57
0.54
0.60
0.58
0.56
0.49
0.62
0.57
0.55
0.53
0.56
0.56
0.51
0.47
0.58
0.45

0.56
0.53
0.71

0.62
0.56
0.50
0.46
0.60
0.55
0.48
0.44
0.62
0.58
0.48
0.45
0.57
0.55
0.43
0.43
0.59
0.41

0.51
0.47
0.71

0.61
0.58
0.56
0.53
0.58
0.57
0.54
0.51
0.59
0.57
0.54
0.52
0.57
0.56
0.51
0.50
0.59
0.50

0.59
0.54
0.70

EN-DE
EN-DE
EN-DE
EN-DE
EN-ES
EN-ES
EN-ES
EN-ES
EN-FR
EN-FR
EN-FR
EN-FR
M ↔ EN
M ↔ EN
M ↔ EN
M ↔ EN
M ↔ M
M ↔ M
GloVe-BOW†
en→cs (2018)
InferSent†

408

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

provided by a bilingual setting with only one attention head. This is in line with the find-
ings of Cífka and Bojar (2018) and could also be expected as the model is more strongly
pushed into a dense semantic abstraction that is beneficial for measuring similarities
without further training. More surprising is the negative effect of the multilingual
models. We believe that the multilingual information encoded jointly in the attention
bridge hampers the results for the monolingual semantic similarity measured with the
cosine distance, while it becomes easier in a bilingual scenario where the vector encodes
only one source language data, English in this case.

ii) On the supervised textual similarity tasks, we find a similar trend as in the image-
caption models for SICK: Both a higher number of attention heads and multilinguality
contribute to better scores. For STSB, we notice a different pattern. Namely, including
multilingual data in the models helps the performance in this task. The many-to-many
models score better than the best bilingual models. Inoltre, when increasing k, IL
multilingual models scores are not as badly hampered as the bilingual ones.

Comparing against two baseline systems, GloVe-BOW and the best model by Cífka
and Bojar (2018), our best model on average achieves better results showing the poten-
tials of our approach. The strength of the InferSent model is probably explained by the
pre-trained word embeddings extracted from billions of words as well as their training
dati, which was taken from the NLI domain rather than parallel corpora of translated
sentences.

4.4 Downstream Tasks in a Low-Resource Setting

Another important test is whether our model is capable of learning reasonable represen-
tations from limited resources. In order to study this, we applied the Multi30k models
from Section 3.1 to the SentEval tasks in the same way as the Europarl models discussed
above. Note that these scores are not directly comparable to other models trained on
large-scale data sets because the Multi30K models were trained on very limited data
sets; they contain 30k sentences on the specific domain of image captioning. Tables 9
E 10 summarize the scores on downstream tasks we obtain for bilingual and for
multilingual models. We ran each experiment with five different seeds, and we present
the average of these scores.

We notice that for the classification and NLI tasks of the SentEval collection, IL
sentence embeddings produced by the multilingual models show consistent improve-
menti, with only two exceptions. Inoltre, we observe that our many-to-many model
obtains better results in the SICK Relatedness (SICKR) and STS-Benchmark (STS-B), Quello

Tavolo 9
Accuracies of models trained with limited data in eight different classification tasks. Including
more languages during training boosts the performance.

EN-DE
EN-CS
EN-FR

M ↔ EN
M ↔ M

CR

68.37
67.79
68.52

68.32
69.01

MR

59.76
59.71
60.08

60.4
61.80

MPQA

SUBJ

SSTB

SSTF

TREC

MRPC

73.19
73.16
73.51

72.98
73.28

75.26
75.32
77.25

78.64
80.88

61.87
61.92
61.91

62.02
62.24

31.15
30.43
30.55

32.10
31.83

67.44
67.75
61.04

69.84
66.4

69.13
68.23
70.96

68.83
70.43

409

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

Tavolo 10
SentEval NLI and semantic similarity tasks results for the models trained with limited data
obtained. We present the Pearson’s correlation coefficient for the similarity tasks. Models trained
with more languages get better scores in trainable tasks, whereas the non-trainable tasks show a
different behavior.

NLI TASKS

SEMANTIC SIMILARITY TASKS

EN-DE
EN-CS
EN-FR
M ↔ EN
M ↔ M

SNLI
61.45
61.75
60.95
64.52
65.12

SICKE
72.82
73.89
74.85
75.46
76.92

TRAINABLE

SICKR
0.618
0.652
0.646
0.659
0.677

STS-B
0.564
0.616
0.574
0.618
0.630

STS12
0.393
0.385
0.359
0.323
0.327

NON-TRAINABLE
STS14
0.426
0.380
0.428
0.353
0.415

STS15
0.489
0.505
0.476
0.418
0.460

STS13
0.265
0.234
0.220
0.190
0.256

STS16
0.430
0.422
0.430
0.375
0.400

È, the trainable semantic similarity tasks. Tuttavia, the results of the non-trainable (se-
mantic similarity) tasks exhibit a different behavior (see the rightmost part of Table 10),
which can be explained by the fact that the additional information encoded in multi-
lingually trained embeddings cannot be effectively separated from the information that
is necessary for monolingual similarity measures. In other words, the attention bridge
layer of the multilingual models outputs vectors that contain information shared across
languages, rendering them incomparable with the cosine similarity.

5. Linguistic Analyses of Inner-Attention-Based Sentence Representations

In this section, we take a closer look at the representations and what they actually
encode. For that, we use the probing tasks that are available in the SentEval toolkit,
which make it possible to study specific linguistic features of a given representation
(see Sections 5.1 E 5.2). Inoltre, we add a careful analysis of the individual
attention heads to further study the information that is encoded by the internal shared
representation layer (see Section 5.3).

5.1 Probing Tasks in SentEval

The SentEval probing tasks inspect three different linguistic categories: surface infor-
mation on the length and word content of the sentence (Adi et al. 2017); several probing
tasks regarding syntactic properties, such as word ordering, top constituent task (Shi,
Padhi, and Knight 2016); and semantic properties, such as subject/object number, odd
man out, and coordination inversion, per dirne alcuni. To assess how our representations
correlate with linguistic properties, we carry out an evaluation on the various categories
of these tasks. Tavolo 11 shows the accuracies on the three linguistic classes.

The results show a clear correlation between the number of attention heads and
the syntactic power of the representations: An increased number of attention heads
corresponds to a higher syntactic score (up to 5 points). Intuitively, more attention
heads can learn long-range dependencies better, which are essential for a better syn-
tactic understanding of the sentence. For the semantic probing test, the best results
are provided by the models with one attention head, in line with the findings of

410

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

Tavolo 11
Average accuracy of the syntactic and semantic probing tasks (Europarl models). The accuracy
on the surface information tasks is shown on the rightmost columns.

Average

syntactic

Average

semantic

EN-DE
EN-DE
EN-DE
EN-DE
EN-ES
EN-ES
EN-ES
EN-ES
EN-FR
EN-FR
EN-FR
EN-FR
M ↔ EN
M ↔ EN
M ↔ EN
M ↔ EN
M ↔ M
M ↔ M

k=1
k=10
k=25
k=50
k=1
k=10
k=25
k=50
k=1
k=10
k=25
k=50
k=1
k=10
k=25
k=50
k=1
k=50

55.87
57.97
59.10
59.97
54.90
58.37
59.27
59.70
55.37
58.13
58.63
59.40
57.83
61.33
60.87
61.70
59.43
60.97

72.52
71.04
70.58
71.84
72.48
71.36
71.44
72.26
72.04
71.16
71.30
72.42
73.86
72.60
72.78
73.60
74.20
73.40

Surface info

Length

82.20
85.40
86.90
86.00
82.30
87.50
88.00
87.30
82.30
86.70
86.40
86.30
89.10
90.30
91.50
92.10
91.30
90.70

WC

63.80
61.40
56.70
49.60
56.50
60.80
52.80
46.50
60.60
61.50
53.80
45.70
58.70
54.20
41.20
34.50
58.90
35.60

Sezione 4.3, except that we do not observe a high degradation with more attention
heads. Regarding the surface information, it is clear that more attention heads con-
tribute to understanding the length of the sentence better, while forgetting about word
content.

5.2 Probing Tasks with Limited Resources

Similar to Section 4.4, we also apply the same probing tasks to the Multi30K models to
demonstrate the ability of the models to learn from limited resources.

Tavolo 12 compares scores between bilingual and multilingual models. Again, we
observe improvements in the majority of cases when adding multiple languages to the
training procedure. Remarkably, we observe a significant increment on the accuracy for
the specific tasks of Length (superficial property), Top Constituents (syntactic property),
and Object Number (semantic information) when training the encoders with multilin-
gual data. Some of the tests result in better scores with the many-to-English setting
compared to the many-to-many set-up, which is slightly surprising. Nevertheless, mul-
tilingual models outperform the bilingual models in all but one test.

5.3 Analysis of Individual Attention Heads

So far we have evaluated sentence representations that are based on attention bridges
of different sizes, ranging from 1 A 50 attention heads. The attention bridge matrix as

411

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 46, Numero 2

v
N
IO
D
R
o
o
C

O
M
O
S

M
tu
N
j
B
O

M
tu
N
j
B
tu
S

e
S
N
e
T

T
F
io
H
S
B

T
S
N
o
C
P
o
T

H
T
P
e
D

5
0
1
6

.

4
7
0
6

.

8
3
1
6

.

7
5
0
6

.

1
2
2
6

.

0
7
9
4

.

1
1
9
4

.

0
9
9
4

.

6
4
9
4

.

2
1
0
5

.

4
9
.
7
6

2
2
.
7
6

1
0
.
0
7

9
1
.
2
7

9
2
.
3
7

8
1
.
6
6

1
2
.
4
6

5
5
.
8
6

7
6
.
9
6

9
8
.
9
6

9
6
.
5
6

7
0
.
6
6

1
6
.
7
6

6
3
.
9
6

7
2
.
8
6

1
4
.
7
5

6
7
.
6
5

2
2
.
7
5

5
3
.
8
5

6
7
.
9
5

4
7
.
9
3

8
3
.
9
3

2
1
.
0
4

4
0
.
4
4

6
7
.
9
3

6
4
.
1
3

1
2
.
0
3

4
1
.
2
3

5
0
.
3
3

6
.
1
3

C
W

2
0
.
0
1

5
5
.
9

6
6
.
9

6
5
.
9

3
1
.
9

H
T
G
N
e
l

9
6
.
5
7

6
7
.
0
8

0
0
.
0
8

6
7
.
4
8

1
4
.
5
8

E
D

N
E

S
C

N
E

R
F

N
E

N
E


M

M

M

.
)
S
l
e
D
o
M
k
0
3
io
T
l
tu
M

(

S
k
S
UN
T
G
N
io
B
o
R
P

l
UN
v
E
T
N
e
S
e
H
T

R
o
F

S
e
R
o
C
S

2
1
e
l
B
UN
T

412

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
C
o

l
io
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

4
6
2
3
8
7
1
8
4
7
6
4
0
/
C
o

l
io

_
UN
_
0
0
3
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT

a whole has been used in our experiments. An interesting further step is to investigate
whether specialized roles are assigned to the different attention heads (columns in the
matrix) and whether they effectively learn to focus on different parts of the sentence.

In order to analyze this, we conduct additional studies on the Europarl models,
and we further assess the performance of individual heads on the SentEval linguistic
probing tasks. Our aim is to identify whether some attention heads are particularly
important in some probing task, and whether we see differences in how the roles of the
heads are distributed depending on the size of the attention bridge.

We compare attention bridges of sizes 10, 25, E 50, E, to study the effect of the
individual attention heads, we detach one attention head at a time to be used as the
representation of the sentence and apply them in the various probing tasks. Figura 5
shows two scenarios, attention bridges of sizes 10 E 50 taken from the many-to-
many Europarl model. In both scenarios, the accuracy of each probing task is shown
for each attention head, separately. To test stability, we trained the probing tasks with
five different seeds and present the average accuracy score in the figure. The variance is
very small (of order ≤1e-1) in all cases and, hence, the trends shown in the illustration
are reliable.

If the attention heads obtain different, specialized roles, we expect to see a higher
accuracy on one probing task for a specific attention head and a higher accuracy on
some other task for another one. In caso di 10 attention heads (k = 10), it is hard
to see any clear pattern. In contrasto, con 50 heads (k = 50) there are some tasks in
which the performance of the heads with a higher index (> 40) is clearly better than
the performance of the ones with low indexes (< 15). The probing tasks, in which this phenomenon is particularly pronounced, are word content, top constituents, number of the subject (subjnumber), number of the object (objnumber), and possibly verb tense (tense). However, rather than differentiation there is strong relation between these tasks, such that an attention head that performs well on one task also performs well on the other tasks. Figure 5 Performance of each attention head on the probing tasks. Mean accuracy along different runs of SentEval probing tasks using the trained {DE,FR,CS} ↔ EN models with k = 10 and 50. 413 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 20406080246810attention headmeank = 1025507502468101214161820222426283032343638404244464850attention headmeank = 50 Computational Linguistics Volume 46, Number 2 From the illustration, we can see that there is an interesting pattern with increasing performance on attention heads with higher index. This can be seen especially on tasks like wordcontent, topconstituents, subjnumber, and objnumber. The differences are small but this trend is still clearly visible and rather unexpected. In order to check whether this observation is purely coincidental, we computed the average accuracy scores over all Europarl models in our experiments averaging the numbers from the ones with the same size of the attention bridge (in number of heads). This includes bilingual, many- to-one, and many-to-many models. Figure 6 shows the average accuracy scores and their variance on four tasks: wordcontent, topconstituents, subjnumber, and objnumber. In this case, we also show figures from the k = 25 settings. The most striking outcome of this study is that the same positive trend can be observed for all independently trained models with larger attention bridges without aligning the inner-attention heads or enforcing this behavior, instead of consistently having a large range and a stationary mean accuracy, as one would expect from combining without any alignment the results of the inner-attention layers of models trained independently. In the smallest attention bridge (k = 10) it is difficult to see whether any head performs better than any other. In the medium size attention bridge (k = 25), there is already a hint at higher performance on the tasks topconstituents and subjnumber for the attention heads with the highest l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 6 Mean accuracy across the different models with k = 10, 25, 50 on four probing tasks. Gray lines surrounding the mean values are the min and max values obtained for each head across models. 414 objnumbersubjnumbertopconstituentswordcontent24681024681024681024681030507090attention headmeank = 10objnumbersubjnumbertopconstituentswordcontent0510152025051015202505101520250510152025305070attention headmeank = 25objnumbersubjnumbertopconstituentswordcontent01020304050010203040500102030405001020304050255075attention headmeank = 50 Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT indexes. In the largest attention bridge (k = 50), on all four tasks, accuracies clearly increase with high indexes. We can also see that the variance is generally higher in the case of a small attention bridge, while the scores become more stable with larger attention bridges and to some extent higher attention head indexes. This seems to suggest that there are indeed certain attention heads that specialize on specific tasks. The system does not have any explicit mechanism to ensure that the attention heads will be aligned. This means that it is possible that exactly the same information encoded by head i in one training would be encoded by head j during another training. We observe that the first attention heads indeed follow this behavior, whereas the later ones start behaving in a similar way. Why part of this specialization is happening in the same order is less clear and this is probably due to some internals of the underlying implementation of the training algorithms. To have a better understanding on why the last attention heads seem to be the best ones in some of the probing tasks, we plotted heat maps of which areas of a sentence each of them focuses on. An example sentence (“We cannot afford to lose more of the mo- mentum that existed in the beginning of the Nineties.”) is shown in Figure 7. For each scenario (k = 1, 10, 25, 50) the patterns for the attention heads are shown, such that every line illustrates the main focus areas of one of them. We obtain similar pictures for (a) k = 1 (b) k = 10 (c) k = 25 (d) k = 50 Figure 7 Heat map showing where each attention head (rows) focuses, using an example sentence. Four different attention bridges from the {DE,FR,CS} ↔ EN models are used in the visualization, where k is the number of heads in the bridge. 415 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 46, Number 2 other sentences and different models (bilingual, many-to-one, or many-to-many) and the figure is just a good example of the general picture we get. The illustration suggests that the model learns to distribute attention to individual words or bigrams among the attention heads until the sentence is sufficiently covered. The order of positions seems rather random and not consistent as the variance of probing task results suggest. Once there are additional attention heads available, the remaining heads start picking up longer word sequences, predominantly from the beginning of the sentence. This effect seems to explain the increasing performance on the probing tasks that we have observed above. The main clause is often in the beginning of a sentence, which explains why we see high accuracies on the tasks top constituents, number of subject, and number of object. The goal in the top constituents task is to determine the high level syntactic structure of the sentence in terms of the top constituents immediately below the sentence (S) node in a phrase structure grammar; an example of such a structure is: ADVP NP VP, as in “Then it happened.” In the number of subject task, we need to determine whether the subject of the main clause of the sentence is in singular or plural number. The number of the subject is marked on the subject itself (plural -s), but also in determiners (a, an) and sometimes in the predicate verb (third person singular -s). Similarly, in the task number of object, we need to decide whether the object of the main clause is in singular or plural. In order to excel in these tasks, it is crucial to be able to identify the area of the sentence where the main clause is located, which apparently the last attention heads are often capable of doing. Furthermore, especially in the tasks top constituents and word content it is necessary to analyze multiple words. The aim of the word content task is to identify which words in a set of 1,000 preselected mid-frequency words are present in the sentence. An attention head covering a larger area is naturally more likely to recognize a larger number of words and, hence, performs better on that task. This also explains why the word content task works well with an attention bridge consisting of only one attention head (k = 1). If there is only one head available, the attention needs to be spread over a broader range of words, which is beneficial for some tasks and detrimental for other tasks (see Section 5.1). Our hypothesis of differentiated roles of the separate attention heads has been partly confirmed. Different attention heads do focus on different areas of the sentence. However, they do not seem to specialize on information that is beneficial for a specific downstream probing task. Rather, we see significant correlation, such that some atten- tion heads are particularly strong at multiple probing tasks concurrently. We can also observe that the attention of individual heads is in general very fo- cused with a low kurtosis in its distribution until the sentence is covered. Once this point is reached, the penalty term forces the inner-attention to explore collocational relationships typically from the beginning of the sentence incrementally extending its span. Interestingly, this behavior is accentuated by the penalty term in the loss function that we introduced in Section 2.2. To illustrate this, we computed the entropy over the attention heads of a model trained with and without penalty term using the following procedure: We randomly sampled 10K sentences of length up to 40 tokens from the English monolingual NewsCrawl corpus 2018,12 and calculated the entropy values over the weights A from Equation (6) using the EN-DE models with 50 attention heads. In 12 http://www.statmt.org/wmt19/translation-task.html. 416 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT Table 13 Comparison of entropy values for models trained with and without penalty term. With Penalty Term Without Penalty Term k ∈ [1, 50] k ∈ [1, ni] k ∈ [ni, 50] 1.20 0.49 1.73 0.63 0.43 0.82 Table 13 we present three different entropy values using (i) all the attention heads, k ∈ [1, 50], (ii) the initial ni attention heads until reaching the length of the sentence, k ∈ [1, ni], and (iii) the final 50 − ni attention heads, k ∈ [ni, 50]; where ni is the length of sentence i. A higher entropy value indicates a broader spread of the attention, which leads us to the conclusion that the penalty term is indeed forcing to scatter the attention once the initial (position-specific) attention heads have covered the whole sentence. Up to the length of the sentence (k ∈ [1, ni]) its presence does not make a significant difference (entropy values 0.49 vs. 0.43). However, for the attention heads beyond the sentence length (k ∈ [ni, 50]), we see that the model without it ends up focusing on a much smaller portion of the sentence (entropy 0.82) compared with the model incorporating the penalty term (entropy 1.73). Manual inspection of selected heat maps verifies this result with redundant attention mostly fixed on the first token of the sentence in models where the penalty term is left out. In future work, we would like to investigate the behavior of attention in our model with a larger diversity of languages and tasks. An interesting question is whether we can see the same trend with less related languages included in our data and whether we can force specialization using certain constraints or augmented loss functions during training. We also want to explore the effect of adding other tasks that can be modeled with the same architecture including speech recognition, sequence labeling, or parsing. 6. Related Work Before concluding the paper, we briefly summarize related work. Multilingual NMT has been widely studied and developed in different pathways during recent years (Dong et al. 2015; Luong et al. 2016; Chen et al. 2017; Johnson et al. 2017). Work has been done with networks that use language specific encoders and decoders, such as Dong et al. (2015), who used a separate attention mechanism for each decoder on one-to-many translation. Zoph and Knight (2016) exploited a multi-way parallel corpus in a many-to-one multilingual scenario, while Firat, Cho, and Bengio (2016) used language-specific encoders and decoders that share a traditional attention mechanism in a many-to-many scheme. Another approach is the use of universal encoder–decoder networks that share embedding spaces to improve the performance of the model, like the one proposed by Gu et al. (2018b) for improving translation on low-resourced languages and the one from Johnson et al. (2017), where the term zero- shot translation was coined. Even though the latter model is proven to be effective and appealing thanks to its simplicity, it does not scale well to the use of many languages. Sharing all parameters of the model leads to a strongly degrading performance when covering many diverse languages and new languages cannot easily be added on neither 417 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 46, Number 2 encoder nor decoder side. Furthermore, it does not provide a straightforward meaning representation that can be used for downstream tasks. Sentence meaning representation has also been vastly studied under NMT settings. When introducing the encoder–decoder architectures for MT, Sutskever, Vinyals, and Le (2014) showed that seq2seq models are better at encoding the meaning of sentences into vector spaces than the bag-of-words model. Recent work includes that of Schwenk and Douze (2017), who used multiple encoders and decoders that are connected through a shared layer, albeit with a different purpose than performing translations; Platanios et al. (2018) showed an intermediate representation that can be decoded to any target language while describing a parameter generation method for universal NMT; Britz, Guan, and Luong (2017) made a computational efficiency analysis for MT using a fixed-size attention layer; Artetxe and Schwenk (2019) used a shared LSTM with max pooling to learn sentence embeddings on 93 translation directions; Cífka and Bojar (2018) introduced an architecture with a self-attentive layer to extract sentence meaning representations of fixed size. Here we use a similar architecture but in a multilingual setting. Our work on multilingual MT and sentence representations is closely related to the study from Lu et al. (2018). There, the authors attempt to build a neural interlingua by using language-independent encoders and decoders which share an attentive LSTM layer. Our approach differs on the choice of the crosslingual shared layer; we use a shared inner-attention mechanism in contrast to having a feedforward layer on top of a shared LSTM. Additionally, we also experiment in a multilingual many-to-many setting, instead of only exploring the one-to-many or many-to-one scenarios. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 7. Conclusions We have shown that fixed-size sentence representations can effectively be learned with multilingual machine translation using an inner-attention layer and scheduled training with multiple translation tasks. The performance of the model heavily depends on the size of the intermediate representation layer and we can show that a higher number of attention heads leads to improved translation and stronger representations in super- vised downstream tasks, a result that contradicts earlier findings. Multilinguality also helps boost the performance in the aforesaid downstream tasks although it does not necessarily contribute to improve translation performance when trained with large data sets. However, multilinguality does substantially benefit translation performance in low-resource scenarios. The multilingual training objectives enable effective transfer learning that leads to an improvement of up to 4.43 absolute BLEU points in an established image caption translation task. We can observe that even languages that are different from both, the source and the target language of the test case, help the overall model to achieve a better translation quality. This indirectly hints at the additional semantic abstraction that the model can pick up from the multilingual signal. Our further analysis reveals that the attention bridge model mainly struggles with long sentences. In fact, the decrease in performance that we observe on the Europarl domain and the large-data scenario is entirely due to the drop in performance on sentences above 45 tokens. Here we can see the limits of the fixed-size attention bridge and the results are not particularly surprising. A future research direction is to study ways to prevent this behavior and to investigate the impact of increasing the linguistic diversity in translation performance. 418 Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT The main purpose of the model is to act as an efficient way of learning language- agnostic representations using translation as the auxiliary training objective. We verify on the larger data set that the system is able to produce sentence embeddings that encode essential syntactic and semantic information in a sentence. The results on down- stream tasks in the large data set-up follow the same trend as we have shown in the low- resource scenario. Punctually, multilinguality helps in supervised tasks. Generalizations can be improved even from language pairs not involving the language we test on (English). We make a careful study on the impact of the size of the attention bridge and the role of individual attention heads in encoding linguistic information. In short, increasing the size of the bridge does not only lead to better translation performance but also to improved results in supervised downstream tasks. To further study the impact of individual attention heads on sentence encoding, we detach each of them in turn and test their ability to perform a number of linguistic probing tasks. This experiment reveals a surprising trend in which the attention heads with a high index in the attention bridge perform better on various tests. Looking at the regions of sentences where these higher-index heads focus on shows that they spread their attention more broadly than their lower-index counterparts across the encoded sentence, which is beneficial for those probing tasks. A similar explanation can be found for attention bridges of limited size that perform well on the same probing tasks, in which an attention head spreads the focus over larger regions of the sentence. This analysis provides important insights about the process of encoding and its applicability in other tasks. Our findings open many directions for future research. First of all, we would like to study the impact of increasingly diverse training material on sentence representa- tions that can be learned in the translation set-up. This includes the use of additional languages from different language families and the integration of multiple domains and textual genres, as well as the effect of using resource-rich language pairs on the performance of low-resource languages. We would like to further investigate the lin- guistic properties that a system picks up and, for example, study the implicit relational structure (syntactic and semantic) that is captured by inner-attention in a multilingual encoder. For this, we will also increase the complexity of the shared architecture with additional layers and connections. We will systematically look at different training schedules that vary the amounts of shared parameters depending on linguistic and typological properties that we want to investigate. Acknowledgments This work is part of the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 771113). The authors gratefully acknowledge the support of the Academy of Finland through project 314062 from the ICT 2023 call on Computation, Machine Learning and Artificial Intelligence and projects 270354 and 273457. Finally, we would also like to acknowledge CSC – IT Center for Science, Finland, for computational resources, as well as NVIDIA and their GPU grant. References Adi, Yossi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. 5th International Conference on Learning Representations, ICLR 2017, Conference Track (Poster), Toulon. Agirre, Eneko, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea et al. 2015. SemEval-2015 Task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th International Workshop 419 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 46, Number 2 on Semantic Evaluation (SemEval 2015), pages 252–263, Denver, CO. Agirre, Eneko, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. SemEval-2014 Task 10: Multilingual semantic textual similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 81–91, Dublin. Agirre, Eneko, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 Task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 497–511, San Diego, CA. Agirre, Eneko, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. *SEM 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, volume 1, pages 32–43, Atlanta, GA. Agirre, Eneko, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. SemEval-2012 Task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 385–393, Montreal. Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. 5th International Conference on Learning Representations, ICLR 2017, Conference Track (Poster), Toulon. Artetxe, Mikel and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015, Conference Track, San Diego, Ca. Bau, Anthony, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2019. Identifying and controlling important neurons in neural 420 machine translation. 7th International Conference on Learning Representations, ICLR 2019, Conference Track (Poster), New Orleans, LA. Belinkov, Yonatan, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 861–872, Vancouver. Blackwood, Graeme, Miguel Ballesteros, and Todd Ward. 2018. Multilingual neural machine translation with task-specific attention. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3112–3122, Santa Fe, NM. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. Bojar, Ondˇrej, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor. 2018. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels. Bowman, Samuel R., Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon. Britz, Denny, Melody Guan, and Minh-Thang Luong. 2017. Efficient attention using a fixed-size memory representation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 392–400, Copenhagen. Callison-Burch, Chris, Philipp Koehn, Cameron Shaw Fordyce, and Christof Monz, editors. 2007. Proceedings of the Second Workshop on Statistical Machine Translation, Prague. Cer, Daniel, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver. Chen, Qian, Zhen-Hua Ling, and Xiaodan Zhu. 2018. Enhancing sentence embedding with generalized pooling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1815–1826, Santa Fe, NM. Chen, Yun, Yang Liu, Yong Cheng, and Victor O. K. Li. 2017. A teacher-student framework for zero-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1925–1935, Vancouver. Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha. Cífka, Ondˇrej and Ondˇrej Bojar. 2018. Are BLEU and meaning representation in opposition? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1362–1371. Conneau, Alexis and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 1699–1704. Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen. Conneau, Alexis, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018a. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne. Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018b. Word translation without parallel data. 6th International Conference on Learning Representations, ICLR 2018, Conference Track (Poster), Vancouver. Conneau, Alexis, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018c. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels. Dalvi, Fahim, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, and Stephan Vogel. 2017. Understanding and improving morphological learning in the neural machine translation decoder. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 142–151. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, MN. Dolan, Bill, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 350–356, Geneva. Dong, Daxiang, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1723–1732. Dou, Zi Yi, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. 2018. Exploiting deep representations for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4253–4262. Elliott, D., S. Frank, K. Sima’an, and L. Specia. 2016. Multi30k: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin. Firat, Orhan, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared 421 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 46, Number 2 attention mechanism. In Proceedings of NAACL-HLT, pages 866–875, San Diego, CA. Firat, Orhan, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, TX. Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pages 1243–1252, Sydney. Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), pages 3483–3487, Miyazaki. Graves, Alex and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5–6):602–610. Gu, Jiatao, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018a. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 344–354, New Orleans, LA. Gu, Jiatao, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018b. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 344–354, New Orleans, LA. Ha, Thanh Le, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2016. Hill, Felix, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: 422 Human Language Technologies, pages 1367–1377, San Diego, CA. Hu, Minqing and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177, Seattle, WA. Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5(1):339–351. Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia. Kingma, Diederik P. and Jimmy Ba. 2015. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015, Conference Track (Poster), San Diego, CA. Klein, Guillaume, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, Vancouver. Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit, volume 5, pages 79–86. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, and Alexandra Constantinand Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demo and Poster Sessions. Kruszewski, Germán, Angeliki Lazaridou, Marco Baroni, et al. 2015. Jointly optimizing word representations for lexical and sentential tasks with the c-phrase model. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 971–981. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Vázquez et al. A Systematic Study of Inner-Attention-Based Sentence Representations in NMT Lakew, Surafel Melaku, Mauro Cettolo, and Marcello Federico. 2018. A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 641–652, Santa Fe, NM. Lee, Jason, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365–378. Lin, Zhouhan, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In 5th International Conference on Learning Representations, ICLR 2017, Conference Track (Poster). Lu, Yichao, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 84–92, Brussels. Luong, Minh Thang, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. 4th International Conference on Learning Representations, ICLR 2016, Conference Track (Poster), San Juan. Luong, Thang, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon. Marelli, Marco, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In LREC, pages 216–223, Reykjavik. McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, Long Beach, CA. Pang, Bo and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 271–278, Barcelona. Pang, Bo and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 115–124, Ann Arbor, MI. Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. GloVE: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha. Platanios, Emmanouil Antonios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. Contextual parameter generation for universal neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 425–435, Doha. Poliak, Adam, Yonatan Belinkov, James Glass, and Benjamin Van Durme. 2018. On the evaluation of semantic phenomena in neural machine translation using natural language inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 513–523, New Orleans, LA. Post, Matt. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels. Reimers, Nils and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3973–3983, Hong Kong. Schwenk, Holger. 2018. Filtering and mining parallel data in a joint multilingual space. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 228–234. Schwenk, Holger and Matthijs Douze. 2017. Learning joint multilingual sentence representations with neural machine translation. In ACL Workshop on Representation Learning for NLP, pages 157–167, Vancouver. Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational 423 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 46, Number 2 Tao, Chongyang, Shen Gao, Mingyue Shang, Wei Wu, Dongyan Zhao, and Rui Yan. 2018. Get The Point of My Utterance! Learning Towards Effective Responses with Multi-Head Attention Mechanism. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 4418–4424, Stockholm. Tu, Zhaopeng, Yang Liu, Zhengdong Lu, Xiaohua Liu, and Hang Li. 2017. Context gates for neural machine translation. Transactions of the Association for Computational Linguistics, 5:87–99. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Voorhees, Ellen M. and Dawn M. Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 200–207, Athens. Wang, Yining, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing Zong. 2018. Three strategies to improve one-to-many multilingual translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2955–2960, Brussels. Wiebe, Janyce, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2–3):165–210. Zoph, Barret and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of NAACL-HLT, pages 30–34, San Diego, CA. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 6 2 3 8 7 1 8 4 7 6 4 0 / c o l i _ a _ 0 0 3 7 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Linguistics (Volume 1: Long Papers), volume 1, pages 1715–1725, Berlin. Shi, Xing, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural MT learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, Austin, TX. Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, WA. Subramanian, Sandeep, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. 6th International Conference on Learning Representations, ICLR 2018, Conference Track (Poster), Vancouver. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Ghahramani, Z., M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., pages 3104–3112. Tai, Kai Sheng, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1556–1566, Beijing. 424A Systematic Study of Inner-Attention-Based image
A Systematic Study of Inner-Attention-Based image
A Systematic Study of Inner-Attention-Based image
A Systematic Study of Inner-Attention-Based image
A Systematic Study of Inner-Attention-Based image
A Systematic Study of Inner-Attention-Based image

Scarica il pdf