Efficient Content-Based Sparse Attention with Routing Transformers

Aurko Roy Mohammad Saffar Ashish Vaswani David Grangier

Google Research
{aurkor,msaffar,avaswani,grangier}@google.com

Abstracto

Self-attention has recently been adopted for
a wide range of sequence modeling problems.
Despite its effectiveness, self-attention suffers
from quadratic computation and memory re-
quirements with respect to sequence length.
Successful approaches to reduce this complex-
ity focused on attending to local sliding win-
dows or a small set of locations independent
of content. Our work proposes to learn dy-
namic sparse attention patterns that avoid
allocating computation and memory to attend
to content unrelated to the query of interest.
This work builds upon two lines of research:
It combines the modeling flexibility of prior
work on content-based sparse attention with
the efficiency gains from approaches based on
local, temporal sparse attention. Our model,
the Routing Transformer, endows self-attention
with a sparse routing module based on on-
line k-means while reducing the overall com-
plexity of attention to O(n1.5d) from O(n2d)
for sequence length n and hidden dimension
d. We show that our model outperforms com-
parable sparse attention models on language
modeling on Wikitext-103 (15.8 vs 18.3
perplexity), as well as on image generation on
ImageNet-64 (3.43 vs 3.44 bits/dim) mientras
using fewer self-attention layers. Además,
we set a new state-of-the-art on the newly
released PG-19 data-set, obtaining a test
perplexity of 33.2 con un 22 layer Routing
Transformer model trained on sequences of
length 8192. We open-source the code for
Routing Transformer in Tensorflow.1

Introducción

Generative models of sequences have witnessed
rapid progress driven by the application of atten-

1https://github.com/google-research
/google-research/tree/master/routing
transformador.

tion to neural networks. En particular, Bahdanau
et al. (2015), Cho et al. (2014), and Vaswani et al.
(2017) relied on attention to drastically improve
the state-of-the art in machine translation. Subse-
quent research (Radford et al., 2018; Devlin et al.,
2019; Liu et al., 2019; Yang et al., 2019) demon-
strated the power of self-attention in learning
powerful representations of language to address
several natural language processing tasks. Self-
attention also brought impressive progress for
generative modeling outside of language, para
ejemplo, imagen (Parmar et al., 2018; Menick and
Kalchbrenner, 2018; Child et al., 2019) and music
generación (Huang et al., 2018; Child et al., 2019).
Self-attention operates over sequences in a
step-wise manner: At every time-step, atención
assigns an attention weight to each previous input
element (representation of past time-steps) y
uses these weights to compute the representation
of the current time-step as a weighted sum of the
past input elements (Vaswani et al., 2017). Self-
atención (Shaw et al., 2018) is a particular case
of attention (Bahdanau et al., 2015; Chorowski
et al., 2015; Luong et al., 2015).

Self-attention is commonly used in auto-
regressive generative models. These models gen-
erate observations step-by-step, modeling the
probability of the next symbol given the previously
generated ones. At every time step, self-attentive
generative models can directly focus on any part
of the previous context. A diferencia de, recurrent
neural networks (RNNs) and convolutional neural
redes (CNNs) have direct interactions with
only a local neighborhood of context around the
current time step.

This advantage, sin embargo, comes at a price:
Unlike recurrent networks or convolution net-
obras, the time and space complexity of self-
attention is quadratic in n,
the length of the
secuencia. Específicamente, for every position i ≤ n,
self-attention computes weights for its whole
context of length i, which induces a complexity
i≤n i = n(n − 1)/2. This makes it difficult
de

(cid:2)

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 53–68, 2021. https://doi.org/10.1162/tacl a 00353
Editor de acciones: Xavier Carreras. Lote de envío: 6/2020; Lote de revisión: 8/2020; Publicado 02/2021.
C(cid:3) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

to scale attention-based models to modeling long
sequences. Sin embargo, long sequences are the norm
in many domains, including music, imagen, speech,
video generation, and document-level machine
traducción.

Por lo tanto, an important research direction is
to investigate sparse and memory efficient forms
of attention in order to scale to tasks with large
sequence lengths. Previous work has proposed
data independent or fixed sparsity patterns bound-
ing temporal dependencies, such as local or strided
atención. At each time step, the model attends
only to a fixed number of time steps in the past
(Child et al., 2019). Extensions to local attention
have suggested learning the length of the tem-
poral sparsity for each attention module in the
network (Sukhbaatar et al., 2019). These strat-
egies draw their inspiration from RNNs and
CNNs and bound their complexity by attend-
ing only to representations summarizing a local
neighborhood of the current
time step. Su
attention matrices (matrices containing the atten-
tion weights for every pair of previous, current
time-step) are natively sparse and require instan-
tiating only non-zero entries. While these ap-
proaches have achieved good results, fixing the
sparsity pattern of a content based mechanism
such as self-attention can limit its ability to pool
in information from large contexts.

As an alternative to local attention, Correia
et al. (2019) consider content-based sparsity, un
approach allowing for arbitrary sparsity patterns.
This formulation, sin embargo, does require instan-
tiating a full dense attention matrix prior to spar-
sification through variants of L0-sparsity or
sparsemax approximations (Blondel et al., 2019).
The present work builds upon these two lines
of research and proposes to retain the modeling
flexibility of content-based sparse attention while
leveraging the efficiency of natively sparse at-
tention matrices. Our formulation avoids sparse-
max variants and relies on clustering of attention
en cambio. Each attention module considers a
clustering of the space: The current time-step only
attends to context belonging to the same cluster.
En otras palabras, the current time-step query is
routed to a limited number of context elements
through its cluster assignment. This strategy draws
inspiration from the application of spherical k-
means clustering to the Maximum Inner Product
Buscar (MIPS) problema.

Our proposed model, Routing Transformer,
combines our efficient clustering-based sparse
attention with classical local attention to reach
excellent performance both for language and
image generation. These results are obtained with-
out the need to maintain attention matrices larger
than batch length which is the case with the seg-
ment level recurrence mechanism used in Dai et al.
(2019) and Sukhbaatar et al. (2019). We pres-
ent experimental results on language modeling
(enwik-8, Wikitext-103, and PG-19) y
unconditional image generation (CIFAR-10 and
ImageNet-64). Routing Transformer sets new
state-of-the-art, while having comparable or
fewer number of self-attention layers and heads,
on Wikitext-103 (15.8 vs 18.3 perplexity),
PG-19 (33.2 vs 33.6 perplexity), and on
ImageNet-64 (3.43 vs 3.44 bits/dim). Nosotros también
report competitive results on enwik-8 (0.99
vs 0.98 perplexity) and present ablations on
CIFAR-10.

2 Trabajo relacionado

Attention with Temporal Sparsity: Research on
efficient attention neural models parallels the
advent of attention-based architectures. En el
context of
Jaitly et al.
speech recognition,
(2016) proposed the Neural Transducer, cual
segments sequences in non-overlapping chunks
and attention is performed in each chunk inde-
pendently. Limiting attention to a fixed temporal
context around the current prediction has also
been explored in Chorowski et al. (2015), mientras
Chiu and Raffel (2018) dynamically segment the
sequence into variable sized-chunks.

Hierarchical attention strategies have also been
explored: The model first considers which part of
the inputs should be attended to before computing
full attention in a contiguous neighborhood of the
selected area (Gregor et al., 2015; Xu et al., 2015;
Luong et al., 2015). Más tarde, hierarchical attention,
simplified by Liu et al. (2018), alternates coarse
capas (attending to the whole sequence at a lower
temporal resolution) with local layers (attending
to a neighborhood of the current prediction).

This alternating strategy is also employed by
Child et al. (2019), who introduce bounded and
strided attention, a saber, attending to a fixed
context in the past at a sub-sampled temporal
resolution. This work formalizes such a strategy

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

using a sparse attention formalism, showing how
it relates to full attention with a specific sparsity
pattern in the attention matrix. It shows that
sparse attention is sufficient to get state-of-the-art
results in modeling long sequences over language
modelado, image generation and music generation.
Sukhbaatar et al. (2019) build upon this work and
show that it is possible to obtain further sparsity by
letting the model learn the length of the temporal
context for each attention module. This work also
makes use of the attention cache introduced in Dai
et al. (2019), a memory mechanism to train models
over temporal contexts which extend beyond the
length of the training batches.

Attention with Content-Based Sparsity: El
above work mainly relies on two efficient ideas:
attending to less elements by only considering
a fixed bounded local context in the past, y
attending to less elements by decreasing the
temporal resolution of context. These ideas do
not allow arbitrary sparsity patterns in attention
matrices. Content-based sparse attention has been
introduced to allow for richer patterns and more
expressive models. (Martins and Kreutzer (2017)
and Malaviya et al. (2018)) propose to compute
attention weights with variants of sparsemax.
Correia et al. (2019) generalize this approach
to every layer in a Transformer using entmax
which allows for more efficient inference. Este
line of work allows for learning arbitrary sparsity
attention patterns from data, based on the content
of the current query and past context. Sin embargo,
sparsity here cannot be leveraged to improve space
and time complexity because sparsemax/entmax
formulations require instantiating the full attention
matrix prior to sparsification. This is a drawback
compared with temporal sparsity approaches. Nuestro
work is motivated by bridging this gap and allows
for arbitrary sparsity patterns while avoiding
having to instantiate non-zero entries of attention
matrices.

Contemporaneous to our work, Kitaev et al.
(2020) proposed to use Locality Sensitive Hashing
(LSH) using random hyperplanes to infer content
based sparsity patterns for attention: tokens that
to attend
into the same hash bucket, conseguir
caer
to each other. While similar in spirit
to our
acercarse, the approach of Kitaev et al. (2020)
keeps the randomly initialized hyperplanes fixed
throughout, while we use mini-batch spherical k-
means to learn the space-partitioning centroids.
The motivation in both approaches is to approx-

imate MIPS in the context of dot product
atención, for which both LSH and spherical k-
means have been used in literature. Sin embargo,
typically spherical k-means is known to out-
perform LSH for MIPS (ver, p.ej., Auvolat et al.,
2015). This is borne out in the common task of
Imagenet-64 generation, where Reformer gets
alrededor 3.65 bits/dim (Cifra 3), while the Routing
Transformer gets 3.43 bits/dim (ver tabla 4 for a
comparación).

Sparse Computation beyond Attention:
Learning models with sparse representations/
activations for saving time and computation has
been addressed in the past in various contexts.
Previous work often refers to this goal as gating
for conditional computation. Gating techniques
relying on sampling and straight-through gradient
estimators are common (Bengio et al., 2013; Eigen
et al., 2013; Cho and Bengio, 2014). Conditional
computation can also be addressed with rein-
forcement learning (Denoyer and Gallinari, 2014;
Indurthi et al., 2019). Memory augmented neural
networks with sparse reads and writes have also
been proposed in Rae et al. (2016) as a way to scale
Neural Turing Machines (Graves et al., 2014). En
the domain of language modeling, a related work
is the sparsely gated Mixture-of-experts (MOE)
(Shazeer et al., 2017), where sparsity is induced
by experts and a trainable gating network controls
the routing strategy to each sub-network. Otro
related work is Lample et al. (2019) who use
product quantization based key-value lookups to
replace the feed forward network in the Trans-
anterior. Our work differs from theirs in that we
make use of dynamic key-value pairs to infer spar-
sity patterns, while their key-value pairs are the
same across examples.

3 Self-Attentive Auto-regressive

Sequence Modeling

Auto-regressive sequence models decompose the
probability of a sequence x = (x1, . . . , xn) como

pag(X) = pθ(x1)

norte(cid:3)

i=2

pθ(xi|X 0, such that
follows that
(cid:8)Qi − μ(cid:8) , (cid:8)Kj − μ(cid:8) < ε. This implies via triangle inequality that: (13) Thus, from Equation 12 it follows that, Q(cid:4) i Kj >
1 − 2ε2. Por lo tanto, when two time steps i > j
are assigned the same cluster due to a small
(cid:8)Qi − μ(cid:8) , (cid:8)Kj − μ(cid:8) distancia, it also means that
their attention weight Q(cid:4)
i Kj is high, a saber, Kj
is an approximate solution to the MIPS objective
of Equation 9 for query Qi. This analysis shows
that our clustering routing strategy preserves large
attention weights as non-zero entries.

Because we route attention via spherical
k-means clustering, we dub our model Routing
Transformador. We give a detailed pseudo-code
implementation for the routing attention compu-
tation in Algorithm 1. A visualization of the at-
tention scheme and its comparison to local and
strided attention is given in Figure 1. El COM-
putational complexity of this variant of sparse
attention is O(nkd + n2d/k). Cluster assignments
correspond to the first term, eso es, it compares
n routing vectors to all k centroids in a space of
size d. Query/key dot products corresponds to the
second term, eso es, assuming balanced clusters,
each of the n queries is compared to n/k in
its cluster through a dot product of dimension d.
n as in Child
Therefore the optimal choice of k is
et al. (2019), thereby reducing overall memory
(cid:5)
and computational cost to O
instead of
oh(n2d) (Vaswani et al., 2017).

n1.5d

√

(cid:4)

En la práctica, we apply mini-batch k-means to
train the cluster centroids. Sin embargo, in order to

K ← Q

Algoritmo 1 Routing Attention
1: Queries, Keys and Values: q, k, V ∈ Rn×d
2: Centroid: μ ∈ Rk×d
3: decadencia: λ
4: if left to right mask then
5:
6: (cid:6) Normalize to unit ball
7: Q ← LayerNorm(q) (cid:6) escala, bias disabled
8: K ← LayerNorm(k) (cid:6) escala, bias disabled
9: Qprod ← μQ(cid:4)
(cid:6) k × n
10: if not left to right mask then
11:
12: w ← n/k
13: Qidx ← top-k(Qprod, w)
14: Qidx ← sort(Qidx)
15: Kidx ← Qidx
16: if not left to right mask then
17:

(cid:6) k × n
(cid:6) attention window
(cid:6) k × w
(cid:6) sort to preserve order
(cid:6) k × w

Kprod ← μK (cid:4)

Kidx ← top-k(Kprod, w)
Kidx ← sort(Kidx)

(cid:6) k × w
(cid:6) sort to preserve

A ← ltr(A)

(cid:6) k × w × d
(cid:6) k × w × d
(cid:6) k × w × d
(cid:6) k × w × w

19: q(cid:5) ← gather(q, Qidx)
20: k (cid:5) ← gather(k, Kidx)
21: V (cid:5) ← gather(V, Kidx)
22: A ← Q(cid:5)(k (cid:5))(cid:4)
23: if left to right mask then
24:
25: A ← softmax(A).
26: V (cid:5) ← einsum(kww, kwd → kwd, A, V (cid:5))
27: X ← scatter(Kidx, V (cid:5))
28: Qm ← one-hot(arg max(Qprod))
29: Km ← one-hot(arg max(Kprod))
30: (cid:6) Update centroids
31: μ ← λμ + (1 − λ)QmQ/2 + (1 − λ)KmK/2
32: return X

(cid:6) k × n
(cid:6) k × n

(cid:6) k × w × w

√

infer balanced routing patterns, we define the
sets Si to be of equal size roughly n/k ∼
norte,
es, for every centroid μi we sort
tokens
eso
by distance to μi and cluster membership is
determined by this threshold (top-k). This adds an
additional O(n log n) term to the cost, sin embargo
note that this is eclipsed by the dominating term of
oh(n1.5d). This strategy is simple and efficient. En
particular, it guarantees that all clusters have the
same size, which is extremely important in terms
of computational efficiency on parallel hardware
like graphic cards. As a downside, this assignment
does not guarantee that each point belongs to a
single cluster. En el futuro, we want to investigate
using balanced variants of k-means (Banerjee and

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Figures showing 2-D attention schemes for the Routing Transformer compared to local attention and
strided attention of (Child et al., 2019). The rows represent the outputs while the columns represent the inputs. Para
local and strided attention, the colored squares represent the elements every output row attends to. For attention
routed as in Section 4.1, the different colors represent cluster memberships for the output token.

Ghosh, 2004; Malinen and Fr¨anti, 2014) cual es
not common in an online setting.

Durante el entrenamiento, we update each cluster centroid
μ by an exponentially moving average of all the
keys and queries assigned to it:

μ ← λμ +

(1−λ)
2

(cid:6)

chi +

(1−λ)
2

(cid:6)

Kj,

i:μ(chi)=μ

j:μ(Kj )=μ

where λ is a decay parameter that we usually set
a 0.999. Además, we also exclude padding
tokens from affecting the centroids.

There is an additional nuance regarding
clustering queries and keys that comes into
izquierda
play when using causal attention (es decir.,
to right masking), as is usually the case in
language models. When grouping queries and
keys belonging to a certain cluster centroid μ,
we may get as members queries Qi for keys
Kj where time-step i ≤ j. This therefore re-
quires an additional masking strategy in addition
to the lower triangular mask used for causal at-
tention. One solution that avoids having to use
an additional mask, is to simply share keys and
consultas. Empirically, we have found that
este
works at par or better than separate keys and que-
ries together with an additional masking strategy
in the causal attention setting. For encoder self
attention and encoder-decoder cross-attention, anuncio-
ditional masking or sharing queries and keys is
not necessary.

5 experimentos

We evaluate our sparse attention model on
various generative modeling tasks including text

and image generation. Las siguientes secciones
report our results on CIFAR-10, Wikitext-
103 (Merity et al., 2017), enwik-8 (Mahoney,
2011), ImageNet-64, as well as PG-19 (Rae
et al., 2020). We find that a scaled up version
of local attention is a surprisingly strong baseline
and that our Routing Transformer outperforms
Transformer-XL (Dai et al., 2019) and the Sparse
Transformer model of Child et al. (2019) on all
tareas. On the recently released PG-19 data-set, nosotros
find that local attention again is a strong baseline,
with a slightly worse performance compared to
Transformer-XL (Dai et al., 2019). Nosotros también
find that the Routing Transformer model out-
performs both Transformer-XL (Dai et al., 2019)
and Compressive Transformer (Rae et al., 2020),
setting a new state-of-the-art result.

In all our models except

the one used for
PG-19, we allocate half the heads to do local
attention and the other half to route attention as
en la ecuación 8. For all our experiments except
for PG-19, we use the Adam optimizer (Kingma
and Ba, 2015) with learning rate 2 × 10−4 with
β1 = 0.9 and β2 = 0.98 following the learning
rate schedule described in Vaswani et al. (2017).
We train all models on 128 TPUv3 cores. El
setup used for PG-19 is described in Section 5.5.

5.1 CIFAR-10

CIFAR-10 is a widely used image data-set
that consists of 60,000 colored images of size
32 × 32. Since the sequence lengths in this case
are relatively short (3072), we use this as a toy
data-set
to perform various ablations to tease
apart the effect of various hyperparameter choices

Modelo
Transformador
Local Transformer
Random Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer
Routing Transformer

Routing heads Routing Layers Attention window Bits/dim Steps/sec

0
0
4 (aleatorio)
2
4
8
2
4
8
2
4
8
2
4
8
2
4
8
2
4
8
2
4
8
2
4
8

0
0
8 (aleatorio)
2
2
2
4
4
4
8
8
8
12
12
12
2
2
2
4
4
4
8
8
8
12
12
12

3072
512
512
512
512
512
512
512
512
512
512
512
512
512
512
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024

2.983
3.009
3.076
3.005
2.986
2.992
2.995
2.975
2.991
2.995
2.971
3.190
2.978
2.994
3.400
2.975
2.950
2.982
2.990
2.958
3.003
2.991
2.983
3.131
2.973
3.005
3.291

5.608
9.023
5.448
7.968
7.409
6.682
7.379
6.492
5.385
6.442
5.140
3.897
5.685
4.349
3.062
7.344
6.440
5.192
6.389
5.112
3.674
5.057
3.597
2.329
4.151
2.788
1.711

Mesa 1: Ablation studies of the Routing Transformer model on the CIFAR-10 data-set. All the models
have a total of 12 attention layers and 8 cabezas. Routing layers when present are always added at the
top of the model. A Routing Transformer model with less than 12 routing attention layers and less than
8 routing heads, has the remaining layers and heads of type local attention. A Random Transformer
model has a random attention head in place of the routing attention head. We report the performance in
bits/dim on the test set and step times are reported on a TPUv3.

on the model performance. We train 12 capa
models with a total of 8 attention heads, y
report a comparison of the effect of various hyper-
parameter choices on the performance and speed
on this data-set. En particular, the following hyper-
parameters are varied 1) the number of routing
attention heads, 2) the number of routing attention
capas, y 3) the size of the attention window.
For routing attention we use k = 6 while varying
the attention window, to see the effect on speed
and performance. All the CIFAR-10 models are
trained with a batch size of 32 and for a total of
200,000 steps. Además, we also compare the
Routing Transformer to a Random Transformer,
where Kidx is randomly chosen rather than being

drawn from nearest neighbor search. For a fair
comparación, we take the best model from Table 1,
with an attention window of 512 and replace all
routing heads with random heads. We present the
ablation results in Table 1 and discuss it in more
detail in Section 6.

5.2 Wikitext-103

Wikitext-103 (Merity et al., 2017) is a large
public benchmark data-set for testing long term
dependencies in word-level language models. Él
contains over 100 million tokens from 28K articles
extracted from Wikipedia with an average of 3.6K
tokens per article, which makes it a reference data-
set to model long-term textual dependencies. Nosotros

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

train a 10 layer Routing Transformer with 16 cabezas
using the relative position encoding of Shaw et al.
(2018) and with attention and ReLU dropout rate
de 0.3 cada. For routing attention as in Section 4.1
we choose k = 16 and attention window to be 256
during both training and evaluation. We describe
our results in Table 2 and compare it to other
recent work on sparse or recurrent attention such
as Adaptive Inputs (Baevski and Auli, 2019) y
TransformerXL (Dai et al., 2019) as well as a local
attention with relative position encoding baseline
(Huang et al., 2018). We find that local attention
is a great inductive bias for sparse attention and
is better than the adaptive methods proposed in
Baevski and Auli (2019); Sukhbaatar et al. (2019).
Además, our Routing Transformer model is able
to get a test perplexity of 15.8 improving on
el 18.3 obtained by TransformerXL (Dai et al.,
2019) while having fewer self-attention layers, y
without the need for segment level recurrence.

5.3 enwik-8

The enwik-8 (Mahoney, 2011) is a data-set to
benchmark text compression algorithms in the
context of the Hutter prize. This data-set consists
of the first 100M bytes of unprocessed Wikipedia.
It is typically used to evaluate character-level
language models. Similar to the prior work of
Dai et al. (2019) and Child et al. (2019) we use a
sequence length n = 8192 and benchmark our
results against various baselines including local
atención. We train a 24 layer model with 8
attention heads with an attention and ReLU
dropout rate of 0.4 each and using the relative
position encoding of Shaw et al. (2018). Para
routing attention as in Section 4.1 we set k = 32
and attention window 256. We report perplexity of
0.99 like TransformerXL and Sparse Transformer,
slightly under 0.98 from Adaptive Transformer.

5.4 ImageNet 64×64

than text, nosotros reportamos

In order to evaluate the ability of our model to
capture long term dependencies on a modality
otro
results on the
ImageNet 64 × 64 data-set as used in Child et al.
(2019). For auto-regressive image generation, este
data-set consists of images of 64 × 64 × 3 bytes
represented as long sequences of length 12,288
presented in raster scan, red-green-blue order. Nosotros
train a 24 layer model with 16 attention heads, con
half the heads performing local attention, y el

other half routing attention as in Section 3. Para
routing attention we set k = 8, attention window
2048, batch size 1, and train our model for roughly
70 epochs as in Child et al. (2019). comparamos
our model to a scaled-up ImageTransformer model
with local attention (Parmar et al., 2018) y el
SparseTransformer model of Child et al. (2019).

We find that local attention (Parmar et al.,
2018) is a strong baseline for image generation,
obtaining 3.48 bits/dim when scaled up to 24
layers and 16 cabezas, compared to later work like
Sub-scale Pixel Networks (SPN) (Menick and
Kalchbrenner, 2018). Our Routing Transformer
model achieves a performance of 3.425 bits/dim
(ver tabla 4) compared to the previous state-
of-the-art of 3.437 bits/dim (Child et al., 2019),
thereby showing the advantage of the content
based sparsity formulation of Section 4.1.

5.5 PG-19

PG-19 is a new data-set released by Rae et al.
(2020) which is larger and longer than previous
language modeling data-sets. The data-set
es
created from approximately 28,000 Proyecto
Gutenberg books published before 1919, estafa-
sisting of 1.9 billion tokens and comprises an
average context size of roughly 69,000 palabras.
This is text that is 10× longer in context than
all prior data-sets such as Wikitext-103, con
minimal pre-processing and an open vocabulary
that makes it extremely challenging for long text
modeling tasks. We use a subword vocabulary of
size approximately 98,000 and report perplexities
normalized by the token counts reported in Rae
et al. (2020). On this data-set we train a 22 capa
Routing Transformer model with 8 heads with a
sequence length of 8192 and set a new state-of-
the-art result on this data-set, improving on both
Compressive Transformers (Rae et al., 2020), como
well as Transformer-XL (Dai et al., 2019). Para
this data-set we change our training setup in three
maneras. En primer lugar, we use only 2 routing heads instead
of sharing it equally with local heads. En segundo lugar,
we use routing heads only in the last two layers
of the model instead of having them present in
every layer. This is motivated by our empirical
finding that long range attention is only needed
in the last few layers – see also Rae and Razavi
(2020). Finalmente, we use the Adafactor optimizer
(Shazeer and Stern, 2018) which is more memory
efficient than Adam in training larger models.
We use a learning rate constant of 0.01 con un

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Layers Heads Perplexity

LSTMs (Grave et al., 2017)
QRNNs (Merity et al., 2018)
Adaptive Transformer (Sukhbaatar et al., 2019)
Local Transformer
Adaptive Input (Baevski and Auli, 2019)
TransformerXL (Dai et al., 2019)
Routing Transformer

−
−
36
16
16
18
10

−
−
8
16
16
16
16

40.8
33.0
20.6
19.8
18.7
18.3
15.8

Mesa 2: Results on language modeling on Wikitext-103 data-set. Local
Transformer refers to Transformer (Vaswani et al., 2017) with relative position
encoding (Shaw et al., 2018) together with local attention. Perplexity is reported
on the test set.

Modelo

Layers

Heads

Bits per byte

T64 (Al-Rfou et al., 2019)
Local Transformer
TransformerXL (Dai et al., 2019)
Sparse Transformer (Child et al., 2019)
Adaptive Transformer (Sukhbaatar et al., 2019)
Routing Transformer

64
24
24
30
24
12

2
8
8
8
8
8

1.13
1.10
0.99
0.99
0.98
0.99

Mesa 3: Results on language modeling on enwik-8 data-set. Local Transformer refers to
Transformador (Vaswani et al., 2017) with relative position encoding (Shaw et al., 2018) together
with local attention. Bits per byte (bpc) is reported on the test set.

Modelo

Layers Heads Bits/dim

Glow (Kingma and Dhariwal, 2018)
PixelCNN (Van den Oord et al., 2016)
PixelSNAIL (Chen et al., 2018)
SPN (Menick and Kalchbrenner, 2018)
ImageTransformer (Parmar et al., 2018)
Sparse Transformer (Child et al., 2019)
Reformer (Kitaev et al., 2020)
Routing Transformer

−
−
−
−
24
48
−

−
−
−
−
16
16
−

3.81
3.57
3.52
3.52
3.48
3.44
3.65
3.43

Mesa 4: Results on image generation on ImageNet- 64 in bits/dim.

linear warmup over 10,000 steps followed by a
rsqrt normalized decay. We do not make use
of any dropout, or weight decay. The hidden
dimension of our model is 1032 and the batch size
es 8192 tokens.

From Table 5, we see that Local Transformer
again sets a very strong baseline, with a 24-layer
local attention model obtaining a test set perplexity
de 39.3, while a 36-layer Transformer-XL gets
36.3. Además, a 22-layer Routing Transformer
improves on the 36-layer Compressive
modelo
Transformador, obtaining a test set perplexity of

33.2 compared to 33.6, while being able to gen-
erate sequences of length 8192.

6 Análisis

6.1 Local vs Global

As reported in Section 5, a scaled up version of
local attention is a strong baseline for efficient
attention over long sequences. From Table 1 nosotros
see that local attention is slightly worse than full
atención – 3.009 vs 2.983 bits per dim. Adding 2

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Layers

Heads

Perplexity

Local Transformer
TransformerXL (Dai et al., 2019)
Compressive Transformer (Rae et al., 2020)
Routing Transformer

24
36
36
22

8
−
−

39.3
36.3
33.6
33.2

Mesa 5: Results on language modeling on PG-19 data-set. Local Transformer refers to
Transformador (Vaswani et al., 2017) with relative position encoding (Shaw et al., 2018)
together with local attention. Perplexity is normalized by the number of tokens reported
en (Rae et al., 2020) and is reported on the test set.

capa 0
capa 1
capa 2
capa 3
capa 4
capa 5
capa 6
capa 7
capa 8
capa 9

JSD(local(cid:8)local)
0.0038 ± 0.0018
0.3071 ± 0.1217
0.2164 ± 0.0803
0.1163 ± 0.0336
0.1840 ± 0.0562
0.2284 ± 0.0225
0.1901 ± 0.0525
0.1566 ± 0.0685
0.1638 ± 0.0739
0.2095 ± 0.0560

JSD(local(cid:8)routing)
0.4706 ± 0.0319
0.6674 ± 0.0153
0.5896 ± 0.0249
0.6047 ± 0.0181
0.6266 ± 0.0062
0.6463 ± 0.0155
0.6471 ± 0.0040
0.5798 ± 0.0235
0.5993 ± 0.0148
0.6127 ± 0.0053

JSD(routing(cid:8)routing)
0.1579 ± 0.0576
0.5820 ± 0.0104
0.4015 ± 0.0121
0.4144 ± 0.0264
0.4191 ± 0.0879
0.4687 ± 0.0449
0.5175 ± 0.0469
0.4350 ± 0.0139
0.4268 ± 0.0291
0.3581 ± 0.0019

Mesa 6: Jensen-Shannon divergence between the attention distributions of a random local
attention head and a random head that routes attention as in Section 4.1 per layer on the
Wikitext-103 data-set. We report means and standard deviations computed over 10 carreras
and use the natural logarithm so that divergences are upper-bounded by 0.6931.

routing layers with 4 heads almost closes the gap
with the performance of full attention, achieving
2.986 bits per dim. Adding more routing layers
and heads improves performance up to a point,
with the best performing model with an attention
window of 512 teniendo 4 routing layers and 4
routing heads, and achieving 2.975 bits per dim.
Increasing the attention window from 512 a 1024
uniformly results in improvement in every setting.
The best model on CIFAR-10 has an attention
window of 1024 con 4 routing layers and 4 routing
cabezas. Curiosamente, the best Routing Transformer
models perform better than full attention, pero no
by a large enough amount to rule out noise. Más
importantly, Mesa 1 shows the importance of
local attention in building intermediate represen-
taciones, with a model with only routing attention
layers and heads with attention windows of 512
y 1024 achieving 3.400 y 3.291 bits per dim
respectivamente.

Thus Table 1 shows us the importance of local
representaciones, as well as the benefit of adding
a few routing layers and heads to enforce a more
global representation. Because attention weights

are a probability distribution on the entire set of
tokens, we evaluate the difference in attention
patterns between local and routing attention by
computing the Jensen-Shannon divergence bet-
ween the two kinds of attention distributions for a
random subset of heads in our network on the
Wikitext-103 data-set. The divergence is
computed over the entire sequence length of 4096.
We average over 10 runs and report means and
standard deviations of the JSD in Table 6. Nota
that the JSD is always non-negative and is upper-
bounded by 0.6931 when computed using the
natural logarithm. We observe that the divergence
between the different local heads is always very
low compared to the divergence between local
and routing attention heads, which is almost
always very close to the upper-bound of 0.6931.
Divergence between different routing attention
heads falls somewhere in between, being closer
to the upper-bound. This shows that the attention
distribution inferred by the routing attention of
Sección 4.1 is highly non-local in nature and
different heads specialize in attending to very
different parts of the input.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Dataset Seq. length Layers Heads Attention window Steps/sec

PG-19
Local Transformer
Routing Transformer PG-19

8192
8192

24
22

8
8

512
512

1.231
0.7236

Mesa 7: Step time comparison between Local Transformer and Routing Transformer on a TPUv3
for the PG-19 data-set.

eso

Cualitativamente, from the ablations in Table 1,
we hypothesize that the reason for the strong
performance of the Routing Transformer is due
to the fact
it combines building local
representations over several layers, together with
enforcing global consistency for every token.
This is achieved via an approximate MIPS over
the entire set of tokens (mira la sección 4.1), y
selecting pairs that have a high dot product for
atención. This allows various entities such as
género, nouns, dates and names of places to be
consistent throughout the entire sequence, desde
on expectation the dot product similarity between
similar entities are high, while for differing enti-
ties they are expected to be low. Esencialmente, nosotros
conjecture that for every time step, the prediction
depends on a small support of high value tokens:
Local attention facilitates local consistency and
fluidez, while a full dot product attention would
facilitate global consistency. Sin embargo, for long
sequences, since full attention is infeasible, nosotros
believe that using spherical k-means to perform
a MIPS search over the global set of tokens
and performing attention between these high dot
product items is a good approximation to full
dot product attention. The importance of the
MIPS search to select high dot product items is
highlighted from the ablation in Table 1, where we
see that a Random Transformer performs worse
compared to a Local Transformer and a Routing
Transformer with the same configuration (3.076
vs 3.009 vs 2.971 bits/dim).

6.2 Recurrence vs Sparse Attention

sparse attention is an
We also note that
orthogonal approach to that of Transformer-
XL and Compressive Transformer, which train
on small sequences and by performing careful
cross attention over cached previous chunks
hope to generalize to longer sequences. Por
contrast, we directly train on long sequences from
the Compressive
the beginning—for example,
Transformer trains on chunks of size 512 para
PG-19, while we train on sequences of length

8192. The benefit of the Transformer-XL like
approach is that it is less memory consuming
and thus is able to scale to 36 capas. Sparse
atención (including local attention) en el otro
hand is more memory expensive since it trains
directly on long sequences and therefore can scale
to fewer layers for the same problem. Sin embargo,
as we demonstrate, it is competitive with the
Transformer-XL like approaches even when using
fewer layers and is guaranteed to generalize to the
long sequence length that it was trained on.

6.3 Wall-Clock Time

We compare the step times for training the various
sparse attention models on the CIFAR-10 data-
set in Table 1 as well as on the PG-19 data-set in
Mesa 7. For PG-19 we report only a comparison
between the Local Transformer and the Routing
Transformador, since sequence lengths are 8192 y
performing full attention is infeasible. All the
step time comparisons are made on a TPUv3,
with the same number of cores and batch sizes
to facilitate a fair comparison. As we see from
local attention is much faster than
Mesa 1,
full attention, training at 9.023 steps per second
compared to 5.608 steps per second. The Routing
Transformer models on CIFAR-10 have step
times that depend on the number of routing heads,
with the best performing model with the same
attention budget as local attention (es decir., an attention
window of 512), which has 8 routing layers and 4
routing heads, training at 5.140 steps per second.
Other Routing Transformer models are faster
while still matching full attention, Por ejemplo, 2
routing layers with 4 routing heads trains at 7.409
steps per second. Por lo tanto, Local Transformer is
roughly between 1.22 − 1.76× faster than the best
performing Routing Transformers. En el otro
hand Transformer is between 0.76 − 1.09× faster
than the best Routing Transformers.

On PG-19, we see from Table 7 that the Local
Transformer is roughly 1.7× faster compared to
the Routing Transformer, similar to the trend on
CIFAR-10. This trade-off with respect to speed

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

compared to the Local Transformer is due to the
lack of support for sparse operations on the TPU;
on the GPU various sparse kernels have been pro-
posed which promise to significantly speed up
training of these models (Gale et al., 2020). Nota
that our goal in this work is a memory efficient
version of sparse attention that can well approx-
imate full attention for long sequences; wall-clock
time efficiency is only a secondary goal.

7 Conclusión

Transformer models constitute the state-of-the-
arte
in auto-regressive generative models for
sequential data. Their space-time complexity is,
sin embargo, quadratic in sequence length, due to their
attention modules. Our work proposes a sparse
attention model,
the Routing Transformer. Él
relies on content-based sparse attention motivated
by non-negative matrix factorization. Compared
with local attention models, it does not require
fixed attention patterns but enjoys similar space-
time complexity. In contrast with prior work on
content-based sparse attention, it does not require
computing a full attention matrix but still selects
sparsity patterns based on content similarity.

Our experiments over text and image generation
draw two main conclusions. Primero, we show that
a scaled up version of local attention establishes
a strong baseline on modern benchmark, incluso
compared to recent
state-of-the-art models.
Segundo, we show that the Routing Transformer re-
defines the state-of-the-art in large long sequence
benchmarks of Wikitext-103, PG-19 and
ImageNet-64, while being very close to do so
on enwik-8 as well. Our analysis also shows that
routing attention modules offer complementary at-
tention patterns when compared to local attention.
En general, our work contributes an efficient
attention mechanism that applies to the modeling
of long sequences and redefines the state of the
art for auto-regressive generative modeling. Nuestro
approach could prove useful in domains where
the inputs are naturally sparse, such as 3D point
clouds, social networks, or protein interactions.

Expresiones de gratitud

The authors would like to thank Phillip Wang and
Aran Komatsuzaki for a Pytorch implementation
of Routing Transformer. The authors would also
like to thank Yonghui Wu, Weikang Zhou, y

Dehao Chen for helpful feedback in improving the
implementation of this work. The authors would
also like to thank anonymous reviewers and the
Action Editor of TACL for their constructive
comments, which helped improve the exposition
of this work.

Referencias

Rami Al-Rfou, Dokook Choe, Noé constante,
Mandy Guo, and Llion Jones. 2019. Personaje-
nivel
deeper
language modeling with
self-attention. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volumen 33,
pages 3159–3166. DOI: https://doi.org
/10.1609/aaai.v33i01.33013159

Alex Auvolat, Sarath Chandar, Pascal Vincent,
Hugo Larochelle, and Yoshua Bengio. 2015.
Clustering is efficient for approximate max-
imum inner product search. arXiv preprint
arXiv:1507.05910.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey
mi. Hinton. 2016. Layer normalization. arXiv
preprint arXiv:1607.06450.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Alexei Baevski and Michael Auli. 2019. Adaptado
idioma
for neural
In International Conference on

aporte
modelado.
Learning Representations.

representaciones

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
aprender juntos a alinear y traducir.
En
3rd International Conference on Learning
Representaciones, ICLR 2015.

Arindam Banerjee

Joydeep Ghosh.
y
2004. Frequency-sensitive competitive learn-
ing for scalable balanced clustering on high-
dimensional hyperspheres. IEEE Transactions
on Neural Networks, 15(3):702–719. DOI:
https://doi.org/10.1109/TNN.2004
.824416, PMID: 15384557

Yoshua Bengio, Nicholas L´eonard, and Aaron
Courville. 2013. Estimating or propagating
gradients through stochastic neurons for con-
ditional computation. arXiv preimpresión arXiv:
1308.3432.

Mathieu Blondel, andref. t. Martins, y
Vlad Niculae. 2019. Learning classifiers with

fenchel-young losses: Generalized entropies,
margins, and algorithms. In The 22nd Inter-
national Conference on Artificial Intelligence
and Statistics, AISTATS 2019, 16–18 April
2019, Naha, Okinawa, Japón, pages 606–615.

Leon Bottou and Yoshua Bengio. 1995. Conver-
gence properties of the k-means algorithms.
In Advances in Neural Information Processing
Sistemas, pages 585–592.

Xi Chen, Nikhil Mishra, Mostafa Rohaninejad,
and Pieter Abbeel. 2018. Pixelsnail: An im-
proved autoregressive generative model. En
Conferencia internacional sobre aprendizaje automático-
En g, pages 864–872.

niño rewon, Scott Gris, Alec Radford, and Ilya
Sutskever. 2019. Generating long sequences
transformadores. arXiv preprint
with sparse
arXiv:1904.10509.

Chung-Cheng Chiu and Colin Raffel. 2018.
Monotonic chunkwise attention. In 6th Inter-
national Conference on Learning Representa-
ciones, ICLR 2018, vancouver, BC, Canada,
Abril 30 – Puede 3, 2018, Conference Track
Actas. OpenReview.net.

Kyunghyun Cho and Yoshua Bengio. 2014. Expo-
nentially increasing the capacity-to-computa-
tion ratio for conditional computation in deep
aprendiendo. arXiv preimpresión arXiv:1406.7362.

Kyunghyun Cho, Bart van Merri¨enboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using rnn
encoder–decoder for statistical machine trans-
lación. En Actas de la 2014 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando (EMNLP), pages 1724–1734.

Jan K. Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
2015. Attention-based models
speech
recognition. In Advances in Neural Information
Sistemas de procesamiento, pages 577–585.

para

Gonc¸alo M. Correia, Vlad Niculae,

y
andref. t. Martins. 2019. Adaptively sparse
el 2019
transformadores.
Jornada sobre Métodos Empíricos en Natural
Language Processing and the 9th Internatio-

En procedimientos de

nal Joint Conference on Natural Language Pro-
cesando (EMNLP-IJCNLP), pages 2174–2184.
DOI: https://doi.org/10.18653/v1
/D19-1223

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
Carbonell, Quoc Le, and Ruslan Salakhutdinov.
idioma
2019. Transformer-xl: Attentive
models beyond a fixed-length context.
En
Actas de la 57ª Reunión Anual de
la Asociación de Lingüística Computacional,
pages 2978–2988.

Ludovic Denoyer and Patrick Gallinari. 2014.
Deep sequential neural network. arXiv preprint
arXiv:1410.0510.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. In NAACL-HLT (1).

David Eigen, Marc’Aurelio Ranzato, and Ilya
Sutskever. 2013. Learning factored represen-
tations in a deep mixture of experts. arXiv
preprint arXiv:1312.4314.

Trevor Gale, Matei Zaharia, Cliff Young, y
Erich Elsen. 2020. Sparse GPU kernels for deep
aprendiendo. arXiv preimpresión arXiv:2006.10901.

Improving neural

Edouard Grave, Armand Joulin, and Nicolas
Usunier. 2017.
idioma
models with a continuous cache. In 5th Inter-
national Conference on Learning Represen-
taciones, ICLR 2017, Toulon, Francia, Abril
24-26, 2017, Conference Track Proceedings.
OpenReview.net.

Alex Graves, Greg Wayne, and Ivo Danihelka.
2014. Neural turing machines. arXiv preprint
arXiv:1410.5401.

Karol Gregor,

Ivo Danihelka, Alex Graves,
Danilo Jimenez Rezende, and Daan Wierstra.
2015. DRAW: A recurrent neural network
for image generation. Francis R. Bach and
David M.. Blei, editores, En procedimientos de
the 32nd International Conference on Machine
Aprendiendo, ICML 2015, Lille, Francia, 6-11 Julio
2015, volumen 37 of JMLR Workshop and
Conference Proceedings, pages 1462–1471.
JMLR.org.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes,
and Yejin Choi. 2020. The curious case
of neural text degeneration. En internacional
Conferencia sobre Representaciones del Aprendizaje.

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob
Uszkoreit, Ian Simon, Curtis Hawthorne, Noam
shazeer, Andrew M. dai, Mateo D.. Hoffman,
Monica Dinculescu, and Douglas Eck. 2018.
Music transformer: Generating music with
long-term structure. In International Confer-
ence on Learning Representations.

Sathish Reddy Indurthi,

Insoo Chung, y
Sangha Kim. 2019. Look harder: A neural
machine translation model with hard attention.
the 57th Conference of
En procedimientos de
la Asociación de Lingüística Computacional,
pages 3037–3043. DOI: https://doi.org
/10.18653/v1/P19-1290

Navdeep Jaitly, Quoc V. Le, Oriol Vinyals,
Ilya Sutskever, David Sussillo, and Samy
bengio. 2016. An online sequence-to-sequence
model using partial conditioning. In Advances
en sistemas de procesamiento de información neuronal,
pages 5067–5075.

Diederik P. Kingma and Jimmy Ba. 2015.
Adán: A method for stochastic optimization.
Yoshua Bengio and Yann LeCun, editores,
In 3rd International Conference on Learning
Representaciones, ICLR 2015, San Diego, California,
EE.UU, Puede 7-9, 2015, Conference Track
Actas.

Durk P. Kingma and Prafulla Dhariwal. 2018.
Glow: Generative flow with invertible 1×1 estafa-
volutions. In Advances in Neural Information
Sistemas de procesamiento, pages 10215–10224.

Nikita Kitaev, Lukasz Kaiser, and Anselm
Levskaya. 2020. Reformer: The efficient
transformador. In International Conference on
Learning Representations.

Guillaume Lample, Alexandre Sablayrolles,
Marc’Aurelio Ranzato, Ludovic Denoyer, y
Herv´e J´egou. 2019. Large memory layers with
product keys. In Advances in Neural Informa-
tion Processing Systems, pages 8548–8559.

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, y

Noam Shazeer. 2018. Generating wikipedia by
summarizing long sequences. En internacional
Conferencia sobre Representaciones del Aprendizaje.

Xiaodong Liu, Pengcheng He, Weizhu Chen, y
Jianfeng Gao. 2019. Multi-task deep neural
networks for natural language understanding.
In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 4487–4496.

Minh-Thang Luong, Hieu Pham, and Christopher
D. Manning. 2015. Effective approaches to
attention-based neural machine translation. En
Actas de la 2015 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 1412–1421. DOI: https://doi.org
/10.18653/v1/D15-1166

Matt Mahoney. 2011. Large text compression
benchmark. URL: http://www.mattmahoney
.net/text/text.html

Chaitanya Malaviya, Pedro Ferreira, and Andr´e
F. t. Martins. 2018. Sparse and constrained
attention for neural machine translation. En
Proceedings of the 56th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 2: Artículos breves), pages 370–376, Mel-
bourne, Australia. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/P18-2059

Mikko I. Malinen and Pasi Fr¨anti. 2014. Balanced
k-means for clustering. In Joint IAPR Inter-
national Workshops on Statistical Techniques
in Pattern Recognition (SPR) and Structural
and Syntactic Pattern Recognition (SSPR),
pages 32–41. Saltador. DOI: https://doi
.org/10.1007/978-3-662-44415-3 4

andref. t. Martins and Julia Kreutzer. 2017.
Learning what’s easy: Fully differentiable neu-
ral easy-first taggers. En Actas de la 2017
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 349–362, Copen-
hagen, Dinamarca. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/D17-1036

Jacob Menick and Nal Kalchbrenner. 2018.
Generating high fidelity images with subscale
pixel networks and multidimensional upscaling.
En Conferencia Internacional sobre Aprendizaje
Representaciones.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Stephen Merity, Nitish Shirish Keskar, y
Richard Socher. 2018. An analysis of neural
language modeling at multiple scales. arXiv
preprint arXiv:1803.08240.

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2017. Pointer sentinel
mixture models. In 5th International Confer-
ence on Learning Representations, ICLR 2017,
Toulon, Francia, Abril 24-26, 2017, Conferencia
Track Proceedings. OpenReview.net.

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit,
Lukasz Kaiser, Noam Shazeer, Alexander Ku,
and Dustin Tran. 2018. Image transformer.
In International Conference on Machine
Aprendiendo, pages 4055–4064.

Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improvisación-
ing language understanding by generative
pre-training. URL https://s3-us-west
-2.amazonaws.com/openai-assets
/research-covers/languageunsupervised
/languageunderstandingpaper.pdf

Jack Rae, Jonathan J. Hunt,

Ivo Danihelka,
Timothy Harley, Andrew W. Senior, Gregory
Wayne, Alex Graves, and Timothy Lillicrap.
2016. Scaling memory-augmented neural net-
works with sparse reads and writes. In Advances
en sistemas de procesamiento de información neuronal,
pages 3621–3629.

Jack Rae and Ali Razavi. 2020. Do transfor-
mers need deep long-range memory? En profesional-
ceedings of the 58th Annual Meeting of the
Asociación de Lingüística Computacional,
pages 7524–7529. Asociación de Computación-
lingüística nacional. En línea. DOI: https://
doi.org/10.18653/v1/2020.acl-main
.672

Jack W. Rae, Anna Potapenko, Siddhant M.
Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. 2020. Compressive transformers for
long-range sequence modelling. In Internatio-
nal Conference on Learning Representations.

Peter Shaw,

and Ashish
Jakob Uszkoreit,
Vaswani. 2018. Self-attention with relative
position representations. En Actas de la
2018 Conference of
the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 2 (Artículos breves), pages 464–468.

DOI: https://doi.org/10.18653/v1
/N18-2074

Noam Shazeer, Azalia Mirhoseini, Krzysztof
Maziarz, Andy Davis, Quoc V. Le, Geoffrey E.
Hinton, and Jeff Dean. 2017. Outrageously
large neural networks: The sparsely-gated
mixture-of-experts layer. In 5th International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2017, Toulon, Francia, Abril 24-26, 2017, Estafa-
ference Track Proceedings. OpenReview.net.

Noam Shazeer and Mitchell Stern. 2018. Adafac-
colina: Adaptive learning rates with sublinear
memory cost. In International Conference on
Machine Learning, pages 4596–4604.

Sainbayar Sukhbaatar,

´Edouard Grave, Piotr
Bojanowski, and Armand Joulin. 2019. Adap-
tive attention span in transformers. En profesional-
cesiones de
the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 331–335. DOI: https://doi.org
/10.18653/v1/P19-1032

Aaron Van den Oord, Nal Kalchbrenner, Lasse
Espeholt, Oriol Vinyals, Alex Graves, y
image generation
otros. 2016. Conditional
In Advances in
with PixelCNN decoders.
Neural
Sistemas,
pages 4790–4798.

Information Processing

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
In Advances
Attention is all you need.
en sistemas de procesamiento de información neuronal,
pages 5998–6008.

Kelvin Xu,

Ba,

Jimmy

Ryan Kiros,
Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel, y yoshua
bengio. 2015. Show, attend and tell: Neural
image caption generation with visual attention.
Francis R. Bach and David M. Blei, editores, En
Proceedings of the 32nd International Con-
ference on Machine Learning, ICML 2015,
Lille, Francia, 6-11 Julio 2015, volumen 37 de
JMLR Workshop and Conference Proceedings,
pages 2048–2057. JMLR.org.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Russ R Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
En
pretraining for
Avances en el procesamiento de información neuronal
Sistemas, pages 5753–5763.

language understanding.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
5
3
1
9
2
3
9
3
2

/
t

a
C
_
a
_
0
0
3
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3 Efficient Content-Based Sparse Attention with Routing Transformers image

Descargar PDF