Adaptive Semiparametric Language Models

Dani Yogatama, Cyprien de Masson d’Autume, Lingpeng Kong

DeepMind
London, United Kingdom
{dyogatama,cyprien,lingpenk}@google.com

Abstract

We present a language model that combines
a large parametric neural network (i.e., a
transformer) with a non-parametric episodic
memory component in an integrated architec-
ture. Our model uses extended short-term con-
text by caching local hidden states—similar
to transformer-XL—and global
long-term
memory by retrieving a set of nearest neighbor
tokens at each timestep. We design a gat-
ing function to adaptively combine multiple
information sources to make a prediction.
This mechanism allows the model
to use
either local context, short-term memory, or
long-term memory (or any combination of
them) on an ad hoc basis depending on
the context. Experiments on word-based and
character-based language modeling datasets
demonstrate the efficacy of our proposed
method compared to strong baselines.

1 Introduction

Human language processing is facilitated by com-
plex systems interacting together. A core compo-
nent that enables such a process is human memory.
Memory in humans consists of specialized sys-
tems, which form a basis for intelligent behaviors
(Tulving, 1985; Rolls, 2000; Eichenbaum, 2012).
For language processing, working (short-term)
memory is a temporary storage that can be used to
comprehend sentences and follow conversations.
Episodic (long-term) memory stores individual
experience and events. Semantic memory stores
facts and knowledge about words and concepts.1
In artificial language processing systems (e.g.,
language models), a popular approach to design
a better model is by encoding all of the desired
knowledge (to produce grammatical sentences,
process long text, remember events, etc.) in the

1We refer readers to Nematzadeh et al. (2020) for
discussions on human and artificial language processing
memory systems.

362

transformer become a better

weights of a large parametric neural network
via end-to-end training. We see an increasingly
language
larger
model (Radford et al., 2018, 2019; Shoeybi et al.,
2019; Brown et al., 2020). In this scale approach,
the knowledge is implicitly represented in the
weights of a parametric neural network, and it is
not straightforward to interpret whether a model
contains a particular knowledge without asking
the model to produce a response—for example,
via a cloze-style question (Petroni et al., 2020) or
a prompt (Brown et al., 2020).

An alternative strategy is to design a modular
architecture that separates memory storage and
computational processing, where each module
has a clear purpose. Recent progress in memory-
augmented neural networks has given rise to
many variants of memory-augmented transformer
language models that fall under this category. For
example, attempts to incorporate extended local
context to a neural network—such as those found
in neural cache (Grave et al., 2017c), transformer-
XL (Dai et al., 2019), compressive transformer
(Rae et al., 2020), performers (Choromanski
et al., 2021), longformer (Beltagy et al., 2020),
and reformer (Kitaev et al., 2020)—can be seen as
models of working memory. Models of episodic
memory include kNN-LM (Khandelwal et al.,
2020) and architectures that are designed for more
complicated tasks such as question answering
(de Masson d’Autume et al., 2019; Guu et al.,
2020) and machine translation (Khandelwal et al.,
2021). In machine learning and natural language
processing, memory-augmented neural networks
is used to refer to all types of memory systems.

In this paper, inspired by the modular design
of human memory systems, we present a lan-
guage model architecture (SPALM) with storage
modules that resemble working and episodic
memory systems, which we combine with a large
parametric neural network that is responsible for

Transactions of the Association for Computational Linguistics, vol. 9, pp. 362–373, 2021. https://doi.org/10.1162/tacl a 00371
Action Editor: Mihai Surdeanu. Submission batch: 10/2020; Revision batch: 1/2021; Published 4/2021.
c(cid:13) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
c
_
a
_
0
0
3
7
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

computation (§2). Our hypothesis is that encour-
aging each component to focus on a specific
function (e.g., storing long-term information, cap-
turing extended context, modeling local infor-
mation) facilitates easier training that produces an
overall better language model.2

Specifically, we follow transformer-XL (Dai
et al., 2019) to capture extended context by
caching hidden states in a temporary short-term
memory. For long-term context, we use a persis-
tent key-value database and perform sparse re-
trieval with (approximate) k-nearest neighbors. In
contrast to previous language models that either
interpolate output probabilities (Merity et al.,
2017; Grave et al., 2017c; Khandelwal et al.,
2020; Kassner and Schutze, 2020) or use input
concatenation (Guu et al., 2020; Xu et al., 2020)
to combine information from different sources,
we design a context-dependent gating mechanism
to incorporate local, extended, and global context.
We discuss similarities and differences to related
work in §3.

In language modeling, many tokens can be pre-
dicted from their local context without requiring
long-term information. Our model can adaptively
decide whether the current (local) context
is
enough, or whether it needs to use information
from the short-term and/or long-term memory.

In §4, we

compare SPALM with strong
baselines—including transformer-XL and kNN-
LM—on word-based and character-based lan-
guage modeling. Our positive results establish
the benefit of the proposed architecture. They also
indicate the generality of our approach and its
potential applicability to other sequence modeling
tasks.

We analyze how SPALM uses long vs. short-term
context (§5) to better understand how the model
operates when making predictions. We conclude
by discussing limitations and future directions
(§6).

2 Model

We consider a language model that takes as input
a sequence of words x≤t = {x0, . . . , xt} and
outputs a probability distribution of the next word

2We note that SPALM is not intended to be a model
of human language processing system. We merely take
inspirations from human memory systems to design a better
artificial language model.

Figure 1: Our language model architecture has three
main components: (i) a transformer that processes the
current local context, (ii) a short-term memory module
that stores hidden states from an extended context, (iii)
and a key-value (hidden state-output token) database
that stores compressed long-term context. At each
timestep, our model combines the current context
and short-term memory with a mechanism similar to
transformer-XL. It then retrieves a set of past output
tokens that are used in a similar context from the long-
term memory module. These past output tokens are
then encoded and aggregated to a single vector that
represents long-term information. We use a context-
dependent gate to combine information from multiple
sources for making a final prediction.

p(xt+1 | x≤t; W). Given a corpus of T words,
the log likelihood of the corpus is:

L =

X
t=0

log p(xt+1 | x≤t; W),

where x0 is the start of sentence symbol.

SPALM consists of three main components: (i)
a large parametric neural network in the form
of a transformer to process local context, (ii) a
short-term memory to store extended context, and
(ii) a non-parametric episodic memory module
that stores information from long-term context.
We integrate these components in a single archi-
tecture with a gating mechanism. Figure 1 shows
an illustration of our model, which we discuss in
detail below.

363

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
c
_
a
_
0
0
3
7
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2.1 Base Model

We use transformer (Vaswani et al., 2017) as our
base model. Given the input sequence x≤t, trans-
former performs multiple layers of self-attention
between every pair of tokens in the input sequence
to produce token representations.

A core limitation of transformer is that its
computational complexity is quadratic in the
input sequence length. As a result, instead of
considering all previous tokens x≤t, transformer
truncates the input to be the most recent N words
˜x≤t = {xt−N +1, . . . , xt} and only operates on
this fixed-length window in practice. A large
transformer, no matter how many parameters it
has, is limited by the input sequence length.

2.2 Short-term Memory

We use transformer-XL (Dai et al., 2019) as our
working memory model. Given the current con-
text ˜x

While it

is difficult

to find consistent pat-
terns, we observe that SPALM is generally better
than both transformer and transformer-XL for
predicting (completing) common phrases and
named entities (that exist in the training set),
especially when they are encountered for the
first time and have not appeared in the extended
context (e.g., pulled their advertising

from, Liberal Democrat, Jo Swinson,
Boeing 787-9 Dreamliner).

On the other hand, we also see a few cases when
transformer-XL outperforms SPALM. These are
usually associated with scenarios where the same
word has appeared in the extended context. While
SPALM uses information from the extended context
the probability is smoothed over by
as well,

369

5.4 Number of Neighbors

We use four neighbors for our word-based and
two neighbors for our character-based language
models. These values are chosen from preliminary
experiments on a small subset of the datasets.

We show SPALM perplexity on development set
for WikiText-103 when we vary the number of
neighbors in Table 5. We see that using one nearest
neighbor is enough to obtain good performance,
with a slight advantage when we use four neigh-
bors. The performance starts to degrade as we use
8 and 16 neighbors. We choose to use four neigh-
bors in our experiments since kNN-LM–which
also uses the same set of neighbors–performs
better with four neighbors instead of one, and we
want to keep the comparison as fair as possible.

One notable difference between our neighbors
and those that are used in kNN-LM (Khandelwal
et al., 2020) is that we do not limit the search of the
neighbors to the same token as the current input
token (I(xi = xt)). While this allows the model
to combine information from related words (not
constrained to an exact match), it could introduce
noise when the number of neighbors is large.

We observe that our representation learning
model (i.e., the baseline transformer) is able to
retrieve relevant neighbors most of the time.
It retrieves the exact output token as the first
neighbor 33%, 44%, and 70% on WikiText-103,
WMT, and enwik8 development sets, respectively.

6 Discussion

local context,

Summary of Contributions. We present a
semiparametric language model
that
(SPALM)
combines
short-term memory,
and long-term memory to make predictions.
Experiments on word-based and character-based
language models demonstrate the benefit of our
proposed method.

limitation is

Limitations. The biggest
the
necessity to retrieve neighbors for each training
token. Such a process—even though can be
fully parallelized—is time consuming. In our
experiments, it takes 6–8 hours to obtain neighbors
for WikiText-103 and enwik8 with 1,000 CPUs
and 18 hours for WMT with 9,000 CPUs.

Future Directions. Our modular approach that
the
combines multiple memory systems at

Figure 5: Distributions of values of g for WMT (left)
and enwik8 (right) development sets.

information from the long-term memory, resulting
in a more peaky distribution for transformer-XL.

5.3 Gate Vectors

Our model has a gating mechanism to regulate
information flow from the current context, short-
term, and long-term memory. We analyze the
values of the gate for tokens in WMT and enwik8.
Figure 5 shows histograms of the distribution of
gate values.

We observe different characterstics for WMT
the gate values are
and enwik8. On enwik8,
concentrated around 1. This indicates that the
model relies on local context most of the time.
This can explain why kNN-LM does not work
well on this dataset. On WMT, the values are
less concentrated around 1. This suggests that
the model uses long-term memory more than on
enwik8. SPALM is able to learn when the long-
term memory is needed and when it is not in
both cases.
We next

look into the value of the gates
for a specific sequence in the development set
in Figure 6. We note that we only show a
small dimension subset from the gate vector
for readability, so we caution against drawing a
conclusion about how the model works from this.
Our goal is only to get a better understanding of
what happens when the model makes predictions.
Comparing WMT and enwik8, we see that in
tends to reserve
general on WMT the model
some dimensions to propagate information from
the long-term memory, as indicated by vertical
red lines. On enwik8, the model relies on long
term information when completing a known word
such as Egypt, as shown by more horizontal
red patterns when forming this word. For other
characters, the value of the gates are closer to one,
which shows that the model relies more on local
and extended short-term context.

370

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
c
_
a
_
0
0
3
7
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 6: Heatmaps of g values on a partial sequence from WMT development set (left) and enwik8 (right). Each
row is a token (word or character), each colum is a dimension from g. blue indicates value closer to 1.0, whereas
red indicates value closer to 0.0. The darker the shade the closer the value is to the extreme. We see vertical
patterns on WMT, indicating that these dimensions are reserved to flow information from long-term memory.
Horizontal patterns on enwik8 indicates the model relies on long-term memory to predict a target token (e.g.,
when forming the word Egypt). The g vector has 512 dimension, we only zoom in to a small dimension subset
here. There are more horizontal and vertical patterns on both datasets as a whole.

# NNs
1
2
4
8
16

Perplexity
18.0
18.0
17.9
18.2
18.4

Table 5: SPALM perplex-
ity on the WikiText-103
development set with
different numbers of
neighbors.

level opens up the possibility
architectural
to incorporate additional memory from other
modalities (e.g., images) or structured knowledge
bases. We also envision a next-generation model
that does not have to retrieve information from
long-term memory for every token and only does it
for those that require global context. A model that
learns how to do this would save a considerable
amount of training and test time—since it would
significantly reduce the number of search that
needs to be performed. Our language model that
integrates retrieval into training is a first step in
this direction.

Acknowledgments

We thank the action editor (Mihai Surdeanu) and
three anonymous reviewers for helpful comments
on an earlier draft of this article.

References

Alexei Baevski and Michael Auli. 2019. Adap-
tive input representations for neural language
modeling. In Proceedings of ICLR.

Ankur Bapna and Orhan Firat. 2019. Non-
parametric adaptation for neural machine
translation. In Proceedings of NAACL-HLT.

DOI: https://doi.org/10.18653/v1
/N19-1191

Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document trans-
former. arXiv preprint arXiv:2004.05150v2.

Tom B. Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel M. Ziegler,
Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark
Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford, Ilya
Sutskev, and Dario Amodei. 2020. Language
models are few-shot learners. In Proceedings
of NeurIPS.

Krzysztof Choromanski, Valerii Likhosherstov,
David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Davis,
Afroz Mohiuddin, Lukasz Kaiser, David
Belanger, Lucy Colwell, and Adrian Weller.
2021. Rethinking attention with performers. In
Proceedings of ICLR.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
Carbonell, Quoc V. Le,
and Ruslan
Salakhutdinov. 2019. Transformer-XL: Atten-
tive language models beyond a fixed-length
context. In Proceedings of ACL.

371

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
c
_
a
_
0
0
3
7
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of NAACL.

Howard Eichenbaum. 2012. Memory systems.
Handbook of Psychology, Second Edition,
3. DOI: https://doi.org/10.1002
/9781118133880.hop203020

Edouard Grave, Moustapha M. Cisse, and
Armand Joulin. 2017a. Unbounded cache
model for online language modeling with open
vocabulary. In Proceedings of NeurIPS.

Edouard Grave, Armand Joulin, Moustapha Cisse,
David Grangier, and Herve Jegou. 2017b.
Efficient softmax approximation for GPUs. In
Proceedings of ICML.

Edouard Grave, Armand Joulin, and Nicolas
language
Usunier. 2017c. Improving neural
models with a continuous cache. In Proceedings
of ICLR.

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng,
David Simcha, Felix Chern, and Sanjiv Kumar.
2020. Accelerating large-scale inference with
anisotropic vector quantization. In Proceedings
of ICML.

Kelvin Guu, Tatsunori B. Hashimoto, Yonatan
Oren, and Percy Liang. 2018. Generating
sentences by editing prototypes. Transactions of
the Association for Computational Linguistics,
6:437–450. DOI: https://doi.org/10
.1162/tacl a 00030

Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. Realm:
Retrieval-augmented language model pre-
training. In Proceedings of ICML.

Marcus Hutter. 2012. The human knowl-
edge compression contest. http://prize
.hutter1.net/

Hakan Inan, Khashayar Khosravi, and Richard
Socher. 2017. Tying word vectors and word
classifiers: A loss framework for language
modeling. In Proceedings of ICLR.

Lukasz Kaiser, Ofir Nachum, Aurko Roy, and
Samy Bengio. 2017. Learning to remember rare
events. In Proceedings of ICLR.

372

Nora Kassner and Hinrich Schutze. 2020. BERT-
to
kNN: Adding a kNN search component
pretrained language models for better QA.
In Proceedings of Findings of EMNLP.
DOI: https://doi.org/10.18653/v1
/2020.findings-emnlp.307.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2021.
In
Nearest neighbor machine translation.
Proceedings of ICLR.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2020.
Generalization through memorization: Nearest
neighbor language models. In Proceedings of
ICLR.

Diederik P. Kingma and Jimmy Lei Ba. 2015.
Adam: A method for stochastic optimization.
In Proceedings of ICLR.

Nikita Kitaev, Lukasz Kaiser, and Anselm
Kevskaya. 2020. Reformer: The efficient
transformer. In Proceedings of ICLR.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2018. Dynamic evaluation
of neural sequence models. In Proceedings of
ICML.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2019. Dynamic evaluation
of transformer language models. arXiv preprint
arXiv:1904.08378v1.

Cyprien de Masson d’Autume, Sebastian Ruder,
Lingpeng Kong, and Dani Yogatama. 2019.
Episodic memory in lifelong language learning.
In Proceedings of NeurIPS.

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2017. Pointer sentinel
mixture models. In Proceedings of ICLR.

Aida Nematzadeh, Sebastian Ruder, and Dani
Yogatama. 2020. On memory in human and
artificial
In
Proceedings of ICLR Workshop on Bridging
AI and Cognitive Science.

language processing systems.

Graham Neubig

and Chris Dyer.

2016.
Generalizing and hybridizing count-based
and neural language models. In Proceedings
of EMNLP. DOI: https://doi.org/10
.18653/v1/D16-1124

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
c
_
a
_
0
0
3
7
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Fabio Petroni, Tim Rocktaschel, Patrick Lewis,
Anton Bakhtin, Yuxiang Wu, Alexander H.
Miller, and Sebastian Riedel. 2020. Language
models as knowledge bases? In Proceedings
of EMNLP. DOI: https://doi.org/10
.18653/v1/D19-1250

Mohammad Shoeybi, Mostofa Patwary, Raul
Puri, Patrick LeGresley, Jared Casper, and
Bryan Catanzaro. 2019. Megatron-LM: Train-
ing multi-billion parameter language models
arXiv preprint
using model parallelism.
arXiv:1909.08053v4.

Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018.
Improving lan-
guage understanding by generative pre-training.
https://cdn.openai.com/research
-covers/language-unsupervised
/language understanding paper.pdf

Alec Radford, Jeff Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. https://cdn.openai.com/better
-language-models/language models
are unsupervised multitask learners
.pdf

Jack W. Rae, Anna Potapenko, Siddhant M.
Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. 2020. Compressive transformers for
long-range sequence modelling. In Proceedings
of ICLR.

Edmund T. Rolls. 2000. Memory systems in
the brain. Annual Review of Psychology,
51(1):599–630. DOI: https://doi.org/
10.1146/annurev.psych.51.1.599,
PMID: 10751982

E. Tulving. 1985. How many memory systems
are there? American Psychologist, 40:385–398.
DOI: https://doi.org/10.1037/0003
-066X.40.4.385

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Proceedings of
NIPS.

Wenhan Xiong, Xiang Lorraine Li, Srini Iyer,
Jingfei Du, Patrick Lewis, William Yang
Wang, Yashar Mehdad, Wen tau Yih, Sebastian
Riedel, Douwe Kiela, and Barlas Oguz. 2021.
Answering complex open-domain questions
with multi-hop dense retrieval. In Proceedings
of ICLR.

Peng Xu, Mostofa Patwary, Mohammad Shoeybi,
Raul Puri, Pascale Fung, Anima Anandkumar,
and Bryan Catanzaro. 2020. Megatron-CNTRL:
Controllable story generation with external
knowledge using large-scale language models.
In Proceedings of EMNLP.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
c
_
a
_
0
0
3
7
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

373 Adaptive Semiparametric Language Models image

Download pdf