Adaptive Semiparametric Language Models

Dani Yogatama, Cyprien de Masson d’Autume, Lingpeng Kong

DeepMind
伦敦, 英国
{dyogatama,cyprien,lingpenk}@google.com

抽象的

We present a language model that combines
a large parametric neural network (IE。, A
transformer) with a non-parametric episodic
memory component in an integrated architec-
真实. Our model uses extended short-term con-
text by caching local hidden states—similar
to transformer-XL—and global
long-term
memory by retrieving a set of nearest neighbor
tokens at each timestep. We design a gat-
ing function to adaptively combine multiple
information sources to make a prediction.
This mechanism allows the model
to use
either local context, short-term memory, 或者
long-term memory (or any combination of
他们) on an ad hoc basis depending on
the context. Experiments on word-based and
character-based language modeling datasets
demonstrate the efficacy of our proposed
method compared to strong baselines.

1 介绍

Human language processing is facilitated by com-
plex systems interacting together. A core compo-
nent that enables such a process is human memory.
Memory in humans consists of specialized sys-
特姆斯, which form a basis for intelligent behaviors
(Tulving, 1985; Rolls, 2000; Eichenbaum, 2012).
For language processing, 在职的 (短期)
memory is a temporary storage that can be used to
comprehend sentences and follow conversations.
Episodic (long-term) memory stores individual
experience and events. Semantic memory stores
facts and knowledge about words and concepts.1
In artificial language processing systems (例如,
language models), a popular approach to design
a better model is by encoding all of the desired
知识 (to produce grammatical sentences,
process long text, remember events, ETC。) 在里面

1We refer readers to Nematzadeh et al. (2020) 为了
discussions on human and artificial language processing
memory systems.

362

transformer become a better

weights of a large parametric neural network
via end-to-end training. We see an increasingly
语言
larger
模型 (Radford et al., 2018, 2019; Shoeybi et al.,
2019; Brown et al., 2020). In this scale approach,
the knowledge is implicitly represented in the
weights of a parametric neural network, 它是
not straightforward to interpret whether a model
contains a particular knowledge without asking
the model to produce a response—for example,
via a cloze-style question (Petroni et al., 2020) 或者
a prompt (Brown et al., 2020).

An alternative strategy is to design a modular
architecture that separates memory storage and
computational processing, where each module
has a clear purpose. Recent progress in memory-
augmented neural networks has given rise to
many variants of memory-augmented transformer
language models that fall under this category. 为了
例子, attempts to incorporate extended local
context to a neural network—such as those found
in neural cache (Grave et al., 2017C), transformer-
XL (Dai et al., 2019), compressive transformer
(Rae et al., 2020), performers (Choromanski
等人。, 2021), longformer (Beltagy et al., 2020),
and reformer (Kitaev et al., 2020)—can be seen as
models of working memory. Models of episodic
memory include kNN-LM (Khandelwal et al.,
2020) and architectures that are designed for more
complicated tasks such as question answering
(de Masson d’Autume et al., 2019; Guu et al.,
2020) and machine translation (Khandelwal et al.,
2021). In machine learning and natural language
加工, memory-augmented neural networks
is used to refer to all types of memory systems.

在本文中, inspired by the modular design
of human memory systems, we present a lan-
guage model architecture (SPALM) with storage
modules that resemble working and episodic
memory systems, which we combine with a large
parametric neural network that is responsible for

计算语言学协会会刊, 卷. 9, PP. 362–373, 2021. https://doi.org/10.1162/tacl 00371
动作编辑器: Mihai Surdeanu. 提交批次: 10/2020; 修改批次: 1/2021; 已发表 4/2021.
C(西德:13) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

我

A
C
_
A
_
0
0
3
7
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

计算 (§2). Our hypothesis is that encour-
aging each component to focus on a specific
function (例如, storing long-term information, 帽-
turing extended context, modeling local infor-
运动) facilitates easier training that produces an
overall better language model.2

具体来说, we follow transformer-XL (Dai
等人。, 2019) to capture extended context by
caching hidden states in a temporary short-term
记忆. For long-term context, we use a persis-
tent key-value database and perform sparse re-
trieval with (近似) k-nearest neighbors. 在
contrast to previous language models that either
interpolate output probabilities (Merity et al.,
2017; Grave et al., 2017C; Khandelwal et al.,
2020; Kassner and Schutze, 2020) or use input
concatenation (Guu et al., 2020; 徐等人。, 2020)
to combine information from different sources,
we design a context-dependent gating mechanism
to incorporate local, extended, and global context.
We discuss similarities and differences to related
work in §3.

In language modeling, many tokens can be pre-
dicted from their local context without requiring
long-term information. Our model can adaptively
decide whether the current (当地的) 语境
是
足够的, or whether it needs to use information
from the short-term and/or long-term memory.

In §4, 我们

compare SPALM with strong
baselines—including transformer-XL and kNN-
LM—on word-based and character-based lan-
guage modeling. Our positive results establish
the benefit of the proposed architecture. 他们还
indicate the generality of our approach and its
potential applicability to other sequence modeling
任务.

We analyze how SPALM uses long vs. 短期
语境 (§5) to better understand how the model
operates when making predictions. 我们得出结论
by discussing limitations and future directions
(§6).

2 模型

We consider a language model that takes as input
a sequence of words x≤t = {x0, . . . , xt} 和
outputs a probability distribution of the next word

2We note that SPALM is not intended to be a model
of human language processing system. We merely take
inspirations from human memory systems to design a better
artificial language model.

数字 1: Our language model architecture has three
main components: (我) a transformer that processes the
current local context, (二) a short-term memory module
that stores hidden states from an extended context, (三、)
and a key-value (hidden state-output token) 数据库
that stores compressed long-term context. 在每一个
timestep, our model combines the current context
and short-term memory with a mechanism similar to
transformer-XL. It then retrieves a set of past output
tokens that are used in a similar context from the long-
term memory module. These past output tokens are
then encoded and aggregated to a single vector that
represents long-term information. We use a context-
dependent gate to combine information from multiple
sources for making a final prediction.

p(xt+1 | x≤t; 瓦). Given a corpus of T words,
the log likelihood of the corpus is:

L =

时间

X
t=0

log p(xt+1 | x≤t; 瓦),

where x0 is the start of sentence symbol.

SPALM consists of three main components: (我)
a large parametric neural network in the form
of a transformer to process local context, (二) A
short-term memory to store extended context, 和
(二) a non-parametric episodic memory module
that stores information from long-term context.
We integrate these components in a single archi-
tecture with a gating mechanism. 数字 1 节目
an illustration of our model, which we discuss in
detail below.

363

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

我

A
C
_
A
_
0
0
3
7
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

2.1 Base Model

We use transformer (Vaswani et al., 2017) as our
base model. Given the input sequence x≤t, 反式-
former performs multiple layers of self-attention
between every pair of tokens in the input sequence
to produce token representations.

A core limitation of transformer is that its
computational complexity is quadratic in the
input sequence length. 因此, 而不是
considering all previous tokens x≤t, transformer
truncates the input to be the most recent N words
˜x≤t = {xt−N +1, . . . , xt} and only operates on
this fixed-length window in practice. A large
transformer, no matter how many parameters it
有, is limited by the input sequence length.

2.2 Short-term Memory

We use transformer-XL (Dai et al., 2019) as our
working memory model. Given the current con-
text ˜x

While it

is difficult

to find consistent pat-
燕鸥, we observe that SPALM is generally better
than both transformer and transformer-XL for
predicting (completing) common phrases and
named entities (that exist in the training set),
especially when they are encountered for the
first time and have not appeared in the extended
语境 (例如, pulled their advertising

从, Liberal Democrat, Jo Swinson,
Boeing 787-9 Dreamliner).

另一方面, we also see a few cases when
transformer-XL outperforms SPALM. 这些都是
usually associated with scenarios where the same
word has appeared in the extended context. 尽管
SPALM uses information from the extended context
the probability is smoothed over by
还有,

369

5.4 Number of Neighbors

We use four neighbors for our word-based and
two neighbors for our character-based language
型号. These values are chosen from preliminary
experiments on a small subset of the datasets.

We show SPALM perplexity on development set
for WikiText-103 when we vary the number of
neighbors in Table 5. We see that using one nearest
neighbor is enough to obtain good performance,
with a slight advantage when we use four neigh-
bors. The performance starts to degrade as we use
8 和 16 neighbors. We choose to use four neigh-
bors in our experiments since kNN-LM–which
also uses the same set of neighbors–performs
better with four neighbors instead of one, 和我们
want to keep the comparison as fair as possible.

One notable difference between our neighbors
and those that are used in kNN-LM (Khandelwal
等人。, 2020) is that we do not limit the search of the
neighbors to the same token as the current input
代币 (我(xi = xt)). While this allows the model
to combine information from related words (不是
constrained to an exact match), it could introduce
noise when the number of neighbors is large.

We observe that our representation learning
模型 (IE。, the baseline transformer) is able to
retrieve relevant neighbors most of the time.
It retrieves the exact output token as the first
neighbor 33%, 44%, 和 70% on WikiText-103,
WMT, and enwik8 development sets, 分别.

6 讨论

local context,

Summary of Contributions. We present a
semiparametric language model
那
(SPALM)
combines
short-term memory,
and long-term memory to make predictions.
Experiments on word-based and character-based
language models demonstrate the benefit of our
proposed method.

limitation is

Limitations. The biggest
这
necessity to retrieve neighbors for each training
代币. Such a process—even though can be
fully parallelized—is time consuming. In our
实验, it takes 6–8 hours to obtain neighbors
for WikiText-103 and enwik8 with 1,000 CPUs
和 18 hours for WMT with 9,000 CPUs.

Future Directions. Our modular approach that
这
combines multiple memory systems at

数字 5: Distributions of values of g for WMT (左边)
and enwik8 (正确的) development sets.

information from the long-term memory, resulting
in a more peaky distribution for transformer-XL.

5.3 Gate Vectors

Our model has a gating mechanism to regulate
information flow from the current context, short-
学期, 和长期记忆. We analyze the
values of the gate for tokens in WMT and enwik8.
数字 5 shows histograms of the distribution of
gate values.

We observe different characterstics for WMT
the gate values are
and enwik8. On enwik8,
concentrated around 1. This indicates that the
model relies on local context most of the time.
This can explain why kNN-LM does not work
well on this dataset. On WMT, the values are
less concentrated around 1. This suggests that
the model uses long-term memory more than on
enwik8. SPALM is able to learn when the long-
term memory is needed and when it is not in
both cases.
We next

look into the value of the gates
for a specific sequence in the development set
图中 6. We note that we only show a
small dimension subset from the gate vector
for readability, so we caution against drawing a
conclusion about how the model works from this.
Our goal is only to get a better understanding of
what happens when the model makes predictions.
Comparing WMT and enwik8, we see that in
tends to reserve
general on WMT the model
some dimensions to propagate information from
the long-term memory, as indicated by vertical
red lines. On enwik8, the model relies on long
term information when completing a known word
such as Egypt, as shown by more horizontal
red patterns when forming this word. For other
characters, the value of the gates are closer to one,
which shows that the model relies more on local
and extended short-term context.

370

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

我

A
C
_
A
_
0
0
3
7
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 6: Heatmaps of g values on a partial sequence from WMT development set (左边) and enwik8 (正确的). 每个
row is a token (word or character), each colum is a dimension from g. blue indicates value closer to 1.0, 然而
red indicates value closer to 0.0. The darker the shade the closer the value is to the extreme. We see vertical
patterns on WMT, indicating that these dimensions are reserved to flow information from long-term memory.
Horizontal patterns on enwik8 indicates the model relies on long-term memory to predict a target token (例如,
when forming the word Egypt). The g vector has 512 dimension, we only zoom in to a small dimension subset
这里. There are more horizontal and vertical patterns on both datasets as a whole.

# NNs
1
2
4
8
16

Perplexity
18.0
18.0
17.9
18.2
18.4

桌子 5: SPALM perplex-
ity on the WikiText-103
development set with
different numbers of
neighbors.

level opens up the possibility
architectural
to incorporate additional memory from other
方式 (例如, 图片) or structured knowledge
bases. We also envision a next-generation model
that does not have to retrieve information from
long-term memory for every token and only does it
for those that require global context. A model that
learns how to do this would save a considerable
amount of training and test time—since it would
significantly reduce the number of search that
needs to be performed. Our language model that
integrates retrieval into training is a first step in
this direction.

致谢

We thank the action editor (Mihai Surdeanu) 和
three anonymous reviewers for helpful comments
on an earlier draft of this article.

参考

Alexei Baevski and Michael Auli. 2019. Adap-
tive input representations for neural language
造型. In Proceedings of ICLR.

Ankur Bapna and Orhan Firat. 2019. Non-
parametric adaptation for neural machine
翻译. In Proceedings of NAACL-HLT.

DOI: https://doi.org/10.18653/v1
/N19-1191

Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document trans-
以前的. arXiv 预印本 arXiv:2004.05150v2.

Tom B. 棕色的, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
阿加瓦尔, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel M. Ziegler,
Jeffrey Wu,
Clemens Winter, Christopher Hesse, 标记
陈, Eric Sigler, Mateusz Litwin, 斯科特
Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford, 伊利亚
Sutskev, and Dario Amodei. 2020. 语言
models are few-shot learners. In Proceedings
of NeurIPS.

Krzysztof Choromanski, Valerii Likhosherstov,
David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Davis,
Afroz Mohiuddin, Lukasz Kaiser, 大卫
Belanger, Lucy Colwell, and Adrian Weller.
2021. Rethinking attention with performers. 在
Proceedings of ICLR.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
Carbonell, Quoc V. Le,
and Ruslan
Salakhutdinov. 2019. Transformer-XL: Atten-
tive language models beyond a fixed-length
语境. In Proceedings of ACL.

371

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

我

A
C
_
A
_
0
0
3
7
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2018. BERT: Pre-training
of deep bidirectional transformers for language
理解. In Proceedings of NAACL.

Howard Eichenbaum. 2012. Memory systems.
Handbook of Psychology, Second Edition,
3. DOI: https://doi.org/10.1002
/9781118133880.hop203020

Edouard Grave, Moustapha M. Cisse, 和
Armand Joulin. 2017A. Unbounded cache
model for online language modeling with open
词汇. In Proceedings of NeurIPS.

Edouard Grave, Armand Joulin, Moustapha Cisse,
David Grangier, and Herve Jegou. 2017乙.
Efficient softmax approximation for GPUs. 在
Proceedings of ICML.

Edouard Grave, Armand Joulin, and Nicolas
语言
Usunier. 2017C. Improving neural
models with a continuous cache. In Proceedings
of ICLR.

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng,
David Simcha, Felix Chern, and Sanjiv Kumar.
2020. Accelerating large-scale inference with
anisotropic vector quantization. In Proceedings
of ICML.

Kelvin Guu, Tatsunori B. Hashimoto, Yonatan
Oren, and Percy Liang. 2018. Generating
sentences by editing prototypes. Transactions of
the Association for Computational Linguistics,
6:437–450. DOI: https://doi.org/10
.1162/tacl a 00030

Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. Realm:
Retrieval-augmented language model pre-
训练. In Proceedings of ICML.

Marcus Hutter. 2012. The human knowl-
edge compression contest. http://prize
.hutter1.net/

Hakan Inan, Khashayar Khosravi, and Richard
Socher. 2017. Tying word vectors and word
classifiers: A loss framework for language
造型. In Proceedings of ICLR.

Lukasz Kaiser, Ofir Nachum, Aurko Roy, 和
Samy Bengio. 2017. Learning to remember rare
事件. In Proceedings of ICLR.

372

Nora Kassner and Hinrich Schutze. 2020. BERT-
到
kNN: Adding a kNN search component
pretrained language models for better QA.
In Proceedings of Findings of EMNLP.
DOI: https://doi.org/10.18653/v1
/2020.findings-emnlp.307.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2021.
在
Nearest neighbor machine translation.
Proceedings of ICLR.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2020.
Generalization through memorization: Nearest
neighbor language models. 在诉讼程序中
ICLR.

Diederik P. Kingma and Jimmy Lei Ba. 2015.
亚当: A method for stochastic optimization.
In Proceedings of ICLR.

Nikita Kitaev, Lukasz Kaiser, and Anselm
Kevskaya. 2020. Reformer: The efficient
transformer. In Proceedings of ICLR.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2018. Dynamic evaluation
of neural sequence models. 在诉讼程序中
ICML.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2019. Dynamic evaluation
of transformer language models. arXiv 预印本
arXiv:1904.08378v1.

Cyprien de Masson d’Autume, Sebastian Ruder,
Lingpeng Kong, and Dani Yogatama. 2019.
Episodic memory in lifelong language learning.
In Proceedings of NeurIPS.

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2017. Pointer sentinel
mixture models. In Proceedings of ICLR.

Aida Nematzadeh, Sebastian Ruder, and Dani
Yogatama. 2020. On memory in human and
artificial
在
Proceedings of ICLR Workshop on Bridging
AI and Cognitive Science.

language processing systems.

Graham Neubig

and Chris Dyer.

2016.
Generalizing and hybridizing count-based
and neural language models. In Proceedings
of EMNLP. DOI: https://doi.org/10
.18653/v1/D16-1124

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

我

A
C
_
A
_
0
0
3
7
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Fabio Petroni, Tim Rocktaschel, Patrick Lewis,
Anton Bakhtin, Yuxiang Wu, Alexander H.
磨坊主, and Sebastian Riedel. 2020. 语言
models as knowledge bases? In Proceedings
of EMNLP. DOI: https://doi.org/10
.18653/v1/D19-1250

Mohammad Shoeybi, Mostofa Patwary, Raul
Puri, Patrick LeGresley, Jared Casper, 和
Bryan Catanzaro. 2019. Megatron-LM: Train-
ing multi-billion parameter language models
arXiv 预印本
using model parallelism.
arXiv:1909.08053v4.

Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018.
Improving lan-
guage understanding by generative pre-training.
https://cdn.openai.com/research
-covers/language-unsupervised
/language understanding paper.pdf

Alec Radford, Jeff Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. https://cdn.openai.com/better
-language-models/language models
are unsupervised multitask learners
.pdf

Jack W. Rae, Anna Potapenko, Siddhant M.
Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. 2020. Compressive transformers for
long-range sequence modelling. In Proceedings
of ICLR.

Edmund T. Rolls. 2000. Memory systems in
大脑. 心理学年度评论,
51(1):599–630. DOI: https://doi.org/
10.1146/annurev.psych.51.1.599,
PMID: 10751982

乙. Tulving. 1985. How many memory systems
are there? American Psychologist, 40:385–398.
DOI: https://doi.org/10.1037/0003
-066X.40.4.385

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. 在诉讼程序中
NIPS.

Wenhan Xiong, Xiang Lorraine Li, Srini Iyer,
Jingfei Du, Patrick Lewis, William Yang
王, Yashar Mehdad, Wen tau Yih, Sebastian
Riedel, Douwe Kiela, and Barlas Oguz. 2021.
Answering complex open-domain questions
with multi-hop dense retrieval. In Proceedings
of ICLR.

Peng Xu, Mostofa Patwary, Mohammad Shoeybi,
Raul Puri, Pascale Fung, Anima Anandkumar,
and Bryan Catanzaro. 2020. Megatron-CNTRL:
Controllable story generation with external
knowledge using large-scale language models.
In Proceedings of EMNLP.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

我

A
C
_
A
_
0
0
3
7
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

373 Adaptive Semiparametric Language Models image

下载pdf