Adaptive Semiparametric Language Models

Dani Yogatama, Cyprien de Masson d’Autume, Lingpeng Kong

DeepMind
Londres, Reino Unido
{dyogatama,cyprien,lingpenk}@google.com

Abstracto

We present a language model that combines
a large parametric neural network (es decir., a
transformador) with a non-parametric episodic
memory component in an integrated architec-
tura. Our model uses extended short-term con-
text by caching local hidden states—similar
to transformer-XL—and global
long-term
memory by retrieving a set of nearest neighbor
tokens at each timestep. We design a gat-
ing function to adaptively combine multiple
information sources to make a prediction.
This mechanism allows the model
to use
either local context, short-term memory, o
long-term memory (or any combination of
a ellos) on an ad hoc basis depending on
the context. Experiments on word-based and
character-based language modeling datasets
demonstrate the efficacy of our proposed
method compared to strong baselines.

1 Introducción

Human language processing is facilitated by com-
plex systems interacting together. A core compo-
nent that enables such a process is human memory.
Memory in humans consists of specialized sys-
tems, which form a basis for intelligent behaviors
(Tulving, 1985; Rolls, 2000; Eichenbaum, 2012).
For language processing, working (short-term)
memory is a temporary storage that can be used to
comprehend sentences and follow conversations.
Episodic (long-term) memory stores individual
experience and events. Semantic memory stores
facts and knowledge about words and concepts.1
In artificial language processing systems (p.ej.,
language models), a popular approach to design
a better model is by encoding all of the desired
conocimiento (to produce grammatical sentences,
process long text, remember events, etc.) en el

1We refer readers to Nematzadeh et al. (2020) para
discussions on human and artificial language processing
memory systems.

362

transformer become a better

weights of a large parametric neural network
via end-to-end training. We see an increasingly
idioma
más grande
modelo (Radford et al., 2018, 2019; Shoeybi et al.,
2019; Brown y cols., 2020). In this scale approach,
the knowledge is implicitly represented in the
weights of a parametric neural network, and it is
not straightforward to interpret whether a model
contains a particular knowledge without asking
the model to produce a response—for example,
via a cloze-style question (Petroni et al., 2020) o
a prompt (Brown y cols., 2020).

An alternative strategy is to design a modular
architecture that separates memory storage and
computational processing, where each module
has a clear purpose. Recent progress in memory-
augmented neural networks has given rise to
many variants of memory-augmented transformer
language models that fall under this category. Para
ejemplo, attempts to incorporate extended local
context to a neural network—such as those found
in neural cache (Grave et al., 2017C), transformador-
XL (Dai et al., 2019), compressive transformer
(Rae et al., 2020), performers (Choromanski
et al., 2021), longformer (Beltagy et al., 2020),
and reformer (Kitaev et al., 2020)—can be seen as
models of working memory. Models of episodic
memory include kNN-LM (Khandelwal et al.,
2020) and architectures that are designed for more
complicated tasks such as question answering
(de Masson d’Autume et al., 2019; Guu et al.,
2020) and machine translation (Khandelwal et al.,
2021). In machine learning and natural language
Procesando, memory-augmented neural networks
is used to refer to all types of memory systems.

en este documento, inspired by the modular design
of human memory systems, we present a lan-
guage model architecture (SPALM) with storage
modules that resemble working and episodic
memory systems, which we combine with a large
parametric neural network that is responsible for

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 362–373, 2021. https://doi.org/10.1162/tacl a 00371
Editor de acciones: Mihai Surdeanu. Lote de envío: 10/2020; Lote de revisión: 1/2021; Publicado 4/2021.
C(cid:13) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
C
_
a
_
0
0
3
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

computation (§2). Our hypothesis is that encour-
aging each component to focus on a specific
función (p.ej., storing long-term information, gorra-
turing extended context, modeling local infor-
formación) facilitates easier training that produces an
overall better language model.2

Específicamente, we follow transformer-XL (dai
et al., 2019) to capture extended context by
caching hidden states in a temporary short-term
memory. For long-term context, we use a persis-
tent key-value database and perform sparse re-
trieval with (approximate) k-nearest neighbors. En
contrast to previous language models that either
interpolate output probabilities (Merity et al.,
2017; Grave et al., 2017C; Khandelwal et al.,
2020; Kassner and Schutze, 2020) or use input
concatenation (Guu et al., 2020; Xu et al., 2020)
to combine information from different sources,
we design a context-dependent gating mechanism
to incorporate local, extended, and global context.
We discuss similarities and differences to related
work in §3.

In language modeling, many tokens can be pre-
dicted from their local context without requiring
long-term information. Our model can adaptively
decide whether the current (local) contexto
es
suficiente, or whether it needs to use information
from the short-term and/or long-term memory.

In §4, nosotros

compare SPALM with strong
baselines—including transformer-XL and kNN-
LM—on word-based and character-based lan-
guage modeling. Our positive results establish
the benefit of the proposed architecture. They also
indicate the generality of our approach and its
potential applicability to other sequence modeling
tareas.

We analyze how SPALM uses long vs. short-term
contexto (§5) to better understand how the model
operates when making predictions. We conclude
by discussing limitations and future directions
(§6).

2 Modelo

We consider a language model that takes as input
a sequence of words x≤t = {x0, . . . , xt} y
outputs a probability distribution of the next word

2We note that SPALM is not intended to be a model
of human language processing system. We merely take
inspirations from human memory systems to design a better
artificial language model.

Cifra 1: Our language model architecture has three
main components: (i) a transformer that processes the
current local context, (ii) a short-term memory module
that stores hidden states from an extended context, (iii)
and a key-value (hidden state-output token) database
that stores compressed long-term context. At each
timestep, our model combines the current context
and short-term memory with a mechanism similar to
transformer-XL. It then retrieves a set of past output
tokens that are used in a similar context from the long-
term memory module. These past output tokens are
then encoded and aggregated to a single vector that
represents long-term information. We use a context-
dependent gate to combine information from multiple
sources for making a final prediction.

pag(xt+1 | x≤t; W.). Given a corpus of T words,
the log likelihood of the corpus is:

L =

X
t=0

iniciar sesión p(xt+1 | x≤t; W.),

where x0 is the start of sentence symbol.

SPALM consists of three main components: (i)
a large parametric neural network in the form
of a transformer to process local context, (ii) a
short-term memory to store extended context, y
(ii) a non-parametric episodic memory module
that stores information from long-term context.
We integrate these components in a single archi-
tecture with a gating mechanism. Cifra 1 muestra
an illustration of our model, which we discuss in
detail below.

363

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
C
_
a
_
0
0
3
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2.1 Base Model

We use transformer (Vaswani et al., 2017) as our
base model. Given the input sequence x≤t, trans-
former performs multiple layers of self-attention
between every pair of tokens in the input sequence
to produce token representations.

A core limitation of transformer is that its
computational complexity is quadratic in the
input sequence length. Como resultado, instead of
considering all previous tokens x≤t, transformador
truncates the input to be the most recent N words
˜x≤t = {xt−N +1, . . . , xt} and only operates on
this fixed-length window in practice. Un gran
transformador, no matter how many parameters it
tiene, is limited by the input sequence length.

2.2 Short-term Memory

We use transformer-XL (Dai et al., 2019) as our
working memory model. Given the current con-
text ˜x

While it

is difficult

to find consistent pat-
charranes, we observe that SPALM is generally better
than both transformer and transformer-XL for
predicting (completing) common phrases and
named entities (that exist in the training set),
especially when they are encountered for the
first time and have not appeared in the extended
contexto (p.ej., pulled their advertising

de, Liberal Democrat, Jo Swinson,
Boeing 787-9 Dreamliner).

Por otro lado, we also see a few cases when
transformer-XL outperforms SPALM. These are
usually associated with scenarios where the same
word has appeared in the extended context. Mientras
SPALM uses information from the extended context
the probability is smoothed over by
también,

369

5.4 Number of Neighbors

We use four neighbors for our word-based and
two neighbors for our character-based language
modelos. These values are chosen from preliminary
experiments on a small subset of the datasets.

We show SPALM perplexity on development set
for WikiText-103 when we vary the number of
neighbors in Table 5. We see that using one nearest
neighbor is enough to obtain good performance,
with a slight advantage when we use four neigh-
bors. The performance starts to degrade as we use
8 y 16 neighbors. We choose to use four neigh-
bors in our experiments since kNN-LM–which
also uses the same set of neighbors–performs
better with four neighbors instead of one, and we
want to keep the comparison as fair as possible.

One notable difference between our neighbors
and those that are used in kNN-LM (Khandelwal
et al., 2020) is that we do not limit the search of the
neighbors to the same token as the current input
simbólico (I(xi = xt)). While this allows the model
to combine information from related words (no
constrained to an exact match), it could introduce
noise when the number of neighbors is large.

We observe that our representation learning
modelo (es decir., the baseline transformer) is able to
retrieve relevant neighbors most of the time.
It retrieves the exact output token as the first
neighbor 33%, 44%, y 70% on WikiText-103,
WMT, and enwik8 development sets, respectivamente.

6 Discusión

local context,

Summary of Contributions. We present a
semiparametric language model
eso
(SPALM)
combines
short-term memory,
and long-term memory to make predictions.
Experiments on word-based and character-based
language models demonstrate the benefit of our
proposed method.

limitation is

Limitaciones. The biggest
el
necessity to retrieve neighbors for each training
simbólico. Such a process—even though can be
fully parallelized—is time consuming. En nuestro
experimentos, it takes 6–8 hours to obtain neighbors
for WikiText-103 and enwik8 with 1,000 CPUs
y 18 hours for WMT with 9,000 CPUs.

Future Directions. Our modular approach that
el
combines multiple memory systems at

Cifra 5: Distributions of values of g for WMT (izquierda)
and enwik8 (bien) development sets.

information from the long-term memory, resulting
in a more peaky distribution for transformer-XL.

5.3 Gate Vectors

Our model has a gating mechanism to regulate
information flow from the current context, corto-
term, and long-term memory. We analyze the
values of the gate for tokens in WMT and enwik8.
Cifra 5 shows histograms of the distribution of
gate values.

We observe different characterstics for WMT
the gate values are
and enwik8. On enwik8,
concentrated around 1. This indicates that the
model relies on local context most of the time.
This can explain why kNN-LM does not work
well on this dataset. On WMT, the values are
less concentrated around 1. This suggests that
the model uses long-term memory more than on
enwik8. SPALM is able to learn when the long-
term memory is needed and when it is not in
ambos casos.
nosotros el siguiente

look into the value of the gates
for a specific sequence in the development set
En figura 6. We note that we only show a
small dimension subset from the gate vector
for readability, so we caution against drawing a
conclusion about how the model works from this.
Our goal is only to get a better understanding of
what happens when the model makes predictions.
Comparing WMT and enwik8, we see that in
tends to reserve
general on WMT the model
some dimensions to propagate information from
the long-term memory, as indicated by vertical
red lines. On enwik8, the model relies on long
term information when completing a known word
such as Egypt, as shown by more horizontal
red patterns when forming this word. For other
characters, the value of the gates are closer to one,
which shows that the model relies more on local
and extended short-term context.

370

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
C
_
a
_
0
0
3
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 6: Heatmaps of g values on a partial sequence from WMT development set (izquierda) and enwik8 (bien). Cada
row is a token (word or character), each colum is a dimension from g. blue indicates value closer to 1.0, mientras
red indicates value closer to 0.0. The darker the shade the closer the value is to the extreme. We see vertical
patterns on WMT, indicating that these dimensions are reserved to flow information from long-term memory.
Horizontal patterns on enwik8 indicates the model relies on long-term memory to predict a target token (p.ej.,
when forming the word Egypt). The g vector has 512 dimension, we only zoom in to a small dimension subset
aquí. There are more horizontal and vertical patterns on both datasets as a whole.

# NNs
1
2
4
8
16

Perplexity
18.0
18.0
17.9
18.2
18.4

Mesa 5: SPALM perplex-
ity on the WikiText-103
development set with
different numbers of
neighbors.

level opens up the possibility
architectural
to incorporate additional memory from other
modalities (p.ej., images) or structured knowledge
bases. We also envision a next-generation model
that does not have to retrieve information from
long-term memory for every token and only does it
for those that require global context. A model that
learns how to do this would save a considerable
amount of training and test time—since it would
significantly reduce the number of search that
needs to be performed. Our language model that
integrates retrieval into training is a first step in
this direction.

Expresiones de gratitud

We thank the action editor (Mihai Surdeanu) y
three anonymous reviewers for helpful comments
on an earlier draft of this article.

Referencias

Alexei Baevski and Michael Auli. 2019. Adap-
tive input representations for neural language
modelado. In Proceedings of ICLR.

Ankur Bapna and Orhan Firat. 2019. No-
parametric adaptation for neural machine
traducción. In Proceedings of NAACL-HLT.

DOI: https://doi.org/10.18653/v1
/N19-1191

Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document trans-
anterior. arXiv preimpresión arXiv:2004.05150v2.

Tom B. Marrón, Benjamín Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, sandhi
agarwal, Ariel Herbert-Voss, Gretchen
krüger, Tom Henighan, niño rewon, Aditya
Ramesh, Daniel M. Ziegler,
Jeffrey Wu,
Clemens Winter, Christopher Hesse, Marca
Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford, Ilya
Sutskev, y Darío Amodei. 2020. Idioma
models are few-shot learners. En procedimientos
of NeurIPS.

Krzysztof Choromanski, Valerii Likhosherstov,
David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Davis,
Afroz Mohiuddin, Lukasz Kaiser, David
Belanger, Lucy Colwell, and Adrian Weller.
2021. Rethinking attention with performers. En
Proceedings of ICLR.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
Carbonell, Quoc V. Le,
and Ruslan
Salakhutdinov. 2019. Transformer-XL: Atten-
tive language models beyond a fixed-length
contexto. In Proceedings of ACL.

371

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
C
_
a
_
0
0
3
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2018. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. In Proceedings of NAACL.

Howard Eichenbaum. 2012. Memory systems.
Handbook of Psychology, Second Edition,
3. DOI: https://doi.org/10.1002
/9781118133880.hop203020

Edouard Grave, Moustapha M. Cisse, y
Armand Joulin. 2017a. Unbounded cache
model for online language modeling with open
vocabulary. In Proceedings of NeurIPS.

Edouard Grave, Armand Joulin, Moustapha Cisse,
David Grangier, and Herve Jegou. 2017b.
Efficient softmax approximation for GPUs. En
Proceedings of ICML.

Edouard Grave, Armand Joulin, and Nicolas
idioma
Usunier. 2017C. Improving neural
models with a continuous cache. En procedimientos
of ICLR.

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng,
David Simcha, Felix Chern, and Sanjiv Kumar.
2020. Accelerating large-scale inference with
anisotropic vector quantization. En procedimientos
of ICML.

Kelvin Gu, Tatsunori B. Hashimoto, Yonatan
Oren, y Percy Liang. 2018. Generating
sentences by editing prototypes. Transactions of
la Asociación de Lingüística Computacional,
6:437–450. DOI: https://doi.org/10
.1162/tacl a 00030

Kelvin Gu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. Realm:
Retrieval-augmented language model pre-
training. In Proceedings of ICML.

Marcus Hutter. 2012. The human knowl-
edge compression contest. http://prize
.hutter1.net/

Hakan Inan, Khashayar Khosravi, y ricardo
Socher. 2017. Tying word vectors and word
classifiers: A loss framework for language
modelado. In Proceedings of ICLR.

Lukasz Kaiser, Ofir Nachum, Aurko Roy, y
Tanto Bengio. 2017. Learning to remember rare
events. In Proceedings of ICLR.

372

Nora Kassner and Hinrich Schutze. 2020. BERT-
a
kNN: Adding a kNN search component
pretrained language models for better QA.
In Proceedings of Findings of EMNLP.
DOI: https://doi.org/10.18653/v1
/2020.findings-emnlp.307.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky,
Lucas Zettlemoyer, and Mike Lewis. 2021.
En
Nearest neighbor machine translation.
Proceedings of ICLR.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Lucas Zettlemoyer, and Mike Lewis. 2020.
Generalization through memorization: Nearest
neighbor language models. En procedimientos de
ICLR.

Diederik P. Kingma and Jimmy Lei Ba. 2015.
Adán: A method for stochastic optimization.
In Proceedings of ICLR.

Nikita Kitaev, Lukasz Kaiser, and Anselm
Kevskaya. 2020. Reformer: The efficient
transformador. In Proceedings of ICLR.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2018. Dynamic evaluation
of neural sequence models. En procedimientos de
ICML.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2019. Dynamic evaluation
of transformer language models. arXiv preprint
arXiv:1904.08378v1.

Cyprien de Masson d’Autume, Sebastián Ruder,
Lingpeng Kong, and Dani Yogatama. 2019.
Episodic memory in lifelong language learning.
In Proceedings of NeurIPS.

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2017. Pointer sentinel
mixture models. In Proceedings of ICLR.

Aida Nematzadeh, Sebastián Ruder, y dani
Yogatama. 2020. On memory in human and
artificial
En
Proceedings of ICLR Workshop on Bridging
AI and Cognitive Science.

language processing systems.

Graham Neubig

and Chris Dyer.

2016.
Generalizing and hybridizing count-based
and neural language models. En procedimientos
of EMNLP. DOI: https://doi.org/10
.18653/v1/D16-1124

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
C
_
a
_
0
0
3
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fabio Petroni, Tim Rocktaschel, Patrick Lewis,
Anton Bakhtin, Yuxiang Wu, Alexander H.
Molinero, and Sebastian Riedel. 2020. Idioma
models as knowledge bases? En procedimientos
of EMNLP. DOI: https://doi.org/10
.18653/v1/D19-1250

Mohammad Shoeybi, Mostofa Patwary, Raul
Puri, Patrick LeGresley, Jared Casper, y
Bryan Catanzaro. 2019. Megatron-LM: Tren-
ing multi-billion parameter language models
arXiv preprint
using model parallelism.
arXiv:1909.08053v4.

Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018.
Improving lan-
guage understanding by generative pre-training.
https://cdn.openai.com/research
-covers/language-unsupervised
/language understanding paper.pdf

Alec Radford, Jeff Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. https://cdn.openai.com/better
-language-models/language models
are unsupervised multitask learners
.pdf

Jack W. Rae, Anna Potapenko, Siddhant M.
Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. 2020. Compressive transformers for
long-range sequence modelling. En procedimientos
of ICLR.

Edmund T. Rolls. 2000. Memory systems in
the brain. Annual Review of Psychology,
51(1):599–630. DOI: https://doi.org/
10.1146/annurev.psych.51.1.599,
PMID: 10751982

mi. Tulving. 1985. How many memory systems
are there? Psicólogo americano, 40:385–398.
DOI: https://doi.org/10.1037/0003
-066X.40.4.385

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
Lukasz Kaiser, y Illia Polosukhin. 2017.
Attention is all you need. En procedimientos de
NIPS.

Wenhan Xiong, Xiang Lorraine Li, Srini Iyer,
Jingfei Du, Patrick Lewis, William Yang
Wang, Yashar Mehdad, Wen tau Yih, Sebastian
Riedel, Douwe Kiela, and Barlas Oguz. 2021.
Answering complex open-domain questions
with multi-hop dense retrieval. En procedimientos
of ICLR.

Peng Xu, Mostofa Patwary, Mohammad Shoeybi,
Raul Puri, Pascale Fung, Anima Anandkumar,
and Bryan Catanzaro. 2020. Megatron-CNTRL:
Controllable story generation with external
knowledge using large-scale language models.
In Proceedings of EMNLP.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
1
1
9
2
4
1
5
0

/
t

a
C
_
a
_
0
0
3
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

373 Adaptive Semiparametric Language Models image

Descargar PDF