Relational Memory-Augmented Language Models

Relational Memory-Augmented Language Models

Qi Liu2∗, Dani Yogatama1, and Phil Blunsom1,2

1DeepMind, Reino Unido,

2Universidad de Oxford, Reino Unido

{qi.liu,phil.blunsom}@cs.ox.ac.uk
dyogatama@deepmind.com

Abstracto

We present a memory-augmented approach to
condition an autoregressive language model
on a knowledge graph. We represent the graph
as a collection of relation triples and re-
trieve relevant relations for a given context
to improve text generation. Experiments on
WikiText-103, WMT19, and enwik8 English
datasets demonstrate that our approach pro-
duces a better language model in terms of per-
plexity and bits per character. We also show
that relational memory improves coherence, es
complementary to token-based memory, y
enables causal interventions. Our model pro-
vides a simple yet effective way to combine an
autoregressive language model and a knowl-
edge graph for more coherent and logical
generación.

1

Introducción

A core function of language is to communicate
proposiciones (p.ej., who did what to whom). Como
semejante, language models need to be able to generate
this information reliably and coherently. Existing
language models (Devlin et al., 2019; Radford
et al., 2019; Brown y cols., 2020) do not have ex-
plicit representations for such information and
rely on it being implicitly encoded in their param-
eters (Liu et al., 2019; Petroni et al., 2019; Wang
et al., 2020). This encoding mechanism makes
it difficult to interpret what the language mod-
els know and often leads to generating illogical
and contradictory contents. Por ejemplo, logan
et al. (2019) observe that existing language mod-
els rely heavily on word correlation and fall short
of logical reasoning. This causes the model to hal-
lucinate—for example, that Barack Obama’s wife
is Hillary Clinton based on the high co-occurrence
of the two entities. In another example, Lake and
Murphy (2020) notice that GPT-2 (Radford et al.,

∗Work completed during an internship at DeepMind.

2019) states that unicorns have four horns, directly
after speaking that unicorns have one horn.

En este trabajo, we explore ways to combine an
autoregressive language model with a knowledge
graph. We design a memory-augmented archi-
tecture that stores relations from a knowledge
graph and investigate the effect of conditioning on
this relational memory in an autoregressive lan-
guage model. In contrast to existing token-based
memory-augmented language models that store
context-target pairs (Khandelwal et al., 2020b;
Yogatama et al., 2021), our memory stores relation
triples (head entity, relation, tail entity). Relation
triples form the basis of knowledge bases, empow-
ering a wide range of applications such as question
answering (Yasunaga et al., 2021), machine read-
En g (Yang and Mitchell, 2019), and reasoning
(Minervini et al., 2020). From a cognitive science
perspectiva, we can consider the neural language
model to be an instance of System 1, which per-
forms fast inference and the symbolic relational
memory as a world model to support slow and log-
ical reasoning of System 2 (kahneman, 2011).1
We hypothesize that relational memory can im-
prove performance and coherence of an autore-
gressive language model.

Given an observed context, we first run an en-
tity tagger to identify entities in the context. Nosotros
then use tf-idf (Ramos et al., 2003) to select salient
entidades. We retrieve relations (from a knowledge
base) for the selected entities and design a gat-
ing function that allows the language model to
adaptively combine information from extracted
relations and observed textual context to predict
the next token. Existing knowledge bases such as
Freebase and Wikidata can be used as a source
of information from which to retrieve relations.
Sin embargo, they are often incomplete and do not
contain relations that are suitable for the particular

1This view is also advocated in a parallel work by
Nye et al. (2021), which presents a model for story generation
y seguir instrucciones.

555

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 555–572, 2022. https://doi.org/10.1162/tacl a 00476
Editor de acciones: Xavier Carreras. Lote de envío: 7/2021; Lote de revisión: 12/2021; Publicado 5/2022.
C(cid:3) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

dataset that we want to work with. Instead of us-
ing these predefined knowledge bases, we choose
to perform open information extraction (OpenIE)
on each language modeling dataset to get relations.
Como resultado, our model is able to move beyond
simple co-occurrence statistics and generate text
that is more grounded on real-world relations ob-
served in a particular corpus.

Our main contributions are as follows:

• We evaluate the model on three English
language modeling datasets. Nosotros mostramos que
our model outperforms a strong transformer-
XL baseline (Dai et al., 2019) on both
word-level (WikiText-103 and WMT19) y
character-level (enwik8) language modeling
in terms of perplexity and bits per character
respectivamente (§3.3).

• We conduct comprehensive ablation and
design choice studies to understand contribu-
tions of different components of our models
(§4.1).

• We measure coherence with human evalua-
tion and two automatic metrics (conocimiento
perplexity and knowledge F1) and demon-
strate that relational memory improves co-
herence (§4.2).

• We study the relationship between our
method and a typical memory-augmented lan-
guage model that stores word tokens in its
memory (Yogatama et al., 2021). We show
that relational memory is complementary to
token-based memory and combining them
improves performance further (§3.3).

• We perform qualitative analysis by examin-
ing gate values and retrieved relations. In line
with our main motivation, we find that the
relational memory is particularly useful for
predicting entities. Más, we demonstrate
that such explicit propositional representa-
tions allow causal interventions and increase
interpretability of language models (§4.3).

bilities with the chain rule (Para Jelinek, 1980; bengio
et al., 2003):

pag(x1, . . . , xT ) =

t(cid:2)

t=1

pag(xt|x0, . . . , xt−1),

(1)

where x0 is a special start token.

Our language model is based on transformer-XL
(§2.1) which is augmented with a relational
memory (§2.2). We discuss them in detail below.

2.1 Transformer-XL

We use transformer-XL (Dai et al., 2019)—which
is based on transformer (Vaswani et al., 2017)
to parametrize the conditional probabilities in
ecuación. 1. Transformer stacks multiple self-attention
layers to obtain contextualized representations.

Language modeling datasets usually consist
of articles of different lengths. It is impractical
to apply transformer to encode long articles, como
its computational complexity is quadratic in the
sequence length. En la práctica, each article is usu-
ally truncated into fixed-length text segments
{xt−N +1, . . . , xt} of length N to train and eval-
uate the model. Sin embargo,
this approximation
prevents transformer from capturing long-term de-
pendency beyond text segments. Transformer-XL
reuses hidden states from previous text segments
to extend the context window.

t−N −M +1, . . . , h(cid:2)

the text segment {h(cid:2)

More specifically, denote the hidden state
of xt at layer (cid:2) as h(cid:2)
t. Given a text segment
{xt−N +1, . . . , xt} and its
extended context
{xt−N −M +1, . . . , xt−N } of length M , both the
t−N +1,
hidden states of
} and the hidden states of the extended
. . . , h(cid:2)
t
} are used. Cuando
contexto {h(cid:2)
t−N
performing self-attention, each token in the text
segment can attend to the preceding tokens in
the text segment and all the tokens in the ex-
tended context, enabling longer-term dependency
compared to a vanilla transformer. En tono rimbombante,
transformer-XL does not backpropagate through
the hidden states of the extended context during
training (by adding stop gradient operators to all
the hidden states in the extended context).

2 Modelo

An autoregressive language model defines the
probability of a sequence of tokens p(X) =
pag(x1, . . . , xT ). Él
is common to factorize this
joint probability as a product of conditional proba-

2.2 Relational Memory

En esta sección, we first introduce how we obtain
relation triples using OpenIE (§2.2.1). Nosotros entonces
use tf-idf to score entities in the observed context
and retrieve relation triples related to these entities
(§2.2.2) to construct relational memory. Finalmente,

556

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: We identify salient entities in the previous text segment and extract relations to build our relational
memory. We encode each relation with an LSTM encoder, aggregate the resulting representations into a vector,
and use a gate mechanism that allows our language model to adaptively take advantage of relational information
for predicting the next token.

we show an integrated architecture that allows
transformer-XL to incorporate the relational mem-
ory for predicting the next token (§2.2.3). We show
our architecture in Figure 1. The pseudocode of
training or evaluating with the relational memory
is demonstrated in Algorithm 1. In the pseudocode,
we use TRAIN(xc, METRO) and EVAL(xc, M to refer
to training with the cross entropy loss and eval-
uating (p.ej., calculating perplexity) on the text
segment xc conditioned on the relational memory
METRO, respectivamente.

2.2.1 Open Information Extraction
A key challenge of utilizing relational information
for language modeling is obtaining high-quality
relation triples. There are several well-established
knowledge bases, such as Freebase (Bollacker
et al., 2007) and YAGO (Rebele et al., 2016).
Sin embargo, existing knowledge bases suffer from
missing relations and often do not contain relation
triples related to observed contexts in a target
cuerpo, even though research on knowledge base
completion has resulted in significant advances
(Bordes et al., 2013; Trouillon et al., 2016; zhang
et al., 2019).

En este trabajo, we use OpenIE (Angeli et al.,
2015; Etzioni et al., 2008) to obtain relation
triples. Since OpenIE directly extracts relation
triples from each dataset D, it provides a structured
way to represent knowledge in D.2 Specifically,
we perform OpenIE on the training set of D.
Given an entity e, we retrieve a set of relation
triples Re = {r1, . . . , rO}, where e is either the
2We provide a comparison of using relations extracted

from OpenIE and Freebase in §4.1.

head entity or the tail entity in these relation triples.
Conceptually, Re consists of all the relation triples
from the one-hop subgraph centred at the entity
e in the knowledge graph constructed from D.
Por lo tanto, Re can provide ‘‘global’’ information
about the entity.

Dynamic OpenIE. Dynamic OpenIE takes ad-
vantage of the autoregressive nature of language
modelado, where text segments are sequentially
procesado. In addition to extracting relations from
the training set of D, we can also extract re-
lations from previously seen text segments of
our evaluation set. We refer to this extraction
mechanism as dynamic OpenIE. After a text seg-
mento {xt−N +1, . . . , xt} has been evaluated, para
ejemplo, after calculating perplexity on this text
segmento, we perform OpenIE on it to obtain new
relation triples to be added to our knowledge
graph. Note that we only perform OpenIE on pre-
viously seen text segments and do not use unseen
texto. We expect that the relation triples extracted
from seen text segments are potentially useful for
predicting the next tokens. This extraction mech-
anism will not violate the autoregressive nature
of language modeling. Metrics such as perplexity
and bits per character are calculated as usual. El
idea of using seen text segments during evalua-
tion to improve language modeling is related to
dynamic evaluation (Krause et al., 2018, 2019). En
dynamic evaluation, the model is adapted based
on recent history during evaluation via gradient
descent so that it can assign higher probabilities to
re-occurring patterns. In contrast to dynamic eval-
uation, we do not update model parameters and

557

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Algoritmo 1 Train/Eval w/ Relational Memory
1: procedure TRAIN/EVAL SPLIT(S)
for each article A in S do
2:
Initialise M to empty
for each text segment xc in A do

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

if S is train set then
TREN(xc, METRO)

else

EVAL(xc, METRO)
Run dynamic OpenIE on xc

end if
Perform relation retrieval with xc
Update M with retrieved triples

end for

end for

14:
15: end procedure

only extract new relations from seen text segments
to enrich our corpus-specific knowledge graph.

Mismatch between Training and Evaluation.
As shown in Algorithm 1, because we do not use
dynamic OpenIE during training due to its addi-
tional efficiency overhead (see speed comparison
in §4.1), this results in a mismatch between train-
ing and evaluation. We extract all the relation
triples from the training set of each dataset D be-
fore training on D. Como resultado, during training we
may retrieve relation triples extracted from unseen
text of the training set when performing relation
retrieval (§2.2.2). We do not suffer from this issue
during evaluation, as we extract relations from
previously seen text of our evaluation set. Nosotros
believe this mismatch is minor given the superior
performance of our model in the experiments.

2.2.2 Relation Retrieval

Given a knowledge graph (represented as a col-
lection of triples), an ideal relational memory
consists of a set of triples that are relevant to the
observed context. There are many choices to mea-
sure the relatedness between the observed context
and relation triples in our knowledge graph—for
ejemplo, based on keyword search or dense re-
trieval (Karpukhin et al., 2020; Guu et al., 2020;
Yogatama et al., 2021).

En este trabajo, we use keyword search because of
its simplicity and leave methods based on dense
retrieval to future work. Específicamente, given the
observed context, we perform entity recognition
(Ratinov and Roth, 2009; Nadeau and Sekine,

558

2007) on this context and score the tagged entities
with tf-idf (Ramos et al., 2003). The top-K scored
entidades (K is set to 5 in our experiments) son
used to retrieve relations {Re1, . . . , ReK
}. Estos
retrieved relations are used to construct the re-
lational memory M. Note that the entities are
selected from the observed context, so that un-
seen text is not utilized. We limit the capacity
of M to P . If the number of newly retrieved
triples is larger than P , we randomly drop rela-
tions and only select P of them to be inserted into
METRO. De lo contrario, the relational memory operates
with a first-in-first-out principle. When M is full,
older relations retrieved will be overwritten by
newly retrieved relations. The relational memory
is re-initialized to empty when an article ends.

As shown in Algorithm 1, since we update
M only after processing an entire text segment,
all the tokens in the same text segment will be
conditioned on the same relational memory. Este
approach is more efficient compared to updating
M each time a new entity is encountered and is
more amenable for batch training.

2.2.3 Integration with Transformer-XL

We now show how we can integrate relational
memory with transformer-XL. We refer to our
model as RELATIONLM.

Relation Triple Encoding. We first discuss
how we encode relation triples in the relational
memory M. We treat relation triples as text and
serialize each relation triple into a sequence, para
ejemplo, (barack obama, president of, United
Estados) is converted into a sequence ‘‘Barack
Obama, president of, United States’’. This se-
quential representation can well capture the order
of head entities and tail entities and is also adopted
by KG-BERT (Yao et al., 2019) and Kepler (Wang
et al., 2021b). Because each example in a batch
corresponds to P retrieved relations, we obtain
B · P relation sequences for each batch, where B
and P denote batch size and relational memory
length, respectivamente. In the order of hundreds of
relation triples, this prevents us from using large
modelos (p.ej., a multi-layer transformer) to encode
these sequences due to memory constraints. En
our preliminary experiments, we compare LSTM
(Hochreiter and Schmidhuber, 1997), GRU (Dar
et al., 2014), and a one-layer transformer and find
that LSTM performs marginally better. Allá-
delantero, for each relation triple rp, we reuse the

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset
WikiText
WMT19
enwik8

# Tren # Valid # Prueba # Artículos # Vocab # Entidades # Relaciones # Relations/Entity
103METRO
151METRO
94METRO

28,595
0.2M 0.2M
0.3M 0.3M 169,180
12,350
5METRO

267,735
50,259
256

980k
976k
361k

8.9METRO
7.8METRO
2.4METRO

9.03
7.97
6.66

5METRO

Mesa 1: Statistics of datasets used in our experiments. For each subset, we show the number of
(sub)words for WikiText-103 and WMT19 or the number of characters for enwik8.

transformer-XL word embedding matrix We to
map each token in the sequence to its embedding
vector. We then run LSTM to encode the sequence
and use the hidden representation of the last token
as the relation representation rp.

There are other approaches to encode relation
triples, Por ejemplo, embedding-based (Bordes
et al., 2013; Trouillon et al., 2016) and graph-based
(Schlichtkrull et al., 2018; Zhang and Chen,
2018) methods. We leave a comparison of these
approaches to future work.

segment xc =
Integration. Given a text
{xt−N +1, . . . , xt}, after L self-attention layers
with transformer-XL, we obtain contextualized
}. At each time-
representaciones {hL
t−N +1, . . . , hL
t
step t, we use its hidden representation hL
t as the
query vector to attend over the P encoded con-
tents of M, es decir., {r1, . . . , rP }. We use a standard
scaled dot-product attention (Vaswani et al., 2017)
to aggregate all triples into a single vector:

mt =

PAG(cid:3)

p=1

PAG(cid:4)

j=1

exp.(hL
t

· rp/

d)

rp,

exp.(hL
t

· rj/

d)

where d denotes
the hidden size of our
transformer-XL. Finalmente, we combine mt and
transformer-XL representation hL

t via a gate:

t , mt])

gt = σ(Wg[hL
zt = gt (cid:5) hL
pag(xt+1 | x≤t) = softmax(Wezt),

t + (1 − gt) (cid:5) mt

where σ is the sigmoid function, [, ] denotes con-
catenation of two vectors, (cid:5) is element-wise
multiplicación, and We is the embedding ma-
trix shared by both input and output embeddings
(Inan et al., 2016). The only new parameters in-
troduced by our method are an LSTM relation
encoder and the gate matrix Wg. This gating
mechanism allows our model to adaptively take
advantage of relational information for predicting
the next token.

559

3 experimentos

Our experiments seek to evaluate the effect of
augmenting language models with a relational
memory. We introduce datasets used for evalua-
ción (§3.1), discuss implementation details (§3.2),
and present our main results (§3.3). Nosotros entonces
show ablation studies and further analysis of our
modelo (§4).

3.1 Datasets and OpenIE

We use three English language modeling data-
conjuntos: WikiText-103 (Merity et al., 2017), WMT19
(Barrault et al., 2019), and enwik8 (Hutter, 2012).
Descriptive statistics of these datasets are shown
en mesa 1. WikiText-103 and WMT19 are (sub)
word-level datasets, while enwik8 is a character-
level dataset.

WikiText-103 is a knowledge-driven dataset
consisting of
featured articles from English
Wikipedia. WMT19 contains English news from
the WMT19 workshop.3 The news are segmented
into months. We use the news from January to
October for training, and news in November and
December for development and test, respectivamente.
Compared to Wikipedia articles, news contains
more dynamic and temporal information, exposing
new challenges for utilizing relational informa-
ción. We reuse the vocabulary of GPT-2 (Radford
et al., 2019) con 50,259 tokens to tokenize this
conjunto de datos. enwik8 contains more than 100M bytes
of Wikipedia text. Character-level language mod-
eling has a much smaller vocabulary size than
(sub)word-level language modeling.

We perform OpenIE on each dataset. For en-
wik8, OpenIE is performed after detokenizing its
text into words. Statistics of extracted relations
are also included in Table 1. Each entity from
WikiText-103, WMT19, and enwik8 has 9.03,
7.97, y 6.66 relation triples on average.

3http://www.statmt.org/wmt19/.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

3.2 Detalles de implementacion
All models are implemented with JAX4 (Bradbury
et al., 2018) and Haiku5 (Hennigan et al., 2020).
We set the hidden size to 512 and the number of
layers to 16 for all models. En (sub)word-level
language modeling, we use adaptive softmax
(Grave et al., 2017) for efficiency. We use GELU
(Hendrycks and Gimpel, 2016) as our activation
function and Adam (Kingma and Ba, 2015) como
the optimizer. For training, we use batch size
128 and train the models on 64 16GB TPUs. Nosotros
apply 4,000 warmup steps, before utilizing co-
sine annealing to decay the learning rate. Dropout
(Srivastava et al., 2014) is applied during training
with a rate of 0.25.

We set the lengths of text segment N , extended
context M , and the relational memory P to (512,
512, 300), (384, 384, 800), y (768, 1536, 400)
for WikiText-103, WMT19, and enwik8, respetar-
activamente. These are determined by grid searches on
development sets.

3.3 Main Results

We compare with a strong transformer-XL base-
line trained under the same setting as our model.
Our main results are shown in Table 2. We ob-
tain three observations comparing transformer-XL
and RELATIONLM. Primero, RELATIONLM consis-
tently outperforms transformer-XL on all three
conjuntos de datos, demonstrating the effectiveness of rela-
tional memory. Note that a decrease of 0.01 es
considerable on enwik8 with the bits per character
métrico. Segundo, relational memory not only im-
proves language modeling on knowledge-driven
artículos (WikiText-103), but also generalizes to
the challenging news domain (WMT19), dónde
information is more dynamic and temporal. Last,
the results indicate that relational memory im-
proves both (sub)word-level and character-level
language modeling.

Complementarity to SPALM. SPALM (Yogatama
et al., 2021) is a state-of-the-art memory-augmented
modelo de lenguaje. Instead of retrieving relation
triples, it retrieves a set of related tokens at each
timestep. Específicamente, it first stores (contexto, el
next token) pairs from training data. It then uses
a pre-trained transformer language model to mea-
sure the similarities between the stored contexts
and the observed context during training/eval-

Modelo

t Transformer-XL
X
RELATIONLM
mi
t
i
k
SPALM
i
W.
(cid:2)→ + RELATIONLM

1
t
METRO
W.

9 Transformer-XL
RELATIONLM
SPALM
(cid:2)→ + RELATIONLM
Transformer-XL
RELATIONLM
SPALM
(cid:2)→ + RELATIONLM

8
k
i
w
norte
mi

# Params
122METRO
124METRO
122METRO
124METRO
114METRO
116METRO
114METRO
116METRO
93METRO
95METRO
93METRO
95METRO

desarrollador
19.0
18.5
18.1
17.7
21.7
21.0
20.4
19.8
1.05
1.04
1.04
1.03

Prueba
19.9
19.2
19.0
18.6
21.5
20.7
20.3
19.6
1.03
1.02
1.02
1.01

Mesa 2: We use perplexity () on WikiText-103
and WMT19 and bits per character () on enwik8
for evaluation.

uation. The next tokens of similar contexts are
retrieved and are integrated with the observed
context via a gating mechanism for generation.

We investigate whether RELATIONLM is com-
plementary to SPALM. Because SPALM also uses
a gating mechanism for integrating the retrieved
tokens, we first apply RELATIONLM to combine
transformer-XL output hL
t with relational infor-
mation to obtain zt (as shown in §2.2.3), antes
using SPALM to integrate zt with retrieved to-
kens. Los resultados se muestran en la tabla. 2. SPALM
outperforms transformer-XL and even performs
comparably or better compared to RELATIONLM
on three datasets, demonstrating the effectiveness
of retrieving related tokens. Sin embargo, integrat-
ing RELATIONLM and SPALM can further improve
el desempeño, indicating that these two models
are not mutually exclusive. Por lo tanto, retrieving
relation triples brings complementary benefits to
retrieving tokens.

4 Análisis

En esta sección, we study several design choices
of relational memory, including its knowledge
source,
input component, capacity, dynamic
OpenIE, entity scoring method used, and speed
comparación. We then show quantitative and
qualitative analysis results to better understand
our model.

4.1 Ablations and Design Choice Studies

4https://github.com/google/jax.
5https://github.com/deepmind/dm-haiku.

For the ablation studies, we use the development
set of WikiText-103.

560

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo
Transformer-XL
RELATIONLM + base libre
RELATIONLM + OpenIE

desarrollador
19.0
19.0
18.5

Mesa 3: RELATIONLM with OpenIE or
Freebase triples.

Modelo
Transformer-XL
TripleRelationTail
TripleRelation
Triple

desarrollador
19.0
19.0
18.7
18.5

Mesa 4: Ablating relation and/or tail
entity from a relation triple.

Source of Relation Triples. We compare re-
lation triples extracted from Freebase or using
OpenIE. In the Freebase case, we use the Freebase
API6 to obtain relation triples for each entity. Para
WikiText-103, hay 10.74 relations per entity
on average, which is comparable to OpenIE rela-
ciones (9.03 relations/entity). The results are shown
en mesa 3. Although Freebase relations have been
observed to improve the performance on smaller
conjuntos de datos (p.ej., WikiText-2; Logan et al., 2019)
and particular domains (p.ej., movies and actors;
Ahn et al., 2016), we find that RELATIONLM
with Freebase relations does not improve over
transformer-XL on a much larger WikiText-103
conjunto de datos. We observe that a large portion of Free-
base relations is from infoboxes of Wikipedia
paginas, which only cover information such as oc-
cupation, birth place, and religion. Creemos
these triples are too general to be useful for most
contextos. The result of RELATIONLM with OpenIE
shows the advantages of extracting relations from
each dataset compared to using Freebase relations.

Ablating Relation Triples. We ablate relation
and/or tail entity from a relation triple (head en-
tity, relation, tail entity) to study the contribution
brought by each component. The results are shown
en mesa 4. We find that ablating both relation and
tail entity performs comparably to transformer-
XL. As head entities are extracted from the ob-
served context, we believe the extended memory
of transformer-XL can offset the effect brought

6https://developers.google.com/freebase.

Cifra 2: Perplexity on WikiText-103 with different
number of relation triples.

Cifra 3: Increasing extended memory length.

by conditioning on head entities. Ablating rela-
tion performs better than transformer-XL. Este
shows the advantage of introducing tail entities.
Using complete relation triples performs the best,
demonstrating the effectiveness of this triple rep-
resentation of knowledge.

Length of Relational Memory. We study how
many relation triples need to be stored in the re-
lational memory. As shown in Figure 2, we can see
that the perplexity improves with more relation
triples. Sin embargo, the curve becomes flat with
más que 300 relation triples.

Length of Transformer-XL Memory. As in-
creasing the length of context window can
capture longer dependency, we study whether in-
creasing the length of extended (transformer-XL)
memory removes the performance gap between
RELATIONLM and transformer-XL. As shown in
Cifra 3, the performance of both RELATIONLM
and transformer-XL improves with larger ex-
tended memory. Sin embargo, RELATIONLM still out-
performs transformer-XL even with extended

561

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo
Transformer-XL
w/o Dynamic OpenIE
w/ Dynamic OpenIE

Wiki WMT ew8
1.05
21.7
19.0
1.04
21.4
18.6
1.04
21.0
18.5

Mesa 5: Perplexity with and without dynamic
OpenIE.

Modelo
Aleatorio
Frecuencia
tf-idf

desarrollador
19.1
18.7
18.5

Mesa 6: Perplexity with dif-
ferent entity scoring methods.

Modelo
Transformer-XL
RELATIONLM

Tren
0.51
0.76

Eval
0.31
0.65

Mesa 7: The unit is second/step. Usamos
batch size 128 y 1 per step for training
and evaluation, respectivamente.

Dataset

WikiText

WMT

enwik8

Subset
desarrollador
Prueba
desarrollador
Prueba
desarrollador
Prueba

# Entity
61.6k
65.8k
84.9k
81.0k
1.7METRO
1.7METRO

# Non-Entity
155.9k
179.7k
262.2k
256.6k
3.3METRO
3.3METRO

Mesa 8: Statistics of entity and non-entity tokens.

Metric

X WikiText

PAG
PAG
mi
gramo
d
mi
yo
w
oh
norte
k

WMT

enwik8

X WikiText
PAG
PAG
y
t
i
t
norte
mi

norte
oh
norte

enwik8

WMT

Modelo
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM

desarrollador
47.3
45.6
77.2
73.2
2.25
2.22
13.3
13.0
14.4
14.2
1.98
1.98

Prueba
52.3
50.9
77.0
73.1
2.21
2.19
13.8
13.4
14.4
14.3
1.95
1.95

memory length 3072. We conclude that relational
memory brings complementary benefits to simply
expanding extended memory, since it provides
global information about entities on each dataset.

Dynamic OpenIE. All our main results use dy-
namic OpenIE. We show results without dynamic
OpenIE in Table 5. We include the results on three
datasets for a comparison. We can see that RE-
LATIONLM with dynamic OpenIE performs com-
parably to RELATIONLM without dynamic OpenIE
on WikiText-103 and enwik8, while larger im-
provements are obtained on WMT19. This indi-
cates that dynamic OpenIE is more helpful for
the news domain, which is more dynamic and
temporal compared to knowledge-driven articles.

Entity Scoring. We study different entity scor-
ing mechanisms for relation retrieval. We consider
random selection (where entities extracted from
the observed context are randomly selected),
frequency-based scoring, and tf-idf scoring. Como
mostrado en la tabla 6, tf-idf performs the best.

Speed Comparison. The wall clock time for
both training and evaluation is shown in Table 7.
RELATIONLM is 1.5 y 2.1 times slower during
training and evaluation, respectivamente. Evaluation
slows down some more due to dynamic OpenIE
as shown in Algorithm 1.

Mesa 9: Knowledge perplexity () and non-entity
perplexity ().

4.2 Does Relational Memory
Improve Coherence?

For evaluating coherence, we use two automatic
metrics—knowledge perplexity and knowledge
F1—to investigate whether the models can faith-
fully use entities. We further perform a human
evaluation to study whether language models can
generate coherent and knowledgeable sequences.
We believe the human evaluation is a reliable
way of evaluating coherence. This claim is advo-
cated in Barzilay and Lapata (2005). We note that
question answering is also often used to evaluate
coherencia (Guu et al., 2020; Lin et al., 2021). Nosotros
leave this to future work.

Knowledge Perplexity. While vanilla perplexity
considers all words in an evaluation set, knowl-
edge perplexity only considers entities for calcu-
lating perplexity. We use it to evaluate whether
the model can assign higher probabilities for the
correct entities under different contexts. Mesa 8
shows the numbers of entity words and non-entity
words in our corpora. We show the results
en mesa 9. We observe that the gap between
RELATIONLM and transformer-XL is larger on

562

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Metric

WikiText

WMT

enwik8

Modelo
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM
Transformer-XL
RELATIONLM

Dev Test
9.4
9.9
11.2
11.4
11.0
11.4
12.3
12.6
18.9
16.0
19.4
16.6

Mesa 10: Knowledge F1 ().

relational memory is helpful

knowledge perplexity. RELATIONLM only per-
forms comparably or slightly better compared
to transformer-XL on non-entity perplexity. Este
shows that
para
predicting entity words. Note that knowledge per-
plexity tends to be much higher than perplexity
on non-entity words, indicating the difficulty of
predicting entity words. This collection of re-
sults indicates that relational memory helps the
model use entities coherently and consistently
under different contexts.

Knowledge F1. We use knowledge F1 to ex-
plore whether our model generates tokens that are
grounded to its contexts. Given a context as in-
put, we sequentially generate 32 palabras (o 128
idioma
characters) for word-(character-)nivel
modeling by sampling from the distribution of the
next word (personaje). To reduce variance, we gen-
erate 100 continuations for each context. Nosotros entonces
perform entity recognition for both the generated
sequences and their corresponding ground-truth
sequences and calculate an F1 score based on
these two sets of entities. Por ejemplo, given the
context ‘‘Ayola was nominated and shortlisted
for the ‘Female Performance in TV’ award’’, nosotros
compare the generated text and the ground truth
‘‘in the 2006 Screen Nation Awards, for her role
as Kyla Tyson in Holby City’’ to calculate F1.
Los resultados se muestran en la tabla. 10. We notice
that RELATIONLM performs better compared to
transformer-XL. We conclude that models with
relational memory can generate more coherent
and logical text.

Human Evaluation. We conduct a human eval-
uation to study whether language models can
generate coherent and knowledgeable sequences.
We take 1,000 contexts from the test set of
WikiText-103. We show the contexts, ground-
truth sequences, and continuations generated by

Modelo
Transformer-XL
RELATIONLM

Coherent Knowledgeable

388
612

416
584

Mesa 11: We show the number of contexts in
which a continuation from a particular model is
chosen by human evaluators for each evaluation
criterion. Recall that the total number of contexts
used for human evaluation is 1,000. Porque nosotros
have five annotators, we use majority voting to de-
cide the favored model for each continuation. Nosotros
use the Kappa statistic to measure inter-annotator
agreement. The statistic is 0.64, which shows
substantial agreement among the annotators.

RELATIONLM and transformer-XL to five annota-
tores. We use greedy decoding for both models.
We shuffle the order of the continuations gener-
ated by RELATIONLM and transformer-XL so that
the annotators are unaware of the sources of se-
quences. We then pose the following questions to
the annotators:

1. Coherent. Given the context and its ground-
truth continuation for reference, which gen-
erated sequence is more logical and coherent?

2. Knowledgeable. Given the context and its
ground-truth continuation, which generated
sequence provides more insights and is more
knowledgeable?

We show the results in Table 11. We find that
RELATIONLM outperforms transformer-XL in the
human evaluation. These results are consistent
with the two automatic metrics, knowledge per-
plexity and knowledge F1. This corroborates our
claim that relational memory improves coherence
in language modeling.

4.3 Qualitative Analysis

Gate Values. As we use a gating function to
integrate transformer-XL with relational informa-
ción, we study gate values in this section. El
histogram of gate values is shown in Figure 5.
We notice that the histogram concentrates around
0.9. This is expected because non-entity words,
which account for a large portion of text (accord-
ing to Table 8), benefit less from the relational
memory and mainly rely on the observed context
for prediction as shown in §4.2. We further calcu-
late the average gate values for entity words and
non-entity words. The average gate value for entity

563

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4: Heatmap of gate values.

Cifra 5: Histogram of gate values gt.

words is 0.87, while the average value is 0.92 para
non-entity words. This confirms that entity words
rely more on relational information for prediction
compared to non-entity words. We also plot the
heatmap of gate values and a cherry-picked exam-
ple is shown in Figure 4. Note that we randomly
select 100 dimensions from 512 dimensions for
readability. We notice that the entities, Aberdeen
and Alec Flett, use more relational information
than other positions (as shown by the horizontal
blue lines). These results demonstrate that RE-
LATIONLM can adaptively incorporate relational
information for prediction.

Ejemplo. We show three cherry-picked exam-
ples in Table 12. We take the first for illustration,
which shows a text segment from the article,
Joe Biden 2008 presidential campaign7 and some
retrieved relations. We find that the first two re-
laciones, (Joe Biden, senior Senator, Delaware)
y (Joe Biden presidential campaign, began,
Enero 7 2007), are extracted from previous text
segments, mientras (Joe Biden, was nominated, vicio
president) y (Biden, withdrew nomination,
1987) are extracted from the other articles, Joe

Mesa 12: Three examples of
text segment
and retrieved relations (based on previous text
segments).

Biden8 and Joe Biden 1988 presidential cam-
paign,9 respectivamente. We notice that the relation
(Joe Biden, was nominated, vice president) es
highly predictive of the sequence, ‘‘Biden was
selected to be Democratic presidential nominee
Barack Obama’s vice presidential running mate’’.
From the observed context, the model also iden-
tifies a closely related entity, barack obama, y
retrieves the relation (barack obama, president
de, United States). Por lo tanto, we conclude that
the relational memory can give a global picture of

7https://en.wikipedia.org/wiki/Joe Biden

8https://en.wikipedia.org/wiki/Joe Biden.
9https://en.wikipedia.org/wiki/Joe Biden

2008 presidential campaign.

1988 presidential campaign.

564

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

related entities and provide relevant information
for language modeling.

Causal Intervention. We use causal interven-
tion to study whether changing the contents in
the relational memory will affect language model
predicción. Given the relation (Obama, born in,
Hawaii) along with other relations about Barack
Obama, we let the model complete the sequence,
‘‘Obama was born in’’. RELATIONLM outputs
‘‘Obama was born in and raised in Hawaii.’’
with greedy decoding. Sin embargo, after modifying
the relation to (Obama, born in, Kenya), we ob-
tain ‘‘Obama was born in Kenya and was the
first African-American president.’’ We further
change to (Obama, born in, París) and the model
outputs ‘‘Obama was born in Paris, France.’’
This indicates that RELATIONLM can take advan-
tage of relation triples for making prediction.
While we can also use prompts as intervention
for vanilla language models, it remains challeng-
ing about selecting the appropriate prompts in
different applications (Liu et al., 2021a).

5 Trabajo relacionado

Knowledge-enhanced Architectures.
Inject-
ing symbolic knowledge to machine learning mod-
els is widely adopted to improve the performance
of natural
language understanding (Annervaz
et al., 2018; Ostendorff et al., 2019), pregunta
answering (Zhang et al., 2018; Huang et al., 2019;
Hixon et al., 2015), dialogue systems (zhang
et al., 2018; Moon et al., 2019; Guo et al., 2018;
Liu et al., 2021b), and recommendation systems
(Zhang et al., 2016; Wang y cols., 2018a, 2019).
Different from these models, we focus on using
symbolic knowledge for language modeling. Ex-
isting language models are prone to generating
illogical and contradictory contents. Creemos
that connecting language modeling and knowl-
edge graphs is a promising direction to overcome
the problem. Next we review previous knowledge-
enhanced language models.

Knowledge-enhanced Language Models. Nuestro
model is closely related to previous work on
grounding autoregressive language models with
gráficos de conocimiento (Ahn et al., 2016; Logan et al.,
2019; Hayashi et al., 2020; Wang y cols., 2021a).
Sin embargo, these models rely on complex and adhoc
preprocessing or rules to link text with knowledge
bases (p.ej., Freebase and Wikidata). Como resultado,

previous work is more aligned with conditional
language modeling, Por ejemplo, graph-to-text
generation p(X|GRAMO) in Wang et al. (2021a), cual
contrasts with unconditional language modeling
pag(X) considered in this work. As the graph G is
constructed with the unseen text x, predicting x
given G is easier due to this information leakage for
Wang et al. (2021a). Also in Hayashi et al. (2020),
topic entities are required for language modeling,
which may not be available in most datasets, para
ejemplo, the news domain. We do not compare
with these previous models due to the different
settings. A diferencia de, we adopt OpenIE relations
and use a tf-idf search to retrieve relation triples
for connecting language models and knowledge
graphs. In the experiments, we demonstrate the
effectiveness of our approach on three datasets,
WikiText-103, WMT19, and enwik8.

There are language models incorporating entity
información, such as entity coreference annota-
ciones (Ji et al., 2017; Clark et al., 2018), surface
forms of entities (Kiddon et al., 2016; Cual
et al., 2017; Cao et al., 2021), entity types
(Parvez et al., 2018; Wang y cols., 2018b), y
entity descriptions (Bahdanau et al., 2017). Dif-
ferent from these models, we augment language
models with a relational memory consisting of re-
lation triples. We demonstrate the effectiveness of
using relation triples by ablating tail entities and
relations in §4.1.

Pretraining. Usando
Knowledge-enhanced
knowledge information for pretraining language
modelos (Peters et al., 2019; Sun et al., 2019;
Liu et al., 2020; Guu et al., 2020; Wang
et al., 2021b; Agarwal et al., 2021; Verga
et al., 2021) has recently grown in popularity
and has achieved substantial
improvements
on knowledge-driven tasks such as question
answering and named entity recognition. En cambio
of using knowledge information for improving
downstream knowledge-driven tasks, we focus on
using knowledge information for improving the
generation capability of the language model itself.

Retrieval-augmentedModels. Retrieval-augmented
models are now widely adopted in open-domain
question answering (Chen et al., 2017; Luis
et al., 2020; de Masson d’Autume et al., 2019;
Izacard and Grave, 2021), dialogue (Dinan et al.,
2019; Fan et al., 2021; Thulke et al., 2021), y
machine translation (Bapna and Firat, 2019;

565

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Khandelwal et al., 2020a). We focus on retrieval
language modeling (Merity
augmentation for
et al., 2017; Grave et al., 2016; Khandelwal et al.,
2020b; Yogatama et al., 2021). These algorithms
are specifically tailored for language modeling,
where related tokens are retrieved to help predict
the next token. En este trabajo, we move beyond
token augmentation and show the benefits of
retrieving relation triples. We also demonstrate
that our model
is complementary to a token
augmentation model, SPALM (Yogatama et al.,
2021), in the experiments.

6 Conclusión

We presented RELATIONLM, a language model
that is augmented with relational memory. Nosotros
showed how to obtain relevant knowledge graphs
for a given corpus and how to combine them
with a state-of-the-art
language model such
as transformer-XL. We demonstrated that our
model improves performance and coherence on
WikiText-103, WMT19, and enwik8. Nosotros también
performed a comprehensive analysis to better
understand how our model works. Our model pro-
vides a way to combine an autoregressive language
model with general knowledge graphs.

Expresiones de gratitud

We would like to thank our action editor (Xavier
Carreras) and three anonymous reviewers for their
insightful comments. We also thank Angeliki
Lazaridou, Cyprien de Masson d’Autume, Lingpeng
kong, Laura Rimell, Aida Nematzadeh, y
the DeepMind language team for their helpful
discusiones.

Referencias

Oshin Agarwal, Heming Ge, Siamak Shakeri, y
Rami Al-Rfou. 2021. Knowledge graph based
synthetic corpus generation for knowledge-
enhanced language model pre-training. En profesional-
cesiones de la 2021 Conference of the North
American Chapter of the Association for Com-
Lingüística putacional: Human Language Tech-
nológico, pages 3554–3565. https://doi
.org/10.18653/v1/2021.naacl-main
.278

Sungjin Ahn, Heeyoul Choi, Tanel P¨arnamaa,
and Yoshua Bengio. 2016. A neural knowl-

edge language model. arXiv preimpresión arXiv:
1608.00318.

Gabor Angeli, Melvin Jose Johnson Premkumar,
and Christopher D. Manning. 2015. Leveraging
linguistic structure for open domain informa-
tion extraction. In Proceedings of the 53rd
Annual Meeting of the Association for Com-
putational Linguistics and the 7th International
Conferencia conjunta sobre lenguaje natural Pro-
cessing of the Asian Federation of Natural
Procesamiento del lenguaje, LCA 2015, Julio 26-31,
2015, Beijing, Porcelana, Volumen 1: Artículos largos,
pages 344–354. The Association for Computer
Lingüística. https://doi.org/10.3115
/v1/P15-1034

k. METRO. Annervaz, Somnath Basu Roy Chowdhury,
and Ambedkar Dukkipati. 2018. Learning be-
yond datasets: Knowledge graph augmented
neural networks for natural language process-
En g. En Actas de la 2018 Conferencia
of the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, NAACL-HLT 2018,
Nueva Orleans, Luisiana, EE.UU, Junio 1-6, 2018,
Volumen 1 (Artículos largos), pages 313–322. también-
ciation for Computational Linguistics.

Dzmitry Bahdanau, Tom Bosc, Stanislaw
Jastrzebski, Edward Grefenstette, Pascal Vincent,
and Yoshua Bengio. 2017. Learning to com-
pute word embeddings on the fly. CORR,
abs/1706.00286.

Ankur Bapna and Orhan Firat. 2019. No-
parametric adaptation for neural machine trans-
lación. En Actas de la 2019 Conferencia
of the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, NAACL-HLT 2019,
Mineápolis, Minnesota, EE.UU, Junio 2-7, 2019, Volumen 1
(Artículos largos y cortos), pages 1921–1931.
Asociación de Lingüística Computacional.

Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,
Christian Federmann, Mark Fishel, Yvette
graham, Barry Haddow, Matthias Huck,
Philipp Koehn, Shervin Malmasi, Christof
Monz, Mathias M¨uller, Santanu Pal, Mate
Correo, and Marcos Zampieri. 2019. Findings of
el 2019 conference on machine translation
(WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volumen 2:
Shared Task Papers, Day 1), pages 1–61,

566

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Florencia, Italia. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/W19-5301

Regina Barzilay and Mirella Lapata. 2005.
Modeling local coherence: An entity-based ap-
proach. In ACL 2005, 43rd Annual Meeting
de la Asociación de Linguis Computacional-
tics, Actas de
the Conference, 25-30
Junio 2005, University of Michigan, EE.UU,
pages 141–148. The Association for Computer
Lingüística. https://doi.org/10.3115
/1219840.1219858

Yoshua Bengio, R´ejean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural
probabilistic language model. Journal of Ma-
chine Learning Research, 3:1137–1155.

Kurt D. Bollacker, Robert P. Cocinar, and Patrick
Tufts. 2007. base libre: A shared database of
structured general human knowledge. En profesional-
cesiones de
the Twenty-Second AAAI Con-
ference on Artificial Intelligence, Julio 22-26,
2007, vancouver, British Columbia, Canada,
pages 1962–1963. AAAI Press.

Antonio Bordes, Nicolás Usunier, Alberto Garc´ıa-
Dur´an, Jason Weston, and Oksana Yakhnenko.
2013. Translating embeddings for modeling
multi-relational data. In Advances in Neural In-
formation Processing Systems 26: 27th Annual
Conference on Neural Information Process-
ing Systems 2013. Proceedings of a meeting
held December 5-8, 2013, Lake Tahoe, Nevada,
United States, pages 2787–2795.

James Bradbury, Roy Frosty, Peter Hawkins,
Matthew James Johnson, Chris Leary, Dougal
Maclaurin, and Skye Wanderman-Milne. 2018.
JAX: Composable transformations of Python+
NumPy programs.

Tom B. Marrón, Benjamín Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, et al. 2020.
Language models are few-shot learners. arXiv
preprint arXiv:2005.14165.

Nicola De Cao, Gautier

Izacard, Sebastian
Riedel, and Fabio Petroni. 2021. Autoregressive
In 9th International Con-
entity retrieval.
ference on Learning Representations, ICLR
2021, Virtual Event, Austria, Puede 3-7, 2021.
OpenReview.net.

the 55th Annual Meeting of

Danqi Chen, Adam Fisch, Jason Weston, y
Antonio Bordes. 2017. Reading Wikipedia to
answer open-domain questions. En procedimientos
de
the Associa-
ción para la Lingüística Computacional, LCA 2017,
vancouver, Canada, Julio 30 – Agosto 4, Volumen 1:
Artículos largos, pages 1870–1879. Asociación
para Lingüística Computacional. https://doi
.org/10.18653/v1/P17-1171

KyungHyun Cho, Bart van Merrienboer, Dzmitry
Bahdanau, and Yoshua Bengio. 2014. On
the properties of neural machine translation:
Encoder-decoder approaches. CORR, abs/1409
.1259. https://doi.org/10.3115/v1
/W14-4012

Elizabeth Clark, Yangfeng Ji, y Noé A.. Herrero.
2018. Neural text generation in stories using en-
tity representations as context. En procedimientos
del 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
NAACL-HLT 2018, Nueva Orleans, Luisiana,
EE.UU, Junio 1-6, 2018, Volumen 1 (Artículos largos),
pages 2250–2260. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/N18-1204

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
GRAMO. Carbonell, Quoc Viet Le, and Ruslan
Salakhutdinov. 2019. Transformer-xl: Attentive
language models beyond a fixed-length context.
In Proceedings of the 57th Conference of the As-
sociation for Computational Linguistics, LCA
2019, Florencia, Italia, Julio 28- Agosto 2, 2019,
Volumen 1: Artículos largos, pages 2978–2988.
Asociación de Lingüística Computacional.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, NAACL-
HLT 2019, Mineápolis, Minnesota, EE.UU, Junio 2-7,
2019, Volumen 1 (Artículos largos y cortos),
páginas 4171–4186. Asociación de Computación-
lingüística nacional.

Emily Dinan, Stephen Roller, Kurt Shuster,
Angela Fan, Michael Auli, y Jason Weston.
2019. Wizard of Wikipedia: Knowledge-
powered conversational agents. In 7th Interna-

567

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

tional Conference on Learning Representations,
ICLR 2019, Nueva Orleans, LA, EE.UU, Puede 6-9,
2019. OpenReview.net.

Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
sian error linear units (gelus). arXiv preprint
arXiv:1606.08415.

Oren Etzioni, Michele Banko, Stephen Soderland,
y Daniel S.. Weld. 2008. Open information
extraction from the web. Comunicaciones de
the ACM, 51(12):68–74. https://doi.org
/10.1145/1409360.1409378

Angela Fan, Claire Gardent, Chlo´e Braud, y
Antonio Bordes. 2021. Augmenting transform-
ers with KNN-based composite memory for
dialog. Transacciones de la Asociación de Compu-
lingüística nacional, 9:82–99. https://doi
.org/10.1162/tacl_a_00356

´Edouard Grave, Armand Joulin, Moustapha Ciss´e,
David Grangier, and Herv´e J´egou. 2017. Ef-
ficient softmax approximation for GPUs. En
Actas de
the 34th International Con-
ference on Machine Learning, volumen 70 de
Actas de investigación sobre aprendizaje automático,
pages 1302–1310. PMLR.

Edouard Grave, Armand Joulin, and Nicolas
Usunier. 2016.
idioma
models with a continuous cache. CORR,
abs/1612.04426.

Improving neural

Daya Guo, Duyu Tang, Nan Duan, Ming Zhou,
and Jian Yin. 2018. Dialog-to-action: Conver-
sational question answering over a large-scale
base de conocimientos. In Advances in Neural Infor-
mation Processing Systems 31: Annual Confer-
ence on Neural Information Processing Systems
2018, NeurIPS 2018, December 3-8, 2018,
Montr´eal, Canada, pages 2946–2955.

Kelvin Gu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. REALM:
pre-
retrieval-augmented
training. CORR, abs/2002.08909.

modelo de lenguaje

Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, y
Graham Neubig. 2020. Latent relation language
modelos. In The Thirty-Fourth AAAI Confer-
ence on Artificial Intelligence, AAAI 2020, El
Thirty-Second Innovative Applications of Arti-
ficial Intelligence Conference, IAAI 2020, El
Tenth AAAI Symposium on Educational Ad-
vances in Artificial Intelligence, EAAI 2020,
Nueva York, Nueva York, EE.UU, Febrero 7-12, 2020,
pages 7911–7918. AAAI Press. https://doi
.org/10.1609/aaai.v34i05.6298

Tom Hennigan, Trevor Cai, Tamara Norman, y
Igor Babuschkin. 2020. Haiku: Sonnet for JAX.

Ben Hixon, Peter Clark, and Hannaneh Hajishirzi.
2015. Learning knowledge graphs for ques-
tion answering through conversational dialog.
In NAACL HLT 2015, El 2015 Conference of
the North American Chapter of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, Denver, Colorado, EE.UU,
Puede 31 – Junio 5, 2015, pages 851–861. El
Asociación de Lingüística Computacional.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, PubMed: 9377276

Xiao Huang, Jingyuan Zhang, Dingcheng Li, y
Ping Li. 2019. Knowledge graph embedding
based question answering. En procedimientos de
the Twelfth ACM International Conference on
Web Search and Data Mining, WSDM 2019,
Melbourne, VIC, Australia, Febrero 11-15,
2019, pages 105–113. ACM. https://doi
.org/10.1145/3289600.3290956

Marcus Hutter. 2012. The human knowledge com-
pression contest. http://prize. hutter1. net, 6.

Hakan Inan, Khashayar Khosravi, y ricardo
Socher. 2016. Tying word vectors and word
classifiers: A loss framework for language
modelado. CORR, abs/1611.01462.

Gautier

Izacard and Edouard Grave. 2021.
Leveraging passage retrieval with generative
models for open domain question answering.
la 16ª Conferencia de
En procedimientos de
el Capítulo Europeo de
la Asociación
para Lingüística Computacional: Volumen principal,
EACL 2021, En línea, Abril 19 – 23, 2021,
pages 874–880. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/2021.eacl-main.74

Frederick Jelinek. 1980. Interpolated estimation
of markov source parameters from sparse data.
In Proceedings of Workshop on Pattern Rec-
ognition in Practice, 1980.

Yangfeng Ji, Chenhao Tan, Sebastian Martschat,
Yejin Choi, y Noé A.. Herrero. 2017. Dynamic

568

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

entity representations in neural language mod-
los. En Actas de la 2017 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando, EMNLP 2017, Copenhague, Dinamarca,
Septiembre 9-11, 2017, pages 1830–1839. también-
ciation for Computational Linguistics.

Daniel Kahneman. 2011. Thinking, Fast and Slow.

Farrar, Straus and Giroux.

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick S. h. Luis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question an-
swering. En Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, EMNLP 2020, En línea, Noviembre
16-20, 2020, pages 6769–6781. Asociación para
Ligüística computacional.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky,
Lucas Zettlemoyer, and Mike Lewis. 2020a.
Nearest neighbor machine translation. CORR,
abs/2010.00710.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Lucas Zettlemoyer, and Mike Lewis. 2020b.
Generalization through memorization: Nearest
neighbor language models. In 8th International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2020, Addis Ababa, Ethiopia, Abril 26-30,
2020. OpenReview.net.

Chlo´e Kiddon, Lucas Zettlemoyer, and Yejin
Choi. 2016. Globally coherent text generation
with neural checklist models. En procedimientos
del 2016 Conference on Empirical Meth-
ods in Natural Language Processing, EMNLP
2016, austin, Texas, EE.UU, Noviembre 1-4, 2016,
pages 329–339. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/D16-1032

Diederik P. Kingma and Jimmy Ba. 2015. Adán:
A method for stochastic optimization. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
Puede 7-9, 2015, Conference Track Proceedings.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2018. Dynamic evaluation
of neural sequence models. En procedimientos de
the 35th International Conference on Machine
ICML 2018, Stockholmsm¨assan,
Aprendiendo,
Stockholm, Suecia, Julio 10-15, 2018, volumen
80 of Proceedings of Machine Learning Re-
buscar, pages 2771–2780. PMLR.

Ben Krause, Emmanuel Kahembwe, Iain Murray,
and Steve Renals. 2019. Dynamic evalua-
tion of transformer language models. CORR,
abs/1904.08378.

Brenden M. Lake and Gregory L. Murphy. 2020.
Word meaning in minds and machines. CORR,
abs/2008.01766.

Patrick S. h. Luis, Ethan Pérez, Aleksandra
Piktus, Fabio Petroni, Vladimir Karpukhin,
Naman Goyal, Heinrich K¨uttler, mike lewis,
Wen-tau Yih, Tim Rockt¨aschel, Sebastian
Riedel, and Douwe Kiela. 2020. Retrieval-
augmented generation for knowledge-intensive
NLP tasks. In Advances in Neural Information
Sistemas de procesamiento 33: Annual Conference on
Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual.

Stephanie Lin, Jacob Hilton, and Owain Evans.
2021. Truthfulqa: Measuring how models mimic
human falsehoods. CORR, abs/2109.07958.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, y Noé A.. Herrero. 2019.
Linguistic knowledge and transferability of
contextual representations. En procedimientos de
el 2019 Conference of the North American
la Asociación de Computación-
Chapter of
lingüística nacional: Human Language Tech-
nológico, Volumen 1 (Artículos largos y cortos),
pages 1073–1094, Mineápolis, Minnesota.
Asociación de Lingüística Computacional.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao
Jiang, Hiroaki Hayashi, y Graham Neubig.
2021a. Pre-train, prompt, and predict: A sys-
tematic survey of prompting methods in natural
language processing. CORR, abs/2107.13586.

Qi Liu, Lei Yu, Laura Rimell, and Phil Blunsom,
2021b. Pretraining the noisy channel model
task-oriented dialogue. Transactions of
para
la Asociación de Lingüística Computacional,
9:657–674. https://doi.org/10.1162
/tacl_a_00390

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang,
Qi Ju, Haotang Deng, and Ping Wang. 2020.
K-BERT: Enabling language representation with
knowledge graph. In The Thirty-Fourth AAAI
Conference on Artificial Intelligence, AAAI 2020,
The Thirty-Second Innovative Applications of
Artificial Intelligence Conference, IAAI 2020,
The Tenth AAAI Symposium on Educational

569

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Advances in Artificial Intelligence, EAAI 2020,
Nueva York, Nueva York, EE.UU, Febrero 7-12, 2020,
pages 2901–2908. AAAI Press. https://doi
.org/10.1609/aaai.v34i03.5681

Robert L. logan, Nelson F. Liu, Matthew E.
Peters, Matt Gardner, and Sameer Singh. 2019.
Barack’s wife hillary: Using knowledge graphs
for fact-aware language modeling. En curso-
ings of the 57th Conference of the Association
para Lingüística Computacional, LCA 2019, Flo-
rence, Italia, Julio 28- Agosto 2, 2019, Volumen 1:
Artículos largos, pages 5962–5971. Asociación
para Lingüística Computacional. https://doi
.org/10.18653/v1/P19-1598

Cyprien de Masson d’Autume, Sebastián Ruder,
Lingpeng Kong, and Dani Yogatama. 2019.
Episodic memory in lifelong language learning.
In Advances in Neural Information Processing
Sistemas.

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2017. Pointer sentinel mix-
ture models. In 5th International Conference on
Learning Representations, ICLR 2017, Toulon,
Francia, Abril 24-26, 2017, Conference Track
Actas. OpenReview.net.

Pasquale Minervini, Matko Bosnjak, Tim
Rockt¨aschel, Sebastián Riedel, and Edward
Grefenstette. 2020. Differentiable reasoning
on large knowledge bases and natural
lan-
guage. In The Thirty-Fourth AAAI Conference
on Artificial
Inteligencia, AAAI 2020, El
Thirty-Second Innovative Applications of Ar-
tificial Intelligence Conference, IAAI 2020, El
Tenth AAAI Symposium on Educational Ad-
vances in Artificial Intelligence, EAAI 2020,
Nueva York, Nueva York, EE.UU, Febrero 7-12, 2020,
pages 5182–5190. AAAI Press. https://
doi.org/10.1609/aaai.v34i04.5962

Seungwhan Moon, Pararth Shah, Anuj Kumar, y
Rajen Subba. 2019. Opendialkg: Explainable
conversational reasoning with attention-based
walks over knowledge graphs. En procedimientos
of the 57th Annual Meeting of the Association
para Lingüística Computacional.

David Nadeau and Satoshi Sekine. 2007. A sur-
vey of named entity recognition and classifica-
ción. Lingvisticae Investigationes, 30(1):3–26.
https://doi.org/10.1075/li.30.1
.03nad

Maxwell Nye, Michael Henry Tessler, Joshua
B. Tenenbaum, y Brenden M.. Lago. 2021.
Improving coherence and consistency in neural
sequence models with dual-system, neuro-
symbolic reasoning. CORR, abs/2107.02794.

Malte Ostendorff, Peter Bourgonje, Maria Berger,
Juli´an Moreno Schneider, Georg Rehm, y
Bela Gipp. 2019. Enriching BERT with knowl-
edge graph embeddings for document classifi-
catión. In Proceedings of the 15th Conference
sobre el procesamiento del lenguaje natural, KONVENS
2019, Erlangen, Alemania, Octubre 9-11, 2019.

Maryland. Rizwan Parvez, Saikat Chakraborty,
Baishakhi Ray, and Kai-Wei Chang. 2018.
Building language models for text with named
entidades. In Proceedings of the 56th Annual
Meeting of the Association for Computational
Lingüística, LCA 2018, Melbourne, Australia,
July 15–20, 2018, Volumen 1: Artículos largos,
pages 2373–2383. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P18-1221

Matthew E. Peters, Mark Neumann, Robert L.
Logan IV, Roy Schwartz, Vidur Joshi, Sameer
singh, y Noé A.. Herrero. 2019. Knowledge
enhanced contextual word representations. En
Actas de la 2019 Conference on Em-
pirical Methods in Natural Language Process-
ing and the 9th International Joint Conference
sobre el procesamiento del lenguaje natural, EMNLP-
IJCNLP 2019, Hong Kong, Porcelana, Noviembre
3-7, 2019, pages 43–54. Asociación para Com-
Lingüística putacional. https://doi.org/10
.18653/v1/D19-1005

Fabio Petroni, Tim Rockt¨aschel, Sebastián Riedel,
Patrick S. h. Luis, Anton Bakhtin, Yuxiang
Wu, and Alexander H. Molinero. 2019. Idioma
models as knowledge bases? En procedimientos
del 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Nat-
ural Language Processing, EMNLP-IJCNLP
2019, Hong Kong, Porcelana, Noviembre 3-7, 2019,
pages 2463–2473. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1250

Alec Radford, Jeffrey Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

570

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Juan Ramos. 2003. Using tf-idf to determine word
relevance in document queries. En curso-
ings of the First Instructional Conference on
Machine Learning, 242, pages 29–48. Citeseer.

Lev Ratinov and Dan Roth. 2009. Design chal-
lenges and misconceptions in named entity
recognition. In Proceedings of the Thirteenth
Conference on Computational Natural Lan-
guage Learning (CoNLL-2009), pages 147–155.
https://doi.org/10.3115/1596374
.1596399

Thomas Rebele, Fabian M. Suchanek, Johannes
Hoffart, Joanna Biega, Erdal Kuzey, y
Gerhard Weikum. 2016. YAGO: A multilin-
gual knowledge base from wikipedia, wordnet,
and geonames. In The Semantic WebISWC
2016 – 15th International Semantic Web Con-
ference, Kobe, Japón, Octubre 17-21, 2016,
Actas, Part II, volumen 9982 of Lecture
Notes in Computer Science, pages 177–185.
https://doi.org/10.1007/978-3-319
-46547-0 19

Michael Sejr Schlichtkrull, Thomas N. Kipf,
Peter Bloem, Rianne van den Berg, Ivan Titov,
and Max Welling. 2018. Modeling relational
data with graph convolutional networks. In The
Web semántica – 15th International Conference,
ESWC 2018, Heraklion, Crete, Greece, Junio 3-7,
2018, Actas, volumen 10843 of Lecture
Notes in Computer Science, pages 593–607.
Saltador. https://doi.org/10.1007/978
-3-319-93417-4 38

Nitish Srivastava, Geoffrey E. Hinton, Alex
and Ruslan
Ilya Sutskever,
Krizhevsky,
Salakhutdinov. 2014. Dropout: A simple way
to prevent neural networks
from overfit-
ting. Journal of Machine Learning Research,
15(1):1929–1958.

Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun
feng, Xuyi Chen, Han Zhang, Xin Tian,
Danxiang Zhu, Hao Tian, and Hua Wu.
2019. ERNIE: Enhanced representation through
knowledge integration. CORR, abs/1904.09223.

David Thulke, Nico Daheim, Christian Dugast,
and Hermann Ney. 2021. Efficient retrieval aug-
mented generation from unstructured knowl-
edge for task-oriented dialog. arXiv preprint
arXiv:2102.04643.

Th´eo Trouillon,

Johannes Welbl, Sebastian
Riedel, ´Eric Gaussier, and Guillaume Bouchard.

2016. Complex embeddings for simple link
predicción. In Proceedings of the 33nd Interna-
tional Conference on Machine Learning, ICML
2016, Nueva York, Nueva York, EE.UU, Junio 19-24,
2016, volumen 48 of JMLR Workshop and
Conference Proceedings, pages 2071–2080.
JMLR.org.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
Lukasz Kaiser, y Illia Polosukhin. 2017. En-
La atención es todo lo que necesitas.. En avances en neurología
Sistemas de procesamiento de información 30: Annual
Conference on Neural Information Process-
ing Systems 2017, December 4-9, 2017, Largo
Beach, California, EE.UU, pages 5998–6008.

Pat Verga, Haitian Sun, Livio Baldini Soares, y
William W. cohen. 2021. Adaptable and inter-
pretable neural memoryover symbolic knowl-
borde. En Actas de la 2021 Conferencia
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Humano
Language Technologies, NAACL-HLT 2021,
En línea, Junio 6-11, 2021, pages 3678–3691.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2021
.naacl-main.288

Chenguang Wang, Xiao Liu, and Dawn Song.
2020. Language models are open knowledge
graphs. CORR, abs/2010.11967.

Hongwei Wang, Fuzheng Zhang, Xing Xie,
and Minyi Guo.
2018a. DKN: Deep
knowledge-aware network for news recom-
mendation. En Actas de la 2018 Mundo
Wide Web Conference on World Wide Web,
WWW 2018, Lyon, Francia, Abril 23-27, 2018,
pages 1835–1844. ACM. https://doi
.org/10.1145/3178876.3186175

Hongwei Wang, Fuzheng Zhang, Miao Zhao,
Wenjie Li, Xing Xie, and Minyi Guo. 2019.
Multi-task feature learning for knowledge graph
enhanced recommendation. In The World Wide
Web Conference, WWW 2019, San Francisco,
California, EE.UU, Puede 13-17, 2019, pages 2000–2010.
ACM.

Luyu Wang, Yujia Li, Ozlem Aslan, and Oriol
Viñales. 2021a. WikiGraphs: A Wikipedia
texto – knowledge graph paired dataset. En
Actas de
the Fifteenth Workshop on
Graph-Based Methods for Natural Language

571

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Procesando (TextGraphs-15), pages 67–82,
Ciudad de México, México. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.textgraphs-1.7

Qingyun Wang, Xiaoman Pan, Lifu Huang,
Boliang Zhang, Zhiying Jiang, Heng Ji, y
Kevin Knight. 2018b. Describing a knowledge
base. In Proceedings of the 11th International
Conference on Natural Language Generation,
Tilburg University, Los países bajos, Noviembre
5-8, 2018, pages 10–21. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/W18-6502

Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu,
Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, y
Jian Tang. 2021b. Kepler: A unified model
for knowledge embedding and pre-trained
language representation. Transactions of the
Asociación de Lingüística Computacional,
9:176–194. https://doi.org/10.1162
/tacl_a_00360

Bishan Yang and Tom M. mitchell. 2019. Lever-
aging knowledge bases in lstms for improving
machine reading. CORR, abs/1902.09091.

Zichao Yang, Phil Blunsom, Chris Dyer, y
Wang Ling. 2017. Reference-aware language
modelos. En procedimientos de
el 2017 Estafa-
ference on Empirical Methods in Natural
Procesamiento del lenguaje, EMNLP 2017, Copen-
hagen, Dinamarca, Septiembre 9-11, 2017,
pages 1850–1859. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D17-1197

Liang Yao, Chengsheng Mao, and Yuan Luo.
2019. KG-BERT: BERT for knowledge graph
completion. CORR, abs/1909.03193.

Michihiro Yasunaga, Hongyu Ren, Antoine
Bosselut, Percy Liang, and Jure Leskovec.
lan-
2021. QA-GNN: Reasoning with
para
guage models and knowledge graphs
question answering. CORR, abs/2104.06378.
https://doi.org/10.18653/v1/2021
.naacl-main.45

Dani Yogatama, Cyprien de Masson d’Autume,
and Lingpeng Kong. 2021. Adaptive semi-
parametric language models. Transactions of

la Asociación de Lingüística Computacional,
9:362–373. https://doi.org/10.1162
/tacl_a_00371

Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian,
Xing Xie, and Wei-Ying Ma. 2016. Columna-
laborative knowledge base embedding for
recommender systems. En Actas de la
22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining,
San Francisco, California, EE.UU, Agosto 13-17, 2016,
pages 353–362. ACM. https://doi.org
/10.1145/2939672.2939673

Muhan Zhang and Yixin Chen. 2018. Link pre-
diction based on graph neural networks. En
Avances en el procesamiento de información neuronal
Sistemas 31: Annual Conference on Neural In-
formation Processing Systems 2018, NeurIPS
2018, December 3-8, 2018, Montr´eal, Canada,
pages 5171–5181.

Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. 2019.
Quaternion knowledge graph embeddings. En
Avances en el procesamiento de información neuronal
Sistemas 32: Annual Conference on Neural In-
formation Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, vancouver, BC,
Canada, pages 2731–2741.

Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva,
Alexander J. Smola, and Le Song. 2018.
Variational reasoning for question answering
with knowledge graph. En Actas de la
Thirty-Second AAAI Conference on Artificial
Inteligencia, (AAAI-18), the 30th innovative Ap-
plications of Artificial Intelligence (IAAI-18),
and the 8th AAAI Symposium on Educational
Advances in Artificial Intelligence (EAAI-18),
Nueva Orleans, Luisiana, EE.UU, Febrero 2-7,
2018, pages 6069–6076. AAAI Press.

Xiangyang Zhou, Lu Li, Daxiang Dong, Hacer
Liu, Ying Chen, Wayne Xin Zhao, Dianhai
Yu, and Hua Wu. 2018. Multi-turn response
selection for chatbots with deep attention
matching network. In Proceedings of the 56th
Annual Meeting of the Association for Com-
Lingüística putacional, LCA 2018, Melbourne,
Australia, Julio 15-20, 2018, Volumen 1: Largo
Documentos, pages 1118–1127. Asociación para
Ligüística computacional.

572

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
6
2
0
2
0
7
2
1

/

/
t

yo

a
C
_
a
_
0
0
4
7
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3Relational Memory-Augmented Language Models image

Descargar PDF