KEPLER: A Unified Model for Knowledge Embedding and

KEPLER: A Unified Model for Knowledge Embedding and
Pre-trained Language Representation

Xiaozhi Wang1, Tianyu Gao3, Zhaocheng Zhu4,5, Zhengyan Zhang1
Zhiyuan Liu1,2∗, Juanzi Li1,2, and Jian Tang4,6,7∗

1Department of CST, BNRist; 2KIRC, Institute for AI, Universidad de Tsinghua, Beijing, Porcelana
{wangxz20,zy-z19}@mails.tsinghua.edu.cn
{liuzy,lijuanzi}@tsinghua.edu.cn
3Departamento de Ciencias de la Computación, Universidad de Princeton, Princeton, Nueva Jersey, EE.UU
tianyug@princeton.edu
4MilaQu´ebec AI Institute; 5Univesit´e de Montr´eal; 6HEC, Montr´eal, Canada
zhaocheng.zhu@umontreal.ca, jian.tang@hec.ca
7CIFAR AI Research Chair

Abstracto

1 Introducción

Pre-trained language representation models
(PLM) cannot well capture factual knowledge
from text. A diferencia de, knowledge embed-
ding (KE) methods can effectively represent
the relational
facts in knowledge graphs
(KGs) with informative entity embeddings,
but conventional KE models cannot take full
advantage of the abundant textual informa-
ción. en este documento, we propose a unified model
for Knowledge Embedding and Pre-trained
LanguagE Representation (KEPLER), cual
can not only better integrate factual knowl-
edge into PLMs but also produce effective
text-enhanced KE with the strong PLMs. En
KEPLER, we encode textual entity descrip-
tions with a PLM as their embeddings, y
then jointly optimize the KE and language
modeling objectives. Experimental
resultados
show that KEPLER achieves state-of-the-art
performances on various NLP tasks, and also
works remarkably well as an inductive KE
model on KG link prediction. Además, para
pre-training and evaluating KEPLER, nosotros estafamos-
struct Wikidata5M1, a large-scale KG dataset
with aligned entity descriptions, and bench-
mark state-of-the-art KE methods on it. It shall
serve as a new KE benchmark and facilitate
the research on large KG, inductive KE, y
KG with text. The source code can be obtained
from https://github.com/THU-KEG
/KEPLER.

∗Correspondence to: z. Liu and J. Espiga.
1https://deepgraphlearning.github.io

/project/wikidata5m.

176

Recent pre-trained language representation mod-
los (PLM) such as BERT (Devlin et al., 2019)
and RoBERTa (Liu et al., 2019C) learn effective
language representation from large-scale unstruc-
tured corpora with language modeling objectives
and have achieved superior performances on var-
ious natural language processing (NLP) tareas.
linguistic knowl-
Existing PLMs learn useful
edge from unlabeled text (Liu et al., 2019a), pero
they generally cannot capture the world facts well,
which are typically sparse and have complex forms
en el texto (Petroni et al., 2019; Logan et al., 2019).

Por el contrario, gráficos de conocimiento (KGs) contener
extensive structural facts, and knowledge embed-
ding (KE) methods (Bordes et al., 2013; Cual
et al., 2015; Sun et al., 2019) can effectively
embed them into continuous entity and relation
embeddings. These embeddings can not only
help with the KG completion but also benefit
various NLP applications (Yang and Mitchell,
2017; Zaremoodi et al., 2018; Han et al., 2018a).
As shown in Figure 1, textual entity descriptions
información. Intuitivamente, KE
contain abundant
methods can provide factual knowledge for PLMs,
while the informative text data can also benefit KE.
Inspired by Xie et al. (2016), we use entity
descriptions to bridge the gap between KE
and PLM, and align the semantic space of
text to the symbol space of KGs (Logeswaran
et al., 2019). We propose KEPLER, un unificado
model for Knowledge Embedding and Pre-trained
LanguagE Representation. We encode the texts

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 176–194, 2021. https://doi.org/10.1162/tacl a 00360
Editor de acciones: Doug Downley. Lote de envío: 7/2020; Lote de revisión: 10/2020; Publicado 3/2021.
C(cid:3) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

As a KE model, (1) KEPLER can take full
advantage of the abundant information from entity
descriptions with the help of the MLM objective.
(2) KEPLER is capable of performing KE in the
inductive setting, eso es, it can produce embed-
dings for unseen entities from their descriptions,
while conventional KE methods are inherently
transductive and they can only learn represen-
tations for the shown entities during training.
Inductive KE is essential for many real-world
applications, such as updating KGs with emerging
entities and KG construction, and thus is worth
more investigation.

For pre-training and evaluating KEPLER, nosotros
need a KG with (1) large amounts of knowledge
hechos, (2) aligned entity descriptions, y (3)
reasonable inductive-setting data split, cual
cannot be satisfied by existing KE benchmarks.
Por lo tanto, we construct Wikidata5M, containing
about 5M entities, 20M triplets, and aligned entity
descriptions from Wikipedia. To the best of our
conocimiento, it is the largest general-domain KG
conjunto de datos. We also benchmark several classical
KE methods and give data splits for both the
transductive and the inductive settings to facilitate
investigación futura.

To summarize, our contribution is three-fold:
(1) We propose KEPLER, a knowledge-enhanced
PLM by jointly optimizing the KE and MLM
objectives, which brings great improvements on
a wide range of NLP tasks. (2) By encoding
text descriptions as entity embeddings, KEPLER
shows its effectiveness as a KE model, especially
in the inductive setting. (3) We also introduce
Wikidata5M, a new large-scale KG dataset, cual
shall promote the research on large-scale KG,
inductive KE, and the interactions between KG
and NLP.

2 KEPLER

As shown in Figure 2, KEPLER implicitly
incorporates factual knowledge into language
representations by jointly training with two objec-
tives. En esta sección, we detailedly introduce the
encoder structure, the KE and MLM objectives,
and how we combine the two as a unified model.

2.1 Encoder

For the text encoder, we use Transformer archi-
tecture (Vaswani et al., 2017) in the same way

Cifra 1: An example of a KG with entity
descripciones. The figure suggests that descriptions
contain abundant information about entities and
can help to represent the relational facts between
a ellos.

and entities into a unified semantic space with the
same PLM as the encoder, and jointly optimize the
KE and the masked language modeling (MLM)
objectives. For the KE objective, we encode the
entity descriptions as entity embeddings and then
train them in the same way as conventional KE
methods. For the MLM objective, we follow the
approach of existing PLMs (Devlin et al., 2019;
Liu et al., 2019C). KEPLER has the following
fortalezas:

As a PLM, (1) KEPLER is able to integrate fac-
tual knowledge into language representation with
the supervision from KG by the KE objective. (2)
KEPLER inherits the strong ability of language
understanding from PLMs by the MLM objec-
tivo. (3) The KE objective enhances the ability of
KEPLER to extract knowledge from text since it
requires the model to encode the entities from their
corresponding descriptions. (4) KEPLER can be
directly adopted in a wide range of NLP tasks with-
out additional inference overhead compared to
conventional PLMs since we just add new training
objectives without modifying model structures.

There are also some recent works (Zhang et al.,
2019; Peters et al., 2019; Liu et al., 2020) directly
integrating fixed entity embeddings into PLMs to
provide external knowledge. Sin embargo, (1) su
entity embeddings are learned by a separate KE
modelo, and thus cannot be easily aligned with the
language representation space. (2) They require
an entity linker to link the text to the correspond-
ing entities, making them suffer from the error
propagation problem. (3) Compared to vanilla
PLM, their sophisticated mechanisms to link and
use entity embeddings lead to additional inference
overhead.

177

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: The KEPLER framework. We encode entity descriptions as entity embeddings and jointly
train the knowledge embedding (KE) and masked language modeling (MLM) objectives on the same
PLM.

as Devlin et al. (2019) and Liu et al. (2019C). El
encoder takes a sequence of N tokens (x1, …, xN )
as inputs, and computes L layers of d-dimensional
contextualized representations Hi ∈ RN ×d, 1 ≤
i ≤ L. Each layer of the encoder Ei is a combi-
nation of a multihead self-attention network and
a multilayer perceptron, and the encoder gets the
representation of each layer by Hi = Ei(Hi−1).
Eventualmente, we get a contextualized representation
for each position, which could be further used in
downstream tasks. Usually, there is a special token
added to the beginning of the text, y el
output at is regarded sentence representation.
We denote the representation function as E(·).
The encoder requires a tokenizer to convert
plain texts into sequences of tokens. Here we use
the same tokenization as RoBERTa: the Byte-Pair
Encoding (BPE) (Sennrich et al., 2016).

Unlike previous knowledge-enhanced PLM
obras (Zhang et al., 2019; Peters et al., 2019), nosotros
do not modify the Transformer encoder structure
to add external entity linkers or knowledge-
integration layers. It means that our model has
no additional inference overhead compared to
vanilla PLMs, and it makes applying KEPLER in
downstream tasks as easy as RoBERTa.

2.2 Knowledge Embedding

representaciones, which benefits lots of down-
stream tasks, such as link prediction and relation
extraction.

We first define KGs: A KG is a graph with
entities as its nodes and relations between entities
as its edges. We use a triplet (h, r, t) to describe a
relational fact, where h, t are the head entity and
the tail entity, and r is the relation type within
a pre-defined relation set R. In conventional KE
modelos, each entity and relation is assigned a
d-dimensional vector, and a scoring function is
defined for training the embeddings and predicting
Enlaces.

In KEPLER, instead of using stored embed-
dings, we encode entities into vectors by using
their corresponding text. By choosing different
textual data and different KE scoring functions,
we have multiple variants for the KE objective
of KEPLER. en este documento, we explore three
simple but effective ways: entity descriptions as
embeddings, entity and relation descriptions as
embeddings, and entity embeddings conditioned
on relations. We leave exploring advanced KE
methods as our future work.

Entity Descriptions as Embeddings For a
relational triplet (h, r, t), tenemos:

h = E(texth),
t = E(textt),
r = Tr,

(1)

To integrate factual knowledge into KEPLER, nosotros
adopt the knowledge embedding (KE) objetivo
in our pre-training. KE encodes entities and rela-
tions in knowledge graphs (KGs) as distributed

where texth and textt are the descriptions for
h and t, with a special token at the beginning.
T ∈ R|R|×d is the relation embeddings and h, t, r
are the embeddings for h, t, and r.

178

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

We use the loss from Sun et al. (2019) as our
KE objective, which adopts negative sampling
(Mikolov et al., 2013) for efficient optimization:

LKE = − log σ(γ − dr(h, t))
norte(cid:2)

log σ(dr(h(cid:6)

i, t(cid:6)

i) − γ),

1
norte

yo=1

(2)

i, r, t(cid:6)

dónde (h(cid:6)
i) are negative samples, γ is the
margin, σ is the sigmoid function, and dr is the
scoring function, for which we choose to follow
TransE (Bordes et al., 2013) for its simplicity,

dr(h, t) = (cid:7)h + r − t(cid:7)pag,

(3)

where we take the norm p as 1. The negative
sampling policy is to fix the head entity and
randomly sample a tail entity, y viceversa.

Entity and Relation Descriptions as Embed-
dings A natural extension for the last method
is to encode the relation descriptions as relation
embeddings as well. Formalmente, tenemos,

ˆr = E(textr),

(4)

where textr is the description for the relation r.
Then we use ˆr to replace r in Equations 2 y 3.

Entity Embeddings Conditioned on Relations
In this manner, we use entity embeddings con-
ditioned on r for better KE performances. El
intuition is that semantics of an entity may have
multiple aspects, and different relations focus on
different ones (Lin et al., 2015). So we have,

hr = E(texth,r),

(5)

is the concatenation of

the relation r, with the special

where texth,r
el
description for the entity h and the description
simbólico
para
at the beginning and in between.
Correspondingly, we use hr instead of h for
Ecuaciones 2 y 3.

2.3 Masked Language Modeling

The masked language modeling (MLM) objetivo
is inherited from BERT and RoBERTa. During
pre-training, MLM randomly selects some of the
input positions, and the objective is to predict the
tokens at these selected positions within a fixed
dictionary.

To be more specific, MLM randomly selects
15% of input positions, among which 80% son

179

masked with the special token , 10% son
replaced by other random tokens, y el resto
remain unchanged. For each selected position j,
the last layer of the contextualized representation
HL,j is used for a W -way classification, dónde
W is the size of the dictionary. At last, a cross-
entropy loss LMLM is calculated over these selected
positions.

We initialize our model with the pre-trained
checkpoint of RoBERTaBASE. Sin embargo, we still
keep MLM as one of our objectives to avoid
catastrophic forgetting (McCloskey and Cohen,
1989) while training towards the KE objective.
De hecho, as demonstrated in Section 5.1, solo
using the KE objective leads to poor results in
NLP tasks.

2.4 Training Objectives

To incorporate factual knowledge and language
understanding into one PLM, we design a
multi-task loss as
como se muestra en la figura 2 y
Ecuación 6,

L = LKE + LMLM,

(6)

where LKE and LMLM are the losses for KE and
MLM correspondingly. Jointly optimizing the
two objectives can implicitly integrate knowledge
from external KGs into the text encoder, mientras
preserving the strong abilities of PLMs for
syntactic and semantic understanding. Tenga en cuenta que
those two tasks only share the text encoder, y
for each mini-batch, text data sampled for KE
and MLM are not (necessarily) lo mismo. Esto es
because seeing a variety of text (instead of just
entity descriptions) in MLM can help the model
to have better language understanding ability.

2.5 Variants and Implementations

We introduce the variants of KEPLER and the pre-
training implementations here. The fine-tuning
details will be introduced in Section 4.

KEPLER Variants

We implement multiple versions of KEPLER
in experiments to explore the effectiveness of
our pre-training framework. We use the same
denotations in Section 4 as below.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

KEPLER-Wiki

is the principal model

en
our experiments, which adopts Wikidata5M
(Sección 3) as the KG and the entity-description-
as-embedding method (Ecuación 1). All other
variants, if not specified, use the same settings.
KEPLER-Wiki achieves the best performances on
most tasks.

KEPLER-WordNet uses the WordNet (Molinero,
1995) as its KG source. WordNet is an English lex-
ical graph, where nodes are lemmas and synsets,
and edges are their relations. Intuitivamente, incor-
porating WordNet can bring lexical knowledge
and thus benefits NLP tasks. We use the same
WordNet 3.0 as in KnowBert (Peters et al., 2019),
which is extracted from the nltk2 package.

KEPLER-W+W takes both Wikidata5M and
WordNet as its KGs. To jointly train with two KG
conjuntos de datos, we modify the objective in Equation 6 como

L = LWiki + LWordNet + LMLM,

(7)

where LWiki and LWordNet are losses from
Wikidata5M and WordNet respectively.

KEPLER-Rel uses the entity and relation
descriptions as embeddings method (Ecuación 4).
As the relation descriptions in Wikidata are
corto (11.7 words on average) and homogeneous,
encoding relation descriptions as relation embed-
dings results in worse performance as shown in
Sección 4.

KEPLER-Cond uses the entity-embedding-
conditioned-on-relation method (Ecuación 5).
This model achieves superior results in link
prediction tasks, both transductive and inductive
(Sección 4.3).

KEPLER-OnlyDesc trains the MLM objective
directly on the entity descriptions from the KE
objective rather than uses the English Wikipedia
and BookCorpus as other versions of KEPLER.
Sin embargo, as the entity description data are smaller
(2.3 GB vs 13 ES) and homogeneous, it harms the
general language understanding ability and thus
performs worse (Sección 4.2).

KEPLER-KE only adopts the KE objective
in pre-training, which is an ablated version of
KEPLER-Wiki. It is used to show the necessity of
the MLM objective for language understanding.

Pre-training Implementation
En la práctica, we choose RoBERTa (Liu et al.,
2019C) as our base model and implement KEPLER

2https://www.nltk.org.

180

in the fairseq framework (Ott et al., 2019) for pre-
training. Due to the computing resource limit,
we choose the BASE size (L = 12, re = 768)
and use the released roberta.base parameters
for initialization, which is a common practice
to save pre-training time (Zhang et al., 2019;
Peters et al., 2019). For the MLM objective, nosotros
use the English Wikipedia (2,500M words) y
BookCorpus (800M words) (Zhu et al., 2015)
as our pre-training corpora (except KEPLER-
OnlyDesc). We extract text from these two corpora
in the same way as Devlin et al. (2019). Para el
KE objective, we encode the first 512 tokens of
entity descriptions from the English Wikipedia as
entity embeddings.

We set the γ in Equation 2 como 4 y 9 for NLP
and KE tasks respectively, and we use the models
pre-trained with 10 y 30 epochs for NLP and
KE. Specially, the γ is 1 for KEPLER-WordNet.
The two hyperparameters are tuned by multiple
trials for γ in {1, 2, 4, 6, 9} and the number of
epochs in {5, 10, 20, 30, 40}, and we select the
model by performances on TACRED (F-1) y
inductive link prediction (HITS@10). Usamos
gradient accumulation to achieve a batch size of
12,288.

3 Wikidata5M

As shown in Section 2, to train KEPLER, the KG
dataset should (1) be large enough, (2) contener
high-quality textual descriptions for its entities
and relations, y (3) have a reasonable inductive
configuración, which most existing KG datasets do not
provide. De este modo, based on Wikidata3 and English
Wikipedia,4 we construct Wikidata5M, un gran-
scale KG dataset with aligned text descriptions
from corresponding Wikipedia pages, and also an
inductive test set. In the following sections, nosotros
first introduce the data collection (Sección 3.1)
and the data split (Sección 3.2), and then provide
the results of representative KE methods on the
conjunto de datos (Sección 3.3).

3.1 Recopilación de datos

We use the July 2019 dump of Wikidata and
Wikipedia. For each entity in Wikidata, we align
it to its Wikipedia page and extract the first section
as its description. Entities with no pages or with
descriptions fewer than 5 words are discarded.

3https://www.wikidata.org.
4https://en.wikipedia.org.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset

#entidad

#relation

#training

#validation

#prueba

FB15K
WN18
FB15K-237
WN18RR

14,951
40,943
14,541
40,943

1,345
18
237
11

483,142
141,442
272,115
86,835

Wikidata5M 4,594,485

822

20,614,279

50,000
5,000
17,535
3,034

5,163

59,07
5,00
20,466
3,134

5,133

Mesa 1: Statistics of Wikidata5M (transductive setting) compared with existing KE benchmarks.

Entity Type

Occurrence Percentage

Subset

#entidad

#relation

#triplet

Humano
Taxon
Wikimedia list
Film
Human Settlement

Total

1,517,591
363,882
118,823
114,266
110,939

2,225,501

33.0%
7.9%
2.6%
2.5%
2.4%

48.4%

Mesa 2: Top-5 entity categories in Wikidata5M.

We retrieve all the relational facts in Wikidata.
A fact is considered to be valid when both of
its entities are not discarded, and its relation
has a non-empty page in Wikidata. The final
KG contains 4,594,485 entidades, 822 relaciones
y 20,624,575 triplets. Statistics of Wikidata5M
along with four other widely used benchmarks
se muestran en la tabla 1. Top-5 entity categories are
listed in Table 2. We can see that Wikidata5M
is much larger than other KG datasets, covering
various domains.

3.2 Data Split

For Wikidata5M, we take two different settings:
the transductive setting and the inductive setting.
The transductive setting (mostrado en la tabla 1) es
adopted in most KG datasets, where the entities are
shared and the triplet sets are disjoint across train-
En g, validation and test. En este caso, KE models
are expected to learn effective entity embeddings
only for the entities in the training set. En el
inductive setting (mostrado en la tabla 3), the entities
and triplets are mutually disjoint across training,
validation and test. We randomly sample some
connected subgraphs as the validation and test set.
In the inductive setting, the KE models should
produce embeddings for the unseen entities given
side features like descriptions, neighbors, etc.. El
inductive setting is more challenging and also

Capacitación
Validación
Prueba

4,579,609
7,374
7,475

822
199
201

20,496,514
6,699
6,894

Mesa 3: Statistics of Wikidata5M inductive
configuración.

meaningful in real-world applications, where enti-
ties in KGs experience open-ended growth, y el
inductive ability is crucial for online KE methods.
Although Wikidata5M contains massive enti-
ties and triplets, our validation and test set are not
grande, which is limited by the standard evaluation
method of link prediction (Sección 3.3). Cada
episode of evaluation requires |mi| × |t | × 2 veces
of KE score calculation, dónde |mi| y |t | son
the total number of entities and the number of
triplets in test set respectively. As Wikidata5M
contains massive entities, the evaluation is very
time-consuming, hence we have to limit the test
set to thousands of triplets to ensure tractable
evaluations. This indicates that large-scale KE
urges a more efficient evaluation protocol. Nosotros
will leave exploring it to future work.

3.3 Benchmark

To assess the challenges of Wikidata5M, nosotros
benchmark several popular KE models on our
in the transductive setting (as they
conjunto de datos
inherently do not support
the inductive set-
ting). Because their original implementations do
not scale to Wikidata5M, we benchmark these
methods with GraphVite (Zhu et al., 2019), a
multi-GPU KE toolkit.

In the transductive setting, for each test triplet
(h, r, t), the model ranks all the entities by scor-
En g (h, r, t(cid:6)), t(cid:6) ∈ E, where E is the entity set
excluding other correct t. The evaluation metrics,
MRR (mean reciprocal rank), MR (mean rank),
and HITS@{1,3,10}, are based on the rank of the

181

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Método

MR MRR HITS@1 HITS@3 HITS@10

TransE (Bordes et al., 2013)
DistMult (Yang et al., 2015)
ComplEx (Trouillon et al., 2016)
SimplE (Kazemi and Poole, 2018)
RotatE (Sun et al., 2019)

109370
211030
244540
115263
89459

25.3
25.3
28.1
29.6
29.0

17.0
20.8
22.8
25.2
23.4

31.1
27.8
31.0
31.7
32.2

39.2
33.4
37.3
37.7
39.0

Mesa 4: Performance of different KE models on Wikidata5M (% except MR).

correct tail entity t among all the entities in E.
Then we do the same thing for the head entities.
We report the average results over all test triplets
and over both head and tail entity predictions.

Mesa 4 shows the results of popular KE meth-
ods on Wikidata5M, which are all significantly
lower than on existing KG datasets like FB15K-
237, WN18RR, Etcétera. It demonstrates that
Wikidata5M is more challenging due to its large
scale and high coverage. The results advocate for
more efforts towards large-scale KE.

4 experimentos

En esta sección, we introduce the experiment
settings and results of our model on various
NLP and KE tasks, along with some analyses
on KEPLER.

4.1 Experimental Setting

Líneas de base
En nuestros experimentos, RoBERTa is an
important baseline since KEPLER is based on it
(all mentioned models are of BASE size if not
specified). As we cannot afford the full RoBERTa
corpus (126 ES, and we only use 13 ES)
in KEPLER pre-training, we implement Our
RoBERTa for direct comparisons to KEPLER.
It is initialized by RoBERTaBASE and is further
trained with the MLM objective on the same
corpora as KEPLER.

We also evaluate recent knowledge-enhanced
PLM, including ERNIEBERT (Zhang et al., 2019)
and KnowBertBERT (Peters et al., 2019). Como
ERNIE and our principal model KEPLER-Wiki
only use Wikidata, we take KnowBert-Wiki in the
experiments to ensure fair comparisons with the
same knowledge source. Considering KEPLER
is based on RoBERTa, we reproduce the two
models with RoBERTa too (ERNIERoBERTa and
KnowBertRoBERTa). The reproduction of Know-
Bert is based on its original implementation.5

5https://github.com/allenai/kb.

On relation classification, we also compare with
MTB (Baldini Soares et al., 2019), which adopts
‘‘matching the blank’’ pre-training. Different
from other baselines, the original MTB is based
on BERTLARGE (denoted by MTB (BERTLARGE)).
For a fair comparison under the same model size,
we reimplement MTB with BERTBASE (MTB).

Hyperparameter The pre-training settings are
en la sección 2.5. For fine-tuning on downstream
tareas, we set KEPLER hyperparameters the same
as reported in KnowBert on TACRED and
OpenEntity. On FewRel, we set
the learning
rate as 2e-5 and batch size as 20 y 4 para el
Proto and PAIR frameworks respectively. Para
GLUE, we follow the hyperparameters reported
in RoBERTa. For baselines, we keep their original
hyperparameters unchanged or use the best trial in
KEPLER searching space if no original settings
are available.

4.2 NLP Tasks

En esta sección, we demonstrate the performance of
KEPLER and its baselines on various NLP tasks.

Relation Classification

Relation classification requires models to classify
relation types between two given entities from
texto. We evaluate KEPLER and other baselines
on two widely used benchmarks: TACRED and
FewRel.

TACRED (Zhang et al., 2017) tiene 42 relaciones
y 106,264 oraciones. Here we follow the settings
of Baldini Soares et al. (2019), where we add four
special tokens before and after the two entity
menciona, and concatenate the representations at
the beginnings of the two entities for classification.
Note that the original KnowBert also takes entity
types as inputs, which is different from Zhang et al.
(2019); Baldini Soares et al. (2019). To ensure fair
comparisons, we re-evaluate KnowBert with the

182

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

BERT
BERTLARGE
MTB
MTB (BERTLARGE)
ERNIEBERT
KnowBertBERT
RoBERTa
ERNIERoBERTa
KnowBertRoBERTa

Our RoBERTa
KEPLER-Wiki
KEPLER-WordNet
KEPLER-W+W
KEPLER-Rel
KEPLER-Cond
KEPLER-OnlyDesc
KEPLER-KE

PAG

67.2

69.7

70.0
73.5
70.4
73.5
71.9

70.8
71.5
71.4
71.1
71.3
72.1
72.3
63.5

R

64.8

67.9

66.1
64.1
71.1
68.0
69.9

69.6
72.5
71.3
72.0
70.9
70.7
69.1
60.5

F-1

66.0
70.1
68.8
71.5
68.0
68.5
70.7
70.7
70.9

70.2
72.0
71.3
71.5
71.1
71.4
70.7
62.0

Mesa 5: Precision, recordar, and F-1 on TACRED
(%). KnowBert results are different from the
original paper since different task settings are
usado.

same setting as other baselines, thus the reported
results are different from the original paper.

From the TACRED results in Table 5, podemos
observe that: (1) KEPLER-Wiki is the best one
among KEPLER variants and significantly out-
performs all the baselines, while other versions
of KEPLER also achieve good results. It demon-
strates the effectiveness of KEPLER on integrating
factual knowledge into PLMs. Basado en el
resultado, we use KEPLER-Wiki as the principal
model in the following experiments. (2) KEPLER-
WordNet shows a marginal improvement over
Our RoBERTa, while KEPLER-W+W underper-
forms KEPLER-Wiki. It suggests that pre-training
with WordNet only has limited benefits in the
KEPLER framework. We will explore how to
better combine different KGs in our future work.
FewRel (Han et al., 2018b) is a few-shot
relation classification dataset with 100 relaciones
y 70,000 instancias, which is constructed with
Wikipedia text and Wikidata facts. Además,
Gao et al. (2019) propose FewRel 2.0, agregando
a domain adaptation challenge with a new
medical-domain test set.

FewRel

takes the N -way K-shot setting.
Relations in the training and test sets are disjoint.

For every evaluation episode, N relations, k
supporting samples for each relation, and several
query sentences are sampled from the test set. El
models are required to classify queries into one of
the N relations only given the sampled N × K
instancias.

We use two state-of-the-art few-shot frame-
obras: Proto (Snell et al., 2017) and PAIR (gao
et al., 2019). We replace the text encoders with our
baselines and KEPLER and compare the perfor-
mance. Because FewRel 1.0 is constructed with
Wikidatos, we remove all the triplets in its test set
from Wikidata5M to avoid information leakage
for KEPLER. Sin embargo, we cannot control the
KGs used in our baselines. We mark the models
utilizing Wikidata and have information leakage
risk with † in Table 6.

As Table 6 muestra, KEPLER-Wiki achieves
the best performance over the BASE-size PLMs
in most settings. From the results, nosotros también
have some interesting observations: (1) RoBERTa
consistently outperforms BERT on various NLP
tareas (Liu et al., 2019C), yet the RoBERTa-based
models here are comparable or even worse than
BERT-based models in the PAIR framework.
Because PAIR uses sentence concatenation, este
result may be credited to the next sentence
predicción (NSP) objective of BERT. (2) KEPLER
brings improvements on FewRel 2.0, mientras
ERNIE and KnowBert even degenerate in most
of the settings. It indicates that the paradigms of
ERNIE and KnowBert cannot well generalize to
new domains which may require much different
entity linkers and entity embeddings. Sobre el
other hand, KEPLER not only learns better entity
representations but also acquires a general ability
to extract factual knowledge from the context
across different domains. We further verify this in
Sección 5.5. (3) KnowBert underperforms ERNIE
in FewRel while it typically achieves better results
on other tasks. This may be because it uses the
TuckER (Balazevic et al., 2019) KE model while
ERNIE and KEPLER follow TransE (Bordes et al.,
2013). We will explore the effects of different KE
methods in the future.

We also have another two observations with
regard to ERNIE and MTB: (1) ERNIE performs
the best on 1-shot settings of FewRel 1.0.
We believe this is because that the knowledge
embedding injection of ERNIE has particular
advantages in this case, since it directly brings

183

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

knowledge about entities. When using 5-shot
(supporting text provides more information) y
FewRel 2.0 (ERNIE does not have knowledge
for biomedical entities), KEPLER outperforms
ERNIE. (2) Though MTB (BERTLARGE) es el
state-of-the-art model on FewRel, its BERTBASE
version does not outperform other knowledge-
enhanced PLMs, which suggests that using large
models contributes much to its gain. Nosotros también
notice that when combined with PAIR, MTB
suffers an obvious performance drop, which may
be because its pre-training objective degenerates
sentence-pair tasks.

Entity Typing

Entity typing requires to classify given entity
mentions into pre-defined types. For this task, nosotros
carry out evaluations on OpenEntity (Choi et al.,
2018) following the settings in Zhang et al. (2019).
OpenEntity has 6 entity types and 2,000 instancias
para entrenamiento, validation and test each.

To identify the entity mentions of interest,
we add two special tokens before and after the
entity spans, and use the representations of the
first special tokens for classification. As shown
en mesa 7, KEPLER-Wiki achieves state-of-the-
art results. Note that the KnowBert results are
different from the original paper since we use
KnowBert-Wiki here rather than KnowBert-W+W
to ensure the same knowledge resource and fair
comparisons. KEPLER does not perform linking
or entity embedding pre-training like ERNIE and
KnowBert, which bring them special advantages
in entity span tasks. Sin embargo, KEPLER still
outperforms these baselines, which proves its
eficacia.

GLUE

The General Language Understanding Evalu-
ación (GLUE) (Wang y cols., 2019b) collects
several natural language understanding tasks and
is widely used for evaluating PLMs. En general,
solving GLUE does not require factual knowl-
borde (Zhang et al., 2019) and we use it to examine
whether KEPLER harms the general language
understanding ability.

Mesa 8 shows the GLUE results. We can
observe that KEPLER-Wiki
is close to Our
RoBERTa, suggesting that while incorporating
factual knowledge, KEPLER maintains a strong
language understanding ability. Sin embargo, allá

are significant performance drops of KEPLER-
OnlyDesc, which indicates that the small-scale
entity description data are not sufficient
para
training KEPLER with MLM.

For the small datasets STS-B, MRPC and RTE,
directly fine-tuning models on them typically
result in unstable performance. Hence we fine-
tune models on a large-scale dataset (here we use
MNLI) first and then further fine-tune them on the
small datasets. The method has been shown to be
effective (Wang y cols., 2019a) and is also used in
the original RoBERTa paper (Liu et al., 2019C).

4.3 KE Tasks

We show how KEPLER works as a KE model, y
evaluate it on Wikidata5M in both the transductive
link prediction setting and the inductive setting.

Experimental Settings

the entity and relation
In link prediction,
embeddings of KEPLER are obtained as described
en la sección 2.2 y 2.5. The evaluation method is
descrito en la Sección 3.3. We also add RoBERTa
and Our RoBERTa as baselines. They adopt
Ecuaciones 1 y 4 to acquire entity and relation
embeddings, and use Equation 3 as their scoring
función.

In the transductive setting, we compare our
models with TransE (Bordes et al., 2013). Establecimos
its dimension as 512, negative sampling size as 64,
batch size as 2048, and learning rate as 0.001 después
hyperparameter searching. The negative sampling
size is crucial for the performance on KE tasks,
but limited by the model complexity, KEPLER
can only take a negative size of 1. For a direct
comparison to intuitively show the benefits of pre-
training, we set a baseline TransE†, which also
usos 1 as the negative sampling size and keeps the
other hyperparameters unchanged.

Because conventional KE methods like TransE
inherently cannot provide embeddings for unseen
entidades, we take DKRL (Xie et al., 2016) as our
baseline in the KE experiments, which utilizes
convolutional neural networks to encode entity
descriptions as embeddings. We set its dimension
como 768, negative sampling size as 64, batch size as
1024, and learning rate as 0.0005.

Transductive Setting

Table 9a shows the results of the transductive
configuración. We observe that:

184

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo
MTB (BERTLARGE)†
Proto (BERT)
Proto (MTB)
Proto (ERNIEBERT)†
Proto (KnowBertBERT)†
Proto (RoBERTa)
Proto (Our RoBERTa)
Proto (ERNIERoBERTa)†
Proto (KnowBertRoBERTa)†
Proto (KEPLER-Wiki)
PAIR (BERT)
PAIR (MTB)
PAIR (ERNIEBERT)†
PAIR (KnowBertBERT)†
PAIR (RoBERTa)
PAIR (Our RoBERTa)
PAIR (ERNIERoBERTa)†
PAIR (KnowBertRoBERTa)†
PAIR (KEPLER-Wiki)

FewRel 1.0
10-1
5-5
89.20
97.06
71.48
89.60
71.55
91.05
84.23
94.66
79.52
93.22
77.65
95.78
76.43
95.30
80.14
95.62
76.21
93.62
81.10
95.94
80.63
93.22
73.42
87.64
87.08
94.27
82.57
92.75
82.49
93.70
83.32
93.71
81.68
94.11
76.04
91.34
85.48
94.28

5-1
93.86
80.68
81.39
89.43
86.64
85.78
84.42
87.76
82.39
88.30
88.32
83.01
92.53
88.48
89.32
89.26
87.46
85.05
90.31

10-5
94.27
82.89
83.47
90.83
88.35
92.26
91.74
91.47
88.57
92.67
87.02
78.47
89.13
86.18
88.43
89.02
87.83
85.25
90.51

5-1

40.12
52.13
49.40
64.40
64.65
61.98
54.43
55.68
66.41
67.41
46.18
56.18
66.05
66.78
63.22
59.29
50.68
67.23

FewRel 2.0
10-1
5-5


26.45
51.50
48.28
76.67
34.99
65.55
51.66
79.87
50.80
82.76
48.56
83.11
37.97
80.48
41.90
71.82
51.85
84.02
54.89
78.57
36.92
70.50
43.40
68.97
50.86
77.88
53.99
81.84
49.28
77.66
48.51
72.91
37.10
66.04
54.32
82.09

10-5

36.93
69.75
49.68
69.71
71.84
72.19
66.26
58.55
73.60
66.85
55.17
54.35
67.19
70.85
65.97
60.26
51.13
71.01

Mesa 6: Accuracies (%) on the FewRel dataset. N -K indicates the N -way K-shot setting. MTB
uses the LARGE size and all the other models use the BASE size. † indicates oracle models which
may have seen facts in the FewRel 1.0 test set during pre-training.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

PAG

R

UFET (Choi et al., 2018)
BERT
ERNIEBERT
KnowBertBERT
RoBERTa
ERNIERoBERTa
KnowBertRoBERTa

Our RoBERTa
KEPLER-Wiki

77.4
76.4
78.4
77.9
77.4
80.3
78.7

75.1
77.8

60.6
71.0
72.9
71.2
73.6
70.2
72.7

73.4
74.6

F-1

68.0
73.6
75.6
74.4
75.4
74.9
75.6

74.3
76.2

Mesa 7: Entity typing results on OpenEntity (%).

Él

(1) KEPLER underperforms TransE.

es
reasonable since KEPLER is limited by its large
model size, and thus cannot use a large negative
sampling size (1 for KEPLER, while typical KE
methods use 64 or more) and more training epochs
(30 vs. 1000 for TransE), which are crucial for
KE (Zhu et al., 2019). Por otro lado, KEPLER
and its variants perform much better than TransE†
(with a negative sampling size of 1), demostración
that using the same negative sampling size,
KEPLER can benefit from pre-trained language

Modelo

MNLI (m/mm) QQP QNLI
104k

363k

392k

RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-OnlyDesc

87.5/87.2
87.1/86.8
87.2/86.5
85.9/85.6

91.9
90.9
91.7
90.8

92.7
92.5
92.4
92.4

SST-2
67k

94.8
94.7
94.5
94.4

Modelo

RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-OnlyDesc

CoLA
8.5k

63.6
63.4
63.6
55.8

STS-B MRPC RTE
2.5k
3.5k
5.7k

91.2
91.1
91.2
90.2

90.2
88.4
89.3
88.5

80.9
82.3
85.2
78.3

Mesa 8: GLUE results on the dev set (%). Todo
the results are medians over 5 carreras. We report F-1
scores for QQP and MRPC, Spearman correlations
for STS-B, and accuracy scores for the other tasks.
The ‘‘m/mm’’ stands for matched/mismatched
evaluation sets for MNLI (Williams et al., 2018).

representations and textual entity descriptions so
that outperform TransE. En el futuro, we will
explore reducing the model size of KEPLER to
take advantage of both large negative sampling
size and pre-training.

(2) The vanilla RoBERTa perform poorly in KE
while KEPLER achieves favorable performances,

185

Modelo

MR

MRR HITS@1 HITS@3 HITS@10

TransE (Bordes et al., 2013)
TransE†
DKRL (Xie et al., 2016)
RoBERTa
Our RoBERTa
KEPLER-KE
KEPLER-Rel
KEPLER-Wiki
KEPLER-Cond

109370
406957
31566
1381597
1756130
76735
15820
14454
20267

25.3
6.0
16.0
0.1
0.1
8.2
6.6
15.4
21.0

17.0
1.8
12.0
0.0
0.0
4.9
3.7
10.5
17.3

31.1
8.0
18.1
0.1
0.1
8.9
7.0
17.4
22.4

39.2
13.6
22.9
0.3
0.2
15.1
11.7
24.4
27.7

(a) Transductive results on Wikidata5M (% except MR). TransE† denotes a TransE modeled trained with
the same negative sampling size (1) as KEPLER.

Modelo

DKRL (Xie et al., 2016)
RoBERTa
Our RoBERTa
KEPLER-KE
KEPLER-Rel
KEPLER-Wiki
KEPLER-Cond

MR

78
723
1070
138
35
32
28

MRR

HITS@1

HITS@3

HITS@10

23.1
7.4
5.8
17.8
33.4
35.1
40.2

5.9
0.7
1.9
5.7
15.9
15.4
22.2

32.0
1.0
6.3
22.9
43.5
46.9
51.4

54.6
19.6
13.0
40.7
66.1
71.9
73.0

(b) Inductive results on Wikidata5M (% except MR).

Mesa 9: Link prediction results on Wikidata5M transductive and inductive settings.

which demonstrates the effectiveness of our multi-
task pre-training to infuse factual knowledge.

(3) Among the KEPLER variants, KEPLER-
Cond has superior results, which substantiates
the intuition in Section 2.2. KEPLER-Rel per-
forms worst, which we believe is due to the short
and homogeneous relation descriptions of Wiki-
datos. KEPLER-KE significantly underperforms
KEPLER-Wiki, which suggests that the MLM
objective is necessary as well for the KE tasks to
build effective language representation.

(4) We also notice that DKRL performs well on
the transductive setting and the result is close to
KEPLER. We believe this is because DKRL takes
a much smaller encoder (CNN) and thus is easier
to train. In the more difficult inductive setting, el
gap between DKRL and KEPLER is larger, cual
better shows the language understanding ability
of KEPLER to utilize textual entity descriptions.

still far from ideal performances required by prac-
tical applications (constructing KG from scratch,
etc.), which urges further efforts on inductive
KE. Comparisons among KEPLER variants are
consistent with in the transductive setting.

Además, we clarify why results in the induc-
tive setting are much higher than the transductive
configuración, while the inductive setting is more dif-
ficult: As shown in Tables 1 y 3, the entities
involved in the inductive evaluation is much less
than the transductive setting (7,475 vs. 4,594,485).
Considering the KE evaluation metrics are based
on entity ranking, it is reasonable to see higher
values in the inductive setting. The performance in
different settings should not be directly compared.

5 Análisis

Inductive Setting
Table 9b shows the Wikidata5M inductive results.
KEPLER outperforms DKRL and RoBERTa by a
large margin, demonstrating the effectiveness of
our joint training method. But KEPLER results are

En esta sección, we analyze the effectiveness and
efficiency of KEPLER with experiments. Todo
the hyperparameters are the same as reported
en la sección 4.1, including models in the ablation
estudiar.

186

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Our RoBERTa
KEPLER-KE
KEPLER-Wiki

PAG

70.8
63.5
71.5

R

69.6
60.5
72.5

F-1

70.2
62.0
72.0

Mesa 10: Ablation study results on
TACRED (%).

5.1 Ablation Study

As shown in Equation 6, KEPLER takes a multi-
task loss. To demonstrate the effectiveness of the
joint objective, we compare full KEPLER with
models trained with only the MLM loss (Nuestro
RoBERTa) and only the KE loss (KEPLER-
KE) on TACRED. As demonstrated in Table 10,
compared to KEPLER-Wiki, both ablation models
suffer significant drops.
el
performance gain of KEPLER is credited to the
joint training towards both objectives.

It suggests that

5.2 Knowledge Probing Experiment

Sección 4.2 shows that KEPLER can achieve
significant improvements on NLP tasks requiring
factual knowledge. To further verify whether
KEPLER can better integrate factual knowledge
into PLMs and help to recall them, nosotros llevamos a cabo
experiments on LAMA (Petroni et al., 2019), a
widely used knowledge probe. LAMA examines
hechos
PLMs’ abilities on recalling relational
by cloze-style questions. Por ejemplo, given a
natural language template ‘‘Paris is the capital
de '', PLMs are required to predict
the masked token without fine-tuning. LAMA
reports the micro-averaged precision at one (P@1)
puntuaciones. Sin embargo, Poerner et al. (2020) present that
LAMA contains some easy questions which can be
answered with superficial clues like entity names.
Hence we also evaluate the models on LAMA-
UHN (Poerner et al., 2020), which filters out the
questionable templates from the Google-RE and
T-REx corpora of LAMA.

The evaluation results are shown in Table 11,
from which we have the following observations:
(1) KEPLER consistently outperforms the vanilla
PLM baseline Our RoBERTa in almost all
the settings except ConceptNet, which focuses
on commonsense knowledge rather than factual
conocimiento. It indicates that KEPLER can indeed
better integrate factual knowledge. (2) A pesar de

KEPLER-W+W cannot outperform KEPLER-
it shows
Wiki on NLP tasks (Sección 4.2),
significant improvements in LAMA-UHN, cual
suggests that we should explore which kind of
knowledge is needed on different scenarios in
the future. (3) All the RoBERTa-based models
perform worse than vanilla BERTBASE by a
large margin, which is consistent with the results
of Wang et al. (2020). This may be due to different
vocabularies used in BERT and RoBERTa, cual
presents the vulnerability of LAMA-style probing
de nuevo (Kassner and Sch¨utze, 2020). We will leave
developing a better knowledge probing framework
as our future work.

5.3 Running Time Comparison

Compared to vanilla PLMs, KEPLER does
no
introduce any additional parameters or
computations during fine-tuning and inference,
which is efficient for practice use. We compare the
running time of KEPLER and other knowledge-
enhanced PLMs (ERNIE and KnowBert)
en
Mesa 12. The time is evaluated on TACRED
training set for one epoch with one NVIDIA
Tesla V100 (32 ES), and all models use 32
batch size and 128 sequence length. The ‘‘entity
linking’’ time of KnowBert is for entity candidate
generación. We can observe that KEPLER requires
much less running time since it does not need
entity linking or entity embedding fusion, cual
will benefit time-sensitive applications.

5.4 Correlation with Entity Frequency

To better understand how KEPLER helps the
entity-centric tasks, we provide analyses on
the correlations between KEPLER performance
and entity frequency in this
sección. El
motivation is to verify a natural hypothesis that
KEPLER improvements mainly come from better
representing the entity mentions in text, especially
the rare entities, which do not show up frequently
in the pre-training corpora and thus cannot be well
learned by the language modeling objectives.

We perform entity linking for the TACRED
dataset with BLINK (Wu et al., 2020) to link
the entity mentions in text to their correspond-
ing Wikipedia identifiers. Then we count
el
occurrences of the entities in Wikipedia with the
hyperlinks in rich text, denoting the entity frequen-
cíes. We conduct two experiments to analyze the
correlations between KEPLER performance and
entity frequency: (1) En mesa 13, we divide the

187

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

BERT
RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-W+W

Google-RE T-REx ConceptNet SQuAD Google-RE T-REx

LAMA

LAMA-UHN

9.8
5.3
7.0
7.3
7.3

31.1
24.7
23.2
24.6
24.4

15.6
19.5
19.0
18.7
17.6

14.1
9.1
8.0
14.3
10.8

4.7
2.2
2.8
3.3
4.1

21.8
17.0
15.7
16.5
17.1

Mesa 11: P@1 results on knowledge probing benchmark LAMA and LAMA-UHN.

Modelo

Entity
Fine-
Linking tuning

Inference

5.5 Understanding Text or
Storing Knowledge

ERNIERoBERTa
KnowBertRoBERTa
KEPLER

780s
190s
0s

730s
677s
508s

194s
235s
152s

Mesa 12: Three parts of running time for one epoch
of TACRED training set.

entity mentions into five parts by their frequencies,
and compare the TACRED performances while
only keeping entities in one part and masking the
otro. (2) En figura 3, we sequentially mask the
entity mentions in the ascending order of entity
frequencies and see the F-1 changes.

From the results, we can observe that:
(1) Cifra 3 shows that when the entity masking
rate is low, the improvements of KEPLER over
RoBERTa are generally much higher than when
the entity masking rate is high. It indicates that
the improvements of KEPLER do mainly come
from better modeling entities in context. Sin embargo,
even when all the entity mentions are masked,
KEPLER still outperforms RoBERTa. We claim
this is because the KE objective can also help
to learn to understand fact-related text since it
requires the model to recall facts from textual
descripciones. This claim is further substantiated in
Sección 5.5.

(2) From Table 13, we can observe that
the improvement in the ‘‘0%-20%’’ setting is
marginally higher than the other settings, cual
demonstrates that KEPLER does have special
advantages on modeling rare entities compared
to vanilla PLMs. But the improvements in the
frequent settings are also significant and we cannot
say that the overall improvements of KEPLER are
mostly from the rare entities. En general, the results
en mesa 13 show that KEPLER can better model
all the entities, no matter rare or frequent.

We argue that by jointly training the KE and
the MLM objectives, KEPLER (1) can better
understand fact-related text and better extract
knowledge from text, and also (2) can remember
factual knowledge. To investigate the two abilities
of KEPLER in a quantitative aspect, we carry out
an experiment on TACRED, in which the head and
tail entity mentions are masked (masked-entity,
A MÍ) or only head and tail entity mentions are
mostrado (only-entity, OE). The ME setting shows
to what extent the models can extract facts only
from the textual context without the clues in entity
names. The OE setting demonstrates to what
extent the models can store and predict factual
conocimiento, as only the entity names are given to
the models.

As shown in Table 14, KEPLER-Wiki shows
improvements over Our RoBERTa
significativo
in both settings, which suggests that KEPLER
has indeed possessed superior abilities on both
extracting and storing knowledge compared to
vanilla PLMs without knowledge infusion. Y
the KEPLER-KE model performs poorly on the
ME setting but achieves marginal improvements
on the OE setting. It indicates that without the help
of the MLM objective, KEPLER only learns the
entity description embeddings and degenerates in
general language understanding, while it can still
remember knowledge into entity names to some
extent.

6 Trabajo relacionado

Pre-training in NLP There has been a long
history of pre-training in NLP. Early works focus
on distributed word representations (Collobert and
Weston, 2008; Mikolov et al., 2013; Pennington
et al., 2014), many of which are often adopted
in current models as word embeddings. Estos

188

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Entity Frequency
KEPLER-Wiki
Our RoBERTa
Improvement

0%-20% 20%-40% 40%-60% 60%-80% 80%-100%

64.7
64.1
+0.6

64.4
64.3
+0.1

64.8
64.5
+0.3

64.7
64.3
+0.4

68.8
68.5
+0.3

Mesa 13: F-1 scores on TACRED (%) under different settings by entity frequencies.
We sort the entity mentions in TACRED by their corresponding entity frequencies in
Wikipedia. The ‘‘0%-20%’’ setting indicates only keeping the least frequent 20% entidad
mentions and masking all the other entity mentions (for both training and validation), y
so on. The results are averaged over 5 carreras.

Modelo

Our RoBERTa
KEPLER-KE
KEPLER-Wiki

A MÍ

54.0
40.2
54.8

OE

46.8
47.0
48.9

Mesa 14: Masked-entity (A MÍ) y solo-
entidad (OE) F-1 scores on TACRED (%).

more data and more parameter tuning can benefit
PLM, and release a new state-of-the-art model
(RoBERTa). Other works explore how to add
more tasks (Liu et al., 2019b) and more parame-
ters (Rafael y col., 2020; Lan et al., 2020) to PLMs.

Knowledge-Enhanced PLMs Recently, muchos
works have investigated how to incorporate
knowledge into PLMs. MTB (Baldini Soares
et al., 2019) takes a straightforward ‘‘matching
the blank’’ pre-training objective to help the
relation classification task. ERNIE (Zhang et al.,
2019) identifies entity mentions in text and links
pre-processed knowledge embeddings to the cor-
responding positions, which shows improvements
on several NLP benchmarks. With a similar idea
as ERNIE, KnowBert (Peters et al., 2019) incor-
porates an integrated entity linker in their model
and adopts end-to-end training. Besides, logan
et al. (2019) and Hayashi et al. (2020) utilize
relations between entities inside one sentence to
train better generation models. Xiong et al. (2019)
adopt entity replacement knowledge learning for
improving entity-related tasks.

Some contemporaneous or following works try
to inject factual knowledge into PLMs in different
maneras. E-BERT (Poerner et al., 2020) aligns entity
embeddings with word embeddings and then
directly adds the aligned embeddings into BERT
to avoid additional pre-training. K-Adapter (Wang
et al., 2020) injects knowledge with additional
neural adapters to support continuous learning.

Cifra 3: TACRED performance
(F-1) de
KEPLER and RoBERTa change with the rate
of entity mentions being masked.

pre-trained embeddings can capture the semantics
of words from large-scale corpora and thus ben-
efit NLP applications. Peters et al. (2018) push
this trend a step forward by using a bidirectional
LSTM to form contextualized word embeddings
richer semantic meanings under
para
(ELMo)
different circumstances.

Apart from word embeddings, there is another
trend exploring pre-trained language models. dai
and Le (2015) propose to train an auto-encoder
on unlabeled textual data and then fine-tune it on
downstream tasks. Howard and Ruder (2018) pro-
pose a universal language model (ULMFiT). Con
the powerful Transformer architecture (Vaswani
et al., 2017), Radford et al. (2018) demonstrate
an effective pre-trained generative model (GPT).
Más tarde, Devlin et al. (2019) release a pre-trained
deep Bidirectional Encoder Representation from
transformadores (BERT), achieving state-of-the-art
performance on a wide range of NLP benchmarks.
After BERT, similar PLMs spring up recently.
Yang y otros. (2019) propose a permutation language
modelo (XLNet). Más tarde, Liu et al. (2019C) muestra esa

189

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Knowledge Embedding KE methods have been
extensively studied. Conventional KE models
define different scoring functions for relational
triplets. Por ejemplo, TransE (Bordes et al., 2013)
treats tail entities as translations of head entities
and uses L1-norm or L2-norm to score triplets,
while DistMult (Yang et al., 2015) uses matrix
multiplications and ComplEx (Trouillon et al.,
2016) adopts complex operations based on it.
RotatE (Sun et al., 2019) combines the advantages
of both of them.

Inductive Embedding Above KE methods
learn entity embeddings only from KG and are
inherently transductive, while some works (Wang
et al., 2014; Xie et al., 2016; Yamada et al., 2016;
Cao et al., 2017; Shi and Weninger, 2018; Cao
et al., 2018) incorporate textual metadata such as
entity names or descriptions to enhance the KE
methods and hence can do inductive KE to some
extent. Besides KG, it is also common for general
inductive graph embedding methods (hamilton
et al., 2017; Bojchevski and G¨unnemann, 2018) a
utilize additional node features like text attributes,
degrees, etcétera. KEPLER follows this line
of studies and takes full advantage of textual
information with an effective PLM.

Hamaguchi et al. (2017) and Wang et al. (2019C)
perform inductive KE by aggregating the trained
embeddings of the known neighboring nodes with
graph neural networks, and thus do not need
additional features. But these methods require the
unseen nodes to be surrounded by known nodes
and cannot embed new (sub)graphs. We leave
how to develop KEPLER to do fully inductive KE
without additional features as future work.

7 Conclusion and Future Work

en este documento, we propose KEPLER, a simple but
effective unified model for knowledge embedding
and pre-trained language representation. Nosotros
train KEPLER with both the KE and MLM
objectives to align the factual knowledge and
language representation into the same semantic
espacio, and experimental results on extensive tasks
demonstrate its effectiveness on both NLP and KE
applications. Besides, we propose Wikidata5M, a
large-scale KG dataset to facilitate future research.
En el futuro, we will (1) explore advanced
ways for more smoothly unifying the two semantic
espacio, including different KE forms and different
training objectives, y (2) investigate better

knowledge probing methods for PLMs to shed
light on knowledge-integrating mechanisms.

Expresiones de gratitud

This work is supported by the National Key
Research and Development Program of China
the National Natural
(No. 2018YFB1004503),
Science Foundation of China
(NSFC No.
U1736204, 61533018, 61772302, 61732008),
grants from Institute for Guo Qiang, Tsinghua
Universidad (2019GQB0003), and Beijing Academy
of Artificial Intelligence (BAAI2019ZD0502).
Profe. Jian Tang is supported by the Natural
Sciences and Engineering Research Council
(NSERC) Discovery Grant and the Canada CIFAR
AI Chair Program. Xiaozhi Wang and Tianyu Gao
are supported by Tsinghua University Initiative
Scientific Research Program. We also thank
our action editor, Profe. Doug Downey, y el
anonymous reviewers for their consistent help
and insightful suggestions.

Referencias

Ivana Balazevic, Carl Allen, and Timothy
factor-
Hospedales. 2019. TuckER: Tensor
completo-
ización
In Proceedings of EMNLP-IJCNLP,
ción.
pages 5185–5194. DOI: https://doi.org
/10.18653/v1/D19-1522

conocimiento

graph

para

Livio Baldini Soares, Nicholas FitzGerald,
Jeffrey Ling, and Tom Kwiatkowski. 2019.
Matching the blanks: Distributional similarity
for relation learning. In Proceedings of ACL,
pages 2895–2905. DOI: https://doi.org
/10.18653/v1/P19-1279

Aleksandar Bojchevski and Stephan G¨unnemann.
2018. Deep Gaussian embedding of graphs:
Unsupervised inductive learning via ranking.
In Proceedings of ICLR.

Antonio Bordes, Nicolás Usunier, Alberto
Garcia-Duran,
Jason Weston, and Oksana
Yakhnenko. 2013. Translating embeddings for
modeling multi-relational data. In Advances in
Neural Information Processing Systems (NIPS),
pages 2787–2795.

Yixin Cao, Retención de plomo, Juanzi Li, Zhiyuan Liu,
Chengjiang Li, Xu Chen, and Tiansi Dong.

190

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

and entities via

2018. Joint representation learning of cross-
attentive
lingual words
distant supervision. In Proceedings of EMNLP,
pages 227–237. DOI: https://doi.org
/10.18653/v1/D18-1021

Yixin Cao, Lifu Huang, Heng Ji, Xu Chen,
and Juanzi Li. 2017. Bridge text and knowl-
edge by learning multi-prototype entity men-
In Proceedings of ACL,
tion embedding.
pages 1623–1633. DOI: https://doi.org
/10.18653/v1/P17-1149

Eunsol Choi, Omer Levy, Yejin Choi, y lucas
Zettlemoyer. 2018. Ultra-fine entity typing. En
Proceedings of ACL, pages 87–96.

Ronan Collobert and Jason Weston. 2008.
idioma
A unified architecture for natural
Procesando: Deep
networks with
aprendizaje multitarea. In Proceedings of ICML,
pages 160–167.

neural

Andrew M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. In Advances in
Neural Information Processing Systems (NIPS),
pages 3079–3087.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. In Proceedings of NAACL-HLT,
páginas 4171–4186.

Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2019.
FewRel 2.0: Towards more challenging few-
shot relation classification. En procedimientos de
EMNLP-IJCNLP, pages 6251–6256.

transfer

Takuo Hamaguchi, Hidekazu Oiwa, Masashi
Shimbo, and Yuji Matsumoto. 2017. Knowl-
borde
out-of-knowledge-base
entidades: A graph neural network approach. En
Proceedings of IJCAI, pages 1802–1808. DOI:
https://doi.org/10.24963/ijcai
.2017/250

para

William L. hamilton, Rex Ying, and Jure
Leskovec. 2017. Inductive representation learn-
ing on large graphs. In Advances in Neu-
ral Information Processing Systems (NIPS),
pages 1025–1035.

191

Xu Han, Zhiyuan Liu, and Maosong Sun. 2018a.
Neural knowledge acquisition via mutual
attention between knowledge graph and text.
In Proceedings of AAAI, pages 4832–4839.

Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang,
Yuan Yao, Zhiyuan Liu, and Maosong Sun.
2018b. FewRel: A large-scale supervised few-
shot relation classification dataset with state-of-
the-art evaluation. In Proceedings of EMNLP,
pages 4803–4809. DOI: https://doi.org
/10.18653/v1/D18-1514

Hiroaki Hayashi, Zecong Hu, Chenyan Xiong,
y Graham Neubig. 2020. Latent relation
language models. In Proceedings of AAAI,
pages 7911–7918. DOI: https://doi.org
/10.1609/aaai.v34i05.6298

Jeremy Howard and Sebastian Ruder. 2018.
Universal language model fine-tuning for text
of ACL,
clasificación.
pages 328–339. DOI: https://doi.org/10
.18653/v1/P18-1031, PMID: 28889062

En procedimientos

Nora Kassner and Hinrich Sch¨utze. 2020. Negated
and misprimed probes for pretrained language
modelos: Birds can talk, but cannot fly. En
Proceedings of ACL, pages 7811–7818. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.698

Seyed Mehran Kazemi and David Poole. 2018.
SimplE embedding for
link prediction in
gráficos de conocimiento. En avances en neurología
Sistemas de procesamiento de información (NeurIPS),
pages 4284–4295.

Zhenzhong Lan, Mingda Chen, Sebastian
Buen hombre, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. 2020. ALBERT: A lite
BERT for self-supervised learning of language
representaciones. In Proceedings of ICLR.

Yankai Lin, Zhiyuan Liu, Maosong Sun,
Yang Liu, and Xuan Zhu. 2015. Aprendiendo
entity and relation embeddings for knowledge
graph completion. In Proceedings of AAAI,
pages 2181–2187.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, y Noé A.. Herrero. 2019a.
Linguistic knowledge and transferabilityof

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

contextual representations. En procedimientos de
NAACL-HLT, pages 1073–1094.

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo
Wang, chi
Ju, Haotang Deng, and Ping
Wang. 2020. K-BERT: Enabling language
representation with knowledge graph.
En
Proceedings of AAAI, pages 2901–2908. DOI:
https://doi.org/10.1609/aaai
.v34i03.5681

Xiaodong Liu, Pengcheng He, Weizhu Chen, y
Jianfeng Gao. 2019b. Multi-task deep neural
networks for natural language understanding.
In Proceedings of ACL, pages 4487–4496.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019C. RoBERTa: A robustly
optimized BERT pretraining approach. CORR,
cs.CL/1907.11692v1.

Robert Logan, Nelson F. Liu, Matthew E.
Peters, Matt Gardner, and Sameer Singh.
2019. Barack’s Wife hillary: Using knowledge
graphs for fact-aware language modeling. En
Proceedings of ACL, pages 5962–5971. DOI:
https://doi.org/10.18653/v1/P19
-1598

Lajanugen Logeswaran, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova, Jacob Devlin,
and Honglak Lee. 2019. Zero-shot entity
linking by reading entity descriptions.
En
Proceedings of ACL, pages 3449–3460. DOI:
https://doi.org/10.18653/v1/P19
-1335

Michael McCloskey and Neal J. cohen. 1989.
in connectionist
Catastrophic
interference
redes: The sequential
learning problem.
In Psychology of Learning and motivation,
volumen 24, pages 109–165. Elsevier. DOI:
https://doi.org/10.1016/S0079
-7421(08)60536-8

Tomas Mikolov,

Ilya Sutskever, Kai Chen,
Gregory S. Corrado, and Jeffrey Dean. 2013.
Distributed representations of words
y
phrases and their compositionality. In Advances
in Neural
Sistemas de procesamiento de información
(NIPS), pages 3111–3119.

George A. Molinero. 1995. WordNet: A lexical
database for english. Communications of the
ACM, 38(11):39–41. DOI: https://doi
.org/10.1145/219717.219748

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. fairseq:
extensible
A fast,
secuencia
In Proceedings of NAACL-HLT
modelado.
(Demonstrations),
48–53. DOI:
https://doi.org/10.18653/v1/N19
-4009

toolkit

paginas

para

jeffrey

Socher,

Pennington, Ricardo

y
Christopher Manning. 2014. GloVe: Global
vectors for word representation. En curso-
ings of EMNLP, pages 1532–1543. DOI:
https://doi.org/10.3115/v1/D14
-1162

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. En procedimientos
of NAACL-HLT, pages 2227–2237. DOI:
https://doi.org/10.18653/v1/N18
-1202

Matthew E. Peters, Mark Neumann, Robert Logan,
Roy Schwartz, Vidur Joshi, Samer Singh, y
Noah A. Herrero. 2019. Knowledge enhanced
contextual word representations. En curso-
ings of EMNLP-IJCNLP, pages 43–54. DOI:
https://doi.org/10.18653/v1/D19
-1005, PMID: 31383442

Fabio Petroni, Tim Rockt¨aschel, Sebastian
Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang
Wu, and Alexander Miller. 2019. Idioma
models as knowledge bases? En procedimientos
of EMNLP-IJCNLP, pages 2463–2473. DOI:
https://doi.org/10.18653/v1/D19
-1250

Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2020. E-BERT: Efficient-yet-effective
entity embeddings for BERT. En hallazgos de
the Association for Computational Linguis-
tics: EMNLP 2020, pages 803–818. DOI:
https://doi.org/10.18653/v1/2020
.findings-emnlp.71

Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving

192

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

language understanding by generative pre-
training. In Technical report, OpenAI.

https://doi.org/10.18653/v1/P19
-1439

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Miguel
mañana, Yanqi Zhou, wei li, y Pedro J.. Liu.
2020. Explorando los límites del aprendizaje por transferencia
con un transformador unificado de texto a texto. Diario
de Investigación sobre Aprendizaje Automático, 21(140):1–67.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
En profesional-
rare words with subword units.
ceedings of ACL, pages 1715–1725. DOI:
https://doi.org/10.18653/v1/P16
-1162

Baoxu Shi and Tim Weninger. 2018. Open-world
knowledge graph completion. En procedimientos
of AAAI, pages 1957–1964.

Jake Snell, Kevin Swersky, and Richard Zemel.
2017. Prototypical networks
few-shot
aprendiendo. In Advances in Neural Information
Sistemas de procesamiento (NIPS), pages 4077–4087.

para

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, y
Jian Tang. 2019. RotatE: Knowledge graph
embedding by relational rotation in complex
espacio. In Proceedings of ICLR.

Th´eo Trouillon,

Johannes Welbl, Sebastian
Riedel, ´Eric Gaussier, and Guillaume Bouchard.
2016. Complex Embeddings for Simple Link
Prediction. In Proceedings of ICML, paginas
2071–2080.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N Gomez,
lucas káiser, y Illia Polosukhin. 2017.
Attention is all you Need. In Advances in
Neural Information Processing Systems (NIPS),
pages 5998–6008.

Alex Wang,

Patrick Xia,
Jan Hula,
Raghavendra Pappagari, R. Thomas McCoy,
Roma Patel, Najoung Kim,
Ian Tenney,
Yinghui Huang, Katherin Yu, Shuning Jin,
Berlin Chen, Benjamin Van Durme, Edouard
Grave, Ellie Pavlick, and Samuel R. Bowman.
2019a. Can you tell me how to get past
pretraining
sesame
En curso-
beyond language modeling.
4465–4476. DOI:
ings

street? Sentence-level

of ACL,

paginas

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman.
2019b. GLUE: A multi-task benchmark and
analysis platform for natural language under-
ICLR. DOI:
de pie.
https://doi.org/10.18653/v1/W18
-5446

En procedimientos de

PeiFeng Wang, Jialong Han, Chenliang Li,
and Rong Pan. 2019C. Logic attention based
neighborhood aggregation for inductive knowl-
En procedimientos de
edge graph embedding.
AAAI, pages 7152–7159. DOI: https://
doi.org/10.1609/aaai.v33i01.33017152

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu
Wei, Xuanjing Huang, Jianshu Ji, Cuihong Cao,
Daxin Jiang, y Ming Zhou. 2020. K-Adapter:
Infusing knowledge into pre-trained models
with adapters. CORR, cs.CL/2002.01808v3.

Zhen Wang, Jianwen Zhang, Jianlin Feng, y
Zheng Chen. 2014. Knowledge graph and text
jointly embedding. In Proceedings of EMNLP,
pages 1591–1601. DOI: https://doi.org
/10.3115/v1/D14-1167

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
In Proceedings of NAACL-HLT,
inferencia.
pages 1112–1122. DOI: https://doi.org
/10.18653/v1/N18-1101

Ledell Wu, Fabio Petroni, Martin Josifoski,
Sebastián Riedel, and Luke Zettlemoyer. 2020.
Scalable zero-shot entity linking with dense
entity retrieval. In Proceedings of EMNLP,
pages 6397–6407.

Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo
Luan, and Maosong Sun. 2016. Representation
learning of knowledge graphs with entity
of AAAI,
En procedimientos
descripciones.
pages 2659–2665.

Wenhan Xiong,

Jingfei Du, William Yang
Wang, and Stoyanov Veselin. 2019. Pretrained
encyclopedia: Weakly supervised knowledge-
pretrained language model. En procedimientos de
ICLR.

193

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ikuya Yamada, Hiroyuki Shindo, Hideaki
Takeda, and Yoshiyasu Takefuji. 2016. Joint
learning of
the embedding of words and
entities for named entity disambiguation. En
Proceedings of CoNLL, pages 250–259. DOI:
https://doi.org/10.18653/v1/K16
-1025

Bishan Yang and Tom Mitchell. 2017. Leveraging
knowledge bases in LSTMs for improving
In Proceedings of ACL,
machine reading.
pages 1436–1446. DOI: https://doi.org
/10.18653/v1/P17-1132

Bishan Yang, Scott Wen-tau Yih, Xiaodong He,
Jianfeng Gao, and Li Deng. 2015. Embedding
entities and relations for learning and inference
in knowledge bases. In Proceedings of ICLR.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for
En
Avances en el procesamiento de información neuronal
Sistemas (NeurIPS), pages 5754–5764.

language understanding.

Proceedings of ACL, pages 656–661. DOI:
https://doi.org/10.18653/v1/P18
-2104

Yuhao Zhang, Victor Zhong, Danqi Chen,
Gabor Angeli, and Christopher D. Manning.
2017. Position-aware attention and supervised
data improve slot filling. En procedimientos de
EMNLP, pages 35–45. DOI: https://doi
.org/10.18653/v1/D17-1004

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
with informative entities. En procedimientos de
LCA, pages 1441–1451. DOI: https://doi
.org/10.18653/v1/P19-1139

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning
books and movies: Towards story-like visual
explanations by watching movies and reading
books. In Proceedings of ICCV, pages 19–27.
DOI: h t t p s : / / d o i . o r g / 1 0 . 1 109
/ICCV.2015.11

Poorya

y
Zaremoodi, Wray Buntine,
Gholamreza Haffari. 2018. Adaptive knowl-
edge sharing in multi-task learning: Improving
low-resource neural machine translation. En

Zhaocheng Zhu, Shizhen Xu, Jian Tang, and Meng
Qu. 2019. GraphVite: A high-performance
CPU-GPU hybrid system for node embedding.
In Proceedings of WWW, pages 2494–2504.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

yo

a
C
_
a
_
0
0
3
6
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

194KEPLER: A Unified Model for Knowledge Embedding and image

Descargar PDF