KEPLER: A Unified Model for Knowledge Embedding and - Ricerca sull'intelligenza artificiale specializzata al MIT

KEPLER: A Unified Model for Knowledge Embedding and
Pre-trained Language Representation

Xiaozhi Wang1, Tianyu Gao3, Zhaocheng Zhu4,5, Zhengyan Zhang1
Zhiyuan Liu1,2∗, Juanzi Li1,2, and Jian Tang4,6,7∗

1Department of CST, BNRist; 2KIRC, Institute for AI, Tsinghua University, Beijing, China
{wangxz20,zy-z19}@mails.tsinghua.edu.cn
{liuzy,lijuanzi}@tsinghua.edu.cn
3Department of Computer Science, Princeton University, Princeton, NJ, USA
tianyug@princeton.edu
4Mila – Qu´ebec AI Institute; 5Univesit´e de Montr´eal; 6HEC, Montr´eal, Canada
zhaocheng.zhu@umontreal.ca, jian.tang@hec.ca
7CIFAR AI Research Chair

Astratto

1 introduzione

Pre-trained language representation models
(PLMs) cannot well capture factual knowledge
from text. In contrasto, knowledge embed-
ding (KE) methods can effectively represent
the relational
facts in knowledge graphs
(KGs) with informative entity embeddings,
but conventional KE models cannot take full
advantage of the abundant textual informa-
zione. in questo documento, we propose a unified model
for Knowledge Embedding and Pre-trained
LanguagE Representation (KEPLER), Quale
can not only better integrate factual knowl-
edge into PLMs but also produce effective
text-enhanced KE with the strong PLMs. In
KEPLER, we encode textual entity descrip-
tions with a PLM as their embeddings, E
then jointly optimize the KE and language
modeling objectives. Experimental
risultati
show that KEPLER achieves state-of-the-art
performances on various NLP tasks, and also
works remarkably well as an inductive KE
model on KG link prediction. Inoltre, for
pre-training and evaluating KEPLER, we con-
struct Wikidata5M1, a large-scale KG dataset
with aligned entity descriptions, and bench-
mark state-of-the-art KE methods on it. It shall
serve as a new KE benchmark and facilitate
the research on large KG, inductive KE, E
KG with text. The source code can be obtained
from https://github.com/THU-KEG
/KEPLER.

∗Correspondence to: Z. Liu and J. Tang.
1https://deepgraphlearning.github.io

/project/wikidata5m.

176

Recent pre-trained language representation mod-
els (PLMs) such as BERT (Devlin et al., 2019)
and RoBERTa (Liu et al., 2019C) learn effective
language representation from large-scale unstruc-
tured corpora with language modeling objectives
and have achieved superior performances on var-
ious natural language processing (PNL) compiti.
linguistic knowl-
Existing PLMs learn useful
edge from unlabeled text (Liu et al., 2019UN), Ma
they generally cannot capture the world facts well,
which are typically sparse and have complex forms
in text (Petroni et al., 2019; Logan et al., 2019).

By contrast, knowledge graphs (KGs) contain
extensive structural facts, and knowledge embed-
ding (KE) metodi (Bordes et al., 2013; Yang
et al., 2015; Sole et al., 2019) can effectively
embed them into continuous entity and relation
embeddings. These embeddings can not only
help with the KG completion but also benefit
various NLP applications (Yang and Mitchell,
2017; Zaremoodi et al., 2018; Han et al., 2018UN).
Come mostrato in figura 1, textual entity descriptions
informazione. Intuitively, KE
contain abundant
methods can provide factual knowledge for PLMs,
while the informative text data can also benefit KE.
Inspired by Xie et al. (2016), we use entity
descriptions to bridge the gap between KE
and PLM, and align the semantic space of
text to the symbol space of KGs (Logeswaran
et al., 2019). We propose KEPLER, a unified
model for Knowledge Embedding and Pre-trained
LanguagE Representation. We encode the texts

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 9, pag. 176–194, 2021. https://doi.org/10.1162/tacl a 00360
Redattore di azioni: Doug Downley. Lotto di invio: 7/2020; Lotto di revisione: 10/2020; Pubblicato 3/2021.
C(cid:3) 2021 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/
T

UN
C
_
UN
_
0
0
3
6
0
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

As a KE model, (1) KEPLER can take full
advantage of the abundant information from entity
descriptions with the help of the MLM objective.
(2) KEPLER is capable of performing KE in the
inductive setting, questo è, it can produce embed-
dings for unseen entities from their descriptions,
while conventional KE methods are inherently
transductive and they can only learn represen-
tations for the shown entities during training.
Inductive KE is essential for many real-world
applications, such as updating KGs with emerging
entities and KG construction, and thus is worth
more investigation.

For pre-training and evaluating KEPLER, we
need a KG with (1) large amounts of knowledge
facts, (2) aligned entity descriptions, E (3)
reasonable inductive-setting data split, Quale
cannot be satisfied by existing KE benchmarks.
Therefore, we construct Wikidata5M, containing
about 5M entities, 20M triplets, and aligned entity
descriptions from Wikipedia. To the best of our
knowledge, it is the largest general-domain KG
dataset. We also benchmark several classical
KE methods and give data splits for both the
transductive and the inductive settings to facilitate
future research.

To summarize, our contribution is three-fold:
(1) We propose KEPLER, a knowledge-enhanced
PLM by jointly optimizing the KE and MLM
objectives, which brings great improvements on
a wide range of NLP tasks. (2) By encoding
text descriptions as entity embeddings, KEPLER
shows its effectiveness as a KE model, particolarmente
in the inductive setting. (3) We also introduce
Wikidata5M, a new large-scale KG dataset, Quale
shall promote the research on large-scale KG,
inductive KE, and the interactions between KG
and NLP.

2 KEPLER

Come mostrato in figura 2, KEPLER implicitly
incorporates factual knowledge into language
representations by jointly training with two objec-
tives. In this section, we detailedly introduce the
encoder structure, the KE and MLM objectives,
and how we combine the two as a unified model.

2.1 Encoder

For the text encoder, we use Transformer archi-
tectura (Vaswani et al., 2017) in the same way

Figura 1: An example of a KG with entity
descriptions. The figure suggests that descriptions
contain abundant information about entities and
can help to represent the relational facts between
them.

and entities into a unified semantic space with the
same PLM as the encoder, and jointly optimize the
KE and the masked language modeling (MLM)
objectives. For the KE objective, we encode the
entity descriptions as entity embeddings and then
train them in the same way as conventional KE
metodi. For the MLM objective, we follow the
approach of existing PLMs (Devlin et al., 2019;
Liu et al., 2019C). KEPLER has the following
strengths:

As a PLM, (1) KEPLER is able to integrate fac-
tual knowledge into language representation with
the supervision from KG by the KE objective. (2)
KEPLER inherits the strong ability of language
understanding from PLMs by the MLM objec-
tive. (3) The KE objective enhances the ability of
KEPLER to extract knowledge from text since it
requires the model to encode the entities from their
corresponding descriptions. (4) KEPLER can be
directly adopted in a wide range of NLP tasks with-
out additional inference overhead compared to
conventional PLMs since we just add new training
objectives without modifying model structures.

There are also some recent works (Zhang et al.,
2019; Peters et al., 2019; Liu et al., 2020) directly
integrating fixed entity embeddings into PLMs to
provide external knowledge. Tuttavia, (1) their
entity embeddings are learned by a separate KE
modello, and thus cannot be easily aligned with the
language representation space. (2) They require
an entity linker to link the text to the correspond-
ing entities, making them suffer from the error
propagation problem. (3) Compared to vanilla
PLMs, their sophisticated mechanisms to link and
use entity embeddings lead to additional inference
overhead.

177

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/
T

UN
C
_
UN
_
0
0
3
6
0
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 2: The KEPLER framework. We encode entity descriptions as entity embeddings and jointly
train the knowledge embedding (KE) and masked language modeling (MLM) objectives on the same
PLM.

as Devlin et al. (2019) and Liu et al. (2019C). IL
encoder takes a sequence of N tokens (x1, …, xN )
as inputs, and computes L layers of d-dimensional
contextualized representations Hi ∈ RN ×d, 1 ≤
i ≤ L. Each layer of the encoder Ei is a combi-
nation of a multihead self-attention network and
a multilayer perceptron, and the encoder gets the
representation of each layer by Hi = Ei(Hi−1).
Eventually, we get a contextualized representation
for each position, which could be further used in
downstream tasks. Usually, there is a special token
added to the beginning of the text, and the
output at is regarded sentence representation.
We denote the representation function as E(·).
The encoder requires a tokenizer to convert
plain texts into sequences of tokens. Here we use
the same tokenization as RoBERTa: the Byte-Pair
Encoding (BPE) (Sennrich et al., 2016).

Unlike previous knowledge-enhanced PLM
works (Zhang et al., 2019; Peters et al., 2019), we
do not modify the Transformer encoder structure
to add external entity linkers or knowledge-
integration layers. It means that our model has
no additional inference overhead compared to
vanilla PLMs, and it makes applying KEPLER in
downstream tasks as easy as RoBERTa.

2.2 Knowledge Embedding

representations, which benefits lots of down-
stream tasks, such as link prediction and relation
extraction.

We first define KGs: A KG is a graph with
entities as its nodes and relations between entities
as its edges. We use a triplet (H, R, T) to describe a
relational fact, where h, t are the head entity and
the tail entity, and r is the relation type within
a pre-defined relation set R. In conventional KE
models, each entity and relation is assigned a
d-dimensional vector, and a scoring function is
defined for training the embeddings and predicting
links.

In KEPLER, instead of using stored embed-
dings, we encode entities into vectors by using
their corresponding text. By choosing different
textual data and different KE scoring functions,
we have multiple variants for the KE objective
of KEPLER. in questo documento, we explore three
simple but effective ways: entity descriptions as
embeddings, entity and relation descriptions as
embeddings, and entity embeddings conditioned
on relations. We leave exploring advanced KE
methods as our future work.

Entity Descriptions as Embeddings For a
relational triplet (H, R, T), we have:

h = E(texth),
t = E(textt),
r = Tr,

(1)

To integrate factual knowledge into KEPLER, we
adopt the knowledge embedding (KE) objective
in our pre-training. KE encodes entities and rela-
tions in knowledge graphs (KGs) as distributed

where texth and textt are the descriptions for
h and t, with a special token at the beginning.
T ∈ R|R|×d is the relation embeddings and h, T, R
are the embeddings for h, T, and r.

178

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

We use the loss from Sun et al. (2019) as our
KE objective, which adopts negative sampling
(Mikolov et al., 2013) for efficient optimization:

LKE = − log σ(γ − dr(H, T))
N(cid:2)

log σ(dr(H(cid:6)

io, T(cid:6)

io) − γ),

−

1
N

i=1

(2)

io, R, T(cid:6)

Dove (H(cid:6)
io) are negative samples, γ is the
margin, σ is the sigmoid function, and dr is the
scoring function, for which we choose to follow
TransE (Bordes et al., 2013) for its simplicity,

dr(H, T) = (cid:7)H + r − t(cid:7)P,

(3)

where we take the norm p as 1. The negative
sampling policy is to fix the head entity and
randomly sample a tail entity, and vice versa.

Entity and Relation Descriptions as Embed-
dings A natural extension for the last method
is to encode the relation descriptions as relation
embeddings as well. Formalmente, we have,

ˆr = E~~(textr),~~

(4)

where textr is the description for the relation r.
Then we use ˆr to replace r in Equations 2 E 3.

Entity Embeddings Conditioned on Relations
In this manner, we use entity embeddings con-
ditioned on r for better KE performances. IL
intuition is that semantics of an entity may have
multiple aspects, and different relations focus on
different ones (Lin et al., 2015). So we have,

hr = E~~(texth,R),~~

(5)

is the concatenation of

the relation r, with the special

where texth,R
IL
description for the entity h and the description
token
for
~~at the beginning and~~ in between.
Correspondingly, we use hr instead of h for
Equations 2 E 3.

2.3 Masked Language Modeling

The masked language modeling (MLM) objective
is inherited from BERT and RoBERTa. During
pre-training, MLM randomly selects some of the
input positions, and the objective is to predict the
tokens at these selected positions within a fixed
dictionary.

To be more specific, MLM randomly selects
15% of input positions, among which 80% are

179

masked with the special token , 10% are
replaced by other random tokens, and the rest
remain unchanged. For each selected position j,
the last layer of the contextualized representation
HL,j is used for a W -way classification, Dove
W is the size of the dictionary. At last, a cross-
entropy loss LMLM is calculated over these selected
positions.

We initialize our model with the pre-trained
checkpoint of RoBERTaBASE. Tuttavia, we still
keep MLM as one of our objectives to avoid
catastrophic forgetting (McCloskey and Cohen,
1989) while training towards the KE objective.
Actually, as demonstrated in Section 5.1, only
using the KE objective leads to poor results in
NLP tasks.

2.4 Training Objectives

To incorporate factual knowledge and language
understanding into one PLM, we design a
multi-task loss as
shown in Figure 2 E
Equazione 6,

L = LKE + LMLM,

(6)

where LKE and LMLM are the losses for KE and
MLM correspondingly. Jointly optimizing the
two objectives can implicitly integrate knowledge
from external KGs into the text encoder, while
preserving the strong abilities of PLMs for
syntactic and semantic understanding. Note that
those two tasks only share the text encoder, E
for each mini-batch, text data sampled for KE
and MLM are not (necessarily) the same. This is
because seeing a variety of text (instead of just
entity descriptions) in MLM can help the model
to have better language understanding ability.

2.5 Variants and Implementations

We introduce the variants of KEPLER and the pre-
training implementations here. The fine-tuning
details will be introduced in Section 4.

KEPLER Variants

We implement multiple versions of KEPLER
in experiments to explore the effectiveness of
our pre-training framework. We use the same
denotations in Section 4 as below.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

KEPLER-Wiki

is the principal model

In
our experiments, which adopts Wikidata5M
(Sezione 3) as the KG and the entity-description-
as-embedding method (Equazione 1). All other
variants, if not specified, use the same settings.
KEPLER-Wiki achieves the best performances on
most tasks.

KEPLER-WordNet uses the WordNet (Mugnaio,
1995) as its KG source. WordNet is an English lex-
ical graph, where nodes are lemmas and synsets,
and edges are their relations. Intuitively, incor-
porating WordNet can bring lexical knowledge
and thus benefits NLP tasks. We use the same
WordNet 3.0 as in KnowBert (Peters et al., 2019),
which is extracted from the nltk2 package.

KEPLER-W+W takes both Wikidata5M and
WordNet as its KGs. To jointly train with two KG
datasets, we modify the objective in Equation 6 COME

L = LWiki + LWordNet + LMLM,

(7)

where LWiki and LWordNet are losses from
Wikidata5M and WordNet respectively.

KEPLER-Rel uses the entity and relation
descriptions as embeddings method (Equazione 4).
As the relation descriptions in Wikidata are
short (11.7 words on average) and homogeneous,
encoding relation descriptions as relation embed-
dings results in worse performance as shown in
Sezione 4.

KEPLER-Cond uses the entity-embedding-
conditioned-on-relation method (Equazione 5).
This model achieves superior results in link
prediction tasks, both transductive and inductive
(Sezione 4.3).

KEPLER-OnlyDesc trains the MLM objective
directly on the entity descriptions from the KE
objective rather than uses the English Wikipedia
and BookCorpus as other versions of KEPLER.
Tuttavia, as the entity description data are smaller
(2.3 GB vs 13 GB) and homogeneous, it harms the
general language understanding ability and thus
performs worse (Sezione 4.2).

KEPLER-KE only adopts the KE objective
in pre-training, which is an ablated version of
KEPLER-Wiki. It is used to show the necessity of
the MLM objective for language understanding.

Pre-training Implementation
In practice, we choose RoBERTa (Liu et al.,
2019C) as our base model and implement KEPLER

2https://www.nltk.org.

180

in the fairseq framework (Ott et al., 2019) for pre-
training. Due to the computing resource limit,
we choose the BASE size (L = 12, d = 768)
and use the released roberta.base parameters
for initialization, which is a common practice
to save pre-training time (Zhang et al., 2019;
Peters et al., 2019). For the MLM objective, we
use the English Wikipedia (2,500M words) E
BookCorpus (800M words) (Zhu et al., 2015)
as our pre-training corpora (except KEPLER-
OnlyDesc). We extract text from these two corpora
in the same way as Devlin et al. (2019). For the
KE objective, we encode the first 512 tokens of
entity descriptions from the English Wikipedia as
entity embeddings.

We set the γ in Equation 2 COME 4 E 9 for NLP
and KE tasks respectively, and we use the models
pre-trained with 10 E 30 epochs for NLP and
KE. Specially, the γ is 1 for KEPLER-WordNet.
The two hyperparameters are tuned by multiple
trials for γ in {1, 2, 4, 6, 9} and the number of
epochs in {5, 10, 20, 30, 40}, and we select the
model by performances on TACRED (F-1) E
inductive link prediction (HITS@10). We use
gradient accumulation to achieve a batch size of
12,288.

3 Wikidata5M

As shown in Section 2, to train KEPLER, the KG
dataset should (1) be large enough, (2) contain
high-quality textual descriptions for its entities
and relations, E (3) have a reasonable inductive
setting, which most existing KG datasets do not
provide. Così, based on Wikidata3 and English
Wikipedia,4 we construct Wikidata5M, a large-
scale KG dataset with aligned text descriptions
from corresponding Wikipedia pages, and also an
inductive test set. In the following sections, we
first introduce the data collection (Sezione 3.1)
and the data split (Sezione 3.2), and then provide
the results of representative KE methods on the
dataset (Sezione 3.3).

3.1 Data Collection

We use the July 2019 dump of Wikidata and
Wikipedia. For each entity in Wikidata, we align
it to its Wikipedia page and extract the first section
as its description. Entities with no pages or with
descriptions fewer than 5 words are discarded.

3https://www.wikidata.org.
4https://en.wikipedia.org.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Dataset

#entity

#relation

#training

#validation

#test

FB15K
WN18
FB15K-237
WN18RR

14,951
40,943
14,541
40,943

1,345
18
237
11

483,142
141,442
272,115
86,835

Wikidata5M 4,594,485

822

20,614,279

50,000
5,000
17,535
3,034

5,163

59,07
5,00
20,466
3,134

5,133

Tavolo 1: Statistics of Wikidata5M (transductive setting) compared with existing KE benchmarks.

Entity Type

Occurrence Percentage

Subset

#entity

#relation

#triplet

Umano
Taxon
Wikimedia list
Film
Human Settlement

Total

1,517,591
363,882
118,823
114,266
110,939

2,225,501

33.0%
7.9%
2.6%
2.5%
2.4%

48.4%

Tavolo 2: Top-5 entity categories in Wikidata5M.

We retrieve all the relational facts in Wikidata.
A fact is considered to be valid when both of
its entities are not discarded, and its relation
has a non-empty page in Wikidata. The final
KG contains 4,594,485 entities, 822 relations
E 20,624,575 triplets. Statistics of Wikidata5M
along with four other widely used benchmarks
are shown in Table 1. Top-5 entity categories are
listed in Table 2. We can see that Wikidata5M
is much larger than other KG datasets, covering
various domains.

3.2 Data Split

For Wikidata5M, we take two different settings:
the transductive setting and the inductive setting.
The transductive setting (shown in Table 1) È
adopted in most KG datasets, where the entities are
shared and the triplet sets are disjoint across train-
ing, validation and test. In questo caso, KE models
are expected to learn effective entity embeddings
only for the entities in the training set. Nel
inductive setting (shown in Table 3), the entities
and triplets are mutually disjoint across training,
validation and test. We randomly sample some
connected subgraphs as the validation and test set.
In the inductive setting, the KE models should
produce embeddings for the unseen entities given
side features like descriptions, neighbors, eccetera. IL
inductive setting is more challenging and also

Training
Validation
Test

4,579,609
7,374
7,475

822
199
201

20,496,514
6,699
6,894

Tavolo 3: Statistics of Wikidata5M inductive
setting.

meaningful in real-world applications, where enti-
ties in KGs experience open-ended growth, and the
inductive ability is crucial for online KE methods.
Although Wikidata5M contains massive enti-
ties and triplets, our validation and test set are not
large, which is limited by the standard evaluation
method of link prediction (Sezione 3.3). Each
episode of evaluation requires |E| × |T | × 2 times
of KE score calculation, Dove |E| E |T | are
the total number of entities and the number of
triplets in test set respectively. As Wikidata5M
contains massive entities, the evaluation is very
time-consuming, hence we have to limit the test
set to thousands of triplets to ensure tractable
valutazioni. This indicates that large-scale KE
urges a more efficient evaluation protocol. Noi
will leave exploring it to future work.

3.3 Benchmark

To assess the challenges of Wikidata5M, we
benchmark several popular KE models on our
in the transductive setting (as they
dataset
inherently do not support
the inductive set-
ting). Because their original implementations do
not scale to Wikidata5M, we benchmark these
methods with GraphVite (Zhu et al., 2019), UN
multi-GPU KE toolkit.

In the transductive setting, for each test triplet
(H, R, T), the model ranks all the entities by scor-
ing (H, R, T(cid:6)), T(cid:6) ∈ E, where E is the entity set
excluding other correct t. The evaluation metrics,
MRR (mean reciprocal rank), MR (mean rank),
and HITS@{1,3,10}, are based on the rank of the

181

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Method

MR MRR HITS@1 HITS@3 HITS@10

TransE (Bordes et al., 2013)
DistMult (Yang et al., 2015)
ComplEx (Trouillon et al., 2016)
SimplE (Kazemi and Poole, 2018)
RotatE (Sole et al., 2019)

109370
211030
244540
115263
89459

25.3
25.3
28.1
29.6
29.0

17.0
20.8
22.8
25.2
23.4

31.1
27.8
31.0
31.7
32.2

39.2
33.4
37.3
37.7
39.0

Tavolo 4: Performance of different KE models on Wikidata5M (% except MR).

correct tail entity t among all the entities in E.
Then we do the same thing for the head entities.
We report the average results over all test triplets
and over both head and tail entity predictions.

Tavolo 4 shows the results of popular KE meth-
ods on Wikidata5M, which are all significantly
lower than on existing KG datasets like FB15K-
237, WN18RR, and so forth. It demonstrates that
Wikidata5M is more challenging due to its large
scale and high coverage. The results advocate for
more efforts towards large-scale KE.

4 Experiments

In this section, we introduce the experiment
settings and results of our model on various
NLP and KE tasks, along with some analyses
on KEPLER.

4.1 Experimental Setting

Baselines
In our experiments, RoBERTa is an
important baseline since KEPLER is based on it
(all mentioned models are of BASE size if not
specified). As we cannot afford the full RoBERTa
corpora (126 GB, and we only use 13 GB)
in KEPLER pre-training, we implement Our
RoBERTa for direct comparisons to KEPLER.
It is initialized by RoBERTaBASE and is further
trained with the MLM objective on the same
corpora as KEPLER.

We also evaluate recent knowledge-enhanced
PLMs, including ERNIEBERT (Zhang et al., 2019)
and KnowBertBERT (Peters et al., 2019). As
ERNIE and our principal model KEPLER-Wiki
only use Wikidata, we take KnowBert-Wiki in the
experiments to ensure fair comparisons with the
same knowledge source. Considering KEPLER
is based on RoBERTa, we reproduce the two
models with RoBERTa too (ERNIERoBERTa and
KnowBertRoBERTa). The reproduction of Know-
Bert is based on its original implementation.5

5https://github.com/allenai/kb.

On relation classification, we also compare with
MTB (Baldini Soares et al., 2019), which adopts
‘‘matching the blank’’ pre-training. Different
from other baselines, the original MTB is based
on BERTLARGE (denoted by MTB (BERTLARGE)).
For a fair comparison under the same model size,
we reimplement MTB with BERTBASE (MTB).

Hyperparameter The pre-training settings are
in Section 2.5. For fine-tuning on downstream
compiti, we set KEPLER hyperparameters the same
as reported in KnowBert on TACRED and
OpenEntity. On FewRel, we set
the learning
rate as 2e-5 and batch size as 20 E 4 for the
Proto and PAIR frameworks respectively. For
GLUE, we follow the hyperparameters reported
in RoBERTa. For baselines, we keep their original
hyperparameters unchanged or use the best trial in
KEPLER searching space if no original settings
are available.

4.2 NLP Tasks

In this section, we demonstrate the performance of
KEPLER and its baselines on various NLP tasks.

Relation Classification

Relation classification requires models to classify
relation types between two given entities from
testo. We evaluate KEPLER and other baselines
on two widely used benchmarks: TACRED and
FewRel.

TACRED (Zhang et al., 2017) ha 42 relations
E 106,264 sentences. Here we follow the settings
of Baldini Soares et al. (2019), where we add four
special tokens before and after the two entity
mentions, and concatenate the representations at
the beginnings of the two entities for classification.
Note that the original KnowBert also takes entity
types as inputs, which is different from Zhang et al.
(2019); Baldini Soares et al. (2019). To ensure fair
comparisons, we re-evaluate KnowBert with the

182

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

BERT
BERTLARGE
MTB
MTB (BERTLARGE)
ERNIEBERT
KnowBertBERT
RoBERTa
ERNIERoBERTa
KnowBertRoBERTa

Our RoBERTa
KEPLER-Wiki
KEPLER-WordNet
KEPLER-W+W
KEPLER-Rel
KEPLER-Cond
KEPLER-OnlyDesc
KEPLER-KE

P

67.2
−
69.7
−
70.0
73.5
70.4
73.5
71.9

70.8
71.5
71.4
71.1
71.3
72.1
72.3
63.5

R

64.8
−
67.9
−
66.1
64.1
71.1
68.0
69.9

69.6
72.5
71.3
72.0
70.9
70.7
69.1
60.5

F-1

66.0
70.1
68.8
71.5
68.0
68.5
70.7
70.7
70.9

70.2
72.0
71.3
71.5
71.1
71.4
70.7
62.0

Tavolo 5: Precision, recall, and F-1 on TACRED
(%). KnowBert results are different from the
original paper since different task settings are
used.

same setting as other baselines, thus the reported
results are different from the original paper.

From the TACRED results in Table 5, we can
observe that: (1) KEPLER-Wiki is the best one
among KEPLER variants and significantly out-
performs all the baselines, while other versions
of KEPLER also achieve good results. It demon-
strates the effectiveness of KEPLER on integrating
factual knowledge into PLMs. Based on the
result, we use KEPLER-Wiki as the principal
model in the following experiments. (2) KEPLER-
WordNet shows a marginal improvement over
Our RoBERTa, while KEPLER-W+W underper-
forms KEPLER-Wiki. It suggests that pre-training
with WordNet only has limited benefits in the
KEPLER framework. We will explore how to
better combine different KGs in our future work.
FewRel (Han et al., 2018B) is a few-shot
relation classification dataset with 100 relations
E 70,000 instances, which is constructed with
Wikipedia text and Wikidata facts. Inoltre,
Gao et al. (2019) propose FewRel 2.0, adding
a domain adaptation challenge with a new
medical-domain test set.

FewRel

takes the N -way K-shot setting.
Relations in the training and test sets are disjoint.

For every evaluation episode, N relations, K
supporting samples for each relation, and several
query sentences are sampled from the test set. IL
models are required to classify queries into one of
the N relations only given the sampled N × K
instances.

We use two state-of-the-art few-shot frame-
works: Proto (Snell et al., 2017) and PAIR (Gao
et al., 2019). We replace the text encoders with our
baselines and KEPLER and compare the perfor-
mance. Because FewRel 1.0 is constructed with
Wikidata, we remove all the triplets in its test set
from Wikidata5M to avoid information leakage
for KEPLER. Tuttavia, we cannot control the
KGs used in our baselines. We mark the models
utilizing Wikidata and have information leakage
risk with † in Table 6.

As Table 6 shows, KEPLER-Wiki achieves
the best performance over the BASE-size PLMs
in most settings. From the results, we also
have some interesting observations: (1) RoBERTa
consistently outperforms BERT on various NLP
compiti (Liu et al., 2019C), yet the RoBERTa-based
models here are comparable or even worse than
BERT-based models in the PAIR framework.
Because PAIR uses sentence concatenation, Questo
result may be credited to the next sentence
prediction (NSP) objective of BERT. (2) KEPLER
brings improvements on FewRel 2.0, while
ERNIE and KnowBert even degenerate in most
of the settings. It indicates that the paradigms of
ERNIE and KnowBert cannot well generalize to
new domains which may require much different
entity linkers and entity embeddings. On the
other hand, KEPLER not only learns better entity
representations but also acquires a general ability
to extract factual knowledge from the context
across different domains. We further verify this in
Sezione 5.5. (3) KnowBert underperforms ERNIE
in FewRel while it typically achieves better results
on other tasks. This may be because it uses the
TuckER (Balazevic et al., 2019) KE model while
ERNIE and KEPLER follow TransE (Bordes et al.,
2013). We will explore the effects of different KE
methods in the future.

We also have another two observations with
regard to ERNIE and MTB: (1) ERNIE performs
the best on 1-shot settings of FewRel 1.0.
We believe this is because that the knowledge
embedding injection of ERNIE has particular
advantages in this case, since it directly brings

183

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

knowledge about entities. When using 5-shot
(supporting text provides more information) E
FewRel 2.0 (ERNIE does not have knowledge
for biomedical entities), KEPLER outperforms
ERNIE. (2) Though MTB (BERTLARGE) is the
state-of-the-art model on FewRel, its BERTBASE
version does not outperform other knowledge-
enhanced PLMs, which suggests that using large
models contributes much to its gain. We also
notice that when combined with PAIR, MTB
suffers an obvious performance drop, which may
be because its pre-training objective degenerates
sentence-pair tasks.

Entity Typing

Entity typing requires to classify given entity
mentions into pre-defined types. For this task, we
carry out evaluations on OpenEntity (Choi et al.,
2018) following the settings in Zhang et al. (2019).
OpenEntity has 6 entity types and 2,000 instances
for training, validation and test each.

To identify the entity mentions of interest,
we add two special tokens before and after the
entity spans, and use the representations of the
first special tokens for classification. As shown
in Table 7, KEPLER-Wiki achieves state-of-the-
art results. Note that the KnowBert results are
different from the original paper since we use
KnowBert-Wiki here rather than KnowBert-W+W
to ensure the same knowledge resource and fair
comparisons. KEPLER does not perform linking
or entity embedding pre-training like ERNIE and
KnowBert, which bring them special advantages
in entity span tasks. Tuttavia, KEPLER still
outperforms these baselines, which proves its
effectiveness.

GLUE

The General Language Understanding Evalu-
ation (GLUE) (Wang et al., 2019B) collects
several natural language understanding tasks and
is widely used for evaluating PLMs. Generalmente,
solving GLUE does not require factual knowl-
edge (Zhang et al., 2019) and we use it to examine
whether KEPLER harms the general language
understanding ability.

Tavolo 8 shows the GLUE results. We can
observe that KEPLER-Wiki
is close to Our
RoBERTa, suggesting that while incorporating
factual knowledge, KEPLER maintains a strong
language understanding ability. Tuttavia, there

are significant performance drops of KEPLER-
OnlyDesc, which indicates that the small-scale
entity description data are not sufficient
for
training KEPLER with MLM.

For the small datasets STS-B, MRPC and RTE,
directly fine-tuning models on them typically
result in unstable performance. Hence we fine-
tune models on a large-scale dataset (here we use
MNLI) first and then further fine-tune them on the
small datasets. The method has been shown to be
effective (Wang et al., 2019UN) and is also used in
the original RoBERTa paper (Liu et al., 2019C).

4.3 KE Tasks

We show how KEPLER works as a KE model, E
evaluate it on Wikidata5M in both the transductive
link prediction setting and the inductive setting.

Experimental Settings

the entity and relation
In link prediction,
embeddings of KEPLER are obtained as described
in Section 2.2 E 2.5. The evaluation method is
described in Section 3.3. We also add RoBERTa
and Our RoBERTa as baselines. They adopt
Equations 1 E 4 to acquire entity and relation
embeddings, and use Equation 3 as their scoring
function.

In the transductive setting, we compare our
models with TransE (Bordes et al., 2013). We set
its dimension as 512, negative sampling size as 64,
batch size as 2048, and learning rate as 0.001 after
hyperparameter searching. The negative sampling
size is crucial for the performance on KE tasks,
but limited by the model complexity, KEPLER
can only take a negative size of 1. For a direct
comparison to intuitively show the benefits of pre-
training, we set a baseline TransE†, which also
uses 1 as the negative sampling size and keeps the
other hyperparameters unchanged.

Because conventional KE methods like TransE
inherently cannot provide embeddings for unseen
entities, we take DKRL (Xie et al., 2016) as our
baseline in the KE experiments, which utilizes
convolutional neural networks to encode entity
descriptions as embeddings. We set its dimension
COME 768, negative sampling size as 64, batch size as
1024, and learning rate as 0.0005.

Transductive Setting

Table 9a shows the results of the transductive
setting. Lo osserviamo:

184

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model
MTB (BERTLARGE)†
Proto (BERT)
Proto (MTB)
Proto (ERNIEBERT)†
Proto (KnowBertBERT)†
Proto (RoBERTa)
Proto (Our RoBERTa)
Proto (ERNIERoBERTa)†
Proto (KnowBertRoBERTa)†
Proto (KEPLER-Wiki)
PAIR (BERT)
PAIR (MTB)
PAIR (ERNIEBERT)†
PAIR (KnowBertBERT)†
PAIR (RoBERTa)
PAIR (Our RoBERTa)
PAIR (ERNIERoBERTa)†
PAIR (KnowBertRoBERTa)†
PAIR (KEPLER-Wiki)

FewRel 1.0
10-1
5-5
89.20
97.06
71.48
89.60
71.55
91.05
84.23
94.66
79.52
93.22
77.65
95.78
76.43
95.30
80.14
95.62
76.21
93.62
81.10
95.94
80.63
93.22
73.42
87.64
87.08
94.27
82.57
92.75
82.49
93.70
83.32
93.71
81.68
94.11
76.04
91.34
85.48
94.28

5-1
93.86
80.68
81.39
89.43
86.64
85.78
84.42
87.76
82.39
88.30
88.32
83.01
92.53
88.48
89.32
89.26
87.46
85.05
90.31

10-5
94.27
82.89
83.47
90.83
88.35
92.26
91.74
91.47
88.57
92.67
87.02
78.47
89.13
86.18
88.43
89.02
87.83
85.25
90.51

5-1
−
40.12
52.13
49.40
64.40
64.65
61.98
54.43
55.68
66.41
67.41
46.18
56.18
66.05
66.78
63.22
59.29
50.68
67.23

FewRel 2.0
10-1
5-5
−
−
26.45
51.50
48.28
76.67
34.99
65.55
51.66
79.87
50.80
82.76
48.56
83.11
37.97
80.48
41.90
71.82
51.85
84.02
54.89
78.57
36.92
70.50
43.40
68.97
50.86
77.88
53.99
81.84
49.28
77.66
48.51
72.91
37.10
66.04
54.32
82.09

10-5
−
36.93
69.75
49.68
69.71
71.84
72.19
66.26
58.55
73.60
66.85
55.17
54.35
67.19
70.85
65.97
60.26
51.13
71.01

Tavolo 6: Accuracies (%) on the FewRel dataset. N -K indicates the N -way K-shot setting. MTB
uses the LARGE size and all the other models use the BASE size. † indicates oracle models which
may have seen facts in the FewRel 1.0 test set during pre-training.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

P

R

UFET (Choi et al., 2018)
BERT
ERNIEBERT
KnowBertBERT
RoBERTa
ERNIERoBERTa
KnowBertRoBERTa

Our RoBERTa
KEPLER-Wiki

77.4
76.4
78.4
77.9
77.4
80.3
78.7

75.1
77.8

60.6
71.0
72.9
71.2
73.6
70.2
72.7

73.4
74.6

F-1

68.0
73.6
75.6
74.4
75.4
74.9
75.6

74.3
76.2

Tavolo 7: Entity typing results on OpenEntity (%).

It

(1) KEPLER underperforms TransE.

È
reasonable since KEPLER is limited by its large
model size, and thus cannot use a large negative
sampling size (1 for KEPLER, while typical KE
methods use 64 or more) and more training epochs
(30 vs. 1000 for TransE), which are crucial for
KE (Zhu et al., 2019). D'altra parte, KEPLER
and its variants perform much better than TransE†
(with a negative sampling size of 1), showing
that using the same negative sampling size,
KEPLER can benefit from pre-trained language

Model

MNLI (m/mm) QQP QNLI
104K

363K

392K

RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-OnlyDesc

87.5/87.2
87.1/86.8
87.2/86.5
85.9/85.6

91.9
90.9
91.7
90.8

92.7
92.5
92.4
92.4

SST-2
67K

94.8
94.7
94.5
94.4

Model

RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-OnlyDesc

CoLA
8.5K

63.6
63.4
63.6
55.8

STS-B MRPC RTE
2.5K
3.5K
5.7K

91.2
91.1
91.2
90.2

90.2
88.4
89.3
88.5

80.9
82.3
85.2
78.3

Tavolo 8: GLUE results on the dev set (%). Tutto
the results are medians over 5 runs. We report F-1
scores for QQP and MRPC, Spearman correlations
for STS-B, and accuracy scores for the other tasks.
The ‘‘m/mm’’ stands for matched/mismatched
evaluation sets for MNLI (Williams et al., 2018).

representations and textual entity descriptions so
that outperform TransE. In the future, we will
explore reducing the model size of KEPLER to
take advantage of both large negative sampling
size and pre-training.

(2) The vanilla RoBERTa perform poorly in KE
while KEPLER achieves favorable performances,

185

Model

MR

MRR HITS@1 HITS@3 HITS@10

TransE (Bordes et al., 2013)
TransE†
DKRL (Xie et al., 2016)
RoBERTa
Our RoBERTa
KEPLER-KE
KEPLER-Rel
KEPLER-Wiki
KEPLER-Cond

109370
406957
31566
1381597
1756130
76735
15820
14454
20267

25.3
6.0
16.0
0.1
0.1
8.2
6.6
15.4
21.0

17.0
1.8
12.0
0.0
0.0
4.9
3.7
10.5
17.3

31.1
8.0
18.1
0.1
0.1
8.9
7.0
17.4
22.4

39.2
13.6
22.9
0.3
0.2
15.1
11.7
24.4
27.7

(UN) Transductive results on Wikidata5M (% except MR). TransE† denotes a TransE modeled trained with
the same negative sampling size (1) as KEPLER.

Model

DKRL (Xie et al., 2016)
RoBERTa
Our RoBERTa
KEPLER-KE
KEPLER-Rel
KEPLER-Wiki
KEPLER-Cond

MR

78
723
1070
138
35
32
28

MRR

HITS@1

HITS@3

HITS@10

23.1
7.4
5.8
17.8
33.4
35.1
40.2

5.9
0.7
1.9
5.7
15.9
15.4
22.2

32.0
1.0
6.3
22.9
43.5
46.9
51.4

54.6
19.6
13.0
40.7
66.1
71.9
73.0

(B) Inductive results on Wikidata5M (% except MR).

Tavolo 9: Link prediction results on Wikidata5M transductive and inductive settings.

which demonstrates the effectiveness of our multi-
task pre-training to infuse factual knowledge.

(3) Among the KEPLER variants, KEPLER-
Cond has superior results, which substantiates
the intuition in Section 2.2. KEPLER-Rel per-
forms worst, which we believe is due to the short
and homogeneous relation descriptions of Wiki-
dati. KEPLER-KE significantly underperforms
KEPLER-Wiki, which suggests that the MLM
objective is necessary as well for the KE tasks to
build effective language representation.

(4) We also notice that DKRL performs well on
the transductive setting and the result is close to
KEPLER. We believe this is because DKRL takes
a much smaller encoder (CNN) and thus is easier
to train. In the more difficult inductive setting, IL
gap between DKRL and KEPLER is larger, Quale
better shows the language understanding ability
of KEPLER to utilize textual entity descriptions.

still far from ideal performances required by prac-
tical applications (constructing KG from scratch,
eccetera.), which urges further efforts on inductive
KE. Comparisons among KEPLER variants are
consistent with in the transductive setting.

Inoltre, we clarify why results in the induc-
tive setting are much higher than the transductive
setting, while the inductive setting is more dif-
ficult: As shown in Tables 1 E 3, the entities
involved in the inductive evaluation is much less
than the transductive setting (7,475 vs. 4,594,485).
Considering the KE evaluation metrics are based
on entity ranking, it is reasonable to see higher
values in the inductive setting. The performance in
different settings should not be directly compared.

5 Analysis

Inductive Setting
Table 9b shows the Wikidata5M inductive results.
KEPLER outperforms DKRL and RoBERTa by a
large margin, demonstrating the effectiveness of
our joint training method. But KEPLER results are

In this section, we analyze the effectiveness and
efficiency of KEPLER with experiments. Tutto
the hyperparameters are the same as reported
in Section 4.1, including models in the ablation
study.

186

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

Our RoBERTa
KEPLER-KE
KEPLER-Wiki

P

70.8
63.5
71.5

R

69.6
60.5
72.5

F-1

70.2
62.0
72.0

Tavolo 10: Ablation study results on
TACRED (%).

5.1 Ablation Study

As shown in Equation 6, KEPLER takes a multi-
task loss. To demonstrate the effectiveness of the
joint objective, we compare full KEPLER with
models trained with only the MLM loss (Nostro
RoBERTa) and only the KE loss (KEPLER-
KE) on TACRED. As demonstrated in Table 10,
compared to KEPLER-Wiki, both ablation models
suffer significant drops.
IL
performance gain of KEPLER is credited to the
joint training towards both objectives.

It suggests that

5.2 Knowledge Probing Experiment

Sezione 4.2 shows that KEPLER can achieve
significant improvements on NLP tasks requiring
factual knowledge. To further verify whether
KEPLER can better integrate factual knowledge
into PLMs and help to recall them, we conduct
experiments on LAMA (Petroni et al., 2019), UN
widely used knowledge probe. LAMA examines
facts
PLMs’ abilities on recalling relational
by cloze-style questions. For instance, given a
natural language template ‘‘Paris is the capital
Di ’’, PLMs are required to predict
the masked token without fine-tuning. LAMA
reports the micro-averaged precision at one (P@1)
scores. Tuttavia, Poerner et al. (2020) present that
LAMA contains some easy questions which can be
answered with superficial clues like entity names.
Hence we also evaluate the models on LAMA-
UHN (Poerner et al., 2020), which filters out the
questionable templates from the Google-RE and
T-REx corpora of LAMA.

The evaluation results are shown in Table 11,
from which we have the following observations:
(1) KEPLER consistently outperforms the vanilla
PLM baseline Our RoBERTa in almost all
the settings except ConceptNet, which focuses
on commonsense knowledge rather than factual
knowledge. It indicates that KEPLER can indeed
better integrate factual knowledge. (2) Although

KEPLER-W+W cannot outperform KEPLER-
it shows
Wiki on NLP tasks (Sezione 4.2),
significant improvements in LAMA-UHN, Quale
suggests that we should explore which kind of
knowledge is needed on different scenarios in
the future. (3) All the RoBERTa-based models
perform worse than vanilla BERTBASE by a
large margin, which is consistent with the results
of Wang et al. (2020). This may be due to different
vocabularies used in BERT and RoBERTa, Quale
presents the vulnerability of LAMA-style probing
Ancora (Kassner and Sch¨utze, 2020). We will leave
developing a better knowledge probing framework
as our future work.

5.3 Running Time Comparison

Compared to vanilla PLMs, KEPLER does
non
introduce any additional parameters or
computations during fine-tuning and inference,
which is efficient for practice use. We compare the
running time of KEPLER and other knowledge-
enhanced PLMs (ERNIE and KnowBert)
In
Tavolo 12. The time is evaluated on TACRED
training set for one epoch with one NVIDIA
Tesla V100 (32 GB), and all models use 32
batch size and 128 sequence length. The ‘‘entity
linking’’ time of KnowBert is for entity candidate
generation. We can observe that KEPLER requires
much less running time since it does not need
entity linking or entity embedding fusion, Quale
will benefit time-sensitive applications.

5.4 Correlation with Entity Frequency

To better understand how KEPLER helps the
entity-centric tasks, we provide analyses on
the correlations between KEPLER performance
and entity frequency in this
section. IL
motivation is to verify a natural hypothesis that
KEPLER improvements mainly come from better
representing the entity mentions in text, particolarmente
the rare entities, which do not show up frequently
in the pre-training corpora and thus cannot be well
learned by the language modeling objectives.

We perform entity linking for the TACRED
dataset with BLINK (Wu et al., 2020) to link
the entity mentions in text to their correspond-
ing Wikipedia identifiers. Then we count
IL
occurrences of the entities in Wikipedia with the
hyperlinks in rich text, denoting the entity frequen-
cies. We conduct two experiments to analyze the
correlations between KEPLER performance and
entity frequency: (1) In Table 13, we divide the

187

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

BERT
RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-W+W

Google-RE T-REx ConceptNet SQuAD Google-RE T-REx

LAMA

LAMA-UHN

9.8
5.3
7.0
7.3
7.3

31.1
24.7
23.2
24.6
24.4

15.6
19.5
19.0
18.7
17.6

14.1
9.1
8.0
14.3
10.8

4.7
2.2
2.8
3.3
4.1

21.8
17.0
15.7
16.5
17.1

Tavolo 11: P@1 results on knowledge probing benchmark LAMA and LAMA-UHN.

Model

Entity
Fine-
Linking tuning

Inference

5.5 Understanding Text or
Storing Knowledge

ERNIERoBERTa
KnowBertRoBERTa
KEPLER

780S
190S
0S

730S
677S
508S

194S
235S
152S

Tavolo 12: Three parts of running time for one epoch
of TACRED training set.

entity mentions into five parts by their frequencies,
and compare the TACRED performances while
only keeping entities in one part and masking the
other. (2) In Figure 3, we sequentially mask the
entity mentions in the ascending order of entity
frequencies and see the F-1 changes.

From the results, we can observe that:
(1) Figura 3 shows that when the entity masking
rate is low, the improvements of KEPLER over
RoBERTa are generally much higher than when
the entity masking rate is high. It indicates that
the improvements of KEPLER do mainly come
from better modeling entities in context. Tuttavia,
even when all the entity mentions are masked,
KEPLER still outperforms RoBERTa. We claim
this is because the KE objective can also help
to learn to understand fact-related text since it
requires the model to recall facts from textual
descriptions. This claim is further substantiated in
Sezione 5.5.

(2) From Table 13, we can observe that
the improvement in the ‘‘0%-20%’’ setting is
marginally higher than the other settings, Quale
demonstrates that KEPLER does have special
advantages on modeling rare entities compared
to vanilla PLMs. But the improvements in the
frequent settings are also significant and we cannot
say that the overall improvements of KEPLER are
mostly from the rare entities. Generalmente, the results
in Table 13 show that KEPLER can better model
all the entities, no matter rare or frequent.

We argue that by jointly training the KE and
the MLM objectives, KEPLER (1) can better
understand fact-related text and better extract
knowledge from text, and also (2) can remember
factual knowledge. To investigate the two abilities
of KEPLER in a quantitative aspect, we carry out
an experiment on TACRED, in which the head and
tail entity mentions are masked (masked-entity,
ME) or only head and tail entity mentions are
shown (only-entity, OE). The ME setting shows
to what extent the models can extract facts only
from the textual context without the clues in entity
names. The OE setting demonstrates to what
extent the models can store and predict factual
knowledge, as only the entity names are given to
the models.

As shown in Table 14, KEPLER-Wiki shows
improvements over Our RoBERTa
significant
in both settings, which suggests that KEPLER
has indeed possessed superior abilities on both
extracting and storing knowledge compared to
vanilla PLMs without knowledge infusion. E
the KEPLER-KE model performs poorly on the
ME setting but achieves marginal improvements
on the OE setting. It indicates that without the help
of the MLM objective, KEPLER only learns the
entity description embeddings and degenerates in
general language understanding, while it can still
remember knowledge into entity names to some
extent.

6 Related Work

Pre-training in NLP There has been a long
history of pre-training in NLP. Early works focus
on distributed word representations (Collobert and
Weston, 2008; Mikolov et al., 2013; Pennington
et al., 2014), many of which are often adopted
in current models as word embeddings. These

188

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Entity Frequency
KEPLER-Wiki
Our RoBERTa
Improvement

0%-20% 20%-40% 40%-60% 60%-80% 80%-100%

64.7
64.1
+0.6

64.4
64.3
+0.1

64.8
64.5
+0.3

64.7
64.3
+0.4

68.8
68.5
+0.3

Tavolo 13: F-1 scores on TACRED (%) under different settings by entity frequencies.
We sort the entity mentions in TACRED by their corresponding entity frequencies in
Wikipedia. The ‘‘0%-20%’’ setting indicates only keeping the least frequent 20% entity
mentions and masking all the other entity mentions (for both training and validation), E
so on. The results are averaged over 5 runs.

Model

Our RoBERTa
KEPLER-KE
KEPLER-Wiki

ME

54.0
40.2
54.8

OE

46.8
47.0
48.9

Tavolo 14: Masked-entity (ME) and only-
entity (OE) F-1 scores on TACRED (%).

more data and more parameter tuning can benefit
PLMs, and release a new state-of-the-art model
(RoBERTa). Other works explore how to add
more tasks (Liu et al., 2019B) and more parame-
ters (Raffel et al., 2020; Lan et al., 2020) to PLMs.

Knowledge-Enhanced PLMs Recently, many
works have investigated how to incorporate
knowledge into PLMs. MTB (Baldini Soares
et al., 2019) takes a straightforward ‘‘matching
the blank’’ pre-training objective to help the
relation classification task. ERNIE (Zhang et al.,
2019) identifies entity mentions in text and links
pre-processed knowledge embeddings to the cor-
responding positions, which shows improvements
on several NLP benchmarks. With a similar idea
as ERNIE, KnowBert (Peters et al., 2019) incor-
porates an integrated entity linker in their model
and adopts end-to-end training. Besides, Logan
et al. (2019) and Hayashi et al. (2020) utilize
relations between entities inside one sentence to
train better generation models. Xiong et al. (2019)
adopt entity replacement knowledge learning for
improving entity-related tasks.

Some contemporaneous or following works try
to inject factual knowledge into PLMs in different
ways. E-BERT (Poerner et al., 2020) aligns entity
embeddings with word embeddings and then
directly adds the aligned embeddings into BERT
to avoid additional pre-training. K-Adapter (Wang
et al., 2020) injects knowledge with additional
neural adapters to support continuous learning.

Figura 3: TACRED performance
(F-1) Di
KEPLER and RoBERTa change with the rate
of entity mentions being masked.

pre-trained embeddings can capture the semantics
of words from large-scale corpora and thus ben-
efit NLP applications. Peters et al. (2018) push
this trend a step forward by using a bidirectional
LSTM to form contextualized word embeddings
richer semantic meanings under
for
(ELMo)
different circumstances.

Apart from word embeddings, there is another
trend exploring pre-trained language models. Dai
and Le (2015) propose to train an auto-encoder
on unlabeled textual data and then fine-tune it on
downstream tasks. Howard and Ruder (2018) pro-
pose a universal language model (ULMFiT). Con
the powerful Transformer architecture (Vaswani
et al., 2017), Radford et al. (2018) demonstrate
an effective pre-trained generative model (GPT).
Later, Devlin et al. (2019) release a pre-trained
deep Bidirectional Encoder Representation from
Transformers (BERT), achieving state-of-the-art
performance on a wide range of NLP benchmarks.
After BERT, similar PLMs spring up recently.
Yang et al. (2019) propose a permutation language
modello (XLNet). Later, Liu et al. (2019C) show that

189

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Knowledge Embedding KE methods have been
extensively studied. Conventional KE models
define different scoring functions for relational
triplets. Per esempio, TransE (Bordes et al., 2013)
treats tail entities as translations of head entities
and uses L1-norm or L2-norm to score triplets,
while DistMult (Yang et al., 2015) uses matrix
multiplications and ComplEx (Trouillon et al.,
2016) adopts complex operations based on it.
RotatE (Sole et al., 2019) combines the advantages
of both of them.

Inductive Embedding Above KE methods
learn entity embeddings only from KG and are
inherently transductive, while some works (Wang
et al., 2014; Xie et al., 2016; Yamada et al., 2016;
Cao et al., 2017; Shi and Weninger, 2018; Cao
et al., 2018) incorporate textual metadata such as
entity names or descriptions to enhance the KE
methods and hence can do inductive KE to some
extent. Besides KG, it is also common for general
inductive graph embedding methods (Hamilton
et al., 2017; Bojchevski and G¨unnemann, 2018) A
utilize additional node features like text attributes,
degrees, and so on. KEPLER follows this line
of studies and takes full advantage of textual
information with an effective PLM.

Hamaguchi et al. (2017) and Wang et al. (2019C)
perform inductive KE by aggregating the trained
embeddings of the known neighboring nodes with
graph neural networks, and thus do not need
additional features. But these methods require the
unseen nodes to be surrounded by known nodes
and cannot embed new (sub)graphs. We leave
how to develop KEPLER to do fully inductive KE
without additional features as future work.

7 Conclusion and Future Work

in questo documento, we propose KEPLER, a simple but
effective unified model for knowledge embedding
and pre-trained language representation. Noi
train KEPLER with both the KE and MLM
objectives to align the factual knowledge and
language representation into the same semantic
spazio, and experimental results on extensive tasks
demonstrate its effectiveness on both NLP and KE
applications. Besides, we propose Wikidata5M, UN
large-scale KG dataset to facilitate future research.
In the future, we will (1) explore advanced
ways for more smoothly unifying the two semantic
spazio, including different KE forms and different
training objectives, E (2) investigate better

knowledge probing methods for PLMs to shed
light on knowledge-integrating mechanisms.

Ringraziamenti

This work is supported by the National Key
Research and Development Program of China
the National Natural
(No. 2018YFB1004503),
Science Foundation of China
(NSFC No.
U1736204, 61533018, 61772302, 61732008),
grants from Institute for Guo Qiang, Tsinghua
Università (2019GQB0003), and Beijing Academy
of Artificial Intelligence (BAAI2019ZD0502).
Prof. Jian Tang is supported by the Natural
Sciences and Engineering Research Council
(NSERC) Discovery Grant and the Canada CIFAR
AI Chair Program. Xiaozhi Wang and Tianyu Gao
are supported by Tsinghua University Initiative
Scientific Research Program. We also thank
our action editor, Prof. Doug Downey, and the
anonymous reviewers for their consistent help
and insightful suggestions.

Riferimenti

Ivana Balazevic, Carl Allen, and Timothy
factor-
Hospedales. 2019. TuckER: Tensor
comple-
ization
In Proceedings of EMNLP-IJCNLP,
zione.
pages 5185–5194. DOI: https://doi.org
/10.18653/v1/D19-1522

knowledge

graph

for

Livio Baldini Soares, Nicholas FitzGerald,
Jeffrey Ling, and Tom Kwiatkowski. 2019.
Matching the blanks: Distributional similarity
for relation learning. In Proceedings of ACL,
pages 2895–2905. DOI: https://doi.org
/10.18653/v1/P19-1279

Aleksandar Bojchevski and Stephan G¨unnemann.
2018. Deep Gaussian embedding of graphs:
Unsupervised inductive learning via ranking.
In Proceedings of ICLR.

Antoine Bordes, Nicolas Usunier, Alberto
Garcia-Duran,
Jason Weston, and Oksana
Yakhnenko. 2013. Translating embeddings for
modeling multi-relational data. In Advances in
Neural Information Processing Systems (NIPS),
pages 2787–2795.

Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu,
Chengjiang Li, Xu Chen, and Tiansi Dong.

190

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

and entities via

2018. Joint representation learning of cross-
attentive
lingual words
distant supervision. In Proceedings of EMNLP,
pages 227–237. DOI: https://doi.org
/10.18653/v1/D18-1021

Yixin Cao, Lifu Huang, Heng Ji, Xu Chen,
and Juanzi Li. 2017. Bridge text and knowl-
edge by learning multi-prototype entity men-
In Proceedings of ACL,
tion embedding.
pages 1623–1633. DOI: https://doi.org
/10.18653/v1/P17-1149

Eunsol Choi, Omer Levy, Yejin Choi, and Luke
Zettlemoyer. 2018. Ultra-fine entity typing. In
Proceedings of ACL, pages 87–96.

Ronan Collobert and Jason Weston. 2008.
lingua
A unified architecture for natural
processing: Deep
networks with
multitask learning. In Proceedings of ICML,
pages 160–167.

neural

Andrew M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. In Advances in
Neural Information Processing Systems (NIPS),
pages 3079–3087.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of NAACL-HLT,
pages 4171–4186.

Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2019.
FewRel 2.0: Towards more challenging few-
shot relation classification. Negli Atti di
EMNLP-IJCNLP, pages 6251–6256.

transfer

Takuo Hamaguchi, Hidekazu Oiwa, Masashi
Shimbo, and Yuji Matsumoto. 2017. Knowl-
edge
out-of-knowledge-base
entities: A graph neural network approach. In
Proceedings of IJCAI, pages 1802–1808. DOI:
https://doi.org/10.24963/ijcai
.2017/250

for

William L. Hamilton, Rex Ying, and Jure
Leskovec. 2017. Inductive representation learn-
ing on large graphs. In Advances in Neu-
ral Information Processing Systems (NIPS),
pages 1025–1035.

191

Xu Han, Zhiyuan Liu, and Maosong Sun. 2018UN.
Neural knowledge acquisition via mutual
attention between knowledge graph and text.
In Proceedings of AAAI, pages 4832–4839.

Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang,
Yuan Yao, Zhiyuan Liu, and Maosong Sun.
2018B. FewRel: A large-scale supervised few-
shot relation classification dataset with state-of-
the-art evaluation. In Proceedings of EMNLP,
pages 4803–4809. DOI: https://doi.org
/10.18653/v1/D18-1514

Hiroaki Hayashi, Zecong Hu, Chenyan Xiong,
and Graham Neubig. 2020. Latent relation
language models. In Proceedings of AAAI,
pages 7911–7918. DOI: https://doi.org
/10.1609/aaai.v34i05.6298

Jeremy Howard and Sebastian Ruder. 2018.
Universal language model fine-tuning for text
of ACL,
classificazione.
pages 328–339. DOI: https://doi.org/10
.18653/v1/P18-1031, PMID: 28889062

Negli Atti

Nora Kassner and Hinrich Sch¨utze. 2020. Negated
and misprimed probes for pretrained language
models: Birds can talk, but cannot fly. In
Proceedings of ACL, pages 7811–7818. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.698

Seyed Mehran Kazemi and David Poole. 2018.
SimplE embedding for
link prediction in
knowledge graphs. In Advances in Neural
Information Processing Systems (NeurIPS),
pages 4284–4295.

Zhenzhong Lan, Mingda Chen, Sebastian
Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. 2020. ALBERT: A lite
BERT for self-supervised learning of language
representations. In Proceedings of ICLR.

Yankai Lin, Zhiyuan Liu, Maosong Sun,
Yang Liu, and Xuan Zhu. 2015. Apprendimento
entity and relation embeddings for knowledge
graph completion. In Proceedings of AAAI,
pages 2181–2187.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. Smith. 2019UN.
Linguistic knowledge and transferabilityof

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

contextual representations. Negli Atti di
NAACL-HLT, pages 1073–1094.

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo
Wang, Qi
Ju, Haotang Deng, and Ping
Wang. 2020. K-BERT: Enabling language
representation with knowledge graph.
In
Proceedings of AAAI, pages 2901–2908. DOI:
https://doi.org/10.1609/aaai
.v34i03.5681

Xiaodong Liu, Pengcheng He, Weizhu Chen, E
Jianfeng Gao. 2019B. Multi-task deep neural
networks for natural language understanding.
In Proceedings of ACL, pages 4487–4496.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019C. RoBERTa: A robustly
optimized BERT pretraining approach. CoRR,
cs.CL/1907.11692v1.

Robert Logan, Nelson F. Liu, Matthew E.
Peters, Matt Gardner, and Sameer Singh.
2019. Barack’s Wife hillary: Using knowledge
graphs for fact-aware language modeling. In
Proceedings of ACL, pages 5962–5971. DOI:
https://doi.org/10.18653/v1/P19
-1598

Lajanugen Logeswaran, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova, Jacob Devlin,
and Honglak Lee. 2019. Zero-shot entity
linking by reading entity descriptions.
In
Proceedings of ACL, pages 3449–3460. DOI:
https://doi.org/10.18653/v1/P19
-1335

Michael McCloskey and Neal J. Cohen. 1989.
in connectionist
Catastrophic
interference
networks: The sequential
learning problem.
In Psychology of Learning and motivation,
volume 24, pages 109–165. Elsevier. DOI:
https://doi.org/10.1016/S0079
-7421(08)60536-8

Tomás Mikolov,

Ilya Sutskever, Kai Chen,
Gregory S. Corrado, and Jeffrey Dean. 2013.
Distributed representations of words
E
phrases and their compositionality. In Advances
in Neural
Information Processing Systems
(NIPS), pages 3111–3119.

Giorgio A. Mugnaio. 1995. WordNet: A lexical
database for english. Communications of the
ACM, 38(11):39–41. DOI: https://doi
.org/10.1145/219717.219748

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. fairseq:
extensible
A fast,
sequence
In Proceedings of NAACL-HLT
modeling.
(Demonstrations),
48–53. DOI:
https://doi.org/10.18653/v1/N19
-4009

toolkit

pagine

for

Jeffrey

Socher,

Pennington, Richard

E
Cristoforo Manning. 2014. Guanto: Globale
vettori per la rappresentazione delle parole. In Procedi-
ings of EMNLP, pagine 1532–1543. DOI:
https://doi.org/10.3115/v1/D14
-1162

Matteo Peters, Marco Neumann, Mohit Iyyer,
Matt Gardner, Cristoforo Clark, Kenton Lee,
e Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. Negli Atti
of NAACL-HLT, pagine 2227–2237. DOI:
https://doi.org/10.18653/v1/N18
-1202

Matthew E. Peters, Marco Neumann, Robert Logan,
Roy Schwartz, Vidur Joshi, Sameer Singh, E
Noah A. Smith. 2019. Knowledge enhanced
contextual word representations. In Procedi-
ings of EMNLP-IJCNLP, pages 43–54. DOI:
https://doi.org/10.18653/v1/D19
-1005, PMID: 31383442

Fabio Petroni, Tim Rockt¨aschel, Sebastian
Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang
Wu, and Alexander Miller. 2019. Language
models as knowledge bases? Negli Atti
of EMNLP-IJCNLP, pages 2463–2473. DOI:
https://doi.org/10.18653/v1/D19
-1250

Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2020. E-BERT: Efficient-yet-effective
entity embeddings for BERT. In Findings of
the Association for Computational Linguis-
tic: EMNLP 2020, pages 803–818. DOI:
https://doi.org/10.18653/v1/2020
.findings-emnlp.71

Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving

192

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

language understanding by generative pre-
training. In Technical report, OpenAI.

https://doi.org/10.18653/v1/P19
-1439

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
Nel professionista-
rare words with subword units.
ceedings of ACL, pages 1715–1725. DOI:
https://doi.org/10.18653/v1/P16
-1162

Baoxu Shi and Tim Weninger. 2018. Open-world
knowledge graph completion. Negli Atti
of AAAI, pages 1957–1964.

Jake Snell, Kevin Swersky, and Richard Zemel.
2017. Prototypical networks
few-shot
apprendimento. In Advances in Neural Information
Processing Systems (NIPS), pages 4077–4087.

for

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, E
Jian Tang. 2019. RotatE: Knowledge graph
embedding by relational rotation in complex
spazio. In Proceedings of ICLR.

Th´eo Trouillon,

Johannes Welbl, Sebastian
Riedel, ´Eric Gaussier, and Guillaume Bouchard.
2016. Complex Embeddings for Simple Link
Prediction. In Proceedings of ICML, pagine
2071–2080.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you Need. In Advances in
Neural Information Processing Systems (NIPS),
pages 5998–6008.

Alex Wang,

Patrick Xia,
Jan Hula,
Raghavendra Pappagari, R. Thomas McCoy,
Roma Patel, Najoung Kim,
Ian Tenney,
Yinghui Huang, Katherin Yu, Shuning Jin,
Berlin Chen, Benjamin Van Durme, Edouard
Grave, Ellie Pavlick, and Samuel R. Bowman.
2019UN. Can you tell me how to get past
pretraining
sesame
In Procedi-
beyond language modeling.
4465–4476. DOI:
ing

street? Sentence-level

of ACL,

pagine

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman.
2019B. GLUE: A multi-task benchmark and
analysis platform for natural language under-
ICLR. DOI:
standing.
https://doi.org/10.18653/v1/W18
-5446

Negli Atti di

PeiFeng Wang, Jialong Han, Chenliang Li,
and Rong Pan. 2019C. Logic attention based
neighborhood aggregation for inductive knowl-
Negli Atti di
edge graph embedding.
AAAI, pages 7152–7159. DOI: https://
doi.org/10.1609/aaai.v33i01.33017152

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu
Wei, Xuanjing Huang, Jianshu Ji, Cuihong Cao,
Daxin Jiang, and Ming Zhou. 2020. K-Adapter:
Infusing knowledge into pre-trained models
with adapters. CoRR, cs.CL/2002.01808v3.

Zhen Wang, Jianwen Zhang, Jianlin Feng, E
Zheng Chen. 2014. Knowledge graph and text
jointly embedding. In Proceedings of EMNLP,
pages 1591–1601. DOI: https://doi.org
/10.3115/v1/D14-1167

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
In Proceedings of NAACL-HLT,
inference.
pages 1112–1122. DOI: https://doi.org
/10.18653/v1/N18-1101

Ledell Wu, Fabio Petroni, Martin Josifoski,
Sebastian Riedel, e Luke Zettlemoyer. 2020.
Scalable zero-shot entity linking with dense
entity retrieval. In Proceedings of EMNLP,
pages 6397–6407.

Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo
Luan, and Maosong Sun. 2016. Representation
learning of knowledge graphs with entity
of AAAI,
Negli Atti
descriptions.
pages 2659–2665.

Wenhan Xiong,

Jingfei Du, William Yang
Wang, and Stoyanov Veselin. 2019. Pretrained
encyclopedia: Weakly supervised knowledge-
pretrained language model. Negli Atti di
ICLR.

193

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Ikuya Yamada, Hiroyuki Shindo, Hideaki
Takeda, and Yoshiyasu Takefuji. 2016. Joint
learning of
the embedding of words and
entities for named entity disambiguation. In
Proceedings of CoNLL, pages 250–259. DOI:
https://doi.org/10.18653/v1/K16
-1025

Bishan Yang and Tom Mitchell. 2017. Leveraging
knowledge bases in LSTMs for improving
In Proceedings of ACL,
machine reading.
pages 1436–1446. DOI: https://doi.org
/10.18653/v1/P17-1132

Bishan Yang, Scott Wen-tau Yih, Xiaodong He,
Jianfeng Gao, and Li Deng. 2015. Embedding
entities and relations for learning and inference
in knowledge bases. In Proceedings of ICLR.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonello, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for
In
Advances in Neural Information Processing
Sistemi (NeurIPS), pages 5754–5764.

language understanding.

Proceedings of ACL, pages 656–661. DOI:
https://doi.org/10.18653/v1/P18
-2104

Yuhao Zhang, Victor Zhong, Danqi Chen,
Gabor Angeli, e Christopher D. Equipaggio.
2017. Position-aware attention and supervised
data improve slot filling. Negli Atti di
EMNLP, pages 35–45. DOI: https://doi
.org/10.18653/v1/D17-1004

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
with informative entities. Negli Atti di
ACL, pages 1441–1451. DOI: https://doi
.org/10.18653/v1/P19-1139

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning
books and movies: Towards story-like visual
explanations by watching movies and reading
books. In Proceedings of ICCV, pages 19–27.
DOI: h t t p s : / / d o i . o r g / 1 0 . 1 109
/ICCV.2015.11

Poorya

E
Zaremoodi, Wray Buntine,
Gholamreza Haffari. 2018. Adaptive knowl-
edge sharing in multi-task learning: Improving
low-resource neural machine translation. In

Zhaocheng Zhu, Shizhen Xu, Jian Tang, and Meng
Qu. 2019. GraphVite: A high-performance
CPU-GPU hybrid system for node embedding.
In Proceedings of WWW, pages 2494–2504.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
T

l

UN
C
_
UN
_
0
0
3
6
0
P
D

.

F

B
sì
G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

194

Scarica il pdf