KEPLER: A Unified Model for Knowledge Embedding and - 麻省理工学院人工智能研究专业

KEPLER: A Unified Model for Knowledge Embedding and
Pre-trained Language Representation

Xiaozhi Wang1, Tianyu Gao3, Zhaocheng Zhu4,5, Zhengyan Zhang1
Zhiyuan Liu1,2∗, Juanzi Li1,2, and Jian Tang4,6,7∗

1Department of CST, BNRist; 2KIRC, Institute for AI, 清华大学, 北京, 中国
{wangxz20,zy-z19}@mails.tsinghua.edu.cn
{liuzy,lijuanzi}@tsinghua.edu.cn
3计算机科学系, 普林斯顿大学, 普林斯顿大学, 新泽西州, 美国
tianyug@princeton.edu
4Mila – Qu´ebec AI Institute; 5Univesit´e de Montr´eal; 6HEC, Montr´eal, 加拿大
zhaocheng.zhu@umontreal.ca, jian.tang@hec.ca
7CIFAR AI Research Chair

抽象的

1 介绍

Pre-trained language representation models
(PLMs) cannot well capture factual knowledge
from text. 相比之下, knowledge embed-
丁 (KE) methods can effectively represent
the relational
facts in knowledge graphs
(KGs) with informative entity embeddings,
but conventional KE models cannot take full
advantage of the abundant textual informa-
的. 在本文中, we propose a unified model
for Knowledge Embedding and Pre-trained
LanguagE Representation (KEPLER), 哪个
can not only better integrate factual knowl-
edge into PLMs but also produce effective
text-enhanced KE with the strong PLMs. 在
KEPLER, we encode textual entity descrip-
tions with a PLM as their embeddings, 和
then jointly optimize the KE and language
modeling objectives. Experimental
结果
show that KEPLER achieves state-of-the-art
performances on various NLP tasks, and also
works remarkably well as an inductive KE
model on KG link prediction. 此外, 为了
pre-training and evaluating KEPLER, 我们骗-
struct Wikidata5M1, a large-scale KG dataset
with aligned entity descriptions, and bench-
mark state-of-the-art KE methods on it. It shall
serve as a new KE benchmark and facilitate
the research on large KG, inductive KE, 和
KG with text. The source code can be obtained
来自 https://github.com/THU-KEG
/KEPLER.

∗Correspondence to: Z. Liu and J. 唐.
1https://deepgraphlearning.github.io

/project/wikidata5m.

176

Recent pre-trained language representation mod-
这 (PLMs) such as BERT (Devlin et al., 2019)
and RoBERTa (刘等人。, 2019C) learn effective
language representation from large-scale unstruc-
tured corpora with language modeling objectives
and have achieved superior performances on var-
ious natural language processing (自然语言处理) 任务.
linguistic knowl-
Existing PLMs learn useful
edge from unlabeled text (刘等人。, 2019A), 但
they generally cannot capture the world facts well,
which are typically sparse and have complex forms
in text (Petroni et al., 2019; Logan et al., 2019).

相比之下, knowledge graphs (KGs) contain
extensive structural facts, and knowledge embed-
丁 (KE) 方法 (Bordes et al., 2013; 哪个
等人。, 2015; 孙等人。, 2019) can effectively
embed them into continuous entity and relation
嵌入. These embeddings can not only
help with the KG completion but also benefit
various NLP applications (Yang and Mitchell,
2017; Zaremoodi et al., 2018; Han et al., 2018A).
如图 1, textual entity descriptions
信息. 直观地, KE
contain abundant
methods can provide factual knowledge for PLMs,
while the informative text data can also benefit KE.
Inspired by Xie et al. (2016), we use entity
descriptions to bridge the gap between KE
and PLM, and align the semantic space of
text to the symbol space of KGs (Logeswaran
等人。, 2019). We propose KEPLER, a unified
model for Knowledge Embedding and Pre-trained
LanguagE Representation. We encode the texts

计算语言学协会会刊, 卷. 9, PP. 176–194, 2021. https://doi.org/10.1162/tacl 00360
动作编辑器: Doug Downley. 提交批次: 7/2020; 修改批次: 10/2020; 已发表 3/2021.
C(西德:3) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

As a KE model, (1) KEPLER can take full
advantage of the abundant information from entity
descriptions with the help of the MLM objective.
(2) KEPLER is capable of performing KE in the
inductive setting, 那是, it can produce embed-
dings for unseen entities from their descriptions,
while conventional KE methods are inherently
transductive and they can only learn represen-
tations for the shown entities during training.
Inductive KE is essential for many real-world
applications, such as updating KGs with emerging
entities and KG construction, and thus is worth
more investigation.

For pre-training and evaluating KEPLER, 我们
need a KG with (1) large amounts of knowledge
facts, (2) aligned entity descriptions, 和 (3)
reasonable inductive-setting data split, 哪个
cannot be satisfied by existing KE benchmarks.
所以, we construct Wikidata5M, containing
about 5M entities, 20M triplets, and aligned entity
descriptions from Wikipedia. To the best of our
知识, it is the largest general-domain KG
dataset. We also benchmark several classical
KE methods and give data splits for both the
transductive and the inductive settings to facilitate
future research.

总结一下, our contribution is three-fold:
(1) We propose KEPLER, a knowledge-enhanced
PLM by jointly optimizing the KE and MLM
目标, which brings great improvements on
a wide range of NLP tasks. (2) By encoding
text descriptions as entity embeddings, KEPLER
shows its effectiveness as a KE model, 尤其
in the inductive setting. (3) We also introduce
Wikidata5M, a new large-scale KG dataset, 哪个
shall promote the research on large-scale KG,
inductive KE, and the interactions between KG
and NLP.

2 KEPLER

如图 2, KEPLER implicitly
incorporates factual knowledge into language
representations by jointly training with two objec-
特维斯. 在这个部分, we detailedly introduce the
encoder structure, the KE and MLM objectives,
and how we combine the two as a unified model.

2.1 Encoder

For the text encoder, we use Transformer archi-
结构 (Vaswani et al., 2017) in the same way

数字 1: An example of a KG with entity
descriptions. The figure suggests that descriptions
contain abundant information about entities and
can help to represent the relational facts between
他们.

and entities into a unified semantic space with the
same PLM as the encoder, and jointly optimize the
KE and the masked language modeling (MLM)
目标. For the KE objective, we encode the
entity descriptions as entity embeddings and then
train them in the same way as conventional KE
方法. For the MLM objective, we follow the
approach of existing PLMs (Devlin et al., 2019;
刘等人。, 2019C). KEPLER has the following
strengths:

As a PLM, (1) KEPLER is able to integrate fac-
tual knowledge into language representation with
the supervision from KG by the KE objective. (2)
KEPLER inherits the strong ability of language
understanding from PLMs by the MLM objec-
主动的. (3) The KE objective enhances the ability of
KEPLER to extract knowledge from text since it
requires the model to encode the entities from their
corresponding descriptions. (4) KEPLER can be
directly adopted in a wide range of NLP tasks with-
out additional inference overhead compared to
conventional PLMs since we just add new training
objectives without modifying model structures.

There are also some recent works (张等人。,
2019; Peters et al., 2019; 刘等人。, 2020) 直接地
integrating fixed entity embeddings into PLMs to
provide external knowledge. 然而, (1) 他们的
entity embeddings are learned by a separate KE
模型, and thus cannot be easily aligned with the
language representation space. (2) They require
an entity linker to link the text to the correspond-
ing entities, making them suffer from the error
propagation problem. (3) Compared to vanilla
PLMs, their sophisticated mechanisms to link and
use entity embeddings lead to additional inference
overhead.

177

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: The KEPLER framework. We encode entity descriptions as entity embeddings and jointly
train the knowledge embedding (KE) and masked language modeling (MLM) objectives on the same
PLM.

as Devlin et al. (2019) and Liu et al. (2019C). 这
encoder takes a sequence of N tokens (x1, …, xN )
as inputs, and computes L layers of d-dimensional
contextualized representations Hi ∈ RN ×d, 1 ≤
i ≤ L. Each layer of the encoder Ei is a combi-
nation of a multihead self-attention network and
a multilayer perceptron, and the encoder gets the
representation of each layer by Hi = Ei(Hi−1).
最终, we get a contextualized representation
for each position, which could be further used in
下游任务. 通常, there is a special token
added to the beginning of the text, 和
output at is regarded sentence representation.
We denote the representation function as E(·).
The encoder requires a tokenizer to convert
plain texts into sequences of tokens. Here we use
the same tokenization as RoBERTa: the Byte-Pair
Encoding (BPE) (Sennrich et al., 2016).

Unlike previous knowledge-enhanced PLM
作品 (张等人。, 2019; Peters et al., 2019), 我们
do not modify the Transformer encoder structure
to add external entity linkers or knowledge-
integration layers. It means that our model has
no additional inference overhead compared to
vanilla PLMs, and it makes applying KEPLER in
downstream tasks as easy as RoBERTa.

2.2 Knowledge Embedding

陈述, which benefits lots of down-
stream tasks, such as link prediction and relation
extraction.

We first define KGs: A KG is a graph with
entities as its nodes and relations between entities
as its edges. We use a triplet (H, r, t) to describe a
relational fact, where h, t are the head entity and
the tail entity, and r is the relation type within
a pre-defined relation set R. In conventional KE
型号, each entity and relation is assigned a
d-dimensional vector, and a scoring function is
defined for training the embeddings and predicting
links.

In KEPLER, instead of using stored embed-
丁斯, we encode entities into vectors by using
their corresponding text. By choosing different
textual data and different KE scoring functions,
we have multiple variants for the KE objective
of KEPLER. 在本文中, we explore three
simple but effective ways: entity descriptions as
嵌入, entity and relation descriptions as
嵌入, and entity embeddings conditioned
on relations. We leave exploring advanced KE
methods as our future work.

Entity Descriptions as Embeddings For a
relational triplet (H, r, t), 我们有:

h = E(texth),
t = E(textt),
r = Tr,

(1)

To integrate factual knowledge into KEPLER, 我们
adopt the knowledge embedding (KE) 客观的
in our pre-training. KE encodes entities and rela-
tions in knowledge graphs (KGs) as distributed

where texth and textt are the descriptions for
h and t, with a special token at the beginning.
T ∈ R|右|×d is the relation embeddings and h, t, r
are the embeddings for h, t, and r.

178

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

We use the loss from Sun et al. (2019) as our
KE objective, which adopts negative sampling
(Mikolov et al., 2013) for efficient optimization:

LKE = − log σ(γ − dr(H, t))
n(西德:2)

log σ(dr(H(西德:6)

我, t(西德:6)

我) − γ),

-

1
n

我=1

(2)

我, r, t(西德:6)

在哪里 (H(西德:6)
我) are negative samples, γ is the
margin, σ is the sigmoid function, and dr is the
scoring function, for which we choose to follow
TransE (Bordes et al., 2013) for its simplicity,

dr(H, t) = (西德:7)H + r − t(西德:7)p,

(3)

where we take the norm p as 1. The negative
sampling policy is to fix the head entity and
randomly sample a tail entity, and vice versa.

Entity and Relation Descriptions as Embed-
dings A natural extension for the last method
is to encode the relation descriptions as relation
embeddings as well. 正式地, 我们有,

ˆr = E~~(textr),~~

(4)

where textr is the description for the relation r.
Then we use ˆr to replace r in Equations 2 和 3.

Entity Embeddings Conditioned on Relations
In this manner, we use entity embeddings con-
ditioned on r for better KE performances. 这
intuition is that semantics of an entity may have
multiple aspects, and different relations focus on
different ones (林等人。, 2015). So we have,

hr = E~~(texth,r),~~

(5)

is the concatenation of

the relation r, with the special

where texth,r
这
description for the entity h and the description
代币
为了
~~at the beginning and~~ in between.
相应地, we use hr instead of h for
Equations 2 和 3.

2.3 Masked Language Modeling

The masked language modeling (MLM) 客观的
is inherited from BERT and RoBERTa. 期间
pre-training, MLM randomly selects some of the
input positions, and the objective is to predict the
tokens at these selected positions within a fixed
dictionary.

To be more specific, MLM randomly selects
15% of input positions, among which 80% 是

179

masked with the special token , 10% 是
replaced by other random tokens, 和其余的
remain unchanged. For each selected position j,
the last layer of the contextualized representation
HL,j is used for a W -way classification, 在哪里
W is the size of the dictionary. At last, a cross-
entropy loss LMLM is calculated over these selected
positions.

We initialize our model with the pre-trained
checkpoint of RoBERTaBASE. 然而, we still
keep MLM as one of our objectives to avoid
catastrophic forgetting (McCloskey and Cohen,
1989) while training towards the KE objective.
Actually, as demonstrated in Section 5.1, 仅有的
using the KE objective leads to poor results in
NLP tasks.

2.4 Training Objectives

To incorporate factual knowledge and language
understanding into one PLM, we design a
multi-task loss as
shown in Figure 2 和
方程 6,

L = LKE + LMLM,

(6)

where LKE and LMLM are the losses for KE and
MLM correspondingly. Jointly optimizing the
two objectives can implicitly integrate knowledge
from external KGs into the text encoder, 尽管
preserving the strong abilities of PLMs for
syntactic and semantic understanding. 注意
those two tasks only share the text encoder, 和
for each mini-batch, text data sampled for KE
and MLM are not (一定) 相同. 这是
because seeing a variety of text (instead of just
entity descriptions) in MLM can help the model
to have better language understanding ability.

2.5 Variants and Implementations

We introduce the variants of KEPLER and the pre-
training implementations here. The fine-tuning
details will be introduced in Section 4.

KEPLER Variants

We implement multiple versions of KEPLER
in experiments to explore the effectiveness of
our pre-training framework. We use the same
denotations in Section 4 as below.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

KEPLER-Wiki

is the principal model

在
our experiments, which adopts Wikidata5M
(部分 3) as the KG and the entity-description-
as-embedding method (方程 1). 所有其他
variants, if not specified, use the same settings.
KEPLER-Wiki achieves the best performances on
most tasks.

KEPLER-WordNet uses the WordNet (磨坊主,
1995) as its KG source. WordNet is an English lex-
ical graph, where nodes are lemmas and synsets,
and edges are their relations. 直观地, incor-
porating WordNet can bring lexical knowledge
and thus benefits NLP tasks. We use the same
WordNet 3.0 as in KnowBert (Peters et al., 2019),
which is extracted from the nltk2 package.

KEPLER-W+W takes both Wikidata5M and
WordNet as its KGs. To jointly train with two KG
datasets, we modify the objective in Equation 6 作为

L = LWiki + LWordNet + LMLM,

(7)

where LWiki and LWordNet are losses from
Wikidata5M and WordNet respectively.

KEPLER-Rel uses the entity and relation
descriptions as embeddings method (方程 4).
As the relation descriptions in Wikidata are
short (11.7 words on average) and homogeneous,
encoding relation descriptions as relation embed-
dings results in worse performance as shown in
部分 4.

KEPLER-Cond uses the entity-embedding-
conditioned-on-relation method (方程 5).
This model achieves superior results in link
prediction tasks, both transductive and inductive
(部分 4.3).

KEPLER-OnlyDesc trains the MLM objective
directly on the entity descriptions from the KE
objective rather than uses the English Wikipedia
and BookCorpus as other versions of KEPLER.
然而, as the entity description data are smaller
(2.3 GB vs 13 GB) and homogeneous, it harms the
general language understanding ability and thus
performs worse (部分 4.2).

KEPLER-KE only adopts the KE objective
in pre-training, which is an ablated version of
KEPLER-Wiki. It is used to show the necessity of
the MLM objective for language understanding.

Pre-training Implementation
在实践中, we choose RoBERTa (刘等人。,
2019C) as our base model and implement KEPLER

2https://www.nltk.org.

180

in the fairseq framework (Ott et al., 2019) for pre-
训练. Due to the computing resource limit,
we choose the BASE size (L = 12, d = 768)
and use the released roberta.base parameters
for initialization, which is a common practice
to save pre-training time (张等人。, 2019;
Peters et al., 2019). For the MLM objective, 我们
use the English Wikipedia (2,500M words) 和
BookCorpus (800M words) (Zhu et al., 2015)
as our pre-training corpora (except KEPLER-
OnlyDesc). We extract text from these two corpora
in the same way as Devlin et al. (2019). 为了
KE objective, we encode the first 512 tokens of
entity descriptions from the English Wikipedia as
entity embeddings.

We set the γ in Equation 2 作为 4 和 9 for NLP
and KE tasks respectively, and we use the models
pre-trained with 10 和 30 epochs for NLP and
KE. Specially, the γ is 1 for KEPLER-WordNet.
The two hyperparameters are tuned by multiple
trials for γ in {1, 2, 4, 6, 9} and the number of
epochs in {5, 10, 20, 30, 40}, and we select the
model by performances on TACRED (F-1) 和
inductive link prediction (HITS@10). We use
gradient accumulation to achieve a batch size of
12,288.

3 Wikidata5M

As shown in Section 2, to train KEPLER, the KG
dataset should (1) be large enough, (2) contain
high-quality textual descriptions for its entities
and relations, 和 (3) have a reasonable inductive
环境, which most existing KG datasets do not
provide. 因此, based on Wikidata3 and English
维基百科,4 we construct Wikidata5M, 一个大的-
scale KG dataset with aligned text descriptions
from corresponding Wikipedia pages, and also an
inductive test set. 在以下部分中, 我们
first introduce the data collection (部分 3.1)
and the data split (部分 3.2), and then provide
the results of representative KE methods on the
dataset (部分 3.3).

3.1 Data Collection

We use the July 2019 dump of Wikidata and
维基百科. For each entity in Wikidata, we align
it to its Wikipedia page and extract the first section
as its description. Entities with no pages or with
descriptions fewer than 5 words are discarded.

3https://www.wikidata.org.
4https://en.wikipedia.org.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数据集

#实体

#关系

#训练

#验证

#测试

FB15K
WN18
FB15K-237
WN18RR

14,951
40,943
14,541
40,943

1,345
18
237
11

483,142
141,442
272,115
86,835

Wikidata5M 4,594,485

822

20,614,279

50,000
5,000
17,535
3,034

5,163

59,07
5,00
20,466
3,134

5,133

桌子 1: Statistics of Wikidata5M (transductive setting) compared with existing KE benchmarks.

Entity Type

Occurrence Percentage

Subset

#实体

#关系

#triplet

人类
Taxon
Wikimedia list
Film
Human Settlement

全部的

1,517,591
363,882
118,823
114,266
110,939

2,225,501

33.0%
7.9%
2.6%
2.5%
2.4%

48.4%

桌子 2: Top-5 entity categories in Wikidata5M.

We retrieve all the relational facts in Wikidata.
A fact is considered to be valid when both of
its entities are not discarded, and its relation
has a non-empty page in Wikidata. The final
KG contains 4,594,485 实体, 822 关系
和 20,624,575 triplets. Statistics of Wikidata5M
along with four other widely used benchmarks
are shown in Table 1. Top-5 entity categories are
listed in Table 2. We can see that Wikidata5M
is much larger than other KG datasets, covering
various domains.

3.2 Data Split

For Wikidata5M, we take two different settings:
the transductive setting and the inductive setting.
The transductive setting (如表所示 1) 是
adopted in most KG datasets, where the entities are
shared and the triplet sets are disjoint across train-
英, validation and test. 在这种情况下, KE models
are expected to learn effective entity embeddings
only for the entities in the training set. 在里面
inductive setting (如表所示 3), the entities
and triplets are mutually disjoint across training,
validation and test. We randomly sample some
connected subgraphs as the validation and test set.
In the inductive setting, the KE models should
produce embeddings for the unseen entities given
side features like descriptions, neighbors, ETC. 这
inductive setting is more challenging and also

Training
验证
Test

4,579,609
7,374
7,475

822
199
201

20,496,514
6,699
6,894

桌子 3: Statistics of Wikidata5M inductive
环境.

meaningful in real-world applications, where enti-
ties in KGs experience open-ended growth, 和
inductive ability is crucial for online KE methods.
Although Wikidata5M contains massive enti-
ties and triplets, our validation and test set are not
大的, which is limited by the standard evaluation
method of link prediction (部分 3.3). 每个
episode of evaluation requires |乙| × |时间 | × 2 次
of KE score calculation, 在哪里 |乙| 和 |时间 | 是
the total number of entities and the number of
triplets in test set respectively. As Wikidata5M
contains massive entities, the evaluation is very
time-consuming, hence we have to limit the test
set to thousands of triplets to ensure tractable
evaluations. This indicates that large-scale KE
urges a more efficient evaluation protocol. 我们
will leave exploring it to future work.

3.3 Benchmark

To assess the challenges of Wikidata5M, 我们
benchmark several popular KE models on our
in the transductive setting (像他们
dataset
inherently do not support
the inductive set-
ting). Because their original implementations do
not scale to Wikidata5M, we benchmark these
methods with GraphVite (Zhu et al., 2019), A
multi-GPU KE toolkit.

In the transductive setting, for each test triplet
(H, r, t), the model ranks all the entities by scor-
英 (H, r, t(西德:6)), t(西德:6) ∈ E, where E is the entity set
excluding other correct t. The evaluation metrics,
MRR (mean reciprocal rank), MR (mean rank),
and HITS@{1,3,10}, are based on the rank of the

181

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

方法

MR MRR HITS@1 HITS@3 HITS@10

TransE (Bordes et al., 2013)
DistMult (杨等人。, 2015)
ComplEx (Trouillon et al., 2016)
SimplE (Kazemi and Poole, 2018)
RotatE (孙等人。, 2019)

109370
211030
244540
115263
89459

25.3
25.3
28.1
29.6
29.0

17.0
20.8
22.8
25.2
23.4

31.1
27.8
31.0
31.7
32.2

39.2
33.4
37.3
37.7
39.0

桌子 4: Performance of different KE models on Wikidata5M (% except MR).

correct tail entity t among all the entities in E.
Then we do the same thing for the head entities.
We report the average results over all test triplets
and over both head and tail entity predictions.

桌子 4 shows the results of popular KE meth-
ods on Wikidata5M, which are all significantly
lower than on existing KG datasets like FB15K-
237, WN18RR, 等等. It demonstrates that
Wikidata5M is more challenging due to its large
scale and high coverage. The results advocate for
more efforts towards large-scale KE.

4 实验

在这个部分, we introduce the experiment
settings and results of our model on various
NLP and KE tasks, along with some analyses
on KEPLER.

4.1 Experimental Setting

基线
In our experiments, RoBERTa is an
important baseline since KEPLER is based on it
(all mentioned models are of BASE size if not
specified). As we cannot afford the full RoBERTa
语料库 (126 GB, and we only use 13 GB)
in KEPLER pre-training, we implement Our
RoBERTa for direct comparisons to KEPLER.
It is initialized by RoBERTaBASE and is further
trained with the MLM objective on the same
corpora as KEPLER.

We also evaluate recent knowledge-enhanced
PLMs, including ERNIEBERT (张等人。, 2019)
and KnowBertBERT (Peters et al., 2019). 作为
ERNIE and our principal model KEPLER-Wiki
only use Wikidata, we take KnowBert-Wiki in the
experiments to ensure fair comparisons with the
same knowledge source. Considering KEPLER
is based on RoBERTa, we reproduce the two
models with RoBERTa too (ERNIERoBERTa and
KnowBertRoBERTa). The reproduction of Know-
Bert is based on its original implementation.5

5https://github.com/allenai/kb.

On relation classification, we also compare with
MTB (Baldini Soares et al., 2019), which adopts
‘‘matching the blank’’ pre-training. 不同的
from other baselines, the original MTB is based
on BERTLARGE (denoted by MTB (BERTLARGE)).
For a fair comparison under the same model size,
we reimplement MTB with BERTBASE (MTB).

Hyperparameter The pre-training settings are
in Section 2.5. For fine-tuning on downstream
任务, we set KEPLER hyperparameters the same
as reported in KnowBert on TACRED and
OpenEntity. On FewRel, 我们设定
学习
rate as 2e-5 and batch size as 20 和 4 为了
Proto and PAIR frameworks respectively. 为了
GLUE, we follow the hyperparameters reported
in RoBERTa. For baselines, we keep their original
hyperparameters unchanged or use the best trial in
KEPLER searching space if no original settings
可用.

4.2 NLP Tasks

在这个部分, we demonstrate the performance of
KEPLER and its baselines on various NLP tasks.

Relation Classification

Relation classification requires models to classify
relation types between two given entities from
文本. We evaluate KEPLER and other baselines
on two widely used benchmarks: TACRED and
FewRel.

TACRED (张等人。, 2017) 有 42 关系
和 106,264 句子. Here we follow the settings
of Baldini Soares et al. (2019), where we add four
special tokens before and after the two entity
mentions, and concatenate the representations at
the beginnings of the two entities for classification.
Note that the original KnowBert also takes entity
types as inputs, which is different from Zhang et al.
(2019); Baldini Soares et al. (2019). To ensure fair
comparisons, we re-evaluate KnowBert with the

182

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

BERT
BERTLARGE
MTB
MTB (BERTLARGE)
ERNIEBERT
KnowBertBERT
RoBERTa
ERNIERoBERTa
KnowBertRoBERTa

Our RoBERTa
KEPLER-Wiki
KEPLER-WordNet
KEPLER-W+W
KEPLER-Rel
KEPLER-Cond
KEPLER-OnlyDesc
KEPLER-KE

磷

67.2
-
69.7
-
70.0
73.5
70.4
73.5
71.9

70.8
71.5
71.4
71.1
71.3
72.1
72.3
63.5

右

64.8
-
67.9
-
66.1
64.1
71.1
68.0
69.9

69.6
72.5
71.3
72.0
70.9
70.7
69.1
60.5

F-1

66.0
70.1
68.8
71.5
68.0
68.5
70.7
70.7
70.9

70.2
72.0
71.3
71.5
71.1
71.4
70.7
62.0

桌子 5: Precision, 记起, and F-1 on TACRED
(%). KnowBert results are different from the
original paper since different task settings are
用过的.

same setting as other baselines, thus the reported
results are different from the original paper.

From the TACRED results in Table 5, 我们可以
observe that: (1) KEPLER-Wiki is the best one
among KEPLER variants and significantly out-
performs all the baselines, while other versions
of KEPLER also achieve good results. It demon-
strates the effectiveness of KEPLER on integrating
factual knowledge into PLMs. Based on the
结果, we use KEPLER-Wiki as the principal
model in the following experiments. (2) KEPLER-
WordNet shows a marginal improvement over
Our RoBERTa, while KEPLER-W+W underper-
forms KEPLER-Wiki. It suggests that pre-training
with WordNet only has limited benefits in the
KEPLER framework. We will explore how to
better combine different KGs in our future work.
FewRel (Han et al., 2018乙) is a few-shot
relation classification dataset with 100 关系
和 70,000 instances, which is constructed with
Wikipedia text and Wikidata facts. 此外,
Gao et al. (2019) propose FewRel 2.0, 添加
a domain adaptation challenge with a new
medical-domain test set.

FewRel

takes the N -way K-shot setting.
Relations in the training and test sets are disjoint.

For every evaluation episode, N relations, K
supporting samples for each relation, and several
query sentences are sampled from the test set. 这
models are required to classify queries into one of
the N relations only given the sampled N × K
instances.

We use two state-of-the-art few-shot frame-
作品: Proto (Snell et al., 2017) and PAIR (高
等人。, 2019). We replace the text encoders with our
baselines and KEPLER and compare the perfor-
曼斯. Because FewRel 1.0 is constructed with
Wikidata, we remove all the triplets in its test set
from Wikidata5M to avoid information leakage
for KEPLER. 然而, we cannot control the
KGs used in our baselines. We mark the models
utilizing Wikidata and have information leakage
risk with † in Table 6.

As Table 6 节目, KEPLER-Wiki achieves
the best performance over the BASE-size PLMs
in most settings. From the results, we also
have some interesting observations: (1) RoBERTa
consistently outperforms BERT on various NLP
任务 (刘等人。, 2019C), yet the RoBERTa-based
models here are comparable or even worse than
BERT-based models in the PAIR framework.
Because PAIR uses sentence concatenation, 这
result may be credited to the next sentence
prediction (NSP) objective of BERT. (2) KEPLER
brings improvements on FewRel 2.0, 尽管
ERNIE and KnowBert even degenerate in most
of the settings. It indicates that the paradigms of
ERNIE and KnowBert cannot well generalize to
new domains which may require much different
entity linkers and entity embeddings. 上
另一方面, KEPLER not only learns better entity
representations but also acquires a general ability
to extract factual knowledge from the context
across different domains. We further verify this in
部分 5.5. (3) KnowBert underperforms ERNIE
in FewRel while it typically achieves better results
on other tasks. This may be because it uses the
TuckER (Balazevic et al., 2019) KE model while
ERNIE and KEPLER follow TransE (Bordes et al.,
2013). We will explore the effects of different KE
methods in the future.

We also have another two observations with
regard to ERNIE and MTB: (1) ERNIE performs
the best on 1-shot settings of FewRel 1.0.
We believe this is because that the knowledge
embedding injection of ERNIE has particular
advantages in this case, since it directly brings

183

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

knowledge about entities. When using 5-shot
(supporting text provides more information) 和
FewRel 2.0 (ERNIE does not have knowledge
for biomedical entities), KEPLER outperforms
ERNIE. (2) Though MTB (BERTLARGE) 是个
state-of-the-art model on FewRel, its BERTBASE
version does not outperform other knowledge-
enhanced PLMs, which suggests that using large
models contributes much to its gain. 我们也
notice that when combined with PAIR, MTB
suffers an obvious performance drop, 这可能
be because its pre-training objective degenerates
sentence-pair tasks.

Entity Typing

Entity typing requires to classify given entity
mentions into pre-defined types. For this task, 我们
carry out evaluations on OpenEntity (Choi et al.,
2018) following the settings in Zhang et al. (2019).
OpenEntity has 6 entity types and 2,000 instances
for training, validation and test each.

To identify the entity mentions of interest,
we add two special tokens before and after the
entity spans, and use the representations of the
first special tokens for classification. As shown
表中 7, KEPLER-Wiki achieves state-of-the-
art results. Note that the KnowBert results are
different from the original paper since we use
KnowBert-Wiki here rather than KnowBert-W+W
to ensure the same knowledge resource and fair
comparisons. KEPLER does not perform linking
or entity embedding pre-training like ERNIE and
KnowBert, which bring them special advantages
in entity span tasks. 然而, KEPLER still
outperforms these baselines, which proves its
效力.

GLUE

The General Language Understanding Evalu-
化 (GLUE) (王等人。, 2019乙) 收集
several natural language understanding tasks and
is widely used for evaluating PLMs. 一般来说,
solving GLUE does not require factual knowl-
边缘 (张等人。, 2019) and we use it to examine
whether KEPLER harms the general language
understanding ability.

桌子 8 shows the GLUE results. 我们可以
observe that KEPLER-Wiki
is close to Our
RoBERTa, suggesting that while incorporating
事实知识, KEPLER maintains a strong
language understanding ability. 然而, 那里

are significant performance drops of KEPLER-
OnlyDesc, which indicates that the small-scale
entity description data are not sufficient
为了
training KEPLER with MLM.

For the small datasets STS-B, MRPC and RTE,
directly fine-tuning models on them typically
result in unstable performance. Hence we fine-
tune models on a large-scale dataset (here we use
MNLI) first and then further fine-tune them on the
small datasets. The method has been shown to be
effective (王等人。, 2019A) and is also used in
the original RoBERTa paper (刘等人。, 2019C).

4.3 KE Tasks

We show how KEPLER works as a KE model, 和
evaluate it on Wikidata5M in both the transductive
link prediction setting and the inductive setting.

Experimental Settings

the entity and relation
In link prediction,
embeddings of KEPLER are obtained as described
in Section 2.2 和 2.5. The evaluation method is
节中描述 3.3. We also add RoBERTa
and Our RoBERTa as baselines. They adopt
Equations 1 和 4 to acquire entity and relation
嵌入, and use Equation 3 as their scoring
function.

In the transductive setting, we compare our
models with TransE (Bordes et al., 2013). We set
its dimension as 512, negative sampling size as 64,
batch size as 2048, and learning rate as 0.001 后
hyperparameter searching. The negative sampling
size is crucial for the performance on KE tasks,
but limited by the model complexity, KEPLER
can only take a negative size of 1. For a direct
comparison to intuitively show the benefits of pre-
训练, we set a baseline TransE†, which also
uses 1 as the negative sampling size and keeps the
other hyperparameters unchanged.

Because conventional KE methods like TransE
inherently cannot provide embeddings for unseen
实体, we take DKRL (Xie et al., 2016) as our
baseline in the KE experiments, which utilizes
convolutional neural networks to encode entity
descriptions as embeddings. We set its dimension
作为 768, negative sampling size as 64, batch size as
1024, and learning rate as 0.0005.

Transductive Setting

Table 9a shows the results of the transductive
环境. We observe that:

184

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型
MTB (BERTLARGE)†
Proto (BERT)
Proto (MTB)
Proto (ERNIEBERT)†
Proto (KnowBertBERT)†
Proto (RoBERTa)
Proto (Our RoBERTa)
Proto (ERNIERoBERTa)†
Proto (KnowBertRoBERTa)†
Proto (KEPLER-Wiki)
PAIR (BERT)
PAIR (MTB)
PAIR (ERNIEBERT)†
PAIR (KnowBertBERT)†
PAIR (RoBERTa)
PAIR (Our RoBERTa)
PAIR (ERNIERoBERTa)†
PAIR (KnowBertRoBERTa)†
PAIR (KEPLER-Wiki)

FewRel 1.0
10-1
5-5
89.20
97.06
71.48
89.60
71.55
91.05
84.23
94.66
79.52
93.22
77.65
95.78
76.43
95.30
80.14
95.62
76.21
93.62
81.10
95.94
80.63
93.22
73.42
87.64
87.08
94.27
82.57
92.75
82.49
93.70
83.32
93.71
81.68
94.11
76.04
91.34
85.48
94.28

5-1
93.86
80.68
81.39
89.43
86.64
85.78
84.42
87.76
82.39
88.30
88.32
83.01
92.53
88.48
89.32
89.26
87.46
85.05
90.31

10-5
94.27
82.89
83.47
90.83
88.35
92.26
91.74
91.47
88.57
92.67
87.02
78.47
89.13
86.18
88.43
89.02
87.83
85.25
90.51

5-1
-
40.12
52.13
49.40
64.40
64.65
61.98
54.43
55.68
66.41
67.41
46.18
56.18
66.05
66.78
63.22
59.29
50.68
67.23

FewRel 2.0
10-1
5-5
-
-
26.45
51.50
48.28
76.67
34.99
65.55
51.66
79.87
50.80
82.76
48.56
83.11
37.97
80.48
41.90
71.82
51.85
84.02
54.89
78.57
36.92
70.50
43.40
68.97
50.86
77.88
53.99
81.84
49.28
77.66
48.51
72.91
37.10
66.04
54.32
82.09

10-5
-
36.93
69.75
49.68
69.71
71.84
72.19
66.26
58.55
73.60
66.85
55.17
54.35
67.19
70.85
65.97
60.26
51.13
71.01

桌子 6: Accuracies (%) on the FewRel dataset. N -K indicates the N -way K-shot setting. MTB
uses the LARGE size and all the other models use the BASE size. † indicates oracle models which
may have seen facts in the FewRel 1.0 test set during pre-training.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

磷

右

UFET (Choi et al., 2018)
BERT
ERNIEBERT
KnowBertBERT
RoBERTa
ERNIERoBERTa
KnowBertRoBERTa

Our RoBERTa
KEPLER-Wiki

77.4
76.4
78.4
77.9
77.4
80.3
78.7

75.1
77.8

60.6
71.0
72.9
71.2
73.6
70.2
72.7

73.4
74.6

F-1

68.0
73.6
75.6
74.4
75.4
74.9
75.6

74.3
76.2

桌子 7: Entity typing results on OpenEntity (%).

它

(1) KEPLER underperforms TransE.

是
reasonable since KEPLER is limited by its large
model size, and thus cannot use a large negative
sampling size (1 for KEPLER, while typical KE
methods use 64 或者更多) and more training epochs
(30 与. 1000 for TransE), which are crucial for
KE (Zhu et al., 2019). 另一方面, KEPLER
and its variants perform much better than TransE†
(with a negative sampling size of 1), 显示
that using the same negative sampling size,
KEPLER can benefit from pre-trained language

模型

MNLI (m/mm) QQP QNLI
104K

363K

392K

RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-OnlyDesc

87.5/87.2
87.1/86.8
87.2/86.5
85.9/85.6

91.9
90.9
91.7
90.8

92.7
92.5
92.4
92.4

SST-2
67K

94.8
94.7
94.5
94.4

模型

RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-OnlyDesc

CoLA
8.5K

63.6
63.4
63.6
55.8

STS-B MRPC RTE
2.5K
3.5K
5.7K

91.2
91.1
91.2
90.2

90.2
88.4
89.3
88.5

80.9
82.3
85.2
78.3

桌子 8: GLUE results on the dev set (%). 全部
the results are medians over 5 runs. We report F-1
scores for QQP and MRPC, Spearman correlations
for STS-B, and accuracy scores for the other tasks.
The ‘‘m/mm’’ stands for matched/mismatched
evaluation sets for MNLI (Williams et al., 2018).

representations and textual entity descriptions so
that outperform TransE. 将来, 我们将
explore reducing the model size of KEPLER to
take advantage of both large negative sampling
size and pre-training.

(2) The vanilla RoBERTa perform poorly in KE
while KEPLER achieves favorable performances,

185

模型

MR

MRR HITS@1 HITS@3 HITS@10

TransE (Bordes et al., 2013)
TransE†
DKRL (Xie et al., 2016)
RoBERTa
Our RoBERTa
KEPLER-KE
KEPLER-Rel
KEPLER-Wiki
KEPLER-Cond

109370
406957
31566
1381597
1756130
76735
15820
14454
20267

25.3
6.0
16.0
0.1
0.1
8.2
6.6
15.4
21.0

17.0
1.8
12.0
0.0
0.0
4.9
3.7
10.5
17.3

31.1
8.0
18.1
0.1
0.1
8.9
7.0
17.4
22.4

39.2
13.6
22.9
0.3
0.2
15.1
11.7
24.4
27.7

(A) Transductive results on Wikidata5M (% except MR). TransE† denotes a TransE modeled trained with
the same negative sampling size (1) as KEPLER.

模型

DKRL (Xie et al., 2016)
RoBERTa
Our RoBERTa
KEPLER-KE
KEPLER-Rel
KEPLER-Wiki
KEPLER-Cond

MR

78
723
1070
138
35
32
28

MRR

HITS@1

HITS@3

HITS@10

23.1
7.4
5.8
17.8
33.4
35.1
40.2

5.9
0.7
1.9
5.7
15.9
15.4
22.2

32.0
1.0
6.3
22.9
43.5
46.9
51.4

54.6
19.6
13.0
40.7
66.1
71.9
73.0

(乙) Inductive results on Wikidata5M (% except MR).

桌子 9: Link prediction results on Wikidata5M transductive and inductive settings.

which demonstrates the effectiveness of our multi-
task pre-training to infuse factual knowledge.

(3) Among the KEPLER variants, KEPLER-
Cond has superior results, which substantiates
the intuition in Section 2.2. KEPLER-Rel per-
forms worst, which we believe is due to the short
and homogeneous relation descriptions of Wiki-
数据. KEPLER-KE significantly underperforms
KEPLER-Wiki, which suggests that the MLM
objective is necessary as well for the KE tasks to
build effective language representation.

(4) We also notice that DKRL performs well on
the transductive setting and the result is close to
KEPLER. We believe this is because DKRL takes
a much smaller encoder (CNN) and thus is easier
to train. In the more difficult inductive setting, 这
gap between DKRL and KEPLER is larger, 哪个
better shows the language understanding ability
of KEPLER to utilize textual entity descriptions.

still far from ideal performances required by prac-
tical applications (constructing KG from scratch,
ETC。), which urges further efforts on inductive
KE. Comparisons among KEPLER variants are
consistent with in the transductive setting.

此外, we clarify why results in the induc-
tive setting are much higher than the transductive
环境, while the inductive setting is more dif-
ficult: As shown in Tables 1 和 3, the entities
involved in the inductive evaluation is much less
than the transductive setting (7,475 与. 4,594,485).
Considering the KE evaluation metrics are based
on entity ranking, it is reasonable to see higher
values in the inductive setting. The performance in
different settings should not be directly compared.

5 分析

Inductive Setting
Table 9b shows the Wikidata5M inductive results.
KEPLER outperforms DKRL and RoBERTa by a
large margin, demonstrating the effectiveness of
our joint training method. But KEPLER results are

在这个部分, we analyze the effectiveness and
efficiency of KEPLER with experiments. 全部
the hyperparameters are the same as reported
in Section 4.1, including models in the ablation
学习.

186

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

Our RoBERTa
KEPLER-KE
KEPLER-Wiki

磷

70.8
63.5
71.5

右

69.6
60.5
72.5

F-1

70.2
62.0
72.0

桌子 10: Ablation study results on
TACRED (%).

5.1 Ablation Study

As shown in Equation 6, KEPLER takes a multi-
task loss. To demonstrate the effectiveness of the
joint objective, we compare full KEPLER with
models trained with only the MLM loss (我们的
RoBERTa) and only the KE loss (KEPLER-
KE) on TACRED. As demonstrated in Table 10,
compared to KEPLER-Wiki, both ablation models
suffer significant drops.
这
performance gain of KEPLER is credited to the
joint training towards both objectives.

It suggests that

5.2 Knowledge Probing Experiment

部分 4.2 shows that KEPLER can achieve
significant improvements on NLP tasks requiring
事实知识. To further verify whether
KEPLER can better integrate factual knowledge
into PLMs and help to recall them, we conduct
experiments on LAMA (Petroni et al., 2019), A
widely used knowledge probe. LAMA examines
facts
PLMs’ abilities on recalling relational
by cloze-style questions. 例如, given a
natural language template ‘‘Paris is the capital
的 ’’, PLMs are required to predict
the masked token without fine-tuning. LAMA
reports the micro-averaged precision at one (P@1)
scores. 然而, Poerner et al. (2020) present that
LAMA contains some easy questions which can be
answered with superficial clues like entity names.
Hence we also evaluate the models on LAMA-
UHN (Poerner et al., 2020), which filters out the
questionable templates from the Google-RE and
T-REx corpora of LAMA.

The evaluation results are shown in Table 11,
from which we have the following observations:
(1) KEPLER consistently outperforms the vanilla
PLM baseline Our RoBERTa in almost all
the settings except ConceptNet, 其中重点
on commonsense knowledge rather than factual
知识. It indicates that KEPLER can indeed
better integrate factual knowledge. (2) 虽然

KEPLER-W+W cannot outperform KEPLER-
it shows
Wiki on NLP tasks (部分 4.2),
significant improvements in LAMA-UHN, 哪个
suggests that we should explore which kind of
knowledge is needed on different scenarios in
未来. (3) All the RoBERTa-based models
perform worse than vanilla BERTBASE by a
large margin, which is consistent with the results
of Wang et al. (2020). This may be due to different
vocabularies used in BERT and RoBERTa, 哪个
presents the vulnerability of LAMA-style probing
再次 (Kassner and Sch¨utze, 2020). We will leave
developing a better knowledge probing framework
as our future work.

5.3 Running Time Comparison

Compared to vanilla PLMs, KEPLER does
不是
introduce any additional parameters or
computations during fine-tuning and inference,
which is efficient for practice use. We compare the
running time of KEPLER and other knowledge-
enhanced PLMs (ERNIE and KnowBert)
在
桌子 12. The time is evaluated on TACRED
training set for one epoch with one NVIDIA
Tesla V100 (32 GB), and all models use 32
batch size and 128 sequence length. The ‘‘entity
linking’’ time of KnowBert is for entity candidate
一代. We can observe that KEPLER requires
much less running time since it does not need
entity linking or entity embedding fusion, 哪个
will benefit time-sensitive applications.

5.4 Correlation with Entity Frequency

To better understand how KEPLER helps the
entity-centric tasks, we provide analyses on
the correlations between KEPLER performance
and entity frequency in this
部分. 这
motivation is to verify a natural hypothesis that
KEPLER improvements mainly come from better
representing the entity mentions in text, 尤其
the rare entities, which do not show up frequently
in the pre-training corpora and thus cannot be well
learned by the language modeling objectives.

We perform entity linking for the TACRED
dataset with BLINK (Wu et al., 2020) to link
the entity mentions in text to their correspond-
ing Wikipedia identifiers. Then we count
这
occurrences of the entities in Wikipedia with the
hyperlinks in rich text, denoting the entity frequen-
化学系. We conduct two experiments to analyze the
correlations between KEPLER performance and
entity frequency: (1) 表中 13, we divide the

187

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

BERT
RoBERTa
Our RoBERTa
KEPLER-Wiki
KEPLER-W+W

Google-RE T-REx ConceptNet SQuAD Google-RE T-REx

LAMA

LAMA-UHN

9.8
5.3
7.0
7.3
7.3

31.1
24.7
23.2
24.6
24.4

15.6
19.5
19.0
18.7
17.6

14.1
9.1
8.0
14.3
10.8

4.7
2.2
2.8
3.3
4.1

21.8
17.0
15.7
16.5
17.1

桌子 11: P@1 results on knowledge probing benchmark LAMA and LAMA-UHN.

模型

Entity
Fine-
Linking tuning

推理

5.5 Understanding Text or
Storing Knowledge

ERNIERoBERTa
KnowBertRoBERTa
KEPLER

780s
190s
0s

730s
677s
508s

194s
235s
152s

桌子 12: Three parts of running time for one epoch
of TACRED training set.

entity mentions into five parts by their frequencies,
and compare the TACRED performances while
only keeping entities in one part and masking the
其他. (2) 图中 3, we sequentially mask the
entity mentions in the ascending order of entity
frequencies and see the F-1 changes.

From the results, we can observe that:
(1) 数字 3 shows that when the entity masking
rate is low, the improvements of KEPLER over
RoBERTa are generally much higher than when
the entity masking rate is high. It indicates that
the improvements of KEPLER do mainly come
from better modeling entities in context. 然而,
even when all the entity mentions are masked,
KEPLER still outperforms RoBERTa. We claim
this is because the KE objective can also help
to learn to understand fact-related text since it
requires the model to recall facts from textual
descriptions. This claim is further substantiated in
部分 5.5.

(2) From Table 13, we can observe that
the improvement in the ‘‘0%-20%’’ setting is
marginally higher than the other settings, 哪个
demonstrates that KEPLER does have special
advantages on modeling rare entities compared
to vanilla PLMs. But the improvements in the
frequent settings are also significant and we cannot
say that the overall improvements of KEPLER are
mostly from the rare entities. 一般来说, 结果
表中 13 show that KEPLER can better model
all the entities, no matter rare or frequent.

We argue that by jointly training the KE and
the MLM objectives, KEPLER (1) can better
understand fact-related text and better extract
knowledge from text, and also (2) can remember
事实知识. To investigate the two abilities
of KEPLER in a quantitative aspect, we carry out
an experiment on TACRED, in which the head and
tail entity mentions are masked (masked-entity,
ME) or only head and tail entity mentions are
显示 (only-entity, OE). The ME setting shows
to what extent the models can extract facts only
from the textual context without the clues in entity
名字. The OE setting demonstrates to what
extent the models can store and predict factual
知识, as only the entity names are given to
the models.

如表所示 14, KEPLER-Wiki shows
improvements over Our RoBERTa
重要的
in both settings, which suggests that KEPLER
has indeed possessed superior abilities on both
extracting and storing knowledge compared to
vanilla PLMs without knowledge infusion. 和
the KEPLER-KE model performs poorly on the
ME setting but achieves marginal improvements
on the OE setting. It indicates that without the help
of the MLM objective, KEPLER only learns the
entity description embeddings and degenerates in
general language understanding, while it can still
remember knowledge into entity names to some
extent.

6 相关工作

Pre-training in NLP There has been a long
history of pre-training in NLP. Early works focus
on distributed word representations (Collobert and
Weston, 2008; Mikolov et al., 2013; Pennington
等人。, 2014), many of which are often adopted
in current models as word embeddings. 这些

188

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Entity Frequency
KEPLER-Wiki
Our RoBERTa
Improvement

0%-20% 20%-40% 40%-60% 60%-80% 80%-100%

64.7
64.1
+0.6

64.4
64.3
+0.1

64.8
64.5
+0.3

64.7
64.3
+0.4

68.8
68.5
+0.3

桌子 13: F-1 scores on TACRED (%) under different settings by entity frequencies.
We sort the entity mentions in TACRED by their corresponding entity frequencies in
维基百科. The ‘‘0%-20%’’ setting indicates only keeping the least frequent 20% 实体
mentions and masking all the other entity mentions (for both training and validation), 和
很快. The results are averaged over 5 runs.

模型

Our RoBERTa
KEPLER-KE
KEPLER-Wiki

ME

54.0
40.2
54.8

OE

46.8
47.0
48.9

桌子 14: Masked-entity (ME) 并且只有-
实体 (OE) F-1 scores on TACRED (%).

more data and more parameter tuning can benefit
PLMs, and release a new state-of-the-art model
(RoBERTa). Other works explore how to add
more tasks (刘等人。, 2019乙) and more parame-
特尔斯 (Raffel et al., 2020; Lan et al., 2020) to PLMs.

Knowledge-Enhanced PLMs Recently, 许多
works have investigated how to incorporate
knowledge into PLMs. MTB (Baldini Soares
等人。, 2019) takes a straightforward ‘‘matching
the blank’’ pre-training objective to help the
relation classification task. ERNIE (张等人。,
2019) identifies entity mentions in text and links
pre-processed knowledge embeddings to the cor-
responding positions, which shows improvements
on several NLP benchmarks. With a similar idea
as ERNIE, KnowBert (Peters et al., 2019) incor-
porates an integrated entity linker in their model
and adopts end-to-end training. Besides, Logan
等人. (2019) and Hayashi et al. (2020) utilize
relations between entities inside one sentence to
train better generation models. Xiong et al. (2019)
adopt entity replacement knowledge learning for
improving entity-related tasks.

Some contemporaneous or following works try
to inject factual knowledge into PLMs in different
方法. E-BERT (Poerner et al., 2020) aligns entity
embeddings with word embeddings and then
directly adds the aligned embeddings into BERT
to avoid additional pre-training. K-Adapter (王
等人。, 2020) injects knowledge with additional
neural adapters to support continuous learning.

数字 3: TACRED performance
(F-1) 的
KEPLER and RoBERTa change with the rate
of entity mentions being masked.

pre-trained embeddings can capture the semantics
of words from large-scale corpora and thus ben-
efit NLP applications. Peters et al. (2018) push
this trend a step forward by using a bidirectional
LSTM to form contextualized word embeddings
richer semantic meanings under
为了
(ELMo)
different circumstances.

Apart from word embeddings, there is another
trend exploring pre-trained language models. Dai
and Le (2015) propose to train an auto-encoder
on unlabeled textual data and then fine-tune it on
下游任务. Howard and Ruder (2018) 亲-
pose a universal language model (ULMFiT). 和
the powerful Transformer architecture (Vaswani
等人。, 2017), Radford et al. (2018) demonstrate
an effective pre-trained generative model (GPT).
之后, Devlin et al. (2019) release a pre-trained
deep Bidirectional Encoder Representation from
Transformers (BERT), achieving state-of-the-art
performance on a wide range of NLP benchmarks.
After BERT, similar PLMs spring up recently.
Yang et al. (2019) propose a permutation language
模型 (XLNet). 之后, 刘等人. (2019C) 显示

189

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Knowledge Embedding KE methods have been
extensively studied. Conventional KE models
define different scoring functions for relational
triplets. 例如, TransE (Bordes et al., 2013)
treats tail entities as translations of head entities
and uses L1-norm or L2-norm to score triplets,
while DistMult (杨等人。, 2015) uses matrix
multiplications and ComplEx (Trouillon et al.,
2016) adopts complex operations based on it.
RotatE (孙等人。, 2019) combines the advantages
of both of them.

Inductive Embedding Above KE methods
learn entity embeddings only from KG and are
inherently transductive, while some works (王
等人。, 2014; Xie et al., 2016; Yamada et al., 2016;
Cao et al., 2017; Shi and Weninger, 2018; 曹
等人。, 2018) incorporate textual metadata such as
entity names or descriptions to enhance the KE
methods and hence can do inductive KE to some
extent. Besides KG, it is also common for general
inductive graph embedding methods (汉密尔顿
等人。, 2017; Bojchevski and G¨unnemann, 2018) 到
utilize additional node features like text attributes,
degrees, 等等. KEPLER follows this line
of studies and takes full advantage of textual
information with an effective PLM.

Hamaguchi et al. (2017) and Wang et al. (2019C)
perform inductive KE by aggregating the trained
embeddings of the known neighboring nodes with
graph neural networks, and thus do not need
additional features. But these methods require the
unseen nodes to be surrounded by known nodes
and cannot embed new (子)图表. We leave
how to develop KEPLER to do fully inductive KE
without additional features as future work.

7 Conclusion and Future Work

在本文中, we propose KEPLER, a simple but
effective unified model for knowledge embedding
and pre-trained language representation. 我们
train KEPLER with both the KE and MLM
objectives to align the factual knowledge and
language representation into the same semantic
空间, and experimental results on extensive tasks
demonstrate its effectiveness on both NLP and KE
applications. Besides, we propose Wikidata5M, A
large-scale KG dataset to facilitate future research.
将来, 我们将 (1) explore advanced
ways for more smoothly unifying the two semantic
空间, including different KE forms and different
training objectives, 和 (2) investigate better

knowledge probing methods for PLMs to shed
light on knowledge-integrating mechanisms.

致谢

This work is supported by the National Key
Research and Development Program of China
the National Natural
(不. 2018YFB1004503),
Science Foundation of China
(NSFC No.
U1736204, 61533018, 61772302, 61732008),
grants from Institute for Guo Qiang, Tsinghua
大学 (2019GQB0003), and Beijing Academy
of Artificial Intelligence (BAAI2019ZD0502).
Prof. Jian Tang is supported by the Natural
Sciences and Engineering Research Council
(NSERC) Discovery Grant and the Canada CIFAR
AI Chair Program. Xiaozhi Wang and Tianyu Gao
are supported by Tsinghua University Initiative
Scientific Research Program. We also thank
our action editor, Prof. Doug Downey, 和
anonymous reviewers for their consistent help
and insightful suggestions.

参考

Ivana Balazevic, Carl Allen, and Timothy
因素-
Hospedales. 2019. TuckER: Tensor
comple-
化
In Proceedings of EMNLP-IJCNLP,
的.
pages 5185–5194. DOI: https://doi.org
/10.18653/v1/D19-1522

知识

图形

为了

Livio Baldini Soares, Nicholas FitzGerald,
Jeffrey Ling, and Tom Kwiatkowski. 2019.
Matching the blanks: Distributional similarity
for relation learning. In Proceedings of ACL,
pages 2895–2905. DOI: https://doi.org
/10.18653/v1/P19-1279

Aleksandar Bojchevski and Stephan G¨unnemann.
2018. Deep Gaussian embedding of graphs:
Unsupervised inductive learning via ranking.
In Proceedings of ICLR.

Antoine Bordes, Nicolas Usunier, 阿尔贝托
Garcia-Duran,
Jason Weston, and Oksana
Yakhnenko. 2013. Translating embeddings for
modeling multi-relational data. In Advances in
Neural Information Processing Systems (NIPS),
pages 2787–2795.

Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu,
Chengjiang Li, Xu Chen, and Tiansi Dong.

190

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

and entities via

2018. Joint representation learning of cross-
attentive
lingual words
distant supervision. In Proceedings of EMNLP,
pages 227–237. DOI: https://doi.org
/10.18653/v1/D18-1021

Yixin Cao, Lifu Huang, Heng Ji, Xu Chen,
and Juanzi Li. 2017. Bridge text and knowl-
edge by learning multi-prototype entity men-
In Proceedings of ACL,
tion embedding.
pages 1623–1633. DOI: https://doi.org
/10.18653/v1/P17-1149

Eunsol Choi, Omer Levy, Yejin Choi, and Luke
Zettlemoyer. 2018. Ultra-fine entity typing. 在
Proceedings of ACL, pages 87–96.

Ronan Collobert and Jason Weston. 2008.
语言
A unified architecture for natural
加工: Deep
networks with
multitask learning. In Proceedings of ICML,
pages 160–167.

neural

安德鲁·M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. In Advances in
Neural Information Processing Systems (NIPS),
pages 3079–3087.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. In Proceedings of NAACL-HLT,
pages 4171–4186.

Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2019.
FewRel 2.0: Towards more challenging few-
shot relation classification. 在诉讼程序中
EMNLP-IJCNLP, pages 6251–6256.

transfer

Takuo Hamaguchi, Hidekazu Oiwa, Masashi
Shimbo, and Yuji Matsumoto. 2017. 诺尔-
边缘
out-of-knowledge-base
实体: A graph neural network approach. 在
Proceedings of IJCAI, pages 1802–1808. DOI:
https://doi.org/10.24963/ijcai
.2017/250

为了

William L. 汉密尔顿, Rex Ying, and Jure
Leskovec. 2017. Inductive representation learn-
ing on large graphs. In Advances in Neu-
ral Information Processing Systems (NIPS),
pages 1025–1035.

191

Xu Han, Zhiyuan Liu, and Maosong Sun. 2018A.
Neural knowledge acquisition via mutual
attention between knowledge graph and text.
In Proceedings of AAAI, pages 4832–4839.

Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang,
Yuan Yao, Zhiyuan Liu, and Maosong Sun.
2018乙. FewRel: A large-scale supervised few-
shot relation classification dataset with state-of-
the-art evaluation. In Proceedings of EMNLP,
pages 4803–4809. DOI: https://doi.org
/10.18653/v1/D18-1514

Hiroaki Hayashi, Zecong Hu, Chenyan Xiong,
and Graham Neubig. 2020. Latent relation
language models. In Proceedings of AAAI,
pages 7911–7918. DOI: https://doi.org
/10.1609/aaai.v34i05.6298

Jeremy Howard and Sebastian Ruder. 2018.
Universal language model fine-tuning for text
of ACL,
classification.
pages 328–339. DOI: https://doi.org/10
.18653/v1/P18-1031, PMID: 28889062

In Proceedings

Nora Kassner and Hinrich Sch¨utze. 2020. Negated
and misprimed probes for pretrained language
型号: Birds can talk, but cannot fly. 在
Proceedings of ACL, pages 7811–7818. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.698

Seyed Mehran Kazemi and David Poole. 2018.
SimplE embedding for
link prediction in
knowledge graphs. In Advances in Neural
Information Processing Systems (神经信息处理系统),
pages 4284–4295.

Zhenzhong Lan, Mingda Chen, Sebastian
古德曼, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. 2020. ALBERT: A lite
BERT for self-supervised learning of language
陈述. In Proceedings of ICLR.

Yankai Lin, Zhiyuan Liu, Maosong Sun,
Yang Liu, and Xuan Zhu. 2015. 学习
entity and relation embeddings for knowledge
graph completion. In Proceedings of AAAI,
pages 2181–2187.

Nelson F. 刘, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. 史密斯. 2019A.
Linguistic knowledge and transferabilityof

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

contextual representations. 在诉讼程序中
NAACL-HLT, pages 1073–1094.

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo
王, 齐
Ju, Haotang Deng, and Ping
王. 2020. K-BERT: Enabling language
representation with knowledge graph.
在
Proceedings of AAAI, pages 2901–2908. DOI:
https://doi.org/10.1609/aaai
.v34i03.5681

Xiaodong Liu, Pengcheng He, Weizhu Chen, 和
Jianfeng Gao. 2019乙. Multi-task deep neural
networks for natural language understanding.
In Proceedings of ACL, pages 4487–4496.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019C. RoBERTa: A robustly
optimized BERT pretraining approach. CoRR,
cs.CL/1907.11692v1.

Robert Logan, Nelson F. 刘, Matthew E.
Peters, Matt Gardner, and Sameer Singh.
2019. Barack’s Wife hillary: Using knowledge
graphs for fact-aware language modeling. 在
Proceedings of ACL, pages 5962–5971. DOI:
https://doi.org/10.18653/v1/P19
-1598

Lajanugen Logeswaran, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova, Jacob Devlin,
and Honglak Lee. 2019. Zero-shot entity
linking by reading entity descriptions.
在
Proceedings of ACL, pages 3449–3460. DOI:
https://doi.org/10.18653/v1/P19
-1335

Michael McCloskey and Neal J. 科恩. 1989.
in connectionist
Catastrophic
干涉
网络: 顺序的
learning problem.
In Psychology of Learning and motivation,
体积 24, pages 109–165. 爱思唯尔. DOI:
https://doi.org/10.1016/S0079
-7421(08)60536-8

Tomas Mikolov,

伊利亚·苏茨克维尔, Kai Chen,
Gregory S. 科拉多, and Jeffrey Dean. 2013.
Distributed representations of words
和
短语及其组合性. In Advances
in Neural
Information Processing Systems
(NIPS), 第 3111–3119 页.

George A. 磨坊主. 1995. WordNet: A lexical
database for english. Communications of the
ACM, 38(11):39–41. DOI: https://土井
.org/10.1145/219717.219748

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, 大卫
Grangier, and Michael Auli. 2019. fairseq:
extensible
A fast,
顺序
In Proceedings of NAACL-HLT
造型.
(Demonstrations),
48–53. DOI:
https://doi.org/10.18653/v1/N19
-4009

toolkit

页面

为了

杰弗里

Socher,

Pennington, 理查德

和
Christopher Manning. 2014. GloVe: 全球的
vectors for word representation. In Proceed-
ings of EMNLP, pages 1532–1543. DOI:
https://doi.org/10.3115/v1/D14
-1162

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings
of NAACL-HLT, pages 2227–2237. DOI:
https://doi.org/10.18653/v1/N18
-1202

Matthew E. Peters, Mark Neumann, Robert Logan,
Roy Schwartz, Vidur Joshi, Sameer Singh, 和
诺亚A. 史密斯. 2019. Knowledge enhanced
contextual word representations. In Proceed-
ings of EMNLP-IJCNLP, pages 43–54. DOI:
https://doi.org/10.18653/v1/D19
-1005, PMID: 31383442

Fabio Petroni, Tim Rockt¨aschel, Sebastian
Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang
吴, and Alexander Miller. 2019. 语言
models as knowledge bases? In Proceedings
of EMNLP-IJCNLP, pages 2463–2473. DOI:
https://doi.org/10.18653/v1/D19
-1250

Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2020. E-BERT: Efficient-yet-effective
entity embeddings for BERT. In Findings of
the Association for Computational Linguis-
抽动症: EMNLP 2020, pages 803–818. DOI:
https://doi.org/10.18653/v1/2020
.findings-emnlp.71

Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving

192

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

language understanding by generative pre-
训练. In Technical report, OpenAI.

https://doi.org/10.18653/v1/P19
-1439

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. 杂志
of Machine Learning Research, 21(140):1–67.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of
In Pro-
rare words with subword units.
ceedings of ACL, pages 1715–1725. DOI:
https://doi.org/10.18653/v1/P16
-1162

Baoxu Shi and Tim Weninger. 2018. Open-world
knowledge graph completion. In Proceedings
of AAAI, pages 1957–1964.

Jake Snell, Kevin Swersky, and Richard Zemel.
2017. Prototypical networks
few-shot
学习. In Advances in Neural Information
Processing Systems (NIPS), pages 4077–4087.

为了

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, 和
Jian Tang. 2019. RotatE: Knowledge graph
embedding by relational rotation in complex
空间. In Proceedings of ICLR.

Th´eo Trouillon,

Johannes Welbl, Sebastian
Riedel, ´Eric Gaussier, and Guillaume Bouchard.
2016. Complex Embeddings for Simple Link
预言. In Proceedings of ICML, 页面
2071–2080.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you Need. In Advances in
Neural Information Processing Systems (NIPS),
pages 5998–6008.

Alex Wang,

Patrick Xia,
Jan Hula,
Raghavendra Pappagari, 右. Thomas McCoy,
Roma Patel, Najoung Kim,
Ian Tenney,
Yinghui Huang, Katherin Yu, Shuning Jin,
Berlin Chen, Benjamin Van Durme, Edouard
Grave, Ellie Pavlick, and Samuel R. Bowman.
2019A. Can you tell me how to get past
pretraining
sesame
In Proceed-
beyond language modeling.
4465–4476. DOI:
英格斯

street? Sentence-level

of ACL,

页面

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman.
2019乙. GLUE: A multi-task benchmark and
analysis platform for natural language under-
ICLR. DOI:
常设.
https://doi.org/10.18653/v1/W18
-5446

在诉讼程序中

PeiFeng Wang, Jialong Han, Chenliang Li,
and Rong Pan. 2019C. Logic attention based
neighborhood aggregation for inductive knowl-
在诉讼程序中
edge graph embedding.
AAAI, pages 7152–7159. DOI: https://
doi.org/10.1609/aaai.v33i01.33017152

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu
Wei, Xuanjing Huang, Jianshu Ji, Cuihong Cao,
Daxin Jiang, and Ming Zhou. 2020. K-Adapter:
Infusing knowledge into pre-trained models
with adapters. CoRR, cs.CL/2002.01808v3.

Zhen Wang, Jianwen Zhang, Jianlin Feng, 和
Zheng Chen. 2014. Knowledge graph and text
jointly embedding. In Proceedings of EMNLP,
pages 1591–1601. DOI: https://doi.org
/10.3115/v1/D14-1167

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
In Proceedings of NAACL-HLT,
inference.
pages 1112–1122. DOI: https://doi.org
/10.18653/v1/N18-1101

Ledell Wu, Fabio Petroni, Martin Josifoski,
Sebastian Riedel, and Luke Zettlemoyer. 2020.
Scalable zero-shot entity linking with dense
entity retrieval. In Proceedings of EMNLP,
pages 6397–6407.

Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo
Luan, and Maosong Sun. 2016. Representation
learning of knowledge graphs with entity
of AAAI,
In Proceedings
descriptions.
pages 2659–2665.

Wenhan Xiong,

Jingfei Du, William Yang
王, and Stoyanov Veselin. 2019. Pretrained
encyclopedia: Weakly supervised knowledge-
pretrained language model. 在诉讼程序中
ICLR.

193

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Ikuya Yamada, Hiroyuki Shindo, Hideaki
Takeda, and Yoshiyasu Takefuji. 2016. Joint
learning of
the embedding of words and
entities for named entity disambiguation. 在
Proceedings of CoNLL, pages 250–259. DOI:
https://doi.org/10.18653/v1/K16
-1025

Bishan Yang and Tom Mitchell. 2017. Leveraging
knowledge bases in LSTMs for improving
In Proceedings of ACL,
machine reading.
pages 1436–1446. DOI: https://doi.org
/10.18653/v1/P17-1132

Bishan Yang, Scott Wen-tau Yih, Xiaodong He,
Jianfeng Gao, and Li Deng. 2015. 嵌入
entities and relations for learning and inference
in knowledge bases. In Proceedings of ICLR.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for
在
神经信息处理的进展
系统 (神经信息处理系统), pages 5754–5764.

language understanding.

Proceedings of ACL, pages 656–661. DOI:
https://doi.org/10.18653/v1/P18
-2104

Yuhao Zhang, Victor Zhong, Danqi Chen,
Gabor Angeli, and Christopher D. 曼宁.
2017. Position-aware attention and supervised
data improve slot filling. 在诉讼程序中
EMNLP, pages 35–45. DOI: https://土井
.org/10.18653/v1/D17-1004

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
with informative entities. 在诉讼程序中
前交叉韧带, pages 1441–1451. DOI: https://土井
.org/10.18653/v1/P19-1139

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning
books and movies: Towards story-like visual
explanations by watching movies and reading
图书. In Proceedings of ICCV, pages 19–27.
DOI: h t t p s : / / d o i . o r g / 1 0 . 1 109
/ICCV.2015.11

Poorya

和
Zaremoodi, Wray Buntine,
Gholamreza Haffari. 2018. Adaptive knowl-
edge sharing in multi-task learning: Improving
low-resource neural machine translation. 在

Zhaocheng Zhu, Shizhen Xu, Jian Tang, and Meng
Qu. 2019. GraphVite: A high-performance
CPU-GPU hybrid system for node embedding.
In Proceedings of WWW, pages 2494–2504.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

p

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

.

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
0
1
9
2
3
9
2
7

/

/
t

我

A
C
_
A
_
0
0
3
6
0
p
d

.

F

乙
y
G
你
e
s
t

t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

194

下载pdf