How Can We Know What Language Models Know?

Zhengbao Jiang1∗ Frank F. Xu1∗

Jun Araki2 Graham Neubig1

1Language Technologies Institute, 卡内基梅隆大学
2Bosch Research North America

{zhengbaj,fangzhex,gneubig}@cs.cmu.edu

jun.araki@us.bosch.com

抽象的

Recent work has presented intriguing results
examining the knowledge contained in lan-
guage models (LMs) by having the LM fill
in the blanks of prompts such as ‘‘Obama
是一个
by profession’’. These prompts are
usually manually created, and quite possibly
sub-optimal; another prompt such as ‘‘Obama
worked as a
’’ may result in more accurately
predicting the correct profession. 由于
这, given an inappropriate prompt, we might
fail to retrieve facts that the LM does know,
and thus any given prompt only provides a
lower bound estimate of the knowledge con-
tained in an LM. 在本文中, we attempt to
more accurately estimate the knowledge con-
tained in LMs by automatically discovering
better prompts to use in this querying process.
具体来说, we propose mining-based and
paraphrasing-based methods to automatically
generate high-quality and diverse prompts,
as well as ensemble methods to combine
answers from different prompts. Extensive
experiments on the LAMA benchmark for
extracting relational knowledge from LMs
demonstrate that our methods can improve
accuracy from 31.1% 到 39.6%, 提供
a tighter lower bound on what LMs know.
We have released the code and the resulting
LM Prompt And Query Archive (LPAQA) 在
https://github.com/jzbjyb/LPAQA.

1 介绍

Recent years have seen the primary role of lan-
guage models (LMs) transition from generating
or evaluating the fluency of natural text (米科洛夫
and Zweig, 2012; Merity et al., 2018; Melis et al.,
2018; Gamon et al., 2005) to being a powerful
tool for text understanding. This understanding has
mainly been achieved through the use of language
modeling as a pre-training task for feature extrac-
托尔斯, where the hidden vectors learned through a
language modeling objective are then used in

∗ The first two authors contributed equally.

423

down-stream language understanding systems
(Dai and Le, 2015; Melamud et al., 2016; Peters
等人。, 2018; Devlin et al., 2019).

有趣的是, it is also becoming apparent that
LMs1 themselves can be used as a tool for text
understanding by formulating queries in natural
language and either generating textual answers
直接地 (McCann et al., 2018; Radford et al.,
2019), or assessing multiple choices and picking
the most likely one (Zweig and Burges, 2011;
Rajani et al., 2019). 例如, LMs have been
used to answer factoid questions (Radford et al.,
2019), answer common sense queries (Trinh and
Le, 2018; Sap et al., 2019), or extract factual
knowledge about
relations between entities
(Petroni et al., 2019; Baldini Soares et al.,
2019). Regardless of the end task, the knowledge
contained in LMs is probed by providing a prompt,
and letting the LM either generate the continuation
of a prefix (例如, ‘‘Barack Obama was born in ’’),
or predict missing words in a cloze-style template
(例如, ‘‘Barack Obama is a

by profession’’).

然而, while this paradigm has been used to
achieve a number of intriguing results regarding
the knowledge expressed by LMs, they usually
rely on prompts that were manually created
based on the intuition of the experimenter. 这些
manually created prompts (例如, ‘‘Barack Obama
出生于
’’) might be sub-optimal because
LMs might have learned target knowledge from
substantially different contexts (例如, ‘‘The birth
place of Barack Obama is Honolulu, Hawaii.’’)
during their training. Thus it is quite possible that
a fact that the LM does know cannot be retrieved
due to the prompts not being effective queries
for the fact. 因此, existing results are simply a
lower bound on the extent of knowledge contained

1Some models we use in this paper, 例如, BERT (Devlin
等人。, 2019), are bi-directional, and do not directly define
probability distribution over text, which is the underlying
definition of an LM. 尽管如此, we call them LMs for
simplicity.

计算语言学协会会刊, 卷. 8, PP. 423–438, 2020. https://doi.org/10.1162/tacl 00324
动作编辑器: Timothy Baldwin. 提交批次: 12/2019; 修改批次: 3/2020; 已发表 7/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

BERT-large as well. We further demonstrate that
using a diversity of prompts through ensembling
further improves accuracy to 39.6%. We perform
extensive analysis and ablations, gleaning insights
both about how to best query the knowledge
stored in LMs and about potential directions for
incorporating knowledge into LMs themselves.
最后, we have released the resulting LM Prompt
And Query Archive (LPAQA) to facilitate future
experiments on probing knowledge contained in
LMs.

2 Knowledge Retrieval from LMs

Retrieving factual knowledge from LMs is quite
不同的
from querying standard declarative
knowledge bases (KBs). In standard KBs, 用户
formulate their information needs as a structured
query defined by the KB schema and query
语言. 例如, SELECT ?y WHERE
{wd:Q76 wdt:P19 ?y} is a SPARQL query
to search the birth place of Barack Obama. 在
对比, LMs must be queried by natural language
prompts, such as ‘‘Barack Obama was born in ’’,
and the word assigned the highest probability in
the blank will be returned as the answer. 不像
deterministic queries on KBs, this provides no
guarantees of correctness or success.

While the idea of prompts is common to
methods for extracting many varieties of knowl-
edge from LMs, in this paper we specifically
follow the formulation of Petroni et al. (2019),
where factual knowledge is in the form of triples
hx, r, 做. Here x indicates the subject, y indicates
物体, and r is their corresponding relation.
To query the LM, r is associated with a cloze-style
prompt tr consisting of a sequence of tokens, 二
of which are placeholders for subjects and objects
(例如, ‘‘x plays at y position’’). The existence of
the fact in the LM is assessed by replacing x with
the surface form of the subject, and letting the
model predict the missing object (例如, ‘‘LeBron
position’’):2
James plays at

ˆy = arg max
y′∈V

PLM(y′|X, tr),

2We can also go the other way around by filling in the
objects and predicting the missing subjects. Since our focus
is on improving prompts, we choose to be consistent with
Petroni et al. (2019) to make a fair comparison, and leave
exploring other settings to future work. Also notably, Petroni
等人. (2019) only use objects consisting of a single token, 所以
we only need to predict one word for the missing slot.

数字 1: Top-5 predictions and their log probabilities
using different prompts (手动的, mined, and para-
phrased) to query BERT. Correct answer is underlined.

in LMs, 事实上, LMs may be even more
knowledgeable than these initial results indicate.
In this paper we ask the question: ‘‘How can
we tighten this lower bound and get a more
accurate estimate of the knowledge contained in
state-of-the-art LMs?’’ This is interesting both
scientifically, as a probe of the knowledge that
LMs contain, and from an engineering perspective,
as it will result in higher recall when using LMs
as part of a knowledge extraction system.

尤其, we focus on the setting of Petroni
等人. (2019) who examine extracting knowledge
regarding the relations between entities (defini-
tions in § 2). We propose two automatic methods
to systematically improve the breadth and quality
of the prompts used to query the existence of a
关系 (§ 3). 具体来说, 如图 1,
these are mining-based methods inspired by pre-
vious relation extraction methods (Ravichandran
和霍维, 2002), and paraphrasing-based meth-
ods that take a seed prompt (either manually
created or automatically mined), and paraphrase
it into several other semantically similar expres-
西翁. 更远, because different prompts may
work better when querying for different subject-
object pairs, we also investigate lightweight
ensemble methods to combine the answers from
different prompts together (§ 4).

We experiment on the LAMA benchmark
(Petroniet al., 2019), which is an English-language
benchmark devised to test the ability of LMs to
retrieve relations between entities (§ 5). We first
demonstrate that improved prompts significantly
improve accuracy on this task, with the one-best
prompt extracted by our method raising accuracy
从 31.1% 到 34.1% on BERT-base (Devlin et al.,
2019), with similar gains being obtained with

424

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

where V is the vocabulary, and PLM(y′|X, tr)
is the LM probability of predicting y′
在里面
blank conditioned on the other tokens (IE。, 这
subject and the prompt).3 We say that an LM has
knowledge of a fact if ˆy is the same as the ground-
truth y. Because we would like our prompts to
most effectively elicit any knowledge contained
in the LM itself, a ‘‘good’’ prompt should trigger
the LM to predict the ground-truth objects as often
尽可能.

在之前的工作中 (McCann et al., 2018; 雷德福
等人。, 2019; Petroni et al., 2019), tr has been
a single manually defined prompt based on the
intuition of the experimenter. As noted in the
introduction, this method has no guarantee of
being optimal, and thus we propose methods that
learn effective prompts from a small set of training
data consisting of gold subject-object pairs for
each relation.

3 Prompt Generation

第一的, we tackle prompt generation: the task of gen-
erating a set of prompts {tr,我}时间
i=1 for each relation
r, where at least some of the prompts effectively
trigger LMs to predict ground-truth objects. 我们
employ two practical methods to either mine
prompt candidates from a large corpus (§ 3.1)
or diversify a seed prompt through paraphrasing
(§ 3.2).

3.1 Mining-based Generation

first method is

我们的
inspired by template-
based relation extraction methods (Agichtein and
Gravano, 2000; Ravichandran and Hovy, 2002),
which are based on the observation that words
in the vicinity of the subject x and object y in a
large corpus often describe the relation r. Based
on this intuition, we first identify all the Wikipedia
sentences that contain both subjects and objects
of a specific relation r using the assumption of
distant supervision, then propose two methods to
extract prompts.

Middle-word Prompts Following the observa-
tion that words in the middle of the subject and
object are often indicative of the relation, 我们

3We restrict to masked LMs in this paper because the
missing slot might not be the last token in the sentence and
computing this probability in traditional left-to-right LMs
using Bayes’ theorem is not tractable.

directly use those words as prompts. 考试用-
普莱, ‘‘Barack Obama was born in Hawaii’’ is
converted into a prompt ‘‘x was born in y’’
by replacing the subject and the object with
placeholders.

Dependency-based Prompts Toutanova et al.
(2015) note that in cases of templates where words
do not appear in the middle (例如, ‘‘The capital of
France is Paris’’), templates based on syntactic
analysis of the sentence can be more effective for
relation extraction. We follow this insight in our
second strategy for prompt creation, which parses
sentences with a dependency parser to identify
the shortest dependency path between the subject
and object, then uses the phrase spanning from
the leftmost word to the rightmost word in the
dependency path as a prompt. 例如, 这
dependency path in the above example is ‘‘France
nsubj
pobj
←−− is attr
−−→ Paris’’, 在哪里
←−− of
the leftmost and rightmost words are ‘‘capital’’
and ‘‘Paris’’, giving a prompt of ‘‘capital of x is
y’’.

prep
←−− capital

尤其, these mining-based methods do not
rely on any manually created prompts, 并且可以
thus be flexibly applied to any relation where we
can obtain a set of subject-object pairs. This will
result in diverse prompts, covering a wide variety
of ways that the relation may be expressed in text.
然而, it may also be prone to noise, as many
prompts acquired in this way may not be very
indicative of the relation (例如, ‘‘x, y’’), 即使
they are frequent.

3.2 Paraphrasing-based Generation

Our second method for generating prompts is more
targeted—it aims to improve lexical diversity
while remaining relatively faithful to the original
迅速的. 具体来说, we do so by performing
paraphrasing over the original prompt into other
semantically similar or identical expressions. 为了
例子, if our original prompt is ‘‘x shares a
border with y’’, it may be paraphrased into ‘‘x
has a common border with y’’ and ‘‘x adjoins y’’.
This is conceptually similar to query expansion
techniques used in information retrieval that refor-
mulate a given query to improve retrieval perfor-
曼斯 (Carpineto and Romano, 2012).

Although many methods could be used for
paraphrasing (Romano et al., 2006; Bhagat and
Ravichandran, 2008), we follow the simple

425

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

method of using back-translation (Sennrich et al.,
2016; Mallinson et al., 2017) to first translate
the initial prompt into B candidates in another
语言, each of which is then back-translated
into B candidates in the original language. 我们
then rank B2 candidates based on their round-
trip probability (IE。, Pforward(¯t|ˆt ) · Pbackward(t|¯t ),
where ˆt is the initial prompt, ¯t is the translated
prompt in the other language, and t is the final
迅速的), and keep the top T prompts.

4 Prompt Selection and Ensembling

In the previous section, we described methods to
generate a set of candidate prompts {tr,我}时间
i = 1 为了
a particular relation r. Each of these prompts may
be more or less effective at eliciting knowledge
from the LM, and thus it is necessary to decide
how to use these generated prompts at test time. 在
this section, we describe three methods to do so.

probabilities4 from the top K prompts to calculate
the probability of the object:

s(y|X, r) =

X
我=1

1
K

log PLM(y|X, tr,我),

磷 (y|X, r) = softmax(s(·|X, r))y,

(1)

(2)

where tr,i is the prompt ranked at the i-th position.
这里, K is a hyper-parameter, where a small K
focuses on the few most accurate prompts, and a
large K increases diversity of the prompts.

4.3 Optimized Ensemble

The above method treats the top K prompts
equally, which is sub-optimal given some prompts
are more reliable than others. 因此, we also
propose a method that directly optimizes prompt
重量. 正式地, we re-define the score in
方程 1 作为:

时间

4.1 Top-1 Prompt Selection

s(y|X, r) =

Pθr(tr,我|r) log PLM(y|X, tr,我),

For each prompt, we can measure its accuracy of
predicting the ground-truth objects (on a training
dataset) 使用:

A(tr,我) =

Phx,yi∈R δ(y=arg maxy′ PLM(y′|X,tr,我))
|右|

where R is a set of subject-object pairs with
relation r, and δ(·) is Kronecker’s delta function,
returning 1 if the internal condition is true and
0 否则. In the simplest method for querying
the LM, we choose the prompt with the highest
accuracy and query using only this prompt.

4.2 Rank-based Ensemble

Next we examine methods that use not only
the top-1 prompt, but combine together multiple
prompts. The advantage to this is that the LM may
have observed different entity pairs in different
contexts within its training data, and having
a variety of prompts may allow for elicitation
of knowledge that appeared in these different
上下文.

Our first method for ensembling is a parameter-
free method that averages the predictions of the
the prompts
top-ranked prompts. We rank all
based on their accuracy of predicting the objects
on the training set, and use the average log

X
我=1

(3)
where Pθr (tr,我|r) = softmax(θr) is a distribution
over prompts parameterized by θr, a T -sized real-
value vector. For every relation, we learn to score
a different set of T candidate prompts, 所以
total number of parameters is T times the number
of relations. The parameter θr is optimized to
maximize the probability of the gold-standard
objects P (y|X, r) over training data.

5 Main Experiments

5.1 Experimental Settings

在这个部分, we assess the extent to which our
prompts can improve fact prediction performance,
raising the lower bound on the knowledge we
discern is contained in LMs.

Dataset As data, we use the T-REx subset
(ElSahar et al., 2018) of the LAMA benchmark
(Petroni et al., 2019), which has a broader set
的 41 关系 (compared with the Google-RE
subset, which only covers 3). Each relation is
associated with at most 1000 subject-object pairs
from Wikidata, and a single manually designed

4直观地, because we are combining together scores in
the log space, this has the effect of penalizing objects that are
very unlikely given any certain prompt in the collection. 我们
also compare with linear combination in ablations in § 5.3.

426

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

迅速的. To learn to mine prompts (§ 3.1), rank
prompts (§ 4.2), or learn ensemble weights (§ 4.3),
we create a separate training set of subject-object
pairs also from Wikidata for each relation that has
no overlap with the T-REx dataset. We denote the
training set as T-REx-train. For consistency with
the T-REx dataset in LAMA, T-REx-train also is
chosen to contain only single-token objects. 到
investigate the generality of our method, we also
report the performance of our methods on the
Google-RE subset,5 which takes a similar form
to T-REx but is relatively small and only covers
three relations.

P¨orner et al. (2019) note that some facts in
LAMA can be recalled solely based on surface
forms of entities, without memorizing facts. 他们
filter out those easy-to-guess facts and create a
more difficult benchmark, denoted as LAMA-
UHN. We also conduct experiments on the T-REx
subset of LAMA-UHN (IE。, T-REx-UHN) 到
investigate whether our methods can still obtain
improvements on this harder benchmark. 数据集
statistics are summarized in Table 1.

Models As for the models to probe, in our main
experiments we use the standard BERT-base and
BERT-large models (Devlin et al., 2019). 我们
also perform some experiments with other pre-
trained models enhanced with external entity
陈述, 即, ERNIE (张等人。,
2019) and KnowBert (Peters et al., 2019), 哪个
we believe may do better on recall of entities.

Evaluation Metrics We use two metrics to
evaluate the success of prompts in probing
LMs. The first evaluation metric, micro-averaged
准确性, follows the LAMA benchmark6 in
calculating the accuracy of all subject-object pairs
for relation r:

1
|右| X
hx,yi∈R

δ(ˆy = y),

where ˆy is the prediction and y is the ground
truth. Then we average across all
关系.
然而, we found that the object distributions

5https://code.google.com/archive/p/

relation-extraction-corpus/.

6In LAMA, it is called ‘‘P@1.’’ There might be multiple
correct answers for some cases, 例如, a person speaking
multiple languages, but we only use one ground truth. 我们
will leave exploring more advanced evaluation methods to
future work.

Properties

T-REx T-REx-UHN T-REx-train

#sub-obj pairs
#unique subject
#unique objects
object entropy

830.2
767.8
150.9
3.6

661.1
600.8
120.5
3.4

948.7
880.1
354.6
4.4

桌子 1: Dataset statistics. 全部
averaged across 41 关系.

the values are

一些

关系

extremely

skewed
是
的
(例如, more than half of the objects in relation
native language are French). This can lead
to deceptively high scores, even for a majority-
class baseline that picks the most common object
for each relation, which achieves a score of 22.0%.
To mitigate this problem, we also report macro-
averaged accuracy, which computes accuracy for
each unique object separately, then averages them
together to get the relation-level accuracy:

|uni obj(右)| X

y′∈uni obj(右)

Phx,yi∈R,y = y′ δ(ˆy = y)
|{y|hx, yi ∈ R, y = y′}|

where uni obj(右) returns a set of unique objects
from relation r. This is a much stricter metric,
with the majority-class baseline only achieving a
分数为 2.2%.

Methods We attempted different methods for
prompt generation and selection/ensembling, 和
compare them with the manually designed
prompts used in Petroni et al. (2019). Majority
refers to predicting the majority object for each
关系, as mentioned above. Man is the baseline
from Petroni et al. (2019) that only uses the
manually designed prompts for retrieval. Mine
(§ 3.1) uses the prompts mined from Wikipedia
through both middle words and dependency paths,
and Mine+Man combines them with the manual
prompts. Mine+Para (§ 3.2) paraphrases the
highest-ranked mined prompt for each relation,
while Man+Para uses the manual one instead.

The prompts are combined either by averaging
the log probabilities from the TopK highest-
ranked prompts (§ 4.2) or the weights after
优化 (§ 4.3; Opti.). Oracle represents the
upper bound of the performance of the generated
prompts, where a fact is judged as correct if any
one of the prompts allows the LM to successfully
predict the object.

Implementation Details We use T = 40 最多
frequent prompts either generated through mining

427

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

or paraphrasing in all experiments, and the number
of candidates in back-translation is set to B = 7.
We remove prompts only containing stopwords/
punctuations or longer than 10 words to reduce
noise. We use the round-trip English-German
neural machine translation models pre-trained on
WMT’19 (Ng et al., 2019) for back-translation,
as English-German is one of the most highly
resourced language pairs.7 When optimizing
ensemble parameters, we use Adam (Kingma and
Ba, 2015) with default parameters and batch size
的 32.

Prompts

Mine
Mine+Man
Mine+Para
Man+Para

34.2
35.9
34.0
35.8

38.9
39.6
36.2
37.3

34.7
35.1
34.5
36.6

Top1 Top3 Top5 Opti. Oracle
BERT-base (Man=31.1)
31.4
31.6
32.7
34.1
BERT-large (Man=32.3)
37.0
39.4
37.8
35.9

54.4
56.1
51.8
50.0

43.7
43.9
40.1
38.8

37.0
40.6
38.6
37.3

36.4
38.4
38.6
38.0

50.7
52.6
48.1
47.9

5.2 Evaluation Results

Micro- and macro-averaged accuracy of different
methods are reported in Tables 2 和 3, 重新指定-
主动地.

Single Prompt Experiments When only one
prompt is used (in the first Top1 column in both
桌子), the best of the proposed prompt genera-
tion methods increases micro-averaged accuracy
从 31.1% 到 34.1% on BERT-base, 和来自
32.3% 到 39.4% on BERT-large. This demon-
strates that
the manually created prompts are
a somewhat weak lower bound; there are other
prompts that further improve the ability to query
knowledge from LMs. 桌子 4 shows some
of the mined prompts that resulted in a large
performance gain compared with the manual ones.
For the relation religion, ‘‘x who converted
to y’’ improved 60.0% over the manually defined
prompt of ‘‘x is affiliated with the y religion’’,
and for the relation subclass of, ‘‘x is a type
of y’’ raised the accuracy by 22.7% over ‘‘x is
a subclass of y’’. It can be seen that the largest
gains from using mined prompts seem to occur
in cases where the manually defined prompt is
more complicated syntactically (例如, the former),
or when it uses less common wording (例如, 这
后者) than the mined prompt.

Prompt Ensembling Next we turn to experi-
ments that use multiple prompts to query the LM.
Comparing the single-prompt results in column 1
to the ensembled results in the following three
columns, we can see that ensembling multiple
prompts almost always leads to better perform-
安斯. The simple average used in Top3 and

7https://github.com/pytorch/fairseq/tree/

master/examples/wmt19.

桌子 2: Micro-averaged accuracy of different
方法 (%). Majority gives us 22.0%. Italic
indicates best single-prompt accuracy, and bold
indicates the best non-oracle accuracy overall.

Prompts

Mine
Mine+Man
Mine+Para
Man+Para

23.9
24.8
23.0
24.6

25.7
26.6
23.6
25.0

22.7
23.8
22.4
23.8

Top1 Top3 Top5 Opti. Oracle
BERT-base (Man=22.8)
20.7
21.3
21.2
22.8
BERT-large (Man=25.7)
26.4
28.1
26.2
25.9

36.2
38.0
34.1
34.9

40.7
42.2
38.3
39.3

26.3
28.3
27.1
27.8

30.1
30.7
27.1
28.0

25.9
27.3
27.0
28.3

桌子 3: Macro-averaged accuracy of different
方法 (%). Majority gives us 2.2%. Italic
indicates best single-prompt accuracy, and bold
indicates the best non-oracle accuracy overall.

Top5 outperforms Top1 across different prompt
generation methods. The optimized ensemble fur-
ther raises micro-averaged accuracy to 38.9% 和
43.7% on BERT-base and BERT-large respec-
主动地, outperforming the rank-based ensemble by
a large margin. These two sets of results demon-
strate that diverse prompts can indeed query the
LM in different ways, and that the optimization-
based method is able to find weights that
effectively combine different prompts together.

We list the learned weights of top-3 mined
prompts and accuracy gain over only using
the top-1 prompt in Table 5. Weights tend to
concentrate on one particular prompt, and the other
prompts serve as complements. We also depict the
performance of the rank-based ensemble method

428

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

关系

Manual Prompts

Mined Prompts

Acc. Gain

place of death

x is affiliated with the y religion x who converted to y

P140 religion
P159 headquarters location The headquarter of x is in y
P20
P264 record label
P279 subclass of
P39

x died in y
x is represented by music label y x recorded for y
x is a type of y
x is a subclass of y
x is elected y
x has the position of y

x is based in y
x died at his home in y

position held

+60.0
+4.9
+4.6
+17.2
+22.7
+7.9

桌子 4: Micro-averaged accuracy gain (%) of the mined prompts over the manual prompts.

关系

Prompts and Weights

owned by
religion

P127
P140
P176 manufacturer

x is owned by y .485 x was acquired by y .151 x division of y .151
x who converted to y .615 y tirthankara x .190 y dedicated to x .110
y introduced the x .594 y announced the x .286 x attributed to the y .111

Acc. Gain

+7.0
+12.2
+7.0

桌子 5: Weights of top-3 mined prompts, and the micro-averaged accuracy gain (%) over using the
top-1 prompt.

with respect to the number of prompts in Figure 2.
For mined prompts, top-2 or top-3 usually gives
us the best results, while for paraphrased prompts,
top-5 is the best. Incorporating more prompts does
not always improve accuracy, a finding consistent
with the rapidly decreasing weights learned by
the optimization-based method. The gap between
Oracle and Opti. indicates that there is still space
for improvement using better ensemble methods.

Mining vs. Paraphrasing For the rank-based
ensembles (Top1, 3, 5), prompts generated
by paraphrasing usually perform better
比
mined prompts, while for the optimization-based
ensemble (Opti.), mined prompts perform better.
We conjecture this is because mined prompts
exhibit more variation compared to paraphrases,
and proper weighting is of central importance.
This difference in the variation can be observed in
the average edit distance between the prompts
of each class, 这是 3.27 和 2.73 为了
mined and paraphrased prompts respectively.
然而, the improvement led by ensembling
paraphrases is still significant over just using
one prompt (Top1 vs. Opti.), raising micro-
averaged accuracy from 32.7% 到 36.2% 在
BERT-base, 和来自 37.8% 到 40.1% on BERT-
大的. This indicates that even small modifications
to prompts can result in relatively large changes in
预测. 桌子 6 demonstrates cases where
modification of one word (either function or
to significant accuracy
content word)

leads

429

数字 2: Performance for different top-K ensembles.

Modifications

Acc. Gain

P413
P495
P495
P361
P413

x plays in→at y position
x was created→made in y
x was→is created in y
x is a part of y
x plays in y position

+23.2
+10.8
+10.0
+2.7
+2.2

桌子 6: Small modifications (update, insert,
and delete) in paraphrase lead to large accuracy
gain (%).

improvements, indicating that large-scale LMs
are still brittle to small changes in the ways they
are queried.

Middle-word vs. Dependency-based We com-
pare the performance of only using middle-
word prompts and concatenating them with
dependency-based prompts in Table 7. 这

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Prompts Top1 Top3 Top5 Opti. Oracle

中
Mid+Dep

30.7
31.4

32.7
34.2

31.2
34.7

36.9
38.9

45.1
50.7

桌子 7: Ablation study of middle-word and
dependency-based prompts on BERT-base.

模型

Man Mine

Mine Mine Man
+Man +Para +Para

BERT
ERNIE
KnowBert

31.1
32.1
26.2

38.9
42.3
34.1

39.6
43.8
34.6

36.2
40.1
31.9

37.3
41.1
32.1

桌子 8: Micro-averaged accuracy (%) of various
LMs

improvements confirm our intuition that words
belonging to the dependency path but not in the
middle of the subject and object are also indicative
of the relation.

Micro vs. Macro Comparing Tables 2 和
3, we can see that macro-averaged accuracy is
than micro-averaged accuracy,
much lower
indicating that macro-averaged accuracy is a
more challenging metric that evaluates how many
unique objects LMs know. Our optimization-
based method improves macro-averaged accuracy
从 22.8% 到 25.7% on BERT-base, 和
从 25.7% 到 30.1% on BERT-base. 这
again confirms the effectiveness of ensembling
multiple prompts, but the gains are somewhat
较小. 尤其,
in our optimization-based
方法,
the ensemble weights are optimized
on each example in the training set, 这是
more conducive to optimizing micro-averaged
准确性. Optimization to improve macro-
averaged accuracy is potentially an interesting
在
direction for future work that may result
prompts more generally applicable to different
types of objects.

Performance of Different LMs
表中 8,
we compare BERT with ERNIE and KnowBert,
which are enhanced with external knowledge
by explicitly incorporating entity embeddings.
ERNIE outperforms BERT by 1 point even
with the manually defined prompts, but our
prompt generation methods further emphasize
the difference between the two methods, 和
the highest accuracy numbers differing by 4.2
points using the Mine+Man method. 这

模型

Man Mine

Mine Mine Man
+Man +Para +Para

BERT-base
21.3
BERT-large 24.2

28.7
34.5

29.4
34.5

26.8
31.6

27.0
29.8

桌子 9: Micro-averaged accuracy (%) 在
LAMA-UHN.

模型

Man Mine

Mine Mine Man
+Man +Para +Para

BERT-base
9.8
BERT-large 10.5

10.0
10.6

10.4
11.3

9.6
10.4

10.0
10.7

桌子 10: Micro-averaged accuracy (%) 在
Google-RE.

表明
if LMs are queried effectively,
the differences between highly performant
models may become more clear. KnowBert
underperforms BERT on LAMA, 哪个
is opposite to the observation made in Peters et al.
(2019). This is probably because that multi token
subjects/objects are used to evaluate KnowBert in
Peters et al. (2019), while LAMA contains only
single-token objects.

LAMA-UHN Evaluation The performances
on LAMA-UHN benchmark are reported in
桌子 9. Although the overall performances drop
dramatically compared to the performances on the
original LAMA benchmark (桌子 2), optimized
ensembles can still outperform manual prompts
by a large margin, indicating that our methods are
effective in retrieving knowledge that cannot be
inferred based on surface forms.

5.3 分析

下一个, we perform further analysis to better
understand what type of prompts proved most
suitable for facilitating retrieval of knowledge
from LMs.

Prediction Consistency by Prompt We first
analyze the conditions under which prompts
will yield different predictions. We define the
divergence between predictions of two prompts
tr,i and tr,j using the following equation:

Div(tr,我, tr,j) =

Phx,yi∈R δ(C(X, y, tr,我) 6= C(X, y, tr,j ))
|右|

where C(X, y, tr,我) = 1 if prompt tr,i can
successfully predict y and 0 否则, and δ(·) 是

430

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: Correlation of edit distance between prompts
and their prediction divergence.

数字 4: Ranking position distribution of prompts with
different patterns. Lower is better.

x/y V y/x

| x/y V P y/x

| x/y V W* P y/x

桌子 11: Three part-of-speech-based regular
expressions used in ReVerb to identify relational
短语.

Kronecker’s delta. For each relation, we normalize
the edit distance of two prompts into [0, 1] 和
bucket the normalized distance into five bins
with intervals of 0.2. We plot a box chart for
each bin to visualize the distribution of prediction
divergence in Figure 3, with the green triangles
representing mean values and the green bars in
the box representing median values. As the edit
distance becomes larger, the divergence increases,
which confirms our intuition that very different
prompts tend to cause different prediction results.
The Pearson correlation coefficient is 0.25, 哪个
shows that there is a weak correlation between
these two quantities.

Performance on Google-RE We also report
the performance of optimized ensemble on the
Google-RE subset in Table 10. 再次, ensembling
diverse prompts improves accuracies for both the
BERT-base and BERT-large models. The gains
are somewhat smaller than those on the T-REx
subset, which might be caused by the fact that there
are only three relations and one of them (predicting
the birth date of a person) is particularly hard
to the extent that only one prompt yields non-zero
准确性.

431

POS-based Analysis Next, we try to examine
which types of prompts tend to be effective
in the abstract by examining the part-of-speech
(销售点) patterns of prompts that successfully
extract knowledge from LMs. In open information
extraction systems (Banko et al., 2007), manually
defined patterns are often leveraged to filter out
noisy relational phrases. 例如, ReVerb
(Fader et al., 2011) incorporates three syntactic
constraints listed in Table 11 to improve the
coherence and informativeness of
the mined
relational phrases. To test whether these patterns
are also indicative of the ability of a prompt
to retrieve knowledge from LMs, we use these
three patterns to group prompts generated by our
methods into four clusters, where the ‘‘other’’
cluster contains prompts that do not match any
pattern. We then calculate the rank of each
prompt within the extracted prompts, and plot the
distribution of rank using box plots in Figure 4.8
We can see that the average rank of prompts
matching these patterns is better than those in
the ‘‘other’’ group, confirming our intuitions
that good prompts should conform with those
图案. Some of the best performing prompts’
POS signatures are ‘‘x VBD VBN IN y’’ (例如,
‘‘x was born in y’’) and ‘‘x VBZ DT NN IN y’’
(例如, ‘‘x is the capital of y’’).

Cross-model Consistency Finally,
is of
那
to know whether
兴趣
we are extracting are highly tailored to a

the prompts

它

8We use the ranking position of a prompt to represent its
quality instead of its accuracy because accuracy distributions
of different relations might span different ranges, 制作
accuracy not directly comparable across relations.

Test
Train

BERT-base
大的
根据

BERT-large
根据
大的

Test
Train

BERT

ERNIE

BERT ERNIE ERNIE BERT

Mine
Mine+Man
Mine+Para
Man+Para

38.9
39.6
36.2
37.3

38.7
40.1
35.6
35.6

43.7
43.9
40.1
38.8

42.2
42.2
39.0
37.5

Mine
Mine+Man
Mine+Para
Man+Para

38.9
39.6
36.2
37.3

38.0
39.5
34.2
35.2

42.3
43.8
40.1
41.1

38.7
40.5
39.0
40.3

桌子 12: Cross-model micro-averaged
准确性 (%). The first row is the model
to test, and the second row is the model
on which prompt weights are learned.

桌子 13: Cross-model micro-averaged accuracy
(%). The first row is the model to test, 和
second row is the model on which prompt weights
are learned.

specific model, or whether they can generalize
across models. 这样做, we use two settings:
One compares BERT-base and BERT-large, 这
same model architecture with different sizes;
the other compares BERT-base and ERNIE,
different model architectures with a comparable
尺寸. In each setting, we compare when the
optimization-based ensembles are trained on the
same model, or when they are trained on one
model and tested on the other. As shown in
13, we found that in general
Tables 12 和
there is usually some drop in performance
in the cross-model scenario (third and fifth
columns), but the losses tend to be small, 和
the highest performance when querying BERT-
base is actually achieved by the weights optimized
on BERT-large. 尤其, the best accuracies of
40.1% 和 42.2% (桌子 12) 和 39.5% 和
40.5% (桌子 13) with the weights optimized on
the other model are still much higher than those
obtained by the manual prompts, indicating that
optimized prompts still afford large gains across
型号. Another interesting observation is that the
drop in performance on ERNIE (last two columns
表中 13) is larger than that on BERT-large
(last two columns in Table 12) using weights
optimized on BERT-base, indicating that models
sharing the same architecture benefit more from
the same prompts.

Linear vs. Log-linear Combination As men-
tioned in § 4.2, we use log-linear combination of
probabilities in our main experiments. 然而, 它
is also possible to calculate probabilities through
regular linear interpolation:

磷 (y|X, r) =

X
我=1

1
K

PLM(y|X, tr,我)

(4)

432

数字 5: Performance of two interpolation methods.

We compare these two ways to combine pre-
dictions from multiple mined prompts in Figure 5
(§ 4.2). We assume that log-linear combination
outperforms linear combination because log prob-
abilities make it possible to penalize objects that
are very unlikely given any certain prompt.

6 Omitted Design Elements

最后, in addition to the elements of our main
proposed methodology in § 3 and § 4, 我们
experimented with a few additional methods that
did not prove highly effective, and thus were
omitted from our final design. We briefly describe
these below, along with cursory experimental
结果.

6.1 LM-aware Prompt Generation

We examined methods to generate prompts by
solving an optimization problem that maximizes
the probability of producing the ground-truth
objects with respect to the prompts:

t∗
r = arg max

PLM(y|X, tr),

where PLM(y|X, tr) is parameterized with a pre-
trained LM. 换句话说, this method directly
searches for a prompt that causes the LM to assign
ground-truth objects the highest probability.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Prompts Top1 Top3 Top5 Opti. Oracle

Features

Mine

Paraphrase

前
后

31.9
30.2

34.5
32.5

33.8
34.7

38.1
37.5

47.9
50.8

桌子 14: Micro-averaged accuracy (%) 前
and after LM-aware prompt fine-tuning.

macro micro macro micro

向前

+落后

38.1
38.2

25.2
25.5

37.3
37.4

25.0
25.2

桌子 15: Performance (%) of using forward
and backward features with BERT-base.

Solving this problem of finding text sequences
that optimize some continuous objective has been
studied both in the context of end-to-end sequence
一代 (Hoang et al., 2017), and in the context
of making small changes to an existing input for
adversarial attacks (Ebrahimi et al., 2018; 华莱士
等人。, 2019). 然而, we found that directly
optimizing prompts guided by gradients was
unstable and often yielded prompts in unnatural
English in our preliminary experiments. 因此, 我们
instead resorted to a more straightforward hill-
climbing method that starts with an initial prompt,
then masks out one token at a time and replaces
it with the most probable token conditioned on
the other tokens, inspired by the mask-predict
decoding algorithm used in non-autoregressive
machine translation (Ghazvininejad et al., 2019):9

PLM(wi|tr \ 我) = Phx,yi∈R PLM(wi|X, tr \ 我, y)

|右|

where wi is the i-th token in the prompt and tr \ 我
is the prompt with the i-th token masked out. 我们
followed a simple rule that modifies a prompt from
left to right, and this is repeated until convergence.

We used this method to refine all the mined
and manual prompts on the T-REx-train dataset,
and display theirperformance on the T-REx dataset
表中 14. After fine-tuning, the oracle perfor-
mance increased significantly, while the ensemble
performances (both rank-based and optimization-
基于) dropped slightly. 这
表明
LM-aware fine-tuning has the potential to discover
better prompts, but some portion of the refined
prompts may have over-fit to the training set upon
which they were optimized.

9理论上, this algorithm can be applied to both masked
LMs like BERT and traditional left-to-right LMs, 自从
masked probability can be computed using Bayes’ theorem
for traditional LMs. 然而, 在实践中, due to the large size
of vocabulary, it can only be approximated with beam search,
or computed with more complicated continuous optimization
算法 (Hoang et al., 2017).

6.2 Forward and Backward Probabilities

该模型

最后, given class imbalance and the propensity
的
大多数
to over-predict
目的, we examine a method to encourage
to predict subject-object pairs that
该模型
are more strongly aligned.
Inspired by the
maximum mutual information objective used in
李等人. (2016A), we add the backward log
probability log PLM(X|y, tr,我) of each prompt
to our optimization-based scoring function in
方程 3. Due to the large search space for
物体, we turn to an approximation approach
that only computes backward probability for the
most probable B objects given by the forward
probability at both training and test time. 作为
如表所示 15, the improvement resulting
from backward probability is small, indicating
that a diversity-promoting scoring function might
not be necessary for knowledge retrieval from
LMs.

7 相关工作

Much work has focused on understanding the
internal representations in neural NLP models
(Belinkov and Glass, 2019), either by using
extrinsic probing tasks to examine whether
certain linguistic properties can be predicted
from those representations (Shi et al., 2016;
Linzen et al., 2016; Belinkov et al., 2017),
or by ablations to the models to investigate
how behavior varies (李等人。, 2016乙; 史密斯
等人。, 2017). For contextualized representations
尤其, a broad suite of NLP tasks are
used to analyze both syntactic and semantic
特性, providing evidence that contextualized
representations encode linguistic knowledge in
layers (Hewitt and Manning, 2019;
不同的
Tenney et al., 2019A; Tenney et al., 2019乙;
Jawahar et al., 2019; Goldberg, 2019).

Different from analyses probing the representa-
tions themselves, our work follows Petroni et al.
(2019); P¨orner et al. (2019) in probing for factual

433

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

知识. They use manually defined prompts,
which may be under-estimating the true perfor-
mance obtainable by LMs. Concurrently to this
工作, Bouraoui et al. (2020) made a similar obser-
vation that using different prompts can help better
extract relational knowledge from LMs, 但他们
use models explicitly trained for relation extrac-
tion whereas our methods examine the knowledge
included in LMs without any additional training.
Orthogonally, some previous works integrate
external knowledge bases so that the language
generation process is explicitly conditioned on
symbolic knowledge (Ahn et al., 2016; 杨等人。,
2017; Logan et al., 2019; Hayashi et al., 2020).
Similar extensions have been applied to pre-trained
LMs like BERT, where contextualized representa-
tions are enhanced with entity embeddings (张
等人。, 2019; Peters et al., 2019; P¨orner et al., 2019).
相比之下, we focus on better knowledge re-
trieval through prompts from LMs as-is, 没有
modifying them.

8 结论

在本文中, we examined the importance of
the prompts used in retrieving factual knowledge
from language models. We propose mining-based
and paraphrasing-based methods to systematically
generate diverse prompts to query specific pieces
of relational knowledge. Those prompts, 什么时候
combined together, improve factual knowledge
retrieval accuracy by 8%, outperforming manually
designed prompts by a large margin. Our analysis
indicates that LMs are indeed more knowledgeable
than initially indicated by previous results, 但
they are also quite sensitive to how we query
他们. This indicates potential future directions
例如 (1) more robust LMs that can be queried
in different ways but still return similar results,
(2) methods to incorporate factual knowledge
in LMs, 和 (3)
在
更远
optimizing methods to query LMs for knowledge.
最后, we have released all our
学到了
prompts to the community as the LM Prompt
and Query Archive (LPAQA), 可以在:
https://github.com/jzbjyb/LPAQA.

improvements

致谢

Pengcheng Yin, and Shuyan Zhou for
insightful comments and suggestions.

他们的

参考

Eugene Agichtein and Luis Gravano. 2000.
Snowball: Extracting relations from large plain-
text collections. In Proceedings of the Fifth
ACM Conference on Digital Libraries, 六月
2-7, 2000, San Antonio, TX, 美国, pages 85–94.
ACM.

Sungjin Ahn, Heeyoul Choi, Tanel P¨arnamaa,
and Yoshua Bengio. 2016. A neural knowledge
language model. CoRR, abs/1608.00318v2.

Livio Baldini Soares, Nicholas FitzGerald, 杰弗里
Ling, and Tom Kwiatkowski. 2019. Matching
the blanks: Distributional similarity for relation
学习. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 2895–2905, Florence, 意大利.
计算语言学协会.

Michele Banko, 迈克尔·J. Cafarella, 斯蒂芬
Soderland, Matthew Broadhead, and Oren
Etzioni. 2007. Open information extraction
from the web. In IJCAI 2007, 会议记录
的
the 20th International Joint Conference
on Artificial Intelligence, Hyderabad, 印度,
一月 6-12, 2007, pages 2670–2676.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. 什么
do neural machine translation models learn
about morphology? 在诉讼程序中
这
55th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), pages 861–872, Vancouver, 加拿大.
计算语言学协会.

Yonatan Belinkov and James R. Glass. 2019.
Analysis methods in neural language process-
英: 一项调查. Transactions of the Association
for Computational Linguistics, 7:49–72.

This work was supported by a gift from Bosch
Research and NSF award no. 1815287. 我们会
like to thank Paul Michel, Hiroaki Hayashi,

Rahul Bhagat and Deepak Ravichandran. 2008.
Large scale acquisition of paraphrases for
In Proceedings
learning surface patterns.

434

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

前交叉韧带-08:

的
页面
Columbus, 俄亥俄州. 协会
putational Linguistics.

赫勒特,

674–682,
for Com-

Zied Bouraoui, Jose Camacho-Collados, 和
Steven Schockaert. 2020. Inducing relational
knowledge from BERT. In Thirty-Fourth AAAI
Conference on Artificial Intelligence (AAAI),
纽约, 美国.

Claudio Carpineto and Giovanni Romano. 2012.
A survey of automatic query expansion
in information retrieval. ACM, 计算
Surveys, 44(1):1:1–1:50.

in Neural

安德鲁·M. Dai and Quoc V. Le. 2015.
In Ad-
Semi-supervised sequence learning.
Information Processing
vances
系统 28: Annual Conference on Neural
Information Processing Systems 2015, Decem-
误码率 7-12, 2015, 蒙特利尔, Quebec, 加拿大,
pages 3079–3087.

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: 预-
training of deep bidirectional transformers for
在诉讼程序中
language understanding.
这 2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
NAACL-HLT 2019, 明尼阿波利斯, 明尼苏达州, 美国,
六月 2-7, 2019, 体积 1 (Long and Short
文件), pages 4171–4186.

text classification.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, 和
Dejing Dou. 2018. HotFlip: White-box adver-
sarial examples for
在
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguis-
抽动症 (体积 2: Short Papers), pages 31–36,
墨尔本, 澳大利亚, Association for Compu-
tational Linguistics.

Hady ElSahar, Pavlos Vougiouklis, Arslen
Remaci, Christophe Gravier, Jonathon S. Hare,
Fr´ed´erique Laforest, and Elena Simperl. 2018.
T-REx: A large scale alignment of natural
language with knowledge base triples.
在
the Eleventh International
会议记录
Conference on Language Resources and
评估, LREC 2018, Miyazaki, 日本, 可能
7-12, 2018.

information extraction. 在诉讼程序中
2011 实证方法会议
自然语言处理, EMNLP 2011,
27-31 七月 2011, John McIntyre Conference
Centre, 爱丁堡, 英国, A meeting of SIGDAT,
the ACL,
Interest Group of
a Special
pages 1535–1545.

Michael Gamon, Anthony Aue, and Martine
Smets. 2005. Sentence-level MT evaluation
without reference translations: Beyond lan-
guage modeling. In Proceedings of EAMT,
pages 103–111.

在诉讼程序中

Marjan Ghazvininejad, Omer Levy, Yinhan Liu,
and Luke Zettlemoyer. 2019. Mask-predict:
conditional masked
Parallel decoding of
这
language models.
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
页面
6114–6123, 香港, 中国.
计算语言学协会.

加工

Yoav Goldberg.

2019. Assessing BERT’s

syntactic abilities. CoRR, abs/1901.05287v1.

Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, 和
Graham Neubig. 2020. Latent relation language
型号. In Thirty-Fourth AAAI Conference on
人工智能 (AAAI), 纽约, 美国.

John Hewitt and Christopher D. 曼宁. 2019.
A structural probe for finding syntax in word
陈述. 在诉讼程序中 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, NAACL-HLT
2019, 明尼阿波利斯, 明尼苏达州, 美国, 六月 2-7,
2019, 体积 1 (Long and Short Papers),
pages 4129–4138.

Cong Duy Vu Hoang, Gholamreza Haffari, 和
Trevor Cohn. 2017. Towards decoding as
continuous optimisation in neural machine
翻译. 在诉讼程序中 2017 骗局-
in Natu-
ference on Empirical Methods
ral Language Processing, pages 146–156,
哥本哈根, 丹麦. Association for Com-
putational Linguistics.

Anthony Fader, Stephen Soderland, and Oren
Etzioni. 2011. Identifying relations for open

罗伯特·L. Logan IV, Nelson F. 刘, Matthew E.
Peters, Matt Gardner, and Sameer Singh.

435

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

the 57th Conference of

2019. Barack’s wife Hillary: Using knowledge
graphs for fact-aware language modeling. 在
会议记录
这
计算语言学协会,
前交叉韧带
28-
八月 2, 2019, 体积 1: Long Papers,
pages 5962–5971.

2019, Florence,

意大利,

七月

Ganesh Jawahar, Benoˆıt Sagot, and Djam´e
Seddah. 2019. What does BERT learn about
语言? In Proceedings
the structure of
的
the Association
for Computational Linguistics, 前交叉韧带 2019,
Florence, 意大利, 七月 28- 八月 2, 2019, 体积
1: Long Papers, pages 3651–3657.

the 57th Conference of

Diederik P. Kingma and Jimmy Ba. 2015. 亚当:
A method for stochastic optimization. In 3rd
International Conference on Learning Repre-
句子, ICLR 2015, 圣地亚哥, CA, 美国,
可能 7-9, 2015, Conference Track Proceedings.

function

客观的

Jiwei Li, 米歇尔·加莱, Chris Brockett, Jianfeng
高, and Bill Dolan. 2016A. A diversity-
promoting
for neural
conversation models. In NAACL HLT 2016,
这 2016 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
San Diego California, 美国, 六月 12-17, 2016,
pages 110–119.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016乙.
Understanding neural networks through repre-
sentation erasure. CoRR, abs/1612.08220v3.

Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. 反式-
actions of the Association for Computational
语言学, 4:521–535.

Jonathan Mallinson, Rico Sennrich, and Mirella
警告. 2017. Paraphrasing revisited with
neural machine translation. 在诉讼程序中
the 15th Conference of the European Chapter of
the Association for Computational Linguistics:
体积 1, Long Papers, pages 881–893,
Valencia, 西班牙. Association for Computational
语言学.

Bryan McCann, Nitish Shirish Keskar, Caiming
Xiong, and Richard Socher. 2018. The natural
language decathlon: Multitask learning as
question answering. CoRR, abs/1806.08730v1.

436

Oren Melamud, Jacob Goldberger, and Ido Dagan.
2016. context2vec: Learning generic context
embedding with bidirectional LSTM.
在
Proceedings of the 20th SIGNLL Conference
on Computational Natural Language Learning,
CoNLL 2016, 柏林, 德国, 八月 11-12,
2016, pages 51–61.

G´abor Melis, Chris Dyer, and Phil Blunsom.
2018. On the state of the art of evaluation
in neural language models. In 6th International
Conference on Learning Representations, ICLR
2018, Vancouver, BC, 加拿大, 四月 30 – 可能
3, 2018, Conference Track Proceedings.

Stephen Merity, Nitish Shirish Keskar, 和
Richard Socher. 2018. Regularizing and opti-
In 6th
mizing LSTM language models.
International Conference on Learning Rep-
resentations,
ICLR 2018, Vancouver, BC,
加拿大, 四月 30 – 可能 3, 2018, 会议
Track Proceedings.

Tomas Mikolov and Geoffrey Zweig. 2012.
Context dependent recurrent neural network
在 2012 IEEE Spoken
language model.
语言
(SLT),
pages 234–239. IEEE.

Technology Workshop

Nathan Ng, Kyra Yee, Alexei Baevski, Myle
Ott, Michael Auli, and Sergey Edunov. 2019.
Facebook FAIR’s WMT19 news translation
task submission. In Proceedings of the Fourth
Conference on Machine Translation, WMT
2019, Florence, 意大利, 八月 1-2, 2019 –
体积 2: Shared Task Papers, Day 1,
pages 314–319.

Matthew E. Peters, Mark Neumann, Mohit
伊耶尔, Matt Gardner, Christopher Clark, Kenton
李, and Luke Zettlemoyer. 2018. Deep con-
textualized word representations. In Proceed-
ings of
the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
逻辑的, NAACL-HLT 2018, New Orleans,
Louisiana, 美国, 六月 1-6, 2018, 体积 1
(Long Papers), pages 2227–2237.

这 2018 Conference of

Matthew E. Peters, Mark Neumann, 罗伯特
Logan, Roy Schwartz, Vidur Joshi, Sameer
辛格, and Noah A. 史密斯. 2019. 诺尔-
edge enhanced contextual word representations.
这 2019 会议
在诉讼程序中

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

on Empirical Methods
in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 43–54, 洪
孔, 中国. Association for Computational
语言学.

Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Language models
as knowledge bases? 在诉讼程序中
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
语言
(EMNLP-IJCNLP),
页面
2463–2473, 香港, 中国.
计算语言学协会.

加工

Nina P¨orner, Ulli Waltinger,

and Hinrich
Sch¨utze. 2019. BERT is not a knowledge
根据 (然而): Factual knowledge vs. 姓名-
based reasoning in unsupervised QA. CoRR,
abs/1911.03681v1.

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8).

Nazneen Fatema Rajani, Bryan McCann,
Caiming Xiong, and Richard Socher. 2019.
Explain yourself! Leveraging language models
for commonsense reasoning. 在诉讼程序中
the 57th Annual Meeting of the Association for
计算语言学, pages 4932–4942,
Florence, 意大利. Association for Computational
语言学.

Deepak Ravichandran and Eduard Hovy. 2002.
Learning surface text patterns for a ques-
在诉讼程序中
tion answering system.
the 40th annual meeting on association for
computational
语言学,
41–47.
计算语言学协会.

页面

Lorenza Romano, Milen Kouylekov,

Idan
Szpektor, Ido Dagan, and Alberto Lavelli.
Investigating a generic paraphrase-
2006.
based approach for
在
11th Conference of the European Chapter of
the Association for Computational Linguistics,
Trento, 意大利. Association for Computational
语言学.

relation extraction.

437

Maarten Sap, Ronan Le Bras, Emily Allaway,
Chandra Bhagavatula, Nicholas Lourie, Hannah
Rashkin, Brendan Roof, 诺亚A. 史密斯,
and Yejin Choi. 2019. Atomic: An atlas of
machine commonsense for if-then reasoning.
the AAAI Conference
在诉讼程序中
在
Artificial
33,
pages 3027–3035.

智力,

体积

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016.
Improving neural machine
translation models with monolingual data. 在
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
前交叉韧带 2016, 八月 7-12, 2016, 柏林, 德国,
体积 1: Long Papers.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016.
Does string-based neural MT learn source
syntax? 在诉讼程序中 2016 会议
on Empirical Methods in Natural Language
加工, pages 1526–1534, Austin, 德克萨斯州.
计算语言学协会.

诺亚A. 史密斯, Chris Dyer, Miguel Ballesteros,
Graham Neubig, Lingpeng Kong,
和
Adhiguna Kuncoro. 2017. What do recurrent
neural network grammars learn about syntax?
the 15th Conference of
在诉讼程序中
the European Chapter of
the Association
for Computational Linguistics, EACL 2017,
Valencia, 西班牙, 四月 3-7, 2017, 体积 1:
Long Papers, pages 1249–1258.

Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019A. BERT rediscovers the classical NLP
pipeline. In Proceedings of the 57th Conference
的
for Computational
语言学, 前交叉韧带 2019, Florence, 意大利, 七月
28- 八月 2, 2019, 体积 1: Long Papers,
pages 4593–4601.

协会

这

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, 右. Thomas McCoy, Najoung
Kim, Benjamin Van Durme, Samuel R.
Bowman, Dipanjan Das, and Ellie Pavlick.
2019乙. What do you learn from context?
Probing for sentence structure in contextualized
In 7th International
word representations.
Conference on Learning Representations, ICLR
2019, New Orleans, 这, 美国, 可能 6-9, 2019.

Kristina Toutanova, Danqi Chen, Patrick Pantel,
Hoifung Poon, Pallavi Choudhury, 和迈克尔

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

为了

joint
Gamon. 2015. Representing text
text and knowledge bases.
embedding of
这 2015 会议
在诉讼程序中
Empirical Methods
in Natural Language
加工, EMNLP 2015, 里斯本, Portugal,
九月 17-21, 2015, pages 1499–1509.

Trieu H. Trinh and Quoc V. Le. 2018. A simple
method for commonsense reasoning. CoRR,
abs/1806.02847v2.

Eric Wallace, Shi Feng, Nikhil Kandpal, 马特
加德纳, and Sameer Singh. 2019. Universal
adversarial triggers for attacking and analyzing
自然语言处理. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2153–2162, 洪
孔, 中国. Association for Computational
语言学.

Zichao Yang, Phil Blunsom, Chris Dyer, 和
Wang Ling. 2017. Reference-aware language
型号. 在诉讼程序中
这 2017 骗局-
ference on Empirical Methods in Natural
语言处理, EMNLP 2017, Copen-
hagen, 丹麦, 九月 9-11, 2017,
pages 1850–1859.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
In Proceedings
with informative entities.
the Association
the 57th Conference of
的
for Computational Linguistics, 前交叉韧带 2019,
Florence, 意大利, 七月 28- 八月 2, 2019, 体积
1: Long Papers, pages 1441–1451.

Geoffrey Zweig and Christopher J. C. 布尔吉斯.
2011. The Microsoft Research sentence
completion challenge. Microsoft Research,
Redmond, WA, 美国, Technical Report MSR-
TR-2011-129.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
4
1
9
2
3
8
6
7

/
t

我

A
C
_
A
_
0
0
3
2
4
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

438
下载pdf