测量和提高预训练语言模型的一致性

测量和提高预训练语言模型的一致性

Yanai Elazar1,2 Nora Kassner3 Shauli Ravfogel1,2 Abhilasha Ravichander4
Eduard Hovy4 Hinrich Sch ¨utze3 Yoav Goldberg1,2
1Computer Science Department, 巴伊兰大学, 以色列
2Allen Institute for Artificial Intelligence, 美国
3Center for Information and Language Processing (CIS), LMU Munich, 德国
4Language Technologies Institute, 卡内基梅隆大学, 美国
{yanaiela,shauli.ravfogel,yoav.goldberg}@gmail.com
kassner@cis.lmu.de {aravicha,hovy}@cs.cmu.edu

抽象的

Consistency of a model—that is, the invari-
ance of its behavior under meaning-preserving
alternations in its input—is a highly desirable
property in natural language processing. 在这个
paper we study the question: Are Pretrained
Language Models (PLMs) consistent with re-
spect to factual knowledge? 为此, 我们
create PARAREL , a high-quality resource of
cloze-style query English paraphrases. It con-
tains a total of 328 paraphrases for 38 关系.
Using PARAREL , we show that the consistency
of all PLMs we experiment with is poor—
though with high variance between relations.
Our analysis of the representational spaces of
PLMs suggests that they have a poor structure
and are currently not suitable for represent-
ing knowledge robustly. 最后, we propose a
method for improving model consistency and
experimentally demonstrate its effectiveness.1

1

介绍

Pretrained Language Models (PLMs) are large
neural networks that are used in a wide variety of
NLP tasks. They operate under a pretrain-finetune
范例: Models are first pretrained over a large
text corpus and then finetuned on a downstream
任务. PLMs are thought of as good language en-
coders, supplying basic language understanding
capabilities that can be used with ease for many
下游任务.

A desirable property of a good language un-
derstanding model is consistency: the ability to
make consistent decisions in semantically equiv-
alent contexts, reflecting a systematic ability to
generalize in the face of language variability.

1The code and resource are available at: https://

github.com/yanaiela/pararel.

Examples of consistency include: predicting
the same answer in question answering and read-
ing comprehension tasks regardless of paraphrase
(Asai and Hajishirzi, 2020); making consistent
assignments in coreference resolution (Denis and
Baldridge, 2009; Chang et al., 2011); or making
summaries factually consistent with the original
文档 (Kryscinski et al., 2020). While consis-
tency is important in many tasks, nothing in the
training process explicitly targets it. 一个可以
hope that the unsupervised training signal from
large corpora made available to PLMs such as
BERT or RoBERTa (Devlin et al., 2019; 刘
等人。, 2019) is sufficient to induce consistency
and transfer it to downstream tasks. 在本文中,
we show that this is not the case.

The recent rise of PLMs has sparked a discus-
sion about whether these models can be used as
Knowledge Bases (KBs) (Petroni et al., 2019;
2020; Davison et al., 2019; Peters et al., 2019;
Jiang et al., 2020; Roberts et al., 2020). Consis-
tency is a key property of KBs and is particularly
important for automatically constructed KBs. 一
of the biggest appeals of using a PLM as a KB
is that we can query it in natural language—
instead of relying on a specific KB schema. 这
expectation is that PLMs abstract away from lan-
guage and map queries in natural language into
meaningful representations such that queries with
identical intent but different language forms yield
the same answer. 例如, the query ‘‘Home-
land premiered on [MASK]’’ should produce the
same answer as ‘‘Homeland originally aired on
[MASK]’’. Studying inconsistencies of PLM-KBs
can also teach us about the organization of knowl-
edge in the model, or lack thereof. 最后, failure
to behave consistently may point to other repre-
sentational issues such as the similarity between

1012

计算语言学协会会刊, 卷. 9, PP. 1012–1031, 2021. https://doi.org/10.1162/tacl 00410
动作编辑器: George Foster. 提交批次: 3/2021; 修改批次: 4/2021; 已发表 9/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

antonyms and synonyms (Nguyen et al., 2016),
and overestimating events and actions (reporting
bias) (Shwartz and Choi, 2020).

在这项工作中, we study the consistency of factual
knowledge in PLMs, specifically in Masked Lan-
guage Models (MLMs)—these are PLMs trained
with the MLM objective (Devlin et al., 2019; 刘
等人。, 2019), as opposed to other strategies such
as standard language modeling (Radford et al.,
2019) or text-to-text (Raffel et al., 2020). We ask:
Is the factual information we extract from PLMs
invariant to paraphrasing? We use zero-shot eval-
uation since we want to inspect models directly,
without adding biases through finetuning. 这
allows us to assess how much consistency was
acquired during pretraining and to compare the
consistency of different models. 全面的, 我们发现
that the consistency of the PLMs we consider is
贫穷的, although there is a high variance between
关系.

We introduce PARAREL , a new benchmark that
enables us to measure consistency in PLMs (§3),
by using factual knowledge that was found to
be partially encoded in them (Petroni et al., 2019;
Jiang et al., 2020). PARAREL is a manually curated
resource that provides patterns—short
textual
prompts—that are paraphrases of one another,
和 328 paraphrases describing 38 binary rela-
tions such as X born-in Y, X works-for Y (§4). 我们
then test multiple PLMs for knowledge consis-
tency, 即, whether a model predicts the same
answer for all patterns of a relation. 数字 1 节目
an overview of our approach. Using PARAREL ,
we probe for consistency in four PLM types:
BERT, BERT-whole-word-masking, RoBERTa,
and ALBERT (§5). Our experiments with
PARAREL
show that current models have poor
一致性, although with high variance between
关系 (§6).

最后, we propose a method that improves
model consistency by introducing a novel con-
sistency loss (§8). We demonstrate that, trained
with this loss, BERT achieves better consis-
tency performance on unseen relations. 然而,
more work is required to achieve fully consistent
型号.

2 Background

There has been significant interest in analyzing
how well PLMs (Rogers et al., 2020) perform

数字 1: Overview of our approach. We expect that
a consistent model would predict the same answer
for two paraphrases. 在这个例子中, the model is
inconsistent on the Homeland and consistent on the
Seinfeld paraphrases.

on linguistic tasks (Goldberg, 2019; Hewitt and
曼宁, 2019; Tenney et al., 2019; Elazar et al.,
2021), commonsense (Forbes et al., 2019; Da and
Kasai, 2019; 张等人。, 2020), and reasoning
(Talmor et al., 2020; Kassner et al., 2020), usu-
ally assessed by measures of accuracy. 然而,
accuracy is just one measure of PLM perfor-
曼斯 (扁豆, 2020). It is equally important that
PLMs do not make contradictory predictions (比照.
数字 1), a type of error that humans rarely make.
There has been relatively little research attention
devoted to this question, 那是, to analyze if
models behave consistently. One example con-
cerns negation: Ettinger (2020) and Kassner and
Sch¨utze (2020) show that models tend to generate
facts and their negation, a type of inconsistent be-
行为. Ravichander et al. (2020) propose paired
probes for evaluating consistency. Our work is
broader in scope, examining the consistency of
PLM behavior across a range of factual knowl-
edge types and investigating how models can be
made to behave more consistently.

Consistency has also been highlighted as a
desirable property in automatically constructed
KBs and downstream NLP tasks. We now briefly
review work along these lines.

Consistency in knowledge bases (KBs) 有
been studied in theoretical frameworks in the
context of the satisfiability problem and KB
建造, and efficient algorithms for detect-
ing inconsistencies in KBs have been proposed
(Hansen and Jaumard, 2000; Andersen and
Pretolani, 2001). Other work aims to quantify the
degree to which KBs are inconsistent and de-
tects inconsistent statements (Thimm, 2009, 2013;
Mui˜no, 2011).

1013

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Consistency in question answering was stud-
ied by Ribeiro et al. (2019) in two tasks: visual
question answering (Antol et al., 2015) 并阅读-
ing comprehension (Rajpurkar et al., 2016). 他们
automatically generate questions to test the con-
sistency of QA models. Their findings suggest
that most models are not consistent in their pre-
措辞. 此外, they use data augmentation
to create more robust models. Alberti et al. (2019)
generate new questions conditioned on context
and answer from a labeled dataset and by filtering
answers that do not provide a consistent result
with the original answer. They show that pretrain-
ing on these synthetic data improves QA results.
Asai and Hajishirzi (2020) use data augmentation
that complements questions with symmetricity
and transitivity, as well as a regularizing loss that
penalizes inconsistent predictions. Kassner et al.
(2021乙) propose a method to improve accuracy
and consistency of QA models by augmenting
a PLM with an evolving memory that records
PLM answers and resolves inconsistency between
答案.

Work on consistency in other domains in-
cludes Du et al. (2019) where prediction of con-
sistency in procedural text is improved. Ribeiro
等人. (2020) use consistency for more robust
评估. 李等人. (2019) measure and miti-
gate inconsistency in natural language inference
(NLI), 最后, Camburu et al. (2020) propose
a method for measuring inconsistencies in NLI
explanations (Camburu et al., 2018).

3 Probing PLMs for Consistency

在这个部分, we formally define consistency and
describe our framework for probing consistency
of PLMs.

3.1 Consistency

We define a model as consistent
如果, 给定
two cloze-phrases such as ‘‘Seinfeld originally
aired on [MASK]’’ and ‘‘Seinfeld premiered on
[MASK]’’ that are quasi-paraphrases, it makes
non-contradictory predictions2 on N-1 relations
over a large set of entities. A quasi-paraphrase—a

2We refer to non-contradictory predictions as predictions
那, as the name suggest, do not contradict one another. 为了
实例, predicting as the birth place of a person two dif-
ference cities is considered to be contradictory, but predict-
ing a city and its country, 不是.

concept introduced by Bhagat and Hovy (2013)—
is a more fuzzy version of a paraphrase. 这
concept does not rely on the strict, logical defini-
tion of paraphrase and allows us to operationalize
concrete uses of paraphrases. This definition is
in the spirit of the RTE definition (Dagan et al.,
2005), which similarly supports a more flexible
use of the notion of entailment. 例如, A
model that predicts NBC and ABC on the two
aforementioned patterns, is not consistent, 自从
these two facts are contradictory. We define a
cloze-pattern as a cloze-phrase that expresses a
relation between a subject and an object. 笔记
that consistency does not require the answers to
be factually correct. While correctness is also an
important property for KBs, we view it as a sep-
arate objective and measure it independently. 我们
use the terms paraphrase and quasi-paraphrase
interchangeably.

Many-to-many (N-M) 关系 (例如, shares-
border-with) can be consistent even with different
答案 (given they are correct). 例如, 二
patterns that express the shares-border-with rela-
tion and predict Albania and Bulgaria for Greece
are both correct. We do not consider such relations
for measuring consistency. 然而, another re-
quirement from a KB is determinism, 那是, 关于-
turning the results in the same order (when more
than a single result exists). 在这项工作中, we focus
on consistency, but also measure determinism of
the models we inspect.

3.2 The Framework

An illustration of the framework is presented
图中 2. Let Di be a set of subject-object
KB tuples (例如, ) 从
some relation ri (例如, originally-aired-on), 交流电-
companied with a set of quasi-paraphrases
cloze-patterns Pi (例如, X originally aired on Y).
Our goal is to test whether the model consistently
predicts the same object (例如, Showtime) for a
particular subject (例如, Homeland).3 为此,
we substitute X with a subject from Di and Y with
[MASK] in all of the patterns Pi of that relation
(例如, Homeland originally aired on [MASK] 和
Homeland premiered on [MASK]). A consistent
model must predict the same entity.

3Although it is possible to also predict the subject from the
目的, in the cases of N-1 relations more than a single answer
would be possible, thus converting the test from measuring
consistency to measuring determinism instead.

1014

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 2: Overview of our framework for assessing model consistency. 从 (‘‘Data Pairs (D)’’ on the left) is a set
of KB triplets of some relation ri, which are coupled with a set of quasi-paraphrase cloze-patterns Pi (‘‘Patterns
(磷 )’’ on the right) that describe that relation. We then populate the subjects from Di as well as a mask token into
all patterns Pi (shown in the middle) and expect a model to predict the same object across all pattern pairs.

Restricted Candidate Sets Since PLMs were
not trained for serving as KBs, they often predict
words that are not KB entities; 例如, a PLM
may predict, for the pattern ‘‘Showtime originally
aired on [MASK]’’, the noun ‘tv’—which is also
a likely substitution for the language modeling
客观的, but not a valid KB fact completion.
所以, following others (Xiong et al., 2020;
Ravichander et al., 2020; Kassner et al., 2021A),
we restrict the PLMs’ output vocabulary to the set
of possible gold objects for each relation from the
underlining KB. 例如, in the born-in rela-
的, instead of inspecting the entire vocabulary of
a model, we only keep objects from the KB, 这样的
as Paris, 伦敦, 东京, 等等.

Note that this setup makes the task easier for the
PLM, especially in the context of KBs. 然而,
poor consistency in this setup strongly implies
that consistency would be even lower without
restricting candidates.

4 The PARAREL Resource

We now describe PARAREL , a resource designed

for our framework (比照. 部分 3.2). PARAREL
curated by experts, with a high level of agreement.
It contains patterns for 38 relations4 from T-REx
(Elsahar et al., 2018)—a large dataset containing
KB triples aligned with Wikipedia abstracts—with
an average of 8.63 patterns per relation. 桌子 1
gives statistics. We further analyze the paraphrases

4使用 41 relations from LAMA (Petroni et al., 2019),
leaving out three relations that are poorly defined, or consist
of mixed and heterogeneous entities.

used in this resource, partly based on the types
defined in Bhagat and Hovy (2013), 并报告
this analysis in Appendix B.

Construction Method PARAREL was con-
structed in four steps. (1) We began with the
patterns provided by LAMA (Petroni et al., 2019)
(one pattern per relation, referred to as base-
pattern). (2) We augmented each base-pattern with
other patterns that are paraphrases from LPAQA
(Jiang et al., 2020). 然而, since LPAQA was
created automatically (either by back-translation
or by extracting patterns from sentences that con-
tain both subject and object), some LPAQA pat-
terns are not correct paraphrases. We therefore
only include the subset of correct paraphrases.
(3) Using SPIKE (Shlain et al., 2020),5 a search
engine over Wikipedia sentences that supports
syntax-based queries, we searched for additional
patterns that appeared in Wikipedia and added
them to PARAREL . 具体来说, we searched for
Wikipedia sentences containing a subject-object
tuple from T-REx and then manually extracted
patterns from the sentences. (4) 最后, we added
additional paraphrases of the base-pattern using
the annotators’ linguistic expertise. Two addi-
tional experts went over all the patterns and cor-
rected them, while engaging in a discussion until
reaching agreement, discarding patterns they
could not agree on.

Human Agreement To assess the quality of
PARAREL , we run a human annotation study. 为了

5https://spike.apps.allenai.org/.

1015

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

# 关系
# Patterns

最小 # patterns per rel.
Max # patterns per rel.
Avg # patterns per rel.
Avg syntax
Avg lexical

38
328

2
20
8.63
4.74
6.03

桌子 1: Statistics of PARAREL . Last two
rows: average number of unique syntactic/
lexical variations of patterns for a relation.

each relation, we sample up to five paraphrases,
comparing each of the new patterns to the base-
pattern from LAMA. 那是, if relation ri con-
tains the following patterns: p1, p2, p3, p4, and p1
is the base-pattern, then we compare the following
对 (p1, p2), (p1, p3), (p1, p4).

We populate the patterns with random subjects
and objects pairs from T-REx (Elsahar et al.,
2018) and ask annotators if these sentences are
paraphrases. We also sample patterns from dif-
ferent relations to provide examples that are not
paraphrases of each other, as a control. Each task
contains five patterns that are thought to be para-
phrases and two that are not.6 Overall, we collect
annotations for 156 paraphrase candidates and 61
controls.

We asked NLP graduate students to annotate
the pairs and collected one answer per pair.7
The agreement scores for the paraphrases and the
controls are 95.5% 和 98.3%, 分别, 哪个
is high and indicates PARAREL ’s high quality.
We also inspected the disagreements and fixed
many additional problems to further improve
质量.

5 实验装置

5.1 Models and Data

等人。, 2019). For BERT, RoBERTa, and ALBERT,
we use a base and a large version.9 We also report
a majority baseline that always predicts the most
common object for a relation. By construction,
this baseline is perfectly consistent.

We use knowledge graph data from T-REx
(Elsahar et al., 2018).10 To make the results com-
parable across models, we remove objects that are
not represented as a single token in all models’
vocabularies; 26,813 tuples remain.11 We further
split the data into N-M relations for which we
report determinism results (seven relations) 和
N-1 relations for which we report consistency (31
关系).

5.2 评估
Our consistency measure for a relation ri (Consis-
tency) is the percentage of consistent predictions
∈ Pi of that relation,
k, pi
of all the pattern pairs pi

∈ Di. 因此, for each KB
for all its KB tuples di
j
tuple from a relation ri that contains n patterns,
we consider predictions for n(n - 1)/2 对.

We also report Accuracy, 那是, the acc@1
of a model in predicting the correct object, 使用
the original patterns from Petroni et al. (2019).
In contrast to Petroni et al. (2019), we define it
as the accuracy of the top-ranked object from the
candidate set of each relation. 最后, we report
Consistent-Acc, a new measure that evaluates indi-
vidual objects as correct only if all patterns of the
corresponding relation predict the object correctly.
Consistent-Acc is much stricter and combines the
requirements of both consistency (Consistency)
and factual correctness (Accuracy). We report the
average over relations (IE。, macro average), 但
notice that the micro average produces similar
结果.

6 Experiments and Results

6.1 Knowledge Extraction through

Different Patterns

We experiment with four PLMs: BERT, BERT
whole-word-masking8
(Devlin et al., 2019),
RoBERTa (刘等人。, 2019), and ALBERT (兰

We begin by assessing our patterns as well as the
degree to which they extract the correct entities.
These results are summarized in Table 2.

6The controls contain the same subjects and objects so
that only the pattern (not its arguments) can be used to solve
the task.

7We asked the annotators to re-annotate any mismatch
with our initial label, to allow them to fix random mistakes.
8BERT whole-word-masking is BERT’s version where
words that are tokenized into multiple tokens are masked
一起.

9For ALBERT we use the smallest and largest versions.
10We discard three poorly defined relations from T-REx.
11In a few cases, we filter entities from certain relations that
contain multiple fine-grained relations to make our patterns
compatible with the data. 例如, most of the instances
for the genre relation describes music genres, thus we remove
some of the tuples were the objects include non-music genres
such as ‘satire’, ‘sitcom’, and ‘thriller’.

1016

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

模型

Succ-Patt Succ-Objs Unk-Const Know-Const

97.3±7.3 23.2±21.0 100.0±0.0
majority
100.0±0.0 63.0±19.9 46.5±21.7
BERT-base
100.0±0.0 65.7±22.1 48.1±20.2
BERT-large
BERT-large-wwm 100.0±0.0 64.9±20.3 49.5±20.1
100.0±0.0 56.2±22.7 43.9±15.8
RoBERTa-base
100.0±0.0 60.1±22.3 46.8±18.0
RoBERTa-large
100.0±0.0 45.8±23.7 41.4±17.3
ALBERT-base
ALBERT-xxlarge 100.0±0.0 58.8±23.8 40.5±16.4

100.0±0.0
63.8±24.5
65.2±23.8
65.3±25.1
56.3±19.0
60.5±21.1
56.3±22.0
57.5±23.8

桌子 2: Extractability measures in the different
models we inspect. Best model for each measure
highlighted in bold.

第一的, we report Succ-Patt, the percentage of
patterns that successfully predicted the right ob-
ject at least once. A high score suggests that the
patterns are of high quality and enable the models
to extract the correct answers. All PLMs achieve
a perfect score. 下一个, we report Succ-Objs, 这
percentage of entities that were predicted cor-
rectly by at least one of the patterns. Succ-Objs
quantifies the degree to which the models ‘‘have’’
the required knowledge. We observe that some
tuples are not predicted correctly by any of our
图案: The scores vary between 45.8% 为了
ALBERT-base and 65.7% for BERT-large. 和
an average number of 8.63 patterns per relation,
there are multiple ways to extract the knowledge,
we thus interpret these results as evidence that a
large part of T-REx knowledge is not stored in
these models.

最后, we measure Unk-Const, a consistency
measure for the subset of tuples for which no pat-
tern predicted the correct answer; and Know-
Const, consistency for the subset where at least
one of the patterns for a specific relation pre-
dicted the correct answer. This split into subsets is
based on Succ-Objs. 全面的, the results indicate
that when the factual knowledge is successfully
extracted, the model is also more consistent. 为了
实例, for BERT-large, Know-Const is 65.2%
and Unk-Const is 48.1%.

6.2 Consistency and Knowledge

在这个部分, we report the overall knowledge
measure that was used in Petroni et al. (2019) (乙酰胆碱-
curacy), the consistency measure (Consistency),
and Consistent- Acc, which combines knowledge
and consistency (Consistent-Acc). The results are
summarized in Table 3.

We begin with the Accuracy results. The results
range between 29.8% (ALBERT-base) 和 48.7%

模型

Accuracy Consistency Consistent-Acc
23.1±21.0 100.0±0.0
majority
45.8±25.6 58.5±24.2
BERT-base
48.1±26.1 61.1±23.0
BERT-large
BERT-large-wwm 48.7±25.0 60.9±24.2
39.0±22.8 52.1±17.8
RoBERTa-base
43.2±24.7 56.3±20.4
RoBERTa-large
29.8±22.8 49.8±20.1
ALBERT-base
ALBERT-xxlarge 41.7±24.9 52.1±22.4

23.1±21.0
27.0±23.8
29.5±26.6
29.3±26.9
16.4±16.4
22.5±21.1
16.7±20.3
23.8±24.8

桌子 3: Knowledge and consistency results. 最好的
model for each measure in bold.

(BERT-large whole-word-masking). Notice that
our numbers differ from Petroni et al. (2019) 作为
we use a candidate set (§3) and only consider KB
triples whose object is a single token in all the
PLMs we consider (§5.1).

下一个, we report Consistency (§5.2). The BERT
models achieve the highest scores. There is a con-
sistent improvement from base to large versions
of each model. In contrast to previous work that
observed quantitative and qualitative improve-
ments of RoBERTa-based models over BERT
(刘等人。, 2019; Talmor et al., 2020), in terms
of consistency, BERT is more consistent than
RoBERTa and ALBERT. 仍然, the overall results
are low (61.1% for the best model), even more
remarkably so because the restricted candidate set
makes the task easier. We note that the results are
highly variant between models (performance on
original-language varies between 52% 和 90%),
and relations (BERT-large performance is 92% 在
capital-of and 44% on owned-by).

最后, we report Consistent-Acc: 结果
are much lower than for Accuracy, as expected, 但
follow similar trends: RoBERTa-base performs
更差 (16.4%) and BERT-large best (29.5%).

有趣的是, we find strong correlations be-
tween Accuracy and Consistency, 范围从
67.3% for RoBERTa-base to 82.1% for BERT-
大的 (all with small p-values (西德:4) 0.01).

A striking result of the model comparison is
the clear superiority of BERT, both in knowledge
准确性 (which was also observed by Shin et al.,
2020) and knowledge consistency. We hypothe-
size this result is caused by the different sources
of training data: although Wikipedia is part of the
training data for all models we consider, for BERT
it is the main data source, but for RoBERTa and
ALBERT it is only a small portion. 因此, 什么时候
using additional data, some of the facts may be

1017

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

模型

Acc

Consistency Consistent-Acc

模型

majority
RoBERTa-med-small-1M 11.2±9.4
RoBERTa-base-10M
RoBERTa-base-100M
RoBERTa-base-1B

23.1±21.0 100.0±0.0
37.1±11.0
17.3±15.8 29.8±12.7
22.1±17.1 31.5±13.0
38.0±23.4 50.6±19.8

23.1±21.0
2.8±4.0
3.2±5.1
3.7±5.3
18.0±16.0

桌子 4: Knowledge and consistency results for dif-
ferent RoBERTas, trained on increasing amounts
of data. Best model for each measure in bold.

majority
BERT-base
BERT-large
BERT-large-wwm
RoBERTa-base
RoBERTa-large
ALBERT-base
ALBERT-xxlarge

Diff-Syntax No-Change
100.0±0.0
100.0±0.0
76.3±22.6
67.9±30.3
78.7±14.7
67.5±30.2
81.1±9.7
63.0±31.7
80.7±5.2
66.9±10.1
80.3±6.8
69.7±19.2
72.6±11.5
62.3±22.8
67.3±17.1
51.7±26.0

forgotten, or contradicted in the other corpora; 这
can diminish knowledge and compromise consis-
tency behavior. 因此, since Wikipedia is likely the
largest unified source of factual knowledge that
exists in unstructured data, giving it prominence
in pretraining makes it more likely that the model
will incorporate Wikipedia’s factual knowledge
出色地. These results may have a broader impact
on models to come: Training bigger models with
more data (such as GPT-3 [Brown et al., 2020]) 是
not always beneficial.

Determinism We also measure determinism for
N-M relations—that is, we use the same measure
as Consistency, but since difference predictions
may be factually correct, these do not necessar-
ily convey consistency violations, but indicate
non-determinism. For brevity, we do not present
all results, but the trend is similar to the con-
sistency result (although not comparable, 作为
relations are different): 52.9% 和 44.6% 为了
BERT-large and RoBERTa-base, 分别.

Effect of Pretraining Corpus Size Next, 我们
study the question of whether the number of
tokens used during pretraining contributes to con-
sistency. We use the pretrained RoBERTa models
from Warstadt et al. (2020) and repeat the ex-
periments on four additional models. 这些都是
RoBERTa-based models, trained on a sample of
Wikipedia and the book corpus, with varying train-
ing sizes and parameters. We use one of the three
published models for each configuration and re-
port the average accuracy over the relations for
each model in Table 4. 全面的, Accuracy and
Consistent-Acc improve with more training data.
然而, there is an interesting outlier to this
趋势: The model that was trained on one mil-
lion tokens is more consistent than the models
trained on ten and one-hundred million tokens.
A potentially crucial difference is that this model
has many fewer parameters than the rest (to avoid

桌子 5: Consistency and standard deviation when
only syntax differs (Diff-Syntax) and when syntax
and lexical choice are identical (No-Change). 最好的
model for each metric is highlighted in bold.

overfitting). It is nonetheless interesting that a
model that is trained on significantly less data can
achieve better consistency. 另一方面, 它是
accuracy scores are lower, arguably due to the
model being exposed to less factual knowledge
during pretraining.

6.3 Do PLMs Generalize Over Syntactic

Configurations?

Many papers have found neural models (尤其
PLMs) to naturally encode syntax (Linzen et al.,
2016; Belinkov et al., 2017; Marvin and Linzen,
2018; Belinkov and Glass, 2019; Goldberg, 2019;
Hewitt and Manning, 2019). Does this mean that
PLMs have successfully abstracted knowledge
and can comprehend and produce it regardless of
syntactic variation? We consider two scenarios.
(1) Two patterns differ only in syntax. (2) 两个都
syntax and lexical choice are the same. As a proxy,
we define syntactic equivalence when the depen-
dency path between the subject and object are
identical. We parse all patterns from PARAREL
(Honnibal et al.,
using a dependency parser
2020)12 and retain the path between the entities.
Success on (1) indicates that the model’s knowl-
edge processing is robust to syntactic variation.
Success on (2) indicates that the model’s knowl-
edge processing is robust to variation in word
order and tense.

桌子 5 reports the results. While these and the
main results on the entire dataset are not compa-
rable as the pattern subsets are different, 他们是
higher than the general results: 67.5% for BERT-
large when only the syntax differs and 78.7% 什么时候

12https://spacy.io/.

1018

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

# 主题

目的

图案 #1

图案 #2

图案 #3

Pred #1

Pred #2

Pred #3

1 Adriaan Pauw

阿姆斯特丹 [X] 出生于 [是].

[X] is native to [是].

[X] 是一个 [是]-born person.

阿姆斯特丹

马达加斯加

卢森堡

2 Nissan Livina Geniss Nissan

[X] is produced by [是].

[X] is created by [是].

[X], created by [是].

Nissan

3 阿尔巴尼亚

4 iCloud

Serbia

Apple

[X] shares border with [是].

[是] borders with [X].

[是] shares the border with [X]

希腊

[X] is developed by [是].

[X], created by [是].

[X] was created by [是]

Microsoft

谷歌

Renault

Turkey

Renault

科索沃

Sony

5 雅虎! Messenger

雅虎

[X], a product created by [是]

[X], a product developed by [是]

[是], that developed [X]

Microsoft

Microsoft

Microsoft

6 Wales

Cardiff

The capital of [X] 是 [是] .

[X]’s capital, [是].

[X]’s capital city, [是].

Cardiff

Cardiff

Cardiff

桌子 6: Predictions of BERT-large-cased. ‘‘Subject’’ and ‘‘Object’’ are from T-REx (Elsahar et al.,
2018). ‘‘Pattern #i’’ / ‘‘Pred #i’’: three different patterns from our resource and their predictions. 这
predictions are colored in blue if the model predicted correctly (out of the candidate list), and in red
否则. If there is more than a single erroneous prediction, it is colored by a different red.

the syntax is identical. This demonstrates that
while PLMs have impressive syntactic abilities,
they struggle to extract factual knowledge in the
face of tense, word-order, and syntactic variation.
McCoy et al. (2019) show that supervised mod-
els trained on MNLI (Williams et al., 2018), 一个
NLI dataset (Bowman et al., 2015), use superficial
syntactic heuristics rather than more generalizable
properties of the data. Our results indicate that
PLMs have problems along the same lines: 他们
are not robust to surface variation.

7 分析

7.1 Qualitative Analysis

To better understand the factors affecting con-
sistent predictions, we inspect the predictions of
BERT-large on the patterns shown in Table 6. 我们
highlight several cases: The predictions in Exam-
普莱 #1 are inconsistent, and correct for the first
pattern (阿姆斯特丹), but not for the other two
(Madagascar and Luxembourg). The predictions
在示例中 #2 also show a single pattern that
predicted the right object; 然而, the two other
图案, which are lexically similar, predicted the
相同的, wrong answer—Renault. 下一个, the patterns
of Example #3 produced two factually correct an-
swers out of three (希腊, 科索沃), but simply
do not correspond to the gold object in T-REx
(阿尔巴尼亚), since this is an M-N relation. 注意
this relation is not part of the consistency evalu-
化, but the determinism evaluation. The three
different predictions in example #4 are all incor-
直角. 最后, the two last predictions demonstrate
consistent predictions: 例子 #5 is consistent
but factually incorrect (even though the correct
answer is a substring of the subject), 最后,
例子 #6 is consistent and factual.

关系. The colors represent

数字 3: t-SNE of the encoded patterns from the
首都
不同之处-
ent subjects, while the shapes represent patterns. A
knowledge-focused representation should cluster based
on identical subjects (颜色), but instead the clustering
is according to identical patterns (shape).

7.2 Representation Analysis

To provide insights on the models’ representa-
系统蒸发散, we inspect these after encoding the patterns.
Motivated by previous work that found that
words with the same syntactic structure cluster
一起 (Chi et al., 2020; Ravfogel et al., 2020)
we perform a similar experiment to test if this be-
havior replicates with respect to knowledge: 我们
encode the patterns, after filling the placehold-
ers with subjects and masked tokens and inspect
the last layer representations in the masked token
位置. When plotting the results using t-SNE
(Maaten and Hinton, 2008) we mainly observe
clustering based on the patterns, which suggests
that encoding of knowledge of the entity is not the
main component of the representations. 数字 3
demonstrates this for BERT-large encodings of
the capital relation, which is highly consistent.13
To provide a more quantitative assessment of this

13While some patterns are clustered based on the subjects
(upper-left part), most of them are clustered based on patterns.

1019

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
1
0
1
9
7
5
9
5
7

/

/
t

A
C
_
A
_
0
0
4
1
0
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

现象, we also cluster the representations
and set the number of centroids based on:14 (1) 这
number of patterns in each relation, which aims to
capture pattern-based clusters, 和 (2) 号码
of subjects in each relation, which aims to cap-
ture entity-based clusters. This would allow for
a perfect clustering, in the case of perfect align-
ment between the representation and the inspected
财产. We measure the purity of these clusters
using V-measure and observe that the clusters
are mostly grouped by the patterns, 而不是
the subjects. 最后, we compute the Spearman
correlation between the consistency scores and
the V-measure of the representations. 然而,
the correlation between these variables is close
to zero,15 therefore not explaining the models’
行为. We repeated these experiments while
inspecting the objects instead of the subjects, 和
found similar trends. This finding is interesting
since it means that (1) these representations are
not knowledge-focused, IE。, their main compo-
nent does not relate to knowledge, 和 (2) 这
representation by its entirety does not explain the
behavior of the model, and thus only a subset
of the representation does. This finding is con-
sistent with previous work that observed similar
trends for linguistic tasks (Elazar et al., 2021).
We hypothesize that this disparity between the
representation and the behavior of the model may
be explained by a situation where the distance
between representations largely does not reflect
the distance between predictions, but rather other,
behaviorally irrelevant factors of a sentence.

8

Improving Consistency in PLMs

In the previous sections, we showed PLMs are
generally not consistent in their predictions, 和
previous works have noticed the lack of this prop-
erty in a variety of downstream tasks. An ideal
model would exhibit the consistency property after
pretraining, and would then be able to transfer it
to different downstream tasks. We therefore ask:
Can we enhance current PLMs and make them
more consistent?

8.1 Consistency Improved PLMs

We propose to improve the consistency of PLMs
by continuing the pretraining step with a novel

consistency loss. We make use of the T-REx
tuples and the paraphrases from PARAREL .

For each relation ri, we have a set of para-
phrased patterns Pi describing that relation. 我们
use a PLM to encode all patterns in Pi, after popu-
lating a subject that corresponds to the relation ri
and a mask token. We expect the model to make
the same prediction for the masked token for all
图案.

Consistency Loss Function As we evaluate the
model using acc@1, the straight-forward consis-
tency loss would require these predictions to be
identical:

min

模拟(arg max

(Pn)[我], arg max

j

(Pm)[j])

where fθ(Pn) is the output of an encoding function
(例如, BERT) parameterized by θ (a vector) 超过
input Pn, and fθ(Pn)[我] is the score of the ith
vocabulary item of the model.

然而, this objective contains a comparison
between the output of two argmax operations,
making it discrete and discontinuous, and hard to
optimize in a gradient-based framework. 我们在-
stead relax the objective, and require that the pre-
dicted distributions Qn = softmax((Pn)), 相当
than the top-1 prediction, be identical to each
其他. We use two-sided KL Divergence to mea-
||
sure similarity between distributions: DKL(Qri
n
||Qri
Qri
n is the predicted
distribution for pattern Pn of relation ri.

米) + DKL(Qri

n ) where Qri

As most of the vocabulary is not relevant for
the predictions, we filter it down to the k tokens
from the candidate set of each relation (§3.2). 我们
want to maintain the original capabilities of the
model—focusing on the candidate set helps to
achieve this goal since most of the vocabulary is
not affected by our new loss.

To encourage a more general solution, we make
use of all the paraphrases together, and enforce
all predictions to be as close as possible. 因此,
the consistency loss for all pattern pairs for a
particular relation ri is:

k(西德:2)

Lc =

k(西德:2)

n=1

m=n+1

DKL(Qri
n

||Qri

米) + DKL(Qri

||Qri
n )

14Using the KMeans algorithm.
15Except for BERT-large whole-word-masking, 哪里的

correlation is 39.5 (p < 0.05). MLM Loss Since the consistency loss is dif- ferent from the Cross-Entropy loss the PLM is trained on, we find it important to continue the 1020 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 MLM loss on text data, similar to previous work (Geva et al., 2020). We consider two alternatives for continuing the pretraining objective: (1) MLM on Wikipedia and (2) MLM on the patterns of the relations used for the consistency loss. We found that the latter works better. We denote this loss by LM LM . Consistency Guided MLM Continual Training Combining our novel consistency loss with the regular MLM loss, we continue the PLM training by combining the two losses. The combination of the two losses is determined by a hyperparameter λ, resulting in the following final loss function: L = λLc + LM LM This loss is computed per relation, for one KB tuple. We have many of these instances, which we require to behave similarly. Therefore, we batch together l = 8 tuples from the same relation and apply the consistency loss function to all of them. 8.2 Setup Since we evaluate our method on unseen relations, we also split train and test by relation type (e.g., location-based relations, which are very common in T-REx). Moreover, our method is aimed to be simple, effective, and to require only mini- mal supervision. Therefore, we opt to use only three relations: original-language, named-after, and original-network; these were chosen ran- domly, out of the non-location related relations.16 For validation, we randomly pick three relations of the remaining relations and use the remaining 25 for testing. We perform minimal tuning of the parame- ters (λ ∈ 0.1, 0.5, 1) to pick the best model, train for three epochs, and select the best model based on Consistent-Acc on the validation set. For efficiency reasons, we use the base version of BERT. 8.3 Improved Consistency Results The results are presented in Table 7. We report ag- gregated results for the 25 relations in the test. We again report macro average (mean over relations) and standard deviation. We report the results of the majority baseline (first row), BERT-base (second row), and our new model (BERT-ft, third row). 16Many relations are location-based—not training on them prevents train-test leakage. Model majority BERT-base BERT-ft Accuracy Consistency Consistent-Acc 24.4±22.5 100.0±0.0 58.2±23.9 45.6±27.6 64.0±22.9 47.4±27.3 60.9±22.6 -consistency 46.9±27.6 62.0±21.2 46.5±27.1 -typed 80.8±27.1 16.9±21.1 -MLM 24.4±22.5 27.3±24.8 33.2±27.0 30.9±26.3 31.1±25.2 9.1±11.5 Table 7: Knowledge and consistency results for the baseline, BERT base, and our model. The results are averaged over the 25 test relations. Underlined: best performance overall, including ablations. Bold: Best performance for BERT-ft and the two baselines (BERT-base, majority). First, we note that our model significantly im- proves consistency: 64.0% (compared with 58.2% for BERT-base, an increase of 5.8 points). Accu- racy also improves compared to BERT-base, from 45.6% to 47.4%. Finally, and most importantly, we see an increase of 5.9 points in Consistent-Acc, which is achieved due to the improved consistency of the model. Notably, these improvements arise from training on merely three relations, meaning that the model improved its consistency ability and generalized to new relations. We measure the statistical significance of our method compared to the BERT baseline, using McNemar’s test (fol- lowing Dror et al. [2018, 2020]) and find all results to be significant (p (cid:4) 0.01). We also perform an ablation study to quantify the utility of the different components. First, we report on the finetuned model without the con- sistency loss (-consistency). Interestingly, it does improve over the baseline (BERT-base), but it lags behind our finetuned model. Second, applying our loss on the candidate set rather than on the entire vocabulary is beneficial (-typed). Finally, by not performing the MLM training on the generated patterns (-MLM), the consistency results improve significantly (80.8%); however, this also hurts Accuracy and Consistent-Acc. MLM training seems to serve as a regularizer that prevents ca- tastrophic forgetting. Our ultimate goal is to improve consistency in PLMs for better performance on downstream tasks. Therefore, we also experiment with fine- tuning on SQuAD (Rajpurkar et al., 2016), and evaluating on paraphrased questions from SQuAD (Gan and Ng, 2019) using our consistency model. However, the results perform on par with the base- line model, both on SQuAD and the paraphrase 1021 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 questions. More research is required to show that consistent PLMs can also benefit downstream tasks. to use PLMs as a starting point to more complex systems have promising results and address these problems (Thorne et al., 2020). 9 Discussion Consistency for Downstream Tasks The rise of PLMs has improved many tasks but has also brought a lot of expectations. The standard usage of these models is pretraining on a large cor- pus of unstructured text and then finetuning on a task of interest. The first step is thought of as providing a good language-understanding compo- nent, whereas the second step is used to teach the format and the nuances of a downstream task. As discussed earlier, consistency is a crucial component of many NLP systems (Du et al., 2019; Asai and Hajishirzi, 2020; Denis and Baldridge, 2009; Kryscinski et al., 2020) and obtaining this skill from a pretrained model would be extremely beneficial and has the potential to make special- ized consistency solutions in downstream tasks redundant. Indeed, there is an ongoing discussion about the ability to acquire ‘‘meaning’’ from raw text signal alone (Bender and Koller, 2020). Our new benchmark makes it possible to track the progress of consistency in pretrained models. Broader Sense of Consistency In this work we focus on one type of consistency, that is, con- sistency in the face of paraphrasing; however, consistency is a broader concept. For instance, previous work has studied the effect of nega- tion on factual statements, which can also be seen as consistency (Ettinger, 2020; Kassner and Sch¨utze, 2020). A consistent model is expected to return different answers to the prompts: ‘‘Birds can [MASK]’’ and ‘‘Birds cannot [MASK]’’. The inability to do so, as was shown in these works, also shows the lack of model consistency. Usage of PLMs as KBs Our work follows the setup of Petroni et al. (2019) and Jiang et al. (2020), where PLMs are being tested as KBs. While it is an interesting setup for probing models for knowledge and consistency, it lacks important properties of standard KBs: (1) the ability to return more than a single answer and (2) the ability to return no answer. Although some heuristics can be used for allowing a PLM to do so, for example, using a threshold on the probabilities, it is not the way that the model was trained, and thus may not be optimal. Newer approaches that propose In another approach, Shin et al. (2020) sug- gest using AUTOPROMPT to automatically generate prompts, or patterns, instead of creating them manually. This approach is superior to manual patterns (Petroni et al., 2019), or aggregation of patterns that were collected automatically (Jiang et al., 2020). Brittleness of Neural Models Our work also relates to the problem of the brittleness of neural networks. One example of this brittleness is the vulnerability to adversarial attacks (Szegedy et al., 2014; Jia and Liang, 2017). The other problem, closer to the problem we explore in this work, is the poor generalization to paraphrases. For ex- ample, Gan and Ng (2019) created a paraphrase version for a subset of SQuAD (Rajpurkar et al., 2016), and showed that model performance drops significantly. Ribeiro et al. (2018) proposed another method for creating paraphrases and performed a similar analysis for visual ques- tion answering and sentiment analysis. Recently, Ribeiro et al. (2020) proposed CHECKLIST, a sys- tem that tests a model’s vulnerability to several linguistic perturbations. PARAREL enables us to study the brittleness of PLMs, and separate facts that are robustly encoded in the model from mere ‘guesses’, which may arise from some heuristic or spurious correlations with certain patterns (Poerner et al., 2020). We showed that PLMs are susceptible to small perturbations, and thus, finetuning on a downstream task—given that training datasets are typically not large and do not contain equivalent examples—is not likely to perform better. Can We Expect from LMs to Be Consistent? The typical training procedure of an LM does not encourage consistency. The standard training solely tries to minimize the log-likelihood of an unseen token, and this objective is not always aligned with consistency of knowledge. Consider for example the case of Wikipedia texts, as op- posed to Reddit; their texts and styles may be very different and they may even describe contradic- tory facts. An LM can exploit the styles of each text to best fit the probabilities given to an unseen word, even if the resulting generations contradict each other. 1022 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Since the pretraining-finetuning procedure is currently the dominating one in our field, a great amount of the language capabilities that were learned during pre-training also propagates to the fine-tuned models. As such, we believe it is impor- tant to measure and improve consistency already in the pretrained models. Reasons Behind the (In)Consistency Since LMs are not expected to be consistent, what are the reasons behind their predictions, when being consistent, or inconsistent? In this work, we presented the predictions of multiple queries, and the representation space of one of the inspected models. However, this does not point to the origins of such behavior. In future work, we aim to inspect this question more closely. 10 Conclusion In this work, we study the consistency of PLMs with regard to their ability to extract knowledge. We build a high-quality resource named PARAREL that contains 328 high-quality patterns for 38 re- lations. Using PARAREL , we measure consistency in multiple PLMs, including BERT, RoBERTa, and ALBERT, and show that although the latter two are superior to BERT in other tasks, they fall short in terms of consistency. However, the consistency of these models is generally low. We release PARAREL along with data tuples from T-REx as a new benchmark to track knowledge consistency of NLP models. Finally, we propose a new simple method to improve model consis- tency, by continuing the pretraining with a novel loss. We show this method to be effective and to improve both the consistency of models as well as their ability to extract the correct facts. Acknowledgments We would like to thank Tomer Wolfson, Ido Dagan, Amit Moryossef, and Victoria Basmov for their helpful comments and discussions, and Alon Jacovi, Ori Shapira, Arie Cattan, Elron Bandel, Philipp Dufter, Masoud Jalili Sabet, Marina Speranskaya, Antonis Maronikolakis, Aakanksha Naik, Aishwarya Ravichander, and Aditya Potukuchi for the help with the annota- tions. We also thank the anonymous reviewers and the action editor, George Foster, for their valuable suggestions. Yanai Elazar is grateful to be supported by the PBC fellowship for outstanding PhD can- didates in Data Science and the Google PhD fellowship. This project has received funding from the European Research Council (ERC) un- der the European Union’s Horizon 2020 research and innovation programme, grant agreement no. 802774 (iEXTRACT). This work has been funded by the European Research Council (#740516) and by the German Federal Ministry of Edu- cation and Research (BMBF) under grant no. 01IS18036A. The authors of this work take full responsibility for its content. This research was also supported in part by grants from the National Science Foundation Secure and Trustworthy Com- puting program (CNS-1330596, CNS15-13957, CNS-1801316, CNS-1914486), and a DARPA Brandeis grant (FA8750-15-2-0277). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the NSF, DARPA, or the US Government. References Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. 2019. Synthetic QA corpora generation with roundtrip con- sistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168–6173. https:// doi.org/10.18653/v1/P19-1620 Kim Allan Andersen and Daniele Pretolani. 2001. Easy cases of probabilistic satisfiability. Annals of Mathematics and Artificial Intelligence, 33(1):69–91. https://doi.org/10.1023 /A:1012332915908 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433. https://doi.org/10 .1109/ICCV.2015.279 Akari Asai and Hannaneh Hajishirzi. 2020. Logic-guided data augmentation and regu- larization for consistent question answering. In Proceedings of the 58th Annual Meeting for Computational of Association the 1023 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages Linguistics, 5642–5650, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.499 Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the the Association 55th Annual Meeting of for Computational Linguistics (Volume 1: Long Papers), pages 861–872. https://doi .org/10.18653/v1/P17-1080 Yonatan Belinkov and James Glass. 2019. Anal- ysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72. https:// doi.org/10.1162/tacl a 00254 Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198. https://doi.org/10 .18653/v1/2020.acl-main.463 Rahul Bhagat and Eduard Hovy. 2013. What is a paraphrase? Computational Linguistics, 39(3):463–472. https://doi.org/10.1162 /COLI a 00166 Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural lan- guage inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. https://doi .org/10.18653/v1/D15-1075 Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan, Prafulla Melanie Subbiah, Dhariwal, Arvind Neelakantan, Pranav Shyam, GirishSastry, AmandaAskell, SandhiniAgarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language mod- learners. arXiv preprint els are few-shot arXiv:2005.14165. Oana-Maria Camburu, Tim Rockt¨aschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. E-SNLI: Natural language inference with natural lan- guage explanations. In NeurIPS. Oana-Maria Camburu, Brendan Shillingford, Pasquale Minervini, Thomas Lukasiewicz, and Phil Blunsom. 2020. Make up your mind! Adversarial generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4157–4165. https://doi.org/10.18653/v1/2020 .acl-main.382 Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, Inference protocols and Dan Roth. 2011. for coreference resolution. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 40–44. Ethan A. Chi, John Hewitt, and Christopher D. Manning. 2020. Finding universal grammatical relations in multilingual BERT. In Proceed- ings of the Association for Computational Linguistics, pages 5564–5577, Online. Association for Computational Linguistics. the 58th Annual Meeting of Jeff Da and Jungo Kasai. 2019. Cracking the contextual commonsense code: Under- standing commonsense reasoning aptitude of deep contextual representations. In Proceed- ings of the First Workshop on Commonsense Inference in Natural Language Processing, pages 1–12, Hong Kong, China. Association for Computational Linguistics. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The Pascal recognising textual In Machine Learn- entailment challenge. ing Challenges Workshop, pages 177–190. https://doi.org/10.1007 Springer. /11736790_9 1024 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Joe Davison, Joshua Feldman, and Alexander M. Rush. 2019. Commonsense knowledge mining from pretrained models. In Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing (EMNLP-IJCNLP), pages 1173–1178. https://doi.org/10 .18653/v1/D19-1109 Pascal Denis and Jason Baldridge. 2009. Global joint models for coreference resolution and named entity classification. Procesamiento del Lenguaje Natural, 42. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural lan- guage processing. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1383–1392. https://doi.org/10 .18653/v1/P18-1128 Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart. 2020. Statistical significance testing for natural language pro- cessing. Synthesis Lectures on Human Lan- guage Technologies, 13(2):1–116. https:// doi.org/10.2200/S00994ED1V01Y202 002HLT045 Xinya Du, Bhavana Dalvi, Niket Tandon, Antoine Bosselut, Wen-tau Yih, Peter Clark, and Claire Cardie. 2019. Be consistent! Im- proving procedural text comprehension using label consistency. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2347–2356. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic Probing: Behavioral Explanation with Amnesic Coun- the Association terfactuals. Transactions of for Computational Linguistics, 9:160–175. https://doi.org/10.1162/tacl a 00359 Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. 2018. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC 2018). Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholin- guistic diagnostics language models. for Transactions of the Association for Com- putational Linguistics, 8:34–48. https:// doi.org/10.1162/tacl a 00298 Maxwell Forbes, Ari Holtzman, and Yejin Choi. 2019. Do neural language representations learn physical commonsense? In CogSci. Wee Chung Gan and Hwee Tou Ng. 2019. Improving the robustness of question an- swering systems to question paraphrasing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6065–6075. Injecting numerical Mor Geva, Ankit Gupta, and Jonathan Berant. reasoning skills 2020. In Association for into language models. Computational Linguistics (ACL). https:// doi.org/10.18653/v1/2020.acl -main.89 Yoav Goldberg. 2019. Assessing BERT’s syntac- tic abilities. arXiv preprint arXiv:1901.05287. Pierre Hansen and Brigitte Jaumard. 2000. In Handbook of Probabilistic satisfiability. Defeasible Reasoning and Uncertainty Man- agement Systems, pages 321–367, Springer. h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / 9 7 8 - 9 4 - 0 1 7 - 1 7 3 7 - 3 8 John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In North American Chapter 1025 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 of the Association for Computational Linguis- tics: Human Language Technologies (NAACL). Association for Computational Linguistics. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Process- ing in Python. https://doi.org/10.5281 /zenodo.1212303. Robin Jia and Percy Liang. 2017. Adversarial ex- amples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031. https://doi .org/10.18653/v1/D17-1215 Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438. https://doi.org/10.1162 /tacl_a_00324 Nora Kassner, Philipp Dufter, and Hinrich Sch¨utze. 2021a. Multilingual lama: Investi- gating knowledge in multilingual pretrained language models. Nora Kassner, Benno Krojer, and Hinrich Sch¨utze. 2020. Are pretrained language mod- els symbolic reasoners over knowledge? In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 552–564, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2020.conll-1.45 Nora Kassner In Proceedings of and Hinrich Sch¨utze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot the 58th Annual fly. Meeting of the Association for Computa- tional Linguistics, pages 7811–7818, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.698 Nora Kassner, Oyvind Tafjord, Hinrich Sch¨utze, and Peter Clark. 2021b. Enriching a model’s notion of belief using a persistent memory. CoRR, abs/2104.08401. Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Eval- uating the factual consistency of abstrac- tive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346. https://doi.org/10 .18653/v1/2020.emnlp-main.750 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representa- tions. In International Conference on Learning Representations. Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek Srikumar. 2019. A logic-driven framework for consistency of neural models. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 3924–3935, Hong Kong, China. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1405 Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 5210–5217. https://doi.org /10.18653/v1/2020.acl-main.465 Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans- actions of the Association for Computational Linguistics, 4:521–535. https://doi.org /10.1162/tacl_a_00115 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly opti- mized bert pretraining approach. arXiv preprint arXiv:1907.11692. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605. Rebecca Marvin and Tal Linzen. 2018. Targeted language models. the 2018 Conference on syntactic evaluation of In Proceedings of 1026 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Empirical Methods in Natural Language Pro- cessing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18 -1151 Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diag- nosing syntactic heuristics in natural lan- the In Proceedings of guage 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448. https://doi.org/10.18653/v1/P19 -1334 inference. David Picado Mui˜no. 2011. Measuring and repair- ing inconsistency in probabilistic knowledge bases. International Journal of Approximate Reasoning, 52(6):828–840. https://doi .org/10.1016/j.ijar.2011.02.003 Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distribu- tional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 454–459. https://doi .org/10.18653/v1/P16-2074 F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Know- ledge enhanced contextual word representa- tions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 43–54. https:// doi.org/10.18653/v1/D19-1005 Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rockt¨aschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2020. How context affects language models’ factual predictions. In Automated Knowledge Base Construction. Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D19-1250 Nina Poerner, Ulli Waltinger, and Hinrich Sch¨utze. 2020. E-BERT: Efficient-yet-effective entity embeddings for bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 803–818. https://doi.org/10 .18653/v1/2020.findings-emnlp.71 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehen- the 2016 sion of Conference on Empirical Methods in Nat- ural Language Processing, pages 2383–2392. https://doi.org/10.18653/v1/D16-1264 In Proceedings of text. Shauli Ravfogel, Yanai Elazar, Jacob Goldberger, and Yoav Goldberg. 2020. Unsupervised distil- lation of syntactic information from contextu- alized word representations. In Proceedings of the Third BlackboxNLP Workshop on Analyz- ing and Interpreting Neural Networks for NLP, pages 91–106, Online. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2020.blackboxnlp-1.9 Abhilasha Ravichander, Eduard Hovy, Kaheer Suleman, Adam Trischler, and Jackie Chi Kit 1027 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Cheung. 2020. On the systematicity of probing contextualized word representations: The case of hypernymy in BERT. In Proceedings of the Ninth Joint Conference on Lexical and Compu- tational Semantics, pages 88–102, Barcelona, Spain (Online). Association for Computational Linguistics. In Proceedings of Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? Evaluating consistency of question-answering the 57th An- models. the Association for Com- nual Meeting of putational Linguistics, 6174–6184, pages Florence, Italy. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/P19-1621 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equiv- alent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual the Association for Computa- Meeting of (Volume 1: Long Pa- tional Linguistics pers), pages 856–865. https://doi.org /10.18653/v1/P18-1079 Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond testing of NLP mod- accuracy: Behavioral els with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.acl-main.442 Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? the 2020 Conference on In Proceedings of Empirical Methods in Natural Language Processing 5418–5426. https://doi.org/10.18653/v1/2020 .emnlp-main.437 (EMNLP), pages Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting knowledge from lan- guage models with automatically generated prompts. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.emnlp-main.346 Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, and Yoav Goldberg. 2020. Syntactic search by example. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demon- strations, pages 17–23, Online. Association for Computational Linguistics. https:// doi.org/10.18653/v1/2020.acl -demos.3 Vered Shwartz and Yejin Choi. 2020. Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, 6863–6870. https://doi.org/10 pages .18653/v1/2020.coling-main.605 Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations. Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2020. oLMpics—on pre-training what cap- the Association tures. for Computational Linguistics, 8:743–758. https://doi.org/10.1162/tacl a 00342 language model of Transactions Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual the Association for Computa- Meeting of tional 4593–4601. https://doi.org/10.18653/v1/P19-1452 Linguistics, pages and Anna Anna Rogers, Olga Kovaleva, Rumshisky. 2020. A primer in bertology: What we know about how bert works. Transactions of the Association for Compu- tational Linguistics, 8:842–866. https:// doi.org/10.1162/tacl a 00349 Matthias Thimm. 2009. Measuring inconsistency in probabilistic knowledge bases. In Proceed- ings of the Twenty-Fifth Conference on Un- certainty in Artificial Intelligence (UAI’09), pages 530–537. AUAI Press. 1028 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Matthias Thimm. 2013. Inconsistency mea- logics. Artificial sures Intelligence, 197:1–24. https://doi.org /10.1016/j.artint.2013.02.001 probabilistic for James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Halevy. 2020. Neural databases. arXiv preprint arXiv:2010.06973. Alex Warstadt, Yian Zhang, Xiaocheng Li, Haokun Liu, and Samuel R. Bowman. 2020. Learning which features matter: RoBERTa linguistic gen- acquires a preference for eralizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217–235, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2020.emnlp-main.16 Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics. https:// doi.org/10.18653/v1/N18-1101 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art In Proceed- language processing. natural ings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, On- line. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.emnlp-demos.6 Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. 2020. Pre- trained supervised knowledge-pretrained language model. In 8th encyclopedia: Weakly International Conference on Learning Repre- sentations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do language embeddings capture scales? In Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing: Findings, pages 4889–4896. https://doi .org/10.18653/v1/2020.findings -emnlp.439 A Implementation Details We heavily rely on Hugging Face’s Transformers library (Wolf et al., 2020) for all experiments in- volving the PLMs. We used Weights & Biases for tracking and logging the experiments (Biewald, 2020). Finally, we used sklearn (Pedregosa et al., 2011) for other ML-related experiments. B Paraphrases Analysis We provide a characterization of the paraphrase types included in our dataset. We analyze the type of paraphrases in PARAREL . We sample 100 paraphrase pairs from the agreement evaluation that were labeled as paraphrases and annotate the paraphrase type. Notice that the paraphrases can be complex; as such, multiple transformations can be anno- tated for each pair. We mainly use a subset of paraphrase types categorized by Bhagat and Hovy (2013), but also define new types that were not covered by that work. We begin by briefly defining the types of paraphrases found in PARAREL from Bhagat and Hovy (2013) (more thorough definitions can be found in their paper), and then define the new types we observed. 1. Synonym substitution: Replacing a word/ phrase by a synonymous word/phrase, in the appropriate context. 2. Function word variations: Changing the func- tion words in a sentence/phrase without affecting its semantics, in the appropriate context. 3. Converse substitution: Replacing a word/ phrase with its converse and inverting the 1029 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 relationship between the constituents of a sentence/phrase, in the appropriate context, presenting the situation from the converse perspective. 4. Change of tense: Changing the tense of a verb, in the appropriate context. 5. Change of voice: Changing a verb from its ac- tive to passive form and vice versa results in a paraphrase of the original sentence/phrase. 6. Verb/Noun conversion: Replacing a verb by its corresponding nominalized noun form and vice versa, in the appropriate context. 7. External knowledge: Replacing a word/ phrase by another word/phrase based on extra-linguistic (world) knowledge, in the appropriate context. 8. Noun/Adjective conversion: Replacing a verb by its corresponding adjective form and vice versa, in the appropriate context. 9. Change of aspect: Changing the aspect of a verb, in the appropriate context. a. Irrelevant addition: Addition or removal of a word or phrase, that does not affect the meaning of the sentence (as far as the relation of interest is concerned), and can be inferred from the context independently. b. Topicalization transformation: A transforma- tion from or to a topicalization construction. Topicalization is a construction in which a clause is moved to the beginning of its enclosing clause. c. Apposition transformation: A transformation from or to an apposition construction. In an apposition construction, two noun phrases where one identifies the other are placed one next to each other. d. Other syntactic movements: Includes other types of syntactic transformations that are not part of the other categories. This includes cases such as moving an element from a co- ordinate construction to the subject position as in the last example in Table 8. Another type of transformation is in the following paraphrase: ‘‘[X] plays in [Y] position.’’ and ‘‘[X] plays in the position of [Y].’’ where a compound noun-phrase is replaced with a prepositional phrase. We report the percentage of each type, along with examples of paraphrases in Table 8. The most common paraphrase is the ‘synonym sub- stitution’, following ‘function words variations’ which occurred 41 and 16 times, respectively. The least common paraphrase is ‘change of aspect’, which occurred only once in the sample. We also define several other types of para- phrases not covered in Bhagat and Hovy (2013) (potentially because they did not occur in the corpora they have inspected). The full PARAREL resource can be found at: https://github.com/yanaiela/pararel /tree/main/data/pattern data/graphs json. 1030 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Paraphrase Type Pattern #1 Pattern #2 Relation Synonym substitution Function words variations Converse substitution Change of tense Change of voice Verb/Noun conversion External knowledge Noun/Adjective conversion Change of aspect Irrelevant addition Topicalization transformation Apposition transformation Other syntactic movements [X] died in [Y]. [X] is [Y] citizen. [X] maintains diplomatic relations with [Y]. [X] is developed by [Y]. [X] is owned by [Y]. The headquarter of [X] is in [Y]. [X] is represented by music label [Y]. The official language of [X] is [Y]. [X] plays in [Y] position. [X] shares border with [Y]. [X] plays in [Y] position. [X] is the capital of [Y]. [X] and [Y] are twin cities. [X] expired at [Y]. [X], who is a citizen of [Y]. [Y] maintains diplomatic relations with [X]. [X] was developed by [Y]. [Y] owns [X]. [X] is headquartered in [Y]. [X], that is represented by [Y]. The official language of [X] is the [Y] language. playing as an [X], [Y] [X] shares a common border with [Y]. playing as a [Y], [X] [Y]’s capital, [X]. [X] is a twin city of [Y]. place of death country of citizenship diplomatic relation developer owned by headquarters location record label official language position played on team shares border with position played on team capital of twinned administrative body N. 41 16 10 10 7 7 3 2 1 11 8 4 10 Table 8: Different types of paraphrases in PARAREL . We report examples from each paraphrase type, along with the type of relation, and the number of examples from the specific transformation from a random subset of 100 pairs. Each pair can be classified into more than a single transformation (we report one for brevity), thus the sum of transformation is more than 100. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 0 1 9 7 5 9 5 7 / / t l a c _ a _ 0 0 4 1 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1031Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image
Measuring and Improving Consistency in Pretrained Language Models image

下载pdf