Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar1,2 Nora Kassner3 Shauli Ravfogel1,2 Abhilasha Ravichander4
Eduard Hovy4 Hinrich Sch ¨utze3 Yoav Goldberg1,2
1Computer Science Department, Bar Ilan University, Israel
2Allen Institute for Artificial Intelligence, United States
3Center for Information and Language Processing (CIS), LMU Munich, Germany
4Language Technologies Institute, Carnegie Mellon University, United States
{yanaiela,shauli.ravfogel,yoav.goldberg}@gmail.com
kassner@cis.lmu.de {aravicha,hovy}@cs.cmu.edu
Abstract
Consistency of a model—that is, the invari-
ance of its behavior under meaning-preserving
alternations in its input—is a highly desirable
property in natural language processing. In this
paper we study the question: Are Pretrained
Language Models (PLMs) consistent with re-
spect to factual knowledge? To this end, we
create PARAREL , a high-quality resource of
cloze-style query English paraphrases. It con-
tains a total of 328 paraphrases for 38 relations.
Using PARAREL , we show that the consistency
of all PLMs we experiment with is poor—
though with high variance between relations.
Our analysis of the representational spaces of
PLMs suggests that they have a poor structure
and are currently not suitable for represent-
ing knowledge robustly. Finally, we propose a
method for improving model consistency and
experimentally demonstrate its effectiveness.1
1
Introduction
Pretrained Language Models (PLMs) are large
neural networks that are used in a wide variety of
NLP tasks. They operate under a pretrain-finetune
paradigm: Models are first pretrained over a large
text corpus and then finetuned on a downstream
task. PLMs are thought of as good language en-
coders, supplying basic language understanding
capabilities that can be used with ease for many
downstream tasks.
A desirable property of a good language un-
derstanding model is consistency: the ability to
make consistent decisions in semantically equiv-
alent contexts, reflecting a systematic ability to
generalize in the face of language variability.
1The code and resource are available at: https://
github.com/yanaiela/pararel.
Examples of consistency include: predicting
the same answer in question answering and read-
ing comprehension tasks regardless of paraphrase
(Asai and Hajishirzi, 2020); making consistent
assignments in coreference resolution (Denis and
Baldridge, 2009; Chang et al., 2011); or making
summaries factually consistent with the original
document (Kryscinski et al., 2020). While consis-
tency is important in many tasks, nothing in the
training process explicitly targets it. One could
hope that the unsupervised training signal from
large corpora made available to PLMs such as
BERT or RoBERTa (Devlin et al., 2019; Liu
et al., 2019) is sufficient to induce consistency
and transfer it to downstream tasks. In this paper,
we show that this is not the case.
The recent rise of PLMs has sparked a discus-
sion about whether these models can be used as
Knowledge Bases (KBs) (Petroni et al., 2019;
2020; Davison et al., 2019; Peters et al., 2019;
Jiang et al., 2020; Roberts et al., 2020). Consis-
tency is a key property of KBs and is particularly
important for automatically constructed KBs. One
of the biggest appeals of using a PLM as a KB
is that we can query it in natural language—
instead of relying on a specific KB schema. The
expectation is that PLMs abstract away from lan-
guage and map queries in natural language into
meaningful representations such that queries with
identical intent but different language forms yield
the same answer. For example, the query ‘‘Home-
land premiered on [MASK]’’ should produce the
same answer as ‘‘Homeland originally aired on
[MASK]’’. Studying inconsistencies of PLM-KBs
can also teach us about the organization of knowl-
edge in the model, or lack thereof. Finally, failure
to behave consistently may point to other repre-
sentational issues such as the similarity between
1012
Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021. https://doi.org/10.1162/tacl a 00410
Action Editor: George Foster. Submission batch: 3/2021; Revision batch: 4/2021; Published 9/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
antonyms and synonyms (Nguyen et al., 2016),
and overestimating events and actions (reporting
bias) (Shwartz and Choi, 2020).
In this work, we study the consistency of factual
knowledge in PLMs, specifically in Masked Lan-
guage Models (MLMs)—these are PLMs trained
with the MLM objective (Devlin et al., 2019; Liu
et al., 2019), as opposed to other strategies such
as standard language modeling (Radford et al.,
2019) or text-to-text (Raffel et al., 2020). We ask:
Is the factual information we extract from PLMs
invariant to paraphrasing? We use zero-shot eval-
uation since we want to inspect models directly,
without adding biases through finetuning. This
allows us to assess how much consistency was
acquired during pretraining and to compare the
consistency of different models. Overall, we find
that the consistency of the PLMs we consider is
poor, although there is a high variance between
relations.
We introduce PARAREL , a new benchmark that
enables us to measure consistency in PLMs (§3),
by using factual knowledge that was found to
be partially encoded in them (Petroni et al., 2019;
Jiang et al., 2020). PARAREL is a manually curated
resource that provides patterns—short
textual
prompts—that are paraphrases of one another,
with 328 paraphrases describing 38 binary rela-
tions such as X born-in Y, X works-for Y (§4). We
then test multiple PLMs for knowledge consis-
tency, namely, whether a model predicts the same
answer for all patterns of a relation. Figure 1 shows
an overview of our approach. Using PARAREL ,
we probe for consistency in four PLM types:
BERT, BERT-whole-word-masking, RoBERTa,
and ALBERT (§5). Our experiments with
PARAREL
show that current models have poor
consistency, although with high variance between
relations (§6).
Finally, we propose a method that improves
model consistency by introducing a novel con-
sistency loss (§8). We demonstrate that, trained
with this loss, BERT achieves better consis-
tency performance on unseen relations. However,
more work is required to achieve fully consistent
models.
2 Background
There has been significant interest in analyzing
how well PLMs (Rogers et al., 2020) perform
Figure 1: Overview of our approach. We expect that
a consistent model would predict the same answer
for two paraphrases. In this example, the model is
inconsistent on the Homeland and consistent on the
Seinfeld paraphrases.
on linguistic tasks (Goldberg, 2019; Hewitt and
Manning, 2019; Tenney et al., 2019; Elazar et al.,
2021), commonsense (Forbes et al., 2019; Da and
Kasai, 2019; Zhang et al., 2020), and reasoning
(Talmor et al., 2020; Kassner et al., 2020), usu-
ally assessed by measures of accuracy. However,
accuracy is just one measure of PLM perfor-
mance (Linzen, 2020). It is equally important that
PLMs do not make contradictory predictions (cf.
Figure 1), a type of error that humans rarely make.
There has been relatively little research attention
devoted to this question, that is, to analyze if
models behave consistently. One example con-
cerns negation: Ettinger (2020) and Kassner and
Sch¨utze (2020) show that models tend to generate
facts and their negation, a type of inconsistent be-
havior. Ravichander et al. (2020) propose paired
probes for evaluating consistency. Our work is
broader in scope, examining the consistency of
PLM behavior across a range of factual knowl-
edge types and investigating how models can be
made to behave more consistently.
Consistency has also been highlighted as a
desirable property in automatically constructed
KBs and downstream NLP tasks. We now briefly
review work along these lines.
Consistency in knowledge bases (KBs) has
been studied in theoretical frameworks in the
context of the satisfiability problem and KB
construction, and efficient algorithms for detect-
ing inconsistencies in KBs have been proposed
(Hansen and Jaumard, 2000; Andersen and
Pretolani, 2001). Other work aims to quantify the
degree to which KBs are inconsistent and de-
tects inconsistent statements (Thimm, 2009, 2013;
Mui˜no, 2011).
1013
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Consistency in question answering was stud-
ied by Ribeiro et al. (2019) in two tasks: visual
question answering (Antol et al., 2015) and read-
ing comprehension (Rajpurkar et al., 2016). They
automatically generate questions to test the con-
sistency of QA models. Their findings suggest
that most models are not consistent in their pre-
dictions. In addition, they use data augmentation
to create more robust models. Alberti et al. (2019)
generate new questions conditioned on context
and answer from a labeled dataset and by filtering
answers that do not provide a consistent result
with the original answer. They show that pretrain-
ing on these synthetic data improves QA results.
Asai and Hajishirzi (2020) use data augmentation
that complements questions with symmetricity
and transitivity, as well as a regularizing loss that
penalizes inconsistent predictions. Kassner et al.
(2021b) propose a method to improve accuracy
and consistency of QA models by augmenting
a PLM with an evolving memory that records
PLM answers and resolves inconsistency between
answers.
Work on consistency in other domains in-
cludes Du et al. (2019) where prediction of con-
sistency in procedural text is improved. Ribeiro
et al. (2020) use consistency for more robust
evaluation. Li et al. (2019) measure and miti-
gate inconsistency in natural language inference
(NLI), and finally, Camburu et al. (2020) propose
a method for measuring inconsistencies in NLI
explanations (Camburu et al., 2018).
3 Probing PLMs for Consistency
In this section, we formally define consistency and
describe our framework for probing consistency
of PLMs.
3.1 Consistency
We define a model as consistent
if, given
two cloze-phrases such as ‘‘Seinfeld originally
aired on [MASK]’’ and ‘‘Seinfeld premiered on
[MASK]’’ that are quasi-paraphrases, it makes
non-contradictory predictions2 on N-1 relations
over a large set of entities. A quasi-paraphrase—a
2We refer to non-contradictory predictions as predictions
that, as the name suggest, do not contradict one another. For
instance, predicting as the birth place of a person two dif-
ference cities is considered to be contradictory, but predict-
ing a city and its country, is not.
concept introduced by Bhagat and Hovy (2013)—
is a more fuzzy version of a paraphrase. The
concept does not rely on the strict, logical defini-
tion of paraphrase and allows us to operationalize
concrete uses of paraphrases. This definition is
in the spirit of the RTE definition (Dagan et al.,
2005), which similarly supports a more flexible
use of the notion of entailment. For instance, a
model that predicts NBC and ABC on the two
aforementioned patterns, is not consistent, since
these two facts are contradictory. We define a
cloze-pattern as a cloze-phrase that expresses a
relation between a subject and an object. Note
that consistency does not require the answers to
be factually correct. While correctness is also an
important property for KBs, we view it as a sep-
arate objective and measure it independently. We
use the terms paraphrase and quasi-paraphrase
interchangeably.
Many-to-many (N-M) relations (e.g., shares-
border-with) can be consistent even with different
answers (given they are correct). For instance, two
patterns that express the shares-border-with rela-
tion and predict Albania and Bulgaria for Greece
are both correct. We do not consider such relations
for measuring consistency. However, another re-
quirement from a KB is determinism, that is, re-
turning the results in the same order (when more
than a single result exists). In this work, we focus
on consistency, but also measure determinism of
the models we inspect.
3.2 The Framework
An illustration of the framework is presented
in Figure 2. Let Di be a set of subject-object
KB tuples (e.g.,
some relation ri (e.g., originally-aired-on), ac-
companied with a set of quasi-paraphrases
cloze-patterns Pi (e.g., X originally aired on Y).
Our goal is to test whether the model consistently
predicts the same object (e.g., Showtime) for a
particular subject (e.g., Homeland).3 To this end,
we substitute X with a subject from Di and Y with
[MASK] in all of the patterns Pi of that relation
(e.g., Homeland originally aired on [MASK] and
Homeland premiered on [MASK]). A consistent
model must predict the same entity.
3Although it is possible to also predict the subject from the
object, in the cases of N-1 relations more than a single answer
would be possible, thus converting the test from measuring
consistency to measuring determinism instead.
1014
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Overview of our framework for assessing model consistency. Di (‘‘Data Pairs (D)’’ on the left) is a set
of KB triplets of some relation ri, which are coupled with a set of quasi-paraphrase cloze-patterns Pi (‘‘Patterns
(P )’’ on the right) that describe that relation. We then populate the subjects from Di as well as a mask token into
all patterns Pi (shown in the middle) and expect a model to predict the same object across all pattern pairs.
Restricted Candidate Sets Since PLMs were
not trained for serving as KBs, they often predict
words that are not KB entities; for example, a PLM
may predict, for the pattern ‘‘Showtime originally
aired on [MASK]’’, the noun ‘tv’—which is also
a likely substitution for the language modeling
objective, but not a valid KB fact completion.
Therefore, following others (Xiong et al., 2020;
Ravichander et al., 2020; Kassner et al., 2021a),
we restrict the PLMs’ output vocabulary to the set
of possible gold objects for each relation from the
underlining KB. For example, in the born-in rela-
tion, instead of inspecting the entire vocabulary of
a model, we only keep objects from the KB, such
as Paris, London, Tokyo, and so forth.
Note that this setup makes the task easier for the
PLM, especially in the context of KBs. However,
poor consistency in this setup strongly implies
that consistency would be even lower without
restricting candidates.
4 The PARAREL Resource
We now describe PARAREL , a resource designed
is
for our framework (cf. Section 3.2). PARAREL
curated by experts, with a high level of agreement.
It contains patterns for 38 relations4 from T-REx
(Elsahar et al., 2018)—a large dataset containing
KB triples aligned with Wikipedia abstracts—with
an average of 8.63 patterns per relation. Table 1
gives statistics. We further analyze the paraphrases
4Using the 41 relations from LAMA (Petroni et al., 2019),
leaving out three relations that are poorly defined, or consist
of mixed and heterogeneous entities.
used in this resource, partly based on the types
defined in Bhagat and Hovy (2013), and report
this analysis in Appendix B.
Construction Method PARAREL was con-
structed in four steps. (1) We began with the
patterns provided by LAMA (Petroni et al., 2019)
(one pattern per relation, referred to as base-
pattern). (2) We augmented each base-pattern with
other patterns that are paraphrases from LPAQA
(Jiang et al., 2020). However, since LPAQA was
created automatically (either by back-translation
or by extracting patterns from sentences that con-
tain both subject and object), some LPAQA pat-
terns are not correct paraphrases. We therefore
only include the subset of correct paraphrases.
(3) Using SPIKE (Shlain et al., 2020),5 a search
engine over Wikipedia sentences that supports
syntax-based queries, we searched for additional
patterns that appeared in Wikipedia and added
them to PARAREL . Specifically, we searched for
Wikipedia sentences containing a subject-object
tuple from T-REx and then manually extracted
patterns from the sentences. (4) Lastly, we added
additional paraphrases of the base-pattern using
the annotators’ linguistic expertise. Two addi-
tional experts went over all the patterns and cor-
rected them, while engaging in a discussion until
reaching agreement, discarding patterns they
could not agree on.
Human Agreement To assess the quality of
PARAREL , we run a human annotation study. For
5https://spike.apps.allenai.org/.
1015
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
# Relations
# Patterns
Min # patterns per rel.
Max # patterns per rel.
Avg # patterns per rel.
Avg syntax
Avg lexical
38
328
2
20
8.63
4.74
6.03
Table 1: Statistics of PARAREL . Last two
rows: average number of unique syntactic/
lexical variations of patterns for a relation.
each relation, we sample up to five paraphrases,
comparing each of the new patterns to the base-
pattern from LAMA. That is, if relation ri con-
tains the following patterns: p1, p2, p3, p4, and p1
is the base-pattern, then we compare the following
pairs (p1, p2), (p1, p3), (p1, p4).
We populate the patterns with random subjects
and objects pairs from T-REx (Elsahar et al.,
2018) and ask annotators if these sentences are
paraphrases. We also sample patterns from dif-
ferent relations to provide examples that are not
paraphrases of each other, as a control. Each task
contains five patterns that are thought to be para-
phrases and two that are not.6 Overall, we collect
annotations for 156 paraphrase candidates and 61
controls.
We asked NLP graduate students to annotate
the pairs and collected one answer per pair.7
The agreement scores for the paraphrases and the
controls are 95.5% and 98.3%, respectively, which
is high and indicates PARAREL ’s high quality.
We also inspected the disagreements and fixed
many additional problems to further improve
quality.
5 Experimental Setup
5.1 Models and Data
et al., 2019). For BERT, RoBERTa, and ALBERT,
we use a base and a large version.9 We also report
a majority baseline that always predicts the most
common object for a relation. By construction,
this baseline is perfectly consistent.
We use knowledge graph data from T-REx
(Elsahar et al., 2018).10 To make the results com-
parable across models, we remove objects that are
not represented as a single token in all models’
vocabularies; 26,813 tuples remain.11 We further
split the data into N-M relations for which we
report determinism results (seven relations) and
N-1 relations for which we report consistency (31
relations).
5.2 Evaluation
Our consistency measure for a relation ri (Consis-
tency) is the percentage of consistent predictions
∈ Pi of that relation,
k, pi
of all the pattern pairs pi
l
∈ Di. Thus, for each KB
for all its KB tuples di
j
tuple from a relation ri that contains n patterns,
we consider predictions for n(n − 1)/2 pairs.
We also report Accuracy, that is, the acc@1
of a model in predicting the correct object, using
the original patterns from Petroni et al. (2019).
In contrast to Petroni et al. (2019), we define it
as the accuracy of the top-ranked object from the
candidate set of each relation. Finally, we report
Consistent-Acc, a new measure that evaluates indi-
vidual objects as correct only if all patterns of the
corresponding relation predict the object correctly.
Consistent-Acc is much stricter and combines the
requirements of both consistency (Consistency)
and factual correctness (Accuracy). We report the
average over relations (i.e., macro average), but
notice that the micro average produces similar
results.
6 Experiments and Results
6.1 Knowledge Extraction through
Different Patterns
We experiment with four PLMs: BERT, BERT
whole-word-masking8
(Devlin et al., 2019),
RoBERTa (Liu et al., 2019), and ALBERT (Lan
We begin by assessing our patterns as well as the
degree to which they extract the correct entities.
These results are summarized in Table 2.
6The controls contain the same subjects and objects so
that only the pattern (not its arguments) can be used to solve
the task.
7We asked the annotators to re-annotate any mismatch
with our initial label, to allow them to fix random mistakes.
8BERT whole-word-masking is BERT’s version where
words that are tokenized into multiple tokens are masked
together.
9For ALBERT we use the smallest and largest versions.
10We discard three poorly defined relations from T-REx.
11In a few cases, we filter entities from certain relations that
contain multiple fine-grained relations to make our patterns
compatible with the data. For instance, most of the instances
for the genre relation describes music genres, thus we remove
some of the tuples were the objects include non-music genres
such as ‘satire’, ‘sitcom’, and ‘thriller’.
1016
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Succ-Patt Succ-Objs Unk-Const Know-Const
97.3±7.3 23.2±21.0 100.0±0.0
majority
100.0±0.0 63.0±19.9 46.5±21.7
BERT-base
100.0±0.0 65.7±22.1 48.1±20.2
BERT-large
BERT-large-wwm 100.0±0.0 64.9±20.3 49.5±20.1
100.0±0.0 56.2±22.7 43.9±15.8
RoBERTa-base
100.0±0.0 60.1±22.3 46.8±18.0
RoBERTa-large
100.0±0.0 45.8±23.7 41.4±17.3
ALBERT-base
ALBERT-xxlarge 100.0±0.0 58.8±23.8 40.5±16.4
100.0±0.0
63.8±24.5
65.2±23.8
65.3±25.1
56.3±19.0
60.5±21.1
56.3±22.0
57.5±23.8
Table 2: Extractability measures in the different
models we inspect. Best model for each measure
highlighted in bold.
First, we report Succ-Patt, the percentage of
patterns that successfully predicted the right ob-
ject at least once. A high score suggests that the
patterns are of high quality and enable the models
to extract the correct answers. All PLMs achieve
a perfect score. Next, we report Succ-Objs, the
percentage of entities that were predicted cor-
rectly by at least one of the patterns. Succ-Objs
quantifies the degree to which the models ‘‘have’’
the required knowledge. We observe that some
tuples are not predicted correctly by any of our
patterns: The scores vary between 45.8% for
ALBERT-base and 65.7% for BERT-large. With
an average number of 8.63 patterns per relation,
there are multiple ways to extract the knowledge,
we thus interpret these results as evidence that a
large part of T-REx knowledge is not stored in
these models.
Finally, we measure Unk-Const, a consistency
measure for the subset of tuples for which no pat-
tern predicted the correct answer; and Know-
Const, consistency for the subset where at least
one of the patterns for a specific relation pre-
dicted the correct answer. This split into subsets is
based on Succ-Objs. Overall, the results indicate
that when the factual knowledge is successfully
extracted, the model is also more consistent. For
instance, for BERT-large, Know-Const is 65.2%
and Unk-Const is 48.1%.
6.2 Consistency and Knowledge
In this section, we report the overall knowledge
measure that was used in Petroni et al. (2019) (Ac-
curacy), the consistency measure (Consistency),
and Consistent- Acc, which combines knowledge
and consistency (Consistent-Acc). The results are
summarized in Table 3.
We begin with the Accuracy results. The results
range between 29.8% (ALBERT-base) and 48.7%
Model
Accuracy Consistency Consistent-Acc
23.1±21.0 100.0±0.0
majority
45.8±25.6 58.5±24.2
BERT-base
48.1±26.1 61.1±23.0
BERT-large
BERT-large-wwm 48.7±25.0 60.9±24.2
39.0±22.8 52.1±17.8
RoBERTa-base
43.2±24.7 56.3±20.4
RoBERTa-large
29.8±22.8 49.8±20.1
ALBERT-base
ALBERT-xxlarge 41.7±24.9 52.1±22.4
23.1±21.0
27.0±23.8
29.5±26.6
29.3±26.9
16.4±16.4
22.5±21.1
16.7±20.3
23.8±24.8
Table 3: Knowledge and consistency results. Best
model for each measure in bold.
(BERT-large whole-word-masking). Notice that
our numbers differ from Petroni et al. (2019) as
we use a candidate set (§3) and only consider KB
triples whose object is a single token in all the
PLMs we consider (§5.1).
Next, we report Consistency (§5.2). The BERT
models achieve the highest scores. There is a con-
sistent improvement from base to large versions
of each model. In contrast to previous work that
observed quantitative and qualitative improve-
ments of RoBERTa-based models over BERT
(Liu et al., 2019; Talmor et al., 2020), in terms
of consistency, BERT is more consistent than
RoBERTa and ALBERT. Still, the overall results
are low (61.1% for the best model), even more
remarkably so because the restricted candidate set
makes the task easier. We note that the results are
highly variant between models (performance on
original-language varies between 52% and 90%),
and relations (BERT-large performance is 92% on
capital-of and 44% on owned-by).
Finally, we report Consistent-Acc: the results
are much lower than for Accuracy, as expected, but
follow similar trends: RoBERTa-base performs
worse (16.4%) and BERT-large best (29.5%).
Interestingly, we find strong correlations be-
tween Accuracy and Consistency, ranging from
67.3% for RoBERTa-base to 82.1% for BERT-
large (all with small p-values (cid:4) 0.01).
A striking result of the model comparison is
the clear superiority of BERT, both in knowledge
accuracy (which was also observed by Shin et al.,
2020) and knowledge consistency. We hypothe-
size this result is caused by the different sources
of training data: although Wikipedia is part of the
training data for all models we consider, for BERT
it is the main data source, but for RoBERTa and
ALBERT it is only a small portion. Thus, when
using additional data, some of the facts may be
1017
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Acc
Consistency Consistent-Acc
Model
majority
RoBERTa-med-small-1M 11.2±9.4
RoBERTa-base-10M
RoBERTa-base-100M
RoBERTa-base-1B
23.1±21.0 100.0±0.0
37.1±11.0
17.3±15.8 29.8±12.7
22.1±17.1 31.5±13.0
38.0±23.4 50.6±19.8
23.1±21.0
2.8±4.0
3.2±5.1
3.7±5.3
18.0±16.0
Table 4: Knowledge and consistency results for dif-
ferent RoBERTas, trained on increasing amounts
of data. Best model for each measure in bold.
majority
BERT-base
BERT-large
BERT-large-wwm
RoBERTa-base
RoBERTa-large
ALBERT-base
ALBERT-xxlarge
Diff-Syntax No-Change
100.0±0.0
100.0±0.0
76.3±22.6
67.9±30.3
78.7±14.7
67.5±30.2
81.1±9.7
63.0±31.7
80.7±5.2
66.9±10.1
80.3±6.8
69.7±19.2
72.6±11.5
62.3±22.8
67.3±17.1
51.7±26.0
forgotten, or contradicted in the other corpora; this
can diminish knowledge and compromise consis-
tency behavior. Thus, since Wikipedia is likely the
largest unified source of factual knowledge that
exists in unstructured data, giving it prominence
in pretraining makes it more likely that the model
will incorporate Wikipedia’s factual knowledge
well. These results may have a broader impact
on models to come: Training bigger models with
more data (such as GPT-3 [Brown et al., 2020]) is
not always beneficial.
Determinism We also measure determinism for
N-M relations—that is, we use the same measure
as Consistency, but since difference predictions
may be factually correct, these do not necessar-
ily convey consistency violations, but indicate
non-determinism. For brevity, we do not present
all results, but the trend is similar to the con-
sistency result (although not comparable, as the
relations are different): 52.9% and 44.6% for
BERT-large and RoBERTa-base, respectively.
Effect of Pretraining Corpus Size Next, we
study the question of whether the number of
tokens used during pretraining contributes to con-
sistency. We use the pretrained RoBERTa models
from Warstadt et al. (2020) and repeat the ex-
periments on four additional models. These are
RoBERTa-based models, trained on a sample of
Wikipedia and the book corpus, with varying train-
ing sizes and parameters. We use one of the three
published models for each configuration and re-
port the average accuracy over the relations for
each model in Table 4. Overall, Accuracy and
Consistent-Acc improve with more training data.
However, there is an interesting outlier to this
trend: The model that was trained on one mil-
lion tokens is more consistent than the models
trained on ten and one-hundred million tokens.
A potentially crucial difference is that this model
has many fewer parameters than the rest (to avoid
Table 5: Consistency and standard deviation when
only syntax differs (Diff-Syntax) and when syntax
and lexical choice are identical (No-Change). Best
model for each metric is highlighted in bold.
overfitting). It is nonetheless interesting that a
model that is trained on significantly less data can
achieve better consistency. On the other hand, its
accuracy scores are lower, arguably due to the
model being exposed to less factual knowledge
during pretraining.
6.3 Do PLMs Generalize Over Syntactic
Configurations?
Many papers have found neural models (especially
PLMs) to naturally encode syntax (Linzen et al.,
2016; Belinkov et al., 2017; Marvin and Linzen,
2018; Belinkov and Glass, 2019; Goldberg, 2019;
Hewitt and Manning, 2019). Does this mean that
PLMs have successfully abstracted knowledge
and can comprehend and produce it regardless of
syntactic variation? We consider two scenarios.
(1) Two patterns differ only in syntax. (2) Both
syntax and lexical choice are the same. As a proxy,
we define syntactic equivalence when the depen-
dency path between the subject and object are
identical. We parse all patterns from PARAREL
(Honnibal et al.,
using a dependency parser
2020)12 and retain the path between the entities.
Success on (1) indicates that the model’s knowl-
edge processing is robust to syntactic variation.
Success on (2) indicates that the model’s knowl-
edge processing is robust to variation in word
order and tense.
Table 5 reports the results. While these and the
main results on the entire dataset are not compa-
rable as the pattern subsets are different, they are
higher than the general results: 67.5% for BERT-
large when only the syntax differs and 78.7% when
12https://spacy.io/.
1018
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
# Subject
Object
Pattern #1
Pattern #2
Pattern #3
Pred #1
Pred #2
Pred #3
1 Adriaan Pauw
Amsterdam [X] was born in [Y].
[X] is native to [Y].
[X] is a [Y]-born person.
Amsterdam
Madagascar
Luxembourg
2 Nissan Livina Geniss Nissan
[X] is produced by [Y].
[X] is created by [Y].
[X], created by [Y].
Nissan
3 Albania
4 iCloud
Serbia
Apple
[X] shares border with [Y].
[Y] borders with [X].
[Y] shares the border with [X]
Greece
[X] is developed by [Y].
[X], created by [Y].
[X] was created by [Y]
Microsoft
Renault
Turkey
Renault
Kosovo
Sony
5 Yahoo! Messenger
Yahoo
[X], a product created by [Y]
[X], a product developed by [Y]
[Y], that developed [X]
Microsoft
Microsoft
Microsoft
6 Wales
Cardiff
The capital of [X] is [Y] .
[X]’s capital, [Y].
[X]’s capital city, [Y].
Cardiff
Cardiff
Cardiff
Table 6: Predictions of BERT-large-cased. ‘‘Subject’’ and ‘‘Object’’ are from T-REx (Elsahar et al.,
2018). ‘‘Pattern #i’’ / ‘‘Pred #i’’: three different patterns from our resource and their predictions. The
predictions are colored in blue if the model predicted correctly (out of the candidate list), and in red
otherwise. If there is more than a single erroneous prediction, it is colored by a different red.
the syntax is identical. This demonstrates that
while PLMs have impressive syntactic abilities,
they struggle to extract factual knowledge in the
face of tense, word-order, and syntactic variation.
McCoy et al. (2019) show that supervised mod-
els trained on MNLI (Williams et al., 2018), an
NLI dataset (Bowman et al., 2015), use superficial
syntactic heuristics rather than more generalizable
properties of the data. Our results indicate that
PLMs have problems along the same lines: They
are not robust to surface variation.
7 Analysis
7.1 Qualitative Analysis
To better understand the factors affecting con-
sistent predictions, we inspect the predictions of
BERT-large on the patterns shown in Table 6. We
highlight several cases: The predictions in Exam-
ple #1 are inconsistent, and correct for the first
pattern (Amsterdam), but not for the other two
(Madagascar and Luxembourg). The predictions
in Example #2 also show a single pattern that
predicted the right object; however, the two other
patterns, which are lexically similar, predicted the
same, wrong answer—Renault. Next, the patterns
of Example #3 produced two factually correct an-
swers out of three (Greece, Kosovo), but simply
do not correspond to the gold object in T-REx
(Albania), since this is an M-N relation. Note that
this relation is not part of the consistency evalu-
ation, but the determinism evaluation. The three
different predictions in example #4 are all incor-
rect. Finally, the two last predictions demonstrate
consistent predictions: Example #5 is consistent
but factually incorrect (even though the correct
answer is a substring of the subject), and finally,
Example #6 is consistent and factual.
relation. The colors represent
Figure 3: t-SNE of the encoded patterns from the
capital
the differ-
ent subjects, while the shapes represent patterns. A
knowledge-focused representation should cluster based
on identical subjects (color), but instead the clustering
is according to identical patterns (shape).
7.2 Representation Analysis
To provide insights on the models’ representa-
tions, we inspect these after encoding the patterns.
Motivated by previous work that found that
words with the same syntactic structure cluster
together (Chi et al., 2020; Ravfogel et al., 2020)
we perform a similar experiment to test if this be-
havior replicates with respect to knowledge: We
encode the patterns, after filling the placehold-
ers with subjects and masked tokens and inspect
the last layer representations in the masked token
position. When plotting the results using t-SNE
(Maaten and Hinton, 2008) we mainly observe
clustering based on the patterns, which suggests
that encoding of knowledge of the entity is not the
main component of the representations. Figure 3
demonstrates this for BERT-large encodings of
the capital relation, which is highly consistent.13
To provide a more quantitative assessment of this
13While some patterns are clustered based on the subjects
(upper-left part), most of them are clustered based on patterns.
1019
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
phenomenon, we also cluster the representations
and set the number of centroids based on:14 (1) the
number of patterns in each relation, which aims to
capture pattern-based clusters, and (2) the number
of subjects in each relation, which aims to cap-
ture entity-based clusters. This would allow for
a perfect clustering, in the case of perfect align-
ment between the representation and the inspected
property. We measure the purity of these clusters
using V-measure and observe that the clusters
are mostly grouped by the patterns, rather than
the subjects. Finally, we compute the Spearman
correlation between the consistency scores and
the V-measure of the representations. However,
the correlation between these variables is close
to zero,15 therefore not explaining the models’
behavior. We repeated these experiments while
inspecting the objects instead of the subjects, and
found similar trends. This finding is interesting
since it means that (1) these representations are
not knowledge-focused, i.e., their main compo-
nent does not relate to knowledge, and (2) the
representation by its entirety does not explain the
behavior of the model, and thus only a subset
of the representation does. This finding is con-
sistent with previous work that observed similar
trends for linguistic tasks (Elazar et al., 2021).
We hypothesize that this disparity between the
representation and the behavior of the model may
be explained by a situation where the distance
between representations largely does not reflect
the distance between predictions, but rather other,
behaviorally irrelevant factors of a sentence.
8
Improving Consistency in PLMs
In the previous sections, we showed PLMs are
generally not consistent in their predictions, and
previous works have noticed the lack of this prop-
erty in a variety of downstream tasks. An ideal
model would exhibit the consistency property after
pretraining, and would then be able to transfer it
to different downstream tasks. We therefore ask:
Can we enhance current PLMs and make them
more consistent?
8.1 Consistency Improved PLMs
We propose to improve the consistency of PLMs
by continuing the pretraining step with a novel
consistency loss. We make use of the T-REx
tuples and the paraphrases from PARAREL .
For each relation ri, we have a set of para-
phrased patterns Pi describing that relation. We
use a PLM to encode all patterns in Pi, after popu-
lating a subject that corresponds to the relation ri
and a mask token. We expect the model to make
the same prediction for the masked token for all
patterns.
Consistency Loss Function As we evaluate the
model using acc@1, the straight-forward consis-
tency loss would require these predictions to be
identical:
min
θ
sim(arg max
i
fθ(Pn)[i], arg max
j
fθ(Pm)[j])
where fθ(Pn) is the output of an encoding function
(e.g., BERT) parameterized by θ (a vector) over
input Pn, and fθ(Pn)[i] is the score of the ith
vocabulary item of the model.
However, this objective contains a comparison
between the output of two argmax operations,
making it discrete and discontinuous, and hard to
optimize in a gradient-based framework. We in-
stead relax the objective, and require that the pre-
dicted distributions Qn = softmax(fθ(Pn)), rather
than the top-1 prediction, be identical to each
other. We use two-sided KL Divergence to mea-
||
sure similarity between distributions: DKL(Qri
n
||Qri
Qri
n is the predicted
distribution for pattern Pn of relation ri.
m) + DKL(Qri
m
n ) where Qri
As most of the vocabulary is not relevant for
the predictions, we filter it down to the k tokens
from the candidate set of each relation (§3.2). We
want to maintain the original capabilities of the
model—focusing on the candidate set helps to
achieve this goal since most of the vocabulary is
not affected by our new loss.
To encourage a more general solution, we make
use of all the paraphrases together, and enforce
all predictions to be as close as possible. Thus,
the consistency loss for all pattern pairs for a
particular relation ri is:
k(cid:2)
Lc =
k(cid:2)
n=1
m=n+1
DKL(Qri
n
||Qri
m) + DKL(Qri
m
||Qri
n )
14Using the KMeans algorithm.
15Except for BERT-large whole-word-masking, where the
correlation is 39.5 (p < 0.05).
MLM Loss Since the consistency loss is dif-
ferent from the Cross-Entropy loss the PLM is
trained on, we find it important to continue the
1020
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
MLM loss on text data, similar to previous work
(Geva et al., 2020).
We consider two alternatives for continuing the
pretraining objective: (1) MLM on Wikipedia and
(2) MLM on the patterns of the relations used
for the consistency loss. We found that the latter
works better. We denote this loss by LM LM .
Consistency Guided MLM Continual Training
Combining our novel consistency loss with the
regular MLM loss, we continue the PLM training
by combining the two losses. The combination of
the two losses is determined by a hyperparameter
λ, resulting in the following final loss function:
L = λLc + LM LM
This loss is computed per relation, for one KB
tuple. We have many of these instances, which we
require to behave similarly. Therefore, we batch
together l = 8 tuples from the same relation and
apply the consistency loss function to all of them.
8.2 Setup
Since we evaluate our method on unseen relations,
we also split train and test by relation type (e.g.,
location-based relations, which are very common
in T-REx). Moreover, our method is aimed to
be simple, effective, and to require only mini-
mal supervision. Therefore, we opt to use only
three relations: original-language, named-after,
and original-network;
these were chosen ran-
domly, out of the non-location related relations.16
For validation, we randomly pick three relations
of the remaining relations and use the remaining
25 for testing.
We perform minimal tuning of the parame-
ters (λ ∈ 0.1, 0.5, 1) to pick the best model,
train for three epochs, and select the best model
based on Consistent-Acc on the validation set. For
efficiency reasons, we use the base version of
BERT.
8.3 Improved Consistency Results
The results are presented in Table 7. We report ag-
gregated results for the 25 relations in the test. We
again report macro average (mean over relations)
and standard deviation. We report the results of the
majority baseline (first row), BERT-base (second
row), and our new model (BERT-ft, third row).
16Many relations are location-based—not training on them
prevents train-test leakage.
Model
majority
BERT-base
BERT-ft
Accuracy Consistency Consistent-Acc
24.4±22.5 100.0±0.0
58.2±23.9
45.6±27.6
64.0±22.9
47.4±27.3
60.9±22.6
-consistency 46.9±27.6
62.0±21.2
46.5±27.1
-typed
80.8±27.1
16.9±21.1
-MLM
24.4±22.5
27.3±24.8
33.2±27.0
30.9±26.3
31.1±25.2
9.1±11.5
Table 7: Knowledge and consistency results for
the baseline, BERT base, and our model. The
results are averaged over the 25 test relations.
Underlined: best performance overall, including
ablations. Bold: Best performance for BERT-ft
and the two baselines (BERT-base, majority).
First, we note that our model significantly im-
proves consistency: 64.0% (compared with 58.2%
for BERT-base, an increase of 5.8 points). Accu-
racy also improves compared to BERT-base, from
45.6% to 47.4%. Finally, and most importantly,
we see an increase of 5.9 points in Consistent-Acc,
which is achieved due to the improved consistency
of the model. Notably, these improvements arise
from training on merely three relations, meaning
that the model improved its consistency ability
and generalized to new relations. We measure the
statistical significance of our method compared
to the BERT baseline, using McNemar’s test (fol-
lowing Dror et al. [2018, 2020]) and find all results
to be significant (p (cid:4) 0.01).
We also perform an ablation study to quantify
the utility of the different components. First, we
report on the finetuned model without the con-
sistency loss (-consistency). Interestingly, it does
improve over the baseline (BERT-base), but it lags
behind our finetuned model. Second, applying our
loss on the candidate set rather than on the entire
vocabulary is beneficial (-typed). Finally, by not
performing the MLM training on the generated
patterns (-MLM), the consistency results improve
significantly (80.8%); however, this also hurts
Accuracy and Consistent-Acc. MLM training
seems to serve as a regularizer that prevents ca-
tastrophic forgetting.
Our ultimate goal is to improve consistency
in PLMs for better performance on downstream
tasks. Therefore, we also experiment with fine-
tuning on SQuAD (Rajpurkar et al., 2016), and
evaluating on paraphrased questions from SQuAD
(Gan and Ng, 2019) using our consistency model.
However, the results perform on par with the base-
line model, both on SQuAD and the paraphrase
1021
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
questions. More research is required to show that
consistent PLMs can also benefit downstream
tasks.
to use PLMs as a starting point to more complex
systems have promising results and address these
problems (Thorne et al., 2020).
9 Discussion
Consistency for Downstream Tasks The rise
of PLMs has improved many tasks but has also
brought a lot of expectations. The standard usage
of these models is pretraining on a large cor-
pus of unstructured text and then finetuning on
a task of interest. The first step is thought of as
providing a good language-understanding compo-
nent, whereas the second step is used to teach the
format and the nuances of a downstream task.
As discussed earlier, consistency is a crucial
component of many NLP systems (Du et al., 2019;
Asai and Hajishirzi, 2020; Denis and Baldridge,
2009; Kryscinski et al., 2020) and obtaining this
skill from a pretrained model would be extremely
beneficial and has the potential to make special-
ized consistency solutions in downstream tasks
redundant. Indeed, there is an ongoing discussion
about the ability to acquire ‘‘meaning’’ from raw
text signal alone (Bender and Koller, 2020). Our
new benchmark makes it possible to track the
progress of consistency in pretrained models.
Broader Sense of Consistency In this work we
focus on one type of consistency, that is, con-
sistency in the face of paraphrasing; however,
consistency is a broader concept. For instance,
previous work has studied the effect of nega-
tion on factual statements, which can also be
seen as consistency (Ettinger, 2020; Kassner and
Sch¨utze, 2020). A consistent model is expected to
return different answers to the prompts: ‘‘Birds
can [MASK]’’ and ‘‘Birds cannot [MASK]’’. The
inability to do so, as was shown in these works,
also shows the lack of model consistency.
Usage of PLMs as KBs Our work follows the
setup of Petroni et al. (2019) and Jiang et al.
(2020), where PLMs are being tested as KBs.
While it is an interesting setup for probing models
for knowledge and consistency, it lacks important
properties of standard KBs: (1) the ability to return
more than a single answer and (2) the ability to
return no answer. Although some heuristics can
be used for allowing a PLM to do so, for example,
using a threshold on the probabilities, it is not
the way that the model was trained, and thus may
not be optimal. Newer approaches that propose
In another approach, Shin et al. (2020) sug-
gest using AUTOPROMPT to automatically generate
prompts, or patterns, instead of creating them
manually. This approach is superior to manual
patterns (Petroni et al., 2019), or aggregation of
patterns that were collected automatically (Jiang
et al., 2020).
Brittleness of Neural Models Our work also
relates to the problem of the brittleness of neural
networks. One example of this brittleness is the
vulnerability to adversarial attacks (Szegedy et al.,
2014; Jia and Liang, 2017). The other problem,
closer to the problem we explore in this work,
is the poor generalization to paraphrases. For ex-
ample, Gan and Ng (2019) created a paraphrase
version for a subset of SQuAD (Rajpurkar et al.,
2016), and showed that model performance drops
significantly. Ribeiro et al.
(2018) proposed
another method for creating paraphrases and
performed a similar analysis for visual ques-
tion answering and sentiment analysis. Recently,
Ribeiro et al. (2020) proposed CHECKLIST, a sys-
tem that tests a model’s vulnerability to several
linguistic perturbations.
PARAREL enables us to study the brittleness of
PLMs, and separate facts that are robustly encoded
in the model from mere ‘guesses’, which may arise
from some heuristic or spurious correlations with
certain patterns (Poerner et al., 2020). We showed
that PLMs are susceptible to small perturbations,
and thus, finetuning on a downstream task—given
that training datasets are typically not large and
do not contain equivalent examples—is not likely
to perform better.
Can We Expect from LMs to Be Consistent?
The typical training procedure of an LM does
not encourage consistency. The standard training
solely tries to minimize the log-likelihood of an
unseen token, and this objective is not always
aligned with consistency of knowledge. Consider
for example the case of Wikipedia texts, as op-
posed to Reddit; their texts and styles may be very
different and they may even describe contradic-
tory facts. An LM can exploit the styles of each
text to best fit the probabilities given to an unseen
word, even if the resulting generations contradict
each other.
1022
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Since the pretraining-finetuning procedure is
currently the dominating one in our field, a great
amount of the language capabilities that were
learned during pre-training also propagates to the
fine-tuned models. As such, we believe it is impor-
tant to measure and improve consistency already
in the pretrained models.
Reasons Behind the (In)Consistency Since
LMs are not expected to be consistent, what are
the reasons behind their predictions, when being
consistent, or inconsistent?
In this work, we presented the predictions of
multiple queries, and the representation space of
one of the inspected models. However, this does
not point to the origins of such behavior. In future
work, we aim to inspect this question more closely.
10 Conclusion
In this work, we study the consistency of PLMs
with regard to their ability to extract knowledge.
We build a high-quality resource named PARAREL
that contains 328 high-quality patterns for 38 re-
lations. Using PARAREL , we measure consistency
in multiple PLMs, including BERT, RoBERTa,
and ALBERT, and show that although the latter
two are superior to BERT in other tasks, they
fall short in terms of consistency. However, the
consistency of these models is generally low. We
release PARAREL
along with data tuples from
T-REx as a new benchmark to track knowledge
consistency of NLP models. Finally, we propose
a new simple method to improve model consis-
tency, by continuing the pretraining with a novel
loss. We show this method to be effective and to
improve both the consistency of models as well as
their ability to extract the correct facts.
Acknowledgments
We would like to thank Tomer Wolfson, Ido
Dagan, Amit Moryossef, and Victoria Basmov
for their helpful comments and discussions, and
Alon Jacovi, Ori Shapira, Arie Cattan, Elron
Bandel, Philipp Dufter, Masoud Jalili Sabet,
Marina Speranskaya, Antonis Maronikolakis,
Aakanksha Naik, Aishwarya Ravichander, and
Aditya Potukuchi for the help with the annota-
tions. We also thank the anonymous reviewers
and the action editor, George Foster, for their
valuable suggestions.
Yanai Elazar is grateful to be supported by
the PBC fellowship for outstanding PhD can-
didates in Data Science and the Google PhD
fellowship. This project has received funding
from the European Research Council (ERC) un-
der the European Union’s Horizon 2020 research
and innovation programme, grant agreement no.
802774 (iEXTRACT). This work has been funded
by the European Research Council (#740516)
and by the German Federal Ministry of Edu-
cation and Research (BMBF) under grant no.
01IS18036A. The authors of this work take full
responsibility for its content. This research was
also supported in part by grants from the National
Science Foundation Secure and Trustworthy Com-
puting program (CNS-1330596, CNS15-13957,
CNS-1801316, CNS-1914486), and a DARPA
Brandeis grant (FA8750-15-2-0277). The views
and conclusions contained herein are those of
the authors and should not be interpreted as
necessarily representing the official policies or
endorsements, either expressed or implied, of the
NSF, DARPA, or the US Government.
References
Chris Alberti, Daniel Andor, Emily Pitler, Jacob
Devlin, and Michael Collins. 2019. Synthetic
QA corpora generation with roundtrip con-
sistency. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 6168–6173. https://
doi.org/10.18653/v1/P19-1620
Kim Allan Andersen and Daniele Pretolani. 2001.
Easy cases of probabilistic satisfiability. Annals
of Mathematics and Artificial
Intelligence,
33(1):69–91. https://doi.org/10.1023
/A:1012332915908
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,
Margaret Mitchell, Dhruv Batra, C. Lawrence
Zitnick, and Devi Parikh. 2015. VQA: Visual
question answering. In Proceedings of the IEEE
International Conference on Computer Vision,
pages 2425–2433. https://doi.org/10
.1109/ICCV.2015.279
Akari Asai and Hannaneh Hajishirzi. 2020.
Logic-guided data augmentation and regu-
larization for consistent question answering.
In Proceedings of the 58th Annual Meeting
for Computational
of
Association
the
1023
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
pages
Linguistics,
5642–5650, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.499
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. What
do neural machine translation models learn
about morphology? In Proceedings of
the
the Association
55th Annual Meeting of
for Computational Linguistics (Volume 1:
Long Papers), pages 861–872. https://doi
.org/10.18653/v1/P17-1080
Yonatan Belinkov and James Glass. 2019. Anal-
ysis methods in neural language processing:
A survey. Transactions of the Association for
Computational Linguistics, 7:49–72. https://
doi.org/10.1162/tacl a 00254
Emily M. Bender and Alexander Koller. 2020.
Climbing towards NLU: On meaning, form,
and understanding in the age of data.
In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 5185–5198. https://doi.org/10
.18653/v1/2020.acl-main.463
Rahul Bhagat and Eduard Hovy. 2013. What
is a paraphrase? Computational Linguistics,
39(3):463–472. https://doi.org/10.1162
/COLI a 00166
Lukas Biewald. 2020. Experiment tracking with
weights and biases. Software available from
wandb.com.
Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing (EMNLP). Association
for Computational Linguistics. https://doi
.org/10.18653/v1/D15-1075
Tom B. Brown, Benjamin Mann, Nick Ryder,
Jared Kaplan, Prafulla
Melanie Subbiah,
Dhariwal, Arvind Neelakantan, Pranav Shyam,
GirishSastry, AmandaAskell, SandhiniAgarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler,
Jeffrey Wu, Clemens
Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. 2020. Language mod-
learners. arXiv preprint
els are few-shot
arXiv:2005.14165.
Oana-Maria Camburu, Tim Rockt¨aschel, Thomas
Lukasiewicz, and Phil Blunsom. 2018. E-SNLI:
Natural language inference with natural lan-
guage explanations. In NeurIPS.
Oana-Maria Camburu, Brendan Shillingford,
Pasquale Minervini, Thomas Lukasiewicz, and
Phil Blunsom. 2020. Make up your mind!
Adversarial generation of inconsistent natural
language explanations. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 4157–4165.
https://doi.org/10.18653/v1/2020
.acl-main.382
Kai-Wei Chang, Rajhans
Samdani, Alla
Rozovskaya, Nick Rizzolo, Mark Sammons,
Inference protocols
and Dan Roth. 2011.
for coreference resolution. In Proceedings of
the Fifteenth Conference on Computational
Natural Language Learning: Shared Task,
pages 40–44.
Ethan A. Chi, John Hewitt, and Christopher D.
Manning. 2020. Finding universal grammatical
relations in multilingual BERT. In Proceed-
ings of
the
Association for Computational Linguistics,
pages 5564–5577, Online. Association for
Computational Linguistics.
the 58th Annual Meeting of
Jeff Da and Jungo Kasai. 2019. Cracking
the contextual commonsense code: Under-
standing commonsense reasoning aptitude of
deep contextual representations. In Proceed-
ings of the First Workshop on Commonsense
Inference in Natural Language Processing,
pages 1–12, Hong Kong, China. Association
for Computational Linguistics.
Ido Dagan, Oren Glickman, and Bernardo
Magnini. 2005. The Pascal recognising textual
In Machine Learn-
entailment challenge.
ing Challenges Workshop, pages 177–190.
https://doi.org/10.1007
Springer.
/11736790_9
1024
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Joe Davison, Joshua Feldman, and Alexander
M. Rush. 2019. Commonsense knowledge
mining from pretrained models. In Proceed-
ings of
the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 1173–1178. https://doi.org/10
.18653/v1/D19-1109
Pascal Denis and Jason Baldridge. 2009. Global
joint models for coreference resolution and
named entity classification. Procesamiento del
Lenguaje Natural, 42.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long
and Short Papers), pages 4171–4186.
Rotem Dror, Gili Baumer, Segev Shlomov, and
Roi Reichart. 2018. The hitchhiker’s guide to
testing statistical significance in natural lan-
guage processing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 1383–1392. https://doi.org/10
.18653/v1/P18-1128
Rotem Dror, Lotem Peled-Cohen,
Segev
Shlomov, and Roi Reichart. 2020. Statistical
significance testing for natural language pro-
cessing. Synthesis Lectures on Human Lan-
guage Technologies, 13(2):1–116. https://
doi.org/10.2200/S00994ED1V01Y202
002HLT045
Xinya Du, Bhavana Dalvi, Niket Tandon,
Antoine Bosselut, Wen-tau Yih, Peter Clark,
and Claire Cardie. 2019. Be consistent! Im-
proving procedural text comprehension using
label consistency. In Proceedings of the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 2347–2356.
Yanai Elazar, Shauli Ravfogel, Alon Jacovi,
and Yoav Goldberg. 2021. Amnesic Probing:
Behavioral Explanation with Amnesic Coun-
the Association
terfactuals. Transactions of
for Computational Linguistics, 9:160–175.
https://doi.org/10.1162/tacl a
00359
Hady Elsahar, Pavlos Vougiouklis, Arslen
Remaci, Christophe Gravier, Jonathon Hare,
Frederique Laforest, and Elena Simperl. 2018.
T-rex: A large scale alignment of natural
language with knowledge base triples.
In
Proceedings of the Eleventh International Con-
ference on Language Resources and Evaluation
(LREC 2018).
Allyson Ettinger. 2020. What BERT is not:
Lessons from a new suite of psycholin-
guistic diagnostics
language models.
for
Transactions of
the Association for Com-
putational Linguistics, 8:34–48. https://
doi.org/10.1162/tacl a 00298
Maxwell Forbes, Ari Holtzman, and Yejin Choi.
2019. Do neural language representations learn
physical commonsense? In CogSci.
Wee Chung Gan and Hwee Tou Ng. 2019.
Improving the robustness of question an-
swering systems to question paraphrasing. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 6065–6075.
Injecting numerical
Mor Geva, Ankit Gupta, and Jonathan Berant.
reasoning skills
2020.
In Association for
into language models.
Computational Linguistics (ACL). https://
doi.org/10.18653/v1/2020.acl
-main.89
Yoav Goldberg. 2019. Assessing BERT’s syntac-
tic abilities. arXiv preprint arXiv:1901.05287.
Pierre Hansen and Brigitte Jaumard. 2000.
In Handbook of
Probabilistic satisfiability.
Defeasible Reasoning and Uncertainty Man-
agement Systems, pages 321–367, Springer.
h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / 9 7 8
- 9 4 - 0 1 7 - 1 7 3 7 - 3 8
John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representations. In North American Chapter
1025
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
of the Association for Computational Linguis-
tics: Human Language Technologies (NAACL).
Association for Computational Linguistics.
Matthew Honnibal, Ines Montani, Sofie Van
Landeghem, and Adriane Boyd. 2020. spaCy:
Industrial-strength Natural Language Process-
ing in Python. https://doi.org/10.5281
/zenodo.1212303.
Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
systems. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 2021–2031. https://doi
.org/10.18653/v1/D17-1215
Zhengbao Jiang, Frank F. Xu, Jun Araki, and
Graham Neubig. 2020. How can we know
what language models know? Transactions of
the Association for Computational Linguistics,
8:423–438. https://doi.org/10.1162
/tacl_a_00324
Nora Kassner, Philipp Dufter, and Hinrich
Sch¨utze. 2021a. Multilingual
lama: Investi-
gating knowledge in multilingual pretrained
language models.
Nora Kassner, Benno Krojer, and Hinrich
Sch¨utze. 2020. Are pretrained language mod-
els symbolic reasoners over knowledge? In
Proceedings of
the 24th Conference on
Computational Natural Language Learning,
pages 552–564, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.conll-1.45
Nora Kassner
In Proceedings of
and Hinrich Sch¨utze. 2020.
Negated and misprimed probes for pretrained
language models: Birds can talk, but cannot
the 58th Annual
fly.
Meeting of
the Association for Computa-
tional Linguistics, pages 7811–7818, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.698
Nora Kassner, Oyvind Tafjord, Hinrich Sch¨utze,
and Peter Clark. 2021b. Enriching a model’s
notion of belief using a persistent memory.
CoRR, abs/2104.08401.
Wojciech Kryscinski, Bryan McCann, Caiming
Xiong, and Richard Socher. 2020. Eval-
uating the factual consistency of abstrac-
tive text summarization. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 9332–9346. https://doi.org/10
.18653/v1/2020.emnlp-main.750
Zhenzhong Lan, Mingda Chen, Sebastian
Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. 2019. ALBERT: A lite BERT for
self-supervised learning of language representa-
tions. In International Conference on Learning
Representations.
Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek
Srikumar. 2019. A logic-driven framework for
consistency of neural models. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3924–3935, Hong Kong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1405
Tal Linzen. 2020. How can we accelerate progress
towards human-like linguistic generalization?
In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
tics, pages 5210–5217. https://doi.org
/10.18653/v1/2020.acl-main.465
Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. Trans-
actions of the Association for Computational
Linguistics, 4:521–535. https://doi.org
/10.1162/tacl_a_00115
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
Mandar
Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
Laurens van der Maaten and Geoffrey Hinton.
2008. Visualizing data using t-SNE. Journal of
Machine Learning Research, 9:2579–2605.
Rebecca Marvin and Tal Linzen. 2018. Targeted
language models.
the 2018 Conference on
syntactic evaluation of
In Proceedings of
1026
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Empirical Methods in Natural Language Pro-
cessing, pages 1192–1202, Brussels, Belgium.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D18
-1151
Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019. Right for the wrong reasons: Diag-
nosing syntactic heuristics in natural
lan-
the
In Proceedings of
guage
57th Annual Meeting of the Association for
Computational Linguistics, pages 3428–3448.
https://doi.org/10.18653/v1/P19
-1334
inference.
David Picado Mui˜no. 2011. Measuring and repair-
ing inconsistency in probabilistic knowledge
bases. International Journal of Approximate
Reasoning, 52(6):828–840. https://doi
.org/10.1016/j.ijar.2011.02.003
Kim Anh Nguyen, Sabine Schulte im Walde, and
Ngoc Thang Vu. 2016. Integrating distribu-
tional lexical contrast into word embeddings for
antonym-synonym distinction. In Proceedings
of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2:
Short Papers), pages 454–459. https://doi
.org/10.18653/v1/P16-2074
F. Pedregosa, G. Varoquaux, A. Gramfort, V.
Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J.
Vanderplas, A. Passos, D. Cournapeau, M.
Brucher, M. Perrot, and E. Duchesnay. 2011.
Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research,
12:2825–2830.
Matthew E. Peters, Mark Neumann, Robert
Logan, Roy Schwartz, Vidur Joshi, Sameer
Singh, and Noah A. Smith. 2019. Know-
ledge enhanced contextual word representa-
tions. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 43–54. https://
doi.org/10.18653/v1/D19-1005
Fabio Petroni, Patrick Lewis, Aleksandra
Piktus, Tim Rockt¨aschel, Yuxiang Wu,
Alexander H. Miller, and Sebastian Riedel.
2020. How context affects language models’
factual predictions. In Automated Knowledge
Base Construction.
Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Language models
as knowledge bases? In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2463–2473,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1250
Nina Poerner, Ulli Waltinger, and Hinrich
Sch¨utze. 2020. E-BERT: Efficient-yet-effective
entity embeddings for bert. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing: Findings,
pages 803–818. https://doi.org/10
.18653/v1/2020.findings-emnlp.71
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21:1–67.
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
the 2016
sion of
Conference on Empirical Methods in Nat-
ural Language Processing, pages 2383–2392.
https://doi.org/10.18653/v1/D16-1264
In Proceedings of
text.
Shauli Ravfogel, Yanai Elazar, Jacob Goldberger,
and Yoav Goldberg. 2020. Unsupervised distil-
lation of syntactic information from contextu-
alized word representations. In Proceedings of
the Third BlackboxNLP Workshop on Analyz-
ing and Interpreting Neural Networks for NLP,
pages 91–106, Online. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.blackboxnlp-1.9
Abhilasha Ravichander, Eduard Hovy, Kaheer
Suleman, Adam Trischler, and Jackie Chi Kit
1027
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Cheung. 2020. On the systematicity of probing
contextualized word representations: The case
of hypernymy in BERT. In Proceedings of the
Ninth Joint Conference on Lexical and Compu-
tational Semantics, pages 88–102, Barcelona,
Spain (Online). Association for Computational
Linguistics.
In Proceedings of
Marco Tulio Ribeiro, Carlos Guestrin, and
Sameer Singh. 2019. Are red roses red?
Evaluating consistency of question-answering
the 57th An-
models.
the Association for Com-
nual Meeting of
putational Linguistics,
6174–6184,
pages
Florence,
Italy. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1621
Marco Tulio Ribeiro, Sameer Singh,
and
Carlos Guestrin. 2018. Semantically equiv-
alent adversarial rules for debugging NLP
models. In Proceedings of the 56th Annual
the Association for Computa-
Meeting of
(Volume 1: Long Pa-
tional Linguistics
pers), pages 856–865. https://doi.org
/10.18653/v1/P18-1079
Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Beyond
testing of NLP mod-
accuracy: Behavioral
els with CheckList. In Proceedings of
the
58th Annual Meeting of the Association for
Computational Linguistics, pages 4902–4912,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.442
Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How much knowledge can you pack
into the parameters of a language model?
the 2020 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing
5418–5426.
https://doi.org/10.18653/v1/2020
.emnlp-main.437
(EMNLP),
pages
Taylor Shin, Yasaman Razeghi, Robert L. Logan
IV, Eric Wallace, and Sameer Singh. 2020.
AutoPrompt: Eliciting knowledge from lan-
guage models with automatically generated
prompts. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4222–4235,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.346
Micah
Shlain, Hillel Taub-Tabib,
Shoval
Sadde, and Yoav Goldberg. 2020. Syntactic
search by example. In Proceedings of
the
58th Annual Meeting of the Association for
Computational Linguistics: System Demon-
strations, pages 17–23, Online. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl
-demos.3
Vered Shwartz and Yejin Choi. 2020. Do neural
language models overcome reporting bias?
In Proceedings of
the 28th International
Conference on Computational Linguistics,
6863–6870. https://doi.org/10
pages
.18653/v1/2020.coling-main.605
Christian Szegedy, Wojciech Zaremba,
Ilya
Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. 2014. Intriguing
properties of neural networks. In International
Conference on Learning Representations.
Alon Talmor, Yanai Elazar, Yoav Goldberg,
and Jonathan Berant. 2020. oLMpics—on
pre-training
what
cap-
the Association
tures.
for Computational Linguistics, 8:743–758.
https://doi.org/10.1162/tacl a 00342
language model
of
Transactions
Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019. Bert
rediscovers the classical NLP
pipeline. In Proceedings of the 57th Annual
the Association for Computa-
Meeting of
tional
4593–4601.
https://doi.org/10.18653/v1/P19-1452
Linguistics,
pages
and Anna
Anna Rogers, Olga Kovaleva,
Rumshisky. 2020. A primer
in bertology:
What we know about how bert works.
Transactions of the Association for Compu-
tational Linguistics, 8:842–866. https://
doi.org/10.1162/tacl a 00349
Matthias Thimm. 2009. Measuring inconsistency
in probabilistic knowledge bases. In Proceed-
ings of the Twenty-Fifth Conference on Un-
certainty in Artificial Intelligence (UAI’09),
pages 530–537. AUAI Press.
1028
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Matthias Thimm. 2013.
Inconsistency mea-
logics. Artificial
sures
Intelligence, 197:1–24. https://doi.org
/10.1016/j.artint.2013.02.001
probabilistic
for
James Thorne, Majid Yazdani, Marzieh Saeidi,
Fabrizio Silvestri, Sebastian Riedel, and Alon
Halevy. 2020. Neural databases. arXiv preprint
arXiv:2010.06973.
Alex Warstadt, Yian Zhang, Xiaocheng Li,
Haokun Liu, and Samuel R. Bowman. 2020.
Learning which features matter: RoBERTa
linguistic gen-
acquires a preference for
eralizations (eventually). In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 217–235, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.16
Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
inference. In Proceedings of the 2018 Con-
ference of
the North American Chapter of
the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/N18-1101
Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R´emi
Louf, Morgan Funtowicz, Joe Davison, Sam
Shleifer, Patrick von Platen, Clara Ma,
Yacine Jernite,
Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama
Drame, Quentin Lhoest, and Alexander M.
Rush. 2020. Transformers: State-of-the-art
In Proceed-
language processing.
natural
ings of
the 2020 Conference on Empirical
Methods in Natural Language Processing:
System Demonstrations, pages 38–45, On-
line. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-demos.6
Wenhan Xiong,
Jingfei Du, William Yang
Wang, and Veselin Stoyanov. 2020. Pre-
trained
supervised
knowledge-pretrained language model. In 8th
encyclopedia: Weakly
International Conference on Learning Repre-
sentations, ICLR 2020, Addis Ababa, Ethiopia,
April 26-30, 2020. OpenReview.net.
Xikun Zhang, Deepak Ramachandran,
Ian
Tenney, Yanai Elazar, and Dan Roth. 2020.
Do language embeddings capture scales? In
Proceedings of the 2020 Conference on Empir-
ical Methods in Natural Language Processing:
Findings, pages 4889–4896. https://doi
.org/10.18653/v1/2020.findings
-emnlp.439
A Implementation Details
We heavily rely on Hugging Face’s Transformers
library (Wolf et al., 2020) for all experiments in-
volving the PLMs. We used Weights & Biases for
tracking and logging the experiments (Biewald,
2020). Finally, we used sklearn (Pedregosa et al.,
2011) for other ML-related experiments.
B Paraphrases Analysis
We provide a characterization of the paraphrase
types included in our dataset.
We analyze the type of paraphrases
in
PARAREL . We sample 100 paraphrase pairs from
the agreement evaluation that were labeled as
paraphrases and annotate the paraphrase type.
Notice that
the paraphrases can be complex;
as such, multiple transformations can be anno-
tated for each pair. We mainly use a subset
of paraphrase types categorized by Bhagat and
Hovy (2013), but also define new types that
were not covered by that work. We begin by
briefly defining the types of paraphrases found
in PARAREL
from Bhagat and Hovy (2013)
(more thorough definitions can be found in
their paper), and then define the new types we
observed.
1. Synonym substitution: Replacing a word/
phrase by a synonymous word/phrase, in the
appropriate context.
2. Function word variations: Changing the func-
tion words in a sentence/phrase without
affecting its semantics, in the appropriate
context.
3. Converse substitution: Replacing a word/
phrase with its converse and inverting the
1029
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
relationship between the constituents of a
sentence/phrase, in the appropriate context,
presenting the situation from the converse
perspective.
4. Change of tense: Changing the tense of a
verb, in the appropriate context.
5. Change of voice: Changing a verb from its ac-
tive to passive form and vice versa results in
a paraphrase of the original sentence/phrase.
6. Verb/Noun conversion: Replacing a verb by
its corresponding nominalized noun form
and vice versa, in the appropriate context.
7. External knowledge: Replacing a word/
phrase by another word/phrase based on
extra-linguistic (world) knowledge, in the
appropriate context.
8. Noun/Adjective conversion: Replacing a
verb by its corresponding adjective form and
vice versa, in the appropriate context.
9. Change of aspect: Changing the aspect of a
verb, in the appropriate context.
a. Irrelevant addition: Addition or removal of
a word or phrase, that does not affect the
meaning of the sentence (as far as the relation
of interest is concerned), and can be inferred
from the context independently.
b. Topicalization transformation: A transforma-
tion from or to a topicalization construction.
Topicalization is a construction in which
a clause is moved to the beginning of its
enclosing clause.
c. Apposition transformation: A transformation
from or to an apposition construction. In an
apposition construction, two noun phrases
where one identifies the other are placed one
next to each other.
d. Other syntactic movements: Includes other
types of syntactic transformations that are
not part of the other categories. This includes
cases such as moving an element from a co-
ordinate construction to the subject position
as in the last example in Table 8. Another
type of transformation is in the following
paraphrase: ‘‘[X] plays in [Y] position.’’ and
‘‘[X] plays in the position of [Y].’’ where
a compound noun-phrase is replaced with a
prepositional phrase.
We report the percentage of each type, along
with examples of paraphrases in Table 8. The
most common paraphrase is the ‘synonym sub-
stitution’, following ‘function words variations’
which occurred 41 and 16 times, respectively. The
least common paraphrase is ‘change of aspect’,
which occurred only once in the sample.
We also define several other types of para-
phrases not covered in Bhagat and Hovy (2013)
(potentially because they did not occur in the
corpora they have inspected).
The full PARAREL
resource can be found at:
https://github.com/yanaiela/pararel
/tree/main/data/pattern data/graphs
json.
1030
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Paraphrase Type
Pattern #1
Pattern #2
Relation
Synonym substitution
Function words variations
Converse substitution
Change of tense
Change of voice
Verb/Noun conversion
External knowledge
Noun/Adjective conversion
Change of aspect
Irrelevant addition
Topicalization transformation
Apposition transformation
Other syntactic movements
[X] died in [Y].
[X] is [Y] citizen.
[X] maintains diplomatic relations with [Y].
[X] is developed by [Y].
[X] is owned by [Y].
The headquarter of [X] is in [Y].
[X] is represented by music label [Y].
The official language of [X] is [Y].
[X] plays in [Y] position.
[X] shares border with [Y].
[X] plays in [Y] position.
[X] is the capital of [Y].
[X] and [Y] are twin cities.
[X] expired at [Y].
[X], who is a citizen of [Y].
[Y] maintains diplomatic relations with [X].
[X] was developed by [Y].
[Y] owns [X].
[X] is headquartered in [Y].
[X], that is represented by [Y].
The official language of [X] is the [Y] language.
playing as an [X], [Y]
[X] shares a common border with [Y].
playing as a [Y], [X]
[Y]’s capital, [X].
[X] is a twin city of [Y].
place of death
country of citizenship
diplomatic relation
developer
owned by
headquarters location
record label
official language
position played on team
shares border with
position played on team
capital of
twinned administrative body
N.
41
16
10
10
7
7
3
2
1
11
8
4
10
Table 8: Different types of paraphrases in PARAREL . We report examples from each paraphrase type,
along with the type of relation, and the number of examples from the specific transformation from a
random subset of 100 pairs. Each pair can be classified into more than a single transformation (we
report one for brevity), thus the sum of transformation is more than 100.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
1
0
1
9
7
5
9
5
7
/
/
t
l
a
c
_
a
_
0
0
4
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1031