Break, Perturb, Build: Automatic Perturbation of Reasoning Paths
Through Question Decomposition
Mor Geva, Tomer Wolfson, Jonathan Berant
School of Computer Science, Tel Aviv University, Israel
Allen Institute for Artificial Intelligence
{morgeva@mail,tomerwol@mail,joberant@cs}.tau.ac.il
Astratto
Recent efforts to create challenge benchmarks
that test the abilities of natural language un-
derstanding models have largely depended
on human annotations. In this work, we in-
troduce the ‘‘Break, Perturb, Build’’ (BPB)
framework for automatic reasoning-oriented
perturbation of question-answer pairs. BPB
represents a question by decomposing it into
the reasoning steps that are required to answer
Esso, symbolically perturbs the decomposition,
and then generates new question-answer pairs.
We demonstrate the effectiveness of BPB by
creating evaluation sets for
three reading
comprehension (RC) benchmarks, generating
thousands of high-quality examples without
human intervention. We evaluate a range of RC
models on our evaluation sets, which reveals
large performance gaps on generated exam-
ples compared to the original data. Inoltre,
symbolic perturbations enable fine-grained
analysis of the strengths and limitations of
models. Last, augmenting the training data
with examples generated by BPB helps close
the performance gaps, without any drop on the
original data distribution.
1
introduzione
Evaluating natural language understanding (NLU)
systems has become a fickle enterprise. While
models outperform humans on standard bench-
marks, they perform poorly on a multitude of
distribution shifts (Jia and Liang, 2017; Naik et al.,
2018; McCoy et al., 2019, inter alia). To expose
such gaps, recent work has proposed to evaluate
models on contrast sets (Gardner et al., 2020), O
counterfactually-augmented data (Kaushik et al.,
2020), where minimal but meaningful pertur-
bations are applied to test examples. Tuttavia,
since such examples are manually written, col-
lecting them is expensive, and procuring diverse
perturbations is challenging (Joshi and He, 2021).
111
Recentemente, methods for automatic generation of
contrast sets were proposed. Tuttavia, current
methods are restricted to shallow surface pertur-
bations (Mille et al., 2021; Li et al., 2020), specific
reasoning skills (Asai and Hajishirzi, 2020), O
rely on expensive annotations (Bitton et al., 2021).
Così, automatic generation of examples that test
high-level reasoning abilities of models and their
robustness to fine semantic distinctions remains
an open challenge.
In this work, we propose the ‘‘Break, Perturb,
Build’’ (BPB) framework for automatic genera-
tion of reasoning-focused contrast sets for read-
ing comprehension (RC). Changing the high-level
semantics of questions and generating question-
answer pairs automatically is challenging. Primo, Esso
requires extracting the reasoning path expressed
in a question, in order to manipulate it. Secondo,
it requires the ability to generate grammatical and
coherent questions. In Figure 1, Per esempio, trans-
forming Q, which involves number comparison,
into Q1, which requires subtraction, leads to dra-
matic changes in surface form. Third, it requires
an automatic method for computing the answer to
the perturbed question.
Our insight is that perturbing question semantics
is possible when modifications are applied to a
structured meaning representation, piuttosto che
to the question itself. Specifically, we represent
questions with QDMR (Wolfson et al., 2020), UN
representation that decomposes a question into a
sequence of reasoning steps, which are written
in natural language and are easy to manipulate.
Relying on a structured representation lets us
develop a pipeline for perturbing the reasoning
path expressed in RC examples.
Our method (Guarda la figura 1) has four steps. Noi
(1) parse the question into its QDMR decompo-
sition, (2) apply rule-based perturbations to the
decomposition, (3) generate new questions from
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 111–126, 2022. https://doi.org/10.1162/tacl a 00450
Redattore di azioni: Preslav Nakov. Lotto di invio: 8/2021; Lotto di revisione: 9/2021; Pubblicato 2/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: An overview of BPB. Given a context (C), question (Q), and the answer (UN) to the question, we
generate new examples by (1) parsing the question into its QDMR decomposition, (2) applying semantic
perturbations to the decomposition, (3) generating a question for each transformed decomposition, E (4)
computing answers/constraints to the new questions.
the perturbed decompositions, E (4) compute
their answers. In cases where computing the an-
swer is impossible, we compute constraints on
the answer, which are also useful for evaluation.
Per esempio, for Q4 in Figure 1, even if we can-
not extract the years of the described events, we
know the answer type of the question (Boolean).
Notably, aside from answer generation, all steps
depend on the question only, and can be applied to
other modalities, such as visual or table question
answering (QA).
Running BPB on the three RC datasets, DROP
(Dua et al., 2019), HOTPOTQA (Yang et al., 2018),
and IIRC (Ferguson et al., 2020), yields thousands
of semantically rich examples, covering a major-
ity of the original examples (63.5%, 70.2%, E
45.1%, rispettivamente). Inoltre, we validate ex-
amples using crowdworkers and find that ≥85%
of generated examples are correct.
We demonstrate the utility of BPB for compre-
hensive and fine-grained evaluation of multiple
RC models. Primo, we show that leading models,
such as UNIFIEDQA (Khashabi et al., 2020B) E
TASE (Segal et al., 2020), struggle on the gen-
erated contrast sets with a decrease of 13-36 F1
points and low consistency (<40). Moreover, an-
alyzing model performance per perturbation type
and constraints, reveals the strengths and weak-
nesses of models on various reasoning types. For
instance, (a) models with specialized architectures
are more brittle compared to general-purpose mod-
els trained on multiple datasets, (b) TASE fails to
answer intermediate reasoning steps on DROP, (c)
UNIFIEDQA fails completely on questions requir-
ing numerical computations, and (d) models tend
to do better when the numerical value of an answer
is small. Last, data augmentation with examples
generated by BPB closes part of the performance
gap, without any decrease on the original datasets.
In summary, we introduce a novel frame-
work for automatic perturbation of complex
reasoning questions, and demonstrate its effi-
cacy for generating contrast sets and evaluating
improve-
that
models. We expect
ments in question generation, RC, and QDMR
models will
further widen the accuracy and
applicability of our approach. The generated eval-
uation sets and codebase are publicly available
at https://github.com/mega002/qdmr
-based-question-generation.
imminent
2 Background
Our goal, given a natural language question q,
is to automatically alter its semantics, generating
perturbed questions ˆq for evaluating RC models.
This section provides background on the QDMR
representation and the notion of contrast sets.
Question Decomposition Meaning Representa-
tion (QDMR). To manipulate question seman-
tics, we rely on QDMR (Wolfson et al., 2020), a
structured meaning representation for questions.
The QDMR decomposition d = QDMR(q) is a
sequence of reasoning steps s1, . . . , s|d| required
to answer q. Each step si in d is an intermediate
question that is phrased in natural language and
annotated with a logical operation oi, such as se-
lection (e.g., ‘‘When was the Madison Woolen
Mill built?’’) or comparison (e.g., ‘‘Which is
highest of #1, #2?’’). Example QDMRs are shown
112
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
t
l
a
c
_
a
_
0
0
4
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
in Figure 1 (step 2). QDMR paves a path to-
wards controlling the reasoning path expressed
in a question by changing, removing, or adding
steps (§3.2).
Contrast Sets. Gardner et al. (2020) defined the
contrast set C(x) of an example x with a label y
as a set of examples with minimal perturbations
to x that typically affect y. Contrast sets evaluate
whether a local decision boundary around an ex-
ample is captured by a model. In this work, given
a question-context pair x = (cid:4)q, c(cid:5), we semanti-
cally perturb the question and generate examples
ˆx = (cid:4)ˆq, c(cid:5) ∈ C((cid:4)q, c(cid:5)) that modify the original
answer a to ˆa.
3 BPB: Automatically Generating
Semantic Question Perturbations
We now describe the BPB framework. Given an
input x = (cid:4)q, c(cid:5) of question and context, and
the answer a to q given c, we automatically map
it to a set of new examples C(x) (Figure 1). Our
approach uses models for question decomposition,
question generation (QG), and RC.
3.1 Question Decomposition
The first step (Figure 1, step 1) is to represent q us-
ing a structured decomposition, d = QDMR(q).
To this end, we train a text-to-text model that
generates d conditioned on q. Specifically, we
fine-tune BART (Lewis et al., 2020) on the high-
level subset of the BREAK dataset (Wolfson et al.,
2020), which consists of 23.8K (cid:4)q, d(cid:5) pairs from
three RC datasets, including DROP and HOT-
POTQA.1 Our QDMR parser obtains a 77.3 SARI
score on the development set, which is near
state-of-the-art on the leaderboard.2
3.2 Decomposition Perturbation
A decomposition d describes the reasoning steps
necessary for answering q. By modifying d’s steps,
we can control the semantics of the question.
We define a ‘‘library’’ of rules for transforming
d → ˆd, and use it to generate questions ˆd → ˆq.
BPB provides a general method for creating a
wide range of perturbations. In practice, though,
deciding which rules to include is coupled with
the reasoning abilities expected from our models.
For example, there is little point in testing a
model on arithmetic operations if it had never
seen such examples. Thus, we implement rules
based on the reasoning skills required in current
RC datasets (Yang et al., 2018; Dua et al., 2019).
As future benchmarks and models tackle a wider
range of reasoning phenomena, one can expand
the rule library.
Table 1 provides examples for all QDMR
perturbations, which we describe next:
• AppendBool: When the question q re-
turns a numeric value, we transform its
QDMR by appending a ‘‘yes/no’’ com-
parison step. The comparison is against
the answer a of question q. As shown in
Table 1, the appended step compares the
previous step result (‘‘#3’’) to a constant
(‘‘is higher than 2’’). AppendBool per-
turbations are generated for 5 comparison
operators (>, <, ≤, ≥, (cid:9)=). For the compared
values, we sample from a set, based on the
answer a: {a + k, a − k, a
k , a × k} for
k ∈ {1, 2, 3}.
• ChangeLast: Changes the type of the last
QDMR step. This perturbation is applied
to steps involving operations over two refer-
enced steps. Steps with type {arithmetic,
comparison} have their type changed to
either {arithmetic, Boolean}. Table 1
shows a comparison step changed to an
arithmetic step, involving subtraction.
Below it, an arithmetic step is changed
to a yes/no question (Boolean).
• ReplaceArith: Given an arithmetic
step, involving either subtraction or addition,
we transform it by flipping its arithmetic
operation.
• ReplaceBool: Given a Boolean step,
verifying whether two statements are correct,
we transform it to verify if neither are correct.
• ReplaceComp: A comparison step
compares two values and returns the high-
est or lowest. Given a comparison step,
we flip its expression from ‘‘highest’’ to
‘‘lowest’’ and vice versa.
1We fine-tune BART-large for 10 epochs, using a learning
rate of 3e−5 with polynomial decay and a batch size of 32.
2https://leaderboard.allenai.org/breakhighlevel/.
• PruneStep: We remove one of the QDMR
steps. Following step pruning, we prune all
113
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
t
l
a
c
_
a
_
0
0
4
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Perturbation
Question
QDMR
Perturbed QDMR
Perturbed Question
Append
Boolean
step
Kadeem Jack is a
player in a league
that
started with
how many teams?
(1) league that Kadeem
Jack is a player in; (2)
teams
that #1 started
with; (3) number of #2
Change
last
step
(to
arith-
metic)
Change
last
step
(to
Boolean)
Replace
arith-
metic
op.
Replace
Boolean
op.
Which gallery was
foundedfirst,Hughes-
Donahue Gallery
or Art Euphoric?
How many years
after Madrugada’s
final concert did
Sunday Driver be-
come popular?
How many more na-
tive Hindi speakers
are there compared
to native Kannada
speakers?
Stenocereus
Can
and Pachypodium
both include tree
like plants?
Replace
com-
parison
op.
for
Which group is
smaller
the
county according to
the census: people
or households?
Prune
step
How many people
comprised the total
adult population of
Cunter, excluding
seniors?
(1) when was Hughes-
Donahue Gallery founded;
(2) when was Art Eupho-
ric founded; (3) which
was first of #1, #2
(1) year of Madrugada’s
final concert; (2) year
when Sunday Driver be-
come popular; (3) the
difference of #2 and #1
(1) native Hindi speak-
ers; (2) native Kannada
speakers; (3) number of
#1; (4) number of #2; (5)
difference of #3 and #4
if Stenocereus in-
(1)
clude tree like plants;
(2) if Pachypodium in-
clude treelike plants; (3)
if both #1 and #2 are
true
(1) size of the people
group in the county ac-
cording to the census; (2)
size of households group
in the county according
to the census; (3) which
is smaller of #1, #2
(1) adult population of
Cunter; (2) #1 excluding
seniors; (3) number of
#2
(1) league that Kadeem
Jack is a player in; (2)
teams
that #1 started
with; (3) number of #2;
(4) if #3 is higher than 2
(1) when was Hughes-
Donahue Gallery founded;
(2) when was Art Eu-
phoric founded; (3) the
difference of #1 and #2
(1) year of Madrugada’s
final concert; (2) year
when Sunday Driver be-
come popular; (3) if #1
is the same as #2
(1) native Hindi speak-
ers; (2) native Kannada
speakers; (3) number of
#1; (4) number of #2; (5)
sum of #3 and #4
if Stenocereus in-
(1)
clude tree like plants;
(2) if Pachypodium in-
clude treelike plants; (3)
if both #1 and #2 are
false
(1) size of the people
group in the county ac-
cording to the census; (2)
size of households group
in the county according
to the census; (3) which
is highest of #1, #2
(1) adult population of
Cunter; (2) number of #2
If Kadeem Jack is
a player in a league
thatstartedwithmore
than two teams?
How many years af-
ter Hughes-Donahue
Gallery was founded
was Art Euphoric
founded?
Did Sunday Driver
become popular in
the same year as
Madrugada’s
final
concert?
Of the native Hindi
speakers and native
Kannada
speakers,
how many are there
in total?
Do neither Steno-
cereus nor Pachy-
podium include tree
like plants?
According to
the
census, which group
in the county from
the county is larger:
people
house-
or
holds?
How many adult po-
pulation does Cunter
have?
Table 1: The full list of semantic perturbations in BPB. For each perturbation, we provide an example
question and its decomposition. We highlight the altered decomposition steps, along with the generated
question.
other steps that are no longer referenced.
We apply only a single PruneStep per d.
Table 1 displays ˆd after its second step has
been pruned.
3.3 Question Generation
At this point (Figure 1, step 3), we parsed q
to its decomposition d and altered its steps to
produce the perturbed decomposition ˆd. The new ˆd
expresses a different reasoning process compared
to the original q. Next, we generate the perturbed
question ˆq corresponding to ˆd. To this end, we train
a QG model, generating questions conditioned on
the input QDMR. Using the same (cid:4)q, d(cid:5) pairs
used to train the QDMR parser (§3.1), we train a
separate BART model for mapping d → q.3
An issue with our QG model is that the per-
turbed ˆd may be outside the distribution the QG
3We use the same hyperparameters as detailed in §3.1,
except the number of epochs, which was set to 15.
114
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
t
l
a
c
_
a
_
0
0
4
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Original question
Augmented question
How many in-
did
terceptions
Matt Hasselbeck
throw?
How many touch-
downs were there
in the first quarter?
Are Giuseppe Verdi
and
Ambroise
both
Thomas
Opera composers?
Which singer
is
younger, Shirley
Manson or
Jim
Kerr?
If Matt Hasselbeck throw
less than 23 intercep-
tions? (AppendBool)
If there were two touch-
downs in the first quar-
ter? (AppendBool)
Are neither Giuseppe
nor Ambroise
Verdi
Thomas Opera composers?
(ReplaceBool)
Which singer is older,
Shirley Manson or Jim
Kerr? (ReplaceComp)
Table 2: Example application of all textual pat-
terns used to generate questions qaug (perturbation
type highlighted). Boldface indicates the pattern
matched in q and the modified part in qaug.
Decompositions d and daug omitted for brevity.
model was trained on, e.g., applying Append-
Bool on questions from DROP results in yes/no
questions that do not occur in the original dataset.
This can lead to low-quality questions ˆq. To im-
prove our QG model, we use simple heuristics to
take (cid:4)q, d(cid:5) pairs from BREAK and generate addi-
tional pairs (cid:4)qaug, daug(cid:5). Specifically, we define 4
textual patterns, associated with the perturbations,
AppendBool, ReplaceBool or Replace-
Comp. We automatically generate examples
(cid:4)qaug, daug(cid:5) from (cid:4)q, d(cid:5) pairs that match a pat-
tern. An example application of all patterns is
in Table 2. For example, in AppendBool, the
question qaug is inferred with the pattern ‘‘how
many . . . did’’. In ReplaceComp, generating
qaug is done by identifying the superlative in q and
fetching its antonym.
Overall, we generate 4,315 examples and train
our QG model on the union of BREAK and the
augmented data. As QG models have been rapidly
improving, we expect future QG models will be
able to generate high-quality questions for any
decomposition without data augmentation.
3.4 Answer Generation
context. Therefore, this part of BPB can be applied
to any question, regardless of the context modality.
We now describe a RC-specific component for
answer generation that uses the textual context.
To get complete RC examples, we must com-
pute answers to the generated questions (Figure 1,
step 4). We take a two-step approach: For some
questions, we can compute the answer automati-
cally based on the type of applied perturbation. If
this fails, we compute the answer by answering
each step in the perturbed QDMR ˆd.
Answer Generation Methods. Let (cid:4)q, c, a(cid:5) be the
original RC example and denote by ˆq the generated
question. We use the following per-perturbation
rules to generate the new answer ˆa:
• AppendBool: The transformed ˆq compares
whether the answer a and a numeric value v
satisfy a comparison condition. As the values
of a and v are given (§3.2), we can com-
pute whether the answer is ‘‘yes’’ or ‘‘no’’
directly.
• ReplaceArith: This perturbation con-
verts an answer that is the sum (difference)
of numbers to an answer that is the difference
(sum). We can often identify the numbers by
looking for numbers x, y in the context c such
that a = x ± y and flipping the operation:
ˆa = |x ∓ y|. To avoid noise, we discard ex-
amples for which there is more than one pair
of numbers that result in a, and cases where
a < 10, as the computation may involve
explicit counting rather than an arithmetic
computation.
• ReplaceBool: This perturbation turns a
verification of whether two statements x, y
are true, to a verification of whether neither x
nor y are true. Therefore, if a is ‘‘yes’’ (i.e.,
both x, y are true), ˆa must be ‘‘no’’.
• ReplaceComp: This perturbation takes a
comparison question q that contains two can-
didate answers x, y, of which x is the answer
a. We parse q with spaCy4 and identify the
two answer candidates x, y, and return the
one that is not a.
We converted the input question into a set of
perturbed questions without using the answer or
4https://spacy.io/.
115
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
t
l
a
c
_
a
_
0
0
4
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
development set size
# of unique generated
perturbations
# of generated examples
# of covered develop-
ment examples
% of covered develop-
ment examples
Avg. contrast set size
Avg. # of perturbations
per example
% of answers generated
by the QDMR evaluator
# of annotated contrast
examples
% of valid annotated
examples
DROP
9,536
65,675
61,231
6,053
63.5
11.1
1.2
HPQA
7,405
10,541
8,488
5,199
IIRC
1,301
3,119
2,450
587
70.2
45.1
2.6
1
5.2
1
5.8
61.8
22.5
1,235
1,325
559
85
89
90.3
Table 3: Generation and annotation statistics for
the DROP, HOTPOTQA, and IIRC datasets.
When changing the last QDMR step to an arith-
metic or Boolean operation (Table 1, rows 2-3),
the new answer should be Numeric or Bool-
ean, respectively. An example for a Boolean
constraint is given in Q4 in Figure 1. When re-
placing an arithmetic operation (Table 1, row 4),
if an answer that is the sum (difference) of two
non-negative numbers is changed to the difference
(sum) of these numbers, the new answer must not
be greater (smaller) than the original answer. For
example, the answer to the question perturbed
by ReplaceArith in Table 1 (row 4) should
satisfy the ≥ constraint.
4 Generated Evaluation Sets
We run BPB on the RC datasets DROP (Dua et al.,
2019), HOTPOTQA (Yang et al., 2018), and IIRC
(Ferguson et al., 2020). Questions from the train-
ing sets of DROP and HOTPOTQA are included in
BREAK, and were used to train the decomposition
and QG models. Results on IIRC show BPB’s
generalization to datasets for which we did not
observe (cid:4)q, d(cid:5) pairs. Statistics on the generated
contrast and constraint sets are in Table 3, 4,
and 5.
Contrast Sets. Table 3 shows that BPB suc-
cessfully generates thousands of perturbations for
each dataset. For the vast majority of perturba-
tions, answer generation successfully produced a
result—for 61K out of 65K in DROP, 8.5K out
of 10.5K in HOTPOTQA, and 2.5K out of 3K in
IIRC. Overall, 61K/8.5K examples were created
Figure 2: Example execution of the QDMR evaluator.
QDMR Evaluator. When our heuristics do not
apply (e.g., arithmetic computations over more
than two numbers, PruneStep, and Change-
Last), we use a RC model and the QDMR
structure to directly evaluate each step of ˆd
and compute ˆa. Recall each QDMR step si is
annotated with a logical operation oi (§2). To
evaluate ˆd, we go over it step-by-step, and for
each step either apply the RC model for op-
erations that require querying the context (e.g.,
selection), or directly compute the output for
numerical/set-based operations (e.g., compar-
ison). The answer computed for each step is
then used for replacing placeholders in subsequent
steps. An example is provided in Figure 2.
We discard the generated example when the
RC model predicted an answer that does not
match the expected argument type in a follow-
ing step for which the answer is an argument
(e.g., when a non-numerical span predicted by
the RC model is used as an argument for an
arithmetic operation), and when the generated
answer has more than 8 words. Also, we discard
operations that often produce noisy answers based
on manual analysis (e.g., project with a non-
numeric answer).
For our QDMR evaluator, we fine-tune a
ROBERTA-large model with a standard span-
extraction output head on SQUAD (Rajpurkar
et al., 2016) and BOOLQ (Clark et al., 2019).
BOOLQ is included to support yes/no answers.
3.5 Answer Constraint Generation
For some perturbations, even if we fail to generate
an answer, it is still possible to derive constraints
on the answer. Such constraints are valuable, as
they indicate cases of model failure. Therefore, in
addition to ˆa, we generate four types of answer
constraints: Numeric, Boolean, ≥, ≤.
116
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
t
l
a
c
_
a
_
0
0
4
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
AppendBool
ChangeLast
contrast
annotated
% valid
contrast
annotated
% valid
contrast
ReplaceArith annotated
% valid
contrast
ReplaceBool annotated
% valid
contrast
ReplaceComp annotated
PruneStep
% valid
contrast
annotated
% valid
390
191
97.2
85
69
55.1
DROP HPQA
2,754
56,205
200
254
98
408
200
84.5
–
–
–
127
127
97.6
362
200
88.5
3,777
399
85.8
79.6
–
–
–
1,126
245
90.2
3,425
476
82.4
IIRC
1,884
198
98
43
43
76.7
1
1
0
1
1
100
14
14
71.4
507
302
88.4
Table 4: Per-perturbation statistics for generation
and annotation of our datasets. Validation results
are in bold for perturbations with at least 40
examples.
from the development sets of DROP/ HOTPOTQA,
respectively, covering 63.5%/70.2% of the devel-
opment set. For the held-out dataset IIRC, not used
to train the QDMR parser and QG model, BPB
created a contrast set of 2.5K examples, which
covers almost half of the development set.
Table 4 shows the number of generated ex-
amples per perturbation. The distribution over
perturbations is skewed, with some perturbations
(AppendBool) 100x more frequent than others
(ReplaceArith). This is because the original
distribution over operations is not uniform and
each perturbation operates on different decom-
positions (e.g., AppendBool can be applied to
any question with a numeric answer, while Re-
placeComp operates on questions comparing
two objects).
Constraint Sets. Table 5 shows the number of
generated answer constraints for each dataset. The
constraint set for DROP is the largest, consist-
ing of 3.3K constraints, 8.9% of which covering
DROP examples for which we could not generate
a contrast set. This is due to the examples with
arithmetic operations, for which it is easier to gen-
erate constraints. The constraint sets of HOTPOTQA
and IIRC contain yes/no questions, for which we
use the Boolean constraint.
Estimating Example Quality To analyze the
quality of generated examples, we sampled
117
# of constraints
% of constraints that
cover examples without
a contrast set
% of covered develop-
ment examples
Numeric
Boolean
≥
≤
DROP
3,323
8.9
HPQA
550
26
IIRC
56
21.4
22.5
2,398
–
825
100
7.4
–
549
–
1
4
–
52
1
3
Table 5: Generation of constraints statistics for
the DROP, HOTPOTQA, and IIRC datasets.
200-500 examples from each perturbation and
dataset (unless fewer than 200 examples were
generated) and let crowdworkers validate their
correctness. We qualify 5 workers, and estab-
lish a feedback protocol where we review work
and send feedback after every annotation batch
(Nangia et al., 2021). Each generated example
was validated by three workers, and is consid-
ered valid if approved by the majority. Overall,
we observe a Fleiss Kappa (Fleiss, 1971) of
0.71, indicating substantial annotator agreement
(Landis and Koch, 1977).
Results are in Table 3 and 4. The vast majority
of generated examples (≥85%) were marked as
valid, showing that BPB produces high-quality
examples. Moreover (Table 4), we see variance
across perturbations, where some perturbations
reach >95% valid examples (AppendBool, Re-
placeBool), mentre altri (ChangeLast) Avere
lower validity. Così, overall quality can be
controlled by choosing specific perturbations.
Manual validation of generated contrast sets is
cheaper than authoring contrast sets from scratch:
The median validation time per example is 31
seconds, roughly an order of magnitude faster
than reported in Gardner et al. (2020). Così, Quando
a very clean evaluation set is needed, BPB can
dramatically reduce the cost of manual annotation.
Error Analysis of the QDMR Parser To study
the impact of errors by the QDMR parser on the
quality of generated examples, we (the authors)
took the examples annotated by crowdworkers,
and analyzed the generated QDMRs for 60 ex-
amples per perturbation from each dataset: 30
that were marked as valid by crowdworkers, E
30 that were marked as invalid. Specifically, for
each example, we checked whether the generated
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
QDMR faithfully expresses the reasoning path re-
quired to answer the question, and compared the
quality of QDMRs of valid and invalid examples.
For the examples that were marked as valid, we
observed that the accuracy of QDMR structures
is high: 89.5%, 92.7%, E 91.1% for DROP,
HOTPOTQA, and IIRC, rispettivamente. This implies
Quello, overall, our QDMR parser generated faithful
and accurate representations for the input ques-
zioni. Inoltre, for examples marked as invalid,
the QDMR parser accuracy was lower but still
relatively high, con 82.0%, 82.9%, E 75.5%
valid QDMRs for DROP, HOTPOTQA, and IIRC,
rispettivamente. This suggests that the impact of er-
rors made by the QDMR parser on generated
examples is moderate.
5 Experimental Setting
while separate UNIFIEDQA models are fine-tuned
on each of the three datasets. Inoltre, we
evaluate UNIFIEDQA without fine-tuning, to ana-
lyze its generalization to unseen QA distributions.
We denote by UNIFIEDQA the model without
fine-tuning, and by UNIFIEDQAX the UNIFIEDQA
model fine-tuned on dataset X.
We consider a ‘‘pure’’ RC setting, where only
the context necessary for answering is given as
input. For HOTPOTQA, we feed the model with the
two gold paragraphs (without distractors), and for
IIRC we concatenate the input paragraph with the
gold evidence pieces from other paragraphs.
Overall, we study 6 model-dataset combina-
zioni, con 2 models per dataset. For each model,
we perform a hyperparameter search and train 3-4
instances with different random seeds, using the
best configuration on the development set.
We use the generated contrast and constraints sets
to evaluate the performance of strong RC models.
5.2 Evaluation
5.1 Modelli
To evaluate our approach, we examine a suite of
models that perform well on current RC bench-
marks, and that are diverse it
terms of their
architecture and the reasoning skills they address:
• TASE (Segal et al., 2020): A ROBERTA
modello (Liu et al., 2019) con 4 specialized
output heads for (UN) tag-based multi-span ex-
traction, (B) single-span extraction, (C) signed
number combinations, E (D) counting (un-
til 9). TASE obtains near state-of-the-art
performance when fine-tuned on DROP.
• UNIFIEDQA (Khashabi et al., 2020B): UN
text-to-text T5 model (Raffel et al., 2020)
that was fine-tuned on multiple QA datasets
with different answer formats (yes/no, span,
eccetera.). UNIFIEDQA has demonstrated high
performance on a wide range of QA
benchmarks.
• READER (Asai et al., 2020): A BERT-based
modello (Devlin et al., 2019) for RC with
two output heads for answer classification
to yes/no/span/no-answer, and span
extraction.
We evaluate each model in multiple settings: (UN)
the original development set; (B) the generated
contrast set, denoted by CONT; (C) the subset
of CONT marked as valid by crowdworkers, Di-
noted by CONTVAL. Notably, CONT and CONTVAL
have a different distribution over perturbations.
To account for this discrepancy, we also eval-
uate models on a sample from CONT, denoted
by CONTRAND, where sampling is according to
the perturbation distribution in CONTVAL. Last, A
assess the utility of constraint sets, we enrich the
contrast set of each example with its corresponding
constraints, denoted by CONT+CONST.
Performance is measured using the standard
F1 metric. Inoltre, we measure consistency
(Gardner et al., 2020), questo è, the fraction of
examples such that the model predicted the correct
answer to the original example as well as to all
examples generated for this example. A prediction
is considered correct if the F1 score, with respect
to the gold answer, is ≥ 0.8. Formalmente, for a set of
evaluation examples S = {(cid:4)qi, ci, ai(cid:5)}|S|
i=1:
consistency(S) =
1
|S|
(cid:2)
x∈S
G(C(X))
We fine-tune two TASE models, one on DROP
and another on IIRC, which also requires numeri-
cal reasoning. READER is fine-tuned on HOTPOTQA,
(cid:3)
G(X ) =
1,
0,
if ∀(cid:4)ˆx, ˆa(cid:5) ∈ X : F1(sì(ˆx), ˆa) ≥ 0.8
otherwise
118
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
DEV
F1
83.5 ± 0.1
83.7 ± 1.1
69.9 ± 0.5
68.8 ± 1.3
CONTVAL
F1
65.9 ± 1
75.2 ± 0.5
45 ± 5
81.1 ± 4.6
CONTRAND
F1
57.3 ± 0.6
68 ± 1
41.2 ± 3.8
78.2 ± 4.9
CONT
F1
54.8 ± 0.4
66.5 ± 0.5
33.7 ± 2.2
72.4 ± 5.7
CONTVAL
Cnst.
55.7 ± 1.1
66.3 ± 0.4
23.7 ± 4.7
50.4 ± 3.2
CONT
Cnst.
35.7 ± 0.5
48.9 ± 0.6
24.3 ± 5.3
48.2 ± 2.5
CONT+CONST
Cnst.
33.7 ± 0.3
45 ± 0.4
24.3 ± 5.3
48.2 ± 2.5
TASEDROP
TASEDROP+
TASEIIRC
TASEIIRC+
Tavolo 6: Evaluation results of TASE on DROP and IIRC. For each dataset, we compare the model
trained on the original and augmented (marked with +) training data.
DEV
F1
82.2 ± 0.2
82.7 ± 0.9
CONTVAL
F1
58.1 ± 0.1
89.1 ± 0.4
CONTRAND
F1
54.5 ± 0.7
86.6 ± 0.6
CONT
F1
49.9 ± 0.4
81.9 ± 0.3
CONTVAL
Cnst.
39.6 ± 0.6
65.6 ± 0.4
CONT
Cnst.
43.1 ± 0.1
56.4 ± 0.4
CONT+CONST
Cnst.
43 ± 0.1
56.3 ± 0.4
READER
READER+
Tavolo 7: Results of READER on HOTPOTQA, when trained on the original and augmented (marked with +)
dati.
DEV
F1
28.2
CONTVAL CONTRAND
CONT
F1
34.9
CONTVAL
Cnst.
5.3
F1
35.1
F1
38.1
33.9 ± 0.9 28.4 ± 0.8 26.9 ± 0.5
UNIFIEDQA
8.1 ± 3.8 12.2 ± 1.6
UNIFIEDQADROP
UNIFIEDQADROP+ 32.9 ± 1.2 37.9 ± 1.4 35.9 ± 2.5 10.5 ± 4.4 16.9 ± 0.2
UNIFIEDQA
74.7 ± 0.2 60.3 ± 0.8 58.7 ± 0.9 61.9 ± 0.7 35.6 ± 1.1 40.2 ± 0.1
UNIFIEDQAHPQA
UNIFIEDQAHPQA+ 74.1 ± 0.2 60.3 ± 1.9 59.2 ± 1.5 62.3 ± 2.3 36.3 ± 0.7 41.6 ± 0.3
UNIFIEDQA
UNIFIEDQAIIRC
UNIFIEDQAIIRC+
50.2 ± 0.7 45.1 ± 2.1 42.5 ± 2.3 20.4 ± 2.9 24.9 ± 1.2 28.6 ± 0.8
51.7 ± 0.9 62.9 ± 2.9 54.5 ± 3.9 40.8 ± 5.4 30.2 ± 2.7 32.1 ± 1.9
65.2
29.8
44.5
28.1
57.2
36.5
68.2
68.7
61.1
52.9
21.6
CONT
Cnst.
4.4
5.1 ± 0.7
9.6 ± 0.2
38.4
CONT+CONST
Cnst.
2.2
4.4 ± 0.5
8 ± 0.5
37.6
39.9 ± 0.1
41.3 ± 0.4
28.1
28.5 ± 0.8
32.1 ± 1.9
Tavolo 8: Evaluation results of UNIFIEDQA on DROP, HOTPOTQA, and IIRC. We compare UNIFIEDQA
without fine-tuning, and after fine-tuning on the original training data and on the augmented training
dati (marked with +).
where C(X) is the generated contrast set for exam-
ple x (which includes x),5 and y(ˆx) is the model’s
prediction for example ˆx. Constraint satisfaction
is measured using a binary 0-1 score.
Because yes/no questions do not exist in DROP,
we do not evaluate TASEDROP on AppendBool
examples, which have yes/no answers, as we
cannot expect the model to answer those correctly.
5.3 Results
Results are presented separately for each model,
in Table 6, 7, E 8. Comparing performance on
the development sets (DEV F1) to the correspond-
ing contrast sets (CONT F1), we see a substantial
decrease in performance on the generated contrast
sets, across all datasets (per esempio., 83.5 → 54.8 for
TASEDROP, 82.2 → 49.9 for READER, E 50.2 →
5With a slight abuse of notation, we overload the definition
of C(X) from §2, such that members of C(X) include not just
the queston and context, but also an answer.
20.4 for UNIFIEDQAIIRC). Inoltre, model con-
sistency (CONT Cnst.) is considerably lower than
the development scores (DEV F1), Per esempio,
TASEIIRC obtains 69.9 F1 score but only 24.3 con-
sistency. This suggests that, overall, the models
do not generalize to pertrubations in the reasoning
path expressed in the original question.
Comparing the results on the contrast sets and
their validated subsets (CONT vs. CONTVAL), per-
formance on CONTVAL is better than on CONT
(per esempio., 58.1 versus 49.9 for READER). These gaps
are due to (UN) the distribution mismatch between
the two sets, E (B) bad example generation.
To isolate the effect of bad example generation,
we can compare CONTVAL to CONTRAND, Quale
have the same distribution over perturbations, Ma
CONTRAND is not validated by humans. We see that
the performance of CONTVAL is typically ≤10%
higher than CONTRAND (per esempio., 58.1 vs. 54.5 for
READER). Given that performance on the original
119
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
development set is dramatically higher, it seems
we can currently use automatically generated
contrast sets (without verification) to evaluate
robustness to reasoning perturbations.
Last, adding constraints to the generated con-
trast sets (CONT vs. CONT+CONST) often leads to
a decrease in model consistency, most notably on
DROP, where there are arithmetic constraints and
not only answer type constraints.
For
instance, consistency drops from 35.7
A 33.7 for TASE, and from 5.1 A 4.4 for
UNIFIEDQADROP. This shows that the generated
constraints expose additional flaws in current
models.
5.4 Data Augmentation
Results in §5.3 reveal clear performance gaps
in current QA models. A natural solution is to
augment the training data with examples from
the contrast set distribution, which can be done
effortlessly, since BPB is fully automatic.
We run BPB on the training sets of DROP,
HOTPOTQA, and IIRC. As BPB generates many
examples, it can shift the original training distri-
bution dramatically. Così, we limit the number
of examples generated by each perturbation by a
threshold τ . Specifically, for a training set S with
|S| = n examples, we augment S with τ ∗ n ran-
domly generated examples from each perturbation
(if fewer than τ ∗ n examples were generated we
add all of them). We experiment with three val-
ues τ ∈ {0.03, 0.05, 0.1}, and choose the trained
model with the best F1 on the contrast set.
Augmentation results are shown in Table 6–8.
Consistency (CONT and CONTVAL) improves dra-
matically, with only a small change in the model’s
DEV performance, across all models. We ob-
serve an increase in consistency of 13 points for
TASEDROP, 24 for TASEIIRC, 13 for READER, E
1-4 points for the UNIFIEDQA models. Interesse-
ingly, augmentation is less helpful for UNIFIEDQA
than for TASE and READER. We conjecture that
this is because UNIFIEDQA was trained on exam-
ples from multiple QA datasets and is thus less
affected by the augmented data.
Improvement on test examples sampled from
the augmented training distribution is expected.
To test whether augmented data improves robust-
ness on other distributions, we evaluate TASE+
and UNIFIEDQADROP+ on the DROP contrast set
manually collected by Gardner et al. (2020). Noi
find that training on the augmented training set
does not lead to a significant change on the man-
ually collected contrast set (F1of 60.4 → 61.1 for
TASE, E 30 → 29.6 for UNIFIEDQADROP). Questo
agrees with findings that data augmentation with
respect to a phenomenon may not improve gen-
eralization to other out-of-distribution examples
(Kaushik et al., 2021; Joshi and He, 2021).
6 Performance Analysis
Analysis Across Perturbations. We compare
model performance on the original (ORIG) E
generated examples (CONT and CONTVAL) across
perturbations (Figura 3, 4, 5). Starting from mod-
els with specialized architectures (TASE and
READER), except for ChangeLast (discussed
Dopo), models’ performance decreases on all
perturbations. Specifically, TASE (Figura 3, 5)
demonstrates brittleness to changes in comparison
questions (10-30 F1 decrease on ReplaceComp)
and arithmetic computations (∼30 F1 decrease
on ReplaceArith). The biggest decrease of
almost 50 points is on examples generated by
PruneStep from DROP (Figura 3), showing
that the model struggles to answer intermediate
reasoning steps.
READER (Figura 4) shows similar trends to
TASE, with a dramatic performance decrease
Di 80-90 points on yes/no questions created
by AppendBool and ReplaceBool. Inter-
estingly, READER obtains high performance on
PruneStep examples, as opposed to TASEDROP
(Figura 3), which has a similar span extraction
head that is required for these examples. Questo
is possibly due to the ‘‘train-easy’’ subset of
HOTPOTQA, which includes single-step selection
questions.
Moving to the general-purpose UNIFIEDQA
models, they perform on PruneStep at least
as well the original examples, showing their abil-
ity to answer simple selection questions. They also
demonstrate robustness on ReplaceBool. Yet,
they struggle on numeric comparison questions
or arithmetic calculations: ∼65 points decrease
on ChangeLast on DROP (Figura 3), 10-30
F1 decrease on ReplaceComp and Append-
Bool (Figura 3, 4, 5), and almost 0 F1 on
ReplaceArith (Figura 3).
Performance on CONT and CONTVAL. Re-
sults on CONTVAL are generally higher than CONT
due to the noise in example generation. Tuttavia,
120
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 3: Performance on DROP per perturbation: SU
the generated contrast set (CONT), on the examples
from which CONT was generated (ORIG), and on the
validated subset of CONT (CONTVAL).
Figura 4: Performance on HOTPOTQA per perturbation:
on the generated contrast set (CONT), on the examples
from which CONT was generated (ORIG), and the
validated subset of CONT (CONTVAL).
whenever results on ORIG are higher than CONT,
they are also higher than CONTVAL, showing that
the general trend can be inferred from CONT, due
to the large performance gap between ORIG and
CONT. An exception is ChangeLast in DROP
and HOTPOTQA, where performance on CONT is
lower than ORIG, but on CONTVAL is higher. Questo
is probably due to the noise in generation, es-
pecially for DROP, where example validity is at
55.1% (Vedi la tabella 4).
Evaluation on Answer Constraints Evaluat-
ing whether the model satisfies answer constraints
can help assess the model’s skills. A tal fine, we
121
Figura 5: Performance on IIRC per perturbation: on the
generated contrast set (CONT), on the examples from
which CONT was generated (ORIG), and the validated
subset of CONT (CONTVAL).
measure the fraction of answer constraints satis-
fied by the predictions of each model (we consider
only constraints with more than 50 examples).
Models typically predict the correct answer
type; TASEDROP and UNIFIEDQA predict a number
for ≥ 86% of the generated numeric questions,
and READER and TASEIIRC successfully predict a
yes/no answer in ≥ 92% of the cases. Tuttavia,
fine-tuning UNIFIEDQA on HOTPOTQA and IIRC
reduces constraint satisfaction (94.7 → 76.3 for
UNIFIEDQAHPQA, 65.4 → 38.9 for UNIFIEDQAIIRC),
possibly since yes/no questions constitute fewer
di 10% of the examples (Yang et al., 2018;
Ferguson et al., 2020). Inoltre, results on
DROP for the constraint ‘≥’ are considerably
lower than for ‘≤’ for UNIFIEDQA (83 → 67.4)
and UNIFIEDQADROP (81.8 → 65.9), indicating a
bias towards predicting small numbers.
7 Related Work
The evaluation crisis in NLU has led to wide inter-
est in challenge sets that evaluate the robustness
of models to input perturbations. Tuttavia, most
past approaches (Ribeiro et al., 2020; Gardner
et al., 2020; Khashabi et al., 2020UN; Kaushik et al.,
2020) involve a human-in-the-loop and are thus
costly.
Recentemente, more and more work has consid-
ered using meaning representations of language to
automatically generate evaluation sets. Past work
used an ERG grammar (Li et al., 2020) and AMR
(Rakshit and Flanigan, 2021) to generate rela-
tively shallow perturbations. In parallel to this
lavoro, Ross et al. (2021) used control codes over
SRL to generate more semantic perturbations to
declarative sentences. We generate perturbations
at the level of the underlying reasoning process,
in the context of QA. Last, Bitton et al. (2021)
used scene graphs to generate examples for vi-
sual QA. Tuttavia, they assumed the existence of
gold scene graph at the input. Overall, this body of
work represents an exciting new research program,
where structured representations are leveraged to
test and improve the blind spots of pre-trained
language models.
More broadly, interest in automatic creation of
evaluation sets that test out-of-distribution gener-
alization has skyrocketed, whether using heuristics
(Asai and Hajishirzi, 2020; Wu et al., 2021), dati
splits (Finegan-Dollak et al., 2018; Keysers et al.,
2020), adversarial methods (Alzantot et al., 2018),
or an aggregation of the above (Mille et al., 2021;
Goel et al., 2021).
Last, QDMR-to-question generation is broadly
related to work on text generation from struc-
tured data (Nan et al., 2021; Novikova et al.,
2017; Shu et al., 2021), and to passage-to-question
generation methods (Du et al., 2017; Wang et al.,
2020; Duan et al., 2017) Quello, in contrast to our
lavoro, focused on simple questions not requiring
reasoning.
Last, we showed that constraint sets are useful
for evaluation. Future work can use constraints as
a supervision signal, similar to Dua et al. (2021),
who leveraged dependencies between training
examples to enhance model performance.
Limitations BPB represents questions with
QDMR, which is geared towards representing
complex factoid questions that involve multiple
reasoning steps. Così, BPB cannot be used when
questions involve a single step, Per esempio, one
cannot use BPB to perturb ‘‘Where was Barack
Obama born?’’. Inherently, the effectiveness of
our pipeline approach depends on the performance
of its modules—the QDMR parser, the QG model,
and the single-hop RC model used for QDMR
evaluation. Tuttavia, our results suggest that cur-
rent models already yield high-quality examples,
and model performance is expected to improve
over time.
Ringraziamenti
We thank Yuxiang Wu, Itay Levy, and Inbar Oren
for the helpful feedback and suggestions. Questo
research was supported in part by The Yandex
Initiative for Machine Learning, and The Euro-
pean Research Council (ERC) under the European
Union Horizons 2020 research and innovation pro-
gramme (grant ERC DELPHI 802800). This work
was completed in partial fulfillment for the Ph.D.
degree of Mor Geva.
8 Discussion
Riferimenti
We propose the BPB framework for generating
high-quality reasoning-focused question perturba-
zioni, and demonstrate its utility for constructing
contrast sets and evaluating RC models.
While we focus on RC, our method for per-
turbing questions is independent of the context
modality. Così, porting our approach to other
modalities only requires a method for computing
the answer to perturbed questions. Inoltre, BPB
provides a general-purpose mechanism for ques-
tion generation, which can be used outside QA
anche.
We provide a library of perturbations that is
a function of the current abilities of RC models.
As future RC models, QDMR parsers, and QG
models improve, we can expand this library to
support additional semantic phenomena.
Moustafa Alzantot, Yash Sharma, Ahmed
Elgohary, Bo-Jhang Ho, Mani Srivastava, E
Kai-Wei Chang. 2018. Generating natural
language adversarial examples. In Procedi-
IL 2018 Conferenza sull'Empirico
ings di
Methods
in Natural Language Process-
ing, pages 2890–2896, Brussels, Belgium.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/D18-1316
Akari Asai and Hannaneh Hajishirzi. 2020.
Logic-guided data augmentation and regulari-
zation for consistent question answering. In
Proceedings of
the 58th Annual Meeting
of the Association for Computational Linguis-
tic, pages 5642–5650, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.499
122
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Akari Asai, Kazuma Hashimoto, Hannaneh
Hajishirzi, Riccardo Socher, and Caiming Xiong.
2020. Learning to retrieve reasoning paths
over wikipedia graph for question answer-
ing. In International Conference on Learning
Representations.
Yonatan Bitton, Gabriel
Stanovsky, Roy
Schwartz, and Michael Elhadad. 2021. Auto-
matic generation of contrast sets from scene
graphs: Probing the compositional consis-
tency of GQA. Negli Atti del 2021
Conference of
the North American Chap-
the Association for Computational
ter of
Linguistica: Tecnologie del linguaggio umano,
pages 94–105, Online. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.naacl-main.9
Cristoforo Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, E
Kristina Toutanova. 2019. BoolQ: Exploring
the surprising difficulty of natural yes/no ques-
zioni. Negli Atti del 2019 Conferenza
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Umano
Language Technologies, Volume 1 (Long and
Short Papers), pages 2924–2936.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In North American Associa-
tion for Computational Linguistics (NAACL),
pages 4171–4186, Minneapolis, Minnesota.
Xinya Du, Junru Shao, and Claire Cardie. 2017.
Learning to ask: Neural question generation
for reading comprehension. Negli Atti
of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Documenti lunghi), pages 1342–1352, Vancou-
ver, Canada. Associazione per il calcolo
Linguistica.
Dheeru Dua, Pradeep Dasigi, Sameer Singh,
and Matt Gardner. 2021. Learning with in-
stance bundles for
reading comprehension.
arXiv preprint arXiv:2104.08735.
Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Sameer Singh, and Matt
Gardner. 2019. DROP: A reading comprehen-
sion benchmark requiring discrete reasoning
over paragraphs. In North American Chapter of
the Association for Computational Linguistics
(NAACL).
Nan Duan, Duyu Tang, Peng Chen, and Ming
Zhou. 2017. Question generation for ques-
tion answering. Negli Atti del 2017
Conference on Empirical Methods in Nat-
elaborazione del linguaggio urale, pages 866–874,
Copenhagen, Denmark. Association for Com-
Linguistica putazionale. https://doi.org
/10.18653/v1/D17-1090
IIRC: A dataset of
James Ferguson, Matt Gardner, Hannaneh
Hajishirzi, Tushar Khot, and Pradeep Dasigi.
2020.
incomplete in-
formation reading comprehension questions.
IL 2020 Conference on
Negli Atti di
in Natural Language
Empirical Methods
in lavorazione (EMNLP), pages 1137–1147, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.86
Catherine Finegan-Dollak, Jonathan K. Kummerfeld,
Li Zhang, Karthik Ramanathan, Sesh Sadasivam,
Rui Zhang, and Dragomir Radev. 2018. Im-
proving text-to-SQL evaluation methodology.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tic (Volume 1: Documenti lunghi), pages 351–360,
Melbourne, Australia. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/P18-1033
Joseph L. Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psicologico
bulletin, 76(5):378. https://doi.org/10
.1037/h0031619
Matt Gardner, Yoav Artzi, Victoria Basmov,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hannaneh
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiangming Liu, Nelson F. Liu,
Phoebe Mulcaire, Qiang Ning, Sameer Singh,
Noah A. Smith, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, Ally Zhang, and Ben
Zhou. 2020. Evaluating models’ local deci-
sion boundaries via contrast sets. In Findings
Di
the Association for Computational Lin-
guistics: EMNLP 2020, pages 1307–1323,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.findings-emnlp.117
123
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Karan Goel, Nazneen Fatema Rajani, Jesse
Vig, Zachary Taschdjian, Mohit Bansal, E
Christopher R´e. 2021. Robustness gym: Uni-
fying the NLP evaluation landscape.
In
Proceedings of
IL 2021 Conference of
the Asso-
the North American Chapter of
ciation for Computational Linguistics: Eh-
uomo Tecnologie del linguaggio: Demonstrations,
pages 42–55, Online. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.naacl-demos.6
Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
systems. In Empirical Methods in Natural Lan-
guage Processing (EMNLP). https://doi
.org/10.18653/v1/D17-1215
Nitish Joshi and He He. 2021. An investigation of
IL (In) effectiveness of counterfactually aug-
mented data. arXiv preprint arXiv:2107.00753.
Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that
makes a difference with counterfactually-
augmented data. In International Conference
sulle rappresentazioni dell'apprendimento.
Divyansh Kaushik, Douwe Kiela, Zachary C.
Lipton, and Wen-tau Yih. 2021. On the
efficacy of adversarial data collection for ques-
tion answering: Results from a large-scale
randomized study. In Association for Com-
putational Linguistics and International Joint
Conference on Natural Language Process-
ing (ACL-IJCNLP). https://doi.org/10
.18653/v1/2021.acl-long.517
Daniel Keysers, Nathanael Sch¨arli, Nathan Scales,
Hylke Buisman, Daniel Furrer, Sergii Kashubin,
Nikola Momchev, Danila Sinopalnikov, Lukasz
Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao
Wang, Marc van Zee, and Olivier Bousquet.
2020. Measuring compositional generaliza-
zione: A comprehensive method on realistic
dati. In International Conference on Learning
Representations.
Daniel Khashabi, Tushar Khot, and Ashish
Sabharwal. 2020UN. More bang for your buck:
Natural perturbation for robust question an-
swering. Negli Atti del 2020 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione (EMNLP), pages 163–170, Online.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/2020
.emnlp-main.12
Daniel Khashabi, Sewon Min, Tushar Khot,
Ashish Sabharwal, Oyvind Tafjord, Peter
and Hannaneh Hajishirzi. 2020B.
Clark,
UNIFIEDQA: Crossing format boundaries
In Findings
con un
the Association for Computational Lin-
Di
guistics: EMNLP 2020, pages 1896–1907,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.findings-emnlp.171
single QA system.
J. Richard Landis and Gary G. Koch. 1977.
The measurement of observer agreement for
categorical data. Biometrics, 33(1):159–174.
https://doi.org/10.2307/2529310,
PubMed: 843571
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, e Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
translation, and comprehension. In Procedi-
ings di
IL
Associazione per la Linguistica Computazionale,
pages 7871–7880. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/2020.acl-main.703
the 58th Annual Meeting of
Chuanrong Li, Lin Shengshuo, Zeyu Liu, Xinyi
Wu, Xuhui Zhou, and Shane Steinert-Threlkeld.
2020. Linguistically-informed transformations
(LIT): A method for automatically gener-
IL
ating contrast sets.
Third BlackboxNLP Workshop on Analyz-
ing and Interpreting Neural Networks for
PNL, pages 126–135, Online. Associazione per
Linguistica computazionale.
Negli Atti di
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Right for the wrong reasons: Diagnosing syn-
tactic heuristics in natural language inference.
124
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 3428–3448, Florence, Italy. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/P19-1334
Simon Mille, Kaustubh Dhole, Saad Mahamood,
Laura Perez-Beltrachini, Varun Gangal, Mihir
Kale, Emiel van Miltenburg, and Sebastian
Gehrmann. 2021. Automatic construction of
evaluation suites for natural
language gen-
In Thirty-fifth Conference
eration datasets.
Information Processing Systems
on Neural
Datasets and Benchmarks Track (Round 1).
Aakanksha Naik, Abhilasha Ravichander, Norman
Sadeh, Carolyn Rose, and Graham Neubig.
2018. Stress test evaluation for natural lan-
guage inference. In Proceedings of the 27th
Conferenza internazionale sul calcolo
Linguistica, pages 2340–2353, Santa Fe, Nuovo
Mexico, USA. Associazione per il calcolo
Linguistica.
Linyong Nan, Dragomir Radev, Rui Zhang, Amrit
Rau, Abhinand Sivaprasad, Chiachun Hsieh,
Xiangru Tang, Aadit Vyas, Neha Verma,
Pranav Krishna, Yangxiaokang Liu, Nadia
Irwanto, Jessica Pan, Faiaz Rahman, Ahmad
Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit
Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin,
Caiming Xiong, Riccardo Socher, and Nazneen
Fatema Rajani. 2021. DART: Open-domain
structured data record to text generation. In
Atti del 2021 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 432–447, Online.
Associazione per la Linguistica Computazionale.
an effective
Nikita Nangia, Saku Sugawara, Harsh Trivedi,
Alex Warstadt, Clara Vania, and Samuel
R. Bowman. 2021. Che cosa
ingredients make
crowdsourcing protocol
for
for difficult NLU data collection tasks? In
Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics
and the 11th International Joint Conference
on Natural Language Processing (Volume 1:
Documenti lunghi), pages 1221–1235, Online.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/2021
.acl-long.98
Jekaterina Novikova, Ondˇrej Duˇsek, and Verena
Rieser. 2017. The E2E dataset: New chal-
lenges for end-to-end generation. In Procedi-
ings di
the 18th Annual SIGdial Meeting
on Discourse and Dialogue, pages 201–206,
Saarbr¨ucken, Germany. Association for Com-
Linguistica putazionale. https://doi.org
/10.18653/v1/W17-5525
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
sion of text. In Empirical Methods in Natural
Language Processing (EMNLP).
Geetanjali Rakshit and Jeffrey Flanigan. 2021.
ASQ: Automatically generating question-
answer pairs using AMRS. arXiv preprint
arXiv:2105.10023.
Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Beyond
testing of NLP mod-
accuracy: Behavioral
els with CheckList. Negli Atti di
IL
58esima Assemblea Annuale dell'Associazione per
Linguistica computazionale, pages 4902–4912,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.acl-main.442
Alexis Ross, Tongshuang Wu, Hao Peng, Matthew
E. Peters, and Matt Gardner. 2021. Tailor:
Generating and perturbing text with semantic
controls. arXiv preprint arXiv:2107.07150.
Elad Segal, Avia Efrat, Mor Shoham, Amir
Globerson, and Jonathan Berant. 2020. UN
for answer-
simple and effective model
ing multi-span questions. Negli Atti di
IL 2020 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale (EMNLP),
pages 3074–3080, Online. Associazione per
Linguistica computazionale. https://doi
.org/10.18653/v1/2020.emnlp-main
.248
Chang Shu, Yusen Zhang, Xiangyu Dong,
Peng Shi, Tao Yu, and Rui Zhang. 2021.
125
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
In Findings of
Logic-consistency text generation from se-
the As-
mantic parses.
sociation
for Computational Linguistics:
ACL-IJCNLP 2021, pages 4414–4426, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2021.findings-acl.388
Siyuan Wang, Zhongyu Wei, Zhihao Fan,
Zengfeng Huang, Weijian Sun, Qi Zhang,
and Xuanjing Huang. 2020. PathQG: Neural
question generation from facts. In Procedi-
ings di
IL 2020 Conferenza sull'Empirico
in Natural Language Process-
Methods
ing (EMNLP), pages 9066–9075, Online.
Association
for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.729
Tomer Wolfson, Mor Geva, Ankit Gupta, Matt
Gardner, Yoav Goldberg, Daniel Deutch,
and Jonathan Berant. 2020. BREAK it
down: A question understanding benchmark.
the Association for Com-
Transactions of
(TACL), 8:183–198.
Linguistica putazionale
https://doi.org/10.1162/tacl a 00309
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey
Heer, and Daniel Weld. 2021. Polyjuice: Gener-
ating counterfactuals for explaining, evaluating,
Negli Atti di
and improving models.
the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural
Language Processing (Volume 1: Long Pa-
pers), pages 6707–6723, Online. Association
for Computational Linguistics.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
Bengio, William Cohen, Ruslan Salakhutdinov,
e Christopher D. Equipaggio. 2018. Hot-
potQA: A dataset
for diverse, explainable
multi-hop question answering. In Empirical
Metodi nell'elaborazione del linguaggio naturale
(EMNLP). https://doi.org/10.18653
/v1/D18-1259
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2
/
/
T
l
UN
C
_
UN
_
0
0
4
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
126