Break, Perturb, Build: Automatic Perturbation of Reasoning Paths

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths
Through Question Decomposition

Mor Geva, Tomer Wolfson, Jonathan Berant
School of Computer Science, Tel Aviv University, Israel
Allen Institute for Artificial Intelligence

{morgeva@mail,tomerwol@mail,joberant@cs}.tau.ac.il

Astratto

Recent efforts to create challenge benchmarks
that test the abilities of natural language un-
derstanding models have largely depended
on human annotations. In this work, we in-
troduce the ‘‘Break, Perturb, Build’’ (BPB)
framework for automatic reasoning-oriented
perturbation of question-answer pairs. BPB
represents a question by decomposing it into
the reasoning steps that are required to answer
Esso, symbolically perturbs the decomposition,
and then generates new question-answer pairs.
We demonstrate the effectiveness of BPB by
creating evaluation sets for
three reading
comprehension (RC) benchmarks, generating
thousands of high-quality examples without
human intervention. We evaluate a range of RC
models on our evaluation sets, which reveals
large performance gaps on generated exam-
ples compared to the original data. Inoltre,
symbolic perturbations enable fine-grained
analysis of the strengths and limitations of
models. Last, augmenting the training data
with examples generated by BPB helps close
the performance gaps, without any drop on the
original data distribution.

1

introduzione

Evaluating natural language understanding (NLU)
systems has become a fickle enterprise. While
models outperform humans on standard bench-
marks, they perform poorly on a multitude of
distribution shifts (Jia and Liang, 2017; Naik et al.,
2018; McCoy et al., 2019, inter alia). To expose
such gaps, recent work has proposed to evaluate
models on contrast sets (Gardner et al., 2020), O
counterfactually-augmented data (Kaushik et al.,
2020), where minimal but meaningful pertur-
bations are applied to test examples. Tuttavia,
since such examples are manually written, col-
lecting them is expensive, and procuring diverse
perturbations is challenging (Joshi and He, 2021).

111

Recentemente, methods for automatic generation of
contrast sets were proposed. Tuttavia, current
methods are restricted to shallow surface pertur-
bations (Mille et al., 2021; Li et al., 2020), specific
reasoning skills (Asai and Hajishirzi, 2020), O
rely on expensive annotations (Bitton et al., 2021).
Così, automatic generation of examples that test
high-level reasoning abilities of models and their
robustness to fine semantic distinctions remains
an open challenge.

In this work, we propose the ‘‘Break, Perturb,
Build’’ (BPB) framework for automatic genera-
tion of reasoning-focused contrast sets for read-
ing comprehension (RC). Changing the high-level
semantics of questions and generating question-
answer pairs automatically is challenging. Primo, Esso
requires extracting the reasoning path expressed
in a question, in order to manipulate it. Secondo,
it requires the ability to generate grammatical and
coherent questions. In Figure 1, Per esempio, trans-
forming Q, which involves number comparison,
into Q1, which requires subtraction, leads to dra-
matic changes in surface form. Third, it requires
an automatic method for computing the answer to
the perturbed question.

Our insight is that perturbing question semantics
is possible when modifications are applied to a
structured meaning representation, piuttosto che
to the question itself. Specifically, we represent
questions with QDMR (Wolfson et al., 2020), UN
representation that decomposes a question into a
sequence of reasoning steps, which are written
in natural language and are easy to manipulate.
Relying on a structured representation lets us
develop a pipeline for perturbing the reasoning
path expressed in RC examples.

Our method (Guarda la figura 1) has four steps. Noi
(1) parse the question into its QDMR decompo-
sition, (2) apply rule-based perturbations to the
decomposition, (3) generate new questions from

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 111–126, 2022. https://doi.org/10.1162/tacl a 00450
Redattore di azioni: Preslav Nakov. Lotto di invio: 8/2021; Lotto di revisione: 9/2021; Pubblicato 2/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 1: An overview of BPB. Given a context (C), question (Q), and the answer (UN) to the question, we
generate new examples by (1) parsing the question into its QDMR decomposition, (2) applying semantic
perturbations to the decomposition, (3) generating a question for each transformed decomposition, E (4)
computing answers/constraints to the new questions.

the perturbed decompositions, E (4) compute
their answers. In cases where computing the an-
swer is impossible, we compute constraints on
the answer, which are also useful for evaluation.
Per esempio, for Q4 in Figure 1, even if we can-
not extract the years of the described events, we
know the answer type of the question (Boolean).
Notably, aside from answer generation, all steps
depend on the question only, and can be applied to
other modalities, such as visual or table question
answering (QA).

Running BPB on the three RC datasets, DROP
(Dua et al., 2019), HOTPOTQA (Yang et al., 2018),
and IIRC (Ferguson et al., 2020), yields thousands
of semantically rich examples, covering a major-
ity of the original examples (63.5%, 70.2%, E
45.1%, rispettivamente). Inoltre, we validate ex-
amples using crowdworkers and find that ≥85%
of generated examples are correct.

We demonstrate the utility of BPB for compre-
hensive and fine-grained evaluation of multiple
RC models. Primo, we show that leading models,
such as UNIFIEDQA (Khashabi et al., 2020B) E
TASE (Segal et al., 2020), struggle on the gen-
erated contrast sets with a decrease of 13-36 F1
points and low consistency (<40). Moreover, an- alyzing model performance per perturbation type and constraints, reveals the strengths and weak- nesses of models on various reasoning types. For instance, (a) models with specialized architectures are more brittle compared to general-purpose mod- els trained on multiple datasets, (b) TASE fails to answer intermediate reasoning steps on DROP, (c) UNIFIEDQA fails completely on questions requir- ing numerical computations, and (d) models tend to do better when the numerical value of an answer is small. Last, data augmentation with examples generated by BPB closes part of the performance gap, without any decrease on the original datasets. In summary, we introduce a novel frame- work for automatic perturbation of complex reasoning questions, and demonstrate its effi- cacy for generating contrast sets and evaluating improve- that models. We expect ments in question generation, RC, and QDMR models will further widen the accuracy and applicability of our approach. The generated eval- uation sets and codebase are publicly available at https://github.com/mega002/qdmr -based-question-generation. imminent 2 Background Our goal, given a natural language question q, is to automatically alter its semantics, generating perturbed questions ˆq for evaluating RC models. This section provides background on the QDMR representation and the notion of contrast sets. Question Decomposition Meaning Representa- tion (QDMR). To manipulate question seman- tics, we rely on QDMR (Wolfson et al., 2020), a structured meaning representation for questions. The QDMR decomposition d = QDMR(q) is a sequence of reasoning steps s1, . . . , s|d| required to answer q. Each step si in d is an intermediate question that is phrased in natural language and annotated with a logical operation oi, such as se- lection (e.g., ‘‘When was the Madison Woolen Mill built?’’) or comparison (e.g., ‘‘Which is highest of #1, #2?’’). Example QDMRs are shown 112 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 5 0 1 9 8 7 0 2 2 / / t l a c _ a _ 0 0 4 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 in Figure 1 (step 2). QDMR paves a path to- wards controlling the reasoning path expressed in a question by changing, removing, or adding steps (§3.2). Contrast Sets. Gardner et al. (2020) defined the contrast set C(x) of an example x with a label y as a set of examples with minimal perturbations to x that typically affect y. Contrast sets evaluate whether a local decision boundary around an ex- ample is captured by a model. In this work, given a question-context pair x = (cid:4)q, c(cid:5), we semanti- cally perturb the question and generate examples ˆx = (cid:4)ˆq, c(cid:5) ∈ C((cid:4)q, c(cid:5)) that modify the original answer a to ˆa. 3 BPB: Automatically Generating Semantic Question Perturbations We now describe the BPB framework. Given an input x = (cid:4)q, c(cid:5) of question and context, and the answer a to q given c, we automatically map it to a set of new examples C(x) (Figure 1). Our approach uses models for question decomposition, question generation (QG), and RC. 3.1 Question Decomposition The first step (Figure 1, step 1) is to represent q us- ing a structured decomposition, d = QDMR(q). To this end, we train a text-to-text model that generates d conditioned on q. Specifically, we fine-tune BART (Lewis et al., 2020) on the high- level subset of the BREAK dataset (Wolfson et al., 2020), which consists of 23.8K (cid:4)q, d(cid:5) pairs from three RC datasets, including DROP and HOT- POTQA.1 Our QDMR parser obtains a 77.3 SARI score on the development set, which is near state-of-the-art on the leaderboard.2 3.2 Decomposition Perturbation A decomposition d describes the reasoning steps necessary for answering q. By modifying d’s steps, we can control the semantics of the question. We define a ‘‘library’’ of rules for transforming d → ˆd, and use it to generate questions ˆd → ˆq. BPB provides a general method for creating a wide range of perturbations. In practice, though, deciding which rules to include is coupled with the reasoning abilities expected from our models. For example, there is little point in testing a model on arithmetic operations if it had never seen such examples. Thus, we implement rules based on the reasoning skills required in current RC datasets (Yang et al., 2018; Dua et al., 2019). As future benchmarks and models tackle a wider range of reasoning phenomena, one can expand the rule library. Table 1 provides examples for all QDMR perturbations, which we describe next: • AppendBool: When the question q re- turns a numeric value, we transform its QDMR by appending a ‘‘yes/no’’ com- parison step. The comparison is against the answer a of question q. As shown in Table 1, the appended step compares the previous step result (‘‘#3’’) to a constant (‘‘is higher than 2’’). AppendBool per- turbations are generated for 5 comparison operators (>, <, ≤, ≥, (cid:9)=). For the compared values, we sample from a set, based on the answer a: {a + k, a − k, a k , a × k} for k ∈ {1, 2, 3}. • ChangeLast: Changes the type of the last QDMR step. This perturbation is applied to steps involving operations over two refer- enced steps. Steps with type {arithmetic, comparison} have their type changed to either {arithmetic, Boolean}. Table 1 shows a comparison step changed to an arithmetic step, involving subtraction. Below it, an arithmetic step is changed to a yes/no question (Boolean). • ReplaceArith: Given an arithmetic step, involving either subtraction or addition, we transform it by flipping its arithmetic operation. • ReplaceBool: Given a Boolean step, verifying whether two statements are correct, we transform it to verify if neither are correct. • ReplaceComp: A comparison step compares two values and returns the high- est or lowest. Given a comparison step, we flip its expression from ‘‘highest’’ to ‘‘lowest’’ and vice versa. 1We fine-tune BART-large for 10 epochs, using a learning rate of 3e−5 with polynomial decay and a batch size of 32. 2https://leaderboard.allenai.org/breakhighlevel/. • PruneStep: We remove one of the QDMR steps. Following step pruning, we prune all 113 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 5 0 1 9 8 7 0 2 2 / / t l a c _ a _ 0 0 4 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Perturbation Question QDMR Perturbed QDMR Perturbed Question Append Boolean step Kadeem Jack is a player in a league that started with how many teams? (1) league that Kadeem Jack is a player in; (2) teams that #1 started with; (3) number of #2 Change last step (to arith- metic) Change last step (to Boolean) Replace arith- metic op. Replace Boolean op. Which gallery was foundedfirst,Hughes- Donahue Gallery or Art Euphoric? How many years after Madrugada’s final concert did Sunday Driver be- come popular? How many more na- tive Hindi speakers are there compared to native Kannada speakers? Stenocereus Can and Pachypodium both include tree like plants? Replace com- parison op. for Which group is smaller the county according to the census: people or households? Prune step How many people comprised the total adult population of Cunter, excluding seniors? (1) when was Hughes- Donahue Gallery founded; (2) when was Art Eupho- ric founded; (3) which was first of #1, #2 (1) year of Madrugada’s final concert; (2) year when Sunday Driver be- come popular; (3) the difference of #2 and #1 (1) native Hindi speak- ers; (2) native Kannada speakers; (3) number of #1; (4) number of #2; (5) difference of #3 and #4 if Stenocereus in- (1) clude tree like plants; (2) if Pachypodium in- clude treelike plants; (3) if both #1 and #2 are true (1) size of the people group in the county ac- cording to the census; (2) size of households group in the county according to the census; (3) which is smaller of #1, #2 (1) adult population of Cunter; (2) #1 excluding seniors; (3) number of #2 (1) league that Kadeem Jack is a player in; (2) teams that #1 started with; (3) number of #2; (4) if #3 is higher than 2 (1) when was Hughes- Donahue Gallery founded; (2) when was Art Eu- phoric founded; (3) the difference of #1 and #2 (1) year of Madrugada’s final concert; (2) year when Sunday Driver be- come popular; (3) if #1 is the same as #2 (1) native Hindi speak- ers; (2) native Kannada speakers; (3) number of #1; (4) number of #2; (5) sum of #3 and #4 if Stenocereus in- (1) clude tree like plants; (2) if Pachypodium in- clude treelike plants; (3) if both #1 and #2 are false (1) size of the people group in the county ac- cording to the census; (2) size of households group in the county according to the census; (3) which is highest of #1, #2 (1) adult population of Cunter; (2) number of #2 If Kadeem Jack is a player in a league thatstartedwithmore than two teams? How many years af- ter Hughes-Donahue Gallery was founded was Art Euphoric founded? Did Sunday Driver become popular in the same year as Madrugada’s final concert? Of the native Hindi speakers and native Kannada speakers, how many are there in total? Do neither Steno- cereus nor Pachy- podium include tree like plants? According to the census, which group in the county from the county is larger: people house- or holds? How many adult po- pulation does Cunter have? Table 1: The full list of semantic perturbations in BPB. For each perturbation, we provide an example question and its decomposition. We highlight the altered decomposition steps, along with the generated question. other steps that are no longer referenced. We apply only a single PruneStep per d. Table 1 displays ˆd after its second step has been pruned. 3.3 Question Generation At this point (Figure 1, step 3), we parsed q to its decomposition d and altered its steps to produce the perturbed decomposition ˆd. The new ˆd expresses a different reasoning process compared to the original q. Next, we generate the perturbed question ˆq corresponding to ˆd. To this end, we train a QG model, generating questions conditioned on the input QDMR. Using the same (cid:4)q, d(cid:5) pairs used to train the QDMR parser (§3.1), we train a separate BART model for mapping d → q.3 An issue with our QG model is that the per- turbed ˆd may be outside the distribution the QG 3We use the same hyperparameters as detailed in §3.1, except the number of epochs, which was set to 15. 114 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 5 0 1 9 8 7 0 2 2 / / t l a c _ a _ 0 0 4 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Original question Augmented question How many in- did terceptions Matt Hasselbeck throw? How many touch- downs were there in the first quarter? Are Giuseppe Verdi and Ambroise both Thomas Opera composers? Which singer is younger, Shirley Manson or Jim Kerr? If Matt Hasselbeck throw less than 23 intercep- tions? (AppendBool) If there were two touch- downs in the first quar- ter? (AppendBool) Are neither Giuseppe nor Ambroise Verdi Thomas Opera composers? (ReplaceBool) Which singer is older, Shirley Manson or Jim Kerr? (ReplaceComp) Table 2: Example application of all textual pat- terns used to generate questions qaug (perturbation type highlighted). Boldface indicates the pattern matched in q and the modified part in qaug. Decompositions d and daug omitted for brevity. model was trained on, e.g., applying Append- Bool on questions from DROP results in yes/no questions that do not occur in the original dataset. This can lead to low-quality questions ˆq. To im- prove our QG model, we use simple heuristics to take (cid:4)q, d(cid:5) pairs from BREAK and generate addi- tional pairs (cid:4)qaug, daug(cid:5). Specifically, we define 4 textual patterns, associated with the perturbations, AppendBool, ReplaceBool or Replace- Comp. We automatically generate examples (cid:4)qaug, daug(cid:5) from (cid:4)q, d(cid:5) pairs that match a pat- tern. An example application of all patterns is in Table 2. For example, in AppendBool, the question qaug is inferred with the pattern ‘‘how many . . . did’’. In ReplaceComp, generating qaug is done by identifying the superlative in q and fetching its antonym. Overall, we generate 4,315 examples and train our QG model on the union of BREAK and the augmented data. As QG models have been rapidly improving, we expect future QG models will be able to generate high-quality questions for any decomposition without data augmentation. 3.4 Answer Generation context. Therefore, this part of BPB can be applied to any question, regardless of the context modality. We now describe a RC-specific component for answer generation that uses the textual context. To get complete RC examples, we must com- pute answers to the generated questions (Figure 1, step 4). We take a two-step approach: For some questions, we can compute the answer automati- cally based on the type of applied perturbation. If this fails, we compute the answer by answering each step in the perturbed QDMR ˆd. Answer Generation Methods. Let (cid:4)q, c, a(cid:5) be the original RC example and denote by ˆq the generated question. We use the following per-perturbation rules to generate the new answer ˆa: • AppendBool: The transformed ˆq compares whether the answer a and a numeric value v satisfy a comparison condition. As the values of a and v are given (§3.2), we can com- pute whether the answer is ‘‘yes’’ or ‘‘no’’ directly. • ReplaceArith: This perturbation con- verts an answer that is the sum (difference) of numbers to an answer that is the difference (sum). We can often identify the numbers by looking for numbers x, y in the context c such that a = x ± y and flipping the operation: ˆa = |x ∓ y|. To avoid noise, we discard ex- amples for which there is more than one pair of numbers that result in a, and cases where a < 10, as the computation may involve explicit counting rather than an arithmetic computation. • ReplaceBool: This perturbation turns a verification of whether two statements x, y are true, to a verification of whether neither x nor y are true. Therefore, if a is ‘‘yes’’ (i.e., both x, y are true), ˆa must be ‘‘no’’. • ReplaceComp: This perturbation takes a comparison question q that contains two can- didate answers x, y, of which x is the answer a. We parse q with spaCy4 and identify the two answer candidates x, y, and return the one that is not a. We converted the input question into a set of perturbed questions without using the answer or 4https://spacy.io/. 115 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 5 0 1 9 8 7 0 2 2 / / t l a c _ a _ 0 0 4 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 development set size # of unique generated perturbations # of generated examples # of covered develop- ment examples % of covered develop- ment examples Avg. contrast set size Avg. # of perturbations per example % of answers generated by the QDMR evaluator # of annotated contrast examples % of valid annotated examples DROP 9,536 65,675 61,231 6,053 63.5 11.1 1.2 HPQA 7,405 10,541 8,488 5,199 IIRC 1,301 3,119 2,450 587 70.2 45.1 2.6 1 5.2 1 5.8 61.8 22.5 1,235 1,325 559 85 89 90.3 Table 3: Generation and annotation statistics for the DROP, HOTPOTQA, and IIRC datasets. When changing the last QDMR step to an arith- metic or Boolean operation (Table 1, rows 2-3), the new answer should be Numeric or Bool- ean, respectively. An example for a Boolean constraint is given in Q4 in Figure 1. When re- placing an arithmetic operation (Table 1, row 4), if an answer that is the sum (difference) of two non-negative numbers is changed to the difference (sum) of these numbers, the new answer must not be greater (smaller) than the original answer. For example, the answer to the question perturbed by ReplaceArith in Table 1 (row 4) should satisfy the ≥ constraint. 4 Generated Evaluation Sets We run BPB on the RC datasets DROP (Dua et al., 2019), HOTPOTQA (Yang et al., 2018), and IIRC (Ferguson et al., 2020). Questions from the train- ing sets of DROP and HOTPOTQA are included in BREAK, and were used to train the decomposition and QG models. Results on IIRC show BPB’s generalization to datasets for which we did not observe (cid:4)q, d(cid:5) pairs. Statistics on the generated contrast and constraint sets are in Table 3, 4, and 5. Contrast Sets. Table 3 shows that BPB suc- cessfully generates thousands of perturbations for each dataset. For the vast majority of perturba- tions, answer generation successfully produced a result—for 61K out of 65K in DROP, 8.5K out of 10.5K in HOTPOTQA, and 2.5K out of 3K in IIRC. Overall, 61K/8.5K examples were created Figure 2: Example execution of the QDMR evaluator. QDMR Evaluator. When our heuristics do not apply (e.g., arithmetic computations over more than two numbers, PruneStep, and Change- Last), we use a RC model and the QDMR structure to directly evaluate each step of ˆd and compute ˆa. Recall each QDMR step si is annotated with a logical operation oi (§2). To evaluate ˆd, we go over it step-by-step, and for each step either apply the RC model for op- erations that require querying the context (e.g., selection), or directly compute the output for numerical/set-based operations (e.g., compar- ison). The answer computed for each step is then used for replacing placeholders in subsequent steps. An example is provided in Figure 2. We discard the generated example when the RC model predicted an answer that does not match the expected argument type in a follow- ing step for which the answer is an argument (e.g., when a non-numerical span predicted by the RC model is used as an argument for an arithmetic operation), and when the generated answer has more than 8 words. Also, we discard operations that often produce noisy answers based on manual analysis (e.g., project with a non- numeric answer). For our QDMR evaluator, we fine-tune a ROBERTA-large model with a standard span- extraction output head on SQUAD (Rajpurkar et al., 2016) and BOOLQ (Clark et al., 2019). BOOLQ is included to support yes/no answers. 3.5 Answer Constraint Generation For some perturbations, even if we fail to generate an answer, it is still possible to derive constraints on the answer. Such constraints are valuable, as they indicate cases of model failure. Therefore, in addition to ˆa, we generate four types of answer constraints: Numeric, Boolean, ≥, ≤. 116 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 5 0 1 9 8 7 0 2 2 / / t l a c _ a _ 0 0 4 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 AppendBool ChangeLast contrast annotated % valid contrast annotated % valid contrast ReplaceArith annotated % valid contrast ReplaceBool annotated % valid contrast ReplaceComp annotated PruneStep % valid contrast annotated % valid 390 191 97.2 85 69 55.1 DROP HPQA 2,754 56,205 200 254 98 408 200 84.5 – – – 127 127 97.6 362 200 88.5 3,777 399 85.8 79.6 – – – 1,126 245 90.2 3,425 476 82.4 IIRC 1,884 198 98 43 43 76.7 1 1 0 1 1 100 14 14 71.4 507 302 88.4 Table 4: Per-perturbation statistics for generation and annotation of our datasets. Validation results are in bold for perturbations with at least 40 examples. from the development sets of DROP/ HOTPOTQA, respectively, covering 63.5%/70.2% of the devel- opment set. For the held-out dataset IIRC, not used to train the QDMR parser and QG model, BPB created a contrast set of 2.5K examples, which covers almost half of the development set. Table 4 shows the number of generated ex- amples per perturbation. The distribution over perturbations is skewed, with some perturbations (AppendBool) 100x more frequent than others (ReplaceArith). This is because the original distribution over operations is not uniform and each perturbation operates on different decom- positions (e.g., AppendBool can be applied to any question with a numeric answer, while Re- placeComp operates on questions comparing two objects). Constraint Sets. Table 5 shows the number of generated answer constraints for each dataset. The constraint set for DROP is the largest, consist- ing of 3.3K constraints, 8.9% of which covering DROP examples for which we could not generate a contrast set. This is due to the examples with arithmetic operations, for which it is easier to gen- erate constraints. The constraint sets of HOTPOTQA and IIRC contain yes/no questions, for which we use the Boolean constraint. Estimating Example Quality To analyze the quality of generated examples, we sampled 117 # of constraints % of constraints that cover examples without a contrast set % of covered develop- ment examples Numeric Boolean ≥ ≤ DROP 3,323 8.9 HPQA 550 26 IIRC 56 21.4 22.5 2,398 – 825 100 7.4 – 549 – 1 4 – 52 1 3 Table 5: Generation of constraints statistics for the DROP, HOTPOTQA, and IIRC datasets. 200-500 examples from each perturbation and dataset (unless fewer than 200 examples were generated) and let crowdworkers validate their correctness. We qualify 5 workers, and estab- lish a feedback protocol where we review work and send feedback after every annotation batch (Nangia et al., 2021). Each generated example was validated by three workers, and is consid- ered valid if approved by the majority. Overall, we observe a Fleiss Kappa (Fleiss, 1971) of 0.71, indicating substantial annotator agreement (Landis and Koch, 1977). Results are in Table 3 and 4. The vast majority of generated examples (≥85%) were marked as valid, showing that BPB produces high-quality examples. Moreover (Table 4), we see variance across perturbations, where some perturbations reach >95% valid examples (AppendBool, Re-
placeBool), mentre altri (ChangeLast) Avere
lower validity. Così, overall quality can be
controlled by choosing specific perturbations.

Manual validation of generated contrast sets is
cheaper than authoring contrast sets from scratch:
The median validation time per example is 31
seconds, roughly an order of magnitude faster
than reported in Gardner et al. (2020). Così, Quando
a very clean evaluation set is needed, BPB can
dramatically reduce the cost of manual annotation.

Error Analysis of the QDMR Parser To study
the impact of errors by the QDMR parser on the
quality of generated examples, we (the authors)
took the examples annotated by crowdworkers,
and analyzed the generated QDMRs for 60 ex-
amples per perturbation from each dataset: 30
that were marked as valid by crowdworkers, E
30 that were marked as invalid. Specifically, for
each example, we checked whether the generated

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

QDMR faithfully expresses the reasoning path re-
quired to answer the question, and compared the
quality of QDMRs of valid and invalid examples.
For the examples that were marked as valid, we
observed that the accuracy of QDMR structures
is high: 89.5%, 92.7%, E 91.1% for DROP,
HOTPOTQA, and IIRC, rispettivamente. This implies
Quello, overall, our QDMR parser generated faithful
and accurate representations for the input ques-
zioni. Inoltre, for examples marked as invalid,
the QDMR parser accuracy was lower but still
relatively high, con 82.0%, 82.9%, E 75.5%
valid QDMRs for DROP, HOTPOTQA, and IIRC,
rispettivamente. This suggests that the impact of er-
rors made by the QDMR parser on generated
examples is moderate.

5 Experimental Setting

while separate UNIFIEDQA models are fine-tuned
on each of the three datasets. Inoltre, we
evaluate UNIFIEDQA without fine-tuning, to ana-
lyze its generalization to unseen QA distributions.
We denote by UNIFIEDQA the model without
fine-tuning, and by UNIFIEDQAX the UNIFIEDQA
model fine-tuned on dataset X.

We consider a ‘‘pure’’ RC setting, where only
the context necessary for answering is given as
input. For HOTPOTQA, we feed the model with the
two gold paragraphs (without distractors), and for
IIRC we concatenate the input paragraph with the
gold evidence pieces from other paragraphs.

Overall, we study 6 model-dataset combina-
zioni, con 2 models per dataset. For each model,
we perform a hyperparameter search and train 3-4
instances with different random seeds, using the
best configuration on the development set.

We use the generated contrast and constraints sets
to evaluate the performance of strong RC models.

5.2 Evaluation

5.1 Modelli

To evaluate our approach, we examine a suite of
models that perform well on current RC bench-
marks, and that are diverse it
terms of their
architecture and the reasoning skills they address:

• TASE (Segal et al., 2020): A ROBERTA
modello (Liu et al., 2019) con 4 specialized
output heads for (UN) tag-based multi-span ex-
traction, (B) single-span extraction, (C) signed
number combinations, E (D) counting (un-
til 9). TASE obtains near state-of-the-art
performance when fine-tuned on DROP.

• UNIFIEDQA (Khashabi et al., 2020B): UN
text-to-text T5 model (Raffel et al., 2020)
that was fine-tuned on multiple QA datasets
with different answer formats (yes/no, span,
eccetera.). UNIFIEDQA has demonstrated high
performance on a wide range of QA
benchmarks.

• READER (Asai et al., 2020): A BERT-based
modello (Devlin et al., 2019) for RC with
two output heads for answer classification
to yes/no/span/no-answer, and span
extraction.

We evaluate each model in multiple settings: (UN)
the original development set; (B) the generated
contrast set, denoted by CONT; (C) the subset
of CONT marked as valid by crowdworkers, Di-
noted by CONTVAL. Notably, CONT and CONTVAL
have a different distribution over perturbations.
To account for this discrepancy, we also eval-
uate models on a sample from CONT, denoted
by CONTRAND, where sampling is according to
the perturbation distribution in CONTVAL. Last, A
assess the utility of constraint sets, we enrich the
contrast set of each example with its corresponding
constraints, denoted by CONT+CONST.

Performance is measured using the standard
F1 metric. Inoltre, we measure consistency
(Gardner et al., 2020), questo è, the fraction of
examples such that the model predicted the correct
answer to the original example as well as to all
examples generated for this example. A prediction
is considered correct if the F1 score, with respect
to the gold answer, is ≥ 0.8. Formalmente, for a set of
evaluation examples S = {(cid:4)qi, ci, ai(cid:5)}|S|
i=1:

consistency(S) =

1
|S|

(cid:2)

x∈S

G(C(X))

We fine-tune two TASE models, one on DROP
and another on IIRC, which also requires numeri-
cal reasoning. READER is fine-tuned on HOTPOTQA,

(cid:3)

G(X ) =

1,
0,

if ∀(cid:4)ˆx, ˆa(cid:5) ∈ X : F1(sì(ˆx), ˆa) 0.8
otherwise

118

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

DEV
F1
83.5 ± 0.1
83.7 ± 1.1
69.9 ± 0.5
68.8 ± 1.3

CONTVAL
F1
65.9 ± 1
75.2 ± 0.5
45 ± 5
81.1 ± 4.6

CONTRAND
F1
57.3 ± 0.6
68 ± 1
41.2 ± 3.8
78.2 ± 4.9

CONT
F1
54.8 ± 0.4
66.5 ± 0.5
33.7 ± 2.2
72.4 ± 5.7

CONTVAL
Cnst.
55.7 ± 1.1
66.3 ± 0.4
23.7 ± 4.7
50.4 ± 3.2

CONT
Cnst.
35.7 ± 0.5
48.9 ± 0.6
24.3 ± 5.3
48.2 ± 2.5

CONT+CONST
Cnst.
33.7 ± 0.3
45 ± 0.4
24.3 ± 5.3
48.2 ± 2.5

TASEDROP
TASEDROP+
TASEIIRC
TASEIIRC+

Tavolo 6: Evaluation results of TASE on DROP and IIRC. For each dataset, we compare the model
trained on the original and augmented (marked with +) training data.

DEV
F1
82.2 ± 0.2
82.7 ± 0.9

CONTVAL
F1
58.1 ± 0.1
89.1 ± 0.4

CONTRAND
F1
54.5 ± 0.7
86.6 ± 0.6

CONT
F1
49.9 ± 0.4
81.9 ± 0.3

CONTVAL
Cnst.
39.6 ± 0.6
65.6 ± 0.4

CONT
Cnst.
43.1 ± 0.1
56.4 ± 0.4

CONT+CONST
Cnst.
43 ± 0.1
56.3 ± 0.4

READER
READER+

Tavolo 7: Results of READER on HOTPOTQA, when trained on the original and augmented (marked with +)
dati.

DEV
F1
28.2

CONTVAL CONTRAND

CONT
F1
34.9

CONTVAL
Cnst.
5.3

F1
35.1

F1
38.1
33.9 ± 0.9 28.4 ± 0.8 26.9 ± 0.5

UNIFIEDQA
8.1 ± 3.8 12.2 ± 1.6
UNIFIEDQADROP
UNIFIEDQADROP+ 32.9 ± 1.2 37.9 ± 1.4 35.9 ± 2.5 10.5 ± 4.4 16.9 ± 0.2
UNIFIEDQA
74.7 ± 0.2 60.3 ± 0.8 58.7 ± 0.9 61.9 ± 0.7 35.6 ± 1.1 40.2 ± 0.1
UNIFIEDQAHPQA
UNIFIEDQAHPQA+ 74.1 ± 0.2 60.3 ± 1.9 59.2 ± 1.5 62.3 ± 2.3 36.3 ± 0.7 41.6 ± 0.3
UNIFIEDQA
UNIFIEDQAIIRC
UNIFIEDQAIIRC+

50.2 ± 0.7 45.1 ± 2.1 42.5 ± 2.3 20.4 ± 2.9 24.9 ± 1.2 28.6 ± 0.8
51.7 ± 0.9 62.9 ± 2.9 54.5 ± 3.9 40.8 ± 5.4 30.2 ± 2.7 32.1 ± 1.9

65.2

29.8

44.5

28.1

57.2

36.5

68.2

68.7

61.1

52.9

21.6

CONT
Cnst.
4.4
5.1 ± 0.7
9.6 ± 0.2
38.4

CONT+CONST
Cnst.
2.2
4.4 ± 0.5
8 ± 0.5
37.6
39.9 ± 0.1
41.3 ± 0.4
28.1
28.5 ± 0.8
32.1 ± 1.9

Tavolo 8: Evaluation results of UNIFIEDQA on DROP, HOTPOTQA, and IIRC. We compare UNIFIEDQA
without fine-tuning, and after fine-tuning on the original training data and on the augmented training
dati (marked with +).

where C(X) is the generated contrast set for exam-
ple x (which includes x),5 and y(ˆx) is the model’s
prediction for example ˆx. Constraint satisfaction
is measured using a binary 0-1 score.

Because yes/no questions do not exist in DROP,
we do not evaluate TASEDROP on AppendBool
examples, which have yes/no answers, as we
cannot expect the model to answer those correctly.

5.3 Results

Results are presented separately for each model,
in Table 6, 7, E 8. Comparing performance on
the development sets (DEV F1) to the correspond-
ing contrast sets (CONT F1), we see a substantial
decrease in performance on the generated contrast
sets, across all datasets (per esempio., 83.5 54.8 for
TASEDROP, 82.2 49.9 for READER, E 50.2
5With a slight abuse of notation, we overload the definition
of C(X) from §2, such that members of C(X) include not just
the queston and context, but also an answer.

20.4 for UNIFIEDQAIIRC). Inoltre, model con-
sistency (CONT Cnst.) is considerably lower than
the development scores (DEV F1), Per esempio,
TASEIIRC obtains 69.9 F1 score but only 24.3 con-
sistency. This suggests that, overall, the models
do not generalize to pertrubations in the reasoning
path expressed in the original question.

Comparing the results on the contrast sets and
their validated subsets (CONT vs. CONTVAL), per-
formance on CONTVAL is better than on CONT
(per esempio., 58.1 versus 49.9 for READER). These gaps
are due to (UN) the distribution mismatch between
the two sets, E (B) bad example generation.
To isolate the effect of bad example generation,
we can compare CONTVAL to CONTRAND, Quale
have the same distribution over perturbations, Ma
CONTRAND is not validated by humans. We see that
the performance of CONTVAL is typically ≤10%
higher than CONTRAND (per esempio., 58.1 vs. 54.5 for
READER). Given that performance on the original

119

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

development set is dramatically higher, it seems
we can currently use automatically generated
contrast sets (without verification) to evaluate
robustness to reasoning perturbations.

Last, adding constraints to the generated con-
trast sets (CONT vs. CONT+CONST) often leads to
a decrease in model consistency, most notably on
DROP, where there are arithmetic constraints and
not only answer type constraints.

For

instance, consistency drops from 35.7
A 33.7 for TASE, and from 5.1 A 4.4 for
UNIFIEDQADROP. This shows that the generated
constraints expose additional flaws in current
models.

5.4 Data Augmentation
Results in §5.3 reveal clear performance gaps
in current QA models. A natural solution is to
augment the training data with examples from
the contrast set distribution, which can be done
effortlessly, since BPB is fully automatic.

We run BPB on the training sets of DROP,
HOTPOTQA, and IIRC. As BPB generates many
examples, it can shift the original training distri-
bution dramatically. Così, we limit the number
of examples generated by each perturbation by a
threshold τ . Specifically, for a training set S with
|S| = n examples, we augment S with τ ∗ n ran-
domly generated examples from each perturbation
(if fewer than τ ∗ n examples were generated we
add all of them). We experiment with three val-
ues τ ∈ {0.03, 0.05, 0.1}, and choose the trained
model with the best F1 on the contrast set.

Augmentation results are shown in Table 6–8.
Consistency (CONT and CONTVAL) improves dra-
matically, with only a small change in the model’s
DEV performance, across all models. We ob-
serve an increase in consistency of 13 points for
TASEDROP, 24 for TASEIIRC, 13 for READER, E
1-4 points for the UNIFIEDQA models. Interesse-
ingly, augmentation is less helpful for UNIFIEDQA
than for TASE and READER. We conjecture that
this is because UNIFIEDQA was trained on exam-
ples from multiple QA datasets and is thus less
affected by the augmented data.

Improvement on test examples sampled from
the augmented training distribution is expected.
To test whether augmented data improves robust-
ness on other distributions, we evaluate TASE+
and UNIFIEDQADROP+ on the DROP contrast set
manually collected by Gardner et al. (2020). Noi

find that training on the augmented training set
does not lead to a significant change on the man-
ually collected contrast set (F1of 60.4 61.1 for
TASE, E 30 29.6 for UNIFIEDQADROP). Questo
agrees with findings that data augmentation with
respect to a phenomenon may not improve gen-
eralization to other out-of-distribution examples
(Kaushik et al., 2021; Joshi and He, 2021).

6 Performance Analysis

Analysis Across Perturbations. We compare
model performance on the original (ORIG) E
generated examples (CONT and CONTVAL) across
perturbations (Figura 3, 4, 5). Starting from mod-
els with specialized architectures (TASE and
READER), except for ChangeLast (discussed
Dopo), models’ performance decreases on all
perturbations. Specifically, TASE (Figura 3, 5)
demonstrates brittleness to changes in comparison
questions (10-30 F1 decrease on ReplaceComp)
and arithmetic computations (∼30 F1 decrease
on ReplaceArith). The biggest decrease of
almost 50 points is on examples generated by
PruneStep from DROP (Figura 3), showing
that the model struggles to answer intermediate
reasoning steps.

READER (Figura 4) shows similar trends to
TASE, with a dramatic performance decrease
Di 80-90 points on yes/no questions created
by AppendBool and ReplaceBool. Inter-
estingly, READER obtains high performance on
PruneStep examples, as opposed to TASEDROP
(Figura 3), which has a similar span extraction
head that is required for these examples. Questo
is possibly due to the ‘‘train-easy’’ subset of
HOTPOTQA, which includes single-step selection
questions.

Moving to the general-purpose UNIFIEDQA
models, they perform on PruneStep at least
as well the original examples, showing their abil-
ity to answer simple selection questions. They also
demonstrate robustness on ReplaceBool. Yet,
they struggle on numeric comparison questions
or arithmetic calculations: ∼65 points decrease
on ChangeLast on DROP (Figura 3), 10-30
F1 decrease on ReplaceComp and Append-
Bool (Figura 3, 4, 5), and almost 0 F1 on
ReplaceArith (Figura 3).

Performance on CONT and CONTVAL. Re-
sults on CONTVAL are generally higher than CONT
due to the noise in example generation. Tuttavia,

120

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 3: Performance on DROP per perturbation: SU
the generated contrast set (CONT), on the examples
from which CONT was generated (ORIG), and on the
validated subset of CONT (CONTVAL).

Figura 4: Performance on HOTPOTQA per perturbation:
on the generated contrast set (CONT), on the examples
from which CONT was generated (ORIG), and the
validated subset of CONT (CONTVAL).

whenever results on ORIG are higher than CONT,
they are also higher than CONTVAL, showing that
the general trend can be inferred from CONT, due
to the large performance gap between ORIG and
CONT. An exception is ChangeLast in DROP
and HOTPOTQA, where performance on CONT is
lower than ORIG, but on CONTVAL is higher. Questo
is probably due to the noise in generation, es-
pecially for DROP, where example validity is at
55.1% (Vedi la tabella 4).

Evaluation on Answer Constraints Evaluat-
ing whether the model satisfies answer constraints
can help assess the model’s skills. A tal fine, we

121

Figura 5: Performance on IIRC per perturbation: on the
generated contrast set (CONT), on the examples from
which CONT was generated (ORIG), and the validated
subset of CONT (CONTVAL).

measure the fraction of answer constraints satis-
fied by the predictions of each model (we consider
only constraints with more than 50 examples).

Models typically predict the correct answer
type; TASEDROP and UNIFIEDQA predict a number
for ≥ 86% of the generated numeric questions,
and READER and TASEIIRC successfully predict a
yes/no answer in ≥ 92% of the cases. Tuttavia,
fine-tuning UNIFIEDQA on HOTPOTQA and IIRC
reduces constraint satisfaction (94.7 76.3 for
UNIFIEDQAHPQA, 65.4 38.9 for UNIFIEDQAIIRC),
possibly since yes/no questions constitute fewer
di 10% of the examples (Yang et al., 2018;
Ferguson et al., 2020). Inoltre, results on
DROP for the constraint ‘≥’ are considerably
lower than for ‘≤’ for UNIFIEDQA (83 67.4)
and UNIFIEDQADROP (81.8 65.9), indicating a
bias towards predicting small numbers.

7 Related Work

The evaluation crisis in NLU has led to wide inter-
est in challenge sets that evaluate the robustness
of models to input perturbations. Tuttavia, most
past approaches (Ribeiro et al., 2020; Gardner
et al., 2020; Khashabi et al., 2020UN; Kaushik et al.,
2020) involve a human-in-the-loop and are thus
costly.

Recentemente, more and more work has consid-
ered using meaning representations of language to

automatically generate evaluation sets. Past work
used an ERG grammar (Li et al., 2020) and AMR
(Rakshit and Flanigan, 2021) to generate rela-
tively shallow perturbations. In parallel to this
lavoro, Ross et al. (2021) used control codes over
SRL to generate more semantic perturbations to
declarative sentences. We generate perturbations
at the level of the underlying reasoning process,
in the context of QA. Last, Bitton et al. (2021)
used scene graphs to generate examples for vi-
sual QA. Tuttavia, they assumed the existence of
gold scene graph at the input. Overall, this body of
work represents an exciting new research program,
where structured representations are leveraged to
test and improve the blind spots of pre-trained
language models.

More broadly, interest in automatic creation of
evaluation sets that test out-of-distribution gener-
alization has skyrocketed, whether using heuristics
(Asai and Hajishirzi, 2020; Wu et al., 2021), dati
splits (Finegan-Dollak et al., 2018; Keysers et al.,
2020), adversarial methods (Alzantot et al., 2018),
or an aggregation of the above (Mille et al., 2021;
Goel et al., 2021).

Last, QDMR-to-question generation is broadly
related to work on text generation from struc-
tured data (Nan et al., 2021; Novikova et al.,
2017; Shu et al., 2021), and to passage-to-question
generation methods (Du et al., 2017; Wang et al.,
2020; Duan et al., 2017) Quello, in contrast to our
lavoro, focused on simple questions not requiring
reasoning.

Last, we showed that constraint sets are useful
for evaluation. Future work can use constraints as
a supervision signal, similar to Dua et al. (2021),
who leveraged dependencies between training
examples to enhance model performance.

Limitations BPB represents questions with
QDMR, which is geared towards representing
complex factoid questions that involve multiple
reasoning steps. Così, BPB cannot be used when
questions involve a single step, Per esempio, one
cannot use BPB to perturb ‘‘Where was Barack
Obama born?’’. Inherently, the effectiveness of
our pipeline approach depends on the performance
of its modules—the QDMR parser, the QG model,
and the single-hop RC model used for QDMR
evaluation. Tuttavia, our results suggest that cur-
rent models already yield high-quality examples,
and model performance is expected to improve
over time.

Ringraziamenti

We thank Yuxiang Wu, Itay Levy, and Inbar Oren
for the helpful feedback and suggestions. Questo
research was supported in part by The Yandex
Initiative for Machine Learning, and The Euro-
pean Research Council (ERC) under the European
Union Horizons 2020 research and innovation pro-
gramme (grant ERC DELPHI 802800). This work
was completed in partial fulfillment for the Ph.D.
degree of Mor Geva.

8 Discussion

Riferimenti

We propose the BPB framework for generating
high-quality reasoning-focused question perturba-
zioni, and demonstrate its utility for constructing
contrast sets and evaluating RC models.

While we focus on RC, our method for per-
turbing questions is independent of the context
modality. Così, porting our approach to other
modalities only requires a method for computing
the answer to perturbed questions. Inoltre, BPB
provides a general-purpose mechanism for ques-
tion generation, which can be used outside QA
anche.

We provide a library of perturbations that is
a function of the current abilities of RC models.
As future RC models, QDMR parsers, and QG
models improve, we can expand this library to
support additional semantic phenomena.

Moustafa Alzantot, Yash Sharma, Ahmed
Elgohary, Bo-Jhang Ho, Mani Srivastava, E
Kai-Wei Chang. 2018. Generating natural
language adversarial examples. In Procedi-
IL 2018 Conferenza sull'Empirico
ings di
Methods
in Natural Language Process-
ing, pages 2890–2896, Brussels, Belgium.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/D18-1316

Akari Asai and Hannaneh Hajishirzi. 2020.
Logic-guided data augmentation and regulari-
zation for consistent question answering. In
Proceedings of
the 58th Annual Meeting
of the Association for Computational Linguis-
tic, pages 5642–5650, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.499

122

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Akari Asai, Kazuma Hashimoto, Hannaneh
Hajishirzi, Riccardo Socher, and Caiming Xiong.
2020. Learning to retrieve reasoning paths
over wikipedia graph for question answer-
ing. In International Conference on Learning
Representations.

Yonatan Bitton, Gabriel

Stanovsky, Roy
Schwartz, and Michael Elhadad. 2021. Auto-
matic generation of contrast sets from scene
graphs: Probing the compositional consis-
tency of GQA. Negli Atti del 2021
Conference of
the North American Chap-
the Association for Computational
ter of
Linguistica: Tecnologie del linguaggio umano,
pages 94–105, Online. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.naacl-main.9

Cristoforo Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, E
Kristina Toutanova. 2019. BoolQ: Exploring
the surprising difficulty of natural yes/no ques-
zioni. Negli Atti del 2019 Conferenza
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Umano
Language Technologies, Volume 1 (Long and
Short Papers), pages 2924–2936.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In North American Associa-
tion for Computational Linguistics (NAACL),
pages 4171–4186, Minneapolis, Minnesota.

Xinya Du, Junru Shao, and Claire Cardie. 2017.
Learning to ask: Neural question generation
for reading comprehension. Negli Atti
of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Documenti lunghi), pages 1342–1352, Vancou-
ver, Canada. Associazione per il calcolo
Linguistica.

Dheeru Dua, Pradeep Dasigi, Sameer Singh,
and Matt Gardner. 2021. Learning with in-
stance bundles for
reading comprehension.
arXiv preprint arXiv:2104.08735.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Sameer Singh, and Matt
Gardner. 2019. DROP: A reading comprehen-
sion benchmark requiring discrete reasoning
over paragraphs. In North American Chapter of

the Association for Computational Linguistics
(NAACL).

Nan Duan, Duyu Tang, Peng Chen, and Ming
Zhou. 2017. Question generation for ques-
tion answering. Negli Atti del 2017
Conference on Empirical Methods in Nat-
elaborazione del linguaggio urale, pages 866–874,
Copenhagen, Denmark. Association for Com-
Linguistica putazionale. https://doi.org
/10.18653/v1/D17-1090

IIRC: A dataset of

James Ferguson, Matt Gardner, Hannaneh
Hajishirzi, Tushar Khot, and Pradeep Dasigi.
2020.
incomplete in-
formation reading comprehension questions.
IL 2020 Conference on
Negli Atti di
in Natural Language
Empirical Methods
in lavorazione (EMNLP), pages 1137–1147, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.86

Catherine Finegan-Dollak, Jonathan K. Kummerfeld,
Li Zhang, Karthik Ramanathan, Sesh Sadasivam,
Rui Zhang, and Dragomir Radev. 2018. Im-
proving text-to-SQL evaluation methodology.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tic (Volume 1: Documenti lunghi), pages 351–360,
Melbourne, Australia. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/P18-1033

Joseph L. Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psicologico
bulletin, 76(5):378. https://doi.org/10
.1037/h0031619

Matt Gardner, Yoav Artzi, Victoria Basmov,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hannaneh
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiangming Liu, Nelson F. Liu,
Phoebe Mulcaire, Qiang Ning, Sameer Singh,
Noah A. Smith, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, Ally Zhang, and Ben
Zhou. 2020. Evaluating models’ local deci-
sion boundaries via contrast sets. In Findings
Di
the Association for Computational Lin-
guistics: EMNLP 2020, pages 1307–1323,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.findings-emnlp.117

123

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Karan Goel, Nazneen Fatema Rajani, Jesse
Vig, Zachary Taschdjian, Mohit Bansal, E
Christopher R´e. 2021. Robustness gym: Uni-
fying the NLP evaluation landscape.
In
Proceedings of
IL 2021 Conference of
the Asso-
the North American Chapter of
ciation for Computational Linguistics: Eh-
uomo Tecnologie del linguaggio: Demonstrations,
pages 42–55, Online. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/2021.naacl-demos.6

Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
systems. In Empirical Methods in Natural Lan-
guage Processing (EMNLP). https://doi
.org/10.18653/v1/D17-1215

Nitish Joshi and He He. 2021. An investigation of
IL (In) effectiveness of counterfactually aug-
mented data. arXiv preprint arXiv:2107.00753.

Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that
makes a difference with counterfactually-
augmented data. In International Conference
sulle rappresentazioni dell'apprendimento.

Divyansh Kaushik, Douwe Kiela, Zachary C.
Lipton, and Wen-tau Yih. 2021. On the
efficacy of adversarial data collection for ques-
tion answering: Results from a large-scale
randomized study. In Association for Com-
putational Linguistics and International Joint
Conference on Natural Language Process-
ing (ACL-IJCNLP). https://doi.org/10
.18653/v1/2021.acl-long.517

Daniel Keysers, Nathanael Sch¨arli, Nathan Scales,
Hylke Buisman, Daniel Furrer, Sergii Kashubin,
Nikola Momchev, Danila Sinopalnikov, Lukasz
Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao
Wang, Marc van Zee, and Olivier Bousquet.
2020. Measuring compositional generaliza-
zione: A comprehensive method on realistic
dati. In International Conference on Learning
Representations.

Daniel Khashabi, Tushar Khot, and Ashish
Sabharwal. 2020UN. More bang for your buck:
Natural perturbation for robust question an-
swering. Negli Atti del 2020 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione (EMNLP), pages 163–170, Online.
Associazione per la Linguistica Computazionale.

https://doi.org/10.18653/v1/2020
.emnlp-main.12

Daniel Khashabi, Sewon Min, Tushar Khot,
Ashish Sabharwal, Oyvind Tafjord, Peter
and Hannaneh Hajishirzi. 2020B.
Clark,
UNIFIEDQA: Crossing format boundaries
In Findings
con un
the Association for Computational Lin-
Di
guistics: EMNLP 2020, pages 1896–1907,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.findings-emnlp.171

single QA system.

J. Richard Landis and Gary G. Koch. 1977.
The measurement of observer agreement for
categorical data. Biometrics, 33(1):159–174.
https://doi.org/10.2307/2529310,
PubMed: 843571

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, e Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
translation, and comprehension. In Procedi-
ings di
IL
Associazione per la Linguistica Computazionale,
pages 7871–7880. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/2020.acl-main.703

the 58th Annual Meeting of

Chuanrong Li, Lin Shengshuo, Zeyu Liu, Xinyi
Wu, Xuhui Zhou, and Shane Steinert-Threlkeld.
2020. Linguistically-informed transformations
(LIT): A method for automatically gener-
IL
ating contrast sets.
Third BlackboxNLP Workshop on Analyz-
ing and Interpreting Neural Networks for
PNL, pages 126–135, Online. Associazione per
Linguistica computazionale.

Negli Atti di

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Right for the wrong reasons: Diagnosing syn-
tactic heuristics in natural language inference.

124

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 3428–3448, Florence, Italy. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/P19-1334

Simon Mille, Kaustubh Dhole, Saad Mahamood,
Laura Perez-Beltrachini, Varun Gangal, Mihir
Kale, Emiel van Miltenburg, and Sebastian
Gehrmann. 2021. Automatic construction of
evaluation suites for natural
language gen-
In Thirty-fifth Conference
eration datasets.
Information Processing Systems
on Neural
Datasets and Benchmarks Track (Round 1).

Aakanksha Naik, Abhilasha Ravichander, Norman
Sadeh, Carolyn Rose, and Graham Neubig.
2018. Stress test evaluation for natural lan-
guage inference. In Proceedings of the 27th
Conferenza internazionale sul calcolo
Linguistica, pages 2340–2353, Santa Fe, Nuovo
Mexico, USA. Associazione per il calcolo
Linguistica.

Linyong Nan, Dragomir Radev, Rui Zhang, Amrit
Rau, Abhinand Sivaprasad, Chiachun Hsieh,
Xiangru Tang, Aadit Vyas, Neha Verma,
Pranav Krishna, Yangxiaokang Liu, Nadia
Irwanto, Jessica Pan, Faiaz Rahman, Ahmad
Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit
Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin,
Caiming Xiong, Riccardo Socher, and Nazneen
Fatema Rajani. 2021. DART: Open-domain
structured data record to text generation. In
Atti del 2021 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 432–447, Online.
Associazione per la Linguistica Computazionale.

an effective

Nikita Nangia, Saku Sugawara, Harsh Trivedi,
Alex Warstadt, Clara Vania, and Samuel
R. Bowman. 2021. Che cosa
ingredients make
crowdsourcing protocol
for
for difficult NLU data collection tasks? In
Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics
and the 11th International Joint Conference
on Natural Language Processing (Volume 1:
Documenti lunghi), pages 1221–1235, Online.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/2021
.acl-long.98

Jekaterina Novikova, Ondˇrej Duˇsek, and Verena
Rieser. 2017. The E2E dataset: New chal-
lenges for end-to-end generation. In Procedi-
ings di
the 18th Annual SIGdial Meeting
on Discourse and Dialogue, pages 201–206,
Saarbr¨ucken, Germany. Association for Com-
Linguistica putazionale. https://doi.org
/10.18653/v1/W17-5525

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.

Pranav Rajpurkar,

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
sion of text. In Empirical Methods in Natural
Language Processing (EMNLP).

Geetanjali Rakshit and Jeffrey Flanigan. 2021.
ASQ: Automatically generating question-
answer pairs using AMRS. arXiv preprint
arXiv:2105.10023.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Beyond
testing of NLP mod-
accuracy: Behavioral
els with CheckList. Negli Atti di
IL
58esima Assemblea Annuale dell'Associazione per
Linguistica computazionale, pages 4902–4912,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.acl-main.442

Alexis Ross, Tongshuang Wu, Hao Peng, Matthew
E. Peters, and Matt Gardner. 2021. Tailor:
Generating and perturbing text with semantic
controls. arXiv preprint arXiv:2107.07150.

Elad Segal, Avia Efrat, Mor Shoham, Amir
Globerson, and Jonathan Berant. 2020. UN
for answer-
simple and effective model
ing multi-span questions. Negli Atti di
IL 2020 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale (EMNLP),
pages 3074–3080, Online. Associazione per
Linguistica computazionale. https://doi
.org/10.18653/v1/2020.emnlp-main
.248

Chang Shu, Yusen Zhang, Xiangyu Dong,
Peng Shi, Tao Yu, and Rui Zhang. 2021.

125

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

In Findings of

Logic-consistency text generation from se-
the As-
mantic parses.
sociation
for Computational Linguistics:
ACL-IJCNLP 2021, pages 4414–4426, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2021.findings-acl.388

Siyuan Wang, Zhongyu Wei, Zhihao Fan,
Zengfeng Huang, Weijian Sun, Qi Zhang,
and Xuanjing Huang. 2020. PathQG: Neural
question generation from facts. In Procedi-
ings di
IL 2020 Conferenza sull'Empirico
in Natural Language Process-
Methods
ing (EMNLP), pages 9066–9075, Online.
Association
for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.729

Tomer Wolfson, Mor Geva, Ankit Gupta, Matt
Gardner, Yoav Goldberg, Daniel Deutch,
and Jonathan Berant. 2020. BREAK it
down: A question understanding benchmark.

the Association for Com-
Transactions of
(TACL), 8:183–198.
Linguistica putazionale
https://doi.org/10.1162/tacl a 00309

Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey
Heer, and Daniel Weld. 2021. Polyjuice: Gener-
ating counterfactuals for explaining, evaluating,
Negli Atti di
and improving models.
the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural
Language Processing (Volume 1: Long Pa-
pers), pages 6707–6723, Online. Association
for Computational Linguistics.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
Bengio, William Cohen, Ruslan Salakhutdinov,
e Christopher D. Equipaggio. 2018. Hot-
potQA: A dataset
for diverse, explainable
multi-hop question answering. In Empirical
Metodi nell'elaborazione del linguaggio naturale
(EMNLP). https://doi.org/10.18653
/v1/D18-1259

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
5
0
1
9
8
7
0
2
2

/

/
T

l

UN
C
_
UN
_
0
0
4
5
0
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

126Break, Perturb, Build: Automatic Perturbation of Reasoning Paths image
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths image

Scarica il pdf