Did Aristotle Use a Laptop? - Specialized Research AI at MIT

Did Aristotle Use a Laptop?
A Question Answering Benchmark with Implicit Reasoning Strategies

Mor Geva1,2, Daniel Khashabi2, Elad Segal1, Tushar Khot2,
Dan Roth3, Jonathan Berant1,2

2Allen Institute for AI

3University of Pennsylvania
1Tel Aviv University
morgeva@mail.tau.ac.il, {danielk,tushark}@allenai.org,
elad.segal@gmail.com, danroth@seas.upenn.edu
joberant@cs.tau.ac.il

Abstract

in this

challenge

A key limitation in current datasets for multi-
hop reasoning is that
the required steps
for answering the question are mentioned
in it explicitly. In this work, we introduce
STRATEGYQA, a question answering (QA)
benchmark where the required reasoning
steps are implicit
in the question, and
should be inferred using a strategy. A
fundamental
setup is
how to elicit such creative questions from
crowdsourcing workers, while covering a
broad range of potential strategies. We propose
a data collection procedure that combines
term-based priming to inspire annotators,
careful control over the annotator population,
and adversarial
eliminating
reasoning shortcuts. Moreover, we annotate
each question with (1) a decomposition into
reasoning steps for answering it, and (2)
Wikipedia paragraphs that contain the answers
to each step. Overall, STRATEGYQA includes
2,780 examples, each consisting of a strategy
its decomposition, and evidence
question,
paragraphs. Analysis shows that questions
in STRATEGYQA are short, topic-diverse, and
cover a wide range of strategies. Empirically,
we show that humans perform well (87%) on
this task, while our best baseline reaches an
accuracy of ∼ 66%.

filtering for

1 Introduction

Developing models
successfully reason
that
over multiple parts of their input has attracted
substantial attention recently,
leading to the
creation of many multi-step reasoning Question
Answering (QA) benchmarks (Welbl et al., 2018;
Talmor and Berant, 2018; Khashabi et al., 2018;
Yang et al., 2018; Dua et al., 2019; Suhr et al.,
2019).

346

Commonly, the language of questions in such
benchmarks explicitly describes the process for
deriving the answer. For instance (Figure 1, Q2),
the question Was Aristotle alive when the laptop
was invented? explicitly specifies the required
reasoning steps. However, in real-life questions,
reasoning is often implicit. For example,
the
question Did Aristotle use a laptop? (Q1) can
be answered using the same steps, but
the
model must infer the strategy for answering the
question–temporal comparison, in this case.

retrieving the context

Answering implicit questions poses several
challenges compared to answering their explicit
is
counterparts. First,
difficult as there is little overlap between the
question and its context (Figure 1, Q1 and ‘E’).
Moreover, questions tend to be short, lowering
the possibility of the model exploiting shortcuts
in the language of the question. In this work, we
introduce STRATEGYQA, a Boolean QA benchmark
focusing on implicit multi-hop reasoning for
strategy questions, where a strategy is
the
ability to infer from a question its atomic sub-
questions. In contrast to previous benchmarks
(Khot et al., 2020a; Yang et al., 2018), questions
in STRATEGYQA are not
limited to predefined
decomposition patterns and cover a wide range
of strategies that humans apply when answering
questions.

Eliciting strategy questions using crowdsourc-
ing is non-trivial. First, authoring such questions
requires creativity. Past work often collected
multi-hop questions by showing workers an entire
context, which led to limited creativity and high
lexical overlap between questions and contexts
and consequently to reasoning shortcuts (Khot
et al., 2020a; Yang et al., 2018). An alter-
native approach, applied in Natural Questions

Transactions of the Association for Computational Linguistics, vol. 9, pp. 346–361, 2021. https://doi.org/10.1162/tacl a 00370
Action Editor: Kristina Toutanova. Submission batch: 10/2020; Revision batch: 12/2020; Published 4/2021.
c(cid:13) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
0
1
9
2
4
1
0
4

/
t

a
c
_
a
_
0
0
3
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

that

Our

analysis

STRATEGYQA
shows
necessitates reasoning on a wide variety of
knowledge domains (physics, geography, etc.)
and logical operations (e.g., number comparison).
Moreover, experiments show that STRATEGYQA
poses a combined challenge of retrieval and
QA, and while humans perform well on these
questions, even strong systems struggle to answer
them.

Figure 1: Questions in STRATEGYQA (Q1) require
implicit decomposition into reasoning steps (D),
for which we annotate supporting evidence from
Wikipedia (E). This is in contrast
to multi-step
questions that explicitly specify the reasoning process
(Q2).

(Kwiatkowski et al., 2019) and MS-MARCO
(Nguyen et al., 2016), overcomes this by col-
lecting real user questions. However, can we elicit
creative questions independently of the context
and without access to users?

Second, an important property in STRATEGYQA
is that questions entail diverse strategies. While
the example in Figure 1 necessitates temporal
reasoning,
there are many possible strategies
for answering questions (Table 1). We want
a benchmark that exposes a broad range of
strategies. But crowdsourcing workers often use
repetitive patterns, which may limit question
diversity.

To overcome these difficulties, we use the
following techniques in our pipeline for eliciting
strategy questions: (a) we prime crowd workers
with random Wikipedia terms that serve as a
minimal context to inspire their imagination and
increase their creativity; (b) we use a large set of
annotators to increase question diversity, limiting
the number of questions a single annotator can
write; and (c) we continuously train adversarial
models during data collection, slowly increasing
the difficulty in question writing and preventing
recurring patterns (Bartolo et al., 2020).

Beyond the questions, as part of STRATEGYQA,
we annotated: (a) question decompositions: a
sequence of steps sufficient for answering the
question (‘D’ in Figure 1), and (b) evidence
paragraphs: Wikipedia paragraphs that contain
the answer to each decomposition step (‘E’ in
Figure 1). STRATEGYQA is the first QA dataset to
provide decompositions and evidence annotations
for each individual step of the reasoning process.

347

In summary, the contributions of this work are:

1. Defining strategy questions—a class of
implicit multi-step

requiring

question
reasoning.

2. STRATEGYQA, the first benchmark for implicit
that covers a diverse set
multi-step QA,
of reasoning skills. STRATEGYQA consists
of 2,780 questions, annotated with their
decomposition and per-step evidence.

3. A novel annotation pipeline designed to
elicit quality strategy questions, with minimal
context for priming workers.

The dataset and codebase are publicly available at
https://allenai.org/data/strategyqa.

2 Strategy Questions

2.1 Desiderata

We define strategy questions by characterizing
their desired properties. Some properties, such as
whether the question is answerable, also depend
on the context used for answering the question.
In this work, we assume this context is a corpus
of documents, specifically, Wikipedia, which we
assume provides correct content.

Multi-step Strategy questions are multi-step
questions, that is, they comprise a sequence of
single-step questions. A single-step question is
either (a) a question that can be answered from
a short text fragment in the corpus (e.g., steps
1 and 2 in Figure 1), or (b) a logical operation
over answers from previous steps (e.g., step 3 in
Figure 1). A strategy question should have at least
two steps for deriving the answer. Example multi-
and single- step questions are provided in Table 2.
We define the reasoning process structure in §2.2.

Feasible Questions should be answerable from
paragraphs in the corpus. Specifically, for each

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
0
1
9
2
4
1
0
4

/
t

a
c
_
a
_
0
0
3
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Question
Can one spot helium? (No)

Would Hades and Osiris hypothetically
compete for real estate in the Underworld?
(Yes)
Would a monocle be appropriate for a cyclop?
(Yes)
Should a finished website have lorem ipsum
paragraphs? (No)

Is it normal to find parsley in multiple sections
of the grocery store? (Yes)

Implicit facts
Helium is a gas, Helium is odorless, Helium is
tasteless, Helium has no color.
Hades was the Greek god of death and the
Underworld. Osiris was the Egyptian god of the
Underworld.
Cyclops have one eye. A monocle helps one eye at
a time.
Lorem Ipsum paragraphs are meant to be temporary.
Web designers always remove lorem ipsum
paragraphs before launch.
Parsley is available in both fresh and dry forms.
Fresh parsley must be kept cool. Dry parsley is a
shelf stable product.

Table 1: Example strategy questions and the implicit facts needed for answering them.

Question
Was Barack Obama born in
the United States? (Yes)

Do cars use drinking water
to power their engine? (No)

Are sharks faster than crabs?
(Yes)

Was Tom Cruise married to
the female star of Inland
Empire? (No)
Are more watermelons
grown in Texas than in
Antarctica? (Yes)
Would someone with a
nosebleed benefit from
Coca? (Yes)

IM Explanation

The question explicitly states the required information for
the answer–the birth place of Barack Obama. The answer
is likely to be found in a single text fragment in Wikipedia.
The question explicitly states the required information for
the answer–the liquid used to power car engines. The
answer is likely to be found in a single text fragment in
Wikipedia.
The question explicitly states the required reasoning steps:
1) How fast are sharks? 2) How fast are crabs? 3) Is #1
faster than #2?
The question explicitly states the required reasoning steps:
1) Who is the female star of Inland Empire? 2) Was Tom
Cruise married to #2?

X X The answer can be derived through geographical/botanical
reasoning that the climate in Antarctica does not support
growth of watermelons.

X X The answer can be derived through biological reasoning
that Coca constricts blood vessels, and therefore, serves to
stop bleeding.

Table 2: Example questions demonstrating the multi-step (MS) and implicit (IM) properties of strategy
questions.

reasoning step in the sequence, there should be
sufficient evidence from the corpus to answer the
question. For example, the answer to the question
Would a monocle be appropriate for a cyclop? can
be derived from paragraphs stating that cyclops
have one eye and that a monocle is used by one
eye at the time. This information is found in
our corpus, Wikipedia, and thus the question is
feasible. In contrast, the question Does Justin
Beiber own a Zune? is not feasible, because
answering it requires going through Beiber’s

348

belongings, and this information is unlikely to
be found in Wikipedia.

Implicit A key property distinguishing strategy
questions from prior multi-hop questions is
their implicit nature. In explicit questions, each
step in the reasoning process can be inferred
from the language of the question directly. For
two questions
example,
are explicitly stated, one in the main clause
and one in the adverbial clause. Conversely,

in Figure 1,

the first

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
0
1
9
2
4
1
0
4

/
t

a
c
_
a
_
0
0
3
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

reasoning steps in strategy questions require
going beyond the language of the question. Due
to language variability, a precise definition of
implicit questions based on lexical overlap is
elusive, but a good rule-of-thumb is the following:
If the question decomposition can be written with
a vocabulary limited to words from the questions,
their inflections, and function words, then it is
an explicit question. If new content words must
be introduced to describe the reasoning process,
the question is implicit. Examples for implicit and
explicit questions are in Table 2.

Definite A type of questions we wish to
avoid are non-definitive questions, such as Are
hamburgers considered a sandwich? and Does
chocolate taste better than vanilla? for which
there is no clear answer. We would like to
collect questions where the answer is definitive
or, at least, very likely, based on the corpus.
For example, consider the question Does wood
conduct electricity?. Although it is possible that a
damp wood will conduct electricity, the answer is
generally no.

To summarize, strategy questions are multi-
step questions with implicit reasoning (a strategy)
and a definitive answer that can be reached given
a corpus. We limit ourselves to Boolean yes/no
questions, which limits the output space, but lets us
focus on the complexity of the questions, which is
the key contribution. Example strategy questions
are in Table 1, and examples that demonstrate the
mentioned properties are in Table 2. Next (§2.2),
we describe additional structures annotated during
data collection.

2.2 Decomposing Strategy Questions

Strategy questions involve complex reasoning that
leads to a yes/no answer. To guide and evaluate
the QA process, we annotate every example with
a description of the expected reasoning process.

Prior work used rationales or supporting facts,
namely, text snippets extracted from the context
(DeYoung et al., 2020; Yang et al., 2018;
Kwiatkowski et al., 2019; Khot et al., 2020a) as
evidence for an answer. However, reasoning can
rely on elements that are not explicitly expressed
in the context. Moreover, answering a question
based on relevant context does not imply that
the model performs reasoning properly (Jiang and
Bansal, 2019).

Question
Did the Battle
of Peleliu or
the Seven
Days Battles
last longer?
Can the
President of
Mexico vote
in
New Mexico
primaries?

Can a
microwave
melt a Toyota
Prius battery?

Would it be
common to
find a penguin
in Miami?

Decomposition
(1) How long did the Battle of
Peleliu last?
(2) How long did the Seven
Days Battle last?
(3) Which is longer of #1, #2?
(1) What is the citizenship
requirement for voting
in New Mexico?
(2) What is the citizenship
requirement of any
President of Mexico?
(3) Is #2 the same as #1?
(1) What kind of battery does
a Toyota Prius use?
(2) What type of material is
#1 made out of?
(3) What is the melting point
of #2?
(4) Can a microwave’s
temperature reach at least
#3?
(1) Where is a typical
penguin’s natural habitat?
(2) What conditions make #1
suitable for penguins?
(3) Are all of #2 present in
Miami?

Table 3: Explicit (row 1) and strategy (rows 2–4)
question decompositions. We mark words that
are explicit (italic) or implicit in the input (bold).

Inspired by recent work (Wolfson et al.,
2020), we associate every question-answer pair
with a strategy question decomposition. A
decomposition of a question q is a sequence of n
steps hs(1), s(2), . . . , s(n)i required for computing
the answer to q. Each step s(i) corresponds to
a single-step question and may include special
references, which are placeholders
referring
to the result of a previous step s(j). The
last decomposition step (i.e., s(n)) returns the
final answer to the question. Table 3 shows
decomposition examples.

Wolfson et al. (2020) targeted explicit multi-
step questions (first row in Table 3), where the
decomposition is restricted to a small vocabulary
derived almost entirely from the original question.
Conversely, decomposing strategy questions

349

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
0
1
9
2
4
1
0
4

/
t

a
c
_
a
_
0
0
3
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: Overview of the data collection pipeline. First (CQW, §3.1), a worker is presented with a term (T) and
an expected answer (A) and writes a question (Q) and the facts (F1, F2) required to answer it. Next, the question is
decomposed (SQD, §3.2) into steps (S1, S2) along with Wikipedia page titles (P1, P2) that the worker expects to
find the answer in. Last (EVM, §3.3), decomposition steps are matched with evidence from Wikipedia (E1, E2).

requires using implicit knowledge, and thus
decompositions can include any token that
is
needed for describing the implicit reasoning (rows
2–4 in Table 3). This makes the decomposition
task significantly harder for strategy questions.

In this work, we distinguish between two types
of required actions for executing a step. Retrieval,
a step that requires retrieval from the corpus,
and operation, a logical function over answers to
previous steps. In the second row of Table 3, the
first two steps are retrieval steps, and the last step
is an operation. A decomposition step can require
both retrieval and an operation (see last row in
Table 3).

To verify that steps are valid single-step
questions that can be answered using the corpus
(Wikipedia), we collect supporting evidence for
each retrieval step and annotate operation steps.
A supporting evidence is one or more paragraphs
that provide an answer to the retrieval step.

In summary, each example in our dataset
contains a) a strategy question, b) the strategy
question decomposition,
supporting
evidence per decomposition step. Collecting
strategy questions and their annotations is the
main challenge of this work, and we turn to this
next.

and c)

3 Data Collection Pipeline

Our goal
to establish a procedure for
collecting strategy questions and their annotations
at scale. To this end, we build a multi-step
crowdsourcing1 pipeline designed for encouraging
worker creativity, while preventing biases in the
data.

1We use Amazon Mechanical Turk as our framework.

We break the data collection into three tasks:
question writing (§3.1), question decomposition
(§3.2), and evidence matching (§3.3). In addition,
we implement mechanisms for quality assurance
(§3.4). An overview of the data collection pipeline
is in Figure 2.

3.1 Creative Question Writing (CQW)

Generating natural language annotations through
is
crowdsourcing (e.g., question generation)
known to suffer
from several shortcomings.
First, when annotators generate many instances,
they use recurring patterns that lead to biases
in the data. (Gururangan et al., 2018; Geva
et al., 2019). Second, when language is generated
conditioned on a long context, such as a paragraph,
annotators use similar language (Kwiatkowski
et al., 2019),
leading to high lexical overlap
and hence, inadvertently, to an easier problem.
Moreover, a unique property of our setup is that
we wish to cover a broad and diverse set of
strategies. Thus, we must discourage repeated use
of the same strategy.

We tackle these challenges on multiple fronts.
First, rather than using a long paragraph as
context, we prime workers to write questions
given single terms from Wikipedia, reducing the
overlap with the context to a minimum. Second,
to encourage diversity, we control the population
of annotators, making sure a large number of
annotators contribute to the dataset. Third, we use
model-in-the-loop adversarial annotations (Dua
et al., 2019; Khot et al., 2020a; Bartolo et al., 2020)
to filter our questions, and only accept questions
that fool our models. While some model-in-the-
loop approaches use fixed pre-trained models
to eliminate ‘‘easy’’ questions, we continuously

350

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
0
1
9
2
4
1
0
4

/
t

a
c
_
a
_
0
0
3
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

update the models during data collection to combat
the use of repeated patterns or strategies.

We now provide a description of the task, and
elaborate on these methods (Figure 2, upper row).

Task description Given a term (e.g., silk), a
description of the term, and an expected answer
(yes or no), the task is to write a strategy question
about the term with the expected answer, and the
facts required to answer the question.

Priming with Wikipedia Terms Writing
strategy questions from scratch is difficult. To
inspire worker creativity, we ask to write questions
about terms they are familiar with or can easily
understand. The terms are titles of ‘‘popular’’2
Wikipedia pages. We provide workers only with a
short description of the given term. Then, workers
use their background knowledge and Web search
skills to form a strategy question.

Controlling the Answer Distribution We ask
workers to write questions where the answer is
set to be ‘yes’ or ‘no’. To balance the answer
distribution, the expected answer is dynamically
sampled inversely proportional to the ratio of ‘yes’
and ‘no’ questions collected until that point.

Model-in-the-Loop Filtering To ensure ques-
tions are challenging and reduce recurring
language and reasoning patterns, questions are
only accepted when verified by two sets of online
solvers. We deploy a set of 5 pre-trained models
(termed PTD) that check if the question is too easy.
If at least 4 out of 5 answer the question correctly,
it is rejected. Second, we use a set of 3 models
(called FNTD) that are continuously fine-tuned on
our collected data and are meant to detect biases
in the current question set. A question is rejected
if all 3 solvers answer it correctly. The solvers are
ROBERTA (Liu et al., 2019) models fine-tuned on
different auxiliary datasets; details in §5.1.

Auxiliary Sub-Task We ask workers to provide
the facts required to answer the question they have
written, for several reasons: 1) it helps workers
frame the question writing task and describe the
reasoning process they have in mind, 2) it helps
reviewing their work, and 3) it provides useful
information for the decomposition step (§3.2).

2We filter pages based on the number of contributors and

3.2 Strategy Question Decomposition (SQD)

the

question

corresponding
and
Once
facts are written, we generate the strategy
question decomposition (Figure 2, middle row).
We annotate decompositions before matching
evidence in order to avoid biases stemming from
seeing the context.

The decomposition strategy for a question is
not always obvious, which can lead to undesirable
explicit decompositions. For example, a possible
explicit decomposition for Q1 (Figure 1) might
be (1) What items did Aristotle use? (2) Is laptop
in #1?; but the first step is not feasible. To guide
the decomposition, we provide workers with the
facts written in the CQW task to show the strategy
of the question author. Evidently, there can be
many valid strategies and the same strategy can
be phrased in multiple ways—the facts only serve
as a soft guidance.

Task Description Given a strategy question, a
yes/no answer, and a set of facts, the task is to
write the steps needed to answer the question.

Auxiliary Sub-task We observe that in some
cases, annotators write explicit decompositions,
which often lead to infeasible steps that cannot be
answered from the corpus. To help workers avoid
explicit decompositions, we ask them to specify,
for each decomposition step, a Wikipedia page
they expect to find the answer in. This encourages
workers to write decomposition steps for which it
is possible to find answers in Wikipedia, and leads
to feasible strategy decompositions, with only a
small overhead (the workers are not required to
read the proposed Wikipedia page).

3.3 Evidence Matching (EVM)

We now have a question and its decomposition.
To ground them in context, we add a third task of
evidence matching (Figure 2, bottom row).

Task Description Given a question and its
decomposition (a list of single-step questions), the
task is to find evidence paragraphs on Wikipedia
for each retrieval step. Operation steps that do not
require retrieval (§2.2) are marked as operation.

Controlling the Matched Context Workers
search for evidence on Wikipedia. We index
Wikipedia3 and provide a search interface where
workers can drag-and-drop paragraphs from the

the number of backward links from other pages.

3We use the Wikipedia Cirrus dump from 11/05/2020.

351

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
0
1
9
2
4
1
0
4

/
t

a
c
_
a
_
0
0
3
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

results shown on the search interface. This
guarantees that annotators choose paragraphs
we included in our index, at a pre-determined
paragraph-level granularity.

3.4 Data Verification Mechanisms

Task Qualifications For each task, we hold
qualifications that test understanding of the task,
and manually review several examples. Workers
who follow the requirements are granted access
to our tasks. Our qualifications are open to
workers from English-speaking countries who
the
have high reputation scores. Additionally,
authors regularly review annotations to give
feedback and prevent noisy annotations.

Real-time Automatic Checks For CQW, we
use heuristics to check question validity, for
example, whether it ends with a question mark,
and that it doesn’t use language that characterizes
explicit multi-hop questions (for instance, having
multiple verbs). For SQD, we check that
the
decomposition structure forms a directed acyclic
graph, that
is: (i) each decomposition step is
referenced by (at least) one of the following steps,
such that all steps are reachable from the last step;
and (ii) steps don’t form a cycle. In the EVM task,
a warning message is shown when the worker
marks an intermediate step as an operation (an
unlikely scenario).

Inter-task Feedback At each step of
the
pipeline, we collect feedback about previous steps.
To verify results from the CQW task, we ask
workers to indicate whether the given answer is
incorrect (in the SQD, EVM tasks), or if the
question is not definitive (in the SQD task) (§2.1).
Similarly, to identify non-feasible questions or
decompositions, we ask workers to indicate if
there is no evidence for a decomposition step (in
the EVM task).

Evidence Verification Task After the EVM
its
step, each example comprises a question,
answer, decomposition, and supporting evidence.
To verify that a question can be answered
by executing the decomposition steps against
the matched evidence paragraphs, we construct
an additional evidence verification task (EVV).
In this task, workers are given a question,
its decomposition and matched paragraphs, and
are asked to answer
the question in each
decomposition step purely based on the provided

paragraphs. Running EVV on a subset of examples
during data collection helps identify issues in the
pipeline and in worker performance.

4 The STRATEGYQA Dataset

We run our pipeline on 1,799 Wikipedia terms,
allowing a maximum of 5 questions per term. We
update our online fine-tuned solvers (FNTD) every
1K questions. Every question is decomposed once,
and evidence is matched for each decomposition
by 3 different workers. The cost of annotating a
full example is $4.

To encourage diversity in strategies used
in the questions, we recruited new workers
throughout data collection. Moreover, periodic
updates of the online solvers prevent workers
from exploiting shortcuts, since the solvers adapt
to the training distribution. Overall, there were 29
question writers, 19 decomposers, and 54 evidence
matchers participating in the data collection.

We collected 2,835 questions, out of which 55
were marked as having an incorrect answer during
SQD (§3.2). This results in a collection of 2,780
verified strategy questions, for which we create an
annotator-based data split (Geva et al., 2019). We
now describe the dataset statistics (§4.1), analyze
the quality of the examples (§4.2), and explore the
reasoning skills in STRATEGYQA (§4.3).

4.1 Dataset Statistics

We observe (Table 4) that the answer distribution
is roughly balanced (yes/no). Moreover, questions
are short (< 10 words), and the most common trigram occurs in roughly 1% of the examples. This indicates that the language of the questions is both simple and diverse. For comparison, the average question length in the multi- hop datasets HOTPOTQA (Yang et al., 2018) and COMPLEXWEBQUESTIONS (Talmor and Berant, 2018) is 13.7 words and 15.8 words, respectively. Likewise, the top trigram in these datasets occurs in 9.2% and 4.8% of their examples, respectively. More than half of the generated questions are filtered by our solvers, pointing to the difficulty of generating good strategy questions. We release all 3,305 filtered questions as well. To characterize the reasoning complexity required to answer questions in STRATEGYQA, we examine the decomposition length and the number of evidence paragraphs. Figure 3 and Table 4 (bottom) show the distributions of these properties 352 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 # of questions % ‘‘yes’’ questions # of unique terms # of unique decomposition steps # of unique evidence paragraphs # of occurrences of the top trigram # of question writers # of filtered questions Avg. question length (words) Avg. decomposition length (steps) Avg. # of paragraphs per question Test 490 Train 2290 46.8% 46.1% 1333 6050 442 1347 9251 2136 31 5 23 2821 9.6 2.93 6 484 9.8 2.92 2.33 2.29 Table 4: STRATEGYQA statistics. Filtered questions were rejected by the solvers (§3.1). The train and test sets of question writers are disjoint. The ‘‘top trigram’’ is the most common trigram. Figure 3: The distributions of decomposition length (left) and the number of evidence paragraphs (right). The majority of the questions in STRATEGYQA require a reasoning process comprised of ≥ 3 steps, of which about 2 steps involve retrieving external knowledge. are centered around 3-step decompositions and 2 evidence paragraphs, but a considerable portion of the dataset requires more steps and paragraphs. 4.2 Data Quality Do questions in STRATEGYQA require multi- step implicit reasoning? To assess the quality of questions, we sampled 100 random examples from the training set, and had two experts (authors) independently annotate whether the questions satisfy the desired properties of strategy questions (§2.1). We find that most of the examples (81%) are valid multi-step implicit questions, 82% of 353 implicit explicit multi-step 81 14.5 95.5 single-step 1 3.5 4.5 82 18 100 Table 5: Distribution over the implicit and multi-step properties (§2) in a sample of 100 STRATEGYQA questions, annotated by two experts (we average the expert decisions). Most questions are multi-step and implicit. Annotator agreement is substantial for both the implicit (κ = 0.73) and multi-step (κ = 0.65) properties. questions are implicit, and 95.5% are multi-step (Table 5). Do questions in STRATEGYQA have a definitive answer? We let experts review the answers to 100 random questions, allowing access to the Web. We then ask them to state for every question whether they agree or disagree with the provided answer. We find that the experts agree with the answer in 94% of the cases, and disagree only in 2%. For the remaining 4%, either the question was ambiguous, or the annotators could not find a definite answer on the Web. Overall, this suggests that questions in STRATEGYQA have clear answers. What is the quality of the decompositions? We randomly sampled 100 decompositions and asked experts to judge their quality. Experts judged if the decomposition is explicit or utilizes a strategy. We find that 83% of the decompositions validly use a strategy to break down the question. The remaining 17% decompositions are explicit, however, in 14% of the cases the original question is already explicit. Second, experts checked if the phrasing of the decomposition is ‘‘natural’’, that is, it reflects the decomposition of a person that does not already know the answer. We find that 89% of the decompositions express a ‘‘natural’’ reasoning process, while 11% may depend on the answer. Last, we asked experts to indicate any potential logical flaws in the decomposition, but no such cases occurred in the sample. Would different annotators use the same decomposition 50 sample different workers and examples, decompose the the decomposition pairs, we find that a) for all pairs, strategy? We two let questions. Comparing l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the last step returns the same answer, b) in 44 out of 50 pairs, the decomposition pairs follow the same reasoning path, and c) in the other 6 pairs, the decompositions either follow a different reasoning process (5 pairs) or one of the decompositions is explicit (1 pair). This shows that different workers usually use the same strategy when decomposing questions. Is the evidence for strategy questions in Wikipedia? Another important property is whether questions in STRATEGYQA can be answered based on context from our corpus, Wikipedia, given that questions are written independently of the context. To measure evidence coverage, in the EVM task (§3.3), we provide workers with a checkbox for every decomposition step, indicating whether only partial or no evidence could be found for that step. Recall that three different workers match evidence for each decomposition step. We find that 88.3% of the questions are fully covered: Evidence was matched for each step by some worker. Moreover, in 86.9% of the questions, at least one worker found evidence for all steps. Last, in only 0.5% of the examples were all three annotators unable to match evidence for any of the steps. This suggests that overall, Wikipedia is a good corpus for questions in STRATEGYQA that were written independently of the context. Do matched paragraphs provide evidence? We assess the quality of matched paragraphs by analyzing both example-level and step-level annotations. First, we sample 217 decomposition paragraphs steps with matched by one of the three workers. We let 3 different crowdworkers decide whether the paragraphs provide evidence for the answer to that step. We find that in 93% of the cases, the majority vote is that the evidence is valid.4 corresponding their succeeds. In the last 12 cases indeed evidence is missing, and is possibly absent from Wikipedia. Lastly, we let experts review the paragraphs matched by one of the three workers to all the decomposition steps of a question, for 100 random questions. We find that for 79 of the questions the matched paragraphs provide sufficient evidence for answering the question. For 12 of the 21 questions without sufficient evidence, the experts indicated they would expect to find evidence in Wikipedia, and the worker probably could not find it. For the remaining 9 questions, they estimated that evidence is probably absent from Wikipedia. In conclusion, 93% of the paragraphs matched at the step-level were found to be valid. Moreover, when considering single-worker annotations, the questions are matched with ∼80% of paragraphs that provide sufficient evidence for all retrieval steps. This number increases to 88% when aggregating the annotations of three workers. by for compare Do different annotators match the same paragraphs? To evidence the different paragraphs matched evidence workers, we a given check whether decomposition step, the same paragraph IDs are retrieved by different annotators. Given two non- empty sets of paragraph IDs P1, P2, annotated by two workers, we compute the Jaccard coefficient J(P1, P2) = |P1∩P2| |P1∪P2| . In addition, we take the sets of corresponding Wikipedia page IDs T1, T2 for the matched paragraphs, and compute J(T1, T2). Note that a score of 1 is given to two identical sets, while a score of 0 corresponds to sets that are disjoint. The average similarity score is 0.43 for paragraphs and 0.69 for pages. This suggests that evidence for a decomposition step can be found in more than one paragraph in the same page, or in different pages. Next, we analyze annotations of the verification task (§3.4), where workers are asked to answer all decomposition steps based only on the matched paragraphs. We find that the workers could answer sub-questions and derive the correct answer in 82 out of 100 annotations. Moreover, in 6 questions indeed there was an error in evidence matching, but another worker who annotated the example was able to compensate for the error, leading to 88% of the questions where evidence matching 4With moderate annotator agreement of κ = 0.42. 4.3 Data Diversity We aim to generate creative and diverse questions. We now analyze diversity in terms of the required reasoning skills and question topic. Reasoning Skills To explore the required reasoning skills in STRATEGYQA, we sampled 100 examples and let two experts (authors) discuss and annotate each example with a) the type of strategy for decomposing the question, and b) the required reasoning and knowledge 354 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Strategy Physical Biological Historical Temporal Definition Cultural Religious Example Can human nails carve a statue out of quartz? Is a platypus immune from cholera? Were mollusks an ingredient in the color purple? Did the 40th president of the United States forward lolcats to his friends? Are quadrupeds represented on Chinese calendar? Would a compass attuned to Earth’s magnetic field be a bad gift for a Christmas elf? Was Hillary Clinton’s deputy chief of staff in 2009 baptised? Entertainment Would Garfield enjoy a Sports trip to Italy? Can Larry King’s ex-wives form a water polo team? % 13 11 10 10 8 5 5 4 4 Table 6: Top strategies in STRATEGYQA and their frequency in a 100 example subset (accounting for 70% of the analyzed examples). Figure 4: Reasoning skills in STRATEGYQA; each skill is associated with the proportion of examples it is required for. Domain-related and logical reasoning skills are marked in blue and orange (italic), respectively. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 skills per decomposition step. We then aggregate labels (e.g., botanical → biological) similar and compute the proportion of examples each strategy/reasoning skill is required for (an example can have multiple strategy labels). Table 6 demonstrates the top strategies, showing that STRATEGYQA contains a broad set of strategies. Moreover, diversity is apparent in terms of both domain-related (Figure 4) reasoning (e.g., biological and technological) and logical functions (e.g., set inclusion and ‘‘is member of’’). While the reasoning skills sampled from questions in STRATEGYQA do not necessarily reflect their prevalence in a ‘‘natural’’ distribution, we argue that promoting research on methods Figure 5: The top 15 categories of terms used to prime workers for question writing and their proportion. for inferring strategies is an important research direction. Question Topics As questions in STRATEGYQA were triggered by Wikipedia terms, we use the ‘‘instance of’’ Wikipedia property to characterize the topics of questions.5 Figure 5 shows the distribution of topic categories in STRATEGYQA. The distribution shows STRATEGYQA is very diverse, with the top two categories (‘‘human’’ 5It is usually a 1-to-1 mapping from a term to a Wikipedia category. In cases of 1-to-many, we take the first category. 355 the further compare diversity and ‘‘taxon’’; i.e., a group of organisms) covering only a quarter of the data, and a total of 609 topic categories. We of STRATEGYQA to HOTPOTQA, a multi-hop QA dataset over Wikipedia paragraphs. To this end, we sample 739 pairs of evidence paragraphs associated with a single question in both datasets, and map the pair of paragraphs to a pair of Wikipedia categories using the ‘‘instance of’’ property. We find that there are 571 unique category pairs in STRATEGYQA, but only 356 unique category pairs in HOTPOTQA. Moreover, the top two category pairs in both of the datasets (‘‘human-human’’, ‘‘taxon-taxon’’) constitute 8% and 27% of the cases in STRATEGYQA and HOTPOTQA, respectively. This demonstrates the creativity and breadth of category combinations in STRATEGYQA. 4.4 Human Performance To see how well humans answer strategy questions, we sample a subset of 100 questions from STRATEGYQA and have experts (authors) answer questions, given access to Wikipedia articles and an option to reveal the decomposition for every question. In addition, we ask them to provide a short explanation for the answer, the number of searches they conducted to derive the answer, and to indicate whether they have used the decomposition. We expect humans to excel at coming up with strategies for answering questions. Yet, humans are not necessarily an upper bound because finding the relevant paragraphs is difficult and could potentially be performed better by machines. Table 7 summarizes the results. Overall, humans infer the required strategy and answer the questions with high accuracy. Moreover, the low number of searches shows that humans leverage background knowledge, as they can answer some of the intermediate steps without search. An error analysis shows that the main reason for failure (10%) is difficulty to find evidence, and the rest of the cases (3%) are due to ambiguity in the question that could lead to the opposite answer. 5 Experimental Evaluation In this section, we conduct experiments to answer the following questions: a) How well do pre-trained language models (LMs) answer 356 Answer accuracy Strategy match Decomposition usage Average # searches 87% 86% 14% 1.25 Table 7: Human performance in answering questions. Strategy match is computed by comparing the explanation provided by the expert with the decomposition. Decomposition usage and the number of searches are computed based on information provided by the expert. strategy questions? b) Is retrieval of relevant context helpful? and c) Are decompositions useful for answering questions that require implicit knowledge? 5.1 Baseline Models Answering strategy questions requires external knowledge that cannot be obtained by training on STRATEGYQA alone. Therefore, our models and online solvers (§3.1) are based on pre-trained LMs, fine-tuned on auxiliary datasets that require reasoning. Specifically, in all models we fine-tune ROBERTA (Liu et al., 2019) on a subset of: • BOOLQ (Clark et al., 2019): A dataset for Boolean question answering. • MNLI (Williams et al., 2018): A large natural language inference (NLI) dataset. The task is to predict if a textual premise entails, contradicts, or is neutral with respect to the hypothesis. • TWENTY QUESTIONS (20Q): A collection of 50K short commonsense Boolean questions.6 • DROP (Dua et al., 2019): A large dataset for numerical reasoning over paragraphs. Models are trained in two configurations: • No context : The model is fed with the question only, and outputs a binary prediction using the special CLS token. • With context : We use BM25 (Robertson et al., 1995) to retrieve context from our corpus, while removing stop words from all queries. We examine two retrieval methods: a) question-based retrieval: by using the question as a query and taking the top 6https://github.com/allenai/twentyquestions. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 k = 10 results, and b) decomposition- based retrieval: by initiating a separate query for each (gold or predicted) decomposition step and concatenating the top k = 10 results of all steps (sorted by retrieval is fed score). In both cases, with the question concatenated to the retrieved context, truncated to 512 tokens (the maximum input length of ROBERTA), and outputs a binary prediction. the model Model Solver group(s) ROBERTA∅(20Q) ROBERTA∅(20Q+BOOLQ) ROBERTA∅(BOOLQ) ROBERTAIR-Q(BOOLQ) ROBERTAIR-Q(MNLI+BOOLQ) PTD, FNTD PTD, FNTD PTD, FNTD PTD PTD Table 8: QA models used as online solvers during data collection (§3.1). Each model was fine-tuned on the datasets mentioned in its name. Predicting Decompositions We train a seq- to-seq model, termed BARTDECOMP, that, given a question, generates its decomposition token-by- token. Specifically, we fine-tune BART (Lewis et al., 2020) on STRATEGYQA decompositions. Baseline Models As our base model, we train a model as follows: We take a ROBERTA (Liu et al., 2019) model and fine-tune it on DROP, 20Q and BOOLQ (in this order). The model is trained on DROP with multiple output heads, as in Segal et al. (2020), which are then replaced with a single Boolean output.7 We call this model ROBERTA*. We use ROBERTA* and ROBERTA to train the following models on STRATEGYQA: without (ROBERTA*∅), with question-based context retrieval (ROBERTA*IR-Q, ROBERTAIR-Q), and with predicted decomposition-based retrieval (ROBERTA*IR-D). We also present four oracle models: • ROBERTA*ORA-P: Uses the gold paragraphs (no retrieval). • ROBERTA*IR-ORA-D: Performs retrieval with the gold decomposition. • ROBERTA*last-step ORA-P-D: Exploits both the gold decomposition and the gold paragraphs. We fine-tune ROBERTA on BOOLQ and SQUAD (Rajpurkar et al., 2016) to obtain a model that can answer single-step questions. We then run this model on STRATEGYQA to obtain answers for all decomposition sub-questions, and replace all placeholder references with 7For brevity, exact details on model training and hyper-parameters will be released as part of our codebase. Model MAJORITY ROBERTA*∅ ROBERTAIR-Q ROBERTA*IR-Q ROBERTA*IR-D ROBERTA*IR-ORA-D ROBERTA*ORA-P ROBERTA*last-step-raw ORA-P-D ROBERTA*last-step ORA-P-D Accuracy Recall@10 53.9 63.6 ± 1.3 53.6 ± 1.0 63.6 ± 1.0 61.7 ± 2.2 62.0 ± 1.3 70.7 ± 0.6 65.2 ± 1.4 72.0 ± 1.0 - - 0.174 0.174 0.195 0.282 - - - Table 9: QA accuracy (with standard deviation across 7 experiments), and retrieval performance, measured by Recall@10, of baseline models on the test set. the predicted answers. Last, we fine-tune ROBERTA* to answer the last decomposition step of STRATEGYQA, for which we have supervision. • ROBERTA*last-step-raw ORA-P-D : ROBERTA* that is fine-tuned to predict from the gold paragraphs and the last step of the gold decomposition, without replacing placeholder references. the answer Online Solvers For the solvers integrated in the data collection process (§3.1), we use three no- context models and two question-based retrieval models. The solvers are listed in Table 8. 5.2 Results Strategy QA performance Table 9 summarizes the results of all models (§5.1). ROBERTA*IR-Q substantially outperforms ROBERTAIR-Q, indicat- ing that fine-tuning on related auxiliary datasets before STRATEGYQA is crucial. Hence, we focus on ROBERTA* for all other results and analysis. Strategy questions pose a combined challenge of retrieving the relevant context, and deriving the answer based on that context. Training 357 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 without context shows a large accuracy gain of 53.9 → 63.6 over the majority baseline. This is far from human performance, but shows that some questions can be answered by a large LM fine-tuned on related datasets without retrieval. On the other end, training with gold paragraphs raises performance to 70.7. This shows that high-quality retrieval lets the model effectively reason over the given paragraphs. Last, using both gold decompositions and retrieval further increases performance to 72.0, showing the utility of decompositions. Focusing on retrieval-based methods, we observe that question-based retrieval reaches an accuracy of 63.6 and retrieval with gold decompositions results in an accuracy of 62.0. This shows that the quality of retrieval even with gold decompositions is not high enough to improve the 63.6 accuracy obtained by ROBERTA*∅, a model that uses no context. Retrieval with predicted decompositions results in an even lower accuracy of 61.7. We also analyze predicted decompositions below. Retrieval Evaluation A question decomposi- tion describes the reasoning steps for answering the question. Therefore, using the decomposi- tion for retrieval may help obtain the relevant context and improve performance. To test this, we directly compare performance of question- and decomposition-based retrieval with respect to the annotated gold paragraphs. We compute Recall@10, that is, the fraction of the gold para- graphs retrieved in the top-10 results of each method. Since there are 3 annotations per ques- tion, we compute Recall@10 for each annotation and take the maximum as the final score. For a fair comparison, in decomposition-based retrieval, we use the top-10 results across all steps. Results (Table 9) show that retrieval per- low, partially explaining why formance is retrieval models do not improve performance compared to ROBERTA*∅, and demonstrating the retrieval challenge in our setup. Gold decomposition-based retrieval substantially out- performs question-based retrieval, showing that using the decomposition for retrieval is a promis- ing direction for answering multi-step questions. Still, predicted decomposition-based retrieval does not improve retrieval compared to question- based retrieval, showing better decomposition models are needed. To understand the low retrieval scores, we analyzed the query results of 50 random decomposition steps. Most failure cases are due to the shallow pattern matching done by BM25—for example, failure to match synonyms. indeed there is little word This shows that overlap between decomposition steps and the evidence, as intended by our pipeline design. In other examples, either a key question entity was missing because it was represented by a reference token, or the decomposition step had complex language, leading to failed retrieval. This analysis suggests that advances in neural retrieval might be beneficial for STRATEGYQA. Human Retrieval Performance To quantify human performance in finding gold paragraphs, we ask experts to find evidence paragraphs for 100 random questions. For half of the questions we also provide decomposition. We observe average Recall@10 of 0.586 and 0.513 with and without the decomposition, respectively. This shows that humans significantly outperform our IR baselines. However, humans are still far from covering the gold paragraphs, since there are multiple valid evidence paragraphs (§4.2), and retrieval can be difficult even for humans. Lastly, using decompositions improves human retrieval, showing decompositions indeed are useful for finding evidence. Interestingly, Predicted Decompositions Analysis shows that BARTDECOMP’s decompositions are grammatical and well-structured. the model generates strategies, but often applies them to questions incorrectly. For example, the question Can a lifeboat rescue people in the Hooke Sea? is decomposed to 1) What is the maximum depth of the Hooke Sea? 2) How deep can a lifeboat dive? 3) Is #2 greater than or equal to #1?. While the decomposition is well-structured, it uses a wrong strategy (lifeboats do not dive). 6 Related Work Prior work has typically let annotators write questions based on an entire context (Khot et al., 2020a; Yang et al., 2018; Dua et al., 2019; Mihaylov et al., 2018; Khashabi et al., 2018). In this work, we prime annotators with minimal information (few tokens) and let them use their 358 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 imagination and own wording to create questions. A related priming method was recently proposed by Clark et al. (2020), who used the first 100 characters of a Wikipedia page. Among multi-hop reasoning datasets, our it requires implicit in that dataset stands out decompositions. Two recent datasets (Khot et al., 2020a; Mihaylov et al., 2018) have considered questions requiring implicit facts. However, they are limited to specific domain strategies, while in our work we seek diversity in this aspect. Most multi-hop reasoning datasets do not fully annotate question decomposition (Yang et al., 2018; Khot et al., 2020a; Mihaylov et al., 2018). This issue has prompted recent work to create question decompositions for existing datasets (Wolfson et al., 2020), and to train models that generate question decompositions (Perez et al., 2020; Khot et al., 2020b; Min et al., 2019). In this work, we annotate question decompositions as part of the data collection. 7 Conclusion We present STRATEGYQA, the first dataset of implicit multi-step questions requiring a wide- range of reasoning skills. To build STRATEGYQA, we introduced a novel annotation pipeline for eliciting creative questions that use simple language, but cover a challenging range of diverse strategies. Questions in STRATEGYQA are annotated with decomposition into reasoning steps and evidence paragraphs, to guide the ongoing research towards addressing implicit multi-hop reasoning. Acknowledgments We thank Tomer Wolfson for helpful feedback and the REVIZ team at Allen Institute for AI, particularly Michal Guerquin and Sam Skjonsberg. This research was supported in part by the Yandex Initiative for Machine Learning, and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800). Dan Roth is partly supported by ONR contract N00014-19-1-2620 and DARPA contract FA8750-19-2-1004, under the Kairos program. This work was completed in partial fulfillment for the PhD degree of Mor Geva. References Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating adversarial reading comprehen- human annotation for sion. Transactions of the Association for Computational Linguistics, 8:662–678. DOI: https://doi.org/10.1162/tacl a 00338 Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no the 2019 In Proceedings of questions. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, for Minneapolis, Minnesota. Association Computational Linguistics. Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information- seeking question answering in typologically diverse languages. Transactions of the Asso- ciation for Computational Linguistics (TACL), 8:454–470. DOI: https://doi.org/10 .1162/tacl a 00317 Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458. Association for Computational Linguistics. DOI: https:// doi.org/10.18653/v1/2020.acl-main .408 Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In North American Chapter of the Association for Computational Linguistics (NAACL). Mor Geva, Yoav Goldberg, and Jonathan Berant. the 2019. Are we modeling the task or 359 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 annotator? An investigation of annotator bias language understanding datasets. in natural the 2019 Conference on In Proceedings of Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics. the 2018 Conference of Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Pro- ceedings of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1 /N18-2017 Yichen Jiang and Mohit Bansal. 2019. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA. In Association for Compu- tational Linguistics (ACL). DOI: https:// doi.org/10.18653/v1/P19-1262, PMID: 31353678 Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multi- the 2018 ple sentences. In Proceedings of Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana. Association for Com- putational Linguistics, DOI: https://doi .org/10.18653/v1/N18-1023 Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. 2020a. for question answering QASC: A dataset In AAAI. DOI: via sentence composition. https://doi.org/10.1609/aaai.v34i05 .6319 Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2020b. 360 Text modular networks: Learning to decompose tasks in the language of existing models. arXiv preprint arXiv:2009.00751. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics (TACL), 7:453–466. DOI: https://doi.org/10.1162/tacl a 00276 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1 /2020.acl-main.703 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. for armor Todor Mihaylov, Peter Clark, Tushar Khot, 2018. Can Sabharwal. a and Ashish electricity? A conduct suit of question book open new dataset the 2018 In Proceedings of answering. Conference on Empirical Methods in Natural Language Processing, 2381–2391, Brussels, Belgium. Association for Compu- tational Linguistics. DOI: https://doi .org/10.18653/v1/D18-1260 pages Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109, comprehension through l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages 6418–6428, Florence, Italy. Associ- ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19 -1644 Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex In North American Chapter of questions. the Association for Computational Linguistics (NAACL). DOI: https://doi.org/10 .18653/v1/N18-1059 Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi- hop reading comprehension across documents. Transactions of the Association for Computa- tional Linguistics (TACL), 6:287–302. DOI: https://doi.org/10.1162/tacl a 00021 Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through 2018 In Proceedings of inference. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, Association for Computational Linguistics. DOI: https://doi.org/10 .18653/v1/N18-1101 the Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Trans- actions of the Association for Computational (TACL), DOI: https://doi Linguistics .org/10.1162/tacl a 00309 Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop In Empirical Methods question answering. in Natural Language Processing (EMNLP), DOI: https://doi.org/10.18653/v1 /D18-1259, PMCID: PMC6156886 Florence, for Com- Italy. Association putational Linguistics. DOI: https:// doi.org/10.18653/v1/P19-1613 Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In Workshop on Cognitive Computing at NIPS. Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised question decomposition for question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8864–8880. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehen- sion of text. In Empirical Methods in Natural (EMNLP). DOI: Language https://doi.org/10.18653/v1/D16 -1264 Processing Stephen Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1995. the Okapi Third Text REtrieval Conference (TREC-3), pages 109–126. Gaithersburg, MD: NIST. In Overview of at TREC-3. Elad Segal, Avia Efrat, Mor Shoham, Amir Globerson, and Jonathan Berant. 2020. A simple and effective model for answering multi-span questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3074–3080. DOI: https://doi.org /10.18653/v1/2020.emnlp-main.248 Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 361 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 0 1 9 2 4 1 0 4 / / t l a c _ a _ 0 0 3 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Did Aristotle Use a Laptop? image

Download pdf