(西德:2) MuSiQue: Multihop Questions via Single-hop Question Composition

(西德:2) MuSiQue: Multihop Questions via Single-hop Question Composition

Harsh Trivedi(西德:2) Niranjan Balasubramanian(西德:2) Tushar Khot† Ashish Sabharwal†

(西德:2)Stony Brook University, Stony Brook, 美国
{hjtrivedi,niranjan}@cs.stonybrook.edu

†Allen Institute for AI, Seattle, 美国
{tushark,ashishs}@allenai.org

抽象的

Multihop reasoning remains an elusive goal
as existing multihop benchmarks are known
to be largely solvable via shortcuts. Can we
create a question answering (QA) dataset that,
by construction, requires proper multihop rea-
soning? 为此, we introduce a bottom–up
approach that systematically selects compos-
able pairs of single-hop questions that are
连接的, 那是, where one reasoning step
critically relies on information from another.
This bottom–up methodology lets us explore
a vast space of questions and add stringent
filters as well as other mechanisms targeting
connected reasoning. It provides fine-grained
control over the construction process and the
properties of the resulting k-hop questions. 我们
use this methodology to create MuSiQue-Ans,
a new multihop QA dataset with 25K 2–4
hop questions. Relative to existing datasets,
MuSiQue-Ans is more difficult overall (3×
increase in human–machine gap), and harder
to cheat via disconnected reasoning (例如, A
single-hop model has a 30-point drop in F1).
We further add unanswerable contrast ques-
tions to produce a more stringent dataset,
MuSiQue-Full. We hope our datasets will
help the NLP community develop models that
perform genuine multihop reasoning.1

1

介绍

Multihop QA datasets are designed to support
the development and evaluation of models that
perform multiple steps of reasoning in order to
answer a question. Recent work, 然而, 节目
that on existing datasets, models often need not
even connect information across all supporting

facts,2 because they can exploit reasoning short-
cuts and other artifacts to find the correct answers
and obtain high scores (Min et al., 2019A; Chen and
Durrett, 2019; Trivedi et al., 2020). Such short-
cuts arise from various factors, such as overly
specific sub-questions, train-test leakage, 并在-
sufficient distractors. These factors allow models
to circumvent connected reasoning—they need
not read the context to find answers to previous
sub-question(s) or use these answers to answer the
later sub-questions that depend on them.

The left hand side of Fig. 1 illustrates an in-
stance of this problem in an actual question (问)
taken from the HotpotQA dataset (杨等人。,
2018). This question has the over-specification is-
起诉. At first glance, it appears to require a model to
identify Kurt Vonnegut as the author of Armaged-
don in Retrospect, and then use this information
to answer the final question about the famous
satire novel he authored. 然而, this framing of
the question is insufficient to enforce that models
must perform connected multihop reasoning to
arrive at the correct answer. A model can, 实际上,
find the correct answer to this question from the
context without finding the answer to Q1. 这
is because, even if a model does not know that
A1 refers to Kurt Vonnegut, there happens to be
only one person best known for a satirical novel
mentioned in the context.

Contrast this with the question on the right (Q’),
which cannot be answered by simply returning a
novel that someone was best known for. 有
three possible answers in the context and choosing
between them requires knowing which author is
referenced. This is a desirable multihop question
that requires connected reasoning.

1Code and datasets available at https://github

2例如, they often don’t even use information from

.com/stonybrooknlp/musique.

one supporting fact to select another.

539

计算语言学协会会刊, 卷. 10, PP. 539–554, 2022. https://doi.org/10.1162/tacl 00475
动作编辑器: Yulan He. 提交批次: 11/2021; 修改批次: 1/2022; 已发表 5/2022.
C(西德:3) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

(四号) Adding distractor contexts that cannot be eas-
ily identified. (v) Creating unanswerable multihop
questions at the sub-question level.

2) A new challenge dataset and empiri-
cal analysis: We build a new multihop QA
dataset, MuSiQue-Ans (abbreviated as (西德:2) -Ans),
with ∼25K 2–4 hop questions with six dif-
ferent composition structures (比照. 桌子 1). 我们
demonstrate that (西德:2) -Ans is more challenging and
less cheatable than two prior multihop reason-
ing datasets, HotpotQA (杨等人。, 2018) 和
2WikiMultihopQA (Ho et al., 2020). 尤其,
it has 3× the human–machine gap, and a substan-
tially lower disconnected reasoning (DiRe) 分数,
which captures the extent to which a dataset can
be cheated via disconnected reasoning (Trivedi
等人。, 2020). We also show how various features
of our dataset construction pipeline help increase
dataset difficulty and reduce cheatability. 最后,
by incorporating the notion of insufficient con-
文本 (Rajpurkar et al., 2018; Trivedi et al., 2020),
we also release a variant of our dataset, (西德:2) -Full,
having ∼50K multihop questions that form con-
trasting pairs (Kaushik et al., 2019; Gardner et al.,
2020) of answerable and unanswerable questions.
(西德:2) -Full is even more challenging and harder to
cheat on.

We hope our bottom–up multihop dataset
construction methodology and our challenging
datasets with a mixed number of hops will help
develop proper multihop reasoning systems and
decomposition-based models.

数字 1: Generating connected multihop questions by
composing carefully chosen pairs of single-hop ques-
系统蒸发散. 左边: A HotpotQA question that would have been
filtered out by our approach for not requiring connected
推理; it can be answered using just Q2 without
knowing the answer to Q1 (since there is only one
person mentioned in the context as being best known
for a satirical novel). 正确的: A connected question
that forces models to reason through both intended
hops (since there are multiple people mentioned in the
context as being best known for some novel).

Prior work has characterized such reasoning,
where a model arrives at
the correct answer
without using all supporting facts, as Discon-
nected Reasoning (Trivedi et al., 2020). 尽管
this characterization enables filtering or automati-
cally transforming existing datasets (Trivedi et al.,
2020), we ask a different question: How can we
construct a new multihop dataset that, by design,
enforces connected reasoning?

We make two main contributions towards this:

2 相关工作

1) A new dataset construction approach:
We introduce a bottom–up process for build-
ing challenging multihop reading comprehension
QA datasets by carefully selecting and compos-
ing single-hop questions obtained from existing
datasets. The key ideas behind our approach are:
(我) Composing multihop questions from a large
collection of single-hop questions, which allows a
systematic exploration of a vast space of candidate
multihop questions. (二) Applying a stringent set
of filters that ensure no sub-question can be an-
swered without finding the answer to the previous
sub-questions it is connected to (a key property we
formally define as part of the MuSiQue condi-
的, Eqn. (2)). (三、) Reducing train-test leakage at
the level of each single-hop question, thereby mit-
igating the impact of simple memorization tricks.

Multihop QA. (西德:2) -Ans is closest to HotpotQA
(杨等人。, 2018) and 2WikiMultihopQA
(Ho et al., 2020). HotpotQA was constructed
by directly crowdsourcing 2-hop questions with-
out considering the difficulty of composition and
has been shown to be largely solvable without
multihop reasoning (Min et al., 2019A; 陈
and Durrett, 2019; Trivedi et al., 2020). 尽管
2WikiMultihopQA was also constructed via com-
位置, they use a limited set of hand-authored
compositional rules, making it easy for large lan-
guage models. 我们表明 (西德:2) -Ans is harder and
less cheatable than both of these. Other multihop
datasets (Khashabi et al., 2018; Dua et al., 2019,
inter alia) focus on different challenges such as
multiple modalitites (陈等人。, 2020; Talmor
等人。, 2021), open-domain QA (Geva et al., 2021;

540

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

桌子 1: The six reasoning graph shapes (2-hop to 4-hop) present in MuSiQue, along with sample questions.

Khot et al., 2020), fact verification (Jiang et al.,
2020), science explanations (Jansen et al., 2018),
and relation extraction (Welbl et al., 2018), 之中
其他的. Extending our ideas to these challenges is
an interesting avenue for future work.

Unanswerable QA. Prior works have used
unanswerable questions for robust reasoning in
single-hop (Rajpurkar et al., 2018) and multihop
(Ferguson et al., 2020; Trivedi et al., 2020) 放-
tings. IIRC (Ferguson et al., 2020) focuses on
open-domain QA where the unanswerable ques-
tions are identified by crowdsourcing questions
where relevant knowledge couldn’t be retrieved
from Wikipedia. Our idea to make unanswerable
multihop questions by removing support para-
graphs is most similar to Trivedi et al. (2020).
While they rely on annotations (potentially in-
完全的) to identify these support paragraphs,
we can use the bridge entities to remove any po-
tential support paragraphs (containing the bridge
实体) and better ensure unanswerability.

Question Decomposition and Composition.
Multihop QA datasets have been decomposed
into simpler questions (Min et al., 2019乙; Talmor
and Berant, 2018) and special meaning represen-
tations (Wolfson et al., 2020). Our dataset creation
pipeline naturally provides question decomposi-
系统蒸发散, which can can help develop interpretable
型号 (Min et al., 2019乙; Khot et al., 2021).

Recent work has also used bottom–up ap-
proaches
(Pan
to create multihop questions
等人。, 2021; Yoran et al., 2021) using rule-based
方法. 然而, their primary goal was data
augmentation to improve on downstream datasets.
The questions themselves haven’t been shown to
be challenging or less cheatable.

3 Multihop Reasoning Desiderata

Multihop question answering can be seen as a se-
quence of inter-dependent reasoning steps leading
to the answer. In its most general form, these rea-
soning steps and the dependencies can be viewed
as directed acyclic graph (DAG), GQ. Each node
qi in this graph represents a reasoning step or a
‘‘hop’’, 例如, a single-hop question in mul-
tihop QA or a KB relation traversal in graph-based
KBQA. An edge (qj, qi) ∈ edges(GQ) indicates
that the reasoning step qi relies critically on the
output of the predecessor step qj. 例如, 在
如图. 1, the single-hop question Q2(西德:6) 取决于
answer to Q1(西德:6), and the graph GQ(西德:6) is a linear chain
Q1(西德:6) → Q2(西德:6).

Given this framing, a key desirable property
for multihop reasoning is connected reasoning:
Performing each step qi correctly should require
the output of all its predecessor steps qj.

Analytical Intuition: Suppose a model M can
answer each qi correctly with probability p, 和
it can also answer qi without the output of all

541

its predecessor steps with probability r ≤ p.
For simplicity, we assume these probabilities are
independent across various qi. M can correctly
answer a k-hop question Q by identifying and
performing all its k reasoning steps. This will suc-
ceed with probability at most pk. 或者,
as an extreme case, it can ‘‘cheat’’ by identifying
and performing only the last step qk (the ‘‘end
question’’) without considering the output of qk−1
(or other steps) 根本不. This could succeed with
probability as much as r, which does not decrease
with k and is thus undesirable when constructing
multihop datasets. Our goal is to create multihop
questions that enforce connected reasoning, 那
是, where r (西德:9) p and, 尤其, r < pk, so that models have an incentive to perform all k reasoning steps. Not surprisingly, the connected reasoning prop- erty is often not satisfied by existing datasets (Min et al., 2019a; Chen and Durrett, 2019; Trivedi et al., 2020), and never optimized for during dataset construction. As a consequence, models are able to exploit artifacts in existing datasets that allow them to achieve high scores while bypassing some of the reasoning steps, thus negating the main purpose of building multi- hop datasets. Prior work (Trivedi et al., 2020) has attempted to measure the extent of connected reasoning in current models and datasets. How- ever, due to the design of existing datasets, this approach is only able to measure this by ab- lating the pre-requisites of each reasoning step, namely, the supporting facts. Rather than only measure, we propose a method to construct multihop QA datasets that directly optimize for this condition. Consider question Q on the left-hand side of Fig. 1. It can be answered in two steps, Q1 and Q2. However, the information in Q2 itself is suf- ficient to uniquely identify A2 from the context, even without considering A1. That is, while there is an intended dependency between Q1 and Q2, Q2 can be answered correctly without requiring the output of its predecessor question Q1. Our approach constructs multihop questions that pre- vent this issue, and thereby require the desired connected reasoning. Specifically, we carefully choose which single-hop questions to compose and what context to use such that each constituent single-hop question necessitates the answers from one or more previous questions. 4 Connected Reasoning via Composition The central issue we want to address is ensur- ing connected reasoning. Our solution is to use a bottom–up approach where we compose multihop questions from a large pool of single-hop ques- tions. As we show later, this approach allows us to explore a large space of multihop questions and carefully select ones that require connected rea- soning. Additionally, with each multihop question, we will have associated constituent questions, their answers and supporting paragraphs, which can help develop more interpretable models. Here we describe the high-level process and describe the specifics in the next section. 4.1 Multihop via Single-Hop Composition As mentioned earlier, multihop questions can be viewed as a sequence of reasoning steps where answer from one reasoning step is used to iden- tify the next reasoning step. Therefore, we can use single-hop questions containing answers from other questions to construct potential multihop questions. For example, in Fig. 1, Q2’ mentions A1’, and hence single-hop questions Q1’ and Q2’ can be composed to create a DAG Q1(cid:6) → Q2(cid:6) and multihop question Q’ (right). Concretely, to create a multihop question from two single-hop questions, we have a composability criteria: Two single-hop question answer tuples (q1, a1) and (q2, a2) are composable into a multihop question Q with a2 as a valid answer if a1 is a named entity and it is mentioned in q2. See §5:S2 for detailed criteria. This process of composing multihop questions can be chained together to form candidate reason- ing graphs of various shapes and sizes (examples in Table 1). Formally, each multihop question Q has an underlying DAG GQ representing the com- position of the single-hop questions q1, q2, . . . , qn, which form the nodes of GQ. A directed edge (qj, qi) indicates that qi depends on the answer of the previous sub-question qj. ai is the answer to qi, and thereby, an is the answer to Q. 4.2 Ensuring Connected Reasoning Given the graph GQ associated with a question Q, ensuring connected reasoning requires ensuring that for each edge (qj, qi) ∈ edges(GQ), arriving at answer ai using qi, necessitates the use of aj. In other words, without aj, there isn’t sufficient information in qi to arrive at ai. 542 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 5 2 0 2 0 6 9 4 / / t l a c _ a _ 0 0 4 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 5 2 0 2 0 6 9 4 / / t l a c _ a _ 0 0 4 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: MuSiQue construction pipeline. MuSiQue pipline takes single-hop questions from existing datasets, explores the space of multihop questions that can be composed from them, and generates dataset of challenging multihop questions that are difficult to cheat on. MuSiQue pipeline also makes unanswerable multihop questions that makes the final dataset significantly more challenging. The existence of such information can be probed by training a strong QA model M on subquestions (qi) with the mention of their predecessor’s answer (aj) masked out (removed). If, on held out data, the model can identify a subquestion’s answer (ai) without its predecessor’s answer (aj), we say the edge (qj, qi) is disconnected. Formally, we say Q requires connected reasoning if: ∀(qj, qi) ∈ edges(GQ) : M (qmj i ) (cid:11)= ai (1) where qmj qi by masking out the mention of the answer aj. denotes the subquestion formed from i Consider the masked questions Q2 and Q2’ in Fig. 1. While Q2 can easily be answered without answer A1, Q2’ can’t be answered without A1’ and Q’ hence satisfies condition (1). 4.3 Reading Comprehension Setting While our proposed framework makes no as- sumptions about the choice of the model, and is applicable to open-domain setting, we focus on the Reading Comprehension (RC) setting, where we’ve a fixed set of paragraphs as context, C. In a RC setting, apart from requiring the depen- dence between the reasoning steps, we also want the model to depend on the context to answer each question. While this requirement seems unneces- sary, previous works have shown that RC datasets often have artifacts that allow models to pre- dict the answer without the context (Kaushik and Lipton, 2018) and can even memorize the answers (Lewis et al., 2021) due to train-test leakage. As we will show later, previous multihop RC datasets 543 can be cheated via such shortcuts. To ensure the dependence between the question and context, we modify the required condition in Eqn. (1) to: ∀(qj, qi) ∈ edges(GQ) : M (qmj ∧ ∀qi ∈ nodes(GQ) : M (qi; φ) (cid:11)= ai ; C) (cid:11)= ai i (2) In summary, we want multihop reading com- prehension questions that satisfy condition (2) for a strong trained model M . If it does, we say that the question satisfies the MuSiQue condition. Our dataset construction pipeline optimizes for this condition as described next. 5 Dataset Construction Pipeline The high-level schematic of the pipeline is shown in Fig. 2. We begin with a large set of RC single- hop questions from 5 English Wikipedia-based datasets, SQuAD (Rajpurkar et al., 2016), Natu- ral Questions (Kwiatkowski et al., 2019), MLQA (en-en) (Lewis et al., 2020b), T-REx (ElSahar et al., 2018), and Zero Shot RE (Levy et al., 2017), where instances are of the form (qi, pi, ai) referring to the question, the associated paragraph, and the answer, respectively. For Natural Ques- tions, as the context is very long (entire Wikipedia page), we use the annotated long answer (usually a paragraph) from the dataset as the context, and the annotated short answer as the answer. Then, we take the following two steps: S1. Find Good Single-Hop Questions. Even a tolerably small percentage of issues in single-hop questions can compound into an intolerably large percentage in the composed multihop questions. To mitigate this, we first remove questions that are likely annotation errors. Because manually identifying such questions at scale is laborious, we use a model-based approach. We remove the questions for which none of five large trained QA models3 can predict the associated answer with > 0 answer F1. 此外, we remove (我)
erroneous questions where the answer spans are
not in the context, (二) questions with < 20 word context as we found them to be too easy, and (iii) questions with > 300 word context to prevent final
multihop question context from being too long for
current long-range transformer models.

S2. Find Composable Single-Hop Pairs. 到
create 2-hop questions, we first collect distinct
single-hop question pairs with a bridge entity.
具体来说, we find pairs (q1, p1, a1) 和 (q2,
p2, a2) 这样 (我) a1 is a named entity also
mentioned in q2, (二) a2 is not in q1, 和 (三、)
p1 (西德:11)= p2. Such pairs can be combined to form a
2-hop question (问, {p1, p2}, a2). To ensure that
the mentions (a1 and its occurrence in q2 denoted
e2) refer to the same entity, we ensure: 1. Spacy
entity tagger (Honnibal et al., 2020) tags a1 and
e2 as entities of the same type. 2. A Wikipedia
search with a1 and e2 returns identical 1st result.
3. A state-of-the-art (索塔) Wikification model
(Wu et al., 2020) returns the same result for
a1 and e2. At a later step (S7) when humans
write composed questions from DAGs, they get
to remove questions containing erroneous pairs.
仅有的 8% of the pairs are pruned in that step,
indicating that step S2 is quite effective.

S3. Filter Disconnected Single-Hop Pairs. 我们
want connected 2-hop questions—questions that
cannot be answered without using the answers

the constituent single-hop questions. 这
MuSiQue condition (2) states that for a 2-hop
question to be connected, either sub-question
qi should not be correctly answered without its
语境 (中号 (qi, φ) (西德:11)= ai) and the tail question
q2 should not be correctly answered when a1
is removed from it (中号 (qm1
, C) (西德:11)= a2). Accord-
2
ingly we use a two-step filtering process to find
connected 2-hop questions. For simplicity, 并-
cause the second condition already filters some tail

3Two random-seed variants of RoBERTa-large (刘等人。,
two random-seeds of Longformer-Large (Beltagy

2019),
等人。, 2020), and one UnifiedQA (Khashabi et al., 2020).

问题, our current implementation enforces the
first condition only on the head question, q1.

Filtering Head Nodes: We collect all questions
that appear at least once as the head of compos-
able 2-hop questions (q1) to create a set of head
节点. We create 5-fold train-test splits of this set
and train two Longformer-Large models (不同的
种子) per split (train on three, validate and test on
一). We generate answer predictions using the 2
models on their corresponding test splits resulting
在 2 predictions per question. We accept a head
question if, 一般, the predicted answers’
word overlap (computed using answer f1) 和
the answer label is < 0.5. Filtering Tail Nodes: We create a unique set of masked single-hop questions that occur as a tail node (q2) in any composable 2-hop question. If the same single-hop question occurs in two 2-hop questions with different masked entities, they both are added to the set. We combine the gold-paragraph with 9 distractor paragraphs (re- trieved4 using the question without the masked entities as query). As before, we create 5-fold train-test splits and use 2 Longformer-Large mod- els to obtain 2 answer and support predictions. We accept a tail question if either mean answer F1 ≤ 0.25, or if it’s ≤ 0.75 and mean support F1 < 1.0. The thresholds for head and tail node filtering were chosen via a manual inspection of a few predictions in various ranges of the parameters, and gauging at what F1 values does the model’s answer semantically match the correct answer (e.g., ‘‘Barack Obama’’ and ‘‘President Barack Obama’’ overlap with 0.8 answer F1). Control- ling these thresholds provides a way to trade off between the degree of cheatability allowed in the dataset and the size of the final dataset. We aim to limit cheatability while retaining a reasonable dataset size. Finally, only 2-hop questions for which both head and tail node are acceptable are kept. We call this process Disconnection Filtering. S4. Build Multihop Questions. We now have a set of connected 2-hop questions, which form directed edges of a graph. Any subset DAG of it can be used to create a connected multihop question. We use 6 types of reasoning graphs with 2–4 hops as shown in Table 1. To avoid very long questions, we limit single-hop questions to 4We use the BM25 algorithm via Elasticsearch. 544 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 5 2 0 2 0 6 9 4 / / t l a c _ a _ 0 0 4 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 ≤ 10 tokens, the total length of questions in 2, 3-hops to ≤ 15, and 3-hops to ≤ 20 tokens. To ensure diversity, we (1) cap the reuse of bridging entities and single-hop questions at 25 and 100 multihop questions respectively (2) remove any n-hop question that’s subset of any m-hop question (m > n > 1).

S5. Minimize Train-Test Leakage. We devise
a procedure to create train, 验证, and test
splits such that models cannot achieve high scores
via memorization enabled by train-test leakage, 一个
issue observed in some existing datasets (Lewis
等人。, 2021). Our procedure ensures that the train-
ing set has no overlap with validation or the
test sets, and tries to keep the overlap between
validation and test sets minimal.

We consider two multihop questions Qi and
Qj to overlap if any of the following are com-
mon between Qi and Qj: (我) single-hop question,
(二) answer to any single-hop question, (三、) asso-
ciated paragraph to any single-hop question. 到
minimize such overlap, we take a set of multihop
问题, greedily find a subset of given size (S)
which least overlaps with its complement (S’),
and then remove overlapping questions from S’,
to get train (S) and dev+test set (S’). 然后, 我们
split dev+test to dev and test similarly. We ensure
the distribution of source datasets of single-hop
questions in train, dev and test are similar, 和
also control the proportion of 2–4 hop questions.

S6. Build Contexts for Questions. For an n-hop
问题, the context has 20 paragraphs contain-
英: (我) supporting paragraphs associated with its
single-hop questions {p1, p2 . . . pn}, (二) distrac-
tor paragraphs retrieved using a query that is a
concatenation of single-hop questions from which
all intermediate answer mentions are removed. 到
make distractor paragraphs harder to identify, 我们
retrieve them from the set of gold-paragraphs for
the filtered single-hop question (S1).

S7. Crowdsource Question Compositions. 我们
crowdsource question compositions on Amazon
MTurk, where workers composed coherent ques-
tions from our final DAGs of single-hop questions.
In the interface, workers could see a list of
single-hop questions with their associated para-
graphs and how they are connected via bridge
实体. They were first asked to check whether
all pairs of mentions of bridge entities indeed refer
to the same underlying entity. If they answered

‘yes’ for each pair,5 they were asked to compose a
natural language question ensuring that informa-
tion from all single-hop questions in the DAG is
用过的, and the answer to the composed question is
the same as the last single-hop question. If they an-
swered ‘no’ for any of the pairs, we discarded that
问题. Our tutorial provided them with several
handwritten good and bad examples for each of
the 2–4 hop compositions. Workers were encour-
aged to write short questions and make implicit
inferences when possible. They were allowed to
split questions into two sentences if needed.

We carried out a qualification round where 100
workers participated to perform the aforemen-
tioned task on 20 examples each. We manually
evaluated these annotations for correctness and co-
herence, and selected 17 workers to annotate the
full dataset. To ensure dataset quality, we carried
out crowdsourcing in 9 batches, reading 10–20
random examples from each worker after each
batch and sending relevant feedback via email, 如果
需要的. Workers were paid 25, 40, 和 60 cents
for each 2-, 3-, and 4-hop question, amounting to
∼15 USD per hour, totaling ∼11K USD.

We refer

to the dataset at

MuSiQue-Ans or (西德:2) -Ans.

this stage as

S8. Add Unanswerable Questions. For each
answerable multihop RC instance we create a cor-
responding unanswerable multihop RC instance
using the procedure similar to the one proposed in
Trivedi et al. (2020). For a multihop question we
randomly sample any of its single-hop question
and make it unanswerable by ensuring the answer
to that single-hop question doesn’t appear in any of
the paragraphs in context (except this requirement,
the context is built as described in S6). 因为
one of the single-hop questions is unanswerable,
the whole multihop question is unanswerable.

The task now is to predict whether the question
is answerable, and predict the answer and support
if it’s answerable. Given the questions for answer-
able and unanswerable pair are identical and the
context marginally changes, models that rely on
shortcuts find this new task very difficult. We call
the dataset at this stage MuSiQue-Full or (西德:2) -Full,
and both datasets together as MuSiQue.

Final Dataset. The statistics for (西德:2) -Ans ((西德:2) -Full
has twice the number of questions in each

5They answered yes 92% 当时的, 一般.

545

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

2-hop

3-hop

4-hop Total (24,814)

人类

Train
Dev
Test

14376
1252
1271

4387
760
763

1175
405
425

19938
2417
2459

Answer F1
Support F1

分数

78.0
93.9

UB

88.6
97.3

Agr

84.1
91.4

桌子 2: Dataset statistics of MuSiQue-Ans.
MuSiQue-Full contains twice the number of ques-
tions in each category above—one answerable and
one unanswerable.

cell) are shown in Table 2. MuSiQue consti-
tutes 21020 unique single-hop questions, 4132
answers to multihop questions, 19841 answers to
single-hop questions, 和 7676 supporting para-
图表. MuSiQue has 6 types of reasoning graphs
and 2–4 hops (比照. 桌子 1 举些例子).

总之, our construction pipeline allows
us to produce a dataset with mixed hops, 多-
ple types of reasoning graphs, and unanswerable
sub-questions, all of which make for a more
challenging and less cheatable dataset (和我们一样
will quantify in Section 8). Question decom-
位置, which is a natural outcome of our
construction pipeline, can also be used to aid
decomposition-based QA research (Min et al.,
2019乙; Khot et al., 2021).

6 Dataset Quality Assessment

Quality of (西德:2) -Ans. To assess the quality of
(西德:2) -Ans, we first evaluate how well humans can
answer questions in it. Note that we already have
gold answers and supporting paragraphs from our
construction pipeline. This goal is therefore not to
determine gold labels, but rather to measure how
well humans perform on the task treating our gold
labels as correct.

We sample 125 questions from (西德:2) -Ans valida-
tion and test sets, and obtain 3 注释 (回答
and supporting paragraphs) for each question.
We used Amazon MTurk,6 selecting crowdsource
workers as described in §7.3.

Workers were shown the question and all para-
graphs in the context, and were asked to highlight
the answer span and checkmark the supporting
段落. Our interface allowed for searching,
sorting, and filtering the list of paragraphs easily
with interactive text-overlap-based search queries.
The instructions included worked out examples.

6https://www.mturk.com.

桌子 3: Human performance (score and upper
bound) and agreement on MuSiQue-Ans.

We compute human performance by compar-
ing against gold labels for answer and support in
two ways: 1) Human Score—the most frequent
answer and support among the three annotators
breaking ties at random (the strategy used by
Rajpurkar et al. (2018)), 和 2) Human Up-
per Bound (UB)—the answer and support that
maximizes the score (as done by Yang et al.
(2018)).

此外, to assess how well humans agree
with each other (ignoring our gold labels), 我们
also compute the Human Agreement (Agr) 分数
(Rajpurkar et al., 2016; 杨等人。, 2018). Specif-
ically, we treat one of 3 注释, 选择的
randomly, as predicted, and evaluate it against rest
of the annotations, which are treated as correct.

桌子 3 demonstrates that (西德:2) -Ans is a high-
quality dataset. 此外, as we will discuss
in §7.3, we also compare our human performance
with two other similar datasets (HotpotQA and
2WikiMultihopQA), and show that (西德:2) -Ans is close
to them under these metrics (§8).
Quality of (西德:2) -Full. We perform an additional
manual validation to assess dataset quality of
(西德:2) -Full. Recall that (西德:2) -Full shares the answer-
able questions with (西德:2) -Ans, the only extra task
在 (西德:2) -Full being determining the answerability
of a question from the given context. To assess
the validity of this task, we sampled 50 random
instances from (西德:2) -Full, and one of the authors de-
termined the answerability of each question from
its context. We found that in 45 出于 50 在-
立场 (90%) the human predicted answerability
matched the gold label, showing that (西德:2) -Full is a
also high-quality dataset.

Multihop Nature of MuSiQue. 最后, 我们
assess the extent to which (西德:2) -Ans satisfies the
MuSiQue condition (Eqn. 2) for connected rea-
soning. 为此, we first estimate what
percentage of head and tail questions in the vali-
dation set would we retain if we were to repeat our
disconnection filtering procedure (S3) with mod-
els trained on the final training data. This captures

546

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

the fraction of the questions in (西德:2) -Ans that satisfy
the MuSiQue condition. We then compare it with
the respective numbers from the original step S3.
In the original disconnection filtering step, 我们
retained only 26.5% of the tail questions, 然而
we would have retained 79.0% of the tail questions
had we filtered the final validation dataset. 为了
head questions, we see a less dramatic but still
significant effect—we originally retained 74.5%
问题, and would now have retained 87.7%
had we filtered the final validation set. This shows
that vastly more questions in (西德:2) -Ans satisfy the
MuSiQue condition than what we started with.

7 实验装置

7.1 数据集

We compare our datasets (MuSiQue-Ans and
MuSiQue-Full) with two similar multihop RC
datasets: distractor-setting of HotpotQA (哪个
等人。, 2018) and 2WikiMultihopQA (Ho et al.,
2020).7 Both datasets have 10 paragraphs as con-
文本. HQ and 2W have 2-hop and 2,4-hop questions
分别. 此外, HQ has sentence sup-
port and 2W has entity-relation tuples support, 但
we don’t use this annotation in our training or
evaluation for a fair comparison.

HQ, 2瓦, 和 (西德:2) -Ans have 90K, 167K, 和
20K training instances, 分别. For a fair
比较, we use equal sized training sets in all
our experiments, obtained by randomly sampling
20K instances each from HQ and 2W, and referred
to as HQ-20k and 2W-20k, 分别.

Instances in (西德:2) -Ans, HQ, and 2W are
符号.
of the form (问, C; A, Ps). Given a question Q and
context C consisting of a set of paragraphs, 这
task is to predict the answer A and identify sup-
porting paragraphs Ps ∈ C. (西德:2) -Ans additionally
has gold decomposition GQ (§3), which can be
leveraged during training. Instances in (西德:2) -Full are
of form (问, C; A, Ps, S), where there’s an addi-
tional binary classification task to predict S, 这
answerability of Q based on C, also referred to as
context sufficiency (Trivedi et al., 2020).

指标. 为了 (西德:2) -Ans, HQ, and 2W, we report
the standard F1 based metrics for answer (一个) 和
support identification (Sp); see Yang et al. (2018)

7For brevity, we use HQ, 2瓦, (西德:2) -Ans/Full

to re-
fer to HotpotQA, 2WikiMultihopQA, MuSiQue-Ans/Full,
分别.

欲了解详情. To make a fair comparison across
datasets, we use only paragraph-level support F1.
为了 (西德:2) -Full, we follow Trivedi et al. (2020) 到
combine sufficiency prediction S with An and Sp,
which are denoted as An+Sf and Sp+Sf. Instances
在 (西德:2) -Full are evaluated in pairs. For each Q with a
sufficient context C, there is a paired instance with
Q and an insufficient context C (西德:6). For An+Sf, if a
model incorrectly predicts context sufficiency (是的
or no) for either of the instances in a pair, it gets
0 points on that pair. 否则, it gets the same
An score on that pair as it gets on the answerable
instance in that pair. Scores are averaged across
all pairs of instances in the dataset. Likewise for
Sp+Sf.

7.2 楷模

Our models are Transformer-based (Vaswani
等人。, 2017) language models (Devlin et al., 2019),
implemented using PyTorch (Paszke et al.,
2019), HuggingFace Transformers (沃尔夫等人。,
2019), and AllenNLP (Gardner et al., 2017). 我们
experiment with 2 types of models: (1) Multihop
楷模, which are in principle capable of em-
ploying desired reasoning, and have demonstrated
competitive performance on previous multihop
QA datasets. They help probe the extent to which
a dataset can be solved by current models. (2)
Artifact-based Models, which are restricted in
some way that prohibits them from doing desired
推理 (discussed shortly). They help probe the
extent to which a dataset can be cheated. 下一个, 我们
describe these models for (西德:2) -Ans and (西德:2) -Full. 为了
HQ and 2W, they work similar to (西德:2) -Ans.

7.2.1 Multihop Models

End2End (EE) 模型. This model
需要
(问, C) as input, runs it through a transformer,
and predicts (A, Ps) as the output for (西德:2) -Ans and
(A, Ps, S) 为了 (西德:2) -Full. We use Longformer-Large
as it’s one of the few transformer architectures
that is able to fit the full context, and follow
Beltagy et al. (2020) for answer and support
prediction. Answerability prediction is done via
binary classification using CLS token.

Note that our Longformer EE model is a strong
model for multihop reasoning. When trained on
full datasets, its answer F1 is 78.4 (之内 3 pts
of published SOTA [Groeneveld et al., 2020]) 在
HQ, 和 87.7 (索塔) on 2W.

547

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Select+Answer (在) 模型. This model, 在-
spired by Quark (Groeneveld et al., 2020) 和
SAE (Tu et al., 2020), has two parts. 第一的, A
selector ranks and selects the K most relevant
paragraphs CK ⊆ C.8 Specifically, 给定 (问, C)
as input, it classifies every paragraph P ∈ C as rel-
evant or not, and is trained with the cross-entropy
loss. 第二, for MuSiQue-Ans, the answerer
predicts the answer and supporting paragraphs
based only on CK. For MuSiQue-Full, it addi-
tionally predicts answerability. Both components
are trained individually using annotations avail-
able in the dataset. We implement a selector using
RoBERTa-large (刘等人。, 2019), and an answerer
using Longformer-Large.

Step Execution (EX) 模型. Similar to prior
工作 (Talmor and Berant, 2018; Min et al., 2019乙;
Qi et al., 2021; Khot et al., 2021), this model per-
forms explicit, step-by-step multihop reasoning,
by first decomposing the Q into a DAG GQ having
single-hop questions, and then calling single-hop
model repeatedly to execute this decomposition.

The decomposer is trained with gold decompo-

地点, and is implemented with BART-large.

The executor takes C and the predicted DAG
GQ, and outputs (A, Ps) for MuSiQue-Ans and
(A, Ps, S) for MuSiQue-Full. It calls single-hop
model Ms repeatedly while traversing GQ along
the edges and substituting the answers.

Model Ms is trained on only single-hop in-
stances—taking (qi, C) as input, and producing
(A, 圆周率) 或者 (A, Psi, 和) as the output. Here Pi
refers to the supporting paragraph for qi and
Si refers to whether C is sufficient to answer
qi. For MuSiQue-Full, the answerer predicts Q
as having sufficient context if Ms predicts all
qi
to have sufficient context. We implement
2 such single-hop models Ms: End2End and
Select+Answer, abbreviated as EX(EE) 和
EX(在) 分别

We don’t experiment with this model on HQ,
since it needs ground-truth decomposition and in-
termediate answers, which aren’t available in HQ.

Baseline (RNN) 模型. The filtering steps in
our pipeline use transformer-based models, 哪个
could make MuSiQue particularly difficult for
transformer-based models. A natural question then
是, can a strong non-transformer model perform

8K is a hyperparameter, chosen from {3,5,7}.

better on MuSiQue? To answer this, we evaluate
our re-implementation of a strong RNN-based
基线 (杨等人。, 2018) (see their original
paper for details). To verify our implementation,
we trained it on full HotpotQA and found its
performance to be 64.0 一个 (answer F1) 在
validation set, better than what’s reported by Yang
等人. (2018) (58.3 一个). We thus use this model as
a strong non-transformer baseline.

7.2.2 Artifact-based Models

The Q-Only Model takes only Q as input (不
C) and generates output A for (西德:2) -Ans and (A, S)
为了 (西德:2) -Full. We implement this with BART-large
(刘易斯等人。, 2020A). The C-Only Model takes
only C as input (no Q) and predicts (A, Ps) 为了
(西德:2) -Ans and (A, Ps, S) 为了 (西德:2) -Full. We implement
this with an EE Longformer-Large model with
empty Q. The 1-Para Model, like Min et al.
(2019A) and Chen and Durrett (2019), is similar
to SA model with K = 1. Instead of training the
selector to rank all Ps the highest, we train it to
rank any paragraph containing the answer A as
the highest. The answerer then takes as input one
selected paragraph p ∈ Ps and predicts an answer
to Q based solely on p. This model can’t access full
supporting information as all considered datasets
have at least 2 supporting paragraphs.

7.2.3 Cheatability Score

We compute the DiRe score of all datasets, 哪个
measures the extent to which the datasets can be
cheated by strong models via Disconnected Rea-
soning (Trivedi et al., 2020). We report scores based
on the SA model because it performed the best.

7.3 Human Performance

Apart from assessing the human performance
level on (西德:2) -Ans, as discussed in §6, we also
obtain human performance on HQ and 2W. 为了
a fair comparison, we use the same crowdsourc-
ing workers, 注释指南, and interface
across the 3 datasets. We sample 125 问题
from each dataset, shuffle them all into one set,
and obtain 3 annotations per question for answer
and support.

To select the workers, we ran a qualification
round where each worker was required to identify
answer and support for at least 25 问题. 我们
then selected workers who had more than 75 一个

548

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

HQ-20K

2W-20K

(西德:2) -Ans

一个

Sp An

Sp An

Sp

84.5 92.5 83.2 99.3 78.0 93.9
88.6 97.3
91.8 96.0 89.0 100

n Score
A


H

UB

p

H

t


中号

s

e
d

中号

RNN
EE

EX(EE)
EX(在)

51.0 82.4 52.7 94.9 13.6 41.9
72.9 94.3 72.9 97.6 42.3 67.6
74.9 94.6 79.5 99.0 47.3 72.3
79.8 97.5 45.6 77.8
71.2 98.1 49.8 79.2


t
C
A
F

t
r

A

s 1-Para

e
C-only
d

Q-only
中号

64.8 — 60.1 — 32.0 —
18.4 67.6 50.1 92.0
0.0
19.6 — 27.0 — 4.6 —

3.4

DiRe Score

68.8 93.0 63.4 98.5 37.8 63.4

桌子 4: Compared to the other datasets consid-
埃雷德, (西德:2) -Ans has a much larger human-model gap
(higher gap between top and middle sections), 和
is much less cheatable (lower scores in bottom
two sections).

and Sp scores on all datasets. Seven out of 15
workers were qualified for rest of the validation.

8 Empirical Findings

We now discuss our findings, 证明
MuSiQue is a challenging multihop dataset that
is harder to cheat on than existing datasets (§8.1)
and that the steps in the MuSiQue construction
pipeline are individually valuable (§8.2). 最后,
we explore avenues for future work (§8.3).

For HQ and 2W, we report validation set per-
formance. 为了 (西德:2) -Ans and (西德:2) -Full, 桌子 5 报告
test set numbers; all else is on the validation set.

8.1 MuSiQue is a Challenging Dataset

Compared to HQ and 2W, both variants of
MuSiQue are less cheatable via shortcuts and
have a larger human-to-model gap.

Higher Human–Model Gap. Top two sections
of Table 4 展示 (西德:2) -Ans has a significantly higher
human–model gap (computed as Human Score
minus best model score) than the other datasets,
for both answer and supporting paragraph iden-
tification. 实际上, for both the other datasets,
supporting paragraph identification has even sur-
passed the human score, whereas for (西德:2) -Ans,
there is a 14-point gap. 此外, (西德:2) -Ans has a
∼27-point gap in answer F1, whereas HQ and 2W
have a gap of only 10 和 5 点, 分别.

(西德:2) -Ans

(西德:2) -Full

一个

Sp

An+Sf

Sp+Sf

p

H

t


中号

s

e
d

中号

EE

Ex(EE)
Ex(在)

40.7
52.3
46.4
49.0

69.4
75.2
78.1
80.6

t
C
A
F

t
r

A

s 1-Para

e
d
C-only

中号
Q-only

35.7 —
3.7
4.6 —

0.0

24.0
34.8
32.2
32.2

2.3
1.6
0.0

25.6
42.1
44.2
44.3


1.1

桌子 5: (西德:2) -Full is harder (top row) and less
cheatable (bottom row) 比 (西德:2) -Ans. 笔记:
(西德:2) -Full has a stricter metric that operates over
instance pairs (§7.1:指标).

Our best model, EX(在), scores 57.9, 47.9, 和
28.1 answer F1 on 2, 3, and 4-hop questions of
(西德:2) -Ans, 分别. The EE model, 在另一
手, stays around 42% irrespective of the number
of hops.

Lower Cheatability. The 3rd section of
桌子 4 shows that the performance of artifact-
based models (§7.2.2) is much higher on HQ and
2W than on (西德:2) -Ans. 例如, the 1-Para
model achieves 64.8 和 60.1 answer score on
HQ and 2W, 分别, 但仅 32.0 在
(西德:2) -Ans. Support identification in both datasets
can be done to a surprisingly high degree (67.6
和 92.0 F1) even without the question (C-only
模型), but fails on (西德:2) -Ans.9

相似地, the last row of Table 4 shows that
the DiRe answer scores of HQ and 2W (68.8
和 63.4) are high, indicating that even discon-
nected reasoning (bypassing reasoning steps) 能
achieve such high scores. 相比之下, this number
is significantly lower (37.8) 为了 (西德:2) -Ans.

These results demonstrate that (西德:2) -Ans is signifi-
cantly less cheatable via shortcut-based reasoning.

MuSiQue-Full: Even More Challenging.
桌子 5 shows that (西德:2) -Full is significantly more
difficult and less cheatable than (西德:2) -Ans.

直观地, because the answerable and unan-
swerable instances are very similar but have
different labels, it’s difficult for models to do
well on both instances if they learn to rely on
shortcuts (Kaushik et al., 2019; Gardner et al.,
2020). All artifact-based models barely get any

9即使当 (西德:2) -Ans is modified to have 10 段落

like HQ, C-only support score remains low; 比照. 桌子 7.

549

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

1-为了

C-only

EE

Ctxt Corpus

1-为了

C-Only

EE

一个

Sp

一个

Sp

一个

Sp

An Sp An Sp An Sp

(西德:2)
3.4
32.0 —
(西德:2) \ DF
8.6
59.2 —
(西德:2) \ RL 85.1 — 69.5

0.0
22.4
42.3

42.3
60.6
87.3

67.6
71.1
79.3

桌子 6: Disconnection Filter (DF, step 5) 和
Reduced Train-Test Leakage (RL, step 3) 的
MuSiQue pipeline are crucial for its difficulty
(EE model) and less cheatability (1-Para and
C-only models).

An+Sf or Sp+Sf score. For all multihop mod-
els too, the An drops by 14–17 pts and Sp by
33–44 pts.

8.2 Dataset Construction Steps are Valuable

下一个, we show that the key steps of our dataset
construction pipeline (§5) are valuable.

Disconnection Filter (step 3). To assess the ef-
fect of Disconnection Filter (DF), we ablate it from
the pipeline, 那是, skip the filtering composable
2-hop questions to connected 2-hop questions. 作为
we don’t have human-generated composed ques-
tions for the resulting questions, we use a seq2seq
BART-large model that’s trained (using MuSiQue)
to compose questions from input decomposition
DAG. For a fair comparison, we randomly sub-
sample train set from ablated pipeline to be of the
same size as the original train set.

桌子 6 shows that DF is crucial for increasing
difficulty and reducing cheatability of the dataset.
Without DF, both multihop and artifact-based
models do much better on the resulting datasets.

Reduced Train-Test Leakage (step 5). To as-
sess the effect of Reduced train-test Leakage (RL),
we create a dataset the traditional way, with a ran-
dom partition into train, 验证, and test splits.
For uniformity, we ensure the distribution of 2–4
hop questions in development set of the resulting
dataset from both ablated pipelines remains the
same as in the original development set. Like DF
ablation, we also normalize train set sizes.

桌子 6 shows that without a careful split, 这
dataset is highly solvable by multihop models
(An = 87.3). 重要的, most of this high score
can also be achieved by artifact-based models:
1-为了 (An = 85.1) and C-only (An = 69.5),
revealing the high cheatability of such a split.

10
10

FW 42.5 — 12.5 77.7 57.2 87.6
PD 28.0 — 5.5 34.6 54.1 80.2

20
(西德:2) 20

FW 41.7 — 12.4 66.4 50.3 80.8
PD 32.0 — 3.4 0.0 42.3 67.6

桌子 7: Positive Distractors (PD) are more
effective than using Full Wikipedia (FW)
for choosing distractors, as shown by lower
scores of models. The effect of using PD
is more pronounced when combined with
the use of 20 (而不是 10) distractor
段落.

Harder Distractors (step 7). To assess the
effect of distractors in (西德:2) -Ans, we create 4
variations. Two vary the number of distractors:
(我) 10 paragraphs and (二) 20 段落; 和
two vary the source: (我) Full Wikipedia (FW)10
和 (二) gold context paragraphs from the good
single-hop questions from step 1. We refer to
the last setting as positive distractors (PD), 作为
these paragraphs are likely to appear as supporting
(积极的) paragraphs in our final dataset.

桌子 7 shows that all models find PD sig-
nificantly harder than FW. 尤其, PD
makes support identification extremely difficult
for C-only, whereas Table 4 showed that C-only
succeeds on HQ and 2W to a high degree (67.6
和 92.0 Sp). This would have also been true
为了 (西德:2) -Ans (66.4 Sp) had we used Wikipedia as
the distractor construction corpus like HQ and
2瓦. This underscores the value of selecting the
right corpus for distractor selection, and ensuring
distributional shift can’t be exploited to bypass
reasoning.11

第二, 使用 20 paragraphs instead of 10
makes the dataset more difficult and less cheat-
有能力的. 有趣的是, the effect is stronger if we
use PD,
indicating the synergy between two
approaches to create challenging distractors.

8.3 Potential Avenues for Improvement

Better Decomposition. We train our EX(在)
model using ground-truth decompositions. 在
(西德:2) -Ans, (一个, Sp) improve by (9.4, 7.3) 点,

10We used the Wikipedia corpus from Petroni et al. (2021).
11Our single-hop datasets are Wikipedia-based, 和我们
ensured retrieved contexts from FW are 20-300 字,
like PD.

550

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

and on (西德:2) -Full, (An+Sf, Sp+Sf) improve by (7.3,
6.9) 点. The improvements with the EX(EE)
model are slightly lower. This shows that although
improving question decomposition will be help-
满, it’s insufficient to reach human parity on
the dataset.

Better Transformer. While Longformer can
fit long context, there are arguably more effec-
tive pretrained transformers for shorter input, 为了
例子, T5. 而且, since T5 uses relative po-
sition embeddings, it can be used for longer text,
although at a significant memory and computation
成本. We managed to train SA with T5-large on
MuSiQue,12 but didn’t use it for the rest of our
experiments because of high computational cost.
Over Longformer SA, T5 SA showed a modest
improvement of (6.1, 0.7) 在 (西德:2) -Ans and (1.7,
2.0) 在 (西德:2) -Full.

9 结论

Constructing multihop datasets is a tricky pro-
过程. It can introduce shortcuts and artifacts
that models can exploit to circumvent the need
for multihop reasoning. A bottom–up process of
constructing multihop from single-hop questions
allows systematic exploration of a large space
of multihop candidates and greater control over
which questions we compose. We showed how
to use such a carefully controlled process to cre-
ate a challenging dataset that, by design, 需要
connected reasoning by reducing potential reason-
ing shortcuts, minimizing train-test leakage, 和
including harder distractor contexts. Empirical re-
sults show that (西德:2) -Ans has a substantially higher
human-model gap and is significantly less cheat-
able via disconnected reasoning than previous
datasets. The dataset also comes with unan-
swerable questions, and question decompositions
which we hope spurs further work in developing
models that get right answers for the right reasons.

致谢

The authors thank the action editor and reviewers
for their valuable feedback. This work was sup-
ported in part by the National Science Foundation
under grant IIS-1815358.

12SA worked best for 7 selected paragraphs, 哪里的
answerer (T5) had to process ∼1100 wordpieces on average.

参考

Iz Beltagy, Matthew E. Peters, and Arman
Cohan. 2020. Longformer: The long-document
transformer. arXiv:2004.05150.

Jifan Chen and Greg Durrett. 2019. Understanding
dataset design choices for multi-hop reason-
英. In NAACL-HLT. https://doi.org
/10.18653/v1/N19-1405

Wenhu Chen, Hanwen Zha, Zhiyu Chen,
Wenhan Xiong, Hong Wang, and William
王. 2020. Hybridqa: A dataset of multi-hop
question answering over tabular and textual
数据. Findings of EMNLP 2020. https://
doi.org/10.18653/v1/2020.findings
-emnlp.91

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. In NAACL.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Sameer Singh, and Matt
加德纳. 2019. DROP: A reading comprehen-
sion benchmark requiring discrete reasoning
over paragraphs. In NAACL.

Hady ElSahar, 磷. Vougiouklis, Arslen Remaci,
C. Gravier, Jonathon S. Hare, F. Laforest, 和
乙. Simperl. 2018. T-REx: A large scale align-
ment of natural language with knowledge base
三元组. In LREC.

James Ferguson, Matt Gardner, Hannaneh
Hajishirzi, Tushar Khot, and Pradeep Dasigi.
2020. IIRC: A dataset of incomplete infor-
mation reading comprehension questions. 在
EMNLP. https://doi.org/10.18653
/v1/2020.emnlp-main.86

Matt Gardner, Yoav Artzi, Victoria Basmova,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hanna
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiangming Liu, Nelson F. 刘,
Phoebe Mulcaire, Qiang Ning, Sameer Singh,
诺亚A. 史密斯, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, A. 张, and Ben
周. 2020. Evaluating models’ local decision
boundaries via contrast sets. In Findings of
EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.117

551

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Matt Gardner, Joel Grus, Mark Neumann, Oyvind
Tafjord, Pradeep Dasigi, Nelson F. 刘,
Matthew Peters, Michael Schmitz, and Luke S.
Zettlemoyer. 2017. AllenNLP: A deep se-
mantic natural language processing platform.
arXiv 预印本 arXiv:1803.07640. https://
doi.org/10.18653/v1/W18-2501

Mor Geva, Daniel Khashabi, Elad Segal, Tushar
Khot, Dan Roth, and Jonathan Berant. 2021.
Did Aristotle use a laptop? A question an-
swering benchmark with implicit reasoning
策略. 处理. https://doi.org/10
.1162/tacl_a_00370

Dirk Groeneveld, Tushar Khot, Mausam,
2020. A sim-
and Ashish Sabharwal.

ple yet strong pipeline for HotpotQA.
EMNLP. https://doi.org/10.18653
/v1/2020.emnlp-main.711

Xanh Ho, A. 阮, Saku Sugawara, 和
Akiko Aizawa. 2020. Constructing a multi-hop
QA dataset for comprehensive evaluation of
reasoning steps. In COLING.

Matthew Honnibal, Ines Montani, Sofie Van
Landeghem, and Adriane Boyd. 2020. SpaCy:
Industrial-strength natural language processing
in Python. https://doi.org/10.5281
/zenodo.1212303

彼得

Jansen, Elizabeth Wainwright, Steven
Marmorstein, and Clayton Morrison. 2018.
WorldTree: A corpus of explanation graphs
for elementary science questions
支持-
ing multi-hop inference. 在诉讼程序中
the Eleventh International Conference on
语言资源与评估 (LREC
2018), Miyazaki, 日本. European Language
Resources Association (ELRA).

Yichen Jiang, Shikha Bordia, Zheng Zhong,
Charles Dognin, Maneesh Singh, and Mohit
Bansal. 2020. HoVer: A dataset for many-hop
fact extraction and claim verification.

EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.309

Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2019. Learning the difference that
makes a difference with counterfactually-
augmented data. In ICLR.

Divyansh Kaushik and Zachary C. Lipton. 2018.
How much reading does reading compre-
investigation of
hension require? 一个批评的

popular benchmarks. In EMNLP. https://
doi.org/10.18653/v1/D18-1546

Daniel Khashabi, Snigdha Chaturvedi, 迈克尔
Roth, Shyam Upadhyay, and Dan Roth. 2018.
Looking beyond the surface: A challenge set
for reading comprehension over multiple sen-
时态. In NAACL. https://doi.org/10
.18653/v1/N18-1023

Daniel Khashabi, Sewon Min, Tushar Khot,
Ashish Sabhwaral, Oyvind Tafjord, 彼得
克拉克,
2020.
and Hannaneh Hajishirzi.
UnifiedQA: Crossing
边界
single QA system. Findings of
与一个
EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.171

格式

Tushar Khot, Peter Clark, Michal Guerquin, 彼得
Jansen, and Ashish Sabharwal. 2020. QASC:
A dataset for question answering via sentence
作品. In AAAI. https://doi.org
/10.1609/aaai.v34i05.6319

Tushar Khot, Daniel Khashabi, Kyle Richardson,
Peter Clark, and Ashish Sabharwal. 2021.
Text modular networks: Learning to de-
compose tasks in the language of existing
型号. In NAACL. https://doi.org/10
.18653/v1/2021.naacl-main.99

Tom Kwiatkowski,

Jennimaria

Palomaki,
Olivia Redfield, Michael Collins, Ankur P.
Parikh, Chris Alberti, Danielle Epstein, Illia
Jacob Devlin, Kenton Lee,
Polosukhin,
Kristina Toutanova, Llion Jones, 马修
Kelcey, Ming-Wei Chang, 安德鲁·M. Dai,
Jakob Uszkoreit, Quoc V. Le, and Slav
Petrov. 2019. Natural questions: A bench-
mark for question answering research. 处理,
7:453–466. https://doi.org/10.1162
/tacl_a_00276

Omer Levy, Minjoon Seo, Eunsol Choi, 和
Luke Zettlemoyer. 2017. Zero-shot
rela-
tion extraction via reading comprehension.
In CoNLL. https://doi.org/10.18653
/v1/K17-1034

Mike Lewis, Yinhan Liu, Naman Goyal, 三月-
jan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke
Zettlemoyer.
2020A. 捷运: Denoising
sequence-to-sequence pre-training for natural
language generation, 翻译, and compre-
hension. In ACL. https://doi.org/10
.18653/v1/2020.acl-main.703

552

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Patrick Lewis, Barlas O˘guz, 鲁西·里诺特,
Sebastian Riedel, and Holger Schwenk. 2020乙.
MLQA: Evaluating cross-lingual extractive
question answering. In ACL. https://土井
.org/10.18653/v1/2020.acl-main.653

Patrick Lewis, Pontus Stenetorp, and Sebastian
Riedel. 2021. Question and answer test-train
overlap in open-domain question answering
datasets. In EACL. https://doi.org/10
.18653/v1/2021.eacl-main.86

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv 预印本
arXiv:1907.11692.

Sewon Min, Eric Wallace, Sameer Singh,
Matt Gardner, Hannaneh Hajishirzi, and Luke
Zettlemoyer. 2019A. Compositional questions
do not necessitate multi-hop reasoning. 在
前交叉韧带. https://doi.org/10.18653/v1
/P19-1416

Sewon Min, Victor Zhong, Luke S. Zettlemoyer,
and Hannaneh Hajishirzi. 2019乙. Multi-hop
reading comprehension through question de-
composition and rescoring. In ACL.

Liangming Pan, Wenhu Chen, Wenhan Xiong,
Min-Yen Kan, and William Yang Wang. 2021.
Unsupervised multi-hop question answering by
question generation. In NAACL.

Adam Paszke, Sam Gross, Francisco Massa,
James Bradbury, Gregory
Adam Lerer,
Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary
DeVito, Martin Raison, Alykhan Tejani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch:
An imperative style, high-performance deep
learning library. In NeurIPS, pages 8024–8035.

Fabio Petroni, Aleksandra Piktus, Angela Fan,
Patrick Lewis, Majid Yazdani, Nicola De Cao,
James Thorne, Yacine
Jernite, Vassilis
Plachouras, Tim Rockt¨aschel, and Sebastian
Riedel. 2021. KILT: A benchmark for knowl-
edge intensive language tasks. In NAACL.
https://doi.org/10.18653/v1/2021
.naacl-main.200

Peng Qi, Haejun Lee, Oghenetegiri ‘‘TG’’ Sido,
and Christopher D. 曼宁. 2021. Answering
open-domain questions of varying reasoning
steps from text. In EMNLP.

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In ACL. https://
doi.org/10.18653/v1/P18-2124

Pranav Rajpurkar,

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100, 000+ questions for machine comprehen-
sion of text. In EMNLP. https://doi.org
/10.18653/v1/D16-1264

Alon Talmor and Jonathan Berant. 2018. The web
as a knowledge-base for answering complex
问题. In NAACL.

Alon Talmor, Ori Yoran, Amnon Catav, Dan
Lahav, Yizhong Wang, Akari Asai, Gabriel
Ilharco, Hannaneh Hajishirzi, and Jonathan
Berant. 2021. MultiModalQA: Complex ques-
tion answering over text, tables and images. 在
ICLR. https://doi.org/10.18653/v1
/N18-1059

Harsh Trivedi, Niranjan Balasubramanian, Tushar
Khot, and Ashish Sabharwal. 2020. Is multi-
hop QA in DiRe condition? Measuring and
reducing disconnected reasoning. In EMNLP.
https://doi.org/10.18653/v1/2020
.emnlp-main.712

Ming Tu, Kevin Huang, Guangtao Wang, Jing
黄, Xiaodong He, and Bowen Zhou. 2020.
Select, answer and explain: Interpretable mul-
tihop reading comprehension over multiple
文件. In AAAI.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In NeurIPS,
Attention is all you need.
pages 5998–6008.

Johannes Welbl, Pontus Stenetorp, and Sebastian
Riedel. 2018. Constructing datasets for multi-
hop reading comprehension across documents.
处理, 6:287–302. https://doi.org/10
.1162/tacl_a_00021

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R’emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art

553

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

natural language processing. ArXiv, abs/1910
https://doi.org/10.18653
.03771.
/v1/2020.emnlp-demos.6

Tomer Wolfson, Mor Geva, Ankit Gupta, 马特
加德纳, Yoav Goldberg, Daniel Deutch, 和
Jonathan Berant. 2020. Break it down: A ques-
tion understanding benchmark. 处理. https://
doi.org/10.1162/tacl 00309

Ledell Wu, Fabio Petroni, Martin Josifoski,
Sebastian Riedel, and Luke Zettlemoyer. 2020.
Zero-shot entity linking with dense entity
恢复. In EMNLP.

Zhilin Yang, Peng Qi, Saizheng Zhang,
Yoshua Bengio, William W. 科恩, Ruslan
Salakhutdinov, and Christopher D. 曼宁.
2018. HotpotQA: A dataset for diverse, 前任-
plainable multihop question answering.

EMNLP. https://doi.org/10.18653
/v1/D18-1259

Ori Yoran, Alon Talmor, and Jonathan Berant.
2021. Turning tables: Generating examples
from semi-structured tables for endowing lan-
guage models with reasoning skills. arXiv
preprint arXiv:2107.07261.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
7
5
2
0
2
0
6
9
4

/

/
t

A
C
_
A
_
0
0
4
7
5
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

554(西德:2) MuSiQue: Multihop Questions via Single-hop Question Composition image

下载pdf