(cid:2) MuSiQue: Multihop Questions via Single-hop Question Composition - IA de Investigación especializada en el MIT

(cid:2) música: Preguntas de múltiples saltos mediante composición de preguntas de un solo salto

Harsh Trivedi(cid:2) Niranjan Balasubramanian(cid:2) Tushar Khot† Ashish Sabharwal†

(cid:2)Stony Brook University, Stony Brook, EE.UU
{hjtrivedi,niranjan}@cs.stonybrook.edu

†Allen Institute for AI, seattle, EE.UU
{tushark,ashishs}@allenai.org

Abstracto

Multihop reasoning remains an elusive goal
as existing multihop benchmarks are known
to be largely solvable via shortcuts. Can we
create a question answering (control de calidad) dataset that,
by construction, requires proper multihop rea-
expiación? Para tal fin, we introduce a bottom–up
approach that systematically selects compos-
able pairs of single-hop questions that are
conectado, eso es, where one reasoning step
critically relies on information from another.
This bottom–up methodology lets us explore
a vast space of questions and add stringent
filters as well as other mechanisms targeting
connected reasoning. It provides fine-grained
control over the construction process and the
properties of the resulting k-hop questions. Nosotros
use this methodology to create MuSiQue-Ans,
a new multihop QA dataset with 25K 2–4
hop questions. Relative to existing datasets,
MuSiQue-Ans is more difficult overall (3×
increase in human–machine gap), and harder
to cheat via disconnected reasoning (p.ej., a
single-hop model has a 30-point drop in F1).
We further add unanswerable contrast ques-
tions to produce a more stringent dataset,
MuSiQue-Full. We hope our datasets will
help the NLP community develop models that
perform genuine multihop reasoning.1

Introducción

Multihop QA datasets are designed to support
the development and evaluation of models that
perform multiple steps of reasoning in order to
answer a question. Recent work, sin embargo, muestra
that on existing datasets, models often need not
even connect information across all supporting

hechos,2 because they can exploit reasoning short-
cuts and other artifacts to find the correct answers
and obtain high scores (Min et al., 2019a; Chen and
Durrett, 2019; Trivedi et al., 2020). Such short-
cuts arise from various factors, such as overly
specific sub-questions, train-test leakage, and in-
sufficient distractors. These factors allow models
to circumvent connected reasoning—they need
not read the context to find answers to previous
sub-question(s) or use these answers to answer the
later sub-questions that depend on them.

The left hand side of Fig. 1 illustrates an in-
stance of this problem in an actual question (q)
taken from the HotpotQA dataset (Yang et al.,
2018). This question has the over-specification is-
sue. At first glance, it appears to require a model to
identify Kurt Vonnegut as the author of Armaged-
don in Retrospect, and then use this information
to answer the final question about the famous
satire novel he authored. Sin embargo, this framing of
the question is insufficient to enforce that models
must perform connected multihop reasoning to
arrive at the correct answer. A model can, En realidad,
find the correct answer to this question from the
context without finding the answer to Q1. Este
is because, even if a model does not know that
A1 refers to Kurt Vonnegut, there happens to be
only one person best known for a satirical novel
mentioned in the context.

Contrast this with the question on the right (Q’),
which cannot be answered by simply returning a
novel that someone was best known for. Hay
three possible answers in the context and choosing
between them requires knowing which author is
referenced. This is a desirable multihop question
that requires connected reasoning.

1Code and datasets available at https://github

2Por ejemplo, they often don’t even use information from

.com/stonybrooknlp/musique.

one supporting fact to select another.

539

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 539–554, 2022. https://doi.org/10.1162/tacl a 00475
Editor de acciones: Yulan He. Lote de envío: 11/2021; Lote de revisión: 1/2022; Publicado 5/2022.
C(cid:3) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(iv) Adding distractor contexts that cannot be eas-
ily identified. (v) Creating unanswerable multihop
questions at the sub-question level.

2) A new challenge dataset and empiri-
análisis calórico: We build a new multihop QA
conjunto de datos, MuSiQue-Ans (abbreviated as (cid:2) -Ans),
with ∼25K 2–4 hop questions with six dif-
ferent composition structures (cf. Mesa 1). Nosotros
demonstrate that (cid:2) -Ans is more challenging and
less cheatable than two prior multihop reason-
ing datasets, HotpotQA (Yang et al., 2018) y
2WikiMultihopQA (Ho et al., 2020). En particular,
it has 3× the human–machine gap, and a substan-
tially lower disconnected reasoning (DiRe) puntaje,
which captures the extent to which a dataset can
be cheated via disconnected reasoning (Trivedi
et al., 2020). We also show how various features
of our dataset construction pipeline help increase
dataset difficulty and reduce cheatability. Por último,
by incorporating the notion of insufficient con-
texto (Rajpurkar et al., 2018; Trivedi et al., 2020),
we also release a variant of our dataset, (cid:2) -Lleno,
having ∼50K multihop questions that form con-
trasting pairs (Kaushik et al., 2019; Gardner et al.,
2020) of answerable and unanswerable questions.
(cid:2) -Full is even more challenging and harder to
cheat on.

We hope our bottom–up multihop dataset
construction methodology and our challenging
datasets with a mixed number of hops will help
develop proper multihop reasoning systems and
decomposition-based models.

Cifra 1: Generating connected multihop questions by
composing carefully chosen pairs of single-hop ques-
ciones. Left: A HotpotQA question that would have been
filtered out by our approach for not requiring connected
reasoning; it can be answered using just Q2 without
knowing the answer to Q1 (since there is only one
person mentioned in the context as being best known
for a satirical novel). Right: A connected question
that forces models to reason through both intended
hops (since there are multiple people mentioned in the
context as being best known for some novel).

Prior work has characterized such reasoning,
where a model arrives at
the correct answer
without using all supporting facts, as Discon-
nected Reasoning (Trivedi et al., 2020). Mientras
this characterization enables filtering or automati-
cally transforming existing datasets (Trivedi et al.,
2020), we ask a different question: How can we
construct a new multihop dataset that, por diseño,
enforces connected reasoning?

We make two main contributions towards this:

2 Trabajo relacionado

1) A new dataset construction approach:
We introduce a bottom–up process for build-
ing challenging multihop reading comprehension
QA datasets by carefully selecting and compos-
ing single-hop questions obtained from existing
conjuntos de datos. The key ideas behind our approach are:
(i) Composing multihop questions from a large
collection of single-hop questions, which allows a
systematic exploration of a vast space of candidate
multihop questions. (ii) Applying a stringent set
of filters that ensure no sub-question can be an-
swered without finding the answer to the previous
sub-questions it is connected to (a key property we
formally define as part of the MuSiQue condi-
ción, Eqn. (2)). (iii) Reducing train-test leakage at
the level of each single-hop question, thereby mit-
igating the impact of simple memorization tricks.

Multihop QA. (cid:2) -Ans is closest to HotpotQA
(Yang et al., 2018) and 2WikiMultihopQA
(Ho et al., 2020). HotpotQA was constructed
by directly crowdsourcing 2-hop questions with-
out considering the difficulty of composition and
has been shown to be largely solvable without
multihop reasoning (Min et al., 2019a; Chen
and Durrett, 2019; Trivedi et al., 2020). Mientras
2WikiMultihopQA was also constructed via com-
posición, they use a limited set of hand-authored
compositional rules, making it easy for large lan-
guage models. Nosotros mostramos que (cid:2) -Ans is harder and
less cheatable than both of these. Other multihop
conjuntos de datos (Khashabi et al., 2018; Dua et al., 2019,
inter alia) focus on different challenges such as
multiple modalitites (Chen et al., 2020; Talmor
et al., 2021), open-domain QA (Geva et al., 2021;

540

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 1: The six reasoning graph shapes (2-hop to 4-hop) present in MuSiQue, along with sample questions.

Khot et al., 2020), fact verification (Jiang et al.,
2020), science explanations (Jansen et al., 2018),
and relation extraction (Welbl et al., 2018), entre
otros. Extending our ideas to these challenges is
an interesting avenue for future work.

Unanswerable QA. Prior works have used
unanswerable questions for robust reasoning in
single-hop (Rajpurkar et al., 2018) and multihop
(Ferguson et al., 2020; Trivedi et al., 2020) colocar-
tings. IIRC (Ferguson et al., 2020) se centra en
open-domain QA where the unanswerable ques-
tions are identified by crowdsourcing questions
where relevant knowledge couldn’t be retrieved
de Wikipedia. Our idea to make unanswerable
multihop questions by removing support para-
graphs is most similar to Trivedi et al. (2020).
While they rely on annotations (potentially in-
complete) to identify these support paragraphs,
we can use the bridge entities to remove any po-
tential support paragraphs (containing the bridge
entidad) and better ensure unanswerability.

Question Decomposition and Composition.
Multihop QA datasets have been decomposed
into simpler questions (Min et al., 2019b; Talmor
and Berant, 2018) and special meaning represen-
taciones (Wolfson et al., 2020). Our dataset creation
pipeline naturally provides question decomposi-
ciones, which can can help develop interpretable
modelos (Min et al., 2019b; Khot et al., 2021).

Recent work has also used bottom–up ap-
se acerca
(Cacerola
to create multihop questions
et al., 2021; Yoran et al., 2021) using rule-based
methods. Sin embargo, their primary goal was data
augmentation to improve on downstream datasets.
The questions themselves haven’t been shown to
be challenging or less cheatable.

3 Multihop Reasoning Desiderata

Multihop question answering can be seen as a se-
quence of inter-dependent reasoning steps leading
to the answer. In its most general form, these rea-
soning steps and the dependencies can be viewed
as directed acyclic graph (DAG), GQ. Each node
qi in this graph represents a reasoning step or a
‘‘hop’’, Por ejemplo, a single-hop question in mul-
tihop QA or a KB relation traversal in graph-based
KBQA. An edge (qj, qi) ∈ edges(GQ) indicates
that the reasoning step qi relies critically on the
output of the predecessor step qj. Por ejemplo, en
Higo. 1, the single-hop question Q2(cid:6) depends on the
answer to Q1(cid:6), and the graph GQ(cid:6) is a linear chain
Q1(cid:6) → Q2(cid:6).

Given this framing, a key desirable property
for multihop reasoning is connected reasoning:
Performing each step qi correctly should require
the output of all its predecessor steps qj.

Analytical Intuition: Suppose a model M can
answer each qi correctly with probability p, y
it can also answer qi without the output of all

541

its predecessor steps with probability r ≤ p.
Por simplicidad, we assume these probabilities are
independent across various qi. M can correctly
answer a k-hop question Q by identifying and
performing all its k reasoning steps. This will suc-
ceed with probability at most pk. Alternativamente,
as an extreme case, it can ‘‘cheat’’ by identifying
and performing only the last step qk (the ‘‘end
question’’) without considering the output of qk−1
(or other steps) en absoluto. This could succeed with
probability as much as r, which does not decrease
with k and is thus undesirable when constructing
multihop datasets. Our goal is to create multihop
questions that enforce connected reasoning, eso
es, where r (cid:9) p and, En particular, r < pk, so that models have an incentive to perform all k reasoning steps. Not surprisingly, the connected reasoning prop- erty is often not satisfied by existing datasets (Min et al., 2019a; Chen and Durrett, 2019; Trivedi et al., 2020), and never optimized for during dataset construction. As a consequence, models are able to exploit artifacts in existing datasets that allow them to achieve high scores while bypassing some of the reasoning steps, thus negating the main purpose of building multi- hop datasets. Prior work (Trivedi et al., 2020) has attempted to measure the extent of connected reasoning in current models and datasets. How- ever, due to the design of existing datasets, this approach is only able to measure this by ab- lating the pre-requisites of each reasoning step, namely, the supporting facts. Rather than only measure, we propose a method to construct multihop QA datasets that directly optimize for this condition. Consider question Q on the left-hand side of Fig. 1. It can be answered in two steps, Q1 and Q2. However, the information in Q2 itself is suf- ficient to uniquely identify A2 from the context, even without considering A1. That is, while there is an intended dependency between Q1 and Q2, Q2 can be answered correctly without requiring the output of its predecessor question Q1. Our approach constructs multihop questions that pre- vent this issue, and thereby require the desired connected reasoning. Specifically, we carefully choose which single-hop questions to compose and what context to use such that each constituent single-hop question necessitates the answers from one or more previous questions. 4 Connected Reasoning via Composition The central issue we want to address is ensur- ing connected reasoning. Our solution is to use a bottom–up approach where we compose multihop questions from a large pool of single-hop ques- tions. As we show later, this approach allows us to explore a large space of multihop questions and carefully select ones that require connected rea- soning. Additionally, with each multihop question, we will have associated constituent questions, their answers and supporting paragraphs, which can help develop more interpretable models. Here we describe the high-level process and describe the specifics in the next section. 4.1 Multihop via Single-Hop Composition As mentioned earlier, multihop questions can be viewed as a sequence of reasoning steps where answer from one reasoning step is used to iden- tify the next reasoning step. Therefore, we can use single-hop questions containing answers from other questions to construct potential multihop questions. For example, in Fig. 1, Q2’ mentions A1’, and hence single-hop questions Q1’ and Q2’ can be composed to create a DAG Q1(cid:6) → Q2(cid:6) and multihop question Q’ (right). Concretely, to create a multihop question from two single-hop questions, we have a composability criteria: Two single-hop question answer tuples (q1, a1) and (q2, a2) are composable into a multihop question Q with a2 as a valid answer if a1 is a named entity and it is mentioned in q2. See §5:S2 for detailed criteria. This process of composing multihop questions can be chained together to form candidate reason- ing graphs of various shapes and sizes (examples in Table 1). Formally, each multihop question Q has an underlying DAG GQ representing the com- position of the single-hop questions q1, q2, . . . , qn, which form the nodes of GQ. A directed edge (qj, qi) indicates that qi depends on the answer of the previous sub-question qj. ai is the answer to qi, and thereby, an is the answer to Q. 4.2 Ensuring Connected Reasoning Given the graph GQ associated with a question Q, ensuring connected reasoning requires ensuring that for each edge (qj, qi) ∈ edges(GQ), arriving at answer ai using qi, necessitates the use of aj. In other words, without aj, there isn’t sufficient information in qi to arrive at ai. 542 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 5 2 0 2 0 6 9 4 / / t l a c _ a _ 0 0 4 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 5 2 0 2 0 6 9 4 / / t l a c _ a _ 0 0 4 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: MuSiQue construction pipeline. MuSiQue pipline takes single-hop questions from existing datasets, explores the space of multihop questions that can be composed from them, and generates dataset of challenging multihop questions that are difficult to cheat on. MuSiQue pipeline also makes unanswerable multihop questions that makes the final dataset significantly more challenging. The existence of such information can be probed by training a strong QA model M on subquestions (qi) with the mention of their predecessor’s answer (aj) masked out (removed). If, on held out data, the model can identify a subquestion’s answer (ai) without its predecessor’s answer (aj), we say the edge (qj, qi) is disconnected. Formally, we say Q requires connected reasoning if: ∀(qj, qi) ∈ edges(GQ) : M (qmj i ) (cid:11)= ai (1) where qmj qi by masking out the mention of the answer aj. denotes the subquestion formed from i Consider the masked questions Q2 and Q2’ in Fig. 1. While Q2 can easily be answered without answer A1, Q2’ can’t be answered without A1’ and Q’ hence satisfies condition (1). 4.3 Reading Comprehension Setting While our proposed framework makes no as- sumptions about the choice of the model, and is applicable to open-domain setting, we focus on the Reading Comprehension (RC) setting, where we’ve a fixed set of paragraphs as context, C. In a RC setting, apart from requiring the depen- dence between the reasoning steps, we also want the model to depend on the context to answer each question. While this requirement seems unneces- sary, previous works have shown that RC datasets often have artifacts that allow models to pre- dict the answer without the context (Kaushik and Lipton, 2018) and can even memorize the answers (Lewis et al., 2021) due to train-test leakage. As we will show later, previous multihop RC datasets 543 can be cheated via such shortcuts. To ensure the dependence between the question and context, we modify the required condition in Eqn. (1) to: ∀(qj, qi) ∈ edges(GQ) : M (qmj ∧ ∀qi ∈ nodes(GQ) : M (qi; φ) (cid:11)= ai ; C) (cid:11)= ai i (2) In summary, we want multihop reading com- prehension questions that satisfy condition (2) for a strong trained model M . If it does, we say that the question satisfies the MuSiQue condition. Our dataset construction pipeline optimizes for this condition as described next. 5 Dataset Construction Pipeline The high-level schematic of the pipeline is shown in Fig. 2. We begin with a large set of RC single- hop questions from 5 English Wikipedia-based datasets, SQuAD (Rajpurkar et al., 2016), Natu- ral Questions (Kwiatkowski et al., 2019), MLQA (en-en) (Lewis et al., 2020b), T-REx (ElSahar et al., 2018), and Zero Shot RE (Levy et al., 2017), where instances are of the form (qi, pi, ai) referring to the question, the associated paragraph, and the answer, respectively. For Natural Ques- tions, as the context is very long (entire Wikipedia page), we use the annotated long answer (usually a paragraph) from the dataset as the context, and the annotated short answer as the answer. Then, we take the following two steps: S1. Find Good Single-Hop Questions. Even a tolerably small percentage of issues in single-hop questions can compound into an intolerably large percentage in the composed multihop questions. To mitigate this, we first remove questions that are likely annotation errors. Because manually identifying such questions at scale is laborious, we use a model-based approach. We remove the questions for which none of five large trained QA models3 can predict the associated answer with > 0 answer F1. Además, quitamos (i)
erroneous questions where the answer spans are
not in the context, (ii) preguntas con < 20 word context as we found them to be too easy, and (iii) questions with > 300 word context to prevent final
multihop question context from being too long for
current long-range transformer models.

S2. Find Composable Single-Hop Pairs. A
create 2-hop questions, we first collect distinct
single-hop question pairs with a bridge entity.
Específicamente, we find pairs (q1, p1, a1) y (q2,
p2, a2) such that (i) a1 is a named entity also
mentioned in q2, (ii) a2 is not in q1, y (iii)
p1 (cid:11)= p2. Such pairs can be combined to form a
2-hop question (q, {p1, p2}, a2). To ensure that
the mentions (a1 and its occurrence in q2 denoted
e2) refer to the same entity, nos aseguramos: 1. Spacy
entity tagger (Honnibal et al., 2020) tags a1 and
e2 as entities of the same type. 2. A Wikipedia
search with a1 and e2 returns identical 1st result.
3. A state-of-the-art (SOTA) Wikification model
(Wu et al., 2020) returns the same result for
a1 and e2. At a later step (S7) when humans
write composed questions from DAGs, ellos obtienen
to remove questions containing erroneous pairs.
Solo 8% of the pairs are pruned in that step,
indicating that step S2 is quite effective.

S3. Filter Disconnected Single-Hop Pairs. Nosotros
want connected 2-hop questions—questions that
cannot be answered without using the answers
de
the constituent single-hop questions. El
MuSiQue condition (2) states that for a 2-hop
question to be connected, either sub-question
qi should not be correctly answered without its
contexto (METRO (qi, Fi) (cid:11)= ai) and the tail question
q2 should not be correctly answered when a1
is removed from it (METRO (qm1
, C) (cid:11)= a2). Accord-
2
ingly we use a two-step filtering process to find
connected 2-hop questions. Por simplicidad, y se-
cause the second condition already filters some tail

3Two random-seed variants of RoBERTa-large (Liu et al.,
two random-seeds of Longformer-Large (Beltagy

2019),
et al., 2020), and one UnifiedQA (Khashabi et al., 2020).

preguntas, our current implementation enforces the
first condition only on the head question, q1.

Filtering Head Nodes: We collect all questions
that appear at least once as the head of compos-
able 2-hop questions (q1) to create a set of head
nodos. We create 5-fold train-test splits of this set
and train two Longformer-Large models (diferente
seeds) per split (train on three, validate and test on
uno). We generate answer predictions using the 2
models on their corresponding test splits resulting
en 2 predictions per question. We accept a head
question if, on average, the predicted answers’
word overlap (computed using answer f1) con
the answer label is < 0.5. Filtering Tail Nodes: We create a unique set of masked single-hop questions that occur as a tail node (q2) in any composable 2-hop question. If the same single-hop question occurs in two 2-hop questions with different masked entities, they both are added to the set. We combine the gold-paragraph with 9 distractor paragraphs (re- trieved4 using the question without the masked entities as query). As before, we create 5-fold train-test splits and use 2 Longformer-Large mod- els to obtain 2 answer and support predictions. We accept a tail question if either mean answer F1 ≤ 0.25, or if it’s ≤ 0.75 and mean support F1 < 1.0. The thresholds for head and tail node filtering were chosen via a manual inspection of a few predictions in various ranges of the parameters, and gauging at what F1 values does the model’s answer semantically match the correct answer (e.g., ‘‘Barack Obama’’ and ‘‘President Barack Obama’’ overlap with 0.8 answer F1). Control- ling these thresholds provides a way to trade off between the degree of cheatability allowed in the dataset and the size of the final dataset. We aim to limit cheatability while retaining a reasonable dataset size. Finally, only 2-hop questions for which both head and tail node are acceptable are kept. We call this process Disconnection Filtering. S4. Build Multihop Questions. We now have a set of connected 2-hop questions, which form directed edges of a graph. Any subset DAG of it can be used to create a connected multihop question. We use 6 types of reasoning graphs with 2–4 hops as shown in Table 1. To avoid very long questions, we limit single-hop questions to 4We use the BM25 algorithm via Elasticsearch. 544 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 5 2 0 2 0 6 9 4 / / t l a c _ a _ 0 0 4 7 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 ≤ 10 tokens, the total length of questions in 2, 3-hops to ≤ 15, and 3-hops to ≤ 20 tokens. To ensure diversity, we (1) cap the reuse of bridging entities and single-hop questions at 25 and 100 multihop questions respectively (2) remove any n-hop question that’s subset of any m-hop question (m > n > 1).

S5. Minimize Train-Test Leakage. We devise
a procedure to create train, validation, and test
splits such that models cannot achieve high scores
via memorization enabled by train-test leakage, un
issue observed in some existing datasets (Luis
et al., 2021). Our procedure ensures that the train-
ing set has no overlap with validation or the
test sets, and tries to keep the overlap between
validation and test sets minimal.

We consider two multihop questions Qi and
Qj to overlap if any of the following are com-
mon between Qi and Qj: (i) single-hop question,
(ii) answer to any single-hop question, (iii) asso-
ciated paragraph to any single-hop question. A
minimize such overlap, we take a set of multihop
preguntas, greedily find a subset of given size (S)
which least overlaps with its complement (S’),
and then remove overlapping questions from S’,
to get train (S) and dev+test set (S’). Entonces, nosotros
split dev+test to dev and test similarly. We ensure
the distribution of source datasets of single-hop
questions in train, dev and test are similar, y
also control the proportion of 2–4 hop questions.

S6. Build Contexts for Questions. For an n-hop
pregunta, the context has 20 paragraphs contain-
En g: (i) supporting paragraphs associated with its
single-hop questions {p1, p2 . . . pn}, (ii) distrac-
tor paragraphs retrieved using a query that is a
concatenation of single-hop questions from which
all intermediate answer mentions are removed. A
make distractor paragraphs harder to identify, nosotros
retrieve them from the set of gold-paragraphs for
the filtered single-hop question (S1).

S7. Crowdsource Question Compositions. Nosotros
crowdsource question compositions on Amazon
MTurk, where workers composed coherent ques-
tions from our final DAGs of single-hop questions.
In the interface, workers could see a list of
single-hop questions with their associated para-
graphs and how they are connected via bridge
entidades. They were first asked to check whether
all pairs of mentions of bridge entities indeed refer
to the same underlying entity. If they answered

‘yes’ for each pair,5 they were asked to compose a
natural language question ensuring that informa-
tion from all single-hop questions in the DAG is
usado, and the answer to the composed question is
the same as the last single-hop question. If they an-
swered ‘no’ for any of the pairs, we discarded that
pregunta. Our tutorial provided them with several
handwritten good and bad examples for each of
the 2–4 hop compositions. Workers were encour-
aged to write short questions and make implicit
inferences when possible. They were allowed to
split questions into two sentences if needed.

We carried out a qualification round where 100
workers participated to perform the aforemen-
tioned task on 20 examples each. We manually
evaluated these annotations for correctness and co-
herence, and selected 17 workers to annotate the
full dataset. To ensure dataset quality, we carried
out crowdsourcing in 9 batches, reading 10–20
random examples from each worker after each
batch and sending relevant feedback via email, si
needed. Workers were paid 25, 40, y 60 cents
for each 2-, 3-, and 4-hop question, amounting to
∼15 USD per hour, totaling ∼11K USD.

We refer

to the dataset at

MuSiQue-Ans or (cid:2) -Ans.

this stage as

S8. Add Unanswerable Questions. For each
answerable multihop RC instance we create a cor-
responding unanswerable multihop RC instance
using the procedure similar to the one proposed in
Trivedi et al. (2020). For a multihop question we
randomly sample any of its single-hop question
and make it unanswerable by ensuring the answer
to that single-hop question doesn’t appear in any of
the paragraphs in context (except this requirement,
the context is built as described in S6). Porque
one of the single-hop questions is unanswerable,
the whole multihop question is unanswerable.

The task now is to predict whether the question
is answerable, and predict the answer and support
if it’s answerable. Given the questions for answer-
able and unanswerable pair are identical and the
context marginally changes, models that rely on
shortcuts find this new task very difficult. We call
the dataset at this stage MuSiQue-Full or (cid:2) -Lleno,
and both datasets together as MuSiQue.

Final Dataset. The statistics for (cid:2) -Ans ((cid:2) -Lleno
has twice the number of questions in each

5They answered yes 92% of the time, on average.

545

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2-hop

3-hop

4-hop Total (24,814)

Humano

Tren
desarrollador
Prueba

14376
1252
1271

4387
760
763

1175
405
425

19938
2417
2459

Answer F1
Support F1

Score

78.0
93.9

88.6
97.3

Agr

84.1
91.4

Mesa 2: Dataset statistics of MuSiQue-Ans.
MuSiQue-Full contains twice the number of ques-
tions in each category above—one answerable and
one unanswerable.

cell) se muestran en la tabla 2. MuSiQue consti-
tutes 21020 unique single-hop questions, 4132
answers to multihop questions, 19841 answers to
single-hop questions, y 7676 supporting para-
graphs. MuSiQue has 6 types of reasoning graphs
and 2–4 hops (cf. Mesa 1 for examples).

En resumen, our construction pipeline allows
us to produce a dataset with mixed hops, multi-
ple types of reasoning graphs, and unanswerable
sub-questions, all of which make for a more
challenging and less cheatable dataset (as we
will quantify in Section 8). Question decom-
posición, which is a natural outcome of our
construction pipeline, can also be used to aid
decomposition-based QA research (Min et al.,
2019b; Khot et al., 2021).

6 Dataset Quality Assessment

Quality of (cid:2) -Ans. To assess the quality of
(cid:2) -Ans, we first evaluate how well humans can
answer questions in it. Note that we already have
gold answers and supporting paragraphs from our
construction pipeline. This goal is therefore not to
determine gold labels, but rather to measure how
well humans perform on the task treating our gold
labels as correct.

We sample 125 questions from (cid:2) -Ans valida-
tion and test sets, and obtain 3 anotaciones (respuesta
and supporting paragraphs) for each question.
We used Amazon MTurk,6 selecting crowdsource
workers as described in §7.3.

Workers were shown the question and all para-
graphs in the context, and were asked to highlight
the answer span and checkmark the supporting
párrafos. Our interface allowed for searching,
sorting, and filtering the list of paragraphs easily
with interactive text-overlap-based search queries.
The instructions included worked out examples.

6https://www.mturk.com.

Mesa 3: Human performance (score and upper
atado) and agreement on MuSiQue-Ans.

We compute human performance by compar-
ing against gold labels for answer and support in
two ways: 1) Human Score—the most frequent
answer and support among the three annotators
breaking ties at random (the strategy used by
Rajpurkar et al. (2018)), y 2) Human Up-
per Bound (UB)—the answer and support that
maximizes the score (as done by Yang et al.
(2018)).

Además, to assess how well humans agree
with each other (ignoring our gold labels), nosotros
also compute the Human Agreement (Agr) puntaje
(Rajpurkar et al., 2016; Yang et al., 2018). Specif-
icamente, we treat one of 3 anotaciones, chosen
randomly, as predicted, and evaluate it against rest
of the annotations, which are treated as correct.

Mesa 3 demonstrates that (cid:2) -Ans is a high-
quality dataset. Además, as we will discuss
in §7.3, we also compare our human performance
with two other similar datasets (HotpotQA and
2WikiMultihopQA), and show that (cid:2) -Ans is close
to them under these metrics (§8).
Quality of (cid:2) -Lleno. We perform an additional
manual validation to assess dataset quality of
(cid:2) -Lleno. Recall that (cid:2) -Full shares the answer-
able questions with (cid:2) -Ans, the only extra task
en (cid:2) -Full being determining the answerability
of a question from the given context. To assess
the validity of this task, we sampled 50 aleatorio
instances from (cid:2) -Lleno, and one of the authors de-
termined the answerability of each question from
su contexto. We found that in 45 fuera de 50 en-
posturas (90%) the human predicted answerability
matched the gold label, mostrando que (cid:2) -Full is a
also high-quality dataset.

Multihop Nature of MuSiQue. Finalmente, nosotros
assess the extent to which (cid:2) -Ans satisfies the
MuSiQue condition (Eqn. 2) for connected rea-
expiación. Para tal fin, we first estimate what
percentage of head and tail questions in the vali-
dation set would we retain if we were to repeat our
disconnection filtering procedure (S3) with mod-
els trained on the final training data. This captures

546

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

the fraction of the questions in (cid:2) -Ans that satisfy
the MuSiQue condition. We then compare it with
the respective numbers from the original step S3.
In the original disconnection filtering step, nosotros
retained only 26.5% of the tail questions, mientras
we would have retained 79.0% of the tail questions
had we filtered the final validation dataset. Para
head questions, we see a less dramatic but still
significant effect—we originally retained 74.5%
preguntas, and would now have retained 87.7%
had we filtered the final validation set. This shows
that vastly more questions in (cid:2) -Ans satisfy the
MuSiQue condition than what we started with.

7 Experimental Setup

7.1 Datasets

We compare our datasets (MuSiQue-Ans and
MuSiQue-Full) with two similar multihop RC
conjuntos de datos: distractor-setting of HotpotQA (Cual
et al., 2018) and 2WikiMultihopQA (Ho et al.,
2020).7 Both datasets have 10 paragraphs as con-
texto. HQ and 2W have 2-hop and 2,4-hop questions
respectivamente. Además, HQ has sentence sup-
port and 2W has entity-relation tuples support, pero
we don’t use this annotation in our training or
evaluation for a fair comparison.

HQ, 2W., y (cid:2) -Ans have 90K, 167k, y
20K training instances, respectivamente. For a fair
comparación, we use equal sized training sets in all
our experiments, obtained by randomly sampling
20K instances each from HQ and 2W, and referred
to as HQ-20k and 2W-20k, respectivamente.

Instances in (cid:2) -Ans, HQ, and 2W are
Notation.
of the form (q, C; A, Ps). Given a question Q and
context C consisting of a set of paragraphs, el
task is to predict the answer A and identify sup-
porting paragraphs Ps ∈ C. (cid:2) -Ans additionally
has gold decomposition GQ (§3), which can be
leveraged during training. Instances in (cid:2) -Full are
of form (q, C; A, Ps, S), where there’s an addi-
tional binary classification task to predict S, el
answerability of Q based on C, also referred to as
context sufficiency (Trivedi et al., 2020).

Métrica. Para (cid:2) -Ans, HQ, and 2W, nosotros reportamos
the standard F1 based metrics for answer (Un) y
support identification (Sp); see Yang et al. (2018)

7Para ser breve, we use HQ, 2W., (cid:2) -Ans/Full

to re-
fer to HotpotQA, 2WikiMultihopQA, MuSiQue-Ans/Full,
respectivamente.

for details. To make a fair comparison across
conjuntos de datos, we use only paragraph-level support F1.
Para (cid:2) -Lleno, we follow Trivedi et al. (2020) a
combine sufficiency prediction S with An and Sp,
which are denoted as An+Sf and Sp+Sf. Instances
en (cid:2) -Full are evaluated in pairs. For each Q with a
sufficient context C, there is a paired instance with
Q and an insufficient context C (cid:6). For An+Sf, if a
model incorrectly predicts context sufficiency (Sí
or no) for either of the instances in a pair, it gets
0 points on that pair. De lo contrario, it gets the same
An score on that pair as it gets on the answerable
instance in that pair. Scores are averaged across
all pairs of instances in the dataset. Likewise for
Sp+Sf.

7.2 Modelos

Our models are Transformer-based (Vaswani
et al., 2017) language models (Devlin et al., 2019),
implemented using PyTorch (Paszke et al.,
2019), HuggingFace Transformers (Wolf et al.,
2019), and AllenNLP (Gardner et al., 2017). Nosotros
experiment with 2 types of models: (1) Multihop
Modelos, which are in principle capable of em-
ploying desired reasoning, and have demonstrated
competitive performance on previous multihop
QA datasets. They help probe the extent to which
a dataset can be solved by current models. (2)
Artifact-based Models, which are restricted in
some way that prohibits them from doing desired
reasoning (discussed shortly). They help probe the
extent to which a dataset can be cheated. Próximo, nosotros
describe these models for (cid:2) -Ans and (cid:2) -Lleno. Para
HQ and 2W, they work similar to (cid:2) -Ans.

7.2.1 Multihop Models

End2End (EE) Modelo. This model
takes
(q, C) as input, runs it through a transformer,
and predicts (A, Ps) as the output for (cid:2) -Ans and
(A, Ps, S) para (cid:2) -Lleno. We use Longformer-Large
as it’s one of the few transformer architectures
that is able to fit the full context, and follow
Beltagy et al. (2020) for answer and support
predicción. Answerability prediction is done via
binary classification using CLS token.

Note that our Longformer EE model is a strong
model for multihop reasoning. When trained on
full datasets, its answer F1 is 78.4 (dentro 3 pts
of published SOTA [Groeneveld et al., 2020]) en
HQ, y 87.7 (SOTA) on 2W.

547

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Select+Answer (SA) Modelo. This model, en-
spired by Quark (Groeneveld et al., 2020) y
SAE (Tu et al., 2020), has two parts. Primero, a
selector ranks and selects the K most relevant
paragraphs CK ⊆ C.8 Specifically, given (q, C)
as input, it classifies every paragraph P ∈ C as rel-
evant or not, and is trained with the cross-entropy
loss. Segundo, for MuSiQue-Ans, the answerer
predicts the answer and supporting paragraphs
based only on CK. For MuSiQue-Full, it addi-
tionally predicts answerability. Both components
are trained individually using annotations avail-
able in the dataset. We implement a selector using
RoBERTa-large (Liu et al., 2019), and an answerer
using Longformer-Large.

Step Execution (EX) Modelo. Similar to prior
trabajar (Talmor y Berant, 2018; Min et al., 2019b;
Qi et al., 2021; Khot et al., 2021), this model per-
forms explicit, step-by-step multihop reasoning,
by first decomposing the Q into a DAG GQ having
single-hop questions, and then calling single-hop
model repeatedly to execute this decomposition.

The decomposer is trained with gold decompo-

sitions, and is implemented with BART-large.

The executor takes C and the predicted DAG
GQ, and outputs (A, Ps) for MuSiQue-Ans and
(A, Ps, S) for MuSiQue-Full. It calls single-hop
model Ms repeatedly while traversing GQ along
the edges and substituting the answers.

Model Ms is trained on only single-hop in-
stances—taking (qi, C) as input, and producing
(A, Pi) o (A, Psi, Y) as the output. Here Pi
refers to the supporting paragraph for qi and
Si refers to whether C is sufficient to answer
qi. For MuSiQue-Full, the answerer predicts Q
as having sufficient context if Ms predicts all
qi
to have sufficient context. We implement
2 such single-hop models Ms: End2End and
Select+Answer, abbreviated as EX(EE) y
EX(SA) respectivamente

We don’t experiment with this model on HQ,
since it needs ground-truth decomposition and in-
termediate answers, which aren’t available in HQ.

Base (RNN) Modelo. The filtering steps in
our pipeline use transformer-based models, cual
could make MuSiQue particularly difficult for
transformer-based models. A natural question then
es, can a strong non-transformer model perform

8K is a hyperparameter, chosen from {3,5,7}.

better on MuSiQue? To answer this, we evaluate
our re-implementation of a strong RNN-based
base (Yang et al., 2018) (see their original
paper for details). To verify our implementation,
we trained it on full HotpotQA and found its
performance to be 64.0 Un (answer F1) sobre el
validation set, better than what’s reported by Yang
et al. (2018) (58.3 Un). We thus use this model as
a strong non-transformer baseline.

7.2.2 Artifact-based Models

The Q-Only Model takes only Q as input (No
C) and generates output A for (cid:2) -Ans and (A, S)
para (cid:2) -Lleno. We implement this with BART-large
(Lewis et al., 2020a). The C-Only Model takes
only C as input (no Q) and predicts (A, Ps) para
(cid:2) -Ans and (A, Ps, S) para (cid:2) -Lleno. We implement
this with an EE Longformer-Large model with
empty Q. The 1-Para Model, like Min et al.
(2019a) and Chen and Durrett (2019), is similar
to SA model with K = 1. Instead of training the
selector to rank all Ps the highest, we train it to
rank any paragraph containing the answer A as
the highest. The answerer then takes as input one
selected paragraph p ∈ Ps and predicts an answer
to Q based solely on p. This model can’t access full
supporting information as all considered datasets
have at least 2 supporting paragraphs.

7.2.3 Cheatability Score

We compute the DiRe score of all datasets, cual
measures the extent to which the datasets can be
cheated by strong models via Disconnected Rea-
expiación (Trivedi et al., 2020). We report scores based
on the SA model because it performed the best.

7.3 Human Performance

Apart from assessing the human performance
level on (cid:2) -Ans, as discussed in §6, nosotros también
obtain human performance on HQ and 2W. Para
a fair comparison, we use the same crowdsourc-
ing workers, pautas de anotación, and interface
a través del 3 conjuntos de datos. We sample 125 preguntas
from each dataset, shuffle them all into one set,
and obtain 3 annotations per question for answer
and support.

To select the workers, we ran a qualification
round where each worker was required to identify
answer and support for at least 25 preguntas. Nosotros
then selected workers who had more than 75 Un

548

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

HQ-20K

2W-20K

(cid:2) -Ans

Sp An

84.5 92.5 83.2 99.3 78.0 93.9
88.6 97.3
91.8 96.0 89.0 100

n Score
a
metro
tu
h

pag
oh
h
i
t
yo
tu
METRO

s
yo
mi
d
oh
METRO

RNN
EE
SA
EX(EE)
EX(SA)

51.0 82.4 52.7 94.9 13.6 41.9
72.9 94.3 72.9 97.6 42.3 67.6
74.9 94.6 79.5 99.0 47.3 72.3
79.8 97.5 45.6 77.8
71.2 98.1 49.8 79.2

—
—

t
C
a
F
i
t
r

s 1-Para
yo
mi
C-only
d
oh
Q-only
METRO

64.8 — 60.1 — 32.0 —
18.4 67.6 50.1 92.0
0.0
19.6 — 27.0 — 4.6 —

3.4

DiRe Score

68.8 93.0 63.4 98.5 37.8 63.4

Mesa 4: Compared to the other datasets consid-
ered, (cid:2) -Ans has a much larger human-model gap
(higher gap between top and middle sections), y
is much less cheatable (lower scores in bottom
two sections).

and Sp scores on all datasets. Seven out of 15
workers were qualified for rest of the validation.

8 Empirical Findings

We now discuss our findings, demostrando que
MuSiQue is a challenging multihop dataset that
is harder to cheat on than existing datasets (§8.1)
and that the steps in the MuSiQue construction
pipeline are individually valuable (§8.2). Finalmente,
we explore avenues for future work (§8.3).

For HQ and 2W, we report validation set per-
rendimiento. Para (cid:2) -Ans and (cid:2) -Lleno, Mesa 5 reports
test set numbers; all else is on the validation set.

8.1 MuSiQue is a Challenging Dataset

Compared to HQ and 2W, both variants of
MuSiQue are less cheatable via shortcuts and
have a larger human-to-model gap.

Higher Human–Model Gap. Top two sections
de mesa 4 espectáculo (cid:2) -Ans has a significantly higher
human–model gap (computed as Human Score
minus best model score) than the other datasets,
for both answer and supporting paragraph iden-
tification. De hecho, for both the other datasets,
supporting paragraph identification has even sur-
passed the human score, whereas for (cid:2) -Ans,
there is a 14-point gap. Además, (cid:2) -Ans has a
∼27-point gap in answer F1, whereas HQ and 2W
have a gap of only 10 y 5 puntos, respectivamente.

(cid:2) -Ans

(cid:2) -Lleno

An+Sf

Sp+Sf

pag
oh
h
i
t
yo
tu
METRO

s
yo
mi
d
oh
METRO

EE
SA
Ex(EE)
Ex(SA)

40.7
52.3
46.4
49.0

69.4
75.2
78.1
80.6

t
C
a
F
i
t
r

s 1-Para
yo
mi
d
C-only
oh
METRO
Q-only

35.7 —
3.7
4.6 —

0.0

24.0
34.8
32.2
32.2

2.3
1.6
0.0

25.6
42.1
44.2
44.3

—
1.1
—

Mesa 5: (cid:2) -Full is harder (fila superior) and less
cheatable (fila inferior) than (cid:2) -Ans. Nota:
(cid:2) -Full has a stricter metric that operates over
instance pairs (§7.1:métrica).

Our best model, EX(SA), puntuaciones 57.9, 47.9, y
28.1 answer F1 on 2, 3, and 4-hop questions of
(cid:2) -Ans, respectivamente. The EE model, en el otro
mano, stays around 42% irrespective of the number
of hops.

Lower Cheatability. The 3rd section of
Mesa 4 shows that the performance of artifact-
based models (§7.2.2) is much higher on HQ and
2W than on (cid:2) -Ans. Por ejemplo, the 1-Para
model achieves 64.8 y 60.1 answer score on
HQ and 2W, respectivamente, but only 32.0 en
(cid:2) -Ans. Support identification in both datasets
can be done to a surprisingly high degree (67.6
y 92.0 F1) even without the question (C-only
modelo), but fails on (cid:2) -Ans.9

Similarmente, the last row of Table 4 shows that
the DiRe answer scores of HQ and 2W (68.8
y 63.4) are high, indicating that even discon-
nected reasoning (bypassing reasoning steps) poder
achieve such high scores. A diferencia de, this number
is significantly lower (37.8) para (cid:2) -Ans.

These results demonstrate that (cid:2) -Ans is signifi-
cantly less cheatable via shortcut-based reasoning.

MuSiQue-Full: Even More Challenging.
Mesa 5 shows that (cid:2) -Full is significantly more
difficult and less cheatable than (cid:2) -Ans.

Intuitivamente, because the answerable and unan-
swerable instances are very similar but have
different labels, it’s difficult for models to do
well on both instances if they learn to rely on
shortcuts (Kaushik et al., 2019; Gardner et al.,
2020). All artifact-based models barely get any

9Incluso cuando (cid:2) -Ans is modified to have 10 párrafos

like HQ, C-only support score remains low; cf. Mesa 7.

549

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1-Para

C-only

Ctxt Corpus

1-Para

C-Only

An Sp An Sp An Sp

(cid:2)
3.4
32.0 —
(cid:2) \ DF
8.6
59.2 —
(cid:2) \ rl 85.1 — 69.5

0.0
22.4
42.3

42.3
60.6
87.3

67.6
71.1
79.3

Mesa 6: Disconnection Filter (DF, step 5) y
Reduced Train-Test Leakage (rl, step 3) de
MuSiQue pipeline are crucial for its difficulty
(EE model) and less cheatability (1-Para and
C-only models).

An+Sf or Sp+Sf score. For all multihop mod-
els too, the An drops by 14–17 pts and Sp by
33–44 pts.

8.2 Dataset Construction Steps are Valuable

Próximo, we show that the key steps of our dataset
construction pipeline (§5) are valuable.

Disconnection Filter (step 3). To assess the ef-
fect of Disconnection Filter (DF), we ablate it from
the pipeline, eso es, skip the filtering composable
2-hop questions to connected 2-hop questions. Como
we don’t have human-generated composed ques-
tions for the resulting questions, we use a seq2seq
BART-large model that’s trained (using MuSiQue)
to compose questions from input decomposition
DAG. For a fair comparison, we randomly sub-
sample train set from ablated pipeline to be of the
same size as the original train set.

Mesa 6 shows that DF is crucial for increasing
difficulty and reducing cheatability of the dataset.
Without DF, both multihop and artifact-based
models do much better on the resulting datasets.

Reduced Train-Test Leakage (step 5). To as-
sess the effect of Reduced train-test Leakage (rl),
we create a dataset the traditional way, with a ran-
dom partition into train, validation, and test splits.
For uniformity, we ensure the distribution of 2–4
hop questions in development set of the resulting
dataset from both ablated pipelines remains the
same as in the original development set. Like DF
ablation, we also normalize train set sizes.

Mesa 6 shows that without a careful split, el
dataset is highly solvable by multihop models
(An = 87.3). En tono rimbombante, most of this high score
can also be achieved by artifact-based models:
1-para (An = 85.1) and C-only (An = 69.5),
revealing the high cheatability of such a split.

10
10

FW 42.5 — 12.5 77.7 57.2 87.6
PD 28.0 — 5.5 34.6 54.1 80.2

20
(cid:2) 20

FW 41.7 — 12.4 66.4 50.3 80.8
PD 32.0 — 3.4 0.0 42.3 67.6

Mesa 7: Positive Distractors (PD) are more
effective than using Full Wikipedia (FW)
for choosing distractors, as shown by lower
scores of models. The effect of using PD
is more pronounced when combined with
the use of 20 (en vez de 10) distractor
párrafos.

Harder Distractors (step 7). To assess the
effect of distractors in (cid:2) -Ans, we create 4
variations. Two vary the number of distractors:
(i) 10 paragraphs and (ii) 20 párrafos; y
two vary the source: (i) Full Wikipedia (FW)10
y (ii) gold context paragraphs from the good
single-hop questions from step 1. We refer to
the last setting as positive distractors (PD), como
these paragraphs are likely to appear as supporting
(positivo) paragraphs in our final dataset.

Mesa 7 shows that all models find PD sig-
nificantly harder than FW. En particular, PD
makes support identification extremely difficult
for C-only, whereas Table 4 showed that C-only
succeeds on HQ and 2W to a high degree (67.6
y 92.0 Sp). This would have also been true
para (cid:2) -Ans (66.4 Sp) had we used Wikipedia as
the distractor construction corpus like HQ and
2W.. This underscores the value of selecting the
right corpus for distractor selection, and ensuring
distributional shift can’t be exploited to bypass
reasoning.11

Segundo, usando 20 paragraphs instead of 10
makes the dataset more difficult and less cheat-
capaz. Curiosamente, the effect is stronger if we
use PD,
indicating the synergy between two
approaches to create challenging distractors.

8.3 Potential Avenues for Improvement

Better Decomposition. We train our EX(SA)
model using ground-truth decompositions. On
(cid:2) -Ans, (Un, Sp) improve by (9.4, 7.3) puntos,

10We used the Wikipedia corpus from Petroni et al. (2021).
11Our single-hop datasets are Wikipedia-based, and we
ensured retrieved contexts from FW are 20-300 palabras,
like PD.

550

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

and on (cid:2) -Lleno, (An+Sf, Sp+Sf) improve by (7.3,
6.9) puntos. The improvements with the EX(EE)
model are slightly lower. This shows that although
improving question decomposition will be help-
lleno, it’s insufficient to reach human parity on
the dataset.

Better Transformer. While Longformer can
fit long context, there are arguably more effec-
tive pretrained transformers for shorter input, para
ejemplo, T5. Además, since T5 uses relative po-
sition embeddings, it can be used for longer text,
although at a significant memory and computation
costo. We managed to train SA with T5-large on
música,12 but didn’t use it for the rest of our
experiments because of high computational cost.
Over Longformer SA, T5 SA showed a modest
mejora de (6.1, 0.7) en (cid:2) -Ans and (1.7,
2.0) en (cid:2) -Lleno.

9 Conclusión

Constructing multihop datasets is a tricky pro-
impuesto. It can introduce shortcuts and artifacts
that models can exploit to circumvent the need
for multihop reasoning. A bottom–up process of
constructing multihop from single-hop questions
allows systematic exploration of a large space
of multihop candidates and greater control over
which questions we compose. We showed how
to use such a carefully controlled process to cre-
ate a challenging dataset that, por diseño, requires
connected reasoning by reducing potential reason-
ing shortcuts, minimizing train-test leakage, y
including harder distractor contexts. Empirical re-
sults show that (cid:2) -Ans has a substantially higher
human-model gap and is significantly less cheat-
able via disconnected reasoning than previous
conjuntos de datos. The dataset also comes with unan-
swerable questions, and question decompositions
which we hope spurs further work in developing
models that get right answers for the right reasons.

Expresiones de gratitud

The authors thank the action editor and reviewers
for their valuable feedback. This work was sup-
ported in part by the National Science Foundation
under grant IIS-1815358.

12SA worked best for 7 selected paragraphs, donde el
answerer (T5) had to process ∼1100 wordpieces on average.

Referencias

Iz Beltagy, Matthew E. Peters, and Arman
Cohán. 2020. Longformer: The long-document
transformador. arXiv:2004.05150.

Jifan Chen and Greg Durrett. 2019. Comprensión
dataset design choices for multi-hop reason-
En g. In NAACL-HLT. https://doi.org
/10.18653/v1/N19-1405

Wenhu Chen, Hanwen Zha, Zhiyu Chen,
Wenhan Xiong, Hong Wang, and William
Wang. 2020. Hybridqa: A dataset of multi-hop
question answering over tabular and textual
datos. Findings of EMNLP 2020. https://
doi.org/10.18653/v1/2020.findings
-emnlp.91

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. In NAACL.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Samer Singh, y mate
jardinero. 2019. GOTA: una lectura comprender-
punto de referencia de sion que requiere razonamiento discreto
sobre párrafos. In NAACL.

Hady ElSahar, PAG. Vougiouklis, Arslen Remaci,
C. Gravier, Jonathon S. Hare, F. Laforest, y
mi. simple. 2018. T-REx: A large scale align-
ment of natural language with knowledge base
triples. In LREC.

James Ferguson, Matt Gardner, Hannaneh
Hajishirzi, Tushar Khot, and Pradeep Dasigi.
2020. IIRC: A dataset of incomplete infor-
mation reading comprehension questions. En
EMNLP. https://doi.org/10.18653
/v1/2020.emnlp-main.86

Matt Gardner, Yoav Artzi, Victoria Basmova,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hanna
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiang Ming Liu, Nelson F. Liu,
Phoebe Mulcaire, Qiang Ning, Samer Singh,
Noah A. Herrero, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, A. zhang, and Ben
zhou. 2020. Evaluating models’ local decision
boundaries via contrast sets. En hallazgos de
EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.117

551

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Matt Gardner, Joel Grus, Mark Neumann, Oyvind
Tafjord, Pradeep Dasigi, Nelson F. Liu,
Matthew Peters, Michael Schmitz, and Luke S.
Zettlemoyer. 2017. AllenNLP: A deep se-
mantic natural language processing platform.
arXiv preimpresión arXiv:1803.07640. https://
doi.org/10.18653/v1/W18-2501

Mor Geva, Daniel Khashabi, Elad Segal, Tushar
Khot, Dan Roth, y jonathan berant. 2021.
Did Aristotle use a laptop? A question an-
swering benchmark with implicit reasoning
estrategias. TACL. https://doi.org/10
.1162/tacl_a_00370

Dirk Groeneveld, Tushar Khot, Mausam,
2020. un sim-
and Ashish Sabharwal.
En
ple yet strong pipeline for HotpotQA.
EMNLP. https://doi.org/10.18653
/v1/2020.emnlp-main.711

Xanh Ho, A. Nguyen, Saku Sugawara, y
Akiko Aizawa. 2020. Constructing a multi-hop
QA dataset for comprehensive evaluation of
reasoning steps. In COLING.

Matthew Honnibal, Ines Montani, Sofie Van
Landeghem, and Adriane Boyd. 2020. SpaCy:
Industrial-strength natural language processing
in Python. https://doi.org/10.5281
/zenodo.1212303

Peter

Jansen, Elizabeth Wainwright, Steven
Marmorstein, and Clayton Morrison. 2018.
WorldTree: A corpus of explanation graphs
for elementary science questions
apoyo-
ing multi-hop inference. En procedimientos de
the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018), Miyazaki, Japón. European Language
Resources Association (ELRA).

Yichen Jiang, Shikha Bordia, Zheng Zhong,
Charles Dognin, Maneesh Singh, and Mohit
Bansal. 2020. HoVer: A dataset for many-hop
fact extraction and claim verification.
En
EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.309

Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2019. Learning the difference that
makes a difference with counterfactually-
augmented data. In ICLR.

Divyansh Kaushik and Zachary C. Lipton. 2018.
How much reading does reading compre-
investigation of
hension require? A critical

popular benchmarks. In EMNLP. https://
doi.org/10.18653/v1/D18-1546

Daniel Khashabi, Snigdha Chaturvedi, Miguel
Roth, Shyam Upadhyay, and Dan Roth. 2018.
Looking beyond the surface: A challenge set
for reading comprehension over multiple sen-
tenencias. In NAACL. https://doi.org/10
.18653/v1/N18-1023

Daniel Khashabi, Sewon Min, Tushar Khot,
Ashish Sabhwaral, Oyvind Tafjord, Peter
clark,
2020.
and Hannaneh Hajishirzi.
UnifiedQA: Crossing
boundaries
single QA system. Findings of
con un
EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.171

format

Tushar Khot, Peter Clark, Michal Guerquín, Peter
Jansen, and Ashish Sabharwal. 2020. QASC:
A dataset for question answering via sentence
composition. En AAAI. https://doi.org
/10.1609/aaai.v34i05.6319

Tushar Khot, Daniel Khashabi, Kyle Richardson,
Peter Clark, and Ashish Sabharwal. 2021.
Text modular networks: Learning to de-
compose tasks in the language of existing
modelos. In NAACL. https://doi.org/10
.18653/v1/2021.naacl-main.99

Tom Kwiatkowski,

Jennimaria

Palomaki,
Olivia Redfield, michael collins, Ankur P.
Parikh, Chris Alberti, Danielle Epstein, Illia
Jacob Devlin, Kenton Lee,
Polosukhin,
Kristina Toutanova, Leon Jones, Matthew
Kelcey, Ming-Wei Chang, Andrew M. dai,
Jakob Uszkoreit, Quoc V. Le, and Slav
Petrov. 2019. Natural questions: A bench-
mark for question answering research. TACL,
7:453–466. https://doi.org/10.1162
/tacl_a_00276

Omer Levy, Minjoon Seo, Eunsol Choi, y
Lucas Zettlemoyer. 2017. Zero-shot
rela-
tion extraction via reading comprehension.
In CoNLL. https://doi.org/10.18653
/v1/K17-1034

mike lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, y lucas
Zettlemoyer.
2020a. BART: Denoising
sequence-to-sequence pre-training for natural
language generation, traducción, and compre-
hension. In ACL. https://doi.org/10
.18653/v1/2020.acl-main.703

552

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Patrick Lewis, Barlas O˘guz, Ruty Rinott,
Sebastián Riedel, and Holger Schwenk. 2020b.
MLQA: Evaluating cross-lingual extractive
question answering. In ACL. https://doi
.org/10.18653/v1/2020.acl-main.653

Patrick Lewis, Pontus Stenetorp, and Sebastian
Riedel. 2021. Question and answer test-train
overlap in open-domain question answering
conjuntos de datos. In EACL. https://doi.org/10
.18653/v1/2021.eacl-main.86

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.

Sewon Min, Eric Wallace, Samer Singh,
Matt Gardner, Hannaneh Hajishirzi, y lucas
Zettlemoyer. 2019a. Compositional questions
do not necessitate multi-hop reasoning. En
LCA. https://doi.org/10.18653/v1
/P19-1416

Sewon Min, Victor Zhong, Luke S. Zettlemoyer,
and Hannaneh Hajishirzi. 2019b. Multi-hop
reading comprehension through question de-
composition and rescoring. In ACL.

Pan de Liang Ming, Wenhu Chen, Wenhan Xiong,
Min-Yen Kan, and William Yang Wang. 2021.
Unsupervised multi-hop question answering by
question generation. In NAACL.

Adam Paszke, Sam Gross, Francisco Massa,
James Bradbury, Gregory
Adam Lerer,
Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary
DeVito, Martin Raison, Alykhan Tejani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch:
An imperative style, high-performance deep
learning library. In NeurIPS, pages 8024–8035.

Fabio Petroni, Aleksandra Piktus, Angela Fan,
Patrick Lewis, Majid Yazdani, Nicola De Cao,
James Thorn, Yacine
Jernita, Vassilis
Plachouras, Tim Rockt¨aschel, and Sebastian
Riedel. 2021. KILT: A benchmark for knowl-
edge intensive language tasks. In NAACL.
https://doi.org/10.18653/v1/2021
.naacl-main.200

Peng Qi, Haejun Lee, Oghenetegiri ‘‘TG’’ Sido,
and Christopher D. Manning. 2021. Answering
open-domain questions of varying reasoning
steps from text. In EMNLP.

Pranav Rajpurkar, Robin Jia, y Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In ACL. https://
doi.org/10.18653/v1/P18-2124

Pranav Rajpurkar,

Jian Zhang, Constantino
Lopyrev, y Percy Liang. 2016. Equipo:
100, 000+ preguntas para la comprensión de la máquina-
sion of text. In EMNLP. https://doi.org
/10.18653/v1/D16-1264

Alon Talmor y Jonathan Berant. 2018. The web
as a knowledge-base for answering complex
preguntas. In NAACL.

Alon Talmor, Ori Yoran, Amnon Catav, Dan
Lahav, Yizhong Wang, Akari Asai, Gabriel
Ilharco, Hannaneh Hajishirzi, and Jonathan
tarde. 2021. MultiModalQA: Complex ques-
tion answering over text, tables and images. En
ICLR. https://doi.org/10.18653/v1
/N18-1059

Harsh Trivedi, Niranjan Balasubramanian, Tushar
Khot, and Ashish Sabharwal. 2020. Is multi-
hop QA in DiRe condition? Measuring and
reducing disconnected reasoning. In EMNLP.
https://doi.org/10.18653/v1/2020
.emnlp-main.712

Ming Tu, Kevin Huang, Guangtao Wang, Jing
Huang, Xiaodong He, and Bowen Zhou. 2020.
Select, answer and explain: Interpretable mul-
tihop reading comprehension over multiple
documentos. En AAAI.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N Gomez,
lucas káiser, y Illia Polosukhin. 2017.
In NeurIPS,
Attention is all you need.
pages 5998–6008.

Johannes Welbl, Pontus Stenetorp, and Sebastian
Riedel. 2018. Constructing datasets for multi-
hop reading comprehension across documents.
TACL, 6:287–302. https://doi.org/10
.1162/tacl_a_00021

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, R’emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art

553

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

natural language processing. ArXiv, abs/1910
https://doi.org/10.18653
.03771.
/v1/2020.emnlp-demos.6

Tomer Wolfson, Mor Geva, Ankit Gupta, Mate
jardinero, Yoav Goldberg, Daniel Deutch, y
Jonathan Berant. 2020. Break it down: A ques-
tion understanding benchmark. TACL. https://
doi.org/10.1162/tacl a 00309

Ledell Wu, Fabio Petroni, Martin Josifoski,
Sebastián Riedel, and Luke Zettlemoyer. 2020.
Zero-shot entity linking with dense entity
retrieval. In EMNLP.

Zhilin Yang, Peng Qi, Saizheng Zhang,
Yoshua Bengio, William W. cohen, Ruslan
Salakhutdinov, and Christopher D. Manning.
2018. HotpotQA: A dataset for diverse, ex-
plainable multihop question answering.
En
EMNLP. https://doi.org/10.18653
/v1/D18-1259

Ori Yoran, Alon Talmor, y jonathan berant.
2021. Turning tables: Generating examples
from semi-structured tables for endowing lan-
guage models with reasoning skills. arXiv
preprint arXiv:2107.07261.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
5
2
0
2
0
6
9
4

/
t

a
C
_
a
_
0
0
4
7
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

554 (cid:2) música: Multihop Questions via Single-hop Question Composition image

Descargar PDF