Towards General Natural Language Understanding - Recherche en IA spécialisée au MIT

Towards General Natural Language Understanding
with Probabilistic Worldbuilding

Abulhair Saparov and Tom M. Mitchell
Machine Learning Department,
Carnegie Mellon University, Etats-Unis
{asaparov, tom.mitchell}@cs.cmu.edu

Abstrait

We introduce the Probabilistic Worldbuilding
Model (PWM), a new fully symbolic Bayesian
model of semantic parsing and reasoning, as a
first step in a research program toward more
domain- and task-general NLU and AI. Hu-
mans create internal mental models of their
observations that greatly aid in their ability to
understand and reason about a large variety of
problems. In PWM, the meanings of sentences,
acquired facts about the world, and intermedi-
ate steps in reasoning are all expressed in a
human-readable formal language, with the de-
sign goal of interpretability. PWM is Bayesian,
designed specifically to be able to generalize
to new domains and new tasks. We derive and
implement an inference algorithm that reads
sentences by parsing and abducing updates to
its latent world model that capture the seman-
tics of those sentences, and evaluate it on two
out-of-domain question-answering datasets:
(1) ProofWriter and (2) a new dataset we call
FictionalGeoQA, designed to be more rep-
resentative of real language but still simple
enough to focus on evaluating reasoning abil-
ville, while being robust against heuristics. Notre
method outperforms baselines on both, thereby
demonstrating its value as a proof-of-concept.

Introduction

Despite recent progress in AI and NLP producing
algorithms that perform well on a number of NLP
tasks, it is still unclear how to move forward and
develop algorithms that understand language as
well as humans do. En particulier, large-scale lan-
guage models such as BERT (Devlin et al., 2019),
RoBERTa (Liu et al., 2019), GPT-3 (Brown et al.,
2020), XLNet (Yang et al., 2019), and others were
trained on a very large amount of text and can
then be applied to perform many different NLP
tasks after some fine-tuning. In the case of GPT-3,
some tasks require very few additional training
examples to achieve state-of-the-art performance.

325

As a result of training on text from virtually every
domain, these models are domain-general. This is
in contrast with NLP algorithms that are largely
trained on one or a small handful of domains,
and as such, are not able to perform well on new
domains outside of their training. Malgré cela
focus on domain-generality, there are still a large
number of tasks on which these large-scale lan-
guage models perform poorly (Dunietz et al.,
2020). Many limitations of today’s state-of-the-art
methods become evident when comparing with
the human ability to understand language (Lake
et coll., 2016; Tamari et al., 2020; Bender and
Koller, 2020; Gardner et al., 2019; Linzen, 2020).
Many cognitive scientists posit that humans create
rich mental models of the world from their ob-
servations which provide superior explainability,
reasoning, and generalizability to new domains
and tasks. How do we, as a field, move from to-
day’s state-of-the-art to more general intelligence?
What are the next steps to develop algorithms that
can generalize to new tasks at the same level as
humans? The lack of interpretability in many of
these models makes these questions impossible to
answer precisely. One promising direction is to
change the evaluation metric: Brown et al. (2020),
Linzen (2020), and many others have suggested
zero-shot or few-shot accuracy to measure the
performance of algorithms (c'est à dire., the algorithm is
evaluated with a new dataset, wholly separate
from its training; or in the case of few-shot learn-
ing, save for a few examples). While this shift is
welcome, it alone will not solve the above issues.
We introduce the Probabilistic Worldbuilding
Model (PWM), a probabilistic generative model
of reasoning and semantic parsing. Like some
past approaches, PWM explicitly builds an in-
ternal mental model, which we call the theory
(Tamari et al., 2020; Hogan et al., 2021; Mitchell
et coll., 2018; Charniak and Goldman, 1993). Le
theory constitutes what the algorithm believes to

Transactions of the Association for Computational Linguistics, vol. 10, pp. 325–342, 2022. https://doi.org/10.1162/tacl a 00463
Action Editor: Hinrich Sch¨utze. Submission batch: 3/2021; Revision batch: 9/2021; Published 4/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1: The generative process and inference in our model, with an example of a theory, generating a proof of
a logical form which itself generates the sentence ‘‘Bob is a mammal.’’ During inference, only the sentences are
observed, whereas the theory and proofs are latent. Given sentence yi, the language module outputs the logical
formulaire. The reasoning module then infers the proof πi of the logical form and updates the posterior of the theory T .

be true. PWM is fully symbolic and Bayesian,
using a single unified human-readable formal lan-
guage to represent all meaning, and is therefore
inherently interpretable. This is in contrast to
systems that use subsymbolic representations of
meaning for some or all of their components.
Every random variable in PWM is well-defined
with respect to other random variables and/or
grounded primitives. Prior knowledge such as
the rules of deductive inference, the structure
of English grammar, and knowledge of basic
physics and mathematics can be incorporated by
modifying the prior distributions of the random
variables in PWM. Incorporating prior knowledge
can greatly reduce the amount of training data re-
quired to achieve sufficient generalizability, comme
we will demonstrate. Extensibility is key to fu-
ture research that could enable more general NLU
and AI, as it provides a clearer path forward for
future exploration.

We present an implementation of inference
under the proposed model, called the Probabilis-
tic Worldbuilding from Language (PWL). While
PWM is an abstract mathematical description of
the underlying distribution of axioms, proofs, log-
ical forms, and sentences, PWL is the algorithm
that reads sentences, computes logical form rep-
resentations of their meaning, and updates the
axioms and proofs in the theory accordingly. Voir
Chiffre 1 for a high-level schematic diagram of
PWM and PWL. PWM describes the process de-
picted by the red arrows, whereas PWL is the
algorithm depicted by the green arrows. We em-
phasize that the reasoning in PWL is not a theorem
prover and is not purely deductive. Plutôt, PWL
solves a different problem of finding satisfying

abductive proofs, which is computationally easier
than deductive inference: Given a set of obser-
vations, work backwards to find a set of axioms
that deductively explain the observations. C'est
these abduced axioms that constitute the internal
‘‘mental model.’’ Humans often rely on abductive
reasoning, for example in commonsense reasoning
(Bhagavatula et al., 2020; Furbach et al., 2015).

A core principle of our approach is to ensure
generality by design. Simplifying assumptions of-
ten trade away generality for tractability, tel que
by restricting the representation of the meanings
of sentences, or number of steps during reasoning.
PWM is designed to be domain- and task-general,
and to this end, uses higher-order logic (c'est à dire., lambda
calculus) (Church, 1940) as the formal language,
which we believe is sufficiently expressive to
capture the meaning of declarative and interrog-
ative sentences in natural language. En outre,
PWM uses natural deduction for reasoning, lequel
is complete in that if a logical form φ is true, là
is a proof of φ (Henkin, 1950).

In Section 3, we describe PWM and PWL more
precisely. In Section 4, as a proof-of-concept of
the value of further research, we run experiments
on two question-answering datasets: ProofWriter
(Tafjord et al., 2021) and a new dataset, called
FictionalGeoQA, which we specifically created
to evaluate the ability to reason over short para-
graphs while being robust against simpler heuristic
strategies. Unlike ProofWriter, the text in Fic-
tionalGeoQA was not template-generated and is
more realistic, but is still simple enough to focus
the evaluation on reasoning rather than parsing,
with many sentences having semantics that go
beyond the Horn clause fragment of first-order

326

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

logic. PWL outperforms the baselines with re-
spect to zero-shot accuracy (c'est à dire., without looking
at any training examples). Our code and data is
freely available at github.com/asaparov/PWL and
github.com/asaparov/fictionalgeoqa.

En résumé, the primary contributions of this

paper are the following:

• PWM, a new model for more general NLU,
and PWL, the implementation that reads sen-
tences, computes their logical forms, et
updates its theory accordingly.

• Introducing FictionalGeoQA, a new question-
answering dataset designed to evaluate the
ability to reason over language.

• Experiments on ProofWriter and Fictional-
GeoQA demonstrating that PWL outperforms
baseline methods on question-answering.

2 Related Work

Fully symbolic methods were commonplace in
earlier AI research (Newell and Simon, 1976;
Dreyfus, 1985). Cependant, they were oftentimes
brittle: A new observation would contradict the
internal theory or violate an assumption, et c'était
not clear how to resolve the impasse in a princi-
pled manner and proceed. But they do have some
key advantages: Symbolic approaches that use
well-studied human-readable formal
languages
such as first-order logic, higher-order logic, type
théorie, etc.. enable humans to readily inspect and
understand the internal processing of these algo-
rithms, effecting a high degree of interpretability
(Dowty, 1981; Gregory, 2015; Cooper et al.,
2015). Symbolic systems can be made general
by design, by using a sufficiently expressive for-
mal language and ontology. Hybrid methods have
been explored to alleviate the brittleness of formal
systems while engendering their strengths, tel
as interpretability and generalizability; for exam-
ple, the recent work in neuro-symbolic methods
(Yi et al., 2020; Saha et al., 2020; Tafjord et al.,
2021). Neural theorem provers are in this vein
(Rockt¨aschel and Riedel, 2017). Cependant, le
proofs considered in these approaches are based
on backward chaining (Russell and Norvig, 2010),
which restricts the semantics to the Horn clause
fragment of first-order logic. Sun et al. (2020),

Ren et al. (2020), and Arakelyan et al. (2021)
extend coverage to the existential positive frag-
ment of first-order logic. In natural language,
sentences express more complex semantics such
as negation, nested universal quantification, et
higher-order structures. Our work explores the
other side of the tradeoff between tractability and
expressivity/generality. Theorem provers attempt
to solve the problem of deduction: finding a proof
of a given formula, given a set of axioms. Dans
contraste, the reasoning component of PWM is
abductive, and the abduced axioms can be used
in downstream tasks, such as question-answering,
and to better read new sentences in the context
of the world model, as we will demonstrate. Nous
posit that abduction is sufficient for more general
NLU (Hobbs, 2006; Hobbs et al., 1993). PWM
combines Bayesian statistical machine learning
with symbolic representations in order to handle
uncertainty in a principled manner, ‘‘smoothing
out’’ or ‘‘softening’’ the rigidity of a purely sym-
bolic approach. In PWM, the internal theory is a
random variable, so if a new observation is incon-
sistent with the theory, there may be other theories
in the probability space that are consistent with
the observation. The probabilistic approach pro-
vides a principled way to resolve these impasses.
to combine
symbolic and probabilistic methods. There is a
rich history of inductive logic programming (ILP)
(Muggleton, 1991; Cropper and Morel, 2021) et
probabilistic ILP languages (Muggleton, 1996;
Cussens, 2001; Sato et al., 2005; Bellodi and
Riguzzi, 2015). These languages could be used to
learn a ‘‘theory’’ from a collection of observa-
tion, but they are typically restricted to learning
rules in the form of first-order Horn clauses, pour
tractability. In natural language, it is easy to ex-
press semantics beyond the Horn clause fragment
of first-order logic.

PWM is certainly not

the first

Knowledge bases (KBs) and cognitive archi-
tectures (Kotseruba and Tsotsos, 2020; Hogan
et coll., 2021; Laird et al., 1987; Mitchell et al.,
2018) have attempted to explicitly model domain-
general knowledge in a form amenable to reason-
ing. Cognitive architectures aim to more closely
replicate human cognition. Some approaches
use probabilistic methods to handle uncertainty
(Niepert et al., 2012; Niepert and Domingos,
2015; Jain et al., 2019). Cependant, many of these
approaches make strong simplifying assumptions
that restrict the expressive power of the formal

327

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

language that expresses facts in the KB. For ex-
ample, many KBs can be characterized as graphs,
where each entity corresponds to a vertex and
every fact corresponds to a labeled edge. For ex-
ample, the belief plays sport(s williams,
tennis) is representable as a directed edge
connecting the vertex s williams to the vertex
tennis, with the edge label plays sport.
While this assumption greatly aids tractability and
scalability, allowing many problems in reasoning
to be solved by graph algorithms,
it greatly
hinders expressivity and generality, and there are
many kinds of knowledge that simply cannot be
expressed and represented in such KBs. PWM
does not make such restrictions on logical forms
in the theory, allowing for richer semantics, tel
as definitions, universally quantified statements,
conditionals, etc..

3 Model

Dans cette section, we provide a mathematical de-
scription of PWM. At a high level, the process
for generating a sentence sampled from this
probability distribution is:

1. Sample the theory T from a prior distribution
p(T ). T is a collection of logical forms in
higher-order logic that represent what PWL
believes to be true.

2. For each observation i, sample a proof πi
from p(πi | T ). The conclusion of the proof
is the logical form xi, which represents the
meaning of the ith sentence.

3. Sample the ith sentence yi from p(yi | xi).

Inference is effectively the inverse of this process,
and is implemented by PWL. During inference,
PWL is given a collection of observed sentences
y1, . . . , yn and the goal is to discern the value of the
latent variables: the logical form of each sentence
X (cid:2) {x1, . . . , xn}, the proofs for each logical
form π (cid:2) {π1, . . . , πn}, and the underlying the-
ory T . Both the generative process and inference
algorithm naturally divide into two modules:

• Language module: During inference, ce
module’s purpose is to infer the logical form
of each observed sentence. C'est, given the
input sentence yi, this module outputs the k

most-probable values of the logical form xi
(c'est à dire., semantic parsing).

• Reasoning module: During inference, ce
module’s purpose is to infer the underlying
theory that logically entails the observed log-
ical forms (and their proofs thereof). C'est,
given an input collection of logical forms x,
this module outputs the posterior distribution
of the underlying theory T and the proofs π
of those logical forms.

Note that the yi need not necessarily be sen-
tences, and PWM can easily be generalized to
other kinds of data. Par exemple, if a generative
model of images is available for p(yi | xi), then an
equivalent ‘‘vision module’’ may be defined. Ce
module may be used either in place of, or together
avec, the language module. In the above genera-
tive process, PWM assumes each sentence to be
independent. A model of context is required to
properly handle inter-sentential anaphora or con-
versational settings. This can be done by allowing
the distribution on yi to depend on previous log-
ical forms or sentences: p(yi | x1, . . . , xi) (c'est à dire.,
relaxing the i.i.d. assumption). For simplicity of
this proof-of-concept, this is left to future work.

There is a vast design space for symbolic
representations of meaning. We are unable to
comprehensively list all of our design choices, mais
we describe two important ones below.

Neo-Davidsonian semantics (Parsons, 1990) est
used to represent meaning in all logical forms
(both in the theory and during semantic pars-
ing). As a concrete example, un simple
way to represent the meaning of ‘‘Jason traveled
to New York’’ could be with the logical form
travel(jason, nyc). In neo-Davidsonian se-
mantics, this would instead be represented with
three distinct atoms: travel(c1), arg1(c1) =
jason, and arg2(c1) = nyc. Ici, c1 is a
constant that represents the ‘‘traveling event,’’
whose first argument is the constant representing
Jason, and whose second argument is the constant
representing New York City. This representation
allows the event to be more readily modified
by other logical expressions, such as in ‘‘Jason
quickly traveled to NYC before nightfall.’’

En outre, PWM defers named entity linking
to the reasoning module (it is not done during
parsing). C'est, the semantic parser does not
parse ‘‘Jason’’ directly into the constant jason.

328

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 1: Design choices in the representation of the meaning of ‘‘Alex wrote a book.’’

Plutôt, named entities are parsed into exis-
tentially quantified expressions,
Par exemple,
∃j(nom(j) = ‘‘Jason’’ ∧ . . .). This simplifies
the parser’s task and allows reasoning to aid in
entity linking. Tableau 1 details these design options.

3.1 Reasoning Module

3.1.1 Generative Process for the Theory p(T)

The theory T is a collection of axioms a1, a2, . . .
represented in higher-order logic. We choose a
fairly simple prior p(T ) for rapid prototyping, mais
it is straightforward to substitute with a more com-
plex prior. Specifically a1, a2, . . . are distributed
according to a distribution Ga which is sampled
from a Dirichlet process (DP) (Ferguson, 1973),
an exchangeable non-parametric distribution.

Ga ∼ DP(Ha, un),

a1, a2, . . . ∼ Ga,

(1)

(2)

where Ha is the base distribution and α = 0.1 est
the concentration parameter. An equivalent per-
spective of the DP that better illustrates how the
samples are generated is the Chinese restaurant
processus (Aldous, 1985):

ai ∼ Ha with probability

ai = aj with probability

un
un + i − 1
(cid:2)

(3)

i−1
k=1

1{ak = aj}

un + i − 1

(4)

C'est, the ith sample is drawn from Ha with
probability proportional to α, or it is set to a
previous sample with probability proportional to
the number of times that sample has been drawn.
The base distribution Ha recursively generates
logical forms in higher-order logic. Because any
formula can be written as a tree, they can be gen-

329

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

erated top-down, starting from the root. The type
of each node (conjunction ∧, disjunction ∨, nega-
tion ¬, quantification ∀x, etc.) is sampled from
a categorical distribution. If the type of the node
is selected to be an atom (par exemple., livre(c1)), alors
its predicate is sampled from a non-parametric
distribution of predicate symbols Hp. The atom’s
argument(s) are each sampled as follows: If nV
is the number of available variables (from ear-
lier generated quantifiers), then sample a variable
nV +1 ; oth-
uniformly at random with probability
erwise, with probability
nV +1 , sample a constant
from a non-parametric distribution of constant
symbols Hc. For brevity, we refer the reader to
our code for the specific forms of Hp and Hc.

Since PWM uses a neo-Davidsonian represen-
tation, another node type that Ha can generate
is an event argument (par exemple., arg1(c1) = jason).
When this is selected, the event constant (c1 in the
example) is sampled in the same way an atom’s
argument is sampled, as described above: first by
trying to sample a variable, and otherwise sam-
pling a constant from Hc. The right side of the
equality (jason in the example) can either be
a variable, constant, string, or number, so PWM
first selects its type from a categorical distribution.
If the type is chosen to be a number, string, ou
variable, its value is sampled uniformly. If the type
is chosen to be a constant, it is sampled from Hc.
Names of entities are treated specially in this
prior: The number of names available to each
entity is sampled according to a very light-tailed
distribution i.i.d.: for entity ci the number of names
nN (ci) (cid:2) #{s : nom(ci) = s} is distributed
according to p(nN (ci) = k) ∝ λk2. This ensures
that entities tend not to have too many names.

Sets are also treated specially in this prior:
One kind of axiom that can be generated is one
that declares the size of a set—for example,

size(λx.planet(X)) = 8 denotes that the size
of the set of planets is 8. In the prior, the size of
each set is distributed according to a geometric
distribution with parameter 10−4. Sets can have
arity not equal to 1, in which case their elements
are tuples.

Deterministic Constraints: We also impose
hard constraints on the theory T . Most impor-
tantly, T is required to be globally consistent.
While this is a conceptually simple requirement,
it is computationally expensive (generally unde-
cideable even in first-order logic). PWL enforces
this constraint by keeping track of the known sets
in the theory (c'est à dire., a set is known if its set size
axiom is used in a proof, or if the set appears as a
subset/superset in a universally-quantified axiom,
such as in ∀x(cat(X) → mammal(X)) where the
set λx.cat(X) is a subset of λx.mammal(X)).
For each set, PWL computes which elements are
provably members of that set. If the number of
provable members of a set is greater than its size,
or if an element is both provably a member and not
a member of a set, the theory is inconsistent. Re-
laxing this constraint would be valuable in future
recherche, perhaps instead by only considering the
relevant sets rather than all sets in the theory, ou
deferring consistency checks altogether. We place
a handful of other constraints on the theory T :
The name of an entity must be a string (and not a
number or a constant). All constants are distinct;
c'est, ci (cid:10)= cj for all i (cid:10)= j. This helps to alleviate
identifiability issues, as otherwise, there would be
a much larger number of logically equivalent the-
ories. No event can be an argument of itself (par exemple.,
there is no constant ci such that arg1(ci) = ci).
If a theory T satisfies all constraints, we write
‘‘T valid.’’

These constraints do slightly complicate com-
putation of the prior, since the generative process
for generating T is conditioned on T being valid:

p(T | T valid) = p(T )/p(T valid),

(cid:3)

p(T (cid:11)),

(5)

(6)

where p(T valid) =

T (cid:11):T (cid:11) valid

and the denominator is intractable to compute.
Cependant, we show in Section 3.1.3 that for infer-
ence, it suffices to be able to efficiently compute
the ratio of prior probabilities:

p(T1|T1 valid)
p(T2|T2 valid)

p(T1)p(T2 valid)
p(T2)p(T1 valid)

p(T1)
p(T2)

(7)

330

Additionally note that because the above con-
straints do not depend on the order of the axioms,
constants, and so forth (c'est à dire., the constraints them-
selves are exchangeable), the distribution of T
conditioned on T being valid is exchangeable.

Properties of the Prior p(T ): We emphasize
that these distributions were chosen for simplicity
and ease of implementation, and they worked well
enough in experiments. Cependant, there are likely
many distributions that would work just as well.
The parameters in the above distributions are not
learned; they were set and fixed a priori. Never-
theless, this prior does exhibit useful properties for
a domain- and task-general model of reasoning:

• Occam’s razor: Smaller/simpler theories are
given higher probability than larger and more
complex theories.

• Consistency: Inconsistent theories are dis-

couraged or impossible.

• Entities tend to have a unique name. Notre
prior above encodes one direction of this
prior belief: Each entity is unlikely to have
many names. Cependant, the prior does not dis-
courage one name from referring to multiple
entities.

• Entities tend to have a unique type. Note,
cependant, that this does not discourage types
provable by subsumption. Par exemple, si
the theory has the axioms novel(c1) et
∀x(novel(X) → book(X)), even though
livre(c1) is provable, it is not an axiom
in this example and the prior only applies to
axioms.

3.1.2 Generative Process for Proofs p(πi | T )
PWM uses natural deduction, a well-studied proof
calculus, for the proofs (Gentzen, 1935, 1969).
Pfenning (2004) provides an accessible introduc-
tion. Chiffre 2 illustrates a simple example of a
natural deduction proof. Each horizontal line is a
deduction step, avec le (zero or more) formulas
above the line being its premises, and the one
formula below the line being its conclusion. Chaque
deduction step has a label to the right of the line.
Par exemple, the ‘‘∧I’’ step denotes conjunction
introduction: given that A and B are true, ce
step concludes that A ∧ B is true, where A and B
can be any formula. A natural deduction proof can
rely on axioms (denoted by ‘‘Ax’’). We can write

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 2: An example of a proof of ¬(A ∧ ¬A).

any natural deduction proof πi as a sequence of
deduction steps πi (cid:2) (πi,1, . . . , πi,k) by traversing
the proof tree in prefix order. We define a simple
generative process for πi:

1. First sample the length of the proof k from a
Poisson distribution with parameter 20.

2. For each j = 1, . . . , k: Select a deduction
rule from the proof calculus with a categor-
ical distribution. If the Ax rule is selected,
then simply take the next available axiom
from the theory T = a1, a2, . . . If the de-
duction rule requires premises, then each
premise is selected uniformly at random from
πi,1, . . . , πi,j−1.1

The above generative process may produce a forest
rather than a single proof tree. Ainsi, πi is sampled
conditioned on πi being a valid proof. Just as
with p(T ) in equation 5, this conditioning causes
p(πi | T ) to be intractable to compute. Cependant,
only the ratio of the prior probability is needed for
inference, which can be computed efficiently:

p(πi | T, πi valid)
| T, π(cid:11)
p(π(cid:11)
i valid)
je
p(πi | T )p(π(cid:11)
p(π(cid:11)
je

i valid | T )
| T )p(πi valid | T )

p(πi | T )
p(π(cid:11)
| T )
je

(8)

Although PWL was initially implemented as-
suming classical logic, it is easy to adapt PWL to
use other logics, such as intuitionistic logic. Intu-
itionistic logic is identical to classical logic except
that the law of the excluded middle A ∨ ¬A is
not a theorem (voir la figure 3 for an example where
the two logics disagree). The interpretable nature
of the reasoning module makes it easy to adapt
it to other kinds of logic or proof calculi. PWL
supports both classical and intuitionistic logic.

1Some deduction rules require additional parameters, et
we refer the reader to our code for details on how these
parameters are sampled.

331

Chiffre 3: An example from the Electricity1 section in
the ProofWriter dataset. Its label is unknown. Under
classical logic, the query is provably true from the
information in the 1st, 3rd, and 4th sentences.

3.1.3 Inference

Having described the generative process for the
theory T and proofs π, we now describe inference.
Given logical forms x, the goal is to compute the
posterior distribution of T and π such that the con-
clusion of the each proof πi is xi. C'est, PWL
aims to recover the latent theory and proofs that
explain/entail the given observed logical forms.
To this end, PWL uses Metropolis-Hastings
(MH) (Hastings, 1970; Robert and Casella, 2004).
PWL performs inference in a streaming fash-
ion, starting with the case n = 1 to obtain
MH samples from p(π1, T |x1). Alors, for every
new logical form xn, PWL uses the last sample
from p(π1, . . . , πn−1, T |x1, . . . , xn−1) as a start-
ing point and then obtains MH samples from
p(π1, . . . , πn, T |x1, . . . , xn). This warm-start ini-
tialization serves to dramatically reduce the
number of iterations needed to mix the Markov
chain. To obtain the MH samples, the proof of
each new logical form π(0)
is initialized using
n
Algorithm 1, whereas the proofs of previous logi-
cal forms are kept from the last MH sample. Le
axioms in these proofs constitute the theory sam-
ple T (0). Alors, for each iteration t = 1, . . . , Niter,
MH proposes a mutation to one or more proofs in
π(t). The possible mutations are listed in Table 2.
This may change axioms in T (t). Let T (cid:11), π(cid:11)
i be the
newly proposed theory and proofs. Alors, compute
the acceptance probability:

p(T (cid:11))
p(T (t))

n(cid:4)

je = 1

p(π(cid:11)
je
p(π(t)
je

|T (cid:11))
|T (t))

g(T (t), π(t)|T (cid:11), π(cid:11))
g(T (cid:11), π(cid:11)|T (t), π(t))

, (9)

where g(T (cid:11), π(cid:11)|T (t), π(t)) is the probability of
proposing the mutation from T (t), π(t) to T (cid:11), π(cid:11),
and g(T (t), π(t)|T (cid:11), π(cid:11)) is the probability of the
inverse of this mutation. Because this quantity

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

= π(cid:11)

= π(t)
je

depends only on the ratio of probabilities, it can
be computed efficiently (see equations 3.1.1 et
8). Once this quantity is computed, sample from
a Bernoulli with this quantity as its parameter.
If it succeeds, MH accepts the proposed the-
ory and proofs as the next sample: T (t+1) = T (cid:11)
and π(t+1)
je. Otherwise, reject the proposal
je
and keep the old sample: T (t+1) = T (t) et
π(t+1)
. If every possible theory and proof is
je
reachable from the initial theory by a sequence of
mutations, then with sufficiently many iterations,
the samples T (t) and π(t)
i will be distributed ac-
cording to the true posterior p(T, π|X). If only a
subset of possible theories and proofs are reach-
able from the initial theory, the MH samples
will be distributed according to the true posterior
conditioned on that subset. This may suffice for
many applications, particularly if the theories in
the subset have desirable properties such as better
tractability. But the subset cannot be too small
because then PWL would lose generality.

The function init proof in Algorithm 1 concernant-
cursively calls init disproof. Due to space
limitations, we refer the reader to our code for
this function; it closely mirrors the structure of
init proof. The purpose of init proof is
to find some proof of a given higher-order for-
mula, or return null if none exists. Its task is
finding a satisfying abductive proof, which is eas-
ier than theorem proving, since it can create new
axioms as needed. The returned proof need not
be ‘‘optimal’’ because it serves as the initial state
for MH, which will further refine the proof. Le
validity of the proofs is guaranteed by the fact that
init proof only returns valid proofs and the
MH proposals preserve validity.

2swap randomly selects an element in its input list to
swap with the first element. The probability of moving an
element c to the front of the list is computed as follows:
Recursively inspect the atoms in the formula f (c) and count
the number of ‘‘matching’’ atoms: The atoms t(c) or c(t) est
considered ‘‘matching’’ if it is provable in T . Suivant, count the
number of ‘‘mismatching’’ axioms: for each atom t(c) dans le
formula f (c), an axiom t(cid:11)(c) is ‘‘mismatching’’ if t (cid:10)= t(cid:11). Et
similarly for each atom c(t) in the formula f (c), an axiom
c(t(cid:11)) is ‘‘mismatching’’ if t (cid:10)= t(cid:11). Let n be the number of
‘‘matching’’ atoms and m be the number of ‘‘mismatching’’
axioms, then the probability of moving c to the front of the
list is proportional to exp{n − 2m}. This greatly increases
the chance of finding a high-probability proof in the first
iteration of the loop on line 31, and since this function is
also used in an MH proposal, it dramatically improves the
acceptance rate. This reduces the number of MH iterations
needed to sufficiently mix the Markov chain.

Algorithm 1: Pseudocode for proof initializa-
tion. If any new axiom violates the determin-
istic constraints in Section 3.1.1, the function
returns null.
1 function init proof(formula A)
if A is a conjunction B1 ∧ . . . ∧ Bn
2

for i = 1 to n do

φi = init proof(Bi)
. . . φn ∧I

φ1
B1 ∧ . . . ∧ Bn
else if A is a disjunction B1 ∨ . . . ∨ Bn

return

I = shuffle(1, . . . , n)
for i ∈ I do

φi = init proof(Bi)
if φi (cid:10)= null

return

φi
B1 ∨ . . . ∨ Bn

∨I

else if A is a negation ¬B

return init disproof(B)
else if A is an implication B1 → B2

if using classical logic

I = shuffle(1,2)
for i ∈ I do
if i = 1

φ1 = init disproof(B1)
if φ1 = null continue
Ax
B1 ¬E
⊥E

return

φ1

⊥

B2
B1 → B2

→I

else

φ2 = init proof(B2)
if φ2 = null continue

return

φ2
B1 → B2

→I

else if using intuitionistic logic

return

B1 → B2

else if A is an existential quantification ∃x.f (X)

let C be the set of known constants, numbers, and strings

in T , and the new constant c∗
I = swap2(shuffle(C))
for c ∈ I do

φc = init proof(F (c))
if φc (cid:10)= null

return

φc
∃x.f (X)

∃I

else if A is a universal quantification ∀x.f (X)
Ax

return

∀x.f (X)
else if A is an equality B1 = B2

return

B1 = B2

else if A is an atom (e.g. livre(great gatsby))

return

else return null

3.2 Language Module

For the language module, PWM uses the prob-
abilistic model of Saparov et al. (2017). Le
generative nature of their semantic parsing model
allows it to fit seamlessly into PWM and PWL.

332

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Proposal

Probability
of selecting
proposal

Select a grounded atomic axiom (par exemple., square(c1)) and propose to replace it with an instantiation
of a universal quantification (par exemple., ∀x(rectangle(X) ∧ rhombus(X) → square(X))), where the
antecedent conjuncts are selected uniformly at random from the other grounded atomic axioms for the
constant c1: rectangle(c1), rhombus(c1), etc..
The inverse of the above proposal: select an instantiation of a universal quantification and replace it
with a grounded atomic axiom.
Select an axiom that declares the size of a set (par exemple., of the form size(us states) = 50), and propose
to change the size of the set by sampling from the prior distribution, conditioned on the maximum and
minimum consistent set size.
Select a node from a proof tree of type ∨I, → I, or ∃I.3 These nodes were created in Algorithm 1 sur
lines 7, 16, et 30, respectivement, where for each node, a single premise was selected out of a number
of possible premises. This proposal naturally follows from the desire to explore other selections by
re-sampling the proof: it simply calls init proof again on the formula at this proof node.
Merge: Select a ‘‘mergeable’’ event; c'est, three constants (ci, cj, ck) such that arg1(ci) = cj,
arg2(ci) = ck, and t(ci) for some constant t are axioms, and there also exist constants (ci(cid:11) , cj(cid:11) , ck(cid:11) )
such that i(cid:11) > i, arg1(ci(cid:11) ) = cj(cid:11) , arg2(ci(cid:11) ) = ck(cid:11) , and t(ci(cid:11) ) are axioms. Suivant, propose to merge ci(cid:11)
with ci by replacing all instances of ci(cid:11) with ci in the proof trees, cj(cid:11) with cj, and ck(cid:11) with ck.
This proposal is not necessary in that these changes are reachable with other proposals, but those
proposals may have low probability, and so this can help to more easily escape local maxima.

Split: The inverse of the above proposal.

1
N

un
N

β
N

Tableau 2: A list of the Metropolis-Hastings proposals implemented in PWL thus far. N , ici, is a
normalization term: N = |UN| + |U | + |C| + |P. | + un|M. | + β|S| où: A is the set of grounded atomic
axioms in T (par exemple., square(c1)), U is the set of universally-quantified axioms that can be eliminated
by the second proposal, C is the set of axioms that declare the size of a set (par exemple., size(UN) = 4), P is
the set of nodes of type ∨I, → I, or ∃I3 in the proofs π, M is the set of ‘‘mergeable’’ events (described
au-dessus de), and S is the set of ‘‘splittable’’ events. In our experiments, α = 2 and β = 0.001.

The logical forms in their model are distributed
according to a semantic prior, which we replace
with our distribution of logical forms conditioned
on the theory p(πi|T ). Their parser is probabilistic
and finds the k-best logical forms that maximize
p(yi|xi, T ) for a given input sentence. Combined
with our reasoning module’s ability to compute
the probability of a logical form, the parser can
resolve ambiguous interpretations of sentences by
exploiting acquired knowledge. We will demon-
strate the utility of this property in resolving
lexical ambiguity.

Cependant, the semantic grammar in Saparov
et autres. (2017) was designed for a DATALOG rep-
resentation of logical forms. Ainsi, we designed
and implemented a new grammar for our more
domain-general formalism in higher-order logic.
Though their model induces preterminal produc-
tion rules from data (par exemple., N → ‘‘cat’’), nous
must manually specify the nonterminal produc-
tion rules (par exemple., NP → ADJP NP). Ceci permet

3Also disproofs of conjunctions, if using classical logic.

us to encode prior knowledge of the English
language into PWM, dramatically improving its
statistical efficiency and obviating the need for
massive training sets to learn English syntax.
It is nonetheless tedious to design these rules
while maintaining domain-generality. Once spec-
these rules can be re-used in
ified, cependant,
new tasks and domains with minimal or no
changes. We also improved their model to gen-
eralize over inflected forms of words. In the
generative process, instead of generating sentence
tokens directly (par exemple., ‘‘I am sleeping’’), PWM
generates word roots with flags indicating their
inflection (par exemple., ‘‘I be[1ST,SG] dormir[PRS,PTCP]’’).
During parsing, this has the effect of performing
morphological and semantic parsing jointly. Nous
extracted the necessary comprehensive morphol-
ogy information from Wiktionary (Wikimedia
Fondation 2020).

We train this new grammar to learn the pa-
rameters that govern the conditional distributions
and the preterminal production rules. To do so, nous
construct a small seed training set consisting of 55

333

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

labeled sentences, 47 nouns, 55 adjectives, et 20
verbs.4 We wrote and labeled these sentences by
main, largely in the domain of astronomy, avec le
aim to cover a diverse range of English syntactic
constructions. This small training set was suffi-
cient thanks to the statistical efficiency of PWM.
While PWL uses the same parsing algo-
rithm of Saparov et al. (2017), we provide an
easier-to-understand presentation. Given an in-
put sentence yi,
the parser aims to find the
logical form(s) xi and derivation trees ti
que
maximize the posterior probability p(xi, ti|yi, T ).
This discrete optimization is performed using
branch-and-bound (Land and Doig, 1960): Le
algorithm starts by considering the set of all
derivation trees and partitions it into a number
of subsets (the ‘‘branch’’ step). For each sub-
set S, the parser computes an upper bound on
the log probability of any derivation in S (le
‘‘bound’’ step). Having computed the bound for
each subset, the parser puts them into a priority
queue, prioritized by the bound. The parser then
dequeues the subset with the highest bound and
repeats this process, further subdividing this set,
computing the bound for each subdivision, et
adding them to the queue. Eventually, the parser
will dequeue a subset containing a single deriva-
tion whose log probability is at least the highest
priority in the queue. This derivation is optimal.
The algorithm can be continued to obtain the top-k
derivations/logical forms. Because this algorithm
operates over sets of logical forms (where each set
is possibly infinite), we implemented a data struc-
ture to sparsely represent such sets of higher-order
formulas, as well as algorithms to perform set
opérations, such as intersection and subtraction.

4 Experiments

4.1 ProofWriter

To demonstrate our implementation as a proof-
of-concept, we evaluate it on two question-
answering tasks. The first
is the ProofWriter
dataset
(Tafjord et al., 2021), which itself
is based on the earlier RuleTaker dataset
(Clark et al., 2020). To evaluate and demon-
strate the out-of-domain language understanding
and reasoning ability of PWL, we use the

Birds-Electricity ‘‘open-world’’5 portion of the
dataset, as the authors evaluated their method on
this portion zero-shot, just as we do (c'est à dire., the algo-
rithm did not see any example from this portion
during training). This portion of the data is subdi-
vided into 6 sections, each with varying degrees of
difficulty. An example from this dataset is shown
in Figure 3. For each example, PWL reads the
context and abduces a theory. Suivant, it parses the
query sentence yn+1 into a logical form xn+1 and
estimates its unnormalized probability:

p(xn+1 | X) ∝ p(x1, . . . , xn+1),

(10)

(11)

(cid:3)

T,πi proof of xi
(cid:3)

≈

T (t),π(t)
je

∼T,π|X

p(T )

n+1(cid:4)

je = 1

p(πi | T ),

p(T (t))

n+1(cid:4)

je = 1

p(π(t)
je

| T (t)). (12)

Ici, x are the previously read logical forms
(the context). Since the quantity in equation 11
is intractable to compute, PWL approximates it
by sampling from the posterior T, π1, . . . , πn+1 |
x1, . . . , xn+1 and summing over distinct samples.
Although this approximation seems crude, the sum
is dominated by a small number of the most prob-
able theories and proofs, and MH is an effective
way to find them, as we observe in experiments.
MH is run for 400 iterations, and at every 100th
iteration, PWL re-initializes the Markov chain
by performing 20 ‘‘exploratory’’ MH steps (c'est à dire.,
consisting of only the third and fourth propos-
als in Table 2 and accepting every proposal).
This re-initialization is analogous to a random
restart and can help to escape from local maxima.
Cependant, it may be promising to explore other
approaches to compute this quantity, such as Luo
et autres. (2020). Once PWL has computed this prob-
ability for the query sentence, it does the same
for the negation of the sentence. These unnormal-
ized probabilities are compared, and if they are
within 2000 in log probability, PWL returns the
label unknown. If the first probability is suffi-
ciently larger than the second, PWL returns true,
and otherwise, returns false. The parameters in
the prior were set by hand initially by choos-
ing values that we thought were reasonable (par exemple.,
the average length of a natural deduction proof

4The grammar, morphology data, code, as well as the seed

5The dataset comes in two flavors: one that makes the

training set are available in our Github repository.

closed-world assumption, and one that does not.

334

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 3: Zero-shot accuracy of PWL and baselines on the ProofWriter dataset.

for a sentence containing a simple subject noun
phrase, object noun phrase, and transitive verb is
autour 20 steps, which is why the Poisson pa-
rameter for the proof length is set to 20). Le
values were tweaked as necessary by running
the algorithm on toy examples during debug-
ging. Note that the sentences ‘‘Bill is a bird’’
and ‘‘Bill is not a bird’’ can still both be true if
each ‘‘Bill’’ refers to distinct entities. To avoid
ce, we chose an extreme value of the prior pa-
rameter such that the log prior probability of a
theory with two entities having the same name is
2000 less than that of a theory where the name is
unique. It is for this reason 2000 was chosen as
the threshold for determining whether a query is
true/false vs unknown. This prior worked well
enough in our experiments, but the goal is to
have a single prior work well with any task, donc
further work to explore which priors work better
across a wider variety of tasks is welcome. Nous
evaluated PWL using both classical and intuition-
istic logic, even though the ground truth labels
in the dataset were generated using intuitionistic
logic.

Tableau 3 lists the zero-shot accuracy of PWL,
comparing with baselines based on the T5 trans-
former (Raffel et al., 2020). We emphasize here
that PWL is not perfectly comparable to the base-
line, because they aim to demonstrate that their
method can learn to reason. We instead aim to
demonstrate that PWL’s ability to parse and rea-
son end-to-end generalizes to an out-of-domain
question-answering task. The baseline is trained
on other portions of the ProofWriter data, alors que
PWL is trained only on its seed training set. PWL
performed much better using intuitionistic logic
than classical logic, as expected since the ground
truth labels were generated using intuitionistic
semantics. Cependant, most real-world reasoning
tasks would take the law of the excluded middle

to be true, and classical logic would serve as a bet-
ter default. Although the task is relatively simple,
it nevertheless demonstrates the proof-of-concept
and the promise of further research.

4.2 FictionalGeoQA

The sentences in the ProofWriter experiment are
template-generated and have simple semantics.
For the sake of evaluation more representa-
tive of
real-world language, we introduce a
new question-answering dataset called Fictional-
GeoQA.6 To create this dataset, we took questions
from GeoQuery (Zelle and Mooney, 1996), et
for each question, we wrote a paragraph context
containing the information necessary to answer the
question. We added distractor sentences to make
the task more robust against heuristics. Whenever
possible, the sentences in this paragraph were
taken from Simple English Wikipedia. Cependant,
some facts, such as the lengths of rivers, are not
expressed in sentences in Wikipedia (they typi-
cally appear in a table on the right side of the
page), so we wrote those sentences by hand: Nous
took questions from GEOQUERY that expressed the
desired fact in interrogative form (par exemple., ‘‘What is
the length of ?’’) and converted
them into declarative form (par exemple., ‘‘The length of
est .’’). The resulting
dataset contains 600 examples, où 67.4% de
the sentences are from Simple English Wikipedia,
et 90% of the examples contain at least one
sentence not from Wikipedia. We replaced all
place names with fictional ones to remove any
confounding effects from pretraining. To keep the
focus of the evaluation on reasoning ability, nous
chose to restrict the complexity of the language. Dans
particular, each sentence is independent and can be

6Available at github.com/asaparov/fictionalgeoqa.

335

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ntin
toi
co
30
20.0
0.0
0.0
0.0
0.0
0.0

def.
def.
ncept
ncept
co
co
bjective
bjective

uity
big
am
lexical

negatio

ntext
co
grand

metic
arith

bsets
su
0

bset
su
1

bsets
su
2

bsets
su
3

bsets
su
4

perlative

su
210
29.5

su
150
7.3
0.0 12.0
0.0 12.0
0.0 13.3
0.0 13.3
40.5 33.3

tous
600
33.8
9.7
9.7
5.0
9.0
43.1

102
14.7
15.7
15.7
15.7
15.7
23.5

100
43.0
14.0
14.0
4.0
4.0
45.0

170
33.5
11.8
11.8
2.9
11.2
33.5

20
10.0
10.0
10.0
10.0
10.0
10.0

N
180
32.8
UnifiedQA
0.0
Boxer + E
Boxer + Vampire
0.0
PWL parser + E
0.0
0.0
PWL parser + Vampire
34.4
PWL
LEGEND: superlative The subset of the dataset with examples that require reasoning over superlatives, c'est à dire., ‘‘longest river.’’
subjective concept def. Subset with definitions of ‘‘subjective’’ concepts, c'est à dire., ‘‘Every river longer than 500 km is major.’’
objective concept def. Subset with definitions of ‘‘objective’’ concepts, c'est à dire., the population of a location is the number of people living there.
lexical ambiguity Subset with lexical ambiguity, c'est à dire., ‘‘has’’ means different things in ‘‘a state has a city named’’ vs ‘‘a state has an area of…’’
negation Subset with examples that require reasoning with classical negation (negation-as-failure is insufficient).
large context Subset of examples where there are at least 100 sentences in the context.
arithmetic Subset with examples that require simple arithmetic. counting Subset with examples that require counting.
n subset(s) Examples that belong to exactly n of the above subsets (no example is a member of more than 4 subsets).

30
23.3
0.0
0.0
0.0
0.0
0.0

213
47.9
17.8
17.8
7.0
13.6
62.9

85
41.2
7.1
7.1
1.2
12.9
43.5

187
27.8
5.3
5.3
5.3
5.3
39.0

85
8.2
4.7
4.7
4.7
4.7
17.6

Tableau 4: Zero-shot accuracy of PWL and baselines on the FictionalGeoQA dataset.

a theorem prover for full first-order logic, (3)
Boxer combined with E 2.6 (Schulz et al., 2019),
another theorem prover for full first-order logic,
(4) the language module of PWL combined
with Vampire, et (5) the language module of
PWL combined with E. The results are shown in
Tableau 4, along with a breakdown across multiple
subsets of the dataset. UnifiedQA performs rela-
tively well but fares more poorly on questions with
negation and subjective concept definitions (par exemple.,
‘‘Every river longer than 500km is major. . . What
are the major rivers?’’). Humans are easily able
to understand and utilize such definitions, et le
ability to do so is instrumental in learning about
new concepts or words in new domains. PWL is
able to fare better than UnifiedQA in examples
with lexical ambiguity, as a result of the language
module’s ability to exploit acquired knowledge
to resolve ambiguities. We find that Boxer has
significantly higher coverage than PWL (100% vs
79.8%) but much lower precision. Par exemple,
Boxer uses the semantic representation in the
Parallel Meaning Bank (Abzianidze et al., 2017)
which has a simpler representation of superlatives,
and is thus unable to capture the correct semantics
of superlatives in examples of this dataset. Nous
also find that for most examples, Boxer produces
different semantics for the question vs. the con-
text sentences, oftentimes predicting the incorrect
semantic role for the interrogative words, lequel
leads to the theorem provers being unable to find

Chiffre 4: An example from FictionalGeoQA, a new
fictional geography question-answering dataset that
we created to evaluate reasoning in natural language
understanding.

understood in isolation (par exemple., no cross-sentential
anaphora). The sentences are more complex than
those in ProofWriter, having more of the com-
plexities of real language, such as synonymy,
lexical ambiguity (par exemple., what is the semantics of
‘‘has’’ in ‘‘a state has city’’ vs ‘‘a state has area’’;
or whether ‘‘largest state’’ refers to area or pop-
ulation), and syntactic ambiguity. This increased
difficulty is evident in the results. This dataset is
meant to evaluate out-of-domain generalizability,
so we do not provide a separate training set for
fine-tuning. An example is shown in Figure 4.

We compare PWL (using classical logic) avec
a number of baselines: (1) UnifiedQA (Khashabi
et coll., 2020), a QA system based on large-scale
neural language models, (2) Boxer (Bos, 2015),
a wide-coverage semantic parser, combined with
Vampire 4.5.1 (Kov´acs and Voronkov, 2013),

336

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

a proof for these extra semantic roles. We also ex-
perimented with replacing our reasoning module
with a theorem prover and found that for almost all
examples, the search of the theorem prover would
explode combinatorially. This was due to the fact
that our semantic representation relies heavily on
sets, so a number of simple set theoretic axioms
are required for the theorem provers, but this
quickly causes the deduction problem to become
undecideable. Our reasoning module instead per-
forms abduction, and is able to create axioms to
more quickly find an initial proof, and then re-
fine that proof using MH. Despite our attempt to
maximize the generalizability of the grammar in
PWL, there are a number of linguistic phenomena
that we did not yet implement, such as interroga-
tive subordinate clauses, wh-movement, spelling
or grammatical mistakes, and so forth, and this
led to the lower coverage on this dataset. Travail
remains to be done to implement these missing
production rules in order to further increase the
coverage of the parser.

5 Conclusions and Future Work

We introduced PWM, a fully symbolic Bayesian
model of semantic parsing and reasoning, lequel
we hope serves as a compelling first step in
a research program toward more domain- et
task-general NLU. We derived PWL, an efficient
inference algorithm that reads sentences by pars-
ing and abducing updates to its latent world model
that capture the semantics of those sentences, et
empirically demonstrated its ability to generalize
to two out-of-domain question-answering tasks.
To do so, we created a new question-answering
dataset, FictionalGeoQA, designed specifically
to evaluate reasoning ability while capturing more
of the complexities of real language and being
robust against heuristic strategies. PWL is able
to read and understand sentences with richer se-
mantics, such as definitions of new concepts. Dans
contrast with past deductive reasoning approaches,
PWL performs abduction, which is computation-
ally easier. The highly underspecified nature of the
problem of abduction is alleviated by the prob-
abilistic nature of PWL, as it gives a principled
way to find the most probable theories. We present
an inference strategy where Metropolis-Hastings
(MH) is performed on each sentence, in sequence,
where the previous sample of the theory and proofs

provides a warm-start for inference of the next
sentence, reducing the number of MH iterations.
There are many avenues for future work: UN
simple prior was used for proofs p(πi|T ), and an
alternative is to use a compositional exchange-
able prior such as adaptor grammars (Johnson
et coll., 2006).

The first MH proposal in Table 2 is simple
but restrictive: The antecedent conjuncts and the
consequent are restricted to be atomic. MH would
be able to explore a much larger and semanti-
cally richer set of theories if the antecedent or
consequent could contain more complex formu-
las, including quantified formulas. En outre, le
inference algorithm sometimes becomes stuck in
local maxima. One way to improve the efficiency
of inference is to add a new MH proposal that
specifically proposes to split or merge types. Pour
example, if the theory has the axioms cat(c1)
and dog(c1), this proposal would split c1 into
two concepts: cat(c1) and dog(c2). This kind of
type-based Markov chain Monte Carlo is similar
in principle to Liang et al. (2010).

As mentioned earlier, a model of context is
necessary in the language module to properly han-
dle cross-sentential anaphora and conversational
contexts. Real language very rarely consists of
sentences that are independent of context. Là
are also many research questions on the issue of
scalability. Although PWL is able to scale to ex-
amples in FictionalGeoQA with more than 100
phrases, there are two main bottlenecks cur-
rently preventing it from scaling to significantly
larger theories: (1) the maintenance of global
consistency, et (2) the unfocused nature of the
current MH proposals. When checking for con-
sistency of a new axiom, rather than considering
all other axioms/sets in the theory, ce serait
preferable to only consider the portion of the the-
ory relevant to the new axiom. En plus, le
current MH proposals do not take into account
the goal of reasoning. Par exemple, if the current
task is to answer a question about geography, alors
MH proposals for proofs unrelated to geography
are wasteful, and would increase the number of
MH steps needed. A more clever goal-aware ap-
proach for selecting proofs to mutate would help
to alleviate this problem and improve scalability.
PWM provides a path to incorporate information
from additional modalities in principled fashion:
for example by adding a generative model of im-
ages, which would serve as a separate ‘‘vision

337

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

module.’’
En outre, even though PWL is
fully-symbolic, non-symbolic methods could be
used for expressive prior/proposal distributions
or approximate inference. There are many fas-
cinating research paths to pursue from here.

Remerciements

We thank the anonymous reviewers and the Ac-
tion Editor for their invaluable feedback. We also
thank Peter Clark, William W. Cohen, Rik van
Noord, Johan Bos, and Emmanouil A. Platanios
for their insightful and helpful discussion. Ce
research was funded in part by the Air Force Of-
fice of Scientific Research under AFOSR grant
FA95502010118.

Les références

Lasha Abzianidze, Johannes Bjerva, Kilian Evang,
Hessel Haagsma, Rik van Noord, Pierre
Ludmann, Duc-Duy Nguyen, and Johan Bos.
2017. The parallel meaning bank: Towards a
multilingual corpus of translations annotated
with compositional meaning representations.
In Proceedings of
the 15th Conference of
the European Chapter of the Association for
Computational Linguistics, EACL 2017, Valen-
cia, Espagne, Avril 3-7, 2017, Volume 2: Short
Papers, pages 242–247. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/E17-2039

David J. Aldous. 1985. Exchangeability and
related topics. In Lecture Notes in Mathemat-
ics, pages 1–198, Springer Berlin Heidelberg.
https://doi.org/10.1007/BFb0099421

Erik Arakelyan, Daniel Daza,

Pasquale
Minervini, and Michael Cochez. 2021. Com-
plex query answering with neural link predic-
tors. In International Conference on Learning
Representations.

learning

Elena Bellodi and Fabrizio Riguzzi. 2015.
Structure
logic
de
programs by searching the clause space.
Theory and Practice of Logic Program-
ming, 15(2):169–212. https://doi.org
/10.1017/S1471068413000689

probabilistic

Emily M. Bender and Alexander Koller. 2020.
Climbing towards NLU: On meaning, formulaire,

338

and understanding in the age of data.
Dans
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, ACL 2020, En ligne, Juillet 5-10, 2020,
pages 5185–5198. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2020.acl-main.463

Chandra Bhagavatula, Ronan Le Bras, Chaitanya
Malaviya, Keisuke Sakaguchi, Ari Holtzman,
Hannah Rashkin, Doug Downey, Wen-tau Yih,
and Yejin Choi. 2020. Abductive commonsense
reasoning. In 8th International Conference on
Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26–30, 2020.

Johan Bos. 2015. Open-domain semantic parsing
with boxer. In Proceedings of the 20th Nordic
Conference of Computational Linguistics,
NODALIDA 2015, Institute of the Lithuanian
Language, Vilnius, Lithuania, May 11-13, 2015,
volume 109 of Link¨oping Electronic Confer-
ence Proceedings, pages 301–304. Link¨oping
University Electronic Press / Association for
Computational Linguistics.

Tom B. Brun, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Jeffrey Wu, Clemens
Daniel M. Ziegler,
Hiver, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, et
Dario Amodei. 2020. Language models are
few-shot learners. CoRR, abs/2005.14165v4.

Eugene Charniak and Robert P. Homme d'or. 1993. UN
Bayesian model of plan recognition. Artificial
Intelligence, 64(1):53–79.

Alonzo Church. 1940. A formulation of the simple
theory of types. Journal of Symbolic Logic,
5(2):56–68.

Peter Clark, Oyvind Tafjord, and Kyle Richardson.
2020. Transformers as soft reasoners over lan-
guage. In Proceedings of
the Twenty-Ninth
International Joint Conference on Artificial
Intelligence, IJCAI 2020, pages 3882–3890.
ijcai.org.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Robin Cooper, Simon Dobnik, Shalom Lappin,
and Staffan Larsson. 2015. Probabilistic
language seman-
type theory and natural
in Language
In Linguistic
tics.
Technologie, Volume 10, 2015. CSLI Pub-
lications. https://doi.org/10.33011
/lilt.v10i.1357

Problèmes

Andrew Cropper and Rolf Morel. 2021. Apprentissage
programs by learning from failures. Machine
Apprentissage, 110(4):801–856. https://est ce que je
.org/10.1007/s10994-020-05934-z

James Cussens. 2001. Parameter estimation in
stochastic logic programs. Machine Learning,
44(3):245–271. https://doi.org/10.1023
/UN:1010924021315

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional
transformers for lan-
guage understanding. In Proceedings of the
2019 Conference of the North American Chap-
ter of
the Association for Computational
Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, Etats-Unis,
Juin 2-7, 2019, Volume 1 (Long and Short
Papers), pages 4171–4186. Association for
Computational Linguistics.

D. R.. Dowty. 1981. Introduction to Montague
Semantics, 1st ed. Études
in Linguistics
and Philosophy, 11. Springer Netherlands,
Dordrecht. https://doi.org/10.1007
/978-94-009-9065-4_1

H. L. Dreyfus. 1985. From micro-worlds to knowl-
edge representation: AI at an impasse. In R. J..
Brachman and H. J.. Levesque, editors, Readings
in Knowledge Representation, pages 71–93.
Kaufmann, Los Altos, Californie.

Jesse Dunietz, Gregory Burnham, Akash
Bharadwaj, Owen Rambow, Jennifer Chu-
Carroll, and David A. Ferrucci. 2020. To test
machine comprehension, start by defining com-
prehension. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, ACL 2020, En ligne, July 5–10, 2020,
pages 7839–7859. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2020.acl-main.701

Thomas S. Ferguson. 1973. A Bayesian analysis
of some nonparametric problems. The Annals
of Statistics, 1(2):209–230. https://est ce que je
.org/10.1214/aos/1176342360.

Ulrich Furbach, Andrew S. Gordon, and Claudia
Schon. 2015. Tackling benchmark problems
of commonsense reasoning. In Proceedings of
the Workshop on Bridging the Gap between
Human and Automated Reasoning – A work-
the 25th International Conference
shop of
on Automated Deduction (CADE-25), Berlin,
Allemagne, Août 1, 2015, volume 1412 de
CEUR Workshop Proceedings, pages 47–59.
CEUR-WS.org.

Matt Gardner,

Jonathan Berant, Hannaneh
Hajishirzi, Alon Talmor, and Sewon Min.
2019. On making reading comprehension more
comprehensive. In Proceedings of the 2nd Work-
shop on Machine Reading for Question An-
swering, MRQA@EMNLP 2019, Hong Kong,
Chine, Novembre 4, 2019, pages 105–112.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D19
-5815

G. Gentzen. 1935. Untersuchungen ¨uber das
logische schließen i. Mathematische Zeitschrift,
39:176–210. https://doi.org/10.1007
/BF01201363

G. Gentzen. 1969. Investigations into Logical De-
duction. En M. E. Szabo, editor, The Collected
Papers of Gerhard Gentzen, pages 68–213.
North-Holland, Amsterdam. https://est ce que je
.org/10.1016/S0049-237X(08)70822-X

Howard Gregory. 2015. Language and Logics:
An Introduction to the Logical Foundations
of Language. Edinburgh University Press.

W. K. Hastings. 1970. Monte Carlo sampling
methods using markov chains and their appli-
cations. Biometrika, 57(1):97–109. https://
doi.org/10.1093/biomet/57.1.97

Leon Henkin. 1950. Completeness in the theory of
les types. Journal of Symbolic Logic, 15(2):81–91.
https://doi.org/10.2307/2266967

Jerry R. Hobbs. 2006. Abduction in Natural Lan-
guage Understanding, chapter 32. John Wiley
& Fils, Ltd.

Jerry R. Hobbs, Mark E. Stickel, Douglas E.
Appelt, and Paul A. Martine. 1993.
Inter-
pretation as abduction. Artificial Intelligence,
https://est ce que je.org/10
63(1–2):69–142.
.1016/0004-3702(93)90015-4

339

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Aidan Hogan, Eva Blomqvist, Michael Cochez,
Claudia d’Amato, Gerard de Melo, Claudio
Guti´errez, Sabrina Kirrane, Jos´e Emilio Labra
Gayo, Roberto Navigli, Sebastian Neumaier,
Axel-Cyrille Ngonga Ngomo, Axel Polleres,
Sabbir M. Rashid, Anisa Rula, Lukas
Schmelzeisen, Juan F. Sequeda, Steffen Staab,
and Antoine Zimmermann. 2021. Knowl-
Surveys,
bord
54(4):71:1–71:37. https://est ce que je.org/10
.1145/3447772

graphs. ACM Computing

Arcchit Jain, Tal Friedman, Ondrej Kuzelka,
Guy Van den Broeck, and Luc De Raedt. 2019.
Scalable rule learning in probabilistic knowl-
edge bases. In 1st Conference on Automated
Knowledge Base Construction, AKBC 2019,
Amherst, MA, Etats-Unis, May 20–22, 2019.

Mark Johnson, Thomas L. Griffiths, and Sharon
Goldwater. 2006. Adaptor grammars: A frame-
work for specifying compositional nonparam-
etric bayesian models. In Advances in Neural
Information Processing Systems 19, Proceed-
the Twentieth Annual Conference
ings of
on Neural Information Processing Systems,
Vancouver, British Columbia, Canada, De-
cember 4–7, 2006, pages 641–648. AVEC
Presse.

Daniel Khashabi, Sewon Min, Tushar Khot,
Ashish Sabharwal, Oyvind Tafjord, Peter Clark,
and Hannaneh Hajishirzi. 2020. Unifiedqa:
Crossing format boundaries with a single QA
système. Trevor Cohn, Yulan He, and Yang Liu,
editors, In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, En ligne
Event, 16-20 Novembre 2020, volume EMNLP
2020 of Findings of ACL, pages 1896–1907.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.findings-emnlp.171

Iuliia Kotseruba and John K. Tsotsos. 2020.
40 years of cognitive architectures: Core
cognitive abilities and practical applications.
Artificial
Intelligence Review, 53(1):17–94.
https://doi.org/10.1007/s10462-018
-018-9646-oui

Laura Kov´acs and Andrei Voronkov. 2013.
First-order theorem proving and vampire. Dans
Computer Aided Verification – 25th Interna-
tional Conference, CAV 2013, Saint Petersburg,

Russia, Juillet 13-19, 2013. Procédure, volume
8044 of Lecture Notes in Computer Science,
pages 1–35. Springer.

John E. Laird, Allen Newell, and Paul S.
Rosenbloom. 1987. SOAR: An architecture
for general intelligence. Artificial Intelligence,
33(1):1–64.

Brenden M. Lake, Tomer D. Ullman, Joshua B.
Tenenbaum, and Samuel J. Gershman. 2016.
Building machines that learn and think like peo-
ple. CoRR, abs/1604.00289v3. https://est ce que je
.org/10.1016/0004-3702(87)90050-6

UN. H. Land and A. G. Doig. 1960. An automatic
method of solving discrete programming prob-
lems. Econometrica, 28(3):497. https://
doi.org/10.2307/1910129

Percy Liang, Michael I. Jordan, and Dan Klein.
2010. Type-based MCMC. In Human Lan-
guage Technologies: Conference of the North
American Chapter of
the Association of
Computational Linguistics, Procédure, Juin
2-4, 2010, Les anges, California, Etats-Unis,
pages 573–581. Association for Computational
Linguistics.

Tal Linzen. 2020. How can we accelerate progress
towards human-like linguistic generalization?
In Proceedings of the 58th Annual Meeting
de
the Association for Computational Lin-
guistics, ACL 2020, En ligne, Juillet 5-10, 2020,
pages 5210–5217. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2020.acl-main.465

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. CoRR,
abs/1907.11692v1.

Yucen Luo, Alex Beatson, Mohammad Norouzi,
Jun Zhu, David Duvenaud, Ryan P. Adams,
and Ricky T. Q. Chen. 2020. SUMO: Unbiased
estimation of log marginal probability for latent
variable models. In 8th International Confer-
ence on Learning Representations, ICLR 2020,
Addis Ababa, Ethiopia, April 26–30, 2020.
OpenReview.net.

Tom M. Mitchell, William W. Cohen, Estevam
R.. Hruschka Jr., Partha P. Talukdar, Bo Yang,

340

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Justin Betteridge, Andrew Carlson, Bhavana
Dalvi Mishra, Matt Gardner, Bryan Kisiel,
Jayant Krishnamurthy, Ni Lao, Kathryn
Mazaitis, Thahir Mohamed, Ndapandula
Nakashole, Emmanouil A. Platanios, Alan
Ritter, Mehdi Samadi, Burr Settles, Richard C.
Wang, Derry Wijaya, Abhinav Gupta, Xinlei
Chen, Abulhair Saparov, Malcolm Greaves,
and Joel Welling. 2018. Never-ending learn-
ing. Communications of ACM, 61(5):103–115.
https://doi.org/10.1145/3191513

Stephen Muggleton. 1991.

logic
programming. New Generation Computing,
8(4):295–318. https://doi.org/10.1007
/BF03037089

Inductive

Stephen Muggleton. 1996. Stochastic

logic
In Advances in Inductive Logic

programs.
Programming, pages 254–264.

Allen Newell and Herbert A. Simon. 1976. Com-
puter science as empirical inquiry: Symboles
and search. Commun. ACM, 19(3):113–126.
https://doi.org/10.1145/360018
.360022

Mathias Niepert and Pedro M. Domingos. 2015.
Learning and inference in tractable probabilis-
tic knowledge bases. In Proceedings of the
Thirty-First Conference on Uncertainty in Arti-
ficial Intelligence, UAI 2015, Juillet 12-16, 2015,
Amsterdam, The Netherlands, pages 632–641.
AUAI Press.

Mathias Niepert, Christian Meilicke, and Heiner
Stuckenschmidt. 2012. Towards distributed
MCMC inference in probabilistic knowledge
bases. In Proceedings of the Joint Workshop
on Automatic Knowledge Base Construction
and Web-scale Knowledge Extraction, AKBC-
WEKEX@NAACL-HLT 2012, Montr`eal, Canada,
Juin 7-8, 2012, pages 1–6. Association for
Computational Linguistics.

Terence Parsons. 1990. Events in the Semantics of

English. AVEC Presse, Cambridge, MA.

Frank Pfenning. 2004. Natural deduction. Lecture
notes in 15-815 Automated Theorem Proving.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael

341

Matena, Yanqi Zhou, Wei Li, and Peter J.
Liu. 2020. Exploring the limits of transfer
learning with a unified text-to-text
trans-
former. Journal of Machine Learning Research,
21:140:1–140:67.

Hongyu Ren, Weihua Hu, and Jure Leskovec.
2020. Query2box: Reasoning over knowledge
graphs in vector space using box embeddings.
In 8th International Conference on Learn-
ing Representations, ICLR 2020, Addis Ababa,
Ethiopia, Avril 26-30, 2020. OpenReview.net.

Christian P. Robert and George Casella. 2004.
Monte Carlo Statistical Methods. Springer
Texts in Statistics, Springer. https://est ce que je
.org/10.1007/978-1-4757-4145-2

Tim Rockt¨aschel and Sebastian Riedel. 2017.
End-to-end differentiable proving. In Advances
Information Processing Systems
in Neural
30: Annual Conference on Neural Informa-
tion Processing Systems 2017, Décembre 4-9,
2017, Long Beach, Californie, Etats-Unis, pages 3788–3800.

Stuart J. Russell and Peter Norvig. 2010. Artifi-
cial Intelligence – A Modern Approach, Troisième
International Edition. Pearson Education.

Swarnadeep Saha, Sayan Ghosh, Shashank
Srivastava, and Mohit Bansal. 2020. Prover:
Proof generation for interpretable reasoning
over rules. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural
Language Processing, EMNLP 2020, En ligne,
Novembre 16-20, 2020, pages 122–136. Asso-
ciation for Computational Linguistics.

Abulhair Saparov, Vijay A. Saraswat, and Tom
M.. Mitchell. 2017. A probabilistic generative
grammar for semantic parsing. In Proceed-
ings of
the 21st Conference on Computa-
tional Natural Language Learning (CoNLL
2017), Vancouver, Canada, August 3–4, 2017,
pages 248–259. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/K17-1026

Taisuke Sato, Yoshitaka Kameya, and Neng-Fa
Zhou. 2005. Generative modeling with fail-
ure in PRISM. In IJCAI-05, Proceedings of
the Nineteenth International Joint Conference
on Artificial Intelligence, Édimbourg, Écosse,

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ROYAUME-UNI, Juillet 30 – Août 5, 2005, pages 847–852.
Professional Book Center,

Stephan Schulz, Simon Cruanes, and Petar
Vukmirovic. 2019. Faster, higher, stronger:
E 2.3. In Automated Deduction – CADE 27 –
27th International Conference on Automated
Deduction, Natal, Brazil, Août 27-30, 2019,
Procédure, volume 11716 of Lecture Notes
in Computer Science, pages 495–507. Springer.
https://doi.org/10.1007/978-3-030
-29436-6 29

Haitian Sun, Andrew O. Arnold, Tania Bedrax-
Blanc, Fernando Pereira, and William W.
Cohen. 2020. Faithful embeddings for knowl-
edge base queries. In Advances in Neural
Information Processing Systems 33: Annual
Conference on Neural Information Processing
Systems 2020, NeurIPS 2020, Décembre 6-12,
2020, virtual.

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark.
2021. Proofwriter: Generating implications,
proofs, and abductive statements over natu-
ral language. In Findings of the Association
for Computational Linguistics: ACL/IJCNLP
2021, Online Event, August 1–6, 2021, vol-
ume ACL/IJCNLP 2021 of Findings of ACL,
pages 3621–3634. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2021.findings-acl.317

Ronen Tamari, Chen Shani, Tom Hope, Miriam
R.. L. Petruck, Omri Abend, and Dafna
Shahaf. 2020. Language (concernant)modelling: À-
wards embodied language understanding. Dans

Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, ACL 2020, En ligne, Juillet 5-10, 2020,
pages 6268–6281. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2020.acl-main.559

Wikimedia Foundation. 2020. Wiktionary data

dumps.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
G. Carbonell, Ruslan Salakhutdinov, and Quoc
V. Le. 2019. Xlnet: Generalized autoregres-
sive pretraining for language understanding.
In Advances in Neural Information Processing
Systems 32: Annual Conference on Neural In-
formation Processing Systems 2019, NeurIPS
2019, Décembre 8-14, 2019, Vancouver, BC,
Canada, pages 5754–5764.

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet
Kohli, Jiajun Wu, Antonio Torralba, and Joshua
B. Tenenbaum. 2020. CLEVRER: Collision
events for video representation and reasoning.
In 8th International Conference on Learn-
ing Representations, ICLR 2020, Addis Ababa,
Ethiopia, Avril 26-30, 2020.

John M. Zelle and Raymond J. Mooney. 1996.
Learning to parse database queries using induc-
tive logic programming. In Proceedings of the
Thirteenth National Conference on Artificial
Intelligence and Eighth Innovative Applica-
Intelligence Conference,
tions of Artificial
AAAI 96, IAAI 96, Portland, Oregon, Etats-Unis,
August 4–8, 1996, Volume 2, pages 1050–1055.
AAAI Press / La presse du MIT.

342

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
6
3
2
0
0
6
9
8
4

/
t

un
c
_
un
_
0
0
4
6
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF