On Graph-based Reentrancy-free Semantic Parsing

Alban Petit and Caio Corro
Universite Paris-Saclay, CNRS, LISN, 91400, Orsay, France
{alban.petit,caio.corro}@lisn.upsaclay.fr

Abstract

We propose a novel graph-based approach
for semantic parsing that resolves two prob-
lems observed in the literature: (1) seq2seq
models fail on compositional generalization
tasks; (2) previous work using phrase structure
parsers cannot cover all the semantic parses
observed in treebanks. We prove that both
MAP inference and latent tag anchoring (re-
quired for weakly-supervised learning) are
NP-hard problems. We propose two optimiza-
tion algorithms based on constraint smoothing
to approximately
and conditional gradient
solve these inference problems. Experimen-
tally, our approach delivers state-of-the-art re-
sults on GEOQUERY, SCAN, and CLEVR, both for
i.i.d. splits and for splits that test for compo-
sitional generalization.

Introduction

Semantic parsing aims to transform a natural lan-
guage utterance into a structured representation
that can be easily manipulated by a software (e.g.,
to query a database). As such, it is a central task
in human–computer interfaces. Andreas et al.
(2013) first proposed to rely on machine trans-
lation models for semantic parsing, where the tar-
get representation is linearized and treated as a
foreign language. Due to recent advances in deep
learning and especially in sequence-to-sequence
(seq2seq) with attention architectures for machine
translation (Bahdanau et al., 2015), it is appealing
to use the same architectures for standard struc-
tured prediction problems (Vinyals et al., 2015).
This approach is indeed common in semantic pars-
ing (Jia and Liang, 2016; Dong and Lapata, 2016;
Wang et al., 2020), as well as other domains. Un-
fortunately, there are well-known limitations to
seq2seq architectures for semantic parsing. First,
at test time, the decoding algorithm is typically
based on beam search as the model is autore-
gressive and does not make any independence
assumption. In case of prediction failure, it is

therefore unknown if this is due to errors in the
weighting function or to the optimal solution fail-
ing out of the beam. Secondly, they are known
to fail when compositional generalization is re-
quired (Lake and Baroni, 2018; Finegan-Dollak
et al., 2018a; Keysers et al., 2020).

In order to bypass these problems, Herzig and
Berant (2021) proposed to represent the semantic
content associated with an utterance as a phrase
structure, i.e., using the same representation usu-
ally associated with syntactic constituents. As
such, their semantic parser is based on standard
span-based decoding algorithms (Hall et al., 2014;
Stern et al., 2017; Corro, 2020) with additional
well-formedness constraints from the semantic
formalism. Given a weighting function, MAP in-
ference is a polynomial time problem that can
be solved via a variant of the CYK algorithm
(Kasami, 1965; Younger, 1967; Cocke, 1970).
Experimentally, Herzig and Berant (2021) show
that their approach outperforms seq2seq models
in terms of compositional generalization, there-
fore effectively bypassing the two major problems
of these architectures.

The complexity of MAP inference for phrase
structure parsing is directly impacted by the con-
sidered search space (Kallmeyer, 2010). Impor-
tantly, (ill-nested) discontinuous phrase structure
parsing is known to be NP-hard, even with a
bounded block-degree (Satta, 1992). Herzig and
Berant (2021) explore two restricted inference al-
gorithms, both of which have a cubic time com-
plexity with respect to the input length. The first
one only considers continuous phrase structures,
that is, derived trees that could have been gener-
ated by a context-free grammar, and the second
one also considers a specific type of discontinui-
ties, see Corro (2020, Section 3.6). Both algo-
rithms fail to cover the full set of phrase structures
observed in semantic treebanks, see Figure 1.

In this work, we propose to reduce semantic
parsing without reentrancy (i.e., a given predicate
or entity cannot be used as an argument for two

703

Transactions of the Association for Computational Linguistics, vol. 11, pp. 703–722, 2023. https://doi.org/10.1162/tacl a 00570
Action Editor: James Henderson. Submission batch: 8/2022; Revision batch: 12/2022; Published 6/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

• We propose a novel integer linear program-
ming formulation for this problem together
with an approximate solver based on condi-
tional gradient and constraint smoothing;
• We tackle the training problem using varia-
tional approximations of objective functions,
including the weakly-supervised scenario;
• We evaluate our approach on GEOQUERY,
SCAN, and CLEVR and observe that it outper-
forms baselines on both i.i.d. splits and splits
that test for compositional generalization.

Code to reproduce the experiments is available
online.1

2 Graph-based Semantic Parsing

We propose to reduce semantic parsing to pars-
ing the abstract syntax tree (AST) associated with
a semantic program. We focus on semantic pro-
grams whose ASTs do not have any reentrancy,
i.e., a single predicate or entity cannot be the ar-
gument of two different predicates. Moreover, we
assume that each predicate or entity is anchored
on exactly one word of the sentence and each
word can be the anchor of at most one predicate or
entity. As such, the semantic parsing problem can
be reduced to assigning predicates and entities to
words and identifying arguments via dependency
relations, see Figure 2. In order to formalize our
approach to the semantic parsing problem, we will
use concepts from graph theory. We therefore first
introduce the vocabulary and notions that will be
useful in the rest of this article. Notably, the no-
tions of cluster and generalized arborescence will
be used to formalize our prediction problem.

Notations and Definitions. Let G = (cid:3)V, A(cid:4)
be a directed graph with vertices V and arcs
A ⊆ V × V . An arc in A from a vertex u ∈ V
to a vertex v ∈ V is denoted either a ∈ A or
u → v ∈ A. For any subset of vertices U ⊆ V ,
we denote σ+
G(U )) the set of arcs
leaving one vertex of U and entering one vertex
of V \ U (resp., leaving one vertex of V \ U and
entering one vertex of U ) in the graph G. Let
B ⊆ A be a subset of arcs. We denote V [B] the
cover set of B, i.e., the set of vertices that appear
as an extremity of at least one arc in B. A graph

G(U ) (resp., σ−

Figure 1: Example of a semantic phrase structure from
GEOQUERY. This structure is outside of the search space
of the parser of Herzig and Berant (2021) as the con-
stituent in red is discontinuous and also has a discon-
tinuous parent (in red+green).

different predicates) to a bi-lexical dependency
parsing problem. As such, we tackle the same se-
mantic content as aforementioned previous work
but using a different mathematical representation
(Rambow, 2010). We identify two main benefits
to our approach: (1) as we allow crossing arcs,
i.e., ‘‘non-projective graphs’’, all datasets are
guaranteed to be fully covered and (2) it allows
us to rely on optimization methods to tackle in-
ference intractability of our novel graph-based
formulation of the problem. More specifically, in
our setting we need to jointly assign predicates/
entities to words that convey a semantic con-
tent and to identify arguments of predicates
via bi-lexical dependencies. We show that MAP
inference in this setting is equivalent to the maxi-
mum generalized spanning arborescence problem
(Myung et al., 1995) with supplementary con-
straints to ensure well-formedness with respect to
the semantic formalism. Although this problem is
NP-hard, we propose an optimization algorithm
that solves a linear relaxation of the problem and
can deliver an optimality certificate.

Our contributions can be summarized as

follows:

• We propose a novel graph-based approach
for semantic parsing without reentrancy;

• We prove the NP-hardness of MAP infer-

1https://github.com/alban-petit/semantic

ence and latent anchoring inference;

-dependency-parser.

704

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

v ∈ σ−(W ) by an arc u → w and all the arcs
u → v ∈ σ+(W ) by an arc w → v. Given a
graph with partition π, the contracted graph is
the graph where each cluster in π has been con-
tracted. While contracting a graph may introduce
parallel arcs, it is not an issue in practice, even
for weighted graphs.

2.1 Semantic Grammar and AST

The semantic programs we focus on take the
form of a functional language, i.e., a representa-
tion where each predicate is a function that takes
other predicates or entities as arguments. The se-
mantic language is typed in the same sense than
in ‘‘typed programming languages’’. For exam-
ple, in GEOQUERY, the predicate capital 2 ex-
pects an argument of type city and returns an
object of type state. In the datasets we use, the
typing system disambiguates the position of argu-
ments in a function: For a given function, either
all arguments are of the same type or the order of
arguments is unimportant—an example of both is
the predicate intersection river in GEO-
QUERY that takes two arguments of type river,
but the result of the execution is unchanged if
the arguments are swapped.3

Formally, we define the set of valid seman-
tic programs as the set of programs that can be
produced with a semantic grammar G = (cid:3)E, T,
fTYPE, fARGS

(cid:4) where:

• E is the set of predicates and entities, which
we will refer to as the set of tags—w.l.o.g.
we assume that ROOT /∈ E where ROOT is a
special tag used for parsing;

• T is the set of types;

• fTYPE : E → T is a typing function that

assigns a type to each tag;

• fARGS : E × T → N is a valency function that
assigns the numbers of expected arguments
of a given type to each tag.

A tag e ∈ E is an entity iff ∀t ∈ T : fARGS(e, t) =
0. Otherwise, e is a predicate.

A semantic program in a functional language
can be equivalently represented as an AST, a

3There are a few corner cases like exclude river,
for which we simply assume arguments are in the same
order as they appear in the input sentence.

Figure 2: (top) The semantic program correspond-
ing to the sentence ‘‘What rivers do not run through
Tennessee?’’ in the GEOQUERY dataset. (middle) The
associated AST. (bottom) Two examples illustrating
the intuition of our model: We jointly assign predi-
cates/entities and identify argument dependencies. As
such, the resulting structure is strongly related to a
syntactic dependency parse, but where the dependency
structure do not cover all words.

G = (cid:3)V, A(cid:4) is an arborescence2 rooted at u ∈ V
if and only if (iff) it contains |V | − 1 arcs and
there is a directed path from u to each vertex in
V . In the rest of this work, we will assume that
the root is always vertex 0 ∈ V . Let B ⊆ A
be a set of arcs such that G(cid:8) = (cid:3)V [B], B(cid:4) is an
arborescence. Then G(cid:8) is a spanning arborescence
of G iff V [B] = V .

Let π = {V0, . . . , Vn} be a partition of V
containing n + 1 clusters. G(cid:8)
is a generalized
not-necessarily-spanning arborescence (resp. gen-
eralized spanning arborescence) on the partition
π of G iff G(cid:8) is an arborescence and V [B] con-
tains at most one vertex per cluster in π (resp.
contains exactly one).

Let W ⊆ V be a set of vertices. Contracting
W consists in replacing in G the set W by a
new vertex w /∈ V , replacing all the arcs u →

2In the NLP community, arborescences are often called
(directed) trees. We stick with the term arborescence as
it is more standard in the graph theory literature, see for
example Schrijver (2003). Using the term tree introduces
a confusion between two unrelated algorithms, Kruskal’s
maximum spanning tree algorithm (Kruskal, 1956) that oper-
ates on undirected graphs and Edmond’s maximum spanning
arborescence algorithm (Edmonds, 1967) that operates on
directed graphs. Moreover, this prevents any confusion be-
tween the graph object called arborescence and the semantic
structure called AST.

705

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3: (a) Example of a sentence and its associated AST (solid arcs) from the GEOQUERY dataset. The dashed
edges indicate predicates and entities anchors (note that this information is not available in the dataset). (b) The
corresponding generalized valency-constrained not-necessarily-spanning arborescence (red arcs). The root is the
isolated top left vertex. Adding ∅ tags and dotted orange arcs produces a generalized spanning arborescence.

graph where instances of predicates and entities
are represented as vertices and where arcs identify
arguments of predicates. Formally, an AST is
a labeled graph G = (cid:3)V, A, l(cid:4) where function
l : V → E assigns a tag to each vertex and arcs
identify the arguments of tags, see Figure 2. An
AST G is well-formed with respect to the grammar
G iff G is an arborescence and the valency and
type constraints are satisfied, i.e., ∀u ∈ V, t ∈ T :

fARGS(l(u), t) = |σ+

G({u}, t)|

where:

G({u}, t) =
σ+

(cid:2)

u → v ∈ σ+
G({u})
s.t. fTYPE(l(v)) = t

(cid:3)

2.2 Problem Reduction and Complexity

In our setting, semantic parsing is a joint sen-
tence tagging and dependency parsing problem
(Bohnet and Nivre, 2012; Li et al., 2011; Corro
et al., 2017): each content word (i.e., words that
convey a semantic meaning) must be tagged
with a predicate or an entity, and dependen-
cies between content words identify arguments
of predicates, see Figure 2. However, our seman-
tic parsing setting differs from standard syntactic
analysis in two ways: (1) the resulting structure
is not-necessarily-spanning, there are words (e.g.,
function words) that must not be tagged and that
do not have any incident dependency—and those
words are not known in advance, they must be
identified jointly with the rest of the structure;
(2) the dependency structure is highly constrained
by the typing mechanism, that is, the predicted
structure must be a valid AST. Nevertheless,

similarly to aforementioned works, our parser is
graph-based, that is, for a given input we build a
(complete) directed graph and decoding is reduced
to computing a constrained subgraph of maximum
weight.

Given a sentence w = w1 . . . wn with n words
and a grammar G, we construct a clustered labeled
graph G = (cid:3)V, A, π, ¯l(cid:4) as follows. The partition
π = {V0, . . . , Vn} contains n + 1 clusters, where
V0 is a root cluster and each cluster Vi, i (cid:11)= 0, is
associated to word wi. The root cluster V0 = {0}
contains a single vertex that will be used as the root
and every other cluster contains |E| vertices. The
extended labeling function ¯l : V → E ∪ {ROOT}
assigns a tag in E to each vertex v ∈ V \ {0} and
ROOT to vertex 0. Distinct vertices in a cluster Vi
cannot have the same label, i.e., ∀u, v ∈ Vi : u (cid:11)=
v =⇒ ¯l(u) (cid:11)= ¯l(v).

Let B ⊆ A be a subset of arcs. The graph
G(cid:8) = (cid:3)V [B], B(cid:4) defines a 0-rooted generalized
valency-constrained not-necessarily-spanning ar-
borescence iff it is a generalized arborescence
of G, there is exactly one arc leaving 0 and
the destination
the sub-arborescence rooted at
of that arc is a valid AST with respect
to
the grammar G. As such, there is a one-to-one
correspondence between ASTs anchored on the
sentence w and generalized valency-constrained
not-necessarily-spanning arborescences in the
graph G, see Figure 3b.

For any sentence w, our aim is to find the
AST that most likely corresponds to it. Thus, af-
ter building the graph G as explained above, the
neural network described in Appendix B is used
to produce a vector of weights μ ∈ R|V | asso-
ciated to the set of vertices V and a vector of

706

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

weights φ ∈ R|A| associated to the set of arcs A.
Given these weights, graph-based semantic pars-
ing is reduced to an optimization problem called
the maximum generalized valency-constrained
not-necessarily-spanning arborescence (MGVC-
NNSA) in the graph G.

are spanning arborescences over G where clus-
ters have been contracted:

(cid:4)

ya = 0.

(1)

a∈σ−
G

(V0)
(cid:4)

ya ≥ 1

∀π(cid:8) ⊆ ¯π \ {V0}.

(2)

Theorem 1. The MGVCNNSA problem is NP-
hard.

a∈σ−
G(

(cid:2)

¯U ∈π(cid:8) U)

The proof is in Appendix A.

2.3 Mathematical Program

Our graph-based approach to semantic parsing
has allowed us to prove the intrinsic hard-
ness of the problem. We follow previous work
on graph-based parsing (Martins et al., 2009;
Koo et al., 2010), and other topics, by proposing
an integer linear programming (ILP) formulation
in order to compute (approximate) solutions.

Remember that in the joint tagging and de-
pendency parsing interpretation of the semantic
parsing problem, the resulting structure is not-
necessarily-spanning, meaning that some words
may not be tagged. In order to rely on well-
known algorithms for computing spanning ar-
borescences as a subroutine of our approximate
solver, we first introduce the notion of extended
graph. Given a graph G = (cid:3)V, A, π, ¯l(cid:4), we con-
struct an extended graph G = (cid:3)V , A, π, ¯l(cid:4)4 con-
taining n additional vertices {1, . . . , ¯n} that are
distributed along clusters, i.e., ¯π = {V0, V1 ∪
{1}, . . . , Vn ∪ {¯n}}, and arcs from the root to
these extra vertices, i.e., ¯A = A ∪ {0 → ¯i|1 ≤
i ≤ n}. Let B ⊆ A be a subset of arcs such
that (cid:3)V [B], B(cid:4) is a generalized not-necessarily-
spanning arborescence on G. Let B ⊆ A be
a subset of arcs defined as B = B ∪ {0 →
¯i|σ−
(cid:3)V [B],B(cid:4)(Vi) = ∅}. Then, there is a one-to-
one correspondence between generalized not-
(cid:3)V [B], B(cid:4)
necessarily-spanning arborescences
and generalized spanning arborescences (cid:3)V [B],
B(cid:4), see Figure 3b.

Let x ∈ {0, 1}|V | and y ∈ {0, 1}|A| be variable
vectors indexed by vertices and arcs such that a
vertex v ∈ V (resp., an arc a ∈ A) is selected
iff xv = 1 (resp., ya = 1). The set of 0-rooted
generalized valency-constrained spanning arbo-
rescences on ¯G can be written as the set of
variables (cid:3)x, y(cid:4) satisfying the following linear
constraints. First, we restrict y to structures that

4The labeling function is unchanged as there is no need

for types for vertices in ¯V \ V .

(cid:4)

a∈σ−
G

(Vi)

ya = 1

∀Vi ∈ π \ {V0}.

(3)

Constraints (2) ensure that the contracted graph
is weakly connected. Constraints (3) force each
cluster to have exactly one incoming arc. The set
of vectors y that satisfy these three constraints
are exactly the set of 0-rooted spanning arbores-
cences on the contracted graph, see Schrijver
(2003, Section 52.4) for an in-depth analysis of
this polytope. The root vertex is always selected
and other vertices are selected iff they have one
incoming selected arc:

x0 = 1.

xu =

(cid:4)

a∈σ−
G

({u})

∀u ∈ ¯V \ {0}.

(4)

(5)

Note that constraints (1)–(3) do not force selected
arcs to leave from a selected vertex as they oper-
ate at the cluster level. This property will be en-
forced via the following valency constraints:

y0→u = 1.

(cid:4)

u∈V \{0}
(cid:4)

a∈σ+

G({u},t)

ya = xufARGS(l(u), t) ∀t ∈ T,

u ∈ V \ {0}.

Constraint (6) forces the root to have exactly one
outgoing arc into a vertex u ∈ V \ {0} (i.e.,
a vertex that is not part of the extra vertices
introduced in the extended graph) that will be
the root of the AST. Constraints (7) force the se-
lected vertices and arcs to produce a well-formed
AST with respect to the grammar G. Note that
these constraints are only defined for vertices in
V \ {0}, i.e., they are neither defined for the root
vertex nor for the extra vertices introduced in the
extended graph.

707

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

(6)

(7)

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

To simplify notation, we introduce the follow-

ing sets:

(cid:5)

C(sa) =

C(val) =

(cid:3)x, y(cid:4) ∈ {0, 1} ¯V × {0, 1} ¯A

s.t. x and y satisfy (1)–(5)
(cid:3)x, y(cid:4) ∈ {0, 1} ¯V × {0, 1} ¯A

s.t. x and y satisfy (6)–(7)

(cid:6)

and C = C(sa) ∩ C(val). Given vertex weights
μ ∈ R| ¯V | and arc weights φ ∈ R| ¯A|, comput-
ing the MGVCNNSA is equivalent to solving the
following ILP:

(ILP1) max
x,y

μ(cid:17)x + φ(cid:17)y

s.t.

(cid:3)x, y(cid:4) ∈ C(sa) and (cid:3)x, y(cid:4) ∈ C(val).

Without constraint (cid:3)x, y(cid:4) ∈ C(val), the problem
would be easy to solve. The set C(sa) is the set of
spanning arborescences over the contracted graph,
hence to maximize over this set we can simply:
(1) contract the graph and assign to each arc in
the contracted graph the weight of its correspond-
ing arc plus the weight of its destination vertex
in the original graph; (2) run the the maximum
spanning arborescence algorithm (MSA, Edmonds,
1967; Tarjan, 1977) on the contracted graph,
which has a O(n2) time-complexity. This pro-
cess is illustrated on Figure 5 (top). Note that
the contracted graph may have parallel arcs,
which is not an issue in practice as only the one
of maximum weight can appear in a solution of
the MSA.

We have established that MAP inference in
our semantic parsing framework is a NP-hard
problem. We proposed an ILP formulation of the
problem that would be easy to solve if some con-
straints were removed. This property suggests the
use of an approximation algorithm that introduces
the difficult constraints as penalties. As a simi-
lar setting arises from our weakly supervised loss
function, the presentation of the approximation
algorithm is deferred until Section 4.

3 Training Objective Functions

3.1 Supervised Training Objective
We define the likelihood of a pair (cid:3)x, y(cid:4) ∈ C via
the Boltzmann distribution:

where c(μ, φ) is the log-partition function:
(cid:4)

c(μ, φ) = log

(cid:3)x(cid:8),y(cid:8)(cid:4)∈C

exp(μ(cid:17)x(cid:8) + φ(cid:17)y(cid:8)).

During training, we aim to maximize the log-
likelihood of
the training dataset. The log-
likelihood of an observation (cid:3)x, y(cid:4) is defined as:

(cid:4)(μ, φ; x, y) = log pμ,φ(x, y)

= μ(cid:17)x + φ(cid:17)y − c(μ, φ).

Unfortunately, computing the log-partition func-
tion is intractable as it requires summing over
all feasible solutions. Instead, we rely on a sur-
rogate lower-bound as an objective function. To
this end, we derive an upper bound (because it
is negated in (cid:4)) to the second term: a sum of
log-sum-exp functions that sums over each cluster
of vertices independently and over incoming arcs
in each cluster independently, which is tractable.
This loss can be understood as a generalization
of the head selection loss used in dependency
parsing (Zhang et al., 2017). We now detail the
derivation and prove that it is an upper bound to
the log-partition function.

Let U be a matrix such that each row contains
a pair (cid:3)x, y(cid:4) ∈ C and Δ|C| be the simplex of di-
mension |C| − 1, i.e., the set of all stochastic
vectors of dimension |C|. The log-partition func-
tion can then be rewritten using its variational
formulation:

c(μ, φ) = max
p∈Δ|C|

(cid:9)(cid:10)

(cid:7)

(cid:8)

p(cid:17)

μ
φ

+ H[p],

(cid:11)

i pi log pi

the reader
(2004, Example

where H[p] = −
is the Shan-
non entropy. We refer
to Boyd
3.25),
and Vandenberghe
Wainwright and Jordan (2008, Section 3.6),
and Beck (2017, Section 4.4.10). Note that
this formulation remains impractical as p has
an exponential size. Let M = conv(C) be
the marginal polytope, i.e., the convex hull of
the feasible integer solutions, we can rewrite the
above variational formulation as:

= max
m∈M

m(cid:17)

(cid:9)

(cid:8)

μ
φ

+ HM[m],

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

pμ,φ(x, y) = exp(μ(cid:17)x + φ(cid:17)y − c(μ, φ)),

where HM is a joint entropy function defined
such that the equality holds. The maximization

708

in this reformulation acts on the marginal proba-
bilities of parts (vertices and arcs) and has there-
fore a polynomial number of variables. We refer
the reader to Wainwright and Jordan (2008, 5.2.1)
and Blondel et al. (2020, Section 7) for more de-
tails. Unfortunately, this optimization problem is
hard to solve as M cannot be characterized in
an explicit manner and HM is defined indirectly
and lacks a polynomial closed form (Wainwright
and Jordan, 2008, Section 3.7). However, we
can derive an upper bound to the log-partition
function by decomposing the entropy term HM
(Cover, 1999, Property 4 on page 41, i.e., H is
an upper bound) and by using an outer approxi-
mation to the marginal polytope L ⊇ M (i.e.,
increasing the search space):

≤ max
m∈L

m(cid:17)

(cid:9)

(cid:8)

μ
φ

+ H[m].

In particular, we observe that each pair (cid:3)x, y(cid:4)
∈ C has exactly one vertex selected per cluster
Vi ∈ π and one incoming arc selected per clus-
ter Vi ∈ π \ {V0}. We denote C(one) the set of
all the pairs (cid:3)x, y(cid:4) that satisfy these constraints.
By using L = conv(C(one)) as an outer approxi-
mation to the marginal polytope (see Figure 4)
the optimization problem can be rewritten as a
sum of independent problems. As each of these
problems is the variational formulation of a log-
sum-exp term, the upper bound on c(μ, φ) can
be expressed as a sum of log-sum-exp functions,
one over vertices in each cluster Vi ∈ π \ {V0}
and one over incoming arcs σ−
¯G(Vi) for each
cluster Vi ∈ π \ {V0}. Although this type of ap-
proximation may not result in a Bayes consistent
loss (Corro, 2023), it works well in practice.

Figure 4: Polyhedron illustration. The solid lines rep-
resent the convex hull of feasible solutions of ILP1,
denoted M, whose vertices are feasible integer solu-
tions (black vertices). The dashed lines represent the
convex hull of feasible solutions of the linear relaxation
of ILP1, which has non integral vertices (in white). Fi-
nally, the dotted lines represent the polyhedron L that
is used to approximate c(μ, φ). All its vertices are
integral, but some of them are not feasible solutions
of ILP1.

i.e., we marginalize over all the structures that
induce the gold AST. We can rewrite this loss as:

⎛

⎞

⎝log

(cid:4)

(cid:3)x,y(cid:4)∈C∗

exp(μ(cid:17)x + φ(cid:17)y)

⎠ − c(μ, φ).

The two terms are intractable. We approximate
the second term using the bound defined in
Section 3.1.

We now derive a tractable lower bound to the
first term. Let q be a proposal distribution such
that q(x, y) = 0 if (cid:3)x, y(cid:4) /∈ C∗. We derive the
following lower bound via Jensen’s inequality:

(cid:4)

log

(cid:3)x,y(cid:4)∈C∗

exp(μ(cid:17)x + φ(cid:17)y)

(cid:18)(cid:18)

(cid:19)

(cid:16)

(cid:17)(cid:17)

exp

= log Eq

(cid:20)

≥ Eq

μ(cid:17)x + φ(cid:17)y

μ(cid:17)x + φ(cid:17)y
q(x, y)
(cid:21)

+ H[q].

3.2 Weakly Supervised Training Objective

Unfortunately, training data often does not in-
clude gold pairs (cid:3)x, y(cid:4) but instead only the AST,
without word anchors (or word alignment). This
is the case for the three datasets we use in our
experiments. We thus consider our training sig-
nal to be the set of all structures that induce the
annotated AST, which we denote C∗.

The weakly supervised loss is defined as:

˜(cid:4)(μ, φ; C∗) = log

(cid:4)

(cid:3)x,y(cid:4)∈C∗

pμ,φ(x, y),

This bound holds for any distribution q satisfy-
ing the aforementioned condition. We choose to
maximize this lower bound using a distribution
that gives a probability of one to a single struc-
ture, as in ‘‘hard’’ EM (Neal and Hinton, 1998,
Section 6).

For a given sentence w, let G = (cid:3)V, A, π, ¯l(cid:4)
be a graph defined as in Section 2.2 and G(cid:8) =
(cid:3)V (cid:8), A(cid:8), l(cid:8)(cid:4) be an AST defined as in Section 2.1.
We aim to find the GVCNNSA in G of maxi-
mum weight whose induced AST is exactly G(cid:8).
This is equivalent to aligning each vertex in V (cid:8)
with one vertex of V \ {0} s.t. there is at most

709

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

one vertex per cluster of π appearing in the align-
ment and where the weight of an alignment is
defined as:

1. for each vertex u(cid:8) ∈ V (cid:8), we add the weight of
the vertex u ∈ V it is aligned to—moreover,
if u(cid:8) is the root of the AST we also add the
weight of the arc 0 → u;

2. for each arc u(cid:8) → v(cid:8) ∈ A(cid:8), we add the weight
of the arc u → v where u ∈ V (resp. v ∈ V )
is the vertex u(cid:8) (resp. v(cid:8)) it is aligned with.

Note that this latent anchoring inference consists
in computing a (partial) alignment between ver-
tices of G and G(cid:8), but the fact that we need to
take into account arc weights forbids the use of
the Kuhn–Munkres algorithm (Kuhn, 1955).

Theorem 2. Computing the anchoring of max-
imum weight of an AST with a graph G is
NP-hard.

The proof is in Appendix A.

Therefore, we propose an optimization-based
approach to compute the distribution q. Note that
the problem has a constraint requiring each cluster
Vi ∈ π to be aligned with at most one vertex
v(cid:8) ∈ V (cid:8), i.e., each word in the sentence can be
aligned with at most one vertex in the AST.
If we remove this constraint, then the problem
becomes tractable via dynamic programming.
Indeed, we can recursively construct a table
CHART[u(cid:8), u], u(cid:8) ∈ V (cid:8) and u ∈ V , containing
the score of aligning vertex u(cid:8) to vertex u plus the
score of the best alignment of all the descendants
of u(cid:8). To this end, we simply visit the vertices
V (cid:8) of the AST in reverse topological order, see
Algorithm 1. The best alignment can be retrieved
via back-pointers.

Computing q is therefore equivalent to solving

the following ILP:

(ILP2) max
x,y

μ(cid:17)x + φ(cid:17)y

s.t.

(cid:3)x, y(cid:4) ∈ C∗(relaxed),
(cid:4)
xu ≤ 1

∀Vi ∈ π,

u∈Vi

whose convex hull can be described via linear
constraints (Martin et al., 1990).

4 Efficient Inference

In this section, we propose an efficient way to
solve the linear relaxations of MAP inference
(ILP1) and latent anchoring inference (ILP2) via
constraint smoothing and the conditional gradi-
ent method. We focus on problems of the follow-
ing form:

f (z)

max
z
s.t. z ∈ conv(C(easy))

Az = b or Az ≤ b

where the vector z is the concatenation of the
vectors x and y defined previously and conv
denotes the convex hull of a set. We explained
previously that if the set of constraints of form
Az = b for (ILP1) or Az ≤ b for (ILP2) was
absent, the problem would be easy to solve under
a linear objective function. In fact, there exists an
efficient linear maximization oracle (LMO), i.e., a
function that returns the optimal integral solution,
for the set conv(C(easy)). This setting covers both
(ILP1) and (ILP2) where we have C(easy) = C(sa) and
C(easy) = C∗(relaxed), respectively.

An appealing approach in this setting is to
introduce the problematic constraints as penalties
in the objective:

f (z) − δS(Az)

max
z
s.t. z ∈ conv(C(easy))

where δS is the indicator function of the set S:

(cid:5)

δS(Az) =

if Az ∈ S,
0
+∞ otherwise.

In the equality case, we use S = {b} and in the
inequality case, we use S = {u|u ≤ b}.

4.1 Conditional Gradient Method

Given a proper, smooth, and differentiable func-
tion g and a nonempty, bounded, closed, and
convex set conv(C(easy)), the conditional gradient
method (a.k.a. Frank-Wolfe; Frank and Wolfe,
1956; Levitin and Polyak, 1966; Lacoste-Julien
and Jaggi, 2015) can be used to solve optimization
problems of the following form:

where the set C*(relaxed) is the set of feasible so-
lutions of the dynamic program in Algorithm 1,

max
z

g(z)

s.t. z ∈ conv(C(easy)).

710

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Algorithm 1 Unconstrained alignment of maximum weight between a graph G and an AST G(cid:8)

function DPALIGNMENT(G, G(cid:8))

for u(cid:8) ∈ V (cid:8) in reverse topological order do
for u ∈ {v ∈ V |¯l(v) = l(cid:8)(u(cid:8))} do

(cid:11)

CHART[u(cid:8), u] ← μu +
return maxu∈V CHART[r(cid:8), u] + φ0→u

u(cid:8)→v(cid:8)∈σ+

G(cid:8) ({u(cid:8)})

(cid:6) We can only map u(cid:8) to vertices u if they have the same tag.
(cid:17)

(cid:18)

maxv∈V CHART[v(cid:8), v] + φu→v

(cid:6) Where r(cid:8) ∈ A(cid:8) is the root of the AST.

Algorithm 2 Conditional gradient

function CONDITIONALGRADIENT(G, G(cid:8))

Let z(0) ∈ conv(C(easy))
for k ∈ {0. . . . K} do
(cid:17)

(cid:18)

− z(k)

d ←
lmoconv(C(easy))(∇g(z(k)))
if ∇g(z(k))(cid:17)d ≤ (cid:8) then return z(k)
γ ∈ arg maxγ∈[0,1] g(z(k) + γd)
z(k+1) = z(k) + γd

return z(k)

(cid:6) Where K is the maximum number of iterations
(cid:6) Compute the update direction
(cid:6) If the dual gap is small, z(k) is (almost) optimal
(cid:6) Compute or approximate the optimal stepsize
(cid:6) Update the current point

Contrary to the projected gradient method, this
approach does not require to compute projections
onto the feasible set conv(C(easy)) which is, in
most cases, computationally expensive. Instead,
the conditional gradient method only relies on a
LMO:

lmoC(easy)(ψ) ∈ arg max
z∈conv(C(easy))

ψ(cid:17)z.

The algorithm constructs a solution to the original
problem as a convex combination of elements
returned by the LMO. The pseudo-code is given
in Algorithm 2. An interesting property of this
method is that its step size range is bounded, which
allows the use of simple linesearch techniques.

4.2 Smoothing

the function g(z) = f (z) −
Unfortunately,
δS(Az) is non-smooth due to the indicator func-
tion term, preventing the use of the conditional
gradient method. We propose to rely on the frame-
work proposed by Yurtsever et al. (2018), where
the indicator function is replaced by a smooth
approximation. The indicator function of the set S
can be rewritten as:

found in Beck (2017, Section 4.1 and 4.2).
In order to smooth the indicator function, we add
a β-parameterized convex regularizer − β
2 to
2
its Fenchel biconjugate:

(cid:23) · (cid:23)2

δ∗∗
S,β(Az) = max

u(cid:17)Az − σS(u) − β
2

(cid:23)u(cid:23)2
2,

where β > 0 controls the quality and the
smoothness of
the approximation (Nesterov,
2005).

Equalities.

In the case where S = {b},
with a few computations that are detailed by
Yurtsever et al. (2018), we obtain:

δ∗∗
{b},β(Az) =

1
2β

(cid:23)Az − b(cid:23)2
2.

That
is, we have a quadratic penalty term
in the objective. Note that this term is similar to
the term introduced in an augmented Lagrangian
(Nocedal and Wright, 1999, Equation 17.36), and
adds a penalty in the objective for vectors z s.t.
Az (cid:11)= b.

Inequalities. In the case where S = {u|u ≤

b}, similar computations lead to:

δS(Az) = δ∗∗

S (Az) = sup
u

u(cid:17)Az − σS(u),

δ∗∗
≤b,β(Az) =

1
2β

(cid:23)[Az − b]+(cid:23)2
2,

where δ∗∗
S denotes the Fenchel biconjugate of the
indicator function and σS(u) = supt∈S u(cid:17)t is
the support function of S. More details can be

where [·]+ denotes the Euclidian projection into
the non-negative orthant (i.e., clipping negative
values). Similarly to the equality case, this term

711

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 5: Illustration of the approximate inference algorithm on the two-word sentence ‘‘List states’’, where
we assume the grammar has one entity state all and one predicate loc 1 that takes exactly one entity as
argument. The left graph is the extended graph for the sentence, including vertices and arcs weights (in black). If
we ignore constraints (6)–(7), inference is reduced to computing the MSA on the contracted graph (solid arcs in
the middle column). This may lead to solutions that do not satisfy constraints (6)–(7) on the expanded graph (top
example). However, the gradient of the smoothed constraint (7) will induce penalties (in red) to vertex and arc
scores that will encourage the loc 1 predicate to either be dropped from the solution or to have an outgoing arc
to a state all argument. Computing the MSA on the contracted graph with penalties results in a solution that
satisfies constraints (6)–(7) (bottom example).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

introduces a penalty in the objective for vectors z
s.t. Az > b. This penalty function is also called
the Courant-Beltrami penalty function.

Figure 5 (bottom) illustrates how the gradi-
ent of the penalty term can ‘‘force’’ the LMO
to return solutions that satisfy the smoothed
constraints.

4.3 Practical Details

smoothness

Smoothness.
In practice, we need to choose
follow
the
Yurtsever et al. (2018) and use β(k) = β(0)
√
k+1
where k is the iteration number and β(0) = 1.

parameter β. We

Step Size. Another important choice in the al-
gorithm is the step size γ. We show that when
the smoothed constraints are equalities, comput-
ing the optimal step size has a simple closed
form solution if the function f is linear, which
is the case for (ILP1), i.e., MAP decoding. The
step size poblem formulation at iteration k is
defined as:

arg max
γ∈[0,1]

f (z(k) + γd) −

(cid:23)A(z(k) + γd) − b(cid:23)2
2β

712

By assumption, f is linear and can be written
as f (z) = θ(cid:17)z. Ignoring the box constraints on
γ, by first order optimality conditions, we have:

γ =

−βθ(cid:17)d + (Ad)(cid:17)b − (Ad)(cid:17)(Az(k))
(cid:23)Ad(cid:23)2

We can then simply clip the result so that it
satisfies the box constraints. Unfortunately, in
the inequalities case, there is no simple closed
form solution. We approximate the step size using
10 iterations of the bisection algorithm for root
finding.

Non-integral Solutions. As we solve the
linear relaxation of original ILPs, the optimal
solutions may not be integral
(Figure 4).
Therefore, we use simple heuristics to construct
a feasible solution to the original ILP in these
cases. For MAP inference, we simply solve
the ILP5 using CPLEX but
introducing only

5We use the multi-commodity flow formulation of
Martins et al. (2009) instead of the cycle breaking con-
straints (2).

variables that have a non-null value in the linear
relaxation, leading to a very sparse problem which
is fast to solve. For latent anchoring, we simply
use the Kuhn–Munkres algorithm using the
non-integral solution as assignment costs.

5 Experiments

We compare our method to baseline systems
both on i.i.d. splits (IID) and splits that test for
compositional generalization for three datasets.
The neural network is described in Appendix B.

Datasets. SCAN (Lake and Baroni, 2018) con-
tains natural language navigation commands. We
use the variant of Herzig and Berant (2021) for
semantic parsing. The IID split is the simple split
(Lake and Baroni, 2018). The compositional splits
are primitive right (RIGHT) and primitive around
right (ARIGHT) (Loula et al., 2018).

GEOQUERY (Zelle and Mooney, 1996) uses the
FunQL formalism (Kate et al., 2005) and contains
questions about the US geography. The IID split
is the standard split and compositional general-
ization is evaluated on two splits: LENGTH where
the examples are split by program length and
TEMPLATE (Finegan-Dollak et al., 2018a) where
they are split such that all semantic programs hav-
ing the same AST are in the same split.

CLEVR (Johnson et al., 2017) contains synthetic
questions over object relations in images. CLO-
SURE (Bahdanau et al., 2019) introduces additional
question templates that require compositional gen-
eralization. We use the original split as our IID
split and the CLOSURE split as a compositional
split where the model is evaluated on CLOSURE.

Baselines. We compare our approach against
the architecture proposed by Herzig and Berant
(2021) (SPANBASEDSP) as well as the seq2seq
baselines they used. In SEQ2SEQ (Jia and Liang,
2016), the encoder is a bi-LSTM over pre-trained
GloVe embeddings (Pennington et al., 2014) or
ELMO (Peters et al., 2018) and the decoder is an
attention-based LSTM (Bahdanau et al., 2015).
BERT2SEQ replaces the encoder with BERT-
base. GRAMMAR is similar to SEQ2SEQ but
the decoding is constrained by a grammar. BART
(Lewis et al., 2020) is pre-trained as a denoising
autoencoder.

Results. We report the denotation accuracies
in Table 1. Our approach outperforms all other

methods. In particular, the seq2seq baselines suf-
fer from a significant drop in accuracy on splits
that require compositional generalization. While
SPANBASEDSP is able to generalize, our approach
outperforms it. Note that we observed that the
GEOQUERY execution script used to compute de-
notation accuracy in previous work contains sev-
eral bugs that overestimate the true accuracy.
Therefore, we also report denotation accuracy
with a corrected executor (see Appendix C) for
fair comparison with future work.

We also report exact match accuracy, with and
without the heuristic to construct integral solutions
from fractional ones. The exact match accuracy
is always lower or equal to the denotation ac-
curacy. This shows that our approach can some-
times provide the correct denotation even though
the prediction is different from the gold semantic
program. Importantly, while our approach out-
performs baselines, its accuracy is still signifi-
cantly worse on the split that requires to generalize
to longer programs.

6 Related Work

Graph-based Methods. Graph-based methods
have been popularized by syntactic dependency
parsing (McDonald et al., 2005) where MAP in-
ference is realized via the maximum spanning
arborescence algorithm (Chu and Liu, 1965;
Edmonds, 1967). A benefit of this algorithm is that
it has a O(n2) time-complexity (Tarjan, 1977),
i.e., it is more efficient than algorithms explor-
ing more restricted search spaces (Eisner, 1997;
G´omez-Rodr´ıguez et al., 2011; Pitler et al., 2012,
2013).

In the case of semantic structures, Kuhlmann
and Jonsson (2015) proposed a O(n3) algo-
rithm for the maximum non-necessarily-spanning
acyclic graphs with a noncrossing arc constraint.
Without the noncrossing constraint, the problem
is known to be NP-hard (Gr¨otschel et al., 1985).
To bypass this computational complexity, Dozat
and Manning (2018) proposed to handle each de-
pendency as an independent binary classification
problem, that is they do not enforce any constraint
on the output structure. Note that, contrary to our
work, these approaches allow for reentrancy but
do not enforce well-formedness of the output with
respect to the semantic grammar. Lyu and Titov
(2018) use a similar approach for AMR parsing

713

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

SCAN

GEOQUERY

CLEVR

IID

RIGHT ARIGHT

IID

TEMPLATE LENGTH

IID

CLOSURE

Baselines (denotation accuracy only)

SEQ2SEQ
+ ELMO
BERT2SEQ
GRAMMAR
BART
SPANBASEDSP

Our approach

99.9

100
100
100
100
100

11.6
54.9
77.7
0.0
50.5
100

0
41.6
95.3
4.2

100
100

Denotation accuracy
(cid:2) Corrected executor
Exact match
(cid:2) w/o CPLEX heuristic

100

100
100

78.5
79.3
81.1
72.1
87.1
86.1

92.9
91.8
90.7
90.0

46.0
50.0
49.6
54.0
67.0
82.2

89.9
88.7
86.2
83.0

24.3
25.7
26.1
24.6
19.3
63.6

74.9
74.5
69.3
67.5

100
100
100
100
100

96.7

100

100
100

59.5
64.2
56.4
51.3
51.5
98.8

99.6

99.6
98.0

Table 1: Denotation and exact match accuracy on the test sets. All the baseline results were taken from
Herzig and Berant (2021). For our approach, we also report exact match accuracy, i.e., the percentage
of sentences for which the prediction is identical to the gold program. The last line reports the exact
match accuracy without the use of CPLEX to round non integral solutions (Section 4.3).

where tags are predicted first, followed by arc
predictions and finally heuristics are used to en-
sure the output graph is valid. On the contrary,
we do not use a pipeline and we focus on joint
decoding where validity of the output is directly
encoded in the search space.

Previous work in the literature has also con-
sidered reduction to graph-based methods for
other problems, e.g., for discontinuous constit-
uency parsing (Fern´andez-Gonz´alez and Martins,
2015; Corro et al., 2017), lexical segmentation
(Constant and Le Roux, 2015), and machine trans-
lation (Zaslavskiy et al., 2009), inter alia.

Compositional Generalization. Several au-
thors observed that compositional generalization
insufficiency is an important source of error for
semantic parsers, especially ones based on seq2seq
architectures (Lake and Baroni, 2018; Finegan-
Dollak et al., 2018b; Herzig and Berant, 2019;
Keysers et al., 2020). Wang et al. (2021) pro-
posed a latent re-ordering step to improve com-
positional generalization, whereas Zheng and
Lapata (2021) relied on latent predicate tagging
in the encoder. There has also been an interest in
using data augmentation methods to improve gen-
eralization (Jia and Liang, 2016; Andreas, 2020;
Aky¨urek et al., 2021; Qiu et al., 2022; Yang et al.,
2022).

Recently, Herzig and Berant (2021) showed
that span-based parsers do not exhibit such prob-
lematic behavior. Unfortunately, these parsers fail
to cover the set of semantic structures observed
in English treebanks, and we hypothesize that this
would be even worse for free word order lan-
guages. Our graph-based approach does not ex-
hibit this downside. Previous work by Jambor
and Bahdanau (2022) also considered graph-based
methods for compositional generalization, but their
approach predicts each part independently without
any well-formedness or acyclicity constraint.

7 Conclusion

In this work, we focused on graph-based semantic
parsing for formalisms that do not allow reen-
trancy. We conducted a complexity study of two
inference problems that appear in this setting. We
proposed ILP formulations of these problems to-
gether with a solver for their linear relaxation
based on the conditional gradient method. Exper-
imentally, our approach outperforms comparable
baselines.

One downside of our semantic parser is speed
(we parse approximately 5 sentences per sec-
ond for GEOQUERY). However, we hope this work
will give a better understanding of the semantic

714

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

parsing problem together with baseline for faster
methods.

Future research will investigate extensions for
(1) ASTs that contain reentrancies and (2) predic-
tion algorithms for the case where a single word
can be the anchor of more than one predicate or
entity. These two properties are crucial for seman-
tic representations like Abstract Meaning Repre-
sentation (Banarescu et al., 2013). Moreover, even
if our graph-based semantic parser provides bet-
ter results than previous work on length general-
ization, this setting is still difficult. A more general
research direction on neural architectures that
generalize better to longer sentences is important.

Acknowledgments

We thank Franc¸ois Yvon and the anonymous re-
viewers for their comments and suggestions. We
thank Jonathan Herzig and Jonathan Berant for
fruitful discussions. This work benefited from
computations done on the Saclay-IA platform and
on the HPC resources of IDRIS under the allo-
cation 2022-AD011013727 made by GENCI.

References

Ekin Aky¨urek, Afra Feyza Aky¨urek, and Jacob
Andreas. 2021. Learning to recombine and
resample data for compositional generaliza-
tion. In International Conference on Learning
Representations.

Jacob Andreas. 2020. Good-enough composi-
tional data augmentation. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 7556–7566,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.acl-main.676

Jacob Andreas, Andreas Vlachos, and Stephen
Clark. 2013. Semantic parsing as machine
translation. In Proceedings of the 51st Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 2: Short Papers),
pages 47–52, Sofia, Bulgaria. Association for
Computational Linguistics.

International Conference on Learning Rep-
ICLR 2015, San Diego, CA,
resentations,
USA, May 7–9, 2015, Conference Track Pro-
ceedings. https://doi.org/10.48550
/arXiv.1409.0473

Shikhar Murty,

Dzmitry Bahdanau, Harm de Vries, Timothy
J. O’Donnell,
Philippe
Beaudoin, Yoshua Bengio, and Aaron C.
Courville. 2019. CLOSURE: Assessing sys-
tematic generalization of CLEVR models.
CoRR, abs/1912.05783. https://doi.org
/10.48550/arXiv.1912.05783

Laura Banarescu, Claire Bonial, Shu Cai,
Madalina Georgescu, Kira Griffitt, Ulf
Hermjakob, Kevin Knight, Philipp Koehn,
Martha Palmer, and Nathan Schneider. 2013.
Abstract Meaning Representation for sembank-
ing. In Proceedings of the 7th Linguistic An-
notation Workshop and Interoperability with
Discourse, pages 178–186, Sofia, Bulgaria.
Association for Computational Linguistics.

Amir Beck. 2017. First-Order Methods in Op-
timization. SIAM. https://doi.org/10
.1137/1.9781611974997

Mathieu Blondel, Andr´e F. T. Martins, and Vlad
Niculae. 2020. Learning with Fenchel-Young
losses. Journal of Machine Learning Research,
21(35):1–69.

Bernd Bohnet and Joakim Nivre. 2012. A
transition-based system for joint part-of-speech
tagging and labeled non-projective dependency
parsing. In Proceedings of the 2012 Joint Con-
ference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural
Language Learning, pages 1455–1465, Jeju
Island, Korea. Association for Computational
Linguistics.

Stephen Boyd

and Lieven Vandenberghe.
2004. Convex Optimization. Cambridge Uni-
versity Press. https://doi.org/10.1017
/CBO9780511804441

Yoeng-Jin Chu and Tseng-Hong Liu. 1965. On
the shortest arborescence of a directed graph.
Scientia Sinica, 14:1396–1400.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd

John Cocke. 1970. Programming Languages and
Their Compilers: Preliminary Notes. New York
University.

715

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Matthieu Constant and Joseph Le Roux. 2015.
Dependency representations for lexical seg-
In 6th Workshop on Statistical
mentation.
Parsing of Morphologically Rich Languages
(SPMRL 2015), Bilbao, Spain, July.

Caio Corro. 2020. Span-based discontinuous con-
stituency parsing: A family of exact chart-
based algorithms with time complexities from
O(nˆ6) down to O(nˆ3). In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 2753–2764, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.219

Caio Corro. 2023. On the inconsistency of separa-
ble losses for structured prediction. In Proceed-
ings of the 17th Conference of the European
Chapter of the Association for Computational
Linguistics: Main Volume. Association for
Computational Linguistics.

Caio Corro, Joseph Le Roux, and Mathieu
Lacroix. 2017. Efficient discontinuous phrase-
structure parsing via the generalized maximum
spanning arborescence. In Proceedings of the
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1644–1654,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1172

Thomas M. Cover. 1999. Elements of Informa-

tion Theory. John Wiley & Sons.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1423

Li Dong and Mirella Lapata. 2016. Language
to logical form with neural attention. In Pro-
ceedings of the 54th Annual Meeting of the
Association for Computational Linguistics
(Volume 1: Long Papers), pages 33–43, Berlin,
Germany. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/P16-1004

716

Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural de-
pendency parsing. In International Conference
on Learning Representations.

Timothy Dozat and Christopher D. Manning.
2018. Simpler but more accurate semantic de-
pendency parsing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers),
pages 484–490, Melbourne, Australia. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P18-2077

Christophe Duhamel, Luis Gouveia, Pedro
Moura, and Mauricio Souza. 2008. Models
and heuristics for a minimum arborescence
problem. Networks, 51(1):34–47. https://
doi.org/10.1002/net.20194

Jack Edmonds. 1967. Optimum branchings. Jour-
nal of Research of the National Bureau of
Standards – B. Mathematics and Mathemati-
cal Physics, 71:233. https://doi.org/10
.6028/jres.071B.032

Jason Eisner. 1997. Bilexical grammars and a
cubic-time probabilistic parser. In Proceed-
ings of the Fifth International Workshop on
Parsing Technologies, pages 54–65, Boston/
Cambridge, Massachusetts, USA. Association
for Computational Linguistics.

Daniel Fern´andez-Gonz´alez and Andr´e F. T.
Martins. 2015. Parsing as reduction. In Pro-
ceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long
Papers), pages 1523–1533, Beijing, China.
Association for Computational Linguistics.
https://doi.org/10.3115/v1/P15-1147

Catherine

Jonathan

Finegan-Dollak,

K.
Kummerfeld, Li Zhang, Karthik Ramanathan,
Sesh Sadasivam, Rui Zhang, and Dragomir
Radev. 2018a. Improving text-to-SQL evalua-
tion methodology. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 351–360, Melbourne, Australia. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P18-1033

Catherine

K.
Kummerfeld, Li Zhang, Karthik Ramanathan,

Finegan-Dollak,

Jonathan

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Sesh Sadasivam, Rui Zhang, and Dragomir
Radev. 2018b. Improving text-to-SQL eval-
uation methodology. In Proceedings of
the
56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 351–360, Melbourne, Australia.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P18
-1033

Marguerite Frank and Philip Wolfe. 1956. An
algorithm for quadratic programming. Naval
Research Logistics Quarterly, 3(1–2):95–110.
https://doi.org/10.1002/nav.3800030109

Michael R. Garey and David S. Johnson. 1979.
Computers and Intractability: A Guide to the
Theory of NP-Completeness (Series of Books
in the Mathematical Sciences). W. H. Freeman.

Carlos G´omez-Rodr´ıguez,

John Carroll, and
David Weir. 2011. Dependency parsing
schemata and mildly non-projective depen-
dency parsing. Computational Linguistics,
37(3):541–586. https://doi.org/10.1162
/COLI a 00060

Martin Gr¨otschel, Michael J¨unger, and Gerhard
Reinelt. 1985. Acyclic Subdigraphs and Linear
Orderings: Polytopes, Facets, and a Cutting
Plane Algorithm, pages 217–264. Springer
Netherlands. Dordrecht. https://doi.org
/10.1007/978-94-009-5315-4_7

David Hall, Greg Durrett, and Dan Klein. 2014.
Less grammar, more features. In Proceedings
of the 52nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 228–237, Baltimore, Maryland.
Association for Computational Linguistics.
https://doi.org/10.3115/v1/P14-1022

Jonathan Herzig and Jonathan Berant. 2019.
Don’t paraphrase, detect! Rapid and effective
data collection for semantic parsing. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 3810–3820, Hong Kong, China. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1394

Jonathan Herzig and Jonathan Berant. 2021. Span-
based semantic parsing for compositional gen-
eralization. In Proceedings of the 59th Annual

Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 908–921,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.acl-long.74

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, PubMed: 9377276

Dora Jambor and Dzmitry Bahdanau. 2022.
LAGr: Label aligned graphs for better sys-
tematic generalization in semantic parsing. In
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 3295–3308,
Dublin,
Ireland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.233

the 54th Annual Meeting of

Robin Jia and Percy Liang. 2016. Data recombina-
tion for neural semantic parsing. In Proceedings
the Associ-
of
ation for Computational Linguistics (Volume 1:
Long Papers), pages 12–22, Berlin, Germany.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P16-1002

Justin Johnson, Bharath Hariharan, Laurens
van der Maaten, Li Fei-Fei, C. Lawrence
Zitnick, and Ross B. Girshick. 2017. CLEVR:
lan-
A diagnostic dataset for compositional
guage and elementary visual reasoning. 2017
IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 1988–1997.
https://doi.org/10.1109/CVPR.2017
.215

Laura Kallmeyer. 2010. Parsing Beyond Context-
Free Grammars. Springer Science & Business
Media. https://doi.org/10.1007/978
-3-642-14846-0

Tadao Kasami. 1965. An Efficient Recognition
and Syntax-Analysis Algorithm for Context-Free
Languages, Coordinated Science Laboratory,
University of Illinois at Urbana-Champaign.

Rohit J. Kate, Yuk Wah Wong, and Raymond J.
Mooney. 2005. Learning to transform natu-
ral to formal languages. In Proceedings of the
20th National Conference on Artificial Intelli-
gence – Volume 3, AAAI’05, pages 1062–1068.
AAAI Press.

717

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Daniel Keysers, Nathanael Sch¨arli, Nathan
Scales, Hylke Buisman, Daniel Furrer,
Sergii Kashubin, Nikola Momchev, Danila
Sinopalnikov, Lukasz Stafiniak, Tibor Tihon,
Dmitry Tsarkov, Xiao Wang, Marc van Zee, and
Olivier Bousquet. 2020. Measuring composi-
tional generalization: A comprehensive method
on realistic data. In International Conference
on Learning Representations.

Terry Koo, Alexander M. Rush, Michael Collins,
Tommi Jaakkola, and David Sontag. 2010.
Dual decomposition for parsing with non-
projective head automata. In Proceedings of the
2010 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1288–1298,
Cambridge, MA. Association for Computa-
tional Linguistics.

Joseph B. Kruskal. 1956. On the shortest span-
ning subtree of a graph and the traveling sales-
man problem. Proceedings of the American
Mathematical Society, 7(1):48–50. https://
doi.org/10.1090/S0002-9939-1956
-0078686-7

Marco Kuhlmann and Peter Jonsson. 2015. Pars-
ing to noncrossing dependency graphs. Trans-
actions of the Association for Computational
Linguistics, 3:559–570. https://doi.org
/10.1162/tacl_a_00158

Harold W. Kuhn. 1955. The Hungarian method
for the assignment problem. Naval Research
Logistics Quarterly, 2(1–2):83–97. https://
doi.org/10.1002/nav.3800020109

Simon Lacoste-Julien and Martin Jaggi. 2015.
On the global linear convergence of Frank-
Wolfe optimization variants. In Advances in
Neural Information Processing Systems, vol-
ume 28. Curran Associates, Inc.

Brenden Lake

and Marco Baroni.

2018.
Generalization without systematicity: On the
compositional skills of sequence-to-sequence
recurrent networks. In Proceedings of the 35th
International Conference on Machine Learn-
ing, volume 80 of Proceedings of Machine
Learning Research, pages 2873–2882. PMLR.

Evgeny S. Levitin and Boris T. Polyak.
1966. Constrained minimization methods. USSR
Computational Mathematics and Mathematical
Physics, 6(5):1–50. https://doi.org/10
.1016/0041-5553(66)90114-5

718

Mike Lewis, Yinhan Liu, Naman Goyal,
Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke
Zettlemoyer. 2020. BART: Denoising sequence-
to-sequence pre-training for natural language
generation, translation, and comprehension. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 7871–7880, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.703

Zhenghua Li, Min Zhang, Wanxiang Che, Ting
Liu, Wenliang Chen, and Haizhou Li. 2011.
Joint models for Chinese POS tagging and de-
pendency parsing. In Proceedings of the 2011
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1180–1191,
Edinburgh, Scotland, UK. Association for
Computational Linguistics.

Jo˜ao Loula, Marco Baroni, and Brenden Lake.
2018. Rearranging the familiar: Testing com-
positional generalization in recurrent networks.
In Proceedings of the 2018 EMNLP Work-
shop BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 108–114,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/W18-5413

Chunchuan Lyu and Ivan Titov. 2018. AMR pars-
ing as graph prediction with latent alignment.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 397–407,
Melbourne, Australia. Association for Compu-
tational Linguistics. https://doi/org/10
.18653/v1/P18-1037

Richard Kipp Martin, Ronald L. Rardin, and
Brian A. Campbell. 1990. Polyhedral charac-
terization of discrete dynamic programming.
Operations Research, 38(1):127–138. https://
doi.org/10.1287/opre.38.1.127

Andr´e Martins, Noah Smith, and Eric Xing. 2009.
Concise integer linear programming formula-
tions for dependency parsing. In Proceedings
of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International
Joint Conference on Natural Language Pro-
cessing of the AFNLP, pages 342–350, Suntec,

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Singapore. Association for Computational Lin-
guistics. https://doi.org/10.3115/1687878
.1687928

Ryan McDonald, Fernando Pereira, Kiril Ribarov,
and Jan Hajiˇc. 2005. Non-projective depen-
dency parsing using spanning tree algorithms.
In Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical
Methods in Natural Language Processing,
pages 523–530, Vancouver, British Columbia,
Canada. Association for Computational Lin-
guistics. https://doi.org/10.3115/1220575
.1220641

Ryan McDonald and Giorgio Satta. 2007. On
the complexity of non-projective data-driven
the
dependency parsing. In Proceedings of
Tenth International Conference on Parsing
Technologies, pages 121–132, Prague, Czech
Republic. Association for Computational Lin-
guistics. https://doi.org/10.3115/1621410
.1621426

Young-Soo Myung, Chang-Ho Lee,

and
Dong-Wan Tcha. 1995. On the generalized
minimum spanning tree problem. Networks,
26(4):231–241. https://doi.org/10.1002
/net.3230260407

Radford M. Neal and Geoffrey E. Hinton. 1998,
justifies
A view of the EM algorithm that
incremental, sparse, and other variants.
In
Learning in Graphical Models, pages 355–368.
Springer. https://doi.org/10.1007/978
-94-011-5014-9 12

Yu Nesterov. 2005. Smooth minimization of
non-smooth functions. Mathematical Program-
ming, 103(1):127–152. https://doi.org
/10.1007/s10107-004-0552-5

Jorge Nocedal and Stephen J. Wright. 1999.
Numerical Optimization. Springer. https://
doi.org/10.1007/b98874

Jeffrey

Socher,

Pennington, Richard

and
Christopher Manning. 2014. GloVe: Global
vectors for word representation. In Proceedings
of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1532–1543, Doha, Qatar. Association for
Computational Linguistics. https://doi
.org/10.3115/v1/D14-1162

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,

719

and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings of
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1202

Emily Pitler, Sampath Kannan, and Mitchell
Marcus. 2012. Dynamic programming for
higher order parsing of gap-minding trees. In
Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
Learning, pages 478–488, Jeju Island, Korea.
Association for Computational Linguistics.

Emily Pitler, Sampath Kannan, and Mitchell
Marcus. 2013. Finding optimal 1-endpoint-
crossing trees. Transactions of the Associa-
tion for Computational Linguistics, 1:13–24.
https://doi.org/10.1162/tacl a 00206

Linlu Qiu, Peter Shaw, Panupong Pasupat, Pawel
Nowak, Tal Linzen, Fei Sha, and Kristina
Toutanova. 2022.
Improving compositional
generalization with latent structure and data
the 2022
augmentation. In Proceedings of
the North American Chap-
Conference of
ter of
the Association for Computational
Linguistics: Human Language Technologies,
pages 4341–4362, Seattle, United States.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2022
.naacl-main.323

Owen Rambow. 2010. The simple truth about
dependency and phrase structure representa-
tions: An opinion piece. In Human Language
Technologies: The 2010 Annual Conference of
the North American Chapter of the Association
for Computational Linguistics, pages 337–340,
Los Angeles, California. Association for Com-
putational Linguistics.

V. Venkata Rao and R. Sridharan. 2002.
Minimum-weight
not-necessarily-
rooted
spanning arborescence problem. Networks,
39(2):77–87. https://doi.org/10.1002
/net.10015

Giorgio Satta. 1992. Recognition of

linear
context-free rewriting systems. In 30th Annual
Meeting of the Association for Computational

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Linguistics, pages 89–95, Newark, Delaware,
USA. Association for Computational Linguis-
tics. https://doi.org/10.3115/981967
.981979

Alexander Schrijver. 2003. Combinatorial Opti-
mization: Polyhedra and Efficiency, volume 24.
Springer.

Mitchell Stern, Jacob Andreas, and Dan Klein.
2017. A minimal span-based neural constitu-
ency parser. In Proceedings of the 55th Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 1: Long Papers),
pages 818–827, Vancouver, Canada. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1076

Robert Endre Tarjan. 1977. Finding optimum
branchings. Networks, 7(1):25–35. https://
doi.org/10.1002/net.3230070103

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav
Petrov, Ilya Sutskever, and Geoffrey Hinton.
2015. Grammar as a foreign language. In
Advances in Neural Information Processing
Systems, volume 28. Curran Associates, Inc.

Martin J. Wainwright and Michael Irwin Jordan.
2008. Graphical Models, Exponential Families,
and Variational Inference, Now Publishers Inc.
https://doi.org/10.1561/9781601981851

Bailin Wang, Mirella Lapata, and Ivan Titov.
2021. Structured reordering for modeling latent
alignments in sequence transduction. In Ad-
vances in Neural Information Processing Sys-
tems, volume 34, pages 13378–13391. Curran
Associates, Inc.

Bailin Wang, Richard Shin, Xiaodong Liu,
Oleksandr Polozov, and Matthew Richardson.
2020. RAT-SQL: Relation-aware schema en-
coding and linking for text-to-SQL parsers. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 7567–7578, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.677

Jingfeng Yang, Le Zhang, and Diyi Yang.
2022. SUBS: Subtree substitution for compo-
sitional semantic parsing. In Proceedings of
the 2022 Conference of the North American
the Association for Computa-
Chapter of
tional Linguistics: Human Language Tech-
nologies, pages 169–174, Seattle, United States.

Association for Computational Linguistics.
https://doi.org/10.18653/v1/2022
.naacl-main.12

Daniel H. Younger. 1967. Recognition and
parsing of context-free languages in time
n3. Information and Control, 10(2):189–208.
https://doi.org/10.1016/S0019-9958
(67)80007-X

Alp Yurtsever, Olivier Fercoq, Francesco
Locatello, and Volkan Cevher. 2018. A con-
ditional gradient
framework for composite
convex minimization with applications to semi-
definite programming. In Proceedings of the
35th International Conference on Machine
Learning, volume 80 of Proceedings of Ma-
chine Learning Research, pages 5727–5736.
PMLR.

Mikhail Zaslavskiy, Marc Dymetman, and Nicola
Cancedda. 2009. Phrase-based statistical ma-
chine translation as a traveling salesman prob-
lem. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on
Natural Language Processing of the AFNLP,
pages 333–341, Suntec, Singapore. Association
for Computational Linguistics. https://doi
.org/10.3115/1687878.1687926

John M. Zelle and Raymond J. Mooney. 1996.
Learning to parse database queries using in-
ductive logic programming. In Proceedings of
the Thirteenth National Conference on Ar-
tificial Intelligence – Volume 2, AAAI’96,
pages 105–1055. AAAI Press.

Xingxing Zhang, Jianpeng Cheng, and Mirella
Lapata. 2017. Dependency parsing as head
selection. In Proceedings of the 15th Confer-
ence of the European Chapter of the Associa-
tion for Computational Linguistics: Volume 1,
Long Papers, pages 665–676, Valencia, Spain.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/E17
-1063

Hao Zheng and Mirella Lapata. 2021. Composi-
tional generalization via semantic tagging. In
Findings of the Association for Computational
Linguistics: EMNLP 2021, pages 1022–1032,
Punta Cana, Dominican Republic. Associa-
tion for Computational Linguistics. https://

720

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
c
_
a
_
0
0
5
7
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

doi.org/10.18653/v1/2021.findings
-emnlp.88

whereas computing the MNNSA is a NP-hard
problem.

A Proofs

Proof: Theorem 1. We prove Theorem 1 by re-
ducing the maximum not-necessarily-spanning
arborescence (MNNSA) problem, which is known
to be NP-hard (Rao and Sridharan, 2002; Duhamel
et al., 2008), to the MGVCNNSA.

Let G = (cid:3)V, A, ψ(cid:4) be a weighted graph where
V = {0, . . . , n} and ψ ∈ R|A| are arc weights.
The MNNSA problem aims to compute the sub-
set of arcs B ⊆ A such that (cid:3)V [B], B(cid:4) is an ar-
borescence of maximum weight, where its weight
is defined as

(cid:11)

a∈B ψa.

Let G = (cid:3)E, T, fTYPE, fARGS

(cid:4) be a grammar such
that E = {0, . . . , n − 1}, T = {t} and ∀e ∈ E :
fTYPE(e) = t ∧ fARGS(e, t) = e. Intuitively, a tag
e ∈ E will be associated to vertices that require
exactly e outgoing arcs.

0 , . . . , V (cid:8)

∈ π, ∀u(cid:8), v(cid:8) ∈ V (cid:8)

We construct a clustered labeled weighted
graph G(cid:8) = (cid:3)V (cid:8), A(cid:8), π, l, ψ(cid:8)(cid:4) as follows. π =
} is a partition of V (cid:8) such that each
{V (cid:8)
cluster V (cid:8)
i contains n − 1 vertices and represents
the vertex i ∈ V . The labeling function l assigns
a different tag to each vertex in a cluster, i.e.,
i : u(cid:8) (cid:11)= v(cid:8) ⇒ l(u(cid:8)) (cid:11)= l(v(cid:8)).
∀V (cid:8)
i
The set of arcs is defined as A(cid:8) = {u(cid:8) → v(cid:8)|∃i →
j ∈ A s.t. u(cid:8) ∈ V (cid:8)
∧ v(cid:8) ∈ V (cid:8)
}. The weight vector
j
i
ψ(cid:8) ∈ R|A(cid:8)| is such that ∀u(cid:8) → v(cid:8) ∈ A(cid:8) : u(cid:8) ∈
∧ v(cid:8) ∈ V (cid:8)
V (cid:8)
u(cid:8)→v(cid:8) = ψu→v.
v
u
As such, there is a one-to-one correspondence
between solutions of the MNNSA on graph G and
solutions of the MGVCNNSA on graph G(cid:8).

⇒ ψ(cid:8)

Proof: Theorem 2. We prove Theorem 2 by re-
ducing the maximum directed Hamiltonian path
problem, which is known to be NP-hard (Garey
and Johnson, 1979, Appendix A1.3), to the latent
anchoring.

Let G = (cid:3)V, A, ψ(cid:4) be a weighted graph where
V = {1, . . . , n} and ψ ∈ R|A| are arc weights.
The maximum Hamiltonian path problem aims
to compute the subset of arcs B ⊆ A such that
V [B] = V and (cid:3)V [B], B(cid:4) is a path of maximum
a∈B ψa.
weight, where its weight is defined as
(cid:4) be a grammar such
that E = {0, 1}, T = {t} and ∀e ∈ E : fTYPE(e) =
t ∧ fARGS(e, t) = e.

Let G = (cid:3)E, T, fTYPE, fARGS

(cid:11)

(cid:11)= V (cid:8)

We construct a clustered labeled weighted
graph G(cid:8) = (cid:3)V (cid:8), A(cid:8), π, l, ψ(cid:8)(cid:4) as follows. π =
} is a partition of V (cid:8) such that
0 , . . . , V (cid:8)
{V (cid:8)
0 = {0}, each cluster V (cid:8)
V (cid:8)
0 contains 2
vertices and represents the vertex i ∈ V . The
labeling function l assigns a different
tag to
i.e.,
each vertex in a cluster except
∀V (cid:8)
(cid:11)= v(cid:8) ⇒
∈ π, i > 0, ∀u(cid:8), v(cid:8) ∈ V (cid:8)
i
i
l(u(cid:8)) (cid:11)= l(v(cid:8)). The set of arcs is defined as
A(cid:8) = {0 → u(cid:8)|u(cid:8) ∈ V (cid:8) \ {0}} ∪ {u(cid:8) → v(cid:8)|i →
j ∈ A ∧ u(cid:8) ∈ V (cid:8)
}. The weight vector
i
ψ(cid:8) ∈ R|A(cid:8)| is such that ∀u(cid:8) → v(cid:8) ∈ A(cid:8) : u(cid:8) ∈
V (cid:8)
⇒ ψ(cid:8)
u(cid:8)→v(cid:8) = ψu→v and arcs leaving
u
0 have null weights.

the root,
: u(cid:8)

∧ v(cid:8) ∈ V (cid:8)
j

∧ v(cid:8) ∈ V (cid:8)
v

We construct an AST G(cid:8)(cid:8) = (cid:3)V (cid:8)(cid:8), A(cid:8)(cid:8), l(cid:8)(cid:4) such
that V (cid:8)(cid:8) = {1, . . . , n}, A(cid:8)(cid:8) = {i → i + 1 | 1 ≤ i < n} and the labeling function l(cid:8) assigns the tag 0 to n and the tag 1 to every other vertex. Note that our proof considers that arcs leaving from the root cluster satisfy constraints defined by the grammar whereas we previously only required the root vertex to have a single outgoing arc. The latter constraint can be added directly in a gram- mar, but we omit presentation for brevity. The constrained arity case presented by McDonald and Satta (2007) focuses on spanning arborescences with an arity constraint by reducing the Hamil- tonian path problem to their problem. While the arity constraint is similar in their problem and ours, our proof considers the not-necessarily-spanning case instead of the spanning one. Although the two problems seem related, they need to be studied separately, e.g., computing the maximum span- ning arborescence is a polynomial time problem As such, there is a one-to-one correspondence between solutions of the maximum Hamiltonian path problem on graph G and solutions of the mapping of maximum weight of G(cid:8)(cid:8) with G(cid:8). B Experimental Setup The neural architecture used in our experiments to produce the weights μ and φ is composed of: (1) an embedding layer of dimension 100 for SCAN or BERT-base (Devlin et al., 2019) for the other datasets, followed by a bi-LSTM (Hochreiter and Schmidhuber, 1997) with a hidden size of 400; (2) a linear projection of dimension 500 over the output of the bi-LSTM followed by a TANH activation and another linear projection of dimen- sion |E| to obtain μ; (3) a linear projection of 721 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 7 0 2 1 4 0 9 9 2 / / t l a c _ a _ 0 0 5 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 dimension 500 followed by a TANH activation and a bi-affine layer (Dozat and Manning, 2017) to obtain φ. We apply dropout with a probability of 0.3 over the outputs of BERT-base and the bi- LSTM and after both TANH activations. The learn- ing rate is 5 × 10−4 and each batch is composed of 30 examples. We keep the parameters that obtain the best accuracy on the development set after 25 epochs. Training the model takes be- tween 40 minutes for GEOQUERY and 8 hours for CLEVR. However, note that the bottleneck is the conditional gradient method which is computed on the CPU. C GEOQUERY Denotation Accuracy Issue The denotation accuracy is evaluated by check- ing whether the denotation returned by an exec- utor is the same when given the gold semantic program and the prediction of the model. It can be higher than the exact match accuracy when differ- ent semantic programs yield the same denotation. When we evaluated our approach using the same executor as the baselines of Herzig and Berant (2021), we observed two main issues regarding the behavior of predicates: (1) Several predicates have undefined behaviors (e.g., population 1 and traverse 2 in the case of an argument of type country), in the sense that they are not implemented; (2) The behavior of some predicates are incorrect with respect to their expected semantic (e.g., traverse 1 and traverse 2). These two sources of errors for several semantic programs, leading to an overes- timation of the denotation accuracy when both the gold and predicted programs return by accident an empty denotation (potentially for different reasons, due to aforementioned implementation issues). denotation incorrect result in We implemented a corrected executor address- is available ing the issues that we found. It here: https://github.com/alban-petit /geoquery-funql-executor. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 7 0 2 1 4 0 9 9 2 / / t l a c _ a _ 0 0 5 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 722
Download pdf