On Graph-based Reentrancy-free Semantic Parsing

Alban Petit and Caio Corro
Universite Paris-Saclay, CNRS, LISN, 91400, Orsay, Francia
{alban.petit,caio.corro}@lisn.upsaclay.fr

Abstracto

We propose a novel graph-based approach
for semantic parsing that resolves two prob-
lems observed in the literature: (1) seq2seq
models fail on compositional generalization
tareas; (2) previous work using phrase structure
parsers cannot cover all the semantic parses
observed in treebanks. We prove that both
MAP inference and latent tag anchoring (re-
quired for weakly-supervised learning) son
NP-hard problems. We propose two optimiza-
tion algorithms based on constraint smoothing
to approximately
and conditional gradient
solve these inference problems. Experimen-
tally, our approach delivers state-of-the-art re-
sults on GEOQUERY, ESCANEAR, and CLEVR, both for
i.i.d. splits and for splits that test for compo-
generalización posicional.

Introducción

Semantic parsing aims to transform a natural lan-
guage utterance into a structured representation
that can be easily manipulated by a software (p.ej.,
to query a database). Tal como, it is a central task
in human–computer interfaces. Andreas et al.
(2013) first proposed to rely on machine trans-
lation models for semantic parsing, where the tar-
get representation is linearized and treated as a
foreign language. Due to recent advances in deep
learning and especially in sequence-to-sequence
(seq2seq) with attention architectures for machine
traducción (Bahdanau et al., 2015), it is appealing
to use the same architectures for standard struc-
tured prediction problems (Vinyals et al., 2015).
This approach is indeed common in semantic pars-
En g (Jia and Liang, 2016; Dong and Lapata, 2016;
Wang y cols., 2020), as well as other domains. Y-
fortunately, there are well-known limitations to
seq2seq architectures for semantic parsing. Primero,
en el momento de la prueba, the decoding algorithm is typically
based on beam search as the model is autore-
gressive and does not make any independence
assumption. In case of prediction failure, es

therefore unknown if this is due to errors in the
weighting function or to the optimal solution fail-
ing out of the beam. En segundo lugar, they are known
to fail when compositional generalization is re-
quired (Lago y Baroni, 2018; Finegan-Dollak
et al., 2018a; Keysers et al., 2020).

In order to bypass these problems, Abundante y
tarde (2021) proposed to represent the semantic
content associated with an utterance as a phrase
estructura, es decir., using the same representation usu-
ally associated with syntactic constituents. Como
semejante, their semantic parser is based on standard
span-based decoding algorithms (Hall et al., 2014;
Stern et al., 2017; Corro, 2020) with additional
well-formedness constraints from the semantic
formalism. Given a weighting function, MAP in-
ference is a polynomial time problem that can
be solved via a variant of the CYK algorithm
(Kasami, 1965; Younger, 1967; cocke, 1970).
Experimentally, Herzig y Berant (2021) espectáculo
that their approach outperforms seq2seq models
in terms of compositional generalization, allá-
fore effectively bypassing the two major problems
of these architectures.

The complexity of MAP inference for phrase
structure parsing is directly impacted by the con-
sidered search space (Kallmeyer, 2010). Impor-
tantly, (ill-nested) discontinuous phrase structure
parsing is known to be NP-hard, even with a
bounded block-degree (Satta, 1992). Abundante y
tarde (2021) explore two restricted inference al-
gorithms, both of which have a cubic time com-
plexity with respect to the input length. The first
one only considers continuous phrase structures,
eso es, derived trees that could have been gener-
ated by a context-free grammar, and the second
one also considers a specific type of discontinui-
corbatas, see Corro (2020, Sección 3.6). Both algo-
rithms fail to cover the full set of phrase structures
observed in semantic treebanks, ver figura 1.

En este trabajo, we propose to reduce semantic
parsing without reentrancy (es decir., a given predicate
or entity cannot be used as an argument for two

703

Transacciones de la Asociación de Lingüística Computacional, volumen. 11, páginas. 703–722, 2023. https://doi.org/10.1162/tacl a 00570
Editor de acciones: James Henderson. Lote de envío: 8/2022; Lote de revisión: 12/2022; Publicado 6/2023.
C(cid:2) 2023 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

• We propose a novel integer linear program-
ming formulation for this problem together
with an approximate solver based on condi-
tional gradient and constraint smoothing;
• We tackle the training problem using varia-
tional approximations of objective functions,
including the weakly-supervised scenario;
• We evaluate our approach on GEOQUERY,
ESCANEAR, and CLEVR and observe that it outper-
forms baselines on both i.i.d. splits and splits
that test for compositional generalization.

Code to reproduce the experiments is available
online.1

2 Graph-based Semantic Parsing

We propose to reduce semantic parsing to pars-
ing the abstract syntax tree (AST) associated with
a semantic program. We focus on semantic pro-
grams whose ASTs do not have any reentrancy,
es decir., a single predicate or entity cannot be the ar-
gument of two different predicates. Además, nosotros
assume that each predicate or entity is anchored
on exactly one word of the sentence and each
word can be the anchor of at most one predicate or
entidad. Tal como, the semantic parsing problem can
be reduced to assigning predicates and entities to
words and identifying arguments via dependency
relaciones, ver figura 2. In order to formalize our
approach to the semantic parsing problem, we will
use concepts from graph theory. We therefore first
introduce the vocabulary and notions that will be
useful in the rest of this article. Notablemente, the no-
tions of cluster and generalized arborescence will
be used to formalize our prediction problem.

Notations and Definitions. Let G = (cid:3)V, A(cid:4)
be a directed graph with vertices V and arcs
A ⊆ V × V . An arc in A from a vertex u ∈ V
to a vertex v ∈ V is denoted either a ∈ A or
u → v ∈ A. For any subset of vertices U ⊆ V ,
we denote σ+
GRAMO(Ud. )) the set of arcs
leaving one vertex of U and entering one vertex
of V \ Ud. (resp., leaving one vertex of V \ U and
entering one vertex of U ) in the graph G. Dejar
B ⊆ A be a subset of arcs. We denote V [B] el
cover set of B, es decir., the set of vertices that appear
as an extremity of at least one arc in B. A graph

GRAMO(Ud. ) (resp., σ−

Cifra 1: Example of a semantic phrase structure from
GEOQUERY. This structure is outside of the search space
of the parser of Herzig and Berant (2021) as the con-
stituent in red is discontinuous and also has a discon-
tinuous parent (in red+green).

different predicates) to a bi-lexical dependency
parsing problem. Tal como, we tackle the same se-
mantic content as aforementioned previous work
but using a different mathematical representation
(Rambow, 2010). We identify two main benefits
to our approach: (1) as we allow crossing arcs,
es decir., ‘‘non-projective graphs’’, all datasets are
guaranteed to be fully covered and (2) it allows
us to rely on optimization methods to tackle in-
ference intractability of our novel graph-based
formulation of the problem. More specifically, en
our setting we need to jointly assign predicates/
entities to words that convey a semantic con-
tent and to identify arguments of predicates
via bi-lexical dependencies. We show that MAP
inference in this setting is equivalent to the maxi-
mum generalized spanning arborescence problem
(Myung et al., 1995) with supplementary con-
straints to ensure well-formedness with respect to
the semantic formalism. Although this problem is
NP-hard, we propose an optimization algorithm
that solves a linear relaxation of the problem and
can deliver an optimality certificate.

Our contributions can be summarized as

follows:

• We propose a novel graph-based approach
for semantic parsing without reentrancy;

• We prove the NP-hardness of MAP infer-

1https://github.com/alban-petit/semantic

ence and latent anchoring inference;

-dependency-parser.

704

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

v ∈ σ−(W. ) by an arc u → w and all the arcs
u → v ∈ σ+(W. ) by an arc w → v. Given a
graph with partition π, the contracted graph is
the graph where each cluster in π has been con-
tracted. While contracting a graph may introduce
parallel arcs, it is not an issue in practice, incluso
for weighted graphs.

2.1 Semantic Grammar and AST

The semantic programs we focus on take the
form of a functional language, es decir., a representa-
tion where each predicate is a function that takes
other predicates or entities as arguments. The se-
mantic language is typed in the same sense than
in ‘‘typed programming languages’’. Para examen-
por ejemplo, in GEOQUERY, the predicate capital 2 ex-
pects an argument of type city and returns an
object of type state. In the datasets we use, el
typing system disambiguates the position of argu-
ments in a function: For a given function, either
all arguments are of the same type or the order of
arguments is unimportant—an example of both is
the predicate intersection river in GEO-
QUERY that takes two arguments of type river,
but the result of the execution is unchanged if
the arguments are swapped.3

Formalmente, we define the set of valid seman-
tic programs as the set of programs that can be
produced with a semantic grammar G = (cid:3)mi, t,
fTYPE, fARGS

(cid:4) dónde:

• E is the set of predicates and entities, cual
we will refer to as the set of tags—w.l.o.g.
we assume that ROOT /∈ E where ROOT is a
special tag used for parsing;

• T is the set of types;

• fTYPE : E → T is a typing function that

assigns a type to each tag;

• fARGS : E × T → N is a valency function that
assigns the numbers of expected arguments
of a given type to each tag.

A tag e ∈ E is an entity iff ∀t ∈ T : fARGS(mi, t) =
0. De lo contrario, e is a predicate.

A semantic program in a functional language
can be equivalently represented as an AST, a

3There are a few corner cases like exclude river,
for which we simply assume arguments are in the same
order as they appear in the input sentence.

Cifra 2: (arriba) The semantic program correspond-
ing to the sentence ‘‘What rivers do not run through
Tennessee?’’ in the GEOQUERY dataset. (middle) El
associated AST. (abajo) Two examples illustrating
the intuition of our model: We jointly assign predi-
cates/entities and identify argument dependencies. Como
semejante, the resulting structure is strongly related to a
syntactic dependency parse, but where the dependency
structure do not cover all words.

G = (cid:3)V, A(cid:4) is an arborescence2 rooted at u ∈ V
if and only if (iff) it contains |V | − 1 arcs and
there is a directed path from u to each vertex in
V . In the rest of this work, we will assume that
the root is always vertex 0 ∈ V . Let B ⊆ A
be a set of arcs such that G(cid:8) = (cid:3)V [B], B(cid:4) is an
arborescence. Then G(cid:8) is a spanning arborescence
of G iff V [B] = V .

Let π = {V0, . . . , Vn} be a partition of V
containing n + 1 grupos. GRAMO(cid:8)
is a generalized
not-necessarily-spanning arborescence (resp. generación-
eralized spanning arborescence) on the partition
π of G iff G(cid:8) is an arborescence and V [B] estafa-
tains at most one vertex per cluster in π (resp.
contains exactly one).

Let W ⊆ V be a set of vertices. Contracting
W consists in replacing in G the set W by a
new vertex w /∈ V , replacing all the arcs u →

2In the NLP community, arborescences are often called
(directed) árboles. We stick with the term arborescence as
it is more standard in the graph theory literature, see for
example Schrijver (2003). Using the term tree introduces
a confusion between two unrelated algorithms, Kruskal’s
maximum spanning tree algorithm (Kruskal, 1956) that oper-
ates on undirected graphs and Edmond’s maximum spanning
arborescence algorithm (Edmonds, 1967) that operates on
directed graphs. Además, this prevents any confusion be-
tween the graph object called arborescence and the semantic
structure called AST.

705

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: (a) Example of a sentence and its associated AST (solid arcs) from the GEOQUERY dataset. The dashed
edges indicate predicates and entities anchors (note that this information is not available in the dataset). (b) El
corresponding generalized valency-constrained not-necessarily-spanning arborescence (red arcs). The root is the
isolated top left vertex. Adding ∅ tags and dotted orange arcs produces a generalized spanning arborescence.

graph where instances of predicates and entities
are represented as vertices and where arcs identify
arguments of predicates. Formalmente, an AST is
a labeled graph G = (cid:3)V, A, yo(cid:4) where function
yo : V → E assigns a tag to each vertex and arcs
identify the arguments of tags, ver figura 2. Un
AST G is well-formed with respect to the grammar
G iff G is an arborescence and the valency and
type constraints are satisfied, es decir., ∀u ∈ V, t ∈ T :

fARGS(yo(tu), t) = |σ+

GRAMO({tu}, t)|

dónde:

GRAMO({tu}, t) =
σ+

(cid:2)

u → v ∈ σ+
GRAMO({tu})
s.t. fTYPE(yo(v)) =t

(cid:3)

2.2 Problem Reduction and Complexity

In our setting, semantic parsing is a joint sen-
tence tagging and dependency parsing problem
(Bohnet and Nivre, 2012; Le et al., 2011; Corro
et al., 2017): each content word (es decir., words that
convey a semantic meaning) must be tagged
with a predicate or an entity, and dependen-
cies between content words identify arguments
of predicates, ver figura 2. Sin embargo, our seman-
tic parsing setting differs from standard syntactic
analysis in two ways: (1) the resulting structure
is not-necessarily-spanning, there are words (p.ej.,
function words) that must not be tagged and that
do not have any incident dependency—and those
words are not known in advance, they must be
identified jointly with the rest of the structure;
(2) the dependency structure is highly constrained
by the typing mechanism, eso es, the predicted
structure must be a valid AST. Sin embargo,

similarly to aforementioned works, our parser is
graph-based, eso es, for a given input we build a
(complete) directed graph and decoding is reduced
to computing a constrained subgraph of maximum
weight.

Given a sentence w = w1 . . . wn with n words
and a grammar G, we construct a clustered labeled
graph G = (cid:3)V, A, Pi, ¯l(cid:4) como sigue. The partition
π = {V0, . . . , Vn} contains n + 1 grupos, dónde
V0 is a root cluster and each cluster Vi, i (cid:11)= 0, es
associated to word wi. The root cluster V0 = {0}
contains a single vertex that will be used as the root
and every other cluster contains |mi| vertices. El
extended labeling function ¯l : V → E ∪ {ROOT}
assigns a tag in E to each vertex v ∈ V \ {0} y
ROOT to vertex 0. Distinct vertices in a cluster Vi
cannot have the same label, es decir., ∀u, v ∈ Vi : tu (cid:11)=
v =⇒ ¯l(tu) (cid:11)= ¯l(v).

Let B ⊆ A be a subset of arcs. The graph
GRAMO(cid:8) = (cid:3)V [B], B(cid:4) defines a 0-rooted generalized
valency-constrained not-necessarily-spanning ar-
borescence iff it is a generalized arborescence
of G, there is exactly one arc leaving 0 y
the destination
the sub-arborescence rooted at
of that arc is a valid AST with respect
a
the grammar G. Tal como, there is a one-to-one
correspondence between ASTs anchored on the
sentence w and generalized valency-constrained
not-necessarily-spanning arborescences in the
graph G, ver Figura 3b.

For any sentence w, our aim is to find the
AST that most likely corresponds to it. De este modo, af-
ter building the graph G as explained above, el
neural network described in Appendix B is used
to produce a vector of weights μ ∈ R|V | asso-
ciated to the set of vertices V and a vector of

706

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

weights φ ∈ R|A| associated to the set of arcs A.
Given these weights, graph-based semantic pars-
ing is reduced to an optimization problem called
the maximum generalized valency-constrained
not-necessarily-spanning arborescence (MGVC-
NNSA) in the graph G.

are spanning arborescences over G where clus-
ters have been contracted:

(cid:4)

ya = 0.

(1)

a∈σ−
GRAMO

(V0)
(cid:4)

ya ≥ 1

∀π(cid:8) ⊆ ¯π \ {V0}.

(2)

Teorema 1. The MGVCNNSA problem is NP-
hard.

a∈σ−
GRAMO(

(cid:2)

¯U ∈π(cid:8) Ud.)

The proof is in Appendix A.

2.3 Mathematical Program

Our graph-based approach to semantic parsing
has allowed us to prove the intrinsic hard-
ness of the problem. We follow previous work
on graph-based parsing (Martins et al., 2009;
Koo et al., 2010), and other topics, by proposing
an integer linear programming (ILP) formulation
in order to compute (approximate) soluciones.

Remember that in the joint tagging and de-
pendency parsing interpretation of the semantic
parsing problem, the resulting structure is not-
necessarily-spanning, meaning that some words
may not be tagged. In order to rely on well-
known algorithms for computing spanning ar-
borescences as a subroutine of our approximate
solver, we first introduce the notion of extended
graph. Given a graph G = (cid:3)V, A, Pi, ¯l(cid:4), nosotros estafamos-
struct an extended graph G = (cid:3)V , A, Pi, ¯l(cid:4)4 estafa-
taining n additional vertices {1, . . . , ¯n} that are
distributed along clusters, es decir., ¯π = {V0, V1 ∪
{1}, . . . , Vn ∪ {¯n}}, and arcs from the root to
these extra vertices, es decir., ¯A = A ∪ {0 → ¯i|1 ≤
i ≤ n}. Let B ⊆ A be a subset of arcs such
eso (cid:3)V [B], B(cid:4) is a generalized not-necessarily-
spanning arborescence on G. Let B ⊆ A be
a subset of arcs defined as B = B ∪ {0 →
¯i|σ−
(cid:3)V [B],B(cid:4)(Vi) = ∅}. Entonces, there is a one-to-
one correspondence between generalized not-
(cid:3)V [B], B(cid:4)
necessarily-spanning arborescences
and generalized spanning arborescences (cid:3)V [B],
B(cid:4), ver Figura 3b.

Let x ∈ {0, 1}|V | and y ∈ {0, 1}|A| be variable
vectors indexed by vertices and arcs such that a
vertex v ∈ V (resp., an arc a ∈ A) is selected
iff xv = 1 (resp., ya = 1). The set of 0-rooted
generalized valency-constrained spanning arbo-
rescences on ¯G can be written as the set of
variables (cid:3)X, y(cid:4) satisfying the following linear
constraints. Primero, we restrict y to structures that

4The labeling function is unchanged as there is no need

for types for vertices in ¯V \ V .

(cid:4)

a∈σ−
GRAMO

(Vi)

ya = 1

∀Vi ∈ π \ {V0}.

(3)

Constraints (2) ensure that the contracted graph
is weakly connected. Constraints (3) force each
cluster to have exactly one incoming arc. The set
of vectors y that satisfy these three constraints
are exactly the set of 0-rooted spanning arbores-
cences on the contracted graph, see Schrijver
(2003, Sección 52.4) for an in-depth analysis of
this polytope. The root vertex is always selected
and other vertices are selected iff they have one
incoming selected arc:

x0 = 1.

xu =

(cid:4)

a∈σ−
GRAMO

({tu})

∀u ∈ ¯V \ {0}.

(4)

(5)

Note that constraints (1)–(3) do not force selected
arcs to leave from a selected vertex as they oper-
ate at the cluster level. This property will be en-
forced via the following valency constraints:

y0→u = 1.

(cid:4)

u∈V \{0}
(cid:4)

a∈σ+

GRAMO({tu},t)

ya = xufARGS(yo(tu), t) ∀t ∈ T,

u ∈ V \ {0}.

Restricción (6) forces the root to have exactly one
outgoing arc into a vertex u ∈ V \ {0} (es decir.,
a vertex that is not part of the extra vertices
introduced in the extended graph) that will be
the root of the AST. Constraints (7) force the se-
lected vertices and arcs to produce a well-formed
AST with respect to the grammar G. Tenga en cuenta que
these constraints are only defined for vertices in
V \ {0}, es decir., they are neither defined for the root
vertex nor for the extra vertices introduced in the
extended graph.

707

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

(6)

(7)

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

To simplify notation, we introduce the follow-

ing sets:

(cid:5)

C(sa) =

C(vale) =

(cid:3)X, y(cid:4) ∈ {0, 1} ¯V × {0, 1} ¯A

s.t. x and y satisfy (1)–(5)
(cid:3)X, y(cid:4) ∈ {0, 1} ¯V × {0, 1} ¯A

s.t. x and y satisfy (6)–(7)

(cid:6)

and C = C(sa) ∩ C(vale). Given vertex weights
μ ∈ R| ¯V | and arc weights φ ∈ R| ¯A|, comput-
ing the MGVCNNSA is equivalent to solving the
following ILP:

(ILP1) máximo
X,y

μ(cid:17)X + Fi(cid:17)y

s.t.

(cid:3)X, y(cid:4) ∈ C(sa) y (cid:3)X, y(cid:4) ∈ C(vale).

Without constraint (cid:3)X, y(cid:4) ∈ C(vale), the problem
would be easy to solve. The set C(sa) is the set of
spanning arborescences over the contracted graph,
hence to maximize over this set we can simply:
(1) contract the graph and assign to each arc in
the contracted graph the weight of its correspond-
ing arc plus the weight of its destination vertex
in the original graph; (2) run the the maximum
spanning arborescence algorithm (MSA, Edmonds,
1967; Tarjan, 1977) on the contracted graph,
which has a O(n2) time-complexity. este profesional-
cess is illustrated on Figure 5 (arriba). Tenga en cuenta que
the contracted graph may have parallel arcs,
which is not an issue in practice as only the one
of maximum weight can appear in a solution of
the MSA.

We have established that MAP inference in
our semantic parsing framework is a NP-hard
problema. We proposed an ILP formulation of the
problem that would be easy to solve if some con-
straints were removed. This property suggests the
use of an approximation algorithm that introduces
the difficult constraints as penalties. As a simi-
lar setting arises from our weakly supervised loss
función, the presentation of the approximation
algorithm is deferred until Section 4.

3 Training Objective Functions

3.1 Supervised Training Objective
We define the likelihood of a pair (cid:3)X, y(cid:4) ∈ C via
the Boltzmann distribution:

donde C(μ, Fi) is the log-partition function:
(cid:4)

C(μ, Fi) = log

(cid:3)X(cid:8),y(cid:8)(cid:4)∈C

exp.(μ(cid:17)X(cid:8) + Fi(cid:17)y(cid:8)).

Durante el entrenamiento, we aim to maximize the log-
likelihood of
the training dataset. The log-
likelihood of an observation (cid:3)X, y(cid:4) is defined as:

(cid:4)(μ, Fi; X, y) = log pμ,Fi(X, y)

= μ(cid:17)X + Fi(cid:17)y − c(μ, Fi).

Desafortunadamente, computing the log-partition func-
tion is intractable as it requires summing over
all feasible solutions. En cambio, we rely on a sur-
rogate lower-bound as an objective function. A
this end, we derive an upper bound (porque
is negated in (cid:4)) to the second term: a sum of
log-sum-exp functions that sums over each cluster
of vertices independently and over incoming arcs
in each cluster independently, which is tractable.
This loss can be understood as a generalization
of the head selection loss used in dependency
analizando (Zhang et al., 2017). We now detail the
derivation and prove that it is an upper bound to
the log-partition function.

Let U be a matrix such that each row contains
a pair (cid:3)X, y(cid:4) ∈ C and Δ|C| be the simplex of di-
mension |C| − 1, es decir., the set of all stochastic
vectors of dimension |C|. The log-partition func-
tion can then be rewritten using its variational
formulation:

C(μ, Fi) = max
p∈Δ|C|

(cid:9)(cid:10)

(cid:7)

(cid:8)

pag(cid:17)

Ud.

μ
Fi

+ h[pag],

(cid:11)

i pi log pi

the reader
(2004, Ejemplo

where H[pag] = -
is the Shan-
non entropy. We refer
to Boyd
3.25),
and Vandenberghe
Wainwright and Jordan (2008, Sección 3.6),
and Beck (2017, Sección 4.4.10). Tenga en cuenta que
this formulation remains impractical as p has
an exponential size. Let M = conv(C) ser
the marginal polytope, es decir., the convex hull of
the feasible integer solutions, we can rewrite the
above variational formulation as:

= max
m∈M

metro(cid:17)

(cid:9)

(cid:8)

μ
Fi

+ HM[metro],

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

pμ,Fi(X, y) = exp(μ(cid:17)X + Fi(cid:17)y − c(μ, Fi)),

where HM is a joint entropy function defined
such that the equality holds. The maximization

708

in this reformulation acts on the marginal proba-
bilities of parts (vertices and arcs) and has there-
fore a polynomial number of variables. We refer
the reader to Wainwright and Jordan (2008, 5.2.1)
and Blondel et al. (2020, Sección 7) for more de-
tails. Desafortunadamente, this optimization problem is
hard to solve as M cannot be characterized in
an explicit manner and HM is defined indirectly
and lacks a polynomial closed form (Wainwright
and Jordan, 2008, Sección 3.7). Sin embargo, nosotros
can derive an upper bound to the log-partition
function by decomposing the entropy term HM
(Cover, 1999, Property 4 on page 41, es decir., H is
an upper bound) and by using an outer approxi-
mation to the marginal polytope L ⊇ M (es decir.,
increasing the search space):

≤ max
m∈L

metro(cid:17)

(cid:9)

(cid:8)

μ
Fi

+ h[metro].

En particular, we observe that each pair (cid:3)X, y(cid:4)
∈ C has exactly one vertex selected per cluster
Vi ∈ π and one incoming arc selected per clus-
ter Vi ∈ π \ {V0}. We denote C(uno) the set of
all the pairs (cid:3)X, y(cid:4) that satisfy these constraints.
By using L = conv(C(uno)) as an outer approxi-
mation to the marginal polytope (ver figura 4)
the optimization problem can be rewritten as a
sum of independent problems. As each of these
problems is the variational formulation of a log-
sum-exp term, the upper bound on c(μ, Fi) poder
be expressed as a sum of log-sum-exp functions,
one over vertices in each cluster Vi ∈ π \ {V0}
and one over incoming arcs σ−
¯G(Vi) for each
cluster Vi ∈ π \ {V0}. Although this type of ap-
proximation may not result in a Bayes consistent
loss (Corro, 2023), it works well in practice.

Cifra 4: Polyhedron illustration. The solid lines rep-
resent the convex hull of feasible solutions of ILP1,
denoted M, whose vertices are feasible integer solu-
ciones (black vertices). The dashed lines represent the
convex hull of feasible solutions of the linear relaxation
of ILP1, which has non integral vertices (in white). fi-
finalmente, the dotted lines represent the polyhedron L that
is used to approximate c(μ, Fi). All its vertices are
integral, but some of them are not feasible solutions
of ILP1.

es decir., we marginalize over all the structures that
induce the gold AST. We can rewrite this loss as:

⎛

⎞

⎝log

(cid:4)

(cid:3)X,y(cid:4)∈C∗

exp.(μ(cid:17)X + Fi(cid:17)y)

⎠ − c(μ, Fi).

The two terms are intractable. We approximate
the second term using the bound defined in
Sección 3.1.

We now derive a tractable lower bound to the
first term. Let q be a proposal distribution such
that q(X, y) = 0 si (cid:3)X, y(cid:4) /∈ C∗. We derive the
following lower bound via Jensen’s inequality:

(cid:4)

registro

(cid:3)X,y(cid:4)∈C∗

exp.(μ(cid:17)X + Fi(cid:17)y)

(cid:18)(cid:18)

(cid:19)

(cid:16)

(cid:17)(cid:17)

exp.

= log Eq

(cid:20)

≥ Eq

μ(cid:17)X + Fi(cid:17)y

μ(cid:17)X + Fi(cid:17)y
q(X, y)
(cid:21)

+ h[q].

3.2 Weakly Supervised Training Objective

Desafortunadamente, training data often does not in-
clude gold pairs (cid:3)X, y(cid:4) but instead only the AST,
without word anchors (or word alignment). Este
is the case for the three datasets we use in our
experimentos. We thus consider our training sig-
nal to be the set of all structures that induce the
annotated AST, which we denote C∗.

The weakly supervised loss is defined as:

˜(cid:4)(μ, Fi; C∗) = log

(cid:4)

(cid:3)X,y(cid:4)∈C∗

pμ,Fi(X, y),

This bound holds for any distribution q satisfy-
ing the aforementioned condition. We choose to
maximize this lower bound using a distribution
that gives a probability of one to a single struc-
tura, as in ‘‘hard’’ EM (Neal and Hinton, 1998,
Sección 6).

For a given sentence w, let G = (cid:3)V, A, Pi, ¯l(cid:4)
be a graph defined as in Section 2.2 y G(cid:8) =
(cid:3)V (cid:8), A(cid:8), yo(cid:8)(cid:4) be an AST defined as in Section 2.1.
We aim to find the GVCNNSA in G of maxi-
mum weight whose induced AST is exactly G(cid:8).
This is equivalent to aligning each vertex in V (cid:8)
with one vertex of V \ {0} s.t. there is at most

709

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

one vertex per cluster of π appearing in the align-
ment and where the weight of an alignment is
defined as:

1. for each vertex u(cid:8) ∈ V (cid:8), we add the weight of
the vertex u ∈ V it is aligned to—moreover,
if u(cid:8) is the root of the AST we also add the
weight of the arc 0 → u;

2. for each arc u(cid:8) → v(cid:8) ∈ A(cid:8), we add the weight
of the arc u → v where u ∈ V (resp. v ∈ V )
is the vertex u(cid:8) (resp. v(cid:8)) it is aligned with.

Note that this latent anchoring inference consists
in computing a (partial) alignment between ver-
tices of G and G(cid:8), but the fact that we need to
take into account arc weights forbids the use of
the Kuhn–Munkres algorithm (Kuhn, 1955).

Teorema 2. Computing the anchoring of max-
imum weight of an AST with a graph G is
NP-hard.

The proof is in Appendix A.

Por lo tanto, we propose an optimization-based
approach to compute the distribution q. Tenga en cuenta que
the problem has a constraint requiring each cluster
Vi ∈ π to be aligned with at most one vertex
v(cid:8) ∈ V (cid:8), es decir., each word in the sentence can be
aligned with at most one vertex in the AST.
If we remove this constraint, then the problem
becomes tractable via dynamic programming.
En efecto, we can recursively construct a table
CHART[tu(cid:8), tu], tu(cid:8) ∈ V (cid:8) and u ∈ V , containing
the score of aligning vertex u(cid:8) to vertex u plus the
score of the best alignment of all the descendants
of u(cid:8). Para tal fin, we simply visit the vertices
V (cid:8) of the AST in reverse topological order, ver
Algoritmo 1. The best alignment can be retrieved
via back-pointers.

Computing q is therefore equivalent to solving

the following ILP:

(ILP2) máximo
X,y

μ(cid:17)X + Fi(cid:17)y

s.t.

(cid:3)X, y(cid:4) ∈ C∗(relaxed),
(cid:4)
xu ≤ 1

∀Vi ∈ π,

u∈Vi

whose convex hull can be described via linear
constraints (Martin et al., 1990).

4 Efficient Inference

En esta sección, we propose an efficient way to
solve the linear relaxations of MAP inference
(ILP1) and latent anchoring inference (ILP2) a través de
constraint smoothing and the conditional gradi-
ent method. We focus on problems of the follow-
ing form:

F (z)

máximo
z
s.t. z ∈ conv(C(fácil))

Az = b or Az ≤ b

where the vector z is the concatenation of the
vectors x and y defined previously and conv
denotes the convex hull of a set. We explained
previously that if the set of constraints of form
Az = b for (ILP1) or Az ≤ b for (ILP2) era
absent, the problem would be easy to solve under
a linear objective function. De hecho, there exists an
efficient linear maximization oracle (LMO), es decir., a
function that returns the optimal integral solution,
for the set conv(C(fácil)). This setting covers both
(ILP1) y (ILP2) where we have C(fácil) = C(sa) y
C(fácil) = C∗(relaxed), respectivamente.

An appealing approach in this setting is to
introduce the problematic constraints as penalties
in the objective:

F (z) − δS(Az)

máximo
z
s.t. z ∈ conv(C(fácil))

where δS is the indicator function of the set S:

(cid:5)

δS(Az) =

if Az ∈ S,
0
+∞ otherwise.

In the equality case, we use S = {b} and in the
inequality case, we use S = {tu|u ≤ b}.

4.1 Conditional Gradient Method

Given a proper, liso, and differentiable func-
tion g and a nonempty, bounded, closed, y
convex set conv(C(fácil)), the conditional gradient
método (a.k.a. Frank-Wolfe; Frank and Wolfe,
1956; Levitin and Polyak, 1966; Lacoste-Julien
and Jaggi, 2015) can be used to solve optimization
problems of the following form:

where the set C*(relaxed) is the set of feasible so-
lutions of the dynamic program in Algorithm 1,

máximo
z

gramo(z)

s.t. z ∈ conv(C(fácil)).

710

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Algoritmo 1 Unconstrained alignment of maximum weight between a graph G and an AST G(cid:8)

function DPALIGNMENT(GRAMO, GRAMO(cid:8))

for u(cid:8) ∈ V (cid:8) in reverse topological order do
for u ∈ {v ∈ V |¯l(v) = l(cid:8)(tu(cid:8))} hacer

(cid:11)

CHART[tu(cid:8), tu] ← μu +
return maxu∈V CHART[r(cid:8), tu] + φ0→u

tu(cid:8)→v(cid:8)∈σ+

GRAMO(cid:8) ({tu(cid:8)})

(cid:6) We can only map u(cid:8) to vertices u if they have the same tag.
(cid:17)

(cid:18)

maxv∈V CHART[v(cid:8), v] + φu→v

(cid:6) Where r(cid:8) ∈ A(cid:8) is the root of the AST.

Algoritmo 2 Conditional gradient

function CONDITIONALGRADIENT(GRAMO, GRAMO(cid:8))

Let z(0) ∈ conv(C(fácil))
for k ∈ {0. . . . k} hacer
(cid:17)

(cid:18)

− z(k)

d ←
lmoconv(C(fácil))(∇g(z(k)))
if ∇g(z(k))(cid:17)d ≤ (cid:8) then return z(k)
γ ∈ arg maxγ∈[0,1] gramo(z(k) + γd)
z(k+1) = z(k) + γd

return z(k)

(cid:6) Where K is the maximum number of iterations
(cid:6) Compute the update direction
(cid:6) If the dual gap is small, z(k) es (almost) óptimo
(cid:6) Compute or approximate the optimal stepsize
(cid:6) Update the current point

Contrary to the projected gradient method, este
approach does not require to compute projections
onto the feasible set conv(C(fácil)) cual es, en
most cases, computationally expensive. En cambio,
the conditional gradient method only relies on a
LMO:

lmoC(fácil)(ψ) ∈ arg max
z∈conv(C(fácil))

ψ(cid:17)z.

The algorithm constructs a solution to the original
problem as a convex combination of elements
returned by the LMO. The pseudo-code is given
in Algorithm 2. An interesting property of this
method is that its step size range is bounded, cual
allows the use of simple linesearch techniques.

4.2 Smoothing

the function g(z) = f (z) −
Desafortunadamente,
δS(Az) is non-smooth due to the indicator func-
tion term, preventing the use of the conditional
gradient method. We propose to rely on the frame-
work proposed by Yurtsever et al. (2018), dónde
the indicator function is replaced by a smooth
approximation. The indicator function of the set S
can be rewritten as:

found in Beck (2017, Sección 4.1 y 4.2).
In order to smooth the indicator function, añadimos
a β-parameterized convex regularizer − β
2 a
2
its Fenchel biconjugate:

(cid:23) · (cid:23)2

δ∗∗
S,b(Az) = max

tu(cid:17)Az − σS(tu) − β
2

(cid:23)tu(cid:23)2
2,

where β > 0 controls the quality and the
smoothness of
the approximation (Nesterov,
2005).

Equalities.

In the case where S = {b},
with a few computations that are detailed by
Yurtsever et al. (2018), we obtain:

δ∗∗
{b},b(Az) =

1
2b

(cid:23)Az − b(cid:23)2
2.

Eso
es, we have a quadratic penalty term
in the objective. Note that this term is similar to
the term introduced in an augmented Lagrangian
(Nocedal and Wright, 1999, Ecuación 17.36), y
adds a penalty in the objective for vectors z s.t.
Az (cid:11)= b.

Inequalities. In the case where S = {tu|u ≤

b}, similar computations lead to:

δS(Az) = δ∗∗

S (Az) = sup
tu

tu(cid:17)Az − σS(tu),

δ∗∗
≤b,b(Az) =

1
2b

(cid:23)[Az − b]+(cid:23)2
2,

where δ∗∗
S denotes the Fenchel biconjugate of the
indicator function and σS(tu) = supt∈S u(cid:17)t is
the support function of S. More details can be

dónde [·]+ denotes the Euclidian projection into
the non-negative orthant (es decir., clipping negative
valores). Similarly to the equality case, this term

711

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 5: Illustration of the approximate inference algorithm on the two-word sentence ‘‘List states’’, dónde
we assume the grammar has one entity state all and one predicate loc 1 that takes exactly one entity as
argumento. The left graph is the extended graph for the sentence, including vertices and arcs weights (in black). Si
we ignore constraints (6)–(7), inference is reduced to computing the MSA on the contracted graph (solid arcs in
the middle column). This may lead to solutions that do not satisfy constraints (6)–(7) on the expanded graph (arriba
ejemplo). Sin embargo, the gradient of the smoothed constraint (7) will induce penalties (in red) to vertex and arc
scores that will encourage the loc 1 predicate to either be dropped from the solution or to have an outgoing arc
to a state all argument. Computing the MSA on the contracted graph with penalties results in a solution that
satisfies constraints (6)–(7) (bottom example).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

introduces a penalty in the objective for vectors z
s.t. Az > b. This penalty function is also called
the Courant-Beltrami penalty function.

Cifra 5 (abajo) illustrates how the gradi-
ent of the penalty term can ‘‘force’’ the LMO
to return solutions that satisfy the smoothed
constraints.

4.3 Practical Details

smoothness

Smoothness.
En la práctica, we need to choose
seguir
el
Yurtsever et al. (2018) and use β(k) = β(0)
√
k+1
where k is the iteration number and β(0) = 1.

parameter β. Nosotros

Step Size. Another important choice in the al-
gorithm is the step size γ. We show that when
the smoothed constraints are equalities, comput-
ing the optimal step size has a simple closed
form solution if the function f is linear, cual
is the case for (ILP1), es decir., MAP decoding. El
step size poblem formulation at iteration k is
defined as:

arg max
γ∈[0,1]

F (z(k) + γd) −

(cid:23)A(z(k) + γd) − b(cid:23)2
2b

712

By assumption, f is linear and can be written
as f (z) = θ(cid:17)z. Ignoring the box constraints on
γ, by first order optimality conditions, tenemos:

γ =

−βθ(cid:17)d + (Ad)(cid:17)b − (Ad)(cid:17)(Az(k))
(cid:23)Ad(cid:23)2

We can then simply clip the result so that it
satisfies the box constraints. Desafortunadamente, en
the inequalities case, there is no simple closed
form solution. We approximate the step size using
10 iterations of the bisection algorithm for root
finding.

Non-integral Solutions. As we solve the
linear relaxation of original ILPs, the optimal
solutions may not be integral
(Cifra 4).
Por lo tanto, we use simple heuristics to construct
a feasible solution to the original ILP in these
casos. For MAP inference, we simply solve
the ILP5 using CPLEX but
introducing only

5We use the multi-commodity flow formulation of
Martins et al. (2009) instead of the cycle breaking con-
tensiones (2).

variables that have a non-null value in the linear
relaxation, leading to a very sparse problem which
is fast to solve. For latent anchoring, we simply
use the Kuhn–Munkres algorithm using the
non-integral solution as assignment costs.

5 experimentos

We compare our method to baseline systems
both on i.i.d. se divide (IID) and splits that test for
compositional generalization for three datasets.
The neural network is described in Appendix B.

Datasets. ESCANEAR (Lago y Baroni, 2018) estafa-
tains natural language navigation commands. Nosotros
use the variant of Herzig and Berant (2021) para
semantic parsing. The IID split is the simple split
(Lago y Baroni, 2018). The compositional splits
are primitive right (RIGHT) and primitive around
bien (ARIGHT) (Loula et al., 2018).

GEOQUERY (Zelle and Mooney, 1996) uses the
FunQL formalism (Kate et al., 2005) and contains
questions about the US geography. The IID split
is the standard split and compositional general-
ization is evaluated on two splits: LENGTH where
the examples are split by program length and
TEMPLATE (Finegan-Dollak et al., 2018a) dónde
they are split such that all semantic programs hav-
ing the same AST are in the same split.

CLEVR (Johnson et al., 2017) contains synthetic
questions over object relations in images. CLO-
SURE (Bahdanau et al., 2019) introduces additional
question templates that require compositional gen-
eralización. We use the original split as our IID
split and the CLOSURE split as a compositional
split where the model is evaluated on CLOSURE.

Líneas de base. We compare our approach against
the architecture proposed by Herzig and Berant
(2021) (SPANBASEDSP) as well as the seq2seq
baselines they used. In SEQ2SEQ (Jia and Liang,
2016), the encoder is a bi-LSTM over pre-trained
GloVe embeddings (Pennington et al., 2014) o
ELMO (Peters et al., 2018) and the decoder is an
attention-based LSTM (Bahdanau et al., 2015).
BERT2SEQ replaces the encoder with BERT-
base. GRAMMAR is similar to SEQ2SEQ but
the decoding is constrained by a grammar. BART
(Lewis et al., 2020) is pre-trained as a denoising
autoencoder.

Resultados. We report the denotation accuracies
en mesa 1. Our approach outperforms all other

methods. En particular, the seq2seq baselines suf-
fer from a significant drop in accuracy on splits
that require compositional generalization. Mientras
SPANBASEDSP is able to generalize, our approach
outperforms it. Note that we observed that the
GEOQUERY execution script used to compute de-
notation accuracy in previous work contains sev-
eral bugs that overestimate the true accuracy.
Por lo tanto, we also report denotation accuracy
with a corrected executor (see Appendix C) para
fair comparison with future work.

We also report exact match accuracy, with and
without the heuristic to construct integral solutions
from fractional ones. The exact match accuracy
is always lower or equal to the denotation ac-
curacy. This shows that our approach can some-
times provide the correct denotation even though
the prediction is different from the gold semantic
programa. En tono rimbombante, while our approach out-
performs baselines, its accuracy is still signifi-
cantly worse on the split that requires to generalize
to longer programs.

6 Trabajo relacionado

Graph-based Methods. Graph-based methods
have been popularized by syntactic dependency
analizando (McDonald et al., 2005) where MAP in-
ference is realized via the maximum spanning
arborescence algorithm (Chu and Liu, 1965;
Edmonds, 1967). A benefit of this algorithm is that
it has a O(n2) time-complexity (Tarjan, 1977),
es decir., it is more efficient than algorithms explor-
ing more restricted search spaces (Eisner, 1997;
G´omez-Rodr´ıguez et al., 2011; Pitler et al., 2012,
2013).

In the case of semantic structures, Kuhlmann
and Jonsson (2015) proposed a O(n3) algo-
rithm for the maximum non-necessarily-spanning
acyclic graphs with a noncrossing arc constraint.
Without the noncrossing constraint, the problem
is known to be NP-hard (Gr¨otschel et al., 1985).
To bypass this computational complexity, Dozat
and Manning (2018) proposed to handle each de-
pendency as an independent binary classification
problema, that is they do not enforce any constraint
on the output structure. Tenga en cuenta que, contrary to our
trabajar, these approaches allow for reentrancy but
do not enforce well-formedness of the output with
respect to the semantic grammar. Lyu and Titov
(2018) use a similar approach for AMR parsing

713

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

ESCANEAR

GEOQUERY

CLEVR

IID

RIGHT ARIGHT

IID

TEMPLATE LENGTH

IID

CLOSURE

Líneas de base (denotation accuracy only)

SEQ2SEQ
+ ELMO
BERT2SEQ
GRAMMAR
BART
SPANBASEDSP

Our approach

99.9

100
100
100
100
100

11.6
54.9
77.7
0.0
50.5
100

0
41.6
95.3
4.2

100
100

Denotation accuracy
(cid:2) Corrected executor
Exact match
(cid:2) w/o CPLEX heuristic

100

100
100

78.5
79.3
81.1
72.1
87.1
86.1

92.9
91.8
90.7
90.0

46.0
50.0
49.6
54.0
67.0
82.2

89.9
88.7
86.2
83.0

24.3
25.7
26.1
24.6
19.3
63.6

74.9
74.5
69.3
67.5

100
100
100
100
100

96.7

100

100
100

59.5
64.2
56.4
51.3
51.5
98.8

99.6

99.6
98.0

Mesa 1: Denotation and exact match accuracy on the test sets. All the baseline results were taken from
Herzig y Berant (2021). For our approach, we also report exact match accuracy, es decir., the percentage
of sentences for which the prediction is identical to the gold program. The last line reports the exact
match accuracy without the use of CPLEX to round non integral solutions (Sección 4.3).

where tags are predicted first, followed by arc
predictions and finally heuristics are used to en-
sure the output graph is valid. On the contrary,
we do not use a pipeline and we focus on joint
decoding where validity of the output is directly
encoded in the search space.

Previous work in the literature has also con-
sidered reduction to graph-based methods for
other problems, p.ej., for discontinuous constit-
uency parsing (Fern´andez-Gonz´alez and Martins,
2015; Corro et al., 2017), lexical segmentation
(Constant and Le Roux, 2015), and machine trans-
lación (Zaslavskiy et al., 2009), inter alia.

Compositional Generalization. Several au-
thors observed that compositional generalization
insufficiency is an important source of error for
semantic parsers, especially ones based on seq2seq
architectures (Lago y Baroni, 2018; Finegan-
Dollak et al., 2018b; Herzig y Berant, 2019;
Keysers et al., 2020). Wang et al. (2021) pro-
posed a latent re-ordering step to improve com-
generalización posicional, whereas Zheng and
Lapata (2021) relied on latent predicate tagging
in the encoder. There has also been an interest in
using data augmentation methods to improve gen-
eralización (Jia and Liang, 2016; Andreas, 2020;
Aky¨urek et al., 2021; Qiu et al., 2022; Yang et al.,
2022).

Recientemente, Herzig y Berant (2021) presentado
that span-based parsers do not exhibit such prob-
lematic behavior. Desafortunadamente, these parsers fail
to cover the set of semantic structures observed
in English treebanks, and we hypothesize that this
would be even worse for free word order lan-
calibres. Our graph-based approach does not ex-
hibit this downside. Previous work by Jambor
and Bahdanau (2022) also considered graph-based
methods for compositional generalization, but their
approach predicts each part independently without
any well-formedness or acyclicity constraint.

7 Conclusión

En este trabajo, we focused on graph-based semantic
parsing for formalisms that do not allow reen-
trancy. We conducted a complexity study of two
inference problems that appear in this setting. Nosotros
proposed ILP formulations of these problems to-
gether with a solver for their linear relaxation
based on the conditional gradient method. Exper-
imentally, our approach outperforms comparable
baselines.

One downside of our semantic parser is speed
(we parse approximately 5 sentences per sec-
ond for GEOQUERY). Sin embargo, we hope this work
will give a better understanding of the semantic

714

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

parsing problem together with baseline for faster
methods.

Future research will investigate extensions for
(1) ASTs that contain reentrancies and (2) predic-
tion algorithms for the case where a single word
can be the anchor of more than one predicate or
entidad. These two properties are crucial for seman-
tic representations like Abstract Meaning Repre-
sentation (Banarescu et al., 2013). Además, incluso
if our graph-based semantic parser provides bet-
ter results than previous work on length general-
ización, this setting is still difficult. A more general
research direction on neural architectures that
generalize better to longer sentences is important.

Expresiones de gratitud

We thank Franc¸ois Yvon and the anonymous re-
viewers for their comments and suggestions. Nosotros
thank Jonathan Herzig and Jonathan Berant for
fruitful discussions. This work benefited from
computations done on the Saclay-IA platform and
on the HPC resources of IDRIS under the allo-
cation 2022-AD011013727 made by GENCI.

Referencias

Ekin Aky¨urek, Afra Feyza Aky¨urek, and Jacob
Andreas. 2021. Learning to recombine and
resample data for compositional generaliza-
ción. En Conferencia Internacional sobre Aprendizaje
Representaciones.

jacob andreas. 2020. Good-enough composi-
tional data augmentation. En procedimientos de
la 58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 7556–7566,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/2020.acl-main.676

jacob andreas, Andreas Vlachos, and Stephen
clark. 2013. Semantic parsing as machine
traducción. In Proceedings of the 51st Annual
reunión de
la Asociación de Computación-
lingüística nacional (Volumen 2: Artículos breves),
pages 47–52, Sofia, Bulgaria. Asociación para
Ligüística computacional.

International Conference on Learning Rep-
ICLR 2015, San Diego, California,
resentaciones,
EE.UU, May 7–9, 2015, Conference Track Pro-
ceedings. https://doi.org/10.48550
/arXiv.1409.0473

Shikhar Murty,

Dzmitry Bahdanau, Harm de Vries, Timothy
j. O'Donnell,
Philippe
Beaudoin, Yoshua Bengio, and Aaron C.
Courville. 2019. CLOSURE: Assessing sys-
tematic generalization of CLEVR models.
CORR, abs/1912.05783. https://doi.org
/10.48550/arXiv.1912.05783

Laura Banarescu, Claire Bonial, Shu Cai,
Madalina Georgescu, Kira Griffitt, Ulf
Hermjakob, Kevin Knight, Philipp Koehn,
Martha Palmer, and Nathan Schneider. 2013.
Abstract Meaning Representation for sembank-
En g. In Proceedings of the 7th Linguistic An-
notation Workshop and Interoperability with
Discurso, pages 178–186, Sofia, Bulgaria.
Asociación de Lingüística Computacional.

Amir Beck. 2017. First-Order Methods in Op-
timization. SIAM. https://doi.org/10
.1137/1.9781611974997

Mathieu Blondel, andref. t. Martins, and Vlad
Niculae. 2020. Learning with Fenchel-Young
losses. Journal of Machine Learning Research,
21(35):1–69.

Bernd Bohnet and Joakim Nivre. 2012. A
transition-based system for joint part-of-speech
tagging and labeled non-projective dependency
analizando. En Actas de la 2012 Joint Con-
Conferencia sobre métodos empíricos en Lan Natural.-
guage Processing and Computational Natural
Aprendizaje de idiomas, pages 1455–1465, Jeju
Island, Korea. Asociación de Computación
Lingüística.

Stephen Boyd

and Lieven Vandenberghe.
2004. Convex Optimization. Cambridge Uni-
versity Press. https://doi.org/10.1017
/CBO9780511804441

Yoeng-Jin Chu and Tseng-Hong Liu. 1965. On
the shortest arborescence of a directed graph.
Scientia Sinica, 14:1396–1400.

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
aprender juntos a alinear y traducir. en 3ro

John Cocke. 1970. Programming Languages and
Their Compilers: Preliminary Notes. Nueva York
Universidad.

715

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Matthieu Constant and Joseph Le Roux. 2015.
Dependency representations for lexical seg-
In 6th Workshop on Statistical
mentation.
Parsing of Morphologically Rich Languages
(SPMRL 2015), Bilbao, España, Julio.

Caio Corro. 2020. Span-based discontinuous con-
stituency parsing: A family of exact chart-
based algorithms with time complexities from
oh(nˆ6) down to O(nˆ3). En procedimientos de
el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 2753–2764, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.emnlp-main.219

Caio Corro. 2023. On the inconsistency of separa-
ble losses for structured prediction. En curso-
ings of the 17th Conference of the European
Chapter of the Association for Computational
Lingüística: Volumen principal. Asociación para
Ligüística computacional.

Caio Corro, Joseph Le Roux, and Mathieu
Lacroix. 2017. Efficient discontinuous phrase-
structure parsing via the generalized maximum
spanning arborescence. En Actas de la
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1644–1654,
Copenhague, Dinamarca. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/D17-1172

Tomas M.. Cover. 1999. Elements of Informa-

tion Theory. John Wiley & Sons.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N19-1423

Li Dong and Mirella Lapata. 2016. Idioma
to logical form with neural attention. En profesional-
ceedings of the 54th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 33–43, Berlina,
Alemania. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/P16-1004

716

Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural de-
pendency parsing. In International Conference
on Learning Representations.

Timothy Dozat and Christopher D. Manning.
2018. Simpler but more accurate semantic de-
pendency parsing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 2: Artículos breves),
pages 484–490, Melbourne, Australia. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P18-2077

Christophe Duhamel, Luis Gouveia, Pedro
Moura, and Mauricio Souza. 2008. Modelos
and heuristics for a minimum arborescence
problema. Networks, 51(1):34–47. https://
doi.org/10.1002/net.20194

Jack Edmonds. 1967. Optimum branchings. Jour-
nal of Research of the National Bureau of
Standards – B. Mathematics and Mathemati-
cal Physics, 71:233. https://doi.org/10
.6028/jres.071B.032

Jason Eisner. 1997. Bilexical grammars and a
cubic-time probabilistic parser. En curso-
ings of the Fifth International Workshop on
Parsing Technologies, pages 54–65, Boston/
Cambridge, Massachusetts, EE.UU. Asociación
para Lingüística Computacional.

Daniel Fern´andez-Gonz´alez and Andr´e F. t.
Martins. 2015. Parsing as reduction. En profesional-
ceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on Nat-
ural Language Processing (Volumen 1: Largo
Documentos), pages 1523–1533, Beijing, Porcelana.
Asociación de Lingüística Computacional.
https://doi.org/10.3115/v1/P15-1147

Catherine

Jonathan

Finegan-Dollak,

k.
Kummerfeld, Li Zhang, Karthik Ramanathan,
Sesh Sadasivam, Rui Zhang, and Dragomir
Radev. 2018a. Improving text-to-SQL evalua-
tion methodology. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 351–360, Melbourne, Australia. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P18-1033

Catherine

k.
Kummerfeld, Li Zhang, Karthik Ramanathan,

Finegan-Dollak,

Jonathan

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Sesh Sadasivam, Rui Zhang, and Dragomir
Radev. 2018b. Improving text-to-SQL eval-
uation methodology. En procedimientos de
el
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), pages 351–360, Melbourne, Australia.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P18
-1033

Marguerite Frank and Philip Wolfe. 1956. Un
algorithm for quadratic programming. Naval
Research Logistics Quarterly, 3(1–2):95–110.
https://doi.org/10.1002/nav.3800030109

Michael R. Garey and David S. Johnson. 1979.
Computers and Intractability: A Guide to the
Theory of NP-Completeness (Series of Books
in the Mathematical Sciences). W.. h. Hombre libre.

Carlos G´omez-Rodr´ıguez,

Juan Carroll, y
David Weir. 2011. Dependency parsing
schemata and mildly non-projective depen-
dency parsing. Ligüística computacional,
37(3):541–586. https://doi.org/10.1162
/COLI a 00060

Martin Gr¨otschel, Michael J¨unger, and Gerhard
Reinelt. 1985. Acyclic Subdigraphs and Linear
Orderings: Polytopes, Facets, and a Cutting
Plane Algorithm, pages 217–264. Saltador
Países Bajos. Dordrecht. https://doi.org
/10.1007/978-94-009-5315-4_7

David Hall, Greg Durrett, and Dan Klein. 2014.
Less grammar, more features. En procedimientos
of the 52nd Annual Meeting of the Association
para Lingüística Computacional (Volumen 1: Largo
Documentos), pages 228–237, baltimore, Maryland.
Asociación de Lingüística Computacional.
https://doi.org/10.3115/v1/P14-1022

Jonathan Herzig y Jonathan Berant. 2019.
Don’t paraphrase, detect! Rapid and effective
data collection for semantic parsing. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 3810–3820, Hong Kong, Porcelana. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1394

Jonathan Herzig y Jonathan Berant. 2021. Span-
based semantic parsing for compositional gen-
eralización. In Proceedings of the 59th Annual

Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 1: Artículos largos), páginas 908–921,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/2021.acl-long.74

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, PubMed: 9377276

Dora Jambor and Dzmitry Bahdanau. 2022.
LAGr: Label aligned graphs for better sys-
tematic generalization in semantic parsing. En
Actas de la 60ª Reunión Anual de
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 3295–3308,
Dublín,
Irlanda. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2022.acl-long.233

the 54th Annual Meeting of

Robin Jia and Percy Liang. 2016. Data recombina-
tion for neural semantic parsing. En procedimientos
the Associ-
de
ation for Computational Linguistics (Volumen 1:
Artículos largos), pages 12–22, Berlina, Alemania.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P16-1002

Justin Johnson, Bharath Hariharan, Laurens
van der Maaten, Li Fei-Fei, C. lorenzo
Zitnick, and Ross B. Girshick. 2017. CLEVR:
lan-
A diagnostic dataset for compositional
guage and elementary visual reasoning. 2017
IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 1988–1997.
https://doi.org/10.1109/CVPR.2017
.215

Laura Kallmeyer. 2010. Parsing Beyond Context-
Free Grammars. Ciencia Springer & Negocio
Media. https://doi.org/10.1007/978
-3-642-14846-0

Tadao Kasami. 1965. An Efficient Recognition
and Syntax-Analysis Algorithm for Context-Free
Idiomas, Coordinated Science Laboratory,
University of Illinois at Urbana-Champaign.

Rohit J. Kate, Yuk Wah Wong, and Raymond J.
Mooney. 2005. Learning to transform natu-
ral to formal languages. En Actas de la
20th National Conference on Artificial Intelli-
gence – Volumen 3, AAAI’05, pages 1062–1068.
AAAI Press.

717

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Daniel Keysers, Natanael Sch¨arli, Nathan
Scales, Hylke Buisman, Daniel Furrer,
Sergi Kashubin, Nikola Momchev, Danila
Sinopalnikov, Lukasz Stafiniak, Tibor Tihon,
Dmitri Tsarkov, Xiao Wang, Marc van Zee, y
Olivier Bousquet. 2020. Measuring composi-
tional generalization: A comprehensive method
on realistic data. In International Conference
on Learning Representations.

Terry Koo, Alejandro M.. Rush, michael collins,
Tommi Jaakkola, and David Sontag. 2010.
Dual decomposition for parsing with non-
projective head automata. En Actas de la
2010 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1288–1298,
Cambridge, MAMÁ. Asociación de Computación-
lingüística nacional.

Joseph B. Kruskal. 1956. On the shortest span-
ning subtree of a graph and the traveling sales-
man problem. Proceedings of the American
Mathematical Society, 7(1):48–50. https://
doi.org/10.1090/S0002-9939-1956
-0078686-7

Marco Kuhlmann and Peter Jonsson. 2015. Pars-
ing to noncrossing dependency graphs. Trans-
acciones de la Asociación de Computación
Lingüística, 3:559–570. https://doi.org
/10.1162/tacl_a_00158

Harold W. Kuhn. 1955. The Hungarian method
for the assignment problem. Naval Research
Logistics Quarterly, 2(1–2):83–97. https://
doi.org/10.1002/nav.3800020109

Simon Lacoste-Julien and Martin Jaggi. 2015.
On the global linear convergence of Frank-
Wolfe optimization variants. In Advances in
Neural Information Processing Systems, volumen-
ume 28. Asociados Curran, Cª.

Brenden Lake

and Marco Baroni.

2018.
Generalización sin sistematicidad: Sobre el
habilidades de composición de secuencia a secuencia
redes recurrentes. En Actas del 35
Conferencia internacional sobre aprendizaje automático-
En g, volumen 80 de Actas de Máquina
Investigación del aprendizaje, pages 2873–2882. PMLR.

Evgeny S. Levitin and Boris T. Polyak.
1966. Constrained minimization methods. USSR
Computational Mathematics and Mathematical
Physics, 6(5):1–50. https://doi.org/10
.1016/0041-5553(66)90114-5

718

mike lewis, Yinhan Liu, Naman Goyal,
Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, y lucas
Zettlemoyer. 2020. BART: Denoising sequence-
to-sequence pre-training for natural language
generación, traducción, and comprehension. En
Actas de la 58ª Reunión Anual de
the Association for Computational Linguis-
tics, pages 7871–7880, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.acl-main.703

Zhenghua Li, Min Zhang, Wanxiang Che, Ting
Liu, Wenliang Chen, and Haizhou Li. 2011.
Joint models for Chinese POS tagging and de-
pendency parsing. En Actas de la 2011
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 1180–1191,
Edimburgo, Escocia, Reino Unido. Asociación para
Ligüística computacional.

Jo˜ao Loula, Marco Baroni, and Brenden Lake.
2018. Rearranging the familiar: Testing com-
positional generalization in recurrent networks.
En Actas de la 2018 EMNLP Work-
shop BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 108–114,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/W18-5413

Chunchuan Lyu and Ivan Titov. 2018. AMR pars-
ing as graph prediction with latent alignment.
In Proceedings of the 56th Annual Meeting
de la Asociación de Linguis Computacional-
tics (Volumen 1: Artículos largos), pages 397–407,
Melbourne, Australia. Asociación de Computación-
lingüística nacional. https://doi/org/10
.18653/v1/P18-1037

Richard Kipp Martin, Ronald L. Rardin, y
Brian A. Campbell. 1990. Polyhedral charac-
terization of discrete dynamic programming.
Operations Research, 38(1):127–138. https://
doi.org/10.1287/opre.38.1.127

Andr´e Martins, Noah Smith, and Eric Xing. 2009.
Concise integer linear programming formula-
tions for dependency parsing. En procedimientos
of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International
Conferencia conjunta sobre lenguaje natural Pro-
cessing of the AFNLP, pages 342–350, Suntec,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Singapur. Asociación de Lin Computacional-
guísticos. https://doi.org/10.3115/1687878
.1687928

ryan mcdonald, Fernando Pereira, Kiril Ribarov,
and Jan Hajiˇc. 2005. Non-projective depen-
dency parsing using spanning tree algorithms.
In Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical
Métodos en el procesamiento del lenguaje natural,
pages 523–530, vancouver, British Columbia,
Canada. Asociación de Lin Computacional-
guísticos. https://doi.org/10.3115/1220575
.1220641

Ryan McDonald and Giorgio Satta. 2007. On
the complexity of non-projective data-driven
el
dependency parsing. En procedimientos de
Tenth International Conference on Parsing
Technologies, pages 121–132, Prague, checo
República. Asociación de Lin Computacional-
guísticos. https://doi.org/10.3115/1621410
.1621426

Young-Soo Myung, Chang-Ho Lee,

y
Dong-Wan Tcha. 1995. On the generalized
minimum spanning tree problem. Networks,
26(4):231–241. https://doi.org/10.1002
/net.3230260407

Radford M. Neal and Geoffrey E. Hinton. 1998,
justifies
A view of the EM algorithm that
incremental, sparse, and other variants.
En
Learning in Graphical Models, pages 355–368.
Saltador. https://doi.org/10.1007/978
-94-011-5014-9 12

Yu Nesterov. 2005. Smooth minimization of
non-smooth functions. Mathematical Program-
ming, 103(1):127–152. https://doi.org
/10.1007/s10107-004-0552-5

Jorge Nocedal and Stephen J. Wright. 1999.
Numerical Optimization. Saltador. https://
doi.org/10.1007/b98874

jeffrey

Socher,

Pennington, Ricardo

y
Christopher Manning. 2014. GloVe: Global
vectors for word representation. En procedimientos
del 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1532–1543, Doha, Qatar. Asociación para
Ligüística computacional. https://doi
.org/10.3115/v1/D14-1162

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,

719

and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. En procedimientos de
el 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 1 (Artículos largos), pages 2227–2237,
Nueva Orleans, Luisiana. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N18-1202

Emily Pitler, Sampath Kannan, and Mitchell
marco. 2012. Dynamic programming for
higher order parsing of gap-minding trees. En
Actas de la 2012 Joint Conference on
Métodos empíricos en Natural Language Pro-
cessing and Computational Natural Language
Aprendiendo, pages 478–488, Jeju Island, Korea.
Asociación de Lingüística Computacional.

Emily Pitler, Sampath Kannan, and Mitchell
marco. 2013. Finding optimal 1-endpoint-
crossing trees. Transacciones de la Asociación-
ción para la Lingüística Computacional, 1:13–24.
https://doi.org/10.1162/tacl a 00206

Linlu Qiu, Peter Shaw, Panupong Pasupat, Pawel
Nowak, Tal Linzen, Fei Sha, and Kristina
Toutanova. 2022.
Improving compositional
generalization with latent structure and data
el 2022
augmentation. En procedimientos de
the North American Chap-
Conference of
ter of
the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 4341–4362, seattle, United States.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2022
.naacl-main.323

Owen Rambow. 2010. The simple truth about
dependency and phrase structure representa-
ciones: An opinion piece. In Human Language
Technologies: El 2010 Annual Conference of
the North American Chapter of the Association
para Lingüística Computacional, pages 337–340,
Los Angeles, California. Asociación para Com-
Lingüística putacional.

V. Venkata Rao and R. Sridharan. 2002.
Minimum-weight
not-necessarily-
rooted
spanning arborescence problem. Networks,
39(2):77–87. https://doi.org/10.1002
/net.10015

Giorgio Satta. 1992. Recognition of

linear
context-free rewriting systems. In 30th Annual
Meeting of the Association for Computational

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Lingüística, pages 89–95, Nueva York, Delaware,
EE.UU. Association for Computational Linguis-
tics. https://doi.org/10.3115/981967
.981979

Alexander Schrijver. 2003. Combinatorial Opti-
mization: Polyhedra and Efficiency, volumen 24.
Saltador.

Mitchell Stern, jacob andreas, and Dan Klein.
2017. A minimal span-based neural constitu-
ency parser. In Proceedings of the 55th Annual
reunión de
la Asociación de Computación-
lingüística nacional (Volumen 1: Artículos largos),
pages 818–827, vancouver, Canada. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/P17-1076

Robert Endre Tarjan. 1977. Finding optimum
branchings. Networks, 7(1):25–35. https://
doi.org/10.1002/net.3230070103

Oriol Vinyals, lucas káiser, Terry Koo, Slav
Petrov, Ilya Sutskever, and Geoffrey Hinton.
2015. Grammar as a foreign language. En
Avances en el procesamiento de información neuronal
Sistemas, volumen 28. Asociados Curran, Cª.

Martín J.. Wainwright and Michael Irwin Jordan.
2008. Graphical Models, Exponential Families,
and Variational Inference, Now Publishers Inc.
https://doi.org/10.1561/9781601981851

Bailin Wang, Mirella Lapata, and Ivan Titov.
2021. Structured reordering for modeling latent
alignments in sequence transduction. In Ad-
vances in Neural Information Processing Sys-
tems, volumen 34, pages 13378–13391. Curran
Associates, Cª.

Bailin Wang, Richard Shin, Xiaodong Liu,
Oleksandr Polozov, and Matthew Richardson.
2020. RAT-SQL: Relation-aware schema en-
coding and linking for text-to-SQL parsers. En
Actas de la 58ª Reunión Anual de
the Association for Computational Linguis-
tics, pages 7567–7578, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.acl-main.677

Jingfeng Yang, Le Zhang, y diyi yang.
2022. SUBS: Subtree substitution for compo-
sitional semantic parsing. En procedimientos de
el 2022 Conference of the North American
la Asociación de Computación-
Chapter of
lingüística nacional: Human Language Tech-
nológico, pages 169–174, seattle, United States.

Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2022
.naacl-main.12

Daniel H. Younger. 1967. Recognition and
parsing of context-free languages in time
n3. Information and Control, 10(2):189–208.
https://doi.org/10.1016/S0019-9958
(67)80007-X

Alp Yurtsever, Olivier Fercoq, Francesco
Locatello, and Volkan Cevher. 2018. A con-
ditional gradient
framework for composite
convex minimization with applications to semi-
definite programming. En Actas de la
35th International Conference on Machine
Aprendiendo, volumen 80 of Proceedings of Ma-
chine Learning Research, pages 5727–5736.
PMLR.

Mikhail Zaslavskiy, Marc Dymetman, and Nicola
Cancedda. 2009. Phrase-based statistical ma-
chine translation as a traveling salesman prob-
lem. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on
Natural Language Processing of the AFNLP,
pages 333–341, Suntec, Singapur. Asociación
para Lingüística Computacional. https://doi
.org/10.3115/1687878.1687926

John M. Zelle and Raymond J. Mooney. 1996.
Learning to parse database queries using in-
ductive logic programming. En procedimientos de
the Thirteenth National Conference on Ar-
tificial Intelligence – Volumen 2, AAAI’96,
pages 105–1055. AAAI Press.

Xingxing Zhang, Jianpeng Cheng, and Mirella
Lapata. 2017. Dependency parsing as head
selección. In Proceedings of the 15th Confer-
ence of the European Chapter of the Associa-
ción para la Lingüística Computacional: Volumen 1,
Artículos largos, pages 665–676, Valencia, España.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/E17
-1063

Hao Zheng and Mirella Lapata. 2021. Composi-
tional generalization via semantic tagging. En
Hallazgos de la Asociación de Computación
Lingüística: EMNLP 2021, pages 1022–1032,
Punta Cana, República Dominicana. asociación-
ción para la Lingüística Computacional. https://

720

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
0
2
1
4
0
9
9
2

/
t

a
C
_
a
_
0
0
5
7
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

doi.org/10.18653/v1/2021.findings
-emnlp.88

whereas computing the MNNSA is a NP-hard
problema.

A Proofs

Prueba: Teorema 1. We prove Theorem 1 by re-
ducing the maximum not-necessarily-spanning
arborescence (MNNSA) problema, which is known
to be NP-hard (Rao and Sridharan, 2002; Duhamel
et al., 2008), to the MGVCNNSA.

Let G = (cid:3)V, A, ψ(cid:4) be a weighted graph where
V = {0, . . . , norte} and ψ ∈ R|A| are arc weights.
The MNNSA problem aims to compute the sub-
set of arcs B ⊆ A such that (cid:3)V [B], B(cid:4) is an ar-
borescence of maximum weight, where its weight
is defined as

(cid:11)

a∈B ψa.

Let G = (cid:3)mi, t, fTYPE, fARGS

(cid:4) be a grammar such
that E = {0, . . . , n − 1}, T = {t} and ∀e ∈ E :
fTYPE(mi) = t ∧ fARGS(mi, t) = e. Intuitivamente, a tag
e ∈ E will be associated to vertices that require
exactly e outgoing arcs.

norte

0 , . . . , V (cid:8)

∈ π, ∀u(cid:8), v(cid:8) ∈ V (cid:8)

We construct a clustered labeled weighted
graph G(cid:8) = (cid:3)V (cid:8), A(cid:8), Pi, yo, ψ(cid:8)(cid:4) como sigue. π =
} is a partition of V (cid:8) such that each
{V (cid:8)
cluster V (cid:8)
i contains n − 1 vertices and represents
the vertex i ∈ V . The labeling function l assigns
a different tag to each vertex in a cluster, i.e.,
i : tu(cid:8) (cid:11)= v(cid:8) ⇒ l(tu(cid:8)) (cid:11)= l(v(cid:8)).
∀V (cid:8)
i
The set of arcs is defined as A(cid:8) = {tu(cid:8) → v(cid:8)|∃i →
j ∈ A s.t. tu(cid:8) ∈ V (cid:8)
∧ v(cid:8) ∈ V (cid:8)
}. The weight vector
j
i
ψ(cid:8) ∈ R|A(cid:8)| is such that ∀u(cid:8) → v(cid:8) ∈ A(cid:8) : tu(cid:8) ∈
∧ v(cid:8) ∈ V (cid:8)
V (cid:8)
tu(cid:8)→v(cid:8) = ψu→v.
v
tu
Tal como, there is a one-to-one correspondence
between solutions of the MNNSA on graph G and
solutions of the MGVCNNSA on graph G(cid:8).

⇒ ψ(cid:8)

Prueba: Teorema 2. We prove Theorem 2 by re-
ducing the maximum directed Hamiltonian path
problema, which is known to be NP-hard (Garey
and Johnson, 1979, Appendix A1.3), to the latent
anchoring.

Let G = (cid:3)V, A, ψ(cid:4) be a weighted graph where
V = {1, . . . , norte} and ψ ∈ R|A| are arc weights.
The maximum Hamiltonian path problem aims
to compute the subset of arcs B ⊆ A such that
V [B] = V and (cid:3)V [B], B(cid:4) is a path of maximum
a∈B ψa.
weight, where its weight is defined as
(cid:4) be a grammar such
that E = {0, 1}, T = {t} and ∀e ∈ E : fTYPE(mi) =
t ∧ fARGS(mi, t) = e.

Let G = (cid:3)mi, t, fTYPE, fARGS

(cid:11)

norte

(cid:11)= V (cid:8)

We construct a clustered labeled weighted
graph G(cid:8) = (cid:3)V (cid:8), A(cid:8), Pi, yo, ψ(cid:8)(cid:4) como sigue. π =
} is a partition of V (cid:8) such that
0 , . . . , V (cid:8)
{V (cid:8)
0 = {0}, each cluster V (cid:8)
V (cid:8)
0 contiene 2
vertices and represents the vertex i ∈ V . El
labeling function l assigns a different
tag to
es decir.,
each vertex in a cluster except
∀V (cid:8)
(cid:11)= v(cid:8) ⇒
∈ π, i > 0, ∀u(cid:8), v(cid:8) ∈ V (cid:8)
i
i
yo(tu(cid:8)) (cid:11)= l(v(cid:8)). The set of arcs is defined as
A(cid:8) = {0 → u(cid:8)|tu(cid:8) ∈ V (cid:8) \ {0}} ∪ {tu(cid:8) → v(cid:8)|i →
j ∈ A ∧ u(cid:8) ∈ V (cid:8)
}. The weight vector
i
ψ(cid:8) ∈ R|A(cid:8)| is such that ∀u(cid:8) → v(cid:8) ∈ A(cid:8) : tu(cid:8) ∈
V (cid:8)
⇒ ψ(cid:8)
tu(cid:8)→v(cid:8) = ψu→v and arcs leaving
tu
0 have null weights.

the root,
: tu(cid:8)

∧ v(cid:8) ∈ V (cid:8)
j

∧ v(cid:8) ∈ V (cid:8)
v

We construct an AST G(cid:8)(cid:8) = (cid:3)V (cid:8)(cid:8), A(cid:8)(cid:8), yo(cid:8)(cid:4) semejante
that V (cid:8)(cid:8) = {1, . . . , norte}, A(cid:8)(cid:8) = {i → i + 1 | 1 ≤ i < n} and the labeling function l(cid:8) assigns the tag 0 to n and the tag 1 to every other vertex. Note that our proof considers that arcs leaving from the root cluster satisfy constraints defined by the grammar whereas we previously only required the root vertex to have a single outgoing arc. The latter constraint can be added directly in a gram- mar, but we omit presentation for brevity. The constrained arity case presented by McDonald and Satta (2007) focuses on spanning arborescences with an arity constraint by reducing the Hamil- tonian path problem to their problem. While the arity constraint is similar in their problem and ours, our proof considers the not-necessarily-spanning case instead of the spanning one. Although the two problems seem related, they need to be studied separately, e.g., computing the maximum span- ning arborescence is a polynomial time problem As such, there is a one-to-one correspondence between solutions of the maximum Hamiltonian path problem on graph G and solutions of the mapping of maximum weight of G(cid:8)(cid:8) with G(cid:8). B Experimental Setup The neural architecture used in our experiments to produce the weights μ and φ is composed of: (1) an embedding layer of dimension 100 for SCAN or BERT-base (Devlin et al., 2019) for the other datasets, followed by a bi-LSTM (Hochreiter and Schmidhuber, 1997) with a hidden size of 400; (2) a linear projection of dimension 500 over the output of the bi-LSTM followed by a TANH activation and another linear projection of dimen- sion |E| to obtain μ; (3) a linear projection of 721 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 7 0 2 1 4 0 9 9 2 / / t l a c _ a _ 0 0 5 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 dimension 500 followed by a TANH activation and a bi-affine layer (Dozat and Manning, 2017) to obtain φ. We apply dropout with a probability of 0.3 over the outputs of BERT-base and the bi- LSTM and after both TANH activations. The learn- ing rate is 5 × 10−4 and each batch is composed of 30 examples. We keep the parameters that obtain the best accuracy on the development set after 25 epochs. Training the model takes be- tween 40 minutes for GEOQUERY and 8 hours for CLEVR. However, note that the bottleneck is the conditional gradient method which is computed on the CPU. C GEOQUERY Denotation Accuracy Issue The denotation accuracy is evaluated by check- ing whether the denotation returned by an exec- utor is the same when given the gold semantic program and the prediction of the model. It can be higher than the exact match accuracy when differ- ent semantic programs yield the same denotation. When we evaluated our approach using the same executor as the baselines of Herzig and Berant (2021), we observed two main issues regarding the behavior of predicates: (1) Several predicates have undefined behaviors (e.g., population 1 and traverse 2 in the case of an argument of type country), in the sense that they are not implemented; (2) The behavior of some predicates are incorrect with respect to their expected semantic (e.g., traverse 1 and traverse 2). These two sources of errors for several semantic programs, leading to an overes- timation of the denotation accuracy when both the gold and predicted programs return by accident an empty denotation (potentially for different reasons, due to aforementioned implementation issues). denotation incorrect result in We implemented a corrected executor address- is available ing the issues that we found. It here: https://github.com/alban-petit /geoquery-funql-executor. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 7 0 2 1 4 0 9 9 2 / / t l a c _ a _ 0 0 5 7 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 722
Descargar PDF