Learning an Executable Neural

Learning an Executable Neural
Semantic Parser

Jianpeng Cheng
University of Edinburgh
jianpeng.cheng@ed.ac.uk

Siva Reddy
Universidad Stanford
sivar@stanford.edu

Vijay Saraswat
IBM T.J. Watson Research
vjsaraswat@stanford.edu

Mirella Lapata
University of Edinburgh
mlap@inf.ed.ac.uk

This article describes a neural semantic parser that maps natural language utterances onto
logical forms that can be executed against a task-specific environment, such as a knowledge base
or a database, to produce a response. The parser generates tree-structured logical forms with a
transition-based approach, combining a generic tree-generation algorithm with domain-general
grammar defined by the logical language. The generation process is modeled by structured re-
current neural networks, which provide a rich encoding of the sentential context and generation
history for making predictions. To tackle mismatches between natural language and logical
form tokens, various attention mechanisms are explored. Finalmente, we consider different training
settings for the neural semantic parser, including fully supervised training where annotated
logical forms are given, weakly supervised training where denotations are provided, and distant
supervision where only unlabeled sentences and a knowledge base are available. experimentos
across a wide range of data sets demonstrate the effectiveness of our parser.

1. Introducción

An important task in artificial intelligence is to develop systems that understand natural
language and enable interactions between computers and humans. análisis semántico
has emerged as a key technology toward achieving this goal. Semantic parsers specify a
mapping between natural language utterances and machine-understandable meaning

Envío recibido: 6 Noviembre 2017; versión revisada recibida: 17 Julio 2018; accepted for publication:
10 Agosto 2018.

doi:10.1162/COLI a 00342

© 2019 Asociación de Lingüística Computacional
Publicado bajo una Atribución Creative Commons-NoComercial-SinDerivadas 4.0 Internacional
(CC BY-NC-ND 4.0) licencia

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

Mesa 1
Examples of questions, corresponding logical forms, and their answers.

Environment: A database of US geography
Utterance: What is the longest river in Ohio?
Logical form: longest(y(type.river, ubicación(Ohio)))
Denotation: Ohio River

Environment: base libre
Utterance: How many daughters does Obama have?
Logical form: count(daughterOf(barack obama))
Denotation: 2

representaciones, commonly known as logical forms. A logical form can be executed
against a real-world environment, such as a knowledge base, to produce a response,
often called a denotation. Mesa 1 shows examples of natural language queries, their cor-
responding logical forms, and denotations. The query What is the longest river in Ohio?
is represented by the logical form longest(y(type.river, ubicación(Ohio))),
which when executed against a database of US geography returns the answer Ohio
River. In the second example, the logical form count(daughterOf(barack obama))
corresponds to the query How many daughters does Obama have? and is executed
against the Freebase knowledge base to return the answer 2.

En años recientes, semantic parsing has attracted a great deal of attention because of its
utility in a wide range of applications, such as question answering (Kwiatkowski et al.
2011; Liang, Jordán, and Klein 2011), relation extraction (Krishnamurthy and Mitchell
2012), goal-oriented dialog (Wen et al. 2015), natural language interfaces (Popescu et al.
2004), robot control (Matuszek et al. 2012), and interpreting instructions (Chen and
Mooney 2011; Artzi and Zettlemoyer 2013).

Early statistical semantic parsers (Zelle and Mooney 1996; Zettlemoyer and Collins
2005; Wong and Mooney 2006; Kwiatkowksi et al. 2010) mostly require training data
in the form of utterances paired with annotated logical forms. More recently, alter-
native forms of supervision have been proposed to alleviate the annotation burden—
Por ejemplo, training on utterance-denotation pairs (Clarke et al. 2010; Kwiatkowski
et al. 2013; Liang 2016), or using distant supervision (Krishnamurthy and Mitchell
2012; Cai and Yates 2013). Despite different supervision signals, training and inference
procedures in conventional semantic parsers rely largely on domain-specific grammars
and engineering. A CKY-style chart parsing algorithm is commonly used to parse a
sentence in polynomial time.

The successful application of recurrent neural networks (Sutskever, Viñales, y
Le 2014; Bahdanau, Dar, and Bengio 2015) to a variety of NLP tasks has provided
strong impetus to treat semantic parsing as a sequence transduction problem where
an utterance is mapped to a target meaning representation in string format (dong y
Lapata 2016; Jia and Liang 2016; Koˇcisk ´y et al. 2016). Neural semantic parsers generate
a sentence in linear time, while reducing the need for domain-specific assumptions,
grammar learning, and more generally extensive feature engineering. But this modeling
flexibility comes at a cost because it is no longer possible to interpret how meaning
composition is performed, given that logical forms are structured objects like trees or
graphs. Such knowledge plays a critical role in understanding modeling limitations so
as to build better semantic parsers. Además, without any task-specific knowledge, el
learning problem is fairly unconstrained, both in terms of the possible derivations to
consider and in terms of the target output which can be syntactically invalid.

60

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

In this work we propose a neural semantic parsing framework that combines
recurrent neural networks and their ability to model long-range dependencies with a
transition system to generate well-formed and meaningful logical forms. The transition
system combines a generic tree-generation algorithm with a small set of domain-general
grammar pertaining to the logical language to guarantee correctness. Our neural parser
differs from conventional semantic parsers in two respects. Primero, it does not require
lexicon-level rules to specify the mapping between natural language and logical form
tokens. En cambio, the parser is designed to handle cases where the lexicon is missing or
incomplete thanks to a neural attention layer, which encodes a soft mapping between
natural language and logical form tokens. This modeling choice greatly reduces the number
of grammar rules used during inference to those only specifying domain-general as-
pects. Segundo, our parser is transition-based rather than chart-based. Although chart-
based inference has met with popularity in conventional semantic parsers, it has difficulty
in leveraging sentence-level features because the dynamic programming algorithm
requires features defined over substructures. En comparación, our linear-time parser
allows us to generate parse structures incrementally conditioned on the entire sentence.
We perform several experiments in downstream question-answering tasks and
demonstrate the effectiveness of our approach across different training scenarios. Estos
include full supervision with questions paired with annotated logical forms using the
GEOQUERY (Zettlemoyer and Collins 2005) data set, weak supervision with question-
answer pairs using the WEBQUESTIONS (Berant et al. 2013a) and GRAPHQUESTIONS
(Su et al. 2016) data sets, and distant supervision without question-answer pairs, using the
SPADES (Bisk et al. 2016) data set. Experimental results show that our neural semantic
parser is able to generate high-quality logical forms and answer real-world questions
on a wide range of domains.

The remainder of this article is structured as follows. Sección 2 provides an overview
of related work. Sección 3 introduces our neural semantic parsing framework and
discusses the various training scenarios to which it can be applied. Our experiments
are described in Section 4, together with detailed analysis of system output. Discusión
of future work concludes in Section 5.

2. Trabajo relacionado

The proposed framework has connections to several lines of research, including various
formalisms for representing natural language meaning, semantic parsing models, y
the training regimes they adopt. We review related work in these areas here.

Semantic Formalism. Logical forms have played an important role in semantic parsing
systems since their inception in the 1970s (Winograd 1972; Woods, Kaplan, and Nash-
Webber 1972). The literature is rife with semantic formalisms that can be used to define
logical forms. Examples include lambda calculus (montesco 1973), which has been
used by many semantic parsers (Zettlemoyer and Collins 2005; Kwiatkowksi et al.
2010; Reddy, Lapata, and Steedman 2014) because of its expressiveness and flexibil-
ity to construct logical forms of great complexity; Combinatory Categorial Grammar
(Steedman 2000); dependency-based compositional semantics (Liang, Jordán, and Klein
2011); frame semantics (Panadero, Fillmore, and Lowe 1998); and abstract meaning repre-
sentaciones (Banarescu et al. 2013).

En este trabajo, we adopt a database querying language as the semantic formalism,
a saber, the functional query language (FunQL; Zelle 1995). FunQL maps first-order
logical forms into function-argument structures, resulting in recursive, tree-structured,

61

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

program representations. Although it lacks expressive power, FunQL has a mod-
eling advantage for downstream tasks, because it is more natural to describe the
manipulation of a simple world as procedural programs. This modeling advantage has
been revealed in recent advances of neural programmings: Recurrent neural networks
have demonstrated great capability in inducing compositional programs (Neelakantan,
Le, and Sutskever 2016; Reed and De Freitas 2016; Cai, espinilla, and Song 2017). por ejemplo-
amplio, they learn to perform grade-school additions, bubble sort, and table comprehen-
sion in procedures. Finalmente, some recent work (Iyer et al. 2017; Yin and Neubig 2017;
Zhong, xiong, and Socher 2017) uses other programming languages, such as the SQL,
as the semantic formalism.

Semantic Parsing Model. The problem of learning to map utterances to meaning rep-
resentations has been studied extensively in the NLP community. Most data-driven
semantic parsers consist of three key components: a grammar, a trainable model, y
a parsing algorithm. The grammar defines the space of derivations from sentences to
logical forms, and the model together with the parsing algorithm find the most likely
derivation. The model—which can take, Por ejemplo, the form of a support vector
machine (Kate and Mooney 2006), a structured perceptron (Zettlemoyer and Collins
2007; Lu et al. 2008; Reddy, Lapata, and Steedman 2014; Reddy et al. 2016), or a log-linear
modelo (Zettlemoyer and Collins 2005; Berant et al. 2013a)—scores the set of candidate
derivations generated from the grammar. During inference, a chart-based parsing algo-
rithm is commonly used to predict the most likely semantic parse for a sentence.

With recent advances in neural networks and deep learning, there is a trend of
reformulating semantic parsing as a machine translation problem. The idea is not
novedoso, because semantic parsing has been previously studied with statistical machine
translation approaches in both Wong and Mooney (2006) and Andreas, Vlachos, y
clark (2013). Sin embargo, the task set-up is important to be revisited since recurrent neural
networks have been shown to be extremely useful in context modeling and sequence
generación (Bahdanau, Dar, and Bengio 2015). Following this direction, dong y
Lapata (2016) and Jia and Liang (2016) have developed neural semantic parsers that
treat semantic parsing as a sequence to a sequence learning problem. Jia and Liang
(2016) also introduce a data augmentation approach that bootstraps a synchronous
grammar from existing data and generates artificial examples as extra training data.
Other related work extends the vanilla sequence-to-sequence model in various ways,
such as multi-task learning (Fan et al. 2017), parsing cross-domain queries (Abundante y
tarde 2017) and context-dependent queries (Suhr, Iyer, and Artzi 2018), and applying
the model to other formalisms such as AMR (Konstas et al. 2017) and SQL (Zhong,
xiong, and Socher 2017).

The fact that logical forms have a syntactic structure has motivated some of the
recent work on exploring structured neural decoders to generate tree or graph struc-
tures and grammar-constrained decoders to ensure the outputs are meaningful and
executable. Related work includes Yin and Neubig (2017), who generate abstract syn-
tax trees for source code with a grammar-constrained neural decoder. Krishnamurthy,
Dasigi, and Gardner (2017) also introduce a neural semantic parser that decodes rules
in a grammar to obtain well-typed logical forms. Rabinovich, Stern, and Klein (2017)
propose abstract syntax networks with a modular decoder, whose multiple submodels
(one per grammar construct) are composed to generate abstract syntax trees in a top–
down manner.

Our work shares similar motivation: We generate tree-structured, syntactically valid
logical forms, but following a transition-based generation approach (Dyer et al. 2015,

62

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

2016). Our semantic parser is a generalization of the model presented in Cheng et al.
(2017). Whereas they focus solely on top–down generation using hard attention, el
parser presented in this work generates logical forms following either a top–down or
bottom–up generation order and introduces additional attention mechanisms (es decir., soft
and structured attention) for handling mismatches between natural language and log-
ical form tokens. We empirically compare generation orders and attention variants,
elaborate on model details, and formalize how the neural parser can be effectively
trained under different types of supervision.

Training Regimes. Various types of supervision have been explored to train semantic
analizadores, ranging from full supervision with utterance-logical form pairs to unsuper-
vised semantic parsing without given utterances. Early work in statistical semantic
parsing has mostly used annotated training data consisting of utterances paired with
logical forms (Zelle and Mooney 1996; Kate, Wong, and Mooney 2005; Kate and Mooney
2006; Wong and Mooney 2006; Lu et al. 2008; Kwiatkowksi et al. 2010). The same applies
to some of the recent work on neural semantic parsing (Dong and Lapata 2016; Jia and
Liang 2016). This form of supervision is the most effective to train the parser, but is also
expensive to obtain. In order to write down a correct logical form, the annotator not only
needs to have expertise in the semantic formalism, but also has to ensure that the logical
form matches the utterance semantics and contains no grammatical mistakes. Para esto
reason, fully supervised training applies more to small, close-domain problems, como
querying the US geographical database (Zelle and Mooney 1996).

Over the past few years, developments have been made to train semantic parsers
with weak supervision from utterance-denotation pairs (Clarke et al. 2010; Liang, Jordán,
and Klein 2011; Berant et al. 2013a; Kwiatkowski et al. 2013; Pasupat and Liang 2015).
The approach enables more efficient data collection, since denotations (such as answers
to a question, responses to a system) are much easier to obtain via crowdsourcing. Para esto
reason, semantic parsing can be scaled to handle large, complex, and open domain prob-
lemas. Examples include learning semantic parsers from question-answer pairs on Free-
base (Liang, Jordán, and Klein 2011; Berant et al. 2013a; Berant and Liang 2014; Liang
et al. 2017; Cheng et al. 2017), from system feedbacks (Clarke et al. 2010; Chen and
Mooney 2011; Artzi and Zettlemoyer 2013), from abstract examples (Goldman et al. 2018),
and from human feedbacks (Iyer et al. 2017) or statements (Artzi and Zettlemoyer 2011).
Some work seeks more clever ways of gathering data and trains semantic parsers
with even weaker supervision. In a class of distant supervision methods, the input is
solely a knowledge base and a corpus of unlabeled sentences. Artificial training data are
generated from the given resources. Por ejemplo, Cai and Yates (2013) generate utter-
ances paired with logical forms. Their approach searches for sentences containing cer-
tain entity pairs, and assumes (with some pruning technique) that the sentences express
a certain relation from the knowledge base. In Krishnamurthy and Mitchell (2012, 2014),
whose authors work with the CCG formalism, an extra source of supervision is added.
The semantic parser is trained to produce parses that syntactically agree with depen-
dency structures. Reddy, Lapata, and Steedman (2014) generate utterance-denotation
pairs by masking entity mentions in declarative sentences from a large corpus. A seman-
tic parser is then trained to predict the denotations corresponding to the masked entities.

3. Neural Semantic Parsing Framework

We present a neural network–based semantic parser that maps an utterance into a
logical form, which can be executed in the context of a knowledge base to produce

63

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

a response. Compared with traditional semantic parsers, our framework reduces the
number of manually engineered features and domain-specific rules. As semantic for-
malism, we choose the functional query language (FunQL), which is recursive and tree-
structured (Sección 3.1). A transition-based tree generation algorithm is then defined
to generate FunQL logical forms (Sections 3.2–3.4). The process of generating logical
forms is modeled by recurrent neural networks—a powerful tool for encoding the
context of a sentence and the generation history for making predictions (Sección 3.5).
We handle mismatches between natural language and the knowledge base through var-
ious attention mechanisms (Sección 3.7). Finalmente, we explore different training regimes
(Sección 3.8), including a fully supervised setting where each utterance is labeled with
annotated logical forms, a weakly supervised setting where utterance-denotation pairs
are available, and a distant-supervision setting where only a collection of unlabeled
sentences and a knowledge base is given.

3.1 FunQL Semantic Representation

As mentioned earlier, we adopt FunQL as our semantic formalism. FunQL is a variable-
free recursive meaning representation language that maps simple first-order logical
forms to function-argument structures that abstract away from variables and quantifiers
(Kate and Mooney 2006). The language is also closely related to lambda DCS (Liang
2013), which makes existential quantifiers implicit. Lambda DCS is more compact in
the sense that it can use variables in rare cases to handle anaphora and build composite
binary predicates.

The FunQL logical forms we define contain the following primitive functional
operadores. They overlap with simple lambda DCS (Berant et al. 2013a) but differ slightly
in syntax to ease recursive generation of logical forms. Let l denote a logical form,
yo
(cid:75)
represent its denotation, and K refer to a knowledge base.

(cid:74)

Unary base case: An entity e (p.ej., barack obama) is a unary logical form
whose denotation is a singleton set containing that entity:

= {mi}

mi
(cid:75)

(cid:74)

(1)

Binary base case: A relation r (p.ej., daughterOf) is a binary logical form
with denotation:

= {(e1, e2) : (e1, r, e2) ∈ K}

r
(cid:75)

(cid:74)

(2)

A relation r can be applied to an entity e1 (written as r(e1)) and returns as
denotation the unary satisfying the relation

r(e1)
(cid:75)

(cid:74)

= {mi : (e1, mi) ∈

}

r
(cid:74)

(cid:75)

(3)

Por ejemplo, the expression daughterOf(barack obama) corresponds to
the question “Who are Barack Obama’s daughters? ".

count returns the cardinality of the unary set u:

count(tu)
(cid:75)

(cid:74)

= {|

|}

tu
(cid:75)

(cid:74)

(4)

64

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

Por ejemplo, count(daughterOf(barack obama)) represents the question
“How many daughters does Barack Obama have? ".

argmax or argmin return a subset of the unary set u whose specific relation
r is maximum or minimum:

argmax(tu, r)
(cid:75)
(cid:74)

= {mi : e ∈ u ∩ ∀e(cid:48) ∈ u, r(mi) ≥ r(mi(cid:48))}

(5)

Por ejemplo, the expression argmax(daughterOf(barack obama), edad)
corresponds to the utterance “Who is Barack Obama’s eldest daughter? ".

filter returns a subset of the unary set u where a comparative constraint
(=, ! =, >, <, ≥, ≤) acting on the relation r is satisfied: filter>(tu, r, v)

(cid:74)

(cid:75)

= {mi : e ∈ u ∩ r(mi) > v}

(6)

Por ejemplo, the query filter> (daughterOf(barack obama), edad, 5)
returns the daughters of Barack Obama who are older than five years.

and takes the intersection of two urinary sets u1 and u2:

(cid:74)
and or takes their union:

y(u1, u2)

=

u1

(cid:74)

(cid:75)

u2

(cid:74)

(cid:75)

(cid:75)

o(u1, u2)

=

u1

(cid:74)

(cid:75)

u2
(cid:74)

(cid:75)

(cid:75)

(cid:74)

(7)

(8)

Por ejemplo, the expression and(daughterOf(barack obama),
InfluentialTeensByYear(2014)) would correspond to the query “Which
daughter of Barack Obama was named Most Influential Teens in the year
2014? ".

The operators just defined give rise to compositional logical forms (p.ej., count(y
(daughterOf(barack obama), InfluentialTeensByYear(2014)).

The reason for using FunQL in our framework lies in its recursive nature, cual
allows us to model the process of generating logical form as a sequence of transition
operations that can be decoded by powerful recurrent neural networks. We next de-
scribe how our semantic formalism is integrated with a transition-based tree-generation
algorithm to produce tree-structured logical forms.

3.2 Tree Generation Algorithm

We introduce a generic tree generation algorithm that recursively generates tree con-
stituents with a set of transition operations. The key insight underlying our algorithm is
to define a canonical traversal or generation order, which generates a tree as a transition
secuencia. A transition sequence for a tree is a sequence of configuration-transition pairs
[(c0, t0), (c1, t1), · · · , (cm, tm)]. En este trabajo, we consider two commonly used generation
orders, namely top–down pre-order and bottom–up post-order.

The top–down system is specified by the tuple c = ((cid:80), Pi, pag, norte, PAG) dónde (cid:80) is a stack
used to store partially complete tree fragments, π is the non-terminal token to be gener-
ated, σ is the terminal token to be generated, N is a stack of open non-terminals, and P is

65

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

Mesa 2
Transitions for the top–down and bottom–up generation system. Stack (cid:80) is represented as a list
with its head to the right (with tail σ), same for stack N (with tail β).

Top–down Transitions

NT(X)
TER(X)
RED

([pag|X(cid:48)], X, ε, [b|X(cid:48)], PAG(X(cid:48))) ([pag|X(cid:48), X], ε, ε, [b|X(cid:48), X], PAG(X))
([pag|X(cid:48)], ε, X, [b|X(cid:48)], PAG(X’)) ([pag|X(cid:48), X], ε, ε, [b|X(cid:48), X], PAG(X(cid:48)))
([pag|X(cid:48), X, X], ε, ε, [b|X(cid:48), X], PAG(X)) ([pag|X(cid:48), X(X)], ε, ε, [b|X(cid:48)], PAG(X(cid:48)))

Bottom–up Transitions

TER(X)
NT-RED(X)

(pag, ε, X) ([pag|X], ε, ε)
([pag|X], X, ε) ([pag|X(X)], ε, ε)

a function indexing the position of a non-terminal pointer. The pointer indicates where
subsequent children nodes should be attached (p.ej., PAG(X) means that the pointer is
pointing to the non-terminal X). The initial configuration is c0 = ([], TOP, ε, [], ), dónde
TOP stands for the root node of the tree, ε represents an empty string, and ⊥ represents
an unspecified function. The top–down system uses three transition operations, defined
en mesa 2:

NT(X) creates a new subtree non-terminal node denoted by X. El
non-terminal X is pushed on top of the stack and written as X (y
subsequent tree nodes are generated as children underneath X).

TER(X) creates a new child node denoted by x. The terminal x is pushed on
top of the stack, written as x.

RED is the reduce operation, which indicates that the current subtree being
generated is complete. The non-terminal root of the current subtree is
closed and subsequent children nodes will be attached to the predecessor
open non-terminal. Stack-wise, RED recursively pops children (cual
can be either terminals or completed subtrees) on top until an open
non-terminal is encountered. The non-terminal is popped as well, después
which a completed subtree is pushed back to the stack as a single closed
constituent, written for example as X1(X2, X3).

We define the bottom–up system by tuple c = ((cid:80), Pi, pag) dónde (cid:80) is a stack used
to store partially complete tree fragments, π is the token non-terminal to be generated,
and σ is the token terminal to be generated. We take the initial configuration to be c0 =
([], SG, ε), where xl stands for the leftmost terminal node of the tree, and ε represents
an empty string. The bottom–up generation uses two transition operations, defined in
Mesa 2:

TER(X) creates a new terminal node denoted by x. The terminal x is
pushed on top of the stack, written as x.

66

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

NT-RED(X) builds a new subtree by attaching a parent node (denoted by X)
to children nodes on top of the stack. The children nodes can be either
terminals or smaller subtrees. Similarly to RED in the top–down case,
children nodes are first popped from the stack, and subsequently
combined with the parent X to form a subtree. The subtree is pushed back
to the stack as a single constituent, written for example as X1(X2, X3).
A challenge with NT-RED(X) is to decide how many children should be
popped and included in the new subtree. En este trabajo, the number of
children is dictated by the number of arguments expected by X, which is in
turn constrained by the logical language. Por ejemplo, from the FunQL
grammar it is clear that count takes one argument and argmax takes two.
The language we use does not contain non-terminal functions with a
variable number of arguments.

Top–down traversal is defined by three generic operations, and bottom–up order
applies two operations only (since it combines reduce with non-terminal generation).
Sin embargo, the operation predictions required are the same for the two systems. El
reason is that the reduce operation in the top–down system is deterministic when the
FunQL grammar is used as a constraint (we return to this point in Section 3.4).

3.3 Generating Tree-Structured Logical Forms

To generate tree-structured logical forms, we integrate the generic tree generation oper-
ations described earlier with FunQL, whose grammar determines the space of allowed
terminal and non-terminal symbols:

NT(X) includes an operation that generates relations NT(relation), y
other domain-general operators in FunQL: NT(y), NT(o), NT(count),
NT(argmax), NT(argmin), and NT(filter). Note that NT(relation)
creates a placeholder for a relation, which is subsequently generated.

TER(X) includes two operations: TER(relation) for generating relations
and TER(entidad) for generating entities. Both operations create a
placeholder for a relation or an entity, which is subsequently generated.

NT-RED(X) includes NT-RED(relation), NT-RED(y), NT-RED(o),
NT-RED(count), NT-RED(argmax), NT-RED(argmin), and NT-RED(filter).
De nuevo, NT-RED(relation) creates a placeholder for a relation, cual es
subsequently generated.

Mesa 3 illustrates the sequence of operations used by our parser in order to gen-
erate the logical form count(y(daughterOf(barack obama), InfluentialTeens-
ByYear(2014)) top–down. Mesa 4 shows how the same logical form is generated
bottom–up. Note that the examples are simplified for illustration purposes; lo lógico
form is generated conditioned on an input utterance, such as “How many daughters of
Barack Obama were named Most Influential Teens in the year 2014? ".

3.4 Constraints

A challenge in neural semantic parsing lies in generating well-formed and meaningful
logical forms. Para tal fin, we incorporate two types of constraints in our system. El

67

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

Mesa 3
Top–down generation of the logical form count(y(daughterOf(barack obama),
InfluentialTeensByYear(2014)). Elements on the stack are separated by || and the top of the
stack is on the right.

Operation

Logical form token

Stack

count
y
daughterOf
barack obama

InfluentialTeensByYear
2014

NT(count)
NT(y)
NT(relation)
TER(entidad)
RED
NT(relation)
TER(entidad)
RED
RED
RED

count(
count( || y(
count( || y( || daughterOf
count( || y( || daughterOf( || barack obama
count( || y( || daughterOf(barack obama)
count( || y( || daughterOf(barack obama) || InfluentialTeensByYear(
count( || y( || daughterOf(barack obama) || InfluentialTeensByYear( || 2014
count( || y( || daughterOf(barack obama) || InfluentialTeensByYear(2014)
count( || y(daughterOf(barack obama), InfluentialTeensByYear(2014))
count(y(daughterOf(barack obama), InfluentialTeensByYear(2014)))

Mesa 4
Bottom–up generation of the logical form count(y(daughterOf(barack obama),
InfluentialTeensByYear(2014)). Elements on the stack are separated by || and the top of the
stack is on the right.

Operation

Logical form token

Stack

TER(entidad)
NT-RED(relation)
TER(entidad)
NT-RED(relation)
NT-RED(y)
NT-RED(count)

barack obama
daughterOf
2014
InfluentialTeensByYear
y
count

barack obama
daughterOf(barack obama)
daughterOf(barack obama) || 2014
daughterOf(barack obama) || InfluentialTeensByYear(2014)
y(daughterOf(barack obama), InfluentialTeensByYear(2014))
count(y(daughterOf(barack obama), InfluentialTeensByYear(2014)))

first ones are structural constraints to ensure that the outputs are syntactically valid
logical forms. For the top–down system these constraints include:

The first operation must be NT;

RED cannot directly follow NT;

The maximum number of open non-terminal symbols allowed on the stack
es 10. NT is disabled when the maximum number is reached;

The maximum number of (open and closed) non-terminal symbols
allowed on the stack is 10. NT is disabled when the maximum number is
reached.

Tree constraints for the bottom–up system are:

The first operation must be TER;

The maximum number of consecutive TERs allowed is 5;

The maximum number of terminal symbols allowed on the stack is the
number of words in the sentence. TER is disallowed when the maximum
number is reached.

The second type of constraints relate to the FunQL-grammar itself, ensuring that the
generated logical forms are meaningful for execution:

The type of argument expected by each non-terminal symbol must follow
the FunQL grammar;

68

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

The number of arguments expected by each non-terminal symbol must
follow the FunQL grammar;

• When the expected number of arguments for a non-terminal symbol is

reached, a RED operation must be called for the top–down system; para el
bottom–up system this constraint is built within the NT-RED operation,
since it reduces the expected number of arguments based on a specific
non-terminal symbol.

3.5 Neural Network Realizer

We model this logical form generation algorithm with a structured neural network
that encodes the utterance and the generation history, and then predicts a sequence of
transition operations as well as logical form tokens based on the encoded information.
En el siguiente, we present details for each component in the network.

Utterance Encoding. An utterance x is encoded with a bidirectional long short-term mem-
ory (LSTM) architecture (Hochreiter and Schmidhuber 1997). A bidirectional LSTM
comprises a forward LSTM and a backward LSTM. The forward LSTM processes a
variable-length sequence x = (x1, x2, · · · , xn) by incrementally adding new content into
a single memory slot, with gates controlling the extent to which new content should
be memorized, old content should be erased, and current content should be exposed.
At time step t, the memory (cid:126)ct and the hidden state (cid:126)ht are updated with the following
ecuaciones:





él
ft
ot
ˆct





pag
pag
pag
tanh

=

W · [

−−→
ht−1, xt]

ct = ft (cid:12) −−→
−→

ct−1 + él (cid:12) ˆct

−→
ht = ot (cid:12) tanh(

−→
ct )

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

(9)

(10)

(11)

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

where i, F , and o are gate activations; W denotes the weight matrix. Por simplicidad, nosotros
denote the recurrent computation of the forward LSTM as

−→
ht =

−−−→
LSTM(xt,

−−→
ht−1)

(12)

−→
hn] within the forward context
After encoding, a list of token representations [
is obtained. Similarmente, the backward LSTM computes a list of token representations
←−
h1,
[

←−
hn] within the backward context as

←−
h2, · · · ,

−→
h2 , · · · ,

−→
h1 ,

←−
ht =

←−−−−
LSTM(xt,

←−−
ht+1)

(13)

69

Ligüística computacional

Volumen 45, Número 1

Finalmente, each input token xi is represented by the concatenation of its forward and
←−
backward LSTM state vectors, denoted by hi =
hi . The list storing token vectors for
the entire utterance x can be considered as a buffer, in analogy to syntactic parsing. A
notable difference is that tokens in the buffer will not be removed because its alignment
to logical form tokens is not predetermined in the general semantic parsing scenario.
We denote the buffer b as b = [h1, · · · , hk], where k denotes the length of the utterance.

−→
hi :

Generation History Encoding. The generation history—that is, the partially completed
subtrees—is encoded with a variant of stack-LSTM (Dyer et al. 2015). Such an encoder
captures not only previously generated tree tokens but also tree structures. We first
discuss the stack-based LSTM in the top–down transition system and then present
modifications to account for the bottom–up system.

In top–down transitions, operations NT and TER change the stack-LSTM representa-

tion st as in a vanilla LSTM as

st = LSTM(yt, st−1)

(14)

where yt denotes the newly generated non-terminal or terminal token. A RED operation
recursively pops the stack-LSTM states as well as corresponding tree tokens on the
output stack. The popping stops when a non-terminal state is reached and popped,
after which the stack-LSTM reaches an intermediate state st−1:t.1 The representation of
the completed subtree u is then computed as

u = Wu · [pu : cu]

(15)

where pu denotes the parent (non-terminal) embedding of the subtree, cu denotes the av-
erage of the children (terminal or completed subtree) embeddings, and Wu denotes the
weight matrix. Note that cu can also be computed with more advanced methods, semejante
as a recurrent neural network (Kuncoro et al. 2017). Finalmente, the subtree embedding u
serves as the input to the LSTM and updates st−1:t to st as

st = LSTM(tu, st−1:t)

(16)

Cifra 1 provides a graphical view on how the three operations change the configura-
tion of a stack-LSTM.

En comparación, the bottom–up transition system uses the same TER operation to

update the stack-LSTM representation st when a terminal yt is newly generated:

st = LSTM(yt, st−1)

(17)

Differently, the effects of NT and RED are merged into a NT-RED(X) operación. Cuando
NT-RED(X) is invoked, a non-terminal yt is first predicted and then the stack-LSTM
starts popping its states on the stack. The number of pops is decided by the amount
of argument expected by yt. Después, a subtree can be obtained by combining the non-
terminal yt and the newly popped terminal tokens, while the stack-LSTM reaches an

1 We use st−1:t to denote the intermediate transit state from time step t − 1 to t, after terminal tokens are

popped from the stack; st denotes the final LSTM state after the subtree representation is pushed back to
the stack (as explained subsequently in the text).

70

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

Cifra 1
A stack-LSTM extends a standard LSTM with the addition of a stack pointer (shown as Top in
la figura). The example shows how the configuration of the stack changes when the operations
NT, TER, and RED are applied in sequence. The initial stack is presumed empty for illustration
purposes. We only show how the stack-LSTM updates its states, not how subsequent predictions
are made, which depend not only on the hidden state of the stack-LSTM, but also on the natural
language utterance.

intermediate state st−1:t. Similar to the top–down system, we compute the representa-
tion of the newly combined subtree u as

u = Wu · [pu : cu]

(18)

where pu denotes the parent (non-terminal) embedding of the subtree, cu denotes the
average of the children (terminal or completed subtree) embeddings, and Wu denotes
the weight matrix. Finalmente, the subtree embedding u serves as the input to the LSTM
and updates st−1:t to st as

st = LSTM(tu, st−1:t)

(19)

The key difference here is that a non-terminal tree token is never pushed alone to update
the stack-LSTM, but rather as part of a completed subtree that does the update.

Making Predictions. Given encodings of the utterance and generation history, our model
makes two types of predictions pertaining to transition operations and logical form

71

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

tokens (see Tables 3 y 4). Primero, at every time step, the next transition operation ot+1 is
predicted based on utterance encoding b and generation history st:

ot+1 ∼ f (b, st)

(20)

where f is a neural network that computes the parameters of a multinomial distribution
over the action space which is restricted by the constraints discussed in Section 3.4.

Próximo, the logical form token underlying each generation operation must be emit-
ted. When the generation operation contains one of the domain-general non-terminals
count, argmax, argmin, y, o, and filter (p.ej., NT(count)), the logical form token is
the corresponding non-terminal (p.ej., count). When the generation operation involves
one of the placeholders for entity or relation (p.ej., NT(relation), NT-RED(relation),
TER(relation), and TER(entidad)), a domain-specific logical form token yt+1 (es decir., un
entity or a relation) is predicted in a fashion similar to action prediction:

yt+1 ∼ g(b, st)

(21)

where g is a neural network that computes the parameters of a multinomial distribution
over the token space.

A remaining challenge lies in designing predictive functions f (for the next action)
and g (for the next logical form token) in the context of semantic parsing. We explore
various attention mechanisms that we discuss in the next sections.

3.6 Next Action Prediction

This section explains how we model function f for predicting the next action. We draw
inspiration from previous work on transition-based syntactic parsing and compute a
feature vector representing the current state of the generation system (Dyer et al. 2016).
This feature vector typically leverages the buffer, which stores unprocessed tokens in
the utterance, and the stack, which stores tokens in the partially completed parse tree.
A major difference in our semantic parsing context is that the buffer configuration does
not change deterministically with respect to the stack because the alignment between
natural language tokens and logical-form tokens is not explicitly specified. This gives
rise to the challenge of extracting features representing the buffer at different time steps.
Para tal fin, we compute at each time step t a single adaptive representation of the
buffer ¯bt with a soft attention mechanism:

ui
t = V tanh(Wbbi + Wsst)

t = softmax(ui
αi
t)

¯bt =

(cid:88)

αi

tbi

i

(22)

(23)

(24)

where Wb and Ws are weight matrices and V is a weight vector. We then combine
the representation of the buffer and the stack with a feed-forward neural network

72

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

(Ecuación (25)) to yield a feature vector for the generation system. Finalmente, softmax is
taken to obtain the parameters of the multinomial distribution over actions:

at+1 ∼ softmax(Woa tanh(Wf [¯bt, st]))

(25)

where Woa and Wf are weight matrices.

3.7 Next Token Prediction

This section presents various functions g for predicting the next logical form token
(es decir., a specific entity or relation). A hurdle in semantic parsing concerns handling
mismatches between natural language and logical tokens in the target knowledge base.
Por ejemplo, both utterances “Where did X graduate from” and “Where did X get his PhD”
would trigger the same predicate education in Freebase. Traditional semantic parsers
map utterances directly to domain-specific logical forms relying exclusively on a set of
lexicons either predefined or learned for the target domain with only limited coverage.
Recent approaches alleviate this issue by first mapping the utterance to a domain-
general logical form that aims to capture language-specific semantic aspects, después de lo cual
ontology matching is performed to handle mismatches (Kwiatkowski et al. 2013; Reddy,
Lapata, and Steedman 2014; Reddy et al. 2016). Beyond efficiency considerations, él
remains unclear which domain-general representation is best suited to domain-specific
semantic parsing.

Neural networks provide an alternative solution: The matching between natural
language and domain-specific predicates is accomplished via an attention layer, cual
encodes a context-sensitive probabilistic lexicon. This is analogous to the application
of the attention mechanism in machine translation (Bahdanau, Dar, and Bengio 2015),
which is used as an alternative to conventional phrase tables. En este trabajo, we consider
a practical domain-specific semantic parsing scenario where we are given no lexicon.
We first introduce the basic form of attention used to predict logical form tokens and
then discuss various extensions as shown in Figure 2.

Soft Attention. In the case where no lexicon is provided, we use a soft attention layer
similar to action prediction. The parameters of the soft attention layer prior to softmax
are shared with those used in action prediction:

ui
t = V tanh(Wbbi + Wsst)

t = softmax(ui
αi
t)

¯bt =

(cid:88)

αi

tbi

i

yt+1 ∼ softmax(Woy tanh(Wf [¯bt, st]))

(26)

(27)

(28)

(29)

73

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

utterance: which daughter of Barack Obama was named Most Influential Teens in the year 2014
partially completed logical form: y(daughterOf(barack obama),
next logical form token: InfluentialTeensByYear

soft attention over all utterance tokens:
daughter ofof Barack
whichwhich daughter

barack obama

Obama waswas named

named MostMost

Teens
Influential Teens
Influential

2014
inin thethe YearYear 2014

hard attention over a single utterance token:
which daughter of Barack Obama was named as the Influential

Influential Teens in the year 2014

structured attention over a subset of utterance tokens:
which daughter of Barack Obama was named Most Influential Teens in the year

Most Influential Teens in the year 2014

Cifra 2
Different attention mechanisms for predicting the next logical form token. The example
utterance is which daughter of Barack Obama was named Most Influential Teens in the year
2014? and the corresponding logical form to be generated is and(daughterOf(barack obama),
InfluentialTeensByYear(2014)). The figure shows attention for predicting
InfluentialTeensByYear. Darker shading indicates higher values.

which outputs the parameters of the multinomial distribution over logical form tokens
(either predicates or entities). When dealing with extremely large knowledge bases,
the output space can be pruned and restricted with an entity linking procedure. Este
method requires us to identity potential entity candidates in the sentence, y luego
generate only entities belonging to this subset and the relations linking them.

Structured Soft Attention. We also explored a structured attention layer (Kim et al. 2017;
Liu and Lapata 2018) to encourage the model to attend to contiguous natural language
phrases when generating a logical token, while still being differentiable.

The structured attention layer we adopt is a linear-chain conditional random field
(Lafferty, Mccallum, and Pereira 2001). Assume that at time step t each token in the
buffer (p.ej., the ith token) is assigned an attention label Ai
t ∈ {0, 1}. The conditional
random field defines p(En), the probability of the sequence of attention labels at time
step t as

pag(En) =

(cid:80)

exp. (cid:80)

i Wf · ψ(Ai−1
exp. (cid:80)

, Ai
i Wf · ψ(Ai−1

t, bi, st)
, Ai

t

t

t, bi, st)

A1

t ,··· ,Un
t

(30)

dónde (cid:80)
i sums over all tokens and (cid:80)
t ,··· ,Un
t
attention labels. Wf is a weight vector and ψ(Ai−1
the feature vector is defined with three dimensions: the state feature for each token

sums over all possible sequences of
, Ai
t, bi, st) a feature vector. En este trabajo

A1

t

t · ai
ui
t

(31)

where ui
feature

t is the token-specific attention score computed in Equation (26); the transition

Ai−1
t

· Ai
t

(32)

74

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

and the context-dependent transition feature

t · Ai−1
ui

t

· Ai
t

(33)

The marginal probability p(Ai

t= 1) of each token being selected is computed with
the forward–backward message passing algorithm (Lafferty, Mccallum, and Pereira
2001). The procedure is shown in Figure 3. To compare with standard soft attention,
we denote this procedure as

t = forward-backward(ui
αi
t)

(34)

The marginal probabilities are used as in standard soft attention to compute an

adaptive buffer representation:

¯bt =

(cid:88)

αi

tbi

i

which is then used to compute a distribution of output logical form tokens:

yt+1 ∼ softmax(Woy tanh(Wf [¯bt, st]))

(35)

(36)

Objective: Predict the next logical form token given the current stack representation
st and n input token representations in the buffer b1 · · · bn.
Steps:

1.

2.

3.

4.

5.

6.

7.

Compute the logit ui
t of each input token bi as ui
The logit will be used to compute the first and third feature in ψ.

t = V tanh(Wbbi + Wsst).

Forward algorithm: Initialize β(A1
For i ∈ {2 · · · n}, Ai

t ∈ {0, 1} : b(Ai

t ) = 1.
t) = (cid:80)
Ai−1

t ∈{0,1}

where ψ is the context-dependent feature vector.

b(Ai−1
t

)×ψ(Ai−1

t

, Ai

t, bi, st),

Backward algorithm: Initialize γ(Un
For i ∈ {1 · · · (n − 1)}, Ai
γ(Ai+1
γ(Ai
t

t ∈ {0, 1}:
) × ψ(Ai

t, Ai+1

t

t) = (cid:80)
Ai+1

t ∈{0,1}

t ) = 1.

, bi, st), where ψ is the

context-dependent feature vector.

t of each input token bi:

Compute the marginal probability αi
αi
t = β(Ai
t) × γ(Ai
t)
Apply soft attention to compute an adaptive buffer representation:
¯bt = (cid:80)
Predict the next token: yt+1 ∼ softmax(Woy tanh(Wf [¯bt, st]))
Compute the error and backpropagate.

i αi

tbi

Cifra 3
The structured attention model for token prediction.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

75

Ligüística computacional

Volumen 45, Número 1

The structured attention layer is soft and fully differentiable and allows us to model
attention over phrases because the forward–backward algorithm implicitly sums over
an exponentially sized set of substructures through dynamic programming.

Hard Attention. Soft attention learns a complete mapping between natural language
and logical tokens with a differentiable neural layer. At every time step, every natural
language token in the utterance is assigned the probability of triggering every logical
predicate. This offers little in the way of interpretability. In order to render the inner
workings of the model more transparent, we explore the use of a hard attention mecha-
nism as a means of rationalizing neural predictions.

At each time step, hard attention samples from the attention probability a single

natural language token xt:

ui
t = V tanh(Wbbi + Wsst)

xt ∼ softmax(ui
t)

The representation of xt denoted by bt is then used to predict the logical token yt:

yt+1 ∼ softmax(Woy tanh(Wf [bt, st]))

(37)

(38)

(39)

Hard attention is nevertheless optimization-wise challenging; it requires sampling
symbols (es decir., non-differentiable representations) inside an end-to-end module that may
incur high variance. En la práctica, we adopt a baseline method to reduce the variance of
the predictor, which we discuss in Section 3.8.1.

Binomial Hard Attention. Learning difficulties aside, a limitation of hard attention lies
in selecting a single token to attend to at each time step. En la práctica, a logical form
predicate is often triggered by a natural language phrase or a multi-word expression. A
way to overcome this limitation is to compute a binomial distribution for every token
separately, indicating the probability of the token being selected. Then an attention label
is assigned to each token based on this probability (p.ej., with threshold 0.5). Let Ai
t ∈
{0, 1} denote the attention label of the ith token at time step t. Using the unnormalized
attention score ui

t computed in Equation (26), we obtain the probability p(Ai

t= 1) como

pag(Ai

t= 1) = logistic(ui
t)

(40)

where logistic denotes a logistic regression classifier. We compute adaptive buffer rep-
resentation as an average of the selected token embeddings:

¯bt = 1
(cid:80)

i Ai
t

(cid:88)

Ai

tbi

i

which is then used to compute a distribution of the output logical form tokens:

yt+1 ∼ softmax(Woy tanh(Wf [¯bt, st]))

76

(41)

(42)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

3.8 Model Training

We now discuss how our neural semantic parser can be trained under different con-
ditions, eso es, with access to utterances annotated with logical forms, when only
denotations are provided, and finally, when neither logical forms nor denotations are
disponible (ver tabla 5).

3.8.1 Learning from Utterance-Logical Form Pairs. The most straightforward training set-
up is fully supervised making use of utterance-logical form pairs. Consider utterance x
with logical form l whose structure is determined by a sequence of transition opera-
tions a and a sequence of logical form tokens y. Our ultimate goal is to maximize the
conditional likelihood of the logical form given the utterance for all training data:

(cid:88)

L =

iniciar sesión p(yo|X)

(X,yo)∈T

which can be decomposed into the action likelihood and the token likelihood:

iniciar sesión p(yo|X) = log p(a|X) + iniciar sesión p(y|X, a)

Soft Attention. Our objective consists of two terms, one for the action sequence,

La =

(cid:88)

(X,yo)∈T

iniciar sesión p(a|X) =

(cid:88)

norte
(cid:88)

(X,yo)∈T

t=1

iniciar sesión p(en|X)

and one for the logical form token sequence,

(43)

(44)

(45)

Ly =

(cid:88)

(X,yo)∈T

iniciar sesión p(y|X, a) =

(cid:88)

norte
(cid:88)

(X,yo)∈T

t=1

iniciar sesión p(yt|X, en)

(46)

These constitute the training objective for fully differentiable neural semantic parsers,
cuando (basic or structured) soft attention is used.

Mesa 5
Example data for various semantic parsing training regimes.

Full supervision: utterance-logical form pairs
utterance: which daughter of Barack Obama was named Most Influential Teens in the year 2014?
logical form: y(daughterOf(barack obama), InfluentialTeensByYear(2014))

Weak supervision: utterance-denotation pairs
utterance: which daughter of Barack Obama was named Most Influential Teens in the year 2014?
denotation: Malia Obama

Distant supervision: entity-masked utterances
utterance: Malia Obama, the daughter of Barack Obama, was named Most Influential Teens in the year 2014.
artificial utterance: blank , the daughter of Barack Obama, was named Most Influential Teens in the year 2014.
denotation: Malia Obama

77

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

Hard Attention. When hard attention is used for token prediction, the objective La
remains the same but Ly differs. This is because the attention layer is non-differentiable
for errors to backpropagate through. We use the alternative REINFORCE-style algo-
ritmo (williams 1992) for backpropagation. In this scenario, the neural attention layer
is used as a policy predictor to emit an attention choice, while subsequent neural
layers are used as the value function to compute a reward—a lower bound of the log
likelihood log p(y|X, a). Let ut denote the latent attention choice2 at each time step t;
we maximize the expected log likelihood of the logical form token given the overall
attention choice for all examples, which by Jensen’s Inequality is the lower bound on
the log likelihood log p(y|X, a):

Ly =

(cid:88)

(cid:88)

(cid:2)pag(tu|X, a) iniciar sesión p(y|tu, X, a)(cid:3)

tu

(cid:88)

registro

(cid:2)pag(tu|X, a)pag(y|tu, X, a)(cid:3)

tu

iniciar sesión p(y|X, a)

(X,yo)∈T
(cid:88)

(X,yo)∈T
(cid:88)

(X,yo)∈T

=

(47)

The gradient of Ly with respect to model parameters θ is given by

∂Ly
∂θ

=

=

=

(cid:88)

(cid:88)

(X,yo)∈T

tu

(cid:88)

(cid:88)

(X,yo)∈T

tu

(cid:88)

(cid:88)

(X,yo)∈T

tu

pag(tu|X, a)

∂ log p(y|tu, X, a)
∂θ

+ iniciar sesión p(y|tu, X, a)

∂p(tu|X, a)
∂θ

pag(tu|X, a)

∂ log p(y|tu, X, a)
∂θ

+ iniciar sesión p(y|tu, X, a)

∂ log p(tu|X, a)
∂θ

pag(tu|X, a)

pag(tu|X, a)

(cid:20) ∂ log p(y|tu, X, a)
∂θ

+ iniciar sesión p(y|tu, X, a)

(cid:21)

∂ log p(tu|X, a)
∂θ

(48)

(cid:88)

(X,yo)∈T

1
norte

k
(cid:88)

k=1

(cid:20) ∂ log p(y|uk, X, a)
∂θ

+ iniciar sesión p(y|uk, X, a)

∂ log p(uk|X, a)
∂θ

(cid:21)

which is estimated by the Monte Carlo estimator with K samples. This gradient esti-
mator incurs high variance because the reward term log p(y|uk, X, a) is dependent on the
samples of uk. An input-dependent baseline is used to reduce the variance, which adjusts
the gradient update as

∂Ly
∂θ

=

(cid:88)

(X,yo)∈T

1
norte

k
(cid:88)

k=1

(cid:20) ∂ log p(y|uk, X, a)
∂θ

+ (iniciar sesión p(y|uk, X, a) − b)

∂ log p(uk|X, a)
∂θ

(cid:21)

(49)

As baseline, we use the soft attention token predictor described earlier. El efecto
is to encourage attention samples that return a higher reward than standard soft

2 In standard hard attention, the choice is a single token in the sentence, whereas in binomial hard

attention it is a phrase.

78

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

atención, while discouraging those resulting in a lower reward. For each training case,
we approximate the expected gradient with a single sample of uk.

3.8.2 Learning from Utterance-Denotation Pairs. Desafortunadamente, training data consisting of
utterances and their corresponding logical forms are difficult to obtain at large scale,
and as a result limited to a few domains with a small number of logical predicates. Un
alternative to full supervision is a weakly supervised setting where the semantic parser
is trained on utterance-denotation pairs, where logical forms are treated as latent.

In the following we first provide a brief review of conventional weakly supervised
semantic parsing systems (Berant et al. 2013a), and then explain the extension of our
neural semantic parser to a similar setting. Conventional weakly supervised semantic
parsing systems separate the parser from the learner (Liang 2016). A chart-based (non-
parametrized) parser will recursively build derivations for each span of an utterance,
eventually obtaining a list of candidate derivations mapping the utterance to its logical
forma. The learner (which is often a log-linear model) defines features useful for scoring
and ranking the set of candidate derivations, and is trained based on the correctness
of their denotations. As mentioned in Liang (2016), the chart-based parser brings a
disadvantage since the system does not support incremental contextual interpretation,
because features of a span can only depend on the sub-derivations in that span, as a
requirement of dynamic programming.

Different from chart-based parsers, a neural semantic parser is itself a parametrized
model and is able to leverage global utterance features (via attention) for decoding.
Sin embargo, training the neural parser directly with utterance-denotation pairs is chal-
lenging because the decoder does not have access to gold standard logical forms for
backpropagation. Además, the neural decoder is a conditional generative model that
generates logical forms in a greedy fashion and therefore lacks the ability to make global
judgments of logical forms. Para tal fin, we follow the conventional set-up in integrating
our neural semantic parser with a log-linear ranker, to cope with the weak supervision
señal. The role of the neural parser is to generate a list of candidate logical forms,
while the ranker is able to leverage global features of utterance-logical form-denotation
triplets to select which candidate to use for execution.

The objective of the log-linear ranker is to maximize the log marginal likelihood of

the denotation d via latent logical forms l:

iniciar sesión p(d|X) = log

pag(yo|X)pag(d|X, yo)

(cid:88)

l∈L

(50)

where L denotes the set of candidate logical forms generated by the neural parser.
Note that p(d|X, yo) es igual 1 if the logical form executes to the correct denotation and
0 de lo contrario. Por esta razón, we can also write this equation as log (cid:80)
l∈L(C) pag(yo|X), dónde
l(C) is the set of consistent logical forms that execute to the correct denotation.

Specifically p(yo|X) is computed with a log-linear model:

pag(yo|X) =

(cid:80)

exp.(Fi(X, yo)i)
yo(cid:48)∈L exp(Fi(X, yo(cid:48))i)

(51)

where L is the set of candidate logical forms; φ is the feature function that maps
an utterance-logical form pair (and also the corresponding denotation) into a feature
vector; and θ denotes the weight parameter of the model.

79

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

Training such a system involves the following steps. Given an input utterance, el
neural parser first generates a list of candidate logical forms via beam search. Entonces
these candidate logical forms are executed and those that yield the correct denotation
are marked as consistent logical forms. The neural parser is then trained to maximize
the likelihood of these consistent logical forms (cid:80)
l∈Lc log p(yo|X). Mientras tanto, the ranker
is trained to maximize the marginal likelihood of denotations log p(d|X).

Claramente, if the parser does not generate any consistent logical forms, no model
parameters will be updated. A challenge in this training paradigm is the fact that we
rely exclusively on beam search to find good logical forms from an exponential search
espacio. In the beginning of training, neural parameters are far from optimal, and as a
result good logical forms are likely to fall outside the beam. We alleviate this problem
by performing entity linking, which greatly reduces the search space. We determine the
identity of the entities mentioned in the utterance according to the knowledge base and
restrict the neural parser to generating logical forms containing only those entities.

3.8.3 Distant Supervision. Despite allowing to scale semantic parsing to large open-
domain problems (Kwiatkowski et al. 2013; Berant et al. 2013a; Yao and Van Durme
2014), the creation of utterance-denotation pairs still relies on labor-intensive crowd-
sourcing. A promising research direction is to use a sort of distant supervision, dónde
training data (p.ej., artificial utterance-denotations pairs) are artificially generated with
given resources (p.ej., a knowledge base, Wikipedia documents). En este trabajo, we ad-
ditionally train the weakly supervised neural semantic parser with a distant supervi-
sion approach proposed by Reddy, Lapata, and Steedman (2014). In this setting, el
given data is a corpus of entity-recognized sentences and a knowledge base. Utterance-
denotation pairs are artificially created by replacing entity mentions in the sentences
with variables. Entonces, the semantic parser is trained to predict the denotation for the
variable that includes the mentioned entity. Por ejemplo, given the declarative sen-
tence NVIDIA was founded by Jen-Hsun Huang and Chris Malachowsky, the distant
supervision approach creates the utterance NVIDIA was founded by Jen-Hsun Huang
and blank paired with the corresponding denotation Chris Malachowsky. En algunos
casos, even stronger constraints can be applied. Por ejemplo, if the mention is preceded
by the word the, then the correct denotation includes exactly one entity. En suma, el
approach converts the corpus of entity-recognized sentences into artificial utterance-
denotation pairs on which the weakly supervised model described in Section 3.8.2
can be trained. We also aim to evaluate whether this approach is helpful for practical
question answering.

4. experimentos

En esta sección, we present our experimental set-up for assessing the performance of
the neural semantic parsing framework. We present the data sets on which our model
was trained and tested, discuss implementation details, and finally report and analyze
semantic parsing results.

4.1 Data Sets

We evaluated our model on the following data sets, which cover different domains and
require different types of supervision.

GEOQUERY (Zelle and Mooney 1996) contiene 880 questions and database queries
about US geography. The utterances are compositional, but the language is simple and

80

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

vocabulary size small (698 entities and 24 relaciones). Model training on this data set is
fully supervised (Sección 3.8.1)

WEBQUESTIONS (Berant et al. 2013b) contiene 5,810 question-answer pairs. Es
based on Freebase and the questions are not very compositional. Sin embargo, ellos son
real questions asked by people on the Web.

GRAPHQUESTIONS (Su et al. 2016) contiene 5,166 question-answer pairs that were
created by showing 500 Freebase graph queries to Amazon Mechanical Turk work-
ers and asking them to paraphrase them into natural language. Model training on
WEBQUESTIONS and GRAPHQUESTIONS is weakly supervised (Sección 3.8.2).

SPADES (Bisk et al. 2016) contiene 93,319 questions derived from CLUEWEB09
(Gabrilovich, Ringgaard, and Subramanya 2013) oraciones. Específicamente, the questions
were created by randomly removing an entity, thus producing sentence-denotation
pares (Reddy, Lapata, and Steedman 2014). The sentences include two or more entities
and although they are not very compositional, they constitute a large-scale data set for
neural network training with distant supervision (Sección 3.8.3).

4.2 Detalles de implementacion

Shared Parameters. Across training regimes, the dimensions of word vector, logical form
token vector, and LSTM hidden state are 50, 50, y 150, respectivamente. Word embeddings
were initialized with GloVe embeddings (Pennington, Socher, and Manning 2014). Todo
other embeddings were randomly initialized. We used one LSTM layer in forward and
backward directions. Dropout was used on the combined feature representation of the
buffer and the stack (Ecuación (25)), which computes the softmax activation of the next
action or token. The dropout rate was set to 0.5. Finalmente, momentum SGD (Sutskever
et al. 2013) was used as the optimization method to update the parameters of the model.

Entity Resolution. Among the four data sets, only GEOQUERY contains annotated logical
forms that can be used to directly train a neural semantic parser. For the other three data
conjuntos, supervision is indirect via consistent logical forms validated on denotations (ver
Sección 3.8.2). As mentioned earlier, we use entity linking to reduce the search space for
consistent logical forms. Entity mentions in SPADES are automatically annotated with
Freebase entities (Gabrilovich, Ringgaard, and Subramanya 2013). For WEBQUESTIONS
and GRAPHQUESTIONS we perform entity linking following the procedure described
in Reddy et al. (2016). We identify potential entity spans using seven handcrafted
part-of-speech patterns and associate them with Freebase entities obtained from the
Freebase/KG API (http://developers.google.com/freebase/). For each candidate
entity span, we retrieve the top 10 entities according to the API. We treat each possibility
as a candidate entity to construct candidate utterances with beam search of size 500,
among which we look for the consistent logical forms.

Discriminative Ranker. For data sets that use denotations as supervision, our semantic
parsing system additionally includes a discriminative ranker, whose role is to select
the final logical form to execute from a list of candidates generated by the neural
semantic parser. en el momento de la prueba, the generation process is accomplished by beam search
with beam size 300. The ranker, which is a log-linear model, is trained with momentum
SGD (Sutskever et al. 2013). As features, we consider the embedding cosine similarity
between the utterance (excluding stop-words) and the logical form, the token overlap
count between the two, and also similar features between the lemmatized utterance and
the logical form. Además, we include as features the embedding cosine similarity

81

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

between the question words and the logical form, the similarity between the question
palabras (p.ej., qué, OMS, dónde, cuyo, fecha, cual, how many, count ) and relations
in the logical form, and the similarity between the question words and answer type as
indicated by the last word in the Freebase relation (Xu et al. 2016). Finalmente, we add as a
feature the length of the denotation given by the logical form (Berant et al. 2013a).

4.3 Resultados

En esta sección, we present the experimental results of our Transition-based Neural
Semantic Parser (TNSP). We present various instantiations of our own model as well
as comparisons against semantic parsers proposed in the literature.

Experimental results on GEOQUERY are shown in Table 6. The first block con-
tains conventional statistical semantic parsers, previously proposed neural models are
presented in the second block, and variants of TNSP are shown in the third block.
Específicamente, we build various top–down and bottom–up TNSP models using the various
types of attention introduced in Section 3.7. We report accuracy, which is defined as
the proportion of utterances that correctly parsed to their gold standard logical forms.
Among TNSP models, a top–down system with structured (soft) attention performs
best. En general, we observe that differences between top–down and bottom–up systems
are small; it is mostly the attention mechanism that affects performance, with hard
attention performing worst and soft attention performing best for both top–down and
bottom–up systems. TNSP outperforms previously proposed neural semantic parsers
that treat semantic parsing as a sequence transduction problem and use LSTMs to map
utterances to logical forms (Dong and Lapata 2016; Jia and Liang 2016). TNSP brings

Mesa 6
Fully supervised experimental results on the GEOQUERY data set. For Jia and Liang (2016), nosotros
include two of their results: one is a standard neural sequence to sequence model; and the other
is the same model trained with a data augmentation algorithm on the labeled data (reported in
parentheses).

Modelos

Accuracy

Zettlemoyer and Collins (2005)
Zettlemoyer and Collins (2007)
Kwiatkowksi et al. (2010)
Kwiatkowski et al. (2011)
Kwiatkowski et al. (2013)
Zhao and Huang (2015)
Liang, Jordán, and Klein (2011)

79.3
86.1
87.9
88.6
88.0
88.9
91.1

Dong and Lapata (2016)
Jia and Liang (2016)
Rabinovich, Stern, and Klein (2017)

84.6
85.0 (89.1)
87.1

TNSP, soft attention, top–down
TNSP, soft structured attention, top–down
TNSP, hard attention, top–down
TNSP, binomial hard attention, top–down
TNSP, soft attention, bottom–up
TNSP, soft structured attention, bottom–up
TNSP, hard attention, bottom–up
TNSP, binomial hard attention, bottom–up

86.8
87.1
85.3
85.5
86.1
86.8
85.3
85.3

82

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

performance improvements over these systems when using comparable data sources
para entrenamiento. Jia and Liang (2016) achieve better results with synthetic data that expand
GEOQUERY; we could adopt their approach to improve model performance, sin embargo,
we leave this to future work. Our system is on par with the model of Rabinovich,
Stern, and Klein (2017), who also output well-formed trees in a top–down manner using
a decoder built of many submodels, each associated with a specific construct in the
underlying grammar.

Results for the weakly supervised training scenario are shown in Table 7. Para
all Freebase-related data sets we use average F1 (Berant et al. 2013a) as our evalua-
tion metric. We report results on WEBQUESTIONS (Table 7a) and GRAPHQUESTIONS
(Table 7b). The first block in the tables groups conventional statistical semantic parsers,
the second block presents related neural models, and the third block variants of TNSP.
For fair comparison, we also built a baseline sequence-to-sequence model enhanced
with an attention mechanism (Dong and Lapata 2016).

On WEBQUESTIONS, the best performing TNSP system generates logical forms
based on top–down pre-order while using soft attention. The same top–down system
with structured attention performs closely. Again we observe that bottom–up preorder
lags behind. En general, our semantic parser obtains performance on par with the best
symbolic systems (see the first block in Table 7a). It is important to note that Bast
and Haussmann (2015) develop a question-answering system, cual, contrary to ours,
cannot produce meaning representations, whereas Berant and Liang (2015) propose
a sophisticated agenda-based parser that is trained borrowing ideas from imitation
aprendiendo. Reddy et al. (2016) learn a semantic parser via intermediate representations
that they generate based on the output of a dependency parser. TNSP performs competi-
tively despite not having access to linguistically informed syntactic structure. Acerca de
neural systems (see the second block in Table 7a), our model outperforms the sequence-
to-sequence baseline and other related neural architectures using similar resources.
Xu et al. (2016) represent the state of the art on WEBQUESTIONS. Their system uses
Wikipedia to prune out erroneous candidate answers extracted from Freebase. Nuestro
model would also benefit from a similar post-processing.

With respect to GRAPHQUESTIONS, we report F1 for various TNSP models (tercero
block in Table 7), and conventional statistical semantic parsers (first block in Table 7b).
The first three systems are presented in Su et al. (2016). De nuevo, we observe that a top–
down variant of TNSP with soft attention performs best. It is superior to the sequence-to-
sequence baseline and obtains performance comparable to Reddy et al. (2017) sin
making use of an external syntactic parser. The model of Dong et al. (2017) is state of the
art on GRAPHQUESTIONS. Their method is trained end-to-end using question-answer
pairs as a supervision signal together with question paraphrases as a means of capturing
different ways of expressing the same content. En tono rimbombante, their system is optimized
with question answering in mind, and does not produce logical forms.

When learning from denotations, a challenge concerns the handling of an expo-
nentially large set of logical forms. In our approach, we rely on the neural semantic
parser to generate a list of candidate logical forms by beam search. Idealmente, we hope
the beam size is large enough to include good logical forms that will be subsequently
selected by the discriminative ranker. Cifra 4 shows the effect of varying beam size
on GRAPHQUESTIONS (development set) when training executes for two epochs using
the TNSP soft attention model with top–down generation order. We report the number
of utterances that are answerable (es decir., an utterance is considered answerable if the
beam includes one or more good logical forms leading to the correct denotation) y
the number of utterances that are correctly answered eventually. As the beam size

83

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 45, Número 1

Mesa 7
Weakly supervised experimental results on two data sets. Results with additional resources are
shown in parentheses.

(a) WEBQUESTIONS

Modelos

SEMPRE (Berant et al. 2013b)
JACANA (Yao and Van Durme 2014)
PARASEMPRE (Berant and Liang 2014)
AQQU (Bast and Haussmann 2015)
AGENDAIL (Berant and Liang 2015)
DEPLAMBDA (Reddy et al. 2016)

SUBGRAPH (Bordes, chopra, and Weston 2014)
MCCNN (Dong et al. 2015)
STAGG (Yih et al. 2015)
MCNN (Xu et al. 2016)
Sequence-to-sequence

TNSP, soft attention, top–down
TNSP, soft structured attention, top–down
TNSP, hard attention, top–down
TNSP, binomial hard attention, top–down
TNSP, soft attention, bottom–up
TNSP, soft structured attention, bottom–up
TNSP, hard attention, bottom–up
TNSP, binomial hard attention, bottom–up

(b) GRAPHQUESTIONS

Modelos

sempre (Berant et al. 2013b)
parasempre (Berant and Liang 2014)
jacana (Yao and Van Durme 2014)
SimpleGraph (Reddy et al. 2016)
UDepLambda (Reddy et al. 2017)

Sequence-to-sequence
PARA4QA (Dong et al. 2017)

TNSP, soft attention, top–down
TNSP, soft structured attention, top–down
TNSP, hard attention, top–down
TNSP, binomial hard attention, top–down
TNSP, soft attention, bottom–up
TNSP, soft structured attention, bottom–up
TNSP, hard attention, bottom–up
TNSP, binomial hard attention, bottom–up

F1

35.7
33.0
39.9
49.4
49.7
50.3

39.2
40.8
52.5
53.3
48.3

50.1
49.8
49.4
48.7
49.6
49.5
48.4
48.7

F1

10.8
12.8
5.1
15.9
17.6

16.2
20.4

17.3
17.1
16.2
16.4
16.9
17.1
16.8
16.5

aumenta, the gap between utterances that are answerable and those that are answered
correctly becomes larger. And the curve for correctly answered utterances gradually
plateaus and the performance does not improve. This indicates a trade-off between gen-
erating candidates that cover good logical forms and picking the best logical form for

84

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

4
5
1
5
9
1
8
0
9
6
5
6
/
C
oh

yo
i

_
a
_
0
0
3
4
2
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

cheng, Reddy, Saraswat, and Lapata

Learning an Executable Neural Semantic Parser

1

0.8

0.6

0.4

0.2

norte
oh
i
t
C
a
r
F

answerable
correctly answered

0.636

0.645

0.658

0.575

0.525

0.606

0.195

0.152

0.213

0.223

0.226

0.226

0.000
0.000
0
0

100

200

300

400

500

beam size

Cifra 4
Fraction of utterances that are answerable versus those correctly predicted with varying beam
size on the GRAPHQUESTIONS development set.

execution: When the beam size is large, there is a higher chance for good logical forms
to be included but also for the discriminative ranker to make mistakes.

GRAPHQUESTIONS consists of four types of questions. As shown in Table 8, el
first type are relational questions (denoted by relation). An example of a relational
question is what periodic table block contains oxygen; the second type contains count
preguntas (denoted by count). An example is how many firefighters does the new
york city fire department have; the third type includes aggregation questions requiring
argmax or argmin (denoted by aggregation). An example is what human stampede
injured the most people; the last type are filter questions that require comparisons by >,
≥, <, and ≤ (denoted by filter). An example is which presidents of the united states weigh not less than 80.0 kg. Table 8 shows the number of questions broken down by type, as well as the proportion of answerable and correctly answered questions. As the results reveal, relation questions are the simplest to answer, which is expected since relation questions are non-compositional and their logical forms are easy to find by beam search. The remaining types of questions are rather difficult to answer: Although the system is able to discover logical forms that lead to the correct denotation during beam search, the ranker is not able to identify the right logical forms to execute. Aside Table 8 Breakdown of questions answered by type for the GRAPHQUESTIONS. Question type Number % Answerable % Correctly answered relation count aggregation filter All 1,938 309 226 135 2,608 0.499 0.421 0.363 0.459 0.476 0.213 0.032 0.075 0.096 0.173 85 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 Table 9 Distantly supervised experimental results on the SPADES data set. Models Unsupervised CCG (Bisk et al. 2016) Semi-supervised CCG (Bisk et al. 2016) Supervised CCG (Bisk et al. 2016) Rule-based system (Bisk et al. 2016) Sequence-to-sequence TNSP, soft attention, top–down TNSP, soft structured attention, top–down TNSP, hard attention, top–down TNSP, binomial hard attention, top–down TNSP, soft attention, bottom–up TNSP, soft structured attention, bottom–up TNSP, hard attention, bottom–up TNSP, binomial hard attention, bottom–up F1 24.8 28.4 30.9 31.4 28.6 32.4 32.1 31.5 29.8 32.1 31.4 30.7 30.4 from the compositional nature of these questions, which makes them hard to answer, another difficulty is that such questions are a minority in the data set, posing a learning challenge for the ranker to identify them. As future work, we plan to train separate rankers for different question types. Finally, Table 9 presents experimental results on SPADES, which serves as a testbed for our distant supervision setting. Previous work on this data set has used a semantic parsing framework where natural language is converted to an intermediate syntactic representation and then grounded to Freebase. Specifically, Bisk et al. (2016) evaluate the effectiveness of four different CCG parsers on the semantic parsing task when varying the amount of supervision required. As can be seen, TNSP outperforms all CCG vari- ants (from unsupervised to fully supervised) without having access to any manually annotated derivations or lexicons. Again, we observe that a top–down TNSP system with soft attention performs best and is superior to the sequence-to-sequence baseline. The results on SPADES hold promise for scaling semantic parsing by using distant supervision. In fact, artificial data could potentially help improve weakly supervised question-answering models trained on utterance-denotation pairs. To this end, we use the entity-masked declarative sentences paired with their denotations in SPADES as ad- ditional training data for GRAPHQUESTIONS. We train the neural semantic parser with the combined training data and evaluate on the GRAPHQUESTIONS. We use the top– down, soft-attention TNSP model with a beam search size of 300. During each epoch of training, the model was first trained with a mixture of the additional SPADES data and the original training data. Figure 5 shows the fraction of answerable and correctly answered questions generated by the neural semantic parser on GRAPHQUESTIONS. Note that the original GRAPHQUESTIONS training set consists of 1,794 examples and we report numbers when different amounts of SPADES training data is used. As the figure shows, using artificial training data allows us to improve the neural semantic parser on a question-answering task to some extent. This suggests that distant supervision is a promising direction for building practical semantic parsing systems. Because artificial training data can be abundantly generated to fit a neural parser, the approach can be used for data argumentation when question-answer pairs are limited. 86 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cheng, Reddy, Saraswat, and Lapata Learning an Executable Neural Semantic Parser Figure 5 Fraction of answerable and correctly answered questions in the GRAPHQUESTIONS when different amounts of the SPADES data is used. However, we observe that the maximum gain occurs when 1,000 extra training examples are used, a size comparable to the original training set. After that no further improvements are made when more training examples are used. We hypothesize that this is due to the disparities between utterance-denotation pairs created in distant supervision and utterance-denotation pairs gathered from real users. For example, given the declarative sentence NVIDIA was founded by Jen-Hsun Huang and Chris Malachowsky, the distant supervision approach creates the utterance NVIDIA was founded by Jen-Hsun Huang and blank and the corresponding denotation Chris Malachowsky. However, the actual question users may ask is Who founded NVIDIA together with Jen-Hsun Huang. This poses a challenge if the neural network is trained on one type of utterance and tested on another. We observe that the distribution mis- match outweighs the addition of artificial data quickly. Future work will focus on how to alleviate this problem by generating more realistic data with an advanced question generation module. Another factor limiting performance is that SPADES mainly consists of relational questions without high-level predicates, such as count, filter, and aggregation, which are substantially harder to answer correctly (see Table 8). To summarize, across experiments and training regimes, we observe that TNSP performs competitively while producing meaningful and well-formed logical forms. One characteristic of the neural semantic parser is that it generates tree-structured representations in an arbitrarily canonical order, as a sequence of transition operations. We investigated two such orders, top–down pre-order and bottom–up post-order. Ex- perimentally, we observed that pre-order generation provides marginal benefits over post-order generation. One reason for this is that compared with sibling information, which the bottom-up system uses, parent information used by the top–down system is more important for subtree prediction. 87 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 We explored three attention mechanisms in our work, including soft attention, hard attention, and structured attention. Quantitatively, we observe that soft attention always outperforms hard attention in all three training set-ups. This can be attributed to the differentiability of the soft attention layer. The structured attention layer is also differentiable because it computes the marginal probability of each token being selected with a dynamic programming procedure. We observe that on GEOQUERY, which rep- resents the fully supervised setting, structured attention offers marginal gains over soft attention. But in other data sets where logical forms are not given, the more structurally aware attention mechanism does not improve over soft attention, possibly because of the weaker supervision signal. However, it should be noted that the structured attention layer at each decoding step requires the forward-backward algorithm, which has time complexity O(2n2) (where n denotes the utterance length) and therefore is much slower than soft attention, which has linear (O(n)) complexity. Table 10 Hard attention and structured attention when predicting the relation in each question. The corresponding logical predicate is shown in parentheses. hard attention good selections: the brickyard 400 was hosted at what venue? (base.nascar.nascar venue) christian faith branched from what religion? (religion.religion) which paintings are discovered in lascaux? (base.caveart.painting) bad selections: which violent events started on 1995-04-07? (base.disaster2.attack) who was the aircraft designer of the b-747? (aviation.aircraft designer) the boinc has been used in which services? (base.centreforeresearch.service) neutral selections: how does ultram act in the body? (medicine.drug mechanism of action) microsoft has created which programming languages? (computer.programming language) find un agencies founded in 1957 (base.unitednations.united nations agency). structured attention good selections: the brickyard 400 was hosted at what venue? (base.nascar.nascar venue) which violent events started on 1995-04-07? (base.disaster2.attack) how does ultram act in the body? (medicine.drug mechanism of action) bad selections: what is ehrlich’s affiliation? (education.department) for which war was the italian armistice signed? (base.morelaw.war) the boinc has been used in which services? (base.centreforeresearch.service) neutral selections: where was the brickyard 400 held? (base.nascar.nascar venue) by whom was paul influenced? (influence.influence node) how does ultram act in the body? (medicine.drug mechanism of action) 88 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cheng, Reddy, Saraswat, and Lapata Learning an Executable Neural Semantic Parser An advantage of hard and structured attention is that they allow us to inspect which natural language tokens are being selected when predicting a relation or entity in the logical form. For hard attention, the selection boils down to a token-sampling procedure; whereas for structured attention, the tokens selected can be interpreted with the Viterbi algorithm, which assigns the most likely label for each token. Table 10 shows examples of hard and structured attention when predicting the key relational logical predicate. These examples were selected from GRAPHQUESTIONS using the top–down TNSP system. The table contains both meaningful token selections (where the selected tokens denote an informative relation) and non-meaningful ones. 5. Conclusions In this article, we described a general neural semantic parsing framework that oper- ates with functional query language and generates tree-structured logical forms with transition-based neural networks. To tackle mismatches between natural language and logical form tokens, we introduced various attention mechanisms in the generation pro- cess. We also considered different training regimes, including fully supervised training where annotated logical forms are given, weakly supervised training where denotations are provided, and distant supervision where only unlabeled sentences and a knowl- edge base are available. Compared with previous neural semantic parsers, our model generates well-formed logical forms, and is more interpretable—hard and structured attention can be used to inspect what the model has learned. When the training data consists of utterance-denotation pairs, we use a generative parser-discriminative ranker framework: The role of the parser is to (beam) search for candidate logical forms, which are subsequently re-scored by the ranker. This is in contrast to recent work (Neelakantan et al. 2017) on weakly supervised neural se- mantic parsing, where the parser is directly trained by reinforcement learning using denotations as reward. Advantageously, our framework uses beam search (in contrast to greedy decoding) to increase the likelihood of discovering correct logical forms in a candidate set. Meanwhile, the discriminative ranker is able to leverage global features on utterance-logical form-denotation triplets to score logical forms. In the future, we will compare the presented parser-ranker framework with reinforcement learning-based parsers. Directions for future work are many and varied. Because the current semantic parser generates tree structured logical forms conditioned on an input utterance, we could additionally exploit input information beyond sequences such as dependency tree representations, resembling a tree-to-tree transduction model. To tackle long-term dependencies in the generation process, an intra-attention mechanism could be used (Cheng, Dong, and Lapata 2016; Vaswani et al. 2017). Additionaly, when learning from denotations, it is possible that the beam search output contains spurious logical forms that lead to correct answers accidentally but do not represent the actual meaning of an utterance. Such logical forms are misleading training signals and should be removed (e.g., with a generative neural network component [Cheng, Lopez, and Lapata 2017], which scores how well a logical form represents the utterance semantics). Last but not least, because our semantic parsing framework provides a decomposition between domain-generic tree generation and the selection of domain-specific constants, we would like to further explore training the semantic parser in a multi-domain set-up (Herzig and Berant 2017), where the domain-generic parameters are shared. 89 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 References Andreas, Jacob, Andreas Vlachos, and Stephen Clark. 2013. Semantic parsing as machine translation. In Proceedings of the Association for Computational Linguistics, pages 47–52, Sofia. Artzi, Yoav and Luke Zettlemoyer. 2011. Bootstrapping semantic parsers from conversations. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 421–432, Edinburgh. Artzi, Yoav and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 1(1):49–62. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 4th International Conference on Learning Representations, San Diego. Baker, Collin F., Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, Montreal. Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia. Bast, Hannah and Elmar Haussmann. 2015. More accurate question answering on Freebase. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, pages 1431–1440, Melbourne. Berant, Jonathan, Andrew Chou, Roy Frostig, and Percy Liang. 2013a. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, WA. Berant, Jonathan, Andrew Chou, Roy Frostig, and Percy Liang. 2013b. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, WA. 90 Berant, Jonathan and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1415–1425, Baltimore, MD. Berant, Jonathan and Percy Liang. 2015. Imitation learning of agenda-based semantic parsers. Transactions of the Association for Computational Linguistics, 3:545–558. Bisk, Yonatan, Siva Reddy, John Blitzer, Julia Hockenmaier, and Mark Steedman. 2016. Evaluating induced CCG parsers on grounded semantic parsing. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2022–2027, Austin, TX. Bordes, Antoine, Sumit Chopra, and Jason Weston. 2014. Question answering with subgraph embeddings. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 615–620, Doha. Cai, Jonathon, Richard Shin, and Dawn Song. 2017. Making neural programming architectures generalize via recursion. arXiv:1074.06611. Cai, Qingqing and Alexander Yates. 2013. Large-scale semantic parsing via schema matching and lexicon extension. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 423–433, Sofia. Chen, David L. and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th Conference on Artificial Intelligence, volume 2, pages 859–865, San Francisco, CA. Cheng, Jianpeng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 551–561, Austin, TX. Cheng, Jianpeng, Adam Lopez, and Mirella Lapata. 2017. A generative parser with a discriminative recognition algorithm. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 118–124, Vancouver. Cheng, Jianpeng, Siva Reddy, Vijay Saraswat, and Mirella Lapata. 2017. Learning structured natural language representations for semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cheng, Reddy, Saraswat, and Lapata Learning an Executable Neural Semantic Parser (Volume 1: Long Papers), pages 44–55, Vancouver. Clarke, James, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In Proceedings of the 14th Conference on Computational Natural Language Learning, pages 18–27, Uppsala. Dong, Li and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33–43, Berlin. Dong, Li, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. 2017. Learning to paraphrase for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 886–897, Copenhagen. Dong, Li, Furu Wei, Ming Zhou, and Ke Xu. 2015. Question answering over Freebase with multi-column convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 260–269, Beijing. Dyer, Chris, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing. Dyer, Chris, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, CA. Fan, Xing, Emilio Monti, Lambert Mathias, and Markus Dreyer. 2017. Transfer learning for neural semantic parsing. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 48–56, Vancouver. Gabrilovich, Evgeniy, Michael Ringgaard, and Amarnag Subramanya. 2013. FACC1: Freebase annotation of ClueWeb corpora, version 1 (release date 2013-06-26, format version 1, correction level 0). lemurproject.org/clueweb12/FACC1. Goldman, Omer, Veronica Latcinnik, Udi Naveh, Amir Globerson, and Jonathan Berant. 2018. Weakly-supervised semantic parsing with abstract examples, pages 1809–1819, Melbourne. Herzig, Jonathan and Jonathan Berant. 2017. Neural semantic parsing over multiple knowledge-bases. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 623–628, Vancouver. Hochreiter, Sepp and J ¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. Iyer, Srinivasan, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 963–973, Vancouver. Jia, Robin and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12–22, Berlin. Kate, Rohit J. and Raymond J. Mooney. 2006. Using string-kernels for learning semantic parsers. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 913–920, Sydney. Kate, Rohit J., Yuk Wah Wong, and Raymond J. Mooney. 2005. Learning to Transform Natural to Formal Languages. In Proceedings for the 20th National Conference on Artificial Intelligence, pages 1062–1068, Pittsburgh, PA. Kim, Yoon, Carl Denton, Luong Hoang, and Alexander M. Rush. 2017. Structured attention networks. In 5th International Conference on Learning Representations, Toulon. Konstas, Ioannis, Srinivasan Iyer, Mark Yatskar, Yejin Choi, and Luke Zettlemoyer. 2017. Neural AMR: Sequence-to-sequence models for parsing and generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 146–157, Vancouver. Koˇcisk ´y, Tom´aˇs, G´abor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, and Karl Moritz Hermann. 2016. Semantic parsing with semi-supervised sequential autoencoders. In Proceedings of 91 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1078–1087, Austin, TX. Krishnamurthy, Jayant, Pradeep Dasigi, and Matt Gardner. 2017. Neural semantic parsing with type constraints for semi-structured tables. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1516–1526, Copenhagen. Krishnamurthy, Jayant and Tom Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 754–765, Jeju Island. Krishnamurthy, Jayant and Tom M. Mitchell. 2014. Joint syntactic and semantic parsing with combinatory categorial grammar. In Proceedings of the Association for Computational Linguistics, pages 1188–1198, Baltimore, MD. Kuncoro, Adhiguna, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. What do recurrent neural network grammars learn about syntax? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1249–1258, Valencia. Kwiatkowksi, Tom, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1223–1233, Cambridge, MA. Kwiatkowski, Tom, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1545–1556, Seattle, WA. Kwiatkowski, Tom, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2011. Lexical generalization in CCG grammar induction for semantic parsing. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1512–1523, Edinburgh. Lafferty, John, Andrew Mccallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International 92 Conference on Machine Learning, pages 282–298, San Francisco, CA. Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. 2017. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 23–33, Vancouver. Liang, Percy. 2013. Lambda dependency-based compositional semantics. arXiv preprint arXiv:1309.4408. Liang, Percy. 2016. Learning executable semantic parsers for natural language understanding. Communications of the ACM, 59(9):68–76. Liang, Percy, Michael Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 590–599, Portland, OR. Liu, Yang and Mirella Lapata. 2018. Learning structured text representations. Transactions of the Association for Computational Linguistics, 6:63–75. Lu, Wei, Hwee Tou Ng, Wee Sun Lee, and Luke S. Zettlemoyer. 2008. A generative model for parsing natural language to meaning representations. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 783–792, Honolulu, HI. Matuszek, Cynthia, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. In Proceedings of the 29th International Conference on Machine Learning, pages 1671–1678, Edinburgh. Montague, Richard. 1973. The proper treatment of quantification in ordinary English. In K. J. J. Hintikka, J. M. E. Moravcsik, and P. Suppes, editors, Approaches to Natural Language, volume 49 of Synthese Library. Springer Netherlands, pages 221–242. Neelakantan, Arvind, Quoc V. Le, Martin Abadi, Andrew McCallum, and Dario Amodei. 2017. Learning a natural language interface with neural programmer. 5th International Conference on Learning Representations, Toulon. Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever. 2016. Neural programmer: Inducing latent programs with gradient descent. 4th International Conference on Learning Representations, San Juan. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cheng, Reddy, Saraswat, and Lapata Learning an Executable Neural Semantic Parser Pasupat, Panupong and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1470–1480, Beijing. Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, Doha. Popescu, Ana-Maria, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In Proceedings of COLING 2004, pages 141–147, Geneva. Rabinovich, Maxim, Mitchell Stern, and Dan Klein. 2017. Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1139–1149, Vancouver. Reddy, Siva, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question-answer pairs. Transactions of the Association for Computational Linguistics, 2:377–392. Reddy, Siva, Oscar T¨ackstr ¨om, Michael Collins, Tom Kwiatkowski, Dipanjan Das, Mark Steedman, and Mirella Lapata. 2016. Transforming dependency structures to logical forms for semantic parsing. Transactions of the Association for Computational Linguistics, 4:127–140. Reddy, Siva, Oscar T¨ackstr ¨om, Slav Petrov, Mark Steedman, and Mirella Lapata. 2017. Universal semantic parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 89–101, Copenhagen. Reed, Scott and Nando De Freitas. 2016. Neural programmer-interpreters. 4th International Conference on Learning Representations, San Juan. Steedman, Mark. 2000. The Syntactic Process, MIT Press. Su, Yu, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gur, Zenghui Yan, and Xifeng Yan. 2016. On generating characteristic-rich question sets for qa evaluation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 562–572, Austin, TX. Suhr, Alane, Srinivasan Iyer, and Yoav Artzi. 2018. Learning to map context-dependent sentences to executable formal queries. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2238–2249, New Orleans. Sutskever, Ilya, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139–1147, Atlanta, GA. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104–3112, Montreal. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008, Montreal. Wen, Tsung Hsien, Milica Gasic, Nikola Mrkˇsi´c, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721, Lisbon. Williams, Ronald J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256. Winograd, Terry. 1972. Understanding natural language. Cognitive Psychology, 3(1):1–191. Wong, Yuk Wah and Raymond Mooney. 2006. Learning for semantic parsing with statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 439–446, New York, NY. Woods, William A., Ronald M. Kaplan, and Bonnie Nash-Webber. 1972. The Lunar sciences: Natural language information system: Final report. Technical Report BBN Report 2378, Bolt Beranek and Newman. Xu, Kun, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on Freebase via relation extraction and textual evidence. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 93 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 (Volume 1: Long Papers), pages 2326–2336, Berlin. Yao, Xuchen and Benjamin Van Durme. 2014. Information extraction over structured data: Question answering with Freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 956–966, Baltimore, MD. Yih, Wen tau, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331, Beijing. Yin, Pengcheng and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 440–450, Vancouver. Zelle, John M. and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 1050–1055, Portland, OR. Zelle, John Marvin. 1995. Using inductive logic programming to automate the construction of natural language parsers. Ph.D. thesis, University of Texas at Austin. Zettlemoyer, Luke and Michael Collins. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 678–687, Prague. Zettlemoyer, Luke S. and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pages 658–666, Edinburgh. Zhao, Kai and Liang Huang. 2015. Type-driven incremental semantic parsing with polymorphism. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1416–1421, Denver, CO. Zhong, Victor, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103. 94 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 5 9 1 8 0 9 6 5 6 / c o l i _ a _ 0 0 3 4 2 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3Learning an Executable Neural image

Descargar PDF