Memory-Based Semantic Parsing - IA de Investigación especializada en el MIT

Análisis semántico basado en memoria

Parag Jain and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, Reino Unido

parag.jain@ed.ac.uk

mlap@inf.ed.ac.uk

Abstracto

We present a memory-based model for context-
dependent semantic parsing. Previous approaches
focus on enabling the decoder to copy or mod-
ify the parse from the previous utterance, como-
suming there is a dependency between the
current and previous parses. En este trabajo, nosotros
propose to represent contextual information
using an external memory. We learn a context
memory controller that manages the memory
by maintaining the cumulative meaning of se-
quential user utterances. We evaluate our ap-
proach on three semantic parsing benchmarks.
Experimental results show that our model can
better process context-dependent information
and demonstrates improved performance with-
out using task-specific decoders.

Introducción

Semantic parsing is the task of converting natu-
ral language utterances into machine interpreta-
ble meaning representations such as executable
queries or logical forms. It has emerged as an
important component in many natural language
interfaces ( ˝Ozcan et al., 2020) with applications
in robotics (Dukes, 2014), question answering
(Zhong et al., 2018; Yu et al., 2018b), dialogue
sistemas (Artzi and Zettlemoyer, 2011), y el
Internet of Things (Campagna et al., 2017).

Neural network based approaches have led
to significant
improvements in semantic pars-
En g (Zhong et al., 2018; Kamath and Das, 2019;
Yu et al., 2018b; Yavuz et al., 2018; Yu et al.,
2018a) across domains and semantic formalisms.
The majority of existing studies focus on pars-
ing utterances in isolation, and as a result they
cannot readily transfer in more realistic settings
where users ask multiple inter-related questions
to satisfy an information need. En este trabajo, nosotros
study context-dependent semantic parsing focus-
ing specifically on text-to-SQL generation, cual

has emerged as a popular application area in
recent years.

Cifra 1 shows a sequence of utterances in an
interacción. The discourse focuses on a specific
topic serving a specific information need, a saber,
finding out which Continental flights leave from
Chicago on a given date and time. En tono rimbombante,
interpreting each of these utterances, and mapping
them to a database query to retrieve an answer,
needs to be situated in a particular context as the
exchange proceeds. The topic further evolves as
the discourse transitions from one utterance to the
next and constraints (p.ej., TIME or PLACE) son
added or revised. Por ejemplo, in Q2 the TIME
constraint before 10am from Q1 is revised to
before noon, and in Q3 to before 2pm. Aside from
such topic extensions (Chai and Jin, 2004), el
interpretation of Q2 and Q3 depends on Q1, as it
is implied that the questions concern Continental
flights that go from Chicago to Seattle, not just
any Continental flights, however the phrase from
Chicago to Seattle is elided from Q2 and Q3.
The interpretation of Q4 depends on Q3, En cual
turn depends on Q1. Curiosamente, Q5 introduces
information with no dependencies on previous
discourse and, en este caso, relying on information
from previous utterances will lead to incorrect
SQL queries.

The problem of contextual language process-
ing has been most widely studied within dialogue
systems where the primary goal is to incremen-
tally fill pre-defined slot-templates, which can be
then used to generate appropriate natural language
respuestas (Gao et al., 2019). But the rich seman-
tics of SQL queries makes the task of contextual
text-to-SQL parsing substantially different. Previ-
ous approaches (Suhr et al., 2018; Zhang et al.,
2019) tackle this problem by enabling the decoder
to copy or modify the previous queries under the
assumption that they contain all necessary con-
text for generating the current SQL query. El

1197

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1197–1212, 2021. https://doi.org/10.1162/tacl a 00422
Editor de acciones: mike lewis. Lote de envío: 4/2021; Lote de revisión: 6/2021; Publicado 11/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Example utterances from a user interaction in the ATIS dataset. Utterance segments referring to the
same entity or objects are in same color. SQL queries corresponding to Q2–Q5 follow a pattern similar to Q1 and
are not shown for the sake of brevity.

utterance history is encoded in a hierarchical
manner and although this is a good enough approx-
imation for most queries (in existing datasets), él
is not sufficient to model long-range discourse
phenomena (Grosz and Sidner, 1986).

Our own work draws inspiration from Kintsch
and van Dijk’s (1978) text comprehension model.
In their system the process of comprehension in-
volves three levels of operations. En primer lugar, smaller
units of meaning (es decir., proposiciones) are extracted
and organized into a coherent whole (microstruc-
tura); some of these are stored in a working
memory buffer and allow to decide whether new
input overlaps with already processed proposi-
ciones. En segundo lugar, the gist of the whole is condensed
(macrostructure). And thirdly, the previous two
operations generate new texts in working with
the memory. En otras palabras, el (short and long
term) memory of the reader gives meaning to
the text read. They propose three macro rules,
a saber, deletion, generalización, and construction,
as essential to reduce and organize the detailed
information of the microstructure of the text. Fur-
thermore, previous knowledge and experience are
central to the interpretation of text enabling the
reader to fill information gaps.

Our work borrows several key insights from
Kintsch and van Dijk (1978) without being a di-
rect implementation of their model. Específicamente,
we also break down input utterances into smaller
units, a saber, phrases, and argue that this infor-

mation can be effectively utilized in maintaining
contextual information in an interaction. Más-
más, the notion of a memory buffer that can be
used to store and process new and old information
plays a prominent role in our approach. We pro-
pose a Memory-based ContExt model (which we
call MemCE for short) for keeping track of con-
textual information, and learn a context memory
controller that manages the memory. Each inter-
acción (sequence of user utterances) maintains its
context using a memory matrix. User utterances
are segmented into a sequence of phrases repre-
senting either new information to be added into
the memory (p.ej., that have a meal in Figure 1) o
old information which might conflict with current
information in memory and needs to be updated
(p.ej., antes 10 am should be replaced with be-
fore noon in Figure 1). Our model can inherently
add new content to memory, read existing con-
tent by accessing the memory, and update old
información.

We evaluate our approach on the ATIS (Suhr
et al., 2018; Dahl et al., 1994), SParC (Yu et al.,
2019b), and CoSQL (Yu et al., 2019a) conjuntos de datos.
We observe performance improvements when we
combine MemCE with existing models underlying
the importance of more specialized mechanisms
for processing context information. Además,
our model brings interpretability in how the con-
text is processed. We are able to inspect the learned
memory controller and analyze whether important

1198

discourse phenomena such as coreference and
ellipsis are modeled.

2 Trabajo relacionado

Sequence-to-sequence neural networks (Bahdanau
et al., 2015) have emerged as a general model-
ing framework for semantic parsing, achieving
impressive results across different domains and
semantic formalisms (Dong and Lapata, 2016; Jia
and Liang, 2016; Iyer et al., 2017; Wang y cols.,
2020; Zhong et al., 2018; Yu et al., 2018b,
inter alia). The majority of existing work has
focused on mapping natural language utterances
into machine-readable meaning representations in
isolation without utilizing context information.
While this is useful for environments consisting
of one-shot interactions of users with a system
(p.ej., running QA queries on a database), muchos
settings require extended interactions between a
user and an automated assistant (p.ej., booking a
flight). This makes the one-shot parsing model
inadequate for many scenarios.

In this paper we are concerned with the lesser
studied problem of contextualized semantic pars-
ing where previous utterances are taken into ac-
count in the interpretation of the current utterance.
Earlier work (Miller et al., 1996; Zettlemoyer and
collins, 2009; Srivastava et al., 2017) has focused
on symbolic features for representing context—
Por ejemplo, by explicitly modeling discourse
referents, or the flow of discourse. More recent
neural methods extend the sequence-to-sequence
architecture to incorporate contextual information
either by modifying the encoder or the decoder.
Context-aware encoders resort to concatenating
the current utterance with the utterances preced-
ing it (Suhr et al., 2018; Zhang et al., 2019) o
focus on the history of the utterances most rele-
vant to the current decoder state (Liu et al., 2020).
The decoders take context representations as ad-
ditional input and often copy segments from the
previous query (Suhr et al., 2018; Zhang et al.,
2019). Hybrid approaches (Iyyer et al., 2017; guo
et al., 2019; Liu et al., 2020; Lin et al., 2019) em-
ploy neural networks for representation learning
but use a grammar for decoding (p.ej., a sequence
of actions or an intermediate representation).

A tremendous amount of work has taken place
in the context of discourse modeling focusing
on extended texts (Mann y Thompson, 1988;
Hobbs, 1985) and dialogue (Grosz and Sidner,

1986). Kintsch and van Dijk (1978) study the
mental operations underlying the comprehension
and summarization of text. They introduce propo-
sitions as the basic unit of text representation,
and a model of how incoming text is processed
given memory limitations; texts are reduced to
important propositions (to be recalled later) usando
macro-operators (p.ej., addition, deletion). Su
model has met with popularity in cognitive psy-
chology (Baddeley, 2007) and has also found
application in summarization (Fang and Teufel,
2016).

Our work proposes a new encoder for con-
textualized semantic parsing. At the heart of our
approach is a memory controller that keeps track
of context via writing new information and updat-
ing old information. Our memory-based approach
is inspired by Kintsch and van Dijk (1978) and is
closest to Santoro et al. (2016), who use a memory
augmented neural network (Weston et al., 2015;
Sukhbaatar et al., 2015) for meta-learning. Spe-
cifically, they introduce a method for accessing
external memory which functions as short-term
storage for meta-learning. Although we report ex-
periments solely on semantic parsing, our encoder
is fairly general and could be applied to other
context-dependent tasks such as conversational
information seeking (Dalton et al., 2020) and infor-
mation retrieval (Sun and Chai, 2007; Voorhees,
2004).

3 Modelo

Our model is based on the encoder-decoder ar-
chitecture (Cho et al., 2015) with the addition of
a memory component (Sukhbaatar et al., 2015;
Santoro et al., 2016) for incorporating context.
Let I = [Xi, Hacer]norte
i=1 denote an interaction such that
Xi is the input utterance and Yi is the output SQL
at interaction turn I[i]. At each turn i, given Xi
and all previous turns I[1 . . . i − 1], our task is to
predict SQL output Yi.

As shown in Figure 2, our model consists of
four components: (1) a memory matrix retains
discourse information, (2) a memory controller,
which learns to access and manipulate the mem-
ory such that correct discourse information is
retained, (3) utterance and phrase encoders, y
(4) a decoder that interacts with the memory and
utterance encoder using an attention mechanism
to generate SQL output.

1199

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Overview of model architecture. Utterances are broken down into segments. Each segment is encoded
with the same encoder (same weights) and is processed independently. The context update controller learns to
manipulate the memory such that correct discourse information is retained.

3.1 Input Encoder

Each input utterance Xi = (xi,1 . . . xi,|Xi|) is en-
coded using a bi-directional LSTM (Hochreiter
and Schmidhuber, 1997),

i,j = biLSTMU (ei,j; hU
hU

i,j−1)

(1)

i . . . pK

dónde, ei,j = φ(xi,j) is a learned embedding
corresponding to input token xi,j and hU
i,j is the
concatenation of the forward and backward LSTM
hidden representations at step j. As mentioned
earlier, Xi is also segmented into a sequence of
phrases denoted as Xi = (p1
i ), where K
is the number of phrases for utterance Xi. Nosotros
provide details on how utterances are segmented
into phrases in Section 4. For now, suffice it to say
that they are obtained from the output of a chunker
with some minimal postprocessing (p.ej., to merge
postmodifiers with NPs or VPs). Each phrase
consists of tokens pk
|]), such that
k ∈ [1, k] and sk =
|. Each phrase
pk
is separately encoded using a bi-directional
i
LSTM,

i = (xi,[sk:sk+|pk
k−1
z=1

|pz
i

(cid:2)

i,k,j = biLSTMP (ei,j; hP
hP

i,k,j−1)

(2)

|]. As shown in
such that j ∈ [sk : sk + |pk
i
Cifra 2, every phrase pk
i in utterance i is sepa-
rately encoded using biLSTMP to obtain a phrase

representation hP
forward and backward hidden representations.

i,k by concatenating the final

3.2 Context Memory
Our context memory is a matrix Mi ∈ RL×d with
L memory slots, each of dimension d, where i is
the state of the memory matrix at the ith interaction
doblar. The goal of context memory is to maintain
relevant information required to parse the input
utterance at each turn. As shown in Figure 2, this is
achieved by learning a context update controller,
which is responsible for updating the memory at
each turn.

For each phrase pk

i belonging to a sequence of
phrases within utterance Xi, the controller decides
whether it contains old information that conflicts
with information present in the memory or new
information that has to be added to the current
contexto. When novel information is introduced, el
controller should add it to an empty or least-used
memory slot, otherwise the conflicting memory
slot should be updated with the latest information.
Let t denote the memory update time step such
that t ∈ [1, norte], where n is the total number of
phrases in interaction I. We simplify notation,
using hP
instead of hP
i,k, to represent the hidden
t
representation of a phrase at time t.

Detecting Conflicts Given phrase representa-
tion hP
(see Equation (2)), we use a similarity
t

1200

module to detect conflicts between hP
t and ev-
ery memory slot in Mi(metro) where m ∈ [1, l];
Mi(metro) is the mth row representing a memory slot
in the memory matrix. Intuitivamente, low similarity
represents new information. Our similarity mod-
ule is based on a Siamese network architecture
(Bromley et al., 1994) that takes phrase hidden
representation hP
t and memory slot Mi(metro) y
computes a low-dimensional representation using
the same neural network weights. La resultante
low-dimensional representations are then com-
pared using the cosine distance metric:

ˆwt,metro

c =

sia(hP

t ) · sia(Mi(metro))

máximo((cid:4) sia(hP

t ) (cid:4)2 · (cid:4) sia(Mi(metro)) (cid:4)2, (cid:3))

(3)

dónde (cid:3) is a small value for numerical stability
and sia is a multi-layer feed-forward network with
a tanh activation function. For hidden representa-
tion h, sia is computed as:

where write weights wt
w are used to compute the
write location and are described in the memory
update paragraph below. The least used weight
vector, peso
lu, at update step t is then calculated as:

lu = softmin(wt−1
peso
tu )

(7)

(cid:2)

where for vector x we calculate softmin(X) =
j exp(−xj). Hard updates (es decir., a nosotros-
exp.(−x)/
ing smallest instead of softmin) are also possible.
Sin embargo, we found softmin to be more stable
during learning.

Memory Update We wish to compute write
w given least used weight vector wt
location wt
lu
and conflict probability distribution wt
s. Notice
that wt
lu are essentially two probabil-
ity distributions each representing a candidate
write location in memory. We learn a convex
combination parameter μ that depends on wt
s,

s and wt

ˆh = W (tanh(W lh + bl) + b)

(4)

μ = σ(Wσwt

s + bσ)

represents

to obtain a similarity distribution wt

where l
the layer number and
W l, bl, W. , and b are learnable parameters. Nosotros
use ˆwt,metro
s for
updating step t over memory slots. peso
s represents
the probability of dissimilarity (or conflict) cual
is calculated by computing softmax over cosine
similarities with every memory slot m ∈ [1..l]:

peso

w = softmax((μwt

s + (1 − μ)peso

lu)/t )

where temperature hyperparameter τ is used to
peak the write location. Finalmente, the memory is
updated with current phrase representation hP
t as

M t

i (metro) = M t−1

(metro)+ peso

w(metro)hP

t , ∀m ∈ [1, l]
(10)

(8)

(9)

peso

s = softmax([ ˆwt,1

C ; ˆwt,2
C

. . . ; ˆwt,l

])

(5)

3.3 Decoder

We compute softmax over cosine values so that
the linear combination of wt
s with least used
weights wt
lu (described below in the memory up-
date paragraph) still represents the probability of
update across each memory slot.

Adding New Information To add new informa-
tion to the memory (es decir., when there is no conflict
with any locations), we need to ascertain which
memory locations are either empty or rarely used.
When the memory is full (es decir., all memory slots
are used during previous updates), we update the
slot which was least used. This is accomplished by
∈ RL at
maintaining memory usage weights wt
tu
each update t; peso
u is initialized with zeros at t = 0
and is updated by combining previous memory
usage weights wt−1
u with current write weight wt
w
using a decay parameter λ:

peso

u = wt

w + λwt−1

(6)

1201

The output query is generated with an LSTM
decoder. As shown in Figure 2, the decoder de-
pends on the memory and utterance represen-
tations computed using Equations (10) y (1),
respectivamente. The decoder state at time step s is
computed as:

s = LSTM([φo(yi,s−1); cM
hD

s−1; cU

s−1]; hD

s−1) (11)

where φo is a learned embedding function for
output tokens, cU
s is an utterance context vector,
s−1 is a memory context vector, and hD
cM
s−1 is the
previous decoder hidden state. cU
s is calculated as
the weighted sum of all hidden states, where αU
s
is the utterance state attention score:

vs(j) = hU

i,jW AhD
s

αU
s = softmax(vs)
i,jαU
hU
cU
s (j)
s =

(cid:3)

(12)

(13)

(14)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Memory state attention score αM
and memory
s
context vector cM
s are computed in a similar man-
ner using memory slots as hidden states.1 The
probability of output query tokens is computed as:

PAG ( ˆwi,s|Xi, Hacer, I[: i − 1]) ∝
s ; cM

s ]W ˆo)W o + bo)

s ; cU

exp.(tanh ([hD

We further modify the decoder in order to deal
with the large number of database values (p.ej.,
city names) common in text-to-SQL semantic
parsing tasks. As described in Suhr et al. (2018),
we add anonymized token attention scores in the
output vocabulary distribution, which enables
copying anonymized tokens mentioned in in-
put utterances. The final probability distribution
over output vocabulary tokens and anonymized
tokens is:

PAG (Wisconsin,s) = softmax(PAG ( ˆwi,s) ⊕ P (ˆai,s))

(15)

where ⊕ represents concatenation and P (ˆai,s) son
anonymized token attention scores in the attention
distribution αU
s .

3.4 Capacitación

Our model is trained in an end-to-end fashion
using a cross-entropy loss. Given a training set of
N interactions {I (yo)}norte
l=1, such that each interaction
I (yo) consists of utterances X (yo)
i,1 . . . xi,|Xi|(yo))
i,1 . . . y(yo)
paired with output queries Y (yo)
i,|Hacer|),
we minimize token cross-entropy loss as:
(cid:4)

i = (X(yo)
i = (y(yo)

(cid:5)

l(ˆy(yo)

i,k) = −logP

ˆy(yo)
i,k

|X(yo)
i

, y(yo)

i,k, I[: i − 1]

(16)

dónde, ˆy(yo)
i,k denotes the predicted output token and
k is the gold output token index. The total loss is
the average of the utterance level losses used for
back-propagation.

4 Experimental Setup

We evaluated MemCE, our memory-based con-
text model, on various settings by integrating it
with multiple open-source models. We achieve
this by replacing the discourse component of re-
lated models with MemCE subject to minor or
no additional changes. All base models in our
experiments use a turn-level hierarchical encoder

1In experiments we found that using the (raw) memory

directly is empirically better to encoding it with an LSTM.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: Example of sentence segmentation using
chunking and rule-based merging.

to capture previous language context. For primary
evaluación, we use the ATIS (Hemphill et al.,
1990; Dahl et al., 1994) dataset but also present
results on SParC (Yu et al., 2019b) and CoSQL
(Yu et al., 2019a).

Utterance Segmentation We segment each in-
put utterance into a sequence of phrases with a
pretrained chunker and then apply a simple rule-
based merging procedure to create bigger chunks
as an approximation to propositions (Kintsch and
van Dijk, 1978). Cifra 3 illustrates the process.
We used the Flair chunker (Akbik et al., 2018)
trained on CONLL-2000 (Tjong Kim Sang and
Buchholz, 2000) to identify NP and VP phrases
without postmodifiers. Small chunks (p.ej., de,
before in the figure) were subsequently merged
into segments using the following rules and
NLTK’s (Bird et al., 2009) tag-based regex merge:

R1: lef t = (cid:8)V P.∗(cid:10), right = (cid:8)V P.∗(cid:10)
R2: lef t = (cid:8)P P.∗(cid:10)|(cid:8)N P.∗(cid:10), right = (cid:8)N P (cid:10)+
R3: lef t = (cid:8)N P.∗(cid:10), right = (cid:8)V B.∗(cid:10)
R4: lef t = (cid:8)AD.∗(cid:10), right = (cid:8)N P.∗(cid:10)

The rules above are applied in order. For each
rule we find any chunk whose end matches the
left pattern followed by a chunk whose beginning
matches the right pattern. Chunks that satisfy this
criterion are merged.

We segment utterances and anonymize entities
independently and then match entities within seg-
ments deterministically. This step is necessary to
robustly perform anonymization as in some rare

1202

casos, the chunking process will separate enti-
ties in two different phrases (p.ej., in Long Beach
California that is chunked as in Long Beach and
California that). This is easily handled by a sim-
ple token number matching procedure between the
anonymized utterance and corresponding phrases.

Model Configuration Our model
is imple-
mented in PyTorch (Paszke et al., 2019). Para
all experiments, we used the ADAM optimizer
(Kingma and Ba, 2015) to minimize the loss func-
tion and the initial learning rate was set to 0.001.
Durante el entrenamiento, we used the ReduceLROnPlateau
learning rate scheduling strategy on the vali-
dation loss, with a decay rate of 0.8. Nosotros también
applied dropout with 0.5 probabilidad. Dimensions
for the word embeddings were set to 300. Fol-
lowing previous work (Zhang et al., 2019) we use
pretrained GloVe (Pennington et al., 2014) em-
beddings for our main experiments on the SparC
and CoSQL datasets. For ATIS, word embed-
dings were not pretrained (Suhr et al., 2018;
Zhang et al., 2019). Memory length was cho-
sen as a hyperparameter from the range [15, 25]
and the temperature parameter was chosen from
{0.01, 0.1}. Best memory length values for ATIS,
SparC, and CoSQL were 25, 16, y 20, respetar-
activamente. The RNN decoder is a two-layer LSTM and
the encoder is a single layer LSTM. The Siamese
network in the module which detects conflicting
slots uses two hidden layers.

5 Resultados

En esta sección, we assess the effectiveness of
the MemCE encoder at handling contextual in-
formación. We present our results, evaluación
methodology, and comparisons against the state
of the art.

5.1 Evaluation on ATIS

We primarily focus on ATIS because it contains
relatively long interactions (average length is 7)
compared with other datasets (p.ej., the average
length in SParC is 3). Longer interactions present
multiple challenges that require non-trivial pro-
cessing of context, some of which are discussed in
Sección 6. We use the ATIS dataset split created
by Suhr et al. (2018). It contains 27 tables and
162K entries with 1,148/380/130 train/dev/test
interactions. The semantic representations are
in SQL.

Following Suhr et al. (2018), we measure query
exactitud, strict denotation accuracy, and relaxed
denotation accuracy. Query accuracy is the per-
centage of predicted queries that match the ref-
erence query. Strict denotation accuracy is the
percentage of predicted queries that when exe-
cuted produce the same results as the reference
query. Relaxed accuracy also gives credit to a
prediction query that fails to execute if the refer-
ence table is empty. In cases where the utterance
is ambiguous and there are multiple gold queries,
the query or table is considered correct if they
match any of the gold labels. We evaluate on
both development and test set, and select the best
model during training via a separate validation set
consisting of 5% of the training data.

Mesa 1 presents a summary of our results. Nosotros
compare our approach against a simple Seq2Seq
model which is a baseline encoder-decoder
without any access to contextual
información.
Seq2Seq+Concat is a strong baseline which con-
sists of an encoder-decoder model with attention
on the current and the previous three concatenated
utterances. We also compare against the models
of Suhr et al. (2018) and Zhang et al. (2019).
The former uses a turn-level encoder on top of
an utterance-level encoder in a hierarchical fash-
ion together with a decoder which learns to copy
complete SQL segments from the previous query
(SQL segments between consecutive queries are
aligned during training using a rule-based proce-
dure). The latter enhances the turn-level encoder
by employing an attention mechanism across dif-
ferent turns and additionally introduces a query
editing mechanism which decides at each decod-
ing step whether to copy from the previous query
or insert a new token. Column Enc-Dec in Table 1
describes the various models in terms of the type
of encoder/decoder used. LSTM is a vanilla en-
coder or decoder, HE is a turn-level hierarchical
encoder, and Mem is the proposed memory-based
encoder. SnipCopy and EditBased respectively re-
fer to Suhr et al.’s 2018 and Zhang et al.’s 2019
decoders. We present two instantiations of our
MemCE model with a simple LSTM decoder
(Mem-LSTM) and SnipCopy (Mem-SnipCopy).
For the sake of completeness, Mesa 1 also reports
the results from Lin et al. (2019), who apply a
grammar-based decoder to this task; they also
incorporate the interaction history by concatenat-
ing the current utterance with the previous three
utterances which are encoded with a bi-directional

1203

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Enc-Dec

Dev Set

Denotation

Test Set

Denotation

Query

Relaxed

Strict Query

Relaxed

Strict

LSTM-LSTM
Seq2Seq
LSTM-LSTM
Seq2Seq+Concat
HE-LSTM
Suhr et al. (2018)
Suhr et al. (2018)
HE-SnipCopy
Zhang et al. (2019) HE-EditBased
Lin et al. (2019)
MemCE
MemCE

LSTM-Grammar
Mem-LSTM
Mem-SnipCopy

28.7
35.1
36.0
37.5
36.2
39.1
40.2
39.1

48.8
59.4
59.5
63.0
60.5
—
63.6
65.5

43.2
56.7
58.3
62.5
60.0
65.8
61.2
65.2

35.7
42.2
—
43.6
43.9
44.1
47.0
45.3

56.4
66.6
—
69.3
68.5
—
70.1
70.2

53.8
65.8
—
69.2
68.1
73.7
68.9
69.8

Mesa 1: Model accuracy on the ATIS dataset. HE is a hierarchical interaction encoder, while Mem
is the proposed memory-based encoder. LSTM are vanilla encoder/decoder models, while SnipCopy
copies SQL segments from the previous query and EditBased adopts a query editing mechanism.

Modelo

Enc-Dec

HE-LSTM
CDS2S
HE-SnipCopy
CDS2S
HE-Grammar
Liu et al. (2020)
Mem-LSTM
MemCE+CDS2S
MemCE+CDS2S
Mem-SnipCopy
MemCE+Liu et al. (2020) Mem-Grammar

CoSQL(D)

CoSQL(t)

SparC(D)

SparC(t)

SparC-DI(t)

I
2.6

q
13.9

q
13.8
12.3
33.5
13.4
13.1
32.8 10.6

q
I
2.1
21.9
2.1 — — 21.7
9.6 — — 41.8
3.4 — — 21.2
2.7 — — 21.4
42.4

28.4

6.2

q
I
23.2
8.1
9.5
20.3
20.6 —
8.8 —
10.9 —
40.3
21.1

q
I
39.5
7.5
8.1
38.7
— 57.1
— 41.3
— 41.5
55.7
16.7

I
20.1
24
35.3
22.9
26.7
36.3

Mesa 2: Query (q) and Interaction (I) accuracy for SParC and CoSQL. We report results on the
desarrollo (D) and test (t) conjuntos. Sparc-DI is our domain-independent split of SparC. HE is a hierar-
chical encoder and Mem is the proposed memory-based context encoder. LSTM is a vanilla decoder,
SnipCopy copies SQL segments from the previous query, and Grammar refers to a decoder which
outputs a sequence of grammar rules rather than tokens. Table cells are filled with—whenever results
are not available.

LSTM. All models in Table 1 use entity ano-
nymization; Lin et al. (2019) additionally use
identifier linking, a saber, string matching heu-
ristic rules to link words or phrases in the in-
put utterance to identifiers in the database (p.ej.,
city name string -> ‘‘BOSTON’’).

As shown in Table 1, MemCE is able to out-
perform comparison systems. We observe a boost
in denotation accuracy when using the SnipCopy
decoder instead of an LSTM-based one, a pesar de,
exact match does not improve. This is possibly be-
cause SnipCopy makes it easier to generate long
SQL queries by copying segments, but at the same
time it suffers from spurious generation and error
propagation.

Mesa 3 presents various ablation studies which
evaluate the contribution of individual model com-
ponents. We use Mem-SnipCopy as our base
model and report performance on the ATIS devel-
opment set following the configuration described

en la sección 4. We first remove the proposed
memory controller described in Section 3.2 y
simplify Equation (9) using key-value based
attention to calculate wt
w as

αj = M t−1
peso

(j)W P hP
t
w = softmax(a)

(17)

(18)

We observe a decrease in performance (see second
row in Table 3), indicating that the proposed mem-
ory controller is helpful in maintaining interaction
contexto.

We performed two ablation experiments to
evaluate the usefulness of utterance segmenta-
ción. En primer lugar, instead of the phrases extracted from
our segmentation procedure, we employ a variant
of our model which operates over individual to-
kens (see row ‘‘phrases are utterance tokens’’ in
Mesa 3). As can be seen, this strategy is not opti-
mal as results decrease across metrics. Creemos

1204

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Denotation

Query Relaxed Strict

MemCE+SnipCopy

39.1
Without memory controller
34.3
Phrases are utterance tokens 37.2
36.8
Phrases are full utterances

65.5
58.7
61.9
64.2

65.2
58.1
61.7
63.9

Mesa 3: Ablation results with SnipCopy decoder
on the ATIS development set.

operating directly on tokens can lead to ambigu-
ity during update. Por ejemplo, when processing
current phrase to Boston given previous utterance
What Continental flights go from Chicago to Seat-
tle, it is not obvious whether Boston should update
Chicago or Seattle. En segundo lugar, we do not use any
segmentation at all, not even at the token level.
En cambio, we treat the entire utterance as a single
phrase (see row ‘‘phrases are full utterances’’ in
Mesa 3). If memory’s only function is to simply
store utterance encodings, then this model be-
comes comparable to a hierarchical encoder with
atención. De nuevo, we observe that performance de-
creases, which indicates that our system benefits
from utterance segmentation. En general, the abla-
tion studies in Table 3 show that segmentation
and its granularity matters. Our heuristic proce-
dure works well for the task at hand, a pesar de
a learning-based method would be more flexi-
ble and potentially lead to further improvements.
Sin embargo, we leave this to future work.

5.2 Evaluation on SParC and CoSQL

In this section we describe our results on SParC
and CoSQL. Both datasets assume a cross-domain
semantic parsing task in context with SQL as the
meaning representation. Además, for ambigu-
ous utterances, (which cannot be uniquely mapped
to SQL given past context), CoSQL also includes
clarification questions (and answers). We do not
tackle these explicitly but consider them part of
the utterance preceding them (p.ej., please list the
singers | did you mean list their names? | Sí).
Since our primary objective is to study and mea-
sure context-dependent language understanding,
we created a split of SParC that is denoted as
SParC-DI2, where domains are all seen in train-
En g, desarrollo, and test set. In this way we

#Interactions
#Utterances

Tren

2869
8535

desarrollador

290
851

Prueba

290
821

Mesa 4: Statistics for SParC-DI domain-
independent split which has 157 domains in
total.

ensure that no model has the added advantage of
being able to handle cross-domain instances while
lacking context-dependent language understand-
En g. Mesa 4 shows the statistics of our SParC-DI
dividir, following a ratio of 80/10/10 percent for the
training/development/test set.

We evaluate model output using exact set match
exactitud (Yu et al., 2019b).3 We report two met-
rics: question accuracy, which is the accuracy
considering all utterances independently, and in-
teraction accuracy, which is the correct interaction
accuracy averaged across interactions. un inter-
action is marked as correct if all utterances in
that interaction are correct. Because utterances
in an interaction can be semantically complete
(es decir., independent of context), we prefer interaction
exactitud.

Mesa 2 summarizes our results. CDS2S is the
context-dependent cross-domain parsing model
of Zhang et al. (2019). It is is adapted from
Suhr et al. (2018) to include a schema encoder,
which is necessary for SparC and CoSQL. It also
uses a turn-level hierarchical encoder to repre-
sent the interaction history. We also report model
variants where the CDS2S encoder is combined
with an LSTM-based encoder, SnipCopy (Suhr
et al., 2018), and a grammar-based decoder (Liu
et al., 2020). The latter decodes SQL queries as
a sequence of grammar rules, rather than tokens.
We compare the above systems with three vari-
ants of our MemCE model that differ in their
use of an LSTM decoder, SnipCopy, y el
grammar-based decoder of Liu et al. (2020).

Across models and datasets we observe that
MemCE improves performance, which suggests
that it better captures contextual information as
an independent language modeling component.
We observe that benefits from our memory-based
encoder persist across domains and data splits even

2We only considered training and development instances

as the test set is not publicly available.

3Predicted queries are decomposed into different SQL
clauses and scores are computed for each clause separately.

1205

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

MemCE

Suhr et al. (2018)
Denotation Query Denotation Query

Focus Shift
Referring Exp
Ellipsis
Independent

80.4
80.0
69.4
81.4

50.0
40.0
33.3
61.1

76.7
70.0
66.6
81.3

44.6
20.0
25.0
62.7

Mesa 5: Model accuracy on specific phenomena
(20 interactions, ATIS dev set).

when sophisticated strategies like grammar-based
decoding are adopted.

6 Análisis

En esta sección, we analyze our model’s ability to
handle important discourse phenomena such as
focus shift, referring expressions, and ellipsis. Nosotros
also showcase its interpretability by examining
the behavior of the (learned) memory controller.

6.1 Focus Shift

Our linguistic analysis took place on 20 enterrar-
actions4 randomly sampled from the ATIS devel-
opment set (134 utterances in total). Mesa 5 muestra
overall performance statistics for MemCE (Mem-
LSTM) and Suhr et al. (2018) (HE-SnipCopy) en
our sample. We annotated the focus of attention in
each utterance (underlined in the example below)
which we operationalized as the most salient en-
tity (p.ej., city) within the utterance (Grosz et al.,
1995). Focus shift occurs when the attention tran-
sitions from one entity to another. In the interaction
below the focus shifts from flights in Q2 to cities
in Q3.

Q1: What flights are provided by American airlines

Q2: What flights are provided by Delta airlines

Q3: Which cities are serviced by both American

and Delta airlines

Handling focus shift has been problematic in the
context of semantic parsing (Suhr et al., 2018). En
our sample, 41.8% of utterances displayed focus
shift. Our model was able to correctly parse all ut-
terances in the interaction above and is more apt at
handling focus shifts compared to related systems
(Suhr et al., 2018). Mesa 5 reports denotation and
query accuracy on our analysis sample.

6.2 Referring Expressions and Ellipsis

Ellipsis refers to the omission of information from
an utterance that can be recovered from the con-

4Interactions with less than two utterances were discarded.

texto. In the interaction below, Q2 and Q3 exem-
plify nominal ellipsis, the NP all flights from Long
Beach to Memphis is elided and ideally should be
recovered from the discourse, in order to generate
correct SQL queries. Q4 is an example of corefer-
ence, they refers to the answer of Q3. Sin embargo, él
can also be recovered by considering all previous
utterances (es decir., Where do they [flights from Long
Beach to Memphis; any day] stop). Because our
model explicitly stores information in context, es
able to parse utterances like Q2 and Q4 correctly.

Q1: Please give me all flights from Long Beach to

Menfis

Q2: What about 1993 June thirtieth

Q3: How about any day

Q4: Where do they stop

In our ATIS sample, 26.8% of the utterances
exhibited ellipsis and 7.5% contained referring
expresiones. Results in Table 5 show that MemCE
is able to better handle both such cases.

6.3 Memory Interpretation

In this section we delve into the memory con-
troller with the aim of understanding what kind of
patterns it learns and where it fails. En figura 4, nosotros
visualize the content of memory for an interaction
(fila superior) from the ATIS development set consist-
ing of seven utterances.5 Each column in Figure 4
shows the content of memory after processing the
corresponding utterance in the interaction. El
bottom row indicates whether the final output was
correcto (✓
) or not (✗ ). For the purpose of clear
visualization we took the max instead of softmax
en la ecuación (8) to obtain the memory state at any
time step.

Q2 presents an interesting case for our model,
it is not obvious whether Continental airlines
from Q1 should be carried forward while pro-
cessing Q2. The latter is genuinely ambiguous, él
could be referring to Continental airlines flights
or to flights by any carrier leaving from Seattle to
chicago. If we assume the second interpretation,
then Q2 is more or less semantically complete
and independent of Q1. Forty-four percent of ut-
terances in our ATIS sample are semantically
complete. Although we do not explicitly handle
such utterances, our model is able to parse many

5Q4 was repeated in the dataset. We do the same to
maintain consistency and to observe the effect of repetition.

1206

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4: Visualization of memory matrix. Rows represent memory content and columns represents the utterance
time step. The top row shows the utterances being processed. Each row is marked with a memory slot number
which represents the content of memory in that slot. Empty slots are marked with φ. The bottom row shows
whether the utterance was parsed correctly(✓
: Stale content in memory with respect to the current
utterance.

: Incorrect substitution.

) or not(✗ ).

of them correctly because they usually repeat the
information mentioned in previous discourse as
a single query (ver tabla 5). Q2 also shows that
the memory controller is able to learn the sim-
ilarity between long phrases: en 1993 Febrero
twenty Seventh ⇔ Show 1993 February twenty
eighth flights. It also demonstrates a degree of
semantic understanding—that is, it replaces from
Chicago with from Seattle in order to process ut-
terance Q2, rather than simply relying on entity
matching.

Cifra 4 further shows the kind of mistakes the
controller makes which are mostly due to stale
content in memory. In utterance Q6 the memory
carries over the constraint after 1500 hours from
the previous utterance, which is not valid since
Q6 explicitly states Show all . . . flights on Conti-
nental. At the same time constraints from Seattle
and to Chicago should carry forward. Knowing

which content to keep or discard makes the task
challenging.

Another cause of errors relates to reinstating
previously nullified constraints. In the interaction
abajo, Q3 reinstates from Seattle to Chicago, el
focus shifts from flights in Q1 to ground trans-
portation in Q2 and then again to flights in Q3.

Q1: Show flights from Seattle to Chicago
Q2: What ground transportation is available in

chicago

Q3: Show flights after 1500 horas

Handling these issues altogether necessitates a
non-trivial way of managing context. Given that
our model is trained in an end-to-end fashion, él
is encouraging to observe a one-to-one correspon-
dence between memory and the final output which
supports our hypothesis that explicitly modeling
language context is helpful.

1207

7 Conclusions

en este documento, we presented a memory-based model
for context-dependent semantic parsing and eval-
uated its performance on a text-to-SQL task.
Analysis of model output revealed that our ap-
proach is able to handle several discourse related
phenomena to a large extent. We also analyzed the
behavior of the memory controller and observed
that it correlates with the model’s output deci-
siones. Our study indicates that explicitly modeling
context can be helpful for contextual language
processing tasks. Our model manipulates infor-
mation at the phrase level which can be too rigid
for fine-grained updates. En el futuro, we would
like to experiment with learning the right level of
utterance segmentation for context modeling as
well as learning when to reinstate a constraint.

Acknowledgment

We thank Mike Lewis, Miguel Ballesteros, y
our anonymous reviewers for their feedback. Nosotros
are grateful to Alex Lascarides and Ivan Titov for
their comments on the paper. This work was sup-
ported in part by Huawei and the UKRI Centre for
Doctoral Training in Natural Language Process-
En g (grant EP/S022481/1). Lapata acknowledges
the support of the European Research Council
(award number 681760, ‘‘Translating Multiple
Modalities into Text’’).

Referencias

Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual string embeddings for se-
quence labeling. In Proceedings of the 27th
International Conference on Computational
Lingüística, pages 1638–1649, Santa Fe, Nuevo
México, EE.UU. Asociación de Computación
Lingüística.

Yoav Artzi and Luke Zettlemoyer. 2011. Boot-
strapping semantic parsers from conversations.
En procedimientos de
el 2011 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando, pages 421–432, Edimburgo, Escocia,
Reino Unido. Asociación de Lingüística Computacional.

Alan D. Baddeley. 2007. Working Memory,
Thought, and Action, prensa de la Universidad de Oxford,
Oxford.

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
aprender juntos a alinear y traducir. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
May 7–9, 2015, Conference Track Proceedings.

Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with
Python: Analyzing Text with the Natural
Language Toolkit, O’Reilly Media, Cª.

Jane Bromley, Isabelle Guyon, Yann LeCun,
Eduard S¨ackinger, and Roopak Shah. 1994.
Signature verification using a ‘‘Siamese’’ time
delay neural network. In Advances in Neu-
ral Information Processing Systems, volumen 6,
pages 737–744. Morgan-Kaufmann.

Giovanni Campagna, Rakesh Ramesh, Silei Xu,
Michael Fischer, and Monica S. Justicia. 2017.
Almond: The architecture of an open, crowd-
de origen, privacy-preserving, programmable
virtual assistant. In Proceedings of the 26th
International Conference on World Wide Web,
WWW ’17, pages 341–350, International World
Wide Web Conferences Steering Committee,
Republic and Canton of Geneva, CHE.

Joyce Y. Chai and Rong Jin. 2004. Discurso
structure for context question answering. En
Proceedings of the Workshop on Pragmatics
of Question Answering at HLT-NAACL 2004,
pages 23–30, Bostón, Massachusetts, EE.UU.
Asociación de Lingüística Computacional.

Kyunghyun Cho, Aaron Courville, y yoshua
bengio. 2015. Describing multimedia con-
tent using attention-based encoder-decoder
redes. IEEE Transactions on Multimedia,
17(11):1875–1886. https://doi.org/10
.1109/TMM.2015.2477044

Deborah A. Dahl, Madeleine Bates, Miguel
Marrón, William Fisher, Kate Hunicke-Smith,
David Pallett, Christine Pao, Alexander
Rudnicky, and Elizabeth Shriberg. 1994. Ex-
panding the scope of the ATIS task: The ATIS-3
cuerpo. In Human Language Technology: Pro-
ceedings of a Workshop held at Plainsboro,
New Jersey, March 8–11, 1994.

Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar,
and Jamie Callan. 2020. Cast-19: Un conjunto de datos

1208

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

information seeking.

for conversational
En
Proceedings of the 43rd International ACM
SIGIR Conference on Research and Devel-
opment in Information Retrieval, SIGIR ’20,
pages 1985–1988, Nueva York, Nueva York, EE.UU. también-
ciación para maquinaria informática. https://
doi.org/10.1145/3397271.3401206

cross-domain database with intermediate repre-
sentation. In Proceedings of the 57th Annual
reunión de
la Asociación de Computación-
lingüística nacional, pages 4524–4535, Florencia,
Italia. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/P19-1444

Li Dong and Mirella Lapata. 2016. Idioma
to logical form with neural attention. En profesional-
ceedings of the 54th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 33–43, Berlina,
Alemania. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/P16-1004

Kais Dukes. 2014. SemEval-2014 task 6: Super-
vised semantic parsing of robotic spatial com-
mands. In Proceedings of the 8th International
Workshop on Semantic Evaluation (SemEval
2014), pages 45–53, Dublín, Irlanda. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.3115/v1/S14-2006

Yimai Fang and Simone Teufel. 2016. Improvisación-
ing argument overlap for proposition-based
summarisation. En procedimientos de
the 54th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 2: Artículos breves),
pages 479–485, Berlina, Alemania. Asociación
para Lingüística Computacional. https://doi
.org/10.18653/v1/P16-2078

Jianfeng Gao, Michel Galley, and Lihong Li. 2019.
Neural approaches to conversational AI. Foun-
dations and Trends(cid:2), in Information Retrieval,
13(2–3):127–298. https://doi.org/10
.1561/1500000074

Barbara J. Grosz, Aravind K. Joshi, y Scott
Weinstein. 1995. Centrado: A framework for
modeling the local coherence of discourse.
Ligüística computacional,
21(2):203–225.
https://doi.org/10.21236/ADA324949

Barbara J. Grosz and Candace L. Sidner.
intentions, and the struc-
1986. Atención,
ture of discourse. Ligüística computacional,
12(3):175–204.

Charles T. Hemphill, Juan J.. Godfrey, and George
R. Doddington. 1990. The ATIS spoken lan-
guage systems pilot corpus. In Speech and
Natural Language: Proceedings of a Work-
shop Held at Hidden Valley, Pensilvania,
June 24–27, 1990. https://doi.org/10
.3115/116580.116613

j. Hobbs. 1985. On the coherence and structure of

discourse. CSLI, 85(37).

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, PubMed: 9377276

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,
Jayant Krishnamurthy, and Luke Zettlemoyer.
2017. Learning a neural semantic parser from
user feedback. En procedimientos de
the 55th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 963–973, vancouver, Canada. asociación-
ción para la Lingüística Computacional.

Mohit

Iyyer, Wen-tau Yih, and Ming-Wei
Chang. 2017. Search-based neural structured
learning for sequential question answering. En
Proceedings of the 55th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1821–1831,
vancouver, Canada. Asociación de Computación-
lingüística nacional.

Robin Jia and Percy Liang. 2016. Data recombina-
tion for neural semantic parsing. En procedimientos
of the 54th Annual Meeting of the Associa-
ción para la Lingüística Computacional (Volumen 1:
Artículos largos), pages 12–22, Berlina, Alemania.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P16
-1002

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,
Jian-Guang Lou, Ting Liu, and Dongmei
zhang. 2019. Towards complex text-to-SQL in

Aishwarya Kamath and Rajarshi Das. 2019.
A survey on semantic parsing. In Automated
Knowledge Base Construction (AKBC).

1209

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Diederik P. Kingma and Jimmy Ba. 2015. Adán:
A method for stochastic optimization. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
May 7–9, 2015, Conference Track Proceedings.

Walter Kintsch and Teun A. van Dijk. 1978.
Toward a model of text comprehension and pro-
ducción. Revisión psicológica, 85(5):363–394.
https://doi.org/10.1037/0033-295X
.85.5.363

Kevin Lin, Ben Bogin, Mark Neumann, Jonathan
tarde, and Matt Gardner. 2019. Grammar-
text-to-sql generation. ArXiv,
based neural
abs/1905.13326.

Qian Liu, En Chen, Jiaqi Guo, Jian-Guang Lou,
Bin Zhou, y Dongmei Zhang. 2020. Cómo
far are we from effective context modeling?
An exploratory study on semantic parsing in
contexto. In Proceedings of the Twenty-Ninth
International Joint Conference on Artificial
Inteligencia, IJCAI-20, pages 3580–3586. En-
ternational Joint Conferences on Artificial In-
telligence Organization. https://doi.org
/10.24963/ijcai.2020/495

William C. Mann and Sandra A. Thompson.
1988. Teoría de la estructura retórica: Toward a
functional theory of text organization. Texto-
interdisciplinary Journal for the Study of Dis-
curso, 8(3):243–281. https://doi.org
/10.1515/text.1.1988.8.3.243

Scott Miller, David Stallard, Robert Bobrow,
and Richard Schwartz. 1996. A fully statis-
tical approach to natural language interfaces.
In 34th Annual Meeting of the Association for
Ligüística computacional, pages 55–61, Santa
Cruz, California, EE.UU. Asociación de Computación-
lingüística nacional. https://doi.org/10
.3115/981863.981871

Fatma

˝Ozcan, Abdul Quamar, Jaydeep Sen,
Chuan Lei, and Vasilis Efthymiou. 2020. Estado
of the art and open challenges in natural lan-
guage interfaces to data. En procedimientos de
el 2020 ACM SIGMOD International Confer-
ence on Management of Data, SIGMOD ’20,
pages 2629–2636, Nueva York, Nueva York, EE.UU. también-
ciación para maquinaria informática. https://
doi.org/10.1145/3318464.3383128

Adam Paszke, Sam Gross, Francisco Massa,
James Bradbury, Gregory
Adam Lerer,
Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary DeVito,
Martin Raison, Alykhan Tejani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch:
An imperative style, high-performance deep
learning library. In Advances in Neural In-
formation Processing Systems, volumen 32,
pages 8026–8037. Asociados Curran, Cª.

jeffrey

Socher,

Pennington, Ricardo

y
Christopher Manning. 2014. GloVe: Global
vectors for word representation. En procedimientos
del 2014 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 1532–1543, Doha, Qatar. Asociación para
Ligüística computacional. https://doi
.org/10.3115/v1/D14-1162

Adam Santoro, Sergey Bartunov, Matthew
Botvinick, Daan Wierstra,
and Timothy
Lillicrap. 2016. Meta-learning with memory-
augmented neural networks. En procedimientos de
The 33rd International Conference on Machine
Aprendiendo, volumen 48 of Proceedings of Ma-
chine Learning Research, pages 1842–1850,
Nueva York, Nueva York, EE.UU. PMLR.

Shashank Srivastava, Amos Azaria, and Tom
mitchell. 2017. Parsing natural
idioma
conversations using contextual cues. En profesional
cesiones de
the Twenty-Sixth International
Joint Conference on Artificial Intelligence,
IJCAI-17, pages 4089–4095.

Alane Suhr, Srinivasan Iyer, and Yoav Artzi.
2018. Learning to map context-dependent
sentences to executable formal queries. En
Actas de la 2018 Conference of the
North American Chapter of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, Volumen 1 (Artículos largos),
pages 2238–2249, Nueva Orleans, Luisiana.
Asociación de Lingüística Computacional.

Sainbayar Sukhbaatar, Arthur Szlam,

Jason
Weston, and Rob Fergus. 2015. End-to-end
En avances en neurología
memory networks.
Sistemas de procesamiento de información, volumen 28,
pages 2440–2448. Asociados Curran, Cª.

1210

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mingyu Sun and Joyce Y. Chai. 2007. Discurso
processing for context question answering
based on linguistic knowledge. Knowledge-
Based Systems, 20(6):511–526. Special Issue
On Intelligent User Interfaces. https://doi
.org/10.1016/j.knosys.2007.04.005

Erik F. Tjong Kim Sang and Sabine Buchholz.
2000. Introduction to the CoNLL-2000 shared
task chunking. In Fourth Conference on Com-
putational Natural Language Learning and the
Second Learning Language in Logic Workshop.
https://doi.org/10.3115/1117601
.1117631

Ellen M. Voorhees. 2004. Overview of TREC
2004. In Proceedings of the Thirteenth Text RE-
trieval Conference, TREC 2004, Gaithersburg,
Maryland, EE.UU, November 16–19, 2004,
volume 500–261 of NIST Special Publication.
National Institute of Standards and Technology
(NIST).

Bailin Wang, Richard Shin, Xiaodong Liu,
Oleksandr Polozov, and Matthew Richardson.
2020. RAT-SQL: Relation-aware schema en-
coding and linking for text-to-SQL parsers. En
Actas de la 58ª Reunión Anual de
the Association for Computational Linguis-
tics, pages 7567–7578, En línea. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/2020.acl-main.677

Jason Weston, Sumit Chopra, and Antoine Bordes.
2015. Memory networks. In 3rd International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2015, San Diego, California, EE.UU, May 7–9, 2015,
Conference Track Proceedings.

Semih Yavuz, Izzeddin Gur, Yu Su, and Xifeng
yan. 2018. What it takes to achieve 100%
condition accuracy on WikiSQL. En curso-
cosas de
el 2018 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 1702–1711, Bruselas, Bélgica. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/D18-1197

Tao Yu, Michihiro Yasunaga, Kai Yang, Rui
zhang, Dongxu Wang, Zifan Li, and Dragomir
Radev. 2018a. SyntaxSQLNet: Syntax tree
redes
complex and cross-domain
text-to-SQL task. En Actas de la 2018

para

Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 1653–1663,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D18-1193

Tao Yu, Rui Zhang, Heyang Er, Suyi Li, eric
Xue, Bo Pang, Xi Victoria Lin, Yi Chern
Broncearse, Tianze Shi, Zihan Li, Youxuan Jiang,
Michihiro Yasunaga, Sungrok Shim, Tao Chen,
Alexander Fabbri, Zifan Li, Luyao Chen,
Yuwen Zhang, Shreya Dixit, Vincent Zhang,
Caiming Xiong, Richard Socher, walter
Lasecki, and Dragomir Radev. 2019a. CoSQL:
A conversational
text-to-SQL challenge to-
wards cross-domain natural language interfaces
el 2019
to databases.
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 1962–1979,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1204

En procedimientos de

Tao Yu, Rui Zhang, Kai Yang, Michihiro
Yasunaga, Dongxu Wang, Zifan Li, James Ma,
Irene Li, Qingning Yao, Shanelle Roman, Zilin
zhang, and Dragomir Radev. 2018b. Spider:
A large-scale human-labeled dataset for com-
plex and cross-domain semantic parsing and
text-to-SQL task. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 3911–3921,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D18-1425

Tao Yu, Rui Zhang, Michihiro Yasunaga, Hacer
Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er,
Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya
Dixit, David Proctor, Sungrok Shim, Jonathan
kraft, Vincent Zhang, Caiming Xiong, Ricardo
Socher, and Dragomir Radev. 2019b. SParC:
Cross-domain semantic parsing in context. En
Actas de la 57ª Reunión Anual de
la Asociación de Lingüística Computacional,
pages 4511–4523, Florencia, Italia. Asociación
para Lingüística Computacional.

Luke Zettlemoyer and Michael Collins. 2009.
Learning context-dependent mappings from
sentences to logical form. En Actas de la

1211

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
2
2
1
9
7
2
4
5
4

/
t

a
C
_
a
_
0
0
4
2
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Joint Conference of the 47th Annual Meet-
the ACL and the 4th International
ing of
Conferencia conjunta sobre lenguaje natural Pro-
cessing of the AFNLP, pages 976–984, Suntec,
Singapur. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/P19-1443

Rui Zhang, Tao Yu, Heyang Er, Sungrok
Shim, Eric Xue, Xi Victoria Lin, Tianze Shi,
Caiming Xiong, Richard Socher, and Dragomir
Radev. 2019. Editing-based SQL query gen-
eration for cross-domain context-dependent

preguntas. En Actas de la 2019 Estafa-
ference on Empirical Methods in Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 5338–5349,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1537

Victor Zhong, Caiming Xiong, y ricardo
Socher. 1995. Seq2SQL: Generating struc-
tured queries from natural
language using
aprendizaje reforzado.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/