Continual Learning for Grounded Instruction Generation
by Observing Human Following Behavior
Noriyuki Kojima, Alane Suhr, Yoav Artzi
Department of Computer Science and Cornell Tech, Universidad de Cornell, EE.UU
nk654@cornell.edu {suhr, yoav}@cs.cornell.edu
Abstracto
We study continual learning for natural lan-
guage instruction generation, by observing hu-
man users’ instruction execution. We focus
on a collaborative scenario, where the system
both acts and delegates tasks to human users
using natural language. We compare user exe-
cution of generated instructions to the original
system intent as an indication to the system’s
success communicating its intent. We show
how to use this signal to improve the system’s
ability to generate instructions via contex-
tual bandit learning. In interaction with real
users, our system demonstrates dramatic im-
provements in its ability to generate language
con el tiempo.
1
Introducción
Natural language provides an expressive and ac-
cessible avenue to instruct non-expert users. El
ability to generate instructions is critical for sys-
tems that collaborate with users, Por ejemplo, a
delegate tasks. In such scenarios, the system gen-
erates language to communicate to the user a latent
intent. When users are cooperative and proficient
in the language, whether they accomplish the sys-
tem’s intent provides an informative, albeit noisy
signal of the quality of instruction generation.
This implicit signal is fundamentally different
from supervised data, including via active learn-
En g, in that it does not label the system’s intent with
a written instruction, but only provides evidence
to the quality of a given instruction in relaying this
intent. As a natural byproduct of interaction with
users, it also differs from explicit user feedback in
not requiring user action beyond what they already
do as part of the interaction. Despite its potential
and prevalence, this signal is understudied for
learning to generate natural language
en este documento, we study this learning signal. Nosotros
formalize continually improving instruction gen-
eration by observing human users executing gen-
erated instructions. We learn by comparing in-
struction execution to the system intent, y
demonstrate how this results in a system that con-
tinually improves its natural language generation
ability through interaction with users. Cifra 1
illustrates our learning process.
We design a task-oriented collaborative sce-
nario using the CEREALBAR game environment
(Suhr et al., 2019). In CEREALBAR, two agents, a
leader and a follower, work together to complete
tareas. The leader plans the tasks to complete, y
communicates goals to the follower using natural
idioma. CEREALBAR was originally introduced
for studying follower instruction execution. Nosotros
modify it to focus on generation of leader in-
structions, which are then executed by human
followers. The collaborative, embodied setup ef-
fectively engages users, and aligns their incentives
with executing the system’s instructions to the best
of their abilities.
A major challenge is inferring a learning sig-
nal from observed user behavior. Given the user
execution, we create positive and negative exam-
ples, depending on how the user execution aligns
with the system’s plan and the user’s perceived
correctness of their own execution. Por ejemplo,
consider an execution that does not align well
with the system’s plan, but that the user considers
correct given the instruction. Because of the mis-
alignment, we cannot consider the instruction as a
successful example given the system’s plan. Cómo-
alguna vez, given the user’s perceived correctness, nosotros
can generate a positive example treating the user’s
execution as a plan paired with the instruction. En
contrast to supervised learning with gold-standard
per-token labels (Sutskever et al., 2014), semejante
utterance-level binary labels form a challenging
signal for learning, because they do not distinguish
between correct and incorrect tokens.
We do not make the typical distinction between
training and deployment; as human users follow
1303
Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1303–1319, 2021. https://doi.org/10.1162/tacl a 00428
Editor de acciones: Andreas Vlachos. Lote de envío: 5/2021; Lote de revisión: 8/2021; Publicado 12/2021.
C(cid:13) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
yo
a
C
_
a
_
0
0
4
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
2 Technical Overview and Notation
Our goal is to continually improve a natural lan-
guage instruction generation model, by observing
human executions of generated instructions.
Interaction Scenario We focus on a collabora-
tive scenario, where two agents, a leader and a
follower, complete tasks in an environment. El
system is the leader, and the human user is the
follower. The leader plans tasks to accomplish,
acts in the world, and instructs the follower using
natural language. We use a deterministic proce-
dure for planning and executing leader actions,
and focus on learning the leader instruction gener-
ation model. The human follower acts in the world
following the system instructions. We instantiate
this scenario using CEREALBAR (Sección 3), a col-
laborative game, where two agents collect sets of
cards together by moving in a 3D environment.
Task A world state s describes the current envi-
ambiente; in CEREALBAR, this includes the location
of landmarks, cards, and both agents. A plan ¯p is a
sequence of poses hp1, . . . , pag| ¯p|i the system intends
for the human user to take starting from a start
state s1. In CEREALBAR, a plan includes moving
in the environment with the intent of collecting
cards; each pose pj is a tuple (hj, wj, αj), dónde
hj and wj are height and width coordinates, y
αj is a discrete orientation angle. An instruction
¯x is a sequence of tokens hx1, . . . , X|¯x|i. An in-
struction execution ¯e is the sequence of poses
hp1, . . . , pag|¯e|i a user takes executing ¯x, starting
in a start state s1. The generation distribution
PAG (¯x | s1, ¯p; i) is parameterized by θ. The goal
of instruction generation is that given a generated
instruction ¯x ∼ P (· | s1, ¯p; i), the user execution
¯e from s1 will follow the plan ¯p. The user does not
have access to ¯p, but only to its description ¯x.
Learning We use an encoder-decoder neural
modelo de red (Sección 4), which we continually
improve by observing user behavior. This process
proceeds in rounds. At each round r, we first col-
lect data and then train our model by estimating
the model parameters θr. During data collection in
round r, we sample from our model to generate in-
structions, and observe a human user’s execution
of each instruction. An execution of an instruction
¯x ∼ P (· | s1, ¯p; θr) generated for the plan ¯p with
start state s1 creates a tuple (s1, ¯p, ¯x, ¯e, F ), where ¯e
Cifra 1: Diagram of our learning process. We ini-
tialize a generation model using supervised learning,
and continually learn through interaction with users,
by alternating between observing user execution of
generated instructions and training.
generated instructions, we continually collect new
datos, periodically train using this data, and eval-
uate the system through the interaction itself.
We formalize learning as an off-policy contextual
bandit learning problem. We show that positive
examples can be treated in a manner that reduces to
supervised learning, allowing for simple effective
use of the data. Sin embargo, using negative examples
is more challenging, because simply minimizing
their likelihood gives an unbounded negative loss.
We weigh negative examples using an inverse pro-
pensity score (IPS; Horvitz and Thompson, 1952;
Wang y cols., 2017) to address this issue.
We experiment with our approach through in-
teraction with human users, tracking both task
performance and how the generated language
cambios. We observe dramatic improvements in
the quality of instructions generated as reflected
in users’ execution: Task completion in accor-
dance to the system intent increases from 44.7%
a 79.3%. This is accompanied by significant
language change: The occurrence of erroneous
phrases decreases as desired, but the effective
system vocabulary gradually shrinks.
Although using user feedback for improving
language generation has been studied, as we dis-
cuss in Section 8, a lo mejor de nuestro conocimiento,
this study is the first to show effective instruction
generation learning by observing user execution.
Our experiments demonstrate the effectiveness
of our process, but also illustrate limitations and
important directions for future work. Code and
data are available at https://lil.nlp.cornell
.edu/cerealbar/.
1304
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
yo
a
C
_
a
_
0
0
4
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
is the user execution and f is structured user feed-
back solicited using binary questions (p.ej., acerca de
the grammaticality of ¯x). The learner does not ob-
serve the user’s actions executing ¯x, but only their
poses along the execution. Given these tuples, nosotros
create a dataset Dr = {(s(i)
1 , ¯ρ(i), ¯x(i), y(i))}|Dr|
yo=1 ,
where y(i) ∈ {−1, +1} is a binary label. Depend-
ing on the user execution and feedback, the plan
¯ρ(i) is either the original plan ¯p(i) used for gen-
erating ¯x(i) or the user execution ¯e(i) of ¯x(i). Nosotros
formulate estimating θr+1 as a contextual bandit
learning problem with y as the reward. Sección 5
describes the complete learning process.
Evaluation Throughout the system’s lifetime,
we measure how well human users complete tasks,
and also use earth mover’s distance (EMD; Rubner
et al., 1998) to quantify the similarity of the
user execution ¯e to the plan ¯p. We characterize
language change over time by tracking vocabulary
tamaño, instruction length, and other statistics.
3
Interaction Scenario
Suhr et al. (2019) describe CEREALBAR in detail.
CEREALBAR is a two-player,
turn-based game
where a leader and follower collaborate to col-
lect sets of matching cards. The game objective is
to collect as many valid sets as possible in a 3D en-
vironment. The environment includes landmarks
(houses, mountains, ponds, etc.) that the players
must move around, and may obscure a player’s
vista. A valid set consists of three cards with three
distinct colors, shapes, and counts. Players move
onto cards to select or deselect them. Cuando el
selected cards comprise a valid set, the players
earn a point, all cards disappear,1 and new cards
appear. The two players must collaborate effec-
tively using natural language. The leader observes
the entire environment, plans who should select
which cards for the next set, executes their own
part of this plan, and issues instructions to the
follower. The follower executes leader instruc-
ciones, only seeing a partial first-person view of
the environment. Leader instructions must make
use of the observed spatial environment, incluido
landmarks, for the follower to be able to execute
them given their partial view. Each interaction
includes multiple instructions. Cifra 2 muestra
the game and example generated instructions.
1In Suhr et al. (2019), only the selected cards disap-
pear. We introduced this modification to minimize inter-turn
effects for the follower (es decir., memorize card locations).
Cifra 2: Interaction snapshot in CEREALBAR, con
instructions generated by our model. The current in-
struction is ¯x9. The leader plan is illustrated with red
arrows in the leader’s view. The user sees only the
follower’s view during execution.
CEREALBAR was originally used for learning a
follower instruction execution model from human
demonstrations (Suhr et al., 2019). A diferencia de,
we learn an instruction generation model for the
leader, with the human user as the follower. El
generated instructions must often specify multi-
ple tasks to complete (es decir., when the follower is
to select multiple cards), and how to navigate to
the target cards, because the follower has only
partial observability of the environment. This in-
cludes references to landmarks, spatial relations,
and descriptions of paths. We focus on language
generación, and use a deterministic planner to gen-
erate the plan, including which cards to select and
how each player should move in their next turn,
and execute the planned leader actions. The sys-
tem uses the model we learn to map the follower’s
part of the plan to a natural language instruction.
We learn through interactions with non-expert
human followers, which CEREALBAR is particularly
suited for. The utility-maximizing game objective
to earn a high score by collecting as many valid sets
as possible incentivizes followers to execute the
generated instructions as accurately as possible.
Además, CEREALBAR players need no expert
knowledge to participate in the game, más allá de
familiarity with the simple game rules.
4 Modelo
We design a relatively simple encoder-decoder
architecture to model the generation distribution
1305
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
yo
a
C
_
a
_
0
0
4
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 3: Model illustration. Sección 4 describes the model.
PAG (· | s1, ¯p; i), leaving more complex model de-
velopment for future work. The inputs are a start
state s1 and a plan ¯p. The model parameters are θ.
Our design considers the environment and plan to
generate relevant, grounded instructions. Cifra 3
illustrates the model.
Inputs Similar to Suhr et al. (2019), we repre-
sent the world state s1 ∈ {0, 1}P ×H×W as a binary
3D tensor, where P is the number of position prop-
erties, and H and W are the environment’s height
and width. Each of the W × H positions is rep-
resented as a binary properties vector of length P
(encoding the type of object in the position, es
color, etc.). The system plan ¯p = hp1, . . . , pag| ¯p|i is
a sequence of follower poses along the intended
execution. Each pose pj is a tuple (hj, wj, αj) de
height hj and width wj coordinates, and a discrete
orientation angle αj.
Encoder The encoder computes a set of hid-
den states, which the decoder attends to during
generación. We use a learned embedding function
φs to map each position vector to a dense em-
bedding of size N s by summing the embeddings
of each of the position’s properties. We combine
the embeddings into a tensor S ∈ IRN s×H×W ,
and compute: S′ = CNN1(S), where CNN1 is
a learned convolution and S′ ∈ RN s′
×H×W . Ser-
cause the CEREALBAR environment is a grid of
hexagons, we use HEXACONV (Hoogeboom et al.,
2018). We encode the plan positions into a se-
quence of vectors hps′
| ¯p|i by cropping a
N s′ × N p × N p-sized tensors from S′ centered
around each (hj, wj) and rotated by αj. Estos
tensors represent the pose of the follower and its
surroundings during execution. Each ps′
is en-
j
1 , . . . , ps′
coded to pj = CNN2(ps′
dimensionality of ps′
j .
j ), while retaining the
We concatenate an orientation embedding φα(αj)
to each pj, and process [p1; φα(α1)], . . . , [pag| ¯p|;
φα(a| ¯p|)] with a bidirectional LSTM to compute
h1, . . . , h| ¯p|. We construct the set of hidden states
P the decoder attends to by concatenating each
hj with the N p × N p position vectors encoded in
each pj:
P =
[hj; pj[X, y]] |1 ≤ j ≤ |¯p|,
(cid:8)
1 ≤ x, y ≤ N p
,
(1)
where pj[X, y] is a position vector of size N s′.
(cid:9)
Decoder The decoder computes a probability
distribution over token types conditioned on the
prefix generated so far and the set P, cual
represents the environment state and plan. El
decoder uses the first four layers of the GPT-2
Transformer architecture (Radford et al., 2019),
which enables initializing with GPT-2 weights.
We extend it with pseudo self attention (Ziegler
et al., 2019) to condition the generation on the
encoder outputs P. This adds a linear layer that
projects the encoder outputs P into the decoder
self-attention space.
Inference We decode instructions from P (· |
s1, ¯p; i) using temperature sampling with a tem-
perature of τ (Kreutzer et al., 2018b). This sharp-
ens the sampling distribution, to focus on higher
probability outputs. We do not use beam search.
5 Aprendiendo
We continually improve our model by observ-
ing users following generated instructions and
re-estimating the model parameters. We initialize
1306
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
yo
a
C
_
a
_
0
0
4
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
the model parameters θ1 using an existing lan-
guage model and training on a static dataset of
instructions D0 (Sección 5.1). We then perform a
series of rounds, each round r includes deploying
the model with human users and training on the
collected interactions (Sección 5.2). In round r, nosotros
collect interactions between our model parameter-
ized by θr and human followers, to create a dataset
Dr = {(s(i)
i=1 of start states s(i)
1 ,
plans ¯ρ(i), instructions ¯x(i), and binary labels y(i).
We estimate θr+1 using all data collected so far
∪r
q=0Dq. Cifra 1 illustrates our learning process.
1 , ¯ρ(i), ¯x(i), y(i))}|Dr|
5.1 Initialization
User interaction requires some level of mini-
mal performance. Pilot experiments showed that
a poorly initialized system is likely to frustrate
users, who in turn provide little learning signal.
Our initialization provides a sufficient level of
grammaticality and plausibility to support user
interacción, and thereby further learning.
We initialize the decoder weights with the first
four layers of GPT-2 (Radford et al., 2019). Todo
other weights, including of the encoder and pseudo
self-attention linear layers, are initialized ran-
domly. We then train with a supervised dataset
D0 = {(s(i), ¯ρ(i), ¯x(i), y(i))}|D0|
i=1 of human plans
¯ρ(i) starting at start states s(i) and instructions
¯x(i), all with positive labels y(i) = +1. Usamos
limited data, just sufficient to effectively interact
with users for further learning. We estimate θ1
by minimizing a supervised loss:
LI (θ1,D0) =
−
1
|D0|
|D0|
yo=1
X
registro P (¯x(i)|s(i), ¯ρ(i); θ1) .
(2)
5.2 Learning from User Behavior
Learning from interacting with human users al-
ternates between generating instructions in inter-
action with users and training the model.
Interaction with Users
In each round r, we first
deploy the model with parameters θr to interact
with human users, with our system as the leader
and the user as the follower. We do not update the
model during this interaction phase.
The game environment is randomly generated
for each interaction. Each game continues until it
concluye, either when the user leaves or the turns
are exhausted. A game often includes collecting
Cifra 4: The binary questions displayed to the user at
the end of instruction execution.
multiple sets of cards, and generating multiple
instructions. Each instruction is generated for the
current state as the start state s1;2 as both agents
move and change the status of cards, the environ-
ment state changes throughout the game. At state
s1, we generate the plan ¯p using a deterministic
planner that determines (a) which cards should be
selected or de-selected to make the next valid set,
y (b) the shortest paths the leader and follower
should take to visit all target cards. The actions
the planner assigns to the follower form the plan
¯p. The actions assigned to the leader are executed
by the leader agent deterministically during its
doblar. The model is used to sample an instruction
¯x ∼ P (· | s1, ¯p; θr), which is displayed to the user.
The human user has no access to ¯p, the set of target
cards, or the game state s1. They only observe the
instruction and what is ahead (Cifra 2).
During their turn, the user executes ¯x to the
best of their ability, and indicates when done. Si
the user determines that the instruction cannot be
seguido, they can terminate the execution, cual
is treated just like marking the instruction as com-
plete. The user execution ¯e is the entire sequence
of poses they take while following the instruction.
When the user concludes or terminates an in-
struction ¯x, we show them a top-down view of the
entire environment with their execution path high-
lighted. They do not see the original system plan.
We ask the user two binary feedback questions
about the perceived correctness of their execution
and grammaticality (Cifra 4).
We create a tuple (s1, ¯p, ¯x, ¯e, F ) for each ex-
ecution ¯e, where s1 is the start state of the
ambiente, ¯p is the plan generated in that state,
¯x ∼ P (· | s1, ¯p; θr) is the sampled instruction, y
f is the set of responses to the feedback questions.
2Por simplicidad, we do not index the game time step.
1307
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
yo
a
C
_
a
_
0
0
4
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Once the user submits the answers to the feedback
preguntas, the next instruction is generated.
Dataset Construction We use all interactions
in round r to construct dataset Dr, which is made
of tuples (s1, ¯ρ, ¯x, y), where ¯ρ is a plan and y is a
binary label. Given a tuple (s1, ¯p, ¯x, ¯e, F ), we use
three heuristics to add examples to Dr:
1. If any feedback answer in f is negative, el
instruction does not reflect the user’s execu-
tion or not well written (es decir., ungrammatical).
We add a negative example to Dr with the
system plan ¯p: (s1, ¯p, ¯x, −1).
2. If both feedback answers are positive, el
user considers their execution ¯e accurate and
the instruction well formed. This does not
necessarily indicate the execution follows
the system plan, but that we can treat the ex-
ecution as a plan. We add a positive example
with the execution as the plan: (s1, ¯e, ¯x, +1).
3. If both answers are positive and the execution
¯e follows the plan ¯p,3 the instruction com-
municates the plan well. We add a positive
example with the system plan: (s1, ¯p, ¯x, +1).
En general, we add examples to Dr using both the
original system plan and the user execution. El
heuristics utilize the observational learning signal
as much as possible while avoiding examples not
beneficial for learning. Por ejemplo, we do not
add negative examples using the user execution,
because these are less likely to be useful for learn-
En g. Although such executions can form negative
examples if the user answered negatively to the
correctness question, they tend to be relatively ar-
bitrary, and it is unlikely the model conditioned on
them will assign significant probability to the gen-
erated instruction, which is the behavior negative
examples come to suppress.
Parameter Estimation We estimate the model
parameters for the next round θr+1 using all avail-
able data D = ∪r
q=0Dq. We re-train our model,
starting with GPT-2 parameters (Sección 5.1).4
3For instructions that target cards, we require getting
the card selection right, and ignore the follower position.
For instructions that require waiting (p.ej., hold still), nosotros
require the position to remain the same, but allow orientation
desviación.
4Pilot studies showed re-training to be more stable than
fine-tuning given new data, and we conduct the majority of
We formulate learning as an offline contex-
tual bandit problem, treating the sentence labels y
as rewards. Learning from the positive examples
in D forms a straightforward supervised learning
problema, albeit one where the data is generated
from system interaction. A key challenge is using
the negative examples. Treating them like super-
vised examples requires optimizing the probability
of their instructions to zero. Because limP (·)→ 0
registro P (·) = −∞, this leads to an unbounded neg-
ative loss that quickly dominates the objective.
This in contrast to positive examples, for which the
loss is bounded by zero. This issue is not present
in existing work using offline contextual bandits
to improve machine translation (Lawrence et al.,
2017; Kreutzer et al., 2018b), where rewards are
always non-negative.
We address this issue by adding an inverse
propensity score (IPS; Horvitz and Thompson,
1952; Wang y cols., 2017) coefficient to negative
examples in a policy gradient objective. The gra-
dient for estimating parameters θr+1 is:
∇L(θr+1, D) =
ℓ(i)
θr+1
y(i)∇ log P (¯x(i) | s(i), ¯ρ(i); θr+1) ,
(3)
1
D
|D|
yo=1
X
dónde, given an example (s(i), ¯ρ(i), ¯x(i), y(i))
acquired in round q with parameters θq, ℓ(i)
es:
i
ℓ(i)
θ =
1
PAG ( ¯x(i)|s(i), ¯p(i);i)
PAG ( ¯x(i)|s(i), ¯p(i);θq)
(
y = +1
y = −1
.
(4)
As the probability of a negative example (es decir.,
y = −1) decreases, so does its impact on the loss.
While IPS is commonly used in bandit learning to
de-bias the loss estimate (Lawrence et al., 2017),
our motivation is different, and we do not add it to
positive examples. Because of the large combina-
torial space, sentence probabilities are generally
pequeño. The IPS coefficient of a positive example
can become very large as its probability increases
during learning. En cambio, we use a supervised-like
term, which is known to behave well.5
our experiments with this method. Sin embargo, we also observe
that our process is overall robust to the initially observed
instabilities of fine-tuning (Sección 7).
5An alternative, and important direction for future study
is to add IPS to all examples, but clip it at a certain maximal
valor, similar to clipping in PPO (Schulman et al., 2017).
1308
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
yo
a
C
_
a
_
0
0
4
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
6 Experimental Setup
Initialization Data We create the supervised
initialization dataset D0 by sampling 360 interac-
tions from the original CEREALBAR data (Suhr et al.,
2019), which was collected in a wizard-of-oz
(WOZ; kelly, 1984) setup via human-human
juegos. We select this number through pilot stud-
ies and qualitative analysis to minimize the amount
of initialization data, while still maintaining suf-
ficient model performance for early interactions
to facilitate learning. Our goal is to use as little
data as possible to study the target scenario where
investment in supervised data is minimal, y
most learning is left to interaction with users. Este
data includes 7,147 examples. We use the human
demonstrations in the original data as plans.
Evaluation Similar to Zhao et al. (2021), nosotros
observe that automated metrics, such as BLEU
(Papineni et al., 2002) or BERTScore (Zhang et al.,
2020), computed over a static held-out validation
set are unreliable for evaluating instruction gen-
eration. En cambio, we focus on task-completion
measures via human execution. We measure task
completion by considering the user execution as
completing the intended task if the user visits all
card locations included in the system plan; o, si
the plan includes no target cards, the user stays
in the starting position. We quantify the similarity
of the user execution to the path in the system
plan by computing earth mover’s distance (EMD;
Rubner et al., 1998)6 between the two (Blukis
et al., 2019). We also track the user answers to
the feedback questions (Cifra 4). We average
each measure over the number of instructions in
each round.
Language Analysis We quantitatively analyze
how generated instructions change throughout
training. For each round, we report mean instruc-
tion length, vocabulary size, and three measures
of syntactic complexity using dependency trees
(Xu and Reitter, 2016): (a) maximum depth: el
longest path from root to a leaf; (b) maximum
width: the maximum out-degree of any word in
the tree; y (C) average branching factor: the av-
erage out-degree of non-leaf words. We normalize
the three measures by instruction length. We qual-
itatively analyze errors in generated instructions,
by comparatively analyzing 100 randomly sam-
pled examples where the user failed to complete
the intended task from the first and final rounds.
Interaction Setup Except initialization, aprender-
ing and evaluation are done through live inter-
action with users on Amazon MTurk. All workers
passed a tutorial and a qualification quiz. We pay
$0.15 per interaction, with a bonus of $0.10 por
instruction to workers who follow our guidelines.
Implementation Details Similar
to perfor-
mance evaluation, automated measures are un-
reliable for model selection. En cambio, for both
initialization and in each round, we train for
norte = 400 epochs, and take the final model. Nosotros
find N via qualitative analysis of the initial model.
We use an ensemble of four models. We uniformly
sample one of the four models to sample each in-
estructura, and take its probability to use in IPS
for negative examples. We use a sampling tem-
perature τ = 0.5, and AdamW (Loshchilov and
Hutter, 2018) for learning.
7 Results and Analysis
We conduct a long-term experiment with 14
rounds using our approach, and separate seven-
round experiments to compare system variants. En
both experiments, we collect roughly 100 interac-
tions for each system per round. In the seven-round
experimentos, we deploy methods simultaneously
to ensure that our observations are not sensitive to
changes in user behavior, Por ejemplo, because of
adaptation and increased expertise. We do not in-
form workers about the model they are interacting
con. We train each system only on data collected
by the same method in previous rounds.
7.1 Long-term Study
We experiment with our approach for 14 rounds.
We collect a total of 27,031 instructions from
1,445 interactions, con 103.2 interactions per
round on average. The total cost
es $2,895. Cifra 5 shows both performance measures and language trends. For task measures and user feed- atrás, we also break down performance according to the number of target cards in the system plan to evaluate performance changes for plans which may be more difficult to describe (p.ej., because they require specifying more cards).7 6We use POT (Flamary et al., 2021) to compute EMD. 70-card plans target no cards (p.ej., hold still). 1309 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / t a c l / lartice – pdf / ¿yo? / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 8 1 9 7 6 2 0 7 / / t l a c _ a _ 0 0 4 2 8 pd . f por invitado 0 7 septiembre 2 0 2 3 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / t a c l / lartice – pdf / ¿yo? / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 8 1 9 7 6 2 0 7 / / t l a c _ a _ 0 0 4 2 8 pd . f por invitado 0 7 septiembre 2 0 2 3 Cifra 5: The system’s lifetime statistics from the long-term experiment (14 rounds). The system improves on task completion (↑), EMD (↓), positive response rate for the two feedback questions (↑), and game score (↑). Sección 7.1 discusses these results in detail. Our learning method significantly improves the system performance across all measures. Task completion rate improves from 44.7% at round one to 79.3% at round 14, while EMD decreases from 1.73 a 0.88, showing increasing similar- ity between the execution and the plan. The user perception of the system also improves: The pos- itive response rate for the perceived correctness question improves from 47.9% a 78.6%, and for grammaticality from 88.9% a 99.2%. The over- all collaborative system performance improves as well; the game score increases from 4.5 a 10.4. The number of positive examples per round gradually increases, as the system improves and the interactions become longer. A diferencia de, the number of negative examples decreases over time. We observe that the initial model struggles to describe plans containing more target cards, with a particularly low task completion rate of 1.6% for 3-card plans in the first round. This is potentially because only 0.7% of human follower executions in D0 demonstrate picking up three cards, while the planner generates 3-card plans 7.9% of the time. While initial improvement is slow for 3-card instructions, it picks up around round eight, and reaches 32.9% task completion rate. Language Analysis We observe a consistent trend of decreasing sentence length and vo- cabulary size. En general, these trends accompany reduction in over generation of erroneous phrases that are not grounded well in the environment. We also qualitatively observe that the systems gradually generates slightly more underspecified instructions, for example by dropping mentions of landmarks crucial for navigating to a target card. This may explain the slight decrease in 1-card task completion rate in later rounds (Cifra 5), ser- cause the planner usually has the follower travel further for 1-card instructions, which requires re- lying more on landmarks. A potential explanation to the decrease in vocabulary size is the ever in- creasing presence of system-generated sentences in training, which reinforces the system’s word choices. Alternativamente, our learning signal may not account for the need for more descriptive language. Por ejemplo, humans may compensate with exploration for omitted descriptions, cual es 1310 Error Type r = 1 r= 14 Example Incorrect, desaparecido, or extra cards Irrelevant landmarks Incorrect direction Incorrect actions or conditions Underspecification Implausible instructions 75 13 30 28 8 11 39 1 35 14 26 1 turn left and go to the yellow star triangles Head toward the windmill house. grab 2 red and triangle grab the black heart to your left in front of you. After the two red triangles, get the 3 red triangles. turn right and go straight toward red trees collect two orange triangle. Turn left and get the two pink hearts and the two pink hearts near the pink hearts. Proportion of erroneous instructions 68.5% 26.8% Mesa 1: The types of errors observed in erroneous instructions generated during the first (r= 1) and final (r= 14) rounds of deployment. We show error counts from the 100 randomly-sampled erroneous instructions. Examples illustrate error categories; red strikethrough shows erroneous segments, and blue fragments show possible corrections. Instructions that fit into multiple categories are double counted. not distinguished by how we convert the observed behavior to a learning signal. These trends outline important directions for future work. We observe a small increase in syntactic com- plexity over the system’s lifetime with regard to the branching factor, which shows significant increase (pag < 0.00001).8 We also see a slight decrease in maximum tree depth (p < 0.0001), and no significant change in max width. Error Analysis We analyze errors in the gener- ated instructions at the first and final rounds. For each round, we randomly sample 100 instructions that the user did not execute according to the plan or answered negatively to a feedback question. Table 1 shows error types and example instruc- tions. Overall, the frequency of erroneous instruc- tions decreases from 68.5% of instructions in the first round, to 26.8% in the final round. From the first to final round, we observe noticeable decrease in errors related to grounding of cards and land- marks. The overall frequency of errors related to incorrect directions and incorrect actions or condi- tions also decreases, and implausible instructions diminish close to zero percent. However, there is an overall increase in underspecified instructions. This aligns with the decrease in the vocabulary size and landmark use we discuss above. Confounding Factors We identify two mecha- nisms unrelated to our approach that could explain the observed performance changes. We deploy two additional systems alongside our system dur- ing the final round. For each interaction, one of 8We use t-test (α = 0.01) comparing rounds 1 and 14. Model θ1 θ1 θ′ 1 θ14 r Overall 0-card7 1-card 2-card 3-card 1 14 14 14 44.8 45.1 49.6 79.4 84.1 84.5 76.6 99.6 64.9 62.1 63.8 81.9 9.6 9.3 24.8 72.1 1.7 0.8 7.4 33.0 Table 2: The effect of confounding factors on task completion rate (%). The initial model θ1 is evaluated both in the first (r = 1) and final (r = 14) rounds, showing no effect of user adaptation. In the final round, we also evaluate θ′ 1, which is trained on the same data as θ1 but using more gradient updates. We also show results for the final-round model θ14. the three systems is randomly chosen. We do not inform the workers of the identity of the model for each interaction. First, we deploy the system following initialization during the final round to study if performance might be explained by user improvement over time. Second, because we train with a fixed number of epochs, later rounds have many more gradient updates, which may allow for better parameter estimation, even with the same amount of data. We train a system on the initialization dataset D0 for the same number of gradient updates as when training the final full system. Table 2 shows that these confounding factors do not explain the observed gains. We find minimal differences between evaluating the initial model (θ1) at the beginning and end of deployment, showing no significant effect from user improve- ment. Training the initial system longer (θ′ 1) shows a slight overall improvement, but negligent com- pared to final system (θ14). 1311 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 8 1 9 7 6 2 0 7 / / t l a c _ a _ 0 0 4 2 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 examples without IPS results in a largely unus- able system. We collect a total of 63,189 instructions across all systems, with 3173 interactions. Each round includes 453.2 interactions on average. The total cost is $7,165. All systems are used concurrently
in each round, including re-deploying FULL again
starting from initialization. Figure 6 shows the
results. Despite some differences between the
system variants, our method is largely robust to
variations in learning design decisions.
All systems achieve comparable improvements
in task completion rate, except for OS-ONLY, which
slightly underperforms. We observe faster de-
crease in the vocabulary size and instruction
length for OS-ONLY, which does not use nega-
tive examples. This is possibly because the loss
from negative examples encourages a more uni-
form generation distribution, potentially slowing
down the overall trends of making the generation
distribution more peaky. TC-ONLY, which ignores
the answers to user feedback questions when con-
structing the dataset, shows fewer positive re-
sponses to the perceived correctness feedback,
although task completion remains comparable.
We observe that using a single (NO-ENSEMBLE)
model rather than an ensemble leads to limited dif-
ference in overall performance. However, because
of the challenge of identifying a good automated
metric to stop training, the performance of models
following training varies significantly. This can
lead to deploying a bad model, which provides
users with a poor experience. Using an ensemble
of models incurs higher computational cost, but
makes such a worst-case scenario less likely. For
example, in our long-term experiment, the maxi-
mum task completion performance gap we observe
between the best and worst models in each round
is 13%.
Finally, we observe that fine-tuning (FINE-
TUNING) works as well as our re-training approach
(FULL), potentially with a more stable vocabulary
size. This is in contrast to our initial experiments,
which showed it is harder to get consistent im-
provements through fine-tuning. While the fine-
tuning process is harder to design because it
requires to choose the fine-tuning procedure (e.g.,
rehearsal (Robins, 1995) or KL regularization (Yu
et al., 2013)) and carefully optimize additional
hyperparameters, it can work just as well as re-
training. Because fine-tuning is faster to train be-
tween rounds, it may be preferable in future work.
Figure 6: Comparison of system variants.
7.2 System Variants Study
We vary different design decisions, and experi-
ment for seven interaction rounds.9 We experiment
with four system variants: (a) FULL: our full ap-
proach described in Section 5; (b) POS-ONLY: use
only examples with positive labels y = +1; (c)
TC-ONLY: ignore the feedback questions, instead
if the user completes the task according to our
task success measure we add positive examples
with both the system plan and user execution,
otherwise we add a negative example using the
system plan; (d) NO-ENSEMBLE: train and deploy
a single model each round, starting from an ini-
tial model randomly sampled from these we use
for FULL; and (e) FINE-TUNING: train model pa-
rameters θr+1 on Dr for N epochs, starting from
θr, avoiding overfitting with rehearsal (Rebuffi et
al., 2017; Hawkins et al., 2020a). In rehearsal, in
each batch, half the examples are sampled ran-
domly from the previous datasets D0,. . . , Dr−1.
Except the variations specified, the systems are
identical. We do not deploy a system ablating IPS,
because we observe that training with negative
9This study is similar to ablation analysis, but aims to
study different learning design decisions. Full-fledged repet-
itive ablations to identify the ideal system design are parti-
cularly challenging in this work, both because of experiment
costs and the complex dynamics of interacting with users.
1312
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
there is a difference between the plans of hu-
man leaders and our planner. Our training is better
suited to adapt to how the complete system is
designed, whereas training on human-annotated
data is bound to suffer from a distribution shift.
However, the continual learning system did not
consistently outperform the supervised alternative
on 2-card and 3-card instructions, especially at
early rounds. This is likely because the continual
learning system generates few positive examples
for more complex system plans (i.e., 2-card or
3-card) at earlier rounds. At later rounds, as the
system improves, we observe more positive ex-
amples for such plans, creating an accelerating
effect of improvement, which is best observed in
our long-term experiment (Figure 5).
8 Related Work
Learning for instruction generation has been stud-
ied using supervised methods, with examples of
task specifications (i.e., contexts) paired with
human-written instructions (e.g., Daniele et al.,
2016; Narayan-Chen et al., 2019), including to
improve instruction following (Fried et al., 2018;
Tan et al., 2019). We focus on continually learning
by observing users executing generated instruc-
tions. This reduces annotation needs, and delegates
much of the learning to interaction with users dur-
ing system deployment. Language generation in
context was also studied in scenarios that are not
explicitly instructional, but aim to elicit specific
behavior, such as negotiation games (e.g., Lewis
et al., 2017) and referring expression generation
(e.g., Dale and Reiter, 1995).
Gatt and Krahmer (2017) survey existing work
on language generation,
including using rule-
based methods. Similar to our approach, some
rule-based methods were evaluated with human
followers in situated environments using task suc-
cess (e.g., Koller et al., 2010; Janarthanam and
Lemon, 2011). Such methods are accurate and re-
liable, but are limited to pre-specified rules and
remain static following development. Our focus is
on studying the potential for learning by observing
human behavior. The two approaches can be com-
bined, for example by using rule-based methods
to generate initialization data for our approach.
Bandit learning has been studied with simulated
user ratings for machine translation (Nguyen
et al., 2017; Lawrence et al., 2017; Kreutzer
et al., 2017) and semantic parsing (Lawrence and
Figure 7: Comparison to supervised learning. The con-
tinual learning system is competitive in task completion
rates with systems trained on equivalent amount of
supervised data.
7.3 Comparison to Supervised Learning
We also separately study the learning trends of
our method compared to training on equivalent
amount of supervised WOZ data. Supervised data
is fundamentally different from our bandit data,
for two main reasons: (a) it is significantly costlier
because it requires a dedicated instruction-writing
effort, whereas our data arises naturally from the
system interaction with users during deployment;
and (b) it provides per-token labels, whereas our
data includes only utterance-level binary labels.
For the supervised system, after each round, we
expand the dataset by randomly drawing an equiv-
alent amount of additional data from the complete
dataset of Suhr et al. (2019), which includes 19,112
examples from 960 interactions.10 This dataset
allows for seven rounds. We concurrently deploy
a no-ensemble variant of our continual learning
system. We collect a total of 22,216 instructions
across both systems, with 1,166 interactions. This
experiment’s total cost is $2,230.
Figure 7 shows our continual learning system
consistently outperforms this supervised alterna-
tive in overall task completion rate. There are
two potential explanations to this gap. First, the
data our approach uses is made of examples the
system is likely to generate, potentially provid-
ing a more effective learning signal. Second,
10Interactions with the supervised system are not used for
learning, but only for evaluation.
1313
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Riezler, 2018). We learn from real users, similar
to recent studies in machine translation (Kreutzer
et al., 2018a,b). In general, such learning assumes
users can judge the system output, for exam-
ple via proficiency in the language they wish to
translate to. Our learning signal does not require
such expertise, and is available naturally from
the interaction.
Explicit human feedback has also been in-
corporated into reinforcement learning methods
(Knox and Stone, 2009; Pilarski et al., 2011;
Daniel et al., 2015; Mathewson and Pilarski, 2016;
Warnell et al., 2018; MacGlashan et al., 2017;
Arumugam et al., 2019), including in the context
of dialogue system learning (Liu et al., 2018).
Jaques et al. (2020) study forming a reward from
implicit feedback for non-task-oriented dialogue
language generation, by training multiple mod-
els to detect linguistic signals, such as sentiment
and lexical overlap, that correlate with explicit
user feedback. Learning from users has also been
studied by asking users to rank system outputs
(e.g., Wilson et al., 2012; Christiano et al., 2017),
including for instruction following (Wang et al.,
2016) and summarization (Stiennon et al., 2020).
Unlike our approach, such ranking requires know-
ing the true system intent, and is not part of the
system’s normal operation (i.e., instructing users
in our case).
Incorporating human users into learning is re-
lated to active learning (Settles, 2009), where a
policy selects examples for an oracle to label
during learning. Unlike common active learning
scenarios we do not select examples from a static
underlying distribution (i.e., a training set) for an-
notation, but generate examples with the learned
model. This is similar to query synthesis active
learning (Angluin, 1988), where examples are gen-
erated for annotation, rather than being selected
from a set of unannotated examples. A more signi-
ficant difference is that active learning methods
solicit model output annotations by presenting an
oracle with model inputs. In contrast, our approach
exposes users to model outputs (i.e., generated in-
structions). It does not solicit written instructions,
as would be expected if requesting labels. We also
do not show model inputs (i.e., plans) to users.
Finally, our model interacts with users during sys-
tem operation, while completing its task. It does
not require oracle annotators.
Language learning from behavioral signals has
been studied in the cognitive science and psychol-
ogy literature.11 Krauss and Weinheimer (1966)
study two types of feedback in human studies:
concurrent linguistic feedback and behavioral in-
tent confirmation, and show how both influence
linguistic adaptation in an interaction over time.
Studies of reference games reproduced the effect
of confirmation feedback, showing that successful
intent communication reinforces convention for-
mation in the form of shorter references (Clark and
Wilkes-Gibbs, 1986; Hawkins et al., 2020b). Our
learning signal is a type of confirmation feedback.
However, our interaction procures and makes use
of more complex learning signals than a simple
binary intent communication success, by using
the path the listener takes in response to the
generated instruction as an alternative intent when
constructing data for learning (Section 5.2).12
9 Discussion
We propose a methodology to continually improve
an instruction generation model by observing hu-
man users executing natural language instructions,
and demonstrate its efficacy within a collaborative
instruction following scenario. Our study shows
that observation of user behavior is an infor-
mative signal for generating language to relay
instructional intent. To the best of our knowledge,
this type of learning signal has not been studied
before. This learning setting facilitates contin-
ual learning through interaction with users, and
is particularly compelling for interactions with
collaborative agents, including robots and soft-
ware agents. Such agents are likely to operate in
constantly changing environments (e.g., robots in
homes), where continual learning is necessary to
adjust to changes. Our continual learning approach
also provides systems the flexibility to co-adapt to
human users, who are likely to change preferences
and behaviors in response to system behavior.
Our experiments demonstrate the learning pro-
cess is robust to various learning and process
design choices. However, they also show it is ac-
companied by a reduction of language complexity,
including reducing the effective vocabulary and
sentence length. While much of the decrease in the
effective vocabulary size throughout the system
11This review is not comprehensive, and only aims
the relation to problems studied in related
to highlight
disciplines.
12In more recent reference games (Hawkins et al., 2020b),
unlike in Krauss and Weinheimer (1966), the choice of a bad
referent can be seen as related to our use of listener execution.
1314
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
lifetime relates to generating fewer erroneous
phrases, it also reduces the language diversity
and descriptiveness. Our experiments show that
this trend can be slowed down by using nega-
tive examples, and appears to be less pronounced
when using fine-tuning. The combination of this
decrease with the preference for shorter instruc-
tions makes it difficult for the system to describe
longer, complex trajectories. Qualitatively, we
observe this open problem is responsible for a
significant portion of the remaining errors. An im-
portant direction for future work is experimenting
with directly encouraging more diverse language.
This can be combined with approaches that al-
low for introducing new word types, which is
unlikely in our approach, even though it uses
sub-word tokenization. A potential direction in
this vein is combining active learning to solicit
human-written oracle instructions for plans the
system fails to communicate.
Our work highlights several other directions
for future work. There is a strong need for a
reliable automated metric to evaluate instruction
generation. In absence of such a metric, we use a
simple, but likely sub-optimal stopping criteria for
learning. Beyond the learning signal we explored
in our experiments, there are additional potential
cues available during interaction. For example, us-
ing continuous-valued similarity between system
intent and user execution, modeling follower qual-
ity to discount the learning signal from interactions
with bad followers, or weighing the feedback
questions differently for more nuanced reward.
Finally, the decrease in utterance length and
vocabulary size mirrors similar trends observed
in studies of human communication (Clark and
Wilkes-Gibbs, 1986; Hawkins et al., 2020b). This
illustrates the potential of continual learning sys-
tems to reflect the dynamics of language change
human participants expect in natural language in-
teractions. Observations of human learning also
indicate the potential of integrating our approach
with conversational self-repair (Clark, 2020) and
partner reformulation (Clark, 2018), both impor-
tant components of child language acquisition that
likely provide better credit assignment for learning
compared to our binary bandit signal.
Acknowledgments
Foundation, a Facebook Fellowship, and NSF
under grants no. 1750499 and DGE-1650441. We
thank Jonathan Chang, Sasha Rush, the Cornell
NLP Group, Robert Hawkins, Dipendra Misra,
and John Langford for discussion and com-
ments; Suyi Diao for Unity development; Anna
Effenberger for code to compute syntax com-
plexity; Ge Gao, Koji Shiono, and Takayuki
Kojima for feedback on our interaction platform;
and the crowdsourcing workers for participat-
ing in our data collection. Finally, we thank the
action editor and the anonymous reviewers for
detailed comments.
References
D. Angluin. 1988. Queries and concept learning.
Machine Learning, 2:319–342. https://doi
.org/10.1023/A:1022821128753, https://
doi.org/10.1007/BF00116828
Dilip Arumugam, Jun Ki Lee, Sophie Saskin,
and Michael L. Littman. 2019. Deep reinforce-
ment learning from policy-dependent human
feedback. CoRR, abs/1902.04257.
Valts Blukis, Eyvind Niklasson, Ross A. Knepper,
and Yoav Artzi. 2019. Learning to map natural
language instructions to physical quadcopter
control using simulated flight. In Proceed-
ings of the Conference on Robot Learning,
pages 1415–1438.
Paul Christiano, Jan Leike, Tom B. Brown, Miljan
Martic, Shane Legg, and Dario Amodei. 2017.
Deep reinforcement learning from human pref-
erences. In Proceedings of the Advances in Neu-
ral Information Processing Systems. Curran
Associates, Inc.
Eve V. Clark. 2018. Conversation and language
acquisition: A pragmatic approach. Language
Learning
14:170–185.
https://doi.org/10.1080/15475441
.2017.1340843
and Development,
Eve V. Clark. 2020. Conversational repair and the
acquisition of language. Discourse Processes,
57:441–459. https://doi.org/10.1080
/0163853X.2020.1719795
This research was supported by ARO W911NF-
21-1-0106, a Google Focused Award, the Masason
Herbert H. Clark and Deanna Wilkes-Gibbs. 1986.
Referring as a collaborative process. Cognition,
1315
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
22(1):1–39. https://doi.org/10.1016
/0010-0277(86)90010-7
Robert Dale and Ehud Reiter. 1995. Computa-
tional interpretations of the gricean maxims in
the generation of referring expressions. Cog-
nitive Science, 19:233–263. https://doi
.org/10.1207/s15516709cog1902_3
Christian Daniel, Oliver Kroemer, M. Viering,
Jan Metz, and Jan Peters. 2015. Active reward
learning with a novel acquisition function. Auto-
nomous Robots, 39:389–405. https://doi
.org/10.1007/s10514-015-9454-z
Andrea F. Daniele, Mohit Bansal, and Matthew R.
Walter. 2016. Natural language generation in
the context of providing indoor route instruc-
tions. In Proceedings of the Robotics: Science
and Systems Workshop on Model Learning for
Human-Robot Communication.
R´emi Flamary, NicolasCourty, AlexandreGramfort,
Mokhtar Z. Alaya, Aur´elie Boisbunon, Stanislas
Chambon, Laetitia Chapel, Adrien Corenflos,
Kilian Fatras, Nemo Fournier, L´eo Gautheron,
Nathalie T. H. Gayraud, Hicham Janati, Alain
Rakotomamonjy, Ievgen Redko, Antoine Rolet,
Antony Schutz, Vivien Seguy, Danica J.
Sutherland, Romain Tavenard, Alexander
Tong, and Titouan Vayer. 2021. POT: Python
optimal transport. Journal of Machine Learning
Research, 22(78):1–8.
the Conference of
Daniel Fried, Jacob Andreas, and Dan Klein.
2018. Unified pragmatic models for generat-
ing and following instructions. In Proceedings
of
the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 1951–1963. https://doi.org/10
.18653/v1/N18-1177
Albert Gatt and Emiel Krahmer. 2017. Survey
language
of the state of the art
generation: Core tasks, applications and evalu-
ation. Journal Artificial Intelligence Research,
61:65–170. https://doi.org/10.1613
/jair.5477
in natural
Robert Hawkins, Minae Kwon, Dorsa Sadigh, and
Noah Goodman. 2020a. Continual adaptation
for efficient machine communication. In Pro-
ceedings of the Conference on Computational
Natural Language Learning, pages 408–419.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.conll-1.33
Robert D. Hawkins, Michael C. Frank, and
Noah D. Goodman. 2020b. Characterizing
the dynamics of learning in repeated refer-
ence games. Cognitive Science, 44(6):e12845.
https://doi.org/10.1111/cogs.12845,
PubMed: 32496603
Emiel Hoogeboom, Jorn W. T. Peters, Taco S.
Cohen, and Max Welling. 2018. Hexaconv. In
Proceedings of the International Conference on
Learning Representations.
Daniel G. Horvitz and Donovan J. Thompson.
1952. A generalization of sampling without re-
placement from a finite universe. Journal of the
American Statistical Association, 47(260):663–685.
https://doi.org/10.1080/01621459
.1952.10483446
Srini Janarthanam and Oliver Lemon. 2011.
The GRUVE challenge: Generating routes un-
der uncertainty in virtual environments. In
Proceedings of
the European Workshop on
Natural Language Generation, pages 208–211.
Association for Computational Linguistics.
Natasha Jaques,
Judy Hanwen Shen, Asma
Ghandeharioun, Craig
Ferguson, Agata
Lapedriza, Noah Jones, Shixiang Gu, and
Rosalind Picard. 2020. Human-centric dialog
training via offline reinforcement learning. In
Proceedings of
the Conference on Empiri-
cal Methods in Natural Language Processing,
pages 3985–4003. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-main.327
John F. Kelley. 1984. An iterative design method-
ology for user-friendly natural language office
information applications. ACM Transactions on
Information Systems, 2(1):26–41. https://
doi.org/10.1145/357417.357420
W. Bradley Knox and Peter Stone. 2009. Interac-
tively shaping agents via human reinforcement:
the TAMER framework. In Proceedings of the
fifth international conference on Knowledge
capture, pages 9–16.
1316
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Alexander Koller, Kristina Striegnitz, Andrew
Gargett, Donna Byron, Justine Cassell, Robert
Dale, Johanna Moore, and Jon Oberlander.
2010. Report on the second NLG challenge on
generating instructions in virtual environments
(GIVE-2). In Proceedings of International Nat-
ural Language Generation Conference. Asso-
ciation for Computational Linguistics.
Robert M. Krauss and Sidney Weinheimer. 1966.
Concurrent feedback, confirmation, and the
encoding of referents in verbal communica-
tion. Journal of Personality and Social Psy-
chology, 43:343–6. https://doi.org/10
.1037/h0023705, PubMed: 5969163
Julia Kreutzer, Shahram Khadivi, Evgeny
Matusov, and Stefan Riezler. 2018a. Can neu-
ral machine translation be improved with user
feedback? In Proceedings of the Conference of
the North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 92–105. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/N18-3012
Julia Kreutzer, Artem Sokolov, and Stefan
Riezler. 2017. Bandit structured prediction
for neural sequence-to-sequence learning. In
Proceedings of
the
Association for Computational Linguistics,
pages 1503–1513. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P17-1138
the Annual Meeting of
Julia Kreutzer, Joshua Uyheng, and Stefan Riezler.
2018b. Reliability and learnability of human
bandit feedback for sequence-to-sequence rein-
forcement learning. In ProceedingsoftheAnnual
Meeting of the Association for Computational
Linguistics, pages 1777–1788. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/P18-1165
Carolin Lawrence and Stefan Riezler. 2018.
Improving a neural semantic parser by counter-
factual learning from human bandit feedback.
In Proceedings of the Annual Meeting of the
Association for Computational Linguistics,
pages 1820–1830. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P18-1169
Carolin Lawrence, Artem Sokolov, and Stefan
learning from
Riezler. 2017. Counterfactual
bandit feedback under deterministic logging :
A case study in statistical machine translation.
In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2566–2576. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D17-1272
Mike Lewis, Denis Yarats, Yann Dauphin, Devi
Parikh, and Dhruv Batra. 2017. Deal or no deal?
End-to-end learning of negotiation dialogues.
In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2443–2453. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D17-1259
Bing Liu, Gokhan T¨ur, Dilek Hakkani-T¨ur,
Pararth Shah, and Larry Heck. 2018. Dialogue
learning with human teaching and feedback
in end-to-end trainable task-oriented dialogue
systems. In Proceedings of the Conference of
the North American Chapter of the Associa-
tion for Computational Linguistics: Human
Language Technologies, pages 2060–2069.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N18
-1187
Ilya Loshchilov and Frank Hutter. 2018. De-
coupled weight decay regularization. In Pro-
ceedings of the International Conference on
Learning Representations.
James MacGlashan, Mark K. Ho, Robert Tyler
Loftin, Bei Peng, David L. Roberts, Matthew E.
Taylor, and Michael L. Littman. 2017. Inter-
active learning from policy-dependent human
feedback. In Proceedings of the International
Conference on Machine Learning.
K. Mathewson and P. Pilarski. 2016. Simultaneous
control and human feedback in the training of
a robotic agent with actor-critic reinforcement
learning. arXiv, abs/1606.06979.
Anjali Narayan-Chen, Prashant Jayannavar, and
Julia Hockenmaier. 2019. Collaborative dia-
logue in Minecraft. In Proceedings of the Annual
Meeting of the Association for Computational
Linguistics, pages 5405–5415. Association for
1317
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics. https://doi.org
/10.18653/v1/P19-1537
Khanh Nguyen, Hal Daum´e III, and Jordan
Boyd-Graber. 2017. Reinforcement
learning
for bandit neural machine translation with
simulated human feedback. In Proceedings of
the Conference on Empirical Methods in Nat-
ural Language Processing, pages 1464–1474.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D17
-1153
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method
for automatic evaluation of machine transla-
tion. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics,
pages 311–318. Association for Computational
Linguistics. https://doi.org/10.3115
/1073083.1073135
P. M. Pilarski, M. R. Dawson, T. Degris, F.
Fahimi, J. P. Carey, and R. S. Sutton. 2011.
Online human training of a myoelectric pros-
thesis controller via actor-critic reinforcement
learning. In ProceedingsoftheInternational Con-
ference on Rehabilitation Robotics, pages 1–7.
https://doi.org/10.1109/ICORR.2011
.5975338, PubMed: 22275543
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov,
Georg Sperl, and Christoph H. Lampert. 2017.
icarl: Incremental classifier and representation
learning. In Proceedings of the Conference
on Computer Vision and Pattern Recognition,
pages 2001–2010. IEEE.
Anthony Robins. 1995. Catastrophic forgetting,
rehearsal and pseudorehearsal. Connection
Science, 7(2):123–146. https://doi.org
/10.1080/09540099550039318
Yossi Rubner, Carlo Tomasi, and Leonidas J.
Guibas. 1998. A metric for distributions with
applications to image databases. In Proceed-
ings of the International Conference on Com-
puter Vision. IEEE.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv, abs
/1707.06347.
Burr Settles. 2009. Active learning literature
survey.
Nisan Stiennon, Long Ouyang, Jeffrey Wu,
Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec
Radford, Dario Amodei,
and Paul F.
Christiano. 2020. Learning to summarize with
human feedback. In Proceedings of the Ad-
Information Processing
vances
Systems, pages 3008–3021. Curran Associates, Inc.
in Neural
Alane Suhr, Claudia Yan, Jack Schluger, Stanley
Iris
Yu, Hadi Khader, Marwa Mouallem,
Zhang, and Yoav Artzi. 2019. Executing in-
structions in situated collaborative interactions.
In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2119–2130. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1218
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with neu-
ral networks. In Proceedings of the Advances
in Neural Information Processing Systems,
pages 3008–3021. Curran Associates, Inc.
Hao Tan, Licheng Yu, and Mohit Bansal. 2019.
Learning to navigate unseen environments:
Back translation with environmental dropout.
In Proceedings of the Conference of the North
the Association for
American Chapter of
Computational Linguistics: Human Language
Technologies, pages 2610–2621. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/N19-1268
Sida I. Wang, Percy Liang, and Christopher D.
Manning. 2016. Learning language games
through interaction. In Proceedings of the Annual
Meeting of the Association for Computational
Linguistics, pages 2368–2378. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/P16-1224
Yu-Xiang Wang, Alekh Agarwal, and Miroslav
Dud´ık. 2017. Optimal and adaptive off-policy
evaluation in contextual bandits. In Proceed-
ings of International Conference on Machine
1318
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Learning, pages 3589–3597. Proceedings of
Machine Learning Research.
Garrett Warnell, Nicholas R. Waytowich, Vernon
Lawhern, and Peter Stone. 2018. Deep TAMER:
Interactive agent shaping in high-dimensional
state spaces. In Proceedings of the AAAI Con-
ference on Artificial Intelligence.
Aaron Wilson, Alan Fern, and Prasad Tadepalli.
2012. A Bayesian approach for policy learning
from trajectory preference queries. In Proceed-
ings of the Advances in Neural Information
Processing Systems. Curran Associates, Inc.
Yang Xu and David Reitter. 2016. Conver-
gence of syntactic complexity in conversation.
In Proceedings of the Annual Meeting of the
Association for Computational Linguistics,
pages 443–448. Association for Computational
Linguistics
Dong Yu, Kaisheng Yao, Hang Su, Gang Li,
and Frank Seide. 2013. Kl-divergence reg-
ularized deep neural network adaptation for
improved large vocabulary speech recogni-
tion. In 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing,
pages 7893–7897. IEEE.
Tianyi Zhang, Varsha Kishore, Felix Wu,
Kilian Q. Weinberger, and Yoav Artzi. 2020.
BERTScore: Evaluating text generation with
BERT. In Proceedings of the International Con-
ference on Learning Representations.
Ming Zhao, Peter Anderson, Vihan Jain, Su Wang,
Alex Ku, Jason Baldridge, and Eugene Ie.
2021. On the evaluation of vision-and-language
navigation instructions. In Proceedings of the
European Chapter of
the Association for
Computational Linguistics, pages 1302–1316.
Association for Computational Linguistics.
Zachary M. Ziegler, Luke Melas-Kyriazi, Sebastian
Gehrmann, and Alexander M. Rush. 2019.
Encoder-agnostic adaptation for conditional
language generation. arXiv, abs/1908.06938.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
8
1
9
7
6
2
0
7
/
/
t
l
a
c
_
a
_
0
0
4
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1319