Adaptive Generation in Dialogue Systems - IA de Investigación especializada en el MIT

Generación adaptativa en sistemas de diálogo
Using Dynamic User Modeling

Srinivasan Janarthanam∗
Heriot-Watt University

Oliver Lemon∗∗
Heriot-Watt University

We address the problem of dynamically modeling and adapting to unknown users in
resource-scarce domains in the context of interactive spoken dialogue systems. Como ejemplo, nosotros
show how a system can learn to choose referring expressions to refer to domain entities for users
with different levels of domain expertise, and whose domain knowledge is initially unknown
to the system. We approach this problem using a three-step process: collecting data using a
Wizard-of-Oz method, building simulated users, and learning to model and adapt to users using
Reinforcement Learning techniques.

We show that by using only a small corpus of non-adaptive dialogues and user knowledge
proﬁles it is possible to learn an adaptive user modeling policy using a sense-predict-adapt
acercarse. Our evaluation results show that the learned user modeling and adaptation strategies
performed better in terms of adaptation than some simple hand-coded baseline policies, with both
simulated and real users. With real users, the learned policy produced around a 20% increase in
adaptation in comparison to an adaptive hand-coded baseline. We also show that adaptation to
users’ domain knowledge results in improving task success (99.47% for the learned policy vs.
84.7% for a hand-coded baseline) and reducing dialogue time of the conversation (11% relative
diferencia). We also compared the learned policy with a variety of carefully hand-crafted adaptive
policies that use the user knowledge proﬁles to adapt their choices of referring expressions
throughout a conversation. We show that the learned policy generalizes better to unseen user
proﬁles than these hand-coded policies, while having comparable performance on known user
proﬁles.

We discuss the overall advantages of this method and how it can be extended to other levels
of adaptation such as content selection and dialogue management, and to other domains where
adapting to users’ domain knowledge is useful, such as travel and healthcare.

∗ School of Mathematical and Computer Sciences, Heriot-Watt University, Edimburgo.

Correo electrónico: sc445@hw.ac.uk.

∗∗ School of Mathematical and Computer Sciences, Heriot-Watt University, Edimburgo.

Correo electrónico: o.lemon@hw.ac.uk.

Envío recibido: 16 Noviembre 2012; versión revisada recibida: 1 Noviembre 2013; accepted for publication:
18 Enero 2014.

doi:10.1162/COLI a 00203

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

1. Introducción

A user-adaptive spoken dialogue system in a technical support domain should be able
to generate instructions that are appropriate to the user’s level of domain expertise
(using appropriate referring expressions for domain entities, generating instructions
with appropriate complexity, etc.). The domain knowledge of users is often unknown
when a conversation starts. Por ejemplo, a caller calling a helpdesk to troubleshoot
his laptop cannot be readily identiﬁed as a beginner, intermediate, or an expert in the
domain. In natural human–human conversations, dialogue partners learn about each
other and adapt their language to suit their domain expertise (Issacs and Clark 1987).
This kind of adaptation is called “Alignment through Audience Design” (Clark and
Murphy 1982; Campana 1984). Similar to this adaptive human behavior, a spoken dialogue
sistema (SDS) must also be capable of observing the user’s dialogue behavior, modelado
his/her domain knowledge, and adapting accordingly.

Although there are several levels at which systems can adapt to users’ domain
conocimiento, here we focus on adaptively choosing referring expressions that are used
in technical instructions given to users. We also discuss how our model can later be
extended to other levels of adaptation as well such as content selection and dialogue
management. Referring expressions are linguistic expressions that are used to refer to
domain objects of interest. Traditionally, the referring expression generation (REG) tarea
includes selecting the type of expression (pronouns, proper nouns, common nouns,
etc.), selecting attributes (color, tipo, tamaño, etc.) and realizing them in the form of a
linguistic expression (Reiter y Dale 2000). Sin embargo, in this work, we focus on the
user modeling aspects of referring expression generation. Our objective is to choose a
referring expression (either a technical or a descriptive expression) that the user can
understand easily and efﬁciently. Para esto, we build a dynamic user model to represent
the user’s domain knowledge that is estimated during the conversation. See Table 1
for some example utterances that we aim to generate using technical and descriptive
expressions or a combination of the two types.

We present an approach to learning user-adaptive behavior by sensing partial in-
formation about the user’s domain knowledge using unobtrusive information sensing
movimientos, populating the user model, and then predicting the rest of the user’s knowledge
using reinforcement learning techniques. We present a three-step process to learning
user-adaptive behavior in dialogue systems: data collection, building user simulations,
and learning adaptive behavior using reinforcement learning. We show that the learned
behavior performs better than a hand-coded adaptive behavior when evaluated with
real users, by adapting to them and thereby enabling them to ﬁnish their task faster and
more successfully. Our approach is corpus-driven and the system learns from a small
cuerpo (solo 12 dialogues) of non-adaptive human–machine interaction.

En la sección 2, we analyze the problem of dynamic user modeling in spoken dialogue
systems in detail. En la sección 3, we present a technical support dialogue system that

Mesa 1
Variants of technical instructions to be generated by the system (with technical and descriptive
expressions in italics).

1: Please plug one end of the broadband cable into the broadband ﬁlter.
2: Please plug one end of the thin white cable with grey ends into the
small white box.
3: Please plug one end of the broadband cable into the small white box.

884

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

we use to build and experiment with our adaptive behavior learning model. Nosotros entonces
discuss data collection, building user simulations, and learning adaptive behavior in
Secciones 4, 5, y 6. We present the results and analysis of the evaluations in Section 7.
Finalmente, we present an experiment in simulation comparing the learned policy to a smart
hand-coded policy, and discuss future work such as adapting at the level of content
selection and dialogue management and adapting to dynamic knowledge proﬁles in
Sección 8.

2. Dynamic User Modeling

In order to adapt to the user, it is necessary for the system to have a model of the
user’s domain knowledge. This is currently taken into account by state-of-the-art REG
algorithms by using an internal user model (UM). The UM determines whether the user
would be able to relate the referring expression made by the system to the intended
referent. To be more speciﬁc, it is used to estimate whether the user knows or would
be able to determine whether an attribute-value pair applies to an object (Valle 1988;
Reiter y Dale 1992, 1995; Krahmer and Theune 2002; Krahmer, van Erk, and Verleg
2003; Belz and Varges 2007; Gatt and Belz 2008; Gatt and van Deemter 2009). Entonces, si
the user model believes that the user cannot associate an attribute-value pair (p.ej.,
< category, recliner >) to the target entity x, then it would return false. Por otro lado,
if he can instead associate the pair (p.ej., < category, chair >) to x, the user model would
return true. This would inform the algorithm to choose the category “chair” in order to
refer to x. Por lo tanto, using an accurate user model, an appropriate choice can be made
to suit the user. Sin embargo, these models are static and are predeﬁned before run-time.

How can a system adapt when the user’s knowledge is initially unknown at run-
tiempo? There are many cases when accurate user models will not be available to the sys-
tem beforehand and therefore the state-of-the-art attribute selection algorithms cannot
be used in their present form. They need user modeling strategies that can cope with
unknown users. In order to deal with unknown users, a system should be able to do the
following (Mairesse and Walker 2010):

(cid:114)

Sense: Learn about the user’s domain knowledge during the course of
interaction and populate the user model.

Adapt: Adapt to the user by using the information in the user model.

A smarter system should be able to predict the user’s domain knowledge from partial
information sensed earlier. In our approach we aim to sense partial information, predict
the rest, and adapt to the user. We refer to this approach as the sense-predict-adapt
acercarse. The more information the system has in its user model, the easier it is to
predict the unknown information about the user and choose appropriate expressions
respectivamente. This is because there are different underlying knowledge patterns for
different types of users. Novice users may know technical expressions only for the most
commonplace domain objects. Intermediate users may have knowledge of a few related
concepts that form a subdomain within a larger domain (also called local expertise by
París [1984]). Experts may know names for almost all the domain objects. Por lo tanto, por
knowing more about a user, the system can attempt to identify his/her expertise and
more accurately predict the user’s knowledge.

Sensing user knowledge can be done using explicit questions, or else implicitly
by observing the user’s responses to system instructions. In some dialogue systems,

885

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

explicit pre-task questions about the user’s knowledge level in the task domain (p.ej.,
broadband Internet connections, troubleshooting laptop issues) are used so that the
system can produce adaptive utterances (McKeown, Robin, and Tanenblatt 1993). Para
instancia, “Are you an expert or a novice?” However, it is hard to decide which subset
of questions to ask in order to help prediction later even if we assume conceptual
dependencies between referring expressions. Another approach is to ask users explicit
questions during the conversation like “Do you know what a broadband ﬁlter is?"
(Cawsey 1993). Such measures are taken whenever inference is not possible during
the conversation. It is argued that asking such explicit sensing questions at appropriate
places in the conversation makes them less obtrusive. In large domains, a large number
of explicit sensing questions would need to be asked, which could be unwieldy. En
contrast, we aim to sense each user’s domain knowledge implicitly by using expert
technical (or “jargon”) expressions within the interaction.

Another issue in user modeling is to be able to use the sensed information to
predict unknown facts about the user’s knowledge. Rule-based and supervised learning
approaches have been proposed to solve the problem of adapting to users. Rule-based
approaches require task domain experts (es decir., those with a good understanding of the
task domain and its users) to hand-code the relationships between domain concepts
and rules to infer the user’s knowledge of one concept when his/her knowledge of
other concepts is established (Kass 1991; Cawsey 1993). Hand-coded policies can also
be designed by dialogue system designers to inform the system about when to seek
information in order to partially populate the user model (Cawsey 1993). Sin embargo,
hand-coding such adaptation policies can be difﬁcult for large and complex tasks that
contain a large number of domain objects. Similarmente, supervised learning approaches
like Bayesian networks can be used to specify the relationship between different domain
concepts and can be used for prediction (Akiba and Tanaka 1994; Nguyen and Do 2009).
Sin embargo, they require many annotated adaptive dialogues to train on. In gathering
such a corpus, the expert should have exhibited adaptive behavior with users of all
types. Además, annotating a large number of dialogues to learn user modeling
and adaptive strategies could be very expensive. Such an annotated corpus of expert–
layperson interactions is a scarce resource.

Another issue is that domain experts suffer from what psychologists call the curse
of expertise (Hinds 1999). This means that experts have difﬁculties communicating with
non-experts because their own expertise distorts their predictions about non-experts.
Such inaccurate predictions lead to underestimating or overestimating the non-expert’s
capacidades. Por lo tanto, data collected using domain experts may not be ideal for systems
to learn adaptation strategies from. En cambio, it would be beneﬁcial if such predictive
rules for adaptation can be learned from non-adaptive dialogues, with little or no
input from task domain experts. One reason for this is that non-adaptive dialogues
may already be available or can be collected using existing troubleshooting scripts at
technical call centers. Because data gathering using techniques like “Wizard of Oz”
(WOZ) methods are expensive, we also investigate how adaptation strategies can be
learned from limited data.

Our objective in this study, por lo tanto, is to build a model that can address the

following challenges:

Unobtrusive dynamic user modeling by implicitly sensing and predicting
user knowledge.

User modeling and adaptation using limited data and domain expertise.

886

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

Note that users may learn new referring expressions during the course of the
interacción, and therefore the user’s domain knowledge may be dynamically changing.
Sin embargo, we restrict ourselves to modeling and adapting to the initial knowledge state
of the user. Modeling and adapting to a dynamically changing user knowledge state
would be an interesting extension to our current work, and we discuss this later in the
paper (mira la sección 8).

We chose to study the user modeling problem in a technical support dialogue
system that chooses between two kinds of expressions: jargon and descriptive. Jargon
expressions are very speciﬁc names given to an entity and are known only to ex-
perts in the domain (p.ej., broadband ﬁlter). Descriptive expressions, as the name
suggests, are more descriptive and identify the referent using attributes like shape, tamaño
and color, etcétera (p.ej., small white box). Although the choice between jargon and
descriptive expressions may be motivated by many factors (learning gain, lexical
alignment/entrainment, etc.), we focus on enabling users with different domain knowl-
edge levels to identify the target entity efﬁciently. By domain knowledge, we mean the
user’s capability to identify domain objects when the system uses jargon expressions to
refer to them. This is also called domain communication knowledge (Rambow 1990;
Kittredge, Korelsky, y rambow 1991). Por lo tanto, this means that an expert user as
deﬁned in this article will not necessarily be able to reason about domain entities in
terms of their functionality and how they relate with each other. It simply means that
she/he will be able to identify the domain entities using jargon expressions.

3. The Dialogue System

In order to explore the problem of dynamic user modeling, we built a “wizarded” tech-
nical support dialogue system that helps users to set up a home broadband connection.
The dialogue system consists of a dialogue manager, a user modeling component, a
natural language generation component, and a speech synthesizer. A human wizard
recognizes user utterances and transcribes them into dialogue acts, which are sent to the
dialogue manager. The dialogue manager decides the next dialogue move and sends a
dialogue act to the natural language generation (NLG) module, which generates system
utterances to be synthesized into speech by the speech synthesizer. The user modeling
component takes input from the dialogue manager, dynamically models the user, y
informs the NLG module which referring expressions to use based on its belief about
the user’s domain knowledge. The architecture of the system and its interaction with
the user is shown in Figure 1.

3.1 Wizarded Speech Recognition and Language Understanding

We used a Wizard-of-Oz (WOZ) framework to both collect data and evaluate our
learned model with real users. WOZ frameworks are often used to collect dialogues
between real users and dialogue systems before actually implementing the dialogue
sistema (Fraser and Gilbert 1991) . In this framework, participants interact with an expert
human operator (known as a “wizard”), who is disguised as an automated dialogue
sistema. These dialogue systems are called wizarded dialogue systems (Forbes-Riley
and Litman 2010). WOZ systems have been used extensively to collect data to learn and
test dialogue management policies (Whittaker, Caminante, and Moore 2002; Hajdinjak and
Miheli 2003; Cheng et al. 2004; Strauss, Hoffmann, and Scherer 2007; Rieser and Lemon
2011) and information presentation strategies (Demberg, Winterboer, and Moore 2011).

887

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

Cifra 1
Wizarded spoken dialogue system.

In our system, the wizards played the role of intercepting, recognizing, and inter-
preting user speech into dialogue acts. Like Demberg, Winterboer, and Moore (2011),
wizards in our set-up did not make dialogue management decisions. These were
computed by the dialogue manager module based on the user dialogue act and the
current dialogue state. Usually, in fully automated dialogue systems, automatic speech
recognition (ASR) and natural language understanding (NLU) modules are used.

Sin embargo, we use a human wizard to play the roles of ASR and NLU modules, entonces
that we can focus on only the user modeling and NLG problem. ASR and NLU issues
may make user modeling more complicated and their interaction should be studied
carefully in future work.

The wizards were assisted by a tool called the Wizard Interpretation Tool (WIT),
which was used by the wizard to interpret the user’s utterances and generate the user
dialogue acts (ver figura 2). The GUI was divided into several panels.

a. System Response Panel – This panel displayed the dialogue-system-generated
response to the user’s previous utterance and the system’s referring expression (RE)
choices for the domain objects in the utterance. This is done to serve as context for
subsequent clariﬁcation requests from the user. It also displayed the strategy adopted
by the system in the current dialogue and a visual indicator of whether the system
response was being played back to the user.

b. Conﬁrmation Request Panel – This panel enabled the wizard to handle issues in
comunicación (p.ej., ruido). The wizard can ask the user to repeat, speak louder,
conﬁrm their responses, Etcétera. Appropriate pre-recorded messages were played
back to the user. There was also provision for the wizard to build custom messages and
send them to the user. Custom messages were converted to speech and played back
to the user.

888

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

C. Conﬁrmation Panel – This panel enabled the wizard to handle conﬁrmation questions
from the user. The wizard can choose yes or no or build its own custom message. El
message was converted to speech and played back to the user.

d. Annotation Panel – This panel enabled the wizard to annotate the content of the
participant’s utterances. Participant responses ranging from answers to questions,
to acknowledging instructions, to requesting clariﬁcations can be annotated. El
annotated dialogue act is sent to the dialogue system for response. Mesa 2 shows the
set of dialogue acts that can be annotated using this panel. In addition to these, otro
behaviors, like remaining silent or saying irrelevant things, were also accommodated.
The WIT sent the generated dialogue act to the dialogue manager. For a more detailed
description of the tool, please refer to Janarthanam and Lemon (2009).

3.2 Dialogue Manager

The dialogue manager identiﬁes the next dialogue act (Como,t where t denotes turn number,
s denotes system) to give to the user based on the dialogue management policy πdm.
The dialogue management policy is coded in the form of a ﬁnite state machine. Él
represents a series of instructions to be given to the user in order to set up a home
broadband connection. In this dialogue task, the system provides instructions to either
observe or manipulate the environment. The user’s environment consists of several
domain entities such as broadband and Ethernet cables, a broadband ﬁlter, sockets on
the modem, Etcétera. These are referred to by the NLG module using either jargon or
descriptive expressions. If users ask for clariﬁcations on jargon expressions, the system

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

889

Cifra 2
Wizard interpretation tool.

Ligüística computacional

Volumen 40, Número 4

Mesa 2
User dialogue acts.

Dialogue Act

Ejemplo

Sí
No
ok
req description
req location
req verify jargon
req verify desc
req repeat
req rephrase
req wait
ayuda
otro
silent

Yes it is on
No, its not ﬂashing
Ok. I did that
Whats an Ethernet cable?
Where is the ﬁlter?
Is it the Ethernet cable?
Is it the white cable?
Please repeat
What do you mean?
Give me a minute?
I need help
I had a bad morning

clariﬁes (using the dialogue act provide clariﬁcation) by giving information to enable the
user to associate the expression with the intended referent. If users respond positively
to the instructions given, the dialogue manager presents them with the next instruc-
ción, etcétera. By “positive response,” we mean that users answered observation
questions correctly and they acknowledged following the manipulation instructions.
For any other user response, the previous instruction is simply repeated. The dialogue
manager is also responsible for updating and managing the system state Ss,t. The state
Ss,t is a set of variables that represents the current state of the conversation, cual
includes the state of the environment (es decir., how much of the broadband set-up has been
ﬁnished).

3.3 User Modeling

A dynamic user modeling component incrementally updates a user model and informs
other modules of the system about its estimates of the user (Kobsa and Wahlster 1989).
In our system, the user modeling component maintains a user model UMs,t, cual
represents the system’s beliefs about what the user knows. The user model starts with
a state where the system does not have any knowledge about the user. It is then
updated dynamically based on the user’s dialogue behavior during the conversation.
Because the model is updated according to the user’s behavior, it may be inaccu-
rate if the user’s behavior was itself uncertain. The user model is represented as a
vector of n variables (K1, K2. . .kn). A user’s knowledge of the technical name of each
entity i is represented by variable Ki and takes one of the three values: true, false, y
unknown.

The variables are updated using a simple user model update algorithm after the
user’s response to each turn. Initially each variable is set to unknown. If the user
responds to an instruction containing the jargon expression for x with a clariﬁcation
request, then Kx is set to false (assuming that the user did not know the technical

890

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

name for the entity x). If the user responds with an appropriate response to the
system’s instruction, Kx is set to true. Only the user’s initial knowledge is recorded.
This is based on the hypothesis (borne out by our evaluation) that an estimate of
the user’s initial knowledge helps to predict the user’s knowledge of the rest of the
entidades.

In order to update the user model and inform the NLG module about its estimates
of the user, the user modeling component recommends how an entity should be referred
to in the system utterances. This behavior is generated by what is called the UM policy
(πum). This is the policy that we attempt to learn. We will later show how the UM policy
interacts with other components of the dialogue system in order to populate the user
model and estimate users’ knowledge.
The UM policy (πum) is deﬁned as

πum : UMs,t → RECs,t

where RECs,t= {(R1, T1), …, (Rn, Tn)}

(1)

The referring expression choices RECs,t is a set of pairs identifying the referent R
and the expression type T used in the current system utterance (s refers to system and t
to turn number). Por ejemplo, the pair (broadband ﬁlter, desc) represents the descriptive
expression “small white box.” Because the expression type is speciﬁed individually for
each referent entity, it is possible to recommend jargon expressions for some entities
and descriptive expressions for others in the same utterance.

The user modeling module can be operated in two modes. Given a UM policy
(either hand-coded or learned), the task of this module is to recommend expressions
speciﬁed in RECs,t, depending on the user model state UMs,t. We call this the evaluation
mode. Por otro lado, the user modeling module can operate as a learning agent
in order to learn a UM policy, where it learns to associate the optimal RE choices
to the UM states. We discuss the implementation of user modeling states in detail in
Sección 6.

3.4 NLG Module

The NLG module receives dialogue acts from the dialogue manager, retrieves an ap-
propriate template, and picks appropriate referring expressions for each of the domain
entities in the given dialogue act, based on recommendations from the user modeling
component as described earlier. The NLG module then embeds the expressions into
the templates to generate instructions.

3.5 Speech Synthesis Module

The utterances generated by the NLG module are then converted into speech by a
speech synthesizer. We use the Cereproc Text-To-Speech1 engine for this purpose.

1 http://www.cereproc.com/.

891

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

4. Recopilación de datos

En esta sección, we present the ﬁrst step of our three-step process for dynamic user
modeling in interactive systems. Using the wizarded dialogue system presented pre-
viously, we collected a dialogue corpus from a number of users. Because we did not
have an adaptive UM policy yet, we conﬁgured the user modeling module to generate
two separate non-adaptive strategies: All-Jargon and All-Descriptive. In the All-Jargon
política, the system instructions only contained jargon expressions. Similarmente, in the All-
Descriptive policy, the system instructions contained only descriptive expressions. Nosotros
collected half the dialogues with All-Jargon and the other half using All-Descriptive
policies to analyze how users respond to jargon and descriptive expressions based on
their domain knowledge.

Participants were presented with a box containing several objects (cables, modem,
etc.), a phone, a phone socket, and a desktop computer, which were needed for a home
broadband internet connection set-up. The modem consisted of several sockets that
were used in this set-up (ver figura 3). The participants were asked to put these objects
together in a speciﬁc pattern as instructed by the system. Por ejemplo, the broadband
cable must connect the modem to the phone socket, the Ethernet cable must be used
to connect the desktop to the modem, etcétera. The task had 16 steps to ﬁnish the
broadband set-up and there were references to 13 domain entities, some of which were
mentioned more than once in the dialogue. Users interacted with the system through a
headset using speech.

We followed a six-step process to collect data from the users. This process not only
collected the dialogue exchanges between the user and the system but also collected
other information, such as the user’s domain knowledge before and after the dialogue
tarea, the user’s interaction with the physical environment, and the user satisfaction
puntuaciones.

Step 1. Background of the user – The user was asked to ﬁll in a pre-task background
questionnaire containing queries on their experience with computers, the internet, y
dialogue systems.

Cifra 3
Domain objects for the broadband set-up.

892

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

Step 2. Knowledge pre-test – Each user’s initial domain knowledge was recorded by
asking each user to point to the domain object that was called out by the experimenter
by its jargon expression.

Step 3. Dialogue – The conversations between the user and the system were logged
as an XML ﬁle. The log contains system and user dialogue acts, times of system
utterances, system’s choice of REs, and its utterances at every turn. It also contains
the dialogue start time, total time elapsed, total number of turns, number of words in
system utterances, number of clariﬁcation requests, number of technical and descriptive
expresiones, and number of conﬁrmations.

Step 4. Knowledge gain post-test – Each users’ knowledge gain during the dialogue
task was measured by asking each user to redo the pointing task. The experimenter
read out the jargon expression (p.ej., “broadband cable”) aloud and asked the users to
point to the domain entity referred to.

Step 5. Percentage of task completion – The experimenter examined the ﬁnal set-up
on the user’s table to determine the percentage of task success using a form containing
declarative statements describing the ideal broadband set-up (p.ej., “the broadband
ﬁlter is plugged in to the phone socket on the wall”). The experimenter awards one
point to every statement that is true of the user’s broadband set-up.

Step 6. User satisfaction questionnaire – The user was requested to ﬁll in a post-
task questionnaire containing queries on the performance of the system during the
tarea. Statements about the conversation and the system like, “Conversation with
the system was easy,” “I would use such a system in future,” were answered in a
four-point Likert scale on how strongly the user agreed or disagreed with the given
statement.

The dialogue corpus was collected from 12 Participantes; knowledge proﬁles were
acquired from these participants, plus an additional 5 participants reserved for a study
of tutorial policy. In total, Había 203 jargon and 202 descriptive expressions used
in the dialogues. More statistics are given in Table 3. The participants were students
and staff from various backgrounds (letras, humanidades, ciencia, medicine, etc.). Every
participant was paid £10 after the experiment was ﬁnished. Out of the 12 dialogues,
6 used the All-Jargon policy and 6 used the All-Descriptive policy.

Mesa 3
Corpus statistics (grouped on strategy).

Parameters

Jargon Descriptive

No. dialogues
Task completion rate
Pre-task score
Post-task score
Turns
Sys words
Time (mín.)
Time per turn (segundo)

6
98.3
6.67
12.33
28.17
470.5
7.7
16.49

6
98.3
8.5
10.66
25.83
471.67
6.86

15.9

893

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

5. User Simulations

En esta sección, we present the second step of our process: building a user simu-
lación. We built a corpus-based user simulation model that simulates the dialogue
behavior of a real human user. User simulations are used in place of real users dur-
ing the training and testing phases of reinforcement learning agents for the following
razones:

Training cycles typically require thousands of dialogue episodes to train
the agent, and training and testing cycles with real users can be very
expensive.

Real users could get frustrated with dialogue agents at the initial stage of
aprendiendo, as they tend to choose random actions that are not adapted to
dialogue context.

Several user simulation models have been proposed for use in reinforcement learn-
ing of dialogue policies (Georgila, Henderson, and Lemon 2005; Schatzmann et al.
2006, 2007; Ai and Litman 2007). Sin embargo, they are suited only for learning dialogue
management policies, and not for user-modeling policies (es decir., policies to populate the
user model and inform other modules of users’ domain knowledge). The following
user simulation model was therefore designed and implemented to satisfy three re-
quirements: (1) be sensitive to a system’s choice of referring expressions, (2) modelo
users’ domain knowledge, y (3) learn new expressions during the conversation.
Please note that this module is not a part of the actual dialogue system and is used
externally in the place of real users. En secciones 6 y 7, we show how the following user
simulation was used to train and evaluate the dynamic user modeling behavior of the
sistema.

The user simulation (US) receives the system action As,t and its referring expression
choices RECs,t at each turn as input. Note that the US does not receive as input the
natural language utterance from the system. The US responds with a user action
Au,t+1 (u denoting user) and an environment action EAu,t+1. The user action can either
be a clariﬁcation request (CR) or an instruction response (IR). The user simulation
combines three models to simulate the process of a user’s understanding of the
system’s instruction, executing it in the environment, and responding to the system.
These three models are for generating clariﬁcation requests, environment actions, y
instruction responses, as described below.

Clariﬁcation request model: This model produces a clariﬁcation request CR based on
the referent R, type of the referring expression T (es decir., jargon/descriptive), y el
current domain knowledge of the user for the referring expression DKu,t(R, t) (es decir.,
true/false). The referents are classiﬁed into “easy” and “hard” in the following way.
Primero, the number of clariﬁcation requests per referent entity was calculated from the
cuerpo. Entonces, those entities whose jargon expressions led to clariﬁcation requests more
than the mean number of clariﬁcation requests were classiﬁed as hard and others as
easy entities. Por ejemplo, power adaptor is easy – all users understood this expres-
sión; broadband ﬁlter is hard as there were more than the mean number of clariﬁca-
tion requests. The probability of generating a clariﬁcation request (CR) for a referring
expression depends on the class of the referent C(R), type of the expression used T,

894

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

and the user’s knowledge of the expression DKu,t(R, t) at time t, and is deﬁned as
follows:

PAG(CRu,t+1(R, t)|C(R), t, DKu,t(R, t))

(2)

One should note that the actual literal expression was not used in the transaction.
Only the entity that it was referring to (R) and its type (t) were used. Sin embargo,
this model simulated the process of interpreting and resolving the expression and
identifying the domain entity of interest in the instruction, thereby satisfying our ﬁrst
requirement that the user simulation has to be sensitive to referring expressions used
by the system.

Environment action model: An environment action EAu,t was generated using a model
based on system dialogue action As,t. This is the probability that the user performed the
required action successfully.

PAG(EAu,t+1|Como,t)

(3)

Instruction response model: An instruction response was generated based on the
user’s environment action EAu,t+1 and the system action As,t. Instruction responses
are typical responses to system’s instructions and can be either provide info, acknowl-
edgement, or other. The probability of each of these responses is given by the following
modelo:

PAG(IRu,t+1|EAu,t+1, Como,t)

(4)

The user simulation combined the three models in the following manner. Primero, él
sampled from the clariﬁcation request model for each (R, t) in RECs,t. If a clariﬁcation
was produced, it returned it as the user’s action (es decir., Au,t+1 = CRu,t+1(R, t)) and no
environment action was produced. If no clariﬁcation request was produced, it then
sampled from the environment action model (es decir., did the user perform the requested
action correctly?) and the instruction response model. The IRu,t+1 that was generated
was returned to the system as the user action.

All of these models were trained on our corpus data using maximum likelihood esti-
mation and smoothed using a variant of Witten-Bell discounting. The corpus contained
12 dialogues between a non-adaptive dialogue system and real users. According to the
datos, clariﬁcation requests are more likely when jargon expressions are used to refer to
the referents that belong to the hard class and which the user does not know about.
When the system uses expressions that the user knows, the user generally responds to
the instruction given by the system.

The trained probabilities are shown in Table 4. Clariﬁcation requests occurred only
for jargon type expressions and not for descriptive expressions in our corpus. Nosotros
therefore set the probability of generating one for descriptive expressions to zero.

Using k-means clustering on pre-test knowledge patterns, we created ﬁve patterns
of users’ domain knowledge (k = 5). We set k to ﬁve so that we obtain three proﬁles to
train with and two additional proﬁles for testing the learned policy and examining how
well it generalizes to the two unseen user types. The models ranged from novices to
experts with three intermediate levels, as shown in Table 5. The value T represents that

895

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

Mesa 4
Trained clariﬁcation request model (probability of producing a clariﬁcation request).

Clase (C)

Tipo (t) User’s Domain Knowledge (NS)

PAG(CR)

Hard
Hard
Easy
Easy

Jargon
Jargon
Jargon
Jargon

True
False
True
False

5.84
84.52
2.04
15.04

Mesa 5
Domain knowledge of ﬁve different users.

Novice

Int1

Int2

Int3

Expert

Phone socket
Livebox
Livebox power socket
Livebox power light
Power adaptor
Broadband cable
Ethernet cable
Livebox broadband light
Livebox Ethernet light
Livebox ADSL socket
Livebox Ethernet socket
PC Ethernet socket
Broadband ﬁlter

t
t
t
t

t
t
t
t
t

t
t
t
t
t
t
t

t
t

t
t
t
t
t
t
t
t
t
t
t
t
t

a user of the type can identify the referent when the jargon expression is used. The user
domain knowledge DKu,t was initially set to one of these models at the start of every
conversation. A novice user knew only power adaptor, an expert knew all the jargon
expresiones, and intermediate users knew some of them. We assumed that users can
interpret the descriptive expressions for all referents R and resolve their references (es decir.,
DKu,t(R, descripción) = true). Por lo tanto, they were not explicitly represented. We only
coded the user’s knowledge of jargon expressions using Boolean variables representing
whether the user knew the expression or not. The use of knowledge patterns satisﬁes
the second requirement that the user simulation must model the domain knowledge of
the user.

In our corpus of 17 users we had two each of beginners, experts, and int1, four int3
users, and seven int2 users (ﬁve users encountered a tutorial policy whose dialogues
were not used later on, only their knowledge proﬁles were used). Corpus data showed
that users can learn to associate new jargon expressions with domain entities during
the conversation. We modeled this using the knowledge update model. This satisﬁes
the third requirement of producing a learning effect and a dialogue behavior that is
consistent with an evolving domain knowledge DKu of the user. The domain knowledge
is updated based on two types of system dialogue actions. We observed in the dialogue
corpus that users always learned a jargon expression for a referent R when the system
provided the user with a clariﬁcation. This was modeled using the following update
regla:

if As,t == provide clarification(R), then DKu,t+1(R, jargon) = true

(5)

896

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

Users also learned when jargon expressions were repeatedly presented to them.
Learning by repetition followed a linear learning relationship (es decir., the greater the num-
ber of repetitions, the higher the likelihood of learning), which then converged after a
few repetitions. From post-test data, we found that when a jargon expression was given
to the user once, the probability that the user learned the association between the term
and the entity was 0.55. When it was presented twice or more, the probability was 1. El
probability that the user learns a jargon expression is given by a function of the referent
(R) and the number of times the jargon expressions are repeated in the conversation,
denoted by n, como sigue:

PAG(DKu,t+1(R, jargon) = true) = f (R, norte)

(6)

We estimated f as a linear model based on the frequency of each jargon expression
and users’ post-task recognition scores. Due to the learning effect produced by the
system’s use of jargon expressions, the ﬁnal state of the user’s domain knowledge
(DKu,final) may be different from the initial state (DKu,initial).

5.1 Evaluation of User Simulation

We measured dialogue divergence (DD) based on the Kullback-Leibler (DKL) diver-
gence between real and simulated dialogues to show how realistic our user simulation
es. Kullback-Leibler (KL) divergencia, which is also called relative entropy, is a mea-
sure of how similar or different two probability distributions are (Kullback and Leibler
1951; Kullback 1959, 1987). Several recent studies have used this metric to evaluate
how closely their user simulation models replicate real user behavior (Cuayahuitl et al.
2005; Cuayahuitl 2009; Keizer et al. 2010). Because KL divergence is a non-symmetric
measure, DD is computed by taking the average of the KL divergence between the
simulated responses and the original responses (es decir., DKL(simulado||real)) y viceversa
(es decir., DKL(real||simulado)). DD between two models P and Q is deﬁned as follows:

DKL(PAG||q) =

METRO
(cid:88)

yo=1

pi ∗ log(

pi
Qi

)

DD(PAG||q) = 1
norte

norte
(cid:88)

yo=1

KL(PAG||q) + Di

KL(q||PAG)

(7)

(8)

The metric measures the divergence between distributions P and Q in N dif-
ferent contexts (es decir., system’s dialogue action, entities mentioned, expression type
usado, and user’s knowledge of those expressions) with M responses (es decir., user’s dia-
logue/environment action) per context. Idealmente, the dialogue divergence between two
similar distributions is close to zero. The divergence of our dialogue action model
PAG(Au,t) and the environment action model P(EAu,t) with respect to the corpus data were
0.711 y 0.232, respectivamente. These results were comparable with other recent work
on user simulation (Cuayahuitl 2009; Keizer et al. 2010). For a more detailed analysis
of our simulation model, see Janarthanam (2011).

897

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

6. Learning User-Adaptive Behavior

The ﬁnal step of our approach is to learn user-adaptive behavior. We used reinforcement
learning techniques in order for the system to learn a dynamic user modeling policy.
Reinforcement Learning (rl) is a set of machine learning techniques in which the
learning agent learns the optimal sequence of decisions through trial-and-error learning
based on feedback it gets from its environment (Kaelbling, Littman, and Moore 1996;
Sutton and Barto 1998). Cifra 4 illustrates how a reinforcement learning agent interacts
with its environment. The agent is presented with a learning problem in the form
of a Markov Decision Process (MDP) consisting of a set of states S, a set of actions
A, transition probabilities T from one state to another (when an action is taken), y
rewards R associated with such transitions. The agent learns to solve the problem by
learning a policy π : s → a that optimally maps all the states to actions that lead to a
high expected cumulative reward. The state of the agent represents the environment as
observed by the agent.

Reinforcement learning has been widely used to learn dialogue management poli-
cies that decide what dialogue action the system should take in a given dialogue state
(Eckert, Levin, and Pieraccini 1997; Levin, Pieraccini, and Eckert 1997; Williams and
Joven 2003; Cuayahuitl et al. 2005; Henderson, Lemon, and Georgila 2008). Recientemente,
Lemon (2008), Rieser and Lemon (2009), and Dethlefs and Cuayahuitl (2010) tener
extended this approach to NLG to learn NLG policies to choose the appropriate at-
tributes and strategies in information presentation tasks. Sin embargo, a nuestro conocimiento,
the application of RL for dynamically modeling users’ domain knowledge and gen-
eration of referring expressions based on user’s domain knowledge is novel. Cifra 5
shows the interaction between the dialogue system and the user simulation (junto con
environment simulation). The user modeling component (as discussed in Section 3.2) es
the learning agent.

The user modeling module was trained using the user simulation presented in
Sección 5 to learn UM policies that map referring expressions to entities based on the
estimated user expertise in the domain. The module was trained in learning mode using
the SARSA reinforcement learning algorithm (with linear function approximation)
(Shapiro and Langley 2002). The training produced approximately 5,000 dialogues.
The user simulation was calibrated to produce three types of users using the Novice,
Intermedio (Int2), and Expert proﬁles from Table 5, randomly but with equal

Cifra 4
Aprendizaje reforzado.

898

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Janarthanam and Lemon

Generación adaptativa en sistemas de diálogo

Cifra 5
Interaction between the dialogue system and the user simulation (aprendiendo).

probabilidad. We did not use all the proﬁles we had, because we wanted to evaluate how
well the learned policy generalizes to unseen intermediate proﬁles (es decir., Int1 and Int3).
The user modeling state (UMs,t) was implemented as follows. It consisted of two
variables for each jargon expression x: user knows x and user doesnt know x. Ellos
were both initially set to 0. This signiﬁed that the agent did not have any information
about the user’s knowledge of the jargon expression x. The variables were updated
using a simple user model update algorithm. If the user responded to an instruction
containing the jargon expression x with a clariﬁcation request, then user doesnt know x
was set to 1. Por otro lado, if the user responds with an instruction response
(IR) to the system’s instruction, the dialogue manager set user knows x to 1 y
user doesnt know x to 0. Each pair of these variables takes only three valid val-
ues (out of four possible values); por lo tanto, the state space size for 13 entities is 313
(aproximadamente 1.5 million states).

The actions that were available to the agent were to choose either a jargon expres-
sion or a descriptive one for each entity. Once the policy is learned, the decision to
choose between using jargon expressions and descriptive expressions for each referent
will be made based on the Q-values of the two actions (es decir., choose jargon and choose desc)
in the given user model state. The action that gets the highest Q-value will be executed.
The Q-value of each action (a) is calculated using the following formula, where s is the
user model state with n variables:

q(s, a) =

norte
(cid:88)

yo=1

θa(i)s(i)t

(9)

As explained earlier, hay 26 variables (es decir., norte = 26) in the user model s (sT is the
transpose of s). For each action a, the learning agent learns θ values for each of these
variables in the user model (θa = θa(1), θa(2), .., θa(norte)). Por lo tanto, for each referent, el
agent learns two sets of θ values, one for each action. The θ values signify the relevance

899

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
4
8
8
3
1
8
0
4
5
3
8
/
C
oh

yo
i

_
a
_
0
0
2
0
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 4

of the user’s knowledge of various jargon expressions in the domain to its actions.
Estimating Q-values as a linear function allows the learning agent to generalize to states
not seen during the learning phase (mira la sección 8.2).

During the learning phase, initially, the θ values are set randomly and the UM
policy starts by choosing randomly between the referring expression types for each
domain entity in the system utterance, irrespective of the user model state. Once the
referring expressions were chosen, the system presented the user simulation with both
the dialogue act and referring expression choices. The choice of referring expression
affected the user’s dialogue behavior. Por ejemplo, choosing a jargon expression could
evoke a clariﬁcation request from the user, based on which the user model state (UMs,t)
was updated with the new information that the user was ignorant of the particular
expresión. It should be noted that using a jargon expression is an information sensing
move that enables the user modeling module to estimate the user’s knowledge level.
The same process was repeated for every dialogue instruction. At the end of each
dialogue, the system was rewarded based on its choices of referring expressions (ver
Sección 6.1). The Q-values of a state-action pair are updated using the following SARSA
equation, where α is called the learning rate (0 < α < 1), which determines how fast or slowly the algorithm learns from its experience, and γ is called the discount factor (Sutton and Barto 1998): Q(st, at) ← Q(st, at) + α[rt+1 + γQ(st+1, at+1) − Q(st, at)] (10) In addition to choosing actions randomly, the agent can also choose actions based on the Q-values of the state action pair. The former way of choosing actions is called exploration and the latter is called exploitation. During exploration, the agent tried out new state-action combinations to explore the possibility of greater future rewards. The proportion of exploratory actions were higher at the beginning of the learning phase, but over time it stopped exploring new state-action combinations and used those actions that have high Q-values, which in turn contributed to higher expected reward. 6.1 Reward Function We wanted the system to learn a policy to present appropriate referring expressions to the user—that is, to present jargon when the user knows it and descriptive otherwise. If the system chose jargon expressions for novice users or descriptive expressions for expert users, penalties were incurred and if the system chose REs appropriately, the reward was high. Although experts might not actively complain about descriptive expressions, they are likely to be less satisﬁed when the system gives them long in- structions instead of using jargon that they can easily handle. Based on the general principle of audience design, the maxim of manner (Gricean maxims of co-operative conversation [Grice 1975]), and principle of sensitivity (Dale 1988), we consider pre- senting descriptive expressions to experts to be less efﬁcient than using the shorter jargon/expert vocabulary. Although it is not easy to say whether presenting jargon to novices should be weighed the same as presenting descriptive expressions to experts, we use this model as an initial representation for measuring adaptation. We designed a reward function for the goal of adapting to each user’s initial domain knowledge. Our reward function is what we call the “Adaptation Accuracy” score (AA), which calculates how accurately the agent chose the appropriate expressions for each referent in a set of referents (X), with respect to the user’s initial knowledge DKu,initial. 900 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems As before, we use the pair (R, T) to represent a referring expression, where R represents the referent and T represents the type of expression used. So, when the user knew the jargon expression for the referent R, the appropriate expression to use was jargon, and if she or he didn’t know the jargon, a descriptive expression was appropriate. This is expressed as function f : f ((R, T), DKu,initial) =    1 if T = jargon and DKu,initial(R, jargon) == true 1 if T = desc and DKu,initial(R, jargon) == false otherwise 0    (11) We calculated independent accuracy per referent entity IA(x) and then calculated the overall mean adaptation accuracy (AA) over all referents, as shown in the following. By ﬁrst calculating independent accuracy for each referent, we ensure that every referent is equally weighted in terms of adaptation when calculating the overall AA. Where m is the total number of instances of referent R in the conversation with each instance indexed by j, Independent Accuracy (IA) is deﬁned as: IA(R) = 1 m Σj=1..m f ((R, T)j, DKu,initial) (12) Where |X| is the total number of distinct domain entities referred to in the conver- sations, Adaptation Accuracy (AA) is deﬁned as: AA = 1 |X| ΣR∈X IA(R) (13) Other deﬁnitions for adaptation accuracy are possible and the automatic optimiza- tion would happen in exactly the same way. For instance, it could be deﬁned as adapting to the dynamically changing user’s domain knowledge (see Section 8.3). In such a case adaptation accuracy must be calculated based on current domain knowledge of the user (DKu,t) instead of the initial domain knowledge (DKu,initial). Another possible metric for optimization would be to weigh each reference instance equally, wherein there is no need to calculate Independent Accuracy for each entity and then average them into Adaptation Accuracy, as shown earlier. However, such an approach will lead the learning agent to ignore the entities that are least referred to, and focus on getting the reference to the most frequently referred-to entities right. Investigating other metrics for the reward function is left to future work. In the current set-up, in order to maximize the AA, the system learned to associate the initial state of the user’s knowledge with the optimal choice of referring expressions for all the entities equally. We decided to treat each referent equally because the overall task (i.e., setting up a broadband internet connection) would not be successful if even one of the referring expressions fails. 6.2 Learned User Modeling Policy The user modeling module learned to choose the appropriate referring expressions based on the user model in order to maximize the overall adaptation accuracy, which was our reward function. Figure 6 shows how the agent learned a policy using the data-driven simulation during training. We can see in Figure 6 that towards the end of training the curve plateaus, signifying that learning has converged. 901 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 Figure 6 Learning curve: Training. The system learned a policy to maximize the adaptation accuracy score by quickly sensing the user domain knowledge levels and adapting to this as early as possible. We call this the Learned-DS policy as it was learned from interactions with the data-driven user simulation. The system learned that by using jargon expressions, it can discover the user’s knowledge about the domain, because users will ask for clariﬁcation ques- tions when presented with jargon that they do not know. (Note that this relationship between jargon expressions and information sensing was never explicitly coded into the system.) Because the agent started the conversation with no knowledge about the user, it learned to use jargon expressions as information sensing moves. Although in the short term this behavior is not rewarding, it allows the system to quickly gather enough information to be able to adapt to the user in order to fetch long-term rewards. For instance, using a jargon expression with a novice user may not be an adaptive move, but it will probably reveal the kind of user that the system was dealing with. Because its goal was to maximize the adaptation accuracy, the agent also learned to restrict such sensing moves and start estimating the user’s domain knowledge as soon as possible. By learning to trade off between information-sensing and adaptation, the Learned-DS policy produced high adaptation scores for users with different domain knowledge levels. It also learned the dependencies between users’ knowledge of domain entities as evident in the knowledge proﬁles (as in Table 5). For instance, when the user asked for clariﬁcation on some referring expressions (e.g., Ethernet cable), it used descriptive expressions for related domain objects (such as Ethernet light and Ethernet socket). This shows that the system learned the fact that when a user knows Ethernet cable, he or she most likely knows Ethernet light and Ethernet socket. This is evident from the knowledge proﬁles that (assuming different types of users are equally distributed) there is 0.66 probability that a user knows Ethernet light given that he or she knows Ethernet cable, and so on. Therefore by sensing the user’s knowledge of one entity, it predicts 902 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems his or her knowledge of related entities. It also identiﬁed a set of non-related entities during the conversation and used this knowledge to sense whenever a new set of non- related entities are introduced in the conversation. For other entities in the same set, it learned to use adaptive choices. Therefore it identiﬁed different intermediate users as well. Example dialogues (reconstructed from logged system and user dialogue acts) between real users and the learned policy is given in Appendix A. 7. Evaluation In this section, we present the details of the evaluation process, the baseline policies, the metrics used, and the results. We evaluated the learned policy and several hand- coded baselines with simulated users and found that the Learned-DS policy produced higher adaptation accuracy than other policies. Another interesting observation is that the evaluation results obtained in simulated environments transfer to evaluations with real users. 7.1 Baseline Policies In order to compare the performance of the learned policy with hand-coded UM policies, four rule-based adaptive baseline policies were initially developed. We also later developed and evaluated a more advanced baseline (see Section 8.2). All-Descriptive: Used descriptive expressions for all referents by default. Jargon-adapt: Used jargon for initial reference for all referents by default, but changed to using descriptive expressions for those referents for which users asked for clariﬁcations. Table 6 provides an example dialogue. Switching-adapt: This policy started with jargon expressions for initial references and continued using them until the user requested clariﬁcation of any entity. After a clariﬁcation request, it switched to descriptive expressions for all new referents and continued to use them until the end. Table 7 provides an example dialogue. Stereotypes: In this policy, we used the knowledge proﬁles from our data collection. The system started using jargon expressions for the ﬁrst n turns and then, based on the user’s responses, it classiﬁed them into one of the ﬁve stereotypes (see Table 5) and thereafter used their respective knowledge proﬁles in order to choose the most Table 6 Jargon-adapt policy: An example dialogue. Sys: Do you have a broadband cable in the package? Usr: What is a broadband cable? Sys: The broadband cable is the thin black cable with colorless plastic ends. Usr: Yes. I have that. ......... Sys: Please plug one end of the thin black cable with colorless plastic ends into the broadband ﬁlter. 903 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 Table 7 Switching-adapt policy: An example dialogue. Sys: Do you have a broadband cable in the package? Usr: What is a broadband cable? Sys: The broadband cable is the thin black cable with colorless plastic ends. Usr: Yes. I have that. Sys: Do you have a small white box that has two sockets and a phone plug in the package? ......... Sys: Please plug one end of the thin black cable with colorless plastic ends into the small white box that has two sockets and a phone plug. appropriate referring expressions. For instance, if after n turns, the user was classiﬁed as a novice, the system used the novice proﬁle to choose expressions for the referents in the rest of the dialogue. We tested various values for n with simulated users (see Section 5) and used the one that produced the highest accuracy (i.e., n = 6). Note that as the value of n increases from 1, accuracy increases as it provides more evidence for classiﬁcation. However, after a certain point the adaptation accuracy started to stabilize, because too much sensing is not more informative. Later it started to fall slightly because sensing moves came at the cost of adaptation moves (see Table 8). Note that the Jargon-adapt and Switching-adapt policies exploit the user model in their subsequent references. When the system knows that the user does (or does not) know a particular expression, this knowledge is exploited in subsequent turns by using the appropriate expressions; and, therefore, the system is adaptive. We explore additional hand-crafted policies, also using the user proﬁle information, in Section 8.2. 7.2 Additional Evaluation Metrics We used the adaptation accuracy (see Section 6.1) to measure the level of adaptation to each user. In addition, we also measured other interesting parameters from the conversation (normalized learning gain, dialogue duration, and task completion) to investigate how they are affected by adaptation. Table 8 Stereotypes: n-values and Adaptation Accuracy (where n is number of turns). No. of steps Adaptation Accuracy % (AA) 51.23 58.18 58.56 72.46 71.5 71.0 70.7 69.23 68.22 67.04 3 4 5 6 7 8 9 10 11 12 904 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems Normalized learning gain (LG): We measured the learning effect on the users using normalized learning gain (LG) produced by using unknown jargon expressions. This was calculated using the pre-test (PRE) and post-test (POST) scores for the user domain knowledge (DKu). Please remember that for simulated runs, the domain knowledge of the user is updated during the interaction using the knowledge update rule. For real users, LG is calculated from their pre- and post-test scores. Normalized Learning Gain : LG = POST − PRE 1 − PRE (14) Dialogue time (DT): This was the time taken for the user to complete the task. For simulated runs, we estimated the time taken (in minutes) to complete the task using a regression model (r2 = 0.98, p = 0.000) derived from the corpus based on number of words (#(W)), turns (T), and mean user response time (URT). Dialogue Time : DT = 19.75 + 0.6 ∗ #(W) + 0.78 ∗ URT ∗ T 60 (15) Task completion (TC): This was measured by examining the user’s broadband set-up after the task was completed (i.e., the percentage of correct connections that they had made in their ﬁnal set-up). We used this measure for real users only. Although our primary objective is to adapt as much as possible to the user, we believe these metrics could be used in future reward functions to achieve goals other than simply adapting to users. For instance, a tutorial dialogue system would aim to optimize on normalized learning gain and would not care much about dialogue time, adaptation, or perhaps even task completion. 7.3 Evaluation with Simulated Users The user modeling module was operated in evaluation mode to produce 200 dialogues per policy distributed equally over the ﬁve user groups (Novice, Int1, Int2, Int3, and Expert). Overall performance of the different policies in terms of Adaptation Accuracy (AA), Dialogue Time (DT), and Learning Gain (LG) are given in Table 9. Figure 7 shows how the baseline policies as well as the Learned DS policy perform with each user type. It shows that the Learned DS (LDS) policy generalizes well to unseen user types (i.e., Int1 and Int3) and is more consistent than any baseline policy with the different groups, especially for groups Int1 and Int3, whose proﬁles were not available to the learning agent. This shows that a reinforcement learning agent can learn a policy that generalizes well to unseen user types. Table 9 Evaluation on ﬁve simulated user types. Policies AA (%) DT (mins) LG Descriptive Jargon-adapt Switching-adapt Stereotype (n=6) Learned DS 46.15 (± 33.29) 74.54 (± 17.9) 62.47 (± 17.58) 72.46 (± 20.77) 79.99 (± 10.46) 7.44 9.15 7.48 8.15 8.08 0 0.97 0.30 0.49 0.63 905 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 Figure 7 Evaluation: Adaptation Accuracy vs. User types. In Section 8.2, we further compare the learned policy to additional hand-crafted baseline policies that utilize the user proﬁles in their adaptation. A one-way ANOVA was used to test the difference between policies. We found that the policies differed signiﬁcantly in the adaptation accuracy (AA) metric (p < 0.0001). We then used two-tailed paired t-tests (pairing user types) to compare the policies further. We found that the LDS policy was the most accurate (Mean = 79.99, SD = 10.46) in terms of adaptation to each user’s initial state of domain knowledge. It outperformed all other policies: Switching-adapt (Mean = 62.47, SD = 14.18), Jargon-adapt (Mean = 74.54, SD = 17.9), Stereotype (Mean = 72.46, SD = 20.77), and Descriptive (Mean = 46.15, SD = 33.29). Accuracy of adaptation of the LDS policy was signiﬁcantly better than Descriptive policy (p = 0.000, t = 9.11, SE = 37.413), Jargon-adapt policy (p = 0.01, t = 2.58, SE = 20.19), Stereotype policy (p = 0.000, t = 3.95, SE = 23.40), and Switching- adapt policy (p = 0.000, t = 8.09, SE = 22.29). The LDS policy performed better than the Jargon-adapt policy, because it was able to predict accurately the user’s knowledge of referents unseen in the dialogue so far. It performed better than the Stereotype policy because its adaptive behavior takes into account the uncertainty in the user’s dialogue behavior. For instance, users did not always ask for clariﬁcation when they did not know the jargon expression. They might instead go ahead and do something incorrectly. Therefore, when there is no verbal feedback (i.e., no clariﬁcation request) from the user, the system has no information on which a user proﬁle can be picked. However, the learned policy represents this uncertainty in its state transistions and is able to select an appropriate adaptive action. Another point to note is that the LDS policy does not pick a user proﬁle but maps user model states directly to actions, generating either a jargon or descriptive expression for each entity, and so adapts continuously until the end of a dialogue, unlike the stereotype policy, which chooses a proﬁle and sticks with it. 906 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems The Jargon-adapt policy performed better than the Switching-adapt and Descriptive policies (p < 0.05) in terms of adaptation accuracy. This was because the system can learn more about the user by using more jargon expressions and then using that knowl- edge to make its later choices more adaptive. Jargon-adapt performed slightly better than the Stereotype policy but the increase in accuracy is not statistically signiﬁcant (p = 0.17). The Stereotype policy also performed signiﬁcantly better than the Switching- adapt and the Descriptive policies (p < 0.001). The Stereotype policy adapted to users globally using their proﬁles. However, due to uncertainty in user’s responses, it was not always possible to pick the right proﬁle for adaptation. This was probably why it out- performed the Switching-adapt and the Descriptive policies and performed as well as the Jargon-adapt policy but did not outperform the Learned-DS policy. The Switching- adapt policy, on the other hand, quickly switched its policy (sometimes erroneously) based on the user’s clariﬁcation requests but did not adapt appropriately to evidence presented later during the conversation. Sometimes, this policy switched erroneously because of uncertain user behaviors. The Descriptive policy performed very well with novice users but not so with other user types. In terms of dialogue time (DT), the Learned-DS policy was a bit more time- consuming than the Switching-adapt and Descriptive policies but less so than the Jargon-adapt and Stereotype policies. This was because learned policies use sensing moves (giving rise to clariﬁcation requests) in order to learn more about the user. The Descriptive policy was non-adaptive and therefore faster than other policies because it only used descriptive expressions and therefore caused no clariﬁcation requests from the users. Similarly, due to fewer clariﬁcation requests, the Switching-adapt pol- icy also took less dialogue time. Learned policies spent more time in order to learn about the users they interact with before they adapt to them. When the three high- performing policies (by adaptation accuracy) are compared, the Learned-DS policy had the shortest dialogue duration. This was due to better adaptation. The difference between the Learned-DS and the Jargon-adapt policy is statistically signiﬁcant (p < 0.05). However, the difference between the Learned-DS and the Stereotype policy is not signiﬁcant. With respect to normalized learning gain (LG), the Jargon-adapt policy produced the highest gain (LG = 0.97). This is because the policy used jargon expressions for all referents at least once. The difference between Jargon-adapt policy and others were statistically signiﬁcant at p < 0.0001. The LDS policy produced a learning gain of 0.63, which is a close second because it did use jargon expressions with novice users until it was ready to adapt to them. Although the use of jargon expressions with novices and intermediates sacriﬁced adaptation accuracy, it served to increase normalized learning gain as well as populating the user model. Recall that normalized learning gain is not what we aimed to optimize. We merely report this metric as we feel it is interesting to see how adaptation affects learning gain and that this could itself be used as a reward function in the future. 7.4 Evaluation with Real Users We chose the two best performing policies from our evaluation with simulated users for our ﬁnal evaluation with real users. Thirty-eight university students from different backgrounds (e.g., Arts, Humanities, Medicine, and Engineering) participated in the evaluation. Seventeen users were given a system with the Jargon-adapt policy and 19 users interacted with a system with the Learned DS (LDS) policy. Data from two other 907 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 participants were unusable due to logging issues. Each user was given a pre-task recog- nition test to record his/her initial domain knowledge. The mean pre-task recognition score of the two groups were tested with Mann-Whitney U test for two independent samples and found to be not signiﬁcantly different from each other (Jargon-adapt = 7.33, LDS = 7.45). Therefore, there was no bias towards any policy. The experimenter read out a list of technical terms and the user was asked to point out the domain entities laid out in front of them. They were then given one of the two systems, learned or baseline, to interact with. Following the system instructions, they then attempted to set up the broadband connection. When the dialogue had ended, the user was given a post-task test where the recognition test was repeated and their responses were recorded. The user’s broadband connection set-up was manually examined for task completion (i.e., the percentage of correct connections that they had made in their ﬁnal set-up). The user was given the task completion results and was then given a user satisfaction questionnaire to evaluate the features of the system based on the conversation. Example dialogues (reconstructed from logged system and user dialogue acts) between real users and these two policies are given in Appendix A. All users interacted with a wizarded system using one of the two UM policies. The users’ responses were intercepted by a human interpreter (or “wizard”) and were immediately annotated as dialogue acts, to which the automated dialogue manager responded with a system dialogue action (the dialogue policy was ﬁxed). The wizards were not aware of the user modeling policy used by the system. The respective policies chose the referring expressions to generate the system utterance for the given dialogue action. We compare the performance of the two policies on real users using objective pa- rameters and subjective feedback scores. Tests for statistical signiﬁcance were done us- ing the Mann-Whitney U test for two independent samples (due to the non-parametric nature of the data). Because we measure four metrics, namely, Adaptation Accuracy, Learning Gain, Dialogue Time, and Task Completion Rate, we apply Bonferroni correc- tion and set our α to 0.0125 (i.e., 0.05/4). Table 10 presents the mean accuracy of adaptation (AA), learning gain (LG), di- alogue time (DT), and task completion (TC) produced by the two policies. The LDS policy produced more accurate adaptation than the Jargon-adapt policy (p = 0.000, U = 9.0, r = –0.81). The use of the LDS policy resulted in less dialogue time (U = 73.0, p = 0.008, r = –0.46) and higher task completion (U = 47.5, p = 0.0006, r = –0.72) than the Jargon-adapt policy. However, there was no signiﬁcant difference in LG. Another important point to note is that the order of ranking in terms of adaptation accuracy from the simulated user evaluation is preserved in the real user evaluation as well: LDS policy scores better than Jargon-adapt policy in terms of AA both with simulated and real users. We tested for correlation between the above metrics using Spearman’s rho Table 10 Evaluation with real users. Jargon-adapt Learned DS Sig. Adaptation Accuracy (%) Learning Gain Dialogue Time (mins) Task Completion Rate (%) ∗Statistical signiﬁcance (p < 0.0125). 63.91 (± 8.4) 0.71 (± 0.26) 7.86 (± 0.77) 84.7 (± 14.63) 84.72 (± 4.72) 0.74 (± 0.22) 6.98 (± 0.93) 99.47 (± 2.29) ∗ ∗ ∗ 908 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems Table 11 Real user feedback. Jargon-adapt Learned DS Q1. Quality of voice Q2. Had to ask too many questions Q3. System adapted very well Q4. Easy to identify objects Q5. Right amount of dialogue time Q6. Learned useful terms Q7. Conversation was easy Q8. Future use 3.11 2.23 3.41 2.94 3.23 2.94 3.17 3.23 3.36 1.89 3.58 3.42 3.26 3.05 3.42 3.47 correlation. We also found that AA correlates positively with task completion rate (TCR) (r = 0.584, p = 0.000) and negatively with DT (r = −0.546, p = 0.001). These correlations and our results suggest that as a system’s adaptation towards its users increases, the task completion rate increases and dialogue duration decreases signiﬁcantly. Table 11 presents how the users subjectively scored different features of the system on an agreement scale of 1 to 4 (with 1 = strongly disagree and 4 = strongly agree), based on their conversations with the two different strategies. The difference in over- all satisfaction score, calculated as the mean of all the questions Q1 to Q8 (with Q2 reversed), was not signiﬁcant (Jargon = 3.1 ± 0.38, Learned = 3.35 ± 0.32, p = 0.058). Although there is statistical difference between the policies in the objective metrics, there is no signiﬁcant difference between them in any of the user ratings. Users seemed unable to recognize the nuances in the way the system adapted to them (Q3) and they did not rate the Learned-DS policy any higher than the Jargon-adapt policy regarding whether it was easy to identify objects (Q4). They could have been satisﬁed with the fact that both the systems adapted at all. This adaptation and the fact that the system offered help when the users were confused in interpreting the technical terms could have led the users to score the system well in terms of future use (Q8), dialogue time (Q5), and ease of conversation (Q7); but in common with experiments in dialogue management (Lemon, Georgila, and Henderson 2006), it seems that users ﬁnd it difﬁcult to evaluate these improvements subjectively. The users were given only one of the two strategies and therefore were not in a position to compare the two strategies and judge which one was better. Results in Table 11 lead us to conclude that perhaps users need to directly compare two or more strategies in order to better judge the differences between strategies, or perhaps the differences are just too subtle for users to notice. Another point to note is that the participants, although real humans, were performing the task in a laboratory setting and not in a real setting (e.g., at home where they are setting up their own home broadband connection). 8. Discussion 8.1 Application of Our Approach Our approach could be generally useful in dialogue systems where users’ domain knowledge inﬂuences the conversations between users and the system. Some systems will simply aim to adapt to the user as much as possible and do not need to attend to users’ learning, which is the approach we have taken in this article. For instance, a city 909 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 navigation system that interacts with locals and tourists (such as Rogers, Fiechter, and Thompson 2000; Janarthanam et al. 2013) should use proper names and descriptions of landmarks appropriately to different users to guide them around the city. A technical support system helping expert and novice users (such as Boye 2007) should use referring expressions and instructions appropriate to the user’s expertise. An Ambient Intelligence Environment in a public space (e.g., museum) interacting with visitors (such as Lopez-Cozar et al. 2005) can guide visitors and describe the exhibits in a language that the user would appreciate and understand. 8.2 Comparison with More Intelligent Hand-Coded Policies Although some of our hand-coded policies adapted to users, most of them did not use internal user models (except the Stereotype policy). We therefore also compared the performance of our learned policy with a more intelligent hand-coded policy that uses all the ﬁve user knowledge proﬁles: “Active Stereotype 5Proﬁles” (AS5). The AS5 policy made use of the knowledge proﬁles just like the Stereotype policy described in Section 7.1. However, the difference was that this policy used the stereotype information to actively select one of the ﬁve possible stereotypes to apply from the start of the con- versation (unlike the Stereotype policy, which waited until six turns to make a decision). This is done through a process of elimination. Initially, all ﬁve stereotype proﬁles are considered possible. The policy starts the conversation with jargon expressions, and as evidence is gathered about the user’s knowledge of the jargon expressions, it eliminates those proﬁles that are incompatible with the evidence. For instance, if the user knows the expression Livebox, the policy eliminates beginner proﬁle from the list of possibilities. It goes on until it has narrowed down the possibilities to one proﬁle in a similar fashion. The last remaining proﬁle was then used for adapting to the user. During this process of elimination, it also continuously estimates the user’s domain knowledge based on the stereotypes that are still under consideration. This is done so that if all proﬁles under consideration indicate that the user does not know a particular jargon expression, a descriptive expression can be used instead to improve adaptation. Otherwise, the policy used jargon expressions as information sensing moves. This policy was run to produce 200 dialogues with the user simulation (see Section 5). The user simulation generated the behavior of all ﬁve types of user with equal probability. The average AA of the AS5 policy was 77.52% (±23.36). We found no signiﬁcant difference between the means of the AS5 policy and the LDS policy using a paired t-test (pairing user types). How these two policies compare for each user type can be seen in Figure 8. Whereas there was no signiﬁcant difference in means for Int2, Int3, and Expert users, for Beginners, AS5 was better than LDS (AS5 = 83.72, LDS = 76.92, p = 0.009) and for Int1, LDS was better (LDS = 77.03, AS5 = 65.72, p = 0.0001). Although it may seem that the learned policy is only as good as a smart hand-coded policy, it must be noted that the AS5 policy uses ﬁve user proﬁles and the LDS policy was trained using only three proﬁles. It therefore seems reasonable to compare the LDS policy with a version of the active stereotype policy that only uses the same three user proﬁles (Beginner, Int2, and Expert) that the learned policy had access to during training. We call this policy “Active Stereotype 3Proﬁles” (AS3). It works the same way as the AS5 policy but only has three proﬁles to start with. We ran this policy with the user simulation and compared the adaptation accuracy produced to the LDS policy. The overall average adaptation accuracy over all user types for the AS3 policy was 66.98 (±25.79). This 910 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems Figure 8 Evaluation - Adaptation Accuracy vs. User types (LDS vs. AS policies). was signiﬁcantly lower than the LDS policy (p = 0.0001). We also compared the two policies per user type (see Figure 8). The AS3 policy was better than the LDS policy for Beginners (LDS = 76.92, AS3 = 82.49, p = 0.02), no statistical difference was found for Int2 and Experts, and the LDS policy was better than AS3 for Int1 (LDS = 77.03, AS3 = 66.34, p = 0.0001) and Int3 users (LDS = 83.64, AS3 = 42.84, p = 0.0001). This shows that the LDS policy is able to generalize well to unseen users (i.e., Int1 and Int3), better than a smart hand-coded policy that had the same knowledge of user proﬁles. 8.3 Learning to Adapt to a Dynamically Changing User Knowledge In reality, users often learn during a technical conversation. This is how we modeled users in our user simulation. However, we only learned a policy that adapts to the initial state of the user’s knowledge. We see this as a ﬁrst step towards learning a more complex policy that will adapt to a dynamically changing user knowledge state. Adapting to dynamically changing user knowledge requires additional representation in the system’s user model regarding what users might learn during the conversation, in addition to what they already know. Furthermore, the system will have to model the nuances between expressions that are easy to learn and those that are harder to learn, and also that users’ learning might be affected by how many times an entity is repeatedly referred to in a conversation. The system may also need to model the process of users forgetting recently learned expressions, especially in long conversations involving many domain entities. There are several applications of this approach to user modeling. For instance, an assistive health care system that interacts with patients to educate and assist them in taking care of themselves (Bickmore and Giorgino 2004) should be able to adapt to patients’ initial levels of knowledge and in subsequent dialogues change its language according to the improvement in the patient’s under- standing and improving knowledge of the domain. Similarly, a tutorial dialogue sys- tem that tutors students or trains personnel in industry (such as Dzikovska et al. 2007) should adapt to the needs of the learner in terms of their levels of understand- ing and expertise. Such systems pay attention to learning gain, but aim to keep the 911 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 user in the zone of proximal development (Vygotsky 1978) and therefore not use too many jargon expressions or complex instructions. We leave these issues for future research. 8.4 Extending Our Approach to Content Selection and Dialogue Management In this section, we discuss how our approach to user modeling (that we have demon- strated in choosing referring expressions) can be extended to content selection and dialogue management. In the current set-up, for every instruction the dialogue manager chooses to give the user, the NLG module uses the corresponding template. In other words, there is a one-to-one mapping between the dialogue move and the content of the system utterance; and there is no choice for the NLG module to make in terms of choosing the content of the utterance (except for choosing the referring expressions). Suppose there is more than one choice for every dialogue move, such as a detailed instruction that anyone can process and an abstract one that only experts can handle (see Table 12, for example). The choice between the two can be made based on a user modeling policy similar to the one we used for choosing referring expressions. This may require modeling user’s domain knowledge not only in terms of their ability to identify domain objects using their technical names but also whether they understand domain speciﬁc verbs (e.g., connect) and know domain-speciﬁc procedures, or are capable of domain-speciﬁc planning and execution given a high-level instruction. For instance, for the instruction, “Connect the modem to your computer using the Ethernet cable,” the user must be able to not only identify the Ethernet socket, but must also be able to plan or know two low-level moves: connecting the Ethernet cable to the modem and connecting the other end of the cable to the computer, which in turn requires him/her to know where (i.e., which socket and its location) in the modem and the computer the Ethernet cable should be plugged in. This knowledge about the user needs to be modeled in the user modeling state space. Similarly, we can extend the user modeling policy to support dialogue manage- ment. Dialogue management is the process of maintaining the dialogue context, and based on that choosing an appropriate dialogue action at an appropriate time. In this study, we presented a simple dialogue manager that uses a ﬁnite state machine to present a sequence of detailed instructions to the user. However, it is reasonable to sometimes group a smaller sequence of instructions and refer to such a procedure by a technical name. For instance, what can be referred to as ping 192.168.0.1 can be broken down into a sequence of instructions such as clicking on the Start button, searching for and opening the command line, and typing ping 192.168.0.1. Although some users know how to plan this sequence of actions given the technical action term ping, others may need to be given step-by-step guidance (see Figure 9). The dialogue manager can Table 12 Detailed vs. abstract instruction. Detailed instruction Sys: Please plug one end of the Ethernet cable into the Ethernet socket of the modem and plug the other end of the Ethernet cable into the Ethernet socket of your computer. Abstract instruction Sys: Connect the modem to your computer using the Ethernet cable. 912 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems Instruct to carry out a domain procedure Sys: Ping 192.168.0.1 to see if your modem is reachable. Instruct to carry out a sequence of actions step by step Sys: Open command prompt. Usr: What is a command line? Sys: Never mind. Click on the Start button. Usr: Ok. Sys: Search for command prompt. Usr: Ok. Found it. Sys: Execute command prompt. Usr: Ok. Sys: Type ping 192.168.0.1. Figure 9 Domain procedure vs. step-by-step instruction. decide between the two: Give procedure name with parameters or give detailed step- by-step instructions, based on the user model. Just like content selection, this needs the domain knowledge of the user to be modeled at a detailed level, such as knowledge of domain speciﬁc procedures (e.g., pinging). Another interesting domain where user modeling can help dialogue management is pedestrian navigation. Janarthanam et al. (2013) use a similar approach where users who are locals may know how to get to most well-known streets but not necessarily to their destinations. In most cases, users may want to go to a destination, but may know how to navigate part-way by themselves. In such cases, users could be asked to navigate to the well-known streets by themselves and then navigate using turn-by-turn instructions until they reach their destination. However, tourists may need step-by-step instructions right from the beginning. By dynamically modeling users’ knowledge of the city, the dialogue manager can decide whether to ask the user to self-navigate part-route, or ask him or her if he or she can do that, or start navigating him or her right away. When adaptation happens at the level of content selection or dialogue management, we argue that adaptive systems such as ours can choose instructions of appropriate complexity and therefore save more dialogue time than non-adaptive competitors such as Descriptive and Switching-adapt policies. Another way in which the dialogue manager can be involved in user modeling is to explicitly question the user about the parameters modeled in the user model. Instead of just using a jargon expression to sense if the user knows it or not, the system will now have a choice to ask a probing question, such as “Do you know what a broadband ﬁlter is?” or “Do you know how to connect the modem to your computer using a Ethernet cable?” (as in Cawsey [1993] but learned from interactions with a user simulation). The dialogue manager ideally has to choose between the two approaches to sense information from the user or balance between them based on a long-term reward. This reward may not be based just on adaptation accuracy but also other factors such as time taken to complete the task, task success, user satisfaction, and so on. Here we see an interesting trade-off between explicitly and implicitly sensing information from the users about their domain knowledge. It should also be noted that evidence of users’ knowledge at one level could be used at other levels and should be modeled by the system. For instance, if the system discovers that the user does not know Ethernet cable, this evidence can be used in content selection and dialogue management levels as well in selecting appropriate actions. 913 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 8.5 Advantages of the Learning Approach To summarize, the Reinforcement Learning approach to learning user modeling policies has the following advantages compared with other approaches that could deliver similar results: Automatic optimization: The policy is learned guided by the reward function. One should note that by modifying the reward function, the agent could be made to learn a different policy automatically. In the given set-up, we could set LG instead of AA as reward function and the agent will learn a policy to maximize LG and not care for adaptation. More complex policies can be learned when more than one metric is combined (e.g., in the form of a weighted linear function) in the reward function to balance different goals of the system (e.g., learning gain, lexical alignment). In such cases, the agent learns a policy that trades off between the different metrics optimally. For example, Rieser and Lemon (2011) have shown how a reinforcement learning agent can learn a dialogue management policy to balance different metrics such as task success and dialogue duration to optimize the overall user satisfaction score. Generalization: We have shown that the learned policy generalizes well to unseen users. We have also shown that even a smart hand-coded policy utilizing user proﬁles could not handle unseen users very effectively in comparison with the learned policy. Learning from a small corpus: By factoring the user simulation into three models (Clariﬁcation request, Environment, and Instruction Response), we were able to learn a realistic user simulation from a small corpus of 12 dialogues and 17 user proﬁles. Learning on-line: A recent study by Gaˇsi´c et al. (2013) has shown that it is possible to learn such user modeling policies online by interacting with real users. 9. Conclusion In this article we presented a novel sense-predict-adapt approach to dynamic user mod- eling and adaptation in dialogue systems. In this approach, the system learned to choose appropriate referring expressions to refer to domain objects based on users’ domain knowledge levels using an RL framework and data-driven user simulations. It learned an adaptive policy by interacting with simulated users with different levels of domain knowledge from experts to novices, based on a small amount of training data. It learned to trade off between adaptive moves and information sensing moves automatically to maximize the overall adaptation accuracy. The learned policy started the conversation with information sensing moves, learned a little about the user, and started adapting dynamically as the conversation progressed. It also learned the dependencies between entities in the users’ domain knowledge, which it uses effectively for predicting users’ knowledge of future entities. Although we focused on adaptively choosing referring expressions in technical instructions, we discussed how our model can be extended to other levels of adaptation as well such as content selection and aggregation. We dis- cussed how this approach can be applied to spoken dialogue systems in domains such as technical support and tutoring, when users’ domain knowledge can be dynamically modeled and adapted to. 914 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems We presented a novel user simulation module that can be trained from a small corpus of non-adaptive dialogues and simulate users’ dialogue behavior in a technical conversation. It simulated users with different levels of domain knowledge. We trained the system using the user simulation and evaluated the learned policy along with several other hand-coded policies. In an evaluation with simulated users, we showed that the learned policy generalizes to unseen users and outperforms several different baseline policies, including a policy that makes use of user domain knowledge proﬁles from the corpus. In our real user evaluation study, we showed that the learned policy performs better than a simple adaptive hand-coded policy with real users in terms of accuracy of adaptation, dialogue time, and successful task completion. This article also provides further evidence that evaluation results obtained in simulated environments can transfer reliably to evaluations with real users (Lemon, Georgila, and Henderson 2006). Finally, we also compared the learned policy to a variety of carefully hand- crafted adaptive policies that utilize the user knowledge proﬁles to adapt their choices of referring expressions throughout a conversation. We show that the learned policy generalizes better to unseen user proﬁles than these hand-coded policies, while having comparable performance on known user proﬁles. Whether the learned policy would perform better than a hand-coded policy that was painstakingly crafted by a domain expert (or learned using supervised methods from an expert-layperson corpus) is an interesting question that needs further exploration. Also, it would also be interesting to make the learned policy account for the user’s learning behavior and adapt accordingly. Appendix A: Examples of Dialogue with Real Users An example dialogue between a real user and the jargon-adapt policy (System choices shown in boldface.) (1) Sys: Are you ready? (2) Usr: Yes. (3) Sys: Do you have a livebox in the package? (4) Usr: Yes. (5) Sys: Do you have a broadband cable in the package? (6) Usr: Which one is the broadband cable? (7) Sys: The broadband cable is the thin black cable with colorless plastic ends. (8) Usr: Yes. (9) Sys: Do you have an Ethernet cable in the package? (10) Usr: Yes. (11) Sys: Do you have a power adaptor in the package? (12) Usr: Yes. (13) Sys: Do you have a broadband ﬁlter in the package? (14) Usr: Which one is the broadband ﬁlter? (15) Sys: The broadband ﬁlter is the small white box that has two sockets and a phone plug. (16) Usr: Yes. (17) Sys: Good. We seem to have everything we need. Now follow my instructions to set up the connections. Ready? (18) Usr: I am ready. (19) Sys: Please disconnect the phone from the phone socket on the wall. 915 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 (20) Usr: Done. (21) Sys: Please take the power adaptor. (22) Usr: Yes. (23) Sys: Please plug the power adaptor into the two-pin mains power socket. (24) Usr: Yes. (25) Sys: Please connect the cable of the power adaptor ﬁrmly into the power socket of the livebox. Observe the lights on the front panel of the livebox. (26) Usr: Yes. (27) Sys: Did all the lights on the front panel of the livebox turn on and go off after a few seconds. (28) Usr: Yes. (29) Sys: Place the livebox with the Orange label facing up. What is the status of the power light on the livebox? (30) Usr: Its ﬂashing red. (31) Sys: Please plug one end of the thin black cable with colorless plastic ends into the ADSL socket on the livebox. (32) Usr: Yes. (33) Sys: Take the small white box that has two sockets and a phone plug. (34) Usr: Yes. (35) Sys: Please plug the other end of the thin black cable with colorless plastic ends into the small white box that has two sockets and a phone plug. (36) Usr: Yes. (37) Sys: Please plug the small white box that has two sockets and a phone plug into the phone socket on the wall. (38) Usr: Yes. (39) Sys: What is the status of the broadband light on the livebox? (40) Usr: Its ﬂashing red. (41) Sys: Please plug one end of the Ethernet cable into the Ethernet socket on the livebox. (42) Usr: Yes. (43) Sys: Please connect the other end of the Ethernet cable into the Ethernet socket on the back panel of your computer. (44) Usr: Yes. (45) Sys: What is the status of the Ethernet light on the livebox? (46) Usr: Its red. (47) Sys: Please connect the phone cable into the small white box that has two sockets and a phone plug that you plugged into the phone socket on the wall. (48) Usr: Ok. (49) Sys: Good. We have now ﬁnished setting up your broadband Internet connection. In this dialogue, the Jargon-adapt policy learns in turns (6) and (14) that the user does not know broadband cable and broadband ﬁlter. It uses this information to adapt to the user by using descriptive expressions for those entities later in turns (31), (33), (35), and (37). However, it does not use this information to adapt with respect to other entities referred to in the conversation. 916 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems An example dialogue between a real user and the Learned-DS policy (System choices shown in boldface.) (1) Usr: Yes. (2) Sys: Do you have a livebox in the package? (3) Usr: Yes. (4) Sys: Do you have a broadband cable in the package? (5) Usr: Which one is the broadband cable? (6) Sys: The broadband cable is the thin black cable with colorless plastic ends. (7) Usr: Yes. (8) Sys: Do you have an Ethernet cable in the package? (9) Usr: Yes. (10) Sys: Do you have a power adaptor in the package? (11) Usr: Yes. (12) Sys: Do you have a broadband ﬁlter in the package? (13) Usr: Which one is the broadband ﬁlter? (14) Sys: The broadband ﬁlter is the small white box that has two sockets and a phone plug. (15) Usr: Yes. (16) Sys: Good. We seem to have everything we need. Now follow my instructions to set up the connections. Ready? (17) Usr: I am ready. (18) Sys: Please disconnect the phone from the phone socket on the wall. (19) Usr: Done. (20) Sys: Please take the power adaptor. (21) Usr: Yes. (22) Sys: Please plug the power adaptor into the two-pin mains power socket. (23) Usr: Yes. (24) Sys: Please connect the cable of the power adaptor ﬁrmly into the power socket of the livebox. Observe the lights on the front panel of the livebox. (25) Usr: Yes. (26) Sys: Did all the lights on the front panel of the livebox turn on and go off after a few seconds. (27) Usr: Yes. (28) Sys: Place the livebox with the Orange label facing up. What is the status of the power light on the livebox? (29) Usr: Its ﬂashing red. (30) Sys: Please plug one end of the thin black cable with colorless plastic ends into the ADSL socket on the livebox. (31) Usr: Yes. (32) Sys: Take the small white box that has two sockets and a phone plug. (33) Usr: Yes. (34) Sys: Please plug the other end of the thin black cable with colorless plastic ends into the small white box that has two sockets and a phone plug. (35) Usr: Yes. (36) Sys: Please plug the small white box that has two sockets and a phone plug into the phone socket on the wall. (37) Usr: Yes. (38) Sys: What is the status of the second light from the left on the livebox? (39) Usr: Its ﬂashing red. 917 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 (40) Sys: Please plug one end of the Ethernet cable into the square socket with red stripe on the livebox. (41) Usr: Yes. (42) Sys: Please connect the other end of the Ethernet cable into the Ethernet socket on the back panel of your computer. (43) Usr: Yes. (44) Sys: What is the status of the fourth light from the left on the livebox? (45) Usr: Its red. (46) Sys: Please connect the phone cable into the small white box that has two sockets and a phone plug that you plugged into the phone socket on the wall. (47) Usr: Ok. (48) Sys: Good. We have now ﬁnished setting up your broadband Internet connection. In this dialogue, the Learned-DS policy adapts to a user who does not know the expressions broadband ﬁlter and broadband cable. It uses this information to adapt by using descriptive expressions for those entities later in the conversation just like the Jargon-adapt policy. In addition, it also uses descriptive expressions for two other entities: Ethernet light and broadband light (see turns (38) and (44)). Acknowledgments The research leading to these results has received funding from British Council (UKIERI PhD scholarships 2007–2010), the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreements no. 216594 (CLASSiC project www.classic- project.org) and 270019 (SPACEBOOK project www.spacebook-project.eu), and from the EPSRC (project no. EP/G069840/1). References Ai, H. and D. Litman. 2007. Knowledge consistent user simulations for dialog systems. In Proceedings of Interspeech 2007, pages 2,697–2,700, Antwerp. Akiba, T. and H. Tanaka. 1994. A Bayesian approach for user modeling in dialogue systems. In Proceedings of the 15th Conference on Computational Linguistics - Volume 2, Kyoto. Bell, A. 1984. Language style as audience other problem-solving applications. In Proceedings SIGDial’07, pages 247–255, Antwerp. Cawsey, A. 1993. User modeling in interactive explanations. User Modeling and User-Adapted Interaction, 3(3):221–247. Cheng, H., H. Bratt, R. Mishra, E. Shriberg, S. Upson, J. Chen, F. Weng, S. Peters, L. Cavedon, and J. Niekrasz. 2004. A Wizard of Oz framework for collecting spoken human-computer dialogs. In Proceedings of the International Conference on Spoken Language Processing, pages 2,269–2,272, Jeju. Clark, H. H. and G. L. Murphy. 1982. Audience design in meaning and reference. In J. F. Leny and W. Kintsch, editors, Language and Comprehension. North-Holland Publishing Company, Amsterdam. Cuayahuitl, H. 2009. Hierarchical Reinforcement Learning for Spoken Dialogue Systems. Ph.D. thesis, University of Edinburgh. design. Language in Society, 13(2):145–204. Cuayahuitl, H., S. Renals, O. Lemon, and Belz, A. and S. Varges. 2007. Generation of repeated references to discourse entities. In Proceedings ENLG-2007, pages 8–16, Schloss Dagstuhl. Bickmore, T. and T. Giorgino. 2004. Some novel aspects of health communication from a dialogue systems perspective. In AAAI Fall Symposium on Dialogue Systems for Health Communication, pages 275–291, Washington, DC. Boye, J. 2007. Dialogue management for automatic troubleshooting and H. Shimodaira. 2005. Human--computer dialogue simulation using hidden Markov models. In Proceedings of ASRU 2005, Cancun. Dale, R. 1988. Generating Referring Expressions in a Domain of Objects and Processes. Ph.D. thesis, University of Edinburgh. Demberg, Vera, Andi Winterboer, and Johanna D. Moore. 2011. A strategy for information presentation in spoken dialog systems. Computational Linguistics, 37(3):489–539. 918 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Janarthanam and Lemon Adaptive Generation in Dialogue Systems Dethlefs, N. and H. Cuayahuitl. 2010. Hierarchical reinforcement learning for adaptive text generation. In Proceedings of the 6th International Natural Language Generation Conference, pages 37–45, Dublin. Dzikovska, M. O., C. Callaway, E. Farrow, M. Marques-Pita, C. Matheson, and J. D. Moore. 2007. Adaptive tutorial dialogue systems using deep NLP techniques. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics 2007, pages 5–6, Morristown, NJ. Eckert, W., E. Levin, and R. Pieraccini. 1997. User modeling for spoken dialogue system evaluation. In Proceedings of ASRU 1997, pages 80–87, Santa Barbara, CA. Forbes-Riley, K. and D. Litman. 2010. Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system. Computer Speech and Language, 25(1):105–126. Fraser, N. and G. N. Gilbert. 1991. Simulating speech systems. Computer Speech and Language, 5:81–99. Gaˇsi´c, M., C. Breslin, M. Henderson, D. Kim, M. Szummer, B. Thomson, P. Tsiakoulis, and S. Young. 2013. On-line policy optimisation of Bayesian spoken dialogue systems via human interaction. In Proceedings of ICASSP 2013, Canada. datasets. Computational Linguistics, 34(4):487–512. Hinds, P. 1999. The curse of expertise: The effects of expertise and debiasing methods on predictions of novice performance. Experimental Psychology: Applied, 5(2):205–221. Issacs, E. A. and H. H. Clark. 1987. References in conversations between experts and novices. Journal of Experimental Psychology: General, 116:26–37. Janarthanam, S. 2011. Learning User Modeling Strategies for Adaptive Referring Expression Generation in Spoken Dialogue Systems. Ph.D. thesis, University of Edinburgh. Janarthanam, S. and O. Lemon. 2009. A Wizard-of-Oz environment to study referring expression generation in a situated spoken dialogue task. In Proceedings ENLG’09, pages 94–97, Athens. Janarthanam, S., O. Lemon, P. Bartie, T. Dalmas, A. Dickinson, X. Liu, J. Mackaness, and W. Goetze. 2013. Evaluating a city exploration dialogue system with integrated question- answering and pedestrian navigation. In Proceedings of ACL 2013, pages 1,660–1,668. Kaelbling, L. P., M. L. Littman, and A. W. Moore. 1996. Reinforcement Learning: A Survey. Journal of Artiﬁcial Intelligence Research, 4:237–285. Gatt, A. and A. Belz. 2008. Attribute selection Kass, R. 1991. Building a user model for referring expression generation: New algorithms and evaluation methods. In Proceedings of INLG-2008, pages 50–58, Salt Fork, OH. Gatt, A. and K. van Deemter. 2009. Generating plural NPs in discourse: Evidence from the GNOME corpus. In Proceedings of the Workshop on Production of Referring Expressions: Bridging Computational and Psycholinguistic Approaches (PRE-CogSci-09), Amsterdam. Georgila, K., J. Henderson, and O. Lemon. 2005. Learning user simulations for information state update dialogue systems. In Proceedings of Eurospeech/ Interspeech, pages 893–896, Lisbon. Grice, H. P. 1975. Logic and conversation. Syntax and Semantics: Vol 3, Speech Acts, pages 43–58. Hajdinjak, M. and F. Miheli. 2003. The Wizard of Oz system for weather information retrieval. In Proceedings of the 6th International Conference TSD, pages 400–405, Czech Republic. Henderson, J., O. Lemon, and K. Georgila. 2008. Hybrid reinforcement/supervised learning of dialogue policies from ﬁxed implicitly from a cooperative advisory dialogue. User Modeling and User-Adapted Interaction, 1:203–258. Keizer, S., M. Gaˇsi´c, F. Jurcicek, F. Mairesse, B. Thomson, K. Yu, and S. Young. 2010. Parameter estimation for agenda-based user simulation. In Proceedings of SIGDial 2010, pages 116–123, Tokyo. Kittredge, R., T. Korelsky, and O. Rambow. 1991. On the need for domain communication knowledge. Computational Intelligence, 7(4):305–314. Kobsa, A. and W. Wahlster. 1989. User Models in Dialog Systems. Springer Verlag, Berlin. Krahmer, E. and M. Theune. 2002. Efﬁcient context-sensitive generation of referring expressions. In K. van Deemter and R. Kibble, editors, Information Sharing: Reference and Presupposition in Language Generation and Interpretation. CSLI, Stanford, CA, pages 223–264. Krahmer, E., S. van Erk, and A. Verleg. 2003. Graph-based generation of referring expressions. Computational Linguistics, 29(1):53–72. Kullback, S. 1959. Information Theory and Statistics. John Wiley and Sons, New York. 919 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 4 Kullback, S. 1987. Letter to the Editor: The Kullback-Leibler distance. The American Statistician, 41(4):340–341. Kullback, S. and R. A. Leibler. 1951. On Information and Sufﬁciency. Annals of Mathematical Statistics, 22(1):79–86. Lemon, O. 2008. Adaptive natural language generation in dialogue using reinforcement learning. In Proceedings of SEMdial’08, pages 149–156, London. Lemon, O., K. Georgila, and J. Henderson. 2006. Evaluating effectiveness and portability of reinforcement learned dialogue strategies with real users: The TALK TownInfo evaluation. In Proceedings of IEEE/ACL Spoken Language Technology 2006, pages 178–181, Palm Beach. Levin, E., R. Pieraccini, and W. Eckert. 1997. Learning dialogue strategies within the Markov decision process framework. In Proceedings of ASRU 1997, pages 72–79, Santa Barbara, CA. Lopez-Cozar, R., Z. Callejas, M. Gea, and G. Montoro. 2005. Multimodal, multilingual and adaptive dialogue system for ubiquitous interaction in educational space. In Proceedings of Applied Spoken Language Interaction in Distributed Environments (ASIDE), Aalborg. Mairesse, F. and M. Walker. 2010. Towards personality-based user adaptation: Psychologically informed stylistic language generation. User Modeling and User-Adapted Interaction, 20:3:227–278. McKeown, K., J. Robin, and M. Tanenblatt. 1993. Tailoring lexical choice to the user’s vocabulary in multimedia explanation generation. In Proceedings ACL 1993, pages 226–234, Columbus, OH. Nguyen, L. and P. Do. 2009. Combination of Bayesian network and overlay model in user modeling. In Proceedings of the 9th International Conference on Computational Science, volume 5545/2009 of LNCS, pages 5–14, Baton Rouge, LA. Paris, C. L. 1984. Determining the level of expertise. In First Annual Workshop on Theoretical Issues in Conceptual Information Processing, Atlanta, GA. Rambow, O. 1990. Domain communication knowledge. In Proceedings of the Fifth International Workshop on Natural Language Generation 1990, pages 87–94, Dawson, PA. Reiter, E. and R. Dale. 1992. A fast algorithm for the generation of referring expressions. In Proceedings COLING-1992, pages 232–238, Nantes. Reiter, E. and R. Dale. 1995. Computational interpretations of the Gricean maxims in 920 the generation of referring expressions. Cognitive Science, 18:233–263. Reiter, E. and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press. Rieser, V. and O. Lemon. 2009. Natural language generation as planning under uncertainty for spoken dialogue systems. In Proceedings EACL 2009, pages 638–691, Athens. Rieser, V. and O. Lemon. 2011. Learning and evaluation of dialogue strategies for new applications: Empirical methods for optimization from small data sets. Computational Linguistics, 37:1:153–196. Rogers, S., C. Fiechter, and C. Thompson. 2000. Adaptive user interfaces for automotive environments. In IEEE Intelligent Vehicles Symposium, pages 662–667, Dearborn, MI. Schatzmann, J., B. Thomson, K. Weilhammer, H. Ye, and S. J. Young. 2007. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proceedings of HLT/NAACL 2007, pages 149–152, Rochester, NY. Schatzmann, J., K. Weilhammer, M. N. Stuttle, and S. J. Young. 2006. A survey of statistical user simulation techniques for reinforcement learning of dialogue management strategies. Knowledge Engineering Review, 21:97–126. Shapiro, D. and P. Langley. 2002. Separating skills from preference: Using learning to program by reward. In Proceedings ICML-02, pages 570–577, Sydney. Strauss, P. M., H. Hoffmann, and S. Scherer. 2007. Evaluation and user acceptance of a dialogue system using Wizard-of-Oz recordings. In Proceedings of 3rd IET International Conference on Intelligent Environments, pages 521–524, Germany. Sutton, R. and A. Barto. 1998. Reinforcement Learning. MIT Press, Cambridge, MA. Vygotsky, L. S. 1978. Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA. Whittaker, S., M. Walker, and J. Moore. 2002. Fish or fowl: A Wizard of Oz evaluation of dialogue strategies in the restaurant domain. In Language Resources and Evaluation Conference, pages 1,602–1,609, Las Palmas. Williams, J. and S. J. Young. 2003. Using Wizard-of-Oz simulations to bootstrap reinforcement learning based dialogue management systems. In Proceedings SIGdial’03, pages 135–139, Sapporo. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 4 8 8 3 1 8 0 4 5 3 8 / c o l i _ a _ 0 0 2 0 3 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Adaptive Generation in Dialogue Systems image

Descargar PDF