Generating Tailored, Comparatif
Descriptions with Contextually
Appropriate Intonation
Michael White∗
The Ohio State University
Robert A. J.. Clark∗∗
University of Edinburgh
Johanna D. Moore†
University of Edinburgh
Generating responses that take user preferences into account requires adaptation at all levels of
the generation process. This article describes a multi-level approach to presenting user-tailored
information in spoken dialogues which brings together for the first time multi-attribute decision
models, strategic content planning, surface realization that incorporates prosody prediction, et
unit selection synthesis that takes the resulting prosodic structure into account. The system
selects the most important options to mention and the attributes that are most relevant to
choosing between them, based on the user model. Multiple options are selected when each offers a
compelling trade-off. To convey these trade-offs, the system employs a novel presentation strategy
which straightforwardly lends itself to the determination of information structure, as well as the
contents of referring expressions. During surface realization, the prosodic structure is derived
from the information structure using Combinatory Categorial Grammar in a way that allows
phrase boundaries to be determined in a flexible, data-driven fashion. This approach to choosing
pitch accents and edge tones is shown to yield prosodic structures with significantly higher
acceptability than baseline prosody prediction models in an expert evaluation. These prosodic
structures are then shown to enable perceptibly more natural synthesis using a unit selection
voice that aims to produce the target tunes, in comparison to two baseline synthetic voices. Un
expert evaluation and f0 analysis confirm the superiority of the generator-driven intonation and
its contribution to listeners’ ratings.
1. Introduction
In an evaluation of nine spoken dialogue information systems, developed as part of the
DARPA Communicator program, the information presentation phase of the dialogues
∗ 1712 Neil Ave., Columbus, OH 43210, Etats-Unis. Web: http://www.ling.ohio-state.edu/~mwhite/.
∗∗ 10 Crichton Street, Édimbourg, Scotland EH8 1AB, ROYAUME-UNI. Web:
http://www.cstr.ed.ac.uk/ssi/people/robert.html.
† 10 Crichton Street, Édimbourg, Scotland EH8 1AB, ROYAUME-UNI. Web: http://www.hcrc.ed.ac.uk/~jmoore/.
Submission received: 31 Janvier 2008; revised submission received: 19 May 2009; accepted for publication:
24 Septembre 2009.
© 2010 Association for Computational Linguistics
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
Chiffre 1
Typical information presentation phase of a Communicator dialogue.
was found to be the primary contributor to dialogue duration (Walker, Passonneau,
and Boland 2001). During this phase, the typical system sequentially presents the set of
options that match the user’s constraints, as shown in Figure 1. The user can then navi-
gate through these options and refine them by offering new constraints. When multiple
options are returned, this process can be exacting, leading to reduced user satisfaction.
As Walker et al. (2004) observe, having to access the set of available options sequen-
tially makes it hard for the user to remember information relevant to making a decision.
To reduce user memory load, we need alternative strategies for sequential presentation.
En particulier, we require better algorithms for:
1.
2.
selecting the most relevant subset of options to mention, as well as the
attributes that are most relevant to choosing among them; et
determining how to organize and express the descriptions of the selected
options and attributes, in ways that are both easy to understand and
memorable.1
In this article, we describe how we have addressed these points in the FLIGHTS2
système, reviewing and extending the description given in Moore et al. (2004). FLIGHTS
follows previous work (Carberry, Chu-Carroll, and Elzer 1999; Carenini and Moore
2000; Walker et al. 2002) in applying decision-theoretic models of user preferences to
the generation of tailored descriptions of the most relevant available options. Multi-
attribute decision theory provides a detailed account of how models of user preferences
can be used in decision making (Edwards and Barron 1994). Such preference models
have been shown to enable systems to present information in ways that are concise and
1 An issue we do not address in this article is whether a multimodal system would be more effective than a
voice-only one. We believe that these needs, and in particular the need to express information with
contextually appropriate prosody, are also highly relevant for multimodal systems. We also note that
there is still strong demand for voice-oriented systems for eyes-busy use and for the blind.
2 FLIGHTS stands for Fancy Linguistically Informed Generation of Highly Tailored Speech.
160
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
tailored to the user’s interests (Carenini and Moore 2001; Walker et al. 2004; Carenini
and Moore 2006). Decision-theoretic models have also been commercially deployed in
Web systems.3
To present multiple options, we introduce a novel strategy where the best option
(with respect to the user model) is presented first, followed by the most compelling
remaining options, in terms of trade-offs between attributes that are important to the
user. (Multiple options are selected only when each offers a compelling trade-off.) Un
important property of this strategy is that it naturally lends itself to the determination
of information structure, as well as the contents of referring expressions. Ainsi, to help
make the trade-offs among the selected options clear to the user, FLIGHTS (1) groupes
attributes that are positively and negatively valued for the user, (2) chooses referring ex-
pressions that highlight the salient distinguishing attributes, (3) determines information
structure and prosodic structure that express contrasts intelligibly, et (4) synthesizes
utterances with a unit selection voice that takes the prosodic structure into account.
En tant que tel, FLIGHTS goes beyond previous systems in adapting its output according to
user preferences at all levels of the generation process, not just at the levels of content
selection and text planning.
Our approach to generating contextually appropriate intonation follows Prevost
(1995) and Steedman (2000un) in using Combinatory Categorial Grammar (CCG) à
convey the information structure of sentences via pitch accents and edge tones. À
adapt Prevost and Steedman’s approach to FLIGHTS, we operationalize the information
structural notion of theme to correspond to implicit questions that necessarily arise
in presenting the trade-offs among options. We also refine the way in which prosodic
structure is derived from information structure by allowing for a more flexible, one-to-
many mapping between themes or rhemes and intonational phrases, where the final
choice of the type and placement of edge tones is determined by n-gram models. À
investigate the impact of information structural grammatical constraints in our hybrid
rule-based, data-driven approach, we compare realizer outputs with those of baseline
n-gram models, and show that the realizer yields target prosodic structures with signif-
icantly higher acceptability than the baseline models in an expert evaluation.
The prosodic structure derived during surface realization is passed as prosodic
markup to the speech synthesizer. The synthesizer uses this prosodic markup in the
text analysis phase of synthesis in place of the structures that it would otherwise have
to predict from the text. The synthesizer then uses the context provided by the markup
to enforce the selection of suitable units from the database. To verify that the prosodic
markup yields improvements in the quality of synthetic speech, we present an exper-
iment which shows that listeners perceive a unit selection voice that aims to produce
the target prosodic structures as significantly more natural than either of two baseline
unit selection voices that do not use the markup. We also present an expert evaluation
and f0 analysis which confirm the superiority of the generator-driven intonation and its
contribution to listeners’ ratings.
The remainder of this article is structured as follows. Section 2 presents our ap-
proach to natural language generation (NLG) in the information presentation phase
of a FLIGHTS dialogue, including how multi-attribute decision models are used in
content selection; how rhetorical and information structure are determined during dis-
course planning; how lexical choice and referring expressions are handled in sentence
planning; how prosodic structures are derived in surface realization; and how these
3 See http://www.cogentex.com/solutions/recommender/index.shtml, Par exemple.
161
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
Chiffre 2
Tailored descriptions of the available flights for three different user models.
prosodic structures compare to those predicted by baseline n-gram models in an expert
evaluation. Section 3 describes how the prosodic structures are used in the unit selection
voice employed in the present study, and compares this voice to the baseline ones
used in our perception experiment. Section 4 provides the methods and results of the
perception experiment itself, along with the expert prosody evaluation and f0 analysis.
Section 5 compares our approach to related work. Enfin, Section 6 concludes with a
summary and discussion of remaining issues.
2. NLG in FLIGHTS
2.1 Tailoring Flight Descriptions
To illustrate how decision-theoretic models of user preferences can be used to tailor
descriptions of the available options at many points in the generation process, let us
consider the following three hypothetical users of the FLIGHTS system:
Student (S) A student who cares most about price, all else being equal.
Frequent Flyer (FF) A business traveler who prefers business class, but cares most
about building up frequent-flyer miles on KLM.
Business Class (BC) Another business traveler who prefers KLM, but wants, above all,
to travel in business class.
Suppose that each user is interested in flying from Edinburgh to Brussels on a
certain day, and would like to arrive by five o’clock in the afternoon. FLIGHTS begins
the dialogue by gathering the details necessary to query the database for possible flights.
Suivant, it uses the preferences encoded in the user model to select the highest ranked flight
for each user, as well as those flights that offer interesting trade-offs. These flights are
then described to the user, as shown in Figure 2.4
For the student (S), the BMI flight is rated most highly, because it is a fairly in-
expensive, direct flight that arrives near the desired time. The Ryanair flight is also
4 The set of available flights was obtained by “screen scraping” data from several online sources. Le
hypothetical user preferences were chosen with an eye towards making the example interesting. Actual
user preferences are specified as part of registering to use the FLIGHTS system.
162
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
mentioned as a possibility, as it has the best price; it ends up ranked lower overall than
the BMI flight though, because it requires a connection and arrives well in advance of
the desired arrival time. For the KLM frequent flyer (FF), life is a bit more complicated:
A KLM flight with a good arrival time is offered as the top choice, even though it is a
connecting flight with no availability in business class. As alternatives, the direct flight
on BMI (with no business class availability) and the British Airways flight with seats
available in business class (but requiring a connection) are described. Enfin, for the
must-have-business-class traveler (BC), the British Airways flight with business class
available is presented first, despite its requiring a connection; the direct flight on BMI is
offered as another possibility.
Although user preferences have an immediately apparent impact on content se-
lection and ordering, they also have more subtle effects on many aspects of how the
selected content is organized and expressed, as explained subsequently.
Referring expressions: Rather than always referring to the available flights in the same
chemin, flights of interest are instead described using the attributes most relevant to
the user: Par exemple, direct flight, cheapest flight, KLM flight.
Aggregation: For conciseness, multiple attributes may be given in a single sentence,
subject to the constraint that attributes whose values are positive (or negative)
for the user should be kept together. Par exemple, in There’s a KLM flight arriving
Brussels at four fifty p.m., but business class is not available and you’d need to connect in
Amsterdam, the values of the attributes airline and arrival-time are considered
good, and thus are grouped together to contrast with the values of the attributes
fare-class and number-of-legs, which are considered bad.
Scalar terms: Scalar modifiers like good, as in good price, and just, as in just fifty pounds,
are chosen to characterize an attribute’s value to the user relative to values of the
same attribute for other options.
Discourse cues: Attributes with negative values for the user are acknowledged using
discourse cues, such as but and though. Interesting trade-offs can also be signaled
using cues such as if -conditionals.
Information structure and prosody: Compelling trade-offs are always indicated via
prosodic phrasing and emphasis, as in (There ARE seats in business class)theme (sur
the British Airways flight)rheme (that arrives at four twenty p.m.)rheme, where the di-
vision of the sentence into theme and rheme phrases is shown informally using
parentheses, and contrastive emphasis (on ARE) is shown using small caps. (Voir
Section 2.6.2 for details of how emphasis and phrasing are realized by pitch
accents and edge tones.)
2.2 Architecture
The architecture of the FLIGHTS generator appears in Figure 3. OAA (Martine, Cheyer,
and Moran 1999) serves as a communications hub, with the following agents responsi-
ble for specific tasks: DIPPER (Bos et al. 2003) for dialogue management; a Java agent
that implements an additive multi-attribute value function (AMVF), a decision-theoretic
model of the user’s preferences (Carenini and Moore 2000, 2006), for user modeling;
OPlan (Currie and Tate 1991) for content planning; Xalan XSLT5 and OpenCCG (Blanc
5 http://xml.apache.org/xalan-j/.
163
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
Chiffre 3
FLIGHTS generation architecture.
2004, 2006un, 2006b) for sentence planning and surface realization; and Festival (Taylor,
Noir, and Caley 1998) for speech synthesis. The user modeling, content planning, sen-
tence planning, and surface realization agents are described in the ensuing subsections.
FLIGHTS follows a typical pipeline architecture (Reiter and Dale 2000) for NLG.
The NLG subsystem takes as input an abstract communicative goal from the dialogue
manager. In the information presentation phase of the dialogue, this goal is to describe
the available flights that best meet the user’s constraints and preferences. Given a
communicative goal, the content planner selects and arranges the information to convey
by applying the plan operators that implement its presentation strategy. Ce faisant,
it makes use of three further knowledge sources: the user model, the domain model,
and the dialogue history. Suivant, the content plan is sent to the sentence planner, lequel
uses XSLT templates to perform aggregation, lexicalization, and referring expression
generation. The output of sentence planning is a sequence of logical forms (LFs). Le
use of LF templates represents a practical and flexible way to deal with the interaction
of decisions made at the sentence planning level, and further blurs the traditional
distinction between template-based and “real” NLG that van Deemter, Krahmer, et
Theune (2005) have called into question. Each LF is realized as a sentence using a CCG
lexico-grammar (Steedman 2000a, 2000b). Note that in contrast to the generation archi-
tectures of, Par exemple, Pan, McKeown, and Hirschberg (2002) and Walker, Rambow,
and Rogati (2002), the prosodic structure of the sentence is determined as an integral
part of surface realization, rather than in a separate prosody prediction component.
The prosodic structure is passed to the Festival speech synthesizer using Affective
Presentation Markup Language (de Carolis et al. 2004; Steedman 2004), or APML, un
XML markup language for the annotation of affect, information structure, and prosody.
Festival uses the Tones and Break Indices (Silverman et al. 1992), or ToBI,6 pitch
accents and edge tones—specified as APML annotations—in determining utterance
phrasing and intonation, and employs a custom synthetic voice to produce the system
utterances.
6 See http://www.ling.ohio-state.edu/∼tobi/ for an introduction to ToBI and links to on-line
ressources.
164
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
2.3 User Modeling
FLIGHTS uses an additive multi-attribute value function (AMVF) to represent the
user’s preferences, as in the GEA real estate recommendation system (Carenini and
Moore 2000, 2006) and the MATCH restaurant recommendation system (Walker et al.
2004). Decision-theoretic models of this kind are based on the notion that, si quelque chose
is valued, it is valued for multiple reasons, where the relative importance of different
reasons may vary among users.
The first step is to identify good flights for a particular origin, destination, et
arrival or departure time. The following attributes contribute to this objective: arrival-
temps, departure-time, number-of-legs, total-travel-time, price, airline, fare-
class, and layover-airport. As in MATCH, these attributes are arranged into a
one-level tree.
The second step is to define a value function for each attribute. A value function
maps from the features of a flight to a number between 0 et 1, representing the value
of that flight for that attribute, où 0 is the worst and 1 is the best. Par exemple,
the function for total-travel-time computes the difference in minutes between the
flight’s arrival and departure times, and then multiplies the result by a scaling factor to
obtain an evaluation between 0 et 1. The functions for the airline, layover-airport,
and fare-class attributes make use of user-specified preferred or dispreferred values
for that attribute. In the current version of these functions, a preferred value is given a
score of 0.8, a dispreferred value 0.2, and all other values 0.5.7
The structure and weights of the user model represent a user’s dispositional biases
about flight selection. Situational features are incorporated in two ways. The requested
origin and destination are used as a filter when selecting the set of available options
by querying the database. In contrast, the requested arrival or departure time—if
specified—is used in the corresponding attribute’s evaluation function to give a higher
score to flights that are closer to the specified time. If an arrival or departure time is not
specified, the corresponding attribute is disabled in the user model.
As in previous work, the overall evaluation of an option is computed as the
weighted sum of its evaluation on each attribute. C'est, if f represents the option being
evaluated, N is the total number of attributes, and wi and vi are, respectivement, the weight
and the value for attribute i, then the evaluation v( F ) of option f is computed as follows:
v( F ) =
N(cid:1)
je = 1
wivi( F )
To create a user model for a specific user, two types of information are required. Le
user must rank the attributes in order of importance, and he or she must also specify
any preferred or dispreferred attribute values for the airline, layover-airport, et
fare-class attributes. In FLIGHTS, we also allow users to specify a partial ordering of
the rankings, so that several attributes can be given equal importance when registering
to use the system. Chiffre 4 shows the user models for the student (S), frequent-flyer (FF),
and business-class (BC) users discussed earlier; because no departure time is specified
in the sample query, departure-time is not included in these examples.
7 In informal experiments, we did not find the system to be particularly sensitive to the exact values used
in the attribute functions when selecting content.
165
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
Chiffre 4
Sample user models.
Based on the user’s ranking of the attributes, weights are assigned to each attribute.
As in previous work, we use Rank Order Centroid (ROC) weights (Edwards and Barron
1994). This allows weights to be assigned based on rankings, guaranteeing that the sum
will be 1. The nth ROC weight wR
n of N total weights is computed as follows:
n = 1
wR
N
N(cid:1)
i=n
1
je
We extend these initial weights to the partial-ordering case as follows. If attributes
je . . . j all have the same ranking, then the weight of each will be the mean of the relevant
ROC weights; c'est,
j(cid:1)
(
k=i
k )/( j − i + 1)
wR
As a concrete example, if there is a single highest-ranked attribute followed by a three-
way tie for second, then w1 = wR
1 , and w2 = w3 = w4 = 1
3 + wR
4 ).
2 + wR
3 (wR
2.4 Content Planning
2.4.1 Content Selection. Once a specific user model has been created, the AMVF can be
used to select a set of flights to describe for that user, and to determine the features of
those flights that should be included in the descriptions. We use a novel strategy that
combines features of the Compare and Recommend strategies of Walker et al. (2004),
refining them with an enhanced method of selecting options to mention. In brief, le
idea behind the strategy is to select the top-ranked flight, along with any other highly
ranked flights that offer a compelling trade-off—that is, a better value (for the user)
on one of its attributes. As we shall see in Section 2.4.2, by guaranteeing that any
option beyond the first one offers such a trade-off, our strategy lends itself naturally
to the determination of information structure and the contents of referring expressions
identifying the option. Par contre, Walker et al.’s Recommend strategy only presents
a single option, and their Compare strategy does not present options in a ranked order
that facilitates making trade-offs salient. En outre, we may observe that with our
strategy, the user model need only contain a rough approximation of the user’s true
166
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Chiffre 5
Algorithm for selecting the options to describe.
preferences in order for it to do its job of helping to identify good flights for the user
to consider.8 A similar observation underlies the candidate/critique model of Linden,
Hanks, and Lesh’s (1997) Web-based system.
Selecting the Options to Describe. In determining whether an option is worth mention-
ing, we make use of two measures. Firstly, we use the z-score of each option; this mea-
sures how far the evaluation v( F ) of an option f is from the mean evaluation. Officiellement,
it is defined using the mean (µV) and standard deviation (σV) of all evaluations, comme
follows:
z( F ) = (v( F ) − µV )/σV
We also make use of the compellingness measure described by Carenini and Moore
(2000, 2006), who provide a formal definition. Informally, the compellingness of an
attribute measures its strength in contributing to the overall difference between the
evaluation of two options, all other things being equal. For options f, g, and threshold
value kc, we define the set comp( F, g, kc) as the set of attributes that have a higher score
for f than for g, and for which the compellingness is above kc.
The set Sel of options to describe is constructed as follows. D'abord, we include the top-
ranked option. Suivant, for all of the other options whose z-score is above a threshold kz,
we check whether there is an attribute of that option that offers a compelling trade-off
over the already selected options; if so, we add that option to the set.9 This algorithm is
presented in Figure 5.
For the BC user model, Par exemple, this algorithm proceeds as follows. D'abord, it
selects the top-ranked flight: a connecting flight on British Airways with availability
in business class. The next-highest-ranked flight is a morning flight, which does not
have any attributes that are compellingly better than those of the top choice, and is
therefore skipped. Cependant, the third option presents an interesting trade-off: même
though business class is not available, it is a direct flight, so it is also included. None
of the other options above the threshold present any interesting trade-offs, so only those
two flights are included.
8 Bien sûr, if the user model could be relied upon to contain perfect information, the system could always
just recommend a single best flight; cependant, because we do not expect our models to capture a user’s
preferences perfectly, we have designed the system to let the user weigh the alternatives when there
appear to be interesting trade-offs among the available options.
9 The requirement that an option offer a compelling trade-off is similar to the exclusion of dominated
solutions in Linden, Hanks, and Lesh (1997).
167
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
The selected flights for the other sample user models show similar trade-offs, comme
mentioned in the discussion of Figure 2. For FF, the selected flights are a connecting,
economy-class flight on the preferred airline; a direct, economy-class flight on a neutral
airline; and a connecting, business-class flight on a neutral airline. For S, the top choices
are a reasonably cheap direct flight that arrives near the desired time, and an even
cheaper, connecting flight that arrives much earlier in the day.
Selecting the Attributes to Include. When selecting the attributes to include in the de-
scription, we make use of an additional measure, s-compellingness. Informally, le
s-compellingness of an attribute represents the contribution of that attribute to the
evaluation of a single option; again, the formal definition is given by Carenini and
Moore (2000, 2006). Note that an attribute may be s-compelling in either a positive or a
negative way. For an option f and threshold kc, we define the set s-comp( F, kc) as the set
of attributes whose s-compellingness for f is greater than kc.
The set Atts of attributes is constructed in two steps. D'abord, we add the most com-
pelling attributes of the top choice. Suivant, we add all attributes that represent a trade-off
between any two of the selected options; c'est, attributes that are compellingly better
for one option than for another. The algorithm appears in Figure 6.
For the BC user model, the s-compelling attributes of the top choice are arrival-
time and fare-class; the latter is also a compelling advantage of this flight over the
second option. The advantage of the second option over the first is that it is direct, donc
number-of-legs is also included. A similar process on the other user models results
in price, arrival-time, and number-of-legs being selected for S, and arrival-time,
fare-class, airline, and number-of-legs for FF.
2.4.2 Planning Texts with Information Structure. Based on the information returned by the
content selection process, together with further information from the user model and
the current dialogue context, the content planning agent develops a plan for presenting
the available options. A distinguishing feature of the resulting content plans is that
they contain specifications of the information structure of sentences (Steedman 2000a),
including sentence theme (roughly, the topic the sentence addresses) and sentence
rheme (roughly, the new contribution on a topic).
Steedman (2000un) characterizes the notions of theme and rheme more formally by
stating that a theme presupposes a rheme alternative set, in the sense of Rooth (1992),
while a rheme restricts this set. Because Steedman does not fully formalize the discourse
update semantics of sentence themes, we have chosen to operationalize themes in
FLIGHTS in one particular way, namely, as corresponding to implicit questions that
arise in the context of describing the available options. Our reasoning is as follows.
D'abord, we note that a set of alternative answers corresponds formally to the meaning of a
question. Suivant, we observe that whenever a flight option presents a compelling trade-off
Chiffre 6
Algorithm for selecting the attributes to include.
168
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
for a particular attribute, it at least partially addresses the question of whether there are
any flights that have a desirable value for that attribute; moreover, whenever a flight
is presented that has a less-than-optimal value for an attribute, its mention implicitly
raises the question of whether any flights are available with a better value for that
attribute. Enfin, we conclude that by specifying and realizing content as thematic, le
system can help the user understand why a flight option is being presented, since the
theme (by virtue of its presupposition) identifies what implicit question—that is, what
trade-off—the option is addressing.
In all, the content planner’s presentation strategy performs the following functions:
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
marking the status of items as definite/indefinite and predications as
theme/rheme for information structure;
determining contrast between options and attributes, or between groups of
options and attributes;
grouping and ordering of similar options and attributes (par exemple., presenting the
top scoring option first vs. last);
choosing the contents of referring expressions (par exemple., referring to a particular
option by airline); et
decomposing strategies into basic dialogue acts and hierarchically organized
rhetorical speech acts.
The presentation strategy is specified via a small set (à propos 40) of content planning
operators. These operators present the selected flights as an ordered sequence of op-
tion, starting with the best one. Each flight is suggested and then further described. Comme
part of suggesting a flight, it is identified by its most compelling attribute, according to
the user model. (Recall that any selected flight options beyond the highest ranked one
must offer a compelling trade-off.) Flights are additionally identified by their airline,
which we deemed sufficiently significant to warrant special treatment (otherwise the
strategy is domain-independent).
The information for the first flight is presented as all rheme, as no direct link is
made to the preceding query. Each subsequent, alternative flight is presented with its
most compelling attribute in the theme, and the remaining attributes in the rheme. Pour
instance, consider the student example (S) in Figure 2. After presenting the BMI flight—
which has a good price, is direct, and arrives near the desired time—the question arises
whether there are any cheaper alternatives. Because the second option has price as its
most compelling attribute—and given that it is the least expensive flight available—it is
identified as the CHEAPEST flight, with this phrase forming the theme of the utterance.
As another illustration, consider the business-class traveler example (BC) in Figure 2.
After presenting the British Airways flight, which has availability in business class
but is not direct, the question arises whether there are any direct flights available. Le
presentation of the second option addresses this implicit question, introducing the BMI
flight with the theme phrase There is a DIRECT flight.
The output of the content planner is derived from the hierarchical goal structure
produced during planning. Chiffre 7 shows the resulting content plan for the student
example. Note that, as part of suggesting the second flight option in the sequence (f2),
subgoals are introduced that identify the option by informing the user that the option
has type flight and that the option has the attribute cheapest, where this attribute is
the most compelling one for the user. Both subgoals are marked as part of the theme
169
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 7
Content plan for student example (S).
(rheme marking is the default); the subgoal for the option type is also marked for
definiteness (indefinite is the default), as the cheapest attribute uniquely identifies the
flight. The remaining information for the second flight is presented in terms of a contrast
between its positive and negative attributes, as determined by the user model.
The way in which our presentation strategy uses theme phrases to connect alterna-
tive flight suggestions to implicit questions is related to Prevost’s (1995) use of theme
phrases to link system answers to immediately preceding user questions. It is also
170
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Chiffre 8
Semantic dependency graph produced by the sentence planner for (The CHEAPEST flight)theme (est
on RYANAIR)rheme.
related to Kruijff-Korbayov´a et al.’s (2003) use of theme phrases to link utterances with
questions under discussion (Ginzburg 1996; Roberts 1996) in an information-state based
dialogue system. An interesting challenge that remains for future work is to determine
to what extent our presentation strategy can be generalized to handle theme/rheme
partitioning, both for explicit questions across turns as well for implicit questions within
turns.
2.5 Sentence Planning
The sentence planning agent uses the Xalan XSLT processor to transform the output of
the content planner into a sequence of LFs that can be realized by the OpenCCG agent.
It is intended to be a relatively straightforward component, as the content planner has
been designed to implement the most important high-level generation choices. Its pri-
mary responsibility is to lexicalize the basic speech acts in the content plan—which may
appear in referring expressions—along with the rhetorical speech acts that connect them
ensemble. When alternative lexicalizations are available, all possibilities are included in
a packed structure (Foster and White 2004; White 2006a). The sentence planner is also
responsible for adding discourse markers such as also and but, adding pronouns, et
choosing sentence boundaries. It additionally implements a small number of rhetorical
restructuring operations for enhanced fluency.
The sentence planner makes use of approximately 50 XSLT templates to recursively
transform content plans into logical forms. An example logical form that results from
applying these templates to the content plan shown in Figure 7 appears in Figure 8 (avec
alternative lexicalizations suppressed). As described further in Section 2.6.1, the logical
forms produced by the sentence planner are semantic dependency structures,10 lequel
make use of an info feature to encode theme/rheme partitioning, and a kon feature to
implement Steedman’s (2006) notion of kontrast (Vallduv´ı and Vilkuna 1998). Following
Steedman, kontrast is assigned to the interpretations of words which contribute to
distinguishing the theme or rheme of the utterance from other alternatives that the
context makes available.
In order to trigger the inclusion of context-sensitive discourse markers such as also
and either, the sentence planner compares the inform acts for consecutive flight options
10 The nodes of the graph are typically labelled by lexical predicates. An exception here is the has-rel
predicate, which allows the predicative complement on Ryanair to introduce the (cid:5)AIRLINE(cid:6) role in a way
that is similar to the dependency structure for the Ryanair flight.
171
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
to see whether acts with the same type have the same value.11 These same checks can
also trigger de-accenting. Par exemple, when two consecutive flights have no seats in
business class, the second one can be described using the phrase it has NO AVAILABILITY
in business class either, where business class has been de-accented.
The development of the XSLT templates was made considerably easier by the ability
to invoke OpenCCG to parse a target sentence and then use the resulting logical form
as the basis of a template. (See Section 3 for a description of how the target sentences in
the FLIGHTS voice script were developed.) Using LF templates, rather than templates
at the string level, makes it simpler to uniformly handle discourse markers such as
also and either, which have different preferred positions within a clause, depending in
part on which verb they modify. LF templates also simplify the treatment of subject–
verb agreement. En plus, by employing LF templates with a single theme/rheme
partition, it becomes possible to underspecify whether the theme and rheme will be
realized by one intonational phrase each or by multiple phrases (see Section 2.6.2). À
the same time, certain aspects of the LF templates can be left alone, when there is no
need for further analysis and generalization.
As noted earlier, the use of LF templates further blurs the traditional distinction
between template-based and “real” NLG that van Deemter, Krahmer, and Theune (2005)
have called into question. In the case of referring expressions especially, LF templates in
FLIGHTS represent a practical and flexible way to deal with the interaction of decisions
made at the sentence planning level, as the speech acts identifying flight options are
considered together with the other basic and rhetorical speech acts in the applicability
conditions for the templates that structure clauses. In this way, options can be identified
not only in definite NPs, such as the CHEAPEST flight, but also in there-existentials and
conditionals, such as there is a DIRECT flight on BMI that . . . or if you prefer to fly DIRECT,
there’s a BMI flight that . . . . We may further observe that the traditional approach to gen-
erating referring expressions (Reiter and Dale 2000), where a distinguishing description
for an entity is constructed during sentence planning without regard to a user model,
would not fit in our architecture, where the user model drives the selection of a referring
expression’s content at the content planning level.
Although the use of LF templates in XSLT represents a practical approach to
handling sentence planning tasks that were not the focus of our research, it is not
one that promotes reuse, and thus it is worth noting which aspects of our sentence
planner would pose challenges for a more declarative and general treatment. The bulk
of the templates concern domain-specific lexicalization and are straightforward; given
the way these were developed from the results of OpenCCG parsing, it is conceiv-
able that this process could be largely automated from example input–output pairs.
The templates for adding pronouns and discourse markers require more expertise
but remain reasonably straightforward; the templates for rhetorical restructuring and
choosing sentence boundaries, in contrast, are fairly intricate. En principe, satisfactory
results might be obtained using a more general set of options for handling pronouns,
discourse markers, and sentence boundaries, together with an overgenerate-and-rank
methodology; we leave this possibility as a topic for future research.
2.6 Surface Realization
2.6.1 Chart Realization with OpenCCG. For surface realization, we use the OpenCCG open
source realizer (Blanc 2004, 2006un, 2006b). A distinguishing feature of OpenCCG is that
11 In future work, these checks could be extended to apply across turns as well.
172
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Blanc, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
it implements a hybrid symbolic-statistical chart realization algorithm that combines
(1) a theoretically grounded approach to syntax and semantic composition, avec (2) le
use of integrated language models for making choices among the options left open by
the grammar. Ce faisant, it brings together the symbolic chart realization (Kay 1996;
Shemtov 1997; Carroll et al. 1999; Moore 2002) and statistical realization (Knight and
Hatzivassiloglou 1995; Langkilde 2000; Bangalore and Rambow 2000; Langkilde-Geary
2002; Oh and Rudnicky 2002; Ratnaparkhi 2002) traditions. Another recent approach to
combining these traditions appears in Carroll and Oepen (2005), where parse selection
techniques are incorporated into an HPSG realizer.
Like other realizers, the OpenCCG realizer is partially responsible for determining
word order and inflection. Par exemple, the realizer determines that also should prefer-
ably follow the verb in There is also a very cheap flight on Air France, whereas in other cases
it typically precedes the verb, as in I also have a flight that leaves London at 3:45 p.m. It also
enforces subject–verb agreement, Par exemple, between is and flight, or between are and
seats. Less typically, in FLIGHTS and in the COMIC12 system, the OpenCCG realizer
additionally determines the prosodic structure, in terms of the type and placement of
pitch accents and edge tones, based on the information structure of its input logical
forms. Although OpenCCG’s algorithmic details have been described in the works cited
au-dessus de, details of how prosodic structure can be determined from information structure
in OpenCCG appear for the first time in this article.
The grammars used in the FLIGHTS and COMIC systems have been manually
written with the aim of achieving very high quality. Cependant, to streamline grammar
development, the grammar has been allowed to overgenerate in areas where rules are
difficult to write and where n-gram models can be reliable; in particular, it does not
sufficiently constrain modifier order, which in the case of adverb placement especially
can lead to a large number of possible orderings. En plus, it allows for a one-to-
many mapping from themes or rhemes to edge tones, yielding many variants that differ
only in boundary type or placement. As will be explained subsequently, we consider
this more flexible, data-driven approach to phrasing to be better suited to the needs of
generation than would be a more direct implementation of Steedman’s (2000un) théorie,
which would require all phrasing choices to be made at the sentence planning level.
We have built a language model for the system from the FLIGHTS speech corpus
described in Section 3.1. To enhance generalization, named entities and scalar adjectives
have been replaced with the names of their semantic classes (such as TIME, DATE, CITY,
AIRLINE, etc.), as is often done in limited domain systems. Note that in the corpus and
the model, pitch accents are treated as integral parts of words, whereas edge tones and
punctuation marks appear as separate words. The language model is a 4-gram back-off
model built with the SRI language modeling toolkit (Stolcke 2002), keeping all 1-counts
and using Ristad’s (1995) natural discounting law for smoothing. OpenCCG has its own
implementation for run-time scoring; in FLIGHTS, the model additionally incorporates
an a/an-filter, which assigns a score of zero to sequences containing a followed by a
vowel, or an followed by a consonant, subject to exceptions culled from bigram counts.
2.6.2 Deriving Prosody. CCG is a unification-based categorial framework that is both
linguistically and computationally attractive. We provide here a brief overview of CCG;
an extensive introduction appears in Steedman (2000b).
12 http://www.hcrc.ed.ac.uk/comic/.
173
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
je
je
.
0
9
–
0
2
3
–
r
1
–
0
8
–
0
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Nombre 2
Chiffre 9
A simple CCG derivation.
Chiffre 10
A CCG derivation with non-standard constituents.
A grammar in CCG is defined almost entirely in terms of the entries in the lexicon,
which are (possibly complex) categories bearing standard feature information (tel que
verb form, agreement, etc.) and subcategorization information. CCG has a small set of
rules which can be used to combine categories in derivations. The two most basic rules
are forward (>) et en arrière (<) function application. These rules are illustrated in
Figure 9, which shows the derivation of a simple copular sentence (with no prosodic
information).13 In the figure, the noun and proper name receive atomic categories with
labels n and np, respectively. The remaining words receive functional categories, such as
\np) to its
\np) for the verb is; this category seeks a predicative adjective (sadj
sdcl
right and an np to its left, and returns the category sdcl, for a declarative sentence. Note
that dcl and adj are values for the form feature; other features, such as those for number
and case, have been suppressed in the figure (as has the feature label, form).
\np/(sadj
CCG also employs further rules based on the composition (B), type raising (T),
and substitution (S) combinators of combinatory logic. These rules add an element
of associativity to the grammar, making possible multiple derivations with the same
semantics. They are crucial for building the “non-standard” constituents that are the
hallmark of categorial grammars, and which are essential for CCG’s handling of coor-
dination, extraction, intonation, and other phenomena. For example, Figure 10 shows
how the category for there can be type-raised and composed with the category for is
\np)), which cuts across
in order to derive the constituent there is a direct flight (sdcl/(sadj
the VP in traditional syntax. Because there is a direct flight corresponds to the theme
phrase, and on BMI to the rheme phrase, in the first clause of the second sentence in
13 These derivations are shown from the parsing perspective, starting with an ordered sequence of words at
the top. During realization, it is the semantics rather than the word sequence which is given; the word
sequence is determined during the search for a derivation that covers the input semantics. See the
discussion of Figure 11 (later in this section) for a discussion of the semantic representations used in
OpenCCG.
174
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Figure 11
A derivation of theme and rheme phrases.
the business-class (BC) example in Figure 2, the derivation in Figure 10 also shows how
CCG’s flexible notion of constituency allows the intonation structure and information
structure to coincide, where the intonation coincides with surface structure, and the
information structure is built compositionally from the constituent analysis.
In Steedman (2000a), theme and rheme tunes are characterized by distinct patterns
of pitch accents and edge tones. The notation for pitch accents and edge tones is taken
from Pierrehumbert (1980) and ToBI (Silverman et al. 1992). We have implemented a re-
duced version of Steedman’s theory in OpenCCG, where theme tunes generally consist
of one or more L+H* pitch accents followed by a L-H% compound edge tone at the end
of the phrase,14 and rheme tunes consist of one or more H* pitch accents followed by
a L- or L-L% boundary at phrase end. Additionally, yes/no questions typically receive
a H-H% final boundary, and L-H% boundaries are often used as continuation rises to
mark the end of a non-final conjoined phrase.
The key implementation idea in Steedman’s (2000a) approach is to use features on
the syntactic categories to enforce information structural phrasing constraints—that is,
to ensure that the intonational phrases are consistent with the theme/rheme partition
in the semantics. For example, if there is a direct flight corresponds to the theme and on
BMI the rheme, then the syntactic features ensure that the intonation brackets the clause
as (there is a direct flight) (on BMI), rather than (there) (is a direct flight on BMI) or (there
is) (a direct flight on BMI), etc. To illustrate, Figure 11 shows, again from the parsing
perspective, how theme and rheme phrases are derived in OpenCCG for the subject
NP and VP of Figure 9, respectively. (To save space, pitch accents and edge tones will
henceforth be written using subscripts and string elements, rather than appearing on a
separate tonal tier.) In the figure, the category for cheapest has the info feature on each
14 To reduce ambiguity in the grammar, unmarked themes—that is, constituents which appear to be
thematic but which are not grouped intonationally into a separate phrase—are assumed to be
incorporated as background parts of the rheme, as suggested in Calhoun et al. (2005).
175
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
atomic category set to th(eme), and Ryanair has its info feature set to rh(eme), due to
the presence of the L+H* and H* accents, respectively. The remaining words have no
accents, and thus their categories have a variable (M, for EME) as the value of the info
feature on each atomic category.15 All the words also have a variable (O) as the value of
the owner feature, discussed subsequently, on each atomic category. As words combine
into phrases, the info variables serve to propagate the theme or rheme status of a
phrase; thus, the phrase the cheapestL+H∗ flight has category npth,O, whereas is on RyanairH∗
\nprh,O. Because these two phrases have incompatible info
ends up with category sdcl,rh,O
values, they cannot combine before being “promoted” to intonational phrases by their
respective edge tones. In this way, the constraint that phrasing choices respect the
theme/rheme partition is enforced.
The edge tones allow complete intermediate or intonational phrases to combine by
changing the value of the info feature to phr(ase) in the result category.16 Note that
the argument categories of the edge tones do not select for a particular value of the
info feature; instead, L-H% has sh$1 as its argument category, whereas L-L% has ss$1,
where h(earer) and s(peaker) are the respective values of the owner feature.17 Combination
with an edge tone unifies the owner features throughout the phrase. In a prototypical
theme phrase, the hearer is the owner, whereas in a rheme phrase, the speaker is the
owner.
Like other compositional grammatical frameworks, CCG allows logical forms to
be built in parallel with the derivational process. Traditionally, the λ-calculus has been
used to express semantic interpretations, but OpenCCG instead makes use of a more
flexible representational framework, Hybrid Logic Dependency Semantics (Baldridge
and Kruijff 2002; Kruijff 2003), or HLDS. In HLDS, hybrid logic (Blackburn 2000) terms
are used to describe semantic dependency graphs, such as the one seen earlier in
Figure 8. As discussed in White (2006b), HLDS is well suited to the realization task, in
that it enables an approach to semantic construction that ensures semantic monotonicity,
simplifies equality tests, and avoids copying in coordinate constructions.
To illustrate, four lexical items from Figure 11 appear in Example (1). The three
words are derived by a lexical rule which adds pitch accents and information structure
features to the base forms. The format of the entries is lexeme (cid:8) category, where the
category is itself a pair in the format syntax : logical form. The logical form is a conjunction
of elementary predications (EPs), which come in three varieties: lexical predications,
such as @Ebe; semantic features, such as @E(cid:5)TENSE(cid:6)pres; and dependency relations,
such as @E(cid:5)ARG(cid:6)X.
(1) a. cheapestL+H∗ (cid:8) nX,th,O/nX,th,O : @X(cid:5)HASPROP(cid:6)P ∧ @Pcheapest ∧
@P(cid:5)INFO(cid:6)th ∧ @P(cid:5)OWNER(cid:6)O ∧ @P(cid:5)KON(cid:6)+
b. RyanairH∗ (cid:8) npX,rh,O : @XRyanair ∧
@X(cid:5)INFO(cid:6)rh ∧ @X(cid:5)OWNER(cid:6)O ∧ @X(cid:5)KON(cid:6)+
15 In practice, variables like M are replaced with “fresh” variables (such as M 1) during lexical lookup, so
that the M variables are distinct for different words.
16 The $ variables range over a (possibly empty) stack of arguments, allowing the edge tones to employ a \s$1 category to combine with s/(s\np), s\np, and so on. As indicated in the figure, the new
single s$1 value of the info feature (phr) is automatically distributed throughout the arguments of the result category. 17 See Steedman (2000a) for a discussion of this notion of “ownership,” or responsibility. 176 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 6 2 1 5 9 1 8 1 0 2 5 4 / c o l i . 0 9 - 0 2 3 - r 1 - 0 8 - 0 0 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 White, Clark, and Moore Generating Tailored Descriptions with Appropriate Intonation c. is (cid:8) sE,dcl,M,O \npX,M,O/(sP,adj,M,O \npX,M,O) : @Ebe ∧ @E(cid:5)TENSE(cid:6)pres ∧ @E(cid:5)ARG(cid:6)X ∧ @E(cid:5)PROP(cid:6)P ∧ @E(cid:5)INFO(cid:6)M ∧ @E(cid:5)OWNER(cid:6)O ∧ @E(cid:5)KON(cid:6)– d. L-H% (cid:8) sphr$1
\sh$1
In these entries, the indices (or nominals, in hybrid logic terms) in the logical forms,
such as X, P, and E, correspond to nodes in the semantic graph structure, and are
linked to the syntactic categories via the index feature. Similarly, the syntactic features
info and owner have associated (cid:5)INFO(cid:6) and (cid:5)OWNER(cid:6) features in the semantics. As
discussed previously, in derivations the values of the info and owner features are
propagated throughout an intonational phrase, which has the effect of propagating the
values of the (cid:5)INFO(cid:6) and (cid:5)OWNER(cid:6) semantic features to every node in the dependency
graph corresponding to the phrase. In this way, a distributed representation of the
theme/rheme partition is encoded, in a fashion reminiscent of Kruijff’s (2003) approach
to representing topic–focus articulation using hybrid logic. By contrast, the (cid:5)KON(cid:6) fea-
ture (cf. Section 2.5) is a purely local one, and thus appears only in the semantics. Note
that because the edge tones do not add any elementary predications, one or more edge
tones—and thus one or more intonational phrases—may be used to derive the same
theme or rheme in the logical form.
To simplify the semantic representations the sentence planner must produce, Open-
CCG includes default rules that (where applicable) propagate the value of the (cid:5)INFO(cid:6)
feature to subtrees in the logical form, set the (cid:5)OWNER(cid:6) feature to its prototypical value,
and set the value of the (cid:5)KON(cid:6) feature to false. When applied to the logical form for
the semantic dependency graph in Figure 8, the rules yield the HLDS term in Exam-
ple (2). After this logical form is flattened to a conjunction of EPs, lexical instantiation
looks up relevant lexical entries, such as those in Example (1), and instantiates the
variables to match those in the EPs. The chart-based search for complete realizations
proceeds from these instantiated lexical entries.
(2) @e(be ∧ (cid:5)TENSE(cid:6)pres ∧ (cid:5)INFO(cid:6)rh ∧ (cid:5)OWNER(cid:6)s ∧ (cid:5)KON(cid:6)– ∧
(cid:5)ARG(cid:6)( f ∧ flight) ∧ (cid:5)DET(cid:6)the ∧ (cid:5)NUM(cid:6)sg ∧ (cid:5)INFO(cid:6)th ∧ (cid:5)OWNER(cid:6)h ∧ (cid:5)KON(cid:6)– ∧
(cid:5)HASPROP(cid:6)(c ∧ cheapest ∧ (cid:5)INFO(cid:6)th ∧ (cid:5)OWNER(cid:6)h ∧ (cid:5)KON(cid:6)+)) ∧
(cid:5)PROP(cid:6)(p ∧ has-rel ∧ (cid:5)INFO(cid:6)rh ∧ (cid:5)OWNER(cid:6)s ∧ (cid:5)KON(cid:6)– ∧
(cid:5)OF(cid:6)f ∧
(cid:5)AIRLINE(cid:6)(r ∧ Ryanair ∧ (cid:5)INFO(cid:6)rh ∧ (cid:5)OWNER(cid:6)s ∧ (cid:5)KON(cid:6)+)))
Given our focus in FLIGHTS on using intonation to express contrasts intelligibly,
we have chosen to employ hard constraints in the OpenCCG grammar on the choice
of pitch accents and on the use of edge tones to separate theme and rheme phrases.
However, the grammar only partially constrains the type of edge tone (as explained
previously), and allows the theme and rheme to be expressed by one or more intona-
tional phrases each; consequently, the final choice of the type and placement of edge
tones is determined by the n-gram model. To illustrate, consider Example (3), which
shows how the frequent flyer sentence seen in Figure 2 is divided into four intonational
phrases. Other possibilities (among many) allowed by the grammar include leaving out
the L-L% boundary between flight and arriving, which would yield a phrase that’s likely
to be perceived as too long, or adding a L- or L-L% boundary between there’s and a,
which would yield two unnecessarily short phrases.
(3) There’s a KLMH∗ flight L-L% arriving BrusselsH∗ at fourH∗ fiftyH∗ p.m.H∗ L-L%,
but businessH∗ classH∗ is notH∗ availableH∗ L-H% and you’d need to connectH∗ in
AmsterdamH∗ L-L%.
177
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
To allow for this flexible mapping between themes and rhemes and one or more into-
national phrases, we take advantage of our distributed approach to representing the
theme/rheme partition, where edge tones mark the ends of intonational phrases with-
out introducing their own elementary predications. As an alternative, we could have
associated a theme or rheme predication with the edge tones, which would be more
in line with Steedman’s (2000a) approach. However, doing so would make it necessary
to include one such predication per phrase in the logical forms, thereby anticipating
the desired number of output theme and rheme phrases in the realizer’s input. Given
that the naturalness of intonational phrasing decisions can depend on surface features
like phrase length, we consider the distributed approach to representing theme/rheme
status to be better suited to the needs of generation.
2.7 Comparison to Baseline Prosody Prediction Models
As noted in Section 2.2, spoken language dialogue systems often include a separate
prosody prediction component (Pan, McKeown, and Hirschberg 2002; Walker, Rambow,
and Rogati 2002), rather than determining prosodic structure as an integral part of
surface realization, as we do here. Although it is beyond the scope of this article to
compare our approach to a full-blown, machine learning–based prosody prediction
model, we do present in this section an expert evaluation that shows that our approach
outperforms strong baseline n-gram models. In particular, we show that the informa-
tion structural constraints in the grammar play an important role in producing target
prosodic boundaries, and that these boundary choices are preferred to those made by
an n-gram model in isolation.
According to Pan, McKeown, and Hirschberg (2002, page 472), word-based n-gram
models can be surpringly good: “The word itself also proves to be a good predictor for
both accent and break index prediction. . . . Since the performance of this [word] model
is the best among all the [single-feature] accent prediction models investigated, it seems
to suggest that for a CTS [Concept-to-Speech] application created for a specific domain,
features like word can be quite effective in prosody prediction.” Indeed, although their
best accent prediction model exceeded the word-based one, the difference did not reach
statistical significance (page 485). Additionally, word predictability, measured by the
log probability of bigrams and trigrams, was found to significantly correlate with pitch
accent decisions (pages 482–483), and contributed to their best machine-learned models
for accent presence and boundary presence.18
As baseline n-gram models, we trained 1- to 4-gram models for predicting accents
and boundaries using the FLIGHTS speech corpus (Section 3.1), the same corpus used to
train the realizer’s language model. The baseline n-gram models are factored language
models (Bilmes and Kirchhoff 2003), with words, accents, and boundaries as factors.
The accent models have accent as the child variable and 1–4 words as parent variables,
starting with the current word, and including up to three previous words. The boundary
models are analogous, with boundary as the child variable. With these models, each
maximum likelihood prediction of pitch accent or edge tone is independent of all other
choices, so there is no need to perform a best-path search. The majority baseline predicts
no accent and no edge tone for each word.
18 Note that in our setting, it would be impossible to exactly replicate Pan et al.’s models, in which syntactic
boundaries play an important role, as CCG does not have a rigid notion of syntactic boundary.
178
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Table 1
Baseline pitch accent and boundary prediction accuracy against target tunes.
Accuracy
Majority
1-gram 2-gram 3-gram 4-gram N
Accent Presence
Accent Type
Boundary Presence
Boundary Type
73.3%
73.3%
81.1%
81.1%
98.0%
96.5%
91.9%
91.0%
98.6%
97.4%
94.5%
92.7%
97.4%
96.2%
95.4%
93.0%
97.4% 344
96.8% 344
95.1% 344
93.9% 344
We tested the realizer and the baseline n-gram models on the 31 sentences used to
synthesize the stimuli in the perception experiment described in Section 4 (see Figure 13
for an example mini-dialogue). None of the test sentences appear verbatim in the
FLIGHTS speech corpus. The test sentences contain target prosodic structures intended
to be appropriate for the discourse context. Given these structures, we can quantify
the accuracy with which the realizer is able to reproduce the pitch accent and edge
tone choices in the target sentences, and compare it to the accuracy with which n-gram
models predict these choices using maximum likelihood. Note that the target prosodic
structures may not represent the only natural choices, motivating the need for the expert
evaluation described further subsequently.
Although the realizer is capable of generating the target prosodic structure of each
test sentence exactly, the test sentence (with its target prosody) is not always the top-
ranked realization of the corresponding logical form, which may differ in word order
or choice of function words. Thus, to compare the realizer’s choices against the target
accents and boundaries, we generated n-best lists of realizations that included the target
realization, and compared this realization to others in the n-best list with the same
words in the same order (ignoring pitch accents and edge tones). In each case, the target
realization was ranked higher than all other realizations with the same word sequence,
and so we may conclude that the realizer reproduces the target accent and boundary
choices in the test sentences with 100% accuracy.
The accuracy with which the baseline n-gram models reproduce the target tunes
is shown in Table 1. As the test sentences are very similar to those in the FLIGHTS
speech corpus, the accent model performs remarkably well, with the bigram model
reproducing the exact accent type (including no accent) in 97.4% of the cases, and
agreeing on the choice of whether to accent the word at all in 98.6% of the cases. The
boundary model also performs well, though substantially worse than the realizer, with
the 4-gram model reproducing the boundary type (including no boundary) in 93.9% of
the cases, and agreeing on boundary presence in 95.1% of the cases.19
Inspired by Marsi’s (2004) work on evaluating optionality in prosody prediction, we
asked an expert ToBI annotator, who was unfamiliar with the experimental hypotheses
under investigation, to indicate for each test sentence the range of contextually appro-
priate tunes by providing all the pitch accents and edge tones that would be acceptable
19 Because the boundary models sometimes failed to include a boundary before a comma or full stop,
default L-L% boundaries were added in these cases.
179
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
Table 2
Examples comparing target tunes to baseline prosody prediction models, with expert corrections
(Items 07-2 and 06-1).
(a)
(b)
target
edits
n-grams
edits
target
edits
n-grams
the only directL+H∗ flight L-H% leaves at 5:10H∗ L-L% .
none
the only directH∗ flight leaves at 5:10H∗ L-L% .
the only directL+H∗ flight L- leaves at 5:10H∗ L-L% .
there ’s a directH∗ flight on British AirwaysH∗ with a goodH∗ price L-L% .
there ’s a directH∗ flight on British AirwaysH∗ L- with a goodH∗ price L-L% .
there ’s a directH∗ flight on British AirwaysH∗ L-L% with a goodL+H∗
price L-L% .
edits
none
for each word.20 However, our annotator found this task to be too difficult, in part
because of the difficulty of coming up with all possible acceptable tunes, and in part
because of dependencies between the choices made for each word. For this reason,
we instead chose to follow the approach taken with the Human Translation Error
Rate (HTER) post-edit metric in MT (Snover et al. 2006), and asked our annotator to
indicate, for the target tune and the n-gram baseline tune, which accents and boundaries
would need to change in order to yield a tune appropriate for the context. For the
n-gram baseline tune, we used the choices of the bigram accent model and the 4-gram
boundary model.
Examples of how the target tunes (and realizer choices) compare to those of the
baseline accent and boundary prediction models appear in Table 2, along with the
corrections provided by our expert ToBI annotator.21 In Table 2(a), the target tune,
which contains the theme phrase the only directL+H∗ flight L-H%, was considered fully
acceptable. By contrast, with the tune of the n-gram models, the H* accent on direct was
not considered acceptable for the context, and at least a minor phrase boundary was
considered necessary to delimit the theme phrase. In Table 2(b), we have an example
where the target tune was not considered fully acceptable: Although the target consisted
of a single, all-rheme intonational phrase, with no intermediate phrases, our annotator
indicated that at least a minor phrase break was necessary after British Airways. The
n-gram models assigned a major phrase break at this point, which was also considered
acceptable. Note also that the n-gram models had a L+H* accent on good, in contrast to
the target tune’s H*, but both choices were considered acceptable.
Table 3 shows the number of accent and boundary corrections for all 31 test sen-
tences at different levels of specificity. Overall, there were just 10 accent or boundary
corrections for the target tunes, versus 24 for those of the baseline models, out of 688
total accent and boundary choices, a significant difference (p = 0.01, Fisher’s Exact Test
[FET], 1-tailed). With the accents, there were fewer corrections for the target tunes,
but not many in either case, and the difference was not significant. Of course, with a
20 We thank one of the anonymous reviewers for this suggestion. Note that in Marsi’s study, having one
annotator indicate optionality was found to be a reasonable approximation of deriving optionality from
multiple annotations.
21 During realization, multiwords are treated as single words, such as British Airways and 5:10 (five ten a.m.).
Accents on multiwords are distributed to the individual words before the output is sent to the speech
synthesizer.
180
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Table 3
Number of prosodic corrections of different types in all utterances and theme utterances for the
target tunes and the ones selected by n-gram models. Items in bold are significantly different at
p = 0.05 or less by a one-tailed Fisher’s Exact Test.
All Utts
Theme Utts
Target
n-grams
Target
n-grams
Total corrections
Accents
Presence
Boundaries
Presence
Major
10
4
3
6
2
0
24
7
4
17
13
9
3
2
2
1
0
0
11
5
3
6
4
3
larger sample size, or with test sentences that are less similar to those in the corpus,
a significant difference could arise. With the boundaries, out of 344 choices, the target
tunes had six corrections, only two of which involved the presence of a boundary, and
none of which involved a missing major boundary; by contrast, the n-gram baseline
had 17, 13, and 9 such corrections, respectively, a significant difference in each case
(p = 0.02, p = 0.003, p = 0.002, respectively, FET). In the subset of 12 sentences involving
theme phrases, where the intonation is more marked than in the all-rheme sentences,
the target tunes again had significantly fewer corrections overall (3 vs. 11 corrections
out of 220 total choices; p = 0.03, FET), and the difference in boundary and boundary
presence corrections approached significance (p = 0.06 in each case, FET).
Having shown that the information structural constraints in the grammar play an
important role in producing target realizations with contextually appropriate prosodic
structures—in particular, in making acceptable boundary choices, where the choice is
not deterministically rule-governed—we now briefly demonstrate that the realizer’s
n-gram model (see Section 2.6.1) has an important role to play as well. Table 4 compares
the realizer’s output on the test sentences against its output with the language model
disabled, in which case an arbitrary choice is effectively made among those outputs
allowed by the grammar. Not surprisingly, given that the grammar has been allowed
to overgenerate, the realizer produces far more exact matches and far higher BLEU
(Papineni et al. 2001) scores with its language model than without. Looking at the dif-
ferences between the realizer’s highest scoring outputs and the target realizations, the
differences largely appear to represent cases of acceptable variation. The most frequent
difference concerns whether an auxiliary or copular verb is contracted or not, where
either choice seems reasonable. Most of the other differences represent minor variations
in word order, such as directH∗ Air FranceH∗ flight vs. directH∗ flight on Air FranceH∗. By
Table 4
Impact of language model on realizer choice.
Exact Match
BLEU
Realizer
No LM
61.3%
0.0%
0.9505
0.6293
181
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
contrast, many of the outputs chosen with no language model scoring contain unde-
sirable variation in word order or phrasing; for example: Air FranceH∗ directH∗ flight
instead of directH∗ flight on Air FranceH∗, and you L- though would need L- to connectH∗ in
AmsterdamH∗ L- instead of you’d need to connectH∗ in AmsterdamH∗ though L-L%.
2.8 Interim Summary
In this section, we have introduced a presentation strategy for highlighting the most
compelling trade-offs for the user that straightforwardly lends itself to the determina-
tion of information structure, which then drives the choice of prosodic structure during
surface realization with CCG. We have also shown how phrase boundaries can be
determined in a flexible, data-driven fashion, while respecting the constraints imposed
by the grammar. An expert evaluation demonstrated that the approach yields prosodic
structures with significantly higher acceptability than strong n-gram baseline prosody
prediction models. In the next two sections, we show how the generator-driven prosody
can be used to produce perceptibly more natural synthetic speech.
3. Unit Selection Synthesis with Prosodic Markup
In this section we describe the unit selection voices that we employed in our perception
experiment (Section 4). Three voices in total were used in the evaluation: GEN, ALL,
and APML. Each was a state-of-the-art unit selection voice using the Festival Multisyn
speech synthesis engine (Clark, Richmond, and King 2007). The GEN voice, used as
a baseline, is a general-purpose unit selection voice. The ALL voice is a voice built
using the same data as the GEN voice but with the addition of the data from the
FLIGHTS speech corpus described in Section 3.1. The APML voice augments the in-
domain data in the ALL voice with prosodic markup, which is then used at run-time by
the synthesizer in conjunction with marked-up input to guide the selection of units.
To provide in-domain data for the ALL and APML voices, we needed to record
a suitable data set from the original speaker used for the GEN voice. We describe the
process of constructing this speech corpus for FLIGHTS next, in Section 3.1. Then, in
Section 3.2, we describe the unit selection voices in detail, along with how the in-domain
data was used in building them.
3.1 The FLIGHTS Speech Corpus
The FLIGHTS speech corpus was intended to have a version of each word that needs to
be spoken by the system, recorded in the context in which it will be spoken. This context
can be thought of as a three-word window centered around the given word, together
with the word’s target pitch accent and edge tone, if any. This would theoretically
provide more than sufficient speech data for a full limited domain voice, and a voice
using this data would be guaranteed to have a unit sequence for any sentence generated
by FLIGHTS where each individual unit is a tight context match for the target unit.
Because the system was still limited in its generation capabilities at the time of
recording the voice data for FLIGHTS, we developed a recording script by combining
the generated data that was available with additional utterances that we anticipated
the finished system would generate. To do so, we assembled a set of around 50 utter-
ance templates that describe flight availability—with slots to be filled by times, dates,
amounts, airlines, airports, cities, and flight details—to which we added approximately
two hundred individual or single-slot utterances, which account for the introductions,
182
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
questions, confirmations, and other responses that the system makes. We then made use
of an algorithm designed by Baker (2003) to iterate through the filler combinations for
each utterance template to provide a suitable recording script.22
To illustrate, Example (4) shows an example utterance template, along with two
utterances in the recording script created from this template. The example demonstrates
that having multiple slots in a template makes it possible to reduce the number of sen-
tences that need to be recorded to cover all possible combinations of fillers. With (b) and
(c) recorded as illustrated to fill the template in (a), we would then have the recorded
data to mix and match the slot fillers with two different values each to synthesize eight
different sentences. It is also often possible to use the fillers for a slot in one utterance
as the fillers for a slot in another utterance, if the phrase structure is sufficiently similar.
A more complex case is shown in Example (5), involving adjacent slots. In this case, it
is not sufficient to just record one example of each possible filler, as the context around
that filler is changed by the presence of other slots; instead, combinations of fillers must
be considered (Baker 2003).
(4) a. It arrives at (cid:5)TIME(cid:6)H∗ L-H% and costs just (cid:5)AMOUNT(cid:6)H∗ L-L%, but it requires
a connectionH∗ in (cid:5)CITY(cid:6)H∗ L-L%.
b. It arrives at sixH∗ p.m.H∗ L-H% and costs just fiftyH∗ poundsH∗ L-L%, but it
requires a connectionH∗ in ParisH∗ L-L%.
c. It arrives at nineH∗ a.m.H∗ L-H% and costs just eightyH∗ poundsH∗ L-L%, but it
requires a connectionH∗ in PisaH∗ L-L%.
(5) a. There are (cid:5)NUM(cid:6)H∗ (cid:5)MOD(cid:6)H∗ flights L-L% from (cid:5)CITY(cid:6)H∗ to (cid:5)CITY(cid:6)H∗
today L-L%.
b. There are twoH∗ earlierH∗ flights L-L% from BordeauxH∗ to AmsterdamH∗
today L-L%.
The recording script presented similar utterances in blocks. It made use of small
caps to indicate intended word-level emphasis and punctuation to partially indicate
desired phrasing, as the speaker was not familiar with ToBI annotation. The intended
dialogue context for the utterances was discussed with our speaker, but the recordings
did not consist of complete dialogues, as using complete dialogues would have been too
time consuming. The recording took place during two afternoon sessions and yielded
approximately two hours of recorded speech. The resulting speech database consisted
of 1,237 utterances, the bulk of which was derived from the utterances that describe
flight availability.
The recordings were automatically split into sentence-length utterances, each of
which had a generated APML file with the pitch accents and edge tones predicted for
the utterance. The prosodic structures were based on intuition, as there was no corpus
of human–human dialogues sufficiently similar to what the system produces that could
have been used to inform prosodic choice.
The recordings were then manually checked against the predicted APML, by listen-
ing to the audio and looking at the f0 contours to see if the pitch accents and edge
tones matched the predicted ones. In the interest of consistency, changes were only
22 The target prosody for each utterance template was based on the developers’ intuitions, with input from
Mark Steedman. As one of the anonymous reviewers observed, an alternative approach might have been
to conduct an elicitation study in a Wizard-of-Oz setup to determine target tunes. This strategy may have
yielded more natural prosodic variation, but perhaps at the expense of employing less distinctive tunes.
For this project, it was not practical to take the extra time required to conduct such an elicitation study.
183
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
made to the APML files when there were clear cases of the speaker diverging from the
predicted phrasing or emphasis; nevertheless, the changes did involve mismatches in
both the presence and the type of the accents and boundaries. As an example of the kind
of changes that were made, in Example (4), we had predicted that the speaker would
use a L-H% boundary (a continuation rise) prior to the word but, however she instead
consistently used a low boundary tone in this location. Consequently, we changed all
the APML files for sentences with this pattern to include a L-L% compound edge tone
at that point. Further details of this data cleanup process are given in Rocha (2004).
3.2 The Synthetic Voices
Ideally, we would like to build a general purpose unit selection voice where we can
fully specify the intonation in terms of ToBI accents and boundaries and then have the
synthesizer generate this for us. This, however, is currently not a fully viable option for
the reasons discussed subsequently, so instead we built a number of different voices
using the unit selection technique to investigate how working towards a voice with
full intonation control would function. There are a number of ways in which prosodic
generation for unit selection can be approached (Aylett 2005; van Santen et al. 2005;
Clark and King 2006), with most systems designed to produce best general prosody.
This contrasts with the framework chosen here, which is designed to make maximum
use of in-domain data, in an attempt to produce a system that realizes prosody as well
as is possible in that domain. This can be considered an upper bound for what more
general systems could achieve under optimal conditions.
The ideal way to use intonation within the unit selection framework requires three
components: the database of speech, which must be fully annotated for intonation;
the predicted intonation for a target being synthesized; and a target cost that ensures
suitable candidates are found to match the target. Each of these requirements prove to
be difficult to achieve successfully.
Manually labeling a large speech database accurately and consistently with intona-
tion labels is both difficult and expensive, whereas automatic labeling techniques tend
to produce mediocre results at best. Producing accent labeling for the flight information
script is somewhat easier because of the limited number of templates that the script is
based upon, and once the templates are labeled, intonation for individual sentences can
be derived. To build a general purpose unit selection voice for the FLIGHTS domain,
we would want to combine the FLIGHTS data with speech data designed to provide
more general coverage. Providing accent labeling for the extra, non-FLIGHTS, data is
the main problem here.
Predicting intonation for a target utterance is generally a difficult task, and can
only really be done well when the domain of the speech synthesis is constrained. For
example, a number of statistical modeling techniques may be able to provide adequate
accent prediction for a system to read the news or forecast the weather, because the sen-
tence structure is usually limited to a sequence of simple statements. The task becomes
difficult when the sentence structure is more variable, for example in dialogue where a
number of different question forms may exist along with contrastive statements, and so
forth. In the FLIGHTS domain we can side-step the prediction problem by providing a
specification for the intonation of a sentence as part of language generation.
In one sense, defining a target-cost component to direct the search towards finding
suitable intonation is not difficult, as a simple penalty for not matching the target
intonation suffices. However, standard unit selection techniques only ever take into
184
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
account local effects, and no provision is made to ensure a suitable global intonation
contour.
These problems, and a number of technical issues relating to the use of markup,
automatic segment alignment, and the voice building process, prohibit us from building
a general purpose unit selection voice where we can specify the required intonation.
One of the questions that this study is attempting to answer is whether it is worth trying
to resolve these issues to build such a system in the future. To address this question
we present a number of voices designed to investigate the issues of producing natural
speech synthesis in the FLIGHTS domain.
3.2.1 The GEN Voice. The GEN voice uses a database of approximately 2,000 sentences
of read newspaper speech. The Festival Multisyn unit selection engine selects diphone
units from the database by minimizing a combination of a target cost and a join cost.
The target cost scores the linguistic context of the diphone in terms of stress, position
of the diphone in the current word, syllable and phrase, phonetic context, and part of
speech. The join cost scores the continuity between selected units in terms of spectrum,
energy, and f0. There is no explicit modeling of prosody in this system.
The GEN voice was used as a baseline. There is nothing specific to the FLIGHTS
domain associated with any part of this voice, and the quality of the resulting synthesis
is comparable with a typical unit selection speech synthesiser.
3.2.2 The ALL voice. The ALL voice was created by augmenting the GEN voice with
the speech data recorded as part of the FLIGHTS speech corpus described herein. The
motivation here is to attempt to provide a system which would have the best possible
quality that a general purpose unit selection synthesiser could have when working in
this domain. The additional data increases the availability of units in the exact contexts
that would be required by the FLIGHTS system. Having examples of airport and airline
names in the database, for example, increases the likelihood that when these words
are synthesized there are appropriate, often consecutive, units available to synthesize
them. Naturalness is improved both by the better context match and by the need for
fewer joins in constructing these words and the utterances they are used in.
3.2.3 The APML voice. The APML voice is different from the ALL voice in that it is
designed to take APML input from the FLIGHTS system rather than text input. The
APML voice comprises the same speech data as the ALL voice, but also includes
the APML annotation for the FLIGHTS part of the corpora. The target cost for the
synthesizer is augmented with a prosodic component which penalizes any mismatch
between supplied APML input and the APML specification associated with a particular
unit. As the read newspaper component of this voice does not have accompanying
APML markup, and because the voice is required to work for text input (although this
capability is not used here), the target cost penalizes (1) the synthesis of an APML-
specified target with a unit that does not have accompanying APML markup; (2) the
synthesis of a target without APML specification with APML-specified units, and (3)
synthesis where there is an APML mismatch between the target and candidate. It is
important that these prosodic mismatches are only discouraged, rather than forbidden,
to allow the system to synthesize out of the original domain if required.
As work on the FLIGHTS system progressed, we found numerous cases where the
speech corpus failed to anticipate all the possible outputs of the system. For example,
although we included an utterance template for conveying price in a conjoined verb
phrase, as in Example (4), we did not include a template for conveying price in a
185
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
single independent clause, as in It costs just fifty pounds. Had the system been further
along in its development when we recorded the speech data, we could have pursued a
strategy of selecting utterances to record from generated outputs, in order to maximize
some coverage criterion.23 In either case, though, we consider it difficult to develop
a recording script that will completely cover all three-word sequences that a system
will ever generate, especially if further development is considered. For this reason, we
believe it to be essential to have a strategy to handle the cases where generation needs
go beyond the original plan. The use of an augmented general purpose unit selection
system—rather than, for example, a strictly limited domain system—means that there
is no difficulty in synthesizing extra material as needed, although there is the risk that it
will not sound quite as good as that which is closer in context to the original in-domain
recordings. For an APML voice where the input is “new” APML, if there are no suitable
units within the APML-annotated section of the corpus, units will be chosen from the
main portion of the corpus. There will be a penalty in terms of target cost for doing so,
but the best sequence will still be found.
4. Synthesis Evaluation
In this section, we describe a perceptual experiment which was carried out to deter-
mine whether the prosodic structures generated by the FLlGHTS system actually result
in improved naturalness in speech synthesis. We also describe an evaluation of the
prosody in the synthesized speech that makes use of an expert annotator’s assessments
of the acceptability of the perceived tunes for the given context, and an evaluation that
examines objective differences in the f0 contours of the theme phrases in the synthesized
speech. The three types of evaluation (subject ratings, expert annotation, and objective
measures) all show the APML voice outperforming the ALL voice, both on the complete
set of test utterances as well as on the subset containing theme phrases. Additionally, the
three types of evaluation support each other in that differences in ratings of particular
items correspond to differences in acceptability annotations and to differences in the
objective measures, as will be explained in the following.
4.1 Perception Experiment
4.1.1 Methodology. Our experimental hypothesis was that listeners would prefer the
APML voice, used with contextually appropriate intonation markup, over the ALL and
GEN control voices. We further hypothesized that the preference would be larger for
the utterances containing theme phrases, where the intonation is more marked than it
is in the all-rheme utterances.
Subjects were presented with mini-dialogues consisting of a summary of the user’s
request in text and three versions of the system’s response, one for each of the three
voices, as shown in Figure 12. System utterances were presented side-by-side, with
each system turn comprising two to four utterances, and each voice labeled as A, B,
or C. The label assignments were balanced across the mini-dialogues so that each voice
appeared an equal number of times labeled as A, B, or C, and the mini-dialogues were
presented in an individually randomized order. Subjects were asked to assign ratings
to each version of each utterance on a 1–7 scale, with 7 corresponding to “completely
natural” and 1 corresponding to “completely unnatural.” Ratings were gathered on-line
23 For example, in developing a custom voice for the COMIC system, we selected from generated utterances
in order to maximize bigram coverage with high priority named entities.
186
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Figure 12
Screenshot of webexp2 interface for gathering listener ratings.
using webexp2.24 Subjects were allowed to play the sound files any number of times in
any order, but were required to assign ratings to all the utterances before proceeding to
the next screen. In assigning their ratings, subjects were instructed to pay attention to
the context given by the summary of the user’s request, keeping the following questions
in mind:
(cid:1)
(cid:1)
(cid:1)
(cid:1)
Does the utterance make it clear how well the flight (or flights) in question
meet the user’s needs?
Are words emphasised in a way that highlights the trade-offs among the
different options?
For the second and subsequent utterances, is emphasis used in a way that
makes sense given the previous system utterances?
Is the utterance clear and easy to understand, or garbled and difficult to
understand?
Twelve mini-dialogues were used as stimuli, comprising 31 utterances in total. The
dialogues were constructed so as to contain a representative range of theme phrases,
with each mini-dialogue containing one utterance with a theme phrase. An example
dialogue, with target tunes for the APML voice, appears in Figure 13; the second system
utterance contains a theme phrase. The complete set of stimuli, including sound files, is
available on-line25 from the first author’s web page.
Fourteen native English speaking subjects participated in the study. The subjects
were all from the UK or USA and had no known hearing deficits. For participating in
the study, subjects were entered into a prize drawing.
24 http://www.webexp.info/.
25 http://www.ling.ohio-state.edu/∼mwhite/flights-stimuli/.
187
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
Figure 13
Example mini-dialogue (Dialogue 03).
4.1.2 Results. As shown in Figure 14, the APML voice received higher ratings than the
ALL voice, and both the APML and ALL voices scored much higher than the GEN
voice. Overall, the APML voice’s average rating surpassed that of the ALL voice by
0.77; its score of 5.83 was close to 6 on the rating scale, corresponding to “mostly
natural,” while the ALL voice’s score was 5.06, just above 5 on the scale, corresponding
to “somewhat natural.” The difference between the two voices was highly significant
(paired t-test, t433 = 10.20, p < 0.001). On the theme utterances, the difference between
the APML and ALL voices was even larger, at 0.91. With the APML voice, there was
no significant difference between the average ratings of the theme utterances and those
without theme phrases. In contrast, the ALL voice showed a trend towards the theme
utterances scoring worse than the remaining ones (t-test, 1-tailed, t432 = −1.39, p =
0.08), with the average of the theme utterances 0.19 lower than that of the all-rheme
utterances. The GEN voice did considerably worse (0.48) on the theme utterances
(t-test, 1-tailed, t432 = −3.06, p = 0.001).
4.2 Expert Prosody Evaluation
4.2.1 Methodology. To more directly examine whether the APML voice yielded more
contextually appropriate prosody than the ALL voice, we asked an expert ToBI anno-
tator, who was unfamiliar with the experimental hypotheses under investigation, to
annotate the perceived tunes in the synthesized stimuli for these two voices. In cases
of uncertainty, multiple annotations were given. The stimuli from the ALL voice were
always presented first, so that listening to the tune from the APML item would not bias
our annotator against the corresponding ALL item. Because there is not necessarily a
unique tune that is acceptable for the context, we also asked our annotator to indicate
the closest acceptable tune by indicating which accents and boundaries would need to
Figure 14
Mean ratings for each voice. Theme utterances are the subset containing a theme phrase. (Error
bars show standard errors.)
188
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 15
Example annotation and f0 showing tune corrections (Item 05-2).
change in order to yield a tune appropriate for the context, in much the same fashion
as in our prosody prediction evaluation (Section 2.7).26 We then counted the number of
accent or boundary corrections at different levels of specificity.
An example annotation with tune corrections appears in Figure 15. This utterance
is the second one in Dialogue 05, where the user prefers to fly BMI; accordingly, the
utterance contains the theme phrase (with target tune) the only BMIL+H∗ flight L-H%.
With the APML version of the utterance, the annotator perceived a L+H* accent on BMI
and a L-H% boundary on de-accented flight, as desired. Additionally, the annotator
perceived a H* accent on only, which was not part of the target tune. The tune was
nevertheless considered completely acceptable, as the Edits tier is identical to the Tones
tier. By contrast, with the ALL version of the utterance, BMI was less prominent than
only and flight, and accordingly it received no pitch accent, whereas only and flight
received L+H* and H* accents, respectively; in addition, a H- boundary was annotated
on only, and a boundary that was uncertain between L- and L-L% was annotated on
26 The target tunes were not presented together with the synthesized stimuli, to avoid influencing the tunes
perceived by our expert.
189
Computational Linguistics
Volume 36, Number 2
Table 5
Number of prosodic corrections of different types in all utterances and theme utterances for the
two voices. Items in bold are significantly different at p = 0.05 or less by a one-tailed Fisher’s
Exact Test.
All Utts
Theme Utts
APML ALL APML ALL
Total corrections
Accents
Presence
L+H*
Boundaries
Presence
22
10
6
4
12
7
49
33
20
13
16
6
11
4
2
2
7
3
26
16
8
8
10
4
flight. This tune for the theme phrase was not considered acceptable for the context: As
the Edits tier shows, the lack of an accent on BMI, and the presence of an accent on
flight, was corrected by our annotator, as was the H- minor phrase boundary on only.
This choice makes sense for the context, as BMI is what distinguishes this option from
the one suggested in the first utterance, while flight is given information at this point in
the dialogue. Indeed, it is difficult to come up with an interpretation of onlyL+H∗ BMI,
with no accent on BMI, though this tune might make sense if the question of whether
the flight was code-shared with another airline was a salient one in the context.
4.2.2 Results. The results of the expert prosody evaluation appear in Table 5. Across all 31
utterances, there were 49 accent or boundary corrections for the ALL voice, versus just
22 for the APML voice, out of 688 total accent and boundary choices, a highly significant
difference (p < 0.001, Fisher’s Exact Test, 1-tailed). With the 12 theme utterances, there
were 26 corrections for the ALL voice, versus 11 for the APML voice, out of 220 total
choices (p = 0.008, FET). Looking at the pitch accents, the difference in the number
of accent corrections was significant in each case (p < 0.001, FET, and p = 0.004, FET,
respectively), as was the number of corrections involving the presence or absence of a
pitch accent (p = 0.004, FET, and p = 0.05, FET, respectively) and the number involving
L+H* accents (p = 0.02, FET, and p = 0.05, FET, respectively).27 Looking at the bound-
aries, we may observe that with the ALL voice, the majority of corrections were with
accents, whereas with the APML voice, the majority were with boundaries. Although
the APML voice had fewer boundary corrections, the difference was not significant.
As noted in the discussion of Figure 15, the perceived tunes with the APML voice
did not always exactly match the target tunes, sometimes in ways that our expert
annotator considered acceptable. More commonly, however, mismatches between the
perceived and target tunes were not judged to be acceptable. In fact, of the 22 correc-
tions for the APML voice, 19 would have been acceptable had the target tune been
successfully achieved; that is, 19 of the 22 corrections changed a perceived accent or
boundary to the one in the target tune. In 12 of these 19 cases, our annotator perceived
an accent or boundary where the target tune had none. There were also five other cases
27 A correction was counted as involving the presence or absence of a pitch accent if the word was
perceived as having an accent when it was deemed that it should have none, or was perceived as having
no accent when it was deemed that it should have one; L+H* accent and boundary presence corrections
were counted in the same fashion.
190
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Figure 16
Mean ratings of items with 0–4 prosodic corrections for the APML and ALL voices combined.
(Error bars show standard errors.)
where the target was a L-L% boundary, but a rising boundary was perceived instead.
The remaining cases involved accent type mismatches.
To examine the relationship between the number of corrections indicated by our
expert ToBI annotator and the ratings in the perception experiment, we grouped the
APML and ALL utterances by the number of corrections and calculated the mean
ratings for each group. The results appear in Figure 16, which shows the expected
relationship between mean ratings and the number of corrections, with ratings go-
ing down as the number of corrections go up. The pattern is less clear with two to
four corrections, where there are fewer tokens. One-tailed t-tests show that utterances
with zero corrections were rated significantly higher than utterances with one to four
corrections (t39 = 2.96, t35 = 5.14, t30 = 3.03, t28 = 5.35, respectively; p < 0.01 in each
case); utterances with one correction were also significantly higher than those with four
corrections (t17 = 2.30, p = 0.02).
We also examined the relationship between the listener ratings and the number
of errors noted by our expert annotator through a multiple regression analysis. The
regressors were the number of accent or boundary presence errors and the number of
remaining accent or boundary errors, involving only a difference in type. The regression
equation was Rating = 5.85 − 0.41 × PresenceErrors − 0.28 × TypeOnlyErrors, which ac-
counted for 32% of the variance and was highly significant (F2,59 = 15.3, p < 0.001). As
expected, ratings were negatively related to accent and boundary errors, with presence
errors showing a greater impact than type-only errors. Both the effect of presence errors
(t59 = −4.34, p < 0.001) and the effect of type-only errors (t59 = −2.65, p = 0.01) were
significant. Because both kinds of errors had a significant impact on ratings, the results
suggest that it is worthwhile to make use of fine-grained ToBI categories where feasible,
as we discuss further in Section 6.
4.3 Objective Measures
4.3.1 Methodology. In addition to the expert prosody evaluation, we also measured the
f0 values and durations of the adjectives and nouns in the theme phrases, to see whether
they differed significantly between the APML and ALL voices. We also measured the
drop in f0 between the adjective, which should receive a contrastive pitch accent, and
the noun, which should be deaccented. Note that the f0 drop is a potentially more fine-
grained measure than pitch accent corrections, because the accents could be considered
acceptable even though the tune is less distinct in some cases.
An example in which this occurs appears in Figure 17, which shows the second
utterance of Dialogue 03, seen earlier in Figure 13, for the APML and ALL voices. Here,
191
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
Figure 17
Example annotation and f0 showing difference in f0 drop in the theme phrases (Item 03-2).
the APML version was annotated with the target accents, namely L+H* for Lufthansa
and none for flight, whereas the ALL version was annotated with a L+H* !H* pattern,
which was also considered acceptable. However, with the APML version the pitch drops
from an f0 value of 293 Hz to 177 Hz, whereas with the ALL version the pitch only drops
from 260 Hz to 209 Hz. As a result, the theme tune is much less distinctive in the ALL
version, in a way that could impact the ease with which the theme phrase’s discourse
function is identified.
4.3.2 Results. Figure 18 shows the mean f0 values across all 12 theme phrases for the
adjective, which should receive contrastive emphasis, and the head noun, which should
be reduced. As the figure shows, the pattern seen in Figure 17 was borne out on the
whole in the stimuli with theme phrases. In particular, the theme phrases for the APML
voice had an f0 drop of 90 Hz on average from the adjective to the noun, whereas the
ALL voice only had an f0 drop of 50 Hz, a significant difference (t-test, 1-tailed, t22 =
2.60, p = 0.008). The durations of the adjectives and nouns, however, did not show a
significant difference. In addition to the f0 drop, a significant difference was also found
with the f0 max on the adjective and the f0 min on the noun (t-tests, 1-tailed, t28 = 1.70,
p = 0.05 and t22 = −1.86, p = 0.04, respectively). Finally, a relatively high correlation (r =
0.78) was also found between the difference in f0 drop and the difference in the ratings
of the theme utterances for the two voices (t10 = 3.94, p = 0.001), suggesting that the less
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
192
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
distinctive theme tunes produced by the ALL voice were perceived as less natural than
the ones produced by the APML voice.
4.4 Discussion
The perception experiment confirmed our hypothesis that listeners would prefer the
APML voice, used with contextually appropriate intonation markup, over the ALL
and GEN control voices. Both on the complete set of utterances, as well as the subset
containing theme phrases, the APML voice was rated substantially higher on average
than the ALL voice, and much higher than the GEN voice. Note that the ALL voice was
also rated much higher on average than the GEN voice—which is not surprising, given
that it has access to the same limited domain data as the APML voice—showing that
the ALL voice serves as a tough baseline to beat. The expert prosody evaluation and the
objective measures also confirmed the superiority of the APML voice, in particular on
the theme utterances.
We also found evidence in favor of our hypothesis that the preference for the APML
voice would be larger for the utterances containing theme phrases—where the intona-
tion is more marked than it is in the all-rheme utterances—as the difference between
the mean ratings of the APML and ALL voices was larger on the theme utterances than
on the remaining ones. Additionally, with the APML voice, there was no significant
difference between the mean ratings of the theme utterances and those without theme
phrases, whereas the ALL voice showed a trend towards the theme utterances scoring
worse than the all-rheme ones, and the GEN voice clearly did considerably worse on
the theme utterances. One small surprise was that with the ALL voice, the difference
between the mean ratings of the theme and all-rheme utterances did not reach the
standard level of statistical significance. Of course, it could be that with a larger sample
size, a more significant difference would be found. However, it is undoubtedly the case
that the ALL voice dropped off less on the theme utterances than did the GEN voice,
and the reason is almost certainly that the limited domain data has good coverage of
the theme phrases, and thus the ALL voice often does reasonably well on the theme
utterances even without explicit prosodic control. What is perhaps more remarkable
is that the ALL voice did not do better on the all-rheme utterances, as can be seen in
the number of expert corrections listed in Table 5 for all the utterances, which go well
beyond those in the theme utterance subset. That the ALL voice had more than double
Figure 18
Mean f0 values for emphasized and reduced words in theme phrases for the APML and ALL
voices. (Error bars show standard errors.)
193
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
the number of corrections as the APML voice on both the complete set of utterances
and the subset containing theme phrases shows that the prosodic specifications were
important throughout.
Finally, we may observe that although the APML voice had fewer boundary correc-
tions than did the ALL voice, the difference did not reach significance, suggesting that
there is room for improvement in how boundaries are handled in the APML voice. In
particular, this result suggests that edge tones, and intermediate phrase boundaries in
particular, should affect the selection of units non-locally, as theoretically their effect on
the pitch contour spreads back to the last pitch accent. In fact, it may well be that because
the speech synthesis system only models prosodic effects locally, essentially at the
syllable level, and does not take utterance-level structures such as tune into account, a
ceiling has been reached for both accent and phrasing performance. Representing global
prosodic structures to ensure prosodic coherence will be one of the major challenges for
future generations of speech synthesis systems.
5. Related Work
The FLIGHTS system combines and extends earlier approaches to user-tailored genera-
tion in spoken dialogue. A distinguishing feature of FLIGHTS is that it adapts its output
according to user preferences at all levels of the generation process, from the selection
of content to linguistic realization and the prosody targeted in speech synthesis.
The most similar system to ours is MATCH (Walker et al. 2004), which employs sim-
pler content planning strategies and does not explicitly point out the trade-offs among
options. MATCH also uses simple templates for realization, and does not attempt to
control intonation. Carenini and Moore’s (2000, 2006) system is also closely related, but
it does not make comparisons, and generates text rather than speech. Carberry et al.’s
(1999) system likewise employs additive decision models in recommending courses,
though their focus is on dynamically acquiring a model of the student’s preferences, and
the system is limited to recommending a single option considered better than the user’s
current one. In addition, the system only addresses the problem of selecting positive
attributes to justify the recommendation, and does not plan and prosodically realize the
positive and negative attributes of multiple suggested options.
These systems all employ a user model to select a small set of good options, and to
identify the attributes that justify their desirability, in order to present a summary, com-
parison, or recommendation to the user. Evaluation showed that tailoring recommen-
dations and comparisons to the user increased argument effectiveness and improved
user satisfaction (Walker et al. 2004). Thus, the user-model (UM-) based approach is an
appropriate strategy for spoken dialogue systems when there are a small number of
options to present, either because the number of options is limited or because users can
supply sufficient constraints to winnow down a large set before querying the database
of options.
Other researchers have argued that it is important to allow users to browse the data
for a number of reasons: (1) if there are many options that share attribute values, they
will be very close in score when ranked using the UM-based approach; (2) users may
not be able to provide constraints until they hear more information about the space of
options; and (3) the UM-based approach does not give users an overview of the option
space, and this may reduce their confidence that they have been told about the best
option(s) (Demberg and Moore 2006).
Polifroni, Chung, and Seneff (2003) proposed a “summarize and refine” (SR) ap-
proach, in which the system structures a large number of options into a small number
194
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
of clusters that share attributes. The system then summarizes the clusters based on their
attributes, implicitly prompting the user to provide additional constraints. The system
produces summaries such as
I have found 983 restaurants. Most of them are located in Boston and Cambridge. There are 32
choices for cuisine. I also have information about price range.
which help the user get an overview of the option space. Chung (2004) extended this
approach by proposing a constraint relaxation strategy for coping with queries that
are too restrictive to be satisfied by any option. Pon-Barry, Weng, and Varges (2006)
found that fewer dialogue turns were necessary when the system proactively suggested
refinements and relaxations.
However, as argued in Demberg and Moore (2006), there are several limitations
to the SR approach. First, many turns may be required during the refinement process.
Second, if there is no optimal solution, exploration of trade-offs is difficult. Finally, the
attributes on which the data has been clustered may be irrelevant for the specific user.
Demberg and Moore (2006) subsequently developed the user-model-based summarize
and refine approach (UMSR) to combine the benefits of the UM and SR approaches, by
integrating user modeling with automated clustering. When there are more than a small
number of relevant options, the UMSR approach builds a cluster-based tree structure
which orders the options to allow for stepwise refinement. The effectiveness of the tree
structure, which directs the dialogue flow, is optimized by taking the user’s preferences
into account. Trade-offs between alternative options are presented explicitly to give
the user a better overview of the option space and lead the user to a more informed
choice. To give the user confidence that they are being presented with all relevant
options, a brief account of the remaining (irrelevant) options is also provided. Results of
a laboratory experiment comparing the SR and UMSR approaches demonstrated that (1)
participants preferred UMSR, (2) UMSR presentations are as easy to understand as those
of SR, (3) UMSR increases overall user satisfaction, (4) UMSR significantly improves the
user’s overview of the available options, and (5) UMSR increases users’ confidence in
having heard about all relevant options. Although the UMSR approach has not been
implemented in the FLIGHTS system, it could be used when there are a large number
of available options to winnow them down to a handful of relevant ones, which would
then be compared following the approach described in this article.
As regards our work on intonation, as stated in the introduction, Prevost’s (1995)
generator has directly informed our approach to information structure and prosody; his
system does not make use of quantitative user models though, and only describes single
options. Theune (2002) likewise follows Prevost’s approach in her system, refining the
way contrast is determined in assigning pitch accents. Theune et al. (2001) show that a
system employing syntactic templates and a rule-based prosody assignment algorithm
leads to more natural synthesis (of Dutch); unlike FLIGHTS though, their D2S system
does not employ a user preference model or a notion of theme phrase, and does not
distinguish different types of pitch accents. Also closely related is Kruijff-Korbayov´a
et al.’s (2003) information-state based dialogue system, in which the authors explore a
similar approach to using information structure across dialogue turns; however, their
system does not make use of a user model, and employs template-based realization
with much simpler sentence structures. Kruijff-Korbayov´a et al. likewise present an
evaluation indicating that the contextual appropriateness of spoken output (in Ger-
man) improves when intonation is assigned on the basis of information structure. In
comparison with our evaluation, theirs examines improvements over a general purpose
195
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
text-to-speech voice with default intonation, rather than a limited domain voice, which
provides a much higher baseline in terms of the naturalness of the resulting intonation.
Less closely related is most work on machine-learning approaches to prosody
prediction in text-to-speech (TTS) systems (Hirschberg 1993; Hirschberg and Prieto
1996; Taylor and Black 1998; Dusterhoff, Black, and Taylor 1999; Brenier et al. 2006)
and concept-to-speech (CTS) systems (Hitzeman et al. 1998, 1999; Pan, McKeown, and
Hirschberg 2002). These approaches have typically aimed to develop generic models
of prosody prediction, by training classifiers for accents and boundaries that make
use of a considerable variety of features. For example, in predicting accent placement
(but not type), Hitzeman et al.’s CTS system makes use of rhetorical relations such as
list and contrast, along with the reference type of NPs and whether they represent
first mentions of an entity in the discourse. In Pan et al.’s more comprehensive study,
their system predicts accent placement (but not type), break indices, and edge tones
based on features extracted from the SURGE realizer (Elhadad 1993), deep semantic
and discourse features, including semantic type, semantic abnormality and given/new
status, and surface features, such as part of speech and word informativeness. However,
neither of these CTS approaches makes use of the theme/rheme distinction, or the
notion of kontrast that stems from Rooth’s (1992) work on alternative sets, both of
which are crucial to Steedman’s (2000a) theory of how information structure constrains
prosodic choices. More recently, Brenier et al. have shown that the ratio of accented to
unaccented tokens of a word in spontaneous speech is a surprisingly effective feature in
predicting pitch accents; they also argued that using information status and contrast is
unlikely to improve upon prominence prediction based only on surface features, since
these manually labeled features did not yield substantial improvements in their decision
tree models. Again, however, their approach does not make use of the theme/rheme
distinction, and does not attempt to predict pitch accent type or edge tones; in addition,
they report frequent errors on auxiliaries and negatives (e.g., no), which we have found
to be important for highlighting trade-offs prosodically.
In contrast to these approaches, we have emphasized the generation and synthesis
of sharply distinctive theme and rheme tunes in the limited domain of a dialogue
system, using a hybrid rule-based and data-driven approach. In particular, whereas
Hitzeman et al. (1998, 1999) and Pan, McKeown, and Hirschberg (2002) make use of
individual classifiers for prosodic realization decisions—with no means of tying the
decisions of these classifiers together—our approach instead uses rules and constraints
in the grammar to specify a space of possible realizations, through which the realizer
searches to find a sequence of words, pitch accents, and edge tones that maximizes the
probability assigned by an n-gram model for the domain.
In an approach that is more similar in spirit to ours, Bulyko and Ostendorf (2002)
likewise aim to reproduce distinctive intonational patterns in a limited domain. How-
ever, unlike our approach, theirs makes use of simple templates for generating para-
phrases, as their focus is on how deferring the final choice of wording and prosodic
realization to their synthesizer enables them to achieve more natural sounding synthetic
speech. Following on the work described in this article, Nakatsu and White (2006)
present a discriminative approach to realization ranking based on predicted synthesis
quality that is directly compatible with the FLIGHTS system.
Turning to our synthesis evaluation, we note that debate over the standardization of
speech synthesis evaluation continues, with the Blizzard Challenge (Black and Tokuda
2005; Fraser and King 2007) proving to be a useful forum for discussing and performing
evaluation across different synthesis platforms. Mayo, Clark, and King (2005) have pro-
posed to evaluate speech synthesis evaluation from a perceptual viewpoint to discover
196
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
exactly what subjects pay attention to, in order to ensure that evaluation actually is
evaluating what we think it is. The authors found through the use of multidimensional
scaling (MDS) techniques that, when asked to make judgments on the naturalness of
synthetic speech, subjects made judgments relating to a number of different dimensions,
including both segmental quality and prosody. Subsequently, Clark et al. (2007) found
correlations between results in MDS spaces and standard mean opinion score (MOS)
tests, but as the MOS tests did not correspond to single dimensions in the MDS space,
they suggested that it may be possible to design more informative evaluations by
asking subjects to specifically rate each factor of interest (e.g., prosody), where each
factor relates to one dimension in the MDS space. However, as no specific method is
suggested to guarantee reliable prosodic judgments from naive listeners, we have left
this question for future research, opting instead to augment the listener ratings gathered
in our perception experiment with an expert prosody evaluation and an f0 analysis of
the theme phrases.
6. Conclusions and Discussion
In this article, we have described an approach to presenting user-tailored information
in spoken dialogues which for the first time brings together multi-attribute decision
models, strategic content planning, surface realization which incorporates prosodic fea-
tures, and unit selection synthesis that takes the resulting prosodic markup into account.
Based on the user model, the system selects the most important subset of the available
options to mention and the attributes that are most relevant to choosing between
them. To convey these trade-offs, the system employs a novel presentation strategy
which makes it straightforward to determine information structure and the contents of
referring expressions. During surface realization with OpenCCG, the prosodic structure
is derived from the information structure in a way that allows phrase boundaries to be
determined in a flexible, data-driven fashion, and with significantly higher acceptability
than baseline prosody prediction models in an expert evaluation. We hypothesize that
the resulting descriptions are both memorable and easy for users to understand. As a
step towards verifying this hypothesis, we have presented an experiment which shows
that listeners perceive a unit selection voice that makes use of the prosodic markup
as significantly more natural than either of two baseline synthetic voices. Through
an expert evaluation and f0 analysis, we have also confirmed the superiority of the
generator-driven intonation and its contribution to listeners’ ratings. In future work, we
intend to examine the impact of our generation and synthesis methods on memorability
or other task-oriented measures.
The present study provides evidence that it is worthwhile to investigate methods
of developing general purpose synthesizers that accept prosodic specifications in their
input. The main reason that we did not use such a synthesizer in our evaluation is that
in order to build a general purpose Festival APML voice, suitable APML markup would
be required for the 2,000–3,000 utterances that make up the core unit selection database.
As these utterances are outside of the FLIGHTS domain (and thus not generated by
the NLG system), it would not be possible with current technology to provide accurate
APML markup for these utterances. Given the difficulty of automatically annotating
general texts with APML, it may be worth considering a simplified version of the
markup for the database. For example, a system which marks the location of pri-
mary phrasal stress, other pitch accents, and a simple categorization of overall tune
(wh-question, yes/no-question, statement, etc.) could be used to annotate the speech
database. This could be achieved with an accent detector and a very simple parser to
197
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
determine tune type. The APML specification on the input could easily be mapped to
information equivalent to the database annotation by simple rules. Reasonable quality
synthesis could then be achieved without the database being fully parsed and anno-
tated. Additionally, if there is a portion of in-domain data in the database where full
annotation is available, it could be used directly when those units are searched.
An interesting unresolved question is the extent to which more generic prosody
prediction models, along the lines of Hitzeman et al. (1999) and Pan, McKeown, and
Hirschberg (2002)—which make no use of such information structural notions as theme,
rheme, and kontrast (cf. Section 2.5)—could be trained to produce tunes as distinctive
as those we have targeted. We suspect that such models would have trouble doing so,
given data sparsity issues and the fact that machine-learned classification models tend
to discover general trends, rather than aiming to reproduce aspects of specific examples,
which may contain important but rare events. At the same time, it remains for us to
investigate whether our hybrid rule-based and data-driven approach can be generalized
to be as flexible and widely applicable as these machine-learned models, while retaining
its ability to express contrasts intelligibly. In so doing, we expect information structural
constraints in the grammar to continue to play an important role.
Acknowledgments
We thank Rachel Baker, Steve Conway, Mary
Ellen Foster, Kallirroi Georgila, Oliver
Lemon, Colin Matheson, Neide Franca
Rocha, and Mark Steedman for their
contributions to the FLIGHTS system; Eric
Fosler-Lussier and Craige Roberts for helpful
discussion; and Julie McGory for providing
the expert ToBI annotations. We also thank
the anonymous reviewers for helpful
comments and suggestions. This work was
supported in part by EPSRC grant
GR/R02450/01 and an Arts & Humanities
Innovation grant from The Ohio State
University.
References
Aylett, Matthew. 2005. Merging data driven
and rule based prosodic models for unit
selection TTS. In 5th ISCA Speech Synthesis
Workshop, pages 55–59, Pittsburg, PA.
Baker, Rachel Elizabeth. 2003. Using unit
selection to synthesise contextually
appropriate intonation in limited domain
synthesis. Master’s thesis, Department of
Linguistics, University of Edinburgh.
Baldridge, Jason and Geert-Jan Kruijff. 2002.
Coupling CCG and Hybrid Logic
Dependency Semantics. In Proceedings of
40th Annual Meeting of the Association for
Computational Linguistics, pages 319–326,
Philadelphia, PA.
Bangalore, Srinivas and Owen Rambow.
2000. Exploiting a probabilistic hierarchical
model for generation. In Proceedings of
COLING-00, pages 42–48, Saarbrucken,
Germany.
198
Bilmes, Jeff and Katrin Kirchhoff. 2003.
Factored language models and general
parallelized backoff. In Proceedings of
HLT-03, pages 4–6, Edmonton, Canada.
Black, Alan W. and Keiichi Tokuda. 2005.
The blizzard challenge—2005: Evaluating
corpus-based speech synthesis on
common datasets. In Interspeech 2005,
pages 77–80, Lisbon.
Blackburn, Patrick. 2000. Representation,
reasoning, and relational structures: A
hybrid logic manifesto. Logic Journal of the
IGPL, 8(3):339–625.
Bos, Johan, Ewan Klein, Oliver Lemon,
and Tetsushi Oka. 2003. DIPPER:
Description and formalisation of an
information-state update dialogue system
architecture. In 4th SIGdial Workshop on
Discourse and Dialogue, pages 115–124,
Sapporo.
Brenier, Jason, Ani Nenkova, Anubha
Kothari, Laura Whitton, David Beaver,
and Dan Jurafsky. 2006. The (non)utility
of linguistic features for predicting
prominence in spontaneous speech.
IEEE/ACL 2006 Workshop on Spoken
Language Technology, pages 54–57, Palm
Beach, Aruba.
Bulyko, Ivan and Mari Ostendorf. 2002.
Efficient integrated response generation
from multiple targets using weighted finite
state transducers. Computer Speech and
Language, 16:533–550.
Calhoun, Sasha, Malvina Nissim, Mark
Steedman, and Jason Brenier. 2005. A
framework for annotating information
structure in discourse. Proceedings of the
ACL-05 Workshop on Frontiers in Corpus
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Annotation II: Pie in the Sky, pages 45–52,
Ann Arbor, Michigan.
Carberry, Sandra, Jennifer Chu-Carroll, and
Stephanie Elzer. 1999. Constructing and
utilizing a model of user preferences in
collaborative consultation dialogues.
Computational Intelligence Journal,
15(3):185–217.
Carenini, Giuseppe and Johanna D. Moore.
2000. A strategy for generating evaluative
arguments. In Proceedings of INLG-00,
pages 47–54, Mitzpe Ramon.
Carenini, Giuseppe and Johanna D. Moore.
2001. An empirical study of the influence
of user tailoring on evaluative argument
effectiveness. In Proceedings of IJCAI-01,
pages 1307–1314, Seattle, WA.
Carenini, Giuseppe and Johanna D. Moore.
2006. Generating and evaluating
evaluative arguments. Artificial Intelligence,
170:925–952.
Believable Behavior Generation. In
H. Prendinger and M. Ishizuka, editors,
Life-like Characters. Tools, Affective Functions
and Applications. Springer, Berlin,
pages 65–85.
Demberg, Vera and Johanna D. Moore. 2006.
Information presentation in spoken
dialogue systems. In Proceedings of the 11th
Conference of the European Chapter of the
Association for Computational Linguistics
(EACL-06), pages 5–72, Trento, Italy.
Dusterhoff, Kurt E., Alan W. Black, and
Paul A. Taylor. 1999. Using decision trees
within the tilt intonation model to
predict f0 contours. In Eurospeech-99,
pages 1627–1630, Budapest, Hungary.
Edwards, W. and F. H. Barron. 1994.
SMARTS and SMARTER: Improved
simple methods for multiattribute utility
measurement. Organizational Behavior and
Human Decision Processes, 60:306–325.
Carroll, John, Ann Copestake, Dan
Elhadad, Michael. 1993. Using Argumentation
Flickinger, and Victor Pozna ´nski. 1999.
An efficient chart generator for (semi-)
lexicalist grammars. In Proceedings of
EWNLG-99, pages 86–95, Toulouse, France.
Carroll, John and Stefan Oepen. 2005. High
efficiency realization for a wide-coverage
unification grammar. In Proceedings of
IJCNLP-05, pages 165–176, Jeju Island,
Korea.
Chung, Grace. 2004. Developing a flexible
spoken dialog system using simulation.
In Proceedings of ACL ’04, pages 63–70,
Barcelona.
Clark, Robert A. J. and Simon King. 2006.
Joint prosodic and segmental unit selection
speech synthesis. In Proceedings of
Interspeech 2006, pages 1312–1315,
Pittsburgh, PA.
Clark, Robert A. J., Monika Podsiadlo, Mark
Fraser, Catherine Mayo, and Simon King.
2007. Statistical analysis of the Blizzard
Challenge 2007 listening test results. In
Proceedings of Blizzard Workshop (in
Proceedings of the 6th ISCA Workshop on
Speech Synthesis), pages 1–6, Bonn,
Germany.
Clark, Robert A. J., Korin Richmond, and
Simon King. 2007. Multisyn: Open-domain
unit selection for the Festival speech
synthesis system. Speech Communication,
49(4):317–330.
Currie, K. and A. Tate. 1991. O-Plan: The
open planning architecture. Artificial
Intelligence, 52:49–86.
de Carolis, Bernadina, Catherine Pelachaud,
Isabella Poggi, and Mark Steedman. 2004.
APML, a Mark-up Language for
to Control Lexical Choice: A Functional
Unification Implementation. Ph.D. thesis,
Columbia University.
Foster, Mary Ellen and Michael White. 2004.
Techniques for Text Planning with XSLT.
In Proceedings of the 4th NLPXML Workshop,
pages 1–8, Barcelona, Spain.
Fraser, Mark and Simon King. 2007. The
Blizzard Challenge 2007. In Proceedings of
Blizzard Workshop (in Proceedings of the
6th ISCA Workshop on Speech Synthesis),
pages 7–12, Bonn, Germany.
Ginzburg, Jonathan. 1996. Interrogatives:
Questions, facts and dialogue. In
Shalom Lappin, editor, The Handbook of
Contemporary Semantic Theory. Blackwell,
Oxford, pages 385–422.
Hirschberg, Julia. 1993. Pitch accent in
context: Predicting intonational
prominence from text. Artificial Intelligence,
63:305–340.
Hirschberg, Julia and Pilar Prieto. 1996.
Training intonational phrasing rules
automatically for English and Spanish
text-to-speech. Speech Communication,
18:281–290.
Hitzeman, Janet, Alan W. Black, Chris
Mellish, Jon Oberlander, Massimo Poesio,
and Paul Taylor. 1999. An annotation
scheme for concept-to-speech synthesis.
In Proceedings of EWNLG-99, pages 59–66,
Toulouse.
Hitzeman, Janet, Alan W. Black, Chris
Mellish, Jon Oberlander, and Paul Taylor.
1998. On the use of automatically
generated discourse-level information in a
concept-to-speech synthesis system. In
199
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 36, Number 2
Proceedings of ICSLP-98, pages 2763–2768,
Sydney, Australia.
Kay, Martin. 1996. Chart generation. In
Proceedings of ACL-96, pages 200–204,
Santa Cruz, USA.
Knight, Kevin and Vasileios
Hatzivassiloglou. 1995. Two-level,
many-paths generation. In Proceedings of
ACL-95, pages 252–260, Cambridge, MA.
Kruijff, Geert-Jan M. 2003. Binding across
boundaries. In Geert-Jan M. Kruijff and
Richard T. Oehrle, editors, Resource-
Sensitivity in Binding and Anaphora. Kluwer
Academic Publishers, pages 123–158,
Dordrecht, The Netherlands.
Kruijff-Korbayov´a, Ivana, Stina Ericsson,
Kepa J. Rodr´ıguez, and Elena Karagjosova.
2003. Producing contextually appropriate
intonation in an information-state based
dialogue system. In Proceedings of EACL-93,
pages 227–234, Budapest, Hungary.
Langkilde, Irene. 2000. Forest-based
statistical sentence generation. In
Proceedings of NAACL-00, pages 170–177,
Seattle, Washington.
Langkilde-Geary, Irene. 2002. An empirical
verification of coverage and correctness
for a general-purpose sentence generator.
In Proceedings of INLG-02, pages 17–24,
New York, NY.
Linden, Greg, Steve Hanks, and Neal Lesh.
1997. Interactive assessment of user
preference models: The automated
travel assistant. In Proceedings of User
Modeling ’97, pages 67–78, Chia Laguna,
Sardinia, Italy.
Marsi, Erwin. 2004. Optionality in evaluating
prosody prediction. In Proceedings of the
5th ISCA Speech Synthesis Workshop,
pages 13–18, Pittsburg, PA.
Martin, D. L., A. J. Cheyer, and D. B. Moran.
1999. The Open Agent Architecture: A
framework for building distributed
software systems. Applied Artificial
Intelligence, 13(1):91–128.
Mayo, C., R. A. J. Clark, and S. King. 2005.
Multidimensional scaling of listener
responses to synthetic speech. In
Interspeech 2005, pages 1725–1728, Lisbon.
Moore, Johanna, Mary Ellen Foster, Oliver
Lemon, and Michael White. 2004.
Generating tailored, comparative
descriptions in spoken dialogue. In
Proceedings of FLAIRS-04, pages 917–922,
Miami Beach, USA.
Nakatsu, Crystal and Michael White. 2006.
Learning to say it well: Reranking
realizations by predicted synthesis quality.
In Proceedings of COLING-ACL ’06,
pages 1113–1120, Sydney, Australia.
Oh, Alice H. and Alexander I. Rudnicky.
2002. Stochastic natural language
generation for spoken dialog systems.
Computer, Speech & Language,
16(3/4):387–407.
Pan, Shimei, Kathleen McKeown, and Julia
Hirschberg. 2002. Exploring features from
natural language generation for prosody
modeling. Computer Speech and Language,
16:457–490.
Papineni, Kishore, Salim Roukos, Todd
Ward, and Wei-Jing Zhu. 2001. Bleu: A
method for Automatic Evaluation of
Machine Translation. Technical Report
RC22176, IBM.
Pierrehumbert, Janet. 1980. The Phonology
and Phonetics of English Intonation. Ph.D.
thesis, MIT.
Polifroni, Joseph, Grace Chung, and
Stephanie Seneff. 2003. Towards
automatic generation of mixed-initiative
dialogue systems from Web content.
In Proceedings of Eurospeech ’03,
pages 193–196, Geneva.
Pon-Barry, Heather, Fuliang Weng, and
Sebastian Varges. 2006. Evaluation of
content presentation strategies for an
in-car spoken dialogue system. In
Proceedings of Interspeech 2006,
pages 1930–1933, Pittsburgh, PA.
Prevost, Scott. 1995. A Semantics of Contrast
and Information Structure for Specifying
Intonation in Spoken Language Generation.
Ph.D. thesis, University of Pennsylvania.
Ratnaparkhi, Adwait. 2002. Trainable
approaches to surface natural language
generation and their application to
conversational dialog systems. Computer,
Speech & Language, 16(3/4):435–455.
Reiter, Ehud and Robert Dale. 2000. Building
Natural Language Generation Systems.
Cambridge University Press, Cambridge.
Ristad, Eric S.˙1995. A Natural Law of
Succession. Technical Report
CS-TR-495-95, Princeton University.
Roberts, Craige. 1996. Information
structure: Towards an integrated
formal theory of pragmatics. Ohio State
University Working Papers in Linguistics,
49:91–136.
Moore, Robert C. 2002. A complete, efficient
Rocha, Neide Franca. 2004. Evaluating
sentence-realization algorithm for
unification grammar. In Proceedings of
INLG-02, pages 41–48, New York, NY.
prosodic markup in a spoken dialogue
system. Master’s thesis, Department of
Linguistics, University of Edinburgh.
200
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
White, Clark, and Moore
Generating Tailored Descriptions with Appropriate Intonation
Rooth, Mats. 1992. A theory of focus
interpretation. Natural Language Semantics,
1:75–116.
Shemtov, Hadar. 1997. Ambiguity Management
in Natural Language Generation. Ph.D.
thesis, Stanford University.
Silverman, K., M. Beckman, J. Pitrelli,
M. Ostendorf, C. Wightman, P. Price,
J. Pierrehumbert, and J. Hirschberg. 1992.
ToBI: A standard for labeling English
prosody. Proceedings of ICSLP92, 2:867–870,
Banff, Canada.
Snover, Matthew, Bonnie Dorr, Richard
Schwartz, Linnea Micciulla, and John
Makhoul. 2006. A study of translation edit
rate with targeted human annotation. In
Proceedings of the Association for Machine
Translation in the Americas (AMTA-06),
pages 223–231, Cambridge, MA.
Steedman, Mark. 2000a. Information
structure and the syntax-phonology
interface. Linguistic Inquiry,
31(4):649–689.
Steedman, Mark. 2000b. The Syntactic Process.
MIT Press, Cambridge, MA.
Steedman, Mark. 2004. Using APML to
specify intonation. Magicster Project
Deliverable 2.5. University of Edinburgh.
Available at http://www.ltg.ed.ac.uk/
magicster/deliverables/annex2.5/
apml-howto.pdf
Steedman, Mark. 2006. Information-
structural semantics for English
intonation. In Chungmin Lee, Matt
Gordon, and Daniel B ¨uring, editors,
Topic and Focus: Cross-Linguistic
Perspectives on Meaning and Intonation.
Springer, Dordrecht, pages 245–264.
Stolcke, Andreas. 2002. SRILM — An
extensible language modeling toolkit. In
Proceedings of ICSLP-02, pages 901–904,
Denver, Colorado.
Taylor, Paul and Alan Black. 1998. Assigning
phrase breaks from part-of-speech
sequences. Computer Speech and Language,
12:99–117.
Taylor, P., A. Black, and R. Caley. 1998. The
architecture of the the Festival speech
synthesis system. In Third International
Workshop on Speech Synthesis,
pages 147–151, Sydney.
Theune, Mari¨et. 2002. Contrast in
concept-to-speech generation. Computer
Speech and Language, 16:491–531.
Theune, Mari¨et, Esther Klabbers, Jan-Roelof
de Pijper, Emiel Krahmer, and Jan Odijk.
2001. From data to speech: A general
approach. Natural Language Engineering,
7(1):47–86.
Vallduv´ı, Enric and Maria Vilkuna. 1998. On
rheme and kontrast. In Peter Culicover
and Louise McNally, editors, Syntax and
Semantics, Vol. 29: The Limits of Syntax.
Academic Press, San Diego, CA,
pages 79–108.
van Deemter, Kees, Emiel Krahmer, and
Mari¨et Theune. 2005. Real versus
template-based natural language
generation: A false opposition?
Computational Linguistics, 31(1):15–23.
van Santen, Jan, Alexander Kain, Esther
Klabbers, and Taniya Mishra. 2005.
Synthesis of prosody using multi-level
unit sequences. Speech Communication,
46(3–4):365–375.
Walker, M. A., S. Whittaker, A. Stent,
P. Maloor, J. D. Moore, M. Johnston, and
G. Vasireddy. 2002. Speech-plans:
Generating evaluative responses in spoken
dialogue. In Proceedings of INLG ’02,
pages 73–80, New York, NY.
Walker, M. A., S. J. Whittaker, A. Stent,
P. Maloor, J. D. Moore, M. Johnston, and
G. Vasireddy. 2004. Generation and
evaluation of user-tailored responses in
multimodal dialogue. Cognitive Science,
28:811–840.
Walker, Marilyn A., Rebecca Passonneau,
and Julie E. Boland. 2001. Quantitative
and qualitative evaluation of DARPA
Communicator spoken dialogue systems.
In Proceedings of ACL-01, pages 515–522,
Toulouse, France.
Walker, Marilyn A., Owen C. Rambow, and
Monica Rogati. 2002. Training a sentence
planner for spoken dialogue using
boosting. Computer Speech and Language,
16:409–433.
White, Michael. 2004. Reining in CCG Chart
Realization. In Proceedings of INLG-04,
pages 182–191, Brockenhurst, UK.
White, Michael. 2006a. CCG chart realization
from disjunctive inputs. In Proceedings of
INLG-06, pages 12–19, Sydney, Australia.
White, Michael. 2006b. Efficient realization of
coordinate structures in Combinatory
Categorial Grammar. Research on Language
& Computation, 4(1):39–75.
201
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
6
2
1
5
9
1
8
1
0
2
5
4
/
c
o
l
i
.
0
9
-
0
2
3
-
r
1
-
0
8
-
0
0
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3